linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 00/11] enable bs > ps in XFS
@ 2024-05-03  9:53 Luis Chamberlain
  2024-05-03  9:53 ` [PATCH v5 01/11] readahead: rework loop in page_cache_ra_unbounded() Luis Chamberlain
                   ` (10 more replies)
  0 siblings, 11 replies; 51+ messages in thread
From: Luis Chamberlain @ 2024-05-03  9:53 UTC (permalink / raw)
  To: akpm, willy, djwong, brauner, david, chandan.babu
  Cc: hare, ritesh.list, john.g.garry, ziy, linux-fsdevel, linux-xfs,
	linux-mm, linux-block, gost.dev, p.raghav, kernel, mcgrof

[ I was asked by Pankaj to post this v5 as he's out right now. ]

This is the fifth version of the series that enables block size > page size
(Large Block Size) in XFS. The context and motivation can be seen in cover
letter of the RFC v1 [0]. We also recorded a talk about this effort at LPC [1],
if someone would like more context on this effort.

The major change on this v5 is truncation to min order now included and has been
tested. The main issue which was observed was root cuased, and Matthew was able
to identify a fix for it in xarray, that fix is now queued up on
mm-hotfixes-unstable [2].

A lot of emphasis has been put on testing using kdevops, starting with an XFS
baseline [3]. The testing has been split into regression and progression.

Regression testing:
In regression testing, we ran the whole test suite to check for regressions on
existing profiles due to the page cache changes.

No regressions were found with these patches added on top.

Progression testing:
For progression testing, we tested for 8k, 16k, 32k and 64k block sizes.  To
compare it with existing support, an ARM VM with 64k base page system (without
our patches) was used as a reference to check for actual failures due to LBS
support in a 4k base page size system.

There are some tests that assumes block size < page size that needs to be fixed.
We have a tree with fixes for xfstests [4], most of the changes have been posted
already, and only a few minor changes need to be posted. Already part of these
changes has been upstreamed to fstests, and new tests have also been written and
are out for review, namely for mmap zeroing-around corner cases, compaction
and fsstress races on mm, and stress testing folio truncation on file mapped
folios.

No new failures were found with the LBS support.

We've done some preliminary performance tests with fio on XFS on 4k block size
against pmem and NVMe with buffered IO and Direct IO on vanilla Vs + these
patches applied, and detected no regressions.

We also wrote an eBPF tool called blkalgn [5] to see if IO sent to the device
is aligned and at least filesystem block size in length.

For those who want this in a git tree we have this up on a kdevops
20240503-large-block-minorder branch [6].

[0] https://lore.kernel.org/lkml/20230915183848.1018717-1-kernel@pankajraghav.com/
[1] https://www.youtube.com/watch?v=ar72r5Xf7x4
[2] https://lkml.kernel.org/r/20240501153120.4094530-1-willy@infradead.org
[3] https://github.com/linux-kdevops/kdevops/blob/master/docs/xfs-bugs.md
489 non-critical issues and 55 critical issues. We've determined and reported
that the 55 critical issues have all fall into 5 common  XFS asserts or hung
tasks  and 2 memory management asserts.
[4] https://github.com/linux-kdevops/fstests/tree/lbs-fixes
[5] https://github.com/iovisor/bcc/pull/4813
[7] https://github.com/linux-kdevops/linux/tree/20240503-large-block-minorder

Changes since v4:
- Added new Reviewed-by tags
- Truncation is now enabled, this depends on Matthew Wilcox xarray fix
  being merged and its already on its way
- filemap_map_pages() simplification as suggested by Matthew Wilcox
- minor variable forward declration ordering as suggested by Matthew Wilcox

Changes since v3:
- Cap the PTE range to i_size for LBS configuration in folio_map_range()
- Added Chinners kvmalloc xattr patches
- Moved Hannes patches before adding the minorder patches to avoid confusion.
- Added mapping_set_folio_order_range().
- Return EINVAL instead EAGAIN in split_huge_page_to_list_to_order()

Changes since v2:
- Simplified the filemap and readahead changes. (Thanks willy)
- Removed DEFINE_READAHEAD_ALIGN.
- Added minorder support to readahead_expand().

Changes since v1:
- Round up to nearest min nr pages in ra_init
- Calculate index in filemap_create instead of doing in filemap_get_pages
- Remove unnecessary BUG_ONs in the delete path
- Use check_shl_overflow instead of check_mul_overflow
- Cast to uint32_t instead of unsigned long in xfs_stat_blksize

Changes since RFC v2:
- Move order 1 patch above the 1st patch
- Remove order == 1 conditional in `fs: Allow fine-grained control of
folio sizes`. This fixed generic/630 that was reported in the previous version.
- Hide the max order and expose `mapping_set_folio_min_order` instead.
- Add new helper mapping_start_index_align and DEFINE_READAHEAD_ALIGN
- don't call `page_cache_ra_order` with min order in do_mmap_sync_readahead
- simplify ondemand readahead with only aligning the start index at the end
- Don't cap ra_pages based on bdi->io_pages
- use `checked_mul_overflow` while calculating bytes in validate_fsb
- Remove config lbs option
- Add a warning while mounting a LBS kernel
- Add Acked-by and Reviewed-by from Hannes and Darrick.

Changes since RFC v1:
- Added willy's patch to enable order-1 folios.
- Unified common page cache effort from Hannes LBS work.
- Added a new helper min_nrpages and added CONFIG_THP for enabling
  mapping_large_folio_support
- Don't split a folio if it has minorder set. Remove the old code where we
  set extra pins if it has that requirement.
- Split the code in XFS between the validation of mapping count. Put the
  icache code changes with enabling bs > ps.
- Added CONFIG_XFS_LBS option
- align the index in do_read_cache_folio()
- Removed truncate changes
- Fixed generic/091 with iomap changes to iomap_dio_zero function.
- Took care of folio truncation scenario in page_cache_ra_unbounded()
  that happens after read_pages if a folio was found.
- Sqaushed and moved commits around
- Rebased on top of v6.8-rc4

Dave Chinner (1):
  xfs: use kvmalloc for xattr buffers

Hannes Reinecke (1):
  readahead: rework loop in page_cache_ra_unbounded()

Luis Chamberlain (2):
  filemap: allocate mapping_min_order folios in the page cache
  mm: split a folio in minimum folio order chunks

Matthew Wilcox (Oracle) (1):
  fs: Allow fine-grained control of folio sizes

Pankaj Raghav (6):
  readahead: allocate folios with mapping_min_order in readahead
  filemap: cap PTE range to be created to allowed zero fill in
    folio_map_range()
  iomap: fix iomap_dio_zero() for fs bs > system page size
  xfs: expose block size in stat
  xfs: make the calculation generic in xfs_sb_validate_fsb_count()
  xfs: enable block size larger than page size support

 fs/iomap/direct-io.c          |  13 +++-
 fs/xfs/libxfs/xfs_attr_leaf.c |  15 ++---
 fs/xfs/libxfs/xfs_ialloc.c    |   5 ++
 fs/xfs/libxfs/xfs_shared.h    |   3 +
 fs/xfs/xfs_icache.c           |   6 +-
 fs/xfs/xfs_iops.c             |   2 +-
 fs/xfs/xfs_mount.c            |  10 ++-
 fs/xfs/xfs_super.c            |  10 +--
 include/linux/huge_mm.h       |  12 ++--
 include/linux/pagemap.h       | 116 ++++++++++++++++++++++++++++------
 mm/filemap.c                  |  31 ++++++---
 mm/huge_memory.c              |  50 ++++++++++++++-
 mm/readahead.c                |  94 +++++++++++++++++++++------
 13 files changed, 290 insertions(+), 77 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH v5 01/11] readahead: rework loop in page_cache_ra_unbounded()
  2024-05-03  9:53 [PATCH v5 00/11] enable bs > ps in XFS Luis Chamberlain
@ 2024-05-03  9:53 ` Luis Chamberlain
  2024-05-03  9:53 ` [PATCH v5 02/11] fs: Allow fine-grained control of folio sizes Luis Chamberlain
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 51+ messages in thread
From: Luis Chamberlain @ 2024-05-03  9:53 UTC (permalink / raw)
  To: akpm, willy, djwong, brauner, david, chandan.babu
  Cc: hare, ritesh.list, john.g.garry, ziy, linux-fsdevel, linux-xfs,
	linux-mm, linux-block, gost.dev, p.raghav, kernel, mcgrof

From: Hannes Reinecke <hare@suse.de>

Rework the loop in page_cache_ra_unbounded() to advance with
the number of pages in a folio instead of just one page at a time.

Note that the index is incremented by 1 if filemap_add_folio() fails
because the size of the folio we are trying to add is 1 (order 0).

Signed-off-by: Hannes Reinecke <hare@suse.de>
Co-developed-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Acked-by: Darrick J. Wong <djwong@kernel.org>
---
 mm/readahead.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 130c0e7df99f..2361634a84fd 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -208,7 +208,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 	struct address_space *mapping = ractl->mapping;
 	unsigned long index = readahead_index(ractl);
 	gfp_t gfp_mask = readahead_gfp_mask(mapping);
-	unsigned long i;
+	unsigned long i = 0;
 
 	/*
 	 * Partway through the readahead operation, we will have added
@@ -226,7 +226,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 	/*
 	 * Preallocate as many pages as we will need.
 	 */
-	for (i = 0; i < nr_to_read; i++) {
+	while (i < nr_to_read) {
 		struct folio *folio = xa_load(&mapping->i_pages, index + i);
 
 		if (folio && !xa_is_value(folio)) {
@@ -239,8 +239,8 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 			 * not worth getting one just for that.
 			 */
 			read_pages(ractl);
-			ractl->_index++;
-			i = ractl->_index + ractl->_nr_pages - index - 1;
+			ractl->_index += folio_nr_pages(folio);
+			i = ractl->_index + ractl->_nr_pages - index;
 			continue;
 		}
 
@@ -252,13 +252,14 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 			folio_put(folio);
 			read_pages(ractl);
 			ractl->_index++;
-			i = ractl->_index + ractl->_nr_pages - index - 1;
+			i = ractl->_index + ractl->_nr_pages - index;
 			continue;
 		}
 		if (i == nr_to_read - lookahead_size)
 			folio_set_readahead(folio);
 		ractl->_workingset |= folio_test_workingset(folio);
-		ractl->_nr_pages++;
+		ractl->_nr_pages += folio_nr_pages(folio);
+		i += folio_nr_pages(folio);
 	}
 
 	/*
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v5 02/11] fs: Allow fine-grained control of folio sizes
  2024-05-03  9:53 [PATCH v5 00/11] enable bs > ps in XFS Luis Chamberlain
  2024-05-03  9:53 ` [PATCH v5 01/11] readahead: rework loop in page_cache_ra_unbounded() Luis Chamberlain
@ 2024-05-03  9:53 ` Luis Chamberlain
  2024-05-03  9:53 ` [PATCH v5 03/11] filemap: allocate mapping_min_order folios in the page cache Luis Chamberlain
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 51+ messages in thread
From: Luis Chamberlain @ 2024-05-03  9:53 UTC (permalink / raw)
  To: akpm, willy, djwong, brauner, david, chandan.babu
  Cc: hare, ritesh.list, john.g.garry, ziy, linux-fsdevel, linux-xfs,
	linux-mm, linux-block, gost.dev, p.raghav, kernel, mcgrof

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Some filesystems want to be able to ensure that folios that are added to
the page cache are at least a certain size.
Add mapping_set_folio_min_order() to allow this level of control.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Co-developed-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 include/linux/pagemap.h | 116 +++++++++++++++++++++++++++++++++-------
 1 file changed, 96 insertions(+), 20 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 2df35e65557d..2e5612de1749 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -202,13 +202,18 @@ enum mapping_flags {
 	AS_EXITING	= 4, 	/* final truncate in progress */
 	/* writeback related tags are not used */
 	AS_NO_WRITEBACK_TAGS = 5,
-	AS_LARGE_FOLIO_SUPPORT = 6,
-	AS_RELEASE_ALWAYS,	/* Call ->release_folio(), even if no private data */
-	AS_STABLE_WRITES,	/* must wait for writeback before modifying
+	AS_RELEASE_ALWAYS = 6,	/* Call ->release_folio(), even if no private data */
+	AS_STABLE_WRITES = 7,	/* must wait for writeback before modifying
 				   folio contents */
-	AS_UNMOVABLE,		/* The mapping cannot be moved, ever */
+	AS_FOLIO_ORDER_MIN = 8,
+	AS_FOLIO_ORDER_MAX = 13, /* Bit 8-17 are used for FOLIO_ORDER */
+	AS_UNMOVABLE = 18,		/* The mapping cannot be moved, ever */
 };
 
+#define AS_FOLIO_ORDER_MIN_MASK 0x00001f00
+#define AS_FOLIO_ORDER_MAX_MASK 0x0003e000
+#define AS_FOLIO_ORDER_MASK (AS_FOLIO_ORDER_MIN_MASK | AS_FOLIO_ORDER_MAX_MASK)
+
 /**
  * mapping_set_error - record a writeback error in the address_space
  * @mapping: the mapping in which an error should be set
@@ -344,9 +349,63 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
 	m->gfp_mask = mask;
 }
 
+/*
+ * There are some parts of the kernel which assume that PMD entries
+ * are exactly HPAGE_PMD_ORDER.  Those should be fixed, but until then,
+ * limit the maximum allocation order to PMD size.  I'm not aware of any
+ * assumptions about maximum order if THP are disabled, but 8 seems like
+ * a good order (that's 1MB if you're using 4kB pages)
+ */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define MAX_PAGECACHE_ORDER	HPAGE_PMD_ORDER
+#else
+#define MAX_PAGECACHE_ORDER	8
+#endif
+
+/*
+ * mapping_set_folio_order_range() - Set the folio order range
+ * @mapping: The address_space.
+ * @min: Minimum folio order (between 0-MAX_PAGECACHE_ORDER inclusive).
+ * @max: Maximum folio order (between @min-MAX_PAGECACHE_ORDER inclusive).
+ *
+ * The filesystem should call this function in its inode constructor to
+ * indicate which base size (min) and maximum size (max) of folio the VFS
+ * can use to cache the contents of the file.  This should only be used
+ * if the filesystem needs special handling of folio sizes (ie there is
+ * something the core cannot know).
+ * Do not tune it based on, eg, i_size.
+ *
+ * Context: This should not be called while the inode is active as it
+ * is non-atomic.
+ */
+static inline void mapping_set_folio_order_range(struct address_space *mapping,
+						 unsigned int min_order,
+						 unsigned int max_order)
+{
+	if (min_order > MAX_PAGECACHE_ORDER)
+		min_order = MAX_PAGECACHE_ORDER;
+
+	if (max_order > MAX_PAGECACHE_ORDER)
+		max_order = MAX_PAGECACHE_ORDER;
+
+	max_order = max(max_order, min_order);
+	/*
+	 * TODO: max_order is not yet supported in filemap.
+	 */
+	mapping->flags = (mapping->flags & ~AS_FOLIO_ORDER_MASK) |
+			 (min_order << AS_FOLIO_ORDER_MIN) |
+			 (max_order << AS_FOLIO_ORDER_MAX);
+}
+
+static inline void mapping_set_folio_min_order(struct address_space *mapping,
+					       unsigned int min)
+{
+	mapping_set_folio_order_range(mapping, min, MAX_PAGECACHE_ORDER);
+}
+
 /**
  * mapping_set_large_folios() - Indicate the file supports large folios.
- * @mapping: The file.
+ * @mapping: The address_space.
  *
  * The filesystem should call this function in its inode constructor to
  * indicate that the VFS can use large folios to cache the contents of
@@ -357,7 +416,37 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
  */
 static inline void mapping_set_large_folios(struct address_space *mapping)
 {
-	__set_bit(AS_LARGE_FOLIO_SUPPORT, &mapping->flags);
+	mapping_set_folio_order_range(mapping, 0, MAX_PAGECACHE_ORDER);
+}
+
+static inline unsigned int mapping_max_folio_order(struct address_space *mapping)
+{
+	return (mapping->flags & AS_FOLIO_ORDER_MAX_MASK) >> AS_FOLIO_ORDER_MAX;
+}
+
+static inline unsigned int mapping_min_folio_order(struct address_space *mapping)
+{
+	return (mapping->flags & AS_FOLIO_ORDER_MIN_MASK) >> AS_FOLIO_ORDER_MIN;
+}
+
+static inline unsigned long mapping_min_folio_nrpages(struct address_space *mapping)
+{
+	return 1UL << mapping_min_folio_order(mapping);
+}
+
+/**
+ * mapping_align_start_index() - Align starting index based on the min
+ * folio order of the page cache.
+ * @mapping: The address_space.
+ *
+ * Ensure the index used is aligned to the minimum folio order when adding
+ * new folios to the page cache by rounding down to the nearest minimum
+ * folio number of pages.
+ */
+static inline pgoff_t mapping_align_start_index(struct address_space *mapping,
+						pgoff_t index)
+{
+	return round_down(index, mapping_min_folio_nrpages(mapping));
 }
 
 /*
@@ -367,7 +456,7 @@ static inline void mapping_set_large_folios(struct address_space *mapping)
 static inline bool mapping_large_folio_support(struct address_space *mapping)
 {
 	return IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
-		test_bit(AS_LARGE_FOLIO_SUPPORT, &mapping->flags);
+	       (mapping_max_folio_order(mapping) > 0);
 }
 
 static inline int filemap_nr_thps(struct address_space *mapping)
@@ -528,19 +617,6 @@ static inline void *detach_page_private(struct page *page)
 	return folio_detach_private(page_folio(page));
 }
 
-/*
- * There are some parts of the kernel which assume that PMD entries
- * are exactly HPAGE_PMD_ORDER.  Those should be fixed, but until then,
- * limit the maximum allocation order to PMD size.  I'm not aware of any
- * assumptions about maximum order if THP are disabled, but 8 seems like
- * a good order (that's 1MB if you're using 4kB pages)
- */
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#define MAX_PAGECACHE_ORDER	HPAGE_PMD_ORDER
-#else
-#define MAX_PAGECACHE_ORDER	8
-#endif
-
 #ifdef CONFIG_NUMA
 struct folio *filemap_alloc_folio(gfp_t gfp, unsigned int order);
 #else
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v5 03/11] filemap: allocate mapping_min_order folios in the page cache
  2024-05-03  9:53 [PATCH v5 00/11] enable bs > ps in XFS Luis Chamberlain
  2024-05-03  9:53 ` [PATCH v5 01/11] readahead: rework loop in page_cache_ra_unbounded() Luis Chamberlain
  2024-05-03  9:53 ` [PATCH v5 02/11] fs: Allow fine-grained control of folio sizes Luis Chamberlain
@ 2024-05-03  9:53 ` Luis Chamberlain
  2024-05-03  9:53 ` [PATCH v5 04/11] readahead: allocate folios with mapping_min_order in readahead Luis Chamberlain
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 51+ messages in thread
From: Luis Chamberlain @ 2024-05-03  9:53 UTC (permalink / raw)
  To: akpm, willy, djwong, brauner, david, chandan.babu
  Cc: hare, ritesh.list, john.g.garry, ziy, linux-fsdevel, linux-xfs,
	linux-mm, linux-block, gost.dev, p.raghav, kernel, mcgrof

filemap_create_folio() and do_read_cache_folio() were always allocating
folio of order 0. __filemap_get_folio was trying to allocate higher
order folios when fgp_flags had higher order hint set but it will default
to order 0 folio if higher order memory allocation fails.

Supporting mapping_min_order implies that we guarantee each folio in the
page cache has at least an order of mapping_min_order. When adding new
folios to the page cache we must also ensure the index used is aligned to
the mapping_min_order as the page cache requires the index to be aligned
to the order of the folio.

Co-developed-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 mm/filemap.c | 24 +++++++++++++++++-------
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 30de18c4fd28..f0c0cfbbd134 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -858,6 +858,8 @@ noinline int __filemap_add_folio(struct address_space *mapping,
 
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
 	VM_BUG_ON_FOLIO(folio_test_swapbacked(folio), folio);
+	VM_BUG_ON_FOLIO(folio_order(folio) < mapping_min_folio_order(mapping),
+			folio);
 	mapping_set_update(&xas, mapping);
 
 	if (!huge) {
@@ -1895,8 +1897,10 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 		folio_wait_stable(folio);
 no_page:
 	if (!folio && (fgp_flags & FGP_CREAT)) {
-		unsigned order = FGF_GET_ORDER(fgp_flags);
+		unsigned int min_order = mapping_min_folio_order(mapping);
+		unsigned int order = max(min_order, FGF_GET_ORDER(fgp_flags));
 		int err;
+		index = mapping_align_start_index(mapping, index);
 
 		if ((fgp_flags & FGP_WRITE) && mapping_can_writeback(mapping))
 			gfp |= __GFP_WRITE;
@@ -1936,7 +1940,7 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 				break;
 			folio_put(folio);
 			folio = NULL;
-		} while (order-- > 0);
+		} while (order-- > min_order);
 
 		if (err == -EEXIST)
 			goto repeat;
@@ -2425,13 +2429,16 @@ static int filemap_update_page(struct kiocb *iocb,
 }
 
 static int filemap_create_folio(struct file *file,
-		struct address_space *mapping, pgoff_t index,
+		struct address_space *mapping, loff_t pos,
 		struct folio_batch *fbatch)
 {
 	struct folio *folio;
 	int error;
+	unsigned int min_order = mapping_min_folio_order(mapping);
+	pgoff_t index;
 
-	folio = filemap_alloc_folio(mapping_gfp_mask(mapping), 0);
+	folio = filemap_alloc_folio(mapping_gfp_mask(mapping),
+				    min_order);
 	if (!folio)
 		return -ENOMEM;
 
@@ -2449,6 +2456,8 @@ static int filemap_create_folio(struct file *file,
 	 * well to keep locking rules simple.
 	 */
 	filemap_invalidate_lock_shared(mapping);
+	/* index in PAGE units but aligned to min_order number of pages. */
+	index = (pos >> (PAGE_SHIFT + min_order)) << min_order;
 	error = filemap_add_folio(mapping, folio, index,
 			mapping_gfp_constraint(mapping, GFP_KERNEL));
 	if (error == -EEXIST)
@@ -2509,8 +2518,7 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
 	if (!folio_batch_count(fbatch)) {
 		if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_WAITQ))
 			return -EAGAIN;
-		err = filemap_create_folio(filp, mapping,
-				iocb->ki_pos >> PAGE_SHIFT, fbatch);
+		err = filemap_create_folio(filp, mapping, iocb->ki_pos, fbatch);
 		if (err == AOP_TRUNCATED_PAGE)
 			goto retry;
 		return err;
@@ -3708,9 +3716,11 @@ static struct folio *do_read_cache_folio(struct address_space *mapping,
 repeat:
 	folio = filemap_get_folio(mapping, index);
 	if (IS_ERR(folio)) {
-		folio = filemap_alloc_folio(gfp, 0);
+		folio = filemap_alloc_folio(gfp,
+					    mapping_min_folio_order(mapping));
 		if (!folio)
 			return ERR_PTR(-ENOMEM);
+		index = mapping_align_start_index(mapping, index);
 		err = filemap_add_folio(mapping, folio, index, gfp);
 		if (unlikely(err)) {
 			folio_put(folio);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v5 04/11] readahead: allocate folios with mapping_min_order in readahead
  2024-05-03  9:53 [PATCH v5 00/11] enable bs > ps in XFS Luis Chamberlain
                   ` (2 preceding siblings ...)
  2024-05-03  9:53 ` [PATCH v5 03/11] filemap: allocate mapping_min_order folios in the page cache Luis Chamberlain
@ 2024-05-03  9:53 ` Luis Chamberlain
  2024-05-03 14:32   ` Hannes Reinecke
  2024-05-03  9:53 ` [PATCH v5 05/11] mm: split a folio in minimum folio order chunks Luis Chamberlain
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 51+ messages in thread
From: Luis Chamberlain @ 2024-05-03  9:53 UTC (permalink / raw)
  To: akpm, willy, djwong, brauner, david, chandan.babu
  Cc: hare, ritesh.list, john.g.garry, ziy, linux-fsdevel, linux-xfs,
	linux-mm, linux-block, gost.dev, p.raghav, kernel, mcgrof

From: Pankaj Raghav <p.raghav@samsung.com>

page_cache_ra_unbounded() was allocating single pages (0 order folios)
if there was no folio found in an index. Allocate mapping_min_order folios
as we need to guarantee the minimum order if it is set.
When read_pages() is triggered and if a page is already present, check
for truncation and move the ractl->_index by mapping_min_nrpages if that
folio was truncated. This is done to ensure we keep the alignment
requirement while adding a folio to the page cache.

page_cache_ra_order() tries to allocate folio to the page cache with a
higher order if the index aligns with that order. Modify it so that the
order does not go below the mapping_min_order requirement of the page
cache. This function will do the right thing even if the new_order passed
is less than the mapping_min_order.
When adding new folios to the page cache we must also ensure the index
used is aligned to the mapping_min_order as the page cache requires the
index to be aligned to the order of the folio.

readahead_expand() is called from readahead aops to extend the range of
the readahead so this function can assume ractl->_index to be aligned with
min_order.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 mm/readahead.c | 85 +++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 71 insertions(+), 14 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 2361634a84fd..646a90ada59a 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -206,9 +206,10 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 		unsigned long nr_to_read, unsigned long lookahead_size)
 {
 	struct address_space *mapping = ractl->mapping;
-	unsigned long index = readahead_index(ractl);
+	unsigned long ra_folio_index, index = readahead_index(ractl);
 	gfp_t gfp_mask = readahead_gfp_mask(mapping);
-	unsigned long i = 0;
+	unsigned long mark, i = 0;
+	unsigned int min_nrpages = mapping_min_folio_nrpages(mapping);
 
 	/*
 	 * Partway through the readahead operation, we will have added
@@ -223,6 +224,22 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 	unsigned int nofs = memalloc_nofs_save();
 
 	filemap_invalidate_lock_shared(mapping);
+	index = mapping_align_start_index(mapping, index);
+
+	/*
+	 * As iterator `i` is aligned to min_nrpages, round_up the
+	 * difference between nr_to_read and lookahead_size to mark the
+	 * index that only has lookahead or "async_region" to set the
+	 * readahead flag.
+	 */
+	ra_folio_index = round_up(readahead_index(ractl) + nr_to_read - lookahead_size,
+				  min_nrpages);
+	mark = ra_folio_index - index;
+	if (index != readahead_index(ractl)) {
+		nr_to_read += readahead_index(ractl) - index;
+		ractl->_index = index;
+	}
+
 	/*
 	 * Preallocate as many pages as we will need.
 	 */
@@ -230,6 +247,8 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 		struct folio *folio = xa_load(&mapping->i_pages, index + i);
 
 		if (folio && !xa_is_value(folio)) {
+			long nr_pages = folio_nr_pages(folio);
+
 			/*
 			 * Page already present?  Kick off the current batch
 			 * of contiguous pages before continuing with the
@@ -239,23 +258,35 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 			 * not worth getting one just for that.
 			 */
 			read_pages(ractl);
-			ractl->_index += folio_nr_pages(folio);
+
+			/*
+			 * Move the ractl->_index by at least min_pages
+			 * if the folio got truncated to respect the
+			 * alignment constraint in the page cache.
+			 *
+			 */
+			if (mapping != folio->mapping)
+				nr_pages = min_nrpages;
+
+			VM_BUG_ON_FOLIO(nr_pages < min_nrpages, folio);
+			ractl->_index += nr_pages;
 			i = ractl->_index + ractl->_nr_pages - index;
 			continue;
 		}
 
-		folio = filemap_alloc_folio(gfp_mask, 0);
+		folio = filemap_alloc_folio(gfp_mask,
+					    mapping_min_folio_order(mapping));
 		if (!folio)
 			break;
 		if (filemap_add_folio(mapping, folio, index + i,
 					gfp_mask) < 0) {
 			folio_put(folio);
 			read_pages(ractl);
-			ractl->_index++;
+			ractl->_index += min_nrpages;
 			i = ractl->_index + ractl->_nr_pages - index;
 			continue;
 		}
-		if (i == nr_to_read - lookahead_size)
+		if (i == mark)
 			folio_set_readahead(folio);
 		ractl->_workingset |= folio_test_workingset(folio);
 		ractl->_nr_pages += folio_nr_pages(folio);
@@ -489,12 +520,18 @@ void page_cache_ra_order(struct readahead_control *ractl,
 {
 	struct address_space *mapping = ractl->mapping;
 	pgoff_t index = readahead_index(ractl);
+	unsigned int min_order = mapping_min_folio_order(mapping);
 	pgoff_t limit = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT;
 	pgoff_t mark = index + ra->size - ra->async_size;
 	int err = 0;
 	gfp_t gfp = readahead_gfp_mask(mapping);
+	unsigned int min_ra_size = max(4, mapping_min_folio_nrpages(mapping));
 
-	if (!mapping_large_folio_support(mapping) || ra->size < 4)
+	/*
+	 * Fallback when size < min_nrpages as each folio should be
+	 * at least min_nrpages anyway.
+	 */
+	if (!mapping_large_folio_support(mapping) || ra->size < min_ra_size)
 		goto fallback;
 
 	limit = min(limit, index + ra->size - 1);
@@ -503,9 +540,18 @@ void page_cache_ra_order(struct readahead_control *ractl,
 		new_order += 2;
 		new_order = min_t(unsigned int, MAX_PAGECACHE_ORDER, new_order);
 		new_order = min_t(unsigned int, new_order, ilog2(ra->size));
+		new_order = max(new_order, min_order);
 	}
 
 	filemap_invalidate_lock_shared(mapping);
+	/*
+	 * If the new_order is greater than min_order and index is
+	 * already aligned to new_order, then this will be noop as index
+	 * aligned to new_order should also be aligned to min_order.
+	 */
+	ractl->_index = mapping_align_start_index(mapping, index);
+	index = readahead_index(ractl);
+
 	while (index <= limit) {
 		unsigned int order = new_order;
 
@@ -513,7 +559,7 @@ void page_cache_ra_order(struct readahead_control *ractl,
 		if (index & ((1UL << order) - 1))
 			order = __ffs(index);
 		/* Don't allocate pages past EOF */
-		while (index + (1UL << order) - 1 > limit)
+		while (order > min_order && index + (1UL << order) - 1 > limit)
 			order--;
 		err = ra_alloc_folio(ractl, index, mark, order, gfp);
 		if (err)
@@ -776,8 +822,15 @@ void readahead_expand(struct readahead_control *ractl,
 	struct file_ra_state *ra = ractl->ra;
 	pgoff_t new_index, new_nr_pages;
 	gfp_t gfp_mask = readahead_gfp_mask(mapping);
+	unsigned long min_nrpages = mapping_min_folio_nrpages(mapping);
+	unsigned int min_order = mapping_min_folio_order(mapping);
 
 	new_index = new_start / PAGE_SIZE;
+	/*
+	 * Readahead code should have aligned the ractl->_index to
+	 * min_nrpages before calling readahead aops.
+	 */
+	VM_BUG_ON(!IS_ALIGNED(ractl->_index, min_nrpages));
 
 	/* Expand the leading edge downwards */
 	while (ractl->_index > new_index) {
@@ -787,9 +840,11 @@ void readahead_expand(struct readahead_control *ractl,
 		if (folio && !xa_is_value(folio))
 			return; /* Folio apparently present */
 
-		folio = filemap_alloc_folio(gfp_mask, 0);
+		folio = filemap_alloc_folio(gfp_mask, min_order);
 		if (!folio)
 			return;
+
+		index = mapping_align_start_index(mapping, index);
 		if (filemap_add_folio(mapping, folio, index, gfp_mask) < 0) {
 			folio_put(folio);
 			return;
@@ -799,7 +854,7 @@ void readahead_expand(struct readahead_control *ractl,
 			ractl->_workingset = true;
 			psi_memstall_enter(&ractl->_pflags);
 		}
-		ractl->_nr_pages++;
+		ractl->_nr_pages += min_nrpages;
 		ractl->_index = folio->index;
 	}
 
@@ -814,9 +869,11 @@ void readahead_expand(struct readahead_control *ractl,
 		if (folio && !xa_is_value(folio))
 			return; /* Folio apparently present */
 
-		folio = filemap_alloc_folio(gfp_mask, 0);
+		folio = filemap_alloc_folio(gfp_mask, min_order);
 		if (!folio)
 			return;
+
+		index = mapping_align_start_index(mapping, index);
 		if (filemap_add_folio(mapping, folio, index, gfp_mask) < 0) {
 			folio_put(folio);
 			return;
@@ -826,10 +883,10 @@ void readahead_expand(struct readahead_control *ractl,
 			ractl->_workingset = true;
 			psi_memstall_enter(&ractl->_pflags);
 		}
-		ractl->_nr_pages++;
+		ractl->_nr_pages += min_nrpages;
 		if (ra) {
-			ra->size++;
-			ra->async_size++;
+			ra->size += min_nrpages;
+			ra->async_size += min_nrpages;
 		}
 	}
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v5 05/11] mm: split a folio in minimum folio order chunks
  2024-05-03  9:53 [PATCH v5 00/11] enable bs > ps in XFS Luis Chamberlain
                   ` (3 preceding siblings ...)
  2024-05-03  9:53 ` [PATCH v5 04/11] readahead: allocate folios with mapping_min_order in readahead Luis Chamberlain
@ 2024-05-03  9:53 ` Luis Chamberlain
  2024-05-03 14:53   ` Zi Yan
  2024-05-15 15:32   ` Matthew Wilcox
  2024-05-03  9:53 ` [PATCH v5 06/11] filemap: cap PTE range to be created to allowed zero fill in folio_map_range() Luis Chamberlain
                   ` (5 subsequent siblings)
  10 siblings, 2 replies; 51+ messages in thread
From: Luis Chamberlain @ 2024-05-03  9:53 UTC (permalink / raw)
  To: akpm, willy, djwong, brauner, david, chandan.babu
  Cc: hare, ritesh.list, john.g.garry, ziy, linux-fsdevel, linux-xfs,
	linux-mm, linux-block, gost.dev, p.raghav, kernel, mcgrof

split_folio() and split_folio_to_list() assume order 0, to support
minorder we must expand these to check the folio mapping order and use
that.

Set new_order to be at least minimum folio order if it is set in
split_huge_page_to_list() so that we can maintain minimum folio order
requirement in the page cache.

Update the debugfs write files used for testing to ensure the order
is respected as well. We simply enforce the min order when a file
mapping is used.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 include/linux/huge_mm.h | 12 ++++++----
 mm/huge_memory.c        | 50 ++++++++++++++++++++++++++++++++++++++---
 2 files changed, 55 insertions(+), 7 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index de0c89105076..06748a8fa43b 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -87,6 +87,8 @@ extern struct kobj_attribute shmem_enabled_attr;
 #define thp_vma_allowable_order(vma, vm_flags, smaps, in_pf, enforce_sysfs, order) \
 	(!!thp_vma_allowable_orders(vma, vm_flags, smaps, in_pf, enforce_sysfs, BIT(order)))
 
+#define split_folio(f) split_folio_to_list(f, NULL)
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 #define HPAGE_PMD_SHIFT PMD_SHIFT
 #define HPAGE_PMD_SIZE	((1UL) << HPAGE_PMD_SHIFT)
@@ -267,9 +269,10 @@ void folio_prep_large_rmappable(struct folio *folio);
 bool can_split_folio(struct folio *folio, int *pextra_pins);
 int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 		unsigned int new_order);
+int split_folio_to_list(struct folio *folio, struct list_head *list);
 static inline int split_huge_page(struct page *page)
 {
-	return split_huge_page_to_list_to_order(page, NULL, 0);
+	return split_folio(page_folio(page));
 }
 void deferred_split_folio(struct folio *folio);
 
@@ -432,6 +435,10 @@ static inline int split_huge_page(struct page *page)
 {
 	return 0;
 }
+static inline int split_folio_to_list(struct page *page, struct list_head *list)
+{
+	return 0;
+}
 static inline void deferred_split_folio(struct folio *folio) {}
 #define split_huge_pmd(__vma, __pmd, __address)	\
 	do { } while (0)
@@ -532,9 +539,6 @@ static inline int split_folio_to_order(struct folio *folio, int new_order)
 	return split_folio_to_list_to_order(folio, NULL, new_order);
 }
 
-#define split_folio_to_list(f, l) split_folio_to_list_to_order(f, l, 0)
-#define split_folio(f) split_folio_to_order(f, 0)
-
 /*
  * archs that select ARCH_WANTS_THP_SWAP but don't support THP_SWP due to
  * limitations in the implementation like arm64 MTE can override this to
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 89f58c7603b2..c0cc8f32fe42 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3035,6 +3035,9 @@ bool can_split_folio(struct folio *folio, int *pextra_pins)
  * Returns 0 if the hugepage is split successfully.
  * Returns -EBUSY if the page is pinned or if anon_vma disappeared from under
  * us.
+ *
+ * Callers should ensure that the order respects the address space mapping
+ * min-order if one is set.
  */
 int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 				     unsigned int new_order)
@@ -3107,6 +3110,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 		mapping = NULL;
 		anon_vma_lock_write(anon_vma);
 	} else {
+		unsigned int min_order;
 		gfp_t gfp;
 
 		mapping = folio->mapping;
@@ -3117,6 +3121,14 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 			goto out;
 		}
 
+		min_order = mapping_min_folio_order(folio->mapping);
+		if (new_order < min_order) {
+			VM_WARN_ONCE(1, "Cannot split mapped folio below min-order: %u",
+				     min_order);
+			ret = -EINVAL;
+			goto out;
+		}
+
 		gfp = current_gfp_context(mapping_gfp_mask(mapping) &
 							GFP_RECLAIM_MASK);
 
@@ -3227,6 +3239,21 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 	return ret;
 }
 
+int split_folio_to_list(struct folio *folio, struct list_head *list)
+{
+	unsigned int min_order = 0;
+
+	if (!folio_test_anon(folio)) {
+		if (!folio->mapping) {
+			count_vm_event(THP_SPLIT_PAGE_FAILED);
+			return -EBUSY;
+		}
+		min_order = mapping_min_folio_order(folio->mapping);
+	}
+
+	return split_huge_page_to_list_to_order(&folio->page, list, min_order);
+}
+
 void folio_undo_large_rmappable(struct folio *folio)
 {
 	struct deferred_split *ds_queue;
@@ -3466,6 +3493,7 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
 		struct vm_area_struct *vma = vma_lookup(mm, addr);
 		struct page *page;
 		struct folio *folio;
+		unsigned int target_order = new_order;
 
 		if (!vma)
 			break;
@@ -3502,7 +3530,18 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
 		if (!folio_trylock(folio))
 			goto next;
 
-		if (!split_folio_to_order(folio, new_order))
+		if (!folio_test_anon(folio)) {
+			unsigned int min_order;
+
+			if (!folio->mapping)
+				goto next;
+
+			min_order = mapping_min_folio_order(folio->mapping);
+			if (new_order < target_order)
+				target_order = min_order;
+		}
+
+		if (!split_folio_to_order(folio, target_order))
 			split++;
 
 		folio_unlock(folio);
@@ -3545,14 +3584,19 @@ static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start,
 
 	for (index = off_start; index < off_end; index += nr_pages) {
 		struct folio *folio = filemap_get_folio(mapping, index);
+		unsigned int min_order, target_order = new_order;
 
 		nr_pages = 1;
 		if (IS_ERR(folio))
 			continue;
 
-		if (!folio_test_large(folio))
+		if (!folio->mapping || !folio_test_large(folio))
 			goto next;
 
+		min_order = mapping_min_folio_order(mapping);
+		if (new_order < min_order)
+			target_order = min_order;
+
 		total++;
 		nr_pages = folio_nr_pages(folio);
 
@@ -3562,7 +3606,7 @@ static int split_huge_pages_in_file(const char *file_path, pgoff_t off_start,
 		if (!folio_trylock(folio))
 			goto next;
 
-		if (!split_folio_to_order(folio, new_order))
+		if (!split_folio_to_order(folio, target_order))
 			split++;
 
 		folio_unlock(folio);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v5 06/11] filemap: cap PTE range to be created to allowed zero fill in folio_map_range()
  2024-05-03  9:53 [PATCH v5 00/11] enable bs > ps in XFS Luis Chamberlain
                   ` (4 preceding siblings ...)
  2024-05-03  9:53 ` [PATCH v5 05/11] mm: split a folio in minimum folio order chunks Luis Chamberlain
@ 2024-05-03  9:53 ` Luis Chamberlain
  2024-05-03  9:53 ` [PATCH v5 07/11] iomap: fix iomap_dio_zero() for fs bs > system page size Luis Chamberlain
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 51+ messages in thread
From: Luis Chamberlain @ 2024-05-03  9:53 UTC (permalink / raw)
  To: akpm, willy, djwong, brauner, david, chandan.babu
  Cc: hare, ritesh.list, john.g.garry, ziy, linux-fsdevel, linux-xfs,
	linux-mm, linux-block, gost.dev, p.raghav, kernel, mcgrof

From: Pankaj Raghav <p.raghav@samsung.com>

Usually the page cache does not extend beyond the size of the inode,
therefore, no PTEs are created for folios that extend beyond the size.

But with LBS support, we might extend page cache beyond the size of the
inode as we need to guarantee folios of minimum order. Cap the PTE range
to be created for the page cache up to the max allowed zero-fill file
end, which is aligned to the PAGE_SIZE.

An fstests test has been created to trigger this edge case [0].

[0] https://lore.kernel.org/fstests/20240415081054.1782715-1-mcgrof@kernel.org/

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 mm/filemap.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index f0c0cfbbd134..a727d8a4bac0 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3574,7 +3574,7 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 	struct vm_area_struct *vma = vmf->vma;
 	struct file *file = vma->vm_file;
 	struct address_space *mapping = file->f_mapping;
-	pgoff_t last_pgoff = start_pgoff;
+	pgoff_t file_end, last_pgoff = start_pgoff;
 	unsigned long addr;
 	XA_STATE(xas, &mapping->i_pages, start_pgoff);
 	struct folio *folio;
@@ -3598,6 +3598,11 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 		folio_put(folio);
 		goto out;
 	}
+
+	file_end = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE) - 1;
+	if (end_pgoff > file_end)
+		end_pgoff = file_end;
+
 	do {
 		unsigned long end;
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v5 07/11] iomap: fix iomap_dio_zero() for fs bs > system page size
  2024-05-03  9:53 [PATCH v5 00/11] enable bs > ps in XFS Luis Chamberlain
                   ` (5 preceding siblings ...)
  2024-05-03  9:53 ` [PATCH v5 06/11] filemap: cap PTE range to be created to allowed zero fill in folio_map_range() Luis Chamberlain
@ 2024-05-03  9:53 ` Luis Chamberlain
  2024-05-07 14:58   ` [RFC] iomap: use huge zero folio in iomap_dio_zero Pankaj Raghav (Samsung)
  2024-05-07 16:00   ` [PATCH v5 07/11] iomap: fix iomap_dio_zero() for fs bs > system page size Matthew Wilcox
  2024-05-03  9:53 ` [PATCH v5 08/11] xfs: use kvmalloc for xattr buffers Luis Chamberlain
                   ` (3 subsequent siblings)
  10 siblings, 2 replies; 51+ messages in thread
From: Luis Chamberlain @ 2024-05-03  9:53 UTC (permalink / raw)
  To: akpm, willy, djwong, brauner, david, chandan.babu
  Cc: hare, ritesh.list, john.g.garry, ziy, linux-fsdevel, linux-xfs,
	linux-mm, linux-block, gost.dev, p.raghav, kernel, mcgrof

From: Pankaj Raghav <p.raghav@samsung.com>

iomap_dio_zero() will pad a fs block with zeroes if the direct IO size
< fs block size. iomap_dio_zero() has an implicit assumption that fs block
size < page_size. This is true for most filesystems at the moment.

If the block size > page size, this will send the contents of the page
next to zero page(as len > PAGE_SIZE) to the underlying block device,
causing FS corruption.

iomap is a generic infrastructure and it should not make any assumptions
about the fs block size and the page size of the system.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
---
 fs/iomap/direct-io.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index f3b43d223a46..5f481068de5b 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -239,14 +239,23 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
 	struct page *page = ZERO_PAGE(0);
 	struct bio *bio;
 
-	bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
+	WARN_ON_ONCE(len > (BIO_MAX_VECS * PAGE_SIZE));
+
+	bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS,
+				  REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
 	fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
 				  GFP_KERNEL);
+
 	bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos);
 	bio->bi_private = dio;
 	bio->bi_end_io = iomap_dio_bio_end_io;
 
-	__bio_add_page(bio, page, len, 0);
+	while (len) {
+		unsigned int io_len = min_t(unsigned int, len, PAGE_SIZE);
+
+		__bio_add_page(bio, page, io_len, 0);
+		len -= io_len;
+	}
 	iomap_dio_submit_bio(iter, dio, bio, pos);
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v5 08/11] xfs: use kvmalloc for xattr buffers
  2024-05-03  9:53 [PATCH v5 00/11] enable bs > ps in XFS Luis Chamberlain
                   ` (6 preceding siblings ...)
  2024-05-03  9:53 ` [PATCH v5 07/11] iomap: fix iomap_dio_zero() for fs bs > system page size Luis Chamberlain
@ 2024-05-03  9:53 ` Luis Chamberlain
  2024-05-03  9:53 ` [PATCH v5 09/11] xfs: expose block size in stat Luis Chamberlain
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 51+ messages in thread
From: Luis Chamberlain @ 2024-05-03  9:53 UTC (permalink / raw)
  To: akpm, willy, djwong, brauner, david, chandan.babu
  Cc: hare, ritesh.list, john.g.garry, ziy, linux-fsdevel, linux-xfs,
	linux-mm, linux-block, gost.dev, p.raghav, kernel, mcgrof,
	Dave Chinner

From: Dave Chinner <dchinner@redhat.com>

Pankaj Raghav reported that when filesystem block size is larger
than page size, the xattr code can use kmalloc() for high order
allocations. This triggers a useless warning in the allocator as it
is a __GFP_NOFAIL allocation here:

static inline
struct page *rmqueue(struct zone *preferred_zone,
                        struct zone *zone, unsigned int order,
                        gfp_t gfp_flags, unsigned int alloc_flags,
                        int migratetype)
{
        struct page *page;

        /*
         * We most definitely don't want callers attempting to
         * allocate greater than order-1 page units with __GFP_NOFAIL.
         */
>>>>    WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
...

Fix this by changing all these call sites to use kvmalloc(), which
will strip the NOFAIL from the kmalloc attempt and if that fails
will do a __GFP_NOFAIL vmalloc().

This is not an issue that productions systems will see as
filesystems with block size > page size cannot be mounted by the
kernel; Pankaj is developing this functionality right now.

Reported-by: Pankaj Raghav <kernel@pankajraghav.com>
Fixes: f078d4ea8276 ("xfs: convert kmem_alloc() to kmalloc()")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Pankaj Raghav <p.raghav@samsung.com>
---
 fs/xfs/libxfs/xfs_attr_leaf.c | 15 ++++++---------
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_attr_leaf.c b/fs/xfs/libxfs/xfs_attr_leaf.c
index ac904cc1a97b..969abc6efd70 100644
--- a/fs/xfs/libxfs/xfs_attr_leaf.c
+++ b/fs/xfs/libxfs/xfs_attr_leaf.c
@@ -1059,10 +1059,7 @@ xfs_attr3_leaf_to_shortform(
 
 	trace_xfs_attr_leaf_to_sf(args);
 
-	tmpbuffer = kmalloc(args->geo->blksize, GFP_KERNEL | __GFP_NOFAIL);
-	if (!tmpbuffer)
-		return -ENOMEM;
-
+	tmpbuffer = kvmalloc(args->geo->blksize, GFP_KERNEL | __GFP_NOFAIL);
 	memcpy(tmpbuffer, bp->b_addr, args->geo->blksize);
 
 	leaf = (xfs_attr_leafblock_t *)tmpbuffer;
@@ -1125,7 +1122,7 @@ xfs_attr3_leaf_to_shortform(
 	error = 0;
 
 out:
-	kfree(tmpbuffer);
+	kvfree(tmpbuffer);
 	return error;
 }
 
@@ -1533,7 +1530,7 @@ xfs_attr3_leaf_compact(
 
 	trace_xfs_attr_leaf_compact(args);
 
-	tmpbuffer = kmalloc(args->geo->blksize, GFP_KERNEL | __GFP_NOFAIL);
+	tmpbuffer = kvmalloc(args->geo->blksize, GFP_KERNEL | __GFP_NOFAIL);
 	memcpy(tmpbuffer, bp->b_addr, args->geo->blksize);
 	memset(bp->b_addr, 0, args->geo->blksize);
 	leaf_src = (xfs_attr_leafblock_t *)tmpbuffer;
@@ -1571,7 +1568,7 @@ xfs_attr3_leaf_compact(
 	 */
 	xfs_trans_log_buf(trans, bp, 0, args->geo->blksize - 1);
 
-	kfree(tmpbuffer);
+	kvfree(tmpbuffer);
 }
 
 /*
@@ -2250,7 +2247,7 @@ xfs_attr3_leaf_unbalance(
 		struct xfs_attr_leafblock *tmp_leaf;
 		struct xfs_attr3_icleaf_hdr tmphdr;
 
-		tmp_leaf = kzalloc(state->args->geo->blksize,
+		tmp_leaf = kvzalloc(state->args->geo->blksize,
 				GFP_KERNEL | __GFP_NOFAIL);
 
 		/*
@@ -2291,7 +2288,7 @@ xfs_attr3_leaf_unbalance(
 		}
 		memcpy(save_leaf, tmp_leaf, state->args->geo->blksize);
 		savehdr = tmphdr; /* struct copy */
-		kfree(tmp_leaf);
+		kvfree(tmp_leaf);
 	}
 
 	xfs_attr3_leaf_hdr_to_disk(state->args->geo, save_leaf, &savehdr);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v5 09/11] xfs: expose block size in stat
  2024-05-03  9:53 [PATCH v5 00/11] enable bs > ps in XFS Luis Chamberlain
                   ` (7 preceding siblings ...)
  2024-05-03  9:53 ` [PATCH v5 08/11] xfs: use kvmalloc for xattr buffers Luis Chamberlain
@ 2024-05-03  9:53 ` Luis Chamberlain
  2024-05-03  9:53 ` [PATCH v5 10/11] xfs: make the calculation generic in xfs_sb_validate_fsb_count() Luis Chamberlain
  2024-05-03  9:53 ` [PATCH v5 11/11] xfs: enable block size larger than page size support Luis Chamberlain
  10 siblings, 0 replies; 51+ messages in thread
From: Luis Chamberlain @ 2024-05-03  9:53 UTC (permalink / raw)
  To: akpm, willy, djwong, brauner, david, chandan.babu
  Cc: hare, ritesh.list, john.g.garry, ziy, linux-fsdevel, linux-xfs,
	linux-mm, linux-block, gost.dev, p.raghav, kernel, mcgrof

From: Pankaj Raghav <p.raghav@samsung.com>

For block size larger than page size, the unit of efficient IO is
the block size, not the page size. Leaving stat() to report
PAGE_SIZE as the block size causes test programs like fsx to issue
illegal ranges for operations that require block size alignment
(e.g. fallocate() insert range). Hence update the preferred IO size
to reflect the block size in this case.

This change is based on a patch originally from Dave Chinner.[1]

[1] https://lwn.net/ml/linux-fsdevel/20181107063127.3902-16-david@fromorbit.com/

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 fs/xfs/xfs_iops.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 66f8c47642e8..77b198a33aa1 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -543,7 +543,7 @@ xfs_stat_blksize(
 			return 1U << mp->m_allocsize_log;
 	}
 
-	return PAGE_SIZE;
+	return max_t(uint32_t, PAGE_SIZE, mp->m_sb.sb_blocksize);
 }
 
 STATIC int
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v5 10/11] xfs: make the calculation generic in xfs_sb_validate_fsb_count()
  2024-05-03  9:53 [PATCH v5 00/11] enable bs > ps in XFS Luis Chamberlain
                   ` (8 preceding siblings ...)
  2024-05-03  9:53 ` [PATCH v5 09/11] xfs: expose block size in stat Luis Chamberlain
@ 2024-05-03  9:53 ` Luis Chamberlain
  2024-05-07  8:40   ` John Garry
  2024-05-03  9:53 ` [PATCH v5 11/11] xfs: enable block size larger than page size support Luis Chamberlain
  10 siblings, 1 reply; 51+ messages in thread
From: Luis Chamberlain @ 2024-05-03  9:53 UTC (permalink / raw)
  To: akpm, willy, djwong, brauner, david, chandan.babu
  Cc: hare, ritesh.list, john.g.garry, ziy, linux-fsdevel, linux-xfs,
	linux-mm, linux-block, gost.dev, p.raghav, kernel, mcgrof

From: Pankaj Raghav <p.raghav@samsung.com>

Instead of assuming that PAGE_SHIFT is always higher than the blocklog,
make the calculation generic so that page cache count can be calculated
correctly for LBS.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_mount.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index df370eb5dc15..56d71282972a 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -133,9 +133,16 @@ xfs_sb_validate_fsb_count(
 {
 	ASSERT(PAGE_SHIFT >= sbp->sb_blocklog);
 	ASSERT(sbp->sb_blocklog >= BBSHIFT);
+	uint64_t max_index;
+	uint64_t max_bytes;
+
+	if (check_shl_overflow(nblocks, sbp->sb_blocklog, &max_bytes))
+		return -EFBIG;
 
 	/* Limited by ULONG_MAX of page cache index */
-	if (nblocks >> (PAGE_SHIFT - sbp->sb_blocklog) > ULONG_MAX)
+	max_index = max_bytes >> PAGE_SHIFT;
+
+	if (max_index > ULONG_MAX)
 		return -EFBIG;
 	return 0;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v5 11/11] xfs: enable block size larger than page size support
  2024-05-03  9:53 [PATCH v5 00/11] enable bs > ps in XFS Luis Chamberlain
                   ` (9 preceding siblings ...)
  2024-05-03  9:53 ` [PATCH v5 10/11] xfs: make the calculation generic in xfs_sb_validate_fsb_count() Luis Chamberlain
@ 2024-05-03  9:53 ` Luis Chamberlain
  2024-05-07  0:05   ` Dave Chinner
  10 siblings, 1 reply; 51+ messages in thread
From: Luis Chamberlain @ 2024-05-03  9:53 UTC (permalink / raw)
  To: akpm, willy, djwong, brauner, david, chandan.babu
  Cc: hare, ritesh.list, john.g.garry, ziy, linux-fsdevel, linux-xfs,
	linux-mm, linux-block, gost.dev, p.raghav, kernel, mcgrof

From: Pankaj Raghav <p.raghav@samsung.com>

Page cache now has the ability to have a minimum order when allocating
a folio which is a prerequisite to add support for block size > page
size.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 fs/xfs/libxfs/xfs_ialloc.c |  5 +++++
 fs/xfs/libxfs/xfs_shared.h |  3 +++
 fs/xfs/xfs_icache.c        |  6 ++++--
 fs/xfs/xfs_mount.c         |  1 -
 fs/xfs/xfs_super.c         | 10 ++--------
 5 files changed, 14 insertions(+), 11 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index e5ac3e5430c4..60005feb0015 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -2975,6 +2975,11 @@ xfs_ialloc_setup_geometry(
 		igeo->ialloc_align = mp->m_dalign;
 	else
 		igeo->ialloc_align = 0;
+
+	if (mp->m_sb.sb_blocksize > PAGE_SIZE)
+		igeo->min_folio_order = mp->m_sb.sb_blocklog - PAGE_SHIFT;
+	else
+		igeo->min_folio_order = 0;
 }
 
 /* Compute the location of the root directory inode that is laid out by mkfs. */
diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
index dfd61fa8332e..7d3abd182322 100644
--- a/fs/xfs/libxfs/xfs_shared.h
+++ b/fs/xfs/libxfs/xfs_shared.h
@@ -229,6 +229,9 @@ struct xfs_ino_geometry {
 	/* precomputed value for di_flags2 */
 	uint64_t	new_diflags2;
 
+	/* minimum folio order of a page cache allocation */
+	unsigned int	min_folio_order;
+
 };
 
 #endif /* __XFS_SHARED_H__ */
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 74f1812b03cb..a2629e00de41 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -89,7 +89,8 @@ xfs_inode_alloc(
 	/* VFS doesn't initialise i_mode or i_state! */
 	VFS_I(ip)->i_mode = 0;
 	VFS_I(ip)->i_state = 0;
-	mapping_set_large_folios(VFS_I(ip)->i_mapping);
+	mapping_set_folio_min_order(VFS_I(ip)->i_mapping,
+				    M_IGEO(mp)->min_folio_order);
 
 	XFS_STATS_INC(mp, vn_active);
 	ASSERT(atomic_read(&ip->i_pincount) == 0);
@@ -324,7 +325,8 @@ xfs_reinit_inode(
 	inode->i_rdev = dev;
 	inode->i_uid = uid;
 	inode->i_gid = gid;
-	mapping_set_large_folios(inode->i_mapping);
+	mapping_set_folio_min_order(inode->i_mapping,
+				    M_IGEO(mp)->min_folio_order);
 	return error;
 }
 
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 56d71282972a..a451302aa258 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -131,7 +131,6 @@ xfs_sb_validate_fsb_count(
 	xfs_sb_t	*sbp,
 	uint64_t	nblocks)
 {
-	ASSERT(PAGE_SHIFT >= sbp->sb_blocklog);
 	ASSERT(sbp->sb_blocklog >= BBSHIFT);
 	uint64_t max_index;
 	uint64_t max_bytes;
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index bce020374c5e..db3b82c2c381 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1623,16 +1623,10 @@ xfs_fs_fill_super(
 		goto out_free_sb;
 	}
 
-	/*
-	 * Until this is fixed only page-sized or smaller data blocks work.
-	 */
 	if (mp->m_sb.sb_blocksize > PAGE_SIZE) {
 		xfs_warn(mp,
-		"File system with blocksize %d bytes. "
-		"Only pagesize (%ld) or less will currently work.",
-				mp->m_sb.sb_blocksize, PAGE_SIZE);
-		error = -ENOSYS;
-		goto out_free_sb;
+"EXPERIMENTAL: Filesystem with Large Block Size (%d bytes) enabled.",
+			mp->m_sb.sb_blocksize);
 	}
 
 	/* Ensure this filesystem fits in the page cache limits */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH v5 04/11] readahead: allocate folios with mapping_min_order in readahead
  2024-05-03  9:53 ` [PATCH v5 04/11] readahead: allocate folios with mapping_min_order in readahead Luis Chamberlain
@ 2024-05-03 14:32   ` Hannes Reinecke
  0 siblings, 0 replies; 51+ messages in thread
From: Hannes Reinecke @ 2024-05-03 14:32 UTC (permalink / raw)
  To: Luis Chamberlain, akpm, willy, djwong, brauner, david, chandan.babu
  Cc: ritesh.list, john.g.garry, ziy, linux-fsdevel, linux-xfs,
	linux-mm, linux-block, gost.dev, p.raghav, kernel

On 5/3/24 11:53, Luis Chamberlain wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
> 
> page_cache_ra_unbounded() was allocating single pages (0 order folios)
> if there was no folio found in an index. Allocate mapping_min_order folios
> as we need to guarantee the minimum order if it is set.
> When read_pages() is triggered and if a page is already present, check
> for truncation and move the ractl->_index by mapping_min_nrpages if that
> folio was truncated. This is done to ensure we keep the alignment
> requirement while adding a folio to the page cache.
> 
> page_cache_ra_order() tries to allocate folio to the page cache with a
> higher order if the index aligns with that order. Modify it so that the
> order does not go below the mapping_min_order requirement of the page
> cache. This function will do the right thing even if the new_order passed
> is less than the mapping_min_order.
> When adding new folios to the page cache we must also ensure the index
> used is aligned to the mapping_min_order as the page cache requires the
> index to be aligned to the order of the folio.
> 
> readahead_expand() is called from readahead aops to extend the range of
> the readahead so this function can assume ractl->_index to be aligned with
> min_order.
> 
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>   mm/readahead.c | 85 +++++++++++++++++++++++++++++++++++++++++---------
>   1 file changed, 71 insertions(+), 14 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Ivo Totev, Andrew McDonald,
Werner Knoblich


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v5 05/11] mm: split a folio in minimum folio order chunks
  2024-05-03  9:53 ` [PATCH v5 05/11] mm: split a folio in minimum folio order chunks Luis Chamberlain
@ 2024-05-03 14:53   ` Zi Yan
  2024-05-15 15:32   ` Matthew Wilcox
  1 sibling, 0 replies; 51+ messages in thread
From: Zi Yan @ 2024-05-03 14:53 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: akpm, willy, djwong, brauner, david, chandan.babu, hare,
	ritesh.list, john.g.garry, linux-fsdevel, linux-xfs, linux-mm,
	linux-block, gost.dev, p.raghav, kernel

[-- Attachment #1: Type: text/plain, Size: 905 bytes --]

On 3 May 2024, at 5:53, Luis Chamberlain wrote:

> split_folio() and split_folio_to_list() assume order 0, to support
> minorder we must expand these to check the folio mapping order and use
> that.
>
> Set new_order to be at least minimum folio order if it is set in
> split_huge_page_to_list() so that we can maintain minimum folio order
> requirement in the page cache.
>
> Update the debugfs write files used for testing to ensure the order
> is respected as well. We simply enforce the min order when a file
> mapping is used.
>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>  include/linux/huge_mm.h | 12 ++++++----
>  mm/huge_memory.c        | 50 ++++++++++++++++++++++++++++++++++++++---
>  2 files changed, 55 insertions(+), 7 deletions(-)
>
It makes sense to me. Reviewed-by: Zi Yan <ziy@nvidia.com>

--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v5 11/11] xfs: enable block size larger than page size support
  2024-05-03  9:53 ` [PATCH v5 11/11] xfs: enable block size larger than page size support Luis Chamberlain
@ 2024-05-07  0:05   ` Dave Chinner
  0 siblings, 0 replies; 51+ messages in thread
From: Dave Chinner @ 2024-05-07  0:05 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: akpm, willy, djwong, brauner, chandan.babu, hare, ritesh.list,
	john.g.garry, ziy, linux-fsdevel, linux-xfs, linux-mm,
	linux-block, gost.dev, p.raghav, kernel

On Fri, May 03, 2024 at 02:53:53AM -0700, Luis Chamberlain wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
> 
> Page cache now has the ability to have a minimum order when allocating
> a folio which is a prerequisite to add support for block size > page
> size.
> 
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>

.....

> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index bce020374c5e..db3b82c2c381 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1623,16 +1623,10 @@ xfs_fs_fill_super(
>  		goto out_free_sb;
>  	}
>  
> -	/*
> -	 * Until this is fixed only page-sized or smaller data blocks work.
> -	 */
>  	if (mp->m_sb.sb_blocksize > PAGE_SIZE) {
>  		xfs_warn(mp,
> -		"File system with blocksize %d bytes. "
> -		"Only pagesize (%ld) or less will currently work.",
> -				mp->m_sb.sb_blocksize, PAGE_SIZE);
> -		error = -ENOSYS;
> -		goto out_free_sb;
> +"EXPERIMENTAL: Filesystem with Large Block Size (%d bytes) enabled.",
> +			mp->m_sb.sb_blocksize);
>  	}

We really don't want to have to test and support this on V4
filesystems as tehy are deprecated, so can you make this conditional
on being a V5 filesystem?

i.e:
	if (mp->m_sb.sb_blocksize > PAGE_SIZE) {
		if (!xfs_has_crc(mp)) {
			xfs_warn(mp,
"V4 File system with blocksize %d bytes. Only pagesize (%ld) is supported.",
				mp->m_sb.sb_blocksize, PAGE_SIZE);
			error = -ENOSYS;
			goto out_free_sb;
		}

		xfs_warn(mp,
"EXPERIMENTAL: V5 Filesystem with Large Block Size (%d bytes) enabled.",
			mp->m_sb.sb_blocksize);
	}

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v5 10/11] xfs: make the calculation generic in xfs_sb_validate_fsb_count()
  2024-05-03  9:53 ` [PATCH v5 10/11] xfs: make the calculation generic in xfs_sb_validate_fsb_count() Luis Chamberlain
@ 2024-05-07  8:40   ` John Garry
  2024-05-07 21:13     ` Darrick J. Wong
  0 siblings, 1 reply; 51+ messages in thread
From: John Garry @ 2024-05-07  8:40 UTC (permalink / raw)
  To: Luis Chamberlain, akpm, willy, djwong, brauner, david, chandan.babu
  Cc: hare, ritesh.list, ziy, linux-fsdevel, linux-xfs, linux-mm,
	linux-block, gost.dev, p.raghav, kernel

On 03/05/2024 10:53, Luis Chamberlain wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
> 
> Instead of assuming that PAGE_SHIFT is always higher than the blocklog,
> make the calculation generic so that page cache count can be calculated
> correctly for LBS.
> 
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> ---
>   fs/xfs/xfs_mount.c | 9 ++++++++-
>   1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index df370eb5dc15..56d71282972a 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -133,9 +133,16 @@ xfs_sb_validate_fsb_count(
>   {
>   	ASSERT(PAGE_SHIFT >= sbp->sb_blocklog);
>   	ASSERT(sbp->sb_blocklog >= BBSHIFT);
> +	uint64_t max_index;
> +	uint64_t max_bytes;

nit: any other XFS code which I have seen puts the declarations before 
any ASSERT() calls

> +
> +	if (check_shl_overflow(nblocks, sbp->sb_blocklog, &max_bytes))
> +		return -EFBIG;
>   
>   	/* Limited by ULONG_MAX of page cache index */
> -	if (nblocks >> (PAGE_SHIFT - sbp->sb_blocklog) > ULONG_MAX)
> +	max_index = max_bytes >> PAGE_SHIFT;
> +
> +	if (max_index > ULONG_MAX)
>   		return -EFBIG;
>   	return 0;
>   }


^ permalink raw reply	[flat|nested] 51+ messages in thread

* [RFC] iomap: use huge zero folio in iomap_dio_zero
  2024-05-03  9:53 ` [PATCH v5 07/11] iomap: fix iomap_dio_zero() for fs bs > system page size Luis Chamberlain
@ 2024-05-07 14:58   ` Pankaj Raghav (Samsung)
  2024-05-07 15:11     ` Zi Yan
                       ` (2 more replies)
  2024-05-07 16:00   ` [PATCH v5 07/11] iomap: fix iomap_dio_zero() for fs bs > system page size Matthew Wilcox
  1 sibling, 3 replies; 51+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-05-07 14:58 UTC (permalink / raw)
  To: hch, willy
  Cc: mcgrof, akpm, brauner, chandan.babu, david, djwong, gost.dev,
	hare, john.g.garry, kernel, linux-block, linux-fsdevel, linux-mm,
	linux-xfs, p.raghav, ritesh.list, ziy

From: Pankaj Raghav <p.raghav@samsung.com>

Instead of looping with ZERO_PAGE, use a huge zero folio to zero pad the
block. Fallback to ZERO_PAGE if mm_get_huge_zero_folio() fails.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
I rebased on top of mm-unstable to get mm_get_huge_zero_folio().

@Christoph is this inline with what you had in mind?

 fs/iomap/direct-io.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 5f481068de5b..7f584f9ff2c5 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -236,11 +236,18 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
 		loff_t pos, unsigned len)
 {
 	struct inode *inode = file_inode(dio->iocb->ki_filp);
-	struct page *page = ZERO_PAGE(0);
+	struct folio *zero_page_folio = page_folio(ZERO_PAGE(0));
+	struct folio *folio = zero_page_folio;
 	struct bio *bio;
 
 	WARN_ON_ONCE(len > (BIO_MAX_VECS * PAGE_SIZE));
 
+	if (len > PAGE_SIZE) {
+		folio = mm_get_huge_zero_folio(current->mm);
+		if (!folio)
+			folio = zero_page_folio;
+	}
+
 	bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS,
 				  REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
 	fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
@@ -251,10 +258,10 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
 	bio->bi_end_io = iomap_dio_bio_end_io;
 
 	while (len) {
-		unsigned int io_len = min_t(unsigned int, len, PAGE_SIZE);
+		unsigned int size = min(len, folio_size(folio));
 
-		__bio_add_page(bio, page, io_len, 0);
-		len -= io_len;
+		bio_add_folio_nofail(bio, folio, size, 0);
+		len -= size;
 	}
 	iomap_dio_submit_bio(iter, dio, bio, pos);
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [RFC] iomap: use huge zero folio in iomap_dio_zero
  2024-05-07 14:58   ` [RFC] iomap: use huge zero folio in iomap_dio_zero Pankaj Raghav (Samsung)
@ 2024-05-07 15:11     ` Zi Yan
  2024-05-07 16:11     ` Christoph Hellwig
  2024-05-15  0:50     ` Matthew Wilcox
  2 siblings, 0 replies; 51+ messages in thread
From: Zi Yan @ 2024-05-07 15:11 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: hch, willy, mcgrof, akpm, brauner, chandan.babu, david, djwong,
	gost.dev, hare, john.g.garry, linux-block, linux-fsdevel,
	linux-mm, linux-xfs, p.raghav, ritesh.list

[-- Attachment #1: Type: text/plain, Size: 2318 bytes --]

On 7 May 2024, at 10:58, Pankaj Raghav (Samsung) wrote:

> From: Pankaj Raghav <p.raghav@samsung.com>
>
> Instead of looping with ZERO_PAGE, use a huge zero folio to zero pad the
> block. Fallback to ZERO_PAGE if mm_get_huge_zero_folio() fails.
>
> Suggested-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
> I rebased on top of mm-unstable to get mm_get_huge_zero_folio().
>
> @Christoph is this inline with what you had in mind?
>
>  fs/iomap/direct-io.c | 15 +++++++++++----
>  1 file changed, 11 insertions(+), 4 deletions(-)
>
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 5f481068de5b..7f584f9ff2c5 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -236,11 +236,18 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
>  		loff_t pos, unsigned len)
>  {
>  	struct inode *inode = file_inode(dio->iocb->ki_filp);
> -	struct page *page = ZERO_PAGE(0);
> +	struct folio *zero_page_folio = page_folio(ZERO_PAGE(0));
> +	struct folio *folio = zero_page_folio;
>  	struct bio *bio;
>
>  	WARN_ON_ONCE(len > (BIO_MAX_VECS * PAGE_SIZE));
>
> +	if (len > PAGE_SIZE) {
> +		folio = mm_get_huge_zero_folio(current->mm);

The huge zero folio is order-9, but it seems here you only require len to be
bigger than PAGE_SIZE. I am not sure if it works when len is smaller than
HPAGE_PMD_SIZE.

> +		if (!folio)
> +			folio = zero_page_folio;
> +	}
> +
>  	bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS,
>  				  REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
>  	fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
> @@ -251,10 +258,10 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
>  	bio->bi_end_io = iomap_dio_bio_end_io;
>
>  	while (len) {
> -		unsigned int io_len = min_t(unsigned int, len, PAGE_SIZE);
> +		unsigned int size = min(len, folio_size(folio));
>
> -		__bio_add_page(bio, page, io_len, 0);
> -		len -= io_len;
> +		bio_add_folio_nofail(bio, folio, size, 0);
> +		len -= size;

Maybe use huge zero folio when len > HPAGE_PMD_SIZE and use ZERO_PAGE(0)
for len % HPAGE_PMD_SIZE.

>  	}
>  	iomap_dio_submit_bio(iter, dio, bio, pos);
>  }
> -- 
> 2.34.1


--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v5 07/11] iomap: fix iomap_dio_zero() for fs bs > system page size
  2024-05-03  9:53 ` [PATCH v5 07/11] iomap: fix iomap_dio_zero() for fs bs > system page size Luis Chamberlain
  2024-05-07 14:58   ` [RFC] iomap: use huge zero folio in iomap_dio_zero Pankaj Raghav (Samsung)
@ 2024-05-07 16:00   ` Matthew Wilcox
  2024-05-07 16:10     ` Christoph Hellwig
  2024-05-08 11:20     ` Pankaj Raghav (Samsung)
  1 sibling, 2 replies; 51+ messages in thread
From: Matthew Wilcox @ 2024-05-07 16:00 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: akpm, djwong, brauner, david, chandan.babu, hare, ritesh.list,
	john.g.garry, ziy, linux-fsdevel, linux-xfs, linux-mm,
	linux-block, gost.dev, p.raghav, kernel

On Fri, May 03, 2024 at 02:53:49AM -0700, Luis Chamberlain wrote:
> +	bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS,
> +				  REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
>  	fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
>  				  GFP_KERNEL);
> +
>  	bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos);
>  	bio->bi_private = dio;
>  	bio->bi_end_io = iomap_dio_bio_end_io;
>  
> -	__bio_add_page(bio, page, len, 0);
> +	while (len) {
> +		unsigned int io_len = min_t(unsigned int, len, PAGE_SIZE);
> +
> +		__bio_add_page(bio, page, io_len, 0);
> +		len -= io_len;
> +	}
>  	iomap_dio_submit_bio(iter, dio, bio, pos);

If the len is more than PAGE_SIZE * BIO_MAX_VECS, __bio_add_page()
will fail silently.  I hate this interface.

You should be doing something like ...

	while (len) {
		unsigned int io_len = min_t(unsigned int, len, PAGE_SIZE);

		while (!bio || bio_add_page() < io_len) {
			if (bio)
				iomap_dio_submit_bio(iter, dio, bio, pos);
			bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS,
					REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
		 	fscrypt_set_bio_crypt_ctx(bio, inode,
					pos >> inode->i_blkbits, GFP_KERNEL);
		}
	}

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v5 07/11] iomap: fix iomap_dio_zero() for fs bs > system page size
  2024-05-07 16:00   ` [PATCH v5 07/11] iomap: fix iomap_dio_zero() for fs bs > system page size Matthew Wilcox
@ 2024-05-07 16:10     ` Christoph Hellwig
  2024-05-07 16:11       ` Matthew Wilcox
  2024-05-08 11:20     ` Pankaj Raghav (Samsung)
  1 sibling, 1 reply; 51+ messages in thread
From: Christoph Hellwig @ 2024-05-07 16:10 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Luis Chamberlain, akpm, djwong, brauner, david, chandan.babu,
	hare, ritesh.list, john.g.garry, ziy, linux-fsdevel, linux-xfs,
	linux-mm, linux-block, gost.dev, p.raghav, kernel

On Tue, May 07, 2024 at 05:00:28PM +0100, Matthew Wilcox wrote:
> If the len is more than PAGE_SIZE * BIO_MAX_VECS, __bio_add_page()
> will fail silently.  I hate this interface.

No, it won't.  You can pass an arbitray len to it.

> 
> You should be doing something like ...
> 
> 	while (len) {
> 		unsigned int io_len = min_t(unsigned int, len, PAGE_SIZE);
> 
> 		while (!bio || bio_add_page() < io_len) {
> 			if (bio)
> 				iomap_dio_submit_bio(iter, dio, bio, pos);
> 			bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS,
> 					REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
> 		 	fscrypt_set_bio_crypt_ctx(bio, inode,
> 					pos >> inode->i_blkbits, GFP_KERNEL);
> 		}
> 	}

Wee, no.  The right way is:

	bio = iomap_dio_alloc_bio(iter, dio, 1,
			REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
	__bio_add_page(bio, page, len, 0);

	fscrypt_set_bio_crypt_ctx(bio, inode,
			pos >> inode->i_blkbits, GFP_KERNEL);

(or even better the folio version)

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC] iomap: use huge zero folio in iomap_dio_zero
  2024-05-07 14:58   ` [RFC] iomap: use huge zero folio in iomap_dio_zero Pankaj Raghav (Samsung)
  2024-05-07 15:11     ` Zi Yan
@ 2024-05-07 16:11     ` Christoph Hellwig
  2024-05-08 11:39       ` Pankaj Raghav (Samsung)
  2024-05-15  0:50     ` Matthew Wilcox
  2 siblings, 1 reply; 51+ messages in thread
From: Christoph Hellwig @ 2024-05-07 16:11 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: hch, willy, mcgrof, akpm, brauner, chandan.babu, david, djwong,
	gost.dev, hare, john.g.garry, linux-block, linux-fsdevel,
	linux-mm, linux-xfs, p.raghav, ritesh.list, ziy

On Tue, May 07, 2024 at 04:58:12PM +0200, Pankaj Raghav (Samsung) wrote:
> +	if (len > PAGE_SIZE) {
> +		folio = mm_get_huge_zero_folio(current->mm);

I don't think the mm_struct based interfaces work well here, as I/O
completions don't come in through the same mm.  You'll want to use
lower level interfaces like get_huge_zero_page and use them at
mount time.

> +		if (!folio)
> +			folio = zero_page_folio;

And then don't bother with a fallback.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v5 07/11] iomap: fix iomap_dio_zero() for fs bs > system page size
  2024-05-07 16:10     ` Christoph Hellwig
@ 2024-05-07 16:11       ` Matthew Wilcox
  2024-05-07 16:13         ` Christoph Hellwig
  0 siblings, 1 reply; 51+ messages in thread
From: Matthew Wilcox @ 2024-05-07 16:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Luis Chamberlain, akpm, djwong, brauner, david, chandan.babu,
	hare, ritesh.list, john.g.garry, ziy, linux-fsdevel, linux-xfs,
	linux-mm, linux-block, gost.dev, p.raghav, kernel

On Tue, May 07, 2024 at 09:10:15AM -0700, Christoph Hellwig wrote:
> On Tue, May 07, 2024 at 05:00:28PM +0100, Matthew Wilcox wrote:
> > If the len is more than PAGE_SIZE * BIO_MAX_VECS, __bio_add_page()
> > will fail silently.  I hate this interface.
> 
> No, it won't.  You can pass an arbitray len to it.
> 
> > 
> > You should be doing something like ...
> > 
> > 	while (len) {
> > 		unsigned int io_len = min_t(unsigned int, len, PAGE_SIZE);
> > 
> > 		while (!bio || bio_add_page() < io_len) {
> > 			if (bio)
> > 				iomap_dio_submit_bio(iter, dio, bio, pos);
> > 			bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS,
> > 					REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
> > 		 	fscrypt_set_bio_crypt_ctx(bio, inode,
> > 					pos >> inode->i_blkbits, GFP_KERNEL);
> > 		}
> > 	}
> 
> Wee, no.  The right way is:
> 
> 	bio = iomap_dio_alloc_bio(iter, dio, 1,
> 			REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
> 	__bio_add_page(bio, page, len, 0);

no?  len can be > PAGE_SIZE.  and that can be true in the folio version
too, because we won't necessarily be able to allocate the THP.

> 	fscrypt_set_bio_crypt_ctx(bio, inode,
> 			pos >> inode->i_blkbits, GFP_KERNEL);
> 
> (or even better the folio version)

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v5 07/11] iomap: fix iomap_dio_zero() for fs bs > system page size
  2024-05-07 16:11       ` Matthew Wilcox
@ 2024-05-07 16:13         ` Christoph Hellwig
  2024-05-08  4:24           ` Matthew Wilcox
  0 siblings, 1 reply; 51+ messages in thread
From: Christoph Hellwig @ 2024-05-07 16:13 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, Luis Chamberlain, akpm, djwong, brauner,
	david, chandan.babu, hare, ritesh.list, john.g.garry, ziy,
	linux-fsdevel, linux-xfs, linux-mm, linux-block, gost.dev,
	p.raghav, kernel

On Tue, May 07, 2024 at 05:11:58PM +0100, Matthew Wilcox wrote:
> > 	__bio_add_page(bio, page, len, 0);
> 
> no?  len can be > PAGE_SIZE.

Yes. So what?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v5 10/11] xfs: make the calculation generic in xfs_sb_validate_fsb_count()
  2024-05-07  8:40   ` John Garry
@ 2024-05-07 21:13     ` Darrick J. Wong
  2024-05-08 11:28       ` Pankaj Raghav (Samsung)
  0 siblings, 1 reply; 51+ messages in thread
From: Darrick J. Wong @ 2024-05-07 21:13 UTC (permalink / raw)
  To: John Garry
  Cc: Luis Chamberlain, akpm, willy, brauner, david, chandan.babu,
	hare, ritesh.list, ziy, linux-fsdevel, linux-xfs, linux-mm,
	linux-block, gost.dev, p.raghav, kernel

On Tue, May 07, 2024 at 09:40:58AM +0100, John Garry wrote:
> On 03/05/2024 10:53, Luis Chamberlain wrote:
> > From: Pankaj Raghav <p.raghav@samsung.com>
> > 
> > Instead of assuming that PAGE_SHIFT is always higher than the blocklog,
> > make the calculation generic so that page cache count can be calculated
> > correctly for LBS.
> > 
> > Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >   fs/xfs/xfs_mount.c | 9 ++++++++-
> >   1 file changed, 8 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> > index df370eb5dc15..56d71282972a 100644
> > --- a/fs/xfs/xfs_mount.c
> > +++ b/fs/xfs/xfs_mount.c
> > @@ -133,9 +133,16 @@ xfs_sb_validate_fsb_count(
> >   {
> >   	ASSERT(PAGE_SHIFT >= sbp->sb_blocklog);
> >   	ASSERT(sbp->sb_blocklog >= BBSHIFT);
> > +	uint64_t max_index;
> > +	uint64_t max_bytes;

Extra nit: the  ^ indentation of the names should have tabs, like the
other xfs functions.

--D

> nit: any other XFS code which I have seen puts the declarations before any
> ASSERT() calls
> 
> > +
> > +	if (check_shl_overflow(nblocks, sbp->sb_blocklog, &max_bytes))
> > +		return -EFBIG;
> >   	/* Limited by ULONG_MAX of page cache index */
> > -	if (nblocks >> (PAGE_SHIFT - sbp->sb_blocklog) > ULONG_MAX)
> > +	max_index = max_bytes >> PAGE_SHIFT;
> > +
> > +	if (max_index > ULONG_MAX)
> >   		return -EFBIG;
> >   	return 0;
> >   }
> 
> 

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v5 07/11] iomap: fix iomap_dio_zero() for fs bs > system page size
  2024-05-07 16:13         ` Christoph Hellwig
@ 2024-05-08  4:24           ` Matthew Wilcox
  2024-05-08 11:22             ` Pankaj Raghav (Samsung)
  2024-05-08 11:36             ` Christoph Hellwig
  0 siblings, 2 replies; 51+ messages in thread
From: Matthew Wilcox @ 2024-05-08  4:24 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Luis Chamberlain, akpm, djwong, brauner, david, chandan.babu,
	hare, ritesh.list, john.g.garry, ziy, linux-fsdevel, linux-xfs,
	linux-mm, linux-block, gost.dev, p.raghav, kernel

On Tue, May 07, 2024 at 09:13:17AM -0700, Christoph Hellwig wrote:
> On Tue, May 07, 2024 at 05:11:58PM +0100, Matthew Wilcox wrote:
> > > 	__bio_add_page(bio, page, len, 0);
> > 
> > no?  len can be > PAGE_SIZE.
> 
> Yes. So what?

the zero_page is only PAGE_SIZE bytes long.  so you'd be writing
from the page that's after the zero page, whatever contents that has.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v5 07/11] iomap: fix iomap_dio_zero() for fs bs > system page size
  2024-05-07 16:00   ` [PATCH v5 07/11] iomap: fix iomap_dio_zero() for fs bs > system page size Matthew Wilcox
  2024-05-07 16:10     ` Christoph Hellwig
@ 2024-05-08 11:20     ` Pankaj Raghav (Samsung)
  1 sibling, 0 replies; 51+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-05-08 11:20 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Luis Chamberlain, akpm, djwong, brauner, david, chandan.babu,
	hare, ritesh.list, john.g.garry, ziy, linux-fsdevel, linux-xfs,
	linux-mm, linux-block, gost.dev, p.raghav

On Tue, May 07, 2024 at 05:00:28PM +0100, Matthew Wilcox wrote:
> On Fri, May 03, 2024 at 02:53:49AM -0700, Luis Chamberlain wrote:
> > +	bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS,
> > +				  REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
> >  	fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
> >  				  GFP_KERNEL);
> > +
> >  	bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos);
> >  	bio->bi_private = dio;
> >  	bio->bi_end_io = iomap_dio_bio_end_io;
> >  
> > -	__bio_add_page(bio, page, len, 0);
> > +	while (len) {
> > +		unsigned int io_len = min_t(unsigned int, len, PAGE_SIZE);
> > +
> > +		__bio_add_page(bio, page, io_len, 0);
> > +		len -= io_len;
> > +	}
> >  	iomap_dio_submit_bio(iter, dio, bio, pos);
> 
> If the len is more than PAGE_SIZE * BIO_MAX_VECS, __bio_add_page()
> will fail silently.  I hate this interface.

I added a WARN_ON_ONCE() at the start of the function so that it does
not silently fail:
	WARN_ON_ONCE(len > (BIO_MAX_VECS * PAGE_SIZE));

This function is used to do only sub block zeroing, and I don't think we will
cross 1MB block size in the forseeable future, and even if we do, we have
this to warn us about so that it can be changed?

> 
> You should be doing something like ...
> 
> 	while (len) {
> 		unsigned int io_len = min_t(unsigned int, len, PAGE_SIZE);
> 
> 		while (!bio || bio_add_page() < io_len) {
> 			if (bio)
> 				iomap_dio_submit_bio(iter, dio, bio, pos);
> 			bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS,
> 					REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
> 		 	fscrypt_set_bio_crypt_ctx(bio, inode,
> 					pos >> inode->i_blkbits, GFP_KERNEL);
> 		}
> 	}

-- 
Pankaj Raghav

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v5 07/11] iomap: fix iomap_dio_zero() for fs bs > system page size
  2024-05-08  4:24           ` Matthew Wilcox
@ 2024-05-08 11:22             ` Pankaj Raghav (Samsung)
  2024-05-08 11:36             ` Christoph Hellwig
  1 sibling, 0 replies; 51+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-05-08 11:22 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, Luis Chamberlain, akpm, djwong, brauner,
	david, chandan.babu, hare, ritesh.list, john.g.garry, ziy,
	linux-fsdevel, linux-xfs, linux-mm, linux-block, gost.dev,
	p.raghav

On Wed, May 08, 2024 at 05:24:53AM +0100, Matthew Wilcox wrote:
> On Tue, May 07, 2024 at 09:13:17AM -0700, Christoph Hellwig wrote:
> > On Tue, May 07, 2024 at 05:11:58PM +0100, Matthew Wilcox wrote:
> > > > 	__bio_add_page(bio, page, len, 0);
> > > 
> > > no?  len can be > PAGE_SIZE.
> > 
> > Yes. So what?
> 
> the zero_page is only PAGE_SIZE bytes long.  so you'd be writing
> from the page that's after the zero page, whatever contents that has.

This is the exact reason this patch was added. We were writing the
garbage value past the PAGE_SIZE for LBS that led to FS corruption. Not
an issue where FS block size <= page size.

-- 
Pankaj Raghav

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v5 10/11] xfs: make the calculation generic in xfs_sb_validate_fsb_count()
  2024-05-07 21:13     ` Darrick J. Wong
@ 2024-05-08 11:28       ` Pankaj Raghav (Samsung)
  0 siblings, 0 replies; 51+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-05-08 11:28 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: John Garry, Luis Chamberlain, akpm, willy, brauner, david,
	chandan.babu, hare, ritesh.list, ziy, linux-fsdevel, linux-xfs,
	linux-mm, linux-block, gost.dev, p.raghav

On Tue, May 07, 2024 at 02:13:10PM -0700, Darrick J. Wong wrote:
> On Tue, May 07, 2024 at 09:40:58AM +0100, John Garry wrote:
> > On 03/05/2024 10:53, Luis Chamberlain wrote:
> > > From: Pankaj Raghav <p.raghav@samsung.com>
> > > 
> > > Instead of assuming that PAGE_SHIFT is always higher than the blocklog,
> > > make the calculation generic so that page cache count can be calculated
> > > correctly for LBS.
> > > 
> > > Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> > > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> > > ---
> > >   fs/xfs/xfs_mount.c | 9 ++++++++-
> > >   1 file changed, 8 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> > > index df370eb5dc15..56d71282972a 100644
> > > --- a/fs/xfs/xfs_mount.c
> > > +++ b/fs/xfs/xfs_mount.c
> > > @@ -133,9 +133,16 @@ xfs_sb_validate_fsb_count(
> > >   {
> > >   	ASSERT(PAGE_SHIFT >= sbp->sb_blocklog);
> > >   	ASSERT(sbp->sb_blocklog >= BBSHIFT);
> > > +	uint64_t max_index;
> > > +	uint64_t max_bytes;
> 
> Extra nit: the  ^ indentation of the names should have tabs, like the
> other xfs functions.
Thanks John and Darrick, I will fold it in the next series.

> 
> --D
> 
> > nit: any other XFS code which I have seen puts the declarations before any
> > ASSERT() calls
> > 
> > > +
> > > +	if (check_shl_overflow(nblocks, sbp->sb_blocklog, &max_bytes))
> > > +		return -EFBIG;
> > >   	/* Limited by ULONG_MAX of page cache index */
> > > -	if (nblocks >> (PAGE_SHIFT - sbp->sb_blocklog) > ULONG_MAX)
> > > +	max_index = max_bytes >> PAGE_SHIFT;
> > > +
> > > +	if (max_index > ULONG_MAX)
> > >   		return -EFBIG;
> > >   	return 0;
> > >   }
> > 
> > 

-- 
Pankaj Raghav

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v5 07/11] iomap: fix iomap_dio_zero() for fs bs > system page size
  2024-05-08  4:24           ` Matthew Wilcox
  2024-05-08 11:22             ` Pankaj Raghav (Samsung)
@ 2024-05-08 11:36             ` Christoph Hellwig
  1 sibling, 0 replies; 51+ messages in thread
From: Christoph Hellwig @ 2024-05-08 11:36 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, Luis Chamberlain, akpm, djwong, brauner,
	david, chandan.babu, hare, ritesh.list, john.g.garry, ziy,
	linux-fsdevel, linux-xfs, linux-mm, linux-block, gost.dev,
	p.raghav, kernel

On Wed, May 08, 2024 at 05:24:53AM +0100, Matthew Wilcox wrote:
> On Tue, May 07, 2024 at 09:13:17AM -0700, Christoph Hellwig wrote:
> > On Tue, May 07, 2024 at 05:11:58PM +0100, Matthew Wilcox wrote:
> > > > 	__bio_add_page(bio, page, len, 0);
> > > 
> > > no?  len can be > PAGE_SIZE.
> > 
> > Yes. So what?
> 
> the zero_page is only PAGE_SIZE bytes long.  so you'd be writing
> from the page that's after the zero page, whatever contents that has.

Except that the whole point of the exercise is to use the huge folio
so that we don't run past the end of the zero page.  Yes, if we use
ZERO_PAGE we need to chunk things up and use bio_add_page instead
bio_page, check the return value and potentially deal with multiple
bios.  I'd rather avoid that, though.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC] iomap: use huge zero folio in iomap_dio_zero
  2024-05-07 16:11     ` Christoph Hellwig
@ 2024-05-08 11:39       ` Pankaj Raghav (Samsung)
  2024-05-08 11:43         ` Christoph Hellwig
  0 siblings, 1 reply; 51+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-05-08 11:39 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: hch, willy, mcgrof, akpm, brauner, chandan.babu, david, djwong,
	gost.dev, hare, john.g.garry, linux-block, linux-fsdevel,
	linux-mm, linux-xfs, p.raghav, ritesh.list, ziy

On Tue, May 07, 2024 at 09:11:51AM -0700, Christoph Hellwig wrote:
> On Tue, May 07, 2024 at 04:58:12PM +0200, Pankaj Raghav (Samsung) wrote:
> > +	if (len > PAGE_SIZE) {
> > +		folio = mm_get_huge_zero_folio(current->mm);
> 
> I don't think the mm_struct based interfaces work well here, as I/O
> completions don't come in through the same mm.  You'll want to use
> lower level interfaces like get_huge_zero_page and use them at
> mount time.

At the moment, we can get a reference to the huge zero folio only through
the mm interface. 

Even if change the lower level interface to return THP, it can still fail
at the mount time and we will need the fallback right?

> 
> > +		if (!folio)
> > +			folio = zero_page_folio;
> 
> And then don't bother with a fallback.
> 

-- 
Pankaj Raghav

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC] iomap: use huge zero folio in iomap_dio_zero
  2024-05-08 11:39       ` Pankaj Raghav (Samsung)
@ 2024-05-08 11:43         ` Christoph Hellwig
  2024-05-09 12:31           ` Pankaj Raghav (Samsung)
  0 siblings, 1 reply; 51+ messages in thread
From: Christoph Hellwig @ 2024-05-08 11:43 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: Christoph Hellwig, hch, willy, mcgrof, akpm, brauner,
	chandan.babu, david, djwong, gost.dev, hare, john.g.garry,
	linux-block, linux-fsdevel, linux-mm, linux-xfs, p.raghav,
	ritesh.list, ziy

On Wed, May 08, 2024 at 11:39:49AM +0000, Pankaj Raghav (Samsung) wrote:
> At the moment, we can get a reference to the huge zero folio only through
> the mm interface. 
> 
> Even if change the lower level interface to return THP, it can still fail
> at the mount time and we will need the fallback right?

Well, that's why I suggest doing it at mount time.  Asking for it deep
down in the write code is certainly going to be a bit problematic.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC] iomap: use huge zero folio in iomap_dio_zero
  2024-05-08 11:43         ` Christoph Hellwig
@ 2024-05-09 12:31           ` Pankaj Raghav (Samsung)
  2024-05-09 12:46             ` Christoph Hellwig
  0 siblings, 1 reply; 51+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-05-09 12:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: hch, willy, mcgrof, akpm, brauner, chandan.babu, david, djwong,
	gost.dev, hare, john.g.garry, linux-block, linux-fsdevel,
	linux-mm, linux-xfs, p.raghav, ritesh.list, ziy

On Wed, May 08, 2024 at 04:43:54AM -0700, Christoph Hellwig wrote:
> On Wed, May 08, 2024 at 11:39:49AM +0000, Pankaj Raghav (Samsung) wrote:
> > At the moment, we can get a reference to the huge zero folio only through
> > the mm interface. 
> > 
> > Even if change the lower level interface to return THP, it can still fail
> > at the mount time and we will need the fallback right?
> 
> Well, that's why I suggest doing it at mount time.  Asking for it deep
> down in the write code is certainly going to be a bit problematic.

Makes sense. But failing to mount because we can't get a huge zero folio
seems wrong as we still can't guarantee it even at mount time.

With the current infrastructure I don't see anyway of geting a huge zero
folio that is guaranteed so that we don't need any fallback.

Let me know what you think.
-- 
Pankaj Raghav

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC] iomap: use huge zero folio in iomap_dio_zero
  2024-05-09 12:31           ` Pankaj Raghav (Samsung)
@ 2024-05-09 12:46             ` Christoph Hellwig
  2024-05-09 12:55               ` Pankaj Raghav (Samsung)
  0 siblings, 1 reply; 51+ messages in thread
From: Christoph Hellwig @ 2024-05-09 12:46 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: Christoph Hellwig, hch, willy, mcgrof, akpm, brauner,
	chandan.babu, david, djwong, gost.dev, hare, john.g.garry,
	linux-block, linux-fsdevel, linux-mm, linux-xfs, p.raghav,
	ritesh.list, ziy

On Thu, May 09, 2024 at 12:31:07PM +0000, Pankaj Raghav (Samsung) wrote:
> > Well, that's why I suggest doing it at mount time.  Asking for it deep
> > down in the write code is certainly going to be a bit problematic.
> 
> Makes sense. But failing to mount because we can't get a huge zero folio
> seems wrong as we still can't guarantee it even at mount time.
> 
> With the current infrastructure I don't see anyway of geting a huge zero
> folio that is guaranteed so that we don't need any fallback.

You export get_huge_zero_page, put_huge_zero_page (they might need a
rename and kerneldoc for the final version) and huge_zero_folio or a
wrapper to get it, and then call get_huge_zero_page from mount,
from unmount and just use huge_zero_folio which is guaranteed to
exist once get_huge_zero_page succeeded.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC] iomap: use huge zero folio in iomap_dio_zero
  2024-05-09 12:46             ` Christoph Hellwig
@ 2024-05-09 12:55               ` Pankaj Raghav (Samsung)
  2024-05-09 12:58                 ` Christoph Hellwig
  0 siblings, 1 reply; 51+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-05-09 12:55 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: hch, willy, mcgrof, akpm, brauner, chandan.babu, david, djwong,
	gost.dev, hare, john.g.garry, linux-block, linux-fsdevel,
	linux-mm, linux-xfs, p.raghav, ritesh.list, ziy

On Thu, May 09, 2024 at 05:46:55AM -0700, Christoph Hellwig wrote:
> On Thu, May 09, 2024 at 12:31:07PM +0000, Pankaj Raghav (Samsung) wrote:
> > > Well, that's why I suggest doing it at mount time.  Asking for it deep
> > > down in the write code is certainly going to be a bit problematic.
> > 
> > Makes sense. But failing to mount because we can't get a huge zero folio
> > seems wrong as we still can't guarantee it even at mount time.
> > 
> > With the current infrastructure I don't see anyway of geting a huge zero
> > folio that is guaranteed so that we don't need any fallback.
> 
> You export get_huge_zero_page, put_huge_zero_page (they might need a
> rename and kerneldoc for the final version) and huge_zero_folio or a
> wrapper to get it, and then call get_huge_zero_page from mount,

static bool get_huge_zero_page(void)
{
	struct folio *zero_folio;
retry:
	if (likely(atomic_inc_not_zero(&huge_zero_refcount)))
		return true;

	zero_folio = folio_alloc((GFP_TRANSHUGE | __GFP_ZERO) & ~__GFP_MOVABLE,
			HPAGE_PMD_ORDER);
	if (!zero_folio) {

We might still fail here during mount. My question is: do we also fail
the mount if folio_alloc fails?

		count_vm_event(THP_ZERO_PAGE_ALLOC_FAILED);
		return false;
	}

...
> from unmount and just use huge_zero_folio which is guaranteed to
> exist once get_huge_zero_page succeeded.
> 

-- 
Pankaj Raghav

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC] iomap: use huge zero folio in iomap_dio_zero
  2024-05-09 12:55               ` Pankaj Raghav (Samsung)
@ 2024-05-09 12:58                 ` Christoph Hellwig
  2024-05-09 14:32                   ` Darrick J. Wong
  0 siblings, 1 reply; 51+ messages in thread
From: Christoph Hellwig @ 2024-05-09 12:58 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: Christoph Hellwig, hch, willy, mcgrof, akpm, brauner,
	chandan.babu, david, djwong, gost.dev, hare, john.g.garry,
	linux-block, linux-fsdevel, linux-mm, linux-xfs, p.raghav,
	ritesh.list, ziy

On Thu, May 09, 2024 at 12:55:14PM +0000, Pankaj Raghav (Samsung) wrote:
> We might still fail here during mount. My question is: do we also fail
> the mount if folio_alloc fails?

Yes.  Like any other allocation that fails at mount time.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC] iomap: use huge zero folio in iomap_dio_zero
  2024-05-09 12:58                 ` Christoph Hellwig
@ 2024-05-09 14:32                   ` Darrick J. Wong
  2024-05-09 15:05                     ` Christoph Hellwig
  0 siblings, 1 reply; 51+ messages in thread
From: Darrick J. Wong @ 2024-05-09 14:32 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Pankaj Raghav (Samsung),
	hch, willy, mcgrof, akpm, brauner, chandan.babu, david, gost.dev,
	hare, john.g.garry, linux-block, linux-fsdevel, linux-mm,
	linux-xfs, p.raghav, ritesh.list, ziy

On Thu, May 09, 2024 at 05:58:23AM -0700, Christoph Hellwig wrote:
> On Thu, May 09, 2024 at 12:55:14PM +0000, Pankaj Raghav (Samsung) wrote:
> > We might still fail here during mount. My question is: do we also fail
> > the mount if folio_alloc fails?
> 
> Yes.  Like any other allocation that fails at mount time.

How hard is it to fallback to regular zero-page if you can't allocate
the zero-hugepage?  I think most sysadmins would rather mount with
reduced zeroing performance than get an ENOMEM.

--D

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC] iomap: use huge zero folio in iomap_dio_zero
  2024-05-09 14:32                   ` Darrick J. Wong
@ 2024-05-09 15:05                     ` Christoph Hellwig
  2024-05-09 15:08                       ` Darrick J. Wong
  0 siblings, 1 reply; 51+ messages in thread
From: Christoph Hellwig @ 2024-05-09 15:05 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, Pankaj Raghav (Samsung),
	hch, willy, mcgrof, akpm, brauner, chandan.babu, david, gost.dev,
	hare, john.g.garry, linux-block, linux-fsdevel, linux-mm,
	linux-xfs, p.raghav, ritesh.list, ziy

On Thu, May 09, 2024 at 07:32:50AM -0700, Darrick J. Wong wrote:
> On Thu, May 09, 2024 at 05:58:23AM -0700, Christoph Hellwig wrote:
> > On Thu, May 09, 2024 at 12:55:14PM +0000, Pankaj Raghav (Samsung) wrote:
> > > We might still fail here during mount. My question is: do we also fail
> > > the mount if folio_alloc fails?
> > 
> > Yes.  Like any other allocation that fails at mount time.
> 
> How hard is it to fallback to regular zero-page if you can't allocate
> the zero-hugepage?

We'd need the bio allocation and bio_add_page loop.  Not the end
of the world, but also a bit annoying.  If we do that we might as
well just do it unconditionally.

> I think most sysadmins would rather mount with
> reduced zeroing performance than get an ENOMEM.

If you don't have 2MB free for the zero huge folio, are you going to
do useful things with your large block size XFS file system which
only makes sense for giant storage sizes?

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC] iomap: use huge zero folio in iomap_dio_zero
  2024-05-09 15:05                     ` Christoph Hellwig
@ 2024-05-09 15:08                       ` Darrick J. Wong
  2024-05-09 15:09                         ` Christoph Hellwig
  0 siblings, 1 reply; 51+ messages in thread
From: Darrick J. Wong @ 2024-05-09 15:08 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Pankaj Raghav (Samsung),
	hch, willy, mcgrof, akpm, brauner, chandan.babu, david, gost.dev,
	hare, john.g.garry, linux-block, linux-fsdevel, linux-mm,
	linux-xfs, p.raghav, ritesh.list, ziy

On Thu, May 09, 2024 at 08:05:24AM -0700, Christoph Hellwig wrote:
> On Thu, May 09, 2024 at 07:32:50AM -0700, Darrick J. Wong wrote:
> > On Thu, May 09, 2024 at 05:58:23AM -0700, Christoph Hellwig wrote:
> > > On Thu, May 09, 2024 at 12:55:14PM +0000, Pankaj Raghav (Samsung) wrote:
> > > > We might still fail here during mount. My question is: do we also fail
> > > > the mount if folio_alloc fails?
> > > 
> > > Yes.  Like any other allocation that fails at mount time.
> > 
> > How hard is it to fallback to regular zero-page if you can't allocate
> > the zero-hugepage?
> 
> We'd need the bio allocation and bio_add_page loop.  Not the end
> of the world, but also a bit annoying.  If we do that we might as
> well just do it unconditionally.
> 
> > I think most sysadmins would rather mount with
> > reduced zeroing performance than get an ENOMEM.
> 
> If you don't have 2MB free for the zero huge folio, are you going to
> do useful things with your large block size XFS file system which
> only makes sense for giant storage sizes?

Oh.  Right, this is for bs>ps.  For that case it makes sense to fail the
mount.  I was only thinking about bs<=ps with large folios, where it
doesn't.

(Would we use the zero-hugepage for large folios on a 4k fsblock fs?)

--D

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC] iomap: use huge zero folio in iomap_dio_zero
  2024-05-09 15:08                       ` Darrick J. Wong
@ 2024-05-09 15:09                         ` Christoph Hellwig
  0 siblings, 0 replies; 51+ messages in thread
From: Christoph Hellwig @ 2024-05-09 15:09 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, Pankaj Raghav (Samsung),
	hch, willy, mcgrof, akpm, brauner, chandan.babu, david, gost.dev,
	hare, john.g.garry, linux-block, linux-fsdevel, linux-mm,
	linux-xfs, p.raghav, ritesh.list, ziy

On Thu, May 09, 2024 at 08:08:28AM -0700, Darrick J. Wong wrote:
> Oh.  Right, this is for bs>ps.  For that case it makes sense to fail the
> mount.  I was only thinking about bs<=ps with large folios, where it
> doesn't.
> 
> (Would we use the zero-hugepage for large folios on a 4k fsblock fs?)

The direct I/O zeroing code always deals with sub-blocksize amounts.
The buffered zeroing path can deal with larger amounts in a few
cases, but that goes through the page cache anyway.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC] iomap: use huge zero folio in iomap_dio_zero
  2024-05-07 14:58   ` [RFC] iomap: use huge zero folio in iomap_dio_zero Pankaj Raghav (Samsung)
  2024-05-07 15:11     ` Zi Yan
  2024-05-07 16:11     ` Christoph Hellwig
@ 2024-05-15  0:50     ` Matthew Wilcox
  2024-05-15  2:34       ` Keith Busch
  2024-05-15 11:48       ` Christoph Hellwig
  2 siblings, 2 replies; 51+ messages in thread
From: Matthew Wilcox @ 2024-05-15  0:50 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: hch, mcgrof, akpm, brauner, chandan.babu, david, djwong,
	gost.dev, hare, john.g.garry, linux-block, linux-fsdevel,
	linux-mm, linux-xfs, p.raghav, ritesh.list, ziy

On Tue, May 07, 2024 at 04:58:12PM +0200, Pankaj Raghav (Samsung) wrote:
> Instead of looping with ZERO_PAGE, use a huge zero folio to zero pad the
> block. Fallback to ZERO_PAGE if mm_get_huge_zero_folio() fails.

So the block people say we're doing this all wrong.  We should be
issuing a REQ_OP_WRITE_ZEROES bio, and the block layer will take care of
using the ZERO_PAGE if the hardware doesn't natively support
WRITE_ZEROES or a DISCARD that zeroes or ...


I suspect all these places should be checked to see if they can use
WRITE_ZEROES too:

fs/bcachefs/checksum.c:                         page_address(ZERO_PAGE(0)), page_len);
fs/bcachefs/io_write.c:         if (bv->bv_page != ZERO_PAGE(0))
fs/bcachefs/quota.c:            if (memcmp(mq, page_address(ZERO_PAGE(0)), sizeof(*mq))) {
fs/cramfs/inode.c:              return page_address(ZERO_PAGE(0));
fs/crypto/bio.c:                ret = bio_add_page(bio, ZERO_PAGE(0), bytes_this_page, 0);
fs/crypto/bio.c:                                                      ZERO_PAGE(0), pages[i],
fs/direct-io.c:         dio->pages[0] = ZERO_PAGE(0);
fs/direct-io.c: page = ZERO_PAGE(0);
fs/iomap/direct-io.c:   struct page *page = ZERO_PAGE(0);


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC] iomap: use huge zero folio in iomap_dio_zero
  2024-05-15  0:50     ` Matthew Wilcox
@ 2024-05-15  2:34       ` Keith Busch
  2024-05-15  4:04         ` Matthew Wilcox
  2024-05-15 11:48       ` Christoph Hellwig
  1 sibling, 1 reply; 51+ messages in thread
From: Keith Busch @ 2024-05-15  2:34 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Pankaj Raghav (Samsung),
	hch, mcgrof, akpm, brauner, chandan.babu, david, djwong,
	gost.dev, hare, john.g.garry, linux-block, linux-fsdevel,
	linux-mm, linux-xfs, p.raghav, ritesh.list, ziy

On Wed, May 15, 2024 at 01:50:53AM +0100, Matthew Wilcox wrote:
> On Tue, May 07, 2024 at 04:58:12PM +0200, Pankaj Raghav (Samsung) wrote:
> > Instead of looping with ZERO_PAGE, use a huge zero folio to zero pad the
> > block. Fallback to ZERO_PAGE if mm_get_huge_zero_folio() fails.
> 
> So the block people say we're doing this all wrong.  We should be
> issuing a REQ_OP_WRITE_ZEROES bio, and the block layer will take care of
> using the ZERO_PAGE if the hardware doesn't natively support
> WRITE_ZEROES or a DISCARD that zeroes or ...

Wait a second, I think you've gone too far if you're setting the bio op
to REQ_OP_WRITE_ZEROES. The block layer handles the difference only
through the blkdev_issue_zeroout() helper. If you actually submit a bio
with that op to a block device that doesn't support it, you'll just get
a BLK_STS_NOTSUPP error from submit_bio_noacct().

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC] iomap: use huge zero folio in iomap_dio_zero
  2024-05-15  2:34       ` Keith Busch
@ 2024-05-15  4:04         ` Matthew Wilcox
  2024-05-15 15:59           ` Pankaj Raghav (Samsung)
  0 siblings, 1 reply; 51+ messages in thread
From: Matthew Wilcox @ 2024-05-15  4:04 UTC (permalink / raw)
  To: Keith Busch
  Cc: Pankaj Raghav (Samsung),
	hch, mcgrof, akpm, brauner, chandan.babu, david, djwong,
	gost.dev, hare, john.g.garry, linux-block, linux-fsdevel,
	linux-mm, linux-xfs, p.raghav, ritesh.list, ziy

On Tue, May 14, 2024 at 08:34:09PM -0600, Keith Busch wrote:
> On Wed, May 15, 2024 at 01:50:53AM +0100, Matthew Wilcox wrote:
> > On Tue, May 07, 2024 at 04:58:12PM +0200, Pankaj Raghav (Samsung) wrote:
> > > Instead of looping with ZERO_PAGE, use a huge zero folio to zero pad the
> > > block. Fallback to ZERO_PAGE if mm_get_huge_zero_folio() fails.
> > 
> > So the block people say we're doing this all wrong.  We should be
> > issuing a REQ_OP_WRITE_ZEROES bio, and the block layer will take care of
> > using the ZERO_PAGE if the hardware doesn't natively support
> > WRITE_ZEROES or a DISCARD that zeroes or ...
> 
> Wait a second, I think you've gone too far if you're setting the bio op
> to REQ_OP_WRITE_ZEROES. The block layer handles the difference only
> through the blkdev_issue_zeroout() helper. If you actually submit a bio
> with that op to a block device that doesn't support it, you'll just get
> a BLK_STS_NOTSUPP error from submit_bio_noacct().

Ohh.  This is a bit awkward, because this is the iomap direct IO path.
I don't see an obvious way to get the semantics we want with the current
blkdev_issue_zeroout().  For reference, here's the current function:

static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
                loff_t pos, unsigned len)
{
        struct inode *inode = file_inode(dio->iocb->ki_filp);
        struct page *page = ZERO_PAGE(0);
        struct bio *bio;

        bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
        fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
                                  GFP_KERNEL);
        bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos);
        bio->bi_private = dio;
        bio->bi_end_io = iomap_dio_bio_end_io;

        __bio_add_page(bio, page, len, 0);
        iomap_dio_submit_bio(iter, dio, bio, pos);
}

and then:

static void iomap_dio_submit_bio(const struct iomap_iter *iter,
                struct iomap_dio *dio, struct bio *bio, loff_t pos)
{
        struct kiocb *iocb = dio->iocb;

        atomic_inc(&dio->ref);

        /* Sync dio can't be polled reliably */
        if ((iocb->ki_flags & IOCB_HIPRI) && !is_sync_kiocb(iocb)) {
                bio_set_polled(bio, iocb);
                WRITE_ONCE(iocb->private, bio);
        }

        if (dio->dops && dio->dops->submit_io)
                dio->dops->submit_io(iter, bio, pos);
        else
                submit_bio(bio);
}

so unless submit_bio() can handle the fallback to "create a new bio
full of zeroes and resubmit it to the device" if the original fails,
we're a little mismatched.  I'm not really familiar with either part of
this code, so I don't have much in the way of bright ideas.  Perhaps
we go back to the "allocate a large folio at filesystem mount" plan.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC] iomap: use huge zero folio in iomap_dio_zero
  2024-05-15  0:50     ` Matthew Wilcox
  2024-05-15  2:34       ` Keith Busch
@ 2024-05-15 11:48       ` Christoph Hellwig
  1 sibling, 0 replies; 51+ messages in thread
From: Christoph Hellwig @ 2024-05-15 11:48 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Pankaj Raghav (Samsung),
	hch, mcgrof, akpm, brauner, chandan.babu, david, djwong,
	gost.dev, hare, john.g.garry, linux-block, linux-fsdevel,
	linux-mm, linux-xfs, p.raghav, ritesh.list, ziy

On Wed, May 15, 2024 at 01:50:53AM +0100, Matthew Wilcox wrote:
> On Tue, May 07, 2024 at 04:58:12PM +0200, Pankaj Raghav (Samsung) wrote:
> > Instead of looping with ZERO_PAGE, use a huge zero folio to zero pad the
> > block. Fallback to ZERO_PAGE if mm_get_huge_zero_folio() fails.
> 
> So the block people say we're doing this all wrong.  We should be
> issuing a REQ_OP_WRITE_ZEROES bio, and the block layer will take care of
> using the ZERO_PAGE if the hardware doesn't natively support
> WRITE_ZEROES or a DISCARD that zeroes or ...

Not sure who "the block people" are, but while this sounds smart
it actually is a really bad idea.

Think about what we are doing here, we zero parts of a file system
block as part of a direct I/O write operation.  So the amount is
relatively small and it is part of a fast path I/O operation.  It
also will most likely land on the indirection entry on the device.

If you use a write zeroes it will go down a separate slow path in
the device instead of using the highly optimized write path and
slow the whole operation down.  Even worse there are chances that
it will increase write amplification because there are two separate
operations now instead of one merged one (either a block layer or
device merge).

And I'm not sure what "block layer person" still doesn't understand
that discard do not zero data, but maybe we'll need yet another
education campaign there.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v5 05/11] mm: split a folio in minimum folio order chunks
  2024-05-03  9:53 ` [PATCH v5 05/11] mm: split a folio in minimum folio order chunks Luis Chamberlain
  2024-05-03 14:53   ` Zi Yan
@ 2024-05-15 15:32   ` Matthew Wilcox
  2024-05-16 14:56     ` Pankaj Raghav (Samsung)
  1 sibling, 1 reply; 51+ messages in thread
From: Matthew Wilcox @ 2024-05-15 15:32 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: akpm, djwong, brauner, david, chandan.babu, hare, ritesh.list,
	john.g.garry, ziy, linux-fsdevel, linux-xfs, linux-mm,
	linux-block, gost.dev, p.raghav, kernel

On Fri, May 03, 2024 at 02:53:47AM -0700, Luis Chamberlain wrote:
> +int split_folio_to_list(struct folio *folio, struct list_head *list);

...

> +static inline int split_folio_to_list(struct page *page, struct list_head *list)
> +{

Type mismatch.  Surprised the build bots didn't whine yet.

>  
> +		min_order = mapping_min_folio_order(folio->mapping);
> +		if (new_order < min_order) {
> +			VM_WARN_ONCE(1, "Cannot split mapped folio below min-order: %u",
> +				     min_order);
> +			ret = -EINVAL;
> +			goto out;
> +		}

Wouldn't we prefer this as:

		if (VM_WARN_ONCE(new_order < min_order,
				"Cannot split mapped folio below min-order: %u",
				min_order) {
			ret = -EINVAL;
			goto out;
		}


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC] iomap: use huge zero folio in iomap_dio_zero
  2024-05-15  4:04         ` Matthew Wilcox
@ 2024-05-15 15:59           ` Pankaj Raghav (Samsung)
  2024-05-15 18:03             ` Matthew Wilcox
  0 siblings, 1 reply; 51+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-05-15 15:59 UTC (permalink / raw)
  To: Matthew Wilcox, david, djwong, hch
  Cc: Keith Busch, mcgrof, akpm, brauner, chandan.babu, gost.dev, hare,
	john.g.garry, linux-block, linux-fsdevel, linux-mm, linux-xfs,
	p.raghav, ritesh.list, ziy

> so unless submit_bio() can handle the fallback to "create a new bio
> full of zeroes and resubmit it to the device" if the original fails,
> we're a little mismatched.  I'm not really familiar with either part of
> this code, so I don't have much in the way of bright ideas.  Perhaps
> we go back to the "allocate a large folio at filesystem mount" plan.

So one thing that became clear after yesterday's discussion was to
**not** use a PMD page for sub block zeroing as in some architectures
we will be using a lot of memory (such as ARM) to zero out a 64k FS block.

So Chinner proposed the idea of using iomap_init function to alloc
large zero folio that could be used in iomap_dio_zero().

The general agreement was 64k large folio is enough for now. We could
always increase it and optimize it in the future when required.

This is a rough prototype of what it might look like:

diff --git a/fs/internal.h b/fs/internal.h
index 7ca738904e34..dad5734b2f75 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -35,6 +35,12 @@ static inline void bdev_cache_init(void)
 int __block_write_begin_int(struct folio *folio, loff_t pos, unsigned len,
                get_block_t *get_block, const struct iomap *iomap);
 
+/*
+ * iomap/buffered-io.c
+ */
+
+extern struct folio *zero_fsb_folio;
+
 /*
  * char_dev.c
  */
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 4e8e41c8b3c0..48235765df7a 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -42,6 +42,7 @@ struct iomap_folio_state {
 };
 
 static struct bio_set iomap_ioend_bioset;
+struct folio *zero_fsb_folio;
 
 static inline bool ifs_is_fully_uptodate(struct folio *folio,
                struct iomap_folio_state *ifs)
@@ -1985,8 +1986,15 @@ iomap_writepages(struct address_space *mapping, struct writeback_control *wbc,
 }
 EXPORT_SYMBOL_GPL(iomap_writepages);
 
+
 static int __init iomap_init(void)
 {
+       void            *addr = kzalloc(16 * PAGE_SIZE, GFP_KERNEL);
+
+       if (!addr)
+               return -ENOMEM;
+
+       zero_fsb_folio = virt_to_folio(addr);
        return bioset_init(&iomap_ioend_bioset, 4 * (PAGE_SIZE / SECTOR_SIZE),
                           offsetof(struct iomap_ioend, io_bio),
                           BIOSET_NEED_BVECS);
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index f3b43d223a46..59a65c3ccf13 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -236,17 +236,23 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
                loff_t pos, unsigned len)
 {
        struct inode *inode = file_inode(dio->iocb->ki_filp);
-       struct page *page = ZERO_PAGE(0);
        struct bio *bio;
 
-       bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
+       /*
+        * The zero folio used is 64k.
+        */
+       WARN_ON_ONCE(len > (16 * PAGE_SIZE));
+
+       bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS,
+                                 REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
        fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
                                  GFP_KERNEL);
+
        bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos);
        bio->bi_private = dio;
        bio->bi_end_io = iomap_dio_bio_end_io;
 
-       __bio_add_page(bio, page, len, 0);
+       bio_add_folio_nofail(bio, zero_fsb_folio, len, 0);
        iomap_dio_submit_bio(iter, dio, bio, pos);
 }


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [RFC] iomap: use huge zero folio in iomap_dio_zero
  2024-05-15 15:59           ` Pankaj Raghav (Samsung)
@ 2024-05-15 18:03             ` Matthew Wilcox
  2024-05-16 15:02               ` Pankaj Raghav (Samsung)
  0 siblings, 1 reply; 51+ messages in thread
From: Matthew Wilcox @ 2024-05-15 18:03 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: david, djwong, hch, Keith Busch, mcgrof, akpm, brauner,
	chandan.babu, gost.dev, hare, john.g.garry, linux-block,
	linux-fsdevel, linux-mm, linux-xfs, p.raghav, ritesh.list, ziy

On Wed, May 15, 2024 at 03:59:43PM +0000, Pankaj Raghav (Samsung) wrote:
>  static int __init iomap_init(void)
>  {
> +       void            *addr = kzalloc(16 * PAGE_SIZE, GFP_KERNEL);

Don't use XFS coding style outside XFS.

kzalloc() does not guarantee page alignment much less alignment to
a folio.  It happens to work today, but that is an implementation
artefact.

> +
> +       if (!addr)
> +               return -ENOMEM;
> +
> +       zero_fsb_folio = virt_to_folio(addr);

We also don't guarantee that calling kzalloc() gives you a virtual
address that can be converted to a folio.  You need to allocate a folio
to be sure that you get a folio.

Of course, you don't actually need a folio.  You don't need any of the
folio metadata and can just use raw pages.

> +       /*
> +        * The zero folio used is 64k.
> +        */
> +       WARN_ON_ONCE(len > (16 * PAGE_SIZE));

PAGE_SIZE is not necessarily 4KiB.

> +       bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS,
> +                                 REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);

The point was that we now only need one biovec, not MAX.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v5 05/11] mm: split a folio in minimum folio order chunks
  2024-05-15 15:32   ` Matthew Wilcox
@ 2024-05-16 14:56     ` Pankaj Raghav (Samsung)
  0 siblings, 0 replies; 51+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-05-16 14:56 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Luis Chamberlain, akpm, djwong, brauner, david, chandan.babu,
	hare, ritesh.list, john.g.garry, ziy, linux-fsdevel, linux-xfs,
	linux-mm, linux-block, gost.dev, p.raghav

On Wed, May 15, 2024 at 04:32:36PM +0100, Matthew Wilcox wrote:
> On Fri, May 03, 2024 at 02:53:47AM -0700, Luis Chamberlain wrote:
> > +int split_folio_to_list(struct folio *folio, struct list_head *list);
> 
> ...
> 
> > +static inline int split_folio_to_list(struct page *page, struct list_head *list)
> > +{
> 
> Type mismatch.  Surprised the build bots didn't whine yet.

Good catch. As we always enabled CONFIG_THP, we didn't detect this
issue.

> 
> >  
> > +		min_order = mapping_min_folio_order(folio->mapping);
> > +		if (new_order < min_order) {
> > +			VM_WARN_ONCE(1, "Cannot split mapped folio below min-order: %u",
> > +				     min_order);
> > +			ret = -EINVAL;
> > +			goto out;
> > +		}
> 
> Wouldn't we prefer this as:
> 
> 		if (VM_WARN_ONCE(new_order < min_order,
> 				"Cannot split mapped folio below min-order: %u",
> 				min_order) {
> 			ret = -EINVAL;
> 			goto out;
> 		}
> 
I don't think so:
#define VM_WARN_ONCE(cond, format...) (void)WARN_ONCE(cond, format)

So we get a build error as follows:
In file included from ./include/linux/mm.h:6,
                 from mm/huge_memory.c:8:
mm/huge_memory.c: In function ‘split_huge_page_to_list_to_order’:
./include/linux/mmdebug.h:93:39: error: void value not ignored as it ought to be
   93 | #define VM_WARN_ONCE(cond, format...) (void)WARN_ONCE(cond, format)
      |                                       ^
mm/huge_memory.c:3158:21: note: in expansion of macro ‘VM_WARN_ONCE’
 3158 |                 if (VM_WARN_ONCE(new_order < min_order,


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC] iomap: use huge zero folio in iomap_dio_zero
  2024-05-15 18:03             ` Matthew Wilcox
@ 2024-05-16 15:02               ` Pankaj Raghav (Samsung)
  2024-05-17 12:36                 ` Hannes Reinecke
  0 siblings, 1 reply; 51+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-05-16 15:02 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: david, djwong, hch, Keith Busch, mcgrof, akpm, brauner,
	chandan.babu, gost.dev, hare, john.g.garry, linux-block,
	linux-fsdevel, linux-mm, linux-xfs, p.raghav, ritesh.list, ziy

On Wed, May 15, 2024 at 07:03:20PM +0100, Matthew Wilcox wrote:
> On Wed, May 15, 2024 at 03:59:43PM +0000, Pankaj Raghav (Samsung) wrote:
> >  static int __init iomap_init(void)
> >  {
> > +       void            *addr = kzalloc(16 * PAGE_SIZE, GFP_KERNEL);
> 
> Don't use XFS coding style outside XFS.
> 
> kzalloc() does not guarantee page alignment much less alignment to
> a folio.  It happens to work today, but that is an implementation
> artefact.
> 
> > +
> > +       if (!addr)
> > +               return -ENOMEM;
> > +
> > +       zero_fsb_folio = virt_to_folio(addr);
> 
> We also don't guarantee that calling kzalloc() gives you a virtual
> address that can be converted to a folio.  You need to allocate a folio
> to be sure that you get a folio.
> 
> Of course, you don't actually need a folio.  You don't need any of the
> folio metadata and can just use raw pages.
> 
> > +       /*
> > +        * The zero folio used is 64k.
> > +        */
> > +       WARN_ON_ONCE(len > (16 * PAGE_SIZE));
> 
> PAGE_SIZE is not necessarily 4KiB.
> 
> > +       bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS,
> > +                                 REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
> 
> The point was that we now only need one biovec, not MAX.
> 

Thanks for the comments. I think it all makes sense:

diff --git a/fs/internal.h b/fs/internal.h
index 7ca738904e34..e152b77a77e4 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -35,6 +35,14 @@ static inline void bdev_cache_init(void)
 int __block_write_begin_int(struct folio *folio, loff_t pos, unsigned len,
                get_block_t *get_block, const struct iomap *iomap);
 
+/*
+ * iomap/buffered-io.c
+ */
+
+#define ZERO_FSB_SIZE (65536)
+#define ZERO_FSB_ORDER (get_order(ZERO_FSB_SIZE))
+extern struct page *zero_fs_block;
+
 /*
  * char_dev.c
  */
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 4e8e41c8b3c0..36d2f7edd310 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -42,6 +42,7 @@ struct iomap_folio_state {
 };
 
 static struct bio_set iomap_ioend_bioset;
+struct page *zero_fs_block;
 
 static inline bool ifs_is_fully_uptodate(struct folio *folio,
                struct iomap_folio_state *ifs)
@@ -1985,8 +1986,13 @@ iomap_writepages(struct address_space *mapping, struct writeback_control *wbc,
 }
 EXPORT_SYMBOL_GPL(iomap_writepages);
 
+
 static int __init iomap_init(void)
 {
+       zero_fs_block = alloc_pages(GFP_KERNEL | __GFP_ZERO, ZERO_FSB_ORDER);
+       if (!zero_fs_block)
+               return -ENOMEM;
+
        return bioset_init(&iomap_ioend_bioset, 4 * (PAGE_SIZE / SECTOR_SIZE),
                           offsetof(struct iomap_ioend, io_bio),
                           BIOSET_NEED_BVECS);
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index f3b43d223a46..50c2bca8a347 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -236,17 +236,22 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
                loff_t pos, unsigned len)
 {
        struct inode *inode = file_inode(dio->iocb->ki_filp);
-       struct page *page = ZERO_PAGE(0);
        struct bio *bio;
 
+       /*
+        * Max block size supported is 64k
+        */
+       WARN_ON_ONCE(len > ZERO_FSB_SIZE);
+
        bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
        fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
                                  GFP_KERNEL);
+
        bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos);
        bio->bi_private = dio;
        bio->bi_end_io = iomap_dio_bio_end_io;
 
-       __bio_add_page(bio, page, len, 0);
+       __bio_add_page(bio, zero_fs_block, len, 0);
        iomap_dio_submit_bio(iter, dio, bio, pos);
 }


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [RFC] iomap: use huge zero folio in iomap_dio_zero
  2024-05-16 15:02               ` Pankaj Raghav (Samsung)
@ 2024-05-17 12:36                 ` Hannes Reinecke
  2024-05-17 12:56                   ` Hannes Reinecke
  2024-05-17 13:30                   ` Matthew Wilcox
  0 siblings, 2 replies; 51+ messages in thread
From: Hannes Reinecke @ 2024-05-17 12:36 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung), Matthew Wilcox
  Cc: david, djwong, hch, Keith Busch, mcgrof, akpm, brauner,
	chandan.babu, gost.dev, john.g.garry, linux-block, linux-fsdevel,
	linux-mm, linux-xfs, p.raghav, ritesh.list, ziy

On 5/16/24 17:02, Pankaj Raghav (Samsung) wrote:
> On Wed, May 15, 2024 at 07:03:20PM +0100, Matthew Wilcox wrote:
>> On Wed, May 15, 2024 at 03:59:43PM +0000, Pankaj Raghav (Samsung) wrote:
>>>   static int __init iomap_init(void)
>>>   {
>>> +       void            *addr = kzalloc(16 * PAGE_SIZE, GFP_KERNEL);
>>
>> Don't use XFS coding style outside XFS.
>>
>> kzalloc() does not guarantee page alignment much less alignment to
>> a folio.  It happens to work today, but that is an implementation
>> artefact.
>>
>>> +
>>> +       if (!addr)
>>> +               return -ENOMEM;
>>> +
>>> +       zero_fsb_folio = virt_to_folio(addr);
>>
>> We also don't guarantee that calling kzalloc() gives you a virtual
>> address that can be converted to a folio.  You need to allocate a folio
>> to be sure that you get a folio.
>>
>> Of course, you don't actually need a folio.  You don't need any of the
>> folio metadata and can just use raw pages.
>>
>>> +       /*
>>> +        * The zero folio used is 64k.
>>> +        */
>>> +       WARN_ON_ONCE(len > (16 * PAGE_SIZE));
>>
>> PAGE_SIZE is not necessarily 4KiB.
>>
>>> +       bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS,
>>> +                                 REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
>>
>> The point was that we now only need one biovec, not MAX.
>>
> 
> Thanks for the comments. I think it all makes sense:
> 
> diff --git a/fs/internal.h b/fs/internal.h
> index 7ca738904e34..e152b77a77e4 100644
> --- a/fs/internal.h
> +++ b/fs/internal.h
> @@ -35,6 +35,14 @@ static inline void bdev_cache_init(void)
>   int __block_write_begin_int(struct folio *folio, loff_t pos, unsigned len,
>                  get_block_t *get_block, const struct iomap *iomap);
>   
> +/*
> + * iomap/buffered-io.c
> + */
> +
> +#define ZERO_FSB_SIZE (65536)
> +#define ZERO_FSB_ORDER (get_order(ZERO_FSB_SIZE))
> +extern struct page *zero_fs_block;
> +
>   /*
>    * char_dev.c
>    */
But why?
We already have a perfectly fine hugepage zero page in huge_memory.c. 
Shouldn't we rather export that one and use it?
(Actually I have some patches for doing so...)
We might allocate folios
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 4e8e41c8b3c0..36d2f7edd310 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -42,6 +42,7 @@ struct iomap_folio_state {
>   };
>   
>   static struct bio_set iomap_ioend_bioset;
> +struct page *zero_fs_block;
>   
>   static inline bool ifs_is_fully_uptodate(struct folio *folio,
>                  struct iomap_folio_state *ifs)
> @@ -1985,8 +1986,13 @@ iomap_writepages(struct address_space *mapping, struct writeback_control *wbc,
>   }
>   EXPORT_SYMBOL_GPL(iomap_writepages);
>   
> +
>   static int __init iomap_init(void)
>   {
> +       zero_fs_block = alloc_pages(GFP_KERNEL | __GFP_ZERO, ZERO_FSB_ORDER);
> +       if (!zero_fs_block)
> +               return -ENOMEM;
> +
>          return bioset_init(&iomap_ioend_bioset, 4 * (PAGE_SIZE / SECTOR_SIZE),
>                             offsetof(struct iomap_ioend, io_bio),
>                             BIOSET_NEED_BVECS);
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index f3b43d223a46..50c2bca8a347 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -236,17 +236,22 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
>                  loff_t pos, unsigned len)
>   {
>          struct inode *inode = file_inode(dio->iocb->ki_filp);
> -       struct page *page = ZERO_PAGE(0);
>          struct bio *bio;
>   
> +       /*
> +        * Max block size supported is 64k
> +        */
> +       WARN_ON_ONCE(len > ZERO_FSB_SIZE);
> +
>          bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
>          fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
>                                    GFP_KERNEL);
> +
>          bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos);
>          bio->bi_private = dio;
>          bio->bi_end_io = iomap_dio_bio_end_io;
>   
> -       __bio_add_page(bio, page, len, 0);
> +       __bio_add_page(bio, zero_fs_block, len, 0);
>          iomap_dio_submit_bio(iter, dio, bio, pos);
>   }
> 

-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC] iomap: use huge zero folio in iomap_dio_zero
  2024-05-17 12:36                 ` Hannes Reinecke
@ 2024-05-17 12:56                   ` Hannes Reinecke
  2024-05-17 13:30                   ` Matthew Wilcox
  1 sibling, 0 replies; 51+ messages in thread
From: Hannes Reinecke @ 2024-05-17 12:56 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung), Matthew Wilcox
  Cc: david, djwong, hch, Keith Busch, mcgrof, akpm, brauner,
	chandan.babu, gost.dev, john.g.garry, linux-block, linux-fsdevel,
	linux-mm, linux-xfs, p.raghav, ritesh.list, ziy

On 5/17/24 14:36, Hannes Reinecke wrote:
> On 5/16/24 17:02, Pankaj Raghav (Samsung) wrote:
>> On Wed, May 15, 2024 at 07:03:20PM +0100, Matthew Wilcox wrote:
>>> On Wed, May 15, 2024 at 03:59:43PM +0000, Pankaj Raghav (Samsung) wrote:
>>>>   static int __init iomap_init(void)
>>>>   {
>>>> +       void            *addr = kzalloc(16 * PAGE_SIZE, GFP_KERNEL);
>>>
>>> Don't use XFS coding style outside XFS.
>>>
>>> kzalloc() does not guarantee page alignment much less alignment to
>>> a folio.  It happens to work today, but that is an implementation
>>> artefact.
>>>
>>>> +
>>>> +       if (!addr)
>>>> +               return -ENOMEM;
>>>> +
>>>> +       zero_fsb_folio = virt_to_folio(addr);
>>>
>>> We also don't guarantee that calling kzalloc() gives you a virtual
>>> address that can be converted to a folio.  You need to allocate a folio
>>> to be sure that you get a folio.
>>>
>>> Of course, you don't actually need a folio.  You don't need any of the
>>> folio metadata and can just use raw pages.
>>>
>>>> +       /*
>>>> +        * The zero folio used is 64k.
>>>> +        */
>>>> +       WARN_ON_ONCE(len > (16 * PAGE_SIZE));
>>>
>>> PAGE_SIZE is not necessarily 4KiB.
>>>
>>>> +       bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS,
>>>> +                                 REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
>>>
>>> The point was that we now only need one biovec, not MAX.
>>>
>>
>> Thanks for the comments. I think it all makes sense:
>>
>> diff --git a/fs/internal.h b/fs/internal.h
>> index 7ca738904e34..e152b77a77e4 100644
>> --- a/fs/internal.h
>> +++ b/fs/internal.h
>> @@ -35,6 +35,14 @@ static inline void bdev_cache_init(void)
>>   int __block_write_begin_int(struct folio *folio, loff_t pos, 
>> unsigned len,
>>                  get_block_t *get_block, const struct iomap *iomap);
>> +/*
>> + * iomap/buffered-io.c
>> + */
>> +
>> +#define ZERO_FSB_SIZE (65536)
>> +#define ZERO_FSB_ORDER (get_order(ZERO_FSB_SIZE))
>> +extern struct page *zero_fs_block;
>> +
>>   /*
>>    * char_dev.c
>>    */
> But why?
> We already have a perfectly fine hugepage zero page in huge_memory.c. 
> Shouldn't we rather export that one and use it?
> (Actually I have some patches for doing so...)
> We might allocate folios

Bah. Hit 'enter' too soon.

We might allocate a zero folio as a fallback if the huge zero page is 
not available, but first we should try to use that.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [RFC] iomap: use huge zero folio in iomap_dio_zero
  2024-05-17 12:36                 ` Hannes Reinecke
  2024-05-17 12:56                   ` Hannes Reinecke
@ 2024-05-17 13:30                   ` Matthew Wilcox
  1 sibling, 0 replies; 51+ messages in thread
From: Matthew Wilcox @ 2024-05-17 13:30 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Pankaj Raghav (Samsung),
	david, djwong, hch, Keith Busch, mcgrof, akpm, brauner,
	chandan.babu, gost.dev, john.g.garry, linux-block, linux-fsdevel,
	linux-mm, linux-xfs, p.raghav, ritesh.list, ziy

On Fri, May 17, 2024 at 02:36:29PM +0200, Hannes Reinecke wrote:
> > +#define ZERO_FSB_SIZE (65536)
> > +#define ZERO_FSB_ORDER (get_order(ZERO_FSB_SIZE))
> > +extern struct page *zero_fs_block;
> > +
> >   /*
> >    * char_dev.c
> >    */
> But why?
> We already have a perfectly fine hugepage zero page in huge_memory.c.
> Shouldn't we rather export that one and use it?
> (Actually I have some patches for doing so...)

But we don't necessarily.  We only have it if
do_huge_pmd_anonymous_page() satisfies:

        if (!(vmf->flags & FAULT_FLAG_WRITE) &&
                        !mm_forbids_zeropage(vma->vm_mm) &&
                        transparent_hugepage_use_zero_page()) {

ie we've taken a page fault on a PMD hole in a VMA, that VMA permits
PMD mappings to exist, the page fault was for read, the
forbids-huge-zeropage isn't set for this vma, and using the hugetlb zero
page isn't forbidden.

I'd like to see stats for how much the PMD-zero-page is actually used,
because I suspect it's not really used very much.


^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2024-05-17 13:30 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-05-03  9:53 [PATCH v5 00/11] enable bs > ps in XFS Luis Chamberlain
2024-05-03  9:53 ` [PATCH v5 01/11] readahead: rework loop in page_cache_ra_unbounded() Luis Chamberlain
2024-05-03  9:53 ` [PATCH v5 02/11] fs: Allow fine-grained control of folio sizes Luis Chamberlain
2024-05-03  9:53 ` [PATCH v5 03/11] filemap: allocate mapping_min_order folios in the page cache Luis Chamberlain
2024-05-03  9:53 ` [PATCH v5 04/11] readahead: allocate folios with mapping_min_order in readahead Luis Chamberlain
2024-05-03 14:32   ` Hannes Reinecke
2024-05-03  9:53 ` [PATCH v5 05/11] mm: split a folio in minimum folio order chunks Luis Chamberlain
2024-05-03 14:53   ` Zi Yan
2024-05-15 15:32   ` Matthew Wilcox
2024-05-16 14:56     ` Pankaj Raghav (Samsung)
2024-05-03  9:53 ` [PATCH v5 06/11] filemap: cap PTE range to be created to allowed zero fill in folio_map_range() Luis Chamberlain
2024-05-03  9:53 ` [PATCH v5 07/11] iomap: fix iomap_dio_zero() for fs bs > system page size Luis Chamberlain
2024-05-07 14:58   ` [RFC] iomap: use huge zero folio in iomap_dio_zero Pankaj Raghav (Samsung)
2024-05-07 15:11     ` Zi Yan
2024-05-07 16:11     ` Christoph Hellwig
2024-05-08 11:39       ` Pankaj Raghav (Samsung)
2024-05-08 11:43         ` Christoph Hellwig
2024-05-09 12:31           ` Pankaj Raghav (Samsung)
2024-05-09 12:46             ` Christoph Hellwig
2024-05-09 12:55               ` Pankaj Raghav (Samsung)
2024-05-09 12:58                 ` Christoph Hellwig
2024-05-09 14:32                   ` Darrick J. Wong
2024-05-09 15:05                     ` Christoph Hellwig
2024-05-09 15:08                       ` Darrick J. Wong
2024-05-09 15:09                         ` Christoph Hellwig
2024-05-15  0:50     ` Matthew Wilcox
2024-05-15  2:34       ` Keith Busch
2024-05-15  4:04         ` Matthew Wilcox
2024-05-15 15:59           ` Pankaj Raghav (Samsung)
2024-05-15 18:03             ` Matthew Wilcox
2024-05-16 15:02               ` Pankaj Raghav (Samsung)
2024-05-17 12:36                 ` Hannes Reinecke
2024-05-17 12:56                   ` Hannes Reinecke
2024-05-17 13:30                   ` Matthew Wilcox
2024-05-15 11:48       ` Christoph Hellwig
2024-05-07 16:00   ` [PATCH v5 07/11] iomap: fix iomap_dio_zero() for fs bs > system page size Matthew Wilcox
2024-05-07 16:10     ` Christoph Hellwig
2024-05-07 16:11       ` Matthew Wilcox
2024-05-07 16:13         ` Christoph Hellwig
2024-05-08  4:24           ` Matthew Wilcox
2024-05-08 11:22             ` Pankaj Raghav (Samsung)
2024-05-08 11:36             ` Christoph Hellwig
2024-05-08 11:20     ` Pankaj Raghav (Samsung)
2024-05-03  9:53 ` [PATCH v5 08/11] xfs: use kvmalloc for xattr buffers Luis Chamberlain
2024-05-03  9:53 ` [PATCH v5 09/11] xfs: expose block size in stat Luis Chamberlain
2024-05-03  9:53 ` [PATCH v5 10/11] xfs: make the calculation generic in xfs_sb_validate_fsb_count() Luis Chamberlain
2024-05-07  8:40   ` John Garry
2024-05-07 21:13     ` Darrick J. Wong
2024-05-08 11:28       ` Pankaj Raghav (Samsung)
2024-05-03  9:53 ` [PATCH v5 11/11] xfs: enable block size larger than page size support Luis Chamberlain
2024-05-07  0:05   ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).