linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 00/11] enable bs > ps in XFS
@ 2024-04-25 11:37 Pankaj Raghav (Samsung)
  2024-04-25 11:37 ` [PATCH v4 01/11] readahead: rework loop in page_cache_ra_unbounded() Pankaj Raghav (Samsung)
                   ` (11 more replies)
  0 siblings, 12 replies; 47+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-04-25 11:37 UTC (permalink / raw)
  To: willy, djwong, brauner, david, chandan.babu, akpm
  Cc: linux-fsdevel, hare, linux-kernel, linux-mm, linux-xfs, mcgrof,
	gost.dev, p.raghav

From: Pankaj Raghav <p.raghav@samsung.com>

This is the fourth version of the series that enables block size > page size
(Large Block Size) in XFS. The context and motivation can be seen in cover
letter of the RFC v1[1]. We also recorded a talk about this effort at LPC [3],
if someone would like more context on this effort.

This series does not split a folio during truncation even though we have
an API to do so due to some issues with writeback. While it is not a
blocker, this feature can be added as a future improvement once we
get the base patches upstream (See patch 7).

A lot of emphasis has been put on testing using kdevops. The testing has
been split into regression and progression.

Regression testing:
In regression testing, we ran the whole test suite to check for
*regression on existing profiles due to the page cache changes.

No regression was found with the patches added on top.

Progression testing:
For progression testing, we tested for 8k, 16k, 32k and 64k block sizes.
To compare it with existing support, an ARM VM with 64k base page system
(without our patches) was used as a reference to check for actual failures
due to LBS support in a 4k base page size system.

There are some tests that assumes block size < page size that needs to
be fixed. I have a tree with fixes for xfstests here [6], which I will be
sending soon to the list. Already a part of this has been upstreamed to
fstest.

No new failures were found with the LBS support.

We've done some preliminary performance tests with fio on XFS on 4k block
size against pmem and NVMe with buffered IO and Direct IO on vanilla
Vs + these patches applied, and detected no regressions.

We also wrote an eBPF tool called blkalgn [7] to see if IO sent to the device
is aligned and at least filesystem block size in length.

Git tree:
https://github.com/linux-kdevops/linux/tree/large-block-minorder-6.9-rc4

[1] https://lore.kernel.org/lkml/20230915183848.1018717-1-kernel@pankajraghav.com/
[2] https://lore.kernel.org/linux-xfs/20240213093713.1753368-1-kernel@pankajraghav.com/
[3] https://www.youtube.com/watch?v=ar72r5Xf7x4
[4] https://github.com/linux-kdevops/kdevops/blob/master/docs/xfs-bugs.md
489 non-critical issues and 55 critical issues. We've determined and reported
that the 55 critical issues have all fall into 5 common  XFS asserts or hung
tasks  and 2 memory management asserts.
[5] https://lore.kernel.org/linux-xfs/fe7fec1c-3b08-430f-9c95-ea76b237acf4@samsung.com/
[6] https://github.com/linux-kdevops/fstests/tree/lbs-fixes
[7] https://github.com/iovisor/bcc/pull/4813

Changes since v3:
- Cap the PTE range to i_size for LBS configuration in folio_map_range()
- Added Chinners kvmalloc xattr patches
- Moved Hannes patches before adding the minorder patches to avoid
  confusion.
- Added mapping_set_folio_order_range().
- Return EINVAL instead EAGAIN in split_huge_page_to_list_to_order()

Changes since v2:
- Simplified the filemap and readahead changes. (Thanks willy)
- Removed DEFINE_READAHEAD_ALIGN.
- Added minorder support to readahead_expand().

Changes since v1:
- Round up to nearest min nr pages in ra_init
- Calculate index in filemap_create instead of doing in
  filemap_get_pages
- Remove unnecessary BUG_ONs in the delete path
- Use check_shl_overflow instead of check_mul_overflow
- Cast to uint32_t instead of unsigned long in xfs_stat_blksize

Changes since RFC v2:
- Move order 1 patch above the 1st patch
- Remove order == 1 conditional in `fs: Allow fine-grained control of
folio sizes`. This fixed generic/630 that was reported in the previous version.
- Hide the max order and expose `mapping_set_folio_min_order` instead.
- Add new helper mapping_start_index_align and DEFINE_READAHEAD_ALIGN
- don't call `page_cache_ra_order` with min order in do_mmap_sync_readahead
- simplify ondemand readahead with only aligning the start index at the end
- Don't cap ra_pages based on bdi->io_pages
- use `checked_mul_overflow` while calculating bytes in validate_fsb
- Remove config lbs option
- Add a warning while mounting a LBS kernel
- Add Acked-by and Reviewed-by from Hannes and Darrick.

Changes since RFC v1:
- Added willy's patch to enable order-1 folios.
- Unified common page cache effort from Hannes LBS work.
- Added a new helper min_nrpages and added CONFIG_THP for enabling mapping_large_folio_support
- Don't split a folio if it has minorder set. Remove the old code where we set extra pins if it has that requirement.
- Split the code in XFS between the validation of mapping count. Put the icache code changes with enabling bs > ps.
- Added CONFIG_XFS_LBS option
- align the index in do_read_cache_folio()
- Removed truncate changes
- Fixed generic/091 with iomap changes to iomap_dio_zero function.
- Took care of folio truncation scenario in page_cache_ra_unbounded() that happens after read_pages if a folio was found.
- Sqaushed and moved commits around
- Rebased on top of v6.8-rc4

Dave Chinner (1):
  xfs: use kvmalloc for xattr buffers

Hannes Reinecke (1):
  readahead: rework loop in page_cache_ra_unbounded()

Luis Chamberlain (1):
  filemap: allocate mapping_min_order folios in the page cache

Matthew Wilcox (Oracle) (1):
  fs: Allow fine-grained control of folio sizes

Pankaj Raghav (7):
  readahead: allocate folios with mapping_min_order in readahead
  mm: do not split a folio if it has minimum folio order requirement
  filemap: cap PTE range to be created to i_size in folio_map_range()
  iomap: fix iomap_dio_zero() for fs bs > system page size
  xfs: expose block size in stat
  xfs: make the calculation generic in xfs_sb_validate_fsb_count()
  xfs: enable block size larger than page size support

 fs/iomap/direct-io.c          |  13 +++-
 fs/xfs/libxfs/xfs_attr_leaf.c |  15 ++---
 fs/xfs/libxfs/xfs_ialloc.c    |   5 ++
 fs/xfs/libxfs/xfs_shared.h    |   3 +
 fs/xfs/xfs_icache.c           |   6 +-
 fs/xfs/xfs_iops.c             |   2 +-
 fs/xfs/xfs_mount.c            |  10 ++-
 fs/xfs/xfs_super.c            |  10 +--
 include/linux/pagemap.h       | 116 ++++++++++++++++++++++++++++------
 mm/filemap.c                  |  29 ++++++---
 mm/huge_memory.c              |   9 +++
 mm/readahead.c                |  94 +++++++++++++++++++++------
 12 files changed, 242 insertions(+), 70 deletions(-)


base-commit: 0bbac3facb5d6cc0171c45c9873a2dc96bea9680
-- 
2.34.1


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [PATCH v4 01/11] readahead: rework loop in page_cache_ra_unbounded()
  2024-04-25 11:37 [PATCH v4 00/11] enable bs > ps in XFS Pankaj Raghav (Samsung)
@ 2024-04-25 11:37 ` Pankaj Raghav (Samsung)
  2024-04-25 11:37 ` [PATCH v4 02/11] fs: Allow fine-grained control of folio sizes Pankaj Raghav (Samsung)
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 47+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-04-25 11:37 UTC (permalink / raw)
  To: willy, djwong, brauner, david, chandan.babu, akpm
  Cc: linux-fsdevel, hare, linux-kernel, linux-mm, linux-xfs, mcgrof,
	gost.dev, p.raghav

From: Hannes Reinecke <hare@suse.de>

Rework the loop in page_cache_ra_unbounded() to advance with
the number of pages in a folio instead of just one page at a time.

Note that the index is incremented by 1 if filemap_add_folio() fails
because the size of the folio we are trying to add is 1 (order 0).

Signed-off-by: Hannes Reinecke <hare@suse.de>
Co-developed-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Acked-by: Darrick J. Wong <djwong@kernel.org>
---
 mm/readahead.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 130c0e7df99f..2361634a84fd 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -208,7 +208,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 	struct address_space *mapping = ractl->mapping;
 	unsigned long index = readahead_index(ractl);
 	gfp_t gfp_mask = readahead_gfp_mask(mapping);
-	unsigned long i;
+	unsigned long i = 0;
 
 	/*
 	 * Partway through the readahead operation, we will have added
@@ -226,7 +226,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 	/*
 	 * Preallocate as many pages as we will need.
 	 */
-	for (i = 0; i < nr_to_read; i++) {
+	while (i < nr_to_read) {
 		struct folio *folio = xa_load(&mapping->i_pages, index + i);
 
 		if (folio && !xa_is_value(folio)) {
@@ -239,8 +239,8 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 			 * not worth getting one just for that.
 			 */
 			read_pages(ractl);
-			ractl->_index++;
-			i = ractl->_index + ractl->_nr_pages - index - 1;
+			ractl->_index += folio_nr_pages(folio);
+			i = ractl->_index + ractl->_nr_pages - index;
 			continue;
 		}
 
@@ -252,13 +252,14 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 			folio_put(folio);
 			read_pages(ractl);
 			ractl->_index++;
-			i = ractl->_index + ractl->_nr_pages - index - 1;
+			i = ractl->_index + ractl->_nr_pages - index;
 			continue;
 		}
 		if (i == nr_to_read - lookahead_size)
 			folio_set_readahead(folio);
 		ractl->_workingset |= folio_test_workingset(folio);
-		ractl->_nr_pages++;
+		ractl->_nr_pages += folio_nr_pages(folio);
+		i += folio_nr_pages(folio);
 	}
 
 	/*
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v4 02/11] fs: Allow fine-grained control of folio sizes
  2024-04-25 11:37 [PATCH v4 00/11] enable bs > ps in XFS Pankaj Raghav (Samsung)
  2024-04-25 11:37 ` [PATCH v4 01/11] readahead: rework loop in page_cache_ra_unbounded() Pankaj Raghav (Samsung)
@ 2024-04-25 11:37 ` Pankaj Raghav (Samsung)
  2024-04-25 18:07   ` Hannes Reinecke
  2024-04-26 15:09   ` Darrick J. Wong
  2024-04-25 11:37 ` [PATCH v4 03/11] filemap: allocate mapping_min_order folios in the page cache Pankaj Raghav (Samsung)
                   ` (9 subsequent siblings)
  11 siblings, 2 replies; 47+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-04-25 11:37 UTC (permalink / raw)
  To: willy, djwong, brauner, david, chandan.babu, akpm
  Cc: linux-fsdevel, hare, linux-kernel, linux-mm, linux-xfs, mcgrof,
	gost.dev, p.raghav

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Some filesystems want to be able to ensure that folios that are added to
the page cache are at least a certain size.
Add mapping_set_folio_min_order() to allow this level of control.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Co-developed-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 include/linux/pagemap.h | 116 +++++++++++++++++++++++++++++++++-------
 1 file changed, 96 insertions(+), 20 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 2df35e65557d..2e5612de1749 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -202,13 +202,18 @@ enum mapping_flags {
 	AS_EXITING	= 4, 	/* final truncate in progress */
 	/* writeback related tags are not used */
 	AS_NO_WRITEBACK_TAGS = 5,
-	AS_LARGE_FOLIO_SUPPORT = 6,
-	AS_RELEASE_ALWAYS,	/* Call ->release_folio(), even if no private data */
-	AS_STABLE_WRITES,	/* must wait for writeback before modifying
+	AS_RELEASE_ALWAYS = 6,	/* Call ->release_folio(), even if no private data */
+	AS_STABLE_WRITES = 7,	/* must wait for writeback before modifying
 				   folio contents */
-	AS_UNMOVABLE,		/* The mapping cannot be moved, ever */
+	AS_FOLIO_ORDER_MIN = 8,
+	AS_FOLIO_ORDER_MAX = 13, /* Bit 8-17 are used for FOLIO_ORDER */
+	AS_UNMOVABLE = 18,		/* The mapping cannot be moved, ever */
 };
 
+#define AS_FOLIO_ORDER_MIN_MASK 0x00001f00
+#define AS_FOLIO_ORDER_MAX_MASK 0x0003e000
+#define AS_FOLIO_ORDER_MASK (AS_FOLIO_ORDER_MIN_MASK | AS_FOLIO_ORDER_MAX_MASK)
+
 /**
  * mapping_set_error - record a writeback error in the address_space
  * @mapping: the mapping in which an error should be set
@@ -344,9 +349,63 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
 	m->gfp_mask = mask;
 }
 
+/*
+ * There are some parts of the kernel which assume that PMD entries
+ * are exactly HPAGE_PMD_ORDER.  Those should be fixed, but until then,
+ * limit the maximum allocation order to PMD size.  I'm not aware of any
+ * assumptions about maximum order if THP are disabled, but 8 seems like
+ * a good order (that's 1MB if you're using 4kB pages)
+ */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define MAX_PAGECACHE_ORDER	HPAGE_PMD_ORDER
+#else
+#define MAX_PAGECACHE_ORDER	8
+#endif
+
+/*
+ * mapping_set_folio_order_range() - Set the folio order range
+ * @mapping: The address_space.
+ * @min: Minimum folio order (between 0-MAX_PAGECACHE_ORDER inclusive).
+ * @max: Maximum folio order (between @min-MAX_PAGECACHE_ORDER inclusive).
+ *
+ * The filesystem should call this function in its inode constructor to
+ * indicate which base size (min) and maximum size (max) of folio the VFS
+ * can use to cache the contents of the file.  This should only be used
+ * if the filesystem needs special handling of folio sizes (ie there is
+ * something the core cannot know).
+ * Do not tune it based on, eg, i_size.
+ *
+ * Context: This should not be called while the inode is active as it
+ * is non-atomic.
+ */
+static inline void mapping_set_folio_order_range(struct address_space *mapping,
+						 unsigned int min_order,
+						 unsigned int max_order)
+{
+	if (min_order > MAX_PAGECACHE_ORDER)
+		min_order = MAX_PAGECACHE_ORDER;
+
+	if (max_order > MAX_PAGECACHE_ORDER)
+		max_order = MAX_PAGECACHE_ORDER;
+
+	max_order = max(max_order, min_order);
+	/*
+	 * TODO: max_order is not yet supported in filemap.
+	 */
+	mapping->flags = (mapping->flags & ~AS_FOLIO_ORDER_MASK) |
+			 (min_order << AS_FOLIO_ORDER_MIN) |
+			 (max_order << AS_FOLIO_ORDER_MAX);
+}
+
+static inline void mapping_set_folio_min_order(struct address_space *mapping,
+					       unsigned int min)
+{
+	mapping_set_folio_order_range(mapping, min, MAX_PAGECACHE_ORDER);
+}
+
 /**
  * mapping_set_large_folios() - Indicate the file supports large folios.
- * @mapping: The file.
+ * @mapping: The address_space.
  *
  * The filesystem should call this function in its inode constructor to
  * indicate that the VFS can use large folios to cache the contents of
@@ -357,7 +416,37 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
  */
 static inline void mapping_set_large_folios(struct address_space *mapping)
 {
-	__set_bit(AS_LARGE_FOLIO_SUPPORT, &mapping->flags);
+	mapping_set_folio_order_range(mapping, 0, MAX_PAGECACHE_ORDER);
+}
+
+static inline unsigned int mapping_max_folio_order(struct address_space *mapping)
+{
+	return (mapping->flags & AS_FOLIO_ORDER_MAX_MASK) >> AS_FOLIO_ORDER_MAX;
+}
+
+static inline unsigned int mapping_min_folio_order(struct address_space *mapping)
+{
+	return (mapping->flags & AS_FOLIO_ORDER_MIN_MASK) >> AS_FOLIO_ORDER_MIN;
+}
+
+static inline unsigned long mapping_min_folio_nrpages(struct address_space *mapping)
+{
+	return 1UL << mapping_min_folio_order(mapping);
+}
+
+/**
+ * mapping_align_start_index() - Align starting index based on the min
+ * folio order of the page cache.
+ * @mapping: The address_space.
+ *
+ * Ensure the index used is aligned to the minimum folio order when adding
+ * new folios to the page cache by rounding down to the nearest minimum
+ * folio number of pages.
+ */
+static inline pgoff_t mapping_align_start_index(struct address_space *mapping,
+						pgoff_t index)
+{
+	return round_down(index, mapping_min_folio_nrpages(mapping));
 }
 
 /*
@@ -367,7 +456,7 @@ static inline void mapping_set_large_folios(struct address_space *mapping)
 static inline bool mapping_large_folio_support(struct address_space *mapping)
 {
 	return IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
-		test_bit(AS_LARGE_FOLIO_SUPPORT, &mapping->flags);
+	       (mapping_max_folio_order(mapping) > 0);
 }
 
 static inline int filemap_nr_thps(struct address_space *mapping)
@@ -528,19 +617,6 @@ static inline void *detach_page_private(struct page *page)
 	return folio_detach_private(page_folio(page));
 }
 
-/*
- * There are some parts of the kernel which assume that PMD entries
- * are exactly HPAGE_PMD_ORDER.  Those should be fixed, but until then,
- * limit the maximum allocation order to PMD size.  I'm not aware of any
- * assumptions about maximum order if THP are disabled, but 8 seems like
- * a good order (that's 1MB if you're using 4kB pages)
- */
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#define MAX_PAGECACHE_ORDER	HPAGE_PMD_ORDER
-#else
-#define MAX_PAGECACHE_ORDER	8
-#endif
-
 #ifdef CONFIG_NUMA
 struct folio *filemap_alloc_folio(gfp_t gfp, unsigned int order);
 #else
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v4 03/11] filemap: allocate mapping_min_order folios in the page cache
  2024-04-25 11:37 [PATCH v4 00/11] enable bs > ps in XFS Pankaj Raghav (Samsung)
  2024-04-25 11:37 ` [PATCH v4 01/11] readahead: rework loop in page_cache_ra_unbounded() Pankaj Raghav (Samsung)
  2024-04-25 11:37 ` [PATCH v4 02/11] fs: Allow fine-grained control of folio sizes Pankaj Raghav (Samsung)
@ 2024-04-25 11:37 ` Pankaj Raghav (Samsung)
  2024-04-25 19:04   ` Hannes Reinecke
  2024-04-26 15:12   ` Darrick J. Wong
  2024-04-25 11:37 ` [PATCH v4 04/11] readahead: allocate folios with mapping_min_order in readahead Pankaj Raghav (Samsung)
                   ` (8 subsequent siblings)
  11 siblings, 2 replies; 47+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-04-25 11:37 UTC (permalink / raw)
  To: willy, djwong, brauner, david, chandan.babu, akpm
  Cc: linux-fsdevel, hare, linux-kernel, linux-mm, linux-xfs, mcgrof,
	gost.dev, p.raghav

From: Luis Chamberlain <mcgrof@kernel.org>

filemap_create_folio() and do_read_cache_folio() were always allocating
folio of order 0. __filemap_get_folio was trying to allocate higher
order folios when fgp_flags had higher order hint set but it will default
to order 0 folio if higher order memory allocation fails.

Supporting mapping_min_order implies that we guarantee each folio in the
page cache has at least an order of mapping_min_order. When adding new
folios to the page cache we must also ensure the index used is aligned to
the mapping_min_order as the page cache requires the index to be aligned
to the order of the folio.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Co-developed-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 mm/filemap.c | 24 +++++++++++++++++-------
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 30de18c4fd28..f0c0cfbbd134 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -858,6 +858,8 @@ noinline int __filemap_add_folio(struct address_space *mapping,
 
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
 	VM_BUG_ON_FOLIO(folio_test_swapbacked(folio), folio);
+	VM_BUG_ON_FOLIO(folio_order(folio) < mapping_min_folio_order(mapping),
+			folio);
 	mapping_set_update(&xas, mapping);
 
 	if (!huge) {
@@ -1895,8 +1897,10 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 		folio_wait_stable(folio);
 no_page:
 	if (!folio && (fgp_flags & FGP_CREAT)) {
-		unsigned order = FGF_GET_ORDER(fgp_flags);
+		unsigned int min_order = mapping_min_folio_order(mapping);
+		unsigned int order = max(min_order, FGF_GET_ORDER(fgp_flags));
 		int err;
+		index = mapping_align_start_index(mapping, index);
 
 		if ((fgp_flags & FGP_WRITE) && mapping_can_writeback(mapping))
 			gfp |= __GFP_WRITE;
@@ -1936,7 +1940,7 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 				break;
 			folio_put(folio);
 			folio = NULL;
-		} while (order-- > 0);
+		} while (order-- > min_order);
 
 		if (err == -EEXIST)
 			goto repeat;
@@ -2425,13 +2429,16 @@ static int filemap_update_page(struct kiocb *iocb,
 }
 
 static int filemap_create_folio(struct file *file,
-		struct address_space *mapping, pgoff_t index,
+		struct address_space *mapping, loff_t pos,
 		struct folio_batch *fbatch)
 {
 	struct folio *folio;
 	int error;
+	unsigned int min_order = mapping_min_folio_order(mapping);
+	pgoff_t index;
 
-	folio = filemap_alloc_folio(mapping_gfp_mask(mapping), 0);
+	folio = filemap_alloc_folio(mapping_gfp_mask(mapping),
+				    min_order);
 	if (!folio)
 		return -ENOMEM;
 
@@ -2449,6 +2456,8 @@ static int filemap_create_folio(struct file *file,
 	 * well to keep locking rules simple.
 	 */
 	filemap_invalidate_lock_shared(mapping);
+	/* index in PAGE units but aligned to min_order number of pages. */
+	index = (pos >> (PAGE_SHIFT + min_order)) << min_order;
 	error = filemap_add_folio(mapping, folio, index,
 			mapping_gfp_constraint(mapping, GFP_KERNEL));
 	if (error == -EEXIST)
@@ -2509,8 +2518,7 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
 	if (!folio_batch_count(fbatch)) {
 		if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_WAITQ))
 			return -EAGAIN;
-		err = filemap_create_folio(filp, mapping,
-				iocb->ki_pos >> PAGE_SHIFT, fbatch);
+		err = filemap_create_folio(filp, mapping, iocb->ki_pos, fbatch);
 		if (err == AOP_TRUNCATED_PAGE)
 			goto retry;
 		return err;
@@ -3708,9 +3716,11 @@ static struct folio *do_read_cache_folio(struct address_space *mapping,
 repeat:
 	folio = filemap_get_folio(mapping, index);
 	if (IS_ERR(folio)) {
-		folio = filemap_alloc_folio(gfp, 0);
+		folio = filemap_alloc_folio(gfp,
+					    mapping_min_folio_order(mapping));
 		if (!folio)
 			return ERR_PTR(-ENOMEM);
+		index = mapping_align_start_index(mapping, index);
 		err = filemap_add_folio(mapping, folio, index, gfp);
 		if (unlikely(err)) {
 			folio_put(folio);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v4 04/11] readahead: allocate folios with mapping_min_order in readahead
  2024-04-25 11:37 [PATCH v4 00/11] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (2 preceding siblings ...)
  2024-04-25 11:37 ` [PATCH v4 03/11] filemap: allocate mapping_min_order folios in the page cache Pankaj Raghav (Samsung)
@ 2024-04-25 11:37 ` Pankaj Raghav (Samsung)
  2024-04-25 18:53   ` Matthew Wilcox
  2024-04-25 11:37 ` [PATCH v4 05/11] mm: do not split a folio if it has minimum folio order requirement Pankaj Raghav (Samsung)
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 47+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-04-25 11:37 UTC (permalink / raw)
  To: willy, djwong, brauner, david, chandan.babu, akpm
  Cc: linux-fsdevel, hare, linux-kernel, linux-mm, linux-xfs, mcgrof,
	gost.dev, p.raghav

From: Pankaj Raghav <p.raghav@samsung.com>

page_cache_ra_unbounded() was allocating single pages (0 order folios)
if there was no folio found in an index. Allocate mapping_min_order folios
as we need to guarantee the minimum order if it is set.
When read_pages() is triggered and if a page is already present, check
for truncation and move the ractl->_index by mapping_min_nrpages if that
folio was truncated. This is done to ensure we keep the alignment
requirement while adding a folio to the page cache.

page_cache_ra_order() tries to allocate folio to the page cache with a
higher order if the index aligns with that order. Modify it so that the
order does not go below the mapping_min_order requirement of the page
cache. This function will do the right thing even if the new_order passed
is less than the mapping_min_order.
When adding new folios to the page cache we must also ensure the index
used is aligned to the mapping_min_order as the page cache requires the
index to be aligned to the order of the folio.

readahead_expand() is called from readahead aops to extend the range of
the readahead so this function can assume ractl->_index to be aligned with
min_order.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 mm/readahead.c | 85 +++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 71 insertions(+), 14 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 2361634a84fd..9c193185a836 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -206,9 +206,10 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 		unsigned long nr_to_read, unsigned long lookahead_size)
 {
 	struct address_space *mapping = ractl->mapping;
-	unsigned long index = readahead_index(ractl);
+	unsigned long index = readahead_index(ractl), ra_folio_index;
 	gfp_t gfp_mask = readahead_gfp_mask(mapping);
-	unsigned long i = 0;
+	unsigned long i = 0, mark;
+	unsigned int min_nrpages = mapping_min_folio_nrpages(mapping);
 
 	/*
 	 * Partway through the readahead operation, we will have added
@@ -223,6 +224,22 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 	unsigned int nofs = memalloc_nofs_save();
 
 	filemap_invalidate_lock_shared(mapping);
+	index = mapping_align_start_index(mapping, index);
+
+	/*
+	 * As iterator `i` is aligned to min_nrpages, round_up the
+	 * difference between nr_to_read and lookahead_size to mark the
+	 * index that only has lookahead or "async_region" to set the
+	 * readahead flag.
+	 */
+	ra_folio_index = round_up(readahead_index(ractl) + nr_to_read - lookahead_size,
+				  min_nrpages);
+	mark = ra_folio_index - index;
+	if (index != readahead_index(ractl)) {
+		nr_to_read += readahead_index(ractl) - index;
+		ractl->_index = index;
+	}
+
 	/*
 	 * Preallocate as many pages as we will need.
 	 */
@@ -230,6 +247,8 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 		struct folio *folio = xa_load(&mapping->i_pages, index + i);
 
 		if (folio && !xa_is_value(folio)) {
+			long nr_pages = folio_nr_pages(folio);
+
 			/*
 			 * Page already present?  Kick off the current batch
 			 * of contiguous pages before continuing with the
@@ -239,23 +258,35 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 			 * not worth getting one just for that.
 			 */
 			read_pages(ractl);
-			ractl->_index += folio_nr_pages(folio);
+
+			/*
+			 * Move the ractl->_index by at least min_pages
+			 * if the folio got truncated to respect the
+			 * alignment constraint in the page cache.
+			 *
+			 */
+			if (mapping != folio->mapping)
+				nr_pages = min_nrpages;
+
+			VM_BUG_ON_FOLIO(nr_pages < min_nrpages, folio);
+			ractl->_index += nr_pages;
 			i = ractl->_index + ractl->_nr_pages - index;
 			continue;
 		}
 
-		folio = filemap_alloc_folio(gfp_mask, 0);
+		folio = filemap_alloc_folio(gfp_mask,
+					    mapping_min_folio_order(mapping));
 		if (!folio)
 			break;
 		if (filemap_add_folio(mapping, folio, index + i,
 					gfp_mask) < 0) {
 			folio_put(folio);
 			read_pages(ractl);
-			ractl->_index++;
+			ractl->_index += min_nrpages;
 			i = ractl->_index + ractl->_nr_pages - index;
 			continue;
 		}
-		if (i == nr_to_read - lookahead_size)
+		if (i == mark)
 			folio_set_readahead(folio);
 		ractl->_workingset |= folio_test_workingset(folio);
 		ractl->_nr_pages += folio_nr_pages(folio);
@@ -489,12 +520,18 @@ void page_cache_ra_order(struct readahead_control *ractl,
 {
 	struct address_space *mapping = ractl->mapping;
 	pgoff_t index = readahead_index(ractl);
+	unsigned int min_order = mapping_min_folio_order(mapping);
 	pgoff_t limit = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT;
 	pgoff_t mark = index + ra->size - ra->async_size;
 	int err = 0;
 	gfp_t gfp = readahead_gfp_mask(mapping);
+	unsigned int min_ra_size = max(4, mapping_min_folio_nrpages(mapping));
 
-	if (!mapping_large_folio_support(mapping) || ra->size < 4)
+	/*
+	 * Fallback when size < min_nrpages as each folio should be
+	 * at least min_nrpages anyway.
+	 */
+	if (!mapping_large_folio_support(mapping) || ra->size < min_ra_size)
 		goto fallback;
 
 	limit = min(limit, index + ra->size - 1);
@@ -503,9 +540,18 @@ void page_cache_ra_order(struct readahead_control *ractl,
 		new_order += 2;
 		new_order = min_t(unsigned int, MAX_PAGECACHE_ORDER, new_order);
 		new_order = min_t(unsigned int, new_order, ilog2(ra->size));
+		new_order = max(new_order, min_order);
 	}
 
 	filemap_invalidate_lock_shared(mapping);
+	/*
+	 * If the new_order is greater than min_order and index is
+	 * already aligned to new_order, then this will be noop as index
+	 * aligned to new_order should also be aligned to min_order.
+	 */
+	ractl->_index = mapping_align_start_index(mapping, index);
+	index = readahead_index(ractl);
+
 	while (index <= limit) {
 		unsigned int order = new_order;
 
@@ -513,7 +559,7 @@ void page_cache_ra_order(struct readahead_control *ractl,
 		if (index & ((1UL << order) - 1))
 			order = __ffs(index);
 		/* Don't allocate pages past EOF */
-		while (index + (1UL << order) - 1 > limit)
+		while (order > min_order && index + (1UL << order) - 1 > limit)
 			order--;
 		err = ra_alloc_folio(ractl, index, mark, order, gfp);
 		if (err)
@@ -776,8 +822,15 @@ void readahead_expand(struct readahead_control *ractl,
 	struct file_ra_state *ra = ractl->ra;
 	pgoff_t new_index, new_nr_pages;
 	gfp_t gfp_mask = readahead_gfp_mask(mapping);
+	unsigned long min_nrpages = mapping_min_folio_nrpages(mapping);
+	unsigned int min_order = mapping_min_folio_order(mapping);
 
 	new_index = new_start / PAGE_SIZE;
+	/*
+	 * Readahead code should have aligned the ractl->_index to
+	 * min_nrpages before calling readahead aops.
+	 */
+	VM_BUG_ON(!IS_ALIGNED(ractl->_index, min_nrpages));
 
 	/* Expand the leading edge downwards */
 	while (ractl->_index > new_index) {
@@ -787,9 +840,11 @@ void readahead_expand(struct readahead_control *ractl,
 		if (folio && !xa_is_value(folio))
 			return; /* Folio apparently present */
 
-		folio = filemap_alloc_folio(gfp_mask, 0);
+		folio = filemap_alloc_folio(gfp_mask, min_order);
 		if (!folio)
 			return;
+
+		index = mapping_align_start_index(mapping, index);
 		if (filemap_add_folio(mapping, folio, index, gfp_mask) < 0) {
 			folio_put(folio);
 			return;
@@ -799,7 +854,7 @@ void readahead_expand(struct readahead_control *ractl,
 			ractl->_workingset = true;
 			psi_memstall_enter(&ractl->_pflags);
 		}
-		ractl->_nr_pages++;
+		ractl->_nr_pages += min_nrpages;
 		ractl->_index = folio->index;
 	}
 
@@ -814,9 +869,11 @@ void readahead_expand(struct readahead_control *ractl,
 		if (folio && !xa_is_value(folio))
 			return; /* Folio apparently present */
 
-		folio = filemap_alloc_folio(gfp_mask, 0);
+		folio = filemap_alloc_folio(gfp_mask, min_order);
 		if (!folio)
 			return;
+
+		index = mapping_align_start_index(mapping, index);
 		if (filemap_add_folio(mapping, folio, index, gfp_mask) < 0) {
 			folio_put(folio);
 			return;
@@ -826,10 +883,10 @@ void readahead_expand(struct readahead_control *ractl,
 			ractl->_workingset = true;
 			psi_memstall_enter(&ractl->_pflags);
 		}
-		ractl->_nr_pages++;
+		ractl->_nr_pages += min_nrpages;
 		if (ra) {
-			ra->size++;
-			ra->async_size++;
+			ra->size += min_nrpages;
+			ra->async_size += min_nrpages;
 		}
 	}
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v4 05/11] mm: do not split a folio if it has minimum folio order requirement
  2024-04-25 11:37 [PATCH v4 00/11] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (3 preceding siblings ...)
  2024-04-25 11:37 ` [PATCH v4 04/11] readahead: allocate folios with mapping_min_order in readahead Pankaj Raghav (Samsung)
@ 2024-04-25 11:37 ` Pankaj Raghav (Samsung)
  2024-04-25 20:10   ` Matthew Wilcox
  2024-04-25 11:37 ` [PATCH v4 06/11] filemap: cap PTE range to be created to i_size in folio_map_range() Pankaj Raghav (Samsung)
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 47+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-04-25 11:37 UTC (permalink / raw)
  To: willy, djwong, brauner, david, chandan.babu, akpm
  Cc: linux-fsdevel, hare, linux-kernel, linux-mm, linux-xfs, mcgrof,
	gost.dev, p.raghav

From: Pankaj Raghav <p.raghav@samsung.com>

Splitting a larger folio with a base order is supported using
split_huge_page_to_list_to_order() API. However, using that API for LBS
is resulting in an NULL ptr dereference error in the writeback path [1].

Refuse to split a folio if it has minimum folio order requirement until
we can start using split_huge_page_to_list_to_order() API. Splitting the
folio can be added as a later optimization.

[1] https://gist.github.com/mcgrof/d12f586ec6ebe32b2472b5d634c397df

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 mm/huge_memory.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9859aa4f7553..dadf1e68dbdc 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3117,6 +3117,15 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 			goto out;
 		}
 
+		/*
+		 * Do not split if mapping has minimum folio order
+		 * requirement.
+		 */
+		if (mapping_min_folio_order(mapping)) {
+			ret = -EINVAL;
+			goto out;
+		}
+
 		gfp = current_gfp_context(mapping_gfp_mask(mapping) &
 							GFP_RECLAIM_MASK);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v4 06/11] filemap: cap PTE range to be created to i_size in folio_map_range()
  2024-04-25 11:37 [PATCH v4 00/11] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (4 preceding siblings ...)
  2024-04-25 11:37 ` [PATCH v4 05/11] mm: do not split a folio if it has minimum folio order requirement Pankaj Raghav (Samsung)
@ 2024-04-25 11:37 ` Pankaj Raghav (Samsung)
  2024-04-25 20:24   ` Matthew Wilcox
  2024-04-25 11:37 ` [PATCH v4 07/11] iomap: fix iomap_dio_zero() for fs bs > system page size Pankaj Raghav (Samsung)
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 47+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-04-25 11:37 UTC (permalink / raw)
  To: willy, djwong, brauner, david, chandan.babu, akpm
  Cc: linux-fsdevel, hare, linux-kernel, linux-mm, linux-xfs, mcgrof,
	gost.dev, p.raghav

From: Pankaj Raghav <p.raghav@samsung.com>

Usually the page cache does not extend beyond the size of the inode,
therefore, no PTEs are created for folios that extend beyond the size.

But with LBS support, we might extend page cache beyond the size of the
inode as we need to guarantee folios of minimum order. Cap the PTE range
to be created for the page cache by i_size.

A fstest has been created to trigger this edge case[1].

[1]https://lore.kernel.org/fstests/20240415081054.1782715-1-mcgrof@kernel.org/

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Co-Developed-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 mm/filemap.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index f0c0cfbbd134..259531dd297b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3600,12 +3600,15 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
 	}
 	do {
 		unsigned long end;
+		unsigned long i_size;
 
 		addr += (xas.xa_index - last_pgoff) << PAGE_SHIFT;
 		vmf->pte += xas.xa_index - last_pgoff;
 		last_pgoff = xas.xa_index;
 		end = folio_next_index(folio) - 1;
-		nr_pages = min(end, end_pgoff) - xas.xa_index + 1;
+		i_size = DIV_ROUND_UP(i_size_read(mapping->host),
+				      PAGE_SIZE) - 1;
+		nr_pages = min3(end, end_pgoff, i_size) - xas.xa_index + 1;
 
 		if (!folio_test_large(folio))
 			ret |= filemap_map_order0_folio(vmf,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v4 07/11] iomap: fix iomap_dio_zero() for fs bs > system page size
  2024-04-25 11:37 [PATCH v4 00/11] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (5 preceding siblings ...)
  2024-04-25 11:37 ` [PATCH v4 06/11] filemap: cap PTE range to be created to i_size in folio_map_range() Pankaj Raghav (Samsung)
@ 2024-04-25 11:37 ` Pankaj Raghav (Samsung)
  2024-04-26  6:22   ` Christoph Hellwig
  2024-04-25 11:37 ` [PATCH v4 08/11] xfs: use kvmalloc for xattr buffers Pankaj Raghav (Samsung)
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 47+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-04-25 11:37 UTC (permalink / raw)
  To: willy, djwong, brauner, david, chandan.babu, akpm
  Cc: linux-fsdevel, hare, linux-kernel, linux-mm, linux-xfs, mcgrof,
	gost.dev, p.raghav

From: Pankaj Raghav <p.raghav@samsung.com>

iomap_dio_zero() will pad a fs block with zeroes if the direct IO size
< fs block size. iomap_dio_zero() has an implicit assumption that fs block
size < page_size. This is true for most filesystems at the moment.

If the block size > page size, this will send the contents of the page
next to zero page(as len > PAGE_SIZE) to the underlying block device,
causing FS corruption.

iomap is a generic infrastructure and it should not make any assumptions
about the fs block size and the page size of the system.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/iomap/direct-io.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index f3b43d223a46..5f481068de5b 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -239,14 +239,23 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
 	struct page *page = ZERO_PAGE(0);
 	struct bio *bio;
 
-	bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
+	WARN_ON_ONCE(len > (BIO_MAX_VECS * PAGE_SIZE));
+
+	bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS,
+				  REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
 	fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
 				  GFP_KERNEL);
+
 	bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos);
 	bio->bi_private = dio;
 	bio->bi_end_io = iomap_dio_bio_end_io;
 
-	__bio_add_page(bio, page, len, 0);
+	while (len) {
+		unsigned int io_len = min_t(unsigned int, len, PAGE_SIZE);
+
+		__bio_add_page(bio, page, io_len, 0);
+		len -= io_len;
+	}
 	iomap_dio_submit_bio(iter, dio, bio, pos);
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v4 08/11] xfs: use kvmalloc for xattr buffers
  2024-04-25 11:37 [PATCH v4 00/11] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (6 preceding siblings ...)
  2024-04-25 11:37 ` [PATCH v4 07/11] iomap: fix iomap_dio_zero() for fs bs > system page size Pankaj Raghav (Samsung)
@ 2024-04-25 11:37 ` Pankaj Raghav (Samsung)
  2024-04-26 15:18   ` Darrick J. Wong
  2024-04-25 11:37 ` [PATCH v4 09/11] xfs: expose block size in stat Pankaj Raghav (Samsung)
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 47+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-04-25 11:37 UTC (permalink / raw)
  To: willy, djwong, brauner, david, chandan.babu, akpm
  Cc: linux-fsdevel, hare, linux-kernel, linux-mm, linux-xfs, mcgrof,
	gost.dev, p.raghav, Dave Chinner, Pankaj Raghav

From: Dave Chinner <dchinner@redhat.com>

Pankaj Raghav reported that when filesystem block size is larger
than page size, the xattr code can use kmalloc() for high order
allocations. This triggers a useless warning in the allocator as it
is a __GFP_NOFAIL allocation here:

static inline
struct page *rmqueue(struct zone *preferred_zone,
                        struct zone *zone, unsigned int order,
                        gfp_t gfp_flags, unsigned int alloc_flags,
                        int migratetype)
{
        struct page *page;

        /*
         * We most definitely don't want callers attempting to
         * allocate greater than order-1 page units with __GFP_NOFAIL.
         */
>>>>    WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
...

Fix this by changing all these call sites to use kvmalloc(), which
will strip the NOFAIL from the kmalloc attempt and if that fails
will do a __GFP_NOFAIL vmalloc().

This is not an issue that productions systems will see as
filesystems with block size > page size cannot be mounted by the
kernel; Pankaj is developing this functionality right now.

Reported-by: Pankaj Raghav <kernel@pankajraghav.com>
Fixes: f078d4ea8276 ("xfs: convert kmem_alloc() to kmalloc()")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_attr_leaf.c | 15 ++++++---------
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_attr_leaf.c b/fs/xfs/libxfs/xfs_attr_leaf.c
index ac904cc1a97b..969abc6efd70 100644
--- a/fs/xfs/libxfs/xfs_attr_leaf.c
+++ b/fs/xfs/libxfs/xfs_attr_leaf.c
@@ -1059,10 +1059,7 @@ xfs_attr3_leaf_to_shortform(
 
 	trace_xfs_attr_leaf_to_sf(args);
 
-	tmpbuffer = kmalloc(args->geo->blksize, GFP_KERNEL | __GFP_NOFAIL);
-	if (!tmpbuffer)
-		return -ENOMEM;
-
+	tmpbuffer = kvmalloc(args->geo->blksize, GFP_KERNEL | __GFP_NOFAIL);
 	memcpy(tmpbuffer, bp->b_addr, args->geo->blksize);
 
 	leaf = (xfs_attr_leafblock_t *)tmpbuffer;
@@ -1125,7 +1122,7 @@ xfs_attr3_leaf_to_shortform(
 	error = 0;
 
 out:
-	kfree(tmpbuffer);
+	kvfree(tmpbuffer);
 	return error;
 }
 
@@ -1533,7 +1530,7 @@ xfs_attr3_leaf_compact(
 
 	trace_xfs_attr_leaf_compact(args);
 
-	tmpbuffer = kmalloc(args->geo->blksize, GFP_KERNEL | __GFP_NOFAIL);
+	tmpbuffer = kvmalloc(args->geo->blksize, GFP_KERNEL | __GFP_NOFAIL);
 	memcpy(tmpbuffer, bp->b_addr, args->geo->blksize);
 	memset(bp->b_addr, 0, args->geo->blksize);
 	leaf_src = (xfs_attr_leafblock_t *)tmpbuffer;
@@ -1571,7 +1568,7 @@ xfs_attr3_leaf_compact(
 	 */
 	xfs_trans_log_buf(trans, bp, 0, args->geo->blksize - 1);
 
-	kfree(tmpbuffer);
+	kvfree(tmpbuffer);
 }
 
 /*
@@ -2250,7 +2247,7 @@ xfs_attr3_leaf_unbalance(
 		struct xfs_attr_leafblock *tmp_leaf;
 		struct xfs_attr3_icleaf_hdr tmphdr;
 
-		tmp_leaf = kzalloc(state->args->geo->blksize,
+		tmp_leaf = kvzalloc(state->args->geo->blksize,
 				GFP_KERNEL | __GFP_NOFAIL);
 
 		/*
@@ -2291,7 +2288,7 @@ xfs_attr3_leaf_unbalance(
 		}
 		memcpy(save_leaf, tmp_leaf, state->args->geo->blksize);
 		savehdr = tmphdr; /* struct copy */
-		kfree(tmp_leaf);
+		kvfree(tmp_leaf);
 	}
 
 	xfs_attr3_leaf_hdr_to_disk(state->args->geo, save_leaf, &savehdr);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v4 09/11] xfs: expose block size in stat
  2024-04-25 11:37 [PATCH v4 00/11] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (7 preceding siblings ...)
  2024-04-25 11:37 ` [PATCH v4 08/11] xfs: use kvmalloc for xattr buffers Pankaj Raghav (Samsung)
@ 2024-04-25 11:37 ` Pankaj Raghav (Samsung)
  2024-04-26 15:15   ` Darrick J. Wong
  2024-04-25 11:37 ` [PATCH v4 10/11] xfs: make the calculation generic in xfs_sb_validate_fsb_count() Pankaj Raghav (Samsung)
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 47+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-04-25 11:37 UTC (permalink / raw)
  To: willy, djwong, brauner, david, chandan.babu, akpm
  Cc: linux-fsdevel, hare, linux-kernel, linux-mm, linux-xfs, mcgrof,
	gost.dev, p.raghav

From: Pankaj Raghav <p.raghav@samsung.com>

For block size larger than page size, the unit of efficient IO is
the block size, not the page size. Leaving stat() to report
PAGE_SIZE as the block size causes test programs like fsx to issue
illegal ranges for operations that require block size alignment
(e.g. fallocate() insert range). Hence update the preferred IO size
to reflect the block size in this case.

This change is based on a patch originally from Dave Chinner.[1]

[1] https://lwn.net/ml/linux-fsdevel/20181107063127.3902-16-david@fromorbit.com/

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 fs/xfs/xfs_iops.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 66f8c47642e8..77b198a33aa1 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -543,7 +543,7 @@ xfs_stat_blksize(
 			return 1U << mp->m_allocsize_log;
 	}
 
-	return PAGE_SIZE;
+	return max_t(uint32_t, PAGE_SIZE, mp->m_sb.sb_blocksize);
 }
 
 STATIC int
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v4 10/11] xfs: make the calculation generic in xfs_sb_validate_fsb_count()
  2024-04-25 11:37 [PATCH v4 00/11] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (8 preceding siblings ...)
  2024-04-25 11:37 ` [PATCH v4 09/11] xfs: expose block size in stat Pankaj Raghav (Samsung)
@ 2024-04-25 11:37 ` Pankaj Raghav (Samsung)
  2024-04-26 15:16   ` Darrick J. Wong
  2024-04-25 11:37 ` [PATCH v4 11/11] xfs: enable block size larger than page size support Pankaj Raghav (Samsung)
  2024-04-27  4:42 ` [PATCH v4 00/11] enable bs > ps in XFS Ritesh Harjani
  11 siblings, 1 reply; 47+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-04-25 11:37 UTC (permalink / raw)
  To: willy, djwong, brauner, david, chandan.babu, akpm
  Cc: linux-fsdevel, hare, linux-kernel, linux-mm, linux-xfs, mcgrof,
	gost.dev, p.raghav

From: Pankaj Raghav <p.raghav@samsung.com>

Instead of assuming that PAGE_SHIFT is always higher than the blocklog,
make the calculation generic so that page cache count can be calculated
correctly for LBS.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 fs/xfs/xfs_mount.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index df370eb5dc15..56d71282972a 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -133,9 +133,16 @@ xfs_sb_validate_fsb_count(
 {
 	ASSERT(PAGE_SHIFT >= sbp->sb_blocklog);
 	ASSERT(sbp->sb_blocklog >= BBSHIFT);
+	uint64_t max_index;
+	uint64_t max_bytes;
+
+	if (check_shl_overflow(nblocks, sbp->sb_blocklog, &max_bytes))
+		return -EFBIG;
 
 	/* Limited by ULONG_MAX of page cache index */
-	if (nblocks >> (PAGE_SHIFT - sbp->sb_blocklog) > ULONG_MAX)
+	max_index = max_bytes >> PAGE_SHIFT;
+
+	if (max_index > ULONG_MAX)
 		return -EFBIG;
 	return 0;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [PATCH v4 11/11] xfs: enable block size larger than page size support
  2024-04-25 11:37 [PATCH v4 00/11] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (9 preceding siblings ...)
  2024-04-25 11:37 ` [PATCH v4 10/11] xfs: make the calculation generic in xfs_sb_validate_fsb_count() Pankaj Raghav (Samsung)
@ 2024-04-25 11:37 ` Pankaj Raghav (Samsung)
  2024-04-26 15:18   ` Darrick J. Wong
  2024-04-27  4:42 ` [PATCH v4 00/11] enable bs > ps in XFS Ritesh Harjani
  11 siblings, 1 reply; 47+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-04-25 11:37 UTC (permalink / raw)
  To: willy, djwong, brauner, david, chandan.babu, akpm
  Cc: linux-fsdevel, hare, linux-kernel, linux-mm, linux-xfs, mcgrof,
	gost.dev, p.raghav

From: Pankaj Raghav <p.raghav@samsung.com>

Page cache now has the ability to have a minimum order when allocating
a folio which is a prerequisite to add support for block size > page
size.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 fs/xfs/libxfs/xfs_ialloc.c |  5 +++++
 fs/xfs/libxfs/xfs_shared.h |  3 +++
 fs/xfs/xfs_icache.c        |  6 ++++--
 fs/xfs/xfs_mount.c         |  1 -
 fs/xfs/xfs_super.c         | 10 ++--------
 5 files changed, 14 insertions(+), 11 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index e5ac3e5430c4..60005feb0015 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -2975,6 +2975,11 @@ xfs_ialloc_setup_geometry(
 		igeo->ialloc_align = mp->m_dalign;
 	else
 		igeo->ialloc_align = 0;
+
+	if (mp->m_sb.sb_blocksize > PAGE_SIZE)
+		igeo->min_folio_order = mp->m_sb.sb_blocklog - PAGE_SHIFT;
+	else
+		igeo->min_folio_order = 0;
 }
 
 /* Compute the location of the root directory inode that is laid out by mkfs. */
diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
index dfd61fa8332e..7d3abd182322 100644
--- a/fs/xfs/libxfs/xfs_shared.h
+++ b/fs/xfs/libxfs/xfs_shared.h
@@ -229,6 +229,9 @@ struct xfs_ino_geometry {
 	/* precomputed value for di_flags2 */
 	uint64_t	new_diflags2;
 
+	/* minimum folio order of a page cache allocation */
+	unsigned int	min_folio_order;
+
 };
 
 #endif /* __XFS_SHARED_H__ */
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index 74f1812b03cb..a2629e00de41 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -89,7 +89,8 @@ xfs_inode_alloc(
 	/* VFS doesn't initialise i_mode or i_state! */
 	VFS_I(ip)->i_mode = 0;
 	VFS_I(ip)->i_state = 0;
-	mapping_set_large_folios(VFS_I(ip)->i_mapping);
+	mapping_set_folio_min_order(VFS_I(ip)->i_mapping,
+				    M_IGEO(mp)->min_folio_order);
 
 	XFS_STATS_INC(mp, vn_active);
 	ASSERT(atomic_read(&ip->i_pincount) == 0);
@@ -324,7 +325,8 @@ xfs_reinit_inode(
 	inode->i_rdev = dev;
 	inode->i_uid = uid;
 	inode->i_gid = gid;
-	mapping_set_large_folios(inode->i_mapping);
+	mapping_set_folio_min_order(inode->i_mapping,
+				    M_IGEO(mp)->min_folio_order);
 	return error;
 }
 
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 56d71282972a..a451302aa258 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -131,7 +131,6 @@ xfs_sb_validate_fsb_count(
 	xfs_sb_t	*sbp,
 	uint64_t	nblocks)
 {
-	ASSERT(PAGE_SHIFT >= sbp->sb_blocklog);
 	ASSERT(sbp->sb_blocklog >= BBSHIFT);
 	uint64_t max_index;
 	uint64_t max_bytes;
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index bce020374c5e..db3b82c2c381 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1623,16 +1623,10 @@ xfs_fs_fill_super(
 		goto out_free_sb;
 	}
 
-	/*
-	 * Until this is fixed only page-sized or smaller data blocks work.
-	 */
 	if (mp->m_sb.sb_blocksize > PAGE_SIZE) {
 		xfs_warn(mp,
-		"File system with blocksize %d bytes. "
-		"Only pagesize (%ld) or less will currently work.",
-				mp->m_sb.sb_blocksize, PAGE_SIZE);
-		error = -ENOSYS;
-		goto out_free_sb;
+"EXPERIMENTAL: Filesystem with Large Block Size (%d bytes) enabled.",
+			mp->m_sb.sb_blocksize);
 	}
 
 	/* Ensure this filesystem fits in the page cache limits */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 02/11] fs: Allow fine-grained control of folio sizes
  2024-04-25 11:37 ` [PATCH v4 02/11] fs: Allow fine-grained control of folio sizes Pankaj Raghav (Samsung)
@ 2024-04-25 18:07   ` Hannes Reinecke
  2024-04-26 15:09   ` Darrick J. Wong
  1 sibling, 0 replies; 47+ messages in thread
From: Hannes Reinecke @ 2024-04-25 18:07 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung),
	willy, djwong, brauner, david, chandan.babu, akpm
  Cc: linux-fsdevel, linux-kernel, linux-mm, linux-xfs, mcgrof,
	gost.dev, p.raghav

On 4/25/24 13:37, Pankaj Raghav (Samsung) wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> Some filesystems want to be able to ensure that folios that are added to
> the page cache are at least a certain size.
> Add mapping_set_folio_min_order() to allow this level of control.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Co-developed-by: Pankaj Raghav <p.raghav@samsung.com>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>   include/linux/pagemap.h | 116 +++++++++++++++++++++++++++++++++-------
>   1 file changed, 96 insertions(+), 20 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 04/11] readahead: allocate folios with mapping_min_order in readahead
  2024-04-25 11:37 ` [PATCH v4 04/11] readahead: allocate folios with mapping_min_order in readahead Pankaj Raghav (Samsung)
@ 2024-04-25 18:53   ` Matthew Wilcox
  0 siblings, 0 replies; 47+ messages in thread
From: Matthew Wilcox @ 2024-04-25 18:53 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: djwong, brauner, david, chandan.babu, akpm, linux-fsdevel, hare,
	linux-kernel, linux-mm, linux-xfs, mcgrof, gost.dev, p.raghav

On Thu, Apr 25, 2024 at 01:37:39PM +0200, Pankaj Raghav (Samsung) wrote:
> +	unsigned long index = readahead_index(ractl), ra_folio_index;

This is confusing.  Uninitialised variables should go before initialised
ones.  So either:

	unsigned long ra_folio_index, index = readahead_index(ractl);
or
	unsigned long index = readahead_index(ractl);
	unsigned long ra_folio_index;

> +	unsigned long i = 0, mark;

ditto


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 03/11] filemap: allocate mapping_min_order folios in the page cache
  2024-04-25 11:37 ` [PATCH v4 03/11] filemap: allocate mapping_min_order folios in the page cache Pankaj Raghav (Samsung)
@ 2024-04-25 19:04   ` Hannes Reinecke
  2024-04-26 15:12   ` Darrick J. Wong
  1 sibling, 0 replies; 47+ messages in thread
From: Hannes Reinecke @ 2024-04-25 19:04 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung),
	willy, djwong, brauner, david, chandan.babu, akpm
  Cc: linux-fsdevel, linux-kernel, linux-mm, linux-xfs, mcgrof,
	gost.dev, p.raghav

On 4/25/24 13:37, Pankaj Raghav (Samsung) wrote:
> From: Luis Chamberlain <mcgrof@kernel.org>
> 
> filemap_create_folio() and do_read_cache_folio() were always allocating
> folio of order 0. __filemap_get_folio was trying to allocate higher
> order folios when fgp_flags had higher order hint set but it will default
> to order 0 folio if higher order memory allocation fails.
> 
> Supporting mapping_min_order implies that we guarantee each folio in the
> page cache has at least an order of mapping_min_order. When adding new
> folios to the page cache we must also ensure the index used is aligned to
> the mapping_min_order as the page cache requires the index to be aligned
> to the order of the folio.
> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> Co-developed-by: Pankaj Raghav <p.raghav@samsung.com>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>   mm/filemap.c | 24 +++++++++++++++++-------
>   1 file changed, 17 insertions(+), 7 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 05/11] mm: do not split a folio if it has minimum folio order requirement
  2024-04-25 11:37 ` [PATCH v4 05/11] mm: do not split a folio if it has minimum folio order requirement Pankaj Raghav (Samsung)
@ 2024-04-25 20:10   ` Matthew Wilcox
  2024-04-26  0:47     ` Luis Chamberlain
  2024-04-26 15:49     ` Pankaj Raghav (Samsung)
  0 siblings, 2 replies; 47+ messages in thread
From: Matthew Wilcox @ 2024-04-25 20:10 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: djwong, brauner, david, chandan.babu, akpm, linux-fsdevel, hare,
	linux-kernel, linux-mm, linux-xfs, mcgrof, gost.dev, p.raghav

On Thu, Apr 25, 2024 at 01:37:40PM +0200, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
> 
> Splitting a larger folio with a base order is supported using
> split_huge_page_to_list_to_order() API. However, using that API for LBS
> is resulting in an NULL ptr dereference error in the writeback path [1].
> 
> Refuse to split a folio if it has minimum folio order requirement until
> we can start using split_huge_page_to_list_to_order() API. Splitting the
> folio can be added as a later optimization.
> 
> [1] https://gist.github.com/mcgrof/d12f586ec6ebe32b2472b5d634c397df

Obviously this has to be tracked down and fixed before this patchset can
be merged ... I think I have some ideas.  Let me look a bit.  How
would I go about reproducing this?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 06/11] filemap: cap PTE range to be created to i_size in folio_map_range()
  2024-04-25 11:37 ` [PATCH v4 06/11] filemap: cap PTE range to be created to i_size in folio_map_range() Pankaj Raghav (Samsung)
@ 2024-04-25 20:24   ` Matthew Wilcox
  2024-04-26 12:54     ` Pankaj Raghav (Samsung)
  0 siblings, 1 reply; 47+ messages in thread
From: Matthew Wilcox @ 2024-04-25 20:24 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: djwong, brauner, david, chandan.babu, akpm, linux-fsdevel, hare,
	linux-kernel, linux-mm, linux-xfs, mcgrof, gost.dev, p.raghav

On Thu, Apr 25, 2024 at 01:37:41PM +0200, Pankaj Raghav (Samsung) wrote:
>  	do {
>  		unsigned long end;
> +		unsigned long i_size;

Usually i_size is the name of a variable that contains an loff_t, not a
page count.  Not sure what to call this though.  Also, can't we move
this outside the loop?

	pgoff_t file_end = DIV_ROUND_UP(i_size_read(mapping->host),
					PAGE_SIZE) - 1;

	if (end_pgoff > file_end)
		end_pgoff = file_end;


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 05/11] mm: do not split a folio if it has minimum folio order requirement
  2024-04-25 20:10   ` Matthew Wilcox
@ 2024-04-26  0:47     ` Luis Chamberlain
  2024-04-26 23:46       ` Luis Chamberlain
  2024-04-26 15:49     ` Pankaj Raghav (Samsung)
  1 sibling, 1 reply; 47+ messages in thread
From: Luis Chamberlain @ 2024-04-26  0:47 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Pankaj Raghav (Samsung),
	djwong, brauner, david, chandan.babu, akpm, linux-fsdevel, hare,
	linux-kernel, linux-mm, linux-xfs, gost.dev, p.raghav

On Thu, Apr 25, 2024 at 09:10:16PM +0100, Matthew Wilcox wrote:
> On Thu, Apr 25, 2024 at 01:37:40PM +0200, Pankaj Raghav (Samsung) wrote:
> > From: Pankaj Raghav <p.raghav@samsung.com>
> > 
> > Splitting a larger folio with a base order is supported using
> > split_huge_page_to_list_to_order() API. However, using that API for LBS
> > is resulting in an NULL ptr dereference error in the writeback path [1].
> > 
> > Refuse to split a folio if it has minimum folio order requirement until
> > we can start using split_huge_page_to_list_to_order() API. Splitting the
> > folio can be added as a later optimization.
> > 
> > [1] https://gist.github.com/mcgrof/d12f586ec6ebe32b2472b5d634c397df
> 
> Obviously this has to be tracked down and fixed before this patchset can
> be merged ... I think I have some ideas.  Let me look a bit.  How
> would I go about reproducing this?

Using kdevops this is easy:

make defconfig-lbs-xfs-small -j $(nproc)
make -j $(nproc)
make fstests
make linux
make fstests-baseline TESTS=generic/447 COUNT=10
tail -f

guestfs/*-xfs-reflink-16k-4ks/console.log
or
sudo virsh list
sudo virsh console ${foo}-xfs-reflink-16k-4ks

Where $foo is the value of CONFIG_KDEVOPS_HOSTS_PREFIX in .config for
your kdevops run.

Otherwise if you wanna run things manually the above uses an lbs branch
called large-block-minorder on kdevops [0] based on v6.9-rc5 with:

a) Fixes we know we need
b) this patch series minus this patch
c) A truncation enablement patch

Note that the above also uses an fstests git tree with the fstests
changes we also have posted as fixes and some new tests which have been
posted [1]. You will then want to run:

./check -s xfs_reflink_16k_4ks -I 10 generic/447

The configuration for xfs_reflink_16k_4ks follows:

cat /var/lib/xfstests/configs/min-xfs-reflink-16k-4ks.config

[default]
FSTYP=xfs
TEST_DIR=/media/test
SCRATCH_MNT=/media/scratch
RESULT_BASE=$PWD/results/$HOST/$(uname -r)
DUMP_CORRUPT_FS=1
CANON_DEVS=yes
RECREATE_TEST_DEV=true
SOAK_DURATION=9900

[xfs_reflink_16k_4ks]
TEST_DEV=/dev/loop16
SCRATCH_DEV_POOL="/dev/loop5 /dev/loop6 /dev/loop7 /dev/loop8 /dev/loop9 /dev/loop10 /dev/loop11 /dev/loop12"
MKFS_OPTIONS='-f -m reflink=1,rmapbt=1, -i sparse=1, -b size=16384, -s size=4k'
USE_EXTERNAL=no
LOGWRITES_DEV=/dev/loop15

I didn't have time to verify if the above commands for kdevops worked but... in
theory its possible it may, because you know, May is right around the
corner, and May... the force be with us.

[0] https://github.com/linux-kdevops/linux/tree/large-block-minorder
[1] https://github.com/linux-kdevops/fstests

  Luis

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 07/11] iomap: fix iomap_dio_zero() for fs bs > system page size
  2024-04-25 11:37 ` [PATCH v4 07/11] iomap: fix iomap_dio_zero() for fs bs > system page size Pankaj Raghav (Samsung)
@ 2024-04-26  6:22   ` Christoph Hellwig
  2024-04-26 11:43     ` Pankaj Raghav (Samsung)
  2024-04-27  3:26     ` Matthew Wilcox
  0 siblings, 2 replies; 47+ messages in thread
From: Christoph Hellwig @ 2024-04-26  6:22 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: willy, djwong, brauner, david, chandan.babu, akpm, linux-fsdevel,
	hare, linux-kernel, linux-mm, linux-xfs, mcgrof, gost.dev,
	p.raghav

On Thu, Apr 25, 2024 at 01:37:42PM +0200, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
> 
> iomap_dio_zero() will pad a fs block with zeroes if the direct IO size
> < fs block size. iomap_dio_zero() has an implicit assumption that fs block
> size < page_size. This is true for most filesystems at the moment.
> 
> If the block size > page size, this will send the contents of the page
> next to zero page(as len > PAGE_SIZE) to the underlying block device,
> causing FS corruption.
> 
> iomap is a generic infrastructure and it should not make any assumptions
> about the fs block size and the page size of the system.

So what happened to the plan to making huge_zero_page a folio and have
it available for non-hugetlb setups?  Not only would this be cleaner
and more efficient, but it would actually work for the case where you'd
have to zero more than 1MB on a 4k PAGE_SIZE system, which doesn't
seem impossible with 2MB folios.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 07/11] iomap: fix iomap_dio_zero() for fs bs > system page size
  2024-04-26  6:22   ` Christoph Hellwig
@ 2024-04-26 11:43     ` Pankaj Raghav (Samsung)
  2024-04-27  5:12       ` Christoph Hellwig
  2024-04-27  3:26     ` Matthew Wilcox
  1 sibling, 1 reply; 47+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-04-26 11:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: willy, djwong, brauner, david, chandan.babu, akpm, linux-fsdevel,
	hare, linux-kernel, linux-mm, linux-xfs, mcgrof, gost.dev,
	p.raghav

On Thu, Apr 25, 2024 at 11:22:35PM -0700, Christoph Hellwig wrote:
> On Thu, Apr 25, 2024 at 01:37:42PM +0200, Pankaj Raghav (Samsung) wrote:
> > From: Pankaj Raghav <p.raghav@samsung.com>
> > 
> > iomap_dio_zero() will pad a fs block with zeroes if the direct IO size
> > < fs block size. iomap_dio_zero() has an implicit assumption that fs block
> > size < page_size. This is true for most filesystems at the moment.
> > 
> > If the block size > page size, this will send the contents of the page
> > next to zero page(as len > PAGE_SIZE) to the underlying block device,
> > causing FS corruption.
> > 
> > iomap is a generic infrastructure and it should not make any assumptions
> > about the fs block size and the page size of the system.
> 
> So what happened to the plan to making huge_zero_page a folio and have
> it available for non-hugetlb setups?  Not only would this be cleaner
> and more efficient, but it would actually work for the case where you'd
> have to zero more than 1MB on a 4k PAGE_SIZE system, which doesn't
> seem impossible with 2MB folios.

I mentioned this Darrick in one of the older series[1] that it was
proving to be a bit complicated (at least for me) to add that support.

Currently, we reserve the ZERO_PAGE during kernel startup (arch/x86/kernel/head_64.S).

Do we go about doing the same by reserving 1 PMD (512 PTEs with base page size)
at kernel startup if we want to have zeroed 2MB (for x86) always at
our disposal to use for zeroing out?
Because allocating it during runtime will defeat the purpose.

Let me know what you think.

In anycase, I would like to pursue huge_zero_page folio separately
from this series. Also iomap_dio_zero() only pads a fs block with
zeroes, which should never be > 64k for XFS.

[1] https://lore.kernel.org/linux-fsdevel/5kodxnrvjq5dsjgjfeps6wte774c2sl75bn3fg3hh46q3wkwk5@2tru4htvqmrq/

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 06/11] filemap: cap PTE range to be created to i_size in folio_map_range()
  2024-04-25 20:24   ` Matthew Wilcox
@ 2024-04-26 12:54     ` Pankaj Raghav (Samsung)
  0 siblings, 0 replies; 47+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-04-26 12:54 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: djwong, brauner, david, chandan.babu, akpm, linux-fsdevel, hare,
	linux-kernel, linux-mm, linux-xfs, mcgrof, gost.dev, p.raghav

On Thu, Apr 25, 2024 at 09:24:33PM +0100, Matthew Wilcox wrote:
> On Thu, Apr 25, 2024 at 01:37:41PM +0200, Pankaj Raghav (Samsung) wrote:
> >  	do {
> >  		unsigned long end;
> > +		unsigned long i_size;
> 
> Usually i_size is the name of a variable that contains an loff_t, not a
> page count.  Not sure what to call this though.  Also, can't we move
> this outside the loop?
You are right, this can move out as i_size is not going to change. I
will make this change. Thanks!
> 
> 	pgoff_t file_end = DIV_ROUND_UP(i_size_read(mapping->host),
> 					PAGE_SIZE) - 1;
> 
> 	if (end_pgoff > file_end)
> 		end_pgoff = file_end;

--
Pankaj

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 02/11] fs: Allow fine-grained control of folio sizes
  2024-04-25 11:37 ` [PATCH v4 02/11] fs: Allow fine-grained control of folio sizes Pankaj Raghav (Samsung)
  2024-04-25 18:07   ` Hannes Reinecke
@ 2024-04-26 15:09   ` Darrick J. Wong
  1 sibling, 0 replies; 47+ messages in thread
From: Darrick J. Wong @ 2024-04-26 15:09 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: willy, brauner, david, chandan.babu, akpm, linux-fsdevel, hare,
	linux-kernel, linux-mm, linux-xfs, mcgrof, gost.dev, p.raghav

On Thu, Apr 25, 2024 at 01:37:37PM +0200, Pankaj Raghav (Samsung) wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> Some filesystems want to be able to ensure that folios that are added to
> the page cache are at least a certain size.
> Add mapping_set_folio_min_order() to allow this level of control.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Co-developed-by: Pankaj Raghav <p.raghav@samsung.com>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>

Looks good to me,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  include/linux/pagemap.h | 116 +++++++++++++++++++++++++++++++++-------
>  1 file changed, 96 insertions(+), 20 deletions(-)
> 
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 2df35e65557d..2e5612de1749 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -202,13 +202,18 @@ enum mapping_flags {
>  	AS_EXITING	= 4, 	/* final truncate in progress */
>  	/* writeback related tags are not used */
>  	AS_NO_WRITEBACK_TAGS = 5,
> -	AS_LARGE_FOLIO_SUPPORT = 6,
> -	AS_RELEASE_ALWAYS,	/* Call ->release_folio(), even if no private data */
> -	AS_STABLE_WRITES,	/* must wait for writeback before modifying
> +	AS_RELEASE_ALWAYS = 6,	/* Call ->release_folio(), even if no private data */
> +	AS_STABLE_WRITES = 7,	/* must wait for writeback before modifying
>  				   folio contents */
> -	AS_UNMOVABLE,		/* The mapping cannot be moved, ever */
> +	AS_FOLIO_ORDER_MIN = 8,
> +	AS_FOLIO_ORDER_MAX = 13, /* Bit 8-17 are used for FOLIO_ORDER */
> +	AS_UNMOVABLE = 18,		/* The mapping cannot be moved, ever */
>  };
>  
> +#define AS_FOLIO_ORDER_MIN_MASK 0x00001f00
> +#define AS_FOLIO_ORDER_MAX_MASK 0x0003e000
> +#define AS_FOLIO_ORDER_MASK (AS_FOLIO_ORDER_MIN_MASK | AS_FOLIO_ORDER_MAX_MASK)
> +
>  /**
>   * mapping_set_error - record a writeback error in the address_space
>   * @mapping: the mapping in which an error should be set
> @@ -344,9 +349,63 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
>  	m->gfp_mask = mask;
>  }
>  
> +/*
> + * There are some parts of the kernel which assume that PMD entries
> + * are exactly HPAGE_PMD_ORDER.  Those should be fixed, but until then,
> + * limit the maximum allocation order to PMD size.  I'm not aware of any
> + * assumptions about maximum order if THP are disabled, but 8 seems like
> + * a good order (that's 1MB if you're using 4kB pages)
> + */
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +#define MAX_PAGECACHE_ORDER	HPAGE_PMD_ORDER
> +#else
> +#define MAX_PAGECACHE_ORDER	8
> +#endif
> +
> +/*
> + * mapping_set_folio_order_range() - Set the folio order range
> + * @mapping: The address_space.
> + * @min: Minimum folio order (between 0-MAX_PAGECACHE_ORDER inclusive).
> + * @max: Maximum folio order (between @min-MAX_PAGECACHE_ORDER inclusive).
> + *
> + * The filesystem should call this function in its inode constructor to
> + * indicate which base size (min) and maximum size (max) of folio the VFS
> + * can use to cache the contents of the file.  This should only be used
> + * if the filesystem needs special handling of folio sizes (ie there is
> + * something the core cannot know).
> + * Do not tune it based on, eg, i_size.
> + *
> + * Context: This should not be called while the inode is active as it
> + * is non-atomic.
> + */
> +static inline void mapping_set_folio_order_range(struct address_space *mapping,
> +						 unsigned int min_order,
> +						 unsigned int max_order)
> +{
> +	if (min_order > MAX_PAGECACHE_ORDER)
> +		min_order = MAX_PAGECACHE_ORDER;
> +
> +	if (max_order > MAX_PAGECACHE_ORDER)
> +		max_order = MAX_PAGECACHE_ORDER;
> +
> +	max_order = max(max_order, min_order);
> +	/*
> +	 * TODO: max_order is not yet supported in filemap.
> +	 */
> +	mapping->flags = (mapping->flags & ~AS_FOLIO_ORDER_MASK) |
> +			 (min_order << AS_FOLIO_ORDER_MIN) |
> +			 (max_order << AS_FOLIO_ORDER_MAX);
> +}
> +
> +static inline void mapping_set_folio_min_order(struct address_space *mapping,
> +					       unsigned int min)
> +{
> +	mapping_set_folio_order_range(mapping, min, MAX_PAGECACHE_ORDER);
> +}
> +
>  /**
>   * mapping_set_large_folios() - Indicate the file supports large folios.
> - * @mapping: The file.
> + * @mapping: The address_space.
>   *
>   * The filesystem should call this function in its inode constructor to
>   * indicate that the VFS can use large folios to cache the contents of
> @@ -357,7 +416,37 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
>   */
>  static inline void mapping_set_large_folios(struct address_space *mapping)
>  {
> -	__set_bit(AS_LARGE_FOLIO_SUPPORT, &mapping->flags);
> +	mapping_set_folio_order_range(mapping, 0, MAX_PAGECACHE_ORDER);
> +}
> +
> +static inline unsigned int mapping_max_folio_order(struct address_space *mapping)
> +{
> +	return (mapping->flags & AS_FOLIO_ORDER_MAX_MASK) >> AS_FOLIO_ORDER_MAX;
> +}
> +
> +static inline unsigned int mapping_min_folio_order(struct address_space *mapping)
> +{
> +	return (mapping->flags & AS_FOLIO_ORDER_MIN_MASK) >> AS_FOLIO_ORDER_MIN;
> +}
> +
> +static inline unsigned long mapping_min_folio_nrpages(struct address_space *mapping)
> +{
> +	return 1UL << mapping_min_folio_order(mapping);
> +}
> +
> +/**
> + * mapping_align_start_index() - Align starting index based on the min
> + * folio order of the page cache.
> + * @mapping: The address_space.
> + *
> + * Ensure the index used is aligned to the minimum folio order when adding
> + * new folios to the page cache by rounding down to the nearest minimum
> + * folio number of pages.
> + */
> +static inline pgoff_t mapping_align_start_index(struct address_space *mapping,
> +						pgoff_t index)
> +{
> +	return round_down(index, mapping_min_folio_nrpages(mapping));
>  }
>  
>  /*
> @@ -367,7 +456,7 @@ static inline void mapping_set_large_folios(struct address_space *mapping)
>  static inline bool mapping_large_folio_support(struct address_space *mapping)
>  {
>  	return IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
> -		test_bit(AS_LARGE_FOLIO_SUPPORT, &mapping->flags);
> +	       (mapping_max_folio_order(mapping) > 0);
>  }
>  
>  static inline int filemap_nr_thps(struct address_space *mapping)
> @@ -528,19 +617,6 @@ static inline void *detach_page_private(struct page *page)
>  	return folio_detach_private(page_folio(page));
>  }
>  
> -/*
> - * There are some parts of the kernel which assume that PMD entries
> - * are exactly HPAGE_PMD_ORDER.  Those should be fixed, but until then,
> - * limit the maximum allocation order to PMD size.  I'm not aware of any
> - * assumptions about maximum order if THP are disabled, but 8 seems like
> - * a good order (that's 1MB if you're using 4kB pages)
> - */
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -#define MAX_PAGECACHE_ORDER	HPAGE_PMD_ORDER
> -#else
> -#define MAX_PAGECACHE_ORDER	8
> -#endif
> -
>  #ifdef CONFIG_NUMA
>  struct folio *filemap_alloc_folio(gfp_t gfp, unsigned int order);
>  #else
> -- 
> 2.34.1
> 
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 03/11] filemap: allocate mapping_min_order folios in the page cache
  2024-04-25 11:37 ` [PATCH v4 03/11] filemap: allocate mapping_min_order folios in the page cache Pankaj Raghav (Samsung)
  2024-04-25 19:04   ` Hannes Reinecke
@ 2024-04-26 15:12   ` Darrick J. Wong
  2024-04-28 20:59     ` Pankaj Raghav (Samsung)
  1 sibling, 1 reply; 47+ messages in thread
From: Darrick J. Wong @ 2024-04-26 15:12 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: willy, brauner, david, chandan.babu, akpm, linux-fsdevel, hare,
	linux-kernel, linux-mm, linux-xfs, mcgrof, gost.dev, p.raghav

On Thu, Apr 25, 2024 at 01:37:38PM +0200, Pankaj Raghav (Samsung) wrote:
> From: Luis Chamberlain <mcgrof@kernel.org>
> 
> filemap_create_folio() and do_read_cache_folio() were always allocating
> folio of order 0. __filemap_get_folio was trying to allocate higher
> order folios when fgp_flags had higher order hint set but it will default
> to order 0 folio if higher order memory allocation fails.
> 
> Supporting mapping_min_order implies that we guarantee each folio in the
> page cache has at least an order of mapping_min_order. When adding new
> folios to the page cache we must also ensure the index used is aligned to
> the mapping_min_order as the page cache requires the index to be aligned
> to the order of the folio.

If we cannot find a folio of at least min_order size, what error is sent
back?

If the answer is "the same error that you get if we cannot allocate a
base page today (aka ENOMEM)", then I think I understand this enough to
say

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> Co-developed-by: Pankaj Raghav <p.raghav@samsung.com>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  mm/filemap.c | 24 +++++++++++++++++-------
>  1 file changed, 17 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 30de18c4fd28..f0c0cfbbd134 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -858,6 +858,8 @@ noinline int __filemap_add_folio(struct address_space *mapping,
>  
>  	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
>  	VM_BUG_ON_FOLIO(folio_test_swapbacked(folio), folio);
> +	VM_BUG_ON_FOLIO(folio_order(folio) < mapping_min_folio_order(mapping),
> +			folio);
>  	mapping_set_update(&xas, mapping);
>  
>  	if (!huge) {
> @@ -1895,8 +1897,10 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
>  		folio_wait_stable(folio);
>  no_page:
>  	if (!folio && (fgp_flags & FGP_CREAT)) {
> -		unsigned order = FGF_GET_ORDER(fgp_flags);
> +		unsigned int min_order = mapping_min_folio_order(mapping);
> +		unsigned int order = max(min_order, FGF_GET_ORDER(fgp_flags));
>  		int err;
> +		index = mapping_align_start_index(mapping, index);
>  
>  		if ((fgp_flags & FGP_WRITE) && mapping_can_writeback(mapping))
>  			gfp |= __GFP_WRITE;
> @@ -1936,7 +1940,7 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
>  				break;
>  			folio_put(folio);
>  			folio = NULL;
> -		} while (order-- > 0);
> +		} while (order-- > min_order);
>  
>  		if (err == -EEXIST)
>  			goto repeat;
> @@ -2425,13 +2429,16 @@ static int filemap_update_page(struct kiocb *iocb,
>  }
>  
>  static int filemap_create_folio(struct file *file,
> -		struct address_space *mapping, pgoff_t index,
> +		struct address_space *mapping, loff_t pos,
>  		struct folio_batch *fbatch)
>  {
>  	struct folio *folio;
>  	int error;
> +	unsigned int min_order = mapping_min_folio_order(mapping);
> +	pgoff_t index;
>  
> -	folio = filemap_alloc_folio(mapping_gfp_mask(mapping), 0);
> +	folio = filemap_alloc_folio(mapping_gfp_mask(mapping),
> +				    min_order);
>  	if (!folio)
>  		return -ENOMEM;
>  
> @@ -2449,6 +2456,8 @@ static int filemap_create_folio(struct file *file,
>  	 * well to keep locking rules simple.
>  	 */
>  	filemap_invalidate_lock_shared(mapping);
> +	/* index in PAGE units but aligned to min_order number of pages. */
> +	index = (pos >> (PAGE_SHIFT + min_order)) << min_order;
>  	error = filemap_add_folio(mapping, folio, index,
>  			mapping_gfp_constraint(mapping, GFP_KERNEL));
>  	if (error == -EEXIST)
> @@ -2509,8 +2518,7 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
>  	if (!folio_batch_count(fbatch)) {
>  		if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_WAITQ))
>  			return -EAGAIN;
> -		err = filemap_create_folio(filp, mapping,
> -				iocb->ki_pos >> PAGE_SHIFT, fbatch);
> +		err = filemap_create_folio(filp, mapping, iocb->ki_pos, fbatch);
>  		if (err == AOP_TRUNCATED_PAGE)
>  			goto retry;
>  		return err;
> @@ -3708,9 +3716,11 @@ static struct folio *do_read_cache_folio(struct address_space *mapping,
>  repeat:
>  	folio = filemap_get_folio(mapping, index);
>  	if (IS_ERR(folio)) {
> -		folio = filemap_alloc_folio(gfp, 0);
> +		folio = filemap_alloc_folio(gfp,
> +					    mapping_min_folio_order(mapping));
>  		if (!folio)
>  			return ERR_PTR(-ENOMEM);
> +		index = mapping_align_start_index(mapping, index);
>  		err = filemap_add_folio(mapping, folio, index, gfp);
>  		if (unlikely(err)) {
>  			folio_put(folio);
> -- 
> 2.34.1
> 
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 09/11] xfs: expose block size in stat
  2024-04-25 11:37 ` [PATCH v4 09/11] xfs: expose block size in stat Pankaj Raghav (Samsung)
@ 2024-04-26 15:15   ` Darrick J. Wong
  0 siblings, 0 replies; 47+ messages in thread
From: Darrick J. Wong @ 2024-04-26 15:15 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: willy, brauner, david, chandan.babu, akpm, linux-fsdevel, hare,
	linux-kernel, linux-mm, linux-xfs, mcgrof, gost.dev, p.raghav

On Thu, Apr 25, 2024 at 01:37:44PM +0200, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
> 
> For block size larger than page size, the unit of efficient IO is
> the block size, not the page size. Leaving stat() to report
> PAGE_SIZE as the block size causes test programs like fsx to issue
> illegal ranges for operations that require block size alignment
> (e.g. fallocate() insert range). Hence update the preferred IO size
> to reflect the block size in this case.
> 
> This change is based on a patch originally from Dave Chinner.[1]
> 
> [1] https://lwn.net/ml/linux-fsdevel/20181107063127.3902-16-david@fromorbit.com/
> 
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>

Seems reasonable to me...
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_iops.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 66f8c47642e8..77b198a33aa1 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -543,7 +543,7 @@ xfs_stat_blksize(
>  			return 1U << mp->m_allocsize_log;
>  	}
>  
> -	return PAGE_SIZE;
> +	return max_t(uint32_t, PAGE_SIZE, mp->m_sb.sb_blocksize);
>  }
>  
>  STATIC int
> -- 
> 2.34.1
> 
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 10/11] xfs: make the calculation generic in xfs_sb_validate_fsb_count()
  2024-04-25 11:37 ` [PATCH v4 10/11] xfs: make the calculation generic in xfs_sb_validate_fsb_count() Pankaj Raghav (Samsung)
@ 2024-04-26 15:16   ` Darrick J. Wong
  0 siblings, 0 replies; 47+ messages in thread
From: Darrick J. Wong @ 2024-04-26 15:16 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: willy, brauner, david, chandan.babu, akpm, linux-fsdevel, hare,
	linux-kernel, linux-mm, linux-xfs, mcgrof, gost.dev, p.raghav

On Thu, Apr 25, 2024 at 01:37:45PM +0200, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
> 
> Instead of assuming that PAGE_SHIFT is always higher than the blocklog,
> make the calculation generic so that page cache count can be calculated
> correctly for LBS.
> 
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>

Looks ok,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_mount.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index df370eb5dc15..56d71282972a 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -133,9 +133,16 @@ xfs_sb_validate_fsb_count(
>  {
>  	ASSERT(PAGE_SHIFT >= sbp->sb_blocklog);
>  	ASSERT(sbp->sb_blocklog >= BBSHIFT);
> +	uint64_t max_index;
> +	uint64_t max_bytes;
> +
> +	if (check_shl_overflow(nblocks, sbp->sb_blocklog, &max_bytes))
> +		return -EFBIG;
>  
>  	/* Limited by ULONG_MAX of page cache index */
> -	if (nblocks >> (PAGE_SHIFT - sbp->sb_blocklog) > ULONG_MAX)
> +	max_index = max_bytes >> PAGE_SHIFT;
> +
> +	if (max_index > ULONG_MAX)
>  		return -EFBIG;
>  	return 0;
>  }
> -- 
> 2.34.1
> 
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 11/11] xfs: enable block size larger than page size support
  2024-04-25 11:37 ` [PATCH v4 11/11] xfs: enable block size larger than page size support Pankaj Raghav (Samsung)
@ 2024-04-26 15:18   ` Darrick J. Wong
  0 siblings, 0 replies; 47+ messages in thread
From: Darrick J. Wong @ 2024-04-26 15:18 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: willy, brauner, david, chandan.babu, akpm, linux-fsdevel, hare,
	linux-kernel, linux-mm, linux-xfs, mcgrof, gost.dev, p.raghav

On Thu, Apr 25, 2024 at 01:37:46PM +0200, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
> 
> Page cache now has the ability to have a minimum order when allocating
> a folio which is a prerequisite to add support for block size > page
> size.
> 
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>

Seems reasonable...
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/libxfs/xfs_ialloc.c |  5 +++++
>  fs/xfs/libxfs/xfs_shared.h |  3 +++
>  fs/xfs/xfs_icache.c        |  6 ++++--
>  fs/xfs/xfs_mount.c         |  1 -
>  fs/xfs/xfs_super.c         | 10 ++--------
>  5 files changed, 14 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
> index e5ac3e5430c4..60005feb0015 100644
> --- a/fs/xfs/libxfs/xfs_ialloc.c
> +++ b/fs/xfs/libxfs/xfs_ialloc.c
> @@ -2975,6 +2975,11 @@ xfs_ialloc_setup_geometry(
>  		igeo->ialloc_align = mp->m_dalign;
>  	else
>  		igeo->ialloc_align = 0;
> +
> +	if (mp->m_sb.sb_blocksize > PAGE_SIZE)
> +		igeo->min_folio_order = mp->m_sb.sb_blocklog - PAGE_SHIFT;
> +	else
> +		igeo->min_folio_order = 0;
>  }
>  
>  /* Compute the location of the root directory inode that is laid out by mkfs. */
> diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
> index dfd61fa8332e..7d3abd182322 100644
> --- a/fs/xfs/libxfs/xfs_shared.h
> +++ b/fs/xfs/libxfs/xfs_shared.h
> @@ -229,6 +229,9 @@ struct xfs_ino_geometry {
>  	/* precomputed value for di_flags2 */
>  	uint64_t	new_diflags2;
>  
> +	/* minimum folio order of a page cache allocation */
> +	unsigned int	min_folio_order;
> +
>  };
>  
>  #endif /* __XFS_SHARED_H__ */
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 74f1812b03cb..a2629e00de41 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -89,7 +89,8 @@ xfs_inode_alloc(
>  	/* VFS doesn't initialise i_mode or i_state! */
>  	VFS_I(ip)->i_mode = 0;
>  	VFS_I(ip)->i_state = 0;
> -	mapping_set_large_folios(VFS_I(ip)->i_mapping);
> +	mapping_set_folio_min_order(VFS_I(ip)->i_mapping,
> +				    M_IGEO(mp)->min_folio_order);
>  
>  	XFS_STATS_INC(mp, vn_active);
>  	ASSERT(atomic_read(&ip->i_pincount) == 0);
> @@ -324,7 +325,8 @@ xfs_reinit_inode(
>  	inode->i_rdev = dev;
>  	inode->i_uid = uid;
>  	inode->i_gid = gid;
> -	mapping_set_large_folios(inode->i_mapping);
> +	mapping_set_folio_min_order(inode->i_mapping,
> +				    M_IGEO(mp)->min_folio_order);
>  	return error;
>  }
>  
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index 56d71282972a..a451302aa258 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -131,7 +131,6 @@ xfs_sb_validate_fsb_count(
>  	xfs_sb_t	*sbp,
>  	uint64_t	nblocks)
>  {
> -	ASSERT(PAGE_SHIFT >= sbp->sb_blocklog);
>  	ASSERT(sbp->sb_blocklog >= BBSHIFT);
>  	uint64_t max_index;
>  	uint64_t max_bytes;
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index bce020374c5e..db3b82c2c381 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1623,16 +1623,10 @@ xfs_fs_fill_super(
>  		goto out_free_sb;
>  	}
>  
> -	/*
> -	 * Until this is fixed only page-sized or smaller data blocks work.
> -	 */
>  	if (mp->m_sb.sb_blocksize > PAGE_SIZE) {
>  		xfs_warn(mp,
> -		"File system with blocksize %d bytes. "
> -		"Only pagesize (%ld) or less will currently work.",
> -				mp->m_sb.sb_blocksize, PAGE_SIZE);
> -		error = -ENOSYS;
> -		goto out_free_sb;
> +"EXPERIMENTAL: Filesystem with Large Block Size (%d bytes) enabled.",
> +			mp->m_sb.sb_blocksize);
>  	}
>  
>  	/* Ensure this filesystem fits in the page cache limits */
> -- 
> 2.34.1
> 
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 08/11] xfs: use kvmalloc for xattr buffers
  2024-04-25 11:37 ` [PATCH v4 08/11] xfs: use kvmalloc for xattr buffers Pankaj Raghav (Samsung)
@ 2024-04-26 15:18   ` Darrick J. Wong
  2024-04-28 21:06     ` Pankaj Raghav (Samsung)
  0 siblings, 1 reply; 47+ messages in thread
From: Darrick J. Wong @ 2024-04-26 15:18 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: willy, brauner, david, chandan.babu, akpm, linux-fsdevel, hare,
	linux-kernel, linux-mm, linux-xfs, mcgrof, gost.dev, p.raghav,
	Dave Chinner

On Thu, Apr 25, 2024 at 01:37:43PM +0200, Pankaj Raghav (Samsung) wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Pankaj Raghav reported that when filesystem block size is larger
> than page size, the xattr code can use kmalloc() for high order
> allocations. This triggers a useless warning in the allocator as it
> is a __GFP_NOFAIL allocation here:
> 
> static inline
> struct page *rmqueue(struct zone *preferred_zone,
>                         struct zone *zone, unsigned int order,
>                         gfp_t gfp_flags, unsigned int alloc_flags,
>                         int migratetype)
> {
>         struct page *page;
> 
>         /*
>          * We most definitely don't want callers attempting to
>          * allocate greater than order-1 page units with __GFP_NOFAIL.
>          */
> >>>>    WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
> ...
> 
> Fix this by changing all these call sites to use kvmalloc(), which
> will strip the NOFAIL from the kmalloc attempt and if that fails
> will do a __GFP_NOFAIL vmalloc().
> 
> This is not an issue that productions systems will see as
> filesystems with block size > page size cannot be mounted by the
> kernel; Pankaj is developing this functionality right now.
> 
> Reported-by: Pankaj Raghav <kernel@pankajraghav.com>
> Fixes: f078d4ea8276 ("xfs: convert kmem_alloc() to kmalloc()")
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Didn't this already go in for-next?

If not,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/libxfs/xfs_attr_leaf.c | 15 ++++++---------
>  1 file changed, 6 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_attr_leaf.c b/fs/xfs/libxfs/xfs_attr_leaf.c
> index ac904cc1a97b..969abc6efd70 100644
> --- a/fs/xfs/libxfs/xfs_attr_leaf.c
> +++ b/fs/xfs/libxfs/xfs_attr_leaf.c
> @@ -1059,10 +1059,7 @@ xfs_attr3_leaf_to_shortform(
>  
>  	trace_xfs_attr_leaf_to_sf(args);
>  
> -	tmpbuffer = kmalloc(args->geo->blksize, GFP_KERNEL | __GFP_NOFAIL);
> -	if (!tmpbuffer)
> -		return -ENOMEM;
> -
> +	tmpbuffer = kvmalloc(args->geo->blksize, GFP_KERNEL | __GFP_NOFAIL);
>  	memcpy(tmpbuffer, bp->b_addr, args->geo->blksize);
>  
>  	leaf = (xfs_attr_leafblock_t *)tmpbuffer;
> @@ -1125,7 +1122,7 @@ xfs_attr3_leaf_to_shortform(
>  	error = 0;
>  
>  out:
> -	kfree(tmpbuffer);
> +	kvfree(tmpbuffer);
>  	return error;
>  }
>  
> @@ -1533,7 +1530,7 @@ xfs_attr3_leaf_compact(
>  
>  	trace_xfs_attr_leaf_compact(args);
>  
> -	tmpbuffer = kmalloc(args->geo->blksize, GFP_KERNEL | __GFP_NOFAIL);
> +	tmpbuffer = kvmalloc(args->geo->blksize, GFP_KERNEL | __GFP_NOFAIL);
>  	memcpy(tmpbuffer, bp->b_addr, args->geo->blksize);
>  	memset(bp->b_addr, 0, args->geo->blksize);
>  	leaf_src = (xfs_attr_leafblock_t *)tmpbuffer;
> @@ -1571,7 +1568,7 @@ xfs_attr3_leaf_compact(
>  	 */
>  	xfs_trans_log_buf(trans, bp, 0, args->geo->blksize - 1);
>  
> -	kfree(tmpbuffer);
> +	kvfree(tmpbuffer);
>  }
>  
>  /*
> @@ -2250,7 +2247,7 @@ xfs_attr3_leaf_unbalance(
>  		struct xfs_attr_leafblock *tmp_leaf;
>  		struct xfs_attr3_icleaf_hdr tmphdr;
>  
> -		tmp_leaf = kzalloc(state->args->geo->blksize,
> +		tmp_leaf = kvzalloc(state->args->geo->blksize,
>  				GFP_KERNEL | __GFP_NOFAIL);
>  
>  		/*
> @@ -2291,7 +2288,7 @@ xfs_attr3_leaf_unbalance(
>  		}
>  		memcpy(save_leaf, tmp_leaf, state->args->geo->blksize);
>  		savehdr = tmphdr; /* struct copy */
> -		kfree(tmp_leaf);
> +		kvfree(tmp_leaf);
>  	}
>  
>  	xfs_attr3_leaf_hdr_to_disk(state->args->geo, save_leaf, &savehdr);
> -- 
> 2.34.1
> 
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 05/11] mm: do not split a folio if it has minimum folio order requirement
  2024-04-25 20:10   ` Matthew Wilcox
  2024-04-26  0:47     ` Luis Chamberlain
@ 2024-04-26 15:49     ` Pankaj Raghav (Samsung)
  1 sibling, 0 replies; 47+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-04-26 15:49 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: djwong, brauner, david, chandan.babu, akpm, linux-fsdevel, hare,
	linux-kernel, linux-mm, linux-xfs, mcgrof, gost.dev, p.raghav

On Thu, Apr 25, 2024 at 09:10:16PM +0100, Matthew Wilcox wrote:
> On Thu, Apr 25, 2024 at 01:37:40PM +0200, Pankaj Raghav (Samsung) wrote:
> > From: Pankaj Raghav <p.raghav@samsung.com>
> > 
> > Splitting a larger folio with a base order is supported using
> > split_huge_page_to_list_to_order() API. However, using that API for LBS
> > is resulting in an NULL ptr dereference error in the writeback path [1].
> > 
> > Refuse to split a folio if it has minimum folio order requirement until
> > we can start using split_huge_page_to_list_to_order() API. Splitting the
> > folio can be added as a later optimization.
> > 
> > [1] https://gist.github.com/mcgrof/d12f586ec6ebe32b2472b5d634c397df
> 
> Obviously this has to be tracked down and fixed before this patchset can
> be merged ... I think I have some ideas.  Let me look a bit.  How
> would I go about reproducing this?

I am able to reproduce it in a VM with 4G RAM and running generic/447
(sometimes you have to run it twice) on a 16K BS on a 4K PS system.

I have a suspicion on this series: https://lore.kernel.org/linux-fsdevel/20240215063649.2164017-1-hch@lst.de/
but I am still unsure why this is happening when we split with LBS
configurations.

If you have kdevops installed, then go with Luis's suggestion, or else
this is my local config.

This is the diff I applied instead of this patch:

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9859aa4f7553..63ee7b6ed03d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3041,6 +3041,10 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 {
        struct folio *folio = page_folio(page);
        struct deferred_split *ds_queue = get_deferred_split_queue(folio);
+       unsigned int mapping_min_order = mapping_min_folio_order(folio->mapping);
+
+       if (!folio_test_anon(folio))
+               new_order = max_t(unsigned int, mapping_min_order, new_order);
        /* reset xarray order to new order after split */
        XA_STATE_ORDER(xas, &folio->mapping->i_pages, folio->index, new_order);
        struct anon_vma *anon_vma = NULL;
@@ -3117,6 +3121,8 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
                        goto out;
                }
 
+               // XXX: Remove it later
+               VM_WARN_ON_FOLIO((new_order < mapping_min_order), folio);
                gfp = current_gfp_context(mapping_gfp_mask(mapping) &
                                                        GFP_RECLAIM_MASK);
 
(END)

xfstests is based on https://github.com/kdave/xfstests/tree/v2024.04.14

xfstests config:

[default]
FSTYP=xfs
RESULT_BASE=/root/results/
DUMP_CORRUPT_FS=1
CANON_DEVS=yes
RECREATE_TEST_DEV=true
TEST_DEV=/dev/nvme0n1
TEST_DIR=/media/test
SCRATCH_DEV=/dev/vdb
SCRATCH_MNT=/media/scratch
LOGWRITES_DEV=/dev/vdc

[16k_4ks]
MKFS_OPTIONS='-f -m reflink=1,rmapbt=1, -i sparse=1, -b size=16k, -s size=4k'

[nix-shell:~]# lsblk
NAME    MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
vdb     254:16   0   32G  0 disk /media/scratch
vdc     254:32   0   32G  0 disk 
nvme0n1 259:0    0   32G  0 disk /media/test

$ ./check -s 16k_4ks generic/447

BT:
[   74.170698] BUG: KASAN: null-ptr-deref in filemap_get_folios_tag+0x14b/0x510                                                                                                                                                                                                                                                                                                               
[   74.170938] Write of size 4 at addr 0000000000000036 by task kworker/u16:6/284                                                                                                                                                                                                                                                                                                             
[   74.170938]                                                                                                                                                                                                                                                                                                                                                                                
[   74.170938] CPU: 0 PID: 284 Comm: kworker/u16:6 Not tainted 6.9.0-rc4-00011-g4676d00b6f6f #7                                                                                                                                                                                                                                                                                               
[   74.170938] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014                                                                                                                                                                                                                                                               
[   74.170938] Workqueue: writeback wb_workfn (flush-254:16)                                                                                                                                                                                                                                                                                                                                  
[   74.170938] Call Trace:                                                                                                                                                                                                                                                                                                                                                                    
[   74.170938]  <TASK>                                                                                                                                                                                                                                                                                                                                                                        
[   74.170938]  dump_stack_lvl+0x51/0x70                                                                                                                                                                                                                                                                                                                                                      
[   74.170938]  kasan_report+0xab/0xe0                                                                                                                                                                                                                                                                                                                                                        
[   74.170938]  ? filemap_get_folios_tag+0x14b/0x510                                                                                                                                                                                                                                                                                                                                          
[   74.170938]  kasan_check_range+0x35/0x1b0                                                                                                                                                                                                                                                                                                                                                  
[   74.170938]  filemap_get_folios_tag+0x14b/0x510                                                                                                                                                                                                                                                                                                                                            
[   74.170938]  ? __pfx_filemap_get_folios_tag+0x10/0x10                                                                                                                                                                                                                                                                                                                                      
[   74.170938]  ? srso_return_thunk+0x5/0x5f                                                                                                                                                                                                                                                                                                                                                  
[   74.170938]  writeback_iter+0x508/0xcc0                                                                                                                                                                                                                                                                                                                                                    
[   74.170938]  ? __pfx_iomap_do_writepage+0x10/0x10                                                                                                                                                                                                                                                                                                                                          
[   74.170938]  write_cache_pages+0x80/0x100                                                                                                                                                                                                                                                                                                                                                  
[   74.170938]  ? __pfx_write_cache_pages+0x10/0x10                                                                                                                                                                                                                                                                                                                                           
[   74.170938]  ? srso_return_thunk+0x5/0x5f                                                                                                                                                                                                                                                                                                                                                  
[   74.170938]  ? srso_return_thunk+0x5/0x5f                                                                                                                                                                                                                                                                                                                                                  
[   74.170938]  ? srso_return_thunk+0x5/0x5f                                                                                                                                                                                                                                                                                                                                                  
[   74.170938]  ? _raw_spin_lock+0x87/0xe0                                                                                                                                                                                                                                                                                                                                                    
[   74.170938]  iomap_writepages+0x85/0xe0                                                                                                                                                                                                                                                                                                                                                    
[   74.170938]  xfs_vm_writepages+0xe3/0x140 [xfs]                                                                                                                                                                                                                                                                                                                                            
[   74.170938]  ? __pfx_xfs_vm_writepages+0x10/0x10 [xfs]                                                                                                                                                                                                                                                                                                                                     
[   74.170938]  ? kasan_save_track+0x10/0x30                                                                                                                                                                                                                                                                                                                                                  
[   74.170938]  ? srso_return_thunk+0x5/0x5f                                                                                                                                                                                                                                                                                                                                                  
[   74.170938]  ? __kasan_kmalloc+0x7b/0x90                                                                                                                                                                                                                                                                                                                                                   
[   74.170938]  ? srso_return_thunk+0x5/0x5f                                                                                                                                                                                                                                                                                                                                                  
[   74.170938]  ? virtqueue_add_split+0x605/0x1b00                                                                                                                                                                                                                                                                                                                                            
[   74.170938]  do_writepages+0x176/0x740                                                                                                                                                                                                                                                                                                                                                     
[   74.170938]  ? __pfx_do_writepages+0x10/0x10                                                                                                                                                                                                                                                                                                                                               
[   74.170938]  ? __pfx_virtqueue_add_split+0x10/0x10                                                                                                                                                                                                                                                                                                                                         
[   74.170938]  ? __pfx_update_sd_lb_stats.constprop.0+0x10/0x10                                                                                                                                                                                                                                                                                                                              
[   74.170938]  ? srso_return_thunk+0x5/0x5f                                                                                                                                                                                                                                                                                                                                                  
[   74.170938]  ? virtqueue_add_sgs+0xfe/0x130                                                                                                                                                                                                                                                                                                                                                
[   74.170938]  ? srso_return_thunk+0x5/0x5f                                                                                                                                                                                                                                                                                                                                                  
[   74.170938]  ? virtblk_add_req+0x15c/0x280                                                                                                                                                                                                                                                                                                                                                 
[   74.170938]  __writeback_single_inode+0x9f/0x840                                                                                                                                                                                                                                                                                                                                           
[   74.170938]  ? wbc_attach_and_unlock_inode+0x345/0x5d0                                                                                                                                                                                                                                                                                                                                     
[   74.170938]  writeback_sb_inodes+0x491/0xce0                                                                                                                                                                                                                                                                                                                                               
[   74.170938]  ? __pfx_wb_calc_thresh+0x10/0x10                                                                                                                                                                                                                                                                                                                                              
[   74.170938]  ? __pfx_writeback_sb_inodes+0x10/0x10                                                                                                                                                                                                                                                                                                                                         
[   74.170938]  ? __wb_calc_thresh+0x1a0/0x3c0                                                                                                                                                                                                                                                                                                                                                
[   74.170938]  ? __pfx_down_read_trylock+0x10/0x10                                                                                                                                                                                                                                                                                                                                           
[   74.170938]  ? wb_over_bg_thresh+0x16b/0x5e0                                                                                                                                                                                                                                                                                                                                               
[   74.170938]  ? __pfx_move_expired_inodes+0x10/0x10                                                                                                                                                                                                                                                                                                                                         
[   74.170938]  __writeback_inodes_wb+0xb7/0x200                                                                                                                                                                                                                                                                                                                                              
[   74.170938]  wb_writeback+0x2c4/0x660                                                                                                                                                                                                                                                                                                                                                      
[   74.170938]  ? __pfx_wb_writeback+0x10/0x10                                                                                                                                                                                                                                                                                                                                                
[   74.170938]  ? __pfx__raw_spin_lock_irq+0x10/0x10                                                                                                                                                                                                                                                                                                                                          
[   74.170938]  wb_workfn+0x54e/0xaf0                                                                                                                                                                                                                                                                                                                                                         
[   74.170938]  ? srso_return_thunk+0x5/0x5f                                                                                                                                                                                                                                                                                                                                                  
[   74.170938]  ? __pfx_wb_workfn+0x10/0x10                                                                                                                                                                                                                                                                                                                                                   
[   74.170938]  ? __pfx___schedule+0x10/0x10                                                                                                                                                                                                                                                                                                                                                  
[   74.170938]  ? __pfx__raw_spin_lock_irq+0x10/0x10                                                                                                                                                                                                                                                                                                                                          
[   74.170938]  process_one_work+0x622/0x1020                                                                                                                                                                                                                                                                                                                                                 
[   74.170938]  worker_thread+0x844/0x10e0                                                                                                                                                                                                                                                                                                                                                    
[   74.170938]  ? srso_return_thunk+0x5/0x5f                                                                                                                                                                                                                                                                                                                                                  
[   74.170938]  ? __kthread_parkme+0x82/0x150                                                                                                                                                                                                                                                                                                                                                 
[   74.170938]  ? __pfx_worker_thread+0x10/0x10                                                                                                                                                                                                                                                                                                                                               
[   74.170938]  kthread+0x2b4/0x380                                                                                                                                                                                                                                                                                                                                                           
[   74.170938]  ? __pfx_kthread+0x10/0x10                                                                                                                                                                                                                                                                                                                                                     
[   74.170938]  ret_from_fork+0x30/0x70                                                                                                                                                                                                                                                                                                                                                       
[   74.170938]  ? __pfx_kthread+0x10/0x10                                                                                                                                                                                                                                                                                                                                                     
[   74.170938]  ret_from_fork_asm+0x1a/0x30       
[   74.170938]  </TASK>

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 05/11] mm: do not split a folio if it has minimum folio order requirement
  2024-04-26  0:47     ` Luis Chamberlain
@ 2024-04-26 23:46       ` Luis Chamberlain
  2024-04-28  0:57         ` Luis Chamberlain
  0 siblings, 1 reply; 47+ messages in thread
From: Luis Chamberlain @ 2024-04-26 23:46 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Pankaj Raghav (Samsung),
	djwong, brauner, david, chandan.babu, akpm, linux-fsdevel, hare,
	linux-kernel, linux-mm, linux-xfs, gost.dev, p.raghav

On Thu, Apr 25, 2024 at 05:47:28PM -0700, Luis Chamberlain wrote:
> On Thu, Apr 25, 2024 at 09:10:16PM +0100, Matthew Wilcox wrote:
> > On Thu, Apr 25, 2024 at 01:37:40PM +0200, Pankaj Raghav (Samsung) wrote:
> > > From: Pankaj Raghav <p.raghav@samsung.com>
> > > 
> > > using that API for LBS is resulting in an NULL ptr dereference
> > > error in the writeback path [1].
> > >
> > > [1] https://gist.github.com/mcgrof/d12f586ec6ebe32b2472b5d634c397df
> > 
> >  How would I go about reproducing this?
> 
> Using kdevops this is easy:
> 
> make defconfig-lbs-xfs-small -j $(nproc)
> make -j $(nproc)

I forgot after the above:

make bringup

> make fstests
> make linux
> make fstests-baseline TESTS=generic/447 COUNT=10
> tail -f guestfs/*-xfs-reflink-16k-4ks/console.log

The above tail command needs sudo prefix for now too.

> or
> sudo virsh list
> sudo virsh console ${foo}-xfs-reflink-16k-4ks
> 
> Where $foo is the value of CONFIG_KDEVOPS_HOSTS_PREFIX in .config for
> your kdevops run.
>
> I didn't have time to verify if the above commands for kdevops worked

I did now, I forgot to git push commits to kdevops yesterday but I
confirm the above steps can be used to repdroduce this issue now.

  Luis


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 07/11] iomap: fix iomap_dio_zero() for fs bs > system page size
  2024-04-26  6:22   ` Christoph Hellwig
  2024-04-26 11:43     ` Pankaj Raghav (Samsung)
@ 2024-04-27  3:26     ` Matthew Wilcox
  2024-04-27  4:52       ` Christoph Hellwig
  1 sibling, 1 reply; 47+ messages in thread
From: Matthew Wilcox @ 2024-04-27  3:26 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Pankaj Raghav (Samsung),
	djwong, brauner, david, chandan.babu, akpm, linux-fsdevel, hare,
	linux-kernel, linux-mm, linux-xfs, mcgrof, gost.dev, p.raghav

On Thu, Apr 25, 2024 at 11:22:35PM -0700, Christoph Hellwig wrote:
> On Thu, Apr 25, 2024 at 01:37:42PM +0200, Pankaj Raghav (Samsung) wrote:
> > From: Pankaj Raghav <p.raghav@samsung.com>
> > 
> > iomap_dio_zero() will pad a fs block with zeroes if the direct IO size
> > < fs block size. iomap_dio_zero() has an implicit assumption that fs block
> > size < page_size. This is true for most filesystems at the moment.
> > 
> > If the block size > page size, this will send the contents of the page
> > next to zero page(as len > PAGE_SIZE) to the underlying block device,
> > causing FS corruption.
> > 
> > iomap is a generic infrastructure and it should not make any assumptions
> > about the fs block size and the page size of the system.
> 
> So what happened to the plan to making huge_zero_page a folio and have

There's a series of commits in linux-mm with the titles:

      sparc: use is_huge_zero_pmd()
      mm: add is_huge_zero_folio()
      mm: add pmd_folio()
      mm: convert migrate_vma_collect_pmd to use a folio
      mm: convert huge_zero_page to huge_zero_folio
      mm: convert do_huge_pmd_anonymous_page to huge_zero_folio
      dax: use huge_zero_folio
      mm: rename mm_put_huge_zero_page to mm_put_huge_zero_folio

> it available for non-hugetlb setups?  Not only would this be cleaner
> and more efficient, but it would actually work for the case where you'd
> have to zero more than 1MB on a 4k PAGE_SIZE system, which doesn't
> seem impossible with 2MB folios.

It is available for non-hugetlb setups.  It is however allocated on
demand, so it might not be available.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 00/11] enable bs > ps in XFS
  2024-04-25 11:37 [PATCH v4 00/11] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (10 preceding siblings ...)
  2024-04-25 11:37 ` [PATCH v4 11/11] xfs: enable block size larger than page size support Pankaj Raghav (Samsung)
@ 2024-04-27  4:42 ` Ritesh Harjani
  2024-04-27  5:05   ` Darrick J. Wong
  2024-04-29 20:39   ` Pankaj Raghav (Samsung)
  11 siblings, 2 replies; 47+ messages in thread
From: Ritesh Harjani @ 2024-04-27  4:42 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung),
	willy, djwong, brauner, david, chandan.babu, akpm
  Cc: linux-fsdevel, hare, linux-kernel, linux-mm, linux-xfs, mcgrof,
	gost.dev, p.raghav

"Pankaj Raghav (Samsung)" <kernel@pankajraghav.com> writes:

> From: Pankaj Raghav <p.raghav@samsung.com>
>
> This is the fourth version of the series that enables block size > page size
> (Large Block Size) in XFS. The context and motivation can be seen in cover
> letter of the RFC v1[1]. We also recorded a talk about this effort at LPC [3],
> if someone would like more context on this effort.
>
> This series does not split a folio during truncation even though we have
> an API to do so due to some issues with writeback. While it is not a
> blocker, this feature can be added as a future improvement once we
> get the base patches upstream (See patch 7).
>
> A lot of emphasis has been put on testing using kdevops. The testing has
> been split into regression and progression.
>
> Regression testing:
> In regression testing, we ran the whole test suite to check for
> *regression on existing profiles due to the page cache changes.
>
> No regression was found with the patches added on top.
>
> Progression testing:
> For progression testing, we tested for 8k, 16k, 32k and 64k block sizes.
> To compare it with existing support, an ARM VM with 64k base page system
> (without our patches) was used as a reference to check for actual failures
> due to LBS support in a 4k base page size system.
>
> There are some tests that assumes block size < page size that needs to
> be fixed. I have a tree with fixes for xfstests here [6], which I will be
> sending soon to the list. Already a part of this has been upstreamed to
> fstest.
>
> No new failures were found with the LBS support.

I just did portability testing by creating XFS with 16k bs on x86 VM (4k
pagesize), created some files + checksums. I then moved the disk to
Power VM with 64k pagesize and mounted this. I was able to mount and
all the file checksums passed.

Then I did the vice versa, created a filesystem on Power VM with 64k
blocksize and created 10 files with random data of 10MB each. I then
hotplugged this device out from Power and plugged it into x86 VM and
mounted it.

<Logs of the 2nd operation>
~# mount /dev/vdk /mnt1/
[   35.145350] XFS (vdk): EXPERIMENTAL: Filesystem with Large Block Size (65536 bytes) enabled.
[   35.149858] XFS (vdk): Mounting V5 Filesystem 91933a8b-1370-4931-97d1-c21213f31f8f
[   35.227459] XFS (vdk): Ending clean mount
[   35.235090] xfs filesystem being mounted at /mnt1 supports timestamps until 2038-01-19 (0x7fffffff)
~# cd /mnt1/
~# sha256sum -c checksums 
file-1.img: OK
file-2.img: OK
file-3.img: OK
file-4.img: OK
file-5.img: OK
file-6.img: OK
file-7.img: OK
file-8.img: OK
file-9.img: OK
file-10.img: OK

So thanks for this nice portability which this series offers :) 

-ritesh


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 07/11] iomap: fix iomap_dio_zero() for fs bs > system page size
  2024-04-27  3:26     ` Matthew Wilcox
@ 2024-04-27  4:52       ` Christoph Hellwig
  0 siblings, 0 replies; 47+ messages in thread
From: Christoph Hellwig @ 2024-04-27  4:52 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christoph Hellwig, Pankaj Raghav (Samsung),
	djwong, brauner, david, chandan.babu, akpm, linux-fsdevel, hare,
	linux-kernel, linux-mm, linux-xfs, mcgrof, gost.dev, p.raghav

On Sat, Apr 27, 2024 at 04:26:44AM +0100, Matthew Wilcox wrote:
> There's a series of commits in linux-mm with the titles:
> 
>       sparc: use is_huge_zero_pmd()
>       mm: add is_huge_zero_folio()
>       mm: add pmd_folio()
>       mm: convert migrate_vma_collect_pmd to use a folio
>       mm: convert huge_zero_page to huge_zero_folio
>       mm: convert do_huge_pmd_anonymous_page to huge_zero_folio
>       dax: use huge_zero_folio
>       mm: rename mm_put_huge_zero_page to mm_put_huge_zero_folio
> 
> > it available for non-hugetlb setups?  Not only would this be cleaner
> > and more efficient, but it would actually work for the case where you'd
> > have to zero more than 1MB on a 4k PAGE_SIZE system, which doesn't
> > seem impossible with 2MB folios.
> 
> It is available for non-hugetlb setups.  It is however allocated on
> demand, so it might not be available.

We could just export get_huge_zero_page/put_huge_zero_page and make
sure it is is available for block sizse > PAGE_SIZE file systems, or
is there a good argument against that?


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 00/11] enable bs > ps in XFS
  2024-04-27  4:42 ` [PATCH v4 00/11] enable bs > ps in XFS Ritesh Harjani
@ 2024-04-27  5:05   ` Darrick J. Wong
  2024-04-29 20:39   ` Pankaj Raghav (Samsung)
  1 sibling, 0 replies; 47+ messages in thread
From: Darrick J. Wong @ 2024-04-27  5:05 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: Pankaj Raghav (Samsung),
	willy, brauner, david, chandan.babu, akpm, linux-fsdevel, hare,
	linux-kernel, linux-mm, linux-xfs, mcgrof, gost.dev, p.raghav

On Sat, Apr 27, 2024 at 10:12:38AM +0530, Ritesh Harjani wrote:
> "Pankaj Raghav (Samsung)" <kernel@pankajraghav.com> writes:
> 
> > From: Pankaj Raghav <p.raghav@samsung.com>
> >
> > This is the fourth version of the series that enables block size > page size
> > (Large Block Size) in XFS. The context and motivation can be seen in cover
> > letter of the RFC v1[1]. We also recorded a talk about this effort at LPC [3],
> > if someone would like more context on this effort.
> >
> > This series does not split a folio during truncation even though we have
> > an API to do so due to some issues with writeback. While it is not a
> > blocker, this feature can be added as a future improvement once we
> > get the base patches upstream (See patch 7).
> >
> > A lot of emphasis has been put on testing using kdevops. The testing has
> > been split into regression and progression.
> >
> > Regression testing:
> > In regression testing, we ran the whole test suite to check for
> > *regression on existing profiles due to the page cache changes.
> >
> > No regression was found with the patches added on top.
> >
> > Progression testing:
> > For progression testing, we tested for 8k, 16k, 32k and 64k block sizes.
> > To compare it with existing support, an ARM VM with 64k base page system
> > (without our patches) was used as a reference to check for actual failures
> > due to LBS support in a 4k base page size system.
> >
> > There are some tests that assumes block size < page size that needs to
> > be fixed. I have a tree with fixes for xfstests here [6], which I will be
> > sending soon to the list. Already a part of this has been upstreamed to
> > fstest.
> >
> > No new failures were found with the LBS support.
> 
> I just did portability testing by creating XFS with 16k bs on x86 VM (4k
> pagesize), created some files + checksums. I then moved the disk to
> Power VM with 64k pagesize and mounted this. I was able to mount and
> all the file checksums passed.
> 
> Then I did the vice versa, created a filesystem on Power VM with 64k
> blocksize and created 10 files with random data of 10MB each. I then
> hotplugged this device out from Power and plugged it into x86 VM and
> mounted it.
> 
> <Logs of the 2nd operation>
> ~# mount /dev/vdk /mnt1/
> [   35.145350] XFS (vdk): EXPERIMENTAL: Filesystem with Large Block Size (65536 bytes) enabled.
> [   35.149858] XFS (vdk): Mounting V5 Filesystem 91933a8b-1370-4931-97d1-c21213f31f8f
> [   35.227459] XFS (vdk): Ending clean mount
> [   35.235090] xfs filesystem being mounted at /mnt1 supports timestamps until 2038-01-19 (0x7fffffff)
> ~# cd /mnt1/
> ~# sha256sum -c checksums 
> file-1.img: OK
> file-2.img: OK
> file-3.img: OK
> file-4.img: OK
> file-5.img: OK
> file-6.img: OK
> file-7.img: OK
> file-8.img: OK
> file-9.img: OK
> file-10.img: OK
> 
> So thanks for this nice portability which this series offers :) 

Yessss this is awesome to see this coming together after many years!

--D

> -ritesh
> 
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 07/11] iomap: fix iomap_dio_zero() for fs bs > system page size
  2024-04-26 11:43     ` Pankaj Raghav (Samsung)
@ 2024-04-27  5:12       ` Christoph Hellwig
  2024-04-29 21:02         ` Pankaj Raghav (Samsung)
  0 siblings, 1 reply; 47+ messages in thread
From: Christoph Hellwig @ 2024-04-27  5:12 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: Christoph Hellwig, willy, djwong, brauner, david, chandan.babu,
	akpm, linux-fsdevel, hare, linux-kernel, linux-mm, linux-xfs,
	mcgrof, gost.dev, p.raghav

On Fri, Apr 26, 2024 at 11:43:01AM +0000, Pankaj Raghav (Samsung) wrote:
> Because allocating it during runtime will defeat the purpose.

Well, what runtime?  Either way it seems like we have the infrastructure
now based on the comment from willy.

> In anycase, I would like to pursue huge_zero_page folio separately
> from this series. Also iomap_dio_zero() only pads a fs block with
> zeroes, which should never be > 64k for XFS.

Only if you are limited to 64k block size.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 05/11] mm: do not split a folio if it has minimum folio order requirement
  2024-04-26 23:46       ` Luis Chamberlain
@ 2024-04-28  0:57         ` Luis Chamberlain
  2024-04-29  3:56           ` Luis Chamberlain
  0 siblings, 1 reply; 47+ messages in thread
From: Luis Chamberlain @ 2024-04-28  0:57 UTC (permalink / raw)
  To: Matthew Wilcox, ziy
  Cc: Pankaj Raghav (Samsung),
	djwong, brauner, david, chandan.babu, akpm, linux-fsdevel, hare,
	linux-kernel, linux-mm, linux-xfs, gost.dev, p.raghav

On Fri, Apr 26, 2024 at 04:46:11PM -0700, Luis Chamberlain wrote:
> On Thu, Apr 25, 2024 at 05:47:28PM -0700, Luis Chamberlain wrote:
> > On Thu, Apr 25, 2024 at 09:10:16PM +0100, Matthew Wilcox wrote:
> > > On Thu, Apr 25, 2024 at 01:37:40PM +0200, Pankaj Raghav (Samsung) wrote:
> > > > From: Pankaj Raghav <p.raghav@samsung.com>
> > > > 
> > > > using that API for LBS is resulting in an NULL ptr dereference
> > > > error in the writeback path [1].
> > > >
> > > > [1] https://gist.github.com/mcgrof/d12f586ec6ebe32b2472b5d634c397df
> > > 
> > >  How would I go about reproducing this?

Well so the below fixes this but I am not sure if this is correct.
folio_mark_dirty() at least says that a folio should not be truncated
while its running. I am not sure if we should try to split folios then
even though we check for writeback once. truncate_inode_partial_folio()
will folio_wait_writeback() but it will split_folio() before checking
for claiming to fail to truncate with folio_test_dirty(). But since the
folio is locked its not clear why this should be possible.

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 83955362d41c..90195506211a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3058,7 +3058,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 	if (new_order >= folio_order(folio))
 		return -EINVAL;
 
-	if (folio_test_writeback(folio))
+	if (folio_test_dirty(folio) || folio_test_writeback(folio))
 		return -EBUSY;
 
 	if (!folio_test_anon(folio)) {

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 03/11] filemap: allocate mapping_min_order folios in the page cache
  2024-04-26 15:12   ` Darrick J. Wong
@ 2024-04-28 20:59     ` Pankaj Raghav (Samsung)
  0 siblings, 0 replies; 47+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-04-28 20:59 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: willy, brauner, david, chandan.babu, akpm, linux-fsdevel, hare,
	linux-kernel, linux-mm, linux-xfs, mcgrof, gost.dev, p.raghav

On Fri, Apr 26, 2024 at 08:12:43AM -0700, Darrick J. Wong wrote:
> On Thu, Apr 25, 2024 at 01:37:38PM +0200, Pankaj Raghav (Samsung) wrote:
> > From: Luis Chamberlain <mcgrof@kernel.org>
> > 
> > filemap_create_folio() and do_read_cache_folio() were always allocating
> > folio of order 0. __filemap_get_folio was trying to allocate higher
> > order folios when fgp_flags had higher order hint set but it will default
> > to order 0 folio if higher order memory allocation fails.
> > 
> > Supporting mapping_min_order implies that we guarantee each folio in the
> > page cache has at least an order of mapping_min_order. When adding new
> > folios to the page cache we must also ensure the index used is aligned to
> > the mapping_min_order as the page cache requires the index to be aligned
> > to the order of the folio.
> 
> If we cannot find a folio of at least min_order size, what error is sent
> back?
> 
> If the answer is "the same error that you get if we cannot allocate a
> base page today (aka ENOMEM)", then I think I understand this enough to
> say
> 
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Yes. We will get a ENOMEM if we cannot allocate min_order size folio. :)
Thanks!
> 
> --D
> 
> > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> > Co-developed-by: Pankaj Raghav <p.raghav@samsung.com>
> > Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> > ---
> >  mm/filemap.c | 24 +++++++++++++++++-------
> >  1 file changed, 17 insertions(+), 7 deletions(-)
> > 
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 30de18c4fd28..f0c0cfbbd134 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -858,6 +858,8 @@ noinline int __filemap_add_folio(struct address_space *mapping,
> >  
> >  	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
> >  	VM_BUG_ON_FOLIO(folio_test_swapbacked(folio), folio);
> > +	VM_BUG_ON_FOLIO(folio_order(folio) < mapping_min_folio_order(mapping),
> > +			folio);
> >  	mapping_set_update(&xas, mapping);
> >  
> >  	if (!huge) {
> > @@ -1895,8 +1897,10 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
> >  		folio_wait_stable(folio);
> >  no_page:
> >  	if (!folio && (fgp_flags & FGP_CREAT)) {
> > -		unsigned order = FGF_GET_ORDER(fgp_flags);
> > +		unsigned int min_order = mapping_min_folio_order(mapping);
> > +		unsigned int order = max(min_order, FGF_GET_ORDER(fgp_flags));
> >  		int err;
> > +		index = mapping_align_start_index(mapping, index);
> >  
> >  		if ((fgp_flags & FGP_WRITE) && mapping_can_writeback(mapping))
> >  			gfp |= __GFP_WRITE;
> > @@ -1936,7 +1940,7 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
> >  				break;
> >  			folio_put(folio);
> >  			folio = NULL;
> > -		} while (order-- > 0);
> > +		} while (order-- > min_order);
> >  
> >  		if (err == -EEXIST)
> >  			goto repeat;
> > @@ -2425,13 +2429,16 @@ static int filemap_update_page(struct kiocb *iocb,
> >  }
> >  
> >  static int filemap_create_folio(struct file *file,
> > -		struct address_space *mapping, pgoff_t index,
> > +		struct address_space *mapping, loff_t pos,
> >  		struct folio_batch *fbatch)
> >  {
> >  	struct folio *folio;
> >  	int error;
> > +	unsigned int min_order = mapping_min_folio_order(mapping);
> > +	pgoff_t index;
> >  
> > -	folio = filemap_alloc_folio(mapping_gfp_mask(mapping), 0);
> > +	folio = filemap_alloc_folio(mapping_gfp_mask(mapping),
> > +				    min_order);
> >  	if (!folio)
> >  		return -ENOMEM;
> >  
> > @@ -2449,6 +2456,8 @@ static int filemap_create_folio(struct file *file,
> >  	 * well to keep locking rules simple.
> >  	 */
> >  	filemap_invalidate_lock_shared(mapping);
> > +	/* index in PAGE units but aligned to min_order number of pages. */
> > +	index = (pos >> (PAGE_SHIFT + min_order)) << min_order;
> >  	error = filemap_add_folio(mapping, folio, index,
> >  			mapping_gfp_constraint(mapping, GFP_KERNEL));
> >  	if (error == -EEXIST)
> > @@ -2509,8 +2518,7 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
> >  	if (!folio_batch_count(fbatch)) {
> >  		if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_WAITQ))
> >  			return -EAGAIN;
> > -		err = filemap_create_folio(filp, mapping,
> > -				iocb->ki_pos >> PAGE_SHIFT, fbatch);
> > +		err = filemap_create_folio(filp, mapping, iocb->ki_pos, fbatch);
> >  		if (err == AOP_TRUNCATED_PAGE)
> >  			goto retry;
> >  		return err;
> > @@ -3708,9 +3716,11 @@ static struct folio *do_read_cache_folio(struct address_space *mapping,
> >  repeat:
> >  	folio = filemap_get_folio(mapping, index);
> >  	if (IS_ERR(folio)) {
> > -		folio = filemap_alloc_folio(gfp, 0);
> > +		folio = filemap_alloc_folio(gfp,
> > +					    mapping_min_folio_order(mapping));
> >  		if (!folio)
> >  			return ERR_PTR(-ENOMEM);
> > +		index = mapping_align_start_index(mapping, index);
> >  		err = filemap_add_folio(mapping, folio, index, gfp);
> >  		if (unlikely(err)) {
> >  			folio_put(folio);
> > -- 
> > 2.34.1
> > 
> > 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 08/11] xfs: use kvmalloc for xattr buffers
  2024-04-26 15:18   ` Darrick J. Wong
@ 2024-04-28 21:06     ` Pankaj Raghav (Samsung)
  0 siblings, 0 replies; 47+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-04-28 21:06 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: willy, brauner, david, chandan.babu, akpm, linux-fsdevel, hare,
	linux-kernel, linux-mm, linux-xfs, mcgrof, gost.dev, p.raghav,
	Dave Chinner

On Fri, Apr 26, 2024 at 08:18:44AM -0700, Darrick J. Wong wrote:
> On Thu, Apr 25, 2024 at 01:37:43PM +0200, Pankaj Raghav (Samsung) wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Pankaj Raghav reported that when filesystem block size is larger
> > than page size, the xattr code can use kmalloc() for high order
> > allocations. This triggers a useless warning in the allocator as it
> > is a __GFP_NOFAIL allocation here:
> > 
> > static inline
> > struct page *rmqueue(struct zone *preferred_zone,
> >                         struct zone *zone, unsigned int order,
> >                         gfp_t gfp_flags, unsigned int alloc_flags,
> >                         int migratetype)
> > {
> >         struct page *page;
> > 
> >         /*
> >          * We most definitely don't want callers attempting to
> >          * allocate greater than order-1 page units with __GFP_NOFAIL.
> >          */
> > >>>>    WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
> > ...
> > 
> > Fix this by changing all these call sites to use kvmalloc(), which
> > will strip the NOFAIL from the kmalloc attempt and if that fails
> > will do a __GFP_NOFAIL vmalloc().
> > 
> > This is not an issue that productions systems will see as
> > filesystems with block size > page size cannot be mounted by the
> > kernel; Pankaj is developing this functionality right now.
> > 
> > Reported-by: Pankaj Raghav <kernel@pankajraghav.com>
> > Fixes: f078d4ea8276 ("xfs: convert kmem_alloc() to kmalloc()")
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> 
> Didn't this already go in for-next?

I don't think so. I think Christoph suggested to have it in the LBS
series and see if MM folks can fix the issue upstream.[1]

[1] https://www.spinics.net/lists/linux-xfs/msg83130.html
> 
> If not,
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> 
> --D
> 
> > ---
> >  fs/xfs/libxfs/xfs_attr_leaf.c | 15 ++++++---------
> >  1 file changed, 6 insertions(+), 9 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_attr_leaf.c b/fs/xfs/libxfs/xfs_attr_leaf.c
> > index ac904cc1a97b..969abc6efd70 100644
> > --- a/fs/xfs/libxfs/xfs_attr_leaf.c
> > +++ b/fs/xfs/libxfs/xfs_attr_leaf.c
> > @@ -1059,10 +1059,7 @@ xfs_attr3_leaf_to_shortform(
> >  
> >  	trace_xfs_attr_leaf_to_sf(args);
> >  
> > -	tmpbuffer = kmalloc(args->geo->blksize, GFP_KERNEL | __GFP_NOFAIL);
> > -	if (!tmpbuffer)
> > -		return -ENOMEM;
> > -
> > +	tmpbuffer = kvmalloc(args->geo->blksize, GFP_KERNEL | __GFP_NOFAIL);
> >  	memcpy(tmpbuffer, bp->b_addr, args->geo->blksize);
> >  
> >  	leaf = (xfs_attr_leafblock_t *)tmpbuffer;
> > @@ -1125,7 +1122,7 @@ xfs_attr3_leaf_to_shortform(
> >  	error = 0;
> >  
> >  out:
> > -	kfree(tmpbuffer);
> > +	kvfree(tmpbuffer);
> >  	return error;
> >  }
> >  
> > @@ -1533,7 +1530,7 @@ xfs_attr3_leaf_compact(
> >  
> >  	trace_xfs_attr_leaf_compact(args);
> >  
> > -	tmpbuffer = kmalloc(args->geo->blksize, GFP_KERNEL | __GFP_NOFAIL);
> > +	tmpbuffer = kvmalloc(args->geo->blksize, GFP_KERNEL | __GFP_NOFAIL);
> >  	memcpy(tmpbuffer, bp->b_addr, args->geo->blksize);
> >  	memset(bp->b_addr, 0, args->geo->blksize);
> >  	leaf_src = (xfs_attr_leafblock_t *)tmpbuffer;
> > @@ -1571,7 +1568,7 @@ xfs_attr3_leaf_compact(
> >  	 */
> >  	xfs_trans_log_buf(trans, bp, 0, args->geo->blksize - 1);
> >  
> > -	kfree(tmpbuffer);
> > +	kvfree(tmpbuffer);
> >  }
> >  
> >  /*
> > @@ -2250,7 +2247,7 @@ xfs_attr3_leaf_unbalance(
> >  		struct xfs_attr_leafblock *tmp_leaf;
> >  		struct xfs_attr3_icleaf_hdr tmphdr;
> >  
> > -		tmp_leaf = kzalloc(state->args->geo->blksize,
> > +		tmp_leaf = kvzalloc(state->args->geo->blksize,
> >  				GFP_KERNEL | __GFP_NOFAIL);
> >  
> >  		/*
> > @@ -2291,7 +2288,7 @@ xfs_attr3_leaf_unbalance(
> >  		}
> >  		memcpy(save_leaf, tmp_leaf, state->args->geo->blksize);
> >  		savehdr = tmphdr; /* struct copy */
> > -		kfree(tmp_leaf);
> > +		kvfree(tmp_leaf);
> >  	}
> >  
> >  	xfs_attr3_leaf_hdr_to_disk(state->args->geo, save_leaf, &savehdr);
> > -- 
> > 2.34.1
> > 
> > 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 05/11] mm: do not split a folio if it has minimum folio order requirement
  2024-04-28  0:57         ` Luis Chamberlain
@ 2024-04-29  3:56           ` Luis Chamberlain
  2024-04-29 14:29             ` Zi Yan
  0 siblings, 1 reply; 47+ messages in thread
From: Luis Chamberlain @ 2024-04-29  3:56 UTC (permalink / raw)
  To: Vlastimil Babka, Sean Christopherson, Matthew Wilcox, ziy
  Cc: Pankaj Raghav (Samsung),
	djwong, brauner, david, chandan.babu, akpm, linux-fsdevel, hare,
	linux-kernel, linux-mm, linux-xfs, gost.dev, p.raghav

On Sat, Apr 27, 2024 at 05:57:17PM -0700, Luis Chamberlain wrote:
> On Fri, Apr 26, 2024 at 04:46:11PM -0700, Luis Chamberlain wrote:
> > On Thu, Apr 25, 2024 at 05:47:28PM -0700, Luis Chamberlain wrote:
> > > On Thu, Apr 25, 2024 at 09:10:16PM +0100, Matthew Wilcox wrote:
> > > > On Thu, Apr 25, 2024 at 01:37:40PM +0200, Pankaj Raghav (Samsung) wrote:
> > > > > From: Pankaj Raghav <p.raghav@samsung.com>
> > > > > 
> > > > > using that API for LBS is resulting in an NULL ptr dereference
> > > > > error in the writeback path [1].
> > > > >
> > > > > [1] https://gist.github.com/mcgrof/d12f586ec6ebe32b2472b5d634c397df
> > > > 
> > > >  How would I go about reproducing this?
> 
> Well so the below fixes this but I am not sure if this is correct.
> folio_mark_dirty() at least says that a folio should not be truncated
> while its running. I am not sure if we should try to split folios then
> even though we check for writeback once. truncate_inode_partial_folio()
> will folio_wait_writeback() but it will split_folio() before checking
> for claiming to fail to truncate with folio_test_dirty(). But since the
> folio is locked its not clear why this should be possible.
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 83955362d41c..90195506211a 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3058,7 +3058,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>  	if (new_order >= folio_order(folio))
>  		return -EINVAL;
>  
> -	if (folio_test_writeback(folio))
> +	if (folio_test_dirty(folio) || folio_test_writeback(folio))
>  		return -EBUSY;
>  
>  	if (!folio_test_anon(folio)) {

I wondered what code path is causing this and triggering this null
pointer, so I just sprinkled a check here:

	VM_BUG_ON_FOLIO(folio_test_dirty(folio), folio);

The answer was:

kcompactd() --> migrate_pages_batch()
                  --> try_split_folio --> split_folio_to_list() -->
		       split_huge_page_to_list_to_order()

Since it took running fstests generic/447 twice to reproduce the crash
with a minorder and 16k block size, modified generic/447 as followed and
it helps to speed up the reproducer with just running the test once:

diff --git a/tests/generic/447 b/tests/generic/447
index 16b814ee7347..43050b58e8ba 100755
--- a/tests/generic/447
+++ b/tests/generic/447
@@ -36,6 +39,15 @@ _scratch_mount >> "$seqres.full" 2>&1
 testdir="$SCRATCH_MNT/test-$seq"
 mkdir "$testdir"
 
+runfile="$tmp.compaction"
+touch $runfile
+while [ -e $runfile ]; do
+	echo 1 > /proc/sys/vm/compact_memory
+	sleep 10
+done &
+compaction_pid=$!
+
+
 # Setup for one million blocks, but we'll accept stress testing down to
 # 2^17 blocks... that should be plenty for anyone.
 fnr=20
@@ -69,6 +81,8 @@ _scratch_cycle_mount
 echo "Delete file1"
 rm -rf $testdir/file1
 
+rm -f $runfile
+wait > /dev/null 2>&1
 # success, all done
 status=0
 exit

And I verified that moving the check only to the migrate_pages_batch()
path also fixes the crash:

diff --git a/mm/migrate.c b/mm/migrate.c
index 73a052a382f1..83b528eb7100 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1484,7 +1484,12 @@ static inline int try_split_folio(struct folio *folio, struct list_head *split_f
 	int rc;
 
 	folio_lock(folio);
+	if (folio_test_dirty(folio)) {
+		rc = -EBUSY;
+		goto out;
+	}
 	rc = split_folio_to_list(folio, split_folios);
+out:
 	folio_unlock(folio);
 	if (!rc)
 		list_move_tail(&folio->lru, split_folios);

However I'd like compaction folks to review this. I see some indications
in the code that migration can race with truncation but we feel fine by
it by taking the folio lock. However here we have a case where we see
the folio clearly locked and the folio is dirty. Other migraiton code
seems to write back the code and can wait, here we just move on. Further
reading on commit 0003e2a414687 ("mm: Add AS_UNMOVABLE to mark mapping
as completely unmovable") seems to hint that migration is safe if the
mapping either does not exist or the mapping does exist but has
mapping->a_ops->migrate_folio so I'd like further feedback on this.
Another thing which requires review is if we we split a folio but not
down to order 0 but to the new min order, does the accounting on
migrate_pages_batch() require changing?  And most puzzling, why do we
not see this with regular large folios, but we do see it with minorder ?

  Luis

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 05/11] mm: do not split a folio if it has minimum folio order requirement
  2024-04-29  3:56           ` Luis Chamberlain
@ 2024-04-29 14:29             ` Zi Yan
  2024-04-30  0:31               ` Luis Chamberlain
  0 siblings, 1 reply; 47+ messages in thread
From: Zi Yan @ 2024-04-29 14:29 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Vlastimil Babka, Sean Christopherson, Matthew Wilcox,
	Pankaj Raghav (Samsung),
	djwong, brauner, david, chandan.babu, akpm, linux-fsdevel, hare,
	linux-kernel, linux-mm, linux-xfs, gost.dev, p.raghav

[-- Attachment #1: Type: text/plain, Size: 5564 bytes --]

On 28 Apr 2024, at 23:56, Luis Chamberlain wrote:

> On Sat, Apr 27, 2024 at 05:57:17PM -0700, Luis Chamberlain wrote:
>> On Fri, Apr 26, 2024 at 04:46:11PM -0700, Luis Chamberlain wrote:
>>> On Thu, Apr 25, 2024 at 05:47:28PM -0700, Luis Chamberlain wrote:
>>>> On Thu, Apr 25, 2024 at 09:10:16PM +0100, Matthew Wilcox wrote:
>>>>> On Thu, Apr 25, 2024 at 01:37:40PM +0200, Pankaj Raghav (Samsung) wrote:
>>>>>> From: Pankaj Raghav <p.raghav@samsung.com>
>>>>>>
>>>>>> using that API for LBS is resulting in an NULL ptr dereference
>>>>>> error in the writeback path [1].
>>>>>>
>>>>>> [1] https://gist.github.com/mcgrof/d12f586ec6ebe32b2472b5d634c397df
>>>>>
>>>>>  How would I go about reproducing this?
>>
>> Well so the below fixes this but I am not sure if this is correct.
>> folio_mark_dirty() at least says that a folio should not be truncated
>> while its running. I am not sure if we should try to split folios then
>> even though we check for writeback once. truncate_inode_partial_folio()
>> will folio_wait_writeback() but it will split_folio() before checking
>> for claiming to fail to truncate with folio_test_dirty(). But since the
>> folio is locked its not clear why this should be possible.
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 83955362d41c..90195506211a 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -3058,7 +3058,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>  	if (new_order >= folio_order(folio))
>>  		return -EINVAL;
>>
>> -	if (folio_test_writeback(folio))
>> +	if (folio_test_dirty(folio) || folio_test_writeback(folio))
>>  		return -EBUSY;
>>
>>  	if (!folio_test_anon(folio)) {
>
> I wondered what code path is causing this and triggering this null
> pointer, so I just sprinkled a check here:
>
> 	VM_BUG_ON_FOLIO(folio_test_dirty(folio), folio);
>
> The answer was:
>
> kcompactd() --> migrate_pages_batch()
>                   --> try_split_folio --> split_folio_to_list() -->
> 		       split_huge_page_to_list_to_order()
>

There are 3 try_split_folio() in migrate_pages_batch(). First one is to
split anonymous large folios that are on deferred split list, so not related;
second one is to split THPs when thp migration is not supported, but
this is compaction, so not related; third one is to split large folios
when there is no same size free page in the system, and this should be
the one.

> Since it took running fstests generic/447 twice to reproduce the crash
> with a minorder and 16k block size, modified generic/447 as followed and
> it helps to speed up the reproducer with just running the test once:
>
> diff --git a/tests/generic/447 b/tests/generic/447
> index 16b814ee7347..43050b58e8ba 100755
> --- a/tests/generic/447
> +++ b/tests/generic/447
> @@ -36,6 +39,15 @@ _scratch_mount >> "$seqres.full" 2>&1
>  testdir="$SCRATCH_MNT/test-$seq"
>  mkdir "$testdir"
>
> +runfile="$tmp.compaction"
> +touch $runfile
> +while [ -e $runfile ]; do
> +	echo 1 > /proc/sys/vm/compact_memory
> +	sleep 10
> +done &
> +compaction_pid=$!
> +
> +
>  # Setup for one million blocks, but we'll accept stress testing down to
>  # 2^17 blocks... that should be plenty for anyone.
>  fnr=20
> @@ -69,6 +81,8 @@ _scratch_cycle_mount
>  echo "Delete file1"
>  rm -rf $testdir/file1
>
> +rm -f $runfile
> +wait > /dev/null 2>&1
>  # success, all done
>  status=0
>  exit
>
> And I verified that moving the check only to the migrate_pages_batch()
> path also fixes the crash:
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 73a052a382f1..83b528eb7100 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1484,7 +1484,12 @@ static inline int try_split_folio(struct folio *folio, struct list_head *split_f
>  	int rc;
>
>  	folio_lock(folio);
> +	if (folio_test_dirty(folio)) {
> +		rc = -EBUSY;
> +		goto out;
> +	}
>  	rc = split_folio_to_list(folio, split_folios);
> +out:
>  	folio_unlock(folio);
>  	if (!rc)
>  		list_move_tail(&folio->lru, split_folios);
>
> However I'd like compaction folks to review this. I see some indications
> in the code that migration can race with truncation but we feel fine by
> it by taking the folio lock. However here we have a case where we see
> the folio clearly locked and the folio is dirty. Other migraiton code
> seems to write back the code and can wait, here we just move on. Further
> reading on commit 0003e2a414687 ("mm: Add AS_UNMOVABLE to mark mapping
> as completely unmovable") seems to hint that migration is safe if the
> mapping either does not exist or the mapping does exist but has
> mapping->a_ops->migrate_folio so I'd like further feedback on this.

During migration, all page table entries pointing to this dirty folio
are invalid, and accesses to this folio will cause page fault and
wait on the migration entry. I am not sure we need to skip dirty folios.

> Another thing which requires review is if we we split a folio but not
> down to order 0 but to the new min order, does the accounting on
> migrate_pages_batch() require changing?  And most puzzling, why do we

What accounting are you referring to? split code should take care of it.

> not see this with regular large folios, but we do see it with minorder ?

I wonder if the split code handles folio->mapping->i_pages properly.
Does the i_pages store just folio pointers or also need all tail page
pointers? I am no expert in fs, thus need help.


--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 00/11] enable bs > ps in XFS
  2024-04-27  4:42 ` [PATCH v4 00/11] enable bs > ps in XFS Ritesh Harjani
  2024-04-27  5:05   ` Darrick J. Wong
@ 2024-04-29 20:39   ` Pankaj Raghav (Samsung)
  1 sibling, 0 replies; 47+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-04-29 20:39 UTC (permalink / raw)
  To: Ritesh Harjani
  Cc: willy, djwong, brauner, david, chandan.babu, akpm, linux-fsdevel,
	hare, linux-kernel, linux-mm, linux-xfs, mcgrof, gost.dev,
	p.raghav

On Sat, Apr 27, 2024 at 10:12:38AM +0530, Ritesh Harjani wrote:
> "Pankaj Raghav (Samsung)" <kernel@pankajraghav.com> writes:
> 
> > From: Pankaj Raghav <p.raghav@samsung.com>
> >
> > This is the fourth version of the series that enables block size > page size
> > (Large Block Size) in XFS. The context and motivation can be seen in cover
> > letter of the RFC v1[1]. We also recorded a talk about this effort at LPC [3],
> > if someone would like more context on this effort.
> >
> > This series does not split a folio during truncation even though we have
> > an API to do so due to some issues with writeback. While it is not a
> > blocker, this feature can be added as a future improvement once we
> > get the base patches upstream (See patch 7).
> >
> > A lot of emphasis has been put on testing using kdevops. The testing has
> > been split into regression and progression.
> >
> > Regression testing:
> > In regression testing, we ran the whole test suite to check for
> > *regression on existing profiles due to the page cache changes.
> >
> > No regression was found with the patches added on top.
> >
> > Progression testing:
> > For progression testing, we tested for 8k, 16k, 32k and 64k block sizes.
> > To compare it with existing support, an ARM VM with 64k base page system
> > (without our patches) was used as a reference to check for actual failures
> > due to LBS support in a 4k base page size system.
> >
> > There are some tests that assumes block size < page size that needs to
> > be fixed. I have a tree with fixes for xfstests here [6], which I will be
> > sending soon to the list. Already a part of this has been upstreamed to
> > fstest.
> >
> > No new failures were found with the LBS support.
> 
> I just did portability testing by creating XFS with 16k bs on x86 VM (4k
> pagesize), created some files + checksums. I then moved the disk to
> Power VM with 64k pagesize and mounted this. I was able to mount and
> all the file checksums passed.
> 
> Then I did the vice versa, created a filesystem on Power VM with 64k
> blocksize and created 10 files with random data of 10MB each. I then
> hotplugged this device out from Power and plugged it into x86 VM and
> mounted it.
> 
> <Logs of the 2nd operation>
> ~# mount /dev/vdk /mnt1/
> [   35.145350] XFS (vdk): EXPERIMENTAL: Filesystem with Large Block Size (65536 bytes) enabled.
> [   35.149858] XFS (vdk): Mounting V5 Filesystem 91933a8b-1370-4931-97d1-c21213f31f8f
> [   35.227459] XFS (vdk): Ending clean mount
> [   35.235090] xfs filesystem being mounted at /mnt1 supports timestamps until 2038-01-19 (0x7fffffff)
> ~# cd /mnt1/
> ~# sha256sum -c checksums 
> file-1.img: OK
> file-2.img: OK
> file-3.img: OK
> file-4.img: OK
> file-5.img: OK
> file-6.img: OK
> file-7.img: OK
> file-8.img: OK
> file-9.img: OK
> file-10.img: OK
> 
> So thanks for this nice portability which this series offers :) 

That is indeed nice. Thanks a lot for testing this Ritesh. :)

> 
> -ritesh
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 07/11] iomap: fix iomap_dio_zero() for fs bs > system page size
  2024-04-27  5:12       ` Christoph Hellwig
@ 2024-04-29 21:02         ` Pankaj Raghav (Samsung)
  0 siblings, 0 replies; 47+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-04-29 21:02 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: willy, djwong, brauner, david, chandan.babu, akpm, linux-fsdevel,
	hare, linux-kernel, linux-mm, linux-xfs, mcgrof, gost.dev,
	p.raghav

On Fri, Apr 26, 2024 at 10:12:01PM -0700, Christoph Hellwig wrote:
> On Fri, Apr 26, 2024 at 11:43:01AM +0000, Pankaj Raghav (Samsung) wrote:
> > Because allocating it during runtime will defeat the purpose.
> 
> Well, what runtime?  Either way it seems like we have the infrastructure
> now based on the comment from willy.

As willy pointed out in that reply, it is allocated on demand, so it
might still fail and we might have to revert back to looping.
And if we end up using the huge zero page, we should also make sure to
decrement to reference in iomap_dio_bio_end_io(), which is not needed
when we use ZERO_PAGE.

FWIW, I did a prototype with mm_huge_zero_page() in one of the older series.
[1] (I forgot to decrement reference in end_io()) but I did not get final
response if that is the direction we want to go.

Let me know your thoughts.

[1]https://lore.kernel.org/linux-xfs/3pqmgrlewo6ctcwakdvbvjqixac5en6irlipe5aiz6vkylfyni@2luhrs36ke5r/#r
> 
> > In anycase, I would like to pursue huge_zero_page folio separately
> > from this series. Also iomap_dio_zero() only pads a fs block with
> > zeroes, which should never be > 64k for XFS.
> 
> Only if you are limited to 64k block size.
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 05/11] mm: do not split a folio if it has minimum folio order requirement
  2024-04-29 14:29             ` Zi Yan
@ 2024-04-30  0:31               ` Luis Chamberlain
  2024-04-30  0:49                 ` Luis Chamberlain
  2024-04-30  2:43                 ` Zi Yan
  0 siblings, 2 replies; 47+ messages in thread
From: Luis Chamberlain @ 2024-04-30  0:31 UTC (permalink / raw)
  To: Zi Yan
  Cc: Vlastimil Babka, Sean Christopherson, Matthew Wilcox,
	Pankaj Raghav (Samsung),
	djwong, brauner, david, chandan.babu, akpm, linux-fsdevel, hare,
	linux-kernel, linux-mm, linux-xfs, gost.dev, p.raghav

On Mon, Apr 29, 2024 at 10:29:29AM -0400, Zi Yan wrote:
> On 28 Apr 2024, at 23:56, Luis Chamberlain wrote:
> 
> > On Sat, Apr 27, 2024 at 05:57:17PM -0700, Luis Chamberlain wrote:
> >> On Fri, Apr 26, 2024 at 04:46:11PM -0700, Luis Chamberlain wrote:
> >>> On Thu, Apr 25, 2024 at 05:47:28PM -0700, Luis Chamberlain wrote:
> >>>> On Thu, Apr 25, 2024 at 09:10:16PM +0100, Matthew Wilcox wrote:
> >>>>> On Thu, Apr 25, 2024 at 01:37:40PM +0200, Pankaj Raghav (Samsung) wrote:
> >>>>>> From: Pankaj Raghav <p.raghav@samsung.com>
> >>>>>>
> >>>>>> using that API for LBS is resulting in an NULL ptr dereference
> >>>>>> error in the writeback path [1].
> >>>>>>
> >>>>>> [1] https://gist.github.com/mcgrof/d12f586ec6ebe32b2472b5d634c397df
> >>>>>
> >>>>>  How would I go about reproducing this?
> >>
> >> Well so the below fixes this but I am not sure if this is correct.
> >> folio_mark_dirty() at least says that a folio should not be truncated
> >> while its running. I am not sure if we should try to split folios then
> >> even though we check for writeback once. truncate_inode_partial_folio()
> >> will folio_wait_writeback() but it will split_folio() before checking
> >> for claiming to fail to truncate with folio_test_dirty(). But since the
> >> folio is locked its not clear why this should be possible.
> >>
> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >> index 83955362d41c..90195506211a 100644
> >> --- a/mm/huge_memory.c
> >> +++ b/mm/huge_memory.c
> >> @@ -3058,7 +3058,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> >>  	if (new_order >= folio_order(folio))
> >>  		return -EINVAL;
> >>
> >> -	if (folio_test_writeback(folio))
> >> +	if (folio_test_dirty(folio) || folio_test_writeback(folio))
> >>  		return -EBUSY;
> >>
> >>  	if (!folio_test_anon(folio)) {
> >
> > I wondered what code path is causing this and triggering this null
> > pointer, so I just sprinkled a check here:
> >
> > 	VM_BUG_ON_FOLIO(folio_test_dirty(folio), folio);
> >
> > The answer was:
> >
> > kcompactd() --> migrate_pages_batch()
> >                   --> try_split_folio --> split_folio_to_list() -->
> > 		       split_huge_page_to_list_to_order()
> >
> 
> There are 3 try_split_folio() in migrate_pages_batch().

This is only true for linux-next, for v6.9-rc5 off of which this testing
is based on there are only two.

> First one is to split anonymous large folios that are on deferred
> split list, so not related;

This is in linux-next and not v6.9-rc5.

> second one is to split THPs when thp migration is not supported, but
> this is compaction, so not related; third one is to split large folios
> when there is no same size free page in the system, and this should be
> the one.

Agreed, the case where migrate_folio_unmap() failed with -ENOMEM. This
also helps us enhance the reproducer further, which I'll do next.

> > And I verified that moving the check only to the migrate_pages_batch()
> > path also fixes the crash:
> >
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 73a052a382f1..83b528eb7100 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -1484,7 +1484,12 @@ static inline int try_split_folio(struct folio *folio, struct list_head *split_f
> >  	int rc;
> >
> >  	folio_lock(folio);
> > +	if (folio_test_dirty(folio)) {
> > +		rc = -EBUSY;
> > +		goto out;
> > +	}
> >  	rc = split_folio_to_list(folio, split_folios);
> > +out:
> >  	folio_unlock(folio);
> >  	if (!rc)
> >  		list_move_tail(&folio->lru, split_folios);
> >
> > However I'd like compaction folks to review this. I see some indications
> > in the code that migration can race with truncation but we feel fine by
> > it by taking the folio lock. However here we have a case where we see
> > the folio clearly locked and the folio is dirty. Other migraiton code
> > seems to write back the code and can wait, here we just move on. Further
> > reading on commit 0003e2a414687 ("mm: Add AS_UNMOVABLE to mark mapping
> > as completely unmovable") seems to hint that migration is safe if the
> > mapping either does not exist or the mapping does exist but has
> > mapping->a_ops->migrate_folio so I'd like further feedback on this.
> 
> During migration, all page table entries pointing to this dirty folio
> are invalid, and accesses to this folio will cause page fault and
> wait on the migration entry. I am not sure we need to skip dirty folios.

I see.. thanks!

> > Another thing which requires review is if we we split a folio but not
> > down to order 0 but to the new min order, does the accounting on
> > migrate_pages_batch() require changing?  And most puzzling, why do we
> 
> What accounting are you referring to? split code should take care of it.

The folio order can change after split, and so I was concerned about the
nr_pages used in migrate_pages_batch(). But I see now that when
migrate_folio_unmap() first failed we try to split the folio, and if
successful I see now we the caller will again call migrate_pages_batch()
with a retry attempt of 1 only to the split folios. I also see the
nr_pages is just local to each list for each loop, first on the from
list to unmap and afte on the unmap list so we move the folios.

> > not see this with regular large folios, but we do see it with minorder ?
> 
> I wonder if the split code handles folio->mapping->i_pages properly.
> Does the i_pages store just folio pointers or also need all tail page
> pointers? I am no expert in fs, thus need help.

mapping->i_pages stores folio pointers in the page cache or
swap/dax/shadow entries (xa_is_value(folio)). The folios however can be
special and we special-case them with shmem_mapping(mapping) checks.
split_huge_page_to_list_to_order() doens't get called with swap/dax/shadow 
entries, and we also bail out on shmem_mapping(mapping) already.

  Luis

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 05/11] mm: do not split a folio if it has minimum folio order requirement
  2024-04-30  0:31               ` Luis Chamberlain
@ 2024-04-30  0:49                 ` Luis Chamberlain
  2024-04-30  2:43                 ` Zi Yan
  1 sibling, 0 replies; 47+ messages in thread
From: Luis Chamberlain @ 2024-04-30  0:49 UTC (permalink / raw)
  To: Zi Yan
  Cc: Vlastimil Babka, Sean Christopherson, Matthew Wilcox,
	Pankaj Raghav (Samsung),
	djwong, brauner, david, chandan.babu, akpm, linux-fsdevel, hare,
	linux-kernel, linux-mm, linux-xfs, gost.dev, p.raghav

On Mon, Apr 29, 2024 at 05:31:04PM -0700, Luis Chamberlain wrote:
> On Mon, Apr 29, 2024 at 10:29:29AM -0400, Zi Yan wrote:
> > On 28 Apr 2024, at 23:56, Luis Chamberlain wrote:
> > > diff --git a/mm/migrate.c b/mm/migrate.c
> > > index 73a052a382f1..83b528eb7100 100644
> > > --- a/mm/migrate.c
> > > +++ b/mm/migrate.c
> > > @@ -1484,7 +1484,12 @@ static inline int try_split_folio(struct folio *folio, struct list_head *split_f
> > >  	int rc;
> > >
> > >  	folio_lock(folio);
> > > +	if (folio_test_dirty(folio)) {
> > > +		rc = -EBUSY;
> > > +		goto out;
> > > +	}
> > >  	rc = split_folio_to_list(folio, split_folios);
> > > +out:
> > >  	folio_unlock(folio);
> > >  	if (!rc)
> > >  		list_move_tail(&folio->lru, split_folios);
> > >
> > > However I'd like compaction folks to review this. I see some indications
> > > in the code that migration can race with truncation but we feel fine by
> > > it by taking the folio lock. However here we have a case where we see
> > > the folio clearly locked and the folio is dirty. Other migraiton code
> > > seems to write back the code and can wait, here we just move on. Further
> > > reading on commit 0003e2a414687 ("mm: Add AS_UNMOVABLE to mark mapping
> > > as completely unmovable") seems to hint that migration is safe if the
> > > mapping either does not exist or the mapping does exist but has
> > > mapping->a_ops->migrate_folio so I'd like further feedback on this.
> > 
> > During migration, all page table entries pointing to this dirty folio
> > are invalid, and accesses to this folio will cause page fault and
> > wait on the migration entry. I am not sure we need to skip dirty folios.

That would seem to explain why we get this, if we allow these to
continue, ie, without the hunk above, so it begs to ask, are we sure
we are waiting for migration to complete if we're doing writeback?

Apr 29 17:28:54 min-xfs-reflink-16k-4ks unknown: run fstests generic/447 at 2024-04-29 17:28:54
Apr 29 17:28:55 min-xfs-reflink-16k-4ks kernel: XFS (loop5): EXPERIMENTAL: Filesystem with Large Block Size (16384 bytes) enabled.
Apr 29 17:28:55 min-xfs-reflink-16k-4ks kernel: XFS (loop5): Mounting V5 Filesystem 89a3d14d-8147-455d-8897-927a0b7baf65
Apr 29 17:28:55 min-xfs-reflink-16k-4ks kernel: XFS (loop5): Ending clean mount
Apr 29 17:28:56 min-xfs-reflink-16k-4ks kernel: XFS (loop5): Unmounting Filesystem 89a3d14d-8147-455d-8897-927a0b7baf65
Apr 29 17:28:59 min-xfs-reflink-16k-4ks kernel: XFS (loop5): EXPERIMENTAL: Filesystem with Large Block Size (16384 bytes) enabled.
Apr 29 17:28:59 min-xfs-reflink-16k-4ks kernel: XFS (loop5): Mounting V5 Filesystem e29c3eee-18d1-4fec-9a17-076b3eccbd74
Apr 29 17:28:59 min-xfs-reflink-16k-4ks kernel: XFS (loop5): Ending clean mount
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: BUG: kernel NULL pointer dereference, address: 0000000000000036
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: #PF: supervisor read access in kernel mode
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: #PF: error_code(0x0000) - not-present page
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: PGD 0 P4D 0
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: CPU: 3 PID: 2245 Comm: kworker/u36:5 Not tainted 6.9.0-rc5+ #26
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: Workqueue: writeback wb_workfn (flush-7:5)
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: RIP: 0010:filemap_get_folios_tag (./arch/x86/include/asm/atomic.h:23 ./include/linux/atomic/atomic-arch-fallback.h:457 ./include/linux/atomic/atomic-arch-fallback.h:2426 ./include/linux/atomic/atomic-arch-fallback.h:2456 ./include/linux/atomic/atomic-instrumented.h:1518 ./include/linux/page_ref.h:238 ./include/linux/page_ref.h:247 ./include/linux/page_ref.h:280 ./include/linux/page_ref.h:313 mm/filemap.c:1984 mm/filemap.c:2222) 
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: Code: ad 05 86 00 48 89 c3 48 3d 06 04 00 00 74 e8 48 81 fb 02 04 00 00 0f 84 d0 00 00 00 48 85 db 0f 84 04 01 00 00 f6 c3 01 75 c4 <8b> 43 34 85 c0 0f 84 b7 00 00 00 8d 50 01 48 8d 73 34 f0 0f b1 53
All code
========
   0:	ad                   	lods   %ds:(%rsi),%eax
   1:	05 86 00 48 89       	add    $0x89480086,%eax
   6:	c3                   	ret
   7:	48 3d 06 04 00 00    	cmp    $0x406,%rax
   d:	74 e8                	je     0xfffffffffffffff7
   f:	48 81 fb 02 04 00 00 	cmp    $0x402,%rbx
  16:	0f 84 d0 00 00 00    	je     0xec
  1c:	48 85 db             	test   %rbx,%rbx
  1f:	0f 84 04 01 00 00    	je     0x129
  25:	f6 c3 01             	test   $0x1,%bl
  28:	75 c4                	jne    0xffffffffffffffee
  2a:*	8b 43 34             	mov    0x34(%rbx),%eax		<-- trapping instruction
  2d:	85 c0                	test   %eax,%eax
  2f:	0f 84 b7 00 00 00    	je     0xec
  35:	8d 50 01             	lea    0x1(%rax),%edx
  38:	48 8d 73 34          	lea    0x34(%rbx),%rsi
  3c:	f0                   	lock
  3d:	0f                   	.byte 0xf
  3e:	b1 53                	mov    $0x53,%cl

Code starting with the faulting instruction
===========================================
   0:	8b 43 34             	mov    0x34(%rbx),%eax
   3:	85 c0                	test   %eax,%eax
   5:	0f 84 b7 00 00 00    	je     0xc2
   b:	8d 50 01             	lea    0x1(%rax),%edx
   e:	48 8d 73 34          	lea    0x34(%rbx),%rsi
  12:	f0                   	lock
  13:	0f                   	.byte 0xf
  14:	b1 53                	mov    $0x53,%cl
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: RSP: 0018:ffffad0a0396b8f8 EFLAGS: 00010246
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: RAX: 0000000000000002 RBX: 0000000000000002 RCX: 00000000000ba200
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: RDX: 0000000000000002 RSI: 0000000000000002 RDI: ffff9f408d20d488
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: RBP: 0000000000000000 R08: ffffffffffffffff R09: 0000000000000000
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: R10: 0000000000000228 R11: 0000000000000000 R12: ffffffffffffffff
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: R13: ffffad0a0396bbb8 R14: ffffad0a0396bcb8 R15: ffff9f40633bfc00
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: FS:  0000000000000000(0000) GS:ffff9f40bbcc0000(0000) knlGS:0000000000000000
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: CR2: 0000000000000036 CR3: 000000010bd44006 CR4: 0000000000770ef0
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: PKRU: 55555554
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: Call Trace:
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel:  <TASK>
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: ? __die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434) 
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: ? page_fault_oops (arch/x86/mm/fault.c:713) 
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: ? write_cache_pages (mm/page-writeback.c:2569) 
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: ? xfs_vm_writepages (fs/xfs/xfs_aops.c:508) xfs
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: ? do_user_addr_fault (./include/linux/kprobes.h:591 (discriminator 1) arch/x86/mm/fault.c:1265 (discriminator 1)) 
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: ? exc_page_fault (./arch/x86/include/asm/paravirt.h:693 arch/x86/mm/fault.c:1513 arch/x86/mm/fault.c:1563) 
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: ? asm_exc_page_fault (./arch/x86/include/asm/idtentry.h:623) 
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: ? filemap_get_folios_tag (./arch/x86/include/asm/atomic.h:23 ./include/linux/atomic/atomic-arch-fallback.h:457 ./include/linux/atomic/atomic-arch-fallback.h:2426 ./include/linux/atomic/atomic-arch-fallback.h:2456 ./include/linux/atomic/atomic-instrumented.h:1518 ./include/linux/page_ref.h:238 ./include/linux/page_ref.h:247 ./include/linux/page_ref.h:280 ./include/linux/page_ref.h:313 mm/filemap.c:1984 mm/filemap.c:2222) 
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: ? filemap_get_folios_tag (mm/filemap.c:1972 mm/filemap.c:2222) 
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: ? __pfx_iomap_do_writepage (fs/iomap/buffered-io.c:1963) 
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: writeback_iter (./include/linux/pagevec.h:91 mm/page-writeback.c:2421 mm/page-writeback.c:2520) 
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: write_cache_pages (mm/page-writeback.c:2568) 
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: iomap_writepages (fs/iomap/buffered-io.c:1984) 
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: xfs_vm_writepages (fs/xfs/xfs_aops.c:508) xfs
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: do_writepages (mm/page-writeback.c:2612) 
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: __writeback_single_inode (fs/fs-writeback.c:1659) 
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: ? _raw_spin_lock (./arch/x86/include/asm/atomic.h:115 (discriminator 4) ./include/linux/atomic/atomic-arch-fallback.h:2170 (discriminator 4) ./include/linux/atomic/atomic-instrumented.h:1302 (discriminator 4) ./include/asm-generic/qspinlock.h:111 (discriminator 4) ./include/linux/spinlock.h:187 (discriminator 4) ./include/linux/spinlock_api_smp.h:134 (discriminator 4) kernel/locking/spinlock.c:154 (discriminator 4)) 
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: writeback_sb_inodes (fs/fs-writeback.c:1943) 
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: __writeback_inodes_wb (fs/fs-writeback.c:2013) 
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: wb_writeback (fs/fs-writeback.c:2119) 
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: wb_workfn (fs/fs-writeback.c:2277 (discriminator 1) fs/fs-writeback.c:2304 (discriminator 1)) 
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: process_one_work (kernel/workqueue.c:3254) 
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: worker_thread (kernel/workqueue.c:3329 (discriminator 2) kernel/workqueue.c:3416 (discriminator 2)) 
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: ? __pfx_worker_thread (kernel/workqueue.c:3362) 
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: kthread (kernel/kthread.c:388) 
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: ? __pfx_kthread (kernel/kthread.c:341) 
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: ret_from_fork (arch/x86/kernel/process.c:147) 
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: ? __pfx_kthread (kernel/kthread.c:341) 
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: ret_from_fork_asm (arch/x86/entry/entry_64.S:257) 
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel:  </TASK>
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: Modules linked in: xfs nvme_fabrics nvme_core t10_pi crc64_rocksoft_generic crc64_rocksoft crc64 sunrpc 9p netfs kvm_intel kvm crct10dif_pclmul ghash_clmulni_intel sha512_ssse3 sha512_generic sha256_ssse3 sha1_ssse3 aesni_intel crypto_simd cryptd virtio_balloon pcspkr 9pnet_virtio virtio_console button evdev joydev serio_raw loop drm dm_mod autofs4 ext4 crc16 mbcache jbd2 btrfs blake2b_generic efivarfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 md_mod virtio_net net_failover failover virtio_blk crc32_pclmul crc32c_intel psmouse virtio_pci virtio_pci_legacy_dev virtio_pci_modern_dev virtio virtio_ring
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: CR2: 0000000000000036
Apr 29 17:29:20 min-xfs-reflink-16k-4ks kernel: ---[ end trace 0000000000000000 ]---

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 05/11] mm: do not split a folio if it has minimum folio order requirement
  2024-04-30  0:31               ` Luis Chamberlain
  2024-04-30  0:49                 ` Luis Chamberlain
@ 2024-04-30  2:43                 ` Zi Yan
  2024-04-30 19:27                   ` Luis Chamberlain
  1 sibling, 1 reply; 47+ messages in thread
From: Zi Yan @ 2024-04-30  2:43 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Vlastimil Babka, Sean Christopherson, Matthew Wilcox,
	Pankaj Raghav (Samsung),
	djwong, brauner, david, chandan.babu, akpm, linux-fsdevel, hare,
	linux-kernel, linux-mm, linux-xfs, gost.dev, p.raghav

[-- Attachment #1: Type: text/plain, Size: 6705 bytes --]

On 29 Apr 2024, at 20:31, Luis Chamberlain wrote:

> On Mon, Apr 29, 2024 at 10:29:29AM -0400, Zi Yan wrote:
>> On 28 Apr 2024, at 23:56, Luis Chamberlain wrote:
>>
>>> On Sat, Apr 27, 2024 at 05:57:17PM -0700, Luis Chamberlain wrote:
>>>> On Fri, Apr 26, 2024 at 04:46:11PM -0700, Luis Chamberlain wrote:
>>>>> On Thu, Apr 25, 2024 at 05:47:28PM -0700, Luis Chamberlain wrote:
>>>>>> On Thu, Apr 25, 2024 at 09:10:16PM +0100, Matthew Wilcox wrote:
>>>>>>> On Thu, Apr 25, 2024 at 01:37:40PM +0200, Pankaj Raghav (Samsung) wrote:
>>>>>>>> From: Pankaj Raghav <p.raghav@samsung.com>
>>>>>>>>
>>>>>>>> using that API for LBS is resulting in an NULL ptr dereference
>>>>>>>> error in the writeback path [1].
>>>>>>>>
>>>>>>>> [1] https://gist.github.com/mcgrof/d12f586ec6ebe32b2472b5d634c397df
>>>>>>>
>>>>>>>  How would I go about reproducing this?
>>>>
>>>> Well so the below fixes this but I am not sure if this is correct.
>>>> folio_mark_dirty() at least says that a folio should not be truncated
>>>> while its running. I am not sure if we should try to split folios then
>>>> even though we check for writeback once. truncate_inode_partial_folio()
>>>> will folio_wait_writeback() but it will split_folio() before checking
>>>> for claiming to fail to truncate with folio_test_dirty(). But since the
>>>> folio is locked its not clear why this should be possible.
>>>>
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index 83955362d41c..90195506211a 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -3058,7 +3058,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>>  	if (new_order >= folio_order(folio))
>>>>  		return -EINVAL;
>>>>
>>>> -	if (folio_test_writeback(folio))
>>>> +	if (folio_test_dirty(folio) || folio_test_writeback(folio))
>>>>  		return -EBUSY;
>>>>
>>>>  	if (!folio_test_anon(folio)) {
>>>
>>> I wondered what code path is causing this and triggering this null
>>> pointer, so I just sprinkled a check here:
>>>
>>> 	VM_BUG_ON_FOLIO(folio_test_dirty(folio), folio);
>>>
>>> The answer was:
>>>
>>> kcompactd() --> migrate_pages_batch()
>>>                   --> try_split_folio --> split_folio_to_list() -->
>>> 		       split_huge_page_to_list_to_order()
>>>
>>
>> There are 3 try_split_folio() in migrate_pages_batch().
>
> This is only true for linux-next, for v6.9-rc5 off of which this testing
> is based on there are only two.
>
>> First one is to split anonymous large folios that are on deferred
>> split list, so not related;
>
> This is in linux-next and not v6.9-rc5.
>
>> second one is to split THPs when thp migration is not supported, but
>> this is compaction, so not related; third one is to split large folios
>> when there is no same size free page in the system, and this should be
>> the one.
>
> Agreed, the case where migrate_folio_unmap() failed with -ENOMEM. This
> also helps us enhance the reproducer further, which I'll do next.
>
>>> And I verified that moving the check only to the migrate_pages_batch()
>>> path also fixes the crash:
>>>
>>> diff --git a/mm/migrate.c b/mm/migrate.c
>>> index 73a052a382f1..83b528eb7100 100644
>>> --- a/mm/migrate.c
>>> +++ b/mm/migrate.c
>>> @@ -1484,7 +1484,12 @@ static inline int try_split_folio(struct folio *folio, struct list_head *split_f
>>>  	int rc;
>>>
>>>  	folio_lock(folio);
>>> +	if (folio_test_dirty(folio)) {
>>> +		rc = -EBUSY;
>>> +		goto out;
>>> +	}
>>>  	rc = split_folio_to_list(folio, split_folios);
>>> +out:
>>>  	folio_unlock(folio);
>>>  	if (!rc)
>>>  		list_move_tail(&folio->lru, split_folios);
>>>
>>> However I'd like compaction folks to review this. I see some indications
>>> in the code that migration can race with truncation but we feel fine by
>>> it by taking the folio lock. However here we have a case where we see
>>> the folio clearly locked and the folio is dirty. Other migraiton code
>>> seems to write back the code and can wait, here we just move on. Further
>>> reading on commit 0003e2a414687 ("mm: Add AS_UNMOVABLE to mark mapping
>>> as completely unmovable") seems to hint that migration is safe if the
>>> mapping either does not exist or the mapping does exist but has
>>> mapping->a_ops->migrate_folio so I'd like further feedback on this.
>>
>> During migration, all page table entries pointing to this dirty folio
>> are invalid, and accesses to this folio will cause page fault and
>> wait on the migration entry. I am not sure we need to skip dirty folios.
>
> I see.. thanks!
>
>>> Another thing which requires review is if we we split a folio but not
>>> down to order 0 but to the new min order, does the accounting on
>>> migrate_pages_batch() require changing?  And most puzzling, why do we
>>
>> What accounting are you referring to? split code should take care of it.
>
> The folio order can change after split, and so I was concerned about the
> nr_pages used in migrate_pages_batch(). But I see now that when
> migrate_folio_unmap() first failed we try to split the folio, and if
> successful I see now we the caller will again call migrate_pages_batch()
> with a retry attempt of 1 only to the split folios. I also see the
> nr_pages is just local to each list for each loop, first on the from
> list to unmap and afte on the unmap list so we move the folios.
>
>>> not see this with regular large folios, but we do see it with minorder ?
>>
>> I wonder if the split code handles folio->mapping->i_pages properly.
>> Does the i_pages store just folio pointers or also need all tail page
>> pointers? I am no expert in fs, thus need help.
>
> mapping->i_pages stores folio pointers in the page cache or
> swap/dax/shadow entries (xa_is_value(folio)). The folios however can be
> special and we special-case them with shmem_mapping(mapping) checks.
> split_huge_page_to_list_to_order() doens't get called with swap/dax/shadow
> entries, and we also bail out on shmem_mapping(mapping) already.

Hmm, I misunderstood the issue above. To clarify it, the error comes out
when a page cache folio with minorder is split to order-0, an NULL ptr
defer shows up in the writeback path. I thought the folio was split to
non-0 order. split_huge_page_to_list_to_order() should be fine, since
splitting to order-0 is not changed after my patches.

I wonder if you can isolate the issue by just splitting a dirty minorder
page cache folio instead of having folio split and migration going on together.
You probably can use the debugfs to do that. Depending on the result,
we can narrow down the cause of the issue.


--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 05/11] mm: do not split a folio if it has minimum folio order requirement
  2024-04-30  2:43                 ` Zi Yan
@ 2024-04-30 19:27                   ` Luis Chamberlain
  2024-05-01  4:13                     ` Matthew Wilcox
  0 siblings, 1 reply; 47+ messages in thread
From: Luis Chamberlain @ 2024-04-30 19:27 UTC (permalink / raw)
  To: Zi Yan, Andrew Morton
  Cc: Vlastimil Babka, Sean Christopherson, Matthew Wilcox,
	Pankaj Raghav (Samsung),
	djwong, brauner, david, chandan.babu, akpm, linux-fsdevel, hare,
	linux-kernel, linux-mm, linux-xfs, gost.dev, p.raghav

On Mon, Apr 29, 2024 at 10:43:16PM -0400, Zi Yan wrote:
> On 29 Apr 2024, at 20:31, Luis Chamberlain wrote:
> 
> > On Mon, Apr 29, 2024 at 10:29:29AM -0400, Zi Yan wrote:
> >> On 28 Apr 2024, at 23:56, Luis Chamberlain wrote:
> >>
> >>> On Sat, Apr 27, 2024 at 05:57:17PM -0700, Luis Chamberlain wrote:
> >>>> On Fri, Apr 26, 2024 at 04:46:11PM -0700, Luis Chamberlain wrote:
> >>>>> On Thu, Apr 25, 2024 at 05:47:28PM -0700, Luis Chamberlain wrote:
> >>>>>> On Thu, Apr 25, 2024 at 09:10:16PM +0100, Matthew Wilcox wrote:
> >>>>>>> On Thu, Apr 25, 2024 at 01:37:40PM +0200, Pankaj Raghav (Samsung) wrote:
> >>>>>>>> From: Pankaj Raghav <p.raghav@samsung.com>
> >>>>>>>>
> >>>>>>>> using that API for LBS is resulting in an NULL ptr dereference
> >>>>>>>> error in the writeback path [1].
> >>>>>>>>
> >>>>>>>> [1] https://gist.github.com/mcgrof/d12f586ec6ebe32b2472b5d634c397df
> >>>>>>>
> >>>>>>>  How would I go about reproducing this?
> >>>>
> >>>> Well so the below fixes this but I am not sure if this is correct.
> >>>> folio_mark_dirty() at least says that a folio should not be truncated
> >>>> while its running. I am not sure if we should try to split folios then
> >>>> even though we check for writeback once. truncate_inode_partial_folio()
> >>>> will folio_wait_writeback() but it will split_folio() before checking
> >>>> for claiming to fail to truncate with folio_test_dirty(). But since the
> >>>> folio is locked its not clear why this should be possible.
> >>>>
> >>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >>>> index 83955362d41c..90195506211a 100644
> >>>> --- a/mm/huge_memory.c
> >>>> +++ b/mm/huge_memory.c
> >>>> @@ -3058,7 +3058,7 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> >>>>  	if (new_order >= folio_order(folio))
> >>>>  		return -EINVAL;
> >>>>
> >>>> -	if (folio_test_writeback(folio))
> >>>> +	if (folio_test_dirty(folio) || folio_test_writeback(folio))
> >>>>  		return -EBUSY;
> >>>>
> >>>>  	if (!folio_test_anon(folio)) {
> >>>
> >>> I wondered what code path is causing this and triggering this null
> >>> pointer, so I just sprinkled a check here:
> >>>
> >>> 	VM_BUG_ON_FOLIO(folio_test_dirty(folio), folio);
> >>>
> >>> The answer was:
> >>>
> >>> kcompactd() --> migrate_pages_batch()
> >>>                   --> try_split_folio --> split_folio_to_list() -->
> >>> 		       split_huge_page_to_list_to_order()
> >>>
> >>
> >> There are 3 try_split_folio() in migrate_pages_batch().
> >
> > This is only true for linux-next, for v6.9-rc5 off of which this testing
> > is based on there are only two.
> >
> >> First one is to split anonymous large folios that are on deferred
> >> split list, so not related;
> >
> > This is in linux-next and not v6.9-rc5.
> >
> >> second one is to split THPs when thp migration is not supported, but
> >> this is compaction, so not related; third one is to split large folios
> >> when there is no same size free page in the system, and this should be
> >> the one.
> >
> > Agreed, the case where migrate_folio_unmap() failed with -ENOMEM. This
> > also helps us enhance the reproducer further, which I'll do next.
> >
> >>> And I verified that moving the check only to the migrate_pages_batch()
> >>> path also fixes the crash:
> >>>
> >>> diff --git a/mm/migrate.c b/mm/migrate.c
> >>> index 73a052a382f1..83b528eb7100 100644
> >>> --- a/mm/migrate.c
> >>> +++ b/mm/migrate.c
> >>> @@ -1484,7 +1484,12 @@ static inline int try_split_folio(struct folio *folio, struct list_head *split_f
> >>>  	int rc;
> >>>
> >>>  	folio_lock(folio);
> >>> +	if (folio_test_dirty(folio)) {
> >>> +		rc = -EBUSY;
> >>> +		goto out;
> >>> +	}
> >>>  	rc = split_folio_to_list(folio, split_folios);
> >>> +out:
> >>>  	folio_unlock(folio);
> >>>  	if (!rc)
> >>>  		list_move_tail(&folio->lru, split_folios);
> >>>
> >>> However I'd like compaction folks to review this. I see some indications
> >>> in the code that migration can race with truncation but we feel fine by
> >>> it by taking the folio lock. However here we have a case where we see
> >>> the folio clearly locked and the folio is dirty. Other migraiton code
> >>> seems to write back the code and can wait, here we just move on. Further
> >>> reading on commit 0003e2a414687 ("mm: Add AS_UNMOVABLE to mark mapping
> >>> as completely unmovable") seems to hint that migration is safe if the
> >>> mapping either does not exist or the mapping does exist but has
> >>> mapping->a_ops->migrate_folio so I'd like further feedback on this.
> >>
> >> During migration, all page table entries pointing to this dirty folio
> >> are invalid, and accesses to this folio will cause page fault and
> >> wait on the migration entry. I am not sure we need to skip dirty folios.
> >
> > I see.. thanks!
> >
> >>> Another thing which requires review is if we we split a folio but not
> >>> down to order 0 but to the new min order, does the accounting on
> >>> migrate_pages_batch() require changing?  And most puzzling, why do we
> >>
> >> What accounting are you referring to? split code should take care of it.
> >
> > The folio order can change after split, and so I was concerned about the
> > nr_pages used in migrate_pages_batch(). But I see now that when
> > migrate_folio_unmap() first failed we try to split the folio, and if
> > successful I see now we the caller will again call migrate_pages_batch()
> > with a retry attempt of 1 only to the split folios. I also see the
> > nr_pages is just local to each list for each loop, first on the from
> > list to unmap and afte on the unmap list so we move the folios.
> >
> >>> not see this with regular large folios, but we do see it with minorder ?
> >>
> >> I wonder if the split code handles folio->mapping->i_pages properly.
> >> Does the i_pages store just folio pointers or also need all tail page
> >> pointers? I am no expert in fs, thus need help.
> >
> > mapping->i_pages stores folio pointers in the page cache or
> > swap/dax/shadow entries (xa_is_value(folio)). The folios however can be
> > special and we special-case them with shmem_mapping(mapping) checks.
> > split_huge_page_to_list_to_order() doens't get called with swap/dax/shadow
> > entries, and we also bail out on shmem_mapping(mapping) already.
> 
> Hmm, I misunderstood the issue above. To clarify it, the error comes out
> when a page cache folio with minorder is split to order-0,

No, min order is used.

In order to support splits with min order we require an out of tree
patch not yet posted:

https://github.com/linux-kdevops/linux/commit/e77a2a4fd6d9aa7e2641d5ea456ad0522c1e8a04

The important part is if no order is specified we use the min order:

int split_folio_to_list(struct folio *folio, struct list_head *list)
{
	unsigned int min_order = 0;

	if (!folio_test_anon(folio))
		min_order = mapping_min_folio_order(folio->mapping);

	return split_huge_page_to_list_to_order(&folio->page, list, min_order);
}

and so compaction's try_split_folio() -->
   split_folio_to_list(folio, split_folios)

will use the min order implicitly due to the above.

So yes, we see a null ptr deref on the writeback path when min order is set.

> I wonder if you can isolate the issue by just splitting a dirty minorder
> page cache folio instead of having folio split and migration going on together.
> You probably can use the debugfs to do that. Depending on the result,
> we can narrow down the cause of the issue.

That's what I had tried with my new fstsest test but now I see where it
also failed -- on 4k filesystems it was trying to split to order 0 and
that had no issues as you pointed out. We can now fine tune the test
very well. I can now reproduce the crash on plain on boring vanilla
linux v6.9-rc6 on a plain xfs filesystem with 4k block size on x86_64
doing this:

You may want the following appended to your kernel command line:

   dyndbg='file mm/huge_memory.c +p'

mkfs.xfs /dev/vdd
mkdir -p /media/scratch/
mount /dev/vdd /media/scratch/

while true; do dd if=/dev/zero of=$FILE bs=4M count=200 2> /dev/null; done &
while true; do sleep 2; echo $FILE,0x0,0x4000,2 > /sys/kernel/debug/split_huge_pages 0x400000 2> /dev/null; done

The crash:

Apr 30 10:37:09 debian12-xfs-reflink-4k kernel: SGI XFS with ACLs, security attributes, realtime, scrub, repair, quota, fatal assert, debug enabled
Apr 30 10:37:09 debian12-xfs-reflink-4k kernel: XFS (vdd): Mounting V5 Filesystem d1f9e444-f61c-4439-a2bf-61a13f6d8e81
Apr 30 10:37:09 debian12-xfs-reflink-4k kernel: XFS (vdd): Ending clean mount
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: huge_memory: split file-backed THPs in file: /media/scratch/foo, page offset: [0x0 - 0x200000]
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: BUG: kernel NULL pointer dereference, address: 0000000000000036
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: #PF: supervisor read access in kernel mode
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: #PF: error_code(0x0000) - not-present page
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: PGD 0 P4D 0
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: CPU: 4 PID: 89 Comm: kworker/u37:2 Not tainted 6.9.0-rc6 #10
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: Workqueue: writeback wb_workfn (flush-254:48)
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: RIP: 0010:filemap_get_folios_tag (./arch/x86/include/asm/atomic.h:23 ./include/linux/atomic/atomic-arch-fallback.h:457 ./include/linux/atomic/atomic-arch-fallback.h:2426 ./include/linux/atomic/atomic-arch-fallback.h:2456 ./include/linux/atomic/atomic-instrumented.h:1518 ./include/linux/page_ref.h:238 ./include/linux/page_ref.h:247 ./include/linux/page_ref.h:280 ./include/linux/page_ref.h:313 mm/filemap.c:1980 mm/filemap.c:2218) 
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: Code: bd 06 86 00 48 89 c3 48 3d 06 04 00 00 74 e8 48 81 fb 02 04 00 00 0f 84 d0 00 00 00 48 85 db 0f 84 04 01 00 00 f6 c3 01 75 c4 <8b> 43 34 85 c0 0f 84 b7 00 00 00 8d 50 01 48 8d 73 34 f0 0f b1 53
All code
========
   0:	bd 06 86 00 48       	mov    $0x48008606,%ebp
   5:	89 c3                	mov    %eax,%ebx
   7:	48 3d 06 04 00 00    	cmp    $0x406,%rax
   d:	74 e8                	je     0xfffffffffffffff7
   f:	48 81 fb 02 04 00 00 	cmp    $0x402,%rbx
  16:	0f 84 d0 00 00 00    	je     0xec
  1c:	48 85 db             	test   %rbx,%rbx
  1f:	0f 84 04 01 00 00    	je     0x129
  25:	f6 c3 01             	test   $0x1,%bl
  28:	75 c4                	jne    0xffffffffffffffee
  2a:*	8b 43 34             	mov    0x34(%rbx),%eax		<-- trapping instruction
  2d:	85 c0                	test   %eax,%eax
  2f:	0f 84 b7 00 00 00    	je     0xec
  35:	8d 50 01             	lea    0x1(%rax),%edx
  38:	48 8d 73 34          	lea    0x34(%rbx),%rsi
  3c:	f0                   	lock
  3d:	0f                   	.byte 0xf
  3e:	b1 53                	mov    $0x53,%cl

Code starting with the faulting instruction
===========================================
   0:	8b 43 34             	mov    0x34(%rbx),%eax
   3:	85 c0                	test   %eax,%eax
   5:	0f 84 b7 00 00 00    	je     0xc2
   b:	8d 50 01             	lea    0x1(%rax),%edx
   e:	48 8d 73 34          	lea    0x34(%rbx),%rsi
  12:	f0                   	lock
  13:	0f                   	.byte 0xf
  14:	b1 53                	mov    $0x53,%cl
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: RSP: 0018:ffffa8f0c07cb8f8 EFLAGS: 00010246
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: RAX: 0000000000000002 RBX: 0000000000000002 RCX: 0000000000018000
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: RDX: 0000000000000002 RSI: 0000000000000002 RDI: ffff987380564920
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: RBP: 0000000000000000 R08: ffffffffffffffff R09: 0000000000000000
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: R10: 0000000000000228 R11: 0000000000000000 R12: ffffffffffffffff
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: R13: ffffa8f0c07cbbb8 R14: ffffa8f0c07cbcb8 R15: ffff98738c4ea800
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: FS:  0000000000000000(0000) GS:ffff9873fbd00000(0000) knlGS:0000000000000000
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: CR2: 0000000000000036 CR3: 000000011aca8003 CR4: 0000000000770ef0
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: PKRU: 55555554
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: Call Trace:
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel:  <TASK>
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: ? __die (arch/x86/kernel/dumpstack.c:421 arch/x86/kernel/dumpstack.c:434) 
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: ? page_fault_oops (arch/x86/mm/fault.c:713) 
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: ? do_user_addr_fault (./include/linux/kprobes.h:591 (discriminator 1) arch/x86/mm/fault.c:1265 (discriminator 1)) 
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: ? exc_page_fault (./arch/x86/include/asm/paravirt.h:693 arch/x86/mm/fault.c:1513 arch/x86/mm/fault.c:1563) 
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: ? asm_exc_page_fault (./arch/x86/include/asm/idtentry.h:623) 
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: ? filemap_get_folios_tag (./arch/x86/include/asm/atomic.h:23 ./include/linux/atomic/atomic-arch-fallback.h:457 ./include/linux/atomic/atomic-arch-fallback.h:2426 ./include/linux/atomic/atomic-arch-fallback.h:2456 ./include/linux/atomic/atomic-instrumented.h:1518 ./include/linux/page_ref.h:238 ./include/linux/page_ref.h:247 ./include/linux/page_ref.h:280 ./include/linux/page_ref.h:313 mm/filemap.c:1980 mm/filemap.c:2218) 
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: ? filemap_get_folios_tag (mm/filemap.c:1968 mm/filemap.c:2218) 
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: ? __pfx_iomap_do_writepage (fs/iomap/buffered-io.c:1963) 
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: writeback_iter (./include/linux/pagevec.h:91 mm/page-writeback.c:2421 mm/page-writeback.c:2520) 
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: write_cache_pages (mm/page-writeback.c:2568) 
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: iomap_writepages (fs/iomap/buffered-io.c:1984) 
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: xfs_vm_writepages (fs/xfs/xfs_aops.c:508) xfs
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: do_writepages (mm/page-writeback.c:2612) 
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: ? update_sd_lb_stats.constprop.0 (kernel/sched/fair.c:9902 (discriminator 2) kernel/sched/fair.c:10583 (discriminator 2)) 
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: __writeback_single_inode (fs/fs-writeback.c:1659) 
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: writeback_sb_inodes (fs/fs-writeback.c:1943) 
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: __writeback_inodes_wb (fs/fs-writeback.c:2013) 
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: wb_writeback (fs/fs-writeback.c:2119) 
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: wb_workfn (fs/fs-writeback.c:2277 (discriminator 1) fs/fs-writeback.c:2304 (discriminator 1)) 
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: process_one_work (kernel/workqueue.c:3254) 
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: worker_thread (kernel/workqueue.c:3329 (discriminator 2) kernel/workqueue.c:3416 (discriminator 2)) 
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: ? _raw_spin_lock_irqsave (./arch/x86/include/asm/atomic.h:115 (discriminator 4) ./include/linux/atomic/atomic-arch-fallback.h:2170 (discriminator 4) ./include/linux/atomic/atomic-instrumented.h:1302 (discriminator 4) ./include/asm-generic/qspinlock.h:111 (discriminator 4) ./include/linux/spinlock.h:187 (discriminator 4) ./include/linux/spinlock_api_smp.h:111 (discriminator 4) kernel/locking/spinlock.c:162 (discriminator 4)) 
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: ? __pfx_worker_thread (kernel/workqueue.c:3362) 
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: kthread (kernel/kthread.c:388) 
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: ? __pfx_kthread (kernel/kthread.c:341) 
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: ret_from_fork (arch/x86/kernel/process.c:147) 
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: ? __pfx_kthread (kernel/kthread.c:341) 
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel: ret_from_fork_asm (arch/x86/entry/entry_64.S:257) 
Apr 30 10:38:04 debian12-xfs-reflink-4k kernel:  </TASK>

The full decoded crash on v6.9-rc6:

https://gist.github.com/mcgrof/c44aaed21b99ae4ecf3d7fc6a1bb00bc

  Luis

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 05/11] mm: do not split a folio if it has minimum folio order requirement
  2024-04-30 19:27                   ` Luis Chamberlain
@ 2024-05-01  4:13                     ` Matthew Wilcox
  2024-05-01 14:28                       ` Matthew Wilcox
  0 siblings, 1 reply; 47+ messages in thread
From: Matthew Wilcox @ 2024-05-01  4:13 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Zi Yan, Andrew Morton, Vlastimil Babka, Sean Christopherson,
	Pankaj Raghav (Samsung),
	djwong, brauner, david, chandan.babu, linux-fsdevel, hare,
	linux-kernel, linux-mm, linux-xfs, gost.dev, p.raghav

On Tue, Apr 30, 2024 at 12:27:04PM -0700, Luis Chamberlain wrote:
>   2a:*	8b 43 34             	mov    0x34(%rbx),%eax		<-- trapping instruction
> RBX: 0000000000000002 RCX: 0000000000018000

Thanks, got it.  I'll send a patch in the morning, but I know exactly
what the problem is.  You're seeing sibling entries tagged as dirty.
That shouldn't happen; we should only see folios tagged as dirty.
The bug is in node_set_marks() which calls node_mark_all().  This works
fine when splitting to order 0, but we should only mark the first entry
of each order.  eg if we split to order 3, we should tag slots 0, 8,
16, 24, .., 56.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH v4 05/11] mm: do not split a folio if it has minimum folio order requirement
  2024-05-01  4:13                     ` Matthew Wilcox
@ 2024-05-01 14:28                       ` Matthew Wilcox
  0 siblings, 0 replies; 47+ messages in thread
From: Matthew Wilcox @ 2024-05-01 14:28 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Zi Yan, Andrew Morton, Vlastimil Babka, Sean Christopherson,
	Pankaj Raghav (Samsung),
	djwong, brauner, david, chandan.babu, linux-fsdevel, hare,
	linux-kernel, linux-mm, linux-xfs, gost.dev, p.raghav

On Wed, May 01, 2024 at 05:13:59AM +0100, Matthew Wilcox wrote:
> On Tue, Apr 30, 2024 at 12:27:04PM -0700, Luis Chamberlain wrote:
> >   2a:*	8b 43 34             	mov    0x34(%rbx),%eax		<-- trapping instruction
> > RBX: 0000000000000002 RCX: 0000000000018000
> 
> Thanks, got it.  I'll send a patch in the morning, but I know exactly
> what the problem is.  You're seeing sibling entries tagged as dirty.
> That shouldn't happen; we should only see folios tagged as dirty.
> The bug is in node_set_marks() which calls node_mark_all().  This works
> fine when splitting to order 0, but we should only mark the first entry
> of each order.  eg if we split to order 3, we should tag slots 0, 8,
> 16, 24, .., 56.

Confirmed:

+++ b/lib/test_xarray.c
@@ -1789,8 +1789,10 @@ static void check_split_1(struct xarray *xa, unsigned lon
g index,
 {
        XA_STATE_ORDER(xas, xa, index, new_order);
        unsigned int i;
+       void *entry;

        xa_store_order(xa, index, order, xa, GFP_KERNEL);
+       xa_set_mark(xa, index, XA_MARK_1);

        xas_split_alloc(&xas, xa, order, GFP_KERNEL);
        xas_lock(&xas);
@@ -1807,6 +1809,12 @@ static void check_split_1(struct xarray *xa, unsigned long index,
        xa_set_mark(xa, index, XA_MARK_0);
        XA_BUG_ON(xa, !xa_get_mark(xa, index, XA_MARK_0));

+       xas_set_order(&xas, index, 0);
+       rcu_read_lock();
+       xas_for_each_marked(&xas, entry, ULONG_MAX, XA_MARK_1)
+               XA_BUG_ON(xa, xa_is_internal(entry));
+       rcu_read_unlock();
+
        xa_destroy(xa);
 }


spits out:

$ ./tools/testing/radix-tree/xarray
BUG at check_split_1:1815
xarray: 0x562b4043e580x head 0x50c0095cc082x flags 3000000 marks 1 1 0
0-63: node 0x50c0095cc080x max 0 parent (nil)x shift 3 count 1 values 0 array 0x562b4043e580x list 0x50c0095cc098x 0x50c0095cc098x marks 1 1 0
0-7: node 0x50c0095cc140x offset 0 parent 0x50c0095cc080x shift 0 count 8 values 4 array 0x562b4043e580x list 0x50c0095cc158x 0x50c0095cc158x marks 1 ff 0
0: value 0 (0x0) [0x1x]
1: sibling (slot 0)
2: value 2 (0x2) [0x5x]
3: sibling (slot 2)
4: value 4 (0x4) [0x9x]
5: sibling (slot 4)
6: value 6 (0x6) [0xdx]
7: sibling (slot 6)
xarray: ../../../lib/test_xarray.c:1815: check_split_1: Assertion `0' failed.
Aborted



^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2024-05-01 14:28 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-25 11:37 [PATCH v4 00/11] enable bs > ps in XFS Pankaj Raghav (Samsung)
2024-04-25 11:37 ` [PATCH v4 01/11] readahead: rework loop in page_cache_ra_unbounded() Pankaj Raghav (Samsung)
2024-04-25 11:37 ` [PATCH v4 02/11] fs: Allow fine-grained control of folio sizes Pankaj Raghav (Samsung)
2024-04-25 18:07   ` Hannes Reinecke
2024-04-26 15:09   ` Darrick J. Wong
2024-04-25 11:37 ` [PATCH v4 03/11] filemap: allocate mapping_min_order folios in the page cache Pankaj Raghav (Samsung)
2024-04-25 19:04   ` Hannes Reinecke
2024-04-26 15:12   ` Darrick J. Wong
2024-04-28 20:59     ` Pankaj Raghav (Samsung)
2024-04-25 11:37 ` [PATCH v4 04/11] readahead: allocate folios with mapping_min_order in readahead Pankaj Raghav (Samsung)
2024-04-25 18:53   ` Matthew Wilcox
2024-04-25 11:37 ` [PATCH v4 05/11] mm: do not split a folio if it has minimum folio order requirement Pankaj Raghav (Samsung)
2024-04-25 20:10   ` Matthew Wilcox
2024-04-26  0:47     ` Luis Chamberlain
2024-04-26 23:46       ` Luis Chamberlain
2024-04-28  0:57         ` Luis Chamberlain
2024-04-29  3:56           ` Luis Chamberlain
2024-04-29 14:29             ` Zi Yan
2024-04-30  0:31               ` Luis Chamberlain
2024-04-30  0:49                 ` Luis Chamberlain
2024-04-30  2:43                 ` Zi Yan
2024-04-30 19:27                   ` Luis Chamberlain
2024-05-01  4:13                     ` Matthew Wilcox
2024-05-01 14:28                       ` Matthew Wilcox
2024-04-26 15:49     ` Pankaj Raghav (Samsung)
2024-04-25 11:37 ` [PATCH v4 06/11] filemap: cap PTE range to be created to i_size in folio_map_range() Pankaj Raghav (Samsung)
2024-04-25 20:24   ` Matthew Wilcox
2024-04-26 12:54     ` Pankaj Raghav (Samsung)
2024-04-25 11:37 ` [PATCH v4 07/11] iomap: fix iomap_dio_zero() for fs bs > system page size Pankaj Raghav (Samsung)
2024-04-26  6:22   ` Christoph Hellwig
2024-04-26 11:43     ` Pankaj Raghav (Samsung)
2024-04-27  5:12       ` Christoph Hellwig
2024-04-29 21:02         ` Pankaj Raghav (Samsung)
2024-04-27  3:26     ` Matthew Wilcox
2024-04-27  4:52       ` Christoph Hellwig
2024-04-25 11:37 ` [PATCH v4 08/11] xfs: use kvmalloc for xattr buffers Pankaj Raghav (Samsung)
2024-04-26 15:18   ` Darrick J. Wong
2024-04-28 21:06     ` Pankaj Raghav (Samsung)
2024-04-25 11:37 ` [PATCH v4 09/11] xfs: expose block size in stat Pankaj Raghav (Samsung)
2024-04-26 15:15   ` Darrick J. Wong
2024-04-25 11:37 ` [PATCH v4 10/11] xfs: make the calculation generic in xfs_sb_validate_fsb_count() Pankaj Raghav (Samsung)
2024-04-26 15:16   ` Darrick J. Wong
2024-04-25 11:37 ` [PATCH v4 11/11] xfs: enable block size larger than page size support Pankaj Raghav (Samsung)
2024-04-26 15:18   ` Darrick J. Wong
2024-04-27  4:42 ` [PATCH v4 00/11] enable bs > ps in XFS Ritesh Harjani
2024-04-27  5:05   ` Darrick J. Wong
2024-04-29 20:39   ` Pankaj Raghav (Samsung)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).