All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 00/11] enable bs > ps in XFS
@ 2024-03-13 17:02 Pankaj Raghav (Samsung)
  2024-03-13 17:02 ` [PATCH v3 01/11] mm: Support order-1 folios in the page cache Pankaj Raghav (Samsung)
                   ` (11 more replies)
  0 siblings, 12 replies; 38+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-03-13 17:02 UTC (permalink / raw)
  To: willy, linux-xfs, linux-fsdevel
  Cc: gost.dev, chandan.babu, hare, mcgrof, djwong, linux-mm,
	linux-kernel, david, akpm, Pankaj Raghav

From: Pankaj Raghav <p.raghav@samsung.com>

This is the third version of the series that enables block size > page size
(Large Block Size) in XFS. The context and motivation can be seen in cover
letter of the RFC v1[1]. We also recorded a talk about this effort at LPC [3],
if someone would like more context on this effort.

A lot of emphasis has been put on testing using kdevops. The testing has
been split into regression and progression.

Regression testing:
In regression testing, we ran the whole test suite to check for
*regression on existing profiles due to the page cache changes.

No regression was found with the patches added on top.

*Baseline for regression was created using SOAK_DURATION of 2.5 hours
and having used about 7-8 XFS test clusters to test loop fstests over
70 times. We then scraped for critical failures (crashes, XFS or page
cache asserts, or hung tasks) and have reported these to the community
as well.[4]

Progression testing:
For progression testing, we tested for 8k, 16k, 32k and 64k block sizes.
To compare it with existing support, an ARM VM with 64k base page system
(without our patches) was used as a reference to check for actual failures
due to LBS support in a 4k base page size system.

There are some common failures upstream for bs=64k that needs to be
fixed[5].
There are also some tests that assumes block size < page size that needs to
be fixed. I have a tree with fixes for xfstests here [6], which I will be
sending soon to the list.

No new failures were found with the LBS support.

We've done some preliminary performance tests with fio on XFS on 4k block
size against pmem and NVMe with buffered IO and Direct IO on vanilla
v6.8-rc4 Vs v6.8-rc4 + these patches applied, and detected no regressions.

We also wrote an eBPF tool called blkalgn [7] to see if IO sent to the device
is aligned and at least filesystem block size in length.

Git tree:
https://github.com/linux-kdevops/linux/tree/large-block-minorder-6.8

[1] https://lore.kernel.org/lkml/20230915183848.1018717-1-kernel@pankajraghav.com/
[2] https://lore.kernel.org/linux-xfs/20240213093713.1753368-1-kernel@pankajraghav.com/
[3] https://www.youtube.com/watch?v=ar72r5Xf7x4
[4] https://github.com/linux-kdevops/kdevops/blob/master/docs/xfs-bugs.md
489 non-critical issues and 55 critical issues. We've determined and reported
that the 55 critical issues have all fall into 5 common  XFS asserts or hung
tasks  and 2 memory management asserts.
[5] https://lore.kernel.org/linux-xfs/fe7fec1c-3b08-430f-9c95-ea76b237acf4@samsung.com/
[6] https://github.com/Panky-codes/xfstests/tree/lbs-fixes
[7] https://github.com/iovisor/bcc/pull/4813

Changes since v2:
- Simplified the filemap and readahead changes. (Thanks willy)
- Removed DEFINE_READAHEAD_ALIGN.
- Added minorder support to readahead_expand().

Changes since v1:
- Round up to nearest min nr pages in ra_init
- Calculate index in filemap_create instead of doing in
  filemap_get_pages
- Remove unnecessary BUG_ONs in the delete path
- Use check_shl_overflow instead of check_mul_overflow
- Cast to uint32_t instead of unsigned long in xfs_stat_blksize

Changes since RFC v2:
- Move order 1 patch above the 1st patch
- Remove order == 1 conditional in `fs: Allow fine-grained control of
folio sizes`. This fixed generic/630 that was reported in the previous version.
- Hide the max order and expose `mapping_set_folio_min_order` instead.
- Add new helper mapping_start_index_align and DEFINE_READAHEAD_ALIGN
- don't call `page_cache_ra_order` with min order in do_mmap_sync_readahead
- simplify ondemand readahead with only aligning the start index at the end
- Don't cap ra_pages based on bdi->io_pages
- use `checked_mul_overflow` while calculating bytes in validate_fsb
- Remove config lbs option
- Add a warning while mounting a LBS kernel
- Add Acked-by and Reviewed-by from Hannes and Darrick.

Changes since RFC v1:
- Added willy's patch to enable order-1 folios.
- Unified common page cache effort from Hannes LBS work.
- Added a new helper min_nrpages and added CONFIG_THP for enabling mapping_large_folio_support
- Don't split a folio if it has minorder set. Remove the old code where we set extra pins if it has that requirement.
- Split the code in XFS between the validation of mapping count. Put the icache code changes with enabling bs > ps.
- Added CONFIG_XFS_LBS option
- align the index in do_read_cache_folio()
- Removed truncate changes
- Fixed generic/091 with iomap changes to iomap_dio_zero function.
- Took care of folio truncation scenario in page_cache_ra_unbounded() that happens after read_pages if a folio was found.
- Sqaushed and moved commits around
- Rebased on top of v6.8-rc4

Hannes Reinecke (1):
  readahead: rework loop in page_cache_ra_unbounded()

Luis Chamberlain (2):
  filemap: allocate mapping_min_order folios in the page cache
  readahead: round up file_ra_state->ra_pages to mapping_min_nrpages

Matthew Wilcox (Oracle) (2):
  mm: Support order-1 folios in the page cache
  fs: Allow fine-grained control of folio sizes

Pankaj Raghav (6):
  readahead: allocate folios with mapping_min_order in readahead
  mm: do not split a folio if it has minimum folio order requirement
  iomap: fix iomap_dio_zero() for fs bs > system page size
  xfs: expose block size in stat
  xfs: make the calculation generic in xfs_sb_validate_fsb_count()
  xfs: enable block size larger than page size support

 fs/iomap/direct-io.c       |  13 ++++-
 fs/xfs/libxfs/xfs_ialloc.c |   5 ++
 fs/xfs/libxfs/xfs_shared.h |   3 ++
 fs/xfs/xfs_icache.c        |   6 ++-
 fs/xfs/xfs_iops.c          |   2 +-
 fs/xfs/xfs_mount.c         |  10 +++-
 fs/xfs/xfs_super.c         |  10 +---
 include/linux/huge_mm.h    |   7 ++-
 include/linux/pagemap.h    | 100 ++++++++++++++++++++++++++++--------
 mm/filemap.c               |  26 ++++++----
 mm/huge_memory.c           |  36 +++++++++++--
 mm/internal.h              |   4 +-
 mm/readahead.c             | 101 +++++++++++++++++++++++++++++--------
 13 files changed, 247 insertions(+), 76 deletions(-)


base-commit: e8f897f4afef0031fe618a8e94127a0934896aba
-- 
2.43.0


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH v3 01/11] mm: Support order-1 folios in the page cache
  2024-03-13 17:02 [PATCH v3 00/11] enable bs > ps in XFS Pankaj Raghav (Samsung)
@ 2024-03-13 17:02 ` Pankaj Raghav (Samsung)
  2024-03-13 17:02 ` [PATCH v3 02/11] fs: Allow fine-grained control of folio sizes Pankaj Raghav (Samsung)
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 38+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-03-13 17:02 UTC (permalink / raw)
  To: willy, linux-xfs, linux-fsdevel
  Cc: gost.dev, chandan.babu, hare, mcgrof, djwong, linux-mm,
	linux-kernel, david, akpm

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Folios of order 1 have no space to store the deferred list.  This is
not a problem for the page cache as file-backed folios are never
placed on the deferred list.  All we need to do is prevent the core
MM from touching the deferred list for order 1 folios and remove the
code which prevented us from allocating order 1 folios.

Link: https://lore.kernel.org/linux-mm/90344ea7-4eec-47ee-5996-0c22f42d6a6a@google.com/
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/huge_mm.h |  7 +++++--
 mm/filemap.c            |  2 --
 mm/huge_memory.c        | 23 ++++++++++++++++++-----
 mm/internal.h           |  4 +---
 mm/readahead.c          |  3 ---
 5 files changed, 24 insertions(+), 15 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 5adb86af35fc..916a2a539517 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -263,7 +263,7 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
 unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
 		unsigned long len, unsigned long pgoff, unsigned long flags);
 
-void folio_prep_large_rmappable(struct folio *folio);
+struct folio *folio_prep_large_rmappable(struct folio *folio);
 bool can_split_folio(struct folio *folio, int *pextra_pins);
 int split_huge_page_to_list(struct page *page, struct list_head *list);
 static inline int split_huge_page(struct page *page)
@@ -410,7 +410,10 @@ static inline unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
 	return 0;
 }
 
-static inline void folio_prep_large_rmappable(struct folio *folio) {}
+static inline struct folio *folio_prep_large_rmappable(struct folio *folio)
+{
+	return folio;
+}
 
 #define transparent_hugepage_flags 0UL
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 4a30de98a8c7..a1cb3ea55fb6 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1912,8 +1912,6 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 			gfp_t alloc_gfp = gfp;
 
 			err = -ENOMEM;
-			if (order == 1)
-				order = 0;
 			if (order > 0)
 				alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN;
 			folio = filemap_alloc_folio(alloc_gfp, order);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 94c958f7ebb5..81fd1ba57088 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -788,11 +788,15 @@ struct deferred_split *get_deferred_split_queue(struct folio *folio)
 }
 #endif
 
-void folio_prep_large_rmappable(struct folio *folio)
+struct folio *folio_prep_large_rmappable(struct folio *folio)
 {
-	VM_BUG_ON_FOLIO(folio_order(folio) < 2, folio);
-	INIT_LIST_HEAD(&folio->_deferred_list);
+	if (!folio || !folio_test_large(folio))
+		return folio;
+	if (folio_order(folio) > 1)
+		INIT_LIST_HEAD(&folio->_deferred_list);
 	folio_set_large_rmappable(folio);
+
+	return folio;
 }
 
 static inline bool is_transparent_hugepage(struct folio *folio)
@@ -3082,7 +3086,8 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 	/* Prevent deferred_split_scan() touching ->_refcount */
 	spin_lock(&ds_queue->split_queue_lock);
 	if (folio_ref_freeze(folio, 1 + extra_pins)) {
-		if (!list_empty(&folio->_deferred_list)) {
+		if (folio_order(folio) > 1 &&
+		    !list_empty(&folio->_deferred_list)) {
 			ds_queue->split_queue_len--;
 			list_del(&folio->_deferred_list);
 		}
@@ -3133,6 +3138,9 @@ void folio_undo_large_rmappable(struct folio *folio)
 	struct deferred_split *ds_queue;
 	unsigned long flags;
 
+	if (folio_order(folio) <= 1)
+		return;
+
 	/*
 	 * At this point, there is no one trying to add the folio to
 	 * deferred_list. If folio is not in deferred_list, it's safe
@@ -3158,7 +3166,12 @@ void deferred_split_folio(struct folio *folio)
 #endif
 	unsigned long flags;
 
-	VM_BUG_ON_FOLIO(folio_order(folio) < 2, folio);
+	/*
+	 * Order 1 folios have no space for a deferred list, but we also
+	 * won't waste much memory by not adding them to the deferred list.
+	 */
+	if (folio_order(folio) <= 1)
+		return;
 
 	/*
 	 * The try_to_unmap() in page reclaim path might reach here too,
diff --git a/mm/internal.h b/mm/internal.h
index f309a010d50f..5174b5b0c344 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -419,9 +419,7 @@ static inline struct folio *page_rmappable_folio(struct page *page)
 {
 	struct folio *folio = (struct folio *)page;
 
-	if (folio && folio_order(folio) > 1)
-		folio_prep_large_rmappable(folio);
-	return folio;
+	return folio_prep_large_rmappable(folio);
 }
 
 static inline void prep_compound_head(struct page *page, unsigned int order)
diff --git a/mm/readahead.c b/mm/readahead.c
index 2648ec4f0494..369c70e2be42 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -516,9 +516,6 @@ void page_cache_ra_order(struct readahead_control *ractl,
 		/* Don't allocate pages past EOF */
 		while (index + (1UL << order) - 1 > limit)
 			order--;
-		/* THP machinery does not support order-1 */
-		if (order == 1)
-			order = 0;
 		err = ra_alloc_folio(ractl, index, mark, order, gfp);
 		if (err)
 			break;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v3 02/11] fs: Allow fine-grained control of folio sizes
  2024-03-13 17:02 [PATCH v3 00/11] enable bs > ps in XFS Pankaj Raghav (Samsung)
  2024-03-13 17:02 ` [PATCH v3 01/11] mm: Support order-1 folios in the page cache Pankaj Raghav (Samsung)
@ 2024-03-13 17:02 ` Pankaj Raghav (Samsung)
  2024-03-25 18:29   ` Matthew Wilcox
  2024-03-13 17:02 ` [PATCH v3 03/11] filemap: allocate mapping_min_order folios in the page cache Pankaj Raghav (Samsung)
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 38+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-03-13 17:02 UTC (permalink / raw)
  To: willy, linux-xfs, linux-fsdevel
  Cc: gost.dev, chandan.babu, hare, mcgrof, djwong, linux-mm,
	linux-kernel, david, akpm, Pankaj Raghav

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Some filesystems want to be able to ensure that folios that are added to
the page cache are at least a certain size.
Add mapping_set_folio_min_order() to allow this level of control.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Co-developed-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
---
 include/linux/pagemap.h | 100 ++++++++++++++++++++++++++++++++--------
 1 file changed, 80 insertions(+), 20 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 2df35e65557d..fc8eb9c94e9c 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -202,13 +202,18 @@ enum mapping_flags {
 	AS_EXITING	= 4, 	/* final truncate in progress */
 	/* writeback related tags are not used */
 	AS_NO_WRITEBACK_TAGS = 5,
-	AS_LARGE_FOLIO_SUPPORT = 6,
-	AS_RELEASE_ALWAYS,	/* Call ->release_folio(), even if no private data */
-	AS_STABLE_WRITES,	/* must wait for writeback before modifying
+	AS_RELEASE_ALWAYS = 6,	/* Call ->release_folio(), even if no private data */
+	AS_STABLE_WRITES = 7,	/* must wait for writeback before modifying
 				   folio contents */
-	AS_UNMOVABLE,		/* The mapping cannot be moved, ever */
+	AS_FOLIO_ORDER_MIN = 8,
+	AS_FOLIO_ORDER_MAX = 13, /* Bit 8-17 are used for FOLIO_ORDER */
+	AS_UNMOVABLE = 18,		/* The mapping cannot be moved, ever */
 };
 
+#define AS_FOLIO_ORDER_MIN_MASK 0x00001f00
+#define AS_FOLIO_ORDER_MAX_MASK 0x0003e000
+#define AS_FOLIO_ORDER_MASK (AS_FOLIO_ORDER_MIN_MASK | AS_FOLIO_ORDER_MAX_MASK)
+
 /**
  * mapping_set_error - record a writeback error in the address_space
  * @mapping: the mapping in which an error should be set
@@ -344,9 +349,47 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
 	m->gfp_mask = mask;
 }
 
+/*
+ * There are some parts of the kernel which assume that PMD entries
+ * are exactly HPAGE_PMD_ORDER.  Those should be fixed, but until then,
+ * limit the maximum allocation order to PMD size.  I'm not aware of any
+ * assumptions about maximum order if THP are disabled, but 8 seems like
+ * a good order (that's 1MB if you're using 4kB pages)
+ */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define MAX_PAGECACHE_ORDER	HPAGE_PMD_ORDER
+#else
+#define MAX_PAGECACHE_ORDER	8
+#endif
+
+/*
+ * mapping_set_folio_min_order() - Set the minimum folio order
+ * @mapping: The address_space.
+ * @min: Minimum folio order (between 0-MAX_PAGECACHE_ORDER inclusive).
+ *
+ * The filesystem should call this function in its inode constructor to
+ * indicate which base size of folio the VFS can use to cache the contents
+ * of the file.  This should only be used if the filesystem needs special
+ * handling of folio sizes (ie there is something the core cannot know).
+ * Do not tune it based on, eg, i_size.
+ *
+ * Context: This should not be called while the inode is active as it
+ * is non-atomic.
+ */
+static inline void mapping_set_folio_min_order(struct address_space *mapping,
+					       unsigned int min)
+{
+	if (min > MAX_PAGECACHE_ORDER)
+		min = MAX_PAGECACHE_ORDER;
+
+	mapping->flags = (mapping->flags & ~AS_FOLIO_ORDER_MASK) |
+			 (min << AS_FOLIO_ORDER_MIN) |
+			 (MAX_PAGECACHE_ORDER << AS_FOLIO_ORDER_MAX);
+}
+
 /**
  * mapping_set_large_folios() - Indicate the file supports large folios.
- * @mapping: The file.
+ * @mapping: The address_space.
  *
  * The filesystem should call this function in its inode constructor to
  * indicate that the VFS can use large folios to cache the contents of
@@ -357,7 +400,37 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
  */
 static inline void mapping_set_large_folios(struct address_space *mapping)
 {
-	__set_bit(AS_LARGE_FOLIO_SUPPORT, &mapping->flags);
+	mapping_set_folio_min_order(mapping, 0);
+}
+
+static inline unsigned int mapping_max_folio_order(struct address_space *mapping)
+{
+	return (mapping->flags & AS_FOLIO_ORDER_MAX_MASK) >> AS_FOLIO_ORDER_MAX;
+}
+
+static inline unsigned int mapping_min_folio_order(struct address_space *mapping)
+{
+	return (mapping->flags & AS_FOLIO_ORDER_MIN_MASK) >> AS_FOLIO_ORDER_MIN;
+}
+
+static inline unsigned long mapping_min_folio_nrpages(struct address_space *mapping)
+{
+	return 1UL << mapping_min_folio_order(mapping);
+}
+
+/**
+ * mapping_align_start_index() - Align starting index based on the min
+ * folio order of the page cache.
+ * @mapping: The address_space.
+ *
+ * Ensure the index used is aligned to the minimum folio order when adding
+ * new folios to the page cache by rounding down to the nearest minimum
+ * folio number of pages.
+ */
+static inline pgoff_t mapping_align_start_index(struct address_space *mapping,
+						pgoff_t index)
+{
+	return round_down(index, mapping_min_folio_nrpages(mapping));
 }
 
 /*
@@ -367,7 +440,7 @@ static inline void mapping_set_large_folios(struct address_space *mapping)
 static inline bool mapping_large_folio_support(struct address_space *mapping)
 {
 	return IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
-		test_bit(AS_LARGE_FOLIO_SUPPORT, &mapping->flags);
+	       (mapping_max_folio_order(mapping) > 0);
 }
 
 static inline int filemap_nr_thps(struct address_space *mapping)
@@ -528,19 +601,6 @@ static inline void *detach_page_private(struct page *page)
 	return folio_detach_private(page_folio(page));
 }
 
-/*
- * There are some parts of the kernel which assume that PMD entries
- * are exactly HPAGE_PMD_ORDER.  Those should be fixed, but until then,
- * limit the maximum allocation order to PMD size.  I'm not aware of any
- * assumptions about maximum order if THP are disabled, but 8 seems like
- * a good order (that's 1MB if you're using 4kB pages)
- */
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#define MAX_PAGECACHE_ORDER	HPAGE_PMD_ORDER
-#else
-#define MAX_PAGECACHE_ORDER	8
-#endif
-
 #ifdef CONFIG_NUMA
 struct folio *filemap_alloc_folio(gfp_t gfp, unsigned int order);
 #else
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v3 03/11] filemap: allocate mapping_min_order folios in the page cache
  2024-03-13 17:02 [PATCH v3 00/11] enable bs > ps in XFS Pankaj Raghav (Samsung)
  2024-03-13 17:02 ` [PATCH v3 01/11] mm: Support order-1 folios in the page cache Pankaj Raghav (Samsung)
  2024-03-13 17:02 ` [PATCH v3 02/11] fs: Allow fine-grained control of folio sizes Pankaj Raghav (Samsung)
@ 2024-03-13 17:02 ` Pankaj Raghav (Samsung)
  2024-03-15 13:21   ` Pankaj Raghav (Samsung)
  2024-03-13 17:02 ` [PATCH v3 04/11] readahead: rework loop in page_cache_ra_unbounded() Pankaj Raghav (Samsung)
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 38+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-03-13 17:02 UTC (permalink / raw)
  To: willy, linux-xfs, linux-fsdevel
  Cc: gost.dev, chandan.babu, hare, mcgrof, djwong, linux-mm,
	linux-kernel, david, akpm, Pankaj Raghav

From: Luis Chamberlain <mcgrof@kernel.org>

filemap_create_folio() and do_read_cache_folio() were always allocating
folio of order 0. __filemap_get_folio was trying to allocate higher
order folios when fgp_flags had higher order hint set but it will default
to order 0 folio if higher order memory allocation fails.

Supporting mapping_min_order implies that we guarantee each folio in the
page cache has at least an order of mapping_min_order. When adding new
folios to the page cache we must also ensure the index used is aligned to
the mapping_min_order as the page cache requires the index to be aligned
to the order of the folio.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Co-developed-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 mm/filemap.c | 24 +++++++++++++++++-------
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index a1cb3ea55fb6..57889f206829 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -849,6 +849,8 @@ noinline int __filemap_add_folio(struct address_space *mapping,
 
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
 	VM_BUG_ON_FOLIO(folio_test_swapbacked(folio), folio);
+	VM_BUG_ON_FOLIO(folio_order(folio) < mapping_min_folio_order(mapping),
+			folio);
 	mapping_set_update(&xas, mapping);
 
 	if (!huge) {
@@ -1886,8 +1888,10 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 		folio_wait_stable(folio);
 no_page:
 	if (!folio && (fgp_flags & FGP_CREAT)) {
-		unsigned order = FGF_GET_ORDER(fgp_flags);
+		unsigned int min_order = mapping_min_folio_order(mapping);
+		unsigned int order = max(min_order, FGF_GET_ORDER(fgp_flags));
 		int err;
+		index = mapping_align_start_index(mapping, index);
 
 		if ((fgp_flags & FGP_WRITE) && mapping_can_writeback(mapping))
 			gfp |= __GFP_WRITE;
@@ -1927,7 +1931,7 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 				break;
 			folio_put(folio);
 			folio = NULL;
-		} while (order-- > 0);
+		} while (order-- > min_order);
 
 		if (err == -EEXIST)
 			goto repeat;
@@ -2416,13 +2420,16 @@ static int filemap_update_page(struct kiocb *iocb,
 }
 
 static int filemap_create_folio(struct file *file,
-		struct address_space *mapping, pgoff_t index,
+		struct address_space *mapping, loff_t pos,
 		struct folio_batch *fbatch)
 {
 	struct folio *folio;
 	int error;
+	unsigned int min_order = mapping_min_folio_order(mapping);
+	pgoff_t index;
 
-	folio = filemap_alloc_folio(mapping_gfp_mask(mapping), 0);
+	folio = filemap_alloc_folio(mapping_gfp_mask(mapping),
+				    min_order);
 	if (!folio)
 		return -ENOMEM;
 
@@ -2440,6 +2447,8 @@ static int filemap_create_folio(struct file *file,
 	 * well to keep locking rules simple.
 	 */
 	filemap_invalidate_lock_shared(mapping);
+	/* index in PAGE units but aligned to min_order number of pages. */
+	index = (pos >> (PAGE_SHIFT + min_order)) << min_order;
 	error = filemap_add_folio(mapping, folio, index,
 			mapping_gfp_constraint(mapping, GFP_KERNEL));
 	if (error == -EEXIST)
@@ -2500,8 +2509,7 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
 	if (!folio_batch_count(fbatch)) {
 		if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_WAITQ))
 			return -EAGAIN;
-		err = filemap_create_folio(filp, mapping,
-				iocb->ki_pos >> PAGE_SHIFT, fbatch);
+		err = filemap_create_folio(filp, mapping, iocb->ki_pos, fbatch);
 		if (err == AOP_TRUNCATED_PAGE)
 			goto retry;
 		return err;
@@ -3662,9 +3670,11 @@ static struct folio *do_read_cache_folio(struct address_space *mapping,
 repeat:
 	folio = filemap_get_folio(mapping, index);
 	if (IS_ERR(folio)) {
-		folio = filemap_alloc_folio(gfp, 0);
+		folio = filemap_alloc_folio(gfp,
+					    mapping_min_folio_order(mapping));
 		if (!folio)
 			return ERR_PTR(-ENOMEM);
+		index = mapping_align_start_index(mapping, index);
 		err = filemap_add_folio(mapping, folio, index, gfp);
 		if (unlikely(err)) {
 			folio_put(folio);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v3 04/11] readahead: rework loop in page_cache_ra_unbounded()
  2024-03-13 17:02 [PATCH v3 00/11] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (2 preceding siblings ...)
  2024-03-13 17:02 ` [PATCH v3 03/11] filemap: allocate mapping_min_order folios in the page cache Pankaj Raghav (Samsung)
@ 2024-03-13 17:02 ` Pankaj Raghav (Samsung)
  2024-03-25 18:41   ` Matthew Wilcox
  2024-03-13 17:02 ` [PATCH v3 05/11] readahead: allocate folios with mapping_min_order in readahead Pankaj Raghav (Samsung)
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 38+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-03-13 17:02 UTC (permalink / raw)
  To: willy, linux-xfs, linux-fsdevel
  Cc: gost.dev, chandan.babu, hare, mcgrof, djwong, linux-mm,
	linux-kernel, david, akpm, Pankaj Raghav

From: Hannes Reinecke <hare@suse.de>

Rework the loop in page_cache_ra_unbounded() to advance with
the number of pages in a folio instead of just one page at a time.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Co-developed-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Acked-by: Darrick J. Wong <djwong@kernel.org>
---
 mm/readahead.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 369c70e2be42..37b938f4b54f 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -208,7 +208,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 	struct address_space *mapping = ractl->mapping;
 	unsigned long index = readahead_index(ractl);
 	gfp_t gfp_mask = readahead_gfp_mask(mapping);
-	unsigned long i;
+	unsigned long i = 0;
 
 	/*
 	 * Partway through the readahead operation, we will have added
@@ -226,7 +226,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 	/*
 	 * Preallocate as many pages as we will need.
 	 */
-	for (i = 0; i < nr_to_read; i++) {
+	while (i < nr_to_read) {
 		struct folio *folio = xa_load(&mapping->i_pages, index + i);
 
 		if (folio && !xa_is_value(folio)) {
@@ -239,8 +239,8 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 			 * not worth getting one just for that.
 			 */
 			read_pages(ractl);
-			ractl->_index++;
-			i = ractl->_index + ractl->_nr_pages - index - 1;
+			ractl->_index += folio_nr_pages(folio);
+			i = ractl->_index + ractl->_nr_pages - index;
 			continue;
 		}
 
@@ -252,13 +252,14 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 			folio_put(folio);
 			read_pages(ractl);
 			ractl->_index++;
-			i = ractl->_index + ractl->_nr_pages - index - 1;
+			i = ractl->_index + ractl->_nr_pages - index;
 			continue;
 		}
 		if (i == nr_to_read - lookahead_size)
 			folio_set_readahead(folio);
 		ractl->_workingset |= folio_test_workingset(folio);
-		ractl->_nr_pages++;
+		ractl->_nr_pages += folio_nr_pages(folio);
+		i += folio_nr_pages(folio);
 	}
 
 	/*
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v3 05/11] readahead: allocate folios with mapping_min_order in readahead
  2024-03-13 17:02 [PATCH v3 00/11] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (3 preceding siblings ...)
  2024-03-13 17:02 ` [PATCH v3 04/11] readahead: rework loop in page_cache_ra_unbounded() Pankaj Raghav (Samsung)
@ 2024-03-13 17:02 ` Pankaj Raghav (Samsung)
  2024-03-25 19:00   ` Matthew Wilcox
  2024-03-13 17:02 ` [PATCH v3 06/11] readahead: round up file_ra_state->ra_pages to mapping_min_nrpages Pankaj Raghav (Samsung)
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 38+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-03-13 17:02 UTC (permalink / raw)
  To: willy, linux-xfs, linux-fsdevel
  Cc: gost.dev, chandan.babu, hare, mcgrof, djwong, linux-mm,
	linux-kernel, david, akpm, Pankaj Raghav

From: Pankaj Raghav <p.raghav@samsung.com>

page_cache_ra_unbounded() was allocating single pages (0 order folios)
if there was no folio found in an index. Allocate mapping_min_order folios
as we need to guarantee the minimum order if it is set.
When read_pages() is triggered and if a page is already present, check
for truncation and move the ractl->_index by mapping_min_nrpages if that
folio was truncated. This is done to ensure we keep the alignment
requirement while adding a folio to the page cache.

page_cache_ra_order() tries to allocate folio to the page cache with a
higher order if the index aligns with that order. Modify it so that the
order does not go below the min_order requirement of the page cache.

When adding new folios to the page cache we must also ensure the index
used is aligned to the mapping_min_order as the page cache requires the
index to be aligned to the order of the folio.

readahead_expand() is called from readahead aops to extend the range of
the readahead so this function can assume ractl->_index to be aligned with
min_order.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 mm/readahead.c | 86 ++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 72 insertions(+), 14 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 37b938f4b54f..650834c033f0 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -206,9 +206,10 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 		unsigned long nr_to_read, unsigned long lookahead_size)
 {
 	struct address_space *mapping = ractl->mapping;
-	unsigned long index = readahead_index(ractl);
+	unsigned long index = readahead_index(ractl), ra_folio_index;
 	gfp_t gfp_mask = readahead_gfp_mask(mapping);
-	unsigned long i = 0;
+	unsigned long i = 0, mark;
+	unsigned int min_nrpages = mapping_min_folio_nrpages(mapping);
 
 	/*
 	 * Partway through the readahead operation, we will have added
@@ -223,6 +224,22 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 	unsigned int nofs = memalloc_nofs_save();
 
 	filemap_invalidate_lock_shared(mapping);
+	index = mapping_align_start_index(mapping, index);
+
+	/*
+	 * As iterator `i` is aligned to min_nrpages, round_up the
+	 * difference between nr_to_read and lookahead_size to mark the
+	 * index that only has lookahead or "async_region" to set the
+	 * readahead flag.
+	 */
+	ra_folio_index = round_up(readahead_index(ractl) + nr_to_read - lookahead_size,
+				  min_nrpages);
+	mark = ra_folio_index - index;
+	if (index != readahead_index(ractl)) {
+		nr_to_read += readahead_index(ractl) - index;
+		ractl->_index = index;
+	}
+
 	/*
 	 * Preallocate as many pages as we will need.
 	 */
@@ -230,6 +247,8 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 		struct folio *folio = xa_load(&mapping->i_pages, index + i);
 
 		if (folio && !xa_is_value(folio)) {
+			long nr_pages = folio_nr_pages(folio);
+
 			/*
 			 * Page already present?  Kick off the current batch
 			 * of contiguous pages before continuing with the
@@ -239,23 +258,35 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 			 * not worth getting one just for that.
 			 */
 			read_pages(ractl);
-			ractl->_index += folio_nr_pages(folio);
+
+			/*
+			 * Move the ractl->_index by at least min_pages
+			 * if the folio got truncated to respect the
+			 * alignment constraint in the page cache.
+			 *
+			 */
+			if (mapping != folio->mapping)
+				nr_pages = min_nrpages;
+
+			VM_BUG_ON_FOLIO(nr_pages < min_nrpages, folio);
+			ractl->_index += nr_pages;
 			i = ractl->_index + ractl->_nr_pages - index;
 			continue;
 		}
 
-		folio = filemap_alloc_folio(gfp_mask, 0);
+		folio = filemap_alloc_folio(gfp_mask,
+					    mapping_min_folio_order(mapping));
 		if (!folio)
 			break;
 		if (filemap_add_folio(mapping, folio, index + i,
 					gfp_mask) < 0) {
 			folio_put(folio);
 			read_pages(ractl);
-			ractl->_index++;
+			ractl->_index += min_nrpages;
 			i = ractl->_index + ractl->_nr_pages - index;
 			continue;
 		}
-		if (i == nr_to_read - lookahead_size)
+		if (i == mark)
 			folio_set_readahead(folio);
 		ractl->_workingset |= folio_test_workingset(folio);
 		ractl->_nr_pages += folio_nr_pages(folio);
@@ -489,12 +520,18 @@ void page_cache_ra_order(struct readahead_control *ractl,
 {
 	struct address_space *mapping = ractl->mapping;
 	pgoff_t index = readahead_index(ractl);
+	unsigned int min_order = mapping_min_folio_order(mapping);
 	pgoff_t limit = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT;
 	pgoff_t mark = index + ra->size - ra->async_size;
 	int err = 0;
 	gfp_t gfp = readahead_gfp_mask(mapping);
+	unsigned int min_ra_size = max(4, mapping_min_folio_nrpages(mapping));
 
-	if (!mapping_large_folio_support(mapping) || ra->size < 4)
+	/*
+	 * Fallback when size < min_nrpages as each folio should be
+	 * at least min_nrpages anyway.
+	 */
+	if (!mapping_large_folio_support(mapping) || ra->size < min_ra_size)
 		goto fallback;
 
 	limit = min(limit, index + ra->size - 1);
@@ -505,9 +542,19 @@ void page_cache_ra_order(struct readahead_control *ractl,
 			new_order = MAX_PAGECACHE_ORDER;
 		while ((1 << new_order) > ra->size)
 			new_order--;
+		if (new_order < min_order)
+			new_order = min_order;
 	}
 
 	filemap_invalidate_lock_shared(mapping);
+	/*
+	 * If the new_order is greater than min_order and index is
+	 * already aligned to new_order, then this will be noop as index
+	 * aligned to new_order should also be aligned to min_order.
+	 */
+	ractl->_index = mapping_align_start_index(mapping, index);
+	index = readahead_index(ractl);
+
 	while (index <= limit) {
 		unsigned int order = new_order;
 
@@ -515,7 +562,7 @@ void page_cache_ra_order(struct readahead_control *ractl,
 		if (index & ((1UL << order) - 1))
 			order = __ffs(index);
 		/* Don't allocate pages past EOF */
-		while (index + (1UL << order) - 1 > limit)
+		while (order > min_order && index + (1UL << order) - 1 > limit)
 			order--;
 		err = ra_alloc_folio(ractl, index, mark, order, gfp);
 		if (err)
@@ -778,8 +825,15 @@ void readahead_expand(struct readahead_control *ractl,
 	struct file_ra_state *ra = ractl->ra;
 	pgoff_t new_index, new_nr_pages;
 	gfp_t gfp_mask = readahead_gfp_mask(mapping);
+	unsigned long min_nrpages = mapping_min_folio_nrpages(mapping);
+	unsigned int min_order = mapping_min_folio_order(mapping);
 
 	new_index = new_start / PAGE_SIZE;
+	/*
+	 * Readahead code should have aligned the ractl->_index to
+	 * min_nrpages before calling readahead aops.
+	 */
+	VM_BUG_ON(!IS_ALIGNED(ractl->_index, min_nrpages));
 
 	/* Expand the leading edge downwards */
 	while (ractl->_index > new_index) {
@@ -789,9 +843,11 @@ void readahead_expand(struct readahead_control *ractl,
 		if (folio && !xa_is_value(folio))
 			return; /* Folio apparently present */
 
-		folio = filemap_alloc_folio(gfp_mask, 0);
+		folio = filemap_alloc_folio(gfp_mask, min_order);
 		if (!folio)
 			return;
+
+		index = mapping_align_start_index(mapping, index);
 		if (filemap_add_folio(mapping, folio, index, gfp_mask) < 0) {
 			folio_put(folio);
 			return;
@@ -801,7 +857,7 @@ void readahead_expand(struct readahead_control *ractl,
 			ractl->_workingset = true;
 			psi_memstall_enter(&ractl->_pflags);
 		}
-		ractl->_nr_pages++;
+		ractl->_nr_pages += min_nrpages;
 		ractl->_index = folio->index;
 	}
 
@@ -816,9 +872,11 @@ void readahead_expand(struct readahead_control *ractl,
 		if (folio && !xa_is_value(folio))
 			return; /* Folio apparently present */
 
-		folio = filemap_alloc_folio(gfp_mask, 0);
+		folio = filemap_alloc_folio(gfp_mask, min_order);
 		if (!folio)
 			return;
+
+		index = mapping_align_start_index(mapping, index);
 		if (filemap_add_folio(mapping, folio, index, gfp_mask) < 0) {
 			folio_put(folio);
 			return;
@@ -828,10 +886,10 @@ void readahead_expand(struct readahead_control *ractl,
 			ractl->_workingset = true;
 			psi_memstall_enter(&ractl->_pflags);
 		}
-		ractl->_nr_pages++;
+		ractl->_nr_pages += min_nrpages;
 		if (ra) {
-			ra->size++;
-			ra->async_size++;
+			ra->size += min_nrpages;
+			ra->async_size += min_nrpages;
 		}
 	}
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v3 06/11] readahead: round up file_ra_state->ra_pages to mapping_min_nrpages
  2024-03-13 17:02 [PATCH v3 00/11] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (4 preceding siblings ...)
  2024-03-13 17:02 ` [PATCH v3 05/11] readahead: allocate folios with mapping_min_order in readahead Pankaj Raghav (Samsung)
@ 2024-03-13 17:02 ` Pankaj Raghav (Samsung)
  2024-03-13 17:02 ` [PATCH v3 07/11] mm: do not split a folio if it has minimum folio order requirement Pankaj Raghav (Samsung)
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 38+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-03-13 17:02 UTC (permalink / raw)
  To: willy, linux-xfs, linux-fsdevel
  Cc: gost.dev, chandan.babu, hare, mcgrof, djwong, linux-mm,
	linux-kernel, david, akpm, Pankaj Raghav

From: Luis Chamberlain <mcgrof@kernel.org>

As we will be adding multiples of mapping_min_nrpages to the page cache,
initialize file_ra_state->ra_pages with bdi->ra_pages rounded up to the
nearest mapping_min_nrpages.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 mm/readahead.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 650834c033f0..50194fddecf1 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -138,7 +138,8 @@
 void
 file_ra_state_init(struct file_ra_state *ra, struct address_space *mapping)
 {
-	ra->ra_pages = inode_to_bdi(mapping->host)->ra_pages;
+	ra->ra_pages = round_up(inode_to_bdi(mapping->host)->ra_pages,
+				mapping_min_folio_nrpages(mapping));
 	ra->prev_pos = -1;
 }
 EXPORT_SYMBOL_GPL(file_ra_state_init);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v3 07/11] mm: do not split a folio if it has minimum folio order requirement
  2024-03-13 17:02 [PATCH v3 00/11] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (5 preceding siblings ...)
  2024-03-13 17:02 ` [PATCH v3 06/11] readahead: round up file_ra_state->ra_pages to mapping_min_nrpages Pankaj Raghav (Samsung)
@ 2024-03-13 17:02 ` Pankaj Raghav (Samsung)
  2024-03-25 19:06   ` Matthew Wilcox
  2024-03-13 17:02 ` [PATCH v3 08/11] iomap: fix iomap_dio_zero() for fs bs > system page size Pankaj Raghav (Samsung)
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 38+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-03-13 17:02 UTC (permalink / raw)
  To: willy, linux-xfs, linux-fsdevel
  Cc: gost.dev, chandan.babu, hare, mcgrof, djwong, linux-mm,
	linux-kernel, david, akpm, Pankaj Raghav

From: Pankaj Raghav <p.raghav@samsung.com>

As we don't have a way to split a folio to a any given lower folio
order yet, avoid splitting the folio in split_huge_page_to_list() if it
has a minimum folio order requirement.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
---
 mm/huge_memory.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 81fd1ba57088..6ec3417638a1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3030,6 +3030,19 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 			goto out;
 		}
 
+		/*
+		 * Do not split if mapping has minimum folio order
+		 * requirement.
+		 *
+		 * XXX: Once we have support for splitting to any lower
+		 * folio order, then it could be split based on the
+		 * min_folio_order.
+		 */
+		if (mapping_min_folio_order(mapping)) {
+			ret = -EAGAIN;
+			goto out;
+		}
+
 		gfp = current_gfp_context(mapping_gfp_mask(mapping) &
 							GFP_RECLAIM_MASK);
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v3 08/11] iomap: fix iomap_dio_zero() for fs bs > system page size
  2024-03-13 17:02 [PATCH v3 00/11] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (6 preceding siblings ...)
  2024-03-13 17:02 ` [PATCH v3 07/11] mm: do not split a folio if it has minimum folio order requirement Pankaj Raghav (Samsung)
@ 2024-03-13 17:02 ` Pankaj Raghav (Samsung)
  2024-03-13 17:02 ` [PATCH v3 09/11] xfs: expose block size in stat Pankaj Raghav (Samsung)
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 38+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-03-13 17:02 UTC (permalink / raw)
  To: willy, linux-xfs, linux-fsdevel
  Cc: gost.dev, chandan.babu, hare, mcgrof, djwong, linux-mm,
	linux-kernel, david, akpm, Pankaj Raghav

From: Pankaj Raghav <p.raghav@samsung.com>

iomap_dio_zero() will pad a fs block with zeroes if the direct IO size
< fs block size. iomap_dio_zero() has an implicit assumption that fs block
size < page_size. This is true for most filesystems at the moment.

If the block size > page size, this will send the contents of the page
next to zero page(as len > PAGE_SIZE) to the underlying block device,
causing FS corruption.

iomap is a generic infrastructure and it should not make any assumptions
about the fs block size and the page size of the system.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/iomap/direct-io.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index bcd3f8cf5ea4..04f6c5548136 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -239,14 +239,23 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
 	struct page *page = ZERO_PAGE(0);
 	struct bio *bio;
 
-	bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
+	WARN_ON_ONCE(len > (BIO_MAX_VECS * PAGE_SIZE));
+
+	bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS,
+				  REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
 	fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
 				  GFP_KERNEL);
+
 	bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos);
 	bio->bi_private = dio;
 	bio->bi_end_io = iomap_dio_bio_end_io;
 
-	__bio_add_page(bio, page, len, 0);
+	while (len) {
+		unsigned int io_len = min_t(unsigned int, len, PAGE_SIZE);
+
+		__bio_add_page(bio, page, io_len, 0);
+		len -= io_len;
+	}
 	iomap_dio_submit_bio(iter, dio, bio, pos);
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v3 09/11] xfs: expose block size in stat
  2024-03-13 17:02 [PATCH v3 00/11] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (7 preceding siblings ...)
  2024-03-13 17:02 ` [PATCH v3 08/11] iomap: fix iomap_dio_zero() for fs bs > system page size Pankaj Raghav (Samsung)
@ 2024-03-13 17:02 ` Pankaj Raghav (Samsung)
  2024-03-13 17:02 ` [PATCH v3 10/11] xfs: make the calculation generic in xfs_sb_validate_fsb_count() Pankaj Raghav (Samsung)
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 38+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-03-13 17:02 UTC (permalink / raw)
  To: willy, linux-xfs, linux-fsdevel
  Cc: gost.dev, chandan.babu, hare, mcgrof, djwong, linux-mm,
	linux-kernel, david, akpm, Pankaj Raghav

From: Pankaj Raghav <p.raghav@samsung.com>

For block size larger than page size, the unit of efficient IO is
the block size, not the page size. Leaving stat() to report
PAGE_SIZE as the block size causes test programs like fsx to issue
illegal ranges for operations that require block size alignment
(e.g. fallocate() insert range). Hence update the preferred IO size
to reflect the block size in this case.

This change is based on a patch originally from Dave Chinner.[1]

[1] https://lwn.net/ml/linux-fsdevel/20181107063127.3902-16-david@fromorbit.com/

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 fs/xfs/xfs_iops.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index a0d77f5f512e..7ee829f7d708 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -543,7 +543,7 @@ xfs_stat_blksize(
 			return 1U << mp->m_allocsize_log;
 	}
 
-	return PAGE_SIZE;
+	return max_t(uint32_t, PAGE_SIZE, mp->m_sb.sb_blocksize);
 }
 
 STATIC int
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v3 10/11] xfs: make the calculation generic in xfs_sb_validate_fsb_count()
  2024-03-13 17:02 [PATCH v3 00/11] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (8 preceding siblings ...)
  2024-03-13 17:02 ` [PATCH v3 09/11] xfs: expose block size in stat Pankaj Raghav (Samsung)
@ 2024-03-13 17:02 ` Pankaj Raghav (Samsung)
  2024-03-25 19:15   ` Matthew Wilcox
  2024-03-13 17:02 ` [PATCH v3 11/11] xfs: enable block size larger than page size support Pankaj Raghav (Samsung)
  2024-03-25 19:19 ` [PATCH v3 00/11] enable bs > ps in XFS Matthew Wilcox
  11 siblings, 1 reply; 38+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-03-13 17:02 UTC (permalink / raw)
  To: willy, linux-xfs, linux-fsdevel
  Cc: gost.dev, chandan.babu, hare, mcgrof, djwong, linux-mm,
	linux-kernel, david, akpm, Pankaj Raghav

From: Pankaj Raghav <p.raghav@samsung.com>

Instead of assuming that PAGE_SHIFT is always higher than the blocklog,
make the calculation generic so that page cache count can be calculated
correctly for LBS.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 fs/xfs/xfs_mount.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index aabb25dc3efa..9cf800586da7 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -133,9 +133,16 @@ xfs_sb_validate_fsb_count(
 {
 	ASSERT(PAGE_SHIFT >= sbp->sb_blocklog);
 	ASSERT(sbp->sb_blocklog >= BBSHIFT);
+	uint64_t max_index;
+	uint64_t max_bytes;
+
+	if (check_shl_overflow(nblocks, sbp->sb_blocklog, &max_bytes))
+		return -EFBIG;
 
 	/* Limited by ULONG_MAX of page cache index */
-	if (nblocks >> (PAGE_SHIFT - sbp->sb_blocklog) > ULONG_MAX)
+	max_index = max_bytes >> PAGE_SHIFT;
+
+	if (max_index > ULONG_MAX)
 		return -EFBIG;
 	return 0;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH v3 11/11] xfs: enable block size larger than page size support
  2024-03-13 17:02 [PATCH v3 00/11] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (9 preceding siblings ...)
  2024-03-13 17:02 ` [PATCH v3 10/11] xfs: make the calculation generic in xfs_sb_validate_fsb_count() Pankaj Raghav (Samsung)
@ 2024-03-13 17:02 ` Pankaj Raghav (Samsung)
  2024-03-25 19:19 ` [PATCH v3 00/11] enable bs > ps in XFS Matthew Wilcox
  11 siblings, 0 replies; 38+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-03-13 17:02 UTC (permalink / raw)
  To: willy, linux-xfs, linux-fsdevel
  Cc: gost.dev, chandan.babu, hare, mcgrof, djwong, linux-mm,
	linux-kernel, david, akpm, Pankaj Raghav

From: Pankaj Raghav <p.raghav@samsung.com>

Page cache now has the ability to have a minimum order when allocating
a folio which is a prerequisite to add support for block size > page
size.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 fs/xfs/libxfs/xfs_ialloc.c |  5 +++++
 fs/xfs/libxfs/xfs_shared.h |  3 +++
 fs/xfs/xfs_icache.c        |  6 ++++--
 fs/xfs/xfs_mount.c         |  1 -
 fs/xfs/xfs_super.c         | 10 ++--------
 5 files changed, 14 insertions(+), 11 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 2361a22035b0..c040bd6271fd 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -2892,6 +2892,11 @@ xfs_ialloc_setup_geometry(
 		igeo->ialloc_align = mp->m_dalign;
 	else
 		igeo->ialloc_align = 0;
+
+	if (mp->m_sb.sb_blocksize > PAGE_SIZE)
+		igeo->min_folio_order = mp->m_sb.sb_blocklog - PAGE_SHIFT;
+	else
+		igeo->min_folio_order = 0;
 }
 
 /* Compute the location of the root directory inode that is laid out by mkfs. */
diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
index 4220d3584c1b..67ed406e7a81 100644
--- a/fs/xfs/libxfs/xfs_shared.h
+++ b/fs/xfs/libxfs/xfs_shared.h
@@ -188,6 +188,9 @@ struct xfs_ino_geometry {
 	/* precomputed value for di_flags2 */
 	uint64_t	new_diflags2;
 
+	/* minimum folio order of a page cache allocation */
+	unsigned int	min_folio_order;
+
 };
 
 #endif /* __XFS_SHARED_H__ */
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index dba514a2c84d..a1857000e2cd 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -88,7 +88,8 @@ xfs_inode_alloc(
 	/* VFS doesn't initialise i_mode or i_state! */
 	VFS_I(ip)->i_mode = 0;
 	VFS_I(ip)->i_state = 0;
-	mapping_set_large_folios(VFS_I(ip)->i_mapping);
+	mapping_set_folio_min_order(VFS_I(ip)->i_mapping,
+				    M_IGEO(mp)->min_folio_order);
 
 	XFS_STATS_INC(mp, vn_active);
 	ASSERT(atomic_read(&ip->i_pincount) == 0);
@@ -323,7 +324,8 @@ xfs_reinit_inode(
 	inode->i_rdev = dev;
 	inode->i_uid = uid;
 	inode->i_gid = gid;
-	mapping_set_large_folios(inode->i_mapping);
+	mapping_set_folio_min_order(inode->i_mapping,
+				    M_IGEO(mp)->min_folio_order);
 	return error;
 }
 
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 9cf800586da7..a77e927807e5 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -131,7 +131,6 @@ xfs_sb_validate_fsb_count(
 	xfs_sb_t	*sbp,
 	uint64_t	nblocks)
 {
-	ASSERT(PAGE_SHIFT >= sbp->sb_blocklog);
 	ASSERT(sbp->sb_blocklog >= BBSHIFT);
 	uint64_t max_index;
 	uint64_t max_bytes;
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 98401de832ee..4f5f4cb772d4 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1624,16 +1624,10 @@ xfs_fs_fill_super(
 		goto out_free_sb;
 	}
 
-	/*
-	 * Until this is fixed only page-sized or smaller data blocks work.
-	 */
 	if (mp->m_sb.sb_blocksize > PAGE_SIZE) {
 		xfs_warn(mp,
-		"File system with blocksize %d bytes. "
-		"Only pagesize (%ld) or less will currently work.",
-				mp->m_sb.sb_blocksize, PAGE_SIZE);
-		error = -ENOSYS;
-		goto out_free_sb;
+"EXPERIMENTAL: Filesystem with Large Block Size (%d bytes) enabled.",
+			mp->m_sb.sb_blocksize);
 	}
 
 	/* Ensure this filesystem fits in the page cache limits */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 03/11] filemap: allocate mapping_min_order folios in the page cache
  2024-03-13 17:02 ` [PATCH v3 03/11] filemap: allocate mapping_min_order folios in the page cache Pankaj Raghav (Samsung)
@ 2024-03-15 13:21   ` Pankaj Raghav (Samsung)
  0 siblings, 0 replies; 38+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-03-15 13:21 UTC (permalink / raw)
  To: willy
  Cc: gost.dev, chandan.babu, hare, mcgrof, djwong, linux-mm,
	linux-kernel, david, akpm, Pankaj Raghav, linux-xfs,
	linux-fsdevel

Hi willy,

> filemap_create_folio() and do_read_cache_folio() were always allocating
> folio of order 0. __filemap_get_folio was trying to allocate higher
> order folios when fgp_flags had higher order hint set but it will default
> to order 0 folio if higher order memory allocation fails.
> 
> Supporting mapping_min_order implies that we guarantee each folio in the
> page cache has at least an order of mapping_min_order. When adding new
> folios to the page cache we must also ensure the index used is aligned to
> the mapping_min_order as the page cache requires the index to be aligned
> to the order of the folio.
> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> Co-developed-by: Pankaj Raghav <p.raghav@samsung.com>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  mm/filemap.c | 24 +++++++++++++++++-------
>  1 file changed, 17 insertions(+), 7 deletions(-)

Are the changes more inline with what you had in mind?

> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index a1cb3ea55fb6..57889f206829 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -849,6 +849,8 @@ noinline int __filemap_add_folio(struct address_space *mapping,
>  
>  	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
>  	VM_BUG_ON_FOLIO(folio_test_swapbacked(folio), folio);
> +	VM_BUG_ON_FOLIO(folio_order(folio) < mapping_min_folio_order(mapping),
> +			folio);
>  	mapping_set_update(&xas, mapping);
>  
>  	if (!huge) {
> @@ -1886,8 +1888,10 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
>  		folio_wait_stable(folio);
>  no_page:
>  	if (!folio && (fgp_flags & FGP_CREAT)) {
> -		unsigned order = FGF_GET_ORDER(fgp_flags);
> +		unsigned int min_order = mapping_min_folio_order(mapping);
> +		unsigned int order = max(min_order, FGF_GET_ORDER(fgp_flags));
>  		int err;
> +		index = mapping_align_start_index(mapping, index);
>  
>  		if ((fgp_flags & FGP_WRITE) && mapping_can_writeback(mapping))
>  			gfp |= __GFP_WRITE;
> @@ -1927,7 +1931,7 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
>  				break;
>  			folio_put(folio);
>  			folio = NULL;
> -		} while (order-- > 0);
> +		} while (order-- > min_order);
>  
>  		if (err == -EEXIST)
>  			goto repeat;
> @@ -2416,13 +2420,16 @@ static int filemap_update_page(struct kiocb *iocb,
>  }
>  
>  static int filemap_create_folio(struct file *file,
> -		struct address_space *mapping, pgoff_t index,
> +		struct address_space *mapping, loff_t pos,
>  		struct folio_batch *fbatch)
>  {
>  	struct folio *folio;
>  	int error;
> +	unsigned int min_order = mapping_min_folio_order(mapping);
> +	pgoff_t index;
>  
> -	folio = filemap_alloc_folio(mapping_gfp_mask(mapping), 0);
> +	folio = filemap_alloc_folio(mapping_gfp_mask(mapping),
> +				    min_order);
>  	if (!folio)
>  		return -ENOMEM;
>  
> @@ -2440,6 +2447,8 @@ static int filemap_create_folio(struct file *file,
>  	 * well to keep locking rules simple.
>  	 */
>  	filemap_invalidate_lock_shared(mapping);
> +	/* index in PAGE units but aligned to min_order number of pages. */
> +	index = (pos >> (PAGE_SHIFT + min_order)) << min_order;
>  	error = filemap_add_folio(mapping, folio, index,
>  			mapping_gfp_constraint(mapping, GFP_KERNEL));
>  	if (error == -EEXIST)
> @@ -2500,8 +2509,7 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
>  	if (!folio_batch_count(fbatch)) {
>  		if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_WAITQ))
>  			return -EAGAIN;
> -		err = filemap_create_folio(filp, mapping,
> -				iocb->ki_pos >> PAGE_SHIFT, fbatch);
> +		err = filemap_create_folio(filp, mapping, iocb->ki_pos, fbatch);
>  		if (err == AOP_TRUNCATED_PAGE)
>  			goto retry;
>  		return err;
> @@ -3662,9 +3670,11 @@ static struct folio *do_read_cache_folio(struct address_space *mapping,
>  repeat:
>  	folio = filemap_get_folio(mapping, index);
>  	if (IS_ERR(folio)) {
> -		folio = filemap_alloc_folio(gfp, 0);
> +		folio = filemap_alloc_folio(gfp,
> +					    mapping_min_folio_order(mapping));
>  		if (!folio)
>  			return ERR_PTR(-ENOMEM);
> +		index = mapping_align_start_index(mapping, index);
>  		err = filemap_add_folio(mapping, folio, index, gfp);
>  		if (unlikely(err)) {
>  			folio_put(folio);
> -- 
> 2.43.0
> 

-- 
Pankaj Raghav

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 02/11] fs: Allow fine-grained control of folio sizes
  2024-03-13 17:02 ` [PATCH v3 02/11] fs: Allow fine-grained control of folio sizes Pankaj Raghav (Samsung)
@ 2024-03-25 18:29   ` Matthew Wilcox
  2024-03-26  8:44     ` Pankaj Raghav (Samsung)
  0 siblings, 1 reply; 38+ messages in thread
From: Matthew Wilcox @ 2024-03-25 18:29 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: linux-xfs, linux-fsdevel, gost.dev, chandan.babu, hare, mcgrof,
	djwong, linux-mm, linux-kernel, david, akpm, Pankaj Raghav

On Wed, Mar 13, 2024 at 06:02:44PM +0100, Pankaj Raghav (Samsung) wrote:
> +/*
> + * mapping_set_folio_min_order() - Set the minimum folio order
> + * @mapping: The address_space.
> + * @min: Minimum folio order (between 0-MAX_PAGECACHE_ORDER inclusive).
> + *
> + * The filesystem should call this function in its inode constructor to
> + * indicate which base size of folio the VFS can use to cache the contents
> + * of the file.  This should only be used if the filesystem needs special
> + * handling of folio sizes (ie there is something the core cannot know).
> + * Do not tune it based on, eg, i_size.
> + *
> + * Context: This should not be called while the inode is active as it
> + * is non-atomic.
> + */
> +static inline void mapping_set_folio_min_order(struct address_space *mapping,
> +					       unsigned int min)
> +{
> +	if (min > MAX_PAGECACHE_ORDER)
> +		min = MAX_PAGECACHE_ORDER;
> +
> +	mapping->flags = (mapping->flags & ~AS_FOLIO_ORDER_MASK) |
> +			 (min << AS_FOLIO_ORDER_MIN) |
> +			 (MAX_PAGECACHE_ORDER << AS_FOLIO_ORDER_MAX);
> +}

I was surprised when I read this, which indicates it might be surprising
for others too.  I think it at least needs a comment to say that the
maximum will be set to the MAX_PAGECACHE_ORDER, because I was expecting
it to set max == min.  I guess that isn't what XFS wants, but someone
doing this to, eg, ext4 is going to have an unpleasant surprise when
they call into block_read_full_folio() and overrun 'arr'.

I'm still not entirely convinced this wouldn't be better to do as
mapping_set_folio_order_range() and have

static inline void mapping_set_folio_min_order(struct address_space *mapping,
		unsigned int min)
{
	mapping_set_folio_range(mapping, min, MAX_PAGECACHE_ORDER);
}


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 04/11] readahead: rework loop in page_cache_ra_unbounded()
  2024-03-13 17:02 ` [PATCH v3 04/11] readahead: rework loop in page_cache_ra_unbounded() Pankaj Raghav (Samsung)
@ 2024-03-25 18:41   ` Matthew Wilcox
  2024-03-26  8:56     ` Pankaj Raghav (Samsung)
                       ` (2 more replies)
  0 siblings, 3 replies; 38+ messages in thread
From: Matthew Wilcox @ 2024-03-25 18:41 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: linux-xfs, linux-fsdevel, gost.dev, chandan.babu, hare, mcgrof,
	djwong, linux-mm, linux-kernel, david, akpm, Pankaj Raghav

On Wed, Mar 13, 2024 at 06:02:46PM +0100, Pankaj Raghav (Samsung) wrote:
> @@ -239,8 +239,8 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
>  			 * not worth getting one just for that.
>  			 */
>  			read_pages(ractl);
> -			ractl->_index++;
> -			i = ractl->_index + ractl->_nr_pages - index - 1;
> +			ractl->_index += folio_nr_pages(folio);
> +			i = ractl->_index + ractl->_nr_pages - index;
>  			continue;
>  		}
>  
> @@ -252,13 +252,14 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
>  			folio_put(folio);
>  			read_pages(ractl);
>  			ractl->_index++;
> -			i = ractl->_index + ractl->_nr_pages - index - 1;
> +			i = ractl->_index + ractl->_nr_pages - index;
>  			continue;
>  		}

You changed index++ in the first hunk, but not the second hunk.  Is that
intentional?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 05/11] readahead: allocate folios with mapping_min_order in readahead
  2024-03-13 17:02 ` [PATCH v3 05/11] readahead: allocate folios with mapping_min_order in readahead Pankaj Raghav (Samsung)
@ 2024-03-25 19:00   ` Matthew Wilcox
  2024-03-26 13:08     ` Pankaj Raghav (Samsung)
  2024-04-22 11:03     ` Pankaj Raghav (Samsung)
  0 siblings, 2 replies; 38+ messages in thread
From: Matthew Wilcox @ 2024-03-25 19:00 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: linux-xfs, linux-fsdevel, gost.dev, chandan.babu, hare, mcgrof,
	djwong, linux-mm, linux-kernel, david, akpm, Pankaj Raghav

On Wed, Mar 13, 2024 at 06:02:47PM +0100, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
> 
> page_cache_ra_unbounded() was allocating single pages (0 order folios)
> if there was no folio found in an index. Allocate mapping_min_order folios
> as we need to guarantee the minimum order if it is set.
> When read_pages() is triggered and if a page is already present, check
> for truncation and move the ractl->_index by mapping_min_nrpages if that
> folio was truncated. This is done to ensure we keep the alignment
> requirement while adding a folio to the page cache.
> 
> page_cache_ra_order() tries to allocate folio to the page cache with a
> higher order if the index aligns with that order. Modify it so that the
> order does not go below the min_order requirement of the page cache.

This paragraph doesn't make sense.  We have an assertion that there's no
folio in the page cache with a lower order than the minimum, so this
seems to be describing a situation that can't happen.  Does it need to
be rephrased (because you're actually describing something else) or is
it just stale?

> @@ -239,23 +258,35 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
>  			 * not worth getting one just for that.
>  			 */
>  			read_pages(ractl);
> -			ractl->_index += folio_nr_pages(folio);
> +
> +			/*
> +			 * Move the ractl->_index by at least min_pages
> +			 * if the folio got truncated to respect the
> +			 * alignment constraint in the page cache.
> +			 *
> +			 */
> +			if (mapping != folio->mapping)
> +				nr_pages = min_nrpages;
> +
> +			VM_BUG_ON_FOLIO(nr_pages < min_nrpages, folio);
> +			ractl->_index += nr_pages;
>  			i = ractl->_index + ractl->_nr_pages - index;
>  			continue;
>  		}
>  
> -		folio = filemap_alloc_folio(gfp_mask, 0);
> +		folio = filemap_alloc_folio(gfp_mask,
> +					    mapping_min_folio_order(mapping));
>  		if (!folio)
>  			break;
>  		if (filemap_add_folio(mapping, folio, index + i,
>  					gfp_mask) < 0) {
>  			folio_put(folio);
>  			read_pages(ractl);
> -			ractl->_index++;
> +			ractl->_index += min_nrpages;

Hah, you changed this here.  Please move into previous patch.

>  			i = ractl->_index + ractl->_nr_pages - index;
>  			continue;
>  		}
> -		if (i == nr_to_read - lookahead_size)
> +		if (i == mark)
>  			folio_set_readahead(folio);
>  		ractl->_workingset |= folio_test_workingset(folio);
>  		ractl->_nr_pages += folio_nr_pages(folio);
> @@ -489,12 +520,18 @@ void page_cache_ra_order(struct readahead_control *ractl,
>  {
>  	struct address_space *mapping = ractl->mapping;
>  	pgoff_t index = readahead_index(ractl);
> +	unsigned int min_order = mapping_min_folio_order(mapping);
>  	pgoff_t limit = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT;
>  	pgoff_t mark = index + ra->size - ra->async_size;
>  	int err = 0;
>  	gfp_t gfp = readahead_gfp_mask(mapping);
> +	unsigned int min_ra_size = max(4, mapping_min_folio_nrpages(mapping));
>  
> -	if (!mapping_large_folio_support(mapping) || ra->size < 4)
> +	/*
> +	 * Fallback when size < min_nrpages as each folio should be
> +	 * at least min_nrpages anyway.
> +	 */
> +	if (!mapping_large_folio_support(mapping) || ra->size < min_ra_size)
>  		goto fallback;
>  
>  	limit = min(limit, index + ra->size - 1);
> @@ -505,9 +542,19 @@ void page_cache_ra_order(struct readahead_control *ractl,
>  			new_order = MAX_PAGECACHE_ORDER;
>  		while ((1 << new_order) > ra->size)
>  			new_order--;
> +		if (new_order < min_order)
> +			new_order = min_order;

I think these are the two lines you're describing in the paragraph that
doesn't make sense to me?  And if so, I think they're useless?

> @@ -515,7 +562,7 @@ void page_cache_ra_order(struct readahead_control *ractl,
>  		if (index & ((1UL << order) - 1))
>  			order = __ffs(index);
>  		/* Don't allocate pages past EOF */
> -		while (index + (1UL << order) - 1 > limit)
> +		while (order > min_order && index + (1UL << order) - 1 > limit)
>  			order--;

This raises an interesting question that I don't know if we have a test
for.  POSIX says that if we mmap, let's say, the first 16kB of a 10kB
file, then we can store into offset 0-12287, but stores to offsets
12288-16383 get a signal (I forget if it's SEGV or BUS).  Thus far,
we've declined to even create folios in the page cache that would let us
create PTEs for offset 12288-16383, so I haven't paid too much attention
to this.  Now we're going to have folios that extend into that range, so
we need to be sure that when we mmap(), we only create PTEs that go as
far as 12287.

Can you check that we have such an fstest, and that we still pass it
with your patches applied and a suitably large block size?


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 07/11] mm: do not split a folio if it has minimum folio order requirement
  2024-03-13 17:02 ` [PATCH v3 07/11] mm: do not split a folio if it has minimum folio order requirement Pankaj Raghav (Samsung)
@ 2024-03-25 19:06   ` Matthew Wilcox
  2024-03-26 16:10     ` Pankaj Raghav (Samsung)
  0 siblings, 1 reply; 38+ messages in thread
From: Matthew Wilcox @ 2024-03-25 19:06 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: linux-xfs, linux-fsdevel, gost.dev, chandan.babu, hare, mcgrof,
	djwong, linux-mm, linux-kernel, david, akpm, Pankaj Raghav

On Wed, Mar 13, 2024 at 06:02:49PM +0100, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
> 
> As we don't have a way to split a folio to a any given lower folio
> order yet, avoid splitting the folio in split_huge_page_to_list() if it
> has a minimum folio order requirement.

FYI, Zi Yan's patch to do that is now in Andrew's tree.
c010d47f107f609b9f4d6a103b6dfc53889049e9 in current linux-next (dated
Feb 26)

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 10/11] xfs: make the calculation generic in xfs_sb_validate_fsb_count()
  2024-03-13 17:02 ` [PATCH v3 10/11] xfs: make the calculation generic in xfs_sb_validate_fsb_count() Pankaj Raghav (Samsung)
@ 2024-03-25 19:15   ` Matthew Wilcox
  2024-03-26  9:53     ` Pankaj Raghav (Samsung)
  0 siblings, 1 reply; 38+ messages in thread
From: Matthew Wilcox @ 2024-03-25 19:15 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: linux-xfs, linux-fsdevel, gost.dev, chandan.babu, hare, mcgrof,
	djwong, linux-mm, linux-kernel, david, akpm, Pankaj Raghav

On Wed, Mar 13, 2024 at 06:02:52PM +0100, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
> 
> Instead of assuming that PAGE_SHIFT is always higher than the blocklog,
> make the calculation generic so that page cache count can be calculated
> correctly for LBS.
> 
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  fs/xfs/xfs_mount.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index aabb25dc3efa..9cf800586da7 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -133,9 +133,16 @@ xfs_sb_validate_fsb_count(
>  {
>  	ASSERT(PAGE_SHIFT >= sbp->sb_blocklog);

but ... you're still asserting that PAGE_SHIFT is larger than blocklog.
Shouldn't you delete that assertion?

>  	ASSERT(sbp->sb_blocklog >= BBSHIFT);
> +	uint64_t max_index;
> +	uint64_t max_bytes;
> +
> +	if (check_shl_overflow(nblocks, sbp->sb_blocklog, &max_bytes))
> +		return -EFBIG;
>  
>  	/* Limited by ULONG_MAX of page cache index */
> -	if (nblocks >> (PAGE_SHIFT - sbp->sb_blocklog) > ULONG_MAX)
> +	max_index = max_bytes >> PAGE_SHIFT;
> +
> +	if (max_index > ULONG_MAX)
>  		return -EFBIG;

This kind of depends on the implementation details of the page cache.
We have MAX_LFS_FILESIZE to abstract that; maybe that should be used
here?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 00/11] enable bs > ps in XFS
  2024-03-13 17:02 [PATCH v3 00/11] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (10 preceding siblings ...)
  2024-03-13 17:02 ` [PATCH v3 11/11] xfs: enable block size larger than page size support Pankaj Raghav (Samsung)
@ 2024-03-25 19:19 ` Matthew Wilcox
  2024-03-26  9:53   ` Hannes Reinecke
  2024-03-26 14:54   ` Pankaj Raghav (Samsung)
  11 siblings, 2 replies; 38+ messages in thread
From: Matthew Wilcox @ 2024-03-25 19:19 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: linux-xfs, linux-fsdevel, gost.dev, chandan.babu, hare, mcgrof,
	djwong, linux-mm, linux-kernel, david, akpm, Pankaj Raghav

On Wed, Mar 13, 2024 at 06:02:42PM +0100, Pankaj Raghav (Samsung) wrote:
> This is the third version of the series that enables block size > page size
> (Large Block Size) in XFS. The context and motivation can be seen in cover
> letter of the RFC v1[1]. We also recorded a talk about this effort at LPC [3],
> if someone would like more context on this effort.

Thank you.  This is a lot better.

I'm still trying to understand your opinion on the contents of the
file_ra_state.  Is it supposed to be properly aligned at all times, or
do we work with it in the terms of "desired number of pages" and then
force it to conform to the minimum-block-size reality right at the end?
Because you seem to be doing both at various points.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 02/11] fs: Allow fine-grained control of folio sizes
  2024-03-25 18:29   ` Matthew Wilcox
@ 2024-03-26  8:44     ` Pankaj Raghav (Samsung)
  0 siblings, 0 replies; 38+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-03-26  8:44 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-xfs, linux-fsdevel, gost.dev, chandan.babu, hare, mcgrof,
	djwong, linux-mm, linux-kernel, david, akpm, Pankaj Raghav

On Mon, Mar 25, 2024 at 06:29:30PM +0000, Matthew Wilcox wrote:
> On Wed, Mar 13, 2024 at 06:02:44PM +0100, Pankaj Raghav (Samsung) wrote:
> > +/*
> > + * mapping_set_folio_min_order() - Set the minimum folio order
> > + * @mapping: The address_space.
> > + * @min: Minimum folio order (between 0-MAX_PAGECACHE_ORDER inclusive).
> > + *
> > + * The filesystem should call this function in its inode constructor to
> > + * indicate which base size of folio the VFS can use to cache the contents
> > + * of the file.  This should only be used if the filesystem needs special
> > + * handling of folio sizes (ie there is something the core cannot know).
> > + * Do not tune it based on, eg, i_size.
> > + *
> > + * Context: This should not be called while the inode is active as it
> > + * is non-atomic.
> > + */
> > +static inline void mapping_set_folio_min_order(struct address_space *mapping,
> > +					       unsigned int min)
> > +{
> > +	if (min > MAX_PAGECACHE_ORDER)
> > +		min = MAX_PAGECACHE_ORDER;
> > +
> > +	mapping->flags = (mapping->flags & ~AS_FOLIO_ORDER_MASK) |
> > +			 (min << AS_FOLIO_ORDER_MIN) |
> > +			 (MAX_PAGECACHE_ORDER << AS_FOLIO_ORDER_MAX);
> > +}
> 
> I was surprised when I read this, which indicates it might be surprising
> for others too.  I think it at least needs a comment to say that the
> maximum will be set to the MAX_PAGECACHE_ORDER, because I was expecting
> it to set max == min.  I guess that isn't what XFS wants, but someone
> doing this to, eg, ext4 is going to have an unpleasant surprise when
> they call into block_read_full_folio() and overrun 'arr'.
> 
> I'm still not entirely convinced this wouldn't be better to do as
> mapping_set_folio_order_range() and have
> 
> static inline void mapping_set_folio_min_order(struct address_space *mapping,
> 		unsigned int min)
> {
> 	mapping_set_folio_range(mapping, min, MAX_PAGECACHE_ORDER);
> }

I agree. Having a helper like this will make it more explicit. The
limits checking can also be done in this helper itself.

Also it makes mapping_set_large_folio() more clear:

static inline void mapping_set_large_folios(struct address_space *mapping)
{
      mapping_set_folio_range(mapping, 0, MAX_PAGECACHE_ORDER);
}

instead of just calling mapping_set_folio_min_order(). Thanks.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 04/11] readahead: rework loop in page_cache_ra_unbounded()
  2024-03-25 18:41   ` Matthew Wilcox
@ 2024-03-26  8:56     ` Pankaj Raghav (Samsung)
  2024-03-26  9:39     ` Hannes Reinecke
  2024-03-26 15:11     ` Pankaj Raghav (Samsung)
  2 siblings, 0 replies; 38+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-03-26  8:56 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-xfs, linux-fsdevel, gost.dev, chandan.babu, hare, mcgrof,
	djwong, linux-mm, linux-kernel, david, akpm, Pankaj Raghav

On Mon, Mar 25, 2024 at 06:41:01PM +0000, Matthew Wilcox wrote:
> On Wed, Mar 13, 2024 at 06:02:46PM +0100, Pankaj Raghav (Samsung) wrote:
> > @@ -239,8 +239,8 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
> >  			 * not worth getting one just for that.
> >  			 */
> >  			read_pages(ractl);
> > -			ractl->_index++;
> > -			i = ractl->_index + ractl->_nr_pages - index - 1;
> > +			ractl->_index += folio_nr_pages(folio);
> > +			i = ractl->_index + ractl->_nr_pages - index;
> >  			continue;
> >  		}
> >  
> > @@ -252,13 +252,14 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
> >  			folio_put(folio);
> >  			read_pages(ractl);
> >  			ractl->_index++;
> > -			i = ractl->_index + ractl->_nr_pages - index - 1;
> > +			i = ractl->_index + ractl->_nr_pages - index;
> >  			continue;
> >  		}
> 
> You changed index++ in the first hunk, but not the second hunk.  Is that
> intentional?

The reason I didn't use folio_nr_pages(folio) in the second hunk is
because we have already `put` the folio and it is not valid anymore to
use folio_nr_pages right? Because we increase the ref count in
filemap_alloc() and we put if add fails. 

Plus in the second hunk, adding the 0 order folio failed in that index,
so we just move on to the next index. Once we have the min order
support, if adding min order folio failed, we move by min_order.

And your comment on the next patch:

> Hah, you changed this here.  Please move into previous patch.

We can't do that either because I am introducing the concept of min
order in the next patch.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 04/11] readahead: rework loop in page_cache_ra_unbounded()
  2024-03-25 18:41   ` Matthew Wilcox
  2024-03-26  8:56     ` Pankaj Raghav (Samsung)
@ 2024-03-26  9:39     ` Hannes Reinecke
  2024-03-26  9:44       ` Pankaj Raghav
  2024-03-26 15:11     ` Pankaj Raghav (Samsung)
  2 siblings, 1 reply; 38+ messages in thread
From: Hannes Reinecke @ 2024-03-26  9:39 UTC (permalink / raw)
  To: Matthew Wilcox, Pankaj Raghav (Samsung)
  Cc: linux-xfs, linux-fsdevel, gost.dev, chandan.babu, mcgrof, djwong,
	linux-mm, linux-kernel, david, akpm, Pankaj Raghav

On 3/25/24 19:41, Matthew Wilcox wrote:
> On Wed, Mar 13, 2024 at 06:02:46PM +0100, Pankaj Raghav (Samsung) wrote:
>> @@ -239,8 +239,8 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
>>   			 * not worth getting one just for that.
>>   			 */
>>   			read_pages(ractl);
>> -			ractl->_index++;
>> -			i = ractl->_index + ractl->_nr_pages - index - 1;
>> +			ractl->_index += folio_nr_pages(folio);
>> +			i = ractl->_index + ractl->_nr_pages - index;
>>   			continue;
>>   		}
>>   
>> @@ -252,13 +252,14 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
>>   			folio_put(folio);
>>   			read_pages(ractl);
>>   			ractl->_index++;
>> -			i = ractl->_index + ractl->_nr_pages - index - 1;
>> +			i = ractl->_index + ractl->_nr_pages - index;
>>   			continue;
>>   		}
> 
> You changed index++ in the first hunk, but not the second hunk.  Is that
> intentional?

Hmm. Looks you are right; it should be modified, too.
Will be fixing it up.

Cheers,

Hannes


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 04/11] readahead: rework loop in page_cache_ra_unbounded()
  2024-03-26  9:39     ` Hannes Reinecke
@ 2024-03-26  9:44       ` Pankaj Raghav
  2024-03-26 10:00         ` Hannes Reinecke
  0 siblings, 1 reply; 38+ messages in thread
From: Pankaj Raghav @ 2024-03-26  9:44 UTC (permalink / raw)
  To: Hannes Reinecke, Matthew Wilcox
  Cc: linux-xfs, linux-fsdevel, gost.dev, chandan.babu, mcgrof, djwong,
	linux-mm, linux-kernel, david, akpm, Pankaj Raghav

Hi Hannes,

On 26/03/2024 10:39, Hannes Reinecke wrote:
> On 3/25/24 19:41, Matthew Wilcox wrote:
>> On Wed, Mar 13, 2024 at 06:02:46PM +0100, Pankaj Raghav (Samsung) wrote:
>>> @@ -239,8 +239,8 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
>>>                * not worth getting one just for that.
>>>                */
>>>               read_pages(ractl);
>>> -            ractl->_index++;
>>> -            i = ractl->_index + ractl->_nr_pages - index - 1;
>>> +            ractl->_index += folio_nr_pages(folio);
>>> +            i = ractl->_index + ractl->_nr_pages - index;
>>>               continue;
>>>           }
>>>   @@ -252,13 +252,14 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
>>>               folio_put(folio);
>>>               read_pages(ractl);
>>>               ractl->_index++;
>>> -            i = ractl->_index + ractl->_nr_pages - index - 1;
>>> +            i = ractl->_index + ractl->_nr_pages - index;
>>>               continue;
>>>           }
>>
>> You changed index++ in the first hunk, but not the second hunk.  Is that
>> intentional?
> 
> Hmm. Looks you are right; it should be modified, too.
> Will be fixing it up.
> 
You initially had also in the second hunk:
ractl->index += folio_nr_pages(folio);

and I changed it to what it is now.

The reason is in my reply to willy:
https://lore.kernel.org/linux-xfs/s4jn4t4betknd3y4ltfccqxyfktzdljiz7klgbqsrccmv3rwrd@orlwjz77oyxo/

Let me know if you agree with it.

> Cheers,
> 
> Hannes
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 10/11] xfs: make the calculation generic in xfs_sb_validate_fsb_count()
  2024-03-25 19:15   ` Matthew Wilcox
@ 2024-03-26  9:53     ` Pankaj Raghav (Samsung)
  0 siblings, 0 replies; 38+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-03-26  9:53 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-xfs, linux-fsdevel, gost.dev, chandan.babu, hare, mcgrof,
	djwong, linux-mm, linux-kernel, david, akpm, Pankaj Raghav

On Mon, Mar 25, 2024 at 07:15:59PM +0000, Matthew Wilcox wrote:
> On Wed, Mar 13, 2024 at 06:02:52PM +0100, Pankaj Raghav (Samsung) wrote:
> > From: Pankaj Raghav <p.raghav@samsung.com>
> > 
> > Instead of assuming that PAGE_SHIFT is always higher than the blocklog,
> > make the calculation generic so that page cache count can be calculated
> > correctly for LBS.
> > 
> > Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> > ---
> >  fs/xfs/xfs_mount.c | 9 ++++++++-
> >  1 file changed, 8 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> > index aabb25dc3efa..9cf800586da7 100644
> > --- a/fs/xfs/xfs_mount.c
> > +++ b/fs/xfs/xfs_mount.c
> > @@ -133,9 +133,16 @@ xfs_sb_validate_fsb_count(
> >  {
> >  	ASSERT(PAGE_SHIFT >= sbp->sb_blocklog);
> 
> but ... you're still asserting that PAGE_SHIFT is larger than blocklog.
> Shouldn't you delete that assertion?

I do that in the next patch when we enable LBS support in XFS by setting
min order. So we still need the check here that will be removed in the
next patch.
> 
> >  	ASSERT(sbp->sb_blocklog >= BBSHIFT);
> > +	uint64_t max_index;
> > +	uint64_t max_bytes;
> > +
> > +	if (check_shl_overflow(nblocks, sbp->sb_blocklog, &max_bytes))
> > +		return -EFBIG;
> >  
> >  	/* Limited by ULONG_MAX of page cache index */
> > -	if (nblocks >> (PAGE_SHIFT - sbp->sb_blocklog) > ULONG_MAX)
> > +	max_index = max_bytes >> PAGE_SHIFT;
> > +
> > +	if (max_index > ULONG_MAX)
> >  		return -EFBIG;
> 
> This kind of depends on the implementation details of the page cache.
> We have MAX_LFS_FILESIZE to abstract that; maybe that should be used
> here?

Makes sense. I will use it instead of using ULONG_MAX.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 00/11] enable bs > ps in XFS
  2024-03-25 19:19 ` [PATCH v3 00/11] enable bs > ps in XFS Matthew Wilcox
@ 2024-03-26  9:53   ` Hannes Reinecke
  2024-03-26 15:06     ` Pankaj Raghav
  2024-03-26 14:54   ` Pankaj Raghav (Samsung)
  1 sibling, 1 reply; 38+ messages in thread
From: Hannes Reinecke @ 2024-03-26  9:53 UTC (permalink / raw)
  To: Matthew Wilcox, Pankaj Raghav (Samsung)
  Cc: linux-xfs, linux-fsdevel, gost.dev, chandan.babu, mcgrof, djwong,
	linux-mm, linux-kernel, david, akpm, Pankaj Raghav

On 3/25/24 20:19, Matthew Wilcox wrote:
> On Wed, Mar 13, 2024 at 06:02:42PM +0100, Pankaj Raghav (Samsung) wrote:
>> This is the third version of the series that enables block size > page size
>> (Large Block Size) in XFS. The context and motivation can be seen in cover
>> letter of the RFC v1[1]. We also recorded a talk about this effort at LPC [3],
>> if someone would like more context on this effort.
> 
> Thank you.  This is a lot better.
> 
> I'm still trying to understand your opinion on the contents of the
> file_ra_state.  Is it supposed to be properly aligned at all times, or
> do we work with it in the terms of "desired number of pages" and then
> force it to conform to the minimum-block-size reality right at the end?
> Because you seem to be doing both at various points.

Guess what, that's what I had been pondering, too.
Each way has its benefits, I guess.

Question really is do we keep the readahead iterator in units of pages,
and convert the result, or do we modify the readahead iterator to work
on folios, and convert the inputs.

Doesn't really matter much, but we need to decide. The former is 
probably easier on the caller, and the latter is easier on the consumer.
Take your pick; I really don't mind.

But we should document the result :-)

Cheers,

Hannes


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 04/11] readahead: rework loop in page_cache_ra_unbounded()
  2024-03-26  9:44       ` Pankaj Raghav
@ 2024-03-26 10:00         ` Hannes Reinecke
  2024-03-26 10:06           ` Pankaj Raghav
  0 siblings, 1 reply; 38+ messages in thread
From: Hannes Reinecke @ 2024-03-26 10:00 UTC (permalink / raw)
  To: Pankaj Raghav, Matthew Wilcox
  Cc: linux-xfs, linux-fsdevel, gost.dev, chandan.babu, mcgrof, djwong,
	linux-mm, linux-kernel, david, akpm, Pankaj Raghav

On 3/26/24 10:44, Pankaj Raghav wrote:
> Hi Hannes,
> 
> On 26/03/2024 10:39, Hannes Reinecke wrote:
>> On 3/25/24 19:41, Matthew Wilcox wrote:
>>> On Wed, Mar 13, 2024 at 06:02:46PM +0100, Pankaj Raghav (Samsung) wrote:
>>>> @@ -239,8 +239,8 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
>>>>                 * not worth getting one just for that.
>>>>                 */
>>>>                read_pages(ractl);
>>>> -            ractl->_index++;
>>>> -            i = ractl->_index + ractl->_nr_pages - index - 1;
>>>> +            ractl->_index += folio_nr_pages(folio);
>>>> +            i = ractl->_index + ractl->_nr_pages - index;
>>>>                continue;
>>>>            }
>>>>    @@ -252,13 +252,14 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
>>>>                folio_put(folio);
>>>>                read_pages(ractl);
>>>>                ractl->_index++;
>>>> -            i = ractl->_index + ractl->_nr_pages - index - 1;
>>>> +            i = ractl->_index + ractl->_nr_pages - index;
>>>>                continue;
>>>>            }
>>>
>>> You changed index++ in the first hunk, but not the second hunk.  Is that
>>> intentional?
>>
>> Hmm. Looks you are right; it should be modified, too.
>> Will be fixing it up.
>>
> You initially had also in the second hunk:
> ractl->index += folio_nr_pages(folio);
> 
> and I changed it to what it is now.
> 
> The reason is in my reply to willy:
> https://lore.kernel.org/linux-xfs/s4jn4t4betknd3y4ltfccqxyfktzdljiz7klgbqsrccmv3rwrd@orlwjz77oyxo/
> 
> Let me know if you agree with it.
> 
Bah. That really is overly complicated. When we attempt a conversion 
that conversion should be stand-alone, not rely on some other patch 
modifications later on.
We definitely need to work on that to make it easier to review, even
without having to read the mail thread.

Cheers,

Hannes


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 04/11] readahead: rework loop in page_cache_ra_unbounded()
  2024-03-26 10:00         ` Hannes Reinecke
@ 2024-03-26 10:06           ` Pankaj Raghav
  2024-03-26 10:55             ` Hannes Reinecke
  0 siblings, 1 reply; 38+ messages in thread
From: Pankaj Raghav @ 2024-03-26 10:06 UTC (permalink / raw)
  To: Hannes Reinecke, Matthew Wilcox
  Cc: linux-xfs, linux-fsdevel, gost.dev, chandan.babu, mcgrof, djwong,
	linux-mm, linux-kernel, david, akpm, Pankaj Raghav

On 26/03/2024 11:00, Hannes Reinecke wrote:
> On 3/26/24 10:44, Pankaj Raghav wrote:
>> Hi Hannes,
>>
>> On 26/03/2024 10:39, Hannes Reinecke wrote:
>>> On 3/25/24 19:41, Matthew Wilcox wrote:
>>>> On Wed, Mar 13, 2024 at 06:02:46PM +0100, Pankaj Raghav (Samsung) wrote:
>>>>> @@ -239,8 +239,8 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
>>>>>                 * not worth getting one just for that.
>>>>>                 */
>>>>>                read_pages(ractl);
>>>>> -            ractl->_index++;
>>>>> -            i = ractl->_index + ractl->_nr_pages - index - 1;
>>>>> +            ractl->_index += folio_nr_pages(folio);
>>>>> +            i = ractl->_index + ractl->_nr_pages - index;
>>>>>                continue;
>>>>>            }
>>>>>    @@ -252,13 +252,14 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
>>>>>                folio_put(folio);
>>>>>                read_pages(ractl);
>>>>>                ractl->_index++;
>>>>> -            i = ractl->_index + ractl->_nr_pages - index - 1;
>>>>> +            i = ractl->_index + ractl->_nr_pages - index;
>>>>>                continue;
>>>>>            }
>>>>
>>>> You changed index++ in the first hunk, but not the second hunk.  Is that
>>>> intentional?
>>>
>>> Hmm. Looks you are right; it should be modified, too.
>>> Will be fixing it up.
>>>
>> You initially had also in the second hunk:
>> ractl->index += folio_nr_pages(folio);
>>
>> and I changed it to what it is now.
>>
>> The reason is in my reply to willy:
>> https://lore.kernel.org/linux-xfs/s4jn4t4betknd3y4ltfccqxyfktzdljiz7klgbqsrccmv3rwrd@orlwjz77oyxo/
>>
>> Let me know if you agree with it.
>>
> Bah. That really is overly complicated. When we attempt a conversion that conversion should be
> stand-alone, not rely on some other patch modifications later on.
> We definitely need to work on that to make it easier to review, even
> without having to read the mail thread.
> 

I don't know understand what you mean by overly complicated. This conversion is standalone and it is
wrong to use folio_nr_pages after we `put` the folio. This patch just reworks the loop and in the
next patch I add min order support to readahead.

This patch doesn't depend on the next patch.

> Cheers,
> 
> Hannes
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 04/11] readahead: rework loop in page_cache_ra_unbounded()
  2024-03-26 10:06           ` Pankaj Raghav
@ 2024-03-26 10:55             ` Hannes Reinecke
  2024-03-26 13:41               ` Pankaj Raghav (Samsung)
  0 siblings, 1 reply; 38+ messages in thread
From: Hannes Reinecke @ 2024-03-26 10:55 UTC (permalink / raw)
  To: Pankaj Raghav, Matthew Wilcox
  Cc: linux-xfs, linux-fsdevel, gost.dev, chandan.babu, mcgrof, djwong,
	linux-mm, linux-kernel, david, akpm, Pankaj Raghav

On 3/26/24 11:06, Pankaj Raghav wrote:
> On 26/03/2024 11:00, Hannes Reinecke wrote:
>> On 3/26/24 10:44, Pankaj Raghav wrote:
>>> Hi Hannes,
>>>
>>> On 26/03/2024 10:39, Hannes Reinecke wrote:
>>>> On 3/25/24 19:41, Matthew Wilcox wrote:
>>>>> On Wed, Mar 13, 2024 at 06:02:46PM +0100, Pankaj Raghav (Samsung) wrote:
>>>>>> @@ -239,8 +239,8 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
>>>>>>                  * not worth getting one just for that.
>>>>>>                  */
>>>>>>                 read_pages(ractl);
>>>>>> -            ractl->_index++;
>>>>>> -            i = ractl->_index + ractl->_nr_pages - index - 1;
>>>>>> +            ractl->_index += folio_nr_pages(folio);
>>>>>> +            i = ractl->_index + ractl->_nr_pages - index;
>>>>>>                 continue;
>>>>>>             }
>>>>>>     @@ -252,13 +252,14 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
>>>>>>                 folio_put(folio);
>>>>>>                 read_pages(ractl);
>>>>>>                 ractl->_index++;
>>>>>> -            i = ractl->_index + ractl->_nr_pages - index - 1;
>>>>>> +            i = ractl->_index + ractl->_nr_pages - index;
>>>>>>                 continue;
>>>>>>             }
>>>>>
>>>>> You changed index++ in the first hunk, but not the second hunk.  Is that
>>>>> intentional?
>>>>
>>>> Hmm. Looks you are right; it should be modified, too.
>>>> Will be fixing it up.
>>>>
>>> You initially had also in the second hunk:
>>> ractl->index += folio_nr_pages(folio);
>>>
>>> and I changed it to what it is now.
>>>
>>> The reason is in my reply to willy:
>>> https://lore.kernel.org/linux-xfs/s4jn4t4betknd3y4ltfccqxyfktzdljiz7klgbqsrccmv3rwrd@orlwjz77oyxo/
>>>
>>> Let me know if you agree with it.
>>>
>> Bah. That really is overly complicated. When we attempt a conversion that conversion should be
>> stand-alone, not rely on some other patch modifications later on.
>> We definitely need to work on that to make it easier to review, even
>> without having to read the mail thread.
>>
> 
> I don't know understand what you mean by overly complicated. This conversion is standalone and it is
> wrong to use folio_nr_pages after we `put` the folio. This patch just reworks the loop and in the
> next patch I add min order support to readahead.
> 
> This patch doesn't depend on the next patch.
> 

Let me rephrase: what does 'ractl->_index' signify?
 From my understanding it should be the index of the
first folio/page in ractl, right?

If so I find it hard to understand how we _could_ increase it by one; 
_index should _always_ in units of the minimal pagemap size.
And if we don't have it here (as you suggested in the mailthread)
I'd rather move this patch _after_ the minimal pagesize is introduced
to ensure that _index is always incremented by the right amount.

Cheers,

Hannes


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 05/11] readahead: allocate folios with mapping_min_order in readahead
  2024-03-25 19:00   ` Matthew Wilcox
@ 2024-03-26 13:08     ` Pankaj Raghav (Samsung)
  2024-04-22 11:03     ` Pankaj Raghav (Samsung)
  1 sibling, 0 replies; 38+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-03-26 13:08 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-xfs, linux-fsdevel, gost.dev, chandan.babu, hare, mcgrof,
	djwong, linux-mm, linux-kernel, david, akpm, Pankaj Raghav

> > 
> > page_cache_ra_unbounded() was allocating single pages (0 order folios)
> > if there was no folio found in an index. Allocate mapping_min_order folios
> > as we need to guarantee the minimum order if it is set.
> > When read_pages() is triggered and if a page is already present, check
> > for truncation and move the ractl->_index by mapping_min_nrpages if that
> > folio was truncated. This is done to ensure we keep the alignment
> > requirement while adding a folio to the page cache.
> > 
> > page_cache_ra_order() tries to allocate folio to the page cache with a
> > higher order if the index aligns with that order. Modify it so that the
> > order does not go below the min_order requirement of the page cache.
> 
> This paragraph doesn't make sense.  We have an assertion that there's no
> folio in the page cache with a lower order than the minimum, so this
> seems to be describing a situation that can't happen.  Does it need to
> be rephrased (because you're actually describing something else) or is
> it just stale?
> 
Hmm, maybe I need to explain better here.

> > @@ -489,12 +520,18 @@ void page_cache_ra_order(struct readahead_control *ractl,
> >  {
> >  	struct address_space *mapping = ractl->mapping;
> >  	pgoff_t index = readahead_index(ractl);
> > +	unsigned int min_order = mapping_min_folio_order(mapping);
> >  	pgoff_t limit = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT;
> >  	pgoff_t mark = index + ra->size - ra->async_size;
> >  	int err = 0;
> >  	gfp_t gfp = readahead_gfp_mask(mapping);
> > +	unsigned int min_ra_size = max(4, mapping_min_folio_nrpages(mapping));
> >  
> > -	if (!mapping_large_folio_support(mapping) || ra->size < 4)

I had one question: `4` was used here because we could not support order
1 folios I assume? As we now support order-1 folios, can this fallback
if ra->size < min_nrpages?

> > +	/*
> > +	 * Fallback when size < min_nrpages as each folio should be
> > +	 * at least min_nrpages anyway.
> > +	 */
> > +	if (!mapping_large_folio_support(mapping) || ra->size < min_ra_size)
> >  		goto fallback;
> >  
> >  	limit = min(limit, index + ra->size - 1);
> > @@ -505,9 +542,19 @@ void page_cache_ra_order(struct readahead_control *ractl,
> >  			new_order = MAX_PAGECACHE_ORDER;
> >  		while ((1 << new_order) > ra->size)
> >  			new_order--;
> > +		if (new_order < min_order)
> > +			new_order = min_order;
> 
> I think these are the two lines you're describing in the paragraph that
Yes. At least I tried to ;)

> doesn't make sense to me?  And if so, I think they're useless?

We need this because:
The caller might pass new_order = 0 (such as when we start mmap around)
but ra_order will do the "right" thing by taking care of the page cache
alignment requirements. The previous revision did this other way around
and it felt a bit all over the place.

One of the main changes I did for this revision is to adapt the
functions that alloc and add a folio to respect the min order alignment,
and not on the caller side.

The rationale I went for this is three fold: 
- the callers of page_cache_ra_order() and page_cache_ra_unbounded() 
  requests a range for which we need to do add a folio and read it. They
  don't care about the folios size and alignment.

- file_ra_state also does not need any poking around because we will
  make sure to cover from [ra->start, ra->size] and update ra->size if
  we spilled over due to the min_order requirement(like it was done before).

- And this is also consistent with what we do in filemap where we adjust
  the index before adding the folio to the page cache.

Let me know if this approach makes sense.
> 
> > @@ -515,7 +562,7 @@ void page_cache_ra_order(struct readahead_control *ractl,
> >  		if (index & ((1UL << order) - 1))
> >  			order = __ffs(index);
> >  		/* Don't allocate pages past EOF */
> > -		while (index + (1UL << order) - 1 > limit)
> > +		while (order > min_order && index + (1UL << order) - 1 > limit)
> >  			order--;
> 
> This raises an interesting question that I don't know if we have a test
> for.  POSIX says that if we mmap, let's say, the first 16kB of a 10kB
> file, then we can store into offset 0-12287, but stores to offsets
> 12288-16383 get a signal (I forget if it's SEGV or BUS).  Thus far,
> we've declined to even create folios in the page cache that would let us
> create PTEs for offset 12288-16383, so I haven't paid too much attention
> to this.  Now we're going to have folios that extend into that range, so
> we need to be sure that when we mmap(), we only create PTEs that go as
> far as 12287.
> 
> Can you check that we have such an fstest, and that we still pass it
> with your patches applied and a suitably large block size?

Hmmm, good point.
I think you mean this from the man page:

SIGBUS Attempted access to a page of the buffer that lies beyond the end
of the mapped file.  For an explanation of the treatment of the bytes in
the page that corresponds to the end of a mapped file that is not a 
multiple of the page size, see NOTES.

I will add a test case for this and see what happens. I am not sure if I
need to make some changes or the current implementation should hold.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 04/11] readahead: rework loop in page_cache_ra_unbounded()
  2024-03-26 10:55             ` Hannes Reinecke
@ 2024-03-26 13:41               ` Pankaj Raghav (Samsung)
  0 siblings, 0 replies; 38+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-03-26 13:41 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Matthew Wilcox, linux-xfs, linux-fsdevel, gost.dev, chandan.babu,
	mcgrof, djwong, linux-mm, linux-kernel, david, akpm,
	Pankaj Raghav

On Tue, Mar 26, 2024 at 11:55:06AM +0100, Hannes Reinecke wrote:
> > > Bah. That really is overly complicated. When we attempt a conversion that conversion should be
> > > stand-alone, not rely on some other patch modifications later on.
> > > We definitely need to work on that to make it easier to review, even
> > > without having to read the mail thread.
> > > 
> > 
> > I don't know understand what you mean by overly complicated. This conversion is standalone and it is
> > wrong to use folio_nr_pages after we `put` the folio. This patch just reworks the loop and in the
> > next patch I add min order support to readahead.
> > 
> > This patch doesn't depend on the next patch.
> > 
> 
> Let me rephrase: what does 'ractl->_index' signify?
> From my understanding it should be the index of the
> first folio/page in ractl, right?
> 
> If so I find it hard to understand how we _could_ increase it by one; _index
> should _always_ in units of the minimal pagemap size.

I still have not introduced the minimal pagemap size concept here. That
comes in the next patch. This patch only reworks the loop and should not
have any functional changes. So the minimal pagemap size unit here is 1.

And to your next question how could we increase it only by one here:

// We come here if we didn't find any folio at index + i
...
folio = filemap_alloc_folio(gfp_mask, 0); // order 0 => 1 page
if (!folio)
	break;
if (filemap_add_folio(mapping, folio, index + i,
			gfp_mask) < 0) {
	folio_put(folio);
	read_pages(ractl);
	ractl->_index++;
	...

If we failed to add a folio of order 0 at (index + i), we put the folio
and start a read_pages() on whatever pages we added so far (ractl->index to
ractl->index + ractl->nr_pages).

read_pages() updates the ractl->index to ractl->index + ractl->nr_pages.
ractl->index after read_pages() should point to (index + i). As we had
issue adding a folio of order 0, we skip that index by incrementing the
ractl->index by 1.

Does this clarify? In your original patch, you used folio_nr_pages()
here. As I said before, we already know the size of the folio we tried
to add was 1, so we could just increment by 1, and we should not use the
folio to deduce the size after folio_put() as it is use after free.

> And if we don't have it here (as you suggested in the mailthread)
> I'd rather move this patch _after_ the minimal pagesize is introduced
> to ensure that _index is always incremented by the right amount.
> 

I intended to have it as two atomic changes where there is
non-functional change that helps with the functional change that comes
later. If it is confusing, I could also combine this with the next
patch?

Or, I could have it as the first patch before I start adding the concept
of folio_min_order. Then it makes it clear that it is intended to be a
non-function change?
--
Pankaj

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 00/11] enable bs > ps in XFS
  2024-03-25 19:19 ` [PATCH v3 00/11] enable bs > ps in XFS Matthew Wilcox
  2024-03-26  9:53   ` Hannes Reinecke
@ 2024-03-26 14:54   ` Pankaj Raghav (Samsung)
  1 sibling, 0 replies; 38+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-03-26 14:54 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-xfs, linux-fsdevel, gost.dev, chandan.babu, hare, mcgrof,
	djwong, linux-mm, linux-kernel, david, akpm, Pankaj Raghav

On Mon, Mar 25, 2024 at 07:19:07PM +0000, Matthew Wilcox wrote:
> On Wed, Mar 13, 2024 at 06:02:42PM +0100, Pankaj Raghav (Samsung) wrote:
> > This is the third version of the series that enables block size > page size
> > (Large Block Size) in XFS. The context and motivation can be seen in cover
> > letter of the RFC v1[1]. We also recorded a talk about this effort at LPC [3],
> > if someone would like more context on this effort.
> 
> Thank you.  This is a lot better.

Thanks.
> 
> I'm still trying to understand your opinion on the contents of the
> file_ra_state.  Is it supposed to be properly aligned at all times, or
> do we work with it in the terms of "desired number of pages" and then
> force it to conform to the minimum-block-size reality right at the end?

The intention of the patches is to do the latter. Apart from the patch
that rounds up file_ra_state->ra_pages, I don't poke file_ra_state
anywhere and it is updated only at the end after we enforce
minimum-block-size constraint (page_cache_ra_order).

> Because you seem to be doing both at various points.
Could you elaborate more on where I do both? Maybe I am missing
something and I could change it in the next series.

The previous series was trying to do both but I intentially stuck to
updating the ra_state at the end in this series.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 00/11] enable bs > ps in XFS
  2024-03-26  9:53   ` Hannes Reinecke
@ 2024-03-26 15:06     ` Pankaj Raghav
  0 siblings, 0 replies; 38+ messages in thread
From: Pankaj Raghav @ 2024-03-26 15:06 UTC (permalink / raw)
  To: Hannes Reinecke, Matthew Wilcox
  Cc: linux-xfs, linux-fsdevel, gost.dev, chandan.babu, mcgrof, djwong,
	linux-mm, linux-kernel, david, akpm, Pankaj Raghav

On 26/03/2024 10:53, Hannes Reinecke wrote:
> On 3/25/24 20:19, Matthew Wilcox wrote:
>> On Wed, Mar 13, 2024 at 06:02:42PM +0100, Pankaj Raghav (Samsung) wrote:
>>> This is the third version of the series that enables block size > page size
>>> (Large Block Size) in XFS. The context and motivation can be seen in cover
>>> letter of the RFC v1[1]. We also recorded a talk about this effort at LPC [3],
>>> if someone would like more context on this effort.
>>
>> Thank you.  This is a lot better.
>>
>> I'm still trying to understand your opinion on the contents of the
>> file_ra_state.  Is it supposed to be properly aligned at all times, or
>> do we work with it in the terms of "desired number of pages" and then
>> force it to conform to the minimum-block-size reality right at the end?
>> Because you seem to be doing both at various points.
> 
> Guess what, that's what I had been pondering, too.
> Each way has its benefits, I guess.
> 
> Question really is do we keep the readahead iterator in units of pages,
> and convert the result, or do we modify the readahead iterator to work
> on folios, and convert the inputs.
> 
> Doesn't really matter much, but we need to decide. The former is probably easier on the caller, and
> the latter is easier on the consumer.
> Take your pick; I really don't mind.
> 
> But we should document the result :-)
> 

Having experimented both approaches, I prefer the latter as it looks more consistent and
contain the changes to few functions.

> Cheers,
> 
> Hannes
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 04/11] readahead: rework loop in page_cache_ra_unbounded()
  2024-03-25 18:41   ` Matthew Wilcox
  2024-03-26  8:56     ` Pankaj Raghav (Samsung)
  2024-03-26  9:39     ` Hannes Reinecke
@ 2024-03-26 15:11     ` Pankaj Raghav (Samsung)
  2 siblings, 0 replies; 38+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-03-26 15:11 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-xfs, linux-fsdevel, gost.dev, chandan.babu, hare, mcgrof,
	djwong, linux-mm, linux-kernel, david, akpm, Pankaj Raghav

On Mon, Mar 25, 2024 at 06:41:01PM +0000, Matthew Wilcox wrote:
> On Wed, Mar 13, 2024 at 06:02:46PM +0100, Pankaj Raghav (Samsung) wrote:
> > @@ -239,8 +239,8 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
> >  			 * not worth getting one just for that.
> >  			 */
> >  			read_pages(ractl);
> > -			ractl->_index++;
> > -			i = ractl->_index + ractl->_nr_pages - index - 1;
> > +			ractl->_index += folio_nr_pages(folio);
> > +			i = ractl->_index + ractl->_nr_pages - index;
> >  			continue;
> >  		}
> >  
> > @@ -252,13 +252,14 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
> >  			folio_put(folio);
> >  			read_pages(ractl);
> >  			ractl->_index++;
> > -			i = ractl->_index + ractl->_nr_pages - index - 1;
> > +			i = ractl->_index + ractl->_nr_pages - index;
> >  			continue;
> >  		}
> 
> You changed index++ in the first hunk, but not the second hunk.  Is that
> intentional?
After having some back and forth with Hannes, I see where the confusion
is coming from.

I intended this to be a non-functional change that helps with adding 
min_order support later.

As this is a non-functional change, I will move this patch to be at the
start of the series as preparation patches before we start adding min_order
helpers and support.

--
Pankaj

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 07/11] mm: do not split a folio if it has minimum folio order requirement
  2024-03-25 19:06   ` Matthew Wilcox
@ 2024-03-26 16:10     ` Pankaj Raghav (Samsung)
  2024-03-26 16:23       ` Zi Yan
  0 siblings, 1 reply; 38+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-03-26 16:10 UTC (permalink / raw)
  To: Matthew Wilcox, ziy
  Cc: linux-xfs, linux-fsdevel, gost.dev, chandan.babu, hare, mcgrof,
	djwong, linux-mm, linux-kernel, david, akpm, Pankaj Raghav

On Mon, Mar 25, 2024 at 07:06:04PM +0000, Matthew Wilcox wrote:
> On Wed, Mar 13, 2024 at 06:02:49PM +0100, Pankaj Raghav (Samsung) wrote:
> > From: Pankaj Raghav <p.raghav@samsung.com>
> > 
> > As we don't have a way to split a folio to a any given lower folio
> > order yet, avoid splitting the folio in split_huge_page_to_list() if it
> > has a minimum folio order requirement.
> 
> FYI, Zi Yan's patch to do that is now in Andrew's tree.
> c010d47f107f609b9f4d6a103b6dfc53889049e9 in current linux-next (dated
> Feb 26)

Yes, I started playing with the patches but I am getting a race condition
resulting in a null-ptr-deref for which I don't have a good answer for
yet.

@zi yan Did you encounter any issue like this when you were testing?

I did the following change (just a prototype) instead of this patch:

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9859aa4f7553..63ee7b6ed03d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3041,6 +3041,10 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 {
        struct folio *folio = page_folio(page);
        struct deferred_split *ds_queue = get_deferred_split_queue(folio);
+       unsigned int mapping_min_order = mapping_min_folio_order(folio->mapping);
+
+       if (!folio_test_anon(folio))
+               new_order = max_t(unsigned int, mapping_min_order, new_order);
        /* reset xarray order to new order after split */
        XA_STATE_ORDER(xas, &folio->mapping->i_pages, folio->index, new_order);
        struct anon_vma *anon_vma = NULL;
@@ -3117,6 +3121,8 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
                        goto out;
                }
 
+               // XXX: Remove it later
+               VM_WARN_ON_FOLIO((new_order < mapping_min_order), folio);
                gfp = current_gfp_context(mapping_gfp_mask(mapping) &
                                                        GFP_RECLAIM_MASK);

I am getting a random null-ptr deref when I run generic/176 multiple
times with different blksizes. I still don't have a minimal reproducer 
for this issue. Race condition during writeback:

filemap_get_folios_tag+0x171/0x5c0:
arch_atomic_read at arch/x86/include/asm/atomic.h:23
(inlined by) raw_atomic_read at include/linux/atomic/atomic-arch-fallback.h:457
(inlined by) raw_atomic_fetch_add_unless at include/linux/atomic/atomic-arch-fallback.h:2426
(inlined by) raw_atomic_add_unless at include/linux/atomic/atomic-arch-fallback.h:2456
(inlined by) atomic_add_unless at include/linux/atomic/atomic-instrumented.h:1518
(inlined by) page_ref_add_unless at include/linux/page_ref.h:238
(inlined by) folio_ref_add_unless at include/linux/page_ref.h:247
(inlined by) folio_ref_try_add_rcu at include/linux/page_ref.h:280
(inlined by) folio_try_get_rcu at include/linux/page_ref.h:313
(inlined by) find_get_entry at mm/filemap.c:1984
(inlined by) filemap_get_folios_tag at mm/filemap.c:2222



[  537.863105] ==================================================================                                                                                                                                                                                                                                             
[  537.863968] BUG: KASAN: null-ptr-deref in filemap_get_folios_tag+0x171/0x5c0                                                                                                                                                                                                                                               
[  537.864581] Write of size 4 at addr 0000000000000036 by task kworker/u32:5/366                                                                                                                                                                                                                                             
[  537.865123]                                                                                                                                                       
[  537.865293] CPU: 6 PID: 366 Comm: kworker/u32:5 Not tainted 6.8.0-11739-g7d0c6e7b5a7d-dirty #795                                                                                                                                                                                                                           
[  537.867201] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014                                                                                                                                                                                               
[  537.868444] Workqueue: writeback wb_workfn (flush-254:32)                                                                                                   
[  537.869055] Call Trace:                                                                                                                                     
[  537.869341]  <TASK>                                                                                                                                         
[  537.869569]  dump_stack_lvl+0x4f/0x70                                                                                                                       
[  537.869938]  kasan_report+0xbd/0xf0                                                                                                                         
[  537.870293]  ? filemap_get_folios_tag+0x171/0x5c0                                                                                                           
[  537.870767]  ? filemap_get_folios_tag+0x171/0x5c0                                                                                                           
[  537.871578]  kasan_check_range+0x101/0x1b0                                                                                                                  
[  537.871893]  filemap_get_folios_tag+0x171/0x5c0                                                                                                                                                                                                                                                                            
[  537.872269]  ? __pfx_filemap_get_folios_tag+0x10/0x10                                                                                                       
[  537.872857]  ? __pfx___submit_bio+0x10/0x10                                                                                                                 
[  537.873326]  ? mlock_drain_local+0x234/0x3f0                                                                                                                
[  537.873938]  writeback_iter+0x59a/0xe00                                                                                                                     
[  537.874477]  ? __pfx_iomap_do_writepage+0x10/0x10                                                                                                           
[  537.874969]  write_cache_pages+0x7f/0x100                                                                                                                   
[  537.875396]  ? __pfx_write_cache_pages+0x10/0x10                                                                                                            
[  537.875892]  ? do_raw_spin_lock+0x12d/0x270                                                                                                                 
[  537.876345]  ? __pfx_do_raw_spin_lock+0x10/0x10                                                                                                                                                                                                                                                                            
[  537.876804]  iomap_writepages+0x88/0xf0                                                                                                                     
[  537.877186]  xfs_vm_writepages+0x120/0x190                                                                                                                  
[  537.877705]  ? __pfx_xfs_vm_writepages+0x10/0x10                                                                                                            
[  537.878161]  ? lock_release+0x36f/0x670                                                                                                                                                                                                                                                                                    
[  537.878521]  ? __wb_calc_thresh+0xe5/0x3b0                                                                                                                                                                                                                                                                                 
[  537.878892]  ? __pfx_lock_release+0x10/0x10                                                                                                                 
[  537.879308]  do_writepages+0x170/0x7a0                                                                                                                                                                                                                                                                                     
[  537.879676]  ? __pfx_do_writepages+0x10/0x10                                                                                                                                                                                                                                                                               
[  537.880182]  ? writeback_sb_inodes+0x312/0xe40                                                                                                                                                                                                                                                                             
[  537.880689]  ? reacquire_held_locks+0x1f1/0x4a0                                                                                                                                                                                                                                                                            
[  537.881193]  ? writeback_sb_inodes+0x312/0xe40                                                                                                                                                                                                                                                                             
[  537.881665]  ? find_held_lock+0x2d/0x110                                                                                                                                                                                                                                                                                   
[  537.882104]  ? lock_release+0x36f/0x670                                                                                                                                                                                                                                                                                    
[  537.883344]  ? wbc_attach_and_unlock_inode+0x3b8/0x710                                                                                                                                                                                                                                                                     
[  537.883853]  ? __pfx_lock_release+0x10/0x10                                                                                                                                                                                                                                                                                
[  537.884229]  ? __pfx_lock_release+0x10/0x10                                                                                                                                                                                                                                                                                
[  537.884604]  ? lock_acquire+0x138/0x2f0                                                                                                                                                                                                                                                                                    
[  537.884952]  __writeback_single_inode+0xd4/0xa60                                                                                                            
[  537.885369]  writeback_sb_inodes+0x4cf/0xe40                                                                                                                
[  537.885760]  ? __pfx_writeback_sb_inodes+0x10/0x10                                                                                                                                                                                                                                                                         
[  537.886208]  ? __pfx_move_expired_inodes+0x10/0x10                                                                                                          
[  537.886640]  __writeback_inodes_wb+0xb4/0x200                                                                                                               
[  537.887037]  wb_writeback+0x55b/0x7c0                                                                                                                       
[  537.887372]  ? __pfx_wb_writeback+0x10/0x10                                                                                                                                                                                                                                                                                
[  537.887750]  ? lock_acquire+0x138/0x2f0                                                                                                                     
[  537.888094]  ? __pfx_register_lock_class+0x10/0x10                                                                                                          
[  537.888521]  wb_workfn+0x648/0xbb0                                                                                                                          
[  537.888835]  ? __pfx_wb_workfn+0x10/0x10                                                                                                                                                                                                                                                                                   
[  537.889192]  ? lock_acquire+0x128/0x2f0                                                                                                                     
[  537.889539]  ? lock_acquire+0x138/0x2f0                                                                                                                     
[  537.889890]  process_one_work+0x7ff/0x1710                                                                                                                                                                                                                                                                                 
[  537.890272]  ? __pfx_process_one_work+0x10/0x10                                                                                                             
[  537.890685]  ? assign_work+0x16c/0x240                                                                                                                                                                                                                                                                                     
[  537.891026]  worker_thread+0x6e8/0x12b0                                                                                                                     
[  537.891381]  ? __pfx_worker_thread+0x10/0x10                                                                                                                
[  537.891768]  kthread+0x2ad/0x380                                                                                                                                                                                                                                                                                           
[  537.892064]  ? __pfx_kthread+0x10/0x10                                                                                                                                                                                                                                                                                     
[  537.892403]  ret_from_fork+0x2d/0x70                                                                                                                                                                                                                                                                                       
[  537.892728]  ? __pfx_kthread+0x10/0x10                                                                                                                                                                                                                                                                                     
[  537.893068]  ret_from_fork_asm+0x1a/0x30                                                                                                                                                                                                                                                                                   
[  537.893434]  </TASK>

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 07/11] mm: do not split a folio if it has minimum folio order requirement
  2024-03-26 16:10     ` Pankaj Raghav (Samsung)
@ 2024-03-26 16:23       ` Zi Yan
  2024-03-26 16:33         ` Pankaj Raghav
  0 siblings, 1 reply; 38+ messages in thread
From: Zi Yan @ 2024-03-26 16:23 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: Matthew Wilcox, linux-xfs, linux-fsdevel, gost.dev, chandan.babu,
	hare, mcgrof, djwong, linux-mm, linux-kernel, david, akpm,
	Pankaj Raghav

[-- Attachment #1: Type: text/plain, Size: 6733 bytes --]

On 26 Mar 2024, at 12:10, Pankaj Raghav (Samsung) wrote:

> On Mon, Mar 25, 2024 at 07:06:04PM +0000, Matthew Wilcox wrote:
>> On Wed, Mar 13, 2024 at 06:02:49PM +0100, Pankaj Raghav (Samsung) wrote:
>>> From: Pankaj Raghav <p.raghav@samsung.com>
>>>
>>> As we don't have a way to split a folio to a any given lower folio
>>> order yet, avoid splitting the folio in split_huge_page_to_list() if it
>>> has a minimum folio order requirement.
>>
>> FYI, Zi Yan's patch to do that is now in Andrew's tree.
>> c010d47f107f609b9f4d6a103b6dfc53889049e9 in current linux-next (dated
>> Feb 26)
>
> Yes, I started playing with the patches but I am getting a race condition
> resulting in a null-ptr-deref for which I don't have a good answer for
> yet.
>
> @zi yan Did you encounter any issue like this when you were testing?
>
> I did the following change (just a prototype) instead of this patch:
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 9859aa4f7553..63ee7b6ed03d 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3041,6 +3041,10 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>  {
>         struct folio *folio = page_folio(page);
>         struct deferred_split *ds_queue = get_deferred_split_queue(folio);
> +       unsigned int mapping_min_order = mapping_min_folio_order(folio->mapping);

I am not sure if this is right. Since folio can be anonymous and folio->mapping
does not point to struct address_space.

> +
> +       if (!folio_test_anon(folio))
> +               new_order = max_t(unsigned int, mapping_min_order, new_order);
>         /* reset xarray order to new order after split */
>         XA_STATE_ORDER(xas, &folio->mapping->i_pages, folio->index, new_order);
>         struct anon_vma *anon_vma = NULL;
> @@ -3117,6 +3121,8 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>                         goto out;
>                 }
>
> +               // XXX: Remove it later
> +               VM_WARN_ON_FOLIO((new_order < mapping_min_order), folio);
>                 gfp = current_gfp_context(mapping_gfp_mask(mapping) &
>                                                         GFP_RECLAIM_MASK);
>
> I am getting a random null-ptr deref when I run generic/176 multiple
> times with different blksizes. I still don't have a minimal reproducer
> for this issue. Race condition during writeback:
>
> filemap_get_folios_tag+0x171/0x5c0:
> arch_atomic_read at arch/x86/include/asm/atomic.h:23
> (inlined by) raw_atomic_read at include/linux/atomic/atomic-arch-fallback.h:457
> (inlined by) raw_atomic_fetch_add_unless at include/linux/atomic/atomic-arch-fallback.h:2426
> (inlined by) raw_atomic_add_unless at include/linux/atomic/atomic-arch-fallback.h:2456
> (inlined by) atomic_add_unless at include/linux/atomic/atomic-instrumented.h:1518
> (inlined by) page_ref_add_unless at include/linux/page_ref.h:238
> (inlined by) folio_ref_add_unless at include/linux/page_ref.h:247
> (inlined by) folio_ref_try_add_rcu at include/linux/page_ref.h:280
> (inlined by) folio_try_get_rcu at include/linux/page_ref.h:313
> (inlined by) find_get_entry at mm/filemap.c:1984
> (inlined by) filemap_get_folios_tag at mm/filemap.c:2222
>
>
>
> [  537.863105] ==================================================================
> [  537.863968] BUG: KASAN: null-ptr-deref in filemap_get_folios_tag+0x171/0x5c0
> [  537.864581] Write of size 4 at addr 0000000000000036 by task kworker/u32:5/366
> [  537.865123]
> [  537.865293] CPU: 6 PID: 366 Comm: kworker/u32:5 Not tainted 6.8.0-11739-g7d0c6e7b5a7d-dirty #795
> [  537.867201] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> [  537.868444] Workqueue: writeback wb_workfn (flush-254:32)
> [  537.869055] Call Trace:
> [  537.869341]  <TASK>
> [  537.869569]  dump_stack_lvl+0x4f/0x70
> [  537.869938]  kasan_report+0xbd/0xf0
> [  537.870293]  ? filemap_get_folios_tag+0x171/0x5c0
> [  537.870767]  ? filemap_get_folios_tag+0x171/0x5c0
> [  537.871578]  kasan_check_range+0x101/0x1b0
> [  537.871893]  filemap_get_folios_tag+0x171/0x5c0
> [  537.872269]  ? __pfx_filemap_get_folios_tag+0x10/0x10
> [  537.872857]  ? __pfx___submit_bio+0x10/0x10
> [  537.873326]  ? mlock_drain_local+0x234/0x3f0
> [  537.873938]  writeback_iter+0x59a/0xe00
> [  537.874477]  ? __pfx_iomap_do_writepage+0x10/0x10
> [  537.874969]  write_cache_pages+0x7f/0x100
> [  537.875396]  ? __pfx_write_cache_pages+0x10/0x10
> [  537.875892]  ? do_raw_spin_lock+0x12d/0x270
> [  537.876345]  ? __pfx_do_raw_spin_lock+0x10/0x10
> [  537.876804]  iomap_writepages+0x88/0xf0
> [  537.877186]  xfs_vm_writepages+0x120/0x190
> [  537.877705]  ? __pfx_xfs_vm_writepages+0x10/0x10
> [  537.878161]  ? lock_release+0x36f/0x670
> [  537.878521]  ? __wb_calc_thresh+0xe5/0x3b0
> [  537.878892]  ? __pfx_lock_release+0x10/0x10
> [  537.879308]  do_writepages+0x170/0x7a0
> [  537.879676]  ? __pfx_do_writepages+0x10/0x10
> [  537.880182]  ? writeback_sb_inodes+0x312/0xe40
> [  537.880689]  ? reacquire_held_locks+0x1f1/0x4a0
> [  537.881193]  ? writeback_sb_inodes+0x312/0xe40
> [  537.881665]  ? find_held_lock+0x2d/0x110
> [  537.882104]  ? lock_release+0x36f/0x670
> [  537.883344]  ? wbc_attach_and_unlock_inode+0x3b8/0x710
> [  537.883853]  ? __pfx_lock_release+0x10/0x10
> [  537.884229]  ? __pfx_lock_release+0x10/0x10
> [  537.884604]  ? lock_acquire+0x138/0x2f0
> [  537.884952]  __writeback_single_inode+0xd4/0xa60
> [  537.885369]  writeback_sb_inodes+0x4cf/0xe40
> [  537.885760]  ? __pfx_writeback_sb_inodes+0x10/0x10
> [  537.886208]  ? __pfx_move_expired_inodes+0x10/0x10
> [  537.886640]  __writeback_inodes_wb+0xb4/0x200
> [  537.887037]  wb_writeback+0x55b/0x7c0
> [  537.887372]  ? __pfx_wb_writeback+0x10/0x10
> [  537.887750]  ? lock_acquire+0x138/0x2f0
> [  537.888094]  ? __pfx_register_lock_class+0x10/0x10
> [  537.888521]  wb_workfn+0x648/0xbb0
> [  537.888835]  ? __pfx_wb_workfn+0x10/0x10
> [  537.889192]  ? lock_acquire+0x128/0x2f0
> [  537.889539]  ? lock_acquire+0x138/0x2f0
> [  537.889890]  process_one_work+0x7ff/0x1710
> [  537.890272]  ? __pfx_process_one_work+0x10/0x10
> [  537.890685]  ? assign_work+0x16c/0x240
> [  537.891026]  worker_thread+0x6e8/0x12b0
> [  537.891381]  ? __pfx_worker_thread+0x10/0x10
> [  537.891768]  kthread+0x2ad/0x380
> [  537.892064]  ? __pfx_kthread+0x10/0x10
> [  537.892403]  ret_from_fork+0x2d/0x70
> [  537.892728]  ? __pfx_kthread+0x10/0x10
> [  537.893068]  ret_from_fork_asm+0x1a/0x30
> [  537.893434]  </TASK>


--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 07/11] mm: do not split a folio if it has minimum folio order requirement
  2024-03-26 16:23       ` Zi Yan
@ 2024-03-26 16:33         ` Pankaj Raghav
  2024-03-26 16:38           ` Zi Yan
  0 siblings, 1 reply; 38+ messages in thread
From: Pankaj Raghav @ 2024-03-26 16:33 UTC (permalink / raw)
  To: Zi Yan
  Cc: Matthew Wilcox, linux-xfs, linux-fsdevel, gost.dev, chandan.babu,
	hare, mcgrof, djwong, linux-mm, linux-kernel, david, akpm,
	Pankaj Raghav

On 26/03/2024 17:23, Zi Yan wrote:
> On 26 Mar 2024, at 12:10, Pankaj Raghav (Samsung) wrote:
> 
>> On Mon, Mar 25, 2024 at 07:06:04PM +0000, Matthew Wilcox wrote:
>>> On Wed, Mar 13, 2024 at 06:02:49PM +0100, Pankaj Raghav (Samsung) wrote:
>>>> From: Pankaj Raghav <p.raghav@samsung.com>
>>>>
>>>> As we don't have a way to split a folio to a any given lower folio
>>>> order yet, avoid splitting the folio in split_huge_page_to_list() if it
>>>> has a minimum folio order requirement.
>>>
>>> FYI, Zi Yan's patch to do that is now in Andrew's tree.
>>> c010d47f107f609b9f4d6a103b6dfc53889049e9 in current linux-next (dated
>>> Feb 26)
>>
>> Yes, I started playing with the patches but I am getting a race condition
>> resulting in a null-ptr-deref for which I don't have a good answer for
>> yet.
>>
>> @zi yan Did you encounter any issue like this when you were testing?
>>
>> I did the following change (just a prototype) instead of this patch:
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 9859aa4f7553..63ee7b6ed03d 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -3041,6 +3041,10 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>  {
>>         struct folio *folio = page_folio(page);
>>         struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>> +       unsigned int mapping_min_order = mapping_min_folio_order(folio->mapping);
> 
> I am not sure if this is right. Since folio can be anonymous and folio->mapping
> does not point to struct address_space.
> 
>> +
>> +       if (!folio_test_anon(folio))

Hmm, but I update the new_order only if it is not anonymous. Do you think it is still
wrong?

>> +               new_order = max_t(unsigned int, mapping_min_order, new_order);
>>         /* reset xarray order to new order after split */
>>         XA_STATE_ORDER(xas, &folio->mapping->i_pages, folio->index, new_order);
>>         struct anon_vma *anon_vma = NULL;
>> @@ -3117,6 +3121,8 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>                         goto out;
>>                 }
>>
>> +               // XXX: Remove it later
>> +               VM_WARN_ON_FOLIO((new_order < mapping_min_order), folio);
>>                 gfp = current_gfp_context(mapping_gfp_mask(mapping) &
>>                                                         GFP_RECLAIM_MASK);
>>
>> I am getting a random null-ptr deref when I run generic/176 multiple
>> times with different blksizes. I still don't have a minimal reproducer
>> for this issue. Race condition during writeback:
>>
>> filemap_get_folios_tag+0x171/0x5c0:
>> arch_atomic_read at arch/x86/include/asm/atomic.h:23
>> (inlined by) raw_atomic_read at include/linux/atomic/atomic-arch-fallback.h:457
>> (inlined by) raw_atomic_fetch_add_unless at include/linux/atomic/atomic-arch-fallback.h:2426
>> (inlined by) raw_atomic_add_unless at include/linux/atomic/atomic-arch-fallback.h:2456
>> (inlined by) atomic_add_unless at include/linux/atomic/atomic-instrumented.h:1518
>> (inlined by) page_ref_add_unless at include/linux/page_ref.h:238
>> (inlined by) folio_ref_add_unless at include/linux/page_ref.h:247
>> (inlined by) folio_ref_try_add_rcu at include/linux/page_ref.h:280
>> (inlined by) folio_try_get_rcu at include/linux/page_ref.h:313
>> (inlined by) find_get_entry at mm/filemap.c:1984
>> (inlined by) filemap_get_folios_tag at mm/filemap.c:2222
>>
>>
>>
>> [  537.863105] ==================================================================
>> [  537.863968] BUG: KASAN: null-ptr-deref in filemap_get_folios_tag+0x171/0x5c0
>> [  537.864581] Write of size 4 at addr 0000000000000036 by task kworker/u32:5/366
>> [  537.865123]
>> [  537.865293] CPU: 6 PID: 366 Comm: kworker/u32:5 Not tainted 6.8.0-11739-g7d0c6e7b5a7d-dirty #795
>> [  537.867201] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
>> [  537.868444] Workqueue: writeback wb_workfn (flush-254:32)
>> [  537.869055] Call Trace:
>> [  537.869341]  <TASK>
>> [  537.869569]  dump_stack_lvl+0x4f/0x70
>> [  537.869938]  kasan_report+0xbd/0xf0
>> [  537.870293]  ? filemap_get_folios_tag+0x171/0x5c0
>> [  537.870767]  ? filemap_get_folios_tag+0x171/0x5c0
>> [  537.871578]  kasan_check_range+0x101/0x1b0
>> [  537.871893]  filemap_get_folios_tag+0x171/0x5c0
>> [  537.872269]  ? __pfx_filemap_get_folios_tag+0x10/0x10
>> [  537.872857]  ? __pfx___submit_bio+0x10/0x10
>> [  537.873326]  ? mlock_drain_local+0x234/0x3f0
>> [  537.873938]  writeback_iter+0x59a/0xe00
>> [  537.874477]  ? __pfx_iomap_do_writepage+0x10/0x10
>> [  537.874969]  write_cache_pages+0x7f/0x100
>> [  537.875396]  ? __pfx_write_cache_pages+0x10/0x10
>> [  537.875892]  ? do_raw_spin_lock+0x12d/0x270
>> [  537.876345]  ? __pfx_do_raw_spin_lock+0x10/0x10
>> [  537.876804]  iomap_writepages+0x88/0xf0
>> [  537.877186]  xfs_vm_writepages+0x120/0x190
>> [  537.877705]  ? __pfx_xfs_vm_writepages+0x10/0x10
>> [  537.878161]  ? lock_release+0x36f/0x670
>> [  537.878521]  ? __wb_calc_thresh+0xe5/0x3b0
>> [  537.878892]  ? __pfx_lock_release+0x10/0x10
>> [  537.879308]  do_writepages+0x170/0x7a0
>> [  537.879676]  ? __pfx_do_writepages+0x10/0x10
>> [  537.880182]  ? writeback_sb_inodes+0x312/0xe40
>> [  537.880689]  ? reacquire_held_locks+0x1f1/0x4a0
>> [  537.881193]  ? writeback_sb_inodes+0x312/0xe40
>> [  537.881665]  ? find_held_lock+0x2d/0x110
>> [  537.882104]  ? lock_release+0x36f/0x670
>> [  537.883344]  ? wbc_attach_and_unlock_inode+0x3b8/0x710
>> [  537.883853]  ? __pfx_lock_release+0x10/0x10
>> [  537.884229]  ? __pfx_lock_release+0x10/0x10
>> [  537.884604]  ? lock_acquire+0x138/0x2f0
>> [  537.884952]  __writeback_single_inode+0xd4/0xa60
>> [  537.885369]  writeback_sb_inodes+0x4cf/0xe40
>> [  537.885760]  ? __pfx_writeback_sb_inodes+0x10/0x10
>> [  537.886208]  ? __pfx_move_expired_inodes+0x10/0x10
>> [  537.886640]  __writeback_inodes_wb+0xb4/0x200
>> [  537.887037]  wb_writeback+0x55b/0x7c0
>> [  537.887372]  ? __pfx_wb_writeback+0x10/0x10
>> [  537.887750]  ? lock_acquire+0x138/0x2f0
>> [  537.888094]  ? __pfx_register_lock_class+0x10/0x10
>> [  537.888521]  wb_workfn+0x648/0xbb0
>> [  537.888835]  ? __pfx_wb_workfn+0x10/0x10
>> [  537.889192]  ? lock_acquire+0x128/0x2f0
>> [  537.889539]  ? lock_acquire+0x138/0x2f0
>> [  537.889890]  process_one_work+0x7ff/0x1710
>> [  537.890272]  ? __pfx_process_one_work+0x10/0x10
>> [  537.890685]  ? assign_work+0x16c/0x240
>> [  537.891026]  worker_thread+0x6e8/0x12b0
>> [  537.891381]  ? __pfx_worker_thread+0x10/0x10
>> [  537.891768]  kthread+0x2ad/0x380
>> [  537.892064]  ? __pfx_kthread+0x10/0x10
>> [  537.892403]  ret_from_fork+0x2d/0x70
>> [  537.892728]  ? __pfx_kthread+0x10/0x10
>> [  537.893068]  ret_from_fork_asm+0x1a/0x30
>> [  537.893434]  </TASK>
> 
> 
> --
> Best Regards,
> Yan, Zi

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 07/11] mm: do not split a folio if it has minimum folio order requirement
  2024-03-26 16:33         ` Pankaj Raghav
@ 2024-03-26 16:38           ` Zi Yan
  0 siblings, 0 replies; 38+ messages in thread
From: Zi Yan @ 2024-03-26 16:38 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: Matthew Wilcox, linux-xfs, linux-fsdevel, gost.dev, chandan.babu,
	hare, mcgrof, djwong, linux-mm, linux-kernel, david, akpm,
	Pankaj Raghav

[-- Attachment #1: Type: text/plain, Size: 2109 bytes --]

On 26 Mar 2024, at 12:33, Pankaj Raghav wrote:

> On 26/03/2024 17:23, Zi Yan wrote:
>> On 26 Mar 2024, at 12:10, Pankaj Raghav (Samsung) wrote:
>>
>>> On Mon, Mar 25, 2024 at 07:06:04PM +0000, Matthew Wilcox wrote:
>>>> On Wed, Mar 13, 2024 at 06:02:49PM +0100, Pankaj Raghav (Samsung) wrote:
>>>>> From: Pankaj Raghav <p.raghav@samsung.com>
>>>>>
>>>>> As we don't have a way to split a folio to a any given lower folio
>>>>> order yet, avoid splitting the folio in split_huge_page_to_list() if it
>>>>> has a minimum folio order requirement.
>>>>
>>>> FYI, Zi Yan's patch to do that is now in Andrew's tree.
>>>> c010d47f107f609b9f4d6a103b6dfc53889049e9 in current linux-next (dated
>>>> Feb 26)
>>>
>>> Yes, I started playing with the patches but I am getting a race condition
>>> resulting in a null-ptr-deref for which I don't have a good answer for
>>> yet.
>>>
>>> @zi yan Did you encounter any issue like this when you were testing?
>>>
>>> I did the following change (just a prototype) instead of this patch:
>>>
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index 9859aa4f7553..63ee7b6ed03d 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -3041,6 +3041,10 @@ int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>  {
>>>         struct folio *folio = page_folio(page);
>>>         struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>> +       unsigned int mapping_min_order = mapping_min_folio_order(folio->mapping);
>>
>> I am not sure if this is right. Since folio can be anonymous and folio->mapping
>> does not point to struct address_space.
>>
>>> +
>>> +       if (!folio_test_anon(folio))
>
> Hmm, but I update the new_order only if it is not anonymous. Do you think it is still
> wrong?

For anonymous folio, folio->mapping has last bit set and point to a possible struct anon_vma. I do not know what ->flag will be or if it is accessible in that case.
See: https://elixir.bootlin.com/linux/latest/source/include/linux/page-flags.h#L608

--
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH v3 05/11] readahead: allocate folios with mapping_min_order in readahead
  2024-03-25 19:00   ` Matthew Wilcox
  2024-03-26 13:08     ` Pankaj Raghav (Samsung)
@ 2024-04-22 11:03     ` Pankaj Raghav (Samsung)
  1 sibling, 0 replies; 38+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-04-22 11:03 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-xfs, linux-fsdevel, gost.dev, chandan.babu, hare, mcgrof,
	djwong, linux-mm, linux-kernel, david, akpm, Pankaj Raghav

> > @@ -515,7 +562,7 @@ void page_cache_ra_order(struct readahead_control *ractl,
> >  		if (index & ((1UL << order) - 1))
> >  			order = __ffs(index);
> >  		/* Don't allocate pages past EOF */
> > -		while (index + (1UL << order) - 1 > limit)
> > +		while (order > min_order && index + (1UL << order) - 1 > limit)
> >  			order--;
> 
> This raises an interesting question that I don't know if we have a test
> for.  POSIX says that if we mmap, let's say, the first 16kB of a 10kB
> file, then we can store into offset 0-12287, but stores to offsets
> 12288-16383 get a signal (I forget if it's SEGV or BUS).  Thus far,
> we've declined to even create folios in the page cache that would let us
> create PTEs for offset 12288-16383, so I haven't paid too much attention
> to this.  Now we're going to have folios that extend into that range, so
> we need to be sure that when we mmap(), we only create PTEs that go as
> far as 12287.
> 
> Can you check that we have such an fstest, and that we still pass it
> with your patches applied and a suitably large block size?
> 

So the mmap is giving the correct SIGBUS error when we try to do this:
dd if=/dev/zero of=./test bs=10k count=1; 
xfs_io -c "mmap -w 0 16384" -c "mwrite 13000 10" test

Logs on bs=64k ps=4k system:
root@debian:/media/test# dd if=/dev/zero of=./test bs=10k count=1;
root@debian:/media/test# du -sh test 
64K     test
root@debian:/media/test# ls -l --block-size=k test 
-rw-r--r-- 1 root root 10K Apr 22 10:42 test
root@debian:/media/test# xfs_io -c "mmap  0 16384" -c "mwrite 13000 10" test
Bus error

The check in filemap_fault takes care of this:

max_idx = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
if (unlikely(index >= max_idx))
        return VM_FAULT_SIGBUS;

The same operation for read should also give a bus error, but it didn't.
Further investigation pointed out that the fault_around() does not take
this condition into account for LBS configuration. When I set fault_around_bytes
to 4096, things worked as expected as we skip fault_around for reads. 

I have a patch that return SIGBUS also for the following read operation:
dd if=/dev/zero of=./test bs=10k count=1; 
xfs_io -c "mmap -r 0 16384" -c "mread 13000 10" test

This is the patch I have for now that fixes fault_around() logic for LBS
configuration:

diff --git a/mm/filemap.c b/mm/filemap.c
index f0c0cfbbd134..259531dd297b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3600,12 +3600,15 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
        }
        do {
                unsigned long end;
+               unsigned long i_size;
 
                addr += (xas.xa_index - last_pgoff) << PAGE_SHIFT;
                vmf->pte += xas.xa_index - last_pgoff;
                last_pgoff = xas.xa_index;
                end = folio_next_index(folio) - 1;
-               nr_pages = min(end, end_pgoff) - xas.xa_index + 1;
+               i_size = DIV_ROUND_UP(i_size_read(mapping->host),
+                                     PAGE_SIZE) - 1;
+               nr_pages = min3(end, end_pgoff, i_size) - xas.xa_index + 1;
 
                if (!folio_test_large(folio))
                        ret |= filemap_map_order0_folio(vmf,

I will send a new version of the series this week after doing some more
testing.

^ permalink raw reply related	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2024-04-22 11:12 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-13 17:02 [PATCH v3 00/11] enable bs > ps in XFS Pankaj Raghav (Samsung)
2024-03-13 17:02 ` [PATCH v3 01/11] mm: Support order-1 folios in the page cache Pankaj Raghav (Samsung)
2024-03-13 17:02 ` [PATCH v3 02/11] fs: Allow fine-grained control of folio sizes Pankaj Raghav (Samsung)
2024-03-25 18:29   ` Matthew Wilcox
2024-03-26  8:44     ` Pankaj Raghav (Samsung)
2024-03-13 17:02 ` [PATCH v3 03/11] filemap: allocate mapping_min_order folios in the page cache Pankaj Raghav (Samsung)
2024-03-15 13:21   ` Pankaj Raghav (Samsung)
2024-03-13 17:02 ` [PATCH v3 04/11] readahead: rework loop in page_cache_ra_unbounded() Pankaj Raghav (Samsung)
2024-03-25 18:41   ` Matthew Wilcox
2024-03-26  8:56     ` Pankaj Raghav (Samsung)
2024-03-26  9:39     ` Hannes Reinecke
2024-03-26  9:44       ` Pankaj Raghav
2024-03-26 10:00         ` Hannes Reinecke
2024-03-26 10:06           ` Pankaj Raghav
2024-03-26 10:55             ` Hannes Reinecke
2024-03-26 13:41               ` Pankaj Raghav (Samsung)
2024-03-26 15:11     ` Pankaj Raghav (Samsung)
2024-03-13 17:02 ` [PATCH v3 05/11] readahead: allocate folios with mapping_min_order in readahead Pankaj Raghav (Samsung)
2024-03-25 19:00   ` Matthew Wilcox
2024-03-26 13:08     ` Pankaj Raghav (Samsung)
2024-04-22 11:03     ` Pankaj Raghav (Samsung)
2024-03-13 17:02 ` [PATCH v3 06/11] readahead: round up file_ra_state->ra_pages to mapping_min_nrpages Pankaj Raghav (Samsung)
2024-03-13 17:02 ` [PATCH v3 07/11] mm: do not split a folio if it has minimum folio order requirement Pankaj Raghav (Samsung)
2024-03-25 19:06   ` Matthew Wilcox
2024-03-26 16:10     ` Pankaj Raghav (Samsung)
2024-03-26 16:23       ` Zi Yan
2024-03-26 16:33         ` Pankaj Raghav
2024-03-26 16:38           ` Zi Yan
2024-03-13 17:02 ` [PATCH v3 08/11] iomap: fix iomap_dio_zero() for fs bs > system page size Pankaj Raghav (Samsung)
2024-03-13 17:02 ` [PATCH v3 09/11] xfs: expose block size in stat Pankaj Raghav (Samsung)
2024-03-13 17:02 ` [PATCH v3 10/11] xfs: make the calculation generic in xfs_sb_validate_fsb_count() Pankaj Raghav (Samsung)
2024-03-25 19:15   ` Matthew Wilcox
2024-03-26  9:53     ` Pankaj Raghav (Samsung)
2024-03-13 17:02 ` [PATCH v3 11/11] xfs: enable block size larger than page size support Pankaj Raghav (Samsung)
2024-03-25 19:19 ` [PATCH v3 00/11] enable bs > ps in XFS Matthew Wilcox
2024-03-26  9:53   ` Hannes Reinecke
2024-03-26 15:06     ` Pankaj Raghav
2024-03-26 14:54   ` Pankaj Raghav (Samsung)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.