linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC v2 00/14] enable bs > ps in XFS
@ 2024-02-13  9:36 Pankaj Raghav (Samsung)
  2024-02-13  9:37 ` [RFC v2 01/14] fs: Allow fine-grained control of folio sizes Pankaj Raghav (Samsung)
                   ` (13 more replies)
  0 siblings, 14 replies; 64+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-02-13  9:36 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: mcgrof, gost.dev, akpm, kbusch, djwong, chandan.babu, p.raghav,
	linux-kernel, hare, willy, linux-mm, david

From: Pankaj Raghav <p.raghav@samsung.com>

This is the second version of the series that enables block size > page size
(Large Block Size) in XFS. This version has various bug fixes and suggestion
collected from the v1[1]. The context and motivation can be seen in cover
letter of the v1. We also recorded a talk about this effort at LPC [2],
if someone would like more context on this effort.

A lot of emphasis has been put on testing using kdevops, and fixing some
critical bugs for LBS.

The testing has been split into regression and progression.

Regression testing:
In regression testing, CONFIG_XFS_LBS was disabled and we ran the whole
test suite with SOAK_DURATION=2.5 hours to check for *regression on existing
profiles due to the page cache changes.

No regression was found with the patches added on top.

* Baseline for regression was created using SOAK_DURATION of 2.5 hours
and having used about 7-8 XFS test clustes to test loop fstests over
70 times. We then scraped for critical failures (crashes, XFS or page
cache asserts, or hung tasks) and have reported these to the community
as well.[3]

Progression testing:
For progression testing, CONFIG_XFS_LBS was enabled and we tested for
8k, 16k, 32k and 64k block sizes. To compare it with existing support,
an ARM VM with 64k base page system(without our patches) was used as a
reference to check for actual failures due to LBS support in a 4k base
page size system.

There are some common failures upstream for bs=64k that needs to be fixed[4].
There are also some tests that assumes block size < page size that needs to
be fixed. I have a tree with fixes for xfstests here [5], which I will be
sending soon to the list.

The only real failure for LBS currently is:
- generic/630 for 8k block size. This tests dedupe race and it only fails
for 8k blocksize, and it does not fail on a 64k base page system.

We've done some preliminary performance tests with fio on XFS on 4k block
size against pmem and NVMe with buffered IO and Direct IO on vanilla
v6.8-rc4 Vs v6.8-rc4 + these patches applied, and detected no regressions.

We also wrote an eBPF tool called blkalgn [6] to see if IO sent to the device
is aligned and at least filesystem block size in length.

Git tree: https://github.com/linux-kdevops/linux/tree/large-block-minorder-6.8.0-rc4

Changes since RFC v1:
- Added willy's patch to enable order-1 folios.
- Unified common page cache effort from Hannes LBS work.
- Added a new helper min_nrpages and added CONFIG_THP for enabling mapping_large_folio_support
- Don't split a folio if it has minorder set. Remove the old code where we set extra pins if it has that requirement.
- Split the code in XFS between the validation of mapping count. Put the icache code changes with enabling bs > ps.
- Added CONFIG_XFS_LBS option
- align the index in do_read_cache_folio()
- Removed truncate changes
- Fixed generic/091 with iomap changes to iomap_dio_zero function.
- Took care of folio truncation scenario in page_cache_ra_unbounded() that happens after read_pages if a folio was found.
- Sqaushed and moved commits around
- Rebased on top of v6.8-rc4

[1] https://lore.kernel.org/lkml/20230915183848.1018717-1-kernel@pankajraghav.com/
[2] https://www.youtube.com/watch?v=ar72r5Xf7x4
[3] https://github.com/linux-kdevops/kdevops/blob/master/docs/xfs-bugs.md
489 non-critical issues and 55 critical issues. We've determined and reported
that the 55 critical issues have all fall into 5 common  XFS asserts or hung
tasks  and 2 memory management asserts.
[4] https://lore.kernel.org/linux-xfs/fe7fec1c-3b08-430f-9c95-ea76b237acf4@samsung.com/
[5] https://github.com/Panky-codes/xfstests/tree/lbs-fixes
[6] https://github.com/iovisor/bcc/pull/4813

Dave Chinner (1):
  xfs: expose block size in stat

Hannes Reinecke (1):
  readahead: rework loop in page_cache_ra_unbounded()

Luis Chamberlain (3):
  filemap: align the index to mapping_min_order in the page cache
  readahead: set file_ra_state->ra_pages to be at least
    mapping_min_order
  readahead: align index to mapping_min_order in ondemand_ra and
    force_ra

Matthew Wilcox (Oracle) (2):
  fs: Allow fine-grained control of folio sizes
  mm: Support order-1 folios in the page cache

Pankaj Raghav (7):
  filemap: use mapping_min_order while allocating folios
  readahead: allocate folios with mapping_min_order in
    ra_(unbounded|order)
  mm: do not split a folio if it has minimum folio order requirement
  iomap: fix iomap_dio_zero() for fs bs > system page size
  xfs: make the calculation generic in xfs_sb_validate_fsb_count()
  xfs: add an experimental CONFIG_XFS_LBS option
  xfs: enable block size larger than page size support

 fs/iomap/direct-io.c    | 13 +++++-
 fs/xfs/Kconfig          | 11 +++++
 fs/xfs/xfs_icache.c     |  8 +++-
 fs/xfs/xfs_iops.c       |  4 +-
 fs/xfs/xfs_mount.c      | 10 ++++-
 fs/xfs/xfs_super.c      |  8 ++--
 include/linux/huge_mm.h |  7 ++-
 include/linux/pagemap.h | 92 ++++++++++++++++++++++++++++++--------
 mm/filemap.c            | 61 ++++++++++++++++++++------
 mm/huge_memory.c        | 36 ++++++++++++---
 mm/internal.h           |  4 +-
 mm/readahead.c          | 97 ++++++++++++++++++++++++++++++++---------
 12 files changed, 276 insertions(+), 75 deletions(-)


base-commit: 841c35169323cd833294798e58b9bf63fa4fa1de
-- 
2.43.0


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [RFC v2 01/14] fs: Allow fine-grained control of folio sizes
  2024-02-13  9:36 [RFC v2 00/14] enable bs > ps in XFS Pankaj Raghav (Samsung)
@ 2024-02-13  9:37 ` Pankaj Raghav (Samsung)
  2024-02-13 12:03   ` Hannes Reinecke
                     ` (2 more replies)
  2024-02-13  9:37 ` [RFC v2 02/14] filemap: align the index to mapping_min_order in the page cache Pankaj Raghav (Samsung)
                   ` (12 subsequent siblings)
  13 siblings, 3 replies; 64+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-02-13  9:37 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: mcgrof, gost.dev, akpm, kbusch, djwong, chandan.babu, p.raghav,
	linux-kernel, hare, willy, linux-mm, david

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Some filesystems want to be able to limit the maximum size of folios,
and some want to be able to ensure that folios are at least a certain
size.  Add mapping_set_folio_orders() to allow this level of control.
The max folio order parameter is ignored and it is always set to
MAX_PAGECACHE_ORDER.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 include/linux/pagemap.h | 92 ++++++++++++++++++++++++++++++++---------
 1 file changed, 73 insertions(+), 19 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 2df35e65557d..5618f762187b 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -202,13 +202,18 @@ enum mapping_flags {
 	AS_EXITING	= 4, 	/* final truncate in progress */
 	/* writeback related tags are not used */
 	AS_NO_WRITEBACK_TAGS = 5,
-	AS_LARGE_FOLIO_SUPPORT = 6,
-	AS_RELEASE_ALWAYS,	/* Call ->release_folio(), even if no private data */
-	AS_STABLE_WRITES,	/* must wait for writeback before modifying
+	AS_RELEASE_ALWAYS = 6,	/* Call ->release_folio(), even if no private data */
+	AS_STABLE_WRITES = 7,	/* must wait for writeback before modifying
 				   folio contents */
-	AS_UNMOVABLE,		/* The mapping cannot be moved, ever */
+	AS_FOLIO_ORDER_MIN = 8,
+	AS_FOLIO_ORDER_MAX = 13, /* Bit 8-17 are used for FOLIO_ORDER */
+	AS_UNMOVABLE = 18,		/* The mapping cannot be moved, ever */
 };
 
+#define AS_FOLIO_ORDER_MIN_MASK 0x00001f00
+#define AS_FOLIO_ORDER_MAX_MASK 0x0003e000
+#define AS_FOLIO_ORDER_MASK (AS_FOLIO_ORDER_MIN_MASK | AS_FOLIO_ORDER_MAX_MASK)
+
 /**
  * mapping_set_error - record a writeback error in the address_space
  * @mapping: the mapping in which an error should be set
@@ -344,6 +349,53 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
 	m->gfp_mask = mask;
 }
 
+/*
+ * There are some parts of the kernel which assume that PMD entries
+ * are exactly HPAGE_PMD_ORDER.  Those should be fixed, but until then,
+ * limit the maximum allocation order to PMD size.  I'm not aware of any
+ * assumptions about maximum order if THP are disabled, but 8 seems like
+ * a good order (that's 1MB if you're using 4kB pages)
+ */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define MAX_PAGECACHE_ORDER	HPAGE_PMD_ORDER
+#else
+#define MAX_PAGECACHE_ORDER	8
+#endif
+
+/*
+ * mapping_set_folio_orders() - Set the range of folio sizes supported.
+ * @mapping: The file.
+ * @min: Minimum folio order (between 0-MAX_PAGECACHE_ORDER inclusive).
+ * @max: Maximum folio order (between 0-MAX_PAGECACHE_ORDER inclusive).
+ *
+ * The filesystem should call this function in its inode constructor to
+ * indicate which sizes of folio the VFS can use to cache the contents
+ * of the file.  This should only be used if the filesystem needs special
+ * handling of folio sizes (ie there is something the core cannot know).
+ * Do not tune it based on, eg, i_size.
+ *
+ * Context: This should not be called while the inode is active as it
+ * is non-atomic.
+ */
+static inline void mapping_set_folio_orders(struct address_space *mapping,
+					    unsigned int min, unsigned int max)
+{
+	if (min == 1)
+		min = 2;
+	if (max < min)
+		max = min;
+	if (max > MAX_PAGECACHE_ORDER)
+		max = MAX_PAGECACHE_ORDER;
+
+	/*
+	 * XXX: max is ignored as only minimum folio order is supported
+	 * currently.
+	 */
+	mapping->flags = (mapping->flags & ~AS_FOLIO_ORDER_MASK) |
+			 (min << AS_FOLIO_ORDER_MIN) |
+			 (MAX_PAGECACHE_ORDER << AS_FOLIO_ORDER_MAX);
+}
+
 /**
  * mapping_set_large_folios() - Indicate the file supports large folios.
  * @mapping: The file.
@@ -357,7 +409,22 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
  */
 static inline void mapping_set_large_folios(struct address_space *mapping)
 {
-	__set_bit(AS_LARGE_FOLIO_SUPPORT, &mapping->flags);
+	mapping_set_folio_orders(mapping, 0, MAX_PAGECACHE_ORDER);
+}
+
+static inline unsigned int mapping_max_folio_order(struct address_space *mapping)
+{
+	return (mapping->flags & AS_FOLIO_ORDER_MAX_MASK) >> AS_FOLIO_ORDER_MAX;
+}
+
+static inline unsigned int mapping_min_folio_order(struct address_space *mapping)
+{
+	return (mapping->flags & AS_FOLIO_ORDER_MIN_MASK) >> AS_FOLIO_ORDER_MIN;
+}
+
+static inline unsigned int mapping_min_folio_nrpages(struct address_space *mapping)
+{
+	return 1U << mapping_min_folio_order(mapping);
 }
 
 /*
@@ -367,7 +434,7 @@ static inline void mapping_set_large_folios(struct address_space *mapping)
 static inline bool mapping_large_folio_support(struct address_space *mapping)
 {
 	return IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
-		test_bit(AS_LARGE_FOLIO_SUPPORT, &mapping->flags);
+	       (mapping_max_folio_order(mapping) > 0);
 }
 
 static inline int filemap_nr_thps(struct address_space *mapping)
@@ -528,19 +595,6 @@ static inline void *detach_page_private(struct page *page)
 	return folio_detach_private(page_folio(page));
 }
 
-/*
- * There are some parts of the kernel which assume that PMD entries
- * are exactly HPAGE_PMD_ORDER.  Those should be fixed, but until then,
- * limit the maximum allocation order to PMD size.  I'm not aware of any
- * assumptions about maximum order if THP are disabled, but 8 seems like
- * a good order (that's 1MB if you're using 4kB pages)
- */
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#define MAX_PAGECACHE_ORDER	HPAGE_PMD_ORDER
-#else
-#define MAX_PAGECACHE_ORDER	8
-#endif
-
 #ifdef CONFIG_NUMA
 struct folio *filemap_alloc_folio(gfp_t gfp, unsigned int order);
 #else
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v2 02/14] filemap: align the index to mapping_min_order in the page cache
  2024-02-13  9:36 [RFC v2 00/14] enable bs > ps in XFS Pankaj Raghav (Samsung)
  2024-02-13  9:37 ` [RFC v2 01/14] fs: Allow fine-grained control of folio sizes Pankaj Raghav (Samsung)
@ 2024-02-13  9:37 ` Pankaj Raghav (Samsung)
  2024-02-13 12:20   ` Hannes Reinecke
  2024-02-13 22:00   ` Dave Chinner
  2024-02-13  9:37 ` [RFC v2 03/14] filemap: use mapping_min_order while allocating folios Pankaj Raghav (Samsung)
                   ` (11 subsequent siblings)
  13 siblings, 2 replies; 64+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-02-13  9:37 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: mcgrof, gost.dev, akpm, kbusch, djwong, chandan.babu, p.raghav,
	linux-kernel, hare, willy, linux-mm, david

From: Luis Chamberlain <mcgrof@kernel.org>

Supporting mapping_min_order implies that we guarantee each folio in the
page cache has at least an order of mapping_min_order. So when adding new
folios to the page cache we must ensure the index used is aligned to the
mapping_min_order as the page cache requires the index to be aligned to
the order of the folio.

A higher order folio than min_order by definition is a multiple of the
min_order. If an index is aligned to an order higher than a min_order, it
will also be aligned to the min order.

This effectively introduces no new functional changes when min order is
not set other than a few rounding computations that should result in the
same value.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 mm/filemap.c | 34 ++++++++++++++++++++++++++--------
 1 file changed, 26 insertions(+), 8 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 750e779c23db..323a8e169581 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2479,14 +2479,16 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
 {
 	struct file *filp = iocb->ki_filp;
 	struct address_space *mapping = filp->f_mapping;
+	unsigned int min_nrpages = mapping_min_folio_nrpages(mapping);
 	struct file_ra_state *ra = &filp->f_ra;
-	pgoff_t index = iocb->ki_pos >> PAGE_SHIFT;
+	pgoff_t index = round_down(iocb->ki_pos >> PAGE_SHIFT, min_nrpages);
 	pgoff_t last_index;
 	struct folio *folio;
 	int err = 0;
 
 	/* "last_index" is the index of the page beyond the end of the read */
 	last_index = DIV_ROUND_UP(iocb->ki_pos + count, PAGE_SIZE);
+	last_index = round_up(last_index, min_nrpages);
 retry:
 	if (fatal_signal_pending(current))
 		return -EINTR;
@@ -2502,8 +2504,7 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
 	if (!folio_batch_count(fbatch)) {
 		if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_WAITQ))
 			return -EAGAIN;
-		err = filemap_create_folio(filp, mapping,
-				iocb->ki_pos >> PAGE_SHIFT, fbatch);
+		err = filemap_create_folio(filp, mapping, index, fbatch);
 		if (err == AOP_TRUNCATED_PAGE)
 			goto retry;
 		return err;
@@ -3095,7 +3096,10 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
 	struct file *file = vmf->vma->vm_file;
 	struct file_ra_state *ra = &file->f_ra;
 	struct address_space *mapping = file->f_mapping;
-	DEFINE_READAHEAD(ractl, file, ra, mapping, vmf->pgoff);
+	unsigned int min_order = mapping_min_folio_order(mapping);
+	unsigned int min_nrpages = mapping_min_folio_nrpages(file->f_mapping);
+	pgoff_t index = round_down(vmf->pgoff, min_nrpages);
+	DEFINE_READAHEAD(ractl, file, ra, mapping, index);
 	struct file *fpin = NULL;
 	unsigned long vm_flags = vmf->vma->vm_flags;
 	unsigned int mmap_miss;
@@ -3147,10 +3151,11 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
 	 */
 	fpin = maybe_unlock_mmap_for_io(vmf, fpin);
 	ra->start = max_t(long, 0, vmf->pgoff - ra->ra_pages / 2);
+	ra->start = round_down(ra->start, min_nrpages);
 	ra->size = ra->ra_pages;
 	ra->async_size = ra->ra_pages / 4;
 	ractl._index = ra->start;
-	page_cache_ra_order(&ractl, ra, 0);
+	page_cache_ra_order(&ractl, ra, min_order);
 	return fpin;
 }
 
@@ -3164,7 +3169,9 @@ static struct file *do_async_mmap_readahead(struct vm_fault *vmf,
 {
 	struct file *file = vmf->vma->vm_file;
 	struct file_ra_state *ra = &file->f_ra;
-	DEFINE_READAHEAD(ractl, file, ra, file->f_mapping, vmf->pgoff);
+	unsigned int min_nrpages = mapping_min_folio_nrpages(file->f_mapping);
+	pgoff_t index = round_down(vmf->pgoff, min_nrpages);
+	DEFINE_READAHEAD(ractl, file, ra, file->f_mapping, index);
 	struct file *fpin = NULL;
 	unsigned int mmap_miss;
 
@@ -3212,13 +3219,17 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 	struct file *file = vmf->vma->vm_file;
 	struct file *fpin = NULL;
 	struct address_space *mapping = file->f_mapping;
+	unsigned int min_order = mapping_min_folio_order(mapping);
+	unsigned int nrpages = 1UL << min_order;
 	struct inode *inode = mapping->host;
-	pgoff_t max_idx, index = vmf->pgoff;
+	pgoff_t max_idx, index = round_down(vmf->pgoff, nrpages);
 	struct folio *folio;
 	vm_fault_t ret = 0;
 	bool mapping_locked = false;
 
 	max_idx = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
+	max_idx = round_up(max_idx, nrpages);
+
 	if (unlikely(index >= max_idx))
 		return VM_FAULT_SIGBUS;
 
@@ -3317,13 +3328,17 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 	 * We must recheck i_size under page lock.
 	 */
 	max_idx = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
+	max_idx = round_up(max_idx, nrpages);
+
 	if (unlikely(index >= max_idx)) {
 		folio_unlock(folio);
 		folio_put(folio);
 		return VM_FAULT_SIGBUS;
 	}
 
-	vmf->page = folio_file_page(folio, index);
+	VM_BUG_ON_FOLIO(folio_order(folio) < min_order, folio);
+
+	vmf->page = folio_file_page(folio, vmf->pgoff);
 	return ret | VM_FAULT_LOCKED;
 
 page_not_uptodate:
@@ -3658,6 +3673,9 @@ static struct folio *do_read_cache_folio(struct address_space *mapping,
 {
 	struct folio *folio;
 	int err;
+	unsigned int min_nrpages = mapping_min_folio_nrpages(mapping);
+
+	index = round_down(index, min_nrpages);
 
 	if (!filler)
 		filler = mapping->a_ops->read_folio;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v2 03/14] filemap: use mapping_min_order while allocating folios
  2024-02-13  9:36 [RFC v2 00/14] enable bs > ps in XFS Pankaj Raghav (Samsung)
  2024-02-13  9:37 ` [RFC v2 01/14] fs: Allow fine-grained control of folio sizes Pankaj Raghav (Samsung)
  2024-02-13  9:37 ` [RFC v2 02/14] filemap: align the index to mapping_min_order in the page cache Pankaj Raghav (Samsung)
@ 2024-02-13  9:37 ` Pankaj Raghav (Samsung)
  2024-02-13 14:58   ` Hannes Reinecke
                     ` (2 more replies)
  2024-02-13  9:37 ` [RFC v2 04/14] readahead: set file_ra_state->ra_pages to be at least mapping_min_order Pankaj Raghav (Samsung)
                   ` (10 subsequent siblings)
  13 siblings, 3 replies; 64+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-02-13  9:37 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: mcgrof, gost.dev, akpm, kbusch, djwong, chandan.babu, p.raghav,
	linux-kernel, hare, willy, linux-mm, david

From: Pankaj Raghav <p.raghav@samsung.com>

filemap_create_folio() and do_read_cache_folio() were always allocating
folio of order 0. __filemap_get_folio was trying to allocate higher
order folios when fgp_flags had higher order hint set but it will default
to order 0 folio if higher order memory allocation fails.

As we bring the notion of mapping_min_order, make sure these functions
allocate at least folio of mapping_min_order as we need to guarantee it
in the page cache.

Add some additional VM_BUG_ON() in page_cache_delete[batch] and
__filemap_add_folio to catch errors where we delete or add folios that
has order less than min_order.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 mm/filemap.c | 25 +++++++++++++++++++++----
 1 file changed, 21 insertions(+), 4 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 323a8e169581..7a6e15c47150 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -127,6 +127,7 @@
 static void page_cache_delete(struct address_space *mapping,
 				   struct folio *folio, void *shadow)
 {
+	unsigned int min_order = mapping_min_folio_order(mapping);
 	XA_STATE(xas, &mapping->i_pages, folio->index);
 	long nr = 1;
 
@@ -135,6 +136,7 @@ static void page_cache_delete(struct address_space *mapping,
 	xas_set_order(&xas, folio->index, folio_order(folio));
 	nr = folio_nr_pages(folio);
 
+	VM_BUG_ON_FOLIO(folio_order(folio) < min_order, folio);
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
 
 	xas_store(&xas, shadow);
@@ -277,6 +279,7 @@ void filemap_remove_folio(struct folio *folio)
 static void page_cache_delete_batch(struct address_space *mapping,
 			     struct folio_batch *fbatch)
 {
+	unsigned int min_order = mapping_min_folio_order(mapping);
 	XA_STATE(xas, &mapping->i_pages, fbatch->folios[0]->index);
 	long total_pages = 0;
 	int i = 0;
@@ -305,6 +308,7 @@ static void page_cache_delete_batch(struct address_space *mapping,
 
 		WARN_ON_ONCE(!folio_test_locked(folio));
 
+		VM_BUG_ON_FOLIO(folio_order(folio) < min_order, folio);
 		folio->mapping = NULL;
 		/* Leave folio->index set: truncation lookup relies on it */
 
@@ -846,6 +850,7 @@ noinline int __filemap_add_folio(struct address_space *mapping,
 	int huge = folio_test_hugetlb(folio);
 	bool charged = false;
 	long nr = 1;
+	unsigned int min_order = mapping_min_folio_order(mapping);
 
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
 	VM_BUG_ON_FOLIO(folio_test_swapbacked(folio), folio);
@@ -896,6 +901,7 @@ noinline int __filemap_add_folio(struct address_space *mapping,
 			}
 		}
 
+		VM_BUG_ON_FOLIO(folio_order(folio) < min_order, folio);
 		xas_store(&xas, folio);
 		if (xas_error(&xas))
 			goto unlock;
@@ -1847,6 +1853,10 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 		fgf_t fgp_flags, gfp_t gfp)
 {
 	struct folio *folio;
+	unsigned int min_order = mapping_min_folio_order(mapping);
+	unsigned int min_nrpages = mapping_min_folio_nrpages(mapping);
+
+	index = round_down(index, min_nrpages);
 
 repeat:
 	folio = filemap_get_entry(mapping, index);
@@ -1886,7 +1896,7 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 		folio_wait_stable(folio);
 no_page:
 	if (!folio && (fgp_flags & FGP_CREAT)) {
-		unsigned order = FGF_GET_ORDER(fgp_flags);
+		unsigned int order = max(min_order, FGF_GET_ORDER(fgp_flags));
 		int err;
 
 		if ((fgp_flags & FGP_WRITE) && mapping_can_writeback(mapping))
@@ -1914,8 +1924,13 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 			err = -ENOMEM;
 			if (order == 1)
 				order = 0;
+			if (order < min_order)
+				order = min_order;
 			if (order > 0)
 				alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN;
+
+			VM_BUG_ON(index & ((1UL << order) - 1));
+
 			folio = filemap_alloc_folio(alloc_gfp, order);
 			if (!folio)
 				continue;
@@ -1929,7 +1944,7 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 				break;
 			folio_put(folio);
 			folio = NULL;
-		} while (order-- > 0);
+		} while (order-- > min_order);
 
 		if (err == -EEXIST)
 			goto repeat;
@@ -2424,7 +2439,8 @@ static int filemap_create_folio(struct file *file,
 	struct folio *folio;
 	int error;
 
-	folio = filemap_alloc_folio(mapping_gfp_mask(mapping), 0);
+	folio = filemap_alloc_folio(mapping_gfp_mask(mapping),
+				    mapping_min_folio_order(mapping));
 	if (!folio)
 		return -ENOMEM;
 
@@ -3682,7 +3698,8 @@ static struct folio *do_read_cache_folio(struct address_space *mapping,
 repeat:
 	folio = filemap_get_folio(mapping, index);
 	if (IS_ERR(folio)) {
-		folio = filemap_alloc_folio(gfp, 0);
+		folio = filemap_alloc_folio(gfp,
+					    mapping_min_folio_order(mapping));
 		if (!folio)
 			return ERR_PTR(-ENOMEM);
 		err = filemap_add_folio(mapping, folio, index, gfp);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v2 04/14] readahead: set file_ra_state->ra_pages to be at least mapping_min_order
  2024-02-13  9:36 [RFC v2 00/14] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (2 preceding siblings ...)
  2024-02-13  9:37 ` [RFC v2 03/14] filemap: use mapping_min_order while allocating folios Pankaj Raghav (Samsung)
@ 2024-02-13  9:37 ` Pankaj Raghav (Samsung)
  2024-02-13 14:59   ` Hannes Reinecke
                     ` (2 more replies)
  2024-02-13  9:37 ` [RFC v2 05/14] readahead: align index to mapping_min_order in ondemand_ra and force_ra Pankaj Raghav (Samsung)
                   ` (9 subsequent siblings)
  13 siblings, 3 replies; 64+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-02-13  9:37 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: mcgrof, gost.dev, akpm, kbusch, djwong, chandan.babu, p.raghav,
	linux-kernel, hare, willy, linux-mm, david

From: Luis Chamberlain <mcgrof@kernel.org>

Set the file_ra_state->ra_pages in file_ra_state_init() to be at least
mapping_min_order of pages if the bdi->ra_pages is less than that.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 mm/readahead.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/mm/readahead.c b/mm/readahead.c
index 2648ec4f0494..4fa7d0e65706 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -138,7 +138,12 @@
 void
 file_ra_state_init(struct file_ra_state *ra, struct address_space *mapping)
 {
+	unsigned int min_nrpages = mapping_min_folio_nrpages(mapping);
+	unsigned int max_pages = inode_to_bdi(mapping->host)->io_pages;
+
 	ra->ra_pages = inode_to_bdi(mapping->host)->ra_pages;
+	if (ra->ra_pages < min_nrpages && min_nrpages < max_pages)
+		ra->ra_pages = min_nrpages;
 	ra->prev_pos = -1;
 }
 EXPORT_SYMBOL_GPL(file_ra_state_init);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v2 05/14] readahead: align index to mapping_min_order in ondemand_ra and force_ra
  2024-02-13  9:36 [RFC v2 00/14] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (3 preceding siblings ...)
  2024-02-13  9:37 ` [RFC v2 04/14] readahead: set file_ra_state->ra_pages to be at least mapping_min_order Pankaj Raghav (Samsung)
@ 2024-02-13  9:37 ` Pankaj Raghav (Samsung)
  2024-02-13 15:00   ` Hannes Reinecke
                     ` (2 more replies)
  2024-02-13  9:37 ` [RFC v2 06/14] readahead: rework loop in page_cache_ra_unbounded() Pankaj Raghav (Samsung)
                   ` (8 subsequent siblings)
  13 siblings, 3 replies; 64+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-02-13  9:37 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: mcgrof, gost.dev, akpm, kbusch, djwong, chandan.babu, p.raghav,
	linux-kernel, hare, willy, linux-mm, david

From: Luis Chamberlain <mcgrof@kernel.org>

Align the ra->start and ra->size to mapping_min_order in
ondemand_readahead(), and align the index to mapping_min_order in
force_page_cache_ra(). This will ensure that the folios allocated for
readahead that are added to the page cache are aligned to
mapping_min_order.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 mm/readahead.c | 48 ++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 40 insertions(+), 8 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 4fa7d0e65706..5e1ec7705c78 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -315,6 +315,7 @@ void force_page_cache_ra(struct readahead_control *ractl,
 	struct file_ra_state *ra = ractl->ra;
 	struct backing_dev_info *bdi = inode_to_bdi(mapping->host);
 	unsigned long max_pages, index;
+	unsigned int min_nrpages = mapping_min_folio_nrpages(mapping);
 
 	if (unlikely(!mapping->a_ops->read_folio && !mapping->a_ops->readahead))
 		return;
@@ -324,6 +325,13 @@ void force_page_cache_ra(struct readahead_control *ractl,
 	 * be up to the optimal hardware IO size
 	 */
 	index = readahead_index(ractl);
+	if (!IS_ALIGNED(index, min_nrpages)) {
+		unsigned long old_index = index;
+
+		index = round_down(index, min_nrpages);
+		nr_to_read += (old_index - index);
+	}
+
 	max_pages = max_t(unsigned long, bdi->io_pages, ra->ra_pages);
 	nr_to_read = min_t(unsigned long, nr_to_read, max_pages);
 	while (nr_to_read) {
@@ -332,6 +340,7 @@ void force_page_cache_ra(struct readahead_control *ractl,
 		if (this_chunk > nr_to_read)
 			this_chunk = nr_to_read;
 		ractl->_index = index;
+		VM_BUG_ON(!IS_ALIGNED(index, min_nrpages));
 		do_page_cache_ra(ractl, this_chunk, 0);
 
 		index += this_chunk;
@@ -344,11 +353,20 @@ void force_page_cache_ra(struct readahead_control *ractl,
  * for small size, x 4 for medium, and x 2 for large
  * for 128k (32 page) max ra
  * 1-2 page = 16k, 3-4 page 32k, 5-8 page = 64k, > 8 page = 128k initial
+ *
+ * For higher order address space requirements we ensure no initial reads
+ * are ever less than the min number of pages required.
+ *
+ * We *always* cap the max io size allowed by the device.
  */
-static unsigned long get_init_ra_size(unsigned long size, unsigned long max)
+static unsigned long get_init_ra_size(unsigned long size,
+				      unsigned int min_nrpages,
+				      unsigned long max)
 {
 	unsigned long newsize = roundup_pow_of_two(size);
 
+	newsize = max_t(unsigned long, newsize, min_nrpages);
+
 	if (newsize <= max / 32)
 		newsize = newsize * 4;
 	else if (newsize <= max / 4)
@@ -356,6 +374,8 @@ static unsigned long get_init_ra_size(unsigned long size, unsigned long max)
 	else
 		newsize = max;
 
+	VM_BUG_ON(newsize & (min_nrpages - 1));
+
 	return newsize;
 }
 
@@ -364,14 +384,16 @@ static unsigned long get_init_ra_size(unsigned long size, unsigned long max)
  *  return it as the new window size.
  */
 static unsigned long get_next_ra_size(struct file_ra_state *ra,
+				      unsigned int min_nrpages,
 				      unsigned long max)
 {
-	unsigned long cur = ra->size;
+	unsigned long cur = max(ra->size, min_nrpages);
 
 	if (cur < max / 16)
 		return 4 * cur;
 	if (cur <= max / 2)
 		return 2 * cur;
+
 	return max;
 }
 
@@ -561,7 +583,11 @@ static void ondemand_readahead(struct readahead_control *ractl,
 	unsigned long add_pages;
 	pgoff_t index = readahead_index(ractl);
 	pgoff_t expected, prev_index;
-	unsigned int order = folio ? folio_order(folio) : 0;
+	unsigned int min_order = mapping_min_folio_order(ractl->mapping);
+	unsigned int min_nrpages = mapping_min_folio_nrpages(ractl->mapping);
+	unsigned int order = folio ? folio_order(folio) : min_order;
+
+	VM_BUG_ON(!IS_ALIGNED(ractl->_index, min_nrpages));
 
 	/*
 	 * If the request exceeds the readahead window, allow the read to
@@ -583,8 +609,8 @@ static void ondemand_readahead(struct readahead_control *ractl,
 	expected = round_down(ra->start + ra->size - ra->async_size,
 			1UL << order);
 	if (index == expected || index == (ra->start + ra->size)) {
-		ra->start += ra->size;
-		ra->size = get_next_ra_size(ra, max_pages);
+		ra->start += round_down(ra->size, min_nrpages);
+		ra->size = get_next_ra_size(ra, min_nrpages, max_pages);
 		ra->async_size = ra->size;
 		goto readit;
 	}
@@ -603,13 +629,18 @@ static void ondemand_readahead(struct readahead_control *ractl,
 				max_pages);
 		rcu_read_unlock();
 
+		start = round_down(start, min_nrpages);
+
+		VM_BUG_ON(folio->index & (folio_nr_pages(folio) - 1));
+
 		if (!start || start - index > max_pages)
 			return;
 
 		ra->start = start;
 		ra->size = start - index;	/* old async_size */
+
 		ra->size += req_size;
-		ra->size = get_next_ra_size(ra, max_pages);
+		ra->size = get_next_ra_size(ra, min_nrpages, max_pages);
 		ra->async_size = ra->size;
 		goto readit;
 	}
@@ -646,7 +677,7 @@ static void ondemand_readahead(struct readahead_control *ractl,
 
 initial_readahead:
 	ra->start = index;
-	ra->size = get_init_ra_size(req_size, max_pages);
+	ra->size = get_init_ra_size(req_size, min_nrpages, max_pages);
 	ra->async_size = ra->size > req_size ? ra->size - req_size : ra->size;
 
 readit:
@@ -657,7 +688,7 @@ static void ondemand_readahead(struct readahead_control *ractl,
 	 * Take care of maximum IO pages as above.
 	 */
 	if (index == ra->start && ra->size == ra->async_size) {
-		add_pages = get_next_ra_size(ra, max_pages);
+		add_pages = get_next_ra_size(ra, min_nrpages, max_pages);
 		if (ra->size + add_pages <= max_pages) {
 			ra->async_size = add_pages;
 			ra->size += add_pages;
@@ -668,6 +699,7 @@ static void ondemand_readahead(struct readahead_control *ractl,
 	}
 
 	ractl->_index = ra->start;
+	VM_BUG_ON(!IS_ALIGNED(ractl->_index, min_nrpages));
 	page_cache_ra_order(ractl, ra, order);
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v2 06/14] readahead: rework loop in page_cache_ra_unbounded()
  2024-02-13  9:36 [RFC v2 00/14] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (4 preceding siblings ...)
  2024-02-13  9:37 ` [RFC v2 05/14] readahead: align index to mapping_min_order in ondemand_ra and force_ra Pankaj Raghav (Samsung)
@ 2024-02-13  9:37 ` Pankaj Raghav (Samsung)
  2024-02-13 16:47   ` Darrick J. Wong
  2024-02-13  9:37 ` [RFC v2 07/14] readahead: allocate folios with mapping_min_order in ra_(unbounded|order) Pankaj Raghav (Samsung)
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 64+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-02-13  9:37 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: mcgrof, gost.dev, akpm, kbusch, djwong, chandan.babu, p.raghav,
	linux-kernel, hare, willy, linux-mm, david

From: Hannes Reinecke <hare@suse.de>

Rework the loop in page_cache_ra_unbounded() to advance with
the number of pages in a folio instead of just one page at a time.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Co-developed-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 mm/readahead.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 5e1ec7705c78..13b62cbd3b79 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -213,7 +213,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 	struct address_space *mapping = ractl->mapping;
 	unsigned long index = readahead_index(ractl);
 	gfp_t gfp_mask = readahead_gfp_mask(mapping);
-	unsigned long i;
+	unsigned long i = 0;
 
 	/*
 	 * Partway through the readahead operation, we will have added
@@ -231,7 +231,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 	/*
 	 * Preallocate as many pages as we will need.
 	 */
-	for (i = 0; i < nr_to_read; i++) {
+	while (i < nr_to_read) {
 		struct folio *folio = xa_load(&mapping->i_pages, index + i);
 
 		if (folio && !xa_is_value(folio)) {
@@ -244,8 +244,8 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 			 * not worth getting one just for that.
 			 */
 			read_pages(ractl);
-			ractl->_index++;
-			i = ractl->_index + ractl->_nr_pages - index - 1;
+			ractl->_index += folio_nr_pages(folio);
+			i = ractl->_index + ractl->_nr_pages - index;
 			continue;
 		}
 
@@ -257,13 +257,14 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 			folio_put(folio);
 			read_pages(ractl);
 			ractl->_index++;
-			i = ractl->_index + ractl->_nr_pages - index - 1;
+			i = ractl->_index + ractl->_nr_pages - index;
 			continue;
 		}
 		if (i == nr_to_read - lookahead_size)
 			folio_set_readahead(folio);
 		ractl->_workingset |= folio_test_workingset(folio);
-		ractl->_nr_pages++;
+		ractl->_nr_pages += folio_nr_pages(folio);
+		i += folio_nr_pages(folio);
 	}
 
 	/*
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v2 07/14] readahead: allocate folios with mapping_min_order in ra_(unbounded|order)
  2024-02-13  9:36 [RFC v2 00/14] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (5 preceding siblings ...)
  2024-02-13  9:37 ` [RFC v2 06/14] readahead: rework loop in page_cache_ra_unbounded() Pankaj Raghav (Samsung)
@ 2024-02-13  9:37 ` Pankaj Raghav (Samsung)
  2024-02-13 15:01   ` Hannes Reinecke
  2024-02-13 16:47   ` Darrick J. Wong
  2024-02-13  9:37 ` [RFC v2 08/14] mm: do not split a folio if it has minimum folio order requirement Pankaj Raghav (Samsung)
                   ` (6 subsequent siblings)
  13 siblings, 2 replies; 64+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-02-13  9:37 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: mcgrof, gost.dev, akpm, kbusch, djwong, chandan.babu, p.raghav,
	linux-kernel, hare, willy, linux-mm, david

From: Pankaj Raghav <p.raghav@samsung.com>

Allocate folios with at least mapping_min_order in
page_cache_ra_unbounded() and page_cache_ra_order() as we need to
guarantee a minimum order in the page cache.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 mm/readahead.c | 30 ++++++++++++++++++++++++++----
 1 file changed, 26 insertions(+), 4 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 13b62cbd3b79..a361fba18674 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -214,6 +214,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 	unsigned long index = readahead_index(ractl);
 	gfp_t gfp_mask = readahead_gfp_mask(mapping);
 	unsigned long i = 0;
+	unsigned int min_nrpages = mapping_min_folio_nrpages(mapping);
 
 	/*
 	 * Partway through the readahead operation, we will have added
@@ -235,6 +236,8 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 		struct folio *folio = xa_load(&mapping->i_pages, index + i);
 
 		if (folio && !xa_is_value(folio)) {
+			long nr_pages = folio_nr_pages(folio);
+
 			/*
 			 * Page already present?  Kick off the current batch
 			 * of contiguous pages before continuing with the
@@ -244,19 +247,31 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 			 * not worth getting one just for that.
 			 */
 			read_pages(ractl);
-			ractl->_index += folio_nr_pages(folio);
+
+			/*
+			 * Move the ractl->_index by at least min_pages
+			 * if the folio got truncated to respect the
+			 * alignment constraint in the page cache.
+			 *
+			 */
+			if (mapping != folio->mapping)
+				nr_pages = min_nrpages;
+
+			VM_BUG_ON_FOLIO(nr_pages < min_nrpages, folio);
+			ractl->_index += nr_pages;
 			i = ractl->_index + ractl->_nr_pages - index;
 			continue;
 		}
 
-		folio = filemap_alloc_folio(gfp_mask, 0);
+		folio = filemap_alloc_folio(gfp_mask,
+					    mapping_min_folio_order(mapping));
 		if (!folio)
 			break;
 		if (filemap_add_folio(mapping, folio, index + i,
 					gfp_mask) < 0) {
 			folio_put(folio);
 			read_pages(ractl);
-			ractl->_index++;
+			ractl->_index += min_nrpages;
 			i = ractl->_index + ractl->_nr_pages - index;
 			continue;
 		}
@@ -516,6 +531,7 @@ void page_cache_ra_order(struct readahead_control *ractl,
 {
 	struct address_space *mapping = ractl->mapping;
 	pgoff_t index = readahead_index(ractl);
+	unsigned int min_order = mapping_min_folio_order(mapping);
 	pgoff_t limit = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT;
 	pgoff_t mark = index + ra->size - ra->async_size;
 	int err = 0;
@@ -542,11 +558,17 @@ void page_cache_ra_order(struct readahead_control *ractl,
 		if (index & ((1UL << order) - 1))
 			order = __ffs(index);
 		/* Don't allocate pages past EOF */
-		while (index + (1UL << order) - 1 > limit)
+		while (order > min_order && index + (1UL << order) - 1 > limit)
 			order--;
 		/* THP machinery does not support order-1 */
 		if (order == 1)
 			order = 0;
+
+		if (order < min_order)
+			order = min_order;
+
+		VM_BUG_ON(index & ((1UL << order) - 1));
+
 		err = ra_alloc_folio(ractl, index, mark, order, gfp);
 		if (err)
 			break;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v2 08/14] mm: do not split a folio if it has minimum folio order requirement
  2024-02-13  9:36 [RFC v2 00/14] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (6 preceding siblings ...)
  2024-02-13  9:37 ` [RFC v2 07/14] readahead: allocate folios with mapping_min_order in ra_(unbounded|order) Pankaj Raghav (Samsung)
@ 2024-02-13  9:37 ` Pankaj Raghav (Samsung)
  2024-02-13 15:02   ` Hannes Reinecke
  2024-02-13  9:37 ` [RFC v2 09/14] mm: Support order-1 folios in the page cache Pankaj Raghav (Samsung)
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 64+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-02-13  9:37 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: mcgrof, gost.dev, akpm, kbusch, djwong, chandan.babu, p.raghav,
	linux-kernel, hare, willy, linux-mm, david

From: Pankaj Raghav <p.raghav@samsung.com>

As we don't have a way to split a folio to a any given lower folio
order yet, avoid splitting the folio in split_huge_page_to_list() if it
has a minimum folio order requirement.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 mm/huge_memory.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 94c958f7ebb5..d897efc51025 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3026,6 +3026,19 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 			goto out;
 		}
 
+		/*
+		 * Do not split if mapping has minimum folio order
+		 * requirement.
+		 *
+		 * XXX: Once we have support for splitting to any lower
+		 * folio order, then it could be split based on the
+		 * min_folio_order.
+		 */
+		if (mapping_min_folio_order(mapping)) {
+			ret = -EAGAIN;
+			goto out;
+		}
+
 		gfp = current_gfp_context(mapping_gfp_mask(mapping) &
 							GFP_RECLAIM_MASK);
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v2 09/14] mm: Support order-1 folios in the page cache
  2024-02-13  9:36 [RFC v2 00/14] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (7 preceding siblings ...)
  2024-02-13  9:37 ` [RFC v2 08/14] mm: do not split a folio if it has minimum folio order requirement Pankaj Raghav (Samsung)
@ 2024-02-13  9:37 ` Pankaj Raghav (Samsung)
  2024-02-13 15:03   ` Hannes Reinecke
  2024-02-13  9:37 ` [RFC v2 10/14] iomap: fix iomap_dio_zero() for fs bs > system page size Pankaj Raghav (Samsung)
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 64+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-02-13  9:37 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: mcgrof, gost.dev, akpm, kbusch, djwong, chandan.babu, p.raghav,
	linux-kernel, hare, willy, linux-mm, david

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Folios of order 1 have no space to store the deferred list.  This is
not a problem for the page cache as file-backed folios are never
placed on the deferred list.  All we need to do is prevent the core
MM from touching the deferred list for order 1 folios and remove the
code which prevented us from allocating order 1 folios.

Link: https://lore.kernel.org/linux-mm/90344ea7-4eec-47ee-5996-0c22f42d6a6a@google.com/
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/huge_mm.h |  7 +++++--
 mm/filemap.c            |  2 --
 mm/huge_memory.c        | 23 ++++++++++++++++++-----
 mm/internal.h           |  4 +---
 mm/readahead.c          |  3 ---
 5 files changed, 24 insertions(+), 15 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 5adb86af35fc..916a2a539517 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -263,7 +263,7 @@ unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
 unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
 		unsigned long len, unsigned long pgoff, unsigned long flags);
 
-void folio_prep_large_rmappable(struct folio *folio);
+struct folio *folio_prep_large_rmappable(struct folio *folio);
 bool can_split_folio(struct folio *folio, int *pextra_pins);
 int split_huge_page_to_list(struct page *page, struct list_head *list);
 static inline int split_huge_page(struct page *page)
@@ -410,7 +410,10 @@ static inline unsigned long thp_vma_allowable_orders(struct vm_area_struct *vma,
 	return 0;
 }
 
-static inline void folio_prep_large_rmappable(struct folio *folio) {}
+static inline struct folio *folio_prep_large_rmappable(struct folio *folio)
+{
+	return folio;
+}
 
 #define transparent_hugepage_flags 0UL
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 7a6e15c47150..c8205a534532 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1922,8 +1922,6 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 			gfp_t alloc_gfp = gfp;
 
 			err = -ENOMEM;
-			if (order == 1)
-				order = 0;
 			if (order < min_order)
 				order = min_order;
 			if (order > 0)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d897efc51025..6ec3417638a1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -788,11 +788,15 @@ struct deferred_split *get_deferred_split_queue(struct folio *folio)
 }
 #endif
 
-void folio_prep_large_rmappable(struct folio *folio)
+struct folio *folio_prep_large_rmappable(struct folio *folio)
 {
-	VM_BUG_ON_FOLIO(folio_order(folio) < 2, folio);
-	INIT_LIST_HEAD(&folio->_deferred_list);
+	if (!folio || !folio_test_large(folio))
+		return folio;
+	if (folio_order(folio) > 1)
+		INIT_LIST_HEAD(&folio->_deferred_list);
 	folio_set_large_rmappable(folio);
+
+	return folio;
 }
 
 static inline bool is_transparent_hugepage(struct folio *folio)
@@ -3095,7 +3099,8 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 	/* Prevent deferred_split_scan() touching ->_refcount */
 	spin_lock(&ds_queue->split_queue_lock);
 	if (folio_ref_freeze(folio, 1 + extra_pins)) {
-		if (!list_empty(&folio->_deferred_list)) {
+		if (folio_order(folio) > 1 &&
+		    !list_empty(&folio->_deferred_list)) {
 			ds_queue->split_queue_len--;
 			list_del(&folio->_deferred_list);
 		}
@@ -3146,6 +3151,9 @@ void folio_undo_large_rmappable(struct folio *folio)
 	struct deferred_split *ds_queue;
 	unsigned long flags;
 
+	if (folio_order(folio) <= 1)
+		return;
+
 	/*
 	 * At this point, there is no one trying to add the folio to
 	 * deferred_list. If folio is not in deferred_list, it's safe
@@ -3171,7 +3179,12 @@ void deferred_split_folio(struct folio *folio)
 #endif
 	unsigned long flags;
 
-	VM_BUG_ON_FOLIO(folio_order(folio) < 2, folio);
+	/*
+	 * Order 1 folios have no space for a deferred list, but we also
+	 * won't waste much memory by not adding them to the deferred list.
+	 */
+	if (folio_order(folio) <= 1)
+		return;
 
 	/*
 	 * The try_to_unmap() in page reclaim path might reach here too,
diff --git a/mm/internal.h b/mm/internal.h
index f309a010d50f..5174b5b0c344 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -419,9 +419,7 @@ static inline struct folio *page_rmappable_folio(struct page *page)
 {
 	struct folio *folio = (struct folio *)page;
 
-	if (folio && folio_order(folio) > 1)
-		folio_prep_large_rmappable(folio);
-	return folio;
+	return folio_prep_large_rmappable(folio);
 }
 
 static inline void prep_compound_head(struct page *page, unsigned int order)
diff --git a/mm/readahead.c b/mm/readahead.c
index a361fba18674..7d5f6a8792a8 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -560,9 +560,6 @@ void page_cache_ra_order(struct readahead_control *ractl,
 		/* Don't allocate pages past EOF */
 		while (order > min_order && index + (1UL << order) - 1 > limit)
 			order--;
-		/* THP machinery does not support order-1 */
-		if (order == 1)
-			order = 0;
 
 		if (order < min_order)
 			order = min_order;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v2 10/14] iomap: fix iomap_dio_zero() for fs bs > system page size
  2024-02-13  9:36 [RFC v2 00/14] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (8 preceding siblings ...)
  2024-02-13  9:37 ` [RFC v2 09/14] mm: Support order-1 folios in the page cache Pankaj Raghav (Samsung)
@ 2024-02-13  9:37 ` Pankaj Raghav (Samsung)
  2024-02-13 15:06   ` Hannes Reinecke
  2024-02-13 16:30   ` Darrick J. Wong
  2024-02-13  9:37 ` [RFC v2 11/14] xfs: expose block size in stat Pankaj Raghav (Samsung)
                   ` (3 subsequent siblings)
  13 siblings, 2 replies; 64+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-02-13  9:37 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: mcgrof, gost.dev, akpm, kbusch, djwong, chandan.babu, p.raghav,
	linux-kernel, hare, willy, linux-mm, david

From: Pankaj Raghav <p.raghav@samsung.com>

iomap_dio_zero() will pad a fs block with zeroes if the direct IO size
< fs block size. iomap_dio_zero() has an implicit assumption that fs block
size < page_size. This is true for most filesystems at the moment.

If the block size > page size, this will send the contents of the page
next to zero page(as len > PAGE_SIZE) to the underlying block device,
causing FS corruption.

iomap is a generic infrastructure and it should not make any assumptions
about the fs block size and the page size of the system.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 fs/iomap/direct-io.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index bcd3f8cf5ea4..04f6c5548136 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -239,14 +239,23 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
 	struct page *page = ZERO_PAGE(0);
 	struct bio *bio;
 
-	bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
+	WARN_ON_ONCE(len > (BIO_MAX_VECS * PAGE_SIZE));
+
+	bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS,
+				  REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
 	fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
 				  GFP_KERNEL);
+
 	bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos);
 	bio->bi_private = dio;
 	bio->bi_end_io = iomap_dio_bio_end_io;
 
-	__bio_add_page(bio, page, len, 0);
+	while (len) {
+		unsigned int io_len = min_t(unsigned int, len, PAGE_SIZE);
+
+		__bio_add_page(bio, page, io_len, 0);
+		len -= io_len;
+	}
 	iomap_dio_submit_bio(iter, dio, bio, pos);
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v2 11/14] xfs: expose block size in stat
  2024-02-13  9:36 [RFC v2 00/14] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (9 preceding siblings ...)
  2024-02-13  9:37 ` [RFC v2 10/14] iomap: fix iomap_dio_zero() for fs bs > system page size Pankaj Raghav (Samsung)
@ 2024-02-13  9:37 ` Pankaj Raghav (Samsung)
  2024-02-13 16:27   ` Darrick J. Wong
  2024-02-13  9:37 ` [RFC v2 12/14] xfs: make the calculation generic in xfs_sb_validate_fsb_count() Pankaj Raghav (Samsung)
                   ` (2 subsequent siblings)
  13 siblings, 1 reply; 64+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-02-13  9:37 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: mcgrof, gost.dev, akpm, kbusch, djwong, chandan.babu, p.raghav,
	linux-kernel, hare, willy, linux-mm, david, Dave Chinner

From: Dave Chinner <dchinner@redhat.com>

For block size larger than page size, the unit of efficient IO is
the block size, not the page size. Leaving stat() to report
PAGE_SIZE as the block size causes test programs like fsx to issue
illegal ranges for operations that require block size alignment
(e.g. fallocate() insert range). Hence update the preferred IO size
to reflect the block size in this case.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
[mcgrof: forward rebase in consideration for commit
dd2d535e3fb29d ("xfs: cleanup calculating the stat optimal I/O size")]
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 fs/xfs/xfs_iops.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index a0d77f5f512e..8791a9d80897 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -515,6 +515,8 @@ xfs_stat_blksize(
 	struct xfs_inode	*ip)
 {
 	struct xfs_mount	*mp = ip->i_mount;
+	unsigned long	default_size = max_t(unsigned long, PAGE_SIZE,
+					     mp->m_sb.sb_blocksize);
 
 	/*
 	 * If the file blocks are being allocated from a realtime volume, then
@@ -543,7 +545,7 @@ xfs_stat_blksize(
 			return 1U << mp->m_allocsize_log;
 	}
 
-	return PAGE_SIZE;
+	return default_size;
 }
 
 STATIC int
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v2 12/14] xfs: make the calculation generic in xfs_sb_validate_fsb_count()
  2024-02-13  9:36 [RFC v2 00/14] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (10 preceding siblings ...)
  2024-02-13  9:37 ` [RFC v2 11/14] xfs: expose block size in stat Pankaj Raghav (Samsung)
@ 2024-02-13  9:37 ` Pankaj Raghav (Samsung)
  2024-02-13 16:26   ` Darrick J. Wong
  2024-02-13  9:37 ` [RFC v2 13/14] xfs: add an experimental CONFIG_XFS_LBS option Pankaj Raghav (Samsung)
  2024-02-13  9:37 ` [RFC v2 14/14] xfs: enable block size larger than page size support Pankaj Raghav (Samsung)
  13 siblings, 1 reply; 64+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-02-13  9:37 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: mcgrof, gost.dev, akpm, kbusch, djwong, chandan.babu, p.raghav,
	linux-kernel, hare, willy, linux-mm, david

From: Pankaj Raghav <p.raghav@samsung.com>

Instead of assuming that PAGE_SHIFT is always higher than the blocklog,
make the calculation generic so that page cache count can be calculated
correctly for LBS.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 fs/xfs/xfs_mount.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index aabb25dc3efa..bfbaaecaf668 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -133,9 +133,13 @@ xfs_sb_validate_fsb_count(
 {
 	ASSERT(PAGE_SHIFT >= sbp->sb_blocklog);
 	ASSERT(sbp->sb_blocklog >= BBSHIFT);
+	unsigned long mapping_count;
+	uint64_t bytes = nblocks << sbp->sb_blocklog;
+
+	mapping_count = bytes >> PAGE_SHIFT;
 
 	/* Limited by ULONG_MAX of page cache index */
-	if (nblocks >> (PAGE_SHIFT - sbp->sb_blocklog) > ULONG_MAX)
+	if (mapping_count > ULONG_MAX)
 		return -EFBIG;
 	return 0;
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v2 13/14] xfs: add an experimental CONFIG_XFS_LBS option
  2024-02-13  9:36 [RFC v2 00/14] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (11 preceding siblings ...)
  2024-02-13  9:37 ` [RFC v2 12/14] xfs: make the calculation generic in xfs_sb_validate_fsb_count() Pankaj Raghav (Samsung)
@ 2024-02-13  9:37 ` Pankaj Raghav (Samsung)
  2024-02-13 16:39   ` Darrick J. Wong
  2024-02-13 21:19   ` Dave Chinner
  2024-02-13  9:37 ` [RFC v2 14/14] xfs: enable block size larger than page size support Pankaj Raghav (Samsung)
  13 siblings, 2 replies; 64+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-02-13  9:37 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: mcgrof, gost.dev, akpm, kbusch, djwong, chandan.babu, p.raghav,
	linux-kernel, hare, willy, linux-mm, david

From: Pankaj Raghav <p.raghav@samsung.com>

Add an experimental CONFIG_XFS_LBS option to enable LBS support in XFS.
Retain the ASSERT for PAGE_SHIFT if CONFIG_XFS_LBS is not enabled.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 fs/xfs/Kconfig     | 11 +++++++++++
 fs/xfs/xfs_mount.c |  4 +++-
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
index 567fb37274d3..6b0db2f7dc13 100644
--- a/fs/xfs/Kconfig
+++ b/fs/xfs/Kconfig
@@ -216,3 +216,14 @@ config XFS_ASSERT_FATAL
 	  result in warnings.
 
 	  This behavior can be modified at runtime via sysfs.
+
+config XFS_LBS
+	bool "XFS large block size support (EXPERIMENTAL)"
+	depends on XFS_FS
+	help
+	  Set Y to enable support for filesystem block size > system's
+	  base page size.
+
+	  This feature is considered EXPERIMENTAL.  Use with caution!
+
+	  If unsure, say N.
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index bfbaaecaf668..596aa2cdefbc 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -131,11 +131,13 @@ xfs_sb_validate_fsb_count(
 	xfs_sb_t	*sbp,
 	uint64_t	nblocks)
 {
-	ASSERT(PAGE_SHIFT >= sbp->sb_blocklog);
 	ASSERT(sbp->sb_blocklog >= BBSHIFT);
 	unsigned long mapping_count;
 	uint64_t bytes = nblocks << sbp->sb_blocklog;
 
+	if (!IS_ENABLED(CONFIG_XFS_LBS))
+		ASSERT(PAGE_SHIFT >= sbp->sb_blocklog);
+
 	mapping_count = bytes >> PAGE_SHIFT;
 
 	/* Limited by ULONG_MAX of page cache index */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [RFC v2 14/14] xfs: enable block size larger than page size support
  2024-02-13  9:36 [RFC v2 00/14] enable bs > ps in XFS Pankaj Raghav (Samsung)
                   ` (12 preceding siblings ...)
  2024-02-13  9:37 ` [RFC v2 13/14] xfs: add an experimental CONFIG_XFS_LBS option Pankaj Raghav (Samsung)
@ 2024-02-13  9:37 ` Pankaj Raghav (Samsung)
  2024-02-13 16:20   ` Darrick J. Wong
  2024-02-13 21:34   ` Dave Chinner
  13 siblings, 2 replies; 64+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-02-13  9:37 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: mcgrof, gost.dev, akpm, kbusch, djwong, chandan.babu, p.raghav,
	linux-kernel, hare, willy, linux-mm, david

From: Pankaj Raghav <p.raghav@samsung.com>

Page cache now has the ability to have a minimum order when allocating
a folio which is a prerequisite to add support for block size > page
size. Enable it in XFS under CONFIG_XFS_LBS.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 fs/xfs/xfs_icache.c | 8 ++++++--
 fs/xfs/xfs_super.c  | 8 +++-----
 2 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
index dba514a2c84d..9de81caf7ad4 100644
--- a/fs/xfs/xfs_icache.c
+++ b/fs/xfs/xfs_icache.c
@@ -73,6 +73,7 @@ xfs_inode_alloc(
 	xfs_ino_t		ino)
 {
 	struct xfs_inode	*ip;
+	int			min_order = 0;
 
 	/*
 	 * XXX: If this didn't occur in transactions, we could drop GFP_NOFAIL
@@ -88,7 +89,8 @@ xfs_inode_alloc(
 	/* VFS doesn't initialise i_mode or i_state! */
 	VFS_I(ip)->i_mode = 0;
 	VFS_I(ip)->i_state = 0;
-	mapping_set_large_folios(VFS_I(ip)->i_mapping);
+	min_order = max(min_order, ilog2(mp->m_sb.sb_blocksize) - PAGE_SHIFT);
+	mapping_set_folio_orders(VFS_I(ip)->i_mapping, min_order, MAX_PAGECACHE_ORDER);
 
 	XFS_STATS_INC(mp, vn_active);
 	ASSERT(atomic_read(&ip->i_pincount) == 0);
@@ -313,6 +315,7 @@ xfs_reinit_inode(
 	dev_t			dev = inode->i_rdev;
 	kuid_t			uid = inode->i_uid;
 	kgid_t			gid = inode->i_gid;
+	int			min_order = 0;
 
 	error = inode_init_always(mp->m_super, inode);
 
@@ -323,7 +326,8 @@ xfs_reinit_inode(
 	inode->i_rdev = dev;
 	inode->i_uid = uid;
 	inode->i_gid = gid;
-	mapping_set_large_folios(inode->i_mapping);
+	min_order = max(min_order, ilog2(mp->m_sb.sb_blocksize) - PAGE_SHIFT);
+	mapping_set_folio_orders(inode->i_mapping, min_order, MAX_PAGECACHE_ORDER);
 	return error;
 }
 
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 5a2512d20bd0..6a3f0f6727eb 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1625,13 +1625,11 @@ xfs_fs_fill_super(
 		goto out_free_sb;
 	}
 
-	/*
-	 * Until this is fixed only page-sized or smaller data blocks work.
-	 */
-	if (mp->m_sb.sb_blocksize > PAGE_SIZE) {
+	if (!IS_ENABLED(CONFIG_XFS_LBS) && mp->m_sb.sb_blocksize > PAGE_SIZE) {
 		xfs_warn(mp,
 		"File system with blocksize %d bytes. "
-		"Only pagesize (%ld) or less will currently work.",
+		"Only pagesize (%ld) or less will currently work. "
+		"Enable Experimental CONFIG_XFS_LBS for this support",
 				mp->m_sb.sb_blocksize, PAGE_SIZE);
 		error = -ENOSYS;
 		goto out_free_sb;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [RFC v2 01/14] fs: Allow fine-grained control of folio sizes
  2024-02-13  9:37 ` [RFC v2 01/14] fs: Allow fine-grained control of folio sizes Pankaj Raghav (Samsung)
@ 2024-02-13 12:03   ` Hannes Reinecke
  2024-02-13 16:34   ` Darrick J. Wong
  2024-02-14 18:49   ` Matthew Wilcox
  2 siblings, 0 replies; 64+ messages in thread
From: Hannes Reinecke @ 2024-02-13 12:03 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung), linux-xfs, linux-fsdevel
  Cc: mcgrof, gost.dev, akpm, kbusch, djwong, chandan.babu, p.raghav,
	linux-kernel, willy, linux-mm, david

On 2/13/24 10:37, Pankaj Raghav (Samsung) wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> Some filesystems want to be able to limit the maximum size of folios,
> and some want to be able to ensure that folios are at least a certain
> size.  Add mapping_set_folio_orders() to allow this level of control.
> The max folio order parameter is ignored and it is always set to
> MAX_PAGECACHE_ORDER.
> 
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>   include/linux/pagemap.h | 92 ++++++++++++++++++++++++++++++++---------
>   1 file changed, 73 insertions(+), 19 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 02/14] filemap: align the index to mapping_min_order in the page cache
  2024-02-13  9:37 ` [RFC v2 02/14] filemap: align the index to mapping_min_order in the page cache Pankaj Raghav (Samsung)
@ 2024-02-13 12:20   ` Hannes Reinecke
  2024-02-13 21:13     ` Pankaj Raghav (Samsung)
  2024-02-13 22:00   ` Dave Chinner
  1 sibling, 1 reply; 64+ messages in thread
From: Hannes Reinecke @ 2024-02-13 12:20 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung), linux-xfs, linux-fsdevel
  Cc: mcgrof, gost.dev, akpm, kbusch, djwong, chandan.babu, p.raghav,
	linux-kernel, willy, linux-mm, david

On 2/13/24 10:37, Pankaj Raghav (Samsung) wrote:
> From: Luis Chamberlain <mcgrof@kernel.org>
> 
> Supporting mapping_min_order implies that we guarantee each folio in the
> page cache has at least an order of mapping_min_order. So when adding new
> folios to the page cache we must ensure the index used is aligned to the
> mapping_min_order as the page cache requires the index to be aligned to
> the order of the folio.
> 
> A higher order folio than min_order by definition is a multiple of the
> min_order. If an index is aligned to an order higher than a min_order, it
> will also be aligned to the min order.
> 
> This effectively introduces no new functional changes when min order is
> not set other than a few rounding computations that should result in the
> same value.
> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>   mm/filemap.c | 34 ++++++++++++++++++++++++++--------
>   1 file changed, 26 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 750e779c23db..323a8e169581 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2479,14 +2479,16 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
>   {
>   	struct file *filp = iocb->ki_filp;
>   	struct address_space *mapping = filp->f_mapping;
> +	unsigned int min_nrpages = mapping_min_folio_nrpages(mapping);
>   	struct file_ra_state *ra = &filp->f_ra;
> -	pgoff_t index = iocb->ki_pos >> PAGE_SHIFT;
> +	pgoff_t index = round_down(iocb->ki_pos >> PAGE_SHIFT, min_nrpages);
>   	pgoff_t last_index;
>   	struct folio *folio;
>   	int err = 0;
>   
>   	/* "last_index" is the index of the page beyond the end of the read */
>   	last_index = DIV_ROUND_UP(iocb->ki_pos + count, PAGE_SIZE);
> +	last_index = round_up(last_index, min_nrpages);

Huh? 'last_index' is unset here; sure you mean 'iocb->ki_pos + count' ?

And what's the rationale to replace 'DIV_ROUND_UP' with 'round_up' ?

>   retry:
>   	if (fatal_signal_pending(current))
>   		return -EINTR;
> @@ -2502,8 +2504,7 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
>   	if (!folio_batch_count(fbatch)) {
>   		if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_WAITQ))
>   			return -EAGAIN;
> -		err = filemap_create_folio(filp, mapping,
> -				iocb->ki_pos >> PAGE_SHIFT, fbatch);
> +		err = filemap_create_folio(filp, mapping, index, fbatch);
>   		if (err == AOP_TRUNCATED_PAGE)
>   			goto retry;
>   		return err;
> @@ -3095,7 +3096,10 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
>   	struct file *file = vmf->vma->vm_file;
>   	struct file_ra_state *ra = &file->f_ra;
>   	struct address_space *mapping = file->f_mapping;
> -	DEFINE_READAHEAD(ractl, file, ra, mapping, vmf->pgoff);
> +	unsigned int min_order = mapping_min_folio_order(mapping);
> +	unsigned int min_nrpages = mapping_min_folio_nrpages(file->f_mapping);
> +	pgoff_t index = round_down(vmf->pgoff, min_nrpages);
> +	DEFINE_READAHEAD(ractl, file, ra, mapping, index);
>   	struct file *fpin = NULL;
>   	unsigned long vm_flags = vmf->vma->vm_flags;
>   	unsigned int mmap_miss;
> @@ -3147,10 +3151,11 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
>   	 */
>   	fpin = maybe_unlock_mmap_for_io(vmf, fpin);
>   	ra->start = max_t(long, 0, vmf->pgoff - ra->ra_pages / 2);
> +	ra->start = round_down(ra->start, min_nrpages);
>   	ra->size = ra->ra_pages;
>   	ra->async_size = ra->ra_pages / 4;
>   	ractl._index = ra->start;
> -	page_cache_ra_order(&ractl, ra, 0);
> +	page_cache_ra_order(&ractl, ra, min_order);
>   	return fpin;
>   }
>   
> @@ -3164,7 +3169,9 @@ static struct file *do_async_mmap_readahead(struct vm_fault *vmf,
>   {
>   	struct file *file = vmf->vma->vm_file;
>   	struct file_ra_state *ra = &file->f_ra;
> -	DEFINE_READAHEAD(ractl, file, ra, file->f_mapping, vmf->pgoff);
> +	unsigned int min_nrpages = mapping_min_folio_nrpages(file->f_mapping);
> +	pgoff_t index = round_down(vmf->pgoff, min_nrpages);
> +	DEFINE_READAHEAD(ractl, file, ra, file->f_mapping, index);
>   	struct file *fpin = NULL;
>   	unsigned int mmap_miss;
>   
> @@ -3212,13 +3219,17 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
>   	struct file *file = vmf->vma->vm_file;
>   	struct file *fpin = NULL;
>   	struct address_space *mapping = file->f_mapping;
> +	unsigned int min_order = mapping_min_folio_order(mapping);
> +	unsigned int nrpages = 1UL << min_order;
>   	struct inode *inode = mapping->host;
> -	pgoff_t max_idx, index = vmf->pgoff;
> +	pgoff_t max_idx, index = round_down(vmf->pgoff, nrpages);
>   	struct folio *folio;
>   	vm_fault_t ret = 0;
>   	bool mapping_locked = false;
>   
>   	max_idx = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
> +	max_idx = round_up(max_idx, nrpages);
> +

Same here.
(And the additional whitespace can probably go ...)

>   	if (unlikely(index >= max_idx))
>   		return VM_FAULT_SIGBUS;
>   
> @@ -3317,13 +3328,17 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
>   	 * We must recheck i_size under page lock.
>   	 */
>   	max_idx = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
> +	max_idx = round_up(max_idx, nrpages);
> +

See above.

>   	if (unlikely(index >= max_idx)) {
>   		folio_unlock(folio);
>   		folio_put(folio);
>   		return VM_FAULT_SIGBUS;
>   	}
>   
> -	vmf->page = folio_file_page(folio, index);
> +	VM_BUG_ON_FOLIO(folio_order(folio) < min_order, folio);
> +
> +	vmf->page = folio_file_page(folio, vmf->pgoff);
>   	return ret | VM_FAULT_LOCKED;
>   
>   page_not_uptodate:
> @@ -3658,6 +3673,9 @@ static struct folio *do_read_cache_folio(struct address_space *mapping,
>   {
>   	struct folio *folio;
>   	int err;
> +	unsigned int min_nrpages = mapping_min_folio_nrpages(mapping);
> +
> +	index = round_down(index, min_nrpages);
>   
>   	if (!filler)
>   		filler = mapping->a_ops->read_folio;

Cheers,

Hannes


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 03/14] filemap: use mapping_min_order while allocating folios
  2024-02-13  9:37 ` [RFC v2 03/14] filemap: use mapping_min_order while allocating folios Pankaj Raghav (Samsung)
@ 2024-02-13 14:58   ` Hannes Reinecke
  2024-02-13 16:38   ` Darrick J. Wong
  2024-02-13 22:05   ` Dave Chinner
  2 siblings, 0 replies; 64+ messages in thread
From: Hannes Reinecke @ 2024-02-13 14:58 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung), linux-xfs, linux-fsdevel
  Cc: mcgrof, gost.dev, akpm, kbusch, djwong, chandan.babu, p.raghav,
	linux-kernel, willy, linux-mm, david

On 2/13/24 10:37, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
> 
> filemap_create_folio() and do_read_cache_folio() were always allocating
> folio of order 0. __filemap_get_folio was trying to allocate higher
> order folios when fgp_flags had higher order hint set but it will default
> to order 0 folio if higher order memory allocation fails.
> 
> As we bring the notion of mapping_min_order, make sure these functions
> allocate at least folio of mapping_min_order as we need to guarantee it
> in the page cache.
> 
> Add some additional VM_BUG_ON() in page_cache_delete[batch] and
> __filemap_add_folio to catch errors where we delete or add folios that
> has order less than min_order.
> 
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>   mm/filemap.c | 25 +++++++++++++++++++++----
>   1 file changed, 21 insertions(+), 4 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 04/14] readahead: set file_ra_state->ra_pages to be at least mapping_min_order
  2024-02-13  9:37 ` [RFC v2 04/14] readahead: set file_ra_state->ra_pages to be at least mapping_min_order Pankaj Raghav (Samsung)
@ 2024-02-13 14:59   ` Hannes Reinecke
  2024-02-13 16:46   ` Darrick J. Wong
  2024-02-13 22:09   ` Dave Chinner
  2 siblings, 0 replies; 64+ messages in thread
From: Hannes Reinecke @ 2024-02-13 14:59 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung), linux-xfs, linux-fsdevel
  Cc: mcgrof, gost.dev, akpm, kbusch, djwong, chandan.babu, p.raghav,
	linux-kernel, willy, linux-mm, david

On 2/13/24 10:37, Pankaj Raghav (Samsung) wrote:
> From: Luis Chamberlain <mcgrof@kernel.org>
> 
> Set the file_ra_state->ra_pages in file_ra_state_init() to be at least
> mapping_min_order of pages if the bdi->ra_pages is less than that.
> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>   mm/readahead.c | 5 +++++
>   1 file changed, 5 insertions(+)
> 
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 2648ec4f0494..4fa7d0e65706 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -138,7 +138,12 @@
>   void
>   file_ra_state_init(struct file_ra_state *ra, struct address_space *mapping)
>   {
> +	unsigned int min_nrpages = mapping_min_folio_nrpages(mapping);
> +	unsigned int max_pages = inode_to_bdi(mapping->host)->io_pages;
> +
>   	ra->ra_pages = inode_to_bdi(mapping->host)->ra_pages;
> +	if (ra->ra_pages < min_nrpages && min_nrpages < max_pages)
> +		ra->ra_pages = min_nrpages;
>   	ra->prev_pos = -1;
>   }
>   EXPORT_SYMBOL_GPL(file_ra_state_init);

Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 05/14] readahead: align index to mapping_min_order in ondemand_ra and force_ra
  2024-02-13  9:37 ` [RFC v2 05/14] readahead: align index to mapping_min_order in ondemand_ra and force_ra Pankaj Raghav (Samsung)
@ 2024-02-13 15:00   ` Hannes Reinecke
  2024-02-13 16:46   ` Darrick J. Wong
  2024-02-13 22:29   ` Dave Chinner
  2 siblings, 0 replies; 64+ messages in thread
From: Hannes Reinecke @ 2024-02-13 15:00 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung), linux-xfs, linux-fsdevel
  Cc: mcgrof, gost.dev, akpm, kbusch, djwong, chandan.babu, p.raghav,
	linux-kernel, willy, linux-mm, david

On 2/13/24 10:37, Pankaj Raghav (Samsung) wrote:
> From: Luis Chamberlain <mcgrof@kernel.org>
> 
> Align the ra->start and ra->size to mapping_min_order in
> ondemand_readahead(), and align the index to mapping_min_order in
> force_page_cache_ra(). This will ensure that the folios allocated for
> readahead that are added to the page cache are aligned to
> mapping_min_order.
> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>   mm/readahead.c | 48 ++++++++++++++++++++++++++++++++++++++++--------
>   1 file changed, 40 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 4fa7d0e65706..5e1ec7705c78 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -315,6 +315,7 @@ void force_page_cache_ra(struct readahead_control *ractl,
>   	struct file_ra_state *ra = ractl->ra;
>   	struct backing_dev_info *bdi = inode_to_bdi(mapping->host);
>   	unsigned long max_pages, index;
> +	unsigned int min_nrpages = mapping_min_folio_nrpages(mapping);
>   
>   	if (unlikely(!mapping->a_ops->read_folio && !mapping->a_ops->readahead))
>   		return;
> @@ -324,6 +325,13 @@ void force_page_cache_ra(struct readahead_control *ractl,
>   	 * be up to the optimal hardware IO size
>   	 */
>   	index = readahead_index(ractl);
> +	if (!IS_ALIGNED(index, min_nrpages)) {
> +		unsigned long old_index = index;
> +
> +		index = round_down(index, min_nrpages);
> +		nr_to_read += (old_index - index);
> +	}
> +
>   	max_pages = max_t(unsigned long, bdi->io_pages, ra->ra_pages);
>   	nr_to_read = min_t(unsigned long, nr_to_read, max_pages);
>   	while (nr_to_read) {
> @@ -332,6 +340,7 @@ void force_page_cache_ra(struct readahead_control *ractl,
>   		if (this_chunk > nr_to_read)
>   			this_chunk = nr_to_read;
>   		ractl->_index = index;
> +		VM_BUG_ON(!IS_ALIGNED(index, min_nrpages));
>   		do_page_cache_ra(ractl, this_chunk, 0);
>   
>   		index += this_chunk;
> @@ -344,11 +353,20 @@ void force_page_cache_ra(struct readahead_control *ractl,
>    * for small size, x 4 for medium, and x 2 for large
>    * for 128k (32 page) max ra
>    * 1-2 page = 16k, 3-4 page 32k, 5-8 page = 64k, > 8 page = 128k initial
> + *
> + * For higher order address space requirements we ensure no initial reads
> + * are ever less than the min number of pages required.
> + *
> + * We *always* cap the max io size allowed by the device.
>    */
> -static unsigned long get_init_ra_size(unsigned long size, unsigned long max)
> +static unsigned long get_init_ra_size(unsigned long size,
> +				      unsigned int min_nrpages,
> +				      unsigned long max)
>   {
>   	unsigned long newsize = roundup_pow_of_two(size);
>   
> +	newsize = max_t(unsigned long, newsize, min_nrpages);
> +
>   	if (newsize <= max / 32)
>   		newsize = newsize * 4;
>   	else if (newsize <= max / 4)
> @@ -356,6 +374,8 @@ static unsigned long get_init_ra_size(unsigned long size, unsigned long max)
>   	else
>   		newsize = max;
>   
> +	VM_BUG_ON(newsize & (min_nrpages - 1));
> +
>   	return newsize;
>   }
>   
> @@ -364,14 +384,16 @@ static unsigned long get_init_ra_size(unsigned long size, unsigned long max)
>    *  return it as the new window size.
>    */
>   static unsigned long get_next_ra_size(struct file_ra_state *ra,
> +				      unsigned int min_nrpages,
>   				      unsigned long max)
>   {
> -	unsigned long cur = ra->size;
> +	unsigned long cur = max(ra->size, min_nrpages);
>   
>   	if (cur < max / 16)
>   		return 4 * cur;
>   	if (cur <= max / 2)
>   		return 2 * cur;
> +
>   	return max;
>   }
>   
> @@ -561,7 +583,11 @@ static void ondemand_readahead(struct readahead_control *ractl,
>   	unsigned long add_pages;
>   	pgoff_t index = readahead_index(ractl);
>   	pgoff_t expected, prev_index;
> -	unsigned int order = folio ? folio_order(folio) : 0;
> +	unsigned int min_order = mapping_min_folio_order(ractl->mapping);
> +	unsigned int min_nrpages = mapping_min_folio_nrpages(ractl->mapping);
> +	unsigned int order = folio ? folio_order(folio) : min_order;
> +
> +	VM_BUG_ON(!IS_ALIGNED(ractl->_index, min_nrpages));
>   
>   	/*
>   	 * If the request exceeds the readahead window, allow the read to
> @@ -583,8 +609,8 @@ static void ondemand_readahead(struct readahead_control *ractl,
>   	expected = round_down(ra->start + ra->size - ra->async_size,
>   			1UL << order);
>   	if (index == expected || index == (ra->start + ra->size)) {
> -		ra->start += ra->size;
> -		ra->size = get_next_ra_size(ra, max_pages);
> +		ra->start += round_down(ra->size, min_nrpages);
> +		ra->size = get_next_ra_size(ra, min_nrpages, max_pages);
>   		ra->async_size = ra->size;
>   		goto readit;
>   	}
> @@ -603,13 +629,18 @@ static void ondemand_readahead(struct readahead_control *ractl,
>   				max_pages);
>   		rcu_read_unlock();
>   
> +		start = round_down(start, min_nrpages);
> +
> +		VM_BUG_ON(folio->index & (folio_nr_pages(folio) - 1));
> +
>   		if (!start || start - index > max_pages)
>   			return;
>   
>   		ra->start = start;
>   		ra->size = start - index;	/* old async_size */
> +

Stale whitespace.

>   		ra->size += req_size;
> -		ra->size = get_next_ra_size(ra, max_pages);
> +		ra->size = get_next_ra_size(ra, min_nrpages, max_pages);
>   		ra->async_size = ra->size;
>   		goto readit;
>   	}
> @@ -646,7 +677,7 @@ static void ondemand_readahead(struct readahead_control *ractl,
>   
>   initial_readahead:
>   	ra->start = index;
> -	ra->size = get_init_ra_size(req_size, max_pages);
> +	ra->size = get_init_ra_size(req_size, min_nrpages, max_pages);
>   	ra->async_size = ra->size > req_size ? ra->size - req_size : ra->size;
>   
>   readit:
> @@ -657,7 +688,7 @@ static void ondemand_readahead(struct readahead_control *ractl,
>   	 * Take care of maximum IO pages as above.
>   	 */
>   	if (index == ra->start && ra->size == ra->async_size) {
> -		add_pages = get_next_ra_size(ra, max_pages);
> +		add_pages = get_next_ra_size(ra, min_nrpages, max_pages);
>   		if (ra->size + add_pages <= max_pages) {
>   			ra->async_size = add_pages;
>   			ra->size += add_pages;
> @@ -668,6 +699,7 @@ static void ondemand_readahead(struct readahead_control *ractl,
>   	}
>   
>   	ractl->_index = ra->start;
> +	VM_BUG_ON(!IS_ALIGNED(ractl->_index, min_nrpages));
>   	page_cache_ra_order(ractl, ra, order);
>   }
>   
Otherwise looks good.

Cheers,

Hannes


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 07/14] readahead: allocate folios with mapping_min_order in ra_(unbounded|order)
  2024-02-13  9:37 ` [RFC v2 07/14] readahead: allocate folios with mapping_min_order in ra_(unbounded|order) Pankaj Raghav (Samsung)
@ 2024-02-13 15:01   ` Hannes Reinecke
  2024-02-13 16:47   ` Darrick J. Wong
  1 sibling, 0 replies; 64+ messages in thread
From: Hannes Reinecke @ 2024-02-13 15:01 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung), linux-xfs, linux-fsdevel
  Cc: mcgrof, gost.dev, akpm, kbusch, djwong, chandan.babu, p.raghav,
	linux-kernel, willy, linux-mm, david

On 2/13/24 10:37, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
> 
> Allocate folios with at least mapping_min_order in
> page_cache_ra_unbounded() and page_cache_ra_order() as we need to
> guarantee a minimum order in the page cache.
> 
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>   mm/readahead.c | 30 ++++++++++++++++++++++++++----
>   1 file changed, 26 insertions(+), 4 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 08/14] mm: do not split a folio if it has minimum folio order requirement
  2024-02-13  9:37 ` [RFC v2 08/14] mm: do not split a folio if it has minimum folio order requirement Pankaj Raghav (Samsung)
@ 2024-02-13 15:02   ` Hannes Reinecke
  0 siblings, 0 replies; 64+ messages in thread
From: Hannes Reinecke @ 2024-02-13 15:02 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung), linux-xfs, linux-fsdevel
  Cc: mcgrof, gost.dev, akpm, kbusch, djwong, chandan.babu, p.raghav,
	linux-kernel, willy, linux-mm, david

On 2/13/24 10:37, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
> 
> As we don't have a way to split a folio to a any given lower folio
> order yet, avoid splitting the folio in split_huge_page_to_list() if it
> has a minimum folio order requirement.
> 
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>   mm/huge_memory.c | 13 +++++++++++++
>   1 file changed, 13 insertions(+)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 94c958f7ebb5..d897efc51025 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3026,6 +3026,19 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
>   			goto out;
>   		}
>   
> +		/*
> +		 * Do not split if mapping has minimum folio order
> +		 * requirement.
> +		 *
> +		 * XXX: Once we have support for splitting to any lower
> +		 * folio order, then it could be split based on the
> +		 * min_folio_order.
> +		 */
> +		if (mapping_min_folio_order(mapping)) {
> +			ret = -EAGAIN;
> +			goto out;
> +		}
> +
>   		gfp = current_gfp_context(mapping_gfp_mask(mapping) &
>   							GFP_RECLAIM_MASK);
>   
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 09/14] mm: Support order-1 folios in the page cache
  2024-02-13  9:37 ` [RFC v2 09/14] mm: Support order-1 folios in the page cache Pankaj Raghav (Samsung)
@ 2024-02-13 15:03   ` Hannes Reinecke
  0 siblings, 0 replies; 64+ messages in thread
From: Hannes Reinecke @ 2024-02-13 15:03 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung), linux-xfs, linux-fsdevel
  Cc: mcgrof, gost.dev, akpm, kbusch, djwong, chandan.babu, p.raghav,
	linux-kernel, willy, linux-mm, david

On 2/13/24 10:37, Pankaj Raghav (Samsung) wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> Folios of order 1 have no space to store the deferred list.  This is
> not a problem for the page cache as file-backed folios are never
> placed on the deferred list.  All we need to do is prevent the core
> MM from touching the deferred list for order 1 folios and remove the
> code which prevented us from allocating order 1 folios.
> 
> Link: https://lore.kernel.org/linux-mm/90344ea7-4eec-47ee-5996-0c22f42d6a6a@google.com/
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> ---
>   include/linux/huge_mm.h |  7 +++++--
>   mm/filemap.c            |  2 --
>   mm/huge_memory.c        | 23 ++++++++++++++++++-----
>   mm/internal.h           |  4 +---
>   mm/readahead.c          |  3 ---
>   5 files changed, 24 insertions(+), 15 deletions(-)
> 
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes



^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 10/14] iomap: fix iomap_dio_zero() for fs bs > system page size
  2024-02-13  9:37 ` [RFC v2 10/14] iomap: fix iomap_dio_zero() for fs bs > system page size Pankaj Raghav (Samsung)
@ 2024-02-13 15:06   ` Hannes Reinecke
  2024-02-13 16:30   ` Darrick J. Wong
  1 sibling, 0 replies; 64+ messages in thread
From: Hannes Reinecke @ 2024-02-13 15:06 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung), linux-xfs, linux-fsdevel
  Cc: mcgrof, gost.dev, akpm, kbusch, djwong, chandan.babu, p.raghav,
	linux-kernel, willy, linux-mm, david

On 2/13/24 10:37, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
> 
> iomap_dio_zero() will pad a fs block with zeroes if the direct IO size
> < fs block size. iomap_dio_zero() has an implicit assumption that fs block
> size < page_size. This is true for most filesystems at the moment.
> 
> If the block size > page size, this will send the contents of the page
> next to zero page(as len > PAGE_SIZE) to the underlying block device,
> causing FS corruption.
> 
> iomap is a generic infrastructure and it should not make any assumptions
> about the fs block size and the page size of the system.
> 
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>   fs/iomap/direct-io.c | 13 +++++++++++--
>   1 file changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index bcd3f8cf5ea4..04f6c5548136 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -239,14 +239,23 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
>   	struct page *page = ZERO_PAGE(0);
>   	struct bio *bio;
>   
> -	bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
> +	WARN_ON_ONCE(len > (BIO_MAX_VECS * PAGE_SIZE));
> +
> +	bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS,
> +				  REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
>   	fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
>   				  GFP_KERNEL);
> +
>   	bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos);
>   	bio->bi_private = dio;
>   	bio->bi_end_io = iomap_dio_bio_end_io;
>   
> -	__bio_add_page(bio, page, len, 0);
> +	while (len) {
> +		unsigned int io_len = min_t(unsigned int, len, PAGE_SIZE);
> +
> +		__bio_add_page(bio, page, io_len, 0);
> +		len -= io_len;
> +	}
>   	iomap_dio_submit_bio(iter, dio, bio, pos);
>   }
>   
Reviewed-by: Hannes Reinecke <hare@suse.de>

Cheers,

Hannes


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 14/14] xfs: enable block size larger than page size support
  2024-02-13  9:37 ` [RFC v2 14/14] xfs: enable block size larger than page size support Pankaj Raghav (Samsung)
@ 2024-02-13 16:20   ` Darrick J. Wong
  2024-02-14 16:40     ` Pankaj Raghav (Samsung)
  2024-02-13 21:34   ` Dave Chinner
  1 sibling, 1 reply; 64+ messages in thread
From: Darrick J. Wong @ 2024-02-13 16:20 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: linux-xfs, linux-fsdevel, mcgrof, gost.dev, akpm, kbusch,
	chandan.babu, p.raghav, linux-kernel, hare, willy, linux-mm,
	david

On Tue, Feb 13, 2024 at 10:37:13AM +0100, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
> 
> Page cache now has the ability to have a minimum order when allocating
> a folio which is a prerequisite to add support for block size > page
> size. Enable it in XFS under CONFIG_XFS_LBS.
> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  fs/xfs/xfs_icache.c | 8 ++++++--
>  fs/xfs/xfs_super.c  | 8 +++-----
>  2 files changed, 9 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index dba514a2c84d..9de81caf7ad4 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -73,6 +73,7 @@ xfs_inode_alloc(
>  	xfs_ino_t		ino)
>  {
>  	struct xfs_inode	*ip;
> +	int			min_order = 0;
>  
>  	/*
>  	 * XXX: If this didn't occur in transactions, we could drop GFP_NOFAIL
> @@ -88,7 +89,8 @@ xfs_inode_alloc(
>  	/* VFS doesn't initialise i_mode or i_state! */
>  	VFS_I(ip)->i_mode = 0;
>  	VFS_I(ip)->i_state = 0;
> -	mapping_set_large_folios(VFS_I(ip)->i_mapping);
> +	min_order = max(min_order, ilog2(mp->m_sb.sb_blocksize) - PAGE_SHIFT);
> +	mapping_set_folio_orders(VFS_I(ip)->i_mapping, min_order, MAX_PAGECACHE_ORDER);
>  
>  	XFS_STATS_INC(mp, vn_active);
>  	ASSERT(atomic_read(&ip->i_pincount) == 0);
> @@ -313,6 +315,7 @@ xfs_reinit_inode(
>  	dev_t			dev = inode->i_rdev;
>  	kuid_t			uid = inode->i_uid;
>  	kgid_t			gid = inode->i_gid;
> +	int			min_order = 0;
>  
>  	error = inode_init_always(mp->m_super, inode);
>  
> @@ -323,7 +326,8 @@ xfs_reinit_inode(
>  	inode->i_rdev = dev;
>  	inode->i_uid = uid;
>  	inode->i_gid = gid;
> -	mapping_set_large_folios(inode->i_mapping);
> +	min_order = max(min_order, ilog2(mp->m_sb.sb_blocksize) - PAGE_SHIFT);
> +	mapping_set_folio_orders(inode->i_mapping, min_order, MAX_PAGECACHE_ORDER);

Twice now I've seen this, which makes me think "refactor this into a
single function."

But then, this is really just:

	mapping_set_folio_orders(inode->i_mapping,
			max(0, inode->i_sb->s_blocksize_bits - PAGE_SHIFT),
			MAX_PAGECACHE_ORDER);

Can we make that a generic inode_set_pagecache_orders helper?

>  	return error;
>  }
>  
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 5a2512d20bd0..6a3f0f6727eb 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1625,13 +1625,11 @@ xfs_fs_fill_super(
>  		goto out_free_sb;
>  	}
>  
> -	/*
> -	 * Until this is fixed only page-sized or smaller data blocks work.
> -	 */
> -	if (mp->m_sb.sb_blocksize > PAGE_SIZE) {
> +	if (!IS_ENABLED(CONFIG_XFS_LBS) && mp->m_sb.sb_blocksize > PAGE_SIZE) {
>  		xfs_warn(mp,
>  		"File system with blocksize %d bytes. "
> -		"Only pagesize (%ld) or less will currently work.",
> +		"Only pagesize (%ld) or less will currently work. "
> +		"Enable Experimental CONFIG_XFS_LBS for this support",
>  				mp->m_sb.sb_blocksize, PAGE_SIZE);

Please log a warning about the EXPERIMENTAL bs>ps feature being used
on this mount for the CONFIG_XFS_LBS=y case.

--D

>  		error = -ENOSYS;
>  		goto out_free_sb;
> -- 
> 2.43.0
> 
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 12/14] xfs: make the calculation generic in xfs_sb_validate_fsb_count()
  2024-02-13  9:37 ` [RFC v2 12/14] xfs: make the calculation generic in xfs_sb_validate_fsb_count() Pankaj Raghav (Samsung)
@ 2024-02-13 16:26   ` Darrick J. Wong
  2024-02-13 21:48     ` Pankaj Raghav (Samsung)
  0 siblings, 1 reply; 64+ messages in thread
From: Darrick J. Wong @ 2024-02-13 16:26 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: linux-xfs, linux-fsdevel, mcgrof, gost.dev, akpm, kbusch,
	chandan.babu, p.raghav, linux-kernel, hare, willy, linux-mm,
	david

On Tue, Feb 13, 2024 at 10:37:11AM +0100, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
> 
> Instead of assuming that PAGE_SHIFT is always higher than the blocklog,
> make the calculation generic so that page cache count can be calculated
> correctly for LBS.
> 
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  fs/xfs/xfs_mount.c | 6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index aabb25dc3efa..bfbaaecaf668 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -133,9 +133,13 @@ xfs_sb_validate_fsb_count(
>  {
>  	ASSERT(PAGE_SHIFT >= sbp->sb_blocklog);
>  	ASSERT(sbp->sb_blocklog >= BBSHIFT);
> +	unsigned long mapping_count;

Nit: indenting

	unsigned long		mapping_count;

> +	uint64_t bytes = nblocks << sbp->sb_blocklog;

What happens if someone feeds us a garbage fs with sb_blocklog > 64?
Or did we check that previously, so an overflow isn't possible?

> +
> +	mapping_count = bytes >> PAGE_SHIFT;

Does this result in truncation when unsigned long is 32 bits?

--D

>  
>  	/* Limited by ULONG_MAX of page cache index */
> -	if (nblocks >> (PAGE_SHIFT - sbp->sb_blocklog) > ULONG_MAX)
> +	if (mapping_count > ULONG_MAX)
>  		return -EFBIG;
>  	return 0;
>  }
> -- 
> 2.43.0
> 
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 11/14] xfs: expose block size in stat
  2024-02-13  9:37 ` [RFC v2 11/14] xfs: expose block size in stat Pankaj Raghav (Samsung)
@ 2024-02-13 16:27   ` Darrick J. Wong
  2024-02-13 21:32     ` Pankaj Raghav (Samsung)
  0 siblings, 1 reply; 64+ messages in thread
From: Darrick J. Wong @ 2024-02-13 16:27 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: linux-xfs, linux-fsdevel, mcgrof, gost.dev, akpm, kbusch,
	chandan.babu, p.raghav, linux-kernel, hare, willy, linux-mm,
	david, Dave Chinner

On Tue, Feb 13, 2024 at 10:37:10AM +0100, Pankaj Raghav (Samsung) wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> For block size larger than page size, the unit of efficient IO is
> the block size, not the page size. Leaving stat() to report
> PAGE_SIZE as the block size causes test programs like fsx to issue
> illegal ranges for operations that require block size alignment
> (e.g. fallocate() insert range). Hence update the preferred IO size
> to reflect the block size in this case.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> [mcgrof: forward rebase in consideration for commit
> dd2d535e3fb29d ("xfs: cleanup calculating the stat optimal I/O size")]
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>  fs/xfs/xfs_iops.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index a0d77f5f512e..8791a9d80897 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -515,6 +515,8 @@ xfs_stat_blksize(
>  	struct xfs_inode	*ip)
>  {
>  	struct xfs_mount	*mp = ip->i_mount;
> +	unsigned long	default_size = max_t(unsigned long, PAGE_SIZE,
> +					     mp->m_sb.sb_blocksize);

Nit: wonky indentation, but...

>  
>  	/*
>  	 * If the file blocks are being allocated from a realtime volume, then
> @@ -543,7 +545,7 @@ xfs_stat_blksize(
>  			return 1U << mp->m_allocsize_log;
>  	}
>  
> -	return PAGE_SIZE;
> +	return default_size;

...why not return max_t(...) directly here?

--D

>  }
>  
>  STATIC int
> -- 
> 2.43.0
> 
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 10/14] iomap: fix iomap_dio_zero() for fs bs > system page size
  2024-02-13  9:37 ` [RFC v2 10/14] iomap: fix iomap_dio_zero() for fs bs > system page size Pankaj Raghav (Samsung)
  2024-02-13 15:06   ` Hannes Reinecke
@ 2024-02-13 16:30   ` Darrick J. Wong
  2024-02-13 21:27     ` Pankaj Raghav (Samsung)
  1 sibling, 1 reply; 64+ messages in thread
From: Darrick J. Wong @ 2024-02-13 16:30 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: linux-xfs, linux-fsdevel, mcgrof, gost.dev, akpm, kbusch,
	chandan.babu, p.raghav, linux-kernel, hare, willy, linux-mm,
	david

On Tue, Feb 13, 2024 at 10:37:09AM +0100, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
> 
> iomap_dio_zero() will pad a fs block with zeroes if the direct IO size
> < fs block size. iomap_dio_zero() has an implicit assumption that fs block
> size < page_size. This is true for most filesystems at the moment.
> 
> If the block size > page size, this will send the contents of the page
> next to zero page(as len > PAGE_SIZE) to the underlying block device,
> causing FS corruption.
> 
> iomap is a generic infrastructure and it should not make any assumptions
> about the fs block size and the page size of the system.
> 
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  fs/iomap/direct-io.c | 13 +++++++++++--
>  1 file changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index bcd3f8cf5ea4..04f6c5548136 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -239,14 +239,23 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
>  	struct page *page = ZERO_PAGE(0);
>  	struct bio *bio;
>  
> -	bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
> +	WARN_ON_ONCE(len > (BIO_MAX_VECS * PAGE_SIZE));
> +
> +	bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS,
> +				  REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
>  	fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
>  				  GFP_KERNEL);
> +
>  	bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos);
>  	bio->bi_private = dio;
>  	bio->bi_end_io = iomap_dio_bio_end_io;
>  
> -	__bio_add_page(bio, page, len, 0);
> +	while (len) {
> +		unsigned int io_len = min_t(unsigned int, len, PAGE_SIZE);

What was the result of all that discussion about using the PMD-sized
zero-folio the last time this patch was submitted?  Did that prove to be
unwieldly, or did it require enough extra surgery to become its own
series?

(The code here looks good to me.)

--D

> +
> +		__bio_add_page(bio, page, io_len, 0);
> +		len -= io_len;
> +	}
>  	iomap_dio_submit_bio(iter, dio, bio, pos);
>  }
>  
> -- 
> 2.43.0
> 
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 01/14] fs: Allow fine-grained control of folio sizes
  2024-02-13  9:37 ` [RFC v2 01/14] fs: Allow fine-grained control of folio sizes Pankaj Raghav (Samsung)
  2024-02-13 12:03   ` Hannes Reinecke
@ 2024-02-13 16:34   ` Darrick J. Wong
  2024-02-13 21:05     ` Pankaj Raghav (Samsung)
  2024-02-14 18:49   ` Matthew Wilcox
  2 siblings, 1 reply; 64+ messages in thread
From: Darrick J. Wong @ 2024-02-13 16:34 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: linux-xfs, linux-fsdevel, mcgrof, gost.dev, akpm, kbusch,
	chandan.babu, p.raghav, linux-kernel, hare, willy, linux-mm,
	david

On Tue, Feb 13, 2024 at 10:37:00AM +0100, Pankaj Raghav (Samsung) wrote:
> From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> 
> Some filesystems want to be able to limit the maximum size of folios,
> and some want to be able to ensure that folios are at least a certain
> size.  Add mapping_set_folio_orders() to allow this level of control.
> The max folio order parameter is ignored and it is always set to
> MAX_PAGECACHE_ORDER.

Why?  If MAX_PAGECACHE_ORDER is 8 and I instead pass in max==3, I'm
going to be surprised by my constraint being ignored.  Maybe I said that
because I'm not prepared to handle an order-7 folio; or some customer
will have some weird desire to twist this knob to make their workflow
faster.

--D

> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>  include/linux/pagemap.h | 92 ++++++++++++++++++++++++++++++++---------
>  1 file changed, 73 insertions(+), 19 deletions(-)
> 
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 2df35e65557d..5618f762187b 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -202,13 +202,18 @@ enum mapping_flags {
>  	AS_EXITING	= 4, 	/* final truncate in progress */
>  	/* writeback related tags are not used */
>  	AS_NO_WRITEBACK_TAGS = 5,
> -	AS_LARGE_FOLIO_SUPPORT = 6,
> -	AS_RELEASE_ALWAYS,	/* Call ->release_folio(), even if no private data */
> -	AS_STABLE_WRITES,	/* must wait for writeback before modifying
> +	AS_RELEASE_ALWAYS = 6,	/* Call ->release_folio(), even if no private data */
> +	AS_STABLE_WRITES = 7,	/* must wait for writeback before modifying
>  				   folio contents */
> -	AS_UNMOVABLE,		/* The mapping cannot be moved, ever */
> +	AS_FOLIO_ORDER_MIN = 8,
> +	AS_FOLIO_ORDER_MAX = 13, /* Bit 8-17 are used for FOLIO_ORDER */
> +	AS_UNMOVABLE = 18,		/* The mapping cannot be moved, ever */
>  };
>  
> +#define AS_FOLIO_ORDER_MIN_MASK 0x00001f00
> +#define AS_FOLIO_ORDER_MAX_MASK 0x0003e000
> +#define AS_FOLIO_ORDER_MASK (AS_FOLIO_ORDER_MIN_MASK | AS_FOLIO_ORDER_MAX_MASK)
> +
>  /**
>   * mapping_set_error - record a writeback error in the address_space
>   * @mapping: the mapping in which an error should be set
> @@ -344,6 +349,53 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
>  	m->gfp_mask = mask;
>  }
>  
> +/*
> + * There are some parts of the kernel which assume that PMD entries
> + * are exactly HPAGE_PMD_ORDER.  Those should be fixed, but until then,
> + * limit the maximum allocation order to PMD size.  I'm not aware of any
> + * assumptions about maximum order if THP are disabled, but 8 seems like
> + * a good order (that's 1MB if you're using 4kB pages)
> + */
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +#define MAX_PAGECACHE_ORDER	HPAGE_PMD_ORDER
> +#else
> +#define MAX_PAGECACHE_ORDER	8
> +#endif
> +
> +/*
> + * mapping_set_folio_orders() - Set the range of folio sizes supported.
> + * @mapping: The file.
> + * @min: Minimum folio order (between 0-MAX_PAGECACHE_ORDER inclusive).
> + * @max: Maximum folio order (between 0-MAX_PAGECACHE_ORDER inclusive).
> + *
> + * The filesystem should call this function in its inode constructor to
> + * indicate which sizes of folio the VFS can use to cache the contents
> + * of the file.  This should only be used if the filesystem needs special
> + * handling of folio sizes (ie there is something the core cannot know).
> + * Do not tune it based on, eg, i_size.
> + *
> + * Context: This should not be called while the inode is active as it
> + * is non-atomic.
> + */
> +static inline void mapping_set_folio_orders(struct address_space *mapping,
> +					    unsigned int min, unsigned int max)
> +{
> +	if (min == 1)
> +		min = 2;
> +	if (max < min)
> +		max = min;
> +	if (max > MAX_PAGECACHE_ORDER)
> +		max = MAX_PAGECACHE_ORDER;
> +
> +	/*
> +	 * XXX: max is ignored as only minimum folio order is supported
> +	 * currently.
> +	 */
> +	mapping->flags = (mapping->flags & ~AS_FOLIO_ORDER_MASK) |
> +			 (min << AS_FOLIO_ORDER_MIN) |
> +			 (MAX_PAGECACHE_ORDER << AS_FOLIO_ORDER_MAX);
> +}
> +
>  /**
>   * mapping_set_large_folios() - Indicate the file supports large folios.
>   * @mapping: The file.
> @@ -357,7 +409,22 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
>   */
>  static inline void mapping_set_large_folios(struct address_space *mapping)
>  {
> -	__set_bit(AS_LARGE_FOLIO_SUPPORT, &mapping->flags);
> +	mapping_set_folio_orders(mapping, 0, MAX_PAGECACHE_ORDER);
> +}
> +
> +static inline unsigned int mapping_max_folio_order(struct address_space *mapping)
> +{
> +	return (mapping->flags & AS_FOLIO_ORDER_MAX_MASK) >> AS_FOLIO_ORDER_MAX;
> +}
> +
> +static inline unsigned int mapping_min_folio_order(struct address_space *mapping)
> +{
> +	return (mapping->flags & AS_FOLIO_ORDER_MIN_MASK) >> AS_FOLIO_ORDER_MIN;
> +}
> +
> +static inline unsigned int mapping_min_folio_nrpages(struct address_space *mapping)
> +{
> +	return 1U << mapping_min_folio_order(mapping);
>  }
>  
>  /*
> @@ -367,7 +434,7 @@ static inline void mapping_set_large_folios(struct address_space *mapping)
>  static inline bool mapping_large_folio_support(struct address_space *mapping)
>  {
>  	return IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
> -		test_bit(AS_LARGE_FOLIO_SUPPORT, &mapping->flags);
> +	       (mapping_max_folio_order(mapping) > 0);
>  }
>  
>  static inline int filemap_nr_thps(struct address_space *mapping)
> @@ -528,19 +595,6 @@ static inline void *detach_page_private(struct page *page)
>  	return folio_detach_private(page_folio(page));
>  }
>  
> -/*
> - * There are some parts of the kernel which assume that PMD entries
> - * are exactly HPAGE_PMD_ORDER.  Those should be fixed, but until then,
> - * limit the maximum allocation order to PMD size.  I'm not aware of any
> - * assumptions about maximum order if THP are disabled, but 8 seems like
> - * a good order (that's 1MB if you're using 4kB pages)
> - */
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -#define MAX_PAGECACHE_ORDER	HPAGE_PMD_ORDER
> -#else
> -#define MAX_PAGECACHE_ORDER	8
> -#endif
> -
>  #ifdef CONFIG_NUMA
>  struct folio *filemap_alloc_folio(gfp_t gfp, unsigned int order);
>  #else
> -- 
> 2.43.0
> 
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 03/14] filemap: use mapping_min_order while allocating folios
  2024-02-13  9:37 ` [RFC v2 03/14] filemap: use mapping_min_order while allocating folios Pankaj Raghav (Samsung)
  2024-02-13 14:58   ` Hannes Reinecke
@ 2024-02-13 16:38   ` Darrick J. Wong
  2024-02-13 22:05   ` Dave Chinner
  2 siblings, 0 replies; 64+ messages in thread
From: Darrick J. Wong @ 2024-02-13 16:38 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: linux-xfs, linux-fsdevel, mcgrof, gost.dev, akpm, kbusch,
	chandan.babu, p.raghav, linux-kernel, hare, willy, linux-mm,
	david

On Tue, Feb 13, 2024 at 10:37:02AM +0100, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
> 
> filemap_create_folio() and do_read_cache_folio() were always allocating
> folio of order 0. __filemap_get_folio was trying to allocate higher
> order folios when fgp_flags had higher order hint set but it will default
> to order 0 folio if higher order memory allocation fails.
> 
> As we bring the notion of mapping_min_order, make sure these functions
> allocate at least folio of mapping_min_order as we need to guarantee it
> in the page cache.
> 
> Add some additional VM_BUG_ON() in page_cache_delete[batch] and
> __filemap_add_folio to catch errors where we delete or add folios that
> has order less than min_order.
> 
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>

Looks good to me,
Acked-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  mm/filemap.c | 25 +++++++++++++++++++++----
>  1 file changed, 21 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 323a8e169581..7a6e15c47150 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -127,6 +127,7 @@
>  static void page_cache_delete(struct address_space *mapping,
>  				   struct folio *folio, void *shadow)
>  {
> +	unsigned int min_order = mapping_min_folio_order(mapping);
>  	XA_STATE(xas, &mapping->i_pages, folio->index);
>  	long nr = 1;
>  
> @@ -135,6 +136,7 @@ static void page_cache_delete(struct address_space *mapping,
>  	xas_set_order(&xas, folio->index, folio_order(folio));
>  	nr = folio_nr_pages(folio);
>  
> +	VM_BUG_ON_FOLIO(folio_order(folio) < min_order, folio);
>  	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
>  
>  	xas_store(&xas, shadow);
> @@ -277,6 +279,7 @@ void filemap_remove_folio(struct folio *folio)
>  static void page_cache_delete_batch(struct address_space *mapping,
>  			     struct folio_batch *fbatch)
>  {
> +	unsigned int min_order = mapping_min_folio_order(mapping);
>  	XA_STATE(xas, &mapping->i_pages, fbatch->folios[0]->index);
>  	long total_pages = 0;
>  	int i = 0;
> @@ -305,6 +308,7 @@ static void page_cache_delete_batch(struct address_space *mapping,
>  
>  		WARN_ON_ONCE(!folio_test_locked(folio));
>  
> +		VM_BUG_ON_FOLIO(folio_order(folio) < min_order, folio);
>  		folio->mapping = NULL;
>  		/* Leave folio->index set: truncation lookup relies on it */
>  
> @@ -846,6 +850,7 @@ noinline int __filemap_add_folio(struct address_space *mapping,
>  	int huge = folio_test_hugetlb(folio);
>  	bool charged = false;
>  	long nr = 1;
> +	unsigned int min_order = mapping_min_folio_order(mapping);
>  
>  	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
>  	VM_BUG_ON_FOLIO(folio_test_swapbacked(folio), folio);
> @@ -896,6 +901,7 @@ noinline int __filemap_add_folio(struct address_space *mapping,
>  			}
>  		}
>  
> +		VM_BUG_ON_FOLIO(folio_order(folio) < min_order, folio);
>  		xas_store(&xas, folio);
>  		if (xas_error(&xas))
>  			goto unlock;
> @@ -1847,6 +1853,10 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
>  		fgf_t fgp_flags, gfp_t gfp)
>  {
>  	struct folio *folio;
> +	unsigned int min_order = mapping_min_folio_order(mapping);
> +	unsigned int min_nrpages = mapping_min_folio_nrpages(mapping);
> +
> +	index = round_down(index, min_nrpages);
>  
>  repeat:
>  	folio = filemap_get_entry(mapping, index);
> @@ -1886,7 +1896,7 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
>  		folio_wait_stable(folio);
>  no_page:
>  	if (!folio && (fgp_flags & FGP_CREAT)) {
> -		unsigned order = FGF_GET_ORDER(fgp_flags);
> +		unsigned int order = max(min_order, FGF_GET_ORDER(fgp_flags));
>  		int err;
>  
>  		if ((fgp_flags & FGP_WRITE) && mapping_can_writeback(mapping))
> @@ -1914,8 +1924,13 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
>  			err = -ENOMEM;
>  			if (order == 1)
>  				order = 0;
> +			if (order < min_order)
> +				order = min_order;
>  			if (order > 0)
>  				alloc_gfp |= __GFP_NORETRY | __GFP_NOWARN;
> +
> +			VM_BUG_ON(index & ((1UL << order) - 1));
> +
>  			folio = filemap_alloc_folio(alloc_gfp, order);
>  			if (!folio)
>  				continue;
> @@ -1929,7 +1944,7 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
>  				break;
>  			folio_put(folio);
>  			folio = NULL;
> -		} while (order-- > 0);
> +		} while (order-- > min_order);
>  
>  		if (err == -EEXIST)
>  			goto repeat;
> @@ -2424,7 +2439,8 @@ static int filemap_create_folio(struct file *file,
>  	struct folio *folio;
>  	int error;
>  
> -	folio = filemap_alloc_folio(mapping_gfp_mask(mapping), 0);
> +	folio = filemap_alloc_folio(mapping_gfp_mask(mapping),
> +				    mapping_min_folio_order(mapping));
>  	if (!folio)
>  		return -ENOMEM;
>  
> @@ -3682,7 +3698,8 @@ static struct folio *do_read_cache_folio(struct address_space *mapping,
>  repeat:
>  	folio = filemap_get_folio(mapping, index);
>  	if (IS_ERR(folio)) {
> -		folio = filemap_alloc_folio(gfp, 0);
> +		folio = filemap_alloc_folio(gfp,
> +					    mapping_min_folio_order(mapping));
>  		if (!folio)
>  			return ERR_PTR(-ENOMEM);
>  		err = filemap_add_folio(mapping, folio, index, gfp);
> -- 
> 2.43.0
> 
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 13/14] xfs: add an experimental CONFIG_XFS_LBS option
  2024-02-13  9:37 ` [RFC v2 13/14] xfs: add an experimental CONFIG_XFS_LBS option Pankaj Raghav (Samsung)
@ 2024-02-13 16:39   ` Darrick J. Wong
  2024-02-13 21:19   ` Dave Chinner
  1 sibling, 0 replies; 64+ messages in thread
From: Darrick J. Wong @ 2024-02-13 16:39 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: linux-xfs, linux-fsdevel, mcgrof, gost.dev, akpm, kbusch,
	chandan.babu, p.raghav, linux-kernel, hare, willy, linux-mm,
	david

On Tue, Feb 13, 2024 at 10:37:12AM +0100, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
> 
> Add an experimental CONFIG_XFS_LBS option to enable LBS support in XFS.
> Retain the ASSERT for PAGE_SHIFT if CONFIG_XFS_LBS is not enabled.
> 
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>

With the changes I suggested in the next patch,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

(I think you ought to combine this patch and the next patch after
improving the EXPERIMENTAL log messaging.)

--D

> ---
>  fs/xfs/Kconfig     | 11 +++++++++++
>  fs/xfs/xfs_mount.c |  4 +++-
>  2 files changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/Kconfig b/fs/xfs/Kconfig
> index 567fb37274d3..6b0db2f7dc13 100644
> --- a/fs/xfs/Kconfig
> +++ b/fs/xfs/Kconfig
> @@ -216,3 +216,14 @@ config XFS_ASSERT_FATAL
>  	  result in warnings.
>  
>  	  This behavior can be modified at runtime via sysfs.
> +
> +config XFS_LBS
> +	bool "XFS large block size support (EXPERIMENTAL)"
> +	depends on XFS_FS
> +	help
> +	  Set Y to enable support for filesystem block size > system's
> +	  base page size.
> +
> +	  This feature is considered EXPERIMENTAL.  Use with caution!
> +
> +	  If unsure, say N.
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index bfbaaecaf668..596aa2cdefbc 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -131,11 +131,13 @@ xfs_sb_validate_fsb_count(
>  	xfs_sb_t	*sbp,
>  	uint64_t	nblocks)
>  {
> -	ASSERT(PAGE_SHIFT >= sbp->sb_blocklog);
>  	ASSERT(sbp->sb_blocklog >= BBSHIFT);
>  	unsigned long mapping_count;
>  	uint64_t bytes = nblocks << sbp->sb_blocklog;
>  
> +	if (!IS_ENABLED(CONFIG_XFS_LBS))
> +		ASSERT(PAGE_SHIFT >= sbp->sb_blocklog);
> +
>  	mapping_count = bytes >> PAGE_SHIFT;
>  
>  	/* Limited by ULONG_MAX of page cache index */
> -- 
> 2.43.0
> 
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 04/14] readahead: set file_ra_state->ra_pages to be at least mapping_min_order
  2024-02-13  9:37 ` [RFC v2 04/14] readahead: set file_ra_state->ra_pages to be at least mapping_min_order Pankaj Raghav (Samsung)
  2024-02-13 14:59   ` Hannes Reinecke
@ 2024-02-13 16:46   ` Darrick J. Wong
  2024-02-13 22:09   ` Dave Chinner
  2 siblings, 0 replies; 64+ messages in thread
From: Darrick J. Wong @ 2024-02-13 16:46 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: linux-xfs, linux-fsdevel, mcgrof, gost.dev, akpm, kbusch,
	chandan.babu, p.raghav, linux-kernel, hare, willy, linux-mm,
	david

On Tue, Feb 13, 2024 at 10:37:03AM +0100, Pankaj Raghav (Samsung) wrote:
> From: Luis Chamberlain <mcgrof@kernel.org>
> 
> Set the file_ra_state->ra_pages in file_ra_state_init() to be at least
> mapping_min_order of pages if the bdi->ra_pages is less than that.
> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>

Looks good to me,
Acked-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  mm/readahead.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 2648ec4f0494..4fa7d0e65706 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -138,7 +138,12 @@
>  void
>  file_ra_state_init(struct file_ra_state *ra, struct address_space *mapping)
>  {
> +	unsigned int min_nrpages = mapping_min_folio_nrpages(mapping);
> +	unsigned int max_pages = inode_to_bdi(mapping->host)->io_pages;
> +
>  	ra->ra_pages = inode_to_bdi(mapping->host)->ra_pages;
> +	if (ra->ra_pages < min_nrpages && min_nrpages < max_pages)
> +		ra->ra_pages = min_nrpages;
>  	ra->prev_pos = -1;
>  }
>  EXPORT_SYMBOL_GPL(file_ra_state_init);
> -- 
> 2.43.0
> 
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 05/14] readahead: align index to mapping_min_order in ondemand_ra and force_ra
  2024-02-13  9:37 ` [RFC v2 05/14] readahead: align index to mapping_min_order in ondemand_ra and force_ra Pankaj Raghav (Samsung)
  2024-02-13 15:00   ` Hannes Reinecke
@ 2024-02-13 16:46   ` Darrick J. Wong
  2024-02-13 22:29   ` Dave Chinner
  2 siblings, 0 replies; 64+ messages in thread
From: Darrick J. Wong @ 2024-02-13 16:46 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: linux-xfs, linux-fsdevel, mcgrof, gost.dev, akpm, kbusch,
	chandan.babu, p.raghav, linux-kernel, hare, willy, linux-mm,
	david

On Tue, Feb 13, 2024 at 10:37:04AM +0100, Pankaj Raghav (Samsung) wrote:
> From: Luis Chamberlain <mcgrof@kernel.org>
> 
> Align the ra->start and ra->size to mapping_min_order in
> ondemand_readahead(), and align the index to mapping_min_order in
> force_page_cache_ra(). This will ensure that the folios allocated for
> readahead that are added to the page cache are aligned to
> mapping_min_order.
> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>

Acked-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  mm/readahead.c | 48 ++++++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 40 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 4fa7d0e65706..5e1ec7705c78 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -315,6 +315,7 @@ void force_page_cache_ra(struct readahead_control *ractl,
>  	struct file_ra_state *ra = ractl->ra;
>  	struct backing_dev_info *bdi = inode_to_bdi(mapping->host);
>  	unsigned long max_pages, index;
> +	unsigned int min_nrpages = mapping_min_folio_nrpages(mapping);
>  
>  	if (unlikely(!mapping->a_ops->read_folio && !mapping->a_ops->readahead))
>  		return;
> @@ -324,6 +325,13 @@ void force_page_cache_ra(struct readahead_control *ractl,
>  	 * be up to the optimal hardware IO size
>  	 */
>  	index = readahead_index(ractl);
> +	if (!IS_ALIGNED(index, min_nrpages)) {
> +		unsigned long old_index = index;
> +
> +		index = round_down(index, min_nrpages);
> +		nr_to_read += (old_index - index);
> +	}
> +
>  	max_pages = max_t(unsigned long, bdi->io_pages, ra->ra_pages);
>  	nr_to_read = min_t(unsigned long, nr_to_read, max_pages);
>  	while (nr_to_read) {
> @@ -332,6 +340,7 @@ void force_page_cache_ra(struct readahead_control *ractl,
>  		if (this_chunk > nr_to_read)
>  			this_chunk = nr_to_read;
>  		ractl->_index = index;
> +		VM_BUG_ON(!IS_ALIGNED(index, min_nrpages));
>  		do_page_cache_ra(ractl, this_chunk, 0);
>  
>  		index += this_chunk;
> @@ -344,11 +353,20 @@ void force_page_cache_ra(struct readahead_control *ractl,
>   * for small size, x 4 for medium, and x 2 for large
>   * for 128k (32 page) max ra
>   * 1-2 page = 16k, 3-4 page 32k, 5-8 page = 64k, > 8 page = 128k initial
> + *
> + * For higher order address space requirements we ensure no initial reads
> + * are ever less than the min number of pages required.
> + *
> + * We *always* cap the max io size allowed by the device.
>   */
> -static unsigned long get_init_ra_size(unsigned long size, unsigned long max)
> +static unsigned long get_init_ra_size(unsigned long size,
> +				      unsigned int min_nrpages,
> +				      unsigned long max)
>  {
>  	unsigned long newsize = roundup_pow_of_two(size);
>  
> +	newsize = max_t(unsigned long, newsize, min_nrpages);
> +
>  	if (newsize <= max / 32)
>  		newsize = newsize * 4;
>  	else if (newsize <= max / 4)
> @@ -356,6 +374,8 @@ static unsigned long get_init_ra_size(unsigned long size, unsigned long max)
>  	else
>  		newsize = max;
>  
> +	VM_BUG_ON(newsize & (min_nrpages - 1));
> +
>  	return newsize;
>  }
>  
> @@ -364,14 +384,16 @@ static unsigned long get_init_ra_size(unsigned long size, unsigned long max)
>   *  return it as the new window size.
>   */
>  static unsigned long get_next_ra_size(struct file_ra_state *ra,
> +				      unsigned int min_nrpages,
>  				      unsigned long max)
>  {
> -	unsigned long cur = ra->size;
> +	unsigned long cur = max(ra->size, min_nrpages);
>  
>  	if (cur < max / 16)
>  		return 4 * cur;
>  	if (cur <= max / 2)
>  		return 2 * cur;
> +
>  	return max;
>  }
>  
> @@ -561,7 +583,11 @@ static void ondemand_readahead(struct readahead_control *ractl,
>  	unsigned long add_pages;
>  	pgoff_t index = readahead_index(ractl);
>  	pgoff_t expected, prev_index;
> -	unsigned int order = folio ? folio_order(folio) : 0;
> +	unsigned int min_order = mapping_min_folio_order(ractl->mapping);
> +	unsigned int min_nrpages = mapping_min_folio_nrpages(ractl->mapping);
> +	unsigned int order = folio ? folio_order(folio) : min_order;
> +
> +	VM_BUG_ON(!IS_ALIGNED(ractl->_index, min_nrpages));
>  
>  	/*
>  	 * If the request exceeds the readahead window, allow the read to
> @@ -583,8 +609,8 @@ static void ondemand_readahead(struct readahead_control *ractl,
>  	expected = round_down(ra->start + ra->size - ra->async_size,
>  			1UL << order);
>  	if (index == expected || index == (ra->start + ra->size)) {
> -		ra->start += ra->size;
> -		ra->size = get_next_ra_size(ra, max_pages);
> +		ra->start += round_down(ra->size, min_nrpages);
> +		ra->size = get_next_ra_size(ra, min_nrpages, max_pages);
>  		ra->async_size = ra->size;
>  		goto readit;
>  	}
> @@ -603,13 +629,18 @@ static void ondemand_readahead(struct readahead_control *ractl,
>  				max_pages);
>  		rcu_read_unlock();
>  
> +		start = round_down(start, min_nrpages);
> +
> +		VM_BUG_ON(folio->index & (folio_nr_pages(folio) - 1));
> +
>  		if (!start || start - index > max_pages)
>  			return;
>  
>  		ra->start = start;
>  		ra->size = start - index;	/* old async_size */
> +
>  		ra->size += req_size;
> -		ra->size = get_next_ra_size(ra, max_pages);
> +		ra->size = get_next_ra_size(ra, min_nrpages, max_pages);
>  		ra->async_size = ra->size;
>  		goto readit;
>  	}
> @@ -646,7 +677,7 @@ static void ondemand_readahead(struct readahead_control *ractl,
>  
>  initial_readahead:
>  	ra->start = index;
> -	ra->size = get_init_ra_size(req_size, max_pages);
> +	ra->size = get_init_ra_size(req_size, min_nrpages, max_pages);
>  	ra->async_size = ra->size > req_size ? ra->size - req_size : ra->size;
>  
>  readit:
> @@ -657,7 +688,7 @@ static void ondemand_readahead(struct readahead_control *ractl,
>  	 * Take care of maximum IO pages as above.
>  	 */
>  	if (index == ra->start && ra->size == ra->async_size) {
> -		add_pages = get_next_ra_size(ra, max_pages);
> +		add_pages = get_next_ra_size(ra, min_nrpages, max_pages);
>  		if (ra->size + add_pages <= max_pages) {
>  			ra->async_size = add_pages;
>  			ra->size += add_pages;
> @@ -668,6 +699,7 @@ static void ondemand_readahead(struct readahead_control *ractl,
>  	}
>  
>  	ractl->_index = ra->start;
> +	VM_BUG_ON(!IS_ALIGNED(ractl->_index, min_nrpages));
>  	page_cache_ra_order(ractl, ra, order);
>  }
>  
> -- 
> 2.43.0
> 
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 06/14] readahead: rework loop in page_cache_ra_unbounded()
  2024-02-13  9:37 ` [RFC v2 06/14] readahead: rework loop in page_cache_ra_unbounded() Pankaj Raghav (Samsung)
@ 2024-02-13 16:47   ` Darrick J. Wong
  0 siblings, 0 replies; 64+ messages in thread
From: Darrick J. Wong @ 2024-02-13 16:47 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: linux-xfs, linux-fsdevel, mcgrof, gost.dev, akpm, kbusch,
	chandan.babu, p.raghav, linux-kernel, hare, willy, linux-mm,
	david

On Tue, Feb 13, 2024 at 10:37:05AM +0100, Pankaj Raghav (Samsung) wrote:
> From: Hannes Reinecke <hare@suse.de>
> 
> Rework the loop in page_cache_ra_unbounded() to advance with
> the number of pages in a folio instead of just one page at a time.
> 
> Signed-off-by: Hannes Reinecke <hare@suse.de>
> Co-developed-by: Pankaj Raghav <p.raghav@samsung.com>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>

Acked-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  mm/readahead.c | 13 +++++++------
>  1 file changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 5e1ec7705c78..13b62cbd3b79 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -213,7 +213,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
>  	struct address_space *mapping = ractl->mapping;
>  	unsigned long index = readahead_index(ractl);
>  	gfp_t gfp_mask = readahead_gfp_mask(mapping);
> -	unsigned long i;
> +	unsigned long i = 0;
>  
>  	/*
>  	 * Partway through the readahead operation, we will have added
> @@ -231,7 +231,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
>  	/*
>  	 * Preallocate as many pages as we will need.
>  	 */
> -	for (i = 0; i < nr_to_read; i++) {
> +	while (i < nr_to_read) {
>  		struct folio *folio = xa_load(&mapping->i_pages, index + i);
>  
>  		if (folio && !xa_is_value(folio)) {
> @@ -244,8 +244,8 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
>  			 * not worth getting one just for that.
>  			 */
>  			read_pages(ractl);
> -			ractl->_index++;
> -			i = ractl->_index + ractl->_nr_pages - index - 1;
> +			ractl->_index += folio_nr_pages(folio);
> +			i = ractl->_index + ractl->_nr_pages - index;
>  			continue;
>  		}
>  
> @@ -257,13 +257,14 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
>  			folio_put(folio);
>  			read_pages(ractl);
>  			ractl->_index++;
> -			i = ractl->_index + ractl->_nr_pages - index - 1;
> +			i = ractl->_index + ractl->_nr_pages - index;
>  			continue;
>  		}
>  		if (i == nr_to_read - lookahead_size)
>  			folio_set_readahead(folio);
>  		ractl->_workingset |= folio_test_workingset(folio);
> -		ractl->_nr_pages++;
> +		ractl->_nr_pages += folio_nr_pages(folio);
> +		i += folio_nr_pages(folio);
>  	}
>  
>  	/*
> -- 
> 2.43.0
> 
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 07/14] readahead: allocate folios with mapping_min_order in ra_(unbounded|order)
  2024-02-13  9:37 ` [RFC v2 07/14] readahead: allocate folios with mapping_min_order in ra_(unbounded|order) Pankaj Raghav (Samsung)
  2024-02-13 15:01   ` Hannes Reinecke
@ 2024-02-13 16:47   ` Darrick J. Wong
  1 sibling, 0 replies; 64+ messages in thread
From: Darrick J. Wong @ 2024-02-13 16:47 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: linux-xfs, linux-fsdevel, mcgrof, gost.dev, akpm, kbusch,
	chandan.babu, p.raghav, linux-kernel, hare, willy, linux-mm,
	david

On Tue, Feb 13, 2024 at 10:37:06AM +0100, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
> 
> Allocate folios with at least mapping_min_order in
> page_cache_ra_unbounded() and page_cache_ra_order() as we need to
> guarantee a minimum order in the page cache.
> 
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>

Acked-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  mm/readahead.c | 30 ++++++++++++++++++++++++++----
>  1 file changed, 26 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 13b62cbd3b79..a361fba18674 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -214,6 +214,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
>  	unsigned long index = readahead_index(ractl);
>  	gfp_t gfp_mask = readahead_gfp_mask(mapping);
>  	unsigned long i = 0;
> +	unsigned int min_nrpages = mapping_min_folio_nrpages(mapping);
>  
>  	/*
>  	 * Partway through the readahead operation, we will have added
> @@ -235,6 +236,8 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
>  		struct folio *folio = xa_load(&mapping->i_pages, index + i);
>  
>  		if (folio && !xa_is_value(folio)) {
> +			long nr_pages = folio_nr_pages(folio);
> +
>  			/*
>  			 * Page already present?  Kick off the current batch
>  			 * of contiguous pages before continuing with the
> @@ -244,19 +247,31 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
>  			 * not worth getting one just for that.
>  			 */
>  			read_pages(ractl);
> -			ractl->_index += folio_nr_pages(folio);
> +
> +			/*
> +			 * Move the ractl->_index by at least min_pages
> +			 * if the folio got truncated to respect the
> +			 * alignment constraint in the page cache.
> +			 *
> +			 */
> +			if (mapping != folio->mapping)
> +				nr_pages = min_nrpages;
> +
> +			VM_BUG_ON_FOLIO(nr_pages < min_nrpages, folio);
> +			ractl->_index += nr_pages;
>  			i = ractl->_index + ractl->_nr_pages - index;
>  			continue;
>  		}
>  
> -		folio = filemap_alloc_folio(gfp_mask, 0);
> +		folio = filemap_alloc_folio(gfp_mask,
> +					    mapping_min_folio_order(mapping));
>  		if (!folio)
>  			break;
>  		if (filemap_add_folio(mapping, folio, index + i,
>  					gfp_mask) < 0) {
>  			folio_put(folio);
>  			read_pages(ractl);
> -			ractl->_index++;
> +			ractl->_index += min_nrpages;
>  			i = ractl->_index + ractl->_nr_pages - index;
>  			continue;
>  		}
> @@ -516,6 +531,7 @@ void page_cache_ra_order(struct readahead_control *ractl,
>  {
>  	struct address_space *mapping = ractl->mapping;
>  	pgoff_t index = readahead_index(ractl);
> +	unsigned int min_order = mapping_min_folio_order(mapping);
>  	pgoff_t limit = (i_size_read(mapping->host) - 1) >> PAGE_SHIFT;
>  	pgoff_t mark = index + ra->size - ra->async_size;
>  	int err = 0;
> @@ -542,11 +558,17 @@ void page_cache_ra_order(struct readahead_control *ractl,
>  		if (index & ((1UL << order) - 1))
>  			order = __ffs(index);
>  		/* Don't allocate pages past EOF */
> -		while (index + (1UL << order) - 1 > limit)
> +		while (order > min_order && index + (1UL << order) - 1 > limit)
>  			order--;
>  		/* THP machinery does not support order-1 */
>  		if (order == 1)
>  			order = 0;
> +
> +		if (order < min_order)
> +			order = min_order;
> +
> +		VM_BUG_ON(index & ((1UL << order) - 1));
> +
>  		err = ra_alloc_folio(ractl, index, mark, order, gfp);
>  		if (err)
>  			break;
> -- 
> 2.43.0
> 
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 01/14] fs: Allow fine-grained control of folio sizes
  2024-02-13 16:34   ` Darrick J. Wong
@ 2024-02-13 21:05     ` Pankaj Raghav (Samsung)
  2024-02-13 21:29       ` Darrick J. Wong
  0 siblings, 1 reply; 64+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-02-13 21:05 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-xfs, linux-fsdevel, mcgrof, gost.dev, akpm, kbusch,
	chandan.babu, p.raghav, linux-kernel, hare, willy, linux-mm,
	david

On Tue, Feb 13, 2024 at 08:34:31AM -0800, Darrick J. Wong wrote:
> On Tue, Feb 13, 2024 at 10:37:00AM +0100, Pankaj Raghav (Samsung) wrote:
> > From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> > 
> > Some filesystems want to be able to limit the maximum size of folios,
> > and some want to be able to ensure that folios are at least a certain
> > size.  Add mapping_set_folio_orders() to allow this level of control.
> > The max folio order parameter is ignored and it is always set to
> > MAX_PAGECACHE_ORDER.
> 
> Why?  If MAX_PAGECACHE_ORDER is 8 and I instead pass in max==3, I'm
> going to be surprised by my constraint being ignored.  Maybe I said that
> because I'm not prepared to handle an order-7 folio; or some customer
> will have some weird desire to twist this knob to make their workflow
> faster.
> 
> --D
Maybe I should have been explicit. We are planning to add support
for min order in the first round, and we want to add support for max order
once the min order support is upstreamed. It was done mainly to reduce
the scope and testing of this series.

I definitely agree there are usecases for setting the max order. It is
also the feedback we got from LPC.

So one idea would be not to expose max option until we add the support
for max order? So filesystems can only set the min_order with the
initial support?

> > +static inline void mapping_set_folio_orders(struct address_space *mapping,
> > +					    unsigned int min, unsigned int max)
> > +{
> > +	if (min == 1)
> > +		min = 2;
> > +	if (max < min)
> > +		max = min;
> > +	if (max > MAX_PAGECACHE_ORDER)
> > +		max = MAX_PAGECACHE_ORDER;
> > +
> > +	/*
> > +	 * XXX: max is ignored as only minimum folio order is supported
> > +	 * currently.
> > +	 */
> > +	mapping->flags = (mapping->flags & ~AS_FOLIO_ORDER_MASK) |
> > +			 (min << AS_FOLIO_ORDER_MIN) |
> > +			 (MAX_PAGECACHE_ORDER << AS_FOLIO_ORDER_MAX);
> > +}
> > +

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 02/14] filemap: align the index to mapping_min_order in the page cache
  2024-02-13 12:20   ` Hannes Reinecke
@ 2024-02-13 21:13     ` Pankaj Raghav (Samsung)
  0 siblings, 0 replies; 64+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-02-13 21:13 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: linux-xfs, linux-fsdevel, mcgrof, gost.dev, akpm, kbusch, djwong,
	chandan.babu, p.raghav, linux-kernel, willy, linux-mm, david

> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 750e779c23db..323a8e169581 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -2479,14 +2479,16 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
> >   {
> >   	struct file *filp = iocb->ki_filp;
> >   	struct address_space *mapping = filp->f_mapping;
> > +	unsigned int min_nrpages = mapping_min_folio_nrpages(mapping);
> >   	struct file_ra_state *ra = &filp->f_ra;
> > -	pgoff_t index = iocb->ki_pos >> PAGE_SHIFT;
> > +	pgoff_t index = round_down(iocb->ki_pos >> PAGE_SHIFT, min_nrpages);
> >   	pgoff_t last_index;
> >   	struct folio *folio;
> >   	int err = 0;
> >   	/* "last_index" is the index of the page beyond the end of the read */
> >   	last_index = DIV_ROUND_UP(iocb->ki_pos + count, PAGE_SIZE);
> > +	last_index = round_up(last_index, min_nrpages);
> 
> Huh? 'last_index' is unset here; sure you mean 'iocb->ki_pos + count' ?
> 
> And what's the rationale to replace 'DIV_ROUND_UP' with 'round_up' ?

Actually we are not replacing DIV_ROUND_UP, we are just adding another
round_up operation to min_nrpages pages after we get the last_index by
converting the bytes to page count. The main idea is to align the
last_index to min_order, if it is set. AFAIK, there is no one helper
that can do both lines in one shot.

> 
> >   retry:
> >   	if (fatal_signal_pending(current))
> >   		return -EINTR;

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 13/14] xfs: add an experimental CONFIG_XFS_LBS option
  2024-02-13  9:37 ` [RFC v2 13/14] xfs: add an experimental CONFIG_XFS_LBS option Pankaj Raghav (Samsung)
  2024-02-13 16:39   ` Darrick J. Wong
@ 2024-02-13 21:19   ` Dave Chinner
  2024-02-13 21:54     ` Pankaj Raghav (Samsung)
  1 sibling, 1 reply; 64+ messages in thread
From: Dave Chinner @ 2024-02-13 21:19 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: linux-xfs, linux-fsdevel, mcgrof, gost.dev, akpm, kbusch, djwong,
	chandan.babu, p.raghav, linux-kernel, hare, willy, linux-mm

On Tue, Feb 13, 2024 at 10:37:12AM +0100, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
> 
> Add an experimental CONFIG_XFS_LBS option to enable LBS support in XFS.
> Retain the ASSERT for PAGE_SHIFT if CONFIG_XFS_LBS is not enabled.
> 
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>

NAK.

There it no reason for this existing - the same code is run
regardless of the state of this config variable just with a
difference in min folio order. All it does is increase the test
matrix arbitrarily - now we have two kernel configs we have to test
and there's no good reason for doing that.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 10/14] iomap: fix iomap_dio_zero() for fs bs > system page size
  2024-02-13 16:30   ` Darrick J. Wong
@ 2024-02-13 21:27     ` Pankaj Raghav (Samsung)
  2024-02-13 21:30       ` Darrick J. Wong
  0 siblings, 1 reply; 64+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-02-13 21:27 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-xfs, linux-fsdevel, mcgrof, gost.dev, akpm, kbusch,
	chandan.babu, p.raghav, linux-kernel, hare, willy, linux-mm,
	david

On Tue, Feb 13, 2024 at 08:30:37AM -0800, Darrick J. Wong wrote:
> On Tue, Feb 13, 2024 at 10:37:09AM +0100, Pankaj Raghav (Samsung) wrote:
> > From: Pankaj Raghav <p.raghav@samsung.com>
> > 
> > iomap_dio_zero() will pad a fs block with zeroes if the direct IO size
> > < fs block size. iomap_dio_zero() has an implicit assumption that fs block
> > size < page_size. This is true for most filesystems at the moment.
> > 
> > If the block size > page size, this will send the contents of the page
> > next to zero page(as len > PAGE_SIZE) to the underlying block device,
> > causing FS corruption.
> > 
> > iomap is a generic infrastructure and it should not make any assumptions
> > about the fs block size and the page size of the system.
> > 
> > Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> > ---
> >  fs/iomap/direct-io.c | 13 +++++++++++--
> >  1 file changed, 11 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > index bcd3f8cf5ea4..04f6c5548136 100644
> > --- a/fs/iomap/direct-io.c
> > +++ b/fs/iomap/direct-io.c
> > @@ -239,14 +239,23 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
> >  	struct page *page = ZERO_PAGE(0);
> >  	struct bio *bio;
> >  
> > -	bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
> > +	WARN_ON_ONCE(len > (BIO_MAX_VECS * PAGE_SIZE));
> > +
> > +	bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS,
> > +				  REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
> >  	fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
> >  				  GFP_KERNEL);
> > +
> >  	bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos);
> >  	bio->bi_private = dio;
> >  	bio->bi_end_io = iomap_dio_bio_end_io;
> >  
> > -	__bio_add_page(bio, page, len, 0);
> > +	while (len) {
> > +		unsigned int io_len = min_t(unsigned int, len, PAGE_SIZE);
> 
> What was the result of all that discussion about using the PMD-sized
> zero-folio the last time this patch was submitted?  Did that prove to be
> unwieldly, or did it require enough extra surgery to become its own
> series?
> 

It proved a bit unwieldly to me at least as I did not know any straight
forward way to do it at the time. So I thought I will keep this approach
as it is, and add support for the PMD-sized zero folio for later
improvement.

> (The code here looks good to me.)

Thanks!
> 
> --D
> 
> > +
> > +		__bio_add_page(bio, page, io_len, 0);
> > +		len -= io_len;
> > +	}
> >  	iomap_dio_submit_bio(iter, dio, bio, pos);
> >  }
> >  
> > -- 
> > 2.43.0
> > 
> > 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 01/14] fs: Allow fine-grained control of folio sizes
  2024-02-13 21:05     ` Pankaj Raghav (Samsung)
@ 2024-02-13 21:29       ` Darrick J. Wong
  2024-02-14 19:00         ` Matthew Wilcox
  0 siblings, 1 reply; 64+ messages in thread
From: Darrick J. Wong @ 2024-02-13 21:29 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: linux-xfs, linux-fsdevel, mcgrof, gost.dev, akpm, kbusch,
	chandan.babu, p.raghav, linux-kernel, hare, willy, linux-mm,
	david

On Tue, Feb 13, 2024 at 10:05:54PM +0100, Pankaj Raghav (Samsung) wrote:
> On Tue, Feb 13, 2024 at 08:34:31AM -0800, Darrick J. Wong wrote:
> > On Tue, Feb 13, 2024 at 10:37:00AM +0100, Pankaj Raghav (Samsung) wrote:
> > > From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> > > 
> > > Some filesystems want to be able to limit the maximum size of folios,
> > > and some want to be able to ensure that folios are at least a certain
> > > size.  Add mapping_set_folio_orders() to allow this level of control.
> > > The max folio order parameter is ignored and it is always set to
> > > MAX_PAGECACHE_ORDER.
> > 
> > Why?  If MAX_PAGECACHE_ORDER is 8 and I instead pass in max==3, I'm
> > going to be surprised by my constraint being ignored.  Maybe I said that
> > because I'm not prepared to handle an order-7 folio; or some customer
> > will have some weird desire to twist this knob to make their workflow
> > faster.
> > 
> > --D
> Maybe I should have been explicit. We are planning to add support
> for min order in the first round, and we want to add support for max order
> once the min order support is upstreamed. It was done mainly to reduce
> the scope and testing of this series.
> 
> I definitely agree there are usecases for setting the max order. It is
> also the feedback we got from LPC.
> 
> So one idea would be not to expose max option until we add the support
> for max order? So filesystems can only set the min_order with the
> initial support?

Yeah, there's really no point in having an argument that's deliberately
ignored.

--D

> > > +static inline void mapping_set_folio_orders(struct address_space *mapping,
> > > +					    unsigned int min, unsigned int max)
> > > +{
> > > +	if (min == 1)
> > > +		min = 2;
> > > +	if (max < min)
> > > +		max = min;
> > > +	if (max > MAX_PAGECACHE_ORDER)
> > > +		max = MAX_PAGECACHE_ORDER;
> > > +
> > > +	/*
> > > +	 * XXX: max is ignored as only minimum folio order is supported
> > > +	 * currently.
> > > +	 */
> > > +	mapping->flags = (mapping->flags & ~AS_FOLIO_ORDER_MASK) |
> > > +			 (min << AS_FOLIO_ORDER_MIN) |
> > > +			 (MAX_PAGECACHE_ORDER << AS_FOLIO_ORDER_MAX);
> > > +}
> > > +
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 10/14] iomap: fix iomap_dio_zero() for fs bs > system page size
  2024-02-13 21:27     ` Pankaj Raghav (Samsung)
@ 2024-02-13 21:30       ` Darrick J. Wong
  2024-02-14 15:13         ` Pankaj Raghav (Samsung)
  0 siblings, 1 reply; 64+ messages in thread
From: Darrick J. Wong @ 2024-02-13 21:30 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: linux-xfs, linux-fsdevel, mcgrof, gost.dev, akpm, kbusch,
	chandan.babu, p.raghav, linux-kernel, hare, willy, linux-mm,
	david

On Tue, Feb 13, 2024 at 10:27:32PM +0100, Pankaj Raghav (Samsung) wrote:
> On Tue, Feb 13, 2024 at 08:30:37AM -0800, Darrick J. Wong wrote:
> > On Tue, Feb 13, 2024 at 10:37:09AM +0100, Pankaj Raghav (Samsung) wrote:
> > > From: Pankaj Raghav <p.raghav@samsung.com>
> > > 
> > > iomap_dio_zero() will pad a fs block with zeroes if the direct IO size
> > > < fs block size. iomap_dio_zero() has an implicit assumption that fs block
> > > size < page_size. This is true for most filesystems at the moment.
> > > 
> > > If the block size > page size, this will send the contents of the page
> > > next to zero page(as len > PAGE_SIZE) to the underlying block device,
> > > causing FS corruption.
> > > 
> > > iomap is a generic infrastructure and it should not make any assumptions
> > > about the fs block size and the page size of the system.
> > > 
> > > Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> > > ---
> > >  fs/iomap/direct-io.c | 13 +++++++++++--
> > >  1 file changed, 11 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > > index bcd3f8cf5ea4..04f6c5548136 100644
> > > --- a/fs/iomap/direct-io.c
> > > +++ b/fs/iomap/direct-io.c
> > > @@ -239,14 +239,23 @@ static void iomap_dio_zero(const struct iomap_iter *iter, struct iomap_dio *dio,
> > >  	struct page *page = ZERO_PAGE(0);
> > >  	struct bio *bio;
> > >  
> > > -	bio = iomap_dio_alloc_bio(iter, dio, 1, REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
> > > +	WARN_ON_ONCE(len > (BIO_MAX_VECS * PAGE_SIZE));
> > > +
> > > +	bio = iomap_dio_alloc_bio(iter, dio, BIO_MAX_VECS,
> > > +				  REQ_OP_WRITE | REQ_SYNC | REQ_IDLE);
> > >  	fscrypt_set_bio_crypt_ctx(bio, inode, pos >> inode->i_blkbits,
> > >  				  GFP_KERNEL);
> > > +
> > >  	bio->bi_iter.bi_sector = iomap_sector(&iter->iomap, pos);
> > >  	bio->bi_private = dio;
> > >  	bio->bi_end_io = iomap_dio_bio_end_io;
> > >  
> > > -	__bio_add_page(bio, page, len, 0);
> > > +	while (len) {
> > > +		unsigned int io_len = min_t(unsigned int, len, PAGE_SIZE);
> > 
> > What was the result of all that discussion about using the PMD-sized
> > zero-folio the last time this patch was submitted?  Did that prove to be
> > unwieldly, or did it require enough extra surgery to become its own
> > series?
> > 
> 
> It proved a bit unwieldly to me at least as I did not know any straight
> forward way to do it at the time. So I thought I will keep this approach
> as it is, and add support for the PMD-sized zero folio for later
> improvement.
> 
> > (The code here looks good to me.)
> 
> Thanks!

In that case I'll throw it on the testing pile and let's ask brauner to
merge this for 6.9 if nothing blows up.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> > 
> > --D
> > 
> > > +
> > > +		__bio_add_page(bio, page, io_len, 0);
> > > +		len -= io_len;
> > > +	}
> > >  	iomap_dio_submit_bio(iter, dio, bio, pos);
> > >  }
> > >  
> > > -- 
> > > 2.43.0
> > > 
> > > 
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 11/14] xfs: expose block size in stat
  2024-02-13 16:27   ` Darrick J. Wong
@ 2024-02-13 21:32     ` Pankaj Raghav (Samsung)
  0 siblings, 0 replies; 64+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-02-13 21:32 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-xfs, linux-fsdevel, mcgrof, gost.dev, akpm, kbusch,
	chandan.babu, p.raghav, linux-kernel, hare, willy, linux-mm,
	david, Dave Chinner

> > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> > index a0d77f5f512e..8791a9d80897 100644
> > --- a/fs/xfs/xfs_iops.c
> > +++ b/fs/xfs/xfs_iops.c
> > @@ -515,6 +515,8 @@ xfs_stat_blksize(
> >  	struct xfs_inode	*ip)
> >  {
> >  	struct xfs_mount	*mp = ip->i_mount;
> > +	unsigned long	default_size = max_t(unsigned long, PAGE_SIZE,
> > +					     mp->m_sb.sb_blocksize);
> 
> Nit: wonky indentation, but...
> 
> >  
> >  	/*
> >  	 * If the file blocks are being allocated from a realtime volume, then
> > @@ -543,7 +545,7 @@ xfs_stat_blksize(
> >  			return 1U << mp->m_allocsize_log;
> >  	}
> >  
> > -	return PAGE_SIZE;
> > +	return default_size;
> 
> ...why not return max_t(...) directly here?

Sounds good. I will add this change.
> 
> --D
> 
> >  }
> >  
> >  STATIC int
> > -- 
> > 2.43.0
> > 
> > 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 14/14] xfs: enable block size larger than page size support
  2024-02-13  9:37 ` [RFC v2 14/14] xfs: enable block size larger than page size support Pankaj Raghav (Samsung)
  2024-02-13 16:20   ` Darrick J. Wong
@ 2024-02-13 21:34   ` Dave Chinner
  2024-02-14 16:35     ` Pankaj Raghav (Samsung)
  1 sibling, 1 reply; 64+ messages in thread
From: Dave Chinner @ 2024-02-13 21:34 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: linux-xfs, linux-fsdevel, mcgrof, gost.dev, akpm, kbusch, djwong,
	chandan.babu, p.raghav, linux-kernel, hare, willy, linux-mm

On Tue, Feb 13, 2024 at 10:37:13AM +0100, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
> 
> Page cache now has the ability to have a minimum order when allocating
> a folio which is a prerequisite to add support for block size > page
> size. Enable it in XFS under CONFIG_XFS_LBS.
> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  fs/xfs/xfs_icache.c | 8 ++++++--
>  fs/xfs/xfs_super.c  | 8 +++-----
>  2 files changed, 9 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index dba514a2c84d..9de81caf7ad4 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -73,6 +73,7 @@ xfs_inode_alloc(
>  	xfs_ino_t		ino)
>  {
>  	struct xfs_inode	*ip;
> +	int			min_order = 0;
>  
>  	/*
>  	 * XXX: If this didn't occur in transactions, we could drop GFP_NOFAIL
> @@ -88,7 +89,8 @@ xfs_inode_alloc(
>  	/* VFS doesn't initialise i_mode or i_state! */
>  	VFS_I(ip)->i_mode = 0;
>  	VFS_I(ip)->i_state = 0;
> -	mapping_set_large_folios(VFS_I(ip)->i_mapping);
> +	min_order = max(min_order, ilog2(mp->m_sb.sb_blocksize) - PAGE_SHIFT);
> +	mapping_set_folio_orders(VFS_I(ip)->i_mapping, min_order, MAX_PAGECACHE_ORDER);

That's pretty nasty. You're using max() to hide underflow in the
subtraction to clamp the value to zero. And you don't need ilog2()
because we have the log of the block size in the superblock already.

	int			min_order = 0;
	.....
	if (mp->m_sb.sb_blocksize > PAGE_SIZE)
		min_order = mp->m_sb.sb_blocklog - PAGE_SHIFT;

But, really why recalculate this -constant- on every inode
allocation?  That's a very hot path, so this should be set in the
M_IGEO(mp) structure (mp->m_ino_geo) at mount time and then the code
is simply:

	mapping_set_folio_orders(VFS_I(ip)->i_mapping,
			M_IGEO(mp)->min_folio_order, MAX_PAGECACHE_ORDER);

We already access the M_IGEO(mp) structure every inode allocation,
so there's little in way of additional cost here....

> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 5a2512d20bd0..6a3f0f6727eb 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1625,13 +1625,11 @@ xfs_fs_fill_super(
>  		goto out_free_sb;
>  	}
>  
> -	/*
> -	 * Until this is fixed only page-sized or smaller data blocks work.
> -	 */
> -	if (mp->m_sb.sb_blocksize > PAGE_SIZE) {
> +	if (!IS_ENABLED(CONFIG_XFS_LBS) && mp->m_sb.sb_blocksize > PAGE_SIZE) {
>  		xfs_warn(mp,
>  		"File system with blocksize %d bytes. "
> -		"Only pagesize (%ld) or less will currently work.",
> +		"Only pagesize (%ld) or less will currently work. "
> +		"Enable Experimental CONFIG_XFS_LBS for this support",
>  				mp->m_sb.sb_blocksize, PAGE_SIZE);
>  		error = -ENOSYS;
>  		goto out_free_sb;

This should just issue a warning if bs > ps.

	if (mp->m_sb.sb_blocksize > PAGE_SIZE) {
  		xfs_warn(mp,
"EXPERIMENTAL: Filesystem with Large Block Size (%d bytes) enabled.",
			mp->m_sb.sb_blocksize);
	}

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 12/14] xfs: make the calculation generic in xfs_sb_validate_fsb_count()
  2024-02-13 16:26   ` Darrick J. Wong
@ 2024-02-13 21:48     ` Pankaj Raghav (Samsung)
  2024-02-13 22:44       ` Dave Chinner
  0 siblings, 1 reply; 64+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-02-13 21:48 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-xfs, linux-fsdevel, mcgrof, gost.dev, akpm, kbusch,
	chandan.babu, p.raghav, linux-kernel, hare, willy, linux-mm,
	david

On Tue, Feb 13, 2024 at 08:26:11AM -0800, Darrick J. Wong wrote:
> On Tue, Feb 13, 2024 at 10:37:11AM +0100, Pankaj Raghav (Samsung) wrote:
> > From: Pankaj Raghav <p.raghav@samsung.com>
> > 
> > Instead of assuming that PAGE_SHIFT is always higher than the blocklog,
> > make the calculation generic so that page cache count can be calculated
> > correctly for LBS.
> > 
> > Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> > ---
> >  fs/xfs/xfs_mount.c | 6 +++++-
> >  1 file changed, 5 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> > index aabb25dc3efa..bfbaaecaf668 100644
> > --- a/fs/xfs/xfs_mount.c
> > +++ b/fs/xfs/xfs_mount.c
> > @@ -133,9 +133,13 @@ xfs_sb_validate_fsb_count(
> >  {
> >  	ASSERT(PAGE_SHIFT >= sbp->sb_blocklog);
> >  	ASSERT(sbp->sb_blocklog >= BBSHIFT);
> > +	unsigned long mapping_count;
> 
> Nit: indenting
> 
> 	unsigned long		mapping_count;

I will add this in the next revision.
> 
> > +	uint64_t bytes = nblocks << sbp->sb_blocklog;
> 
> What happens if someone feeds us a garbage fs with sb_blocklog > 64?
> Or did we check that previously, so an overflow isn't possible?
> 
I was thinking of possibility of an overflow but at the moment the 
blocklog is capped at 16 (65536 bytes) right? mkfs refuses any block
sizes more than 64k. And we have check for this in xfs_validate_sb_common()
in the kernel, which will catch it before this happens?

> > +
> > +	mapping_count = bytes >> PAGE_SHIFT;
> 
> Does this result in truncation when unsigned long is 32 bits?

Ah, good point. So it is better to use uint64_t for mapping_count as
well to be on the safe side?

> 
> --D
> 
> >  
> >  	/* Limited by ULONG_MAX of page cache index */
> > -	if (nblocks >> (PAGE_SHIFT - sbp->sb_blocklog) > ULONG_MAX)
> > +	if (mapping_count > ULONG_MAX)
> >  		return -EFBIG;
> >  	return 0;
> >  }
> > -- 
> > 2.43.0
> > 
> > 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 13/14] xfs: add an experimental CONFIG_XFS_LBS option
  2024-02-13 21:19   ` Dave Chinner
@ 2024-02-13 21:54     ` Pankaj Raghav (Samsung)
  2024-02-13 22:45       ` Dave Chinner
  0 siblings, 1 reply; 64+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-02-13 21:54 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-xfs, linux-fsdevel, mcgrof, gost.dev, akpm, kbusch, djwong,
	chandan.babu, p.raghav, linux-kernel, hare, willy, linux-mm

On Wed, Feb 14, 2024 at 08:19:23AM +1100, Dave Chinner wrote:
> On Tue, Feb 13, 2024 at 10:37:12AM +0100, Pankaj Raghav (Samsung) wrote:
> > From: Pankaj Raghav <p.raghav@samsung.com>
> > 
> > Add an experimental CONFIG_XFS_LBS option to enable LBS support in XFS.
> > Retain the ASSERT for PAGE_SHIFT if CONFIG_XFS_LBS is not enabled.
> > 
> > Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> 
> NAK.
> 
> There it no reason for this existing - the same code is run
> regardless of the state of this config variable just with a
> difference in min folio order. All it does is increase the test
> matrix arbitrarily - now we have two kernel configs we have to test
> and there's no good reason for doing that.

I did not have this CONFIG in the first round but I thought it might
help retain the existing behaviour until we deem the feature stable.

But I get your point. So we remove this CONFIG and just have an
experimental warning during mount when people are using the LBS support?

> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 02/14] filemap: align the index to mapping_min_order in the page cache
  2024-02-13  9:37 ` [RFC v2 02/14] filemap: align the index to mapping_min_order in the page cache Pankaj Raghav (Samsung)
  2024-02-13 12:20   ` Hannes Reinecke
@ 2024-02-13 22:00   ` Dave Chinner
  1 sibling, 0 replies; 64+ messages in thread
From: Dave Chinner @ 2024-02-13 22:00 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: linux-xfs, linux-fsdevel, mcgrof, gost.dev, akpm, kbusch, djwong,
	chandan.babu, p.raghav, linux-kernel, hare, willy, linux-mm

On Tue, Feb 13, 2024 at 10:37:01AM +0100, Pankaj Raghav (Samsung) wrote:
> From: Luis Chamberlain <mcgrof@kernel.org>
> 
> Supporting mapping_min_order implies that we guarantee each folio in the
> page cache has at least an order of mapping_min_order. So when adding new
> folios to the page cache we must ensure the index used is aligned to the
> mapping_min_order as the page cache requires the index to be aligned to
> the order of the folio.
> 
> A higher order folio than min_order by definition is a multiple of the
> min_order. If an index is aligned to an order higher than a min_order, it
> will also be aligned to the min order.
> 
> This effectively introduces no new functional changes when min order is
> not set other than a few rounding computations that should result in the
> same value.
> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  mm/filemap.c | 34 ++++++++++++++++++++++++++--------
>  1 file changed, 26 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 750e779c23db..323a8e169581 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2479,14 +2479,16 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
>  {
>  	struct file *filp = iocb->ki_filp;
>  	struct address_space *mapping = filp->f_mapping;
> +	unsigned int min_nrpages = mapping_min_folio_nrpages(mapping);
>  	struct file_ra_state *ra = &filp->f_ra;
> -	pgoff_t index = iocb->ki_pos >> PAGE_SHIFT;
> +	pgoff_t index = round_down(iocb->ki_pos >> PAGE_SHIFT, min_nrpages);

This is pretty magic - this patch adds a bunch of what appear to be
random rounding operations for some undocumented reason.

>  	pgoff_t last_index;
>  	struct folio *folio;
>  	int err = 0;
>  
>  	/* "last_index" is the index of the page beyond the end of the read */
>  	last_index = DIV_ROUND_UP(iocb->ki_pos + count, PAGE_SIZE);
> +	last_index = round_up(last_index, min_nrpages);

Same here - this is pretty nasty - we round up twice, but no obvious
reason as to why the second round up exists or why it can't be done
by the DIV_ROUND_UP() call. Just looking at the code it's impossible
to reason why this is being done, let alone determine if it has been
implemented correctly.

>  retry:
>  	if (fatal_signal_pending(current))
>  		return -EINTR;
> @@ -2502,8 +2504,7 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
>  	if (!folio_batch_count(fbatch)) {
>  		if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_WAITQ))
>  			return -EAGAIN;
> -		err = filemap_create_folio(filp, mapping,
> -				iocb->ki_pos >> PAGE_SHIFT, fbatch);
> +		err = filemap_create_folio(filp, mapping, index, fbatch);
>  		if (err == AOP_TRUNCATED_PAGE)
>  			goto retry;
>  		return err;
> @@ -3095,7 +3096,10 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
>  	struct file *file = vmf->vma->vm_file;
>  	struct file_ra_state *ra = &file->f_ra;
>  	struct address_space *mapping = file->f_mapping;
> -	DEFINE_READAHEAD(ractl, file, ra, mapping, vmf->pgoff);
> +	unsigned int min_order = mapping_min_folio_order(mapping);
> +	unsigned int min_nrpages = mapping_min_folio_nrpages(file->f_mapping);

Why use file->f_mapping here and not mapping? And why not just

	unsigned int min_nrpages = 1U < min_order;

So it's obvious how the index alignment is related to the folio
order?

> +	pgoff_t index = round_down(vmf->pgoff, min_nrpages);
> +	DEFINE_READAHEAD(ractl, file, ra, mapping, index);
>  	struct file *fpin = NULL;
>  	unsigned long vm_flags = vmf->vma->vm_flags;
>  	unsigned int mmap_miss;
> @@ -3147,10 +3151,11 @@ static struct file *do_sync_mmap_readahead(struct vm_fault *vmf)
>  	 */
>  	fpin = maybe_unlock_mmap_for_io(vmf, fpin);
>  	ra->start = max_t(long, 0, vmf->pgoff - ra->ra_pages / 2);
> +	ra->start = round_down(ra->start, min_nrpages);

Again, another random rounding operation....

>  	ra->size = ra->ra_pages;
>  	ra->async_size = ra->ra_pages / 4;
>  	ractl._index = ra->start;
> -	page_cache_ra_order(&ractl, ra, 0);
> +	page_cache_ra_order(&ractl, ra, min_order);
>  	return fpin;
>  }
>  
> @@ -3164,7 +3169,9 @@ static struct file *do_async_mmap_readahead(struct vm_fault *vmf,
>  {
>  	struct file *file = vmf->vma->vm_file;
>  	struct file_ra_state *ra = &file->f_ra;
> -	DEFINE_READAHEAD(ractl, file, ra, file->f_mapping, vmf->pgoff);
> +	unsigned int min_nrpages = mapping_min_folio_nrpages(file->f_mapping);
> +	pgoff_t index = round_down(vmf->pgoff, min_nrpages);
> +	DEFINE_READAHEAD(ractl, file, ra, file->f_mapping, index);

Ok, this is begging for a new DEFINE_READAHEAD_ALIGNED() macro which
internally grabs the mapping_min_folio_nrpages() from the mapping
passed to the macro.

>  	struct file *fpin = NULL;
>  	unsigned int mmap_miss;
>  
> @@ -3212,13 +3219,17 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
>  	struct file *file = vmf->vma->vm_file;
>  	struct file *fpin = NULL;
>  	struct address_space *mapping = file->f_mapping;
> +	unsigned int min_order = mapping_min_folio_order(mapping);
> +	unsigned int nrpages = 1UL << min_order;

You didn't use mapping_min_folio_nrpages() for this.

At least initialise all the variables the same way in the same
patch!

>  	struct inode *inode = mapping->host;
> -	pgoff_t max_idx, index = vmf->pgoff;
> +	pgoff_t max_idx, index = round_down(vmf->pgoff, nrpages);

Yup, I can't help but think that with how many times this is being
repeated in this patchset that a helper or two is in order:

	index = mapping_align_start_index(mapping, vmf->pgoff);

And then most of the calls to mapping_min_folio_order() and
mapping_min_folio_nrpages() can be removed from this code, too.


>  	struct folio *folio;
>  	vm_fault_t ret = 0;
>  	bool mapping_locked = false;
>  
>  	max_idx = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
> +	max_idx = round_up(max_idx, nrpages);

	max_index = mapping_align_end_index(mapping,
			DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE));

>  	if (unlikely(index >= max_idx))
>  		return VM_FAULT_SIGBUS;
>  
> @@ -3317,13 +3328,17 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
>  	 * We must recheck i_size under page lock.
>  	 */
>  	max_idx = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
> +	max_idx = round_up(max_idx, nrpages);

Same again:

	max_index = mapping_align_end_index(mapping,
			DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE));
> +
>  	if (unlikely(index >= max_idx)) {
>  		folio_unlock(folio);
>  		folio_put(folio);
>  		return VM_FAULT_SIGBUS;
>  	}
>  
> -	vmf->page = folio_file_page(folio, index);
> +	VM_BUG_ON_FOLIO(folio_order(folio) < min_order, folio);
> +
> +	vmf->page = folio_file_page(folio, vmf->pgoff);
>  	return ret | VM_FAULT_LOCKED;
>  
>  page_not_uptodate:
> @@ -3658,6 +3673,9 @@ static struct folio *do_read_cache_folio(struct address_space *mapping,
>  {
>  	struct folio *folio;
>  	int err;
> +	unsigned int min_nrpages = mapping_min_folio_nrpages(mapping);
> +
> +	index = round_down(index, min_nrpages);

And more magic rounding.

	index = mapping_align_start_index(mapping, index);

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 03/14] filemap: use mapping_min_order while allocating folios
  2024-02-13  9:37 ` [RFC v2 03/14] filemap: use mapping_min_order while allocating folios Pankaj Raghav (Samsung)
  2024-02-13 14:58   ` Hannes Reinecke
  2024-02-13 16:38   ` Darrick J. Wong
@ 2024-02-13 22:05   ` Dave Chinner
  2024-02-14 10:13     ` Pankaj Raghav (Samsung)
  2 siblings, 1 reply; 64+ messages in thread
From: Dave Chinner @ 2024-02-13 22:05 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: linux-xfs, linux-fsdevel, mcgrof, gost.dev, akpm, kbusch, djwong,
	chandan.babu, p.raghav, linux-kernel, hare, willy, linux-mm

On Tue, Feb 13, 2024 at 10:37:02AM +0100, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
> 
> filemap_create_folio() and do_read_cache_folio() were always allocating
> folio of order 0. __filemap_get_folio was trying to allocate higher
> order folios when fgp_flags had higher order hint set but it will default
> to order 0 folio if higher order memory allocation fails.
> 
> As we bring the notion of mapping_min_order, make sure these functions
> allocate at least folio of mapping_min_order as we need to guarantee it
> in the page cache.
> 
> Add some additional VM_BUG_ON() in page_cache_delete[batch] and
> __filemap_add_folio to catch errors where we delete or add folios that
> has order less than min_order.
> 
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>  mm/filemap.c | 25 +++++++++++++++++++++----
>  1 file changed, 21 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 323a8e169581..7a6e15c47150 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -127,6 +127,7 @@
>  static void page_cache_delete(struct address_space *mapping,
>  				   struct folio *folio, void *shadow)
>  {
> +	unsigned int min_order = mapping_min_folio_order(mapping);
>  	XA_STATE(xas, &mapping->i_pages, folio->index);
>  	long nr = 1;
>  
> @@ -135,6 +136,7 @@ static void page_cache_delete(struct address_space *mapping,
>  	xas_set_order(&xas, folio->index, folio_order(folio));
>  	nr = folio_nr_pages(folio);
>  
> +	VM_BUG_ON_FOLIO(folio_order(folio) < min_order, folio);
>  	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);

If you are only using min_order in the VM_BUG_ON_FOLIO() macro, then
please just do:

	VM_BUG_ON_FOLIO(folio_order(folio) < mapping_min_folio_order(mapping),
			folio);

There is no need to clutter up the function with variables that are
only used in one debug-only check.

> @@ -1847,6 +1853,10 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
>  		fgf_t fgp_flags, gfp_t gfp)
>  {
>  	struct folio *folio;
> +	unsigned int min_order = mapping_min_folio_order(mapping);
> +	unsigned int min_nrpages = mapping_min_folio_nrpages(mapping);
> +
> +	index = round_down(index, min_nrpages);

	index = mapping_align_start_index(mapping, index);

The rest of the function only cares about min_order, not
min_nrpages....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 04/14] readahead: set file_ra_state->ra_pages to be at least mapping_min_order
  2024-02-13  9:37 ` [RFC v2 04/14] readahead: set file_ra_state->ra_pages to be at least mapping_min_order Pankaj Raghav (Samsung)
  2024-02-13 14:59   ` Hannes Reinecke
  2024-02-13 16:46   ` Darrick J. Wong
@ 2024-02-13 22:09   ` Dave Chinner
  2024-02-14 13:32     ` Pankaj Raghav (Samsung)
  2 siblings, 1 reply; 64+ messages in thread
From: Dave Chinner @ 2024-02-13 22:09 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: linux-xfs, linux-fsdevel, mcgrof, gost.dev, akpm, kbusch, djwong,
	chandan.babu, p.raghav, linux-kernel, hare, willy, linux-mm

On Tue, Feb 13, 2024 at 10:37:03AM +0100, Pankaj Raghav (Samsung) wrote:
> From: Luis Chamberlain <mcgrof@kernel.org>
> 
> Set the file_ra_state->ra_pages in file_ra_state_init() to be at least
> mapping_min_order of pages if the bdi->ra_pages is less than that.
> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> ---
>  mm/readahead.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 2648ec4f0494..4fa7d0e65706 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -138,7 +138,12 @@
>  void
>  file_ra_state_init(struct file_ra_state *ra, struct address_space *mapping)
>  {
> +	unsigned int min_nrpages = mapping_min_folio_nrpages(mapping);
> +	unsigned int max_pages = inode_to_bdi(mapping->host)->io_pages;
> +
>  	ra->ra_pages = inode_to_bdi(mapping->host)->ra_pages;
> +	if (ra->ra_pages < min_nrpages && min_nrpages < max_pages)
> +		ra->ra_pages = min_nrpages;

Why do we want to clamp readahead in this case to io_pages?

We're still going to be allocating a min_order folio in the page
cache, but it is far more efficient to initialise the entire folio
all in a single readahead pass than it is to only partially fill it
with data here and then have to issue and wait for more IO to bring
the folio fully up to date before we can read out data out of it,
right?

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 05/14] readahead: align index to mapping_min_order in ondemand_ra and force_ra
  2024-02-13  9:37 ` [RFC v2 05/14] readahead: align index to mapping_min_order in ondemand_ra and force_ra Pankaj Raghav (Samsung)
  2024-02-13 15:00   ` Hannes Reinecke
  2024-02-13 16:46   ` Darrick J. Wong
@ 2024-02-13 22:29   ` Dave Chinner
  2024-02-14 15:10     ` Pankaj Raghav (Samsung)
  2 siblings, 1 reply; 64+ messages in thread
From: Dave Chinner @ 2024-02-13 22:29 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: linux-xfs, linux-fsdevel, mcgrof, gost.dev, akpm, kbusch, djwong,
	chandan.babu, p.raghav, linux-kernel, hare, willy, linux-mm

On Tue, Feb 13, 2024 at 10:37:04AM +0100, Pankaj Raghav (Samsung) wrote:
> From: Luis Chamberlain <mcgrof@kernel.org>
> 
> Align the ra->start and ra->size to mapping_min_order in
> ondemand_readahead(), and align the index to mapping_min_order in
> force_page_cache_ra(). This will ensure that the folios allocated for
> readahead that are added to the page cache are aligned to
> mapping_min_order.
> 
> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> ---
>  mm/readahead.c | 48 ++++++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 40 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 4fa7d0e65706..5e1ec7705c78 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -315,6 +315,7 @@ void force_page_cache_ra(struct readahead_control *ractl,
>  	struct file_ra_state *ra = ractl->ra;
>  	struct backing_dev_info *bdi = inode_to_bdi(mapping->host);
>  	unsigned long max_pages, index;
> +	unsigned int min_nrpages = mapping_min_folio_nrpages(mapping);
>  
>  	if (unlikely(!mapping->a_ops->read_folio && !mapping->a_ops->readahead))
>  		return;
> @@ -324,6 +325,13 @@ void force_page_cache_ra(struct readahead_control *ractl,
>  	 * be up to the optimal hardware IO size
>  	 */
>  	index = readahead_index(ractl);
> +	if (!IS_ALIGNED(index, min_nrpages)) {
> +		unsigned long old_index = index;
> +
> +		index = round_down(index, min_nrpages);
> +		nr_to_read += (old_index - index);
> +	}

	new_index = mapping_align_start_index(mapping, index);
	if (new_index != index) {
		nr_to_read += index - new_index;
		index = new_index
	}

> +
>  	max_pages = max_t(unsigned long, bdi->io_pages, ra->ra_pages);
>  	nr_to_read = min_t(unsigned long, nr_to_read, max_pages);

This needs to have a size of at least the minimum folio order size
so readahead can fill entire folios, not get neutered to the maximum
IO size the underlying storage supports.

>  	while (nr_to_read) {
> @@ -332,6 +340,7 @@ void force_page_cache_ra(struct readahead_control *ractl,
>  		if (this_chunk > nr_to_read)
>  			this_chunk = nr_to_read;
>  		ractl->_index = index;
> +		VM_BUG_ON(!IS_ALIGNED(index, min_nrpages));
>  		do_page_cache_ra(ractl, this_chunk, 0);
>  
>  		index += this_chunk;
> @@ -344,11 +353,20 @@ void force_page_cache_ra(struct readahead_control *ractl,
>   * for small size, x 4 for medium, and x 2 for large
>   * for 128k (32 page) max ra
>   * 1-2 page = 16k, 3-4 page 32k, 5-8 page = 64k, > 8 page = 128k initial
> + *
> + * For higher order address space requirements we ensure no initial reads
> + * are ever less than the min number of pages required.
> + *
> + * We *always* cap the max io size allowed by the device.
>   */
> -static unsigned long get_init_ra_size(unsigned long size, unsigned long max)
> +static unsigned long get_init_ra_size(unsigned long size,
> +				      unsigned int min_nrpages,
> +				      unsigned long max)
>  {
>  	unsigned long newsize = roundup_pow_of_two(size);
>  
> +	newsize = max_t(unsigned long, newsize, min_nrpages);

This really doesn't need to care about min_nrpages. That rounding
can be done in the caller when the new size is returned.

>  	if (newsize <= max / 32)
>  		newsize = newsize * 4;
>  	else if (newsize <= max / 4)
> @@ -356,6 +374,8 @@ static unsigned long get_init_ra_size(unsigned long size, unsigned long max)
>  	else
>  		newsize = max;
>  
> +	VM_BUG_ON(newsize & (min_nrpages - 1));
> +
>  	return newsize;
>  }
>  
> @@ -364,14 +384,16 @@ static unsigned long get_init_ra_size(unsigned long size, unsigned long max)
>   *  return it as the new window size.
>   */
>  static unsigned long get_next_ra_size(struct file_ra_state *ra,
> +				      unsigned int min_nrpages,
>  				      unsigned long max)
>  {
> -	unsigned long cur = ra->size;
> +	unsigned long cur = max(ra->size, min_nrpages);
>  
>  	if (cur < max / 16)
>  		return 4 * cur;
>  	if (cur <= max / 2)
>  		return 2 * cur;
> +
>  	return max;

Ditto.

>  }
>  
> @@ -561,7 +583,11 @@ static void ondemand_readahead(struct readahead_control *ractl,
>  	unsigned long add_pages;
>  	pgoff_t index = readahead_index(ractl);
>  	pgoff_t expected, prev_index;
> -	unsigned int order = folio ? folio_order(folio) : 0;
> +	unsigned int min_order = mapping_min_folio_order(ractl->mapping);
> +	unsigned int min_nrpages = mapping_min_folio_nrpages(ractl->mapping);
> +	unsigned int order = folio ? folio_order(folio) : min_order;

Huh? If we have a folio, then the order is whatever that folio is,
otherwise we use min_order. What if the folio is larger than
min_order? Doesn't that mean that this:

> @@ -583,8 +609,8 @@ static void ondemand_readahead(struct readahead_control *ractl,
>  	expected = round_down(ra->start + ra->size - ra->async_size,
>  			1UL << order);
>  	if (index == expected || index == (ra->start + ra->size)) {
> -		ra->start += ra->size;
> -		ra->size = get_next_ra_size(ra, max_pages);
> +		ra->start += round_down(ra->size, min_nrpages);
> +		ra->size = get_next_ra_size(ra, min_nrpages, max_pages);

may set up the incorrect readahead range because the folio order is
larger than min_nrpages?

>  		ra->async_size = ra->size;
>  		goto readit;
>  	}
> @@ -603,13 +629,18 @@ static void ondemand_readahead(struct readahead_control *ractl,
>  				max_pages);
>  		rcu_read_unlock();
>  
> +		start = round_down(start, min_nrpages);

		start = mapping_align_start_index(mapping, start);
> +
> +		VM_BUG_ON(folio->index & (folio_nr_pages(folio) - 1));
> +
>  		if (!start || start - index > max_pages)
>  			return;
>  
>  		ra->start = start;
>  		ra->size = start - index;	/* old async_size */
> +
>  		ra->size += req_size;
> -		ra->size = get_next_ra_size(ra, max_pages);
> +		ra->size = get_next_ra_size(ra, min_nrpages, max_pages);

		ra->size = max(min_nrpages, get_next_ra_size(ra, max_pages));

>  		ra->async_size = ra->size;
>  		goto readit;
>  	}
> @@ -646,7 +677,7 @@ static void ondemand_readahead(struct readahead_control *ractl,
>  
>  initial_readahead:
>  	ra->start = index;
> -	ra->size = get_init_ra_size(req_size, max_pages);
> +	ra->size = get_init_ra_size(req_size, min_nrpages, max_pages);

	ra->size = max(min_nrpages, get_init_ra_size(req_size, max_pages));

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 12/14] xfs: make the calculation generic in xfs_sb_validate_fsb_count()
  2024-02-13 21:48     ` Pankaj Raghav (Samsung)
@ 2024-02-13 22:44       ` Dave Chinner
  2024-02-14 15:51         ` Pankaj Raghav (Samsung)
  0 siblings, 1 reply; 64+ messages in thread
From: Dave Chinner @ 2024-02-13 22:44 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: Darrick J. Wong, linux-xfs, linux-fsdevel, mcgrof, gost.dev,
	akpm, kbusch, chandan.babu, p.raghav, linux-kernel, hare, willy,
	linux-mm

On Tue, Feb 13, 2024 at 10:48:17PM +0100, Pankaj Raghav (Samsung) wrote:
> On Tue, Feb 13, 2024 at 08:26:11AM -0800, Darrick J. Wong wrote:
> > On Tue, Feb 13, 2024 at 10:37:11AM +0100, Pankaj Raghav (Samsung) wrote:
> > > From: Pankaj Raghav <p.raghav@samsung.com>
> > > 
> > > Instead of assuming that PAGE_SHIFT is always higher than the blocklog,
> > > make the calculation generic so that page cache count can be calculated
> > > correctly for LBS.
> > > 
> > > Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> > > ---
> > >  fs/xfs/xfs_mount.c | 6 +++++-
> > >  1 file changed, 5 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> > > index aabb25dc3efa..bfbaaecaf668 100644
> > > --- a/fs/xfs/xfs_mount.c
> > > +++ b/fs/xfs/xfs_mount.c
> > > @@ -133,9 +133,13 @@ xfs_sb_validate_fsb_count(
> > >  {
> > >  	ASSERT(PAGE_SHIFT >= sbp->sb_blocklog);
> > >  	ASSERT(sbp->sb_blocklog >= BBSHIFT);
> > > +	unsigned long mapping_count;
> > 
> > Nit: indenting
> > 
> > 	unsigned long		mapping_count;
> 
> I will add this in the next revision.
> > 
> > > +	uint64_t bytes = nblocks << sbp->sb_blocklog;
> > 
> > What happens if someone feeds us a garbage fs with sb_blocklog > 64?
> > Or did we check that previously, so an overflow isn't possible?
> > 
> I was thinking of possibility of an overflow but at the moment the 
> blocklog is capped at 16 (65536 bytes) right? mkfs refuses any block
> sizes more than 64k. And we have check for this in xfs_validate_sb_common()
> in the kernel, which will catch it before this happens?

The sb_blocklog is checked in the superblock verifier when we first read in the
superblock:

	    sbp->sb_blocksize < XFS_MIN_BLOCKSIZE                       ||
            sbp->sb_blocksize > XFS_MAX_BLOCKSIZE                       ||
            sbp->sb_blocklog < XFS_MIN_BLOCKSIZE_LOG                    ||
            sbp->sb_blocklog > XFS_MAX_BLOCKSIZE_LOG                    ||
            sbp->sb_blocksize != (1 << sbp->sb_blocklog)                ||

#define XFS_MAX_BLOCKSIZE_LOG 16

However, we pass mp->m_sb.sb_dblocks or m_sb.sb_rblocks to this
function, and they are validated by the same verifier as invalid
if:

	    sbp->sb_dblocks > XFS_MAX_DBLOCKS(sbp)

#define XFS_MAX_DBLOCKS(s) ((xfs_rfsblock_t)(s)->sb_agcount *
                                             (s)->sb_agblocks)

Which means as long as someone can corrupt some combination of
sb_dblocks, sb_agcount and sb_agblocks that allows sb_dblocks to be
greater than 2^48 on a 64kB fsb fs, then that the above code:

	uint64_t bytes = nblocks << sbp->sb_blocklog;

will overflow.

I also suspect that we can feed a huge rtdev to this new code
and have it overflow without needing to corrupt the superblock in
any way....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 13/14] xfs: add an experimental CONFIG_XFS_LBS option
  2024-02-13 21:54     ` Pankaj Raghav (Samsung)
@ 2024-02-13 22:45       ` Dave Chinner
  0 siblings, 0 replies; 64+ messages in thread
From: Dave Chinner @ 2024-02-13 22:45 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: linux-xfs, linux-fsdevel, mcgrof, gost.dev, akpm, kbusch, djwong,
	chandan.babu, p.raghav, linux-kernel, hare, willy, linux-mm

On Tue, Feb 13, 2024 at 10:54:07PM +0100, Pankaj Raghav (Samsung) wrote:
> On Wed, Feb 14, 2024 at 08:19:23AM +1100, Dave Chinner wrote:
> > On Tue, Feb 13, 2024 at 10:37:12AM +0100, Pankaj Raghav (Samsung) wrote:
> > > From: Pankaj Raghav <p.raghav@samsung.com>
> > > 
> > > Add an experimental CONFIG_XFS_LBS option to enable LBS support in XFS.
> > > Retain the ASSERT for PAGE_SHIFT if CONFIG_XFS_LBS is not enabled.
> > > 
> > > Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
> > 
> > NAK.
> > 
> > There it no reason for this existing - the same code is run
> > regardless of the state of this config variable just with a
> > difference in min folio order. All it does is increase the test
> > matrix arbitrarily - now we have two kernel configs we have to test
> > and there's no good reason for doing that.
> 
> I did not have this CONFIG in the first round but I thought it might
> help retain the existing behaviour until we deem the feature stable.
> 
> But I get your point. So we remove this CONFIG and just have an
> experimental warning during mount when people are using the LBS support?

Yes.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 03/14] filemap: use mapping_min_order while allocating folios
  2024-02-13 22:05   ` Dave Chinner
@ 2024-02-14 10:13     ` Pankaj Raghav (Samsung)
  0 siblings, 0 replies; 64+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-02-14 10:13 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-xfs, linux-fsdevel, mcgrof, gost.dev, akpm, kbusch, djwong,
	chandan.babu, p.raghav, linux-kernel, hare, willy, linux-mm

> > +++ b/mm/filemap.c
> > @@ -127,6 +127,7 @@
> >  static void page_cache_delete(struct address_space *mapping,
> >  				   struct folio *folio, void *shadow)
> >  {
> > +	unsigned int min_order = mapping_min_folio_order(mapping);
> >  	XA_STATE(xas, &mapping->i_pages, folio->index);
> >  	long nr = 1;
> >  
> > @@ -135,6 +136,7 @@ static void page_cache_delete(struct address_space *mapping,
> >  	xas_set_order(&xas, folio->index, folio_order(folio));
> >  	nr = folio_nr_pages(folio);
> >  
> > +	VM_BUG_ON_FOLIO(folio_order(folio) < min_order, folio);
> >  	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
> 
> If you are only using min_order in the VM_BUG_ON_FOLIO() macro, then
> please just do:
> 
> 	VM_BUG_ON_FOLIO(folio_order(folio) < mapping_min_folio_order(mapping),
> 			folio);
> 
> There is no need to clutter up the function with variables that are
> only used in one debug-only check.
> 
Got it. I will fold it in.

> > @@ -1847,6 +1853,10 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
> >  		fgf_t fgp_flags, gfp_t gfp)
> >  {
> >  	struct folio *folio;
> > +	unsigned int min_order = mapping_min_folio_order(mapping);
> > +	unsigned int min_nrpages = mapping_min_folio_nrpages(mapping);
> > +
> > +	index = round_down(index, min_nrpages);
> 
> 	index = mapping_align_start_index(mapping, index);

I will add this helper. Makes the intent more clear. Thanks.

> 
> The rest of the function only cares about min_order, not
> min_nrpages....
> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 04/14] readahead: set file_ra_state->ra_pages to be at least mapping_min_order
  2024-02-13 22:09   ` Dave Chinner
@ 2024-02-14 13:32     ` Pankaj Raghav (Samsung)
  2024-02-14 13:53       ` Pankaj Raghav (Samsung)
  0 siblings, 1 reply; 64+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-02-14 13:32 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-xfs, linux-fsdevel, mcgrof, gost.dev, akpm, kbusch, djwong,
	chandan.babu, p.raghav, linux-kernel, hare, willy, linux-mm

On Wed, Feb 14, 2024 at 09:09:53AM +1100, Dave Chinner wrote:
> On Tue, Feb 13, 2024 at 10:37:03AM +0100, Pankaj Raghav (Samsung) wrote:
> > From: Luis Chamberlain <mcgrof@kernel.org>
> > 
> > Set the file_ra_state->ra_pages in file_ra_state_init() to be at least
> > mapping_min_order of pages if the bdi->ra_pages is less than that.
> > 
> > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> > ---
> >  mm/readahead.c | 5 +++++
> >  1 file changed, 5 insertions(+)
> > 
> > diff --git a/mm/readahead.c b/mm/readahead.c
> > index 2648ec4f0494..4fa7d0e65706 100644
> > --- a/mm/readahead.c
> > +++ b/mm/readahead.c
> > @@ -138,7 +138,12 @@
> >  void
> >  file_ra_state_init(struct file_ra_state *ra, struct address_space *mapping)
> >  {
> > +	unsigned int min_nrpages = mapping_min_folio_nrpages(mapping);
> > +	unsigned int max_pages = inode_to_bdi(mapping->host)->io_pages;
> > +
> >  	ra->ra_pages = inode_to_bdi(mapping->host)->ra_pages;
> > +	if (ra->ra_pages < min_nrpages && min_nrpages < max_pages)
> > +		ra->ra_pages = min_nrpages;
> 
> Why do we want to clamp readahead in this case to io_pages?
> 
> We're still going to be allocating a min_order folio in the page
> cache, but it is far more efficient to initialise the entire folio
> all in a single readahead pass than it is to only partially fill it
> with data here and then have to issue and wait for more IO to bring
> the folio fully up to date before we can read out data out of it,
> right?

We are not clamping it to io_pages. ra_pages is set to min_nrpages if
bdi->ra_pages is less than the min_nrpages. The io_pages parameter is
used as a sanity check so that min_nrpages does not go beyond it.

So maybe, this is not the right place to check if we can at least send
min_nrpages to the backing device but instead do it during mount?

> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 04/14] readahead: set file_ra_state->ra_pages to be at least mapping_min_order
  2024-02-14 13:32     ` Pankaj Raghav (Samsung)
@ 2024-02-14 13:53       ` Pankaj Raghav (Samsung)
  0 siblings, 0 replies; 64+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-02-14 13:53 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-xfs, linux-fsdevel, mcgrof, gost.dev, akpm, kbusch, djwong,
	chandan.babu, p.raghav, linux-kernel, hare, willy, linux-mm

On Wed, Feb 14, 2024 at 02:32:20PM +0100, Pankaj Raghav (Samsung) wrote:
> On Wed, Feb 14, 2024 at 09:09:53AM +1100, Dave Chinner wrote:
> > On Tue, Feb 13, 2024 at 10:37:03AM +0100, Pankaj Raghav (Samsung) wrote:
> > > From: Luis Chamberlain <mcgrof@kernel.org>
> > > 
> > > Set the file_ra_state->ra_pages in file_ra_state_init() to be at least
> > > mapping_min_order of pages if the bdi->ra_pages is less than that.
> > > 
> > > Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
> > > ---
> > >  mm/readahead.c | 5 +++++
> > >  1 file changed, 5 insertions(+)
> > > 
> > > diff --git a/mm/readahead.c b/mm/readahead.c
> > > index 2648ec4f0494..4fa7d0e65706 100644
> > > --- a/mm/readahead.c
> > > +++ b/mm/readahead.c
> > > @@ -138,7 +138,12 @@
> > >  void
> > >  file_ra_state_init(struct file_ra_state *ra, struct address_space *mapping)
> > >  {
> > > +	unsigned int min_nrpages = mapping_min_folio_nrpages(mapping);
> > > +	unsigned int max_pages = inode_to_bdi(mapping->host)->io_pages;
> > > +
> > >  	ra->ra_pages = inode_to_bdi(mapping->host)->ra_pages;
> > > +	if (ra->ra_pages < min_nrpages && min_nrpages < max_pages)
> > > +		ra->ra_pages = min_nrpages;
> > 
> > Why do we want to clamp readahead in this case to io_pages?
> > 
> > We're still going to be allocating a min_order folio in the page
> > cache, but it is far more efficient to initialise the entire folio
> > all in a single readahead pass than it is to only partially fill it
> > with data here and then have to issue and wait for more IO to bring
> > the folio fully up to date before we can read out data out of it,
> > right?

I think I misunderstood your question. I got more context after seeing
your next response.

You are right, I will remove the clamp to io_pages. So a single FSB
might be split into multiple IOs if the underlying block device has
io_pages < min_nrpages.

> 
> We are not clamping it to io_pages. ra_pages is set to min_nrpages if
> bdi->ra_pages is less than the min_nrpages. The io_pages parameter is
> used as a sanity check so that min_nrpages does not go beyond it.
> 
> So maybe, this is not the right place to check if we can at least send
> min_nrpages to the backing device but instead do it during mount?
> 
> > 
> > -Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 05/14] readahead: align index to mapping_min_order in ondemand_ra and force_ra
  2024-02-13 22:29   ` Dave Chinner
@ 2024-02-14 15:10     ` Pankaj Raghav (Samsung)
  0 siblings, 0 replies; 64+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-02-14 15:10 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-xfs, linux-fsdevel, mcgrof, gost.dev, akpm, kbusch, djwong,
	chandan.babu, p.raghav, linux-kernel, hare, willy, linux-mm

> > @@ -324,6 +325,13 @@ void force_page_cache_ra(struct readahead_control *ractl,
> >  	 * be up to the optimal hardware IO size
> >  	 */
> >  	index = readahead_index(ractl);
> > +	if (!IS_ALIGNED(index, min_nrpages)) {
> > +		unsigned long old_index = index;
> > +
> > +		index = round_down(index, min_nrpages);
> > +		nr_to_read += (old_index - index);
> > +	}
> 
> 	new_index = mapping_align_start_index(mapping, index);
> 	if (new_index != index) {
> 		nr_to_read += index - new_index;
> 		index = new_index
Looks good.

> 	}
> 
> > +
> >  	max_pages = max_t(unsigned long, bdi->io_pages, ra->ra_pages);
> >  	nr_to_read = min_t(unsigned long, nr_to_read, max_pages);
> 
> This needs to have a size of at least the minimum folio order size
> so readahead can fill entire folios, not get neutered to the maximum
> IO size the underlying storage supports.

So something like:

> >  	max_pages = max_t(unsigned long, bdi->io_pages, ra->ra_pages);
> >  	nr_to_read = min_t(unsigned long, nr_to_read, max_pages);
nr_to_read = max(nr_to_read, min_order);

> 
> > + * For higher order address space requirements we ensure no initial reads
> > + * are ever less than the min number of pages required.
> > + *
> > + * We *always* cap the max io size allowed by the device.
> >   */
> > -static unsigned long get_init_ra_size(unsigned long size, unsigned long max)
> > +static unsigned long get_init_ra_size(unsigned long size,
> > +				      unsigned int min_nrpages,
> > +				      unsigned long max)
> >  {
> >  	unsigned long newsize = roundup_pow_of_two(size);
> >  
> > +	newsize = max_t(unsigned long, newsize, min_nrpages);
> 
> This really doesn't need to care about min_nrpages. That rounding
> can be done in the caller when the new size is returned.

Sounds good.

> 
> >  	if (newsize <= max / 32)
> >  		newsize = newsize * 4;
> 
> >  
> >  
> > @@ -561,7 +583,11 @@ static void ondemand_readahead(struct readahead_control *ractl,
> >  	unsigned long add_pages;
> >  	pgoff_t index = readahead_index(ractl);
> >  	pgoff_t expected, prev_index;
> > -	unsigned int order = folio ? folio_order(folio) : 0;
> > +	unsigned int min_order = mapping_min_folio_order(ractl->mapping);
> > +	unsigned int min_nrpages = mapping_min_folio_nrpages(ractl->mapping);
> > +	unsigned int order = folio ? folio_order(folio) : min_order;
> 
> Huh? If we have a folio, then the order is whatever that folio is,
> otherwise we use min_order. What if the folio is larger than
> min_order? Doesn't that mean that this:
> 
> > @@ -583,8 +609,8 @@ static void ondemand_readahead(struct readahead_control *ractl,
> >  	expected = round_down(ra->start + ra->size - ra->async_size,
> >  			1UL << order);
> >  	if (index == expected || index == (ra->start + ra->size)) {
> > -		ra->start += ra->size;
> > -		ra->size = get_next_ra_size(ra, max_pages);
> > +		ra->start += round_down(ra->size, min_nrpages);
> > +		ra->size = get_next_ra_size(ra, min_nrpages, max_pages);
> 
> may set up the incorrect readahead range because the folio order is
> larger than min_nrpages?

Hmm... So I think we should just increment ra->start by ra->size, and
make sure to round the new size we get from get_next_ra_size() to
min_nrpages. Then we will not disturb the readahead range and always
increase the range in multiples of min_nrpages:

ra->start += ra->size;
ra->size = round_up(get_next_ra_size(ra, max_pages), min_nrpages);

> 
> >  		ra->async_size = ra->size;
> >  		goto readit;
> >  	}
> > @@ -603,13 +629,18 @@ static void ondemand_readahead(struct readahead_control *ractl,
> >  				max_pages);
> >  		rcu_read_unlock();
> >  
> > +		start = round_down(start, min_nrpages);
> 
> 		start = mapping_align_start_index(mapping, start);
> > +
> > +		VM_BUG_ON(folio->index & (folio_nr_pages(folio) - 1));
> > +
> >  		if (!start || start - index > max_pages)
> >  			return;
> >  
> >  		ra->start = start;
> >  		ra->size = start - index;	/* old async_size */
> > +
> >  		ra->size += req_size;
> > -		ra->size = get_next_ra_size(ra, max_pages);
> > +		ra->size = get_next_ra_size(ra, min_nrpages, max_pages);
> 
> 		ra->size = max(min_nrpages, get_next_ra_size(ra, max_pages));

If this is a round_up of size instead of max operation, we can
always ensure the ra->start from index aligned to min_nrpages. See my
reasoning in the previous comment.

--
Pankaj

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 10/14] iomap: fix iomap_dio_zero() for fs bs > system page size
  2024-02-13 21:30       ` Darrick J. Wong
@ 2024-02-14 15:13         ` Pankaj Raghav (Samsung)
  0 siblings, 0 replies; 64+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-02-14 15:13 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-xfs, linux-fsdevel, mcgrof, gost.dev, akpm, kbusch,
	chandan.babu, p.raghav, linux-kernel, hare, willy, linux-mm,
	david, brauner

> 
> In that case I'll throw it on the testing pile and let's ask brauner to
> merge this for 6.9 if nothing blows up.
> 
Sounds good. Thanks.

> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> 
> --D
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 12/14] xfs: make the calculation generic in xfs_sb_validate_fsb_count()
  2024-02-13 22:44       ` Dave Chinner
@ 2024-02-14 15:51         ` Pankaj Raghav (Samsung)
  0 siblings, 0 replies; 64+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-02-14 15:51 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Darrick J. Wong, linux-xfs, linux-fsdevel, mcgrof, gost.dev,
	akpm, kbusch, chandan.babu, p.raghav, linux-kernel, hare, willy,
	linux-mm

> > I was thinking of possibility of an overflow but at the moment the 
> > blocklog is capped at 16 (65536 bytes) right? mkfs refuses any block
> > sizes more than 64k. And we have check for this in xfs_validate_sb_common()
> > in the kernel, which will catch it before this happens?
> 
> The sb_blocklog is checked in the superblock verifier when we first read in the
> superblock:
> 
> 	    sbp->sb_blocksize < XFS_MIN_BLOCKSIZE                       ||
>             sbp->sb_blocksize > XFS_MAX_BLOCKSIZE                       ||
>             sbp->sb_blocklog < XFS_MIN_BLOCKSIZE_LOG                    ||
>             sbp->sb_blocklog > XFS_MAX_BLOCKSIZE_LOG                    ||
>             sbp->sb_blocksize != (1 << sbp->sb_blocklog)                ||
> 
> #define XFS_MAX_BLOCKSIZE_LOG 16
> 
> However, we pass mp->m_sb.sb_dblocks or m_sb.sb_rblocks to this
> function, and they are validated by the same verifier as invalid
> if:
> 
> 	    sbp->sb_dblocks > XFS_MAX_DBLOCKS(sbp)
> 
> #define XFS_MAX_DBLOCKS(s) ((xfs_rfsblock_t)(s)->sb_agcount *
>                                              (s)->sb_agblocks)
> 
> Which means as long as someone can corrupt some combination of
> sb_dblocks, sb_agcount and sb_agblocks that allows sb_dblocks to be
> greater than 2^48 on a 64kB fsb fs, then that the above code:
> 
> 	uint64_t bytes = nblocks << sbp->sb_blocklog;
> 
> will overflow.
> 
> I also suspect that we can feed a huge rtdev to this new code
> and have it overflow without needing to corrupt the superblock in
> any way....

So we could use the check_mul_overflow to detect these cases:

diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 596aa2cdefbc..23faa993fb80 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -132,8 +132,12 @@ xfs_sb_validate_fsb_count(
        uint64_t        nblocks)
 {
        ASSERT(sbp->sb_blocklog >= BBSHIFT);
-       unsigned long mapping_count;
-       uint64_t bytes = nblocks << sbp->sb_blocklog;
+       uint64_t mapping_count;
+       uint64_t bytes;
+
+       if (check_mul_overflow(nblocks, (1 << sbp->sb_blocklog), &bytes))
+               return -EFBIG;
 
        if (!IS_ENABLED(CONFIG_XFS_LBS))
                ASSERT(PAGE_SHIFT >= sbp->sb_blocklog);

> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [RFC v2 14/14] xfs: enable block size larger than page size support
  2024-02-13 21:34   ` Dave Chinner
@ 2024-02-14 16:35     ` Pankaj Raghav (Samsung)
  2024-02-15 22:17       ` Dave Chinner
  0 siblings, 1 reply; 64+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-02-14 16:35 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-xfs, linux-fsdevel, mcgrof, gost.dev, akpm, kbusch, djwong,
	chandan.babu, p.raghav, linux-kernel, hare, willy, linux-mm

> >  	struct xfs_inode	*ip;
> > +	int			min_order = 0;
> >  
> >  	/*
> >  	 * XXX: If this didn't occur in transactions, we could drop GFP_NOFAIL
> > @@ -88,7 +89,8 @@ xfs_inode_alloc(
> >  	/* VFS doesn't initialise i_mode or i_state! */
> >  	VFS_I(ip)->i_mode = 0;
> >  	VFS_I(ip)->i_state = 0;
> > -	mapping_set_large_folios(VFS_I(ip)->i_mapping);
> > +	min_order = max(min_order, ilog2(mp->m_sb.sb_blocksize) - PAGE_SHIFT);
> > +	mapping_set_folio_orders(VFS_I(ip)->i_mapping, min_order, MAX_PAGECACHE_ORDER);
> 
> That's pretty nasty. You're using max() to hide underflow in the
> subtraction to clamp the value to zero. And you don't need ilog2()
> because we have the log of the block size in the superblock already.
> 
> 	int			min_order = 0;
> 	.....
> 	if (mp->m_sb.sb_blocksize > PAGE_SIZE)
> 		min_order = mp->m_sb.sb_blocklog - PAGE_SHIFT;
how is it underflowing if I am comparing two values of type int?

> 
> But, really why recalculate this -constant- on every inode
> allocation?  That's a very hot path, so this should be set in the
> M_IGEO(mp) structure (mp->m_ino_geo) at mount time and then the code
> is simply:
> 
> 	mapping_set_folio_orders(VFS_I(ip)->i_mapping,
> 			M_IGEO(mp)->min_folio_order, MAX_PAGECACHE_ORDER);
> 

That is a good idea. I will add this change in the next revision.

> We already access the M_IGEO(mp) structure every inode allocation,
> so there's little in way of additional cost here....
> 
> > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > index 5a2512d20bd0..6a3f0f6727eb 100644
> > --- a/fs/xfs/xfs_super.c
> > +++ b/fs/xfs/xfs_super.c
> > @@ -1625,13 +1625,11 @@ xfs_fs_fill_super(
> >  		goto out_free_sb;
> >  	}
> >  
> > -	/*
> > -	 * Until this is fixed only page-sized or smaller data blocks work.
> > -	 */
> > -	if (mp->m_sb.sb_blocksize > PAGE_SIZE) {
> > +	if (!IS_ENABLED(CONFIG_XFS_LBS) && mp->m_sb.sb_blocksize > PAGE_SIZE) {
> >  		xfs_warn(mp,
> >  		"File system with blocksize %d bytes. "
> > -		"Only pagesize (%ld) or less will currently work.",
> > +		"Only pagesize (%ld) or less will currently work. "
> > +		"Enable Experimental CONFIG_XFS_LBS for this support",
> >  				mp->m_sb.sb_blocksize, PAGE_SIZE);
> >  		error = -ENOSYS;
> >  		goto out_free_sb;
> 
> This should just issue a warning if bs > ps.
> 
> 	if (mp->m_sb.sb_blocksize > PAGE_SIZE) {
>   		xfs_warn(mp,
> "EXPERIMENTAL: Filesystem with Large Block Size (%d bytes) enabled.",
> 			mp->m_sb.sb_blocksize);
> 	}

Yes! Luis already told me to add a warning here but I missed it before
sending the patches out.

> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 14/14] xfs: enable block size larger than page size support
  2024-02-13 16:20   ` Darrick J. Wong
@ 2024-02-14 16:40     ` Pankaj Raghav (Samsung)
  0 siblings, 0 replies; 64+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-02-14 16:40 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-xfs, linux-fsdevel, mcgrof, gost.dev, akpm, kbusch,
	chandan.babu, p.raghav, linux-kernel, hare, willy, linux-mm,
	david

> > @@ -323,7 +326,8 @@ xfs_reinit_inode(
> >  	inode->i_rdev = dev;
> >  	inode->i_uid = uid;
> >  	inode->i_gid = gid;
> > -	mapping_set_large_folios(inode->i_mapping);
> > +	min_order = max(min_order, ilog2(mp->m_sb.sb_blocksize) - PAGE_SHIFT);
> > +	mapping_set_folio_orders(inode->i_mapping, min_order, MAX_PAGECACHE_ORDER);
> 
> Twice now I've seen this, which makes me think "refactor this into a
> single function."
> 
> But then, this is really just:
> 
> 	mapping_set_folio_orders(inode->i_mapping,
> 			max(0, inode->i_sb->s_blocksize_bits - PAGE_SHIFT),
> 			MAX_PAGECACHE_ORDER);
> 
> Can we make that a generic inode_set_pagecache_orders helper?

Chinner suggested an alternative to stuff the min_order value in
mp->m_ino_geo. Then it will just be a call to:

mapping_set_folio_orders(VFS_I(ip)->i_mapping,
			M_IGEO(mp)->min_folio_order, MAX_PAGECACHE_ORDER);
> 
> >  	return error;
> >  }
> >  
> > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > index 5a2512d20bd0..6a3f0f6727eb 100644
> > --- a/fs/xfs/xfs_super.c
> > +++ b/fs/xfs/xfs_super.c
> > @@ -1625,13 +1625,11 @@ xfs_fs_fill_super(
> >  		goto out_free_sb;
> >  	}
> >  
> > -	/*
> > -	 * Until this is fixed only page-sized or smaller data blocks work.
> > -	 */
> > -	if (mp->m_sb.sb_blocksize > PAGE_SIZE) {
> > +	if (!IS_ENABLED(CONFIG_XFS_LBS) && mp->m_sb.sb_blocksize > PAGE_SIZE) {
> >  		xfs_warn(mp,
> >  		"File system with blocksize %d bytes. "
> > -		"Only pagesize (%ld) or less will currently work.",
> > +		"Only pagesize (%ld) or less will currently work. "
> > +		"Enable Experimental CONFIG_XFS_LBS for this support",
> >  				mp->m_sb.sb_blocksize, PAGE_SIZE);
> 
> Please log a warning about the EXPERIMENTAL bs>ps feature being used
> on this mount for the CONFIG_XFS_LBS=y case.
> 
Yes! I will do it as a part of the next revision.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 01/14] fs: Allow fine-grained control of folio sizes
  2024-02-13  9:37 ` [RFC v2 01/14] fs: Allow fine-grained control of folio sizes Pankaj Raghav (Samsung)
  2024-02-13 12:03   ` Hannes Reinecke
  2024-02-13 16:34   ` Darrick J. Wong
@ 2024-02-14 18:49   ` Matthew Wilcox
  2024-02-15 10:21     ` Pankaj Raghav (Samsung)
  2 siblings, 1 reply; 64+ messages in thread
From: Matthew Wilcox @ 2024-02-14 18:49 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: linux-xfs, linux-fsdevel, mcgrof, gost.dev, akpm, kbusch, djwong,
	chandan.babu, p.raghav, linux-kernel, hare, linux-mm, david

On Tue, Feb 13, 2024 at 10:37:00AM +0100, Pankaj Raghav (Samsung) wrote:
> +static inline void mapping_set_folio_orders(struct address_space *mapping,
> +					    unsigned int min, unsigned int max)
> +{
> +	if (min == 1)
> +		min = 2;

If you order the "support order-1 folios" patch first, you can drop
these two lines.

> +static inline unsigned int mapping_min_folio_nrpages(struct address_space *mapping)

I'm not sure if you need this, but it should return unsigned long, not
unsigned int.  With 64KiB pages on Arm, a PMD page is 512MiB (order 13)
and a PUD page will be order 26, which is far too close to 2^32 for
my comfort.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 01/14] fs: Allow fine-grained control of folio sizes
  2024-02-13 21:29       ` Darrick J. Wong
@ 2024-02-14 19:00         ` Matthew Wilcox
  2024-02-15 10:34           ` Pankaj Raghav (Samsung)
  0 siblings, 1 reply; 64+ messages in thread
From: Matthew Wilcox @ 2024-02-14 19:00 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Pankaj Raghav (Samsung),
	linux-xfs, linux-fsdevel, mcgrof, gost.dev, akpm, kbusch,
	chandan.babu, p.raghav, linux-kernel, hare, linux-mm, david

On Tue, Feb 13, 2024 at 01:29:14PM -0800, Darrick J. Wong wrote:
> On Tue, Feb 13, 2024 at 10:05:54PM +0100, Pankaj Raghav (Samsung) wrote:
> > On Tue, Feb 13, 2024 at 08:34:31AM -0800, Darrick J. Wong wrote:
> > > On Tue, Feb 13, 2024 at 10:37:00AM +0100, Pankaj Raghav (Samsung) wrote:
> > > > From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
> > > > 
> > > > Some filesystems want to be able to limit the maximum size of folios,
> > > > and some want to be able to ensure that folios are at least a certain
> > > > size.  Add mapping_set_folio_orders() to allow this level of control.
> > > > The max folio order parameter is ignored and it is always set to
> > > > MAX_PAGECACHE_ORDER.
> > > 
> > > Why?  If MAX_PAGECACHE_ORDER is 8 and I instead pass in max==3, I'm
> > > going to be surprised by my constraint being ignored.  Maybe I said that
> > > because I'm not prepared to handle an order-7 folio; or some customer
> > > will have some weird desire to twist this knob to make their workflow
> > > faster.
> > > 
> > > --D
> > Maybe I should have been explicit. We are planning to add support
> > for min order in the first round, and we want to add support for max order
> > once the min order support is upstreamed. It was done mainly to reduce
> > the scope and testing of this series.
> > 
> > I definitely agree there are usecases for setting the max order. It is
> > also the feedback we got from LPC.
> > 
> > So one idea would be not to expose max option until we add the support
> > for max order? So filesystems can only set the min_order with the
> > initial support?
> 
> Yeah, there's really no point in having an argument that's deliberately
> ignored.

I favour introducing the right APIs even if they're not fully implemented.
We have no filesystems today that need this, so it doesn't need to
be implemented, but if we have to go back and add it, it's more churn
for every filesystem.  I'm open to better ideas about the API; I think
for a lot of filesystems they only want to set the minimum, so maybe
introducing that API now would be a good thing.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 01/14] fs: Allow fine-grained control of folio sizes
  2024-02-14 18:49   ` Matthew Wilcox
@ 2024-02-15 10:21     ` Pankaj Raghav (Samsung)
  0 siblings, 0 replies; 64+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-02-15 10:21 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-xfs, linux-fsdevel, mcgrof, gost.dev, akpm, kbusch, djwong,
	chandan.babu, p.raghav, linux-kernel, hare, linux-mm, david

On Wed, Feb 14, 2024 at 06:49:49PM +0000, Matthew Wilcox wrote:
> On Tue, Feb 13, 2024 at 10:37:00AM +0100, Pankaj Raghav (Samsung) wrote:
> > +static inline void mapping_set_folio_orders(struct address_space *mapping,
> > +					    unsigned int min, unsigned int max)
> > +{
> > +	if (min == 1)
> > +		min = 2;
> 
> If you order the "support order-1 folios" patch first, you can drop
> these two lines.
> 
Thanks for pointing this out. I actually forgot to update this later in
my series.

The only failure I was noticing for LBS in 8k block sizes (generic/630)
gets fixed by this change as well :) .

> > +static inline unsigned int mapping_min_folio_nrpages(struct address_space *mapping)
> 
> I'm not sure if you need this, but it should return unsigned long, not
> unsigned int.  With 64KiB pages on Arm, a PMD page is 512MiB (order 13)
> and a PUD page will be order 26, which is far too close to 2^32 for
> my comfort.

There were some suggestions from Chinner which might make this function
go away. But in case I need it, I will update it to unsigned long to be
on the safe side.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 01/14] fs: Allow fine-grained control of folio sizes
  2024-02-14 19:00         ` Matthew Wilcox
@ 2024-02-15 10:34           ` Pankaj Raghav (Samsung)
  0 siblings, 0 replies; 64+ messages in thread
From: Pankaj Raghav (Samsung) @ 2024-02-15 10:34 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Darrick J. Wong, linux-xfs, linux-fsdevel, mcgrof, gost.dev,
	akpm, kbusch, chandan.babu, p.raghav, linux-kernel, hare,
	linux-mm, david

> > > Maybe I should have been explicit. We are planning to add support
> > > for min order in the first round, and we want to add support for max order
> > > once the min order support is upstreamed. It was done mainly to reduce
> > > the scope and testing of this series.
> > > 
> > > I definitely agree there are usecases for setting the max order. It is
> > > also the feedback we got from LPC.
> > > 
> > > So one idea would be not to expose max option until we add the support
> > > for max order? So filesystems can only set the min_order with the
> > > initial support?
> > 
> > Yeah, there's really no point in having an argument that's deliberately
> > ignored.
> 
> I favour introducing the right APIs even if they're not fully implemented.
> We have no filesystems today that need this, so it doesn't need to
> be implemented, but if we have to go back and add it, it's more churn
> for every filesystem.  I'm open to better ideas about the API; I think
> for a lot of filesystems they only want to set the minimum, so maybe
> introducing that API now would be a good thing.

I will introduce a new API that only exposes the min order for now. I
agree with you that I don't see a lot of filesystems other than XFS
using this in the near future.

We deduce min order based on the filesystem blocksize but we don't have any
mechanisms in place from userspace to set the max order for a filesystem.
So that also needs to be thought through and discussed with the
community.

I hope to start working on max_order immediately after upstreaming the
min_order feature.

--
Pankaj

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [RFC v2 14/14] xfs: enable block size larger than page size support
  2024-02-14 16:35     ` Pankaj Raghav (Samsung)
@ 2024-02-15 22:17       ` Dave Chinner
  0 siblings, 0 replies; 64+ messages in thread
From: Dave Chinner @ 2024-02-15 22:17 UTC (permalink / raw)
  To: Pankaj Raghav (Samsung)
  Cc: linux-xfs, linux-fsdevel, mcgrof, gost.dev, akpm, kbusch, djwong,
	chandan.babu, p.raghav, linux-kernel, hare, willy, linux-mm

On Wed, Feb 14, 2024 at 05:35:49PM +0100, Pankaj Raghav (Samsung) wrote:
> > >  	struct xfs_inode	*ip;
> > > +	int			min_order = 0;
> > >  
> > >  	/*
> > >  	 * XXX: If this didn't occur in transactions, we could drop GFP_NOFAIL
> > > @@ -88,7 +89,8 @@ xfs_inode_alloc(
> > >  	/* VFS doesn't initialise i_mode or i_state! */
> > >  	VFS_I(ip)->i_mode = 0;
> > >  	VFS_I(ip)->i_state = 0;
> > > -	mapping_set_large_folios(VFS_I(ip)->i_mapping);
> > > +	min_order = max(min_order, ilog2(mp->m_sb.sb_blocksize) - PAGE_SHIFT);
> > > +	mapping_set_folio_orders(VFS_I(ip)->i_mapping, min_order, MAX_PAGECACHE_ORDER);
> > 
> > That's pretty nasty. You're using max() to hide underflow in the
> > subtraction to clamp the value to zero. And you don't need ilog2()
> > because we have the log of the block size in the superblock already.
> > 
> > 	int			min_order = 0;
> > 	.....
> > 	if (mp->m_sb.sb_blocksize > PAGE_SIZE)
> > 		min_order = mp->m_sb.sb_blocklog - PAGE_SHIFT;
> how is it underflowing if I am comparing two values of type int?

Folio order is supposed to be unsigned. Negative orders are not
valid values.  So you're hacking around an unsigned underflow by
using signed ints, then hiding the fact that unsigned subtraction
would underflow check behind a max(0, underflowing calc) construct
that works only because you're using signed ints rather than
unsigned ints for the order.

It also implicitly relies on the max_order being zero at that point
in time, so if we change the value of max order in future before
this check, this check may not fuction correctly in future.

Please: use unsigned ints for order, and explicitly write the
code so it doesn't ever need negative values that could underflow.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 64+ messages in thread

end of thread, other threads:[~2024-02-15 22:17 UTC | newest]

Thread overview: 64+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-13  9:36 [RFC v2 00/14] enable bs > ps in XFS Pankaj Raghav (Samsung)
2024-02-13  9:37 ` [RFC v2 01/14] fs: Allow fine-grained control of folio sizes Pankaj Raghav (Samsung)
2024-02-13 12:03   ` Hannes Reinecke
2024-02-13 16:34   ` Darrick J. Wong
2024-02-13 21:05     ` Pankaj Raghav (Samsung)
2024-02-13 21:29       ` Darrick J. Wong
2024-02-14 19:00         ` Matthew Wilcox
2024-02-15 10:34           ` Pankaj Raghav (Samsung)
2024-02-14 18:49   ` Matthew Wilcox
2024-02-15 10:21     ` Pankaj Raghav (Samsung)
2024-02-13  9:37 ` [RFC v2 02/14] filemap: align the index to mapping_min_order in the page cache Pankaj Raghav (Samsung)
2024-02-13 12:20   ` Hannes Reinecke
2024-02-13 21:13     ` Pankaj Raghav (Samsung)
2024-02-13 22:00   ` Dave Chinner
2024-02-13  9:37 ` [RFC v2 03/14] filemap: use mapping_min_order while allocating folios Pankaj Raghav (Samsung)
2024-02-13 14:58   ` Hannes Reinecke
2024-02-13 16:38   ` Darrick J. Wong
2024-02-13 22:05   ` Dave Chinner
2024-02-14 10:13     ` Pankaj Raghav (Samsung)
2024-02-13  9:37 ` [RFC v2 04/14] readahead: set file_ra_state->ra_pages to be at least mapping_min_order Pankaj Raghav (Samsung)
2024-02-13 14:59   ` Hannes Reinecke
2024-02-13 16:46   ` Darrick J. Wong
2024-02-13 22:09   ` Dave Chinner
2024-02-14 13:32     ` Pankaj Raghav (Samsung)
2024-02-14 13:53       ` Pankaj Raghav (Samsung)
2024-02-13  9:37 ` [RFC v2 05/14] readahead: align index to mapping_min_order in ondemand_ra and force_ra Pankaj Raghav (Samsung)
2024-02-13 15:00   ` Hannes Reinecke
2024-02-13 16:46   ` Darrick J. Wong
2024-02-13 22:29   ` Dave Chinner
2024-02-14 15:10     ` Pankaj Raghav (Samsung)
2024-02-13  9:37 ` [RFC v2 06/14] readahead: rework loop in page_cache_ra_unbounded() Pankaj Raghav (Samsung)
2024-02-13 16:47   ` Darrick J. Wong
2024-02-13  9:37 ` [RFC v2 07/14] readahead: allocate folios with mapping_min_order in ra_(unbounded|order) Pankaj Raghav (Samsung)
2024-02-13 15:01   ` Hannes Reinecke
2024-02-13 16:47   ` Darrick J. Wong
2024-02-13  9:37 ` [RFC v2 08/14] mm: do not split a folio if it has minimum folio order requirement Pankaj Raghav (Samsung)
2024-02-13 15:02   ` Hannes Reinecke
2024-02-13  9:37 ` [RFC v2 09/14] mm: Support order-1 folios in the page cache Pankaj Raghav (Samsung)
2024-02-13 15:03   ` Hannes Reinecke
2024-02-13  9:37 ` [RFC v2 10/14] iomap: fix iomap_dio_zero() for fs bs > system page size Pankaj Raghav (Samsung)
2024-02-13 15:06   ` Hannes Reinecke
2024-02-13 16:30   ` Darrick J. Wong
2024-02-13 21:27     ` Pankaj Raghav (Samsung)
2024-02-13 21:30       ` Darrick J. Wong
2024-02-14 15:13         ` Pankaj Raghav (Samsung)
2024-02-13  9:37 ` [RFC v2 11/14] xfs: expose block size in stat Pankaj Raghav (Samsung)
2024-02-13 16:27   ` Darrick J. Wong
2024-02-13 21:32     ` Pankaj Raghav (Samsung)
2024-02-13  9:37 ` [RFC v2 12/14] xfs: make the calculation generic in xfs_sb_validate_fsb_count() Pankaj Raghav (Samsung)
2024-02-13 16:26   ` Darrick J. Wong
2024-02-13 21:48     ` Pankaj Raghav (Samsung)
2024-02-13 22:44       ` Dave Chinner
2024-02-14 15:51         ` Pankaj Raghav (Samsung)
2024-02-13  9:37 ` [RFC v2 13/14] xfs: add an experimental CONFIG_XFS_LBS option Pankaj Raghav (Samsung)
2024-02-13 16:39   ` Darrick J. Wong
2024-02-13 21:19   ` Dave Chinner
2024-02-13 21:54     ` Pankaj Raghav (Samsung)
2024-02-13 22:45       ` Dave Chinner
2024-02-13  9:37 ` [RFC v2 14/14] xfs: enable block size larger than page size support Pankaj Raghav (Samsung)
2024-02-13 16:20   ` Darrick J. Wong
2024-02-14 16:40     ` Pankaj Raghav (Samsung)
2024-02-13 21:34   ` Dave Chinner
2024-02-14 16:35     ` Pankaj Raghav (Samsung)
2024-02-15 22:17       ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).