All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/7] RFC: high-order folio support for I/O
@ 2023-06-14 11:46 Hannes Reinecke
  2023-06-14 11:46 ` [PATCH 1/7] brd: use XArray instead of radix-tree to index backing pages Hannes Reinecke
                   ` (9 more replies)
  0 siblings, 10 replies; 40+ messages in thread
From: Hannes Reinecke @ 2023-06-14 11:46 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-fsdevel, linux-block, Andrew Morton, Christoph Hellwig,
	Luis Chamberlain, Hannes Reinecke

Hi all,

now, that was easy.
Thanks to willy and his recent patchset to support large folios in
gfs2 turns out that most of the work to support high-order folios
for I/O is actually done.
It only need twe rather obvious patches to allocate folios with
the order derived from the mapping blocksize, and to adjust readahead
to avoid reading off the end of the device.
But with these two patches (and the patchset from hch to switch
the block device over to iomap) (and the patchset from ritesh to
support sub-blocksize iomap buffers) I can now do:

# modprobe brd rd_size=524288 rd_blksize=16384
# mkfs.xfs -b size=16384 /dev/ram0

it still fails when trying to mount the device:

XFS (ram0): Cannot set_blocksize to 16384 on device ram0

but to my understanding this is being worked on.

Christoph, any chance to have an updated submission of your
patchset to convert block devices over to iomap?
I don't actually need the last one to switch off buffer heads,
but the others really do help for this case.

The entire tree can be found at:

git.kernel.org:/pub/scm/linux/git/kernel/hare/scsi-devel.git
branch brd.v2

Happy hacking!

Hannes Reinecke (6):
  brd: convert to folios
  brd: abstract page_size conventions
  brd: make sector size configurable
  brd: make logical sector size configurable
  mm/filemap: allocate folios with mapping blocksize
  mm/readahead: align readahead down to mapping blocksize

Pankaj Raghav (1):
  brd: use XArray instead of radix-tree to index backing pages

 drivers/block/brd.c     | 320 +++++++++++++++++++++-------------------
 include/linux/pagemap.h |   7 +
 mm/filemap.c            |   7 +-
 mm/readahead.c          |  10 +-
 4 files changed, 186 insertions(+), 158 deletions(-)

-- 
2.35.3


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 1/7] brd: use XArray instead of radix-tree to index backing pages
  2023-06-14 11:46 [PATCH 0/7] RFC: high-order folio support for I/O Hannes Reinecke
@ 2023-06-14 11:46 ` Hannes Reinecke
  2023-06-14 12:45   ` Matthew Wilcox
  2023-06-14 11:46 ` [PATCH 2/7] brd: convert to folios Hannes Reinecke
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 40+ messages in thread
From: Hannes Reinecke @ 2023-06-14 11:46 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-fsdevel, linux-block, Andrew Morton, Christoph Hellwig,
	Luis Chamberlain, Pankaj Raghav

From: Pankaj Raghav <p.raghav@samsung.com>

XArray was introduced to hold large array of pointers with a simple API.
XArray API also provides array semantics which simplifies the way we store
and access the backing pages, and the code becomes significantly easier
to understand.

No performance difference was noticed between the two implementation
using fio with direct=1 [1].

[1] Performance in KIOPS:

          |  radix-tree |    XArray  |   Diff
          |             |            |
write     |    315      |     313    |   -0.6%
randwrite |    286      |     290    |   +1.3%
read      |    330      |     335    |   +1.5%
randread  |    309      |     312    |   +0.9%

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 drivers/block/brd.c | 93 ++++++++++++---------------------------------
 1 file changed, 24 insertions(+), 69 deletions(-)

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index bcad9b926b0c..2f71376afc71 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -19,7 +19,7 @@
 #include <linux/highmem.h>
 #include <linux/mutex.h>
 #include <linux/pagemap.h>
-#include <linux/radix-tree.h>
+#include <linux/xarray.h>
 #include <linux/fs.h>
 #include <linux/slab.h>
 #include <linux/backing-dev.h>
@@ -28,7 +28,7 @@
 #include <linux/uaccess.h>
 
 /*
- * Each block ramdisk device has a radix_tree brd_pages of pages that stores
+ * Each block ramdisk device has a xarray brd_pages of pages that stores
  * the pages containing the block device's contents. A brd page's ->index is
  * its offset in PAGE_SIZE units. This is similar to, but in no way connected
  * with, the kernel's pagecache or buffer cache (which sit above our block
@@ -40,11 +40,9 @@ struct brd_device {
 	struct list_head	brd_list;
 
 	/*
-	 * Backing store of pages and lock to protect it. This is the contents
-	 * of the block device.
+	 * Backing store of pages. This is the contents of the block device.
 	 */
-	spinlock_t		brd_lock;
-	struct radix_tree_root	brd_pages;
+	struct xarray	        brd_pages;
 	u64			brd_nr_pages;
 };
 
@@ -56,21 +54,8 @@ static struct page *brd_lookup_page(struct brd_device *brd, sector_t sector)
 	pgoff_t idx;
 	struct page *page;
 
-	/*
-	 * The page lifetime is protected by the fact that we have opened the
-	 * device node -- brd pages will never be deleted under us, so we
-	 * don't need any further locking or refcounting.
-	 *
-	 * This is strictly true for the radix-tree nodes as well (ie. we
-	 * don't actually need the rcu_read_lock()), however that is not a
-	 * documented feature of the radix-tree API so it is better to be
-	 * safe here (we don't have total exclusion from radix tree updates
-	 * here, only deletes).
-	 */
-	rcu_read_lock();
 	idx = sector >> PAGE_SECTORS_SHIFT; /* sector to page index */
-	page = radix_tree_lookup(&brd->brd_pages, idx);
-	rcu_read_unlock();
+	page = xa_load(&brd->brd_pages, idx);
 
 	BUG_ON(page && page->index != idx);
 
@@ -83,7 +68,7 @@ static struct page *brd_lookup_page(struct brd_device *brd, sector_t sector)
 static int brd_insert_page(struct brd_device *brd, sector_t sector, gfp_t gfp)
 {
 	pgoff_t idx;
-	struct page *page;
+	struct page *page, *cur;
 	int ret = 0;
 
 	page = brd_lookup_page(brd, sector);
@@ -94,71 +79,42 @@ static int brd_insert_page(struct brd_device *brd, sector_t sector, gfp_t gfp)
 	if (!page)
 		return -ENOMEM;
 
-	if (radix_tree_maybe_preload(gfp)) {
-		__free_page(page);
-		return -ENOMEM;
-	}
+	xa_lock(&brd->brd_pages);
 
-	spin_lock(&brd->brd_lock);
 	idx = sector >> PAGE_SECTORS_SHIFT;
 	page->index = idx;
-	if (radix_tree_insert(&brd->brd_pages, idx, page)) {
+
+	cur = __xa_cmpxchg(&brd->brd_pages, idx, NULL, page, gfp);
+
+	if (unlikely(cur)) {
 		__free_page(page);
-		page = radix_tree_lookup(&brd->brd_pages, idx);
-		if (!page)
-			ret = -ENOMEM;
-		else if (page->index != idx)
+		ret = xa_err(cur);
+		if (!ret && (cur->index != idx))
 			ret = -EIO;
 	} else {
 		brd->brd_nr_pages++;
 	}
-	spin_unlock(&brd->brd_lock);
 
-	radix_tree_preload_end();
+	xa_unlock(&brd->brd_pages);
+
 	return ret;
 }
 
 /*
- * Free all backing store pages and radix tree. This must only be called when
+ * Free all backing store pages and xarray. This must only be called when
  * there are no other users of the device.
  */
-#define FREE_BATCH 16
 static void brd_free_pages(struct brd_device *brd)
 {
-	unsigned long pos = 0;
-	struct page *pages[FREE_BATCH];
-	int nr_pages;
-
-	do {
-		int i;
-
-		nr_pages = radix_tree_gang_lookup(&brd->brd_pages,
-				(void **)pages, pos, FREE_BATCH);
-
-		for (i = 0; i < nr_pages; i++) {
-			void *ret;
-
-			BUG_ON(pages[i]->index < pos);
-			pos = pages[i]->index;
-			ret = radix_tree_delete(&brd->brd_pages, pos);
-			BUG_ON(!ret || ret != pages[i]);
-			__free_page(pages[i]);
-		}
-
-		pos++;
+	struct page *page;
+	pgoff_t idx;
 
-		/*
-		 * It takes 3.4 seconds to remove 80GiB ramdisk.
-		 * So, we need cond_resched to avoid stalling the CPU.
-		 */
-		cond_resched();
+	xa_for_each(&brd->brd_pages, idx, page) {
+		__free_page(page);
+		cond_resched_rcu();
+	}
 
-		/*
-		 * This assumes radix_tree_gang_lookup always returns as
-		 * many pages as possible. If the radix-tree code changes,
-		 * so will this have to.
-		 */
-	} while (nr_pages == FREE_BATCH);
+	xa_destroy(&brd->brd_pages);
 }
 
 /*
@@ -372,8 +328,7 @@ static int brd_alloc(int i)
 	brd->brd_number		= i;
 	list_add_tail(&brd->brd_list, &brd_devices);
 
-	spin_lock_init(&brd->brd_lock);
-	INIT_RADIX_TREE(&brd->brd_pages, GFP_ATOMIC);
+	xa_init(&brd->brd_pages);
 
 	snprintf(buf, DISK_NAME_LEN, "ram%d", i);
 	if (!IS_ERR_OR_NULL(brd_debugfs_dir))
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 2/7] brd: convert to folios
  2023-06-14 11:46 [PATCH 0/7] RFC: high-order folio support for I/O Hannes Reinecke
  2023-06-14 11:46 ` [PATCH 1/7] brd: use XArray instead of radix-tree to index backing pages Hannes Reinecke
@ 2023-06-14 11:46 ` Hannes Reinecke
  2023-06-14 13:45   ` Matthew Wilcox
  2023-06-14 11:46 ` [PATCH 3/7] brd: abstract page_size conventions Hannes Reinecke
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 40+ messages in thread
From: Hannes Reinecke @ 2023-06-14 11:46 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-fsdevel, linux-block, Andrew Morton, Christoph Hellwig,
	Luis Chamberlain, Hannes Reinecke

Convert the driver to work on folios instead of pages.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/block/brd.c | 150 ++++++++++++++++++++++----------------------
 1 file changed, 74 insertions(+), 76 deletions(-)

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 2f71376afc71..24769d010fee 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -28,11 +28,10 @@
 #include <linux/uaccess.h>
 
 /*
- * Each block ramdisk device has a xarray brd_pages of pages that stores
- * the pages containing the block device's contents. A brd page's ->index is
- * its offset in PAGE_SIZE units. This is similar to, but in no way connected
- * with, the kernel's pagecache or buffer cache (which sit above our block
- * device).
+ * Each block ramdisk device has a xarray of folios that stores the folios
+ * containing the block device's contents. A brd folio's ->index is its offset
+ * in PAGE_SIZE units. This is similar to, but in no way connected with,
+ * the kernel's pagecache or buffer cache (which sit above our block device).
  */
 struct brd_device {
 	int			brd_number;
@@ -40,81 +39,81 @@ struct brd_device {
 	struct list_head	brd_list;
 
 	/*
-	 * Backing store of pages. This is the contents of the block device.
+	 * Backing store of folios. This is the contents of the block device.
 	 */
-	struct xarray	        brd_pages;
-	u64			brd_nr_pages;
+	struct xarray	        brd_folios;
+	u64			brd_nr_folios;
 };
 
 /*
- * Look up and return a brd's page for a given sector.
+ * Look up and return a brd's folio for a given sector.
  */
-static struct page *brd_lookup_page(struct brd_device *brd, sector_t sector)
+static struct folio *brd_lookup_folio(struct brd_device *brd, sector_t sector)
 {
 	pgoff_t idx;
-	struct page *page;
+	struct folio *folio;
 
-	idx = sector >> PAGE_SECTORS_SHIFT; /* sector to page index */
-	page = xa_load(&brd->brd_pages, idx);
+	idx = sector >> PAGE_SECTORS_SHIFT; /* sector to folio index */
+	folio = xa_load(&brd->brd_folios, idx);
 
-	BUG_ON(page && page->index != idx);
+	BUG_ON(folio && folio->index != idx);
 
-	return page;
+	return folio;
 }
 
 /*
- * Insert a new page for a given sector, if one does not already exist.
+ * Insert a new folio for a given sector, if one does not already exist.
  */
-static int brd_insert_page(struct brd_device *brd, sector_t sector, gfp_t gfp)
+static int brd_insert_folio(struct brd_device *brd, sector_t sector, gfp_t gfp)
 {
 	pgoff_t idx;
-	struct page *page, *cur;
+	struct folio *folio, *cur;
 	int ret = 0;
 
-	page = brd_lookup_page(brd, sector);
-	if (page)
+	folio = brd_lookup_folio(brd, sector);
+	if (folio)
 		return 0;
 
-	page = alloc_page(gfp | __GFP_ZERO | __GFP_HIGHMEM);
-	if (!page)
+	folio = folio_alloc(gfp | __GFP_ZERO | __GFP_HIGHMEM, 0);
+	if (!folio)
 		return -ENOMEM;
 
-	xa_lock(&brd->brd_pages);
+	xa_lock(&brd->brd_folios);
 
 	idx = sector >> PAGE_SECTORS_SHIFT;
-	page->index = idx;
+	folio->index = idx;
 
-	cur = __xa_cmpxchg(&brd->brd_pages, idx, NULL, page, gfp);
+	cur = __xa_cmpxchg(&brd->brd_folios, idx, NULL, folio, gfp);
 
 	if (unlikely(cur)) {
-		__free_page(page);
+		folio_put(folio);
 		ret = xa_err(cur);
 		if (!ret && (cur->index != idx))
 			ret = -EIO;
 	} else {
-		brd->brd_nr_pages++;
+		brd->brd_nr_folios++;
 	}
 
-	xa_unlock(&brd->brd_pages);
+	xa_unlock(&brd->brd_folios);
 
 	return ret;
 }
 
 /*
- * Free all backing store pages and xarray. This must only be called when
+ * Free all backing store folios and xarray. This must only be called when
  * there are no other users of the device.
  */
-static void brd_free_pages(struct brd_device *brd)
+static void brd_free_folios(struct brd_device *brd)
 {
-	struct page *page;
+	struct folio *folio;
 	pgoff_t idx;
 
-	xa_for_each(&brd->brd_pages, idx, page) {
-		__free_page(page);
+	xa_for_each(&brd->brd_folios, idx, folio) {
+		folio_put(folio);
 		cond_resched_rcu();
 	}
 
-	xa_destroy(&brd->brd_pages);
+	xa_destroy(&brd->brd_folios);
 }
 
 /*
@@ -128,12 +127,12 @@ static int copy_to_brd_setup(struct brd_device *brd, sector_t sector, size_t n,
 	int ret;
 
 	copy = min_t(size_t, n, PAGE_SIZE - offset);
-	ret = brd_insert_page(brd, sector, gfp);
+	ret = brd_insert_folio(brd, sector, gfp);
 	if (ret)
 		return ret;
 	if (copy < n) {
 		sector += copy >> SECTOR_SHIFT;
-		ret = brd_insert_page(brd, sector, gfp);
+		ret = brd_insert_folio(brd, sector, gfp);
 	}
 	return ret;
 }
@@ -144,29 +143,29 @@ static int copy_to_brd_setup(struct brd_device *brd, sector_t sector, size_t n,
 static void copy_to_brd(struct brd_device *brd, const void *src,
 			sector_t sector, size_t n)
 {
-	struct page *page;
+	struct folio *folio;
 	void *dst;
 	unsigned int offset = (sector & (PAGE_SECTORS-1)) << SECTOR_SHIFT;
 	size_t copy;
 
 	copy = min_t(size_t, n, PAGE_SIZE - offset);
-	page = brd_lookup_page(brd, sector);
-	BUG_ON(!page);
+	folio = brd_lookup_folio(brd, sector);
+	BUG_ON(!folio);
 
-	dst = kmap_atomic(page);
-	memcpy(dst + offset, src, copy);
-	kunmap_atomic(dst);
+	dst = kmap_local_folio(folio, offset);
+	memcpy(dst, src, copy);
+	kunmap_local(dst);
 
 	if (copy < n) {
 		src += copy;
 		sector += copy >> SECTOR_SHIFT;
 		copy = n - copy;
-		page = brd_lookup_page(brd, sector);
-		BUG_ON(!page);
+		folio = brd_lookup_folio(brd, sector);
+		BUG_ON(!folio);
 
-		dst = kmap_atomic(page);
+		dst = kmap_local_folio(folio, 0);
 		memcpy(dst, src, copy);
-		kunmap_atomic(dst);
+		kunmap_local(dst);
 	}
 }
 
@@ -176,17 +175,17 @@ static void copy_to_brd(struct brd_device *brd, const void *src,
 static void copy_from_brd(void *dst, struct brd_device *brd,
 			sector_t sector, size_t n)
 {
-	struct page *page;
+	struct folio *folio;
 	void *src;
 	unsigned int offset = (sector & (PAGE_SECTORS-1)) << SECTOR_SHIFT;
 	size_t copy;
 
 	copy = min_t(size_t, n, PAGE_SIZE - offset);
-	page = brd_lookup_page(brd, sector);
-	if (page) {
-		src = kmap_atomic(page);
-		memcpy(dst, src + offset, copy);
-		kunmap_atomic(src);
+	folio = brd_lookup_folio(brd, sector);
+	if (folio) {
+		src = kmap_local_folio(folio, offset);
+		memcpy(dst, src, copy);
+		kunmap_local(src);
 	} else
 		memset(dst, 0, copy);
 
@@ -194,20 +193,20 @@ static void copy_from_brd(void *dst, struct brd_device *brd,
 		dst += copy;
 		sector += copy >> SECTOR_SHIFT;
 		copy = n - copy;
-		page = brd_lookup_page(brd, sector);
-		if (page) {
-			src = kmap_atomic(page);
+		folio = brd_lookup_folio(brd, sector);
+		if (folio) {
+			src = kmap_local_folio(folio, 0);
 			memcpy(dst, src, copy);
-			kunmap_atomic(src);
+			kunmap_local(src);
 		} else
 			memset(dst, 0, copy);
 	}
 }
 
 /*
- * Process a single bvec of a bio.
+ * Process a single folio of a bio.
  */
-static int brd_do_bvec(struct brd_device *brd, struct page *page,
+static int brd_do_folio(struct brd_device *brd, struct folio *folio,
 			unsigned int len, unsigned int off, blk_opf_t opf,
 			sector_t sector)
 {
@@ -217,7 +216,7 @@ static int brd_do_bvec(struct brd_device *brd, struct page *page,
 	if (op_is_write(opf)) {
 		/*
 		 * Must use NOIO because we don't want to recurse back into the
-		 * block or filesystem layers from page reclaim.
+		 * block or filesystem layers from folio reclaim.
 		 */
 		gfp_t gfp = opf & REQ_NOWAIT ? GFP_NOWAIT : GFP_NOIO;
 
@@ -226,15 +225,15 @@ static int brd_do_bvec(struct brd_device *brd, struct page *page,
 			goto out;
 	}
 
-	mem = kmap_atomic(page);
+	mem = kmap_local_folio(folio, off);
 	if (!op_is_write(opf)) {
-		copy_from_brd(mem + off, brd, sector, len);
-		flush_dcache_page(page);
+		copy_from_brd(mem, brd, sector, len);
+		flush_dcache_folio(folio);
 	} else {
-		flush_dcache_page(page);
-		copy_to_brd(brd, mem + off, sector, len);
+		flush_dcache_folio(folio);
+		copy_to_brd(brd, mem, sector, len);
 	}
-	kunmap_atomic(mem);
+	kunmap_local(mem);
 
 out:
 	return err;
@@ -244,19 +243,18 @@ static void brd_submit_bio(struct bio *bio)
 {
 	struct brd_device *brd = bio->bi_bdev->bd_disk->private_data;
 	sector_t sector = bio->bi_iter.bi_sector;
-	struct bio_vec bvec;
-	struct bvec_iter iter;
+	struct folio_iter iter;
 
-	bio_for_each_segment(bvec, bio, iter) {
-		unsigned int len = bvec.bv_len;
+	bio_for_each_folio_all(iter, bio) {
+		unsigned int len = iter.length;
 		int err;
 
 		/* Don't support un-aligned buffer */
-		WARN_ON_ONCE((bvec.bv_offset & (SECTOR_SIZE - 1)) ||
+		WARN_ON_ONCE((iter.offset & (SECTOR_SIZE - 1)) ||
 				(len & (SECTOR_SIZE - 1)));
 
-		err = brd_do_bvec(brd, bvec.bv_page, len, bvec.bv_offset,
-				  bio->bi_opf, sector);
+		err = brd_do_folio(brd, iter.folio, len, iter.offset,
+				   bio->bi_opf, sector);
 		if (err) {
 			if (err == -ENOMEM && bio->bi_opf & REQ_NOWAIT) {
 				bio_wouldblock_error(bio);
@@ -328,12 +326,12 @@ static int brd_alloc(int i)
 	brd->brd_number		= i;
 	list_add_tail(&brd->brd_list, &brd_devices);
 
-	xa_init(&brd->brd_pages);
+	xa_init(&brd->brd_folios);
 
 	snprintf(buf, DISK_NAME_LEN, "ram%d", i);
 	if (!IS_ERR_OR_NULL(brd_debugfs_dir))
 		debugfs_create_u64(buf, 0444, brd_debugfs_dir,
-				&brd->brd_nr_pages);
+				&brd->brd_nr_folios);
 
 	disk = brd->brd_disk = blk_alloc_disk(NUMA_NO_NODE);
 	if (!disk)
@@ -388,7 +386,7 @@ static void brd_cleanup(void)
 	list_for_each_entry_safe(brd, next, &brd_devices, brd_list) {
 		del_gendisk(brd->brd_disk);
 		put_disk(brd->brd_disk);
-		brd_free_pages(brd);
+		brd_free_folios(brd);
 		list_del(&brd->brd_list);
 		kfree(brd);
 	}
@@ -419,7 +417,7 @@ static int __init brd_init(void)
 
 	brd_check_and_reset_par();
 
-	brd_debugfs_dir = debugfs_create_dir("ramdisk_pages", NULL);
+	brd_debugfs_dir = debugfs_create_dir("ramdisk_folios", NULL);
 
 	for (i = 0; i < rd_nr; i++) {
 		err = brd_alloc(i);
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 3/7] brd: abstract page_size conventions
  2023-06-14 11:46 [PATCH 0/7] RFC: high-order folio support for I/O Hannes Reinecke
  2023-06-14 11:46 ` [PATCH 1/7] brd: use XArray instead of radix-tree to index backing pages Hannes Reinecke
  2023-06-14 11:46 ` [PATCH 2/7] brd: convert to folios Hannes Reinecke
@ 2023-06-14 11:46 ` Hannes Reinecke
  2023-06-14 11:46 ` [PATCH 4/7] brd: make sector size configurable Hannes Reinecke
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 40+ messages in thread
From: Hannes Reinecke @ 2023-06-14 11:46 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-fsdevel, linux-block, Andrew Morton, Christoph Hellwig,
	Luis Chamberlain, Hannes Reinecke

In preparation for changing the block sizes abstract away references
to PAGE_SIZE and friends.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/block/brd.c | 39 ++++++++++++++++++++++++++++++---------
 1 file changed, 30 insertions(+), 9 deletions(-)

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 24769d010fee..71d3d8af8b0d 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -45,6 +45,23 @@ struct brd_device {
 	u64			brd_nr_folios;
 };
 
+#define BRD_SECTOR_SHIFT(b) (PAGE_SHIFT - SECTOR_SHIFT)
+
+static pgoff_t brd_sector_index(struct brd_device *brd, sector_t sector)
+{
+	pgoff_t idx;
+
+	idx = sector >> BRD_SECTOR_SHIFT(brd);
+	return idx;
+}
+
+static int brd_sector_offset(struct brd_device *brd, sector_t sector)
+{
+	unsigned int rd_sector_mask = (1 << BRD_SECTOR_SHIFT(brd)) - 1;
+
+	return ((unsigned int)sector & rd_sector_mask) << SECTOR_SHIFT;
+}
+
 /*
  * Look up and return a brd's folio for a given sector.
  */
@@ -53,7 +70,7 @@ static struct folio *brd_lookup_folio(struct brd_device *brd, sector_t sector)
 	pgoff_t idx;
 	struct folio *folio;
 
-	idx = sector >> PAGE_SECTORS_SHIFT; /* sector to folio index */
+	idx = brd_sector_index(brd, sector); /* sector to folio index */
 	folio = xa_load(&brd->brd_folios, idx);
 
 	BUG_ON(folio && folio->index != idx);
@@ -68,19 +85,20 @@ static int brd_insert_folio(struct brd_device *brd, sector_t sector, gfp_t gfp)
 {
 	pgoff_t idx;
 	struct folio *folio, *cur;
+	unsigned int rd_sector_order = get_order(PAGE_SIZE);
 	int ret = 0;
 
 	folio = brd_lookup_folio(brd, sector);
 	if (folio)
 		return 0;
 
-	folio = folio_alloc(gfp | __GFP_ZERO | __GFP_HIGHMEM, 0);
+	folio = folio_alloc(gfp | __GFP_ZERO | __GFP_HIGHMEM, rd_sector_order);
 	if (!folio)
 		return -ENOMEM;
 
 	xa_lock(&brd->brd_folios);
 
-	idx = sector >> PAGE_SECTORS_SHIFT;
+	idx = brd_sector_index(brd, sector);
 	folio->index = idx;
 
 	cur = __xa_cmpxchg(&brd->brd_folios, idx, NULL, folio, gfp);
@@ -122,11 +140,12 @@ static void brd_free_folios(struct brd_device *brd)
 static int copy_to_brd_setup(struct brd_device *brd, sector_t sector, size_t n,
 			     gfp_t gfp)
 {
-	unsigned int offset = (sector & (PAGE_SECTORS-1)) << SECTOR_SHIFT;
+	unsigned int rd_sector_size = PAGE_SIZE;
+	unsigned int offset = brd_sector_offset(brd, sector);
 	size_t copy;
 	int ret;
 
-	copy = min_t(size_t, n, PAGE_SIZE - offset);
+	copy = min_t(size_t, n, rd_sector_size - offset);
 	ret = brd_insert_folio(brd, sector, gfp);
 	if (ret)
 		return ret;
@@ -145,10 +164,11 @@ static void copy_to_brd(struct brd_device *brd, const void *src,
 {
 	struct folio *folio;
 	void *dst;
-	unsigned int offset = (sector & (PAGE_SECTORS-1)) << SECTOR_SHIFT;
+	unsigned int rd_sector_size = PAGE_SIZE;
+	unsigned int offset = brd_sector_offset(brd, sector);
 	size_t copy;
 
-	copy = min_t(size_t, n, PAGE_SIZE - offset);
+	copy = min_t(size_t, n, rd_sector_size - offset);
 	folio = brd_lookup_folio(brd, sector);
 	BUG_ON(!folio);
 
@@ -177,10 +197,11 @@ static void copy_from_brd(void *dst, struct brd_device *brd,
 {
 	struct folio *folio;
 	void *src;
-	unsigned int offset = (sector & (PAGE_SECTORS-1)) << SECTOR_SHIFT;
+	unsigned int rd_sector_size = PAGE_SIZE;
+	unsigned int offset = brd_sector_offset(brd, sector);
 	size_t copy;
 
-	copy = min_t(size_t, n, PAGE_SIZE - offset);
+	copy = min_t(size_t, n, rd_sector_size - offset);
 	folio = brd_lookup_folio(brd, sector);
 	if (folio) {
 		src = kmap_local_folio(folio, offset);
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 4/7] brd: make sector size configurable
  2023-06-14 11:46 [PATCH 0/7] RFC: high-order folio support for I/O Hannes Reinecke
                   ` (2 preceding siblings ...)
  2023-06-14 11:46 ` [PATCH 3/7] brd: abstract page_size conventions Hannes Reinecke
@ 2023-06-14 11:46 ` Hannes Reinecke
  2023-06-14 12:55   ` Matthew Wilcox
  2023-06-15  2:17   ` Dave Chinner
  2023-06-14 11:46 ` [PATCH 5/7] brd: make logical " Hannes Reinecke
                   ` (5 subsequent siblings)
  9 siblings, 2 replies; 40+ messages in thread
From: Hannes Reinecke @ 2023-06-14 11:46 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-fsdevel, linux-block, Andrew Morton, Christoph Hellwig,
	Luis Chamberlain, Hannes Reinecke

Add a module option 'rd_blksize' to allow the user to change
the sector size of the RAM disks.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/block/brd.c | 50 +++++++++++++++++++++++++++++++--------------
 1 file changed, 35 insertions(+), 15 deletions(-)

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 71d3d8af8b0d..2ebb5532a204 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -30,7 +30,7 @@
 /*
  * Each block ramdisk device has a xarray of folios that stores the folios
  * containing the block device's contents. A brd folio's ->index is its offset
- * in PAGE_SIZE units. This is similar to, but in no way connected with,
+ * in brd_sector_size units. This is similar to, but in no way connected with,
  * the kernel's pagecache or buffer cache (which sit above our block device).
  */
 struct brd_device {
@@ -43,9 +43,11 @@ struct brd_device {
 	 */
 	struct xarray	        brd_folios;
 	u64			brd_nr_folios;
+	unsigned int		brd_sector_shift;
+	unsigned int		brd_sector_size;
 };
 
-#define BRD_SECTOR_SHIFT(b) (PAGE_SHIFT - SECTOR_SHIFT)
+#define BRD_SECTOR_SHIFT(b) ((b)->brd_sector_shift - SECTOR_SHIFT)
 
 static pgoff_t brd_sector_index(struct brd_device *brd, sector_t sector)
 {
@@ -85,7 +87,7 @@ static int brd_insert_folio(struct brd_device *brd, sector_t sector, gfp_t gfp)
 {
 	pgoff_t idx;
 	struct folio *folio, *cur;
-	unsigned int rd_sector_order = get_order(PAGE_SIZE);
+	unsigned int rd_sector_order = get_order(brd->brd_sector_size);
 	int ret = 0;
 
 	folio = brd_lookup_folio(brd, sector);
@@ -140,7 +142,7 @@ static void brd_free_folios(struct brd_device *brd)
 static int copy_to_brd_setup(struct brd_device *brd, sector_t sector, size_t n,
 			     gfp_t gfp)
 {
-	unsigned int rd_sector_size = PAGE_SIZE;
+	unsigned int rd_sector_size = brd->brd_sector_size;
 	unsigned int offset = brd_sector_offset(brd, sector);
 	size_t copy;
 	int ret;
@@ -164,7 +166,7 @@ static void copy_to_brd(struct brd_device *brd, const void *src,
 {
 	struct folio *folio;
 	void *dst;
-	unsigned int rd_sector_size = PAGE_SIZE;
+	unsigned int rd_sector_size = brd->brd_sector_size;
 	unsigned int offset = brd_sector_offset(brd, sector);
 	size_t copy;
 
@@ -197,7 +199,7 @@ static void copy_from_brd(void *dst, struct brd_device *brd,
 {
 	struct folio *folio;
 	void *src;
-	unsigned int rd_sector_size = PAGE_SIZE;
+	unsigned int rd_sector_size = brd->brd_sector_size;
 	unsigned int offset = brd_sector_offset(brd, sector);
 	size_t copy;
 
@@ -310,6 +312,10 @@ static int max_part = 1;
 module_param(max_part, int, 0444);
 MODULE_PARM_DESC(max_part, "Num Minors to reserve between devices");
 
+static unsigned int rd_blksize = PAGE_SIZE;
+module_param(rd_blksize, uint, 0444);
+MODULE_PARM_DESC(rd_blksize, "Blocksize of each RAM disk in bytes.");
+
 MODULE_LICENSE("GPL");
 MODULE_ALIAS_BLOCKDEV_MAJOR(RAMDISK_MAJOR);
 MODULE_ALIAS("rd");
@@ -336,6 +342,7 @@ static int brd_alloc(int i)
 	struct brd_device *brd;
 	struct gendisk *disk;
 	char buf[DISK_NAME_LEN];
+	unsigned int rd_max_sectors;
 	int err = -ENOMEM;
 
 	list_for_each_entry(brd, &brd_devices, brd_list)
@@ -346,6 +353,25 @@ static int brd_alloc(int i)
 		return -ENOMEM;
 	brd->brd_number		= i;
 	list_add_tail(&brd->brd_list, &brd_devices);
+	brd->brd_sector_shift = ilog2(rd_blksize);
+	if ((1ULL << brd->brd_sector_shift) != rd_blksize) {
+		pr_err("rd_blksize %d is not supported\n", rd_blksize);
+		err = -EINVAL;
+		goto out_free_dev;
+	}
+	if (rd_blksize < SECTOR_SIZE) {
+		pr_err("rd_blksize must be at least 512 bytes\n");
+		err = -EINVAL;
+		goto out_free_dev;
+	}
+	/* We can't allocate more than MAX_ORDER pages */
+	rd_max_sectors = (1ULL << MAX_ORDER) << BRD_SECTOR_SHIFT(brd);
+	if (rd_blksize > rd_max_sectors) {
+		pr_err("rd_blocksize too large\n");
+		err = -EINVAL;
+		goto out_free_dev;
+	}
+	brd->brd_sector_size = rd_blksize;
 
 	xa_init(&brd->brd_folios);
 
@@ -365,15 +391,9 @@ static int brd_alloc(int i)
 	disk->private_data	= brd;
 	strscpy(disk->disk_name, buf, DISK_NAME_LEN);
 	set_capacity(disk, rd_size * 2);
-	
-	/*
-	 * This is so fdisk will align partitions on 4k, because of
-	 * direct_access API needing 4k alignment, returning a PFN
-	 * (This is only a problem on very small devices <= 4M,
-	 *  otherwise fdisk will align on 1M. Regardless this call
-	 *  is harmless)
-	 */
-	blk_queue_physical_block_size(disk->queue, PAGE_SIZE);
+
+	blk_queue_physical_block_size(disk->queue, rd_blksize);
+	blk_queue_max_hw_sectors(disk->queue, 1ULL << (MAX_ORDER + PAGE_SECTORS_SHIFT));
 
 	/* Tell the block layer that this is not a rotational device */
 	blk_queue_flag_set(QUEUE_FLAG_NONROT, disk->queue);
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 5/7] brd: make logical sector size configurable
  2023-06-14 11:46 [PATCH 0/7] RFC: high-order folio support for I/O Hannes Reinecke
                   ` (3 preceding siblings ...)
  2023-06-14 11:46 ` [PATCH 4/7] brd: make sector size configurable Hannes Reinecke
@ 2023-06-14 11:46 ` Hannes Reinecke
  2023-06-14 11:46 ` [PATCH 6/7] mm/filemap: allocate folios with mapping blocksize Hannes Reinecke
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 40+ messages in thread
From: Hannes Reinecke @ 2023-06-14 11:46 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-fsdevel, linux-block, Andrew Morton, Christoph Hellwig,
	Luis Chamberlain, Hannes Reinecke

Add a module option 'rd_logical_blksize' to allow the user to change
the logical sector size of the RAM disks.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/block/brd.c | 38 ++++++++++++++++++++++++++++++--------
 1 file changed, 30 insertions(+), 8 deletions(-)

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 2ebb5532a204..a9f3c6591e75 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -45,9 +45,11 @@ struct brd_device {
 	u64			brd_nr_folios;
 	unsigned int		brd_sector_shift;
 	unsigned int		brd_sector_size;
+	unsigned int		brd_logical_sector_shift;
+	unsigned int		brd_logical_sector_size;
 };
 
-#define BRD_SECTOR_SHIFT(b) ((b)->brd_sector_shift - SECTOR_SHIFT)
+#define BRD_SECTOR_SHIFT(b) ((b)->brd_sector_shift - (b)->brd_logical_sector_shift)
 
 static pgoff_t brd_sector_index(struct brd_device *brd, sector_t sector)
 {
@@ -61,7 +63,7 @@ static int brd_sector_offset(struct brd_device *brd, sector_t sector)
 {
 	unsigned int rd_sector_mask = (1 << BRD_SECTOR_SHIFT(brd)) - 1;
 
-	return ((unsigned int)sector & rd_sector_mask) << SECTOR_SHIFT;
+	return ((unsigned int)sector & rd_sector_mask) << brd->brd_logical_sector_shift;
 }
 
 /*
@@ -152,7 +154,7 @@ static int copy_to_brd_setup(struct brd_device *brd, sector_t sector, size_t n,
 	if (ret)
 		return ret;
 	if (copy < n) {
-		sector += copy >> SECTOR_SHIFT;
+		sector += copy >> brd->brd_logical_sector_shift;
 		ret = brd_insert_folio(brd, sector, gfp);
 	}
 	return ret;
@@ -180,7 +182,7 @@ static void copy_to_brd(struct brd_device *brd, const void *src,
 
 	if (copy < n) {
 		src += copy;
-		sector += copy >> SECTOR_SHIFT;
+		sector += copy >> brd->brd_logical_sector_shift;
 		copy = n - copy;
 		folio = brd_lookup_folio(brd, sector);
 		BUG_ON(!folio);
@@ -214,7 +216,7 @@ static void copy_from_brd(void *dst, struct brd_device *brd,
 
 	if (copy < n) {
 		dst += copy;
-		sector += copy >> SECTOR_SHIFT;
+		sector += copy >> brd->brd_logical_sector_shift;
 		copy = n - copy;
 		folio = brd_lookup_folio(brd, sector);
 		if (folio) {
@@ -273,8 +275,8 @@ static void brd_submit_bio(struct bio *bio)
 		int err;
 
 		/* Don't support un-aligned buffer */
-		WARN_ON_ONCE((iter.offset & (SECTOR_SIZE - 1)) ||
-				(len & (SECTOR_SIZE - 1)));
+		WARN_ON_ONCE((iter.offset & (brd->brd_logical_sector_size - 1)) ||
+				(len & (brd->brd_logical_sector_size - 1)));
 
 		err = brd_do_folio(brd, iter.folio, len, iter.offset,
 				   bio->bi_opf, sector);
@@ -286,7 +288,7 @@ static void brd_submit_bio(struct bio *bio)
 			bio_io_error(bio);
 			return;
 		}
-		sector += len >> SECTOR_SHIFT;
+		sector += len >> brd->brd_logical_sector_shift;
 	}
 
 	bio_endio(bio);
@@ -316,6 +318,10 @@ static unsigned int rd_blksize = PAGE_SIZE;
 module_param(rd_blksize, uint, 0444);
 MODULE_PARM_DESC(rd_blksize, "Blocksize of each RAM disk in bytes.");
 
+static unsigned int rd_logical_blksize = SECTOR_SIZE;
+module_param(rd_logical_blksize, uint, 0444);
+MODULE_PARM_DESC(rd_logical_blksize, "Logical blocksize of each RAM disk in bytes.");
+
 MODULE_LICENSE("GPL");
 MODULE_ALIAS_BLOCKDEV_MAJOR(RAMDISK_MAJOR);
 MODULE_ALIAS("rd");
@@ -373,6 +379,21 @@ static int brd_alloc(int i)
 	}
 	brd->brd_sector_size = rd_blksize;
 
+	brd->brd_logical_sector_shift = ilog2(rd_logical_blksize);
+	if ((1ULL << brd->brd_sector_shift) != rd_blksize) {
+		pr_err("rd_logical_blksize %d is not supported\n",
+		       rd_logical_blksize);
+		err = -EINVAL;
+		goto out_free_dev;
+	}
+	if (rd_logical_blksize > rd_blksize) {
+		pr_err("rd_logical_blksize %d larger than rd_blksize %d\n",
+		       rd_logical_blksize, rd_blksize);
+		err = -EINVAL;
+		goto out_free_dev;
+	}
+	brd->brd_logical_sector_size = rd_logical_blksize;
+
 	xa_init(&brd->brd_folios);
 
 	snprintf(buf, DISK_NAME_LEN, "ram%d", i);
@@ -393,6 +414,7 @@ static int brd_alloc(int i)
 	set_capacity(disk, rd_size * 2);
 
 	blk_queue_physical_block_size(disk->queue, rd_blksize);
+	blk_queue_logical_block_size(disk->queue, rd_logical_blksize);
 	blk_queue_max_hw_sectors(disk->queue, 1ULL << (MAX_ORDER + PAGE_SECTORS_SHIFT));
 
 	/* Tell the block layer that this is not a rotational device */
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 6/7] mm/filemap: allocate folios with mapping blocksize
  2023-06-14 11:46 [PATCH 0/7] RFC: high-order folio support for I/O Hannes Reinecke
                   ` (4 preceding siblings ...)
  2023-06-14 11:46 ` [PATCH 5/7] brd: make logical " Hannes Reinecke
@ 2023-06-14 11:46 ` Hannes Reinecke
       [not found]   ` <CGME20230619080901eucas1p224e67aa31866d2ad8d259b2209c2db67@eucas1p2.samsung.com>
  2023-06-14 11:46 ` [PATCH 7/7] mm/readahead: align readahead down to " Hannes Reinecke
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 40+ messages in thread
From: Hannes Reinecke @ 2023-06-14 11:46 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-fsdevel, linux-block, Andrew Morton, Christoph Hellwig,
	Luis Chamberlain, Hannes Reinecke

The mapping has an underlying blocksize (by virtue of
mapping->host->i_blkbits), so if the mapping blocksize
is larger than the pagesize we should allocate folios
in the correct order.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 include/linux/pagemap.h | 7 +++++++
 mm/filemap.c            | 7 ++++---
 mm/readahead.c          | 6 +++---
 3 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 716953ee1ebd..9ea1a9724d64 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -494,6 +494,13 @@ static inline gfp_t readahead_gfp_mask(struct address_space *x)
 	return mapping_gfp_mask(x) | __GFP_NORETRY | __GFP_NOWARN;
 }
 
+static inline int mapping_get_order(struct address_space *x)
+{
+	if (x->host->i_blkbits > PAGE_SHIFT)
+		return x->host->i_blkbits - PAGE_SHIFT;
+	return 0;
+}
+
 typedef int filler_t(struct file *, struct folio *);
 
 pgoff_t page_cache_next_miss(struct address_space *mapping,
diff --git a/mm/filemap.c b/mm/filemap.c
index 4be20e82e4c3..6f08d04995d9 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1936,7 +1936,7 @@ struct folio *__filemap_get_folio(struct address_space *mapping, pgoff_t index,
 			gfp |= GFP_NOWAIT | __GFP_NOWARN;
 		}
 
-		folio = filemap_alloc_folio(gfp, 0);
+		folio = filemap_alloc_folio(gfp, mapping_get_order(mapping));
 		if (!folio)
 			return ERR_PTR(-ENOMEM);
 
@@ -2495,7 +2495,8 @@ static int filemap_create_folio(struct file *file,
 	struct folio *folio;
 	int error;
 
-	folio = filemap_alloc_folio(mapping_gfp_mask(mapping), 0);
+	folio = filemap_alloc_folio(mapping_gfp_mask(mapping),
+				    mapping_get_order(mapping));
 	if (!folio)
 		return -ENOMEM;
 
@@ -3646,7 +3647,7 @@ static struct folio *do_read_cache_folio(struct address_space *mapping,
 repeat:
 	folio = filemap_get_folio(mapping, index);
 	if (IS_ERR(folio)) {
-		folio = filemap_alloc_folio(gfp, 0);
+		folio = filemap_alloc_folio(gfp, mapping_get_order(mapping));
 		if (!folio)
 			return ERR_PTR(-ENOMEM);
 		err = filemap_add_folio(mapping, folio, index, gfp);
diff --git a/mm/readahead.c b/mm/readahead.c
index 47afbca1d122..031935b78af7 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -245,7 +245,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 			continue;
 		}
 
-		folio = filemap_alloc_folio(gfp_mask, 0);
+		folio = filemap_alloc_folio(gfp_mask, mapping_get_order(mapping));
 		if (!folio)
 			break;
 		if (filemap_add_folio(mapping, folio, index + i,
@@ -806,7 +806,7 @@ void readahead_expand(struct readahead_control *ractl,
 		if (folio && !xa_is_value(folio))
 			return; /* Folio apparently present */
 
-		folio = filemap_alloc_folio(gfp_mask, 0);
+		folio = filemap_alloc_folio(gfp_mask, mapping_get_order(mapping));
 		if (!folio)
 			return;
 		if (filemap_add_folio(mapping, folio, index, gfp_mask) < 0) {
@@ -833,7 +833,7 @@ void readahead_expand(struct readahead_control *ractl,
 		if (folio && !xa_is_value(folio))
 			return; /* Folio apparently present */
 
-		folio = filemap_alloc_folio(gfp_mask, 0);
+		folio = filemap_alloc_folio(gfp_mask, mapping_get_order(mapping));
 		if (!folio)
 			return;
 		if (filemap_add_folio(mapping, folio, index, gfp_mask) < 0) {
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 7/7] mm/readahead: align readahead down to mapping blocksize
  2023-06-14 11:46 [PATCH 0/7] RFC: high-order folio support for I/O Hannes Reinecke
                   ` (5 preceding siblings ...)
  2023-06-14 11:46 ` [PATCH 6/7] mm/filemap: allocate folios with mapping blocksize Hannes Reinecke
@ 2023-06-14 11:46 ` Hannes Reinecke
  2023-06-14 13:17 ` [PATCH 0/7] RFC: high-order folio support for I/O Hannes Reinecke
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 40+ messages in thread
From: Hannes Reinecke @ 2023-06-14 11:46 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-fsdevel, linux-block, Andrew Morton, Christoph Hellwig,
	Luis Chamberlain, Hannes Reinecke

If the blocksize of the mapping is larger than the page size we need
to align down readahead to avoid reading past the end of the device.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 mm/readahead.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/mm/readahead.c b/mm/readahead.c
index 031935b78af7..91a7dbf4fa04 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -285,6 +285,7 @@ static void do_page_cache_ra(struct readahead_control *ractl,
 	struct inode *inode = ractl->mapping->host;
 	unsigned long index = readahead_index(ractl);
 	loff_t isize = i_size_read(inode);
+	unsigned int iblksize = i_blocksize(inode);
 	pgoff_t end_index;	/* The last page we want to read */
 
 	if (isize == 0)
@@ -293,6 +294,9 @@ static void do_page_cache_ra(struct readahead_control *ractl,
 	end_index = (isize - 1) >> PAGE_SHIFT;
 	if (index > end_index)
 		return;
+	if (iblksize > PAGE_SIZE)
+		end_index = ALIGN_DOWN(end_index, iblksize);
+
 	/* Don't read past the page containing the last byte of the file */
 	if (nr_to_read > end_index - index)
 		nr_to_read = end_index - index + 1;
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/7] brd: use XArray instead of radix-tree to index backing pages
  2023-06-14 11:46 ` [PATCH 1/7] brd: use XArray instead of radix-tree to index backing pages Hannes Reinecke
@ 2023-06-14 12:45   ` Matthew Wilcox
  2023-06-14 12:50     ` Pankaj Raghav
  0 siblings, 1 reply; 40+ messages in thread
From: Matthew Wilcox @ 2023-06-14 12:45 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: linux-fsdevel, linux-block, Andrew Morton, Christoph Hellwig,
	Luis Chamberlain, Pankaj Raghav

On Wed, Jun 14, 2023 at 01:46:31PM +0200, Hannes Reinecke wrote:
>  static void brd_free_pages(struct brd_device *brd)
>  {
> -	unsigned long pos = 0;
> -	struct page *pages[FREE_BATCH];
> -	int nr_pages;
> -
> -	do {
> -		int i;
> -
> -		nr_pages = radix_tree_gang_lookup(&brd->brd_pages,
> -				(void **)pages, pos, FREE_BATCH);
> -
> -		for (i = 0; i < nr_pages; i++) {
> -			void *ret;
> -
> -			BUG_ON(pages[i]->index < pos);
> -			pos = pages[i]->index;
> -			ret = radix_tree_delete(&brd->brd_pages, pos);
> -			BUG_ON(!ret || ret != pages[i]);
> -			__free_page(pages[i]);
> -		}
> -
> -		pos++;
> +	struct page *page;
> +	pgoff_t idx;
>  
> -		/*
> -		 * It takes 3.4 seconds to remove 80GiB ramdisk.
> -		 * So, we need cond_resched to avoid stalling the CPU.
> -		 */
> -		cond_resched();
> +	xa_for_each(&brd->brd_pages, idx, page) {
> +		__free_page(page);
> +		cond_resched_rcu();

This should be a regular cond_resched().  The body of the loop is run
without the RCU read lock held.  Surprised none of the bots have noticed
an unlock-underflow.  Perhaps they don't test brd ;-)

With that fixed,

Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/7] brd: use XArray instead of radix-tree to index backing pages
  2023-06-14 12:45   ` Matthew Wilcox
@ 2023-06-14 12:50     ` Pankaj Raghav
  2023-06-14 13:03       ` Hannes Reinecke
  0 siblings, 1 reply; 40+ messages in thread
From: Pankaj Raghav @ 2023-06-14 12:50 UTC (permalink / raw)
  To: Matthew Wilcox, Hannes Reinecke
  Cc: linux-fsdevel, linux-block, Andrew Morton, Christoph Hellwig,
	Luis Chamberlain

>>  
>> -		/*
>> -		 * It takes 3.4 seconds to remove 80GiB ramdisk.
>> -		 * So, we need cond_resched to avoid stalling the CPU.
>> -		 */
>> -		cond_resched();
>> +	xa_for_each(&brd->brd_pages, idx, page) {
>> +		__free_page(page);
>> +		cond_resched_rcu();
> 
> This should be a regular cond_resched().  The body of the loop is run
> without the RCU read lock held.  Surprised none of the bots have noticed
> an unlock-underflow.  Perhaps they don't test brd ;-)
> 
> With that fixed,
> 
> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>

This patch is already queued up for 6.5 in Jens's tree.
I will send this as a fix soon. Thanks.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/7] brd: make sector size configurable
  2023-06-14 11:46 ` [PATCH 4/7] brd: make sector size configurable Hannes Reinecke
@ 2023-06-14 12:55   ` Matthew Wilcox
  2023-06-14 13:02     ` Hannes Reinecke
  2023-06-15  2:17   ` Dave Chinner
  1 sibling, 1 reply; 40+ messages in thread
From: Matthew Wilcox @ 2023-06-14 12:55 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: linux-fsdevel, linux-block, Andrew Morton, Christoph Hellwig,
	Luis Chamberlain

On Wed, Jun 14, 2023 at 01:46:34PM +0200, Hannes Reinecke wrote:
> @@ -43,9 +43,11 @@ struct brd_device {
>  	 */
>  	struct xarray	        brd_folios;
>  	u64			brd_nr_folios;
> +	unsigned int		brd_sector_shift;
> +	unsigned int		brd_sector_size;
>  };
>  
> -#define BRD_SECTOR_SHIFT(b) (PAGE_SHIFT - SECTOR_SHIFT)
> +#define BRD_SECTOR_SHIFT(b) ((b)->brd_sector_shift - SECTOR_SHIFT)
>  
>  static pgoff_t brd_sector_index(struct brd_device *brd, sector_t sector)
>  {
> @@ -85,7 +87,7 @@ static int brd_insert_folio(struct brd_device *brd, sector_t sector, gfp_t gfp)
>  {
>  	pgoff_t idx;
>  	struct folio *folio, *cur;
> -	unsigned int rd_sector_order = get_order(PAGE_SIZE);
> +	unsigned int rd_sector_order = get_order(brd->brd_sector_size);

Surely max(0, brd->brd_sector_shift - PAGE_SHIFT) ?

> @@ -346,6 +353,25 @@ static int brd_alloc(int i)
>  		return -ENOMEM;
>  	brd->brd_number		= i;
>  	list_add_tail(&brd->brd_list, &brd_devices);
> +	brd->brd_sector_shift = ilog2(rd_blksize);
> +	if ((1ULL << brd->brd_sector_shift) != rd_blksize) {
> +		pr_err("rd_blksize %d is not supported\n", rd_blksize);

Are you trying to require power-of-two here?  We have is_power_of_2()
for that purpose.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/7] brd: make sector size configurable
  2023-06-14 12:55   ` Matthew Wilcox
@ 2023-06-14 13:02     ` Hannes Reinecke
  0 siblings, 0 replies; 40+ messages in thread
From: Hannes Reinecke @ 2023-06-14 13:02 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-fsdevel, linux-block, Andrew Morton, Christoph Hellwig,
	Luis Chamberlain

On 6/14/23 14:55, Matthew Wilcox wrote:
> On Wed, Jun 14, 2023 at 01:46:34PM +0200, Hannes Reinecke wrote:
>> @@ -43,9 +43,11 @@ struct brd_device {
>>   	 */
>>   	struct xarray	        brd_folios;
>>   	u64			brd_nr_folios;
>> +	unsigned int		brd_sector_shift;
>> +	unsigned int		brd_sector_size;
>>   };
>>   
>> -#define BRD_SECTOR_SHIFT(b) (PAGE_SHIFT - SECTOR_SHIFT)
>> +#define BRD_SECTOR_SHIFT(b) ((b)->brd_sector_shift - SECTOR_SHIFT)
>>   
>>   static pgoff_t brd_sector_index(struct brd_device *brd, sector_t sector)
>>   {
>> @@ -85,7 +87,7 @@ static int brd_insert_folio(struct brd_device *brd, sector_t sector, gfp_t gfp)
>>   {
>>   	pgoff_t idx;
>>   	struct folio *folio, *cur;
>> -	unsigned int rd_sector_order = get_order(PAGE_SIZE);
>> +	unsigned int rd_sector_order = get_order(brd->brd_sector_size);
> 
> Surely max(0, brd->brd_sector_shift - PAGE_SHIFT) ?
> 
Errm. Possibly.

>> @@ -346,6 +353,25 @@ static int brd_alloc(int i)
>>   		return -ENOMEM;
>>   	brd->brd_number		= i;
>>   	list_add_tail(&brd->brd_list, &brd_devices);
>> +	brd->brd_sector_shift = ilog2(rd_blksize);
>> +	if ((1ULL << brd->brd_sector_shift) != rd_blksize) {
>> +		pr_err("rd_blksize %d is not supported\n", rd_blksize);
> 
> Are you trying to require power-of-two here?  We have is_power_of_2()
> for that purpose.
> 
Ah. So let's use that, then :-)

Cheers,

Hannes


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/7] brd: use XArray instead of radix-tree to index backing pages
  2023-06-14 12:50     ` Pankaj Raghav
@ 2023-06-14 13:03       ` Hannes Reinecke
  0 siblings, 0 replies; 40+ messages in thread
From: Hannes Reinecke @ 2023-06-14 13:03 UTC (permalink / raw)
  To: Pankaj Raghav, Matthew Wilcox
  Cc: linux-fsdevel, linux-block, Andrew Morton, Christoph Hellwig,
	Luis Chamberlain

On 6/14/23 14:50, Pankaj Raghav wrote:
>>>   
>>> -		/*
>>> -		 * It takes 3.4 seconds to remove 80GiB ramdisk.
>>> -		 * So, we need cond_resched to avoid stalling the CPU.
>>> -		 */
>>> -		cond_resched();
>>> +	xa_for_each(&brd->brd_pages, idx, page) {
>>> +		__free_page(page);
>>> +		cond_resched_rcu();
>>
>> This should be a regular cond_resched().  The body of the loop is run
>> without the RCU read lock held.  Surprised none of the bots have noticed
>> an unlock-underflow.  Perhaps they don't test brd ;-)
>>
>> With that fixed,
>>
>> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> 
> This patch is already queued up for 6.5 in Jens's tree.
> I will send this as a fix soon. Thanks.

Ah. Hence. I've been running off akpms mm-unstable branch, which doesn't 
have that patch (yet).

Cheers,

Hannes


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/7] RFC: high-order folio support for I/O
  2023-06-14 11:46 [PATCH 0/7] RFC: high-order folio support for I/O Hannes Reinecke
                   ` (6 preceding siblings ...)
  2023-06-14 11:46 ` [PATCH 7/7] mm/readahead: align readahead down to " Hannes Reinecke
@ 2023-06-14 13:17 ` Hannes Reinecke
  2023-06-14 13:53   ` Matthew Wilcox
  2023-06-14 13:48 ` [PATCH 1/2] highmem: Add memcpy_to_folio() Matthew Wilcox (Oracle)
  2023-06-14 13:48 ` [PATCH 2/2] highmem: Add memcpy_from_folio() Matthew Wilcox (Oracle)
  9 siblings, 1 reply; 40+ messages in thread
From: Hannes Reinecke @ 2023-06-14 13:17 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-fsdevel, linux-block, Andrew Morton, Christoph Hellwig,
	Luis Chamberlain

On 6/14/23 13:46, Hannes Reinecke wrote:
> Hi all,
> 
> now, that was easy.
> Thanks to willy and his recent patchset to support large folios in
> gfs2 turns out that most of the work to support high-order folios
> for I/O is actually done.
> It only need twe rather obvious patches to allocate folios with
> the order derived from the mapping blocksize, and to adjust readahead
> to avoid reading off the end of the device.
> But with these two patches (and the patchset from hch to switch
> the block device over to iomap) (and the patchset from ritesh to
> support sub-blocksize iomap buffers) I can now do:
> 
> # modprobe brd rd_size=524288 rd_blksize=16384
> # mkfs.xfs -b size=16384 /dev/ram0
> 
> it still fails when trying to mount the device:
> 
> XFS (ram0): Cannot set_blocksize to 16384 on device ram0
> 
> but to my understanding this is being worked on.
> 
Turns out that was quite easy to fix (just remove the check in 
set_blocksize()), but now I get this:

SGI XFS with ACLs, security attributes, quota, no debug enabled
XFS (ram0): File system with blocksize 16384 bytes. Only pagesize (4096) 
or less will currently work.

Hmm. And btrfs doesn't fare better here:

BTRFS info (device ram0): using crc32c (crc32c-intel) checksum algorithm
BTRFS error (device ram0): sectorsize 16384 not yet supported for page 
size 4096
BTRFS error (device ram0): superblock contains fatal errors
BTRFS error (device ram0): open_ctree failed

But at least we _can_ test these filesystems now :-)

Cheers,

Hannes


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/7] brd: convert to folios
  2023-06-14 11:46 ` [PATCH 2/7] brd: convert to folios Hannes Reinecke
@ 2023-06-14 13:45   ` Matthew Wilcox
  2023-06-14 13:50     ` Hannes Reinecke
  0 siblings, 1 reply; 40+ messages in thread
From: Matthew Wilcox @ 2023-06-14 13:45 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: linux-fsdevel, linux-block, Andrew Morton, Christoph Hellwig,
	Luis Chamberlain

On Wed, Jun 14, 2023 at 01:46:32PM +0200, Hannes Reinecke wrote:
>  /*
> - * Each block ramdisk device has a xarray brd_pages of pages that stores
> - * the pages containing the block device's contents. A brd page's ->index is
> - * its offset in PAGE_SIZE units. This is similar to, but in no way connected
> - * with, the kernel's pagecache or buffer cache (which sit above our block
> - * device).
> + * Each block ramdisk device has a xarray of folios that stores the folios
> + * containing the block device's contents. A brd folio's ->index is its offset
> + * in PAGE_SIZE units. This is similar to, but in no way connected with,
> + * the kernel's pagecache or buffer cache (which sit above our block device).

Having read my way to the end of the series, I can now circle back and
say this comment is wrong.  The folio->index is its offset in PAGE_SIZE
units if the sector size is <= PAGE_SIZE, otherwise it's the offset in
sector size units.  This is _different from_ the pagecache which uses
PAGE_SIZE units and multi-index entries in the XArray.

> @@ -144,29 +143,29 @@ static int copy_to_brd_setup(struct brd_device *brd, sector_t sector, size_t n,
>  static void copy_to_brd(struct brd_device *brd, const void *src,
>  			sector_t sector, size_t n)
>  {
> -	struct page *page;
> +	struct folio *folio;
>  	void *dst;
>  	unsigned int offset = (sector & (PAGE_SECTORS-1)) << SECTOR_SHIFT;
>  	size_t copy;
>  
>  	copy = min_t(size_t, n, PAGE_SIZE - offset);
> -	page = brd_lookup_page(brd, sector);
> -	BUG_ON(!page);
> +	folio = brd_lookup_folio(brd, sector);
> +	BUG_ON(!folio);
>  
> -	dst = kmap_atomic(page);
> -	memcpy(dst + offset, src, copy);
> -	kunmap_atomic(dst);
> +	dst = kmap_local_folio(folio, offset);
> +	memcpy(dst, src, copy);
> +	kunmap_local(dst);

This should use memcpy_to_folio(), which doesn't exist yet.
Compile-tested patch incoming shortly ...

> +	folio = brd_lookup_folio(brd, sector);
> +	if (folio) {
> +		src = kmap_local_folio(folio, offset);
> +		memcpy(dst, src, copy);
> +		kunmap_local(src);

And this will need memcpy_from_folio(), patch for that incoming too.

> @@ -226,15 +225,15 @@ static int brd_do_bvec(struct brd_device *brd, struct page *page,
>  			goto out;
>  	}
>  
> -	mem = kmap_atomic(page);
> +	mem = kmap_local_folio(folio, off);
>  	if (!op_is_write(opf)) {
> -		copy_from_brd(mem + off, brd, sector, len);
> -		flush_dcache_page(page);
> +		copy_from_brd(mem, brd, sector, len);
> +		flush_dcache_folio(folio);

Nngh.  This will need to be a more complex loop.  I don't think we can
do a simple abstraction here.  Perhaps you can base it on the two
patches you're about to see?


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 1/2] highmem: Add memcpy_to_folio()
  2023-06-14 11:46 [PATCH 0/7] RFC: high-order folio support for I/O Hannes Reinecke
                   ` (7 preceding siblings ...)
  2023-06-14 13:17 ` [PATCH 0/7] RFC: high-order folio support for I/O Hannes Reinecke
@ 2023-06-14 13:48 ` Matthew Wilcox (Oracle)
  2023-06-14 18:38   ` kernel test robot
                     ` (2 more replies)
  2023-06-14 13:48 ` [PATCH 2/2] highmem: Add memcpy_from_folio() Matthew Wilcox (Oracle)
  9 siblings, 3 replies; 40+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-06-14 13:48 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Matthew Wilcox (Oracle),
	linux-fsdevel, linux-block, Andrew Morton, Christoph Hellwig,
	Luis Chamberlain

This is the folio equivalent of memcpy_to_page(), but it handles large
highmem folios.  It may be a little too big to inline on systems that
have CONFIG_HIGHMEM enabled but on systems we actually care about almost
all the code will be eliminated.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/highmem.h | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index 4de1dbcd3ef6..ec39f544113d 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -507,6 +507,35 @@ static inline void folio_zero_range(struct folio *folio,
 	zero_user_segments(&folio->page, start, start + length, 0, 0);
 }
 
+/**
+ * memcpy_to_folio - Copy a range of bytes to a folio
+ * @folio: The folio to write to.
+ * @offset: The first byte in the folio to store to.
+ * @from: The memory to copy from.
+ * @len: The number of bytes to copy.
+ */
+static inline void memcpy_to_folio(struct folio *folio, size_t offset,
+		const char *from, size_t len)
+{
+	size_t n = len;
+
+	VM_BUG_ON(offset + len > folio_size(folio));
+
+	if (folio_test_highmem(folio))
+		n = min(len, PAGE_SIZE - offset_in_page(offset));
+	for (;;) {
+		char *to = kmap_local_folio(folio, offset);
+		memcpy(to, from, n);
+		kunmap_local(to);
+		if (!folio_test_highmem(folio) || n == len)
+			break;
+		offset += n;
+		len -= n;
+		n = min(len, PAGE_SIZE);
+	}
+	flush_dcache_folio(folio);
+}
+
 static inline void put_and_unmap_page(struct page *page, void *addr)
 {
 	kunmap_local(addr);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 2/2] highmem: Add memcpy_from_folio()
  2023-06-14 11:46 [PATCH 0/7] RFC: high-order folio support for I/O Hannes Reinecke
                   ` (8 preceding siblings ...)
  2023-06-14 13:48 ` [PATCH 1/2] highmem: Add memcpy_to_folio() Matthew Wilcox (Oracle)
@ 2023-06-14 13:48 ` Matthew Wilcox (Oracle)
  9 siblings, 0 replies; 40+ messages in thread
From: Matthew Wilcox (Oracle) @ 2023-06-14 13:48 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Matthew Wilcox (Oracle),
	linux-fsdevel, linux-block, Andrew Morton, Christoph Hellwig,
	Luis Chamberlain

This is the folio equivalent of memcpy_from_page(), but it handles large
highmem folios.  It may be a little too big to inline on systems that
have CONFIG_HIGHMEM enabled but on systems we actually care about almost
all the code will be eliminated.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/highmem.h | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index ec39f544113d..d47f4a09f2fa 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -536,6 +536,34 @@ static inline void memcpy_to_folio(struct folio *folio, size_t offset,
 	flush_dcache_folio(folio);
 }
 
+/**
+ * memcpy_from_folio - Copy a range of bytes from a folio
+ * @to: The memory to copy to.
+ * @folio: The folio to read from.
+ * @offset: The first byte in the folio to read.
+ * @len: The number of bytes to copy.
+ */
+static inline void memcpy_from_folio(char *to, struct folio *folio,
+		size_t offset, size_t len)
+{
+	size_t n = len;
+
+	VM_BUG_ON(offset + len > folio_size(folio));
+
+	if (folio_test_highmem(folio))
+		n = min(len, PAGE_SIZE - offset_in_page(offset));
+	for (;;) {
+		char *from = kmap_local_folio(folio, offset);
+		memcpy(to, from, n);
+		kunmap_local(from);
+		if (!folio_test_highmem(folio) || n == len)
+			break;
+		offset += n;
+		len -= n;
+		n = min(len, PAGE_SIZE);
+	}
+}
+
 static inline void put_and_unmap_page(struct page *page, void *addr)
 {
 	kunmap_local(addr);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/7] brd: convert to folios
  2023-06-14 13:45   ` Matthew Wilcox
@ 2023-06-14 13:50     ` Hannes Reinecke
  0 siblings, 0 replies; 40+ messages in thread
From: Hannes Reinecke @ 2023-06-14 13:50 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-fsdevel, linux-block, Andrew Morton, Christoph Hellwig,
	Luis Chamberlain

On 6/14/23 15:45, Matthew Wilcox wrote:
> On Wed, Jun 14, 2023 at 01:46:32PM +0200, Hannes Reinecke wrote:
>>   /*
>> - * Each block ramdisk device has a xarray brd_pages of pages that stores
>> - * the pages containing the block device's contents. A brd page's ->index is
>> - * its offset in PAGE_SIZE units. This is similar to, but in no way connected
>> - * with, the kernel's pagecache or buffer cache (which sit above our block
>> - * device).
>> + * Each block ramdisk device has a xarray of folios that stores the folios
>> + * containing the block device's contents. A brd folio's ->index is its offset
>> + * in PAGE_SIZE units. This is similar to, but in no way connected with,
>> + * the kernel's pagecache or buffer cache (which sit above our block device).
> 
> Having read my way to the end of the series, I can now circle back and
> say this comment is wrong.  The folio->index is its offset in PAGE_SIZE
> units if the sector size is <= PAGE_SIZE, otherwise it's the offset in
> sector size units.  This is _different from_ the pagecache which uses
> PAGE_SIZE units and multi-index entries in the XArray.
> 
Hmm. I am aware that brd is using the ->index field in a different way,
but I thought I got it straigtened out ...

>> @@ -144,29 +143,29 @@ static int copy_to_brd_setup(struct brd_device *brd, sector_t sector, size_t n,
>>   static void copy_to_brd(struct brd_device *brd, const void *src,
>>   			sector_t sector, size_t n)
>>   {
>> -	struct page *page;
>> +	struct folio *folio;
>>   	void *dst;
>>   	unsigned int offset = (sector & (PAGE_SECTORS-1)) << SECTOR_SHIFT;
>>   	size_t copy;
>>   
>>   	copy = min_t(size_t, n, PAGE_SIZE - offset);
>> -	page = brd_lookup_page(brd, sector);
>> -	BUG_ON(!page);
>> +	folio = brd_lookup_folio(brd, sector);
>> +	BUG_ON(!folio);
>>   
>> -	dst = kmap_atomic(page);
>> -	memcpy(dst + offset, src, copy);
>> -	kunmap_atomic(dst);
>> +	dst = kmap_local_folio(folio, offset);
>> +	memcpy(dst, src, copy);
>> +	kunmap_local(dst);
> 
> This should use memcpy_to_folio(), which doesn't exist yet.
> Compile-tested patch incoming shortly ...
> 
Ah. You live and learn.

>> +	folio = brd_lookup_folio(brd, sector);
>> +	if (folio) {
>> +		src = kmap_local_folio(folio, offset);
>> +		memcpy(dst, src, copy);
>> +		kunmap_local(src);
> 
> And this will need memcpy_from_folio(), patch for that incoming too.
> 
>> @@ -226,15 +225,15 @@ static int brd_do_bvec(struct brd_device *brd, struct page *page,
>>   			goto out;
>>   	}
>>   
>> -	mem = kmap_atomic(page);
>> +	mem = kmap_local_folio(folio, off);
>>   	if (!op_is_write(opf)) {
>> -		copy_from_brd(mem + off, brd, sector, len);
>> -		flush_dcache_page(page);
>> +		copy_from_brd(mem, brd, sector, len);
>> +		flush_dcache_folio(folio);
> 
> Nngh.  This will need to be a more complex loop.  I don't think we can
> do a simple abstraction here.  Perhaps you can base it on the two
> patches you're about to see?
> 
Sure.
Might explain the strange crashes I've seen ...

Cheers,

Hannes


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/7] RFC: high-order folio support for I/O
  2023-06-14 13:17 ` [PATCH 0/7] RFC: high-order folio support for I/O Hannes Reinecke
@ 2023-06-14 13:53   ` Matthew Wilcox
  2023-06-14 15:06     ` Hannes Reinecke
  0 siblings, 1 reply; 40+ messages in thread
From: Matthew Wilcox @ 2023-06-14 13:53 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: linux-fsdevel, linux-block, Andrew Morton, Christoph Hellwig,
	Luis Chamberlain

On Wed, Jun 14, 2023 at 03:17:25PM +0200, Hannes Reinecke wrote:
> Turns out that was quite easy to fix (just remove the check in
> set_blocksize()), but now I get this:
> 
> SGI XFS with ACLs, security attributes, quota, no debug enabled
> XFS (ram0): File system with blocksize 16384 bytes. Only pagesize (4096) or
> less will currently work.

What happens if you just remove this hunk:

+++ b/fs/xfs/xfs_super.c
@@ -1583,18 +1583,6 @@ xfs_fs_fill_super(
                goto out_free_sb;
        }

-       /*
-        * Until this is fixed only page-sized or smaller data blocks work.
-        */
-       if (mp->m_sb.sb_blocksize > PAGE_SIZE) {
-               xfs_warn(mp,
-               "File system with blocksize %d bytes. "
-               "Only pagesize (%ld) or less will currently work.",
-                               mp->m_sb.sb_blocksize, PAGE_SIZE);
-               error = -ENOSYS;
-               goto out_free_sb;
-       }
-
        /* Ensure this filesystem fits in the page cache limits */
        if (xfs_sb_validate_fsb_count(&mp->m_sb, mp->m_sb.sb_dblocks) ||
            xfs_sb_validate_fsb_count(&mp->m_sb, mp->m_sb.sb_rblocks)) {

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/7] RFC: high-order folio support for I/O
  2023-06-14 13:53   ` Matthew Wilcox
@ 2023-06-14 15:06     ` Hannes Reinecke
  2023-06-14 15:35       ` Hannes Reinecke
                         ` (2 more replies)
  0 siblings, 3 replies; 40+ messages in thread
From: Hannes Reinecke @ 2023-06-14 15:06 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-fsdevel, linux-block, Andrew Morton, Christoph Hellwig,
	Luis Chamberlain

On 6/14/23 15:53, Matthew Wilcox wrote:
> On Wed, Jun 14, 2023 at 03:17:25PM +0200, Hannes Reinecke wrote:
>> Turns out that was quite easy to fix (just remove the check in
>> set_blocksize()), but now I get this:
>>
>> SGI XFS with ACLs, security attributes, quota, no debug enabled
>> XFS (ram0): File system with blocksize 16384 bytes. Only pagesize (4096) or
>> less will currently work.
> 
> What happens if you just remove this hunk:
> 
> +++ b/fs/xfs/xfs_super.c
> @@ -1583,18 +1583,6 @@ xfs_fs_fill_super(
>                  goto out_free_sb;
>          }
> 
> -       /*
> -        * Until this is fixed only page-sized or smaller data blocks work.
> -        */
> -       if (mp->m_sb.sb_blocksize > PAGE_SIZE) {
> -               xfs_warn(mp,
> -               "File system with blocksize %d bytes. "
> -               "Only pagesize (%ld) or less will currently work.",
> -                               mp->m_sb.sb_blocksize, PAGE_SIZE);
> -               error = -ENOSYS;
> -               goto out_free_sb;
> -       }
> -
>          /* Ensure this filesystem fits in the page cache limits */
>          if (xfs_sb_validate_fsb_count(&mp->m_sb, mp->m_sb.sb_dblocks) ||
>              xfs_sb_validate_fsb_count(&mp->m_sb, mp->m_sb.sb_rblocks)) {

Whee! That works!

Rebased things with your memcpy_{to,from}_folio() patches, disabled that 
chunk, and:

# mount /dev/ram0 /mnt
XFS (ram0): Mounting V5 Filesystem 5cd71ab5-2d11-4c18-97dd-71708f40e551
XFS (ram0): Ending clean mount
xfs filesystem being mounted at /mnt supports timestamps until 
2038-01-19 (0x7fffffff)
# umount /mnt
XFS (ram0): Unmounting Filesystem 5cd71ab5-2d11-4c18-97dd-71708f40e551

Great work, Matthew!

(Now I just need to check why copying data from NFS crashes ...)

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/7] RFC: high-order folio support for I/O
  2023-06-14 15:06     ` Hannes Reinecke
@ 2023-06-14 15:35       ` Hannes Reinecke
  2023-06-14 17:46         ` Matthew Wilcox
  2023-06-14 23:53       ` Dave Chinner
  2023-06-15  3:44       ` Dave Chinner
  2 siblings, 1 reply; 40+ messages in thread
From: Hannes Reinecke @ 2023-06-14 15:35 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-fsdevel, linux-block, Andrew Morton, Christoph Hellwig,
	Luis Chamberlain

On 6/14/23 17:06, Hannes Reinecke wrote:
[ .. ]
> 
> Whee! That works!
> 
> Rebased things with your memcpy_{to,from}_folio() patches, disabled that 
> chunk, and:
> 
> # mount /dev/ram0 /mnt
> XFS (ram0): Mounting V5 Filesystem 5cd71ab5-2d11-4c18-97dd-71708f40e551
> XFS (ram0): Ending clean mount
> xfs filesystem being mounted at /mnt supports timestamps until 
> 2038-01-19 (0x7fffffff)
> # umount /mnt
> XFS (ram0): Unmounting Filesystem 5cd71ab5-2d11-4c18-97dd-71708f40e551
> 
> Great work, Matthew!
> 
> (Now I just need to check why copying data from NFS crashes ...)
> 
Hmm. And for that I'm hitting include/linux/pagemap.h:1250 pretty 
consistently; something's going haywire with readahead.

Matthew, are you sure that this one:

/** 

  * readahead_length - The number of bytes in this readahead request. 

  * @rac: The readahead request. 

  */
static inline size_t readahead_length(struct readahead_control *rac)
{
         return rac->_nr_pages * PAGE_SIZE;
}

is tenable for large folios?
Especially as we have in mm/readahead.c:499

         ractl->_nr_pages += 1UL << order;

Hmm?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/7] RFC: high-order folio support for I/O
  2023-06-14 15:35       ` Hannes Reinecke
@ 2023-06-14 17:46         ` Matthew Wilcox
  0 siblings, 0 replies; 40+ messages in thread
From: Matthew Wilcox @ 2023-06-14 17:46 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: linux-fsdevel, linux-block, Andrew Morton, Christoph Hellwig,
	Luis Chamberlain

On Wed, Jun 14, 2023 at 05:35:02PM +0200, Hannes Reinecke wrote:
> Hmm. And for that I'm hitting include/linux/pagemap.h:1250 pretty
> consistently; something's going haywire with readahead.

Is that this one?

        VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);

in __readahead_folio()

> Matthew, are you sure that this one:
> 
> /**
> 
>  * readahead_length - The number of bytes in this readahead request.
> 
>  * @rac: The readahead request.
> 
>  */
> static inline size_t readahead_length(struct readahead_control *rac)
> {
>         return rac->_nr_pages * PAGE_SIZE;
> }
> 
> is tenable for large folios?
> Especially as we have in mm/readahead.c:499
> 
>         ractl->_nr_pages += 1UL << order;
> 
> Hmm?

Yes.  nr_pages really is the number of pages.

I'm somewhat surprised you're not hitting:

                VM_BUG_ON_FOLIO(index & (folio_nr_pages(folio) - 1), folio);

in __filemap_add_folio().  Maybe the filesystems in question always
access indices which are naturally aligned to the block size.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] highmem: Add memcpy_to_folio()
  2023-06-14 13:48 ` [PATCH 1/2] highmem: Add memcpy_to_folio() Matthew Wilcox (Oracle)
@ 2023-06-14 18:38   ` kernel test robot
  2023-06-14 19:30   ` kernel test robot
  2023-06-15  5:58   ` Christoph Hellwig
  2 siblings, 0 replies; 40+ messages in thread
From: kernel test robot @ 2023-06-14 18:38 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle), Hannes Reinecke
  Cc: llvm, oe-kbuild-all, Matthew Wilcox (Oracle),
	linux-fsdevel, linux-block, Andrew Morton,
	Linux Memory Management List, Christoph Hellwig,
	Luis Chamberlain

Hi Matthew,

kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on linus/master v6.4-rc6]
[cannot apply to next-20230614]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Matthew-Wilcox-Oracle/highmem-Add-memcpy_from_folio/20230614-215150
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20230614134853.1521439-1-willy%40infradead.org
patch subject: [PATCH 1/2] highmem: Add memcpy_to_folio()
config: hexagon-randconfig-r015-20230614 (https://download.01.org/0day-ci/archive/20230615/202306150241.f2mNsWXE-lkp@intel.com/config)
compiler: clang version 17.0.0 (https://github.com/llvm/llvm-project.git 4a5ac14ee968ff0ad5d2cc1ffa0299048db4c88a)
reproduce (this is a W=1 build):
        mkdir -p ~/bin
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        git remote add akpm-mm https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git
        git fetch akpm-mm mm-everything
        git checkout akpm-mm/mm-everything
        b4 shazam https://lore.kernel.org/r/20230614134853.1521439-1-willy@infradead.org
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang ~/bin/make.cross W=1 O=build_dir ARCH=hexagon olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang ~/bin/make.cross W=1 O=build_dir ARCH=hexagon SHELL=/bin/bash drivers/nvme/target/

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202306150241.f2mNsWXE-lkp@intel.com/

All warnings (new ones prefixed by >>):

   In file included from drivers/nvme/target/fc.c:8:
   In file included from include/linux/blk-mq.h:5:
   In file included from include/linux/blkdev.h:9:
   In file included from include/linux/blk_types.h:10:
   In file included from include/linux/bvec.h:10:
   In file included from include/linux/highmem.h:12:
   In file included from include/linux/hardirq.h:11:
   In file included from ./arch/hexagon/include/generated/asm/hardirq.h:1:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/hexagon/include/asm/io.h:334:
   include/asm-generic/io.h:547:31: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     547 |         val = __raw_readb(PCI_IOBASE + addr);
         |                           ~~~~~~~~~~ ^
   include/asm-generic/io.h:560:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     560 |         val = __le16_to_cpu((__le16 __force)__raw_readw(PCI_IOBASE + addr));
         |                                                         ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/little_endian.h:37:51: note: expanded from macro '__le16_to_cpu'
      37 | #define __le16_to_cpu(x) ((__force __u16)(__le16)(x))
         |                                                   ^
   In file included from drivers/nvme/target/fc.c:8:
   In file included from include/linux/blk-mq.h:5:
   In file included from include/linux/blkdev.h:9:
   In file included from include/linux/blk_types.h:10:
   In file included from include/linux/bvec.h:10:
   In file included from include/linux/highmem.h:12:
   In file included from include/linux/hardirq.h:11:
   In file included from ./arch/hexagon/include/generated/asm/hardirq.h:1:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/hexagon/include/asm/io.h:334:
   include/asm-generic/io.h:573:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     573 |         val = __le32_to_cpu((__le32 __force)__raw_readl(PCI_IOBASE + addr));
         |                                                         ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/little_endian.h:35:51: note: expanded from macro '__le32_to_cpu'
      35 | #define __le32_to_cpu(x) ((__force __u32)(__le32)(x))
         |                                                   ^
   In file included from drivers/nvme/target/fc.c:8:
   In file included from include/linux/blk-mq.h:5:
   In file included from include/linux/blkdev.h:9:
   In file included from include/linux/blk_types.h:10:
   In file included from include/linux/bvec.h:10:
   In file included from include/linux/highmem.h:12:
   In file included from include/linux/hardirq.h:11:
   In file included from ./arch/hexagon/include/generated/asm/hardirq.h:1:
   In file included from include/asm-generic/hardirq.h:17:
   In file included from include/linux/irq.h:20:
   In file included from include/linux/io.h:13:
   In file included from arch/hexagon/include/asm/io.h:334:
   include/asm-generic/io.h:584:33: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     584 |         __raw_writeb(value, PCI_IOBASE + addr);
         |                             ~~~~~~~~~~ ^
   include/asm-generic/io.h:594:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     594 |         __raw_writew((u16 __force)cpu_to_le16(value), PCI_IOBASE + addr);
         |                                                       ~~~~~~~~~~ ^
   include/asm-generic/io.h:604:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     604 |         __raw_writel((u32 __force)cpu_to_le32(value), PCI_IOBASE + addr);
         |                                                       ~~~~~~~~~~ ^
   In file included from drivers/nvme/target/fc.c:8:
   In file included from include/linux/blk-mq.h:5:
   In file included from include/linux/blkdev.h:9:
   In file included from include/linux/blk_types.h:10:
   In file included from include/linux/bvec.h:10:
>> include/linux/highmem.h:525:7: warning: comparison of distinct pointer types ('typeof (len) *' (aka 'unsigned int *') and 'typeof ((1UL << 12) - ((unsigned long)(offset) & ~(~((1 << 12) - 1)))) *' (aka 'unsigned long *')) [-Wcompare-distinct-pointer-types]
     525 |                 n = min(len, PAGE_SIZE - offset_in_page(offset));
         |                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/minmax.h:67:19: note: expanded from macro 'min'
      67 | #define min(x, y)       __careful_cmp(x, y, <)
         |                         ^~~~~~~~~~~~~~~~~~~~~~
   include/linux/minmax.h:36:24: note: expanded from macro '__careful_cmp'
      36 |         __builtin_choose_expr(__safe_cmp(x, y), \
         |                               ^~~~~~~~~~~~~~~~
   include/linux/minmax.h:26:4: note: expanded from macro '__safe_cmp'
      26 |                 (__typecheck(x, y) && __no_side_effects(x, y))
         |                  ^~~~~~~~~~~~~~~~~
   include/linux/minmax.h:20:28: note: expanded from macro '__typecheck'
      20 |         (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
         |                    ~~~~~~~~~~~~~~ ^  ~~~~~~~~~~~~~~
   In file included from drivers/nvme/target/fc.c:8:
   In file included from include/linux/blk-mq.h:5:
   In file included from include/linux/blkdev.h:9:
   In file included from include/linux/blk_types.h:10:
   In file included from include/linux/bvec.h:10:
>> include/linux/highmem.h:534:7: warning: comparison of distinct pointer types ('typeof (len) *' (aka 'unsigned int *') and 'typeof ((1UL << 12)) *' (aka 'unsigned long *')) [-Wcompare-distinct-pointer-types]
     534 |                 n = min(len, PAGE_SIZE);
         |                     ^~~~~~~~~~~~~~~~~~~
   include/linux/minmax.h:67:19: note: expanded from macro 'min'
      67 | #define min(x, y)       __careful_cmp(x, y, <)
         |                         ^~~~~~~~~~~~~~~~~~~~~~
   include/linux/minmax.h:36:24: note: expanded from macro '__careful_cmp'
      36 |         __builtin_choose_expr(__safe_cmp(x, y), \
         |                               ^~~~~~~~~~~~~~~~
   include/linux/minmax.h:26:4: note: expanded from macro '__safe_cmp'
      26 |                 (__typecheck(x, y) && __no_side_effects(x, y))
         |                  ^~~~~~~~~~~~~~~~~
   include/linux/minmax.h:20:28: note: expanded from macro '__typecheck'
      20 |         (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
         |                    ~~~~~~~~~~~~~~ ^  ~~~~~~~~~~~~~~
   drivers/nvme/target/fc.c:176:1: warning: unused function 'nvmet_fc_iodnum' [-Wunused-function]
     176 | nvmet_fc_iodnum(struct nvmet_fc_ls_iod *iodptr)
         | ^
   drivers/nvme/target/fc.c:182:1: warning: unused function 'nvmet_fc_fodnum' [-Wunused-function]
     182 | nvmet_fc_fodnum(struct nvmet_fc_fcp_iod *fodptr)
         | ^
   10 warnings generated.
--
   In file included from drivers/nvme/target/loop.c:7:
   In file included from include/linux/scatterlist.h:9:
   In file included from arch/hexagon/include/asm/io.h:334:
   include/asm-generic/io.h:547:31: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     547 |         val = __raw_readb(PCI_IOBASE + addr);
         |                           ~~~~~~~~~~ ^
   include/asm-generic/io.h:560:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     560 |         val = __le16_to_cpu((__le16 __force)__raw_readw(PCI_IOBASE + addr));
         |                                                         ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/little_endian.h:37:51: note: expanded from macro '__le16_to_cpu'
      37 | #define __le16_to_cpu(x) ((__force __u16)(__le16)(x))
         |                                                   ^
   In file included from drivers/nvme/target/loop.c:7:
   In file included from include/linux/scatterlist.h:9:
   In file included from arch/hexagon/include/asm/io.h:334:
   include/asm-generic/io.h:573:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     573 |         val = __le32_to_cpu((__le32 __force)__raw_readl(PCI_IOBASE + addr));
         |                                                         ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/little_endian.h:35:51: note: expanded from macro '__le32_to_cpu'
      35 | #define __le32_to_cpu(x) ((__force __u32)(__le32)(x))
         |                                                   ^
   In file included from drivers/nvme/target/loop.c:7:
   In file included from include/linux/scatterlist.h:9:
   In file included from arch/hexagon/include/asm/io.h:334:
   include/asm-generic/io.h:584:33: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     584 |         __raw_writeb(value, PCI_IOBASE + addr);
         |                             ~~~~~~~~~~ ^
   include/asm-generic/io.h:594:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     594 |         __raw_writew((u16 __force)cpu_to_le16(value), PCI_IOBASE + addr);
         |                                                       ~~~~~~~~~~ ^
   include/asm-generic/io.h:604:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
     604 |         __raw_writel((u32 __force)cpu_to_le32(value), PCI_IOBASE + addr);
         |                                                       ~~~~~~~~~~ ^
   In file included from drivers/nvme/target/loop.c:8:
   In file included from include/linux/blk-mq.h:5:
   In file included from include/linux/blkdev.h:9:
   In file included from include/linux/blk_types.h:10:
   In file included from include/linux/bvec.h:10:
>> include/linux/highmem.h:525:7: warning: comparison of distinct pointer types ('typeof (len) *' (aka 'unsigned int *') and 'typeof ((1UL << 12) - ((unsigned long)(offset) & ~(~((1 << 12) - 1)))) *' (aka 'unsigned long *')) [-Wcompare-distinct-pointer-types]
     525 |                 n = min(len, PAGE_SIZE - offset_in_page(offset));
         |                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/minmax.h:67:19: note: expanded from macro 'min'
      67 | #define min(x, y)       __careful_cmp(x, y, <)
         |                         ^~~~~~~~~~~~~~~~~~~~~~
   include/linux/minmax.h:36:24: note: expanded from macro '__careful_cmp'
      36 |         __builtin_choose_expr(__safe_cmp(x, y), \
         |                               ^~~~~~~~~~~~~~~~
   include/linux/minmax.h:26:4: note: expanded from macro '__safe_cmp'
      26 |                 (__typecheck(x, y) && __no_side_effects(x, y))
         |                  ^~~~~~~~~~~~~~~~~
   include/linux/minmax.h:20:28: note: expanded from macro '__typecheck'
      20 |         (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
         |                    ~~~~~~~~~~~~~~ ^  ~~~~~~~~~~~~~~
   In file included from drivers/nvme/target/loop.c:8:
   In file included from include/linux/blk-mq.h:5:
   In file included from include/linux/blkdev.h:9:
   In file included from include/linux/blk_types.h:10:
   In file included from include/linux/bvec.h:10:
>> include/linux/highmem.h:534:7: warning: comparison of distinct pointer types ('typeof (len) *' (aka 'unsigned int *') and 'typeof ((1UL << 12)) *' (aka 'unsigned long *')) [-Wcompare-distinct-pointer-types]
     534 |                 n = min(len, PAGE_SIZE);
         |                     ^~~~~~~~~~~~~~~~~~~
   include/linux/minmax.h:67:19: note: expanded from macro 'min'
      67 | #define min(x, y)       __careful_cmp(x, y, <)
         |                         ^~~~~~~~~~~~~~~~~~~~~~
   include/linux/minmax.h:36:24: note: expanded from macro '__careful_cmp'
      36 |         __builtin_choose_expr(__safe_cmp(x, y), \
         |                               ^~~~~~~~~~~~~~~~
   include/linux/minmax.h:26:4: note: expanded from macro '__safe_cmp'
      26 |                 (__typecheck(x, y) && __no_side_effects(x, y))
         |                  ^~~~~~~~~~~~~~~~~
   include/linux/minmax.h:20:28: note: expanded from macro '__typecheck'
      20 |         (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
         |                    ~~~~~~~~~~~~~~ ^  ~~~~~~~~~~~~~~
   8 warnings generated.


vim +525 include/linux/highmem.h

   509	
   510	/**
   511	 * memcpy_to_folio - Copy a range of bytes to a folio
   512	 * @folio: The folio to write to.
   513	 * @offset: The first byte in the folio to store to.
   514	 * @from: The memory to copy from.
   515	 * @len: The number of bytes to copy.
   516	 */
   517	static inline void memcpy_to_folio(struct folio *folio, size_t offset,
   518			const char *from, size_t len)
   519	{
   520		size_t n = len;
   521	
   522		VM_BUG_ON(offset + len > folio_size(folio));
   523	
   524		if (folio_test_highmem(folio))
 > 525			n = min(len, PAGE_SIZE - offset_in_page(offset));
   526		for (;;) {
   527			char *to = kmap_local_folio(folio, offset);
   528			memcpy(to, from, n);
   529			kunmap_local(to);
   530			if (!folio_test_highmem(folio) || n == len)
   531				break;
   532			offset += n;
   533			len -= n;
 > 534			n = min(len, PAGE_SIZE);
   535		}
   536		flush_dcache_folio(folio);
   537	}
   538	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] highmem: Add memcpy_to_folio()
  2023-06-14 13:48 ` [PATCH 1/2] highmem: Add memcpy_to_folio() Matthew Wilcox (Oracle)
  2023-06-14 18:38   ` kernel test robot
@ 2023-06-14 19:30   ` kernel test robot
  2023-06-15  5:58   ` Christoph Hellwig
  2 siblings, 0 replies; 40+ messages in thread
From: kernel test robot @ 2023-06-14 19:30 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle), Hannes Reinecke
  Cc: llvm, oe-kbuild-all, Matthew Wilcox (Oracle),
	linux-fsdevel, linux-block, Andrew Morton,
	Linux Memory Management List, Christoph Hellwig,
	Luis Chamberlain

Hi Matthew,

kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on linus/master v6.4-rc6]
[cannot apply to next-20230614]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Matthew-Wilcox-Oracle/highmem-Add-memcpy_from_folio/20230614-215150
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20230614134853.1521439-1-willy%40infradead.org
patch subject: [PATCH 1/2] highmem: Add memcpy_to_folio()
config: arm-randconfig-r046-20230614 (https://download.01.org/0day-ci/archive/20230615/202306150314.BSvTy8oJ-lkp@intel.com/config)
compiler: clang version 17.0.0 (https://github.com/llvm/llvm-project.git 4a5ac14ee968ff0ad5d2cc1ffa0299048db4c88a)
reproduce (this is a W=1 build):
        mkdir -p ~/bin
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install arm cross compiling tool for clang build
        # apt-get install binutils-arm-linux-gnueabi
        git remote add akpm-mm https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git
        git fetch akpm-mm mm-everything
        git checkout akpm-mm/mm-everything
        b4 shazam https://lore.kernel.org/r/20230614134853.1521439-1-willy@infradead.org
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang ~/bin/make.cross W=1 O=build_dir ARCH=arm olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang ~/bin/make.cross W=1 O=build_dir ARCH=arm SHELL=/bin/bash drivers/iio/light/

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202306150314.BSvTy8oJ-lkp@intel.com/

All warnings (new ones prefixed by >>):

   In file included from drivers/iio/light/rohm-bu27034.c:12:
   In file included from include/linux/i2c.h:19:
   In file included from include/linux/regulator/consumer.h:35:
   In file included from include/linux/suspend.h:5:
   In file included from include/linux/swap.h:9:
   In file included from include/linux/memcontrol.h:22:
   In file included from include/linux/writeback.h:13:
   In file included from include/linux/blk_types.h:10:
   In file included from include/linux/bvec.h:10:
>> include/linux/highmem.h:525:7: warning: comparison of distinct pointer types ('typeof (len) *' (aka 'unsigned int *') and 'typeof (((1UL) << 12) - ((unsigned long)(offset) & ~(~((1 << 12) - 1)))) *' (aka 'unsigned long *')) [-Wcompare-distinct-pointer-types]
     525 |                 n = min(len, PAGE_SIZE - offset_in_page(offset));
         |                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/minmax.h:67:19: note: expanded from macro 'min'
      67 | #define min(x, y)       __careful_cmp(x, y, <)
         |                         ^~~~~~~~~~~~~~~~~~~~~~
   include/linux/minmax.h:36:24: note: expanded from macro '__careful_cmp'
      36 |         __builtin_choose_expr(__safe_cmp(x, y), \
         |                               ^~~~~~~~~~~~~~~~
   include/linux/minmax.h:26:4: note: expanded from macro '__safe_cmp'
      26 |                 (__typecheck(x, y) && __no_side_effects(x, y))
         |                  ^~~~~~~~~~~~~~~~~
   include/linux/minmax.h:20:28: note: expanded from macro '__typecheck'
      20 |         (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
         |                    ~~~~~~~~~~~~~~ ^  ~~~~~~~~~~~~~~
   In file included from drivers/iio/light/rohm-bu27034.c:12:
   In file included from include/linux/i2c.h:19:
   In file included from include/linux/regulator/consumer.h:35:
   In file included from include/linux/suspend.h:5:
   In file included from include/linux/swap.h:9:
   In file included from include/linux/memcontrol.h:22:
   In file included from include/linux/writeback.h:13:
   In file included from include/linux/blk_types.h:10:
   In file included from include/linux/bvec.h:10:
>> include/linux/highmem.h:534:7: warning: comparison of distinct pointer types ('typeof (len) *' (aka 'unsigned int *') and 'typeof (((1UL) << 12)) *' (aka 'unsigned long *')) [-Wcompare-distinct-pointer-types]
     534 |                 n = min(len, PAGE_SIZE);
         |                     ^~~~~~~~~~~~~~~~~~~
   include/linux/minmax.h:67:19: note: expanded from macro 'min'
      67 | #define min(x, y)       __careful_cmp(x, y, <)
         |                         ^~~~~~~~~~~~~~~~~~~~~~
   include/linux/minmax.h:36:24: note: expanded from macro '__careful_cmp'
      36 |         __builtin_choose_expr(__safe_cmp(x, y), \
         |                               ^~~~~~~~~~~~~~~~
   include/linux/minmax.h:26:4: note: expanded from macro '__safe_cmp'
      26 |                 (__typecheck(x, y) && __no_side_effects(x, y))
         |                  ^~~~~~~~~~~~~~~~~
   include/linux/minmax.h:20:28: note: expanded from macro '__typecheck'
      20 |         (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
         |                    ~~~~~~~~~~~~~~ ^  ~~~~~~~~~~~~~~
   2 warnings generated.
--
   In file included from drivers/iio/light/pa12203001.c:14:
   In file included from include/linux/i2c.h:19:
   In file included from include/linux/regulator/consumer.h:35:
   In file included from include/linux/suspend.h:5:
   In file included from include/linux/swap.h:9:
   In file included from include/linux/memcontrol.h:22:
   In file included from include/linux/writeback.h:13:
   In file included from include/linux/blk_types.h:10:
   In file included from include/linux/bvec.h:10:
>> include/linux/highmem.h:525:7: warning: comparison of distinct pointer types ('typeof (len) *' (aka 'unsigned int *') and 'typeof (((1UL) << 12) - ((unsigned long)(offset) & ~(~((1 << 12) - 1)))) *' (aka 'unsigned long *')) [-Wcompare-distinct-pointer-types]
     525 |                 n = min(len, PAGE_SIZE - offset_in_page(offset));
         |                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/minmax.h:67:19: note: expanded from macro 'min'
      67 | #define min(x, y)       __careful_cmp(x, y, <)
         |                         ^~~~~~~~~~~~~~~~~~~~~~
   include/linux/minmax.h:36:24: note: expanded from macro '__careful_cmp'
      36 |         __builtin_choose_expr(__safe_cmp(x, y), \
         |                               ^~~~~~~~~~~~~~~~
   include/linux/minmax.h:26:4: note: expanded from macro '__safe_cmp'
      26 |                 (__typecheck(x, y) && __no_side_effects(x, y))
         |                  ^~~~~~~~~~~~~~~~~
   include/linux/minmax.h:20:28: note: expanded from macro '__typecheck'
      20 |         (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
         |                    ~~~~~~~~~~~~~~ ^  ~~~~~~~~~~~~~~
   In file included from drivers/iio/light/pa12203001.c:14:
   In file included from include/linux/i2c.h:19:
   In file included from include/linux/regulator/consumer.h:35:
   In file included from include/linux/suspend.h:5:
   In file included from include/linux/swap.h:9:
   In file included from include/linux/memcontrol.h:22:
   In file included from include/linux/writeback.h:13:
   In file included from include/linux/blk_types.h:10:
   In file included from include/linux/bvec.h:10:
>> include/linux/highmem.h:534:7: warning: comparison of distinct pointer types ('typeof (len) *' (aka 'unsigned int *') and 'typeof (((1UL) << 12)) *' (aka 'unsigned long *')) [-Wcompare-distinct-pointer-types]
     534 |                 n = min(len, PAGE_SIZE);
         |                     ^~~~~~~~~~~~~~~~~~~
   include/linux/minmax.h:67:19: note: expanded from macro 'min'
      67 | #define min(x, y)       __careful_cmp(x, y, <)
         |                         ^~~~~~~~~~~~~~~~~~~~~~
   include/linux/minmax.h:36:24: note: expanded from macro '__careful_cmp'
      36 |         __builtin_choose_expr(__safe_cmp(x, y), \
         |                               ^~~~~~~~~~~~~~~~
   include/linux/minmax.h:26:4: note: expanded from macro '__safe_cmp'
      26 |                 (__typecheck(x, y) && __no_side_effects(x, y))
         |                  ^~~~~~~~~~~~~~~~~
   include/linux/minmax.h:20:28: note: expanded from macro '__typecheck'
      20 |         (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
         |                    ~~~~~~~~~~~~~~ ^  ~~~~~~~~~~~~~~
   drivers/iio/light/pa12203001.c:457:36: warning: unused variable 'pa12203001_acpi_match' [-Wunused-const-variable]
     457 | static const struct acpi_device_id pa12203001_acpi_match[] = {
         |                                    ^
   3 warnings generated.
--
   In file included from drivers/iio/light/rpr0521.c:14:
   In file included from include/linux/i2c.h:19:
   In file included from include/linux/regulator/consumer.h:35:
   In file included from include/linux/suspend.h:5:
   In file included from include/linux/swap.h:9:
   In file included from include/linux/memcontrol.h:22:
   In file included from include/linux/writeback.h:13:
   In file included from include/linux/blk_types.h:10:
   In file included from include/linux/bvec.h:10:
>> include/linux/highmem.h:525:7: warning: comparison of distinct pointer types ('typeof (len) *' (aka 'unsigned int *') and 'typeof (((1UL) << 12) - ((unsigned long)(offset) & ~(~((1 << 12) - 1)))) *' (aka 'unsigned long *')) [-Wcompare-distinct-pointer-types]
     525 |                 n = min(len, PAGE_SIZE - offset_in_page(offset));
         |                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/minmax.h:67:19: note: expanded from macro 'min'
      67 | #define min(x, y)       __careful_cmp(x, y, <)
         |                         ^~~~~~~~~~~~~~~~~~~~~~
   include/linux/minmax.h:36:24: note: expanded from macro '__careful_cmp'
      36 |         __builtin_choose_expr(__safe_cmp(x, y), \
         |                               ^~~~~~~~~~~~~~~~
   include/linux/minmax.h:26:4: note: expanded from macro '__safe_cmp'
      26 |                 (__typecheck(x, y) && __no_side_effects(x, y))
         |                  ^~~~~~~~~~~~~~~~~
   include/linux/minmax.h:20:28: note: expanded from macro '__typecheck'
      20 |         (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
         |                    ~~~~~~~~~~~~~~ ^  ~~~~~~~~~~~~~~
   In file included from drivers/iio/light/rpr0521.c:14:
   In file included from include/linux/i2c.h:19:
   In file included from include/linux/regulator/consumer.h:35:
   In file included from include/linux/suspend.h:5:
   In file included from include/linux/swap.h:9:
   In file included from include/linux/memcontrol.h:22:
   In file included from include/linux/writeback.h:13:
   In file included from include/linux/blk_types.h:10:
   In file included from include/linux/bvec.h:10:
>> include/linux/highmem.h:534:7: warning: comparison of distinct pointer types ('typeof (len) *' (aka 'unsigned int *') and 'typeof (((1UL) << 12)) *' (aka 'unsigned long *')) [-Wcompare-distinct-pointer-types]
     534 |                 n = min(len, PAGE_SIZE);
         |                     ^~~~~~~~~~~~~~~~~~~
   include/linux/minmax.h:67:19: note: expanded from macro 'min'
      67 | #define min(x, y)       __careful_cmp(x, y, <)
         |                         ^~~~~~~~~~~~~~~~~~~~~~
   include/linux/minmax.h:36:24: note: expanded from macro '__careful_cmp'
      36 |         __builtin_choose_expr(__safe_cmp(x, y), \
         |                               ^~~~~~~~~~~~~~~~
   include/linux/minmax.h:26:4: note: expanded from macro '__safe_cmp'
      26 |                 (__typecheck(x, y) && __no_side_effects(x, y))
         |                  ^~~~~~~~~~~~~~~~~
   include/linux/minmax.h:20:28: note: expanded from macro '__typecheck'
      20 |         (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
         |                    ~~~~~~~~~~~~~~ ^  ~~~~~~~~~~~~~~
   drivers/iio/light/rpr0521.c:1105:36: warning: unused variable 'rpr0521_acpi_match' [-Wunused-const-variable]
    1105 | static const struct acpi_device_id rpr0521_acpi_match[] = {
         |                                    ^
   3 warnings generated.
--
   In file included from drivers/iio/light/stk3310.c:11:
   In file included from include/linux/i2c.h:19:
   In file included from include/linux/regulator/consumer.h:35:
   In file included from include/linux/suspend.h:5:
   In file included from include/linux/swap.h:9:
   In file included from include/linux/memcontrol.h:22:
   In file included from include/linux/writeback.h:13:
   In file included from include/linux/blk_types.h:10:
   In file included from include/linux/bvec.h:10:
>> include/linux/highmem.h:525:7: warning: comparison of distinct pointer types ('typeof (len) *' (aka 'unsigned int *') and 'typeof (((1UL) << 12) - ((unsigned long)(offset) & ~(~((1 << 12) - 1)))) *' (aka 'unsigned long *')) [-Wcompare-distinct-pointer-types]
     525 |                 n = min(len, PAGE_SIZE - offset_in_page(offset));
         |                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/minmax.h:67:19: note: expanded from macro 'min'
      67 | #define min(x, y)       __careful_cmp(x, y, <)
         |                         ^~~~~~~~~~~~~~~~~~~~~~
   include/linux/minmax.h:36:24: note: expanded from macro '__careful_cmp'
      36 |         __builtin_choose_expr(__safe_cmp(x, y), \
         |                               ^~~~~~~~~~~~~~~~
   include/linux/minmax.h:26:4: note: expanded from macro '__safe_cmp'
      26 |                 (__typecheck(x, y) && __no_side_effects(x, y))
         |                  ^~~~~~~~~~~~~~~~~
   include/linux/minmax.h:20:28: note: expanded from macro '__typecheck'
      20 |         (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
         |                    ~~~~~~~~~~~~~~ ^  ~~~~~~~~~~~~~~
   In file included from drivers/iio/light/stk3310.c:11:
   In file included from include/linux/i2c.h:19:
   In file included from include/linux/regulator/consumer.h:35:
   In file included from include/linux/suspend.h:5:
   In file included from include/linux/swap.h:9:
   In file included from include/linux/memcontrol.h:22:
   In file included from include/linux/writeback.h:13:
   In file included from include/linux/blk_types.h:10:
   In file included from include/linux/bvec.h:10:
>> include/linux/highmem.h:534:7: warning: comparison of distinct pointer types ('typeof (len) *' (aka 'unsigned int *') and 'typeof (((1UL) << 12)) *' (aka 'unsigned long *')) [-Wcompare-distinct-pointer-types]
     534 |                 n = min(len, PAGE_SIZE);
         |                     ^~~~~~~~~~~~~~~~~~~
   include/linux/minmax.h:67:19: note: expanded from macro 'min'
      67 | #define min(x, y)       __careful_cmp(x, y, <)
         |                         ^~~~~~~~~~~~~~~~~~~~~~
   include/linux/minmax.h:36:24: note: expanded from macro '__careful_cmp'
      36 |         __builtin_choose_expr(__safe_cmp(x, y), \
         |                               ^~~~~~~~~~~~~~~~
   include/linux/minmax.h:26:4: note: expanded from macro '__safe_cmp'
      26 |                 (__typecheck(x, y) && __no_side_effects(x, y))
         |                  ^~~~~~~~~~~~~~~~~
   include/linux/minmax.h:20:28: note: expanded from macro '__typecheck'
      20 |         (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
         |                    ~~~~~~~~~~~~~~ ^  ~~~~~~~~~~~~~~
   drivers/iio/light/stk3310.c:693:36: warning: unused variable 'stk3310_acpi_id' [-Wunused-const-variable]
     693 | static const struct acpi_device_id stk3310_acpi_id[] = {
         |                                    ^
   3 warnings generated.
--
   In file included from drivers/iio/light/us5182d.c:14:
   In file included from include/linux/i2c.h:19:
   In file included from include/linux/regulator/consumer.h:35:
   In file included from include/linux/suspend.h:5:
   In file included from include/linux/swap.h:9:
   In file included from include/linux/memcontrol.h:22:
   In file included from include/linux/writeback.h:13:
   In file included from include/linux/blk_types.h:10:
   In file included from include/linux/bvec.h:10:
>> include/linux/highmem.h:525:7: warning: comparison of distinct pointer types ('typeof (len) *' (aka 'unsigned int *') and 'typeof (((1UL) << 12) - ((unsigned long)(offset) & ~(~((1 << 12) - 1)))) *' (aka 'unsigned long *')) [-Wcompare-distinct-pointer-types]
     525 |                 n = min(len, PAGE_SIZE - offset_in_page(offset));
         |                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/minmax.h:67:19: note: expanded from macro 'min'
      67 | #define min(x, y)       __careful_cmp(x, y, <)
         |                         ^~~~~~~~~~~~~~~~~~~~~~
   include/linux/minmax.h:36:24: note: expanded from macro '__careful_cmp'
      36 |         __builtin_choose_expr(__safe_cmp(x, y), \
         |                               ^~~~~~~~~~~~~~~~
   include/linux/minmax.h:26:4: note: expanded from macro '__safe_cmp'
      26 |                 (__typecheck(x, y) && __no_side_effects(x, y))
         |                  ^~~~~~~~~~~~~~~~~
   include/linux/minmax.h:20:28: note: expanded from macro '__typecheck'
      20 |         (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
         |                    ~~~~~~~~~~~~~~ ^  ~~~~~~~~~~~~~~
   In file included from drivers/iio/light/us5182d.c:14:
   In file included from include/linux/i2c.h:19:
   In file included from include/linux/regulator/consumer.h:35:
   In file included from include/linux/suspend.h:5:
   In file included from include/linux/swap.h:9:
   In file included from include/linux/memcontrol.h:22:
   In file included from include/linux/writeback.h:13:
   In file included from include/linux/blk_types.h:10:
   In file included from include/linux/bvec.h:10:
>> include/linux/highmem.h:534:7: warning: comparison of distinct pointer types ('typeof (len) *' (aka 'unsigned int *') and 'typeof (((1UL) << 12)) *' (aka 'unsigned long *')) [-Wcompare-distinct-pointer-types]
     534 |                 n = min(len, PAGE_SIZE);
         |                     ^~~~~~~~~~~~~~~~~~~
   include/linux/minmax.h:67:19: note: expanded from macro 'min'
      67 | #define min(x, y)       __careful_cmp(x, y, <)
         |                         ^~~~~~~~~~~~~~~~~~~~~~
   include/linux/minmax.h:36:24: note: expanded from macro '__careful_cmp'
      36 |         __builtin_choose_expr(__safe_cmp(x, y), \
         |                               ^~~~~~~~~~~~~~~~
   include/linux/minmax.h:26:4: note: expanded from macro '__safe_cmp'
      26 |                 (__typecheck(x, y) && __no_side_effects(x, y))
         |                  ^~~~~~~~~~~~~~~~~
   include/linux/minmax.h:20:28: note: expanded from macro '__typecheck'
      20 |         (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
         |                    ~~~~~~~~~~~~~~ ^  ~~~~~~~~~~~~~~
   drivers/iio/light/us5182d.c:950:36: warning: unused variable 'us5182d_acpi_match' [-Wunused-const-variable]
     950 | static const struct acpi_device_id us5182d_acpi_match[] = {
         |                                    ^
   3 warnings generated.
--
   In file included from drivers/iio/light/ltr501.c:13:
   In file included from include/linux/i2c.h:19:
   In file included from include/linux/regulator/consumer.h:35:
   In file included from include/linux/suspend.h:5:
   In file included from include/linux/swap.h:9:
   In file included from include/linux/memcontrol.h:22:
   In file included from include/linux/writeback.h:13:
   In file included from include/linux/blk_types.h:10:
   In file included from include/linux/bvec.h:10:
>> include/linux/highmem.h:525:7: warning: comparison of distinct pointer types ('typeof (len) *' (aka 'unsigned int *') and 'typeof (((1UL) << 12) - ((unsigned long)(offset) & ~(~((1 << 12) - 1)))) *' (aka 'unsigned long *')) [-Wcompare-distinct-pointer-types]
     525 |                 n = min(len, PAGE_SIZE - offset_in_page(offset));
         |                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/minmax.h:67:19: note: expanded from macro 'min'
      67 | #define min(x, y)       __careful_cmp(x, y, <)
         |                         ^~~~~~~~~~~~~~~~~~~~~~
   include/linux/minmax.h:36:24: note: expanded from macro '__careful_cmp'
      36 |         __builtin_choose_expr(__safe_cmp(x, y), \
         |                               ^~~~~~~~~~~~~~~~
   include/linux/minmax.h:26:4: note: expanded from macro '__safe_cmp'
      26 |                 (__typecheck(x, y) && __no_side_effects(x, y))
         |                  ^~~~~~~~~~~~~~~~~
   include/linux/minmax.h:20:28: note: expanded from macro '__typecheck'
      20 |         (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
         |                    ~~~~~~~~~~~~~~ ^  ~~~~~~~~~~~~~~
   In file included from drivers/iio/light/ltr501.c:13:
   In file included from include/linux/i2c.h:19:
   In file included from include/linux/regulator/consumer.h:35:
   In file included from include/linux/suspend.h:5:
   In file included from include/linux/swap.h:9:
   In file included from include/linux/memcontrol.h:22:
   In file included from include/linux/writeback.h:13:
   In file included from include/linux/blk_types.h:10:
   In file included from include/linux/bvec.h:10:
>> include/linux/highmem.h:534:7: warning: comparison of distinct pointer types ('typeof (len) *' (aka 'unsigned int *') and 'typeof (((1UL) << 12)) *' (aka 'unsigned long *')) [-Wcompare-distinct-pointer-types]
     534 |                 n = min(len, PAGE_SIZE);
         |                     ^~~~~~~~~~~~~~~~~~~
   include/linux/minmax.h:67:19: note: expanded from macro 'min'
      67 | #define min(x, y)       __careful_cmp(x, y, <)
         |                         ^~~~~~~~~~~~~~~~~~~~~~
   include/linux/minmax.h:36:24: note: expanded from macro '__careful_cmp'
      36 |         __builtin_choose_expr(__safe_cmp(x, y), \
         |                               ^~~~~~~~~~~~~~~~
   include/linux/minmax.h:26:4: note: expanded from macro '__safe_cmp'
      26 |                 (__typecheck(x, y) && __no_side_effects(x, y))
         |                  ^~~~~~~~~~~~~~~~~
   include/linux/minmax.h:20:28: note: expanded from macro '__typecheck'
      20 |         (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
         |                    ~~~~~~~~~~~~~~ ^  ~~~~~~~~~~~~~~
   drivers/iio/light/ltr501.c:1611:36: warning: unused variable 'ltr_acpi_match' [-Wunused-const-variable]
    1611 | static const struct acpi_device_id ltr_acpi_match[] = {
         |                                    ^
   3 warnings generated.
..


vim +525 include/linux/highmem.h

   509	
   510	/**
   511	 * memcpy_to_folio - Copy a range of bytes to a folio
   512	 * @folio: The folio to write to.
   513	 * @offset: The first byte in the folio to store to.
   514	 * @from: The memory to copy from.
   515	 * @len: The number of bytes to copy.
   516	 */
   517	static inline void memcpy_to_folio(struct folio *folio, size_t offset,
   518			const char *from, size_t len)
   519	{
   520		size_t n = len;
   521	
   522		VM_BUG_ON(offset + len > folio_size(folio));
   523	
   524		if (folio_test_highmem(folio))
 > 525			n = min(len, PAGE_SIZE - offset_in_page(offset));
   526		for (;;) {
   527			char *to = kmap_local_folio(folio, offset);
   528			memcpy(to, from, n);
   529			kunmap_local(to);
   530			if (!folio_test_highmem(folio) || n == len)
   531				break;
   532			offset += n;
   533			len -= n;
 > 534			n = min(len, PAGE_SIZE);
   535		}
   536		flush_dcache_folio(folio);
   537	}
   538	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/7] RFC: high-order folio support for I/O
  2023-06-14 15:06     ` Hannes Reinecke
  2023-06-14 15:35       ` Hannes Reinecke
@ 2023-06-14 23:53       ` Dave Chinner
  2023-06-15  6:21         ` Hannes Reinecke
  2023-06-15  3:44       ` Dave Chinner
  2 siblings, 1 reply; 40+ messages in thread
From: Dave Chinner @ 2023-06-14 23:53 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Matthew Wilcox, linux-fsdevel, linux-block, Andrew Morton,
	Christoph Hellwig, Luis Chamberlain

On Wed, Jun 14, 2023 at 05:06:14PM +0200, Hannes Reinecke wrote:
> On 6/14/23 15:53, Matthew Wilcox wrote:
> > On Wed, Jun 14, 2023 at 03:17:25PM +0200, Hannes Reinecke wrote:
> > > Turns out that was quite easy to fix (just remove the check in
> > > set_blocksize()), but now I get this:
> > > 
> > > SGI XFS with ACLs, security attributes, quota, no debug enabled
> > > XFS (ram0): File system with blocksize 16384 bytes. Only pagesize (4096) or
> > > less will currently work.
> > 
> > What happens if you just remove this hunk:
> > 
> > +++ b/fs/xfs/xfs_super.c
> > @@ -1583,18 +1583,6 @@ xfs_fs_fill_super(
> >                  goto out_free_sb;
> >          }
> > 
> > -       /*
> > -        * Until this is fixed only page-sized or smaller data blocks work.
> > -        */
> > -       if (mp->m_sb.sb_blocksize > PAGE_SIZE) {
> > -               xfs_warn(mp,
> > -               "File system with blocksize %d bytes. "
> > -               "Only pagesize (%ld) or less will currently work.",
> > -                               mp->m_sb.sb_blocksize, PAGE_SIZE);
> > -               error = -ENOSYS;
> > -               goto out_free_sb;
> > -       }
> > -
> >          /* Ensure this filesystem fits in the page cache limits */
> >          if (xfs_sb_validate_fsb_count(&mp->m_sb, mp->m_sb.sb_dblocks) ||
> >              xfs_sb_validate_fsb_count(&mp->m_sb, mp->m_sb.sb_rblocks)) {
> 
> Whee! That works!
> 
> Rebased things with your memcpy_{to,from}_folio() patches, disabled that
> chunk, and:
> 
> # mount /dev/ram0 /mnt
> XFS (ram0): Mounting V5 Filesystem 5cd71ab5-2d11-4c18-97dd-71708f40e551
> XFS (ram0): Ending clean mount
> xfs filesystem being mounted at /mnt supports timestamps until 2038-01-19
> (0x7fffffff)
> # umount /mnt
> XFS (ram0): Unmounting Filesystem 5cd71ab5-2d11-4c18-97dd-71708f40e551

Mounting the filesystem doesn't mean it works. XFS metadata has
laways worked with bs > ps, and mounting the filesystem only does
metadata IO.

It's not until you start reading/writing user data that the
filesystem will start exercising the page cache....

> Great work, Matthew!
> 
> (Now I just need to check why copying data from NFS crashes ...)

.... and then we see it doesn't actually work. :)

Likely you also need the large folio support in the iomap write path
patches from Willy, plus whatever corner cases in iomap that still
have implicit dependencies on PAGE_SIZE need to be fixed (truncate,
invalidate, sub-block zeroing, etc may not be doing exactly the
right thing).

All you need to do now is run the BS > PS filesytems through a full
fstests pass (reflink + rmap enabled, auto group), and then we can
start on the real data integrity validation work. It'll need tens of
billions of fsx ops run on it, days of recoveryloop testing, days of
fstress based exercise, etc before we can actually enable it in
XFS....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/7] brd: make sector size configurable
  2023-06-14 11:46 ` [PATCH 4/7] brd: make sector size configurable Hannes Reinecke
  2023-06-14 12:55   ` Matthew Wilcox
@ 2023-06-15  2:17   ` Dave Chinner
  2023-06-15  5:55     ` Christoph Hellwig
  2023-06-15  6:23     ` Hannes Reinecke
  1 sibling, 2 replies; 40+ messages in thread
From: Dave Chinner @ 2023-06-15  2:17 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Matthew Wilcox, linux-fsdevel, linux-block, Andrew Morton,
	Christoph Hellwig, Luis Chamberlain

On Wed, Jun 14, 2023 at 01:46:34PM +0200, Hannes Reinecke wrote:
> @@ -310,6 +312,10 @@ static int max_part = 1;
>  module_param(max_part, int, 0444);
>  MODULE_PARM_DESC(max_part, "Num Minors to reserve between devices");
>  
> +static unsigned int rd_blksize = PAGE_SIZE;
> +module_param(rd_blksize, uint, 0444);
> +MODULE_PARM_DESC(rd_blksize, "Blocksize of each RAM disk in bytes.");

This needs CONFIG_BLK_DEV_RAM_BLOCK_SIZE to set the default size
for those of us who don't use modular kernels....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/7] RFC: high-order folio support for I/O
  2023-06-14 15:06     ` Hannes Reinecke
  2023-06-14 15:35       ` Hannes Reinecke
  2023-06-14 23:53       ` Dave Chinner
@ 2023-06-15  3:44       ` Dave Chinner
  2 siblings, 0 replies; 40+ messages in thread
From: Dave Chinner @ 2023-06-15  3:44 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Matthew Wilcox, linux-fsdevel, linux-block, Andrew Morton,
	Christoph Hellwig, Luis Chamberlain

On Wed, Jun 14, 2023 at 05:06:14PM +0200, Hannes Reinecke wrote:
> On 6/14/23 15:53, Matthew Wilcox wrote:
> > On Wed, Jun 14, 2023 at 03:17:25PM +0200, Hannes Reinecke wrote:
> > > Turns out that was quite easy to fix (just remove the check in
> > > set_blocksize()), but now I get this:
> > > 
> > > SGI XFS with ACLs, security attributes, quota, no debug enabled
> > > XFS (ram0): File system with blocksize 16384 bytes. Only pagesize (4096) or
> > > less will currently work.
> > 
> > What happens if you just remove this hunk:
> > 
> > +++ b/fs/xfs/xfs_super.c
> > @@ -1583,18 +1583,6 @@ xfs_fs_fill_super(
> >                  goto out_free_sb;
> >          }
> > 
> > -       /*
> > -        * Until this is fixed only page-sized or smaller data blocks work.
> > -        */
> > -       if (mp->m_sb.sb_blocksize > PAGE_SIZE) {
> > -               xfs_warn(mp,
> > -               "File system with blocksize %d bytes. "
> > -               "Only pagesize (%ld) or less will currently work.",
> > -                               mp->m_sb.sb_blocksize, PAGE_SIZE);
> > -               error = -ENOSYS;
> > -               goto out_free_sb;
> > -       }
> > -
> >          /* Ensure this filesystem fits in the page cache limits */
> >          if (xfs_sb_validate_fsb_count(&mp->m_sb, mp->m_sb.sb_dblocks) ||
> >              xfs_sb_validate_fsb_count(&mp->m_sb, mp->m_sb.sb_rblocks)) {
> 
> Whee! That works!
> 
> Rebased things with your memcpy_{to,from}_folio() patches, disabled that
> chunk, and:
> 
> # mount /dev/ram0 /mnt

What is the output of mkfs.xfs?

> XFS (ram0): Mounting V5 Filesystem 5cd71ab5-2d11-4c18-97dd-71708f40e551
> XFS (ram0): Ending clean mount
> xfs filesystem being mounted at /mnt supports timestamps until 2038-01-19
> (0x7fffffff)
> # umount /mnt
> XFS (ram0): Unmounting Filesystem 5cd71ab5-2d11-4c18-97dd-71708f40e551

Nope. Not here.

Debug kernel builds assert fail at mount time with:

XFS: Assertion failed: PAGE_SHIFT >= sbp->sb_blocklog, file: fs/xfs/xfs_mount.c, line: 133

Because we do a check to ensure that the entire filesystem address
range can be indexed by the page cache. I suspect this is actually a
stale, left over check from the days we used the page cache for
indexing cached metadata, but with that sorted....

It fails here (8GB ram disk):

#mkfs.xfs -f -b size=64k /dev/ram0
meta-data=/dev/ram0              isize=512    agcount=4, agsize=32000 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1    bigtime=1 inobtcount=1 nrext64=0
data     =                       bsize=65536  blocks=128000, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=65536  ascii-ci=0, ftype=1
log      =internal log           bsize=65536  blocks=1024, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=65536  blocks=0, rtextents=0
# mount /dev/ram0 /mnt/test
[   34.047433] XFS (ram0): Mounting V5 Filesystem 074579ae-9c33-447a-a336-8ea99cda87c3
[   34.053962] BUG: Bad rss-counter state mm:00000000b41e2cf6 type:MM_FILEPAGES val:11
[   34.054451] general protection fault, probably for non-canonical address 0x4002888237d00000: 0000 [#1] PREEMPT SMP
[   34.057426] psi: task underflow! cpu=8 t=2 tasks=[0 0 0 0] clear=4 set=0
[   34.065011] CPU: 2 PID: 3689 Comm: mount Not tainted 6.4.0-rc6-dgc+ #1832
[   34.068647] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[   34.073236] RIP: 0010:__submit_bio+0x9e/0x110
[   34.075124] Code: c3 e8 d6 c5 88 ff 48 8b 43 08 48 89 df 4c 8b 60 10 49 8b 44 24 60 ff 10 49 8b 5c 24 68 e8 3a 92 88 ff 48 8b 43 10 a8 03 75 56 <65> 48 8
[   34.084879] RSP: 0018:ffffc900045c3a10 EFLAGS: 00010246
[   34.087501] RAX: 4003000000000000 RBX: ffff8885c1568000 RCX: 0000000000000080
[   34.090455] RDX: ffff888805d8a900 RSI: 0000000000000286 RDI: ffffc900045c3ad8
[   34.093419] RBP: ffffc900045c3a20 R08: 0000000000000000 R09: 0000000000000000
[   34.096381] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88880124b000
[   34.099340] R13: ffff8885c1568000 R14: ffff888100620000 R15: 0000000000fa0000
[   34.102285] FS:  00007f1b86428840(0000) GS:ffff888237d00000(0000) knlGS:0000000000000000
[   34.105410] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   34.107577] CR2: 00007f1b86364000 CR3: 000000010482d003 CR4: 0000000000060ee0
[   34.110114] Call Trace:
[   34.111031]  <TASK>
[   34.111782]  ? show_regs+0x61/0x70
[   34.112945]  ? die_addr+0x37/0x90
[   34.114080]  ? exc_general_protection+0x19e/0x3b0
[   34.115619]  ? asm_exc_general_protection+0x27/0x30
[   34.117162]  ? __submit_bio+0x9e/0x110
[   34.118355]  submit_bio_noacct_nocheck+0xf3/0x330
[   34.119819]  submit_bio_noacct+0x196/0x490
[   34.121042]  submit_bio+0x58/0x60
[   34.122045]  submit_bio_wait+0x70/0xd0
[   34.123178]  xfs_rw_bdev+0x188/0x1b0
[   34.124255]  xlog_do_io+0x95/0x170
[   34.125283]  xlog_bwrite+0x14/0x20
[   34.126310]  xlog_write_log_records+0x179/0x260
[   34.127637]  xlog_clear_stale_blocks+0xa5/0x1c0
[   34.128917]  xlog_find_tail+0x372/0x3b0
[   34.130011]  xlog_recover+0x2f/0x190
[   34.131041]  xfs_log_mount+0x1b8/0x350
[   34.132055]  xfs_mountfs+0x451/0x9a0
[   34.133019]  xfs_fs_fill_super+0x4d9/0x920
[   34.134113]  get_tree_bdev+0x16e/0x270
[   34.135130]  ? xfs_open_devices+0x230/0x230
[   34.136184]  xfs_fs_get_tree+0x15/0x20
[   34.137144]  vfs_get_tree+0x24/0xd0
[   34.138041]  path_mount+0x2fd/0xae0
[   34.138955]  ? putname+0x53/0x60
[   34.139749]  __x64_sys_mount+0x108/0x140
[   34.140698]  do_syscall_64+0x34/0x80
[   34.141574]  entry_SYSCALL_64_after_hwframe+0x63/0xcd


Hmmmm - that's logical sector aligned/sized IO that is failing like
this. The fs is using 64kB block size, 512 byte sector size.

So I went looking.

# blockdev --report /dev/ram0
RO    RA   SSZ   BSZ        StartSec            Size   Device
rw   256   512  4096               0      8388608000   /dev/ram0
#

Huh. sector size is fine, block size for the device isn't.

# cat /sys/block/ram0/queue/physical_block_size
65536
#

Yup, brd definitely picked up that it is supposed to be using 64kB
blocks.

# blockdev --setbsz 65536 /dev/ram0
blockdev: ioctl error on BLKBSZSET: Invalid argument
#

Huh.

<dig dig dig>

int set_blocksize(struct block_device *bdev, int size)
{
        /* Size must be a power of two, and between 512 and PAGE_SIZE */
        if (size > PAGE_SIZE || size < 512 || !is_power_of_2(size))
                return -EINVAL;
	.....

Yeah, ok. The block device doesn't support 64kB block sizes. Lucky
that XFs uses this as it's sector size:

# mkfs.xfs -f -b size=64k -s size=16k /dev/ram0
....
# mount /dev/ram0 /mnt/test
[  692.564375] XFS (ram0): Cannot set_blocksize to 16384 on device ram0
<mount fails>
#

Now expected. I wonder if the problem is 512 byte sector sizes....

# mkfs.xfs -f -s size=4k -b size=64k /dev/ram0
meta-data=/dev/ram0              isize=512    agcount=4, agsize=32000 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1    bigtime=1 inobtcount=1 nrext64=0
data     =                       bsize=65536  blocks=128000, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=65536  ascii-ci=0, ftype=1
log      =internal log           bsize=65536  blocks=1024, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=65536  blocks=0, rtextents=0
# mount /dev/ram0 /mnt/test
[  835.711473] XFS (ram0): Mounting V5 Filesystem 72743c95-1264-43cd-8867-1f2b2e30ba24
[  835.722700] XFS (ram0): Ending clean mount
#

Okay, there we go. The patchset appears to have some kind of problem
with the filesystem using the minimum logical sector size of 512
bytes on this modified driver. Setting sector size == PAGE_SIZE
allows the filesystem to mount, but brd should not break if logical
sector aligned/sized IO is done.

$ sudo xfs_io -f -d -c "pwrite -b 1M 0 1M" -c "pread -v 0 1M" /mnt/test/foo
wrote 1048576/1048576 bytes at offset 0
1 MiB, 1 ops; 0.0004 sec (2.266 GiB/sec and 2320.1856 ops/sec)
.....
000ffff0:  cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd  ................
read 1048576/1048576 bytes at offset 0
1 MiB, 16 ops; 0.9233 sec (1.083 MiB/sec and 17.3284 ops/sec)
$

Ok, direct IO works just fine.

$ xfs_io -c "pwrite -S 0xaa -b 1M 0 1M" -c "pread 0 1M -v" /mnt/test/foo
.....
000ffff0:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  ................
read 1048576/1048576 bytes at offset 0
1 MiB, 16 ops; 0.7035 sec (1.421 MiB/sec and 22.7428 ops/sec)
$

Ok, well aligned buffered IO looks like it works.

Right, let's step it up and do some more complex stuff. Let's run a
basic fsx pass on the filesystem:


$ sudo ~/src/xfstests-dev/ltp/fsx -d /mnt/test/baz
Seed set to 1
main: filesystem does not support dedupe range, disabling!
main: filesystem does not support exchange range, disabling!
truncating to largest ever: 0x3aea7
2 trunc from 0x0 to 0x3aea7
3 copy  from 0x1a3d6 to 0x26608, (0xc232 bytes) at 0x2ea8c
Segmentation fault
$

And there's the boom, only 3 operations into the test. This is kinda
what I expected - getting fsx to run for billions of ops without
failure might take a while.

Huh, why did it say FIDEDUPERANGE was not supported - that's weird,
something is broken there, maybe the fsx test.

[ 1787.365339] ------------[ cut here ]------------
[ 1787.368623] kernel BUG at include/linux/pagemap.h:1248!
[ 1787.371488] invalid opcode: 0000 [#1] PREEMPT SMP
[ 1787.374814] CPU: 10 PID: 5153 Comm: fsx Not tainted 6.4.0-rc6-dgc+ #1832
[ 1787.377240] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[ 1787.383061] RIP: 0010:read_pages+0x11d/0x230
[ 1787.385268] Code: 4c 89 e7 e8 c5 14 ff ff f0 41 ff 4c 24 34 0f 85 55 ff ff ff 4c 89 e7 e8 61 1d 00 00 8b 73 24 8b 43 20 39 f0 0f 83 4d ff ff ff <0f> 0b 0
[ 1787.395078] RSP: 0018:ffffc90004113918 EFLAGS: 00010283
[ 1787.396636] RAX: 0000000000000001 RBX: ffffc90004113ab0 RCX: 0000000000000000
[ 1787.400357] RDX: 0000000000001000 RSI: 0000000000000010 RDI: ffffea0017421000
[ 1787.403989] RBP: ffffc90004113960 R08: 0000000000001000 R09: 0000000000000000
[ 1787.407915] R10: ffff8885d084a000 R11: ffffc900041137d8 R12: ffffffff822b2e60
[ 1787.411472] R13: 000000000000001b R14: ffffea0017421000 R15: ffff8885c0cc8318
[ 1787.415342] FS:  00007f96c86fcc40(0000) GS:ffff8885fed00000(0000) knlGS:0000000000000000
[ 1787.418404] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1787.420717] CR2: 000055ecf4289908 CR3: 00000005feceb001 CR4: 0000000000060ee0
[ 1787.423110] Call Trace:
[ 1787.424214]  <TASK>
[ 1787.425606]  ? show_regs+0x61/0x70
[ 1787.426568]  ? die+0x37/0x90
[ 1787.427584]  ? do_trap+0xec/0x100
[ 1787.428959]  ? do_error_trap+0x6c/0x90
[ 1787.430672]  ? read_pages+0x11d/0x230
[ 1787.432259]  ? exc_invalid_op+0x52/0x70
[ 1787.433312]  ? read_pages+0x11d/0x230
[ 1787.434544]  ? asm_exc_invalid_op+0x1b/0x20
[ 1787.436038]  ? read_pages+0x11d/0x230
[ 1787.437442]  ? read_pages+0x5c/0x230
[ 1787.438943]  page_cache_ra_unbounded+0x128/0x1b0
[ 1787.440419]  do_page_cache_ra+0x6c/0x70
[ 1787.441765]  ondemand_readahead+0x31f/0x350
[ 1787.443426]  page_cache_sync_ra+0x49/0x50
[ 1787.445070]  filemap_get_pages+0x10e/0x680
[ 1787.446259]  ? xfs_ilock+0xc1/0x220
[ 1787.447426]  filemap_read+0xed/0x380
[ 1787.448632]  ? kmem_cache_free+0x1f5/0x480
[ 1787.449926]  ? xfs_log_ticket_put+0x2f/0x60
[ 1787.451152]  ? xfs_inode_item_release+0x2e/0xa0
[ 1787.453128]  generic_file_read_iter+0xdb/0x160
[ 1787.454527]  xfs_file_buffered_read+0x54/0xd0
[ 1787.455894]  xfs_file_read_iter+0x74/0xe0
[ 1787.457544]  generic_file_splice_read+0x8c/0x150
[ 1787.460094]  do_splice_to+0x85/0xb0
[ 1787.461285]  splice_direct_to_actor+0xb3/0x210
[ 1787.462336]  ? pipe_to_sendpage+0xa0/0xa0
[ 1787.463287]  do_splice_direct+0x92/0xd0
[ 1787.464203]  vfs_copy_file_range+0x2af/0x560
[ 1787.465229]  __do_sys_copy_file_range+0xe3/0x1f0
[ 1787.466429]  __x64_sys_copy_file_range+0x24/0x30
[ 1787.468053]  do_syscall_64+0x34/0x80
[ 1787.469392]  entry_SYSCALL_64_after_hwframe+0x63/0xcd

static inline struct folio *__readahead_folio(struct readahead_control *ractl)
{
        struct folio *folio;

>>>>>   BUG_ON(ractl->_batch_count > ractl->_nr_pages);
        ractl->_nr_pages -= ractl->_batch_count;
        ractl->_index += ractl->_batch_count;

....

So something is going wrong in the readahead path from a splice
operation from copy_file_range().

..... Wait, what?

Why is it splicing rather than doing a remap operation?  'cp
--reflink=always bar bar2' appears to work fine, so it's unexpected
that it's copying data rather than cloning extents. Something is
going wrong there...

.....

Ok, that's enough time spent on this right now. The BS > PS stuff in
this patchset doesn't allow filesystems to work correctly, 
and the reasons for things going wrong are not obvious.

I suspect that this is going to take quite some work just to muscle
through all these whacky corner cases - fsx will find a lot of
them; you'll need to work through them until it runs without fail
for a couple of billion ops.

The patch I was using is below.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

xfs: support block size > page size

From: Dave Chinner <dchinner@redhat.com>

Everything is supposed to work, so turn on the BOOM.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 drivers/block/brd.c |  2 +-
 fs/xfs/xfs_mount.c  |  4 +++-
 fs/xfs/xfs_super.c  | 12 ------------
 3 files changed, 4 insertions(+), 14 deletions(-)

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index a9f3c6591e75..e6e4f31bfcf5 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -314,7 +314,7 @@ static int max_part = 1;
 module_param(max_part, int, 0444);
 MODULE_PARM_DESC(max_part, "Num Minors to reserve between devices");
 
-static unsigned int rd_blksize = PAGE_SIZE;
+static unsigned int rd_blksize = 65536;
 module_param(rd_blksize, uint, 0444);
 MODULE_PARM_DESC(rd_blksize, "Blocksize of each RAM disk in bytes.");
 
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index fb87ffb48f7f..921acd02787c 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -130,10 +130,12 @@ xfs_sb_validate_fsb_count(
 	xfs_sb_t	*sbp,
 	uint64_t	nblocks)
 {
-	ASSERT(PAGE_SHIFT >= sbp->sb_blocklog);
 	ASSERT(sbp->sb_blocklog >= BBSHIFT);
 
 	/* Limited by ULONG_MAX of page cache index */
+	if (sbp->sb_blocklog > PAGE_SHIFT &&
+	    (nblocks << (sbp->sb_blocklog - PAGE_SHIFT) > ULONG_MAX))
+		return -EFBIG;
 	if (nblocks >> (PAGE_SHIFT - sbp->sb_blocklog) > ULONG_MAX)
 		return -EFBIG;
 	return 0;
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 4120bd1cba90..3c2fc203a5c0 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1583,18 +1583,6 @@ xfs_fs_fill_super(
 		goto out_free_sb;
 	}
 
-	/*
-	 * Until this is fixed only page-sized or smaller data blocks work.
-	 */
-	if (mp->m_sb.sb_blocksize > PAGE_SIZE) {
-		xfs_warn(mp,
-		"File system with blocksize %d bytes. "
-		"Only pagesize (%ld) or less will currently work.",
-				mp->m_sb.sb_blocksize, PAGE_SIZE);
-		error = -ENOSYS;
-		goto out_free_sb;
-	}
-
 	/* Ensure this filesystem fits in the page cache limits */
 	if (xfs_sb_validate_fsb_count(&mp->m_sb, mp->m_sb.sb_dblocks) ||
 	    xfs_sb_validate_fsb_count(&mp->m_sb, mp->m_sb.sb_rblocks)) {

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/7] brd: make sector size configurable
  2023-06-15  2:17   ` Dave Chinner
@ 2023-06-15  5:55     ` Christoph Hellwig
  2023-06-15  6:33       ` Hannes Reinecke
  2023-06-15  6:23     ` Hannes Reinecke
  1 sibling, 1 reply; 40+ messages in thread
From: Christoph Hellwig @ 2023-06-15  5:55 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Hannes Reinecke, Matthew Wilcox, linux-fsdevel, linux-block,
	Andrew Morton, Christoph Hellwig, Luis Chamberlain

On Thu, Jun 15, 2023 at 12:17:12PM +1000, Dave Chinner wrote:
> On Wed, Jun 14, 2023 at 01:46:34PM +0200, Hannes Reinecke wrote:
> > @@ -310,6 +312,10 @@ static int max_part = 1;
> >  module_param(max_part, int, 0444);
> >  MODULE_PARM_DESC(max_part, "Num Minors to reserve between devices");
> >  
> > +static unsigned int rd_blksize = PAGE_SIZE;
> > +module_param(rd_blksize, uint, 0444);
> > +MODULE_PARM_DESC(rd_blksize, "Blocksize of each RAM disk in bytes.");
> 
> This needs CONFIG_BLK_DEV_RAM_BLOCK_SIZE to set the default size
> for those of us who don't use modular kernels....

You can set module parameter on the command line for built-in code
like brd.rd_blksize=8196

While we're at it, why that weird rd_ prefix for the parameter?


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] highmem: Add memcpy_to_folio()
  2023-06-14 13:48 ` [PATCH 1/2] highmem: Add memcpy_to_folio() Matthew Wilcox (Oracle)
  2023-06-14 18:38   ` kernel test robot
  2023-06-14 19:30   ` kernel test robot
@ 2023-06-15  5:58   ` Christoph Hellwig
  2023-06-15 12:16     ` Matthew Wilcox
  2 siblings, 1 reply; 40+ messages in thread
From: Christoph Hellwig @ 2023-06-15  5:58 UTC (permalink / raw)
  To: Matthew Wilcox (Oracle)
  Cc: Hannes Reinecke, linux-fsdevel, linux-block, Andrew Morton,
	Christoph Hellwig, Luis Chamberlain

On Wed, Jun 14, 2023 at 02:48:52PM +0100, Matthew Wilcox (Oracle) wrote:
> This is the folio equivalent of memcpy_to_page(), but it handles large
> highmem folios.  It may be a little too big to inline on systems that
> have CONFIG_HIGHMEM enabled but on systems we actually care about almost
> all the code will be eliminated.

I suspect the right thing is to have the trivial version without kmapping
for !HIGHMEM inline, and a separate version with the kmap loop out of
line for HIGHMEM builds.

Same for the next patch.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/7] RFC: high-order folio support for I/O
  2023-06-14 23:53       ` Dave Chinner
@ 2023-06-15  6:21         ` Hannes Reinecke
  2023-06-15  8:51           ` Dave Chinner
  0 siblings, 1 reply; 40+ messages in thread
From: Hannes Reinecke @ 2023-06-15  6:21 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Matthew Wilcox, linux-fsdevel, linux-block, Andrew Morton,
	Christoph Hellwig, Luis Chamberlain

On 6/15/23 01:53, Dave Chinner wrote:
> On Wed, Jun 14, 2023 at 05:06:14PM +0200, Hannes Reinecke wrote:
>> On 6/14/23 15:53, Matthew Wilcox wrote:
>>> On Wed, Jun 14, 2023 at 03:17:25PM +0200, Hannes Reinecke wrote:
>>>> Turns out that was quite easy to fix (just remove the check in
>>>> set_blocksize()), but now I get this:
>>>>
>>>> SGI XFS with ACLs, security attributes, quota, no debug enabled
>>>> XFS (ram0): File system with blocksize 16384 bytes. Only pagesize (4096) or
>>>> less will currently work.
>>>
>>> What happens if you just remove this hunk:
>>>
>>> +++ b/fs/xfs/xfs_super.c
>>> @@ -1583,18 +1583,6 @@ xfs_fs_fill_super(
>>>                   goto out_free_sb;
>>>           }
>>>
>>> -       /*
>>> -        * Until this is fixed only page-sized or smaller data blocks work.
>>> -        */
>>> -       if (mp->m_sb.sb_blocksize > PAGE_SIZE) {
>>> -               xfs_warn(mp,
>>> -               "File system with blocksize %d bytes. "
>>> -               "Only pagesize (%ld) or less will currently work.",
>>> -                               mp->m_sb.sb_blocksize, PAGE_SIZE);
>>> -               error = -ENOSYS;
>>> -               goto out_free_sb;
>>> -       }
>>> -
>>>           /* Ensure this filesystem fits in the page cache limits */
>>>           if (xfs_sb_validate_fsb_count(&mp->m_sb, mp->m_sb.sb_dblocks) ||
>>>               xfs_sb_validate_fsb_count(&mp->m_sb, mp->m_sb.sb_rblocks)) {
>>
>> Whee! That works!
>>
>> Rebased things with your memcpy_{to,from}_folio() patches, disabled that
>> chunk, and:
>>
>> # mount /dev/ram0 /mnt
>> XFS (ram0): Mounting V5 Filesystem 5cd71ab5-2d11-4c18-97dd-71708f40e551
>> XFS (ram0): Ending clean mount
>> xfs filesystem being mounted at /mnt supports timestamps until 2038-01-19
>> (0x7fffffff)
>> # umount /mnt
>> XFS (ram0): Unmounting Filesystem 5cd71ab5-2d11-4c18-97dd-71708f40e551
> 
> Mounting the filesystem doesn't mean it works. XFS metadata has
> laways worked with bs > ps, and mounting the filesystem only does
> metadata IO.
> 
> It's not until you start reading/writing user data that the
> filesystem will start exercising the page cache....
> 
>> Great work, Matthew!
>>
>> (Now I just need to check why copying data from NFS crashes ...)
> 
> .... and then we see it doesn't actually work. :)
> 
> Likely you also need the large folio support in the iomap write path
> patches from Willy, plus whatever corner cases in iomap that still
> have implicit dependencies on PAGE_SIZE need to be fixed (truncate,
> invalidate, sub-block zeroing, etc may not be doing exactly the
> right thing).
> 
These are built on top of the mm-unstable branch from akpm, which does 
include the iomap write path patches from Willy, so yes, I know.

> All you need to do now is run the BS > PS filesytems through a full
> fstests pass (reflink + rmap enabled, auto group), and then we can
> start on the real data integrity validation work. It'll need tens of
> billions of fsx ops run on it, days of recoveryloop testing, days of
> fstress based exercise, etc before we can actually enable it in
> XFS....
> 
Hey, c'mon. I do know _that_. All I'm saying is that now we can _start_
running tests and figure out corner cases (like NFS crashing on me :-).
With this patchset we now have some infrastructure in place making it
even _possible_ to run those tests.

Don't be so pessimistic ...

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/7] brd: make sector size configurable
  2023-06-15  2:17   ` Dave Chinner
  2023-06-15  5:55     ` Christoph Hellwig
@ 2023-06-15  6:23     ` Hannes Reinecke
  1 sibling, 0 replies; 40+ messages in thread
From: Hannes Reinecke @ 2023-06-15  6:23 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Matthew Wilcox, linux-fsdevel, linux-block, Andrew Morton,
	Christoph Hellwig, Luis Chamberlain

On 6/15/23 04:17, Dave Chinner wrote:
> On Wed, Jun 14, 2023 at 01:46:34PM +0200, Hannes Reinecke wrote:
>> @@ -310,6 +312,10 @@ static int max_part = 1;
>>   module_param(max_part, int, 0444);
>>   MODULE_PARM_DESC(max_part, "Num Minors to reserve between devices");
>>   
>> +static unsigned int rd_blksize = PAGE_SIZE;
>> +module_param(rd_blksize, uint, 0444);
>> +MODULE_PARM_DESC(rd_blksize, "Blocksize of each RAM disk in bytes.");
> 
> This needs CONFIG_BLK_DEV_RAM_BLOCK_SIZE to set the default size
> for those of us who don't use modular kernels....
> Ok, will do.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/7] brd: make sector size configurable
  2023-06-15  5:55     ` Christoph Hellwig
@ 2023-06-15  6:33       ` Hannes Reinecke
  0 siblings, 0 replies; 40+ messages in thread
From: Hannes Reinecke @ 2023-06-15  6:33 UTC (permalink / raw)
  To: Christoph Hellwig, Dave Chinner
  Cc: Matthew Wilcox, linux-fsdevel, linux-block, Andrew Morton,
	Luis Chamberlain

On 6/15/23 07:55, Christoph Hellwig wrote:
> On Thu, Jun 15, 2023 at 12:17:12PM +1000, Dave Chinner wrote:
>> On Wed, Jun 14, 2023 at 01:46:34PM +0200, Hannes Reinecke wrote:
>>> @@ -310,6 +312,10 @@ static int max_part = 1;
>>>   module_param(max_part, int, 0444);
>>>   MODULE_PARM_DESC(max_part, "Num Minors to reserve between devices");
>>>   
>>> +static unsigned int rd_blksize = PAGE_SIZE;
>>> +module_param(rd_blksize, uint, 0444);
>>> +MODULE_PARM_DESC(rd_blksize, "Blocksize of each RAM disk in bytes.");
>>
>> This needs CONFIG_BLK_DEV_RAM_BLOCK_SIZE to set the default size
>> for those of us who don't use modular kernels....
> 
> You can set module parameter on the command line for built-in code
> like brd.rd_blksize=8196
> 
> While we're at it, why that weird rd_ prefix for the parameter?
> 
Because that's what's used for all the existing parameters, too.

We can remove it, though, but then we either have inconsistent naming
(some parameters with 'rd_', others without), or break existing setups.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/7] RFC: high-order folio support for I/O
  2023-06-15  6:21         ` Hannes Reinecke
@ 2023-06-15  8:51           ` Dave Chinner
  2023-06-16 16:06             ` Kent Overstreet
  0 siblings, 1 reply; 40+ messages in thread
From: Dave Chinner @ 2023-06-15  8:51 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Matthew Wilcox, linux-fsdevel, linux-block, Andrew Morton,
	Christoph Hellwig, Luis Chamberlain

On Thu, Jun 15, 2023 at 08:21:10AM +0200, Hannes Reinecke wrote:
> On 6/15/23 01:53, Dave Chinner wrote:
> > On Wed, Jun 14, 2023 at 05:06:14PM +0200, Hannes Reinecke wrote:
> > All you need to do now is run the BS > PS filesytems through a full
> > fstests pass (reflink + rmap enabled, auto group), and then we can
> > start on the real data integrity validation work. It'll need tens of
> > billions of fsx ops run on it, days of recoveryloop testing, days of
> > fstress based exercise, etc before we can actually enable it in
> > XFS....
> > 
> Hey, c'mon. I do know _that_. All I'm saying is that now we can _start_
> running tests and figure out corner cases (like NFS crashing on me :-).
> With this patchset we now have some infrastructure in place making it
> even _possible_ to run those tests.

I got to this same point several years ago. You know, that patchset
that Luis went back to when he brought up this whole topic again?
That's right when I started running fsx, and I realised it
didn't cover FICLONERANGE, FIDEDUPERANGE and copy_file_range().

Yep, that's when we first realised we had -zero- test coverage of
those operations. Darrick and I spent the next *3 months* pretty
much rewriting the VFS level of those operations and fixing all the
other bugs in the implementations, just so we could validate they
worked correct on BS <= PS.

But by then Willy had started working over iomap and filemap for
folios, and the bs > PS patches were completely bitrotted and needed
rewriting from scratch. Which I now didn't have time to do....

So today is deja vu all over again: the first time I run fsx on
a new 64kB BS on 4KB PS implementation it hangs doing something
-really weird- and unexpected in copy_file_range(). It shouldn't
even be in the splice code doing a physical data copy.  So something
went wrong in ->remap_file_range(), or maybe in the truncate before
it, before it bugged out over out of range readahead in the page
cache...

I got only 3 fsx ops in today, and at least three bugs have already
manifest themselves....

> Don't be so pessimistic ...

I'm not pessimistic. I'm being realistic. I'm speaking from
experience. Not just as a Linux filesystem engineer who has had to
run this fsx-based data integrity validation process from the ground
up multiple times in the past decade, but also as an Irix OS
engineer that spent many, many hours digging out nasty, subtle bs > ps
data corruption bugs of the Irix buffer/page cache.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/2] highmem: Add memcpy_to_folio()
  2023-06-15  5:58   ` Christoph Hellwig
@ 2023-06-15 12:16     ` Matthew Wilcox
  0 siblings, 0 replies; 40+ messages in thread
From: Matthew Wilcox @ 2023-06-15 12:16 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Hannes Reinecke, linux-fsdevel, linux-block, Andrew Morton,
	Luis Chamberlain

On Thu, Jun 15, 2023 at 07:58:31AM +0200, Christoph Hellwig wrote:
> On Wed, Jun 14, 2023 at 02:48:52PM +0100, Matthew Wilcox (Oracle) wrote:
> > This is the folio equivalent of memcpy_to_page(), but it handles large
> > highmem folios.  It may be a little too big to inline on systems that
> > have CONFIG_HIGHMEM enabled but on systems we actually care about almost
> > all the code will be eliminated.
> 
> I suspect the right thing is to have the trivial version without kmapping
> for !HIGHMEM inline, and a separate version with the kmap loop out of
> line for HIGHMEM builds.
> 
> Same for the next patch.

Yeah, that's what I did to zero_user_segments().  As time goes by,
I'm starting to care about performance of CONFIG_HIGHMEM systems less
and less.



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/7] RFC: high-order folio support for I/O
  2023-06-15  8:51           ` Dave Chinner
@ 2023-06-16 16:06             ` Kent Overstreet
  0 siblings, 0 replies; 40+ messages in thread
From: Kent Overstreet @ 2023-06-16 16:06 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Hannes Reinecke, Matthew Wilcox, linux-fsdevel, linux-block,
	Andrew Morton, Christoph Hellwig, Luis Chamberlain

On Thu, Jun 15, 2023 at 06:51:35PM +1000, Dave Chinner wrote:
> On Thu, Jun 15, 2023 at 08:21:10AM +0200, Hannes Reinecke wrote:
> > On 6/15/23 01:53, Dave Chinner wrote:
> > > On Wed, Jun 14, 2023 at 05:06:14PM +0200, Hannes Reinecke wrote:
> > > All you need to do now is run the BS > PS filesytems through a full
> > > fstests pass (reflink + rmap enabled, auto group), and then we can
> > > start on the real data integrity validation work. It'll need tens of
> > > billions of fsx ops run on it, days of recoveryloop testing, days of
> > > fstress based exercise, etc before we can actually enable it in
> > > XFS....
> > > 
> > Hey, c'mon. I do know _that_. All I'm saying is that now we can _start_
> > running tests and figure out corner cases (like NFS crashing on me :-).
> > With this patchset we now have some infrastructure in place making it
> > even _possible_ to run those tests.
> 
> I got to this same point several years ago. You know, that patchset
> that Luis went back to when he brought up this whole topic again?
> That's right when I started running fsx, and I realised it
> didn't cover FICLONERANGE, FIDEDUPERANGE and copy_file_range().
> 
> Yep, that's when we first realised we had -zero- test coverage of
> those operations. Darrick and I spent the next *3 months* pretty
> much rewriting the VFS level of those operations and fixing all the
> other bugs in the implementations, just so we could validate they
> worked correct on BS <= PS.

code coverage analysis...

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 6/7] mm/filemap: allocate folios with mapping blocksize
       [not found]   ` <CGME20230619080901eucas1p224e67aa31866d2ad8d259b2209c2db67@eucas1p2.samsung.com>
@ 2023-06-19  8:08     ` Pankaj Raghav
  2023-06-19  8:42       ` Hannes Reinecke
  0 siblings, 1 reply; 40+ messages in thread
From: Pankaj Raghav @ 2023-06-19  8:08 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Matthew Wilcox, linux-fsdevel, linux-block, Andrew Morton,
	Christoph Hellwig, Luis Chamberlain, p.raghav, gost.dev

[-- Attachment #1: Type: text/plain, Size: 2490 bytes --]

Hi Hannes,
On Wed, Jun 14, 2023 at 01:46:36PM +0200, Hannes Reinecke wrote:
> The mapping has an underlying blocksize (by virtue of
> mapping->host->i_blkbits), so if the mapping blocksize
> is larger than the pagesize we should allocate folios
> in the correct order.
> 
Network filesystems such as 9pfs set the blkbits to be maximum data it
wants to transfer leading to unnecessary memory pressure as we will try
to allocate higher order folios(Order 5 in my setup). Isn't it better
for each filesystem to request the minimum folio order it needs for its
page cache early on? Block devices can do the same for its block cache.

I have prototype along those lines and I will it soon. This is also
something willy indicated before in a mailing list conversation.

> Signed-off-by: Hannes Reinecke <hare@suse.de>
> ---
> diff --git a/mm/readahead.c b/mm/readahead.c
> index 47afbca1d122..031935b78af7 100644
> --- a/mm/readahead.c
> +++ b/mm/readahead.c
> @@ -245,7 +245,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
>  			continue;
>  		}
>  
> -		folio = filemap_alloc_folio(gfp_mask, 0);
> +		folio = filemap_alloc_folio(gfp_mask, mapping_get_order(mapping));
>  		if (!folio)
>  			break;
>  		if (filemap_add_folio(mapping, folio, index + i,

Did you turn on CONFIG_DEBUG_VM while testing? I don't think we are
incrementing the counter in this function correctly as this function
assumes order 0. We might need something like this:

-               ractl->_nr_pages++;
+               ractl->_nr_pages += folio_nr_pages(folio);
+               i += folio_nr_pages(folio) - 1;
> @@ -806,7 +806,7 @@ void readahead_expand(struct readahead_control *ractl,
>  		if (folio && !xa_is_value(folio))
>  			return; /* Folio apparently present */
>  
> -		folio = filemap_alloc_folio(gfp_mask, 0);
> +		folio = filemap_alloc_folio(gfp_mask, mapping_get_order(mapping));
>  		if (!folio)
>  			return;
>  		if (filemap_add_folio(mapping, folio, index, gfp_mask) < 0) {
> @@ -833,7 +833,7 @@ void readahead_expand(struct readahead_control *ractl,
>  		if (folio && !xa_is_value(folio))
>  			return; /* Folio apparently present */
Same here:
-               ractl->_nr_pages++;
+               ractl->_nr_pages += folio_nr_pages(folio);

>  
> -		folio = filemap_alloc_folio(gfp_mask, 0);
> +		folio = filemap_alloc_folio(gfp_mask, mapping_get_order(mapping));
>  		if (!folio)
>  			return;
>  		if (filemap_add_folio(mapping, folio, index, gfp_mask) < 0) {
> -- 
> 2.35.3
> 

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 6/7] mm/filemap: allocate folios with mapping blocksize
  2023-06-19  8:08     ` Pankaj Raghav
@ 2023-06-19  8:42       ` Hannes Reinecke
  2023-06-19 22:57         ` Dave Chinner
  0 siblings, 1 reply; 40+ messages in thread
From: Hannes Reinecke @ 2023-06-19  8:42 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: Matthew Wilcox, linux-fsdevel, linux-block, Andrew Morton,
	Christoph Hellwig, Luis Chamberlain, gost.dev

On 6/19/23 10:08, Pankaj Raghav wrote:
> Hi Hannes,
> On Wed, Jun 14, 2023 at 01:46:36PM +0200, Hannes Reinecke wrote:
>> The mapping has an underlying blocksize (by virtue of
>> mapping->host->i_blkbits), so if the mapping blocksize
>> is larger than the pagesize we should allocate folios
>> in the correct order.
>>
> Network filesystems such as 9pfs set the blkbits to be maximum data it
> wants to transfer leading to unnecessary memory pressure as we will try
> to allocate higher order folios(Order 5 in my setup). Isn't it better
> for each filesystem to request the minimum folio order it needs for its
> page cache early on? Block devices can do the same for its block cache.
> 
> I have prototype along those lines and I will it soon. This is also
> something willy indicated before in a mailing list conversation.
> 
Well; I _though_ that's why we had things like optimal I/O size and
maximal I/O size. But this seem to be relegated to request queue limits,
so I guess it's not available from 'struct block_device' or 'struct 
gendisk'.

So I've been thinking of adding a flag somewhere (possibly in
'struct address_space') to indicate that blkbits is a hard limit
and not just an advisory thing.

But indeed, I've seen this with NFS, too, which insists on setting 
blkbits to something like 8.

Cheers,

Hannes


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 6/7] mm/filemap: allocate folios with mapping blocksize
  2023-06-19  8:42       ` Hannes Reinecke
@ 2023-06-19 22:57         ` Dave Chinner
  2023-06-20  0:00           ` Matthew Wilcox
  2023-06-20  5:57           ` Hannes Reinecke
  0 siblings, 2 replies; 40+ messages in thread
From: Dave Chinner @ 2023-06-19 22:57 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Pankaj Raghav, Matthew Wilcox, linux-fsdevel, linux-block,
	Andrew Morton, Christoph Hellwig, Luis Chamberlain, gost.dev

On Mon, Jun 19, 2023 at 10:42:38AM +0200, Hannes Reinecke wrote:
> On 6/19/23 10:08, Pankaj Raghav wrote:
> > Hi Hannes,
> > On Wed, Jun 14, 2023 at 01:46:36PM +0200, Hannes Reinecke wrote:
> > > The mapping has an underlying blocksize (by virtue of
> > > mapping->host->i_blkbits), so if the mapping blocksize
> > > is larger than the pagesize we should allocate folios
> > > in the correct order.
> > > 
> > Network filesystems such as 9pfs set the blkbits to be maximum data it
> > wants to transfer leading to unnecessary memory pressure as we will try
> > to allocate higher order folios(Order 5 in my setup). Isn't it better
> > for each filesystem to request the minimum folio order it needs for its
> > page cache early on? Block devices can do the same for its block cache.

Folio size is not a "filesystem wide" thing - it's a per-inode
configuration. We can have inodes within a filesystem that have
different "block" sizes. A prime example of this is XFS directories
- they can have 64kB block sizes on 4kB block size filesystem.

Another example is extent size hints in XFS data files - they
trigger aligned allocation-around similar to using large folios in
the page cache for small writes. Effectively this gives data files a
"block size" of the extent size hint regardless of the filesystem
block size.

Hence in future we might want different sizes of folios for
different types of inodes and so whatever we do we need to support
per-inode folio size configuration for the inode mapping tree.

> > I have prototype along those lines and I will it soon. This is also
> > something willy indicated before in a mailing list conversation.
> > 
> Well; I _though_ that's why we had things like optimal I/O size and
> maximal I/O size. But this seem to be relegated to request queue limits,
> so I guess it's not available from 'struct block_device' or 'struct
> gendisk'.

Yes, those are block device constructs to enable block device based
filesystems to be laid out best for the given block device. They
don't exist for non-block-based filesystems like network
filesystems...

> So I've been thinking of adding a flag somewhere (possibly in
> 'struct address_space') to indicate that blkbits is a hard limit
> and not just an advisory thing.

This still relies on interpreting inode->i_blkbits repeatedly at
runtime in some way, in mm code that really has no business looking
at filesystem block sizes.

What is needed is a field into the mapping that defines the
folio order that all folios allocated for the page cache must be
aligned/sized to to allow them to be inserted into the mapping.

This means the minimum folio order and alignment is maintained
entirely by the mapping (e.g. it allows truncate to do the right
thing), and the filesystem/device side code does not need to do
anything special (except support large folios) to ensure that the
page cache always contains folios that are block sized and aligned.

We already have mapping_set_large_folios() that we use at
inode/mapping instantiation time to enable large folios in the page
cache for that mapping. What we need is a new
mapping_set_large_folio_order() API to enable the filesystem/device
to set the base folio order for the mapping tree at instantiation
time, and for all the page cache instantiation code to align/size to
the order stored in the mapping...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 6/7] mm/filemap: allocate folios with mapping blocksize
  2023-06-19 22:57         ` Dave Chinner
@ 2023-06-20  0:00           ` Matthew Wilcox
  2023-06-20  5:57           ` Hannes Reinecke
  1 sibling, 0 replies; 40+ messages in thread
From: Matthew Wilcox @ 2023-06-20  0:00 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Hannes Reinecke, Pankaj Raghav, linux-fsdevel, linux-block,
	Andrew Morton, Christoph Hellwig, Luis Chamberlain, gost.dev

[-- Attachment #1: Type: text/plain, Size: 296 bytes --]

On Tue, Jun 20, 2023 at 08:57:48AM +1000, Dave Chinner wrote:
> What is needed is a field into the mapping that defines the
> folio order that all folios allocated for the page cache must be
> aligned/sized to to allow them to be inserted into the mapping.

Attached patch from December 2020 ;-)

[-- Attachment #2: 0001-fs-Allow-fine-grained-control-of-folio-sizes.patch --]
[-- Type: text/plain, Size: 3421 bytes --]

From 1aeee696f4d322af5f34544e39fc00006c399fb8 Mon Sep 17 00:00:00 2001
From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Date: Tue, 15 Dec 2020 10:57:34 -0500
Subject: [PATCH] fs: Allow fine-grained control of folio sizes

Some filesystems want to be able to limit the maximum size of folios,
and some want to be able to ensure that folios are at least a certain
size.  Add mapping_set_folio_orders() to allow this level of control
(although it is not yet honoured).

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 include/linux/pagemap.h | 41 +++++++++++++++++++++++++++++++++++++----
 1 file changed, 37 insertions(+), 4 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index cad81db32e61..9cbb8bdbaee7 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -198,9 +198,15 @@ enum mapping_flags {
 	AS_EXITING	= 4, 	/* final truncate in progress */
 	/* writeback related tags are not used */
 	AS_NO_WRITEBACK_TAGS = 5,
-	AS_LARGE_FOLIO_SUPPORT = 6,
+	AS_FOLIO_ORDER_MIN = 8,
+	AS_FOLIO_ORDER_MAX = 13,
+	/* 8-17 are used for FOLIO_ORDER */
 };
 
+#define AS_FOLIO_ORDER_MIN_MASK	0x00001f00
+#define AS_FOLIO_ORDER_MAX_MASK 0x0002e000
+#define AS_FOLIO_ORDER_MASK (AS_FOLIO_ORDER_MIN_MASK | AS_FOLIO_ORDER_MAX_MASK)
+
 /**
  * mapping_set_error - record a writeback error in the address_space
  * @mapping: the mapping in which an error should be set
@@ -290,6 +296,29 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
 	m->gfp_mask = mask;
 }
 
+/**
+ * mapping_set_folio_orders() - Set the range of folio sizes supported.
+ * @mapping: The file.
+ * @min: Minimum folio order (between 0-31 inclusive).
+ * @max: Maximum folio order (between 0-31 inclusive).
+ *
+ * The filesystem should call this function in its inode constructor to
+ * indicate which sizes of folio the VFS can use to cache the contents
+ * of the file.  This should only be used if the filesystem needs special
+ * handling of folio sizes (ie there is something the core cannot know).
+ * Do not tune it based on, eg, i_size.
+ * 
+ * Context: This should not be called while the inode is active as it
+ * is non-atomic.
+ */
+static inline void mapping_set_folio_orders(struct address_space *mapping,
+		unsigned int min, unsigned int max)
+{
+	mapping->flags = (mapping->flags & ~AS_FOLIO_ORDER_MASK) |
+			(min << AS_FOLIO_ORDER_MIN) |
+			(max << AS_FOLIO_ORDER_MAX);
+}
+
 /**
  * mapping_set_large_folios() - Indicate the file supports large folios.
  * @mapping: The file.
@@ -303,7 +332,12 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
  */
 static inline void mapping_set_large_folios(struct address_space *mapping)
 {
-	__set_bit(AS_LARGE_FOLIO_SUPPORT, &mapping->flags);
+	mapping_set_folio_orders(mapping, 0, 31);
+}
+
+static inline unsigned mapping_max_folio_order(struct address_space *mapping)
+{
+	return (mapping->flags & AS_FOLIO_ORDER_MAX_MASK) >> AS_FOLIO_ORDER_MAX;
 }
 
 /*
@@ -312,8 +346,7 @@ static inline void mapping_set_large_folios(struct address_space *mapping)
  */
 static inline bool mapping_large_folio_support(struct address_space *mapping)
 {
-	return IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
-		test_bit(AS_LARGE_FOLIO_SUPPORT, &mapping->flags);
+	return mapping_max_folio_order(mapping) > 0;
 }
 
 static inline int filemap_nr_thps(struct address_space *mapping)
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 6/7] mm/filemap: allocate folios with mapping blocksize
  2023-06-19 22:57         ` Dave Chinner
  2023-06-20  0:00           ` Matthew Wilcox
@ 2023-06-20  5:57           ` Hannes Reinecke
  1 sibling, 0 replies; 40+ messages in thread
From: Hannes Reinecke @ 2023-06-20  5:57 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Pankaj Raghav, Matthew Wilcox, linux-fsdevel, linux-block,
	Andrew Morton, Christoph Hellwig, Luis Chamberlain, gost.dev

On 6/20/23 00:57, Dave Chinner wrote:
> On Mon, Jun 19, 2023 at 10:42:38AM +0200, Hannes Reinecke wrote:
>> On 6/19/23 10:08, Pankaj Raghav wrote:
>>> Hi Hannes,
>>> On Wed, Jun 14, 2023 at 01:46:36PM +0200, Hannes Reinecke wrote:
>>>> The mapping has an underlying blocksize (by virtue of
>>>> mapping->host->i_blkbits), so if the mapping blocksize
>>>> is larger than the pagesize we should allocate folios
>>>> in the correct order.
>>>>
>>> Network filesystems such as 9pfs set the blkbits to be maximum data it
>>> wants to transfer leading to unnecessary memory pressure as we will try
>>> to allocate higher order folios(Order 5 in my setup). Isn't it better
>>> for each filesystem to request the minimum folio order it needs for its
>>> page cache early on? Block devices can do the same for its block cache.
> 
> Folio size is not a "filesystem wide" thing - it's a per-inode
> configuration. We can have inodes within a filesystem that have
> different "block" sizes. A prime example of this is XFS directories
> - they can have 64kB block sizes on 4kB block size filesystem.
> 
> Another example is extent size hints in XFS data files - they
> trigger aligned allocation-around similar to using large folios in
> the page cache for small writes. Effectively this gives data files a
> "block size" of the extent size hint regardless of the filesystem
> block size.
> 
> Hence in future we might want different sizes of folios for
> different types of inodes and so whatever we do we need to support
> per-inode folio size configuration for the inode mapping tree.
> 
Sure. Using some mapping tree configuration was what I had in mind, too.

>>> I have prototype along those lines and I will it soon. This is also
>>> something willy indicated before in a mailing list conversation.
>>>
>> Well; I _though_ that's why we had things like optimal I/O size and
>> maximal I/O size. But this seem to be relegated to request queue limits,
>> so I guess it's not available from 'struct block_device' or 'struct
>> gendisk'.
> 
> Yes, those are block device constructs to enable block device based
> filesystems to be laid out best for the given block device. They
> don't exist for non-block-based filesystems like network
> filesystems...
> 
>> So I've been thinking of adding a flag somewhere (possibly in
>> 'struct address_space') to indicate that blkbits is a hard limit
>> and not just an advisory thing.
> 
> This still relies on interpreting inode->i_blkbits repeatedly at
> runtime in some way, in mm code that really has no business looking
> at filesystem block sizes.
> 
> What is needed is a field into the mapping that defines the
> folio order that all folios allocated for the page cache must be
> aligned/sized to to allow them to be inserted into the mapping.
> 
> This means the minimum folio order and alignment is maintained
> entirely by the mapping (e.g. it allows truncate to do the right
> thing), and the filesystem/device side code does not need to do
> anything special (except support large folios) to ensure that the
> page cache always contains folios that are block sized and aligned.
> 
> We already have mapping_set_large_folios() that we use at
> inode/mapping instantiation time to enable large folios in the page
> cache for that mapping. What we need is a new
> mapping_set_large_folio_order() API to enable the filesystem/device
> to set the base folio order for the mapping tree at instantiation
> time, and for all the page cache instantiation code to align/size to
> the order stored in the mapping...
> 
Having a mapping_set_large_folio_order() (or maybe 
mapping_set_folio_order(), for as soon as order > 0 we will have large
folios ...) looks like a good idea.
I'll give it a go.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman


^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2023-06-20  5:57 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-14 11:46 [PATCH 0/7] RFC: high-order folio support for I/O Hannes Reinecke
2023-06-14 11:46 ` [PATCH 1/7] brd: use XArray instead of radix-tree to index backing pages Hannes Reinecke
2023-06-14 12:45   ` Matthew Wilcox
2023-06-14 12:50     ` Pankaj Raghav
2023-06-14 13:03       ` Hannes Reinecke
2023-06-14 11:46 ` [PATCH 2/7] brd: convert to folios Hannes Reinecke
2023-06-14 13:45   ` Matthew Wilcox
2023-06-14 13:50     ` Hannes Reinecke
2023-06-14 11:46 ` [PATCH 3/7] brd: abstract page_size conventions Hannes Reinecke
2023-06-14 11:46 ` [PATCH 4/7] brd: make sector size configurable Hannes Reinecke
2023-06-14 12:55   ` Matthew Wilcox
2023-06-14 13:02     ` Hannes Reinecke
2023-06-15  2:17   ` Dave Chinner
2023-06-15  5:55     ` Christoph Hellwig
2023-06-15  6:33       ` Hannes Reinecke
2023-06-15  6:23     ` Hannes Reinecke
2023-06-14 11:46 ` [PATCH 5/7] brd: make logical " Hannes Reinecke
2023-06-14 11:46 ` [PATCH 6/7] mm/filemap: allocate folios with mapping blocksize Hannes Reinecke
     [not found]   ` <CGME20230619080901eucas1p224e67aa31866d2ad8d259b2209c2db67@eucas1p2.samsung.com>
2023-06-19  8:08     ` Pankaj Raghav
2023-06-19  8:42       ` Hannes Reinecke
2023-06-19 22:57         ` Dave Chinner
2023-06-20  0:00           ` Matthew Wilcox
2023-06-20  5:57           ` Hannes Reinecke
2023-06-14 11:46 ` [PATCH 7/7] mm/readahead: align readahead down to " Hannes Reinecke
2023-06-14 13:17 ` [PATCH 0/7] RFC: high-order folio support for I/O Hannes Reinecke
2023-06-14 13:53   ` Matthew Wilcox
2023-06-14 15:06     ` Hannes Reinecke
2023-06-14 15:35       ` Hannes Reinecke
2023-06-14 17:46         ` Matthew Wilcox
2023-06-14 23:53       ` Dave Chinner
2023-06-15  6:21         ` Hannes Reinecke
2023-06-15  8:51           ` Dave Chinner
2023-06-16 16:06             ` Kent Overstreet
2023-06-15  3:44       ` Dave Chinner
2023-06-14 13:48 ` [PATCH 1/2] highmem: Add memcpy_to_folio() Matthew Wilcox (Oracle)
2023-06-14 18:38   ` kernel test robot
2023-06-14 19:30   ` kernel test robot
2023-06-15  5:58   ` Christoph Hellwig
2023-06-15 12:16     ` Matthew Wilcox
2023-06-14 13:48 ` [PATCH 2/2] highmem: Add memcpy_from_folio() Matthew Wilcox (Oracle)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.