All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 00/18] block: update buffer_head for Large-block I/O
@ 2023-09-18 11:04 Hannes Reinecke
  2023-09-18 11:04 ` [PATCH 01/18] mm/readahead: rework loop in page_cache_ra_unbounded() Hannes Reinecke
                   ` (17 more replies)
  0 siblings, 18 replies; 39+ messages in thread
From: Hannes Reinecke @ 2023-09-18 11:04 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Luis Chamberlain, Christoph Hellwig, Jens Axboe, Pankaj Raghav,
	linux-block, linux-fsdevel, Hannes Reinecke

Hi all,

to make life a little bit more interesting, here is an alternative
patchset to enable large-block I/O for buffer_head. Based on top of
current linus master, no special tweaking required.
And yes, I am _very_ much aware of hch's work to disable buffer_head for iomap.

Key point here is that we can enable large block I/O for buffer heads
if we always consider the page reference in 'struct buffer_head' to be
a folio, and always calculcate the size of the page reference with using
folio_size(). That way we keep the assumptions that each page (or, in our
context, each folio) has a pointer (or a list of pointers) to a struct
buffer_head, and each buffer_head has a pointer to exactly one page/folio.
Then it's just a matter of auditing the page cache to always looks at the
folio and not the page, and the work's pretty much done.

Famous last words.

Patchset also contains an update to 'brd' to enable large block sizes.
For testing please do:

# modprobe brd rd_size=524288 rd_blksize=16384 rd_logical_blksize=16384
# mkfs.xfs -b size=16384

As usual, comments and reviews are welcome.

Hannes Reinecke (16):
  mm/readahead: rework loop in page_cache_ra_unbounded()
  fs/mpage: use blocks_per_folio instead of blocks_per_page
  block/buffer_head: introduce block_{index_to_sector,sector_to_index}
  fs/buffer.c: use accessor function to translate page index to sectors
  fs/mpage: use accessor function to translate page index to sectors
  mm/filemap: allocate folios with mapping order preference
  mm/readahead: allocate folios with mapping order preference
  fs/buffer: use mapping order in grow_dev_page()
  block/bdev: lift restrictions on supported blocksize
  block/bdev: enable large folio support for large logical block sizes
  brd: convert to folios
  brd: abstract page_size conventions
  brd: use memcpy_{to,from}_folio()
  brd: make sector size configurable
  brd: make logical sector size configurable
  xfs: remove check for block sizes smaller than PAGE_SIZE

Matthew Wilcox (Oracle) (1):
  fs: Allow fine-grained control of folio sizes

Pankaj Raghav (1):
  nvme: enable logical block size > PAGE_SIZE

 block/bdev.c                |  15 ++-
 drivers/block/brd.c         | 256 ++++++++++++++++++++----------------
 drivers/nvme/host/core.c    |   2 +-
 fs/buffer.c                 |  20 +--
 fs/mpage.c                  |  54 ++++----
 fs/xfs/xfs_super.c          |  12 --
 include/linux/buffer_head.h |  17 +++
 include/linux/pagemap.h     |  48 ++++++-
 mm/filemap.c                |  20 ++-
 mm/readahead.c              |  52 +++++---
 10 files changed, 305 insertions(+), 191 deletions(-)

-- 
2.35.3


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 01/18] mm/readahead: rework loop in page_cache_ra_unbounded()
  2023-09-18 11:04 [RFC PATCH 00/18] block: update buffer_head for Large-block I/O Hannes Reinecke
@ 2023-09-18 11:04 ` Hannes Reinecke
       [not found]   ` <CGME20230920115645eucas1p1c8ed9bf515c4532b3e6995f8078a863b@eucas1p1.samsung.com>
  2023-09-18 11:04 ` [PATCH 02/18] fs/mpage: use blocks_per_folio instead of blocks_per_page Hannes Reinecke
                   ` (16 subsequent siblings)
  17 siblings, 1 reply; 39+ messages in thread
From: Hannes Reinecke @ 2023-09-18 11:04 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Luis Chamberlain, Christoph Hellwig, Jens Axboe, Pankaj Raghav,
	linux-block, linux-fsdevel, Hannes Reinecke

Rework the loop in page_cache_ra_unbounded() to advance with
the number of pages in a folio instead of just one page at a time.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 mm/readahead.c | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index e815c114de21..40a5f1f65281 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -208,7 +208,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 	struct address_space *mapping = ractl->mapping;
 	unsigned long index = readahead_index(ractl);
 	gfp_t gfp_mask = readahead_gfp_mask(mapping);
-	unsigned long i;
+	unsigned long i = 0;
 
 	/*
 	 * Partway through the readahead operation, we will have added
@@ -226,7 +226,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 	/*
 	 * Preallocate as many pages as we will need.
 	 */
-	for (i = 0; i < nr_to_read; i++) {
+	do {
 		struct folio *folio = xa_load(&mapping->i_pages, index + i);
 
 		if (folio && !xa_is_value(folio)) {
@@ -239,8 +239,8 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 			 * not worth getting one just for that.
 			 */
 			read_pages(ractl);
-			ractl->_index++;
-			i = ractl->_index + ractl->_nr_pages - index - 1;
+			ractl->_index += folio_nr_pages(folio);
+			i = ractl->_index + ractl->_nr_pages - index;
 			continue;
 		}
 
@@ -251,15 +251,16 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 					gfp_mask) < 0) {
 			folio_put(folio);
 			read_pages(ractl);
-			ractl->_index++;
-			i = ractl->_index + ractl->_nr_pages - index - 1;
+			ractl->_index += folio_nr_pages(folio);
+			i = ractl->_index + ractl->_nr_pages - index;
 			continue;
 		}
 		if (i == nr_to_read - lookahead_size)
 			folio_set_readahead(folio);
 		ractl->_workingset |= folio_test_workingset(folio);
-		ractl->_nr_pages++;
-	}
+		ractl->_nr_pages += folio_nr_pages(folio);
+		i += folio_nr_pages(folio);
+	} while (i < nr_to_read);
 
 	/*
 	 * Now start the IO.  We ignore I/O errors - if the folio is not
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 02/18] fs/mpage: use blocks_per_folio instead of blocks_per_page
  2023-09-18 11:04 [RFC PATCH 00/18] block: update buffer_head for Large-block I/O Hannes Reinecke
  2023-09-18 11:04 ` [PATCH 01/18] mm/readahead: rework loop in page_cache_ra_unbounded() Hannes Reinecke
@ 2023-09-18 11:04 ` Hannes Reinecke
  2023-09-18 13:15   ` Matthew Wilcox
  2023-09-18 11:04 ` [PATCH 03/18] block/buffer_head: introduce block_{index_to_sector,sector_to_index} Hannes Reinecke
                   ` (15 subsequent siblings)
  17 siblings, 1 reply; 39+ messages in thread
From: Hannes Reinecke @ 2023-09-18 11:04 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Luis Chamberlain, Christoph Hellwig, Jens Axboe, Pankaj Raghav,
	linux-block, linux-fsdevel, Hannes Reinecke

Convert mpage to folios and associate the number of blocks with
a folio and not a page.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 fs/mpage.c | 42 +++++++++++++++++++++---------------------
 1 file changed, 21 insertions(+), 21 deletions(-)

diff --git a/fs/mpage.c b/fs/mpage.c
index 242e213ee064..c9d9fdadb500 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -114,12 +114,12 @@ static void map_buffer_to_folio(struct folio *folio, struct buffer_head *bh,
 		 * don't make any buffers if there is only one buffer on
 		 * the folio and the folio just needs to be set up to date
 		 */
-		if (inode->i_blkbits == PAGE_SHIFT &&
+		if (inode->i_blkbits == folio_shift(folio) &&
 		    buffer_uptodate(bh)) {
 			folio_mark_uptodate(folio);
 			return;
 		}
-		create_empty_buffers(&folio->page, i_blocksize(inode), 0);
+		folio_create_empty_buffers(folio, i_blocksize(inode), 0);
 		head = folio_buffers(folio);
 	}
 
@@ -161,7 +161,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
 	struct folio *folio = args->folio;
 	struct inode *inode = folio->mapping->host;
 	const unsigned blkbits = inode->i_blkbits;
-	const unsigned blocks_per_page = PAGE_SIZE >> blkbits;
+	const unsigned blocks_per_folio = folio_size(folio) >> blkbits;
 	const unsigned blocksize = 1 << blkbits;
 	struct buffer_head *map_bh = &args->map_bh;
 	sector_t block_in_file;
@@ -169,7 +169,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
 	sector_t last_block_in_file;
 	sector_t blocks[MAX_BUF_PER_PAGE];
 	unsigned page_block;
-	unsigned first_hole = blocks_per_page;
+	unsigned first_hole = blocks_per_folio;
 	struct block_device *bdev = NULL;
 	int length;
 	int fully_mapped = 1;
@@ -190,7 +190,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
 		goto confused;
 
 	block_in_file = (sector_t)folio->index << (PAGE_SHIFT - blkbits);
-	last_block = block_in_file + args->nr_pages * blocks_per_page;
+	last_block = block_in_file + args->nr_pages * blocks_per_folio;
 	last_block_in_file = (i_size_read(inode) + blocksize - 1) >> blkbits;
 	if (last_block > last_block_in_file)
 		last_block = last_block_in_file;
@@ -211,7 +211,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
 				clear_buffer_mapped(map_bh);
 				break;
 			}
-			if (page_block == blocks_per_page)
+			if (page_block == blocks_per_folio)
 				break;
 			blocks[page_block] = map_bh->b_blocknr + map_offset +
 						relative_block;
@@ -225,7 +225,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
 	 * Then do more get_blocks calls until we are done with this folio.
 	 */
 	map_bh->b_folio = folio;
-	while (page_block < blocks_per_page) {
+	while (page_block < blocks_per_folio) {
 		map_bh->b_state = 0;
 		map_bh->b_size = 0;
 
@@ -238,7 +238,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
 
 		if (!buffer_mapped(map_bh)) {
 			fully_mapped = 0;
-			if (first_hole == blocks_per_page)
+			if (first_hole == blocks_per_folio)
 				first_hole = page_block;
 			page_block++;
 			block_in_file++;
@@ -256,7 +256,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
 			goto confused;
 		}
 	
-		if (first_hole != blocks_per_page)
+		if (first_hole != blocks_per_folio)
 			goto confused;		/* hole -> non-hole */
 
 		/* Contiguous blocks? */
@@ -267,7 +267,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
 			if (relative_block == nblocks) {
 				clear_buffer_mapped(map_bh);
 				break;
-			} else if (page_block == blocks_per_page)
+			} else if (page_block == blocks_per_folio)
 				break;
 			blocks[page_block] = map_bh->b_blocknr+relative_block;
 			page_block++;
@@ -276,7 +276,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
 		bdev = map_bh->b_bdev;
 	}
 
-	if (first_hole != blocks_per_page) {
+	if (first_hole != blocks_per_folio) {
 		folio_zero_segment(folio, first_hole << blkbits, PAGE_SIZE);
 		if (first_hole == 0) {
 			folio_mark_uptodate(folio);
@@ -311,10 +311,10 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
 	relative_block = block_in_file - args->first_logical_block;
 	nblocks = map_bh->b_size >> blkbits;
 	if ((buffer_boundary(map_bh) && relative_block == nblocks) ||
-	    (first_hole != blocks_per_page))
+	    (first_hole != blocks_per_folio))
 		args->bio = mpage_bio_submit_read(args->bio);
 	else
-		args->last_block_in_bio = blocks[blocks_per_page - 1];
+		args->last_block_in_bio = blocks[blocks_per_folio - 1];
 out:
 	return args->bio;
 
@@ -393,7 +393,7 @@ int mpage_read_folio(struct folio *folio, get_block_t get_block)
 {
 	struct mpage_readpage_args args = {
 		.folio = folio,
-		.nr_pages = 1,
+		.nr_pages = folio_nr_pages(folio),
 		.get_block = get_block,
 	};
 
@@ -474,12 +474,12 @@ static int __mpage_writepage(struct folio *folio, struct writeback_control *wbc,
 	struct address_space *mapping = folio->mapping;
 	struct inode *inode = mapping->host;
 	const unsigned blkbits = inode->i_blkbits;
-	const unsigned blocks_per_page = PAGE_SIZE >> blkbits;
+	const unsigned blocks_per_folio = folio_size(folio) >> blkbits;
 	sector_t last_block;
 	sector_t block_in_file;
 	sector_t blocks[MAX_BUF_PER_PAGE];
 	unsigned page_block;
-	unsigned first_unmapped = blocks_per_page;
+	unsigned first_unmapped = blocks_per_folio;
 	struct block_device *bdev = NULL;
 	int boundary = 0;
 	sector_t boundary_block = 0;
@@ -504,12 +504,12 @@ static int __mpage_writepage(struct folio *folio, struct writeback_control *wbc,
 				 */
 				if (buffer_dirty(bh))
 					goto confused;
-				if (first_unmapped == blocks_per_page)
+				if (first_unmapped == blocks_per_folio)
 					first_unmapped = page_block;
 				continue;
 			}
 
-			if (first_unmapped != blocks_per_page)
+			if (first_unmapped != blocks_per_folio)
 				goto confused;	/* hole -> non-hole */
 
 			if (!buffer_dirty(bh) || !buffer_uptodate(bh))
@@ -552,7 +552,7 @@ static int __mpage_writepage(struct folio *folio, struct writeback_control *wbc,
 		goto page_is_mapped;
 	last_block = (i_size - 1) >> blkbits;
 	map_bh.b_folio = folio;
-	for (page_block = 0; page_block < blocks_per_page; ) {
+	for (page_block = 0; page_block < blocks_per_folio; ) {
 
 		map_bh.b_state = 0;
 		map_bh.b_size = 1 << blkbits;
@@ -631,14 +631,14 @@ static int __mpage_writepage(struct folio *folio, struct writeback_control *wbc,
 	BUG_ON(folio_test_writeback(folio));
 	folio_start_writeback(folio);
 	folio_unlock(folio);
-	if (boundary || (first_unmapped != blocks_per_page)) {
+	if (boundary || (first_unmapped != blocks_per_folio)) {
 		bio = mpage_bio_submit_write(bio);
 		if (boundary_block) {
 			write_boundary_block(boundary_bdev,
 					boundary_block, 1 << blkbits);
 		}
 	} else {
-		mpd->last_block_in_bio = blocks[blocks_per_page - 1];
+		mpd->last_block_in_bio = blocks[blocks_per_folio - 1];
 	}
 	goto out;
 
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 03/18] block/buffer_head: introduce block_{index_to_sector,sector_to_index}
  2023-09-18 11:04 [RFC PATCH 00/18] block: update buffer_head for Large-block I/O Hannes Reinecke
  2023-09-18 11:04 ` [PATCH 01/18] mm/readahead: rework loop in page_cache_ra_unbounded() Hannes Reinecke
  2023-09-18 11:04 ` [PATCH 02/18] fs/mpage: use blocks_per_folio instead of blocks_per_page Hannes Reinecke
@ 2023-09-18 11:04 ` Hannes Reinecke
  2023-09-18 16:36   ` Matthew Wilcox
  2023-09-18 11:04 ` [PATCH 04/18] fs/buffer.c: use accessor function to translate page index to sectors Hannes Reinecke
                   ` (14 subsequent siblings)
  17 siblings, 1 reply; 39+ messages in thread
From: Hannes Reinecke @ 2023-09-18 11:04 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Luis Chamberlain, Christoph Hellwig, Jens Axboe, Pankaj Raghav,
	linux-block, linux-fsdevel, Hannes Reinecke

Introduce accessor functions block_index_to_sector() and block_sector_to_index()
to convert the page index into the corresponding sector and vice versa.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 include/linux/buffer_head.h | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 4ede47649a81..55a3032f8375 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -277,6 +277,7 @@ int generic_cont_expand_simple(struct inode *inode, loff_t size);
 void block_commit_write(struct page *page, unsigned int from, unsigned int to);
 int block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
 				get_block_t get_block);
+
 sector_t generic_block_bmap(struct address_space *, sector_t, get_block_t *);
 int block_truncate_page(struct address_space *, loff_t, get_block_t *);
 
@@ -449,6 +450,22 @@ __bread(struct block_device *bdev, sector_t block, unsigned size)
 
 bool block_dirty_folio(struct address_space *mapping, struct folio *folio);
 
+static inline sector_t block_index_to_sector(pgoff_t index, unsigned int blkbits)
+{
+	if (PAGE_SHIFT < blkbits)
+		return (sector_t)index >> (blkbits - PAGE_SHIFT);
+	else
+		return (sector_t)index << (PAGE_SHIFT - blkbits);
+}
+
+static inline pgoff_t block_sector_to_index(sector_t block, unsigned int blkbits)
+{
+	if (PAGE_SHIFT < blkbits)
+		return (pgoff_t)block << (blkbits - PAGE_SHIFT);
+	else
+		return (pgoff_t)block >> (PAGE_SHIFT - blkbits);
+}
+
 #ifdef CONFIG_BUFFER_HEAD
 
 void buffer_init(void);
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 04/18] fs/buffer.c: use accessor function to translate page index to sectors
  2023-09-18 11:04 [RFC PATCH 00/18] block: update buffer_head for Large-block I/O Hannes Reinecke
                   ` (2 preceding siblings ...)
  2023-09-18 11:04 ` [PATCH 03/18] block/buffer_head: introduce block_{index_to_sector,sector_to_index} Hannes Reinecke
@ 2023-09-18 11:04 ` Hannes Reinecke
  2023-10-20 19:37   ` Matthew Wilcox
  2023-09-18 11:04 ` [PATCH 05/18] fs/mpage: " Hannes Reinecke
                   ` (13 subsequent siblings)
  17 siblings, 1 reply; 39+ messages in thread
From: Hannes Reinecke @ 2023-09-18 11:04 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Luis Chamberlain, Christoph Hellwig, Jens Axboe, Pankaj Raghav,
	linux-block, linux-fsdevel, Hannes Reinecke

Use accessor functions block_index_to_sector() and block_sector_to_index()
to translate the page index into the block sector and vice versa.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 fs/buffer.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 2379564e5aea..66895432d91f 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -199,7 +199,7 @@ __find_get_block_slow(struct block_device *bdev, sector_t block)
 	int all_mapped = 1;
 	static DEFINE_RATELIMIT_STATE(last_warned, HZ, 1);
 
-	index = block >> (PAGE_SHIFT - bd_inode->i_blkbits);
+	index = block_sector_to_index(block, bd_inode->i_blkbits);
 	folio = __filemap_get_folio(bd_mapping, index, FGP_ACCESSED, 0);
 	if (IS_ERR(folio))
 		goto out;
@@ -1702,13 +1702,13 @@ void clean_bdev_aliases(struct block_device *bdev, sector_t block, sector_t len)
 	struct inode *bd_inode = bdev->bd_inode;
 	struct address_space *bd_mapping = bd_inode->i_mapping;
 	struct folio_batch fbatch;
-	pgoff_t index = block >> (PAGE_SHIFT - bd_inode->i_blkbits);
+	pgoff_t index = block_sector_to_index(block, bd_inode->i_blkbits);
 	pgoff_t end;
 	int i, count;
 	struct buffer_head *bh;
 	struct buffer_head *head;
 
-	end = (block + len - 1) >> (PAGE_SHIFT - bd_inode->i_blkbits);
+	end = block_sector_to_index(block + len - 1, bd_inode->i_blkbits);
 	folio_batch_init(&fbatch);
 	while (filemap_get_folios(bd_mapping, &index, end, &fbatch)) {
 		count = folio_batch_count(&fbatch);
@@ -1835,7 +1835,7 @@ int __block_write_full_folio(struct inode *inode, struct folio *folio,
 	blocksize = bh->b_size;
 	bbits = block_size_bits(blocksize);
 
-	block = (sector_t)folio->index << (PAGE_SHIFT - bbits);
+	block = block_index_to_sector(folio->index, bbits);
 	last_block = (i_size_read(inode) - 1) >> bbits;
 
 	/*
@@ -2087,7 +2087,7 @@ int __block_write_begin_int(struct folio *folio, loff_t pos, unsigned len,
 	blocksize = head->b_size;
 	bbits = block_size_bits(blocksize);
 
-	block = (sector_t)folio->index << (PAGE_SHIFT - bbits);
+	block = block_index_to_sector(folio->index, bbits);
 
 	for(bh = head, block_start = 0; bh != head || !block_start;
 	    block++, block_start=block_end, bh = bh->b_this_page) {
@@ -2369,7 +2369,7 @@ int block_read_full_folio(struct folio *folio, get_block_t *get_block)
 	blocksize = head->b_size;
 	bbits = block_size_bits(blocksize);
 
-	iblock = (sector_t)folio->index << (PAGE_SHIFT - bbits);
+	iblock = block_index_to_sector(folio->index, bbits);
 	lblock = (limit+blocksize-1) >> bbits;
 	bh = head;
 	nr = 0;
@@ -2657,7 +2657,7 @@ int block_truncate_page(struct address_space *mapping,
 		return 0;
 
 	length = blocksize - length;
-	iblock = (sector_t)index << (PAGE_SHIFT - inode->i_blkbits);
+	iblock = block_index_to_sector(index, inode->i_blkbits);
 	
 	folio = filemap_grab_folio(mapping, index);
 	if (IS_ERR(folio))
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 05/18] fs/mpage: use accessor function to translate page index to sectors
  2023-09-18 11:04 [RFC PATCH 00/18] block: update buffer_head for Large-block I/O Hannes Reinecke
                   ` (3 preceding siblings ...)
  2023-09-18 11:04 ` [PATCH 04/18] fs/buffer.c: use accessor function to translate page index to sectors Hannes Reinecke
@ 2023-09-18 11:04 ` Hannes Reinecke
  2023-09-18 11:04 ` [PATCH 06/18] fs: Allow fine-grained control of folio sizes Hannes Reinecke
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 39+ messages in thread
From: Hannes Reinecke @ 2023-09-18 11:04 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Luis Chamberlain, Christoph Hellwig, Jens Axboe, Pankaj Raghav,
	linux-block, linux-fsdevel, Hannes Reinecke

Use accessor functions to translate between page index and block sectors
and ensure the resulting buffer size is calculated correctly.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 fs/mpage.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/fs/mpage.c b/fs/mpage.c
index c9d9fdadb500..26460afd829a 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -168,7 +168,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
 	sector_t last_block;
 	sector_t last_block_in_file;
 	sector_t blocks[MAX_BUF_PER_PAGE];
-	unsigned page_block;
+	unsigned num_blocks, page_block;
 	unsigned first_hole = blocks_per_folio;
 	struct block_device *bdev = NULL;
 	int length;
@@ -189,8 +189,12 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
 	if (folio_buffers(folio))
 		goto confused;
 
-	block_in_file = (sector_t)folio->index << (PAGE_SHIFT - blkbits);
-	last_block = block_in_file + args->nr_pages * blocks_per_folio;
+	block_in_file = block_index_to_sector(folio->index, blkbits);
+	if (blkbits > PAGE_SHIFT)
+		num_blocks = args->nr_pages >> (blkbits - PAGE_SHIFT);
+	else
+		num_blocks = args->nr_pages * blocks_per_folio;
+	last_block = block_in_file + num_blocks;
 	last_block_in_file = (i_size_read(inode) + blocksize - 1) >> blkbits;
 	if (last_block > last_block_in_file)
 		last_block = last_block_in_file;
@@ -277,7 +281,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
 	}
 
 	if (first_hole != blocks_per_folio) {
-		folio_zero_segment(folio, first_hole << blkbits, PAGE_SIZE);
+		folio_zero_segment(folio, first_hole << blkbits, folio_size(folio));
 		if (first_hole == 0) {
 			folio_mark_uptodate(folio);
 			folio_unlock(folio);
@@ -543,7 +547,7 @@ static int __mpage_writepage(struct folio *folio, struct writeback_control *wbc,
 	 * The page has no buffers: map it to disk
 	 */
 	BUG_ON(!folio_test_uptodate(folio));
-	block_in_file = (sector_t)folio->index << (PAGE_SHIFT - blkbits);
+	block_in_file = block_index_to_sector(folio->index, blkbits);
 	/*
 	 * Whole page beyond EOF? Skip allocating blocks to avoid leaking
 	 * space.
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 06/18] fs: Allow fine-grained control of folio sizes
  2023-09-18 11:04 [RFC PATCH 00/18] block: update buffer_head for Large-block I/O Hannes Reinecke
                   ` (4 preceding siblings ...)
  2023-09-18 11:04 ` [PATCH 05/18] fs/mpage: " Hannes Reinecke
@ 2023-09-18 11:04 ` Hannes Reinecke
  2023-09-18 12:29   ` [lkp] [+550 bytes kernel size regression] [i386-tinyconfig] [8558b2228d] " kernel test robot
  2023-09-18 11:04 ` [PATCH 07/18] mm/filemap: allocate folios with mapping order preference Hannes Reinecke
                   ` (11 subsequent siblings)
  17 siblings, 1 reply; 39+ messages in thread
From: Hannes Reinecke @ 2023-09-18 11:04 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Luis Chamberlain, Christoph Hellwig, Jens Axboe, Pankaj Raghav,
	linux-block, linux-fsdevel, Hannes Reinecke

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Some filesystems want to be able to limit the maximum size of folios,
and some want to be able to ensure that folios are at least a certain
size.  Add mapping_set_folio_orders() to allow this level of control
(although it is not yet honoured).

[Pankaj]: added mapping_min_folio_order()

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
---
 include/linux/pagemap.h | 48 ++++++++++++++++++++++++++++++++++++-----
 1 file changed, 43 insertions(+), 5 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 351c3b7f93a1..129d62a891be 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -202,10 +202,16 @@ enum mapping_flags {
 	AS_EXITING	= 4, 	/* final truncate in progress */
 	/* writeback related tags are not used */
 	AS_NO_WRITEBACK_TAGS = 5,
-	AS_LARGE_FOLIO_SUPPORT = 6,
-	AS_RELEASE_ALWAYS,	/* Call ->release_folio(), even if no private data */
+	AS_RELEASE_ALWAYS = 6,	/* Call ->release_folio(), even if no private data */
+	AS_FOLIO_ORDER_MIN = 8,
+	AS_FOLIO_ORDER_MAX = 13,
+	/* 8-17 are used for FOLIO_ORDER */
 };
 
+#define AS_FOLIO_ORDER_MIN_MASK	0x00001f00
+#define AS_FOLIO_ORDER_MAX_MASK 0x0002e000
+#define AS_FOLIO_ORDER_MASK (AS_FOLIO_ORDER_MIN_MASK | AS_FOLIO_ORDER_MAX_MASK)
+
 /**
  * mapping_set_error - record a writeback error in the address_space
  * @mapping: the mapping in which an error should be set
@@ -310,6 +316,29 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
 	m->gfp_mask = mask;
 }
 
+/**
+ * mapping_set_folio_orders() - Set the range of folio sizes supported.
+ * @mapping: The file.
+ * @min: Minimum folio order (between 0-31 inclusive).
+ * @max: Maximum folio order (between 0-31 inclusive).
+ *
+ * The filesystem should call this function in its inode constructor to
+ * indicate which sizes of folio the VFS can use to cache the contents
+ * of the file.  This should only be used if the filesystem needs special
+ * handling of folio sizes (ie there is something the core cannot know).
+ * Do not tune it based on, eg, i_size.
+ *
+ * Context: This should not be called while the inode is active as it
+ * is non-atomic.
+ */
+static inline void mapping_set_folio_orders(struct address_space *mapping,
+		unsigned int min, unsigned int max)
+{
+	mapping->flags = (mapping->flags & ~AS_FOLIO_ORDER_MASK) |
+			(min << AS_FOLIO_ORDER_MIN) |
+			(max << AS_FOLIO_ORDER_MAX);
+}
+
 /**
  * mapping_set_large_folios() - Indicate the file supports large folios.
  * @mapping: The file.
@@ -323,7 +352,17 @@ static inline void mapping_set_gfp_mask(struct address_space *m, gfp_t mask)
  */
 static inline void mapping_set_large_folios(struct address_space *mapping)
 {
-	__set_bit(AS_LARGE_FOLIO_SUPPORT, &mapping->flags);
+	mapping_set_folio_orders(mapping, 0, 31);
+}
+
+static inline unsigned mapping_max_folio_order(struct address_space *mapping)
+{
+	return (mapping->flags & AS_FOLIO_ORDER_MAX_MASK) >> AS_FOLIO_ORDER_MAX;
+}
+
+static inline unsigned mapping_min_folio_order(struct address_space *mapping)
+{
+	return (mapping->flags & AS_FOLIO_ORDER_MIN_MASK) >> AS_FOLIO_ORDER_MIN;
 }
 
 /*
@@ -332,8 +371,7 @@ static inline void mapping_set_large_folios(struct address_space *mapping)
  */
 static inline bool mapping_large_folio_support(struct address_space *mapping)
 {
-	return IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
-		test_bit(AS_LARGE_FOLIO_SUPPORT, &mapping->flags);
+	return mapping_max_folio_order(mapping) > 0;
 }
 
 static inline int filemap_nr_thps(struct address_space *mapping)
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 07/18] mm/filemap: allocate folios with mapping order preference
  2023-09-18 11:04 [RFC PATCH 00/18] block: update buffer_head for Large-block I/O Hannes Reinecke
                   ` (5 preceding siblings ...)
  2023-09-18 11:04 ` [PATCH 06/18] fs: Allow fine-grained control of folio sizes Hannes Reinecke
@ 2023-09-18 11:04 ` Hannes Reinecke
  2023-09-18 13:41   ` Matthew Wilcox
  2023-09-18 11:05 ` [PATCH 08/18] mm/readahead: " Hannes Reinecke
                   ` (10 subsequent siblings)
  17 siblings, 1 reply; 39+ messages in thread
From: Hannes Reinecke @ 2023-09-18 11:04 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Luis Chamberlain, Christoph Hellwig, Jens Axboe, Pankaj Raghav,
	linux-block, linux-fsdevel, Hannes Reinecke

Use mapping_min_folio_order() when calling filemap_alloc_folio()
to allocate folios with the order specified by the mapping.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 mm/filemap.c | 20 +++++++++++++++++---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 582f5317ff71..98c3737644d5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -507,9 +507,14 @@ static void __filemap_fdatawait_range(struct address_space *mapping,
 	pgoff_t end = end_byte >> PAGE_SHIFT;
 	struct folio_batch fbatch;
 	unsigned nr_folios;
+	unsigned int order = mapping_min_folio_order(mapping);
 
 	folio_batch_init(&fbatch);
 
+	if (order) {
+		index = ALIGN_DOWN(index, 1 << order);
+		end = ALIGN_DOWN(end, 1 << order);
+	}
 	while (index <= end) {
 		unsigned i;
 
@@ -2482,7 +2487,8 @@ static int filemap_create_folio(struct file *file,
 	struct folio *folio;
 	int error;
 
-	folio = filemap_alloc_folio(mapping_gfp_mask(mapping), 0);
+	folio = filemap_alloc_folio(mapping_gfp_mask(mapping),
+				    mapping_min_folio_order(mapping));
 	if (!folio)
 		return -ENOMEM;
 
@@ -2542,9 +2548,16 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
 	pgoff_t last_index;
 	struct folio *folio;
 	int err = 0;
+	unsigned int order = mapping_min_folio_order(mapping);
 
 	/* "last_index" is the index of the page beyond the end of the read */
 	last_index = DIV_ROUND_UP(iocb->ki_pos + count, PAGE_SIZE);
+	if (order) {
+		/* Align with folio order */
+		WARN_ON(index % 1 << order);
+		index = ALIGN_DOWN(index, 1 << order);
+		last_index = ALIGN(last_index, 1 << order);
+	}
 retry:
 	if (fatal_signal_pending(current))
 		return -EINTR;
@@ -2561,7 +2574,7 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
 		if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_WAITQ))
 			return -EAGAIN;
 		err = filemap_create_folio(filp, mapping,
-				iocb->ki_pos >> PAGE_SHIFT, fbatch);
+				index, fbatch);
 		if (err == AOP_TRUNCATED_PAGE)
 			goto retry;
 		return err;
@@ -3676,7 +3689,8 @@ static struct folio *do_read_cache_folio(struct address_space *mapping,
 repeat:
 	folio = filemap_get_folio(mapping, index);
 	if (IS_ERR(folio)) {
-		folio = filemap_alloc_folio(gfp, 0);
+		folio = filemap_alloc_folio(gfp,
+				mapping_min_folio_order(mapping));
 		if (!folio)
 			return ERR_PTR(-ENOMEM);
 		err = filemap_add_folio(mapping, folio, index, gfp);
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 08/18] mm/readahead: allocate folios with mapping order preference
  2023-09-18 11:04 [RFC PATCH 00/18] block: update buffer_head for Large-block I/O Hannes Reinecke
                   ` (6 preceding siblings ...)
  2023-09-18 11:04 ` [PATCH 07/18] mm/filemap: allocate folios with mapping order preference Hannes Reinecke
@ 2023-09-18 11:05 ` Hannes Reinecke
  2023-09-18 13:11   ` kernel test robot
  2023-09-18 20:46   ` kernel test robot
  2023-09-18 11:05 ` [PATCH 09/18] fs/buffer: use mapping order in grow_dev_page() Hannes Reinecke
                   ` (9 subsequent siblings)
  17 siblings, 2 replies; 39+ messages in thread
From: Hannes Reinecke @ 2023-09-18 11:05 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Luis Chamberlain, Christoph Hellwig, Jens Axboe, Pankaj Raghav,
	linux-block, linux-fsdevel, Hannes Reinecke

Use mapping_get_folio_order() when calling filemap_alloc_folio()
to allocate folios with the order specified by the mapping.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 mm/readahead.c | 35 +++++++++++++++++++++++------------
 1 file changed, 23 insertions(+), 12 deletions(-)

diff --git a/mm/readahead.c b/mm/readahead.c
index 40a5f1f65281..0466a2bdb80a 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -244,7 +244,8 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
 			continue;
 		}
 
-		folio = filemap_alloc_folio(gfp_mask, 0);
+		folio = filemap_alloc_folio(gfp_mask,
+				mapping_min_folio_order(mapping));
 		if (!folio)
 			break;
 		if (filemap_add_folio(mapping, folio, index + i,
@@ -311,6 +312,8 @@ void force_page_cache_ra(struct readahead_control *ractl,
 	struct file_ra_state *ra = ractl->ra;
 	struct backing_dev_info *bdi = inode_to_bdi(mapping->host);
 	unsigned long max_pages, index;
+	unsigned int order = mapping_min_folio_order(mapping);
+	unsigned int min_pages = 1 << order;
 
 	if (unlikely(!mapping->a_ops->read_folio && !mapping->a_ops->readahead))
 		return;
@@ -320,6 +323,10 @@ void force_page_cache_ra(struct readahead_control *ractl,
 	 * be up to the optimal hardware IO size
 	 */
 	index = readahead_index(ractl);
+	if (order) {
+		WARN_ON(index & (min_pages - 1));
+		index = ALIGN_DOWN(index, min_pages);
+	}
 	max_pages = max_t(unsigned long, bdi->io_pages, ra->ra_pages);
 	nr_to_read = min_t(unsigned long, nr_to_read, max_pages);
 	while (nr_to_read) {
@@ -327,6 +334,8 @@ void force_page_cache_ra(struct readahead_control *ractl,
 
 		if (this_chunk > nr_to_read)
 			this_chunk = nr_to_read;
+		if (this_chunk < min_pages)
+			this_chunk = min_pages;
 		ractl->_index = index;
 		do_page_cache_ra(ractl, this_chunk, 0);
 
@@ -597,8 +606,8 @@ static void ondemand_readahead(struct readahead_control *ractl,
 		pgoff_t start;
 
 		rcu_read_lock();
-		start = page_cache_next_miss(ractl->mapping, index + 1,
-				max_pages);
+		start = page_cache_next_miss(ractl->mapping,
+				index + folio_nr_pages(folio), max_pages);
 		rcu_read_unlock();
 
 		if (!start || start - index > max_pages)
@@ -782,18 +791,20 @@ void readahead_expand(struct readahead_control *ractl,
 	struct file_ra_state *ra = ractl->ra;
 	pgoff_t new_index, new_nr_pages;
 	gfp_t gfp_mask = readahead_gfp_mask(mapping);
+	unsigned int order = mapping_min_folio_order(mapping);
+	unsigned int min_nr_pages = 1 << order;
 
-	new_index = new_start / PAGE_SIZE;
+	new_index = new_start / (min_nr_pages * PAGE_SIZE);
 
 	/* Expand the leading edge downwards */
 	while (ractl->_index > new_index) {
-		unsigned long index = ractl->_index - 1;
+		unsigned long index = ractl->_index - min_nr_pages;
 		struct folio *folio = xa_load(&mapping->i_pages, index);
 
 		if (folio && !xa_is_value(folio))
 			return; /* Folio apparently present */
 
-		folio = filemap_alloc_folio(gfp_mask, 0);
+		folio = filemap_alloc_folio(gfp_mask, order);
 		if (!folio)
 			return;
 		if (filemap_add_folio(mapping, folio, index, gfp_mask) < 0) {
@@ -805,12 +816,12 @@ void readahead_expand(struct readahead_control *ractl,
 			ractl->_workingset = true;
 			psi_memstall_enter(&ractl->_pflags);
 		}
-		ractl->_nr_pages++;
+		ractl->_nr_pages += folio_nr_pages(folio);
 		ractl->_index = folio->index;
 	}
 
 	new_len += new_start - readahead_pos(ractl);
-	new_nr_pages = DIV_ROUND_UP(new_len, PAGE_SIZE);
+	new_nr_pages = DIV_ROUND_UP(new_len, min_nr_pages * PAGE_SIZE);
 
 	/* Expand the trailing edge upwards */
 	while (ractl->_nr_pages < new_nr_pages) {
@@ -820,7 +831,7 @@ void readahead_expand(struct readahead_control *ractl,
 		if (folio && !xa_is_value(folio))
 			return; /* Folio apparently present */
 
-		folio = filemap_alloc_folio(gfp_mask, 0);
+		folio = filemap_alloc_folio(gfp_mask, order);
 		if (!folio)
 			return;
 		if (filemap_add_folio(mapping, folio, index, gfp_mask) < 0) {
@@ -832,10 +843,10 @@ void readahead_expand(struct readahead_control *ractl,
 			ractl->_workingset = true;
 			psi_memstall_enter(&ractl->_pflags);
 		}
-		ractl->_nr_pages++;
+		ractl->_nr_pages += folio_nr_pages(folio);
 		if (ra) {
-			ra->size++;
-			ra->async_size++;
+			ra->size += folio_nr_pages(folio);
+			ra->async_size += folio_nr_pages(folio);
 		}
 	}
 }
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 09/18] fs/buffer: use mapping order in grow_dev_page()
  2023-09-18 11:04 [RFC PATCH 00/18] block: update buffer_head for Large-block I/O Hannes Reinecke
                   ` (7 preceding siblings ...)
  2023-09-18 11:05 ` [PATCH 08/18] mm/readahead: " Hannes Reinecke
@ 2023-09-18 11:05 ` Hannes Reinecke
  2023-09-18 14:00   ` Matthew Wilcox
  2023-09-18 11:05 ` [PATCH 10/18] block/bdev: lift restrictions on supported blocksize Hannes Reinecke
                   ` (8 subsequent siblings)
  17 siblings, 1 reply; 39+ messages in thread
From: Hannes Reinecke @ 2023-09-18 11:05 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Luis Chamberlain, Christoph Hellwig, Jens Axboe, Pankaj Raghav,
	linux-block, linux-fsdevel, Hannes Reinecke

Use the correct mapping order in grow_dev_page() to ensure folios
are created with the correct order.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 fs/buffer.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 66895432d91f..a219b79c57ad 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1044,6 +1044,7 @@ grow_dev_page(struct block_device *bdev, sector_t block,
 	sector_t end_block;
 	int ret = 0;
 	gfp_t gfp_mask;
+	fgf_t fgf = FGP_LOCK | FGP_ACCESSED | FGP_CREAT;
 
 	gfp_mask = mapping_gfp_constraint(inode->i_mapping, ~__GFP_FS) | gfp;
 
@@ -1054,9 +1055,8 @@ grow_dev_page(struct block_device *bdev, sector_t block,
 	 * code knows what it's doing.
 	 */
 	gfp_mask |= __GFP_NOFAIL;
-
-	folio = __filemap_get_folio(inode->i_mapping, index,
-			FGP_LOCK | FGP_ACCESSED | FGP_CREAT, gfp_mask);
+	fgf |= fgf_set_order(mapping_min_folio_order(inode->i_mapping));
+	folio = __filemap_get_folio(inode->i_mapping, index, fgf, gfp_mask);
 
 	bh = folio_buffers(folio);
 	if (bh) {
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 10/18] block/bdev: lift restrictions on supported blocksize
  2023-09-18 11:04 [RFC PATCH 00/18] block: update buffer_head for Large-block I/O Hannes Reinecke
                   ` (8 preceding siblings ...)
  2023-09-18 11:05 ` [PATCH 09/18] fs/buffer: use mapping order in grow_dev_page() Hannes Reinecke
@ 2023-09-18 11:05 ` Hannes Reinecke
  2023-09-18 11:05 ` [PATCH 11/18] block/bdev: enable large folio support for large logical block sizes Hannes Reinecke
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 39+ messages in thread
From: Hannes Reinecke @ 2023-09-18 11:05 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Luis Chamberlain, Christoph Hellwig, Jens Axboe, Pankaj Raghav,
	linux-block, linux-fsdevel, Hannes Reinecke

We now can support blocksizes larger than PAGE_SIZE, so lift
the restriction.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 block/bdev.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/bdev.c b/block/bdev.c
index f3b13aa1b7d4..adbcf7af0b56 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -137,8 +137,8 @@ static void set_init_blocksize(struct block_device *bdev)
 
 int set_blocksize(struct block_device *bdev, int size)
 {
-	/* Size must be a power of two, and between 512 and PAGE_SIZE */
-	if (size > PAGE_SIZE || size < 512 || !is_power_of_2(size))
+	/* Size must be a power of two, and larger than 512 */
+	if (size < 512 || !is_power_of_2(size))
 		return -EINVAL;
 
 	/* Size cannot be smaller than the size supported by the device */
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 11/18] block/bdev: enable large folio support for large logical block sizes
  2023-09-18 11:04 [RFC PATCH 00/18] block: update buffer_head for Large-block I/O Hannes Reinecke
                   ` (9 preceding siblings ...)
  2023-09-18 11:05 ` [PATCH 10/18] block/bdev: lift restrictions on supported blocksize Hannes Reinecke
@ 2023-09-18 11:05 ` Hannes Reinecke
  2023-09-18 11:05 ` [PATCH 12/18] brd: convert to folios Hannes Reinecke
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 39+ messages in thread
From: Hannes Reinecke @ 2023-09-18 11:05 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Luis Chamberlain, Christoph Hellwig, Jens Axboe, Pankaj Raghav,
	linux-block, linux-fsdevel, Hannes Reinecke

Enable large folio support when the logical block size is larger
than PAGE_SIZE.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 block/bdev.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/block/bdev.c b/block/bdev.c
index adbcf7af0b56..d5198743401a 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -133,6 +133,9 @@ static void set_init_blocksize(struct block_device *bdev)
 		bsize <<= 1;
 	}
 	bdev->bd_inode->i_blkbits = blksize_bits(bsize);
+	if (bsize > PAGE_SIZE)
+		mapping_set_folio_orders(bdev->bd_inode->i_mapping,
+					 get_order(bsize), MAX_ORDER);
 }
 
 int set_blocksize(struct block_device *bdev, int size)
@@ -149,6 +152,9 @@ int set_blocksize(struct block_device *bdev, int size)
 	if (bdev->bd_inode->i_blkbits != blksize_bits(size)) {
 		sync_blockdev(bdev);
 		bdev->bd_inode->i_blkbits = blksize_bits(size);
+		if (size > PAGE_SIZE)
+			mapping_set_folio_orders(bdev->bd_inode->i_mapping,
+						 get_order(size), MAX_ORDER);
 		kill_bdev(bdev);
 	}
 	return 0;
@@ -425,6 +431,11 @@ void bdev_set_nr_sectors(struct block_device *bdev, sector_t sectors)
 
 void bdev_add(struct block_device *bdev, dev_t dev)
 {
+	unsigned int bsize = bdev_logical_block_size(bdev);
+
+	if (bsize > PAGE_SIZE)
+		mapping_set_folio_orders(bdev->bd_inode->i_mapping,
+					 get_order(bsize), MAX_ORDER);
 	bdev->bd_dev = dev;
 	bdev->bd_inode->i_rdev = dev;
 	bdev->bd_inode->i_ino = dev;
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 12/18] brd: convert to folios
  2023-09-18 11:04 [RFC PATCH 00/18] block: update buffer_head for Large-block I/O Hannes Reinecke
                   ` (10 preceding siblings ...)
  2023-09-18 11:05 ` [PATCH 11/18] block/bdev: enable large folio support for large logical block sizes Hannes Reinecke
@ 2023-09-18 11:05 ` Hannes Reinecke
  2023-09-18 11:05 ` [PATCH 13/18] brd: abstract page_size conventions Hannes Reinecke
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 39+ messages in thread
From: Hannes Reinecke @ 2023-09-18 11:05 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Luis Chamberlain, Christoph Hellwig, Jens Axboe, Pankaj Raghav,
	linux-block, linux-fsdevel, Hannes Reinecke

Convert the driver to work on folios instead of pages.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/block/brd.c | 150 ++++++++++++++++++++++----------------------
 1 file changed, 74 insertions(+), 76 deletions(-)

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 970bd6ff38c4..9aa7511abc33 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -28,11 +28,10 @@
 #include <linux/uaccess.h>
 
 /*
- * Each block ramdisk device has a xarray brd_pages of pages that stores
- * the pages containing the block device's contents. A brd page's ->index is
- * its offset in PAGE_SIZE units. This is similar to, but in no way connected
- * with, the kernel's pagecache or buffer cache (which sit above our block
- * device).
+ * Each block ramdisk device has a xarray of folios that stores the folios
+ * containing the block device's contents. A brd folio's ->index is its offset
+ * in PAGE_SIZE units. This is similar to, but in no way connected with,
+ * the kernel's pagecache or buffer cache (which sit above our block device).
  */
 struct brd_device {
 	int			brd_number;
@@ -40,81 +39,81 @@ struct brd_device {
 	struct list_head	brd_list;
 
 	/*
-	 * Backing store of pages. This is the contents of the block device.
+	 * Backing store of folios. This is the contents of the block device.
 	 */
-	struct xarray	        brd_pages;
-	u64			brd_nr_pages;
+	struct xarray	        brd_folios;
+	u64			brd_nr_folios;
 };
 
 /*
- * Look up and return a brd's page for a given sector.
+ * Look up and return a brd's folio for a given sector.
  */
-static struct page *brd_lookup_page(struct brd_device *brd, sector_t sector)
+static struct folio *brd_lookup_folio(struct brd_device *brd, sector_t sector)
 {
 	pgoff_t idx;
-	struct page *page;
+	struct folio *folio;
 
-	idx = sector >> PAGE_SECTORS_SHIFT; /* sector to page index */
-	page = xa_load(&brd->brd_pages, idx);
+	idx = sector >> PAGE_SECTORS_SHIFT; /* sector to folio index */
+	folio = xa_load(&brd->brd_folios, idx);
 
-	BUG_ON(page && page->index != idx);
+	BUG_ON(folio && folio->index != idx);
 
-	return page;
+	return folio;
 }
 
 /*
- * Insert a new page for a given sector, if one does not already exist.
+ * Insert a new folio for a given sector, if one does not already exist.
  */
-static int brd_insert_page(struct brd_device *brd, sector_t sector, gfp_t gfp)
+static int brd_insert_folio(struct brd_device *brd, sector_t sector, gfp_t gfp)
 {
 	pgoff_t idx;
-	struct page *page, *cur;
+	struct folio *folio, *cur;
 	int ret = 0;
 
-	page = brd_lookup_page(brd, sector);
-	if (page)
+	folio = brd_lookup_folio(brd, sector);
+	if (folio)
 		return 0;
 
-	page = alloc_page(gfp | __GFP_ZERO | __GFP_HIGHMEM);
-	if (!page)
+	folio = folio_alloc(gfp | __GFP_ZERO | __GFP_HIGHMEM, 0);
+	if (!folio)
 		return -ENOMEM;
 
-	xa_lock(&brd->brd_pages);
+	xa_lock(&brd->brd_folios);
 
 	idx = sector >> PAGE_SECTORS_SHIFT;
-	page->index = idx;
+	folio->index = idx;
 
-	cur = __xa_cmpxchg(&brd->brd_pages, idx, NULL, page, gfp);
+	cur = __xa_cmpxchg(&brd->brd_folios, idx, NULL, folio, gfp);
 
 	if (unlikely(cur)) {
-		__free_page(page);
+		folio_put(folio);
 		ret = xa_err(cur);
 		if (!ret && (cur->index != idx))
 			ret = -EIO;
 	} else {
-		brd->brd_nr_pages++;
+		brd->brd_nr_folios++;
 	}
 
-	xa_unlock(&brd->brd_pages);
+	xa_unlock(&brd->brd_folios);
 
 	return ret;
 }
 
 /*
- * Free all backing store pages and xarray. This must only be called when
+ * Free all backing store folios and xarray. This must only be called when
  * there are no other users of the device.
  */
-static void brd_free_pages(struct brd_device *brd)
+static void brd_free_folios(struct brd_device *brd)
 {
-	struct page *page;
+	struct folio *folio;
 	pgoff_t idx;
 
-	xa_for_each(&brd->brd_pages, idx, page) {
-		__free_page(page);
+	xa_for_each(&brd->brd_folios, idx, folio) {
+		folio_put(folio);
 		cond_resched();
 	}
 
-	xa_destroy(&brd->brd_pages);
+	xa_destroy(&brd->brd_folios);
 }
 
 /*
@@ -128,12 +127,12 @@ static int copy_to_brd_setup(struct brd_device *brd, sector_t sector, size_t n,
 	int ret;
 
 	copy = min_t(size_t, n, PAGE_SIZE - offset);
-	ret = brd_insert_page(brd, sector, gfp);
+	ret = brd_insert_folio(brd, sector, gfp);
 	if (ret)
 		return ret;
 	if (copy < n) {
 		sector += copy >> SECTOR_SHIFT;
-		ret = brd_insert_page(brd, sector, gfp);
+		ret = brd_insert_folio(brd, sector, gfp);
 	}
 	return ret;
 }
@@ -144,29 +143,29 @@ static int copy_to_brd_setup(struct brd_device *brd, sector_t sector, size_t n,
 static void copy_to_brd(struct brd_device *brd, const void *src,
 			sector_t sector, size_t n)
 {
-	struct page *page;
+	struct folio *folio;
 	void *dst;
 	unsigned int offset = (sector & (PAGE_SECTORS-1)) << SECTOR_SHIFT;
 	size_t copy;
 
 	copy = min_t(size_t, n, PAGE_SIZE - offset);
-	page = brd_lookup_page(brd, sector);
-	BUG_ON(!page);
+	folio = brd_lookup_folio(brd, sector);
+	BUG_ON(!folio);
 
-	dst = kmap_atomic(page);
-	memcpy(dst + offset, src, copy);
-	kunmap_atomic(dst);
+	dst = kmap_local_folio(folio, offset);
+	memcpy(dst, src, copy);
+	kunmap_local(dst);
 
 	if (copy < n) {
 		src += copy;
 		sector += copy >> SECTOR_SHIFT;
 		copy = n - copy;
-		page = brd_lookup_page(brd, sector);
-		BUG_ON(!page);
+		folio = brd_lookup_folio(brd, sector);
+		BUG_ON(!folio);
 
-		dst = kmap_atomic(page);
+		dst = kmap_local_folio(folio, 0);
 		memcpy(dst, src, copy);
-		kunmap_atomic(dst);
+		kunmap_local(dst);
 	}
 }
 
@@ -176,17 +175,17 @@ static void copy_to_brd(struct brd_device *brd, const void *src,
 static void copy_from_brd(void *dst, struct brd_device *brd,
 			sector_t sector, size_t n)
 {
-	struct page *page;
+	struct folio *folio;
 	void *src;
 	unsigned int offset = (sector & (PAGE_SECTORS-1)) << SECTOR_SHIFT;
 	size_t copy;
 
 	copy = min_t(size_t, n, PAGE_SIZE - offset);
-	page = brd_lookup_page(brd, sector);
-	if (page) {
-		src = kmap_atomic(page);
-		memcpy(dst, src + offset, copy);
-		kunmap_atomic(src);
+	folio = brd_lookup_folio(brd, sector);
+	if (folio) {
+		src = kmap_local_folio(folio, offset);
+		memcpy(dst, src, copy);
+		kunmap_local(src);
 	} else
 		memset(dst, 0, copy);
 
@@ -194,20 +193,20 @@ static void copy_from_brd(void *dst, struct brd_device *brd,
 		dst += copy;
 		sector += copy >> SECTOR_SHIFT;
 		copy = n - copy;
-		page = brd_lookup_page(brd, sector);
-		if (page) {
-			src = kmap_atomic(page);
+		folio = brd_lookup_folio(brd, sector);
+		if (folio) {
+			src = kmap_local_folio(folio, 0);
 			memcpy(dst, src, copy);
-			kunmap_atomic(src);
+			kunmap_local(src);
 		} else
 			memset(dst, 0, copy);
 	}
 }
 
 /*
- * Process a single bvec of a bio.
+ * Process a single folio of a bio.
  */
-static int brd_do_bvec(struct brd_device *brd, struct page *page,
+static int brd_do_folio(struct brd_device *brd, struct folio *folio,
 			unsigned int len, unsigned int off, blk_opf_t opf,
 			sector_t sector)
 {
@@ -217,7 +216,7 @@ static int brd_do_bvec(struct brd_device *brd, struct page *page,
 	if (op_is_write(opf)) {
 		/*
 		 * Must use NOIO because we don't want to recurse back into the
-		 * block or filesystem layers from page reclaim.
+		 * block or filesystem layers from folio reclaim.
 		 */
 		gfp_t gfp = opf & REQ_NOWAIT ? GFP_NOWAIT : GFP_NOIO;
 
@@ -226,15 +225,15 @@ static int brd_do_bvec(struct brd_device *brd, struct page *page,
 			goto out;
 	}
 
-	mem = kmap_atomic(page);
+	mem = kmap_local_folio(folio, off);
 	if (!op_is_write(opf)) {
-		copy_from_brd(mem + off, brd, sector, len);
-		flush_dcache_page(page);
+		copy_from_brd(mem, brd, sector, len);
+		flush_dcache_folio(folio);
 	} else {
-		flush_dcache_page(page);
-		copy_to_brd(brd, mem + off, sector, len);
+		flush_dcache_folio(folio);
+		copy_to_brd(brd, mem, sector, len);
 	}
-	kunmap_atomic(mem);
+	kunmap_local(mem);
 
 out:
 	return err;
@@ -244,19 +243,18 @@ static void brd_submit_bio(struct bio *bio)
 {
 	struct brd_device *brd = bio->bi_bdev->bd_disk->private_data;
 	sector_t sector = bio->bi_iter.bi_sector;
-	struct bio_vec bvec;
-	struct bvec_iter iter;
+	struct folio_iter iter;
 
-	bio_for_each_segment(bvec, bio, iter) {
-		unsigned int len = bvec.bv_len;
+	bio_for_each_folio_all(iter, bio) {
+		unsigned int len = iter.length;
 		int err;
 
 		/* Don't support un-aligned buffer */
-		WARN_ON_ONCE((bvec.bv_offset & (SECTOR_SIZE - 1)) ||
+		WARN_ON_ONCE((iter.offset & (SECTOR_SIZE - 1)) ||
 				(len & (SECTOR_SIZE - 1)));
 
-		err = brd_do_bvec(brd, bvec.bv_page, len, bvec.bv_offset,
-				  bio->bi_opf, sector);
+		err = brd_do_folio(brd, iter.folio, len, iter.offset,
+				   bio->bi_opf, sector);
 		if (err) {
 			if (err == -ENOMEM && bio->bi_opf & REQ_NOWAIT) {
 				bio_wouldblock_error(bio);
@@ -328,12 +326,12 @@ static int brd_alloc(int i)
 	brd->brd_number		= i;
 	list_add_tail(&brd->brd_list, &brd_devices);
 
-	xa_init(&brd->brd_pages);
+	xa_init(&brd->brd_folios);
 
 	snprintf(buf, DISK_NAME_LEN, "ram%d", i);
 	if (!IS_ERR_OR_NULL(brd_debugfs_dir))
 		debugfs_create_u64(buf, 0444, brd_debugfs_dir,
-				&brd->brd_nr_pages);
+				&brd->brd_nr_folios);
 
 	disk = brd->brd_disk = blk_alloc_disk(NUMA_NO_NODE);
 	if (!disk)
@@ -388,7 +386,7 @@ static void brd_cleanup(void)
 	list_for_each_entry_safe(brd, next, &brd_devices, brd_list) {
 		del_gendisk(brd->brd_disk);
 		put_disk(brd->brd_disk);
-		brd_free_pages(brd);
+		brd_free_folios(brd);
 		list_del(&brd->brd_list);
 		kfree(brd);
 	}
@@ -419,7 +417,7 @@ static int __init brd_init(void)
 
 	brd_check_and_reset_par();
 
-	brd_debugfs_dir = debugfs_create_dir("ramdisk_pages", NULL);
+	brd_debugfs_dir = debugfs_create_dir("ramdisk_folios", NULL);
 
 	for (i = 0; i < rd_nr; i++) {
 		err = brd_alloc(i);
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 13/18] brd: abstract page_size conventions
  2023-09-18 11:04 [RFC PATCH 00/18] block: update buffer_head for Large-block I/O Hannes Reinecke
                   ` (11 preceding siblings ...)
  2023-09-18 11:05 ` [PATCH 12/18] brd: convert to folios Hannes Reinecke
@ 2023-09-18 11:05 ` Hannes Reinecke
  2023-09-18 11:05 ` [PATCH 14/18] brd: use memcpy_{to,from}_folio() Hannes Reinecke
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 39+ messages in thread
From: Hannes Reinecke @ 2023-09-18 11:05 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Luis Chamberlain, Christoph Hellwig, Jens Axboe, Pankaj Raghav,
	linux-block, linux-fsdevel, Hannes Reinecke

In preparation for changing the block sizes abstract away references
to PAGE_SIZE and friends.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/block/brd.c | 39 ++++++++++++++++++++++++++++++---------
 1 file changed, 30 insertions(+), 9 deletions(-)

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 9aa7511abc33..d6595b1a22e8 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -45,6 +45,23 @@ struct brd_device {
 	u64			brd_nr_folios;
 };
 
+#define BRD_SECTOR_SHIFT(b) (PAGE_SHIFT - SECTOR_SHIFT)
+
+static pgoff_t brd_sector_index(struct brd_device *brd, sector_t sector)
+{
+	pgoff_t idx;
+
+	idx = sector >> BRD_SECTOR_SHIFT(brd);
+	return idx;
+}
+
+static int brd_sector_offset(struct brd_device *brd, sector_t sector)
+{
+	unsigned int rd_sector_mask = (1 << BRD_SECTOR_SHIFT(brd)) - 1;
+
+	return ((unsigned int)sector & rd_sector_mask) << SECTOR_SHIFT;
+}
+
 /*
  * Look up and return a brd's folio for a given sector.
  */
@@ -53,7 +70,7 @@ static struct folio *brd_lookup_folio(struct brd_device *brd, sector_t sector)
 	pgoff_t idx;
 	struct folio *folio;
 
-	idx = sector >> PAGE_SECTORS_SHIFT; /* sector to folio index */
+	idx = brd_sector_index(brd, sector); /* sector to folio index */
 	folio = xa_load(&brd->brd_folios, idx);
 
 	BUG_ON(folio && folio->index != idx);
@@ -68,19 +85,20 @@ static int brd_insert_folio(struct brd_device *brd, sector_t sector, gfp_t gfp)
 {
 	pgoff_t idx;
 	struct folio *folio, *cur;
+	unsigned int rd_sector_order = get_order(PAGE_SIZE);
 	int ret = 0;
 
 	folio = brd_lookup_folio(brd, sector);
 	if (folio)
 		return 0;
 
-	folio = folio_alloc(gfp | __GFP_ZERO | __GFP_HIGHMEM, 0);
+	folio = folio_alloc(gfp | __GFP_ZERO | __GFP_HIGHMEM, rd_sector_order);
 	if (!folio)
 		return -ENOMEM;
 
 	xa_lock(&brd->brd_folios);
 
-	idx = sector >> PAGE_SECTORS_SHIFT;
+	idx = brd_sector_index(brd, sector);
 	folio->index = idx;
 
 	cur = __xa_cmpxchg(&brd->brd_folios, idx, NULL, folio, gfp);
@@ -122,11 +140,12 @@ static void brd_free_folios(struct brd_device *brd)
 static int copy_to_brd_setup(struct brd_device *brd, sector_t sector, size_t n,
 			     gfp_t gfp)
 {
-	unsigned int offset = (sector & (PAGE_SECTORS-1)) << SECTOR_SHIFT;
+	unsigned int rd_sector_size = PAGE_SIZE;
+	unsigned int offset = brd_sector_offset(brd, sector);
 	size_t copy;
 	int ret;
 
-	copy = min_t(size_t, n, PAGE_SIZE - offset);
+	copy = min_t(size_t, n, rd_sector_size - offset);
 	ret = brd_insert_folio(brd, sector, gfp);
 	if (ret)
 		return ret;
@@ -145,10 +164,11 @@ static void copy_to_brd(struct brd_device *brd, const void *src,
 {
 	struct folio *folio;
 	void *dst;
-	unsigned int offset = (sector & (PAGE_SECTORS-1)) << SECTOR_SHIFT;
+	unsigned int rd_sector_size = PAGE_SIZE;
+	unsigned int offset = brd_sector_offset(brd, sector);
 	size_t copy;
 
-	copy = min_t(size_t, n, PAGE_SIZE - offset);
+	copy = min_t(size_t, n, rd_sector_size - offset);
 	folio = brd_lookup_folio(brd, sector);
 	BUG_ON(!folio);
 
@@ -177,10 +197,11 @@ static void copy_from_brd(void *dst, struct brd_device *brd,
 {
 	struct folio *folio;
 	void *src;
-	unsigned int offset = (sector & (PAGE_SECTORS-1)) << SECTOR_SHIFT;
+	unsigned int rd_sector_size = PAGE_SIZE;
+	unsigned int offset = brd_sector_offset(brd, sector);
 	size_t copy;
 
-	copy = min_t(size_t, n, PAGE_SIZE - offset);
+	copy = min_t(size_t, n, rd_sector_size - offset);
 	folio = brd_lookup_folio(brd, sector);
 	if (folio) {
 		src = kmap_local_folio(folio, offset);
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 14/18] brd: use memcpy_{to,from}_folio()
  2023-09-18 11:04 [RFC PATCH 00/18] block: update buffer_head for Large-block I/O Hannes Reinecke
                   ` (12 preceding siblings ...)
  2023-09-18 11:05 ` [PATCH 13/18] brd: abstract page_size conventions Hannes Reinecke
@ 2023-09-18 11:05 ` Hannes Reinecke
  2023-09-18 11:05 ` [PATCH 15/18] brd: make sector size configurable Hannes Reinecke
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 39+ messages in thread
From: Hannes Reinecke @ 2023-09-18 11:05 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Luis Chamberlain, Christoph Hellwig, Jens Axboe, Pankaj Raghav,
	linux-block, linux-fsdevel, Hannes Reinecke

Simplify copy routines by using memcpy_to_folio()/memcpy_from_folio().

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/block/brd.c | 39 ++++-----------------------------------
 1 file changed, 4 insertions(+), 35 deletions(-)

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index d6595b1a22e8..90e1b6c4fbc8 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -163,7 +163,6 @@ static void copy_to_brd(struct brd_device *brd, const void *src,
 			sector_t sector, size_t n)
 {
 	struct folio *folio;
-	void *dst;
 	unsigned int rd_sector_size = PAGE_SIZE;
 	unsigned int offset = brd_sector_offset(brd, sector);
 	size_t copy;
@@ -172,21 +171,7 @@ static void copy_to_brd(struct brd_device *brd, const void *src,
 	folio = brd_lookup_folio(brd, sector);
 	BUG_ON(!folio);
 
-	dst = kmap_local_folio(folio, offset);
-	memcpy(dst, src, copy);
-	kunmap_local(dst);
-
-	if (copy < n) {
-		src += copy;
-		sector += copy >> SECTOR_SHIFT;
-		copy = n - copy;
-		folio = brd_lookup_folio(brd, sector);
-		BUG_ON(!folio);
-
-		dst = kmap_local_folio(folio, 0);
-		memcpy(dst, src, copy);
-		kunmap_local(dst);
-	}
+	memcpy_to_folio(folio, offset, src, copy);
 }
 
 /*
@@ -196,32 +181,16 @@ static void copy_from_brd(void *dst, struct brd_device *brd,
 			sector_t sector, size_t n)
 {
 	struct folio *folio;
-	void *src;
 	unsigned int rd_sector_size = PAGE_SIZE;
 	unsigned int offset = brd_sector_offset(brd, sector);
 	size_t copy;
 
 	copy = min_t(size_t, n, rd_sector_size - offset);
 	folio = brd_lookup_folio(brd, sector);
-	if (folio) {
-		src = kmap_local_folio(folio, offset);
-		memcpy(dst, src, copy);
-		kunmap_local(src);
-	} else
+	if (folio)
+		memcpy_from_folio(dst, folio, offset, copy);
+	else
 		memset(dst, 0, copy);
-
-	if (copy < n) {
-		dst += copy;
-		sector += copy >> SECTOR_SHIFT;
-		copy = n - copy;
-		folio = brd_lookup_folio(brd, sector);
-		if (folio) {
-			src = kmap_local_folio(folio, 0);
-			memcpy(dst, src, copy);
-			kunmap_local(src);
-		} else
-			memset(dst, 0, copy);
-	}
 }
 
 /*
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 15/18] brd: make sector size configurable
  2023-09-18 11:04 [RFC PATCH 00/18] block: update buffer_head for Large-block I/O Hannes Reinecke
                   ` (13 preceding siblings ...)
  2023-09-18 11:05 ` [PATCH 14/18] brd: use memcpy_{to,from}_folio() Hannes Reinecke
@ 2023-09-18 11:05 ` Hannes Reinecke
  2023-09-18 11:05 ` [PATCH 16/18] brd: make logical " Hannes Reinecke
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 39+ messages in thread
From: Hannes Reinecke @ 2023-09-18 11:05 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Luis Chamberlain, Christoph Hellwig, Jens Axboe, Pankaj Raghav,
	linux-block, linux-fsdevel, Hannes Reinecke

Add a module option 'rd_blksize' to allow the user to change
the sector size of the RAM disks.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/block/brd.c | 50 +++++++++++++++++++++++++++++++--------------
 1 file changed, 35 insertions(+), 15 deletions(-)

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 90e1b6c4fbc8..0c5f3dbbb77c 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -30,7 +30,7 @@
 /*
  * Each block ramdisk device has a xarray of folios that stores the folios
  * containing the block device's contents. A brd folio's ->index is its offset
- * in PAGE_SIZE units. This is similar to, but in no way connected with,
+ * in brd_sector_size units. This is similar to, but in no way connected with,
  * the kernel's pagecache or buffer cache (which sit above our block device).
  */
 struct brd_device {
@@ -43,9 +43,11 @@ struct brd_device {
 	 */
 	struct xarray	        brd_folios;
 	u64			brd_nr_folios;
+	unsigned int		brd_sector_shift;
+	unsigned int		brd_sector_size;
 };
 
-#define BRD_SECTOR_SHIFT(b) (PAGE_SHIFT - SECTOR_SHIFT)
+#define BRD_SECTOR_SHIFT(b) ((b)->brd_sector_shift - SECTOR_SHIFT)
 
 static pgoff_t brd_sector_index(struct brd_device *brd, sector_t sector)
 {
@@ -85,7 +87,7 @@ static int brd_insert_folio(struct brd_device *brd, sector_t sector, gfp_t gfp)
 {
 	pgoff_t idx;
 	struct folio *folio, *cur;
-	unsigned int rd_sector_order = get_order(PAGE_SIZE);
+	unsigned int rd_sector_order = get_order(brd->brd_sector_size);
 	int ret = 0;
 
 	folio = brd_lookup_folio(brd, sector);
@@ -140,7 +142,7 @@ static void brd_free_folios(struct brd_device *brd)
 static int copy_to_brd_setup(struct brd_device *brd, sector_t sector, size_t n,
 			     gfp_t gfp)
 {
-	unsigned int rd_sector_size = PAGE_SIZE;
+	unsigned int rd_sector_size = brd->brd_sector_size;
 	unsigned int offset = brd_sector_offset(brd, sector);
 	size_t copy;
 	int ret;
@@ -163,7 +165,7 @@ static void copy_to_brd(struct brd_device *brd, const void *src,
 			sector_t sector, size_t n)
 {
 	struct folio *folio;
-	unsigned int rd_sector_size = PAGE_SIZE;
+	unsigned int rd_sector_size = brd->brd_sector_size;
 	unsigned int offset = brd_sector_offset(brd, sector);
 	size_t copy;
 
@@ -181,7 +183,7 @@ static void copy_from_brd(void *dst, struct brd_device *brd,
 			sector_t sector, size_t n)
 {
 	struct folio *folio;
-	unsigned int rd_sector_size = PAGE_SIZE;
+	unsigned int rd_sector_size = brd->brd_sector_size;
 	unsigned int offset = brd_sector_offset(brd, sector);
 	size_t copy;
 
@@ -279,6 +281,10 @@ static int max_part = 1;
 module_param(max_part, int, 0444);
 MODULE_PARM_DESC(max_part, "Num Minors to reserve between devices");
 
+static unsigned int rd_blksize = PAGE_SIZE;
+module_param(rd_blksize, uint, 0444);
+MODULE_PARM_DESC(rd_blksize, "Blocksize of each RAM disk in bytes.");
+
 MODULE_LICENSE("GPL");
 MODULE_ALIAS_BLOCKDEV_MAJOR(RAMDISK_MAJOR);
 MODULE_ALIAS("rd");
@@ -305,6 +311,7 @@ static int brd_alloc(int i)
 	struct brd_device *brd;
 	struct gendisk *disk;
 	char buf[DISK_NAME_LEN];
+	unsigned int rd_max_sectors;
 	int err = -ENOMEM;
 
 	list_for_each_entry(brd, &brd_devices, brd_list)
@@ -315,6 +322,25 @@ static int brd_alloc(int i)
 		return -ENOMEM;
 	brd->brd_number		= i;
 	list_add_tail(&brd->brd_list, &brd_devices);
+	if (!is_power_of_2(rd_blksize)) {
+		pr_err("rd_blksize %d is not supported\n", rd_blksize);
+		err = -EINVAL;
+		goto out_free_dev;
+	}
+	if (rd_blksize < SECTOR_SIZE) {
+		pr_err("rd_blksize must be at least 512 bytes\n");
+		err = -EINVAL;
+		goto out_free_dev;
+	}
+	/* We can't allocate more than MAX_ORDER pages */
+	rd_max_sectors = (1ULL << MAX_ORDER) << PAGE_SECTORS_SHIFT;
+	if ((rd_blksize >> SECTOR_SHIFT) > rd_max_sectors) {
+		pr_err("rd_blocksize too large\n");
+		err = -EINVAL;
+		goto out_free_dev;
+	}
+	brd->brd_sector_shift = ilog2(rd_blksize);
+	brd->brd_sector_size = rd_blksize;
 
 	xa_init(&brd->brd_folios);
 
@@ -334,15 +360,9 @@ static int brd_alloc(int i)
 	disk->private_data	= brd;
 	strscpy(disk->disk_name, buf, DISK_NAME_LEN);
 	set_capacity(disk, rd_size * 2);
-	
-	/*
-	 * This is so fdisk will align partitions on 4k, because of
-	 * direct_access API needing 4k alignment, returning a PFN
-	 * (This is only a problem on very small devices <= 4M,
-	 *  otherwise fdisk will align on 1M. Regardless this call
-	 *  is harmless)
-	 */
-	blk_queue_physical_block_size(disk->queue, PAGE_SIZE);
+
+	blk_queue_physical_block_size(disk->queue, rd_blksize);
+	blk_queue_max_hw_sectors(disk->queue, rd_max_sectors);
 
 	/* Tell the block layer that this is not a rotational device */
 	blk_queue_flag_set(QUEUE_FLAG_NONROT, disk->queue);
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 16/18] brd: make logical sector size configurable
  2023-09-18 11:04 [RFC PATCH 00/18] block: update buffer_head for Large-block I/O Hannes Reinecke
                   ` (14 preceding siblings ...)
  2023-09-18 11:05 ` [PATCH 15/18] brd: make sector size configurable Hannes Reinecke
@ 2023-09-18 11:05 ` Hannes Reinecke
  2023-09-18 11:05 ` [PATCH 17/18] xfs: remove check for block sizes smaller than PAGE_SIZE Hannes Reinecke
  2023-09-18 11:05 ` [PATCH 18/18] nvme: enable logical block size > PAGE_SIZE Hannes Reinecke
  17 siblings, 0 replies; 39+ messages in thread
From: Hannes Reinecke @ 2023-09-18 11:05 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Luis Chamberlain, Christoph Hellwig, Jens Axboe, Pankaj Raghav,
	linux-block, linux-fsdevel, Hannes Reinecke

Add a module option 'rd_logical_blksize' to allow the user to change
the logical sector size of the RAM disks.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 drivers/block/brd.c | 26 ++++++++++++++++++++++++--
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index 0c5f3dbbb77c..e2e364255902 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -45,6 +45,8 @@ struct brd_device {
 	u64			brd_nr_folios;
 	unsigned int		brd_sector_shift;
 	unsigned int		brd_sector_size;
+	unsigned int		brd_logical_sector_shift;
+	unsigned int		brd_logical_sector_size;
 };
 
 #define BRD_SECTOR_SHIFT(b) ((b)->brd_sector_shift - SECTOR_SHIFT)
@@ -242,8 +244,8 @@ static void brd_submit_bio(struct bio *bio)
 		int err;
 
 		/* Don't support un-aligned buffer */
-		WARN_ON_ONCE((iter.offset & (SECTOR_SIZE - 1)) ||
-				(len & (SECTOR_SIZE - 1)));
+		WARN_ON_ONCE((iter.offset & (brd->brd_logical_sector_size - 1)) ||
+				(len & (brd->brd_logical_sector_size - 1)));
 
 		err = brd_do_folio(brd, iter.folio, len, iter.offset,
 				   bio->bi_opf, sector);
@@ -285,6 +287,10 @@ static unsigned int rd_blksize = PAGE_SIZE;
 module_param(rd_blksize, uint, 0444);
 MODULE_PARM_DESC(rd_blksize, "Blocksize of each RAM disk in bytes.");
 
+static unsigned int rd_logical_blksize = SECTOR_SIZE;
+module_param(rd_logical_blksize, uint, 0444);
+MODULE_PARM_DESC(rd_logical_blksize, "Logical blocksize of each RAM disk in bytes.");
+
 MODULE_LICENSE("GPL");
 MODULE_ALIAS_BLOCKDEV_MAJOR(RAMDISK_MAJOR);
 MODULE_ALIAS("rd");
@@ -342,6 +348,21 @@ static int brd_alloc(int i)
 	brd->brd_sector_shift = ilog2(rd_blksize);
 	brd->brd_sector_size = rd_blksize;
 
+	if (!is_power_of_2(rd_logical_blksize)) {
+		pr_err("rd_logical_blksize %d is not supported\n",
+		       rd_logical_blksize);
+		err = -EINVAL;
+		goto out_free_dev;
+	}
+	if (rd_logical_blksize > rd_blksize) {
+		pr_err("rd_logical_blksize %d larger than rd_blksize %d\n",
+		       rd_logical_blksize, rd_blksize);
+		err = -EINVAL;
+		goto out_free_dev;
+	}
+	brd->brd_logical_sector_shift = ilog2(rd_logical_blksize);
+	brd->brd_logical_sector_size = rd_logical_blksize;
+
 	xa_init(&brd->brd_folios);
 
 	snprintf(buf, DISK_NAME_LEN, "ram%d", i);
@@ -362,6 +383,7 @@ static int brd_alloc(int i)
 	set_capacity(disk, rd_size * 2);
 
 	blk_queue_physical_block_size(disk->queue, rd_blksize);
+	blk_queue_logical_block_size(disk->queue, rd_logical_blksize);
 	blk_queue_max_hw_sectors(disk->queue, rd_max_sectors);
 
 	/* Tell the block layer that this is not a rotational device */
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 17/18] xfs: remove check for block sizes smaller than PAGE_SIZE
  2023-09-18 11:04 [RFC PATCH 00/18] block: update buffer_head for Large-block I/O Hannes Reinecke
                   ` (15 preceding siblings ...)
  2023-09-18 11:05 ` [PATCH 16/18] brd: make logical " Hannes Reinecke
@ 2023-09-18 11:05 ` Hannes Reinecke
  2023-09-20  2:13   ` Dave Chinner
  2023-09-18 11:05 ` [PATCH 18/18] nvme: enable logical block size > PAGE_SIZE Hannes Reinecke
  17 siblings, 1 reply; 39+ messages in thread
From: Hannes Reinecke @ 2023-09-18 11:05 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Luis Chamberlain, Christoph Hellwig, Jens Axboe, Pankaj Raghav,
	linux-block, linux-fsdevel, Hannes Reinecke

We now support block sizes larger than PAGE_SIZE, so this
check is pointless.

Signed-off-by: Hannes Reinecke <hare@suse.de>
---
 fs/xfs/xfs_super.c | 12 ------------
 1 file changed, 12 deletions(-)

diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 1f77014c6e1a..67dcdd4dcf2d 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1651,18 +1651,6 @@ xfs_fs_fill_super(
 		goto out_free_sb;
 	}
 
-	/*
-	 * Until this is fixed only page-sized or smaller data blocks work.
-	 */
-	if (mp->m_sb.sb_blocksize > PAGE_SIZE) {
-		xfs_warn(mp,
-		"File system with blocksize %d bytes. "
-		"Only pagesize (%ld) or less will currently work.",
-				mp->m_sb.sb_blocksize, PAGE_SIZE);
-		error = -ENOSYS;
-		goto out_free_sb;
-	}
-
 	/* Ensure this filesystem fits in the page cache limits */
 	if (xfs_sb_validate_fsb_count(&mp->m_sb, mp->m_sb.sb_dblocks) ||
 	    xfs_sb_validate_fsb_count(&mp->m_sb, mp->m_sb.sb_rblocks)) {
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 18/18] nvme: enable logical block size > PAGE_SIZE
  2023-09-18 11:04 [RFC PATCH 00/18] block: update buffer_head for Large-block I/O Hannes Reinecke
                   ` (16 preceding siblings ...)
  2023-09-18 11:05 ` [PATCH 17/18] xfs: remove check for block sizes smaller than PAGE_SIZE Hannes Reinecke
@ 2023-09-18 11:05 ` Hannes Reinecke
  17 siblings, 0 replies; 39+ messages in thread
From: Hannes Reinecke @ 2023-09-18 11:05 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Luis Chamberlain, Christoph Hellwig, Jens Axboe, Pankaj Raghav,
	linux-block, linux-fsdevel

From: Pankaj Raghav <p.raghav@samsung.com>

Don't set the capacity to zero for when logical block size > PAGE_SIZE
as the block device with iomap aops support allocating block cache with
a minimum folio order.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
 drivers/nvme/host/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index f3a01b79148c..e72f1b1ee27a 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1889,7 +1889,7 @@ static void nvme_update_disk_info(struct gendisk *disk,
 	 * The block layer can't support LBA sizes larger than the page size
 	 * yet, so catch this early and don't allow block I/O.
 	 */
-	if (ns->lba_shift > PAGE_SHIFT) {
+	if ((ns->lba_shift > PAGE_SHIFT) && IS_ENABLED(CONFIG_BUFFER_HEAD)) {
 		capacity = 0;
 		bs = (1 << 9);
 	}
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [lkp] [+550 bytes kernel size regression] [i386-tinyconfig] [8558b2228d] fs: Allow fine-grained control of folio sizes
  2023-09-18 11:04 ` [PATCH 06/18] fs: Allow fine-grained control of folio sizes Hannes Reinecke
@ 2023-09-18 12:29   ` kernel test robot
  0 siblings, 0 replies; 39+ messages in thread
From: kernel test robot @ 2023-09-18 12:29 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: oe-kbuild-all, lkp, Josh Triplett


FYI, we noticed a +550 bytes kernel size regression due to commit:

commit: 8558b2228d415844dc8bb42d48a08aa85678fef9 (fs: Allow fine-grained control of folio sizes)
url: https://github.com/intel-lab-lkp/linux/commits/Hannes-Reinecke/mm-readahead-rework-loop-in-page_cache_ra_unbounded/20230918-190732
base: https://git.kernel.org/cgit/linux/kernel/git/axboe/linux-block.git for-next
patch link: https://lore.kernel.org/all/20230918110510.66470-7-hare@suse.de/
patch subject: [PATCH 06/18] fs: Allow fine-grained control of folio sizes


Details as below (size data is obtained by `nm --size-sort vmlinux`):

4cf60235: fs/mpage: use accessor function to translate page index to sectors
8558b222: fs: Allow fine-grained control of folio sizes

+------------------------------+----------+----------+-------+
|            symbol            | 4cf60235 | 8558b222 | delta |
+------------------------------+----------+----------+-------+
| nm.T.page_cache_ra_order     | 11       | 447      | 436   |
| bzImage                      | 522176   | 522560   | 384   |
| nm.T.__filemap_get_folio     | 338      | 442      | 104   |
| nm.T.readahead_expand        | 351      | 355      | 4     |
| nm.t.do_read_cache_folio     | 251      | 253      | 2     |
| nm.T.page_cache_ra_unbounded | 290      | 292      | 2     |
| nm.t.filemap_get_pages       | 699      | 701      | 2     |
+------------------------------+----------+----------+-------+



Thanks



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 08/18] mm/readahead: allocate folios with mapping order preference
  2023-09-18 11:05 ` [PATCH 08/18] mm/readahead: " Hannes Reinecke
@ 2023-09-18 13:11   ` kernel test robot
  2023-09-18 20:46   ` kernel test robot
  1 sibling, 0 replies; 39+ messages in thread
From: kernel test robot @ 2023-09-18 13:11 UTC (permalink / raw)
  To: Hannes Reinecke, Matthew Wilcox
  Cc: oe-kbuild-all, Luis Chamberlain, Christoph Hellwig, Jens Axboe,
	Pankaj Raghav, linux-block, linux-fsdevel, Hannes Reinecke

Hi Hannes,

kernel test robot noticed the following build errors:

[auto build test ERROR on axboe-block/for-next]
[also build test ERROR on akpm-mm/mm-everything linus/master v6.6-rc2 next-20230918]
[cannot apply to xfs-linux/for-next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Hannes-Reinecke/mm-readahead-rework-loop-in-page_cache_ra_unbounded/20230918-190732
base:   https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next
patch link:    https://lore.kernel.org/r/20230918110510.66470-9-hare%40suse.de
patch subject: [PATCH 08/18] mm/readahead: allocate folios with mapping order preference
config: i386-tinyconfig (https://download.01.org/0day-ci/archive/20230918/202309182020.nSchrF93-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20230918/202309182020.nSchrF93-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202309182020.nSchrF93-lkp@intel.com/

All errors (new ones prefixed by >>):

   ld: mm/readahead.o: in function `readahead_expand':
>> readahead.c:(.text+0x17b): undefined reference to `__divdi3'

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 02/18] fs/mpage: use blocks_per_folio instead of blocks_per_page
  2023-09-18 11:04 ` [PATCH 02/18] fs/mpage: use blocks_per_folio instead of blocks_per_page Hannes Reinecke
@ 2023-09-18 13:15   ` Matthew Wilcox
  2023-09-18 17:45     ` Hannes Reinecke
  0 siblings, 1 reply; 39+ messages in thread
From: Matthew Wilcox @ 2023-09-18 13:15 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Luis Chamberlain, Christoph Hellwig, Jens Axboe, Pankaj Raghav,
	linux-block, linux-fsdevel

On Mon, Sep 18, 2023 at 01:04:54PM +0200, Hannes Reinecke wrote:
> @@ -161,7 +161,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
>  	struct folio *folio = args->folio;
>  	struct inode *inode = folio->mapping->host;
>  	const unsigned blkbits = inode->i_blkbits;
> -	const unsigned blocks_per_page = PAGE_SIZE >> blkbits;
> +	const unsigned blocks_per_folio = folio_size(folio) >> blkbits;
>  	const unsigned blocksize = 1 << blkbits;
>  	struct buffer_head *map_bh = &args->map_bh;
>  	sector_t block_in_file;
> @@ -169,7 +169,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
>  	sector_t last_block_in_file;
>  	sector_t blocks[MAX_BUF_PER_PAGE];
>  	unsigned page_block;
> -	unsigned first_hole = blocks_per_page;
> +	unsigned first_hole = blocks_per_folio;
>  	struct block_device *bdev = NULL;
>  	int length;
>  	int fully_mapped = 1;

I feel like we need an assertion that blocks_per_folio <=
MAX_BUF_PER_PAGE.  Otherwise this function runs off the end of the
'blocks' array.

Or (and I tried to do this once before getting bogged down), change this
function to not need the blocks array.  We need (first_block, length),
because this function will punt to 'confused' if theree is an on-disk
discontiguity.

ie this line needs to change:

-		if (page_block && blocks[page_block-1] != map_bh->b_blocknr-1)
+		if (page_block == 0)
+			first_block = map_bh->b_blocknr;
+		else if (first_block + page_block != map_bh->b_blocknr)
			goto confused;

> @@ -474,12 +474,12 @@ static int __mpage_writepage(struct folio *folio, struct writeback_control *wbc,
>  	struct address_space *mapping = folio->mapping;
>  	struct inode *inode = mapping->host;
>  	const unsigned blkbits = inode->i_blkbits;
> -	const unsigned blocks_per_page = PAGE_SIZE >> blkbits;
> +	const unsigned blocks_per_folio = folio_size(folio) >> blkbits;
>  	sector_t last_block;
>  	sector_t block_in_file;
>  	sector_t blocks[MAX_BUF_PER_PAGE];
>  	unsigned page_block;
> -	unsigned first_unmapped = blocks_per_page;
> +	unsigned first_unmapped = blocks_per_folio;
>  	struct block_device *bdev = NULL;
>  	int boundary = 0;
>  	sector_t boundary_block = 0;

Similarly, this function needss an assert.  Or remove blocks[].


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 07/18] mm/filemap: allocate folios with mapping order preference
  2023-09-18 11:04 ` [PATCH 07/18] mm/filemap: allocate folios with mapping order preference Hannes Reinecke
@ 2023-09-18 13:41   ` Matthew Wilcox
  2023-09-18 17:34     ` Hannes Reinecke
  0 siblings, 1 reply; 39+ messages in thread
From: Matthew Wilcox @ 2023-09-18 13:41 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Luis Chamberlain, Christoph Hellwig, Jens Axboe, Pankaj Raghav,
	linux-block, linux-fsdevel

On Mon, Sep 18, 2023 at 01:04:59PM +0200, Hannes Reinecke wrote:
> +++ b/mm/filemap.c
> @@ -507,9 +507,14 @@ static void __filemap_fdatawait_range(struct address_space *mapping,
>  	pgoff_t end = end_byte >> PAGE_SHIFT;
>  	struct folio_batch fbatch;
>  	unsigned nr_folios;
> +	unsigned int order = mapping_min_folio_order(mapping);
>  
>  	folio_batch_init(&fbatch);
>  
> +	if (order) {
> +		index = ALIGN_DOWN(index, 1 << order);
> +		end = ALIGN_DOWN(end, 1 << order);
> +	}
>  	while (index <= end) {
>  		unsigned i;
>  

I don't understand why this function needs to change at all.
filemap_get_folios_tag() should return any folios which overlap
(index, end).  And aligning 'end' _down_ certainly sets off alarm bells
for me.  We surely would need to align _up_.  Except i don't think we
need to do anything to this function.

> @@ -2482,7 +2487,8 @@ static int filemap_create_folio(struct file *file,
>  	struct folio *folio;
>  	int error;
>  
> -	folio = filemap_alloc_folio(mapping_gfp_mask(mapping), 0);
> +	folio = filemap_alloc_folio(mapping_gfp_mask(mapping),
> +				    mapping_min_folio_order(mapping));
>  	if (!folio)
>  		return -ENOMEM;
>  

Surely we need to align 'index' here?

> @@ -2542,9 +2548,16 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
>  	pgoff_t last_index;
>  	struct folio *folio;
>  	int err = 0;
> +	unsigned int order = mapping_min_folio_order(mapping);
>  
>  	/* "last_index" is the index of the page beyond the end of the read */
>  	last_index = DIV_ROUND_UP(iocb->ki_pos + count, PAGE_SIZE);
> +	if (order) {
> +		/* Align with folio order */
> +		WARN_ON(index % 1 << order);
> +		index = ALIGN_DOWN(index, 1 << order);
> +		last_index = ALIGN(last_index, 1 << order);
> +	}

Not sure I see the point of this.  filemap_get_read_batch() returns any
folio which contains 'index'.

>  retry:
>  	if (fatal_signal_pending(current))
>  		return -EINTR;
> @@ -2561,7 +2574,7 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
>  		if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_WAITQ))
>  			return -EAGAIN;
>  		err = filemap_create_folio(filp, mapping,
> -				iocb->ki_pos >> PAGE_SHIFT, fbatch);
> +				index, fbatch);

... ah, you align index here.  I wonder if we wouldn't be better passing
iocb->ki_pos to filemap_create_folio() to emphasise that the caller
can't assume anything about the alignment/size of the folio.

> @@ -3676,7 +3689,8 @@ static struct folio *do_read_cache_folio(struct address_space *mapping,
>  repeat:
>  	folio = filemap_get_folio(mapping, index);
>  	if (IS_ERR(folio)) {
> -		folio = filemap_alloc_folio(gfp, 0);
> +		folio = filemap_alloc_folio(gfp,
> +				mapping_min_folio_order(mapping));
>  		if (!folio)
>  			return ERR_PTR(-ENOMEM);
>  		err = filemap_add_folio(mapping, folio, index, gfp);

This needs to align index.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 09/18] fs/buffer: use mapping order in grow_dev_page()
  2023-09-18 11:05 ` [PATCH 09/18] fs/buffer: use mapping order in grow_dev_page() Hannes Reinecke
@ 2023-09-18 14:00   ` Matthew Wilcox
  2023-09-18 17:38     ` Hannes Reinecke
  0 siblings, 1 reply; 39+ messages in thread
From: Matthew Wilcox @ 2023-09-18 14:00 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Luis Chamberlain, Christoph Hellwig, Jens Axboe, Pankaj Raghav,
	linux-block, linux-fsdevel

On Mon, Sep 18, 2023 at 01:05:01PM +0200, Hannes Reinecke wrote:
> Use the correct mapping order in grow_dev_page() to ensure folios
> are created with the correct order.

I see why you did this, but I think it's fragile.  __filemap_get_folio()
will happily decrease 'order' if memory allocation fails.  I think
__filemap_get_folio() needs to become aware of the minimum folio
order for this mapping, and then we don't need this patch.

Overall, I like bits of this patchset and I like bits of Pankaj's ;-)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 03/18] block/buffer_head: introduce block_{index_to_sector,sector_to_index}
  2023-09-18 11:04 ` [PATCH 03/18] block/buffer_head: introduce block_{index_to_sector,sector_to_index} Hannes Reinecke
@ 2023-09-18 16:36   ` Matthew Wilcox
  2023-09-18 17:42     ` Hannes Reinecke
  0 siblings, 1 reply; 39+ messages in thread
From: Matthew Wilcox @ 2023-09-18 16:36 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Luis Chamberlain, Christoph Hellwig, Jens Axboe, Pankaj Raghav,
	linux-block, linux-fsdevel

On Mon, Sep 18, 2023 at 01:04:55PM +0200, Hannes Reinecke wrote:
> @@ -449,6 +450,22 @@ __bread(struct block_device *bdev, sector_t block, unsigned size)
>  
>  bool block_dirty_folio(struct address_space *mapping, struct folio *folio);
>  
> +static inline sector_t block_index_to_sector(pgoff_t index, unsigned int blkbits)
> +{
> +	if (PAGE_SHIFT < blkbits)
> +		return (sector_t)index >> (blkbits - PAGE_SHIFT);
> +	else
> +		return (sector_t)index << (PAGE_SHIFT - blkbits);
> +}

Is this actually more efficient than ...

	loff_t pos = (loff_t)index * PAGE_SIZE;
	return pos >> blkbits;

It feels like we're going to be doing this a lot, so we should find out
what's actually faster.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 07/18] mm/filemap: allocate folios with mapping order preference
  2023-09-18 13:41   ` Matthew Wilcox
@ 2023-09-18 17:34     ` Hannes Reinecke
  0 siblings, 0 replies; 39+ messages in thread
From: Hannes Reinecke @ 2023-09-18 17:34 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Luis Chamberlain, Christoph Hellwig, Jens Axboe, Pankaj Raghav,
	linux-block, linux-fsdevel

On 9/18/23 15:41, Matthew Wilcox wrote:
> On Mon, Sep 18, 2023 at 01:04:59PM +0200, Hannes Reinecke wrote:
>> +++ b/mm/filemap.c
>> @@ -507,9 +507,14 @@ static void __filemap_fdatawait_range(struct address_space *mapping,
>>   	pgoff_t end = end_byte >> PAGE_SHIFT;
>>   	struct folio_batch fbatch;
>>   	unsigned nr_folios;
>> +	unsigned int order = mapping_min_folio_order(mapping);
>>   
>>   	folio_batch_init(&fbatch);
>>   
>> +	if (order) {
>> +		index = ALIGN_DOWN(index, 1 << order);
>> +		end = ALIGN_DOWN(end, 1 << order);
>> +	}
>>   	while (index <= end) {
>>   		unsigned i;
>>   
> 
> I don't understand why this function needs to change at all.
> filemap_get_folios_tag() should return any folios which overlap
> (index, end).  And aligning 'end' _down_ certainly sets off alarm bells
> for me.  We surely would need to align _up_.  Except i don't think we
> need to do anything to this function.
> 
Because 'end' is the _last_ valid index, not the index at which the 
iteration stops (cf 'index <= end') here. And as the index remains in 4k 
units we need to align both, index and end, to the nearest folio.

>> @@ -2482,7 +2487,8 @@ static int filemap_create_folio(struct file *file,
>>   	struct folio *folio;
>>   	int error;
>>   
>> -	folio = filemap_alloc_folio(mapping_gfp_mask(mapping), 0);
>> +	folio = filemap_alloc_folio(mapping_gfp_mask(mapping),
>> +				    mapping_min_folio_order(mapping));
>>   	if (!folio)
>>   		return -ENOMEM;
>>   
> 
> Surely we need to align 'index' here?
> 
Surely.

>> @@ -2542,9 +2548,16 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
>>   	pgoff_t last_index;
>>   	struct folio *folio;
>>   	int err = 0;
>> +	unsigned int order = mapping_min_folio_order(mapping);
>>   
>>   	/* "last_index" is the index of the page beyond the end of the read */
>>   	last_index = DIV_ROUND_UP(iocb->ki_pos + count, PAGE_SIZE);
>> +	if (order) {
>> +		/* Align with folio order */
>> +		WARN_ON(index % 1 << order);
>> +		index = ALIGN_DOWN(index, 1 << order);
>> +		last_index = ALIGN(last_index, 1 << order);
>> +	}
> 
> Not sure I see the point of this.  filemap_get_read_batch() returns any
> folio which contains 'index'.
> 
Does it? Cool. Then of course we don't need to align the index here.

>>   retry:
>>   	if (fatal_signal_pending(current))
>>   		return -EINTR;
>> @@ -2561,7 +2574,7 @@ static int filemap_get_pages(struct kiocb *iocb, size_t count,
>>   		if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_WAITQ))
>>   			return -EAGAIN;
>>   		err = filemap_create_folio(filp, mapping,
>> -				iocb->ki_pos >> PAGE_SHIFT, fbatch);
>> +				index, fbatch);
> 
> ... ah, you align index here.  I wonder if we wouldn't be better passing
> iocb->ki_pos to filemap_create_folio() to emphasise that the caller
> can't assume anything about the alignment/size of the folio.
> 
I can check if that makes a difference.

>> @@ -3676,7 +3689,8 @@ static struct folio *do_read_cache_folio(struct address_space *mapping,
>>   repeat:
>>   	folio = filemap_get_folio(mapping, index);
>>   	if (IS_ERR(folio)) {
>> -		folio = filemap_alloc_folio(gfp, 0);
>> +		folio = filemap_alloc_folio(gfp,
>> +				mapping_min_folio_order(mapping));
>>   		if (!folio)
>>   			return ERR_PTR(-ENOMEM);
>>   		err = filemap_add_folio(mapping, folio, index, gfp);
> 
> This needs to align index.

Why, but of course. Will check.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 09/18] fs/buffer: use mapping order in grow_dev_page()
  2023-09-18 14:00   ` Matthew Wilcox
@ 2023-09-18 17:38     ` Hannes Reinecke
  0 siblings, 0 replies; 39+ messages in thread
From: Hannes Reinecke @ 2023-09-18 17:38 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Luis Chamberlain, Christoph Hellwig, Jens Axboe, Pankaj Raghav,
	linux-block, linux-fsdevel

On 9/18/23 16:00, Matthew Wilcox wrote:
> On Mon, Sep 18, 2023 at 01:05:01PM +0200, Hannes Reinecke wrote:
>> Use the correct mapping order in grow_dev_page() to ensure folios
>> are created with the correct order.
> 
> I see why you did this, but I think it's fragile.  __filemap_get_folio()
> will happily decrease 'order' if memory allocation fails.  I think
> __filemap_get_folio() needs to become aware of the minimum folio
> order for this mapping, and then we don't need this patch.
> 
> Overall, I like bits of this patchset and I like bits of Pankaj's ;-)

To be expected. It's basically parallel development, me and Pankaj 
working independently and arriving at different patchsets.
Will see next week at ALPSS if we can merge them into something sensible.

And 'grow_dev_page()' was really done by audit, and I'm not sure if my 
tests even exercised this particular codepath. So yeah, I'm with you.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 03/18] block/buffer_head: introduce block_{index_to_sector,sector_to_index}
  2023-09-18 16:36   ` Matthew Wilcox
@ 2023-09-18 17:42     ` Hannes Reinecke
  2023-09-18 21:01       ` Matthew Wilcox
  0 siblings, 1 reply; 39+ messages in thread
From: Hannes Reinecke @ 2023-09-18 17:42 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Luis Chamberlain, Christoph Hellwig, Jens Axboe, Pankaj Raghav,
	linux-block, linux-fsdevel

On 9/18/23 18:36, Matthew Wilcox wrote:
> On Mon, Sep 18, 2023 at 01:04:55PM +0200, Hannes Reinecke wrote:
>> @@ -449,6 +450,22 @@ __bread(struct block_device *bdev, sector_t block, unsigned size)
>>   
>>   bool block_dirty_folio(struct address_space *mapping, struct folio *folio);
>>   
>> +static inline sector_t block_index_to_sector(pgoff_t index, unsigned int blkbits)
>> +{
>> +	if (PAGE_SHIFT < blkbits)
>> +		return (sector_t)index >> (blkbits - PAGE_SHIFT);
>> +	else
>> +		return (sector_t)index << (PAGE_SHIFT - blkbits);
>> +}
> 
> Is this actually more efficient than ...
> 
> 	loff_t pos = (loff_t)index * PAGE_SIZE;
> 	return pos >> blkbits;
> 
> It feels like we're going to be doing this a lot, so we should find out
> what's actually faster.
> 
I fear that's my numerical computation background chiming in again.
One always tries to worry about numerical stability, and increasing a 
number always risks of running into an overflow.
But yeah, I guess your version is simpler, and we can always lean onto 
the compiler folks to have the compiler arrive at the same assembler 
code than my version.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 02/18] fs/mpage: use blocks_per_folio instead of blocks_per_page
  2023-09-18 13:15   ` Matthew Wilcox
@ 2023-09-18 17:45     ` Hannes Reinecke
  0 siblings, 0 replies; 39+ messages in thread
From: Hannes Reinecke @ 2023-09-18 17:45 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Luis Chamberlain, Christoph Hellwig, Jens Axboe, Pankaj Raghav,
	linux-block, linux-fsdevel

On 9/18/23 15:15, Matthew Wilcox wrote:
> On Mon, Sep 18, 2023 at 01:04:54PM +0200, Hannes Reinecke wrote:
>> @@ -161,7 +161,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
>>   	struct folio *folio = args->folio;
>>   	struct inode *inode = folio->mapping->host;
>>   	const unsigned blkbits = inode->i_blkbits;
>> -	const unsigned blocks_per_page = PAGE_SIZE >> blkbits;
>> +	const unsigned blocks_per_folio = folio_size(folio) >> blkbits;
>>   	const unsigned blocksize = 1 << blkbits;
>>   	struct buffer_head *map_bh = &args->map_bh;
>>   	sector_t block_in_file;
>> @@ -169,7 +169,7 @@ static struct bio *do_mpage_readpage(struct mpage_readpage_args *args)
>>   	sector_t last_block_in_file;
>>   	sector_t blocks[MAX_BUF_PER_PAGE];
>>   	unsigned page_block;
>> -	unsigned first_hole = blocks_per_page;
>> +	unsigned first_hole = blocks_per_folio;
>>   	struct block_device *bdev = NULL;
>>   	int length;
>>   	int fully_mapped = 1;
> 
> I feel like we need an assertion that blocks_per_folio <=
> MAX_BUF_PER_PAGE.  Otherwise this function runs off the end of the
> 'blocks' array.
> 
> Or (and I tried to do this once before getting bogged down), change this
> function to not need the blocks array.  We need (first_block, length),
> because this function will punt to 'confused' if theree is an on-disk
> discontiguity.
> 
> ie this line needs to change:
> 
> -		if (page_block && blocks[page_block-1] != map_bh->b_blocknr-1)
> +		if (page_block == 0)
> +			first_block = map_bh->b_blocknr;
> +		else if (first_block + page_block != map_bh->b_blocknr)
> 			goto confused;
> 
>> @@ -474,12 +474,12 @@ static int __mpage_writepage(struct folio *folio, struct writeback_control *wbc,
>>   	struct address_space *mapping = folio->mapping;
>>   	struct inode *inode = mapping->host;
>>   	const unsigned blkbits = inode->i_blkbits;
>> -	const unsigned blocks_per_page = PAGE_SIZE >> blkbits;
>> +	const unsigned blocks_per_folio = folio_size(folio) >> blkbits;
>>   	sector_t last_block;
>>   	sector_t block_in_file;
>>   	sector_t blocks[MAX_BUF_PER_PAGE];
>>   	unsigned page_block;
>> -	unsigned first_unmapped = blocks_per_page;
>> +	unsigned first_unmapped = blocks_per_folio;
>>   	struct block_device *bdev = NULL;
>>   	int boundary = 0;
>>   	sector_t boundary_block = 0;
> 
> Similarly, this function needss an assert.  Or remove blocks[].
> 
Will have a look. Just tried the obvious conversion as I was still 
trying to get the thing to work at that point. So bear with me.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 08/18] mm/readahead: allocate folios with mapping order preference
  2023-09-18 11:05 ` [PATCH 08/18] mm/readahead: " Hannes Reinecke
  2023-09-18 13:11   ` kernel test robot
@ 2023-09-18 20:46   ` kernel test robot
  1 sibling, 0 replies; 39+ messages in thread
From: kernel test robot @ 2023-09-18 20:46 UTC (permalink / raw)
  To: Hannes Reinecke, Matthew Wilcox
  Cc: oe-kbuild-all, Luis Chamberlain, Christoph Hellwig, Jens Axboe,
	Pankaj Raghav, linux-block, linux-fsdevel, Hannes Reinecke

Hi Hannes,

kernel test robot noticed the following build errors:

[auto build test ERROR on axboe-block/for-next]
[also build test ERROR on akpm-mm/mm-everything linus/master v6.6-rc2 next-20230918]
[cannot apply to xfs-linux/for-next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Hannes-Reinecke/mm-readahead-rework-loop-in-page_cache_ra_unbounded/20230918-190732
base:   https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-next
patch link:    https://lore.kernel.org/r/20230918110510.66470-9-hare%40suse.de
patch subject: [PATCH 08/18] mm/readahead: allocate folios with mapping order preference
config: arm-allnoconfig (https://download.01.org/0day-ci/archive/20230919/202309190435.X4RxeTJO-lkp@intel.com/config)
compiler: arm-linux-gnueabi-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20230919/202309190435.X4RxeTJO-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202309190435.X4RxeTJO-lkp@intel.com/

All errors (new ones prefixed by >>):

   arm-linux-gnueabi-ld: mm/readahead.o: in function `readahead_expand':
>> readahead.c:(.text+0x360): undefined reference to `__aeabi_ldivmod'

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 03/18] block/buffer_head: introduce block_{index_to_sector,sector_to_index}
  2023-09-18 17:42     ` Hannes Reinecke
@ 2023-09-18 21:01       ` Matthew Wilcox
  0 siblings, 0 replies; 39+ messages in thread
From: Matthew Wilcox @ 2023-09-18 21:01 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Luis Chamberlain, Christoph Hellwig, Jens Axboe, Pankaj Raghav,
	linux-block, linux-fsdevel

On Mon, Sep 18, 2023 at 07:42:51PM +0200, Hannes Reinecke wrote:
> On 9/18/23 18:36, Matthew Wilcox wrote:
> > On Mon, Sep 18, 2023 at 01:04:55PM +0200, Hannes Reinecke wrote:
> > > @@ -449,6 +450,22 @@ __bread(struct block_device *bdev, sector_t block, unsigned size)
> > >   bool block_dirty_folio(struct address_space *mapping, struct folio *folio);
> > > +static inline sector_t block_index_to_sector(pgoff_t index, unsigned int blkbits)
> > > +{
> > > +	if (PAGE_SHIFT < blkbits)
> > > +		return (sector_t)index >> (blkbits - PAGE_SHIFT);
> > > +	else
> > > +		return (sector_t)index << (PAGE_SHIFT - blkbits);
> > > +}
> > 
> > Is this actually more efficient than ...
> > 
> > 	loff_t pos = (loff_t)index * PAGE_SIZE;
> > 	return pos >> blkbits;
> > 
> > It feels like we're going to be doing this a lot, so we should find out
> > what's actually faster.
> > 
> I fear that's my numerical computation background chiming in again.
> One always tries to worry about numerical stability, and increasing a number
> always risks of running into an overflow.
> But yeah, I guess your version is simpler, and we can always lean onto the
> compiler folks to have the compiler arrive at the same assembler code than
> my version.

I actually don't mind the additional complexity -- if it's faster.
Yours is a conditional, two subtractions and two shifts (dependent on
the result of the subtractions).  Mine is two shifts, the second
dependent on the first.

I would say mine is safe because we're talking about a file (or a bdev).
By definition, the byte offset into one of those fits into an loff_t,
although maybe not an unsigned long.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 17/18] xfs: remove check for block sizes smaller than PAGE_SIZE
  2023-09-18 11:05 ` [PATCH 17/18] xfs: remove check for block sizes smaller than PAGE_SIZE Hannes Reinecke
@ 2023-09-20  2:13   ` Dave Chinner
  0 siblings, 0 replies; 39+ messages in thread
From: Dave Chinner @ 2023-09-20  2:13 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Matthew Wilcox, Luis Chamberlain, Christoph Hellwig, Jens Axboe,
	Pankaj Raghav, linux-block, linux-fsdevel

On Mon, Sep 18, 2023 at 01:05:09PM +0200, Hannes Reinecke wrote:
> We now support block sizes larger than PAGE_SIZE, so this
> check is pointless.
> 
> Signed-off-by: Hannes Reinecke <hare@suse.de>
> ---
>  fs/xfs/xfs_super.c | 12 ------------
>  1 file changed, 12 deletions(-)
> 
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 1f77014c6e1a..67dcdd4dcf2d 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1651,18 +1651,6 @@ xfs_fs_fill_super(
>  		goto out_free_sb;
>  	}
>  
> -	/*
> -	 * Until this is fixed only page-sized or smaller data blocks work.
> -	 */
> -	if (mp->m_sb.sb_blocksize > PAGE_SIZE) {
> -		xfs_warn(mp,
> -		"File system with blocksize %d bytes. "
> -		"Only pagesize (%ld) or less will currently work.",
> -				mp->m_sb.sb_blocksize, PAGE_SIZE);
> -		error = -ENOSYS;
> -		goto out_free_sb;
> -	}

This really needs to be replaced with an EXPERIMENTAL warning -
we're not going to support these LBS configurations until we are
sure it doesn't eat data.

Anyway, smoke tests....

# mkfs.xfs -f -b size=64k /dev/pmem0
meta-data=/dev/pmem0             isize=512    agcount=4, agsize=32768 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1    bigtime=1 inobtcount=1 nrext64=0
data     =                       bsize=65536  blocks=131072, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=65536  ascii-ci=0, ftype=1
log      =internal log           bsize=65536  blocks=1024, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=65536  blocks=0, rtextents=0
# mount /dev/pmem0 /mnt/test

Message from syslogd@test3 at Sep 20 11:23:32 ...
 kernel:[   73.521819] XFS: Assertion failed: PAGE_SHIFT >= sbp->sb_blocklog, file: fs/xfs/xfs_mount.c, line: 134

Message from syslogd@test3 at Sep 20 11:23:32 ...
 kernel:[   73.521819] XFS: Assertion failed: PAGE_SHIFT >= sbp->sb_blocklog, file: fs/xfs/xfs_mount.c, line: 134
Segmentation fault
#

Looks like this hasn't been tested with CONFIG_XFS_DEBUG=y. If
that's the case, I expect that none of this actually works... :/

I've attached a patch at the end of the email that allows XFS
filesystems to mount with debug enabled.

Next problem, filesystem created with a 32kB sector size:

# mkfs.xfs -f -b size=64k -s size=32k /dev/pmem0
meta-data=/dev/pmem0             isize=512    agcount=4, agsize=32768 blks
         =                       sectsz=32768 attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=1
         =                       reflink=1    bigtime=1 inobtcount=1 nrext64=0
data     =                       bsize=65536  blocks=131072, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=65536  ascii-ci=0, ftype=1
log      =internal log           bsize=65536  blocks=2709, version=2
         =                       sectsz=32768 sunit=1 blks, lazy-count=1
realtime =none                   extsz=65536  blocks=0, rtextents=0
#

and then running xfs_db on it to change the UUID:

# xfs_admin -U generate /dev/pmem0

Results in a kernel panic:

[  132.151886] XFS (pmem0): Mounting V5 Filesystem 3d96f860-2aa2-4e50-970c-134508b7954a
[  132.161673] XFS (pmem0): Ending clean mount
[  175.824015] XFS (pmem0): Unmounting Filesystem 3d96f860-2aa2-4e50-970c-134508b7954a
[  185.759251] Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: do_mpage_readpage+0x7e5/0x7f0
[  185.766632] CPU: 1 PID: 4383 Comm: xfs_db Not tainted 6.6.0-rc2-dgc+ #1903
[  185.771882] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[  185.778827] Call Trace:
[  185.780706]  <TASK>
[  185.782318]  dump_stack_lvl+0x37/0x50
[  185.785069]  dump_stack+0x10/0x20
[  185.787557]  panic+0x15b/0x300
[  185.789763]  ? do_mpage_readpage+0x7e5/0x7f0
[  185.792705]  __stack_chk_fail+0x14/0x20
[  185.795183]  do_mpage_readpage+0x7e5/0x7f0
[  185.797826]  ? blkdev_write_begin+0x30/0x30
[  185.800500]  ? blkdev_readahead+0x15/0x20
[  185.802894]  ? read_pages+0x5c/0x230
[  185.805023]  ? page_cache_ra_order+0x2ae/0x310
[  185.807538]  ? ondemand_readahead+0x1f1/0x3a0
[  185.809899]  ? page_cache_async_ra+0x26/0x30
[  185.812175]  ? filemap_get_pages+0x540/0x6d0
[  185.814327]  ? _copy_to_iter+0x65/0x4c0
[  185.816283]  ? filemap_read+0xfc/0x3a0
[  185.818086]  ? __fsnotify_parent+0x107/0x340
[  185.820142]  ? __might_sleep+0x42/0x70
[  185.821854]  ? blkdev_read_iter+0x6d/0x150
[  185.823697]  ? vfs_read+0x1b1/0x300
[  185.825307]  ? __x64_sys_pread64+0x8f/0xc0
[  185.827159]  ? irqentry_exit+0x33/0x40
[  185.828858]  ? do_syscall_64+0x35/0x80
[  185.830485]  ? entry_SYSCALL_64_after_hwframe+0x63/0xcd
[  185.832677]  </TASK>
[  185.834068] Kernel Offset: disabled
[  185.835510] ---[ end Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: do_mpage_readpage+0x7e5/0x7f0 ]---

Probably related to the block device having a 32kB block size set on
the pmem device by xfs_db when it only really has a 4kB sector size....

Anyway, back to just using 64k fsb, 4k sector.  generic/001 ASSERT
fails immediately with:

[  111.785796] run fstests generic/001 at 2023-09-20 11:50:19
[  113.346797] XFS: Assertion failed: imap.br_startblock != DELAYSTARTBLOCK, file: fs/xfs/xfs_reflink.c, line: 1392
[  113.352512] ------------[ cut here ]------------
[  113.354793] kernel BUG at fs/xfs/xfs_message.c:102!
[  113.358444] invalid opcode: 0000 [#1] PREEMPT SMP
[  113.360769] CPU: 8 PID: 7581 Comm: cp Not tainted 6.6.0-rc2-dgc+ #1903
[  113.364183] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[  113.369784] RIP: 0010:assfail+0x35/0x40
[  113.372038] Code: c9 48 c7 c2 00 d8 6c 82 48 89 e5 48 89 f1 48 89 fe 48 c7 c7 d8 d3 60 82 e8 a8 fd ff ff 80 3d d1 36 ec 02 00 75 04 0f 0b 5d c3 <0f> 0b 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 63 f6 49 89
[  113.384178] RSP: 0018:ffffc9000962bca0 EFLAGS: 00010202
[  113.387467] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 000000007fffffff
[  113.392326] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff8260d3d8
[  113.397235] RBP: ffffc9000962bca0 R08: 0000000000000000 R09: 000000000000000a
[  113.401449] R10: 000000000000000a R11: 0fffffffffffffff R12: 0000000000000000
[  113.406312] R13: 00000000ffffff8b R14: 0000000000000000 R15: ffff88810de00f00
[  113.410562] FS:  00007f4f6219b500(0000) GS:ffff8885fec00000(0000) knlGS:0000000000000000
[  113.415506] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  113.418688] CR2: 0000564951bfcf50 CR3: 00000005cd6b8005 CR4: 0000000000060ee0
[  113.422825] Call Trace:
[  113.424302]  <TASK>
[  113.425566]  ? show_regs+0x61/0x70
[  113.427444]  ? die+0x37/0x90
[  113.429175]  ? do_trap+0xec/0x100
[  113.431016]  ? do_error_trap+0x6c/0x90
[  113.432951]  ? assfail+0x35/0x40
[  113.434677]  ? exc_invalid_op+0x52/0x70
[  113.436755]  ? assfail+0x35/0x40
[  113.438659]  ? asm_exc_invalid_op+0x1b/0x20
[  113.440864]  ? assfail+0x35/0x40
[  113.442671]  xfs_reflink_remap_blocks+0x197/0x350
[  113.445278]  xfs_file_remap_range+0xf3/0x320
[  113.447504]  do_clone_file_range+0xfe/0x2b0
[  113.449689]  vfs_clone_file_range+0x3f/0x150
[  113.452080]  ioctl_file_clone+0x52/0xa0
[  113.453600]  do_vfs_ioctl+0x485/0x8d0
[  113.455054]  ? selinux_file_ioctl+0x96/0x120
[  113.456637]  ? selinux_file_ioctl+0x96/0x120
[  113.458213]  __x64_sys_ioctl+0x73/0xd0
[  113.459598]  do_syscall_64+0x35/0x80
[  113.460840]  entry_SYSCALL_64_after_hwframe+0x63/0xcd

Problems with unexpected extent types in the reflink remap code.

This is due to the reflink operation finding a delalloc extent where
it should be finding a real extent. this implies the
xfs_flush_unmap_range() call in xfs_reflink_remap_prep() didn't
flush the full data range it was supposed to.
xfs_flush_unmap_range() is supposed to round the range out to:

	rounding = max_t(xfs_off_t, mp->m_sb.sb_blocksize, PAGE_SIZE);

the block size for these cases, so it is LBS aware. So maybe there's
a problem with filemap_write_and_wait_range() and/or
truncate_pagecache_range() when dealing with LBS enabled page cache?

So, yeah, debug XFS builds tell us straight away that important stuff
does not appear to be not working correctly. Debug really needs to
be enabled, otherwise silent data corruption situations like this
can go unnoticed by fstests tests...

-Dave
-- 
Dave Chinner
david@fromorbit.com


xfs: fix page cache size validation for LBS configuations

From: Dave Chinner <dchinner@redhat.com>

The page cache can index larger filesystem sizes on 32 bit systems
if large block sizes are enabled. Fix this code to take these
configurations into account.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_mount.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 0a0fd19573d8..35dfcbc1e576 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -131,11 +131,15 @@ xfs_sb_validate_fsb_count(
 	xfs_sb_t	*sbp,
 	uint64_t	nblocks)
 {
-	ASSERT(PAGE_SHIFT >= sbp->sb_blocklog);
-	ASSERT(sbp->sb_blocklog >= BBSHIFT);
+	ASSERT(sbp->sb_blocklog <= XFS_MAX_BLOCKSIZE_LOG);
+	ASSERT(sbp->sb_blocklog >= XFS_MIN_BLOCKSIZE_LOG);
 
 	/* Limited by ULONG_MAX of page cache index */
-	if (nblocks >> (PAGE_SHIFT - sbp->sb_blocklog) > ULONG_MAX)
+	if (PAGE_SHIFT >= sbp->sb_blocklog)
+		nblocks >>= (PAGE_SHIFT - sbp->sb_blocklog);
+	else
+		nblocks <<= (sbp->sb_blocklog - PAGE_SHIFT);
+	if (nblocks > ULONG_MAX)
 		return -EFBIG;
 	return 0;
 }

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH 01/18] mm/readahead: rework loop in page_cache_ra_unbounded()
       [not found]   ` <CGME20230920115645eucas1p1c8ed9bf515c4532b3e6995f8078a863b@eucas1p1.samsung.com>
@ 2023-09-20 11:56     ` Pankaj Raghav
  2023-09-20 14:13       ` Hannes Reinecke
  2023-09-20 14:18       ` Matthew Wilcox
  0 siblings, 2 replies; 39+ messages in thread
From: Pankaj Raghav @ 2023-09-20 11:56 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Matthew Wilcox, Luis Chamberlain, Christoph Hellwig, Jens Axboe,
	linux-block, linux-fsdevel, p.raghav

On Mon, Sep 18, 2023 at 01:04:53PM +0200, Hannes Reinecke wrote:
>  		if (folio && !xa_is_value(folio)) {
> @@ -239,8 +239,8 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
>  			 * not worth getting one just for that.
>  			 */
>  			read_pages(ractl);
> -			ractl->_index++;
> -			i = ractl->_index + ractl->_nr_pages - index - 1;
> +			ractl->_index += folio_nr_pages(folio);
> +			i = ractl->_index + ractl->_nr_pages - index;
I am not entirely sure if this is correct.

The above if condition only verifies if a folio is in the page cache but
doesn't tell if it is uptodate. But we are advancing the ractl->index
past this folio irrespective of that.

Am I missing something?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 01/18] mm/readahead: rework loop in page_cache_ra_unbounded()
  2023-09-20 11:56     ` Pankaj Raghav
@ 2023-09-20 14:13       ` Hannes Reinecke
  2023-09-21  9:06         ` Pankaj Raghav
  2023-09-20 14:18       ` Matthew Wilcox
  1 sibling, 1 reply; 39+ messages in thread
From: Hannes Reinecke @ 2023-09-20 14:13 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: Matthew Wilcox, Luis Chamberlain, Christoph Hellwig, Jens Axboe,
	linux-block, linux-fsdevel

On 9/20/23 13:56, Pankaj Raghav wrote:
> On Mon, Sep 18, 2023 at 01:04:53PM +0200, Hannes Reinecke wrote:
>>   		if (folio && !xa_is_value(folio)) {
>> @@ -239,8 +239,8 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
>>   			 * not worth getting one just for that.
>>   			 */
>>   			read_pages(ractl);
>> -			ractl->_index++;
>> -			i = ractl->_index + ractl->_nr_pages - index - 1;
>> +			ractl->_index += folio_nr_pages(folio);
>> +			i = ractl->_index + ractl->_nr_pages - index;
> I am not entirely sure if this is correct.
> 
> The above if condition only verifies if a folio is in the page cache but
> doesn't tell if it is uptodate. But we are advancing the ractl->index
> past this folio irrespective of that.
> 
> Am I missing something?

Confused. Which condition?
I'm not changing the condition at all, just changing the way how the 'i' 
index is calculated during the loop (ie getting rid of the weird 
decrement and increment on 'i' during the loop).
Please clarify.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 01/18] mm/readahead: rework loop in page_cache_ra_unbounded()
  2023-09-20 11:56     ` Pankaj Raghav
  2023-09-20 14:13       ` Hannes Reinecke
@ 2023-09-20 14:18       ` Matthew Wilcox
  1 sibling, 0 replies; 39+ messages in thread
From: Matthew Wilcox @ 2023-09-20 14:18 UTC (permalink / raw)
  To: Pankaj Raghav
  Cc: Hannes Reinecke, Luis Chamberlain, Christoph Hellwig, Jens Axboe,
	linux-block, linux-fsdevel

On Wed, Sep 20, 2023 at 01:56:43PM +0200, Pankaj Raghav wrote:
> On Mon, Sep 18, 2023 at 01:04:53PM +0200, Hannes Reinecke wrote:
> >  		if (folio && !xa_is_value(folio)) {
> > @@ -239,8 +239,8 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
> >  			 * not worth getting one just for that.
> >  			 */
> >  			read_pages(ractl);
> > -			ractl->_index++;
> > -			i = ractl->_index + ractl->_nr_pages - index - 1;
> > +			ractl->_index += folio_nr_pages(folio);
> > +			i = ractl->_index + ractl->_nr_pages - index;
> I am not entirely sure if this is correct.
> 
> The above if condition only verifies if a folio is in the page cache but
> doesn't tell if it is uptodate. But we are advancing the ractl->index
> past this folio irrespective of that.
> 
> Am I missing something?

How readahead works?

Readahead is for the optimistic case where nothing has gone wrong;
we just don't have anything in the page cache yet.

If there's a !uptodate folio in the page cache, there are two
possibilities.  The most likely is that we have two threads accessing this
file at the same time.  If so, we should stop and allow the other thread
to do the readahead it has decided to do.  The less likely scenario is
that we had an error doing a previous readahead.  If that happened, we
should not try reading it again in readahead; we should let the thread
retry the read when it actually tries to access the folio.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 01/18] mm/readahead: rework loop in page_cache_ra_unbounded()
  2023-09-20 14:13       ` Hannes Reinecke
@ 2023-09-21  9:06         ` Pankaj Raghav
  0 siblings, 0 replies; 39+ messages in thread
From: Pankaj Raghav @ 2023-09-21  9:06 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Matthew Wilcox, Luis Chamberlain, Christoph Hellwig, Jens Axboe,
	linux-block, linux-fsdevel

On 2023-09-20 16:13, Hannes Reinecke wrote:
> On 9/20/23 13:56, Pankaj Raghav wrote:
>> On Mon, Sep 18, 2023 at 01:04:53PM +0200, Hannes Reinecke wrote:
>>>           if (folio && !xa_is_value(folio)) {
>>> @@ -239,8 +239,8 @@ void page_cache_ra_unbounded(struct readahead_control *ractl,
>>>                * not worth getting one just for that.
>>>                */
>>>               read_pages(ractl);
>>> -            ractl->_index++;
>>> -            i = ractl->_index + ractl->_nr_pages - index - 1;
>>> +            ractl->_index += folio_nr_pages(folio);
>>> +            i = ractl->_index + ractl->_nr_pages - index;
>> I am not entirely sure if this is correct.
>>
>> The above if condition only verifies if a folio is in the page cache but
>> doesn't tell if it is uptodate. But we are advancing the ractl->index
>> past this folio irrespective of that.
>>
>> Am I missing something?
> 
> Confused. Which condition?
> I'm not changing the condition at all, just changing the way how the 'i' index is calculated during
> the loop (ie getting rid of the weird decrement and increment on 'i' during the loop).
> Please clarify.
> 

I made a mistake of pointing out the wrong line in my reply. I was concerned about the increment to
the ractl->_index:

if (folio && !xa_is_value(folio)) {
....
  read_pages(ractl);
  ractl->_index += folio_nr_pages(folio); // This increment to the ractl->_index
...
}

But I think I was missing this point, as willy explained in his reply:

If there's a !uptodate folio in the page cache, <snip>. If that happened, we
should not try reading it **again in readahead**; we should let the thread
retry the read when it actually tries to access the folio.

Plus your changes optimizes the code path for a large folio where we increment the index by 1 each
time instead of moving the index directly to the next folio in the page cache.

Please ignore the noise!

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 04/18] fs/buffer.c: use accessor function to translate page index to sectors
  2023-09-18 11:04 ` [PATCH 04/18] fs/buffer.c: use accessor function to translate page index to sectors Hannes Reinecke
@ 2023-10-20 19:37   ` Matthew Wilcox
  2023-10-21  5:08     ` Matthew Wilcox
  2023-10-23  5:03     ` Hannes Reinecke
  0 siblings, 2 replies; 39+ messages in thread
From: Matthew Wilcox @ 2023-10-20 19:37 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Luis Chamberlain, Christoph Hellwig, Jens Axboe, Pankaj Raghav,
	linux-block, linux-fsdevel

On Mon, Sep 18, 2023 at 01:04:56PM +0200, Hannes Reinecke wrote:
> Use accessor functions block_index_to_sector() and block_sector_to_index()
> to translate the page index into the block sector and vice versa.

You missed two in grow_dev_page() (which I just happened upon):

        bh = folio_buffers(folio);
        if (bh) {
                if (bh->b_size == size) {
                        end_block = folio_init_buffers(folio, bdev,
                                        (sector_t)index << sizebits, size);
                        goto done;
                }
...
        spin_lock(&inode->i_mapping->private_lock);
        link_dev_buffers(folio, bh);
        end_block = folio_init_buffers(folio, bdev,
                        (sector_t)index << sizebits, size);

Can UBSAN be of help here?  It should catch shifting by a negative
amount.  That sizebits is calculated in grow_buffers:

        sizebits = PAGE_SHIFT - __ffs(size);


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 04/18] fs/buffer.c: use accessor function to translate page index to sectors
  2023-10-20 19:37   ` Matthew Wilcox
@ 2023-10-21  5:08     ` Matthew Wilcox
  2023-10-23  5:03     ` Hannes Reinecke
  1 sibling, 0 replies; 39+ messages in thread
From: Matthew Wilcox @ 2023-10-21  5:08 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Luis Chamberlain, Christoph Hellwig, Jens Axboe, Pankaj Raghav,
	linux-block, linux-fsdevel

On Fri, Oct 20, 2023 at 08:37:26PM +0100, Matthew Wilcox wrote:
> On Mon, Sep 18, 2023 at 01:04:56PM +0200, Hannes Reinecke wrote:
> > Use accessor functions block_index_to_sector() and block_sector_to_index()
> > to translate the page index into the block sector and vice versa.
> 
> You missed two in grow_dev_page() (which I just happened upon):

I have fixes here.  The key part of the first patch is:

 static sector_t folio_init_buffers(struct folio *folio,
-               struct block_device *bdev, sector_t block, int size)
+               struct block_device *bdev, int size)
 {
        struct buffer_head *head = folio_buffers(folio);
        struct buffer_head *bh = head;
        bool uptodate = folio_test_uptodate(folio);
+       sector_t block = folio_pos(folio) / size;
        sector_t end_block = blkdev_max_block(bdev, size);

(and then there's the cruft of removing the arguments from
folio_init_buffers)

The second patch is:

 static bool grow_buffers(struct block_device *bdev, sector_t block,
                unsigned size, gfp_t gfp)
 {
-       pgoff_t index;
-       int sizebits;
-
-       sizebits = PAGE_SHIFT - __ffs(size);
-       index = block >> sizebits;
+       loff_t pos;
[...]
-       if (unlikely(index != block >> sizebits)) {
+       if (check_mul_overflow(block, size, &pos) || pos > MAX_LFS_FILESIZE) {

I'll send a proper patch series tomorrow once the fstests are done
running.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 04/18] fs/buffer.c: use accessor function to translate page index to sectors
  2023-10-20 19:37   ` Matthew Wilcox
  2023-10-21  5:08     ` Matthew Wilcox
@ 2023-10-23  5:03     ` Hannes Reinecke
  1 sibling, 0 replies; 39+ messages in thread
From: Hannes Reinecke @ 2023-10-23  5:03 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Luis Chamberlain, Christoph Hellwig, Jens Axboe, Pankaj Raghav,
	linux-block, linux-fsdevel

On 10/20/23 21:37, Matthew Wilcox wrote:
> On Mon, Sep 18, 2023 at 01:04:56PM +0200, Hannes Reinecke wrote:
>> Use accessor functions block_index_to_sector() and block_sector_to_index()
>> to translate the page index into the block sector and vice versa.
> 
> You missed two in grow_dev_page() (which I just happened upon):
> 
>          bh = folio_buffers(folio);
>          if (bh) {
>                  if (bh->b_size == size) {
>                          end_block = folio_init_buffers(folio, bdev,
>                                          (sector_t)index << sizebits, size);
>                          goto done;
>                  }
> ...
>          spin_lock(&inode->i_mapping->private_lock);
>          link_dev_buffers(folio, bh);
>          end_block = folio_init_buffers(folio, bdev,
>                          (sector_t)index << sizebits, size);
> 
> Can UBSAN be of help here?  It should catch shifting by a negative
> amount.  That sizebits is calculated in grow_buffers:
> 
>          sizebits = PAGE_SHIFT - __ffs(size);
> 
Ah, thanks. I'm currently working with Luis and Pankay to get a combined 
patchset; I'll make sure to have that included.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman


^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2023-10-23  5:03 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-18 11:04 [RFC PATCH 00/18] block: update buffer_head for Large-block I/O Hannes Reinecke
2023-09-18 11:04 ` [PATCH 01/18] mm/readahead: rework loop in page_cache_ra_unbounded() Hannes Reinecke
     [not found]   ` <CGME20230920115645eucas1p1c8ed9bf515c4532b3e6995f8078a863b@eucas1p1.samsung.com>
2023-09-20 11:56     ` Pankaj Raghav
2023-09-20 14:13       ` Hannes Reinecke
2023-09-21  9:06         ` Pankaj Raghav
2023-09-20 14:18       ` Matthew Wilcox
2023-09-18 11:04 ` [PATCH 02/18] fs/mpage: use blocks_per_folio instead of blocks_per_page Hannes Reinecke
2023-09-18 13:15   ` Matthew Wilcox
2023-09-18 17:45     ` Hannes Reinecke
2023-09-18 11:04 ` [PATCH 03/18] block/buffer_head: introduce block_{index_to_sector,sector_to_index} Hannes Reinecke
2023-09-18 16:36   ` Matthew Wilcox
2023-09-18 17:42     ` Hannes Reinecke
2023-09-18 21:01       ` Matthew Wilcox
2023-09-18 11:04 ` [PATCH 04/18] fs/buffer.c: use accessor function to translate page index to sectors Hannes Reinecke
2023-10-20 19:37   ` Matthew Wilcox
2023-10-21  5:08     ` Matthew Wilcox
2023-10-23  5:03     ` Hannes Reinecke
2023-09-18 11:04 ` [PATCH 05/18] fs/mpage: " Hannes Reinecke
2023-09-18 11:04 ` [PATCH 06/18] fs: Allow fine-grained control of folio sizes Hannes Reinecke
2023-09-18 12:29   ` [lkp] [+550 bytes kernel size regression] [i386-tinyconfig] [8558b2228d] " kernel test robot
2023-09-18 11:04 ` [PATCH 07/18] mm/filemap: allocate folios with mapping order preference Hannes Reinecke
2023-09-18 13:41   ` Matthew Wilcox
2023-09-18 17:34     ` Hannes Reinecke
2023-09-18 11:05 ` [PATCH 08/18] mm/readahead: " Hannes Reinecke
2023-09-18 13:11   ` kernel test robot
2023-09-18 20:46   ` kernel test robot
2023-09-18 11:05 ` [PATCH 09/18] fs/buffer: use mapping order in grow_dev_page() Hannes Reinecke
2023-09-18 14:00   ` Matthew Wilcox
2023-09-18 17:38     ` Hannes Reinecke
2023-09-18 11:05 ` [PATCH 10/18] block/bdev: lift restrictions on supported blocksize Hannes Reinecke
2023-09-18 11:05 ` [PATCH 11/18] block/bdev: enable large folio support for large logical block sizes Hannes Reinecke
2023-09-18 11:05 ` [PATCH 12/18] brd: convert to folios Hannes Reinecke
2023-09-18 11:05 ` [PATCH 13/18] brd: abstract page_size conventions Hannes Reinecke
2023-09-18 11:05 ` [PATCH 14/18] brd: use memcpy_{to,from}_folio() Hannes Reinecke
2023-09-18 11:05 ` [PATCH 15/18] brd: make sector size configurable Hannes Reinecke
2023-09-18 11:05 ` [PATCH 16/18] brd: make logical " Hannes Reinecke
2023-09-18 11:05 ` [PATCH 17/18] xfs: remove check for block sizes smaller than PAGE_SIZE Hannes Reinecke
2023-09-20  2:13   ` Dave Chinner
2023-09-18 11:05 ` [PATCH 18/18] nvme: enable logical block size > PAGE_SIZE Hannes Reinecke

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.