All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC v2 0/8] add support for blocksize > PAGE_SIZE
@ 2023-05-26  7:55 Luis Chamberlain
  2023-05-26  7:55 ` [RFC v2 1/8] page_flags: add is_folio_hwpoison() Luis Chamberlain
                   ` (10 more replies)
  0 siblings, 11 replies; 22+ messages in thread
From: Luis Chamberlain @ 2023-05-26  7:55 UTC (permalink / raw)
  To: hughd, akpm, willy, brauner, djwong
  Cc: p.raghav, da.gomez, rohan.puri, rpuri.linux, a.manzanares, dave,
	yosryahmed, keescook, hare, kbusch, mcgrof, patches, linux-block,
	linux-fsdevel, linux-mm, linux-kernel

This is an initial attempt to add support for block size > PAGE_SIZE for tmpfs.
Why would you want this? It helps us experiment with higher order folio uses
with fs APIS and helps us test out corner cases which would likely need
to be accounted for sooner or later if and when filesystems enable support
for this. Better review early and burn early than continue on in the wrong
direction so looking for early feedback.

I have other patches to convert shmem_file_read_iter() to folios too but that
is not yet working. In the swap world the next thing to look at would be to
convert swap_cluster_readahead() to folios.

As mentioned at LSFMM, if folks want to experiment with anything related to
Large Block Sizes (LBS) I've been trying to stash related patches in
a tree which tries to carry as many nuggets we have and can collect into
a dedicated lage-block tree. Many of this is obviously work in progress
so don't try it unless you want to your systems to blow up. But in case you
do, you can use my large-block-20230525 branch [0]. Similarly you can also
use kdevops with CONFIG_QEMU_ENABLE_EXTRA_DRIVE_LARGEIO support to get
everything with just as that branch is used for that:
                                                                                                                                                                                              
  make
  make bringup
  make linux

Changes on this v2:

  o the block size has been modified to block order after Matthew Wilcox's
    suggestion. This truly makes a huge difference in making this code
    much more easier to read and maintain.
  o At Pankaj Raghav's suggestion I've put together a helper for
    poison flags and so this now introduces that as is_folio_hwpoison().
  o cleaned up the nits / debug code as pointed out by Matthew Wilcox
  o clarified the max block size we support is computed by the MAX_ORDER,
    and for x86_64 this is 8 MiB.
  o Tested up to 4 MiB block size with a basic test nothing blew up

Future work:

  o shmem_file_read_iter()
  o extend struct address_space with order and use that instead
    of our own block order. We may still need to have our own block order,
    we'll need to see.
  o swap_cluster_readahead() and friends coverted over to folios
  o test this well

[0] https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux-next.git/log/?h=large-block-20230525
[1] https://github.com/linux-kdevops/kdevops

Luis Chamberlain (8):
  page_flags: add is_folio_hwpoison()
  shmem: convert to use is_folio_hwpoison()
  shmem: account for high order folios
  shmem: add helpers to get block size
  shmem: account for larger blocks sizes for shmem_default_max_blocks()
  shmem: consider block size in shmem_default_max_inodes()
  shmem: add high order page support
  shmem: add support to customize block size order

 include/linux/page-flags.h |   7 ++
 include/linux/shmem_fs.h   |   3 +
 mm/shmem.c                 | 139 +++++++++++++++++++++++++++++--------
 3 files changed, 119 insertions(+), 30 deletions(-)

-- 
2.39.2

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [RFC v2 1/8] page_flags: add is_folio_hwpoison()
  2023-05-26  7:55 [RFC v2 0/8] add support for blocksize > PAGE_SIZE Luis Chamberlain
@ 2023-05-26  7:55 ` Luis Chamberlain
  2023-05-26 13:51   ` Matthew Wilcox
  2023-05-26  7:55 ` [RFC v2 2/8] shmem: convert to use is_folio_hwpoison() Luis Chamberlain
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 22+ messages in thread
From: Luis Chamberlain @ 2023-05-26  7:55 UTC (permalink / raw)
  To: hughd, akpm, willy, brauner, djwong
  Cc: p.raghav, da.gomez, rohan.puri, rpuri.linux, a.manzanares, dave,
	yosryahmed, keescook, hare, kbusch, mcgrof, patches, linux-block,
	linux-fsdevel, linux-mm, linux-kernel

Provide a helper similar to is_page_hwpoison() for folios
which tests the first head and if the folio is large any page in
the folio is tested for the poison flag.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 include/linux/page-flags.h | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 1c68d67b832f..4d5f395edf03 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -883,6 +883,13 @@ static inline bool is_page_hwpoison(struct page *page)
 	return PageHuge(page) && PageHWPoison(compound_head(page));
 }
 
+static inline bool is_folio_hwpoison(struct folio *folio)
+{
+       if (folio_test_hwpoison(folio))
+               return true;
+       return folio_test_large(folio) && folio_test_has_hwpoisoned(folio);
+}
+
 /*
  * For pages that are never mapped to userspace (and aren't PageSlab),
  * page_type may be used.  Because it is initialised to -1, we invert the
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC v2 2/8] shmem: convert to use is_folio_hwpoison()
  2023-05-26  7:55 [RFC v2 0/8] add support for blocksize > PAGE_SIZE Luis Chamberlain
  2023-05-26  7:55 ` [RFC v2 1/8] page_flags: add is_folio_hwpoison() Luis Chamberlain
@ 2023-05-26  7:55 ` Luis Chamberlain
  2023-05-26 14:32   ` Matthew Wilcox
  2023-05-26  7:55 ` [RFC v2 3/8] shmem: account for high order folios Luis Chamberlain
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 22+ messages in thread
From: Luis Chamberlain @ 2023-05-26  7:55 UTC (permalink / raw)
  To: hughd, akpm, willy, brauner, djwong
  Cc: p.raghav, da.gomez, rohan.puri, rpuri.linux, a.manzanares, dave,
	yosryahmed, keescook, hare, kbusch, mcgrof, patches, linux-block,
	linux-fsdevel, linux-mm, linux-kernel

The PageHWPoison() call can be converted over to the respective folio
call is_folio_hwpoison(). This introduces no functional changes.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 mm/shmem.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 351803415ad2..a947f2678a39 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -3360,7 +3360,7 @@ static const char *shmem_get_link(struct dentry *dentry,
 		folio = filemap_get_folio(inode->i_mapping, 0);
 		if (IS_ERR(folio))
 			return ERR_PTR(-ECHILD);
-		if (PageHWPoison(folio_page(folio, 0)) ||
+		if (is_folio_hwpoison(folio) ||
 		    !folio_test_uptodate(folio)) {
 			folio_put(folio);
 			return ERR_PTR(-ECHILD);
@@ -3371,7 +3371,7 @@ static const char *shmem_get_link(struct dentry *dentry,
 			return ERR_PTR(error);
 		if (!folio)
 			return ERR_PTR(-ECHILD);
-		if (PageHWPoison(folio_page(folio, 0))) {
+		if (is_folio_hwpoison(folio)) {
 			folio_unlock(folio);
 			folio_put(folio);
 			return ERR_PTR(-ECHILD);
@@ -4548,7 +4548,7 @@ struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
 		return &folio->page;
 
 	page = folio_file_page(folio, index);
-	if (PageHWPoison(page)) {
+	if (is_folio_hwpoison(folio)) {
 		folio_put(folio);
 		return ERR_PTR(-EIO);
 	}
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC v2 3/8] shmem: account for high order folios
  2023-05-26  7:55 [RFC v2 0/8] add support for blocksize > PAGE_SIZE Luis Chamberlain
  2023-05-26  7:55 ` [RFC v2 1/8] page_flags: add is_folio_hwpoison() Luis Chamberlain
  2023-05-26  7:55 ` [RFC v2 2/8] shmem: convert to use is_folio_hwpoison() Luis Chamberlain
@ 2023-05-26  7:55 ` Luis Chamberlain
  2023-05-26  7:55 ` [RFC v2 4/8] shmem: add helpers to get block size Luis Chamberlain
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 22+ messages in thread
From: Luis Chamberlain @ 2023-05-26  7:55 UTC (permalink / raw)
  To: hughd, akpm, willy, brauner, djwong
  Cc: p.raghav, da.gomez, rohan.puri, rpuri.linux, a.manzanares, dave,
	yosryahmed, keescook, hare, kbusch, mcgrof, patches, linux-block,
	linux-fsdevel, linux-mm, linux-kernel

shmem uses the shem_info_inode alloced, swapped to account
for allocated pages and swapped pages. In preparation for high
order folios adjust the accounting to use folio_nr_pages().

This should produce no functional changes yet as higher order
folios are not yet used or supported in shmem.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 mm/shmem.c | 34 ++++++++++++++++++++--------------
 1 file changed, 20 insertions(+), 14 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index a947f2678a39..7bea4c5cb83a 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -803,15 +803,15 @@ unsigned long shmem_partial_swap_usage(struct address_space *mapping,
 						pgoff_t start, pgoff_t end)
 {
 	XA_STATE(xas, &mapping->i_pages, start);
-	struct page *page;
+	struct folio *folio;
 	unsigned long swapped = 0;
 
 	rcu_read_lock();
-	xas_for_each(&xas, page, end - 1) {
-		if (xas_retry(&xas, page))
+	xas_for_each(&xas, folio, end - 1) {
+		if (xas_retry(&xas, folio))
 			continue;
-		if (xa_is_value(page))
-			swapped++;
+		if (xa_is_value(folio))
+			swapped += (folio_nr_pages(folio));
 
 		if (need_resched()) {
 			xas_pause(&xas);
@@ -938,10 +938,12 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 			folio = fbatch.folios[i];
 
 			if (xa_is_value(folio)) {
+				long swaps_freed;
 				if (unfalloc)
 					continue;
-				nr_swaps_freed += !shmem_free_swap(mapping,
-							indices[i], folio);
+				swaps_freed = folio_nr_pages(folio);
+				if (!shmem_free_swap(mapping, indices[i], folio))
+					nr_swaps_freed += swaps_freed;
 				continue;
 			}
 
@@ -1007,14 +1009,16 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 			folio = fbatch.folios[i];
 
 			if (xa_is_value(folio)) {
+				long swaps_freed;
 				if (unfalloc)
 					continue;
+				swaps_freed = folio_nr_pages(folio);
 				if (shmem_free_swap(mapping, indices[i], folio)) {
 					/* Swap was replaced by page: retry */
 					index = indices[i];
 					break;
 				}
-				nr_swaps_freed++;
+				nr_swaps_freed += swaps_freed;
 				continue;
 			}
 
@@ -1445,7 +1449,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 			NULL) == 0) {
 		spin_lock_irq(&info->lock);
 		shmem_recalc_inode(inode);
-		info->swapped++;
+		info->swapped += folio_nr_pages(folio);
 		spin_unlock_irq(&info->lock);
 
 		swap_shmem_alloc(swap);
@@ -1720,6 +1724,7 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
 	struct shmem_inode_info *info = SHMEM_I(inode);
 	swp_entry_t swapin_error;
 	void *old;
+	long num_swap_pages;
 
 	swapin_error = make_swapin_error_entry();
 	old = xa_cmpxchg_irq(&mapping->i_pages, index,
@@ -1729,6 +1734,7 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
 		return;
 
 	folio_wait_writeback(folio);
+	num_swap_pages = folio_nr_pages(folio);
 	delete_from_swap_cache(folio);
 	spin_lock_irq(&info->lock);
 	/*
@@ -1736,8 +1742,8 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
 	 * be 0 when inode is released and thus trigger WARN_ON(inode->i_blocks) in
 	 * shmem_evict_inode.
 	 */
-	info->alloced--;
-	info->swapped--;
+	info->alloced -= num_swap_pages;
+	info->swapped -= num_swap_pages;
 	shmem_recalc_inode(inode);
 	spin_unlock_irq(&info->lock);
 	swap_free(swap);
@@ -1827,7 +1833,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 		goto failed;
 
 	spin_lock_irq(&info->lock);
-	info->swapped--;
+	info->swapped -= folio_nr_pages(folio);
 	shmem_recalc_inode(inode);
 	spin_unlock_irq(&info->lock);
 
@@ -2542,8 +2548,8 @@ int shmem_mfill_atomic_pte(pmd_t *dst_pmd,
 		goto out_delete_from_cache;
 
 	spin_lock_irq(&info->lock);
-	info->alloced++;
-	inode->i_blocks += PAGE_SECTORS;
+	info->alloced += folio_nr_pages(folio);
+	inode->i_blocks += PAGE_SECTORS << folio_order(folio);
 	shmem_recalc_inode(inode);
 	spin_unlock_irq(&info->lock);
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC v2 4/8] shmem: add helpers to get block size
  2023-05-26  7:55 [RFC v2 0/8] add support for blocksize > PAGE_SIZE Luis Chamberlain
                   ` (2 preceding siblings ...)
  2023-05-26  7:55 ` [RFC v2 3/8] shmem: account for high order folios Luis Chamberlain
@ 2023-05-26  7:55 ` Luis Chamberlain
  2023-05-26  7:55 ` [RFC v2 5/8] shmem: account for larger blocks sizes for shmem_default_max_blocks() Luis Chamberlain
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 22+ messages in thread
From: Luis Chamberlain @ 2023-05-26  7:55 UTC (permalink / raw)
  To: hughd, akpm, willy, brauner, djwong
  Cc: p.raghav, da.gomez, rohan.puri, rpuri.linux, a.manzanares, dave,
	yosryahmed, keescook, hare, kbusch, mcgrof, patches, linux-block,
	linux-fsdevel, linux-mm, linux-kernel

Stuff the block size as a struct shmem_sb_info member as a block_order
when CONFIG_TMPFS is enabled, but keep the current static value for now,
and use helpers to get the blocksize. This will make the subsequent
change easier to read.

The static value for block order is PAGE_SHIFT and so the default block
size is PAGE_SIZE.

The struct super_block s_blocksize_bits represents the blocksize in
power of two, and that will match the shmem_sb_info block_order.

This commit introduces no functional changes other than extending the
struct shmem_sb_info with the block_order.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 include/linux/shmem_fs.h |  3 +++
 mm/shmem.c               | 34 +++++++++++++++++++++++++++++++---
 2 files changed, 34 insertions(+), 3 deletions(-)

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 9029abd29b1c..2d0a4311fdbf 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -36,6 +36,9 @@ struct shmem_inode_info {
 #define SHMEM_FL_INHERITED		(FS_NODUMP_FL | FS_NOATIME_FL)
 
 struct shmem_sb_info {
+#ifdef CONFIG_TMPFS
+	unsigned char block_order;
+#endif
 	unsigned long max_blocks;   /* How many blocks are allowed */
 	struct percpu_counter used_blocks;  /* How many are allocated */
 	unsigned long max_inodes;   /* How many inodes are allowed */
diff --git a/mm/shmem.c b/mm/shmem.c
index 7bea4c5cb83a..c124997f8d93 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -122,7 +122,22 @@ struct shmem_options {
 #define SHMEM_SEEN_NOSWAP 16
 };
 
+static u64 shmem_default_block_order(void)
+{
+	return PAGE_SHIFT;
+}
+
 #ifdef CONFIG_TMPFS
+static u64 shmem_block_order(struct shmem_sb_info *sbinfo)
+{
+	return sbinfo->block_order;
+}
+
+static u64 shmem_sb_blocksize(struct shmem_sb_info *sbinfo)
+{
+	return 1UL << sbinfo->block_order;
+}
+
 static unsigned long shmem_default_max_blocks(void)
 {
 	return totalram_pages() / 2;
@@ -134,6 +149,17 @@ static unsigned long shmem_default_max_inodes(void)
 
 	return min(nr_pages - totalhigh_pages(), nr_pages / 2);
 }
+#else
+static u64 shmem_block_order(struct shmem_sb_info *sbinfo)
+{
+	return PAGE_SHIFT;
+}
+
+static u64 shmem_sb_blocksize(struct shmem_sb_info *sbinfo)
+{
+	return PAGE_SIZE;
+}
+
 #endif
 
 static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
@@ -3062,7 +3088,7 @@ static int shmem_statfs(struct dentry *dentry, struct kstatfs *buf)
 	struct shmem_sb_info *sbinfo = SHMEM_SB(dentry->d_sb);
 
 	buf->f_type = TMPFS_MAGIC;
-	buf->f_bsize = PAGE_SIZE;
+	buf->f_bsize = shmem_sb_blocksize(sbinfo);
 	buf->f_namelen = NAME_MAX;
 	if (sbinfo->max_blocks) {
 		buf->f_blocks = sbinfo->max_blocks;
@@ -3972,6 +3998,7 @@ static int shmem_fill_super(struct super_block *sb, struct fs_context *fc)
 	}
 	sb->s_export_op = &shmem_export_ops;
 	sb->s_flags |= SB_NOSEC | SB_I_VERSION;
+	sbinfo->block_order = shmem_default_block_order();
 #else
 	sb->s_flags |= SB_NOUSER;
 #endif
@@ -3997,8 +4024,9 @@ static int shmem_fill_super(struct super_block *sb, struct fs_context *fc)
 	INIT_LIST_HEAD(&sbinfo->shrinklist);
 
 	sb->s_maxbytes = MAX_LFS_FILESIZE;
-	sb->s_blocksize = PAGE_SIZE;
-	sb->s_blocksize_bits = PAGE_SHIFT;
+	sb->s_blocksize = shmem_sb_blocksize(sbinfo);
+	sb->s_blocksize_bits = shmem_block_order(sbinfo);
+	WARN_ON_ONCE(sb->s_blocksize_bits != PAGE_SHIFT);
 	sb->s_magic = TMPFS_MAGIC;
 	sb->s_op = &shmem_ops;
 	sb->s_time_gran = 1;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC v2 5/8] shmem: account for larger blocks sizes for shmem_default_max_blocks()
  2023-05-26  7:55 [RFC v2 0/8] add support for blocksize > PAGE_SIZE Luis Chamberlain
                   ` (3 preceding siblings ...)
  2023-05-26  7:55 ` [RFC v2 4/8] shmem: add helpers to get block size Luis Chamberlain
@ 2023-05-26  7:55 ` Luis Chamberlain
  2023-05-26  7:55 ` [RFC v2 6/8] shmem: consider block size in shmem_default_max_inodes() Luis Chamberlain
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 22+ messages in thread
From: Luis Chamberlain @ 2023-05-26  7:55 UTC (permalink / raw)
  To: hughd, akpm, willy, brauner, djwong
  Cc: p.raghav, da.gomez, rohan.puri, rpuri.linux, a.manzanares, dave,
	yosryahmed, keescook, hare, kbusch, mcgrof, patches, linux-block,
	linux-fsdevel, linux-mm, linux-kernel

If we end up supporting a larger block size than PAGE_SIZE the
calculations in shmem_default_max_blocks() need to be modified to take
into account the fact that multiple pages would be required for a
single block.

Today the max number of blocks is computed based on the fact that we
will by default use half of the available memory and each block is of
PAGE_SIZE.

And so we end up with:

totalram_pages() / 2

That's because blocksize == PAGE_SIZE. When blocksize > PAGE_SIZE
we need to consider how how many blocks fit into totalram_pages() first,
then just divide by 2. This ends up being:

totalram_pages * PAGE_SIZE / blocksize / 2
totalram_pages * 2^PAGE_SHIFT / 2^bbits / 2
totalram_pages * 2^(PAGE_SHIFT - bbits - 1)

We know bbits > PAGE_SHIFT so we'll end up with a negative
power of 2. 2^(-some_val). We can factor the -1 out by changing
this to a division of power of 2 and flipping the values for
the signs:

-1 * (PAGE_SHIFT - bbits -1) = (-PAGE_SHIFT + bbits + 1)
                             = (bbits - PAGE_SHIFT + 1)

And so we end up with:

totalram_pages / 2^(bbits - PAGE_SHIFT + 1)

The bbits is just the block order.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 mm/shmem.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index c124997f8d93..179fde04f57f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -138,9 +138,11 @@ static u64 shmem_sb_blocksize(struct shmem_sb_info *sbinfo)
 	return 1UL << sbinfo->block_order;
 }
 
-static unsigned long shmem_default_max_blocks(void)
+static unsigned long shmem_default_max_blocks(unsigned char block_order)
 {
-	return totalram_pages() / 2;
+	if (block_order == shmem_default_block_order())
+		return totalram_pages() / 2;
+	return totalram_pages() >> (block_order - PAGE_SHIFT + 1);
 }
 
 static unsigned long shmem_default_max_inodes(void)
@@ -3905,7 +3907,7 @@ static int shmem_show_options(struct seq_file *seq, struct dentry *root)
 {
 	struct shmem_sb_info *sbinfo = SHMEM_SB(root->d_sb);
 
-	if (sbinfo->max_blocks != shmem_default_max_blocks())
+	if (sbinfo->max_blocks != shmem_default_max_blocks(shmem_default_block_order()))
 		seq_printf(seq, ",size=%luk",
 			sbinfo->max_blocks << (PAGE_SHIFT - 10));
 	if (sbinfo->max_inodes != shmem_default_max_inodes())
@@ -3987,7 +3989,7 @@ static int shmem_fill_super(struct super_block *sb, struct fs_context *fc)
 	 */
 	if (!(sb->s_flags & SB_KERNMOUNT)) {
 		if (!(ctx->seen & SHMEM_SEEN_BLOCKS))
-			ctx->blocks = shmem_default_max_blocks();
+			ctx->blocks = shmem_default_max_blocks(shmem_default_block_order());
 		if (!(ctx->seen & SHMEM_SEEN_INODES))
 			ctx->inodes = shmem_default_max_inodes();
 		if (!(ctx->seen & SHMEM_SEEN_INUMS))
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC v2 6/8] shmem: consider block size in shmem_default_max_inodes()
  2023-05-26  7:55 [RFC v2 0/8] add support for blocksize > PAGE_SIZE Luis Chamberlain
                   ` (4 preceding siblings ...)
  2023-05-26  7:55 ` [RFC v2 5/8] shmem: account for larger blocks sizes for shmem_default_max_blocks() Luis Chamberlain
@ 2023-05-26  7:55 ` Luis Chamberlain
  2023-05-26  7:55 ` [RFC v2 7/8] shmem: add high order page support Luis Chamberlain
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 22+ messages in thread
From: Luis Chamberlain @ 2023-05-26  7:55 UTC (permalink / raw)
  To: hughd, akpm, willy, brauner, djwong
  Cc: p.raghav, da.gomez, rohan.puri, rpuri.linux, a.manzanares, dave,
	yosryahmed, keescook, hare, kbusch, mcgrof, patches, linux-block,
	linux-fsdevel, linux-mm, linux-kernel

Today we allow for a max number of inodes in consideration for
the smallest possible inodes with just one block of size PAGE_SIZE.
The max number of inodes depend on the size of the block size then,
and if we want to support higher block sizes we end up with less
number of inodes.

Account for this in the computation for the max number of inodes.

If the blocksize is greater than the PAGE_SIZE, we simply divide the
number of pages usable, multiply by the page size and divide by the
blocksize.

This produces no functional changes right now as we don't support
larger block sizes yet.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 mm/shmem.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 179fde04f57f..d347a5ba49f1 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -145,11 +145,14 @@ static unsigned long shmem_default_max_blocks(unsigned char block_order)
 	return totalram_pages() >> (block_order - PAGE_SHIFT + 1);
 }
 
-static unsigned long shmem_default_max_inodes(void)
+static unsigned long shmem_default_max_inodes(unsigned char block_order)
 {
 	unsigned long nr_pages = totalram_pages();
+	unsigned long pages_for_inodes = min(nr_pages - totalhigh_pages(), nr_pages / 2);
 
-	return min(nr_pages - totalhigh_pages(), nr_pages / 2);
+	if (block_order == shmem_default_block_order())
+		return pages_for_inodes;
+	return pages_for_inodes >> (block_order - PAGE_SHIFT);
 }
 #else
 static u64 shmem_block_order(struct shmem_sb_info *sbinfo)
@@ -3910,7 +3913,7 @@ static int shmem_show_options(struct seq_file *seq, struct dentry *root)
 	if (sbinfo->max_blocks != shmem_default_max_blocks(shmem_default_block_order()))
 		seq_printf(seq, ",size=%luk",
 			sbinfo->max_blocks << (PAGE_SHIFT - 10));
-	if (sbinfo->max_inodes != shmem_default_max_inodes())
+	if (sbinfo->max_inodes != shmem_default_max_inodes(shmem_default_block_order()))
 		seq_printf(seq, ",nr_inodes=%lu", sbinfo->max_inodes);
 	if (sbinfo->mode != (0777 | S_ISVTX))
 		seq_printf(seq, ",mode=%03ho", sbinfo->mode);
@@ -3991,7 +3994,7 @@ static int shmem_fill_super(struct super_block *sb, struct fs_context *fc)
 		if (!(ctx->seen & SHMEM_SEEN_BLOCKS))
 			ctx->blocks = shmem_default_max_blocks(shmem_default_block_order());
 		if (!(ctx->seen & SHMEM_SEEN_INODES))
-			ctx->inodes = shmem_default_max_inodes();
+			ctx->inodes = shmem_default_max_inodes(shmem_default_block_order());
 		if (!(ctx->seen & SHMEM_SEEN_INUMS))
 			ctx->full_inums = IS_ENABLED(CONFIG_TMPFS_INODE64);
 		sbinfo->noswap = ctx->noswap;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC v2 7/8] shmem: add high order page support
  2023-05-26  7:55 [RFC v2 0/8] add support for blocksize > PAGE_SIZE Luis Chamberlain
                   ` (5 preceding siblings ...)
  2023-05-26  7:55 ` [RFC v2 6/8] shmem: consider block size in shmem_default_max_inodes() Luis Chamberlain
@ 2023-05-26  7:55 ` Luis Chamberlain
  2023-05-26  7:55 ` [RFC v2 8/8] shmem: add support to customize block size order Luis Chamberlain
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 22+ messages in thread
From: Luis Chamberlain @ 2023-05-26  7:55 UTC (permalink / raw)
  To: hughd, akpm, willy, brauner, djwong
  Cc: p.raghav, da.gomez, rohan.puri, rpuri.linux, a.manzanares, dave,
	yosryahmed, keescook, hare, kbusch, mcgrof, patches, linux-block,
	linux-fsdevel, linux-mm, linux-kernel

To support high order block sizes we want to support a high order
folios so to treat the larger block atomically. Add support for this
for tmpfs mounts.

Right now this produces no functional changes since we only allow one
single block size, matching the PAGE_SIZE and so the order is always 0.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 mm/shmem.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index d347a5ba49f1..080864949fe5 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1623,9 +1623,15 @@ static struct folio *shmem_alloc_folio(gfp_t gfp,
 {
 	struct vm_area_struct pvma;
 	struct folio *folio;
+	struct inode *inode = &info->vfs_inode;
+	struct super_block *i_sb = inode->i_sb;
+	int order = 0;
+
+	if (!(i_sb->s_flags & SB_KERNMOUNT))
+		order = i_sb->s_blocksize_bits - PAGE_SHIFT;
 
 	shmem_pseudo_vma_init(&pvma, info, index);
-	folio = vma_alloc_folio(gfp, 0, &pvma, 0, false);
+	folio = vma_alloc_folio(gfp, order, &pvma, 0, false);
 	shmem_pseudo_vma_destroy(&pvma);
 
 	return folio;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [RFC v2 8/8] shmem: add support to customize block size order
  2023-05-26  7:55 [RFC v2 0/8] add support for blocksize > PAGE_SIZE Luis Chamberlain
                   ` (6 preceding siblings ...)
  2023-05-26  7:55 ` [RFC v2 7/8] shmem: add high order page support Luis Chamberlain
@ 2023-05-26  7:55 ` Luis Chamberlain
  2023-05-26  8:07 ` [RFC v2 0/8] add support for blocksize > PAGE_SIZE Luis Chamberlain
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 22+ messages in thread
From: Luis Chamberlain @ 2023-05-26  7:55 UTC (permalink / raw)
  To: hughd, akpm, willy, brauner, djwong
  Cc: p.raghav, da.gomez, rohan.puri, rpuri.linux, a.manzanares, dave,
	yosryahmed, keescook, hare, kbusch, mcgrof, patches, linux-block,
	linux-fsdevel, linux-mm, linux-kernel

This allows tmpfs mounts to use a custom block size order. We
only allow block sizes greater than PAGE_SIZE, and these must
also be a multiple of the PAGE_SIZE too. To simplify these
requirements and the math we just use power of 2 order, so
block order.

Only simple tests have been run so far:

mkdir -p /data-tmpfs/
time for i in $(seq 1 1000000); do echo $i >> /root/ordered.txt; done

real    0m21.392s
user    0m8.077s
sys     0m13.098s

du -h /root/ordered.txt
6.6M    /root/ordered.txt

sha1sum /root/ordered.txt
2dcc06b7ca3b7dd8b5626af83c1be3cb08ddc76c  /root/ordered.txt

stat /root/ordered.txt
  File: /root/ordered.txt
  Size: 6888896         Blocks: 13456      IO Block: 4096   regular file
Device: 254,1   Inode: 655717      Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2023-04-21 19:34:20.709869093 +0000
Modify: 2023-04-21 19:34:43.833900042 +0000
Change: 2023-04-21 19:34:43.833900042 +0000
 Birth: 2023-04-21 19:34:20.709869093 +0000

8 KiB block size:

sha1sum /root/ordered.txt
mount -t tmpfs            -o size=10M,border=13 -o noswap tmpfs /data-tmpfs/
cp /root/ordered.txt
sha1sum /data-tmpfs/ordered.txt
stat /data-tmpfs/ordered.txt
2dcc06b7ca3b7dd8b5626af83c1be3cb08ddc76c  /root/ordered.txt
2dcc06b7ca3b7dd8b5626af83c1be3cb08ddc76c  /data-tmpfs/ordered.txt
  File: /data-tmpfs/ordered.txt
  Size: 6888896         Blocks: 13456      IO Block: 8192   regular file
Device: 0,42    Inode: 2           Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2023-04-21 19:31:16.078390405 +0000
Modify: 2023-04-21 19:31:16.070391363 +0000
Change: 2023-04-21 19:31:16.070391363 +0000
 Birth: 2023-04-21 19:31:16.034395676 +0000

64 KiB block size:

sha1sum /root/ordered.txt
mount -t tmpfs            -o size=10M,border=16 -o noswap tmpfs /data-tmpfs/
cp /root/ordered.txt /data-tmpfs/; sha1sum /data-tmpfs/ordered.txt
stat /data-tmpfs/ordered.txt
2dcc06b7ca3b7dd8b5626af83c1be3cb08ddc76c  /root/ordered.txt
2dcc06b7ca3b7dd8b5626af83c1be3cb08ddc76c  /data-tmpfs/ordered.txt
  File: /data-tmpfs/ordered.txt
  Size: 6888896         Blocks: 13568      IO Block: 65536  regular file
Device: 0,42    Inode: 2           Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2023-04-21 19:32:14.669796970 +0000
Modify: 2023-04-21 19:32:14.661796959 +0000
Change: 2023-04-21 19:32:14.661796959 +0000
 Birth: 2023-04-21 19:32:14.649796944 +0000

4 MiB works too.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 mm/shmem.c | 44 +++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 39 insertions(+), 5 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 080864949fe5..777e953df62e 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -115,11 +115,13 @@ struct shmem_options {
 	int huge;
 	int seen;
 	bool noswap;
+	unsigned char block_order;
 #define SHMEM_SEEN_BLOCKS 1
 #define SHMEM_SEEN_INODES 2
 #define SHMEM_SEEN_HUGE 4
 #define SHMEM_SEEN_INUMS 8
 #define SHMEM_SEEN_NOSWAP 16
+#define SHMEM_SEEN_BLOCKORDER 32
 };
 
 static u64 shmem_default_block_order(void)
@@ -3661,6 +3663,7 @@ enum shmem_param {
 	Opt_inode32,
 	Opt_inode64,
 	Opt_noswap,
+	Opt_border,
 };
 
 static const struct constant_table shmem_param_enums_huge[] = {
@@ -3683,6 +3686,7 @@ const struct fs_parameter_spec shmem_fs_parameters[] = {
 	fsparam_flag  ("inode32",	Opt_inode32),
 	fsparam_flag  ("inode64",	Opt_inode64),
 	fsparam_flag  ("noswap",	Opt_noswap),
+	fsparam_u32   ("border",	Opt_border),
 	{}
 };
 
@@ -3709,7 +3713,15 @@ static int shmem_parse_one(struct fs_context *fc, struct fs_parameter *param)
 		}
 		if (*rest)
 			goto bad_value;
-		ctx->blocks = DIV_ROUND_UP(size, PAGE_SIZE);
+		if (!(ctx->seen & SHMEM_SEEN_BLOCKORDER) ||
+		    ctx->block_order == shmem_default_block_order())
+			ctx->blocks = DIV_ROUND_UP(size, PAGE_SIZE);
+		else {
+			if (size < (1UL << ctx->block_order) ||
+			    size % (1UL << ctx->block_order) != 0)
+				goto bad_value;
+			ctx->blocks = size >> ctx->block_order;
+		}
 		ctx->seen |= SHMEM_SEEN_BLOCKS;
 		break;
 	case Opt_nr_blocks:
@@ -3774,6 +3786,19 @@ static int shmem_parse_one(struct fs_context *fc, struct fs_parameter *param)
 		ctx->noswap = true;
 		ctx->seen |= SHMEM_SEEN_NOSWAP;
 		break;
+	case Opt_border:
+		ctx->block_order = result.uint_32;
+		ctx->seen |= SHMEM_SEEN_BLOCKORDER;
+		if (ctx->block_order < PAGE_SHIFT)
+			goto bad_value;
+		/*
+		 * We cap this to allow a block to be at least allowed to
+		 * be allocated using the buddy allocator. That's MAX_ORDER
+		 * pages. So 8 MiB on x86_64.
+		 */
+		if (ctx->block_order > (MAX_ORDER + PAGE_SHIFT))
+			goto bad_value;
+		break;
 	}
 	return 0;
 
@@ -3845,6 +3870,12 @@ static int shmem_reconfigure(struct fs_context *fc)
 	raw_spin_lock(&sbinfo->stat_lock);
 	inodes = sbinfo->max_inodes - sbinfo->free_inodes;
 
+	if (ctx->seen & SHMEM_SEEN_BLOCKORDER) {
+		if (ctx->block_order != shmem_block_order(sbinfo)) {
+			err = "Cannot modify block order on remount";
+			goto out;
+		}
+	}
 	if ((ctx->seen & SHMEM_SEEN_BLOCKS) && ctx->blocks) {
 		if (!sbinfo->max_blocks) {
 			err = "Cannot retroactively limit size";
@@ -3960,6 +3991,8 @@ static int shmem_show_options(struct seq_file *seq, struct dentry *root)
 	shmem_show_mpol(seq, sbinfo->mpol);
 	if (sbinfo->noswap)
 		seq_printf(seq, ",noswap");
+	if (shmem_block_order(sbinfo) != shmem_default_block_order())
+		seq_printf(seq, ",border=%llu", shmem_block_order(sbinfo));
 	return 0;
 }
 
@@ -3997,10 +4030,12 @@ static int shmem_fill_super(struct super_block *sb, struct fs_context *fc)
 	 * but the internal instance is left unlimited.
 	 */
 	if (!(sb->s_flags & SB_KERNMOUNT)) {
+		if (!(ctx->seen & SHMEM_SEEN_BLOCKORDER))
+			ctx->block_order = shmem_default_block_order();
 		if (!(ctx->seen & SHMEM_SEEN_BLOCKS))
-			ctx->blocks = shmem_default_max_blocks(shmem_default_block_order());
+			ctx->blocks = shmem_default_max_blocks(ctx->block_order);
 		if (!(ctx->seen & SHMEM_SEEN_INODES))
-			ctx->inodes = shmem_default_max_inodes(shmem_default_block_order());
+			ctx->inodes = shmem_default_max_inodes(ctx->block_order);
 		if (!(ctx->seen & SHMEM_SEEN_INUMS))
 			ctx->full_inums = IS_ENABLED(CONFIG_TMPFS_INODE64);
 		sbinfo->noswap = ctx->noswap;
@@ -4009,7 +4044,7 @@ static int shmem_fill_super(struct super_block *sb, struct fs_context *fc)
 	}
 	sb->s_export_op = &shmem_export_ops;
 	sb->s_flags |= SB_NOSEC | SB_I_VERSION;
-	sbinfo->block_order = shmem_default_block_order();
+	sbinfo->block_order = ctx->block_order;
 #else
 	sb->s_flags |= SB_NOUSER;
 #endif
@@ -4037,7 +4072,6 @@ static int shmem_fill_super(struct super_block *sb, struct fs_context *fc)
 	sb->s_maxbytes = MAX_LFS_FILESIZE;
 	sb->s_blocksize = shmem_sb_blocksize(sbinfo);
 	sb->s_blocksize_bits = shmem_block_order(sbinfo);
-	WARN_ON_ONCE(sb->s_blocksize_bits != PAGE_SHIFT);
 	sb->s_magic = TMPFS_MAGIC;
 	sb->s_op = &shmem_ops;
 	sb->s_time_gran = 1;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC v2 0/8] add support for blocksize > PAGE_SIZE
  2023-05-26  7:55 [RFC v2 0/8] add support for blocksize > PAGE_SIZE Luis Chamberlain
                   ` (7 preceding siblings ...)
  2023-05-26  7:55 ` [RFC v2 8/8] shmem: add support to customize block size order Luis Chamberlain
@ 2023-05-26  8:07 ` Luis Chamberlain
  2023-05-26  8:14 ` Christoph Hellwig
  2023-05-26 13:54 ` Matthew Wilcox
  10 siblings, 0 replies; 22+ messages in thread
From: Luis Chamberlain @ 2023-05-26  8:07 UTC (permalink / raw)
  To: hughd, akpm, willy, brauner, djwong
  Cc: p.raghav, da.gomez, rohan.puri, rpuri.linux, a.manzanares, dave,
	yosryahmed, keescook, hare, kbusch, patches, linux-block,
	linux-fsdevel, linux-mm, linux-kernel

On Fri, May 26, 2023 at 12:55:44AM -0700, Luis Chamberlain wrote:
> Future work:
> 
>   o shmem_file_read_iter()

And as for this, this is what I'm up to, but for the life of me I can't
figure out why I end up with an empty new line at the end of my test
with this, the same simple test as described in the patch "shmem: add
support to customize block size order".

I end up with:

root@iomap ~ # ./run.sh 
2dcc06b7ca3b7dd8b5626af83c1be3cb08ddc76c  /root/ordered.txt
a0466a798f2d967c143f0f716c344660dc360f78  /data-tmpfs/ordered.txt
  File: /data-tmpfs/ordered.txt
    Size: 6888896         Blocks: 16384      IO Block: 4194304 regular
    file
    Device: 0,44    Inode: 2           Links: 1
    Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/
    root)
    Access: 2023-05-26 01:06:15.566330524 -0700
    Modify: 2023-05-26 01:06:15.554330477 -0700
    Change: 2023-05-26 01:06:15.554330477 -0700
     Birth: 2023-05-26 01:06:15.534330399 -0700

root@iomap ~ # diff -u /root/ordered.txt /data-tmpfs/ordered.txt 
--- /root/ordered.txt   2023-05-25 16:50:53.755019418 -0700
+++ /data-tmpfs/ordered.txt     2023-05-26 01:06:15.554330477 -0700
@@ -999998,3 +999998,4 @@
 999998
 999999
 1000000
+
\ No newline at end of file

root@iomap ~ # cat run.sh 
#!/bin/bash

# time for i in $(seq 1 1000000); do echo $i >>
# /root/ordered.txt; done

sha1sum /root/ordered.txt
mount -t tmpfs            -o size=8M,border=22 -o noswap tmpfs
/data-tmpfs/
cp /root/ordered.txt /data-tmpfs/
sha1sum /data-tmpfs/ordered.txt
stat /data-tmpfs/ordered.txt

From 61008f03217b1524da317928885ef68a67abc773 Mon Sep 17 00:00:00 2001
From: Luis Chamberlain <mcgrof@kernel.org>
Date: Wed, 19 Apr 2023 20:42:54 -0700
Subject: [PATCH] shmem: convert shmem_file_read_iter() to folios

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 mm/shmem.c | 74 +++++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 56 insertions(+), 18 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 777e953df62e..2d3512f6dd30 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2431,6 +2431,10 @@ static struct inode *shmem_get_inode(struct mnt_idmap *idmap, struct super_block
 		inode->i_ino = ino;
 		inode_init_owner(idmap, inode, dir, mode);
 		inode->i_blocks = 0;
+		if (sb->s_flags & SB_KERNMOUNT)
+			inode->i_blkbits = PAGE_SHIFT;
+		else
+			inode->i_blkbits = sb->s_blocksize_bits;
 		inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode);
 		inode->i_generation = get_random_u32();
 		info = SHMEM_I(inode);
@@ -2676,19 +2680,42 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file_inode(file);
 	struct address_space *mapping = inode->i_mapping;
+	struct super_block *sb = inode->i_sb;
+	u64 bsize = i_blocksize(inode);
 	pgoff_t index;
 	unsigned long offset;
 	int error = 0;
 	ssize_t retval = 0;
 	loff_t *ppos = &iocb->ki_pos;
 
+	/*
+	 * Although our index is page specific, we can read a blocksize at a
+	 * time as we use a folio per block.
+	 */
 	index = *ppos >> PAGE_SHIFT;
-	offset = *ppos & ~PAGE_MASK;
+
+	/*
+	 * We're going to read a folio at a time of size blocksize.
+	 *
+	 * The offset represents the position in the folio where we are
+	 * currently doing reads on. It starts off by the offset position in the
+	 * first folio where we were asked to start our read. It later gets
+	 * incremented by the number of bytes we read per folio.  After the
+	 * first folio is read offset would be 0 as we are starting to read the
+	 * next folio at offset 0. We'd then read a full blocksize at a time
+	 * until we're done.
+	 */
+	offset = *ppos & (bsize - 1);
 
 	for (;;) {
 		struct folio *folio = NULL;
-		struct page *page = NULL;
 		pgoff_t end_index;
+		/*
+		 * nr represents the number of bytes we can read per folio,
+		 * and this will depend on the blocksize set. On the last
+		 * folio nr represents how much data on the last folio is
+		 * valid to be read on the inode.
+		 */
 		unsigned long nr, ret;
 		loff_t i_size = i_size_read(inode);
 
@@ -2696,7 +2723,7 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 		if (index > end_index)
 			break;
 		if (index == end_index) {
-			nr = i_size & ~PAGE_MASK;
+			nr = i_size & (bsize - 1);
 			if (nr <= offset)
 				break;
 		}
@@ -2709,9 +2736,7 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 		}
 		if (folio) {
 			folio_unlock(folio);
-
-			page = folio_file_page(folio, index);
-			if (PageHWPoison(page)) {
+			if (is_folio_hwpoison(folio)) {
 				folio_put(folio);
 				error = -EIO;
 				break;
@@ -2722,49 +2747,56 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 		 * We must evaluate after, since reads (unlike writes)
 		 * are called without i_rwsem protection against truncate
 		 */
-		nr = PAGE_SIZE;
+		nr = bsize;
+		WARN_ON(!(sb->s_flags & SB_KERNMOUNT) && folio && bsize != folio_size(folio));
 		i_size = i_size_read(inode);
 		end_index = i_size >> PAGE_SHIFT;
 		if (index == end_index) {
-			nr = i_size & ~PAGE_MASK;
+			nr = i_size & (bsize - 1);
 			if (nr <= offset) {
 				if (folio)
 					folio_put(folio);
 				break;
 			}
 		}
+
+		/*
+		 * On the first folio read the number of bytes we can read
+		 * will be blocksize - offset. On subsequent reads we can read
+		 * blocksize at time until iov_iter_count(to) == 0.
+		 */
 		nr -= offset;
 
 		if (folio) {
 			/*
-			 * If users can be writing to this page using arbitrary
+			 * If users can be writing to this folio using arbitrary
 			 * virtual addresses, take care about potential aliasing
-			 * before reading the page on the kernel side.
+			 * before reading the folio on the kernel side.
 			 */
 			if (mapping_writably_mapped(mapping))
-				flush_dcache_page(page);
+				flush_dcache_folio(folio);
 			/*
-			 * Mark the page accessed if we read the beginning.
+			 * Mark the folio accessed if we read the beginning.
 			 */
 			if (!offset)
 				folio_mark_accessed(folio);
 			/*
-			 * Ok, we have the page, and it's up-to-date, so
+			 * Ok, we have the folio, and it's up-to-date, so
 			 * now we can copy it to user space...
 			 */
-			ret = copy_page_to_iter(page, offset, nr, to);
+			ret = copy_folio_to_iter(folio, offset, nr, to);
 			folio_put(folio);
 
 		} else if (user_backed_iter(to)) {
 			/*
 			 * Copy to user tends to be so well optimized, but
 			 * clear_user() not so much, that it is noticeably
-			 * faster to copy the zero page instead of clearing.
+			 * faster to copy the zero folio instead of clearing.
 			 */
-			ret = copy_page_to_iter(ZERO_PAGE(0), offset, nr, to);
+			ret = copy_folio_to_iter(page_folio(ZERO_PAGE(0)), offset, nr, to);
 		} else {
 			/*
-			 * But submitting the same page twice in a row to
+			 * But submitting the same folio twice in a row to
 			 * splice() - or others? - can result in confusion:
 			 * so don't attempt that optimization on pipes etc.
 			 */
@@ -2773,8 +2805,14 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 
 		retval += ret;
 		offset += ret;
+
+		/*
+		 * Due to usage of folios per blocksize we know this will
+		 * actually read blocksize at a time after the first block read
+		 * at offset.
+		 */
 		index += offset >> PAGE_SHIFT;
-		offset &= ~PAGE_MASK;
+		offset &= (bsize - 1);
 
 		if (!iov_iter_count(to))
 			break;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC v2 0/8] add support for blocksize > PAGE_SIZE
  2023-05-26  7:55 [RFC v2 0/8] add support for blocksize > PAGE_SIZE Luis Chamberlain
                   ` (8 preceding siblings ...)
  2023-05-26  8:07 ` [RFC v2 0/8] add support for blocksize > PAGE_SIZE Luis Chamberlain
@ 2023-05-26  8:14 ` Christoph Hellwig
  2023-05-26  8:18   ` Luis Chamberlain
  2023-05-26 13:54 ` Matthew Wilcox
  10 siblings, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2023-05-26  8:14 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: hughd, akpm, willy, brauner, djwong, p.raghav, da.gomez,
	rohan.puri, rpuri.linux, a.manzanares, dave, yosryahmed,
	keescook, hare, kbusch, patches, linux-block, linux-fsdevel,
	linux-mm, linux-kernel

On Fri, May 26, 2023 at 12:55:44AM -0700, Luis Chamberlain wrote:
> This is an initial attempt to add support for block size > PAGE_SIZE for tmpfs.

The concept of a block size doesn't make any sense for tmpfs.   What
are you actually trying to do here?


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC v2 0/8] add support for blocksize > PAGE_SIZE
  2023-05-26  8:14 ` Christoph Hellwig
@ 2023-05-26  8:18   ` Luis Chamberlain
  2023-05-26  8:28     ` Christoph Hellwig
  0 siblings, 1 reply; 22+ messages in thread
From: Luis Chamberlain @ 2023-05-26  8:18 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: hughd, akpm, willy, brauner, djwong, p.raghav, da.gomez,
	rohan.puri, rpuri.linux, a.manzanares, dave, yosryahmed,
	keescook, hare, kbusch, patches, linux-block, linux-fsdevel,
	linux-mm, linux-kernel

On Fri, May 26, 2023 at 01:14:55AM -0700, Christoph Hellwig wrote:
> On Fri, May 26, 2023 at 12:55:44AM -0700, Luis Chamberlain wrote:
> > This is an initial attempt to add support for block size > PAGE_SIZE for tmpfs.
> 
> The concept of a block size doesn't make any sense for tmpfs.   What
> are you actually trying to do here?

More of helping to test high order folios for tmpfs. Swap for instance
would be one thing we could use to test.

  Luis

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC v2 0/8] add support for blocksize > PAGE_SIZE
  2023-05-26  8:18   ` Luis Chamberlain
@ 2023-05-26  8:28     ` Christoph Hellwig
  2023-05-26  8:35       ` Luis Chamberlain
  0 siblings, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2023-05-26  8:28 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: Christoph Hellwig, hughd, akpm, willy, brauner, djwong, p.raghav,
	da.gomez, rohan.puri, rpuri.linux, a.manzanares, dave,
	yosryahmed, keescook, hare, kbusch, patches, linux-block,
	linux-fsdevel, linux-mm, linux-kernel

On Fri, May 26, 2023 at 01:18:19AM -0700, Luis Chamberlain wrote:
> On Fri, May 26, 2023 at 01:14:55AM -0700, Christoph Hellwig wrote:
> > On Fri, May 26, 2023 at 12:55:44AM -0700, Luis Chamberlain wrote:
> > > This is an initial attempt to add support for block size > PAGE_SIZE for tmpfs.
> > 
> > The concept of a block size doesn't make any sense for tmpfs.   What
> > are you actually trying to do here?
> 
> More of helping to test high order folios for tmpfs. Swap for instance
> would be one thing we could use to test.

I'm still not sure where the concept of a block size would come in here.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC v2 0/8] add support for blocksize > PAGE_SIZE
  2023-05-26  8:28     ` Christoph Hellwig
@ 2023-05-26  8:35       ` Luis Chamberlain
  0 siblings, 0 replies; 22+ messages in thread
From: Luis Chamberlain @ 2023-05-26  8:35 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: hughd, akpm, willy, brauner, djwong, p.raghav, da.gomez,
	rohan.puri, rpuri.linux, a.manzanares, dave, yosryahmed,
	keescook, hare, kbusch, patches, linux-block, linux-fsdevel,
	linux-mm, linux-kernel

On Fri, May 26, 2023 at 01:28:03AM -0700, Christoph Hellwig wrote:
> On Fri, May 26, 2023 at 01:18:19AM -0700, Luis Chamberlain wrote:
> > On Fri, May 26, 2023 at 01:14:55AM -0700, Christoph Hellwig wrote:
> > > On Fri, May 26, 2023 at 12:55:44AM -0700, Luis Chamberlain wrote:
> > > > This is an initial attempt to add support for block size > PAGE_SIZE for tmpfs.
> > > 
> > > The concept of a block size doesn't make any sense for tmpfs.   What
> > > are you actually trying to do here?
> > 
> > More of helping to test high order folios for tmpfs. Swap for instance
> > would be one thing we could use to test.
> 
> I'm still not sure where the concept of a block size would come in here.

From a filesystem perspective that's what we call it as well today, and
tmpfs implements a simple one, just that indeed this just a high order
folio support. The languge for blocksize was used before my patches for the
sb->s_blocksize and sb->s_blocksize_bits. Even for shmem_statfs()
buf->f_bsize.

I understand we should move the sb->s_blocksize to the block_device
and use the page order for address_space, but we can't negate the
existing stuff there immediately.

  Luis

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC v2 1/8] page_flags: add is_folio_hwpoison()
  2023-05-26  7:55 ` [RFC v2 1/8] page_flags: add is_folio_hwpoison() Luis Chamberlain
@ 2023-05-26 13:51   ` Matthew Wilcox
  2023-05-26 15:40     ` Keith Busch
  0 siblings, 1 reply; 22+ messages in thread
From: Matthew Wilcox @ 2023-05-26 13:51 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: hughd, akpm, brauner, djwong, p.raghav, da.gomez, rohan.puri,
	rpuri.linux, a.manzanares, dave, yosryahmed, keescook, hare,
	kbusch, patches, linux-block, linux-fsdevel, linux-mm,
	linux-kernel

On Fri, May 26, 2023 at 12:55:45AM -0700, Luis Chamberlain wrote:
> Provide a helper similar to is_page_hwpoison() for folios
> which tests the first head and if the folio is large any page in
> the folio is tested for the poison flag.

But it's not "is poison".  it's "contains poison".  So how about
folio_contains_hwpoison() as a name?

But what do you really want to know here?  In the Glorious Future,
individual pages get their memdesc pointer set to be a hwpoison
pointer.  Are we going to need to retain a bit in every memdesc to
say whether one of the pages in the memdesc has been poisoned?

Or can we get away with just testing individual pages as we look at
them?


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC v2 0/8] add support for blocksize > PAGE_SIZE
  2023-05-26  7:55 [RFC v2 0/8] add support for blocksize > PAGE_SIZE Luis Chamberlain
                   ` (9 preceding siblings ...)
  2023-05-26  8:14 ` Christoph Hellwig
@ 2023-05-26 13:54 ` Matthew Wilcox
  2023-05-26 17:33   ` Luis Chamberlain
  10 siblings, 1 reply; 22+ messages in thread
From: Matthew Wilcox @ 2023-05-26 13:54 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: hughd, akpm, brauner, djwong, p.raghav, da.gomez, rohan.puri,
	rpuri.linux, a.manzanares, dave, yosryahmed, keescook, hare,
	kbusch, patches, linux-block, linux-fsdevel, linux-mm,
	linux-kernel

On Fri, May 26, 2023 at 12:55:44AM -0700, Luis Chamberlain wrote:
> This is an initial attempt to add support for block size > PAGE_SIZE for tmpfs.
> Why would you want this? It helps us experiment with higher order folio uses
> with fs APIS and helps us test out corner cases which would likely need
> to be accounted for sooner or later if and when filesystems enable support
> for this. Better review early and burn early than continue on in the wrong
> direction so looking for early feedback.

I think this is entirely the wrong direction to go in.

You're coming at this from a block layer perspective, and we have two
ways of doing large block devices -- qemu nvme and brd.  tmpfs should
be like other filesystems and opportunistically use folios of whatever
size makes sense.

Don't add a mount option to specify what size folios to use.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC v2 2/8] shmem: convert to use is_folio_hwpoison()
  2023-05-26  7:55 ` [RFC v2 2/8] shmem: convert to use is_folio_hwpoison() Luis Chamberlain
@ 2023-05-26 14:32   ` Matthew Wilcox
  2023-05-26 17:41     ` Luis Chamberlain
  0 siblings, 1 reply; 22+ messages in thread
From: Matthew Wilcox @ 2023-05-26 14:32 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: hughd, akpm, brauner, djwong, p.raghav, da.gomez, rohan.puri,
	rpuri.linux, a.manzanares, dave, yosryahmed, keescook, hare,
	kbusch, patches, linux-block, linux-fsdevel, linux-mm,
	linux-kernel

On Fri, May 26, 2023 at 12:55:46AM -0700, Luis Chamberlain wrote:
> The PageHWPoison() call can be converted over to the respective folio
> call is_folio_hwpoison(). This introduces no functional changes.

Yes, it very much does!

> @@ -4548,7 +4548,7 @@ struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
>  		return &folio->page;
>  
>  	page = folio_file_page(folio, index);
> -	if (PageHWPoison(page)) {
> +	if (is_folio_hwpoison(folio)) {
>  		folio_put(folio);

Imagine you have an order-9 folio and one of the pages in it gets
HWPoison.  Before, you can read the other 511 pages in the folio.
After your patch, you can't read any of them.  You've effectively
increased the blast radius of any hwerror, and I don't think that's an
acceptable change.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC v2 1/8] page_flags: add is_folio_hwpoison()
  2023-05-26 13:51   ` Matthew Wilcox
@ 2023-05-26 15:40     ` Keith Busch
  0 siblings, 0 replies; 22+ messages in thread
From: Keith Busch @ 2023-05-26 15:40 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Luis Chamberlain, hughd, akpm, brauner, djwong, p.raghav,
	da.gomez, rohan.puri, rpuri.linux, a.manzanares, dave,
	yosryahmed, keescook, hare, patches, linux-block, linux-fsdevel,
	linux-mm, linux-kernel

On Fri, May 26, 2023 at 02:51:34PM +0100, Matthew Wilcox wrote:
> On Fri, May 26, 2023 at 12:55:45AM -0700, Luis Chamberlain wrote:
> > Provide a helper similar to is_page_hwpoison() for folios
> > which tests the first head and if the folio is large any page in
> > the folio is tested for the poison flag.
> 
> But it's not "is poison".  it's "contains poison".  So how about
> folio_contains_hwpoison() as a name?

Would a smaller change in tense to "is poisoned" also work? I think
that's mostly synonymous to "contains poison".

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC v2 0/8] add support for blocksize > PAGE_SIZE
  2023-05-26 13:54 ` Matthew Wilcox
@ 2023-05-26 17:33   ` Luis Chamberlain
  2023-05-26 18:43     ` Matthew Wilcox
  0 siblings, 1 reply; 22+ messages in thread
From: Luis Chamberlain @ 2023-05-26 17:33 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: hughd, akpm, brauner, djwong, p.raghav, da.gomez, rohan.puri,
	rpuri.linux, a.manzanares, dave, yosryahmed, keescook, hare,
	kbusch, patches, linux-block, linux-fsdevel, linux-mm,
	linux-kernel

On Fri, May 26, 2023 at 02:54:12PM +0100, Matthew Wilcox wrote:
> On Fri, May 26, 2023 at 12:55:44AM -0700, Luis Chamberlain wrote:
> > This is an initial attempt to add support for block size > PAGE_SIZE for tmpfs.
> > Why would you want this? It helps us experiment with higher order folio uses
> > with fs APIS and helps us test out corner cases which would likely need
> > to be accounted for sooner or later if and when filesystems enable support
> > for this. Better review early and burn early than continue on in the wrong
> > direction so looking for early feedback.
> 
> I think this is entirely the wrong direction to go in.

Any recommendations for alternative directions?

> You're coming at this from a block layer perspective, and we have two
> ways of doing large block devices -- qemu nvme and brd.  tmpfs should
> be like other filesystems and opportunistically use folios of whatever
> size makes sense.

I figured the backing block size would be a good reason to use high
order folios for filesystems, and this mimicks that through the super
block block size. Although usage of the block size would be moved to
the block device and tmpfs use an page order, what other alternatives
were you thinking?

  Luis

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC v2 2/8] shmem: convert to use is_folio_hwpoison()
  2023-05-26 14:32   ` Matthew Wilcox
@ 2023-05-26 17:41     ` Luis Chamberlain
  2023-05-26 18:41       ` Matthew Wilcox
  0 siblings, 1 reply; 22+ messages in thread
From: Luis Chamberlain @ 2023-05-26 17:41 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: hughd, akpm, brauner, djwong, p.raghav, da.gomez, rohan.puri,
	rpuri.linux, a.manzanares, dave, yosryahmed, keescook, hare,
	kbusch, patches, linux-block, linux-fsdevel, linux-mm,
	linux-kernel

On Fri, May 26, 2023 at 03:32:54PM +0100, Matthew Wilcox wrote:
> On Fri, May 26, 2023 at 12:55:46AM -0700, Luis Chamberlain wrote:
> > The PageHWPoison() call can be converted over to the respective folio
> > call is_folio_hwpoison(). This introduces no functional changes.
> 
> Yes, it very much does!
> 
> > @@ -4548,7 +4548,7 @@ struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
> >  		return &folio->page;
> >  
> >  	page = folio_file_page(folio, index);
> > -	if (PageHWPoison(page)) {
> > +	if (is_folio_hwpoison(folio)) {
> >  		folio_put(folio);
> 
> Imagine you have an order-9 folio and one of the pages in it gets
> HWPoison.  Before, you can read the other 511 pages in the folio.

But before we didn't use high order folios for reads on tmpfs?

But I get the idea.

> After your patch, you can't read any of them.  You've effectively
> increased the blast radius of any hwerror, and I don't think that's an
> acceptable change.

I see, thanks! Will fix if we move forward with this.

  Luis

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC v2 2/8] shmem: convert to use is_folio_hwpoison()
  2023-05-26 17:41     ` Luis Chamberlain
@ 2023-05-26 18:41       ` Matthew Wilcox
  0 siblings, 0 replies; 22+ messages in thread
From: Matthew Wilcox @ 2023-05-26 18:41 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: hughd, akpm, brauner, djwong, p.raghav, da.gomez, rohan.puri,
	rpuri.linux, a.manzanares, dave, yosryahmed, keescook, hare,
	kbusch, patches, linux-block, linux-fsdevel, linux-mm,
	linux-kernel

On Fri, May 26, 2023 at 10:41:00AM -0700, Luis Chamberlain wrote:
> On Fri, May 26, 2023 at 03:32:54PM +0100, Matthew Wilcox wrote:
> > On Fri, May 26, 2023 at 12:55:46AM -0700, Luis Chamberlain wrote:
> > > The PageHWPoison() call can be converted over to the respective folio
> > > call is_folio_hwpoison(). This introduces no functional changes.
> > 
> > Yes, it very much does!
> > 
> > > @@ -4548,7 +4548,7 @@ struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
> > >  		return &folio->page;
> > >  
> > >  	page = folio_file_page(folio, index);
> > > -	if (PageHWPoison(page)) {
> > > +	if (is_folio_hwpoison(folio)) {
> > >  		folio_put(folio);
> > 
> > Imagine you have an order-9 folio and one of the pages in it gets
> > HWPoison.  Before, you can read the other 511 pages in the folio.
> 
> But before we didn't use high order folios for reads on tmpfs?

Sure we did!  If SHMEM_HUGE_ALWAYS is set, we can see reads of THPs
(order-9 folios) in this path.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC v2 0/8] add support for blocksize > PAGE_SIZE
  2023-05-26 17:33   ` Luis Chamberlain
@ 2023-05-26 18:43     ` Matthew Wilcox
  0 siblings, 0 replies; 22+ messages in thread
From: Matthew Wilcox @ 2023-05-26 18:43 UTC (permalink / raw)
  To: Luis Chamberlain
  Cc: hughd, akpm, brauner, djwong, p.raghav, da.gomez, rohan.puri,
	rpuri.linux, a.manzanares, dave, yosryahmed, keescook, hare,
	kbusch, patches, linux-block, linux-fsdevel, linux-mm,
	linux-kernel

On Fri, May 26, 2023 at 10:33:53AM -0700, Luis Chamberlain wrote:
> On Fri, May 26, 2023 at 02:54:12PM +0100, Matthew Wilcox wrote:
> > You're coming at this from a block layer perspective, and we have two
> > ways of doing large block devices -- qemu nvme and brd.  tmpfs should
> > be like other filesystems and opportunistically use folios of whatever
> > size makes sense.
> 
> I figured the backing block size would be a good reason to use high
> order folios for filesystems, and this mimicks that through the super
> block block size. Although usage of the block size would be moved to
> the block device and tmpfs use an page order, what other alternatives
> were you thinking?

Use the readahead code like other filesystems to determine what size of
folios to allocate.  Also use the size of writes to determine what size
of folio to allocate, as in this patchset:

https://lore.kernel.org/linux-fsdevel/20230520163603.1794256-1-willy@infradead.org/

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2023-05-26 18:43 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-26  7:55 [RFC v2 0/8] add support for blocksize > PAGE_SIZE Luis Chamberlain
2023-05-26  7:55 ` [RFC v2 1/8] page_flags: add is_folio_hwpoison() Luis Chamberlain
2023-05-26 13:51   ` Matthew Wilcox
2023-05-26 15:40     ` Keith Busch
2023-05-26  7:55 ` [RFC v2 2/8] shmem: convert to use is_folio_hwpoison() Luis Chamberlain
2023-05-26 14:32   ` Matthew Wilcox
2023-05-26 17:41     ` Luis Chamberlain
2023-05-26 18:41       ` Matthew Wilcox
2023-05-26  7:55 ` [RFC v2 3/8] shmem: account for high order folios Luis Chamberlain
2023-05-26  7:55 ` [RFC v2 4/8] shmem: add helpers to get block size Luis Chamberlain
2023-05-26  7:55 ` [RFC v2 5/8] shmem: account for larger blocks sizes for shmem_default_max_blocks() Luis Chamberlain
2023-05-26  7:55 ` [RFC v2 6/8] shmem: consider block size in shmem_default_max_inodes() Luis Chamberlain
2023-05-26  7:55 ` [RFC v2 7/8] shmem: add high order page support Luis Chamberlain
2023-05-26  7:55 ` [RFC v2 8/8] shmem: add support to customize block size order Luis Chamberlain
2023-05-26  8:07 ` [RFC v2 0/8] add support for blocksize > PAGE_SIZE Luis Chamberlain
2023-05-26  8:14 ` Christoph Hellwig
2023-05-26  8:18   ` Luis Chamberlain
2023-05-26  8:28     ` Christoph Hellwig
2023-05-26  8:35       ` Luis Chamberlain
2023-05-26 13:54 ` Matthew Wilcox
2023-05-26 17:33   ` Luis Chamberlain
2023-05-26 18:43     ` Matthew Wilcox

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.