All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v7 0/2] tmpfs: inode: Reduce risk of inum overflow
@ 2020-07-13 17:28 Chris Down
  2020-07-13 17:28 ` [PATCH v7 1/2] tmpfs: Per-superblock i_ino support Chris Down
  2020-07-13 17:28 ` [PATCH v7 2/2] tmpfs: Support 64-bit inums per-sb Chris Down
  0 siblings, 2 replies; 12+ messages in thread
From: Chris Down @ 2020-07-13 17:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Al Viro, Matthew Wilcox, Amir Goldstein,
	Jeff Layton, Johannes Weiner, Tejun Heo, linux-mm, linux-fsdevel,
	linux-kernel, kernel-team

In Facebook production we are seeing heavy i_ino wraparounds on tmpfs.
On affected tiers, in excess of 10% of hosts show multiple files with
different content and the same inode number, with some servers even
having as many as 150 duplicated inode numbers with differing file
content.

This causes actual, tangible problems in production. For example, we
have complaints from those working on remote caches that their
application is reporting cache corruptions because it uses (device,
inodenum) to establish the identity of a particular cache object, but
because it's not unique any more, the application refuses to continue
and reports cache corruption. Even worse, sometimes applications may not
even detect the corruption but may continue anyway, causing phantom and
hard to debug behaviour.

In general, userspace applications expect that (device, inodenum) should
be enough to be uniquely point to one inode, which seems fair enough.
One might also need to check the generation, but in this case:

1. That's not currently exposed to userspace
   (ioctl(...FS_IOC_GETVERSION...) returns ENOTTY on tmpfs);
2. Even with generation, there shouldn't be two live inodes with the
   same inode number on one device.

In order to mitigate this, we take a two-pronged approach:

1. Moving inum generation from being global to per-sb for tmpfs. This
   itself allows some reduction in i_ino churn. This works on both 64-
   and 32- bit machines.
2. Adding inode{64,32} for tmpfs. This fix is supported on machines with
   64-bit ino_t only: we allow users to mount tmpfs with a new inode64
   option that uses the full width of ino_t, or CONFIG_TMPFS_INODE64.

You can see how this compares to previous related patches which didn't
implement this per-superblock:

- https://patchwork.kernel.org/patch/11254001/
- https://patchwork.kernel.org/patch/11023915/

Changes since v6:

- Fix misalignment in percpu batching, thanks Matthew.

Chris Down (2):
  tmpfs: Per-superblock i_ino support
  tmpfs: Support 64-bit inums per-sb

 Documentation/filesystems/tmpfs.rst |  11 +++
 fs/Kconfig                          |  15 ++++
 include/linux/fs.h                  |  15 ++++
 include/linux/shmem_fs.h            |   3 +
 mm/shmem.c                          | 127 ++++++++++++++++++++++++++--
 5 files changed, 166 insertions(+), 5 deletions(-)

-- 
2.27.0


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v7 1/2] tmpfs: Per-superblock i_ino support
  2020-07-13 17:28 [PATCH v7 0/2] tmpfs: inode: Reduce risk of inum overflow Chris Down
@ 2020-07-13 17:28 ` Chris Down
  2020-08-01 19:22     ` Hugh Dickins
  2020-07-13 17:28 ` [PATCH v7 2/2] tmpfs: Support 64-bit inums per-sb Chris Down
  1 sibling, 1 reply; 12+ messages in thread
From: Chris Down @ 2020-07-13 17:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Al Viro, Matthew Wilcox, Amir Goldstein,
	Jeff Layton, Johannes Weiner, Tejun Heo, linux-mm, linux-fsdevel,
	linux-kernel, kernel-team

get_next_ino has a number of problems:

- It uses and returns a uint, which is susceptible to become overflowed
  if a lot of volatile inodes that use get_next_ino are created.
- It's global, with no specificity per-sb or even per-filesystem. This
  means it's not that difficult to cause inode number wraparounds on a
  single device, which can result in having multiple distinct inodes
  with the same inode number.

This patch adds a per-superblock counter that mitigates the second case.
This design also allows us to later have a specific i_ino size
per-device, for example, allowing users to choose whether to use 32- or
64-bit inodes for each tmpfs mount. This is implemented in the next
commit.

For internal shmem mounts which may be less tolerant to spinlock delays,
we implement a percpu batching scheme which only takes the stat_lock at
each batch boundary.

Signed-off-by: Chris Down <chris@chrisdown.name>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-mm@kvack.org
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: kernel-team@fb.com
---
 include/linux/fs.h       | 15 +++++++++
 include/linux/shmem_fs.h |  2 ++
 mm/shmem.c               | 66 +++++++++++++++++++++++++++++++++++++---
 3 files changed, 78 insertions(+), 5 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index f15848899945..b70b334f8e16 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2961,6 +2961,21 @@ extern void discard_new_inode(struct inode *);
 extern unsigned int get_next_ino(void);
 extern void evict_inodes(struct super_block *sb);
 
+/*
+ * Userspace may rely on the the inode number being non-zero. For example, glibc
+ * simply ignores files with zero i_ino in unlink() and other places.
+ *
+ * As an additional complication, if userspace was compiled with
+ * _FILE_OFFSET_BITS=32 on a 64-bit kernel we'll only end up reading out the
+ * lower 32 bits, so we need to check that those aren't zero explicitly. With
+ * _FILE_OFFSET_BITS=64, this may cause some harmless false-negatives, but
+ * better safe than sorry.
+ */
+static inline bool is_zero_ino(ino_t ino)
+{
+	return (u32)ino == 0;
+}
+
 extern void __iget(struct inode * inode);
 extern void iget_failed(struct inode *);
 extern void clear_inode(struct inode *);
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 7a35a6901221..eb628696ec66 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -36,6 +36,8 @@ struct shmem_sb_info {
 	unsigned char huge;	    /* Whether to try for hugepages */
 	kuid_t uid;		    /* Mount uid for root directory */
 	kgid_t gid;		    /* Mount gid for root directory */
+	ino_t next_ino;		    /* The next per-sb inode number to use */
+	ino_t __percpu *ino_batch;  /* The next per-cpu inode number to use */
 	struct mempolicy *mpol;     /* default memory policy for mappings */
 	spinlock_t shrinklist_lock;   /* Protects shrinklist */
 	struct list_head shrinklist;  /* List of shinkable inodes */
diff --git a/mm/shmem.c b/mm/shmem.c
index a0dbe62f8042..0ae250b4da28 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -260,18 +260,67 @@ bool vma_is_shmem(struct vm_area_struct *vma)
 static LIST_HEAD(shmem_swaplist);
 static DEFINE_MUTEX(shmem_swaplist_mutex);
 
-static int shmem_reserve_inode(struct super_block *sb)
+/*
+ * shmem_reserve_inode() performs bookkeeping to reserve a shmem inode, and
+ * produces a novel ino for the newly allocated inode.
+ *
+ * It may also be called when making a hard link to permit the space needed by
+ * each dentry. However, in that case, no new inode number is needed since that
+ * internally draws from another pool of inode numbers (currently global
+ * get_next_ino()). This case is indicated by passing NULL as inop.
+ */
+#define SHMEM_INO_BATCH 1024
+static int shmem_reserve_inode(struct super_block *sb, ino_t *inop)
 {
 	struct shmem_sb_info *sbinfo = SHMEM_SB(sb);
-	if (sbinfo->max_inodes) {
+	ino_t ino;
+
+	if (!(sb->s_flags & SB_KERNMOUNT)) {
 		spin_lock(&sbinfo->stat_lock);
 		if (!sbinfo->free_inodes) {
 			spin_unlock(&sbinfo->stat_lock);
 			return -ENOSPC;
 		}
 		sbinfo->free_inodes--;
+		if (inop) {
+			ino = sbinfo->next_ino++;
+			if (unlikely(is_zero_ino(ino)))
+				ino = sbinfo->next_ino++;
+			if (unlikely(ino > UINT_MAX)) {
+				/*
+				 * Emulate get_next_ino uint wraparound for
+				 * compatibility
+				 */
+				ino = 1;
+			}
+			*inop = ino;
+		}
 		spin_unlock(&sbinfo->stat_lock);
+	} else if (inop) {
+		/*
+		 * __shmem_file_setup, one of our callers, is lock-free: it
+		 * doesn't hold stat_lock in shmem_reserve_inode since
+		 * max_inodes is always 0, and is called from potentially
+		 * unknown contexts. As such, use a per-cpu batched allocator
+		 * which doesn't require the per-sb stat_lock unless we are at
+		 * the batch boundary.
+		 */
+		ino_t *next_ino;
+		next_ino = per_cpu_ptr(sbinfo->ino_batch, get_cpu());
+		ino = *next_ino;
+		if (unlikely(ino % SHMEM_INO_BATCH == 0)) {
+			spin_lock(&sbinfo->stat_lock);
+			ino = sbinfo->next_ino;
+			sbinfo->next_ino += SHMEM_INO_BATCH;
+			spin_unlock(&sbinfo->stat_lock);
+			if (unlikely(is_zero_ino(ino)))
+				ino++;
+		}
+		*inop = ino;
+		*next_ino = ++ino;
+		put_cpu();
 	}
+
 	return 0;
 }
 
@@ -2222,13 +2271,14 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
 	struct inode *inode;
 	struct shmem_inode_info *info;
 	struct shmem_sb_info *sbinfo = SHMEM_SB(sb);
+	ino_t ino;
 
-	if (shmem_reserve_inode(sb))
+	if (shmem_reserve_inode(sb, &ino))
 		return NULL;
 
 	inode = new_inode(sb);
 	if (inode) {
-		inode->i_ino = get_next_ino();
+		inode->i_ino = ino;
 		inode_init_owner(inode, dir, mode);
 		inode->i_blocks = 0;
 		inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode);
@@ -2932,7 +2982,7 @@ static int shmem_link(struct dentry *old_dentry, struct inode *dir, struct dentr
 	 * first link must skip that, to get the accounting right.
 	 */
 	if (inode->i_nlink) {
-		ret = shmem_reserve_inode(inode->i_sb);
+		ret = shmem_reserve_inode(inode->i_sb, NULL);
 		if (ret)
 			goto out;
 	}
@@ -3584,6 +3634,7 @@ static void shmem_put_super(struct super_block *sb)
 {
 	struct shmem_sb_info *sbinfo = SHMEM_SB(sb);
 
+	free_percpu(sbinfo->ino_batch);
 	percpu_counter_destroy(&sbinfo->used_blocks);
 	mpol_put(sbinfo->mpol);
 	kfree(sbinfo);
@@ -3626,6 +3677,11 @@ static int shmem_fill_super(struct super_block *sb, struct fs_context *fc)
 #endif
 	sbinfo->max_blocks = ctx->blocks;
 	sbinfo->free_inodes = sbinfo->max_inodes = ctx->inodes;
+	if (sb->s_flags & SB_KERNMOUNT) {
+		sbinfo->ino_batch = alloc_percpu(ino_t);
+		if (!sbinfo->ino_batch)
+			goto failed;
+	}
 	sbinfo->uid = ctx->uid;
 	sbinfo->gid = ctx->gid;
 	sbinfo->mode = ctx->mode;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v7 2/2] tmpfs: Support 64-bit inums per-sb
  2020-07-13 17:28 [PATCH v7 0/2] tmpfs: inode: Reduce risk of inum overflow Chris Down
  2020-07-13 17:28 ` [PATCH v7 1/2] tmpfs: Per-superblock i_ino support Chris Down
@ 2020-07-13 17:28 ` Chris Down
  2020-08-01 20:41     ` Hugh Dickins
  1 sibling, 1 reply; 12+ messages in thread
From: Chris Down @ 2020-07-13 17:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, Al Viro, Matthew Wilcox, Amir Goldstein,
	Jeff Layton, Johannes Weiner, Tejun Heo, linux-mm, linux-fsdevel,
	linux-kernel, kernel-team

The default is still set to inode32 for backwards compatibility, but
system administrators can opt in to the new 64-bit inode numbers by
either:

1. Passing inode64 on the command line when mounting, or
2. Configuring the kernel with CONFIG_TMPFS_INODE64=y

The inode64 and inode32 names are used based on existing precedent from
XFS.

Signed-off-by: Chris Down <chris@chrisdown.name>
Reviewed-by: Amir Goldstein <amir73il@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-mm@kvack.org
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: kernel-team@fb.com
---
 Documentation/filesystems/tmpfs.rst | 11 +++++
 fs/Kconfig                          | 15 +++++++
 include/linux/shmem_fs.h            |  1 +
 mm/shmem.c                          | 65 ++++++++++++++++++++++++++++-
 4 files changed, 90 insertions(+), 2 deletions(-)

diff --git a/Documentation/filesystems/tmpfs.rst b/Documentation/filesystems/tmpfs.rst
index 4e95929301a5..47b84ddaa8bb 100644
--- a/Documentation/filesystems/tmpfs.rst
+++ b/Documentation/filesystems/tmpfs.rst
@@ -150,6 +150,15 @@ These options do not have any effect on remount. You can change these
 parameters with chmod(1), chown(1) and chgrp(1) on a mounted filesystem.
 
 
+tmpfs has a mount option to select whether it will wrap at 32- or 64-bit inode
+numbers:
+
+inode64   Use 64-bit inode numbers
+inode32   Use 32-bit inode numbers
+
+On 64-bit, the default is set by CONFIG_TMPFS_INODE64. On 32-bit, inode64 is
+not legal and will produce an error at mount time.
+
 So 'mount -t tmpfs -o size=10G,nr_inodes=10k,mode=700 tmpfs /mytmpfs'
 will give you tmpfs instance on /mytmpfs which can allocate 10GB
 RAM/SWAP in 10240 inodes and it is only accessible by root.
@@ -161,3 +170,5 @@ RAM/SWAP in 10240 inodes and it is only accessible by root.
    Hugh Dickins, 4 June 2007
 :Updated:
    KOSAKI Motohiro, 16 Mar 2010
+Updated:
+   Chris Down, 13 July 2020
diff --git a/fs/Kconfig b/fs/Kconfig
index ff257b81fde5..64d530ba42f6 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -229,6 +229,21 @@ config TMPFS_XATTR
 
 	  If unsure, say N.
 
+config TMPFS_INODE64
+	bool "Use 64-bit ino_t by default in tmpfs"
+	depends on TMPFS && 64BIT
+	default n
+	help
+	  tmpfs has historically used only inode numbers as wide as an unsigned
+	  int. In some cases this can cause wraparound, potentially resulting in
+	  multiple files with the same inode number on a single device. This option
+	  makes tmpfs use the full width of ino_t by default, similarly to the
+	  inode64 mount option.
+
+	  To override this default, use the inode32 or inode64 mount options.
+
+	  If unsure, say N.
+
 config HUGETLBFS
 	bool "HugeTLB file system support"
 	depends on X86 || IA64 || SPARC64 || (S390 && 64BIT) || \
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index eb628696ec66..a5a5d1d4d7b1 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -36,6 +36,7 @@ struct shmem_sb_info {
 	unsigned char huge;	    /* Whether to try for hugepages */
 	kuid_t uid;		    /* Mount uid for root directory */
 	kgid_t gid;		    /* Mount gid for root directory */
+	bool full_inums;	    /* If i_ino should be uint or ino_t */
 	ino_t next_ino;		    /* The next per-sb inode number to use */
 	ino_t __percpu *ino_batch;  /* The next per-cpu inode number to use */
 	struct mempolicy *mpol;     /* default memory policy for mappings */
diff --git a/mm/shmem.c b/mm/shmem.c
index 0ae250b4da28..f3126ad7ba3d 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -114,11 +114,13 @@ struct shmem_options {
 	kuid_t uid;
 	kgid_t gid;
 	umode_t mode;
+	bool full_inums;
 	int huge;
 	int seen;
 #define SHMEM_SEEN_BLOCKS 1
 #define SHMEM_SEEN_INODES 2
 #define SHMEM_SEEN_HUGE 4
+#define SHMEM_SEEN_INUMS 8
 };
 
 #ifdef CONFIG_TMPFS
@@ -286,12 +288,17 @@ static int shmem_reserve_inode(struct super_block *sb, ino_t *inop)
 			ino = sbinfo->next_ino++;
 			if (unlikely(is_zero_ino(ino)))
 				ino = sbinfo->next_ino++;
-			if (unlikely(ino > UINT_MAX)) {
+			if (unlikely(!sbinfo->full_inums &&
+				     ino > UINT_MAX)) {
 				/*
 				 * Emulate get_next_ino uint wraparound for
 				 * compatibility
 				 */
-				ino = 1;
+				if (IS_ENABLED(CONFIG_64BIT))
+					pr_warn("%s: inode number overflow on device %d, consider using inode64 mount option\n",
+						__func__, MINOR(sb->s_dev));
+				sbinfo->next_ino = 1;
+				ino = sbinfo->next_ino++;
 			}
 			*inop = ino;
 		}
@@ -304,6 +311,10 @@ static int shmem_reserve_inode(struct super_block *sb, ino_t *inop)
 		 * unknown contexts. As such, use a per-cpu batched allocator
 		 * which doesn't require the per-sb stat_lock unless we are at
 		 * the batch boundary.
+		 *
+		 * We don't need to worry about inode{32,64} since SB_KERNMOUNT
+		 * shmem mounts are not exposed to userspace, so we don't need
+		 * to worry about things like glibc compatibility.
 		 */
 		ino_t *next_ino;
 		next_ino = per_cpu_ptr(sbinfo->ino_batch, get_cpu());
@@ -3397,6 +3408,8 @@ enum shmem_param {
 	Opt_nr_inodes,
 	Opt_size,
 	Opt_uid,
+	Opt_inode32,
+	Opt_inode64,
 };
 
 static const struct constant_table shmem_param_enums_huge[] = {
@@ -3416,6 +3429,8 @@ const struct fs_parameter_spec shmem_fs_parameters[] = {
 	fsparam_string("nr_inodes",	Opt_nr_inodes),
 	fsparam_string("size",		Opt_size),
 	fsparam_u32   ("uid",		Opt_uid),
+	fsparam_flag  ("inode32",	Opt_inode32),
+	fsparam_flag  ("inode64",	Opt_inode64),
 	{}
 };
 
@@ -3487,6 +3502,18 @@ static int shmem_parse_one(struct fs_context *fc, struct fs_parameter *param)
 			break;
 		}
 		goto unsupported_parameter;
+	case Opt_inode32:
+		ctx->full_inums = false;
+		ctx->seen |= SHMEM_SEEN_INUMS;
+		break;
+	case Opt_inode64:
+		if (sizeof(ino_t) < 8) {
+			return invalfc(fc,
+				       "Cannot use inode64 with <64bit inums in kernel\n");
+		}
+		ctx->full_inums = true;
+		ctx->seen |= SHMEM_SEEN_INUMS;
+		break;
 	}
 	return 0;
 
@@ -3578,8 +3605,16 @@ static int shmem_reconfigure(struct fs_context *fc)
 		}
 	}
 
+	if ((ctx->seen & SHMEM_SEEN_INUMS) && !ctx->full_inums &&
+	    sbinfo->next_ino > UINT_MAX) {
+		err = "Current inum too high to switch to 32-bit inums";
+		goto out;
+	}
+
 	if (ctx->seen & SHMEM_SEEN_HUGE)
 		sbinfo->huge = ctx->huge;
+	if (ctx->seen & SHMEM_SEEN_INUMS)
+		sbinfo->full_inums = ctx->full_inums;
 	if (ctx->seen & SHMEM_SEEN_BLOCKS)
 		sbinfo->max_blocks  = ctx->blocks;
 	if (ctx->seen & SHMEM_SEEN_INODES) {
@@ -3619,6 +3654,29 @@ static int shmem_show_options(struct seq_file *seq, struct dentry *root)
 	if (!gid_eq(sbinfo->gid, GLOBAL_ROOT_GID))
 		seq_printf(seq, ",gid=%u",
 				from_kgid_munged(&init_user_ns, sbinfo->gid));
+
+	/*
+	 * Showing inode{64,32} might be useful even if it's the system default,
+	 * since then people don't have to resort to checking both here and
+	 * /proc/config.gz to confirm 64-bit inums were successfully applied
+	 * (which may not even exist if IKCONFIG_PROC isn't enabled).
+	 *
+	 * We hide it when inode64 isn't the default and we are using 32-bit
+	 * inodes, since that probably just means the feature isn't even under
+	 * consideration.
+	 *
+	 * As such:
+	 *
+	 *                     +-----------------+-----------------+
+	 *                     | TMPFS_INODE64=y | TMPFS_INODE64=n |
+	 *  +------------------+-----------------+-----------------+
+	 *  | full_inums=true  | show            | show            |
+	 *  | full_inums=false | show            | hide            |
+	 *  +------------------+-----------------+-----------------+
+	 *
+	 */
+	if (IS_ENABLED(CONFIG_TMPFS_INODE64) || sbinfo->full_inums)
+		seq_printf(seq, ",inode%d", (sbinfo->full_inums ? 64 : 32));
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	/* Rightly or wrongly, show huge mount option unmasked by shmem_huge */
 	if (sbinfo->huge)
@@ -3667,6 +3725,8 @@ static int shmem_fill_super(struct super_block *sb, struct fs_context *fc)
 			ctx->blocks = shmem_default_max_blocks();
 		if (!(ctx->seen & SHMEM_SEEN_INODES))
 			ctx->inodes = shmem_default_max_inodes();
+		if (!(ctx->seen & SHMEM_SEEN_INUMS))
+			ctx->full_inums = IS_ENABLED(CONFIG_TMPFS_INODE64);
 	} else {
 		sb->s_flags |= SB_NOUSER;
 	}
@@ -3684,6 +3744,7 @@ static int shmem_fill_super(struct super_block *sb, struct fs_context *fc)
 	}
 	sbinfo->uid = ctx->uid;
 	sbinfo->gid = ctx->gid;
+	sbinfo->full_inums = ctx->full_inums;
 	sbinfo->mode = ctx->mode;
 	sbinfo->huge = ctx->huge;
 	sbinfo->mpol = ctx->mpol;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v7 1/2] tmpfs: Per-superblock i_ino support
  2020-07-13 17:28 ` [PATCH v7 1/2] tmpfs: Per-superblock i_ino support Chris Down
@ 2020-08-01 19:22     ` Hugh Dickins
  0 siblings, 0 replies; 12+ messages in thread
From: Hugh Dickins @ 2020-08-01 19:22 UTC (permalink / raw)
  To: Chris Down
  Cc: Andrew Morton, Hugh Dickins, Al Viro, Matthew Wilcox,
	Amir Goldstein, Jeff Layton, Johannes Weiner, Tejun Heo,
	linux-mm, linux-fsdevel, linux-kernel, kernel-team

On Mon, 13 Jul 2020, Chris Down wrote:

> get_next_ino has a number of problems:
> 
> - It uses and returns a uint, which is susceptible to become overflowed
>   if a lot of volatile inodes that use get_next_ino are created.
> - It's global, with no specificity per-sb or even per-filesystem. This
>   means it's not that difficult to cause inode number wraparounds on a
>   single device, which can result in having multiple distinct inodes
>   with the same inode number.
> 
> This patch adds a per-superblock counter that mitigates the second case.
> This design also allows us to later have a specific i_ino size
> per-device, for example, allowing users to choose whether to use 32- or
> 64-bit inodes for each tmpfs mount. This is implemented in the next
> commit.
> 
> For internal shmem mounts which may be less tolerant to spinlock delays,
> we implement a percpu batching scheme which only takes the stat_lock at
> each batch boundary.
> 
> Signed-off-by: Chris Down <chris@chrisdown.name>
> Cc: Amir Goldstein <amir73il@gmail.com>
> Cc: Hugh Dickins <hughd@google.com>

Acked-by: Hugh Dickins <hughd@google.com>

Thanks for coming back and completing this, Chris.
Some comments below, nothing to detract from that Ack, more notes to
myself: things I might change slightly when I get back here later on.
I'm glad to see Andrew pulled your 0/2 text into this 1/2.

> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Jeff Layton <jlayton@kernel.org>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: linux-mm@kvack.org
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: kernel-team@fb.com
> ---
>  include/linux/fs.h       | 15 +++++++++
>  include/linux/shmem_fs.h |  2 ++
>  mm/shmem.c               | 66 +++++++++++++++++++++++++++++++++++++---
>  3 files changed, 78 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index f15848899945..b70b334f8e16 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2961,6 +2961,21 @@ extern void discard_new_inode(struct inode *);
>  extern unsigned int get_next_ino(void);
>  extern void evict_inodes(struct super_block *sb);
>  
> +/*
> + * Userspace may rely on the the inode number being non-zero. For example, glibc
> + * simply ignores files with zero i_ino in unlink() and other places.
> + *
> + * As an additional complication, if userspace was compiled with
> + * _FILE_OFFSET_BITS=32 on a 64-bit kernel we'll only end up reading out the
> + * lower 32 bits, so we need to check that those aren't zero explicitly. With
> + * _FILE_OFFSET_BITS=64, this may cause some harmless false-negatives, but
> + * better safe than sorry.
> + */
> +static inline bool is_zero_ino(ino_t ino)
> +{
> +	return (u32)ino == 0;
> +}

Hmm, okay. The value of this unnecessary, and inaccurately named,
wrapper is in the great comment above it: which then suffers a bit
from being hidden away in a header file.  I'd understand its placing
better if you had also changed get_next_ino() to use it.

And I have another reason for wondering whether this function is
the right thing to abstract out: 1 is also somewhat special, being
the ino of the root.  It seems a little unfortunate, when we recycle
through the 32-bit space, to reuse the one ino that is certain to be
still in use (maybe all the others get "rm -rf"ed every day).  But
I haven't yet decided whether that's worth bothering about at all.

> +
>  extern void __iget(struct inode * inode);
>  extern void iget_failed(struct inode *);
>  extern void clear_inode(struct inode *);
> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> index 7a35a6901221..eb628696ec66 100644
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -36,6 +36,8 @@ struct shmem_sb_info {
>  	unsigned char huge;	    /* Whether to try for hugepages */
>  	kuid_t uid;		    /* Mount uid for root directory */
>  	kgid_t gid;		    /* Mount gid for root directory */
> +	ino_t next_ino;		    /* The next per-sb inode number to use */
> +	ino_t __percpu *ino_batch;  /* The next per-cpu inode number to use */
>  	struct mempolicy *mpol;     /* default memory policy for mappings */
>  	spinlock_t shrinklist_lock;   /* Protects shrinklist */
>  	struct list_head shrinklist;  /* List of shinkable inodes */
> diff --git a/mm/shmem.c b/mm/shmem.c
> index a0dbe62f8042..0ae250b4da28 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -260,18 +260,67 @@ bool vma_is_shmem(struct vm_area_struct *vma)
>  static LIST_HEAD(shmem_swaplist);
>  static DEFINE_MUTEX(shmem_swaplist_mutex);
>  
> -static int shmem_reserve_inode(struct super_block *sb)
> +/*
> + * shmem_reserve_inode() performs bookkeeping to reserve a shmem inode, and
> + * produces a novel ino for the newly allocated inode.
> + *
> + * It may also be called when making a hard link to permit the space needed by
> + * each dentry. However, in that case, no new inode number is needed since that
> + * internally draws from another pool of inode numbers (currently global
> + * get_next_ino()). This case is indicated by passing NULL as inop.
> + */
> +#define SHMEM_INO_BATCH 1024
> +static int shmem_reserve_inode(struct super_block *sb, ino_t *inop)
>  {
>  	struct shmem_sb_info *sbinfo = SHMEM_SB(sb);
> -	if (sbinfo->max_inodes) {
> +	ino_t ino;
> +
> +	if (!(sb->s_flags & SB_KERNMOUNT)) {

I was disappointed to find that this is still decided by SB_KERNMOUNT
instead of max_inodes, as we had discussed previously.  But it may just
be me who sees that as a regression (taking stat_lock when minimizing
stat_lock was the mounter's choice), so I'll accept responsibility to
follow that up later (and in no great hurry).  And I may discover why
you didn't make the change - remount with different options might
well turn out to be a pain, and may entail some compromises.

>  		spin_lock(&sbinfo->stat_lock);
>  		if (!sbinfo->free_inodes) {
>  			spin_unlock(&sbinfo->stat_lock);
>  			return -ENOSPC;
>  		}
>  		sbinfo->free_inodes--;
> +		if (inop) {
> +			ino = sbinfo->next_ino++;
> +			if (unlikely(is_zero_ino(ino)))
> +				ino = sbinfo->next_ino++;
> +			if (unlikely(ino > UINT_MAX)) {
> +				/*
> +				 * Emulate get_next_ino uint wraparound for
> +				 * compatibility
> +				 */
> +				ino = 1;
> +			}
> +			*inop = ino;
> +		}
>  		spin_unlock(&sbinfo->stat_lock);
> +	} else if (inop) {
> +		/*
> +		 * __shmem_file_setup, one of our callers, is lock-free: it
> +		 * doesn't hold stat_lock in shmem_reserve_inode since
> +		 * max_inodes is always 0, and is called from potentially
> +		 * unknown contexts. As such, use a per-cpu batched allocator
> +		 * which doesn't require the per-sb stat_lock unless we are at
> +		 * the batch boundary.
> +		 */
> +		ino_t *next_ino;
> +		next_ino = per_cpu_ptr(sbinfo->ino_batch, get_cpu());
> +		ino = *next_ino;
> +		if (unlikely(ino % SHMEM_INO_BATCH == 0)) {
> +			spin_lock(&sbinfo->stat_lock);
> +			ino = sbinfo->next_ino;
> +			sbinfo->next_ino += SHMEM_INO_BATCH;
> +			spin_unlock(&sbinfo->stat_lock);
> +			if (unlikely(is_zero_ino(ino)))
> +				ino++;
> +		}
> +		*inop = ino;
> +		*next_ino = ++ino;
> +		put_cpu();
>  	}
> +
>  	return 0;
>  }
>  
> @@ -2222,13 +2271,14 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
>  	struct inode *inode;
>  	struct shmem_inode_info *info;
>  	struct shmem_sb_info *sbinfo = SHMEM_SB(sb);
> +	ino_t ino;
>  
> -	if (shmem_reserve_inode(sb))
> +	if (shmem_reserve_inode(sb, &ino))
>  		return NULL;
>  
>  	inode = new_inode(sb);
>  	if (inode) {
> -		inode->i_ino = get_next_ino();
> +		inode->i_ino = ino;
>  		inode_init_owner(inode, dir, mode);
>  		inode->i_blocks = 0;
>  		inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode);
> @@ -2932,7 +2982,7 @@ static int shmem_link(struct dentry *old_dentry, struct inode *dir, struct dentr
>  	 * first link must skip that, to get the accounting right.
>  	 */
>  	if (inode->i_nlink) {
> -		ret = shmem_reserve_inode(inode->i_sb);
> +		ret = shmem_reserve_inode(inode->i_sb, NULL);
>  		if (ret)
>  			goto out;
>  	}
> @@ -3584,6 +3634,7 @@ static void shmem_put_super(struct super_block *sb)
>  {
>  	struct shmem_sb_info *sbinfo = SHMEM_SB(sb);
>  
> +	free_percpu(sbinfo->ino_batch);
>  	percpu_counter_destroy(&sbinfo->used_blocks);
>  	mpol_put(sbinfo->mpol);
>  	kfree(sbinfo);
> @@ -3626,6 +3677,11 @@ static int shmem_fill_super(struct super_block *sb, struct fs_context *fc)
>  #endif
>  	sbinfo->max_blocks = ctx->blocks;
>  	sbinfo->free_inodes = sbinfo->max_inodes = ctx->inodes;
> +	if (sb->s_flags & SB_KERNMOUNT) {
> +		sbinfo->ino_batch = alloc_percpu(ino_t);
> +		if (!sbinfo->ino_batch)
> +			goto failed;
> +	}
>  	sbinfo->uid = ctx->uid;
>  	sbinfo->gid = ctx->gid;
>  	sbinfo->mode = ctx->mode;
> -- 
> 2.27.0

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v7 1/2] tmpfs: Per-superblock i_ino support
@ 2020-08-01 19:22     ` Hugh Dickins
  0 siblings, 0 replies; 12+ messages in thread
From: Hugh Dickins @ 2020-08-01 19:22 UTC (permalink / raw)
  To: Chris Down
  Cc: Andrew Morton, Hugh Dickins, Al Viro, Matthew Wilcox,
	Amir Goldstein, Jeff Layton, Johannes Weiner, Tejun Heo,
	linux-mm, linux-fsdevel, linux-kernel, kernel-team

On Mon, 13 Jul 2020, Chris Down wrote:

> get_next_ino has a number of problems:
> 
> - It uses and returns a uint, which is susceptible to become overflowed
>   if a lot of volatile inodes that use get_next_ino are created.
> - It's global, with no specificity per-sb or even per-filesystem. This
>   means it's not that difficult to cause inode number wraparounds on a
>   single device, which can result in having multiple distinct inodes
>   with the same inode number.
> 
> This patch adds a per-superblock counter that mitigates the second case.
> This design also allows us to later have a specific i_ino size
> per-device, for example, allowing users to choose whether to use 32- or
> 64-bit inodes for each tmpfs mount. This is implemented in the next
> commit.
> 
> For internal shmem mounts which may be less tolerant to spinlock delays,
> we implement a percpu batching scheme which only takes the stat_lock at
> each batch boundary.
> 
> Signed-off-by: Chris Down <chris@chrisdown.name>
> Cc: Amir Goldstein <amir73il@gmail.com>
> Cc: Hugh Dickins <hughd@google.com>

Acked-by: Hugh Dickins <hughd@google.com>

Thanks for coming back and completing this, Chris.
Some comments below, nothing to detract from that Ack, more notes to
myself: things I might change slightly when I get back here later on.
I'm glad to see Andrew pulled your 0/2 text into this 1/2.

> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Jeff Layton <jlayton@kernel.org>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: linux-mm@kvack.org
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: kernel-team@fb.com
> ---
>  include/linux/fs.h       | 15 +++++++++
>  include/linux/shmem_fs.h |  2 ++
>  mm/shmem.c               | 66 +++++++++++++++++++++++++++++++++++++---
>  3 files changed, 78 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index f15848899945..b70b334f8e16 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2961,6 +2961,21 @@ extern void discard_new_inode(struct inode *);
>  extern unsigned int get_next_ino(void);
>  extern void evict_inodes(struct super_block *sb);
>  
> +/*
> + * Userspace may rely on the the inode number being non-zero. For example, glibc
> + * simply ignores files with zero i_ino in unlink() and other places.
> + *
> + * As an additional complication, if userspace was compiled with
> + * _FILE_OFFSET_BITS=32 on a 64-bit kernel we'll only end up reading out the
> + * lower 32 bits, so we need to check that those aren't zero explicitly. With
> + * _FILE_OFFSET_BITS=64, this may cause some harmless false-negatives, but
> + * better safe than sorry.
> + */
> +static inline bool is_zero_ino(ino_t ino)
> +{
> +	return (u32)ino == 0;
> +}

Hmm, okay. The value of this unnecessary, and inaccurately named,
wrapper is in the great comment above it: which then suffers a bit
from being hidden away in a header file.  I'd understand its placing
better if you had also changed get_next_ino() to use it.

And I have another reason for wondering whether this function is
the right thing to abstract out: 1 is also somewhat special, being
the ino of the root.  It seems a little unfortunate, when we recycle
through the 32-bit space, to reuse the one ino that is certain to be
still in use (maybe all the others get "rm -rf"ed every day).  But
I haven't yet decided whether that's worth bothering about at all.

> +
>  extern void __iget(struct inode * inode);
>  extern void iget_failed(struct inode *);
>  extern void clear_inode(struct inode *);
> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> index 7a35a6901221..eb628696ec66 100644
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -36,6 +36,8 @@ struct shmem_sb_info {
>  	unsigned char huge;	    /* Whether to try for hugepages */
>  	kuid_t uid;		    /* Mount uid for root directory */
>  	kgid_t gid;		    /* Mount gid for root directory */
> +	ino_t next_ino;		    /* The next per-sb inode number to use */
> +	ino_t __percpu *ino_batch;  /* The next per-cpu inode number to use */
>  	struct mempolicy *mpol;     /* default memory policy for mappings */
>  	spinlock_t shrinklist_lock;   /* Protects shrinklist */
>  	struct list_head shrinklist;  /* List of shinkable inodes */
> diff --git a/mm/shmem.c b/mm/shmem.c
> index a0dbe62f8042..0ae250b4da28 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -260,18 +260,67 @@ bool vma_is_shmem(struct vm_area_struct *vma)
>  static LIST_HEAD(shmem_swaplist);
>  static DEFINE_MUTEX(shmem_swaplist_mutex);
>  
> -static int shmem_reserve_inode(struct super_block *sb)
> +/*
> + * shmem_reserve_inode() performs bookkeeping to reserve a shmem inode, and
> + * produces a novel ino for the newly allocated inode.
> + *
> + * It may also be called when making a hard link to permit the space needed by
> + * each dentry. However, in that case, no new inode number is needed since that
> + * internally draws from another pool of inode numbers (currently global
> + * get_next_ino()). This case is indicated by passing NULL as inop.
> + */
> +#define SHMEM_INO_BATCH 1024
> +static int shmem_reserve_inode(struct super_block *sb, ino_t *inop)
>  {
>  	struct shmem_sb_info *sbinfo = SHMEM_SB(sb);
> -	if (sbinfo->max_inodes) {
> +	ino_t ino;
> +
> +	if (!(sb->s_flags & SB_KERNMOUNT)) {

I was disappointed to find that this is still decided by SB_KERNMOUNT
instead of max_inodes, as we had discussed previously.  But it may just
be me who sees that as a regression (taking stat_lock when minimizing
stat_lock was the mounter's choice), so I'll accept responsibility to
follow that up later (and in no great hurry).  And I may discover why
you didn't make the change - remount with different options might
well turn out to be a pain, and may entail some compromises.

>  		spin_lock(&sbinfo->stat_lock);
>  		if (!sbinfo->free_inodes) {
>  			spin_unlock(&sbinfo->stat_lock);
>  			return -ENOSPC;
>  		}
>  		sbinfo->free_inodes--;
> +		if (inop) {
> +			ino = sbinfo->next_ino++;
> +			if (unlikely(is_zero_ino(ino)))
> +				ino = sbinfo->next_ino++;
> +			if (unlikely(ino > UINT_MAX)) {
> +				/*
> +				 * Emulate get_next_ino uint wraparound for
> +				 * compatibility
> +				 */
> +				ino = 1;
> +			}
> +			*inop = ino;
> +		}
>  		spin_unlock(&sbinfo->stat_lock);
> +	} else if (inop) {
> +		/*
> +		 * __shmem_file_setup, one of our callers, is lock-free: it
> +		 * doesn't hold stat_lock in shmem_reserve_inode since
> +		 * max_inodes is always 0, and is called from potentially
> +		 * unknown contexts. As such, use a per-cpu batched allocator
> +		 * which doesn't require the per-sb stat_lock unless we are at
> +		 * the batch boundary.
> +		 */
> +		ino_t *next_ino;
> +		next_ino = per_cpu_ptr(sbinfo->ino_batch, get_cpu());
> +		ino = *next_ino;
> +		if (unlikely(ino % SHMEM_INO_BATCH == 0)) {
> +			spin_lock(&sbinfo->stat_lock);
> +			ino = sbinfo->next_ino;
> +			sbinfo->next_ino += SHMEM_INO_BATCH;
> +			spin_unlock(&sbinfo->stat_lock);
> +			if (unlikely(is_zero_ino(ino)))
> +				ino++;
> +		}
> +		*inop = ino;
> +		*next_ino = ++ino;
> +		put_cpu();
>  	}
> +
>  	return 0;
>  }
>  
> @@ -2222,13 +2271,14 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
>  	struct inode *inode;
>  	struct shmem_inode_info *info;
>  	struct shmem_sb_info *sbinfo = SHMEM_SB(sb);
> +	ino_t ino;
>  
> -	if (shmem_reserve_inode(sb))
> +	if (shmem_reserve_inode(sb, &ino))
>  		return NULL;
>  
>  	inode = new_inode(sb);
>  	if (inode) {
> -		inode->i_ino = get_next_ino();
> +		inode->i_ino = ino;
>  		inode_init_owner(inode, dir, mode);
>  		inode->i_blocks = 0;
>  		inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode);
> @@ -2932,7 +2982,7 @@ static int shmem_link(struct dentry *old_dentry, struct inode *dir, struct dentr
>  	 * first link must skip that, to get the accounting right.
>  	 */
>  	if (inode->i_nlink) {
> -		ret = shmem_reserve_inode(inode->i_sb);
> +		ret = shmem_reserve_inode(inode->i_sb, NULL);
>  		if (ret)
>  			goto out;
>  	}
> @@ -3584,6 +3634,7 @@ static void shmem_put_super(struct super_block *sb)
>  {
>  	struct shmem_sb_info *sbinfo = SHMEM_SB(sb);
>  
> +	free_percpu(sbinfo->ino_batch);
>  	percpu_counter_destroy(&sbinfo->used_blocks);
>  	mpol_put(sbinfo->mpol);
>  	kfree(sbinfo);
> @@ -3626,6 +3677,11 @@ static int shmem_fill_super(struct super_block *sb, struct fs_context *fc)
>  #endif
>  	sbinfo->max_blocks = ctx->blocks;
>  	sbinfo->free_inodes = sbinfo->max_inodes = ctx->inodes;
> +	if (sb->s_flags & SB_KERNMOUNT) {
> +		sbinfo->ino_batch = alloc_percpu(ino_t);
> +		if (!sbinfo->ino_batch)
> +			goto failed;
> +	}
>  	sbinfo->uid = ctx->uid;
>  	sbinfo->gid = ctx->gid;
>  	sbinfo->mode = ctx->mode;
> -- 
> 2.27.0


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v7 2/2] tmpfs: Support 64-bit inums per-sb
  2020-07-13 17:28 ` [PATCH v7 2/2] tmpfs: Support 64-bit inums per-sb Chris Down
@ 2020-08-01 20:41     ` Hugh Dickins
  0 siblings, 0 replies; 12+ messages in thread
From: Hugh Dickins @ 2020-08-01 20:41 UTC (permalink / raw)
  To: Chris Down
  Cc: Andrew Morton, Hugh Dickins, Al Viro, Matthew Wilcox,
	Amir Goldstein, Jeff Layton, Johannes Weiner, Tejun Heo,
	linux-mm, linux-fsdevel, linux-kernel, kernel-team

On Mon, 13 Jul 2020, Chris Down wrote:

> The default is still set to inode32 for backwards compatibility, but
> system administrators can opt in to the new 64-bit inode numbers by
> either:
> 
> 1. Passing inode64 on the command line when mounting, or
> 2. Configuring the kernel with CONFIG_TMPFS_INODE64=y
> 
> The inode64 and inode32 names are used based on existing precedent from
> XFS.
> 
> Signed-off-by: Chris Down <chris@chrisdown.name>
> Reviewed-by: Amir Goldstein <amir73il@gmail.com>
> Cc: Hugh Dickins <hughd@google.com>

Acked-by: Hugh Dickins <hughd@google.com>

Again, thanks, and comments below, but nothing to override tha Ack.

> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Jeff Layton <jlayton@kernel.org>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: linux-mm@kvack.org
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: kernel-team@fb.com
> ---
>  Documentation/filesystems/tmpfs.rst | 11 +++++
>  fs/Kconfig                          | 15 +++++++
>  include/linux/shmem_fs.h            |  1 +
>  mm/shmem.c                          | 65 ++++++++++++++++++++++++++++-
>  4 files changed, 90 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/filesystems/tmpfs.rst b/Documentation/filesystems/tmpfs.rst
> index 4e95929301a5..47b84ddaa8bb 100644
> --- a/Documentation/filesystems/tmpfs.rst
> +++ b/Documentation/filesystems/tmpfs.rst
> @@ -150,6 +150,15 @@ These options do not have any effect on remount. You can change these
>  parameters with chmod(1), chown(1) and chgrp(1) on a mounted filesystem.
>  
>  
> +tmpfs has a mount option to select whether it will wrap at 32- or 64-bit inode
> +numbers:
> +
> +inode64   Use 64-bit inode numbers
> +inode32   Use 32-bit inode numbers
> +
> +On 64-bit, the default is set by CONFIG_TMPFS_INODE64. On 32-bit, inode64 is
> +not legal and will produce an error at mount time.
> +

I did originally want to move this up higher, between the discussion
of nr_inodes and mpol; but its placing here follows where you placed
it in /proc/mounts, and I end up agreeing with that, so let's leave
where you've placed it.

But this is a description of a new pair of mount options, nothing to
do with the paragraph below, so one more blank line is needed.

Except that tmpfs.txt has now been replaced by tmpfs.rst, with some
====ing too.  We'd better add that.  I'm unfamiliar with, and not
prepared for .rst - getting Doc right has become a job for specialists.
I'll send a -fix.patch of how how I think it should look to Andrew and
Randy Dunlap, hoping Randy can confirm whether it ends up right.

>  So 'mount -t tmpfs -o size=10G,nr_inodes=10k,mode=700 tmpfs /mytmpfs'
>  will give you tmpfs instance on /mytmpfs which can allocate 10GB
>  RAM/SWAP in 10240 inodes and it is only accessible by root.
> @@ -161,3 +170,5 @@ RAM/SWAP in 10240 inodes and it is only accessible by root.
>     Hugh Dickins, 4 June 2007
>  :Updated:
>     KOSAKI Motohiro, 16 Mar 2010
> +Updated:

I bet that should say ":Updated:" now.

> +   Chris Down, 13 July 2020
> diff --git a/fs/Kconfig b/fs/Kconfig
> index ff257b81fde5..64d530ba42f6 100644
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -229,6 +229,21 @@ config TMPFS_XATTR
>  
>  	  If unsure, say N.
>  
> +config TMPFS_INODE64
> +	bool "Use 64-bit ino_t by default in tmpfs"
> +	depends on TMPFS && 64BIT
> +	default n

Okay: I haven't looked up what our last position was on which way the
default should be, but I do recall that we were reluctantly realizing
that we couldn't safely be as radical as we had hoped.

As to "default n" being the default and unnecessary: I don't mind
seeing it there explicitly, let's give the work to whoever wants to
delete that line.

> +	help
> +	  tmpfs has historically used only inode numbers as wide as an unsigned
> +	  int. In some cases this can cause wraparound, potentially resulting in
> +	  multiple files with the same inode number on a single device. This option
> +	  makes tmpfs use the full width of ino_t by default, similarly to the
> +	  inode64 mount option.

I've just realized that this, and the inode64 Documentation, make no
mention of why you might not want to enable it: it sounds like such a
good thing, with the documentation on why not in include/linux/fs.h.
I'd better add some text to these in the -fix.patch.

> +
> +	  To override this default, use the inode32 or inode64 mount options.
> +
> +	  If unsure, say N.
> +
>  config HUGETLBFS
>  	bool "HugeTLB file system support"
>  	depends on X86 || IA64 || SPARC64 || (S390 && 64BIT) || \
> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> index eb628696ec66..a5a5d1d4d7b1 100644
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -36,6 +36,7 @@ struct shmem_sb_info {
>  	unsigned char huge;	    /* Whether to try for hugepages */
>  	kuid_t uid;		    /* Mount uid for root directory */
>  	kgid_t gid;		    /* Mount gid for root directory */
> +	bool full_inums;	    /* If i_ino should be uint or ino_t */

Fine, but don't be surprised if I change that eventually,
maybe I'll make it a bit in a bitmask: I've nothing against bools
generally, but like to know the size of my structure fields,
and never sure what size a bool ends up with.

>  	ino_t next_ino;		    /* The next per-sb inode number to use */
>  	ino_t __percpu *ino_batch;  /* The next per-cpu inode number to use */
>  	struct mempolicy *mpol;     /* default memory policy for mappings */
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 0ae250b4da28..f3126ad7ba3d 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -114,11 +114,13 @@ struct shmem_options {
>  	kuid_t uid;
>  	kgid_t gid;
>  	umode_t mode;
> +	bool full_inums;
>  	int huge;
>  	int seen;
>  #define SHMEM_SEEN_BLOCKS 1
>  #define SHMEM_SEEN_INODES 2
>  #define SHMEM_SEEN_HUGE 4
> +#define SHMEM_SEEN_INUMS 8
>  };
>  
>  #ifdef CONFIG_TMPFS
> @@ -286,12 +288,17 @@ static int shmem_reserve_inode(struct super_block *sb, ino_t *inop)
>  			ino = sbinfo->next_ino++;
>  			if (unlikely(is_zero_ino(ino)))
>  				ino = sbinfo->next_ino++;
> -			if (unlikely(ino > UINT_MAX)) {
> +			if (unlikely(!sbinfo->full_inums &&
> +				     ino > UINT_MAX)) {

And don't be surprised if I put that on one line one day :)

>  				/*
>  				 * Emulate get_next_ino uint wraparound for
>  				 * compatibility
>  				 */
> -				ino = 1;
> +				if (IS_ENABLED(CONFIG_64BIT))
> +					pr_warn("%s: inode number overflow on device %d, consider using inode64 mount option\n",
> +						__func__, MINOR(sb->s_dev));

I like that you've added a warning; but it comes too late for a remount,
doesn't it?  We could do a countdown - "Only 1 billion more inode numbers
until overflow!" - that would be fun :) But a multiplicity of warnings
would be irritating to some and unnoticed by others.  I don't know.

> +				sbinfo->next_ino = 1;
> +				ino = sbinfo->next_ino++;
>  			}
>  			*inop = ino;
>  		}
> @@ -304,6 +311,10 @@ static int shmem_reserve_inode(struct super_block *sb, ino_t *inop)
>  		 * unknown contexts. As such, use a per-cpu batched allocator
>  		 * which doesn't require the per-sb stat_lock unless we are at
>  		 * the batch boundary.
> +		 *
> +		 * We don't need to worry about inode{32,64} since SB_KERNMOUNT
> +		 * shmem mounts are not exposed to userspace, so we don't need
> +		 * to worry about things like glibc compatibility.
>  		 */
>  		ino_t *next_ino;
>  		next_ino = per_cpu_ptr(sbinfo->ino_batch, get_cpu());
> @@ -3397,6 +3408,8 @@ enum shmem_param {
>  	Opt_nr_inodes,
>  	Opt_size,
>  	Opt_uid,
> +	Opt_inode32,
> +	Opt_inode64,
>  };
>  
>  static const struct constant_table shmem_param_enums_huge[] = {
> @@ -3416,6 +3429,8 @@ const struct fs_parameter_spec shmem_fs_parameters[] = {
>  	fsparam_string("nr_inodes",	Opt_nr_inodes),
>  	fsparam_string("size",		Opt_size),
>  	fsparam_u32   ("uid",		Opt_uid),
> +	fsparam_flag  ("inode32",	Opt_inode32),
> +	fsparam_flag  ("inode64",	Opt_inode64),
>  	{}
>  };
>  
> @@ -3487,6 +3502,18 @@ static int shmem_parse_one(struct fs_context *fc, struct fs_parameter *param)
>  			break;
>  		}
>  		goto unsupported_parameter;
> +	case Opt_inode32:
> +		ctx->full_inums = false;
> +		ctx->seen |= SHMEM_SEEN_INUMS;
> +		break;
> +	case Opt_inode64:
> +		if (sizeof(ino_t) < 8) {
> +			return invalfc(fc,
> +				       "Cannot use inode64 with <64bit inums in kernel\n");
> +		}
> +		ctx->full_inums = true;
> +		ctx->seen |= SHMEM_SEEN_INUMS;
> +		break;
>  	}
>  	return 0;
>  
> @@ -3578,8 +3605,16 @@ static int shmem_reconfigure(struct fs_context *fc)
>  		}
>  	}
>  
> +	if ((ctx->seen & SHMEM_SEEN_INUMS) && !ctx->full_inums &&
> +	    sbinfo->next_ino > UINT_MAX) {
> +		err = "Current inum too high to switch to 32-bit inums";
> +		goto out;
> +	}
> +
>  	if (ctx->seen & SHMEM_SEEN_HUGE)
>  		sbinfo->huge = ctx->huge;
> +	if (ctx->seen & SHMEM_SEEN_INUMS)
> +		sbinfo->full_inums = ctx->full_inums;
>  	if (ctx->seen & SHMEM_SEEN_BLOCKS)
>  		sbinfo->max_blocks  = ctx->blocks;
>  	if (ctx->seen & SHMEM_SEEN_INODES) {
> @@ -3619,6 +3654,29 @@ static int shmem_show_options(struct seq_file *seq, struct dentry *root)
>  	if (!gid_eq(sbinfo->gid, GLOBAL_ROOT_GID))
>  		seq_printf(seq, ",gid=%u",
>  				from_kgid_munged(&init_user_ns, sbinfo->gid));
> +
> +	/*
> +	 * Showing inode{64,32} might be useful even if it's the system default,
> +	 * since then people don't have to resort to checking both here and
> +	 * /proc/config.gz to confirm 64-bit inums were successfully applied
> +	 * (which may not even exist if IKCONFIG_PROC isn't enabled).
> +	 *
> +	 * We hide it when inode64 isn't the default and we are using 32-bit
> +	 * inodes, since that probably just means the feature isn't even under
> +	 * consideration.
> +	 *
> +	 * As such:
> +	 *
> +	 *                     +-----------------+-----------------+
> +	 *                     | TMPFS_INODE64=y | TMPFS_INODE64=n |
> +	 *  +------------------+-----------------+-----------------+
> +	 *  | full_inums=true  | show            | show            |
> +	 *  | full_inums=false | show            | hide            |
> +	 *  +------------------+-----------------+-----------------+
> +	 *
> +	 */

Thanks for spelling this out: it may be the most contentious part
of the patch.  I don't disagree with your decision, but I can imagine
it causing annoyance.  Let's start off with it as you decided, but be
ready to be persuaded differently.  I have a growing inclination
towards the 64-bit kernel showing ",inode32" but never ",inode64".

> +	if (IS_ENABLED(CONFIG_TMPFS_INODE64) || sbinfo->full_inums)
> +		seq_printf(seq, ",inode%d", (sbinfo->full_inums ? 64 : 32));
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  	/* Rightly or wrongly, show huge mount option unmasked by shmem_huge */
>  	if (sbinfo->huge)
> @@ -3667,6 +3725,8 @@ static int shmem_fill_super(struct super_block *sb, struct fs_context *fc)
>  			ctx->blocks = shmem_default_max_blocks();
>  		if (!(ctx->seen & SHMEM_SEEN_INODES))
>  			ctx->inodes = shmem_default_max_inodes();
> +		if (!(ctx->seen & SHMEM_SEEN_INUMS))
> +			ctx->full_inums = IS_ENABLED(CONFIG_TMPFS_INODE64);
>  	} else {
>  		sb->s_flags |= SB_NOUSER;
>  	}
> @@ -3684,6 +3744,7 @@ static int shmem_fill_super(struct super_block *sb, struct fs_context *fc)
>  	}
>  	sbinfo->uid = ctx->uid;
>  	sbinfo->gid = ctx->gid;
> +	sbinfo->full_inums = ctx->full_inums;
>  	sbinfo->mode = ctx->mode;
>  	sbinfo->huge = ctx->huge;
>  	sbinfo->mpol = ctx->mpol;
> -- 
> 2.27.0

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v7 2/2] tmpfs: Support 64-bit inums per-sb
@ 2020-08-01 20:41     ` Hugh Dickins
  0 siblings, 0 replies; 12+ messages in thread
From: Hugh Dickins @ 2020-08-01 20:41 UTC (permalink / raw)
  To: Chris Down
  Cc: Andrew Morton, Hugh Dickins, Al Viro, Matthew Wilcox,
	Amir Goldstein, Jeff Layton, Johannes Weiner, Tejun Heo,
	linux-mm, linux-fsdevel, linux-kernel, kernel-team

On Mon, 13 Jul 2020, Chris Down wrote:

> The default is still set to inode32 for backwards compatibility, but
> system administrators can opt in to the new 64-bit inode numbers by
> either:
> 
> 1. Passing inode64 on the command line when mounting, or
> 2. Configuring the kernel with CONFIG_TMPFS_INODE64=y
> 
> The inode64 and inode32 names are used based on existing precedent from
> XFS.
> 
> Signed-off-by: Chris Down <chris@chrisdown.name>
> Reviewed-by: Amir Goldstein <amir73il@gmail.com>
> Cc: Hugh Dickins <hughd@google.com>

Acked-by: Hugh Dickins <hughd@google.com>

Again, thanks, and comments below, but nothing to override tha Ack.

> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Jeff Layton <jlayton@kernel.org>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: linux-mm@kvack.org
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: kernel-team@fb.com
> ---
>  Documentation/filesystems/tmpfs.rst | 11 +++++
>  fs/Kconfig                          | 15 +++++++
>  include/linux/shmem_fs.h            |  1 +
>  mm/shmem.c                          | 65 ++++++++++++++++++++++++++++-
>  4 files changed, 90 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/filesystems/tmpfs.rst b/Documentation/filesystems/tmpfs.rst
> index 4e95929301a5..47b84ddaa8bb 100644
> --- a/Documentation/filesystems/tmpfs.rst
> +++ b/Documentation/filesystems/tmpfs.rst
> @@ -150,6 +150,15 @@ These options do not have any effect on remount. You can change these
>  parameters with chmod(1), chown(1) and chgrp(1) on a mounted filesystem.
>  
>  
> +tmpfs has a mount option to select whether it will wrap at 32- or 64-bit inode
> +numbers:
> +
> +inode64   Use 64-bit inode numbers
> +inode32   Use 32-bit inode numbers
> +
> +On 64-bit, the default is set by CONFIG_TMPFS_INODE64. On 32-bit, inode64 is
> +not legal and will produce an error at mount time.
> +

I did originally want to move this up higher, between the discussion
of nr_inodes and mpol; but its placing here follows where you placed
it in /proc/mounts, and I end up agreeing with that, so let's leave
where you've placed it.

But this is a description of a new pair of mount options, nothing to
do with the paragraph below, so one more blank line is needed.

Except that tmpfs.txt has now been replaced by tmpfs.rst, with some
====ing too.  We'd better add that.  I'm unfamiliar with, and not
prepared for .rst - getting Doc right has become a job for specialists.
I'll send a -fix.patch of how how I think it should look to Andrew and
Randy Dunlap, hoping Randy can confirm whether it ends up right.

>  So 'mount -t tmpfs -o size=10G,nr_inodes=10k,mode=700 tmpfs /mytmpfs'
>  will give you tmpfs instance on /mytmpfs which can allocate 10GB
>  RAM/SWAP in 10240 inodes and it is only accessible by root.
> @@ -161,3 +170,5 @@ RAM/SWAP in 10240 inodes and it is only accessible by root.
>     Hugh Dickins, 4 June 2007
>  :Updated:
>     KOSAKI Motohiro, 16 Mar 2010
> +Updated:

I bet that should say ":Updated:" now.

> +   Chris Down, 13 July 2020
> diff --git a/fs/Kconfig b/fs/Kconfig
> index ff257b81fde5..64d530ba42f6 100644
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -229,6 +229,21 @@ config TMPFS_XATTR
>  
>  	  If unsure, say N.
>  
> +config TMPFS_INODE64
> +	bool "Use 64-bit ino_t by default in tmpfs"
> +	depends on TMPFS && 64BIT
> +	default n

Okay: I haven't looked up what our last position was on which way the
default should be, but I do recall that we were reluctantly realizing
that we couldn't safely be as radical as we had hoped.

As to "default n" being the default and unnecessary: I don't mind
seeing it there explicitly, let's give the work to whoever wants to
delete that line.

> +	help
> +	  tmpfs has historically used only inode numbers as wide as an unsigned
> +	  int. In some cases this can cause wraparound, potentially resulting in
> +	  multiple files with the same inode number on a single device. This option
> +	  makes tmpfs use the full width of ino_t by default, similarly to the
> +	  inode64 mount option.

I've just realized that this, and the inode64 Documentation, make no
mention of why you might not want to enable it: it sounds like such a
good thing, with the documentation on why not in include/linux/fs.h.
I'd better add some text to these in the -fix.patch.

> +
> +	  To override this default, use the inode32 or inode64 mount options.
> +
> +	  If unsure, say N.
> +
>  config HUGETLBFS
>  	bool "HugeTLB file system support"
>  	depends on X86 || IA64 || SPARC64 || (S390 && 64BIT) || \
> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> index eb628696ec66..a5a5d1d4d7b1 100644
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -36,6 +36,7 @@ struct shmem_sb_info {
>  	unsigned char huge;	    /* Whether to try for hugepages */
>  	kuid_t uid;		    /* Mount uid for root directory */
>  	kgid_t gid;		    /* Mount gid for root directory */
> +	bool full_inums;	    /* If i_ino should be uint or ino_t */

Fine, but don't be surprised if I change that eventually,
maybe I'll make it a bit in a bitmask: I've nothing against bools
generally, but like to know the size of my structure fields,
and never sure what size a bool ends up with.

>  	ino_t next_ino;		    /* The next per-sb inode number to use */
>  	ino_t __percpu *ino_batch;  /* The next per-cpu inode number to use */
>  	struct mempolicy *mpol;     /* default memory policy for mappings */
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 0ae250b4da28..f3126ad7ba3d 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -114,11 +114,13 @@ struct shmem_options {
>  	kuid_t uid;
>  	kgid_t gid;
>  	umode_t mode;
> +	bool full_inums;
>  	int huge;
>  	int seen;
>  #define SHMEM_SEEN_BLOCKS 1
>  #define SHMEM_SEEN_INODES 2
>  #define SHMEM_SEEN_HUGE 4
> +#define SHMEM_SEEN_INUMS 8
>  };
>  
>  #ifdef CONFIG_TMPFS
> @@ -286,12 +288,17 @@ static int shmem_reserve_inode(struct super_block *sb, ino_t *inop)
>  			ino = sbinfo->next_ino++;
>  			if (unlikely(is_zero_ino(ino)))
>  				ino = sbinfo->next_ino++;
> -			if (unlikely(ino > UINT_MAX)) {
> +			if (unlikely(!sbinfo->full_inums &&
> +				     ino > UINT_MAX)) {

And don't be surprised if I put that on one line one day :)

>  				/*
>  				 * Emulate get_next_ino uint wraparound for
>  				 * compatibility
>  				 */
> -				ino = 1;
> +				if (IS_ENABLED(CONFIG_64BIT))
> +					pr_warn("%s: inode number overflow on device %d, consider using inode64 mount option\n",
> +						__func__, MINOR(sb->s_dev));

I like that you've added a warning; but it comes too late for a remount,
doesn't it?  We could do a countdown - "Only 1 billion more inode numbers
until overflow!" - that would be fun :) But a multiplicity of warnings
would be irritating to some and unnoticed by others.  I don't know.

> +				sbinfo->next_ino = 1;
> +				ino = sbinfo->next_ino++;
>  			}
>  			*inop = ino;
>  		}
> @@ -304,6 +311,10 @@ static int shmem_reserve_inode(struct super_block *sb, ino_t *inop)
>  		 * unknown contexts. As such, use a per-cpu batched allocator
>  		 * which doesn't require the per-sb stat_lock unless we are at
>  		 * the batch boundary.
> +		 *
> +		 * We don't need to worry about inode{32,64} since SB_KERNMOUNT
> +		 * shmem mounts are not exposed to userspace, so we don't need
> +		 * to worry about things like glibc compatibility.
>  		 */
>  		ino_t *next_ino;
>  		next_ino = per_cpu_ptr(sbinfo->ino_batch, get_cpu());
> @@ -3397,6 +3408,8 @@ enum shmem_param {
>  	Opt_nr_inodes,
>  	Opt_size,
>  	Opt_uid,
> +	Opt_inode32,
> +	Opt_inode64,
>  };
>  
>  static const struct constant_table shmem_param_enums_huge[] = {
> @@ -3416,6 +3429,8 @@ const struct fs_parameter_spec shmem_fs_parameters[] = {
>  	fsparam_string("nr_inodes",	Opt_nr_inodes),
>  	fsparam_string("size",		Opt_size),
>  	fsparam_u32   ("uid",		Opt_uid),
> +	fsparam_flag  ("inode32",	Opt_inode32),
> +	fsparam_flag  ("inode64",	Opt_inode64),
>  	{}
>  };
>  
> @@ -3487,6 +3502,18 @@ static int shmem_parse_one(struct fs_context *fc, struct fs_parameter *param)
>  			break;
>  		}
>  		goto unsupported_parameter;
> +	case Opt_inode32:
> +		ctx->full_inums = false;
> +		ctx->seen |= SHMEM_SEEN_INUMS;
> +		break;
> +	case Opt_inode64:
> +		if (sizeof(ino_t) < 8) {
> +			return invalfc(fc,
> +				       "Cannot use inode64 with <64bit inums in kernel\n");
> +		}
> +		ctx->full_inums = true;
> +		ctx->seen |= SHMEM_SEEN_INUMS;
> +		break;
>  	}
>  	return 0;
>  
> @@ -3578,8 +3605,16 @@ static int shmem_reconfigure(struct fs_context *fc)
>  		}
>  	}
>  
> +	if ((ctx->seen & SHMEM_SEEN_INUMS) && !ctx->full_inums &&
> +	    sbinfo->next_ino > UINT_MAX) {
> +		err = "Current inum too high to switch to 32-bit inums";
> +		goto out;
> +	}
> +
>  	if (ctx->seen & SHMEM_SEEN_HUGE)
>  		sbinfo->huge = ctx->huge;
> +	if (ctx->seen & SHMEM_SEEN_INUMS)
> +		sbinfo->full_inums = ctx->full_inums;
>  	if (ctx->seen & SHMEM_SEEN_BLOCKS)
>  		sbinfo->max_blocks  = ctx->blocks;
>  	if (ctx->seen & SHMEM_SEEN_INODES) {
> @@ -3619,6 +3654,29 @@ static int shmem_show_options(struct seq_file *seq, struct dentry *root)
>  	if (!gid_eq(sbinfo->gid, GLOBAL_ROOT_GID))
>  		seq_printf(seq, ",gid=%u",
>  				from_kgid_munged(&init_user_ns, sbinfo->gid));
> +
> +	/*
> +	 * Showing inode{64,32} might be useful even if it's the system default,
> +	 * since then people don't have to resort to checking both here and
> +	 * /proc/config.gz to confirm 64-bit inums were successfully applied
> +	 * (which may not even exist if IKCONFIG_PROC isn't enabled).
> +	 *
> +	 * We hide it when inode64 isn't the default and we are using 32-bit
> +	 * inodes, since that probably just means the feature isn't even under
> +	 * consideration.
> +	 *
> +	 * As such:
> +	 *
> +	 *                     +-----------------+-----------------+
> +	 *                     | TMPFS_INODE64=y | TMPFS_INODE64=n |
> +	 *  +------------------+-----------------+-----------------+
> +	 *  | full_inums=true  | show            | show            |
> +	 *  | full_inums=false | show            | hide            |
> +	 *  +------------------+-----------------+-----------------+
> +	 *
> +	 */

Thanks for spelling this out: it may be the most contentious part
of the patch.  I don't disagree with your decision, but I can imagine
it causing annoyance.  Let's start off with it as you decided, but be
ready to be persuaded differently.  I have a growing inclination
towards the 64-bit kernel showing ",inode32" but never ",inode64".

> +	if (IS_ENABLED(CONFIG_TMPFS_INODE64) || sbinfo->full_inums)
> +		seq_printf(seq, ",inode%d", (sbinfo->full_inums ? 64 : 32));
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  	/* Rightly or wrongly, show huge mount option unmasked by shmem_huge */
>  	if (sbinfo->huge)
> @@ -3667,6 +3725,8 @@ static int shmem_fill_super(struct super_block *sb, struct fs_context *fc)
>  			ctx->blocks = shmem_default_max_blocks();
>  		if (!(ctx->seen & SHMEM_SEEN_INODES))
>  			ctx->inodes = shmem_default_max_inodes();
> +		if (!(ctx->seen & SHMEM_SEEN_INUMS))
> +			ctx->full_inums = IS_ENABLED(CONFIG_TMPFS_INODE64);
>  	} else {
>  		sb->s_flags |= SB_NOUSER;
>  	}
> @@ -3684,6 +3744,7 @@ static int shmem_fill_super(struct super_block *sb, struct fs_context *fc)
>  	}
>  	sbinfo->uid = ctx->uid;
>  	sbinfo->gid = ctx->gid;
> +	sbinfo->full_inums = ctx->full_inums;
>  	sbinfo->mode = ctx->mode;
>  	sbinfo->huge = ctx->huge;
>  	sbinfo->mpol = ctx->mpol;
> -- 
> 2.27.0


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH mmotm] tmpfs: support 64-bit inums per-sb fix
  2020-08-01 20:41     ` Hugh Dickins
@ 2020-08-02  2:37       ` Hugh Dickins
  -1 siblings, 0 replies; 12+ messages in thread
From: Hugh Dickins @ 2020-08-02  2:37 UTC (permalink / raw)
  To: Andrew Morton, Chris Down, Randy Dunlap
  Cc: Al Viro, Matthew Wilcox, Amir Goldstein, Hugh Dickins,
	Jeff Layton, Johannes Weiner, Tejun Heo, linux-mm, linux-fsdevel,
	linux-kernel, kernel-team

Expanded Chris's Documentation and Kconfig help on tmpfs inode64.
TMPFS_INODE64 still there, still default N, but writing down its very
limited limitation does make me wonder again if we want the option.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
Andrew, please fold into tmpfs-support-64-bit-inums-per-sb.patch later.

Randy, you're very active on Documentation and linux-next: may I ask you
please to try applying this patch to latest, and see if tmpfs.rst comes
out looking right to you?  I'm an old dog still stuck in the days of
tmpfs.txt, hoping to avoid new tricks for a while.  Thanks!  (Bonus
points if you can explain what the "::" on line 122 is about. I started
out reading Documentation/doc-guide/sphinx.rst, but... got diverted.
Perhaps I should ask Mauro or Jon, but turning for help first to you.)

 Documentation/filesystems/tmpfs.rst |   13 ++++++++++---
 fs/Kconfig                          |   16 +++++++++++-----
 2 files changed, 21 insertions(+), 8 deletions(-)

--- mmotm/Documentation/filesystems/tmpfs.rst	2020-07-27 18:54:51.116524795 -0700
+++ linux/Documentation/filesystems/tmpfs.rst	2020-08-01 18:37:07.719713987 -0700
@@ -153,11 +153,18 @@ parameters with chmod(1), chown(1) and c
 tmpfs has a mount option to select whether it will wrap at 32- or 64-bit inode
 numbers:
 
+=======   ========================
 inode64   Use 64-bit inode numbers
 inode32   Use 32-bit inode numbers
+=======   ========================
+
+On a 32-bit kernel, inode32 is implicit, and inode64 is refused at mount time.
+On a 64-bit kernel, CONFIG_TMPFS_INODE64 sets the default.  inode64 avoids the
+possibility of multiple files with the same inode number on a single device;
+but risks glibc failing with EOVERFLOW once 33-bit inode numbers are reached -
+if a long-lived tmpfs is accessed by 32-bit applications so ancient that
+opening a file larger than 2GiB fails with EINVAL.
 
-On 64-bit, the default is set by CONFIG_TMPFS_INODE64. On 32-bit, inode64 is
-not legal and will produce an error at mount time.
 
 So 'mount -t tmpfs -o size=10G,nr_inodes=10k,mode=700 tmpfs /mytmpfs'
 will give you tmpfs instance on /mytmpfs which can allocate 10GB
@@ -170,5 +177,5 @@ RAM/SWAP in 10240 inodes and it is only
    Hugh Dickins, 4 June 2007
 :Updated:
    KOSAKI Motohiro, 16 Mar 2010
-Updated:
+:Updated:
    Chris Down, 13 July 2020
--- mmotm/fs/Kconfig	2020-07-27 18:54:59.384550639 -0700
+++ linux/fs/Kconfig	2020-08-01 18:11:33.749236321 -0700
@@ -223,12 +223,18 @@ config TMPFS_INODE64
 	default n
 	help
 	  tmpfs has historically used only inode numbers as wide as an unsigned
-	  int. In some cases this can cause wraparound, potentially resulting in
-	  multiple files with the same inode number on a single device. This option
-	  makes tmpfs use the full width of ino_t by default, similarly to the
-	  inode64 mount option.
+	  int. In some cases this can cause wraparound, potentially resulting
+	  in multiple files with the same inode number on a single device. This
+	  option makes tmpfs use the full width of ino_t by default, without
+	  needing to specify the inode64 option when mounting.
 
-	  To override this default, use the inode32 or inode64 mount options.
+	  But if a long-lived tmpfs is to be accessed by 32-bit applications so
+	  ancient that opening a file larger than 2GiB fails with EINVAL, then
+	  the INODE64 config option and inode64 mount option risk operations
+	  failing with EOVERFLOW once 33-bit inode numbers are reached.
+
+	  To override this configured default, use the inode32 or inode64
+	  option when mounting.
 
 	  If unsure, say N.
 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH mmotm] tmpfs: support 64-bit inums per-sb fix
@ 2020-08-02  2:37       ` Hugh Dickins
  0 siblings, 0 replies; 12+ messages in thread
From: Hugh Dickins @ 2020-08-02  2:37 UTC (permalink / raw)
  To: Andrew Morton, Chris Down, Randy Dunlap
  Cc: Al Viro, Matthew Wilcox, Amir Goldstein, Hugh Dickins,
	Jeff Layton, Johannes Weiner, Tejun Heo, linux-mm, linux-fsdevel,
	linux-kernel, kernel-team

Expanded Chris's Documentation and Kconfig help on tmpfs inode64.
TMPFS_INODE64 still there, still default N, but writing down its very
limited limitation does make me wonder again if we want the option.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
Andrew, please fold into tmpfs-support-64-bit-inums-per-sb.patch later.

Randy, you're very active on Documentation and linux-next: may I ask you
please to try applying this patch to latest, and see if tmpfs.rst comes
out looking right to you?  I'm an old dog still stuck in the days of
tmpfs.txt, hoping to avoid new tricks for a while.  Thanks!  (Bonus
points if you can explain what the "::" on line 122 is about. I started
out reading Documentation/doc-guide/sphinx.rst, but... got diverted.
Perhaps I should ask Mauro or Jon, but turning for help first to you.)

 Documentation/filesystems/tmpfs.rst |   13 ++++++++++---
 fs/Kconfig                          |   16 +++++++++++-----
 2 files changed, 21 insertions(+), 8 deletions(-)

--- mmotm/Documentation/filesystems/tmpfs.rst	2020-07-27 18:54:51.116524795 -0700
+++ linux/Documentation/filesystems/tmpfs.rst	2020-08-01 18:37:07.719713987 -0700
@@ -153,11 +153,18 @@ parameters with chmod(1), chown(1) and c
 tmpfs has a mount option to select whether it will wrap at 32- or 64-bit inode
 numbers:
 
+=======   ========================
 inode64   Use 64-bit inode numbers
 inode32   Use 32-bit inode numbers
+=======   ========================
+
+On a 32-bit kernel, inode32 is implicit, and inode64 is refused at mount time.
+On a 64-bit kernel, CONFIG_TMPFS_INODE64 sets the default.  inode64 avoids the
+possibility of multiple files with the same inode number on a single device;
+but risks glibc failing with EOVERFLOW once 33-bit inode numbers are reached -
+if a long-lived tmpfs is accessed by 32-bit applications so ancient that
+opening a file larger than 2GiB fails with EINVAL.
 
-On 64-bit, the default is set by CONFIG_TMPFS_INODE64. On 32-bit, inode64 is
-not legal and will produce an error at mount time.
 
 So 'mount -t tmpfs -o size=10G,nr_inodes=10k,mode=700 tmpfs /mytmpfs'
 will give you tmpfs instance on /mytmpfs which can allocate 10GB
@@ -170,5 +177,5 @@ RAM/SWAP in 10240 inodes and it is only
    Hugh Dickins, 4 June 2007
 :Updated:
    KOSAKI Motohiro, 16 Mar 2010
-Updated:
+:Updated:
    Chris Down, 13 July 2020
--- mmotm/fs/Kconfig	2020-07-27 18:54:59.384550639 -0700
+++ linux/fs/Kconfig	2020-08-01 18:11:33.749236321 -0700
@@ -223,12 +223,18 @@ config TMPFS_INODE64
 	default n
 	help
 	  tmpfs has historically used only inode numbers as wide as an unsigned
-	  int. In some cases this can cause wraparound, potentially resulting in
-	  multiple files with the same inode number on a single device. This option
-	  makes tmpfs use the full width of ino_t by default, similarly to the
-	  inode64 mount option.
+	  int. In some cases this can cause wraparound, potentially resulting
+	  in multiple files with the same inode number on a single device. This
+	  option makes tmpfs use the full width of ino_t by default, without
+	  needing to specify the inode64 option when mounting.
 
-	  To override this default, use the inode32 or inode64 mount options.
+	  But if a long-lived tmpfs is to be accessed by 32-bit applications so
+	  ancient that opening a file larger than 2GiB fails with EINVAL, then
+	  the INODE64 config option and inode64 mount option risk operations
+	  failing with EOVERFLOW once 33-bit inode numbers are reached.
+
+	  To override this configured default, use the inode32 or inode64
+	  option when mounting.
 
 	  If unsure, say N.
 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH mmotm] tmpfs: support 64-bit inums per-sb fix
  2020-08-02  2:37       ` Hugh Dickins
  (?)
@ 2020-08-02  3:05       ` Randy Dunlap
  2020-08-02  5:25           ` Hugh Dickins
  -1 siblings, 1 reply; 12+ messages in thread
From: Randy Dunlap @ 2020-08-02  3:05 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton, Chris Down
  Cc: Al Viro, Matthew Wilcox, Amir Goldstein, Jeff Layton,
	Johannes Weiner, Tejun Heo, linux-mm, linux-fsdevel,
	linux-kernel, kernel-team

On 8/1/20 7:37 PM, Hugh Dickins wrote:
> Expanded Chris's Documentation and Kconfig help on tmpfs inode64.
> TMPFS_INODE64 still there, still default N, but writing down its very
> limited limitation does make me wonder again if we want the option.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
> Andrew, please fold into tmpfs-support-64-bit-inums-per-sb.patch later.
> 
> Randy, you're very active on Documentation and linux-next: may I ask you
> please to try applying this patch to latest, and see if tmpfs.rst comes
> out looking right to you?  I'm an old dog still stuck in the days of

Hi Hugh,
It looks fine.

> tmpfs.txt, hoping to avoid new tricks for a while.  Thanks!  (Bonus
> points if you can explain what the "::" on line 122 is about. I started
> out reading Documentation/doc-guide/sphinx.rst, but... got diverted.
> Perhaps I should ask Mauro or Jon, but turning for help first to you.)

That's the correct file. Around line 216, it says:

* For inserting fixed width text blocks (for code examples, use case
  examples, etc.), use ``::`` for anything that doesn't really benefit
  from syntax highlighting, especially short snippets. Use
  ``.. code-block:: <language>`` for longer code blocks that benefit
  from highlighting. For a short snippet of code embedded in the text, use \`\`.


so it's just for a (short) code example block, fixed font...


> 
>  Documentation/filesystems/tmpfs.rst |   13 ++++++++++---
>  fs/Kconfig                          |   16 +++++++++++-----
>  2 files changed, 21 insertions(+), 8 deletions(-)
> 
> --- mmotm/Documentation/filesystems/tmpfs.rst	2020-07-27 18:54:51.116524795 -0700
> +++ linux/Documentation/filesystems/tmpfs.rst	2020-08-01 18:37:07.719713987 -0700



cheers.
-- 
~Randy


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH mmotm] tmpfs: support 64-bit inums per-sb fix
  2020-08-02  3:05       ` Randy Dunlap
@ 2020-08-02  5:25           ` Hugh Dickins
  0 siblings, 0 replies; 12+ messages in thread
From: Hugh Dickins @ 2020-08-02  5:25 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Hugh Dickins, Andrew Morton, Chris Down, Al Viro, Matthew Wilcox,
	Amir Goldstein, Jeff Layton, Johannes Weiner, Tejun Heo,
	linux-mm, linux-fsdevel, linux-kernel, kernel-team

On Sat, 1 Aug 2020, Randy Dunlap wrote:
> On 8/1/20 7:37 PM, Hugh Dickins wrote:
> > Expanded Chris's Documentation and Kconfig help on tmpfs inode64.
> > TMPFS_INODE64 still there, still default N, but writing down its very
> > limited limitation does make me wonder again if we want the option.
> > 
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> > ---
> > Andrew, please fold into tmpfs-support-64-bit-inums-per-sb.patch later.
> > 
> > Randy, you're very active on Documentation and linux-next: may I ask you
> > please to try applying this patch to latest, and see if tmpfs.rst comes
> > out looking right to you?  I'm an old dog still stuck in the days of
> 
> Hi Hugh,
> It looks fine.

Thank you so much, Randy.

> 
> > tmpfs.txt, hoping to avoid new tricks for a while.  Thanks!  (Bonus
> > points if you can explain what the "::" on line 122 is about. I started
> > out reading Documentation/doc-guide/sphinx.rst, but... got diverted.
> > Perhaps I should ask Mauro or Jon, but turning for help first to you.)
> 
> That's the correct file. Around line 216, it says:
> 
> * For inserting fixed width text blocks (for code examples, use case
>   examples, etc.), use ``::`` for anything that doesn't really benefit
>   from syntax highlighting, especially short snippets. Use
>   ``.. code-block:: <language>`` for longer code blocks that benefit
>   from highlighting. For a short snippet of code embedded in the text, use \`\`.
> 
> 
> so it's just for a (short) code example block, fixed font...

Bonus points awarded, thanks...ish. I'll have to look around for more
examples of where that's done, and I think it'll only make real sense
to me, when I'm further along, producing the proper output, then seeing
how bad something looks without the "::".

Thanks again,
Hugh

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH mmotm] tmpfs: support 64-bit inums per-sb fix
@ 2020-08-02  5:25           ` Hugh Dickins
  0 siblings, 0 replies; 12+ messages in thread
From: Hugh Dickins @ 2020-08-02  5:25 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: Hugh Dickins, Andrew Morton, Chris Down, Al Viro, Matthew Wilcox,
	Amir Goldstein, Jeff Layton, Johannes Weiner, Tejun Heo,
	linux-mm, linux-fsdevel, linux-kernel, kernel-team

On Sat, 1 Aug 2020, Randy Dunlap wrote:
> On 8/1/20 7:37 PM, Hugh Dickins wrote:
> > Expanded Chris's Documentation and Kconfig help on tmpfs inode64.
> > TMPFS_INODE64 still there, still default N, but writing down its very
> > limited limitation does make me wonder again if we want the option.
> > 
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> > ---
> > Andrew, please fold into tmpfs-support-64-bit-inums-per-sb.patch later.
> > 
> > Randy, you're very active on Documentation and linux-next: may I ask you
> > please to try applying this patch to latest, and see if tmpfs.rst comes
> > out looking right to you?  I'm an old dog still stuck in the days of
> 
> Hi Hugh,
> It looks fine.

Thank you so much, Randy.

> 
> > tmpfs.txt, hoping to avoid new tricks for a while.  Thanks!  (Bonus
> > points if you can explain what the "::" on line 122 is about. I started
> > out reading Documentation/doc-guide/sphinx.rst, but... got diverted.
> > Perhaps I should ask Mauro or Jon, but turning for help first to you.)
> 
> That's the correct file. Around line 216, it says:
> 
> * For inserting fixed width text blocks (for code examples, use case
>   examples, etc.), use ``::`` for anything that doesn't really benefit
>   from syntax highlighting, especially short snippets. Use
>   ``.. code-block:: <language>`` for longer code blocks that benefit
>   from highlighting. For a short snippet of code embedded in the text, use \`\`.
> 
> 
> so it's just for a (short) code example block, fixed font...

Bonus points awarded, thanks...ish. I'll have to look around for more
examples of where that's done, and I think it'll only make real sense
to me, when I'm further along, producing the proper output, then seeing
how bad something looks without the "::".

Thanks again,
Hugh


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-08-02  5:25 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-13 17:28 [PATCH v7 0/2] tmpfs: inode: Reduce risk of inum overflow Chris Down
2020-07-13 17:28 ` [PATCH v7 1/2] tmpfs: Per-superblock i_ino support Chris Down
2020-08-01 19:22   ` Hugh Dickins
2020-08-01 19:22     ` Hugh Dickins
2020-07-13 17:28 ` [PATCH v7 2/2] tmpfs: Support 64-bit inums per-sb Chris Down
2020-08-01 20:41   ` Hugh Dickins
2020-08-01 20:41     ` Hugh Dickins
2020-08-02  2:37     ` [PATCH mmotm] tmpfs: support 64-bit inums per-sb fix Hugh Dickins
2020-08-02  2:37       ` Hugh Dickins
2020-08-02  3:05       ` Randy Dunlap
2020-08-02  5:25         ` Hugh Dickins
2020-08-02  5:25           ` Hugh Dickins

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.