All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/2] fs: inode: shmem: Reduce risk of inum overflow
@ 2020-01-03 17:30 Chris Down
  2020-01-03 17:30 ` [PATCH v3 1/2] tmpfs: Add per-superblock i_ino support Chris Down
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Chris Down @ 2020-01-03 17:30 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Al Viro, Matthew Wilcox, Amir Goldstein, Jeff Layton,
	Johannes Weiner, Tejun Heo, linux-kernel, kernel-team

In Facebook production we are seeing heavy i_ino wraparounds on tmpfs.
On affected tiers, in excess of 10% of hosts show multiple files with
different content and the same inode number, with some servers even
having as many as 150 duplicated inode numbers with differing file
content.

This causes actual, tangible problems in production. For example, we
have complaints from those working on remote caches that their
application is reporting cache corruptions because it uses (device,
inodenum) to establish the identity of a particular cache object, but
because it's not unique any more, the application refuses to continue
and reports cache corruption. Even worse, sometimes applications may not
even detect the corruption but may continue anyway, causing phantom and
hard to debug behaviour.

In general, userspace applications expect that (device, inodenum) should
be enough to be uniquely point to one inode, which seems fair enough.
One might also need to check the generation, but in this case:

1. That's not currently exposed to userspace
   (ioctl(...FS_IOC_GETVERSION...) returns ENOTTY on tmpfs);
2. Even with generation, there shouldn't be two live inodes with the
   same inode number on one device.

In order to mitigate this, we take a two-pronged approach:

1. Moving inum generation from being global to per-sb for tmpfs. This
   itself allows some reduction in i_ino churn. This works on both 64-
   and 32- bit machines.
2. Adding inode{64,32} for tmpfs. This fix is supported on machines with
   64-bit ino_t only: we allow users to mount tmpfs with a new inode64
   option that uses the full width of ino_t, or CONFIG_TMPFS_INODE64.

Chris Down (2):
  tmpfs: Add per-superblock i_ino support
  tmpfs: Support 64-bit inums per-sb

 Documentation/filesystems/tmpfs.txt | 11 ++++
 fs/Kconfig                          | 15 +++++
 include/linux/shmem_fs.h            |  2 +
 mm/shmem.c                          | 97 ++++++++++++++++++++++++++++-
 4 files changed, 124 insertions(+), 1 deletion(-)

-- 
2.24.1


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v3 1/2] tmpfs: Add per-superblock i_ino support
  2020-01-03 17:30 [PATCH v3 0/2] fs: inode: shmem: Reduce risk of inum overflow Chris Down
@ 2020-01-03 17:30 ` Chris Down
  2020-01-04 19:09   ` Amir Goldstein
  2020-01-03 17:30 ` [PATCH v3 2/2] tmpfs: Support 64-bit inums per-sb Chris Down
  2020-01-04 21:16   ` Amir Goldstein
  2 siblings, 1 reply; 8+ messages in thread
From: Chris Down @ 2020-01-03 17:30 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Al Viro, Matthew Wilcox, Amir Goldstein, Jeff Layton,
	Johannes Weiner, Tejun Heo, linux-kernel, kernel-team

get_next_ino has a number of problems:

- It uses and returns a uint, which is susceptible to become overflowed
  if a lot of volatile inodes that use get_next_ino are created.
- It's global, with no specificity per-sb or even per-filesystem. This
  means it's not that difficult to cause inode number wraparounds on a
  single device, which can result in having multiple distinct inodes
  with the same inode number.

This patch adds a per-superblock counter that mitigates the second case.
This design also allows us to later have a specific i_ino size
per-device, for example, allowing users to choose whether to use 32- or
64-bit inodes for each tmpfs mount. This is implemented in the next
commit.

Signed-off-by: Chris Down <chris@chrisdown.name>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: kernel-team@fb.com
---
 include/linux/shmem_fs.h |  1 +
 mm/shmem.c               | 33 ++++++++++++++++++++++++++++++++-
 2 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index de8e4b71e3ba..7fac91f490dc 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -35,6 +35,7 @@ struct shmem_sb_info {
 	unsigned char huge;	    /* Whether to try for hugepages */
 	kuid_t uid;		    /* Mount uid for root directory */
 	kgid_t gid;		    /* Mount gid for root directory */
+	ino_t next_ino;		    /* The next per-sb inode number to use */
 	struct mempolicy *mpol;     /* default memory policy for mappings */
 	spinlock_t shrinklist_lock;   /* Protects shrinklist */
 	struct list_head shrinklist;  /* List of shinkable inodes */
diff --git a/mm/shmem.c b/mm/shmem.c
index 8793e8cc1a48..638b1e30625f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2236,6 +2236,15 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
 	return 0;
 }
 
+/*
+ * shmem_get_inode - reserve, allocate, and initialise a new inode
+ *
+ * If SB_KERNMOUNT, we use the per-sb inode allocator to avoid wraparound.
+ * Otherwise, we use get_next_ino, which is global.
+ *
+ * If max_inodes is greater than 0 (ie. non-SB_KERNMOUNT), we may have to grab
+ * the per-sb stat_lock.
+ */
 static struct inode *shmem_get_inode(struct super_block *sb, const struct inode *dir,
 				     umode_t mode, dev_t dev, unsigned long flags)
 {
@@ -2248,7 +2257,28 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
 
 	inode = new_inode(sb);
 	if (inode) {
-		inode->i_ino = get_next_ino();
+		if (sb->s_flags & SB_KERNMOUNT) {
+			/*
+			 * __shmem_file_setup, one of our callers, is lock-free:
+			 * it doesn't hold stat_lock in shmem_reserve_inode
+			 * since max_inodes is always 0, and is called from
+			 * potentially unknown contexts. As such, use the global
+			 * allocator which doesn't require the per-sb stat_lock.
+			 */
+			inode->i_ino = get_next_ino();
+		} else {
+			spin_lock(&sbinfo->stat_lock);
+			if (unlikely(sbinfo->next_ino > UINT_MAX)) {
+				/*
+				 * Emulate get_next_ino uint wraparound for
+				 * compatibility
+				 */
+				sbinfo->next_ino = 1;
+			}
+			inode->i_ino = sbinfo->next_ino++;
+			spin_unlock(&sbinfo->stat_lock);
+		}
+
 		inode_init_owner(inode, dir, mode);
 		inode->i_blocks = 0;
 		inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode);
@@ -3662,6 +3692,7 @@ static int shmem_fill_super(struct super_block *sb, struct fs_context *fc)
 #else
 	sb->s_flags |= SB_NOUSER;
 #endif
+	sbinfo->next_ino = 1;
 	sbinfo->max_blocks = ctx->blocks;
 	sbinfo->free_inodes = sbinfo->max_inodes = ctx->inodes;
 	sbinfo->uid = ctx->uid;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH v3 2/2] tmpfs: Support 64-bit inums per-sb
  2020-01-03 17:30 [PATCH v3 0/2] fs: inode: shmem: Reduce risk of inum overflow Chris Down
  2020-01-03 17:30 ` [PATCH v3 1/2] tmpfs: Add per-superblock i_ino support Chris Down
@ 2020-01-03 17:30 ` Chris Down
  2020-01-04 19:15   ` Amir Goldstein
  2020-01-04 21:16   ` Amir Goldstein
  2 siblings, 1 reply; 8+ messages in thread
From: Chris Down @ 2020-01-03 17:30 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Al Viro, Matthew Wilcox, Amir Goldstein, Jeff Layton,
	Johannes Weiner, Tejun Heo, linux-kernel, kernel-team

The default is still set to inode32 for backwards compatibility, but
system administrators can opt in to the new 64-bit inode numbers by
either:

1. Passing inode64 on the command line when mounting, or
2. Configuring the kernel with CONFIG_TMPFS_INODE64=y

Signed-off-by: Chris Down <chris@chrisdown.name>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: kernel-team@fb.com
---
 Documentation/filesystems/tmpfs.txt | 11 +++++
 fs/Kconfig                          | 15 +++++++
 include/linux/shmem_fs.h            |  1 +
 mm/shmem.c                          | 66 ++++++++++++++++++++++++++++-
 4 files changed, 92 insertions(+), 1 deletion(-)

diff --git a/Documentation/filesystems/tmpfs.txt b/Documentation/filesystems/tmpfs.txt
index 5ecbc03e6b2f..203e12a684c9 100644
--- a/Documentation/filesystems/tmpfs.txt
+++ b/Documentation/filesystems/tmpfs.txt
@@ -136,6 +136,15 @@ These options do not have any effect on remount. You can change these
 parameters with chmod(1), chown(1) and chgrp(1) on a mounted filesystem.
 
 
+tmpfs has a mount option to select whether it will wrap at 32- or 64-bit inode
+numbers:
+
+inode64   Use 64-bit inode numbers
+inode32   Use 32-bit inode numbers
+
+On 64-bit, the default is set by CONFIG_TMPFS_INODE64. On 32-bit, inode64 is
+not legal and will produce an error at mount time.
+
 So 'mount -t tmpfs -o size=10G,nr_inodes=10k,mode=700 tmpfs /mytmpfs'
 will give you tmpfs instance on /mytmpfs which can allocate 10GB
 RAM/SWAP in 10240 inodes and it is only accessible by root.
@@ -147,3 +156,5 @@ Updated:
    Hugh Dickins, 4 June 2007
 Updated:
    KOSAKI Motohiro, 16 Mar 2010
+Updated:
+   Chris Down, 2 Jan 2020
diff --git a/fs/Kconfig b/fs/Kconfig
index 7b623e9fc1b0..e74f127df22a 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -199,6 +199,21 @@ config TMPFS_XATTR
 
 	  If unsure, say N.
 
+config TMPFS_INODE64
+	bool "Use 64-bit ino_t by default in tmpfs"
+	depends on TMPFS && 64BIT
+	default n
+	help
+	  tmpfs has historically used only inode numbers as wide as an unsigned
+	  int. In some cases this can cause wraparound, potentially resulting in
+	  multiple files with the same inode number on a single device. This option
+	  makes tmpfs use the full width of ino_t by default, similarly to the
+	  inode64 mount option.
+
+	  To override this default, use the inode32 or inode64 mount options.
+
+	  If unsure, say N.
+
 config HUGETLBFS
 	bool "HugeTLB file system support"
 	depends on X86 || IA64 || SPARC64 || (S390 && 64BIT) || \
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 7fac91f490dc..8925eb774518 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -35,6 +35,7 @@ struct shmem_sb_info {
 	unsigned char huge;	    /* Whether to try for hugepages */
 	kuid_t uid;		    /* Mount uid for root directory */
 	kgid_t gid;		    /* Mount gid for root directory */
+	bool full_inums;	    /* If i_ino should be uint or ino_t */
 	ino_t next_ino;		    /* The next per-sb inode number to use */
 	struct mempolicy *mpol;     /* default memory policy for mappings */
 	spinlock_t shrinklist_lock;   /* Protects shrinklist */
diff --git a/mm/shmem.c b/mm/shmem.c
index 638b1e30625f..a6b43593e852 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -115,11 +115,13 @@ struct shmem_options {
 	kuid_t uid;
 	kgid_t gid;
 	umode_t mode;
+	bool full_inums;
 	int huge;
 	int seen;
 #define SHMEM_SEEN_BLOCKS 1
 #define SHMEM_SEEN_INODES 2
 #define SHMEM_SEEN_HUGE 4
+#define SHMEM_SEEN_INUMS 8
 };
 
 #ifdef CONFIG_TMPFS
@@ -2264,15 +2266,24 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
 			 * since max_inodes is always 0, and is called from
 			 * potentially unknown contexts. As such, use the global
 			 * allocator which doesn't require the per-sb stat_lock.
+			 *
+			 * No special behaviour is needed for
+			 * sbinfo->full_inums, because it's not possible to
+			 * manually set on callers of this type, and
+			 * CONFIG_TMPFS_INODE64 only applies to user-visible
+			 * mounts.
 			 */
 			inode->i_ino = get_next_ino();
 		} else {
 			spin_lock(&sbinfo->stat_lock);
-			if (unlikely(sbinfo->next_ino > UINT_MAX)) {
+			if (unlikely(!sbinfo->full_inums &&
+				     sbinfo->next_ino > UINT_MAX)) {
 				/*
 				 * Emulate get_next_ino uint wraparound for
 				 * compatibility
 				 */
+				pr_warn("%s: inode number overflow on device %d, consider using inode64 mount option\n",
+					__func__, MINOR(sb->s_dev));
 				sbinfo->next_ino = 1;
 			}
 			inode->i_ino = sbinfo->next_ino++;
@@ -3409,6 +3420,8 @@ enum shmem_param {
 	Opt_nr_inodes,
 	Opt_size,
 	Opt_uid,
+	Opt_inode32,
+	Opt_inode64,
 };
 
 static const struct fs_parameter_spec shmem_param_specs[] = {
@@ -3420,6 +3433,8 @@ static const struct fs_parameter_spec shmem_param_specs[] = {
 	fsparam_string("nr_inodes",	Opt_nr_inodes),
 	fsparam_string("size",		Opt_size),
 	fsparam_u32   ("uid",		Opt_uid),
+	fsparam_flag  ("inode32",	Opt_inode32),
+	fsparam_flag  ("inode64",	Opt_inode64),
 	{}
 };
 
@@ -3444,6 +3459,7 @@ static int shmem_parse_one(struct fs_context *fc, struct fs_parameter *param)
 	unsigned long long size;
 	char *rest;
 	int opt;
+	const char *err;
 
 	opt = fs_parse(fc, &shmem_fs_parameters, param, &result);
 	if (opt < 0)
@@ -3505,6 +3521,18 @@ static int shmem_parse_one(struct fs_context *fc, struct fs_parameter *param)
 			break;
 		}
 		goto unsupported_parameter;
+	case Opt_inode32:
+		ctx->full_inums = false;
+		ctx->seen |= SHMEM_SEEN_INUMS;
+		break;
+	case Opt_inode64:
+		if (sizeof(ino_t) < 8) {
+			err = "Cannot use inode64 with <64bit inums in kernel";
+			goto err_msg;
+		}
+		ctx->full_inums = true;
+		ctx->seen |= SHMEM_SEEN_INUMS;
+		break;
 	}
 	return 0;
 
@@ -3512,6 +3540,8 @@ static int shmem_parse_one(struct fs_context *fc, struct fs_parameter *param)
 	return invalf(fc, "tmpfs: Unsupported parameter '%s'", param->key);
 bad_value:
 	return invalf(fc, "tmpfs: Bad value for '%s'", param->key);
+err_msg:
+	return invalf(fc, "tmpfs: %s", err);
 }
 
 static int shmem_parse_options(struct fs_context *fc, void *data)
@@ -3596,8 +3626,16 @@ static int shmem_reconfigure(struct fs_context *fc)
 		}
 	}
 
+	if ((ctx->seen & SHMEM_SEEN_INUMS) && !ctx->full_inums &&
+	    sbinfo->next_ino > UINT_MAX) {
+		err = "Current inum too high to switch to 32-bit inums";
+		goto out;
+	}
+
 	if (ctx->seen & SHMEM_SEEN_HUGE)
 		sbinfo->huge = ctx->huge;
+	if (ctx->seen & SHMEM_SEEN_INUMS)
+		sbinfo->full_inums = ctx->full_inums;
 	if (ctx->seen & SHMEM_SEEN_BLOCKS)
 		sbinfo->max_blocks  = ctx->blocks;
 	if (ctx->seen & SHMEM_SEEN_INODES) {
@@ -3637,6 +3675,29 @@ static int shmem_show_options(struct seq_file *seq, struct dentry *root)
 	if (!gid_eq(sbinfo->gid, GLOBAL_ROOT_GID))
 		seq_printf(seq, ",gid=%u",
 				from_kgid_munged(&init_user_ns, sbinfo->gid));
+
+	/*
+	 * Showing inode{64,32} might be useful even if it's the system default,
+	 * since then people don't have to resort to checking both here and
+	 * /proc/config.gz to confirm 64-bit inums were successfully applied
+	 * (which may not even exist if IKCONFIG_PROC isn't enabled).
+	 *
+	 * We hide it when inode64 isn't the default and we are using 32-bit
+	 * inodes, since that probably just means the feature isn't even under
+	 * consideration.
+	 *
+	 * As such:
+	 *
+	 *                     +-----------------+-----------------+
+	 *                     | TMPFS_INODE64=y | TMPFS_INODE64=n |
+	 *  +------------------+-----------------+-----------------+
+	 *  | full_inums=true  | show            | show            |
+	 *  | full_inums=false | show            | hide            |
+	 *  +------------------+-----------------+-----------------+
+	 *
+	 */
+	if (IS_ENABLED(CONFIG_TMPFS_INODE64) || sbinfo->full_inums)
+		seq_printf(seq, ",inode%d", (sbinfo->full_inums ? 64 : 32));
 #ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE
 	/* Rightly or wrongly, show huge mount option unmasked by shmem_huge */
 	if (sbinfo->huge)
@@ -3684,6 +3745,8 @@ static int shmem_fill_super(struct super_block *sb, struct fs_context *fc)
 			ctx->blocks = shmem_default_max_blocks();
 		if (!(ctx->seen & SHMEM_SEEN_INODES))
 			ctx->inodes = shmem_default_max_inodes();
+		if (!(ctx->seen & SHMEM_SEEN_INUMS))
+			ctx->full_inums = IS_ENABLED(CONFIG_TMPFS_INODE64);
 	} else {
 		sb->s_flags |= SB_NOUSER;
 	}
@@ -3697,6 +3760,7 @@ static int shmem_fill_super(struct super_block *sb, struct fs_context *fc)
 	sbinfo->free_inodes = sbinfo->max_inodes = ctx->inodes;
 	sbinfo->uid = ctx->uid;
 	sbinfo->gid = ctx->gid;
+	sbinfo->full_inums = ctx->full_inums;
 	sbinfo->mode = ctx->mode;
 	sbinfo->huge = ctx->huge;
 	sbinfo->mpol = ctx->mpol;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 1/2] tmpfs: Add per-superblock i_ino support
  2020-01-03 17:30 ` [PATCH v3 1/2] tmpfs: Add per-superblock i_ino support Chris Down
@ 2020-01-04 19:09   ` Amir Goldstein
  2020-01-05 11:28     ` Chris Down
  0 siblings, 1 reply; 8+ messages in thread
From: Amir Goldstein @ 2020-01-04 19:09 UTC (permalink / raw)
  To: Chris Down
  Cc: linux-fsdevel, Al Viro, Matthew Wilcox, Jeff Layton,
	Johannes Weiner, Tejun Heo, linux-kernel, kernel-team

On Fri, Jan 3, 2020 at 7:30 PM Chris Down <chris@chrisdown.name> wrote:
>
> get_next_ino has a number of problems:
>
> - It uses and returns a uint, which is susceptible to become overflowed
>   if a lot of volatile inodes that use get_next_ino are created.
> - It's global, with no specificity per-sb or even per-filesystem. This
>   means it's not that difficult to cause inode number wraparounds on a
>   single device, which can result in having multiple distinct inodes
>   with the same inode number.
>
> This patch adds a per-superblock counter that mitigates the second case.
> This design also allows us to later have a specific i_ino size
> per-device, for example, allowing users to choose whether to use 32- or
> 64-bit inodes for each tmpfs mount. This is implemented in the next
> commit.
>
> Signed-off-by: Chris Down <chris@chrisdown.name>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Amir Goldstein <amir73il@gmail.com>
> Cc: Jeff Layton <jlayton@kernel.org>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: kernel-team@fb.com
> ---

Some nits. When fixed you may add:
Reviewed-by: Amir Goldstein <amir73il@gmail.com>


>  include/linux/shmem_fs.h |  1 +
>  mm/shmem.c               | 33 ++++++++++++++++++++++++++++++++-
>  2 files changed, 33 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> index de8e4b71e3ba..7fac91f490dc 100644
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -35,6 +35,7 @@ struct shmem_sb_info {
>         unsigned char huge;         /* Whether to try for hugepages */
>         kuid_t uid;                 /* Mount uid for root directory */
>         kgid_t gid;                 /* Mount gid for root directory */
> +       ino_t next_ino;             /* The next per-sb inode number to use */
>         struct mempolicy *mpol;     /* default memory policy for mappings */
>         spinlock_t shrinklist_lock;   /* Protects shrinklist */
>         struct list_head shrinklist;  /* List of shinkable inodes */
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 8793e8cc1a48..638b1e30625f 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2236,6 +2236,15 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
>         return 0;
>  }
>
> +/*
> + * shmem_get_inode - reserve, allocate, and initialise a new inode
> + *
> + * If SB_KERNMOUNT, we use the per-sb inode allocator to avoid wraparound.
> + * Otherwise, we use get_next_ino, which is global.

Its the other way around.

> + *
> + * If max_inodes is greater than 0 (ie. non-SB_KERNMOUNT), we may have to grab
> + * the per-sb stat_lock.

It's not a "may" it's for sure, but I don't see what this comment adds
in this context.
The comment about stat_lock below seems enough to me.

> + */
>  static struct inode *shmem_get_inode(struct super_block *sb, const struct inode *dir,
>                                      umode_t mode, dev_t dev, unsigned long flags)
>  {
> @@ -2248,7 +2257,28 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
>
>         inode = new_inode(sb);
>         if (inode) {
> -               inode->i_ino = get_next_ino();
> +               if (sb->s_flags & SB_KERNMOUNT) {
> +                       /*
> +                        * __shmem_file_setup, one of our callers, is lock-free:
> +                        * it doesn't hold stat_lock in shmem_reserve_inode
> +                        * since max_inodes is always 0, and is called from
> +                        * potentially unknown contexts. As such, use the global
> +                        * allocator which doesn't require the per-sb
.
> +                        */
> +                       inode->i_ino = get_next_ino();
> +               } else {
> +                       spin_lock(&sbinfo->stat_lock);
> +                       if (unlikely(sbinfo->next_ino > UINT_MAX)) {
> +                               /*
> +                                * Emulate get_next_ino uint wraparound for
> +                                * compatibility
> +                                */
> +                               sbinfo->next_ino = 1;
> +                       }
> +                       inode->i_ino = sbinfo->next_ino++;
> +                       spin_unlock(&sbinfo->stat_lock);
> +               }
> +
>                 inode_init_owner(inode, dir, mode);
>                 inode->i_blocks = 0;
>                 inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode);
> @@ -3662,6 +3692,7 @@ static int shmem_fill_super(struct super_block *sb, struct fs_context *fc)
>  #else
>         sb->s_flags |= SB_NOUSER;
>  #endif
> +       sbinfo->next_ino = 1;
>         sbinfo->max_blocks = ctx->blocks;
>         sbinfo->free_inodes = sbinfo->max_inodes = ctx->inodes;
>         sbinfo->uid = ctx->uid;
> --
> 2.24.1
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 2/2] tmpfs: Support 64-bit inums per-sb
  2020-01-03 17:30 ` [PATCH v3 2/2] tmpfs: Support 64-bit inums per-sb Chris Down
@ 2020-01-04 19:15   ` Amir Goldstein
  0 siblings, 0 replies; 8+ messages in thread
From: Amir Goldstein @ 2020-01-04 19:15 UTC (permalink / raw)
  To: Chris Down
  Cc: linux-fsdevel, Al Viro, Matthew Wilcox, Jeff Layton,
	Johannes Weiner, Tejun Heo, linux-kernel, kernel-team

On Fri, Jan 3, 2020 at 7:30 PM Chris Down <chris@chrisdown.name> wrote:
>
> The default is still set to inode32 for backwards compatibility, but
> system administrators can opt in to the new 64-bit inode numbers by
> either:
>
> 1. Passing inode64 on the command line when mounting, or
> 2. Configuring the kernel with CONFIG_TMPFS_INODE64=y
>
> Signed-off-by: Chris Down <chris@chrisdown.name>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Amir Goldstein <amir73il@gmail.com>
> Cc: Jeff Layton <jlayton@kernel.org>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: kernel-team@fb.com
> ---

You may want to add that the mount options inode32|inode64 take after
the mount options by the same name and similar meaning for xfs mount.

You may add:
Reviewed-by: Amir Goldstein <amir73il@gmail.com>

>  Documentation/filesystems/tmpfs.txt | 11 +++++
>  fs/Kconfig                          | 15 +++++++
>  include/linux/shmem_fs.h            |  1 +
>  mm/shmem.c                          | 66 ++++++++++++++++++++++++++++-
>  4 files changed, 92 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/filesystems/tmpfs.txt b/Documentation/filesystems/tmpfs.txt
> index 5ecbc03e6b2f..203e12a684c9 100644
> --- a/Documentation/filesystems/tmpfs.txt
> +++ b/Documentation/filesystems/tmpfs.txt
> @@ -136,6 +136,15 @@ These options do not have any effect on remount. You can change these
>  parameters with chmod(1), chown(1) and chgrp(1) on a mounted filesystem.
>
>
> +tmpfs has a mount option to select whether it will wrap at 32- or 64-bit inode
> +numbers:
> +
> +inode64   Use 64-bit inode numbers
> +inode32   Use 32-bit inode numbers
> +
> +On 64-bit, the default is set by CONFIG_TMPFS_INODE64. On 32-bit, inode64 is
> +not legal and will produce an error at mount time.
> +
>  So 'mount -t tmpfs -o size=10G,nr_inodes=10k,mode=700 tmpfs /mytmpfs'
>  will give you tmpfs instance on /mytmpfs which can allocate 10GB
>  RAM/SWAP in 10240 inodes and it is only accessible by root.
> @@ -147,3 +156,5 @@ Updated:
>     Hugh Dickins, 4 June 2007
>  Updated:
>     KOSAKI Motohiro, 16 Mar 2010
> +Updated:
> +   Chris Down, 2 Jan 2020
> diff --git a/fs/Kconfig b/fs/Kconfig
> index 7b623e9fc1b0..e74f127df22a 100644
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -199,6 +199,21 @@ config TMPFS_XATTR
>
>           If unsure, say N.
>
> +config TMPFS_INODE64
> +       bool "Use 64-bit ino_t by default in tmpfs"
> +       depends on TMPFS && 64BIT
> +       default n
> +       help
> +         tmpfs has historically used only inode numbers as wide as an unsigned
> +         int. In some cases this can cause wraparound, potentially resulting in
> +         multiple files with the same inode number on a single device. This option
> +         makes tmpfs use the full width of ino_t by default, similarly to the
> +         inode64 mount option.
> +
> +         To override this default, use the inode32 or inode64 mount options.
> +
> +         If unsure, say N.
> +
>  config HUGETLBFS
>         bool "HugeTLB file system support"
>         depends on X86 || IA64 || SPARC64 || (S390 && 64BIT) || \
> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> index 7fac91f490dc..8925eb774518 100644
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -35,6 +35,7 @@ struct shmem_sb_info {
>         unsigned char huge;         /* Whether to try for hugepages */
>         kuid_t uid;                 /* Mount uid for root directory */
>         kgid_t gid;                 /* Mount gid for root directory */
> +       bool full_inums;            /* If i_ino should be uint or ino_t */
>         ino_t next_ino;             /* The next per-sb inode number to use */
>         struct mempolicy *mpol;     /* default memory policy for mappings */
>         spinlock_t shrinklist_lock;   /* Protects shrinklist */
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 638b1e30625f..a6b43593e852 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -115,11 +115,13 @@ struct shmem_options {
>         kuid_t uid;
>         kgid_t gid;
>         umode_t mode;
> +       bool full_inums;
>         int huge;
>         int seen;
>  #define SHMEM_SEEN_BLOCKS 1
>  #define SHMEM_SEEN_INODES 2
>  #define SHMEM_SEEN_HUGE 4
> +#define SHMEM_SEEN_INUMS 8
>  };
>
>  #ifdef CONFIG_TMPFS
> @@ -2264,15 +2266,24 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
>                          * since max_inodes is always 0, and is called from
>                          * potentially unknown contexts. As such, use the global
>                          * allocator which doesn't require the per-sb stat_lock.
> +                        *
> +                        * No special behaviour is needed for
> +                        * sbinfo->full_inums, because it's not possible to
> +                        * manually set on callers of this type, and
> +                        * CONFIG_TMPFS_INODE64 only applies to user-visible
> +                        * mounts.
>                          */
>                         inode->i_ino = get_next_ino();
>                 } else {
>                         spin_lock(&sbinfo->stat_lock);
> -                       if (unlikely(sbinfo->next_ino > UINT_MAX)) {
> +                       if (unlikely(!sbinfo->full_inums &&
> +                                    sbinfo->next_ino > UINT_MAX)) {
>                                 /*
>                                  * Emulate get_next_ino uint wraparound for
>                                  * compatibility
>                                  */
> +                               pr_warn("%s: inode number overflow on device %d, consider using inode64 mount option\n",
> +                                       __func__, MINOR(sb->s_dev));
>                                 sbinfo->next_ino = 1;
>                         }
>                         inode->i_ino = sbinfo->next_ino++;
> @@ -3409,6 +3420,8 @@ enum shmem_param {
>         Opt_nr_inodes,
>         Opt_size,
>         Opt_uid,
> +       Opt_inode32,
> +       Opt_inode64,
>  };
>
>  static const struct fs_parameter_spec shmem_param_specs[] = {
> @@ -3420,6 +3433,8 @@ static const struct fs_parameter_spec shmem_param_specs[] = {
>         fsparam_string("nr_inodes",     Opt_nr_inodes),
>         fsparam_string("size",          Opt_size),
>         fsparam_u32   ("uid",           Opt_uid),
> +       fsparam_flag  ("inode32",       Opt_inode32),
> +       fsparam_flag  ("inode64",       Opt_inode64),
>         {}
>  };
>
> @@ -3444,6 +3459,7 @@ static int shmem_parse_one(struct fs_context *fc, struct fs_parameter *param)
>         unsigned long long size;
>         char *rest;
>         int opt;
> +       const char *err;
>
>         opt = fs_parse(fc, &shmem_fs_parameters, param, &result);
>         if (opt < 0)
> @@ -3505,6 +3521,18 @@ static int shmem_parse_one(struct fs_context *fc, struct fs_parameter *param)
>                         break;
>                 }
>                 goto unsupported_parameter;
> +       case Opt_inode32:
> +               ctx->full_inums = false;
> +               ctx->seen |= SHMEM_SEEN_INUMS;
> +               break;
> +       case Opt_inode64:
> +               if (sizeof(ino_t) < 8) {
> +                       err = "Cannot use inode64 with <64bit inums in kernel";
> +                       goto err_msg;
> +               }
> +               ctx->full_inums = true;
> +               ctx->seen |= SHMEM_SEEN_INUMS;
> +               break;
>         }
>         return 0;
>
> @@ -3512,6 +3540,8 @@ static int shmem_parse_one(struct fs_context *fc, struct fs_parameter *param)
>         return invalf(fc, "tmpfs: Unsupported parameter '%s'", param->key);
>  bad_value:
>         return invalf(fc, "tmpfs: Bad value for '%s'", param->key);
> +err_msg:
> +       return invalf(fc, "tmpfs: %s", err);
>  }
>
>  static int shmem_parse_options(struct fs_context *fc, void *data)
> @@ -3596,8 +3626,16 @@ static int shmem_reconfigure(struct fs_context *fc)
>                 }
>         }
>
> +       if ((ctx->seen & SHMEM_SEEN_INUMS) && !ctx->full_inums &&
> +           sbinfo->next_ino > UINT_MAX) {
> +               err = "Current inum too high to switch to 32-bit inums";
> +               goto out;
> +       }
> +
>         if (ctx->seen & SHMEM_SEEN_HUGE)
>                 sbinfo->huge = ctx->huge;
> +       if (ctx->seen & SHMEM_SEEN_INUMS)
> +               sbinfo->full_inums = ctx->full_inums;
>         if (ctx->seen & SHMEM_SEEN_BLOCKS)
>                 sbinfo->max_blocks  = ctx->blocks;
>         if (ctx->seen & SHMEM_SEEN_INODES) {
> @@ -3637,6 +3675,29 @@ static int shmem_show_options(struct seq_file *seq, struct dentry *root)
>         if (!gid_eq(sbinfo->gid, GLOBAL_ROOT_GID))
>                 seq_printf(seq, ",gid=%u",
>                                 from_kgid_munged(&init_user_ns, sbinfo->gid));
> +
> +       /*
> +        * Showing inode{64,32} might be useful even if it's the system default,
> +        * since then people don't have to resort to checking both here and
> +        * /proc/config.gz to confirm 64-bit inums were successfully applied
> +        * (which may not even exist if IKCONFIG_PROC isn't enabled).
> +        *
> +        * We hide it when inode64 isn't the default and we are using 32-bit
> +        * inodes, since that probably just means the feature isn't even under
> +        * consideration.
> +        *
> +        * As such:
> +        *
> +        *                     +-----------------+-----------------+
> +        *                     | TMPFS_INODE64=y | TMPFS_INODE64=n |
> +        *  +------------------+-----------------+-----------------+
> +        *  | full_inums=true  | show            | show            |
> +        *  | full_inums=false | show            | hide            |
> +        *  +------------------+-----------------+-----------------+
> +        *
> +        */
> +       if (IS_ENABLED(CONFIG_TMPFS_INODE64) || sbinfo->full_inums)
> +               seq_printf(seq, ",inode%d", (sbinfo->full_inums ? 64 : 32));
>  #ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE
>         /* Rightly or wrongly, show huge mount option unmasked by shmem_huge */
>         if (sbinfo->huge)
> @@ -3684,6 +3745,8 @@ static int shmem_fill_super(struct super_block *sb, struct fs_context *fc)
>                         ctx->blocks = shmem_default_max_blocks();
>                 if (!(ctx->seen & SHMEM_SEEN_INODES))
>                         ctx->inodes = shmem_default_max_inodes();
> +               if (!(ctx->seen & SHMEM_SEEN_INUMS))
> +                       ctx->full_inums = IS_ENABLED(CONFIG_TMPFS_INODE64);
>         } else {
>                 sb->s_flags |= SB_NOUSER;
>         }
> @@ -3697,6 +3760,7 @@ static int shmem_fill_super(struct super_block *sb, struct fs_context *fc)
>         sbinfo->free_inodes = sbinfo->max_inodes = ctx->inodes;
>         sbinfo->uid = ctx->uid;
>         sbinfo->gid = ctx->gid;
> +       sbinfo->full_inums = ctx->full_inums;
>         sbinfo->mode = ctx->mode;
>         sbinfo->huge = ctx->huge;
>         sbinfo->mpol = ctx->mpol;
> --
> 2.24.1
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 0/2] fs: inode: shmem: Reduce risk of inum overflow
  2020-01-03 17:30 [PATCH v3 0/2] fs: inode: shmem: Reduce risk of inum overflow Chris Down
@ 2020-01-04 21:16   ` Amir Goldstein
  2020-01-03 17:30 ` [PATCH v3 2/2] tmpfs: Support 64-bit inums per-sb Chris Down
  2020-01-04 21:16   ` Amir Goldstein
  2 siblings, 0 replies; 8+ messages in thread
From: Amir Goldstein @ 2020-01-04 21:16 UTC (permalink / raw)
  To: Chris Down
  Cc: linux-fsdevel, Al Viro, Matthew Wilcox, Jeff Layton,
	Johannes Weiner, Tejun Heo, linux-kernel, kernel-team,
	Hugh Dickins, zhengbin (A),
	Linux MM, Andrew Morton

On Fri, Jan 3, 2020 at 7:30 PM Chris Down <chris@chrisdown.name> wrote:
>
> In Facebook production we are seeing heavy i_ino wraparounds on tmpfs.
> On affected tiers, in excess of 10% of hosts show multiple files with
> different content and the same inode number, with some servers even
> having as many as 150 duplicated inode numbers with differing file
> content.
>
> This causes actual, tangible problems in production. For example, we
> have complaints from those working on remote caches that their
> application is reporting cache corruptions because it uses (device,
> inodenum) to establish the identity of a particular cache object, but
> because it's not unique any more, the application refuses to continue
> and reports cache corruption. Even worse, sometimes applications may not
> even detect the corruption but may continue anyway, causing phantom and
> hard to debug behaviour.
>
> In general, userspace applications expect that (device, inodenum) should
> be enough to be uniquely point to one inode, which seems fair enough.
> One might also need to check the generation, but in this case:
>
> 1. That's not currently exposed to userspace
>    (ioctl(...FS_IOC_GETVERSION...) returns ENOTTY on tmpfs);
> 2. Even with generation, there shouldn't be two live inodes with the
>    same inode number on one device.
>
> In order to mitigate this, we take a two-pronged approach:
>
> 1. Moving inum generation from being global to per-sb for tmpfs. This
>    itself allows some reduction in i_ino churn. This works on both 64-
>    and 32- bit machines.
> 2. Adding inode{64,32} for tmpfs. This fix is supported on machines with
>    64-bit ino_t only: we allow users to mount tmpfs with a new inode64
>    option that uses the full width of ino_t, or CONFIG_TMPFS_INODE64.
>
> Chris Down (2):
>   tmpfs: Add per-superblock i_ino support
>   tmpfs: Support 64-bit inums per-sb
>
>  Documentation/filesystems/tmpfs.txt | 11 ++++
>  fs/Kconfig                          | 15 +++++
>  include/linux/shmem_fs.h            |  2 +
>  mm/shmem.c                          | 97 ++++++++++++++++++++++++++++-
>  4 files changed, 124 insertions(+), 1 deletion(-)
>

CC tmpfs maintainer, linux-mm and Andrew Morton, who is the one sending
most of the tmpfs patches to Linus.

Also worth mentioning these previous attempts by zhengbin, which was trying to
address the same problem without the per-sb ino counter approach:

https://patchwork.kernel.org/patch/11254001/
https://patchwork.kernel.org/patch/11023915/

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 0/2] fs: inode: shmem: Reduce risk of inum overflow
@ 2020-01-04 21:16   ` Amir Goldstein
  0 siblings, 0 replies; 8+ messages in thread
From: Amir Goldstein @ 2020-01-04 21:16 UTC (permalink / raw)
  To: Chris Down
  Cc: linux-fsdevel, Al Viro, Matthew Wilcox, Jeff Layton,
	Johannes Weiner, Tejun Heo, linux-kernel, kernel-team,
	Hugh Dickins, zhengbin (A),
	Linux MM, Andrew Morton

On Fri, Jan 3, 2020 at 7:30 PM Chris Down <chris@chrisdown.name> wrote:
>
> In Facebook production we are seeing heavy i_ino wraparounds on tmpfs.
> On affected tiers, in excess of 10% of hosts show multiple files with
> different content and the same inode number, with some servers even
> having as many as 150 duplicated inode numbers with differing file
> content.
>
> This causes actual, tangible problems in production. For example, we
> have complaints from those working on remote caches that their
> application is reporting cache corruptions because it uses (device,
> inodenum) to establish the identity of a particular cache object, but
> because it's not unique any more, the application refuses to continue
> and reports cache corruption. Even worse, sometimes applications may not
> even detect the corruption but may continue anyway, causing phantom and
> hard to debug behaviour.
>
> In general, userspace applications expect that (device, inodenum) should
> be enough to be uniquely point to one inode, which seems fair enough.
> One might also need to check the generation, but in this case:
>
> 1. That's not currently exposed to userspace
>    (ioctl(...FS_IOC_GETVERSION...) returns ENOTTY on tmpfs);
> 2. Even with generation, there shouldn't be two live inodes with the
>    same inode number on one device.
>
> In order to mitigate this, we take a two-pronged approach:
>
> 1. Moving inum generation from being global to per-sb for tmpfs. This
>    itself allows some reduction in i_ino churn. This works on both 64-
>    and 32- bit machines.
> 2. Adding inode{64,32} for tmpfs. This fix is supported on machines with
>    64-bit ino_t only: we allow users to mount tmpfs with a new inode64
>    option that uses the full width of ino_t, or CONFIG_TMPFS_INODE64.
>
> Chris Down (2):
>   tmpfs: Add per-superblock i_ino support
>   tmpfs: Support 64-bit inums per-sb
>
>  Documentation/filesystems/tmpfs.txt | 11 ++++
>  fs/Kconfig                          | 15 +++++
>  include/linux/shmem_fs.h            |  2 +
>  mm/shmem.c                          | 97 ++++++++++++++++++++++++++++-
>  4 files changed, 124 insertions(+), 1 deletion(-)
>

CC tmpfs maintainer, linux-mm and Andrew Morton, who is the one sending
most of the tmpfs patches to Linus.

Also worth mentioning these previous attempts by zhengbin, which was trying to
address the same problem without the per-sb ino counter approach:

https://patchwork.kernel.org/patch/11254001/
https://patchwork.kernel.org/patch/11023915/

Thanks,
Amir.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 1/2] tmpfs: Add per-superblock i_ino support
  2020-01-04 19:09   ` Amir Goldstein
@ 2020-01-05 11:28     ` Chris Down
  0 siblings, 0 replies; 8+ messages in thread
From: Chris Down @ 2020-01-05 11:28 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: linux-fsdevel, Al Viro, Matthew Wilcox, Jeff Layton,
	Johannes Weiner, Tejun Heo, linux-kernel, kernel-team

Amir Goldstein writes:
>Some nits. When fixed you may add:
>Reviewed-by: Amir Goldstein <amir73il@gmail.com>

Thanks! I'll fix these and comments on the other patches and send v4 here and 
with linux-mm/tmpfs maintainer on cc. :-)

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-01-05 11:28 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-03 17:30 [PATCH v3 0/2] fs: inode: shmem: Reduce risk of inum overflow Chris Down
2020-01-03 17:30 ` [PATCH v3 1/2] tmpfs: Add per-superblock i_ino support Chris Down
2020-01-04 19:09   ` Amir Goldstein
2020-01-05 11:28     ` Chris Down
2020-01-03 17:30 ` [PATCH v3 2/2] tmpfs: Support 64-bit inums per-sb Chris Down
2020-01-04 19:15   ` Amir Goldstein
2020-01-04 21:16 ` [PATCH v3 0/2] fs: inode: shmem: Reduce risk of inum overflow Amir Goldstein
2020-01-04 21:16   ` Amir Goldstein

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.