linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] fs: inode: shmem: Reduce risk of inum overflow
@ 2019-12-27 14:30 Chris Down
  2019-12-27 14:30 ` [PATCH 1/3] fs: inode: Recycle volatile inode numbers from private slabs Chris Down
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Chris Down @ 2019-12-27 14:30 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Al Viro, Matthew Wilcox, Amir Goldstein, Jeff Layton,
	Johannes Weiner, Tejun Heo, linux-kernel, kernel-team

In Facebook production we are seeing heavy i_ino wraparounds on tmpfs.
On affected tiers, in excess of 10% of hosts show multiple files with
different content and the same inode number, with some servers even
having as many as 150 duplicated inode numbers with differing file
content.

This causes actual, tangible problems in production. For example, we
have complaints from those working on remote caches that their
application is reporting cache corruptions because it uses (device,
inodenum) to establish the identity of a particular cache object, but
because it's not unique any more, the application refuses to continue
and reports cache corruption. Even worse, sometimes applications may not
even detect the corruption but may continue anyway, causing phantom and
hard to debug behaviour.

In general, userspace applications expect that (device, inodenum) should
be enough to be uniquely point to one inode, which seems fair enough.
One might also need to check the generation, but in this case:

1. That's not currently exposed to userspace
   (ioctl(...FS_IOC_GETVERSION...) returns ENOTTY on tmpfs);
2. Even with generation, there shouldn't be two live inodes with the
   same inode number on one device.

In order to mitigate this, we take a two-pronged approach:

1. A mitigation that works both for 32- and 64-bit inodes: we reuse
   inode numbers from recycled slabs where possible (ie. where the
   filesystem uses their own private inode slabs instead of shared inode
   slabs), allowing us to significantly reduce the risk of 32 bit
   wraparound.
2. A fix that works on machines with 64-bit ino_t only: we allow users
   to mount tmpfs with a new inode64 option that uses the full width of
   ino_t. Other filesystems can also use get_next_ino_full to get
   similar behaviour as desired.

Chris Down (3):
  fs: inode: Recycle volatile inode numbers from private slabs
  fs: inode: Add API to retrieve global next ino using full ino_t width
  shmem: Add support for using full width of ino_t

 fs/hugetlbfs/inode.c     |  4 +++-
 fs/inode.c               | 44 +++++++++++++++++++++++++++++++++++++---
 include/linux/fs.h       |  1 +
 include/linux/shmem_fs.h |  1 +
 mm/shmem.c               | 41 ++++++++++++++++++++++++++++++++++++-
 5 files changed, 86 insertions(+), 5 deletions(-)

-- 
2.24.1


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/3] fs: inode: Recycle volatile inode numbers from private slabs
  2019-12-27 14:30 [PATCH 0/3] fs: inode: shmem: Reduce risk of inum overflow Chris Down
@ 2019-12-27 14:30 ` Chris Down
  2019-12-27 14:30 ` [PATCH 2/3] fs: inode: Add API to retrieve global next ino using full ino_t width Chris Down
  2019-12-27 14:30 ` [PATCH 3/3] shmem: Add support for using full width of ino_t Chris Down
  2 siblings, 0 replies; 8+ messages in thread
From: Chris Down @ 2019-12-27 14:30 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Al Viro, Matthew Wilcox, Amir Goldstein, Jeff Layton,
	Johannes Weiner, Tejun Heo, linux-kernel, kernel-team

One limitation to this approach is that slab recycling is currently only
per-memcg. This means workloads which heavily exercise get_next_ino with
the same memcg are most likely to benefit, rather than those with a wide
range of cgroups thrashing it. Depending on the workload, I've seen from
10%-50% recycle rate, which seems like a reasonable win with no
significant increase in code complexity, although it of course doesn't
fix the problem entirely.

Signed-off-by: Chris Down <chris@chrisdown.name>
Reported-by: Phyllipe Medeiros <phyllipe@fb.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: kernel-team@fb.com
---
 fs/hugetlbfs/inode.c | 4 +++-
 fs/inode.c           | 5 +++++
 mm/shmem.c           | 4 +++-
 3 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index d5c2a3158610..7b8fc84299c8 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -732,7 +732,9 @@ static struct inode *hugetlbfs_get_root(struct super_block *sb,
 
 	inode = new_inode(sb);
 	if (inode) {
-		inode->i_ino = get_next_ino();
+		/* Recycle to avoid 32-bit wraparound where possible */
+		if (!inode->i_ino)
+			inode->i_ino = get_next_ino();
 		inode->i_mode = S_IFDIR | ctx->mode;
 		inode->i_uid = ctx->uid;
 		inode->i_gid = ctx->gid;
diff --git a/fs/inode.c b/fs/inode.c
index aff2b5831168..255a4ae81b65 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -880,6 +880,11 @@ static struct inode *find_inode_fast(struct super_block *sb,
 #define LAST_INO_BATCH 1024
 static DEFINE_PER_CPU(unsigned int, last_ino);
 
+/*
+ * As get_next_ino returns a type with a small width (typically 32 bits),
+ * consider reusing inode numbers in your filesystem if you have a private inode
+ * cache in order to reduce the risk of wraparound.
+ */
 unsigned int get_next_ino(void)
 {
 	unsigned int *p = &get_cpu_var(last_ino);
diff --git a/mm/shmem.c b/mm/shmem.c
index 165fa6332993..ff041cb15550 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2247,7 +2247,9 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
 
 	inode = new_inode(sb);
 	if (inode) {
-		inode->i_ino = get_next_ino();
+		/* Recycle to avoid 32-bit wraparound where possible */
+		if (!inode->i_ino)
+			inode->i_ino = get_next_ino();
 		inode_init_owner(inode, dir, mode);
 		inode->i_blocks = 0;
 		inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/3] fs: inode: Add API to retrieve global next ino using full ino_t width
  2019-12-27 14:30 [PATCH 0/3] fs: inode: shmem: Reduce risk of inum overflow Chris Down
  2019-12-27 14:30 ` [PATCH 1/3] fs: inode: Recycle volatile inode numbers from private slabs Chris Down
@ 2019-12-27 14:30 ` Chris Down
  2019-12-27 15:32   ` Amir Goldstein
  2019-12-27 14:30 ` [PATCH 3/3] shmem: Add support for using full width of ino_t Chris Down
  2 siblings, 1 reply; 8+ messages in thread
From: Chris Down @ 2019-12-27 14:30 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Al Viro, Matthew Wilcox, Amir Goldstein, Jeff Layton,
	Johannes Weiner, Tejun Heo, linux-kernel, kernel-team

We can't just wholesale replace get_next_ino to use ino_t and 64-bit
atomics because of a few setbacks:

1. This may break some 32-bit userspace applications on a 64-bit kernel
   which cannot handle a 64-bit ino_t -- see the comment above
   get_next_ino;
2. Some applications inside the kernel already make use of the ino_t
   high bits. For example, overlayfs' xino feature uses these to merge
   inode numbers and fsid indexes to form a new identifier.

As such we need to make using the full width of ino_t an option for
users without being a requirement.

This will later be used to power inode64 in tmpfs, and can be used
elsewhere for other filesystems as desired.

Signed-off-by: Chris Down <chris@chrisdown.name>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: kernel-team@fb.com
---
 fs/inode.c         | 41 +++++++++++++++++++++++++++++++++++++----
 include/linux/fs.h |  1 +
 2 files changed, 38 insertions(+), 4 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 255a4ae81b65..1d96ad8b71f6 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -878,16 +878,17 @@ static struct inode *find_inode_fast(struct super_block *sb,
  * here to attempt to avoid that.
  */
 #define LAST_INO_BATCH 1024
-static DEFINE_PER_CPU(unsigned int, last_ino);
+static DEFINE_PER_CPU(unsigned int, last_ino_ui);
 
 /*
  * As get_next_ino returns a type with a small width (typically 32 bits),
  * consider reusing inode numbers in your filesystem if you have a private inode
- * cache in order to reduce the risk of wraparound.
+ * cache in order to reduce the risk of wraparound, or consider providing the
+ * option to use get_next_ino_full instead.
  */
 unsigned int get_next_ino(void)
 {
-	unsigned int *p = &get_cpu_var(last_ino);
+	unsigned int *p = &get_cpu_var(last_ino_ui);
 	unsigned int res = *p;
 
 #ifdef CONFIG_SMP
@@ -904,11 +905,43 @@ unsigned int get_next_ino(void)
 	if (unlikely(!res))
 		res++;
 	*p = res;
-	put_cpu_var(last_ino);
+	put_cpu_var(last_ino_ui);
 	return res;
 }
 EXPORT_SYMBOL(get_next_ino);
 
+static DEFINE_PER_CPU(ino_t, last_ino_full);
+
+ino_t get_next_ino_full(void)
+{
+	ino_t *p = &get_cpu_var(last_ino_full);
+	ino_t res = *p;
+
+#ifdef CONFIG_SMP
+	if (unlikely((res & (LAST_INO_BATCH-1)) == 0)) {
+		static atomic64_t shared_last_ino;
+		u64 next = atomic64_add_return(LAST_INO_BATCH,
+					       &shared_last_ino);
+
+		/*
+		 * This might get truncated if ino_t is 32-bit, and so be more
+		 * susceptible to wrap around than on environments where ino_t
+		 * is 64-bit.
+		 */
+		res = next - LAST_INO_BATCH;
+	}
+#endif
+
+	res++;
+	/* get_next_ino_full should not provide a 0 inode number */
+	if (unlikely(!res))
+		res++;
+	*p = res;
+	put_cpu_var(last_ino_full);
+	return res;
+}
+EXPORT_SYMBOL(get_next_ino_full);
+
 /**
  *	new_inode_pseudo 	- obtain an inode
  *	@sb: superblock
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 190c45039359..c73896d993c1 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3053,6 +3053,7 @@ static inline void lockdep_annotate_inode_mutex_key(struct inode *inode) { };
 extern void unlock_new_inode(struct inode *);
 extern void discard_new_inode(struct inode *);
 extern unsigned int get_next_ino(void);
+extern ino_t get_next_ino_full(void);
 extern void evict_inodes(struct super_block *sb);
 
 extern void __iget(struct inode * inode);
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 3/3] shmem: Add support for using full width of ino_t
  2019-12-27 14:30 [PATCH 0/3] fs: inode: shmem: Reduce risk of inum overflow Chris Down
  2019-12-27 14:30 ` [PATCH 1/3] fs: inode: Recycle volatile inode numbers from private slabs Chris Down
  2019-12-27 14:30 ` [PATCH 2/3] fs: inode: Add API to retrieve global next ino using full ino_t width Chris Down
@ 2019-12-27 14:30 ` Chris Down
  2019-12-27 15:26   ` Amir Goldstein
  2 siblings, 1 reply; 8+ messages in thread
From: Chris Down @ 2019-12-27 14:30 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Al Viro, Matthew Wilcox, Amir Goldstein, Jeff Layton,
	Johannes Weiner, Tejun Heo, linux-kernel, kernel-team

The new inode64 option now uses get_next_ino_full, which always uses the
full width of ino_t (as opposed to get_next_ino, which always uses
unsigned int).

Using inode64 makes inode number wraparound significantly less likely,
at the cost of making some features that rely on the underlying
filesystem not setting any of the highest 32 bits (eg. overlayfs' xino)
not usable.

Signed-off-by: Chris Down <chris@chrisdown.name>
Reported-by: Phyllipe Medeiros <phyllipe@fb.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: kernel-team@fb.com
---
 include/linux/shmem_fs.h |  1 +
 mm/shmem.c               | 41 ++++++++++++++++++++++++++++++++++++++--
 2 files changed, 40 insertions(+), 2 deletions(-)

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index de8e4b71e3ba..d7727d0d687f 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -35,6 +35,7 @@ struct shmem_sb_info {
 	unsigned char huge;	    /* Whether to try for hugepages */
 	kuid_t uid;		    /* Mount uid for root directory */
 	kgid_t gid;		    /* Mount gid for root directory */
+	bool small_inums;	    /* i_ino width unsigned int or ino_t */
 	struct mempolicy *mpol;     /* default memory policy for mappings */
 	spinlock_t shrinklist_lock;   /* Protects shrinklist */
 	struct list_head shrinklist;  /* List of shinkable inodes */
diff --git a/mm/shmem.c b/mm/shmem.c
index ff041cb15550..56cf581ec66d 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -115,11 +115,13 @@ struct shmem_options {
 	kuid_t uid;
 	kgid_t gid;
 	umode_t mode;
+	bool small_inums;
 	int huge;
 	int seen;
 #define SHMEM_SEEN_BLOCKS 1
 #define SHMEM_SEEN_INODES 2
 #define SHMEM_SEEN_HUGE 4
+#define SHMEM_SEEN_INUMS 8
 };
 
 #ifdef CONFIG_TMPFS
@@ -2248,8 +2250,12 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
 	inode = new_inode(sb);
 	if (inode) {
 		/* Recycle to avoid 32-bit wraparound where possible */
-		if (!inode->i_ino)
-			inode->i_ino = get_next_ino();
+		if (!inode->i_ino) {
+			if (sbinfo->small_inums)
+				inode->i_ino = get_next_ino();
+			else
+				inode->i_ino = get_next_ino_full();
+		}
 		inode_init_owner(inode, dir, mode);
 		inode->i_blocks = 0;
 		inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode);
@@ -3380,6 +3386,8 @@ enum shmem_param {
 	Opt_nr_inodes,
 	Opt_size,
 	Opt_uid,
+	Opt_inode32,
+	Opt_inode64,
 };
 
 static const struct fs_parameter_spec shmem_param_specs[] = {
@@ -3391,6 +3399,8 @@ static const struct fs_parameter_spec shmem_param_specs[] = {
 	fsparam_string("nr_inodes",	Opt_nr_inodes),
 	fsparam_string("size",		Opt_size),
 	fsparam_u32   ("uid",		Opt_uid),
+	fsparam_flag  ("inode32",	Opt_inode32),
+	fsparam_flag  ("inode64",	Opt_inode64),
 	{}
 };
 
@@ -3415,6 +3425,7 @@ static int shmem_parse_one(struct fs_context *fc, struct fs_parameter *param)
 	unsigned long long size;
 	char *rest;
 	int opt;
+	const char *err;
 
 	opt = fs_parse(fc, &shmem_fs_parameters, param, &result);
 	if (opt < 0)
@@ -3476,6 +3487,18 @@ static int shmem_parse_one(struct fs_context *fc, struct fs_parameter *param)
 			break;
 		}
 		goto unsupported_parameter;
+	case Opt_inode32:
+		ctx->small_inums = true;
+		ctx->seen |= SHMEM_SEEN_INUMS;
+		break;
+	case Opt_inode64:
+		if (sizeof(ino_t) < 8) {
+			err = "Cannot use inode64 with <64bit inums in kernel";
+			goto err_msg;
+		}
+		ctx->small_inums = false;
+		ctx->seen |= SHMEM_SEEN_INUMS;
+		break;
 	}
 	return 0;
 
@@ -3483,6 +3506,8 @@ static int shmem_parse_one(struct fs_context *fc, struct fs_parameter *param)
 	return invalf(fc, "tmpfs: Unsupported parameter '%s'", param->key);
 bad_value:
 	return invalf(fc, "tmpfs: Bad value for '%s'", param->key);
+err_msg:
+	return invalf(fc, "tmpfs: %s", err);
 }
 
 static int shmem_parse_options(struct fs_context *fc, void *data)
@@ -3567,6 +3592,15 @@ static int shmem_reconfigure(struct fs_context *fc)
 		}
 	}
 
+	/*
+	 * get_next_ino and get_next_ino_full have different static counters, so
+	 * it's not safe to change once we started or we could get duplicates
+	 */
+	if (ctx->seen & SHMEM_SEEN_INUMS) {
+		err = "Cannot retroactively change inode number size";
+		goto out;
+	}
+
 	if (ctx->seen & SHMEM_SEEN_HUGE)
 		sbinfo->huge = ctx->huge;
 	if (ctx->seen & SHMEM_SEEN_BLOCKS)
@@ -3608,6 +3642,7 @@ static int shmem_show_options(struct seq_file *seq, struct dentry *root)
 	if (!gid_eq(sbinfo->gid, GLOBAL_ROOT_GID))
 		seq_printf(seq, ",gid=%u",
 				from_kgid_munged(&init_user_ns, sbinfo->gid));
+	seq_printf(seq, ",inode%d", (sbinfo->small_inums ? 32 : 64));
 #ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE
 	/* Rightly or wrongly, show huge mount option unmasked by shmem_huge */
 	if (sbinfo->huge)
@@ -3667,6 +3702,7 @@ static int shmem_fill_super(struct super_block *sb, struct fs_context *fc)
 	sbinfo->free_inodes = sbinfo->max_inodes = ctx->inodes;
 	sbinfo->uid = ctx->uid;
 	sbinfo->gid = ctx->gid;
+	sbinfo->small_inums = ctx->small_inums;
 	sbinfo->mode = ctx->mode;
 	sbinfo->huge = ctx->huge;
 	sbinfo->mpol = ctx->mpol;
@@ -3879,6 +3915,7 @@ int shmem_init_fs_context(struct fs_context *fc)
 	ctx->mode = 0777 | S_ISVTX;
 	ctx->uid = current_fsuid();
 	ctx->gid = current_fsgid();
+	ctx->small_inums = true;
 
 	fc->fs_private = ctx;
 	fc->ops = &shmem_fs_context_ops;
-- 
2.24.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 3/3] shmem: Add support for using full width of ino_t
  2019-12-27 14:30 ` [PATCH 3/3] shmem: Add support for using full width of ino_t Chris Down
@ 2019-12-27 15:26   ` Amir Goldstein
  2019-12-27 16:35     ` Chris Down
  0 siblings, 1 reply; 8+ messages in thread
From: Amir Goldstein @ 2019-12-27 15:26 UTC (permalink / raw)
  To: Chris Down
  Cc: linux-fsdevel, Al Viro, Matthew Wilcox, Jeff Layton,
	Johannes Weiner, Tejun Heo, linux-kernel, kernel-team

On Fri, Dec 27, 2019 at 4:30 PM Chris Down <chris@chrisdown.name> wrote:
>
> The new inode64 option now uses get_next_ino_full, which always uses the
> full width of ino_t (as opposed to get_next_ino, which always uses
> unsigned int).
>
> Using inode64 makes inode number wraparound significantly less likely,
> at the cost of making some features that rely on the underlying
> filesystem not setting any of the highest 32 bits (eg. overlayfs' xino)
> not usable.

That's not an accurate statement. overlayfs xino just needs some high
bits available. Therefore I never had any objection to having tmpfs use
64bit ino values (from overlayfs perspective). My only objection is to
use the same pool "irresponsibly" instead of per-sb pool for the heavy
users.

>
> Signed-off-by: Chris Down <chris@chrisdown.name>
> Reported-by: Phyllipe Medeiros <phyllipe@fb.com>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Amir Goldstein <amir73il@gmail.com>
> Cc: Jeff Layton <jlayton@kernel.org>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: kernel-team@fb.com
> ---
>  include/linux/shmem_fs.h |  1 +
>  mm/shmem.c               | 41 ++++++++++++++++++++++++++++++++++++++--
>  2 files changed, 40 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> index de8e4b71e3ba..d7727d0d687f 100644
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -35,6 +35,7 @@ struct shmem_sb_info {
>         unsigned char huge;         /* Whether to try for hugepages */
>         kuid_t uid;                 /* Mount uid for root directory */
>         kgid_t gid;                 /* Mount gid for root directory */
> +       bool small_inums;           /* i_ino width unsigned int or ino_t */
>         struct mempolicy *mpol;     /* default memory policy for mappings */
>         spinlock_t shrinklist_lock;   /* Protects shrinklist */
>         struct list_head shrinklist;  /* List of shinkable inodes */
> diff --git a/mm/shmem.c b/mm/shmem.c
> index ff041cb15550..56cf581ec66d 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -115,11 +115,13 @@ struct shmem_options {
>         kuid_t uid;
>         kgid_t gid;
>         umode_t mode;
> +       bool small_inums;
>         int huge;
>         int seen;
>  #define SHMEM_SEEN_BLOCKS 1
>  #define SHMEM_SEEN_INODES 2
>  #define SHMEM_SEEN_HUGE 4
> +#define SHMEM_SEEN_INUMS 8
>  };
>
>  #ifdef CONFIG_TMPFS
> @@ -2248,8 +2250,12 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode
>         inode = new_inode(sb);
>         if (inode) {
>                 /* Recycle to avoid 32-bit wraparound where possible */
> -               if (!inode->i_ino)
> -                       inode->i_ino = get_next_ino();
> +               if (!inode->i_ino) {
> +                       if (sbinfo->small_inums)
> +                               inode->i_ino = get_next_ino();
> +                       else
> +                               inode->i_ino = get_next_ino_full();
> +               }

Ouch! You set yourself a trap in patch #1 and stepped into it here.
shmem driver has a single shmem_inode_cachep serving all tmpfs
instances. You cannot use different ino allocators and recycle ino
numbers from the same inode cache pool.
Sorry I did not see it coming...

I'm afraid that the recycling method cannot work along side a per-sb
ino allocator :/ (unless get_next_ino() was a truncated version of the
same counter as get_next_ino_full()).

You could still apply the recycling method if shmem inode cache was
never tainted by any other ino allocator, but that will lead to
unpredictable results for users and quite hard to document behavior.

IOW, I don't see a decent way out besides making shmem ino
allocator per-sb proper and always *before* adding the per-sb mount
option inode64.

And because of this, I think the global get_next_ino_full() API is
a mistake.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/3] fs: inode: Add API to retrieve global next ino using full ino_t width
  2019-12-27 14:30 ` [PATCH 2/3] fs: inode: Add API to retrieve global next ino using full ino_t width Chris Down
@ 2019-12-27 15:32   ` Amir Goldstein
  0 siblings, 0 replies; 8+ messages in thread
From: Amir Goldstein @ 2019-12-27 15:32 UTC (permalink / raw)
  To: Chris Down
  Cc: linux-fsdevel, Al Viro, Matthew Wilcox, Jeff Layton,
	Johannes Weiner, Tejun Heo, linux-kernel, kernel-team

On Fri, Dec 27, 2019 at 4:30 PM Chris Down <chris@chrisdown.name> wrote:
>
> We can't just wholesale replace get_next_ino to use ino_t and 64-bit
> atomics because of a few setbacks:
>
> 1. This may break some 32-bit userspace applications on a 64-bit kernel
>    which cannot handle a 64-bit ino_t -- see the comment above
>    get_next_ino;
> 2. Some applications inside the kernel already make use of the ino_t
>    high bits. For example, overlayfs' xino feature uses these to merge
>    inode numbers and fsid indexes to form a new identifier.
>
> As such we need to make using the full width of ino_t an option for
> users without being a requirement.
>
> This will later be used to power inode64 in tmpfs, and can be used
> elsewhere for other filesystems as desired.
>

Unless I am missing something, I think the fact that get_next_ino()
is a global counter was short sighted to begin with or it was never
intended to be used for massive usage fs.

So I think that introducing another global counter to be used
intentionally for massive usage is a mistake.

I think tmpfs should be converted to use a per-sb ino allocator.
When it's per-sb allocator, implementing the option if ino numbers
are 32bit or 64bit per instance won't be a problem.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 3/3] shmem: Add support for using full width of ino_t
  2019-12-27 15:26   ` Amir Goldstein
@ 2019-12-27 16:35     ` Chris Down
  2019-12-28  4:20       ` Amir Goldstein
  0 siblings, 1 reply; 8+ messages in thread
From: Chris Down @ 2019-12-27 16:35 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: linux-fsdevel, Al Viro, Matthew Wilcox, Jeff Layton,
	Johannes Weiner, Tejun Heo, linux-kernel, kernel-team

Amir Goldstein writes:
>On Fri, Dec 27, 2019 at 4:30 PM Chris Down <chris@chrisdown.name> wrote:
>>
>> The new inode64 option now uses get_next_ino_full, which always uses the
>> full width of ino_t (as opposed to get_next_ino, which always uses
>> unsigned int).
>>
>> Using inode64 makes inode number wraparound significantly less likely,
>> at the cost of making some features that rely on the underlying
>> filesystem not setting any of the highest 32 bits (eg. overlayfs' xino)
>> not usable.
>
>That's not an accurate statement. overlayfs xino just needs some high
>bits available. Therefore I never had any objection to having tmpfs use
>64bit ino values (from overlayfs perspective). My only objection is to
>use the same pool "irresponsibly" instead of per-sb pool for the heavy
>users.

Per-sb get_next_ino is fine, but seems less important if inode64 is used. Or is 
your point about people who would still be using inode32?

I think things have become quite unclear in previous discussions, so I want to 
make sure we're all on the same page here. Are you saying you would 
theoretically ack the following series?

1. Recycle volatile slabs in tmpfs/hugetlbfs
2. Make get_next_ino per-sb
3. Make get_next_ino_full (which is also per-sb)
4. Add inode{32,64} to tmpfs

To keep this thread as high signal as possible, I'll avoid sending any other 
patches until I hear back on that :-)

Thanks again,

Chris

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 3/3] shmem: Add support for using full width of ino_t
  2019-12-27 16:35     ` Chris Down
@ 2019-12-28  4:20       ` Amir Goldstein
  0 siblings, 0 replies; 8+ messages in thread
From: Amir Goldstein @ 2019-12-28  4:20 UTC (permalink / raw)
  To: Chris Down
  Cc: linux-fsdevel, Al Viro, Matthew Wilcox, Jeff Layton,
	Johannes Weiner, Tejun Heo, linux-kernel, kernel-team,
	Hugh Dickins, zhengbin (A)

On Fri, Dec 27, 2019 at 6:35 PM Chris Down <chris@chrisdown.name> wrote:
>
> Amir Goldstein writes:
> >On Fri, Dec 27, 2019 at 4:30 PM Chris Down <chris@chrisdown.name> wrote:
> >>
> >> The new inode64 option now uses get_next_ino_full, which always uses the
> >> full width of ino_t (as opposed to get_next_ino, which always uses
> >> unsigned int).
> >>
> >> Using inode64 makes inode number wraparound significantly less likely,
> >> at the cost of making some features that rely on the underlying
> >> filesystem not setting any of the highest 32 bits (eg. overlayfs' xino)
> >> not usable.
> >
> >That's not an accurate statement. overlayfs xino just needs some high
> >bits available. Therefore I never had any objection to having tmpfs use
> >64bit ino values (from overlayfs perspective). My only objection is to
> >use the same pool "irresponsibly" instead of per-sb pool for the heavy
> >users.
>
> Per-sb get_next_ino is fine, but seems less important if inode64 is used. Or is
> your point about people who would still be using inode32?
>
> I think things have become quite unclear in previous discussions, so I want to
> make sure we're all on the same page here. Are you saying you would
> theoretically ack the following series?
>
> 1. Recycle volatile slabs in tmpfs/hugetlbfs
> 2. Make get_next_ino per-sb
> 3. Make get_next_ino_full (which is also per-sb)
> 4. Add inode{32,64} to tmpfs

Not what I meant. On the contrary:
1. Recycle ino from slab is a nice idea, but it is not applicable
    along with per-sb ino allocator, so you can't use it for tmpfs
2. Leave get_next_ino() alone - it is used by things like pipe(2)
    that you don't want to mess with
3. Don't create another global ino allocator
4. inode{32,64} option to tmpfs is the only thing you need

We've made quite a big mess of a problem that is not really that big.

In this thread on zhenbin's patch you have the simple solution that
Google are using to your problem:
https://patchwork.kernel.org/patch/11254001/#23014383

The only thing keeping this solution away from upstream according to
tmpfs maintainer is the concern of breaking legacy 32bit apps.

If you make the high ino bits exposed opt-in by mount and/or Kconfig
option, then this concern would be mitigated and Google's private
solution to tmpfs ino could go upstream.

Hugh did not specify if sbinfo->next_ino is incremented under
sbinfo->stat_lock or some other lock (maybe he can share a link to
the actual patch?), but shmem_reserve_inode() already takes that
lock anyway, so I don't see the need to any further micro optimizations.

Chris, I hope the solution I am proposing is clear now and I hope I am
not leading you by mistake into another trap...

To be clear, solution should be dead simple and contained to tmpfs.
If you like, you could clone exact same solution to hugetlbfs, but no
new vfs helpers please.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2019-12-28  4:21 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-27 14:30 [PATCH 0/3] fs: inode: shmem: Reduce risk of inum overflow Chris Down
2019-12-27 14:30 ` [PATCH 1/3] fs: inode: Recycle volatile inode numbers from private slabs Chris Down
2019-12-27 14:30 ` [PATCH 2/3] fs: inode: Add API to retrieve global next ino using full ino_t width Chris Down
2019-12-27 15:32   ` Amir Goldstein
2019-12-27 14:30 ` [PATCH 3/3] shmem: Add support for using full width of ino_t Chris Down
2019-12-27 15:26   ` Amir Goldstein
2019-12-27 16:35     ` Chris Down
2019-12-28  4:20       ` Amir Goldstein

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).