* [PATCH v5 0/2] fs: inode: shmem: Reduce risk of inum overflow @ 2020-01-05 12:05 Chris Down 2020-01-05 12:06 ` [PATCH v5 1/2] tmpfs: Add per-superblock i_ino support Chris Down 2020-01-05 12:06 ` [PATCH v5 2/2] tmpfs: Support 64-bit inums per-sb Chris Down 0 siblings, 2 replies; 27+ messages in thread From: Chris Down @ 2020-01-05 12:05 UTC (permalink / raw) To: linux-mm Cc: Hugh Dickins, Andrew Morton, Al Viro, Matthew Wilcox, Amir Goldstein, Jeff Layton, Johannes Weiner, Tejun Heo, linux-fsdevel, linux-kernel, kernel-team In Facebook production we are seeing heavy i_ino wraparounds on tmpfs. On affected tiers, in excess of 10% of hosts show multiple files with different content and the same inode number, with some servers even having as many as 150 duplicated inode numbers with differing file content. This causes actual, tangible problems in production. For example, we have complaints from those working on remote caches that their application is reporting cache corruptions because it uses (device, inodenum) to establish the identity of a particular cache object, but because it's not unique any more, the application refuses to continue and reports cache corruption. Even worse, sometimes applications may not even detect the corruption but may continue anyway, causing phantom and hard to debug behaviour. In general, userspace applications expect that (device, inodenum) should be enough to be uniquely point to one inode, which seems fair enough. One might also need to check the generation, but in this case: 1. That's not currently exposed to userspace (ioctl(...FS_IOC_GETVERSION...) returns ENOTTY on tmpfs); 2. Even with generation, there shouldn't be two live inodes with the same inode number on one device. In order to mitigate this, we take a two-pronged approach: 1. Moving inum generation from being global to per-sb for tmpfs. This itself allows some reduction in i_ino churn. This works on both 64- and 32- bit machines. 2. Adding inode{64,32} for tmpfs. This fix is supported on machines with 64-bit ino_t only: we allow users to mount tmpfs with a new inode64 option that uses the full width of ino_t, or CONFIG_TMPFS_INODE64. You can see how this compares to previous related patches which didn't implement this per-superblock: - https://patchwork.kernel.org/patch/11254001/ - https://patchwork.kernel.org/patch/11023915/ Chris Down (2): tmpfs: Add per-superblock i_ino support tmpfs: Support 64-bit inums per-sb Documentation/filesystems/tmpfs.txt | 11 ++++ fs/Kconfig | 15 +++++ include/linux/shmem_fs.h | 2 + mm/shmem.c | 94 ++++++++++++++++++++++++++++- 4 files changed, 121 insertions(+), 1 deletion(-) -- 2.24.1 ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH v5 1/2] tmpfs: Add per-superblock i_ino support 2020-01-05 12:05 [PATCH v5 0/2] fs: inode: shmem: Reduce risk of inum overflow Chris Down @ 2020-01-05 12:06 ` Chris Down 2020-01-06 2:03 ` zhengbin (A) 2020-01-05 12:06 ` [PATCH v5 2/2] tmpfs: Support 64-bit inums per-sb Chris Down 1 sibling, 1 reply; 27+ messages in thread From: Chris Down @ 2020-01-05 12:06 UTC (permalink / raw) To: linux-mm Cc: Hugh Dickins, Andrew Morton, Al Viro, Matthew Wilcox, Amir Goldstein, Jeff Layton, Johannes Weiner, Tejun Heo, linux-fsdevel, linux-kernel, kernel-team get_next_ino has a number of problems: - It uses and returns a uint, which is susceptible to become overflowed if a lot of volatile inodes that use get_next_ino are created. - It's global, with no specificity per-sb or even per-filesystem. This means it's not that difficult to cause inode number wraparounds on a single device, which can result in having multiple distinct inodes with the same inode number. This patch adds a per-superblock counter that mitigates the second case. This design also allows us to later have a specific i_ino size per-device, for example, allowing users to choose whether to use 32- or 64-bit inodes for each tmpfs mount. This is implemented in the next commit. Signed-off-by: Chris Down <chris@chrisdown.name> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Matthew Wilcox <willy@infradead.org> Cc: Jeff Layton <jlayton@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: linux-mm@kvack.org Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: kernel-team@fb.com --- include/linux/shmem_fs.h | 1 + mm/shmem.c | 30 +++++++++++++++++++++++++++++- 2 files changed, 30 insertions(+), 1 deletion(-) v5: Nothing in code, just resending with correct linux-mm domain. diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h index de8e4b71e3ba..7fac91f490dc 100644 --- a/include/linux/shmem_fs.h +++ b/include/linux/shmem_fs.h @@ -35,6 +35,7 @@ struct shmem_sb_info { unsigned char huge; /* Whether to try for hugepages */ kuid_t uid; /* Mount uid for root directory */ kgid_t gid; /* Mount gid for root directory */ + ino_t next_ino; /* The next per-sb inode number to use */ struct mempolicy *mpol; /* default memory policy for mappings */ spinlock_t shrinklist_lock; /* Protects shrinklist */ struct list_head shrinklist; /* List of shinkable inodes */ diff --git a/mm/shmem.c b/mm/shmem.c index 8793e8cc1a48..9e97ba972225 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2236,6 +2236,12 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma) return 0; } +/* + * shmem_get_inode - reserve, allocate, and initialise a new inode + * + * If this tmpfs is from kern_mount we use get_next_ino, which is global, since + * inum churn there is low and this avoids taking locks. + */ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode *dir, umode_t mode, dev_t dev, unsigned long flags) { @@ -2248,7 +2254,28 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode inode = new_inode(sb); if (inode) { - inode->i_ino = get_next_ino(); + if (sb->s_flags & SB_KERNMOUNT) { + /* + * __shmem_file_setup, one of our callers, is lock-free: + * it doesn't hold stat_lock in shmem_reserve_inode + * since max_inodes is always 0, and is called from + * potentially unknown contexts. As such, use the global + * allocator which doesn't require the per-sb stat_lock. + */ + inode->i_ino = get_next_ino(); + } else { + spin_lock(&sbinfo->stat_lock); + if (unlikely(sbinfo->next_ino > UINT_MAX)) { + /* + * Emulate get_next_ino uint wraparound for + * compatibility + */ + sbinfo->next_ino = 1; + } + inode->i_ino = sbinfo->next_ino++; + spin_unlock(&sbinfo->stat_lock); + } + inode_init_owner(inode, dir, mode); inode->i_blocks = 0; inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode); @@ -3662,6 +3689,7 @@ static int shmem_fill_super(struct super_block *sb, struct fs_context *fc) #else sb->s_flags |= SB_NOUSER; #endif + sbinfo->next_ino = 1; sbinfo->max_blocks = ctx->blocks; sbinfo->free_inodes = sbinfo->max_inodes = ctx->inodes; sbinfo->uid = ctx->uid; -- 2.24.1 ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [PATCH v5 1/2] tmpfs: Add per-superblock i_ino support 2020-01-05 12:06 ` [PATCH v5 1/2] tmpfs: Add per-superblock i_ino support Chris Down @ 2020-01-06 2:03 ` zhengbin (A) 2020-01-06 6:41 ` Amir Goldstein 2020-01-06 13:17 ` Chris Down 0 siblings, 2 replies; 27+ messages in thread From: zhengbin (A) @ 2020-01-06 2:03 UTC (permalink / raw) To: Chris Down, linux-mm Cc: Hugh Dickins, Andrew Morton, Al Viro, Matthew Wilcox, Amir Goldstein, Jeff Layton, Johannes Weiner, Tejun Heo, linux-fsdevel, linux-kernel, kernel-team On 2020/1/5 20:06, Chris Down wrote: > get_next_ino has a number of problems: > > - It uses and returns a uint, which is susceptible to become overflowed > if a lot of volatile inodes that use get_next_ino are created. > - It's global, with no specificity per-sb or even per-filesystem. This > means it's not that difficult to cause inode number wraparounds on a > single device, which can result in having multiple distinct inodes > with the same inode number. > > This patch adds a per-superblock counter that mitigates the second case. > This design also allows us to later have a specific i_ino size > per-device, for example, allowing users to choose whether to use 32- or > 64-bit inodes for each tmpfs mount. This is implemented in the next > commit. > > Signed-off-by: Chris Down <chris@chrisdown.name> > Reviewed-by: Amir Goldstein <amir73il@gmail.com> > Cc: Hugh Dickins <hughd@google.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Al Viro <viro@zeniv.linux.org.uk> > Cc: Matthew Wilcox <willy@infradead.org> > Cc: Jeff Layton <jlayton@kernel.org> > Cc: Johannes Weiner <hannes@cmpxchg.org> > Cc: Tejun Heo <tj@kernel.org> > Cc: linux-mm@kvack.org > Cc: linux-fsdevel@vger.kernel.org > Cc: linux-kernel@vger.kernel.org > Cc: kernel-team@fb.com > --- > include/linux/shmem_fs.h | 1 + > mm/shmem.c | 30 +++++++++++++++++++++++++++++- > 2 files changed, 30 insertions(+), 1 deletion(-) > > v5: Nothing in code, just resending with correct linux-mm domain. > > diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h > index de8e4b71e3ba..7fac91f490dc 100644 > --- a/include/linux/shmem_fs.h > +++ b/include/linux/shmem_fs.h > @@ -35,6 +35,7 @@ struct shmem_sb_info { > unsigned char huge; /* Whether to try for hugepages */ > kuid_t uid; /* Mount uid for root directory */ > kgid_t gid; /* Mount gid for root directory */ > + ino_t next_ino; /* The next per-sb inode number to use */ > struct mempolicy *mpol; /* default memory policy for mappings */ > spinlock_t shrinklist_lock; /* Protects shrinklist */ > struct list_head shrinklist; /* List of shinkable inodes */ > diff --git a/mm/shmem.c b/mm/shmem.c > index 8793e8cc1a48..9e97ba972225 100644 > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -2236,6 +2236,12 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma) > return 0; > } > > +/* > + * shmem_get_inode - reserve, allocate, and initialise a new inode > + * > + * If this tmpfs is from kern_mount we use get_next_ino, which is global, since > + * inum churn there is low and this avoids taking locks. > + */ > static struct inode *shmem_get_inode(struct super_block *sb, const struct inode *dir, > umode_t mode, dev_t dev, unsigned long flags) > { > @@ -2248,7 +2254,28 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode > > inode = new_inode(sb); > if (inode) { > - inode->i_ino = get_next_ino(); > + if (sb->s_flags & SB_KERNMOUNT) { > + /* > + * __shmem_file_setup, one of our callers, is lock-free: > + * it doesn't hold stat_lock in shmem_reserve_inode > + * since max_inodes is always 0, and is called from > + * potentially unknown contexts. As such, use the global > + * allocator which doesn't require the per-sb stat_lock. > + */ > + inode->i_ino = get_next_ino(); > + } else { > + spin_lock(&sbinfo->stat_lock); Use spin_lock will affect performance, how about define unsigned long __percpu *last_ino_number; /* Last inode number */ atomic64_t shared_last_ino_number; /* Shared last inode number */ in shmem_sb_info, whose performance will be better? > + if (unlikely(sbinfo->next_ino > UINT_MAX)) { > + /* > + * Emulate get_next_ino uint wraparound for > + * compatibility > + */ > + sbinfo->next_ino = 1; > + } > + inode->i_ino = sbinfo->next_ino++; > + spin_unlock(&sbinfo->stat_lock); > + } > + > inode_init_owner(inode, dir, mode); > inode->i_blocks = 0; > inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode); > @@ -3662,6 +3689,7 @@ static int shmem_fill_super(struct super_block *sb, struct fs_context *fc) > #else > sb->s_flags |= SB_NOUSER; > #endif > + sbinfo->next_ino = 1; > sbinfo->max_blocks = ctx->blocks; > sbinfo->free_inodes = sbinfo->max_inodes = ctx->inodes; > sbinfo->uid = ctx->uid; ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v5 1/2] tmpfs: Add per-superblock i_ino support 2020-01-06 2:03 ` zhengbin (A) @ 2020-01-06 6:41 ` Amir Goldstein 2020-01-07 8:01 ` Hugh Dickins 2020-01-06 13:17 ` Chris Down 1 sibling, 1 reply; 27+ messages in thread From: Amir Goldstein @ 2020-01-06 6:41 UTC (permalink / raw) To: zhengbin (A) Cc: Chris Down, Linux MM, Hugh Dickins, Andrew Morton, Al Viro, Matthew Wilcox, Jeff Layton, Johannes Weiner, Tejun Heo, linux-fsdevel, linux-kernel, kernel-team On Mon, Jan 6, 2020 at 4:04 AM zhengbin (A) <zhengbin13@huawei.com> wrote: > > > On 2020/1/5 20:06, Chris Down wrote: > > get_next_ino has a number of problems: > > > > - It uses and returns a uint, which is susceptible to become overflowed > > if a lot of volatile inodes that use get_next_ino are created. > > - It's global, with no specificity per-sb or even per-filesystem. This > > means it's not that difficult to cause inode number wraparounds on a > > single device, which can result in having multiple distinct inodes > > with the same inode number. > > > > This patch adds a per-superblock counter that mitigates the second case. > > This design also allows us to later have a specific i_ino size > > per-device, for example, allowing users to choose whether to use 32- or > > 64-bit inodes for each tmpfs mount. This is implemented in the next > > commit. > > > > Signed-off-by: Chris Down <chris@chrisdown.name> > > Reviewed-by: Amir Goldstein <amir73il@gmail.com> > > Cc: Hugh Dickins <hughd@google.com> > > Cc: Andrew Morton <akpm@linux-foundation.org> > > Cc: Al Viro <viro@zeniv.linux.org.uk> > > Cc: Matthew Wilcox <willy@infradead.org> > > Cc: Jeff Layton <jlayton@kernel.org> > > Cc: Johannes Weiner <hannes@cmpxchg.org> > > Cc: Tejun Heo <tj@kernel.org> > > Cc: linux-mm@kvack.org > > Cc: linux-fsdevel@vger.kernel.org > > Cc: linux-kernel@vger.kernel.org > > Cc: kernel-team@fb.com > > --- > > include/linux/shmem_fs.h | 1 + > > mm/shmem.c | 30 +++++++++++++++++++++++++++++- > > 2 files changed, 30 insertions(+), 1 deletion(-) > > > > v5: Nothing in code, just resending with correct linux-mm domain. > > > > diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h > > index de8e4b71e3ba..7fac91f490dc 100644 > > --- a/include/linux/shmem_fs.h > > +++ b/include/linux/shmem_fs.h > > @@ -35,6 +35,7 @@ struct shmem_sb_info { > > unsigned char huge; /* Whether to try for hugepages */ > > kuid_t uid; /* Mount uid for root directory */ > > kgid_t gid; /* Mount gid for root directory */ > > + ino_t next_ino; /* The next per-sb inode number to use */ > > struct mempolicy *mpol; /* default memory policy for mappings */ > > spinlock_t shrinklist_lock; /* Protects shrinklist */ > > struct list_head shrinklist; /* List of shinkable inodes */ > > diff --git a/mm/shmem.c b/mm/shmem.c > > index 8793e8cc1a48..9e97ba972225 100644 > > --- a/mm/shmem.c > > +++ b/mm/shmem.c > > @@ -2236,6 +2236,12 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma) > > return 0; > > } > > > > +/* > > + * shmem_get_inode - reserve, allocate, and initialise a new inode > > + * > > + * If this tmpfs is from kern_mount we use get_next_ino, which is global, since > > + * inum churn there is low and this avoids taking locks. > > + */ > > static struct inode *shmem_get_inode(struct super_block *sb, const struct inode *dir, > > umode_t mode, dev_t dev, unsigned long flags) > > { > > @@ -2248,7 +2254,28 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode > > > > inode = new_inode(sb); > > if (inode) { > > - inode->i_ino = get_next_ino(); > > + if (sb->s_flags & SB_KERNMOUNT) { > > + /* > > + * __shmem_file_setup, one of our callers, is lock-free: > > + * it doesn't hold stat_lock in shmem_reserve_inode > > + * since max_inodes is always 0, and is called from > > + * potentially unknown contexts. As such, use the global > > + * allocator which doesn't require the per-sb stat_lock. > > + */ > > + inode->i_ino = get_next_ino(); > > + } else { > > + spin_lock(&sbinfo->stat_lock); > > Use spin_lock will affect performance, how about define > > unsigned long __percpu *last_ino_number; /* Last inode number */ > atomic64_t shared_last_ino_number; /* Shared last inode number */ > in shmem_sb_info, whose performance will be better? > Please take a look at shmem_reserve_inode(). spin lock is already being taken in shmem_get_inode() so there is nothing to be gained from complicating next_ino counter. This fact would have been a lot clearer if next_ino was incremented inside shmem_reserve_inode() and its value returned to be used by shmem_get_inode(), but I am also fine with code as it is with the comment above. Thanks, Amir. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v5 1/2] tmpfs: Add per-superblock i_ino support 2020-01-06 6:41 ` Amir Goldstein @ 2020-01-07 8:01 ` Hugh Dickins 2020-01-07 8:35 ` Amir Goldstein 0 siblings, 1 reply; 27+ messages in thread From: Hugh Dickins @ 2020-01-07 8:01 UTC (permalink / raw) To: Chris Down, Amir Goldstein Cc: zhengbin (A), J. R. Okajima, Linux MM, Hugh Dickins, Andrew Morton, Al Viro, Matthew Wilcox, Jeff Layton, Johannes Weiner, Tejun Heo, linux-fsdevel, linux-kernel, kernel-team On Mon, 6 Jan 2020, Amir Goldstein wrote: > On Mon, Jan 6, 2020 at 4:04 AM zhengbin (A) <zhengbin13@huawei.com> wrote: > > On 2020/1/5 20:06, Chris Down wrote: > > > get_next_ino has a number of problems: > > > > > > - It uses and returns a uint, which is susceptible to become overflowed > > > if a lot of volatile inodes that use get_next_ino are created. > > > - It's global, with no specificity per-sb or even per-filesystem. This > > > means it's not that difficult to cause inode number wraparounds on a > > > single device, which can result in having multiple distinct inodes > > > with the same inode number. > > > > > > This patch adds a per-superblock counter that mitigates the second case. > > > This design also allows us to later have a specific i_ino size > > > per-device, for example, allowing users to choose whether to use 32- or > > > 64-bit inodes for each tmpfs mount. This is implemented in the next > > > commit. > > > > > > Signed-off-by: Chris Down <chris@chrisdown.name> > > > Reviewed-by: Amir Goldstein <amir73il@gmail.com> > > > Cc: Hugh Dickins <hughd@google.com> > > > Cc: Andrew Morton <akpm@linux-foundation.org> > > > Cc: Al Viro <viro@zeniv.linux.org.uk> > > > Cc: Matthew Wilcox <willy@infradead.org> > > > Cc: Jeff Layton <jlayton@kernel.org> > > > Cc: Johannes Weiner <hannes@cmpxchg.org> > > > Cc: Tejun Heo <tj@kernel.org> > > > Cc: linux-mm@kvack.org > > > Cc: linux-fsdevel@vger.kernel.org > > > Cc: linux-kernel@vger.kernel.org > > > Cc: kernel-team@fb.com > > > --- > > > include/linux/shmem_fs.h | 1 + > > > mm/shmem.c | 30 +++++++++++++++++++++++++++++- > > > 2 files changed, 30 insertions(+), 1 deletion(-) > > > > > > v5: Nothing in code, just resending with correct linux-mm domain. > > > > > > diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h > > > index de8e4b71e3ba..7fac91f490dc 100644 > > > --- a/include/linux/shmem_fs.h > > > +++ b/include/linux/shmem_fs.h > > > @@ -35,6 +35,7 @@ struct shmem_sb_info { > > > unsigned char huge; /* Whether to try for hugepages */ > > > kuid_t uid; /* Mount uid for root directory */ > > > kgid_t gid; /* Mount gid for root directory */ > > > + ino_t next_ino; /* The next per-sb inode number to use */ > > > struct mempolicy *mpol; /* default memory policy for mappings */ > > > spinlock_t shrinklist_lock; /* Protects shrinklist */ > > > struct list_head shrinklist; /* List of shinkable inodes */ > > > diff --git a/mm/shmem.c b/mm/shmem.c > > > index 8793e8cc1a48..9e97ba972225 100644 > > > --- a/mm/shmem.c > > > +++ b/mm/shmem.c > > > @@ -2236,6 +2236,12 @@ static int shmem_mmap(struct file *file, struct vm_area_struct *vma) > > > return 0; > > > } > > > > > > +/* > > > + * shmem_get_inode - reserve, allocate, and initialise a new inode > > > + * > > > + * If this tmpfs is from kern_mount we use get_next_ino, which is global, since > > > + * inum churn there is low and this avoids taking locks. > > > + */ > > > static struct inode *shmem_get_inode(struct super_block *sb, const struct inode *dir, > > > umode_t mode, dev_t dev, unsigned long flags) > > > { > > > @@ -2248,7 +2254,28 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode > > > > > > inode = new_inode(sb); > > > if (inode) { > > > - inode->i_ino = get_next_ino(); > > > + if (sb->s_flags & SB_KERNMOUNT) { > > > + /* > > > + * __shmem_file_setup, one of our callers, is lock-free: > > > + * it doesn't hold stat_lock in shmem_reserve_inode > > > + * since max_inodes is always 0, and is called from > > > + * potentially unknown contexts. As such, use the global > > > + * allocator which doesn't require the per-sb stat_lock. > > > + */ > > > + inode->i_ino = get_next_ino(); > > > + } else { > > > + spin_lock(&sbinfo->stat_lock); > > > > Use spin_lock will affect performance, how about define > > > > unsigned long __percpu *last_ino_number; /* Last inode number */ > > atomic64_t shared_last_ino_number; /* Shared last inode number */ > > in shmem_sb_info, whose performance will be better? > > > > Please take a look at shmem_reserve_inode(). > spin lock is already being taken in shmem_get_inode() > so there is nothing to be gained from complicating next_ino counter. > > This fact would have been a lot clearer if next_ino was incremented > inside shmem_reserve_inode() and its value returned to be used by > shmem_get_inode(), but I am also fine with code as it is with the > comment above. Chris, Amir, thank you both for all the work you have been putting into this over the holiday. I'm only now catching up, sorry. It appears that what you are ending up with above is very close to the patch Google has been using for several years. I'm glad Chris has explored some interesting options, disappointed that you had no more success than I had in trying to solve it efficiently with 32-bit inos, agree with the way in which Amir cut it back. That we've come to the same conclusions independently is good confirmation of this approach. Only yesterday did I get to see that Amir had asked to see my patch on the 27th: rediffed against 5.5-rc5, appended now below. I'm showing it, though it's known deficient in three ways (not to mention lack of config or mount options, but I now see Dave Chinner has an interesting take on those) - better make it visible to you now, than me delay you further. It does indicate a couple of improvements to Chris's current patch: reducing use of stat_lock, as Amir suggested (in both nr_inodes limited and unlimited cases); and need to fix shmem_encode_fh(), which neither of us did yet. Where we should go from here, fix Chris's or fix mine, I've not decided. From: Hugh Dickins <hughd@google.com> Date: Fri, 7 Aug 2015 20:08:53 -0700 Subject: [PATCH] tmpfs: provide 64-bit inode numbers Some programs (including ld.so and clang) try to deduplicate opened files by comparing st_dev and st_ino, which breaks down when two files on the same tmpfs have the same inode number: we are now hitting this problem too often. J. R. Okajima has reported the same problem with backup tools. tmpfs is currently sharing the same 32-bit get_next_ino() pool as sockets, pipes, ramfs, hugetlbfs etc. It delivers 32-bit inos even on a 64-bit kernel for one reason: because if a 32-bit executable compiled without _FILE_OFFSET_BITS=64 tries to stat a file with an ino which won't fit in 32 bits, glibc fails that with EOVERFLOW. glibc is being correct, but unhelpful: so 2.6.22 commit 866b04fccbf1 ("inode numbering: make static counters in new_inode and iunique be 32 bits") forced get_next_ino() to unsigned int. But whatever the upstream need to avoid surprising a mis-built 32-bit executable, Google production can be sure of the 64-bit environment, and any remaining 32-bit executables built with _FILE_OFFSET_BITS=64 (our files are sometimes bigger than 2G). So, leave get_next_ino() as it is, but convert tmpfs to supply unsigned long inos, and let each superblock keep its own account: it was weird to be sharing a pool with such disparate uses before. shmem_reserve_inode() already provides a good place to do this; and normally it has to take stat_lock to update free_inodes, so no overhead to increment its next_ino under the same lock. But if it's the internal shm_mnt, or mounted with "nr_inodes=0" to increase scalability by avoiding that stat_lock, then use the same percpu batching technique as get_next_ino(). Take on board 4.2 commit 2adc376c5519 ("vfs: avoid creation of inode number 0 in get_next_ino"): for safety, skip any ino whose low 32 bits is 0. Upstream? That's tougher: maybe try to upstream this as is, and see what falls out; maybe add ino32 or ino64 mount options before trying; or maybe upstream has to stick with the 32-bit ino, and a scheme more complicated than this be implemented for tmpfs. Not-Yet-Signed-off-by: Hugh Dickins <hughd@google.com> 1) Probably needs minor corrections for the 32-bit kernel: I wasn't worrying about that at the time, and expect some "unsigned long"s I didn't need to change for the 64-bit kernel actually need to be "u64"s or "ino_t"s now. 2) The "return 1" from shmem_encode_fh() would nowadays be written as "return FILEID_INO32_GEN" (and overlayfs takes particular interest in that fileid); yet it already put the high 32 bits of the ino into the fh: probably needs a different fileid once high 32 bits are set. 3) When this patch was written, a tmpfs could not be remounted from limited nr_inodes to unlimited nr_inodes=0; but that restriction was accidentally (though sensibly) lifted during v5.4's mount API mods; so would need to be taken into account (or reverted) here. --- include/linux/shmem_fs.h | 2 + mm/shmem.c | 59 ++++++++++++++++++++++++++++++++++--- 2 files changed, 57 insertions(+), 4 deletions(-) --- 5.5-rc5/include/linux/shmem_fs.h 2019-11-24 16:32:01.000000000 -0800 +++ linux/include/linux/shmem_fs.h 2020-01-06 19:58:04.487301841 -0800 @@ -30,6 +30,8 @@ struct shmem_sb_info { struct percpu_counter used_blocks; /* How many are allocated */ unsigned long max_inodes; /* How many inodes are allowed */ unsigned long free_inodes; /* How many are left for allocation */ + unsigned long next_ino; /* Next inode number to be allocated */ + unsigned long __percpu *ino_batch; /* Next per cpu, if unlimited */ spinlock_t stat_lock; /* Serialize shmem_sb_info changes */ umode_t mode; /* Mount mode for root directory */ unsigned char huge; /* Whether to try for hugepages */ --- 5.5-rc5/mm/shmem.c 2019-12-08 18:04:55.053566640 -0800 +++ linux/mm/shmem.c 2020-01-06 19:58:04.487301841 -0800 @@ -261,9 +261,24 @@ bool vma_is_shmem(struct vm_area_struct static LIST_HEAD(shmem_swaplist); static DEFINE_MUTEX(shmem_swaplist_mutex); -static int shmem_reserve_inode(struct super_block *sb) +/* + * shmem_reserve_inode() reserves "space" for a shmem inode, and allocates + * an ino. Unlike get_next_ino() in fs/inode.c, the 64-bit kernel here goes + * for a 64-bit st_ino, expecting even 32-bit userspace to have been built + * with _FILE_OFFSET_BITS=64 nowadays; but lest it has not, each sb uses its + * own supply, independent of sockets and pipes etc, and of other tmpfs sbs. + * Maybe add a 32-bit ino mount option later, if it turns out to be needed. + * Don't worry about 64-bit ino support from a 32-bit kernel at this time. + * + * shmem_reserve_inode() is also called when making a hard link, to allow for + * the space needed by each dentry; but no new ino is wanted then (NULL inop). + */ +#define SHMEM_INO_BATCH 1024 +static int shmem_reserve_inode(struct super_block *sb, unsigned long *inop) { struct shmem_sb_info *sbinfo = SHMEM_SB(sb); + unsigned long ino; + if (sbinfo->max_inodes) { spin_lock(&sbinfo->stat_lock); if (!sbinfo->free_inodes) { @@ -271,7 +286,35 @@ static int shmem_reserve_inode(struct su return -ENOSPC; } sbinfo->free_inodes--; + if (inop) { + ino = sbinfo->next_ino++; + /* Avoid 0 in the low 32 bits: might appear deleted */ + if (unlikely((unsigned int)ino == 0)) + ino = sbinfo->next_ino++; + *inop = ino; + } spin_unlock(&sbinfo->stat_lock); + } else if (inop) { + unsigned long *next_ino; + /* + * Internal shm_mnt, or mounted with unlimited nr_inodes=0 for + * maximum scalability: we don't need to take stat_lock every + * time here, so use percpu batching like get_next_ino(). + */ + next_ino = per_cpu_ptr(sbinfo->ino_batch, get_cpu()); + ino = *next_ino; + if (unlikely((ino & (SHMEM_INO_BATCH-1)) == 0)) { + spin_lock(&sbinfo->stat_lock); + ino = sbinfo->next_ino; + sbinfo->next_ino += SHMEM_INO_BATCH; + spin_unlock(&sbinfo->stat_lock); + /* Avoid 0 in the low 32 bits: might appear deleted */ + if (unlikely((unsigned int)ino == 0)) + ino++; + } + *inop = ino; + *next_ino = ++ino; + put_cpu(); } return 0; } @@ -2241,13 +2284,14 @@ static struct inode *shmem_get_inode(str struct inode *inode; struct shmem_inode_info *info; struct shmem_sb_info *sbinfo = SHMEM_SB(sb); + unsigned long ino; - if (shmem_reserve_inode(sb)) + if (shmem_reserve_inode(sb, &ino)) return NULL; inode = new_inode(sb); if (inode) { - inode->i_ino = get_next_ino(); + inode->i_ino = ino; inode_init_owner(inode, dir, mode); inode->i_blocks = 0; inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode); @@ -2960,7 +3004,7 @@ static int shmem_link(struct dentry *old * first link must skip that, to get the accounting right. */ if (inode->i_nlink) { - ret = shmem_reserve_inode(inode->i_sb); + ret = shmem_reserve_inode(inode->i_sb, NULL); if (ret) goto out; } @@ -3621,6 +3665,7 @@ static void shmem_put_super(struct super { struct shmem_sb_info *sbinfo = SHMEM_SB(sb); + free_percpu(sbinfo->ino_batch); percpu_counter_destroy(&sbinfo->used_blocks); mpol_put(sbinfo->mpol); kfree(sbinfo); @@ -3663,6 +3708,12 @@ static int shmem_fill_super(struct super #endif sbinfo->max_blocks = ctx->blocks; sbinfo->free_inodes = sbinfo->max_inodes = ctx->inodes; + /* If inodes are unlimited for scalability, use percpu ino batching */ + if (!sbinfo->max_inodes) { + sbinfo->ino_batch = alloc_percpu(unsigned long); + if (!sbinfo->ino_batch) + goto failed; + } sbinfo->uid = ctx->uid; sbinfo->gid = ctx->gid; sbinfo->mode = ctx->mode; ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v5 1/2] tmpfs: Add per-superblock i_ino support 2020-01-07 8:01 ` Hugh Dickins @ 2020-01-07 8:35 ` Amir Goldstein 2020-01-08 10:58 ` Hugh Dickins 0 siblings, 1 reply; 27+ messages in thread From: Amir Goldstein @ 2020-01-07 8:35 UTC (permalink / raw) To: Hugh Dickins Cc: Chris Down, zhengbin (A), J. R. Okajima, Linux MM, Andrew Morton, Al Viro, Matthew Wilcox, Jeff Layton, Johannes Weiner, Tejun Heo, linux-fsdevel, linux-kernel, kernel-team > Chris, Amir, thank you both for all the work you have been putting > into this over the holiday. I'm only now catching up, sorry. > > It appears that what you are ending up with above is very close to > the patch Google has been using for several years. I'm glad Chris > has explored some interesting options, disappointed that you had no > more success than I had in trying to solve it efficiently with 32-bit > inos, agree with the way in which Amir cut it back. That we've come to > the same conclusions independently is good confirmation of this approach. > Indeed :) > Only yesterday did I get to see that Amir had asked to see my patch on > the 27th: rediffed against 5.5-rc5, appended now below. I'm showing it, > though it's known deficient in three ways (not to mention lack of config > or mount options, but I now see Dave Chinner has an interesting take on > those) - better make it visible to you now, than me delay you further. > > It does indicate a couple of improvements to Chris's current patch: > reducing use of stat_lock, as Amir suggested (in both nr_inodes limited > and unlimited cases); and need to fix shmem_encode_fh(), which neither > of us did yet. Where we should go from here, fix Chris's or fix mine, > I've not decided. > I vote in favor or best of both patches to result in a simpler outcome: 1. use inop approach from Hugh's patch 2. use get_next_ino() instead of per-sb ino_batch for kern_mount Hugh, do you have any evidence or suspect that shmem_file_setup() could be contributing to depleting the global get_next_ino pool? Do you have an objection to the combination above? > From: Hugh Dickins <hughd@google.com> > Date: Fri, 7 Aug 2015 20:08:53 -0700 > Subject: [PATCH] tmpfs: provide 64-bit inode numbers > > Some programs (including ld.so and clang) try to deduplicate opened > files by comparing st_dev and st_ino, which breaks down when two files > on the same tmpfs have the same inode number: we are now hitting this > problem too often. J. R. Okajima has reported the same problem with > backup tools. > > tmpfs is currently sharing the same 32-bit get_next_ino() pool as > sockets, pipes, ramfs, hugetlbfs etc. It delivers 32-bit inos even > on a 64-bit kernel for one reason: because if a 32-bit executable > compiled without _FILE_OFFSET_BITS=64 tries to stat a file with an > ino which won't fit in 32 bits, glibc fails that with EOVERFLOW. > glibc is being correct, but unhelpful: so 2.6.22 commit 866b04fccbf1 > ("inode numbering: make static counters in new_inode and iunique be > 32 bits") forced get_next_ino() to unsigned int. > > But whatever the upstream need to avoid surprising a mis-built > 32-bit executable, Google production can be sure of the 64-bit > environment, and any remaining 32-bit executables built with > _FILE_OFFSET_BITS=64 (our files are sometimes bigger than 2G). > > So, leave get_next_ino() as it is, but convert tmpfs to supply > unsigned long inos, and let each superblock keep its own account: > it was weird to be sharing a pool with such disparate uses before. > > shmem_reserve_inode() already provides a good place to do this; > and normally it has to take stat_lock to update free_inodes, so > no overhead to increment its next_ino under the same lock. But > if it's the internal shm_mnt, or mounted with "nr_inodes=0" to > increase scalability by avoiding that stat_lock, then use the > same percpu batching technique as get_next_ino(). > > Take on board 4.2 commit 2adc376c5519 ("vfs: avoid creation of > inode number 0 in get_next_ino"): for safety, skip any ino whose > low 32 bits is 0. > > Upstream? That's tougher: maybe try to upstream this as is, and > see what falls out; maybe add ino32 or ino64 mount options before > trying; or maybe upstream has to stick with the 32-bit ino, and a > scheme more complicated than this be implemented for tmpfs. > > Not-Yet-Signed-off-by: Hugh Dickins <hughd@google.com> > > 1) Probably needs minor corrections for the 32-bit kernel: I wasn't > worrying about that at the time, and expect some "unsigned long"s > I didn't need to change for the 64-bit kernel actually need to be > "u64"s or "ino_t"s now. > 2) The "return 1" from shmem_encode_fh() would nowadays be written as > "return FILEID_INO32_GEN" (and overlayfs takes particular interest > in that fileid); yet it already put the high 32 bits of the ino into > the fh: probably needs a different fileid once high 32 bits are set. Nice spotting, but this really isn't a problem for overlayfs. I agree that it would be nice to follow the same practice as xfs with XFS_FILEID_TYPE_64FLAG, but as one can see from the private flag, this is by no means a standard of any sort. As the name_to_handle_at() man page says: "Other than the use of the handle_bytes field, the caller should treat the file_handle structure as an opaque data type: the handle_type and f_handle fields are needed only by a subsequent call to open_by_handle_at()." It is true that one of my initial versions was checking value returned from encode_fh, but Miklos has pointed out to me that this value is arbitrary and we shouldn't rely on it. In current overlayfs, the value FILEID_INO32_GEN is only used internally to indicate the case where filesystem has no encode_fh op (see comment above ovl_can_decode_fh()). IOW, only the special case of encoding by the default export_encode_fh() is considered a proof for 32bit inodes. So tmpfs has never been considered as a 32bit inodes filesystem by overlayfs. Thanks, Amir. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v5 1/2] tmpfs: Add per-superblock i_ino support 2020-01-07 8:35 ` Amir Goldstein @ 2020-01-08 10:58 ` Hugh Dickins 2020-01-08 12:51 ` Amir Goldstein 0 siblings, 1 reply; 27+ messages in thread From: Hugh Dickins @ 2020-01-08 10:58 UTC (permalink / raw) To: Amir Goldstein Cc: Hugh Dickins, Chris Down, zhengbin (A), J. R. Okajima, Linux MM, Andrew Morton, Al Viro, Matthew Wilcox, Jeff Layton, Johannes Weiner, Tejun Heo, linux-fsdevel, linux-kernel, kernel-team On Tue, 7 Jan 2020, Amir Goldstein wrote: > > I vote in favor or best of both patches to result in a simpler outcome: > 1. use inop approach from Hugh's patch > 2. use get_next_ino() instead of per-sb ino_batch for kern_mount > > Hugh, do you have any evidence or suspect that shmem_file_setup() > could be contributing to depleting the global get_next_ino pool? Depends on what the system is used for: on one system, it will make very little use of that pool, on another it will be one of the major depleters of the pool. Generally, it would be kinder to the other users of the pool (those that might also care about ino duplication) if shmem were to cede all use of it; but a bigger patch, yes. > Do you have an objection to the combination above? Objection would be too strong: I'm uncomfortable with it, and not tempted to replace our internal implementation by one reverting to use get_next_ino(); but not as uncomfortable as I am with holding up progress on the issue. Uncomfortable because of the "depletion" you mention. Uncomfortable because half the reason for ever offering the unlimited "nr_inodes=0" mount option was to avoid stat_lock overhead (though, for all I know, maybe nobody anywhere uses that option now - and I'll be surprised if 0-day uses it and reports any slowdown). Also uncomfortable because your (excellent, and I'd never thought about it that way) argument that the shm_mnt objects are internal and unstatable (that may not resemble how you expressed it, I've not gone back to search the mails to find where you made the point), is not entirely correct nowadays - memfd_create()d objects come from that same shm_mnt, and they can be fstat()ed. What is the likelihood that anything would care about duplicated inos of memfd_create()d objects? Fairly low, I think we'll agree; and we can probably also say that their inos are an odd implementation detail that no memfd_create() user should depend upon anyway. But I mention it in case it changes your own feeling about the shm_mnt. > > Not-Yet-Signed-off-by: Hugh Dickins <hughd@google.com> > > > > 1) Probably needs minor corrections for the 32-bit kernel: I wasn't > > worrying about that at the time, and expect some "unsigned long"s > > I didn't need to change for the 64-bit kernel actually need to be > > "u64"s or "ino_t"s now. > > 2) The "return 1" from shmem_encode_fh() would nowadays be written as > > "return FILEID_INO32_GEN" (and overlayfs takes particular interest > > in that fileid); yet it already put the high 32 bits of the ino into > > the fh: probably needs a different fileid once high 32 bits are set. > > Nice spotting, but this really isn't a problem for overlayfs. > I agree that it would be nice to follow the same practice as xfs with > XFS_FILEID_TYPE_64FLAG, but as one can see from the private > flag, this is by no means a standard of any sort. > > As the name_to_handle_at() man page says: > "Other than the use of the handle_bytes field, the caller should treat the > file_handle structure as an opaque data type: the handle_type and f_handle > fields are needed only by a subsequent call to open_by_handle_at()." > > It is true that one of my initial versions was checking value returned from > encode_fh, but Miklos has pointed out to me that this value is arbitrary > and we shouldn't rely on it. > > In current overlayfs, the value FILEID_INO32_GEN is only used internally > to indicate the case where filesystem has no encode_fh op (see comment > above ovl_can_decode_fh()). IOW, only the special case of encoding > by the default export_encode_fh() is considered a proof for 32bit inodes. > So tmpfs has never been considered as a 32bit inodes filesystem by > overlayfs. Thanks, you know far more about encode_fh() than I ever shall, so for now I'll take your word for it that we don't need to make any change there for this 64-bit ino support. One day I should look back, go through the encode_fh() callsites, and decide what that "return 1" really ought to say. It's inconvenient, I'm sorry, but I shall have to go quiet again for a few days - I'm here, but cannot afford to split my attention. Hugh ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v5 1/2] tmpfs: Add per-superblock i_ino support 2020-01-08 10:58 ` Hugh Dickins @ 2020-01-08 12:51 ` Amir Goldstein 0 siblings, 0 replies; 27+ messages in thread From: Amir Goldstein @ 2020-01-08 12:51 UTC (permalink / raw) To: Hugh Dickins Cc: Chris Down, zhengbin (A), J. R. Okajima, Linux MM, Andrew Morton, Al Viro, Matthew Wilcox, Jeff Layton, Johannes Weiner, Tejun Heo, linux-fsdevel, linux-kernel, kernel-team On Wed, Jan 8, 2020 at 12:59 PM Hugh Dickins <hughd@google.com> wrote: > > On Tue, 7 Jan 2020, Amir Goldstein wrote: > > > > I vote in favor or best of both patches to result in a simpler outcome: > > 1. use inop approach from Hugh's patch > > 2. use get_next_ino() instead of per-sb ino_batch for kern_mount > > > > Hugh, do you have any evidence or suspect that shmem_file_setup() > > could be contributing to depleting the global get_next_ino pool? > > Depends on what the system is used for: on one system, it will make > very little use of that pool, on another it will be one of the major > depleters of the pool. Generally, it would be kinder to the other > users of the pool (those that might also care about ino duplication) > if shmem were to cede all use of it; but a bigger patch, yes. > > > Do you have an objection to the combination above? > > Objection would be too strong: I'm uncomfortable with it, and not > tempted to replace our internal implementation by one reverting to > use get_next_ino(); but not as uncomfortable as I am with holding > up progress on the issue. > > Uncomfortable because of the "depletion" you mention. Uncomfortable > because half the reason for ever offering the unlimited "nr_inodes=0" > mount option was to avoid stat_lock overhead (though, for all I know, > maybe nobody anywhere uses that option now - and I'll be surprised if > 0-day uses it and reports any slowdown). > > Also uncomfortable because your (excellent, and I'd never thought > about it that way) argument that the shm_mnt objects are internal > and unstatable (that may not resemble how you expressed it, I've not > gone back to search the mails to find where you made the point), is > not entirely correct nowadays - memfd_create()d objects come from > that same shm_mnt, and they can be fstat()ed. What is the > likelihood that anything would care about duplicated inos of > memfd_create()d objects? Fairly low, I think we'll agree; and > we can probably also say that their inos are an odd implementation > detail that no memfd_create() user should depend upon anyway. But > I mention it in case it changes your own feeling about the shm_mnt. > I have no strong feeling either. I just didn't know how intensive its use is. You have provided sufficient arguments IMO to go with your version of the patch. > > > Not-Yet-Signed-off-by: Hugh Dickins <hughd@google.com> > > > > > > 1) Probably needs minor corrections for the 32-bit kernel: I wasn't > > > worrying about that at the time, and expect some "unsigned long"s > > > I didn't need to change for the 64-bit kernel actually need to be > > > "u64"s or "ino_t"s now. > > > 2) The "return 1" from shmem_encode_fh() would nowadays be written as > > > "return FILEID_INO32_GEN" (and overlayfs takes particular interest > > > in that fileid); yet it already put the high 32 bits of the ino into > > > the fh: probably needs a different fileid once high 32 bits are set. > > > > Nice spotting, but this really isn't a problem for overlayfs. > > I agree that it would be nice to follow the same practice as xfs with > > XFS_FILEID_TYPE_64FLAG, but as one can see from the private > > flag, this is by no means a standard of any sort. > > > > As the name_to_handle_at() man page says: > > "Other than the use of the handle_bytes field, the caller should treat the > > file_handle structure as an opaque data type: the handle_type and f_handle > > fields are needed only by a subsequent call to open_by_handle_at()." > > > > It is true that one of my initial versions was checking value returned from > > encode_fh, but Miklos has pointed out to me that this value is arbitrary > > and we shouldn't rely on it. > > > > In current overlayfs, the value FILEID_INO32_GEN is only used internally > > to indicate the case where filesystem has no encode_fh op (see comment > > above ovl_can_decode_fh()). IOW, only the special case of encoding > > by the default export_encode_fh() is considered a proof for 32bit inodes. > > So tmpfs has never been considered as a 32bit inodes filesystem by > > overlayfs. > > Thanks, you know far more about encode_fh() than I ever shall, so for > now I'll take your word for it that we don't need to make any change > there for this 64-bit ino support. One day I should look back, go > through the encode_fh() callsites, and decide what that "return 1" > really ought to say. > TBH, I meant to say that return 1 shouldn't matter functionally, but it would be much nicer to change it to FILEID_INO64_GEN and define that as 0x81 in include/linux/exportfs.h and actually order the gen word after the inode64 words. But that is independent of the per-sb ino change, because the layout of the file handle has always been [gen,inode64] and never really matched that standard of FILEID_INO32_GEN. I may send a patch to later on to change that. Thanks, Amir. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v5 1/2] tmpfs: Add per-superblock i_ino support 2020-01-06 2:03 ` zhengbin (A) 2020-01-06 6:41 ` Amir Goldstein @ 2020-01-06 13:17 ` Chris Down 1 sibling, 0 replies; 27+ messages in thread From: Chris Down @ 2020-01-06 13:17 UTC (permalink / raw) To: zhengbin (A) Cc: linux-mm, Hugh Dickins, Andrew Morton, Al Viro, Matthew Wilcox, Amir Goldstein, Jeff Layton, Johannes Weiner, Tejun Heo, linux-fsdevel, linux-kernel, kernel-team zhengbin (A) writes: >Use spin_lock will affect performance "Performance" isn't a binary. In discussions, you should avoid invoking the performance boogeyman without presenting any real-world data. :-) We already have to take this spin lock before when setting the free inode count. The two sites can be merged, but it seems unnecessary to conflate their purpose at this time. ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH v5 2/2] tmpfs: Support 64-bit inums per-sb 2020-01-05 12:05 [PATCH v5 0/2] fs: inode: shmem: Reduce risk of inum overflow Chris Down 2020-01-05 12:06 ` [PATCH v5 1/2] tmpfs: Add per-superblock i_ino support Chris Down @ 2020-01-05 12:06 ` Chris Down 2020-01-07 0:10 ` Dave Chinner 1 sibling, 1 reply; 27+ messages in thread From: Chris Down @ 2020-01-05 12:06 UTC (permalink / raw) To: linux-mm Cc: Hugh Dickins, Andrew Morton, Al Viro, Matthew Wilcox, Amir Goldstein, Jeff Layton, Johannes Weiner, Tejun Heo, linux-fsdevel, linux-kernel, kernel-team The default is still set to inode32 for backwards compatibility, but system administrators can opt in to the new 64-bit inode numbers by either: 1. Passing inode64 on the command line when mounting, or 2. Configuring the kernel with CONFIG_TMPFS_INODE64=y The inode64 and inode32 names are used based on existing precedent from XFS. Signed-off-by: Chris Down <chris@chrisdown.name> Reviewed-by: Amir Goldstein <amir73il@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Matthew Wilcox <willy@infradead.org> Cc: Jeff Layton <jlayton@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: linux-mm@kvack.org Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: kernel-team@fb.com --- Documentation/filesystems/tmpfs.txt | 11 +++++ fs/Kconfig | 15 +++++++ include/linux/shmem_fs.h | 1 + mm/shmem.c | 66 ++++++++++++++++++++++++++++- 4 files changed, 92 insertions(+), 1 deletion(-) v5: Nothing in code, just resending with correct linux-mm domain. diff --git a/Documentation/filesystems/tmpfs.txt b/Documentation/filesystems/tmpfs.txt index 5ecbc03e6b2f..203e12a684c9 100644 --- a/Documentation/filesystems/tmpfs.txt +++ b/Documentation/filesystems/tmpfs.txt @@ -136,6 +136,15 @@ These options do not have any effect on remount. You can change these parameters with chmod(1), chown(1) and chgrp(1) on a mounted filesystem. +tmpfs has a mount option to select whether it will wrap at 32- or 64-bit inode +numbers: + +inode64 Use 64-bit inode numbers +inode32 Use 32-bit inode numbers + +On 64-bit, the default is set by CONFIG_TMPFS_INODE64. On 32-bit, inode64 is +not legal and will produce an error at mount time. + So 'mount -t tmpfs -o size=10G,nr_inodes=10k,mode=700 tmpfs /mytmpfs' will give you tmpfs instance on /mytmpfs which can allocate 10GB RAM/SWAP in 10240 inodes and it is only accessible by root. @@ -147,3 +156,5 @@ Updated: Hugh Dickins, 4 June 2007 Updated: KOSAKI Motohiro, 16 Mar 2010 +Updated: + Chris Down, 2 Jan 2020 diff --git a/fs/Kconfig b/fs/Kconfig index 7b623e9fc1b0..e74f127df22a 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -199,6 +199,21 @@ config TMPFS_XATTR If unsure, say N. +config TMPFS_INODE64 + bool "Use 64-bit ino_t by default in tmpfs" + depends on TMPFS && 64BIT + default n + help + tmpfs has historically used only inode numbers as wide as an unsigned + int. In some cases this can cause wraparound, potentially resulting in + multiple files with the same inode number on a single device. This option + makes tmpfs use the full width of ino_t by default, similarly to the + inode64 mount option. + + To override this default, use the inode32 or inode64 mount options. + + If unsure, say N. + config HUGETLBFS bool "HugeTLB file system support" depends on X86 || IA64 || SPARC64 || (S390 && 64BIT) || \ diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h index 7fac91f490dc..8925eb774518 100644 --- a/include/linux/shmem_fs.h +++ b/include/linux/shmem_fs.h @@ -35,6 +35,7 @@ struct shmem_sb_info { unsigned char huge; /* Whether to try for hugepages */ kuid_t uid; /* Mount uid for root directory */ kgid_t gid; /* Mount gid for root directory */ + bool full_inums; /* If i_ino should be uint or ino_t */ ino_t next_ino; /* The next per-sb inode number to use */ struct mempolicy *mpol; /* default memory policy for mappings */ spinlock_t shrinklist_lock; /* Protects shrinklist */ diff --git a/mm/shmem.c b/mm/shmem.c index 9e97ba972225..05de65112bed 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -115,11 +115,13 @@ struct shmem_options { kuid_t uid; kgid_t gid; umode_t mode; + bool full_inums; int huge; int seen; #define SHMEM_SEEN_BLOCKS 1 #define SHMEM_SEEN_INODES 2 #define SHMEM_SEEN_HUGE 4 +#define SHMEM_SEEN_INUMS 8 }; #ifdef CONFIG_TMPFS @@ -2261,15 +2263,24 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode * since max_inodes is always 0, and is called from * potentially unknown contexts. As such, use the global * allocator which doesn't require the per-sb stat_lock. + * + * No special behaviour is needed for + * sbinfo->full_inums, because it's not possible to + * manually set on callers of this type, and + * CONFIG_TMPFS_INODE64 only applies to user-visible + * mounts. */ inode->i_ino = get_next_ino(); } else { spin_lock(&sbinfo->stat_lock); - if (unlikely(sbinfo->next_ino > UINT_MAX)) { + if (unlikely(!sbinfo->full_inums && + sbinfo->next_ino > UINT_MAX)) { /* * Emulate get_next_ino uint wraparound for * compatibility */ + pr_warn("%s: inode number overflow on device %d, consider using inode64 mount option\n", + __func__, MINOR(sb->s_dev)); sbinfo->next_ino = 1; } inode->i_ino = sbinfo->next_ino++; @@ -3406,6 +3417,8 @@ enum shmem_param { Opt_nr_inodes, Opt_size, Opt_uid, + Opt_inode32, + Opt_inode64, }; static const struct fs_parameter_spec shmem_param_specs[] = { @@ -3417,6 +3430,8 @@ static const struct fs_parameter_spec shmem_param_specs[] = { fsparam_string("nr_inodes", Opt_nr_inodes), fsparam_string("size", Opt_size), fsparam_u32 ("uid", Opt_uid), + fsparam_flag ("inode32", Opt_inode32), + fsparam_flag ("inode64", Opt_inode64), {} }; @@ -3441,6 +3456,7 @@ static int shmem_parse_one(struct fs_context *fc, struct fs_parameter *param) unsigned long long size; char *rest; int opt; + const char *err; opt = fs_parse(fc, &shmem_fs_parameters, param, &result); if (opt < 0) @@ -3502,6 +3518,18 @@ static int shmem_parse_one(struct fs_context *fc, struct fs_parameter *param) break; } goto unsupported_parameter; + case Opt_inode32: + ctx->full_inums = false; + ctx->seen |= SHMEM_SEEN_INUMS; + break; + case Opt_inode64: + if (sizeof(ino_t) < 8) { + err = "Cannot use inode64 with <64bit inums in kernel"; + goto err_msg; + } + ctx->full_inums = true; + ctx->seen |= SHMEM_SEEN_INUMS; + break; } return 0; @@ -3509,6 +3537,8 @@ static int shmem_parse_one(struct fs_context *fc, struct fs_parameter *param) return invalf(fc, "tmpfs: Unsupported parameter '%s'", param->key); bad_value: return invalf(fc, "tmpfs: Bad value for '%s'", param->key); +err_msg: + return invalf(fc, "tmpfs: %s", err); } static int shmem_parse_options(struct fs_context *fc, void *data) @@ -3593,8 +3623,16 @@ static int shmem_reconfigure(struct fs_context *fc) } } + if ((ctx->seen & SHMEM_SEEN_INUMS) && !ctx->full_inums && + sbinfo->next_ino > UINT_MAX) { + err = "Current inum too high to switch to 32-bit inums"; + goto out; + } + if (ctx->seen & SHMEM_SEEN_HUGE) sbinfo->huge = ctx->huge; + if (ctx->seen & SHMEM_SEEN_INUMS) + sbinfo->full_inums = ctx->full_inums; if (ctx->seen & SHMEM_SEEN_BLOCKS) sbinfo->max_blocks = ctx->blocks; if (ctx->seen & SHMEM_SEEN_INODES) { @@ -3634,6 +3672,29 @@ static int shmem_show_options(struct seq_file *seq, struct dentry *root) if (!gid_eq(sbinfo->gid, GLOBAL_ROOT_GID)) seq_printf(seq, ",gid=%u", from_kgid_munged(&init_user_ns, sbinfo->gid)); + + /* + * Showing inode{64,32} might be useful even if it's the system default, + * since then people don't have to resort to checking both here and + * /proc/config.gz to confirm 64-bit inums were successfully applied + * (which may not even exist if IKCONFIG_PROC isn't enabled). + * + * We hide it when inode64 isn't the default and we are using 32-bit + * inodes, since that probably just means the feature isn't even under + * consideration. + * + * As such: + * + * +-----------------+-----------------+ + * | TMPFS_INODE64=y | TMPFS_INODE64=n | + * +------------------+-----------------+-----------------+ + * | full_inums=true | show | show | + * | full_inums=false | show | hide | + * +------------------+-----------------+-----------------+ + * + */ + if (IS_ENABLED(CONFIG_TMPFS_INODE64) || sbinfo->full_inums) + seq_printf(seq, ",inode%d", (sbinfo->full_inums ? 64 : 32)); #ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE /* Rightly or wrongly, show huge mount option unmasked by shmem_huge */ if (sbinfo->huge) @@ -3681,6 +3742,8 @@ static int shmem_fill_super(struct super_block *sb, struct fs_context *fc) ctx->blocks = shmem_default_max_blocks(); if (!(ctx->seen & SHMEM_SEEN_INODES)) ctx->inodes = shmem_default_max_inodes(); + if (!(ctx->seen & SHMEM_SEEN_INUMS)) + ctx->full_inums = IS_ENABLED(CONFIG_TMPFS_INODE64); } else { sb->s_flags |= SB_NOUSER; } @@ -3694,6 +3757,7 @@ static int shmem_fill_super(struct super_block *sb, struct fs_context *fc) sbinfo->free_inodes = sbinfo->max_inodes = ctx->inodes; sbinfo->uid = ctx->uid; sbinfo->gid = ctx->gid; + sbinfo->full_inums = ctx->full_inums; sbinfo->mode = ctx->mode; sbinfo->huge = ctx->huge; sbinfo->mpol = ctx->mpol; -- 2.24.1 ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [PATCH v5 2/2] tmpfs: Support 64-bit inums per-sb 2020-01-05 12:06 ` [PATCH v5 2/2] tmpfs: Support 64-bit inums per-sb Chris Down @ 2020-01-07 0:10 ` Dave Chinner 2020-01-07 0:16 ` Chris Down 2020-01-08 14:37 ` Mikael Magnusson 0 siblings, 2 replies; 27+ messages in thread From: Dave Chinner @ 2020-01-07 0:10 UTC (permalink / raw) To: Chris Down Cc: linux-mm, Hugh Dickins, Andrew Morton, Al Viro, Matthew Wilcox, Amir Goldstein, Jeff Layton, Johannes Weiner, Tejun Heo, linux-fsdevel, linux-kernel, kernel-team On Sun, Jan 05, 2020 at 12:06:05PM +0000, Chris Down wrote: > The default is still set to inode32 for backwards compatibility, but > system administrators can opt in to the new 64-bit inode numbers by > either: > > 1. Passing inode64 on the command line when mounting, or > 2. Configuring the kernel with CONFIG_TMPFS_INODE64=y > > The inode64 and inode32 names are used based on existing precedent from > XFS. Please don't copy this misfeature of XFS. The inode32/inode64 XFS options were a horrible hack made more than 20 years ago when NFSv2 was still in use and 64 bit inodes could not be used for NFSv2 exports. It was then continued to be used because 32bit NFSv3 clients were unable to handle 64 bit inodes. It took 15 years for us to be able to essentially deprecate inode32 (inode64 is the default behaviour), and we were very happy to get that albatross off our necks. In reality, almost everything out there in the world handles 64 bit inodes correctly including 32 bit machines and 32bit binaries on 64 bit machines. And, IMNSHO, there no excuse these days for 32 bit binaries that don't using the *64() syscall variants directly and hence support 64 bit inodes correctlyi out of the box on all platforms. I don't think we should be repeating past mistakes by trying to cater for broken 32 bit applications on 64 bit machines in this day and age. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v5 2/2] tmpfs: Support 64-bit inums per-sb 2020-01-07 0:10 ` Dave Chinner @ 2020-01-07 0:16 ` Chris Down 2020-01-07 0:39 ` Dave Chinner 2020-01-08 14:37 ` Mikael Magnusson 1 sibling, 1 reply; 27+ messages in thread From: Chris Down @ 2020-01-07 0:16 UTC (permalink / raw) To: Dave Chinner Cc: linux-mm, Hugh Dickins, Andrew Morton, Al Viro, Matthew Wilcox, Amir Goldstein, Jeff Layton, Johannes Weiner, Tejun Heo, linux-fsdevel, linux-kernel, kernel-team Dave Chinner writes: >It took 15 years for us to be able to essentially deprecate >inode32 (inode64 is the default behaviour), and we were very happy >to get that albatross off our necks. In reality, almost everything >out there in the world handles 64 bit inodes correctly >including 32 bit machines and 32bit binaries on 64 bit machines. >And, IMNSHO, there no excuse these days for 32 bit binaries that >don't using the *64() syscall variants directly and hence support >64 bit inodes correctlyi out of the box on all platforms. > >I don't think we should be repeating past mistakes by trying to >cater for broken 32 bit applications on 64 bit machines in this day >and age. I'm very glad to hear that. I strongly support moving to 64-bit inums in all cases if there is precedent that it's not a compatibility issue, but from the comments on my original[0] patch (especially that they strayed from the original patches' change to use ino_t directly into slab reuse), I'd been given the impression that it was known to be one. From my perspective I have no evidence that inode32 is needed other than the comment from Jeff above get_next_ino. If that turns out not to be a problem, I am more than happy to just wholesale migrate 64-bit inodes per-sb in tmpfs. 0: https://lore.kernel.org/patchwork/patch/1170963/ ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v5 2/2] tmpfs: Support 64-bit inums per-sb 2020-01-07 0:16 ` Chris Down @ 2020-01-07 0:39 ` Dave Chinner 2020-01-07 6:54 ` Amir Goldstein 0 siblings, 1 reply; 27+ messages in thread From: Dave Chinner @ 2020-01-07 0:39 UTC (permalink / raw) To: Chris Down Cc: linux-mm, Hugh Dickins, Andrew Morton, Al Viro, Matthew Wilcox, Amir Goldstein, Jeff Layton, Johannes Weiner, Tejun Heo, linux-fsdevel, linux-kernel, kernel-team On Tue, Jan 07, 2020 at 12:16:43AM +0000, Chris Down wrote: > Dave Chinner writes: > > It took 15 years for us to be able to essentially deprecate > > inode32 (inode64 is the default behaviour), and we were very happy > > to get that albatross off our necks. In reality, almost everything > > out there in the world handles 64 bit inodes correctly > > including 32 bit machines and 32bit binaries on 64 bit machines. > > And, IMNSHO, there no excuse these days for 32 bit binaries that > > don't using the *64() syscall variants directly and hence support > > 64 bit inodes correctlyi out of the box on all platforms. > > > > I don't think we should be repeating past mistakes by trying to > > cater for broken 32 bit applications on 64 bit machines in this day > > and age. > > I'm very glad to hear that. I strongly support moving to 64-bit inums in all > cases if there is precedent that it's not a compatibility issue, but from > the comments on my original[0] patch (especially that they strayed from the > original patches' change to use ino_t directly into slab reuse), I'd been > given the impression that it was known to be one. > > From my perspective I have no evidence that inode32 is needed other than the > comment from Jeff above get_next_ino. If that turns out not to be a problem, > I am more than happy to just wholesale migrate 64-bit inodes per-sb in > tmpfs. Well, that's my comment above about 32 bit apps using non-LFS compliant interfaces in this day and age. It's essentially a legacy interface these days, and anyone trying to access a modern linux filesystem (btrfs, XFS, ext4, etc) ion 64 bit systems need to handle 64 bit inodes because they all can create >32bit inode numbers in their default configurations. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v5 2/2] tmpfs: Support 64-bit inums per-sb 2020-01-07 0:39 ` Dave Chinner @ 2020-01-07 6:54 ` Amir Goldstein 2020-01-07 8:36 ` Hugh Dickins 0 siblings, 1 reply; 27+ messages in thread From: Amir Goldstein @ 2020-01-07 6:54 UTC (permalink / raw) To: Dave Chinner Cc: Chris Down, Linux MM, Hugh Dickins, Andrew Morton, Al Viro, Matthew Wilcox, Jeff Layton, Johannes Weiner, Tejun Heo, linux-fsdevel, linux-kernel, kernel-team On Tue, Jan 7, 2020 at 2:40 AM Dave Chinner <david@fromorbit.com> wrote: > > On Tue, Jan 07, 2020 at 12:16:43AM +0000, Chris Down wrote: > > Dave Chinner writes: > > > It took 15 years for us to be able to essentially deprecate > > > inode32 (inode64 is the default behaviour), and we were very happy > > > to get that albatross off our necks. In reality, almost everything > > > out there in the world handles 64 bit inodes correctly > > > including 32 bit machines and 32bit binaries on 64 bit machines. > > > And, IMNSHO, there no excuse these days for 32 bit binaries that > > > don't using the *64() syscall variants directly and hence support > > > 64 bit inodes correctlyi out of the box on all platforms. > > > > > > I don't think we should be repeating past mistakes by trying to > > > cater for broken 32 bit applications on 64 bit machines in this day > > > and age. > > > > I'm very glad to hear that. I strongly support moving to 64-bit inums in all > > cases if there is precedent that it's not a compatibility issue, but from > > the comments on my original[0] patch (especially that they strayed from the > > original patches' change to use ino_t directly into slab reuse), I'd been > > given the impression that it was known to be one. > > > > From my perspective I have no evidence that inode32 is needed other than the > > comment from Jeff above get_next_ino. If that turns out not to be a problem, > > I am more than happy to just wholesale migrate 64-bit inodes per-sb in > > tmpfs. > > Well, that's my comment above about 32 bit apps using non-LFS > compliant interfaces in this day and age. It's essentially a legacy > interface these days, and anyone trying to access a modern linux > filesystem (btrfs, XFS, ext4, etc) ion 64 bit systems need to handle > 64 bit inodes because they all can create >32bit inode numbers > in their default configurations. > Chris, Following Dave's comment, let's keep the config option, but make it default to Y and remove the mount option for now. If somebody shouts, mount option could be re-added later. This way at least we leave an option for workaround for an unlikely breakage. Thanks, Amir. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v5 2/2] tmpfs: Support 64-bit inums per-sb 2020-01-07 6:54 ` Amir Goldstein @ 2020-01-07 8:36 ` Hugh Dickins 2020-01-07 10:12 ` Amir Goldstein 2020-01-07 20:59 ` Dave Chinner 0 siblings, 2 replies; 27+ messages in thread From: Hugh Dickins @ 2020-01-07 8:36 UTC (permalink / raw) To: Amir Goldstein Cc: Dave Chinner, Chris Down, Linux MM, Hugh Dickins, Andrew Morton, Al Viro, Matthew Wilcox, Jeff Layton, Johannes Weiner, Tejun Heo, linux-fsdevel, linux-kernel, kernel-team On Tue, 7 Jan 2020, Amir Goldstein wrote: > On Tue, Jan 7, 2020 at 2:40 AM Dave Chinner <david@fromorbit.com> wrote: > > On Tue, Jan 07, 2020 at 12:16:43AM +0000, Chris Down wrote: > > > Dave Chinner writes: > > > > It took 15 years for us to be able to essentially deprecate > > > > inode32 (inode64 is the default behaviour), and we were very happy > > > > to get that albatross off our necks. In reality, almost everything > > > > out there in the world handles 64 bit inodes correctly > > > > including 32 bit machines and 32bit binaries on 64 bit machines. > > > > And, IMNSHO, there no excuse these days for 32 bit binaries that > > > > don't using the *64() syscall variants directly and hence support > > > > 64 bit inodes correctlyi out of the box on all platforms. Interesting take on it. I'd all along imagined we would have to resort to a mount option for safety, but Dave is right that I was too focused on avoiding tmpfs regressions, without properly realizing that people were very unlikely to have written such tools for tmpfs in particular, but written them for all filesystems, and already encountered and fixed such EOVERFLOWs for other filesystems. Hmm, though how readily does XFS actually reach the high inos on ordinary users' systems? > > > > > > > > I don't think we should be repeating past mistakes by trying to > > > > cater for broken 32 bit applications on 64 bit machines in this day > > > > and age. > > > > > > I'm very glad to hear that. I strongly support moving to 64-bit inums in all > > > cases if there is precedent that it's not a compatibility issue, but from > > > the comments on my original[0] patch (especially that they strayed from the > > > original patches' change to use ino_t directly into slab reuse), I'd been > > > given the impression that it was known to be one. > > > > > > From my perspective I have no evidence that inode32 is needed other than the > > > comment from Jeff above get_next_ino. If that turns out not to be a problem, > > > I am more than happy to just wholesale migrate 64-bit inodes per-sb in > > > tmpfs. > > > > Well, that's my comment above about 32 bit apps using non-LFS > > compliant interfaces in this day and age. It's essentially a legacy > > interface these days, and anyone trying to access a modern linux > > filesystem (btrfs, XFS, ext4, etc) ion 64 bit systems need to handle > > 64 bit inodes because they all can create >32bit inode numbers > > in their default configurations. I thought ext4 still deals in 32-bit inos, so ext4-world would not necessarily have caught up? Though, arguing the other way, IIUC 64-bit ino support comes bundled with file sizes over 2GB (on all architectures?), and it's hard to imagine who gets by with a 2GB file size limit nowadays. > > > > Chris, > > Following Dave's comment, let's keep the config option, but make it > default to Y and remove the mount option for now. > If somebody shouts, mount option could be re-added later. > This way at least we leave an option for workaround for an unlikely > breakage. Though I don't share Dave's strength of aversion to the inode32 and inode64 mount options, I do agree that it's preferable not to offer them if they're so very unlikely to be necessary. Do we even need the config option? We certainly do need to hold the patch with config option and mount options "in our back pocket", so that if a regression does get reported, then upstream and stable can respond to fix it quickly; but do we need more than that? My main concern is that a new EOVERFLOW might not be reported as such, and waste a lot of someone's time to track down. Hugh ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v5 2/2] tmpfs: Support 64-bit inums per-sb 2020-01-07 8:36 ` Hugh Dickins @ 2020-01-07 10:12 ` Amir Goldstein 2020-01-07 21:07 ` Dave Chinner 2020-01-07 20:59 ` Dave Chinner 1 sibling, 1 reply; 27+ messages in thread From: Amir Goldstein @ 2020-01-07 10:12 UTC (permalink / raw) To: Hugh Dickins Cc: Dave Chinner, Chris Down, Linux MM, Andrew Morton, Al Viro, Matthew Wilcox, Jeff Layton, Johannes Weiner, Tejun Heo, linux-fsdevel, linux-kernel, kernel-team On Tue, Jan 7, 2020 at 10:36 AM Hugh Dickins <hughd@google.com> wrote: > > On Tue, 7 Jan 2020, Amir Goldstein wrote: > > On Tue, Jan 7, 2020 at 2:40 AM Dave Chinner <david@fromorbit.com> wrote: > > > On Tue, Jan 07, 2020 at 12:16:43AM +0000, Chris Down wrote: > > > > Dave Chinner writes: > > > > > It took 15 years for us to be able to essentially deprecate > > > > > inode32 (inode64 is the default behaviour), and we were very happy > > > > > to get that albatross off our necks. In reality, almost everything > > > > > out there in the world handles 64 bit inodes correctly > > > > > including 32 bit machines and 32bit binaries on 64 bit machines. > > > > > And, IMNSHO, there no excuse these days for 32 bit binaries that > > > > > don't using the *64() syscall variants directly and hence support > > > > > 64 bit inodes correctlyi out of the box on all platforms. > > Interesting take on it. I'd all along imagined we would have to resort > to a mount option for safety, but Dave is right that I was too focused on > avoiding tmpfs regressions, without properly realizing that people were > very unlikely to have written such tools for tmpfs in particular, but > written them for all filesystems, and already encountered and fixed > such EOVERFLOWs for other filesystems. > > Hmm, though how readily does XFS actually reach the high inos on > ordinary users' systems? > Define 'ordinary' I my calculations are correct, with default mkfs.xfs any inode allocated from logical offset > 2TB on a volume has high ino bits set. Besides, a deployment with more than 4G inodes shouldn't be hard to find. Thanks, Amir. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v5 2/2] tmpfs: Support 64-bit inums per-sb 2020-01-07 10:12 ` Amir Goldstein @ 2020-01-07 21:07 ` Dave Chinner 2020-01-07 21:37 ` Chris Mason 0 siblings, 1 reply; 27+ messages in thread From: Dave Chinner @ 2020-01-07 21:07 UTC (permalink / raw) To: Amir Goldstein Cc: Hugh Dickins, Chris Down, Linux MM, Andrew Morton, Al Viro, Matthew Wilcox, Jeff Layton, Johannes Weiner, Tejun Heo, linux-fsdevel, linux-kernel, kernel-team On Tue, Jan 07, 2020 at 12:12:00PM +0200, Amir Goldstein wrote: > On Tue, Jan 7, 2020 at 10:36 AM Hugh Dickins <hughd@google.com> wrote: > > > > On Tue, 7 Jan 2020, Amir Goldstein wrote: > > > On Tue, Jan 7, 2020 at 2:40 AM Dave Chinner <david@fromorbit.com> wrote: > > > > On Tue, Jan 07, 2020 at 12:16:43AM +0000, Chris Down wrote: > > > > > Dave Chinner writes: > > > > > > It took 15 years for us to be able to essentially deprecate > > > > > > inode32 (inode64 is the default behaviour), and we were very happy > > > > > > to get that albatross off our necks. In reality, almost everything > > > > > > out there in the world handles 64 bit inodes correctly > > > > > > including 32 bit machines and 32bit binaries on 64 bit machines. > > > > > > And, IMNSHO, there no excuse these days for 32 bit binaries that > > > > > > don't using the *64() syscall variants directly and hence support > > > > > > 64 bit inodes correctlyi out of the box on all platforms. > > > > Interesting take on it. I'd all along imagined we would have to resort > > to a mount option for safety, but Dave is right that I was too focused on > > avoiding tmpfs regressions, without properly realizing that people were > > very unlikely to have written such tools for tmpfs in particular, but > > written them for all filesystems, and already encountered and fixed > > such EOVERFLOWs for other filesystems. > > > > Hmm, though how readily does XFS actually reach the high inos on > > ordinary users' systems? > > > > Define 'ordinary' > I my calculations are correct, with default mkfs.xfs any inode allocated > from logical offset > 2TB on a volume has high ino bits set. > Besides, a deployment with more than 4G inodes shouldn't be hard to find. You don't need to allocate 4 billion inodes to get >32bit inodes in XFS - the inode number is an encoding of the physical location of the inode in the filesystem. Hence we just have to allocate the inode at a disk address higher than 2TB into the device and we overflow 32bits. e.g. make a sparse fs image file of 10TB, mount it, create 50 subdirs, then start creating zero length files spread across the 50 separate subdirectories. You should see >32bit inode numbers almost immediately. (i.e. as soon as we allocate inodes in AG 2 or higher) IOWs, there are *lots* of 64bit inode numbers out there on XFS filesystems.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v5 2/2] tmpfs: Support 64-bit inums per-sb 2020-01-07 21:07 ` Dave Chinner @ 2020-01-07 21:37 ` Chris Mason 2020-01-08 11:24 ` Hugh Dickins 0 siblings, 1 reply; 27+ messages in thread From: Chris Mason @ 2020-01-07 21:37 UTC (permalink / raw) To: Dave Chinner Cc: Amir Goldstein, Hugh Dickins, Chris Down, Linux MM, Andrew Morton, Al Viro, Matthew Wilcox, Jeff Layton, Johannes Weiner, Tejun Heo, linux-fsdevel, linux-kernel, Kernel Team On 7 Jan 2020, at 16:07, Dave Chinner wrote: > IOWs, there are *lots* of 64bit inode numbers out there on XFS > filesystems.... It's less likely in btrfs but +1 to all of Dave's comments. I'm happy to run a scan on machines in the fleet and see how many have 64 bit inodes (either buttery or x-y), but it's going to be a lot. -chris ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v5 2/2] tmpfs: Support 64-bit inums per-sb 2020-01-07 21:37 ` Chris Mason @ 2020-01-08 11:24 ` Hugh Dickins 2020-01-09 0:43 ` Jeff Layton 2020-01-10 16:45 ` Chris Down 0 siblings, 2 replies; 27+ messages in thread From: Hugh Dickins @ 2020-01-08 11:24 UTC (permalink / raw) To: Chris Mason Cc: Dave Chinner, Amir Goldstein, Hugh Dickins, Chris Down, Linux MM, Andrew Morton, Al Viro, Matthew Wilcox, Jeff Layton, Johannes Weiner, Tejun Heo, linux-fsdevel, linux-kernel, Kernel Team On Tue, 7 Jan 2020, Chris Mason wrote: > On 7 Jan 2020, at 16:07, Dave Chinner wrote: > > > IOWs, there are *lots* of 64bit inode numbers out there on XFS > > filesystems.... > > It's less likely in btrfs but +1 to all of Dave's comments. I'm happy > to run a scan on machines in the fleet and see how many have 64 bit > inodes (either buttery or x-y), but it's going to be a lot. Dave, Amir, Chris, many thanks for the info you've filled in - and absolutely no need to run any scan on your fleet for this, I think we can be confident that even if fb had some 15-year-old tool in use on its fleet of 2GB-file filesystems, it would not be the one to insist on a kernel revert of 64-bit tmpfs inos. The picture looks clear now: while ChrisD does need to hold on to his config option and inode32/inode64 mount option patch, it is much better left out of the kernel until (very unlikely) proved necessary. Thanks, Hugh ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v5 2/2] tmpfs: Support 64-bit inums per-sb 2020-01-08 11:24 ` Hugh Dickins @ 2020-01-09 0:43 ` Jeff Layton 2020-01-10 16:45 ` Chris Down 1 sibling, 0 replies; 27+ messages in thread From: Jeff Layton @ 2020-01-09 0:43 UTC (permalink / raw) To: Hugh Dickins, Chris Mason Cc: Dave Chinner, Amir Goldstein, Chris Down, Linux MM, Andrew Morton, Al Viro, Matthew Wilcox, Johannes Weiner, Tejun Heo, linux-fsdevel, linux-kernel, Kernel Team On Wed, 2020-01-08 at 03:24 -0800, Hugh Dickins wrote: > On Tue, 7 Jan 2020, Chris Mason wrote: > > On 7 Jan 2020, at 16:07, Dave Chinner wrote: > > > > > IOWs, there are *lots* of 64bit inode numbers out there on XFS > > > filesystems.... > > > > It's less likely in btrfs but +1 to all of Dave's comments. I'm happy > > to run a scan on machines in the fleet and see how many have 64 bit > > inodes (either buttery or x-y), but it's going to be a lot. > > Dave, Amir, Chris, many thanks for the info you've filled in - > and absolutely no need to run any scan on your fleet for this, > I think we can be confident that even if fb had some 15-year-old tool > in use on its fleet of 2GB-file filesystems, it would not be the one > to insist on a kernel revert of 64-bit tmpfs inos. > > The picture looks clear now: while ChrisD does need to hold on to his > config option and inode32/inode64 mount option patch, it is much better > left out of the kernel until (very unlikely) proved necessary. This approach seems like the best course to me. FWIW, at the time we capped this at 32-bits (2007), 64-bit machines were really just becoming widely available, and it was quite common to run 32-bit, non-LFS apps on a 64-bit kernel. Users were hitting spurious EOVERFLOW errors all over the place so this seemed like the best way to address it. The world has changed a lot since then though, and one would hope that almost everything these days is compiled with FILE_OFFSET_BITS=64. Fingers crossed! -- Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v5 2/2] tmpfs: Support 64-bit inums per-sb 2020-01-08 11:24 ` Hugh Dickins 2020-01-09 0:43 ` Jeff Layton @ 2020-01-10 16:45 ` Chris Down 2020-01-13 7:36 ` Hugh Dickins 1 sibling, 1 reply; 27+ messages in thread From: Chris Down @ 2020-01-10 16:45 UTC (permalink / raw) To: Hugh Dickins, Dave Chinner Cc: Chris Mason, Amir Goldstein, Linux MM, Andrew Morton, Al Viro, Matthew Wilcox, Jeff Layton, Johannes Weiner, Tejun Heo, linux-fsdevel, linux-kernel, Kernel Team Hi Hugh, Dave, Hugh Dickins writes: >Dave, Amir, Chris, many thanks for the info you've filled in - >and absolutely no need to run any scan on your fleet for this, >I think we can be confident that even if fb had some 15-year-old tool >in use on its fleet of 2GB-file filesystems, it would not be the one >to insist on a kernel revert of 64-bit tmpfs inos. > >The picture looks clear now: while ChrisD does need to hold on to his >config option and inode32/inode64 mount option patch, it is much better >left out of the kernel until (very unlikely) proved necessary. Based on Mikael's comment above about Steam binaries, and the lack of likelihood that they can be rebuilt, I'm inclined to still keep inode{64,32}, but make legacy behaviour require explicit opt-in. That is: - Default it to inode64 - Remove the Kconfig option - Only print it as an option if tmpfs was explicitly mounted with inode32 The reason I suggest keeping this is that I'm mildly concerned that the kind of users who might be impacted by this change due to 32-bit _FILE_OFFSET_BITS -- like the not-too-uncommon case that Mikael brings up -- seem unlikely to be the kind of people that would find it in an rc. Other than that, the first patch could be similar to how it is now, incorporating Hugh's improvements to the first patch to put everything under the same stat_lock in shmem_reserve_inode. What do you folks think? Thanks, Chris ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v5 2/2] tmpfs: Support 64-bit inums per-sb 2020-01-10 16:45 ` Chris Down @ 2020-01-13 7:36 ` Hugh Dickins 2020-01-20 15:11 ` Chris Down 0 siblings, 1 reply; 27+ messages in thread From: Hugh Dickins @ 2020-01-13 7:36 UTC (permalink / raw) To: Chris Down Cc: Hugh Dickins, Dave Chinner, Chris Mason, Amir Goldstein, Linux MM, Andrew Morton, Al Viro, Matthew Wilcox, Jeff Layton, Johannes Weiner, Tejun Heo, Mikael Magnusson, linux-fsdevel, linux-kernel, Kernel Team On Fri, 10 Jan 2020, Chris Down wrote: > Hugh Dickins writes: > > Dave, Amir, Chris, many thanks for the info you've filled in - > > and absolutely no need to run any scan on your fleet for this, > > I think we can be confident that even if fb had some 15-year-old tool > > in use on its fleet of 2GB-file filesystems, it would not be the one > > to insist on a kernel revert of 64-bit tmpfs inos. > > > > The picture looks clear now: while ChrisD does need to hold on to his > > config option and inode32/inode64 mount option patch, it is much better > > left out of the kernel until (very unlikely) proved necessary. > > Based on Mikael's comment above about Steam binaries, and the lack of > likelihood that they can be rebuilt, I'm inclined to still keep inode{64,32}, > but make legacy behaviour require explicit opt-in. That is: > > - Default it to inode64 > - Remove the Kconfig option > - Only print it as an option if tmpfs was explicitly mounted with inode32 > > The reason I suggest keeping this is that I'm mildly concerned that the kind > of users who might be impacted by this change due to 32-bit _FILE_OFFSET_BITS > -- like the not-too-uncommon case that Mikael brings up -- seem unlikely to > be the kind of people that would find it in an rc. Okay. None of us are thrilled with it, but I agree that Mikael's observation should override our developer's preference. So the "inode64" option will be accepted but redundant on mounting, but exists for use as a remount option after mounting or remounting with "inode32": allowing the admin to switch temporarily to mask off the high ino bits with "inode32" when needing to run a limited binary. Documentation and commit message to alert Andrew and Linus and distros that we are risking some breakage with this, but supplying the antidote (not breakage of any distros themselves, no doubt they're all good; but breakage of what some users might run on them). > > Other than that, the first patch could be similar to how it is now, > incorporating Hugh's improvements to the first patch to put everything under > the same stat_lock in shmem_reserve_inode. So, I persuaded Amir to the other aspects my version, but did not persuade you? Well, I can live with that (or if not, can send mods on top of yours): but please read again why I was uncomfortable with yours, to check that you still prefer it (I agree that your patch is simpler, and none of my discomfort decisive). Thanks, Hugh ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v5 2/2] tmpfs: Support 64-bit inums per-sb 2020-01-13 7:36 ` Hugh Dickins @ 2020-01-20 15:11 ` Chris Down 2020-02-25 23:14 ` Hugh Dickins 0 siblings, 1 reply; 27+ messages in thread From: Chris Down @ 2020-01-20 15:11 UTC (permalink / raw) To: Hugh Dickins Cc: Dave Chinner, Chris Mason, Amir Goldstein, Linux MM, Andrew Morton, Al Viro, Matthew Wilcox, Jeff Layton, Johannes Weiner, Tejun Heo, Mikael Magnusson, linux-fsdevel, linux-kernel, Kernel Team Hi Hugh, Sorry this response took so long, I had some non-work issues that took a lot of time last week. Hugh Dickins writes: >On Fri, 10 Jan 2020, Chris Down wrote: >> Hugh Dickins writes: >> > Dave, Amir, Chris, many thanks for the info you've filled in - >> > and absolutely no need to run any scan on your fleet for this, >> > I think we can be confident that even if fb had some 15-year-old tool >> > in use on its fleet of 2GB-file filesystems, it would not be the one >> > to insist on a kernel revert of 64-bit tmpfs inos. >> > >> > The picture looks clear now: while ChrisD does need to hold on to his >> > config option and inode32/inode64 mount option patch, it is much better >> > left out of the kernel until (very unlikely) proved necessary. >> >> Based on Mikael's comment above about Steam binaries, and the lack of >> likelihood that they can be rebuilt, I'm inclined to still keep inode{64,32}, >> but make legacy behaviour require explicit opt-in. That is: >> >> - Default it to inode64 >> - Remove the Kconfig option >> - Only print it as an option if tmpfs was explicitly mounted with inode32 >> >> The reason I suggest keeping this is that I'm mildly concerned that the kind >> of users who might be impacted by this change due to 32-bit _FILE_OFFSET_BITS >> -- like the not-too-uncommon case that Mikael brings up -- seem unlikely to >> be the kind of people that would find it in an rc. > >Okay. None of us are thrilled with it, but I agree that >Mikael's observation should override our developer's preference. > >So the "inode64" option will be accepted but redundant on mounting, >but exists for use as a remount option after mounting or remounting >with "inode32": allowing the admin to switch temporarily to mask off >the high ino bits with "inode32" when needing to run a limited binary. > >Documentation and commit message to alert Andrew and Linus and distros >that we are risking some breakage with this, but supplying the antidote >(not breakage of any distros themselves, no doubt they're all good; >but breakage of what some users might run on them). Sounds good. >> >> Other than that, the first patch could be similar to how it is now, >> incorporating Hugh's improvements to the first patch to put everything under >> the same stat_lock in shmem_reserve_inode. > >So, I persuaded Amir to the other aspects my version, but did not >persuade you? Well, I can live with that (or if not, can send mods >on top of yours): but please read again why I was uncomfortable with >yours, to check that you still prefer it (I agree that your patch is >simpler, and none of my discomfort decisive). Hmm, which bit were you thinking of? The lack of batching, shmem_encode_fh(), or the fact that nr_inodes can now be 0 on non-internal mounts? For batching, I'm neutral. I'm happy to use the approach from your patch and integrate it (and credit you, of course). For shmem_encode_fh, I'm not totally sure I understand the concern, if that's what you mean. For nr_inodes, I agree that intentional or unintentional, we should at least handle this case for now and can adjust later if the behaviour changes. Thanks again, Chris ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v5 2/2] tmpfs: Support 64-bit inums per-sb 2020-01-20 15:11 ` Chris Down @ 2020-02-25 23:14 ` Hugh Dickins 0 siblings, 0 replies; 27+ messages in thread From: Hugh Dickins @ 2020-02-25 23:14 UTC (permalink / raw) To: Chris Down Cc: Hugh Dickins, Dave Chinner, Chris Mason, Amir Goldstein, Linux MM, Andrew Morton, Al Viro, Matthew Wilcox, Jeff Layton, Johannes Weiner, Tejun Heo, Mikael Magnusson, linux-fsdevel, linux-kernel, Kernel Team On Mon, 20 Jan 2020, Chris Down wrote: > Hi Hugh, > > Sorry this response took so long, I had some non-work issues that took a lot > of time last week. No, clearly it's I who must apologize to you for very slow response. > > Hugh Dickins writes: > > > > So the "inode64" option will be accepted but redundant on mounting, > > but exists for use as a remount option after mounting or remounting > > with "inode32": allowing the admin to switch temporarily to mask off > > the high ino bits with "inode32" when needing to run a limited binary. > > > > Documentation and commit message to alert Andrew and Linus and distros > > that we are risking some breakage with this, but supplying the antidote > > (not breakage of any distros themselves, no doubt they're all good; > > but breakage of what some users might run on them). > > Sounds good. > > > > > > > Other than that, the first patch could be similar to how it is now, > > > incorporating Hugh's improvements to the first patch to put everything > > > under > > > the same stat_lock in shmem_reserve_inode. > > > > So, I persuaded Amir to the other aspects my version, but did not > > persuade you? Well, I can live with that (or if not, can send mods > > on top of yours): but please read again why I was uncomfortable with > > yours, to check that you still prefer it (I agree that your patch is > > simpler, and none of my discomfort decisive). > > Hmm, which bit were you thinking of? The lack of batching, shmem_encode_fh(), > or the fact that nr_inodes can now be 0 on non-internal mounts? I was uncomfortable with tmpfs depleting get_next_ino()'s pool in some configurations, and wanted the (get_next_ino-like) per-cpu but per-sb batching for nr_inodes=0, to minimize its stat_lock contention. I did not have a patch to shmem_encode_fh(), had gone through thinking that 64-bit inos made an additional fix there necessary; but Amir then educated us that it is safe as is, though could be cleaned up later. nr_inodes can be 0 on non-internal mounts, for many years. > > For batching, I'm neutral. I'm happy to use the approach from your patch and > integrate it (and credit you, of course). Credit not important, but you may well want to blame me for that complication :) > > For shmem_encode_fh, I'm not totally sure I understand the concern, if that's > what you mean. My concern had been that shmem_encode_fh() builds up an fh from i_ino and more, looks well prepared for a 64-bit ino, but appeared to be announcing a 32-bit ino in its return value: Amir reassures us that that return value does not matter. > > For nr_inodes, I agree that intentional or unintentional, we should at least > handle this case for now and can adjust later if the behaviour changes. nr_inodes=0 is an intentional configuration (but 0 denoting infinity: not very clean, and I've sometimes regretted that choice). If there's any behavior change, that's a separate matter from these 64-bit ino patches; maybe I mentioned it in passing and confused us - that we seem to have recently allowed a remounting limited<->unlimited that was not permitted before, and might or might not need a fix patch. Sorry again for delaying you, Chris: not at all what I'd wanted to do. Hugh ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v5 2/2] tmpfs: Support 64-bit inums per-sb 2020-01-07 8:36 ` Hugh Dickins 2020-01-07 10:12 ` Amir Goldstein @ 2020-01-07 20:59 ` Dave Chinner 1 sibling, 0 replies; 27+ messages in thread From: Dave Chinner @ 2020-01-07 20:59 UTC (permalink / raw) To: Hugh Dickins Cc: Amir Goldstein, Chris Down, Linux MM, Andrew Morton, Al Viro, Matthew Wilcox, Jeff Layton, Johannes Weiner, Tejun Heo, linux-fsdevel, linux-kernel, kernel-team On Tue, Jan 07, 2020 at 12:36:25AM -0800, Hugh Dickins wrote: > On Tue, 7 Jan 2020, Amir Goldstein wrote: > > On Tue, Jan 7, 2020 at 2:40 AM Dave Chinner <david@fromorbit.com> wrote: > > > On Tue, Jan 07, 2020 at 12:16:43AM +0000, Chris Down wrote: > > > > Dave Chinner writes: > > > > > It took 15 years for us to be able to essentially deprecate > > > > > inode32 (inode64 is the default behaviour), and we were very happy > > > > > to get that albatross off our necks. In reality, almost everything > > > > > out there in the world handles 64 bit inodes correctly > > > > > including 32 bit machines and 32bit binaries on 64 bit machines. > > > > > And, IMNSHO, there no excuse these days for 32 bit binaries that > > > > > don't using the *64() syscall variants directly and hence support > > > > > 64 bit inodes correctlyi out of the box on all platforms. > > Interesting take on it. I'd all along imagined we would have to resort > to a mount option for safety, but Dave is right that I was too focused on > avoiding tmpfs regressions, without properly realizing that people were > very unlikely to have written such tools for tmpfs in particular, but > written them for all filesystems, and already encountered and fixed > such EOVERFLOWs for other filesystems. > > Hmm, though how readily does XFS actually reach the high inos on > ordinary users' systems? Quite frequently these days - the threshold for exceeding 32 bits in the inode number is dependent on the inode size. Old v4 filesystems use 256 byte inodes by default, so they overflow 32bits when the filesystem size is >1TB. Current v5 filesystems use 512 byte inodes, so they overflow 32 bits on filesytsems >2TB. > > > > I'm very glad to hear that. I strongly support moving to 64-bit inums in all > > > > cases if there is precedent that it's not a compatibility issue, but from > > > > the comments on my original[0] patch (especially that they strayed from the > > > > original patches' change to use ino_t directly into slab reuse), I'd been > > > > given the impression that it was known to be one. > > > > > > > > From my perspective I have no evidence that inode32 is needed other than the > > > > comment from Jeff above get_next_ino. If that turns out not to be a problem, > > > > I am more than happy to just wholesale migrate 64-bit inodes per-sb in > > > > tmpfs. > > > > > > Well, that's my comment above about 32 bit apps using non-LFS > > > compliant interfaces in this day and age. It's essentially a legacy > > > interface these days, and anyone trying to access a modern linux > > > filesystem (btrfs, XFS, ext4, etc) ion 64 bit systems need to handle > > > 64 bit inodes because they all can create >32bit inode numbers > > > in their default configurations. > > I thought ext4 still deals in 32-bit inos, so ext4-world would not > necessarily have caught up? Hmmm - I might have got my wires crossed there - there *was* a project some time ago to increase the ext4 inode number size for large filesystems. Ah, yeah, there we go: https://lore.kernel.org/linux-ext4/20171101212455.47964-1-artem.blagodarenko@gmail.com/ I thought it had been rolled in with all the other "make stuff larger" features that ext4 has... > Though, arguing the other way, IIUC 64-bit > ino support comes bundled with file sizes over 2GB (on all architectures?), > and it's hard to imagine who gets by with a 2GB file size limit nowadays. Right, you need to define LFS support for >2GB file support and you get 64 bit inode support with that for free. It's only legacy binaries that haven't been rebuilt in the past 15 years that are an issue here, but there are so many other things that can go trivially wrong with such binaries I think that inode numbers are the least of the worries here... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v5 2/2] tmpfs: Support 64-bit inums per-sb 2020-01-07 0:10 ` Dave Chinner 2020-01-07 0:16 ` Chris Down @ 2020-01-08 14:37 ` Mikael Magnusson 2020-01-13 6:58 ` Hugh Dickins 1 sibling, 1 reply; 27+ messages in thread From: Mikael Magnusson @ 2020-01-08 14:37 UTC (permalink / raw) To: david Cc: akpm, amir73il, chris, hannes, hughd, jlayton, kernel-team, linux-fsdevel, linux-kernel, linux-mm, tj, viro Dave Chinner writes: > On Sun, Jan 05, 2020 at 12:06:05PM +0000, Chris Down wrote: > > The default is still set to inode32 for backwards compatibility, but > > system administrators can opt in to the new 64-bit inode numbers by > > either: > > > > 1. Passing inode64 on the command line when mounting, or > > 2. Configuring the kernel with CONFIG_TMPFS_INODE64=y > > > > The inode64 and inode32 names are used based on existing precedent from > > XFS. > > Please don't copy this misfeature of XFS. > > The inode32/inode64 XFS options were a horrible hack made more than > 20 years ago when NFSv2 was still in use and 64 bit inodes could > not be used for NFSv2 exports. It was then continued to be used > because 32bit NFSv3 clients were unable to handle 64 bit inodes. > > It took 15 years for us to be able to essentially deprecate > inode32 (inode64 is the default behaviour), and we were very happy > to get that albatross off our necks. In reality, almost everything > out there in the world handles 64 bit inodes correctly > including 32 bit machines and 32bit binaries on 64 bit machines. > And, IMNSHO, there no excuse these days for 32 bit binaries that > don't using the *64() syscall variants directly and hence support > 64 bit inodes correctlyi out of the box on all platforms. > > I don't think we should be repeating past mistakes by trying to > cater for broken 32 bit applications on 64 bit machines in this day > and age. Hi, It's unfortunately not true that everything handles this correctly. 32-bit binaries for games on Steam that use stat() without the 64 is so prevalent that I got tired of adding the LD_PRELOAD wrapper script and just patched out the EOVERFLOW return from glibc instead. (They obviously don't care about the inode value at all, and I don't use any other 32-bit binaries that do). This is probably a class of binaries you don't care very much about, and not very likely to be installed on a tmpfs that has wrapped around, but I thought it was worth mentioning that they do exist anyway. -- Mikael Magnusson ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH v5 2/2] tmpfs: Support 64-bit inums per-sb 2020-01-08 14:37 ` Mikael Magnusson @ 2020-01-13 6:58 ` Hugh Dickins 0 siblings, 0 replies; 27+ messages in thread From: Hugh Dickins @ 2020-01-13 6:58 UTC (permalink / raw) To: Mikael Magnusson Cc: david, akpm, amir73il, chris, hannes, hughd, jlayton, kernel-team, linux-fsdevel, linux-kernel, linux-mm, tj, viro On Wed, 8 Jan 2020, Mikael Magnusson wrote: > > It's unfortunately not true that everything handles this correctly. > 32-bit binaries for games on Steam that use stat() without the 64 is > so prevalent that I got tired of adding the LD_PRELOAD wrapper script > and just patched out the EOVERFLOW return from glibc instead. (They > obviously don't care about the inode value at all, and I don't use any > other 32-bit binaries that do). This is probably a class of binaries > you don't care very much about, and not very likely to be installed on > a tmpfs that has wrapped around, but I thought it was worth mentioning > that they do exist anyway. Thank you for alerting us to reality, Mikael: not what any of us wanted to hear, but we do care about them, and it was well worth mentioning. Hugh ^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2020-02-25 23:14 UTC | newest] Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-01-05 12:05 [PATCH v5 0/2] fs: inode: shmem: Reduce risk of inum overflow Chris Down 2020-01-05 12:06 ` [PATCH v5 1/2] tmpfs: Add per-superblock i_ino support Chris Down 2020-01-06 2:03 ` zhengbin (A) 2020-01-06 6:41 ` Amir Goldstein 2020-01-07 8:01 ` Hugh Dickins 2020-01-07 8:35 ` Amir Goldstein 2020-01-08 10:58 ` Hugh Dickins 2020-01-08 12:51 ` Amir Goldstein 2020-01-06 13:17 ` Chris Down 2020-01-05 12:06 ` [PATCH v5 2/2] tmpfs: Support 64-bit inums per-sb Chris Down 2020-01-07 0:10 ` Dave Chinner 2020-01-07 0:16 ` Chris Down 2020-01-07 0:39 ` Dave Chinner 2020-01-07 6:54 ` Amir Goldstein 2020-01-07 8:36 ` Hugh Dickins 2020-01-07 10:12 ` Amir Goldstein 2020-01-07 21:07 ` Dave Chinner 2020-01-07 21:37 ` Chris Mason 2020-01-08 11:24 ` Hugh Dickins 2020-01-09 0:43 ` Jeff Layton 2020-01-10 16:45 ` Chris Down 2020-01-13 7:36 ` Hugh Dickins 2020-01-20 15:11 ` Chris Down 2020-02-25 23:14 ` Hugh Dickins 2020-01-07 20:59 ` Dave Chinner 2020-01-08 14:37 ` Mikael Magnusson 2020-01-13 6:58 ` Hugh Dickins
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).