linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "zhengbin (A)" <zhengbin13@huawei.com>
To: Chris Down <chris@chrisdown.name>, <linux-fsdevel@vger.kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>,
	Jeff Layton <jlayton@kernel.org>,
	Johannes Weiner <hannes@cmpxchg.org>, Tejun Heo <tj@kernel.org>,
	<linux-kernel@vger.kernel.org>, <kernel-team@fb.com>
Subject: Re: [PATCH] fs: inode: Reduce volatile inode wraparound risk when ino_t is 64 bit
Date: Fri, 20 Dec 2019 11:05:13 +0800	[thread overview]
Message-ID: <be85a39b-e4b5-9b93-a4ff-fc598d7e31f2@huawei.com> (raw)
In-Reply-To: <20191220024936.GA380394@chrisdown.name>


On 2019/12/20 10:49, Chris Down wrote:
> In Facebook production we are seeing heavy inode number wraparounds on
> tmpfs. On affected tiers, in excess of 10% of hosts show multiple files
> with different content and the same inode number, with some servers even
> having as many as 150 duplicated inode numbers with differing file
> content.
>
> This causes actual, tangible problems in production. For example, we
> have complaints from those working on remote caches that their
> application is reporting cache corruptions because it uses (device,
> inodenum) to establish the identity of a particular cache object, but
> because it's not unique any more, the application refuses to continue
> and reports cache corruption. Even worse, sometimes applications may not
> even detect the corruption but may continue anyway, causing phantom and
> hard to debug behaviour.
>
> In general, userspace applications expect that (device, inodenum) should
> be enough to be uniquely point to one inode, which seems fair enough.
> This patch changes get_next_ino to use up to min(sizeof(ino_t), 8) bytes
> to reduce the likelihood of wraparound. On architectures with 32-bit
> ino_t the problem is, at least, not made any worse than it is right now.
>
> I noted the concern in the comment above about 32-bit applications on a
> 64-bit kernel with 32-bit wide ino_t in userspace, as documented by Jeff
> in the commit message for 866b04fc, but these applications are going to
> get EOVERFLOW on filesystems with non-volatile inode numbers anyway,
> since those will likely be 64-bit. Concerns about that seem slimmer
> compared to the disadvantages this presents for known, real users of
> this functionality on platforms with a 64-bit ino_t.
>
> Other approaches I've considered:
>
> - Use an IDA. If this is a problem for users with 32-bit ino_t as well,
>   this seems a feasible approach. For now this change is non-intrusive
>   enough, though, and doesn't make the situation any worse for them than
>   present at least.
> - Look for other approaches in userspace. I think this is less
>   feasible -- users do need to have a way to reliably determine inode
>   identity, and the risk of wraparound with a 2^32-sized counter is
>   pretty high, quite clearly manifesting in production for workloads
>   which make heavy use of tmpfs.

I have sent an IDA approache before, see details on

https://patchwork.kernel.org/patch/11254001/

>
> Signed-off-by: Chris Down <chris@chrisdown.name>
> Reported-by: Phyllipe Medeiros <phyllipe@fb.com>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Jeff Layton <jlayton@kernel.org>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: kernel-team@fb.com
> ---
>  fs/inode.c         | 29 ++++++++++++++++++-----------
>  include/linux/fs.h |  2 +-
>  2 files changed, 19 insertions(+), 12 deletions(-)
>
> diff --git a/fs/inode.c b/fs/inode.c
> index aff2b5831168..8193c17e2d16 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -870,26 +870,33 @@ static struct inode *find_inode_fast(struct super_block *sb,
>   * This does not significantly increase overflow rate because every CPU can
>   * consume at most LAST_INO_BATCH-1 unused inode numbers. So there is
>   * NR_CPUS*(LAST_INO_BATCH-1) wastage. At 4096 and 1024, this is ~0.1% of the
> - * 2^32 range, and is a worst-case. Even a 50% wastage would only increase
> - * overflow rate by 2x, which does not seem too significant.
> + * 2^32 range (for 32-bit ino_t), and is a worst-case. Even a 50% wastage would
> + * only increase overflow rate by 2x, which does not seem too significant. With
> + * a 64-bit ino_t, overflow in general is fairly hard to achieve.
>   *
> - * On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW
> - * error if st_ino won't fit in target struct field. Use 32bit counter
> - * here to attempt to avoid that.
> + * Care should be taken not to overflow when at all possible, since generally
> + * userspace depends on (device, inodenum) being reliably unique.
>   */
>  #define LAST_INO_BATCH 1024
> -static DEFINE_PER_CPU(unsigned int, last_ino);
> +static DEFINE_PER_CPU(ino_t, last_ino);
>  
> -unsigned int get_next_ino(void)
> +ino_t get_next_ino(void)
>  {
> -	unsigned int *p = &get_cpu_var(last_ino);
> -	unsigned int res = *p;
> +	ino_t *p = &get_cpu_var(last_ino);
> +	ino_t res = *p;
>  
>  #ifdef CONFIG_SMP
>  	if (unlikely((res & (LAST_INO_BATCH-1)) == 0)) {
> -		static atomic_t shared_last_ino;
> -		int next = atomic_add_return(LAST_INO_BATCH, &shared_last_ino);
> +		static atomic64_t shared_last_ino;
> +		u64 next = atomic64_add_return(LAST_INO_BATCH,
> +					       &shared_last_ino);
>  
> +		/*
> +		 * This might get truncated if ino_t is 32-bit, and so be more
> +		 * susceptible to wrap around than on environments where ino_t
> +		 * is 64-bit, but that's really no worse than always encoding
> +		 * `res` as unsigned int.
> +		 */
>  		res = next - LAST_INO_BATCH;
>  	}

This approach is same to  https://patchwork.kernel.org/patch/11023915/

which was

>  #endif
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 190c45039359..ca1a04334c9e 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -3052,7 +3052,7 @@ static inline void lockdep_annotate_inode_mutex_key(struct inode *inode) { };
>  #endif
>  extern void unlock_new_inode(struct inode *);
>  extern void discard_new_inode(struct inode *);
> -extern unsigned int get_next_ino(void);
> +extern ino_t get_next_ino(void);
>  extern void evict_inodes(struct super_block *sb);
>  
>  extern void __iget(struct inode * inode);


  reply	other threads:[~2019-12-20  3:05 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-12-20  2:49 [PATCH] fs: inode: Reduce volatile inode wraparound risk when ino_t is 64 bit Chris Down
2019-12-20  3:05 ` zhengbin (A) [this message]
2019-12-20  8:32 ` Amir Goldstein
2019-12-20 12:16   ` Chris Down
2019-12-20 13:41     ` Amir Goldstein
2019-12-20 16:46       ` Matthew Wilcox
2019-12-20 17:35         ` Amir Goldstein
2019-12-20 19:50           ` Matthew Wilcox
2019-12-23 20:45             ` Chris Down
2019-12-24  3:04               ` Amir Goldstein
2019-12-25 12:54                 ` Chris Down
2019-12-26  1:40                   ` zhengbin (A)
2019-12-20 21:30 ` Darrick J. Wong
2019-12-21  8:43   ` Amir Goldstein
2019-12-21 18:05     ` Darrick J. Wong
2019-12-21 10:16   ` Chris Down
2020-01-07 17:35     ` J. Bruce Fields
2020-01-07 17:44       ` Chris Down
2020-01-08  3:00         ` J. Bruce Fields

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=be85a39b-e4b5-9b93-a4ff-fc598d7e31f2@huawei.com \
    --to=zhengbin13@huawei.com \
    --cc=chris@chrisdown.name \
    --cc=hannes@cmpxchg.org \
    --cc=jlayton@kernel.org \
    --cc=kernel-team@fb.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=tj@kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).