fs: inode: Reduce volatile inode wraparound risk when ino_t is 64 bit
diff mbox series

Message ID 20191220024936.GA380394@chrisdown.name
State Superseded
Headers show
Series
  • fs: inode: Reduce volatile inode wraparound risk when ino_t is 64 bit
Related show

Commit Message

Chris Down Dec. 20, 2019, 2:49 a.m. UTC
In Facebook production we are seeing heavy inode number wraparounds on
tmpfs. On affected tiers, in excess of 10% of hosts show multiple files
with different content and the same inode number, with some servers even
having as many as 150 duplicated inode numbers with differing file
content.

This causes actual, tangible problems in production. For example, we
have complaints from those working on remote caches that their
application is reporting cache corruptions because it uses (device,
inodenum) to establish the identity of a particular cache object, but
because it's not unique any more, the application refuses to continue
and reports cache corruption. Even worse, sometimes applications may not
even detect the corruption but may continue anyway, causing phantom and
hard to debug behaviour.

In general, userspace applications expect that (device, inodenum) should
be enough to be uniquely point to one inode, which seems fair enough.
This patch changes get_next_ino to use up to min(sizeof(ino_t), 8) bytes
to reduce the likelihood of wraparound. On architectures with 32-bit
ino_t the problem is, at least, not made any worse than it is right now.

I noted the concern in the comment above about 32-bit applications on a
64-bit kernel with 32-bit wide ino_t in userspace, as documented by Jeff
in the commit message for 866b04fc, but these applications are going to
get EOVERFLOW on filesystems with non-volatile inode numbers anyway,
since those will likely be 64-bit. Concerns about that seem slimmer
compared to the disadvantages this presents for known, real users of
this functionality on platforms with a 64-bit ino_t.

Other approaches I've considered:

- Use an IDA. If this is a problem for users with 32-bit ino_t as well,
  this seems a feasible approach. For now this change is non-intrusive
  enough, though, and doesn't make the situation any worse for them than
  present at least.
- Look for other approaches in userspace. I think this is less
  feasible -- users do need to have a way to reliably determine inode
  identity, and the risk of wraparound with a 2^32-sized counter is
  pretty high, quite clearly manifesting in production for workloads
  which make heavy use of tmpfs.

Signed-off-by: Chris Down <chris@chrisdown.name>
Reported-by: Phyllipe Medeiros <phyllipe@fb.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: kernel-team@fb.com
---
 fs/inode.c         | 29 ++++++++++++++++++-----------
 include/linux/fs.h |  2 +-
 2 files changed, 19 insertions(+), 12 deletions(-)

Comments

Zheng Bin Dec. 20, 2019, 3:05 a.m. UTC | #1
On 2019/12/20 10:49, Chris Down wrote:
> In Facebook production we are seeing heavy inode number wraparounds on
> tmpfs. On affected tiers, in excess of 10% of hosts show multiple files
> with different content and the same inode number, with some servers even
> having as many as 150 duplicated inode numbers with differing file
> content.
>
> This causes actual, tangible problems in production. For example, we
> have complaints from those working on remote caches that their
> application is reporting cache corruptions because it uses (device,
> inodenum) to establish the identity of a particular cache object, but
> because it's not unique any more, the application refuses to continue
> and reports cache corruption. Even worse, sometimes applications may not
> even detect the corruption but may continue anyway, causing phantom and
> hard to debug behaviour.
>
> In general, userspace applications expect that (device, inodenum) should
> be enough to be uniquely point to one inode, which seems fair enough.
> This patch changes get_next_ino to use up to min(sizeof(ino_t), 8) bytes
> to reduce the likelihood of wraparound. On architectures with 32-bit
> ino_t the problem is, at least, not made any worse than it is right now.
>
> I noted the concern in the comment above about 32-bit applications on a
> 64-bit kernel with 32-bit wide ino_t in userspace, as documented by Jeff
> in the commit message for 866b04fc, but these applications are going to
> get EOVERFLOW on filesystems with non-volatile inode numbers anyway,
> since those will likely be 64-bit. Concerns about that seem slimmer
> compared to the disadvantages this presents for known, real users of
> this functionality on platforms with a 64-bit ino_t.
>
> Other approaches I've considered:
>
> - Use an IDA. If this is a problem for users with 32-bit ino_t as well,
>   this seems a feasible approach. For now this change is non-intrusive
>   enough, though, and doesn't make the situation any worse for them than
>   present at least.
> - Look for other approaches in userspace. I think this is less
>   feasible -- users do need to have a way to reliably determine inode
>   identity, and the risk of wraparound with a 2^32-sized counter is
>   pretty high, quite clearly manifesting in production for workloads
>   which make heavy use of tmpfs.

I have sent an IDA approache before, see details on

https://patchwork.kernel.org/patch/11254001/

>
> Signed-off-by: Chris Down <chris@chrisdown.name>
> Reported-by: Phyllipe Medeiros <phyllipe@fb.com>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Jeff Layton <jlayton@kernel.org>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: kernel-team@fb.com
> ---
>  fs/inode.c         | 29 ++++++++++++++++++-----------
>  include/linux/fs.h |  2 +-
>  2 files changed, 19 insertions(+), 12 deletions(-)
>
> diff --git a/fs/inode.c b/fs/inode.c
> index aff2b5831168..8193c17e2d16 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -870,26 +870,33 @@ static struct inode *find_inode_fast(struct super_block *sb,
>   * This does not significantly increase overflow rate because every CPU can
>   * consume at most LAST_INO_BATCH-1 unused inode numbers. So there is
>   * NR_CPUS*(LAST_INO_BATCH-1) wastage. At 4096 and 1024, this is ~0.1% of the
> - * 2^32 range, and is a worst-case. Even a 50% wastage would only increase
> - * overflow rate by 2x, which does not seem too significant.
> + * 2^32 range (for 32-bit ino_t), and is a worst-case. Even a 50% wastage would
> + * only increase overflow rate by 2x, which does not seem too significant. With
> + * a 64-bit ino_t, overflow in general is fairly hard to achieve.
>   *
> - * On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW
> - * error if st_ino won't fit in target struct field. Use 32bit counter
> - * here to attempt to avoid that.
> + * Care should be taken not to overflow when at all possible, since generally
> + * userspace depends on (device, inodenum) being reliably unique.
>   */
>  #define LAST_INO_BATCH 1024
> -static DEFINE_PER_CPU(unsigned int, last_ino);
> +static DEFINE_PER_CPU(ino_t, last_ino);
>  
> -unsigned int get_next_ino(void)
> +ino_t get_next_ino(void)
>  {
> -	unsigned int *p = &get_cpu_var(last_ino);
> -	unsigned int res = *p;
> +	ino_t *p = &get_cpu_var(last_ino);
> +	ino_t res = *p;
>  
>  #ifdef CONFIG_SMP
>  	if (unlikely((res & (LAST_INO_BATCH-1)) == 0)) {
> -		static atomic_t shared_last_ino;
> -		int next = atomic_add_return(LAST_INO_BATCH, &shared_last_ino);
> +		static atomic64_t shared_last_ino;
> +		u64 next = atomic64_add_return(LAST_INO_BATCH,
> +					       &shared_last_ino);
>  
> +		/*
> +		 * This might get truncated if ino_t is 32-bit, and so be more
> +		 * susceptible to wrap around than on environments where ino_t
> +		 * is 64-bit, but that's really no worse than always encoding
> +		 * `res` as unsigned int.
> +		 */
>  		res = next - LAST_INO_BATCH;
>  	}

This approach is same to  https://patchwork.kernel.org/patch/11023915/

which was

>  #endif
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 190c45039359..ca1a04334c9e 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -3052,7 +3052,7 @@ static inline void lockdep_annotate_inode_mutex_key(struct inode *inode) { };
>  #endif
>  extern void unlock_new_inode(struct inode *);
>  extern void discard_new_inode(struct inode *);
> -extern unsigned int get_next_ino(void);
> +extern ino_t get_next_ino(void);
>  extern void evict_inodes(struct super_block *sb);
>  
>  extern void __iget(struct inode * inode);
Amir Goldstein Dec. 20, 2019, 8:32 a.m. UTC | #2
On Fri, Dec 20, 2019 at 4:50 AM Chris Down <chris@chrisdown.name> wrote:
>
> In Facebook production we are seeing heavy inode number wraparounds on
> tmpfs. On affected tiers, in excess of 10% of hosts show multiple files
> with different content and the same inode number, with some servers even
> having as many as 150 duplicated inode numbers with differing file
> content.
>
> This causes actual, tangible problems in production. For example, we
> have complaints from those working on remote caches that their
> application is reporting cache corruptions because it uses (device,
> inodenum) to establish the identity of a particular cache object, but
> because it's not unique any more, the application refuses to continue
> and reports cache corruption. Even worse, sometimes applications may not
> even detect the corruption but may continue anyway, causing phantom and
> hard to debug behaviour.
>
> In general, userspace applications expect that (device, inodenum) should
> be enough to be uniquely point to one inode, which seems fair enough.
> This patch changes get_next_ino to use up to min(sizeof(ino_t), 8) bytes
> to reduce the likelihood of wraparound. On architectures with 32-bit
> ino_t the problem is, at least, not made any worse than it is right now.
>
> I noted the concern in the comment above about 32-bit applications on a
> 64-bit kernel with 32-bit wide ino_t in userspace, as documented by Jeff
> in the commit message for 866b04fc, but these applications are going to
> get EOVERFLOW on filesystems with non-volatile inode numbers anyway,
> since those will likely be 64-bit. Concerns about that seem slimmer
> compared to the disadvantages this presents for known, real users of
> this functionality on platforms with a 64-bit ino_t.
>
> Other approaches I've considered:
>
> - Use an IDA. If this is a problem for users with 32-bit ino_t as well,
>   this seems a feasible approach. For now this change is non-intrusive
>   enough, though, and doesn't make the situation any worse for them than
>   present at least.
> - Look for other approaches in userspace. I think this is less
>   feasible -- users do need to have a way to reliably determine inode
>   identity, and the risk of wraparound with a 2^32-sized counter is
>   pretty high, quite clearly manifesting in production for workloads
>   which make heavy use of tmpfs.

How about something like this:

/* just to explain - use an existing macro */
shmem_ino_shift = ilog2(sizeof(void *));
inode->i_ino = (__u64)inode >> shmem_ino_shift;

This should solve the reported problem with little complexity,
but it exposes internal kernel address to userspace.

Can we do anything to mitigate this risk?

For example, instead of trying to maintain a unique map of
ino_t to struct shmem_inode_info * in the system
it would be enough (and less expensive) to maintain a unique map of
shmem_ino_range_t to slab.
The ino_range id can then be mixes with the relative object index in
slab to compose i_ino.

The big win here is not having to allocate an id every bunch of inodes
instead of every inode, but the fact that recycled (i.e. delete/create)
shmem_inode_info objects get the same i_ino without having to
allocate any id.

This mimics a standard behavior of blockdev filesystem like ext4/xfs
where inode number is determined by logical offset on disk and is
quite often recycled on delete/create.

I realize that the method I described with slab it crossing module layers
and would probably be NACKED.
Similar result could be achieved by shmem keeping a small stash of
recycled inode objects, which are not returned to slab right away and
retain their allocated i_ino. This at least should significantly reduce the
rate of burning get_next_ino allocation.

Anyway, to add another consideration to the mix, overlayfs uses
the high ino bits to multiplex several layers into a single ino domain
(mount option xino=on).

tmpfs is a very commonly used filesystem as overlayfs upper layer,
so many users are going to benefit from keeping the higher most bits
of tmpfs ino inodes unused.

For this reason, I dislike the current "grow forever" approach of
get_next_ino() and prefer that we use a smarter scheme when
switching over to 64bit values.

Thanks,
Amir.
Chris Down Dec. 20, 2019, 12:16 p.m. UTC | #3
Hi Amir,

Thanks for getting back, I appreciate it.

Amir Goldstein writes:
>How about something like this:
>
>/* just to explain - use an existing macro */
>shmem_ino_shift = ilog2(sizeof(void *));
>inode->i_ino = (__u64)inode >> shmem_ino_shift;
>
>This should solve the reported problem with little complexity,
>but it exposes internal kernel address to userspace.

One problem I can see with that approach is that get_next_ino doesn't 
discriminate based on the context (for example, when it is called for a 
particular tmpfs mount) which means that eventually wraparound risk is still 
pushed to the limit on such machines for other users of get_next_ino (like 
named pipes, sockets, procfs, etc). Granted then the space for collisions 
between them is less likely due to their general magnitude of inodes at one 
time compared to some tmpfs workloads, but still.

>Can we do anything to mitigate this risk?
>
>For example, instead of trying to maintain a unique map of
>ino_t to struct shmem_inode_info * in the system
>it would be enough (and less expensive) to maintain a unique map of
>shmem_ino_range_t to slab.
>The ino_range id can then be mixes with the relative object index in
>slab to compose i_ino.
>
>The big win here is not having to allocate an id every bunch of inodes
>instead of every inode, but the fact that recycled (i.e. delete/create)
>shmem_inode_info objects get the same i_ino without having to
>allocate any id.
>
>This mimics a standard behavior of blockdev filesystem like ext4/xfs
>where inode number is determined by logical offset on disk and is
>quite often recycled on delete/create.
>
>I realize that the method I described with slab it crossing module layers
>and would probably be NACKED.

Yeah, that's more or less my concern with that approach as well, hence why I 
went for something that seemed less intrusive and keeps with the current inode 
allocation strategy :-)

>Similar result could be achieved by shmem keeping a small stash of
>recycled inode objects, which are not returned to slab right away and
>retain their allocated i_ino. This at least should significantly reduce the
>rate of burning get_next_ino allocation.

While this issue happens to present itself currently on tmpfs, I'm worried that 
future users of get_next_ino based on historic precedent might end up hitting 
this as well. That's the main reason why I'm inclined to try and improve 
get_next_ino's strategy itself.

>Anyway, to add another consideration to the mix, overlayfs uses
>the high ino bits to multiplex several layers into a single ino domain
>(mount option xino=on).
>
>tmpfs is a very commonly used filesystem as overlayfs upper layer,
>so many users are going to benefit from keeping the higher most bits
>of tmpfs ino inodes unused.
>
>For this reason, I dislike the current "grow forever" approach of
>get_next_ino() and prefer that we use a smarter scheme when
>switching over to 64bit values.

By "a smarter scheme when switching over to 64bit values", you mean keeping 
i_ino as low magnitude as possible while still avoiding simultaneous reuse, 
right?

To that extent, if we can reliably and expediently recycle inode numbers, I'm 
not against sticking to the existing typing scheme in get_next_ino. It's just a 
matter of agreeing by what method and at what level of the stack that should 
take place :-)

I'd appreciate your thoughts on approaches forward. One potential option is to 
reimplement get_next_ino using an IDA, as mentioned in my patch message. Other 
than the potential to upset microbenchmarks, do you have concerns with that as 
a patch?

Thanks,

Chris
Amir Goldstein Dec. 20, 2019, 1:41 p.m. UTC | #4
On Fri, Dec 20, 2019 at 2:16 PM Chris Down <chris@chrisdown.name> wrote:
>
> Hi Amir,
>
> Thanks for getting back, I appreciate it.
>
> Amir Goldstein writes:
> >How about something like this:
> >
> >/* just to explain - use an existing macro */
> >shmem_ino_shift = ilog2(sizeof(void *));
> >inode->i_ino = (__u64)inode >> shmem_ino_shift;
> >
> >This should solve the reported problem with little complexity,
> >but it exposes internal kernel address to userspace.
>
> One problem I can see with that approach is that get_next_ino doesn't
> discriminate based on the context (for example, when it is called for a
> particular tmpfs mount) which means that eventually wraparound risk is still
> pushed to the limit on such machines for other users of get_next_ino (like
> named pipes, sockets, procfs, etc). Granted then the space for collisions
> between them is less likely due to their general magnitude of inodes at one
> time compared to some tmpfs workloads, but still.

If you ask me, trying to solve all the problems that may or may not exist
is not the best way to solve "your" problem. I am not saying you shouldn't
look around to see if you can improve something for more cases, but every
case is different, so not sure there is one solution that fits all.

If you came forward with a suggestion to improve get_next_ino() because
it solves a microbenchmark, I suspect that you wouldn't have gotten far.
Instead, you came forward with a report of a real life problem:
"In Facebook production we are seeing heavy inode number wraparounds
on tmpfs..." and Hugh confessed that Google are facing the same problem
and carry a private patch (per-sb ino counter).

There is no doubt that tmpfs is growing to bigger scales than it used to
be accustomed to in the past. I do doubt that other filesystems that use
get_next_ino() like pseudo filesystems really have a wraparound problem.
If there is such a real world problem, let someone come forward with the
report for the use case.

IMO, tmpfs should be taken out of the list of get_next_ino() (ab)users and
then the rest of the cases should be fine. When I say "tmpfs" I mean
every filesystem with similar use pattern as tmpfs, which is using inode
cache pool and has potential to recycle a very large number of inodes.

>
> >Can we do anything to mitigate this risk?
> >
> >For example, instead of trying to maintain a unique map of
> >ino_t to struct shmem_inode_info * in the system
> >it would be enough (and less expensive) to maintain a unique map of
> >shmem_ino_range_t to slab.
> >The ino_range id can then be mixes with the relative object index in
> >slab to compose i_ino.
> >
> >The big win here is not having to allocate an id every bunch of inodes
> >instead of every inode, but the fact that recycled (i.e. delete/create)
> >shmem_inode_info objects get the same i_ino without having to
> >allocate any id.
> >
> >This mimics a standard behavior of blockdev filesystem like ext4/xfs
> >where inode number is determined by logical offset on disk and is
> >quite often recycled on delete/create.
> >
> >I realize that the method I described with slab it crossing module layers
> >and would probably be NACKED.
>
> Yeah, that's more or less my concern with that approach as well, hence why I
> went for something that seemed less intrusive and keeps with the current inode
> allocation strategy :-)
>
> >Similar result could be achieved by shmem keeping a small stash of
> >recycled inode objects, which are not returned to slab right away and
> >retain their allocated i_ino. This at least should significantly reduce the
> >rate of burning get_next_ino allocation.
>
> While this issue happens to present itself currently on tmpfs, I'm worried that
> future users of get_next_ino based on historic precedent might end up hitting
> this as well. That's the main reason why I'm inclined to try and improve
> get_next_ino's strategy itself.
>

I am not going to stop you from trying to improve get_next_ino()
I just think there is a MUCH simpler solution to your problem (see below).

> >Anyway, to add another consideration to the mix, overlayfs uses
> >the high ino bits to multiplex several layers into a single ino domain
> >(mount option xino=on).
> >
> >tmpfs is a very commonly used filesystem as overlayfs upper layer,
> >so many users are going to benefit from keeping the higher most bits
> >of tmpfs ino inodes unused.
> >
> >For this reason, I dislike the current "grow forever" approach of
> >get_next_ino() and prefer that we use a smarter scheme when
> >switching over to 64bit values.
>
> By "a smarter scheme when switching over to 64bit values", you mean keeping
> i_ino as low magnitude as possible while still avoiding simultaneous reuse,
> right?

Yes.

>
> To that extent, if we can reliably and expediently recycle inode numbers, I'm
> not against sticking to the existing typing scheme in get_next_ino. It's just a
> matter of agreeing by what method and at what level of the stack that should
> take place :-)
>

Suggestion:
1. Extend the kmem_cache API to let the ctor() know if it is
initializing an object
    for the first time (new page) or recycling an object.
2. Let shmem_init_inode retain the value of i_ino of recycled shmem_inode_info
    objects
3. i_ino is initialized with get_next_ino() only in case it it zero

Alternatively to 1., if simpler to implement and acceptable by slab developers:
1.b. remove the assertion from cache_grow_begin()/new_slab_objects():
       WARN_ON_ONCE(s->ctor && (flags & __GFP_ZERO));
       and pass __GFP_ZERO in shmem_alloc_inode()

You see, when you look at the big picture, all the smarts of an id allocator
that you could possibly need is already there in the slab allocator for shmem
inode objects. You just need a way to access that "'id" information for recycled
objects without having to write any performance sensitive code.

> I'd appreciate your thoughts on approaches forward. One potential option is to
> reimplement get_next_ino using an IDA, as mentioned in my patch message. Other
> than the potential to upset microbenchmarks, do you have concerns with that as
> a patch?
>

Only that it will be subject to performance regression reports from hardware and
workloads that you do not have access to - It's going to be hard for
you to prove
that you did not hurt any workload, so it's not an easy way forward.

Thanks,
Amir.
Matthew Wilcox (Oracle) Dec. 20, 2019, 4:46 p.m. UTC | #5
On Fri, Dec 20, 2019 at 03:41:11PM +0200, Amir Goldstein wrote:
> Suggestion:
> 1. Extend the kmem_cache API to let the ctor() know if it is
> initializing an object
>     for the first time (new page) or recycling an object.

Uh, what?  The ctor is _only_ called when new pages are allocated.
Part of the contract with the slab user is that objects are returned to
the slab in an initialised state.

> 2. Let shmem_init_inode retain the value of i_ino of recycled shmem_inode_info
>     objects
> 3. i_ino is initialized with get_next_ino() only in case it it zero
> 
> Alternatively to 1., if simpler to implement and acceptable by slab developers:
> 1.b. remove the assertion from cache_grow_begin()/new_slab_objects():
>        WARN_ON_ONCE(s->ctor && (flags & __GFP_ZERO));
>        and pass __GFP_ZERO in shmem_alloc_inode()

WTF would that _mean_?  I want this object to contain zeroes and whatever
the constructor sets it to.  WHich one wins?
Amir Goldstein Dec. 20, 2019, 5:35 p.m. UTC | #6
On Fri, Dec 20, 2019 at 6:46 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Dec 20, 2019 at 03:41:11PM +0200, Amir Goldstein wrote:
> > Suggestion:
> > 1. Extend the kmem_cache API to let the ctor() know if it is
> > initializing an object
> >     for the first time (new page) or recycling an object.
>
> Uh, what?  The ctor is _only_ called when new pages are allocated.
> Part of the contract with the slab user is that objects are returned to
> the slab in an initialised state.

Right. I mixed up the ctor() with alloc_inode().
So is there anything stopping us from reusing an existing non-zero
value of  i_ino in shmem_get_inode()? for recycling shmem ino
numbers?

Thanks,
Amir.
Matthew Wilcox (Oracle) Dec. 20, 2019, 7:50 p.m. UTC | #7
On Fri, Dec 20, 2019 at 07:35:38PM +0200, Amir Goldstein wrote:
> On Fri, Dec 20, 2019 at 6:46 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Fri, Dec 20, 2019 at 03:41:11PM +0200, Amir Goldstein wrote:
> > > Suggestion:
> > > 1. Extend the kmem_cache API to let the ctor() know if it is
> > > initializing an object
> > >     for the first time (new page) or recycling an object.
> >
> > Uh, what?  The ctor is _only_ called when new pages are allocated.
> > Part of the contract with the slab user is that objects are returned to
> > the slab in an initialised state.
> 
> Right. I mixed up the ctor() with alloc_inode().
> So is there anything stopping us from reusing an existing non-zero
> value of  i_ino in shmem_get_inode()? for recycling shmem ino
> numbers?

I think that would be an excellent solution to the problem!  At least,
I can't think of any problems with it.
Darrick J. Wong Dec. 20, 2019, 9:30 p.m. UTC | #8
On Fri, Dec 20, 2019 at 02:49:36AM +0000, Chris Down wrote:
> In Facebook production we are seeing heavy inode number wraparounds on
> tmpfs. On affected tiers, in excess of 10% of hosts show multiple files
> with different content and the same inode number, with some servers even
> having as many as 150 duplicated inode numbers with differing file
> content.
> 
> This causes actual, tangible problems in production. For example, we
> have complaints from those working on remote caches that their
> application is reporting cache corruptions because it uses (device,
> inodenum) to establish the identity of a particular cache object, but

...but you cannot delete the (dev, inum) tuple from the cache index when
you remove a cache object??

> because it's not unique any more, the application refuses to continue
> and reports cache corruption. Even worse, sometimes applications may not
> even detect the corruption but may continue anyway, causing phantom and
> hard to debug behaviour.
> 
> In general, userspace applications expect that (device, inodenum) should
> be enough to be uniquely point to one inode, which seems fair enough.

Except that it's not.  (dev, inum, generation) uniquely points to an
instance of an inode from creation to the last unlink.

--D

> This patch changes get_next_ino to use up to min(sizeof(ino_t), 8) bytes
> to reduce the likelihood of wraparound. On architectures with 32-bit
> ino_t the problem is, at least, not made any worse than it is right now.
> 
> I noted the concern in the comment above about 32-bit applications on a
> 64-bit kernel with 32-bit wide ino_t in userspace, as documented by Jeff
> in the commit message for 866b04fc, but these applications are going to
> get EOVERFLOW on filesystems with non-volatile inode numbers anyway,
> since those will likely be 64-bit. Concerns about that seem slimmer
> compared to the disadvantages this presents for known, real users of
> this functionality on platforms with a 64-bit ino_t.
> 
> Other approaches I've considered:
> 
> - Use an IDA. If this is a problem for users with 32-bit ino_t as well,
>   this seems a feasible approach. For now this change is non-intrusive
>   enough, though, and doesn't make the situation any worse for them than
>   present at least.
> - Look for other approaches in userspace. I think this is less
>   feasible -- users do need to have a way to reliably determine inode
>   identity, and the risk of wraparound with a 2^32-sized counter is
>   pretty high, quite clearly manifesting in production for workloads
>   which make heavy use of tmpfs.
> 
> Signed-off-by: Chris Down <chris@chrisdown.name>
> Reported-by: Phyllipe Medeiros <phyllipe@fb.com>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Jeff Layton <jlayton@kernel.org>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: linux-fsdevel@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: kernel-team@fb.com
> ---
>  fs/inode.c         | 29 ++++++++++++++++++-----------
>  include/linux/fs.h |  2 +-
>  2 files changed, 19 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index aff2b5831168..8193c17e2d16 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -870,26 +870,33 @@ static struct inode *find_inode_fast(struct super_block *sb,
>   * This does not significantly increase overflow rate because every CPU can
>   * consume at most LAST_INO_BATCH-1 unused inode numbers. So there is
>   * NR_CPUS*(LAST_INO_BATCH-1) wastage. At 4096 and 1024, this is ~0.1% of the
> - * 2^32 range, and is a worst-case. Even a 50% wastage would only increase
> - * overflow rate by 2x, which does not seem too significant.
> + * 2^32 range (for 32-bit ino_t), and is a worst-case. Even a 50% wastage would
> + * only increase overflow rate by 2x, which does not seem too significant. With
> + * a 64-bit ino_t, overflow in general is fairly hard to achieve.
>   *
> - * On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW
> - * error if st_ino won't fit in target struct field. Use 32bit counter
> - * here to attempt to avoid that.
> + * Care should be taken not to overflow when at all possible, since generally
> + * userspace depends on (device, inodenum) being reliably unique.
>   */
>  #define LAST_INO_BATCH 1024
> -static DEFINE_PER_CPU(unsigned int, last_ino);
> +static DEFINE_PER_CPU(ino_t, last_ino);
>  
> -unsigned int get_next_ino(void)
> +ino_t get_next_ino(void)
>  {
> -	unsigned int *p = &get_cpu_var(last_ino);
> -	unsigned int res = *p;
> +	ino_t *p = &get_cpu_var(last_ino);
> +	ino_t res = *p;
>  
>  #ifdef CONFIG_SMP
>  	if (unlikely((res & (LAST_INO_BATCH-1)) == 0)) {
> -		static atomic_t shared_last_ino;
> -		int next = atomic_add_return(LAST_INO_BATCH, &shared_last_ino);
> +		static atomic64_t shared_last_ino;
> +		u64 next = atomic64_add_return(LAST_INO_BATCH,
> +					       &shared_last_ino);
>  
> +		/*
> +		 * This might get truncated if ino_t is 32-bit, and so be more
> +		 * susceptible to wrap around than on environments where ino_t
> +		 * is 64-bit, but that's really no worse than always encoding
> +		 * `res` as unsigned int.
> +		 */
>  		res = next - LAST_INO_BATCH;
>  	}
>  #endif
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 190c45039359..ca1a04334c9e 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -3052,7 +3052,7 @@ static inline void lockdep_annotate_inode_mutex_key(struct inode *inode) { };
>  #endif
>  extern void unlock_new_inode(struct inode *);
>  extern void discard_new_inode(struct inode *);
> -extern unsigned int get_next_ino(void);
> +extern ino_t get_next_ino(void);
>  extern void evict_inodes(struct super_block *sb);
>  
>  extern void __iget(struct inode * inode);
> -- 
> 2.24.1
>
Amir Goldstein Dec. 21, 2019, 8:43 a.m. UTC | #9
On Fri, Dec 20, 2019 at 11:33 PM Darrick J. Wong
<darrick.wong@oracle.com> wrote:
>
> On Fri, Dec 20, 2019 at 02:49:36AM +0000, Chris Down wrote:
> > In Facebook production we are seeing heavy inode number wraparounds on
> > tmpfs. On affected tiers, in excess of 10% of hosts show multiple files
> > with different content and the same inode number, with some servers even
> > having as many as 150 duplicated inode numbers with differing file
> > content.
> >
> > This causes actual, tangible problems in production. For example, we
> > have complaints from those working on remote caches that their
> > application is reporting cache corruptions because it uses (device,
> > inodenum) to establish the identity of a particular cache object, but
>
> ...but you cannot delete the (dev, inum) tuple from the cache index when
> you remove a cache object??
>
> > because it's not unique any more, the application refuses to continue
> > and reports cache corruption. Even worse, sometimes applications may not
> > even detect the corruption but may continue anyway, causing phantom and
> > hard to debug behaviour.
> >
> > In general, userspace applications expect that (device, inodenum) should
> > be enough to be uniquely point to one inode, which seems fair enough.
>
> Except that it's not.  (dev, inum, generation) uniquely points to an
> instance of an inode from creation to the last unlink.
>

Yes, but also:
There should not exist two live inodes on the system with the same (dev, inum)
The problem is that ino 1 may still be alive when wraparound happens
and then two different inodes with ino 1 exist on same dev.

Take the 'diff' utility for example, it will report that those files
are identical
if they have the same dev,ino,size,mtime. I suspect that 'mv' will not
let you move one over the other, assuming they are hardlinks.
generation is not even exposed to legacy application using stat(2).

Thanks,
Amir.
Chris Down Dec. 21, 2019, 10:16 a.m. UTC | #10
Darrick J. Wong writes:
>On Fri, Dec 20, 2019 at 02:49:36AM +0000, Chris Down wrote:
>> In Facebook production we are seeing heavy inode number wraparounds on
>> tmpfs. On affected tiers, in excess of 10% of hosts show multiple files
>> with different content and the same inode number, with some servers even
>> having as many as 150 duplicated inode numbers with differing file
>> content.
>>
>> This causes actual, tangible problems in production. For example, we
>> have complaints from those working on remote caches that their
>> application is reporting cache corruptions because it uses (device,
>> inodenum) to establish the identity of a particular cache object, but
>
>...but you cannot delete the (dev, inum) tuple from the cache index when
>you remove a cache object??

There are some cache objects which may be long-lived. In these kinds of cases, 
the cache objects aren't removed until they're conclusively not needed.

Since tmpfs shares the i_ino counter with every other user of get_next_ino, 
it's then entirely possible that we can thrash through 2^32 inodes within a 
period that it's possible for a single cache file to exist.

>> because it's not unique any more, the application refuses to continue
>> and reports cache corruption. Even worse, sometimes applications may not
>> even detect the corruption but may continue anyway, causing phantom and
>> hard to debug behaviour.
>>
>> In general, userspace applications expect that (device, inodenum) should
>> be enough to be uniquely point to one inode, which seems fair enough.
>
>Except that it's not.  (dev, inum, generation) uniquely points to an
>instance of an inode from creation to the last unlink.

I didn't mention generation because, even though it's set on tmpfs (to 
prandom_u32()), it's not possible to evaluate it from userspace since `ioctl` 
returns ENOTTY. We can't ask userspace applications to introspect on an inode 
attribute that they can't even access :-)
Darrick J. Wong Dec. 21, 2019, 6:05 p.m. UTC | #11
On Sat, Dec 21, 2019 at 10:43:05AM +0200, Amir Goldstein wrote:
> On Fri, Dec 20, 2019 at 11:33 PM Darrick J. Wong
> <darrick.wong@oracle.com> wrote:
> >
> > On Fri, Dec 20, 2019 at 02:49:36AM +0000, Chris Down wrote:
> > > In Facebook production we are seeing heavy inode number wraparounds on
> > > tmpfs. On affected tiers, in excess of 10% of hosts show multiple files
> > > with different content and the same inode number, with some servers even
> > > having as many as 150 duplicated inode numbers with differing file
> > > content.
> > >
> > > This causes actual, tangible problems in production. For example, we
> > > have complaints from those working on remote caches that their
> > > application is reporting cache corruptions because it uses (device,
> > > inodenum) to establish the identity of a particular cache object, but
> >
> > ...but you cannot delete the (dev, inum) tuple from the cache index when
> > you remove a cache object??
> >
> > > because it's not unique any more, the application refuses to continue
> > > and reports cache corruption. Even worse, sometimes applications may not
> > > even detect the corruption but may continue anyway, causing phantom and
> > > hard to debug behaviour.
> > >
> > > In general, userspace applications expect that (device, inodenum) should
> > > be enough to be uniquely point to one inode, which seems fair enough.
> >
> > Except that it's not.  (dev, inum, generation) uniquely points to an
> > instance of an inode from creation to the last unlink.
> >
> 
> Yes, but also:
> There should not exist two live inodes on the system with the same (dev, inum)
> The problem is that ino 1 may still be alive when wraparound happens
> and then two different inodes with ino 1 exist on same dev.

*OH* that's different then.  Most sane filesystems <cough>btrfs<cough>
should never have the same inode numbers for different files.  Sorry for
the noise, I misunderstood what the issue was. :)

> Take the 'diff' utility for example, it will report that those files
> are identical
> if they have the same dev,ino,size,mtime. I suspect that 'mv' will not
> let you move one over the other, assuming they are hardlinks.
> generation is not even exposed to legacy application using stat(2).

Yeah, I was surprised to see it's not even in statx. :/

--D

> Thanks,
> Amir.
Chris Down Dec. 23, 2019, 8:45 p.m. UTC | #12
Matthew Wilcox writes:
>On Fri, Dec 20, 2019 at 07:35:38PM +0200, Amir Goldstein wrote:
>> On Fri, Dec 20, 2019 at 6:46 PM Matthew Wilcox <willy@infradead.org> wrote:
>> >
>> > On Fri, Dec 20, 2019 at 03:41:11PM +0200, Amir Goldstein wrote:
>> > > Suggestion:
>> > > 1. Extend the kmem_cache API to let the ctor() know if it is
>> > > initializing an object
>> > >     for the first time (new page) or recycling an object.
>> >
>> > Uh, what?  The ctor is _only_ called when new pages are allocated.
>> > Part of the contract with the slab user is that objects are returned to
>> > the slab in an initialised state.
>>
>> Right. I mixed up the ctor() with alloc_inode().
>> So is there anything stopping us from reusing an existing non-zero
>> value of  i_ino in shmem_get_inode()? for recycling shmem ino
>> numbers?
>
>I think that would be an excellent solution to the problem!  At least,
>I can't think of any problems with it.

Thanks for the suggestions and feedback, Amir and Matthew :-)

The slab i_ino recycling approach works somewhat, but is unfortunately neutered 
quite a lot by the fact that slab recycling is per-memcg. That is, replacing 
with recycle_or_get_next_ino(old_ino)[0] for shmfs and a few other trivial 
callsites only leads to about 10% slab reuse, which doesn't really stem the 
bleeding of 32-bit inums on an affected workload:

     # tail -5000 /sys/kernel/debug/tracing/trace | grep -o 'recycle_or_get_next_ino:.*' | sort | uniq -c
         4454 recycle_or_get_next_ino: not recycled
          546 recycle_or_get_next_ino: recycled

Roman (who I've just added to cc) tells me that currently we only have 
per-memcg slab reuse instead of global when using CONFIG_MEMCG. This 
contributes fairly significantly here since there are multiple tasks across 
multiple cgroups which are contributing to the get_next_ino() thrash.

I think this is a good start, but we need something of a different magnitude in 
order to actually solve this problem with the current slab infrastructure. How 
about something like the following?

1. Add get_next_ino_full, which uses whatever the full width of ino_t is
2. Use get_next_ino_full in tmpfs (et al.)
3. Add a mount option to tmpfs (et al.), say `32bit-inums`, which people can 
    pass if they want the 32-bit inode numbers back. This would still allow 
    people who want to make this tradeoff to use xino.
4. (If you like) Also add a CONFIG option to disable this at compile time.

I'd appreciate your thoughts on that approach or others you have ideas about. 
Thanks! :-)

0:

unsigned int recycle_or_get_next_ino(ino_t old_ino)
{
	/*
	 * get_next_ino returns unsigned int. If this fires then i_ino must be
	 * >32 bits and have been changed later, so the caller shouldn't be
	 * recycling inode numbers
	 */
	WARN_ONCE(old_ino >> (sizeof(unsigned int) * 8),
		  "Recyclable i_ino uses more bits than unsigned int: %llu",
		  (u64)old_ino);

	if (old_ino) {
		if (prandom_u32() % 100 == 0)
			trace_printk("recycled\n");
		return old_ino;
	} else {
		if (prandom_u32() % 100 == 0)
			trace_printk("not recycled\n");
		return get_next_ino();
	}
}
Amir Goldstein Dec. 24, 2019, 3:04 a.m. UTC | #13
> The slab i_ino recycling approach works somewhat, but is unfortunately neutered
> quite a lot by the fact that slab recycling is per-memcg. That is, replacing
> with recycle_or_get_next_ino(old_ino)[0] for shmfs and a few other trivial
> callsites only leads to about 10% slab reuse, which doesn't really stem the
> bleeding of 32-bit inums on an affected workload:
>
>      # tail -5000 /sys/kernel/debug/tracing/trace | grep -o 'recycle_or_get_next_ino:.*' | sort | uniq -c
>          4454 recycle_or_get_next_ino: not recycled
>           546 recycle_or_get_next_ino: recycled
>

Too bad..
Maybe recycled ino should be implemented all the same because it is simple
and may improve workloads that are not so MEMCG intensive.

> Roman (who I've just added to cc) tells me that currently we only have
> per-memcg slab reuse instead of global when using CONFIG_MEMCG. This
> contributes fairly significantly here since there are multiple tasks across
> multiple cgroups which are contributing to the get_next_ino() thrash.
>
> I think this is a good start, but we need something of a different magnitude in
> order to actually solve this problem with the current slab infrastructure. How
> about something like the following?
>
> 1. Add get_next_ino_full, which uses whatever the full width of ino_t is
> 2. Use get_next_ino_full in tmpfs (et al.)

I would prefer that filesystems making heavy use of get_next_ino, be converted
to use a private ino pool per sb:

ino_pool_create()
ino_pool_get_next()

flags to ino_pool_create() can determine the desired ino range.
Does the Facebook use case involve a single large tmpfs or many
small ones? I would guess the latter and therefore we are trying to solve
a problem that nobody really needs to solve (i.e. global efficient ino pool).

> 3. Add a mount option to tmpfs (et al.), say `32bit-inums`, which people can
>     pass if they want the 32-bit inode numbers back. This would still allow
>     people who want to make this tradeoff to use xino.

inode32|inode64 (see man xfs(5)).

> 4. (If you like) Also add a CONFIG option to disable this at compile time.
>

I Don't know about disable, but the default mode for tmpfs (inode32|inode64)
might me best determined by CONFIG option, so distro builders could decide
if they want to take the risk of breaking applications on tmpfs.

But if you implement per sb ino pool, maybe inode64 will no longer be
required for your use case?

Thanks,
Amir.
Chris Down Dec. 25, 2019, 12:54 p.m. UTC | #14
Amir Goldstein writes:
>> The slab i_ino recycling approach works somewhat, but is unfortunately neutered
>> quite a lot by the fact that slab recycling is per-memcg. That is, replacing
>> with recycle_or_get_next_ino(old_ino)[0] for shmfs and a few other trivial
>> callsites only leads to about 10% slab reuse, which doesn't really stem the
>> bleeding of 32-bit inums on an affected workload:
>>
>>      # tail -5000 /sys/kernel/debug/tracing/trace | grep -o 'recycle_or_get_next_ino:.*' | sort | uniq -c
>>          4454 recycle_or_get_next_ino: not recycled
>>           546 recycle_or_get_next_ino: recycled
>>
>
>Too bad..
>Maybe recycled ino should be implemented all the same because it is simple
>and may improve workloads that are not so MEMCG intensive.

Yeah, I agree. I'll send the full patch over separately (ie. not as v2 for 
this) since it's not a total solution for the problem, but still helps somewhat 
and we all seem to agree that it's overall an uncontroversial improvement.

>> Roman (who I've just added to cc) tells me that currently we only have
>> per-memcg slab reuse instead of global when using CONFIG_MEMCG. This
>> contributes fairly significantly here since there are multiple tasks across
>> multiple cgroups which are contributing to the get_next_ino() thrash.
>>
>> I think this is a good start, but we need something of a different magnitude in
>> order to actually solve this problem with the current slab infrastructure. How
>> about something like the following?
>>
>> 1. Add get_next_ino_full, which uses whatever the full width of ino_t is
>> 2. Use get_next_ino_full in tmpfs (et al.)
>
>I would prefer that filesystems making heavy use of get_next_ino, be converted
>to use a private ino pool per sb:
>
>ino_pool_create()
>ino_pool_get_next()
>
>flags to ino_pool_create() can determine the desired ino range.
>Does the Facebook use case involve a single large tmpfs or many
>small ones? I would guess the latter and therefore we are trying to solve
>a problem that nobody really needs to solve (i.e. global efficient ino pool).

Unfortunately in the case under discussion, it's all in one large tmpfs in 
/dev/shm. I can empathise with that -- application owners often prefer to use 
the mounts provided to them rather than having to set up their own. For this 
one case we can change that, but I think it seems reasonable to support this 
case since using a single tmpfs can be a reasonable decision as an application 
developer, especially if you only have unprivileged access to the system.

>> 3. Add a mount option to tmpfs (et al.), say `32bit-inums`, which people can
>>     pass if they want the 32-bit inode numbers back. This would still allow
>>     people who want to make this tradeoff to use xino.
>
>inode32|inode64 (see man xfs(5)).

Ah great, thanks! I'll reuse precedent from those.

>> 4. (If you like) Also add a CONFIG option to disable this at compile time.
>>
>
>I Don't know about disable, but the default mode for tmpfs (inode32|inode64)
>might me best determined by CONFIG option, so distro builders could decide
>if they want to take the risk of breaking applications on tmpfs.

Sounds good.

>But if you implement per sb ino pool, maybe inode64 will no longer be
>required for your use case?

In this case I think per-sb ino pool will help a bit, but unfortunately not by 
an order of magnitude. As with the recycling patch this will help reduce thrash 
a bit but not conclusively prevent the problem from happening long-term. To fix 
that, I think we really do need the option to use ino_t-sized get_next_ino_full 
(or per-sb equivalent).

Happy holidays, and thanks for your feedback!

Chris
Zheng Bin Dec. 26, 2019, 1:40 a.m. UTC | #15
On 2019/12/25 20:54, Chris Down wrote:
> Amir Goldstein writes:
>>> The slab i_ino recycling approach works somewhat, but is unfortunately neutered
>>> quite a lot by the fact that slab recycling is per-memcg. That is, replacing
>>> with recycle_or_get_next_ino(old_ino)[0] for shmfs and a few other trivial
>>> callsites only leads to about 10% slab reuse, which doesn't really stem the
>>> bleeding of 32-bit inums on an affected workload:
>>>
>>>      # tail -5000 /sys/kernel/debug/tracing/trace | grep -o 'recycle_or_get_next_ino:.*' | sort | uniq -c
>>>          4454 recycle_or_get_next_ino: not recycled
>>>           546 recycle_or_get_next_ino: recycled
>>>
>>
>> Too bad..
>> Maybe recycled ino should be implemented all the same because it is simple
>> and may improve workloads that are not so MEMCG intensive.
>
> Yeah, I agree. I'll send the full patch over separately (ie. not as v2 for this) since it's not a total solution for the problem, but still helps somewhat and we all seem to agree that it's overall an uncontroversial improvement.

Please cc me as well, thanks.


>
>>> Roman (who I've just added to cc) tells me that currently we only have
>>> per-memcg slab reuse instead of global when using CONFIG_MEMCG. This
>>> contributes fairly significantly here since there are multiple tasks across
>>> multiple cgroups which are contributing to the get_next_ino() thrash.
>>>
>>> I think this is a good start, but we need something of a different magnitude in
>>> order to actually solve this problem with the current slab infrastructure. How
>>> about something like the following?
>>>
>>> 1. Add get_next_ino_full, which uses whatever the full width of ino_t is
>>> 2. Use get_next_ino_full in tmpfs (et al.)
>>
>> I would prefer that filesystems making heavy use of get_next_ino, be converted
>> to use a private ino pool per sb:
>>
>> ino_pool_create()
>> ino_pool_get_next()
>>
>> flags to ino_pool_create() can determine the desired ino range.
>> Does the Facebook use case involve a single large tmpfs or many
>> small ones? I would guess the latter and therefore we are trying to solve
>> a problem that nobody really needs to solve (i.e. global efficient ino pool).
>
> Unfortunately in the case under discussion, it's all in one large tmpfs in /dev/shm. I can empathise with that -- application owners often prefer to use the mounts provided to them rather than having to set up their own. For this one case we can change that, but I think it seems reasonable to support this case since using a single tmpfs can be a reasonable decision as an application developer, especially if you only have unprivileged access to the system.
>
>>> 3. Add a mount option to tmpfs (et al.), say `32bit-inums`, which people can
>>>     pass if they want the 32-bit inode numbers back. This would still allow
>>>     people who want to make this tradeoff to use xino.
>>
>> inode32|inode64 (see man xfs(5)).
>
> Ah great, thanks! I'll reuse precedent from those.
>
>>> 4. (If you like) Also add a CONFIG option to disable this at compile time.
>>>
>>
>> I Don't know about disable, but the default mode for tmpfs (inode32|inode64)
>> might me best determined by CONFIG option, so distro builders could decide
>> if they want to take the risk of breaking applications on tmpfs.
>
> Sounds good.
>
>> But if you implement per sb ino pool, maybe inode64 will no longer be
>> required for your use case?
>
> In this case I think per-sb ino pool will help a bit, but unfortunately not by an order of magnitude. As with the recycling patch this will help reduce thrash a bit but not conclusively prevent the problem from happening long-term. To fix that, I think we really do need the option to use ino_t-sized get_next_ino_full (or per-sb equivalent).
>
> Happy holidays, and thanks for your feedback!
>
> Chris
>
> .
>
J. Bruce Fields Jan. 7, 2020, 5:35 p.m. UTC | #16
On Sat, Dec 21, 2019 at 10:16:52AM +0000, Chris Down wrote:
> Darrick J. Wong writes:
> >On Fri, Dec 20, 2019 at 02:49:36AM +0000, Chris Down wrote:
> >>In general, userspace applications expect that (device, inodenum) should
> >>be enough to be uniquely point to one inode, which seems fair enough.
> >
> >Except that it's not.  (dev, inum, generation) uniquely points to an
> >instance of an inode from creation to the last unlink.

I thought that (dev, inum) was supposed to be unique from creation to
last unlink (and last close), and (dev, inum, generation) was supposed
to be unique for all time.

> I didn't mention generation because, even though it's set on tmpfs
> (to prandom_u32()), it's not possible to evaluate it from userspace
> since `ioctl` returns ENOTTY. We can't ask userspace applications to
> introspect on an inode attribute that they can't even access :-)

Is there any reason not to add IOC_GETVERSION support to tmpfs?

I wonder if statx should return it too?

--b.
Chris Down Jan. 7, 2020, 5:44 p.m. UTC | #17
J. Bruce Fields writes:
>I thought that (dev, inum) was supposed to be unique from creation to
>last unlink (and last close), and (dev, inum, generation) was supposed
>to be unique for all time.

Sure, but I mean, we don't really protect against even the first case.

>> I didn't mention generation because, even though it's set on tmpfs
>> (to prandom_u32()), it's not possible to evaluate it from userspace
>> since `ioctl` returns ENOTTY. We can't ask userspace applications to
>> introspect on an inode attribute that they can't even access :-)
>
>Is there any reason not to add IOC_GETVERSION support to tmpfs?
>
>I wonder if statx should return it too?

We can, but that seems like a tangential discussion/patch series. For the 
second case especially, that's something we should do separately from this 
patchset, since this demonstrably fixes issues encountered in production, and 
extending a user-facing APIs is likely to be a much more extensive discussion.

(Also, this one in particular has advanced quite a lot since this v1 patch :-))
J. Bruce Fields Jan. 8, 2020, 3 a.m. UTC | #18
On Tue, Jan 07, 2020 at 05:44:07PM +0000, Chris Down wrote:
> J. Bruce Fields writes:
> >I thought that (dev, inum) was supposed to be unique from creation to
> >last unlink (and last close), and (dev, inum, generation) was supposed
> >to be unique for all time.
> 
> Sure, but I mean, we don't really protect against even the first case.
> 
> >>I didn't mention generation because, even though it's set on tmpfs
> >>(to prandom_u32()), it's not possible to evaluate it from userspace
> >>since `ioctl` returns ENOTTY. We can't ask userspace applications to
> >>introspect on an inode attribute that they can't even access :-)
> >
> >Is there any reason not to add IOC_GETVERSION support to tmpfs?
> >
> >I wonder if statx should return it too?
> 
> We can, but that seems like a tangential discussion/patch series.
> For the second case especially, that's something we should do
> separately from this patchset,

Oh, of course, I'm not objecting to this patchset at all, it's a "why
not also do this?" question.

> since this demonstrably fixes issues encountered in production, and
> extending a user-facing APIs is likely to be a much more extensive
> discussion.

Though if it's a question of just a new implementation of an existing
ioctl, I doubt it's such a big deal.  (Not that I'm volunteering to
write the patch.)

--b.

Patch
diff mbox series

diff --git a/fs/inode.c b/fs/inode.c
index aff2b5831168..8193c17e2d16 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -870,26 +870,33 @@  static struct inode *find_inode_fast(struct super_block *sb,
  * This does not significantly increase overflow rate because every CPU can
  * consume at most LAST_INO_BATCH-1 unused inode numbers. So there is
  * NR_CPUS*(LAST_INO_BATCH-1) wastage. At 4096 and 1024, this is ~0.1% of the
- * 2^32 range, and is a worst-case. Even a 50% wastage would only increase
- * overflow rate by 2x, which does not seem too significant.
+ * 2^32 range (for 32-bit ino_t), and is a worst-case. Even a 50% wastage would
+ * only increase overflow rate by 2x, which does not seem too significant. With
+ * a 64-bit ino_t, overflow in general is fairly hard to achieve.
  *
- * On a 32bit, non LFS stat() call, glibc will generate an EOVERFLOW
- * error if st_ino won't fit in target struct field. Use 32bit counter
- * here to attempt to avoid that.
+ * Care should be taken not to overflow when at all possible, since generally
+ * userspace depends on (device, inodenum) being reliably unique.
  */
 #define LAST_INO_BATCH 1024
-static DEFINE_PER_CPU(unsigned int, last_ino);
+static DEFINE_PER_CPU(ino_t, last_ino);
 
-unsigned int get_next_ino(void)
+ino_t get_next_ino(void)
 {
-	unsigned int *p = &get_cpu_var(last_ino);
-	unsigned int res = *p;
+	ino_t *p = &get_cpu_var(last_ino);
+	ino_t res = *p;
 
 #ifdef CONFIG_SMP
 	if (unlikely((res & (LAST_INO_BATCH-1)) == 0)) {
-		static atomic_t shared_last_ino;
-		int next = atomic_add_return(LAST_INO_BATCH, &shared_last_ino);
+		static atomic64_t shared_last_ino;
+		u64 next = atomic64_add_return(LAST_INO_BATCH,
+					       &shared_last_ino);
 
+		/*
+		 * This might get truncated if ino_t is 32-bit, and so be more
+		 * susceptible to wrap around than on environments where ino_t
+		 * is 64-bit, but that's really no worse than always encoding
+		 * `res` as unsigned int.
+		 */
 		res = next - LAST_INO_BATCH;
 	}
 #endif
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 190c45039359..ca1a04334c9e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3052,7 +3052,7 @@  static inline void lockdep_annotate_inode_mutex_key(struct inode *inode) { };
 #endif
 extern void unlock_new_inode(struct inode *);
 extern void discard_new_inode(struct inode *);
-extern unsigned int get_next_ino(void);
+extern ino_t get_next_ino(void);
 extern void evict_inodes(struct super_block *sb);
 
 extern void __iget(struct inode * inode);