linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* new_inode_pseudo vs locked inode->i_state = 0
@ 2023-08-08 16:05 Mateusz Guzik
  2023-08-08 23:15 ` Dave Chinner
  0 siblings, 1 reply; 5+ messages in thread
From: Mateusz Guzik @ 2023-08-08 16:05 UTC (permalink / raw)
  To: Al Viro, Christian Brauner; +Cc: linux-fsdevel, linux-kernel

Hello,

new_inode_pseudo is:
        struct inode *inode = alloc_inode(sb);

	if (inode) {
		spin_lock(&inode->i_lock);
		inode->i_state = 0;
		spin_unlock(&inode->i_lock);
	}

I'm trying to understand:
1. why is it zeroing i_state (as opposed to have it happen in inode_init_always)
2. why is zeroing taking place with i_lock held

The inode is freshly allocated, not yet added to the hash -- I would
expect that nobody else can see it.

Moreover, another consumer of alloc_inode zeroes without bothering to
lock -- see iget5_locked:
[snip]
	struct inode *new = alloc_inode(sb);

		if (new) {
			new->i_state = 0;
[/snip]

I tried to find justification for it in git, the pre-history-wipe repo
(git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git)
says it came in "Import 2.1.45pre1" in 1997. This is where my digging
stopped.

As is, I strongly suspect this is a leftover waiting for clean up.
Moving i_state = 0 back to inode_init_always would result in a few
simplifications in the area. I'm happy to make them, provided this is
indeed safe.

If the lock is required, then it should be added to iget5_locked?

UNRELATED:

While here, new_inode starts with: spin_lock_prefetch(&sb->s_inode_list_lock)

This was also *way* back in a huge commit, since the line was only
getting patched to remain compilable.

This is the only remaining spin_lock_prefetch use in the tree.

I don't know the original justification nor whether it made sense at
the time, this is definitely problematic today in the rather heavy
multicore era -- there is tons of work happening between the prefetch
and actually take the s_inode_list_lock lock, meaning if there is
contention, the cacheline is going to be marked invalid by the time
spin_lock on it is called. But then this only adds to cacheline
bouncing.

Personally I would just remove this line without even trying to benchmark.
-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: new_inode_pseudo vs locked inode->i_state = 0
  2023-08-08 16:05 new_inode_pseudo vs locked inode->i_state = 0 Mateusz Guzik
@ 2023-08-08 23:15 ` Dave Chinner
  2023-08-09  0:23   ` Mateusz Guzik
  0 siblings, 1 reply; 5+ messages in thread
From: Dave Chinner @ 2023-08-08 23:15 UTC (permalink / raw)
  To: Mateusz Guzik; +Cc: Al Viro, Christian Brauner, linux-fsdevel, linux-kernel

On Tue, Aug 08, 2023 at 06:05:33PM +0200, Mateusz Guzik wrote:
> Hello,
> 
> new_inode_pseudo is:
>         struct inode *inode = alloc_inode(sb);
> 
> 	if (inode) {
> 		spin_lock(&inode->i_lock);
> 		inode->i_state = 0;
> 		spin_unlock(&inode->i_lock);
> 	}
> 
> I'm trying to understand:
> 1. why is it zeroing i_state (as opposed to have it happen in inode_init_always)
> 2. why is zeroing taking place with i_lock held
> 
> The inode is freshly allocated, not yet added to the hash -- I would
> expect that nobody else can see it.

Maybe not at this point, but as soon as the function returns with
the new inode, it could be published in some list that can be
accessed concurrently and then the i_state visible on other CPUs
better be correct.

I'll come back to this, because the answer lies in this code:

> Moreover, another consumer of alloc_inode zeroes without bothering to
> lock -- see iget5_locked:
> [snip]
> 	struct inode *new = alloc_inode(sb);
> 
> 		if (new) {
> 			new->i_state = 0;
> [/snip]

Yes, that one is fine because the inode has not been published yet.
The actual i_state serialisation needed to publish the inode happens
in the function called in the very next line - inode_insert5().

That does:

	spin_lock(&inode_hash_lock);

	.....
        /*
         * Return the locked inode with I_NEW set, the
         * caller is responsible for filling in the contents
         */
        spin_lock(&inode->i_lock);
        inode->i_state |= I_NEW;
        hlist_add_head_rcu(&inode->i_hash, head);
        spin_unlock(&inode->i_lock);
	.....

	spin_unlock(&inode_hash_lock);

The i_lock is held across the inode state initialisation and hash
list insert so that if anything finds the inode in the hash
immediately after insert, they should set an initialised value.

Don't be fooled by the inode_hash_lock here. We have
find_inode_rcu() which walks hash lists without holding the hash
lock, hence if anything needs to do a state check on the found
inode, they are guaranteed to see I_NEW after grabbing the i_lock....

Further, inode_insert5() adds the inode to the superblock inode
list, which means concurrent sb inode list walkers can also see this
inode whilst the inode_hash_lock is still held by inode_insert5().
Those inode list walkers *must* see I_NEW at this point, and they
are guaranteed to do so by taking i_lock before checking i_state....

IOWs, the initialisation of inode->i_state for normal inodes must be
done under i_lock so that lookups that occur after hash/sb list
insert are guaranteed to see the correct value.

If we now go back to new_inode_pseudo(), we see one of the callers
is new_inode(), and it does this:

struct inode *new_inode(struct super_block *sb)
{
        struct inode *inode;

        spin_lock_prefetch(&sb->s_inode_list_lock);

        inode = new_inode_pseudo(sb);
        if (inode)
                inode_sb_list_add(inode);
        return inode;
}

IOWs, the inode is immediately published on the superblock inode
list, and so inode list walkers can see it immediately. As per
inode_insert5(), this requires the inode state to be fully
initialised and memory barriers in place such that any walker will
see the correct value of i_state. The simplest, safest way to do
this is to initialise i_state under the i_lock....

> I don't know the original justification nor whether it made sense at
> the time, this is definitely problematic today in the rather heavy
> multicore era -- there is tons of work happening between the prefetch
> and actually take the s_inode_list_lock lock, meaning if there is
> contention, the cacheline is going to be marked invalid by the time
> spin_lock on it is called. But then this only adds to cacheline
> bouncing.

Well know problem - sb->s_inode_list_lock has been the heaviest
contended lock in the VFS for various XFS workloads for the best
part of a decade, yet XFS does not use new_inode(). See this branch
for how we fix the s_inode_list_lock contention issues:

https://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git/log/?h=vfs-scale

This commit removes the spin_lock_prefetch() you talk about:

https://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git/commit/?h=vfs-scale&id=fb17545b70d8c228295105044dd6b52085197d75

If only I had the time and resources available right now to push
this to completion.....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: new_inode_pseudo vs locked inode->i_state = 0
  2023-08-08 23:15 ` Dave Chinner
@ 2023-08-09  0:23   ` Mateusz Guzik
  2023-08-09  2:30     ` Dave Chinner
  0 siblings, 1 reply; 5+ messages in thread
From: Mateusz Guzik @ 2023-08-09  0:23 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Al Viro, Christian Brauner, linux-fsdevel, linux-kernel

On 8/9/23, Dave Chinner <david@fromorbit.com> wrote:
> On Tue, Aug 08, 2023 at 06:05:33PM +0200, Mateusz Guzik wrote:
>> Hello,
>>
>> new_inode_pseudo is:
>>         struct inode *inode = alloc_inode(sb);
>>
>> 	if (inode) {
>> 		spin_lock(&inode->i_lock);
>> 		inode->i_state = 0;
>> 		spin_unlock(&inode->i_lock);
>> 	}
>>
>> I'm trying to understand:
>> 1. why is it zeroing i_state (as opposed to have it happen in
>> inode_init_always)
>> 2. why is zeroing taking place with i_lock held
>>
>> The inode is freshly allocated, not yet added to the hash -- I would
>> expect that nobody else can see it.
>
> Maybe not at this point, but as soon as the function returns with
> the new inode, it could be published in some list that can be
> accessed concurrently and then the i_state visible on other CPUs
> better be correct.
>
> I'll come back to this, because the answer lies in this code:
>
>> Moreover, another consumer of alloc_inode zeroes without bothering to
>> lock -- see iget5_locked:
>> [snip]
>> 	struct inode *new = alloc_inode(sb);
>>
>> 		if (new) {
>> 			new->i_state = 0;
>> [/snip]
>
> Yes, that one is fine because the inode has not been published yet.
> The actual i_state serialisation needed to publish the inode happens
> in the function called in the very next line - inode_insert5().
>
> That does:
>
> 	spin_lock(&inode_hash_lock);
>
> 	.....
>         /*
>          * Return the locked inode with I_NEW set, the
>          * caller is responsible for filling in the contents
>          */
>         spin_lock(&inode->i_lock);
>         inode->i_state |= I_NEW;
>         hlist_add_head_rcu(&inode->i_hash, head);
>         spin_unlock(&inode->i_lock);
> 	.....
>
> 	spin_unlock(&inode_hash_lock);
>
> The i_lock is held across the inode state initialisation and hash
> list insert so that if anything finds the inode in the hash
> immediately after insert, they should set an initialised value.
>
> Don't be fooled by the inode_hash_lock here. We have
> find_inode_rcu() which walks hash lists without holding the hash
> lock, hence if anything needs to do a state check on the found
> inode, they are guaranteed to see I_NEW after grabbing the i_lock....
>
> Further, inode_insert5() adds the inode to the superblock inode
> list, which means concurrent sb inode list walkers can also see this
> inode whilst the inode_hash_lock is still held by inode_insert5().
> Those inode list walkers *must* see I_NEW at this point, and they
> are guaranteed to do so by taking i_lock before checking i_state....
>
> IOWs, the initialisation of inode->i_state for normal inodes must be
> done under i_lock so that lookups that occur after hash/sb list
> insert are guaranteed to see the correct value.
>
> If we now go back to new_inode_pseudo(), we see one of the callers
> is new_inode(), and it does this:
>
> struct inode *new_inode(struct super_block *sb)
> {
>         struct inode *inode;
>
>         spin_lock_prefetch(&sb->s_inode_list_lock);
>
>         inode = new_inode_pseudo(sb);
>         if (inode)
>                 inode_sb_list_add(inode);
>         return inode;
> }
>
> IOWs, the inode is immediately published on the superblock inode
> list, and so inode list walkers can see it immediately. As per
> inode_insert5(), this requires the inode state to be fully
> initialised and memory barriers in place such that any walker will
> see the correct value of i_state. The simplest, safest way to do
> this is to initialise i_state under the i_lock....
>

Thanks for the detailed answer, I do think you have a valid point but
I don't think it works with the given example. ;)

inode_sb_list_add is:
        spin_lock(&inode->i_sb->s_inode_list_lock);
        list_add(&inode->i_sb_list, &inode->i_sb->s_inodes);
        spin_unlock(&inode->i_sb->s_inode_list_lock);

... thus i_state is published by the time it unlocks.

According to my grep all iterations over the list hold the
s_inode_list_lock, thus they are guaranteed to see the update, making
the release fence in new_inode_pseudo redundant for this case.

With this in mind I'm assuming the fence was there as a safety
measure, for consumers which would maybe need it.

Then the code can:
        struct inode *inode = alloc_inode(sb);

        if (inode) {
                inode->i_state = 0;
                /* make sure i_state update will be visible before we insert
                 * the inode anywhere */
                smp_wmb();
        }

Upshots:
- replaces 2 atomics with a mere release fence, which is way cheaper
to do everywhere and virtually free on x86-64
- people reading the code don't wonder who on earth are we locking against

All that said, if the (possibly redundant) fence is literally the only
reason for the lock trip, I would once more propose zeroing in
inode_init_always:
diff --git a/fs/inode.c b/fs/inode.c
index 8fefb69e1f84..ce9664c4efe9 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -232,6 +232,13 @@ int inode_init_always(struct super_block *sb,
struct inode *inode)
                return -ENOMEM;
        this_cpu_inc(nr_inodes);

+       inode->i_state = 0;
+       /*
+        * Make sure i_state update is visible before this inode gets inserted
+        * anywhere.
+        */
+       smp_wmb();
+
        return 0;
 }
 EXPORT_SYMBOL(inode_init_always);

This is more in the spirit of making sure everybody has published
i_state = 0 and facilitates cleanup.
- new_inode_pseudo is now just alloc_inode
- confusing unlocked/unfenced i_state = 0 disappears from iget5_locked

And probably some more tidyups.

Now, I'm not going to flame with anyone over doing smp_wmb instead of
the lock trip (looks like a no-brainer to me, but I got flamed for
another one earlier today ;>).

I am however going to /strongly suggest/ that a comment explaining
what's going on is added there, if the current state is to remain.

As far as I'm concerned *locking* when a mere smp_wmb would sufficne
is heavily misleading and should be whacked if only for that reason.

Cheers,
-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: new_inode_pseudo vs locked inode->i_state = 0
  2023-08-09  0:23   ` Mateusz Guzik
@ 2023-08-09  2:30     ` Dave Chinner
  2023-08-09  9:11       ` Mateusz Guzik
  0 siblings, 1 reply; 5+ messages in thread
From: Dave Chinner @ 2023-08-09  2:30 UTC (permalink / raw)
  To: Mateusz Guzik; +Cc: Al Viro, Christian Brauner, linux-fsdevel, linux-kernel

On Wed, Aug 09, 2023 at 02:23:59AM +0200, Mateusz Guzik wrote:
> On 8/9/23, Dave Chinner <david@fromorbit.com> wrote:
> > On Tue, Aug 08, 2023 at 06:05:33PM +0200, Mateusz Guzik wrote:
> >> Hello,
> >>
> >> new_inode_pseudo is:
> >>         struct inode *inode = alloc_inode(sb);
> >>
> >> 	if (inode) {
> >> 		spin_lock(&inode->i_lock);
> >> 		inode->i_state = 0;
> >> 		spin_unlock(&inode->i_lock);
> >> 	}
> >>
> >> I'm trying to understand:
> >> 1. why is it zeroing i_state (as opposed to have it happen in
> >> inode_init_always)
> >> 2. why is zeroing taking place with i_lock held
> >>
> >> The inode is freshly allocated, not yet added to the hash -- I would
> >> expect that nobody else can see it.
> >
> > Maybe not at this point, but as soon as the function returns with
> > the new inode, it could be published in some list that can be
> > accessed concurrently and then the i_state visible on other CPUs
> > better be correct.
> >
> > I'll come back to this, because the answer lies in this code:
> >
> >> Moreover, another consumer of alloc_inode zeroes without bothering to
> >> lock -- see iget5_locked:
> >> [snip]
> >> 	struct inode *new = alloc_inode(sb);
> >>
> >> 		if (new) {
> >> 			new->i_state = 0;
> >> [/snip]
> >
> > Yes, that one is fine because the inode has not been published yet.
> > The actual i_state serialisation needed to publish the inode happens
> > in the function called in the very next line - inode_insert5().
> >
> > That does:
> >
> > 	spin_lock(&inode_hash_lock);
> >
> > 	.....
> >         /*
> >          * Return the locked inode with I_NEW set, the
> >          * caller is responsible for filling in the contents
> >          */
> >         spin_lock(&inode->i_lock);
> >         inode->i_state |= I_NEW;
> >         hlist_add_head_rcu(&inode->i_hash, head);
> >         spin_unlock(&inode->i_lock);
> > 	.....
> >
> > 	spin_unlock(&inode_hash_lock);
> >
> > The i_lock is held across the inode state initialisation and hash
> > list insert so that if anything finds the inode in the hash
> > immediately after insert, they should set an initialised value.
> >
> > Don't be fooled by the inode_hash_lock here. We have
> > find_inode_rcu() which walks hash lists without holding the hash
> > lock, hence if anything needs to do a state check on the found
> > inode, they are guaranteed to see I_NEW after grabbing the i_lock....
> >
> > Further, inode_insert5() adds the inode to the superblock inode
> > list, which means concurrent sb inode list walkers can also see this
> > inode whilst the inode_hash_lock is still held by inode_insert5().
> > Those inode list walkers *must* see I_NEW at this point, and they
> > are guaranteed to do so by taking i_lock before checking i_state....
> >
> > IOWs, the initialisation of inode->i_state for normal inodes must be
> > done under i_lock so that lookups that occur after hash/sb list
> > insert are guaranteed to see the correct value.
> >
> > If we now go back to new_inode_pseudo(), we see one of the callers
> > is new_inode(), and it does this:
> >
> > struct inode *new_inode(struct super_block *sb)
> > {
> >         struct inode *inode;
> >
> >         spin_lock_prefetch(&sb->s_inode_list_lock);
> >
> >         inode = new_inode_pseudo(sb);
> >         if (inode)
> >                 inode_sb_list_add(inode);
> >         return inode;
> > }
> >
> > IOWs, the inode is immediately published on the superblock inode
> > list, and so inode list walkers can see it immediately. As per
> > inode_insert5(), this requires the inode state to be fully
> > initialised and memory barriers in place such that any walker will
> > see the correct value of i_state. The simplest, safest way to do
> > this is to initialise i_state under the i_lock....
> >
> 
> Thanks for the detailed answer, I do think you have a valid point but
> I don't think it works with the given example. ;)
> 
> inode_sb_list_add is:
>         spin_lock(&inode->i_sb->s_inode_list_lock);
>         list_add(&inode->i_sb_list, &inode->i_sb->s_inodes);
>         spin_unlock(&inode->i_sb->s_inode_list_lock);
> 
> ... thus i_state is published by the time it unlocks.
> 
> According to my grep all iterations over the list hold the
> s_inode_list_lock, thus they are guaranteed to see the update, making
> the release fence in new_inode_pseudo redundant for this case.

I don't believe that is the case - the i_state modification is not
within the critical region the s_inode_list_lock covers, nor is the
cacheline i_state lies on referenced within the critical section.
Hence there is no explicit ordering dependency created by
inode_sb_list_add on the value of i_state.

Your argument seems to be that we can rely on the side effect of
some unrelated lock to provide ordered memory access - that's just
really poor locking design and will result in unmaintainable code
that gets broken without realising in the future.

That's why the spin lock. It's *obviously correct*, and it doesn't
require any of the other code that checks i_state to have to care
about any lock or memory barrier mechanism other than taking
i_lock...

> With this in mind I'm assuming the fence was there as a safety
> measure, for consumers which would maybe need it.
> 
> Then the code can:
>         struct inode *inode = alloc_inode(sb);
> 
>         if (inode) {
>                 inode->i_state = 0;
>                 /* make sure i_state update will be visible before we insert
>                  * the inode anywhere */
>                 smp_wmb();
>         }

AFAIA, that doesn't work by itself without a matching smp_rmb()
prior to the i_state reader - memory barriers need to be paired for
ordering to be valid. Hence this also seems to assume that we can
rely on some other unrelated lock pairing to actually order memory
accesses to i_state....

> Upshots:
> - replaces 2 atomics with a mere release fence, which is way cheaper
> to do everywhere and virtually free on x86-64
> - people reading the code don't wonder who on earth are we locking against

Downsides:
- memory barriers are hard to get right,
- nobody will be able to look at the code and say "this is obviously
  correct".
- random unpaired memory barriers in code end up making it
  unmaintainable.
- impossible to test for correctness

> All that said, if the (possibly redundant) fence is literally the only
> reason for the lock trip, I would once more propose zeroing in
> inode_init_always:
> diff --git a/fs/inode.c b/fs/inode.c
> index 8fefb69e1f84..ce9664c4efe9 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -232,6 +232,13 @@ int inode_init_always(struct super_block *sb,
> struct inode *inode)
>                 return -ENOMEM;
>         this_cpu_inc(nr_inodes);
> 
> +       inode->i_state = 0;
> +       /*
> +        * Make sure i_state update is visible before this inode gets inserted
> +        * anywhere.
> +        */
> +       smp_wmb();

We've had code calling inode_init_always() for a long time that
requires i_state to remain untouched until the caller is ready to
change it. This would break that code....

> Now, I'm not going to flame with anyone over doing smp_wmb instead of
> the lock trip (looks like a no-brainer to me, but I got flamed for
> another one earlier today ;>).

Yes, I read that thread, and nobody was flaming anyone. People just
asked hard questions about why we need to change code that has been
working largely untouched for a decade.

> I am however going to /strongly suggest/ that a comment explaining
> what's going on is added there, if the current state is to remain.

If you really think that improves the code, send a patch....

> As far as I'm concerned *locking* when a mere smp_wmb would sufficne
> is heavily misleading and should be whacked if only for that reason.

If it ain't broke, don't fix it.

The code we have now is widely used (there are over a hundred
callers of new_inode()) and we know it works correctly. Nobody is
complaining that it is too slow, and generally speaking the overhead
of this lock traversal is lost in the noise of all the other
operations needed to be performed to initialise a new inode.

If there's a compelling reason to change the code (e.g. benchmark
improvements), then spend the time to do the audit to determine the
change is safe and test all the filesystems the change affects to
validate there are no regressions...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: new_inode_pseudo vs locked inode->i_state = 0
  2023-08-09  2:30     ` Dave Chinner
@ 2023-08-09  9:11       ` Mateusz Guzik
  0 siblings, 0 replies; 5+ messages in thread
From: Mateusz Guzik @ 2023-08-09  9:11 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Al Viro, Christian Brauner, linux-fsdevel, linux-kernel

On 8/9/23, Dave Chinner <david@fromorbit.com> wrote:
> On Wed, Aug 09, 2023 at 02:23:59AM +0200, Mateusz Guzik wrote:
>> On 8/9/23, Dave Chinner <david@fromorbit.com> wrote:
>> > On Tue, Aug 08, 2023 at 06:05:33PM +0200, Mateusz Guzik wrote:
>> >> Hello,
>> >>
>> >> new_inode_pseudo is:
>> >>         struct inode *inode = alloc_inode(sb);
>> >>
>> >> 	if (inode) {
>> >> 		spin_lock(&inode->i_lock);
>> >> 		inode->i_state = 0;
>> >> 		spin_unlock(&inode->i_lock);
>> >> 	}
>> >>
>> >> I'm trying to understand:
>> >> 1. why is it zeroing i_state (as opposed to have it happen in
>> >> inode_init_always)
>> >> 2. why is zeroing taking place with i_lock held
>> >>
>> >> The inode is freshly allocated, not yet added to the hash -- I would
>> >> expect that nobody else can see it.
>> >
>> > Maybe not at this point, but as soon as the function returns with
>> > the new inode, it could be published in some list that can be
>> > accessed concurrently and then the i_state visible on other CPUs
>> > better be correct.
>> >
>> > I'll come back to this, because the answer lies in this code:
>> >
>> >> Moreover, another consumer of alloc_inode zeroes without bothering to
>> >> lock -- see iget5_locked:
>> >> [snip]
>> >> 	struct inode *new = alloc_inode(sb);
>> >>
>> >> 		if (new) {
>> >> 			new->i_state = 0;
>> >> [/snip]
>> >
>> > Yes, that one is fine because the inode has not been published yet.
>> > The actual i_state serialisation needed to publish the inode happens
>> > in the function called in the very next line - inode_insert5().
>> >
>> > That does:
>> >
>> > 	spin_lock(&inode_hash_lock);
>> >
>> > 	.....
>> >         /*
>> >          * Return the locked inode with I_NEW set, the
>> >          * caller is responsible for filling in the contents
>> >          */
>> >         spin_lock(&inode->i_lock);
>> >         inode->i_state |= I_NEW;
>> >         hlist_add_head_rcu(&inode->i_hash, head);
>> >         spin_unlock(&inode->i_lock);
>> > 	.....
>> >
>> > 	spin_unlock(&inode_hash_lock);
>> >
>> > The i_lock is held across the inode state initialisation and hash
>> > list insert so that if anything finds the inode in the hash
>> > immediately after insert, they should set an initialised value.
>> >
>> > Don't be fooled by the inode_hash_lock here. We have
>> > find_inode_rcu() which walks hash lists without holding the hash
>> > lock, hence if anything needs to do a state check on the found
>> > inode, they are guaranteed to see I_NEW after grabbing the i_lock....
>> >
>> > Further, inode_insert5() adds the inode to the superblock inode
>> > list, which means concurrent sb inode list walkers can also see this
>> > inode whilst the inode_hash_lock is still held by inode_insert5().
>> > Those inode list walkers *must* see I_NEW at this point, and they
>> > are guaranteed to do so by taking i_lock before checking i_state....
>> >
>> > IOWs, the initialisation of inode->i_state for normal inodes must be
>> > done under i_lock so that lookups that occur after hash/sb list
>> > insert are guaranteed to see the correct value.
>> >
>> > If we now go back to new_inode_pseudo(), we see one of the callers
>> > is new_inode(), and it does this:
>> >
>> > struct inode *new_inode(struct super_block *sb)
>> > {
>> >         struct inode *inode;
>> >
>> >         spin_lock_prefetch(&sb->s_inode_list_lock);
>> >
>> >         inode = new_inode_pseudo(sb);
>> >         if (inode)
>> >                 inode_sb_list_add(inode);
>> >         return inode;
>> > }
>> >
>> > IOWs, the inode is immediately published on the superblock inode
>> > list, and so inode list walkers can see it immediately. As per
>> > inode_insert5(), this requires the inode state to be fully
>> > initialised and memory barriers in place such that any walker will
>> > see the correct value of i_state. The simplest, safest way to do
>> > this is to initialise i_state under the i_lock....
>> >
>>
>> Thanks for the detailed answer, I do think you have a valid point but
>> I don't think it works with the given example. ;)
>>
>> inode_sb_list_add is:
>>         spin_lock(&inode->i_sb->s_inode_list_lock);
>>         list_add(&inode->i_sb_list, &inode->i_sb->s_inodes);
>>         spin_unlock(&inode->i_sb->s_inode_list_lock);
>>
>> ... thus i_state is published by the time it unlocks.
>>
>> According to my grep all iterations over the list hold the
>> s_inode_list_lock, thus they are guaranteed to see the update, making
>> the release fence in new_inode_pseudo redundant for this case.
>
> I don't believe that is the case - the i_state modification is not
> within the critical region the s_inode_list_lock covers, nor is the
> cacheline i_state lies on referenced within the critical section.
> Hence there is no explicit ordering dependency created by
> inode_sb_list_add on the value of i_state.
>
> Your argument seems to be that we can rely on the side effect of
> some unrelated lock to provide ordered memory access - that's just
> really poor locking design and will result in unmaintainable code
> that gets broken without realising in the future.
>

In that spirit, *normally* when you add an object to some
list/whatever, there are probably other changes you made past i_state
= 0 (i.e. another fence is ultimately needed just before insertion).

Least error prone approach as far as I'm concerned would make sure the
inserting routine issues relevant fences -- that way consumers which
sneak in some updates prior to calling it are still covered just fine
and so happens the sb list iteration *is* covered in this manner. But
I guess that's not the expected way here.

I feel compelled to note that not issuing the fence in iget5_locked
and instead relying on hash insertion to sort it out is implementing
this approach, albeit internally it is explicitly done with i_lock.

> That's why the spin lock. It's *obviously correct*, and it doesn't
> require any of the other code that checks i_state to have to care
> about any lock or memory barrier mechanism other than taking
> i_lock...
>

It is obvious not *wrong* to do in sense of not breaking code, just
deeply confusing for the reader -- I don't know about you, if I see a
lock taken and released for what should be an object invisible to
other threads, immediately the question pops up if the inode *is* in
fact visible somewhere. Making sure i_state update is published by
starting with acquiring the lock on freshly allocated inode is not
something I thought of.

>> With this in mind I'm assuming the fence was there as a safety
>> measure, for consumers which would maybe need it.
>>
>> Then the code can:
>>         struct inode *inode = alloc_inode(sb);
>>
>>         if (inode) {
>>                 inode->i_state = 0;
>>                 /* make sure i_state update will be visible before we
>> insert
>>                  * the inode anywhere */
>>                 smp_wmb();
>>         }
>
> AFAIA, that doesn't work by itself without a matching smp_rmb()
> prior to the i_state reader - memory barriers need to be paired for
> ordering to be valid. Hence this also seems to assume that we can
> rely on some other unrelated lock pairing to actually order memory
> accesses to i_state....
>

It has to be either smp_rmb or a consume barrier.

Spelling it out instead of having it implicitly done through
spin_unlock points out that it is redundant for the new_inode vs sb
allocation case, people just don't have strong reaction to plain
lock/unlock trips, whereas they get very worried about any explicitly
mentioned barriers (most justified for the latter, but they also
should be this way for the former).

>> Upshots:
>> - replaces 2 atomics with a mere release fence, which is way cheaper
>> to do everywhere and virtually free on x86-64
>> - people reading the code don't wonder who on earth are we locking
>> against
>
> Downsides:
> - memory barriers are hard to get right,
> - nobody will be able to look at the code and say "this is obviously
>   correct".
> - random unpaired memory barriers in code end up making it
>   unmaintainable.
> - impossible to test for correctness
>
[snip]
>> I am however going to /strongly suggest/ that a comment explaining
>> what's going on is added there, if the current state is to remain.
>
> If you really think that improves the code, send a patch....
>
>> As far as I'm concerned *locking* when a mere smp_wmb would sufficne
>> is heavily misleading and should be whacked if only for that reason.
>
> If it ain't broke, don't fix it.
>
> The code we have now is widely used (there are over a hundred
> callers of new_inode()) and we know it works correctly. Nobody is
> complaining that it is too slow, and generally speaking the overhead
> of this lock traversal is lost in the noise of all the other
> operations needed to be performed to initialise a new inode.
>

Maybe you have been around this code too long to see an outsider
perspective or maybe I did not do a good job spelling it out.

I'm not complaining about speed, I'm complaining that a standalone
lock trip there is *confusing*, especially when another consumer in
the same file does not do it and filesystems have *custom* allocation
routines -- is the inode visible to other threads?

I also noted the reasoning you initially gave is trivially satisified
with smp_wmb, but I'm definitely not going to push for it.

However, after more poking around, what I think what you were trying
to say is that whatever inode lookups/traversal always take the inode
lock to inspect i_state, thus zeroing enclosed by it makes it easier
to reason about it (but if that's the case why is iget5_locked
standing out and you seem fine with it?). Now that I justification I
can understand at least.

All that said, how about this (modulo wording):
diff --git a/fs/inode.c b/fs/inode.c
index 8fefb69e1f84..977e72942706 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1018,6 +1018,13 @@ struct inode *new_inode_pseudo(struct super_block *sb)
        struct inode *inode = alloc_inode(sb);

        if (inode) {
+               /*
+                * Make sure i_state update is visible before this inode gets
+                * inserted anywhere.
+                *
+                * Safe way for consumers to inspect the field is with i_lock
+                * held, match this for simplicity.
+                */
                spin_lock(&inode->i_lock);
                inode->i_state = 0;
                spin_unlock(&inode->i_lock);
@@ -1285,6 +1292,11 @@ struct inode *iget5_locked(struct super_block
*sb, unsigned long hashval,
                struct inode *new = alloc_inode(sb);

                if (new) {
+                       /*
+                        * new_inode_pseudo takes a lock to zero i_state, here
+                        * we safely skip it because hash insertion will take
+                        * care of it
+                        */
                        new->i_state = 0;
                        inode = inode_insert5(new, hashval, test, set, data);
                        if (unlikely(inode != new))


or patch iget5_locked to use new_inode_pseudo, but then I don't want
to be on a lookout for someone's microbench changing

-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2023-08-09  9:11 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-08-08 16:05 new_inode_pseudo vs locked inode->i_state = 0 Mateusz Guzik
2023-08-08 23:15 ` Dave Chinner
2023-08-09  0:23   ` Mateusz Guzik
2023-08-09  2:30     ` Dave Chinner
2023-08-09  9:11       ` Mateusz Guzik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).