linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* 5.5 XFS getdents regression?
@ 2020-03-10  8:45 Rantala, Tommi T. (Nokia - FI/Espoo)
  2020-03-10 11:12 ` Bhaskar Chowdhury
  2020-03-10 22:14 ` Dave Chinner
  0 siblings, 2 replies; 8+ messages in thread
From: Rantala, Tommi T. (Nokia - FI/Espoo) @ 2020-03-10  8:45 UTC (permalink / raw)
  To: darrick.wong, linux-xfs, linux-kernel; +Cc: hch

Hello,

One of my GitLab CI jobs stopped working after upgrading server 5.4.18-
100.fc30.x86_64 -> 5.5.7-100.fc30.x86_64.
(tested 5.5.8-100.fc30.x86_64 too, no change)
The server is fedora30 with XFS rootfs.
The problem reproduces always, and takes only couple minutes to run.

The CI job fails in the beginning when doing "git clean" in docker
container, and failing to rmdir some directory:
"warning: failed to remove 
.vendor/pkg/mod/golang.org/x/net@v0.0.0-20200114155413-6afb5195e5aa/intern
al/socket: Directory not empty"

Quick google search finds some other people reporting similar problems
with 5.5.0:
https://gitlab.com/gitlab-org/gitlab-runner/issues/3185


Collected some data with strace, and it seems that getdents is not
returning all entries:

5.4 getdents64() returns 52+50+1+0 entries 
=> all files in directory are deleted and rmdir() is OK

5.5 getdents64() returns 52+50+0+0 entries
=> rmdir() fails with ENOTEMPTY


Working 5.4 strace:
10:00:12 getdents64(10<
/builds/xyz/.vendor/pkg/mod/golang.org/x/net@v0.0.0-20200301022130-244492dfa37a
/internal/socket>, /* 52 entries */, 2048) = 2024 <0.000020>
10:00:12 unlink("
.vendor/pkg/mod/golang.org/x/net@v0.0.0-20200301022130-244492dfa37a/intern
al/socket/cmsghdr.go") = 0 <0.000068>
10:00:12 unlink("
.vendor/pkg/mod/golang.org/x/net@v0.0.0-20200301022130-244492dfa37a/intern
al/socket/cmsghdr_bsd.go") = 0 <0.000048>
[...]
10:00:12 getdents64(10<
/builds/xyz/.vendor/pkg/mod/golang.org/x/net@v0.0.0-20200301022130-244492dfa37a
/internal/socket>, /* 50 entries */, 2048) = 2048 <0.000023>
10:00:12 unlink("
.vendor/pkg/mod/golang.org/x/net@v0.0.0-20200301022130-244492dfa37a/intern
al/socket/sys_linux_386.s") = 0 <0.000062>
[...]
10:00:12 getdents64(10<
/builds/xyz/.vendor/pkg/mod/golang.org/x/net@v0.0.0-20200301022130-244492dfa37a
/internal/socket>, /* 1 entries */, 2048) = 48 <0.000017>
10:00:12 unlink("
.vendor/pkg/mod/golang.org/x/net@v0.0.0-20200301022130-244492dfa37a/intern
al/socket/zsys_solaris_amd64.go") = 0 <0.000039>
10:00:12 getdents64(10<
/builds/xyz/.vendor/pkg/mod/golang.org/x/net@v0.0.0-20200301022130-244492dfa37a
/internal/socket>, /* 0 entries */, 2048) = 0 <0.000015>
10:00:12 rmdir("
.vendor/pkg/mod/golang.org/x/net@v0.0.0-20200301022130-244492dfa37a/intern
al/socket") = 0 <0.000055>


Failing 5.5 strace:
10:09:15 getdents64(10<
/builds/xyz/.vendor/pkg/mod/golang.org/x/net@v0.0.0-20200301022130-244492dfa37a
/internal/socket>, /* 52 entries */, 2048) = 2024 <0.000031>
10:09:15 unlink("
.vendor/pkg/mod/golang.org/x/net@v0.0.0-20200301022130-244492dfa37a/intern
al/socket/cmsghdr.go") = 0 <0.006174>
[...]
10:09:15 getdents64(10<
/builds/xyz/.vendor/pkg/mod/golang.org/x/net@v0.0.0-20200301022130-244492dfa37a
/internal/socket>, /* 50 entries */, 2048) = 2048 <0.000034>
10:09:15 unlink("
.vendor/pkg/mod/golang.org/x/net@v0.0.0-20200301022130-244492dfa37a/intern
al/socket/sys_linux_386.s") = 0 <0.000054>
[...]
10:09:16 getdents64(10<
/builds/xyz/.vendor/pkg/mod/golang.org/x/net@v0.0.0-20200301022130-244492dfa37a
/internal/socket>, /* 0 entries */, 2048) = 0 <0.000020>
10:09:16 rmdir("
.vendor/pkg/mod/golang.org/x/net@v0.0.0-20200301022130-244492dfa37a/intern
al/socket") = -1 ENOTEMPTY (Directory not empty) <0.000029>


Any ideas what's going wrong here?

-Tommi


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: 5.5 XFS getdents regression?
  2020-03-10  8:45 5.5 XFS getdents regression? Rantala, Tommi T. (Nokia - FI/Espoo)
@ 2020-03-10 11:12 ` Bhaskar Chowdhury
  2020-03-10 14:41   ` Eric Sandeen
  2020-03-10 22:14 ` Dave Chinner
  1 sibling, 1 reply; 8+ messages in thread
From: Bhaskar Chowdhury @ 2020-03-10 11:12 UTC (permalink / raw)
  To: Rantala, Tommi T. (Nokia - FI/Espoo)
  Cc: darrick.wong, linux-xfs, linux-kernel, hch

[-- Attachment #1: Type: text/plain, Size: 3883 bytes --]

On 08:45 Tue 10 Mar 2020, Rantala, Tommi T. (Nokia - FI/Espoo) wrote:

Okay, hang on! don't you think you should query at fedora mailing list
instead here??

Because you are running fedora kernel and I believe it is patched by
their team. So, they might have much more concrete answer than to ask
the file system developer here for the outcome.

Kindly, provide the bug report to them fix your owes.

~Bhaskar




>Hello,
>
>One of my GitLab CI jobs stopped working after upgrading server 5.4.18-
>100.fc30.x86_64 -> 5.5.7-100.fc30.x86_64.
>(tested 5.5.8-100.fc30.x86_64 too, no change)
>The server is fedora30 with XFS rootfs.
>The problem reproduces always, and takes only couple minutes to run.
>
>The CI job fails in the beginning when doing "git clean" in docker
>container, and failing to rmdir some directory:
>"warning: failed to remove 
>.vendor/pkg/mod/golang.org/x/net@v0.0.0-20200114155413-6afb5195e5aa/intern
>al/socket: Directory not empty"
>
>Quick google search finds some other people reporting similar problems
>with 5.5.0:
>https://gitlab.com/gitlab-org/gitlab-runner/issues/3185
>
>
>Collected some data with strace, and it seems that getdents is not
>returning all entries:
>
>5.4 getdents64() returns 52+50+1+0 entries 
>=> all files in directory are deleted and rmdir() is OK
>
>5.5 getdents64() returns 52+50+0+0 entries
>=> rmdir() fails with ENOTEMPTY
>
>
>Working 5.4 strace:
>10:00:12 getdents64(10<
>/builds/xyz/.vendor/pkg/mod/golang.org/x/net@v0.0.0-20200301022130-244492dfa37a
>/internal/socket>, /* 52 entries */, 2048) = 2024 <0.000020>
>10:00:12 unlink("
>.vendor/pkg/mod/golang.org/x/net@v0.0.0-20200301022130-244492dfa37a/intern
>al/socket/cmsghdr.go") = 0 <0.000068>
>10:00:12 unlink("
>.vendor/pkg/mod/golang.org/x/net@v0.0.0-20200301022130-244492dfa37a/intern
>al/socket/cmsghdr_bsd.go") = 0 <0.000048>
>[...]
>10:00:12 getdents64(10<
>/builds/xyz/.vendor/pkg/mod/golang.org/x/net@v0.0.0-20200301022130-244492dfa37a
>/internal/socket>, /* 50 entries */, 2048) = 2048 <0.000023>
>10:00:12 unlink("
>.vendor/pkg/mod/golang.org/x/net@v0.0.0-20200301022130-244492dfa37a/intern
>al/socket/sys_linux_386.s") = 0 <0.000062>
>[...]
>10:00:12 getdents64(10<
>/builds/xyz/.vendor/pkg/mod/golang.org/x/net@v0.0.0-20200301022130-244492dfa37a
>/internal/socket>, /* 1 entries */, 2048) = 48 <0.000017>
>10:00:12 unlink("
>.vendor/pkg/mod/golang.org/x/net@v0.0.0-20200301022130-244492dfa37a/intern
>al/socket/zsys_solaris_amd64.go") = 0 <0.000039>
>10:00:12 getdents64(10<
>/builds/xyz/.vendor/pkg/mod/golang.org/x/net@v0.0.0-20200301022130-244492dfa37a
>/internal/socket>, /* 0 entries */, 2048) = 0 <0.000015>
>10:00:12 rmdir("
>.vendor/pkg/mod/golang.org/x/net@v0.0.0-20200301022130-244492dfa37a/intern
>al/socket") = 0 <0.000055>
>
>
>Failing 5.5 strace:
>10:09:15 getdents64(10<
>/builds/xyz/.vendor/pkg/mod/golang.org/x/net@v0.0.0-20200301022130-244492dfa37a
>/internal/socket>, /* 52 entries */, 2048) = 2024 <0.000031>
>10:09:15 unlink("
>.vendor/pkg/mod/golang.org/x/net@v0.0.0-20200301022130-244492dfa37a/intern
>al/socket/cmsghdr.go") = 0 <0.006174>
>[...]
>10:09:15 getdents64(10<
>/builds/xyz/.vendor/pkg/mod/golang.org/x/net@v0.0.0-20200301022130-244492dfa37a
>/internal/socket>, /* 50 entries */, 2048) = 2048 <0.000034>
>10:09:15 unlink("
>.vendor/pkg/mod/golang.org/x/net@v0.0.0-20200301022130-244492dfa37a/intern
>al/socket/sys_linux_386.s") = 0 <0.000054>
>[...]
>10:09:16 getdents64(10<
>/builds/xyz/.vendor/pkg/mod/golang.org/x/net@v0.0.0-20200301022130-244492dfa37a
>/internal/socket>, /* 0 entries */, 2048) = 0 <0.000020>
>10:09:16 rmdir("
>.vendor/pkg/mod/golang.org/x/net@v0.0.0-20200301022130-244492dfa37a/intern
>al/socket") = -1 ENOTEMPTY (Directory not empty) <0.000029>
>
>
>Any ideas what's going wrong here?
>
>-Tommi
>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: 5.5 XFS getdents regression?
  2020-03-10 11:12 ` Bhaskar Chowdhury
@ 2020-03-10 14:41   ` Eric Sandeen
  0 siblings, 0 replies; 8+ messages in thread
From: Eric Sandeen @ 2020-03-10 14:41 UTC (permalink / raw)
  To: Bhaskar Chowdhury, Rantala, Tommi T. (Nokia - FI/Espoo),
	darrick.wong, linux-xfs, linux-kernel, hch


[-- Attachment #1.1: Type: text/plain, Size: 1138 bytes --]

On 3/10/20 6:12 AM, Bhaskar Chowdhury wrote:
> On 08:45 Tue 10 Mar 2020, Rantala, Tommi T. (Nokia - FI/Espoo) wrote:
> 
> Okay, hang on! don't you think you should query at fedora mailing list
> instead here??
> 
> Because you are running fedora kernel and I believe it is patched by
> their team. So, they might have much more concrete answer than to ask
> the file system developer here for the outcome.

The Fedora kernel isn't very heavily patched most of the time, but it would
be a good idea to test an upstream kernel (easy enough to just re-use the
Fedora kernel config) to confirm that it is an upstream problem.

OTOH the gitlab link in the original email seems to indicate problems on
Windows as well, so it may require some work to determine whether this is
a test harness problem, kernel problem, etc?

Tommi, if the problem is easy to reproduce perhaps you can try a bisect
on upstream kernels between 5.4.0 and 5.5.0?

Also, testing on ext4 on 5.5.7 would help determine whether the problem
you are seeing is xfs-specific or not.

-Eric

> Kindly, provide the bug report to them fix your owes.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: 5.5 XFS getdents regression?
  2020-03-10  8:45 5.5 XFS getdents regression? Rantala, Tommi T. (Nokia - FI/Espoo)
  2020-03-10 11:12 ` Bhaskar Chowdhury
@ 2020-03-10 22:14 ` Dave Chinner
  2020-03-11 17:06   ` Rantala, Tommi T. (Nokia - FI/Espoo)
  1 sibling, 1 reply; 8+ messages in thread
From: Dave Chinner @ 2020-03-10 22:14 UTC (permalink / raw)
  To: Rantala, Tommi T. (Nokia - FI/Espoo)
  Cc: darrick.wong, linux-xfs, linux-kernel, hch

On Tue, Mar 10, 2020 at 08:45:58AM +0000, Rantala, Tommi T. (Nokia - FI/Espoo) wrote:
> Hello,
> 
> One of my GitLab CI jobs stopped working after upgrading server 5.4.18-
> 100.fc30.x86_64 -> 5.5.7-100.fc30.x86_64.
> (tested 5.5.8-100.fc30.x86_64 too, no change)
> The server is fedora30 with XFS rootfs.
> The problem reproduces always, and takes only couple minutes to run.
> 
> The CI job fails in the beginning when doing "git clean" in docker
> container, and failing to rmdir some directory:
> "warning: failed to remove 
> .vendor/pkg/mod/golang.org/x/net@v0.0.0-20200114155413-6afb5195e5aa/intern
> al/socket: Directory not empty"
> 
> Quick google search finds some other people reporting similar problems
> with 5.5.0:
> https://gitlab.com/gitlab-org/gitlab-runner/issues/3185

Which appears to be caused by multiple gitlab processes modifying
the directory at the same time. i.e. something is adding an entry to
the directory at the same time something is trying to rm -rf it.
That's a race condition, and would lead to the exact symptoms you
see here, depending on where in the directory the new entry is
added.

> Collected some data with strace, and it seems that getdents is not
> returning all entries:
> 
> 5.4 getdents64() returns 52+50+1+0 entries 
> => all files in directory are deleted and rmdir() is OK
> 
> 5.5 getdents64() returns 52+50+0+0 entries
> => rmdir() fails with ENOTEMPTY

Yup, that's a classic userspace TOCTOU race.

Remember, getdents() is effectively a sequential walk through the
directory data - subsequent calls start at the offset (cookie) where
the previous one left off. New entries can be added between
getdents() syscalls.

If that new entry is put at the tail of the directory, then the last
getdents() call will return that entry rather than none because it
was placed at an offset in the directory that the getdents() sweep
has not yet reached, and hence will be found by a future getdents()
call in the sweep.


However, if there is a hole in the directory structure before the
current getdents cookie offset, a new entry can be added in that
hole. i.e. at an offset in the directory that getdents has already
passed over. That dirent will never be reported by the current
getdents() sequence - a directory rewind and re-read is required to
find it. i.e. there's an inherent userspace TOUTOC race condition in
'rm -rf' operations.

IOWs, this is exactly what you'd expect to see when there are
concurrent userspace modifications to a directory that is currently
being read. Hence you need to rule out an application and userspace
level issues before looking for filesystem level problems.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: 5.5 XFS getdents regression?
  2020-03-10 22:14 ` Dave Chinner
@ 2020-03-11 17:06   ` Rantala, Tommi T. (Nokia - FI/Espoo)
  2020-03-11 17:22     ` hch
  0 siblings, 1 reply; 8+ messages in thread
From: Rantala, Tommi T. (Nokia - FI/Espoo) @ 2020-03-11 17:06 UTC (permalink / raw)
  To: hch, david; +Cc: darrick.wong, linux-xfs, linux-kernel

On Wed, 2020-03-11 at 09:14 +1100, Dave Chinner wrote:
> On Tue, Mar 10, 2020 at 08:45:58AM +0000, Rantala, Tommi T. (Nokia -
> FI/Espoo) wrote:
> > Hello,
> > 
> > One of my GitLab CI jobs stopped working after upgrading server
> > 5.4.18-
> > 100.fc30.x86_64 -> 5.5.7-100.fc30.x86_64.
> > (tested 5.5.8-100.fc30.x86_64 too, no change)
> > The server is fedora30 with XFS rootfs.
> > The problem reproduces always, and takes only couple minutes to run.
> > 
> > The CI job fails in the beginning when doing "git clean" in docker
> > container, and failing to rmdir some directory:
> > "warning: failed to remove 
> > .vendor/pkg/mod/golang.org/x/net@v0.0.0-20200114155413-6afb5195e5aa/in
> > tern
> > al/socket: Directory not empty"
> > 
> > Quick google search finds some other people reporting similar problems
> > with 5.5.0:
> > https://gitlab.com/gitlab-org/gitlab-runner/issues/3185
> 
> Which appears to be caused by multiple gitlab processes modifying
> the directory at the same time. i.e. something is adding an entry to
> the directory at the same time something is trying to rm -rf it.
> That's a race condition, and would lead to the exact symptoms you
> see here, depending on where in the directory the new entry is
> added.

OK traced "execve" with strace too, and it shows that it's "git clean
-ffdx" command (single process) that is being executed in the container,
which is doing the cleanup.

Tested with 5.6-rc5, it's failing the same way.

Spent some time to bisect this, and the problem is introduced by this:

commit 263dde869bd09b1a709fd92118c7fff832773689
Author: Christoph Hellwig <hch@lst.de>
Date:   Fri Nov 8 15:05:32 2019 -0800

    xfs: cleanup xfs_dir2_block_getdents
    
    Use an offset as the main means for iteration, and only do pointer
    arithmetics to find the data/unused entries.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
    Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>



Hmmmmm, looking at that commit, I think it slighty changed how the
"offset" is used compared to how the pointers were used.

This cures the issue for me, tested (briefly) on top of 5.6-rc5.
Does it make sense...?
(Email client probably damages white-space, sorry, I'll send this properly
signed-off with git-send-email if it's OK)


diff --git a/fs/xfs/xfs_dir2_readdir.c b/fs/xfs/xfs_dir2_readdir.c
index 0d3b640cf1cc..af945ec9df3b 100644
--- a/fs/xfs/xfs_dir2_readdir.c
+++ b/fs/xfs/xfs_dir2_readdir.c
@@ -179,6 +179,7 @@ xfs_dir2_block_getdents(
        struct xfs_dir2_data_unused     *dup = bp->b_addr + offset;
        struct xfs_dir2_data_entry      *dep = bp->b_addr + offset;
        uint8_t filetype;
+       unsigned int dep_offset;
 
        /*
         * Unused, skip it.
@@ -188,18 +189,21 @@ xfs_dir2_block_getdents(
                continue;
        }
 
+       dep_offset = offset;
+
        /*
-        * Bump pointer for the next iteration.
+        * Bump offset for the next iteration.
         */
        offset += xfs_dir2_data_entsize(dp->i_mount, dep->namelen);

        /*
         * The entry is before the desired starting point, skip it.
         */
-       if (offset < wantoff)
+       if (dep_offset < wantoff)
                continue;
 
-       cook = xfs_dir2_db_off_to_dataptr(geo, geo->datablk, offset);
+       cook = xfs_dir2_db_off_to_dataptr(geo, geo->datablk,
+                                         dep_offset);
 
        ctx->pos = cook & 0x7fffffff;
        filetype = xfs_dir2_data_get_ftype(dp->i_mount, dep);



> > Collected some data with strace, and it seems that getdents is not
> > returning all entries:
> > 
> > 5.4 getdents64() returns 52+50+1+0 entries 
> > => all files in directory are deleted and rmdir() is OK
> > 
> > 5.5 getdents64() returns 52+50+0+0 entries
> > => rmdir() fails with ENOTEMPTY
> 
> Yup, that's a classic userspace TOCTOU race.
> 
> Remember, getdents() is effectively a sequential walk through the
> directory data - subsequent calls start at the offset (cookie) where
> the previous one left off. New entries can be added between
> getdents() syscalls.
> 
> If that new entry is put at the tail of the directory, then the last
> getdents() call will return that entry rather than none because it
> was placed at an offset in the directory that the getdents() sweep
> has not yet reached, and hence will be found by a future getdents()
> call in the sweep.
> 
> 
> However, if there is a hole in the directory structure before the
> current getdents cookie offset, a new entry can be added in that
> hole. i.e. at an offset in the directory that getdents has already
> passed over. That dirent will never be reported by the current
> getdents() sequence - a directory rewind and re-read is required to
> find it. i.e. there's an inherent userspace TOUTOC race condition in
> 'rm -rf' operations.
> 
> IOWs, this is exactly what you'd expect to see when there are
> concurrent userspace modifications to a directory that is currently
> being read. Hence you need to rule out an application and userspace
> level issues before looking for filesystem level problems.
> 
> Cheers,
> 
> Dave.


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: 5.5 XFS getdents regression?
  2020-03-11 17:06   ` Rantala, Tommi T. (Nokia - FI/Espoo)
@ 2020-03-11 17:22     ` hch
  2020-03-12  8:09       ` Rantala, Tommi T. (Nokia - FI/Espoo)
  0 siblings, 1 reply; 8+ messages in thread
From: hch @ 2020-03-11 17:22 UTC (permalink / raw)
  To: Rantala, Tommi T. (Nokia - FI/Espoo)
  Cc: hch, david, darrick.wong, linux-xfs, linux-kernel

On Wed, Mar 11, 2020 at 05:06:16PM +0000, Rantala, Tommi T. (Nokia - FI/Espoo) wrote:
> On Wed, 2020-03-11 at 09:14 +1100, Dave Chinner wrote:
> > On Tue, Mar 10, 2020 at 08:45:58AM +0000, Rantala, Tommi T. (Nokia -
> > FI/Espoo) wrote:
> > > Hello,
> > > 
> > > One of my GitLab CI jobs stopped working after upgrading server
> > > 5.4.18-
> > > 100.fc30.x86_64 -> 5.5.7-100.fc30.x86_64.
> > > (tested 5.5.8-100.fc30.x86_64 too, no change)
> > > The server is fedora30 with XFS rootfs.
> > > The problem reproduces always, and takes only couple minutes to run.
> > > 
> > > The CI job fails in the beginning when doing "git clean" in docker
> > > container, and failing to rmdir some directory:
> > > "warning: failed to remove 
> > > .vendor/pkg/mod/golang.org/x/net@v0.0.0-20200114155413-6afb5195e5aa/in
> > > tern
> > > al/socket: Directory not empty"
> > > 
> > > Quick google search finds some other people reporting similar problems
> > > with 5.5.0:
> > > https://gitlab.com/gitlab-org/gitlab-runner/issues/3185
> > 
> > Which appears to be caused by multiple gitlab processes modifying
> > the directory at the same time. i.e. something is adding an entry to
> > the directory at the same time something is trying to rm -rf it.
> > That's a race condition, and would lead to the exact symptoms you
> > see here, depending on where in the directory the new entry is
> > added.
> 
> OK traced "execve" with strace too, and it shows that it's "git clean
> -ffdx" command (single process) that is being executed in the container,
> which is doing the cleanup.
> 
> Tested with 5.6-rc5, it's failing the same way.
> 
> Spent some time to bisect this, and the problem is introduced by this:
> 
> commit 263dde869bd09b1a709fd92118c7fff832773689
> Author: Christoph Hellwig <hch@lst.de>
> Date:   Fri Nov 8 15:05:32 2019 -0800
> 
>     xfs: cleanup xfs_dir2_block_getdents
>     
>     Use an offset as the main means for iteration, and only do pointer
>     arithmetics to find the data/unused entries.
>     
>     Signed-off-by: Christoph Hellwig <hch@lst.de>
>     Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
>     Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> 
> 
> 
> Hmmmmm, looking at that commit, I think it slighty changed how the
> "offset" is used compared to how the pointers were used.
> 
> This cures the issue for me, tested (briefly) on top of 5.6-rc5.
> Does it make sense...?
> (Email client probably damages white-space, sorry, I'll send this properly
> signed-off with git-send-email if it's OK)

Thanks, this looks good.  Although I wonder if the slightly different
version below might be a little more elegant?

diff --git a/fs/xfs/xfs_dir2_readdir.c b/fs/xfs/xfs_dir2_readdir.c
index 0d3b640cf1cc..871ec22c9aee 100644
--- a/fs/xfs/xfs_dir2_readdir.c
+++ b/fs/xfs/xfs_dir2_readdir.c
@@ -147,7 +147,7 @@ xfs_dir2_block_getdents(
 	xfs_off_t		cook;
 	struct xfs_da_geometry	*geo = args->geo;
 	int			lock_mode;
-	unsigned int		offset;
+	unsigned int		offset, next_offset;
 	unsigned int		end;
 
 	/*
@@ -173,9 +173,10 @@ xfs_dir2_block_getdents(
 	 * Loop over the data portion of the block.
 	 * Each object is a real entry (dep) or an unused one (dup).
 	 */
-	offset = geo->data_entry_offset;
 	end = xfs_dir3_data_end_offset(geo, bp->b_addr);
-	while (offset < end) {
+	for (offset = geo->data_entry_offset;
+	     offset < end;
+	     offset = next_offset) {
 		struct xfs_dir2_data_unused	*dup = bp->b_addr + offset;
 		struct xfs_dir2_data_entry	*dep = bp->b_addr + offset;
 		uint8_t filetype;
@@ -184,14 +185,15 @@ xfs_dir2_block_getdents(
 		 * Unused, skip it.
 		 */
 		if (be16_to_cpu(dup->freetag) == XFS_DIR2_DATA_FREE_TAG) {
-			offset += be16_to_cpu(dup->length);
+			next_offset = offset + be16_to_cpu(dup->length);
 			continue;
 		}
 
 		/*
 		 * Bump pointer for the next iteration.
 		 */
-		offset += xfs_dir2_data_entsize(dp->i_mount, dep->namelen);
+		next_offset = offset +
+			xfs_dir2_data_entsize(dp->i_mount, dep->namelen);
 
 		/*
 		 * The entry is before the desired starting point, skip it.

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: 5.5 XFS getdents regression?
  2020-03-11 17:22     ` hch
@ 2020-03-12  8:09       ` Rantala, Tommi T. (Nokia - FI/Espoo)
  2020-03-12  8:18         ` hch
  0 siblings, 1 reply; 8+ messages in thread
From: Rantala, Tommi T. (Nokia - FI/Espoo) @ 2020-03-12  8:09 UTC (permalink / raw)
  To: hch; +Cc: darrick.wong, david, linux-xfs, linux-kernel

On Wed, 2020-03-11 at 18:22 +0100, hch@lst.de wrote:
> On Wed, Mar 11, 2020 at 05:06:16PM +0000, Rantala, Tommi T. (Nokia -
> FI/Espoo) wrote:
> > On Wed, 2020-03-11 at 09:14 +1100, Dave Chinner wrote:
> > > On Tue, Mar 10, 2020 at 08:45:58AM +0000, Rantala, Tommi T. (Nokia -
> > > FI/Espoo) wrote:
> > > > Hello,
> > > > 
> > > > One of my GitLab CI jobs stopped working after upgrading server
> > > > 5.4.18-
> > > > 100.fc30.x86_64 -> 5.5.7-100.fc30.x86_64.
> > > > (tested 5.5.8-100.fc30.x86_64 too, no change)
> > > > The server is fedora30 with XFS rootfs.
> > > > The problem reproduces always, and takes only couple minutes to
> > > > run.
> > > > 
> > > > The CI job fails in the beginning when doing "git clean" in docker
> > > > container, and failing to rmdir some directory:
> > > > "warning: failed to remove 
> > > > .vendor/pkg/mod/golang.org/x/net@v0.0.0-20200114155413-6afb5195e5aa/in
> > > > tern
> > > > al/socket: Directory not empty"
> > > > 
> > > > Quick google search finds some other people reporting similar
> > > > problems
> > > > with 5.5.0:
> > > > https://gitlab.com/gitlab-org/gitlab-runner/issues/3185
> > > 
> > > Which appears to be caused by multiple gitlab processes modifying
> > > the directory at the same time. i.e. something is adding an entry to
> > > the directory at the same time something is trying to rm -rf it.
> > > That's a race condition, and would lead to the exact symptoms you
> > > see here, depending on where in the directory the new entry is
> > > added.
> > 
> > OK traced "execve" with strace too, and it shows that it's "git clean
> > -ffdx" command (single process) that is being executed in the
> > container,
> > which is doing the cleanup.
> > 
> > Tested with 5.6-rc5, it's failing the same way.
> > 
> > Spent some time to bisect this, and the problem is introduced by this:
> > 
> > commit 263dde869bd09b1a709fd92118c7fff832773689
> > Author: Christoph Hellwig <hch@lst.de>
> > Date:   Fri Nov 8 15:05:32 2019 -0800
> > 
> >     xfs: cleanup xfs_dir2_block_getdents
> >     
> >     Use an offset as the main means for iteration, and only do pointer
> >     arithmetics to find the data/unused entries.
> >     
> >     Signed-off-by: Christoph Hellwig <hch@lst.de>
> >     Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
> >     Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > 
> > 
> > 
> > Hmmmmm, looking at that commit, I think it slighty changed how the
> > "offset" is used compared to how the pointers were used.
> > 
> > This cures the issue for me, tested (briefly) on top of 5.6-rc5.
> > Does it make sense...?
> > (Email client probably damages white-space, sorry, I'll send this
> > properly
> > signed-off with git-send-email if it's OK)
> 
> Thanks, this looks good.  Although I wonder if the slightly different
> version below might be a little more elegant?

Yes that's better indeed, thanks!
-Tommi


> diff --git a/fs/xfs/xfs_dir2_readdir.c b/fs/xfs/xfs_dir2_readdir.c
> index 0d3b640cf1cc..871ec22c9aee 100644
> --- a/fs/xfs/xfs_dir2_readdir.c
> +++ b/fs/xfs/xfs_dir2_readdir.c
> @@ -147,7 +147,7 @@ xfs_dir2_block_getdents(
>  	xfs_off_t		cook;
>  	struct xfs_da_geometry	*geo = args->geo;
>  	int			lock_mode;
> -	unsigned int		offset;
> +	unsigned int		offset, next_offset;
>  	unsigned int		end;
>  
>  	/*
> @@ -173,9 +173,10 @@ xfs_dir2_block_getdents(
>  	 * Loop over the data portion of the block.
>  	 * Each object is a real entry (dep) or an unused one (dup).
>  	 */
> -	offset = geo->data_entry_offset;
>  	end = xfs_dir3_data_end_offset(geo, bp->b_addr);
> -	while (offset < end) {
> +	for (offset = geo->data_entry_offset;
> +	     offset < end;
> +	     offset = next_offset) {
>  		struct xfs_dir2_data_unused	*dup = bp->b_addr +
> offset;
>  		struct xfs_dir2_data_entry	*dep = bp->b_addr +
> offset;
>  		uint8_t filetype;
> @@ -184,14 +185,15 @@ xfs_dir2_block_getdents(
>  		 * Unused, skip it.
>  		 */
>  		if (be16_to_cpu(dup->freetag) == XFS_DIR2_DATA_FREE_TAG) {
> -			offset += be16_to_cpu(dup->length);
> +			next_offset = offset + be16_to_cpu(dup->length);
>  			continue;
>  		}
>  
>  		/*
>  		 * Bump pointer for the next iteration.
>  		 */
> -		offset += xfs_dir2_data_entsize(dp->i_mount, dep-
> >namelen);
> +		next_offset = offset +
> +			xfs_dir2_data_entsize(dp->i_mount, dep->namelen);
>  
>  		/*
>  		 * The entry is before the desired starting point, skip
> it.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: 5.5 XFS getdents regression?
  2020-03-12  8:09       ` Rantala, Tommi T. (Nokia - FI/Espoo)
@ 2020-03-12  8:18         ` hch
  0 siblings, 0 replies; 8+ messages in thread
From: hch @ 2020-03-12  8:18 UTC (permalink / raw)
  To: Rantala, Tommi T. (Nokia - FI/Espoo)
  Cc: hch, darrick.wong, david, linux-xfs, linux-kernel

On Thu, Mar 12, 2020 at 08:09:53AM +0000, Rantala, Tommi T. (Nokia - FI/Espoo) wrote:
> > Thanks, this looks good.  Although I wonder if the slightly different
> > version below might be a little more elegant?
> 
> Yes that's better indeed, thanks!

As this is just a slight tweak on all your work, can you submit it with
your signoff and a Fixes a tag?  Thanks for all your work!


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-03-12  8:18 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-10  8:45 5.5 XFS getdents regression? Rantala, Tommi T. (Nokia - FI/Espoo)
2020-03-10 11:12 ` Bhaskar Chowdhury
2020-03-10 14:41   ` Eric Sandeen
2020-03-10 22:14 ` Dave Chinner
2020-03-11 17:06   ` Rantala, Tommi T. (Nokia - FI/Espoo)
2020-03-11 17:22     ` hch
2020-03-12  8:09       ` Rantala, Tommi T. (Nokia - FI/Espoo)
2020-03-12  8:18         ` hch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).