linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Possible readdir regression with BTRFS
@ 2023-09-09  1:07 Ian Johnson
  2023-09-09  7:27 ` Filipe Manana
  0 siblings, 1 reply; 7+ messages in thread
From: Ian Johnson @ 2023-09-09  1:07 UTC (permalink / raw)
  To: linux-btrfs

I was debugging an issue recently and came across what seems to be a
change in the behavior of opendir and readdir in Btrfs seemingly
starting with 9b378f6ad48cfa195ed868db9123c09ee7ec5ea2, specifically
when the contents of a directory are changed between the call to opendir
and the initial call to readdir. The following C program demonstrates
the behavior:

#import <dirent.h>
#import <stdio.h>

int main(void) {
  DIR *dir = opendir("test");

  FILE *file;
  file = fopen("test/1", "w");
  fwrite("1", 1, 1, file);
  fclose(file);

  file = fopen("test/2", "w");
  fwrite("2", 1, 1, file);
  fclose(file);

  /* rewinddir(dir); */

  struct dirent *entry;
  while ((entry = readdir(dir))) {
    printf("%s\n", entry->d_name);
  }
  closedir(dir);
  return 0;
}

Running this program with an initially empty test directory will print
entry 1 but not entry 2 on Btrfs, while on Ext4 it prints both entries.

Evidently this particular result is not an issue, as POSIX states:

 > If a file is removed from or added to the directory after the most
 > recent call to opendir() or rewinddir(), whether a subsequent call to
 > readdir_r() returns an entry for that file is unspecified.

However, the same program with rewinddir uncommented above yields the
same behavior, seemingly in contradiction of POSIX:

 > It [rewinddir] shall also cause the directory stream to refer to the
 > current state of the corresponding directory, as a call to opendir()
 > would have done.

I know that the functions used here are part of the C library and not
the kernel, but given this seems to be a recent change, I wanted to
bring this up. I looked over the man pages for getdents and lseek and
didn't see any mention of guarantees (or lack thereof) in the face of
concurrent directory updates, so it's quite possible that this is within
the realm of expected/allowed behavior.




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Possible readdir regression with BTRFS
  2023-09-09  1:07 Possible readdir regression with BTRFS Ian Johnson
@ 2023-09-09  7:27 ` Filipe Manana
  2023-09-09 11:52   ` Evangelos Foutras
  2023-09-09 12:16   ` Filipe Manana
  0 siblings, 2 replies; 7+ messages in thread
From: Filipe Manana @ 2023-09-09  7:27 UTC (permalink / raw)
  To: Ian Johnson; +Cc: linux-btrfs

On Fri, Sep 08, 2023 at 09:07:10PM -0400, Ian Johnson wrote:
> I was debugging an issue recently and came across what seems to be a
> change in the behavior of opendir and readdir in Btrfs seemingly
> starting with 9b378f6ad48cfa195ed868db9123c09ee7ec5ea2, specifically
> when the contents of a directory are changed between the call to opendir
> and the initial call to readdir. The following C program demonstrates
> the behavior:
> 
> #import <dirent.h>
> #import <stdio.h>
> 
> int main(void) {
>  DIR *dir = opendir("test");
> 
>  FILE *file;
>  file = fopen("test/1", "w");
>  fwrite("1", 1, 1, file);
>  fclose(file);
> 
>  file = fopen("test/2", "w");
>  fwrite("2", 1, 1, file);
>  fclose(file);
> 
>  /* rewinddir(dir); */
> 
>  struct dirent *entry;
>  while ((entry = readdir(dir))) {
>    printf("%s\n", entry->d_name);
>  }
>  closedir(dir);
>  return 0;
> }
> 
> Running this program with an initially empty test directory will print
> entry 1 but not entry 2 on Btrfs, while on Ext4 it prints both entries.
> 
> Evidently this particular result is not an issue, as POSIX states:
> 
> > If a file is removed from or added to the directory after the most
> > recent call to opendir() or rewinddir(), whether a subsequent call to
> > readdir_r() returns an entry for that file is unspecified.

Yes, I can modify it to be more "logical" and don't return any new entry
after the opendir() call. As it is now, it always returns only the first
new entry after opendir().

> 
> However, the same program with rewinddir uncommented above yields the
> same behavior, seemingly in contradiction of POSIX:
> 
> > It [rewinddir] shall also cause the directory stream to refer to the
> > current state of the corresponding directory, as a call to opendir()
> > would have done.

That seems to make sense.
I'll prepare a patch to ensure that behaviour and let you know when it's
ready for you to test.

Thanks.

> 
> I know that the functions used here are part of the C library and not
> the kernel, but given this seems to be a recent change, I wanted to
> bring this up. I looked over the man pages for getdents and lseek and
> didn't see any mention of guarantees (or lack thereof) in the face of
> concurrent directory updates, so it's quite possible that this is within
> the realm of expected/allowed behavior.
> 
> 
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Possible readdir regression with BTRFS
  2023-09-09  7:27 ` Filipe Manana
@ 2023-09-09 11:52   ` Evangelos Foutras
  2023-09-09 12:18     ` Filipe Manana
  2023-09-09 12:16   ` Filipe Manana
  1 sibling, 1 reply; 7+ messages in thread
From: Evangelos Foutras @ 2023-09-09 11:52 UTC (permalink / raw)
  To: Filipe Manana, Ian Johnson; +Cc: linux-btrfs

Hi Filipe,

Please be aware that this bug might not be as harmless as it seems. I'm 
not sure if the fix you're preparing would also fix an issue we saw at 
Arch Linux but I thought I'd mention it here.

We have a package repository server with 4x10 TB drives in RAID10 (btrfs 
only, no mdadm). On multiple mirrors syncing from it we have seen rsync 
occasionally delete ~4 small (<10 MB) files that get frequently updated 
by renaming temporary files into them. This only happened with 6.4.12 
and went away after going back to 6.4.10 (the former had the commit Ian 
mentioned).

Unfortunately I don't have a reproducer for this. I can only describe 
what our repo-add script does and how rsync behaves during problematic 
syncs.

Our repo-add script frequently adds packages to the extra repo by doing:

   ln -f extra.db.tar.gz extra.db.tar.gz.old
   mv .tmp.extra.db.tar.gz extra.db.tar.gz

And the same for extra.files.tar.gz:

   ln -f extra.files.tar.gz extra.files.tar.gz.old
   mv .tmp.extra.files.tar.gz extra.files.tar.gz

While the server was running Linux 6.4.12, rsync on some mirrors would 
occasionally (3-4 times in the day) delete these files:

   deleting extra/os/x86_64/extra.files.tar.gz.old
   deleting extra/os/x86_64/extra.files.tar.gz
   deleting extra/os/x86_64/extra.db.tar.gz.old
   deleting extra/os/x86_64/extra.db.tar.gz

Since renames are atomic, I would expect this scenario to never happen.

Again, sorry for not being able to provide a proper reproducer like Ian; 
there is probably some timing interaction with how rsync does directory 
scanning and repo-add updating the directory entry during this time.

[1] 
https://gitlab.archlinux.org/pacman/pacman/-/blob/v6.0.2/scripts/repo-add.sh.in#L473

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Possible readdir regression with BTRFS
  2023-09-09  7:27 ` Filipe Manana
  2023-09-09 11:52   ` Evangelos Foutras
@ 2023-09-09 12:16   ` Filipe Manana
  2023-09-09 19:27     ` Ian Johnson
  1 sibling, 1 reply; 7+ messages in thread
From: Filipe Manana @ 2023-09-09 12:16 UTC (permalink / raw)
  To: Ian Johnson; +Cc: linux-btrfs

On Sat, Sep 09, 2023 at 08:27:03AM +0100, Filipe Manana wrote:
> On Fri, Sep 08, 2023 at 09:07:10PM -0400, Ian Johnson wrote:
> > I was debugging an issue recently and came across what seems to be a
> > change in the behavior of opendir and readdir in Btrfs seemingly
> > starting with 9b378f6ad48cfa195ed868db9123c09ee7ec5ea2, specifically
> > when the contents of a directory are changed between the call to opendir
> > and the initial call to readdir. The following C program demonstrates
> > the behavior:
> > 
> > #import <dirent.h>
> > #import <stdio.h>
> > 
> > int main(void) {
> >  DIR *dir = opendir("test");
> > 
> >  FILE *file;
> >  file = fopen("test/1", "w");
> >  fwrite("1", 1, 1, file);
> >  fclose(file);
> > 
> >  file = fopen("test/2", "w");
> >  fwrite("2", 1, 1, file);
> >  fclose(file);
> > 
> >  /* rewinddir(dir); */
> > 
> >  struct dirent *entry;
> >  while ((entry = readdir(dir))) {
> >    printf("%s\n", entry->d_name);
> >  }
> >  closedir(dir);
> >  return 0;
> > }
> > 
> > Running this program with an initially empty test directory will print
> > entry 1 but not entry 2 on Btrfs, while on Ext4 it prints both entries.
> > 
> > Evidently this particular result is not an issue, as POSIX states:
> > 
> > > If a file is removed from or added to the directory after the most
> > > recent call to opendir() or rewinddir(), whether a subsequent call to
> > > readdir_r() returns an entry for that file is unspecified.
> 
> Yes, I can modify it to be more "logical" and don't return any new entry
> after the opendir() call. As it is now, it always returns only the first
> new entry after opendir().
> 
> > 
> > However, the same program with rewinddir uncommented above yields the
> > same behavior, seemingly in contradiction of POSIX:
> > 
> > > It [rewinddir] shall also cause the directory stream to refer to the
> > > current state of the corresponding directory, as a call to opendir()
> > > would have done.
> 
> That seems to make sense.
> I'll prepare a patch to ensure that behaviour and let you know when it's
> ready for you to test.

Here it is (I CC'ed you, but just to link the thread):

https://lore.kernel.org/linux-btrfs/cover.1694260751.git.fdmanana@suse.com/

> 
> Thanks.
> 
> > 
> > I know that the functions used here are part of the C library and not
> > the kernel, but given this seems to be a recent change, I wanted to
> > bring this up. I looked over the man pages for getdents and lseek and
> > didn't see any mention of guarantees (or lack thereof) in the face of
> > concurrent directory updates, so it's quite possible that this is within
> > the realm of expected/allowed behavior.
> > 
> > 
> > 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Possible readdir regression with BTRFS
  2023-09-09 11:52   ` Evangelos Foutras
@ 2023-09-09 12:18     ` Filipe Manana
  2023-09-11 11:56       ` Filipe Manana
  0 siblings, 1 reply; 7+ messages in thread
From: Filipe Manana @ 2023-09-09 12:18 UTC (permalink / raw)
  To: Evangelos Foutras; +Cc: Ian Johnson, linux-btrfs

On Sat, Sep 09, 2023 at 02:52:19PM +0300, Evangelos Foutras wrote:
> Hi Filipe,
> 
> Please be aware that this bug might not be as harmless as it seems. I'm not
> sure if the fix you're preparing would also fix an issue we saw at Arch
> Linux but I thought I'd mention it here.
> 
> We have a package repository server with 4x10 TB drives in RAID10 (btrfs
> only, no mdadm). On multiple mirrors syncing from it we have seen rsync
> occasionally delete ~4 small (<10 MB) files that get frequently updated by
> renaming temporary files into them. This only happened with 6.4.12 and went
> away after going back to 6.4.10 (the former had the commit Ian mentioned).
> 
> Unfortunately I don't have a reproducer for this. I can only describe what
> our repo-add script does and how rsync behaves during problematic syncs.
> 
> Our repo-add script frequently adds packages to the extra repo by doing:
> 
>   ln -f extra.db.tar.gz extra.db.tar.gz.old
>   mv .tmp.extra.db.tar.gz extra.db.tar.gz
> 
> And the same for extra.files.tar.gz:
> 
>   ln -f extra.files.tar.gz extra.files.tar.gz.old
>   mv .tmp.extra.files.tar.gz extra.files.tar.gz
> 
> While the server was running Linux 6.4.12, rsync on some mirrors would
> occasionally (3-4 times in the day) delete these files:
> 
>   deleting extra/os/x86_64/extra.files.tar.gz.old
>   deleting extra/os/x86_64/extra.files.tar.gz
>   deleting extra/os/x86_64/extra.db.tar.gz.old
>   deleting extra/os/x86_64/extra.db.tar.gz
> 
> Since renames are atomic, I would expect this scenario to never happen.
> 
> Again, sorry for not being able to provide a proper reproducer like Ian;
> there is probably some timing interaction with how rsync does directory
> scanning and repo-add updating the directory entry during this time.

No worries, I've just sent a patchset with 2 patches:

https://lore.kernel.org/linux-btrfs/cover.1694260751.git.fdmanana@suse.com/

I've only seen your message after sending it, but I think the first patch
should fix what you are seeing.

Thanks.

> 
> [1] https://gitlab.archlinux.org/pacman/pacman/-/blob/v6.0.2/scripts/repo-add.sh.in#L473

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Possible readdir regression with BTRFS
  2023-09-09 12:16   ` Filipe Manana
@ 2023-09-09 19:27     ` Ian Johnson
  0 siblings, 0 replies; 7+ messages in thread
From: Ian Johnson @ 2023-09-09 19:27 UTC (permalink / raw)
  To: Filipe Manana; +Cc: linux-btrfs


On Sat, Sep 9 2023 at 01:16:15 PM +01:00:00, Filipe Manana 
<fdmanana@kernel.org> wrote:
> Here it is (I CC'ed you, but just to link the thread):
> 
> https://lore.kernel.org/linux-btrfs/cover.1694260751.git.fdmanana@suse.com/

Thank you! I've applied the patchset at my end and confirmed that it
works both for my sample program (prints neither 1 nor 2 if rewinddir is
not used, and prints both if it is) and for the other project I was
originally debugging.



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Possible readdir regression with BTRFS
  2023-09-09 12:18     ` Filipe Manana
@ 2023-09-11 11:56       ` Filipe Manana
  0 siblings, 0 replies; 7+ messages in thread
From: Filipe Manana @ 2023-09-11 11:56 UTC (permalink / raw)
  To: Evangelos Foutras; +Cc: Ian Johnson, linux-btrfs

On Sat, Sep 09, 2023 at 01:18:49PM +0100, Filipe Manana wrote:
> On Sat, Sep 09, 2023 at 02:52:19PM +0300, Evangelos Foutras wrote:
> > Hi Filipe,
> > 
> > Please be aware that this bug might not be as harmless as it seems. I'm not
> > sure if the fix you're preparing would also fix an issue we saw at Arch
> > Linux but I thought I'd mention it here.
> > 
> > We have a package repository server with 4x10 TB drives in RAID10 (btrfs
> > only, no mdadm). On multiple mirrors syncing from it we have seen rsync
> > occasionally delete ~4 small (<10 MB) files that get frequently updated by
> > renaming temporary files into them. This only happened with 6.4.12 and went
> > away after going back to 6.4.10 (the former had the commit Ian mentioned).
> > 
> > Unfortunately I don't have a reproducer for this. I can only describe what
> > our repo-add script does and how rsync behaves during problematic syncs.
> > 
> > Our repo-add script frequently adds packages to the extra repo by doing:
> > 
> >   ln -f extra.db.tar.gz extra.db.tar.gz.old
> >   mv .tmp.extra.db.tar.gz extra.db.tar.gz
> > 
> > And the same for extra.files.tar.gz:
> > 
> >   ln -f extra.files.tar.gz extra.files.tar.gz.old
> >   mv .tmp.extra.files.tar.gz extra.files.tar.gz
> > 
> > While the server was running Linux 6.4.12, rsync on some mirrors would
> > occasionally (3-4 times in the day) delete these files:
> > 
> >   deleting extra/os/x86_64/extra.files.tar.gz.old
> >   deleting extra/os/x86_64/extra.files.tar.gz
> >   deleting extra/os/x86_64/extra.db.tar.gz.old
> >   deleting extra/os/x86_64/extra.db.tar.gz
> > 
> > Since renames are atomic, I would expect this scenario to never happen.
> > 
> > Again, sorry for not being able to provide a proper reproducer like Ian;
> > there is probably some timing interaction with how rsync does directory
> > scanning and repo-add updating the directory entry during this time.
> 
> No worries, I've just sent a patchset with 2 patches:
> 
> https://lore.kernel.org/linux-btrfs/cover.1694260751.git.fdmanana@suse.com/
> 
> I've only seen your message after sending it, but I think the first patch
> should fix what you are seeing.

Ok, so I took a more detailed look at your issue this morning.
It's unrelated to what Ian reported, as rsync doesn't even use rewinddir(3).

Here's what I think it's happening (speculating a bit about how rsync
works):

1) rsync uses opendir() and readdir() to iterate over the source and
   destination (backup) directories, to obtain a list of files in each;

2) While it's iterating the source directory, the renames and "ln -f"
   happen. Because of this the readdir() calls don't return neither the
   old file names neither the new ones.

   This is because when opendir() is called, btrfs gets the index of the
   last (most recently added) directory entry, and then never iterates
   beyond that index in readdir() calls - this behaviour was introduced
   in commit 9b378f6ad48c ("btrfs: fix infinite directory reads"), and it's
   intentional to prevent readdir() never finishing while renames (or new
   files added) inside the directory are happening.

   On a rename, the new file name is assigned a new index number, larger
   then the last one we had when openddir() was called. It's effectively
   like removing an entry from the directory and adding a new one.
   The same goes for a 'ln -f' - if the destination name exists, it is
   removed and then the name is added again but pointing to a different
   inode (and with a higher index number);

3) As rsync sees that one of those renamed files is in the destination
   directory but not on the source directory, it deletes those files from
   the destination.

Looking at readdir() requirements from POSIX we have:

  "If a file is removed from or added to the directory after the most recent call
   to opendir() or rewinddir(), whether a subsequent call to readdir() returns an
   entry for that file is unspecified."

  (from https://pubs.opengroup.org/onlinepubs/007904875/functions/readdir_r.html)

Yes, renames are not explicit there, even though they are like adding a new name
and removing another one. So googling around, to be extra sure, I found this old
thread from Ted (ext4 maintainer):

   https://yarchive.net/comp/linux/readdir_nonatomicity.html

Where the most important part is this:

   "This is not a bug; the POSIX specification explicitly allows this
    behavior.  If a filename is renamed during a readdir() session of a
    directory, it is undefined where that neither, either, or both of the
    new and old filenames will be returned."

So from a POSIX point of view, we are not doing anything wrong after that commit
in btrfs. So my advise is to not have rsync running while the renames and "ln -f"
are happening.

I understand this may be a bummer as some applications may be relying on the old
behaviour that happened to guarantee that at least the new file names would always
be visible in readdir() calls, but that effectively was due to a bug in btrfs that
caused infinite directory reads as mentioned in that commit and the linked thread.
As if new indexes were added after opendir(), we would always read and return them.

We could change opendir() to allow reading up to the current last index plus N,
where N may be 10 or 100 for example, so that in the case of concurrent renames we
would still (very likely but no guaranteed) at least return the new names - but
not only that is not required by POSIX it would also not be always reliable - what
if we have N + 1 renames? Then file name N + 1 would still be not returned, or
if N new files are added and then rename happens at N + 1, we would also not return
it - i.e., it would never be reliable and it would be a hack - it would encourage
applications to rely on a behaviour that can not always be guaranteed.

Thanks.

> 
> Thanks.
> 
> > 
> > [1] https://gitlab.archlinux.org/pacman/pacman/-/blob/v6.0.2/scripts/repo-add.sh.in#L473

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-09-11 22:07 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-09  1:07 Possible readdir regression with BTRFS Ian Johnson
2023-09-09  7:27 ` Filipe Manana
2023-09-09 11:52   ` Evangelos Foutras
2023-09-09 12:18     ` Filipe Manana
2023-09-11 11:56       ` Filipe Manana
2023-09-09 12:16   ` Filipe Manana
2023-09-09 19:27     ` Ian Johnson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).