* [PATCH v2 0/4] Make overlayfs volatile mounts reusable @ 2020-11-27 9:20 Sargun Dhillon 2020-11-27 9:20 ` [PATCH v2 1/4] fs: Add s_instance_id field to superblock for unique identification Sargun Dhillon ` (3 more replies) 0 siblings, 4 replies; 31+ messages in thread From: Sargun Dhillon @ 2020-11-27 9:20 UTC (permalink / raw) To: linux-unionfs, miklos, Alexander Viro, Amir Goldstein Cc: Sargun Dhillon, Giuseppe Scrivano, Vivek Goyal, Daniel J Walsh, linux-fsdevel, David Howells This adds some documentation on how the current behaviour of overlayfs works, and adds some non-obvious caveats to the documentation. We may want to pick those up independently. The volatile option is great for "ephemeral" containers. Unfortunately, it doesn't capture all uses. There are two ways to use it safely right now: 1. Throw away the entire upperdir between mounts 2. Manually syncfs between mounts For certain use-cases like serverless, or short-lived containers, it is advantageous to be able to stop the container (runtime) and start it up on demand / invocation of the function. Usually, there is some bootstrap process which involves downloading some artifacts, or putting secrets on disk, and then upon invocation of the function, you want to (re)start the container. If you have to syncfs every time you do this, it can lead to excess filesystem overhead for all of the other containers on the machine, and stall out every container who's upperdir is on the same underlying filesystem, unless your filesystem offers something like subvolumes, and if sync can be restricted to a subvolume. The kernel has information that it can use to determine whether or not this is safe -- primarily if the underlying FS has had writeback errors or not. Overlayfs doesn't delay writes, so the consistency of the upperdir is not contingent on the mount of overlayfs, but rather the mount of the underlying filesystem. It can also make sure the underlying filesystem wasn't remounted. Although, it was suggested that we use derive this information from the upperdir's inode[1], we can checkpoint this data on disk in an xattr. Specifically we checkpoint: * Superblock "id": This is a new concept introduced in one of the patches which keeps track of (re)mounts of filesystems, by having a per boot monotonically increasing integer identifying the superblock. This is safer than trying to obfuscate the pointer and putting it into an xattr (due to leak risk, and address reuse), and during the course of a boot, the u64 should not wrap. * Overlay "boot id": This is a new UUID that is overlayfs specific, as overlayfs is a module that's independent from the rest of the system and can be (re)loaded independently -- thus it generates a UUID at load time which can be used to uniquely identify it. * upperdir / workdir errseq: A sample of the errseq_t on the workdir / upperdir's superblock. Since the errseq_t is implemented as a u32 with errno + error counter, we can safely store it in a checkpoint. This has to be done in kernelspace because userspace does not have access to information such as superblock (re)mounts, the writeback errseq value, and whether the module has been (re)loaded. Although this mechanism only determines a subset of the error scenarios, it lays out the groundwork for adding more. In the future, we may want to make it so overlayfs shuts down on errors, or have some kind of other behaviour when errors are detected. If we want a per-file (on the upperdir) piece of metadata can be added indicating errseq_t for just that file, and we can check it on each close / open, and then allow the user to make an intelligent decision of how they want overlayfs to handle the error. This would allow for errors to be captured during normal operation as opposed to just between remounts. This also allows for: volatile -> volatile non-volatile -> volatile non-volatile -> non-volatile But, for now, prevents volatile -> non-volatile. Perhaps with the future patches around selective inode sync, and tracking errors on every interaction will make this possible. Changes since: V1 [2]: * Add documentation commit about current short-comings of the current volatile implementation. * Do not allow volatile mounts to be mounted as non-volatile RFC [3]: * Changed datastructure names * No longer delete the volatile dir / dirty file * Moved xattr to volatile directory * Split errseq check [1]: https://lore.kernel.org/linux-unionfs/CAOQ4uxhadzC3-kh-igfxv3pAmC3ocDtAQTxByu4hrn8KtZuieQ@mail.gmail.com/ [2]: https://lore.kernel.org/linux-unionfs/20201125104621.18838-1-sargun@sargun.me/T/#t [3]: https://lore.kernel.org/linux-unionfs/20201116045758.21774-1-sargun@sargun.me/T/#t Sargun Dhillon (4): fs: Add s_instance_id field to superblock for unique identification overlay: Document current outstanding shortcoming of volatile overlay: Add the ability to remount volatile directories when safe overlay: Add rudimentary checking of writeback errseq on volatile remount Documentation/filesystems/overlayfs.rst | 24 ++++-- fs/overlayfs/overlayfs.h | 38 ++++++++- fs/overlayfs/readdir.c | 104 +++++++++++++++++++++--- fs/overlayfs/super.c | 74 +++++++++++++---- fs/overlayfs/util.c | 2 + fs/super.c | 3 + include/linux/fs.h | 7 ++ 7 files changed, 213 insertions(+), 39 deletions(-) base-commit: be4df0cea08a8b59eb38d73de988b7ba8022df41 -- 2.25.1 ^ permalink raw reply [flat|nested] 31+ messages in thread
* [PATCH v2 1/4] fs: Add s_instance_id field to superblock for unique identification 2020-11-27 9:20 [PATCH v2 0/4] Make overlayfs volatile mounts reusable Sargun Dhillon @ 2020-11-27 9:20 ` Sargun Dhillon 2020-11-27 9:20 ` [PATCH v2 2/4] overlay: Document current outstanding shortcoming of volatile Sargun Dhillon ` (2 subsequent siblings) 3 siblings, 0 replies; 31+ messages in thread From: Sargun Dhillon @ 2020-11-27 9:20 UTC (permalink / raw) To: linux-unionfs, miklos, Alexander Viro, Amir Goldstein Cc: Sargun Dhillon, Giuseppe Scrivano, Vivek Goyal, Daniel J Walsh, linux-fsdevel, David Howells This assigns a per-boot unique number to each superblock. This allows other components to know whether a filesystem has been remounted since they last interacted with it. At every boot it is reset to 0. There is no specific reason it is set to 0, other than repeatability versus using some random starting number. Because of this, you must store it along some other piece of data which is initialized at boot time. This doesn't have any of the overhead of idr, and a u64 wont wrap any time soon. There is no forward lookup requirement, so an idr is not needed. In the future, we may want to expose this to userspace. Userspace programs can benefit from this if they have large chunks of dirty or mmaped memory that they're interacting with, and they want to see if that volume has been unmounted, and remounted. Along with this, and a mechanism to inspect the superblock's errseq a user can determine whether they need to throw away their cache or similar. This is another benefit in comparison to just using a pointer to the superblock to uniquely identify it. Although this doesn't expose an ioctl or similar yet, in the future we could add an ioctl that allows for fetching the s_instance_id for a given cache, and inspection of the errseq associated with that. Signed-off-by: Sargun Dhillon <sargun@sargun.me> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Cc: linux-unionfs@vger.kernel.org --- fs/super.c | 3 +++ include/linux/fs.h | 7 +++++++ 2 files changed, 10 insertions(+) diff --git a/fs/super.c b/fs/super.c index 904459b35119..e47ace7f8c3d 100644 --- a/fs/super.c +++ b/fs/super.c @@ -42,6 +42,7 @@ static int thaw_super_locked(struct super_block *sb); +static u64 s_instance_id_counter; static LIST_HEAD(super_blocks); static DEFINE_SPINLOCK(sb_lock); @@ -546,6 +547,7 @@ struct super_block *sget_fc(struct fs_context *fc, s->s_iflags |= fc->s_iflags; strlcpy(s->s_id, s->s_type->name, sizeof(s->s_id)); list_add_tail(&s->s_list, &super_blocks); + s->s_instance_id = s_instance_id_counter++; hlist_add_head(&s->s_instances, &s->s_type->fs_supers); spin_unlock(&sb_lock); get_filesystem(s->s_type); @@ -625,6 +627,7 @@ struct super_block *sget(struct file_system_type *type, s->s_type = type; strlcpy(s->s_id, type->name, sizeof(s->s_id)); list_add_tail(&s->s_list, &super_blocks); + s->s_instance_id = s_instance_id_counter++; hlist_add_head(&s->s_instances, &type->fs_supers); spin_unlock(&sb_lock); get_filesystem(type); diff --git a/include/linux/fs.h b/include/linux/fs.h index 7519ae003a08..09bf54382a54 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1472,6 +1472,13 @@ struct super_block { char s_id[32]; /* Informational name */ uuid_t s_uuid; /* UUID */ + /* + * ID identifying this particular instance of the superblock. It can + * be used to determine if a particular filesystem has been remounted. + * It may be exposed to userspace. + */ + u64 s_instance_id; + unsigned int s_max_links; fmode_t s_mode; -- 2.25.1 ^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH v2 2/4] overlay: Document current outstanding shortcoming of volatile 2020-11-27 9:20 [PATCH v2 0/4] Make overlayfs volatile mounts reusable Sargun Dhillon 2020-11-27 9:20 ` [PATCH v2 1/4] fs: Add s_instance_id field to superblock for unique identification Sargun Dhillon @ 2020-11-27 9:20 ` Sargun Dhillon 2020-11-27 12:52 ` Amir Goldstein 2020-11-27 9:20 ` [PATCH v2 3/4] overlay: Add the ability to remount volatile directories when safe Sargun Dhillon 2020-11-27 9:20 ` [PATCH v2 4/4] overlay: Add rudimentary checking of writeback errseq on volatile remount Sargun Dhillon 3 siblings, 1 reply; 31+ messages in thread From: Sargun Dhillon @ 2020-11-27 9:20 UTC (permalink / raw) To: linux-unionfs, miklos, Alexander Viro, Amir Goldstein Cc: Sargun Dhillon, Giuseppe Scrivano, Vivek Goyal, Daniel J Walsh, linux-fsdevel, David Howells This documents behaviour that was discussed in a thread about the volatile feature. Specifically, how failures can go hidden from asynchronous writes (such as from mmap, or writes that are not immediately flushed to the filesystem). Although we pass through calls like msync, fallocate, and write, and will still return errors on those, it doesn't guarantee all kinds of errors will happen at those times, and thus may hide errors. In the future, we can add error checking to all interactions with the upperdir, and pass through errseq_t from the upperdir on mappings, and other interactions with the filesystem[1]. [1]: https://lore.kernel.org/linux-unionfs/20201116045758.21774-1-sargun@sargun.me/T/#m7d501f375e031056efad626e471a1392dd3aad33 Signed-off-by: Sargun Dhillon <sargun@sargun.me> Cc: linux-fsdevel@vger.kernel.org Cc: linux-unionfs@vger.kernel.org Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Amir Goldstein <amir73il@gmail.com> Cc: Vivek Goyal <vgoyal@redhat.com> --- Documentation/filesystems/overlayfs.rst | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst index 580ab9a0fe31..c6e30c1bc2f2 100644 --- a/Documentation/filesystems/overlayfs.rst +++ b/Documentation/filesystems/overlayfs.rst @@ -570,7 +570,11 @@ Volatile mount This is enabled with the "volatile" mount option. Volatile mounts are not guaranteed to survive a crash. It is strongly recommended that volatile mounts are only used if data written to the overlay can be recreated -without significant effort. +without significant effort. In addition to this, the sync family of syscalls +are not sufficient to determine whether a write failed as sync calls are +omitted. For this reason, it is important that the filesystem used by the +upperdir handles failure in a fashion that's suitable for the user. For +example, upon detecting a fault, ext4 can be configured to panic. The advantage of mounting with the "volatile" option is that all forms of sync calls to the upper filesystem are omitted. -- 2.25.1 ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [PATCH v2 2/4] overlay: Document current outstanding shortcoming of volatile 2020-11-27 9:20 ` [PATCH v2 2/4] overlay: Document current outstanding shortcoming of volatile Sargun Dhillon @ 2020-11-27 12:52 ` Amir Goldstein 2020-11-27 22:11 ` Sargun Dhillon 0 siblings, 1 reply; 31+ messages in thread From: Amir Goldstein @ 2020-11-27 12:52 UTC (permalink / raw) To: Sargun Dhillon Cc: overlayfs, Miklos Szeredi, Alexander Viro, Giuseppe Scrivano, Vivek Goyal, Daniel J Walsh, linux-fsdevel, David Howells On Fri, Nov 27, 2020 at 11:21 AM Sargun Dhillon <sargun@sargun.me> wrote: > > This documents behaviour that was discussed in a thread about the volatile > feature. Specifically, how failures can go hidden from asynchronous writes > (such as from mmap, or writes that are not immediately flushed to the > filesystem). Although we pass through calls like msync, fallocate, and > write, and will still return errors on those, it doesn't guarantee all > kinds of errors will happen at those times, and thus may hide errors. > > In the future, we can add error checking to all interactions with the > upperdir, and pass through errseq_t from the upperdir on mappings, > and other interactions with the filesystem[1]. > > [1]: https://lore.kernel.org/linux-unionfs/20201116045758.21774-1-sargun@sargun.me/T/#m7d501f375e031056efad626e471a1392dd3aad33 > > Signed-off-by: Sargun Dhillon <sargun@sargun.me> > Cc: linux-fsdevel@vger.kernel.org > Cc: linux-unionfs@vger.kernel.org > Cc: Miklos Szeredi <miklos@szeredi.hu> > Cc: Amir Goldstein <amir73il@gmail.com> > Cc: Vivek Goyal <vgoyal@redhat.com> > --- > Documentation/filesystems/overlayfs.rst | 6 +++++- > 1 file changed, 5 insertions(+), 1 deletion(-) > > diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst > index 580ab9a0fe31..c6e30c1bc2f2 100644 > --- a/Documentation/filesystems/overlayfs.rst > +++ b/Documentation/filesystems/overlayfs.rst > @@ -570,7 +570,11 @@ Volatile mount > This is enabled with the "volatile" mount option. Volatile mounts are not > guaranteed to survive a crash. It is strongly recommended that volatile > mounts are only used if data written to the overlay can be recreated > -without significant effort. > +without significant effort. In addition to this, the sync family of syscalls > +are not sufficient to determine whether a write failed as sync calls are > +omitted. For this reason, it is important that the filesystem used by the > +upperdir handles failure in a fashion that's suitable for the user. For > +example, upon detecting a fault, ext4 can be configured to panic. > Reading this now, I think I may have wrongly analysed the issue. Specifically, when I wrote that the very minimum is to document the issue, it was under the assumption that a proper fix is hard. I think I was wrong and that the very minimum is to check for errseq since mount on the fsync and syncfs calls. Why? first of all because it is very very simple and goes a long way to fix the broken contract with applications, not the contract about durability obviously, but the contract about write+fsync+read expects to find the written data (during the same mount era). Second, because the sentence you added above is hard for users to understand and out of context. If we properly handle the writeback error in fsync/syncfs, then this sentence becomes irrelevant. The fact that ext4 can lose data if application did not fsync after write is true for volatile as well as non-volatile and it is therefore not relevant in the context of overlayfs documentation at all. Am I wrong saying that it is very very simple to fix? Would you mind making that fix at the bottom of the patch series, so it can be easily applied to stable kernels? Thanks, Amir. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v2 2/4] overlay: Document current outstanding shortcoming of volatile 2020-11-27 12:52 ` Amir Goldstein @ 2020-11-27 22:11 ` Sargun Dhillon 2020-11-28 2:01 ` Jeff Layton 2020-11-28 8:56 ` Amir Goldstein 0 siblings, 2 replies; 31+ messages in thread From: Sargun Dhillon @ 2020-11-27 22:11 UTC (permalink / raw) To: Amir Goldstein Cc: overlayfs, Miklos Szeredi, Alexander Viro, Giuseppe Scrivano, Vivek Goyal, Daniel J Walsh, linux-fsdevel, David Howells, Jeff Layton On Fri, Nov 27, 2020 at 02:52:52PM +0200, Amir Goldstein wrote: > On Fri, Nov 27, 2020 at 11:21 AM Sargun Dhillon <sargun@sargun.me> wrote: > > > > This documents behaviour that was discussed in a thread about the volatile > > feature. Specifically, how failures can go hidden from asynchronous writes > > (such as from mmap, or writes that are not immediately flushed to the > > filesystem). Although we pass through calls like msync, fallocate, and > > write, and will still return errors on those, it doesn't guarantee all > > kinds of errors will happen at those times, and thus may hide errors. > > > > In the future, we can add error checking to all interactions with the > > upperdir, and pass through errseq_t from the upperdir on mappings, > > and other interactions with the filesystem[1]. > > > > [1]: https://lore.kernel.org/linux-unionfs/20201116045758.21774-1-sargun@sargun.me/T/#m7d501f375e031056efad626e471a1392dd3aad33 > > > > Signed-off-by: Sargun Dhillon <sargun@sargun.me> > > Cc: linux-fsdevel@vger.kernel.org > > Cc: linux-unionfs@vger.kernel.org > > Cc: Miklos Szeredi <miklos@szeredi.hu> > > Cc: Amir Goldstein <amir73il@gmail.com> > > Cc: Vivek Goyal <vgoyal@redhat.com> > > --- > > Documentation/filesystems/overlayfs.rst | 6 +++++- > > 1 file changed, 5 insertions(+), 1 deletion(-) > > > > diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst > > index 580ab9a0fe31..c6e30c1bc2f2 100644 > > --- a/Documentation/filesystems/overlayfs.rst > > +++ b/Documentation/filesystems/overlayfs.rst > > @@ -570,7 +570,11 @@ Volatile mount > > This is enabled with the "volatile" mount option. Volatile mounts are not > > guaranteed to survive a crash. It is strongly recommended that volatile > > mounts are only used if data written to the overlay can be recreated > > -without significant effort. > > +without significant effort. In addition to this, the sync family of syscalls > > +are not sufficient to determine whether a write failed as sync calls are > > +omitted. For this reason, it is important that the filesystem used by the > > +upperdir handles failure in a fashion that's suitable for the user. For > > +example, upon detecting a fault, ext4 can be configured to panic. > > > > Reading this now, I think I may have wrongly analysed the issue. > Specifically, when I wrote that the very minimum is to document the > issue, it was under the assumption that a proper fix is hard. > I think I was wrong and that the very minimum is to check for errseq > since mount on the fsync and syncfs calls. > > Why? first of all because it is very very simple and goes a long way to > fix the broken contract with applications, not the contract about durability > obviously, but the contract about write+fsync+read expects to find the written > data (during the same mount era). > > Second, because the sentence you added above is hard for users to understand > and out of context. If we properly handle the writeback error in fsync/syncfs, > then this sentence becomes irrelevant. > The fact that ext4 can lose data if application did not fsync after > write is true > for volatile as well as non-volatile and it is therefore not relevant > in the context > of overlayfs documentation at all. > > Am I wrong saying that it is very very simple to fix? > Would you mind making that fix at the bottom of the patch series, so it can > be easily applied to stable kernels? > > Thanks, > Amir. I'm not sure it's so easy. In VFS, there are places where the superblock's errseq is checked[1]. AFAIK, that's going to check "our" errseq, and not the errseq of the real corresponding real file's superblock. One solution might be as part of all these callbacks we set our errseq to the errseq of the filesystem that the upperdir, and then rely on VFS's checking. I'm having a hard time figuring out how to deal with the per-mapping based error tracking. It seems like this infrastructure was only partially completed by Jeff Layton[2]. I don't now how it's actually supposed to work right now, as not all of his patches landed. How about I split this into two patchsets? One, where I add the logic to copy the errseq on callbacks to fsync from the upperdir to the ovl fs superblock, and thus allowing VFS to bubble up errors, and the documentation. We can CC stable on those because I think it has an effect that's universal across all filesystems. P.S. I notice you maintain overlay tests outside of the kernel. Unfortunately, I think for this kind of test, it requires in kernel code to artificially bump the writeback error count on the upperdir, or it requires the failure injection infrastructure. Simulating this behaviour is non-trivial without in-kernel support: P1: Open(f) -> p1.fd P2: Open(f) -> p2.fd P1: syncfs(p1.fd) -> errrno P2: syncfs(p1.fd) -> 0 # This should return an error [1]: https://elixir.bootlin.com/linux/latest/source/fs/sync.c#L175 [2]: https://lwn.net/Articles/722250/ ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v2 2/4] overlay: Document current outstanding shortcoming of volatile 2020-11-27 22:11 ` Sargun Dhillon @ 2020-11-28 2:01 ` Jeff Layton 2020-11-28 4:45 ` Sargun Dhillon 2020-11-28 8:56 ` Amir Goldstein 1 sibling, 1 reply; 31+ messages in thread From: Jeff Layton @ 2020-11-28 2:01 UTC (permalink / raw) To: Sargun Dhillon, Amir Goldstein Cc: overlayfs, Miklos Szeredi, Alexander Viro, Giuseppe Scrivano, Vivek Goyal, Daniel J Walsh, linux-fsdevel, David Howells On Fri, 2020-11-27 at 22:11 +0000, Sargun Dhillon wrote: > On Fri, Nov 27, 2020 at 02:52:52PM +0200, Amir Goldstein wrote: > > On Fri, Nov 27, 2020 at 11:21 AM Sargun Dhillon <sargun@sargun.me> wrote: > > > > > > This documents behaviour that was discussed in a thread about the volatile > > > feature. Specifically, how failures can go hidden from asynchronous writes > > > (such as from mmap, or writes that are not immediately flushed to the > > > filesystem). Although we pass through calls like msync, fallocate, and > > > write, and will still return errors on those, it doesn't guarantee all > > > kinds of errors will happen at those times, and thus may hide errors. > > > > > > In the future, we can add error checking to all interactions with the > > > upperdir, and pass through errseq_t from the upperdir on mappings, > > > and other interactions with the filesystem[1]. > > > > > > [1]: https://lore.kernel.org/linux-unionfs/20201116045758.21774-1-sargun@sargun.me/T/#m7d501f375e031056efad626e471a1392dd3aad33 > > > > > > Signed-off-by: Sargun Dhillon <sargun@sargun.me> > > > Cc: linux-fsdevel@vger.kernel.org > > > Cc: linux-unionfs@vger.kernel.org > > > Cc: Miklos Szeredi <miklos@szeredi.hu> > > > Cc: Amir Goldstein <amir73il@gmail.com> > > > Cc: Vivek Goyal <vgoyal@redhat.com> > > > --- > > > Documentation/filesystems/overlayfs.rst | 6 +++++- > > > 1 file changed, 5 insertions(+), 1 deletion(-) > > > > > > diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst > > > index 580ab9a0fe31..c6e30c1bc2f2 100644 > > > --- a/Documentation/filesystems/overlayfs.rst > > > +++ b/Documentation/filesystems/overlayfs.rst > > > @@ -570,7 +570,11 @@ Volatile mount > > > This is enabled with the "volatile" mount option. Volatile mounts are not > > > guaranteed to survive a crash. It is strongly recommended that volatile > > > mounts are only used if data written to the overlay can be recreated > > > -without significant effort. > > > +without significant effort. In addition to this, the sync family of syscalls > > > +are not sufficient to determine whether a write failed as sync calls are > > > +omitted. For this reason, it is important that the filesystem used by the > > > +upperdir handles failure in a fashion that's suitable for the user. For > > > +example, upon detecting a fault, ext4 can be configured to panic. > > > > > > > Reading this now, I think I may have wrongly analysed the issue. > > Specifically, when I wrote that the very minimum is to document the > > issue, it was under the assumption that a proper fix is hard. > > I think I was wrong and that the very minimum is to check for errseq > > since mount on the fsync and syncfs calls. > > > > Why? first of all because it is very very simple and goes a long way to > > fix the broken contract with applications, not the contract about durability > > obviously, but the contract about write+fsync+read expects to find the written > > data (during the same mount era). > > > > Second, because the sentence you added above is hard for users to understand > > and out of context. If we properly handle the writeback error in fsync/syncfs, > > then this sentence becomes irrelevant. > > The fact that ext4 can lose data if application did not fsync after > > write is true > > for volatile as well as non-volatile and it is therefore not relevant > > in the context > > of overlayfs documentation at all. > > > > Am I wrong saying that it is very very simple to fix? > > Would you mind making that fix at the bottom of the patch series, so it can > > be easily applied to stable kernels? > > > > Thanks, > > Amir. > > I'm not sure it's so easy. In VFS, there are places where the superblock's > errseq is checked[1]. AFAIK, that's going to check "our" errseq, and not the > errseq of the real corresponding real file's superblock. One solution might be > as part of all these callbacks we set our errseq to the errseq of the filesystem > that the upperdir, and then rely on VFS's checking. > > I'm having a hard time figuring out how to deal with the per-mapping based > error tracking. It seems like this infrastructure was only partially completed > by Jeff Layton[2]. I don't now how it's actually supposed to work right now, > as not all of his patches landed. > The patches in the series were all merged, but we ended up going with a simpler solution [1] than the first series I posted. Instead of plumbing the errseq_t handling down into sync_fs, we did it in the syscall wrapper. I think the tricky part here is that there is no struct file plumbed into ->sync_fs, so you don't have an errseq_t cursor to work with by the time that gets called. What may be easiest is to just propagate the s_wb_err value from the upper_sb to the overlayfs superblock in ovl_sync_fs(). That should get called before the errseq_check_and_advance in the syncfs syscall wrapper and should ensure that syncfs() calls against the overlayfs sb see any errors that occurred on the upper_sb. Something like this maybe? Totally untested of course. May also need to think about how to ensure ordering between racing syncfs calls too (don't really want the s_wb_err going "backward"): ----------------------------8<--------------------------------- $ git diff diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c index 290983bcfbb3..d725705abdac 100644 --- a/fs/overlayfs/super.c +++ b/fs/overlayfs/super.c @@ -283,6 +283,9 @@ static int ovl_sync_fs(struct super_block *sb, int wait) ret = sync_filesystem(upper_sb); up_read(&upper_sb->s_umount); + /* Propagate s_wb_err from upper_sb to overlayfs sb */ + WRITE_ONCE(sb->s_wb_err, READ_ONCE(upper_sb->s_wb_err)); + return ret; } ----------------------------8<--------------------------------- [1]: https://www.spinics.net/lists/linux-api/msg41276.html How about I split this into two patchsets? One, where I add the logic to copy the errseq on callbacks to fsync from the upperdir to the ovl fs superblock, and thus allowing VFS to bubble up errors, and the documentation. We can CC stable on those because I think it has an effect that's universal across all filesystems. P.S. I notice you maintain overlay tests outside of the kernel. Unfortunately, I think for this kind of test, it requires in kernel code to artificially bump the writeback error count on the upperdir, or it requires the failure injection infrastructure. Simulating this behaviour is non-trivial without in-kernel support: P1: Open(f) -> p1.fd P2: Open(f) -> p2.fd P1: syncfs(p1.fd) -> errrno P2: syncfs(p1.fd) -> 0 # This should return an error [1]: https://elixir.bootlin.com/linux/latest/source/fs/sync.c#L175 [2]: https://lwn.net/Articles/722250/ -- Jeff Layton <jlayton@redhat.com> ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [PATCH v2 2/4] overlay: Document current outstanding shortcoming of volatile 2020-11-28 2:01 ` Jeff Layton @ 2020-11-28 4:45 ` Sargun Dhillon 2020-11-28 7:12 ` Amir Goldstein 2020-11-28 12:04 ` Jeff Layton 0 siblings, 2 replies; 31+ messages in thread From: Sargun Dhillon @ 2020-11-28 4:45 UTC (permalink / raw) To: Jeff Layton Cc: Amir Goldstein, overlayfs, Miklos Szeredi, Alexander Viro, Giuseppe Scrivano, Vivek Goyal, Daniel J Walsh, linux-fsdevel, David Howells On Fri, Nov 27, 2020 at 09:01:07PM -0500, Jeff Layton wrote: > On Fri, 2020-11-27 at 22:11 +0000, Sargun Dhillon wrote: > > On Fri, Nov 27, 2020 at 02:52:52PM +0200, Amir Goldstein wrote: > > > On Fri, Nov 27, 2020 at 11:21 AM Sargun Dhillon <sargun@sargun.me> wrote: > > > > > > > > This documents behaviour that was discussed in a thread about the volatile > > > > feature. Specifically, how failures can go hidden from asynchronous writes > > > > (such as from mmap, or writes that are not immediately flushed to the > > > > filesystem). Although we pass through calls like msync, fallocate, and > > > > write, and will still return errors on those, it doesn't guarantee all > > > > kinds of errors will happen at those times, and thus may hide errors. > > > > > > > > In the future, we can add error checking to all interactions with the > > > > upperdir, and pass through errseq_t from the upperdir on mappings, > > > > and other interactions with the filesystem[1]. > > > > > > > > [1]: https://lore.kernel.org/linux-unionfs/20201116045758.21774-1-sargun@sargun.me/T/#m7d501f375e031056efad626e471a1392dd3aad33 > > > > > > > > Signed-off-by: Sargun Dhillon <sargun@sargun.me> > > > > Cc: linux-fsdevel@vger.kernel.org > > > > Cc: linux-unionfs@vger.kernel.org > > > > Cc: Miklos Szeredi <miklos@szeredi.hu> > > > > Cc: Amir Goldstein <amir73il@gmail.com> > > > > Cc: Vivek Goyal <vgoyal@redhat.com> > > > > --- > > > > Documentation/filesystems/overlayfs.rst | 6 +++++- > > > > 1 file changed, 5 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst > > > > index 580ab9a0fe31..c6e30c1bc2f2 100644 > > > > --- a/Documentation/filesystems/overlayfs.rst > > > > +++ b/Documentation/filesystems/overlayfs.rst > > > > @@ -570,7 +570,11 @@ Volatile mount > > > > This is enabled with the "volatile" mount option. Volatile mounts are not > > > > guaranteed to survive a crash. It is strongly recommended that volatile > > > > mounts are only used if data written to the overlay can be recreated > > > > -without significant effort. > > > > +without significant effort. In addition to this, the sync family of syscalls > > > > +are not sufficient to determine whether a write failed as sync calls are > > > > +omitted. For this reason, it is important that the filesystem used by the > > > > +upperdir handles failure in a fashion that's suitable for the user. For > > > > +example, upon detecting a fault, ext4 can be configured to panic. > > > > > > > > > > Reading this now, I think I may have wrongly analysed the issue. > > > Specifically, when I wrote that the very minimum is to document the > > > issue, it was under the assumption that a proper fix is hard. > > > I think I was wrong and that the very minimum is to check for errseq > > > since mount on the fsync and syncfs calls. > > > > > > Why? first of all because it is very very simple and goes a long way to > > > fix the broken contract with applications, not the contract about durability > > > obviously, but the contract about write+fsync+read expects to find the written > > > data (during the same mount era). > > > > > > Second, because the sentence you added above is hard for users to understand > > > and out of context. If we properly handle the writeback error in fsync/syncfs, > > > then this sentence becomes irrelevant. > > > The fact that ext4 can lose data if application did not fsync after > > > write is true > > > for volatile as well as non-volatile and it is therefore not relevant > > > in the context > > > of overlayfs documentation at all. > > > > > > Am I wrong saying that it is very very simple to fix? > > > Would you mind making that fix at the bottom of the patch series, so it can > > > be easily applied to stable kernels? > > > > > > Thanks, > > > Amir. > > > > I'm not sure it's so easy. In VFS, there are places where the superblock's > > errseq is checked[1]. AFAIK, that's going to check "our" errseq, and not the > > errseq of the real corresponding real file's superblock. One solution might be > > as part of all these callbacks we set our errseq to the errseq of the filesystem > > that the upperdir, and then rely on VFS's checking. > > > > I'm having a hard time figuring out how to deal with the per-mapping based > > error tracking. It seems like this infrastructure was only partially completed > > by Jeff Layton[2]. I don't now how it's actually supposed to work right now, > > as not all of his patches landed. > > > > The patches in the series were all merged, but we ended up going with a > simpler solution [1] than the first series I posted. Instead of plumbing > the errseq_t handling down into sync_fs, we did it in the syscall > wrapper. Jeff, Thanks for replying. I'm still a little confused as to what the per-address_space wb_err. It seems like we should implement the flush method, like so: diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c index efccb7c1f9bc..32e5bc0aacd6 100644 --- a/fs/overlayfs/file.c +++ b/fs/overlayfs/file.c @@ -787,6 +787,16 @@ static loff_t ovl_remap_file_range(struct file *file_in, loff_t pos_in, remap_flags, op); } +static int ovl_flush(struct file *file, fl_owner_t id) +{ + struct file *realfile = file->private_data; + + if (real_file->f_op->flush) + return real_file->f_op->flush(filp, id); + + return 0; +} + const struct file_operations ovl_file_operations = { .open = ovl_open, .release = ovl_release, @@ -798,6 +808,7 @@ const struct file_operations ovl_file_operations = { .fallocate = ovl_fallocate, .fadvise = ovl_fadvise, .unlocked_ioctl = ovl_ioctl, + .flush = ovl_flush, #ifdef CONFIG_COMPAT .compat_ioctl = ovl_compat_ioctl, #endif > > I think the tricky part here is that there is no struct file plumbed > into ->sync_fs, so you don't have an errseq_t cursor to work with by the > time that gets called. > > What may be easiest is to just propagate the s_wb_err value from the > upper_sb to the overlayfs superblock in ovl_sync_fs(). That should get > called before the errseq_check_and_advance in the syncfs syscall wrapper > and should ensure that syncfs() calls against the overlayfs sb see any > errors that occurred on the upper_sb. > > Something like this maybe? Totally untested of course. May also need to > think about how to ensure ordering between racing syncfs calls too > (don't really want the s_wb_err going "backward"): > > ----------------------------8<--------------------------------- > $ git diff > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c > index 290983bcfbb3..d725705abdac 100644 > --- a/fs/overlayfs/super.c > +++ b/fs/overlayfs/super.c > @@ -283,6 +283,9 @@ static int ovl_sync_fs(struct super_block *sb, int wait) > ret = sync_filesystem(upper_sb); > up_read(&upper_sb->s_umount); > > + /* Propagate s_wb_err from upper_sb to overlayfs sb */ > + WRITE_ONCE(sb->s_wb_err, READ_ONCE(upper_sb->s_wb_err)); > + > return ret; > } > ----------------------------8<--------------------------------- > > [1]: https://www.spinics.net/lists/linux-api/msg41276.html So, I think this will have bad behaviour because syncfs() twice in a row will return the error twice. The reason is that normally in syncfs it calls errseq_check_and_advance marking the error on the super block as seen. If we copy-up the error value each time, it will break this semantic, as we do not set seen on the upperdir. Either we need to set the seen flag on the upperdir's errseq_t, or have a sync method, like: diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c index 290983bcfbb3..4931d1797c03 100644 --- a/fs/overlayfs/super.c +++ b/fs/overlayfs/super.c @@ -259,6 +259,7 @@ static int ovl_sync_fs(struct super_block *sb, int wait) { struct ovl_fs *ofs = sb->s_fs_info; struct super_block *upper_sb; + errseq_t src; int ret; if (!ovl_upper_mnt(ofs)) @@ -283,6 +284,11 @@ static int ovl_sync_fs(struct super_block *sb, int wait) ret = sync_filesystem(upper_sb); up_read(&upper_sb->s_umount); + /* Propagate the error up from the upper_sb once */ + src = READ_ONCE(upper_sb->s_wb_err); + if (errseq_counter(src) != errseq_counter(sb->s_wb_err)) + WRITE_ONCE(sb->s_wb_err, src & ~ERRSEQ_SEEN); + return ret; } @@ -1945,6 +1951,7 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent) sb->s_stack_depth = ovl_upper_mnt(ofs)->mnt_sb->s_stack_depth; sb->s_time_gran = ovl_upper_mnt(ofs)->mnt_sb->s_time_gran; + sb->s_wb_err = READ_ONCE(ovl_upper_mnt(ofs)->mnt_sb->s_wb_err); diff --git a/lib/errseq.c b/lib/errseq.c index 81f9e33aa7e7..53275c168265 100644 --- a/lib/errseq.c +++ b/lib/errseq.c @@ -204,3 +204,18 @@ int errseq_check_and_advance(errseq_t *eseq, errseq_t *since) return err; } EXPORT_SYMBOL(errseq_check_and_advance); + +/** + * errseq_counter() - Get the value of the errseq counter + * @eseq: Value being checked + * + * This returns the errseq_t counter value. Reminder, it can wrap because + * there are only a limited number of counter bits. + * + * Return: The counter portion of the errseq_t. + */ +int errseq_counter(errseq_t eseq) +{ + return eseq >> (ERRSEQ_SHIFT + 1); +} This also needs some locking sprinkled on it because racing can occur with sync_fs. This would be much easier if the errseq_t was 64-bits, and didn't go backwards, because you could just use a simple cmpxchg64. I know that would make a bunch of structures larger, and if it's just overlayfs that has to pay the tax we can just sprinkle a mutex on it. We can always "copy-up" the errseq_t because if the upperdir's errseq_t is read, and we haven't done a copy yet, we'll get it. Since this will be strictly serialized a simple equivalence check will work versus actually having to deal with happens before behaviour. There is still a correctness flaw here though, which is exactly enough errors occur to result in it wrapping back to the same value, it will break. By the way, how do y'all test this error handling behaviour? I didn't find any automated testing for what currently exists. > > How about I split this into two patchsets? One, where I add the logic to copy > the errseq on callbacks to fsync from the upperdir to the ovl fs superblock, > and thus allowing VFS to bubble up errors, and the documentation. We can CC > stable on those because I think it has an effect that's universal across > all filesystems. > > P.S. > I notice you maintain overlay tests outside of the kernel. Unfortunately, I > think for this kind of test, it requires in kernel code to artificially bump the > writeback error count on the upperdir, or it requires the failure injection > infrastructure. > > Simulating this behaviour is non-trivial without in-kernel support: > > P1: Open(f) -> p1.fd > P2: Open(f) -> p2.fd > P1: syncfs(p1.fd) -> errrno > P2: syncfs(p1.fd) -> 0 # This should return an error > > > [1]: https://elixir.bootlin.com/linux/latest/source/fs/sync.c#L175 > [2]: https://lwn.net/Articles/722250/ > > > -- > Jeff Layton <jlayton@redhat.com> > ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [PATCH v2 2/4] overlay: Document current outstanding shortcoming of volatile 2020-11-28 4:45 ` Sargun Dhillon @ 2020-11-28 7:12 ` Amir Goldstein 2020-11-28 8:52 ` Sargun Dhillon 2020-11-28 12:04 ` Jeff Layton 1 sibling, 1 reply; 31+ messages in thread From: Amir Goldstein @ 2020-11-28 7:12 UTC (permalink / raw) To: Sargun Dhillon Cc: Jeff Layton, overlayfs, Miklos Szeredi, Alexander Viro, Giuseppe Scrivano, Vivek Goyal, Daniel J Walsh, linux-fsdevel, David Howells On Sat, Nov 28, 2020 at 6:45 AM Sargun Dhillon <sargun@sargun.me> wrote: > > On Fri, Nov 27, 2020 at 09:01:07PM -0500, Jeff Layton wrote: > > On Fri, 2020-11-27 at 22:11 +0000, Sargun Dhillon wrote: > > > On Fri, Nov 27, 2020 at 02:52:52PM +0200, Amir Goldstein wrote: > > > > On Fri, Nov 27, 2020 at 11:21 AM Sargun Dhillon <sargun@sargun.me> wrote: > > > > > > > > > > This documents behaviour that was discussed in a thread about the volatile > > > > > feature. Specifically, how failures can go hidden from asynchronous writes > > > > > (such as from mmap, or writes that are not immediately flushed to the > > > > > filesystem). Although we pass through calls like msync, fallocate, and > > > > > write, and will still return errors on those, it doesn't guarantee all > > > > > kinds of errors will happen at those times, and thus may hide errors. > > > > > > > > > > In the future, we can add error checking to all interactions with the > > > > > upperdir, and pass through errseq_t from the upperdir on mappings, > > > > > and other interactions with the filesystem[1]. > > > > > > > > > > [1]: https://lore.kernel.org/linux-unionfs/20201116045758.21774-1-sargun@sargun.me/T/#m7d501f375e031056efad626e471a1392dd3aad33 > > > > > > > > > > Signed-off-by: Sargun Dhillon <sargun@sargun.me> > > > > > Cc: linux-fsdevel@vger.kernel.org > > > > > Cc: linux-unionfs@vger.kernel.org > > > > > Cc: Miklos Szeredi <miklos@szeredi.hu> > > > > > Cc: Amir Goldstein <amir73il@gmail.com> > > > > > Cc: Vivek Goyal <vgoyal@redhat.com> > > > > > --- > > > > > Documentation/filesystems/overlayfs.rst | 6 +++++- > > > > > 1 file changed, 5 insertions(+), 1 deletion(-) > > > > > > > > > > diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst > > > > > index 580ab9a0fe31..c6e30c1bc2f2 100644 > > > > > --- a/Documentation/filesystems/overlayfs.rst > > > > > +++ b/Documentation/filesystems/overlayfs.rst > > > > > @@ -570,7 +570,11 @@ Volatile mount > > > > > This is enabled with the "volatile" mount option. Volatile mounts are not > > > > > guaranteed to survive a crash. It is strongly recommended that volatile > > > > > mounts are only used if data written to the overlay can be recreated > > > > > -without significant effort. > > > > > +without significant effort. In addition to this, the sync family of syscalls > > > > > +are not sufficient to determine whether a write failed as sync calls are > > > > > +omitted. For this reason, it is important that the filesystem used by the > > > > > +upperdir handles failure in a fashion that's suitable for the user. For > > > > > +example, upon detecting a fault, ext4 can be configured to panic. > > > > > > > > > > > > > Reading this now, I think I may have wrongly analysed the issue. > > > > Specifically, when I wrote that the very minimum is to document the > > > > issue, it was under the assumption that a proper fix is hard. > > > > I think I was wrong and that the very minimum is to check for errseq > > > > since mount on the fsync and syncfs calls. > > > > > > > > Why? first of all because it is very very simple and goes a long way to > > > > fix the broken contract with applications, not the contract about durability > > > > obviously, but the contract about write+fsync+read expects to find the written > > > > data (during the same mount era). > > > > > > > > Second, because the sentence you added above is hard for users to understand > > > > and out of context. If we properly handle the writeback error in fsync/syncfs, > > > > then this sentence becomes irrelevant. > > > > The fact that ext4 can lose data if application did not fsync after > > > > write is true > > > > for volatile as well as non-volatile and it is therefore not relevant > > > > in the context > > > > of overlayfs documentation at all. > > > > > > > > Am I wrong saying that it is very very simple to fix? > > > > Would you mind making that fix at the bottom of the patch series, so it can > > > > be easily applied to stable kernels? > > > > > > > > Thanks, > > > > Amir. > > > > > > I'm not sure it's so easy. In VFS, there are places where the superblock's > > > errseq is checked[1]. AFAIK, that's going to check "our" errseq, and not the > > > errseq of the real corresponding real file's superblock. One solution might be > > > as part of all these callbacks we set our errseq to the errseq of the filesystem > > > that the upperdir, and then rely on VFS's checking. > > > > > > I'm having a hard time figuring out how to deal with the per-mapping based > > > error tracking. It seems like this infrastructure was only partially completed > > > by Jeff Layton[2]. I don't now how it's actually supposed to work right now, > > > as not all of his patches landed. > > > > > > > The patches in the series were all merged, but we ended up going with a > > simpler solution [1] than the first series I posted. Instead of plumbing > > the errseq_t handling down into sync_fs, we did it in the syscall > > wrapper. > Jeff, > > Thanks for replying. I'm still a little confused as to what the > per-address_space wb_err. It seems like we should implement the > flush method, like so: > diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c > index efccb7c1f9bc..32e5bc0aacd6 100644 > --- a/fs/overlayfs/file.c > +++ b/fs/overlayfs/file.c > @@ -787,6 +787,16 @@ static loff_t ovl_remap_file_range(struct file *file_in, loff_t pos_in, > remap_flags, op); > } > > +static int ovl_flush(struct file *file, fl_owner_t id) > +{ > + struct file *realfile = file->private_data; > + > + if (real_file->f_op->flush) > + return real_file->f_op->flush(filp, id); > + > + return 0; > +} > + > const struct file_operations ovl_file_operations = { > .open = ovl_open, > .release = ovl_release, > @@ -798,6 +808,7 @@ const struct file_operations ovl_file_operations = { > .fallocate = ovl_fallocate, > .fadvise = ovl_fadvise, > .unlocked_ioctl = ovl_ioctl, > + .flush = ovl_flush, > #ifdef CONFIG_COMPAT > .compat_ioctl = ovl_compat_ioctl, > #endif > > > > > > I think the tricky part here is that there is no struct file plumbed > > into ->sync_fs, so you don't have an errseq_t cursor to work with by the > > time that gets called. > > > > What may be easiest is to just propagate the s_wb_err value from the > > upper_sb to the overlayfs superblock in ovl_sync_fs(). That should get > > called before the errseq_check_and_advance in the syncfs syscall wrapper > > and should ensure that syncfs() calls against the overlayfs sb see any > > errors that occurred on the upper_sb. > > > > Something like this maybe? Totally untested of course. May also need to > > think about how to ensure ordering between racing syncfs calls too > > (don't really want the s_wb_err going "backward"): > > > > ----------------------------8<--------------------------------- > > $ git diff > > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c > > index 290983bcfbb3..d725705abdac 100644 > > --- a/fs/overlayfs/super.c > > +++ b/fs/overlayfs/super.c > > @@ -283,6 +283,9 @@ static int ovl_sync_fs(struct super_block *sb, int wait) > > ret = sync_filesystem(upper_sb); > > up_read(&upper_sb->s_umount); > > > > + /* Propagate s_wb_err from upper_sb to overlayfs sb */ > > + WRITE_ONCE(sb->s_wb_err, READ_ONCE(upper_sb->s_wb_err)); > > + > > return ret; > > } > > ----------------------------8<--------------------------------- > > > > [1]: https://www.spinics.net/lists/linux-api/msg41276.html > > So, > I think this will have bad behaviour because syncfs() twice in a row will > return the error twice. The reason is that normally in syncfs it calls > errseq_check_and_advance marking the error on the super block as seen. If we > copy-up the error value each time, it will break this semantic, as we do not set > seen on the upperdir. > > Either we need to set the seen flag on the upperdir's errseq_t, or have a sync > method, like: > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c > index 290983bcfbb3..4931d1797c03 100644 > --- a/fs/overlayfs/super.c > +++ b/fs/overlayfs/super.c > @@ -259,6 +259,7 @@ static int ovl_sync_fs(struct super_block *sb, int wait) > { > struct ovl_fs *ofs = sb->s_fs_info; > struct super_block *upper_sb; > + errseq_t src; > int ret; > > if (!ovl_upper_mnt(ofs)) > @@ -283,6 +284,11 @@ static int ovl_sync_fs(struct super_block *sb, int wait) > ret = sync_filesystem(upper_sb); > up_read(&upper_sb->s_umount); > > + /* Propagate the error up from the upper_sb once */ > + src = READ_ONCE(upper_sb->s_wb_err); > + if (errseq_counter(src) != errseq_counter(sb->s_wb_err)) > + WRITE_ONCE(sb->s_wb_err, src & ~ERRSEQ_SEEN); > + > return ret; > } > > @@ -1945,6 +1951,7 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent) > > sb->s_stack_depth = ovl_upper_mnt(ofs)->mnt_sb->s_stack_depth; > sb->s_time_gran = ovl_upper_mnt(ofs)->mnt_sb->s_time_gran; > + sb->s_wb_err = READ_ONCE(ovl_upper_mnt(ofs)->mnt_sb->s_wb_err); > > diff --git a/lib/errseq.c b/lib/errseq.c > index 81f9e33aa7e7..53275c168265 100644 > --- a/lib/errseq.c > +++ b/lib/errseq.c > @@ -204,3 +204,18 @@ int errseq_check_and_advance(errseq_t *eseq, errseq_t *since) > return err; > } > EXPORT_SYMBOL(errseq_check_and_advance); > + > +/** > + * errseq_counter() - Get the value of the errseq counter > + * @eseq: Value being checked > + * > + * This returns the errseq_t counter value. Reminder, it can wrap because > + * there are only a limited number of counter bits. > + * > + * Return: The counter portion of the errseq_t. > + */ > +int errseq_counter(errseq_t eseq) > +{ > + return eseq >> (ERRSEQ_SHIFT + 1); > +} > > This also needs some locking sprinkled on it because racing can occur with > sync_fs. This would be much easier if the errseq_t was 64-bits, and didn't go > backwards, because you could just use a simple cmpxchg64. I know that would make > a bunch of structures larger, and if it's just overlayfs that has to pay the tax > we can just sprinkle a mutex on it. > > We can always "copy-up" the errseq_t because if the upperdir's errseq_t is read, > and we haven't done a copy yet, we'll get it. Since this will be strictly > serialized a simple equivalence check will work versus actually having to deal > with happens before behaviour. There is still a correctness flaw here though, > which is exactly enough errors occur to result in it wrapping back to the same > value, it will break. > > By the way, how do y'all test this error handling behaviour? I didn't > find any automated testing for what currently exists. > > > > How about I split this into two patchsets? One, where I add the logic to copy > > the errseq on callbacks to fsync from the upperdir to the ovl fs superblock, > > and thus allowing VFS to bubble up errors, and the documentation. We can CC > > stable on those because I think it has an effect that's universal across > > all filesystems. > > > > P.S. > > I notice you maintain overlay tests outside of the kernel. Unfortunately, I > > think for this kind of test, it requires in kernel code to artificially bump the > > writeback error count on the upperdir, or it requires the failure injection > > infrastructure. > > > > Simulating this behaviour is non-trivial without in-kernel support: > > > > P1: Open(f) -> p1.fd > > P2: Open(f) -> p2.fd > > P1: syncfs(p1.fd) -> errrno > > P2: syncfs(p1.fd) -> 0 # This should return an error > > > > > > [1]: https://elixir.bootlin.com/linux/latest/source/fs/sync.c#L175 > > [2]: https://lwn.net/Articles/722250/ > > > > Sargun (and Jeff), Thank you for this discussion. It would be very nice to work on testing and fixing errseq propagation is correct on overlayfs. Alas, this is not what I suggested. What I suggested was a solution only for the volatile overlay issue where data can vaporise without applications noticing: "...the very minimum is to check for errseq since mount on the fsync and syncfs calls." Do you get it? there is no pre-file state in the game, not for fsync and not for syncfs. Any single error, no matter how temporary it is and what damage it may or may not have caused to upper layer consistency, permanently invalidates the reliability of the volatile overlay, resulting in: Effective immediately: every fsync/syncfs returns EIO. Going forward: maybe implement overlay shutdown, so every access returns EIO. So now that I hopefully explained myself better, I'll ask again: Am I wrong saying that it is very very simple to fix? Would you mind making that fix at the bottom of the patch series, so it can be easily applied to stable kernels? Thanks, Amir. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v2 2/4] overlay: Document current outstanding shortcoming of volatile 2020-11-28 7:12 ` Amir Goldstein @ 2020-11-28 8:52 ` Sargun Dhillon 2020-11-28 9:04 ` Amir Goldstein 2020-12-01 11:09 ` Sargun Dhillon 0 siblings, 2 replies; 31+ messages in thread From: Sargun Dhillon @ 2020-11-28 8:52 UTC (permalink / raw) To: Amir Goldstein Cc: Jeff Layton, overlayfs, Miklos Szeredi, Alexander Viro, Giuseppe Scrivano, Vivek Goyal, Daniel J Walsh, linux-fsdevel, David Howells On Sat, Nov 28, 2020 at 09:12:03AM +0200, Amir Goldstein wrote: > On Sat, Nov 28, 2020 at 6:45 AM Sargun Dhillon <sargun@sargun.me> wrote: > > > > On Fri, Nov 27, 2020 at 09:01:07PM -0500, Jeff Layton wrote: > > > On Fri, 2020-11-27 at 22:11 +0000, Sargun Dhillon wrote: > > > > On Fri, Nov 27, 2020 at 02:52:52PM +0200, Amir Goldstein wrote: > > > > > On Fri, Nov 27, 2020 at 11:21 AM Sargun Dhillon <sargun@sargun.me> wrote: > > > > > > > > > > > > This documents behaviour that was discussed in a thread about the volatile > > > > > > feature. Specifically, how failures can go hidden from asynchronous writes > > > > > > (such as from mmap, or writes that are not immediately flushed to the > > > > > > filesystem). Although we pass through calls like msync, fallocate, and > > > > > > write, and will still return errors on those, it doesn't guarantee all > > > > > > kinds of errors will happen at those times, and thus may hide errors. > > > > > > > > > > > > In the future, we can add error checking to all interactions with the > > > > > > upperdir, and pass through errseq_t from the upperdir on mappings, > > > > > > and other interactions with the filesystem[1]. > > > > > > > > > > > > [1]: https://lore.kernel.org/linux-unionfs/20201116045758.21774-1-sargun@sargun.me/T/#m7d501f375e031056efad626e471a1392dd3aad33 > > > > > > > > > > > > Signed-off-by: Sargun Dhillon <sargun@sargun.me> > > > > > > Cc: linux-fsdevel@vger.kernel.org > > > > > > Cc: linux-unionfs@vger.kernel.org > > > > > > Cc: Miklos Szeredi <miklos@szeredi.hu> > > > > > > Cc: Amir Goldstein <amir73il@gmail.com> > > > > > > Cc: Vivek Goyal <vgoyal@redhat.com> > > > > > > --- > > > > > > Documentation/filesystems/overlayfs.rst | 6 +++++- > > > > > > 1 file changed, 5 insertions(+), 1 deletion(-) > > > > > > > > > > > > diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst > > > > > > index 580ab9a0fe31..c6e30c1bc2f2 100644 > > > > > > --- a/Documentation/filesystems/overlayfs.rst > > > > > > +++ b/Documentation/filesystems/overlayfs.rst > > > > > > @@ -570,7 +570,11 @@ Volatile mount > > > > > > This is enabled with the "volatile" mount option. Volatile mounts are not > > > > > > guaranteed to survive a crash. It is strongly recommended that volatile > > > > > > mounts are only used if data written to the overlay can be recreated > > > > > > -without significant effort. > > > > > > +without significant effort. In addition to this, the sync family of syscalls > > > > > > +are not sufficient to determine whether a write failed as sync calls are > > > > > > +omitted. For this reason, it is important that the filesystem used by the > > > > > > +upperdir handles failure in a fashion that's suitable for the user. For > > > > > > +example, upon detecting a fault, ext4 can be configured to panic. > > > > > > > > > > > > > > > > Reading this now, I think I may have wrongly analysed the issue. > > > > > Specifically, when I wrote that the very minimum is to document the > > > > > issue, it was under the assumption that a proper fix is hard. > > > > > I think I was wrong and that the very minimum is to check for errseq > > > > > since mount on the fsync and syncfs calls. > > > > > > > > > > Why? first of all because it is very very simple and goes a long way to > > > > > fix the broken contract with applications, not the contract about durability > > > > > obviously, but the contract about write+fsync+read expects to find the written > > > > > data (during the same mount era). > > > > > > > > > > Second, because the sentence you added above is hard for users to understand > > > > > and out of context. If we properly handle the writeback error in fsync/syncfs, > > > > > then this sentence becomes irrelevant. > > > > > The fact that ext4 can lose data if application did not fsync after > > > > > write is true > > > > > for volatile as well as non-volatile and it is therefore not relevant > > > > > in the context > > > > > of overlayfs documentation at all. > > > > > > > > > > Am I wrong saying that it is very very simple to fix? > > > > > Would you mind making that fix at the bottom of the patch series, so it can > > > > > be easily applied to stable kernels? > > > > > > > > > > Thanks, > > > > > Amir. > > > > > > > > I'm not sure it's so easy. In VFS, there are places where the superblock's > > > > errseq is checked[1]. AFAIK, that's going to check "our" errseq, and not the > > > > errseq of the real corresponding real file's superblock. One solution might be > > > > as part of all these callbacks we set our errseq to the errseq of the filesystem > > > > that the upperdir, and then rely on VFS's checking. > > > > > > > > I'm having a hard time figuring out how to deal with the per-mapping based > > > > error tracking. It seems like this infrastructure was only partially completed > > > > by Jeff Layton[2]. I don't now how it's actually supposed to work right now, > > > > as not all of his patches landed. > > > > > > > > > > The patches in the series were all merged, but we ended up going with a > > > simpler solution [1] than the first series I posted. Instead of plumbing > > > the errseq_t handling down into sync_fs, we did it in the syscall > > > wrapper. > > Jeff, > > > > Thanks for replying. I'm still a little confused as to what the > > per-address_space wb_err. It seems like we should implement the > > flush method, like so: > > diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c > > index efccb7c1f9bc..32e5bc0aacd6 100644 > > --- a/fs/overlayfs/file.c > > +++ b/fs/overlayfs/file.c > > @@ -787,6 +787,16 @@ static loff_t ovl_remap_file_range(struct file *file_in, loff_t pos_in, > > remap_flags, op); > > } > > > > +static int ovl_flush(struct file *file, fl_owner_t id) > > +{ > > + struct file *realfile = file->private_data; > > + > > + if (real_file->f_op->flush) > > + return real_file->f_op->flush(filp, id); > > + > > + return 0; > > +} > > + > > const struct file_operations ovl_file_operations = { > > .open = ovl_open, > > .release = ovl_release, > > @@ -798,6 +808,7 @@ const struct file_operations ovl_file_operations = { > > .fallocate = ovl_fallocate, > > .fadvise = ovl_fadvise, > > .unlocked_ioctl = ovl_ioctl, > > + .flush = ovl_flush, > > #ifdef CONFIG_COMPAT > > .compat_ioctl = ovl_compat_ioctl, > > #endif > > > > > > > > > > I think the tricky part here is that there is no struct file plumbed > > > into ->sync_fs, so you don't have an errseq_t cursor to work with by the > > > time that gets called. > > > > > > What may be easiest is to just propagate the s_wb_err value from the > > > upper_sb to the overlayfs superblock in ovl_sync_fs(). That should get > > > called before the errseq_check_and_advance in the syncfs syscall wrapper > > > and should ensure that syncfs() calls against the overlayfs sb see any > > > errors that occurred on the upper_sb. > > > > > > Something like this maybe? Totally untested of course. May also need to > > > think about how to ensure ordering between racing syncfs calls too > > > (don't really want the s_wb_err going "backward"): > > > > > > ----------------------------8<--------------------------------- > > > $ git diff > > > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c > > > index 290983bcfbb3..d725705abdac 100644 > > > --- a/fs/overlayfs/super.c > > > +++ b/fs/overlayfs/super.c > > > @@ -283,6 +283,9 @@ static int ovl_sync_fs(struct super_block *sb, int wait) > > > ret = sync_filesystem(upper_sb); > > > up_read(&upper_sb->s_umount); > > > > > > + /* Propagate s_wb_err from upper_sb to overlayfs sb */ > > > + WRITE_ONCE(sb->s_wb_err, READ_ONCE(upper_sb->s_wb_err)); > > > + > > > return ret; > > > } > > > ----------------------------8<--------------------------------- > > > > > > [1]: https://www.spinics.net/lists/linux-api/msg41276.html > > > > So, > > I think this will have bad behaviour because syncfs() twice in a row will > > return the error twice. The reason is that normally in syncfs it calls > > errseq_check_and_advance marking the error on the super block as seen. If we > > copy-up the error value each time, it will break this semantic, as we do not set > > seen on the upperdir. > > > > Either we need to set the seen flag on the upperdir's errseq_t, or have a sync > > method, like: > > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c > > index 290983bcfbb3..4931d1797c03 100644 > > --- a/fs/overlayfs/super.c > > +++ b/fs/overlayfs/super.c > > @@ -259,6 +259,7 @@ static int ovl_sync_fs(struct super_block *sb, int wait) > > { > > struct ovl_fs *ofs = sb->s_fs_info; > > struct super_block *upper_sb; > > + errseq_t src; > > int ret; > > > > if (!ovl_upper_mnt(ofs)) > > @@ -283,6 +284,11 @@ static int ovl_sync_fs(struct super_block *sb, int wait) > > ret = sync_filesystem(upper_sb); > > up_read(&upper_sb->s_umount); > > > > + /* Propagate the error up from the upper_sb once */ > > + src = READ_ONCE(upper_sb->s_wb_err); > > + if (errseq_counter(src) != errseq_counter(sb->s_wb_err)) > > + WRITE_ONCE(sb->s_wb_err, src & ~ERRSEQ_SEEN); > > + > > return ret; > > } > > > > @@ -1945,6 +1951,7 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent) > > > > sb->s_stack_depth = ovl_upper_mnt(ofs)->mnt_sb->s_stack_depth; > > sb->s_time_gran = ovl_upper_mnt(ofs)->mnt_sb->s_time_gran; > > + sb->s_wb_err = READ_ONCE(ovl_upper_mnt(ofs)->mnt_sb->s_wb_err); > > > > diff --git a/lib/errseq.c b/lib/errseq.c > > index 81f9e33aa7e7..53275c168265 100644 > > --- a/lib/errseq.c > > +++ b/lib/errseq.c > > @@ -204,3 +204,18 @@ int errseq_check_and_advance(errseq_t *eseq, errseq_t *since) > > return err; > > } > > EXPORT_SYMBOL(errseq_check_and_advance); > > + > > +/** > > + * errseq_counter() - Get the value of the errseq counter > > + * @eseq: Value being checked > > + * > > + * This returns the errseq_t counter value. Reminder, it can wrap because > > + * there are only a limited number of counter bits. > > + * > > + * Return: The counter portion of the errseq_t. > > + */ > > +int errseq_counter(errseq_t eseq) > > +{ > > + return eseq >> (ERRSEQ_SHIFT + 1); > > +} > > > > This also needs some locking sprinkled on it because racing can occur with > > sync_fs. This would be much easier if the errseq_t was 64-bits, and didn't go > > backwards, because you could just use a simple cmpxchg64. I know that would make > > a bunch of structures larger, and if it's just overlayfs that has to pay the tax > > we can just sprinkle a mutex on it. > > > > We can always "copy-up" the errseq_t because if the upperdir's errseq_t is read, > > and we haven't done a copy yet, we'll get it. Since this will be strictly > > serialized a simple equivalence check will work versus actually having to deal > > with happens before behaviour. There is still a correctness flaw here though, > > which is exactly enough errors occur to result in it wrapping back to the same > > value, it will break. > > > > By the way, how do y'all test this error handling behaviour? I didn't > > find any automated testing for what currently exists. > > > > > > How about I split this into two patchsets? One, where I add the logic to copy > > > the errseq on callbacks to fsync from the upperdir to the ovl fs superblock, > > > and thus allowing VFS to bubble up errors, and the documentation. We can CC > > > stable on those because I think it has an effect that's universal across > > > all filesystems. > > > > > > P.S. > > > I notice you maintain overlay tests outside of the kernel. Unfortunately, I > > > think for this kind of test, it requires in kernel code to artificially bump the > > > writeback error count on the upperdir, or it requires the failure injection > > > infrastructure. > > > > > > Simulating this behaviour is non-trivial without in-kernel support: > > > > > > P1: Open(f) -> p1.fd > > > P2: Open(f) -> p2.fd > > > P1: syncfs(p1.fd) -> errrno > > > P2: syncfs(p1.fd) -> 0 # This should return an error > > > > > > > > > [1]: https://elixir.bootlin.com/linux/latest/source/fs/sync.c#L175 > > > [2]: https://lwn.net/Articles/722250/ > > > > > > > > Sargun (and Jeff), > > Thank you for this discussion. It would be very nice to work on testing > and fixing errseq propagation is correct on overlayfs. > Alas, this is not what I suggested. > > What I suggested was a solution only for the volatile overlay issue > where data can vaporise without applications noticing: > "...the very minimum is to check for errseq since mount on the fsync > and syncfs calls." > Yeah, I was confusing the checking that VFS does on our behalf and the checking that we can do ourselves in the sync callback. If we return an error prior to the vfs checking it short-circuits that entirely. > Do you get it? there is no pre-file state in the game, not for fsync and not > for syncfs. > > Any single error, no matter how temporary it is and what damage it may > or may not have caused to upper layer consistency, permanently > invalidates the reliability of the volatile overlay, resulting in: > Effective immediately: every fsync/syncfs returns EIO. > Going forward: maybe implement overlay shutdown, so every access > returns EIO. > > So now that I hopefully explained myself better, I'll ask again: > Am I wrong saying that it is very very simple to fix? > Would you mind making that fix at the bottom of the patch series, so it can > be easily applied to stable kernels? > > Thanks, > Amir. I think that this should be easy enough if the semantic is such that volatile overlayfs mounts will return EIO on syncfs on every syncfs call if the upperdir's super block has experienced errors since the initial mount. I imagine we do not want to make it such that if the upperdir has ever experienced errors, return EIO on syncfs. The one caveat that I see is that if the errseq wraps, we can silently begin swallowing errors again. Thus, on the first failed syncfs we should just store a flag indicating that the volatile fs is bad, and to continue to return EIO rather than go through the process of checking errseq_t, but that's easy enough to write. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v2 2/4] overlay: Document current outstanding shortcoming of volatile 2020-11-28 8:52 ` Sargun Dhillon @ 2020-11-28 9:04 ` Amir Goldstein 2020-12-01 11:09 ` Sargun Dhillon 1 sibling, 0 replies; 31+ messages in thread From: Amir Goldstein @ 2020-11-28 9:04 UTC (permalink / raw) To: Sargun Dhillon Cc: Jeff Layton, overlayfs, Miklos Szeredi, Alexander Viro, Giuseppe Scrivano, Vivek Goyal, Daniel J Walsh, linux-fsdevel, David Howells > > What I suggested was a solution only for the volatile overlay issue > > where data can vaporise without applications noticing: > > "...the very minimum is to check for errseq since mount on the fsync > > and syncfs calls." > > > Yeah, I was confusing the checking that VFS does on our behalf and the checking > that we can do ourselves in the sync callback. If we return an error prior to > the vfs checking it short-circuits that entirely. > > > Do you get it? there is no pre-file state in the game, not for fsync and not > > for syncfs. > > > > Any single error, no matter how temporary it is and what damage it may > > or may not have caused to upper layer consistency, permanently > > invalidates the reliability of the volatile overlay, resulting in: > > Effective immediately: every fsync/syncfs returns EIO. > > Going forward: maybe implement overlay shutdown, so every access > > returns EIO. > > > > So now that I hopefully explained myself better, I'll ask again: > > Am I wrong saying that it is very very simple to fix? > > Would you mind making that fix at the bottom of the patch series, so it can > > be easily applied to stable kernels? > > > > Thanks, > > Amir. > > I think that this should be easy enough if the semantic is such that volatile > overlayfs mounts will return EIO on syncfs on every syncfs call if the upperdir's > super block has experienced errors since the initial mount. I imagine we do not > want to make it such that if the upperdir has ever experienced errors, return > EIO on syncfs. > > The one caveat that I see is that if the errseq wraps, we can silently begin > swallowing errors again. Thus, on the first failed syncfs we should just > store a flag indicating that the volatile fs is bad, and to continue to return > EIO rather than go through the process of checking errseq_t, but that's easy > enough to write. I agree. I sent another reply to your question about testing. The test I suggested generic/019, only tests that the first fsync after writeback error fails and that umount succeeds, so logic is good for volatile overlay. Thanks, Amir. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v2 2/4] overlay: Document current outstanding shortcoming of volatile 2020-11-28 8:52 ` Sargun Dhillon 2020-11-28 9:04 ` Amir Goldstein @ 2020-12-01 11:09 ` Sargun Dhillon 2020-12-01 11:29 ` Amir Goldstein 2020-12-01 13:01 ` Jeff Layton 1 sibling, 2 replies; 31+ messages in thread From: Sargun Dhillon @ 2020-12-01 11:09 UTC (permalink / raw) To: Amir Goldstein Cc: Jeff Layton, overlayfs, Miklos Szeredi, Alexander Viro, Giuseppe Scrivano, Vivek Goyal, Daniel J Walsh, linux-fsdevel, David Howells On Sat, Nov 28, 2020 at 08:52:27AM +0000, Sargun Dhillon wrote: > On Sat, Nov 28, 2020 at 09:12:03AM +0200, Amir Goldstein wrote: > > On Sat, Nov 28, 2020 at 6:45 AM Sargun Dhillon <sargun@sargun.me> wrote: > > > > > > On Fri, Nov 27, 2020 at 09:01:07PM -0500, Jeff Layton wrote: > > > > On Fri, 2020-11-27 at 22:11 +0000, Sargun Dhillon wrote: > > > > > On Fri, Nov 27, 2020 at 02:52:52PM +0200, Amir Goldstein wrote: > > > > > > On Fri, Nov 27, 2020 at 11:21 AM Sargun Dhillon <sargun@sargun.me> wrote: > > > > > > > > > > > > > > This documents behaviour that was discussed in a thread about the volatile > > > > > > > feature. Specifically, how failures can go hidden from asynchronous writes > > > > > > > (such as from mmap, or writes that are not immediately flushed to the > > > > > > > filesystem). Although we pass through calls like msync, fallocate, and > > > > > > > write, and will still return errors on those, it doesn't guarantee all > > > > > > > kinds of errors will happen at those times, and thus may hide errors. > > > > > > > > > > > > > > In the future, we can add error checking to all interactions with the > > > > > > > upperdir, and pass through errseq_t from the upperdir on mappings, > > > > > > > and other interactions with the filesystem[1]. > > > > > > > > > > > > > > [1]: https://lore.kernel.org/linux-unionfs/20201116045758.21774-1-sargun@sargun.me/T/#m7d501f375e031056efad626e471a1392dd3aad33 > > > > > > > > > > > > > > Signed-off-by: Sargun Dhillon <sargun@sargun.me> > > > > > > > Cc: linux-fsdevel@vger.kernel.org > > > > > > > Cc: linux-unionfs@vger.kernel.org > > > > > > > Cc: Miklos Szeredi <miklos@szeredi.hu> > > > > > > > Cc: Amir Goldstein <amir73il@gmail.com> > > > > > > > Cc: Vivek Goyal <vgoyal@redhat.com> > > > > > > > --- > > > > > > > Documentation/filesystems/overlayfs.rst | 6 +++++- > > > > > > > 1 file changed, 5 insertions(+), 1 deletion(-) > > > > > > > > > > > > > > diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst > > > > > > > index 580ab9a0fe31..c6e30c1bc2f2 100644 > > > > > > > --- a/Documentation/filesystems/overlayfs.rst > > > > > > > +++ b/Documentation/filesystems/overlayfs.rst > > > > > > > @@ -570,7 +570,11 @@ Volatile mount > > > > > > > This is enabled with the "volatile" mount option. Volatile mounts are not > > > > > > > guaranteed to survive a crash. It is strongly recommended that volatile > > > > > > > mounts are only used if data written to the overlay can be recreated > > > > > > > -without significant effort. > > > > > > > +without significant effort. In addition to this, the sync family of syscalls > > > > > > > +are not sufficient to determine whether a write failed as sync calls are > > > > > > > +omitted. For this reason, it is important that the filesystem used by the > > > > > > > +upperdir handles failure in a fashion that's suitable for the user. For > > > > > > > +example, upon detecting a fault, ext4 can be configured to panic. > > > > > > > > > > > > > > > > > > > Reading this now, I think I may have wrongly analysed the issue. > > > > > > Specifically, when I wrote that the very minimum is to document the > > > > > > issue, it was under the assumption that a proper fix is hard. > > > > > > I think I was wrong and that the very minimum is to check for errseq > > > > > > since mount on the fsync and syncfs calls. > > > > > > > > > > > > Why? first of all because it is very very simple and goes a long way to > > > > > > fix the broken contract with applications, not the contract about durability > > > > > > obviously, but the contract about write+fsync+read expects to find the written > > > > > > data (during the same mount era). > > > > > > > > > > > > Second, because the sentence you added above is hard for users to understand > > > > > > and out of context. If we properly handle the writeback error in fsync/syncfs, > > > > > > then this sentence becomes irrelevant. > > > > > > The fact that ext4 can lose data if application did not fsync after > > > > > > write is true > > > > > > for volatile as well as non-volatile and it is therefore not relevant > > > > > > in the context > > > > > > of overlayfs documentation at all. > > > > > > > > > > > > Am I wrong saying that it is very very simple to fix? > > > > > > Would you mind making that fix at the bottom of the patch series, so it can > > > > > > be easily applied to stable kernels? > > > > > > > > > > > > Thanks, > > > > > > Amir. > > > > > > > > > > I'm not sure it's so easy. In VFS, there are places where the superblock's > > > > > errseq is checked[1]. AFAIK, that's going to check "our" errseq, and not the > > > > > errseq of the real corresponding real file's superblock. One solution might be > > > > > as part of all these callbacks we set our errseq to the errseq of the filesystem > > > > > that the upperdir, and then rely on VFS's checking. > > > > > > > > > > I'm having a hard time figuring out how to deal with the per-mapping based > > > > > error tracking. It seems like this infrastructure was only partially completed > > > > > by Jeff Layton[2]. I don't now how it's actually supposed to work right now, > > > > > as not all of his patches landed. > > > > > > > > > > > > > The patches in the series were all merged, but we ended up going with a > > > > simpler solution [1] than the first series I posted. Instead of plumbing > > > > the errseq_t handling down into sync_fs, we did it in the syscall > > > > wrapper. > > > Jeff, > > > > > > Thanks for replying. I'm still a little confused as to what the > > > per-address_space wb_err. It seems like we should implement the > > > flush method, like so: > > > diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c > > > index efccb7c1f9bc..32e5bc0aacd6 100644 > > > --- a/fs/overlayfs/file.c > > > +++ b/fs/overlayfs/file.c > > > @@ -787,6 +787,16 @@ static loff_t ovl_remap_file_range(struct file *file_in, loff_t pos_in, > > > remap_flags, op); > > > } > > > > > > +static int ovl_flush(struct file *file, fl_owner_t id) > > > +{ > > > + struct file *realfile = file->private_data; > > > + > > > + if (real_file->f_op->flush) > > > + return real_file->f_op->flush(filp, id); > > > + > > > + return 0; > > > +} > > > + > > > const struct file_operations ovl_file_operations = { > > > .open = ovl_open, > > > .release = ovl_release, > > > @@ -798,6 +808,7 @@ const struct file_operations ovl_file_operations = { > > > .fallocate = ovl_fallocate, > > > .fadvise = ovl_fadvise, > > > .unlocked_ioctl = ovl_ioctl, > > > + .flush = ovl_flush, > > > #ifdef CONFIG_COMPAT > > > .compat_ioctl = ovl_compat_ioctl, > > > #endif > > > > > > > > > > > > > > I think the tricky part here is that there is no struct file plumbed > > > > into ->sync_fs, so you don't have an errseq_t cursor to work with by the > > > > time that gets called. > > > > > > > > What may be easiest is to just propagate the s_wb_err value from the > > > > upper_sb to the overlayfs superblock in ovl_sync_fs(). That should get > > > > called before the errseq_check_and_advance in the syncfs syscall wrapper > > > > and should ensure that syncfs() calls against the overlayfs sb see any > > > > errors that occurred on the upper_sb. > > > > > > > > Something like this maybe? Totally untested of course. May also need to > > > > think about how to ensure ordering between racing syncfs calls too > > > > (don't really want the s_wb_err going "backward"): > > > > > > > > ----------------------------8<--------------------------------- > > > > $ git diff > > > > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c > > > > index 290983bcfbb3..d725705abdac 100644 > > > > --- a/fs/overlayfs/super.c > > > > +++ b/fs/overlayfs/super.c > > > > @@ -283,6 +283,9 @@ static int ovl_sync_fs(struct super_block *sb, int wait) > > > > ret = sync_filesystem(upper_sb); > > > > up_read(&upper_sb->s_umount); > > > > > > > > + /* Propagate s_wb_err from upper_sb to overlayfs sb */ > > > > + WRITE_ONCE(sb->s_wb_err, READ_ONCE(upper_sb->s_wb_err)); > > > > + > > > > return ret; > > > > } > > > > ----------------------------8<--------------------------------- > > > > > > > > [1]: https://www.spinics.net/lists/linux-api/msg41276.html > > > > > > So, > > > I think this will have bad behaviour because syncfs() twice in a row will > > > return the error twice. The reason is that normally in syncfs it calls > > > errseq_check_and_advance marking the error on the super block as seen. If we > > > copy-up the error value each time, it will break this semantic, as we do not set > > > seen on the upperdir. > > > > > > Either we need to set the seen flag on the upperdir's errseq_t, or have a sync > > > method, like: > > > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c > > > index 290983bcfbb3..4931d1797c03 100644 > > > --- a/fs/overlayfs/super.c > > > +++ b/fs/overlayfs/super.c > > > @@ -259,6 +259,7 @@ static int ovl_sync_fs(struct super_block *sb, int wait) > > > { > > > struct ovl_fs *ofs = sb->s_fs_info; > > > struct super_block *upper_sb; > > > + errseq_t src; > > > int ret; > > > > > > if (!ovl_upper_mnt(ofs)) > > > @@ -283,6 +284,11 @@ static int ovl_sync_fs(struct super_block *sb, int wait) > > > ret = sync_filesystem(upper_sb); > > > up_read(&upper_sb->s_umount); > > > > > > + /* Propagate the error up from the upper_sb once */ > > > + src = READ_ONCE(upper_sb->s_wb_err); > > > + if (errseq_counter(src) != errseq_counter(sb->s_wb_err)) > > > + WRITE_ONCE(sb->s_wb_err, src & ~ERRSEQ_SEEN); > > > + > > > return ret; > > > } > > > > > > @@ -1945,6 +1951,7 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent) > > > > > > sb->s_stack_depth = ovl_upper_mnt(ofs)->mnt_sb->s_stack_depth; > > > sb->s_time_gran = ovl_upper_mnt(ofs)->mnt_sb->s_time_gran; > > > + sb->s_wb_err = READ_ONCE(ovl_upper_mnt(ofs)->mnt_sb->s_wb_err); > > > > > > diff --git a/lib/errseq.c b/lib/errseq.c > > > index 81f9e33aa7e7..53275c168265 100644 > > > --- a/lib/errseq.c > > > +++ b/lib/errseq.c > > > @@ -204,3 +204,18 @@ int errseq_check_and_advance(errseq_t *eseq, errseq_t *since) > > > return err; > > > } > > > EXPORT_SYMBOL(errseq_check_and_advance); > > > + > > > +/** > > > + * errseq_counter() - Get the value of the errseq counter > > > + * @eseq: Value being checked > > > + * > > > + * This returns the errseq_t counter value. Reminder, it can wrap because > > > + * there are only a limited number of counter bits. > > > + * > > > + * Return: The counter portion of the errseq_t. > > > + */ > > > +int errseq_counter(errseq_t eseq) > > > +{ > > > + return eseq >> (ERRSEQ_SHIFT + 1); > > > +} > > > > > > This also needs some locking sprinkled on it because racing can occur with > > > sync_fs. This would be much easier if the errseq_t was 64-bits, and didn't go > > > backwards, because you could just use a simple cmpxchg64. I know that would make > > > a bunch of structures larger, and if it's just overlayfs that has to pay the tax > > > we can just sprinkle a mutex on it. > > > > > > We can always "copy-up" the errseq_t because if the upperdir's errseq_t is read, > > > and we haven't done a copy yet, we'll get it. Since this will be strictly > > > serialized a simple equivalence check will work versus actually having to deal > > > with happens before behaviour. There is still a correctness flaw here though, > > > which is exactly enough errors occur to result in it wrapping back to the same > > > value, it will break. > > > > > > By the way, how do y'all test this error handling behaviour? I didn't > > > find any automated testing for what currently exists. > > > > > > > > How about I split this into two patchsets? One, where I add the logic to copy > > > > the errseq on callbacks to fsync from the upperdir to the ovl fs superblock, > > > > and thus allowing VFS to bubble up errors, and the documentation. We can CC > > > > stable on those because I think it has an effect that's universal across > > > > all filesystems. > > > > > > > > P.S. > > > > I notice you maintain overlay tests outside of the kernel. Unfortunately, I > > > > think for this kind of test, it requires in kernel code to artificially bump the > > > > writeback error count on the upperdir, or it requires the failure injection > > > > infrastructure. > > > > > > > > Simulating this behaviour is non-trivial without in-kernel support: > > > > > > > > P1: Open(f) -> p1.fd > > > > P2: Open(f) -> p2.fd > > > > P1: syncfs(p1.fd) -> errrno > > > > P2: syncfs(p1.fd) -> 0 # This should return an error > > > > > > > > > > > > [1]: https://elixir.bootlin.com/linux/latest/source/fs/sync.c#L175 > > > > [2]: https://lwn.net/Articles/722250/ > > > > > > > > > > > > Sargun (and Jeff), > > > > Thank you for this discussion. It would be very nice to work on testing > > and fixing errseq propagation is correct on overlayfs. > > Alas, this is not what I suggested. > > > > What I suggested was a solution only for the volatile overlay issue > > where data can vaporise without applications noticing: > > "...the very minimum is to check for errseq since mount on the fsync > > and syncfs calls." > > > Yeah, I was confusing the checking that VFS does on our behalf and the checking > that we can do ourselves in the sync callback. If we return an error prior to > the vfs checking it short-circuits that entirely. > > > Do you get it? there is no pre-file state in the game, not for fsync and not > > for syncfs. > > > > Any single error, no matter how temporary it is and what damage it may > > or may not have caused to upper layer consistency, permanently > > invalidates the reliability of the volatile overlay, resulting in: > > Effective immediately: every fsync/syncfs returns EIO. > > Going forward: maybe implement overlay shutdown, so every access > > returns EIO. > > > > So now that I hopefully explained myself better, I'll ask again: > > Am I wrong saying that it is very very simple to fix? > > Would you mind making that fix at the bottom of the patch series, so it can > > be easily applied to stable kernels? > > > > Thanks, > > Amir. > > I think that this should be easy enough if the semantic is such that volatile > overlayfs mounts will return EIO on syncfs on every syncfs call if the upperdir's > super block has experienced errors since the initial mount. I imagine we do not > want to make it such that if the upperdir has ever experienced errors, return > EIO on syncfs. > > The one caveat that I see is that if the errseq wraps, we can silently begin > swallowing errors again. Thus, on the first failed syncfs we should just > store a flag indicating that the volatile fs is bad, and to continue to return > EIO rather than go through the process of checking errseq_t, but that's easy > enough to write. It turns out this is much more complicated than I initially thought. I'm not entirely sure why VFS even defines sync_fs (on the superblock) as returning an int. The way that sync_fs works is: sys_syncfs -> sync_filesystem(sb) -> __sync_filesystem(sb, N) -> sb->s_op->sync_fs /* __sync_blockdev has its own writeback error checking */ __sync_blockdev errseq_check_and_advance(sb...) __sync_filesystem calls the sync_fs callback on the superblock's superblock operations: static int __sync_filesystem(struct super_block *sb, int wait) { if (wait) sync_inodes_sb(sb); else writeback_inodes_sb(sb, WB_REASON_SYNC); if (sb->s_op->sync_fs) sb->s_op->sync_fs(sb, wait); return __sync_blockdev(sb->s_bdev, wait); } But it completely discards the result. On the other hand, fsync, and fdatasync exhibit different behaviour: sys_fdatasync -> do_fsync (see below) sys_fsync -> do_fsync -> vfs_fsync -> vfs_fsync_range -> return file->f_op->fsync ---- syncfs seems to enforce the semantics laid out by VFS[1]. Specifically the statement: When there is an error during writeback, they expect that error to be reported when a file sync request is made. After an error has been reported on one request, subsequent requests on the same file descriptor should return 0, unless further writeback errors have occurred since the previous file syncronization. This is enforced by the errseq_check_and_advance logic. We can hack around this logic by resetting the errset (setting the error on it) every time we get the sync_fs callback, but that to me seems wrong. FWIW, implementing this behaviour for fdatasync, and fsync is easier, because the error is bubbled up from the filesystem to the VFS. I don't actually think this is a good idea because it seems like this sync_fs behaviour is a bit...not neccessarily what all filesystems expect. For example, btrfs_sync_fs returns an error if it is unable to finish the current transaction. Nowhere in the btrfs code does it set the errseq on the superblock if this fails. I think we have a couple paths forward: 1. Change the semantic of sync_fs so the error is always bubbled up if it returns a non-zero value. If we do this, we have to decide whether or not we would _also_ call errseq_check_and_advance on the SB, or leave that to a subsequent call. 2. Have overlayfs forcefully set an error on the superblock on every callback to sync_fs. This seems ugly, but I wrote a little patch, and it seems to solve the problem for all the fsync / fdatasync / sync / syncfs variants without having to do plumbing in VFS. 3. Choose a different set of semantics for how we want to handle errors in volatile mounts. [1]: https://www.kernel.org/doc/html/v5.9/filesystems/vfs.html?highlight=handling%20errors%20during%20writeback#handling-errors-during-writeback ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v2 2/4] overlay: Document current outstanding shortcoming of volatile 2020-12-01 11:09 ` Sargun Dhillon @ 2020-12-01 11:29 ` Amir Goldstein 2020-12-01 13:01 ` Jeff Layton 1 sibling, 0 replies; 31+ messages in thread From: Amir Goldstein @ 2020-12-01 11:29 UTC (permalink / raw) To: Sargun Dhillon Cc: Jeff Layton, overlayfs, Miklos Szeredi, Alexander Viro, Giuseppe Scrivano, Vivek Goyal, Daniel J Walsh, linux-fsdevel, David Howells > syncfs seems to enforce the semantics laid out by VFS[1]. Specifically the > statement: > When there is an error during writeback, they expect that error to be reported > when a file sync request is made. After an error has been reported on one > request, subsequent requests on the same file descriptor should return 0, unless > further writeback errors have occurred since the previous file syncronization. > > This is enforced by the errseq_check_and_advance logic. We can hack around this > logic by resetting the errset (setting the error on it) every time we get the > sync_fs callback, but that to me seems wrong. FWIW, implementing this behaviour > for fdatasync, and fsync is easier, because the error is bubbled up from the > filesystem to the VFS. I don't actually think this is a good idea because > it seems like this sync_fs behaviour is a bit...not neccessarily what all > filesystems expect. For example, btrfs_sync_fs returns an error if it > is unable to finish the current transaction. Nowhere in the btrfs code > does it set the errseq on the superblock if this fails. > > I think we have a couple paths forward: > 1. Change the semantic of sync_fs so the error is always bubbled up if > it returns a non-zero value. If we do this, we have to decide whether > or not we would _also_ call errseq_check_and_advance on the SB, > or leave that to a subsequent call. > 2. Have overlayfs forcefully set an error on the superblock on every > callback to sync_fs. This seems ugly, but I wrote a little patch, > and it seems to solve the problem for all the fsync / fdatasync / > sync / syncfs variants without having to do plumbing in VFS. > 3. Choose a different set of semantics for how we want to handle > errors in volatile mounts. > IMO it is best if you post your patch to fix volatile overlayfs, because it seems to me that the volatile overlayfs issue is worse than generic sync_fs issues and it's good if we had a small backportable patch even if a bit ugly. Later you can pursue the sync_fs semantic fixes if you wish they do not contradict the fix in overlayfs, just are just a way to remove a hack. Thanks, Amir. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v2 2/4] overlay: Document current outstanding shortcoming of volatile 2020-12-01 11:09 ` Sargun Dhillon 2020-12-01 11:29 ` Amir Goldstein @ 2020-12-01 13:01 ` Jeff Layton 2020-12-01 15:24 ` Vivek Goyal 1 sibling, 1 reply; 31+ messages in thread From: Jeff Layton @ 2020-12-01 13:01 UTC (permalink / raw) To: Sargun Dhillon, Amir Goldstein Cc: overlayfs, Miklos Szeredi, Alexander Viro, Giuseppe Scrivano, Vivek Goyal, Daniel J Walsh, linux-fsdevel, David Howells On Tue, 2020-12-01 at 11:09 +0000, Sargun Dhillon wrote: > On Sat, Nov 28, 2020 at 08:52:27AM +0000, Sargun Dhillon wrote: > > On Sat, Nov 28, 2020 at 09:12:03AM +0200, Amir Goldstein wrote: > > > On Sat, Nov 28, 2020 at 6:45 AM Sargun Dhillon <sargun@sargun.me> wrote: > > > > > > > > On Fri, Nov 27, 2020 at 09:01:07PM -0500, Jeff Layton wrote: > > > > > On Fri, 2020-11-27 at 22:11 +0000, Sargun Dhillon wrote: > > > > > > On Fri, Nov 27, 2020 at 02:52:52PM +0200, Amir Goldstein wrote: > > > > > > > On Fri, Nov 27, 2020 at 11:21 AM Sargun Dhillon <sargun@sargun.me> wrote: > > > > > > > > > > > > > > > > This documents behaviour that was discussed in a thread about the volatile > > > > > > > > feature. Specifically, how failures can go hidden from asynchronous writes > > > > > > > > (such as from mmap, or writes that are not immediately flushed to the > > > > > > > > filesystem). Although we pass through calls like msync, fallocate, and > > > > > > > > write, and will still return errors on those, it doesn't guarantee all > > > > > > > > kinds of errors will happen at those times, and thus may hide errors. > > > > > > > > > > > > > > > > In the future, we can add error checking to all interactions with the > > > > > > > > upperdir, and pass through errseq_t from the upperdir on mappings, > > > > > > > > and other interactions with the filesystem[1]. > > > > > > > > > > > > > > > > [1]: https://lore.kernel.org/linux-unionfs/20201116045758.21774-1-sargun@sargun.me/T/#m7d501f375e031056efad626e471a1392dd3aad33 > > > > > > > > > > > > > > > > Signed-off-by: Sargun Dhillon <sargun@sargun.me> > > > > > > > > Cc: linux-fsdevel@vger.kernel.org > > > > > > > > Cc: linux-unionfs@vger.kernel.org > > > > > > > > Cc: Miklos Szeredi <miklos@szeredi.hu> > > > > > > > > Cc: Amir Goldstein <amir73il@gmail.com> > > > > > > > > Cc: Vivek Goyal <vgoyal@redhat.com> > > > > > > > > --- > > > > > > > > Documentation/filesystems/overlayfs.rst | 6 +++++- > > > > > > > > 1 file changed, 5 insertions(+), 1 deletion(-) > > > > > > > > > > > > > > > > diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst > > > > > > > > index 580ab9a0fe31..c6e30c1bc2f2 100644 > > > > > > > > --- a/Documentation/filesystems/overlayfs.rst > > > > > > > > +++ b/Documentation/filesystems/overlayfs.rst > > > > > > > > @@ -570,7 +570,11 @@ Volatile mount > > > > > > > > This is enabled with the "volatile" mount option. Volatile mounts are not > > > > > > > > guaranteed to survive a crash. It is strongly recommended that volatile > > > > > > > > mounts are only used if data written to the overlay can be recreated > > > > > > > > -without significant effort. > > > > > > > > +without significant effort. In addition to this, the sync family of syscalls > > > > > > > > +are not sufficient to determine whether a write failed as sync calls are > > > > > > > > +omitted. For this reason, it is important that the filesystem used by the > > > > > > > > +upperdir handles failure in a fashion that's suitable for the user. For > > > > > > > > +example, upon detecting a fault, ext4 can be configured to panic. > > > > > > > > > > > > > > > > > > > > > > Reading this now, I think I may have wrongly analysed the issue. > > > > > > > Specifically, when I wrote that the very minimum is to document the > > > > > > > issue, it was under the assumption that a proper fix is hard. > > > > > > > I think I was wrong and that the very minimum is to check for errseq > > > > > > > since mount on the fsync and syncfs calls. > > > > > > > > > > > > > > Why? first of all because it is very very simple and goes a long way to > > > > > > > fix the broken contract with applications, not the contract about durability > > > > > > > obviously, but the contract about write+fsync+read expects to find the written > > > > > > > data (during the same mount era). > > > > > > > > > > > > > > Second, because the sentence you added above is hard for users to understand > > > > > > > and out of context. If we properly handle the writeback error in fsync/syncfs, > > > > > > > then this sentence becomes irrelevant. > > > > > > > The fact that ext4 can lose data if application did not fsync after > > > > > > > write is true > > > > > > > for volatile as well as non-volatile and it is therefore not relevant > > > > > > > in the context > > > > > > > of overlayfs documentation at all. > > > > > > > > > > > > > > Am I wrong saying that it is very very simple to fix? > > > > > > > Would you mind making that fix at the bottom of the patch series, so it can > > > > > > > be easily applied to stable kernels? > > > > > > > > > > > > > > Thanks, > > > > > > > Amir. > > > > > > > > > > > > I'm not sure it's so easy. In VFS, there are places where the superblock's > > > > > > errseq is checked[1]. AFAIK, that's going to check "our" errseq, and not the > > > > > > errseq of the real corresponding real file's superblock. One solution might be > > > > > > as part of all these callbacks we set our errseq to the errseq of the filesystem > > > > > > that the upperdir, and then rely on VFS's checking. > > > > > > > > > > > > I'm having a hard time figuring out how to deal with the per-mapping based > > > > > > error tracking. It seems like this infrastructure was only partially completed > > > > > > by Jeff Layton[2]. I don't now how it's actually supposed to work right now, > > > > > > as not all of his patches landed. > > > > > > > > > > > > > > > > The patches in the series were all merged, but we ended up going with a > > > > > simpler solution [1] than the first series I posted. Instead of plumbing > > > > > the errseq_t handling down into sync_fs, we did it in the syscall > > > > > wrapper. > > > > Jeff, > > > > > > > > Thanks for replying. I'm still a little confused as to what the > > > > per-address_space wb_err. It seems like we should implement the > > > > flush method, like so: > > > > diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c > > > > index efccb7c1f9bc..32e5bc0aacd6 100644 > > > > --- a/fs/overlayfs/file.c > > > > +++ b/fs/overlayfs/file.c > > > > @@ -787,6 +787,16 @@ static loff_t ovl_remap_file_range(struct file *file_in, loff_t pos_in, > > > > remap_flags, op); > > > > } > > > > > > > > +static int ovl_flush(struct file *file, fl_owner_t id) > > > > +{ > > > > + struct file *realfile = file->private_data; > > > > + > > > > + if (real_file->f_op->flush) > > > > + return real_file->f_op->flush(filp, id); > > > > + > > > > + return 0; > > > > +} > > > > + > > > > const struct file_operations ovl_file_operations = { > > > > .open = ovl_open, > > > > .release = ovl_release, > > > > @@ -798,6 +808,7 @@ const struct file_operations ovl_file_operations = { > > > > .fallocate = ovl_fallocate, > > > > .fadvise = ovl_fadvise, > > > > .unlocked_ioctl = ovl_ioctl, > > > > + .flush = ovl_flush, > > > > #ifdef CONFIG_COMPAT > > > > .compat_ioctl = ovl_compat_ioctl, > > > > #endif > > > > > > > > > > > > > > > > > > I think the tricky part here is that there is no struct file plumbed > > > > > into ->sync_fs, so you don't have an errseq_t cursor to work with by the > > > > > time that gets called. > > > > > > > > > > What may be easiest is to just propagate the s_wb_err value from the > > > > > upper_sb to the overlayfs superblock in ovl_sync_fs(). That should get > > > > > called before the errseq_check_and_advance in the syncfs syscall wrapper > > > > > and should ensure that syncfs() calls against the overlayfs sb see any > > > > > errors that occurred on the upper_sb. > > > > > > > > > > Something like this maybe? Totally untested of course. May also need to > > > > > think about how to ensure ordering between racing syncfs calls too > > > > > (don't really want the s_wb_err going "backward"): > > > > > > > > > > ----------------------------8<--------------------------------- > > > > > $ git diff > > > > > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c > > > > > index 290983bcfbb3..d725705abdac 100644 > > > > > --- a/fs/overlayfs/super.c > > > > > +++ b/fs/overlayfs/super.c > > > > > @@ -283,6 +283,9 @@ static int ovl_sync_fs(struct super_block *sb, int wait) > > > > > ret = sync_filesystem(upper_sb); > > > > > up_read(&upper_sb->s_umount); > > > > > > > > > > + /* Propagate s_wb_err from upper_sb to overlayfs sb */ > > > > > + WRITE_ONCE(sb->s_wb_err, READ_ONCE(upper_sb->s_wb_err)); > > > > > + > > > > > return ret; > > > > > } > > > > > ----------------------------8<--------------------------------- > > > > > > > > > > [1]: https://www.spinics.net/lists/linux-api/msg41276.html > > > > > > > > So, > > > > I think this will have bad behaviour because syncfs() twice in a row will > > > > return the error twice. The reason is that normally in syncfs it calls > > > > errseq_check_and_advance marking the error on the super block as seen. If we > > > > copy-up the error value each time, it will break this semantic, as we do not set > > > > seen on the upperdir. > > > > > > > > Either we need to set the seen flag on the upperdir's errseq_t, or have a sync > > > > method, like: > > > > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c > > > > index 290983bcfbb3..4931d1797c03 100644 > > > > --- a/fs/overlayfs/super.c > > > > +++ b/fs/overlayfs/super.c > > > > @@ -259,6 +259,7 @@ static int ovl_sync_fs(struct super_block *sb, int wait) > > > > { > > > > struct ovl_fs *ofs = sb->s_fs_info; > > > > struct super_block *upper_sb; > > > > + errseq_t src; > > > > int ret; > > > > > > > > if (!ovl_upper_mnt(ofs)) > > > > @@ -283,6 +284,11 @@ static int ovl_sync_fs(struct super_block *sb, int wait) > > > > ret = sync_filesystem(upper_sb); > > > > up_read(&upper_sb->s_umount); > > > > > > > > + /* Propagate the error up from the upper_sb once */ > > > > + src = READ_ONCE(upper_sb->s_wb_err); > > > > + if (errseq_counter(src) != errseq_counter(sb->s_wb_err)) > > > > + WRITE_ONCE(sb->s_wb_err, src & ~ERRSEQ_SEEN); > > > > + > > > > return ret; > > > > } > > > > > > > > @@ -1945,6 +1951,7 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent) > > > > > > > > sb->s_stack_depth = ovl_upper_mnt(ofs)->mnt_sb->s_stack_depth; > > > > sb->s_time_gran = ovl_upper_mnt(ofs)->mnt_sb->s_time_gran; > > > > + sb->s_wb_err = READ_ONCE(ovl_upper_mnt(ofs)->mnt_sb->s_wb_err); > > > > > > > > diff --git a/lib/errseq.c b/lib/errseq.c > > > > index 81f9e33aa7e7..53275c168265 100644 > > > > --- a/lib/errseq.c > > > > +++ b/lib/errseq.c > > > > @@ -204,3 +204,18 @@ int errseq_check_and_advance(errseq_t *eseq, errseq_t *since) > > > > return err; > > > > } > > > > EXPORT_SYMBOL(errseq_check_and_advance); > > > > + > > > > +/** > > > > + * errseq_counter() - Get the value of the errseq counter > > > > + * @eseq: Value being checked > > > > + * > > > > + * This returns the errseq_t counter value. Reminder, it can wrap because > > > > + * there are only a limited number of counter bits. > > > > + * > > > > + * Return: The counter portion of the errseq_t. > > > > + */ > > > > +int errseq_counter(errseq_t eseq) > > > > +{ > > > > + return eseq >> (ERRSEQ_SHIFT + 1); > > > > +} > > > > > > > > This also needs some locking sprinkled on it because racing can occur with > > > > sync_fs. This would be much easier if the errseq_t was 64-bits, and didn't go > > > > backwards, because you could just use a simple cmpxchg64. I know that would make > > > > a bunch of structures larger, and if it's just overlayfs that has to pay the tax > > > > we can just sprinkle a mutex on it. > > > > > > > > We can always "copy-up" the errseq_t because if the upperdir's errseq_t is read, > > > > and we haven't done a copy yet, we'll get it. Since this will be strictly > > > > serialized a simple equivalence check will work versus actually having to deal > > > > with happens before behaviour. There is still a correctness flaw here though, > > > > which is exactly enough errors occur to result in it wrapping back to the same > > > > value, it will break. > > > > > > > > By the way, how do y'all test this error handling behaviour? I didn't > > > > find any automated testing for what currently exists. > > > > > > > > > > How about I split this into two patchsets? One, where I add the logic to copy > > > > > the errseq on callbacks to fsync from the upperdir to the ovl fs superblock, > > > > > and thus allowing VFS to bubble up errors, and the documentation. We can CC > > > > > stable on those because I think it has an effect that's universal across > > > > > all filesystems. > > > > > > > > > > P.S. > > > > > I notice you maintain overlay tests outside of the kernel. Unfortunately, I > > > > > think for this kind of test, it requires in kernel code to artificially bump the > > > > > writeback error count on the upperdir, or it requires the failure injection > > > > > infrastructure. > > > > > > > > > > Simulating this behaviour is non-trivial without in-kernel support: > > > > > > > > > > P1: Open(f) -> p1.fd > > > > > P2: Open(f) -> p2.fd > > > > > P1: syncfs(p1.fd) -> errrno > > > > > P2: syncfs(p1.fd) -> 0 # This should return an error > > > > > > > > > > > > > > > [1]: https://elixir.bootlin.com/linux/latest/source/fs/sync.c#L175 > > > > > [2]: https://lwn.net/Articles/722250/ > > > > > > > > > > > > > > > > Sargun (and Jeff), > > > > > > Thank you for this discussion. It would be very nice to work on testing > > > and fixing errseq propagation is correct on overlayfs. > > > Alas, this is not what I suggested. > > > > > > What I suggested was a solution only for the volatile overlay issue > > > where data can vaporise without applications noticing: > > > "...the very minimum is to check for errseq since mount on the fsync > > > and syncfs calls." > > > > > Yeah, I was confusing the checking that VFS does on our behalf and the checking > > that we can do ourselves in the sync callback. If we return an error prior to > > the vfs checking it short-circuits that entirely. > > > > > Do you get it? there is no pre-file state in the game, not for fsync and not > > > for syncfs. > > > > > > Any single error, no matter how temporary it is and what damage it may > > > or may not have caused to upper layer consistency, permanently > > > invalidates the reliability of the volatile overlay, resulting in: > > > Effective immediately: every fsync/syncfs returns EIO. > > > Going forward: maybe implement overlay shutdown, so every access > > > returns EIO. > > > > > > So now that I hopefully explained myself better, I'll ask again: > > > Am I wrong saying that it is very very simple to fix? > > > Would you mind making that fix at the bottom of the patch series, so it can > > > be easily applied to stable kernels? > > > > > > Thanks, > > > Amir. > > > > I think that this should be easy enough if the semantic is such that volatile > > overlayfs mounts will return EIO on syncfs on every syncfs call if the upperdir's > > super block has experienced errors since the initial mount. I imagine we do not > > want to make it such that if the upperdir has ever experienced errors, return > > EIO on syncfs. > > > > The one caveat that I see is that if the errseq wraps, we can silently begin > > swallowing errors again. Thus, on the first failed syncfs we should just > > store a flag indicating that the volatile fs is bad, and to continue to return > > EIO rather than go through the process of checking errseq_t, but that's easy > > enough to write. > > It turns out this is much more complicated than I initially thought. I'm not > entirely sure why VFS even defines sync_fs (on the superblock) as returning an > int. The way that sync_fs works is: > > sys_syncfs -> > sync_filesystem(sb) -> > __sync_filesystem(sb, N) -> > sb->s_op->sync_fs > /* __sync_blockdev has its own writeback error checking */ > __sync_blockdev > errseq_check_and_advance(sb...) > > __sync_filesystem calls the sync_fs callback on the superblock's superblock > operations: > > static int __sync_filesystem(struct super_block *sb, int wait) > { > if (wait) > sync_inodes_sb(sb); > else > writeback_inodes_sb(sb, WB_REASON_SYNC); > > if (sb->s_op->sync_fs) > sb->s_op->sync_fs(sb, wait); > return __sync_blockdev(sb->s_bdev, wait); > } > > But it completely discards the result. On the other hand, fsync, and fdatasync > exhibit different behaviour: > > sys_fdatasync -> > do_fsync > (see below) > > sys_fsync -> > do_fsync -> > vfs_fsync -> > vfs_fsync_range -> > return file->f_op->fsync > ---- > This is mainly a result of the evolution of the code. In the beginning there was only sync() and it doesn't report errors. Eventually syncfs() was added, but the only error it reliably reported was EBADF. I added the machinery to record writeback errors of individual inodes at the superblock level recently, but the sync_fs op was left alone (mainly because changing this would be very invasive, and the value of doing so wasn't completely clear). > > syncfs seems to enforce the semantics laid out by VFS[1]. Specifically the > statement: > When there is an error during writeback, they expect that error to be reported > when a file sync request is made. After an error has been reported on one > request, subsequent requests on the same file descriptor should return 0, unless > further writeback errors have occurred since the previous file syncronization. > > This is enforced by the errseq_check_and_advance logic. We can hack around this > logic by resetting the errset (setting the error on it) every time we get the > sync_fs callback, but that to me seems wrong. FWIW, implementing this behaviour > for fdatasync, and fsync is easier, because the error is bubbled up from the > filesystem to the VFS. I don't actually think this is a good idea because > it seems like this sync_fs behaviour is a bit...not neccessarily what all > filesystems expect. For example, btrfs_sync_fs returns an error if it > is unable to finish the current transaction. Nowhere in the btrfs code > does it set the errseq on the superblock if this fails. > > I think we have a couple paths forward: > 1. Change the semantic of sync_fs so the error is always bubbled up if > it returns a non-zero value. If we do this, we have to decide whether > or not we would _also_ call errseq_check_and_advance on the SB, > or leave that to a subsequent call. You would want to call errseq_check_and_advance in that situation. Otherwise a subsequent syncfs call would return an error even when an error had not occurred since the last syncfs. > 2. Have overlayfs forcefully set an error on the superblock on every > callback to sync_fs. This seems ugly, but I wrote a little patch, > and it seems to solve the problem for all the fsync / fdatasync / > sync / syncfs variants without having to do plumbing in VFS. It is ugly, but that's probably the easiest solution. > 3. Choose a different set of semantics for how we want to handle > errors in volatile mounts. > > [1]: https://www.kernel.org/doc/html/v5.9/filesystems/vfs.html?highlight=handling%20errors%20during%20writeback#handling-errors-during-writeback > -- Jeff Layton <jlayton@redhat.com> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v2 2/4] overlay: Document current outstanding shortcoming of volatile 2020-12-01 13:01 ` Jeff Layton @ 2020-12-01 15:24 ` Vivek Goyal 2020-12-01 16:10 ` Jeff Layton 0 siblings, 1 reply; 31+ messages in thread From: Vivek Goyal @ 2020-12-01 15:24 UTC (permalink / raw) To: Jeff Layton Cc: Sargun Dhillon, Amir Goldstein, overlayfs, Miklos Szeredi, Alexander Viro, Giuseppe Scrivano, Daniel J Walsh, linux-fsdevel, David Howells On Tue, Dec 01, 2020 at 08:01:11AM -0500, Jeff Layton wrote: > On Tue, 2020-12-01 at 11:09 +0000, Sargun Dhillon wrote: > > On Sat, Nov 28, 2020 at 08:52:27AM +0000, Sargun Dhillon wrote: > > > On Sat, Nov 28, 2020 at 09:12:03AM +0200, Amir Goldstein wrote: > > > > On Sat, Nov 28, 2020 at 6:45 AM Sargun Dhillon <sargun@sargun.me> wrote: > > > > > > > > > > On Fri, Nov 27, 2020 at 09:01:07PM -0500, Jeff Layton wrote: > > > > > > On Fri, 2020-11-27 at 22:11 +0000, Sargun Dhillon wrote: > > > > > > > On Fri, Nov 27, 2020 at 02:52:52PM +0200, Amir Goldstein wrote: > > > > > > > > On Fri, Nov 27, 2020 at 11:21 AM Sargun Dhillon <sargun@sargun.me> wrote: > > > > > > > > > > > > > > > > > > This documents behaviour that was discussed in a thread about the volatile > > > > > > > > > feature. Specifically, how failures can go hidden from asynchronous writes > > > > > > > > > (such as from mmap, or writes that are not immediately flushed to the > > > > > > > > > filesystem). Although we pass through calls like msync, fallocate, and > > > > > > > > > write, and will still return errors on those, it doesn't guarantee all > > > > > > > > > kinds of errors will happen at those times, and thus may hide errors. > > > > > > > > > > > > > > > > > > In the future, we can add error checking to all interactions with the > > > > > > > > > upperdir, and pass through errseq_t from the upperdir on mappings, > > > > > > > > > and other interactions with the filesystem[1]. > > > > > > > > > > > > > > > > > > [1]: https://lore.kernel.org/linux-unionfs/20201116045758.21774-1-sargun@sargun.me/T/#m7d501f375e031056efad626e471a1392dd3aad33 > > > > > > > > > > > > > > > > > > Signed-off-by: Sargun Dhillon <sargun@sargun.me> > > > > > > > > > Cc: linux-fsdevel@vger.kernel.org > > > > > > > > > Cc: linux-unionfs@vger.kernel.org > > > > > > > > > Cc: Miklos Szeredi <miklos@szeredi.hu> > > > > > > > > > Cc: Amir Goldstein <amir73il@gmail.com> > > > > > > > > > Cc: Vivek Goyal <vgoyal@redhat.com> > > > > > > > > > --- > > > > > > > > > Documentation/filesystems/overlayfs.rst | 6 +++++- > > > > > > > > > 1 file changed, 5 insertions(+), 1 deletion(-) > > > > > > > > > > > > > > > > > > diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst > > > > > > > > > index 580ab9a0fe31..c6e30c1bc2f2 100644 > > > > > > > > > --- a/Documentation/filesystems/overlayfs.rst > > > > > > > > > +++ b/Documentation/filesystems/overlayfs.rst > > > > > > > > > @@ -570,7 +570,11 @@ Volatile mount > > > > > > > > > This is enabled with the "volatile" mount option. Volatile mounts are not > > > > > > > > > guaranteed to survive a crash. It is strongly recommended that volatile > > > > > > > > > mounts are only used if data written to the overlay can be recreated > > > > > > > > > -without significant effort. > > > > > > > > > +without significant effort. In addition to this, the sync family of syscalls > > > > > > > > > +are not sufficient to determine whether a write failed as sync calls are > > > > > > > > > +omitted. For this reason, it is important that the filesystem used by the > > > > > > > > > +upperdir handles failure in a fashion that's suitable for the user. For > > > > > > > > > +example, upon detecting a fault, ext4 can be configured to panic. > > > > > > > > > > > > > > > > > > > > > > > > > Reading this now, I think I may have wrongly analysed the issue. > > > > > > > > Specifically, when I wrote that the very minimum is to document the > > > > > > > > issue, it was under the assumption that a proper fix is hard. > > > > > > > > I think I was wrong and that the very minimum is to check for errseq > > > > > > > > since mount on the fsync and syncfs calls. > > > > > > > > > > > > > > > > Why? first of all because it is very very simple and goes a long way to > > > > > > > > fix the broken contract with applications, not the contract about durability > > > > > > > > obviously, but the contract about write+fsync+read expects to find the written > > > > > > > > data (during the same mount era). > > > > > > > > > > > > > > > > Second, because the sentence you added above is hard for users to understand > > > > > > > > and out of context. If we properly handle the writeback error in fsync/syncfs, > > > > > > > > then this sentence becomes irrelevant. > > > > > > > > The fact that ext4 can lose data if application did not fsync after > > > > > > > > write is true > > > > > > > > for volatile as well as non-volatile and it is therefore not relevant > > > > > > > > in the context > > > > > > > > of overlayfs documentation at all. > > > > > > > > > > > > > > > > Am I wrong saying that it is very very simple to fix? > > > > > > > > Would you mind making that fix at the bottom of the patch series, so it can > > > > > > > > be easily applied to stable kernels? > > > > > > > > > > > > > > > > Thanks, > > > > > > > > Amir. > > > > > > > > > > > > > > I'm not sure it's so easy. In VFS, there are places where the superblock's > > > > > > > errseq is checked[1]. AFAIK, that's going to check "our" errseq, and not the > > > > > > > errseq of the real corresponding real file's superblock. One solution might be > > > > > > > as part of all these callbacks we set our errseq to the errseq of the filesystem > > > > > > > that the upperdir, and then rely on VFS's checking. > > > > > > > > > > > > > > I'm having a hard time figuring out how to deal with the per-mapping based > > > > > > > error tracking. It seems like this infrastructure was only partially completed > > > > > > > by Jeff Layton[2]. I don't now how it's actually supposed to work right now, > > > > > > > as not all of his patches landed. > > > > > > > > > > > > > > > > > > > The patches in the series were all merged, but we ended up going with a > > > > > > simpler solution [1] than the first series I posted. Instead of plumbing > > > > > > the errseq_t handling down into sync_fs, we did it in the syscall > > > > > > wrapper. > > > > > Jeff, > > > > > > > > > > Thanks for replying. I'm still a little confused as to what the > > > > > per-address_space wb_err. It seems like we should implement the > > > > > flush method, like so: > > > > > diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c > > > > > index efccb7c1f9bc..32e5bc0aacd6 100644 > > > > > --- a/fs/overlayfs/file.c > > > > > +++ b/fs/overlayfs/file.c > > > > > @@ -787,6 +787,16 @@ static loff_t ovl_remap_file_range(struct file *file_in, loff_t pos_in, > > > > > remap_flags, op); > > > > > } > > > > > > > > > > +static int ovl_flush(struct file *file, fl_owner_t id) > > > > > +{ > > > > > + struct file *realfile = file->private_data; > > > > > + > > > > > + if (real_file->f_op->flush) > > > > > + return real_file->f_op->flush(filp, id); > > > > > + > > > > > + return 0; > > > > > +} > > > > > + > > > > > const struct file_operations ovl_file_operations = { > > > > > .open = ovl_open, > > > > > .release = ovl_release, > > > > > @@ -798,6 +808,7 @@ const struct file_operations ovl_file_operations = { > > > > > .fallocate = ovl_fallocate, > > > > > .fadvise = ovl_fadvise, > > > > > .unlocked_ioctl = ovl_ioctl, > > > > > + .flush = ovl_flush, > > > > > #ifdef CONFIG_COMPAT > > > > > .compat_ioctl = ovl_compat_ioctl, > > > > > #endif > > > > > > > > > > > > > > > > > > > > > > I think the tricky part here is that there is no struct file plumbed > > > > > > into ->sync_fs, so you don't have an errseq_t cursor to work with by the > > > > > > time that gets called. > > > > > > > > > > > > What may be easiest is to just propagate the s_wb_err value from the > > > > > > upper_sb to the overlayfs superblock in ovl_sync_fs(). That should get > > > > > > called before the errseq_check_and_advance in the syncfs syscall wrapper > > > > > > and should ensure that syncfs() calls against the overlayfs sb see any > > > > > > errors that occurred on the upper_sb. > > > > > > > > > > > > Something like this maybe? Totally untested of course. May also need to > > > > > > think about how to ensure ordering between racing syncfs calls too > > > > > > (don't really want the s_wb_err going "backward"): > > > > > > > > > > > > ----------------------------8<--------------------------------- > > > > > > $ git diff > > > > > > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c > > > > > > index 290983bcfbb3..d725705abdac 100644 > > > > > > --- a/fs/overlayfs/super.c > > > > > > +++ b/fs/overlayfs/super.c > > > > > > @@ -283,6 +283,9 @@ static int ovl_sync_fs(struct super_block *sb, int wait) > > > > > > ret = sync_filesystem(upper_sb); > > > > > > up_read(&upper_sb->s_umount); > > > > > > > > > > > > + /* Propagate s_wb_err from upper_sb to overlayfs sb */ > > > > > > + WRITE_ONCE(sb->s_wb_err, READ_ONCE(upper_sb->s_wb_err)); > > > > > > + > > > > > > return ret; > > > > > > } > > > > > > ----------------------------8<--------------------------------- > > > > > > > > > > > > [1]: https://www.spinics.net/lists/linux-api/msg41276.html > > > > > > > > > > So, > > > > > I think this will have bad behaviour because syncfs() twice in a row will > > > > > return the error twice. The reason is that normally in syncfs it calls > > > > > errseq_check_and_advance marking the error on the super block as seen. If we > > > > > copy-up the error value each time, it will break this semantic, as we do not set > > > > > seen on the upperdir. > > > > > > > > > > Either we need to set the seen flag on the upperdir's errseq_t, or have a sync > > > > > method, like: > > > > > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c > > > > > index 290983bcfbb3..4931d1797c03 100644 > > > > > --- a/fs/overlayfs/super.c > > > > > +++ b/fs/overlayfs/super.c > > > > > @@ -259,6 +259,7 @@ static int ovl_sync_fs(struct super_block *sb, int wait) > > > > > { > > > > > struct ovl_fs *ofs = sb->s_fs_info; > > > > > struct super_block *upper_sb; > > > > > + errseq_t src; > > > > > int ret; > > > > > > > > > > if (!ovl_upper_mnt(ofs)) > > > > > @@ -283,6 +284,11 @@ static int ovl_sync_fs(struct super_block *sb, int wait) > > > > > ret = sync_filesystem(upper_sb); > > > > > up_read(&upper_sb->s_umount); > > > > > > > > > > + /* Propagate the error up from the upper_sb once */ > > > > > + src = READ_ONCE(upper_sb->s_wb_err); > > > > > + if (errseq_counter(src) != errseq_counter(sb->s_wb_err)) > > > > > + WRITE_ONCE(sb->s_wb_err, src & ~ERRSEQ_SEEN); > > > > > + > > > > > return ret; > > > > > } > > > > > > > > > > @@ -1945,6 +1951,7 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent) > > > > > > > > > > sb->s_stack_depth = ovl_upper_mnt(ofs)->mnt_sb->s_stack_depth; > > > > > sb->s_time_gran = ovl_upper_mnt(ofs)->mnt_sb->s_time_gran; > > > > > + sb->s_wb_err = READ_ONCE(ovl_upper_mnt(ofs)->mnt_sb->s_wb_err); > > > > > > > > > > diff --git a/lib/errseq.c b/lib/errseq.c > > > > > index 81f9e33aa7e7..53275c168265 100644 > > > > > --- a/lib/errseq.c > > > > > +++ b/lib/errseq.c > > > > > @@ -204,3 +204,18 @@ int errseq_check_and_advance(errseq_t *eseq, errseq_t *since) > > > > > return err; > > > > > } > > > > > EXPORT_SYMBOL(errseq_check_and_advance); > > > > > + > > > > > +/** > > > > > + * errseq_counter() - Get the value of the errseq counter > > > > > + * @eseq: Value being checked > > > > > + * > > > > > + * This returns the errseq_t counter value. Reminder, it can wrap because > > > > > + * there are only a limited number of counter bits. > > > > > + * > > > > > + * Return: The counter portion of the errseq_t. > > > > > + */ > > > > > +int errseq_counter(errseq_t eseq) > > > > > +{ > > > > > + return eseq >> (ERRSEQ_SHIFT + 1); > > > > > +} > > > > > > > > > > This also needs some locking sprinkled on it because racing can occur with > > > > > sync_fs. This would be much easier if the errseq_t was 64-bits, and didn't go > > > > > backwards, because you could just use a simple cmpxchg64. I know that would make > > > > > a bunch of structures larger, and if it's just overlayfs that has to pay the tax > > > > > we can just sprinkle a mutex on it. > > > > > > > > > > We can always "copy-up" the errseq_t because if the upperdir's errseq_t is read, > > > > > and we haven't done a copy yet, we'll get it. Since this will be strictly > > > > > serialized a simple equivalence check will work versus actually having to deal > > > > > with happens before behaviour. There is still a correctness flaw here though, > > > > > which is exactly enough errors occur to result in it wrapping back to the same > > > > > value, it will break. > > > > > > > > > > By the way, how do y'all test this error handling behaviour? I didn't > > > > > find any automated testing for what currently exists. > > > > > > > > > > > > How about I split this into two patchsets? One, where I add the logic to copy > > > > > > the errseq on callbacks to fsync from the upperdir to the ovl fs superblock, > > > > > > and thus allowing VFS to bubble up errors, and the documentation. We can CC > > > > > > stable on those because I think it has an effect that's universal across > > > > > > all filesystems. > > > > > > > > > > > > P.S. > > > > > > I notice you maintain overlay tests outside of the kernel. Unfortunately, I > > > > > > think for this kind of test, it requires in kernel code to artificially bump the > > > > > > writeback error count on the upperdir, or it requires the failure injection > > > > > > infrastructure. > > > > > > > > > > > > Simulating this behaviour is non-trivial without in-kernel support: > > > > > > > > > > > > P1: Open(f) -> p1.fd > > > > > > P2: Open(f) -> p2.fd > > > > > > P1: syncfs(p1.fd) -> errrno > > > > > > P2: syncfs(p1.fd) -> 0 # This should return an error > > > > > > > > > > > > > > > > > > [1]: https://elixir.bootlin.com/linux/latest/source/fs/sync.c#L175 > > > > > > [2]: https://lwn.net/Articles/722250/ > > > > > > > > > > > > > > > > > > > > Sargun (and Jeff), > > > > > > > > Thank you for this discussion. It would be very nice to work on testing > > > > and fixing errseq propagation is correct on overlayfs. > > > > Alas, this is not what I suggested. > > > > > > > > What I suggested was a solution only for the volatile overlay issue > > > > where data can vaporise without applications noticing: > > > > "...the very minimum is to check for errseq since mount on the fsync > > > > and syncfs calls." > > > > > > > Yeah, I was confusing the checking that VFS does on our behalf and the checking > > > that we can do ourselves in the sync callback. If we return an error prior to > > > the vfs checking it short-circuits that entirely. > > > > > > > Do you get it? there is no pre-file state in the game, not for fsync and not > > > > for syncfs. > > > > > > > > Any single error, no matter how temporary it is and what damage it may > > > > or may not have caused to upper layer consistency, permanently > > > > invalidates the reliability of the volatile overlay, resulting in: > > > > Effective immediately: every fsync/syncfs returns EIO. > > > > Going forward: maybe implement overlay shutdown, so every access > > > > returns EIO. > > > > > > > > So now that I hopefully explained myself better, I'll ask again: > > > > Am I wrong saying that it is very very simple to fix? > > > > Would you mind making that fix at the bottom of the patch series, so it can > > > > be easily applied to stable kernels? > > > > > > > > Thanks, > > > > Amir. > > > > > > I think that this should be easy enough if the semantic is such that volatile > > > overlayfs mounts will return EIO on syncfs on every syncfs call if the upperdir's > > > super block has experienced errors since the initial mount. I imagine we do not > > > want to make it such that if the upperdir has ever experienced errors, return > > > EIO on syncfs. > > > > > > The one caveat that I see is that if the errseq wraps, we can silently begin > > > swallowing errors again. Thus, on the first failed syncfs we should just > > > store a flag indicating that the volatile fs is bad, and to continue to return > > > EIO rather than go through the process of checking errseq_t, but that's easy > > > enough to write. > > > > It turns out this is much more complicated than I initially thought. I'm not > > entirely sure why VFS even defines sync_fs (on the superblock) as returning an > > int. The way that sync_fs works is: > > > > sys_syncfs -> > > sync_filesystem(sb) -> > > __sync_filesystem(sb, N) -> > > sb->s_op->sync_fs > > /* __sync_blockdev has its own writeback error checking */ > > __sync_blockdev > > errseq_check_and_advance(sb...) > > > > __sync_filesystem calls the sync_fs callback on the superblock's superblock > > operations: > > > > static int __sync_filesystem(struct super_block *sb, int wait) > > { > > if (wait) > > sync_inodes_sb(sb); > > else > > writeback_inodes_sb(sb, WB_REASON_SYNC); > > > > if (sb->s_op->sync_fs) > > sb->s_op->sync_fs(sb, wait); > > return __sync_blockdev(sb->s_bdev, wait); > > } > > > > But it completely discards the result. On the other hand, fsync, and fdatasync > > exhibit different behaviour: > > > > sys_fdatasync -> > > do_fsync > > (see below) > > > > sys_fsync -> > > do_fsync -> > > vfs_fsync -> > > vfs_fsync_range -> > > return file->f_op->fsync > > ---- > > > > This is mainly a result of the evolution of the code. In the beginning > there was only sync() and it doesn't report errors. Eventually syncfs() > was added, but the only error it reliably reported was EBADF. > > I added the machinery to record writeback errors of individual inodes at > the superblock level recently, but the sync_fs op was left alone (mainly > because changing this would be very invasive, and the value of doing so > wasn't completely clear). Is following patch enough to capture ->sync_fs return value and return to user space on syncfs. Or there is more to it which I am missing. --- fs/sync.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) Index: redhat-linux/fs/sync.c =================================================================== --- redhat-linux.orig/fs/sync.c 2020-08-10 15:40:22.535349278 -0400 +++ redhat-linux/fs/sync.c 2020-12-01 10:19:03.624330578 -0500 @@ -30,14 +30,17 @@ */ static int __sync_filesystem(struct super_block *sb, int wait) { + int ret, ret2; + if (wait) sync_inodes_sb(sb); else writeback_inodes_sb(sb, WB_REASON_SYNC); if (sb->s_op->sync_fs) - sb->s_op->sync_fs(sb, wait); - return __sync_blockdev(sb->s_bdev, wait); + ret = sb->s_op->sync_fs(sb, wait); + ret2 = __sync_blockdev(sb->s_bdev, wait); + return ret ? ret : ret2; } /* ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v2 2/4] overlay: Document current outstanding shortcoming of volatile 2020-12-01 15:24 ` Vivek Goyal @ 2020-12-01 16:10 ` Jeff Layton 0 siblings, 0 replies; 31+ messages in thread From: Jeff Layton @ 2020-12-01 16:10 UTC (permalink / raw) To: Vivek Goyal Cc: Sargun Dhillon, Amir Goldstein, overlayfs, Miklos Szeredi, Alexander Viro, Giuseppe Scrivano, Daniel J Walsh, linux-fsdevel, David Howells On Tue, 2020-12-01 at 10:24 -0500, Vivek Goyal wrote: > On Tue, Dec 01, 2020 at 08:01:11AM -0500, Jeff Layton wrote: > > On Tue, 2020-12-01 at 11:09 +0000, Sargun Dhillon wrote: > > > On Sat, Nov 28, 2020 at 08:52:27AM +0000, Sargun Dhillon wrote: > > > > On Sat, Nov 28, 2020 at 09:12:03AM +0200, Amir Goldstein wrote: > > > > > On Sat, Nov 28, 2020 at 6:45 AM Sargun Dhillon <sargun@sargun.me> wrote: > > > > > > > > > > > > On Fri, Nov 27, 2020 at 09:01:07PM -0500, Jeff Layton wrote: > > > > > > > On Fri, 2020-11-27 at 22:11 +0000, Sargun Dhillon wrote: > > > > > > > > On Fri, Nov 27, 2020 at 02:52:52PM +0200, Amir Goldstein wrote: > > > > > > > > > On Fri, Nov 27, 2020 at 11:21 AM Sargun Dhillon <sargun@sargun.me> wrote: > > > > > > > > > > > > > > > > > > > > This documents behaviour that was discussed in a thread about the volatile > > > > > > > > > > feature. Specifically, how failures can go hidden from asynchronous writes > > > > > > > > > > (such as from mmap, or writes that are not immediately flushed to the > > > > > > > > > > filesystem). Although we pass through calls like msync, fallocate, and > > > > > > > > > > write, and will still return errors on those, it doesn't guarantee all > > > > > > > > > > kinds of errors will happen at those times, and thus may hide errors. > > > > > > > > > > > > > > > > > > > > In the future, we can add error checking to all interactions with the > > > > > > > > > > upperdir, and pass through errseq_t from the upperdir on mappings, > > > > > > > > > > and other interactions with the filesystem[1]. > > > > > > > > > > > > > > > > > > > > [1]: https://lore.kernel.org/linux-unionfs/20201116045758.21774-1-sargun@sargun.me/T/#m7d501f375e031056efad626e471a1392dd3aad33 > > > > > > > > > > > > > > > > > > > > Signed-off-by: Sargun Dhillon <sargun@sargun.me> > > > > > > > > > > Cc: linux-fsdevel@vger.kernel.org > > > > > > > > > > Cc: linux-unionfs@vger.kernel.org > > > > > > > > > > Cc: Miklos Szeredi <miklos@szeredi.hu> > > > > > > > > > > Cc: Amir Goldstein <amir73il@gmail.com> > > > > > > > > > > Cc: Vivek Goyal <vgoyal@redhat.com> > > > > > > > > > > --- > > > > > > > > > > Â Documentation/filesystems/overlayfs.rst | 6 +++++- > > > > > > > > > > Â 1 file changed, 5 insertions(+), 1 deletion(-) > > > > > > > > > > > > > > > > > > > > diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst > > > > > > > > > > index 580ab9a0fe31..c6e30c1bc2f2 100644 > > > > > > > > > > --- a/Documentation/filesystems/overlayfs.rst > > > > > > > > > > +++ b/Documentation/filesystems/overlayfs.rst > > > > > > > > > > @@ -570,7 +570,11 @@ Volatile mount > > > > > > > > > > Â This is enabled with the "volatile" mount option. Volatile mounts are not > > > > > > > > > > Â guaranteed to survive a crash. It is strongly recommended that volatile > > > > > > > > > > Â mounts are only used if data written to the overlay can be recreated > > > > > > > > > > -without significant effort. > > > > > > > > > > +without significant effort. In addition to this, the sync family of syscalls > > > > > > > > > > +are not sufficient to determine whether a write failed as sync calls are > > > > > > > > > > +omitted. For this reason, it is important that the filesystem used by the > > > > > > > > > > +upperdir handles failure in a fashion that's suitable for the user. For > > > > > > > > > > +example, upon detecting a fault, ext4 can be configured to panic. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Reading this now, I think I may have wrongly analysed the issue. > > > > > > > > > Specifically, when I wrote that the very minimum is to document the > > > > > > > > > issue, it was under the assumption that a proper fix is hard. > > > > > > > > > I think I was wrong and that the very minimum is to check for errseq > > > > > > > > > since mount on the fsync and syncfs calls. > > > > > > > > > > > > > > > > > > Why? first of all because it is very very simple and goes a long way to > > > > > > > > > fix the broken contract with applications, not the contract about durability > > > > > > > > > obviously, but the contract about write+fsync+read expects to find the written > > > > > > > > > data (during the same mount era). > > > > > > > > > > > > > > > > > > Second, because the sentence you added above is hard for users to understand > > > > > > > > > and out of context. If we properly handle the writeback error in fsync/syncfs, > > > > > > > > > then this sentence becomes irrelevant. > > > > > > > > > The fact that ext4 can lose data if application did not fsync after > > > > > > > > > write is true > > > > > > > > > for volatile as well as non-volatile and it is therefore not relevant > > > > > > > > > in the context > > > > > > > > > of overlayfs documentation at all. > > > > > > > > > > > > > > > > > > Am I wrong saying that it is very very simple to fix? > > > > > > > > > Would you mind making that fix at the bottom of the patch series, so it can > > > > > > > > > be easily applied to stable kernels? > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > Amir. > > > > > > > > > > > > > > > > I'm not sure it's so easy. In VFS, there are places where the superblock's > > > > > > > > errseq is checked[1]. AFAIK, that's going to check "our" errseq, and not the > > > > > > > > errseq of the real corresponding real file's superblock. One solution might be > > > > > > > > as part of all these callbacks we set our errseq to the errseq of the filesystem > > > > > > > > that the upperdir, and then rely on VFS's checking. > > > > > > > > > > > > > > > > I'm having a hard time figuring out how to deal with the per-mapping based > > > > > > > > error tracking. It seems like this infrastructure was only partially completed > > > > > > > > by Jeff Layton[2]. I don't now how it's actually supposed to work right now, > > > > > > > > as not all of his patches landed. > > > > > > > > > > > > > > > > > > > > > > The patches in the series were all merged, but we ended up going with a > > > > > > > simpler solution [1] than the first series I posted. Instead of plumbing > > > > > > > the errseq_t handling down into sync_fs, we did it in the syscall > > > > > > > wrapper. > > > > > > Jeff, > > > > > > > > > > > > Thanks for replying. I'm still a little confused as to what the > > > > > > per-address_space wb_err. It seems like we should implement the > > > > > > flush method, like so: > > > > > > diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c > > > > > > index efccb7c1f9bc..32e5bc0aacd6 100644 > > > > > > --- a/fs/overlayfs/file.c > > > > > > +++ b/fs/overlayfs/file.c > > > > > > @@ -787,6 +787,16 @@ static loff_t ovl_remap_file_range(struct file *file_in, loff_t pos_in, > > > > > > Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â remap_flags, op); > > > > > > Â } > > > > > > > > > > > > +static int ovl_flush(struct file *file, fl_owner_t id) > > > > > > +{ > > > > > > + struct file *realfile = file->private_data; > > > > > > + > > > > > > + if (real_file->f_op->flush) > > > > > > + return real_file->f_op->flush(filp, id); > > > > > > + > > > > > > + return 0; > > > > > > +} > > > > > > + > > > > > > Â const struct file_operations ovl_file_operations = { > > > > > > Â Â Â Â Â Â Â Â .open = ovl_open, > > > > > > Â Â Â Â Â Â Â Â .release = ovl_release, > > > > > > @@ -798,6 +808,7 @@ const struct file_operations ovl_file_operations = { > > > > > > Â Â Â Â Â Â Â Â .fallocate = ovl_fallocate, > > > > > > Â Â Â Â Â Â Â Â .fadvise = ovl_fadvise, > > > > > > Â Â Â Â Â Â Â Â .unlocked_ioctl = ovl_ioctl, > > > > > > + .flush = ovl_flush, > > > > > > Â #ifdef CONFIG_COMPAT > > > > > > Â Â Â Â Â Â Â Â .compat_ioctl = ovl_compat_ioctl, > > > > > > Â #endif > > > > > > > > > > > > > > > > > > > > > > > > > > I think the tricky part here is that there is no struct file plumbed > > > > > > > into ->sync_fs, so you don't have an errseq_t cursor to work with by the > > > > > > > time that gets called. > > > > > > > > > > > > > > What may be easiest is to just propagate the s_wb_err value from the > > > > > > > upper_sb to the overlayfs superblock in ovl_sync_fs(). That should get > > > > > > > called before the errseq_check_and_advance in the syncfs syscall wrapper > > > > > > > and should ensure that syncfs() calls against the overlayfs sb see any > > > > > > > errors that occurred on the upper_sb. > > > > > > > > > > > > > > Something like this maybe? Totally untested of course. May also need to > > > > > > > think about how to ensure ordering between racing syncfs calls too > > > > > > > (don't really want the s_wb_err going "backward"): > > > > > > > > > > > > > > ----------------------------8<--------------------------------- > > > > > > > $ git diff > > > > > > > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c > > > > > > > index 290983bcfbb3..d725705abdac 100644 > > > > > > > --- a/fs/overlayfs/super.c > > > > > > > +++ b/fs/overlayfs/super.c > > > > > > > @@ -283,6 +283,9 @@ static int ovl_sync_fs(struct super_block *sb, int wait) > > > > > > > Â Â Â Â Â Â Â Â ret = sync_filesystem(upper_sb); > > > > > > > Â Â Â Â Â Â Â Â up_read(&upper_sb->s_umount); > > > > > > > > > > > > > > + /* Propagate s_wb_err from upper_sb to overlayfs sb */ > > > > > > > + WRITE_ONCE(sb->s_wb_err, READ_ONCE(upper_sb->s_wb_err)); > > > > > > > + > > > > > > > Â Â Â Â Â Â Â Â return ret; > > > > > > > Â } > > > > > > > ----------------------------8<--------------------------------- > > > > > > > > > > > > > > [1]: https://www.spinics.net/lists/linux-api/msg41276.html > > > > > > > > > > > > So, > > > > > > I think this will have bad behaviour because syncfs() twice in a row will > > > > > > return the error twice. The reason is that normally in syncfs it calls > > > > > > errseq_check_and_advance marking the error on the super block as seen. If we > > > > > > copy-up the error value each time, it will break this semantic, as we do not set > > > > > > seen on the upperdir. > > > > > > > > > > > > Either we need to set the seen flag on the upperdir's errseq_t, or have a sync > > > > > > method, like: > > > > > > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c > > > > > > index 290983bcfbb3..4931d1797c03 100644 > > > > > > --- a/fs/overlayfs/super.c > > > > > > +++ b/fs/overlayfs/super.c > > > > > > @@ -259,6 +259,7 @@ static int ovl_sync_fs(struct super_block *sb, int wait) > > > > > > Â { > > > > > > Â Â Â Â Â Â Â Â struct ovl_fs *ofs = sb->s_fs_info; > > > > > > Â Â Â Â Â Â Â Â struct super_block *upper_sb; > > > > > > + errseq_t src; > > > > > > Â Â Â Â Â Â Â Â int ret; > > > > > > > > > > > > Â Â Â Â Â Â Â Â if (!ovl_upper_mnt(ofs)) > > > > > > @@ -283,6 +284,11 @@ static int ovl_sync_fs(struct super_block *sb, int wait) > > > > > > Â Â Â Â Â Â Â Â ret = sync_filesystem(upper_sb); > > > > > > Â Â Â Â Â Â Â Â up_read(&upper_sb->s_umount); > > > > > > > > > > > > + /* Propagate the error up from the upper_sb once */ > > > > > > + src = READ_ONCE(upper_sb->s_wb_err); > > > > > > + if (errseq_counter(src) != errseq_counter(sb->s_wb_err)) > > > > > > + WRITE_ONCE(sb->s_wb_err, src & ~ERRSEQ_SEEN); > > > > > > + > > > > > > Â Â Â Â Â Â Â Â return ret; > > > > > > Â } > > > > > > > > > > > > @@ -1945,6 +1951,7 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent) > > > > > > > > > > > > Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â sb->s_stack_depth = ovl_upper_mnt(ofs)->mnt_sb->s_stack_depth; > > > > > > Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â sb->s_time_gran = ovl_upper_mnt(ofs)->mnt_sb->s_time_gran; > > > > > > + sb->s_wb_err = READ_ONCE(ovl_upper_mnt(ofs)->mnt_sb->s_wb_err); > > > > > > > > > > > > diff --git a/lib/errseq.c b/lib/errseq.c > > > > > > index 81f9e33aa7e7..53275c168265 100644 > > > > > > --- a/lib/errseq.c > > > > > > +++ b/lib/errseq.c > > > > > > @@ -204,3 +204,18 @@ int errseq_check_and_advance(errseq_t *eseq, errseq_t *since) > > > > > > Â Â Â Â Â Â Â Â return err; > > > > > > Â } > > > > > > Â EXPORT_SYMBOL(errseq_check_and_advance); > > > > > > + > > > > > > +/** > > > > > > + * errseq_counter() - Get the value of the errseq counter > > > > > > + * @eseq: Value being checked > > > > > > + * > > > > > > + * This returns the errseq_t counter value. Reminder, it can wrap because > > > > > > + * there are only a limited number of counter bits. > > > > > > + * > > > > > > + * Return: The counter portion of the errseq_t. > > > > > > + */ > > > > > > +int errseq_counter(errseq_t eseq) > > > > > > +{ > > > > > > + return eseq >> (ERRSEQ_SHIFT + 1); > > > > > > +} > > > > > > > > > > > > This also needs some locking sprinkled on it because racing can occur with > > > > > > sync_fs. This would be much easier if the errseq_t was 64-bits, and didn't go > > > > > > backwards, because you could just use a simple cmpxchg64. I know that would make > > > > > > a bunch of structures larger, and if it's just overlayfs that has to pay the tax > > > > > > we can just sprinkle a mutex on it. > > > > > > > > > > > > We can always "copy-up" the errseq_t because if the upperdir's errseq_t is read, > > > > > > and we haven't done a copy yet, we'll get it. Since this will be strictly > > > > > > serialized a simple equivalence check will work versus actually having to deal > > > > > > with happens before behaviour. There is still a correctness flaw here though, > > > > > > which is exactly enough errors occur to result in it wrapping back to the same > > > > > > value, it will break. > > > > > > > > > > > > By the way, how do y'all test this error handling behaviour? I didn't > > > > > > find any automated testing for what currently exists. > > > > > > > > > > > > > > How about I split this into two patchsets? One, where I add the logic to copy > > > > > > > the errseq on callbacks to fsync from the upperdir to the ovl fs superblock, > > > > > > > and thus allowing VFS to bubble up errors, and the documentation. We can CC > > > > > > > stable on those because I think it has an effect that's universal across > > > > > > > all filesystems. > > > > > > > > > > > > > > P.S. > > > > > > > I notice you maintain overlay tests outside of the kernel. Unfortunately, I > > > > > > > think for this kind of test, it requires in kernel code to artificially bump the > > > > > > > writeback error count on the upperdir, or it requires the failure injection > > > > > > > infrastructure. > > > > > > > > > > > > > > Simulating this behaviour is non-trivial without in-kernel support: > > > > > > > > > > > > > > P1: Open(f) -> p1.fd > > > > > > > P2: Open(f) -> p2.fd > > > > > > > P1: syncfs(p1.fd) -> errrno > > > > > > > P2: syncfs(p1.fd) -> 0 # This should return an error > > > > > > > > > > > > > > > > > > > > > [1]: https://elixir.bootlin.com/linux/latest/source/fs/sync.c#L175 > > > > > > > [2]: https://lwn.net/Articles/722250/ > > > > > > > > > > > > > > > > > > > > > > > > Sargun (and Jeff), > > > > > > > > > > Thank you for this discussion. It would be very nice to work on testing > > > > > and fixing errseq propagation is correct on overlayfs. > > > > > Alas, this is not what I suggested. > > > > > > > > > > What I suggested was a solution only for the volatile overlay issue > > > > > where data can vaporise without applications noticing: > > > > > "...the very minimum is to check for errseq since mount on the fsync > > > > > Â and syncfs calls." > > > > > > > > > Yeah, I was confusing the checking that VFS does on our behalf and the checking > > > > that we can do ourselves in the sync callback. If we return an error prior to > > > > the vfs checking it short-circuits that entirely. > > > > > > > > > Do you get it? there is no pre-file state in the game, not for fsync and not > > > > > for syncfs. > > > > > > > > > > Any single error, no matter how temporary it is and what damage it may > > > > > or may not have caused to upper layer consistency, permanently > > > > > invalidates the reliability of the volatile overlay, resulting in: > > > > > Effective immediately: every fsync/syncfs returns EIO. > > > > > Going forward: maybe implement overlay shutdown, so every access > > > > > returns EIO. > > > > > > > > > > So now that I hopefully explained myself better, I'll ask again: > > > > > Am I wrong saying that it is very very simple to fix? > > > > > Would you mind making that fix at the bottom of the patch series, so it can > > > > > be easily applied to stable kernels? > > > > > > > > > > Thanks, > > > > > Amir. > > > > > > > > I think that this should be easy enough if the semantic is such that volatile > > > > overlayfs mounts will return EIO on syncfs on every syncfs call if the upperdir's > > > > super block has experienced errors since the initial mount. I imagine we do not > > > > want to make it such that if the upperdir has ever experienced errors, return > > > > EIO on syncfs. > > > > > > > > The one caveat that I see is that if the errseq wraps, we can silently begin > > > > swallowing errors again. Thus, on the first failed syncfs we should just > > > > store a flag indicating that the volatile fs is bad, and to continue to return > > > > EIO rather than go through the process of checking errseq_t, but that's easy > > > > enough to write. > > > > > > It turns out this is much more complicated than I initially thought. I'm not > > > entirely sure why VFS even defines sync_fs (on the superblock) as returning an > > > int. The way that sync_fs works is: > > > > > > sys_syncfs -> > > > Â Â sync_filesystem(sb) -> > > > Â Â Â Â __sync_filesystem(sb, N) -> > > > Â Â Â Â Â Â sb->s_op->sync_fs > > > Â Â Â Â Â Â /* __sync_blockdev has its own writeback error checking */ > > > Â Â Â Â Â Â __sync_blockdev > > > Â Â errseq_check_and_advance(sb...) > > > > > > __sync_filesystem calls the sync_fs callback on the superblock's superblock > > > operations: > > > > > > static int __sync_filesystem(struct super_block *sb, int wait) > > > { > > > if (wait) > > > sync_inodes_sb(sb); > > > else > > > writeback_inodes_sb(sb, WB_REASON_SYNC); > > > > > > if (sb->s_op->sync_fs) > > > sb->s_op->sync_fs(sb, wait); > > > return __sync_blockdev(sb->s_bdev, wait); > > > } > > > > > > But it completely discards the result. On the other hand, fsync, and fdatasync > > > exhibit different behaviour: > > > > > > sys_fdatasync -> > > > Â Â do_fsync > > > Â Â Â (see below) > > > > > > sys_fsync -> > > > Â Â do_fsync -> > > > Â Â Â Â vfs_fsync -> > > > Â Â Â Â Â Â vfs_fsync_range -> > > > Â Â Â Â Â Â Â Â return file->f_op->fsync > > > ---- > > > > > > > This is mainly a result of the evolution of the code. In the beginning > > there was only sync() and it doesn't report errors. Eventually syncfs() > > was added, but the only error it reliably reported was EBADF. > > > > I added the machinery to record writeback errors of individual inodes at > > the superblock level recently, but the sync_fs op was left alone (mainly > > because changing this would be very invasive, and the value of doing so > > wasn't completely clear). > > Is following patch enough to capture ->sync_fs return value and return > to user space on syncfs. Or there is more to it which I am missing. > > > --- > fs/sync.c | 7 +++++-- > 1 file changed, 5 insertions(+), 2 deletions(-) > > Index: redhat-linux/fs/sync.c > =================================================================== > --- redhat-linux.orig/fs/sync.c 2020-08-10 15:40:22.535349278 -0400 > +++ redhat-linux/fs/sync.c 2020-12-01 10:19:03.624330578 -0500 > @@ -30,14 +30,17 @@ > */ > static int __sync_filesystem(struct super_block *sb, int wait) > { > + int ret, ret2; > + > if (wait) > sync_inodes_sb(sb); > else > writeback_inodes_sb(sb, WB_REASON_SYNC); > > > if (sb->s_op->sync_fs) > - sb->s_op->sync_fs(sb, wait); > - return __sync_blockdev(sb->s_bdev, wait); > + ret = sb->s_op->sync_fs(sb, wait); > + ret2 = __sync_blockdev(sb->s_bdev, wait); > + return ret ? ret : ret2; > } > > > /* > Maybe. The question of course is what will regress once we make that change. Are there sync_fs routines in place today that regularly return errors that are just ignored? It's difficult to know without testing all the affected filesystems. If we did decide to do that, then there are some other callers of sync_fs -- dquot_quota_sync, for instance, and we probably need to vet them carefully. Also, a sync_fs error could be transient. Suppose you have something call sync() and it gets an error back from sync_fs (which is ignored), and then something else calls syncfs() and doesn't see any error because its sync_fs call succeeded. Arguably, we should return an error in that situation. What may work best is to just do an errseq_set on sb->s_wb_err when the either sync_fs or __sync_blockdev returns an error. Then, the errseq_check_and_advance in syncfs can eventually report it. -- Jeff Layton <jlayton@redhat.com> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v2 2/4] overlay: Document current outstanding shortcoming of volatile 2020-11-28 4:45 ` Sargun Dhillon 2020-11-28 7:12 ` Amir Goldstein @ 2020-11-28 12:04 ` Jeff Layton 1 sibling, 0 replies; 31+ messages in thread From: Jeff Layton @ 2020-11-28 12:04 UTC (permalink / raw) To: Sargun Dhillon Cc: Amir Goldstein, overlayfs, Miklos Szeredi, Alexander Viro, Giuseppe Scrivano, Vivek Goyal, Daniel J Walsh, linux-fsdevel, David Howells On Sat, 2020-11-28 at 04:45 +0000, Sargun Dhillon wrote: > On Fri, Nov 27, 2020 at 09:01:07PM -0500, Jeff Layton wrote: > > On Fri, 2020-11-27 at 22:11 +0000, Sargun Dhillon wrote: > > > On Fri, Nov 27, 2020 at 02:52:52PM +0200, Amir Goldstein wrote: > > > > On Fri, Nov 27, 2020 at 11:21 AM Sargun Dhillon <sargun@sargun.me> wrote: > > > > > > > > > > This documents behaviour that was discussed in a thread about the volatile > > > > > feature. Specifically, how failures can go hidden from asynchronous writes > > > > > (such as from mmap, or writes that are not immediately flushed to the > > > > > filesystem). Although we pass through calls like msync, fallocate, and > > > > > write, and will still return errors on those, it doesn't guarantee all > > > > > kinds of errors will happen at those times, and thus may hide errors. > > > > > > > > > > In the future, we can add error checking to all interactions with the > > > > > upperdir, and pass through errseq_t from the upperdir on mappings, > > > > > and other interactions with the filesystem[1]. > > > > > > > > > > [1]: https://lore.kernel.org/linux-unionfs/20201116045758.21774-1-sargun@sargun.me/T/#m7d501f375e031056efad626e471a1392dd3aad33 > > > > > > > > > > Signed-off-by: Sargun Dhillon <sargun@sargun.me> > > > > > Cc: linux-fsdevel@vger.kernel.org > > > > > Cc: linux-unionfs@vger.kernel.org > > > > > Cc: Miklos Szeredi <miklos@szeredi.hu> > > > > > Cc: Amir Goldstein <amir73il@gmail.com> > > > > > Cc: Vivek Goyal <vgoyal@redhat.com> > > > > > --- > > > > >  Documentation/filesystems/overlayfs.rst | 6 +++++- > > > > >  1 file changed, 5 insertions(+), 1 deletion(-) > > > > > > > > > > diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst > > > > > index 580ab9a0fe31..c6e30c1bc2f2 100644 > > > > > --- a/Documentation/filesystems/overlayfs.rst > > > > > +++ b/Documentation/filesystems/overlayfs.rst > > > > > @@ -570,7 +570,11 @@ Volatile mount > > > > >  This is enabled with the "volatile" mount option. Volatile mounts are not > > > > >  guaranteed to survive a crash. It is strongly recommended that volatile > > > > >  mounts are only used if data written to the overlay can be recreated > > > > > -without significant effort. > > > > > +without significant effort. In addition to this, the sync family of syscalls > > > > > +are not sufficient to determine whether a write failed as sync calls are > > > > > +omitted. For this reason, it is important that the filesystem used by the > > > > > +upperdir handles failure in a fashion that's suitable for the user. For > > > > > +example, upon detecting a fault, ext4 can be configured to panic. > > > > > > > > > > > > > Reading this now, I think I may have wrongly analysed the issue. > > > > Specifically, when I wrote that the very minimum is to document the > > > > issue, it was under the assumption that a proper fix is hard. > > > > I think I was wrong and that the very minimum is to check for errseq > > > > since mount on the fsync and syncfs calls. > > > > > > > > Why? first of all because it is very very simple and goes a long way to > > > > fix the broken contract with applications, not the contract about durability > > > > obviously, but the contract about write+fsync+read expects to find the written > > > > data (during the same mount era). > > > > > > > > Second, because the sentence you added above is hard for users to understand > > > > and out of context. If we properly handle the writeback error in fsync/syncfs, > > > > then this sentence becomes irrelevant. > > > > The fact that ext4 can lose data if application did not fsync after > > > > write is true > > > > for volatile as well as non-volatile and it is therefore not relevant > > > > in the context > > > > of overlayfs documentation at all. > > > > > > > > Am I wrong saying that it is very very simple to fix? > > > > Would you mind making that fix at the bottom of the patch series, so it can > > > > be easily applied to stable kernels? > > > > > > > > Thanks, > > > > Amir. > > > > > > I'm not sure it's so easy. In VFS, there are places where the superblock's > > > errseq is checked[1]. AFAIK, that's going to check "our" errseq, and not the > > > errseq of the real corresponding real file's superblock. One solution might be > > > as part of all these callbacks we set our errseq to the errseq of the filesystem > > > that the upperdir, and then rely on VFS's checking. > > > > > > I'm having a hard time figuring out how to deal with the per-mapping based > > > error tracking. It seems like this infrastructure was only partially completed > > > by Jeff Layton[2]. I don't now how it's actually supposed to work right now, > > > as not all of his patches landed. > > > > > > > The patches in the series were all merged, but we ended up going with a > > simpler solution [1] than the first series I posted. Instead of plumbing > > the errseq_t handling down into sync_fs, we did it in the syscall > > wrapper. > Jeff, > > Thanks for replying. I'm still a little confused as to what the > per-address_space wb_err. It seems like we should implement the > flush method, like so: > diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c > index efccb7c1f9bc..32e5bc0aacd6 100644 > --- a/fs/overlayfs/file.c > +++ b/fs/overlayfs/file.c > @@ -787,6 +787,16 @@ static loff_t ovl_remap_file_range(struct file *file_in, loff_t pos_in, > remap_flags, op); > } > > > > > +static int ovl_flush(struct file *file, fl_owner_t id) > +{ > + struct file *realfile = file->private_data; > + > + if (real_file->f_op->flush) > + return real_file->f_op->flush(filp, id); > + > + return 0; > +} > + > const struct file_operations ovl_file_operations = { > .open = ovl_open, > .release = ovl_release, > @@ -798,6 +808,7 @@ const struct file_operations ovl_file_operations = { > .fallocate = ovl_fallocate, > .fadvise = ovl_fadvise, > .unlocked_ioctl = ovl_ioctl, > + .flush = ovl_flush, > #ifdef CONFIG_COMPAT > .compat_ioctl = ovl_compat_ioctl, > #endif > > > > > > I think the tricky part here is that there is no struct file plumbed > > into ->sync_fs, so you don't have an errseq_t cursor to work with by the > > time that gets called. > > > > What may be easiest is to just propagate the s_wb_err value from the > > upper_sb to the overlayfs superblock in ovl_sync_fs(). That should get > > called before the errseq_check_and_advance in the syncfs syscall wrapper > > and should ensure that syncfs() calls against the overlayfs sb see any > > errors that occurred on the upper_sb. > > > > Something like this maybe? Totally untested of course. May also need to > > think about how to ensure ordering between racing syncfs calls too > > (don't really want the s_wb_err going "backward"): > > > > ----------------------------8<--------------------------------- > > $ git diff > > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c > > index 290983bcfbb3..d725705abdac 100644 > > --- a/fs/overlayfs/super.c > > +++ b/fs/overlayfs/super.c > > @@ -283,6 +283,9 @@ static int ovl_sync_fs(struct super_block *sb, int wait) > > ret = sync_filesystem(upper_sb); > > up_read(&upper_sb->s_umount); > > > > > > > > > > + /* Propagate s_wb_err from upper_sb to overlayfs sb */ > > + WRITE_ONCE(sb->s_wb_err, READ_ONCE(upper_sb->s_wb_err)); > > + > > return ret; > > } > > ----------------------------8<--------------------------------- > > > > [1]: https://www.spinics.net/lists/linux-api/msg41276.html > > So, > I think this will have bad behaviour because syncfs() twice in a row will > return the error twice. The reason is that normally in syncfs it calls > errseq_check_and_advance marking the error on the super block as seen. If we > copy-up the error value each time, it will break this semantic, as we do not set > seen on the upperdir. > You're absolutely right. I thought about this after I had sent the mail last night. > Either we need to set the seen flag on the upperdir's errseq_t, or have a sync > method, like: > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c > index 290983bcfbb3..4931d1797c03 100644 > --- a/fs/overlayfs/super.c > +++ b/fs/overlayfs/super.c > @@ -259,6 +259,7 @@ static int ovl_sync_fs(struct super_block *sb, int wait) > { > struct ovl_fs *ofs = sb->s_fs_info; > struct super_block *upper_sb; > + errseq_t src; > int ret; > > > > > > > > > if (!ovl_upper_mnt(ofs)) > @@ -283,6 +284,11 @@ static int ovl_sync_fs(struct super_block *sb, int wait) > ret = sync_filesystem(upper_sb); > up_read(&upper_sb->s_umount); > > > > > > > > > + /* Propagate the error up from the upper_sb once */ > + src = READ_ONCE(upper_sb->s_wb_err); > + if (errseq_counter(src) != errseq_counter(sb->s_wb_err)) > + WRITE_ONCE(sb->s_wb_err, src & ~ERRSEQ_SEEN); > + > return ret; > } > > > > > > > > > @@ -1945,6 +1951,7 @@ static int ovl_fill_super(struct super_block *sb, void *data, int silent) > > > > > > > > > sb->s_stack_depth = ovl_upper_mnt(ofs)->mnt_sb->s_stack_depth; > sb->s_time_gran = ovl_upper_mnt(ofs)->mnt_sb->s_time_gran; > + sb->s_wb_err = READ_ONCE(ovl_upper_mnt(ofs)->mnt_sb->s_wb_err); > > > > > > > > > diff --git a/lib/errseq.c b/lib/errseq.c > index 81f9e33aa7e7..53275c168265 100644 > --- a/lib/errseq.c > +++ b/lib/errseq.c > @@ -204,3 +204,18 @@ int errseq_check_and_advance(errseq_t *eseq, errseq_t *since) > return err; > } > EXPORT_SYMBOL(errseq_check_and_advance); > + > +/** > + * errseq_counter() - Get the value of the errseq counter > + * @eseq: Value being checked > + * > + * This returns the errseq_t counter value. Reminder, it can wrap because > + * there are only a limited number of counter bits. > + * > + * Return: The counter portion of the errseq_t. > + */ > +int errseq_counter(errseq_t eseq) > +{ > + return eseq >> (ERRSEQ_SHIFT + 1); > +} > > This also needs some locking sprinkled on it because racing can occur with > sync_fs. This would be much easier if the errseq_t was 64-bits, and didn't go > backwards, because you could just use a simple cmpxchg64. I know that would make > a bunch of structures larger, and if it's just overlayfs that has to pay the tax > we can just sprinkle a mutex on it. > A spinlock would work here too. None of these are blocking operations. > We can always "copy-up" the errseq_t because if the upperdir's errseq_t is read, > and we haven't done a copy yet, we'll get it. Since this will be strictly > serialized a simple equivalence check will work versus actually having to deal > with happens before behaviour. There is still a correctness flaw here though, > which is exactly enough errors occur to result in it wrapping back to the same > value, it will break. > > By the way, how do y'all test this error handling behaviour? I didn't > find any automated testing for what currently exists. There are a couple of xfstests that you may be able to use as a starting point. See: generic/441 generic/442 > > > > How about I split this into two patchsets? One, where I add the logic to copy > > the errseq on callbacks to fsync from the upperdir to the ovl fs superblock, > > and thus allowing VFS to bubble up errors, and the documentation. We can CC > > stable on those because I think it has an effect that's universal across > > all filesystems. > > > > P.S. > > I notice you maintain overlay tests outside of the kernel. Unfortunately, I > > think for this kind of test, it requires in kernel code to artificially bump the > > writeback error count on the upperdir, or it requires the failure injection > > infrastructure. > > > > Simulating this behaviour is non-trivial without in-kernel support: > > > > P1: Open(f) -> p1.fd > > P2: Open(f) -> p2.fd > > P1: syncfs(p1.fd) -> errrno > > P2: syncfs(p1.fd) -> 0 # This should return an error > > > > > > [1]: https://elixir.bootlin.com/linux/latest/source/fs/sync.c#L175 > > [2]: https://lwn.net/Articles/722250/ > > > > > > -- > > Jeff Layton <jlayton@redhat.com> > > > -- Jeff Layton <jlayton@redhat.com> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v2 2/4] overlay: Document current outstanding shortcoming of volatile 2020-11-27 22:11 ` Sargun Dhillon 2020-11-28 2:01 ` Jeff Layton @ 2020-11-28 8:56 ` Amir Goldstein 2020-11-28 9:06 ` Amir Goldstein 1 sibling, 1 reply; 31+ messages in thread From: Amir Goldstein @ 2020-11-28 8:56 UTC (permalink / raw) To: Sargun Dhillon Cc: overlayfs, Miklos Szeredi, Alexander Viro, Giuseppe Scrivano, Vivek Goyal, Daniel J Walsh, linux-fsdevel, David Howells, Jeff Layton > I notice you maintain overlay tests outside of the kernel. Unfortunately, I > think for this kind of test, it requires in kernel code to artificially bump the > writeback error count on the upperdir, or it requires the failure injection > infrastructure. > > Simulating this behaviour is non-trivial without in-kernel support: > > P1: Open(f) -> p1.fd > P2: Open(f) -> p2.fd > P1: syncfs(p1.fd) -> errrno > P2: syncfs(p1.fd) -> 0 # This should return an error > failure injection is an option. xfstest generic/019 is an example. generic/108 uses a different method and generic/361 uses a plain loop image over commit without any fault injection to trigger writeback errors. With current xfstests, check -overlay run (runs all generic tests with overlayfs over the configured base fs) all the 3 tests mentioned above will be skipped because of _require_block_device, but the requirement is there for a different reason for each of them. At first look, the loop device approach is the most generic one and could easily work also with overlayfs, so you could create an overlay specific test (non generic) based on generic/361, but it is not easy to use the helper _scratch_mkfs_sized, because check -overlay runs do not mkfs the base scratch fs. My recommendation would be to fix generic/019 in a similar manner to the way that tests that _require_scratch_shutdown were fixed to run with check -overlay: * Instead of _require_block_device, add a specific helper _require_scratch_fail_make_request, which like _require_scratch_shutdown knows how to deal with overlay FSTYP correctly * Instead of `_sysfs_dev $SCRATCH_DEV` use a helper _scratch_sysfs_dev that knows how to deal with overlay FSTYP correctly This will add test coverage to overlayfs fsync/syncfs and then you can do one or both of: 1. Run 'check -overlay generic/019' with OVERLAY_MOUNT_OPTIONS="volatile" 2. Fork an overlay specific test from the generic test that will test the volatile error handling on every 'check -overlay -g quick' run #2 will provide better coverage against regressions in volatile writeback error handling and will be a good start for a test to reuse a volatile mount after writeback errors. Thanks, Amir. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v2 2/4] overlay: Document current outstanding shortcoming of volatile 2020-11-28 8:56 ` Amir Goldstein @ 2020-11-28 9:06 ` Amir Goldstein 0 siblings, 0 replies; 31+ messages in thread From: Amir Goldstein @ 2020-11-28 9:06 UTC (permalink / raw) To: Sargun Dhillon Cc: overlayfs, Miklos Szeredi, Alexander Viro, Giuseppe Scrivano, Vivek Goyal, Daniel J Walsh, linux-fsdevel, David Howells, Jeff Layton On Sat, Nov 28, 2020 at 10:56 AM Amir Goldstein <amir73il@gmail.com> wrote: > > > I notice you maintain overlay tests outside of the kernel. Unfortunately, I > > think for this kind of test, it requires in kernel code to artificially bump the > > writeback error count on the upperdir, or it requires the failure injection > > infrastructure. > > > > Simulating this behaviour is non-trivial without in-kernel support: > > > > P1: Open(f) -> p1.fd > > P2: Open(f) -> p2.fd > > P1: syncfs(p1.fd) -> errrno > > P2: syncfs(p1.fd) -> 0 # This should return an error > > > > failure injection is an option. xfstest generic/019 is an example. > generic/108 uses a different method and generic/361 uses a plain > loop image over commit without any fault injection to trigger writeback > errors. > > With current xfstests, check -overlay run (runs all generic tests with > overlayfs over the configured base fs) all the 3 tests mentioned above > will be skipped because of _require_block_device, but the requirement > is there for a different reason for each of them. > > At first look, the loop device approach is the most generic one and could > easily work also with overlayfs, so you could create an overlay specific > test (non generic) based on generic/361, but it is not easy to use the > helper _scratch_mkfs_sized, because check -overlay runs do not mkfs > the base scratch fs. > > My recommendation would be to fix generic/019 in a similar manner > to the way that tests that _require_scratch_shutdown were fixed to run > with check -overlay: > > * Instead of _require_block_device, add a specific helper > _require_scratch_fail_make_request, which like _require_scratch_shutdown > knows how to deal with overlay FSTYP correctly > > * Instead of `_sysfs_dev $SCRATCH_DEV` use a helper _scratch_sysfs_dev > that knows how to deal with overlay FSTYP correctly > I missed: * Instead of `blockdev --getsz $SCRATCH_DEV` use helper _scratch_blockdev_getsz > This will add test coverage to overlayfs fsync/syncfs and then you can > do one or both of: > 1. Run 'check -overlay generic/019' with OVERLAY_MOUNT_OPTIONS="volatile" > 2. Fork an overlay specific test from the generic test that will test the > volatile error handling on every 'check -overlay -g quick' run > > #2 will provide better coverage against regressions in volatile writeback error > handling and will be a good start for a test to reuse a volatile mount after > writeback errors. > > Thanks, > Amir. ^ permalink raw reply [flat|nested] 31+ messages in thread
* [PATCH v2 3/4] overlay: Add the ability to remount volatile directories when safe 2020-11-27 9:20 [PATCH v2 0/4] Make overlayfs volatile mounts reusable Sargun Dhillon 2020-11-27 9:20 ` [PATCH v2 1/4] fs: Add s_instance_id field to superblock for unique identification Sargun Dhillon 2020-11-27 9:20 ` [PATCH v2 2/4] overlay: Document current outstanding shortcoming of volatile Sargun Dhillon @ 2020-11-27 9:20 ` Sargun Dhillon 2020-11-27 11:09 ` kernel test robot ` (2 more replies) 2020-11-27 9:20 ` [PATCH v2 4/4] overlay: Add rudimentary checking of writeback errseq on volatile remount Sargun Dhillon 3 siblings, 3 replies; 31+ messages in thread From: Sargun Dhillon @ 2020-11-27 9:20 UTC (permalink / raw) To: linux-unionfs, miklos, Alexander Viro, Amir Goldstein Cc: Sargun Dhillon, Giuseppe Scrivano, Vivek Goyal, Daniel J Walsh, linux-fsdevel, David Howells Overlayfs added the ability to setup mounts where all syncs could be short-circuted in (2a99ddacee43: ovl: provide a mount option "volatile"). A user might want to remount this fs, but we do not let the user because of the "incompat" detection feature. In the case of volatile, it is safe to do something like[1]: $ sync -f /root/upperdir $ rm -rf /root/workdir/incompat/volatile There are two ways to go about this. You can call sync on the underlying filesystem, check the error code, and delete the dirty file if everything is clean. If you're running lots of containers on the same filesystem, or you want to avoid all unnecessary I/O, this may be suboptimal. Alternatively, you can blindly delete the dirty file, and "hope for the best". This patch introduces transparent functionality to check if it is (relatively) safe to reuse the upperdir. It ensures that the filesystem hasn't been remounted, the system hasn't been rebooted, nor has the overlayfs code changed. Since the structure is explicitly not meant to be used between different versions of the code, its stability does not matter so much. [1]: https://lore.kernel.org/linux-unionfs/CAOQ4uxhKr+j5jFyEC2gJX8E8M19mQ3CqdTYaPZOvDQ9c0qLEzw@mail.gmail.com/T/#m6abe713e4318202ad57f301bf28a414e1d824f9c Signed-off-by: Sargun Dhillon <sargun@sargun.me> Cc: linux-fsdevel@vger.kernel.org Cc: linux-unionfs@vger.kernel.org Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Amir Goldstein <amir73il@gmail.com> Cc: Vivek Goyal <vgoyal@redhat.com> --- Documentation/filesystems/overlayfs.rst | 18 +++-- fs/overlayfs/overlayfs.h | 37 +++++++++- fs/overlayfs/readdir.c | 98 ++++++++++++++++++++++--- fs/overlayfs/super.c | 73 +++++++++++++----- fs/overlayfs/util.c | 2 + 5 files changed, 190 insertions(+), 38 deletions(-) diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst index c6e30c1bc2f2..b485fdb65b85 100644 --- a/Documentation/filesystems/overlayfs.rst +++ b/Documentation/filesystems/overlayfs.rst @@ -579,13 +579,17 @@ example, upon detecting a fault, ext4 can be configured to panic. The advantage of mounting with the "volatile" option is that all forms of sync calls to the upper filesystem are omitted. -When overlay is mounted with "volatile" option, the directory -"$workdir/work/incompat/volatile" is created. During next mount, overlay -checks for this directory and refuses to mount if present. This is a strong -indicator that user should throw away upper and work directories and create -fresh one. In very limited cases where the user knows that the system has -not crashed and contents of upperdir are intact, The "volatile" directory -can be removed. +When overlay is mounted with the "volatile" option, the directory +"$workdir/work/incompat/volatile" is created. This acts as a indicator +that the user should throw away upper and work directories and create fresh +ones. In some cases, the overlayfs can detect if the upperdir can be +reused safely in a subsequent volatile mounts, and mounting will proceed as +normal. If the filesystem is unable to determine if this is safe (due to a +reboot, upgraded kernel code, or loss of checkpoint, etc...), the user may +bypass these safety checks and remove the "volatile" directory if they know +the system did not encounter a fault and the contents of the upperdir are +intact. Then, the user can remount the filesystem as normal. + Testsuite --------- diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h index f8880aa2ba0e..de694ee99d7c 100644 --- a/fs/overlayfs/overlayfs.h +++ b/fs/overlayfs/overlayfs.h @@ -32,8 +32,13 @@ enum ovl_xattr { OVL_XATTR_NLINK, OVL_XATTR_UPPER, OVL_XATTR_METACOPY, + OVL_XATTR_VOLATILE, }; +#define OVL_INCOMPATDIR_NAME "incompat" +#define OVL_VOLATILEDIR_NAME "volatile" +#define OVL_VOLATILE_DIRTY_NAME "dirty" + enum ovl_inode_flag { /* Pure upper dir that may contain non pure upper entries */ OVL_IMPURE, @@ -57,6 +62,31 @@ enum { OVL_XINO_ON, }; +/* + * This is copied into the volatile xattr, and the user does not interact with + * it. There is no stability requirement, as a reboot explicitly invalidates + * a volatile workdir. It is explicitly meant not to be a stable api. + * + * Although this structure isn't meant to be stable it is exposed to potentially + * unprivileged users. We don't do any kind of cryptographic operations with + * the structure, so it could be tampered with, or inspected. Don't put + * kernel memory pointers in it, or anything else that could cause problems, + * or information disclosure. + */ +struct ovl_volatile_info { + /* + * This uniquely identifies a boot, and is reset if overlayfs itself + * is reloaded. Therefore we check our current / known boot_id + * against this before looking at any other fields to validate: + * 1. Is this datastructure laid out in the way we expect? (Overlayfs + * module, reboot, etc...) + * 2. Could something have changed (like the s_instance_id counter + * resetting) + */ + uuid_t ovl_boot_id; /* Must stay first member */ + u64 s_instance_id; +} __packed; + /* * The tuple (fh,uuid) is a universal unique identifier for a copy up origin, * where: @@ -422,8 +452,8 @@ void ovl_cleanup_whiteouts(struct dentry *upper, struct list_head *list); void ovl_cache_free(struct list_head *list); void ovl_dir_cache_free(struct inode *inode); int ovl_check_d_type_supported(struct path *realpath); -int ovl_workdir_cleanup(struct inode *dir, struct vfsmount *mnt, - struct dentry *dentry, int level); +int ovl_workdir_cleanup(struct ovl_fs *ofs, struct inode *dir, + struct vfsmount *mnt, struct dentry *dentry, int level); int ovl_indexdir_cleanup(struct ovl_fs *ofs); /* inode.c */ @@ -520,3 +550,6 @@ int ovl_set_origin(struct dentry *dentry, struct dentry *lower, /* export.c */ extern const struct export_operations ovl_export_operations; + +/* super.c */ +extern uuid_t ovl_boot_id; diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c index 01620ebae1bd..7b66fbb20261 100644 --- a/fs/overlayfs/readdir.c +++ b/fs/overlayfs/readdir.c @@ -1080,10 +1080,78 @@ int ovl_check_d_type_supported(struct path *realpath) return rdd.d_type_supported; } +static int ovl_verify_volatile_info(struct ovl_fs *ofs, + struct dentry *volatiledir) +{ + int err; + struct ovl_volatile_info info; + + if (!volatiledir->d_inode) + return 0; + + if (!ofs->config.ovl_volatile) { + pr_debug("Mount is not volatile; upperdir is marked volatile\n"); + return -EINVAL; + } + + err = ovl_do_getxattr(ofs, volatiledir, OVL_XATTR_VOLATILE, &info, + sizeof(info)); + if (err < 0) { + pr_debug("Unable to read volatile xattr: %d\n", err); + return -EINVAL; + } + + if (err != sizeof(info)) { + pr_debug("%s xattr on-disk size is %d expected to read %zd\n", + ovl_xattr(ofs, OVL_XATTR_VOLATILE), err, sizeof(info)); + return -EINVAL; + } + + if (!uuid_equal(&ovl_boot_id, &info.ovl_boot_id)) { + pr_debug("boot id has changed (reboot or module reloaded)\n"); + return -EINVAL; + } + + if (volatiledir->d_sb->s_instance_id != info.s_instance_id) { + pr_debug("workdir has been unmounted and remounted\n"); + return -EINVAL; + } + + return 1; +} -#define OVL_INCOMPATDIR_NAME "incompat" +/* + * ovl_check_incompat checks this specific incompat entry for incompatibility. + * If it is found to be incompatible -EINVAL will be returned. + * + * If the directory should be preserved, then this function returns 1. + */ +static int ovl_check_incompat(struct ovl_fs *ofs, struct ovl_cache_entry *p, + struct path *path) +{ + int err = -EINVAL; + struct dentry *d; + + if (!strcmp(p->name, OVL_VOLATILEDIR_NAME)) { + d = lookup_one_len(p->name, path->dentry, p->len); + if (IS_ERR(d)) + return PTR_ERR(d); + + err = ovl_verify_volatile_info(ofs, d); + dput(d); + } + + if (err == -EINVAL) + pr_err("incompat feature '%s' cannot be mounted\n", p->name); + else + pr_debug("incompat '%s' handled: %d\n", p->name, err); + + dput(d); + return err; +} -static int ovl_workdir_cleanup_recurse(struct path *path, int level) +static int ovl_workdir_cleanup_recurse(struct ovl_fs *ofs, struct path *path, + int level) { int err; struct inode *dir = path->dentry->d_inode; @@ -1125,16 +1193,19 @@ static int ovl_workdir_cleanup_recurse(struct path *path, int level) if (p->len == 2 && p->name[1] == '.') continue; } else if (incompat) { - pr_err("overlay with incompat feature '%s' cannot be mounted\n", - p->name); - err = -EINVAL; - break; + err = ovl_check_incompat(ofs, p, path); + if (err < 0) + break; + /* Skip cleaning this */ + if (err == 1) + continue; } dentry = lookup_one_len(p->name, path->dentry, p->len); if (IS_ERR(dentry)) continue; if (dentry->d_inode) - err = ovl_workdir_cleanup(dir, path->mnt, dentry, level); + err = ovl_workdir_cleanup(ofs, dir, path->mnt, dentry, + level); dput(dentry); if (err) break; @@ -1142,11 +1213,13 @@ static int ovl_workdir_cleanup_recurse(struct path *path, int level) inode_unlock(dir); out: ovl_cache_free(&list); + if (incompat && err >= 0) + return 1; return err; } -int ovl_workdir_cleanup(struct inode *dir, struct vfsmount *mnt, - struct dentry *dentry, int level) +int ovl_workdir_cleanup(struct ovl_fs *ofs, struct inode *dir, + struct vfsmount *mnt, struct dentry *dentry, int level) { int err; @@ -1159,7 +1232,7 @@ int ovl_workdir_cleanup(struct inode *dir, struct vfsmount *mnt, struct path path = { .mnt = mnt, .dentry = dentry }; inode_unlock(dir); - err = ovl_workdir_cleanup_recurse(&path, level + 1); + err = ovl_workdir_cleanup_recurse(ofs, &path, level + 1); inode_lock_nested(dir, I_MUTEX_PARENT); if (!err) err = ovl_cleanup(dir, dentry); @@ -1206,9 +1279,10 @@ int ovl_indexdir_cleanup(struct ovl_fs *ofs) } /* Cleanup leftover from index create/cleanup attempt */ if (index->d_name.name[0] == '#') { - err = ovl_workdir_cleanup(dir, path.mnt, index, 1); - if (err) + err = ovl_workdir_cleanup(ofs, dir, path.mnt, index, 1); + if (err < 0) break; + err = 0; goto next; } err = ovl_verify_index(ofs, index); diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c index 290983bcfbb3..a8ee3ba4ebbd 100644 --- a/fs/overlayfs/super.c +++ b/fs/overlayfs/super.c @@ -15,6 +15,7 @@ #include <linux/seq_file.h> #include <linux/posix_acl_xattr.h> #include <linux/exportfs.h> +#include <linux/uuid.h> #include "overlayfs.h" MODULE_AUTHOR("Miklos Szeredi <miklos@szeredi.hu>"); @@ -23,6 +24,7 @@ MODULE_LICENSE("GPL"); struct ovl_dir_cache; +uuid_t ovl_boot_id; #define OVL_MAX_STACK 500 @@ -722,20 +724,24 @@ static struct dentry *ovl_workdir_create(struct ovl_fs *ofs, goto out_unlock; retried = true; - err = ovl_workdir_cleanup(dir, mnt, work, 0); - dput(work); - if (err == -EINVAL) { - work = ERR_PTR(err); - goto out_unlock; + err = ovl_workdir_cleanup(ofs, dir, mnt, work, 0); + /* check if we should reuse the workdir */ + if (err != 1) { + dput(work); + if (err == -EINVAL) { + work = ERR_PTR(err); + goto out_unlock; + } + goto retry; } - goto retry; + } else { + work = ovl_create_real(dir, work, + OVL_CATTR(attr.ia_mode)); + err = PTR_ERR(work); + if (IS_ERR(work)) + goto out_err; } - work = ovl_create_real(dir, work, OVL_CATTR(attr.ia_mode)); - err = PTR_ERR(work); - if (IS_ERR(work)) - goto out_err; - /* * Try to remove POSIX ACL xattrs from workdir. We are good if: * @@ -1237,26 +1243,58 @@ static struct dentry *ovl_lookup_or_create(struct dentry *parent, return child; } +static int ovl_set_volatile_info(struct ovl_fs *ofs, struct dentry *volatiledir) +{ + int err; + struct ovl_volatile_info info = { + .s_instance_id = volatiledir->d_sb->s_instance_id, + }; + + uuid_copy(&info.ovl_boot_id, &ovl_boot_id); + err = ovl_do_setxattr(ofs, volatiledir, OVL_XATTR_VOLATILE, &info, + sizeof(info)); + + if (err == -EOPNOTSUPP) + return 0; + + return err; +} + /* * Creates $workdir/work/incompat/volatile/dirty file if it is not already * present. */ static int ovl_create_volatile_dirty(struct ovl_fs *ofs) { + int err; unsigned int ctr; - struct dentry *d = dget(ofs->workbasedir); + struct dentry *volatiledir, *d = dget(ofs->workbasedir); static const char *const volatile_path[] = { - OVL_WORKDIR_NAME, "incompat", "volatile", "dirty" + OVL_WORKDIR_NAME, + OVL_INCOMPATDIR_NAME, + OVL_VOLATILEDIR_NAME, }; const char *const *name = volatile_path; - for (ctr = ARRAY_SIZE(volatile_path); ctr; ctr--, name++) { - d = ovl_lookup_or_create(d, *name, ctr > 1 ? S_IFDIR : S_IFREG); + /* Create the volatile subdirectory that we put the xattr on */ + for (ctr = 0; ctr < ARRAY_SIZE(volatile_path); ctr++, name++) { + d = ovl_lookup_or_create(d, *name, S_IFDIR); if (IS_ERR(d)) return PTR_ERR(d); } - dput(d); - return 0; + volatiledir = dget(d); + + /* Create the dirty file exists before we set the xattr */ + d = ovl_lookup_or_create(d, OVL_VOLATILE_DIRTY_NAME, S_IFREG); + if (!IS_ERR(d)) { + dput(d); + err = ovl_set_volatile_info(ofs, volatiledir); + } else { + err = PTR_ERR(d); + } + + dput(volatiledir); + return err; } static int ovl_make_workdir(struct super_block *sb, struct ovl_fs *ofs, @@ -2044,6 +2082,7 @@ static int __init ovl_init(void) { int err; + uuid_gen(&ovl_boot_id); ovl_inode_cachep = kmem_cache_create("ovl_inode", sizeof(struct ovl_inode), 0, (SLAB_RECLAIM_ACCOUNT| diff --git a/fs/overlayfs/util.c b/fs/overlayfs/util.c index 23f475627d07..87c9f5a063ed 100644 --- a/fs/overlayfs/util.c +++ b/fs/overlayfs/util.c @@ -580,6 +580,7 @@ bool ovl_check_dir_xattr(struct super_block *sb, struct dentry *dentry, #define OVL_XATTR_NLINK_POSTFIX "nlink" #define OVL_XATTR_UPPER_POSTFIX "upper" #define OVL_XATTR_METACOPY_POSTFIX "metacopy" +#define OVL_XATTR_VOLATILE_POSTFIX "volatile" #define OVL_XATTR_TAB_ENTRY(x) \ [x] = OVL_XATTR_PREFIX x ## _POSTFIX @@ -592,6 +593,7 @@ const char *ovl_xattr_table[] = { OVL_XATTR_TAB_ENTRY(OVL_XATTR_NLINK), OVL_XATTR_TAB_ENTRY(OVL_XATTR_UPPER), OVL_XATTR_TAB_ENTRY(OVL_XATTR_METACOPY), + OVL_XATTR_TAB_ENTRY(OVL_XATTR_VOLATILE), }; int ovl_check_setxattr(struct dentry *dentry, struct dentry *upperdentry, -- 2.25.1 ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [PATCH v2 3/4] overlay: Add the ability to remount volatile directories when safe 2020-11-27 9:20 ` [PATCH v2 3/4] overlay: Add the ability to remount volatile directories when safe Sargun Dhillon @ 2020-11-27 11:09 ` kernel test robot 2020-11-27 13:04 ` Amir Goldstein 2020-12-07 11:39 ` Dan Carpenter 2 siblings, 0 replies; 31+ messages in thread From: kernel test robot @ 2020-11-27 11:09 UTC (permalink / raw) To: Sargun Dhillon, linux-unionfs, miklos, Alexander Viro, Amir Goldstein Cc: kbuild-all, clang-built-linux, Sargun Dhillon, Giuseppe Scrivano, Vivek Goyal, Daniel J Walsh, linux-fsdevel, David Howells [-- Attachment #1: Type: text/plain, Size: 3273 bytes --] Hi Sargun, Thank you for the patch! Perhaps something to improve: [auto build test WARNING on be4df0cea08a8b59eb38d73de988b7ba8022df41] url: https://github.com/0day-ci/linux/commits/Sargun-Dhillon/Make-overlayfs-volatile-mounts-reusable/20201127-172416 base: be4df0cea08a8b59eb38d73de988b7ba8022df41 config: powerpc-randconfig-r034-20201127 (attached as .config) compiler: clang version 12.0.0 (https://github.com/llvm/llvm-project f095ac11a9550530a4a54298debb8b04b36422be) reproduce (this is a W=1 build): wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # install powerpc cross compiling tool for clang build # apt-get install binutils-powerpc-linux-gnu # https://github.com/0day-ci/linux/commit/a9871ab1de6c660e6a4c49fcffdb666851003b35 git remote add linux-review https://github.com/0day-ci/linux git fetch --no-tags linux-review Sargun-Dhillon/Make-overlayfs-volatile-mounts-reusable/20201127-172416 git checkout a9871ab1de6c660e6a4c49fcffdb666851003b35 # save the attached .config to linux build tree COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=powerpc If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot <lkp@intel.com> All warnings (new ones prefixed by >>): >> fs/overlayfs/readdir.c:1135:6: warning: variable 'd' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized] if (!strcmp(p->name, OVL_VOLATILEDIR_NAME)) { ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ fs/overlayfs/readdir.c:1149:7: note: uninitialized use occurs here dput(d); ^ fs/overlayfs/readdir.c:1135:2: note: remove the 'if' if its condition is always true if (!strcmp(p->name, OVL_VOLATILEDIR_NAME)) { ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ fs/overlayfs/readdir.c:1133:18: note: initialize the variable 'd' to silence this warning struct dentry *d; ^ = NULL 1 warning generated. vim +1135 fs/overlayfs/readdir.c 1122 1123 /* 1124 * ovl_check_incompat checks this specific incompat entry for incompatibility. 1125 * If it is found to be incompatible -EINVAL will be returned. 1126 * 1127 * If the directory should be preserved, then this function returns 1. 1128 */ 1129 static int ovl_check_incompat(struct ovl_fs *ofs, struct ovl_cache_entry *p, 1130 struct path *path) 1131 { 1132 int err = -EINVAL; 1133 struct dentry *d; 1134 > 1135 if (!strcmp(p->name, OVL_VOLATILEDIR_NAME)) { 1136 d = lookup_one_len(p->name, path->dentry, p->len); 1137 if (IS_ERR(d)) 1138 return PTR_ERR(d); 1139 1140 err = ovl_verify_volatile_info(ofs, d); 1141 dput(d); 1142 } 1143 1144 if (err == -EINVAL) 1145 pr_err("incompat feature '%s' cannot be mounted\n", p->name); 1146 else 1147 pr_debug("incompat '%s' handled: %d\n", p->name, err); 1148 1149 dput(d); 1150 return err; 1151 } 1152 --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org [-- Attachment #2: .config.gz --] [-- Type: application/gzip, Size: 24138 bytes --] ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v2 3/4] overlay: Add the ability to remount volatile directories when safe 2020-11-27 9:20 ` [PATCH v2 3/4] overlay: Add the ability to remount volatile directories when safe Sargun Dhillon 2020-11-27 11:09 ` kernel test robot @ 2020-11-27 13:04 ` Amir Goldstein 2020-12-07 11:39 ` Dan Carpenter 2 siblings, 0 replies; 31+ messages in thread From: Amir Goldstein @ 2020-11-27 13:04 UTC (permalink / raw) To: Sargun Dhillon Cc: overlayfs, Miklos Szeredi, Alexander Viro, Giuseppe Scrivano, Vivek Goyal, Daniel J Walsh, linux-fsdevel, David Howells On Fri, Nov 27, 2020 at 11:21 AM Sargun Dhillon <sargun@sargun.me> wrote: > > Overlayfs added the ability to setup mounts where all syncs could be > short-circuted in (2a99ddacee43: ovl: provide a mount option "volatile"). > > A user might want to remount this fs, but we do not let the user because > of the "incompat" detection feature. In the case of volatile, it is safe > to do something like[1]: > > $ sync -f /root/upperdir > $ rm -rf /root/workdir/incompat/volatile > > There are two ways to go about this. You can call sync on the underlying > filesystem, check the error code, and delete the dirty file if everything > is clean. If you're running lots of containers on the same filesystem, or > you want to avoid all unnecessary I/O, this may be suboptimal. > > Alternatively, you can blindly delete the dirty file, and "hope for the > best". > > This patch introduces transparent functionality to check if it is > (relatively) safe to reuse the upperdir. It ensures that the filesystem > hasn't been remounted, the system hasn't been rebooted, nor has the > overlayfs code changed. Since the structure is explicitly not meant to be > used between different versions of the code, its stability does not matter > so much. > > [1]: https://lore.kernel.org/linux-unionfs/CAOQ4uxhKr+j5jFyEC2gJX8E8M19mQ3CqdTYaPZOvDQ9c0qLEzw@mail.gmail.com/T/#m6abe713e4318202ad57f301bf28a414e1d824f9c > > Signed-off-by: Sargun Dhillon <sargun@sargun.me> > Cc: linux-fsdevel@vger.kernel.org > Cc: linux-unionfs@vger.kernel.org > Cc: Miklos Szeredi <miklos@szeredi.hu> > Cc: Amir Goldstein <amir73il@gmail.com> > Cc: Vivek Goyal <vgoyal@redhat.com> You may add: Reviewed-by: Amir Goldstein <amir73il@gmail.com> > --- > Documentation/filesystems/overlayfs.rst | 18 +++-- > fs/overlayfs/overlayfs.h | 37 +++++++++- > fs/overlayfs/readdir.c | 98 ++++++++++++++++++++++--- > fs/overlayfs/super.c | 73 +++++++++++++----- > fs/overlayfs/util.c | 2 + > 5 files changed, 190 insertions(+), 38 deletions(-) > > diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst > index c6e30c1bc2f2..b485fdb65b85 100644 > --- a/Documentation/filesystems/overlayfs.rst > +++ b/Documentation/filesystems/overlayfs.rst > @@ -579,13 +579,17 @@ example, upon detecting a fault, ext4 can be configured to panic. > The advantage of mounting with the "volatile" option is that all forms of > sync calls to the upper filesystem are omitted. > > -When overlay is mounted with "volatile" option, the directory > -"$workdir/work/incompat/volatile" is created. During next mount, overlay > -checks for this directory and refuses to mount if present. This is a strong > -indicator that user should throw away upper and work directories and create > -fresh one. In very limited cases where the user knows that the system has > -not crashed and contents of upperdir are intact, The "volatile" directory > -can be removed. > +When overlay is mounted with the "volatile" option, the directory > +"$workdir/work/incompat/volatile" is created. This acts as a indicator > +that the user should throw away upper and work directories and create fresh > +ones. In some cases, the overlayfs can detect if the upperdir can be > +reused safely in a subsequent volatile mounts, and mounting will proceed as > +normal. If the filesystem is unable to determine if this is safe (due to a > +reboot, upgraded kernel code, or loss of checkpoint, etc...), the user may > +bypass these safety checks and remove the "volatile" directory if they know > +the system did not encounter a fault and the contents of the upperdir are > +intact. Then, the user can remount the filesystem as normal. > + > > Testsuite > --------- > diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h > index f8880aa2ba0e..de694ee99d7c 100644 > --- a/fs/overlayfs/overlayfs.h > +++ b/fs/overlayfs/overlayfs.h > @@ -32,8 +32,13 @@ enum ovl_xattr { > OVL_XATTR_NLINK, > OVL_XATTR_UPPER, > OVL_XATTR_METACOPY, > + OVL_XATTR_VOLATILE, > }; > > +#define OVL_INCOMPATDIR_NAME "incompat" > +#define OVL_VOLATILEDIR_NAME "volatile" > +#define OVL_VOLATILE_DIRTY_NAME "dirty" > + > enum ovl_inode_flag { > /* Pure upper dir that may contain non pure upper entries */ > OVL_IMPURE, > @@ -57,6 +62,31 @@ enum { > OVL_XINO_ON, > }; > > +/* > + * This is copied into the volatile xattr, and the user does not interact with > + * it. There is no stability requirement, as a reboot explicitly invalidates > + * a volatile workdir. It is explicitly meant not to be a stable api. > + * > + * Although this structure isn't meant to be stable it is exposed to potentially > + * unprivileged users. We don't do any kind of cryptographic operations with > + * the structure, so it could be tampered with, or inspected. Don't put > + * kernel memory pointers in it, or anything else that could cause problems, > + * or information disclosure. > + */ > +struct ovl_volatile_info { > + /* > + * This uniquely identifies a boot, and is reset if overlayfs itself > + * is reloaded. Therefore we check our current / known boot_id > + * against this before looking at any other fields to validate: > + * 1. Is this datastructure laid out in the way we expect? (Overlayfs > + * module, reboot, etc...) > + * 2. Could something have changed (like the s_instance_id counter > + * resetting) > + */ > + uuid_t ovl_boot_id; /* Must stay first member */ > + u64 s_instance_id; > +} __packed; > + > /* > * The tuple (fh,uuid) is a universal unique identifier for a copy up origin, > * where: > @@ -422,8 +452,8 @@ void ovl_cleanup_whiteouts(struct dentry *upper, struct list_head *list); > void ovl_cache_free(struct list_head *list); > void ovl_dir_cache_free(struct inode *inode); > int ovl_check_d_type_supported(struct path *realpath); > -int ovl_workdir_cleanup(struct inode *dir, struct vfsmount *mnt, > - struct dentry *dentry, int level); > +int ovl_workdir_cleanup(struct ovl_fs *ofs, struct inode *dir, > + struct vfsmount *mnt, struct dentry *dentry, int level); > int ovl_indexdir_cleanup(struct ovl_fs *ofs); > > /* inode.c */ > @@ -520,3 +550,6 @@ int ovl_set_origin(struct dentry *dentry, struct dentry *lower, > > /* export.c */ > extern const struct export_operations ovl_export_operations; > + > +/* super.c */ > +extern uuid_t ovl_boot_id; > diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c > index 01620ebae1bd..7b66fbb20261 100644 > --- a/fs/overlayfs/readdir.c > +++ b/fs/overlayfs/readdir.c > @@ -1080,10 +1080,78 @@ int ovl_check_d_type_supported(struct path *realpath) > > return rdd.d_type_supported; > } > +static int ovl_verify_volatile_info(struct ovl_fs *ofs, > + struct dentry *volatiledir) > +{ > + int err; > + struct ovl_volatile_info info; > + > + if (!volatiledir->d_inode) > + return 0; > + > + if (!ofs->config.ovl_volatile) { > + pr_debug("Mount is not volatile; upperdir is marked volatile\n"); > + return -EINVAL; > + } > + > + err = ovl_do_getxattr(ofs, volatiledir, OVL_XATTR_VOLATILE, &info, > + sizeof(info)); > + if (err < 0) { > + pr_debug("Unable to read volatile xattr: %d\n", err); > + return -EINVAL; > + } > + > + if (err != sizeof(info)) { > + pr_debug("%s xattr on-disk size is %d expected to read %zd\n", > + ovl_xattr(ofs, OVL_XATTR_VOLATILE), err, sizeof(info)); > + return -EINVAL; > + } > + > + if (!uuid_equal(&ovl_boot_id, &info.ovl_boot_id)) { > + pr_debug("boot id has changed (reboot or module reloaded)\n"); > + return -EINVAL; > + } > + > + if (volatiledir->d_sb->s_instance_id != info.s_instance_id) { > + pr_debug("workdir has been unmounted and remounted\n"); > + return -EINVAL; > + } > + > + return 1; > +} > > -#define OVL_INCOMPATDIR_NAME "incompat" > +/* > + * ovl_check_incompat checks this specific incompat entry for incompatibility. > + * If it is found to be incompatible -EINVAL will be returned. > + * > + * If the directory should be preserved, then this function returns 1. > + */ > +static int ovl_check_incompat(struct ovl_fs *ofs, struct ovl_cache_entry *p, > + struct path *path) > +{ > + int err = -EINVAL; > + struct dentry *d; > + > + if (!strcmp(p->name, OVL_VOLATILEDIR_NAME)) { > + d = lookup_one_len(p->name, path->dentry, p->len); > + if (IS_ERR(d)) > + return PTR_ERR(d); > + > + err = ovl_verify_volatile_info(ofs, d); > + dput(d); > + } > + > + if (err == -EINVAL) > + pr_err("incompat feature '%s' cannot be mounted\n", p->name); > + else > + pr_debug("incompat '%s' handled: %d\n", p->name, err); > + > + dput(d); Letfover. Thanks, Amir. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v2 3/4] overlay: Add the ability to remount volatile directories when safe 2020-11-27 9:20 ` [PATCH v2 3/4] overlay: Add the ability to remount volatile directories when safe Sargun Dhillon 2020-11-27 11:09 ` kernel test robot 2020-11-27 13:04 ` Amir Goldstein @ 2020-12-07 11:39 ` Dan Carpenter 2 siblings, 0 replies; 31+ messages in thread From: Dan Carpenter @ 2020-12-07 11:39 UTC (permalink / raw) To: kbuild, Sargun Dhillon, linux-unionfs, miklos, Alexander Viro, Amir Goldstein Cc: lkp, kbuild-all, Sargun Dhillon, Giuseppe Scrivano, Vivek Goyal, Daniel J Walsh, linux-fsdevel, David Howells [-- Attachment #1: Type: text/plain, Size: 2555 bytes --] Hi Sargun, url: https://github.com/0day-ci/linux/commits/Sargun-Dhillon/Make-overlayfs-volatile-mounts-reusable/20201127-172416 base: be4df0cea08a8b59eb38d73de988b7ba8022df41 config: x86_64-randconfig-m001-20201127 (attached as .config) compiler: gcc-9 (Debian 9.3.0-15) 9.3.0 If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> smatch warnings: fs/overlayfs/readdir.c:1149 ovl_check_incompat() error: uninitialized symbol 'd'. vim +/d +1149 fs/overlayfs/readdir.c a9871ab1de6c660 Sargun Dhillon 2020-11-27 1129 static int ovl_check_incompat(struct ovl_fs *ofs, struct ovl_cache_entry *p, a9871ab1de6c660 Sargun Dhillon 2020-11-27 1130 struct path *path) a9871ab1de6c660 Sargun Dhillon 2020-11-27 1131 { a9871ab1de6c660 Sargun Dhillon 2020-11-27 1132 int err = -EINVAL; a9871ab1de6c660 Sargun Dhillon 2020-11-27 1133 struct dentry *d; a9871ab1de6c660 Sargun Dhillon 2020-11-27 1134 a9871ab1de6c660 Sargun Dhillon 2020-11-27 1135 if (!strcmp(p->name, OVL_VOLATILEDIR_NAME)) { a9871ab1de6c660 Sargun Dhillon 2020-11-27 1136 d = lookup_one_len(p->name, path->dentry, p->len); a9871ab1de6c660 Sargun Dhillon 2020-11-27 1137 if (IS_ERR(d)) a9871ab1de6c660 Sargun Dhillon 2020-11-27 1138 return PTR_ERR(d); a9871ab1de6c660 Sargun Dhillon 2020-11-27 1139 a9871ab1de6c660 Sargun Dhillon 2020-11-27 1140 err = ovl_verify_volatile_info(ofs, d); a9871ab1de6c660 Sargun Dhillon 2020-11-27 1141 dput(d); ^^^^^^^ dput() d here. a9871ab1de6c660 Sargun Dhillon 2020-11-27 1142 } a9871ab1de6c660 Sargun Dhillon 2020-11-27 1143 a9871ab1de6c660 Sargun Dhillon 2020-11-27 1144 if (err == -EINVAL) a9871ab1de6c660 Sargun Dhillon 2020-11-27 1145 pr_err("incompat feature '%s' cannot be mounted\n", p->name); a9871ab1de6c660 Sargun Dhillon 2020-11-27 1146 else a9871ab1de6c660 Sargun Dhillon 2020-11-27 1147 pr_debug("incompat '%s' handled: %d\n", p->name, err); a9871ab1de6c660 Sargun Dhillon 2020-11-27 1148 a9871ab1de6c660 Sargun Dhillon 2020-11-27 @1149 dput(d); ^^^^^^ "d" is uninitialized and this is probably a double free. a9871ab1de6c660 Sargun Dhillon 2020-11-27 1150 return err; a9871ab1de6c660 Sargun Dhillon 2020-11-27 1151 } --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org [-- Attachment #2: .config.gz --] [-- Type: application/gzip, Size: 32026 bytes --] ^ permalink raw reply [flat|nested] 31+ messages in thread
* [PATCH v2 4/4] overlay: Add rudimentary checking of writeback errseq on volatile remount 2020-11-27 9:20 [PATCH v2 0/4] Make overlayfs volatile mounts reusable Sargun Dhillon ` (2 preceding siblings ...) 2020-11-27 9:20 ` [PATCH v2 3/4] overlay: Add the ability to remount volatile directories when safe Sargun Dhillon @ 2020-11-27 9:20 ` Sargun Dhillon 2020-11-30 18:43 ` Vivek Goyal ` (2 more replies) 3 siblings, 3 replies; 31+ messages in thread From: Sargun Dhillon @ 2020-11-27 9:20 UTC (permalink / raw) To: linux-unionfs, miklos, Alexander Viro, Amir Goldstein Cc: Sargun Dhillon, Giuseppe Scrivano, Vivek Goyal, Daniel J Walsh, linux-fsdevel, David Howells Volatile remounts validate the following at the moment: * Has the module been reloaded / the system rebooted * Has the workdir been remounted This adds a new check for errors detected via the superblock's errseq_t. At mount time, the errseq_t is snapshotted to disk, and upon remount it's re-verified. This allows for kernel-level detection of errors without forcing userspace to perform a sync and allows for the hidden detection of writeback errors. Signed-off-by: Sargun Dhillon <sargun@sargun.me> Cc: linux-fsdevel@vger.kernel.org Cc: linux-unionfs@vger.kernel.org Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Amir Goldstein <amir73il@gmail.com> Cc: Vivek Goyal <vgoyal@redhat.com> --- fs/overlayfs/overlayfs.h | 1 + fs/overlayfs/readdir.c | 6 ++++++ fs/overlayfs/super.c | 1 + 3 files changed, 8 insertions(+) diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h index de694ee99d7c..e8a711953b64 100644 --- a/fs/overlayfs/overlayfs.h +++ b/fs/overlayfs/overlayfs.h @@ -85,6 +85,7 @@ struct ovl_volatile_info { */ uuid_t ovl_boot_id; /* Must stay first member */ u64 s_instance_id; + errseq_t errseq; /* Implemented as a u32 */ } __packed; /* diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c index 7b66fbb20261..5795b28bb4cf 100644 --- a/fs/overlayfs/readdir.c +++ b/fs/overlayfs/readdir.c @@ -1117,6 +1117,12 @@ static int ovl_verify_volatile_info(struct ovl_fs *ofs, return -EINVAL; } + err = errseq_check(&volatiledir->d_sb->s_wb_err, info.errseq); + if (err) { + pr_debug("Workdir filesystem reports errors: %d\n", err); + return -EINVAL; + } + return 1; } diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c index a8ee3ba4ebbd..2e473f8c75dd 100644 --- a/fs/overlayfs/super.c +++ b/fs/overlayfs/super.c @@ -1248,6 +1248,7 @@ static int ovl_set_volatile_info(struct ovl_fs *ofs, struct dentry *volatiledir) int err; struct ovl_volatile_info info = { .s_instance_id = volatiledir->d_sb->s_instance_id, + .errseq = errseq_sample(&volatiledir->d_sb->s_wb_err), }; uuid_copy(&info.ovl_boot_id, &ovl_boot_id); -- 2.25.1 ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [PATCH v2 4/4] overlay: Add rudimentary checking of writeback errseq on volatile remount 2020-11-27 9:20 ` [PATCH v2 4/4] overlay: Add rudimentary checking of writeback errseq on volatile remount Sargun Dhillon @ 2020-11-30 18:43 ` Vivek Goyal 2020-11-30 19:15 ` Vivek Goyal 2020-11-30 19:33 ` Vivek Goyal 2 siblings, 0 replies; 31+ messages in thread From: Vivek Goyal @ 2020-11-30 18:43 UTC (permalink / raw) To: Sargun Dhillon Cc: linux-unionfs, miklos, Alexander Viro, Amir Goldstein, Giuseppe Scrivano, Daniel J Walsh, linux-fsdevel, David Howells On Fri, Nov 27, 2020 at 01:20:58AM -0800, Sargun Dhillon wrote: > Volatile remounts validate the following at the moment: > * Has the module been reloaded / the system rebooted > * Has the workdir been remounted > > This adds a new check for errors detected via the superblock's > errseq_t. At mount time, the errseq_t is snapshotted to disk, > and upon remount it's re-verified. This allows for kernel-level > detection of errors without forcing userspace to perform a > sync and allows for the hidden detection of writeback errors. > > Signed-off-by: Sargun Dhillon <sargun@sargun.me> > Cc: linux-fsdevel@vger.kernel.org > Cc: linux-unionfs@vger.kernel.org > Cc: Miklos Szeredi <miklos@szeredi.hu> > Cc: Amir Goldstein <amir73il@gmail.com> > Cc: Vivek Goyal <vgoyal@redhat.com> > --- > fs/overlayfs/overlayfs.h | 1 + > fs/overlayfs/readdir.c | 6 ++++++ > fs/overlayfs/super.c | 1 + > 3 files changed, 8 insertions(+) > > diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h > index de694ee99d7c..e8a711953b64 100644 > --- a/fs/overlayfs/overlayfs.h > +++ b/fs/overlayfs/overlayfs.h > @@ -85,6 +85,7 @@ struct ovl_volatile_info { > */ > uuid_t ovl_boot_id; /* Must stay first member */ > u64 s_instance_id; > + errseq_t errseq; /* Implemented as a u32 */ > } __packed; > > /* > diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c > index 7b66fbb20261..5795b28bb4cf 100644 > --- a/fs/overlayfs/readdir.c > +++ b/fs/overlayfs/readdir.c > @@ -1117,6 +1117,12 @@ static int ovl_verify_volatile_info(struct ovl_fs *ofs, > return -EINVAL; > } > > + err = errseq_check(&volatiledir->d_sb->s_wb_err, info.errseq); > + if (err) { > + pr_debug("Workdir filesystem reports errors: %d\n", err); It probably will make sense to make it pr_err() instead? Will be good to know if kernel logic detected errors. Thanks Vivek ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v2 4/4] overlay: Add rudimentary checking of writeback errseq on volatile remount 2020-11-27 9:20 ` [PATCH v2 4/4] overlay: Add rudimentary checking of writeback errseq on volatile remount Sargun Dhillon 2020-11-30 18:43 ` Vivek Goyal @ 2020-11-30 19:15 ` Vivek Goyal 2020-12-05 9:13 ` Amir Goldstein 2020-11-30 19:33 ` Vivek Goyal 2 siblings, 1 reply; 31+ messages in thread From: Vivek Goyal @ 2020-11-30 19:15 UTC (permalink / raw) To: Sargun Dhillon Cc: linux-unionfs, miklos, Alexander Viro, Amir Goldstein, Giuseppe Scrivano, Daniel J Walsh, linux-fsdevel, David Howells On Fri, Nov 27, 2020 at 01:20:58AM -0800, Sargun Dhillon wrote: > Volatile remounts validate the following at the moment: > * Has the module been reloaded / the system rebooted > * Has the workdir been remounted > > This adds a new check for errors detected via the superblock's > errseq_t. At mount time, the errseq_t is snapshotted to disk, > and upon remount it's re-verified. This allows for kernel-level > detection of errors without forcing userspace to perform a > sync and allows for the hidden detection of writeback errors. > > Signed-off-by: Sargun Dhillon <sargun@sargun.me> > Cc: linux-fsdevel@vger.kernel.org > Cc: linux-unionfs@vger.kernel.org > Cc: Miklos Szeredi <miklos@szeredi.hu> > Cc: Amir Goldstein <amir73il@gmail.com> > Cc: Vivek Goyal <vgoyal@redhat.com> > --- > fs/overlayfs/overlayfs.h | 1 + > fs/overlayfs/readdir.c | 6 ++++++ > fs/overlayfs/super.c | 1 + > 3 files changed, 8 insertions(+) > > diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h > index de694ee99d7c..e8a711953b64 100644 > --- a/fs/overlayfs/overlayfs.h > +++ b/fs/overlayfs/overlayfs.h > @@ -85,6 +85,7 @@ struct ovl_volatile_info { > */ > uuid_t ovl_boot_id; /* Must stay first member */ > u64 s_instance_id; > + errseq_t errseq; /* Implemented as a u32 */ > } __packed; > > /* > diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c > index 7b66fbb20261..5795b28bb4cf 100644 > --- a/fs/overlayfs/readdir.c > +++ b/fs/overlayfs/readdir.c > @@ -1117,6 +1117,12 @@ static int ovl_verify_volatile_info(struct ovl_fs *ofs, > return -EINVAL; > } > > + err = errseq_check(&volatiledir->d_sb->s_wb_err, info.errseq); > + if (err) { > + pr_debug("Workdir filesystem reports errors: %d\n", err); > + return -EINVAL; > + } > + > return 1; > } > > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c > index a8ee3ba4ebbd..2e473f8c75dd 100644 > --- a/fs/overlayfs/super.c > +++ b/fs/overlayfs/super.c > @@ -1248,6 +1248,7 @@ static int ovl_set_volatile_info(struct ovl_fs *ofs, struct dentry *volatiledir) > int err; > struct ovl_volatile_info info = { > .s_instance_id = volatiledir->d_sb->s_instance_id, > + .errseq = errseq_sample(&volatiledir->d_sb->s_wb_err), errse_sample() seems to return 0 if nobody has seen the error yet. That means on remount we will fail. It is a false failure from our perspective and we are not interested in knowing if somebody else has seen the failure or not. Maybe we need a flag in errseq_sample() to get us current value irrespective of the fact whether anybody has seen the error or not? If we end up making this change, then we probably will have to somehow mask ERRSEQ_SEEN bit in errseq_check() comparison. Because if we sampled ->s_wb_err when nobody saw it and later by the remount time say ERRSEQ_SEEN is set, we don't want remount to fail. Thanks Vivek > }; > > uuid_copy(&info.ovl_boot_id, &ovl_boot_id); > -- > 2.25.1 > ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v2 4/4] overlay: Add rudimentary checking of writeback errseq on volatile remount 2020-11-30 19:15 ` Vivek Goyal @ 2020-12-05 9:13 ` Amir Goldstein 2020-12-05 13:51 ` Jeff Layton 0 siblings, 1 reply; 31+ messages in thread From: Amir Goldstein @ 2020-12-05 9:13 UTC (permalink / raw) To: Vivek Goyal, Sargun Dhillon Cc: overlayfs, Miklos Szeredi, Alexander Viro, Giuseppe Scrivano, Daniel J Walsh, linux-fsdevel, David Howells, Jeff Layton On Mon, Nov 30, 2020 at 9:15 PM Vivek Goyal <vgoyal@redhat.com> wrote: > > On Fri, Nov 27, 2020 at 01:20:58AM -0800, Sargun Dhillon wrote: > > Volatile remounts validate the following at the moment: > > * Has the module been reloaded / the system rebooted > > * Has the workdir been remounted > > > > This adds a new check for errors detected via the superblock's > > errseq_t. At mount time, the errseq_t is snapshotted to disk, > > and upon remount it's re-verified. This allows for kernel-level > > detection of errors without forcing userspace to perform a > > sync and allows for the hidden detection of writeback errors. > > > > Signed-off-by: Sargun Dhillon <sargun@sargun.me> > > Cc: linux-fsdevel@vger.kernel.org > > Cc: linux-unionfs@vger.kernel.org > > Cc: Miklos Szeredi <miklos@szeredi.hu> > > Cc: Amir Goldstein <amir73il@gmail.com> > > Cc: Vivek Goyal <vgoyal@redhat.com> > > --- > > fs/overlayfs/overlayfs.h | 1 + > > fs/overlayfs/readdir.c | 6 ++++++ > > fs/overlayfs/super.c | 1 + > > 3 files changed, 8 insertions(+) > > > > diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h > > index de694ee99d7c..e8a711953b64 100644 > > --- a/fs/overlayfs/overlayfs.h > > +++ b/fs/overlayfs/overlayfs.h > > @@ -85,6 +85,7 @@ struct ovl_volatile_info { > > */ > > uuid_t ovl_boot_id; /* Must stay first member */ > > u64 s_instance_id; > > + errseq_t errseq; /* Implemented as a u32 */ > > } __packed; > > > > /* > > diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c > > index 7b66fbb20261..5795b28bb4cf 100644 > > --- a/fs/overlayfs/readdir.c > > +++ b/fs/overlayfs/readdir.c > > @@ -1117,6 +1117,12 @@ static int ovl_verify_volatile_info(struct ovl_fs *ofs, > > return -EINVAL; > > } > > > > + err = errseq_check(&volatiledir->d_sb->s_wb_err, info.errseq); > > + if (err) { > > + pr_debug("Workdir filesystem reports errors: %d\n", err); > > + return -EINVAL; > > + } > > + > > return 1; > > } > > > > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c > > index a8ee3ba4ebbd..2e473f8c75dd 100644 > > --- a/fs/overlayfs/super.c > > +++ b/fs/overlayfs/super.c > > @@ -1248,6 +1248,7 @@ static int ovl_set_volatile_info(struct ovl_fs *ofs, struct dentry *volatiledir) > > int err; > > struct ovl_volatile_info info = { > > .s_instance_id = volatiledir->d_sb->s_instance_id, > > + .errseq = errseq_sample(&volatiledir->d_sb->s_wb_err), > > errse_sample() seems to return 0 if nobody has seen the error yet. That > means on remount we will fail. It is a false failure from our perspective > and we are not interested in knowing if somebody else has seen the > failure or not. > > Maybe we need a flag in errseq_sample() to get us current value > irrespective of the fact whether anybody has seen the error or not? > > If we end up making this change, then we probably will have to somehow > mask ERRSEQ_SEEN bit in errseq_check() comparison. Because if we > sampled ->s_wb_err when nobody saw it and later by the remount time > say ERRSEQ_SEEN is set, we don't want remount to fail. > Hopping back to this review, looks like for volatile mount we need something like (in this order): 1. check if re-use and get sampled errseq from volatiledir xattr 2. otherwise errseq_sample() upper_sb and store in volatiledir xattr 3. errseq_check() since stored or sampled errseq (0 for fresh mount with unseen error) 4. fail volatile mount if errseq_check() failed 5. errseq_check() since stored errseq on fsync()/syncfs() For fresh volatile mount, syncfs can fix the temporary mount error. For re-used volatile mount, the mount error is permanent. Did I miss anything? Is the mount safe for both seen and unseen error cases? no error case? Are we safe if a syncfs on upper_sb sneaks in between 2 and 3? Thanks, Amir. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v2 4/4] overlay: Add rudimentary checking of writeback errseq on volatile remount 2020-12-05 9:13 ` Amir Goldstein @ 2020-12-05 13:51 ` Jeff Layton 2020-12-05 14:51 ` Amir Goldstein 0 siblings, 1 reply; 31+ messages in thread From: Jeff Layton @ 2020-12-05 13:51 UTC (permalink / raw) To: Amir Goldstein, Vivek Goyal, Sargun Dhillon Cc: overlayfs, Miklos Szeredi, Alexander Viro, Giuseppe Scrivano, Daniel J Walsh, linux-fsdevel, David Howells On Sat, 2020-12-05 at 11:13 +0200, Amir Goldstein wrote: > On Mon, Nov 30, 2020 at 9:15 PM Vivek Goyal <vgoyal@redhat.com> wrote: > > > > On Fri, Nov 27, 2020 at 01:20:58AM -0800, Sargun Dhillon wrote: > > > Volatile remounts validate the following at the moment: > > > * Has the module been reloaded / the system rebooted > > > * Has the workdir been remounted > > > > > > This adds a new check for errors detected via the superblock's > > > errseq_t. At mount time, the errseq_t is snapshotted to disk, > > > and upon remount it's re-verified. This allows for kernel-level > > > detection of errors without forcing userspace to perform a > > > sync and allows for the hidden detection of writeback errors. > > > > > > Signed-off-by: Sargun Dhillon <sargun@sargun.me> > > > Cc: linux-fsdevel@vger.kernel.org > > > Cc: linux-unionfs@vger.kernel.org > > > Cc: Miklos Szeredi <miklos@szeredi.hu> > > > Cc: Amir Goldstein <amir73il@gmail.com> > > > Cc: Vivek Goyal <vgoyal@redhat.com> > > > --- > > > fs/overlayfs/overlayfs.h | 1 + > > > fs/overlayfs/readdir.c | 6 ++++++ > > > fs/overlayfs/super.c | 1 + > > > 3 files changed, 8 insertions(+) > > > > > > diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h > > > index de694ee99d7c..e8a711953b64 100644 > > > --- a/fs/overlayfs/overlayfs.h > > > +++ b/fs/overlayfs/overlayfs.h > > > @@ -85,6 +85,7 @@ struct ovl_volatile_info { > > > */ > > > uuid_t ovl_boot_id; /* Must stay first member */ > > > u64 s_instance_id; > > > + errseq_t errseq; /* Implemented as a u32 */ > > > } __packed; > > > > > > /* > > > diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c > > > index 7b66fbb20261..5795b28bb4cf 100644 > > > --- a/fs/overlayfs/readdir.c > > > +++ b/fs/overlayfs/readdir.c > > > @@ -1117,6 +1117,12 @@ static int ovl_verify_volatile_info(struct ovl_fs *ofs, > > > return -EINVAL; > > > } > > > > > > + err = errseq_check(&volatiledir->d_sb->s_wb_err, info.errseq); > > > + if (err) { > > > + pr_debug("Workdir filesystem reports errors: %d\n", err); > > > + return -EINVAL; > > > + } > > > + > > > return 1; > > > } > > > > > > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c > > > index a8ee3ba4ebbd..2e473f8c75dd 100644 > > > --- a/fs/overlayfs/super.c > > > +++ b/fs/overlayfs/super.c > > > @@ -1248,6 +1248,7 @@ static int ovl_set_volatile_info(struct ovl_fs *ofs, struct dentry *volatiledir) > > > int err; > > > struct ovl_volatile_info info = { > > > .s_instance_id = volatiledir->d_sb->s_instance_id, > > > + .errseq = errseq_sample(&volatiledir->d_sb->s_wb_err), > > > > errse_sample() seems to return 0 if nobody has seen the error yet. That > > means on remount we will fail. It is a false failure from our perspective > > and we are not interested in knowing if somebody else has seen the > > failure or not. > > > > Maybe we need a flag in errseq_sample() to get us current value > > irrespective of the fact whether anybody has seen the error or not? > > > > If we end up making this change, then we probably will have to somehow > > mask ERRSEQ_SEEN bit in errseq_check() comparison. Because if we > > sampled ->s_wb_err when nobody saw it and later by the remount time > > say ERRSEQ_SEEN is set, we don't want remount to fail. > > > > Hopping back to this review, looks like for volatile mount we need > something like (in this order): > 1. check if re-use and get sampled errseq from volatiledir xattr > 2. otherwise errseq_sample() upper_sb and store in volatiledir xattr I'm not sure I follow. Why does this need to go into an xattr? errseq_t is never persisted on stable storage. It's an entirely in-memory thing. > 3. errseq_check() since stored or sampled errseq (0 for fresh mount > with unseen error) > 4. fail volatile mount if errseq_check() failed > 5. errseq_check() since stored errseq on fsync()/syncfs() > I think this is simpler than that. You just need a new errseq_t helper that only conditionally samples if the thing is 0 or if the error has already been seen. Something like this (hopefully with a better name): bool errseq_sample_no_unseen(errseq_t *eseq, errseq_t *sample) { errseq_t old = READ_ONCE(*eseq); if (old && !(old & ERRSEQ_SEEN)) return false; *sample = old; return true; } If that returns false, fail the mount. If it's true, then save off the sample and proceed. > For fresh volatile mount, syncfs can fix the temporary mount error. > For re-used volatile mount, the mount error is permanent. > > Did I miss anything? > Is the mount safe for both seen and unseen error cases? no error case? > Are we safe if a syncfs on upper_sb sneaks in between 2 and 3? > > Thanks, > Amir. > -- Jeff Layton <jlayton@redhat.com> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v2 4/4] overlay: Add rudimentary checking of writeback errseq on volatile remount 2020-12-05 13:51 ` Jeff Layton @ 2020-12-05 14:51 ` Amir Goldstein 0 siblings, 0 replies; 31+ messages in thread From: Amir Goldstein @ 2020-12-05 14:51 UTC (permalink / raw) To: Jeff Layton Cc: Vivek Goyal, Sargun Dhillon, overlayfs, Miklos Szeredi, Alexander Viro, Giuseppe Scrivano, Daniel J Walsh, linux-fsdevel, David Howells On Sat, Dec 5, 2020 at 3:51 PM Jeff Layton <jlayton@redhat.com> wrote: > > On Sat, 2020-12-05 at 11:13 +0200, Amir Goldstein wrote: > > On Mon, Nov 30, 2020 at 9:15 PM Vivek Goyal <vgoyal@redhat.com> wrote: > > > > > > On Fri, Nov 27, 2020 at 01:20:58AM -0800, Sargun Dhillon wrote: > > > > Volatile remounts validate the following at the moment: > > > > * Has the module been reloaded / the system rebooted > > > > * Has the workdir been remounted > > > > > > > > This adds a new check for errors detected via the superblock's > > > > errseq_t. At mount time, the errseq_t is snapshotted to disk, > > > > and upon remount it's re-verified. This allows for kernel-level > > > > detection of errors without forcing userspace to perform a > > > > sync and allows for the hidden detection of writeback errors. > > > > > > > > Signed-off-by: Sargun Dhillon <sargun@sargun.me> > > > > Cc: linux-fsdevel@vger.kernel.org > > > > Cc: linux-unionfs@vger.kernel.org > > > > Cc: Miklos Szeredi <miklos@szeredi.hu> > > > > Cc: Amir Goldstein <amir73il@gmail.com> > > > > Cc: Vivek Goyal <vgoyal@redhat.com> > > > > --- > > > > fs/overlayfs/overlayfs.h | 1 + > > > > fs/overlayfs/readdir.c | 6 ++++++ > > > > fs/overlayfs/super.c | 1 + > > > > 3 files changed, 8 insertions(+) > > > > > > > > diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h > > > > index de694ee99d7c..e8a711953b64 100644 > > > > --- a/fs/overlayfs/overlayfs.h > > > > +++ b/fs/overlayfs/overlayfs.h > > > > @@ -85,6 +85,7 @@ struct ovl_volatile_info { > > > > */ > > > > uuid_t ovl_boot_id; /* Must stay first member */ > > > > u64 s_instance_id; > > > > + errseq_t errseq; /* Implemented as a u32 */ > > > > } __packed; > > > > > > > > /* > > > > diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c > > > > index 7b66fbb20261..5795b28bb4cf 100644 > > > > --- a/fs/overlayfs/readdir.c > > > > +++ b/fs/overlayfs/readdir.c > > > > @@ -1117,6 +1117,12 @@ static int ovl_verify_volatile_info(struct ovl_fs *ofs, > > > > return -EINVAL; > > > > } > > > > > > > > + err = errseq_check(&volatiledir->d_sb->s_wb_err, info.errseq); > > > > + if (err) { > > > > + pr_debug("Workdir filesystem reports errors: %d\n", err); > > > > + return -EINVAL; > > > > + } > > > > + > > > > return 1; > > > > } > > > > > > > > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c > > > > index a8ee3ba4ebbd..2e473f8c75dd 100644 > > > > --- a/fs/overlayfs/super.c > > > > +++ b/fs/overlayfs/super.c > > > > @@ -1248,6 +1248,7 @@ static int ovl_set_volatile_info(struct ovl_fs *ofs, struct dentry *volatiledir) > > > > int err; > > > > struct ovl_volatile_info info = { > > > > .s_instance_id = volatiledir->d_sb->s_instance_id, > > > > + .errseq = errseq_sample(&volatiledir->d_sb->s_wb_err), > > > > > > errse_sample() seems to return 0 if nobody has seen the error yet. That > > > means on remount we will fail. It is a false failure from our perspective > > > and we are not interested in knowing if somebody else has seen the > > > failure or not. > > > > > > Maybe we need a flag in errseq_sample() to get us current value > > > irrespective of the fact whether anybody has seen the error or not? > > > > > > If we end up making this change, then we probably will have to somehow > > > mask ERRSEQ_SEEN bit in errseq_check() comparison. Because if we > > > sampled ->s_wb_err when nobody saw it and later by the remount time > > > say ERRSEQ_SEEN is set, we don't want remount to fail. > > > > > > > Hopping back to this review, looks like for volatile mount we need > > something like (in this order): > > 1. check if re-use and get sampled errseq from volatiledir xattr > > 2. otherwise errseq_sample() upper_sb and store in volatiledir xattr > > I'm not sure I follow. Why does this need to go into an xattr? > > errseq_t is never persisted on stable storage. It's an entirely > in-memory thing. > We know, but that was the purpose of this patch series [1]. Reusing volatile overlay layers is not allowed in v5.9. Sargun is trying to allow that by verifying that since the first volatile mount there was: * no reboot * no overlay module reload * no underlying fs re-mount * [and with this patch] no writeback error on upper fs [1] https://lore.kernel.org/linux-unionfs/20201127092058.15117-1-sargun@sargun.me/T/#mb2f1c770a47898d8781e62a46fcc7526535e5dde > > > 3. errseq_check() since stored or sampled errseq (0 for fresh mount > > with unseen error) > > 4. fail volatile mount if errseq_check() failed > > 5. errseq_check() since stored errseq on fsync()/syncfs() > > > > I think this is simpler than that. You just need a new errseq_t helper > that only conditionally samples if the thing is 0 or if the error has > already been seen. Something like this (hopefully with a better name): > > bool errseq_sample_no_unseen(errseq_t *eseq, errseq_t *sample) > { > errseq_t old = READ_ONCE(*eseq); > > if (old && !(old & ERRSEQ_SEEN)) > return false; > *sample = old; > return true; > } > > If that returns false, fail the mount. If it's true, then save off the > sample and proceed. > Yes, but that wasn't the purpose of the original patch set, it was just something else that needed fixing that we found during review of this patch for which Sargun posted a separate patch. Thank, Amir. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v2 4/4] overlay: Add rudimentary checking of writeback errseq on volatile remount 2020-11-27 9:20 ` [PATCH v2 4/4] overlay: Add rudimentary checking of writeback errseq on volatile remount Sargun Dhillon 2020-11-30 18:43 ` Vivek Goyal 2020-11-30 19:15 ` Vivek Goyal @ 2020-11-30 19:33 ` Vivek Goyal 2020-12-01 11:56 ` Sargun Dhillon 2 siblings, 1 reply; 31+ messages in thread From: Vivek Goyal @ 2020-11-30 19:33 UTC (permalink / raw) To: Sargun Dhillon Cc: linux-unionfs, miklos, Alexander Viro, Amir Goldstein, Giuseppe Scrivano, Daniel J Walsh, linux-fsdevel, David Howells On Fri, Nov 27, 2020 at 01:20:58AM -0800, Sargun Dhillon wrote: > Volatile remounts validate the following at the moment: > * Has the module been reloaded / the system rebooted > * Has the workdir been remounted > > This adds a new check for errors detected via the superblock's > errseq_t. At mount time, the errseq_t is snapshotted to disk, > and upon remount it's re-verified. This allows for kernel-level > detection of errors without forcing userspace to perform a > sync and allows for the hidden detection of writeback errors. > > Signed-off-by: Sargun Dhillon <sargun@sargun.me> > Cc: linux-fsdevel@vger.kernel.org > Cc: linux-unionfs@vger.kernel.org > Cc: Miklos Szeredi <miklos@szeredi.hu> > Cc: Amir Goldstein <amir73il@gmail.com> > Cc: Vivek Goyal <vgoyal@redhat.com> > --- > fs/overlayfs/overlayfs.h | 1 + > fs/overlayfs/readdir.c | 6 ++++++ > fs/overlayfs/super.c | 1 + > 3 files changed, 8 insertions(+) > > diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h > index de694ee99d7c..e8a711953b64 100644 > --- a/fs/overlayfs/overlayfs.h > +++ b/fs/overlayfs/overlayfs.h > @@ -85,6 +85,7 @@ struct ovl_volatile_info { > */ > uuid_t ovl_boot_id; /* Must stay first member */ > u64 s_instance_id; > + errseq_t errseq; /* Implemented as a u32 */ > } __packed; > > /* > diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c > index 7b66fbb20261..5795b28bb4cf 100644 > --- a/fs/overlayfs/readdir.c > +++ b/fs/overlayfs/readdir.c > @@ -1117,6 +1117,12 @@ static int ovl_verify_volatile_info(struct ovl_fs *ofs, > return -EINVAL; > } > > + err = errseq_check(&volatiledir->d_sb->s_wb_err, info.errseq); Might be a stupid question. Will ask anyway. But what protects against wrapping of counter. IOW, Say we stored info.errseq value as A. It is possible that bunch of errors occurred and at remount time ->s_wb_err is back to A and we pass the check. (Despite the fact lots of errors have occurred since we sampled). Thanks Vivek ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v2 4/4] overlay: Add rudimentary checking of writeback errseq on volatile remount 2020-11-30 19:33 ` Vivek Goyal @ 2020-12-01 11:56 ` Sargun Dhillon 2020-12-01 12:45 ` Jeff Layton 0 siblings, 1 reply; 31+ messages in thread From: Sargun Dhillon @ 2020-12-01 11:56 UTC (permalink / raw) To: Vivek Goyal Cc: linux-unionfs, miklos, Alexander Viro, Amir Goldstein, Giuseppe Scrivano, Daniel J Walsh, linux-fsdevel, David Howells, Jeff Layton On Mon, Nov 30, 2020 at 02:33:42PM -0500, Vivek Goyal wrote: > On Fri, Nov 27, 2020 at 01:20:58AM -0800, Sargun Dhillon wrote: > > Volatile remounts validate the following at the moment: > > * Has the module been reloaded / the system rebooted > > * Has the workdir been remounted > > > > This adds a new check for errors detected via the superblock's > > errseq_t. At mount time, the errseq_t is snapshotted to disk, > > and upon remount it's re-verified. This allows for kernel-level > > detection of errors without forcing userspace to perform a > > sync and allows for the hidden detection of writeback errors. > > > > Signed-off-by: Sargun Dhillon <sargun@sargun.me> > > Cc: linux-fsdevel@vger.kernel.org > > Cc: linux-unionfs@vger.kernel.org > > Cc: Miklos Szeredi <miklos@szeredi.hu> > > Cc: Amir Goldstein <amir73il@gmail.com> > > Cc: Vivek Goyal <vgoyal@redhat.com> > > --- > > fs/overlayfs/overlayfs.h | 1 + > > fs/overlayfs/readdir.c | 6 ++++++ > > fs/overlayfs/super.c | 1 + > > 3 files changed, 8 insertions(+) > > > > diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h > > index de694ee99d7c..e8a711953b64 100644 > > --- a/fs/overlayfs/overlayfs.h > > +++ b/fs/overlayfs/overlayfs.h > > @@ -85,6 +85,7 @@ struct ovl_volatile_info { > > */ > > uuid_t ovl_boot_id; /* Must stay first member */ > > u64 s_instance_id; > > + errseq_t errseq; /* Implemented as a u32 */ > > } __packed; > > > > /* > > diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c > > index 7b66fbb20261..5795b28bb4cf 100644 > > --- a/fs/overlayfs/readdir.c > > +++ b/fs/overlayfs/readdir.c > > @@ -1117,6 +1117,12 @@ static int ovl_verify_volatile_info(struct ovl_fs *ofs, > > return -EINVAL; > > } > > > > + err = errseq_check(&volatiledir->d_sb->s_wb_err, info.errseq); > > Might be a stupid question. Will ask anyway. > > But what protects against wrapping of counter. IOW, Say we stored info.errseq > value as A. It is possible that bunch of errors occurred and at remount > time ->s_wb_err is back to A and we pass the check. (Despite the fact lots > of errors have occurred since we sampled). > > Thanks > Vivek > +Jeff Layton <jlayton@redhat.com> Nothing. The current errseq API works like this today where if you have 2^20 (1048576) errors, and syncfs (or other calls that mark the errseq as seen), and the error that occured 1048575 times ago was the same error as you just last had, and the error on the upperdir has already been marked as seen, the error will be swallowed up silently. This exists throughout all of VFS. I think we're potentially making this more likely by checkpointing to disk. The one aspect which is a little different about the usecase in the patch is that it relies on this mechanism to determine if an error has occured after the entire FS was constructed, so it's somewhat more consequential than the current issue in VFS which will just bubble up errors in a few files. On my system syncfs takes about 2 milliseconds, so you have a chance to experience this every ~30 minutes if the syscalls align in the right way. If we expanded the errseq_t to u64, we would potentially get a collision every 4503599627370496 calls, or assuming the 2 millisecond invariant holds, every 285 years. Now, we probably don't want to make errseq_t into a u64 because of performance reasons (not all systems have native u64 cmpxchg), and the extra memory it'd take up. If we really want to avoid this case, I can think of one "simple" solution, which is something like laying out errseq_t as something like a errseq_t_src that's 64-bits, and all readers just look at the lower 32-bits. The longer errseq_t would exist on super_blocks, but files would still get the shorter one. To potentially avoid the performance penalty of atomic longs, we could also do something like this: typedef struct { atomic_t overflow; u32 errseq; } errseq_t_big; And in errseq_set, do: /* Wraps */ if (new < old) atomic_inc(&eseq->overflow); *shrug* I don't think that the above scenario is likely though. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v2 4/4] overlay: Add rudimentary checking of writeback errseq on volatile remount 2020-12-01 11:56 ` Sargun Dhillon @ 2020-12-01 12:45 ` Jeff Layton 0 siblings, 0 replies; 31+ messages in thread From: Jeff Layton @ 2020-12-01 12:45 UTC (permalink / raw) To: Sargun Dhillon, Vivek Goyal Cc: linux-unionfs, miklos, Alexander Viro, Amir Goldstein, Giuseppe Scrivano, Daniel J Walsh, linux-fsdevel, David Howells On Tue, 2020-12-01 at 11:56 +0000, Sargun Dhillon wrote: > On Mon, Nov 30, 2020 at 02:33:42PM -0500, Vivek Goyal wrote: > > On Fri, Nov 27, 2020 at 01:20:58AM -0800, Sargun Dhillon wrote: > > > Volatile remounts validate the following at the moment: > > > * Has the module been reloaded / the system rebooted > > > * Has the workdir been remounted > > > > > > This adds a new check for errors detected via the superblock's > > > errseq_t. At mount time, the errseq_t is snapshotted to disk, > > > and upon remount it's re-verified. This allows for kernel-level > > > detection of errors without forcing userspace to perform a > > > sync and allows for the hidden detection of writeback errors. > > > > > > Signed-off-by: Sargun Dhillon <sargun@sargun.me> > > > Cc: linux-fsdevel@vger.kernel.org > > > Cc: linux-unionfs@vger.kernel.org > > > Cc: Miklos Szeredi <miklos@szeredi.hu> > > > Cc: Amir Goldstein <amir73il@gmail.com> > > > Cc: Vivek Goyal <vgoyal@redhat.com> > > > --- > > > fs/overlayfs/overlayfs.h | 1 + > > > fs/overlayfs/readdir.c | 6 ++++++ > > > fs/overlayfs/super.c | 1 + > > > 3 files changed, 8 insertions(+) > > > > > > diff --git a/fs/overlayfs/overlayfs.h b/fs/overlayfs/overlayfs.h > > > index de694ee99d7c..e8a711953b64 100644 > > > --- a/fs/overlayfs/overlayfs.h > > > +++ b/fs/overlayfs/overlayfs.h > > > @@ -85,6 +85,7 @@ struct ovl_volatile_info { > > > */ > > > uuid_t ovl_boot_id; /* Must stay first member */ > > > u64 s_instance_id; > > > + errseq_t errseq; /* Implemented as a u32 */ > > > } __packed; > > > > > > > > > > > > > > > /* > > > diff --git a/fs/overlayfs/readdir.c b/fs/overlayfs/readdir.c > > > index 7b66fbb20261..5795b28bb4cf 100644 > > > --- a/fs/overlayfs/readdir.c > > > +++ b/fs/overlayfs/readdir.c > > > @@ -1117,6 +1117,12 @@ static int ovl_verify_volatile_info(struct ovl_fs *ofs, > > > return -EINVAL; > > > } > > > > > > > > > > > > > > > + err = errseq_check(&volatiledir->d_sb->s_wb_err, info.errseq); > > > > Might be a stupid question. Will ask anyway. > > > > But what protects against wrapping of counter. IOW, Say we stored info.errseq > > value as A. It is possible that bunch of errors occurred and at remount > > time ->s_wb_err is back to A and we pass the check. (Despite the fact lots > > of errors have occurred since we sampled). > > > > Thanks > > Vivek > > > > +Jeff Layton <jlayton@redhat.com> > > Nothing. The current errseq API works like this today where if you have 2^20 > (1048576) errors, and syncfs (or other calls that mark the errseq as seen), and > the error that occured 1048575 times ago was the same error as you just last > had, and the error on the upperdir has already been marked as seen, the error > will be swallowed up silently. > This exists throughout all of VFS. I think we're potentially making this more > likely by checkpointing to disk. The one aspect which is a little different about > the usecase in the patch is that it relies on this mechanism to determine if > an error has occured after the entire FS was constructed, so it's somewhat > more consequential than the current issue in VFS which will just bubble up > errors in a few files. > > On my system syncfs takes about 2 milliseconds, so you have a chance to > experience this every ~30 minutes if the syscalls align in the right way. If > we expanded the errseq_t to u64, we would potentially get a collision > every 4503599627370496 calls, or assuming the 2 millisecond invariant > holds, every 285 years. Now, we probably don't want to make errseq_t into > a u64 because of performance reasons (not all systems have native u64 > cmpxchg), and the extra memory it'd take up. > > If we really want to avoid this case, I can think of one "simple" solution, > which is something like laying out errseq_t as something like a errseq_t_src > that's 64-bits, and all readers just look at the lower 32-bits. The longer > errseq_t would exist on super_blocks, but files would still get the shorter one. > To potentially avoid the performance penalty of atomic longs, we could also > do something like this: > > typedef struct { > atomic_t overflow; > u32 errseq; > } errseq_t_big; > > And in errseq_set, do: > /* Wraps */ > if (new < old) > atomic_inc(&eseq->overflow); > > *shrug* > I don't think that the above scenario is likely though. > The counter is only bumped if the seen flag is set, so you'd need to record 2^20 errors _and_ fetch them that many times, and just get unlucky once and hit the exact counter value that you had previously sampled. If that's your situation, then you probably have much bigger problems than the counter wrapping. -- Jeff Layton <jlayton@redhat.com> ^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2020-12-07 11:41 UTC | newest] Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-11-27 9:20 [PATCH v2 0/4] Make overlayfs volatile mounts reusable Sargun Dhillon 2020-11-27 9:20 ` [PATCH v2 1/4] fs: Add s_instance_id field to superblock for unique identification Sargun Dhillon 2020-11-27 9:20 ` [PATCH v2 2/4] overlay: Document current outstanding shortcoming of volatile Sargun Dhillon 2020-11-27 12:52 ` Amir Goldstein 2020-11-27 22:11 ` Sargun Dhillon 2020-11-28 2:01 ` Jeff Layton 2020-11-28 4:45 ` Sargun Dhillon 2020-11-28 7:12 ` Amir Goldstein 2020-11-28 8:52 ` Sargun Dhillon 2020-11-28 9:04 ` Amir Goldstein 2020-12-01 11:09 ` Sargun Dhillon 2020-12-01 11:29 ` Amir Goldstein 2020-12-01 13:01 ` Jeff Layton 2020-12-01 15:24 ` Vivek Goyal 2020-12-01 16:10 ` Jeff Layton 2020-11-28 12:04 ` Jeff Layton 2020-11-28 8:56 ` Amir Goldstein 2020-11-28 9:06 ` Amir Goldstein 2020-11-27 9:20 ` [PATCH v2 3/4] overlay: Add the ability to remount volatile directories when safe Sargun Dhillon 2020-11-27 11:09 ` kernel test robot 2020-11-27 13:04 ` Amir Goldstein 2020-12-07 11:39 ` Dan Carpenter 2020-11-27 9:20 ` [PATCH v2 4/4] overlay: Add rudimentary checking of writeback errseq on volatile remount Sargun Dhillon 2020-11-30 18:43 ` Vivek Goyal 2020-11-30 19:15 ` Vivek Goyal 2020-12-05 9:13 ` Amir Goldstein 2020-12-05 13:51 ` Jeff Layton 2020-12-05 14:51 ` Amir Goldstein 2020-11-30 19:33 ` Vivek Goyal 2020-12-01 11:56 ` Sargun Dhillon 2020-12-01 12:45 ` Jeff Layton
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).