* [RFC] How to fix broken freezing? @ 2012-01-06 14:09 Jan Kara 2012-01-11 21:53 ` Eric Sandeen 0 siblings, 1 reply; 4+ messages in thread From: Jan Kara @ 2012-01-06 14:09 UTC (permalink / raw) To: linux-fsdevel Cc: Surbhi Palande, Kamal Mostafa, Christoph Hellwig, Dave Chinner, Al Viro Hello, I was looking at what causes filesystem to have dirty data after it is frozen. After some thought I realized freezing code is inherently racy and all filesystems (ext3, ext4, xfs) can have dirty data on frozen filesystem. The race is basically following: Task 1 Task 2 freeze_super() __generic_file_aio_write() ... vfs_check_frozen(sb, SB_FREEZE_WRITE) sb->s_frozen = SB_FREEZE_WRITE; sync_filesystem(sb); do the write /* Here we create dirty data * which is left on frozen fs */ sb->s_frozen = SB_FREEZE_TRANS; ... ->freeze_fs() The problem is that you can never make checking for frozen filesystem race-free with the current s_frozen scheme - the filesystem can always be frozen the instant after you check for it and you end up creating dirty data on frozen filesystem. The question is what to do with this problem. I outline the possibilities that come to my mind below: 1) Ignore the problem - depending on the exact fs details this could lead to fs snapshot being corrupted, also flusher thread can hang on the frozen filesystem (e.g. because of sync(1)) creating all sorts of secondary issues. So I don't think this is really an option. 2) Have a rwlock in the superblock that is held for writing while filesystem freezing is in progress and held for reading by the filesystem while a transaction is running except for transactions that are required to do writeback. This is kind of ugly but at least for ext3/4 relatively easy to implement. 3) Have the same rwlock but already VFS will take the lock in kernel entry points which modify a filesystem. Lot of these places is already guarded by mnt_want_write/mnt_drop_write pair so we could hook into it but there are entry points which use file descriptor and thus are not guarded by mnt_want_write/mnt_drop_write so we would have to modify these places. Note that this in particular also means ioctl calls and such so it won't be trivial to catch all the places. This approach looks the cleanest to me but it's quite some work and it's a bit fragile - requires all people adding an entry point modifying filesystem to think of fs freezing. What do people think about this? Any idea other idea how to solve the problem? Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [RFC] How to fix broken freezing? 2012-01-06 14:09 [RFC] How to fix broken freezing? Jan Kara @ 2012-01-11 21:53 ` Eric Sandeen 2012-01-11 22:36 ` Dave Chinner 0 siblings, 1 reply; 4+ messages in thread From: Eric Sandeen @ 2012-01-11 21:53 UTC (permalink / raw) To: Jan Kara Cc: linux-fsdevel, Surbhi Palande, Kamal Mostafa, Christoph Hellwig, Dave Chinner, Al Viro On 1/6/12 8:09 AM, Jan Kara wrote: > Hello, > > I was looking at what causes filesystem to have dirty data after it is > frozen. After some thought I realized freezing code is inherently racy and > all filesystems (ext3, ext4, xfs) can have dirty data on frozen filesystem. > > The race is basically following: > Task 1 Task 2 > freeze_super() __generic_file_aio_write() > ... vfs_check_frozen(sb, SB_FREEZE_WRITE) > sb->s_frozen = SB_FREEZE_WRITE; > sync_filesystem(sb); > do the write > /* Here we create dirty data > * which is left on frozen fs */ > sb->s_frozen = SB_FREEZE_TRANS; > ... > ->freeze_fs() > > The problem is that you can never make checking for frozen filesystem > race-free with the current s_frozen scheme - the filesystem can always be > frozen the instant after you check for it and you end up creating dirty > data on frozen filesystem. > > The question is what to do with this problem. I outline the possibilities > that come to my mind below: > 1) Ignore the problem - depending on the exact fs details this could lead to > fs snapshot being corrupted, also flusher thread can hang on the frozen > filesystem (e.g. because of sync(1)) creating all sorts of secondary > issues. So I don't think this is really an option. > 2) Have a rwlock in the superblock that is held for writing while > filesystem freezing is in progress and held for reading by the filesystem > while a transaction is running except for transactions that are required > to do writeback. This is kind of ugly but at least for ext3/4 relatively > easy to implement. This is as far as I had gotten while independently thinking about it ;) But talking with dchinner, he had concerns about the scalability of any rwlock, and I think we (ok, mostly Dave) came up with another idea. What if we had 2 counters in the superblock, one for the equivalent of SB_FREEZE_WRITE and one for SB_FREEZE_TRANS. These would use similar infrastructure to mnt_want_write et al. Everywhere we currently vfs_check_frozen() we'd have a better-named function which increments the counter, then checks the freeze level. If we are being frozen, we drop the counter & wait. If not frozen, we continue; like this pseudo-code: void super_freeze_wait(sb, level) { while (1) { level_ref++; if (!frozen(sb, level)) return; /* not freezing */ level_ref--; wait_unfrozen(sb, level); } } There would also be new code to drop the counter when the dirtying is complete. The freezing functions then just have to wait until the counters hit zero before they can consider themselves done, and freezing is complete. That way if someone sneaks in while the freeze level is being set, they have already notified their intent, and freeze can wait for it anyway before returning. Can anyone poke holes in that? -Eric > 3) Have the same rwlock but already VFS will take the lock in kernel entry > points which modify a filesystem. Lot of these places is already guarded > by mnt_want_write/mnt_drop_write pair so we could hook into it but there > are entry points which use file descriptor and thus are not guarded by > mnt_want_write/mnt_drop_write so we would have to modify these places. > Note that this in particular also means ioctl calls and such so it won't > be trivial to catch all the places. This approach looks the cleanest to > me but it's quite some work and it's a bit fragile - requires all people > adding an entry point modifying filesystem to think of fs freezing. > > What do people think about this? Any idea other idea how to solve the > problem? > Honza ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [RFC] How to fix broken freezing? 2012-01-11 21:53 ` Eric Sandeen @ 2012-01-11 22:36 ` Dave Chinner 2012-01-12 1:09 ` Jan Kara 0 siblings, 1 reply; 4+ messages in thread From: Dave Chinner @ 2012-01-11 22:36 UTC (permalink / raw) To: Eric Sandeen Cc: Jan Kara, linux-fsdevel, Surbhi Palande, Kamal Mostafa, Christoph Hellwig, Dave Chinner, Al Viro On Wed, Jan 11, 2012 at 03:53:24PM -0600, Eric Sandeen wrote: > On 1/6/12 8:09 AM, Jan Kara wrote: > > Hello, > > > > I was looking at what causes filesystem to have dirty data after it is > > frozen. After some thought I realized freezing code is inherently racy and > > all filesystems (ext3, ext4, xfs) can have dirty data on frozen filesystem. > > > > The race is basically following: > > Task 1 Task 2 > > freeze_super() __generic_file_aio_write() > > ... vfs_check_frozen(sb, SB_FREEZE_WRITE) > > sb->s_frozen = SB_FREEZE_WRITE; > > sync_filesystem(sb); > > do the write > > /* Here we create dirty data > > * which is left on frozen fs */ > > sb->s_frozen = SB_FREEZE_TRANS; > > ... > > ->freeze_fs() > > > > The problem is that you can never make checking for frozen filesystem > > race-free with the current s_frozen scheme - the filesystem can always be > > frozen the instant after you check for it and you end up creating dirty > > data on frozen filesystem. > > > > The question is what to do with this problem. I outline the possibilities > > that come to my mind below: > > 1) Ignore the problem - depending on the exact fs details this could lead to > > fs snapshot being corrupted, also flusher thread can hang on the frozen > > filesystem (e.g. because of sync(1)) creating all sorts of secondary > > issues. So I don't think this is really an option. > > 2) Have a rwlock in the superblock that is held for writing while > > filesystem freezing is in progress and held for reading by the filesystem > > while a transaction is running except for transactions that are required > > to do writeback. This is kind of ugly but at least for ext3/4 relatively > > easy to implement. > > This is as far as I had gotten while independently thinking about it ;) > > But talking with dchinner, he had concerns about the scalability of any > rwlock, and I think we (ok, mostly Dave) came up with another idea. > > What if we had 2 counters in the superblock, one for the equivalent of > SB_FREEZE_WRITE and one for SB_FREEZE_TRANS. These would use similar > infrastructure to mnt_want_write et al. > > Everywhere we currently vfs_check_frozen() we'd have a better-named function > which increments the counter, then checks the freeze level. If we are > being frozen, we drop the counter & wait. If not frozen, we continue; > like this pseudo-code: > > void super_freeze_wait(sb, level) { > while (1) { > level_ref++; > if (!frozen(sb, level)) > return; /* not freezing */ > level_ref--; > wait_unfrozen(sb, level); > } > } > > There would also be new code to drop the counter when the dirtying is complete. > > The freezing functions then just have to wait until the counters hit zero > before they can consider themselves done, and freezing is complete. That way if > someone sneaks in while the freeze level is being set, they have already > notified their intent, and freeze can wait for it anyway before returning. Just to clarify, freeze_super would do: sb->s_frozen = SB_FREEZE_WRITE; smp_wmb(); while (sb->s_active_write_cnt > 0) wait; /* no new or existing dirtying writers now, safe to sync */ sync_filesystem(sb); sb->s_frozen = SB_FREEZE_TRANS; smp_wmb(); while (sb->s_active_trans_cnt > 0) wait; /* no new or existing transactions in progress now, so freeze */ sb->s_op->freeze_fs(sb); The counter implemetations will need to scale (e.g. per-cpu counters) and we could probably use a generic waitqueue, but I think this can all be implemented at the superblock level and we only need to call the inc/dec helper functions in the correct places to make it all work. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [RFC] How to fix broken freezing? 2012-01-11 22:36 ` Dave Chinner @ 2012-01-12 1:09 ` Jan Kara 0 siblings, 0 replies; 4+ messages in thread From: Jan Kara @ 2012-01-12 1:09 UTC (permalink / raw) To: Dave Chinner Cc: Eric Sandeen, Jan Kara, linux-fsdevel, Surbhi Palande, Kamal Mostafa, Christoph Hellwig, Dave Chinner, Al Viro On Thu 12-01-12 09:36:31, Dave Chinner wrote: > On Wed, Jan 11, 2012 at 03:53:24PM -0600, Eric Sandeen wrote: > > On 1/6/12 8:09 AM, Jan Kara wrote: > > > Hello, > > > > > > I was looking at what causes filesystem to have dirty data after it is > > > frozen. After some thought I realized freezing code is inherently racy and > > > all filesystems (ext3, ext4, xfs) can have dirty data on frozen filesystem. > > > > > > The race is basically following: > > > Task 1 Task 2 > > > freeze_super() __generic_file_aio_write() > > > ... vfs_check_frozen(sb, SB_FREEZE_WRITE) > > > sb->s_frozen = SB_FREEZE_WRITE; > > > sync_filesystem(sb); > > > do the write > > > /* Here we create dirty data > > > * which is left on frozen fs */ > > > sb->s_frozen = SB_FREEZE_TRANS; > > > ... > > > ->freeze_fs() > > > > > > The problem is that you can never make checking for frozen filesystem > > > race-free with the current s_frozen scheme - the filesystem can always be > > > frozen the instant after you check for it and you end up creating dirty > > > data on frozen filesystem. > > > > > > The question is what to do with this problem. I outline the possibilities > > > that come to my mind below: > > > 1) Ignore the problem - depending on the exact fs details this could lead to > > > fs snapshot being corrupted, also flusher thread can hang on the frozen > > > filesystem (e.g. because of sync(1)) creating all sorts of secondary > > > issues. So I don't think this is really an option. > > > 2) Have a rwlock in the superblock that is held for writing while > > > filesystem freezing is in progress and held for reading by the filesystem > > > while a transaction is running except for transactions that are required > > > to do writeback. This is kind of ugly but at least for ext3/4 relatively > > > easy to implement. > > > > This is as far as I had gotten while independently thinking about it ;) > > > > But talking with dchinner, he had concerns about the scalability of any > > rwlock, and I think we (ok, mostly Dave) came up with another idea. > > > > What if we had 2 counters in the superblock, one for the equivalent of > > SB_FREEZE_WRITE and one for SB_FREEZE_TRANS. These would use similar > > infrastructure to mnt_want_write et al. > > > > Everywhere we currently vfs_check_frozen() we'd have a better-named function > > which increments the counter, then checks the freeze level. If we are > > being frozen, we drop the counter & wait. If not frozen, we continue; > > like this pseudo-code: > > > > void super_freeze_wait(sb, level) { > > while (1) { > > level_ref++; > > if (!frozen(sb, level)) > > return; /* not freezing */ > > level_ref--; > > wait_unfrozen(sb, level); > > } > > } > > > > There would also be new code to drop the counter when the dirtying is complete. > > > > The freezing functions then just have to wait until the counters hit zero > > before they can consider themselves done, and freezing is complete. That way if > > someone sneaks in while the freeze level is being set, they have already > > notified their intent, and freeze can wait for it anyway before returning. > > Just to clarify, freeze_super would do: > > sb->s_frozen = SB_FREEZE_WRITE; > smp_wmb(); > > while (sb->s_active_write_cnt > 0) > wait; > > /* no new or existing dirtying writers now, safe to sync */ > sync_filesystem(sb); > > sb->s_frozen = SB_FREEZE_TRANS; > smp_wmb(); > > while (sb->s_active_trans_cnt > 0) > wait; > > /* no new or existing transactions in progress now, so freeze */ > sb->s_op->freeze_fs(sb); > > The counter implemetations will need to scale (e.g. per-cpu > counters) and we could probably use a generic waitqueue, but I think > this can all be implemented at the superblock level and we only need > to call the inc/dec helper functions in the correct places to make > it all work. Yeah, this is where my thoughts were going as well so I was already working on patches in this direction. I'll send a first version of them in a moment. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2012-01-12 1:09 UTC | newest] Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2012-01-06 14:09 [RFC] How to fix broken freezing? Jan Kara 2012-01-11 21:53 ` Eric Sandeen 2012-01-11 22:36 ` Dave Chinner 2012-01-12 1:09 ` Jan Kara
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).