All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/19] Fix filesystem freezing deadlocks
@ 2012-03-05 16:00 ` Jan Kara
  0 siblings, 0 replies; 58+ messages in thread
From: Jan Kara @ 2012-03-05 16:00 UTC (permalink / raw)
  To: LKML
  Cc: linux-fsdevel, Al Viro, Christoph Hellwig, dchinner, sandeen,
	Kamal Mostafa, Jan Kara, Alex Elder, Anton Altaparmakov,
	Ben Myers, Chris Mason, cluster-devel, David S. Miller,
	fuse-devel, J. Bruce Fields, Joel Becker, KONISHI Ryusuke,
	linux-btrfs, linux-ext4, linux-nfs, linux-nilfs, linux-ntfs-dev,
	Mark Fasheh, Miklos Szeredi, ocfs2-devel, OGAWA Hirofumi,
	Steven Whitehouse, Theodore Ts'o, xfs

  Hallelujah,

  after a couple of weeks and several rewrites, here comes the third iteration
of my patches to improve filesystem freezing.  Filesystem freezing is currently
racy and thus we can end up with dirty data on frozen filesystem (see changelog
patch 06 for detailed race description). This patch series aims at fixing this.

To be able to block all places where inodes get dirtied, I've moved filesystem
freeze handling in mnt_want_write() / mnt_drop_write(). This however required
some code shuffling and changes to kern_path_create() (see patches 02-05). I
think the result is OK but opinions may differ ;). The advantage of this change
also is that all filesystems get freeze protection almost for free - even ext2
can handle freezing well now.

Another potential contention point might be patch 19. In that patch we make
freeze_super() refuse to freeze the filesystem when there are open but unlinked
files which may be impractical in some cases. The main reason for this is the
problem with handling of file deletion from fput() called with mmap_sem held
(e.g. from munmap(2)), and then there's the fact that we cannot really force
such filesystem into a consistent state... But if people think that freezing
with open but unlinked files should happen, then I have some possible
solutions in mind (maybe as a separate patchset since this is large enough).

I'm not able to hit any deadlocks, lockdep warnings, or dirty data on frozen
filesystem despite beating it with fsstress and bash-shared-mapping while
freezing and unfreezing for several hours (using ext4 and xfs) so I'm
reasonably confident this could finally be the right solution.

And for people wanting to test - this patchset is based on patch series
"Push file_update_time() into .page_mkwrite" so you'll need to pull that one
in as well.

Changes since v2:
  * completely rewritten
  * freezing is now blocked at VFS entry points
  * two stage freezing to handle both mmapped writes and other IO

The biggest changes since v1:
  * have two counters to provide safe state transitions for SB_FREEZE_WRITE
    and SB_FREEZE_TRANS states
  * use percpu counters instead of own percpu structure
  * added documentation fixes from the old fs freezing series
  * converted XFS to use SB_FREEZE_TRANS counter instead of its private
    m_active_trans counter

								Honza

CC: Alex Elder <elder@kernel.org>
CC: Anton Altaparmakov <anton@tuxera.com>
CC: Ben Myers <bpm@sgi.com>
CC: Chris Mason <chris.mason@oracle.com>
CC: cluster-devel@redhat.com
CC: "David S. Miller" <davem@davemloft.net>
CC: fuse-devel@lists.sourceforge.net
CC: "J. Bruce Fields" <bfields@fieldses.org>
CC: Joel Becker <jlbec@evilplan.org>
CC: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
CC: linux-btrfs@vger.kernel.org
CC: linux-ext4@vger.kernel.org
CC: linux-nfs@vger.kernel.org
CC: linux-nilfs@vger.kernel.org
CC: linux-ntfs-dev@lists.sourceforge.net
CC: Mark Fasheh <mfasheh@suse.com>
CC: Miklos Szeredi <miklos@szeredi.hu>
CC: ocfs2-devel@oss.oracle.com
CC: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
CC: Steven Whitehouse <swhiteho@redhat.com>
CC: "Theodore Ts'o" <tytso@mit.edu>
CC: xfs@oss.sgi.com

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH 00/19] Fix filesystem freezing deadlocks
@ 2012-03-05 16:00 ` Jan Kara
  0 siblings, 0 replies; 58+ messages in thread
From: Jan Kara @ 2012-03-05 16:00 UTC (permalink / raw)
  To: LKML
  Cc: Jan Kara, J. Bruce Fields, cluster-devel-H+wXaHxf7aLQT0dZR+AlfA,
	ocfs2-devel-N0ozoZBvEnrZJqsBc5GL+g, KONISHI Ryusuke,
	OGAWA Hirofumi, sandeen-H+wXaHxf7aLQT0dZR+AlfA,
	linux-nilfs-u79uwXL29TY76Z2rM5mHXA, Miklos Szeredi,
	Christoph Hellwig, Anton Altaparmakov,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA,
	fuse-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Mark Fasheh,
	xfs-VZNHf3L845pBDgjK7y7TUQ, Ben Myers, Joel Becker,
	dchinner-H+wXaHxf7aLQT0dZR+AlfA, Steven Whitehouse, Chris Mason,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA, Alex Elder, Theodore Ts'o,
	linux-ntfs-dev-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Kamal Mostafa,
	Al Viro, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA,

  Hallelujah,

  after a couple of weeks and several rewrites, here comes the third iteration
of my patches to improve filesystem freezing.  Filesystem freezing is currently
racy and thus we can end up with dirty data on frozen filesystem (see changelog
patch 06 for detailed race description). This patch series aims at fixing this.

To be able to block all places where inodes get dirtied, I've moved filesystem
freeze handling in mnt_want_write() / mnt_drop_write(). This however required
some code shuffling and changes to kern_path_create() (see patches 02-05). I
think the result is OK but opinions may differ ;). The advantage of this change
also is that all filesystems get freeze protection almost for free - even ext2
can handle freezing well now.

Another potential contention point might be patch 19. In that patch we make
freeze_super() refuse to freeze the filesystem when there are open but unlinked
files which may be impractical in some cases. The main reason for this is the
problem with handling of file deletion from fput() called with mmap_sem held
(e.g. from munmap(2)), and then there's the fact that we cannot really force
such filesystem into a consistent state... But if people think that freezing
with open but unlinked files should happen, then I have some possible
solutions in mind (maybe as a separate patchset since this is large enough).

I'm not able to hit any deadlocks, lockdep warnings, or dirty data on frozen
filesystem despite beating it with fsstress and bash-shared-mapping while
freezing and unfreezing for several hours (using ext4 and xfs) so I'm
reasonably confident this could finally be the right solution.

And for people wanting to test - this patchset is based on patch series
"Push file_update_time() into .page_mkwrite" so you'll need to pull that one
in as well.

Changes since v2:
  * completely rewritten
  * freezing is now blocked at VFS entry points
  * two stage freezing to handle both mmapped writes and other IO

The biggest changes since v1:
  * have two counters to provide safe state transitions for SB_FREEZE_WRITE
    and SB_FREEZE_TRANS states
  * use percpu counters instead of own percpu structure
  * added documentation fixes from the old fs freezing series
  * converted XFS to use SB_FREEZE_TRANS counter instead of its private
    m_active_trans counter

								Honza

CC: Alex Elder <elder-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
CC: Anton Altaparmakov <anton-yrGDUoBaLx3QT0dZR+AlfA@public.gmane.org>
CC: Ben Myers <bpm-sJ/iWh9BUns@public.gmane.org>
CC: Chris Mason <chris.mason-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
CC: cluster-devel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
CC: "David S. Miller" <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
CC: fuse-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
CC: "J. Bruce Fields" <bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
CC: Joel Becker <jlbec-aKy9MeLSZ9dg9hUCZPvPmw@public.gmane.org>
CC: KONISHI Ryusuke <konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
CC: linux-btrfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
CC: linux-ext4-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
CC: linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
CC: linux-nilfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
CC: linux-ntfs-dev-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
CC: Mark Fasheh <mfasheh-IBi9RG/b67k@public.gmane.org>
CC: Miklos Szeredi <miklos-sUDqSbJrdHQHWmgEVkV9KA@public.gmane.org>
CC: ocfs2-devel-N0ozoZBvEnrZJqsBc5GL+g@public.gmane.org
CC: OGAWA Hirofumi <hirofumi-UIVanBePwB70ZhReMnHkpc8NsWr+9BEh@public.gmane.org>
CC: Steven Whitehouse <swhiteho-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
CC: "Theodore Ts'o" <tytso-3s7WtUTddSA@public.gmane.org>
CC: xfs-VZNHf3L845pBDgjK7y7TUQ@public.gmane.org

------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH 00/19] Fix filesystem freezing deadlocks
@ 2012-03-05 16:00 ` Jan Kara
  0 siblings, 0 replies; 58+ messages in thread
From: Jan Kara @ 2012-03-05 16:00 UTC (permalink / raw)
  To: LKML
  Cc: Jan Kara, J. Bruce Fields, cluster-devel, ocfs2-devel,
	KONISHI Ryusuke, OGAWA Hirofumi, sandeen, linux-nilfs,
	Miklos Szeredi, Christoph Hellwig, Anton Altaparmakov,
	linux-ext4, fuse-devel, Mark Fasheh, xfs, Ben Myers, Joel Becker,
	dchinner, Steven Whitehouse, Chris Mason, linux-nfs, Alex Elder,
	Theodore Ts'o, linux-ntfs-dev, Kamal Mostafa, Al Viro,
	linux-fsdevel, David S. Miller, linux-btrfs

  Hallelujah,

  after a couple of weeks and several rewrites, here comes the third iteration
of my patches to improve filesystem freezing.  Filesystem freezing is currently
racy and thus we can end up with dirty data on frozen filesystem (see changelog
patch 06 for detailed race description). This patch series aims at fixing this.

To be able to block all places where inodes get dirtied, I've moved filesystem
freeze handling in mnt_want_write() / mnt_drop_write(). This however required
some code shuffling and changes to kern_path_create() (see patches 02-05). I
think the result is OK but opinions may differ ;). The advantage of this change
also is that all filesystems get freeze protection almost for free - even ext2
can handle freezing well now.

Another potential contention point might be patch 19. In that patch we make
freeze_super() refuse to freeze the filesystem when there are open but unlinked
files which may be impractical in some cases. The main reason for this is the
problem with handling of file deletion from fput() called with mmap_sem held
(e.g. from munmap(2)), and then there's the fact that we cannot really force
such filesystem into a consistent state... But if people think that freezing
with open but unlinked files should happen, then I have some possible
solutions in mind (maybe as a separate patchset since this is large enough).

I'm not able to hit any deadlocks, lockdep warnings, or dirty data on frozen
filesystem despite beating it with fsstress and bash-shared-mapping while
freezing and unfreezing for several hours (using ext4 and xfs) so I'm
reasonably confident this could finally be the right solution.

And for people wanting to test - this patchset is based on patch series
"Push file_update_time() into .page_mkwrite" so you'll need to pull that one
in as well.

Changes since v2:
  * completely rewritten
  * freezing is now blocked at VFS entry points
  * two stage freezing to handle both mmapped writes and other IO

The biggest changes since v1:
  * have two counters to provide safe state transitions for SB_FREEZE_WRITE
    and SB_FREEZE_TRANS states
  * use percpu counters instead of own percpu structure
  * added documentation fixes from the old fs freezing series
  * converted XFS to use SB_FREEZE_TRANS counter instead of its private
    m_active_trans counter

								Honza

CC: Alex Elder <elder@kernel.org>
CC: Anton Altaparmakov <anton@tuxera.com>
CC: Ben Myers <bpm@sgi.com>
CC: Chris Mason <chris.mason@oracle.com>
CC: cluster-devel@redhat.com
CC: "David S. Miller" <davem@davemloft.net>
CC: fuse-devel@lists.sourceforge.net
CC: "J. Bruce Fields" <bfields@fieldses.org>
CC: Joel Becker <jlbec@evilplan.org>
CC: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
CC: linux-btrfs@vger.kernel.org
CC: linux-ext4@vger.kernel.org
CC: linux-nfs@vger.kernel.org
CC: linux-nilfs@vger.kernel.org
CC: linux-ntfs-dev@lists.sourceforge.net
CC: Mark Fasheh <mfasheh@suse.com>
CC: Miklos Szeredi <miklos@szeredi.hu>
CC: ocfs2-devel@oss.oracle.com
CC: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
CC: Steven Whitehouse <swhiteho@redhat.com>
CC: "Theodore Ts'o" <tytso@mit.edu>
CC: xfs@oss.sgi.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [Cluster-devel] [PATCH 00/19] Fix filesystem freezing deadlocks
@ 2012-03-05 16:00 ` Jan Kara
  0 siblings, 0 replies; 58+ messages in thread
From: Jan Kara @ 2012-03-05 16:00 UTC (permalink / raw)
  To: cluster-devel.redhat.com

  Hallelujah,

  after a couple of weeks and several rewrites, here comes the third iteration
of my patches to improve filesystem freezing.  Filesystem freezing is currently
racy and thus we can end up with dirty data on frozen filesystem (see changelog
patch 06 for detailed race description). This patch series aims at fixing this.

To be able to block all places where inodes get dirtied, I've moved filesystem
freeze handling in mnt_want_write() / mnt_drop_write(). This however required
some code shuffling and changes to kern_path_create() (see patches 02-05). I
think the result is OK but opinions may differ ;). The advantage of this change
also is that all filesystems get freeze protection almost for free - even ext2
can handle freezing well now.

Another potential contention point might be patch 19. In that patch we make
freeze_super() refuse to freeze the filesystem when there are open but unlinked
files which may be impractical in some cases. The main reason for this is the
problem with handling of file deletion from fput() called with mmap_sem held
(e.g. from munmap(2)), and then there's the fact that we cannot really force
such filesystem into a consistent state... But if people think that freezing
with open but unlinked files should happen, then I have some possible
solutions in mind (maybe as a separate patchset since this is large enough).

I'm not able to hit any deadlocks, lockdep warnings, or dirty data on frozen
filesystem despite beating it with fsstress and bash-shared-mapping while
freezing and unfreezing for several hours (using ext4 and xfs) so I'm
reasonably confident this could finally be the right solution.

And for people wanting to test - this patchset is based on patch series
"Push file_update_time() into .page_mkwrite" so you'll need to pull that one
in as well.

Changes since v2:
  * completely rewritten
  * freezing is now blocked at VFS entry points
  * two stage freezing to handle both mmapped writes and other IO

The biggest changes since v1:
  * have two counters to provide safe state transitions for SB_FREEZE_WRITE
    and SB_FREEZE_TRANS states
  * use percpu counters instead of own percpu structure
  * added documentation fixes from the old fs freezing series
  * converted XFS to use SB_FREEZE_TRANS counter instead of its private
    m_active_trans counter

								Honza

CC: Alex Elder <elder@kernel.org>
CC: Anton Altaparmakov <anton@tuxera.com>
CC: Ben Myers <bpm@sgi.com>
CC: Chris Mason <chris.mason@oracle.com>
CC: cluster-devel at redhat.com
CC: "David S. Miller" <davem@davemloft.net>
CC: fuse-devel at lists.sourceforge.net
CC: "J. Bruce Fields" <bfields@fieldses.org>
CC: Joel Becker <jlbec@evilplan.org>
CC: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
CC: linux-btrfs at vger.kernel.org
CC: linux-ext4 at vger.kernel.org
CC: linux-nfs at vger.kernel.org
CC: linux-nilfs at vger.kernel.org
CC: linux-ntfs-dev at lists.sourceforge.net
CC: Mark Fasheh <mfasheh@suse.com>
CC: Miklos Szeredi <miklos@szeredi.hu>
CC: ocfs2-devel at oss.oracle.com
CC: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
CC: Steven Whitehouse <swhiteho@redhat.com>
CC: "Theodore Ts'o" <tytso@mit.edu>
CC: xfs at oss.sgi.com



^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH 01/19] mm: Make default vm_ops provide ->page_mkwrite handler
  2012-03-05 16:00 ` Jan Kara
                   ` (3 preceding siblings ...)
  (?)
@ 2012-03-05 16:00 ` Jan Kara
  2012-03-23 22:45   ` Andrew Morton
  -1 siblings, 1 reply; 58+ messages in thread
From: Jan Kara @ 2012-03-05 16:00 UTC (permalink / raw)
  To: LKML
  Cc: linux-fsdevel, Al Viro, Christoph Hellwig, dchinner, sandeen,
	Kamal Mostafa, Jan Kara

Make default vm_ops provide ->page_mkwrite handler. Currently it only updates
file's modification times and gets locked page but later it will also handle
filesystem freezing.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/mm.h |    1 +
 mm/filemap.c       |   20 ++++++++++++++++++++
 mm/filemap_xip.c   |    1 +
 3 files changed, 22 insertions(+), 0 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 17b27cd..074c075 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1419,6 +1419,7 @@ extern void truncate_inode_pages_range(struct address_space *,
 
 /* generic vm_area_ops exported for stackable file systems */
 extern int filemap_fault(struct vm_area_struct *, struct vm_fault *);
+extern int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);
 
 /* mm/page-writeback.c */
 int write_one_page(struct page *page, int wait);
diff --git a/mm/filemap.c b/mm/filemap.c
index b662757..b865c0b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1759,8 +1759,28 @@ page_not_uptodate:
 }
 EXPORT_SYMBOL(filemap_fault);
 
+int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	struct page *page = vmf->page;
+	struct inode *inode = vma->vm_file->f_path.dentry->d_inode;
+	int ret = VM_FAULT_LOCKED;
+
+	file_update_time(vma->vm_file);
+	lock_page(page);
+	if ((page->mapping != inode->i_mapping) ||
+	    (page_offset(page) > i_size_read(inode))) {
+		unlock_page(page);
+		ret = VM_FAULT_NOPAGE;
+		goto out;
+	}
+out:
+	return ret;
+}
+EXPORT_SYMBOL(filemap_page_mkwrite);
+
 const struct vm_operations_struct generic_file_vm_ops = {
 	.fault		= filemap_fault,
+	.page_mkwrite	= filemap_page_mkwrite,
 };
 
 /* This is used for a general mmap of a disk file */
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index a4eb311..591dba6 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -304,6 +304,7 @@ out:
 
 static const struct vm_operations_struct xip_file_vm_ops = {
 	.fault	= xip_file_fault,
+	.page_mkwrite	= filemap_page_mkwrite,
 };
 
 int xip_file_mmap(struct file * file, struct vm_area_struct * vma)
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 02/19] fs: Push mnt_want_write() outside of i_mutex
  2012-03-05 16:00 ` Jan Kara
@ 2012-03-05 16:01   ` Jan Kara
  -1 siblings, 0 replies; 58+ messages in thread
From: Jan Kara @ 2012-03-05 16:01 UTC (permalink / raw)
  To: LKML
  Cc: linux-fsdevel, Al Viro, Christoph Hellwig, dchinner, sandeen,
	Kamal Mostafa, Jan Kara, ocfs2-devel, Mark Fasheh, Joel Becker,
	David S. Miller

Currently, mnt_want_write() is sometimes called with i_mutex held and sometimes
without it. This isn't really a problem because mnt_want_write() is a
non-blocking operation (essentially has a trylock semantics) but when the
function starts to handle also frozen filesystems, it will get a full lock
semantics and thus proper lock ordering has to be established. So move
all mnt_want_write() calls outside of i_mutex.

One non-trivial case needing conversion is kern_path_create() /
user_path_create() which didn't include mnt_want_write() but now needs to
because it acquires i_mutex.  Because there are virtual file systems which
don't bother with freeze / remount-ro protection we actually provide both
versions of the function - one which calls mnt_want_write() and one which does
not.

CC: ocfs2-devel@oss.oracle.com
CC: Mark Fasheh <mfasheh@suse.com>
CC: Joel Becker <jlbec@evilplan.org>
CC: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/namei.c              |  115 +++++++++++++++++++++++++++--------------------
 fs/ocfs2/refcounttree.c |   10 +---
 include/linux/namei.h   |    2 +
 net/unix/af_unix.c      |   13 ++----
 4 files changed, 74 insertions(+), 66 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index a780ea5..575d3a4 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2383,7 +2383,9 @@ struct file *do_file_open_root(struct dentry *dentry, struct vfsmount *mnt,
 	return file;
 }
 
-struct dentry *kern_path_create(int dfd, const char *pathname, struct path *path, int is_dir)
+static struct dentry *do_kern_path_create(int dfd, const char *pathname,
+					  struct path *path, int is_dir,
+					  int freeze_protect)
 {
 	struct dentry *dentry = ERR_PTR(-EEXIST);
 	struct nameidata nd;
@@ -2401,6 +2403,14 @@ struct dentry *kern_path_create(int dfd, const char *pathname, struct path *path
 	nd.flags |= LOOKUP_CREATE | LOOKUP_EXCL;
 	nd.intent.open.flags = O_EXCL;
 
+	if (freeze_protect) {
+		error = mnt_want_write(nd.path.mnt);
+		if (error) {
+			dentry = ERR_PTR(error);
+			goto out;
+		}
+	}
+
 	/*
 	 * Do the final lookup.
 	 */
@@ -2429,24 +2439,49 @@ eexist:
 	dentry = ERR_PTR(-EEXIST);
 fail:
 	mutex_unlock(&nd.path.dentry->d_inode->i_mutex);
+	if (freeze_protect)
+		mnt_drop_write(nd.path.mnt);
 out:
 	path_put(&nd.path);
 	return dentry;
 }
+
+struct dentry *kern_path_create(int dfd, const char *pathname, struct path *path, int is_dir)
+{
+	return do_kern_path_create(dfd, pathname, path, is_dir, 0);
+}
 EXPORT_SYMBOL(kern_path_create);
 
+struct dentry *kern_path_create_thawed(int dfd, const char *pathname, struct path *path, int is_dir)
+{
+	return do_kern_path_create(dfd, pathname, path, is_dir, 1);
+}
+EXPORT_SYMBOL(kern_path_create_thawed);
+
 struct dentry *user_path_create(int dfd, const char __user *pathname, struct path *path, int is_dir)
 {
 	char *tmp = getname(pathname);
 	struct dentry *res;
 	if (IS_ERR(tmp))
 		return ERR_CAST(tmp);
-	res = kern_path_create(dfd, tmp, path, is_dir);
+	res = do_kern_path_create(dfd, tmp, path, is_dir, 0);
 	putname(tmp);
 	return res;
 }
 EXPORT_SYMBOL(user_path_create);
 
+struct dentry *user_path_create_thawed(int dfd, const char __user *pathname, struct path *path, int is_dir)
+{
+	char *tmp = getname(pathname);
+	struct dentry *res;
+	if (IS_ERR(tmp))
+		return ERR_CAST(tmp);
+	res = do_kern_path_create(dfd, tmp, path, is_dir, 1);
+	putname(tmp);
+	return res;
+}
+EXPORT_SYMBOL(user_path_create_thawed);
+
 int vfs_mknod(struct inode *dir, struct dentry *dentry, umode_t mode, dev_t dev)
 {
 	int error = may_create(dir, dentry);
@@ -2502,7 +2537,7 @@ SYSCALL_DEFINE4(mknodat, int, dfd, const char __user *, filename, umode_t, mode,
 	if (S_ISDIR(mode))
 		return -EPERM;
 
-	dentry = user_path_create(dfd, filename, &path, 0);
+	dentry = user_path_create_thawed(dfd, filename, &path, 0);
 	if (IS_ERR(dentry))
 		return PTR_ERR(dentry);
 
@@ -2511,12 +2546,9 @@ SYSCALL_DEFINE4(mknodat, int, dfd, const char __user *, filename, umode_t, mode,
 	error = may_mknod(mode);
 	if (error)
 		goto out_dput;
-	error = mnt_want_write(path.mnt);
-	if (error)
-		goto out_dput;
 	error = security_path_mknod(&path, dentry, mode, dev);
 	if (error)
-		goto out_drop_write;
+		goto out_dput;
 	switch (mode & S_IFMT) {
 		case 0: case S_IFREG:
 			error = vfs_create(path.dentry->d_inode,dentry,mode,NULL);
@@ -2529,11 +2561,10 @@ SYSCALL_DEFINE4(mknodat, int, dfd, const char __user *, filename, umode_t, mode,
 			error = vfs_mknod(path.dentry->d_inode,dentry,mode,0);
 			break;
 	}
-out_drop_write:
-	mnt_drop_write(path.mnt);
 out_dput:
 	dput(dentry);
 	mutex_unlock(&path.dentry->d_inode->i_mutex);
+	mnt_drop_write(path.mnt);
 	path_put(&path);
 
 	return error;
@@ -2571,24 +2602,20 @@ SYSCALL_DEFINE3(mkdirat, int, dfd, const char __user *, pathname, umode_t, mode)
 	struct path path;
 	int error;
 
-	dentry = user_path_create(dfd, pathname, &path, 1);
+	dentry = user_path_create_thawed(dfd, pathname, &path, 1);
 	if (IS_ERR(dentry))
 		return PTR_ERR(dentry);
 
 	if (!IS_POSIXACL(path.dentry->d_inode))
 		mode &= ~current_umask();
-	error = mnt_want_write(path.mnt);
-	if (error)
-		goto out_dput;
 	error = security_path_mkdir(&path, dentry, mode);
 	if (error)
-		goto out_drop_write;
+		goto out_dput;
 	error = vfs_mkdir(path.dentry->d_inode, dentry, mode);
-out_drop_write:
-	mnt_drop_write(path.mnt);
 out_dput:
 	dput(dentry);
 	mutex_unlock(&path.dentry->d_inode->i_mutex);
+	mnt_drop_write(path.mnt);
 	path_put(&path);
 	return error;
 }
@@ -2683,6 +2710,9 @@ static long do_rmdir(int dfd, const char __user *pathname)
 	}
 
 	nd.flags &= ~LOOKUP_PARENT;
+	error = mnt_want_write(nd.path.mnt);
+	if (error)
+		goto exit1;
 
 	mutex_lock_nested(&nd.path.dentry->d_inode->i_mutex, I_MUTEX_PARENT);
 	dentry = lookup_hash(&nd);
@@ -2693,19 +2723,15 @@ static long do_rmdir(int dfd, const char __user *pathname)
 		error = -ENOENT;
 		goto exit3;
 	}
-	error = mnt_want_write(nd.path.mnt);
-	if (error)
-		goto exit3;
 	error = security_path_rmdir(&nd.path, dentry);
 	if (error)
-		goto exit4;
+		goto exit3;
 	error = vfs_rmdir(nd.path.dentry->d_inode, dentry);
-exit4:
-	mnt_drop_write(nd.path.mnt);
 exit3:
 	dput(dentry);
 exit2:
 	mutex_unlock(&nd.path.dentry->d_inode->i_mutex);
+	mnt_drop_write(nd.path.mnt);
 exit1:
 	path_put(&nd.path);
 	putname(name);
@@ -2772,6 +2798,9 @@ static long do_unlinkat(int dfd, const char __user *pathname)
 		goto exit1;
 
 	nd.flags &= ~LOOKUP_PARENT;
+	error = mnt_want_write(nd.path.mnt);
+	if (error)
+		goto exit1;
 
 	mutex_lock_nested(&nd.path.dentry->d_inode->i_mutex, I_MUTEX_PARENT);
 	dentry = lookup_hash(&nd);
@@ -2784,21 +2813,17 @@ static long do_unlinkat(int dfd, const char __user *pathname)
 		if (!inode)
 			goto slashes;
 		ihold(inode);
-		error = mnt_want_write(nd.path.mnt);
-		if (error)
-			goto exit2;
 		error = security_path_unlink(&nd.path, dentry);
 		if (error)
-			goto exit3;
+			goto exit2;
 		error = vfs_unlink(nd.path.dentry->d_inode, dentry);
-exit3:
-		mnt_drop_write(nd.path.mnt);
-	exit2:
+exit2:
 		dput(dentry);
 	}
 	mutex_unlock(&nd.path.dentry->d_inode->i_mutex);
 	if (inode)
 		iput(inode);	/* truncate the inode here */
+	mnt_drop_write(nd.path.mnt);
 exit1:
 	path_put(&nd.path);
 	putname(name);
@@ -2858,23 +2883,19 @@ SYSCALL_DEFINE3(symlinkat, const char __user *, oldname,
 	if (IS_ERR(from))
 		return PTR_ERR(from);
 
-	dentry = user_path_create(newdfd, newname, &path, 0);
+	dentry = user_path_create_thawed(newdfd, newname, &path, 0);
 	error = PTR_ERR(dentry);
 	if (IS_ERR(dentry))
 		goto out_putname;
 
-	error = mnt_want_write(path.mnt);
-	if (error)
-		goto out_dput;
 	error = security_path_symlink(&path, dentry, from);
 	if (error)
-		goto out_drop_write;
+		goto out_dput;
 	error = vfs_symlink(path.dentry->d_inode, dentry, from);
-out_drop_write:
-	mnt_drop_write(path.mnt);
 out_dput:
 	dput(dentry);
 	mutex_unlock(&path.dentry->d_inode->i_mutex);
+	mnt_drop_write(path.mnt);
 	path_put(&path);
 out_putname:
 	putname(from);
@@ -2964,7 +2985,7 @@ SYSCALL_DEFINE5(linkat, int, olddfd, const char __user *, oldname,
 	if (error)
 		return error;
 
-	new_dentry = user_path_create(newdfd, newname, &new_path, 0);
+	new_dentry = user_path_create_thawed(newdfd, newname, &new_path, 0);
 	error = PTR_ERR(new_dentry);
 	if (IS_ERR(new_dentry))
 		goto out;
@@ -2972,18 +2993,14 @@ SYSCALL_DEFINE5(linkat, int, olddfd, const char __user *, oldname,
 	error = -EXDEV;
 	if (old_path.mnt != new_path.mnt)
 		goto out_dput;
-	error = mnt_want_write(new_path.mnt);
-	if (error)
-		goto out_dput;
 	error = security_path_link(old_path.dentry, &new_path, new_dentry);
 	if (error)
-		goto out_drop_write;
+		goto out_dput;
 	error = vfs_link(old_path.dentry, new_path.dentry->d_inode, new_dentry);
-out_drop_write:
-	mnt_drop_write(new_path.mnt);
 out_dput:
 	dput(new_dentry);
 	mutex_unlock(&new_path.dentry->d_inode->i_mutex);
+	mnt_drop_write(new_path.mnt);
 	path_put(&new_path);
 out:
 	path_put(&old_path);
@@ -3174,6 +3191,10 @@ SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,
 	if (newnd.last_type != LAST_NORM)
 		goto exit2;
 
+	error = mnt_want_write(oldnd.path.mnt);
+	if (error)
+		goto exit2;
+
 	oldnd.flags &= ~LOOKUP_PARENT;
 	newnd.flags &= ~LOOKUP_PARENT;
 	newnd.flags |= LOOKUP_RENAME_TARGET;
@@ -3209,23 +3230,19 @@ SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,
 	if (new_dentry == trap)
 		goto exit5;
 
-	error = mnt_want_write(oldnd.path.mnt);
-	if (error)
-		goto exit5;
 	error = security_path_rename(&oldnd.path, old_dentry,
 				     &newnd.path, new_dentry);
 	if (error)
-		goto exit6;
+		goto exit5;
 	error = vfs_rename(old_dir->d_inode, old_dentry,
 				   new_dir->d_inode, new_dentry);
-exit6:
-	mnt_drop_write(oldnd.path.mnt);
 exit5:
 	dput(new_dentry);
 exit4:
 	dput(old_dentry);
 exit3:
 	unlock_rename(new_dir, old_dir);
+	mnt_drop_write(oldnd.path.mnt);
 exit2:
 	path_put(&newnd.path);
 	putname(to);
diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
index cf78233..a99b8e2 100644
--- a/fs/ocfs2/refcounttree.c
+++ b/fs/ocfs2/refcounttree.c
@@ -4453,7 +4453,7 @@ int ocfs2_reflink_ioctl(struct inode *inode,
 		return error;
 	}
 
-	new_dentry = user_path_create(AT_FDCWD, newname, &new_path, 0);
+	new_dentry = user_path_create_thawed(AT_FDCWD, newname, &new_path, 0);
 	error = PTR_ERR(new_dentry);
 	if (IS_ERR(new_dentry)) {
 		mlog_errno(error);
@@ -4466,19 +4466,13 @@ int ocfs2_reflink_ioctl(struct inode *inode,
 		goto out_dput;
 	}
 
-	error = mnt_want_write(new_path.mnt);
-	if (error) {
-		mlog_errno(error);
-		goto out_dput;
-	}
-
 	error = ocfs2_vfs_reflink(old_path.dentry,
 				  new_path.dentry->d_inode,
 				  new_dentry, preserve);
-	mnt_drop_write(new_path.mnt);
 out_dput:
 	dput(new_dentry);
 	mutex_unlock(&new_path.dentry->d_inode->i_mutex);
+	mnt_drop_write(new_path.mnt);
 	path_put(&new_path);
 out:
 	path_put(&old_path);
diff --git a/include/linux/namei.h b/include/linux/namei.h
index ffc0213..432f6bb 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -77,7 +77,9 @@ extern int user_path_at_empty(int, const char __user *, unsigned, struct path *,
 extern int kern_path(const char *, unsigned, struct path *);
 
 extern struct dentry *kern_path_create(int, const char *, struct path *, int);
+extern struct dentry *kern_path_create_thawed(int, const char *, struct path *, int);
 extern struct dentry *user_path_create(int, const char __user *, struct path *, int);
+extern struct dentry *user_path_create_thawed(int, const char __user *, struct path *, int);
 extern int kern_path_parent(const char *, struct nameidata *);
 extern int vfs_path_lookup(struct dentry *, struct vfsmount *,
 			   const char *, unsigned int, struct path *);
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 85d3bb7..7888b09 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -856,7 +856,7 @@ static int unix_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 		 * Get the parent directory, calculate the hash for last
 		 * component.
 		 */
-		dentry = kern_path_create(AT_FDCWD, sun_path, &path, 0);
+		dentry = kern_path_create_thawed(AT_FDCWD, sun_path, &path, 0);
 		err = PTR_ERR(dentry);
 		if (IS_ERR(dentry))
 			goto out_mknod_parent;
@@ -866,19 +866,13 @@ static int unix_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 		 */
 		mode = S_IFSOCK |
 		       (SOCK_INODE(sock)->i_mode & ~current_umask());
-		err = mnt_want_write(path.mnt);
-		if (err)
-			goto out_mknod_dput;
 		err = security_path_mknod(&path, dentry, mode, 0);
 		if (err)
-			goto out_mknod_drop_write;
-		err = vfs_mknod(path.dentry->d_inode, dentry, mode, 0);
-out_mknod_drop_write:
-		mnt_drop_write(path.mnt);
-		if (err)
 			goto out_mknod_dput;
+		err = vfs_mknod(path.dentry->d_inode, dentry, mode, 0);
 		mutex_unlock(&path.dentry->d_inode->i_mutex);
 		dput(path.dentry);
+		mnt_drop_write(path.mnt);
 		path.dentry = dentry;
 
 		addr->hash = UNIX_HASH_SIZE;
@@ -916,6 +910,7 @@ out:
 out_mknod_dput:
 	dput(dentry);
 	mutex_unlock(&path.dentry->d_inode->i_mutex);
+	mnt_drop_write(path.mnt);
 	path_put(&path);
 out_mknod_parent:
 	if (err == -EEXIST)
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 03/19] fat: Push mnt_want_write() outside of i_mutex
  2012-03-05 16:00 ` Jan Kara
                   ` (5 preceding siblings ...)
  (?)
@ 2012-03-05 16:01 ` Jan Kara
  -1 siblings, 0 replies; 58+ messages in thread
From: Jan Kara @ 2012-03-05 16:01 UTC (permalink / raw)
  To: LKML
  Cc: linux-fsdevel, Al Viro, Christoph Hellwig, dchinner, sandeen,
	Kamal Mostafa, Jan Kara, OGAWA Hirofumi

When mnt_want_write() starts to handle freezing it will get a full lock
semantics requiring proper lock ordering. So push mnt_want_write() call
outside of i_mutex as in other places.

CC: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fat/file.c |   15 +++++++--------
 1 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/fs/fat/file.c b/fs/fat/file.c
index a71fe37..e007b8b 100644
--- a/fs/fat/file.c
+++ b/fs/fat/file.c
@@ -43,10 +43,10 @@ static int fat_ioctl_set_attributes(struct file *file, u32 __user *user_attr)
 	if (err)
 		goto out;
 
-	mutex_lock(&inode->i_mutex);
 	err = mnt_want_write_file(file);
 	if (err)
-		goto out_unlock_inode;
+		goto out;
+	mutex_lock(&inode->i_mutex);
 
 	/*
 	 * ATTR_VOLUME and ATTR_DIR cannot be changed; this also
@@ -73,14 +73,14 @@ static int fat_ioctl_set_attributes(struct file *file, u32 __user *user_attr)
 	/* The root directory has no attributes */
 	if (inode->i_ino == MSDOS_ROOT_INO && attr != ATTR_DIR) {
 		err = -EINVAL;
-		goto out_drop_write;
+		goto out_unlock_inode;
 	}
 
 	if (sbi->options.sys_immutable &&
 	    ((attr | oldattr) & ATTR_SYS) &&
 	    !capable(CAP_LINUX_IMMUTABLE)) {
 		err = -EPERM;
-		goto out_drop_write;
+		goto out_unlock_inode;
 	}
 
 	/*
@@ -90,12 +90,12 @@ static int fat_ioctl_set_attributes(struct file *file, u32 __user *user_attr)
 	 */
 	err = security_inode_setattr(file->f_path.dentry, &ia);
 	if (err)
-		goto out_drop_write;
+		goto out_unlock_inode;
 
 	/* This MUST be done before doing anything irreversible... */
 	err = fat_setattr(file->f_path.dentry, &ia);
 	if (err)
-		goto out_drop_write;
+		goto out_unlock_inode;
 
 	fsnotify_change(file->f_path.dentry, ia.ia_valid);
 	if (sbi->options.sys_immutable) {
@@ -107,10 +107,9 @@ static int fat_ioctl_set_attributes(struct file *file, u32 __user *user_attr)
 
 	fat_save_attrs(inode, attr);
 	mark_inode_dirty(inode);
-out_drop_write:
-	mnt_drop_write_file(file);
 out_unlock_inode:
 	mutex_unlock(&inode->i_mutex);
+	mnt_drop_write_file(file);
 out:
 	return err;
 }
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 04/19] btrfs: Push mnt_want_write() outside of i_mutex
  2012-03-05 16:00 ` Jan Kara
                   ` (6 preceding siblings ...)
  (?)
@ 2012-03-05 16:01 ` Jan Kara
  -1 siblings, 0 replies; 58+ messages in thread
From: Jan Kara @ 2012-03-05 16:01 UTC (permalink / raw)
  To: LKML
  Cc: linux-fsdevel, Al Viro, Christoph Hellwig, dchinner, sandeen,
	Kamal Mostafa, Jan Kara, Chris Mason, linux-btrfs

When mnt_want_write() starts to handle freezing it will get a full lock
semantics requiring proper lock ordering. So push mnt_want_write() call
consistently outside of i_mutex.

CC: Chris Mason <chris.mason@oracle.com>
CC: linux-btrfs@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/btrfs/ioctl.c |   23 +++++++++++------------
 1 files changed, 11 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 03bb62a..c855e55 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -192,6 +192,10 @@ static int btrfs_ioctl_setflags(struct file *file, void __user *arg)
 	if (!inode_owner_or_capable(inode))
 		return -EACCES;
 
+	ret = mnt_want_write_file(file);
+	if (ret)
+		return ret;
+
 	mutex_lock(&inode->i_mutex);
 
 	ip_oldflags = ip->flags;
@@ -206,10 +210,6 @@ static int btrfs_ioctl_setflags(struct file *file, void __user *arg)
 		}
 	}
 
-	ret = mnt_want_write_file(file);
-	if (ret)
-		goto out_unlock;
-
 	if (flags & FS_SYNC_FL)
 		ip->flags |= BTRFS_INODE_SYNC;
 	else
@@ -271,9 +271,9 @@ static int btrfs_ioctl_setflags(struct file *file, void __user *arg)
 		inode->i_flags = i_oldflags;
 	}
 
-	mnt_drop_write_file(file);
  out_unlock:
 	mutex_unlock(&inode->i_mutex);
+	mnt_drop_write_file(file);
 	return ret;
 }
 
@@ -624,6 +624,10 @@ static noinline int btrfs_mksubvol(struct path *parent,
 	struct dentry *dentry;
 	int error;
 
+	error = mnt_want_write(parent->mnt);
+	if (error)
+		return error;
+
 	mutex_lock_nested(&dir->i_mutex, I_MUTEX_PARENT);
 
 	dentry = lookup_one_len(name, parent->dentry, namelen);
@@ -635,13 +639,9 @@ static noinline int btrfs_mksubvol(struct path *parent,
 	if (dentry->d_inode)
 		goto out_dput;
 
-	error = mnt_want_write(parent->mnt);
-	if (error)
-		goto out_dput;
-
 	error = btrfs_may_create(dir, dentry);
 	if (error)
-		goto out_drop_write;
+		goto out_dput;
 
 	down_read(&BTRFS_I(dir)->root->fs_info->subvol_sem);
 
@@ -659,12 +659,11 @@ static noinline int btrfs_mksubvol(struct path *parent,
 		fsnotify_mkdir(dir, dentry);
 out_up_read:
 	up_read(&BTRFS_I(dir)->root->fs_info->subvol_sem);
-out_drop_write:
-	mnt_drop_write(parent->mnt);
 out_dput:
 	dput(dentry);
 out_unlock:
 	mutex_unlock(&dir->i_mutex);
+	mnt_drop_write(parent->mnt);
 	return error;
 }
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 05/19] nfsd: Push mnt_want_write() outside of i_mutex
@ 2012-03-05 16:01   ` Jan Kara
  0 siblings, 0 replies; 58+ messages in thread
From: Jan Kara @ 2012-03-05 16:01 UTC (permalink / raw)
  To: LKML
  Cc: linux-fsdevel, Al Viro, Christoph Hellwig, dchinner, sandeen,
	Kamal Mostafa, Jan Kara, linux-nfs, J. Bruce Fields

When mnt_want_write() starts to handle freezing it will get a full lock
semantics requiring proper lock ordering. So push mnt_want_write() call
consistently outside of i_mutex.

CC: linux-nfs@vger.kernel.org
CC: "J. Bruce Fields" <bfields@fieldses.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/nfsd/nfs4recover.c      |    9 +++--
 fs/nfsd/nfsfh.c            |    1 +
 fs/nfsd/nfsproc.c          |    9 ++++-
 fs/nfsd/vfs.c              |   79 ++++++++++++++++++++++---------------------
 fs/nfsd/vfs.h              |   11 +++++-
 include/linux/nfsd/nfsfh.h |    1 +
 6 files changed, 64 insertions(+), 46 deletions(-)

diff --git a/fs/nfsd/nfs4recover.c b/fs/nfsd/nfs4recover.c
index 0b3e875..8e599a6 100644
--- a/fs/nfsd/nfs4recover.c
+++ b/fs/nfsd/nfs4recover.c
@@ -135,6 +135,10 @@ void nfsd4_create_clid_dir(struct nfs4_client *clp)
 	if (status < 0)
 		return;
 
+	status = mnt_want_write_file(rec_file);
+	if (status)
+		return;
+
 	dir = rec_file->f_path.dentry;
 	/* lock the parent */
 	mutex_lock(&dir->d_inode->i_mutex);
@@ -154,11 +158,7 @@ void nfsd4_create_clid_dir(struct nfs4_client *clp)
 		 * as well be forgiving and just succeed silently.
 		 */
 		goto out_put;
-	status = mnt_want_write_file(rec_file);
-	if (status)
-		goto out_put;
 	status = vfs_mkdir(dir->d_inode, dentry, S_IRWXU);
-	mnt_drop_write_file(rec_file);
 out_put:
 	dput(dentry);
 out_unlock:
@@ -170,6 +170,7 @@ out_unlock:
 				" (err %d); please check that %s exists"
 				" and is writeable", status,
 				user_recovery_dirname);
+	mnt_drop_write_file(rec_file);
 	nfs4_reset_creds(original_cred);
 }
 
diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
index 68454e7..8b93353 100644
--- a/fs/nfsd/nfsfh.c
+++ b/fs/nfsd/nfsfh.c
@@ -635,6 +635,7 @@ fh_put(struct svc_fh *fhp)
 		fhp->fh_post_saved = 0;
 #endif
 	}
+	fh_drop_write(fhp);
 	if (exp) {
 		cache_put(&exp->h, &svc_export_cache);
 		fhp->fh_export = NULL;
diff --git a/fs/nfsd/nfsproc.c b/fs/nfsd/nfsproc.c
index e15dc45..aad6d45 100644
--- a/fs/nfsd/nfsproc.c
+++ b/fs/nfsd/nfsproc.c
@@ -196,6 +196,7 @@ nfsd_proc_create(struct svc_rqst *rqstp, struct nfsd_createargs *argp,
 	struct dentry	*dchild;
 	int		type, mode;
 	__be32		nfserr;
+	int		hosterr;
 	dev_t		rdev = 0, wanted = new_decode_dev(attr->ia_size);
 
 	dprintk("nfsd: CREATE   %s %.*s\n",
@@ -214,6 +215,12 @@ nfsd_proc_create(struct svc_rqst *rqstp, struct nfsd_createargs *argp,
 	nfserr = nfserr_exist;
 	if (isdotent(argp->name, argp->len))
 		goto done;
+	hosterr = fh_want_write(dirfhp);
+	if (hosterr) {
+		nfserr = nfserrno(hosterr);
+		goto done;
+	}
+
 	fh_lock_nested(dirfhp, I_MUTEX_PARENT);
 	dchild = lookup_one_len(argp->name, dirfhp->fh_dentry, argp->len);
 	if (IS_ERR(dchild)) {
@@ -330,7 +337,7 @@ nfsd_proc_create(struct svc_rqst *rqstp, struct nfsd_createargs *argp,
 out_unlock:
 	/* We don't really need to unlock, as fh_put does it. */
 	fh_unlock(dirfhp);
-
+	fh_drop_write(dirfhp);
 done:
 	fh_put(dirfhp);
 	return nfsd_return_dirop(nfserr, resp);
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index edf6d3e..af49c63 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1268,6 +1268,10 @@ nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	 * If it has, the parent directory should already be locked.
 	 */
 	if (!resfhp->fh_dentry) {
+		host_err = fh_want_write(fhp);
+		if (host_err)
+			goto out_nfserr;
+
 		/* called from nfsd_proc_mkdir, or possibly nfsd3_proc_create */
 		fh_lock_nested(fhp, I_MUTEX_PARENT);
 		dchild = lookup_one_len(fname, dentry, flen);
@@ -1311,14 +1315,11 @@ nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
 		goto out;
 	}
 
-	host_err = fh_want_write(fhp);
-	if (host_err)
-		goto out_nfserr;
-
 	/*
 	 * Get the dir op function pointer.
 	 */
 	err = 0;
+	host_err = 0;
 	switch (type) {
 	case S_IFREG:
 		host_err = vfs_create(dirp, dchild, iap->ia_mode, NULL);
@@ -1335,10 +1336,8 @@ nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
 		host_err = vfs_mknod(dirp, dchild, iap->ia_mode, rdev);
 		break;
 	}
-	if (host_err < 0) {
-		fh_drop_write(fhp);
+	if (host_err < 0)
 		goto out_nfserr;
-	}
 
 	err = nfsd_create_setattr(rqstp, resfhp, iap);
 
@@ -1350,7 +1349,6 @@ nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	err2 = nfserrno(commit_metadata(fhp));
 	if (err2)
 		err = err2;
-	fh_drop_write(fhp);
 	/*
 	 * Update the file handle to get the new inode info.
 	 */
@@ -1409,6 +1407,11 @@ do_nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	err = nfserr_notdir;
 	if (!dirp->i_op->lookup)
 		goto out;
+
+	host_err = fh_want_write(fhp);
+	if (host_err)
+		goto out_nfserr;
+
 	fh_lock_nested(fhp, I_MUTEX_PARENT);
 
 	/*
@@ -1441,9 +1444,6 @@ do_nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
 		v_atime = verifier[1]&0x7fffffff;
 	}
 	
-	host_err = fh_want_write(fhp);
-	if (host_err)
-		goto out_nfserr;
 	if (dchild->d_inode) {
 		err = 0;
 
@@ -1514,7 +1514,6 @@ do_nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	if (!err)
 		err = nfserrno(commit_metadata(fhp));
 
-	fh_drop_write(fhp);
 	/*
 	 * Update the filehandle to get the new inode info.
 	 */
@@ -1525,6 +1524,7 @@ do_nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	fh_unlock(fhp);
 	if (dchild && !IS_ERR(dchild))
 		dput(dchild);
+	fh_drop_write(fhp);
  	return err;
  
  out_nfserr:
@@ -1604,6 +1604,11 @@ nfsd_symlink(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	err = fh_verify(rqstp, fhp, S_IFDIR, NFSD_MAY_CREATE);
 	if (err)
 		goto out;
+
+	host_err = fh_want_write(fhp);
+	if (host_err)
+		goto out_nfserr;
+
 	fh_lock(fhp);
 	dentry = fhp->fh_dentry;
 	dnew = lookup_one_len(fname, dentry, flen);
@@ -1611,10 +1616,6 @@ nfsd_symlink(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	if (IS_ERR(dnew))
 		goto out_nfserr;
 
-	host_err = fh_want_write(fhp);
-	if (host_err)
-		goto out_nfserr;
-
 	if (unlikely(path[plen] != 0)) {
 		char *path_alloced = kmalloc(plen+1, GFP_KERNEL);
 		if (path_alloced == NULL)
@@ -1674,6 +1675,12 @@ nfsd_link(struct svc_rqst *rqstp, struct svc_fh *ffhp,
 	if (isdotent(name, len))
 		goto out;
 
+	host_err = fh_want_write(tfhp);
+	if (host_err) {
+		err = nfserrno(host_err);
+		goto out;
+	}
+
 	fh_lock_nested(ffhp, I_MUTEX_PARENT);
 	ddir = ffhp->fh_dentry;
 	dirp = ddir->d_inode;
@@ -1685,18 +1692,13 @@ nfsd_link(struct svc_rqst *rqstp, struct svc_fh *ffhp,
 
 	dold = tfhp->fh_dentry;
 
-	host_err = fh_want_write(tfhp);
-	if (host_err) {
-		err = nfserrno(host_err);
-		goto out_dput;
-	}
 	err = nfserr_noent;
 	if (!dold->d_inode)
-		goto out_drop_write;
+		goto out_dput;
 	host_err = nfsd_break_lease(dold->d_inode);
 	if (host_err) {
 		err = nfserrno(host_err);
-		goto out_drop_write;
+		goto out_dput;
 	}
 	host_err = vfs_link(dold, dirp, dnew);
 	if (!host_err) {
@@ -1709,12 +1711,11 @@ nfsd_link(struct svc_rqst *rqstp, struct svc_fh *ffhp,
 		else
 			err = nfserrno(host_err);
 	}
-out_drop_write:
-	fh_drop_write(tfhp);
 out_dput:
 	dput(dnew);
 out_unlock:
 	fh_unlock(ffhp);
+	fh_drop_write(tfhp);
 out:
 	return err;
 
@@ -1757,6 +1758,12 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
 	if (!flen || isdotent(fname, flen) || !tlen || isdotent(tname, tlen))
 		goto out;
 
+	host_err = fh_want_write(ffhp);
+	if (host_err) {
+		err = nfserrno(host_err);
+		goto out;
+	}
+
 	/* cannot use fh_lock as we need deadlock protective ordering
 	 * so do it by hand */
 	trap = lock_rename(tdentry, fdentry);
@@ -1787,17 +1794,14 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
 	host_err = -EXDEV;
 	if (ffhp->fh_export->ex_path.mnt != tfhp->fh_export->ex_path.mnt)
 		goto out_dput_new;
-	host_err = fh_want_write(ffhp);
-	if (host_err)
-		goto out_dput_new;
 
 	host_err = nfsd_break_lease(odentry->d_inode);
 	if (host_err)
-		goto out_drop_write;
+		goto out_dput_new;
 	if (ndentry->d_inode) {
 		host_err = nfsd_break_lease(ndentry->d_inode);
 		if (host_err)
-			goto out_drop_write;
+			goto out_dput_new;
 	}
 	host_err = vfs_rename(fdir, odentry, tdir, ndentry);
 	if (!host_err) {
@@ -1805,8 +1809,6 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
 		if (!host_err)
 			host_err = commit_metadata(ffhp);
 	}
-out_drop_write:
-	fh_drop_write(ffhp);
  out_dput_new:
 	dput(ndentry);
  out_dput_old:
@@ -1822,6 +1824,7 @@ out_drop_write:
 	fill_post_wcc(tfhp);
 	unlock_rename(tdentry, fdentry);
 	ffhp->fh_locked = tfhp->fh_locked = 0;
+	fh_drop_write(ffhp);
 
 out:
 	return err;
@@ -1847,6 +1850,10 @@ nfsd_unlink(struct svc_rqst *rqstp, struct svc_fh *fhp, int type,
 	if (err)
 		goto out;
 
+	host_err = fh_want_write(fhp);
+	if (host_err)
+		goto out_nfserr;
+
 	fh_lock_nested(fhp, I_MUTEX_PARENT);
 	dentry = fhp->fh_dentry;
 	dirp = dentry->d_inode;
@@ -1865,21 +1872,15 @@ nfsd_unlink(struct svc_rqst *rqstp, struct svc_fh *fhp, int type,
 	if (!type)
 		type = rdentry->d_inode->i_mode & S_IFMT;
 
-	host_err = fh_want_write(fhp);
-	if (host_err)
-		goto out_put;
-
 	host_err = nfsd_break_lease(rdentry->d_inode);
 	if (host_err)
-		goto out_drop_write;
+		goto out_put;
 	if (type != S_IFDIR)
 		host_err = vfs_unlink(dirp, rdentry);
 	else
 		host_err = vfs_rmdir(dirp, rdentry);
 	if (!host_err)
 		host_err = commit_metadata(fhp);
-out_drop_write:
-	fh_drop_write(fhp);
 out_put:
 	dput(rdentry);
 
diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
index 1dcd238..d5a50b4 100644
--- a/fs/nfsd/vfs.h
+++ b/fs/nfsd/vfs.h
@@ -108,12 +108,19 @@ int nfsd_set_posix_acl(struct svc_fh *, int, struct posix_acl *);
 
 static inline int fh_want_write(struct svc_fh *fh)
 {
-	return mnt_want_write(fh->fh_export->ex_path.mnt);
+	int ret = mnt_want_write(fh->fh_export->ex_path.mnt);
+
+	if (!ret)
+		fh->fh_want_write = 1;
+	return ret;
 }
 
 static inline void fh_drop_write(struct svc_fh *fh)
 {
-	mnt_drop_write(fh->fh_export->ex_path.mnt);
+	if (fh->fh_want_write) {
+		fh->fh_want_write = 0;
+		mnt_drop_write(fh->fh_export->ex_path.mnt);
+	}
 }
 
 #endif /* LINUX_NFSD_VFS_H */
diff --git a/include/linux/nfsd/nfsfh.h b/include/linux/nfsd/nfsfh.h
index ce4743a..fa63048 100644
--- a/include/linux/nfsd/nfsfh.h
+++ b/include/linux/nfsd/nfsfh.h
@@ -143,6 +143,7 @@ typedef struct svc_fh {
 	int			fh_maxsize;	/* max size for fh_handle */
 
 	unsigned char		fh_locked;	/* inode locked by us */
+	unsigned char		fh_want_write;	/* remount protection taken */
 
 #ifdef CONFIG_NFSD_V3
 	unsigned char		fh_post_saved;	/* post-op attrs saved */
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 05/19] nfsd: Push mnt_want_write() outside of i_mutex
@ 2012-03-05 16:01   ` Jan Kara
  0 siblings, 0 replies; 58+ messages in thread
From: Jan Kara @ 2012-03-05 16:01 UTC (permalink / raw)
  To: LKML
  Cc: linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Al Viro, Christoph Hellwig,
	dchinner-H+wXaHxf7aLQT0dZR+AlfA, sandeen-H+wXaHxf7aLQT0dZR+AlfA,
	Kamal Mostafa, Jan Kara, linux-nfs-u79uwXL29TY76Z2rM5mHXA,
	J. Bruce Fields

When mnt_want_write() starts to handle freezing it will get a full lock
semantics requiring proper lock ordering. So push mnt_want_write() call
consistently outside of i_mutex.

CC: linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
CC: "J. Bruce Fields" <bfields-uC3wQj2KruNg9hUCZPvPmw@public.gmane.org>
Signed-off-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
---
 fs/nfsd/nfs4recover.c      |    9 +++--
 fs/nfsd/nfsfh.c            |    1 +
 fs/nfsd/nfsproc.c          |    9 ++++-
 fs/nfsd/vfs.c              |   79 ++++++++++++++++++++++---------------------
 fs/nfsd/vfs.h              |   11 +++++-
 include/linux/nfsd/nfsfh.h |    1 +
 6 files changed, 64 insertions(+), 46 deletions(-)

diff --git a/fs/nfsd/nfs4recover.c b/fs/nfsd/nfs4recover.c
index 0b3e875..8e599a6 100644
--- a/fs/nfsd/nfs4recover.c
+++ b/fs/nfsd/nfs4recover.c
@@ -135,6 +135,10 @@ void nfsd4_create_clid_dir(struct nfs4_client *clp)
 	if (status < 0)
 		return;
 
+	status = mnt_want_write_file(rec_file);
+	if (status)
+		return;
+
 	dir = rec_file->f_path.dentry;
 	/* lock the parent */
 	mutex_lock(&dir->d_inode->i_mutex);
@@ -154,11 +158,7 @@ void nfsd4_create_clid_dir(struct nfs4_client *clp)
 		 * as well be forgiving and just succeed silently.
 		 */
 		goto out_put;
-	status = mnt_want_write_file(rec_file);
-	if (status)
-		goto out_put;
 	status = vfs_mkdir(dir->d_inode, dentry, S_IRWXU);
-	mnt_drop_write_file(rec_file);
 out_put:
 	dput(dentry);
 out_unlock:
@@ -170,6 +170,7 @@ out_unlock:
 				" (err %d); please check that %s exists"
 				" and is writeable", status,
 				user_recovery_dirname);
+	mnt_drop_write_file(rec_file);
 	nfs4_reset_creds(original_cred);
 }
 
diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c
index 68454e7..8b93353 100644
--- a/fs/nfsd/nfsfh.c
+++ b/fs/nfsd/nfsfh.c
@@ -635,6 +635,7 @@ fh_put(struct svc_fh *fhp)
 		fhp->fh_post_saved = 0;
 #endif
 	}
+	fh_drop_write(fhp);
 	if (exp) {
 		cache_put(&exp->h, &svc_export_cache);
 		fhp->fh_export = NULL;
diff --git a/fs/nfsd/nfsproc.c b/fs/nfsd/nfsproc.c
index e15dc45..aad6d45 100644
--- a/fs/nfsd/nfsproc.c
+++ b/fs/nfsd/nfsproc.c
@@ -196,6 +196,7 @@ nfsd_proc_create(struct svc_rqst *rqstp, struct nfsd_createargs *argp,
 	struct dentry	*dchild;
 	int		type, mode;
 	__be32		nfserr;
+	int		hosterr;
 	dev_t		rdev = 0, wanted = new_decode_dev(attr->ia_size);
 
 	dprintk("nfsd: CREATE   %s %.*s\n",
@@ -214,6 +215,12 @@ nfsd_proc_create(struct svc_rqst *rqstp, struct nfsd_createargs *argp,
 	nfserr = nfserr_exist;
 	if (isdotent(argp->name, argp->len))
 		goto done;
+	hosterr = fh_want_write(dirfhp);
+	if (hosterr) {
+		nfserr = nfserrno(hosterr);
+		goto done;
+	}
+
 	fh_lock_nested(dirfhp, I_MUTEX_PARENT);
 	dchild = lookup_one_len(argp->name, dirfhp->fh_dentry, argp->len);
 	if (IS_ERR(dchild)) {
@@ -330,7 +337,7 @@ nfsd_proc_create(struct svc_rqst *rqstp, struct nfsd_createargs *argp,
 out_unlock:
 	/* We don't really need to unlock, as fh_put does it. */
 	fh_unlock(dirfhp);
-
+	fh_drop_write(dirfhp);
 done:
 	fh_put(dirfhp);
 	return nfsd_return_dirop(nfserr, resp);
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index edf6d3e..af49c63 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1268,6 +1268,10 @@ nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	 * If it has, the parent directory should already be locked.
 	 */
 	if (!resfhp->fh_dentry) {
+		host_err = fh_want_write(fhp);
+		if (host_err)
+			goto out_nfserr;
+
 		/* called from nfsd_proc_mkdir, or possibly nfsd3_proc_create */
 		fh_lock_nested(fhp, I_MUTEX_PARENT);
 		dchild = lookup_one_len(fname, dentry, flen);
@@ -1311,14 +1315,11 @@ nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
 		goto out;
 	}
 
-	host_err = fh_want_write(fhp);
-	if (host_err)
-		goto out_nfserr;
-
 	/*
 	 * Get the dir op function pointer.
 	 */
 	err = 0;
+	host_err = 0;
 	switch (type) {
 	case S_IFREG:
 		host_err = vfs_create(dirp, dchild, iap->ia_mode, NULL);
@@ -1335,10 +1336,8 @@ nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
 		host_err = vfs_mknod(dirp, dchild, iap->ia_mode, rdev);
 		break;
 	}
-	if (host_err < 0) {
-		fh_drop_write(fhp);
+	if (host_err < 0)
 		goto out_nfserr;
-	}
 
 	err = nfsd_create_setattr(rqstp, resfhp, iap);
 
@@ -1350,7 +1349,6 @@ nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	err2 = nfserrno(commit_metadata(fhp));
 	if (err2)
 		err = err2;
-	fh_drop_write(fhp);
 	/*
 	 * Update the file handle to get the new inode info.
 	 */
@@ -1409,6 +1407,11 @@ do_nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	err = nfserr_notdir;
 	if (!dirp->i_op->lookup)
 		goto out;
+
+	host_err = fh_want_write(fhp);
+	if (host_err)
+		goto out_nfserr;
+
 	fh_lock_nested(fhp, I_MUTEX_PARENT);
 
 	/*
@@ -1441,9 +1444,6 @@ do_nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
 		v_atime = verifier[1]&0x7fffffff;
 	}
 	
-	host_err = fh_want_write(fhp);
-	if (host_err)
-		goto out_nfserr;
 	if (dchild->d_inode) {
 		err = 0;
 
@@ -1514,7 +1514,6 @@ do_nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	if (!err)
 		err = nfserrno(commit_metadata(fhp));
 
-	fh_drop_write(fhp);
 	/*
 	 * Update the filehandle to get the new inode info.
 	 */
@@ -1525,6 +1524,7 @@ do_nfsd_create(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	fh_unlock(fhp);
 	if (dchild && !IS_ERR(dchild))
 		dput(dchild);
+	fh_drop_write(fhp);
  	return err;
  
  out_nfserr:
@@ -1604,6 +1604,11 @@ nfsd_symlink(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	err = fh_verify(rqstp, fhp, S_IFDIR, NFSD_MAY_CREATE);
 	if (err)
 		goto out;
+
+	host_err = fh_want_write(fhp);
+	if (host_err)
+		goto out_nfserr;
+
 	fh_lock(fhp);
 	dentry = fhp->fh_dentry;
 	dnew = lookup_one_len(fname, dentry, flen);
@@ -1611,10 +1616,6 @@ nfsd_symlink(struct svc_rqst *rqstp, struct svc_fh *fhp,
 	if (IS_ERR(dnew))
 		goto out_nfserr;
 
-	host_err = fh_want_write(fhp);
-	if (host_err)
-		goto out_nfserr;
-
 	if (unlikely(path[plen] != 0)) {
 		char *path_alloced = kmalloc(plen+1, GFP_KERNEL);
 		if (path_alloced == NULL)
@@ -1674,6 +1675,12 @@ nfsd_link(struct svc_rqst *rqstp, struct svc_fh *ffhp,
 	if (isdotent(name, len))
 		goto out;
 
+	host_err = fh_want_write(tfhp);
+	if (host_err) {
+		err = nfserrno(host_err);
+		goto out;
+	}
+
 	fh_lock_nested(ffhp, I_MUTEX_PARENT);
 	ddir = ffhp->fh_dentry;
 	dirp = ddir->d_inode;
@@ -1685,18 +1692,13 @@ nfsd_link(struct svc_rqst *rqstp, struct svc_fh *ffhp,
 
 	dold = tfhp->fh_dentry;
 
-	host_err = fh_want_write(tfhp);
-	if (host_err) {
-		err = nfserrno(host_err);
-		goto out_dput;
-	}
 	err = nfserr_noent;
 	if (!dold->d_inode)
-		goto out_drop_write;
+		goto out_dput;
 	host_err = nfsd_break_lease(dold->d_inode);
 	if (host_err) {
 		err = nfserrno(host_err);
-		goto out_drop_write;
+		goto out_dput;
 	}
 	host_err = vfs_link(dold, dirp, dnew);
 	if (!host_err) {
@@ -1709,12 +1711,11 @@ nfsd_link(struct svc_rqst *rqstp, struct svc_fh *ffhp,
 		else
 			err = nfserrno(host_err);
 	}
-out_drop_write:
-	fh_drop_write(tfhp);
 out_dput:
 	dput(dnew);
 out_unlock:
 	fh_unlock(ffhp);
+	fh_drop_write(tfhp);
 out:
 	return err;
 
@@ -1757,6 +1758,12 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
 	if (!flen || isdotent(fname, flen) || !tlen || isdotent(tname, tlen))
 		goto out;
 
+	host_err = fh_want_write(ffhp);
+	if (host_err) {
+		err = nfserrno(host_err);
+		goto out;
+	}
+
 	/* cannot use fh_lock as we need deadlock protective ordering
 	 * so do it by hand */
 	trap = lock_rename(tdentry, fdentry);
@@ -1787,17 +1794,14 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
 	host_err = -EXDEV;
 	if (ffhp->fh_export->ex_path.mnt != tfhp->fh_export->ex_path.mnt)
 		goto out_dput_new;
-	host_err = fh_want_write(ffhp);
-	if (host_err)
-		goto out_dput_new;
 
 	host_err = nfsd_break_lease(odentry->d_inode);
 	if (host_err)
-		goto out_drop_write;
+		goto out_dput_new;
 	if (ndentry->d_inode) {
 		host_err = nfsd_break_lease(ndentry->d_inode);
 		if (host_err)
-			goto out_drop_write;
+			goto out_dput_new;
 	}
 	host_err = vfs_rename(fdir, odentry, tdir, ndentry);
 	if (!host_err) {
@@ -1805,8 +1809,6 @@ nfsd_rename(struct svc_rqst *rqstp, struct svc_fh *ffhp, char *fname, int flen,
 		if (!host_err)
 			host_err = commit_metadata(ffhp);
 	}
-out_drop_write:
-	fh_drop_write(ffhp);
  out_dput_new:
 	dput(ndentry);
  out_dput_old:
@@ -1822,6 +1824,7 @@ out_drop_write:
 	fill_post_wcc(tfhp);
 	unlock_rename(tdentry, fdentry);
 	ffhp->fh_locked = tfhp->fh_locked = 0;
+	fh_drop_write(ffhp);
 
 out:
 	return err;
@@ -1847,6 +1850,10 @@ nfsd_unlink(struct svc_rqst *rqstp, struct svc_fh *fhp, int type,
 	if (err)
 		goto out;
 
+	host_err = fh_want_write(fhp);
+	if (host_err)
+		goto out_nfserr;
+
 	fh_lock_nested(fhp, I_MUTEX_PARENT);
 	dentry = fhp->fh_dentry;
 	dirp = dentry->d_inode;
@@ -1865,21 +1872,15 @@ nfsd_unlink(struct svc_rqst *rqstp, struct svc_fh *fhp, int type,
 	if (!type)
 		type = rdentry->d_inode->i_mode & S_IFMT;
 
-	host_err = fh_want_write(fhp);
-	if (host_err)
-		goto out_put;
-
 	host_err = nfsd_break_lease(rdentry->d_inode);
 	if (host_err)
-		goto out_drop_write;
+		goto out_put;
 	if (type != S_IFDIR)
 		host_err = vfs_unlink(dirp, rdentry);
 	else
 		host_err = vfs_rmdir(dirp, rdentry);
 	if (!host_err)
 		host_err = commit_metadata(fhp);
-out_drop_write:
-	fh_drop_write(fhp);
 out_put:
 	dput(rdentry);
 
diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
index 1dcd238..d5a50b4 100644
--- a/fs/nfsd/vfs.h
+++ b/fs/nfsd/vfs.h
@@ -108,12 +108,19 @@ int nfsd_set_posix_acl(struct svc_fh *, int, struct posix_acl *);
 
 static inline int fh_want_write(struct svc_fh *fh)
 {
-	return mnt_want_write(fh->fh_export->ex_path.mnt);
+	int ret = mnt_want_write(fh->fh_export->ex_path.mnt);
+
+	if (!ret)
+		fh->fh_want_write = 1;
+	return ret;
 }
 
 static inline void fh_drop_write(struct svc_fh *fh)
 {
-	mnt_drop_write(fh->fh_export->ex_path.mnt);
+	if (fh->fh_want_write) {
+		fh->fh_want_write = 0;
+		mnt_drop_write(fh->fh_export->ex_path.mnt);
+	}
 }
 
 #endif /* LINUX_NFSD_VFS_H */
diff --git a/include/linux/nfsd/nfsfh.h b/include/linux/nfsd/nfsfh.h
index ce4743a..fa63048 100644
--- a/include/linux/nfsd/nfsfh.h
+++ b/include/linux/nfsd/nfsfh.h
@@ -143,6 +143,7 @@ typedef struct svc_fh {
 	int			fh_maxsize;	/* max size for fh_handle */
 
 	unsigned char		fh_locked;	/* inode locked by us */
+	unsigned char		fh_want_write;	/* remount protection taken */
 
 #ifdef CONFIG_NFSD_V3
 	unsigned char		fh_post_saved;	/* post-op attrs saved */
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 06/19] fs: Improve filesystem freezing handling
  2012-03-05 16:00 ` Jan Kara
                   ` (8 preceding siblings ...)
  (?)
@ 2012-03-05 16:01 ` Jan Kara
  -1 siblings, 0 replies; 58+ messages in thread
From: Jan Kara @ 2012-03-05 16:01 UTC (permalink / raw)
  To: LKML
  Cc: linux-fsdevel, Al Viro, Christoph Hellwig, dchinner, sandeen,
	Kamal Mostafa, Jan Kara

vfs_check_frozen() tests are racy since the filesystem can be frozen just after
the test is performed. Thus in write paths we can end up marking some pages or
inodes dirty even though the file system is already frozen. This creates
problems with flusher thread hanging on frozen filesystem.

Another problem is that exclusion between ->page_mkwrite() and filesystem
freezing has been handled by setting page dirty and then verifying s_frozen.
This guaranteed that either the freezing code sees the faulted page, writes it,
and writeprotects it again or we see s_frozen set and bail out of page fault.
This works to protect from page being marked writeable while filesystem
freezing is running but has an unpleasant artefact of leaving dirty (although
unmodified and writeprotected) pages on frozen filesystem resulting in similar
problems with flusher thread as the first problem.

This patch aims at providing exclusion between write paths and filesystem
freezing. We implement a writer-freeze read-write semaphore in the superblock.
Actually, there are two such semaphores because of lock ranking reasons - one
for page fault handlers (->page_mkwrite) and one for all other writers.  Write
paths which should block freezing (e.g. directory operations, ->aio_write(),
->page_mkwrite) hold reader side of the semaphore. Code freezing the filesystem
takes the writer side.

Only that we don't really want to bounce cachelines of the semaphores between
CPUs for each write happening. So we implement the reader side of the semaphore
as a per-cpu counter and the writer side is implemented using s_writers.frozen
superblock field.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/super.c         |  262 +++++++++++++++++++++++++++++++++++++++++++++++-----
 include/linux/fs.h |   42 +++++++--
 2 files changed, 274 insertions(+), 30 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 6277ec6..8b0b40c 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -32,12 +32,15 @@
 #include <linux/backing-dev.h>
 #include <linux/rculist_bl.h>
 #include <linux/cleancache.h>
+#include <linux/lockdep.h>
 #include "internal.h"
 
 
 LIST_HEAD(super_blocks);
 DEFINE_SPINLOCK(sb_lock);
 
+static struct lock_class_key sb_writers_key[2];
+
 /*
  * One thing we have to be careful of with a per-sb shrinker is that we don't
  * drop the last active reference to the superblock from within the shrinker.
@@ -101,6 +104,33 @@ static int prune_super(struct shrinker *shrink, struct shrink_control *sc)
 	return total_objects;
 }
 
+static int init_sb_writers(struct super_block *s)
+{
+	int err;
+
+	err = percpu_counter_init(&s->s_writers.counter[0], 0);
+	if (err < 0)
+		return err;
+	err = percpu_counter_init(&s->s_writers.counter[1], 0);
+	if (err < 0) {
+		percpu_counter_destroy(&s->s_writers.counter[0]);
+		return err;
+	}
+	init_waitqueue_head(&s->s_writers.wait);
+	init_waitqueue_head(&s->s_writers.wait_unfrozen);
+	lockdep_init_map(&s->s_writers.lock_map[0], "sb_writers",
+			 &sb_writers_key[0], 0);
+	lockdep_init_map(&s->s_writers.lock_map[1], "sb_pagefaults",
+			 &sb_writers_key[1], 0);
+	return 0;
+}
+
+static void destroy_sb_writers(struct super_block *s)
+{
+	percpu_counter_destroy(&s->s_writers.counter[0]);
+	percpu_counter_destroy(&s->s_writers.counter[1]);
+}
+
 /**
  *	alloc_super	-	create new superblock
  *	@type:	filesystem type superblock should belong to
@@ -115,18 +145,19 @@ static struct super_block *alloc_super(struct file_system_type *type)
 
 	if (s) {
 		if (security_sb_alloc(s)) {
+			/*
+			 * We cannot call security_sb_free() without
+			 * security_sb_alloc() succeeding. So bail out manually
+			 */
 			kfree(s);
 			s = NULL;
 			goto out;
 		}
 #ifdef CONFIG_SMP
 		s->s_files = alloc_percpu(struct list_head);
-		if (!s->s_files) {
-			security_sb_free(s);
-			kfree(s);
-			s = NULL;
-			goto out;
-		} else {
+		if (!s->s_files)
+			goto err_out;
+		else {
 			int i;
 
 			for_each_possible_cpu(i)
@@ -135,6 +166,9 @@ static struct super_block *alloc_super(struct file_system_type *type)
 #else
 		INIT_LIST_HEAD(&s->s_files);
 #endif
+		if (init_sb_writers(s))
+			goto err_out;
+
 		s->s_bdi = &default_backing_dev_info;
 		INIT_HLIST_NODE(&s->s_instances);
 		INIT_HLIST_BL_HEAD(&s->s_anon);
@@ -187,6 +221,16 @@ static struct super_block *alloc_super(struct file_system_type *type)
 	}
 out:
 	return s;
+err_out:
+	security_sb_free(s);
+#ifdef CONFIG_SMP
+	if (s->s_files)
+		free_percpu(s->s_files);
+#endif
+	destroy_sb_writers(s);
+	kfree(s);
+	s = NULL;
+	goto out;
 }
 
 /**
@@ -200,6 +244,7 @@ static inline void destroy_super(struct super_block *s)
 #ifdef CONFIG_SMP
 	free_percpu(s->s_files);
 #endif
+	destroy_sb_writers(s);
 	security_sb_free(s);
 	WARN_ON(!list_empty(&s->s_mounts));
 	kfree(s->s_subtype);
@@ -646,10 +691,11 @@ struct super_block *get_super_thawed(struct block_device *bdev)
 {
 	while (1) {
 		struct super_block *s = get_super(bdev);
-		if (!s || s->s_frozen == SB_UNFROZEN)
+		if (!s || s->s_writers.frozen == SB_UNFROZEN)
 			return s;
 		up_read(&s->s_umount);
-		vfs_check_frozen(s, SB_FREEZE_WRITE);
+		wait_event(s->s_writers.wait_unfrozen,
+			   s->s_writers.frozen == SB_UNFROZEN);
 		put_super(s);
 	}
 }
@@ -727,7 +773,7 @@ int do_remount_sb(struct super_block *sb, int flags, void *data, int force)
 	int retval;
 	int remount_ro;
 
-	if (sb->s_frozen != SB_UNFROZEN)
+	if (sb->s_writers.frozen != SB_UNFROZEN)
 		return -EBUSY;
 
 #ifdef CONFIG_BLOCK
@@ -1162,6 +1208,164 @@ out:
 	return ERR_PTR(error);
 }
 
+static void __sb_end_write(struct super_block *sb, int level)
+{
+	percpu_counter_dec(&sb->s_writers.counter[level-1]);
+	/*
+	 * Make sure s_writers are updated before we wake up waiters in
+	 * freeze_super().
+	 */
+	smp_mb();
+	if (waitqueue_active(&sb->s_writers.wait))
+		wake_up(&sb->s_writers.wait);
+	rwsem_release(&sb->s_writers.lock_map[level-1], 1, _RET_IP_);
+}
+
+/**
+ * sb_end_write - drop write access to a superblock
+ * @sb: the super we wrote to
+ *
+ * Decrement number of writers to the filesystem. Wake up possible waiters
+ * wanting to freeze the filesystem.
+ */
+void sb_end_write(struct super_block *sb)
+{
+	__sb_end_write(sb, SB_FREEZE_WRITE);
+}
+EXPORT_SYMBOL(sb_end_write);
+
+/**
+ * sb_end_pagefault - drop write access to a superblock from a page fault
+ * @sb: the super we wrote to
+ *
+ * Decrement number of processes handling write page fault to the filesystem.
+ * Wake up possible waiters wanting to freeze the filesystem.
+ */
+void sb_end_pagefault(struct super_block *sb)
+{
+	__sb_end_write(sb, SB_FREEZE_PAGEFAULT);
+}
+EXPORT_SYMBOL(sb_end_pagefault);
+
+static int __sb_start_write(struct super_block *sb, int level, bool wait)
+{
+retry:
+	if (wait) {
+		wait_event(sb->s_writers.wait_unfrozen,
+			   sb->s_writers.frozen < level);
+	} else if (sb->s_writers.frozen >= level)
+		return 0;
+
+	rwsem_acquire_read(&sb->s_writers.lock_map[level-1], 0, !wait,
+			   _RET_IP_);
+	percpu_counter_inc(&sb->s_writers.counter[level-1]);
+	/*
+	 * Make sure counter is updated before we check for frozen.
+	 * freeze_super() first sets frozen and then checks the counter.
+	 */
+	smp_mb();
+	if (sb->s_writers.frozen >= level) {
+		__sb_end_write(sb, level);
+		goto retry;
+	}
+	return 1;
+}
+
+/**
+ * sb_start_write - get write access to a superblock
+ * @sb: the super we write to
+ *
+ * When a process wants to write data or metadata to a file system (i.e. dirty
+ * a page or an inode), it should embed the operation in a sb_start_write() -
+ * sb_end_write() pair to get exclusion against file system freezing. This
+ * function increments number of writers preventing freezing. If the file
+ * system is already frozen, the function waits until the file system is
+ * thawed.
+ *
+ * Since freeze protection behaves as a lock, users have to preserve
+ * ordering of freeze protection and other filesystem locks. Generally,
+ * freeze protection should be the outermost lock. In particular, we have:
+ *
+ * sb_start_write
+ *   -> i_mutex			(write path, truncate, directory ops, ...)
+ *   -> s_umount		(freeze_super, thaw_super)
+ */
+void sb_start_write(struct super_block *sb)
+{
+	__sb_start_write(sb, SB_FREEZE_WRITE, true);
+}
+EXPORT_SYMBOL(sb_start_write);
+
+int sb_start_write_trylock(struct super_block *sb)
+{
+	return __sb_start_write(sb, SB_FREEZE_WRITE, false);
+}
+EXPORT_SYMBOL(sb_start_write_trylock);
+
+/**
+ * sb_start_write - get write access to a superblock from a page fault
+ * @sb: the super we write to
+ *
+ * When a process starts handling write page fault, it should embed the
+ * operation into sb_start_pagefault() - sb_end_pagefault() pair to get
+ * exclusion against file system freezing. This is needed since the page fault
+ * is going to dirty a page. This function increments number of running page
+ * faults preventing freezing. If the file system is already frozen, the
+ * function waits until the file system is thawed.
+ *
+ * Since page fault freeze protection behaves as a lock, users have to preserve
+ * ordering of freeze protection and other filesystem locks. It is advised to
+ * put sb_start_pagefault() close to mmap_sem in lock ordering. Page fault
+ * handling code implies lock dependency:
+ *
+ * mmap_sem
+ *   -> sb_start_pagefault
+ */
+void sb_start_pagefault(struct super_block *sb)
+{
+	__sb_start_write(sb, SB_FREEZE_PAGEFAULT, true);
+}
+EXPORT_SYMBOL(sb_start_pagefault);
+
+/**
+ * sb_wait_write - wait until all writers to given file system finish
+ * @sb: the super for which we wait
+ * @level: type of writers we wait for (normal vs page fault)
+ *
+ * This function waits until there are no writers of given type to given file
+ * system. Caller of this function should make sure there can be no new writers
+ * of type @level before calling this function. Otherwise this function can
+ * livelock.
+ */
+static void sb_wait_write(struct super_block *sb, int level)
+{
+	s64 writers;
+
+	/*
+	 * We just cycle-through lockdep here so that it does not complain
+	 * about returning with lock to userspace
+	 */
+	rwsem_acquire(&sb->s_writers.lock_map[level-1], 0, 0, _THIS_IP_);
+	rwsem_release(&sb->s_writers.lock_map[level-1], 1, _THIS_IP_);
+
+	do {
+		DEFINE_WAIT(wait);
+
+		/*
+		 * We use a barrier in prepare_to_wait() to separate setting
+		 * of frozen and checking of the counter
+		 */
+		prepare_to_wait(&sb->s_writers.wait, &wait,
+				TASK_UNINTERRUPTIBLE);
+
+		writers = percpu_counter_sum(&sb->s_writers.counter[level-1]);
+		if (writers)
+			schedule();
+
+		finish_wait(&sb->s_writers.wait, &wait);
+	} while (writers);
+}
+
 /**
  * freeze_super - lock the filesystem and force it into a consistent state
  * @sb: the super to lock
@@ -1176,7 +1380,7 @@ int freeze_super(struct super_block *sb)
 
 	atomic_inc(&sb->s_active);
 	down_write(&sb->s_umount);
-	if (sb->s_frozen) {
+	if (sb->s_writers.frozen != SB_UNFROZEN) {
 		deactivate_locked_super(sb);
 		return -EBUSY;
 	}
@@ -1187,33 +1391,48 @@ int freeze_super(struct super_block *sb)
 	}
 
 	if (sb->s_flags & MS_RDONLY) {
-		sb->s_frozen = SB_FREEZE_TRANS;
-		smp_wmb();
+		/* Nothing to do really... */
+		sb->s_writers.frozen = SB_FREEZE_COMPLETE;
 		up_write(&sb->s_umount);
 		return 0;
 	}
 
-	sb->s_frozen = SB_FREEZE_WRITE;
+	/* From now on, no new normal writers can start */
+	sb->s_writers.frozen = SB_FREEZE_WRITE;
 	smp_wmb();
 
-	sync_filesystem(sb);
+	/* Release s_umount to preserve sb_start_write -> s_umount ordering */
+	up_write(&sb->s_umount);
+
+	sb_wait_write(sb, SB_FREEZE_WRITE);
 
-	sb->s_frozen = SB_FREEZE_TRANS;
+	/* Now we go and block page faults... */
+	down_write(&sb->s_umount);
+	sb->s_writers.frozen = SB_FREEZE_PAGEFAULT;
 	smp_wmb();
 
-	sync_blockdev(sb->s_bdev);
+	sb_wait_write(sb, SB_FREEZE_PAGEFAULT);
+
+	/* All writers are done so after syncing there won't be dirty data */
+	sync_filesystem(sb);
+
 	if (sb->s_op->freeze_fs) {
 		ret = sb->s_op->freeze_fs(sb);
 		if (ret) {
 			printk(KERN_ERR
 				"VFS:Filesystem freeze failed\n");
-			sb->s_frozen = SB_UNFROZEN;
+			sb->s_writers.frozen = SB_UNFROZEN;
 			smp_wmb();
-			wake_up(&sb->s_wait_unfrozen);
+			wake_up(&sb->s_writers.wait_unfrozen);
 			deactivate_locked_super(sb);
 			return ret;
 		}
 	}
+	/*
+	 * This is just for debugging purposes so that fs can warn if it
+	 * sees write activity when frozen is set to SB_FREEZE_COMPLETE.
+	 */
+	sb->s_writers.frozen = SB_FREEZE_COMPLETE;
 	up_write(&sb->s_umount);
 	return 0;
 }
@@ -1230,7 +1449,7 @@ int thaw_super(struct super_block *sb)
 	int error;
 
 	down_write(&sb->s_umount);
-	if (sb->s_frozen == SB_UNFROZEN) {
+	if (sb->s_writers.frozen == SB_UNFROZEN) {
 		up_write(&sb->s_umount);
 		return -EINVAL;
 	}
@@ -1243,16 +1462,15 @@ int thaw_super(struct super_block *sb)
 		if (error) {
 			printk(KERN_ERR
 				"VFS:Filesystem thaw failed\n");
-			sb->s_frozen = SB_FREEZE_TRANS;
 			up_write(&sb->s_umount);
 			return error;
 		}
 	}
 
 out:
-	sb->s_frozen = SB_UNFROZEN;
+	sb->s_writers.frozen = SB_UNFROZEN;
 	smp_wmb();
-	wake_up(&sb->s_wait_unfrozen);
+	wake_up(&sb->s_writers.wait_unfrozen);
 	deactivate_locked_super(sb);
 
 	return 0;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 69cd5bb..37667f3 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -397,6 +397,7 @@ struct inodes_stat_t {
 #include <linux/atomic.h>
 #include <linux/shrinker.h>
 #include <linux/migrate_mode.h>
+#include <linux/lockdep.h>
 
 #include <asm/byteorder.h>
 
@@ -1392,6 +1393,14 @@ extern void f_delown(struct file *filp);
 extern pid_t f_getown(struct file *filp);
 extern int send_sigurg(struct fown_struct *fown);
 
+struct mm_struct;
+
+void sb_start_write(struct super_block *sb);
+int sb_start_write_trylock(struct super_block *sb);
+void sb_end_write(struct super_block *sb);
+void sb_start_pagefault(struct super_block *sb);
+void sb_end_pagefault(struct super_block *sb);
+
 /*
  *	Umount options
  */
@@ -1405,6 +1414,28 @@ extern int send_sigurg(struct fown_struct *fown);
 extern struct list_head super_blocks;
 extern spinlock_t sb_lock;
 
+/* Possible states of 'frozen' field */
+enum {
+	SB_UNFROZEN = 0,		/* FS is unfrozen */
+	SB_FREEZE_WRITE	= 1,		/* Writes, dir ops, ioctls frozen */
+	SB_FREEZE_TRANS = 2,
+	SB_FREEZE_PAGEFAULT = 2,	/* Page faults stopped as well */
+	SB_FREEZE_COMPLETE = 3,		/* ->freeze_fs finished successfully */
+};
+
+struct sb_writers {
+	struct percpu_counter	counter[2];	/* counter of running writes
+						 * and page faults  */
+	wait_queue_head_t	wait;		/* queue for waiting for
+						   writers / faults to finish */
+	int			frozen;		/* Is sb frozen? */
+	wait_queue_head_t	wait_unfrozen;	/* queue for waiting for
+						   sb to be thawed */
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	struct lockdep_map	lock_map[2];
+#endif
+};
+
 struct super_block {
 	struct list_head	s_list;		/* Keep this first */
 	dev_t			s_dev;		/* search index; _not_ kdev_t */
@@ -1454,6 +1485,7 @@ struct super_block {
 
 	int			s_frozen;
 	wait_queue_head_t	s_wait_unfrozen;
+	struct sb_writers	s_writers;
 
 	char s_id[32];				/* Informational name */
 	u8 s_uuid[16];				/* UUID */
@@ -1507,14 +1539,8 @@ extern struct timespec current_fs_time(struct super_block *sb);
 /*
  * Snapshotting support.
  */
-enum {
-	SB_UNFROZEN = 0,
-	SB_FREEZE_WRITE	= 1,
-	SB_FREEZE_TRANS = 2,
-};
-
-#define vfs_check_frozen(sb, level) \
-	wait_event((sb)->s_wait_unfrozen, ((sb)->s_frozen < (level)))
+/* Will go away when all users are converted */
+#define vfs_check_frozen(sb, level) do { } while (0)
 
 /*
  * until VFS tracks user namespaces for inodes, just make all files
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 07/19] fs: Add freezing handling to mnt_want_write() / mnt_drop_write()
  2012-03-05 16:00 ` Jan Kara
                   ` (9 preceding siblings ...)
  (?)
@ 2012-03-05 16:01 ` Jan Kara
  -1 siblings, 0 replies; 58+ messages in thread
From: Jan Kara @ 2012-03-05 16:01 UTC (permalink / raw)
  To: LKML
  Cc: linux-fsdevel, Al Viro, Christoph Hellwig, dchinner, sandeen,
	Kamal Mostafa, Jan Kara

Most of places where we want freeze protection coincides with the places where
we also have remount-ro protection. So make mnt_want_write() and
mnt_drop_write() (and their _file alternative) prevent freezing as well.
For the few cases that are really interested only in remount-ro protection
provide new function variants.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/file_table.c       |    2 +-
 fs/inode.c            |    4 +-
 fs/namei.c            |   16 +++++++-
 fs/namespace.c        |  101 +++++++++++++++++++++++++++++++++++++++----------
 fs/open.c             |    2 +-
 include/linux/mount.h |    4 ++
 6 files changed, 103 insertions(+), 26 deletions(-)

diff --git a/fs/file_table.c b/fs/file_table.c
index 20002e3..31ab06f 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -216,7 +216,7 @@ void drop_file_write_access(struct file *file)
 		return;
 	if (file_check_writeable(file) != 0)
 		return;
-	mnt_drop_write(mnt);
+	__mnt_drop_write(mnt);
 	file_release_write(file);
 }
 EXPORT_SYMBOL_GPL(drop_file_write_access);
diff --git a/fs/inode.c b/fs/inode.c
index d3ebdbe..095828c 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1580,7 +1580,7 @@ void file_update_time(struct file *file)
 		return;
 
 	/* Finally allowed to write? Takes lock. */
-	if (mnt_want_write_file(file))
+	if (__mnt_want_write_file(file))
 		return;
 
 	/* Only change inode inside the lock region */
@@ -1591,7 +1591,7 @@ void file_update_time(struct file *file)
 	if (sync_it & S_MTIME)
 		inode->i_mtime = now;
 	mark_inode_dirty_sync(inode);
-	mnt_drop_write_file(file);
+	__mnt_drop_write_file(file);
 }
 EXPORT_SYMBOL(file_update_time);
 
diff --git a/fs/namei.c b/fs/namei.c
index 575d3a4..474b6b2 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2166,12 +2166,21 @@ static struct file *do_last(struct nameidata *nd, struct path *path,
 	if (nd->last.name[nd->last.len])
 		goto exit;
 
+	/*
+	 * We don't know whether write access will be needed or not. But we
+	 * must get the protection before getting i_mutex due to locking
+	 * constraints. Thus O_CREAT open will get blocked on frozen filesystem
+	 * even if it didn't need to create anything. Not that surprising...
+	 */
+	sb_start_write(dir->d_sb);
+
 	mutex_lock(&dir->d_inode->i_mutex);
 
 	dentry = lookup_hash(nd);
 	error = PTR_ERR(dentry);
 	if (IS_ERR(dentry)) {
 		mutex_unlock(&dir->d_inode->i_mutex);
+		sb_end_write(dir->d_sb);
 		goto exit;
 	}
 
@@ -2190,9 +2199,11 @@ static struct file *do_last(struct nameidata *nd, struct path *path,
 		 * a permanent write count is taken through
 		 * the 'struct file' in nameidata_to_filp().
 		 */
-		error = mnt_want_write(nd->path.mnt);
-		if (error)
+		error = __mnt_want_write(nd->path.mnt);
+		if (error) {
+			sb_end_write(dir->d_sb);
 			goto exit_mutex_unlock;
+		}
 		want_write = 1;
 		/* Don't check for write permission, don't truncate */
 		open_flag &= ~O_TRUNC;
@@ -2214,6 +2225,7 @@ static struct file *do_last(struct nameidata *nd, struct path *path,
 	 * It already exists.
 	 */
 	mutex_unlock(&dir->d_inode->i_mutex);
+	sb_end_write(dir->d_sb);
 	audit_inode(pathname, path->dentry);
 
 	error = -EEXIST;
diff --git a/fs/namespace.c b/fs/namespace.c
index e608199..157afbe 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -283,24 +283,22 @@ static int mnt_is_readonly(struct vfsmount *mnt)
 }
 
 /*
- * Most r/o checks on a fs are for operations that take
- * discrete amounts of time, like a write() or unlink().
- * We must keep track of when those operations start
- * (for permission checks) and when they end, so that
- * we can determine when writes are able to occur to
- * a filesystem.
+ * Most r/o & frozen checks on a fs are for operations that take discrete
+ * amounts of time, like a write() or unlink().  We must keep track of when
+ * those operations start (for permission checks) and when they end, so that we
+ * can determine when writes are able to occur to a filesystem.
  */
 /**
- * mnt_want_write - get write access to a mount
+ * __mnt_want_write - get write access to a mount without freeze protection
  * @m: the mount on which to take a write
  *
- * This tells the low-level filesystem that a write is
- * about to be performed to it, and makes sure that
- * writes are allowed before returning success.  When
- * the write operation is finished, mnt_drop_write()
- * must be called.  This is effectively a refcount.
+ * This tells the low-level filesystem that a write is about to be performed to
+ * it, and makes sure that writes are allowed (mnt it read-write) before
+ * returning success. This operation does not protect against filesystem being
+ * frozen. When the write operation is finished, __mnt_drop_write() must be
+ * called. This is effectively a refcount.
  */
-int mnt_want_write(struct vfsmount *m)
+int __mnt_want_write(struct vfsmount *m)
 {
 	struct mount *mnt = real_mount(m);
 	int ret = 0;
@@ -326,6 +324,28 @@ int mnt_want_write(struct vfsmount *m)
 		ret = -EROFS;
 	}
 	preempt_enable();
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(__mnt_want_write);
+
+/**
+ * mnt_want_write - get write access to a mount
+ * @m: the mount on which to take a write
+ *
+ * This tells the low-level filesystem that a write is about to be performed to
+ * it, and makes sure that writes are allowed (mount is read-write, filesystem
+ * is not frozen) before returning success.  When the write operation is
+ * finished, mnt_drop_write() must be called.  This is effectively a refcount.
+ */
+int mnt_want_write(struct vfsmount *m)
+{
+	int ret;
+
+	sb_start_write(m->mnt_sb);
+	ret = __mnt_want_write(m);
+	if (ret)
+		sb_end_write(m->mnt_sb);
 	return ret;
 }
 EXPORT_SYMBOL_GPL(mnt_want_write);
@@ -355,38 +375,79 @@ int mnt_clone_write(struct vfsmount *mnt)
 EXPORT_SYMBOL_GPL(mnt_clone_write);
 
 /**
- * mnt_want_write_file - get write access to a file's mount
+ * __mnt_want_write_file - get write access to a file's mount
  * @file: the file who's mount on which to take a write
  *
- * This is like mnt_want_write, but it takes a file and can
+ * This is like __mnt_want_write, but it takes a file and can
  * do some optimisations if the file is open for write already
  */
-int mnt_want_write_file(struct file *file)
+int __mnt_want_write_file(struct file *file)
 {
 	struct inode *inode = file->f_dentry->d_inode;
+
 	if (!(file->f_mode & FMODE_WRITE) || special_file(inode->i_mode))
-		return mnt_want_write(file->f_path.mnt);
+		return __mnt_want_write(file->f_path.mnt);
 	else
 		return mnt_clone_write(file->f_path.mnt);
 }
+EXPORT_SYMBOL_GPL(__mnt_want_write_file);
+
+/**
+ * mnt_want_write_file - get write access to a file's mount
+ * @file: the file who's mount on which to take a write
+ *
+ * This is like mnt_want_write, but it takes a file and can
+ * do some optimisations if the file is open for write already
+ */
+int mnt_want_write_file(struct file *file)
+{
+	int ret;
+
+	sb_start_write(file->f_path.mnt->mnt_sb);
+	ret = __mnt_want_write_file(file);
+	if (ret)
+		sb_end_write(file->f_path.mnt->mnt_sb);
+	return ret;
+}
 EXPORT_SYMBOL_GPL(mnt_want_write_file);
 
 /**
- * mnt_drop_write - give up write access to a mount
+ * __mnt_drop_write - give up write access to a mount
  * @mnt: the mount on which to give up write access
  *
  * Tells the low-level filesystem that we are done
  * performing writes to it.  Must be matched with
- * mnt_want_write() call above.
+ * __mnt_want_write() call above.
  */
-void mnt_drop_write(struct vfsmount *mnt)
+void __mnt_drop_write(struct vfsmount *mnt)
 {
 	preempt_disable();
 	mnt_dec_writers(real_mount(mnt));
 	preempt_enable();
 }
+EXPORT_SYMBOL_GPL(__mnt_drop_write);
+
+/**
+ * mnt_drop_write - give up write access to a mount
+ * @mnt: the mount on which to give up write access
+ *
+ * Tells the low-level filesystem that we are done performing writes to it and
+ * also allows filesystem to be frozen again.  Must be matched with
+ * mnt_want_write() call above.
+ */
+void mnt_drop_write(struct vfsmount *mnt)
+{
+	__mnt_drop_write(mnt);
+	sb_end_write(mnt->mnt_sb);
+}
 EXPORT_SYMBOL_GPL(mnt_drop_write);
 
+void __mnt_drop_write_file(struct file *file)
+{
+	__mnt_drop_write(file->f_path.mnt);
+}
+EXPORT_SYMBOL(__mnt_drop_write_file);
+
 void mnt_drop_write_file(struct file *file)
 {
 	mnt_drop_write(file->f_path.mnt);
diff --git a/fs/open.c b/fs/open.c
index 77becc0..456e415 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -637,7 +637,7 @@ static inline int __get_file_write_access(struct inode *inode,
 		/*
 		 * Balanced in __fput()
 		 */
-		error = mnt_want_write(mnt);
+		error = __mnt_want_write(mnt);
 		if (error)
 			put_write_access(inode);
 	}
diff --git a/include/linux/mount.h b/include/linux/mount.h
index d7029f4..d6ffc3d 100644
--- a/include/linux/mount.h
+++ b/include/linux/mount.h
@@ -55,10 +55,14 @@ struct vfsmount {
 
 struct file; /* forward dec */
 
+extern int __mnt_want_write(struct vfsmount *mnt);
 extern int mnt_want_write(struct vfsmount *mnt);
+extern int __mnt_want_write_file(struct file *file);
 extern int mnt_want_write_file(struct file *file);
 extern int mnt_clone_write(struct vfsmount *mnt);
+extern void __mnt_drop_write(struct vfsmount *mnt);
 extern void mnt_drop_write(struct vfsmount *mnt);
+extern void __mnt_drop_write_file(struct file *file);
 extern void mnt_drop_write_file(struct file *file);
 extern void mntput(struct vfsmount *mnt);
 extern struct vfsmount *mntget(struct vfsmount *mnt);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 08/19] fs: Skip atime update on frozen filesystem
  2012-03-05 16:00 ` Jan Kara
                   ` (10 preceding siblings ...)
  (?)
@ 2012-03-05 16:01 ` Jan Kara
  -1 siblings, 0 replies; 58+ messages in thread
From: Jan Kara @ 2012-03-05 16:01 UTC (permalink / raw)
  To: LKML
  Cc: linux-fsdevel, Al Viro, Christoph Hellwig, dchinner, sandeen,
	Kamal Mostafa, Jan Kara

It is unexpected to block reading of frozen filesystem because of atime update.
Also handling blocking on frozen filesystem because of atime update would make
locking more complex than it already is. So just skip atime update when
filesystem is frozen like we skip it when filesystem is remounted read-only.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/inode.c |    9 +++++++--
 1 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 095828c..ff3b385 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1535,12 +1535,17 @@ void touch_atime(struct vfsmount *mnt, struct dentry *dentry)
 	if (timespec_equal(&inode->i_atime, &now))
 		return;
 
-	if (mnt_want_write(mnt))
+	if (!sb_start_write_trylock(inode->i_sb))
 		return;
 
+	if (__mnt_want_write(mnt))
+		goto skip_update;
+
 	inode->i_atime = now;
 	mark_inode_dirty_sync(inode);
-	mnt_drop_write(mnt);
+	__mnt_drop_write(mnt);
+skip_update:
+	sb_end_write(inode->i_sb);
 }
 EXPORT_SYMBOL(touch_atime);
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 09/19] fs: Protect write paths by sb_start_write - sb_end_write
  2012-03-05 16:00 ` Jan Kara
                   ` (11 preceding siblings ...)
  (?)
@ 2012-03-05 16:01 ` Jan Kara
  -1 siblings, 0 replies; 58+ messages in thread
From: Jan Kara @ 2012-03-05 16:01 UTC (permalink / raw)
  To: LKML
  Cc: linux-fsdevel, Al Viro, Christoph Hellwig, dchinner, sandeen,
	Kamal Mostafa, Jan Kara

There are several entry points which dirty pages in a filesystem.  mmap
(handled by block_page_mkwrite()), buffered write (handled by
__generic_file_aio_write()), splice write (generic_file_splice_write),
truncate, and fallocate (these can dirty last partial page - handled inside
each filesystem separately). Protect these places with sb_start_write() and
sb_end_write().

->page_mkwrite() calls are particularly complex since they are called with
mmap_sem held and thus we cannot use standard sb_start_write() due to lock
ordering constraints. We solve the problem by using a special freeze protection
sb_start_pagefault() which ranks below mmap_sem.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/buffer.c      |   22 ++++------------------
 fs/open.c        |    7 ++++++-
 fs/splice.c      |    3 +++
 mm/filemap.c     |   12 ++++++++++--
 mm/filemap_xip.c |    5 +++--
 5 files changed, 26 insertions(+), 23 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 5294a33..89ed4af 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2288,8 +2288,8 @@ EXPORT_SYMBOL(block_commit_write);
  * beyond EOF, then the page is guaranteed safe against truncation until we
  * unlock the page.
  *
- * Direct callers of this function should call vfs_check_frozen() so that page
- * fault does not busyloop until the fs is thawed.
+ * Direct callers of this function should protect against filesystem freezing
+ * using sb_start_write() - sb_end_write() functions.
  */
 int __block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
 			 get_block_t get_block)
@@ -2327,18 +2327,7 @@ int __block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
 
 	if (unlikely(ret < 0))
 		goto out_unlock;
-	/*
-	 * Freezing in progress? We check after the page is marked dirty and
-	 * with page lock held so if the test here fails, we are sure freezing
-	 * code will wait during syncing until the page fault is done - at that
-	 * point page will be dirty and unlocked so freezing code will write it
-	 * and writeprotect it again.
-	 */
 	set_page_dirty(page);
-	if (inode->i_sb->s_frozen != SB_UNFROZEN) {
-		ret = -EAGAIN;
-		goto out_unlock;
-	}
 	wait_on_page_writeback(page);
 	return 0;
 out_unlock:
@@ -2353,12 +2342,9 @@ int block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
 	int ret;
 	struct super_block *sb = vma->vm_file->f_path.dentry->d_inode->i_sb;
 
-	/*
-	 * This check is racy but catches the common case. The check in
-	 * __block_page_mkwrite() is reliable.
-	 */
-	vfs_check_frozen(sb, SB_FREEZE_WRITE);
+	sb_start_pagefault(sb);
 	ret = __block_page_mkwrite(vma, vmf, get_block);
+	sb_end_pagefault(sb);
 	return block_page_mkwrite_return(ret);
 }
 EXPORT_SYMBOL(block_page_mkwrite);
diff --git a/fs/open.c b/fs/open.c
index 456e415..444ffbb 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -164,11 +164,13 @@ static long do_sys_ftruncate(unsigned int fd, loff_t length, int small)
 	if (IS_APPEND(inode))
 		goto out_putf;
 
+	sb_start_write(inode->i_sb);
 	error = locks_verify_truncate(inode, file, length);
 	if (!error)
 		error = security_path_truncate(&file->f_path);
 	if (!error)
 		error = do_truncate(dentry, length, ATTR_MTIME|ATTR_CTIME, file);
+	sb_end_write(inode->i_sb);
 out_putf:
 	fput(file);
 out:
@@ -266,7 +268,10 @@ int do_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 	if (!file->f_op->fallocate)
 		return -EOPNOTSUPP;
 
-	return file->f_op->fallocate(file, mode, offset, len);
+	sb_start_write(inode->i_sb);
+	ret = file->f_op->fallocate(file, mode, offset, len);
+	sb_end_write(inode->i_sb);
+	return ret;
 }
 
 SYSCALL_DEFINE(fallocate)(int fd, int mode, loff_t offset, loff_t len)
diff --git a/fs/splice.c b/fs/splice.c
index 1ec0493..401297d 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -992,6 +992,8 @@ generic_file_splice_write(struct pipe_inode_info *pipe, struct file *out,
 	};
 	ssize_t ret;
 
+	sb_start_write(inode->i_sb);
+
 	pipe_lock(pipe);
 
 	splice_from_pipe_begin(&sd);
@@ -1028,6 +1030,7 @@ generic_file_splice_write(struct pipe_inode_info *pipe, struct file *out,
 			*ppos += ret;
 		balance_dirty_pages_ratelimited_nr(mapping, nr_pages);
 	}
+	sb_end_write(inode->i_sb);
 
 	return ret;
 }
diff --git a/mm/filemap.c b/mm/filemap.c
index b865c0b..8ff3111 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1765,6 +1765,7 @@ int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 	struct inode *inode = vma->vm_file->f_path.dentry->d_inode;
 	int ret = VM_FAULT_LOCKED;
 
+	sb_start_pagefault(inode->i_sb);
 	file_update_time(vma->vm_file);
 	lock_page(page);
 	if ((page->mapping != inode->i_mapping) ||
@@ -1773,7 +1774,14 @@ int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 		ret = VM_FAULT_NOPAGE;
 		goto out;
 	}
+	/*
+	 * We mark the page dirty already here so that when freeze is in
+	 * progress, we are guaranteed that writeback during freezing will
+	 * see the dirty page and writeprotect it again.
+	 */
+	set_page_dirty(page);
 out:
+	sb_end_pagefault(inode->i_sb);
 	return ret;
 }
 EXPORT_SYMBOL(filemap_page_mkwrite);
@@ -2537,8 +2545,6 @@ ssize_t __generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 	count = ocount;
 	pos = *ppos;
 
-	vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
-
 	/* We can write back this queue in page reclaim */
 	current->backing_dev_info = mapping->backing_dev_info;
 	written = 0;
@@ -2635,6 +2641,7 @@ ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 
 	BUG_ON(iocb->ki_pos != pos);
 
+	sb_start_write(inode->i_sb);
 	mutex_lock(&inode->i_mutex);
 	blk_start_plug(&plug);
 	ret = __generic_file_aio_write(iocb, iov, nr_segs, &iocb->ki_pos);
@@ -2648,6 +2655,7 @@ ssize_t generic_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 			ret = err;
 	}
 	blk_finish_plug(&plug);
+	sb_end_write(inode->i_sb);
 	return ret;
 }
 EXPORT_SYMBOL(generic_file_aio_write);
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index 591dba6..b051f0d 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -402,6 +402,8 @@ xip_file_write(struct file *filp, const char __user *buf, size_t len,
 	loff_t pos;
 	ssize_t ret;
 
+	sb_start_write(inode->i_sb);
+
 	mutex_lock(&inode->i_mutex);
 
 	if (!access_ok(VERIFY_READ, buf, len)) {
@@ -412,8 +414,6 @@ xip_file_write(struct file *filp, const char __user *buf, size_t len,
 	pos = *ppos;
 	count = len;
 
-	vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
-
 	/* We can write back this queue in page reclaim */
 	current->backing_dev_info = mapping->backing_dev_info;
 
@@ -435,6 +435,7 @@ xip_file_write(struct file *filp, const char __user *buf, size_t len,
 	current->backing_dev_info = NULL;
  out_up:
 	mutex_unlock(&inode->i_mutex);
+	sb_end_write(inode->i_sb);
 	return ret;
 }
 EXPORT_SYMBOL_GPL(xip_file_write);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 10/19] ext4: Convert to new freezing mechanism
  2012-03-05 16:00 ` Jan Kara
                   ` (12 preceding siblings ...)
  (?)
@ 2012-03-05 16:01 ` Jan Kara
  2012-03-07 22:32   ` Kamal Mostafa
  -1 siblings, 1 reply; 58+ messages in thread
From: Jan Kara @ 2012-03-05 16:01 UTC (permalink / raw)
  To: LKML
  Cc: linux-fsdevel, Al Viro, Christoph Hellwig, dchinner, sandeen,
	Kamal Mostafa, Jan Kara, linux-ext4, Theodore Ts'o

We remove most of frozen checks since upper layer takes care
of blocking all writes. We only have to handle protection in
ext4_page_mkwrite() in a special way because we cannot use
generic block_page_mkwrite().

CC: linux-ext4@vger.kernel.org
CC: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c |    7 ++-----
 fs/ext4/super.c |   29 +++++------------------------
 2 files changed, 7 insertions(+), 29 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index feaa82f..c65baf9 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4593,11 +4593,7 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 	get_block_t *get_block;
 	int retries = 0;
 
-	/*
-	 * This check is racy but catches the common case. We rely on
-	 * __block_page_mkwrite() to do a reliable check.
-	 */
-	vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
+	sb_start_pagefault(inode->i_sb);
 	/* Delalloc case is easy... */
 	if (test_opt(inode->i_sb, DELALLOC) &&
 	    !ext4_should_journal_data(inode) &&
@@ -4665,5 +4661,6 @@ retry_alloc:
 out_ret:
 	ret = block_page_mkwrite_return(ret);
 out:
+	sb_end_pagefault(inode->i_sb);
 	return ret;
 }
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 502c61f..0f1024a 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -289,33 +289,17 @@ static void ext4_put_nojournal(handle_t *handle)
  * journal_end calls result in the superblock being marked dirty, so
  * that sync() will call the filesystem's write_super callback if
  * appropriate.
- *
- * To avoid j_barrier hold in userspace when a user calls freeze(),
- * ext4 prevents a new handle from being started by s_frozen, which
- * is in an upper layer.
  */
 handle_t *ext4_journal_start_sb(struct super_block *sb, int nblocks)
 {
 	journal_t *journal;
-	handle_t  *handle;
 
 	trace_ext4_journal_start(sb, nblocks, _RET_IP_);
 	if (sb->s_flags & MS_RDONLY)
 		return ERR_PTR(-EROFS);
 
+	WARN_ON(sb->s_writers.frozen == SB_FREEZE_COMPLETE);
 	journal = EXT4_SB(sb)->s_journal;
-	handle = ext4_journal_current_handle();
-
-	/*
-	 * If a handle has been started, it should be allowed to
-	 * finish, otherwise deadlock could happen between freeze
-	 * and others(e.g. truncate) due to the restart of the
-	 * journal handle if the filesystem is forzen and active
-	 * handles are not stopped.
-	 */
-	if (!handle)
-		vfs_check_frozen(sb, SB_FREEZE_TRANS);
-
 	if (!journal)
 		return ext4_get_nojournal();
 	/*
@@ -4280,10 +4264,8 @@ int ext4_force_commit(struct super_block *sb)
 		return 0;
 
 	journal = EXT4_SB(sb)->s_journal;
-	if (journal) {
-		vfs_check_frozen(sb, SB_FREEZE_TRANS);
+	if (journal)
 		ret = ext4_journal_force_commit(journal);
-	}
 
 	return ret;
 }
@@ -4315,9 +4297,8 @@ static int ext4_sync_fs(struct super_block *sb, int wait)
  * gives us a chance to flush the journal completely and mark the fs clean.
  *
  * Note that only this function cannot bring a filesystem to be in a clean
- * state independently, because ext4 prevents a new handle from being started
- * by @sb->s_frozen, which stays in an upper layer.  It thus needs help from
- * the upper layer.
+ * state independently. It relies on upper layer to stop all data & metadata
+ * modifications.
  */
 static int ext4_freeze(struct super_block *sb)
 {
@@ -4344,7 +4325,7 @@ static int ext4_freeze(struct super_block *sb)
 	EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
 	error = ext4_commit_super(sb, 1);
 out:
-	/* we rely on s_frozen to stop further updates */
+	/* we rely on upper layer to stop further updates */
 	jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
 	return error;
 }
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 11/19] xfs: Convert to new freezing code
  2012-03-05 16:00 ` Jan Kara
@ 2012-03-05 16:01   ` Jan Kara
  -1 siblings, 0 replies; 58+ messages in thread
From: Jan Kara @ 2012-03-05 16:01 UTC (permalink / raw)
  To: LKML
  Cc: linux-fsdevel, Al Viro, Christoph Hellwig, dchinner, sandeen,
	Kamal Mostafa, Jan Kara, Ben Myers, Alex Elder, xfs

Generic code now blocks all writers from standard write paths. So we block all
writers coming from ioctl and replace blocking of transactions on frozen
filesystem with a debugging check. As a bonus, we get a protection of ioctl
against racing remount read-only. We also convert xfs_file_aio_write() to a
non-racy freeze protection.

CC: Ben Myers <bpm@sgi.com>
CC: Alex Elder <elder@kernel.org>
CC: xfs@oss.sgi.com
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/xfs/xfs_file.c    |   10 ++++++--
 fs/xfs/xfs_ioctl.c   |   55 +++++++++++++++++++++++++++++++++++++++++++++++--
 fs/xfs/xfs_ioctl32.c |   12 ++++++++++
 fs/xfs/xfs_iomap.c   |    1 -
 fs/xfs/xfs_mount.c   |    2 +-
 fs/xfs/xfs_mount.h   |    3 --
 fs/xfs/xfs_sync.c    |    2 +-
 fs/xfs/xfs_trans.c   |    2 +-
 8 files changed, 74 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 7e5bc87..57dd20e 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -874,10 +874,12 @@ xfs_file_aio_write(
 	if (ocount == 0)
 		return 0;
 
-	xfs_wait_for_freeze(ip->i_mount, SB_FREEZE_WRITE);
+	sb_start_write(inode->i_sb);
 
-	if (XFS_FORCED_SHUTDOWN(ip->i_mount))
-		return -EIO;
+	if (XFS_FORCED_SHUTDOWN(ip->i_mount)) {
+		ret = -EIO;
+		goto out;
+	}
 
 	if (unlikely(file->f_flags & O_DIRECT))
 		ret = xfs_file_dio_aio_write(iocb, iovp, nr_segs, pos, ocount);
@@ -896,6 +898,8 @@ xfs_file_aio_write(
 			ret = err;
 	}
 
+out:
+	sb_end_write(inode->i_sb);
 	return ret;
 }
 
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 76f3ca5..7890105 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -367,9 +367,15 @@ xfs_fssetdm_by_handle(
 	if (copy_from_user(&dmhreq, arg, sizeof(xfs_fsop_setdm_handlereq_t)))
 		return -XFS_ERROR(EFAULT);
 
+	error = mnt_want_write_file(parfilp);
+	if (error)
+		return error;
+
 	dentry = xfs_handlereq_to_dentry(parfilp, &dmhreq.hreq);
-	if (IS_ERR(dentry))
+	if (IS_ERR(dentry)) {
+		mnt_drop_write_file(parfilp);
 		return PTR_ERR(dentry);
+	}
 
 	if (IS_IMMUTABLE(dentry->d_inode) || IS_APPEND(dentry->d_inode)) {
 		error = -XFS_ERROR(EPERM);
@@ -385,6 +391,7 @@ xfs_fssetdm_by_handle(
 				 fsd.fsd_dmstate);
 
  out:
+	mnt_drop_write_file(parfilp);
 	dput(dentry);
 	return error;
 }
@@ -631,7 +638,11 @@ xfs_ioc_space(
 	if (ioflags & IO_INVIS)
 		attr_flags |= XFS_ATTR_DMI;
 
+	error = mnt_want_write_file(filp);
+	if (error)
+		return error;
 	error = xfs_change_file_space(ip, cmd, bf, filp->f_pos, attr_flags);
+	mnt_drop_write_file(filp);
 	return -error;
 }
 
@@ -1160,6 +1171,7 @@ xfs_ioc_fssetxattr(
 {
 	struct fsxattr		fa;
 	unsigned int		mask;
+	int error;
 
 	if (copy_from_user(&fa, arg, sizeof(fa)))
 		return -EFAULT;
@@ -1168,7 +1180,12 @@ xfs_ioc_fssetxattr(
 	if (filp->f_flags & (O_NDELAY|O_NONBLOCK))
 		mask |= FSX_NONBLOCK;
 
-	return -xfs_ioctl_setattr(ip, &fa, mask);
+	error = mnt_want_write_file(filp);
+	if (error)
+		return error;
+	error = xfs_ioctl_setattr(ip, &fa, mask);
+	mnt_drop_write_file(filp);
+	return -error;
 }
 
 STATIC int
@@ -1193,6 +1210,7 @@ xfs_ioc_setxflags(
 	struct fsxattr		fa;
 	unsigned int		flags;
 	unsigned int		mask;
+	int error;
 
 	if (copy_from_user(&flags, arg, sizeof(flags)))
 		return -EFAULT;
@@ -1207,7 +1225,12 @@ xfs_ioc_setxflags(
 		mask |= FSX_NONBLOCK;
 	fa.fsx_xflags = xfs_merge_ioc_xflags(flags, xfs_ip2xflags(ip));
 
-	return -xfs_ioctl_setattr(ip, &fa, mask);
+	error = mnt_want_write_file(filp);
+	if (error)
+		return error;
+	error = xfs_ioctl_setattr(ip, &fa, mask);
+	mnt_drop_write_file(filp);
+	return -error;
 }
 
 STATIC int
@@ -1382,8 +1405,13 @@ xfs_file_ioctl(
 		if (copy_from_user(&dmi, arg, sizeof(dmi)))
 			return -XFS_ERROR(EFAULT);
 
+		error = mnt_want_write_file(filp);
+		if (error)
+			return error;
+
 		error = xfs_set_dmattrs(ip, dmi.fsd_dmevmask,
 				dmi.fsd_dmstate);
+		mnt_drop_write_file(filp);
 		return -error;
 	}
 
@@ -1431,7 +1459,11 @@ xfs_file_ioctl(
 
 		if (copy_from_user(&sxp, arg, sizeof(xfs_swapext_t)))
 			return -XFS_ERROR(EFAULT);
+		error = mnt_want_write_file(filp);
+		if (error)
+			return error;
 		error = xfs_swapext(&sxp);
+		mnt_drop_write_file(filp);
 		return -error;
 	}
 
@@ -1460,9 +1492,14 @@ xfs_file_ioctl(
 		if (copy_from_user(&inout, arg, sizeof(inout)))
 			return -XFS_ERROR(EFAULT);
 
+		error = mnt_want_write_file(filp);
+		if (error)
+			return error;
+
 		/* input parameter is passed in resblks field of structure */
 		in = inout.resblks;
 		error = xfs_reserve_blocks(mp, &in, &inout);
+		mnt_drop_write_file(filp);
 		if (error)
 			return -error;
 
@@ -1493,7 +1530,11 @@ xfs_file_ioctl(
 		if (copy_from_user(&in, arg, sizeof(in)))
 			return -XFS_ERROR(EFAULT);
 
+		error = mnt_want_write_file(filp);
+		if (error)
+			return error;
 		error = xfs_growfs_data(mp, &in);
+		mnt_drop_write_file(filp);
 		return -error;
 	}
 
@@ -1503,7 +1544,11 @@ xfs_file_ioctl(
 		if (copy_from_user(&in, arg, sizeof(in)))
 			return -XFS_ERROR(EFAULT);
 
+		error = mnt_want_write_file(filp);
+		if (error)
+			return error;
 		error = xfs_growfs_log(mp, &in);
+		mnt_drop_write_file(filp);
 		return -error;
 	}
 
@@ -1513,7 +1558,11 @@ xfs_file_ioctl(
 		if (copy_from_user(&in, arg, sizeof(in)))
 			return -XFS_ERROR(EFAULT);
 
+		error = mnt_want_write_file(filp);
+		if (error)
+			return error;
 		error = xfs_growfs_rt(mp, &in);
+		mnt_drop_write_file(filp);
 		return -error;
 	}
 
diff --git a/fs/xfs/xfs_ioctl32.c b/fs/xfs/xfs_ioctl32.c
index f9ccb7b..2012369 100644
--- a/fs/xfs/xfs_ioctl32.c
+++ b/fs/xfs/xfs_ioctl32.c
@@ -602,7 +602,11 @@ xfs_file_compat_ioctl(
 
 		if (xfs_compat_growfs_data_copyin(&in, arg))
 			return -XFS_ERROR(EFAULT);
+		error = mnt_want_write_file(filp);
+		if (error)
+			return error;
 		error = xfs_growfs_data(mp, &in);
+		mnt_drop_write_file(filp);
 		return -error;
 	}
 	case XFS_IOC_FSGROWFSRT_32: {
@@ -610,7 +614,11 @@ xfs_file_compat_ioctl(
 
 		if (xfs_compat_growfs_rt_copyin(&in, arg))
 			return -XFS_ERROR(EFAULT);
+		error = mnt_want_write_file(filp);
+		if (error)
+			return error;
 		error = xfs_growfs_rt(mp, &in);
+		mnt_drop_write_file(filp);
 		return -error;
 	}
 #endif
@@ -629,7 +637,11 @@ xfs_file_compat_ioctl(
 				   offsetof(struct xfs_swapext, sx_stat)) ||
 		    xfs_ioctl32_bstat_copyin(&sxp.sx_stat, &sxu->sx_stat))
 			return -XFS_ERROR(EFAULT);
+		error = mnt_want_write_file(filp);
+		if (error)
+			return error;
 		error = xfs_swapext(&sxp);
+		mnt_drop_write_file(filp);
 		return -error;
 	}
 	case XFS_IOC_FSBULKSTAT_32:
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 246c7d5..7ab98d9 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -679,7 +679,6 @@ xfs_iomap_write_unwritten(
 		 * the same inode that we complete here and might deadlock
 		 * on the iolock.
 		 */
-		xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
 		tp = _xfs_trans_alloc(mp, XFS_TRANS_STRAT_WRITE, KM_NOFS);
 		tp->t_flags |= XFS_TRANS_RESERVE;
 		error = xfs_trans_reserve(tp, resblks,
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index d06afbc..bead529 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -1541,7 +1541,7 @@ xfs_unmountfs(
 int
 xfs_fs_writable(xfs_mount_t *mp)
 {
-	return !(xfs_test_for_freeze(mp) || XFS_FORCED_SHUTDOWN(mp) ||
+	return !(mp->m_super->s_writers.frozen || XFS_FORCED_SHUTDOWN(mp) ||
 		(mp->m_flags & XFS_MOUNT_RDONLY));
 }
 
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 19f69e2..f0afeb9 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -310,9 +310,6 @@ void xfs_do_force_shutdown(struct xfs_mount *mp, int flags, char *fname,
 #define SHUTDOWN_REMOTE_REQ	0x0010	/* shutdown came from remote cell */
 #define SHUTDOWN_DEVICE_REQ	0x0020	/* failed all paths to the device */
 
-#define xfs_test_for_freeze(mp)		((mp)->m_super->s_frozen)
-#define xfs_wait_for_freeze(mp,l)	vfs_check_frozen((mp)->m_super, (l))
-
 /*
  * Flags for xfs_mountfs
  */
diff --git a/fs/xfs/xfs_sync.c b/fs/xfs/xfs_sync.c
index 40b75ee..c44d687 100644
--- a/fs/xfs/xfs_sync.c
+++ b/fs/xfs/xfs_sync.c
@@ -498,7 +498,7 @@ xfs_sync_worker(
 
 	if (!(mp->m_flags & XFS_MOUNT_RDONLY)) {
 		/* dgc: errors ignored here */
-		if (mp->m_super->s_frozen == SB_UNFROZEN &&
+		if (mp->m_super->s_writers.frozen == SB_UNFROZEN &&
 		    xfs_log_need_covered(mp))
 			error = xfs_fs_log_dummy(mp);
 		else
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 329b06a..6468a2a 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -577,7 +577,6 @@ xfs_trans_alloc(
 	xfs_mount_t	*mp,
 	uint		type)
 {
-	xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
 	return _xfs_trans_alloc(mp, type, KM_SLEEP);
 }
 
@@ -589,6 +588,7 @@ _xfs_trans_alloc(
 {
 	xfs_trans_t	*tp;
 
+	WARN_ON(mp->m_super->s_writers.frozen == SB_FREEZE_COMPLETE);
 	atomic_inc(&mp->m_active_trans);
 
 	tp = kmem_zone_zalloc(xfs_trans_zone, memflags);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 11/19] xfs: Convert to new freezing code
@ 2012-03-05 16:01   ` Jan Kara
  0 siblings, 0 replies; 58+ messages in thread
From: Jan Kara @ 2012-03-05 16:01 UTC (permalink / raw)
  To: LKML
  Cc: sandeen, Alex Elder, Jan Kara, Kamal Mostafa, xfs,
	Christoph Hellwig, Ben Myers, Al Viro, dchinner, linux-fsdevel

Generic code now blocks all writers from standard write paths. So we block all
writers coming from ioctl and replace blocking of transactions on frozen
filesystem with a debugging check. As a bonus, we get a protection of ioctl
against racing remount read-only. We also convert xfs_file_aio_write() to a
non-racy freeze protection.

CC: Ben Myers <bpm@sgi.com>
CC: Alex Elder <elder@kernel.org>
CC: xfs@oss.sgi.com
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/xfs/xfs_file.c    |   10 ++++++--
 fs/xfs/xfs_ioctl.c   |   55 +++++++++++++++++++++++++++++++++++++++++++++++--
 fs/xfs/xfs_ioctl32.c |   12 ++++++++++
 fs/xfs/xfs_iomap.c   |    1 -
 fs/xfs/xfs_mount.c   |    2 +-
 fs/xfs/xfs_mount.h   |    3 --
 fs/xfs/xfs_sync.c    |    2 +-
 fs/xfs/xfs_trans.c   |    2 +-
 8 files changed, 74 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 7e5bc87..57dd20e 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -874,10 +874,12 @@ xfs_file_aio_write(
 	if (ocount == 0)
 		return 0;
 
-	xfs_wait_for_freeze(ip->i_mount, SB_FREEZE_WRITE);
+	sb_start_write(inode->i_sb);
 
-	if (XFS_FORCED_SHUTDOWN(ip->i_mount))
-		return -EIO;
+	if (XFS_FORCED_SHUTDOWN(ip->i_mount)) {
+		ret = -EIO;
+		goto out;
+	}
 
 	if (unlikely(file->f_flags & O_DIRECT))
 		ret = xfs_file_dio_aio_write(iocb, iovp, nr_segs, pos, ocount);
@@ -896,6 +898,8 @@ xfs_file_aio_write(
 			ret = err;
 	}
 
+out:
+	sb_end_write(inode->i_sb);
 	return ret;
 }
 
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 76f3ca5..7890105 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -367,9 +367,15 @@ xfs_fssetdm_by_handle(
 	if (copy_from_user(&dmhreq, arg, sizeof(xfs_fsop_setdm_handlereq_t)))
 		return -XFS_ERROR(EFAULT);
 
+	error = mnt_want_write_file(parfilp);
+	if (error)
+		return error;
+
 	dentry = xfs_handlereq_to_dentry(parfilp, &dmhreq.hreq);
-	if (IS_ERR(dentry))
+	if (IS_ERR(dentry)) {
+		mnt_drop_write_file(parfilp);
 		return PTR_ERR(dentry);
+	}
 
 	if (IS_IMMUTABLE(dentry->d_inode) || IS_APPEND(dentry->d_inode)) {
 		error = -XFS_ERROR(EPERM);
@@ -385,6 +391,7 @@ xfs_fssetdm_by_handle(
 				 fsd.fsd_dmstate);
 
  out:
+	mnt_drop_write_file(parfilp);
 	dput(dentry);
 	return error;
 }
@@ -631,7 +638,11 @@ xfs_ioc_space(
 	if (ioflags & IO_INVIS)
 		attr_flags |= XFS_ATTR_DMI;
 
+	error = mnt_want_write_file(filp);
+	if (error)
+		return error;
 	error = xfs_change_file_space(ip, cmd, bf, filp->f_pos, attr_flags);
+	mnt_drop_write_file(filp);
 	return -error;
 }
 
@@ -1160,6 +1171,7 @@ xfs_ioc_fssetxattr(
 {
 	struct fsxattr		fa;
 	unsigned int		mask;
+	int error;
 
 	if (copy_from_user(&fa, arg, sizeof(fa)))
 		return -EFAULT;
@@ -1168,7 +1180,12 @@ xfs_ioc_fssetxattr(
 	if (filp->f_flags & (O_NDELAY|O_NONBLOCK))
 		mask |= FSX_NONBLOCK;
 
-	return -xfs_ioctl_setattr(ip, &fa, mask);
+	error = mnt_want_write_file(filp);
+	if (error)
+		return error;
+	error = xfs_ioctl_setattr(ip, &fa, mask);
+	mnt_drop_write_file(filp);
+	return -error;
 }
 
 STATIC int
@@ -1193,6 +1210,7 @@ xfs_ioc_setxflags(
 	struct fsxattr		fa;
 	unsigned int		flags;
 	unsigned int		mask;
+	int error;
 
 	if (copy_from_user(&flags, arg, sizeof(flags)))
 		return -EFAULT;
@@ -1207,7 +1225,12 @@ xfs_ioc_setxflags(
 		mask |= FSX_NONBLOCK;
 	fa.fsx_xflags = xfs_merge_ioc_xflags(flags, xfs_ip2xflags(ip));
 
-	return -xfs_ioctl_setattr(ip, &fa, mask);
+	error = mnt_want_write_file(filp);
+	if (error)
+		return error;
+	error = xfs_ioctl_setattr(ip, &fa, mask);
+	mnt_drop_write_file(filp);
+	return -error;
 }
 
 STATIC int
@@ -1382,8 +1405,13 @@ xfs_file_ioctl(
 		if (copy_from_user(&dmi, arg, sizeof(dmi)))
 			return -XFS_ERROR(EFAULT);
 
+		error = mnt_want_write_file(filp);
+		if (error)
+			return error;
+
 		error = xfs_set_dmattrs(ip, dmi.fsd_dmevmask,
 				dmi.fsd_dmstate);
+		mnt_drop_write_file(filp);
 		return -error;
 	}
 
@@ -1431,7 +1459,11 @@ xfs_file_ioctl(
 
 		if (copy_from_user(&sxp, arg, sizeof(xfs_swapext_t)))
 			return -XFS_ERROR(EFAULT);
+		error = mnt_want_write_file(filp);
+		if (error)
+			return error;
 		error = xfs_swapext(&sxp);
+		mnt_drop_write_file(filp);
 		return -error;
 	}
 
@@ -1460,9 +1492,14 @@ xfs_file_ioctl(
 		if (copy_from_user(&inout, arg, sizeof(inout)))
 			return -XFS_ERROR(EFAULT);
 
+		error = mnt_want_write_file(filp);
+		if (error)
+			return error;
+
 		/* input parameter is passed in resblks field of structure */
 		in = inout.resblks;
 		error = xfs_reserve_blocks(mp, &in, &inout);
+		mnt_drop_write_file(filp);
 		if (error)
 			return -error;
 
@@ -1493,7 +1530,11 @@ xfs_file_ioctl(
 		if (copy_from_user(&in, arg, sizeof(in)))
 			return -XFS_ERROR(EFAULT);
 
+		error = mnt_want_write_file(filp);
+		if (error)
+			return error;
 		error = xfs_growfs_data(mp, &in);
+		mnt_drop_write_file(filp);
 		return -error;
 	}
 
@@ -1503,7 +1544,11 @@ xfs_file_ioctl(
 		if (copy_from_user(&in, arg, sizeof(in)))
 			return -XFS_ERROR(EFAULT);
 
+		error = mnt_want_write_file(filp);
+		if (error)
+			return error;
 		error = xfs_growfs_log(mp, &in);
+		mnt_drop_write_file(filp);
 		return -error;
 	}
 
@@ -1513,7 +1558,11 @@ xfs_file_ioctl(
 		if (copy_from_user(&in, arg, sizeof(in)))
 			return -XFS_ERROR(EFAULT);
 
+		error = mnt_want_write_file(filp);
+		if (error)
+			return error;
 		error = xfs_growfs_rt(mp, &in);
+		mnt_drop_write_file(filp);
 		return -error;
 	}
 
diff --git a/fs/xfs/xfs_ioctl32.c b/fs/xfs/xfs_ioctl32.c
index f9ccb7b..2012369 100644
--- a/fs/xfs/xfs_ioctl32.c
+++ b/fs/xfs/xfs_ioctl32.c
@@ -602,7 +602,11 @@ xfs_file_compat_ioctl(
 
 		if (xfs_compat_growfs_data_copyin(&in, arg))
 			return -XFS_ERROR(EFAULT);
+		error = mnt_want_write_file(filp);
+		if (error)
+			return error;
 		error = xfs_growfs_data(mp, &in);
+		mnt_drop_write_file(filp);
 		return -error;
 	}
 	case XFS_IOC_FSGROWFSRT_32: {
@@ -610,7 +614,11 @@ xfs_file_compat_ioctl(
 
 		if (xfs_compat_growfs_rt_copyin(&in, arg))
 			return -XFS_ERROR(EFAULT);
+		error = mnt_want_write_file(filp);
+		if (error)
+			return error;
 		error = xfs_growfs_rt(mp, &in);
+		mnt_drop_write_file(filp);
 		return -error;
 	}
 #endif
@@ -629,7 +637,11 @@ xfs_file_compat_ioctl(
 				   offsetof(struct xfs_swapext, sx_stat)) ||
 		    xfs_ioctl32_bstat_copyin(&sxp.sx_stat, &sxu->sx_stat))
 			return -XFS_ERROR(EFAULT);
+		error = mnt_want_write_file(filp);
+		if (error)
+			return error;
 		error = xfs_swapext(&sxp);
+		mnt_drop_write_file(filp);
 		return -error;
 	}
 	case XFS_IOC_FSBULKSTAT_32:
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 246c7d5..7ab98d9 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -679,7 +679,6 @@ xfs_iomap_write_unwritten(
 		 * the same inode that we complete here and might deadlock
 		 * on the iolock.
 		 */
-		xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
 		tp = _xfs_trans_alloc(mp, XFS_TRANS_STRAT_WRITE, KM_NOFS);
 		tp->t_flags |= XFS_TRANS_RESERVE;
 		error = xfs_trans_reserve(tp, resblks,
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index d06afbc..bead529 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -1541,7 +1541,7 @@ xfs_unmountfs(
 int
 xfs_fs_writable(xfs_mount_t *mp)
 {
-	return !(xfs_test_for_freeze(mp) || XFS_FORCED_SHUTDOWN(mp) ||
+	return !(mp->m_super->s_writers.frozen || XFS_FORCED_SHUTDOWN(mp) ||
 		(mp->m_flags & XFS_MOUNT_RDONLY));
 }
 
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 19f69e2..f0afeb9 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -310,9 +310,6 @@ void xfs_do_force_shutdown(struct xfs_mount *mp, int flags, char *fname,
 #define SHUTDOWN_REMOTE_REQ	0x0010	/* shutdown came from remote cell */
 #define SHUTDOWN_DEVICE_REQ	0x0020	/* failed all paths to the device */
 
-#define xfs_test_for_freeze(mp)		((mp)->m_super->s_frozen)
-#define xfs_wait_for_freeze(mp,l)	vfs_check_frozen((mp)->m_super, (l))
-
 /*
  * Flags for xfs_mountfs
  */
diff --git a/fs/xfs/xfs_sync.c b/fs/xfs/xfs_sync.c
index 40b75ee..c44d687 100644
--- a/fs/xfs/xfs_sync.c
+++ b/fs/xfs/xfs_sync.c
@@ -498,7 +498,7 @@ xfs_sync_worker(
 
 	if (!(mp->m_flags & XFS_MOUNT_RDONLY)) {
 		/* dgc: errors ignored here */
-		if (mp->m_super->s_frozen == SB_UNFROZEN &&
+		if (mp->m_super->s_writers.frozen == SB_UNFROZEN &&
 		    xfs_log_need_covered(mp))
 			error = xfs_fs_log_dummy(mp);
 		else
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 329b06a..6468a2a 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -577,7 +577,6 @@ xfs_trans_alloc(
 	xfs_mount_t	*mp,
 	uint		type)
 {
-	xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
 	return _xfs_trans_alloc(mp, type, KM_SLEEP);
 }
 
@@ -589,6 +588,7 @@ _xfs_trans_alloc(
 {
 	xfs_trans_t	*tp;
 
+	WARN_ON(mp->m_super->s_writers.frozen == SB_FREEZE_COMPLETE);
 	atomic_inc(&mp->m_active_trans);
 
 	tp = kmem_zone_zalloc(xfs_trans_zone, memflags);
-- 
1.7.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 12/19] ocfs2: Convert to new freezing mechanism
  2012-03-05 16:00 ` Jan Kara
@ 2012-03-05 16:01   ` Jan Kara
  -1 siblings, 0 replies; 58+ messages in thread
From: Jan Kara @ 2012-03-05 16:01 UTC (permalink / raw)
  To: LKML
  Cc: linux-fsdevel, Al Viro, Christoph Hellwig, dchinner, sandeen,
	Kamal Mostafa, Jan Kara, Mark Fasheh, Joel Becker, ocfs2-devel

Protect ocfs2_page_mkwrite() and ocfs2_file_aio_write() using the new
freeze protection. We also protect several ioctl entry points which
were missing the protection.

CC: Mark Fasheh <mfasheh@suse.com>
CC: Joel Becker <jlbec@evilplan.org>
CC: ocfs2-devel@oss.oracle.com
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ocfs2/file.c  |   11 +++++++++--
 fs/ocfs2/ioctl.c |   14 ++++++++++++--
 fs/ocfs2/mmap.c  |    2 ++
 3 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 061591a..9b1e3d4 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -1971,6 +1971,7 @@ int ocfs2_change_file_space(struct file *file, unsigned int cmd,
 {
 	struct inode *inode = file->f_path.dentry->d_inode;
 	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
+	int ret;
 
 	if ((cmd == OCFS2_IOC_RESVSP || cmd == OCFS2_IOC_RESVSP64) &&
 	    !ocfs2_writes_unwritten_extents(osb))
@@ -1985,7 +1986,12 @@ int ocfs2_change_file_space(struct file *file, unsigned int cmd,
 	if (!(file->f_mode & FMODE_WRITE))
 		return -EBADF;
 
-	return __ocfs2_change_file_space(file, inode, file->f_pos, cmd, sr, 0);
+	ret = mnt_want_write_file(file);
+	if (ret)
+		return ret;
+	ret = __ocfs2_change_file_space(file, inode, file->f_pos, cmd, sr, 0);
+	mnt_drop_write_file(file);
+	return ret;
 }
 
 static long ocfs2_fallocate(struct file *file, int mode, loff_t offset,
@@ -2261,7 +2267,7 @@ static ssize_t ocfs2_file_aio_write(struct kiocb *iocb,
 	if (iocb->ki_left == 0)
 		return 0;
 
-	vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
+	sb_start_write(inode->i_sb);
 
 	appending = file->f_flags & O_APPEND ? 1 : 0;
 	direct_io = file->f_flags & O_DIRECT ? 1 : 0;
@@ -2434,6 +2440,7 @@ out_sems:
 		ocfs2_iocb_clear_sem_locked(iocb);
 
 	mutex_unlock(&inode->i_mutex);
+	sb_end_write(inode->i_sb);
 
 	if (written)
 		ret = written;
diff --git a/fs/ocfs2/ioctl.c b/fs/ocfs2/ioctl.c
index a6fda3c..59de312 100644
--- a/fs/ocfs2/ioctl.c
+++ b/fs/ocfs2/ioctl.c
@@ -928,7 +928,12 @@ long ocfs2_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 		if (get_user(new_clusters, (int __user *)arg))
 			return -EFAULT;
 
-		return ocfs2_group_extend(inode, new_clusters);
+		status = mnt_want_write_file(filp);
+		if (status)
+			return status;
+		status = ocfs2_group_extend(inode, new_clusters);
+		mnt_drop_write_file(filp);
+		return status;
 	case OCFS2_IOC_GROUP_ADD:
 	case OCFS2_IOC_GROUP_ADD64:
 		if (!capable(CAP_SYS_RESOURCE))
@@ -937,7 +942,12 @@ long ocfs2_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 		if (copy_from_user(&input, (int __user *) arg, sizeof(input)))
 			return -EFAULT;
 
-		return ocfs2_group_add(inode, &input);
+		status = mnt_want_write_file(filp);
+		if (status)
+			return status;
+		status = ocfs2_group_add(inode, &input);
+		mnt_drop_write_file(filp);
+		return status;
 	case OCFS2_IOC_REFLINK:
 		if (copy_from_user(&args, (struct reflink_arguments *)arg,
 				   sizeof(args)))
diff --git a/fs/ocfs2/mmap.c b/fs/ocfs2/mmap.c
index 9cd4108..d150372 100644
--- a/fs/ocfs2/mmap.c
+++ b/fs/ocfs2/mmap.c
@@ -136,6 +136,7 @@ static int ocfs2_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 	sigset_t oldset;
 	int ret;
 
+	sb_start_pagefault(inode->i_sb);
 	ocfs2_block_signals(&oldset);
 
 	/*
@@ -165,6 +166,7 @@ static int ocfs2_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 
 out:
 	ocfs2_unblock_signals(&oldset);
+	sb_end_pagefault(inode->i_sb);
 	return ret;
 }
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 13/19] gfs2: Convert to new freezing mechanism
  2012-03-05 16:00 ` Jan Kara
  (?)
@ 2012-03-05 16:01   ` Jan Kara
  -1 siblings, 0 replies; 58+ messages in thread
From: Jan Kara @ 2012-03-05 16:01 UTC (permalink / raw)
  To: LKML
  Cc: linux-fsdevel, Al Viro, Christoph Hellwig, dchinner, sandeen,
	Kamal Mostafa, Jan Kara, cluster-devel, Steven Whitehouse

It is enough to update gfs2_page_mkwrite() to use new freeze protection.
Rest is handled by the generic code.

CC: cluster-devel@redhat.com
CC: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/gfs2/file.c |   15 +++------------
 1 files changed, 3 insertions(+), 12 deletions(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 1f03531..806b7a4 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -369,11 +369,7 @@ static int gfs2_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 	loff_t size;
 	int ret;
 
-	/* Wait if fs is frozen. This is racy so we check again later on
-	 * and retry if the fs has been frozen after the page lock has
-	 * been acquired
-	 */
-	vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
+	sb_start_pagefault(inode->i_sb);
 
 	/* Update file times before taking page lock */
 	file_update_time(vma->vm_file);
@@ -457,14 +453,9 @@ out:
 	gfs2_holder_uninit(&gh);
 	if (ret == 0) {
 		set_page_dirty(page);
-		/* This check must be post dropping of transaction lock */
-		if (inode->i_sb->s_frozen == SB_UNFROZEN) {
-			wait_on_page_writeback(page);
-		} else {
-			ret = -EAGAIN;
-			unlock_page(page);
-		}
+		wait_on_page_writeback(page);
 	}
+	sb_end_pagefault(inode->i_sb);
 	return block_page_mkwrite_return(ret);
 }
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 13/19] gfs2: Convert to new freezing mechanism
@ 2012-03-05 16:01   ` Jan Kara
  0 siblings, 0 replies; 58+ messages in thread
From: Jan Kara @ 2012-03-05 16:01 UTC (permalink / raw)
  To: LKML
  Cc: sandeen, cluster-devel, Jan Kara, Kamal Mostafa, Al Viro, linux-fsdevel

It is enough to update gfs2_page_mkwrite() to use new freeze protection.
Rest is handled by the generic code.

CC: cluster-devel@redhat.com
CC: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/gfs2/file.c |   15 +++------------
 1 files changed, 3 insertions(+), 12 deletions(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 1f03531..806b7a4 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -369,11 +369,7 @@ static int gfs2_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 	loff_t size;
 	int ret;
 
-	/* Wait if fs is frozen. This is racy so we check again later on
-	 * and retry if the fs has been frozen after the page lock has
-	 * been acquired
-	 */
-	vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
+	sb_start_pagefault(inode->i_sb);
 
 	/* Update file times before taking page lock */
 	file_update_time(vma->vm_file);
@@ -457,14 +453,9 @@ out:
 	gfs2_holder_uninit(&gh);
 	if (ret == 0) {
 		set_page_dirty(page);
-		/* This check must be post dropping of transaction lock */
-		if (inode->i_sb->s_frozen == SB_UNFROZEN) {
-			wait_on_page_writeback(page);
-		} else {
-			ret = -EAGAIN;
-			unlock_page(page);
-		}
+		wait_on_page_writeback(page);
 	}
+	sb_end_pagefault(inode->i_sb);
 	return block_page_mkwrite_return(ret);
 }
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [Cluster-devel] [PATCH 13/19] gfs2: Convert to new freezing mechanism
@ 2012-03-05 16:01   ` Jan Kara
  0 siblings, 0 replies; 58+ messages in thread
From: Jan Kara @ 2012-03-05 16:01 UTC (permalink / raw)
  To: cluster-devel.redhat.com

It is enough to update gfs2_page_mkwrite() to use new freeze protection.
Rest is handled by the generic code.

CC: cluster-devel at redhat.com
CC: Steven Whitehouse <swhiteho@redhat.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/gfs2/file.c |   15 +++------------
 1 files changed, 3 insertions(+), 12 deletions(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 1f03531..806b7a4 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -369,11 +369,7 @@ static int gfs2_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 	loff_t size;
 	int ret;
 
-	/* Wait if fs is frozen. This is racy so we check again later on
-	 * and retry if the fs has been frozen after the page lock has
-	 * been acquired
-	 */
-	vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
+	sb_start_pagefault(inode->i_sb);
 
 	/* Update file times before taking page lock */
 	file_update_time(vma->vm_file);
@@ -457,14 +453,9 @@ out:
 	gfs2_holder_uninit(&gh);
 	if (ret == 0) {
 		set_page_dirty(page);
-		/* This check must be post dropping of transaction lock */
-		if (inode->i_sb->s_frozen == SB_UNFROZEN) {
-			wait_on_page_writeback(page);
-		} else {
-			ret = -EAGAIN;
-			unlock_page(page);
-		}
+		wait_on_page_writeback(page);
 	}
+	sb_end_pagefault(inode->i_sb);
 	return block_page_mkwrite_return(ret);
 }
 
-- 
1.7.1



^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 14/19] fuse: Convert to new freezing mechanism
@ 2012-03-05 16:01   ` Jan Kara
  0 siblings, 0 replies; 58+ messages in thread
From: Jan Kara @ 2012-03-05 16:01 UTC (permalink / raw)
  To: LKML
  Cc: linux-fsdevel, Al Viro, Christoph Hellwig, dchinner, sandeen,
	Kamal Mostafa, Jan Kara, fuse-devel, Miklos Szeredi

Convert check in fuse_file_aio_write() to using new freeze protection.

CC: fuse-devel@lists.sourceforge.net
CC: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fuse/file.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index eade72e..05f661f 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -943,8 +943,8 @@ static ssize_t fuse_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 	if (err)
 		return err;
 
+	sb_start_write(inode->i_sb);
 	mutex_lock(&inode->i_mutex);
-	vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
 
 	/* We can write back this queue in page reclaim */
 	current->backing_dev_info = mapping->backing_dev_info;
@@ -970,6 +970,7 @@ static ssize_t fuse_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 out:
 	current->backing_dev_info = NULL;
 	mutex_unlock(&inode->i_mutex);
+	sb_end_write(inode->i_sb);
 
 	return written ? written : err;
 }
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 14/19] fuse: Convert to new freezing mechanism
@ 2012-03-05 16:01   ` Jan Kara
  0 siblings, 0 replies; 58+ messages in thread
From: Jan Kara @ 2012-03-05 16:01 UTC (permalink / raw)
  To: LKML
  Cc: sandeen-H+wXaHxf7aLQT0dZR+AlfA, Jan Kara, Miklos Szeredi,
	fuse-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f, Kamal Mostafa,
	Christoph Hellwig, Al Viro, dchinner-H+wXaHxf7aLQT0dZR+AlfA,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

Convert check in fuse_file_aio_write() to using new freeze protection.

CC: fuse-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f@public.gmane.org
CC: Miklos Szeredi <miklos-sUDqSbJrdHQHWmgEVkV9KA@public.gmane.org>
Signed-off-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
---
 fs/fuse/file.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index eade72e..05f661f 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -943,8 +943,8 @@ static ssize_t fuse_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 	if (err)
 		return err;
 
+	sb_start_write(inode->i_sb);
 	mutex_lock(&inode->i_mutex);
-	vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
 
 	/* We can write back this queue in page reclaim */
 	current->backing_dev_info = mapping->backing_dev_info;
@@ -970,6 +970,7 @@ static ssize_t fuse_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 out:
 	current->backing_dev_info = NULL;
 	mutex_unlock(&inode->i_mutex);
+	sb_end_write(inode->i_sb);
 
 	return written ? written : err;
 }
-- 
1.7.1


------------------------------------------------------------------------------
Try before you buy = See our experts in action!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-dev2

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 15/19] ntfs: Convert to new freezing mechanism
  2012-03-05 16:00 ` Jan Kara
                   ` (17 preceding siblings ...)
  (?)
@ 2012-03-05 16:01 ` Jan Kara
  -1 siblings, 0 replies; 58+ messages in thread
From: Jan Kara @ 2012-03-05 16:01 UTC (permalink / raw)
  To: LKML
  Cc: linux-fsdevel, Al Viro, Christoph Hellwig, dchinner, sandeen,
	Kamal Mostafa, Jan Kara, linux-ntfs-dev, Anton Altaparmakov

Move check in ntfs_file_aio_write_nolock() to ntfs_file_aio_write() and
use new freeze protection.

CC: linux-ntfs-dev@lists.sourceforge.net
CC: Anton Altaparmakov <anton@tuxera.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ntfs/file.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/fs/ntfs/file.c b/fs/ntfs/file.c
index c587e2d..0503e65 100644
--- a/fs/ntfs/file.c
+++ b/fs/ntfs/file.c
@@ -2084,7 +2084,6 @@ static ssize_t ntfs_file_aio_write_nolock(struct kiocb *iocb,
 	if (err)
 		return err;
 	pos = *ppos;
-	vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
 	/* We can write back this queue in page reclaim. */
 	current->backing_dev_info = mapping->backing_dev_info;
 	written = 0;
@@ -2117,6 +2116,7 @@ static ssize_t ntfs_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 
 	BUG_ON(iocb->ki_pos != pos);
 
+	sb_start_write(inode->i_sb);
 	mutex_lock(&inode->i_mutex);
 	ret = ntfs_file_aio_write_nolock(iocb, iov, nr_segs, &iocb->ki_pos);
 	mutex_unlock(&inode->i_mutex);
@@ -2125,6 +2125,7 @@ static ssize_t ntfs_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
 		if (err < 0)
 			ret = err;
 	}
+	sb_end_write(inode->i_sb);
 	return ret;
 }
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 16/19] nilfs2: Convert to new freezing mechanism
  2012-03-05 16:00 ` Jan Kara
                   ` (18 preceding siblings ...)
  (?)
@ 2012-03-05 16:01 ` Jan Kara
  -1 siblings, 0 replies; 58+ messages in thread
From: Jan Kara @ 2012-03-05 16:01 UTC (permalink / raw)
  To: LKML
  Cc: linux-fsdevel, Al Viro, Christoph Hellwig, dchinner, sandeen,
	Kamal Mostafa, Jan Kara, linux-nilfs, KONISHI Ryusuke

We change nilfs_page_mkwrite() to provide proper freeze protection for
writeable page faults (we must wait for frozen filesystem even if the
page is fully mapped).

We remove all vfs_check_frozen() checks since they are now handled by
the generic code.

CC: linux-nilfs@vger.kernel.org
CC: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/nilfs2/file.c    |   18 +++++++++++-------
 fs/nilfs2/ioctl.c   |    2 --
 fs/nilfs2/segment.c |    2 --
 3 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/fs/nilfs2/file.c b/fs/nilfs2/file.c
index 2660152..0cdd702 100644
--- a/fs/nilfs2/file.c
+++ b/fs/nilfs2/file.c
@@ -65,16 +65,18 @@ static int nilfs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 	struct page *page = vmf->page;
 	struct inode *inode = vma->vm_file->f_dentry->d_inode;
 	struct nilfs_transaction_info ti;
-	int ret;
+	int ret = 0;
 
 	if (unlikely(nilfs_near_disk_full(inode->i_sb->s_fs_info)))
 		return VM_FAULT_SIGBUS; /* -ENOSPC */
 
+	sb_start_pagefault(inode->i_sb);
 	lock_page(page);
 	if (page->mapping != inode->i_mapping ||
 	    page_offset(page) >= i_size_read(inode) || !PageUptodate(page)) {
 		unlock_page(page);
-		return VM_FAULT_NOPAGE; /* make the VM retry the fault */
+		ret = -EFAULT;	/* make the VM retry the fault */
+		goto out;
 	}
 
 	/*
@@ -108,19 +110,21 @@ static int nilfs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 	ret = nilfs_transaction_begin(inode->i_sb, &ti, 1);
 	/* never returns -ENOMEM, but may return -ENOSPC */
 	if (unlikely(ret))
-		return VM_FAULT_SIGBUS;
+		goto out;
 
-	ret = block_page_mkwrite(vma, vmf, nilfs_get_block);
-	if (ret != VM_FAULT_LOCKED) {
+	ret = __block_page_mkwrite(vma, vmf, nilfs_get_block);
+	if (ret) {
 		nilfs_transaction_abort(inode->i_sb);
-		return ret;
+		goto out;
 	}
 	nilfs_set_file_dirty(inode, 1 << (PAGE_SHIFT - inode->i_blkbits));
 	nilfs_transaction_commit(inode->i_sb);
 
  mapped:
 	wait_on_page_writeback(page);
-	return VM_FAULT_LOCKED;
+ out:
+	sb_end_pagefault(inode->i_sb);
+	return block_page_mkwrite_return(ret);
 }
 
 static const struct vm_operations_struct nilfs_file_vm_ops = {
diff --git a/fs/nilfs2/ioctl.c b/fs/nilfs2/ioctl.c
index 2a70fce..a2063c7 100644
--- a/fs/nilfs2/ioctl.c
+++ b/fs/nilfs2/ioctl.c
@@ -660,8 +660,6 @@ static int nilfs_ioctl_clean_segments(struct inode *inode, struct file *filp,
 		goto out_free;
 	}
 
-	vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
-
 	ret = nilfs_ioctl_move_blocks(inode->i_sb, &argv[0], kbufs[0]);
 	if (ret < 0)
 		printk(KERN_ERR "NILFS: GC failed during preparation: "
diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
index 0e72ad6..0891cef 100644
--- a/fs/nilfs2/segment.c
+++ b/fs/nilfs2/segment.c
@@ -189,8 +189,6 @@ int nilfs_transaction_begin(struct super_block *sb,
 	if (ret > 0)
 		return 0;
 
-	vfs_check_frozen(sb, SB_FREEZE_WRITE);
-
 	nilfs = sb->s_fs_info;
 	down_read(&nilfs->ns_segctor_sem);
 	if (vacancy_check && nilfs_near_disk_full(nilfs)) {
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 17/19] btrfs: Convert to new freezing mechanism
  2012-03-05 16:00 ` Jan Kara
                   ` (19 preceding siblings ...)
  (?)
@ 2012-03-05 16:01 ` Jan Kara
  -1 siblings, 0 replies; 58+ messages in thread
From: Jan Kara @ 2012-03-05 16:01 UTC (permalink / raw)
  To: LKML
  Cc: linux-fsdevel, Al Viro, Christoph Hellwig, dchinner, sandeen,
	Kamal Mostafa, Jan Kara, linux-btrfs, Chris Mason

We convert btrfs_file_aio_write() to use new freeze check.  We also add proper
freeze protection to btrfs_page_mkwrite(). Checks in cleaner_kthread() and
transaction_kthread() can be safely removed since btrfs_freeze() will lock
the mutexes and thus block the threads (and they shouldn't have anything to
do anyway).

CC: linux-btrfs@vger.kernel.org
CC: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/btrfs/disk-io.c |    3 ---
 fs/btrfs/file.c    |    3 ++-
 fs/btrfs/inode.c   |    6 +++++-
 3 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 811d9f9..fc0f74c 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1586,8 +1586,6 @@ static int cleaner_kthread(void *arg)
 	struct btrfs_root *root = arg;
 
 	do {
-		vfs_check_frozen(root->fs_info->sb, SB_FREEZE_WRITE);
-
 		if (!(root->fs_info->sb->s_flags & MS_RDONLY) &&
 		    mutex_trylock(&root->fs_info->cleaner_mutex)) {
 			btrfs_run_delayed_iputs(root);
@@ -1618,7 +1616,6 @@ static int transaction_kthread(void *arg)
 
 	do {
 		delay = HZ * 30;
-		vfs_check_frozen(root->fs_info->sb, SB_FREEZE_WRITE);
 		mutex_lock(&root->fs_info->transaction_kthread_mutex);
 
 		spin_lock(&root->fs_info->trans_lock);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 859ba2d..1aac7ca 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1348,7 +1348,7 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
 	ssize_t err = 0;
 	size_t count, ocount;
 
-	vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
+	sb_start_write(inode->i_sb);
 
 	mutex_lock(&inode->i_mutex);
 
@@ -1439,6 +1439,7 @@ static ssize_t btrfs_file_aio_write(struct kiocb *iocb,
 			num_written = err;
 	}
 out:
+	sb_end_write(inode->i_sb);
 	current->backing_dev_info = NULL;
 	return num_written ? num_written : err;
 }
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 32214fe..63c9006 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6405,6 +6405,7 @@ int btrfs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 	u64 page_start;
 	u64 page_end;
 
+	sb_start_pagefault(inode->i_sb);
 	ret  = btrfs_delalloc_reserve_space(inode, PAGE_CACHE_SIZE);
 	if (!ret) {
 		ret = btrfs_update_time(vma->vm_file);
@@ -6495,12 +6496,15 @@ again:
 	unlock_extent_cached(io_tree, page_start, page_end, &cached_state, GFP_NOFS);
 
 out_unlock:
-	if (!ret)
+	if (!ret) {
+		sb_end_pagefault(inode->i_sb);
 		return VM_FAULT_LOCKED;
+	}
 	unlock_page(page);
 out:
 	btrfs_delalloc_release_space(inode, PAGE_CACHE_SIZE);
 out_noreserve:
+	sb_end_pagefault(inode->i_sb);
 	return ret;
 }
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 18/19] fs: Remove old freezing mechanism
  2012-03-05 16:00 ` Jan Kara
                   ` (20 preceding siblings ...)
  (?)
@ 2012-03-05 16:01 ` Jan Kara
  -1 siblings, 0 replies; 58+ messages in thread
From: Jan Kara @ 2012-03-05 16:01 UTC (permalink / raw)
  To: LKML
  Cc: linux-fsdevel, Al Viro, Christoph Hellwig, dchinner, sandeen,
	Kamal Mostafa, Jan Kara

Now that all users are converted, we can remove functions, variables, and
constants defined by the old freezing mechanism.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/super.c         |    1 -
 include/linux/fs.h |    8 --------
 2 files changed, 0 insertions(+), 9 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 8b0b40c..0830afb 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -209,7 +209,6 @@ static struct super_block *alloc_super(struct file_system_type *type)
 		mutex_init(&s->s_dquot.dqio_mutex);
 		mutex_init(&s->s_dquot.dqonoff_mutex);
 		init_rwsem(&s->s_dquot.dqptr_sem);
-		init_waitqueue_head(&s->s_wait_unfrozen);
 		s->s_maxbytes = MAX_NON_LFS;
 		s->s_op = &default_op;
 		s->s_time_gran = 1000000000;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 37667f3..fcb4135 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1483,8 +1483,6 @@ struct super_block {
 	struct hlist_node	s_instances;
 	struct quota_info	s_dquot;	/* Diskquota specific options */
 
-	int			s_frozen;
-	wait_queue_head_t	s_wait_unfrozen;
 	struct sb_writers	s_writers;
 
 	char s_id[32];				/* Informational name */
@@ -1537,12 +1535,6 @@ extern void prune_dcache_sb(struct super_block *sb, int nr_to_scan);
 extern struct timespec current_fs_time(struct super_block *sb);
 
 /*
- * Snapshotting support.
- */
-/* Will go away when all users are converted */
-#define vfs_check_frozen(sb, level) do { } while (0)
-
-/*
  * until VFS tracks user namespaces for inodes, just make all files
  * belong to init_user_ns
  */
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 19/19] fs: Refuse to freeze filesystem with open but unlinked files
  2012-03-05 16:00 ` Jan Kara
                   ` (21 preceding siblings ...)
  (?)
@ 2012-03-05 16:01 ` Jan Kara
  -1 siblings, 0 replies; 58+ messages in thread
From: Jan Kara @ 2012-03-05 16:01 UTC (permalink / raw)
  To: LKML
  Cc: linux-fsdevel, Al Viro, Christoph Hellwig, dchinner, sandeen,
	Kamal Mostafa, Jan Kara

Filesystem with frozen but unlinked files cannot be forced into
a fully consistent state. Also handling of closing such files on frozen
filesystem is problematic since fput() can be called with mmap_sem held
and protection against frozen filesystem ranks above it.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/super.c |   26 +++++++++++++++++++-------
 1 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 0830afb..e0f5992 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1415,16 +1415,22 @@ int freeze_super(struct super_block *sb)
 	/* All writers are done so after syncing there won't be dirty data */
 	sync_filesystem(sb);
 
+	/*
+	 * Check whether there are some open unlinked files (since all writers
+	 * are blocked, new ones cannot be created now so the check is
+	 * reliable). In that case filesystem cannot be forced into a
+	 * consistent state so we just bail out.
+	 */
+	if (atomic_long_read(&sb->s_remove_count)) {
+		ret = -EBUSY;
+		goto bail;
+	}
+
 	if (sb->s_op->freeze_fs) {
 		ret = sb->s_op->freeze_fs(sb);
 		if (ret) {
-			printk(KERN_ERR
-				"VFS:Filesystem freeze failed\n");
-			sb->s_writers.frozen = SB_UNFROZEN;
-			smp_wmb();
-			wake_up(&sb->s_writers.wait_unfrozen);
-			deactivate_locked_super(sb);
-			return ret;
+			printk(KERN_ERR "VFS: Filesystem freeze failed\n");
+			goto bail;
 		}
 	}
 	/*
@@ -1434,6 +1440,12 @@ int freeze_super(struct super_block *sb)
 	sb->s_writers.frozen = SB_FREEZE_COMPLETE;
 	up_write(&sb->s_umount);
 	return 0;
+bail:
+	sb->s_writers.frozen = SB_UNFROZEN;
+	smp_wmb();
+	wake_up(&sb->s_writers.wait_unfrozen);
+	deactivate_locked_super(sb);
+	return ret;
 }
 EXPORT_SYMBOL(freeze_super);
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [Ocfs2-devel] [PATCH 02/19] fs: Push mnt_want_write() outside of i_mutex
@ 2012-03-05 16:01   ` Jan Kara
  0 siblings, 0 replies; 58+ messages in thread
From: Jan Kara @ 2012-03-05 16:01 UTC (permalink / raw)
  To: LKML
  Cc: linux-fsdevel, Al Viro, Christoph Hellwig, dchinner, sandeen,
	Kamal Mostafa, Jan Kara, ocfs2-devel, Mark Fasheh, Joel Becker,
	David S. Miller

Currently, mnt_want_write() is sometimes called with i_mutex held and sometimes
without it. This isn't really a problem because mnt_want_write() is a
non-blocking operation (essentially has a trylock semantics) but when the
function starts to handle also frozen filesystems, it will get a full lock
semantics and thus proper lock ordering has to be established. So move
all mnt_want_write() calls outside of i_mutex.

One non-trivial case needing conversion is kern_path_create() /
user_path_create() which didn't include mnt_want_write() but now needs to
because it acquires i_mutex.  Because there are virtual file systems which
don't bother with freeze / remount-ro protection we actually provide both
versions of the function - one which calls mnt_want_write() and one which does
not.

CC: ocfs2-devel at oss.oracle.com
CC: Mark Fasheh <mfasheh@suse.com>
CC: Joel Becker <jlbec@evilplan.org>
CC: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/namei.c              |  115 +++++++++++++++++++++++++++--------------------
 fs/ocfs2/refcounttree.c |   10 +---
 include/linux/namei.h   |    2 +
 net/unix/af_unix.c      |   13 ++----
 4 files changed, 74 insertions(+), 66 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index a780ea5..575d3a4 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2383,7 +2383,9 @@ struct file *do_file_open_root(struct dentry *dentry, struct vfsmount *mnt,
 	return file;
 }
 
-struct dentry *kern_path_create(int dfd, const char *pathname, struct path *path, int is_dir)
+static struct dentry *do_kern_path_create(int dfd, const char *pathname,
+					  struct path *path, int is_dir,
+					  int freeze_protect)
 {
 	struct dentry *dentry = ERR_PTR(-EEXIST);
 	struct nameidata nd;
@@ -2401,6 +2403,14 @@ struct dentry *kern_path_create(int dfd, const char *pathname, struct path *path
 	nd.flags |= LOOKUP_CREATE | LOOKUP_EXCL;
 	nd.intent.open.flags = O_EXCL;
 
+	if (freeze_protect) {
+		error = mnt_want_write(nd.path.mnt);
+		if (error) {
+			dentry = ERR_PTR(error);
+			goto out;
+		}
+	}
+
 	/*
 	 * Do the final lookup.
 	 */
@@ -2429,24 +2439,49 @@ eexist:
 	dentry = ERR_PTR(-EEXIST);
 fail:
 	mutex_unlock(&nd.path.dentry->d_inode->i_mutex);
+	if (freeze_protect)
+		mnt_drop_write(nd.path.mnt);
 out:
 	path_put(&nd.path);
 	return dentry;
 }
+
+struct dentry *kern_path_create(int dfd, const char *pathname, struct path *path, int is_dir)
+{
+	return do_kern_path_create(dfd, pathname, path, is_dir, 0);
+}
 EXPORT_SYMBOL(kern_path_create);
 
+struct dentry *kern_path_create_thawed(int dfd, const char *pathname, struct path *path, int is_dir)
+{
+	return do_kern_path_create(dfd, pathname, path, is_dir, 1);
+}
+EXPORT_SYMBOL(kern_path_create_thawed);
+
 struct dentry *user_path_create(int dfd, const char __user *pathname, struct path *path, int is_dir)
 {
 	char *tmp = getname(pathname);
 	struct dentry *res;
 	if (IS_ERR(tmp))
 		return ERR_CAST(tmp);
-	res = kern_path_create(dfd, tmp, path, is_dir);
+	res = do_kern_path_create(dfd, tmp, path, is_dir, 0);
 	putname(tmp);
 	return res;
 }
 EXPORT_SYMBOL(user_path_create);
 
+struct dentry *user_path_create_thawed(int dfd, const char __user *pathname, struct path *path, int is_dir)
+{
+	char *tmp = getname(pathname);
+	struct dentry *res;
+	if (IS_ERR(tmp))
+		return ERR_CAST(tmp);
+	res = do_kern_path_create(dfd, tmp, path, is_dir, 1);
+	putname(tmp);
+	return res;
+}
+EXPORT_SYMBOL(user_path_create_thawed);
+
 int vfs_mknod(struct inode *dir, struct dentry *dentry, umode_t mode, dev_t dev)
 {
 	int error = may_create(dir, dentry);
@@ -2502,7 +2537,7 @@ SYSCALL_DEFINE4(mknodat, int, dfd, const char __user *, filename, umode_t, mode,
 	if (S_ISDIR(mode))
 		return -EPERM;
 
-	dentry = user_path_create(dfd, filename, &path, 0);
+	dentry = user_path_create_thawed(dfd, filename, &path, 0);
 	if (IS_ERR(dentry))
 		return PTR_ERR(dentry);
 
@@ -2511,12 +2546,9 @@ SYSCALL_DEFINE4(mknodat, int, dfd, const char __user *, filename, umode_t, mode,
 	error = may_mknod(mode);
 	if (error)
 		goto out_dput;
-	error = mnt_want_write(path.mnt);
-	if (error)
-		goto out_dput;
 	error = security_path_mknod(&path, dentry, mode, dev);
 	if (error)
-		goto out_drop_write;
+		goto out_dput;
 	switch (mode & S_IFMT) {
 		case 0: case S_IFREG:
 			error = vfs_create(path.dentry->d_inode,dentry,mode,NULL);
@@ -2529,11 +2561,10 @@ SYSCALL_DEFINE4(mknodat, int, dfd, const char __user *, filename, umode_t, mode,
 			error = vfs_mknod(path.dentry->d_inode,dentry,mode,0);
 			break;
 	}
-out_drop_write:
-	mnt_drop_write(path.mnt);
 out_dput:
 	dput(dentry);
 	mutex_unlock(&path.dentry->d_inode->i_mutex);
+	mnt_drop_write(path.mnt);
 	path_put(&path);
 
 	return error;
@@ -2571,24 +2602,20 @@ SYSCALL_DEFINE3(mkdirat, int, dfd, const char __user *, pathname, umode_t, mode)
 	struct path path;
 	int error;
 
-	dentry = user_path_create(dfd, pathname, &path, 1);
+	dentry = user_path_create_thawed(dfd, pathname, &path, 1);
 	if (IS_ERR(dentry))
 		return PTR_ERR(dentry);
 
 	if (!IS_POSIXACL(path.dentry->d_inode))
 		mode &= ~current_umask();
-	error = mnt_want_write(path.mnt);
-	if (error)
-		goto out_dput;
 	error = security_path_mkdir(&path, dentry, mode);
 	if (error)
-		goto out_drop_write;
+		goto out_dput;
 	error = vfs_mkdir(path.dentry->d_inode, dentry, mode);
-out_drop_write:
-	mnt_drop_write(path.mnt);
 out_dput:
 	dput(dentry);
 	mutex_unlock(&path.dentry->d_inode->i_mutex);
+	mnt_drop_write(path.mnt);
 	path_put(&path);
 	return error;
 }
@@ -2683,6 +2710,9 @@ static long do_rmdir(int dfd, const char __user *pathname)
 	}
 
 	nd.flags &= ~LOOKUP_PARENT;
+	error = mnt_want_write(nd.path.mnt);
+	if (error)
+		goto exit1;
 
 	mutex_lock_nested(&nd.path.dentry->d_inode->i_mutex, I_MUTEX_PARENT);
 	dentry = lookup_hash(&nd);
@@ -2693,19 +2723,15 @@ static long do_rmdir(int dfd, const char __user *pathname)
 		error = -ENOENT;
 		goto exit3;
 	}
-	error = mnt_want_write(nd.path.mnt);
-	if (error)
-		goto exit3;
 	error = security_path_rmdir(&nd.path, dentry);
 	if (error)
-		goto exit4;
+		goto exit3;
 	error = vfs_rmdir(nd.path.dentry->d_inode, dentry);
-exit4:
-	mnt_drop_write(nd.path.mnt);
 exit3:
 	dput(dentry);
 exit2:
 	mutex_unlock(&nd.path.dentry->d_inode->i_mutex);
+	mnt_drop_write(nd.path.mnt);
 exit1:
 	path_put(&nd.path);
 	putname(name);
@@ -2772,6 +2798,9 @@ static long do_unlinkat(int dfd, const char __user *pathname)
 		goto exit1;
 
 	nd.flags &= ~LOOKUP_PARENT;
+	error = mnt_want_write(nd.path.mnt);
+	if (error)
+		goto exit1;
 
 	mutex_lock_nested(&nd.path.dentry->d_inode->i_mutex, I_MUTEX_PARENT);
 	dentry = lookup_hash(&nd);
@@ -2784,21 +2813,17 @@ static long do_unlinkat(int dfd, const char __user *pathname)
 		if (!inode)
 			goto slashes;
 		ihold(inode);
-		error = mnt_want_write(nd.path.mnt);
-		if (error)
-			goto exit2;
 		error = security_path_unlink(&nd.path, dentry);
 		if (error)
-			goto exit3;
+			goto exit2;
 		error = vfs_unlink(nd.path.dentry->d_inode, dentry);
-exit3:
-		mnt_drop_write(nd.path.mnt);
-	exit2:
+exit2:
 		dput(dentry);
 	}
 	mutex_unlock(&nd.path.dentry->d_inode->i_mutex);
 	if (inode)
 		iput(inode);	/* truncate the inode here */
+	mnt_drop_write(nd.path.mnt);
 exit1:
 	path_put(&nd.path);
 	putname(name);
@@ -2858,23 +2883,19 @@ SYSCALL_DEFINE3(symlinkat, const char __user *, oldname,
 	if (IS_ERR(from))
 		return PTR_ERR(from);
 
-	dentry = user_path_create(newdfd, newname, &path, 0);
+	dentry = user_path_create_thawed(newdfd, newname, &path, 0);
 	error = PTR_ERR(dentry);
 	if (IS_ERR(dentry))
 		goto out_putname;
 
-	error = mnt_want_write(path.mnt);
-	if (error)
-		goto out_dput;
 	error = security_path_symlink(&path, dentry, from);
 	if (error)
-		goto out_drop_write;
+		goto out_dput;
 	error = vfs_symlink(path.dentry->d_inode, dentry, from);
-out_drop_write:
-	mnt_drop_write(path.mnt);
 out_dput:
 	dput(dentry);
 	mutex_unlock(&path.dentry->d_inode->i_mutex);
+	mnt_drop_write(path.mnt);
 	path_put(&path);
 out_putname:
 	putname(from);
@@ -2964,7 +2985,7 @@ SYSCALL_DEFINE5(linkat, int, olddfd, const char __user *, oldname,
 	if (error)
 		return error;
 
-	new_dentry = user_path_create(newdfd, newname, &new_path, 0);
+	new_dentry = user_path_create_thawed(newdfd, newname, &new_path, 0);
 	error = PTR_ERR(new_dentry);
 	if (IS_ERR(new_dentry))
 		goto out;
@@ -2972,18 +2993,14 @@ SYSCALL_DEFINE5(linkat, int, olddfd, const char __user *, oldname,
 	error = -EXDEV;
 	if (old_path.mnt != new_path.mnt)
 		goto out_dput;
-	error = mnt_want_write(new_path.mnt);
-	if (error)
-		goto out_dput;
 	error = security_path_link(old_path.dentry, &new_path, new_dentry);
 	if (error)
-		goto out_drop_write;
+		goto out_dput;
 	error = vfs_link(old_path.dentry, new_path.dentry->d_inode, new_dentry);
-out_drop_write:
-	mnt_drop_write(new_path.mnt);
 out_dput:
 	dput(new_dentry);
 	mutex_unlock(&new_path.dentry->d_inode->i_mutex);
+	mnt_drop_write(new_path.mnt);
 	path_put(&new_path);
 out:
 	path_put(&old_path);
@@ -3174,6 +3191,10 @@ SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,
 	if (newnd.last_type != LAST_NORM)
 		goto exit2;
 
+	error = mnt_want_write(oldnd.path.mnt);
+	if (error)
+		goto exit2;
+
 	oldnd.flags &= ~LOOKUP_PARENT;
 	newnd.flags &= ~LOOKUP_PARENT;
 	newnd.flags |= LOOKUP_RENAME_TARGET;
@@ -3209,23 +3230,19 @@ SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,
 	if (new_dentry == trap)
 		goto exit5;
 
-	error = mnt_want_write(oldnd.path.mnt);
-	if (error)
-		goto exit5;
 	error = security_path_rename(&oldnd.path, old_dentry,
 				     &newnd.path, new_dentry);
 	if (error)
-		goto exit6;
+		goto exit5;
 	error = vfs_rename(old_dir->d_inode, old_dentry,
 				   new_dir->d_inode, new_dentry);
-exit6:
-	mnt_drop_write(oldnd.path.mnt);
 exit5:
 	dput(new_dentry);
 exit4:
 	dput(old_dentry);
 exit3:
 	unlock_rename(new_dir, old_dir);
+	mnt_drop_write(oldnd.path.mnt);
 exit2:
 	path_put(&newnd.path);
 	putname(to);
diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c
index cf78233..a99b8e2 100644
--- a/fs/ocfs2/refcounttree.c
+++ b/fs/ocfs2/refcounttree.c
@@ -4453,7 +4453,7 @@ int ocfs2_reflink_ioctl(struct inode *inode,
 		return error;
 	}
 
-	new_dentry = user_path_create(AT_FDCWD, newname, &new_path, 0);
+	new_dentry = user_path_create_thawed(AT_FDCWD, newname, &new_path, 0);
 	error = PTR_ERR(new_dentry);
 	if (IS_ERR(new_dentry)) {
 		mlog_errno(error);
@@ -4466,19 +4466,13 @@ int ocfs2_reflink_ioctl(struct inode *inode,
 		goto out_dput;
 	}
 
-	error = mnt_want_write(new_path.mnt);
-	if (error) {
-		mlog_errno(error);
-		goto out_dput;
-	}
-
 	error = ocfs2_vfs_reflink(old_path.dentry,
 				  new_path.dentry->d_inode,
 				  new_dentry, preserve);
-	mnt_drop_write(new_path.mnt);
 out_dput:
 	dput(new_dentry);
 	mutex_unlock(&new_path.dentry->d_inode->i_mutex);
+	mnt_drop_write(new_path.mnt);
 	path_put(&new_path);
 out:
 	path_put(&old_path);
diff --git a/include/linux/namei.h b/include/linux/namei.h
index ffc0213..432f6bb 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -77,7 +77,9 @@ extern int user_path_at_empty(int, const char __user *, unsigned, struct path *,
 extern int kern_path(const char *, unsigned, struct path *);
 
 extern struct dentry *kern_path_create(int, const char *, struct path *, int);
+extern struct dentry *kern_path_create_thawed(int, const char *, struct path *, int);
 extern struct dentry *user_path_create(int, const char __user *, struct path *, int);
+extern struct dentry *user_path_create_thawed(int, const char __user *, struct path *, int);
 extern int kern_path_parent(const char *, struct nameidata *);
 extern int vfs_path_lookup(struct dentry *, struct vfsmount *,
 			   const char *, unsigned int, struct path *);
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 85d3bb7..7888b09 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -856,7 +856,7 @@ static int unix_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 		 * Get the parent directory, calculate the hash for last
 		 * component.
 		 */
-		dentry = kern_path_create(AT_FDCWD, sun_path, &path, 0);
+		dentry = kern_path_create_thawed(AT_FDCWD, sun_path, &path, 0);
 		err = PTR_ERR(dentry);
 		if (IS_ERR(dentry))
 			goto out_mknod_parent;
@@ -866,19 +866,13 @@ static int unix_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len)
 		 */
 		mode = S_IFSOCK |
 		       (SOCK_INODE(sock)->i_mode & ~current_umask());
-		err = mnt_want_write(path.mnt);
-		if (err)
-			goto out_mknod_dput;
 		err = security_path_mknod(&path, dentry, mode, 0);
 		if (err)
-			goto out_mknod_drop_write;
-		err = vfs_mknod(path.dentry->d_inode, dentry, mode, 0);
-out_mknod_drop_write:
-		mnt_drop_write(path.mnt);
-		if (err)
 			goto out_mknod_dput;
+		err = vfs_mknod(path.dentry->d_inode, dentry, mode, 0);
 		mutex_unlock(&path.dentry->d_inode->i_mutex);
 		dput(path.dentry);
+		mnt_drop_write(path.mnt);
 		path.dentry = dentry;
 
 		addr->hash = UNIX_HASH_SIZE;
@@ -916,6 +910,7 @@ out:
 out_mknod_dput:
 	dput(dentry);
 	mutex_unlock(&path.dentry->d_inode->i_mutex);
+	mnt_drop_write(path.mnt);
 	path_put(&path);
 out_mknod_parent:
 	if (err == -EEXIST)
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [Ocfs2-devel] [PATCH 12/19] ocfs2: Convert to new freezing mechanism
@ 2012-03-05 16:01   ` Jan Kara
  0 siblings, 0 replies; 58+ messages in thread
From: Jan Kara @ 2012-03-05 16:01 UTC (permalink / raw)
  To: LKML
  Cc: linux-fsdevel, Al Viro, Christoph Hellwig, dchinner, sandeen,
	Kamal Mostafa, Jan Kara, Mark Fasheh, Joel Becker, ocfs2-devel

Protect ocfs2_page_mkwrite() and ocfs2_file_aio_write() using the new
freeze protection. We also protect several ioctl entry points which
were missing the protection.

CC: Mark Fasheh <mfasheh@suse.com>
CC: Joel Becker <jlbec@evilplan.org>
CC: ocfs2-devel at oss.oracle.com
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ocfs2/file.c  |   11 +++++++++--
 fs/ocfs2/ioctl.c |   14 ++++++++++++--
 fs/ocfs2/mmap.c  |    2 ++
 3 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index 061591a..9b1e3d4 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -1971,6 +1971,7 @@ int ocfs2_change_file_space(struct file *file, unsigned int cmd,
 {
 	struct inode *inode = file->f_path.dentry->d_inode;
 	struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
+	int ret;
 
 	if ((cmd == OCFS2_IOC_RESVSP || cmd == OCFS2_IOC_RESVSP64) &&
 	    !ocfs2_writes_unwritten_extents(osb))
@@ -1985,7 +1986,12 @@ int ocfs2_change_file_space(struct file *file, unsigned int cmd,
 	if (!(file->f_mode & FMODE_WRITE))
 		return -EBADF;
 
-	return __ocfs2_change_file_space(file, inode, file->f_pos, cmd, sr, 0);
+	ret = mnt_want_write_file(file);
+	if (ret)
+		return ret;
+	ret = __ocfs2_change_file_space(file, inode, file->f_pos, cmd, sr, 0);
+	mnt_drop_write_file(file);
+	return ret;
 }
 
 static long ocfs2_fallocate(struct file *file, int mode, loff_t offset,
@@ -2261,7 +2267,7 @@ static ssize_t ocfs2_file_aio_write(struct kiocb *iocb,
 	if (iocb->ki_left == 0)
 		return 0;
 
-	vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
+	sb_start_write(inode->i_sb);
 
 	appending = file->f_flags & O_APPEND ? 1 : 0;
 	direct_io = file->f_flags & O_DIRECT ? 1 : 0;
@@ -2434,6 +2440,7 @@ out_sems:
 		ocfs2_iocb_clear_sem_locked(iocb);
 
 	mutex_unlock(&inode->i_mutex);
+	sb_end_write(inode->i_sb);
 
 	if (written)
 		ret = written;
diff --git a/fs/ocfs2/ioctl.c b/fs/ocfs2/ioctl.c
index a6fda3c..59de312 100644
--- a/fs/ocfs2/ioctl.c
+++ b/fs/ocfs2/ioctl.c
@@ -928,7 +928,12 @@ long ocfs2_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 		if (get_user(new_clusters, (int __user *)arg))
 			return -EFAULT;
 
-		return ocfs2_group_extend(inode, new_clusters);
+		status = mnt_want_write_file(filp);
+		if (status)
+			return status;
+		status = ocfs2_group_extend(inode, new_clusters);
+		mnt_drop_write_file(filp);
+		return status;
 	case OCFS2_IOC_GROUP_ADD:
 	case OCFS2_IOC_GROUP_ADD64:
 		if (!capable(CAP_SYS_RESOURCE))
@@ -937,7 +942,12 @@ long ocfs2_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 		if (copy_from_user(&input, (int __user *) arg, sizeof(input)))
 			return -EFAULT;
 
-		return ocfs2_group_add(inode, &input);
+		status = mnt_want_write_file(filp);
+		if (status)
+			return status;
+		status = ocfs2_group_add(inode, &input);
+		mnt_drop_write_file(filp);
+		return status;
 	case OCFS2_IOC_REFLINK:
 		if (copy_from_user(&args, (struct reflink_arguments *)arg,
 				   sizeof(args)))
diff --git a/fs/ocfs2/mmap.c b/fs/ocfs2/mmap.c
index 9cd4108..d150372 100644
--- a/fs/ocfs2/mmap.c
+++ b/fs/ocfs2/mmap.c
@@ -136,6 +136,7 @@ static int ocfs2_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 	sigset_t oldset;
 	int ret;
 
+	sb_start_pagefault(inode->i_sb);
 	ocfs2_block_signals(&oldset);
 
 	/*
@@ -165,6 +166,7 @@ static int ocfs2_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 
 out:
 	ocfs2_unblock_signals(&oldset);
+	sb_end_pagefault(inode->i_sb);
 	return ret;
 }
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [Ocfs2-devel] [PATCH 00/19] Fix filesystem freezing deadlocks
@ 2012-03-05 16:00 ` Jan Kara
  0 siblings, 0 replies; 58+ messages in thread
From: Jan Kara @ 2012-03-05 16:01 UTC (permalink / raw)
  To: LKML
  Cc: linux-fsdevel, Al Viro, Christoph Hellwig, dchinner, sandeen,
	Kamal Mostafa, Jan Kara, Alex Elder, Anton Altaparmakov,
	Ben Myers, Chris Mason, cluster-devel, David S. Miller,
	fuse-devel, J. Bruce Fields, Joel Becker, KONISHI Ryusuke,
	linux-btrfs, linux-ext4, linux-nfs, linux-nilfs, linux-ntfs-dev,
	Mark Fasheh, Miklos Szeredi, ocfs2-devel, OGAWA Hirofumi,
	Steven Whitehouse, Theodore Ts'o, xfs

  Hallelujah,

  after a couple of weeks and several rewrites, here comes the third iteration
of my patches to improve filesystem freezing.  Filesystem freezing is currently
racy and thus we can end up with dirty data on frozen filesystem (see changelog
patch 06 for detailed race description). This patch series aims at fixing this.

To be able to block all places where inodes get dirtied, I've moved filesystem
freeze handling in mnt_want_write() / mnt_drop_write(). This however required
some code shuffling and changes to kern_path_create() (see patches 02-05). I
think the result is OK but opinions may differ ;). The advantage of this change
also is that all filesystems get freeze protection almost for free - even ext2
can handle freezing well now.

Another potential contention point might be patch 19. In that patch we make
freeze_super() refuse to freeze the filesystem when there are open but unlinked
files which may be impractical in some cases. The main reason for this is the
problem with handling of file deletion from fput() called with mmap_sem held
(e.g. from munmap(2)), and then there's the fact that we cannot really force
such filesystem into a consistent state... But if people think that freezing
with open but unlinked files should happen, then I have some possible
solutions in mind (maybe as a separate patchset since this is large enough).

I'm not able to hit any deadlocks, lockdep warnings, or dirty data on frozen
filesystem despite beating it with fsstress and bash-shared-mapping while
freezing and unfreezing for several hours (using ext4 and xfs) so I'm
reasonably confident this could finally be the right solution.

And for people wanting to test - this patchset is based on patch series
"Push file_update_time() into .page_mkwrite" so you'll need to pull that one
in as well.

Changes since v2:
  * completely rewritten
  * freezing is now blocked at VFS entry points
  * two stage freezing to handle both mmapped writes and other IO

The biggest changes since v1:
  * have two counters to provide safe state transitions for SB_FREEZE_WRITE
    and SB_FREEZE_TRANS states
  * use percpu counters instead of own percpu structure
  * added documentation fixes from the old fs freezing series
  * converted XFS to use SB_FREEZE_TRANS counter instead of its private
    m_active_trans counter

								Honza

CC: Alex Elder <elder@kernel.org>
CC: Anton Altaparmakov <anton@tuxera.com>
CC: Ben Myers <bpm@sgi.com>
CC: Chris Mason <chris.mason@oracle.com>
CC: cluster-devel at redhat.com
CC: "David S. Miller" <davem@davemloft.net>
CC: fuse-devel at lists.sourceforge.net
CC: "J. Bruce Fields" <bfields@fieldses.org>
CC: Joel Becker <jlbec@evilplan.org>
CC: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp>
CC: linux-btrfs at vger.kernel.org
CC: linux-ext4 at vger.kernel.org
CC: linux-nfs at vger.kernel.org
CC: linux-nilfs at vger.kernel.org
CC: linux-ntfs-dev at lists.sourceforge.net
CC: Mark Fasheh <mfasheh@suse.com>
CC: Miklos Szeredi <miklos@szeredi.hu>
CC: ocfs2-devel at oss.oracle.com
CC: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
CC: Steven Whitehouse <swhiteho@redhat.com>
CC: "Theodore Ts'o" <tytso@mit.edu>
CC: xfs at oss.sgi.com

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 10/19] ext4: Convert to new freezing mechanism
  2012-03-05 16:01 ` [PATCH 10/19] ext4: Convert to new freezing mechanism Jan Kara
@ 2012-03-07 22:32   ` Kamal Mostafa
  2012-03-08  9:05     ` Jan Kara
  0 siblings, 1 reply; 58+ messages in thread
From: Kamal Mostafa @ 2012-03-07 22:32 UTC (permalink / raw)
  To: Jan Kara
  Cc: LKML, linux-fsdevel, Al Viro, Christoph Hellwig, dchinner,
	sandeen, linux-ext4, Theodore Ts'o

[-- Attachment #1: Type: text/plain, Size: 5573 bytes --]

Re: the patch set:
    [PATCH 00/19] Fix filesystem freezing deadlocks

In my initial smoke testing of this, I find that if I freeze a newly
created ext4 filesystem immediately after mounting it for the very first
time, then I get the new SB_FREEZE_COMPLETE warning from
ext4_journal_start_sb() every 0.4 seconds from ext4lazyinit...

        # mkfs -t ext4 /dev/sdaX
        # mount /dev/sdaX /mnt
        # fsfreeze -f /mnt

         WARNING:
        at /home/kamal/src/linux/ubuntu-precise/fs/ext4/super.c:301
        ext4_journal_start_sb+0x159/0x160()
        
         Pid: 3252, comm: ext4lazyinit Tainted: G        W
        3.2.0-18-generic #28+kamal1+jankara1
        
         Call Trace:
          [<ffffffff8106724f>] warn_slowpath_common+0x7f/0xc0
          [<ffffffff810672aa>] warn_slowpath_null+0x1a/0x20
          [<ffffffff812352c9>] ext4_journal_start_sb+0x159/0x160
          [<ffffffff8121326b>] ? ext4_init_inode_table+0xab/0x370
          [<ffffffff8121326b>] ext4_init_inode_table+0xab/0x370
          [<ffffffff81659cb5>] ? schedule_timeout+0x175/0x320
          [<ffffffff81226905>] ext4_run_li_request+0x85/0xe0
          [<ffffffff812269fc>] ext4_lazyinit_thread+0x9c/0x1c0
          [<ffffffff81226960>] ? ext4_run_li_request+0xe0/0xe0
          [<ffffffff8108a39c>] kthread+0x8c/0xa0
          [<ffffffff81665e34>] kernel_thread_helper+0x4/0x10
          [<ffffffff8108a310>] ? flush_kthread_worker+0xa0/0xa0
          [<ffffffff81665e30>] ? gs_change+0x13/0x13

 -Kamal


On Mon, 2012-03-05 at 17:01 +0100, Jan Kara wrote:
> We remove most of frozen checks since upper layer takes care
> of blocking all writes. We only have to handle protection in
> ext4_page_mkwrite() in a special way because we cannot use
> generic block_page_mkwrite().
> 
> CC: linux-ext4@vger.kernel.org
> CC: "Theodore Ts'o" <tytso@mit.edu>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/ext4/inode.c |    7 ++-----
>  fs/ext4/super.c |   29 +++++------------------------
>  2 files changed, 7 insertions(+), 29 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index feaa82f..c65baf9 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4593,11 +4593,7 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
>  	get_block_t *get_block;
>  	int retries = 0;
>  
> -	/*
> -	 * This check is racy but catches the common case. We rely on
> -	 * __block_page_mkwrite() to do a reliable check.
> -	 */
> -	vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
> +	sb_start_pagefault(inode->i_sb);
>  	/* Delalloc case is easy... */
>  	if (test_opt(inode->i_sb, DELALLOC) &&
>  	    !ext4_should_journal_data(inode) &&
> @@ -4665,5 +4661,6 @@ retry_alloc:
>  out_ret:
>  	ret = block_page_mkwrite_return(ret);
>  out:
> +	sb_end_pagefault(inode->i_sb);
>  	return ret;
>  }
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 502c61f..0f1024a 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -289,33 +289,17 @@ static void ext4_put_nojournal(handle_t *handle)
>   * journal_end calls result in the superblock being marked dirty, so
>   * that sync() will call the filesystem's write_super callback if
>   * appropriate.
> - *
> - * To avoid j_barrier hold in userspace when a user calls freeze(),
> - * ext4 prevents a new handle from being started by s_frozen, which
> - * is in an upper layer.
>   */
>  handle_t *ext4_journal_start_sb(struct super_block *sb, int nblocks)
>  {
>  	journal_t *journal;
> -	handle_t  *handle;
>  
>  	trace_ext4_journal_start(sb, nblocks, _RET_IP_);
>  	if (sb->s_flags & MS_RDONLY)
>  		return ERR_PTR(-EROFS);
>  
> +	WARN_ON(sb->s_writers.frozen == SB_FREEZE_COMPLETE);
>  	journal = EXT4_SB(sb)->s_journal;
> -	handle = ext4_journal_current_handle();
> -
> -	/*
> -	 * If a handle has been started, it should be allowed to
> -	 * finish, otherwise deadlock could happen between freeze
> -	 * and others(e.g. truncate) due to the restart of the
> -	 * journal handle if the filesystem is forzen and active
> -	 * handles are not stopped.
> -	 */
> -	if (!handle)
> -		vfs_check_frozen(sb, SB_FREEZE_TRANS);
> -
>  	if (!journal)
>  		return ext4_get_nojournal();
>  	/*
> @@ -4280,10 +4264,8 @@ int ext4_force_commit(struct super_block *sb)
>  		return 0;
>  
>  	journal = EXT4_SB(sb)->s_journal;
> -	if (journal) {
> -		vfs_check_frozen(sb, SB_FREEZE_TRANS);
> +	if (journal)
>  		ret = ext4_journal_force_commit(journal);
> -	}
>  
>  	return ret;
>  }
> @@ -4315,9 +4297,8 @@ static int ext4_sync_fs(struct super_block *sb, int wait)
>   * gives us a chance to flush the journal completely and mark the fs clean.
>   *
>   * Note that only this function cannot bring a filesystem to be in a clean
> - * state independently, because ext4 prevents a new handle from being started
> - * by @sb->s_frozen, which stays in an upper layer.  It thus needs help from
> - * the upper layer.
> + * state independently. It relies on upper layer to stop all data & metadata
> + * modifications.
>   */
>  static int ext4_freeze(struct super_block *sb)
>  {
> @@ -4344,7 +4325,7 @@ static int ext4_freeze(struct super_block *sb)
>  	EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
>  	error = ext4_commit_super(sb, 1);
>  out:
> -	/* we rely on s_frozen to stop further updates */
> +	/* we rely on upper layer to stop further updates */
>  	jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
>  	return error;
>  }


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 10/19] ext4: Convert to new freezing mechanism
  2012-03-07 22:32   ` Kamal Mostafa
@ 2012-03-08  9:05     ` Jan Kara
  0 siblings, 0 replies; 58+ messages in thread
From: Jan Kara @ 2012-03-08  9:05 UTC (permalink / raw)
  To: Kamal Mostafa
  Cc: Jan Kara, LKML, linux-fsdevel, Al Viro, Christoph Hellwig,
	dchinner, sandeen, linux-ext4, Theodore Ts'o

[-- Attachment #1: Type: text/plain, Size: 5998 bytes --]

On Wed 07-03-12 14:32:13, Kamal Mostafa wrote:
> Re: the patch set:
>     [PATCH 00/19] Fix filesystem freezing deadlocks
> 
> In my initial smoke testing of this, I find that if I freeze a newly
> created ext4 filesystem immediately after mounting it for the very first
> time, then I get the new SB_FREEZE_COMPLETE warning from
> ext4_journal_start_sb() every 0.4 seconds from ext4lazyinit...
> 
>         # mkfs -t ext4 /dev/sdaX
>         # mount /dev/sdaX /mnt
>         # fsfreeze -f /mnt
> 
>          WARNING:
>         at /home/kamal/src/linux/ubuntu-precise/fs/ext4/super.c:301
>         ext4_journal_start_sb+0x159/0x160()
>         
>          Pid: 3252, comm: ext4lazyinit Tainted: G        W
>         3.2.0-18-generic #28+kamal1+jankara1
>         
>          Call Trace:
>           [<ffffffff8106724f>] warn_slowpath_common+0x7f/0xc0
>           [<ffffffff810672aa>] warn_slowpath_null+0x1a/0x20
>           [<ffffffff812352c9>] ext4_journal_start_sb+0x159/0x160
>           [<ffffffff8121326b>] ? ext4_init_inode_table+0xab/0x370
>           [<ffffffff8121326b>] ext4_init_inode_table+0xab/0x370
>           [<ffffffff81659cb5>] ? schedule_timeout+0x175/0x320
>           [<ffffffff81226905>] ext4_run_li_request+0x85/0xe0
>           [<ffffffff812269fc>] ext4_lazyinit_thread+0x9c/0x1c0
>           [<ffffffff81226960>] ? ext4_run_li_request+0xe0/0xe0
>           [<ffffffff8108a39c>] kthread+0x8c/0xa0
>           [<ffffffff81665e34>] kernel_thread_helper+0x4/0x10
>           [<ffffffff8108a310>] ? flush_kthread_worker+0xa0/0xa0
>           [<ffffffff81665e30>] ? gs_change+0x13/0x13
  Ah, good point. Thanks for spotting this. I forgot about the lazyinit
thread. Attached patch fixes the problem (I've folded it into ext4
conversion patch in my series).

								Honza
> 
>  -Kamal
> 
> 
> On Mon, 2012-03-05 at 17:01 +0100, Jan Kara wrote:
> > We remove most of frozen checks since upper layer takes care
> > of blocking all writes. We only have to handle protection in
> > ext4_page_mkwrite() in a special way because we cannot use
> > generic block_page_mkwrite().
> > 
> > CC: linux-ext4@vger.kernel.org
> > CC: "Theodore Ts'o" <tytso@mit.edu>
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/ext4/inode.c |    7 ++-----
> >  fs/ext4/super.c |   29 +++++------------------------
> >  2 files changed, 7 insertions(+), 29 deletions(-)
> > 
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index feaa82f..c65baf9 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -4593,11 +4593,7 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> >  	get_block_t *get_block;
> >  	int retries = 0;
> >  
> > -	/*
> > -	 * This check is racy but catches the common case. We rely on
> > -	 * __block_page_mkwrite() to do a reliable check.
> > -	 */
> > -	vfs_check_frozen(inode->i_sb, SB_FREEZE_WRITE);
> > +	sb_start_pagefault(inode->i_sb);
> >  	/* Delalloc case is easy... */
> >  	if (test_opt(inode->i_sb, DELALLOC) &&
> >  	    !ext4_should_journal_data(inode) &&
> > @@ -4665,5 +4661,6 @@ retry_alloc:
> >  out_ret:
> >  	ret = block_page_mkwrite_return(ret);
> >  out:
> > +	sb_end_pagefault(inode->i_sb);
> >  	return ret;
> >  }
> > diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> > index 502c61f..0f1024a 100644
> > --- a/fs/ext4/super.c
> > +++ b/fs/ext4/super.c
> > @@ -289,33 +289,17 @@ static void ext4_put_nojournal(handle_t *handle)
> >   * journal_end calls result in the superblock being marked dirty, so
> >   * that sync() will call the filesystem's write_super callback if
> >   * appropriate.
> > - *
> > - * To avoid j_barrier hold in userspace when a user calls freeze(),
> > - * ext4 prevents a new handle from being started by s_frozen, which
> > - * is in an upper layer.
> >   */
> >  handle_t *ext4_journal_start_sb(struct super_block *sb, int nblocks)
> >  {
> >  	journal_t *journal;
> > -	handle_t  *handle;
> >  
> >  	trace_ext4_journal_start(sb, nblocks, _RET_IP_);
> >  	if (sb->s_flags & MS_RDONLY)
> >  		return ERR_PTR(-EROFS);
> >  
> > +	WARN_ON(sb->s_writers.frozen == SB_FREEZE_COMPLETE);
> >  	journal = EXT4_SB(sb)->s_journal;
> > -	handle = ext4_journal_current_handle();
> > -
> > -	/*
> > -	 * If a handle has been started, it should be allowed to
> > -	 * finish, otherwise deadlock could happen between freeze
> > -	 * and others(e.g. truncate) due to the restart of the
> > -	 * journal handle if the filesystem is forzen and active
> > -	 * handles are not stopped.
> > -	 */
> > -	if (!handle)
> > -		vfs_check_frozen(sb, SB_FREEZE_TRANS);
> > -
> >  	if (!journal)
> >  		return ext4_get_nojournal();
> >  	/*
> > @@ -4280,10 +4264,8 @@ int ext4_force_commit(struct super_block *sb)
> >  		return 0;
> >  
> >  	journal = EXT4_SB(sb)->s_journal;
> > -	if (journal) {
> > -		vfs_check_frozen(sb, SB_FREEZE_TRANS);
> > +	if (journal)
> >  		ret = ext4_journal_force_commit(journal);
> > -	}
> >  
> >  	return ret;
> >  }
> > @@ -4315,9 +4297,8 @@ static int ext4_sync_fs(struct super_block *sb, int wait)
> >   * gives us a chance to flush the journal completely and mark the fs clean.
> >   *
> >   * Note that only this function cannot bring a filesystem to be in a clean
> > - * state independently, because ext4 prevents a new handle from being started
> > - * by @sb->s_frozen, which stays in an upper layer.  It thus needs help from
> > - * the upper layer.
> > + * state independently. It relies on upper layer to stop all data & metadata
> > + * modifications.
> >   */
> >  static int ext4_freeze(struct super_block *sb)
> >  {
> > @@ -4344,7 +4325,7 @@ static int ext4_freeze(struct super_block *sb)
> >  	EXT4_CLEAR_INCOMPAT_FEATURE(sb, EXT4_FEATURE_INCOMPAT_RECOVER);
> >  	error = ext4_commit_super(sb, 1);
> >  out:
> > -	/* we rely on s_frozen to stop further updates */
> > +	/* we rely on upper layer to stop further updates */
> >  	jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
> >  	return error;
> >  }
> 


-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

[-- Attachment #2: ext4_lazyinit_freeze_fix.diff --]
[-- Type: text/x-patch, Size: 626 bytes --]

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 0f1024a..039b1e0 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2756,6 +2756,7 @@ static int ext4_run_li_request(struct ext4_li_request *elr)
 	sb = elr->lr_super;
 	ngroups = EXT4_SB(sb)->s_groups_count;
 
+	sb_start_write(sb);
 	for (group = elr->lr_next_group; group < ngroups; group++) {
 		gdp = ext4_get_group_desc(sb, group, NULL);
 		if (!gdp) {
@@ -2782,6 +2783,7 @@ static int ext4_run_li_request(struct ext4_li_request *elr)
 		elr->lr_next_sched = jiffies + elr->lr_timeout;
 		elr->lr_next_group = group + 1;
 	}
+	sb_end_write(sb);
 
 	return ret;
 }

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH 11/19] xfs: Convert to new freezing code
  2012-03-05 16:01   ` Jan Kara
@ 2012-03-08 23:20     ` Dave Chinner
  -1 siblings, 0 replies; 58+ messages in thread
From: Dave Chinner @ 2012-03-08 23:20 UTC (permalink / raw)
  To: Jan Kara
  Cc: LKML, linux-fsdevel, Al Viro, Christoph Hellwig, dchinner,
	sandeen, Kamal Mostafa, Ben Myers, Alex Elder, xfs

On Mon, Mar 05, 2012 at 05:01:09PM +0100, Jan Kara wrote:
> Generic code now blocks all writers from standard write paths. So we block all
> writers coming from ioctl and replace blocking of transactions on frozen
> filesystem with a debugging check. As a bonus, we get a protection of ioctl
> against racing remount read-only. We also convert xfs_file_aio_write() to a
> non-racy freeze protection.
....
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 329b06a..6468a2a 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -577,7 +577,6 @@ xfs_trans_alloc(
>  	xfs_mount_t	*mp,
>  	uint		type)
>  {
> -	xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
>  	return _xfs_trans_alloc(mp, type, KM_SLEEP);
>  }

So what is there to stop internal XFS threads from starting
transactions when the filesystem is frozen? Previously this
SB_FREEZE_TRANS would guarantee even internal fucntions would get
stopped, but now there's nothing?

I do beleive that ext4 has the same problem (the issue reported with
the lazy inode init background thread), and I can see that any other
filesystem that can make modifications via internal triggers will
see the same problem - freeze doesn't block them any more...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 11/19] xfs: Convert to new freezing code
@ 2012-03-08 23:20     ` Dave Chinner
  0 siblings, 0 replies; 58+ messages in thread
From: Dave Chinner @ 2012-03-08 23:20 UTC (permalink / raw)
  To: Jan Kara
  Cc: sandeen, Alex Elder, Kamal Mostafa, LKML, xfs, Christoph Hellwig,
	Ben Myers, Al Viro, dchinner, linux-fsdevel

On Mon, Mar 05, 2012 at 05:01:09PM +0100, Jan Kara wrote:
> Generic code now blocks all writers from standard write paths. So we block all
> writers coming from ioctl and replace blocking of transactions on frozen
> filesystem with a debugging check. As a bonus, we get a protection of ioctl
> against racing remount read-only. We also convert xfs_file_aio_write() to a
> non-racy freeze protection.
....
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 329b06a..6468a2a 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -577,7 +577,6 @@ xfs_trans_alloc(
>  	xfs_mount_t	*mp,
>  	uint		type)
>  {
> -	xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
>  	return _xfs_trans_alloc(mp, type, KM_SLEEP);
>  }

So what is there to stop internal XFS threads from starting
transactions when the filesystem is frozen? Previously this
SB_FREEZE_TRANS would guarantee even internal fucntions would get
stopped, but now there's nothing?

I do beleive that ext4 has the same problem (the issue reported with
the lazy inode init background thread), and I can see that any other
filesystem that can make modifications via internal triggers will
see the same problem - freeze doesn't block them any more...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 11/19] xfs: Convert to new freezing code
  2012-03-08 23:20     ` Dave Chinner
@ 2012-03-09  8:23       ` Jan Kara
  -1 siblings, 0 replies; 58+ messages in thread
From: Jan Kara @ 2012-03-09  8:23 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, LKML, linux-fsdevel, Al Viro, Christoph Hellwig,
	dchinner, sandeen, Kamal Mostafa, Ben Myers, Alex Elder, xfs

On Fri 09-03-12 10:20:41, Dave Chinner wrote:
> On Mon, Mar 05, 2012 at 05:01:09PM +0100, Jan Kara wrote:
> > Generic code now blocks all writers from standard write paths. So we block all
> > writers coming from ioctl and replace blocking of transactions on frozen
> > filesystem with a debugging check. As a bonus, we get a protection of ioctl
> > against racing remount read-only. We also convert xfs_file_aio_write() to a
> > non-racy freeze protection.
> ....
> > diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> > index 329b06a..6468a2a 100644
> > --- a/fs/xfs/xfs_trans.c
> > +++ b/fs/xfs/xfs_trans.c
> > @@ -577,7 +577,6 @@ xfs_trans_alloc(
> >  	xfs_mount_t	*mp,
> >  	uint		type)
> >  {
> > -	xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
> >  	return _xfs_trans_alloc(mp, type, KM_SLEEP);
> >  }
> 
> So what is there to stop internal XFS threads from starting
> transactions when the filesystem is frozen? Previously this
> SB_FREEZE_TRANS would guarantee even internal fucntions would get
> stopped, but now there's nothing?
> 
> I do beleive that ext4 has the same problem (the issue reported with
> the lazy inode init background thread), and I can see that any other
> filesystem that can make modifications via internal triggers will
> see the same problem - freeze doesn't block them any more...
  Yes, I'll have to audit filesystem internal threads. I forgot about these
in the first round. Most of them shouldn't have anything to do because the
filesystem is clean but there are exceptions as ext4 lazyinit has shown.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 11/19] xfs: Convert to new freezing code
@ 2012-03-09  8:23       ` Jan Kara
  0 siblings, 0 replies; 58+ messages in thread
From: Jan Kara @ 2012-03-09  8:23 UTC (permalink / raw)
  To: Dave Chinner
  Cc: sandeen, Alex Elder, Jan Kara, Kamal Mostafa, LKML, xfs,
	Christoph Hellwig, Ben Myers, Al Viro, dchinner, linux-fsdevel

On Fri 09-03-12 10:20:41, Dave Chinner wrote:
> On Mon, Mar 05, 2012 at 05:01:09PM +0100, Jan Kara wrote:
> > Generic code now blocks all writers from standard write paths. So we block all
> > writers coming from ioctl and replace blocking of transactions on frozen
> > filesystem with a debugging check. As a bonus, we get a protection of ioctl
> > against racing remount read-only. We also convert xfs_file_aio_write() to a
> > non-racy freeze protection.
> ....
> > diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> > index 329b06a..6468a2a 100644
> > --- a/fs/xfs/xfs_trans.c
> > +++ b/fs/xfs/xfs_trans.c
> > @@ -577,7 +577,6 @@ xfs_trans_alloc(
> >  	xfs_mount_t	*mp,
> >  	uint		type)
> >  {
> > -	xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
> >  	return _xfs_trans_alloc(mp, type, KM_SLEEP);
> >  }
> 
> So what is there to stop internal XFS threads from starting
> transactions when the filesystem is frozen? Previously this
> SB_FREEZE_TRANS would guarantee even internal fucntions would get
> stopped, but now there's nothing?
> 
> I do beleive that ext4 has the same problem (the issue reported with
> the lazy inode init background thread), and I can see that any other
> filesystem that can make modifications via internal triggers will
> see the same problem - freeze doesn't block them any more...
  Yes, I'll have to audit filesystem internal threads. I forgot about these
in the first round. Most of them shouldn't have anything to do because the
filesystem is clean but there are exceptions as ext4 lazyinit has shown.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 11/19] xfs: Convert to new freezing code
  2012-03-08 23:20     ` Dave Chinner
@ 2012-03-09 14:22       ` Jan Kara
  -1 siblings, 0 replies; 58+ messages in thread
From: Jan Kara @ 2012-03-09 14:22 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, LKML, linux-fsdevel, Al Viro, Christoph Hellwig,
	dchinner, sandeen, Kamal Mostafa, Ben Myers, Alex Elder, xfs

On Fri 09-03-12 10:20:41, Dave Chinner wrote:
> On Mon, Mar 05, 2012 at 05:01:09PM +0100, Jan Kara wrote:
> > Generic code now blocks all writers from standard write paths. So we block all
> > writers coming from ioctl and replace blocking of transactions on frozen
> > filesystem with a debugging check. As a bonus, we get a protection of ioctl
> > against racing remount read-only. We also convert xfs_file_aio_write() to a
> > non-racy freeze protection.
> ....
> > diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> > index 329b06a..6468a2a 100644
> > --- a/fs/xfs/xfs_trans.c
> > +++ b/fs/xfs/xfs_trans.c
> > @@ -577,7 +577,6 @@ xfs_trans_alloc(
> >  	xfs_mount_t	*mp,
> >  	uint		type)
> >  {
> > -	xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
> >  	return _xfs_trans_alloc(mp, type, KM_SLEEP);
> >  }
> 
> So what is there to stop internal XFS threads from starting
> transactions when the filesystem is frozen? Previously this
> SB_FREEZE_TRANS would guarantee even internal fucntions would get
> stopped, but now there's nothing?
  I've checked XFS now and I didn't find any work that would be done on a
clean filesystem. I found:
_xfs_mru_cache_reap() - doesn't seem to do any IO
xfs_end_io(), xfs_buf_iodone_work() - shouldn't be a problem either since only
  reads can happen on a frozen filesystem
xfs_reclaim_worker() - everything is clean so no IO should happen
xfs_flush_worker() - everything is clean so no IO should happen
xfs_sync_worker() - again the same if I understand the code right

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 11/19] xfs: Convert to new freezing code
@ 2012-03-09 14:22       ` Jan Kara
  0 siblings, 0 replies; 58+ messages in thread
From: Jan Kara @ 2012-03-09 14:22 UTC (permalink / raw)
  To: Dave Chinner
  Cc: sandeen, Alex Elder, Jan Kara, Kamal Mostafa, LKML, xfs,
	Christoph Hellwig, Ben Myers, Al Viro, dchinner, linux-fsdevel

On Fri 09-03-12 10:20:41, Dave Chinner wrote:
> On Mon, Mar 05, 2012 at 05:01:09PM +0100, Jan Kara wrote:
> > Generic code now blocks all writers from standard write paths. So we block all
> > writers coming from ioctl and replace blocking of transactions on frozen
> > filesystem with a debugging check. As a bonus, we get a protection of ioctl
> > against racing remount read-only. We also convert xfs_file_aio_write() to a
> > non-racy freeze protection.
> ....
> > diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> > index 329b06a..6468a2a 100644
> > --- a/fs/xfs/xfs_trans.c
> > +++ b/fs/xfs/xfs_trans.c
> > @@ -577,7 +577,6 @@ xfs_trans_alloc(
> >  	xfs_mount_t	*mp,
> >  	uint		type)
> >  {
> > -	xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
> >  	return _xfs_trans_alloc(mp, type, KM_SLEEP);
> >  }
> 
> So what is there to stop internal XFS threads from starting
> transactions when the filesystem is frozen? Previously this
> SB_FREEZE_TRANS would guarantee even internal fucntions would get
> stopped, but now there's nothing?
  I've checked XFS now and I didn't find any work that would be done on a
clean filesystem. I found:
_xfs_mru_cache_reap() - doesn't seem to do any IO
xfs_end_io(), xfs_buf_iodone_work() - shouldn't be a problem either since only
  reads can happen on a frozen filesystem
xfs_reclaim_worker() - everything is clean so no IO should happen
xfs_flush_worker() - everything is clean so no IO should happen
xfs_sync_worker() - again the same if I understand the code right

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 00/19] Fix filesystem freezing deadlocks
  2012-03-05 16:00 ` Jan Kara
  (?)
@ 2012-03-11 20:22   ` Kamal Mostafa
  -1 siblings, 0 replies; 58+ messages in thread
From: Kamal Mostafa @ 2012-03-11 20:22 UTC (permalink / raw)
  To: Jan Kara
  Cc: LKML, linux-fsdevel, Al Viro, Christoph Hellwig, dchinner,
	sandeen, Alex Elder, Anton Altaparmakov, Ben Myers, Chris Mason,
	cluster-devel, David S. Miller, fuse-devel, J. Bruce Fields,
	Joel Becker, KONISHI Ryusuke, linux-btrfs, linux-ext4, linux-nfs,
	linux-nilfs, linux-ntfs-dev, Mark Fasheh, Miklos Szeredi,
	ocfs2-devel, OGAWA Hirofumi, Steven Whitehouse, Th

[-- Attachment #1: Type: text/plain, Size: 869 bytes --]

On Mon, 2012-03-05 at 17:00 +0100, Jan Kara wrote:
> Hallelujah,
> 
>   after a couple of weeks and several rewrites, here comes the third iteration
> of my patches to improve filesystem freezing.  [...]

We've been testing this patch set at Canonical on the multipath failover
SAN configuration where we originally encountered the freeze deadlock.
We are happy to report that it does appear to fix the problem.  Thanks
Jan!

Please add the following endorsements for these patches (those actually
exercised by our test case):  01, 02, 03, 06, 07, 08, 09, 10, 14, 18, 19

BugLink: https://bugs.launchpad.net/bugs/897421
Tested-by: Kamal Mostafa <kamal@canonical.com>
Tested-by: Peter M. Petrakis <peter.petrakis@canonical.com>
Tested-by: Dann Frazier <dann.frazier@canonical.com>
Tested-by: Massimo Morana <massimo.morana@canonical.com>

 -Kamal


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 00/19] Fix filesystem freezing deadlocks
@ 2012-03-11 20:22   ` Kamal Mostafa
  0 siblings, 0 replies; 58+ messages in thread
From: Kamal Mostafa @ 2012-03-11 20:22 UTC (permalink / raw)
  To: Jan Kara
  Cc: LKML, linux-fsdevel, Al Viro, Christoph Hellwig, dchinner,
	sandeen, Alex Elder, Anton Altaparmakov, Ben Myers, Chris Mason,
	cluster-devel, David S. Miller, fuse-devel, J. Bruce Fields,
	Joel Becker, KONISHI Ryusuke, linux-btrfs, linux-ext4, linux-nfs,
	linux-nilfs, linux-ntfs-dev, Mark Fasheh, Miklos Szeredi,
	ocfs2-devel, OGAWA Hirofumi, Steven Whitehouse,
	Theodore Ts'o, xfs

[-- Attachment #1: Type: text/plain, Size: 869 bytes --]

On Mon, 2012-03-05 at 17:00 +0100, Jan Kara wrote:
> Hallelujah,
> 
>   after a couple of weeks and several rewrites, here comes the third iteration
> of my patches to improve filesystem freezing.  [...]

We've been testing this patch set at Canonical on the multipath failover
SAN configuration where we originally encountered the freeze deadlock.
We are happy to report that it does appear to fix the problem.  Thanks
Jan!

Please add the following endorsements for these patches (those actually
exercised by our test case):  01, 02, 03, 06, 07, 08, 09, 10, 14, 18, 19

BugLink: https://bugs.launchpad.net/bugs/897421
Tested-by: Kamal Mostafa <kamal@canonical.com>
Tested-by: Peter M. Petrakis <peter.petrakis@canonical.com>
Tested-by: Dann Frazier <dann.frazier@canonical.com>
Tested-by: Massimo Morana <massimo.morana@canonical.com>

 -Kamal


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [Ocfs2-devel] [PATCH 00/19] Fix filesystem freezing deadlocks
@ 2012-03-11 20:22   ` Kamal Mostafa
  0 siblings, 0 replies; 58+ messages in thread
From: Kamal Mostafa @ 2012-03-11 20:23 UTC (permalink / raw)
  To: Jan Kara
  Cc: LKML, linux-fsdevel, Al Viro, Christoph Hellwig, dchinner,
	sandeen, Alex Elder, Anton Altaparmakov, Ben Myers, Chris Mason,
	cluster-devel, David S. Miller, fuse-devel, J. Bruce Fields,
	Joel Becker, KONISHI Ryusuke, linux-btrfs, linux-ext4, linux-nfs,
	linux-nilfs, linux-ntfs-dev, Mark Fasheh, Miklos Szeredi,
	ocfs2-devel, OGAWA Hirofumi, Steven Whitehouse, Th

On Mon, 2012-03-05 at 17:00 +0100, Jan Kara wrote:
> Hallelujah,
> 
>   after a couple of weeks and several rewrites, here comes the third iteration
> of my patches to improve filesystem freezing.  [...]

We've been testing this patch set at Canonical on the multipath failover
SAN configuration where we originally encountered the freeze deadlock.
We are happy to report that it does appear to fix the problem.  Thanks
Jan!

Please add the following endorsements for these patches (those actually
exercised by our test case):  01, 02, 03, 06, 07, 08, 09, 10, 14, 18, 19

BugLink: https://bugs.launchpad.net/bugs/897421
Tested-by: Kamal Mostafa <kamal@canonical.com>
Tested-by: Peter M. Petrakis <peter.petrakis@canonical.com>
Tested-by: Dann Frazier <dann.frazier@canonical.com>
Tested-by: Massimo Morana <massimo.morana@canonical.com>

 -Kamal

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 836 bytes
Desc: This is a digitally signed message part
Url : http://oss.oracle.com/pipermail/ocfs2-devel/attachments/20120311/9f92c086/attachment-0001.bin 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 11/19] xfs: Convert to new freezing code
  2012-03-09 14:22       ` Jan Kara
@ 2012-03-11 22:45         ` Dave Chinner
  -1 siblings, 0 replies; 58+ messages in thread
From: Dave Chinner @ 2012-03-11 22:45 UTC (permalink / raw)
  To: Jan Kara
  Cc: LKML, linux-fsdevel, Al Viro, Christoph Hellwig, dchinner,
	sandeen, Kamal Mostafa, Ben Myers, Alex Elder, xfs

On Fri, Mar 09, 2012 at 03:22:53PM +0100, Jan Kara wrote:
> On Fri 09-03-12 10:20:41, Dave Chinner wrote:
> > On Mon, Mar 05, 2012 at 05:01:09PM +0100, Jan Kara wrote:
> > > Generic code now blocks all writers from standard write paths. So we block all
> > > writers coming from ioctl and replace blocking of transactions on frozen
> > > filesystem with a debugging check. As a bonus, we get a protection of ioctl
> > > against racing remount read-only. We also convert xfs_file_aio_write() to a
> > > non-racy freeze protection.
> > ....
> > > diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> > > index 329b06a..6468a2a 100644
> > > --- a/fs/xfs/xfs_trans.c
> > > +++ b/fs/xfs/xfs_trans.c
> > > @@ -577,7 +577,6 @@ xfs_trans_alloc(
> > >  	xfs_mount_t	*mp,
> > >  	uint		type)
> > >  {
> > > -	xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
> > >  	return _xfs_trans_alloc(mp, type, KM_SLEEP);
> > >  }
> > 
> > So what is there to stop internal XFS threads from starting
> > transactions when the filesystem is frozen? Previously this
> > SB_FREEZE_TRANS would guarantee even internal fucntions would get
> > stopped, but now there's nothing?
>   I've checked XFS now and I didn't find any work that would be done on a
> clean filesystem. I found:
> _xfs_mru_cache_reap() - doesn't seem to do any IO

Causes iput() on inodes, which if dropping the last reference to the
inode can cause them to enter reclaim and hence have transactions
run on them to either truncate blocks beyonds EOF or to do the
second phase of an unlink(). So it will definitely break the freeze
model.

> xfs_end_io(), xfs_buf_iodone_work() - shouldn't be a problem either since only
>   reads can happen on a frozen filesystem
> xfs_reclaim_worker() - everything is clean so no IO should happen

It will do work as other things push inodes into reclaim. e.g.
filestreams inode expiry. And it will still run to reclaim clean
inodes...

> xfs_flush_worker() - everything is clean so no IO should happen
> xfs_sync_worker() - again the same if I understand the code right

xfs_sync_worker() will always trigger a log force, so if there is
anything that has dirtied the journal, it will trigger IO. We have
no protection against the journal being dirtied anymore, so no
guarantees can be given here.

Basically, your patchset creates a "shell" around the outside of the
filesystem that catches all the external modifications that can
occur through the VFS and ioctls. The "shell" is now the only layer
of defense because the patchset removes the layer of protection that
prevents internal modifications from occurring.

For example, the XFS patch added a bunch of protection to ioctls on
XFS that only modify file metadata and so never were protected
against data freezes - they were all implicitly protected by the inner
transaction subsystem freeze. All of the above cases were protected
by the inner layer of defense, which is now gone.

Not to mention the shrinkers. The inode cache shrinkers can
cause inodes to enter reclaim which can trigger EOF truncation just
like _xfs_mru_cache_reap(). The dquot cache shrinker can also issue
IO, and dquots will be dirtied by EOF truncation. Same with buffers,
and the buffer cache shrinker.

I'm sure other filesystems have just as complex or even more complex
internal paths that trigger dirtying and IO that the "freeze shell"
model does not prevent. I don't think auditing is good enough - the
shell model is, IMO, too easy to break and too hard to test for
breakage...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 11/19] xfs: Convert to new freezing code
@ 2012-03-11 22:45         ` Dave Chinner
  0 siblings, 0 replies; 58+ messages in thread
From: Dave Chinner @ 2012-03-11 22:45 UTC (permalink / raw)
  To: Jan Kara
  Cc: sandeen, Alex Elder, Kamal Mostafa, LKML, xfs, Christoph Hellwig,
	Ben Myers, Al Viro, dchinner, linux-fsdevel

On Fri, Mar 09, 2012 at 03:22:53PM +0100, Jan Kara wrote:
> On Fri 09-03-12 10:20:41, Dave Chinner wrote:
> > On Mon, Mar 05, 2012 at 05:01:09PM +0100, Jan Kara wrote:
> > > Generic code now blocks all writers from standard write paths. So we block all
> > > writers coming from ioctl and replace blocking of transactions on frozen
> > > filesystem with a debugging check. As a bonus, we get a protection of ioctl
> > > against racing remount read-only. We also convert xfs_file_aio_write() to a
> > > non-racy freeze protection.
> > ....
> > > diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> > > index 329b06a..6468a2a 100644
> > > --- a/fs/xfs/xfs_trans.c
> > > +++ b/fs/xfs/xfs_trans.c
> > > @@ -577,7 +577,6 @@ xfs_trans_alloc(
> > >  	xfs_mount_t	*mp,
> > >  	uint		type)
> > >  {
> > > -	xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
> > >  	return _xfs_trans_alloc(mp, type, KM_SLEEP);
> > >  }
> > 
> > So what is there to stop internal XFS threads from starting
> > transactions when the filesystem is frozen? Previously this
> > SB_FREEZE_TRANS would guarantee even internal fucntions would get
> > stopped, but now there's nothing?
>   I've checked XFS now and I didn't find any work that would be done on a
> clean filesystem. I found:
> _xfs_mru_cache_reap() - doesn't seem to do any IO

Causes iput() on inodes, which if dropping the last reference to the
inode can cause them to enter reclaim and hence have transactions
run on them to either truncate blocks beyonds EOF or to do the
second phase of an unlink(). So it will definitely break the freeze
model.

> xfs_end_io(), xfs_buf_iodone_work() - shouldn't be a problem either since only
>   reads can happen on a frozen filesystem
> xfs_reclaim_worker() - everything is clean so no IO should happen

It will do work as other things push inodes into reclaim. e.g.
filestreams inode expiry. And it will still run to reclaim clean
inodes...

> xfs_flush_worker() - everything is clean so no IO should happen
> xfs_sync_worker() - again the same if I understand the code right

xfs_sync_worker() will always trigger a log force, so if there is
anything that has dirtied the journal, it will trigger IO. We have
no protection against the journal being dirtied anymore, so no
guarantees can be given here.

Basically, your patchset creates a "shell" around the outside of the
filesystem that catches all the external modifications that can
occur through the VFS and ioctls. The "shell" is now the only layer
of defense because the patchset removes the layer of protection that
prevents internal modifications from occurring.

For example, the XFS patch added a bunch of protection to ioctls on
XFS that only modify file metadata and so never were protected
against data freezes - they were all implicitly protected by the inner
transaction subsystem freeze. All of the above cases were protected
by the inner layer of defense, which is now gone.

Not to mention the shrinkers. The inode cache shrinkers can
cause inodes to enter reclaim which can trigger EOF truncation just
like _xfs_mru_cache_reap(). The dquot cache shrinker can also issue
IO, and dquots will be dirtied by EOF truncation. Same with buffers,
and the buffer cache shrinker.

I'm sure other filesystems have just as complex or even more complex
internal paths that trigger dirtying and IO that the "freeze shell"
model does not prevent. I don't think auditing is good enough - the
shell model is, IMO, too easy to break and too hard to test for
breakage...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 11/19] xfs: Convert to new freezing code
  2012-03-11 22:45         ` Dave Chinner
@ 2012-03-12 17:55           ` Jan Kara
  -1 siblings, 0 replies; 58+ messages in thread
From: Jan Kara @ 2012-03-12 17:55 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, LKML, linux-fsdevel, Al Viro, Christoph Hellwig,
	dchinner, sandeen, Kamal Mostafa, Ben Myers, Alex Elder, xfs

On Mon 12-03-12 09:45:37, Dave Chinner wrote:
> On Fri, Mar 09, 2012 at 03:22:53PM +0100, Jan Kara wrote:
> > On Fri 09-03-12 10:20:41, Dave Chinner wrote:
> > > On Mon, Mar 05, 2012 at 05:01:09PM +0100, Jan Kara wrote:
> > > > Generic code now blocks all writers from standard write paths. So we block all
> > > > writers coming from ioctl and replace blocking of transactions on frozen
> > > > filesystem with a debugging check. As a bonus, we get a protection of ioctl
> > > > against racing remount read-only. We also convert xfs_file_aio_write() to a
> > > > non-racy freeze protection.
> > > ....
> > > > diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> > > > index 329b06a..6468a2a 100644
> > > > --- a/fs/xfs/xfs_trans.c
> > > > +++ b/fs/xfs/xfs_trans.c
> > > > @@ -577,7 +577,6 @@ xfs_trans_alloc(
> > > >  	xfs_mount_t	*mp,
> > > >  	uint		type)
> > > >  {
> > > > -	xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
> > > >  	return _xfs_trans_alloc(mp, type, KM_SLEEP);
> > > >  }
> > > 
> > > So what is there to stop internal XFS threads from starting
> > > transactions when the filesystem is frozen? Previously this
> > > SB_FREEZE_TRANS would guarantee even internal fucntions would get
> > > stopped, but now there's nothing?
> >   I've checked XFS now and I didn't find any work that would be done on a
> > clean filesystem. I found:
> > _xfs_mru_cache_reap() - doesn't seem to do any IO
> 
> Causes iput() on inodes, which if dropping the last reference to the
> inode can cause them to enter reclaim and hence have transactions
> run on them to either truncate blocks beyonds EOF or to do the
> second phase of an unlink(). So it will definitely break the freeze
> model.
  OK, I've missed this. I'll fix it.

> > xfs_end_io(), xfs_buf_iodone_work() - shouldn't be a problem either since only
> >   reads can happen on a frozen filesystem
> > xfs_reclaim_worker() - everything is clean so no IO should happen
> 
> It will do work as other things push inodes into reclaim. e.g.
> filestreams inode expiry. And it will still run to reclaim clean
> inodes...
  Sure. But reclaim will start a transaction only if inode is dirty. And
inode must not be dirty on a frozen filesystem.
 
> > xfs_flush_worker() - everything is clean so no IO should happen
> > xfs_sync_worker() - again the same if I understand the code right
> 
> xfs_sync_worker() will always trigger a log force, so if there is
> anything that has dirtied the journal, it will trigger IO. We have
> no protection against the journal being dirtied anymore, so no
> guarantees can be given here.
  Yes, it would be a bug if something dirtied a journal while the
filesystem is frozen. To catch issues like this, I've added WARN_ON in
xfs_trans_alloc() to actually warn when transaction would be started on a
frozen filesystem. My testing never triggered it with the latest patch set.

> Basically, your patchset creates a "shell" around the outside of the
> filesystem that catches all the external modifications that can
> occur through the VFS and ioctls. The "shell" is now the only layer
> of defense because the patchset removes the layer of protection that
> prevents internal modifications from occurring.
  Yes. I'd just note that the "shell" is there already to provide reliable
remounting read only. I just had to change some properties of the shell to
be usable for freezing as well. But the shell has to be maintained
regardless of how we decide to do freezing.

Also believe me that I've initially tried to fix the freezing without the
external shell. But it just isn't enough to prevent dirty inodes from being
created (e.g. from directory operations) while sync during freezing is
running.

Sure I could keep the freezing mechanism on transaction start. But it
seemed like a cleaner approach to me to protect all the places properly
with the generic mechanism than having two mechanisms and unclear
(especially locking) interactions between them.

But in the end, if you guys really feel strongly about it and decide that
XFS wants to keep it's mechanism of freezing on transaction start, then I
won't stop you. Although it would mean XFS would have to have three
counters to protect freezing while other filesystems will need only two.
Anyway I'm confident enough to remove that mechanism from ext3/4 and ocfs2
but I don't know XFS internals enough to really argue...

> For example, the XFS patch added a bunch of protection to ioctls on
> XFS that only modify file metadata and so never were protected
> against data freezes - they were all implicitly protected by the inner
> transaction subsystem freeze. All of the above cases were protected
> by the inner layer of defense, which is now gone.
  Yes, but all these ioctls are currently buggy with respect to remounting
read only (not the whole filesystem, just one mountpoint of many where
filesystem may be mounted). So I'd argue that these particular changes are
actually bug fixes.

> Not to mention the shrinkers. The inode cache shrinkers can
> cause inodes to enter reclaim which can trigger EOF truncation just
> like _xfs_mru_cache_reap(). The dquot cache shrinker can also issue
> IO, and dquots will be dirtied by EOF truncation. Same with buffers,
> and the buffer cache shrinker.
  OK, now I see that even iput() of a clean inode can cause truncation to
happen for XFS. Thanks for the pointer. I'll see what we could do about
this.

> I'm sure other filesystems have just as complex or even more complex
> internal paths that trigger dirtying and IO that the "freeze shell"
> model does not prevent. I don't think auditing is good enough - the
> shell model is, IMO, too easy to break and too hard to test for
> breakage...
  Well, actually I don't know filesystem which would have as complex
interactions as XFS has. But maybe there are some. In either case I don't
think maintaining the "shell" is that hard. You have to maintain the one
facing to userspace anyway due to remounting. For the internal filesystem
threads, you have to be aware when you can cause write and be protected
against freezing in that case. I don't think it's that complex, although I
agree that the current test coverage will be small (freezing isn't used
very often) so bugs can go unnoticed for a long time.

Maybe we could add a test in xfs_trans_alloc() checking that fs is
protected against freezing? That would increase the test coverage a lot.
Although the test should be probably enabled only for XFS_DEBUG because it
will be an expensive one.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 11/19] xfs: Convert to new freezing code
@ 2012-03-12 17:55           ` Jan Kara
  0 siblings, 0 replies; 58+ messages in thread
From: Jan Kara @ 2012-03-12 17:55 UTC (permalink / raw)
  To: Dave Chinner
  Cc: sandeen, Alex Elder, Jan Kara, Kamal Mostafa, LKML, xfs,
	Christoph Hellwig, Ben Myers, Al Viro, dchinner, linux-fsdevel

On Mon 12-03-12 09:45:37, Dave Chinner wrote:
> On Fri, Mar 09, 2012 at 03:22:53PM +0100, Jan Kara wrote:
> > On Fri 09-03-12 10:20:41, Dave Chinner wrote:
> > > On Mon, Mar 05, 2012 at 05:01:09PM +0100, Jan Kara wrote:
> > > > Generic code now blocks all writers from standard write paths. So we block all
> > > > writers coming from ioctl and replace blocking of transactions on frozen
> > > > filesystem with a debugging check. As a bonus, we get a protection of ioctl
> > > > against racing remount read-only. We also convert xfs_file_aio_write() to a
> > > > non-racy freeze protection.
> > > ....
> > > > diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> > > > index 329b06a..6468a2a 100644
> > > > --- a/fs/xfs/xfs_trans.c
> > > > +++ b/fs/xfs/xfs_trans.c
> > > > @@ -577,7 +577,6 @@ xfs_trans_alloc(
> > > >  	xfs_mount_t	*mp,
> > > >  	uint		type)
> > > >  {
> > > > -	xfs_wait_for_freeze(mp, SB_FREEZE_TRANS);
> > > >  	return _xfs_trans_alloc(mp, type, KM_SLEEP);
> > > >  }
> > > 
> > > So what is there to stop internal XFS threads from starting
> > > transactions when the filesystem is frozen? Previously this
> > > SB_FREEZE_TRANS would guarantee even internal fucntions would get
> > > stopped, but now there's nothing?
> >   I've checked XFS now and I didn't find any work that would be done on a
> > clean filesystem. I found:
> > _xfs_mru_cache_reap() - doesn't seem to do any IO
> 
> Causes iput() on inodes, which if dropping the last reference to the
> inode can cause them to enter reclaim and hence have transactions
> run on them to either truncate blocks beyonds EOF or to do the
> second phase of an unlink(). So it will definitely break the freeze
> model.
  OK, I've missed this. I'll fix it.

> > xfs_end_io(), xfs_buf_iodone_work() - shouldn't be a problem either since only
> >   reads can happen on a frozen filesystem
> > xfs_reclaim_worker() - everything is clean so no IO should happen
> 
> It will do work as other things push inodes into reclaim. e.g.
> filestreams inode expiry. And it will still run to reclaim clean
> inodes...
  Sure. But reclaim will start a transaction only if inode is dirty. And
inode must not be dirty on a frozen filesystem.
 
> > xfs_flush_worker() - everything is clean so no IO should happen
> > xfs_sync_worker() - again the same if I understand the code right
> 
> xfs_sync_worker() will always trigger a log force, so if there is
> anything that has dirtied the journal, it will trigger IO. We have
> no protection against the journal being dirtied anymore, so no
> guarantees can be given here.
  Yes, it would be a bug if something dirtied a journal while the
filesystem is frozen. To catch issues like this, I've added WARN_ON in
xfs_trans_alloc() to actually warn when transaction would be started on a
frozen filesystem. My testing never triggered it with the latest patch set.

> Basically, your patchset creates a "shell" around the outside of the
> filesystem that catches all the external modifications that can
> occur through the VFS and ioctls. The "shell" is now the only layer
> of defense because the patchset removes the layer of protection that
> prevents internal modifications from occurring.
  Yes. I'd just note that the "shell" is there already to provide reliable
remounting read only. I just had to change some properties of the shell to
be usable for freezing as well. But the shell has to be maintained
regardless of how we decide to do freezing.

Also believe me that I've initially tried to fix the freezing without the
external shell. But it just isn't enough to prevent dirty inodes from being
created (e.g. from directory operations) while sync during freezing is
running.

Sure I could keep the freezing mechanism on transaction start. But it
seemed like a cleaner approach to me to protect all the places properly
with the generic mechanism than having two mechanisms and unclear
(especially locking) interactions between them.

But in the end, if you guys really feel strongly about it and decide that
XFS wants to keep it's mechanism of freezing on transaction start, then I
won't stop you. Although it would mean XFS would have to have three
counters to protect freezing while other filesystems will need only two.
Anyway I'm confident enough to remove that mechanism from ext3/4 and ocfs2
but I don't know XFS internals enough to really argue...

> For example, the XFS patch added a bunch of protection to ioctls on
> XFS that only modify file metadata and so never were protected
> against data freezes - they were all implicitly protected by the inner
> transaction subsystem freeze. All of the above cases were protected
> by the inner layer of defense, which is now gone.
  Yes, but all these ioctls are currently buggy with respect to remounting
read only (not the whole filesystem, just one mountpoint of many where
filesystem may be mounted). So I'd argue that these particular changes are
actually bug fixes.

> Not to mention the shrinkers. The inode cache shrinkers can
> cause inodes to enter reclaim which can trigger EOF truncation just
> like _xfs_mru_cache_reap(). The dquot cache shrinker can also issue
> IO, and dquots will be dirtied by EOF truncation. Same with buffers,
> and the buffer cache shrinker.
  OK, now I see that even iput() of a clean inode can cause truncation to
happen for XFS. Thanks for the pointer. I'll see what we could do about
this.

> I'm sure other filesystems have just as complex or even more complex
> internal paths that trigger dirtying and IO that the "freeze shell"
> model does not prevent. I don't think auditing is good enough - the
> shell model is, IMO, too easy to break and too hard to test for
> breakage...
  Well, actually I don't know filesystem which would have as complex
interactions as XFS has. But maybe there are some. In either case I don't
think maintaining the "shell" is that hard. You have to maintain the one
facing to userspace anyway due to remounting. For the internal filesystem
threads, you have to be aware when you can cause write and be protected
against freezing in that case. I don't think it's that complex, although I
agree that the current test coverage will be small (freezing isn't used
very often) so bugs can go unnoticed for a long time.

Maybe we could add a test in xfs_trans_alloc() checking that fs is
protected against freezing? That would increase the test coverage a lot.
Although the test should be probably enabled only for XFS_DEBUG because it
will be an expensive one.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 11/19] xfs: Convert to new freezing code
  2012-03-12 17:55           ` Jan Kara
@ 2012-03-12 23:48             ` Dave Chinner
  -1 siblings, 0 replies; 58+ messages in thread
From: Dave Chinner @ 2012-03-12 23:48 UTC (permalink / raw)
  To: Jan Kara
  Cc: LKML, linux-fsdevel, Al Viro, Christoph Hellwig, dchinner,
	sandeen, Kamal Mostafa, Ben Myers, Alex Elder, xfs

On Mon, Mar 12, 2012 at 06:55:51PM +0100, Jan Kara wrote:
> On Mon 12-03-12 09:45:37, Dave Chinner wrote:
> > On Fri, Mar 09, 2012 at 03:22:53PM +0100, Jan Kara wrote:
> > It will do work as other things push inodes into reclaim. e.g.
> > filestreams inode expiry. And it will still run to reclaim clean
> > inodes...
>   Sure. But reclaim will start a transaction only if inode is dirty. And
> inode must not be dirty on a frozen filesystem.

That's what I've been trying to tell you - even clean inodes can be
dirtied in reclaim on XFS, so ensuring the FS is clean is not a good
enough guarantee.  And not just through the shrinkers - just closing
a file descriptor can trigger truncation as well via .release. See
xfs_release:

	/* If this is a read-only mount, don't do this (would generate I/O) */
	if (mp->m_flags & XFS_MOUNT_RDONLY)
		goto out;

> > > xfs_flush_worker() - everything is clean so no IO should happen
> > > xfs_sync_worker() - again the same if I understand the code right
> > 
> > xfs_sync_worker() will always trigger a log force, so if there is
> > anything that has dirtied the journal, it will trigger IO. We have
> > no protection against the journal being dirtied anymore, so no
> > guarantees can be given here.
>   Yes, it would be a bug if something dirtied a journal while the
> filesystem is frozen. To catch issues like this, I've added WARN_ON in
> xfs_trans_alloc() to actually warn when transaction would be started on a
> frozen filesystem. My testing never triggered it with the latest patch set.

Just hold a file descriptor open for a large write over the freeze
(don't get it caught in the freeze), and then close it after the
filesystem is frozen. Perhaps even drop the slab caches.

> > Basically, your patchset creates a "shell" around the outside of the
> > filesystem that catches all the external modifications that can
> > occur through the VFS and ioctls. The "shell" is now the only layer
> > of defense because the patchset removes the layer of protection that
> > prevents internal modifications from occurring.
>   Yes. I'd just note that the "shell" is there already to provide reliable
> remounting read only.

But it doesn't provide reliable read-only behaviour - internal
filesystem triggers still need to check and disarm to make read-only
mounts reliable. Indeed, XFS has many internal checks for read only
mounts, and the places we are talking about here are similar to the
places where you've removed freeze protection from by gutting the
transaction level freezing.

> I just had to change some properties of the shell to
> be usable for freezing as well. But the shell has to be maintained
> regardless of how we decide to do freezing.
> 
> Also believe me that I've initially tried to fix the freezing without the
> external shell. But it just isn't enough to prevent dirty inodes from being
> created (e.g. from directory operations) while sync during freezing is
> running.

Sure - the old mechanism was "freeze data" which only protected data
modification paths, followed by "freeze trans" which protected
metadata modification paths (like directory operations). You started
by removing the metadata modification path protection - it is any
wonder that those operations continued to dirty inodes until you
expanded the "freeze data" shell to mean "freeze data and metadata"?

> Sure I could keep the freezing mechanism on transaction start. But it
> seemed like a cleaner approach to me to protect all the places properly
> with the generic mechanism than having two mechanisms and unclear
> (especially locking) interactions between them.

Once again, that's a view that does not take into account that
modifcations don't all come through the VFS.

It's like the hard shelli/soft center security model - it protects
well from external attackers, but once they get inside, there's no
protection. Indeed, there is zero protection from inside jobs, and
thats where multiple layers of defense are needed.

Your freeze changes provide us with a hard outer shell, but it's got
a really, really soft center. We can't use the outer shell defenses
deep inside the shell because of the locking orders that have been
defined (freeze protection must be outermost), so we need another
layer....

> But in the end, if you guys really feel strongly about it and decide that
> XFS wants to keep it's mechanism of freezing on transaction start, then I
> won't stop you. Although it would mean XFS would have to have three
> counters to protect freezing while other filesystems will need only two.

There is two counters only to avoid lockdep problems because the
different entry points to the outer shell have different locking
orders. I fail to see why adding a third counter to provide an
innermost freeze lock that retains the current level of protection
is a big deal because all that it really needs to work is a
different lock order.

Also, ext4 could retain it's current outer/inner protection, so I
don't see that this behaviour is XFS specific.....

> Anyway I'm confident enough to remove that mechanism from ext3/4 and ocfs2
> but I don't know XFS internals enough to really argue...

... especially given the reported ext4 regression. I have much less
confidence than you that the outer shell is sufficient for ext4 or
any other complex filesystem.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 11/19] xfs: Convert to new freezing code
@ 2012-03-12 23:48             ` Dave Chinner
  0 siblings, 0 replies; 58+ messages in thread
From: Dave Chinner @ 2012-03-12 23:48 UTC (permalink / raw)
  To: Jan Kara
  Cc: sandeen, Alex Elder, Kamal Mostafa, LKML, xfs, Christoph Hellwig,
	Ben Myers, Al Viro, dchinner, linux-fsdevel

On Mon, Mar 12, 2012 at 06:55:51PM +0100, Jan Kara wrote:
> On Mon 12-03-12 09:45:37, Dave Chinner wrote:
> > On Fri, Mar 09, 2012 at 03:22:53PM +0100, Jan Kara wrote:
> > It will do work as other things push inodes into reclaim. e.g.
> > filestreams inode expiry. And it will still run to reclaim clean
> > inodes...
>   Sure. But reclaim will start a transaction only if inode is dirty. And
> inode must not be dirty on a frozen filesystem.

That's what I've been trying to tell you - even clean inodes can be
dirtied in reclaim on XFS, so ensuring the FS is clean is not a good
enough guarantee.  And not just through the shrinkers - just closing
a file descriptor can trigger truncation as well via .release. See
xfs_release:

	/* If this is a read-only mount, don't do this (would generate I/O) */
	if (mp->m_flags & XFS_MOUNT_RDONLY)
		goto out;

> > > xfs_flush_worker() - everything is clean so no IO should happen
> > > xfs_sync_worker() - again the same if I understand the code right
> > 
> > xfs_sync_worker() will always trigger a log force, so if there is
> > anything that has dirtied the journal, it will trigger IO. We have
> > no protection against the journal being dirtied anymore, so no
> > guarantees can be given here.
>   Yes, it would be a bug if something dirtied a journal while the
> filesystem is frozen. To catch issues like this, I've added WARN_ON in
> xfs_trans_alloc() to actually warn when transaction would be started on a
> frozen filesystem. My testing never triggered it with the latest patch set.

Just hold a file descriptor open for a large write over the freeze
(don't get it caught in the freeze), and then close it after the
filesystem is frozen. Perhaps even drop the slab caches.

> > Basically, your patchset creates a "shell" around the outside of the
> > filesystem that catches all the external modifications that can
> > occur through the VFS and ioctls. The "shell" is now the only layer
> > of defense because the patchset removes the layer of protection that
> > prevents internal modifications from occurring.
>   Yes. I'd just note that the "shell" is there already to provide reliable
> remounting read only.

But it doesn't provide reliable read-only behaviour - internal
filesystem triggers still need to check and disarm to make read-only
mounts reliable. Indeed, XFS has many internal checks for read only
mounts, and the places we are talking about here are similar to the
places where you've removed freeze protection from by gutting the
transaction level freezing.

> I just had to change some properties of the shell to
> be usable for freezing as well. But the shell has to be maintained
> regardless of how we decide to do freezing.
> 
> Also believe me that I've initially tried to fix the freezing without the
> external shell. But it just isn't enough to prevent dirty inodes from being
> created (e.g. from directory operations) while sync during freezing is
> running.

Sure - the old mechanism was "freeze data" which only protected data
modification paths, followed by "freeze trans" which protected
metadata modification paths (like directory operations). You started
by removing the metadata modification path protection - it is any
wonder that those operations continued to dirty inodes until you
expanded the "freeze data" shell to mean "freeze data and metadata"?

> Sure I could keep the freezing mechanism on transaction start. But it
> seemed like a cleaner approach to me to protect all the places properly
> with the generic mechanism than having two mechanisms and unclear
> (especially locking) interactions between them.

Once again, that's a view that does not take into account that
modifcations don't all come through the VFS.

It's like the hard shelli/soft center security model - it protects
well from external attackers, but once they get inside, there's no
protection. Indeed, there is zero protection from inside jobs, and
thats where multiple layers of defense are needed.

Your freeze changes provide us with a hard outer shell, but it's got
a really, really soft center. We can't use the outer shell defenses
deep inside the shell because of the locking orders that have been
defined (freeze protection must be outermost), so we need another
layer....

> But in the end, if you guys really feel strongly about it and decide that
> XFS wants to keep it's mechanism of freezing on transaction start, then I
> won't stop you. Although it would mean XFS would have to have three
> counters to protect freezing while other filesystems will need only two.

There is two counters only to avoid lockdep problems because the
different entry points to the outer shell have different locking
orders. I fail to see why adding a third counter to provide an
innermost freeze lock that retains the current level of protection
is a big deal because all that it really needs to work is a
different lock order.

Also, ext4 could retain it's current outer/inner protection, so I
don't see that this behaviour is XFS specific.....

> Anyway I'm confident enough to remove that mechanism from ext3/4 and ocfs2
> but I don't know XFS internals enough to really argue...

... especially given the reported ext4 regression. I have much less
confidence than you that the outer shell is sufficient for ext4 or
any other complex filesystem.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 11/19] xfs: Convert to new freezing code
  2012-03-12 23:48             ` Dave Chinner
@ 2012-03-13 21:30               ` Jan Kara
  -1 siblings, 0 replies; 58+ messages in thread
From: Jan Kara @ 2012-03-13 21:30 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, LKML, linux-fsdevel, Al Viro, Christoph Hellwig,
	dchinner, sandeen, Kamal Mostafa, Ben Myers, Alex Elder, xfs

On Tue 13-03-12 10:48:17, Dave Chinner wrote:
> On Mon, Mar 12, 2012 at 06:55:51PM +0100, Jan Kara wrote:
> > On Mon 12-03-12 09:45:37, Dave Chinner wrote:
> > > On Fri, Mar 09, 2012 at 03:22:53PM +0100, Jan Kara wrote:
> > > It will do work as other things push inodes into reclaim. e.g.
> > > filestreams inode expiry. And it will still run to reclaim clean
> > > inodes...
> >   Sure. But reclaim will start a transaction only if inode is dirty. And
> > inode must not be dirty on a frozen filesystem.
> 
> That's what I've been trying to tell you - even clean inodes can be
> dirtied in reclaim on XFS, so ensuring the FS is clean is not a good
> enough guarantee.  And not just through the shrinkers - just closing
> a file descriptor can trigger truncation as well via .release. See
> xfs_release:
> 
> 	/* If this is a read-only mount, don't do this (would generate I/O) */
> 	if (mp->m_flags & XFS_MOUNT_RDONLY)
> 		goto out;
  Yes, I've noticed this while writing the email but forgot to update the
beginning. Sorry for confusion.

> > > Basically, your patchset creates a "shell" around the outside of the
> > > filesystem that catches all the external modifications that can
> > > occur through the VFS and ioctls. The "shell" is now the only layer
> > > of defense because the patchset removes the layer of protection that
> > > prevents internal modifications from occurring.
> >   Yes. I'd just note that the "shell" is there already to provide reliable
> > remounting read only.
> 
> But it doesn't provide reliable read-only behaviour - internal
> filesystem triggers still need to check and disarm to make read-only
> mounts reliable. Indeed, XFS has many internal checks for read only
> mounts, and the places we are talking about here are similar to the
> places where you've removed freeze protection from by gutting the
> transaction level freezing.
  And indeed your read-only checks are racy the same way freezing was...
But that's a different story. You have actually a good point that freezing
protection and read-only mount protection should be at the same places.

> > I just had to change some properties of the shell to
> > be usable for freezing as well. But the shell has to be maintained
> > regardless of how we decide to do freezing.
> > 
> > Also believe me that I've initially tried to fix the freezing without the
> > external shell. But it just isn't enough to prevent dirty inodes from being
> > created (e.g. from directory operations) while sync during freezing is
> > running.
> 
> Sure - the old mechanism was "freeze data" which only protected data
> modification paths, followed by "freeze trans" which protected
> metadata modification paths (like directory operations). You started
> by removing the metadata modification path protection - it is any
> wonder that those operations continued to dirty inodes until you
> expanded the "freeze data" shell to mean "freeze data and metadata"?
  That's just not true. Remember older versions of this patch set. Old
model is doing:
   freeze data
   sync
   freeze metadata (via freezing transations)
and it is broken mainly because while sync is running, new metadata are
dirtied and thus you end up with dirty metadata on a frozen filesystem.
Forcing the journal commit in ->freeze_fs() pushes changes to disk but
inodes still remain dirty and actually, it can be too late for flusher
thread anyway because it can be already hanging trying to start a
transaction.

Then there are also other less severe problems like that mark_inode_dirty()
could race with freezing in such a way that it was added to dirty list
after filesystem was fully frozen. 

> > Sure I could keep the freezing mechanism on transaction start. But it
> > seemed like a cleaner approach to me to protect all the places properly
> > with the generic mechanism than having two mechanisms and unclear
> > (especially locking) interactions between them.
> 
> Once again, that's a view that does not take into account that
> modifcations don't all come through the VFS.
  "all the places" in my sentence meant "including the internal sources of
modification"...

> It's like the hard shelli/soft center security model - it protects
> well from external attackers, but once they get inside, there's no
> protection. Indeed, there is zero protection from inside jobs, and
> thats where multiple layers of defense are needed.
> 
> Your freeze changes provide us with a hard outer shell, but it's got
> a really, really soft center. We can't use the outer shell defenses
> deep inside the shell because of the locking orders that have been
> defined (freeze protection must be outermost), so we need another
> layer....
  I think there are two different issues. One is your complaint that
extending the shell to cover also internal sources of fs modification will
be too fragile if not done implicitely on transaction start.

Another is a question whether we could reasonably fend off internal
modification using the current two counters. The MRU thread or ->release
can be easily handled with them (->release can be handled by the counter
ranking below mmap_sem). For shrinkers, I'm not that certain. Certainly the
outer counter won't work. The inner one with lock ranking just below
mmap_sem might work - depends on whether there are any XFS locks under
mmap_sem from within which we'd do GFP_KERNEL allocations. But here I agree
issues might be nasty enough to warrant another counter.

This is connected with another somewhat related issue: Do we want shrinkers
to block on frozen filesystem? That could result in unrelated processes
blocking on frozen filesystem. I'd find it nicer if we just skipped
the writes if the filesystem is frozen as it seems to me
xfs_free_eofblocks() is somewhat optional. Is it true?

> > But in the end, if you guys really feel strongly about it and decide that
> > XFS wants to keep it's mechanism of freezing on transaction start, then I
> > won't stop you. Although it would mean XFS would have to have three
> > counters to protect freezing while other filesystems will need only two.
> 
> There is two counters only to avoid lockdep problems because the
> different entry points to the outer shell have different locking
> orders. I fail to see why adding a third counter to provide an
> innermost freeze lock that retains the current level of protection
> is a big deal because all that it really needs to work is a
> different lock order.
  The two counters are needed to avoid deadlocks, not just because of
lockdep... The third counter can be added but per-cpu counters are not
exactly cheap (in terms of memory) so I don't want to add it without a good
reason. Lock ordering constraints in shrinkers might be this good reason.

Also fitting the XFS transaction system to freezing was a bit ugly (because
you need to log a dummy transaction on a frozen filesystem and you have
things like xfs_trans_dup() which requires us to bump the counter even
though the filesystem is already in freezing process) so I'd be happier if
we could avoid it...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 11/19] xfs: Convert to new freezing code
@ 2012-03-13 21:30               ` Jan Kara
  0 siblings, 0 replies; 58+ messages in thread
From: Jan Kara @ 2012-03-13 21:30 UTC (permalink / raw)
  To: Dave Chinner
  Cc: sandeen, Alex Elder, Jan Kara, Kamal Mostafa, LKML, xfs,
	Christoph Hellwig, Ben Myers, Al Viro, dchinner, linux-fsdevel

On Tue 13-03-12 10:48:17, Dave Chinner wrote:
> On Mon, Mar 12, 2012 at 06:55:51PM +0100, Jan Kara wrote:
> > On Mon 12-03-12 09:45:37, Dave Chinner wrote:
> > > On Fri, Mar 09, 2012 at 03:22:53PM +0100, Jan Kara wrote:
> > > It will do work as other things push inodes into reclaim. e.g.
> > > filestreams inode expiry. And it will still run to reclaim clean
> > > inodes...
> >   Sure. But reclaim will start a transaction only if inode is dirty. And
> > inode must not be dirty on a frozen filesystem.
> 
> That's what I've been trying to tell you - even clean inodes can be
> dirtied in reclaim on XFS, so ensuring the FS is clean is not a good
> enough guarantee.  And not just through the shrinkers - just closing
> a file descriptor can trigger truncation as well via .release. See
> xfs_release:
> 
> 	/* If this is a read-only mount, don't do this (would generate I/O) */
> 	if (mp->m_flags & XFS_MOUNT_RDONLY)
> 		goto out;
  Yes, I've noticed this while writing the email but forgot to update the
beginning. Sorry for confusion.

> > > Basically, your patchset creates a "shell" around the outside of the
> > > filesystem that catches all the external modifications that can
> > > occur through the VFS and ioctls. The "shell" is now the only layer
> > > of defense because the patchset removes the layer of protection that
> > > prevents internal modifications from occurring.
> >   Yes. I'd just note that the "shell" is there already to provide reliable
> > remounting read only.
> 
> But it doesn't provide reliable read-only behaviour - internal
> filesystem triggers still need to check and disarm to make read-only
> mounts reliable. Indeed, XFS has many internal checks for read only
> mounts, and the places we are talking about here are similar to the
> places where you've removed freeze protection from by gutting the
> transaction level freezing.
  And indeed your read-only checks are racy the same way freezing was...
But that's a different story. You have actually a good point that freezing
protection and read-only mount protection should be at the same places.

> > I just had to change some properties of the shell to
> > be usable for freezing as well. But the shell has to be maintained
> > regardless of how we decide to do freezing.
> > 
> > Also believe me that I've initially tried to fix the freezing without the
> > external shell. But it just isn't enough to prevent dirty inodes from being
> > created (e.g. from directory operations) while sync during freezing is
> > running.
> 
> Sure - the old mechanism was "freeze data" which only protected data
> modification paths, followed by "freeze trans" which protected
> metadata modification paths (like directory operations). You started
> by removing the metadata modification path protection - it is any
> wonder that those operations continued to dirty inodes until you
> expanded the "freeze data" shell to mean "freeze data and metadata"?
  That's just not true. Remember older versions of this patch set. Old
model is doing:
   freeze data
   sync
   freeze metadata (via freezing transations)
and it is broken mainly because while sync is running, new metadata are
dirtied and thus you end up with dirty metadata on a frozen filesystem.
Forcing the journal commit in ->freeze_fs() pushes changes to disk but
inodes still remain dirty and actually, it can be too late for flusher
thread anyway because it can be already hanging trying to start a
transaction.

Then there are also other less severe problems like that mark_inode_dirty()
could race with freezing in such a way that it was added to dirty list
after filesystem was fully frozen. 

> > Sure I could keep the freezing mechanism on transaction start. But it
> > seemed like a cleaner approach to me to protect all the places properly
> > with the generic mechanism than having two mechanisms and unclear
> > (especially locking) interactions between them.
> 
> Once again, that's a view that does not take into account that
> modifcations don't all come through the VFS.
  "all the places" in my sentence meant "including the internal sources of
modification"...

> It's like the hard shelli/soft center security model - it protects
> well from external attackers, but once they get inside, there's no
> protection. Indeed, there is zero protection from inside jobs, and
> thats where multiple layers of defense are needed.
> 
> Your freeze changes provide us with a hard outer shell, but it's got
> a really, really soft center. We can't use the outer shell defenses
> deep inside the shell because of the locking orders that have been
> defined (freeze protection must be outermost), so we need another
> layer....
  I think there are two different issues. One is your complaint that
extending the shell to cover also internal sources of fs modification will
be too fragile if not done implicitely on transaction start.

Another is a question whether we could reasonably fend off internal
modification using the current two counters. The MRU thread or ->release
can be easily handled with them (->release can be handled by the counter
ranking below mmap_sem). For shrinkers, I'm not that certain. Certainly the
outer counter won't work. The inner one with lock ranking just below
mmap_sem might work - depends on whether there are any XFS locks under
mmap_sem from within which we'd do GFP_KERNEL allocations. But here I agree
issues might be nasty enough to warrant another counter.

This is connected with another somewhat related issue: Do we want shrinkers
to block on frozen filesystem? That could result in unrelated processes
blocking on frozen filesystem. I'd find it nicer if we just skipped
the writes if the filesystem is frozen as it seems to me
xfs_free_eofblocks() is somewhat optional. Is it true?

> > But in the end, if you guys really feel strongly about it and decide that
> > XFS wants to keep it's mechanism of freezing on transaction start, then I
> > won't stop you. Although it would mean XFS would have to have three
> > counters to protect freezing while other filesystems will need only two.
> 
> There is two counters only to avoid lockdep problems because the
> different entry points to the outer shell have different locking
> orders. I fail to see why adding a third counter to provide an
> innermost freeze lock that retains the current level of protection
> is a big deal because all that it really needs to work is a
> different lock order.
  The two counters are needed to avoid deadlocks, not just because of
lockdep... The third counter can be added but per-cpu counters are not
exactly cheap (in terms of memory) so I don't want to add it without a good
reason. Lock ordering constraints in shrinkers might be this good reason.

Also fitting the XFS transaction system to freezing was a bit ugly (because
you need to log a dummy transaction on a frozen filesystem and you have
things like xfs_trans_dup() which requires us to bump the counter even
though the filesystem is already in freezing process) so I'd be happier if
we could avoid it...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 11/19] xfs: Convert to new freezing code
  2012-03-13 21:30               ` Jan Kara
@ 2012-03-14  3:00                 ` Dave Chinner
  -1 siblings, 0 replies; 58+ messages in thread
From: Dave Chinner @ 2012-03-14  3:00 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Chinner, LKML, linux-fsdevel, Al Viro, Christoph Hellwig,
	sandeen, Kamal Mostafa, Ben Myers, Alex Elder, xfs

On Tue, Mar 13, 2012 at 10:30:20PM +0100, Jan Kara wrote:
> On Tue 13-03-12 10:48:17, Dave Chinner wrote:
> > On Mon, Mar 12, 2012 at 06:55:51PM +0100, Jan Kara wrote:
> > > On Mon 12-03-12 09:45:37, Dave Chinner wrote:
> > > I just had to change some properties of the shell to
> > > be usable for freezing as well. But the shell has to be maintained
> > > regardless of how we decide to do freezing.
> > > 
> > > Also believe me that I've initially tried to fix the freezing without the
> > > external shell. But it just isn't enough to prevent dirty inodes from being
> > > created (e.g. from directory operations) while sync during freezing is
> > > running.
> > 
> > Sure - the old mechanism was "freeze data" which only protected data
> > modification paths, followed by "freeze trans" which protected
> > metadata modification paths (like directory operations). You started
> > by removing the metadata modification path protection - it is any
> > wonder that those operations continued to dirty inodes until you
> > expanded the "freeze data" shell to mean "freeze data and metadata"?
>   That's just not true. Remember older versions of this patch set. Old
> model is doing:
>    freeze data
>    sync
>    freeze metadata (via freezing transations)
> and it is broken mainly because while sync is running, new metadata are
> dirtied and thus you end up with dirty metadata on a frozen filesystem.

But that exactly the issue that the filesystem .freeze_fs
method is supposed to handle.  It's supposed to finish the
freeze process by finishing all outstanding metadata operations and
cleaning all remaining dirty metadata. It still has to guarantee
this regardless of the model used prior to this.

> Forcing the journal commit in ->freeze_fs() pushes changes to disk but
> inodes still remain dirty and actually, it can be too late for flusher
> thread anyway because it can be already hanging trying to start a
> transaction.

If the changes are already in the journal, then there is no new
transaction work needed to clean the VFS state.  Indeed, .freezefs
needs to write back all dirty metadata objects that it is tracking
internally in the journal so that a snapshot of the frozen image is
consistent and can be mounted read-only without running recovery...

> > It's like the hard shelli/soft center security model - it protects
> > well from external attackers, but once they get inside, there's no
> > protection. Indeed, there is zero protection from inside jobs, and
> > thats where multiple layers of defense are needed.
> > 
> > Your freeze changes provide us with a hard outer shell, but it's got
> > a really, really soft center. We can't use the outer shell defenses
> > deep inside the shell because of the locking orders that have been
> > defined (freeze protection must be outermost), so we need another
> > layer....
>   I think there are two different issues. One is your complaint that
> extending the shell to cover also internal sources of fs modification will
> be too fragile if not done implicitely on transaction start.

Most definitely. It means that we have to have freeze behaviour in
mind whether doing any async/background work. That's usually the
furthest think from people's minds when implementing stuff like
background inode initialisation...

> Another is a question whether we could reasonably fend off internal
> modification using the current two counters. The MRU thread or ->release
> can be easily handled with them (->release can be handled by the counter
> ranking below mmap_sem). For shrinkers, I'm not that certain. Certainly the
> outer counter won't work. The inner one with lock ranking just below
> mmap_sem might work - depends on whether there are any XFS locks under
> mmap_sem from within which we'd do GFP_KERNEL allocations. But here I agree
> issues might be nasty enough to warrant another counter.

Might be nasty? There's no question about it in my mind that it
*would* be nasty....

> This is connected with another somewhat related issue: Do we want shrinkers
> to block on frozen filesystem? That could result in unrelated processes
> blocking on frozen filesystem. I'd find it nicer if we just skipped
> the writes if the filesystem is frozen as it seems to me
> xfs_free_eofblocks() is somewhat optional. Is it true?

Not really. It needs to be done on rw filesytems[*].

Besides, this isn't really an XFS specific problem - the shrinkers
need to operate on frozen filesystems. e.g. run a directory
traversal of a frozen filesystem, and if you stop the shrinkers from
running then the dcache/icache growth will run the system out of
memory.

Sure, blocking a shrinker can be considered antisocial, but if you
can't free memory because it will dirty inodes on a frozen
filesystem then it is better to block the shrinker until the
filesytem is unfrozen than it is to allow the shrinker to fail and
eventually OOM the machine. Freezing filesysetms is supposed to be a
short term, workload-invisible operation. Kicking the OOM killer
because you can't free inodes is far more antisocial than having
those applications pause while the fs is frozen.

[*] Right now we are discussing adding a new background/async worker
that that does speculative preallocation removal to make it occur
faster than it currently does. So skipping it is not really an
option and this is exactly the sort of "new feature needs to
consider how to deal with freeze" problem I'm talking about.

> > > But in the end, if you guys really feel strongly about it and decide that
> > > XFS wants to keep it's mechanism of freezing on transaction start, then I
> > > won't stop you. Although it would mean XFS would have to have three
> > > counters to protect freezing while other filesystems will need only two.
> > 
> > There is two counters only to avoid lockdep problems because the
> > different entry points to the outer shell have different locking
> > orders. I fail to see why adding a third counter to provide an
> > innermost freeze lock that retains the current level of protection
> > is a big deal because all that it really needs to work is a
> > different lock order.
>   The two counters are needed to avoid deadlocks, not just because of
> lockdep... The third counter can be added but per-cpu counters are not
> exactly cheap (in terms of memory) so I don't want to add it without a good
> reason. Lock ordering constraints in shrinkers might be this good reason.

An additional percpu counter is far, far cheaper than the cost of
always needing to consider how not to break filesystem freezing for
the forseeable future.

> Also fitting the XFS transaction system to freezing was a bit ugly (because
> you need to log a dummy transaction on a frozen filesystem

Sure - that's to ensure the log is dirty so that open-but-unlinked
inodes are handled correctly by recovery if the system crashes while the
filesystem is frozen. But the last patch in your series would prevent
freezing filesystems while there are open-but-unlinked files
present, so all that grot could be removed.

Indeed, when that patch goes to mainline, the XFS code for
handling remount,ro and freeze can be made identical....

> and you have
> things like xfs_trans_dup() which requires us to bump the counter even
> though the filesystem is already in freezing process) so I'd be happier if
> we could avoid it...

The only additional infrastructure that you need to provide is a
"start_nested" operation for xfs_trans_dup() that increments the
percpu counter. I don't think that's a difficult problem to solve.

Cheers,

Dave.
-- 
Dave Chinner
dchinner@redhat.com

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 11/19] xfs: Convert to new freezing code
@ 2012-03-14  3:00                 ` Dave Chinner
  0 siblings, 0 replies; 58+ messages in thread
From: Dave Chinner @ 2012-03-14  3:00 UTC (permalink / raw)
  To: Jan Kara
  Cc: sandeen, Alex Elder, Kamal Mostafa, LKML, xfs, Christoph Hellwig,
	Ben Myers, Al Viro, linux-fsdevel

On Tue, Mar 13, 2012 at 10:30:20PM +0100, Jan Kara wrote:
> On Tue 13-03-12 10:48:17, Dave Chinner wrote:
> > On Mon, Mar 12, 2012 at 06:55:51PM +0100, Jan Kara wrote:
> > > On Mon 12-03-12 09:45:37, Dave Chinner wrote:
> > > I just had to change some properties of the shell to
> > > be usable for freezing as well. But the shell has to be maintained
> > > regardless of how we decide to do freezing.
> > > 
> > > Also believe me that I've initially tried to fix the freezing without the
> > > external shell. But it just isn't enough to prevent dirty inodes from being
> > > created (e.g. from directory operations) while sync during freezing is
> > > running.
> > 
> > Sure - the old mechanism was "freeze data" which only protected data
> > modification paths, followed by "freeze trans" which protected
> > metadata modification paths (like directory operations). You started
> > by removing the metadata modification path protection - it is any
> > wonder that those operations continued to dirty inodes until you
> > expanded the "freeze data" shell to mean "freeze data and metadata"?
>   That's just not true. Remember older versions of this patch set. Old
> model is doing:
>    freeze data
>    sync
>    freeze metadata (via freezing transations)
> and it is broken mainly because while sync is running, new metadata are
> dirtied and thus you end up with dirty metadata on a frozen filesystem.

But that exactly the issue that the filesystem .freeze_fs
method is supposed to handle.  It's supposed to finish the
freeze process by finishing all outstanding metadata operations and
cleaning all remaining dirty metadata. It still has to guarantee
this regardless of the model used prior to this.

> Forcing the journal commit in ->freeze_fs() pushes changes to disk but
> inodes still remain dirty and actually, it can be too late for flusher
> thread anyway because it can be already hanging trying to start a
> transaction.

If the changes are already in the journal, then there is no new
transaction work needed to clean the VFS state.  Indeed, .freezefs
needs to write back all dirty metadata objects that it is tracking
internally in the journal so that a snapshot of the frozen image is
consistent and can be mounted read-only without running recovery...

> > It's like the hard shelli/soft center security model - it protects
> > well from external attackers, but once they get inside, there's no
> > protection. Indeed, there is zero protection from inside jobs, and
> > thats where multiple layers of defense are needed.
> > 
> > Your freeze changes provide us with a hard outer shell, but it's got
> > a really, really soft center. We can't use the outer shell defenses
> > deep inside the shell because of the locking orders that have been
> > defined (freeze protection must be outermost), so we need another
> > layer....
>   I think there are two different issues. One is your complaint that
> extending the shell to cover also internal sources of fs modification will
> be too fragile if not done implicitely on transaction start.

Most definitely. It means that we have to have freeze behaviour in
mind whether doing any async/background work. That's usually the
furthest think from people's minds when implementing stuff like
background inode initialisation...

> Another is a question whether we could reasonably fend off internal
> modification using the current two counters. The MRU thread or ->release
> can be easily handled with them (->release can be handled by the counter
> ranking below mmap_sem). For shrinkers, I'm not that certain. Certainly the
> outer counter won't work. The inner one with lock ranking just below
> mmap_sem might work - depends on whether there are any XFS locks under
> mmap_sem from within which we'd do GFP_KERNEL allocations. But here I agree
> issues might be nasty enough to warrant another counter.

Might be nasty? There's no question about it in my mind that it
*would* be nasty....

> This is connected with another somewhat related issue: Do we want shrinkers
> to block on frozen filesystem? That could result in unrelated processes
> blocking on frozen filesystem. I'd find it nicer if we just skipped
> the writes if the filesystem is frozen as it seems to me
> xfs_free_eofblocks() is somewhat optional. Is it true?

Not really. It needs to be done on rw filesytems[*].

Besides, this isn't really an XFS specific problem - the shrinkers
need to operate on frozen filesystems. e.g. run a directory
traversal of a frozen filesystem, and if you stop the shrinkers from
running then the dcache/icache growth will run the system out of
memory.

Sure, blocking a shrinker can be considered antisocial, but if you
can't free memory because it will dirty inodes on a frozen
filesystem then it is better to block the shrinker until the
filesytem is unfrozen than it is to allow the shrinker to fail and
eventually OOM the machine. Freezing filesysetms is supposed to be a
short term, workload-invisible operation. Kicking the OOM killer
because you can't free inodes is far more antisocial than having
those applications pause while the fs is frozen.

[*] Right now we are discussing adding a new background/async worker
that that does speculative preallocation removal to make it occur
faster than it currently does. So skipping it is not really an
option and this is exactly the sort of "new feature needs to
consider how to deal with freeze" problem I'm talking about.

> > > But in the end, if you guys really feel strongly about it and decide that
> > > XFS wants to keep it's mechanism of freezing on transaction start, then I
> > > won't stop you. Although it would mean XFS would have to have three
> > > counters to protect freezing while other filesystems will need only two.
> > 
> > There is two counters only to avoid lockdep problems because the
> > different entry points to the outer shell have different locking
> > orders. I fail to see why adding a third counter to provide an
> > innermost freeze lock that retains the current level of protection
> > is a big deal because all that it really needs to work is a
> > different lock order.
>   The two counters are needed to avoid deadlocks, not just because of
> lockdep... The third counter can be added but per-cpu counters are not
> exactly cheap (in terms of memory) so I don't want to add it without a good
> reason. Lock ordering constraints in shrinkers might be this good reason.

An additional percpu counter is far, far cheaper than the cost of
always needing to consider how not to break filesystem freezing for
the forseeable future.

> Also fitting the XFS transaction system to freezing was a bit ugly (because
> you need to log a dummy transaction on a frozen filesystem

Sure - that's to ensure the log is dirty so that open-but-unlinked
inodes are handled correctly by recovery if the system crashes while the
filesystem is frozen. But the last patch in your series would prevent
freezing filesystems while there are open-but-unlinked files
present, so all that grot could be removed.

Indeed, when that patch goes to mainline, the XFS code for
handling remount,ro and freeze can be made identical....

> and you have
> things like xfs_trans_dup() which requires us to bump the counter even
> though the filesystem is already in freezing process) so I'd be happier if
> we could avoid it...

The only additional infrastructure that you need to provide is a
"start_nested" operation for xfs_trans_dup() that increments the
percpu counter. I don't think that's a difficult problem to solve.

Cheers,

Dave.
-- 
Dave Chinner
dchinner@redhat.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 01/19] mm: Make default vm_ops provide ->page_mkwrite handler
  2012-03-05 16:00 ` [PATCH 01/19] mm: Make default vm_ops provide ->page_mkwrite handler Jan Kara
@ 2012-03-23 22:45   ` Andrew Morton
  2012-03-27  7:55     ` Jan Kara
  0 siblings, 1 reply; 58+ messages in thread
From: Andrew Morton @ 2012-03-23 22:45 UTC (permalink / raw)
  To: Jan Kara
  Cc: LKML, linux-fsdevel, Al Viro, Christoph Hellwig, dchinner,
	sandeen, Kamal Mostafa

On Mon,  5 Mar 2012 17:00:59 +0100
Jan Kara <jack@suse.cz> wrote:

> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1759,8 +1759,28 @@ page_not_uptodate:
>  }
>  EXPORT_SYMBOL(filemap_fault);
>  
> +int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> +{
> +	struct page *page = vmf->page;
> +	struct inode *inode = vma->vm_file->f_path.dentry->d_inode;
> +	int ret = VM_FAULT_LOCKED;
> +
> +	file_update_time(vma->vm_file);
> +	lock_page(page);
> +	if ((page->mapping != inode->i_mapping) ||
> +	    (page_offset(page) > i_size_read(inode))) {

Would benefit from a comment explaining how the page can come to be
outside i_size, and why we fail in that case.

I don't think i_mutex is held here, so this test is rather meaningless
and racy anyway?

> +		unlock_page(page);
> +		ret = VM_FAULT_NOPAGE;
> +		goto out;
> +	}
> +out:
> +	return ret;
> +}
> +EXPORT_SYMBOL(filemap_page_mkwrite);

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 01/19] mm: Make default vm_ops provide ->page_mkwrite handler
  2012-03-23 22:45   ` Andrew Morton
@ 2012-03-27  7:55     ` Jan Kara
  2012-03-27 21:38       ` Andrew Morton
  0 siblings, 1 reply; 58+ messages in thread
From: Jan Kara @ 2012-03-27  7:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, LKML, linux-fsdevel, Al Viro, Christoph Hellwig,
	dchinner, sandeen, Kamal Mostafa

On Fri 23-03-12 15:45:02, Andrew Morton wrote:
> On Mon,  5 Mar 2012 17:00:59 +0100
> Jan Kara <jack@suse.cz> wrote:
> 
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -1759,8 +1759,28 @@ page_not_uptodate:
> >  }
> >  EXPORT_SYMBOL(filemap_fault);
> >  
> > +int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > +{
> > +	struct page *page = vmf->page;
> > +	struct inode *inode = vma->vm_file->f_path.dentry->d_inode;
> > +	int ret = VM_FAULT_LOCKED;
> > +
> > +	file_update_time(vma->vm_file);
> > +	lock_page(page);
> > +	if ((page->mapping != inode->i_mapping) ||
> > +	    (page_offset(page) > i_size_read(inode))) {
> 
> Would benefit from a comment explaining how the page can come to be
> outside i_size, and why we fail in that case.
> 
> I don't think i_mutex is held here, so this test is rather meaningless
> and racy anyway?
  i_size test is racy if that's what you mean by "this test". Just I did
the test this way because it's like this in other places and I figured
truncate_pagecache() can take relatively long time so the test has some
effect. But if you think it's not worth it, I can remove it.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 01/19] mm: Make default vm_ops provide ->page_mkwrite handler
  2012-03-27  7:55     ` Jan Kara
@ 2012-03-27 21:38       ` Andrew Morton
  2012-03-27 22:08         ` Jan Kara
  0 siblings, 1 reply; 58+ messages in thread
From: Andrew Morton @ 2012-03-27 21:38 UTC (permalink / raw)
  To: Jan Kara
  Cc: LKML, linux-fsdevel, Al Viro, Christoph Hellwig, dchinner,
	sandeen, Kamal Mostafa

On Tue, 27 Mar 2012 09:55:27 +0200
Jan Kara <jack@suse.cz> wrote:

> On Fri 23-03-12 15:45:02, Andrew Morton wrote:
> > On Mon,  5 Mar 2012 17:00:59 +0100
> > Jan Kara <jack@suse.cz> wrote:
> > 
> > > --- a/mm/filemap.c
> > > +++ b/mm/filemap.c
> > > @@ -1759,8 +1759,28 @@ page_not_uptodate:
> > >  }
> > >  EXPORT_SYMBOL(filemap_fault);
> > >  
> > > +int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > > +{
> > > +	struct page *page = vmf->page;
> > > +	struct inode *inode = vma->vm_file->f_path.dentry->d_inode;
> > > +	int ret = VM_FAULT_LOCKED;
> > > +
> > > +	file_update_time(vma->vm_file);
> > > +	lock_page(page);
> > > +	if ((page->mapping != inode->i_mapping) ||
> > > +	    (page_offset(page) > i_size_read(inode))) {
> > 
> > Would benefit from a comment explaining how the page can come to be
> > outside i_size, and why we fail in that case.

This?

> > I don't think i_mutex is held here, so this test is rather meaningless
> > and racy anyway?
>   i_size test is racy if that's what you mean by "this test". Just I did
> the test this way because it's like this in other places and I figured
> truncate_pagecache() can take relatively long time so the test has some
> effect. But if you think it's not worth it, I can remove it.

It bugs me when we copy-n-paste code without remembering why we had it
there in the first place :( iirc, mmapped pages outside i_size can and
do happen in some race situations, and are benign.  But it's several
years since I thought about it and all the details have evaporated and
it would take a lot of work to reinstantiate it all.  argh.

Also, it's off-by-one, isn't it?  Should be page_offset(page) >= i_size?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 01/19] mm: Make default vm_ops provide ->page_mkwrite handler
  2012-03-27 21:38       ` Andrew Morton
@ 2012-03-27 22:08         ` Jan Kara
  2012-03-27 22:45           ` Dave Chinner
  0 siblings, 1 reply; 58+ messages in thread
From: Jan Kara @ 2012-03-27 22:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, LKML, linux-fsdevel, Al Viro, Christoph Hellwig,
	dchinner, sandeen, Kamal Mostafa

On Tue 27-03-12 14:38:15, Andrew Morton wrote:
> On Tue, 27 Mar 2012 09:55:27 +0200
> Jan Kara <jack@suse.cz> wrote:
> > On Fri 23-03-12 15:45:02, Andrew Morton wrote:
> > > On Mon,  5 Mar 2012 17:00:59 +0100
> > > Jan Kara <jack@suse.cz> wrote:
> > > 
> > > > --- a/mm/filemap.c
> > > > +++ b/mm/filemap.c
> > > > @@ -1759,8 +1759,28 @@ page_not_uptodate:
> > > >  }
> > > >  EXPORT_SYMBOL(filemap_fault);
> > > >  
> > > > +int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > > > +{
> > > > +	struct page *page = vmf->page;
> > > > +	struct inode *inode = vma->vm_file->f_path.dentry->d_inode;
> > > > +	int ret = VM_FAULT_LOCKED;
> > > > +
> > > > +	file_update_time(vma->vm_file);
> > > > +	lock_page(page);
> > > > +	if ((page->mapping != inode->i_mapping) ||
> > > > +	    (page_offset(page) > i_size_read(inode))) {
> > > 
> > > Would benefit from a comment explaining how the page can come to be
> > > outside i_size, and why we fail in that case.
> 
> This?
> 
> > > I don't think i_mutex is held here, so this test is rather meaningless
> > > and racy anyway?
> >   i_size test is racy if that's what you mean by "this test". Just I did
> > the test this way because it's like this in other places and I figured
> > truncate_pagecache() can take relatively long time so the test has some
> > effect. But if you think it's not worth it, I can remove it.
> 
> It bugs me when we copy-n-paste code without remembering why we had it
> there in the first place :( iirc, mmapped pages outside i_size can and
> do happen in some race situations, and are benign.
  Yeah. Certainly there can be pages beyond i_size because we first set
file size and then go and remove pages beyond new i_size one by one when we
do truncate. We must be careful not to create any new pages beyond i_size
but that's what filemap_fault() takes care of. So I think i_size check in
->page_mkwrite() isn't strictly needed.

> But it's several
> years since I thought about it and all the details have evaporated and
> it would take a lot of work to reinstantiate it all.  argh.
> 
> Also, it's off-by-one, isn't it?  Should be page_offset(page) >= i_size?
  Yes, it is. I'll just remove it.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 01/19] mm: Make default vm_ops provide ->page_mkwrite handler
  2012-03-27 22:08         ` Jan Kara
@ 2012-03-27 22:45           ` Dave Chinner
  2012-03-28  9:48             ` Jan Kara
  0 siblings, 1 reply; 58+ messages in thread
From: Dave Chinner @ 2012-03-27 22:45 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, LKML, linux-fsdevel, Al Viro, Christoph Hellwig,
	dchinner, sandeen, Kamal Mostafa

On Wed, Mar 28, 2012 at 12:08:19AM +0200, Jan Kara wrote:
> On Tue 27-03-12 14:38:15, Andrew Morton wrote:
> > On Tue, 27 Mar 2012 09:55:27 +0200
> > Jan Kara <jack@suse.cz> wrote:
> > > On Fri 23-03-12 15:45:02, Andrew Morton wrote:
> > > > On Mon,  5 Mar 2012 17:00:59 +0100
> > > > Jan Kara <jack@suse.cz> wrote:
> > > > 
> > > > > --- a/mm/filemap.c
> > > > > +++ b/mm/filemap.c
> > > > > @@ -1759,8 +1759,28 @@ page_not_uptodate:
> > > > >  }
> > > > >  EXPORT_SYMBOL(filemap_fault);
> > > > >  
> > > > > +int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > > > > +{
> > > > > +	struct page *page = vmf->page;
> > > > > +	struct inode *inode = vma->vm_file->f_path.dentry->d_inode;
> > > > > +	int ret = VM_FAULT_LOCKED;
> > > > > +
> > > > > +	file_update_time(vma->vm_file);
> > > > > +	lock_page(page);
> > > > > +	if ((page->mapping != inode->i_mapping) ||
> > > > > +	    (page_offset(page) > i_size_read(inode))) {
> > > > 
> > > > Would benefit from a comment explaining how the page can come to be
> > > > outside i_size, and why we fail in that case.
> > 
> > This?
> > 
> > > > I don't think i_mutex is held here, so this test is rather meaningless
> > > > and racy anyway?
> > >   i_size test is racy if that's what you mean by "this test". Just I did
> > > the test this way because it's like this in other places and I figured
> > > truncate_pagecache() can take relatively long time so the test has some
> > > effect. But if you think it's not worth it, I can remove it.
> > 
> > It bugs me when we copy-n-paste code without remembering why we had it
> > there in the first place :( iirc, mmapped pages outside i_size can and
> > do happen in some race situations, and are benign.
>   Yeah. Certainly there can be pages beyond i_size because we first set
> file size and then go and remove pages beyond new i_size one by one when we
> do truncate. We must be careful not to create any new pages beyond i_size
> but that's what filemap_fault() takes care of. So I think i_size check in
> ->page_mkwrite() isn't strictly needed.

Actually, I think it is. In __do_fault(), we drop the page lock
between the .fault call and the .page_mkwrite() call, so the size
checks in .fault for the given page being faulted are no longer
valid as truncate serialises only on the page lock. Hence we have to
repeat the truncate race checks again in .page_mkwrite() after we
relock the page.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 01/19] mm: Make default vm_ops provide ->page_mkwrite handler
  2012-03-27 22:45           ` Dave Chinner
@ 2012-03-28  9:48             ` Jan Kara
  0 siblings, 0 replies; 58+ messages in thread
From: Jan Kara @ 2012-03-28  9:48 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Andrew Morton, LKML, linux-fsdevel, Al Viro,
	Christoph Hellwig, dchinner, sandeen, Kamal Mostafa

On Wed 28-03-12 09:45:53, Dave Chinner wrote:
> On Wed, Mar 28, 2012 at 12:08:19AM +0200, Jan Kara wrote:
> > On Tue 27-03-12 14:38:15, Andrew Morton wrote:
> > > On Tue, 27 Mar 2012 09:55:27 +0200
> > > Jan Kara <jack@suse.cz> wrote:
> > > > On Fri 23-03-12 15:45:02, Andrew Morton wrote:
> > > > > On Mon,  5 Mar 2012 17:00:59 +0100
> > > > > Jan Kara <jack@suse.cz> wrote:
> > > > > 
> > > > > > --- a/mm/filemap.c
> > > > > > +++ b/mm/filemap.c
> > > > > > @@ -1759,8 +1759,28 @@ page_not_uptodate:
> > > > > >  }
> > > > > >  EXPORT_SYMBOL(filemap_fault);
> > > > > >  
> > > > > > +int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > > > > > +{
> > > > > > +	struct page *page = vmf->page;
> > > > > > +	struct inode *inode = vma->vm_file->f_path.dentry->d_inode;
> > > > > > +	int ret = VM_FAULT_LOCKED;
> > > > > > +
> > > > > > +	file_update_time(vma->vm_file);
> > > > > > +	lock_page(page);
> > > > > > +	if ((page->mapping != inode->i_mapping) ||
> > > > > > +	    (page_offset(page) > i_size_read(inode))) {
> > > > > 
> > > > > Would benefit from a comment explaining how the page can come to be
> > > > > outside i_size, and why we fail in that case.
> > > 
> > > This?
> > > 
> > > > > I don't think i_mutex is held here, so this test is rather meaningless
> > > > > and racy anyway?
> > > >   i_size test is racy if that's what you mean by "this test". Just I did
> > > > the test this way because it's like this in other places and I figured
> > > > truncate_pagecache() can take relatively long time so the test has some
> > > > effect. But if you think it's not worth it, I can remove it.
> > > 
> > > It bugs me when we copy-n-paste code without remembering why we had it
> > > there in the first place :( iirc, mmapped pages outside i_size can and
> > > do happen in some race situations, and are benign.
> >   Yeah. Certainly there can be pages beyond i_size because we first set
> > file size and then go and remove pages beyond new i_size one by one when we
> > do truncate. We must be careful not to create any new pages beyond i_size
> > but that's what filemap_fault() takes care of. So I think i_size check in
> > ->page_mkwrite() isn't strictly needed.
> 
> Actually, I think it is. In __do_fault(), we drop the page lock
> between the .fault call and the .page_mkwrite() call, so the size
> checks in .fault for the given page being faulted are no longer
> valid as truncate serialises only on the page lock. Hence we have to
> repeat the truncate race checks again in .page_mkwrite() after we
> relock the page.
  I was thinking about this scenario as well but I think doesn't cause any
problems. But maybe there's some noise in what the condition actually is.
So I currently have:
	if (page->mapping != inode->i_mapping)
		bail..
	do stuff.
And I think that is enough. Because only filemap_fault() creates new page,
->page_mkwrite() only uses the page reference from vmf->page. So the only
thing that can happen after __do_fault() drops page lock is that
truncate_pagecache() will go and truncate the page (thus page->mapping will
be NULL) and the condition I currently have should be enough to catch that.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2012-03-28  9:49 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-03-05 16:00 [PATCH 00/19] Fix filesystem freezing deadlocks Jan Kara
2012-03-05 16:01 ` [Ocfs2-devel] " Jan Kara
2012-03-05 16:00 ` [Cluster-devel] " Jan Kara
2012-03-05 16:00 ` Jan Kara
2012-03-05 16:00 ` Jan Kara
2012-03-05 16:00 ` [PATCH 01/19] mm: Make default vm_ops provide ->page_mkwrite handler Jan Kara
2012-03-23 22:45   ` Andrew Morton
2012-03-27  7:55     ` Jan Kara
2012-03-27 21:38       ` Andrew Morton
2012-03-27 22:08         ` Jan Kara
2012-03-27 22:45           ` Dave Chinner
2012-03-28  9:48             ` Jan Kara
2012-03-05 16:01 ` [PATCH 02/19] fs: Push mnt_want_write() outside of i_mutex Jan Kara
2012-03-05 16:01   ` [Ocfs2-devel] " Jan Kara
2012-03-05 16:01 ` [PATCH 03/19] fat: " Jan Kara
2012-03-05 16:01 ` [PATCH 04/19] btrfs: " Jan Kara
2012-03-05 16:01 ` [PATCH 05/19] nfsd: " Jan Kara
2012-03-05 16:01   ` Jan Kara
2012-03-05 16:01 ` [PATCH 06/19] fs: Improve filesystem freezing handling Jan Kara
2012-03-05 16:01 ` [PATCH 07/19] fs: Add freezing handling to mnt_want_write() / mnt_drop_write() Jan Kara
2012-03-05 16:01 ` [PATCH 08/19] fs: Skip atime update on frozen filesystem Jan Kara
2012-03-05 16:01 ` [PATCH 09/19] fs: Protect write paths by sb_start_write - sb_end_write Jan Kara
2012-03-05 16:01 ` [PATCH 10/19] ext4: Convert to new freezing mechanism Jan Kara
2012-03-07 22:32   ` Kamal Mostafa
2012-03-08  9:05     ` Jan Kara
2012-03-05 16:01 ` [PATCH 11/19] xfs: Convert to new freezing code Jan Kara
2012-03-05 16:01   ` Jan Kara
2012-03-08 23:20   ` Dave Chinner
2012-03-08 23:20     ` Dave Chinner
2012-03-09  8:23     ` Jan Kara
2012-03-09  8:23       ` Jan Kara
2012-03-09 14:22     ` Jan Kara
2012-03-09 14:22       ` Jan Kara
2012-03-11 22:45       ` Dave Chinner
2012-03-11 22:45         ` Dave Chinner
2012-03-12 17:55         ` Jan Kara
2012-03-12 17:55           ` Jan Kara
2012-03-12 23:48           ` Dave Chinner
2012-03-12 23:48             ` Dave Chinner
2012-03-13 21:30             ` Jan Kara
2012-03-13 21:30               ` Jan Kara
2012-03-14  3:00               ` Dave Chinner
2012-03-14  3:00                 ` Dave Chinner
2012-03-05 16:01 ` [PATCH 12/19] ocfs2: Convert to new freezing mechanism Jan Kara
2012-03-05 16:01   ` [Ocfs2-devel] " Jan Kara
2012-03-05 16:01 ` [PATCH 13/19] gfs2: " Jan Kara
2012-03-05 16:01   ` [Cluster-devel] " Jan Kara
2012-03-05 16:01   ` Jan Kara
2012-03-05 16:01 ` [PATCH 14/19] fuse: " Jan Kara
2012-03-05 16:01   ` Jan Kara
2012-03-05 16:01 ` [PATCH 15/19] ntfs: " Jan Kara
2012-03-05 16:01 ` [PATCH 16/19] nilfs2: " Jan Kara
2012-03-05 16:01 ` [PATCH 17/19] btrfs: " Jan Kara
2012-03-05 16:01 ` [PATCH 18/19] fs: Remove old " Jan Kara
2012-03-05 16:01 ` [PATCH 19/19] fs: Refuse to freeze filesystem with open but unlinked files Jan Kara
2012-03-11 20:22 ` [PATCH 00/19] Fix filesystem freezing deadlocks Kamal Mostafa
2012-03-11 20:23   ` [Ocfs2-devel] " Kamal Mostafa
2012-03-11 20:22   ` Kamal Mostafa

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.