[PATCH v2.1 0/3] mm/fs: Implement faster stable page writes on filesystems

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2.1 0/3] mm/fs: Implement faster stable page writes on filesystems
@ 2012-11-21  2:00 Darrick J. Wong
  2012-11-21  2:00 ` [PATCH 1/4] bdi: Track users that require stable page writes Darrick J. Wong
                   ` (3 more replies)
  0 siblings, 4 replies; 27+ messages in thread
From: Darrick J. Wong @ 2012-11-21  2:00 UTC (permalink / raw)
  To: axboe, lucho, jack, ericvh, tytso, rminnich, viro
  Cc: martin.petersen, neilb, david, linux-kernel, linux-fsdevel,
	adilger.kernel, bharrosh, jlayton, v9fs-developer, linux-ext4

Hi all,

This patchset ("stable page writes, part 2") makes some key modifications to
the kernel's strategy to keep page contents intact during writeback.  First, it
provides users (devices and filesystems) of a backing_dev_info the ability to
declare whether or not it is necessary to ensure that page contents cannot
change during writeout, whereas the current code assumes that this is true.
Second, it relaxes the wait_on_page_writeback calls so that they only occur if
something needs it.  Third, it fixes up (most of) the remaining disk-based
filesystems to use this improved conditional-wait logic in the hopes of
providing stable page writes on all filesystems, when needed.

It is hoped that (for people not using checksumming devices, anyway) this
patchset will give back unnecessary performance decreases since the original
stable page write patchset went into 3.0.

Note: Even without this patchset, ext3 is broken on DIF/DIX checksumming
devices.  As a part of the discussion about part 1 of this patch set, I recall
that we reached a consensus that fixing ext3 was too invasive, and that new
deployments could use ext4 instead.  Since we can now test for devices that
want stable page writes, put a warning into ext3.

This patchset has been tested on 3.7.0-rc6 on x64 with significant speedups for
some hardware, and (afaict) no regressions.

For the next phase, I'll explore changing md-raid5 and iscsi to use stable page
writes, and figuring out how stable page writes intersects with the networked
filesystems.  In the meantime, this part 2 should alleviate some user pain.

--D

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 1/4] bdi: Track users that require stable page writes
  2012-11-21  2:00 [PATCH v2.1 0/3] mm/fs: Implement faster stable page writes on filesystems Darrick J. Wong
@ 2012-11-21  2:00 ` Darrick J. Wong
  2012-11-21  7:54   ` Christoph Hellwig
  2012-11-21 10:56   ` Christoph Hellwig
  2012-11-21  2:00 ` [PATCH 2/4] mm: Only enforce stable page writes if the backing device requires it Darrick J. Wong
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 27+ messages in thread
From: Darrick J. Wong @ 2012-11-21  2:00 UTC (permalink / raw)
  To: axboe, lucho, jack, ericvh, tytso, rminnich, viro
  Cc: martin.petersen, neilb, david, linux-kernel, linux-fsdevel,
	adilger.kernel, bharrosh, jlayton, v9fs-developer, linux-ext4

This creates a per-backing-device counter that tracks the number of users which
require pages to be held immutable during writeout.  Eventually it will be used
to waive wait_for_page_writeback() if nobody requires stable pages.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 Documentation/ABI/testing/sysfs-class-bdi |    7 ++++++
 block/blk-integrity.c                     |    4 +++
 include/linux/backing-dev.h               |   16 +++++++++++++
 include/linux/blkdev.h                    |   15 ++++++++++++
 mm/backing-dev.c                          |   36 +++++++++++++++++++++++++++++
 5 files changed, 78 insertions(+)


diff --git a/Documentation/ABI/testing/sysfs-class-bdi b/Documentation/ABI/testing/sysfs-class-bdi
index 5f50097..218a618 100644
--- a/Documentation/ABI/testing/sysfs-class-bdi
+++ b/Documentation/ABI/testing/sysfs-class-bdi
@@ -48,3 +48,10 @@ max_ratio (read-write)
 	most of the write-back cache.  For example in case of an NFS
 	mount that is prone to get stuck, or a FUSE mount which cannot
 	be trusted to play fair.
+
+stable_pages_required (read-write)
+
+	If set, the backing device requires that all pages comprising a write
+	request must not be changed until writeout is complete.  The system
+	administrator can turn this on if the hardware does not do so already.
+	However, once enabled, this flag cannot be disabled.
diff --git a/block/blk-integrity.c b/block/blk-integrity.c
index da2a818..cf2dd95 100644
--- a/block/blk-integrity.c
+++ b/block/blk-integrity.c
@@ -420,6 +420,8 @@ int blk_integrity_register(struct gendisk *disk, struct blk_integrity *template)
 	} else
 		bi->name = bi_unsupported_name;
 
+	queue_require_stable_pages(disk->queue);
+
 	return 0;
 }
 EXPORT_SYMBOL(blk_integrity_register);
@@ -438,6 +440,8 @@ void blk_integrity_unregister(struct gendisk *disk)
 	if (!disk || !disk->integrity)
 		return;
 
+	queue_unrequire_stable_pages(disk->queue);
+
 	bi = disk->integrity;
 
 	kobject_uevent(&bi->kobj, KOBJ_REMOVE);
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 2a9a9ab..af19704 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -253,6 +253,7 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned int max_ratio);
 #define BDI_CAP_EXEC_MAP	0x00000040
 #define BDI_CAP_NO_ACCT_WB	0x00000080
 #define BDI_CAP_SWAP_BACKED	0x00000100
+#define BDI_CAP_STABLE_WRITES	0x00000200
 
 #define BDI_CAP_VMFLAGS \
 	(BDI_CAP_READ_MAP | BDI_CAP_WRITE_MAP | BDI_CAP_EXEC_MAP)
@@ -307,6 +308,21 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout);
 int pdflush_proc_obsolete(struct ctl_table *table, int write,
 		void __user *buffer, size_t *lenp, loff_t *ppos);
 
+static inline void bdi_require_stable_pages(struct backing_dev_info *bdi)
+{
+	bdi->capabilities |= BDI_CAP_STABLE_WRITES;
+}
+
+static inline void bdi_unrequire_stable_pages(struct backing_dev_info *bdi)
+{
+	bdi->capabilities &= ~BDI_CAP_STABLE_WRITES;
+}
+
+static inline bool bdi_cap_stable_pages_required(struct backing_dev_info *bdi)
+{
+	return bdi->capabilities & BDI_CAP_STABLE_WRITES;
+}
+
 static inline bool bdi_cap_writeback_dirty(struct backing_dev_info *bdi)
 {
 	return !(bdi->capabilities & BDI_CAP_NO_WRITEBACK);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 1756001..a1c6e91 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -458,6 +458,21 @@ struct request_queue {
 				 (1 << QUEUE_FLAG_SAME_COMP)	|	\
 				 (1 << QUEUE_FLAG_ADD_RANDOM))
 
+static inline void queue_require_stable_pages(struct request_queue *q)
+{
+	bdi_require_stable_pages(&q->backing_dev_info);
+}
+
+static inline void queue_unrequire_stable_pages(struct request_queue *q)
+{
+	bdi_unrequire_stable_pages(&q->backing_dev_info);
+}
+
+static inline int queue_requires_stable_pages(struct request_queue *q)
+{
+	return bdi_cap_stable_pages_required(&q->backing_dev_info);
+}
+
 static inline void queue_lockdep_assert_held(struct request_queue *q)
 {
 	if (q->queue_lock)
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index d3ca2b3..fd6e5b5 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -221,12 +221,48 @@ static ssize_t max_ratio_store(struct device *dev,
 }
 BDI_SHOW(max_ratio, bdi->max_ratio)
 
+static ssize_t stable_pages_required_store(struct device *dev,
+					   struct device_attribute *attr,
+					   const char *buf, size_t count)
+{
+	struct backing_dev_info *bdi = dev_get_drvdata(dev);
+	unsigned int spw;
+	ssize_t ret;
+
+	ret = kstrtouint(buf, 10, &spw);
+	if (ret < 0)
+		return ret;
+
+	/*
+	 * SPW could be enabled due to hw requirement, so don't
+	 * let users disable it.
+	 */
+	if (bdi_cap_stable_pages_required(bdi) && spw == 0)
+		return -EINVAL;
+
+	if (spw != 0)
+		bdi_require_stable_pages(bdi);
+
+	return count;
+}
+
+static ssize_t stable_pages_required_show(struct device *dev,
+					  struct device_attribute *attr,
+					  char *page)
+{
+	struct backing_dev_info *bdi = dev_get_drvdata(dev);
+
+	return snprintf(page, PAGE_SIZE-1, "%d\n",
+			bdi_cap_stable_pages_required(bdi) ? 1 : 0);
+}
+
 #define __ATTR_RW(attr) __ATTR(attr, 0644, attr##_show, attr##_store)
 
 static struct device_attribute bdi_dev_attrs[] = {
 	__ATTR_RW(read_ahead_kb),
 	__ATTR_RW(min_ratio),
 	__ATTR_RW(max_ratio),
+	__ATTR_RW(stable_pages_required),
 	__ATTR_NULL,
 };
 


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 2/4] mm: Only enforce stable page writes if the backing device requires it
  2012-11-21  2:00 [PATCH v2.1 0/3] mm/fs: Implement faster stable page writes on filesystems Darrick J. Wong
  2012-11-21  2:00 ` [PATCH 1/4] bdi: Track users that require stable page writes Darrick J. Wong
@ 2012-11-21  2:00 ` Darrick J. Wong
  2012-11-21 10:57   ` Christoph Hellwig
  2012-11-21  2:00 ` [PATCH 3/4] 9pfs: Fix filesystem to wait for stable page writeback Darrick J. Wong
  2012-11-21  2:00 ` [PATCH 4/4] ext3: Warn if mounting rw on a disk requiring stable page writes Darrick J. Wong
  3 siblings, 1 reply; 27+ messages in thread
From: Darrick J. Wong @ 2012-11-21  2:00 UTC (permalink / raw)
  To: axboe, lucho, jack, ericvh, tytso, rminnich, viro
  Cc: martin.petersen, neilb, david, linux-kernel, linux-fsdevel,
	adilger.kernel, bharrosh, jlayton, v9fs-developer, linux-ext4

Create a helper function to check if a backing device requires stable page
writes and, if so, performs the necessary wait.  Then, make it so that all
points in the memory manager that handle making pages writable use the helper
function.  This should provide stable page write support to most filesystems,
while eliminating unnecessary waiting for devices that don't require the
feature.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/buffer.c             |    2 +-
 fs/ext4/inode.c         |    2 +-
 include/linux/pagemap.h |    1 +
 mm/filemap.c            |    3 ++-
 mm/page-writeback.c     |   11 +++++++++++
 5 files changed, 16 insertions(+), 3 deletions(-)


diff --git a/fs/buffer.c b/fs/buffer.c
index b5f0442..4cf0b78 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2334,7 +2334,7 @@ int __block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
 	if (unlikely(ret < 0))
 		goto out_unlock;
 	set_page_dirty(page);
-	wait_on_page_writeback(page);
+	wait_if_stable_page_write(page);
 	return 0;
 out_unlock:
 	unlock_page(page);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index b3c243b..24c6193 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4814,7 +4814,7 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 		if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
 					ext4_bh_unmapped)) {
 			/* Wait so that we don't change page under IO */
-			wait_on_page_writeback(page);
+			wait_if_stable_page_write(page);
 			ret = VM_FAULT_LOCKED;
 			goto out;
 		}
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e42c762..1d67f81 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -398,6 +398,7 @@ static inline void wait_on_page_writeback(struct page *page)
 }
 
 extern void end_page_writeback(struct page *page);
+void wait_if_stable_page_write(struct page *page);
 
 /*
  * Add an arbitrary waiter to a page's wait queue
diff --git a/mm/filemap.c b/mm/filemap.c
index 83efee7..34df185 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1728,6 +1728,7 @@ int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 	 * see the dirty page and writeprotect it again.
 	 */
 	set_page_dirty(page);
+	wait_if_stable_page_write(page);
 out:
 	sb_end_pagefault(inode->i_sb);
 	return ret;
@@ -2274,7 +2275,7 @@ repeat:
 		return NULL;
 	}
 found:
-	wait_on_page_writeback(page);
+	wait_if_stable_page_write(page);
 	return page;
 }
 EXPORT_SYMBOL(grab_cache_page_write_begin);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 830893b..cbbaeca 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2275,3 +2275,14 @@ int mapping_tagged(struct address_space *mapping, int tag)
 	return radix_tree_tagged(&mapping->page_tree, tag);
 }
 EXPORT_SYMBOL(mapping_tagged);
+
+void wait_if_stable_page_write(struct page *page)
+{
+	struct backing_dev_info *bdi = page->mapping->backing_dev_info;
+
+	if (!bdi_cap_stable_pages_required(bdi))
+		return;
+
+	wait_on_page_writeback(page);
+}
+EXPORT_SYMBOL_GPL(wait_if_stable_page_write);


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 3/4] 9pfs: Fix filesystem to wait for stable page writeback
  2012-11-21  2:00 [PATCH v2.1 0/3] mm/fs: Implement faster stable page writes on filesystems Darrick J. Wong
  2012-11-21  2:00 ` [PATCH 1/4] bdi: Track users that require stable page writes Darrick J. Wong
  2012-11-21  2:00 ` [PATCH 2/4] mm: Only enforce stable page writes if the backing device requires it Darrick J. Wong
@ 2012-11-21  2:00 ` Darrick J. Wong
  2012-11-21  2:00 ` [PATCH 4/4] ext3: Warn if mounting rw on a disk requiring stable page writes Darrick J. Wong
  3 siblings, 0 replies; 27+ messages in thread
From: Darrick J. Wong @ 2012-11-21  2:00 UTC (permalink / raw)
  To: axboe, lucho, jack, ericvh, tytso, rminnich, viro
  Cc: martin.petersen, neilb, david, linux-kernel, linux-fsdevel,
	adilger.kernel, bharrosh, jlayton, v9fs-developer, linux-ext4

Fix up the ->page_mkwrite handler to provide stable page writes if necessary.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/9p/vfs_file.c |    1 +
 1 file changed, 1 insertion(+)


diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c
index c2483e9..a2d3e2f 100644
--- a/fs/9p/vfs_file.c
+++ b/fs/9p/vfs_file.c
@@ -620,6 +620,7 @@ v9fs_vm_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
 	lock_page(page);
 	if (page->mapping != inode->i_mapping)
 		goto out_unlock;
+	wait_if_stable_page_write(page);
 
 	return VM_FAULT_LOCKED;
 out_unlock:


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 4/4] ext3: Warn if mounting rw on a disk requiring stable page writes
  2012-11-21  2:00 [PATCH v2.1 0/3] mm/fs: Implement faster stable page writes on filesystems Darrick J. Wong
                   ` (2 preceding siblings ...)
  2012-11-21  2:00 ` [PATCH 3/4] 9pfs: Fix filesystem to wait for stable page writeback Darrick J. Wong
@ 2012-11-21  2:00 ` Darrick J. Wong
  2012-11-21  2:15   ` Jan Kara
  3 siblings, 1 reply; 27+ messages in thread
From: Darrick J. Wong @ 2012-11-21  2:00 UTC (permalink / raw)
  To: axboe, lucho, jack, ericvh, tytso, rminnich, viro
  Cc: martin.petersen, neilb, david, linux-kernel, linux-fsdevel,
	adilger.kernel, bharrosh, jlayton, v9fs-developer, linux-ext4

ext3 doesn't properly isolate pages from changes during writeback.  Since the
recommended fix is to use ext4, for now we'll just print a warning if the user
tries to mount in write mode.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/ext3/super.c |    8 ++++++++
 1 file changed, 8 insertions(+)


diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index 5366393..5b3725d 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -1325,6 +1325,14 @@ static int ext3_setup_super(struct super_block *sb, struct ext3_super_block *es,
 			"forcing read-only mode");
 		res = MS_RDONLY;
 	}
+	if (!read_only &&
+	    queue_requires_stable_pages(bdev_get_queue(sb->s_bdev))) {
+		ext3_msg(sb, KERN_ERR,
+			"error: ext3 cannot safely write data to a disk "
+			"requiring stable pages writes; forcing read-only "
+			"mode.  Upgrading to ext4 is recommended.");
+		res = MS_RDONLY;
+	}
 	if (read_only)
 		return res;
 	if (!(sbi->s_mount_state & EXT3_VALID_FS))


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/4] ext3: Warn if mounting rw on a disk requiring stable page writes
  2012-11-21  2:00 ` [PATCH 4/4] ext3: Warn if mounting rw on a disk requiring stable page writes Darrick J. Wong
@ 2012-11-21  2:15   ` Jan Kara
  2012-11-21 21:13     ` Darrick J. Wong
  0 siblings, 1 reply; 27+ messages in thread
From: Jan Kara @ 2012-11-21  2:15 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: axboe, lucho, jack, ericvh, tytso, rminnich, viro,
	martin.petersen, neilb, david, linux-kernel, linux-fsdevel,
	adilger.kernel, bharrosh, jlayton, v9fs-developer, linux-ext4

On Tue 20-11-12 18:00:56, Darrick J. Wong wrote:
> ext3 doesn't properly isolate pages from changes during writeback.  Since the
> recommended fix is to use ext4, for now we'll just print a warning if the user
> tries to mount in write mode.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/ext3/super.c |    8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> 
> diff --git a/fs/ext3/super.c b/fs/ext3/super.c
> index 5366393..5b3725d 100644
> --- a/fs/ext3/super.c
> +++ b/fs/ext3/super.c
> @@ -1325,6 +1325,14 @@ static int ext3_setup_super(struct super_block *sb, struct ext3_super_block *es,
>  			"forcing read-only mode");
>  		res = MS_RDONLY;
>  	}
> +	if (!read_only &&
> +	    queue_requires_stable_pages(bdev_get_queue(sb->s_bdev))) {
> +		ext3_msg(sb, KERN_ERR,
> +			"error: ext3 cannot safely write data to a disk "
> +			"requiring stable pages writes; forcing read-only "
> +			"mode.  Upgrading to ext4 is recommended.");
> +		res = MS_RDONLY;
> +	}
>  	if (read_only)
>  		return res;
>  	if (!(sbi->s_mount_state & EXT3_VALID_FS))
  Why this? ext3 should be fixed by your change to
filemap_page_mkwrite()... Or does testing show otherwise?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/4] bdi: Track users that require stable page writes
  2012-11-21  2:00 ` [PATCH 1/4] bdi: Track users that require stable page writes Darrick J. Wong
@ 2012-11-21  7:54   ` Christoph Hellwig
  2012-11-21 10:52     ` Christoph Hellwig
  2012-11-21 10:56   ` Christoph Hellwig
  1 sibling, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2012-11-21  7:54 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: axboe, lucho, jack, ericvh, tytso, rminnich, viro,
	martin.petersen, neilb, david, linux-kernel, linux-fsdevel,
	adilger.kernel, bharrosh, jlayton, v9fs-developer, linux-ext4

On Tue, Nov 20, 2012 at 06:00:34PM -0800, Darrick J. Wong wrote:
> This creates a per-backing-device counter that tracks the number of users which
> require pages to be held immutable during writeout.  Eventually it will be used
> to waive wait_for_page_writeback() if nobody requires stable pages.

Why are we going down this stupid route again now?  Just let the block
device say it needs stable writes and let the VM deal with it.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/4] bdi: Track users that require stable page writes
  2012-11-21  7:54   ` Christoph Hellwig
@ 2012-11-21 10:52     ` Christoph Hellwig
  0 siblings, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2012-11-21 10:52 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: axboe, lucho, jack, ericvh, tytso, rminnich, viro,
	martin.petersen, neilb, david, linux-kernel, linux-fsdevel,
	adilger.kernel, bharrosh, jlayton, v9fs-developer, linux-ext4

On Wed, Nov 21, 2012 at 02:54:35AM -0500, Christoph Hellwig wrote:
> On Tue, Nov 20, 2012 at 06:00:34PM -0800, Darrick J. Wong wrote:
> > This creates a per-backing-device counter that tracks the number of users which
> > require pages to be held immutable during writeout.  Eventually it will be used
> > to waive wait_for_page_writeback() if nobody requires stable pages.
> 
> Why are we going down this stupid route again now?  Just let the block
> device say it needs stable writes and let the VM deal with it.

Seems like the series actually does that, and this paragraph was just
left over from earlier versions.  Sorry for complaining too quickly, but
you really should update the description.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/4] bdi: Track users that require stable page writes
  2012-11-21  2:00 ` [PATCH 1/4] bdi: Track users that require stable page writes Darrick J. Wong
  2012-11-21  7:54   ` Christoph Hellwig
@ 2012-11-21 10:56   ` Christoph Hellwig
  2012-11-21 21:52     ` Darrick J. Wong
  1 sibling, 1 reply; 27+ messages in thread
From: Christoph Hellwig @ 2012-11-21 10:56 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: axboe, lucho, jack, ericvh, tytso, rminnich, viro,
	martin.petersen, neilb, david, linux-kernel, linux-fsdevel,
	adilger.kernel, bharrosh, jlayton, v9fs-developer, linux-ext4

> +static inline void bdi_require_stable_pages(struct backing_dev_info *bdi)
> +{
> +	bdi->capabilities |= BDI_CAP_STABLE_WRITES;
> +}
> +
> +static inline void bdi_unrequire_stable_pages(struct backing_dev_info *bdi)
> +{
> +	bdi->capabilities &= ~BDI_CAP_STABLE_WRITES;
> +}

Any reason to provide these wrappers while other BDI_CAP_ values don't
have it/

Also what protects bdi->capabilities against concurrent updates now that
it gets modified at runtime?

> +static inline void queue_require_stable_pages(struct request_queue *q)
> +{
> +	bdi_require_stable_pages(&q->backing_dev_info);
> +}
> +
> +static inline void queue_unrequire_stable_pages(struct request_queue *q)
> +{
> +	bdi_unrequire_stable_pages(&q->backing_dev_info);
> +}
> +
> +static inline int queue_requires_stable_pages(struct request_queue *q)
> +{
> +	return bdi_cap_stable_pages_required(&q->backing_dev_info);
> +}

Independent of the above I see no point in these wrappers that just
provide a single dereference.

> +static ssize_t stable_pages_required_store(struct device *dev,
> +					   struct device_attribute *attr,
> +					   const char *buf, size_t count)

Can you add a rationale on why we'd want to allow users to change the
value?  I can't really think of any.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/4] mm: Only enforce stable page writes if the backing device requires it
  2012-11-21  2:00 ` [PATCH 2/4] mm: Only enforce stable page writes if the backing device requires it Darrick J. Wong
@ 2012-11-21 10:57   ` Christoph Hellwig
  0 siblings, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2012-11-21 10:57 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: axboe, lucho, jack, ericvh, tytso, rminnich, viro,
	martin.petersen, neilb, david, linux-kernel, linux-fsdevel,
	adilger.kernel, bharrosh, jlayton, v9fs-developer, linux-ext4

On Tue, Nov 20, 2012 at 06:00:41PM -0800, Darrick J. Wong wrote:
> Create a helper function to check if a backing device requires stable page
> writes and, if so, performs the necessary wait.  Then, make it so that all
> points in the memory manager that handle making pages writable use the helper
> function.  This should provide stable page write support to most filesystems,
> while eliminating unnecessary waiting for devices that don't require the
> feature.

> +void wait_if_stable_page_write(struct page *page)

wait_for_stable_page() ?

Also this really needs a kerneldoc comment describing the usage.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/4] ext3: Warn if mounting rw on a disk requiring stable page writes
  2012-11-21  2:15   ` Jan Kara
@ 2012-11-21 21:13     ` Darrick J. Wong
  2012-11-21 21:33       ` Jan Kara
  0 siblings, 1 reply; 27+ messages in thread
From: Darrick J. Wong @ 2012-11-21 21:13 UTC (permalink / raw)
  To: Jan Kara
  Cc: axboe, lucho, ericvh, tytso, rminnich, viro, martin.petersen,
	neilb, david, linux-kernel, linux-fsdevel, adilger.kernel,
	bharrosh, jlayton, v9fs-developer, linux-ext4

On Wed, Nov 21, 2012 at 03:15:43AM +0100, Jan Kara wrote:
> On Tue 20-11-12 18:00:56, Darrick J. Wong wrote:
> > ext3 doesn't properly isolate pages from changes during writeback.  Since the
> > recommended fix is to use ext4, for now we'll just print a warning if the user
> > tries to mount in write mode.
> > 
> > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > ---
> >  fs/ext3/super.c |    8 ++++++++
> >  1 file changed, 8 insertions(+)
> > 
> > 
> > diff --git a/fs/ext3/super.c b/fs/ext3/super.c
> > index 5366393..5b3725d 100644
> > --- a/fs/ext3/super.c
> > +++ b/fs/ext3/super.c
> > @@ -1325,6 +1325,14 @@ static int ext3_setup_super(struct super_block *sb, struct ext3_super_block *es,
> >  			"forcing read-only mode");
> >  		res = MS_RDONLY;
> >  	}
> > +	if (!read_only &&
> > +	    queue_requires_stable_pages(bdev_get_queue(sb->s_bdev))) {
> > +		ext3_msg(sb, KERN_ERR,
> > +			"error: ext3 cannot safely write data to a disk "
> > +			"requiring stable pages writes; forcing read-only "
> > +			"mode.  Upgrading to ext4 is recommended.");
> > +		res = MS_RDONLY;
> > +	}
> >  	if (read_only)
> >  		return res;
> >  	if (!(sbi->s_mount_state & EXT3_VALID_FS))
>   Why this? ext3 should be fixed by your change to
> filemap_page_mkwrite()... Or does testing show otherwise?

Yes, it's still broken even with this new set of changes.  Now that I think
about it a little more, I recall that writeback mode was actually fine, so this
is a little harsh.

Hm... looking at the ordered code a little more, it looks like
ext3_ordered_write_end is calling journal_dirty_data_fn, which (I guess?) tries
to write mapped buffers back through the journal?  Taking it out seems to fix
ordered mode, though I have a suspicion that it might very well break ordered
mode too.

--D
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/4] ext3: Warn if mounting rw on a disk requiring stable page writes
  2012-11-21 21:13     ` Darrick J. Wong
@ 2012-11-21 21:33       ` Jan Kara
  2012-11-21 21:47         ` NeilBrown
  0 siblings, 1 reply; 27+ messages in thread
From: Jan Kara @ 2012-11-21 21:33 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jan Kara, axboe, lucho, ericvh, tytso, rminnich, viro,
	martin.petersen, neilb, david, linux-kernel, linux-fsdevel,
	adilger.kernel, bharrosh, jlayton, v9fs-developer, linux-ext4

On Wed 21-11-12 13:13:19, Darrick J. Wong wrote:
> On Wed, Nov 21, 2012 at 03:15:43AM +0100, Jan Kara wrote:
> > On Tue 20-11-12 18:00:56, Darrick J. Wong wrote:
> > > ext3 doesn't properly isolate pages from changes during writeback.  Since the
> > > recommended fix is to use ext4, for now we'll just print a warning if the user
> > > tries to mount in write mode.
> > > 
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > >  fs/ext3/super.c |    8 ++++++++
> > >  1 file changed, 8 insertions(+)
> > > 
> > > 
> > > diff --git a/fs/ext3/super.c b/fs/ext3/super.c
> > > index 5366393..5b3725d 100644
> > > --- a/fs/ext3/super.c
> > > +++ b/fs/ext3/super.c
> > > @@ -1325,6 +1325,14 @@ static int ext3_setup_super(struct super_block *sb, struct ext3_super_block *es,
> > >  			"forcing read-only mode");
> > >  		res = MS_RDONLY;
> > >  	}
> > > +	if (!read_only &&
> > > +	    queue_requires_stable_pages(bdev_get_queue(sb->s_bdev))) {
> > > +		ext3_msg(sb, KERN_ERR,
> > > +			"error: ext3 cannot safely write data to a disk "
> > > +			"requiring stable pages writes; forcing read-only "
> > > +			"mode.  Upgrading to ext4 is recommended.");
> > > +		res = MS_RDONLY;
> > > +	}
> > >  	if (read_only)
> > >  		return res;
> > >  	if (!(sbi->s_mount_state & EXT3_VALID_FS))
> >   Why this? ext3 should be fixed by your change to
> > filemap_page_mkwrite()... Or does testing show otherwise?
> 
> Yes, it's still broken even with this new set of changes.  Now that I think
> about it a little more, I recall that writeback mode was actually fine, so this
> is a little harsh.
> 
> Hm... looking at the ordered code a little more, it looks like
> ext3_ordered_write_end is calling journal_dirty_data_fn, which (I guess?) tries
> to write mapped buffers back through the journal?  Taking it out seems to fix
> ordered mode, though I have a suspicion that it might very well break ordered
> mode too.
  Oh, right. kjournald writing buffers directly (without setting
PageWriteback) will break things. So please, change warning to:

	/*
	 * In data=ordered mode, kjournald writes buffers without setting
	 * PageWriteback bit thus generic code does not properly wait for
	 * writeback of those buffers to finish.
	 */
	if (!read_only &&
	    test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA &&
	    queue_requires_stable_pages(bdev_get_queue(sb->s_bdev))) {
		ext3_msg(sb, KERN_ERR,
			"error: data=ordered mode does not support stable "
			"page writes required by the disk; forcing read-only "
			"mode. Upgrading to ext4 is recommended.");
		res = MS_RDONLY;
	}

then you need a similar check in ext3_remount() so that filesystem cannot
be remounted read-write.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/4] ext3: Warn if mounting rw on a disk requiring stable page writes
  2012-11-21 21:33       ` Jan Kara
@ 2012-11-21 21:47         ` NeilBrown
  2012-11-22  1:47           ` Darrick J. Wong
  0 siblings, 1 reply; 27+ messages in thread
From: NeilBrown @ 2012-11-21 21:47 UTC (permalink / raw)
  To: Jan Kara
  Cc: Darrick J. Wong, axboe, lucho, ericvh, tytso, rminnich, viro,
	martin.petersen, david, linux-kernel, linux-fsdevel,
	adilger.kernel, bharrosh, jlayton, v9fs-developer, linux-ext4

[-- Attachment #1: Type: text/plain, Size: 3334 bytes --]

On Wed, 21 Nov 2012 22:33:33 +0100 Jan Kara <jack@suse.cz> wrote:

> On Wed 21-11-12 13:13:19, Darrick J. Wong wrote:
> > On Wed, Nov 21, 2012 at 03:15:43AM +0100, Jan Kara wrote:
> > > On Tue 20-11-12 18:00:56, Darrick J. Wong wrote:
> > > > ext3 doesn't properly isolate pages from changes during writeback.  Since the
> > > > recommended fix is to use ext4, for now we'll just print a warning if the user
> > > > tries to mount in write mode.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > ---
> > > >  fs/ext3/super.c |    8 ++++++++
> > > >  1 file changed, 8 insertions(+)
> > > > 
> > > > 
> > > > diff --git a/fs/ext3/super.c b/fs/ext3/super.c
> > > > index 5366393..5b3725d 100644
> > > > --- a/fs/ext3/super.c
> > > > +++ b/fs/ext3/super.c
> > > > @@ -1325,6 +1325,14 @@ static int ext3_setup_super(struct super_block *sb, struct ext3_super_block *es,
> > > >  			"forcing read-only mode");
> > > >  		res = MS_RDONLY;
> > > >  	}
> > > > +	if (!read_only &&
> > > > +	    queue_requires_stable_pages(bdev_get_queue(sb->s_bdev))) {
> > > > +		ext3_msg(sb, KERN_ERR,
> > > > +			"error: ext3 cannot safely write data to a disk "
> > > > +			"requiring stable pages writes; forcing read-only "
> > > > +			"mode.  Upgrading to ext4 is recommended.");
> > > > +		res = MS_RDONLY;
> > > > +	}
> > > >  	if (read_only)
> > > >  		return res;
> > > >  	if (!(sbi->s_mount_state & EXT3_VALID_FS))
> > >   Why this? ext3 should be fixed by your change to
> > > filemap_page_mkwrite()... Or does testing show otherwise?
> > 
> > Yes, it's still broken even with this new set of changes.  Now that I think
> > about it a little more, I recall that writeback mode was actually fine, so this
> > is a little harsh.
> > 
> > Hm... looking at the ordered code a little more, it looks like
> > ext3_ordered_write_end is calling journal_dirty_data_fn, which (I guess?) tries
> > to write mapped buffers back through the journal?  Taking it out seems to fix
> > ordered mode, though I have a suspicion that it might very well break ordered
> > mode too.
>   Oh, right. kjournald writing buffers directly (without setting
> PageWriteback) will break things. So please, change warning to:
> 
> 	/*
> 	 * In data=ordered mode, kjournald writes buffers without setting
> 	 * PageWriteback bit thus generic code does not properly wait for
> 	 * writeback of those buffers to finish.
> 	 */
> 	if (!read_only &&
> 	    test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA &&
> 	    queue_requires_stable_pages(bdev_get_queue(sb->s_bdev))) {
> 		ext3_msg(sb, KERN_ERR,
> 			"error: data=ordered mode does not support stable "
> 			"page writes required by the disk; forcing read-only "
> 			"mode. Upgrading to ext4 is recommended.");
> 		res = MS_RDONLY;
> 	}
> 
> then you need a similar check in ext3_remount() so that filesystem cannot
> be remounted read-write.
> 
> 								Honza

Given this restriction, there is no way that I can change md/raid5 to set the
"stable pages" flag and stop copying pages into the stripe-cache.  ext3 on
raid5 will be much too common to allow this breakage.

I would really like to be able to say "I prefer stable pages, but they aren't
a requirement as long as I know which is which" ....

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/4] bdi: Track users that require stable page writes
  2012-11-21 10:56   ` Christoph Hellwig
@ 2012-11-21 21:52     ` Darrick J. Wong
  2012-11-21 22:06       ` NeilBrown
  0 siblings, 1 reply; 27+ messages in thread
From: Darrick J. Wong @ 2012-11-21 21:52 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe, lucho, jack, ericvh, tytso, rminnich, viro,
	martin.petersen, neilb, david, linux-kernel, linux-fsdevel,
	adilger.kernel, bharrosh, jlayton, v9fs-developer, linux-ext4

Ok, I'll update the description a bit.

On Wed, Nov 21, 2012 at 05:56:24AM -0500, Christoph Hellwig wrote:
> > +static inline void bdi_require_stable_pages(struct backing_dev_info *bdi)
> > +{
> > +	bdi->capabilities |= BDI_CAP_STABLE_WRITES;
> > +}
> > +
> > +static inline void bdi_unrequire_stable_pages(struct backing_dev_info *bdi)
> > +{
> > +	bdi->capabilities &= ~BDI_CAP_STABLE_WRITES;
> > +}
> 
> Any reason to provide these wrappers while other BDI_CAP_ values don't
> have it/
> 
> Also what protects bdi->capabilities against concurrent updates now that
> it gets modified at runtime?

Nothing seems to update ->capabilities at run time.

That said, if you're really worried about concurrent updates, I can always put
a spinlock around all the updates.

(Or revert to the atomic_t counter, but that seemed unpopular...)

I think I can drop the wrappers.

> > +static inline void queue_require_stable_pages(struct request_queue *q)
> > +{
> > +	bdi_require_stable_pages(&q->backing_dev_info);
> > +}
> > +
> > +static inline void queue_unrequire_stable_pages(struct request_queue *q)
> > +{
> > +	bdi_unrequire_stable_pages(&q->backing_dev_info);
> > +}
> > +
> > +static inline int queue_requires_stable_pages(struct request_queue *q)
> > +{
> > +	return bdi_cap_stable_pages_required(&q->backing_dev_info);
> > +}
> 
> Independent of the above I see no point in these wrappers that just
> provide a single dereference.
> 
> > +static ssize_t stable_pages_required_store(struct device *dev,
> > +					   struct device_attribute *attr,
> > +					   const char *buf, size_t count)
> 
> Can you add a rationale on why we'd want to allow users to change the
> value?  I can't really think of any.

I dislike the idea that if a program is dirtying pages that are being written
out, then I don't really know whether the disk will write the before or after
version.  If the power goes out before the inevitable second write, how do you
know which version you get?  Sure would be nice if I could force on stable
writes if I'm feeling paranoid.

--D
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/4] bdi: Track users that require stable page writes
  2012-11-21 21:52     ` Darrick J. Wong
@ 2012-11-21 22:06       ` NeilBrown
  2012-11-22  2:33         ` [PATCH] " Darrick J. Wong
  0 siblings, 1 reply; 27+ messages in thread
From: NeilBrown @ 2012-11-21 22:06 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, axboe, lucho, jack, ericvh, tytso, rminnich,
	viro, martin.petersen, david, linux-kernel, linux-fsdevel,
	adilger.kernel, bharrosh, jlayton, v9fs-developer, linux-ext4

[-- Attachment #1: Type: text/plain, Size: 1228 bytes --]

On Wed, 21 Nov 2012 13:52:07 -0800 "Darrick J. Wong"
<darrick.wong@oracle.com> wrote:

> > Can you add a rationale on why we'd want to allow users to change the
> > value?  I can't really think of any.
> 
> I dislike the idea that if a program is dirtying pages that are being written
> out, then I don't really know whether the disk will write the before or after
> version.  If the power goes out before the inevitable second write, how do you
> know which version you get?  Sure would be nice if I could force on stable
> writes if I'm feeling paranoid.

I don't think this fear is at all rational (but then you did suggest
paranoia).

If the power goes out, then any write that has been requested, but for which
an 'fsync' hasn't completed, may - or may not - have been written.   Setting
this flag doesn't really change that.

The filesystem should provide some degree of certainty - i.e. either old
data or new data and I believe they mostly do - though ext3 with
journal=writeback explicitly doesn't promise very much.  Beyond that, if you
want any certainty then the app must provide that by using fsync.

So I'm with Christoph here: I don't think the flag should be user-settable.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/4] ext3: Warn if mounting rw on a disk requiring stable page writes
  2012-11-21 21:47         ` NeilBrown
@ 2012-11-22  1:47           ` Darrick J. Wong
  2012-11-22  2:36             ` [RFC PATCH 1/2] mm: Introduce page flag to indicate stable page status Darrick J. Wong
                               ` (3 more replies)
  0 siblings, 4 replies; 27+ messages in thread
From: Darrick J. Wong @ 2012-11-22  1:47 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jan Kara, axboe, lucho, ericvh, tytso, rminnich, viro,
	martin.petersen, david, linux-kernel, linux-fsdevel,
	adilger.kernel, bharrosh, jlayton, v9fs-developer, linux-ext4

On Thu, Nov 22, 2012 at 08:47:13AM +1100, NeilBrown wrote:
> On Wed, 21 Nov 2012 22:33:33 +0100 Jan Kara <jack@suse.cz> wrote:
> 
> > On Wed 21-11-12 13:13:19, Darrick J. Wong wrote:
> > > On Wed, Nov 21, 2012 at 03:15:43AM +0100, Jan Kara wrote:
> > > > On Tue 20-11-12 18:00:56, Darrick J. Wong wrote:
> > > > > ext3 doesn't properly isolate pages from changes during writeback.  Since the
> > > > > recommended fix is to use ext4, for now we'll just print a warning if the user
> > > > > tries to mount in write mode.
> > > > > 
> > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > ---
> > > > >  fs/ext3/super.c |    8 ++++++++
> > > > >  1 file changed, 8 insertions(+)
> > > > > 
> > > > > 
> > > > > diff --git a/fs/ext3/super.c b/fs/ext3/super.c
> > > > > index 5366393..5b3725d 100644
> > > > > --- a/fs/ext3/super.c
> > > > > +++ b/fs/ext3/super.c
> > > > > @@ -1325,6 +1325,14 @@ static int ext3_setup_super(struct super_block *sb, struct ext3_super_block *es,
> > > > >  			"forcing read-only mode");
> > > > >  		res = MS_RDONLY;
> > > > >  	}
> > > > > +	if (!read_only &&
> > > > > +	    queue_requires_stable_pages(bdev_get_queue(sb->s_bdev))) {
> > > > > +		ext3_msg(sb, KERN_ERR,
> > > > > +			"error: ext3 cannot safely write data to a disk "
> > > > > +			"requiring stable pages writes; forcing read-only "
> > > > > +			"mode.  Upgrading to ext4 is recommended.");
> > > > > +		res = MS_RDONLY;
> > > > > +	}
> > > > >  	if (read_only)
> > > > >  		return res;
> > > > >  	if (!(sbi->s_mount_state & EXT3_VALID_FS))
> > > >   Why this? ext3 should be fixed by your change to
> > > > filemap_page_mkwrite()... Or does testing show otherwise?
> > > 
> > > Yes, it's still broken even with this new set of changes.  Now that I think
> > > about it a little more, I recall that writeback mode was actually fine, so this
> > > is a little harsh.
> > > 
> > > Hm... looking at the ordered code a little more, it looks like
> > > ext3_ordered_write_end is calling journal_dirty_data_fn, which (I guess?) tries
> > > to write mapped buffers back through the journal?  Taking it out seems to fix
> > > ordered mode, though I have a suspicion that it might very well break ordered
> > > mode too.
> >   Oh, right. kjournald writing buffers directly (without setting
> > PageWriteback) will break things. So please, change warning to:

Maybe we should just fix this anyway?

I still have the patch that adds PG_stable (and changes the
wait_for_page_stable() test to use this flag instead of PG_writeback) kicking
around in my tree.  I wrote a patch to jbd that changes journal_do_submit_data
to set PG_stable, call clear_page_dirty_for_io(), and unsets the stable bit in
the end_io processing.

It seems to get rid of the checksum-after-write errors, though I'm not
convinced it's correct.  But, I'll send both patches along.

> > 
> > 	/*
> > 	 * In data=ordered mode, kjournald writes buffers without setting
> > 	 * PageWriteback bit thus generic code does not properly wait for
> > 	 * writeback of those buffers to finish.
> > 	 */
> > 	if (!read_only &&
> > 	    test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA &&

test_opt(sb, DATA_FLAGS) != EXT3_MOUNT_WRITEBACK_DATA

since I bet data=journal mode is also borken wrt PageWriteback.

> > 	    queue_requires_stable_pages(bdev_get_queue(sb->s_bdev))) {
> > 		ext3_msg(sb, KERN_ERR,
> > 			"error: data=ordered mode does not support stable "
> > 			"page writes required by the disk; forcing read-only "
> > 			"mode. Upgrading to ext4 is recommended.");
> > 		res = MS_RDONLY;
> > 	}
> > 
> > then you need a similar check in ext3_remount() so that filesystem cannot
> > be remounted read-write.
> > 
> > 								Honza
> 
> Given this restriction, there is no way that I can change md/raid5 to set the
> "stable pages" flag and stop copying pages into the stripe-cache.  ext3 on
> raid5 will be much too common to allow this breakage.
> 
> I would really like to be able to say "I prefer stable pages, but they aren't
> a requirement as long as I know which is which" ....

I'd rather just fix ext3. :)

(or remove it, since ext4 can handle ext3)

--D

> NeilBrown



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH] bdi: Track users that require stable page writes
  2012-11-21 22:06       ` NeilBrown
@ 2012-11-22  2:33         ` Darrick J. Wong
  2012-11-22  7:08           ` Christoph Hellwig
  0 siblings, 1 reply; 27+ messages in thread
From: Darrick J. Wong @ 2012-11-22  2:33 UTC (permalink / raw)
  To: NeilBrown
  Cc: Christoph Hellwig, axboe, lucho, jack, ericvh, tytso, rminnich,
	viro, martin.petersen, david, linux-kernel, linux-fsdevel,
	adilger.kernel, bharrosh, jlayton, v9fs-developer, linux-ext4

How's this look to everyone?  No more user-writable flag, and most of the
helper functions are gone.

(Are these emails even getting through?)

--D
---
This creates a per-backing-device flag that tracks whether or not pages must be
held immutable during writeout.  Eventually it will be used to waive
wait_for_page_writeback() if nobody requires stable pages.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 Documentation/ABI/testing/sysfs-class-bdi |    5 +++++
 block/blk-integrity.c                     |    4 ++++
 include/linux/backing-dev.h               |    6 ++++++
 mm/backing-dev.c                          |   11 +++++++++++
 4 files changed, 26 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-class-bdi b/Documentation/ABI/testing/sysfs-class-bdi
index 5f50097..d773d56 100644
--- a/Documentation/ABI/testing/sysfs-class-bdi
+++ b/Documentation/ABI/testing/sysfs-class-bdi
@@ -48,3 +48,8 @@ max_ratio (read-write)
 	most of the write-back cache.  For example in case of an NFS
 	mount that is prone to get stuck, or a FUSE mount which cannot
 	be trusted to play fair.
+
+stable_pages_required (read-only)
+
+	If set, the backing device requires that all pages comprising a write
+	request must not be changed until writeout is complete.
diff --git a/block/blk-integrity.c b/block/blk-integrity.c
index da2a818..dabd221 100644
--- a/block/blk-integrity.c
+++ b/block/blk-integrity.c
@@ -420,6 +420,8 @@ int blk_integrity_register(struct gendisk *disk, struct blk_integrity *template)
 	} else
 		bi->name = bi_unsupported_name;
 
+	disk->queue->backing_dev_info.capabilities |= BDI_CAP_STABLE_WRITES;
+
 	return 0;
 }
 EXPORT_SYMBOL(blk_integrity_register);
@@ -438,6 +440,8 @@ void blk_integrity_unregister(struct gendisk *disk)
 	if (!disk || !disk->integrity)
 		return;
 
+	disk->queue->backing_dev_info.capabilities &= ~BDI_CAP_STABLE_WRITES;
+
 	bi = disk->integrity;
 
 	kobject_uevent(&bi->kobj, KOBJ_REMOVE);
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 2a9a9ab..085501f 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -253,6 +253,7 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned int max_ratio);
 #define BDI_CAP_EXEC_MAP	0x00000040
 #define BDI_CAP_NO_ACCT_WB	0x00000080
 #define BDI_CAP_SWAP_BACKED	0x00000100
+#define BDI_CAP_STABLE_WRITES	0x00000200
 
 #define BDI_CAP_VMFLAGS \
 	(BDI_CAP_READ_MAP | BDI_CAP_WRITE_MAP | BDI_CAP_EXEC_MAP)
@@ -307,6 +308,11 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout);
 int pdflush_proc_obsolete(struct ctl_table *table, int write,
 		void __user *buffer, size_t *lenp, loff_t *ppos);
 
+static inline bool bdi_cap_stable_pages_required(struct backing_dev_info *bdi)
+{
+	return bdi->capabilities & BDI_CAP_STABLE_WRITES;
+}
+
 static inline bool bdi_cap_writeback_dirty(struct backing_dev_info *bdi)
 {
 	return !(bdi->capabilities & BDI_CAP_NO_WRITEBACK);
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index d3ca2b3..41733c5 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -221,12 +221,23 @@ static ssize_t max_ratio_store(struct device *dev,
 }
 BDI_SHOW(max_ratio, bdi->max_ratio)
 
+static ssize_t stable_pages_required_show(struct device *dev,
+					  struct device_attribute *attr,
+					  char *page)
+{
+	struct backing_dev_info *bdi = dev_get_drvdata(dev);
+
+	return snprintf(page, PAGE_SIZE-1, "%d\n",
+			bdi_cap_stable_pages_required(bdi) ? 1 : 0);
+}
+
 #define __ATTR_RW(attr) __ATTR(attr, 0644, attr##_show, attr##_store)
 
 static struct device_attribute bdi_dev_attrs[] = {
 	__ATTR_RW(read_ahead_kb),
 	__ATTR_RW(min_ratio),
 	__ATTR_RW(max_ratio),
+	__ATTR_RO(stable_pages_required),
 	__ATTR_NULL,
 };
 

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH 1/2] mm: Introduce page flag to indicate stable page status
  2012-11-22  1:47           ` Darrick J. Wong
@ 2012-11-22  2:36             ` Darrick J. Wong
  2012-11-22  2:36             ` [RFC PATCH 2/2] jbd: Stabilize pages during writes when in ordered mode Darrick J. Wong
                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 27+ messages in thread
From: Darrick J. Wong @ 2012-11-22  2:36 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jan Kara, axboe, lucho, ericvh, tytso, rminnich, viro,
	martin.petersen, david, linux-kernel, linux-fsdevel,
	adilger.kernel, bharrosh, jlayton, v9fs-developer, linux-ext4

This patch adds yet another page flag, PG_stable, to indicate that the page's
contents must not be changed because the page is undergoing some sort of
integrity operation (checksumming, digest calculation, etc.)  This change
should enable us to reduce page write latency in userspace apps, particularly
for things like networked filesystems where the page contents needn't be held
stable after a write request is transmitted, even though PG_writeback is still
set.

Also, convert wait_for_stable_page to use this new page flag.  By default,
PG_stable is set for the same duration as PG_writeback.

Signed-off-by: Darick J. Wong <darrick.wong@oracle.com>
---
 include/linux/page-flags.h |    2 ++
 include/linux/pagemap.h    |    1 +
 mm/filemap.c               |    8 ++++++++
 mm/page-writeback.c        |    7 ++++++-
 mm/page_alloc.c            |    3 +++
 5 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index b5d1384..83f41bf 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -109,6 +109,7 @@ enum pageflags {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	PG_compound_lock,
 #endif
+	PG_stable,		/* page is immutable during a writeout */
 	__NR_PAGEFLAGS,
 
 	/* Filesystems */
@@ -208,6 +209,7 @@ PAGEFLAG(Pinned, pinned) TESTSCFLAG(Pinned, pinned)	/* Xen */
 PAGEFLAG(SavePinned, savepinned);			/* Xen */
 PAGEFLAG(Reserved, reserved) __CLEARPAGEFLAG(Reserved, reserved)
 PAGEFLAG(SwapBacked, swapbacked) __CLEARPAGEFLAG(SwapBacked, swapbacked)
+PAGEFLAG(Stable, stable) TESTSETFLAG(Stable, stable)
 
 __PAGEFLAG(SlobFree, slob_free)
 
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index ea631ee..0e0a006 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -399,6 +399,7 @@ static inline void wait_on_page_writeback(struct page *page)
 
 extern void end_page_writeback(struct page *page);
 void wait_for_stable_page(struct page *page);
+void clear_page_stable(struct page *page);
 
 /*
  * Add an arbitrary waiter to a page's wait queue
diff --git a/mm/filemap.c b/mm/filemap.c
index 5577dc8..88dc3b7 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -534,6 +534,13 @@ static inline void wake_up_page(struct page *page, int bit)
 	__wake_up_bit(page_waitqueue(page), &page->flags, bit);
 }
 
+void clear_page_stable(struct page *page)
+{
+	ClearPageStable(page);
+	wake_up_page(page, PG_stable);
+}
+EXPORT_SYMBOL_GPL(clear_page_stable);
+
 void wait_on_page_bit(struct page *page, int bit_nr)
 {
 	DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);
@@ -608,6 +615,7 @@ void end_page_writeback(struct page *page)
 
 	smp_mb__after_clear_bit();
 	wake_up_page(page, PG_writeback);
+	wake_up_page(page, PG_stable);
 }
 EXPORT_SYMBOL(end_page_writeback);
 
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 1bc0edf..2cd8155 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2209,6 +2209,7 @@ int test_clear_page_writeback(struct page *page)
 		unsigned long flags;
 
 		spin_lock_irqsave(&mapping->tree_lock, flags);
+		ClearPageStable(page);
 		ret = TestClearPageWriteback(page);
 		if (ret) {
 			radix_tree_tag_clear(&mapping->page_tree,
@@ -2221,6 +2222,7 @@ int test_clear_page_writeback(struct page *page)
 		}
 		spin_unlock_irqrestore(&mapping->tree_lock, flags);
 	} else {
+		ClearPageStable(page);
 		ret = TestClearPageWriteback(page);
 	}
 	if (ret) {
@@ -2240,6 +2242,7 @@ int test_set_page_writeback(struct page *page)
 		unsigned long flags;
 
 		spin_lock_irqsave(&mapping->tree_lock, flags);
+		SetPageStable(page);
 		ret = TestSetPageWriteback(page);
 		if (!ret) {
 			radix_tree_tag_set(&mapping->page_tree,
@@ -2257,6 +2260,7 @@ int test_set_page_writeback(struct page *page)
 				     PAGECACHE_TAG_TOWRITE);
 		spin_unlock_irqrestore(&mapping->tree_lock, flags);
 	} else {
+		SetPageStable(page);
 		ret = TestSetPageWriteback(page);
 	}
 	if (!ret)
@@ -2291,6 +2295,7 @@ void wait_for_stable_page(struct page *page)
 	if (!bdi_cap_stable_pages_required(bdi))
 		return;
 
-	wait_on_page_writeback(page);
+	if (PageStable(page))
+		wait_on_page_bit(page, PG_stable);
 }
 EXPORT_SYMBOL_GPL(wait_for_stable_page);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7bb35ac..2b308d2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6056,6 +6056,9 @@ static const struct trace_print_flags pageflag_names[] = {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	{1UL << PG_compound_lock,	"compound_lock"	},
 #endif
+#ifdef CONFIG_BLK_DEV_INTEGRITY
+	{1UL << PG_stable,		"stable" },
+#endif
 };
 
 static void dump_page_flags(unsigned long flags)

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH 2/2] jbd: Stabilize pages during writes when in ordered mode
  2012-11-22  1:47           ` Darrick J. Wong
  2012-11-22  2:36             ` [RFC PATCH 1/2] mm: Introduce page flag to indicate stable page status Darrick J. Wong
@ 2012-11-22  2:36             ` Darrick J. Wong
  2012-11-22  9:19               ` Jan Kara
  2012-11-22  9:12             ` [PATCH 4/4] ext3: Warn if mounting rw on a disk requiring stable page writes Jan Kara
  2012-11-22 23:15             ` Dave Chinner
  3 siblings, 1 reply; 27+ messages in thread
From: Darrick J. Wong @ 2012-11-22  2:36 UTC (permalink / raw)
  To: NeilBrown
  Cc: Jan Kara, axboe, lucho, ericvh, tytso, rminnich, viro,
	martin.petersen, david, linux-kernel, linux-fsdevel,
	adilger.kernel, bharrosh, jlayton, v9fs-developer, linux-ext4

When writing buffers out to disk ahead of committing a transaction, set the
Stable bit on the page to prevent others from wandering in and modifying the
page.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/jbd/commit.c |   18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
index 86b39b1..b1f0eed 100644
--- a/fs/jbd/commit.c
+++ b/fs/jbd/commit.c
@@ -155,13 +155,29 @@ static int journal_write_commit_record(journal_t *journal,
 	return (ret == -EIO);
 }
 
+static void end_stable_write_sync(struct buffer_head *bh, int uptodate)
+{
+	clear_page_stable(bh->b_page);
+	end_buffer_write_sync(bh, uptodate);
+}
+
 static void journal_do_submit_data(struct buffer_head **wbuf, int bufs,
 				   int write_op)
 {
 	int i;
 
 	for (i = 0; i < bufs; i++) {
-		wbuf[i]->b_end_io = end_buffer_write_sync;
+		struct page *p = wbuf[i]->b_page;
+		if (TestSetPageStable(p))
+			wbuf[i]->b_end_io = end_buffer_write_sync;
+		else
+			wbuf[i]->b_end_io = end_stable_write_sync;
+
+		if (trylock_page(p)) {
+			clear_page_dirty_for_io(p);
+			unlock_page(p);
+		}
+
 		/* We use-up our safety reference in submit_bh() */
 		submit_bh(write_op, wbuf[i]);
 	}

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH] bdi: Track users that require stable page writes
  2012-11-22  2:33         ` [PATCH] " Darrick J. Wong
@ 2012-11-22  7:08           ` Christoph Hellwig
  0 siblings, 0 replies; 27+ messages in thread
From: Christoph Hellwig @ 2012-11-22  7:08 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: NeilBrown, Christoph Hellwig, axboe, lucho, jack, ericvh, tytso,
	rminnich, viro, martin.petersen, david, linux-kernel,
	linux-fsdevel, adilger.kernel, bharrosh, jlayton, v9fs-developer,
	linux-ext4

On Wed, Nov 21, 2012 at 06:33:43PM -0800, Darrick J. Wong wrote:
> How's this look to everyone?  No more user-writable flag, and most of the
> helper functions are gone.

Now fix up the subject line as well and we might have a deal :)

> (Are these emails even getting through?)

At least this one is.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/4] ext3: Warn if mounting rw on a disk requiring stable page writes
  2012-11-22  1:47           ` Darrick J. Wong
  2012-11-22  2:36             ` [RFC PATCH 1/2] mm: Introduce page flag to indicate stable page status Darrick J. Wong
  2012-11-22  2:36             ` [RFC PATCH 2/2] jbd: Stabilize pages during writes when in ordered mode Darrick J. Wong
@ 2012-11-22  9:12             ` Jan Kara
  2012-11-27  2:17               ` Darrick J. Wong
  2012-11-22 23:15             ` Dave Chinner
  3 siblings, 1 reply; 27+ messages in thread
From: Jan Kara @ 2012-11-22  9:12 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: NeilBrown, Jan Kara, axboe, lucho, ericvh, tytso, rminnich, viro,
	martin.petersen, david, linux-kernel, linux-fsdevel,
	adilger.kernel, bharrosh, jlayton, v9fs-developer, linux-ext4

On Wed 21-11-12 17:47:55, Darrick J. Wong wrote:
> On Thu, Nov 22, 2012 at 08:47:13AM +1100, NeilBrown wrote:
> > On Wed, 21 Nov 2012 22:33:33 +0100 Jan Kara <jack@suse.cz> wrote:
> > 
> > > On Wed 21-11-12 13:13:19, Darrick J. Wong wrote:
> > > > On Wed, Nov 21, 2012 at 03:15:43AM +0100, Jan Kara wrote:
> > > > > On Tue 20-11-12 18:00:56, Darrick J. Wong wrote:
> > > > > > ext3 doesn't properly isolate pages from changes during writeback.  Since the
> > > > > > recommended fix is to use ext4, for now we'll just print a warning if the user
> > > > > > tries to mount in write mode.
> > > > > > 
> > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > > ---
> > > > > >  fs/ext3/super.c |    8 ++++++++
> > > > > >  1 file changed, 8 insertions(+)
> > > > > > 
> > > > > > 
> > > > > > diff --git a/fs/ext3/super.c b/fs/ext3/super.c
> > > > > > index 5366393..5b3725d 100644
> > > > > > --- a/fs/ext3/super.c
> > > > > > +++ b/fs/ext3/super.c
> > > > > > @@ -1325,6 +1325,14 @@ static int ext3_setup_super(struct super_block *sb, struct ext3_super_block *es,
> > > > > >  			"forcing read-only mode");
> > > > > >  		res = MS_RDONLY;
> > > > > >  	}
> > > > > > +	if (!read_only &&
> > > > > > +	    queue_requires_stable_pages(bdev_get_queue(sb->s_bdev))) {
> > > > > > +		ext3_msg(sb, KERN_ERR,
> > > > > > +			"error: ext3 cannot safely write data to a disk "
> > > > > > +			"requiring stable pages writes; forcing read-only "
> > > > > > +			"mode.  Upgrading to ext4 is recommended.");
> > > > > > +		res = MS_RDONLY;
> > > > > > +	}
> > > > > >  	if (read_only)
> > > > > >  		return res;
> > > > > >  	if (!(sbi->s_mount_state & EXT3_VALID_FS))
> > > > >   Why this? ext3 should be fixed by your change to
> > > > > filemap_page_mkwrite()... Or does testing show otherwise?
> > > > 
> > > > Yes, it's still broken even with this new set of changes.  Now that I think
> > > > about it a little more, I recall that writeback mode was actually fine, so this
> > > > is a little harsh.
> > > > 
> > > > Hm... looking at the ordered code a little more, it looks like
> > > > ext3_ordered_write_end is calling journal_dirty_data_fn, which (I guess?) tries
> > > > to write mapped buffers back through the journal?  Taking it out seems to fix
> > > > ordered mode, though I have a suspicion that it might very well break ordered
> > > > mode too.
> > >   Oh, right. kjournald writing buffers directly (without setting
> > > PageWriteback) will break things. So please, change warning to:
> 
> Maybe we should just fix this anyway?
> 
> I still have the patch that adds PG_stable (and changes the
> wait_for_page_stable() test to use this flag instead of PG_writeback) kicking
> around in my tree.  I wrote a patch to jbd that changes journal_do_submit_data
> to set PG_stable, call clear_page_dirty_for_io(), and unsets the stable bit in
> the end_io processing.
> 
> It seems to get rid of the checksum-after-write errors, though I'm not
> convinced it's correct.  But, I'll send both patches along.
  I'll check the patches. Fixing PageWriteback logic for ext3 is not easily
doable due to lock ranking constraints - PageWriteback has to be set under
PageLocked but that ranks above transaction start so kjournald cannot grab
page locks so it cannot set PageWriteback... And changing the lock ordering
is a major surgery.

What could be doable is waiting for buffer locks from ext3's ->write_begin
and ->page_mkwrite implementations in case stable writes are required. If
your approach with a separate page bit doesn't work out (and I have some
doubts about that as mm people are *really* thrifty with page bits).

> > > 	/*
> > > 	 * In data=ordered mode, kjournald writes buffers without setting
> > > 	 * PageWriteback bit thus generic code does not properly wait for
> > > 	 * writeback of those buffers to finish.
> > > 	 */
> > > 	if (!read_only &&
> > > 	    test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA &&
> 
> test_opt(sb, DATA_FLAGS) != EXT3_MOUNT_WRITEBACK_DATA
> 
> since I bet data=journal mode is also borken wrt PageWriteback.
  It is broken wrt PageWriteback but it actually waits for buffer locks in
->write_begin() so at least write path should be properly protected. But
mmap is not handled properly there (although that wouldn't be that hard to
fix). So I agree the condition should rather be what you suggest.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 2/2] jbd: Stabilize pages during writes when in ordered mode
  2012-11-22  2:36             ` [RFC PATCH 2/2] jbd: Stabilize pages during writes when in ordered mode Darrick J. Wong
@ 2012-11-22  9:19               ` Jan Kara
  0 siblings, 0 replies; 27+ messages in thread
From: Jan Kara @ 2012-11-22  9:19 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: NeilBrown, Jan Kara, axboe, lucho, ericvh, tytso, rminnich, viro,
	martin.petersen, david, linux-kernel, linux-fsdevel,
	adilger.kernel, bharrosh, jlayton, v9fs-developer, linux-ext4

On Wed 21-11-12 18:36:45, Darrick J. Wong wrote:
> When writing buffers out to disk ahead of committing a transaction, set the
> Stable bit on the page to prevent others from wandering in and modifying the
> page.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/jbd/commit.c |   18 +++++++++++++++++-
>  1 file changed, 17 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/jbd/commit.c b/fs/jbd/commit.c
> index 86b39b1..b1f0eed 100644
> --- a/fs/jbd/commit.c
> +++ b/fs/jbd/commit.c
> @@ -155,13 +155,29 @@ static int journal_write_commit_record(journal_t *journal,
>  	return (ret == -EIO);
>  }
>  
> +static void end_stable_write_sync(struct buffer_head *bh, int uptodate)
> +{
> +	clear_page_stable(bh->b_page);
> +	end_buffer_write_sync(bh, uptodate);
> +}
  This doesn't work when blocksize < pagesize... There can be more buffers
in one page under IO so you can clear the stable bit only after the last of
them is done. See how PageWriteback bit is handled in
end_buffer_async_write().

> +
>  static void journal_do_submit_data(struct buffer_head **wbuf, int bufs,
>  				   int write_op)
>  {
>  	int i;
>  
>  	for (i = 0; i < bufs; i++) {
> -		wbuf[i]->b_end_io = end_buffer_write_sync;
> +		struct page *p = wbuf[i]->b_page;
> +		if (TestSetPageStable(p))
> +			wbuf[i]->b_end_io = end_buffer_write_sync;
> +		else
> +			wbuf[i]->b_end_io = end_stable_write_sync;
  Umm, why is this? It presume it is some attempt to handle blocksize <
page size?

> +
> +		if (trylock_page(p)) {
> +			clear_page_dirty_for_io(p);
> +			unlock_page(p);
> +		}
> +
  And this is wrong again for blocksize < pagesize. There can be other
dirty buffers under the page...

>  		/* We use-up our safety reference in submit_bh() */
>  		submit_bh(write_op, wbuf[i]);
>  	}

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/4] ext3: Warn if mounting rw on a disk requiring stable page writes
  2012-11-22  1:47           ` Darrick J. Wong
                               ` (2 preceding siblings ...)
  2012-11-22  9:12             ` [PATCH 4/4] ext3: Warn if mounting rw on a disk requiring stable page writes Jan Kara
@ 2012-11-22 23:15             ` Dave Chinner
  3 siblings, 0 replies; 27+ messages in thread
From: Dave Chinner @ 2012-11-22 23:15 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: NeilBrown, Jan Kara, axboe, lucho, ericvh, tytso, rminnich, viro,
	martin.petersen, linux-kernel, linux-fsdevel, adilger.kernel,
	bharrosh, jlayton, v9fs-developer, linux-ext4

On Wed, Nov 21, 2012 at 05:47:55PM -0800, Darrick J. Wong wrote:
> On Thu, Nov 22, 2012 at 08:47:13AM +1100, NeilBrown wrote:
> > On Wed, 21 Nov 2012 22:33:33 +0100 Jan Kara <jack@suse.cz> wrote:
> > 
> > > On Wed 21-11-12 13:13:19, Darrick J. Wong wrote:
> > > > On Wed, Nov 21, 2012 at 03:15:43AM +0100, Jan Kara wrote:
> > > > > On Tue 20-11-12 18:00:56, Darrick J. Wong wrote:
> > > > > > ext3 doesn't properly isolate pages from changes during writeback.  Since the
> > > > > > recommended fix is to use ext4, for now we'll just print a warning if the user
> > > > > > tries to mount in write mode.
> > > > > > 
> > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > > ---
> > > > > >  fs/ext3/super.c |    8 ++++++++
> > > > > >  1 file changed, 8 insertions(+)
> > > > > > 
> > > > > > 
> > > > > > diff --git a/fs/ext3/super.c b/fs/ext3/super.c
> > > > > > index 5366393..5b3725d 100644
> > > > > > --- a/fs/ext3/super.c
> > > > > > +++ b/fs/ext3/super.c
> > > > > > @@ -1325,6 +1325,14 @@ static int ext3_setup_super(struct super_block *sb, struct ext3_super_block *es,
> > > > > >  			"forcing read-only mode");
> > > > > >  		res = MS_RDONLY;
> > > > > >  	}
> > > > > > +	if (!read_only &&
> > > > > > +	    queue_requires_stable_pages(bdev_get_queue(sb->s_bdev))) {
> > > > > > +		ext3_msg(sb, KERN_ERR,
> > > > > > +			"error: ext3 cannot safely write data to a disk "
> > > > > > +			"requiring stable pages writes; forcing read-only "
> > > > > > +			"mode.  Upgrading to ext4 is recommended.");
> > > > > > +		res = MS_RDONLY;
> > > > > > +	}
> > > > > >  	if (read_only)
> > > > > >  		return res;
> > > > > >  	if (!(sbi->s_mount_state & EXT3_VALID_FS))
> > > > >   Why this? ext3 should be fixed by your change to
> > > > > filemap_page_mkwrite()... Or does testing show otherwise?
> > > > 
> > > > Yes, it's still broken even with this new set of changes.  Now that I think
> > > > about it a little more, I recall that writeback mode was actually fine, so this
> > > > is a little harsh.
> > > > 
> > > > Hm... looking at the ordered code a little more, it looks like
> > > > ext3_ordered_write_end is calling journal_dirty_data_fn, which (I guess?) tries
> > > > to write mapped buffers back through the journal?  Taking it out seems to fix
> > > > ordered mode, though I have a suspicion that it might very well break ordered
> > > > mode too.
> > >   Oh, right. kjournald writing buffers directly (without setting
> > > PageWriteback) will break things. So please, change warning to:
> 
> Maybe we should just fix this anyway?
> 
> I still have the patch that adds PG_stable (and changes the

Page flags are in short supply. If you need a page flag to fix a bug
in a single filesytem, there's a couple of reserved private page
flags that can be used. i.e PG_owner_priv_1. I don't think using a
glbal page flag for just this purpose will fly, though...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/4] ext3: Warn if mounting rw on a disk requiring stable page writes
  2012-11-22  9:12             ` [PATCH 4/4] ext3: Warn if mounting rw on a disk requiring stable page writes Jan Kara
@ 2012-11-27  2:17               ` Darrick J. Wong
  2012-12-05 12:12                 ` Jan Kara
  0 siblings, 1 reply; 27+ messages in thread
From: Darrick J. Wong @ 2012-11-27  2:17 UTC (permalink / raw)
  To: Jan Kara
  Cc: NeilBrown, axboe, lucho, ericvh, tytso, rminnich, viro,
	martin.petersen, david, linux-kernel, linux-fsdevel,
	adilger.kernel, bharrosh, jlayton, v9fs-developer, linux-ext4

On Thu, Nov 22, 2012 at 10:12:40AM +0100, Jan Kara wrote:
> On Wed 21-11-12 17:47:55, Darrick J. Wong wrote:
> > On Thu, Nov 22, 2012 at 08:47:13AM +1100, NeilBrown wrote:
> > > On Wed, 21 Nov 2012 22:33:33 +0100 Jan Kara <jack@suse.cz> wrote:
> > > 
> > > > On Wed 21-11-12 13:13:19, Darrick J. Wong wrote:
> > > > > On Wed, Nov 21, 2012 at 03:15:43AM +0100, Jan Kara wrote:
> > > > > > On Tue 20-11-12 18:00:56, Darrick J. Wong wrote:
> > > > > > > ext3 doesn't properly isolate pages from changes during writeback.  Since the
> > > > > > > recommended fix is to use ext4, for now we'll just print a warning if the user
> > > > > > > tries to mount in write mode.
> > > > > > > 
> > > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > > > ---
> > > > > > >  fs/ext3/super.c |    8 ++++++++
> > > > > > >  1 file changed, 8 insertions(+)
> > > > > > > 
> > > > > > > 
> > > > > > > diff --git a/fs/ext3/super.c b/fs/ext3/super.c
> > > > > > > index 5366393..5b3725d 100644
> > > > > > > --- a/fs/ext3/super.c
> > > > > > > +++ b/fs/ext3/super.c
> > > > > > > @@ -1325,6 +1325,14 @@ static int ext3_setup_super(struct super_block *sb, struct ext3_super_block *es,
> > > > > > >  			"forcing read-only mode");
> > > > > > >  		res = MS_RDONLY;
> > > > > > >  	}
> > > > > > > +	if (!read_only &&
> > > > > > > +	    queue_requires_stable_pages(bdev_get_queue(sb->s_bdev))) {
> > > > > > > +		ext3_msg(sb, KERN_ERR,
> > > > > > > +			"error: ext3 cannot safely write data to a disk "
> > > > > > > +			"requiring stable pages writes; forcing read-only "
> > > > > > > +			"mode.  Upgrading to ext4 is recommended.");
> > > > > > > +		res = MS_RDONLY;
> > > > > > > +	}
> > > > > > >  	if (read_only)
> > > > > > >  		return res;
> > > > > > >  	if (!(sbi->s_mount_state & EXT3_VALID_FS))
> > > > > >   Why this? ext3 should be fixed by your change to
> > > > > > filemap_page_mkwrite()... Or does testing show otherwise?
> > > > > 
> > > > > Yes, it's still broken even with this new set of changes.  Now that I think
> > > > > about it a little more, I recall that writeback mode was actually fine, so this
> > > > > is a little harsh.
> > > > > 
> > > > > Hm... looking at the ordered code a little more, it looks like
> > > > > ext3_ordered_write_end is calling journal_dirty_data_fn, which (I guess?) tries
> > > > > to write mapped buffers back through the journal?  Taking it out seems to fix
> > > > > ordered mode, though I have a suspicion that it might very well break ordered
> > > > > mode too.
> > > >   Oh, right. kjournald writing buffers directly (without setting
> > > > PageWriteback) will break things. So please, change warning to:
> > 
> > Maybe we should just fix this anyway?
> > 
> > I still have the patch that adds PG_stable (and changes the
> > wait_for_page_stable() test to use this flag instead of PG_writeback) kicking
> > around in my tree.  I wrote a patch to jbd that changes journal_do_submit_data
> > to set PG_stable, call clear_page_dirty_for_io(), and unsets the stable bit in
> > the end_io processing.
> > 
> > It seems to get rid of the checksum-after-write errors, though I'm not
> > convinced it's correct.  But, I'll send both patches along.
>   I'll check the patches. Fixing PageWriteback logic for ext3 is not easily
> doable due to lock ranking constraints - PageWriteback has to be set under
> PageLocked but that ranks above transaction start so kjournald cannot grab
> page locks so it cannot set PageWriteback... And changing the lock ordering
> is a major surgery.
> 
> What could be doable is waiting for buffer locks from ext3's ->write_begin
> and ->page_mkwrite implementations in case stable writes are required. If
> your approach with a separate page bit doesn't work out (and I have some
> doubts about that as mm people are *really* thrifty with page bits).
> 
> > > > 	/*
> > > > 	 * In data=ordered mode, kjournald writes buffers without setting
> > > > 	 * PageWriteback bit thus generic code does not properly wait for
> > > > 	 * writeback of those buffers to finish.
> > > > 	 */
> > > > 	if (!read_only &&
> > > > 	    test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA &&
> > 
> > test_opt(sb, DATA_FLAGS) != EXT3_MOUNT_WRITEBACK_DATA
> > 
> > since I bet data=journal mode is also borken wrt PageWriteback.
>   It is broken wrt PageWriteback but it actually waits for buffer locks in
> ->write_begin() so at least write path should be properly protected. But
> mmap is not handled properly there (although that wouldn't be that hard to
> fix). So I agree the condition should rather be what you suggest.

Hm.  In journal mode, write_begin calls do_journal_get_write_access on each
buffer for a given page, and in turn, jbd's do_get_write_access calls
lock_buffer.  Is that what you're referring to by "actually waits for buffer
locks"?  I'm wondering how that helps us, since afaict PG_writeback doesn't get
set in that path, and I think it's a little early to be setting PG_writeback
anyway.

If the page has to be locked before the transaction starts, how much of a
problem is it to set PG_writeback?  Even though that seems a bit early to be
doing that?

Just for fun, I tried porting ext4_page_mkwrite into ext3 (removing all the
parts that don't exist in ext3) so that do_journal_get_write_access would also
get called here, but it didn't seem to fix journal mode.  

--D
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/4] ext3: Warn if mounting rw on a disk requiring stable page writes
  2012-11-27  2:17               ` Darrick J. Wong
@ 2012-12-05 12:12                 ` Jan Kara
  2012-12-08  1:09                   ` Darrick J. Wong
  0 siblings, 1 reply; 27+ messages in thread
From: Jan Kara @ 2012-12-05 12:12 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jan Kara, NeilBrown, axboe, lucho, ericvh, tytso, rminnich, viro,
	martin.petersen, david, linux-kernel, linux-fsdevel,
	adilger.kernel, bharrosh, jlayton, v9fs-developer, linux-ext4

On Mon 26-11-12 18:17:40, Darrick J. Wong wrote:
> On Thu, Nov 22, 2012 at 10:12:40AM +0100, Jan Kara wrote:
> > On Wed 21-11-12 17:47:55, Darrick J. Wong wrote:
> > > On Thu, Nov 22, 2012 at 08:47:13AM +1100, NeilBrown wrote:
> > > > On Wed, 21 Nov 2012 22:33:33 +0100 Jan Kara <jack@suse.cz> wrote:
> > > > 
> > > > > On Wed 21-11-12 13:13:19, Darrick J. Wong wrote:
> > > > > > On Wed, Nov 21, 2012 at 03:15:43AM +0100, Jan Kara wrote:
> > > > > > > On Tue 20-11-12 18:00:56, Darrick J. Wong wrote:
> > > > > > > > ext3 doesn't properly isolate pages from changes during writeback.  Since the
> > > > > > > > recommended fix is to use ext4, for now we'll just print a warning if the user
> > > > > > > > tries to mount in write mode.
> > > > > > > > 
> > > > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > > > > ---
> > > > > > > >  fs/ext3/super.c |    8 ++++++++
> > > > > > > >  1 file changed, 8 insertions(+)
> > > > > > > > 
> > > > > > > > 
> > > > > > > > diff --git a/fs/ext3/super.c b/fs/ext3/super.c
> > > > > > > > index 5366393..5b3725d 100644
> > > > > > > > --- a/fs/ext3/super.c
> > > > > > > > +++ b/fs/ext3/super.c
> > > > > > > > @@ -1325,6 +1325,14 @@ static int ext3_setup_super(struct super_block *sb, struct ext3_super_block *es,
> > > > > > > >  			"forcing read-only mode");
> > > > > > > >  		res = MS_RDONLY;
> > > > > > > >  	}
> > > > > > > > +	if (!read_only &&
> > > > > > > > +	    queue_requires_stable_pages(bdev_get_queue(sb->s_bdev))) {
> > > > > > > > +		ext3_msg(sb, KERN_ERR,
> > > > > > > > +			"error: ext3 cannot safely write data to a disk "
> > > > > > > > +			"requiring stable pages writes; forcing read-only "
> > > > > > > > +			"mode.  Upgrading to ext4 is recommended.");
> > > > > > > > +		res = MS_RDONLY;
> > > > > > > > +	}
> > > > > > > >  	if (read_only)
> > > > > > > >  		return res;
> > > > > > > >  	if (!(sbi->s_mount_state & EXT3_VALID_FS))
> > > > > > >   Why this? ext3 should be fixed by your change to
> > > > > > > filemap_page_mkwrite()... Or does testing show otherwise?
> > > > > > 
> > > > > > Yes, it's still broken even with this new set of changes.  Now that I think
> > > > > > about it a little more, I recall that writeback mode was actually fine, so this
> > > > > > is a little harsh.
> > > > > > 
> > > > > > Hm... looking at the ordered code a little more, it looks like
> > > > > > ext3_ordered_write_end is calling journal_dirty_data_fn, which (I guess?) tries
> > > > > > to write mapped buffers back through the journal?  Taking it out seems to fix
> > > > > > ordered mode, though I have a suspicion that it might very well break ordered
> > > > > > mode too.
> > > > >   Oh, right. kjournald writing buffers directly (without setting
> > > > > PageWriteback) will break things. So please, change warning to:
> > > 
> > > Maybe we should just fix this anyway?
> > > 
> > > I still have the patch that adds PG_stable (and changes the
> > > wait_for_page_stable() test to use this flag instead of PG_writeback) kicking
> > > around in my tree.  I wrote a patch to jbd that changes journal_do_submit_data
> > > to set PG_stable, call clear_page_dirty_for_io(), and unsets the stable bit in
> > > the end_io processing.
> > > 
> > > It seems to get rid of the checksum-after-write errors, though I'm not
> > > convinced it's correct.  But, I'll send both patches along.
> >   I'll check the patches. Fixing PageWriteback logic for ext3 is not easily
> > doable due to lock ranking constraints - PageWriteback has to be set under
> > PageLocked but that ranks above transaction start so kjournald cannot grab
> > page locks so it cannot set PageWriteback... And changing the lock ordering
> > is a major surgery.
> > 
> > What could be doable is waiting for buffer locks from ext3's ->write_begin
> > and ->page_mkwrite implementations in case stable writes are required. If
> > your approach with a separate page bit doesn't work out (and I have some
> > doubts about that as mm people are *really* thrifty with page bits).
> > 
> > > > > 	/*
> > > > > 	 * In data=ordered mode, kjournald writes buffers without setting
> > > > > 	 * PageWriteback bit thus generic code does not properly wait for
> > > > > 	 * writeback of those buffers to finish.
> > > > > 	 */
> > > > > 	if (!read_only &&
> > > > > 	    test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA &&
> > > 
> > > test_opt(sb, DATA_FLAGS) != EXT3_MOUNT_WRITEBACK_DATA
> > > 
> > > since I bet data=journal mode is also borken wrt PageWriteback.
> >   It is broken wrt PageWriteback but it actually waits for buffer locks in
> > ->write_begin() so at least write path should be properly protected. But
> > mmap is not handled properly there (although that wouldn't be that hard to
> > fix). So I agree the condition should rather be what you suggest.
  Sorry for late reply. I was on vacation...

> Hm.  In journal mode, write_begin calls do_journal_get_write_access on each
> buffer for a given page, and in turn, jbd's do_get_write_access calls
> lock_buffer.  Is that what you're referring to by "actually waits for buffer
> locks"?  I'm wondering how that helps us, since afaict PG_writeback doesn't get
> set in that path, and I think it's a little early to be setting PG_writeback
> anyway.
  It does help us. In ext3 data writeback is done either by flusher thread,
that happens under PG_Writeback and generic code waits for that as need, or
by kjournald - that happens under buffer lock and as you properly observed
do_get_write_access() waits for that (and actually copies data that should
go to disk to a separate buffer if needed).

> If the page has to be locked before the transaction starts, how much of a
> problem is it to set PG_writeback?  Even though that seems a bit early to be
> doing that?
  Well, what you would need to make things consistent is to set
PG_writeback from kjournald so that all writeback happens with PG_writeback
set on the page. But setting has to happen while the page is locked and
kjournald can never block on page lock because that would cause
deadlocks...

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/4] ext3: Warn if mounting rw on a disk requiring stable page writes
  2012-12-05 12:12                 ` Jan Kara
@ 2012-12-08  1:09                   ` Darrick J. Wong
  2012-12-10 10:41                     ` Jan Kara
  0 siblings, 1 reply; 27+ messages in thread
From: Darrick J. Wong @ 2012-12-08  1:09 UTC (permalink / raw)
  To: Jan Kara
  Cc: NeilBrown, axboe, lucho, ericvh, tytso, rminnich, viro,
	martin.petersen, david, linux-kernel, linux-fsdevel,
	adilger.kernel, bharrosh, jlayton, v9fs-developer, linux-ext4

On Wed, Dec 05, 2012 at 01:12:28PM +0100, Jan Kara wrote:
> On Mon 26-11-12 18:17:40, Darrick J. Wong wrote:
> > On Thu, Nov 22, 2012 at 10:12:40AM +0100, Jan Kara wrote:
> > > On Wed 21-11-12 17:47:55, Darrick J. Wong wrote:
> > > > On Thu, Nov 22, 2012 at 08:47:13AM +1100, NeilBrown wrote:
> > > > > On Wed, 21 Nov 2012 22:33:33 +0100 Jan Kara <jack@suse.cz> wrote:
> > > > > 
> > > > > > On Wed 21-11-12 13:13:19, Darrick J. Wong wrote:
> > > > > > > On Wed, Nov 21, 2012 at 03:15:43AM +0100, Jan Kara wrote:
> > > > > > > > On Tue 20-11-12 18:00:56, Darrick J. Wong wrote:
> > > > > > > > > ext3 doesn't properly isolate pages from changes during writeback.  Since the
> > > > > > > > > recommended fix is to use ext4, for now we'll just print a warning if the user
> > > > > > > > > tries to mount in write mode.
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > > > > > > > ---
> > > > > > > > >  fs/ext3/super.c |    8 ++++++++
> > > > > > > > >  1 file changed, 8 insertions(+)
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > diff --git a/fs/ext3/super.c b/fs/ext3/super.c
> > > > > > > > > index 5366393..5b3725d 100644
> > > > > > > > > --- a/fs/ext3/super.c
> > > > > > > > > +++ b/fs/ext3/super.c
> > > > > > > > > @@ -1325,6 +1325,14 @@ static int ext3_setup_super(struct super_block *sb, struct ext3_super_block *es,
> > > > > > > > >  			"forcing read-only mode");
> > > > > > > > >  		res = MS_RDONLY;
> > > > > > > > >  	}
> > > > > > > > > +	if (!read_only &&
> > > > > > > > > +	    queue_requires_stable_pages(bdev_get_queue(sb->s_bdev))) {
> > > > > > > > > +		ext3_msg(sb, KERN_ERR,
> > > > > > > > > +			"error: ext3 cannot safely write data to a disk "
> > > > > > > > > +			"requiring stable pages writes; forcing read-only "
> > > > > > > > > +			"mode.  Upgrading to ext4 is recommended.");
> > > > > > > > > +		res = MS_RDONLY;
> > > > > > > > > +	}
> > > > > > > > >  	if (read_only)
> > > > > > > > >  		return res;
> > > > > > > > >  	if (!(sbi->s_mount_state & EXT3_VALID_FS))
> > > > > > > >   Why this? ext3 should be fixed by your change to
> > > > > > > > filemap_page_mkwrite()... Or does testing show otherwise?
> > > > > > > 
> > > > > > > Yes, it's still broken even with this new set of changes.  Now that I think
> > > > > > > about it a little more, I recall that writeback mode was actually fine, so this
> > > > > > > is a little harsh.
> > > > > > > 
> > > > > > > Hm... looking at the ordered code a little more, it looks like
> > > > > > > ext3_ordered_write_end is calling journal_dirty_data_fn, which (I guess?) tries
> > > > > > > to write mapped buffers back through the journal?  Taking it out seems to fix
> > > > > > > ordered mode, though I have a suspicion that it might very well break ordered
> > > > > > > mode too.
> > > > > >   Oh, right. kjournald writing buffers directly (without setting
> > > > > > PageWriteback) will break things. So please, change warning to:
> > > > 
> > > > Maybe we should just fix this anyway?
> > > > 
> > > > I still have the patch that adds PG_stable (and changes the
> > > > wait_for_page_stable() test to use this flag instead of PG_writeback) kicking
> > > > around in my tree.  I wrote a patch to jbd that changes journal_do_submit_data
> > > > to set PG_stable, call clear_page_dirty_for_io(), and unsets the stable bit in
> > > > the end_io processing.
> > > > 
> > > > It seems to get rid of the checksum-after-write errors, though I'm not
> > > > convinced it's correct.  But, I'll send both patches along.
> > >   I'll check the patches. Fixing PageWriteback logic for ext3 is not easily
> > > doable due to lock ranking constraints - PageWriteback has to be set under
> > > PageLocked but that ranks above transaction start so kjournald cannot grab
> > > page locks so it cannot set PageWriteback... And changing the lock ordering
> > > is a major surgery.
> > > 
> > > What could be doable is waiting for buffer locks from ext3's ->write_begin
> > > and ->page_mkwrite implementations in case stable writes are required. If
> > > your approach with a separate page bit doesn't work out (and I have some
> > > doubts about that as mm people are *really* thrifty with page bits).
> > > 
> > > > > > 	/*
> > > > > > 	 * In data=ordered mode, kjournald writes buffers without setting
> > > > > > 	 * PageWriteback bit thus generic code does not properly wait for
> > > > > > 	 * writeback of those buffers to finish.
> > > > > > 	 */
> > > > > > 	if (!read_only &&
> > > > > > 	    test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA &&
> > > > 
> > > > test_opt(sb, DATA_FLAGS) != EXT3_MOUNT_WRITEBACK_DATA
> > > > 
> > > > since I bet data=journal mode is also borken wrt PageWriteback.
> > >   It is broken wrt PageWriteback but it actually waits for buffer locks in
> > > ->write_begin() so at least write path should be properly protected. But
> > > mmap is not handled properly there (although that wouldn't be that hard to
> > > fix). So I agree the condition should rather be what you suggest.
>   Sorry for late reply. I was on vacation...

No worries; I have plenty of other patchsets to work on. :)

> > Hm.  In journal mode, write_begin calls do_journal_get_write_access on each
> > buffer for a given page, and in turn, jbd's do_get_write_access calls
> > lock_buffer.  Is that what you're referring to by "actually waits for buffer
> > locks"?  I'm wondering how that helps us, since afaict PG_writeback doesn't get
> > set in that path, and I think it's a little early to be setting PG_writeback
> > anyway.
>   It does help us. In ext3 data writeback is done either by flusher thread,
> that happens under PG_Writeback and generic code waits for that as need, or
> by kjournald - that happens under buffer lock and as you properly observed
> do_get_write_access() waits for that (and actually copies data that should
> go to disk to a separate buffer if needed).

Hm.  jbd2 calls generic_writepages to flush those buffers out, which sets
PG_writeback like you'd expect.  It'd be nice if you could do that like ext4
does... but then the ext3 writepage functions can call journal start/stop.

Maybe we could call block_write_full_page directly?

> > If the page has to be locked before the transaction starts, how much of a
> > problem is it to set PG_writeback?  Even though that seems a bit early to be
> > doing that?
>   Well, what you would need to make things consistent is to set
> PG_writeback from kjournald so that all writeback happens with PG_writeback
> set on the page. But setting has to happen while the page is locked and
> kjournald can never block on page lock because that would cause
> deadlocks...

Right now we try to submit_bh() everything on commit->t_sync_datalist without
the page lock.  I guess we could try to lock (and set writeback on) every page
on that list, and anything that we can lock then goes on a new
t_locked_for_sync list.  If we find a page that's already locked, simply back
out to the main kjournald loop and try the whole mess again.  Once all the
pages are on the locked_for_sync list, then we can unlock the pages and
actually commit the transaction.

But, uck, that feels like courting disaster.  Do people sleep with a page
locked?  I'm unsure of the wisdom of making jbd do that.  Can we stall out
forever if someone keeps shoving pages onto t_sync_data?

I actually tried a dumb trylock_page loop in journal_do_submit_data, but it
quickly stalled.

Ok, enough rambling, I'll give a few of my harebrained suggestions a try and
see what results.

--D
> 
> 									Honza
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/4] ext3: Warn if mounting rw on a disk requiring stable page writes
  2012-12-08  1:09                   ` Darrick J. Wong
@ 2012-12-10 10:41                     ` Jan Kara
  0 siblings, 0 replies; 27+ messages in thread
From: Jan Kara @ 2012-12-10 10:41 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Jan Kara, NeilBrown, axboe, lucho, ericvh, tytso, rminnich, viro,
	martin.petersen, david, linux-kernel, linux-fsdevel,
	adilger.kernel, bharrosh, jlayton, v9fs-developer, linux-ext4

On Fri 07-12-12 17:09:58, Darrick J. Wong wrote:
> On Wed, Dec 05, 2012 at 01:12:28PM +0100, Jan Kara wrote:
> > On Mon 26-11-12 18:17:40, Darrick J. Wong wrote:
> > > On Thu, Nov 22, 2012 at 10:12:40AM +0100, Jan Kara wrote:
> > > > On Wed 21-11-12 17:47:55, Darrick J. Wong wrote:
> > > >   I'll check the patches. Fixing PageWriteback logic for ext3 is not easily
> > > > doable due to lock ranking constraints - PageWriteback has to be set under
> > > > PageLocked but that ranks above transaction start so kjournald cannot grab
> > > > page locks so it cannot set PageWriteback... And changing the lock ordering
> > > > is a major surgery.
> > > > 
> > > > What could be doable is waiting for buffer locks from ext3's ->write_begin
> > > > and ->page_mkwrite implementations in case stable writes are required. If
> > > > your approach with a separate page bit doesn't work out (and I have some
> > > > doubts about that as mm people are *really* thrifty with page bits).
> > > > 
> > > > > > > 	/*
> > > > > > > 	 * In data=ordered mode, kjournald writes buffers without setting
> > > > > > > 	 * PageWriteback bit thus generic code does not properly wait for
> > > > > > > 	 * writeback of those buffers to finish.
> > > > > > > 	 */
> > > > > > > 	if (!read_only &&
> > > > > > > 	    test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_ORDERED_DATA &&
> > > > > 
> > > > > test_opt(sb, DATA_FLAGS) != EXT3_MOUNT_WRITEBACK_DATA
> > > > > 
> > > > > since I bet data=journal mode is also borken wrt PageWriteback.
> > > >   It is broken wrt PageWriteback but it actually waits for buffer locks in
> > > > ->write_begin() so at least write path should be properly protected. But
> > > > mmap is not handled properly there (although that wouldn't be that hard to
> > > > fix). So I agree the condition should rather be what you suggest.
> >   Sorry for late reply. I was on vacation...
> 
> No worries; I have plenty of other patchsets to work on. :)
> 
> > > Hm.  In journal mode, write_begin calls do_journal_get_write_access on each
> > > buffer for a given page, and in turn, jbd's do_get_write_access calls
> > > lock_buffer.  Is that what you're referring to by "actually waits for buffer
> > > locks"?  I'm wondering how that helps us, since afaict PG_writeback doesn't get
> > > set in that path, and I think it's a little early to be setting PG_writeback
> > > anyway.
> >   It does help us. In ext3 data writeback is done either by flusher thread,
> > that happens under PG_Writeback and generic code waits for that as need, or
> > by kjournald - that happens under buffer lock and as you properly observed
> > do_get_write_access() waits for that (and actually copies data that should
> > go to disk to a separate buffer if needed).
> 
> Hm.  jbd2 calls generic_writepages to flush those buffers out, which sets
> PG_writeback like you'd expect.  It'd be nice if you could do that like ext4
> does... but then the ext3 writepage functions can call journal start/stop.
  Yes, it would be nice if ext3 did it the way ext4 does but as I already
said earlier, that's a major change to the way locking is done. ext3 has
lock ordering PageLock -> transaction start, while ext4 has lock ordering
transaction start -> PageLock. This influences a lot of places so it's too
big change to do to ext3 which is in maintenance only mode.

> Maybe we could call block_write_full_page directly?
  You have to have the page locked to call that function. Otherwise someone
could invalidate the page under your hands (or do other sort of things
function does not expect).

> > > If the page has to be locked before the transaction starts, how much of a
> > > problem is it to set PG_writeback?  Even though that seems a bit early to be
> > > doing that?
> >   Well, what you would need to make things consistent is to set
> > PG_writeback from kjournald so that all writeback happens with PG_writeback
> > set on the page. But setting has to happen while the page is locked and
> > kjournald can never block on page lock because that would cause
> > deadlocks...
> 
> Right now we try to submit_bh() everything on commit->t_sync_datalist without
> the page lock.  I guess we could try to lock (and set writeback on) every page
> on that list, and anything that we can lock then goes on a new
> t_locked_for_sync list.  If we find a page that's already locked, simply back
> out to the main kjournald loop and try the whole mess again.  Once all the
> pages are on the locked_for_sync list, then we can unlock the pages and
> actually commit the transaction.
  That's just more complicated way of implementing page lock ;) It won't
remove deadlocks.

> But, uck, that feels like courting disaster.  Do people sleep with a page
> locked?
  Yes, they do - for example in ext3_write_begin(), ext3_journal_start()
can sleep waiting for kjournald to commit a transaction :). That's the
deadlock I'm talking about...

>  I'm unsure of the wisdom of making jbd do that.  Can we stall out
> forever if someone keeps shoving pages onto t_sync_data?
  That's not an issue. Once transaction commit starts, no new buffers can
be added to it.

  Hum, I was thinking about how to implement proper exclusion in ext3 using
buffer locks and it won't be easy. Buffered writes are simple enough but
mmap is tricky - we'd have to writeprotect the page at the moment we add
buffer to t_sync_datalist (at that moment we hold the page lock so that is
doable). That page_mkwrite will have to wait for transaction commit (or
write the buffers) if underlying buffers are in t_sync_datalist. That will
make mixing of buffered IO and mmap writes to the same page very slow but
maybe that would be acceptable. I'll think about it some more...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2012-12-10 10:41 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-11-21  2:00 [PATCH v2.1 0/3] mm/fs: Implement faster stable page writes on filesystems Darrick J. Wong
2012-11-21  2:00 ` [PATCH 1/4] bdi: Track users that require stable page writes Darrick J. Wong
2012-11-21  7:54   ` Christoph Hellwig
2012-11-21 10:52     ` Christoph Hellwig
2012-11-21 10:56   ` Christoph Hellwig
2012-11-21 21:52     ` Darrick J. Wong
2012-11-21 22:06       ` NeilBrown
2012-11-22  2:33         ` [PATCH] " Darrick J. Wong
2012-11-22  7:08           ` Christoph Hellwig
2012-11-21  2:00 ` [PATCH 2/4] mm: Only enforce stable page writes if the backing device requires it Darrick J. Wong
2012-11-21 10:57   ` Christoph Hellwig
2012-11-21  2:00 ` [PATCH 3/4] 9pfs: Fix filesystem to wait for stable page writeback Darrick J. Wong
2012-11-21  2:00 ` [PATCH 4/4] ext3: Warn if mounting rw on a disk requiring stable page writes Darrick J. Wong
2012-11-21  2:15   ` Jan Kara
2012-11-21 21:13     ` Darrick J. Wong
2012-11-21 21:33       ` Jan Kara
2012-11-21 21:47         ` NeilBrown
2012-11-22  1:47           ` Darrick J. Wong
2012-11-22  2:36             ` [RFC PATCH 1/2] mm: Introduce page flag to indicate stable page status Darrick J. Wong
2012-11-22  2:36             ` [RFC PATCH 2/2] jbd: Stabilize pages during writes when in ordered mode Darrick J. Wong
2012-11-22  9:19               ` Jan Kara
2012-11-22  9:12             ` [PATCH 4/4] ext3: Warn if mounting rw on a disk requiring stable page writes Jan Kara
2012-11-27  2:17               ` Darrick J. Wong
2012-12-05 12:12                 ` Jan Kara
2012-12-08  1:09                   ` Darrick J. Wong
2012-12-10 10:41                     ` Jan Kara
2012-11-22 23:15             ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).