All of lore.kernel.org
 help / color / mirror / Atom feed
* [resend PATCH 0/3] fs, bdev: handle end of life
@ 2016-01-04 18:20 ` Dan Williams
  0 siblings, 0 replies; 34+ messages in thread
From: Dan Williams @ 2016-01-04 18:20 UTC (permalink / raw)
  To: xfs
  Cc: linux-block, linux-nvdimm, Dave Chinner, Jens Axboe,
	linux-fsdevel, Jan Kara, Tejun Heo, Matthew Wilcox, Ross Zwisler

Per Dave, resend to include the xfs list on the full set.  These are
against v4.4-rc5.

---

As mentioned in [PATCH 1/3] "block, fs: reliably communicate bdev
end-of-life", historically we have waited for filesystem specific
heuristics to attempt to guess when a block device is gone.  Sometimes
this works, but in other cases the system can hang waiting for the fs to
trigger its shutdown protocol.

Now with DAX we need new actions, like unmapping all inodes, to be taken
upon a shutdown event.  Those actions need to be taken whether the
shutdown event comes from the block device being torn down, or some
other file system specific event.

For now, the approach taken in the following patches only affects xfs
and block drivers that are converted to use del_gendisk_queue().  We can
add more filesystems and driver support over time.

Note that 'bdi_gone' was chosen over 'shutdown' so as not to be confused
with generic_shutdown_super()

---

Dan Williams (3):
      block, fs: reliably communicate bdev end-of-life
      xfs: handle shutdown notifications
      writeback: fix false positive WARN in __mark_inode_dirty


 block/genhd.c                |   87 +++++++++++++++++++++++++++++++++++-------
 drivers/block/brd.c          |    3 -
 drivers/nvdimm/pmem.c        |    3 -
 drivers/s390/block/dcssblk.c |    6 +--
 fs/block_dev.c               |   79 +++++++++++++++++++++++++++++++++-----
 fs/xfs/xfs_super.c           |    9 ++++
 include/linux/fs.h           |    4 ++
 include/linux/genhd.h        |    1 
 mm/backing-dev.c             |    7 +++
 9 files changed, 166 insertions(+), 33 deletions(-)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [resend PATCH 0/3] fs, bdev: handle end of life
@ 2016-01-04 18:20 ` Dan Williams
  0 siblings, 0 replies; 34+ messages in thread
From: Dan Williams @ 2016-01-04 18:20 UTC (permalink / raw)
  To: xfs
  Cc: Jens Axboe, linux-nvdimm, linux-block, Tejun Heo, Jan Kara,
	linux-fsdevel, Matthew Wilcox, Ross Zwisler

Per Dave, resend to include the xfs list on the full set.  These are
against v4.4-rc5.

---

As mentioned in [PATCH 1/3] "block, fs: reliably communicate bdev
end-of-life", historically we have waited for filesystem specific
heuristics to attempt to guess when a block device is gone.  Sometimes
this works, but in other cases the system can hang waiting for the fs to
trigger its shutdown protocol.

Now with DAX we need new actions, like unmapping all inodes, to be taken
upon a shutdown event.  Those actions need to be taken whether the
shutdown event comes from the block device being torn down, or some
other file system specific event.

For now, the approach taken in the following patches only affects xfs
and block drivers that are converted to use del_gendisk_queue().  We can
add more filesystems and driver support over time.

Note that 'bdi_gone' was chosen over 'shutdown' so as not to be confused
with generic_shutdown_super()

---

Dan Williams (3):
      block, fs: reliably communicate bdev end-of-life
      xfs: handle shutdown notifications
      writeback: fix false positive WARN in __mark_inode_dirty


 block/genhd.c                |   87 +++++++++++++++++++++++++++++++++++-------
 drivers/block/brd.c          |    3 -
 drivers/nvdimm/pmem.c        |    3 -
 drivers/s390/block/dcssblk.c |    6 +--
 fs/block_dev.c               |   79 +++++++++++++++++++++++++++++++++-----
 fs/xfs/xfs_super.c           |    9 ++++
 include/linux/fs.h           |    4 ++
 include/linux/genhd.h        |    1 
 mm/backing-dev.c             |    7 +++
 9 files changed, 166 insertions(+), 33 deletions(-)

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [resend PATCH 1/3] block, fs: reliably communicate bdev end-of-life
  2016-01-04 18:20 ` Dan Williams
@ 2016-01-04 18:20   ` Dan Williams
  -1 siblings, 0 replies; 34+ messages in thread
From: Dan Williams @ 2016-01-04 18:20 UTC (permalink / raw)
  To: xfs
  Cc: linux-block, linux-nvdimm, Dave Chinner, Jens Axboe, Jan Kara,
	linux-fsdevel, Matthew Wilcox, Ross Zwisler

Historically we have waited for filesystem specific heuristics to
attempt to guess when a block device is gone.  Sometimes this works, but
in other cases the system can hang waiting for the fs to trigger its
shutdown protocol.

The initial motivation for this investigation was to prevent DAX
mappings (direct mmap access to persistent memory) from leaking past the
lifetime of the hosting block device.  However, Dave points out that
these shutdown operations are needed in other scenarios.  Quoting Dave:

    For example, if we detect a free space corruption during allocation,
    it is not safe to trust *any active mapping* because we can't trust
    that we having handed out the same block to multiple owners. Hence
    on such a filesystem shutdown, we have to prevent any new DAX
    mapping from occurring and invalidate all existing mappings as we
    cannot allow userspace to modify any data or metadata until we've
    resolved the corruption situation.

The current block device shutdown sequence of del_gendisk +
blk_cleanup_queue is problematic.  We want to tell the fs after
blk_cleanup_queue that there is no possibility of recovery, but by that
time we have deleted partitions and lost the ability to find all the
super-blocks on a block device.

Introduce del_gendisk_queue to trigger ->quiesce() and ->bdi_gone()
notifications to all the filesystems hosted on the disk.  Where
->quiesce() are 'shutdown' operations while the bdev may still be alive,
and ->bdi_gone() is a set of actions to take after the backing device
is known to be permanently dead.

generic_quiesce_super and generic_bdi_gone, are the default operations
when a filesystem does not implement ->quiesce(), ->bdi_gone().  They
invalidate inodes and unmap DAX-inodes respectively.  For now only
->bdi_gone() has an associated super operation as xfs will implement
->bdi_gone() in a later patch.

Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 block/genhd.c                |   87 +++++++++++++++++++++++++++++++++++-------
 drivers/block/brd.c          |    3 -
 drivers/nvdimm/pmem.c        |    3 -
 drivers/s390/block/dcssblk.c |    6 +--
 fs/block_dev.c               |   78 ++++++++++++++++++++++++++++++++------
 include/linux/fs.h           |    2 +
 include/linux/genhd.h        |    1 
 7 files changed, 147 insertions(+), 33 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index e5cafa51567c..967fcfd63c98 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -634,24 +634,14 @@ void add_disk(struct gendisk *disk)
 }
 EXPORT_SYMBOL(add_disk);
 
-void del_gendisk(struct gendisk *disk)
+static void del_gendisk_start(struct gendisk *disk)
 {
-	struct disk_part_iter piter;
-	struct hd_struct *part;
-
 	blk_integrity_del(disk);
 	disk_del_events(disk);
+}
 
-	/* invalidate stuff */
-	disk_part_iter_init(&piter, disk,
-			     DISK_PITER_INCL_EMPTY | DISK_PITER_REVERSE);
-	while ((part = disk_part_iter_next(&piter))) {
-		invalidate_partition(disk, part->partno);
-		delete_partition(disk, part->partno);
-	}
-	disk_part_iter_exit(&piter);
-
-	invalidate_partition(disk, 0);
+static void del_gendisk_end(struct gendisk *disk)
+{
 	set_capacity(disk, 0);
 	disk->flags &= ~GENHD_FL_UP;
 
@@ -670,9 +660,78 @@ void del_gendisk(struct gendisk *disk)
 	pm_runtime_set_memalloc_noio(disk_to_dev(disk), false);
 	device_del(disk_to_dev(disk));
 }
+
+#define for_each_part(part, piter) \
+	for (part = disk_part_iter_next(piter); part; \
+			part = disk_part_iter_next(piter))
+void del_gendisk(struct gendisk *disk)
+{
+	struct disk_part_iter piter;
+	struct hd_struct *part;
+
+	del_gendisk_start(disk);
+
+	/* invalidate stuff */
+	disk_part_iter_init(&piter, disk,
+			     DISK_PITER_INCL_EMPTY | DISK_PITER_REVERSE);
+	for_each_part(part, &piter) {
+		invalidate_partition(disk, part->partno);
+		delete_partition(disk, part->partno);
+	}
+	disk_part_iter_exit(&piter);
+	invalidate_partition(disk, 0);
+
+	del_gendisk_end(disk);
+}
 EXPORT_SYMBOL(del_gendisk);
 
 /**
+ * del_gendisk_queue - combined del_gendisk + blk_cleanup_queue
+ * @disk: disk to delete, invalidate, unmap, and end i/o
+ *
+ * This alternative for open coded calls to:
+ *     del_gendisk()
+ *     blk_cleanup_queue()
+ * ...is for notifying the filesystem that the block device has gone
+ * away.  This distinction is important for triggering a filesystem to
+ * abort its error recovery and for (DAX) capable devices.  DAX bypasses
+ * page cache and mappings go directly to storage media.  When such a
+ * disk is removed the pfn backing a mapping may be invalid or removed
+ * from the system.  Upon return accessing DAX mappings of this disk
+ * will trigger SIGBUS.
+ */
+void del_gendisk_queue(struct gendisk *disk)
+{
+	struct disk_part_iter piter;
+	struct hd_struct *part;
+
+	del_gendisk_start(disk);
+
+	/* pass1 sync fs + evict idle inodes */
+	disk_part_iter_init(&piter, disk,
+			     DISK_PITER_INCL_EMPTY | DISK_PITER_REVERSE);
+	for_each_part(part, &piter)
+		invalidate_partition(disk, part->partno);
+	disk_part_iter_exit(&piter);
+	invalidate_partition(disk, 0);
+
+	blk_cleanup_queue(disk->queue);
+
+	/* pass2 the queue is dead, trigger fs shutdown via ->bdi_gone() */
+	disk_part_iter_init(&piter, disk,
+			     DISK_PITER_INCL_EMPTY | DISK_PITER_REVERSE);
+	for_each_part(part, &piter) {
+		shutdown_partition(disk, part->partno);
+		delete_partition(disk, part->partno);
+	}
+	disk_part_iter_exit(&piter);
+	shutdown_partition(disk, 0);
+
+	del_gendisk_end(disk);
+}
+EXPORT_SYMBOL(del_gendisk_queue);
+
+/**
  * get_gendisk - get partitioning information for a given device
  * @devt: device to get partitioning information for
  * @partno: returned partition index
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index a5880f4ab40e..f149dd57bba3 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -532,7 +532,6 @@ out:
 static void brd_free(struct brd_device *brd)
 {
 	put_disk(brd->brd_disk);
-	blk_cleanup_queue(brd->brd_queue);
 	brd_free_pages(brd);
 	kfree(brd);
 }
@@ -560,7 +559,7 @@ out:
 static void brd_del_one(struct brd_device *brd)
 {
 	list_del(&brd->brd_list);
-	del_gendisk(brd->brd_disk);
+	del_gendisk_queue(brd->brd_disk);
 	brd_free(brd);
 }
 
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 8ee79893d2f5..6dd06e9d34b0 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -158,9 +158,8 @@ static void pmem_detach_disk(struct pmem_device *pmem)
 	if (!pmem->pmem_disk)
 		return;
 
-	del_gendisk(pmem->pmem_disk);
+	del_gendisk_queue(pmem->pmem_disk);
 	put_disk(pmem->pmem_disk);
-	blk_cleanup_queue(pmem->pmem_queue);
 }
 
 static int pmem_attach_disk(struct device *dev,
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 94a8f4ab57bc..0c3c968b57d9 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -388,8 +388,7 @@ removeseg:
 	}
 	list_del(&dev_info->lh);
 
-	del_gendisk(dev_info->gd);
-	blk_cleanup_queue(dev_info->dcssblk_queue);
+	del_gendisk_queue(dev_info->gd);
 	dev_info->gd->queue = NULL;
 	put_disk(dev_info->gd);
 	up_write(&dcssblk_devices_sem);
@@ -751,8 +750,7 @@ dcssblk_remove_store(struct device *dev, struct device_attribute *attr, const ch
 	}
 
 	list_del(&dev_info->lh);
-	del_gendisk(dev_info->gd);
-	blk_cleanup_queue(dev_info->dcssblk_queue);
+	del_gendisk_queue(dev_info->gd);
 	dev_info->gd->queue = NULL;
 	put_disk(dev_info->gd);
 	device_unregister(&dev_info->dev);
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 44d4a1e9244e..739e43a37e64 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1778,27 +1778,83 @@ fail:
 }
 EXPORT_SYMBOL(lookup_bdev);
 
+static int generic_quiesce_super(struct super_block *sb, bool kill_dirty)
+{
+	/*
+	 * no need to lock the super, get_super holds the read mutex so
+	 * the filesystem cannot go away under us (->put_super runs with
+	 * the write lock held).
+	 */
+	shrink_dcache_sb(sb);
+	return invalidate_inodes(sb, kill_dirty);
+}
+
 int __invalidate_device(struct block_device *bdev, bool kill_dirty)
 {
 	struct super_block *sb = get_super(bdev);
 	int res = 0;
 
-	if (sb) {
-		/*
-		 * no need to lock the super, get_super holds the
-		 * read mutex so the filesystem cannot go away
-		 * under us (->put_super runs with the write lock
-		 * hold).
-		 */
-		shrink_dcache_sb(sb);
-		res = invalidate_inodes(sb, kill_dirty);
-		drop_super(sb);
-	}
+	if (!sb)
+		goto out;
+
+	res = generic_quiesce_super(sb, kill_dirty);
+	drop_super(sb);
+ out:
 	invalidate_bdev(bdev);
+
 	return res;
 }
 EXPORT_SYMBOL(__invalidate_device);
 
+static void generic_bdi_gone(struct super_block *sb)
+{
+	struct inode *inode, *_inode = NULL;
+
+	spin_lock(&sb->s_inode_list_lock);
+	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+		spin_lock(&inode->i_lock);
+		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
+				|| !IS_DAX(inode)) {
+			spin_unlock(&inode->i_lock);
+			continue;
+		}
+		__iget(inode);
+		spin_unlock(&inode->i_lock);
+		spin_unlock(&sb->s_inode_list_lock);
+
+		unmap_mapping_range(inode->i_mapping, 0, 0, 1);
+		iput(_inode);
+		_inode = inode;
+		cond_resched();
+
+		spin_lock(&sb->s_inode_list_lock);
+	}
+	spin_unlock(&sb->s_inode_list_lock);
+	iput(_inode);
+}
+
+void shutdown_partition(struct gendisk *disk, int partno)
+{
+	struct block_device *bdev = bdget_disk(disk, partno);
+	struct super_block *sb = get_super(bdev);
+
+	if (!bdev)
+		return;
+
+	if (!sb) {
+		bdput(bdev);
+		return;
+	}
+
+	if (sb->s_op->bdi_gone)
+		sb->s_op->bdi_gone(sb);
+	else
+		generic_bdi_gone(sb);
+
+	drop_super(sb);
+	bdput(bdev);
+}
+
 void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
 {
 	struct inode *inode, *old_inode = NULL;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3aa514254161..0e201ed38045 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1713,6 +1713,7 @@ struct super_operations {
 				  struct shrink_control *);
 	long (*free_cached_objects)(struct super_block *,
 				    struct shrink_control *);
+	void (*bdi_gone)(struct super_block *);
 };
 
 /*
@@ -2390,6 +2391,7 @@ extern int revalidate_disk(struct gendisk *);
 extern int check_disk_change(struct block_device *);
 extern int __invalidate_device(struct block_device *, bool);
 extern int invalidate_partition(struct gendisk *, int);
+extern void shutdown_partition(struct gendisk *, int);
 #endif
 unsigned long invalidate_mapping_pages(struct address_space *mapping,
 					pgoff_t start, pgoff_t end);
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 847cc1d91634..028cf15a8a57 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -431,6 +431,7 @@ extern void part_round_stats(int cpu, struct hd_struct *part);
 /* block/genhd.c */
 extern void add_disk(struct gendisk *disk);
 extern void del_gendisk(struct gendisk *gp);
+extern void del_gendisk_queue(struct gendisk *disk);
 extern struct gendisk *get_gendisk(dev_t dev, int *partno);
 extern struct block_device *bdget_disk(struct gendisk *disk, int partno);
 


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [resend PATCH 1/3] block, fs: reliably communicate bdev end-of-life
@ 2016-01-04 18:20   ` Dan Williams
  0 siblings, 0 replies; 34+ messages in thread
From: Dan Williams @ 2016-01-04 18:20 UTC (permalink / raw)
  To: xfs
  Cc: Jens Axboe, linux-nvdimm, linux-block, Jan Kara, linux-fsdevel,
	Matthew Wilcox, Ross Zwisler

Historically we have waited for filesystem specific heuristics to
attempt to guess when a block device is gone.  Sometimes this works, but
in other cases the system can hang waiting for the fs to trigger its
shutdown protocol.

The initial motivation for this investigation was to prevent DAX
mappings (direct mmap access to persistent memory) from leaking past the
lifetime of the hosting block device.  However, Dave points out that
these shutdown operations are needed in other scenarios.  Quoting Dave:

    For example, if we detect a free space corruption during allocation,
    it is not safe to trust *any active mapping* because we can't trust
    that we having handed out the same block to multiple owners. Hence
    on such a filesystem shutdown, we have to prevent any new DAX
    mapping from occurring and invalidate all existing mappings as we
    cannot allow userspace to modify any data or metadata until we've
    resolved the corruption situation.

The current block device shutdown sequence of del_gendisk +
blk_cleanup_queue is problematic.  We want to tell the fs after
blk_cleanup_queue that there is no possibility of recovery, but by that
time we have deleted partitions and lost the ability to find all the
super-blocks on a block device.

Introduce del_gendisk_queue to trigger ->quiesce() and ->bdi_gone()
notifications to all the filesystems hosted on the disk.  Where
->quiesce() are 'shutdown' operations while the bdev may still be alive,
and ->bdi_gone() is a set of actions to take after the backing device
is known to be permanently dead.

generic_quiesce_super and generic_bdi_gone, are the default operations
when a filesystem does not implement ->quiesce(), ->bdi_gone().  They
invalidate inodes and unmap DAX-inodes respectively.  For now only
->bdi_gone() has an associated super operation as xfs will implement
->bdi_gone() in a later patch.

Cc: Jan Kara <jack@suse.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 block/genhd.c                |   87 +++++++++++++++++++++++++++++++++++-------
 drivers/block/brd.c          |    3 -
 drivers/nvdimm/pmem.c        |    3 -
 drivers/s390/block/dcssblk.c |    6 +--
 fs/block_dev.c               |   78 ++++++++++++++++++++++++++++++++------
 include/linux/fs.h           |    2 +
 include/linux/genhd.h        |    1 
 7 files changed, 147 insertions(+), 33 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index e5cafa51567c..967fcfd63c98 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -634,24 +634,14 @@ void add_disk(struct gendisk *disk)
 }
 EXPORT_SYMBOL(add_disk);
 
-void del_gendisk(struct gendisk *disk)
+static void del_gendisk_start(struct gendisk *disk)
 {
-	struct disk_part_iter piter;
-	struct hd_struct *part;
-
 	blk_integrity_del(disk);
 	disk_del_events(disk);
+}
 
-	/* invalidate stuff */
-	disk_part_iter_init(&piter, disk,
-			     DISK_PITER_INCL_EMPTY | DISK_PITER_REVERSE);
-	while ((part = disk_part_iter_next(&piter))) {
-		invalidate_partition(disk, part->partno);
-		delete_partition(disk, part->partno);
-	}
-	disk_part_iter_exit(&piter);
-
-	invalidate_partition(disk, 0);
+static void del_gendisk_end(struct gendisk *disk)
+{
 	set_capacity(disk, 0);
 	disk->flags &= ~GENHD_FL_UP;
 
@@ -670,9 +660,78 @@ void del_gendisk(struct gendisk *disk)
 	pm_runtime_set_memalloc_noio(disk_to_dev(disk), false);
 	device_del(disk_to_dev(disk));
 }
+
+#define for_each_part(part, piter) \
+	for (part = disk_part_iter_next(piter); part; \
+			part = disk_part_iter_next(piter))
+void del_gendisk(struct gendisk *disk)
+{
+	struct disk_part_iter piter;
+	struct hd_struct *part;
+
+	del_gendisk_start(disk);
+
+	/* invalidate stuff */
+	disk_part_iter_init(&piter, disk,
+			     DISK_PITER_INCL_EMPTY | DISK_PITER_REVERSE);
+	for_each_part(part, &piter) {
+		invalidate_partition(disk, part->partno);
+		delete_partition(disk, part->partno);
+	}
+	disk_part_iter_exit(&piter);
+	invalidate_partition(disk, 0);
+
+	del_gendisk_end(disk);
+}
 EXPORT_SYMBOL(del_gendisk);
 
 /**
+ * del_gendisk_queue - combined del_gendisk + blk_cleanup_queue
+ * @disk: disk to delete, invalidate, unmap, and end i/o
+ *
+ * This alternative for open coded calls to:
+ *     del_gendisk()
+ *     blk_cleanup_queue()
+ * ...is for notifying the filesystem that the block device has gone
+ * away.  This distinction is important for triggering a filesystem to
+ * abort its error recovery and for (DAX) capable devices.  DAX bypasses
+ * page cache and mappings go directly to storage media.  When such a
+ * disk is removed the pfn backing a mapping may be invalid or removed
+ * from the system.  Upon return accessing DAX mappings of this disk
+ * will trigger SIGBUS.
+ */
+void del_gendisk_queue(struct gendisk *disk)
+{
+	struct disk_part_iter piter;
+	struct hd_struct *part;
+
+	del_gendisk_start(disk);
+
+	/* pass1 sync fs + evict idle inodes */
+	disk_part_iter_init(&piter, disk,
+			     DISK_PITER_INCL_EMPTY | DISK_PITER_REVERSE);
+	for_each_part(part, &piter)
+		invalidate_partition(disk, part->partno);
+	disk_part_iter_exit(&piter);
+	invalidate_partition(disk, 0);
+
+	blk_cleanup_queue(disk->queue);
+
+	/* pass2 the queue is dead, trigger fs shutdown via ->bdi_gone() */
+	disk_part_iter_init(&piter, disk,
+			     DISK_PITER_INCL_EMPTY | DISK_PITER_REVERSE);
+	for_each_part(part, &piter) {
+		shutdown_partition(disk, part->partno);
+		delete_partition(disk, part->partno);
+	}
+	disk_part_iter_exit(&piter);
+	shutdown_partition(disk, 0);
+
+	del_gendisk_end(disk);
+}
+EXPORT_SYMBOL(del_gendisk_queue);
+
+/**
  * get_gendisk - get partitioning information for a given device
  * @devt: device to get partitioning information for
  * @partno: returned partition index
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index a5880f4ab40e..f149dd57bba3 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -532,7 +532,6 @@ out:
 static void brd_free(struct brd_device *brd)
 {
 	put_disk(brd->brd_disk);
-	blk_cleanup_queue(brd->brd_queue);
 	brd_free_pages(brd);
 	kfree(brd);
 }
@@ -560,7 +559,7 @@ out:
 static void brd_del_one(struct brd_device *brd)
 {
 	list_del(&brd->brd_list);
-	del_gendisk(brd->brd_disk);
+	del_gendisk_queue(brd->brd_disk);
 	brd_free(brd);
 }
 
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 8ee79893d2f5..6dd06e9d34b0 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -158,9 +158,8 @@ static void pmem_detach_disk(struct pmem_device *pmem)
 	if (!pmem->pmem_disk)
 		return;
 
-	del_gendisk(pmem->pmem_disk);
+	del_gendisk_queue(pmem->pmem_disk);
 	put_disk(pmem->pmem_disk);
-	blk_cleanup_queue(pmem->pmem_queue);
 }
 
 static int pmem_attach_disk(struct device *dev,
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 94a8f4ab57bc..0c3c968b57d9 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -388,8 +388,7 @@ removeseg:
 	}
 	list_del(&dev_info->lh);
 
-	del_gendisk(dev_info->gd);
-	blk_cleanup_queue(dev_info->dcssblk_queue);
+	del_gendisk_queue(dev_info->gd);
 	dev_info->gd->queue = NULL;
 	put_disk(dev_info->gd);
 	up_write(&dcssblk_devices_sem);
@@ -751,8 +750,7 @@ dcssblk_remove_store(struct device *dev, struct device_attribute *attr, const ch
 	}
 
 	list_del(&dev_info->lh);
-	del_gendisk(dev_info->gd);
-	blk_cleanup_queue(dev_info->dcssblk_queue);
+	del_gendisk_queue(dev_info->gd);
 	dev_info->gd->queue = NULL;
 	put_disk(dev_info->gd);
 	device_unregister(&dev_info->dev);
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 44d4a1e9244e..739e43a37e64 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1778,27 +1778,83 @@ fail:
 }
 EXPORT_SYMBOL(lookup_bdev);
 
+static int generic_quiesce_super(struct super_block *sb, bool kill_dirty)
+{
+	/*
+	 * no need to lock the super, get_super holds the read mutex so
+	 * the filesystem cannot go away under us (->put_super runs with
+	 * the write lock held).
+	 */
+	shrink_dcache_sb(sb);
+	return invalidate_inodes(sb, kill_dirty);
+}
+
 int __invalidate_device(struct block_device *bdev, bool kill_dirty)
 {
 	struct super_block *sb = get_super(bdev);
 	int res = 0;
 
-	if (sb) {
-		/*
-		 * no need to lock the super, get_super holds the
-		 * read mutex so the filesystem cannot go away
-		 * under us (->put_super runs with the write lock
-		 * hold).
-		 */
-		shrink_dcache_sb(sb);
-		res = invalidate_inodes(sb, kill_dirty);
-		drop_super(sb);
-	}
+	if (!sb)
+		goto out;
+
+	res = generic_quiesce_super(sb, kill_dirty);
+	drop_super(sb);
+ out:
 	invalidate_bdev(bdev);
+
 	return res;
 }
 EXPORT_SYMBOL(__invalidate_device);
 
+static void generic_bdi_gone(struct super_block *sb)
+{
+	struct inode *inode, *_inode = NULL;
+
+	spin_lock(&sb->s_inode_list_lock);
+	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+		spin_lock(&inode->i_lock);
+		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
+				|| !IS_DAX(inode)) {
+			spin_unlock(&inode->i_lock);
+			continue;
+		}
+		__iget(inode);
+		spin_unlock(&inode->i_lock);
+		spin_unlock(&sb->s_inode_list_lock);
+
+		unmap_mapping_range(inode->i_mapping, 0, 0, 1);
+		iput(_inode);
+		_inode = inode;
+		cond_resched();
+
+		spin_lock(&sb->s_inode_list_lock);
+	}
+	spin_unlock(&sb->s_inode_list_lock);
+	iput(_inode);
+}
+
+void shutdown_partition(struct gendisk *disk, int partno)
+{
+	struct block_device *bdev = bdget_disk(disk, partno);
+	struct super_block *sb = get_super(bdev);
+
+	if (!bdev)
+		return;
+
+	if (!sb) {
+		bdput(bdev);
+		return;
+	}
+
+	if (sb->s_op->bdi_gone)
+		sb->s_op->bdi_gone(sb);
+	else
+		generic_bdi_gone(sb);
+
+	drop_super(sb);
+	bdput(bdev);
+}
+
 void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
 {
 	struct inode *inode, *old_inode = NULL;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3aa514254161..0e201ed38045 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1713,6 +1713,7 @@ struct super_operations {
 				  struct shrink_control *);
 	long (*free_cached_objects)(struct super_block *,
 				    struct shrink_control *);
+	void (*bdi_gone)(struct super_block *);
 };
 
 /*
@@ -2390,6 +2391,7 @@ extern int revalidate_disk(struct gendisk *);
 extern int check_disk_change(struct block_device *);
 extern int __invalidate_device(struct block_device *, bool);
 extern int invalidate_partition(struct gendisk *, int);
+extern void shutdown_partition(struct gendisk *, int);
 #endif
 unsigned long invalidate_mapping_pages(struct address_space *mapping,
 					pgoff_t start, pgoff_t end);
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 847cc1d91634..028cf15a8a57 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -431,6 +431,7 @@ extern void part_round_stats(int cpu, struct hd_struct *part);
 /* block/genhd.c */
 extern void add_disk(struct gendisk *disk);
 extern void del_gendisk(struct gendisk *gp);
+extern void del_gendisk_queue(struct gendisk *disk);
 extern struct gendisk *get_gendisk(dev_t dev, int *partno);
 extern struct block_device *bdget_disk(struct gendisk *disk, int partno);
 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [resend PATCH 2/3] xfs: handle shutdown notifications
  2016-01-04 18:20 ` Dan Williams
@ 2016-01-04 18:20   ` Dan Williams
  -1 siblings, 0 replies; 34+ messages in thread
From: Dan Williams @ 2016-01-04 18:20 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel, linux-block, Dave Chinner, linux-nvdimm

Force a filesystem shutdown when the backing device is known to be dead.
I.e. blk_queue_enter() permanently returns -ENODEV.

Cc: xfs@oss.sgi.com
Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/block_dev.c     |    3 ++-
 fs/xfs/xfs_super.c |    9 +++++++++
 include/linux/fs.h |    2 ++
 3 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 739e43a37e64..7d6c66148948 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1806,7 +1806,7 @@ int __invalidate_device(struct block_device *bdev, bool kill_dirty)
 }
 EXPORT_SYMBOL(__invalidate_device);
 
-static void generic_bdi_gone(struct super_block *sb)
+void generic_bdi_gone(struct super_block *sb)
 {
 	struct inode *inode, *_inode = NULL;
 
@@ -1832,6 +1832,7 @@ static void generic_bdi_gone(struct super_block *sb)
 	spin_unlock(&sb->s_inode_list_lock);
 	iput(_inode);
 }
+EXPORT_SYMBOL(generic_bdi_gone);
 
 void shutdown_partition(struct gendisk *disk, int partno)
 {
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 36bd8825bfb0..63c36508e9db 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1618,6 +1618,14 @@ xfs_fs_free_cached_objects(
 	return xfs_reclaim_inodes_nr(XFS_M(sb), sc->nr_to_scan);
 }
 
+static void
+xfs_fs_bdi_gone(
+	struct super_block *sb)
+{
+	xfs_force_shutdown(XFS_M(sb), SHUTDOWN_DEVICE_REQ);
+	generic_bdi_gone(sb);
+}
+
 static const struct super_operations xfs_super_operations = {
 	.alloc_inode		= xfs_fs_alloc_inode,
 	.destroy_inode		= xfs_fs_destroy_inode,
@@ -1632,6 +1640,7 @@ static const struct super_operations xfs_super_operations = {
 	.show_options		= xfs_fs_show_options,
 	.nr_cached_objects	= xfs_fs_nr_cached_objects,
 	.free_cached_objects	= xfs_fs_free_cached_objects,
+	.bdi_gone		= xfs_fs_bdi_gone,
 };
 
 static struct file_system_type xfs_fs_type = {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 0e201ed38045..b1e8e049e4b8 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2265,6 +2265,7 @@ extern struct super_block *freeze_bdev(struct block_device *);
 extern void emergency_thaw_all(void);
 extern int thaw_bdev(struct block_device *bdev, struct super_block *sb);
 extern int fsync_bdev(struct block_device *);
+extern void generic_bdi_gone(struct super_block *sb);
 
 extern struct super_block *blockdev_superblock;
 
@@ -2277,6 +2278,7 @@ static inline void bd_forget(struct inode *inode) {}
 static inline int sync_blockdev(struct block_device *bdev) { return 0; }
 static inline void kill_bdev(struct block_device *bdev) {}
 static inline void invalidate_bdev(struct block_device *bdev) {}
+static inline void generic_bdi_gone(struct super_block *sb) {}
 
 static inline struct super_block *freeze_bdev(struct block_device *sb)
 {


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [resend PATCH 2/3] xfs: handle shutdown notifications
@ 2016-01-04 18:20   ` Dan Williams
  0 siblings, 0 replies; 34+ messages in thread
From: Dan Williams @ 2016-01-04 18:20 UTC (permalink / raw)
  To: xfs; +Cc: linux-fsdevel, linux-block, linux-nvdimm

Force a filesystem shutdown when the backing device is known to be dead.
I.e. blk_queue_enter() permanently returns -ENODEV.

Cc: xfs@oss.sgi.com
Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/block_dev.c     |    3 ++-
 fs/xfs/xfs_super.c |    9 +++++++++
 include/linux/fs.h |    2 ++
 3 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 739e43a37e64..7d6c66148948 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1806,7 +1806,7 @@ int __invalidate_device(struct block_device *bdev, bool kill_dirty)
 }
 EXPORT_SYMBOL(__invalidate_device);
 
-static void generic_bdi_gone(struct super_block *sb)
+void generic_bdi_gone(struct super_block *sb)
 {
 	struct inode *inode, *_inode = NULL;
 
@@ -1832,6 +1832,7 @@ static void generic_bdi_gone(struct super_block *sb)
 	spin_unlock(&sb->s_inode_list_lock);
 	iput(_inode);
 }
+EXPORT_SYMBOL(generic_bdi_gone);
 
 void shutdown_partition(struct gendisk *disk, int partno)
 {
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 36bd8825bfb0..63c36508e9db 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1618,6 +1618,14 @@ xfs_fs_free_cached_objects(
 	return xfs_reclaim_inodes_nr(XFS_M(sb), sc->nr_to_scan);
 }
 
+static void
+xfs_fs_bdi_gone(
+	struct super_block *sb)
+{
+	xfs_force_shutdown(XFS_M(sb), SHUTDOWN_DEVICE_REQ);
+	generic_bdi_gone(sb);
+}
+
 static const struct super_operations xfs_super_operations = {
 	.alloc_inode		= xfs_fs_alloc_inode,
 	.destroy_inode		= xfs_fs_destroy_inode,
@@ -1632,6 +1640,7 @@ static const struct super_operations xfs_super_operations = {
 	.show_options		= xfs_fs_show_options,
 	.nr_cached_objects	= xfs_fs_nr_cached_objects,
 	.free_cached_objects	= xfs_fs_free_cached_objects,
+	.bdi_gone		= xfs_fs_bdi_gone,
 };
 
 static struct file_system_type xfs_fs_type = {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 0e201ed38045..b1e8e049e4b8 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2265,6 +2265,7 @@ extern struct super_block *freeze_bdev(struct block_device *);
 extern void emergency_thaw_all(void);
 extern int thaw_bdev(struct block_device *bdev, struct super_block *sb);
 extern int fsync_bdev(struct block_device *);
+extern void generic_bdi_gone(struct super_block *sb);
 
 extern struct super_block *blockdev_superblock;
 
@@ -2277,6 +2278,7 @@ static inline void bd_forget(struct inode *inode) {}
 static inline int sync_blockdev(struct block_device *bdev) { return 0; }
 static inline void kill_bdev(struct block_device *bdev) {}
 static inline void invalidate_bdev(struct block_device *bdev) {}
+static inline void generic_bdi_gone(struct super_block *sb) {}
 
 static inline struct super_block *freeze_bdev(struct block_device *sb)
 {

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [resend PATCH 3/3] writeback: fix false positive WARN in __mark_inode_dirty
  2016-01-04 18:20 ` Dan Williams
@ 2016-01-04 18:20   ` Dan Williams
  -1 siblings, 0 replies; 34+ messages in thread
From: Dan Williams @ 2016-01-04 18:20 UTC (permalink / raw)
  To: xfs
  Cc: linux-block, linux-nvdimm, Dave Chinner, Jens Axboe,
	linux-fsdevel, Jan Kara, Tejun Heo

This warning was added as a debugging aid way back in commit
500b067c5e6c "writeback: check for registered bdi in flusher add and
inode dirty" when we were switching over to per-bdi writeback.

Once the block device has been torn down it's no longer useful to
complain about unregistered bdi's.  Clear the writeback capability under
the the wb->list_lock(), so that __mark_inode_dirty has no opportunity
to race bdi_unregister() to this WARN() condition.

Alternatively we could just delete the warning...

Found this while testing block device remove from underneath an active
fs triggering traces like:

 WARNING: CPU: 6 PID: 2129 at fs/fs-writeback.c:2065 __mark_inode_dirty+0x261/0x350()
 bdi-block not registered
 [..]
 Call Trace:
  [<ffffffff81459f02>] dump_stack+0x44/0x62
  [<ffffffff810a1f32>] warn_slowpath_common+0x82/0xc0
  [<ffffffff810a1fcc>] warn_slowpath_fmt+0x5c/0x80
  [<ffffffff812830b1>] __mark_inode_dirty+0x261/0x350
  [<ffffffff8126d019>] generic_update_time+0x79/0xd0
  [<ffffffff8126d19d>] file_update_time+0xbd/0x110
  [<ffffffff812e4ab8>] ext4_dax_fault+0x68/0x110
  [<ffffffff811f7f3e>] __do_fault+0x4e/0xf0
  [<ffffffff811fc077>] handle_mm_fault+0x5e7/0x1b50
  [<ffffffff811fbae1>] ? handle_mm_fault+0x51/0x1b50
  [<ffffffff81068981>] __do_page_fault+0x191/0x3f0
  [<ffffffff81068caf>] trace_do_page_fault+0x4f/0x120
  [<ffffffff810630fa>] do_async_page_fault+0x1a/0xa0
  [<ffffffff819023f8>] async_page_fault+0x28/0x30

Cc: Jan Kara <jack@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@fb.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 mm/backing-dev.c |    7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 7340353f8aea..9ea843bca709 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -343,10 +343,17 @@ static void wb_shutdown(struct bdi_writeback *wb)
 {
 	/* Make sure nobody queues further work */
 	spin_lock_bh(&wb->work_lock);
+
 	if (!test_and_clear_bit(WB_registered, &wb->state)) {
 		spin_unlock_bh(&wb->work_lock);
 		return;
 	}
+
+	/* tell __mark_inode_dirty that writeback is no longer possible */
+	spin_lock(&wb->list_lock);
+	wb->bdi->capabilities |= BDI_CAP_NO_WRITEBACK;
+	spin_unlock(&wb->list_lock);
+
 	spin_unlock_bh(&wb->work_lock);
 
 	/*


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [resend PATCH 3/3] writeback: fix false positive WARN in __mark_inode_dirty
@ 2016-01-04 18:20   ` Dan Williams
  0 siblings, 0 replies; 34+ messages in thread
From: Dan Williams @ 2016-01-04 18:20 UTC (permalink / raw)
  To: xfs
  Cc: Jens Axboe, linux-nvdimm, linux-block, Tejun Heo, Jan Kara,
	linux-fsdevel

This warning was added as a debugging aid way back in commit
500b067c5e6c "writeback: check for registered bdi in flusher add and
inode dirty" when we were switching over to per-bdi writeback.

Once the block device has been torn down it's no longer useful to
complain about unregistered bdi's.  Clear the writeback capability under
the the wb->list_lock(), so that __mark_inode_dirty has no opportunity
to race bdi_unregister() to this WARN() condition.

Alternatively we could just delete the warning...

Found this while testing block device remove from underneath an active
fs triggering traces like:

 WARNING: CPU: 6 PID: 2129 at fs/fs-writeback.c:2065 __mark_inode_dirty+0x261/0x350()
 bdi-block not registered
 [..]
 Call Trace:
  [<ffffffff81459f02>] dump_stack+0x44/0x62
  [<ffffffff810a1f32>] warn_slowpath_common+0x82/0xc0
  [<ffffffff810a1fcc>] warn_slowpath_fmt+0x5c/0x80
  [<ffffffff812830b1>] __mark_inode_dirty+0x261/0x350
  [<ffffffff8126d019>] generic_update_time+0x79/0xd0
  [<ffffffff8126d19d>] file_update_time+0xbd/0x110
  [<ffffffff812e4ab8>] ext4_dax_fault+0x68/0x110
  [<ffffffff811f7f3e>] __do_fault+0x4e/0xf0
  [<ffffffff811fc077>] handle_mm_fault+0x5e7/0x1b50
  [<ffffffff811fbae1>] ? handle_mm_fault+0x51/0x1b50
  [<ffffffff81068981>] __do_page_fault+0x191/0x3f0
  [<ffffffff81068caf>] trace_do_page_fault+0x4f/0x120
  [<ffffffff810630fa>] do_async_page_fault+0x1a/0xa0
  [<ffffffff819023f8>] async_page_fault+0x28/0x30

Cc: Jan Kara <jack@suse.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@fb.com>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 mm/backing-dev.c |    7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 7340353f8aea..9ea843bca709 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -343,10 +343,17 @@ static void wb_shutdown(struct bdi_writeback *wb)
 {
 	/* Make sure nobody queues further work */
 	spin_lock_bh(&wb->work_lock);
+
 	if (!test_and_clear_bit(WB_registered, &wb->state)) {
 		spin_unlock_bh(&wb->work_lock);
 		return;
 	}
+
+	/* tell __mark_inode_dirty that writeback is no longer possible */
+	spin_lock(&wb->list_lock);
+	wb->bdi->capabilities |= BDI_CAP_NO_WRITEBACK;
+	spin_unlock(&wb->list_lock);
+
 	spin_unlock_bh(&wb->work_lock);
 
 	/*

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [resend PATCH 1/3] block, fs: reliably communicate bdev end-of-life
  2016-01-04 18:20   ` Dan Williams
@ 2016-01-05  3:51     ` Dave Chinner
  -1 siblings, 0 replies; 34+ messages in thread
From: Dave Chinner @ 2016-01-05  3:51 UTC (permalink / raw)
  To: Dan Williams
  Cc: xfs, linux-block, linux-nvdimm, Jens Axboe, Jan Kara,
	linux-fsdevel, Matthew Wilcox, Ross Zwisler

On Mon, Jan 04, 2016 at 10:20:05AM -0800, Dan Williams wrote:
> Historically we have waited for filesystem specific heuristics to
> attempt to guess when a block device is gone.  Sometimes this works, but
> in other cases the system can hang waiting for the fs to trigger its
> shutdown protocol.
> 
> The initial motivation for this investigation was to prevent DAX
> mappings (direct mmap access to persistent memory) from leaking past the
> lifetime of the hosting block device.  However, Dave points out that
> these shutdown operations are needed in other scenarios.  Quoting Dave:
> 
>     For example, if we detect a free space corruption during allocation,
>     it is not safe to trust *any active mapping* because we can't trust
>     that we having handed out the same block to multiple owners. Hence
>     on such a filesystem shutdown, we have to prevent any new DAX
>     mapping from occurring and invalidate all existing mappings as we
>     cannot allow userspace to modify any data or metadata until we've
>     resolved the corruption situation.
> 
> The current block device shutdown sequence of del_gendisk +
> blk_cleanup_queue is problematic.  We want to tell the fs after
> blk_cleanup_queue that there is no possibility of recovery, but by that
> time we have deleted partitions and lost the ability to find all the
> super-blocks on a block device.
> 
> Introduce del_gendisk_queue to trigger ->quiesce() and ->bdi_gone()

I don't see anything that introduces a ->quiesce() method.

> notifications to all the filesystems hosted on the disk.  Where
> ->quiesce() are 'shutdown' operations while the bdev may still be alive,

So you are saying "quiesce == invalidation", which is in conflict
with what we typically think quiesce means. i.e.  "Quiesce" is what
we do when doing orderly writeback of all outstanding dirty objects
in a filesystem - what we do during freeze, remount-ro, and unmount.
e.g. See the functions xfs_log_quiesce() and xfs_attr_quiesce()

> and ->bdi_gone() is a set of actions to take after the backing device
> is known to be permanently dead.

In which case, bdi_gone == shutdown.

Operation methods should be named after what they do, not what their
calling context is. i.e. these are "invalidate" and "shutdown"
superblock operations, not "quiesce" and "bdi_gone".

> generic_quiesce_super and generic_bdi_gone, are the default operations
> when a filesystem does not implement ->quiesce(), ->bdi_gone().  They
> invalidate inodes and unmap DAX-inodes respectively.  For now only
> ->bdi_gone() has an associated super operation as xfs will implement
> ->bdi_gone() in a later patch.

I don't quite understand what the point of factoring
__invalidate_device() like this is - it's not used by anyone, so it
seems completely unnecessary to me.

And really, that points out that there are multiple changes in this
patch set that should be done separately.  The rework of
del_gendisk() into del_gendisk_start/del_gendisk_end should be the
first patch.  The del_gendisk/blk_cleanup_queue to
blk_cleanup_queue() combining should be the second part, as it
builds on the factoring of del_gendisk(). Then, if we really need to
keep it, the factoring to introduce generic_quiesce_super. And
finally, the patch that allows the shutdown callout can be
introduced.

[snip]
>  
> +static void generic_bdi_gone(struct super_block *sb)
> +{
> +	struct inode *inode, *_inode = NULL;
> +
> +	spin_lock(&sb->s_inode_list_lock);
> +	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> +		spin_lock(&inode->i_lock);
> +		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
> +				|| !IS_DAX(inode)) {
> +			spin_unlock(&inode->i_lock);
> +			continue;
> +		}
> +		__iget(inode);
> +		spin_unlock(&inode->i_lock);
> +		spin_unlock(&sb->s_inode_list_lock);
> +
> +		unmap_mapping_range(inode->i_mapping, 0, 0, 1);
> +		iput(_inode);
> +		_inode = inode;
> +		cond_resched();
> +
> +		spin_lock(&sb->s_inode_list_lock);
> +	}
> +	spin_unlock(&sb->s_inode_list_lock);
> +	iput(_inode);
> +}

This belongs in fs/inode.c, right next to invalidate_inodes(), and
with a name that describes what it does, not the context in which
it is called. e.g. unmap_dax_inodes().

> +void shutdown_partition(struct gendisk *disk, int partno)
> +{
> +	struct block_device *bdev = bdget_disk(disk, partno);
> +	struct super_block *sb = get_super(bdev);
> +
> +	if (!bdev)
> +		return;

Null pointer deref. Certainly a landmine waiting for someone to
tread on.

> +	if (!sb) {
> +		bdput(bdev);
> +		return;
> +	}
> +
> +	if (sb->s_op->bdi_gone)
> +		sb->s_op->bdi_gone(sb);
> +	else
> +		generic_bdi_gone(sb);

	if (sb->s_op->shutdown)
		sb->s_op->shutdown(sb);
	else
		unmap_dax_inodes(sb);

name things correctly, and the code documents itself.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [resend PATCH 1/3] block, fs: reliably communicate bdev end-of-life
@ 2016-01-05  3:51     ` Dave Chinner
  0 siblings, 0 replies; 34+ messages in thread
From: Dave Chinner @ 2016-01-05  3:51 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jens Axboe, linux-nvdimm, xfs, linux-block, Jan Kara,
	linux-fsdevel, Matthew Wilcox, Ross Zwisler

On Mon, Jan 04, 2016 at 10:20:05AM -0800, Dan Williams wrote:
> Historically we have waited for filesystem specific heuristics to
> attempt to guess when a block device is gone.  Sometimes this works, but
> in other cases the system can hang waiting for the fs to trigger its
> shutdown protocol.
> 
> The initial motivation for this investigation was to prevent DAX
> mappings (direct mmap access to persistent memory) from leaking past the
> lifetime of the hosting block device.  However, Dave points out that
> these shutdown operations are needed in other scenarios.  Quoting Dave:
> 
>     For example, if we detect a free space corruption during allocation,
>     it is not safe to trust *any active mapping* because we can't trust
>     that we having handed out the same block to multiple owners. Hence
>     on such a filesystem shutdown, we have to prevent any new DAX
>     mapping from occurring and invalidate all existing mappings as we
>     cannot allow userspace to modify any data or metadata until we've
>     resolved the corruption situation.
> 
> The current block device shutdown sequence of del_gendisk +
> blk_cleanup_queue is problematic.  We want to tell the fs after
> blk_cleanup_queue that there is no possibility of recovery, but by that
> time we have deleted partitions and lost the ability to find all the
> super-blocks on a block device.
> 
> Introduce del_gendisk_queue to trigger ->quiesce() and ->bdi_gone()

I don't see anything that introduces a ->quiesce() method.

> notifications to all the filesystems hosted on the disk.  Where
> ->quiesce() are 'shutdown' operations while the bdev may still be alive,

So you are saying "quiesce == invalidation", which is in conflict
with what we typically think quiesce means. i.e.  "Quiesce" is what
we do when doing orderly writeback of all outstanding dirty objects
in a filesystem - what we do during freeze, remount-ro, and unmount.
e.g. See the functions xfs_log_quiesce() and xfs_attr_quiesce()

> and ->bdi_gone() is a set of actions to take after the backing device
> is known to be permanently dead.

In which case, bdi_gone == shutdown.

Operation methods should be named after what they do, not what their
calling context is. i.e. these are "invalidate" and "shutdown"
superblock operations, not "quiesce" and "bdi_gone".

> generic_quiesce_super and generic_bdi_gone, are the default operations
> when a filesystem does not implement ->quiesce(), ->bdi_gone().  They
> invalidate inodes and unmap DAX-inodes respectively.  For now only
> ->bdi_gone() has an associated super operation as xfs will implement
> ->bdi_gone() in a later patch.

I don't quite understand what the point of factoring
__invalidate_device() like this is - it's not used by anyone, so it
seems completely unnecessary to me.

And really, that points out that there are multiple changes in this
patch set that should be done separately.  The rework of
del_gendisk() into del_gendisk_start/del_gendisk_end should be the
first patch.  The del_gendisk/blk_cleanup_queue to
blk_cleanup_queue() combining should be the second part, as it
builds on the factoring of del_gendisk(). Then, if we really need to
keep it, the factoring to introduce generic_quiesce_super. And
finally, the patch that allows the shutdown callout can be
introduced.

[snip]
>  
> +static void generic_bdi_gone(struct super_block *sb)
> +{
> +	struct inode *inode, *_inode = NULL;
> +
> +	spin_lock(&sb->s_inode_list_lock);
> +	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> +		spin_lock(&inode->i_lock);
> +		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
> +				|| !IS_DAX(inode)) {
> +			spin_unlock(&inode->i_lock);
> +			continue;
> +		}
> +		__iget(inode);
> +		spin_unlock(&inode->i_lock);
> +		spin_unlock(&sb->s_inode_list_lock);
> +
> +		unmap_mapping_range(inode->i_mapping, 0, 0, 1);
> +		iput(_inode);
> +		_inode = inode;
> +		cond_resched();
> +
> +		spin_lock(&sb->s_inode_list_lock);
> +	}
> +	spin_unlock(&sb->s_inode_list_lock);
> +	iput(_inode);
> +}

This belongs in fs/inode.c, right next to invalidate_inodes(), and
with a name that describes what it does, not the context in which
it is called. e.g. unmap_dax_inodes().

> +void shutdown_partition(struct gendisk *disk, int partno)
> +{
> +	struct block_device *bdev = bdget_disk(disk, partno);
> +	struct super_block *sb = get_super(bdev);
> +
> +	if (!bdev)
> +		return;

Null pointer deref. Certainly a landmine waiting for someone to
tread on.

> +	if (!sb) {
> +		bdput(bdev);
> +		return;
> +	}
> +
> +	if (sb->s_op->bdi_gone)
> +		sb->s_op->bdi_gone(sb);
> +	else
> +		generic_bdi_gone(sb);

	if (sb->s_op->shutdown)
		sb->s_op->shutdown(sb);
	else
		unmap_dax_inodes(sb);

name things correctly, and the code documents itself.

-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [resend PATCH 2/3] xfs: handle shutdown notifications
  2016-01-04 18:20   ` Dan Williams
@ 2016-01-05  4:03     ` Dave Chinner
  -1 siblings, 0 replies; 34+ messages in thread
From: Dave Chinner @ 2016-01-05  4:03 UTC (permalink / raw)
  To: Dan Williams; +Cc: xfs, linux-fsdevel, linux-block, linux-nvdimm

On Mon, Jan 04, 2016 at 10:20:11AM -0800, Dan Williams wrote:
> Force a filesystem shutdown when the backing device is known to be dead.
> I.e. blk_queue_enter() permanently returns -ENODEV.
> 
> Cc: xfs@oss.sgi.com
> Suggested-by: Dave Chinner <david@fromorbit.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  fs/block_dev.c     |    3 ++-
>  fs/xfs/xfs_super.c |    9 +++++++++
>  include/linux/fs.h |    2 ++
>  3 files changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 739e43a37e64..7d6c66148948 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1806,7 +1806,7 @@ int __invalidate_device(struct block_device *bdev, bool kill_dirty)
>  }
>  EXPORT_SYMBOL(__invalidate_device);
>  
> -static void generic_bdi_gone(struct super_block *sb)
> +void generic_bdi_gone(struct super_block *sb)
>  {
>  	struct inode *inode, *_inode = NULL;
>  
> @@ -1832,6 +1832,7 @@ static void generic_bdi_gone(struct super_block *sb)
>  	spin_unlock(&sb->s_inode_list_lock);
>  	iput(_inode);
>  }
> +EXPORT_SYMBOL(generic_bdi_gone);

That should be in the previous patch.

>  void shutdown_partition(struct gendisk *disk, int partno)
>  {
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 36bd8825bfb0..63c36508e9db 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1618,6 +1618,14 @@ xfs_fs_free_cached_objects(
>  	return xfs_reclaim_inodes_nr(XFS_M(sb), sc->nr_to_scan);
>  }
>  
> +static void
> +xfs_fs_bdi_gone(

xfs_fs_shutdown

> +	struct super_block *sb)
> +{
> +	xfs_force_shutdown(XFS_M(sb), SHUTDOWN_DEVICE_REQ);
> +	generic_bdi_gone(sb);
> +}
> +

This is wrong. we have to unmap the DAX inodes during *every*
shutdown that XFS executes. Hence it needs to be done inside
xfs_do_force_shutdown(), not in addition to the shutdown when the
block device is pulled. i.e. something like this in
xfs_do_force_shutdown():

 	 */
 	if (xfs_log_force_umount(mp, logerror))
 		return;
+
+	/*
+	 * If DAX is in use, we have to unmap all direct access
+	 * virtual mappings to ensure nothing more gets written
+	 * directly from userspace. This will force them to refault
+	 * and that will result in them detecting the shutdown
+	 * condition and hence will fail appropriately.
+	 */
+	unmap_dax_inodes(mp->m_super);
 
 	if (flags & SHUTDOWN_CORRUPT_INCORE) {
 		xfs_alert_tag(mp, XFS_PTAG_SHUTDOWN_CORRUPT

>  static const struct super_operations xfs_super_operations = {
>  	.alloc_inode		= xfs_fs_alloc_inode,
>  	.destroy_inode		= xfs_fs_destroy_inode,
> @@ -1632,6 +1640,7 @@ static const struct super_operations xfs_super_operations = {
>  	.show_options		= xfs_fs_show_options,
>  	.nr_cached_objects	= xfs_fs_nr_cached_objects,
>  	.free_cached_objects	= xfs_fs_free_cached_objects,
> +	.bdi_gone		= xfs_fs_bdi_gone,
>  };
>  
>  static struct file_system_type xfs_fs_type = {
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 0e201ed38045..b1e8e049e4b8 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2265,6 +2265,7 @@ extern struct super_block *freeze_bdev(struct block_device *);
>  extern void emergency_thaw_all(void);
>  extern int thaw_bdev(struct block_device *bdev, struct super_block *sb);
>  extern int fsync_bdev(struct block_device *);
> +extern void generic_bdi_gone(struct super_block *sb);

previous patch.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [resend PATCH 2/3] xfs: handle shutdown notifications
@ 2016-01-05  4:03     ` Dave Chinner
  0 siblings, 0 replies; 34+ messages in thread
From: Dave Chinner @ 2016-01-05  4:03 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-fsdevel, linux-block, linux-nvdimm, xfs

On Mon, Jan 04, 2016 at 10:20:11AM -0800, Dan Williams wrote:
> Force a filesystem shutdown when the backing device is known to be dead.
> I.e. blk_queue_enter() permanently returns -ENODEV.
> 
> Cc: xfs@oss.sgi.com
> Suggested-by: Dave Chinner <david@fromorbit.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  fs/block_dev.c     |    3 ++-
>  fs/xfs/xfs_super.c |    9 +++++++++
>  include/linux/fs.h |    2 ++
>  3 files changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 739e43a37e64..7d6c66148948 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -1806,7 +1806,7 @@ int __invalidate_device(struct block_device *bdev, bool kill_dirty)
>  }
>  EXPORT_SYMBOL(__invalidate_device);
>  
> -static void generic_bdi_gone(struct super_block *sb)
> +void generic_bdi_gone(struct super_block *sb)
>  {
>  	struct inode *inode, *_inode = NULL;
>  
> @@ -1832,6 +1832,7 @@ static void generic_bdi_gone(struct super_block *sb)
>  	spin_unlock(&sb->s_inode_list_lock);
>  	iput(_inode);
>  }
> +EXPORT_SYMBOL(generic_bdi_gone);

That should be in the previous patch.

>  void shutdown_partition(struct gendisk *disk, int partno)
>  {
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 36bd8825bfb0..63c36508e9db 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -1618,6 +1618,14 @@ xfs_fs_free_cached_objects(
>  	return xfs_reclaim_inodes_nr(XFS_M(sb), sc->nr_to_scan);
>  }
>  
> +static void
> +xfs_fs_bdi_gone(

xfs_fs_shutdown

> +	struct super_block *sb)
> +{
> +	xfs_force_shutdown(XFS_M(sb), SHUTDOWN_DEVICE_REQ);
> +	generic_bdi_gone(sb);
> +}
> +

This is wrong. we have to unmap the DAX inodes during *every*
shutdown that XFS executes. Hence it needs to be done inside
xfs_do_force_shutdown(), not in addition to the shutdown when the
block device is pulled. i.e. something like this in
xfs_do_force_shutdown():

 	 */
 	if (xfs_log_force_umount(mp, logerror))
 		return;
+
+	/*
+	 * If DAX is in use, we have to unmap all direct access
+	 * virtual mappings to ensure nothing more gets written
+	 * directly from userspace. This will force them to refault
+	 * and that will result in them detecting the shutdown
+	 * condition and hence will fail appropriately.
+	 */
+	unmap_dax_inodes(mp->m_super);
 
 	if (flags & SHUTDOWN_CORRUPT_INCORE) {
 		xfs_alert_tag(mp, XFS_PTAG_SHUTDOWN_CORRUPT

>  static const struct super_operations xfs_super_operations = {
>  	.alloc_inode		= xfs_fs_alloc_inode,
>  	.destroy_inode		= xfs_fs_destroy_inode,
> @@ -1632,6 +1640,7 @@ static const struct super_operations xfs_super_operations = {
>  	.show_options		= xfs_fs_show_options,
>  	.nr_cached_objects	= xfs_fs_nr_cached_objects,
>  	.free_cached_objects	= xfs_fs_free_cached_objects,
> +	.bdi_gone		= xfs_fs_bdi_gone,
>  };
>  
>  static struct file_system_type xfs_fs_type = {
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 0e201ed38045..b1e8e049e4b8 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2265,6 +2265,7 @@ extern struct super_block *freeze_bdev(struct block_device *);
>  extern void emergency_thaw_all(void);
>  extern int thaw_bdev(struct block_device *bdev, struct super_block *sb);
>  extern int fsync_bdev(struct block_device *);
> +extern void generic_bdi_gone(struct super_block *sb);

previous patch.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [resend PATCH 3/3] writeback: fix false positive WARN in __mark_inode_dirty
  2016-01-04 18:20   ` Dan Williams
@ 2016-01-05  4:23     ` Dave Chinner
  -1 siblings, 0 replies; 34+ messages in thread
From: Dave Chinner @ 2016-01-05  4:23 UTC (permalink / raw)
  To: Dan Williams
  Cc: xfs, linux-block, linux-nvdimm, Jens Axboe, linux-fsdevel,
	Jan Kara, Tejun Heo

On Mon, Jan 04, 2016 at 10:20:16AM -0800, Dan Williams wrote:
> This warning was added as a debugging aid way back in commit
> 500b067c5e6c "writeback: check for registered bdi in flusher add and
> inode dirty" when we were switching over to per-bdi writeback.
> 
> Once the block device has been torn down it's no longer useful to
> complain about unregistered bdi's.  Clear the writeback capability under
> the the wb->list_lock(), so that __mark_inode_dirty has no opportunity
> to race bdi_unregister() to this WARN() condition.
> 
> Alternatively we could just delete the warning...

The warning is correct - the filesytem is trying to mark an inode
dirty on a device that can't do writeback anymore. Seems to me like
it is functioning as it should.

> Found this while testing block device remove from underneath an active
> fs triggering traces like:
> 
>  WARNING: CPU: 6 PID: 2129 at fs/fs-writeback.c:2065 __mark_inode_dirty+0x261/0x350()
>  bdi-block not registered
>  [..]
>  Call Trace:
>   [<ffffffff81459f02>] dump_stack+0x44/0x62
>   [<ffffffff810a1f32>] warn_slowpath_common+0x82/0xc0
>   [<ffffffff810a1fcc>] warn_slowpath_fmt+0x5c/0x80
>   [<ffffffff812830b1>] __mark_inode_dirty+0x261/0x350
>   [<ffffffff8126d019>] generic_update_time+0x79/0xd0
>   [<ffffffff8126d19d>] file_update_time+0xbd/0x110
>   [<ffffffff812e4ab8>] ext4_dax_fault+0x68/0x110
>   [<ffffffff811f7f3e>] __do_fault+0x4e/0xf0

This seems like the problem to me - you haven't implemented a
shutdown hook for ext4, and so it continues to allow page faults to
make progress after the device has been removed. The DAX fault
should have been failed before the filesystem gets to the point of
marking the inode dirty....

> +	/* tell __mark_inode_dirty that writeback is no longer possible */
> +	spin_lock(&wb->list_lock);
> +	wb->bdi->capabilities |= BDI_CAP_NO_WRITEBACK;
> +	spin_unlock(&wb->list_lock);
> +
>  	spin_unlock_bh(&wb->work_lock);

Is that lock ordering safe? i.e. it's inside a section using bh-safe
locking, which tends to imply that it can run from interrupt
contexts. Can we get something like

	spin_lock(&wb->list_lock);
	.....
	<irq>
	.....
	wb_shutdown
	spin_lock_bh(&wb->work_lock);
	spin_lock(&wb->list_lock);

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [resend PATCH 3/3] writeback: fix false positive WARN in __mark_inode_dirty
@ 2016-01-05  4:23     ` Dave Chinner
  0 siblings, 0 replies; 34+ messages in thread
From: Dave Chinner @ 2016-01-05  4:23 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jens Axboe, linux-nvdimm, xfs, linux-block, Tejun Heo, Jan Kara,
	linux-fsdevel

On Mon, Jan 04, 2016 at 10:20:16AM -0800, Dan Williams wrote:
> This warning was added as a debugging aid way back in commit
> 500b067c5e6c "writeback: check for registered bdi in flusher add and
> inode dirty" when we were switching over to per-bdi writeback.
> 
> Once the block device has been torn down it's no longer useful to
> complain about unregistered bdi's.  Clear the writeback capability under
> the the wb->list_lock(), so that __mark_inode_dirty has no opportunity
> to race bdi_unregister() to this WARN() condition.
> 
> Alternatively we could just delete the warning...

The warning is correct - the filesytem is trying to mark an inode
dirty on a device that can't do writeback anymore. Seems to me like
it is functioning as it should.

> Found this while testing block device remove from underneath an active
> fs triggering traces like:
> 
>  WARNING: CPU: 6 PID: 2129 at fs/fs-writeback.c:2065 __mark_inode_dirty+0x261/0x350()
>  bdi-block not registered
>  [..]
>  Call Trace:
>   [<ffffffff81459f02>] dump_stack+0x44/0x62
>   [<ffffffff810a1f32>] warn_slowpath_common+0x82/0xc0
>   [<ffffffff810a1fcc>] warn_slowpath_fmt+0x5c/0x80
>   [<ffffffff812830b1>] __mark_inode_dirty+0x261/0x350
>   [<ffffffff8126d019>] generic_update_time+0x79/0xd0
>   [<ffffffff8126d19d>] file_update_time+0xbd/0x110
>   [<ffffffff812e4ab8>] ext4_dax_fault+0x68/0x110
>   [<ffffffff811f7f3e>] __do_fault+0x4e/0xf0

This seems like the problem to me - you haven't implemented a
shutdown hook for ext4, and so it continues to allow page faults to
make progress after the device has been removed. The DAX fault
should have been failed before the filesystem gets to the point of
marking the inode dirty....

> +	/* tell __mark_inode_dirty that writeback is no longer possible */
> +	spin_lock(&wb->list_lock);
> +	wb->bdi->capabilities |= BDI_CAP_NO_WRITEBACK;
> +	spin_unlock(&wb->list_lock);
> +
>  	spin_unlock_bh(&wb->work_lock);

Is that lock ordering safe? i.e. it's inside a section using bh-safe
locking, which tends to imply that it can run from interrupt
contexts. Can we get something like

	spin_lock(&wb->list_lock);
	.....
	<irq>
	.....
	wb_shutdown
	spin_lock_bh(&wb->work_lock);
	spin_lock(&wb->list_lock);

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [resend PATCH 1/3] block, fs: reliably communicate bdev end-of-life
  2016-01-05  3:51     ` Dave Chinner
@ 2016-01-05  4:25       ` Dan Williams
  -1 siblings, 0 replies; 34+ messages in thread
From: Dan Williams @ 2016-01-05  4:25 UTC (permalink / raw)
  To: Dave Chinner
  Cc: XFS Developers, linux-block, linux-nvdimm, Jens Axboe, Jan Kara,
	linux-fsdevel, Matthew Wilcox, Ross Zwisler

On Mon, Jan 4, 2016 at 7:51 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Jan 04, 2016 at 10:20:05AM -0800, Dan Williams wrote:
>> Historically we have waited for filesystem specific heuristics to
>> attempt to guess when a block device is gone.  Sometimes this works, but
>> in other cases the system can hang waiting for the fs to trigger its
>> shutdown protocol.
>>
>> The initial motivation for this investigation was to prevent DAX
>> mappings (direct mmap access to persistent memory) from leaking past the
>> lifetime of the hosting block device.  However, Dave points out that
>> these shutdown operations are needed in other scenarios.  Quoting Dave:
>>
>>     For example, if we detect a free space corruption during allocation,
>>     it is not safe to trust *any active mapping* because we can't trust
>>     that we having handed out the same block to multiple owners. Hence
>>     on such a filesystem shutdown, we have to prevent any new DAX
>>     mapping from occurring and invalidate all existing mappings as we
>>     cannot allow userspace to modify any data or metadata until we've
>>     resolved the corruption situation.
>>
>> The current block device shutdown sequence of del_gendisk +
>> blk_cleanup_queue is problematic.  We want to tell the fs after
>> blk_cleanup_queue that there is no possibility of recovery, but by that
>> time we have deleted partitions and lost the ability to find all the
>> super-blocks on a block device.
>>
>> Introduce del_gendisk_queue to trigger ->quiesce() and ->bdi_gone()
>
> I don't see anything that introduces a ->quiesce() method.
>

Right, we only have generic_quiesce_super() as no fs had a need to do
anything but the generic actions.

>> notifications to all the filesystems hosted on the disk.  Where
>> ->quiesce() are 'shutdown' operations while the bdev may still be alive,
>
> So you are saying "quiesce == invalidation", which is in conflict
> with what we typically think quiesce means. i.e.  "Quiesce" is what
> we do when doing orderly writeback of all outstanding dirty objects
> in a filesystem - what we do during freeze, remount-ro, and unmount.
> e.g. See the functions xfs_log_quiesce() and xfs_attr_quiesce()

Fair enough, I should have called this one invalidate_super.

>
>> and ->bdi_gone() is a set of actions to take after the backing device
>> is known to be permanently dead.
>
> In which case, bdi_gone == shutdown.
>

True, and following this logic I think the existing
generic_shutdown_super() should be renamed generic_kill_super() to
match the fs actions, but see below...

> Operation methods should be named after what they do, not what their
> calling context is. i.e. these are "invalidate" and "shutdown"
> superblock operations, not "quiesce" and "bdi_gone".

I was running out of colors to paint the bike shed given
generic_shutdown_super() was already in use.  Also, since
generic_shutdown_super() has had its current meaning since forever I
didn't think it was worth the risk to change the meaning of such a
long standing symbol.  Other ideas, generic_stop_super?

>> generic_quiesce_super and generic_bdi_gone, are the default operations
>> when a filesystem does not implement ->quiesce(), ->bdi_gone().  They
>> invalidate inodes and unmap DAX-inodes respectively.  For now only
>> ->bdi_gone() has an associated super operation as xfs will implement
>> ->bdi_gone() in a later patch.
>
> I don't quite understand what the point of factoring
> __invalidate_device() like this is - it's not used by anyone, so it
> seems completely unnecessary to me.
>

You're right, without a need for an fs to intercept the 'invalidation'
event there's no need to do this refactor.

> And really, that points out that there are multiple changes in this
> patch set that should be done separately.  The rework of
> del_gendisk() into del_gendisk_start/del_gendisk_end should be the
> first patch.  The del_gendisk/blk_cleanup_queue to
> blk_cleanup_queue() combining should be the second part, as it
> builds on the factoring of del_gendisk(). Then, if we really need to
> keep it, the factoring to introduce generic_quiesce_super. And
> finally, the patch that allows the shutdown callout can be
> introduced.

Ok, will split.

>
> [snip]
>>
>> +static void generic_bdi_gone(struct super_block *sb)
>> +{
>> +     struct inode *inode, *_inode = NULL;
>> +
>> +     spin_lock(&sb->s_inode_list_lock);
>> +     list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
>> +             spin_lock(&inode->i_lock);
>> +             if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
>> +                             || !IS_DAX(inode)) {
>> +                     spin_unlock(&inode->i_lock);
>> +                     continue;
>> +             }
>> +             __iget(inode);
>> +             spin_unlock(&inode->i_lock);
>> +             spin_unlock(&sb->s_inode_list_lock);
>> +
>> +             unmap_mapping_range(inode->i_mapping, 0, 0, 1);
>> +             iput(_inode);
>> +             _inode = inode;
>> +             cond_resched();
>> +
>> +             spin_lock(&sb->s_inode_list_lock);
>> +     }
>> +     spin_unlock(&sb->s_inode_list_lock);
>> +     iput(_inode);
>> +}
>
> This belongs in fs/inode.c, right next to invalidate_inodes(), and
> with a name that describes what it does, not the context in which
> it is called. e.g. unmap_dax_inodes().
>

Sounds good.

>> +void shutdown_partition(struct gendisk *disk, int partno)
>> +{
>> +     struct block_device *bdev = bdget_disk(disk, partno);
>> +     struct super_block *sb = get_super(bdev);
>> +
>> +     if (!bdev)
>> +             return;
>
> Null pointer deref. Certainly a landmine waiting for someone to
> tread on.
>

Nope, get_super() checks for a NULL argument.

>> +     if (!sb) {
>> +             bdput(bdev);
>> +             return;
>> +     }
>> +
>> +     if (sb->s_op->bdi_gone)
>> +             sb->s_op->bdi_gone(sb);
>> +     else
>> +             generic_bdi_gone(sb);
>
>         if (sb->s_op->shutdown)
>                 sb->s_op->shutdown(sb);
>         else
>                 unmap_dax_inodes(sb);
>
> name things correctly, and the code documents itself.

How about 'stop' or 'halt' instead of 'shutdown' to preserve the
historical meaning of generic_shutdown_super?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [resend PATCH 1/3] block, fs: reliably communicate bdev end-of-life
@ 2016-01-05  4:25       ` Dan Williams
  0 siblings, 0 replies; 34+ messages in thread
From: Dan Williams @ 2016-01-05  4:25 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jens Axboe, linux-nvdimm, XFS Developers, linux-block, Jan Kara,
	linux-fsdevel, Matthew Wilcox, Ross Zwisler

On Mon, Jan 4, 2016 at 7:51 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Jan 04, 2016 at 10:20:05AM -0800, Dan Williams wrote:
>> Historically we have waited for filesystem specific heuristics to
>> attempt to guess when a block device is gone.  Sometimes this works, but
>> in other cases the system can hang waiting for the fs to trigger its
>> shutdown protocol.
>>
>> The initial motivation for this investigation was to prevent DAX
>> mappings (direct mmap access to persistent memory) from leaking past the
>> lifetime of the hosting block device.  However, Dave points out that
>> these shutdown operations are needed in other scenarios.  Quoting Dave:
>>
>>     For example, if we detect a free space corruption during allocation,
>>     it is not safe to trust *any active mapping* because we can't trust
>>     that we having handed out the same block to multiple owners. Hence
>>     on such a filesystem shutdown, we have to prevent any new DAX
>>     mapping from occurring and invalidate all existing mappings as we
>>     cannot allow userspace to modify any data or metadata until we've
>>     resolved the corruption situation.
>>
>> The current block device shutdown sequence of del_gendisk +
>> blk_cleanup_queue is problematic.  We want to tell the fs after
>> blk_cleanup_queue that there is no possibility of recovery, but by that
>> time we have deleted partitions and lost the ability to find all the
>> super-blocks on a block device.
>>
>> Introduce del_gendisk_queue to trigger ->quiesce() and ->bdi_gone()
>
> I don't see anything that introduces a ->quiesce() method.
>

Right, we only have generic_quiesce_super() as no fs had a need to do
anything but the generic actions.

>> notifications to all the filesystems hosted on the disk.  Where
>> ->quiesce() are 'shutdown' operations while the bdev may still be alive,
>
> So you are saying "quiesce == invalidation", which is in conflict
> with what we typically think quiesce means. i.e.  "Quiesce" is what
> we do when doing orderly writeback of all outstanding dirty objects
> in a filesystem - what we do during freeze, remount-ro, and unmount.
> e.g. See the functions xfs_log_quiesce() and xfs_attr_quiesce()

Fair enough, I should have called this one invalidate_super.

>
>> and ->bdi_gone() is a set of actions to take after the backing device
>> is known to be permanently dead.
>
> In which case, bdi_gone == shutdown.
>

True, and following this logic I think the existing
generic_shutdown_super() should be renamed generic_kill_super() to
match the fs actions, but see below...

> Operation methods should be named after what they do, not what their
> calling context is. i.e. these are "invalidate" and "shutdown"
> superblock operations, not "quiesce" and "bdi_gone".

I was running out of colors to paint the bike shed given
generic_shutdown_super() was already in use.  Also, since
generic_shutdown_super() has had its current meaning since forever I
didn't think it was worth the risk to change the meaning of such a
long standing symbol.  Other ideas, generic_stop_super?

>> generic_quiesce_super and generic_bdi_gone, are the default operations
>> when a filesystem does not implement ->quiesce(), ->bdi_gone().  They
>> invalidate inodes and unmap DAX-inodes respectively.  For now only
>> ->bdi_gone() has an associated super operation as xfs will implement
>> ->bdi_gone() in a later patch.
>
> I don't quite understand what the point of factoring
> __invalidate_device() like this is - it's not used by anyone, so it
> seems completely unnecessary to me.
>

You're right, without a need for an fs to intercept the 'invalidation'
event there's no need to do this refactor.

> And really, that points out that there are multiple changes in this
> patch set that should be done separately.  The rework of
> del_gendisk() into del_gendisk_start/del_gendisk_end should be the
> first patch.  The del_gendisk/blk_cleanup_queue to
> blk_cleanup_queue() combining should be the second part, as it
> builds on the factoring of del_gendisk(). Then, if we really need to
> keep it, the factoring to introduce generic_quiesce_super. And
> finally, the patch that allows the shutdown callout can be
> introduced.

Ok, will split.

>
> [snip]
>>
>> +static void generic_bdi_gone(struct super_block *sb)
>> +{
>> +     struct inode *inode, *_inode = NULL;
>> +
>> +     spin_lock(&sb->s_inode_list_lock);
>> +     list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
>> +             spin_lock(&inode->i_lock);
>> +             if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
>> +                             || !IS_DAX(inode)) {
>> +                     spin_unlock(&inode->i_lock);
>> +                     continue;
>> +             }
>> +             __iget(inode);
>> +             spin_unlock(&inode->i_lock);
>> +             spin_unlock(&sb->s_inode_list_lock);
>> +
>> +             unmap_mapping_range(inode->i_mapping, 0, 0, 1);
>> +             iput(_inode);
>> +             _inode = inode;
>> +             cond_resched();
>> +
>> +             spin_lock(&sb->s_inode_list_lock);
>> +     }
>> +     spin_unlock(&sb->s_inode_list_lock);
>> +     iput(_inode);
>> +}
>
> This belongs in fs/inode.c, right next to invalidate_inodes(), and
> with a name that describes what it does, not the context in which
> it is called. e.g. unmap_dax_inodes().
>

Sounds good.

>> +void shutdown_partition(struct gendisk *disk, int partno)
>> +{
>> +     struct block_device *bdev = bdget_disk(disk, partno);
>> +     struct super_block *sb = get_super(bdev);
>> +
>> +     if (!bdev)
>> +             return;
>
> Null pointer deref. Certainly a landmine waiting for someone to
> tread on.
>

Nope, get_super() checks for a NULL argument.

>> +     if (!sb) {
>> +             bdput(bdev);
>> +             return;
>> +     }
>> +
>> +     if (sb->s_op->bdi_gone)
>> +             sb->s_op->bdi_gone(sb);
>> +     else
>> +             generic_bdi_gone(sb);
>
>         if (sb->s_op->shutdown)
>                 sb->s_op->shutdown(sb);
>         else
>                 unmap_dax_inodes(sb);
>
> name things correctly, and the code documents itself.

How about 'stop' or 'halt' instead of 'shutdown' to preserve the
historical meaning of generic_shutdown_super?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [resend PATCH 3/3] writeback: fix false positive WARN in __mark_inode_dirty
  2016-01-05  4:23     ` Dave Chinner
@ 2016-01-05 19:59       ` Dan Williams
  -1 siblings, 0 replies; 34+ messages in thread
From: Dan Williams @ 2016-01-05 19:59 UTC (permalink / raw)
  To: Dave Chinner
  Cc: XFS Developers, linux-block, linux-nvdimm, Jens Axboe,
	linux-fsdevel, Jan Kara, Tejun Heo

On Mon, Jan 4, 2016 at 8:23 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Jan 04, 2016 at 10:20:16AM -0800, Dan Williams wrote:
>> This warning was added as a debugging aid way back in commit
>> 500b067c5e6c "writeback: check for registered bdi in flusher add and
>> inode dirty" when we were switching over to per-bdi writeback.
>>
>> Once the block device has been torn down it's no longer useful to
>> complain about unregistered bdi's.  Clear the writeback capability under
>> the the wb->list_lock(), so that __mark_inode_dirty has no opportunity
>> to race bdi_unregister() to this WARN() condition.
>>
>> Alternatively we could just delete the warning...
>
> The warning is correct - the filesytem is trying to mark an inode
> dirty on a device that can't do writeback anymore. Seems to me like
> it is functioning as it should.
>
>> Found this while testing block device remove from underneath an active
>> fs triggering traces like:
>>
>>  WARNING: CPU: 6 PID: 2129 at fs/fs-writeback.c:2065 __mark_inode_dirty+0x261/0x350()
>>  bdi-block not registered
>>  [..]
>>  Call Trace:
>>   [<ffffffff81459f02>] dump_stack+0x44/0x62
>>   [<ffffffff810a1f32>] warn_slowpath_common+0x82/0xc0
>>   [<ffffffff810a1fcc>] warn_slowpath_fmt+0x5c/0x80
>>   [<ffffffff812830b1>] __mark_inode_dirty+0x261/0x350
>>   [<ffffffff8126d019>] generic_update_time+0x79/0xd0
>>   [<ffffffff8126d19d>] file_update_time+0xbd/0x110
>>   [<ffffffff812e4ab8>] ext4_dax_fault+0x68/0x110
>>   [<ffffffff811f7f3e>] __do_fault+0x4e/0xf0
>
> This seems like the problem to me - you haven't implemented a
> shutdown hook for ext4, and so it continues to allow page faults to
> make progress after the device has been removed. The DAX fault
> should have been failed before the filesystem gets to the point of
> marking the inode dirty....

xfs doesn't check that the fs is still alive before calling
file_update_time().  Also, I don't think we need/want to sprinkle "is
fs alive?" checks to address this when the block device shutdown path
can simply turn off writeback.

>
>> +     /* tell __mark_inode_dirty that writeback is no longer possible */
>> +     spin_lock(&wb->list_lock);
>> +     wb->bdi->capabilities |= BDI_CAP_NO_WRITEBACK;
>> +     spin_unlock(&wb->list_lock);
>> +
>>       spin_unlock_bh(&wb->work_lock);
>
> Is that lock ordering safe? i.e. it's inside a section using bh-safe
> locking, which tends to imply that it can run from interrupt
> contexts. Can we get something like

This would be a problem if wb_shutdown() was called from softirq
context, but it is always called from process context.  So,
->list_lock doesn't currently need to be upgraded to spin_lock_bh and
lockdep will trigger if this situation changes.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [resend PATCH 3/3] writeback: fix false positive WARN in __mark_inode_dirty
@ 2016-01-05 19:59       ` Dan Williams
  0 siblings, 0 replies; 34+ messages in thread
From: Dan Williams @ 2016-01-05 19:59 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jens Axboe, linux-nvdimm, XFS Developers, linux-block, Tejun Heo,
	Jan Kara, linux-fsdevel

On Mon, Jan 4, 2016 at 8:23 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Mon, Jan 04, 2016 at 10:20:16AM -0800, Dan Williams wrote:
>> This warning was added as a debugging aid way back in commit
>> 500b067c5e6c "writeback: check for registered bdi in flusher add and
>> inode dirty" when we were switching over to per-bdi writeback.
>>
>> Once the block device has been torn down it's no longer useful to
>> complain about unregistered bdi's.  Clear the writeback capability under
>> the the wb->list_lock(), so that __mark_inode_dirty has no opportunity
>> to race bdi_unregister() to this WARN() condition.
>>
>> Alternatively we could just delete the warning...
>
> The warning is correct - the filesytem is trying to mark an inode
> dirty on a device that can't do writeback anymore. Seems to me like
> it is functioning as it should.
>
>> Found this while testing block device remove from underneath an active
>> fs triggering traces like:
>>
>>  WARNING: CPU: 6 PID: 2129 at fs/fs-writeback.c:2065 __mark_inode_dirty+0x261/0x350()
>>  bdi-block not registered
>>  [..]
>>  Call Trace:
>>   [<ffffffff81459f02>] dump_stack+0x44/0x62
>>   [<ffffffff810a1f32>] warn_slowpath_common+0x82/0xc0
>>   [<ffffffff810a1fcc>] warn_slowpath_fmt+0x5c/0x80
>>   [<ffffffff812830b1>] __mark_inode_dirty+0x261/0x350
>>   [<ffffffff8126d019>] generic_update_time+0x79/0xd0
>>   [<ffffffff8126d19d>] file_update_time+0xbd/0x110
>>   [<ffffffff812e4ab8>] ext4_dax_fault+0x68/0x110
>>   [<ffffffff811f7f3e>] __do_fault+0x4e/0xf0
>
> This seems like the problem to me - you haven't implemented a
> shutdown hook for ext4, and so it continues to allow page faults to
> make progress after the device has been removed. The DAX fault
> should have been failed before the filesystem gets to the point of
> marking the inode dirty....

xfs doesn't check that the fs is still alive before calling
file_update_time().  Also, I don't think we need/want to sprinkle "is
fs alive?" checks to address this when the block device shutdown path
can simply turn off writeback.

>
>> +     /* tell __mark_inode_dirty that writeback is no longer possible */
>> +     spin_lock(&wb->list_lock);
>> +     wb->bdi->capabilities |= BDI_CAP_NO_WRITEBACK;
>> +     spin_unlock(&wb->list_lock);
>> +
>>       spin_unlock_bh(&wb->work_lock);
>
> Is that lock ordering safe? i.e. it's inside a section using bh-safe
> locking, which tends to imply that it can run from interrupt
> contexts. Can we get something like

This would be a problem if wb_shutdown() was called from softirq
context, but it is always called from process context.  So,
->list_lock doesn't currently need to be upgraded to spin_lock_bh and
lockdep will trigger if this situation changes.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [resend PATCH 3/3] writeback: fix false positive WARN in __mark_inode_dirty
  2016-01-05 19:59       ` Dan Williams
@ 2016-01-05 21:10         ` Dave Chinner
  -1 siblings, 0 replies; 34+ messages in thread
From: Dave Chinner @ 2016-01-05 21:10 UTC (permalink / raw)
  To: Dan Williams
  Cc: XFS Developers, linux-block, linux-nvdimm, Jens Axboe,
	linux-fsdevel, Jan Kara, Tejun Heo

On Tue, Jan 05, 2016 at 11:59:41AM -0800, Dan Williams wrote:
> On Mon, Jan 4, 2016 at 8:23 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Mon, Jan 04, 2016 at 10:20:16AM -0800, Dan Williams wrote:
> >> This warning was added as a debugging aid way back in commit
> >> 500b067c5e6c "writeback: check for registered bdi in flusher add and
> >> inode dirty" when we were switching over to per-bdi writeback.
> >>
> >> Once the block device has been torn down it's no longer useful to
> >> complain about unregistered bdi's.  Clear the writeback capability under
> >> the the wb->list_lock(), so that __mark_inode_dirty has no opportunity
> >> to race bdi_unregister() to this WARN() condition.
> >>
> >> Alternatively we could just delete the warning...
> >
> > The warning is correct - the filesytem is trying to mark an inode
> > dirty on a device that can't do writeback anymore. Seems to me like
> > it is functioning as it should.
> >
> >> Found this while testing block device remove from underneath an active
> >> fs triggering traces like:
> >>
> >>  WARNING: CPU: 6 PID: 2129 at fs/fs-writeback.c:2065 __mark_inode_dirty+0x261/0x350()
> >>  bdi-block not registered
> >>  [..]
> >>  Call Trace:
> >>   [<ffffffff81459f02>] dump_stack+0x44/0x62
> >>   [<ffffffff810a1f32>] warn_slowpath_common+0x82/0xc0
> >>   [<ffffffff810a1fcc>] warn_slowpath_fmt+0x5c/0x80
> >>   [<ffffffff812830b1>] __mark_inode_dirty+0x261/0x350
> >>   [<ffffffff8126d019>] generic_update_time+0x79/0xd0
> >>   [<ffffffff8126d19d>] file_update_time+0xbd/0x110
> >>   [<ffffffff812e4ab8>] ext4_dax_fault+0x68/0x110
> >>   [<ffffffff811f7f3e>] __do_fault+0x4e/0xf0
> >
> > This seems like the problem to me - you haven't implemented a
> > shutdown hook for ext4, and so it continues to allow page faults to
> > make progress after the device has been removed. The DAX fault
> > should have been failed before the filesystem gets to the point of
> > marking the inode dirty....
> 
> xfs doesn't check that the fs is still alive before calling
> file_update_time().  Also, I don't think we need/want to sprinkle "is
> fs alive?" checks to address this when the block device shutdown path
> can simply turn off writeback.

That's because the timestamp update is aborted in XFS during
transaction commit because xfs_trans_commit has a "is the filesystem
shutdown" check to prevent situations like this occurring. i.e. XFS
has shutdown checks sprinkled all through it in strategic places to
ensure that shutdown does what it is supposed to and does not
trigger warnings in generic/VFS code.

If you've removed a device, those inodes can *never* be written
back, and hence marking new inodes dirty for writeback after a
shutdown is completely wrong. e.g. if ext4 has had some other fatal
corruption error, it will continue to allow timestamp updates and
inode writeback through this mechanism. That's a bug in the
filesystem error handling, not a bug in the writeback code.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [resend PATCH 3/3] writeback: fix false positive WARN in __mark_inode_dirty
@ 2016-01-05 21:10         ` Dave Chinner
  0 siblings, 0 replies; 34+ messages in thread
From: Dave Chinner @ 2016-01-05 21:10 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jens Axboe, linux-nvdimm, XFS Developers, linux-block, Tejun Heo,
	Jan Kara, linux-fsdevel

On Tue, Jan 05, 2016 at 11:59:41AM -0800, Dan Williams wrote:
> On Mon, Jan 4, 2016 at 8:23 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Mon, Jan 04, 2016 at 10:20:16AM -0800, Dan Williams wrote:
> >> This warning was added as a debugging aid way back in commit
> >> 500b067c5e6c "writeback: check for registered bdi in flusher add and
> >> inode dirty" when we were switching over to per-bdi writeback.
> >>
> >> Once the block device has been torn down it's no longer useful to
> >> complain about unregistered bdi's.  Clear the writeback capability under
> >> the the wb->list_lock(), so that __mark_inode_dirty has no opportunity
> >> to race bdi_unregister() to this WARN() condition.
> >>
> >> Alternatively we could just delete the warning...
> >
> > The warning is correct - the filesytem is trying to mark an inode
> > dirty on a device that can't do writeback anymore. Seems to me like
> > it is functioning as it should.
> >
> >> Found this while testing block device remove from underneath an active
> >> fs triggering traces like:
> >>
> >>  WARNING: CPU: 6 PID: 2129 at fs/fs-writeback.c:2065 __mark_inode_dirty+0x261/0x350()
> >>  bdi-block not registered
> >>  [..]
> >>  Call Trace:
> >>   [<ffffffff81459f02>] dump_stack+0x44/0x62
> >>   [<ffffffff810a1f32>] warn_slowpath_common+0x82/0xc0
> >>   [<ffffffff810a1fcc>] warn_slowpath_fmt+0x5c/0x80
> >>   [<ffffffff812830b1>] __mark_inode_dirty+0x261/0x350
> >>   [<ffffffff8126d019>] generic_update_time+0x79/0xd0
> >>   [<ffffffff8126d19d>] file_update_time+0xbd/0x110
> >>   [<ffffffff812e4ab8>] ext4_dax_fault+0x68/0x110
> >>   [<ffffffff811f7f3e>] __do_fault+0x4e/0xf0
> >
> > This seems like the problem to me - you haven't implemented a
> > shutdown hook for ext4, and so it continues to allow page faults to
> > make progress after the device has been removed. The DAX fault
> > should have been failed before the filesystem gets to the point of
> > marking the inode dirty....
> 
> xfs doesn't check that the fs is still alive before calling
> file_update_time().  Also, I don't think we need/want to sprinkle "is
> fs alive?" checks to address this when the block device shutdown path
> can simply turn off writeback.

That's because the timestamp update is aborted in XFS during
transaction commit because xfs_trans_commit has a "is the filesystem
shutdown" check to prevent situations like this occurring. i.e. XFS
has shutdown checks sprinkled all through it in strategic places to
ensure that shutdown does what it is supposed to and does not
trigger warnings in generic/VFS code.

If you've removed a device, those inodes can *never* be written
back, and hence marking new inodes dirty for writeback after a
shutdown is completely wrong. e.g. if ext4 has had some other fatal
corruption error, it will continue to allow timestamp updates and
inode writeback through this mechanism. That's a bug in the
filesystem error handling, not a bug in the writeback code.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [resend PATCH 3/3] writeback: fix false positive WARN in __mark_inode_dirty
  2016-01-05 21:10         ` Dave Chinner
@ 2016-01-05 21:29           ` Dan Williams
  -1 siblings, 0 replies; 34+ messages in thread
From: Dan Williams @ 2016-01-05 21:29 UTC (permalink / raw)
  To: Dave Chinner
  Cc: XFS Developers, linux-block, linux-nvdimm, Jens Axboe,
	linux-fsdevel, Jan Kara, Tejun Heo

On Tue, Jan 5, 2016 at 1:10 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Tue, Jan 05, 2016 at 11:59:41AM -0800, Dan Williams wrote:
>> On Mon, Jan 4, 2016 at 8:23 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Mon, Jan 04, 2016 at 10:20:16AM -0800, Dan Williams wrote:
>> >> This warning was added as a debugging aid way back in commit
>> >> 500b067c5e6c "writeback: check for registered bdi in flusher add and
>> >> inode dirty" when we were switching over to per-bdi writeback.
>> >>
>> >> Once the block device has been torn down it's no longer useful to
>> >> complain about unregistered bdi's.  Clear the writeback capability under
>> >> the the wb->list_lock(), so that __mark_inode_dirty has no opportunity
>> >> to race bdi_unregister() to this WARN() condition.
>> >>
>> >> Alternatively we could just delete the warning...
>> >
>> > The warning is correct - the filesytem is trying to mark an inode
>> > dirty on a device that can't do writeback anymore. Seems to me like
>> > it is functioning as it should.
>> >
>> >> Found this while testing block device remove from underneath an active
>> >> fs triggering traces like:
>> >>
>> >>  WARNING: CPU: 6 PID: 2129 at fs/fs-writeback.c:2065 __mark_inode_dirty+0x261/0x350()
>> >>  bdi-block not registered
>> >>  [..]
>> >>  Call Trace:
>> >>   [<ffffffff81459f02>] dump_stack+0x44/0x62
>> >>   [<ffffffff810a1f32>] warn_slowpath_common+0x82/0xc0
>> >>   [<ffffffff810a1fcc>] warn_slowpath_fmt+0x5c/0x80
>> >>   [<ffffffff812830b1>] __mark_inode_dirty+0x261/0x350
>> >>   [<ffffffff8126d019>] generic_update_time+0x79/0xd0
>> >>   [<ffffffff8126d19d>] file_update_time+0xbd/0x110
>> >>   [<ffffffff812e4ab8>] ext4_dax_fault+0x68/0x110
>> >>   [<ffffffff811f7f3e>] __do_fault+0x4e/0xf0
>> >
>> > This seems like the problem to me - you haven't implemented a
>> > shutdown hook for ext4, and so it continues to allow page faults to
>> > make progress after the device has been removed. The DAX fault
>> > should have been failed before the filesystem gets to the point of
>> > marking the inode dirty....
>>
>> xfs doesn't check that the fs is still alive before calling
>> file_update_time().  Also, I don't think we need/want to sprinkle "is
>> fs alive?" checks to address this when the block device shutdown path
>> can simply turn off writeback.
>
> That's because the timestamp update is aborted in XFS during
> transaction commit because xfs_trans_commit has a "is the filesystem
> shutdown" check to prevent situations like this occurring. i.e. XFS
> has shutdown checks sprinkled all through it in strategic places to
> ensure that shutdown does what it is supposed to and does not
> trigger warnings in generic/VFS code.
>
> If you've removed a device, those inodes can *never* be written
> back, and hence marking new inodes dirty for writeback after a
> shutdown is completely wrong. e.g. if ext4 has had some other fatal
> corruption error, it will continue to allow timestamp updates and
> inode writeback through this mechanism. That's a bug in the
> filesystem error handling, not a bug in the writeback code.

True, in that sense the warning is serving a valid purpose to say
"FIXME, implement shutdown checks".  I'll drop this patch.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [resend PATCH 3/3] writeback: fix false positive WARN in __mark_inode_dirty
@ 2016-01-05 21:29           ` Dan Williams
  0 siblings, 0 replies; 34+ messages in thread
From: Dan Williams @ 2016-01-05 21:29 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jens Axboe, linux-nvdimm, XFS Developers, linux-block, Tejun Heo,
	Jan Kara, linux-fsdevel

On Tue, Jan 5, 2016 at 1:10 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Tue, Jan 05, 2016 at 11:59:41AM -0800, Dan Williams wrote:
>> On Mon, Jan 4, 2016 at 8:23 PM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Mon, Jan 04, 2016 at 10:20:16AM -0800, Dan Williams wrote:
>> >> This warning was added as a debugging aid way back in commit
>> >> 500b067c5e6c "writeback: check for registered bdi in flusher add and
>> >> inode dirty" when we were switching over to per-bdi writeback.
>> >>
>> >> Once the block device has been torn down it's no longer useful to
>> >> complain about unregistered bdi's.  Clear the writeback capability under
>> >> the the wb->list_lock(), so that __mark_inode_dirty has no opportunity
>> >> to race bdi_unregister() to this WARN() condition.
>> >>
>> >> Alternatively we could just delete the warning...
>> >
>> > The warning is correct - the filesytem is trying to mark an inode
>> > dirty on a device that can't do writeback anymore. Seems to me like
>> > it is functioning as it should.
>> >
>> >> Found this while testing block device remove from underneath an active
>> >> fs triggering traces like:
>> >>
>> >>  WARNING: CPU: 6 PID: 2129 at fs/fs-writeback.c:2065 __mark_inode_dirty+0x261/0x350()
>> >>  bdi-block not registered
>> >>  [..]
>> >>  Call Trace:
>> >>   [<ffffffff81459f02>] dump_stack+0x44/0x62
>> >>   [<ffffffff810a1f32>] warn_slowpath_common+0x82/0xc0
>> >>   [<ffffffff810a1fcc>] warn_slowpath_fmt+0x5c/0x80
>> >>   [<ffffffff812830b1>] __mark_inode_dirty+0x261/0x350
>> >>   [<ffffffff8126d019>] generic_update_time+0x79/0xd0
>> >>   [<ffffffff8126d19d>] file_update_time+0xbd/0x110
>> >>   [<ffffffff812e4ab8>] ext4_dax_fault+0x68/0x110
>> >>   [<ffffffff811f7f3e>] __do_fault+0x4e/0xf0
>> >
>> > This seems like the problem to me - you haven't implemented a
>> > shutdown hook for ext4, and so it continues to allow page faults to
>> > make progress after the device has been removed. The DAX fault
>> > should have been failed before the filesystem gets to the point of
>> > marking the inode dirty....
>>
>> xfs doesn't check that the fs is still alive before calling
>> file_update_time().  Also, I don't think we need/want to sprinkle "is
>> fs alive?" checks to address this when the block device shutdown path
>> can simply turn off writeback.
>
> That's because the timestamp update is aborted in XFS during
> transaction commit because xfs_trans_commit has a "is the filesystem
> shutdown" check to prevent situations like this occurring. i.e. XFS
> has shutdown checks sprinkled all through it in strategic places to
> ensure that shutdown does what it is supposed to and does not
> trigger warnings in generic/VFS code.
>
> If you've removed a device, those inodes can *never* be written
> back, and hence marking new inodes dirty for writeback after a
> shutdown is completely wrong. e.g. if ext4 has had some other fatal
> corruption error, it will continue to allow timestamp updates and
> inode writeback through this mechanism. That's a bug in the
> filesystem error handling, not a bug in the writeback code.

True, in that sense the warning is serving a valid purpose to say
"FIXME, implement shutdown checks".  I'll drop this patch.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [resend PATCH 1/3] block, fs: reliably communicate bdev end-of-life
  2016-01-05  4:25       ` Dan Williams
@ 2016-01-05 22:32         ` Dave Chinner
  -1 siblings, 0 replies; 34+ messages in thread
From: Dave Chinner @ 2016-01-05 22:32 UTC (permalink / raw)
  To: Dan Williams
  Cc: XFS Developers, linux-block, linux-nvdimm, Jens Axboe, Jan Kara,
	linux-fsdevel, Matthew Wilcox, Ross Zwisler

On Mon, Jan 04, 2016 at 08:25:16PM -0800, Dan Williams wrote:
> On Mon, Jan 4, 2016 at 7:51 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Mon, Jan 04, 2016 at 10:20:05AM -0800, Dan Williams wrote:
> >> Historically we have waited for filesystem specific heuristics to
> >> attempt to guess when a block device is gone.  Sometimes this works, but
> >> in other cases the system can hang waiting for the fs to trigger its
> >> shutdown protocol.
....
> True, and following this logic I think the existing
> generic_shutdown_super() should be renamed generic_kill_super() to
> match the fs actions, but see below...
> 
> > Operation methods should be named after what they do, not what their
> > calling context is. i.e. these are "invalidate" and "shutdown"
> > superblock operations, not "quiesce" and "bdi_gone".
> 
> I was running out of colors to paint the bike shed given
> generic_shutdown_super() was already in use.  Also, since

Ah, yeah, i forgot about that function. To many different shades of
purple are already in use. :/

> >> +void shutdown_partition(struct gendisk *disk, int partno)
> >> +{
> >> +     struct block_device *bdev = bdget_disk(disk, partno);
> >> +     struct super_block *sb = get_super(bdev);
> >> +
> >> +     if (!bdev)
> >> +             return;
> >
> > Null pointer deref. Certainly a landmine waiting for someone to
> > tread on.
> >
> 
> Nope, get_super() checks for a NULL argument.

Hence my comment about it being a landmine. If get_super() is ever
changed, this code will explode on us when we least expect it. If I
have to go read other code in a completely different file just to
determine the code is actually safe, then I consider that bad code.
Code that is slightly more verbose but is obviously correct and
balanced is much, much easier to understand and maintain....

> >> +     if (!sb) {
> >> +             bdput(bdev);
> >> +             return;
> >> +     }
> >> +
> >> +     if (sb->s_op->bdi_gone)
> >> +             sb->s_op->bdi_gone(sb);
> >> +     else
> >> +             generic_bdi_gone(sb);
> >
> >         if (sb->s_op->shutdown)
> >                 sb->s_op->shutdown(sb);
> >         else
> >                 unmap_dax_inodes(sb);
> >
> > name things correctly, and the code documents itself.
> 
> How about 'stop' or 'halt' instead of 'shutdown' to preserve the
> historical meaning of generic_shutdown_super?

It's hard chosing a different color that is appropriate. :/

Because it's a brute-force behavioural override, I think that needs
to be obvious from the op name. So something like force_stop or
force_failure seems best to me.

In fact, now that I've thought/repaeated/written force_failure a
couple of times, it seems quite appropriate here, as what we want to
be able to do is force the filesystem into a (permanent) failure
state.....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [resend PATCH 1/3] block, fs: reliably communicate bdev end-of-life
@ 2016-01-05 22:32         ` Dave Chinner
  0 siblings, 0 replies; 34+ messages in thread
From: Dave Chinner @ 2016-01-05 22:32 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jens Axboe, linux-nvdimm, XFS Developers, linux-block, Jan Kara,
	linux-fsdevel, Matthew Wilcox, Ross Zwisler

On Mon, Jan 04, 2016 at 08:25:16PM -0800, Dan Williams wrote:
> On Mon, Jan 4, 2016 at 7:51 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Mon, Jan 04, 2016 at 10:20:05AM -0800, Dan Williams wrote:
> >> Historically we have waited for filesystem specific heuristics to
> >> attempt to guess when a block device is gone.  Sometimes this works, but
> >> in other cases the system can hang waiting for the fs to trigger its
> >> shutdown protocol.
....
> True, and following this logic I think the existing
> generic_shutdown_super() should be renamed generic_kill_super() to
> match the fs actions, but see below...
> 
> > Operation methods should be named after what they do, not what their
> > calling context is. i.e. these are "invalidate" and "shutdown"
> > superblock operations, not "quiesce" and "bdi_gone".
> 
> I was running out of colors to paint the bike shed given
> generic_shutdown_super() was already in use.  Also, since

Ah, yeah, i forgot about that function. To many different shades of
purple are already in use. :/

> >> +void shutdown_partition(struct gendisk *disk, int partno)
> >> +{
> >> +     struct block_device *bdev = bdget_disk(disk, partno);
> >> +     struct super_block *sb = get_super(bdev);
> >> +
> >> +     if (!bdev)
> >> +             return;
> >
> > Null pointer deref. Certainly a landmine waiting for someone to
> > tread on.
> >
> 
> Nope, get_super() checks for a NULL argument.

Hence my comment about it being a landmine. If get_super() is ever
changed, this code will explode on us when we least expect it. If I
have to go read other code in a completely different file just to
determine the code is actually safe, then I consider that bad code.
Code that is slightly more verbose but is obviously correct and
balanced is much, much easier to understand and maintain....

> >> +     if (!sb) {
> >> +             bdput(bdev);
> >> +             return;
> >> +     }
> >> +
> >> +     if (sb->s_op->bdi_gone)
> >> +             sb->s_op->bdi_gone(sb);
> >> +     else
> >> +             generic_bdi_gone(sb);
> >
> >         if (sb->s_op->shutdown)
> >                 sb->s_op->shutdown(sb);
> >         else
> >                 unmap_dax_inodes(sb);
> >
> > name things correctly, and the code documents itself.
> 
> How about 'stop' or 'halt' instead of 'shutdown' to preserve the
> historical meaning of generic_shutdown_super?

It's hard chosing a different color that is appropriate. :/

Because it's a brute-force behavioural override, I think that needs
to be obvious from the op name. So something like force_stop or
force_failure seems best to me.

In fact, now that I've thought/repaeated/written force_failure a
couple of times, it seems quite appropriate here, as what we want to
be able to do is force the filesystem into a (permanent) failure
state.....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [resend PATCH 1/3] block, fs: reliably communicate bdev end-of-life
  2016-01-04 18:20   ` Dan Williams
@ 2016-01-09  7:54     ` Al Viro
  -1 siblings, 0 replies; 34+ messages in thread
From: Al Viro @ 2016-01-09  7:54 UTC (permalink / raw)
  To: Dan Williams
  Cc: xfs, linux-block, linux-nvdimm, Dave Chinner, Jens Axboe,
	Jan Kara, linux-fsdevel, Matthew Wilcox, Ross Zwisler

On Mon, Jan 04, 2016 at 10:20:05AM -0800, Dan Williams wrote:
> Historically we have waited for filesystem specific heuristics to
> attempt to guess when a block device is gone.  Sometimes this works, but
> in other cases the system can hang waiting for the fs to trigger its
> shutdown protocol.
> 
> The initial motivation for this investigation was to prevent DAX
> mappings (direct mmap access to persistent memory) from leaking past the
> lifetime of the hosting block device.  However, Dave points out that
> these shutdown operations are needed in other scenarios.  Quoting Dave:
> 
>     For example, if we detect a free space corruption during allocation,
>     it is not safe to trust *any active mapping* because we can't trust
>     that we having handed out the same block to multiple owners. Hence
>     on such a filesystem shutdown, we have to prevent any new DAX
>     mapping from occurring and invalidate all existing mappings as we
>     cannot allow userspace to modify any data or metadata until we've
>     resolved the corruption situation.
> 
> The current block device shutdown sequence of del_gendisk +
> blk_cleanup_queue is problematic.  We want to tell the fs after
> blk_cleanup_queue that there is no possibility of recovery, but by that
> time we have deleted partitions and lost the ability to find all the
> super-blocks on a block device.
> 
> Introduce del_gendisk_queue to trigger ->quiesce() and ->bdi_gone()
> notifications to all the filesystems hosted on the disk.  Where
> ->quiesce() are 'shutdown' operations while the bdev may still be alive,
> and ->bdi_gone() is a set of actions to take after the backing device
> is known to be permanently dead.

	Would you mind explaining what the hell is _the_ backing device
of a filesystem?  What does that translate into in case of e.g. btrfs
spanning several disks?  Or ext4 with journal on a different device, for
that matter?

	If anything, I would argue that filesystem is out of place here -
general situation is "IO on X may require IO on device Y and X needs to do
something when Y goes away".  Consider e.g. /dev/loop backed by a device
that went away.  Or by a file on fs that has run down the curtain and joined
the bleedin choir invisible.  With another fs partially hosted by that
loopback device.  Or by RAID0 containing said device.

	You are given Y and attempt to locate the affected X.  _Then_
you assume that X is a filesystem and has "something to be done" independent
from the role Y played for it, so you can pick that action from superblock
method.

	IMO you are placing the burden in the wrong place.  _Recepient_
knows what it depends upon and what should be done for each source of
trouble.  So make it recepient's responsibility to request notifications.
At which point the superblock method goes away, along with the requirement
to handle all sources of trouble the same way, etc.

	What's more, things like RAID5 (also interested in knowing when
a component has been ripped out) might or might not decide to propagate
the event further - after all, that's exactly the point of redundancy.

	I'd look into something along the lines of notifier chain per
gendisk, with potential victims registering a callback when they decide
that from now on such and such device might screw them over...

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [resend PATCH 1/3] block, fs: reliably communicate bdev end-of-life
@ 2016-01-09  7:54     ` Al Viro
  0 siblings, 0 replies; 34+ messages in thread
From: Al Viro @ 2016-01-09  7:54 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jens Axboe, linux-nvdimm, xfs, linux-block, Jan Kara,
	linux-fsdevel, Matthew Wilcox, Ross Zwisler

On Mon, Jan 04, 2016 at 10:20:05AM -0800, Dan Williams wrote:
> Historically we have waited for filesystem specific heuristics to
> attempt to guess when a block device is gone.  Sometimes this works, but
> in other cases the system can hang waiting for the fs to trigger its
> shutdown protocol.
> 
> The initial motivation for this investigation was to prevent DAX
> mappings (direct mmap access to persistent memory) from leaking past the
> lifetime of the hosting block device.  However, Dave points out that
> these shutdown operations are needed in other scenarios.  Quoting Dave:
> 
>     For example, if we detect a free space corruption during allocation,
>     it is not safe to trust *any active mapping* because we can't trust
>     that we having handed out the same block to multiple owners. Hence
>     on such a filesystem shutdown, we have to prevent any new DAX
>     mapping from occurring and invalidate all existing mappings as we
>     cannot allow userspace to modify any data or metadata until we've
>     resolved the corruption situation.
> 
> The current block device shutdown sequence of del_gendisk +
> blk_cleanup_queue is problematic.  We want to tell the fs after
> blk_cleanup_queue that there is no possibility of recovery, but by that
> time we have deleted partitions and lost the ability to find all the
> super-blocks on a block device.
> 
> Introduce del_gendisk_queue to trigger ->quiesce() and ->bdi_gone()
> notifications to all the filesystems hosted on the disk.  Where
> ->quiesce() are 'shutdown' operations while the bdev may still be alive,
> and ->bdi_gone() is a set of actions to take after the backing device
> is known to be permanently dead.

	Would you mind explaining what the hell is _the_ backing device
of a filesystem?  What does that translate into in case of e.g. btrfs
spanning several disks?  Or ext4 with journal on a different device, for
that matter?

	If anything, I would argue that filesystem is out of place here -
general situation is "IO on X may require IO on device Y and X needs to do
something when Y goes away".  Consider e.g. /dev/loop backed by a device
that went away.  Or by a file on fs that has run down the curtain and joined
the bleedin choir invisible.  With another fs partially hosted by that
loopback device.  Or by RAID0 containing said device.

	You are given Y and attempt to locate the affected X.  _Then_
you assume that X is a filesystem and has "something to be done" independent
from the role Y played for it, so you can pick that action from superblock
method.

	IMO you are placing the burden in the wrong place.  _Recepient_
knows what it depends upon and what should be done for each source of
trouble.  So make it recepient's responsibility to request notifications.
At which point the superblock method goes away, along with the requirement
to handle all sources of trouble the same way, etc.

	What's more, things like RAID5 (also interested in knowing when
a component has been ripped out) might or might not decide to propagate
the event further - after all, that's exactly the point of redundancy.

	I'd look into something along the lines of notifier chain per
gendisk, with potential victims registering a callback when they decide
that from now on such and such device might screw them over...

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [resend PATCH 1/3] block, fs: reliably communicate bdev end-of-life
  2016-01-09  7:54     ` Al Viro
@ 2016-01-09 14:17       ` Dan Williams
  -1 siblings, 0 replies; 34+ messages in thread
From: Dan Williams @ 2016-01-09 14:17 UTC (permalink / raw)
  To: Al Viro
  Cc: XFS Developers, linux-block, linux-nvdimm, Dave Chinner,
	Jens Axboe, Jan Kara, linux-fsdevel, Matthew Wilcox,
	Ross Zwisler

On Fri, Jan 8, 2016 at 11:54 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Mon, Jan 04, 2016 at 10:20:05AM -0800, Dan Williams wrote:
[..]
>         Would you mind explaining what the hell is _the_ backing device
> of a filesystem?  What does that translate into in case of e.g. btrfs
> spanning several disks?  Or ext4 with journal on a different device, for
> that matter?
>
>         If anything, I would argue that filesystem is out of place here -
> general situation is "IO on X may require IO on device Y and X needs to do
> something when Y goes away".  Consider e.g. /dev/loop backed by a device
> that went away.  Or by a file on fs that has run down the curtain and joined
> the bleedin choir invisible.  With another fs partially hosted by that
> loopback device.  Or by RAID0 containing said device.
>
>         You are given Y and attempt to locate the affected X.  _Then_
> you assume that X is a filesystem and has "something to be done" independent
> from the role Y played for it, so you can pick that action from superblock
> method.
>
>         IMO you are placing the burden in the wrong place.  _Recepient_
> knows what it depends upon and what should be done for each source of
> trouble.  So make it recepient's responsibility to request notifications.
> At which point the superblock method goes away, along with the requirement
> to handle all sources of trouble the same way, etc.
>
>         What's more, things like RAID5 (also interested in knowing when
> a component has been ripped out) might or might not decide to propagate
> the event further - after all, that's exactly the point of redundancy.
>
>         I'd look into something along the lines of notifier chain per
> gendisk, with potential victims registering a callback when they decide
> that from now on such and such device might screw them over...

Makes sense.  I'll drop this series for now and come back after
re-working it use notifiers.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [resend PATCH 1/3] block, fs: reliably communicate bdev end-of-life
@ 2016-01-09 14:17       ` Dan Williams
  0 siblings, 0 replies; 34+ messages in thread
From: Dan Williams @ 2016-01-09 14:17 UTC (permalink / raw)
  To: Al Viro
  Cc: Jens Axboe, linux-nvdimm, XFS Developers, linux-block, Jan Kara,
	linux-fsdevel, Matthew Wilcox, Ross Zwisler

On Fri, Jan 8, 2016 at 11:54 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Mon, Jan 04, 2016 at 10:20:05AM -0800, Dan Williams wrote:
[..]
>         Would you mind explaining what the hell is _the_ backing device
> of a filesystem?  What does that translate into in case of e.g. btrfs
> spanning several disks?  Or ext4 with journal on a different device, for
> that matter?
>
>         If anything, I would argue that filesystem is out of place here -
> general situation is "IO on X may require IO on device Y and X needs to do
> something when Y goes away".  Consider e.g. /dev/loop backed by a device
> that went away.  Or by a file on fs that has run down the curtain and joined
> the bleedin choir invisible.  With another fs partially hosted by that
> loopback device.  Or by RAID0 containing said device.
>
>         You are given Y and attempt to locate the affected X.  _Then_
> you assume that X is a filesystem and has "something to be done" independent
> from the role Y played for it, so you can pick that action from superblock
> method.
>
>         IMO you are placing the burden in the wrong place.  _Recepient_
> knows what it depends upon and what should be done for each source of
> trouble.  So make it recepient's responsibility to request notifications.
> At which point the superblock method goes away, along with the requirement
> to handle all sources of trouble the same way, etc.
>
>         What's more, things like RAID5 (also interested in knowing when
> a component has been ripped out) might or might not decide to propagate
> the event further - after all, that's exactly the point of redundancy.
>
>         I'd look into something along the lines of notifier chain per
> gendisk, with potential victims registering a callback when they decide
> that from now on such and such device might screw them over...

Makes sense.  I'll drop this series for now and come back after
re-working it use notifiers.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [resend PATCH 1/3] block, fs: reliably communicate bdev end-of-life
  2016-01-09 14:17       ` Dan Williams
@ 2016-01-11  7:15         ` Hannes Reinecke
  -1 siblings, 0 replies; 34+ messages in thread
From: Hannes Reinecke @ 2016-01-11  7:15 UTC (permalink / raw)
  To: Dan Williams, Al Viro
  Cc: XFS Developers, linux-block, linux-nvdimm, Dave Chinner,
	Jens Axboe, Jan Kara, linux-fsdevel, Matthew Wilcox,
	Ross Zwisler

On 01/09/2016 03:17 PM, Dan Williams wrote:
> On Fri, Jan 8, 2016 at 11:54 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>> On Mon, Jan 04, 2016 at 10:20:05AM -0800, Dan Williams wrote:
> [..]
>>          Would you mind explaining what the hell is _the_ backing device
>> of a filesystem?  What does that translate into in case of e.g. btrfs
>> spanning several disks?  Or ext4 with journal on a different device, for
>> that matter?
>>
>>          If anything, I would argue that filesystem is out of place here -
>> general situation is "IO on X may require IO on device Y and X needs to do
>> something when Y goes away".  Consider e.g. /dev/loop backed by a device
>> that went away.  Or by a file on fs that has run down the curtain and joined
>> the bleedin choir invisible.  With another fs partially hosted by that
>> loopback device.  Or by RAID0 containing said device.
>>
>>          You are given Y and attempt to locate the affected X.  _Then_
>> you assume that X is a filesystem and has "something to be done" independent
>> from the role Y played for it, so you can pick that action from superblock
>> method.
>>
>>          IMO you are placing the burden in the wrong place.  _Recepient_
>> knows what it depends upon and what should be done for each source of
>> trouble.  So make it recepient's responsibility to request notifications.
>> At which point the superblock method goes away, along with the requirement
>> to handle all sources of trouble the same way, etc.
>>
>>          What's more, things like RAID5 (also interested in knowing when
>> a component has been ripped out) might or might not decide to propagate
>> the event further - after all, that's exactly the point of redundancy.
>>
>>          I'd look into something along the lines of notifier chain per
>> gendisk, with potential victims registering a callback when they decide
>> that from now on such and such device might screw them over...
>
> Makes sense.  I'll drop this series for now and come back after
> re-working it use notifiers.

Yes please. I need a similar thing for communicating device changes 
(resizing, topology changes), so I'd be very much interested in them.

And while you're at it, maybe we can fold the block device event 
handling into that, too.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		               zSeries & Storage
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [resend PATCH 1/3] block, fs: reliably communicate bdev end-of-life
@ 2016-01-11  7:15         ` Hannes Reinecke
  0 siblings, 0 replies; 34+ messages in thread
From: Hannes Reinecke @ 2016-01-11  7:15 UTC (permalink / raw)
  To: Dan Williams, Al Viro
  Cc: Jens Axboe, linux-nvdimm, XFS Developers, linux-block, Jan Kara,
	linux-fsdevel, Matthew Wilcox, Ross Zwisler

On 01/09/2016 03:17 PM, Dan Williams wrote:
> On Fri, Jan 8, 2016 at 11:54 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>> On Mon, Jan 04, 2016 at 10:20:05AM -0800, Dan Williams wrote:
> [..]
>>          Would you mind explaining what the hell is _the_ backing device
>> of a filesystem?  What does that translate into in case of e.g. btrfs
>> spanning several disks?  Or ext4 with journal on a different device, for
>> that matter?
>>
>>          If anything, I would argue that filesystem is out of place here -
>> general situation is "IO on X may require IO on device Y and X needs to do
>> something when Y goes away".  Consider e.g. /dev/loop backed by a device
>> that went away.  Or by a file on fs that has run down the curtain and joined
>> the bleedin choir invisible.  With another fs partially hosted by that
>> loopback device.  Or by RAID0 containing said device.
>>
>>          You are given Y and attempt to locate the affected X.  _Then_
>> you assume that X is a filesystem and has "something to be done" independent
>> from the role Y played for it, so you can pick that action from superblock
>> method.
>>
>>          IMO you are placing the burden in the wrong place.  _Recepient_
>> knows what it depends upon and what should be done for each source of
>> trouble.  So make it recepient's responsibility to request notifications.
>> At which point the superblock method goes away, along with the requirement
>> to handle all sources of trouble the same way, etc.
>>
>>          What's more, things like RAID5 (also interested in knowing when
>> a component has been ripped out) might or might not decide to propagate
>> the event further - after all, that's exactly the point of redundancy.
>>
>>          I'd look into something along the lines of notifier chain per
>> gendisk, with potential victims registering a callback when they decide
>> that from now on such and such device might screw them over...
>
> Makes sense.  I'll drop this series for now and come back after
> re-working it use notifiers.

Yes please. I need a similar thing for communicating device changes 
(resizing, topology changes), so I'd be very much interested in them.

And while you're at it, maybe we can fold the block device event 
handling into that, too.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		               zSeries & Storage
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [resend PATCH 1/3] block, fs: reliably communicate bdev end-of-life
  2016-01-09  7:54     ` Al Viro
@ 2016-01-11 15:24       ` Hannes Reinecke
  -1 siblings, 0 replies; 34+ messages in thread
From: Hannes Reinecke @ 2016-01-11 15:24 UTC (permalink / raw)
  To: Al Viro, Dan Williams
  Cc: xfs, linux-block, linux-nvdimm, Dave Chinner, Jens Axboe,
	Jan Kara, linux-fsdevel, Matthew Wilcox, Ross Zwisler

On 01/09/2016 08:54 AM, Al Viro wrote:
> On Mon, Jan 04, 2016 at 10:20:05AM -0800, Dan Williams wrote:
>> Historically we have waited for filesystem specific heuristics to
>> attempt to guess when a block device is gone.  Sometimes this works, but
>> in other cases the system can hang waiting for the fs to trigger its
>> shutdown protocol.
>>
>> The initial motivation for this investigation was to prevent DAX
>> mappings (direct mmap access to persistent memory) from leaking past the
>> lifetime of the hosting block device.  However, Dave points out that
>> these shutdown operations are needed in other scenarios.  Quoting Dave:
>>
>>      For example, if we detect a free space corruption during allocation,
>>      it is not safe to trust *any active mapping* because we can't trust
>>      that we having handed out the same block to multiple owners. Hence
>>      on such a filesystem shutdown, we have to prevent any new DAX
>>      mapping from occurring and invalidate all existing mappings as we
>>      cannot allow userspace to modify any data or metadata until we've
>>      resolved the corruption situation.
>>
>> The current block device shutdown sequence of del_gendisk +
>> blk_cleanup_queue is problematic.  We want to tell the fs after
>> blk_cleanup_queue that there is no possibility of recovery, but by that
>> time we have deleted partitions and lost the ability to find all the
>> super-blocks on a block device.
>>
>> Introduce del_gendisk_queue to trigger ->quiesce() and ->bdi_gone()
>> notifications to all the filesystems hosted on the disk.  Where
>> ->quiesce() are 'shutdown' operations while the bdev may still be alive,
>> and ->bdi_gone() is a set of actions to take after the backing device
>> is known to be permanently dead.
>
> 	Would you mind explaining what the hell is _the_ backing device
> of a filesystem?  What does that translate into in case of e.g. btrfs
> spanning several disks?  Or ext4 with journal on a different device, for
> that matter?
>
> 	If anything, I would argue that filesystem is out of place here -
> general situation is "IO on X may require IO on device Y and X needs to do
> something when Y goes away".  Consider e.g. /dev/loop backed by a device
> that went away.  Or by a file on fs that has run down the curtain and joined
> the bleedin choir invisible.  With another fs partially hosted by that
> loopback device.  Or by RAID0 containing said device.
>
> 	You are given Y and attempt to locate the affected X.  _Then_
> you assume that X is a filesystem and has "something to be done" independent
> from the role Y played for it, so you can pick that action from superblock
> method.
>
> 	IMO you are placing the burden in the wrong place.  _Recepient_
> knows what it depends upon and what should be done for each source of
> trouble.  So make it recepient's responsibility to request notifications.
> At which point the superblock method goes away, along with the requirement
> to handle all sources of trouble the same way, etc.
>
> 	What's more, things like RAID5 (also interested in knowing when
> a component has been ripped out) might or might not decide to propagate
> the event further - after all, that's exactly the point of redundancy.
>
> 	I'd look into something along the lines of notifier chain per
> gendisk, with potential victims registering a callback when they decide
> that from now on such and such device might screw them over...

Fully support this. I was planning on something similar to transport 
device changes (resizing, topology change etc).

And it might even be an idea to convert the block device events to a 
notifier chain, too.

Dan, can you keep me in the loop here?
Thanks.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		               zSeries & Storage
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N�rnberg
GF: F. Imend�rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N�rnberg)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [resend PATCH 1/3] block, fs: reliably communicate bdev end-of-life
@ 2016-01-11 15:24       ` Hannes Reinecke
  0 siblings, 0 replies; 34+ messages in thread
From: Hannes Reinecke @ 2016-01-11 15:24 UTC (permalink / raw)
  To: Al Viro, Dan Williams
  Cc: Jens Axboe, linux-nvdimm, xfs, linux-block, Jan Kara,
	linux-fsdevel, Matthew Wilcox, Ross Zwisler

On 01/09/2016 08:54 AM, Al Viro wrote:
> On Mon, Jan 04, 2016 at 10:20:05AM -0800, Dan Williams wrote:
>> Historically we have waited for filesystem specific heuristics to
>> attempt to guess when a block device is gone.  Sometimes this works, but
>> in other cases the system can hang waiting for the fs to trigger its
>> shutdown protocol.
>>
>> The initial motivation for this investigation was to prevent DAX
>> mappings (direct mmap access to persistent memory) from leaking past the
>> lifetime of the hosting block device.  However, Dave points out that
>> these shutdown operations are needed in other scenarios.  Quoting Dave:
>>
>>      For example, if we detect a free space corruption during allocation,
>>      it is not safe to trust *any active mapping* because we can't trust
>>      that we having handed out the same block to multiple owners. Hence
>>      on such a filesystem shutdown, we have to prevent any new DAX
>>      mapping from occurring and invalidate all existing mappings as we
>>      cannot allow userspace to modify any data or metadata until we've
>>      resolved the corruption situation.
>>
>> The current block device shutdown sequence of del_gendisk +
>> blk_cleanup_queue is problematic.  We want to tell the fs after
>> blk_cleanup_queue that there is no possibility of recovery, but by that
>> time we have deleted partitions and lost the ability to find all the
>> super-blocks on a block device.
>>
>> Introduce del_gendisk_queue to trigger ->quiesce() and ->bdi_gone()
>> notifications to all the filesystems hosted on the disk.  Where
>> ->quiesce() are 'shutdown' operations while the bdev may still be alive,
>> and ->bdi_gone() is a set of actions to take after the backing device
>> is known to be permanently dead.
>
> 	Would you mind explaining what the hell is _the_ backing device
> of a filesystem?  What does that translate into in case of e.g. btrfs
> spanning several disks?  Or ext4 with journal on a different device, for
> that matter?
>
> 	If anything, I would argue that filesystem is out of place here -
> general situation is "IO on X may require IO on device Y and X needs to do
> something when Y goes away".  Consider e.g. /dev/loop backed by a device
> that went away.  Or by a file on fs that has run down the curtain and joined
> the bleedin choir invisible.  With another fs partially hosted by that
> loopback device.  Or by RAID0 containing said device.
>
> 	You are given Y and attempt to locate the affected X.  _Then_
> you assume that X is a filesystem and has "something to be done" independent
> from the role Y played for it, so you can pick that action from superblock
> method.
>
> 	IMO you are placing the burden in the wrong place.  _Recepient_
> knows what it depends upon and what should be done for each source of
> trouble.  So make it recepient's responsibility to request notifications.
> At which point the superblock method goes away, along with the requirement
> to handle all sources of trouble the same way, etc.
>
> 	What's more, things like RAID5 (also interested in knowing when
> a component has been ripped out) might or might not decide to propagate
> the event further - after all, that's exactly the point of redundancy.
>
> 	I'd look into something along the lines of notifier chain per
> gendisk, with potential victims registering a callback when they decide
> that from now on such and such device might screw them over...

Fully support this. I was planning on something similar to transport 
device changes (resizing, topology change etc).

And it might even be an idea to convert the block device events to a 
notifier chain, too.

Dan, can you keep me in the loop here?
Thanks.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		               zSeries & Storage
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [resend PATCH 1/3] block, fs: reliably communicate bdev end-of-life
  2016-01-11 15:24       ` Hannes Reinecke
@ 2016-01-11 15:55         ` Dan Williams
  -1 siblings, 0 replies; 34+ messages in thread
From: Dan Williams @ 2016-01-11 15:55 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Al Viro, XFS Developers, linux-block, linux-nvdimm, Dave Chinner,
	Jens Axboe, Jan Kara, linux-fsdevel, Matthew Wilcox,
	Ross Zwisler

On Mon, Jan 11, 2016 at 7:24 AM, Hannes Reinecke <hare@suse.de> wrote:
> On 01/09/2016 08:54 AM, Al Viro wrote:
>>
>> On Mon, Jan 04, 2016 at 10:20:05AM -0800, Dan Williams wrote:
>>>
>>> Historically we have waited for filesystem specific heuristics to
>>> attempt to guess when a block device is gone.  Sometimes this works, but
>>> in other cases the system can hang waiting for the fs to trigger its
>>> shutdown protocol.
>>>
>>> The initial motivation for this investigation was to prevent DAX
>>> mappings (direct mmap access to persistent memory) from leaking past the
>>> lifetime of the hosting block device.  However, Dave points out that
>>> these shutdown operations are needed in other scenarios.  Quoting Dave:
>>>
>>>      For example, if we detect a free space corruption during allocation,
>>>      it is not safe to trust *any active mapping* because we can't trust
>>>      that we having handed out the same block to multiple owners. Hence
>>>      on such a filesystem shutdown, we have to prevent any new DAX
>>>      mapping from occurring and invalidate all existing mappings as we
>>>      cannot allow userspace to modify any data or metadata until we've
>>>      resolved the corruption situation.
>>>
>>> The current block device shutdown sequence of del_gendisk +
>>> blk_cleanup_queue is problematic.  We want to tell the fs after
>>> blk_cleanup_queue that there is no possibility of recovery, but by that
>>> time we have deleted partitions and lost the ability to find all the
>>> super-blocks on a block device.
>>>
>>> Introduce del_gendisk_queue to trigger ->quiesce() and ->bdi_gone()
>>> notifications to all the filesystems hosted on the disk.  Where
>>> ->quiesce() are 'shutdown' operations while the bdev may still be alive,
>>> and ->bdi_gone() is a set of actions to take after the backing device
>>> is known to be permanently dead.
>>
>>
>>         Would you mind explaining what the hell is _the_ backing device
>> of a filesystem?  What does that translate into in case of e.g. btrfs
>> spanning several disks?  Or ext4 with journal on a different device, for
>> that matter?
>>
>>         If anything, I would argue that filesystem is out of place here -
>> general situation is "IO on X may require IO on device Y and X needs to do
>> something when Y goes away".  Consider e.g. /dev/loop backed by a device
>> that went away.  Or by a file on fs that has run down the curtain and
>> joined
>> the bleedin choir invisible.  With another fs partially hosted by that
>> loopback device.  Or by RAID0 containing said device.
>>
>>         You are given Y and attempt to locate the affected X.  _Then_
>> you assume that X is a filesystem and has "something to be done"
>> independent
>> from the role Y played for it, so you can pick that action from superblock
>> method.
>>
>>         IMO you are placing the burden in the wrong place.  _Recepient_
>> knows what it depends upon and what should be done for each source of
>> trouble.  So make it recepient's responsibility to request notifications.
>> At which point the superblock method goes away, along with the requirement
>> to handle all sources of trouble the same way, etc.
>>
>>         What's more, things like RAID5 (also interested in knowing when
>> a component has been ripped out) might or might not decide to propagate
>> the event further - after all, that's exactly the point of redundancy.
>>
>>         I'd look into something along the lines of notifier chain per
>> gendisk, with potential victims registering a callback when they decide
>> that from now on such and such device might screw them over...
>
>
> Fully support this. I was planning on something similar to transport device
> changes (resizing, topology change etc).
>
> And it might even be an idea to convert the block device events to a
> notifier chain, too.
>
> Dan, can you keep me in the loop here?

Yes, will do.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [resend PATCH 1/3] block, fs: reliably communicate bdev end-of-life
@ 2016-01-11 15:55         ` Dan Williams
  0 siblings, 0 replies; 34+ messages in thread
From: Dan Williams @ 2016-01-11 15:55 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: Jens Axboe, linux-nvdimm, XFS Developers, linux-block, Al Viro,
	Jan Kara, linux-fsdevel, Matthew Wilcox, Ross Zwisler

On Mon, Jan 11, 2016 at 7:24 AM, Hannes Reinecke <hare@suse.de> wrote:
> On 01/09/2016 08:54 AM, Al Viro wrote:
>>
>> On Mon, Jan 04, 2016 at 10:20:05AM -0800, Dan Williams wrote:
>>>
>>> Historically we have waited for filesystem specific heuristics to
>>> attempt to guess when a block device is gone.  Sometimes this works, but
>>> in other cases the system can hang waiting for the fs to trigger its
>>> shutdown protocol.
>>>
>>> The initial motivation for this investigation was to prevent DAX
>>> mappings (direct mmap access to persistent memory) from leaking past the
>>> lifetime of the hosting block device.  However, Dave points out that
>>> these shutdown operations are needed in other scenarios.  Quoting Dave:
>>>
>>>      For example, if we detect a free space corruption during allocation,
>>>      it is not safe to trust *any active mapping* because we can't trust
>>>      that we having handed out the same block to multiple owners. Hence
>>>      on such a filesystem shutdown, we have to prevent any new DAX
>>>      mapping from occurring and invalidate all existing mappings as we
>>>      cannot allow userspace to modify any data or metadata until we've
>>>      resolved the corruption situation.
>>>
>>> The current block device shutdown sequence of del_gendisk +
>>> blk_cleanup_queue is problematic.  We want to tell the fs after
>>> blk_cleanup_queue that there is no possibility of recovery, but by that
>>> time we have deleted partitions and lost the ability to find all the
>>> super-blocks on a block device.
>>>
>>> Introduce del_gendisk_queue to trigger ->quiesce() and ->bdi_gone()
>>> notifications to all the filesystems hosted on the disk.  Where
>>> ->quiesce() are 'shutdown' operations while the bdev may still be alive,
>>> and ->bdi_gone() is a set of actions to take after the backing device
>>> is known to be permanently dead.
>>
>>
>>         Would you mind explaining what the hell is _the_ backing device
>> of a filesystem?  What does that translate into in case of e.g. btrfs
>> spanning several disks?  Or ext4 with journal on a different device, for
>> that matter?
>>
>>         If anything, I would argue that filesystem is out of place here -
>> general situation is "IO on X may require IO on device Y and X needs to do
>> something when Y goes away".  Consider e.g. /dev/loop backed by a device
>> that went away.  Or by a file on fs that has run down the curtain and
>> joined
>> the bleedin choir invisible.  With another fs partially hosted by that
>> loopback device.  Or by RAID0 containing said device.
>>
>>         You are given Y and attempt to locate the affected X.  _Then_
>> you assume that X is a filesystem and has "something to be done"
>> independent
>> from the role Y played for it, so you can pick that action from superblock
>> method.
>>
>>         IMO you are placing the burden in the wrong place.  _Recepient_
>> knows what it depends upon and what should be done for each source of
>> trouble.  So make it recepient's responsibility to request notifications.
>> At which point the superblock method goes away, along with the requirement
>> to handle all sources of trouble the same way, etc.
>>
>>         What's more, things like RAID5 (also interested in knowing when
>> a component has been ripped out) might or might not decide to propagate
>> the event further - after all, that's exactly the point of redundancy.
>>
>>         I'd look into something along the lines of notifier chain per
>> gendisk, with potential victims registering a callback when they decide
>> that from now on such and such device might screw them over...
>
>
> Fully support this. I was planning on something similar to transport device
> changes (resizing, topology change etc).
>
> And it might even be an idea to convert the block device events to a
> notifier chain, too.
>
> Dan, can you keep me in the loop here?

Yes, will do.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2016-01-11 15:55 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-04 18:20 [resend PATCH 0/3] fs, bdev: handle end of life Dan Williams
2016-01-04 18:20 ` Dan Williams
2016-01-04 18:20 ` [resend PATCH 1/3] block, fs: reliably communicate bdev end-of-life Dan Williams
2016-01-04 18:20   ` Dan Williams
2016-01-05  3:51   ` Dave Chinner
2016-01-05  3:51     ` Dave Chinner
2016-01-05  4:25     ` Dan Williams
2016-01-05  4:25       ` Dan Williams
2016-01-05 22:32       ` Dave Chinner
2016-01-05 22:32         ` Dave Chinner
2016-01-09  7:54   ` Al Viro
2016-01-09  7:54     ` Al Viro
2016-01-09 14:17     ` Dan Williams
2016-01-09 14:17       ` Dan Williams
2016-01-11  7:15       ` Hannes Reinecke
2016-01-11  7:15         ` Hannes Reinecke
2016-01-11 15:24     ` Hannes Reinecke
2016-01-11 15:24       ` Hannes Reinecke
2016-01-11 15:55       ` Dan Williams
2016-01-11 15:55         ` Dan Williams
2016-01-04 18:20 ` [resend PATCH 2/3] xfs: handle shutdown notifications Dan Williams
2016-01-04 18:20   ` Dan Williams
2016-01-05  4:03   ` Dave Chinner
2016-01-05  4:03     ` Dave Chinner
2016-01-04 18:20 ` [resend PATCH 3/3] writeback: fix false positive WARN in __mark_inode_dirty Dan Williams
2016-01-04 18:20   ` Dan Williams
2016-01-05  4:23   ` Dave Chinner
2016-01-05  4:23     ` Dave Chinner
2016-01-05 19:59     ` Dan Williams
2016-01-05 19:59       ` Dan Williams
2016-01-05 21:10       ` Dave Chinner
2016-01-05 21:10         ` Dave Chinner
2016-01-05 21:29         ` Dan Williams
2016-01-05 21:29           ` Dan Williams

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.