linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] block drivers + dax vs driver unbind
@ 2015-09-30  0:41 Dan Williams
  2015-09-30  0:41 ` [PATCH 1/2] block: generic request_queue reference counting Dan Williams
  2015-09-30  0:41 ` [PATCH 2/2] block, dax: fix lifetime of in-kernel dax mappings Dan Williams
  0 siblings, 2 replies; 13+ messages in thread
From: Dan Williams @ 2015-09-30  0:41 UTC (permalink / raw)
  To: axboe
  Cc: Boaz Harrosh, linux-nvdimm, Dave Chinner, linux-kernel,
	Keith Busch, ross.zwisler, Christoph Hellwig

Auditing pmem driver teardown operations, while developing
get_user_pages() support for dax [1], revealed that we can trivially
crash the kernel by triggering new i/o requests after unbinding the pmem
driver.  In fact, any bio-based driver is susceptible to this crash
because the queue draining done at shutdown uses in flight 'struct
request' objects to pin the queue active.

Solve the problem generically for all drivers and export the new
blk_queue_enter() and blk_queue_exit() helpers for dax to indicate when
the "request queue" is busy (i.e. we are actively using an address
returned by ->direct_access()).

[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html

---

Dan Williams (2):
      block: generic request_queue reference counting
      block, dax: fix lifetime of in-kernel dax mappings


 block/blk-core.c       |   71 +++++++++++++++++++++++---
 block/blk-mq-sysfs.c   |    6 --
 block/blk-mq.c         |   80 +++++++++---------------------
 block/blk-sysfs.c      |    3 -
 block/blk.h            |   12 ++++
 fs/dax.c               |  130 +++++++++++++++++++++++++++++++-----------------
 include/linux/blk-mq.h |    1 
 include/linux/blkdev.h |    4 +
 8 files changed, 185 insertions(+), 122 deletions(-)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/2] block: generic request_queue reference counting
  2015-09-30  0:41 [PATCH 0/2] block drivers + dax vs driver unbind Dan Williams
@ 2015-09-30  0:41 ` Dan Williams
  2015-10-04  6:40   ` Christoph Hellwig
  2015-10-04  7:52   ` Ming Lei
  2015-09-30  0:41 ` [PATCH 2/2] block, dax: fix lifetime of in-kernel dax mappings Dan Williams
  1 sibling, 2 replies; 13+ messages in thread
From: Dan Williams @ 2015-09-30  0:41 UTC (permalink / raw)
  To: axboe
  Cc: Keith Busch, ross.zwisler, linux-nvdimm, Christoph Hellwig, linux-kernel

Allow pmem, and other synchronous/bio-based block drivers, to fallback
on a per-cpu reference count managed by the core for tracking queue
live/dead state.

The existing per-cpu reference count for the blk_mq case is promoted to
be used in all block i/o scenarios.  This involves initializing it by
default, waiting for it to drop to zero at exit, and holding a live
reference over the invocation of q->make_request_fn() in
generic_make_request().  The blk_mq code continues to take its own
reference per blk_mq request and retains the ability to freeze the
queue, but the check that the queue is frozen is moved to
generic_make_request().

This fixes crash signatures like the following:

 BUG: unable to handle kernel paging request at ffff880140000000
 [..]
 Call Trace:
  [<ffffffff8145e8bf>] ? copy_user_handle_tail+0x5f/0x70
  [<ffffffffa004e1e0>] pmem_do_bvec.isra.11+0x70/0xf0 [nd_pmem]
  [<ffffffffa004e331>] pmem_make_request+0xd1/0x200 [nd_pmem]
  [<ffffffff811c3162>] ? mempool_alloc+0x72/0x1a0
  [<ffffffff8141f8b6>] generic_make_request+0xd6/0x110
  [<ffffffff8141f966>] submit_bio+0x76/0x170
  [<ffffffff81286dff>] submit_bh_wbc+0x12f/0x160
  [<ffffffff81286e62>] submit_bh+0x12/0x20
  [<ffffffff813395bd>] jbd2_write_superblock+0x8d/0x170
  [<ffffffff8133974d>] jbd2_mark_journal_empty+0x5d/0x90
  [<ffffffff813399cb>] jbd2_journal_destroy+0x24b/0x270
  [<ffffffff810bc4ca>] ? put_pwq_unlocked+0x2a/0x30
  [<ffffffff810bc6f5>] ? destroy_workqueue+0x225/0x250
  [<ffffffff81303494>] ext4_put_super+0x64/0x360
  [<ffffffff8124ab1a>] generic_shutdown_super+0x6a/0xf0

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 block/blk-core.c       |   71 +++++++++++++++++++++++++++++++++++++------
 block/blk-mq-sysfs.c   |    6 ----
 block/blk-mq.c         |   80 ++++++++++++++----------------------------------
 block/blk-sysfs.c      |    3 +-
 block/blk.h            |   14 ++++++++
 include/linux/blk-mq.h |    1 -
 include/linux/blkdev.h |    2 +
 7 files changed, 102 insertions(+), 75 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 2eb722d48773..6062550baaef 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -554,19 +554,17 @@ void blk_cleanup_queue(struct request_queue *q)
 	 * Drain all requests queued before DYING marking. Set DEAD flag to
 	 * prevent that q->request_fn() gets invoked after draining finished.
 	 */
-	if (q->mq_ops) {
-		blk_mq_freeze_queue(q);
-		spin_lock_irq(lock);
-	} else {
-		spin_lock_irq(lock);
+	blk_freeze_queue(q);
+	spin_lock_irq(lock);
+	if (!q->mq_ops)
 		__blk_drain_queue(q, true);
-	}
 	queue_flag_set(QUEUE_FLAG_DEAD, q);
 	spin_unlock_irq(lock);
 
 	/* @q won't process any more request, flush async actions */
 	del_timer_sync(&q->backing_dev_info.laptop_mode_wb_timer);
 	blk_sync_queue(q);
+	percpu_ref_exit(&q->q_usage_counter);
 
 	if (q->mq_ops)
 		blk_mq_free_queue(q);
@@ -629,6 +627,40 @@ struct request_queue *blk_alloc_queue(gfp_t gfp_mask)
 }
 EXPORT_SYMBOL(blk_alloc_queue);
 
+int blk_queue_enter(struct request_queue *q, gfp_t gfp)
+{
+	while (true) {
+		int ret;
+
+		if (percpu_ref_tryget_live(&q->q_usage_counter))
+			return 0;
+
+		if (!(gfp & __GFP_WAIT))
+			return -EBUSY;
+
+		ret = wait_event_interruptible(q->mq_freeze_wq,
+				!atomic_read(&q->mq_freeze_depth) ||
+				blk_queue_dying(q));
+		if (blk_queue_dying(q))
+			return -ENODEV;
+		if (ret)
+			return ret;
+	}
+}
+
+void blk_queue_exit(struct request_queue *q)
+{
+	percpu_ref_put(&q->q_usage_counter);
+}
+
+static void blk_queue_usage_counter_release(struct percpu_ref *ref)
+{
+	struct request_queue *q =
+		container_of(ref, struct request_queue, q_usage_counter);
+
+	wake_up_all(&q->mq_freeze_wq);
+}
+
 struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 {
 	struct request_queue *q;
@@ -690,11 +722,22 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 
 	init_waitqueue_head(&q->mq_freeze_wq);
 
-	if (blkcg_init_queue(q))
+	/*
+	 * Init percpu_ref in atomic mode so that it's faster to shutdown.
+	 * See blk_register_queue() for details.
+	 */
+	if (percpu_ref_init(&q->q_usage_counter,
+				blk_queue_usage_counter_release,
+				PERCPU_REF_INIT_ATOMIC, GFP_KERNEL))
 		goto fail_bdi;
 
+	if (blkcg_init_queue(q))
+		goto fail_ref;
+
 	return q;
 
+fail_ref:
+	percpu_ref_exit(&q->q_usage_counter);
 fail_bdi:
 	bdi_destroy(&q->backing_dev_info);
 fail_split:
@@ -1966,9 +2009,19 @@ void generic_make_request(struct bio *bio)
 	do {
 		struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 
-		q->make_request_fn(q, bio);
+		if (likely(blk_queue_enter(q, __GFP_WAIT) == 0)) {
+
+			q->make_request_fn(q, bio);
+
+			blk_queue_exit(q);
 
-		bio = bio_list_pop(current->bio_list);
+			bio = bio_list_pop(current->bio_list);
+		} else {
+			struct bio *bio_next = bio_list_pop(current->bio_list);
+
+			bio_io_error(bio);
+			bio = bio_next;
+		}
 	} while (bio);
 	current->bio_list = NULL; /* deactivate */
 }
diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
index 279c5d674edf..731b6eecce82 100644
--- a/block/blk-mq-sysfs.c
+++ b/block/blk-mq-sysfs.c
@@ -412,12 +412,6 @@ static void blk_mq_sysfs_init(struct request_queue *q)
 		kobject_init(&ctx->kobj, &blk_mq_ctx_ktype);
 }
 
-/* see blk_register_queue() */
-void blk_mq_finish_init(struct request_queue *q)
-{
-	percpu_ref_switch_to_percpu(&q->mq_usage_counter);
-}
-
 int blk_mq_register_disk(struct gendisk *disk)
 {
 	struct device *dev = disk_to_dev(disk);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index f2d67b4047a0..6d91894cf85e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -77,47 +77,13 @@ static void blk_mq_hctx_clear_pending(struct blk_mq_hw_ctx *hctx,
 	clear_bit(CTX_TO_BIT(hctx, ctx), &bm->word);
 }
 
-static int blk_mq_queue_enter(struct request_queue *q, gfp_t gfp)
-{
-	while (true) {
-		int ret;
-
-		if (percpu_ref_tryget_live(&q->mq_usage_counter))
-			return 0;
-
-		if (!(gfp & __GFP_WAIT))
-			return -EBUSY;
-
-		ret = wait_event_interruptible(q->mq_freeze_wq,
-				!atomic_read(&q->mq_freeze_depth) ||
-				blk_queue_dying(q));
-		if (blk_queue_dying(q))
-			return -ENODEV;
-		if (ret)
-			return ret;
-	}
-}
-
-static void blk_mq_queue_exit(struct request_queue *q)
-{
-	percpu_ref_put(&q->mq_usage_counter);
-}
-
-static void blk_mq_usage_counter_release(struct percpu_ref *ref)
-{
-	struct request_queue *q =
-		container_of(ref, struct request_queue, mq_usage_counter);
-
-	wake_up_all(&q->mq_freeze_wq);
-}
-
 void blk_mq_freeze_queue_start(struct request_queue *q)
 {
 	int freeze_depth;
 
 	freeze_depth = atomic_inc_return(&q->mq_freeze_depth);
 	if (freeze_depth == 1) {
-		percpu_ref_kill(&q->mq_usage_counter);
+		percpu_ref_kill(&q->q_usage_counter);
 		blk_mq_run_hw_queues(q, false);
 	}
 }
@@ -125,18 +91,34 @@ EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_start);
 
 static void blk_mq_freeze_queue_wait(struct request_queue *q)
 {
-	wait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->mq_usage_counter));
+	wait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->q_usage_counter));
 }
 
 /*
  * Guarantee no request is in use, so we can change any data structure of
  * the queue afterward.
  */
-void blk_mq_freeze_queue(struct request_queue *q)
+void blk_freeze_queue(struct request_queue *q)
 {
+	/*
+	 * In the !blk_mq case we are only calling this to kill the
+	 * q_usage_counter, otherwise this increases the freeze depth
+	 * and waits for it to return to zero.  For this reason there is
+	 * no blk_unfreeze_queue(), and blk_freeze_queue() is not
+	 * exported to drivers as the only user for unfreeze is blk_mq.
+	 */
 	blk_mq_freeze_queue_start(q);
 	blk_mq_freeze_queue_wait(q);
 }
+
+void blk_mq_freeze_queue(struct request_queue *q)
+{
+	/*
+	 * ...just an alias to keep freeze and unfreeze actions balanced
+	 * in the blk_mq_* namespace
+	 */
+	blk_freeze_queue(q);
+}
 EXPORT_SYMBOL_GPL(blk_mq_freeze_queue);
 
 void blk_mq_unfreeze_queue(struct request_queue *q)
@@ -146,7 +128,7 @@ void blk_mq_unfreeze_queue(struct request_queue *q)
 	freeze_depth = atomic_dec_return(&q->mq_freeze_depth);
 	WARN_ON_ONCE(freeze_depth < 0);
 	if (!freeze_depth) {
-		percpu_ref_reinit(&q->mq_usage_counter);
+		percpu_ref_reinit(&q->q_usage_counter);
 		wake_up_all(&q->mq_freeze_wq);
 	}
 }
@@ -255,7 +237,7 @@ struct request *blk_mq_alloc_request(struct request_queue *q, int rw, gfp_t gfp,
 	struct blk_mq_alloc_data alloc_data;
 	int ret;
 
-	ret = blk_mq_queue_enter(q, gfp);
+	ret = blk_queue_enter(q, gfp);
 	if (ret)
 		return ERR_PTR(ret);
 
@@ -278,7 +260,7 @@ struct request *blk_mq_alloc_request(struct request_queue *q, int rw, gfp_t gfp,
 	}
 	blk_mq_put_ctx(ctx);
 	if (!rq) {
-		blk_mq_queue_exit(q);
+		blk_queue_exit(q);
 		return ERR_PTR(-EWOULDBLOCK);
 	}
 	return rq;
@@ -297,7 +279,7 @@ static void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx,
 
 	clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
 	blk_mq_put_tag(hctx, tag, &ctx->last_tag);
-	blk_mq_queue_exit(q);
+	blk_queue_exit(q);
 }
 
 void blk_mq_free_hctx_request(struct blk_mq_hw_ctx *hctx, struct request *rq)
@@ -1184,11 +1166,7 @@ static struct request *blk_mq_map_request(struct request_queue *q,
 	int rw = bio_data_dir(bio);
 	struct blk_mq_alloc_data alloc_data;
 
-	if (unlikely(blk_mq_queue_enter(q, GFP_KERNEL))) {
-		bio_io_error(bio);
-		return NULL;
-	}
-
+	blk_queue_enter_live(q);
 	ctx = blk_mq_get_ctx(q);
 	hctx = q->mq_ops->map_queue(q, ctx->cpu);
 
@@ -1979,14 +1957,6 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 		hctxs[i]->queue_num = i;
 	}
 
-	/*
-	 * Init percpu_ref in atomic mode so that it's faster to shutdown.
-	 * See blk_register_queue() for details.
-	 */
-	if (percpu_ref_init(&q->mq_usage_counter, blk_mq_usage_counter_release,
-			    PERCPU_REF_INIT_ATOMIC, GFP_KERNEL))
-		goto err_hctxs;
-
 	setup_timer(&q->timeout, blk_mq_rq_timer, (unsigned long) q);
 	blk_queue_rq_timeout(q, set->timeout ? set->timeout : 30 * HZ);
 
@@ -2062,8 +2032,6 @@ void blk_mq_free_queue(struct request_queue *q)
 	blk_mq_exit_hw_queues(q, set, set->nr_hw_queues);
 	blk_mq_free_hw_queues(q, set);
 
-	percpu_ref_exit(&q->mq_usage_counter);
-
 	kfree(q->mq_map);
 
 	q->mq_map = NULL;
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 3e44a9da2a13..61fc2633bbea 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -599,9 +599,8 @@ int blk_register_queue(struct gendisk *disk)
 	 */
 	if (!blk_queue_init_done(q)) {
 		queue_flag_set_unlocked(QUEUE_FLAG_INIT_DONE, q);
+		percpu_ref_switch_to_percpu(&q->q_usage_counter);
 		blk_queue_bypass_end(q);
-		if (q->mq_ops)
-			blk_mq_finish_init(q);
 	}
 
 	ret = blk_trace_init_sysfs(dev);
diff --git a/block/blk.h b/block/blk.h
index 98614ad37c81..5b2cd393afbe 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -72,6 +72,20 @@ void blk_dequeue_request(struct request *rq);
 void __blk_queue_free_tags(struct request_queue *q);
 bool __blk_end_bidi_request(struct request *rq, int error,
 			    unsigned int nr_bytes, unsigned int bidi_bytes);
+int blk_queue_enter(struct request_queue *q, gfp_t gfp);
+void blk_queue_exit(struct request_queue *q);
+void blk_freeze_queue(struct request_queue *q);
+
+static inline void blk_queue_enter_live(struct request_queue *q)
+{
+	/*
+	 * Given that running in generic_make_request() context
+	 * guarantees that a live reference against q_usage_counter has
+	 * been established, further references under that same context
+	 * need not check that the queue has been frozen (marked dead).
+	 */
+	percpu_ref_get(&q->q_usage_counter);
+}
 
 void blk_rq_timed_out_timer(unsigned long data);
 unsigned long blk_rq_timeout(unsigned long timeout);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 37d1602c4f7a..25ce763fbb81 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -167,7 +167,6 @@ enum {
 struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *);
 struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
 						  struct request_queue *q);
-void blk_mq_finish_init(struct request_queue *q);
 int blk_mq_register_disk(struct gendisk *);
 void blk_mq_unregister_disk(struct gendisk *);
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 99da9ebc7377..0a0def66c61e 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -450,7 +450,7 @@ struct request_queue {
 #endif
 	struct rcu_head		rcu_head;
 	wait_queue_head_t	mq_freeze_wq;
-	struct percpu_ref	mq_usage_counter;
+	struct percpu_ref	q_usage_counter;
 	struct list_head	all_q_node;
 
 	struct blk_mq_tag_set	*tag_set;


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/2] block, dax: fix lifetime of in-kernel dax mappings
  2015-09-30  0:41 [PATCH 0/2] block drivers + dax vs driver unbind Dan Williams
  2015-09-30  0:41 ` [PATCH 1/2] block: generic request_queue reference counting Dan Williams
@ 2015-09-30  0:41 ` Dan Williams
  2015-09-30 23:35   ` Dave Chinner
  1 sibling, 1 reply; 13+ messages in thread
From: Dan Williams @ 2015-09-30  0:41 UTC (permalink / raw)
  To: axboe
  Cc: Boaz Harrosh, linux-nvdimm, Dave Chinner, linux-kernel,
	ross.zwisler, Christoph Hellwig

The DAX implementation needs to protect new calls to ->direct_access()
and usage of its return value against unbind of the underlying block
device.  Use blk_queue_enter()/blk_queue_exit() to either prevent
blk_cleanup_queue() from proceeding, or fail the dax_map_bh() if the
request_queue is being torn down.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 block/blk.h            |    2 -
 fs/dax.c               |  130 +++++++++++++++++++++++++++++++-----------------
 include/linux/blkdev.h |    2 +
 3 files changed, 85 insertions(+), 49 deletions(-)

diff --git a/block/blk.h b/block/blk.h
index 5b2cd393afbe..0f8de0dda768 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -72,8 +72,6 @@ void blk_dequeue_request(struct request *rq);
 void __blk_queue_free_tags(struct request_queue *q);
 bool __blk_end_bidi_request(struct request *rq, int error,
 			    unsigned int nr_bytes, unsigned int bidi_bytes);
-int blk_queue_enter(struct request_queue *q, gfp_t gfp);
-void blk_queue_exit(struct request_queue *q);
 void blk_freeze_queue(struct request_queue *q);
 
 static inline void blk_queue_enter_live(struct request_queue *q)
diff --git a/fs/dax.c b/fs/dax.c
index bcfb14bfc1e4..7ce002bb60d0 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -63,12 +63,42 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size)
 }
 EXPORT_SYMBOL_GPL(dax_clear_blocks);
 
-static long dax_get_addr(struct buffer_head *bh, void __pmem **addr,
-		unsigned blkbits)
+static void __pmem *__dax_map_bh(const struct buffer_head *bh, unsigned blkbits,
+		unsigned long *pfn, long *len)
 {
-	unsigned long pfn;
+	long rc;
+	void __pmem *addr;
+	struct block_device *bdev = bh->b_bdev;
+	struct request_queue *q = bdev->bd_queue;
 	sector_t sector = bh->b_blocknr << (blkbits - 9);
-	return bdev_direct_access(bh->b_bdev, sector, addr, &pfn, bh->b_size);
+
+	if (blk_queue_enter(q, GFP_NOWAIT) != 0)
+		return (void __pmem *) ERR_PTR(-EIO);
+	rc = bdev_direct_access(bdev, sector, &addr, pfn, bh->b_size);
+	if (len)
+		*len = rc;
+	if (rc < 0) {
+		blk_queue_exit(q);
+		return (void __pmem *) ERR_PTR(rc);
+	}
+	return addr;
+}
+
+static void __pmem *dax_map_bh(const struct buffer_head *bh, unsigned blkbits)
+{
+	unsigned long pfn;
+
+	return __dax_map_bh(bh, blkbits, &pfn, NULL);
+}
+
+static void dax_unmap_bh(const struct buffer_head *bh, void __pmem *addr)
+{
+	struct block_device *bdev = bh->b_bdev;
+	struct request_queue *q = bdev->bd_queue;
+
+	if (IS_ERR(addr))
+		return;
+	blk_queue_exit(q);
 }
 
 /* the clear_pmem() calls are ordered by a wmb_pmem() in the caller */
@@ -104,15 +134,16 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 		      loff_t start, loff_t end, get_block_t get_block,
 		      struct buffer_head *bh)
 {
-	ssize_t retval = 0;
-	loff_t pos = start;
-	loff_t max = start;
-	loff_t bh_max = start;
-	void __pmem *addr;
+	loff_t pos = start, max = start, bh_max = start;
+	int rw = iov_iter_rw(iter), rc;
+	long map_len = 0;
+	unsigned long pfn;
+	void __pmem *addr = NULL;
+	void __pmem *kmap = (void __pmem *) ERR_PTR(-EIO);
 	bool hole = false;
 	bool need_wmb = false;
 
-	if (iov_iter_rw(iter) != WRITE)
+	if (rw == READ)
 		end = min(end, i_size_read(inode));
 
 	while (pos < end) {
@@ -127,9 +158,8 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 			if (pos == bh_max) {
 				bh->b_size = PAGE_ALIGN(end - pos);
 				bh->b_state = 0;
-				retval = get_block(inode, block, bh,
-						   iov_iter_rw(iter) == WRITE);
-				if (retval)
+				rc = get_block(inode, block, bh, rw == WRITE);
+				if (rc)
 					break;
 				if (!buffer_size_valid(bh))
 					bh->b_size = 1 << blkbits;
@@ -141,21 +171,25 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 				bh->b_size -= done;
 			}
 
-			hole = iov_iter_rw(iter) != WRITE && !buffer_written(bh);
+			hole = rw == READ && !buffer_written(bh);
 			if (hole) {
 				addr = NULL;
 				size = bh->b_size - first;
 			} else {
-				retval = dax_get_addr(bh, &addr, blkbits);
-				if (retval < 0)
+				dax_unmap_bh(bh, kmap);
+				kmap = __dax_map_bh(bh, blkbits, &pfn, &map_len);
+				if (IS_ERR(kmap)) {
+					rc = PTR_ERR(kmap);
 					break;
+				}
+				addr = kmap;
 				if (buffer_unwritten(bh) || buffer_new(bh)) {
-					dax_new_buf(addr, retval, first, pos,
-									end);
+					dax_new_buf(addr, map_len, first, pos,
+							end);
 					need_wmb = true;
 				}
 				addr += first;
-				size = retval - first;
+				size = map_len - first;
 			}
 			max = min(pos + size, end);
 		}
@@ -178,8 +212,9 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
 
 	if (need_wmb)
 		wmb_pmem();
+	dax_unmap_bh(bh, kmap);
 
-	return (pos == start) ? retval : pos - start;
+	return (pos == start) ? rc : pos - start;
 }
 
 /**
@@ -274,21 +309,22 @@ static int copy_user_bh(struct page *to, struct buffer_head *bh,
 	void __pmem *vfrom;
 	void *vto;
 
-	if (dax_get_addr(bh, &vfrom, blkbits) < 0)
-		return -EIO;
+	vfrom = dax_map_bh(bh, blkbits);
+	if (IS_ERR(vfrom))
+		return PTR_ERR(vfrom);
 	vto = kmap_atomic(to);
 	copy_user_page(vto, (void __force *)vfrom, vaddr, to);
 	kunmap_atomic(vto);
+	dax_unmap_bh(bh, vfrom);
 	return 0;
 }
 
 static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 			struct vm_area_struct *vma, struct vm_fault *vmf)
 {
-	sector_t sector = bh->b_blocknr << (inode->i_blkbits - 9);
 	unsigned long vaddr = (unsigned long)vmf->virtual_address;
-	void __pmem *addr;
 	unsigned long pfn;
+	void __pmem *addr;
 	pgoff_t size;
 	int error;
 
@@ -305,11 +341,9 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 		goto out;
 	}
 
-	error = bdev_direct_access(bh->b_bdev, sector, &addr, &pfn, bh->b_size);
-	if (error < 0)
-		goto out;
-	if (error < PAGE_SIZE) {
-		error = -EIO;
+	addr = __dax_map_bh(bh, inode->i_blkbits, &pfn, NULL);
+	if (IS_ERR(addr)) {
+		error = PTR_ERR(addr);
 		goto out;
 	}
 
@@ -317,6 +351,7 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
 		clear_pmem(addr, PAGE_SIZE);
 		wmb_pmem();
 	}
+	dax_unmap_bh(bh, addr);
 
 	error = vm_insert_mixed(vma, vaddr, pfn);
 
@@ -528,11 +563,8 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	unsigned blkbits = inode->i_blkbits;
 	unsigned long pmd_addr = address & PMD_MASK;
 	bool write = flags & FAULT_FLAG_WRITE;
-	long length;
-	void __pmem *kaddr;
 	pgoff_t size, pgoff;
-	sector_t block, sector;
-	unsigned long pfn;
+	sector_t block;
 	int result = 0;
 
 	/* Fall back to PTEs if we're going to COW */
@@ -557,8 +589,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 
 	bh.b_size = PMD_SIZE;
 	i_mmap_lock_write(mapping);
-	length = get_block(inode, block, &bh, write);
-	if (length)
+	if (get_block(inode, block, &bh, write) != 0)
 		return VM_FAULT_SIGBUS;
 
 	/*
@@ -569,17 +600,17 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 	if (!buffer_size_valid(&bh) || bh.b_size < PMD_SIZE)
 		goto fallback;
 
-	sector = bh.b_blocknr << (blkbits - 9);
-
 	if (buffer_unwritten(&bh) || buffer_new(&bh)) {
 		int i;
+		long length;
+		unsigned long pfn;
+		void __pmem *kaddr = __dax_map_bh(&bh, blkbits, &pfn, &length);
 
-		length = bdev_direct_access(bh.b_bdev, sector, &kaddr, &pfn,
-						bh.b_size);
-		if (length < 0) {
+		if (IS_ERR(kaddr)) {
 			result = VM_FAULT_SIGBUS;
 			goto out;
 		}
+
 		if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR))
 			goto fallback;
 
@@ -589,6 +620,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		count_vm_event(PGMAJFAULT);
 		mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
 		result |= VM_FAULT_MAJOR;
+		dax_unmap_bh(&bh, kaddr);
 	}
 
 	/*
@@ -635,12 +667,15 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		result = VM_FAULT_NOPAGE;
 		spin_unlock(ptl);
 	} else {
-		length = bdev_direct_access(bh.b_bdev, sector, &kaddr, &pfn,
-						bh.b_size);
-		if (length < 0) {
+		long length;
+		unsigned long pfn;
+		void __pmem *kaddr = __dax_map_bh(&bh, blkbits, &pfn, &length);
+
+		if (IS_ERR(kaddr)) {
 			result = VM_FAULT_SIGBUS;
 			goto out;
 		}
+		dax_unmap_bh(&bh, kaddr);
 		if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR))
 			goto fallback;
 
@@ -746,12 +781,13 @@ int dax_zero_page_range(struct inode *inode, loff_t from, unsigned length,
 	if (err < 0)
 		return err;
 	if (buffer_written(&bh)) {
-		void __pmem *addr;
-		err = dax_get_addr(&bh, &addr, inode->i_blkbits);
-		if (err < 0)
-			return err;
+		void __pmem *addr = dax_map_bh(&bh, inode->i_blkbits);
+
+		if (IS_ERR(addr))
+			return PTR_ERR(addr);
 		clear_pmem(addr + offset, length);
 		wmb_pmem();
+		dax_unmap_bh(&bh, addr);
 	}
 
 	return 0;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 0a0def66c61e..8127fff4d0f5 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -786,6 +786,8 @@ extern int scsi_cmd_ioctl(struct request_queue *, struct gendisk *, fmode_t,
 extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
 			 struct scsi_ioctl_command __user *);
 
+extern int blk_queue_enter(struct request_queue *q, gfp_t gfp);
+extern void blk_queue_exit(struct request_queue *q);
 extern void blk_start_queue(struct request_queue *q);
 extern void blk_stop_queue(struct request_queue *q);
 extern void blk_sync_queue(struct request_queue *q);


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] block, dax: fix lifetime of in-kernel dax mappings
  2015-09-30  0:41 ` [PATCH 2/2] block, dax: fix lifetime of in-kernel dax mappings Dan Williams
@ 2015-09-30 23:35   ` Dave Chinner
  2015-10-01  0:09     ` Dan Williams
  2015-10-06  1:57     ` Dan Williams
  0 siblings, 2 replies; 13+ messages in thread
From: Dave Chinner @ 2015-09-30 23:35 UTC (permalink / raw)
  To: Dan Williams
  Cc: axboe, Boaz Harrosh, linux-nvdimm, linux-kernel, ross.zwisler,
	Christoph Hellwig

On Tue, Sep 29, 2015 at 08:41:36PM -0400, Dan Williams wrote:
> The DAX implementation needs to protect new calls to ->direct_access()
> and usage of its return value against unbind of the underlying block
> device.  Use blk_queue_enter()/blk_queue_exit() to either prevent
> blk_cleanup_queue() from proceeding, or fail the dax_map_bh() if the
> request_queue is being torn down.
> 
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Boaz Harrosh <boaz@plexistor.com>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  block/blk.h            |    2 -
>  fs/dax.c               |  130 +++++++++++++++++++++++++++++++-----------------
>  include/linux/blkdev.h |    2 +
>  3 files changed, 85 insertions(+), 49 deletions(-)
> 
> diff --git a/block/blk.h b/block/blk.h
> index 5b2cd393afbe..0f8de0dda768 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -72,8 +72,6 @@ void blk_dequeue_request(struct request *rq);
>  void __blk_queue_free_tags(struct request_queue *q);
>  bool __blk_end_bidi_request(struct request *rq, int error,
>  			    unsigned int nr_bytes, unsigned int bidi_bytes);
> -int blk_queue_enter(struct request_queue *q, gfp_t gfp);
> -void blk_queue_exit(struct request_queue *q);
>  void blk_freeze_queue(struct request_queue *q);
>  
>  static inline void blk_queue_enter_live(struct request_queue *q)
> diff --git a/fs/dax.c b/fs/dax.c
> index bcfb14bfc1e4..7ce002bb60d0 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -63,12 +63,42 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size)
>  }
>  EXPORT_SYMBOL_GPL(dax_clear_blocks);
>  
> -static long dax_get_addr(struct buffer_head *bh, void __pmem **addr,
> -		unsigned blkbits)
> +static void __pmem *__dax_map_bh(const struct buffer_head *bh, unsigned blkbits,
> +		unsigned long *pfn, long *len)

Please don't use bufferheads for this. Please pass an inode, the
block and length to map, similar to dax_clear_blocks().

Why? Because dax_clear_blocks() needs to do this "mapping" too,
and it is called from contexts where there are no bufferheads.
There's a good chance we'll need more mapping contexts like this in
future, so lets not propagate bufferheads deeper into this code
than we absilutely need to.

We should be trying to limit/remove bufferheads in the DAX code, not
propagating them deeper into the code...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] block, dax: fix lifetime of in-kernel dax mappings
  2015-09-30 23:35   ` Dave Chinner
@ 2015-10-01  0:09     ` Dan Williams
  2015-10-06  1:57     ` Dan Williams
  1 sibling, 0 replies; 13+ messages in thread
From: Dan Williams @ 2015-10-01  0:09 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jens Axboe, Boaz Harrosh, linux-nvdimm@lists.01.org,
	linux-kernel, Ross Zwisler, Christoph Hellwig

On Wed, Sep 30, 2015 at 4:35 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Tue, Sep 29, 2015 at 08:41:36PM -0400, Dan Williams wrote:
>> The DAX implementation needs to protect new calls to ->direct_access()
>> and usage of its return value against unbind of the underlying block
>> device.  Use blk_queue_enter()/blk_queue_exit() to either prevent
>> blk_cleanup_queue() from proceeding, or fail the dax_map_bh() if the
>> request_queue is being torn down.
>>
>> Cc: Jens Axboe <axboe@kernel.dk>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: Boaz Harrosh <boaz@plexistor.com>
>> Cc: Dave Chinner <david@fromorbit.com>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>>  block/blk.h            |    2 -
>>  fs/dax.c               |  130 +++++++++++++++++++++++++++++++-----------------
>>  include/linux/blkdev.h |    2 +
>>  3 files changed, 85 insertions(+), 49 deletions(-)
>>
>> diff --git a/block/blk.h b/block/blk.h
>> index 5b2cd393afbe..0f8de0dda768 100644
>> --- a/block/blk.h
>> +++ b/block/blk.h
>> @@ -72,8 +72,6 @@ void blk_dequeue_request(struct request *rq);
>>  void __blk_queue_free_tags(struct request_queue *q);
>>  bool __blk_end_bidi_request(struct request *rq, int error,
>>                           unsigned int nr_bytes, unsigned int bidi_bytes);
>> -int blk_queue_enter(struct request_queue *q, gfp_t gfp);
>> -void blk_queue_exit(struct request_queue *q);
>>  void blk_freeze_queue(struct request_queue *q);
>>
>>  static inline void blk_queue_enter_live(struct request_queue *q)
>> diff --git a/fs/dax.c b/fs/dax.c
>> index bcfb14bfc1e4..7ce002bb60d0 100644
>> --- a/fs/dax.c
>> +++ b/fs/dax.c
>> @@ -63,12 +63,42 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size)
>>  }
>>  EXPORT_SYMBOL_GPL(dax_clear_blocks);
>>
>> -static long dax_get_addr(struct buffer_head *bh, void __pmem **addr,
>> -             unsigned blkbits)
>> +static void __pmem *__dax_map_bh(const struct buffer_head *bh, unsigned blkbits,
>> +             unsigned long *pfn, long *len)
>
> Please don't use bufferheads for this. Please pass an inode, the
> block and length to map, similar to dax_clear_blocks().
>
> Why? Because dax_clear_blocks() needs to do this "mapping" too,
> and it is called from contexts where there are no bufferheads.
> There's a good chance we'll need more mapping contexts like this in
> future, so lets not propagate bufferheads deeper into this code
> than we absilutely need to.
>
> We should be trying to limit/remove bufferheads in the DAX code, not
> propagating them deeper into the code...
>

Sounds good, will do.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] block: generic request_queue reference counting
  2015-09-30  0:41 ` [PATCH 1/2] block: generic request_queue reference counting Dan Williams
@ 2015-10-04  6:40   ` Christoph Hellwig
  2015-10-05 23:44     ` Dan Williams
  2015-10-04  7:52   ` Ming Lei
  1 sibling, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2015-10-04  6:40 UTC (permalink / raw)
  To: Dan Williams
  Cc: axboe, Keith Busch, ross.zwisler, linux-nvdimm,
	Christoph Hellwig, linux-kernel

On Tue, Sep 29, 2015 at 08:41:31PM -0400, Dan Williams wrote:
> Allow pmem, and other synchronous/bio-based block drivers, to fallback
> on a per-cpu reference count managed by the core for tracking queue
> live/dead state.
> 
> The existing per-cpu reference count for the blk_mq case is promoted to
> be used in all block i/o scenarios.  This involves initializing it by
> default, waiting for it to drop to zero at exit, and holding a live
> reference over the invocation of q->make_request_fn() in
> generic_make_request().  The blk_mq code continues to take its own
> reference per blk_mq request and retains the ability to freeze the
> queue, but the check that the queue is frozen is moved to
> generic_make_request().
> 
> This fixes crash signatures like the following:
> 
>  BUG: unable to handle kernel paging request at ffff880140000000
>  [..]
>  Call Trace:
>   [<ffffffff8145e8bf>] ? copy_user_handle_tail+0x5f/0x70
>   [<ffffffffa004e1e0>] pmem_do_bvec.isra.11+0x70/0xf0 [nd_pmem]
>   [<ffffffffa004e331>] pmem_make_request+0xd1/0x200 [nd_pmem]
>   [<ffffffff811c3162>] ? mempool_alloc+0x72/0x1a0
>   [<ffffffff8141f8b6>] generic_make_request+0xd6/0x110
>   [<ffffffff8141f966>] submit_bio+0x76/0x170
>   [<ffffffff81286dff>] submit_bh_wbc+0x12f/0x160
>   [<ffffffff81286e62>] submit_bh+0x12/0x20
>   [<ffffffff813395bd>] jbd2_write_superblock+0x8d/0x170
>   [<ffffffff8133974d>] jbd2_mark_journal_empty+0x5d/0x90
>   [<ffffffff813399cb>] jbd2_journal_destroy+0x24b/0x270
>   [<ffffffff810bc4ca>] ? put_pwq_unlocked+0x2a/0x30
>   [<ffffffff810bc6f5>] ? destroy_workqueue+0x225/0x250
>   [<ffffffff81303494>] ext4_put_super+0x64/0x360
>   [<ffffffff8124ab1a>] generic_shutdown_super+0x6a/0xf0
> 
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: Keith Busch <keith.busch@intel.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Suggested-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  block/blk-core.c       |   71 +++++++++++++++++++++++++++++++++++++------
>  block/blk-mq-sysfs.c   |    6 ----
>  block/blk-mq.c         |   80 ++++++++++++++----------------------------------
>  block/blk-sysfs.c      |    3 +-
>  block/blk.h            |   14 ++++++++
>  include/linux/blk-mq.h |    1 -
>  include/linux/blkdev.h |    2 +
>  7 files changed, 102 insertions(+), 75 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 2eb722d48773..6062550baaef 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -554,19 +554,17 @@ void blk_cleanup_queue(struct request_queue *q)
>  	 * Drain all requests queued before DYING marking. Set DEAD flag to
>  	 * prevent that q->request_fn() gets invoked after draining finished.
>  	 */
> -	if (q->mq_ops) {
> -		blk_mq_freeze_queue(q);
> -		spin_lock_irq(lock);
> -	} else {
> -		spin_lock_irq(lock);
> +	blk_freeze_queue(q);
> +	spin_lock_irq(lock);
> +	if (!q->mq_ops)
>  		__blk_drain_queue(q, true);
> -	}

__blk_drain_queue really ought to be moved into blk_freeze_queue so it
has equivlaent functionality for mq vs !mq.  But maybe that can be a
separate patch.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] block: generic request_queue reference counting
  2015-09-30  0:41 ` [PATCH 1/2] block: generic request_queue reference counting Dan Williams
  2015-10-04  6:40   ` Christoph Hellwig
@ 2015-10-04  7:52   ` Ming Lei
  2015-10-05 23:23     ` Dan Williams
  1 sibling, 1 reply; 13+ messages in thread
From: Ming Lei @ 2015-10-04  7:52 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jens Axboe, Keith Busch, ross.zwisler, linux-nvdimm,
	Christoph Hellwig, Linux Kernel Mailing List

On Wed, Sep 30, 2015 at 8:41 AM, Dan Williams <dan.j.williams@intel.com> wrote:
> Allow pmem, and other synchronous/bio-based block drivers, to fallback

Just a bit curious, why not extend it for all(both synchronous and
asynchrounous) bio-based drivers? As you mentioned in introductory
message, all bio based drivers may have this kind of problem.

One idea I thought of is to hold the usage counter in bio life time,
instead of request's life time like in blk-mq.

> on a per-cpu reference count managed by the core for tracking queue
> live/dead state.
>
> The existing per-cpu reference count for the blk_mq case is promoted to
> be used in all block i/o scenarios.  This involves initializing it by
> default, waiting for it to drop to zero at exit, and holding a live
> reference over the invocation of q->make_request_fn() in

It isn't enough for asynchrounous bio drivers.

> generic_make_request().  The blk_mq code continues to take its own
> reference per blk_mq request and retains the ability to freeze the
> queue, but the check that the queue is frozen is moved to
> generic_make_request().
>
> This fixes crash signatures like the following:
>
>  BUG: unable to handle kernel paging request at ffff880140000000
>  [..]
>  Call Trace:
>   [<ffffffff8145e8bf>] ? copy_user_handle_tail+0x5f/0x70
>   [<ffffffffa004e1e0>] pmem_do_bvec.isra.11+0x70/0xf0 [nd_pmem]
>   [<ffffffffa004e331>] pmem_make_request+0xd1/0x200 [nd_pmem]
>   [<ffffffff811c3162>] ? mempool_alloc+0x72/0x1a0
>   [<ffffffff8141f8b6>] generic_make_request+0xd6/0x110
>   [<ffffffff8141f966>] submit_bio+0x76/0x170
>   [<ffffffff81286dff>] submit_bh_wbc+0x12f/0x160
>   [<ffffffff81286e62>] submit_bh+0x12/0x20
>   [<ffffffff813395bd>] jbd2_write_superblock+0x8d/0x170
>   [<ffffffff8133974d>] jbd2_mark_journal_empty+0x5d/0x90
>   [<ffffffff813399cb>] jbd2_journal_destroy+0x24b/0x270
>   [<ffffffff810bc4ca>] ? put_pwq_unlocked+0x2a/0x30
>   [<ffffffff810bc6f5>] ? destroy_workqueue+0x225/0x250
>   [<ffffffff81303494>] ext4_put_super+0x64/0x360
>   [<ffffffff8124ab1a>] generic_shutdown_super+0x6a/0xf0
>
> Cc: Jens Axboe <axboe@kernel.dk>
> Cc: Keith Busch <keith.busch@intel.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Suggested-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  block/blk-core.c       |   71 +++++++++++++++++++++++++++++++++++++------
>  block/blk-mq-sysfs.c   |    6 ----
>  block/blk-mq.c         |   80 ++++++++++++++----------------------------------
>  block/blk-sysfs.c      |    3 +-
>  block/blk.h            |   14 ++++++++
>  include/linux/blk-mq.h |    1 -
>  include/linux/blkdev.h |    2 +
>  7 files changed, 102 insertions(+), 75 deletions(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 2eb722d48773..6062550baaef 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -554,19 +554,17 @@ void blk_cleanup_queue(struct request_queue *q)
>          * Drain all requests queued before DYING marking. Set DEAD flag to
>          * prevent that q->request_fn() gets invoked after draining finished.
>          */
> -       if (q->mq_ops) {
> -               blk_mq_freeze_queue(q);
> -               spin_lock_irq(lock);
> -       } else {
> -               spin_lock_irq(lock);
> +       blk_freeze_queue(q);
> +       spin_lock_irq(lock);
> +       if (!q->mq_ops)
>                 __blk_drain_queue(q, true);
> -       }
>         queue_flag_set(QUEUE_FLAG_DEAD, q);
>         spin_unlock_irq(lock);
>
>         /* @q won't process any more request, flush async actions */
>         del_timer_sync(&q->backing_dev_info.laptop_mode_wb_timer);
>         blk_sync_queue(q);
> +       percpu_ref_exit(&q->q_usage_counter);
>
>         if (q->mq_ops)
>                 blk_mq_free_queue(q);
> @@ -629,6 +627,40 @@ struct request_queue *blk_alloc_queue(gfp_t gfp_mask)
>  }
>  EXPORT_SYMBOL(blk_alloc_queue);
>
> +int blk_queue_enter(struct request_queue *q, gfp_t gfp)
> +{
> +       while (true) {
> +               int ret;
> +
> +               if (percpu_ref_tryget_live(&q->q_usage_counter))
> +                       return 0;
> +
> +               if (!(gfp & __GFP_WAIT))
> +                       return -EBUSY;
> +
> +               ret = wait_event_interruptible(q->mq_freeze_wq,
> +                               !atomic_read(&q->mq_freeze_depth) ||
> +                               blk_queue_dying(q));
> +               if (blk_queue_dying(q))
> +                       return -ENODEV;
> +               if (ret)
> +                       return ret;
> +       }
> +}
> +
> +void blk_queue_exit(struct request_queue *q)
> +{
> +       percpu_ref_put(&q->q_usage_counter);
> +}
> +
> +static void blk_queue_usage_counter_release(struct percpu_ref *ref)
> +{
> +       struct request_queue *q =
> +               container_of(ref, struct request_queue, q_usage_counter);
> +
> +       wake_up_all(&q->mq_freeze_wq);
> +}
> +
>  struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
>  {
>         struct request_queue *q;
> @@ -690,11 +722,22 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
>
>         init_waitqueue_head(&q->mq_freeze_wq);
>
> -       if (blkcg_init_queue(q))
> +       /*
> +        * Init percpu_ref in atomic mode so that it's faster to shutdown.
> +        * See blk_register_queue() for details.
> +        */
> +       if (percpu_ref_init(&q->q_usage_counter,
> +                               blk_queue_usage_counter_release,
> +                               PERCPU_REF_INIT_ATOMIC, GFP_KERNEL))
>                 goto fail_bdi;
>
> +       if (blkcg_init_queue(q))
> +               goto fail_ref;
> +
>         return q;
>
> +fail_ref:
> +       percpu_ref_exit(&q->q_usage_counter);
>  fail_bdi:
>         bdi_destroy(&q->backing_dev_info);
>  fail_split:
> @@ -1966,9 +2009,19 @@ void generic_make_request(struct bio *bio)
>         do {
>                 struct request_queue *q = bdev_get_queue(bio->bi_bdev);
>
> -               q->make_request_fn(q, bio);
> +               if (likely(blk_queue_enter(q, __GFP_WAIT) == 0)) {
> +
> +                       q->make_request_fn(q, bio);
> +
> +                       blk_queue_exit(q);
>
> -               bio = bio_list_pop(current->bio_list);
> +                       bio = bio_list_pop(current->bio_list);
> +               } else {
> +                       struct bio *bio_next = bio_list_pop(current->bio_list);
> +
> +                       bio_io_error(bio);
> +                       bio = bio_next;
> +               }
>         } while (bio);
>         current->bio_list = NULL; /* deactivate */
>  }
> diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
> index 279c5d674edf..731b6eecce82 100644
> --- a/block/blk-mq-sysfs.c
> +++ b/block/blk-mq-sysfs.c
> @@ -412,12 +412,6 @@ static void blk_mq_sysfs_init(struct request_queue *q)
>                 kobject_init(&ctx->kobj, &blk_mq_ctx_ktype);
>  }
>
> -/* see blk_register_queue() */
> -void blk_mq_finish_init(struct request_queue *q)
> -{
> -       percpu_ref_switch_to_percpu(&q->mq_usage_counter);
> -}
> -
>  int blk_mq_register_disk(struct gendisk *disk)
>  {
>         struct device *dev = disk_to_dev(disk);
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index f2d67b4047a0..6d91894cf85e 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -77,47 +77,13 @@ static void blk_mq_hctx_clear_pending(struct blk_mq_hw_ctx *hctx,
>         clear_bit(CTX_TO_BIT(hctx, ctx), &bm->word);
>  }
>
> -static int blk_mq_queue_enter(struct request_queue *q, gfp_t gfp)
> -{
> -       while (true) {
> -               int ret;
> -
> -               if (percpu_ref_tryget_live(&q->mq_usage_counter))
> -                       return 0;
> -
> -               if (!(gfp & __GFP_WAIT))
> -                       return -EBUSY;
> -
> -               ret = wait_event_interruptible(q->mq_freeze_wq,
> -                               !atomic_read(&q->mq_freeze_depth) ||
> -                               blk_queue_dying(q));
> -               if (blk_queue_dying(q))
> -                       return -ENODEV;
> -               if (ret)
> -                       return ret;
> -       }
> -}
> -
> -static void blk_mq_queue_exit(struct request_queue *q)
> -{
> -       percpu_ref_put(&q->mq_usage_counter);
> -}
> -
> -static void blk_mq_usage_counter_release(struct percpu_ref *ref)
> -{
> -       struct request_queue *q =
> -               container_of(ref, struct request_queue, mq_usage_counter);
> -
> -       wake_up_all(&q->mq_freeze_wq);
> -}
> -
>  void blk_mq_freeze_queue_start(struct request_queue *q)
>  {
>         int freeze_depth;
>
>         freeze_depth = atomic_inc_return(&q->mq_freeze_depth);
>         if (freeze_depth == 1) {
> -               percpu_ref_kill(&q->mq_usage_counter);
> +               percpu_ref_kill(&q->q_usage_counter);
>                 blk_mq_run_hw_queues(q, false);
>         }
>  }
> @@ -125,18 +91,34 @@ EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_start);
>
>  static void blk_mq_freeze_queue_wait(struct request_queue *q)
>  {
> -       wait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->mq_usage_counter));
> +       wait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->q_usage_counter));
>  }
>
>  /*
>   * Guarantee no request is in use, so we can change any data structure of
>   * the queue afterward.
>   */
> -void blk_mq_freeze_queue(struct request_queue *q)
> +void blk_freeze_queue(struct request_queue *q)
>  {
> +       /*
> +        * In the !blk_mq case we are only calling this to kill the
> +        * q_usage_counter, otherwise this increases the freeze depth
> +        * and waits for it to return to zero.  For this reason there is
> +        * no blk_unfreeze_queue(), and blk_freeze_queue() is not
> +        * exported to drivers as the only user for unfreeze is blk_mq.
> +        */
>         blk_mq_freeze_queue_start(q);
>         blk_mq_freeze_queue_wait(q);
>  }
> +
> +void blk_mq_freeze_queue(struct request_queue *q)
> +{
> +       /*
> +        * ...just an alias to keep freeze and unfreeze actions balanced
> +        * in the blk_mq_* namespace
> +        */
> +       blk_freeze_queue(q);
> +}
>  EXPORT_SYMBOL_GPL(blk_mq_freeze_queue);
>
>  void blk_mq_unfreeze_queue(struct request_queue *q)
> @@ -146,7 +128,7 @@ void blk_mq_unfreeze_queue(struct request_queue *q)
>         freeze_depth = atomic_dec_return(&q->mq_freeze_depth);
>         WARN_ON_ONCE(freeze_depth < 0);
>         if (!freeze_depth) {
> -               percpu_ref_reinit(&q->mq_usage_counter);
> +               percpu_ref_reinit(&q->q_usage_counter);
>                 wake_up_all(&q->mq_freeze_wq);
>         }
>  }
> @@ -255,7 +237,7 @@ struct request *blk_mq_alloc_request(struct request_queue *q, int rw, gfp_t gfp,
>         struct blk_mq_alloc_data alloc_data;
>         int ret;
>
> -       ret = blk_mq_queue_enter(q, gfp);
> +       ret = blk_queue_enter(q, gfp);
>         if (ret)
>                 return ERR_PTR(ret);
>
> @@ -278,7 +260,7 @@ struct request *blk_mq_alloc_request(struct request_queue *q, int rw, gfp_t gfp,
>         }
>         blk_mq_put_ctx(ctx);
>         if (!rq) {
> -               blk_mq_queue_exit(q);
> +               blk_queue_exit(q);
>                 return ERR_PTR(-EWOULDBLOCK);
>         }
>         return rq;
> @@ -297,7 +279,7 @@ static void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx,
>
>         clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags);
>         blk_mq_put_tag(hctx, tag, &ctx->last_tag);
> -       blk_mq_queue_exit(q);
> +       blk_queue_exit(q);
>  }
>
>  void blk_mq_free_hctx_request(struct blk_mq_hw_ctx *hctx, struct request *rq)
> @@ -1184,11 +1166,7 @@ static struct request *blk_mq_map_request(struct request_queue *q,
>         int rw = bio_data_dir(bio);
>         struct blk_mq_alloc_data alloc_data;
>
> -       if (unlikely(blk_mq_queue_enter(q, GFP_KERNEL))) {
> -               bio_io_error(bio);
> -               return NULL;
> -       }
> -
> +       blk_queue_enter_live(q);
>         ctx = blk_mq_get_ctx(q);
>         hctx = q->mq_ops->map_queue(q, ctx->cpu);
>
> @@ -1979,14 +1957,6 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
>                 hctxs[i]->queue_num = i;
>         }
>
> -       /*
> -        * Init percpu_ref in atomic mode so that it's faster to shutdown.
> -        * See blk_register_queue() for details.
> -        */
> -       if (percpu_ref_init(&q->mq_usage_counter, blk_mq_usage_counter_release,
> -                           PERCPU_REF_INIT_ATOMIC, GFP_KERNEL))
> -               goto err_hctxs;
> -
>         setup_timer(&q->timeout, blk_mq_rq_timer, (unsigned long) q);
>         blk_queue_rq_timeout(q, set->timeout ? set->timeout : 30 * HZ);
>
> @@ -2062,8 +2032,6 @@ void blk_mq_free_queue(struct request_queue *q)
>         blk_mq_exit_hw_queues(q, set, set->nr_hw_queues);
>         blk_mq_free_hw_queues(q, set);
>
> -       percpu_ref_exit(&q->mq_usage_counter);
> -
>         kfree(q->mq_map);
>
>         q->mq_map = NULL;
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 3e44a9da2a13..61fc2633bbea 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -599,9 +599,8 @@ int blk_register_queue(struct gendisk *disk)
>          */
>         if (!blk_queue_init_done(q)) {
>                 queue_flag_set_unlocked(QUEUE_FLAG_INIT_DONE, q);
> +               percpu_ref_switch_to_percpu(&q->q_usage_counter);
>                 blk_queue_bypass_end(q);
> -               if (q->mq_ops)
> -                       blk_mq_finish_init(q);
>         }
>
>         ret = blk_trace_init_sysfs(dev);
> diff --git a/block/blk.h b/block/blk.h
> index 98614ad37c81..5b2cd393afbe 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -72,6 +72,20 @@ void blk_dequeue_request(struct request *rq);
>  void __blk_queue_free_tags(struct request_queue *q);
>  bool __blk_end_bidi_request(struct request *rq, int error,
>                             unsigned int nr_bytes, unsigned int bidi_bytes);
> +int blk_queue_enter(struct request_queue *q, gfp_t gfp);
> +void blk_queue_exit(struct request_queue *q);
> +void blk_freeze_queue(struct request_queue *q);
> +
> +static inline void blk_queue_enter_live(struct request_queue *q)
> +{
> +       /*
> +        * Given that running in generic_make_request() context
> +        * guarantees that a live reference against q_usage_counter has
> +        * been established, further references under that same context
> +        * need not check that the queue has been frozen (marked dead).
> +        */
> +       percpu_ref_get(&q->q_usage_counter);
> +}
>
>  void blk_rq_timed_out_timer(unsigned long data);
>  unsigned long blk_rq_timeout(unsigned long timeout);
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index 37d1602c4f7a..25ce763fbb81 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -167,7 +167,6 @@ enum {
>  struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *);
>  struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set,
>                                                   struct request_queue *q);
> -void blk_mq_finish_init(struct request_queue *q);
>  int blk_mq_register_disk(struct gendisk *);
>  void blk_mq_unregister_disk(struct gendisk *);
>
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 99da9ebc7377..0a0def66c61e 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -450,7 +450,7 @@ struct request_queue {
>  #endif
>         struct rcu_head         rcu_head;
>         wait_queue_head_t       mq_freeze_wq;
> -       struct percpu_ref       mq_usage_counter;
> +       struct percpu_ref       q_usage_counter;
>         struct list_head        all_q_node;
>
>         struct blk_mq_tag_set   *tag_set;
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/



-- 
Ming Lei

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] block: generic request_queue reference counting
  2015-10-04  7:52   ` Ming Lei
@ 2015-10-05 23:23     ` Dan Williams
  2015-10-06 10:46       ` Ming Lei
  0 siblings, 1 reply; 13+ messages in thread
From: Dan Williams @ 2015-10-05 23:23 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Keith Busch, Ross Zwisler, linux-nvdimm@lists.01.org,
	Christoph Hellwig, Linux Kernel Mailing List

On Sun, Oct 4, 2015 at 12:52 AM, Ming Lei <tom.leiming@gmail.com> wrote:
> On Wed, Sep 30, 2015 at 8:41 AM, Dan Williams <dan.j.williams@intel.com> wrote:
>> Allow pmem, and other synchronous/bio-based block drivers, to fallback
>
> Just a bit curious, why not extend it for all(both synchronous and
> asynchrounous) bio-based drivers? As you mentioned in introductory
> message, all bio based drivers may have this kind of problem.
>
> One idea I thought of is to hold the usage counter in bio life time,
> instead of request's life time like in blk-mq.
>
>> on a per-cpu reference count managed by the core for tracking queue
>> live/dead state.
>>
>> The existing per-cpu reference count for the blk_mq case is promoted to
>> be used in all block i/o scenarios.  This involves initializing it by
>> default, waiting for it to drop to zero at exit, and holding a live
>> reference over the invocation of q->make_request_fn() in
>
> It isn't enough for asynchrounous bio drivers.

True, but I think that support is a follow on extension of the
mechanism.  It seems to me not as straightforward as holding a
per-request reference or a reference over the submission path.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] block: generic request_queue reference counting
  2015-10-04  6:40   ` Christoph Hellwig
@ 2015-10-05 23:44     ` Dan Williams
  0 siblings, 0 replies; 13+ messages in thread
From: Dan Williams @ 2015-10-05 23:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Keith Busch, Ross Zwisler, linux-nvdimm@lists.01.org,
	linux-kernel

On Sat, Oct 3, 2015 at 11:40 PM, Christoph Hellwig <hch@lst.de> wrote:
> On Tue, Sep 29, 2015 at 08:41:31PM -0400, Dan Williams wrote:
>> Allow pmem, and other synchronous/bio-based block drivers, to fallback
>> on a per-cpu reference count managed by the core for tracking queue
>> live/dead state.
>>
>> The existing per-cpu reference count for the blk_mq case is promoted to
>> be used in all block i/o scenarios.  This involves initializing it by
>> default, waiting for it to drop to zero at exit, and holding a live
>> reference over the invocation of q->make_request_fn() in
>> generic_make_request().  The blk_mq code continues to take its own
>> reference per blk_mq request and retains the ability to freeze the
>> queue, but the check that the queue is frozen is moved to
>> generic_make_request().
>>
>> This fixes crash signatures like the following:
>>
>>  BUG: unable to handle kernel paging request at ffff880140000000
>>  [..]
>>  Call Trace:
>>   [<ffffffff8145e8bf>] ? copy_user_handle_tail+0x5f/0x70
>>   [<ffffffffa004e1e0>] pmem_do_bvec.isra.11+0x70/0xf0 [nd_pmem]
>>   [<ffffffffa004e331>] pmem_make_request+0xd1/0x200 [nd_pmem]
>>   [<ffffffff811c3162>] ? mempool_alloc+0x72/0x1a0
>>   [<ffffffff8141f8b6>] generic_make_request+0xd6/0x110
>>   [<ffffffff8141f966>] submit_bio+0x76/0x170
>>   [<ffffffff81286dff>] submit_bh_wbc+0x12f/0x160
>>   [<ffffffff81286e62>] submit_bh+0x12/0x20
>>   [<ffffffff813395bd>] jbd2_write_superblock+0x8d/0x170
>>   [<ffffffff8133974d>] jbd2_mark_journal_empty+0x5d/0x90
>>   [<ffffffff813399cb>] jbd2_journal_destroy+0x24b/0x270
>>   [<ffffffff810bc4ca>] ? put_pwq_unlocked+0x2a/0x30
>>   [<ffffffff810bc6f5>] ? destroy_workqueue+0x225/0x250
>>   [<ffffffff81303494>] ext4_put_super+0x64/0x360
>>   [<ffffffff8124ab1a>] generic_shutdown_super+0x6a/0xf0
>>
>> Cc: Jens Axboe <axboe@kernel.dk>
>> Cc: Keith Busch <keith.busch@intel.com>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Suggested-by: Christoph Hellwig <hch@lst.de>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>>  block/blk-core.c       |   71 +++++++++++++++++++++++++++++++++++++------
>>  block/blk-mq-sysfs.c   |    6 ----
>>  block/blk-mq.c         |   80 ++++++++++++++----------------------------------
>>  block/blk-sysfs.c      |    3 +-
>>  block/blk.h            |   14 ++++++++
>>  include/linux/blk-mq.h |    1 -
>>  include/linux/blkdev.h |    2 +
>>  7 files changed, 102 insertions(+), 75 deletions(-)
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 2eb722d48773..6062550baaef 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -554,19 +554,17 @@ void blk_cleanup_queue(struct request_queue *q)
>>        * Drain all requests queued before DYING marking. Set DEAD flag to
>>        * prevent that q->request_fn() gets invoked after draining finished.
>>        */
>> -     if (q->mq_ops) {
>> -             blk_mq_freeze_queue(q);
>> -             spin_lock_irq(lock);
>> -     } else {
>> -             spin_lock_irq(lock);
>> +     blk_freeze_queue(q);
>> +     spin_lock_irq(lock);
>> +     if (!q->mq_ops)
>>               __blk_drain_queue(q, true);
>> -     }
>
> __blk_drain_queue really ought to be moved into blk_freeze_queue so it
> has equivlaent functionality for mq vs !mq.  But maybe that can be a
> separate patch.

It's not clear why the cleanup path is gating __blk_drain_queue on
->mq_ops?  __blk_drain_queue already gets called from
blkcg_init_queue().  Also, it seems blk_get_flush_queue() is already
handling the mq_ops vs !mp_ops case, but I'd have to take a closer
look.  Definitely seems like follow-on patch material...

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] block, dax: fix lifetime of in-kernel dax mappings
  2015-09-30 23:35   ` Dave Chinner
  2015-10-01  0:09     ` Dan Williams
@ 2015-10-06  1:57     ` Dan Williams
  2015-10-06 21:15       ` Dave Chinner
  1 sibling, 1 reply; 13+ messages in thread
From: Dan Williams @ 2015-10-06  1:57 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jens Axboe, Boaz Harrosh, linux-nvdimm@lists.01.org,
	linux-kernel, Ross Zwisler, Christoph Hellwig

On Wed, Sep 30, 2015 at 4:35 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Tue, Sep 29, 2015 at 08:41:36PM -0400, Dan Williams wrote:
>> The DAX implementation needs to protect new calls to ->direct_access()
>> and usage of its return value against unbind of the underlying block
>> device.  Use blk_queue_enter()/blk_queue_exit() to either prevent
>> blk_cleanup_queue() from proceeding, or fail the dax_map_bh() if the
>> request_queue is being torn down.
>>
>> Cc: Jens Axboe <axboe@kernel.dk>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: Boaz Harrosh <boaz@plexistor.com>
>> Cc: Dave Chinner <david@fromorbit.com>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>>  block/blk.h            |    2 -
>>  fs/dax.c               |  130 +++++++++++++++++++++++++++++++-----------------
>>  include/linux/blkdev.h |    2 +
>>  3 files changed, 85 insertions(+), 49 deletions(-)
>>
>> diff --git a/block/blk.h b/block/blk.h
>> index 5b2cd393afbe..0f8de0dda768 100644
>> --- a/block/blk.h
>> +++ b/block/blk.h
>> @@ -72,8 +72,6 @@ void blk_dequeue_request(struct request *rq);
>>  void __blk_queue_free_tags(struct request_queue *q);
>>  bool __blk_end_bidi_request(struct request *rq, int error,
>>                           unsigned int nr_bytes, unsigned int bidi_bytes);
>> -int blk_queue_enter(struct request_queue *q, gfp_t gfp);
>> -void blk_queue_exit(struct request_queue *q);
>>  void blk_freeze_queue(struct request_queue *q);
>>
>>  static inline void blk_queue_enter_live(struct request_queue *q)
>> diff --git a/fs/dax.c b/fs/dax.c
>> index bcfb14bfc1e4..7ce002bb60d0 100644
>> --- a/fs/dax.c
>> +++ b/fs/dax.c
>> @@ -63,12 +63,42 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size)
>>  }
>>  EXPORT_SYMBOL_GPL(dax_clear_blocks);
>>
>> -static long dax_get_addr(struct buffer_head *bh, void __pmem **addr,
>> -             unsigned blkbits)
>> +static void __pmem *__dax_map_bh(const struct buffer_head *bh, unsigned blkbits,
>> +             unsigned long *pfn, long *len)
>
> Please don't use bufferheads for this. Please pass an inode, the
> block and length to map, similar to dax_clear_blocks().
>
> Why? Because dax_clear_blocks() needs to do this "mapping" too,
> and it is called from contexts where there are no bufferheads.
> There's a good chance we'll need more mapping contexts like this in
> future, so lets not propagate bufferheads deeper into this code
> than we absilutely need to.
>
> We should be trying to limit/remove bufferheads in the DAX code, not
> propagating them deeper into the code...

So I gave this a try but ran into the road block that get_block() is
performing the inode to bdev conversion and that is filesystem
specific.  However, I'll at least not pass the bh into this map
routine and will call it dax_map_atomic() which is more accurate.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] block: generic request_queue reference counting
  2015-10-05 23:23     ` Dan Williams
@ 2015-10-06 10:46       ` Ming Lei
  2015-10-06 16:04         ` Dan Williams
  0 siblings, 1 reply; 13+ messages in thread
From: Ming Lei @ 2015-10-06 10:46 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jens Axboe, Keith Busch, Ross Zwisler, linux-nvdimm@lists.01.org,
	Christoph Hellwig, Linux Kernel Mailing List

On Tue, Oct 6, 2015 at 7:23 AM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Sun, Oct 4, 2015 at 12:52 AM, Ming Lei <tom.leiming@gmail.com> wrote:
>> On Wed, Sep 30, 2015 at 8:41 AM, Dan Williams <dan.j.williams@intel.com> wrote:
>>> Allow pmem, and other synchronous/bio-based block drivers, to fallback
>>
>> Just a bit curious, why not extend it for all(both synchronous and
>> asynchrounous) bio-based drivers? As you mentioned in introductory
>> message, all bio based drivers may have this kind of problem.
>>
>> One idea I thought of is to hold the usage counter in bio life time,
>> instead of request's life time like in blk-mq.
>>
>>> on a per-cpu reference count managed by the core for tracking queue
>>> live/dead state.
>>>
>>> The existing per-cpu reference count for the blk_mq case is promoted to
>>> be used in all block i/o scenarios.  This involves initializing it by
>>> default, waiting for it to drop to zero at exit, and holding a live
>>> reference over the invocation of q->make_request_fn() in
>>
>> It isn't enough for asynchrounous bio drivers.
>
> True, but I think that support is a follow on extension of the
> mechanism.  It seems to me not as straightforward as holding a
> per-request reference or a reference over the submission path.

Given it is a general problem for all bio based drivers, it seems
better to figure out one general solution instead of the partial fix
only for synchrounous bio based drivers.

IMO, it should work by holding the usage counter in BIO's lifetime,
for example, getting it before calling q->make_request_fn(q, bio), and
putting it after bio's completion(in bio_endio()).  Even though there is
still a small window between all bio's completion and all request's
completion, and it should have been addressed easily for both
!mq_ops and mq_ops.

thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/2] block: generic request_queue reference counting
  2015-10-06 10:46       ` Ming Lei
@ 2015-10-06 16:04         ` Dan Williams
  0 siblings, 0 replies; 13+ messages in thread
From: Dan Williams @ 2015-10-06 16:04 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, Keith Busch, Ross Zwisler, linux-nvdimm@lists.01.org,
	Christoph Hellwig, Linux Kernel Mailing List

On Tue, Oct 6, 2015 at 3:46 AM, Ming Lei <tom.leiming@gmail.com> wrote:
> On Tue, Oct 6, 2015 at 7:23 AM, Dan Williams <dan.j.williams@intel.com> wrote:
>> On Sun, Oct 4, 2015 at 12:52 AM, Ming Lei <tom.leiming@gmail.com> wrote:
>>> On Wed, Sep 30, 2015 at 8:41 AM, Dan Williams <dan.j.williams@intel.com> wrote:
>>>> Allow pmem, and other synchronous/bio-based block drivers, to fallback
>>>
>>> Just a bit curious, why not extend it for all(both synchronous and
>>> asynchrounous) bio-based drivers? As you mentioned in introductory
>>> message, all bio based drivers may have this kind of problem.
>>>
>>> One idea I thought of is to hold the usage counter in bio life time,
>>> instead of request's life time like in blk-mq.
>>>
>>>> on a per-cpu reference count managed by the core for tracking queue
>>>> live/dead state.
>>>>
>>>> The existing per-cpu reference count for the blk_mq case is promoted to
>>>> be used in all block i/o scenarios.  This involves initializing it by
>>>> default, waiting for it to drop to zero at exit, and holding a live
>>>> reference over the invocation of q->make_request_fn() in
>>>
>>> It isn't enough for asynchrounous bio drivers.
>>
>> True, but I think that support is a follow on extension of the
>> mechanism.  It seems to me not as straightforward as holding a
>> per-request reference or a reference over the submission path.
>
> Given it is a general problem for all bio based drivers, it seems
> better to figure out one general solution instead of the partial fix
> only for synchrounous bio based drivers.
>
> IMO, it should work by holding the usage counter in BIO's lifetime,
> for example, getting it before calling q->make_request_fn(q, bio), and
> putting it after bio's completion(in bio_endio()).  Even though there is
> still a small window between all bio's completion and all request's
> completion, and it should have been addressed easily for both
> !mq_ops and mq_ops.

I'm not convinced it is that simple, i.e. that all bio_endio()
invocations are guaranteed to have a 1:1 matching invocation through
generic_make_request().  If you want to help with that audit I'd be
appreciative and open to putting together a follow-on patch.  In the
meantime this starting point already makes things better for pmem,
nd_blk, brd, etc...

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/2] block, dax: fix lifetime of in-kernel dax mappings
  2015-10-06  1:57     ` Dan Williams
@ 2015-10-06 21:15       ` Dave Chinner
  0 siblings, 0 replies; 13+ messages in thread
From: Dave Chinner @ 2015-10-06 21:15 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jens Axboe, Boaz Harrosh, linux-nvdimm@lists.01.org,
	linux-kernel, Ross Zwisler, Christoph Hellwig

On Mon, Oct 05, 2015 at 06:57:19PM -0700, Dan Williams wrote:
> On Wed, Sep 30, 2015 at 4:35 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Tue, Sep 29, 2015 at 08:41:36PM -0400, Dan Williams wrote:
> >> +static void __pmem *__dax_map_bh(const struct buffer_head *bh, unsigned blkbits,
> >> +             unsigned long *pfn, long *len)
> >
> > Please don't use bufferheads for this. Please pass an inode, the
> > block and length to map, similar to dax_clear_blocks().
> >
> > Why? Because dax_clear_blocks() needs to do this "mapping" too,
> > and it is called from contexts where there are no bufferheads.
> > There's a good chance we'll need more mapping contexts like this in
> > future, so lets not propagate bufferheads deeper into this code
> > than we absilutely need to.
> >
> > We should be trying to limit/remove bufferheads in the DAX code, not
> > propagating them deeper into the code...
> 
> So I gave this a try but ran into the road block that get_block() is
> performing the inode to bdev conversion and that is filesystem
> specific.  However, I'll at least not pass the bh into this map
> routine and will call it dax_map_atomic() which is more accurate.

Right, we still need the bh for the get_block call - that much is
unavoidable right now, but I'm angling towards converting XFS to
use a struct iomap and ->map_blocks method similar to the
exportfs calls for PNFS in future. Having a dedicated structure and
set of methods for obtaining and manipulating extent mappings on
inodes will make all these bufferhead issues go away...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2015-10-06 21:16 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-30  0:41 [PATCH 0/2] block drivers + dax vs driver unbind Dan Williams
2015-09-30  0:41 ` [PATCH 1/2] block: generic request_queue reference counting Dan Williams
2015-10-04  6:40   ` Christoph Hellwig
2015-10-05 23:44     ` Dan Williams
2015-10-04  7:52   ` Ming Lei
2015-10-05 23:23     ` Dan Williams
2015-10-06 10:46       ` Ming Lei
2015-10-06 16:04         ` Dan Williams
2015-09-30  0:41 ` [PATCH 2/2] block, dax: fix lifetime of in-kernel dax mappings Dan Williams
2015-09-30 23:35   ` Dave Chinner
2015-10-01  0:09     ` Dan Williams
2015-10-06  1:57     ` Dan Williams
2015-10-06 21:15       ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).