* [PATCH 0/2] block drivers + dax vs driver unbind @ 2015-09-30 0:41 Dan Williams 2015-09-30 0:41 ` [PATCH 1/2] block: generic request_queue reference counting Dan Williams 2015-09-30 0:41 ` [PATCH 2/2] block, dax: fix lifetime of in-kernel dax mappings Dan Williams 0 siblings, 2 replies; 13+ messages in thread From: Dan Williams @ 2015-09-30 0:41 UTC (permalink / raw) To: axboe Cc: Boaz Harrosh, linux-nvdimm, Dave Chinner, linux-kernel, Keith Busch, ross.zwisler, Christoph Hellwig Auditing pmem driver teardown operations, while developing get_user_pages() support for dax [1], revealed that we can trivially crash the kernel by triggering new i/o requests after unbinding the pmem driver. In fact, any bio-based driver is susceptible to this crash because the queue draining done at shutdown uses in flight 'struct request' objects to pin the queue active. Solve the problem generically for all drivers and export the new blk_queue_enter() and blk_queue_exit() helpers for dax to indicate when the "request queue" is busy (i.e. we are actively using an address returned by ->direct_access()). [1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html --- Dan Williams (2): block: generic request_queue reference counting block, dax: fix lifetime of in-kernel dax mappings block/blk-core.c | 71 +++++++++++++++++++++++--- block/blk-mq-sysfs.c | 6 -- block/blk-mq.c | 80 +++++++++--------------------- block/blk-sysfs.c | 3 - block/blk.h | 12 ++++ fs/dax.c | 130 +++++++++++++++++++++++++++++++----------------- include/linux/blk-mq.h | 1 include/linux/blkdev.h | 4 + 8 files changed, 185 insertions(+), 122 deletions(-) ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 1/2] block: generic request_queue reference counting 2015-09-30 0:41 [PATCH 0/2] block drivers + dax vs driver unbind Dan Williams @ 2015-09-30 0:41 ` Dan Williams 2015-10-04 6:40 ` Christoph Hellwig 2015-10-04 7:52 ` Ming Lei 2015-09-30 0:41 ` [PATCH 2/2] block, dax: fix lifetime of in-kernel dax mappings Dan Williams 1 sibling, 2 replies; 13+ messages in thread From: Dan Williams @ 2015-09-30 0:41 UTC (permalink / raw) To: axboe Cc: Keith Busch, ross.zwisler, linux-nvdimm, Christoph Hellwig, linux-kernel Allow pmem, and other synchronous/bio-based block drivers, to fallback on a per-cpu reference count managed by the core for tracking queue live/dead state. The existing per-cpu reference count for the blk_mq case is promoted to be used in all block i/o scenarios. This involves initializing it by default, waiting for it to drop to zero at exit, and holding a live reference over the invocation of q->make_request_fn() in generic_make_request(). The blk_mq code continues to take its own reference per blk_mq request and retains the ability to freeze the queue, but the check that the queue is frozen is moved to generic_make_request(). This fixes crash signatures like the following: BUG: unable to handle kernel paging request at ffff880140000000 [..] Call Trace: [<ffffffff8145e8bf>] ? copy_user_handle_tail+0x5f/0x70 [<ffffffffa004e1e0>] pmem_do_bvec.isra.11+0x70/0xf0 [nd_pmem] [<ffffffffa004e331>] pmem_make_request+0xd1/0x200 [nd_pmem] [<ffffffff811c3162>] ? mempool_alloc+0x72/0x1a0 [<ffffffff8141f8b6>] generic_make_request+0xd6/0x110 [<ffffffff8141f966>] submit_bio+0x76/0x170 [<ffffffff81286dff>] submit_bh_wbc+0x12f/0x160 [<ffffffff81286e62>] submit_bh+0x12/0x20 [<ffffffff813395bd>] jbd2_write_superblock+0x8d/0x170 [<ffffffff8133974d>] jbd2_mark_journal_empty+0x5d/0x90 [<ffffffff813399cb>] jbd2_journal_destroy+0x24b/0x270 [<ffffffff810bc4ca>] ? put_pwq_unlocked+0x2a/0x30 [<ffffffff810bc6f5>] ? destroy_workqueue+0x225/0x250 [<ffffffff81303494>] ext4_put_super+0x64/0x360 [<ffffffff8124ab1a>] generic_shutdown_super+0x6a/0xf0 Cc: Jens Axboe <axboe@kernel.dk> Cc: Keith Busch <keith.busch@intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- block/blk-core.c | 71 +++++++++++++++++++++++++++++++++++++------ block/blk-mq-sysfs.c | 6 ---- block/blk-mq.c | 80 ++++++++++++++---------------------------------- block/blk-sysfs.c | 3 +- block/blk.h | 14 ++++++++ include/linux/blk-mq.h | 1 - include/linux/blkdev.h | 2 + 7 files changed, 102 insertions(+), 75 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 2eb722d48773..6062550baaef 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -554,19 +554,17 @@ void blk_cleanup_queue(struct request_queue *q) * Drain all requests queued before DYING marking. Set DEAD flag to * prevent that q->request_fn() gets invoked after draining finished. */ - if (q->mq_ops) { - blk_mq_freeze_queue(q); - spin_lock_irq(lock); - } else { - spin_lock_irq(lock); + blk_freeze_queue(q); + spin_lock_irq(lock); + if (!q->mq_ops) __blk_drain_queue(q, true); - } queue_flag_set(QUEUE_FLAG_DEAD, q); spin_unlock_irq(lock); /* @q won't process any more request, flush async actions */ del_timer_sync(&q->backing_dev_info.laptop_mode_wb_timer); blk_sync_queue(q); + percpu_ref_exit(&q->q_usage_counter); if (q->mq_ops) blk_mq_free_queue(q); @@ -629,6 +627,40 @@ struct request_queue *blk_alloc_queue(gfp_t gfp_mask) } EXPORT_SYMBOL(blk_alloc_queue); +int blk_queue_enter(struct request_queue *q, gfp_t gfp) +{ + while (true) { + int ret; + + if (percpu_ref_tryget_live(&q->q_usage_counter)) + return 0; + + if (!(gfp & __GFP_WAIT)) + return -EBUSY; + + ret = wait_event_interruptible(q->mq_freeze_wq, + !atomic_read(&q->mq_freeze_depth) || + blk_queue_dying(q)); + if (blk_queue_dying(q)) + return -ENODEV; + if (ret) + return ret; + } +} + +void blk_queue_exit(struct request_queue *q) +{ + percpu_ref_put(&q->q_usage_counter); +} + +static void blk_queue_usage_counter_release(struct percpu_ref *ref) +{ + struct request_queue *q = + container_of(ref, struct request_queue, q_usage_counter); + + wake_up_all(&q->mq_freeze_wq); +} + struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id) { struct request_queue *q; @@ -690,11 +722,22 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id) init_waitqueue_head(&q->mq_freeze_wq); - if (blkcg_init_queue(q)) + /* + * Init percpu_ref in atomic mode so that it's faster to shutdown. + * See blk_register_queue() for details. + */ + if (percpu_ref_init(&q->q_usage_counter, + blk_queue_usage_counter_release, + PERCPU_REF_INIT_ATOMIC, GFP_KERNEL)) goto fail_bdi; + if (blkcg_init_queue(q)) + goto fail_ref; + return q; +fail_ref: + percpu_ref_exit(&q->q_usage_counter); fail_bdi: bdi_destroy(&q->backing_dev_info); fail_split: @@ -1966,9 +2009,19 @@ void generic_make_request(struct bio *bio) do { struct request_queue *q = bdev_get_queue(bio->bi_bdev); - q->make_request_fn(q, bio); + if (likely(blk_queue_enter(q, __GFP_WAIT) == 0)) { + + q->make_request_fn(q, bio); + + blk_queue_exit(q); - bio = bio_list_pop(current->bio_list); + bio = bio_list_pop(current->bio_list); + } else { + struct bio *bio_next = bio_list_pop(current->bio_list); + + bio_io_error(bio); + bio = bio_next; + } } while (bio); current->bio_list = NULL; /* deactivate */ } diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c index 279c5d674edf..731b6eecce82 100644 --- a/block/blk-mq-sysfs.c +++ b/block/blk-mq-sysfs.c @@ -412,12 +412,6 @@ static void blk_mq_sysfs_init(struct request_queue *q) kobject_init(&ctx->kobj, &blk_mq_ctx_ktype); } -/* see blk_register_queue() */ -void blk_mq_finish_init(struct request_queue *q) -{ - percpu_ref_switch_to_percpu(&q->mq_usage_counter); -} - int blk_mq_register_disk(struct gendisk *disk) { struct device *dev = disk_to_dev(disk); diff --git a/block/blk-mq.c b/block/blk-mq.c index f2d67b4047a0..6d91894cf85e 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -77,47 +77,13 @@ static void blk_mq_hctx_clear_pending(struct blk_mq_hw_ctx *hctx, clear_bit(CTX_TO_BIT(hctx, ctx), &bm->word); } -static int blk_mq_queue_enter(struct request_queue *q, gfp_t gfp) -{ - while (true) { - int ret; - - if (percpu_ref_tryget_live(&q->mq_usage_counter)) - return 0; - - if (!(gfp & __GFP_WAIT)) - return -EBUSY; - - ret = wait_event_interruptible(q->mq_freeze_wq, - !atomic_read(&q->mq_freeze_depth) || - blk_queue_dying(q)); - if (blk_queue_dying(q)) - return -ENODEV; - if (ret) - return ret; - } -} - -static void blk_mq_queue_exit(struct request_queue *q) -{ - percpu_ref_put(&q->mq_usage_counter); -} - -static void blk_mq_usage_counter_release(struct percpu_ref *ref) -{ - struct request_queue *q = - container_of(ref, struct request_queue, mq_usage_counter); - - wake_up_all(&q->mq_freeze_wq); -} - void blk_mq_freeze_queue_start(struct request_queue *q) { int freeze_depth; freeze_depth = atomic_inc_return(&q->mq_freeze_depth); if (freeze_depth == 1) { - percpu_ref_kill(&q->mq_usage_counter); + percpu_ref_kill(&q->q_usage_counter); blk_mq_run_hw_queues(q, false); } } @@ -125,18 +91,34 @@ EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_start); static void blk_mq_freeze_queue_wait(struct request_queue *q) { - wait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->mq_usage_counter)); + wait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->q_usage_counter)); } /* * Guarantee no request is in use, so we can change any data structure of * the queue afterward. */ -void blk_mq_freeze_queue(struct request_queue *q) +void blk_freeze_queue(struct request_queue *q) { + /* + * In the !blk_mq case we are only calling this to kill the + * q_usage_counter, otherwise this increases the freeze depth + * and waits for it to return to zero. For this reason there is + * no blk_unfreeze_queue(), and blk_freeze_queue() is not + * exported to drivers as the only user for unfreeze is blk_mq. + */ blk_mq_freeze_queue_start(q); blk_mq_freeze_queue_wait(q); } + +void blk_mq_freeze_queue(struct request_queue *q) +{ + /* + * ...just an alias to keep freeze and unfreeze actions balanced + * in the blk_mq_* namespace + */ + blk_freeze_queue(q); +} EXPORT_SYMBOL_GPL(blk_mq_freeze_queue); void blk_mq_unfreeze_queue(struct request_queue *q) @@ -146,7 +128,7 @@ void blk_mq_unfreeze_queue(struct request_queue *q) freeze_depth = atomic_dec_return(&q->mq_freeze_depth); WARN_ON_ONCE(freeze_depth < 0); if (!freeze_depth) { - percpu_ref_reinit(&q->mq_usage_counter); + percpu_ref_reinit(&q->q_usage_counter); wake_up_all(&q->mq_freeze_wq); } } @@ -255,7 +237,7 @@ struct request *blk_mq_alloc_request(struct request_queue *q, int rw, gfp_t gfp, struct blk_mq_alloc_data alloc_data; int ret; - ret = blk_mq_queue_enter(q, gfp); + ret = blk_queue_enter(q, gfp); if (ret) return ERR_PTR(ret); @@ -278,7 +260,7 @@ struct request *blk_mq_alloc_request(struct request_queue *q, int rw, gfp_t gfp, } blk_mq_put_ctx(ctx); if (!rq) { - blk_mq_queue_exit(q); + blk_queue_exit(q); return ERR_PTR(-EWOULDBLOCK); } return rq; @@ -297,7 +279,7 @@ static void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx, clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags); blk_mq_put_tag(hctx, tag, &ctx->last_tag); - blk_mq_queue_exit(q); + blk_queue_exit(q); } void blk_mq_free_hctx_request(struct blk_mq_hw_ctx *hctx, struct request *rq) @@ -1184,11 +1166,7 @@ static struct request *blk_mq_map_request(struct request_queue *q, int rw = bio_data_dir(bio); struct blk_mq_alloc_data alloc_data; - if (unlikely(blk_mq_queue_enter(q, GFP_KERNEL))) { - bio_io_error(bio); - return NULL; - } - + blk_queue_enter_live(q); ctx = blk_mq_get_ctx(q); hctx = q->mq_ops->map_queue(q, ctx->cpu); @@ -1979,14 +1957,6 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set, hctxs[i]->queue_num = i; } - /* - * Init percpu_ref in atomic mode so that it's faster to shutdown. - * See blk_register_queue() for details. - */ - if (percpu_ref_init(&q->mq_usage_counter, blk_mq_usage_counter_release, - PERCPU_REF_INIT_ATOMIC, GFP_KERNEL)) - goto err_hctxs; - setup_timer(&q->timeout, blk_mq_rq_timer, (unsigned long) q); blk_queue_rq_timeout(q, set->timeout ? set->timeout : 30 * HZ); @@ -2062,8 +2032,6 @@ void blk_mq_free_queue(struct request_queue *q) blk_mq_exit_hw_queues(q, set, set->nr_hw_queues); blk_mq_free_hw_queues(q, set); - percpu_ref_exit(&q->mq_usage_counter); - kfree(q->mq_map); q->mq_map = NULL; diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index 3e44a9da2a13..61fc2633bbea 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -599,9 +599,8 @@ int blk_register_queue(struct gendisk *disk) */ if (!blk_queue_init_done(q)) { queue_flag_set_unlocked(QUEUE_FLAG_INIT_DONE, q); + percpu_ref_switch_to_percpu(&q->q_usage_counter); blk_queue_bypass_end(q); - if (q->mq_ops) - blk_mq_finish_init(q); } ret = blk_trace_init_sysfs(dev); diff --git a/block/blk.h b/block/blk.h index 98614ad37c81..5b2cd393afbe 100644 --- a/block/blk.h +++ b/block/blk.h @@ -72,6 +72,20 @@ void blk_dequeue_request(struct request *rq); void __blk_queue_free_tags(struct request_queue *q); bool __blk_end_bidi_request(struct request *rq, int error, unsigned int nr_bytes, unsigned int bidi_bytes); +int blk_queue_enter(struct request_queue *q, gfp_t gfp); +void blk_queue_exit(struct request_queue *q); +void blk_freeze_queue(struct request_queue *q); + +static inline void blk_queue_enter_live(struct request_queue *q) +{ + /* + * Given that running in generic_make_request() context + * guarantees that a live reference against q_usage_counter has + * been established, further references under that same context + * need not check that the queue has been frozen (marked dead). + */ + percpu_ref_get(&q->q_usage_counter); +} void blk_rq_timed_out_timer(unsigned long data); unsigned long blk_rq_timeout(unsigned long timeout); diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index 37d1602c4f7a..25ce763fbb81 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -167,7 +167,6 @@ enum { struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *); struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set, struct request_queue *q); -void blk_mq_finish_init(struct request_queue *q); int blk_mq_register_disk(struct gendisk *); void blk_mq_unregister_disk(struct gendisk *); diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 99da9ebc7377..0a0def66c61e 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -450,7 +450,7 @@ struct request_queue { #endif struct rcu_head rcu_head; wait_queue_head_t mq_freeze_wq; - struct percpu_ref mq_usage_counter; + struct percpu_ref q_usage_counter; struct list_head all_q_node; struct blk_mq_tag_set *tag_set; ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] block: generic request_queue reference counting 2015-09-30 0:41 ` [PATCH 1/2] block: generic request_queue reference counting Dan Williams @ 2015-10-04 6:40 ` Christoph Hellwig 2015-10-05 23:44 ` Dan Williams 2015-10-04 7:52 ` Ming Lei 1 sibling, 1 reply; 13+ messages in thread From: Christoph Hellwig @ 2015-10-04 6:40 UTC (permalink / raw) To: Dan Williams Cc: axboe, Keith Busch, ross.zwisler, linux-nvdimm, Christoph Hellwig, linux-kernel On Tue, Sep 29, 2015 at 08:41:31PM -0400, Dan Williams wrote: > Allow pmem, and other synchronous/bio-based block drivers, to fallback > on a per-cpu reference count managed by the core for tracking queue > live/dead state. > > The existing per-cpu reference count for the blk_mq case is promoted to > be used in all block i/o scenarios. This involves initializing it by > default, waiting for it to drop to zero at exit, and holding a live > reference over the invocation of q->make_request_fn() in > generic_make_request(). The blk_mq code continues to take its own > reference per blk_mq request and retains the ability to freeze the > queue, but the check that the queue is frozen is moved to > generic_make_request(). > > This fixes crash signatures like the following: > > BUG: unable to handle kernel paging request at ffff880140000000 > [..] > Call Trace: > [<ffffffff8145e8bf>] ? copy_user_handle_tail+0x5f/0x70 > [<ffffffffa004e1e0>] pmem_do_bvec.isra.11+0x70/0xf0 [nd_pmem] > [<ffffffffa004e331>] pmem_make_request+0xd1/0x200 [nd_pmem] > [<ffffffff811c3162>] ? mempool_alloc+0x72/0x1a0 > [<ffffffff8141f8b6>] generic_make_request+0xd6/0x110 > [<ffffffff8141f966>] submit_bio+0x76/0x170 > [<ffffffff81286dff>] submit_bh_wbc+0x12f/0x160 > [<ffffffff81286e62>] submit_bh+0x12/0x20 > [<ffffffff813395bd>] jbd2_write_superblock+0x8d/0x170 > [<ffffffff8133974d>] jbd2_mark_journal_empty+0x5d/0x90 > [<ffffffff813399cb>] jbd2_journal_destroy+0x24b/0x270 > [<ffffffff810bc4ca>] ? put_pwq_unlocked+0x2a/0x30 > [<ffffffff810bc6f5>] ? destroy_workqueue+0x225/0x250 > [<ffffffff81303494>] ext4_put_super+0x64/0x360 > [<ffffffff8124ab1a>] generic_shutdown_super+0x6a/0xf0 > > Cc: Jens Axboe <axboe@kernel.dk> > Cc: Keith Busch <keith.busch@intel.com> > Cc: Ross Zwisler <ross.zwisler@linux.intel.com> > Suggested-by: Christoph Hellwig <hch@lst.de> > Signed-off-by: Dan Williams <dan.j.williams@intel.com> > --- > block/blk-core.c | 71 +++++++++++++++++++++++++++++++++++++------ > block/blk-mq-sysfs.c | 6 ---- > block/blk-mq.c | 80 ++++++++++++++---------------------------------- > block/blk-sysfs.c | 3 +- > block/blk.h | 14 ++++++++ > include/linux/blk-mq.h | 1 - > include/linux/blkdev.h | 2 + > 7 files changed, 102 insertions(+), 75 deletions(-) > > diff --git a/block/blk-core.c b/block/blk-core.c > index 2eb722d48773..6062550baaef 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -554,19 +554,17 @@ void blk_cleanup_queue(struct request_queue *q) > * Drain all requests queued before DYING marking. Set DEAD flag to > * prevent that q->request_fn() gets invoked after draining finished. > */ > - if (q->mq_ops) { > - blk_mq_freeze_queue(q); > - spin_lock_irq(lock); > - } else { > - spin_lock_irq(lock); > + blk_freeze_queue(q); > + spin_lock_irq(lock); > + if (!q->mq_ops) > __blk_drain_queue(q, true); > - } __blk_drain_queue really ought to be moved into blk_freeze_queue so it has equivlaent functionality for mq vs !mq. But maybe that can be a separate patch. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] block: generic request_queue reference counting 2015-10-04 6:40 ` Christoph Hellwig @ 2015-10-05 23:44 ` Dan Williams 0 siblings, 0 replies; 13+ messages in thread From: Dan Williams @ 2015-10-05 23:44 UTC (permalink / raw) To: Christoph Hellwig Cc: Jens Axboe, Keith Busch, Ross Zwisler, linux-nvdimm@lists.01.org, linux-kernel On Sat, Oct 3, 2015 at 11:40 PM, Christoph Hellwig <hch@lst.de> wrote: > On Tue, Sep 29, 2015 at 08:41:31PM -0400, Dan Williams wrote: >> Allow pmem, and other synchronous/bio-based block drivers, to fallback >> on a per-cpu reference count managed by the core for tracking queue >> live/dead state. >> >> The existing per-cpu reference count for the blk_mq case is promoted to >> be used in all block i/o scenarios. This involves initializing it by >> default, waiting for it to drop to zero at exit, and holding a live >> reference over the invocation of q->make_request_fn() in >> generic_make_request(). The blk_mq code continues to take its own >> reference per blk_mq request and retains the ability to freeze the >> queue, but the check that the queue is frozen is moved to >> generic_make_request(). >> >> This fixes crash signatures like the following: >> >> BUG: unable to handle kernel paging request at ffff880140000000 >> [..] >> Call Trace: >> [<ffffffff8145e8bf>] ? copy_user_handle_tail+0x5f/0x70 >> [<ffffffffa004e1e0>] pmem_do_bvec.isra.11+0x70/0xf0 [nd_pmem] >> [<ffffffffa004e331>] pmem_make_request+0xd1/0x200 [nd_pmem] >> [<ffffffff811c3162>] ? mempool_alloc+0x72/0x1a0 >> [<ffffffff8141f8b6>] generic_make_request+0xd6/0x110 >> [<ffffffff8141f966>] submit_bio+0x76/0x170 >> [<ffffffff81286dff>] submit_bh_wbc+0x12f/0x160 >> [<ffffffff81286e62>] submit_bh+0x12/0x20 >> [<ffffffff813395bd>] jbd2_write_superblock+0x8d/0x170 >> [<ffffffff8133974d>] jbd2_mark_journal_empty+0x5d/0x90 >> [<ffffffff813399cb>] jbd2_journal_destroy+0x24b/0x270 >> [<ffffffff810bc4ca>] ? put_pwq_unlocked+0x2a/0x30 >> [<ffffffff810bc6f5>] ? destroy_workqueue+0x225/0x250 >> [<ffffffff81303494>] ext4_put_super+0x64/0x360 >> [<ffffffff8124ab1a>] generic_shutdown_super+0x6a/0xf0 >> >> Cc: Jens Axboe <axboe@kernel.dk> >> Cc: Keith Busch <keith.busch@intel.com> >> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> >> Suggested-by: Christoph Hellwig <hch@lst.de> >> Signed-off-by: Dan Williams <dan.j.williams@intel.com> >> --- >> block/blk-core.c | 71 +++++++++++++++++++++++++++++++++++++------ >> block/blk-mq-sysfs.c | 6 ---- >> block/blk-mq.c | 80 ++++++++++++++---------------------------------- >> block/blk-sysfs.c | 3 +- >> block/blk.h | 14 ++++++++ >> include/linux/blk-mq.h | 1 - >> include/linux/blkdev.h | 2 + >> 7 files changed, 102 insertions(+), 75 deletions(-) >> >> diff --git a/block/blk-core.c b/block/blk-core.c >> index 2eb722d48773..6062550baaef 100644 >> --- a/block/blk-core.c >> +++ b/block/blk-core.c >> @@ -554,19 +554,17 @@ void blk_cleanup_queue(struct request_queue *q) >> * Drain all requests queued before DYING marking. Set DEAD flag to >> * prevent that q->request_fn() gets invoked after draining finished. >> */ >> - if (q->mq_ops) { >> - blk_mq_freeze_queue(q); >> - spin_lock_irq(lock); >> - } else { >> - spin_lock_irq(lock); >> + blk_freeze_queue(q); >> + spin_lock_irq(lock); >> + if (!q->mq_ops) >> __blk_drain_queue(q, true); >> - } > > __blk_drain_queue really ought to be moved into blk_freeze_queue so it > has equivlaent functionality for mq vs !mq. But maybe that can be a > separate patch. It's not clear why the cleanup path is gating __blk_drain_queue on ->mq_ops? __blk_drain_queue already gets called from blkcg_init_queue(). Also, it seems blk_get_flush_queue() is already handling the mq_ops vs !mp_ops case, but I'd have to take a closer look. Definitely seems like follow-on patch material... ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] block: generic request_queue reference counting 2015-09-30 0:41 ` [PATCH 1/2] block: generic request_queue reference counting Dan Williams 2015-10-04 6:40 ` Christoph Hellwig @ 2015-10-04 7:52 ` Ming Lei 2015-10-05 23:23 ` Dan Williams 1 sibling, 1 reply; 13+ messages in thread From: Ming Lei @ 2015-10-04 7:52 UTC (permalink / raw) To: Dan Williams Cc: Jens Axboe, Keith Busch, ross.zwisler, linux-nvdimm, Christoph Hellwig, Linux Kernel Mailing List On Wed, Sep 30, 2015 at 8:41 AM, Dan Williams <dan.j.williams@intel.com> wrote: > Allow pmem, and other synchronous/bio-based block drivers, to fallback Just a bit curious, why not extend it for all(both synchronous and asynchrounous) bio-based drivers? As you mentioned in introductory message, all bio based drivers may have this kind of problem. One idea I thought of is to hold the usage counter in bio life time, instead of request's life time like in blk-mq. > on a per-cpu reference count managed by the core for tracking queue > live/dead state. > > The existing per-cpu reference count for the blk_mq case is promoted to > be used in all block i/o scenarios. This involves initializing it by > default, waiting for it to drop to zero at exit, and holding a live > reference over the invocation of q->make_request_fn() in It isn't enough for asynchrounous bio drivers. > generic_make_request(). The blk_mq code continues to take its own > reference per blk_mq request and retains the ability to freeze the > queue, but the check that the queue is frozen is moved to > generic_make_request(). > > This fixes crash signatures like the following: > > BUG: unable to handle kernel paging request at ffff880140000000 > [..] > Call Trace: > [<ffffffff8145e8bf>] ? copy_user_handle_tail+0x5f/0x70 > [<ffffffffa004e1e0>] pmem_do_bvec.isra.11+0x70/0xf0 [nd_pmem] > [<ffffffffa004e331>] pmem_make_request+0xd1/0x200 [nd_pmem] > [<ffffffff811c3162>] ? mempool_alloc+0x72/0x1a0 > [<ffffffff8141f8b6>] generic_make_request+0xd6/0x110 > [<ffffffff8141f966>] submit_bio+0x76/0x170 > [<ffffffff81286dff>] submit_bh_wbc+0x12f/0x160 > [<ffffffff81286e62>] submit_bh+0x12/0x20 > [<ffffffff813395bd>] jbd2_write_superblock+0x8d/0x170 > [<ffffffff8133974d>] jbd2_mark_journal_empty+0x5d/0x90 > [<ffffffff813399cb>] jbd2_journal_destroy+0x24b/0x270 > [<ffffffff810bc4ca>] ? put_pwq_unlocked+0x2a/0x30 > [<ffffffff810bc6f5>] ? destroy_workqueue+0x225/0x250 > [<ffffffff81303494>] ext4_put_super+0x64/0x360 > [<ffffffff8124ab1a>] generic_shutdown_super+0x6a/0xf0 > > Cc: Jens Axboe <axboe@kernel.dk> > Cc: Keith Busch <keith.busch@intel.com> > Cc: Ross Zwisler <ross.zwisler@linux.intel.com> > Suggested-by: Christoph Hellwig <hch@lst.de> > Signed-off-by: Dan Williams <dan.j.williams@intel.com> > --- > block/blk-core.c | 71 +++++++++++++++++++++++++++++++++++++------ > block/blk-mq-sysfs.c | 6 ---- > block/blk-mq.c | 80 ++++++++++++++---------------------------------- > block/blk-sysfs.c | 3 +- > block/blk.h | 14 ++++++++ > include/linux/blk-mq.h | 1 - > include/linux/blkdev.h | 2 + > 7 files changed, 102 insertions(+), 75 deletions(-) > > diff --git a/block/blk-core.c b/block/blk-core.c > index 2eb722d48773..6062550baaef 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -554,19 +554,17 @@ void blk_cleanup_queue(struct request_queue *q) > * Drain all requests queued before DYING marking. Set DEAD flag to > * prevent that q->request_fn() gets invoked after draining finished. > */ > - if (q->mq_ops) { > - blk_mq_freeze_queue(q); > - spin_lock_irq(lock); > - } else { > - spin_lock_irq(lock); > + blk_freeze_queue(q); > + spin_lock_irq(lock); > + if (!q->mq_ops) > __blk_drain_queue(q, true); > - } > queue_flag_set(QUEUE_FLAG_DEAD, q); > spin_unlock_irq(lock); > > /* @q won't process any more request, flush async actions */ > del_timer_sync(&q->backing_dev_info.laptop_mode_wb_timer); > blk_sync_queue(q); > + percpu_ref_exit(&q->q_usage_counter); > > if (q->mq_ops) > blk_mq_free_queue(q); > @@ -629,6 +627,40 @@ struct request_queue *blk_alloc_queue(gfp_t gfp_mask) > } > EXPORT_SYMBOL(blk_alloc_queue); > > +int blk_queue_enter(struct request_queue *q, gfp_t gfp) > +{ > + while (true) { > + int ret; > + > + if (percpu_ref_tryget_live(&q->q_usage_counter)) > + return 0; > + > + if (!(gfp & __GFP_WAIT)) > + return -EBUSY; > + > + ret = wait_event_interruptible(q->mq_freeze_wq, > + !atomic_read(&q->mq_freeze_depth) || > + blk_queue_dying(q)); > + if (blk_queue_dying(q)) > + return -ENODEV; > + if (ret) > + return ret; > + } > +} > + > +void blk_queue_exit(struct request_queue *q) > +{ > + percpu_ref_put(&q->q_usage_counter); > +} > + > +static void blk_queue_usage_counter_release(struct percpu_ref *ref) > +{ > + struct request_queue *q = > + container_of(ref, struct request_queue, q_usage_counter); > + > + wake_up_all(&q->mq_freeze_wq); > +} > + > struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id) > { > struct request_queue *q; > @@ -690,11 +722,22 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id) > > init_waitqueue_head(&q->mq_freeze_wq); > > - if (blkcg_init_queue(q)) > + /* > + * Init percpu_ref in atomic mode so that it's faster to shutdown. > + * See blk_register_queue() for details. > + */ > + if (percpu_ref_init(&q->q_usage_counter, > + blk_queue_usage_counter_release, > + PERCPU_REF_INIT_ATOMIC, GFP_KERNEL)) > goto fail_bdi; > > + if (blkcg_init_queue(q)) > + goto fail_ref; > + > return q; > > +fail_ref: > + percpu_ref_exit(&q->q_usage_counter); > fail_bdi: > bdi_destroy(&q->backing_dev_info); > fail_split: > @@ -1966,9 +2009,19 @@ void generic_make_request(struct bio *bio) > do { > struct request_queue *q = bdev_get_queue(bio->bi_bdev); > > - q->make_request_fn(q, bio); > + if (likely(blk_queue_enter(q, __GFP_WAIT) == 0)) { > + > + q->make_request_fn(q, bio); > + > + blk_queue_exit(q); > > - bio = bio_list_pop(current->bio_list); > + bio = bio_list_pop(current->bio_list); > + } else { > + struct bio *bio_next = bio_list_pop(current->bio_list); > + > + bio_io_error(bio); > + bio = bio_next; > + } > } while (bio); > current->bio_list = NULL; /* deactivate */ > } > diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c > index 279c5d674edf..731b6eecce82 100644 > --- a/block/blk-mq-sysfs.c > +++ b/block/blk-mq-sysfs.c > @@ -412,12 +412,6 @@ static void blk_mq_sysfs_init(struct request_queue *q) > kobject_init(&ctx->kobj, &blk_mq_ctx_ktype); > } > > -/* see blk_register_queue() */ > -void blk_mq_finish_init(struct request_queue *q) > -{ > - percpu_ref_switch_to_percpu(&q->mq_usage_counter); > -} > - > int blk_mq_register_disk(struct gendisk *disk) > { > struct device *dev = disk_to_dev(disk); > diff --git a/block/blk-mq.c b/block/blk-mq.c > index f2d67b4047a0..6d91894cf85e 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -77,47 +77,13 @@ static void blk_mq_hctx_clear_pending(struct blk_mq_hw_ctx *hctx, > clear_bit(CTX_TO_BIT(hctx, ctx), &bm->word); > } > > -static int blk_mq_queue_enter(struct request_queue *q, gfp_t gfp) > -{ > - while (true) { > - int ret; > - > - if (percpu_ref_tryget_live(&q->mq_usage_counter)) > - return 0; > - > - if (!(gfp & __GFP_WAIT)) > - return -EBUSY; > - > - ret = wait_event_interruptible(q->mq_freeze_wq, > - !atomic_read(&q->mq_freeze_depth) || > - blk_queue_dying(q)); > - if (blk_queue_dying(q)) > - return -ENODEV; > - if (ret) > - return ret; > - } > -} > - > -static void blk_mq_queue_exit(struct request_queue *q) > -{ > - percpu_ref_put(&q->mq_usage_counter); > -} > - > -static void blk_mq_usage_counter_release(struct percpu_ref *ref) > -{ > - struct request_queue *q = > - container_of(ref, struct request_queue, mq_usage_counter); > - > - wake_up_all(&q->mq_freeze_wq); > -} > - > void blk_mq_freeze_queue_start(struct request_queue *q) > { > int freeze_depth; > > freeze_depth = atomic_inc_return(&q->mq_freeze_depth); > if (freeze_depth == 1) { > - percpu_ref_kill(&q->mq_usage_counter); > + percpu_ref_kill(&q->q_usage_counter); > blk_mq_run_hw_queues(q, false); > } > } > @@ -125,18 +91,34 @@ EXPORT_SYMBOL_GPL(blk_mq_freeze_queue_start); > > static void blk_mq_freeze_queue_wait(struct request_queue *q) > { > - wait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->mq_usage_counter)); > + wait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->q_usage_counter)); > } > > /* > * Guarantee no request is in use, so we can change any data structure of > * the queue afterward. > */ > -void blk_mq_freeze_queue(struct request_queue *q) > +void blk_freeze_queue(struct request_queue *q) > { > + /* > + * In the !blk_mq case we are only calling this to kill the > + * q_usage_counter, otherwise this increases the freeze depth > + * and waits for it to return to zero. For this reason there is > + * no blk_unfreeze_queue(), and blk_freeze_queue() is not > + * exported to drivers as the only user for unfreeze is blk_mq. > + */ > blk_mq_freeze_queue_start(q); > blk_mq_freeze_queue_wait(q); > } > + > +void blk_mq_freeze_queue(struct request_queue *q) > +{ > + /* > + * ...just an alias to keep freeze and unfreeze actions balanced > + * in the blk_mq_* namespace > + */ > + blk_freeze_queue(q); > +} > EXPORT_SYMBOL_GPL(blk_mq_freeze_queue); > > void blk_mq_unfreeze_queue(struct request_queue *q) > @@ -146,7 +128,7 @@ void blk_mq_unfreeze_queue(struct request_queue *q) > freeze_depth = atomic_dec_return(&q->mq_freeze_depth); > WARN_ON_ONCE(freeze_depth < 0); > if (!freeze_depth) { > - percpu_ref_reinit(&q->mq_usage_counter); > + percpu_ref_reinit(&q->q_usage_counter); > wake_up_all(&q->mq_freeze_wq); > } > } > @@ -255,7 +237,7 @@ struct request *blk_mq_alloc_request(struct request_queue *q, int rw, gfp_t gfp, > struct blk_mq_alloc_data alloc_data; > int ret; > > - ret = blk_mq_queue_enter(q, gfp); > + ret = blk_queue_enter(q, gfp); > if (ret) > return ERR_PTR(ret); > > @@ -278,7 +260,7 @@ struct request *blk_mq_alloc_request(struct request_queue *q, int rw, gfp_t gfp, > } > blk_mq_put_ctx(ctx); > if (!rq) { > - blk_mq_queue_exit(q); > + blk_queue_exit(q); > return ERR_PTR(-EWOULDBLOCK); > } > return rq; > @@ -297,7 +279,7 @@ static void __blk_mq_free_request(struct blk_mq_hw_ctx *hctx, > > clear_bit(REQ_ATOM_STARTED, &rq->atomic_flags); > blk_mq_put_tag(hctx, tag, &ctx->last_tag); > - blk_mq_queue_exit(q); > + blk_queue_exit(q); > } > > void blk_mq_free_hctx_request(struct blk_mq_hw_ctx *hctx, struct request *rq) > @@ -1184,11 +1166,7 @@ static struct request *blk_mq_map_request(struct request_queue *q, > int rw = bio_data_dir(bio); > struct blk_mq_alloc_data alloc_data; > > - if (unlikely(blk_mq_queue_enter(q, GFP_KERNEL))) { > - bio_io_error(bio); > - return NULL; > - } > - > + blk_queue_enter_live(q); > ctx = blk_mq_get_ctx(q); > hctx = q->mq_ops->map_queue(q, ctx->cpu); > > @@ -1979,14 +1957,6 @@ struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set, > hctxs[i]->queue_num = i; > } > > - /* > - * Init percpu_ref in atomic mode so that it's faster to shutdown. > - * See blk_register_queue() for details. > - */ > - if (percpu_ref_init(&q->mq_usage_counter, blk_mq_usage_counter_release, > - PERCPU_REF_INIT_ATOMIC, GFP_KERNEL)) > - goto err_hctxs; > - > setup_timer(&q->timeout, blk_mq_rq_timer, (unsigned long) q); > blk_queue_rq_timeout(q, set->timeout ? set->timeout : 30 * HZ); > > @@ -2062,8 +2032,6 @@ void blk_mq_free_queue(struct request_queue *q) > blk_mq_exit_hw_queues(q, set, set->nr_hw_queues); > blk_mq_free_hw_queues(q, set); > > - percpu_ref_exit(&q->mq_usage_counter); > - > kfree(q->mq_map); > > q->mq_map = NULL; > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c > index 3e44a9da2a13..61fc2633bbea 100644 > --- a/block/blk-sysfs.c > +++ b/block/blk-sysfs.c > @@ -599,9 +599,8 @@ int blk_register_queue(struct gendisk *disk) > */ > if (!blk_queue_init_done(q)) { > queue_flag_set_unlocked(QUEUE_FLAG_INIT_DONE, q); > + percpu_ref_switch_to_percpu(&q->q_usage_counter); > blk_queue_bypass_end(q); > - if (q->mq_ops) > - blk_mq_finish_init(q); > } > > ret = blk_trace_init_sysfs(dev); > diff --git a/block/blk.h b/block/blk.h > index 98614ad37c81..5b2cd393afbe 100644 > --- a/block/blk.h > +++ b/block/blk.h > @@ -72,6 +72,20 @@ void blk_dequeue_request(struct request *rq); > void __blk_queue_free_tags(struct request_queue *q); > bool __blk_end_bidi_request(struct request *rq, int error, > unsigned int nr_bytes, unsigned int bidi_bytes); > +int blk_queue_enter(struct request_queue *q, gfp_t gfp); > +void blk_queue_exit(struct request_queue *q); > +void blk_freeze_queue(struct request_queue *q); > + > +static inline void blk_queue_enter_live(struct request_queue *q) > +{ > + /* > + * Given that running in generic_make_request() context > + * guarantees that a live reference against q_usage_counter has > + * been established, further references under that same context > + * need not check that the queue has been frozen (marked dead). > + */ > + percpu_ref_get(&q->q_usage_counter); > +} > > void blk_rq_timed_out_timer(unsigned long data); > unsigned long blk_rq_timeout(unsigned long timeout); > diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h > index 37d1602c4f7a..25ce763fbb81 100644 > --- a/include/linux/blk-mq.h > +++ b/include/linux/blk-mq.h > @@ -167,7 +167,6 @@ enum { > struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *); > struct request_queue *blk_mq_init_allocated_queue(struct blk_mq_tag_set *set, > struct request_queue *q); > -void blk_mq_finish_init(struct request_queue *q); > int blk_mq_register_disk(struct gendisk *); > void blk_mq_unregister_disk(struct gendisk *); > > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h > index 99da9ebc7377..0a0def66c61e 100644 > --- a/include/linux/blkdev.h > +++ b/include/linux/blkdev.h > @@ -450,7 +450,7 @@ struct request_queue { > #endif > struct rcu_head rcu_head; > wait_queue_head_t mq_freeze_wq; > - struct percpu_ref mq_usage_counter; > + struct percpu_ref q_usage_counter; > struct list_head all_q_node; > > struct blk_mq_tag_set *tag_set; > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- Ming Lei ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] block: generic request_queue reference counting 2015-10-04 7:52 ` Ming Lei @ 2015-10-05 23:23 ` Dan Williams 2015-10-06 10:46 ` Ming Lei 0 siblings, 1 reply; 13+ messages in thread From: Dan Williams @ 2015-10-05 23:23 UTC (permalink / raw) To: Ming Lei Cc: Jens Axboe, Keith Busch, Ross Zwisler, linux-nvdimm@lists.01.org, Christoph Hellwig, Linux Kernel Mailing List On Sun, Oct 4, 2015 at 12:52 AM, Ming Lei <tom.leiming@gmail.com> wrote: > On Wed, Sep 30, 2015 at 8:41 AM, Dan Williams <dan.j.williams@intel.com> wrote: >> Allow pmem, and other synchronous/bio-based block drivers, to fallback > > Just a bit curious, why not extend it for all(both synchronous and > asynchrounous) bio-based drivers? As you mentioned in introductory > message, all bio based drivers may have this kind of problem. > > One idea I thought of is to hold the usage counter in bio life time, > instead of request's life time like in blk-mq. > >> on a per-cpu reference count managed by the core for tracking queue >> live/dead state. >> >> The existing per-cpu reference count for the blk_mq case is promoted to >> be used in all block i/o scenarios. This involves initializing it by >> default, waiting for it to drop to zero at exit, and holding a live >> reference over the invocation of q->make_request_fn() in > > It isn't enough for asynchrounous bio drivers. True, but I think that support is a follow on extension of the mechanism. It seems to me not as straightforward as holding a per-request reference or a reference over the submission path. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] block: generic request_queue reference counting 2015-10-05 23:23 ` Dan Williams @ 2015-10-06 10:46 ` Ming Lei 2015-10-06 16:04 ` Dan Williams 0 siblings, 1 reply; 13+ messages in thread From: Ming Lei @ 2015-10-06 10:46 UTC (permalink / raw) To: Dan Williams Cc: Jens Axboe, Keith Busch, Ross Zwisler, linux-nvdimm@lists.01.org, Christoph Hellwig, Linux Kernel Mailing List On Tue, Oct 6, 2015 at 7:23 AM, Dan Williams <dan.j.williams@intel.com> wrote: > On Sun, Oct 4, 2015 at 12:52 AM, Ming Lei <tom.leiming@gmail.com> wrote: >> On Wed, Sep 30, 2015 at 8:41 AM, Dan Williams <dan.j.williams@intel.com> wrote: >>> Allow pmem, and other synchronous/bio-based block drivers, to fallback >> >> Just a bit curious, why not extend it for all(both synchronous and >> asynchrounous) bio-based drivers? As you mentioned in introductory >> message, all bio based drivers may have this kind of problem. >> >> One idea I thought of is to hold the usage counter in bio life time, >> instead of request's life time like in blk-mq. >> >>> on a per-cpu reference count managed by the core for tracking queue >>> live/dead state. >>> >>> The existing per-cpu reference count for the blk_mq case is promoted to >>> be used in all block i/o scenarios. This involves initializing it by >>> default, waiting for it to drop to zero at exit, and holding a live >>> reference over the invocation of q->make_request_fn() in >> >> It isn't enough for asynchrounous bio drivers. > > True, but I think that support is a follow on extension of the > mechanism. It seems to me not as straightforward as holding a > per-request reference or a reference over the submission path. Given it is a general problem for all bio based drivers, it seems better to figure out one general solution instead of the partial fix only for synchrounous bio based drivers. IMO, it should work by holding the usage counter in BIO's lifetime, for example, getting it before calling q->make_request_fn(q, bio), and putting it after bio's completion(in bio_endio()). Even though there is still a small window between all bio's completion and all request's completion, and it should have been addressed easily for both !mq_ops and mq_ops. thanks, Ming Lei ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 1/2] block: generic request_queue reference counting 2015-10-06 10:46 ` Ming Lei @ 2015-10-06 16:04 ` Dan Williams 0 siblings, 0 replies; 13+ messages in thread From: Dan Williams @ 2015-10-06 16:04 UTC (permalink / raw) To: Ming Lei Cc: Jens Axboe, Keith Busch, Ross Zwisler, linux-nvdimm@lists.01.org, Christoph Hellwig, Linux Kernel Mailing List On Tue, Oct 6, 2015 at 3:46 AM, Ming Lei <tom.leiming@gmail.com> wrote: > On Tue, Oct 6, 2015 at 7:23 AM, Dan Williams <dan.j.williams@intel.com> wrote: >> On Sun, Oct 4, 2015 at 12:52 AM, Ming Lei <tom.leiming@gmail.com> wrote: >>> On Wed, Sep 30, 2015 at 8:41 AM, Dan Williams <dan.j.williams@intel.com> wrote: >>>> Allow pmem, and other synchronous/bio-based block drivers, to fallback >>> >>> Just a bit curious, why not extend it for all(both synchronous and >>> asynchrounous) bio-based drivers? As you mentioned in introductory >>> message, all bio based drivers may have this kind of problem. >>> >>> One idea I thought of is to hold the usage counter in bio life time, >>> instead of request's life time like in blk-mq. >>> >>>> on a per-cpu reference count managed by the core for tracking queue >>>> live/dead state. >>>> >>>> The existing per-cpu reference count for the blk_mq case is promoted to >>>> be used in all block i/o scenarios. This involves initializing it by >>>> default, waiting for it to drop to zero at exit, and holding a live >>>> reference over the invocation of q->make_request_fn() in >>> >>> It isn't enough for asynchrounous bio drivers. >> >> True, but I think that support is a follow on extension of the >> mechanism. It seems to me not as straightforward as holding a >> per-request reference or a reference over the submission path. > > Given it is a general problem for all bio based drivers, it seems > better to figure out one general solution instead of the partial fix > only for synchrounous bio based drivers. > > IMO, it should work by holding the usage counter in BIO's lifetime, > for example, getting it before calling q->make_request_fn(q, bio), and > putting it after bio's completion(in bio_endio()). Even though there is > still a small window between all bio's completion and all request's > completion, and it should have been addressed easily for both > !mq_ops and mq_ops. I'm not convinced it is that simple, i.e. that all bio_endio() invocations are guaranteed to have a 1:1 matching invocation through generic_make_request(). If you want to help with that audit I'd be appreciative and open to putting together a follow-on patch. In the meantime this starting point already makes things better for pmem, nd_blk, brd, etc... ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 2/2] block, dax: fix lifetime of in-kernel dax mappings 2015-09-30 0:41 [PATCH 0/2] block drivers + dax vs driver unbind Dan Williams 2015-09-30 0:41 ` [PATCH 1/2] block: generic request_queue reference counting Dan Williams @ 2015-09-30 0:41 ` Dan Williams 2015-09-30 23:35 ` Dave Chinner 1 sibling, 1 reply; 13+ messages in thread From: Dan Williams @ 2015-09-30 0:41 UTC (permalink / raw) To: axboe Cc: Boaz Harrosh, linux-nvdimm, Dave Chinner, linux-kernel, ross.zwisler, Christoph Hellwig The DAX implementation needs to protect new calls to ->direct_access() and usage of its return value against unbind of the underlying block device. Use blk_queue_enter()/blk_queue_exit() to either prevent blk_cleanup_queue() from proceeding, or fail the dax_map_bh() if the request_queue is being torn down. Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@lst.de> Cc: Boaz Harrosh <boaz@plexistor.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- block/blk.h | 2 - fs/dax.c | 130 +++++++++++++++++++++++++++++++----------------- include/linux/blkdev.h | 2 + 3 files changed, 85 insertions(+), 49 deletions(-) diff --git a/block/blk.h b/block/blk.h index 5b2cd393afbe..0f8de0dda768 100644 --- a/block/blk.h +++ b/block/blk.h @@ -72,8 +72,6 @@ void blk_dequeue_request(struct request *rq); void __blk_queue_free_tags(struct request_queue *q); bool __blk_end_bidi_request(struct request *rq, int error, unsigned int nr_bytes, unsigned int bidi_bytes); -int blk_queue_enter(struct request_queue *q, gfp_t gfp); -void blk_queue_exit(struct request_queue *q); void blk_freeze_queue(struct request_queue *q); static inline void blk_queue_enter_live(struct request_queue *q) diff --git a/fs/dax.c b/fs/dax.c index bcfb14bfc1e4..7ce002bb60d0 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -63,12 +63,42 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size) } EXPORT_SYMBOL_GPL(dax_clear_blocks); -static long dax_get_addr(struct buffer_head *bh, void __pmem **addr, - unsigned blkbits) +static void __pmem *__dax_map_bh(const struct buffer_head *bh, unsigned blkbits, + unsigned long *pfn, long *len) { - unsigned long pfn; + long rc; + void __pmem *addr; + struct block_device *bdev = bh->b_bdev; + struct request_queue *q = bdev->bd_queue; sector_t sector = bh->b_blocknr << (blkbits - 9); - return bdev_direct_access(bh->b_bdev, sector, addr, &pfn, bh->b_size); + + if (blk_queue_enter(q, GFP_NOWAIT) != 0) + return (void __pmem *) ERR_PTR(-EIO); + rc = bdev_direct_access(bdev, sector, &addr, pfn, bh->b_size); + if (len) + *len = rc; + if (rc < 0) { + blk_queue_exit(q); + return (void __pmem *) ERR_PTR(rc); + } + return addr; +} + +static void __pmem *dax_map_bh(const struct buffer_head *bh, unsigned blkbits) +{ + unsigned long pfn; + + return __dax_map_bh(bh, blkbits, &pfn, NULL); +} + +static void dax_unmap_bh(const struct buffer_head *bh, void __pmem *addr) +{ + struct block_device *bdev = bh->b_bdev; + struct request_queue *q = bdev->bd_queue; + + if (IS_ERR(addr)) + return; + blk_queue_exit(q); } /* the clear_pmem() calls are ordered by a wmb_pmem() in the caller */ @@ -104,15 +134,16 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter, loff_t start, loff_t end, get_block_t get_block, struct buffer_head *bh) { - ssize_t retval = 0; - loff_t pos = start; - loff_t max = start; - loff_t bh_max = start; - void __pmem *addr; + loff_t pos = start, max = start, bh_max = start; + int rw = iov_iter_rw(iter), rc; + long map_len = 0; + unsigned long pfn; + void __pmem *addr = NULL; + void __pmem *kmap = (void __pmem *) ERR_PTR(-EIO); bool hole = false; bool need_wmb = false; - if (iov_iter_rw(iter) != WRITE) + if (rw == READ) end = min(end, i_size_read(inode)); while (pos < end) { @@ -127,9 +158,8 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter, if (pos == bh_max) { bh->b_size = PAGE_ALIGN(end - pos); bh->b_state = 0; - retval = get_block(inode, block, bh, - iov_iter_rw(iter) == WRITE); - if (retval) + rc = get_block(inode, block, bh, rw == WRITE); + if (rc) break; if (!buffer_size_valid(bh)) bh->b_size = 1 << blkbits; @@ -141,21 +171,25 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter, bh->b_size -= done; } - hole = iov_iter_rw(iter) != WRITE && !buffer_written(bh); + hole = rw == READ && !buffer_written(bh); if (hole) { addr = NULL; size = bh->b_size - first; } else { - retval = dax_get_addr(bh, &addr, blkbits); - if (retval < 0) + dax_unmap_bh(bh, kmap); + kmap = __dax_map_bh(bh, blkbits, &pfn, &map_len); + if (IS_ERR(kmap)) { + rc = PTR_ERR(kmap); break; + } + addr = kmap; if (buffer_unwritten(bh) || buffer_new(bh)) { - dax_new_buf(addr, retval, first, pos, - end); + dax_new_buf(addr, map_len, first, pos, + end); need_wmb = true; } addr += first; - size = retval - first; + size = map_len - first; } max = min(pos + size, end); } @@ -178,8 +212,9 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter, if (need_wmb) wmb_pmem(); + dax_unmap_bh(bh, kmap); - return (pos == start) ? retval : pos - start; + return (pos == start) ? rc : pos - start; } /** @@ -274,21 +309,22 @@ static int copy_user_bh(struct page *to, struct buffer_head *bh, void __pmem *vfrom; void *vto; - if (dax_get_addr(bh, &vfrom, blkbits) < 0) - return -EIO; + vfrom = dax_map_bh(bh, blkbits); + if (IS_ERR(vfrom)) + return PTR_ERR(vfrom); vto = kmap_atomic(to); copy_user_page(vto, (void __force *)vfrom, vaddr, to); kunmap_atomic(vto); + dax_unmap_bh(bh, vfrom); return 0; } static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh, struct vm_area_struct *vma, struct vm_fault *vmf) { - sector_t sector = bh->b_blocknr << (inode->i_blkbits - 9); unsigned long vaddr = (unsigned long)vmf->virtual_address; - void __pmem *addr; unsigned long pfn; + void __pmem *addr; pgoff_t size; int error; @@ -305,11 +341,9 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh, goto out; } - error = bdev_direct_access(bh->b_bdev, sector, &addr, &pfn, bh->b_size); - if (error < 0) - goto out; - if (error < PAGE_SIZE) { - error = -EIO; + addr = __dax_map_bh(bh, inode->i_blkbits, &pfn, NULL); + if (IS_ERR(addr)) { + error = PTR_ERR(addr); goto out; } @@ -317,6 +351,7 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh, clear_pmem(addr, PAGE_SIZE); wmb_pmem(); } + dax_unmap_bh(bh, addr); error = vm_insert_mixed(vma, vaddr, pfn); @@ -528,11 +563,8 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address, unsigned blkbits = inode->i_blkbits; unsigned long pmd_addr = address & PMD_MASK; bool write = flags & FAULT_FLAG_WRITE; - long length; - void __pmem *kaddr; pgoff_t size, pgoff; - sector_t block, sector; - unsigned long pfn; + sector_t block; int result = 0; /* Fall back to PTEs if we're going to COW */ @@ -557,8 +589,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address, bh.b_size = PMD_SIZE; i_mmap_lock_write(mapping); - length = get_block(inode, block, &bh, write); - if (length) + if (get_block(inode, block, &bh, write) != 0) return VM_FAULT_SIGBUS; /* @@ -569,17 +600,17 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address, if (!buffer_size_valid(&bh) || bh.b_size < PMD_SIZE) goto fallback; - sector = bh.b_blocknr << (blkbits - 9); - if (buffer_unwritten(&bh) || buffer_new(&bh)) { int i; + long length; + unsigned long pfn; + void __pmem *kaddr = __dax_map_bh(&bh, blkbits, &pfn, &length); - length = bdev_direct_access(bh.b_bdev, sector, &kaddr, &pfn, - bh.b_size); - if (length < 0) { + if (IS_ERR(kaddr)) { result = VM_FAULT_SIGBUS; goto out; } + if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR)) goto fallback; @@ -589,6 +620,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address, count_vm_event(PGMAJFAULT); mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT); result |= VM_FAULT_MAJOR; + dax_unmap_bh(&bh, kaddr); } /* @@ -635,12 +667,15 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address, result = VM_FAULT_NOPAGE; spin_unlock(ptl); } else { - length = bdev_direct_access(bh.b_bdev, sector, &kaddr, &pfn, - bh.b_size); - if (length < 0) { + long length; + unsigned long pfn; + void __pmem *kaddr = __dax_map_bh(&bh, blkbits, &pfn, &length); + + if (IS_ERR(kaddr)) { result = VM_FAULT_SIGBUS; goto out; } + dax_unmap_bh(&bh, kaddr); if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR)) goto fallback; @@ -746,12 +781,13 @@ int dax_zero_page_range(struct inode *inode, loff_t from, unsigned length, if (err < 0) return err; if (buffer_written(&bh)) { - void __pmem *addr; - err = dax_get_addr(&bh, &addr, inode->i_blkbits); - if (err < 0) - return err; + void __pmem *addr = dax_map_bh(&bh, inode->i_blkbits); + + if (IS_ERR(addr)) + return PTR_ERR(addr); clear_pmem(addr + offset, length); wmb_pmem(); + dax_unmap_bh(&bh, addr); } return 0; diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 0a0def66c61e..8127fff4d0f5 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -786,6 +786,8 @@ extern int scsi_cmd_ioctl(struct request_queue *, struct gendisk *, fmode_t, extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t, struct scsi_ioctl_command __user *); +extern int blk_queue_enter(struct request_queue *q, gfp_t gfp); +extern void blk_queue_exit(struct request_queue *q); extern void blk_start_queue(struct request_queue *q); extern void blk_stop_queue(struct request_queue *q); extern void blk_sync_queue(struct request_queue *q); ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] block, dax: fix lifetime of in-kernel dax mappings 2015-09-30 0:41 ` [PATCH 2/2] block, dax: fix lifetime of in-kernel dax mappings Dan Williams @ 2015-09-30 23:35 ` Dave Chinner 2015-10-01 0:09 ` Dan Williams 2015-10-06 1:57 ` Dan Williams 0 siblings, 2 replies; 13+ messages in thread From: Dave Chinner @ 2015-09-30 23:35 UTC (permalink / raw) To: Dan Williams Cc: axboe, Boaz Harrosh, linux-nvdimm, linux-kernel, ross.zwisler, Christoph Hellwig On Tue, Sep 29, 2015 at 08:41:36PM -0400, Dan Williams wrote: > The DAX implementation needs to protect new calls to ->direct_access() > and usage of its return value against unbind of the underlying block > device. Use blk_queue_enter()/blk_queue_exit() to either prevent > blk_cleanup_queue() from proceeding, or fail the dax_map_bh() if the > request_queue is being torn down. > > Cc: Jens Axboe <axboe@kernel.dk> > Cc: Christoph Hellwig <hch@lst.de> > Cc: Boaz Harrosh <boaz@plexistor.com> > Cc: Dave Chinner <david@fromorbit.com> > Cc: Ross Zwisler <ross.zwisler@linux.intel.com> > Signed-off-by: Dan Williams <dan.j.williams@intel.com> > --- > block/blk.h | 2 - > fs/dax.c | 130 +++++++++++++++++++++++++++++++----------------- > include/linux/blkdev.h | 2 + > 3 files changed, 85 insertions(+), 49 deletions(-) > > diff --git a/block/blk.h b/block/blk.h > index 5b2cd393afbe..0f8de0dda768 100644 > --- a/block/blk.h > +++ b/block/blk.h > @@ -72,8 +72,6 @@ void blk_dequeue_request(struct request *rq); > void __blk_queue_free_tags(struct request_queue *q); > bool __blk_end_bidi_request(struct request *rq, int error, > unsigned int nr_bytes, unsigned int bidi_bytes); > -int blk_queue_enter(struct request_queue *q, gfp_t gfp); > -void blk_queue_exit(struct request_queue *q); > void blk_freeze_queue(struct request_queue *q); > > static inline void blk_queue_enter_live(struct request_queue *q) > diff --git a/fs/dax.c b/fs/dax.c > index bcfb14bfc1e4..7ce002bb60d0 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -63,12 +63,42 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size) > } > EXPORT_SYMBOL_GPL(dax_clear_blocks); > > -static long dax_get_addr(struct buffer_head *bh, void __pmem **addr, > - unsigned blkbits) > +static void __pmem *__dax_map_bh(const struct buffer_head *bh, unsigned blkbits, > + unsigned long *pfn, long *len) Please don't use bufferheads for this. Please pass an inode, the block and length to map, similar to dax_clear_blocks(). Why? Because dax_clear_blocks() needs to do this "mapping" too, and it is called from contexts where there are no bufferheads. There's a good chance we'll need more mapping contexts like this in future, so lets not propagate bufferheads deeper into this code than we absilutely need to. We should be trying to limit/remove bufferheads in the DAX code, not propagating them deeper into the code... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] block, dax: fix lifetime of in-kernel dax mappings 2015-09-30 23:35 ` Dave Chinner @ 2015-10-01 0:09 ` Dan Williams 2015-10-06 1:57 ` Dan Williams 1 sibling, 0 replies; 13+ messages in thread From: Dan Williams @ 2015-10-01 0:09 UTC (permalink / raw) To: Dave Chinner Cc: Jens Axboe, Boaz Harrosh, linux-nvdimm@lists.01.org, linux-kernel, Ross Zwisler, Christoph Hellwig On Wed, Sep 30, 2015 at 4:35 PM, Dave Chinner <david@fromorbit.com> wrote: > On Tue, Sep 29, 2015 at 08:41:36PM -0400, Dan Williams wrote: >> The DAX implementation needs to protect new calls to ->direct_access() >> and usage of its return value against unbind of the underlying block >> device. Use blk_queue_enter()/blk_queue_exit() to either prevent >> blk_cleanup_queue() from proceeding, or fail the dax_map_bh() if the >> request_queue is being torn down. >> >> Cc: Jens Axboe <axboe@kernel.dk> >> Cc: Christoph Hellwig <hch@lst.de> >> Cc: Boaz Harrosh <boaz@plexistor.com> >> Cc: Dave Chinner <david@fromorbit.com> >> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> >> Signed-off-by: Dan Williams <dan.j.williams@intel.com> >> --- >> block/blk.h | 2 - >> fs/dax.c | 130 +++++++++++++++++++++++++++++++----------------- >> include/linux/blkdev.h | 2 + >> 3 files changed, 85 insertions(+), 49 deletions(-) >> >> diff --git a/block/blk.h b/block/blk.h >> index 5b2cd393afbe..0f8de0dda768 100644 >> --- a/block/blk.h >> +++ b/block/blk.h >> @@ -72,8 +72,6 @@ void blk_dequeue_request(struct request *rq); >> void __blk_queue_free_tags(struct request_queue *q); >> bool __blk_end_bidi_request(struct request *rq, int error, >> unsigned int nr_bytes, unsigned int bidi_bytes); >> -int blk_queue_enter(struct request_queue *q, gfp_t gfp); >> -void blk_queue_exit(struct request_queue *q); >> void blk_freeze_queue(struct request_queue *q); >> >> static inline void blk_queue_enter_live(struct request_queue *q) >> diff --git a/fs/dax.c b/fs/dax.c >> index bcfb14bfc1e4..7ce002bb60d0 100644 >> --- a/fs/dax.c >> +++ b/fs/dax.c >> @@ -63,12 +63,42 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size) >> } >> EXPORT_SYMBOL_GPL(dax_clear_blocks); >> >> -static long dax_get_addr(struct buffer_head *bh, void __pmem **addr, >> - unsigned blkbits) >> +static void __pmem *__dax_map_bh(const struct buffer_head *bh, unsigned blkbits, >> + unsigned long *pfn, long *len) > > Please don't use bufferheads for this. Please pass an inode, the > block and length to map, similar to dax_clear_blocks(). > > Why? Because dax_clear_blocks() needs to do this "mapping" too, > and it is called from contexts where there are no bufferheads. > There's a good chance we'll need more mapping contexts like this in > future, so lets not propagate bufferheads deeper into this code > than we absilutely need to. > > We should be trying to limit/remove bufferheads in the DAX code, not > propagating them deeper into the code... > Sounds good, will do. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] block, dax: fix lifetime of in-kernel dax mappings 2015-09-30 23:35 ` Dave Chinner 2015-10-01 0:09 ` Dan Williams @ 2015-10-06 1:57 ` Dan Williams 2015-10-06 21:15 ` Dave Chinner 1 sibling, 1 reply; 13+ messages in thread From: Dan Williams @ 2015-10-06 1:57 UTC (permalink / raw) To: Dave Chinner Cc: Jens Axboe, Boaz Harrosh, linux-nvdimm@lists.01.org, linux-kernel, Ross Zwisler, Christoph Hellwig On Wed, Sep 30, 2015 at 4:35 PM, Dave Chinner <david@fromorbit.com> wrote: > On Tue, Sep 29, 2015 at 08:41:36PM -0400, Dan Williams wrote: >> The DAX implementation needs to protect new calls to ->direct_access() >> and usage of its return value against unbind of the underlying block >> device. Use blk_queue_enter()/blk_queue_exit() to either prevent >> blk_cleanup_queue() from proceeding, or fail the dax_map_bh() if the >> request_queue is being torn down. >> >> Cc: Jens Axboe <axboe@kernel.dk> >> Cc: Christoph Hellwig <hch@lst.de> >> Cc: Boaz Harrosh <boaz@plexistor.com> >> Cc: Dave Chinner <david@fromorbit.com> >> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> >> Signed-off-by: Dan Williams <dan.j.williams@intel.com> >> --- >> block/blk.h | 2 - >> fs/dax.c | 130 +++++++++++++++++++++++++++++++----------------- >> include/linux/blkdev.h | 2 + >> 3 files changed, 85 insertions(+), 49 deletions(-) >> >> diff --git a/block/blk.h b/block/blk.h >> index 5b2cd393afbe..0f8de0dda768 100644 >> --- a/block/blk.h >> +++ b/block/blk.h >> @@ -72,8 +72,6 @@ void blk_dequeue_request(struct request *rq); >> void __blk_queue_free_tags(struct request_queue *q); >> bool __blk_end_bidi_request(struct request *rq, int error, >> unsigned int nr_bytes, unsigned int bidi_bytes); >> -int blk_queue_enter(struct request_queue *q, gfp_t gfp); >> -void blk_queue_exit(struct request_queue *q); >> void blk_freeze_queue(struct request_queue *q); >> >> static inline void blk_queue_enter_live(struct request_queue *q) >> diff --git a/fs/dax.c b/fs/dax.c >> index bcfb14bfc1e4..7ce002bb60d0 100644 >> --- a/fs/dax.c >> +++ b/fs/dax.c >> @@ -63,12 +63,42 @@ int dax_clear_blocks(struct inode *inode, sector_t block, long size) >> } >> EXPORT_SYMBOL_GPL(dax_clear_blocks); >> >> -static long dax_get_addr(struct buffer_head *bh, void __pmem **addr, >> - unsigned blkbits) >> +static void __pmem *__dax_map_bh(const struct buffer_head *bh, unsigned blkbits, >> + unsigned long *pfn, long *len) > > Please don't use bufferheads for this. Please pass an inode, the > block and length to map, similar to dax_clear_blocks(). > > Why? Because dax_clear_blocks() needs to do this "mapping" too, > and it is called from contexts where there are no bufferheads. > There's a good chance we'll need more mapping contexts like this in > future, so lets not propagate bufferheads deeper into this code > than we absilutely need to. > > We should be trying to limit/remove bufferheads in the DAX code, not > propagating them deeper into the code... So I gave this a try but ran into the road block that get_block() is performing the inode to bdev conversion and that is filesystem specific. However, I'll at least not pass the bh into this map routine and will call it dax_map_atomic() which is more accurate. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 2/2] block, dax: fix lifetime of in-kernel dax mappings 2015-10-06 1:57 ` Dan Williams @ 2015-10-06 21:15 ` Dave Chinner 0 siblings, 0 replies; 13+ messages in thread From: Dave Chinner @ 2015-10-06 21:15 UTC (permalink / raw) To: Dan Williams Cc: Jens Axboe, Boaz Harrosh, linux-nvdimm@lists.01.org, linux-kernel, Ross Zwisler, Christoph Hellwig On Mon, Oct 05, 2015 at 06:57:19PM -0700, Dan Williams wrote: > On Wed, Sep 30, 2015 at 4:35 PM, Dave Chinner <david@fromorbit.com> wrote: > > On Tue, Sep 29, 2015 at 08:41:36PM -0400, Dan Williams wrote: > >> +static void __pmem *__dax_map_bh(const struct buffer_head *bh, unsigned blkbits, > >> + unsigned long *pfn, long *len) > > > > Please don't use bufferheads for this. Please pass an inode, the > > block and length to map, similar to dax_clear_blocks(). > > > > Why? Because dax_clear_blocks() needs to do this "mapping" too, > > and it is called from contexts where there are no bufferheads. > > There's a good chance we'll need more mapping contexts like this in > > future, so lets not propagate bufferheads deeper into this code > > than we absilutely need to. > > > > We should be trying to limit/remove bufferheads in the DAX code, not > > propagating them deeper into the code... > > So I gave this a try but ran into the road block that get_block() is > performing the inode to bdev conversion and that is filesystem > specific. However, I'll at least not pass the bh into this map > routine and will call it dax_map_atomic() which is more accurate. Right, we still need the bh for the get_block call - that much is unavoidable right now, but I'm angling towards converting XFS to use a struct iomap and ->map_blocks method similar to the exportfs calls for PNFS in future. Having a dedicated structure and set of methods for obtaining and manipulating extent mappings on inodes will make all these bufferhead issues go away... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2015-10-06 21:16 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-09-30 0:41 [PATCH 0/2] block drivers + dax vs driver unbind Dan Williams 2015-09-30 0:41 ` [PATCH 1/2] block: generic request_queue reference counting Dan Williams 2015-10-04 6:40 ` Christoph Hellwig 2015-10-05 23:44 ` Dan Williams 2015-10-04 7:52 ` Ming Lei 2015-10-05 23:23 ` Dan Williams 2015-10-06 10:46 ` Ming Lei 2015-10-06 16:04 ` Dan Williams 2015-09-30 0:41 ` [PATCH 2/2] block, dax: fix lifetime of in-kernel dax mappings Dan Williams 2015-09-30 23:35 ` Dave Chinner 2015-10-01 0:09 ` Dan Williams 2015-10-06 1:57 ` Dan Williams 2015-10-06 21:15 ` Dave Chinner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).