All of lore.kernel.org
 help / color / mirror / Atom feed
From: Christoph Hellwig <hch@infradead.org>
To: Ethan Zhao <haifeng.zhao@intel.com>
Cc: axboe@kernel.dk, bhelgaas@google.com,
	linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org, mcgrof@kernel.org,
	ShanshanX.Zhang@intel.com, pei.p.jia@intel.com,
	Ming Lei <ming.lei@redhat.com>
Subject: Re: [PATCH] Revert "block: revert back to synchronous request_queue removal"
Date: Tue, 8 Sep 2020 15:21:28 +0100	[thread overview]
Message-ID: <20200908142128.GA3463@infradead.org> (raw)
In-Reply-To: <20200908075047.5140-1-haifeng.zhao@intel.com>

On Tue, Sep 08, 2020 at 03:50:48AM -0400, Ethan Zhao wrote:
> From: Ethan Zhao <Haifeng.Zhao@intel.com>
> 
> 'commit e8c7d14ac6c3 ("block: revert back to synchronous request_queue
> removal")' introduced panic issue to NVMe hotplug as following(hit
> after just 2 times NVMe SSD hotplug under stable 5.9-RC2):

I'm pretty sure Ming has already fixed this by not calling put_device
from the rcu callbackm which is the real issue here.

And even if that wasn't the case we generally try to fix bugs instead of
going to a blind revert without much analysis.

> 
> BUG: sleeping function called from invalid context at block/genhd.c:1563
> in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 0, name: swapper/30
> INFO: lockdep is turned off.
> CPU: 30 PID: 0 Comm: swapper/30 Tainted: G S      W         5.9.0-RC2 #3
> Hardware name: Intel Corporation xxxxxxxx
> Call Trace:
> <IRQ>
> dump_stack+0x9d/0xe0
> ___might_sleep.cold.79+0x17f/0x1af
> disk_release+0x26/0x200
> device_release+0x6d/0x1c0
> kobject_put+0x14d/0x430
> hd_struct_free+0xfb/0x260
> percpu_ref_switch_to_atomic_rcu+0x3d1/0x580
> ? rcu_nocb_unlock_irqrestore+0xb6/0xf0
> ? trace_hardirqs_on+0x20/0x1b5
> ? rcu_do_batch+0x3ff/0xb50
> rcu_do_batch+0x47c/0xb50
> ? rcu_accelerate_cbs+0xa9/0x740
> ? rcu_spawn_one_nocb_kthread+0x3d0/0x3d0
> rcu_core+0x945/0xd90
> ? __do_softirq+0x182/0xac0
> __do_softirq+0x1ca/0xac0
> asm_call_on_stack+0xf/0x20
> </IRQ>
> do_softirq_own_stack+0x7f/0x90
> irq_exit_rcu+0x1e3/0x230
> sysvec_apic_timer_interrupt+0x48/0xb0
> asm_sysvec_apic_timer_interrupt+0x12/0x20
> RIP: 0010:cpuidle_enter_state+0x116/0xe90
> Code: 00 31 ff e8 ac c8 a4 fe 80 7c 24 10 00 74 12 9c 58 f6 c4 02
>  0f 85 7e 08 00 00 31 ff e8 43 ca be fe e8 ae a3 d5 fe fb 45 85 ed
>  <0f> 88 4e 05 00 00 4d 63 f5 49 83 fe 09 0f 87 29 0b 00 00 4b 8d 04
> RSP: 0018:ff110001040dfd78 EFLAGS: 00000206
> RAX: 0000000000000007 RBX: ffd1fff7b1a01e00 RCX: 000000000000001f
> RDX: 0000000000000000 RSI: 0000000040000000 RDI: ffffffffb7c5c0f2
> RBP: ffffffffb9a416a0 R08: 0000000000000000 R09: 0000000000000000
> R10: ff110001040d0007 R11: ffe21c002081a000 R12: 0000000000000003
> R13: 0000000000000003 R14: 0000000000000138
> ... ...
> BUG: kernel NULL pointer dereference, address: 0000000000000000
> PGD 0
> Oops: 0010 [#1] SMP KASAN NOPTI
> CPU: 30 PID: 500 Comm: irq/124-pciehp Tainted: G S  W  5.9.0-RC2 #3
> Hardware name: Intel Corporation xxxxxxxx
> RIP: 0010:0x0
> Code: Bad RIP value.
> RSP: 0018:ff110007d5ba75e8 EFLAGS: 00010096
> RAX: 0000000000000000 RBX: ff110001040d0000 RCX: ff110007d5ba7668
> RDX: 0000000000000009 RSI: ff110001040d0000 RDI: ff110008119f59c0
> RBP: ff110008119f59c0 R08: fffffbfff73f7b4d R09: fffffbfff73f7b4d
> R10: ffffffffb9fbda67 R11: fffffbfff73f7b4c R12: 0000000000000000
> R13: ff110007d5ba7668 R14: ff110001040d0000 R15: ff110008119f59c0
> FS:  0000000000000000(0000) GS:ff11000811800000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: ffffffffffffffd6 CR3: 00000007cea16002 CR4: 0000000000771ee0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> PKRU: 55555554
> Call Trace:
> ttwu_do_activate+0xd3/0x160
> try_to_wake_up+0x652/0x1850
> ? migrate_swap_stop+0xad0/0xad0
> ? lock_contended+0xd70/0xd70
> ? rcu_read_unlock+0x50/0x50
> ? rcu_do_batch+0x3ff/0xb50
> swake_up_locked+0x85/0x1c0
> complete+0x4d/0x70
> rcu_do_batch+0x47c/0xb50
> ? rcu_spawn_one_nocb_kthread+0x3d0/0x3d0
> rcu_core+0x945/0xd90
> ? __do_softirq+0x182/0xac0
> __do_softirq+0x1ca/0xac0
> irq_exit_rcu+0x1e3/0x230
> sysvec_apic_timer_interrupt+0x48/0xb0
> asm_sysvec_apic_timer_interrupt+0x12/0x20
> RIP: 0010:_raw_spin_unlock_irqrestore+0x40/0x50
> Code: e8 35 ad 36 fe 48 89 ef e8 cd ce 37 fe f6 c7 02 75 11 53 9d
>  e8 91 1f 5c fe 65 ff 0d ba af c2 47 5b 5d c3 e8 d2 22 5c fe 53 9d
>  <eb> ed 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 53
> RSP: 0018:ff110007d5ba79d0 EFLAGS: 00000293
> RAX: 0000000000000007 RBX: 0000000000000293 RCX: ffffffffb67710d4
> RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffffffffb83f41ce
> RBP: ffffffffbb77e740 R08: 0000000000000000 R09: 0000000000000000
> R10: ffffffffbb77e743 R11: fffffbfff76efce8 R12: 000000000000198e
> R13: ff11001031d7a0b0 R14: ffffffffbb77e740 R15: ffffffffbb77e788
> ? do_raw_spin_unlock+0x54/0x230
> ? _raw_spin_unlock_irqrestore+0x3e/0x50
> dma_debug_device_change+0x150/0x5e0
> notifier_call_chain+0x90/0x160
> __blocking_notifier_call_chain+0x6d/0xa0
> device_release_driver_internal+0x37d/0x490
> pci_stop_bus_device+0x123/0x190
> pci_stop_and_remove_bus_device+0xe/0x20
> pciehp_unconfigure_device+0x17e/0x330
> ? pciehp_configure_device+0x3e0/0x3e0
> ? trace_hardirqs_on+0x20/0x1b5
> pciehp_disable_slot+0x101/0x360
> ? pme_is_native.cold.2+0x29/0x29
> pciehp_handle_presence_or_link_change+0x1ac/0xee0
> ? pciehp_handle_disable_request+0x110/0x110
> pciehp_ist.cold.11+0x39/0x54
> ? pciehp_set_indicators+0x190/0x190
> ? alloc_desc+0x510/0xa30
> ? irq_set_affinity_notifier+0x380/0x380
> ? pciehp_set_indicators+0x190/0x190
> ? irq_thread+0x137/0x420
> irq_thread_fn+0x86/0x150
> irq_thread+0x21f/0x420
> ? irq_forced_thread_fn+0x170/0x170
> ? irq_thread_check_affinity+0x210/0x210
> ? __kthread_parkme+0x52/0x1a0
> ? lockdep_hardirqs_on_prepare+0x33e/0x4e0
> ? _raw_spin_unlock_irqrestore+0x3e/0x50
> ? trace_hardirqs_on+0x20/0x1b5
> ? wake_threads_waitq+0x40/0x40
> ? __kthread_parkme+0xd1/0x1a0
> ? irq_thread_check_affinity+0x210/0x210
> kthread+0x36a/0x430
> ? kthread_create_worker_on_cpu+0xc0/0xc0
> ret_from_fork+0x1f/0x30
> ... ...
> CR2: 0000000000000000
> ---[ end trace cedc4047ef91d2ec ]---
> 
> Seems scheduling happened within hardware interrupt context, after
> reverted this patch, stable 5.9-RC4 build was tested with more than 20
> times NVMe SSD hotplug, no panic found.
> 
> This reverts commit e8c7d14ac6c37c173ec606907d38802b00302988.
> 
> Tested-by: Shanshan Zhang <ShanshanX.Zhang@intel.com>
> Signed-off-by: Ethan Zhao <Haifeng.Zhao@intel.com>
> ---
>  block/blk-core.c       |  8 --------
>  block/blk-sysfs.c      | 43 +++++++++++++++++++++---------------------
>  block/genhd.c          | 17 -----------------
>  include/linux/blkdev.h |  2 ++
>  4 files changed, 23 insertions(+), 47 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 10c08ac50697..1b18a0ef5db1 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -325,9 +325,6 @@ EXPORT_SYMBOL_GPL(blk_clear_pm_only);
>   *
>   * Decrements the refcount of the request_queue kobject. When this reaches 0
>   * we'll have blk_release_queue() called.
> - *
> - * Context: Any context, but the last reference must not be dropped from
> - *          atomic context.
>   */
>  void blk_put_queue(struct request_queue *q)
>  {
> @@ -360,14 +357,9 @@ EXPORT_SYMBOL_GPL(blk_set_queue_dying);
>   *
>   * Mark @q DYING, drain all pending requests, mark @q DEAD, destroy and
>   * put it.  All future requests will be failed immediately with -ENODEV.
> - *
> - * Context: can sleep
>   */
>  void blk_cleanup_queue(struct request_queue *q)
>  {
> -	/* cannot be called from atomic context */
> -	might_sleep();
> -
>  	WARN_ON_ONCE(blk_queue_registered(q));
>  
>  	/* mark @q DYING, no new request or merges will be allowed afterwards */
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 7dda709f3ccb..eb347cbe0f93 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -901,32 +901,22 @@ static void blk_exit_queue(struct request_queue *q)
>  	bdi_put(q->backing_dev_info);
>  }
>  
> +
>  /**
> - * blk_release_queue - releases all allocated resources of the request_queue
> - * @kobj: pointer to a kobject, whose container is a request_queue
> - *
> - * This function releases all allocated resources of the request queue.
> - *
> - * The struct request_queue refcount is incremented with blk_get_queue() and
> - * decremented with blk_put_queue(). Once the refcount reaches 0 this function
> - * is called.
> - *
> - * For drivers that have a request_queue on a gendisk and added with
> - * __device_add_disk() the refcount to request_queue will reach 0 with
> - * the last put_disk() called by the driver. For drivers which don't use
> - * __device_add_disk() this happens with blk_cleanup_queue().
> + * __blk_release_queue - release a request queue
> + * @work: pointer to the release_work member of the request queue to be released
>   *
> - * Drivers exist which depend on the release of the request_queue to be
> - * synchronous, it should not be deferred.
> - *
> - * Context: can sleep
> + * Description:
> + *     This function is called when a block device is being unregistered. The
> + *     process of releasing a request queue starts with blk_cleanup_queue, which
> + *     set the appropriate flags and then calls blk_put_queue, that decrements
> + *     the reference counter of the request queue. Once the reference counter
> + *     of the request queue reaches zero, blk_release_queue is called to release
> + *     all allocated resources of the request queue.
>   */
> -static void blk_release_queue(struct kobject *kobj)
> +static void __blk_release_queue(struct work_struct *work)
>  {
> -	struct request_queue *q =
> -		container_of(kobj, struct request_queue, kobj);
> -
> -	might_sleep();
> +	struct request_queue *q = container_of(work, typeof(*q), release_work);
>  
>  	if (test_bit(QUEUE_FLAG_POLL_STATS, &q->queue_flags))
>  		blk_stat_remove_callback(q, q->poll_cb);
> @@ -958,6 +948,15 @@ static void blk_release_queue(struct kobject *kobj)
>  	call_rcu(&q->rcu_head, blk_free_queue_rcu);
>  }
>  
> +static void blk_release_queue(struct kobject *kobj)
> +{
> +	struct request_queue *q =
> +		container_of(kobj, struct request_queue, kobj);
> +
> +	INIT_WORK(&q->release_work, __blk_release_queue);
> +	schedule_work(&q->release_work);
> +}
> +
>  static const struct sysfs_ops queue_sysfs_ops = {
>  	.show	= queue_attr_show,
>  	.store	= queue_attr_store,
> diff --git a/block/genhd.c b/block/genhd.c
> index 99c64641c314..7e2edf388c8a 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -887,19 +887,12 @@ static void invalidate_partition(struct gendisk *disk, int partno)
>   * The final removal of the struct gendisk happens when its refcount reaches 0
>   * with put_disk(), which should be called after del_gendisk(), if
>   * __device_add_disk() was used.
> - *
> - * Drivers exist which depend on the release of the gendisk to be synchronous,
> - * it should not be deferred.
> - *
> - * Context: can sleep
>   */
>  void del_gendisk(struct gendisk *disk)
>  {
>  	struct disk_part_iter piter;
>  	struct hd_struct *part;
>  
> -	might_sleep();
> -
>  	blk_integrity_del(disk);
>  	disk_del_events(disk);
>  
> @@ -1553,15 +1546,11 @@ int disk_expand_part_tbl(struct gendisk *disk, int partno)
>   * drivers we also call blk_put_queue() for them, and we expect the
>   * request_queue refcount to reach 0 at this point, and so the request_queue
>   * will also be freed prior to the disk.
> - *
> - * Context: can sleep
>   */
>  static void disk_release(struct device *dev)
>  {
>  	struct gendisk *disk = dev_to_disk(dev);
>  
> -	might_sleep();
> -
>  	blk_free_devt(dev->devt);
>  	disk_release_events(disk);
>  	kfree(disk->random);
> @@ -1806,9 +1795,6 @@ EXPORT_SYMBOL(get_disk_and_module);
>   *
>   * This decrements the refcount for the struct gendisk. When this reaches 0
>   * we'll have disk_release() called.
> - *
> - * Context: Any context, but the last reference must not be dropped from
> - *          atomic context.
>   */
>  void put_disk(struct gendisk *disk)
>  {
> @@ -1823,9 +1809,6 @@ EXPORT_SYMBOL(put_disk);
>   *
>   * This is a counterpart of get_disk_and_module() and thus also of
>   * get_gendisk().
> - *
> - * Context: Any context, but the last reference must not be dropped from
> - *          atomic context.
>   */
>  void put_disk_and_module(struct gendisk *disk)
>  {
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index bb5636cc17b9..59fe9de342e0 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -583,6 +583,8 @@ struct request_queue {
>  
>  	size_t			cmd_size;
>  
> +	struct work_struct	release_work;
> +
>  #define BLK_MAX_WRITE_HINTS	5
>  	u64			write_hints[BLK_MAX_WRITE_HINTS];
>  };
> -- 
> 2.18.4
> 
---end quoted text---

  reply	other threads:[~2020-09-08 20:12 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-08  7:50 [PATCH] Revert "block: revert back to synchronous request_queue removal" Ethan Zhao
2020-09-08 14:21 ` Christoph Hellwig [this message]
2020-09-09  2:11   ` Zhao, Haifeng
2020-09-09  2:39     ` Ming Lei
2020-09-09  2:41       ` Zhao, Haifeng
2020-09-08 17:56 ` Bart Van Assche

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200908142128.GA3463@infradead.org \
    --to=hch@infradead.org \
    --cc=ShanshanX.Zhang@intel.com \
    --cc=axboe@kernel.dk \
    --cc=bhelgaas@google.com \
    --cc=haifeng.zhao@intel.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=mcgrof@kernel.org \
    --cc=ming.lei@redhat.com \
    --cc=pei.p.jia@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.