All of lore.kernel.org
 help / color / mirror / Atom feed
From: Coly Li <colyli@suse.de>
To: Ming Lei <ming.lei@redhat.com>
Cc: linux-block@vger.kernel.org, linux-bcache@vger.kernel.org,
	linux-kernel@vger.kernel.org, Hannes Reinecke <hare@suse.com>,
	Xiao Ni <xni@redhat.com>, Bart Van Assche <bvanassche@acm.org>,
	Christoph Hellwig <hch@lst.de>,
	Enzo Matsumiya <ematsumiya@suse.com>,
	Jens Axboe <axboe@kernel.dk>, Evan Green <evgreen@chromium.org>
Subject: Re: [PATCH] block: tolerate 0 byte discard_granularity in __blkdev_issue_discard()
Date: Wed, 5 Aug 2020 10:52:19 +0800	[thread overview]
Message-ID: <44503c40-5ead-3d07-ee81-150cbc8586e3@suse.de> (raw)
In-Reply-To: <20200805024613.GB1981569@T590>

On 2020/8/5 10:46, Ming Lei wrote:
> On Wed, Aug 05, 2020 at 09:54:00AM +0800, Coly Li wrote:
>> On 2020/8/5 07:58, Ming Lei wrote:
>>> On Tue, Aug 04, 2020 at 10:23:32PM +0800, Coly Li wrote:
>>>> When some buggy driver doesn't set its queue->limits.discard_granularity
>>>> (e.g. current loop device driver), discard at LBA 0 on such device will
>>>> trigger a kernel BUG() panic from block/blk-mq.c:563.
>>>>
>>>> [  955.565006][   C39] ------------[ cut here ]------------
>>>> [  955.559660][   C39] invalid opcode: 0000 [#1] SMP NOPTI
>>>> [  955.622171][   C39] CPU: 39 PID: 248 Comm: ksoftirqd/39 Tainted: G            E     5.8.0-default+ #40
>>>> [  955.622171][   C39] Hardware name: Lenovo ThinkSystem SR650 -[7X05CTO1WW]-/-[7X05CTO1WW]-, BIOS -[IVE160M-2.70]- 07/17/2020
>>>> [  955.622175][   C39] RIP: 0010:blk_mq_end_request+0x107/0x110
>>>> [  955.622177][   C39] Code: 48 8b 03 e9 59 ff ff ff 48 89 df 5b 5d 41 5c e9 9f ed ff ff 48 8b 35 98 3c f4 00 48 83 c7 10 48 83 c6 19 e8 cb 56 c9 ff eb cb <0f> 0b 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 56 41 54
>>>> [  955.622179][   C39] RSP: 0018:ffffb1288701fe28 EFLAGS: 00010202
>>>> [  955.749277][   C39] RAX: 0000000000000001 RBX: ffff956fffba5080 RCX: 0000000000004003
>>>> [  955.749278][   C39] RDX: 0000000000000003 RSI: 0000000000000000 RDI: 0000000000000000
>>>> [  955.749279][   C39] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
>>>> [  955.749279][   C39] R10: ffffb1288701fd28 R11: 0000000000000001 R12: ffffffffa8e05160
>>>> [  955.749280][   C39] R13: 0000000000000004 R14: 0000000000000004 R15: ffffffffa7ad3a1e
>>>> [  955.749281][   C39] FS:  0000000000000000(0000) GS:ffff95bfbda00000(0000) knlGS:0000000000000000
>>>> [  955.749282][   C39] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> [  955.749282][   C39] CR2: 00007f6f0ef766a8 CR3: 0000005a37012002 CR4: 00000000007606e0
>>>> [  955.749283][   C39] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>>> [  955.749284][   C39] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>>> [  955.749284][   C39] PKRU: 55555554
>>>> [  955.749285][   C39] Call Trace:
>>>> [  955.749290][   C39]  blk_done_softirq+0x99/0xc0
>>>> [  957.550669][   C39]  __do_softirq+0xd3/0x45f
>>>> [  957.550677][   C39]  ? smpboot_thread_fn+0x2f/0x1e0
>>>> [  957.550679][   C39]  ? smpboot_thread_fn+0x74/0x1e0
>>>> [  957.550680][   C39]  ? smpboot_thread_fn+0x14e/0x1e0
>>>> [  957.550684][   C39]  run_ksoftirqd+0x30/0x60
>>>> [  957.550687][   C39]  smpboot_thread_fn+0x149/0x1e0
>>>> [  957.886225][   C39]  ? sort_range+0x20/0x20
>>>> [  957.886226][   C39]  kthread+0x137/0x160
>>>> [  957.886228][   C39]  ? kthread_park+0x90/0x90
>>>> [  957.886231][   C39]  ret_from_fork+0x22/0x30
>>>> [  959.117120][   C39] ---[ end trace 3dacdac97e2ed164 ]---
>>>>
>>>> This is the procedure to reproduce the panic,
>>>>   # modprobe scsi_debug delay=0 dev_size_mb=2048 max_queue=1
>>>>   # losetup -f /dev/nvme0n1 --direct-io=on
>>>>   # blkdiscard /dev/loop0 -o 0 -l 0x200
>>>>
>>>> This is how the BUG() panic triggered by __blkdev_issue_discard(),
>>>> - For a NVMe SSD backing loop device, the driver does not initialize
>>>>   its queue->limits.discard_granularity and leaves it to 0.
>>>> - When discard on LBA 0 of the loop device, __blkdev_issue_discard()
>>>>   is called before loop device driver code.
>>>> - Inside __blkdev_issue_discard(), when calculating value of
>>>>   granularity_aligned_lba by
>>>> 	granularity_aligned_lba = round_up(sector_mapped,
>>>> 			q->limits.discard_granularity >> SECTOR_SHIFT);
>>>>   because sector_mapped is 0 (at LBA 0 and no partition offset), and
>>>>   q->limits.discard_granularity is 0 (by the buggy loop driver), the
>>>>   calculated granularity_aligned_lba is 0.
>>>> - The inline function bio_aligned_discard_max_sectors() is defined as
>>>> 	return round_down(UINT_MAX, q->limits.discard_granularity) >>
>>>> 			SECTOR_SHIFT;
>>>>    when q->limits.discard_granularity is 0 from loop device driver, the
>>>>    above calculation returns value 0.
>>>> - Now granularity_aligned_lba and sctor_mapped are 0, req_sectors is
>>>>   calculated by the following lines in __blkdev_issue_discard(),
>>>> 	if (granularity_aligned_lba == sector_mapped)
>>>> 		req_sects = min_t(sector_t, nr_sects,
>>>> 				  bio_aligned_discard_max_sectors(q));
>>>>   because bio_aligned_discard_max_sectors(q) returns 0, req_sects is
>>>>   calculated as 0.
>>>> - Now a discard bio is mistakenly initialized as a 0 byte bio by,
>>>> 	bio->bi_iter.bi_size = req_sects << 9;
>>>>   and sent to loop device driver.
>>>> - This discard request is handled by loop device driver by following
>>>>   code path,
>>>>     loop_handle_cmd => do_req_filebacked => lo_fallocate =>
>>>>     file->f_op->fallocate => blkdev_fallocate => blkdev_issue_zeroout =>
>>>>     __blkdev_issue_write_zeroes
>>>> - Inside __blkdev_issue_write_zeroes(), a 0 byte length discard bio is
>>>>   composed and sent to the backing device of the loop device.
>>>> - In the I/O completion code path, in my case it is,
>>>>     blk_done_softirq => nrq->q->mq_ops->complete => nvme_pci_complete_rq
>>>>     => nvme_complete_rq => blk_mq_end_request
>>>>   inside blk_mq_end_request(), blk_update_request() is called and due to
>>>>   req->bio is NULL in previous step, blk_update_request() returns false
>>>>   then the BUG() panic in blk_mq_end_request() is triggered.
>>>>
>>>> Although the above panic can be fixed in loop device driver, the generic
>>>> __blkdev_issue_discard() should also be fixed to tolerate the incorrect
>>>> 0 value from queue->limits.discard_granularity, in case some other buggy
>>>> driver makes such mistake again.
>>>>
>>>> This patch checks whether q->limits.discard_granularity is 0 in
>>>> __blkdev_issue_discard() and bio_aligned_discard_max_sectors(). If it is
>>>> 0 from some buggy driver queue, prints a warning oops information and
>>>> set queue_logical_block_size(q) to a local variable discard_granularity.
>>>> This local variable is used in round_up() and round_down() calculation,
>>>> now req_sects won't be 0  and no empty discard request is generated.
>>>>
>>>> Fixes: 9b15d109a6b2 ("block: improve discard bio alignment in __blkdev_issue_discard()")
>>>> Fixes: c52abf563049 ("loop: Better discard support for block devices")
>>>> Reported-by: Ming Lei <ming.lei@redhat.com>
>>>> Signed-off-by: Coly Li <colyli@suse.de>
>>>> Cc: Hannes Reinecke <hare@suse.com>
>>>> Cc: Ming Lei <ming.lei@redhat.com>
>>>> Cc: Xiao Ni <xni@redhat.com>
>>>> Cc: Bart Van Assche <bvanassche@acm.org>
>>>> Cc: Christoph Hellwig <hch@lst.de>
>>>> Cc: Enzo Matsumiya <ematsumiya@suse.com>
>>>> Cc: Jens Axboe <axboe@kernel.dk>
>>>> Cc: Evan Green <evgreen@chromium.org>
>>>> ---
>>>>  block/blk-lib.c | 8 +++++++-
>>>>  block/blk.h     | 9 +++++++--
>>>>  2 files changed, 14 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/block/blk-lib.c b/block/blk-lib.c
>>>> index 019e09bb9c0e..3017e4cba923 100644
>>>> --- a/block/blk-lib.c
>>>> +++ b/block/blk-lib.c
>>>> @@ -30,6 +30,7 @@ int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
>>>>  	struct bio *bio = *biop;
>>>>  	unsigned int op;
>>>>  	sector_t bs_mask, part_offset = 0;
>>>> +	sector_t discard_granularity;
>>>>  
>>>>  	if (!q)
>>>>  		return -ENXIO;
>>>> @@ -54,6 +55,11 @@ int __blkdev_issue_discard(struct block_device *bdev, sector_t sector,
>>>>  	if (!nr_sects)
>>>>  		return -EINVAL;
>>>>  
>>>> +	discard_granularity = q->limits.discard_granularity;
>>>> +	/* In case some buggy driver does not set limits.discard_granularity */
>>>> +	if (WARN_ON_ONCE(discard_granularity == 0))
>>>> +		discard_granularity = queue_logical_block_size(q);
>>>
>>> This code path is supposed to not run in case of zero q->limits.discard_granularity,
>>> and looks it is fine to just warn and return -EINVAL in this case,
>>> see Documentation/ABI/testing/sysfs-block:
>>>
>>> 	What:       /sys/block/<disk>/queue/discard_granularity
>>> 	Date:       May 2011
>>> 	Contact:    Martin K. Petersen <martin.petersen@oracle.com>
>>> 	Description:
>>> 	        Devices that support discard functionality may
>>> 	        internally allocate space using units that are bigger
>>> 	        than the logical block size. The discard_granularity
>>> 	        parameter indicates the size of the internal allocation
>>> 	        unit in bytes if reported by the device. Otherwise the
>>> 	        discard_granularity will be set to match the device's
>>> 	        physical block size. A discard_granularity of 0 means
>>> 	        that the device does not support discard functionality.
>>>
>>> What we need to fix is loop driver, if it claims to support discard,
>>> q->limits.discard_granularity has to be one valid value.
>>
>> Yes your suggestion is much simpler, let me do it :-)
> 
> That is exactly what the following patch does:
> 
> https://lore.kernel.org/linux-block/6f642b8a-648e-8b59-067f-6c9f4cc32fa6@suse.de/T/#m82a878277ae7ed6b7a595820112fd13beaa24c99

I meant your suggestion in __blkdev_issue_discard(), in case some other
buggy driver does similar fault in future. Now the v2 patch is much
simpler, please review it and thank you in advance :-)

Coly Li

      reply	other threads:[~2020-08-05  2:52 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-04 14:23 [PATCH] block: tolerate 0 byte discard_granularity in __blkdev_issue_discard() Coly Li
2020-08-04 14:31 ` Johannes Thumshirn
2020-08-04 14:34   ` Coly Li
2020-08-04 14:37     ` Johannes Thumshirn
2020-08-04 14:39       ` Johannes Thumshirn
2020-08-04 14:45         ` Coly Li
2020-08-04 14:54           ` Johannes Thumshirn
2020-08-04 14:43       ` Coly Li
2020-08-04 23:58 ` Ming Lei
2020-08-05  0:28   ` Martin K. Petersen
2020-08-05  1:54   ` Coly Li
2020-08-05  2:46     ` Ming Lei
2020-08-05  2:52       ` Coly Li [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=44503c40-5ead-3d07-ee81-150cbc8586e3@suse.de \
    --to=colyli@suse.de \
    --cc=axboe@kernel.dk \
    --cc=bvanassche@acm.org \
    --cc=ematsumiya@suse.com \
    --cc=evgreen@chromium.org \
    --cc=hare@suse.com \
    --cc=hch@lst.de \
    --cc=linux-bcache@vger.kernel.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=ming.lei@redhat.com \
    --cc=xni@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.