Re: BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7

From: Yi Zhang <yi.zhang@redhat.com>
To: Sagi Grimberg <sagi@grimberg.me>, Ming Lei <ming.lei@redhat.com>
Cc: linux-block@vger.kernel.org, linux-nvme@lists.infradead.org
Subject: Re: BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7
Date: Mon, 9 Apr 2018 17:05:40 +0800	[thread overview]
Message-ID: <f34c2a61-fac9-71cb-0663-5876196f3ef4@redhat.com> (raw)
In-Reply-To: <9eb1d6ba-3994-596f-1b90-38a9b879f416@redhat.com>

On 04/09/2018 04:54 PM, Yi Zhang wrote:
>
>
> On 04/09/2018 04:31 PM, Sagi Grimberg wrote:
>>
>>>> My device exposes nr_hw_queues which is not higher than 
>>>> num_online_cpus
>>>> so I want to connect all hctxs with hope that they will be used.
>>>
>>> The issue is that CPU online & offline can happen any time, and after
>>> blk-mq removes CPU hotplug handler, there is no way to remap queue
>>> when CPU topo is changed.
>>>
>>> For example:
>>>
>>> 1) after nr_hw_queues is set as num_online_cpus() and hw queues
>>> are initialized, then some of CPUs become offline, and the issue
>>> reported by Zhang Yi is triggered, but in this case, we should fail
>>> the allocation since 1:1 mapping doesn't need to use this inactive
>>> hw queue.
>>
>> Normal cpu offlining is fine, as the hctxs are already connected. When
>> we reset the controller and re-establish the queues, the issue triggers
>> because we call blk_mq_alloc_request_hctx.
>>
>> The question is, for this particular issue, given that the request
>> execution is guaranteed to run from an online cpu, will the below work?
> Hi Sagi
> Sorry for the late response, bellow patch works, here is the full log:

And this issue cannot be reproduced on 4.15.
So I did bisect testing today, found it was introduced from bellow commit:
bf9ae8c blk-mq: fix bad clear of RQF_MQ_INFLIGHT in blk_mq_ct_ctx_init()

>
> [  117.370832] nvme nvme0: new ctrl: NQN 
> "nqn.2014-08.org.nvmexpress.discovery", addr 172.31.0.90:4420
> [  117.427385] nvme nvme0: creating 40 I/O queues.
> [  117.736806] nvme nvme0: new ctrl: NQN "testnqn", addr 172.31.0.90:4420
> [  122.531891] smpboot: CPU 1 is now offline
> [  122.573007] IRQ 37: no longer affine to CPU2
> [  122.577775] IRQ 54: no longer affine to CPU2
> [  122.582532] IRQ 70: no longer affine to CPU2
> [  122.587300] IRQ 98: no longer affine to CPU2
> [  122.592069] IRQ 140: no longer affine to CPU2
> [  122.596930] IRQ 141: no longer affine to CPU2
> [  122.603166] smpboot: CPU 2 is now offline
> [  122.840577] smpboot: CPU 3 is now offline
> [  125.204901] print_req_error: operation not supported error, dev 
> nvme0n1, sector 143212504
> [  125.204907] print_req_error: operation not supported error, dev 
> nvme0n1, sector 481004984
> [  125.204922] print_req_error: operation not supported error, dev 
> nvme0n1, sector 436594584
> [  125.204924] print_req_error: operation not supported error, dev 
> nvme0n1, sector 461363784
> [  125.204945] print_req_error: operation not supported error, dev 
> nvme0n1, sector 308124792
> [  125.204957] print_req_error: operation not supported error, dev 
> nvme0n1, sector 513395784
> [  125.204959] print_req_error: operation not supported error, dev 
> nvme0n1, sector 432260176
> [  125.204961] print_req_error: operation not supported error, dev 
> nvme0n1, sector 251704096
> [  125.204963] print_req_error: operation not supported error, dev 
> nvme0n1, sector 234819336
> [  125.204966] print_req_error: operation not supported error, dev 
> nvme0n1, sector 181874128
> [  125.938858] nvme nvme0: Reconnecting in 10 seconds...
> [  125.938862] Buffer I/O error on dev nvme0n1, logical block 367355, 
> lost async page write
> [  125.942587] Buffer I/O error on dev nvme0n1, logical block 586, 
> lost async page write
> [  125.942589] Buffer I/O error on dev nvme0n1, logical block 375453, 
> lost async page write
> [  125.942591] Buffer I/O error on dev nvme0n1, logical block 587, 
> lost async page write
> [  125.942592] Buffer I/O error on dev nvme0n1, logical block 588, 
> lost async page write
> [  125.942593] Buffer I/O error on dev nvme0n1, logical block 375454, 
> lost async page write
> [  125.942594] Buffer I/O error on dev nvme0n1, logical block 589, 
> lost async page write
> [  125.942595] Buffer I/O error on dev nvme0n1, logical block 590, 
> lost async page write
> [  125.942596] Buffer I/O error on dev nvme0n1, logical block 591, 
> lost async page write
> [  125.942597] Buffer I/O error on dev nvme0n1, logical block 592, 
> lost async page write
> [  130.205584] print_req_error: 537000 callbacks suppressed
> [  130.205586] print_req_error: I/O error, dev nvme0n1, sector 471135288
> [  130.218763] print_req_error: I/O error, dev nvme0n1, sector 471137240
> [  130.225985] print_req_error: I/O error, dev nvme0n1, sector 471138328
> [  130.233206] print_req_error: I/O error, dev nvme0n1, sector 471140096
> [  130.240433] print_req_error: I/O error, dev nvme0n1, sector 471140184
> [  130.247659] print_req_error: I/O error, dev nvme0n1, sector 471140960
> [  130.254874] print_req_error: I/O error, dev nvme0n1, sector 471141864
> [  130.262095] print_req_error: I/O error, dev nvme0n1, sector 471143296
> [  130.269317] print_req_error: I/O error, dev nvme0n1, sector 471143776
> [  130.276537] print_req_error: I/O error, dev nvme0n1, sector 471144224
> [  132.954315] nvme nvme0: Identify namespace failed
> [  132.959698] buffer_io_error: 3801549 callbacks suppressed
> [  132.959699] Buffer I/O error on dev nvme0n1, logical block 0, async 
> page read
> [  132.974669] Buffer I/O error on dev nvme0n1, logical block 0, async 
> page read
> [  132.983078] Buffer I/O error on dev nvme0n1, logical block 0, async 
> page read
> [  132.991476] Buffer I/O error on dev nvme0n1, logical block 0, async 
> page read
> [  132.999859] Buffer I/O error on dev nvme0n1, logical block 0, async 
> page read
> [  133.008217] Buffer I/O error on dev nvme0n1, logical block 0, async 
> page read
> [  133.016575] Dev nvme0n1: unable to read RDB block 0
> [  133.022423] Buffer I/O error on dev nvme0n1, logical block 0, async 
> page read
> [  133.030800] Buffer I/O error on dev nvme0n1, logical block 0, async 
> page read
> [  133.039151]  nvme0n1: unable to read partition table
> [  133.050221] Buffer I/O error on dev nvme0n1, logical block 
> 65535984, async page read
> [  133.060154] Buffer I/O error on dev nvme0n1, logical block 
> 65535984, async page read
> [  136.334516] nvme nvme0: creating 37 I/O queues.
> [  136.636012] no online cpu for hctx 1
> [  136.640448] no online cpu for hctx 2
> [  136.644832] no online cpu for hctx 3
> [  136.650432] nvme nvme0: Successfully reconnected (1 attempts)
> [  184.894584] x86: Booting SMP configuration:
> [  184.899694] smpboot: Booting Node 1 Processor 1 APIC 0x20
> [  184.913923] smpboot: Booting Node 0 Processor 2 APIC 0x2
> [  184.929556] smpboot: Booting Node 1 Processor 3 APIC 0x22
>
> And here is the debug ouput:
> [root@rdma-virt-01 linux (test)]$ gdb vmlinux
> GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-110.el7
> Copyright (C) 2013 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later 
> <http://gnu.org/licenses/gpl.html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.  Type "show 
> copying"
> and "show warranty" for details.
> This GDB was configured as "x86_64-redhat-linux-gnu".
> For bug reporting instructions, please see:
> <http://www.gnu.org/software/gdb/bugs/>...
> Reading symbols from /home/test/linux/vmlinux...done.
> (gdb) l *(blk_mq_get_request+0x23e)
> 0xffffffff8136bf9e is in blk_mq_get_request (block/blk-mq.c:327).
> 322        rq->rl = NULL;
> 323        set_start_time_ns(rq);
> 324        rq->io_start_time_ns = 0;
> 325    #endif
> 326
> 327        data->ctx->rq_dispatched[op_is_sync(op)]++;
> 328        return rq;
> 329    }
> 330
> 331    static struct request *blk_mq_get_request(struct request_queue *q,
>
>
>
>> -- 
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index 75336848f7a7..81ced3096433 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -444,6 +444,10 @@ struct request *blk_mq_alloc_request_hctx(struct 
>> request_queue *q,
>>                 return ERR_PTR(-EXDEV);
>>         }
>>         cpu = cpumask_first_and(alloc_data.hctx->cpumask, 
>> cpu_online_mask);
>> +       if (cpu >= nr_cpu_ids) {
>> +               pr_warn("no online cpu for hctx %d\n", hctx_idx);
>> +               cpu = cpumask_first(alloc_data.hctx->cpumask);
>> +       }
>>         alloc_data.ctx = __blk_mq_get_ctx(q, cpu);
>>
>>         rq = blk_mq_get_request(q, NULL, op, &alloc_data);
>> -- 
>>
>>> 2) when nr_hw_queues is set as num_online_cpus(), there may be
>>> much less online CPUs, so the hw queue number can be initialized as
>>> much smaller, then performance is degraded much even if some CPUs
>>> become online later.
>>
>> That is correct, when the controller will be reset though, more queues
>> will be added to the system. I agree it would be good if we can change
>> stuff dynamically.
>>
>>> So the current policy is to map all possible CPUs for handing CPU
>>> hotplug, and if you want to get 1:1 mapping between hw queue and
>>> online CPU, the nr_hw_queues can be set as num_possible_cpus.
>>
>> Having nr_hw_queues == num_possible_cpus cannot work as it requires
>> establishing an RDMA queue-pair with a set of HW resources both on
>> the host side _and_ on the controller side, which are idle.
>>
>>> Please see commit 16ccfff28976130 (nvme: pci: pass max vectors as
>>> num_possible_cpus() to pci_alloc_irq_vectors).
>>
>> Yes, I am aware of this patch, however I not sure it'll be a good idea
>> for nvmf as it takes resources from both the host and the target for
>> for cpus that may never come online...
>>
>>> It will waste some memory resource just like percpu variable, but it
>>> simplifies the queue mapping logic a lot, and can support both hard
>>> and soft CPU online/offline without CPU hotplug handler, which may
>>> cause very complicated queue dependency issue.
>>
>> Yes, but these some memory resources are becoming an issue when it
>> takes HW (RDMA) resources on the local device and on the target device.
>>
>>>> I agree we don't want to connect hctx which doesn't have an online
>>>> cpu, that's redundant, but this is not the case here.
>>>
>>> OK, I will explain below, and it can be fixed by the following patch 
>>> too:
>>>
>>> https://marc.info/?l=linux-block&m=152318093725257&w=2
>>>
>>
>> I agree this patch is good!
>>
>>>>>>> Or I may understand you wrong, :-)
>>>>>>
>>>>>> In the report we connected 40 hctxs (which was exactly the number of
>>>>>> online cpus), after Yi removed 3 cpus, we tried to connect 37 hctxs.
>>>>>> I'm not sure why some hctxs are left without any online cpus.
>>>>>
>>>>> That is possible after the following two commits:
>>>>>
>>>>> 4b855ad37194 ("blk-mq: Create hctx for each present CPU)
>>>>> 20e4d8139319 (blk-mq: simplify queue mapping & schedule with each 
>>>>> possisble CPU)
>>>>>
>>>>> And this can be triggered even without putting down any CPUs.
>>>>>
>>>>> The blk-mq CPU hotplug handler is removed in 4b855ad37194, and we 
>>>>> can't
>>>>> remap queue any more when CPU topo is changed, so the static & 
>>>>> fixed mapping
>>>>> has to be setup from the beginning.
>>>>>
>>>>> Then if there are less enough online CPUs compared with number of 
>>>>> hw queues,
>>>>> some of hctxes can be mapped with all offline CPUs. For example, 
>>>>> if one device
>>>>> has 4 hw queues, but there are only 2 online CPUs and 6 offline 
>>>>> CPUs, at most
>>>>> 2 hw queues are assigned to online CPUs, and the other two are all 
>>>>> with offline
>>>>> CPUs.
>>>>
>>>> That is fine, but the problem that I gave in the example below 
>>>> which has
>>>> nr_hw_queues == num_online_cpus but because of the mapping, we still
>>>> have unmapped hctxs.
>>>
>>> For FC's case, there may be some hctxs not 'mapped', which is caused by
>>> blk_mq_map_queues(), but that should one bug.
>>>
>>> So the patch(blk-mq: don't keep offline CPUs mapped to hctx 0)[1] is
>>> fixing the issue:
>>>
>>> [1] https://marc.info/?l=linux-block&m=152318093725257&w=2
>>>
>>> Once this patch is in, any hctx should be mapped by at least one CPU.
>>
>> I think this will solve the problem Yi is stepping on.
>>
>>> Then later, the patch(blk-mq: reimplement blk_mq_hw_queue_mapped)[2]
>>> extends the mapping concept, maybe it should have been renamed as
>>> blk_mq_hw_queue_active(), will do it in V2.
>>>
>>> [2] https://marc.info/?l=linux-block&m=152318099625268&w=2
>>
>> This is also a good patch.
>>
>> ...
>>
>>>> But when we reset the controller, we call blk_mq_update_nr_hw_queues()
>>>> with the current number of nr_hw_queues which never exceeds
>>>> num_online_cpus. This in turn, remaps the mq_map which results
>>>> in unmapped queues because of the mapping function, not because we
>>>> have more hctx than online cpus...
>>>
>>> As I mentioned, num_online_cpus() isn't one stable variable, and it
>>> can change any time.
>>
>> Correct, but I'm afraid num_possible_cpus might not work either.
>>
>>> After patch(blk-mq: don't keep offline CPUs mapped to hctx 0) is in,
>>> there won't be unmapped queue any more.
>>
>> Yes.
>>
>>>> An easy fix, is to allocate num_present_cpus queues, and only connect
>>>> the oneline ones, but as you said, we have unused resources this way.
>>>
>>> Yeah, it should be num_possible_cpus queues because physical CPU 
>>> hotplug
>>> is needed to be supported for KVM or S390, or even some X86_64 system.
>>
>> num_present_cpus is a waste of resources (as I said, both on the host
>> and on the target), but num_possible_cpus is even worse as this is
>> all cpus that _can_ be populated.
>>
>>>> We also have an issue with blk_mq_rdma_map_queues with the only
>>>> device that supports it because it doesn't use managed affinity (code
>>>> was reverted) and can have irq affinity redirected in case of cpu
>>>> offlining...
>>>
>>> That can be one corner case, looks I have to re-consider the patch
>>> (blk-mq: remove code for dealing with remapping queue), which may cause
>>> regression for this RDMA case, but I guess CPU hotplug may break this
>>> case easily.
>>>
>>> [3] https://marc.info/?l=linux-block&m=152318100625284&w=2
>>>
>>> Also this case will make blk-mq's queue mapping much complicated,
>>> could you provide one link about the reason for reverting managed 
>>> affinity
>>> on this device?
>>
>> The problem was that users reported a regression because now
>> /proc/irp/$IRQ/smp_affinity is immutable. Looks like netdev users do
>> this on a regular basis (and also rely on irq_banacer at times) while
>> nvme users (and other HBAs) do not care about it.
>>
>> Thread starts here:
>> https://www.spinics.net/lists/netdev/msg464301.html
>>
>>> Recently we fix quite a few issues on managed affinity, maybe the
>>> original issue for RDMA affinity has been addressed already.
>>
>> That is not specific to RDMA affinity, its because RDMA devices are
>> also network devices and people want to apply their irq affinity
>> scripts on it like their used to with other devices.
>>
>>>> The goal here I think, should be to allocate just enough queues (not
>>>> more than the number online cpus) and spread it 1x1 with online cpus,
>>>> and also make sure to allocate completion vectors that align to online
>>>> cpus. I just need to figure out how to do that...
>>>
>>> I think we have to support CPU hotplug, so your goal may be hard to
>>> reach if you don't want to waste memory resource.
>>
>> Well, not so much if I make blk_mq_rdma_map_queues do the right thing?
>>
>> As I said, for the first go, I'd like to fix the mapping for the simple
>> case where we map the queues with some cpus offlined. Having queues
>> being added dynamically is a different story and I agree would require
>> more work.
>>
>> _______________________________________________
>> Linux-nvme mailing list
>> Linux-nvme@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-nvme
>
>
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme