On Jul 26 12:09, Klaus Jensen wrote:
> On Jul 26 11:19, Klaus Jensen wrote:
> > On Jul 26 15:55, Jinhao Fan wrote:
> > > at 3:41 PM, Klaus Jensen <its@irrelevant.dk> wrote:
> > > 
> > > > On Jul 26 15:35, Jinhao Fan wrote:
> > > >> at 4:55 AM, Klaus Jensen <its@irrelevant.dk> wrote:
> > > >> 
> > > >>> We have a regression following this patch that we need to address.
> > > >>> 
> > > >>> With this patch, issuing a reset on the device (`nvme reset /dev/nvme0`
> > > >>> will do the trick) causes QEMU to hog my host cpu at 100%.
> > > >>> 
> > > >>> I'm still not sure what causes this. The trace output is a bit
> > > >>> inconclusive still.
> > > >>> 
> > > >>> I'll keep looking into it.
> > > >> 
> > > >> I cannot reproduce this bug. I just start the VM and used `nvme reset
> > > >> /dev/nvme0`. Did you do anything before the reset?
> > > > 
> > > > Interesting and thanks for checking! Looks like a kernel issue then!
> > > > 
> > > > I remember that I'm using a dev branch (nvme-v5.20) of the kernel and
> > > > reverting to a stock OS kernel did not produce the bug.
> > > 
> > > I’m using 5.19-rc4 which I pulled from linux-next on Jul 1. It works ok on
> > > my machine.
> > 
> > Interesting. I can reproduce on 5.19-rc4 from torvalds tree. Can you
> > drop your qemu command line here?
> > 
> > This is mine.
> > 
> > /home/kbj/work/src/qemu/build/x86_64-softmmu/qemu-system-x86_64 \
> >   -nodefaults \
> >   -display "none" \
> >   -machine "q35,accel=kvm,kernel-irqchip=split" \
> >   -cpu "host" \
> >   -smp "4" \
> >   -m "8G" \
> >   -device "intel-iommu" \
> >   -netdev "user,id=net0,hostfwd=tcp::2222-:22" \
> >   -device "virtio-net-pci,netdev=net0" \
> >   -device "virtio-rng-pci" \
> >   -drive "id=boot,file=/home/kbj/work/vol/machines/img/nvme.qcow2,format=qcow2,if=virtio,discard=unmap,media=disk,read-only=no" \
> >   -device "pcie-root-port,id=pcie_root_port1,chassis=1,slot=0" \
> >   -device "nvme,id=nvme0,serial=deadbeef,bus=pcie_root_port1,mdts=7" \
> >   -drive "id=null,if=none,file=null-co://,file.read-zeroes=on,format=raw" \
> >   -device "nvme-ns,id=nvm-1,drive=nvm-1,bus=nvme0,nsid=1,drive=null,logical_block_size=4096,physical_block_size=4096" \
> >   -pidfile "/home/kbj/work/vol/machines/run/null/pidfile" \
> >   -kernel "/home/kbj/work/src/kernel/linux/arch/x86_64/boot/bzImage" \
> >   -append "root=/dev/vda1 console=ttyS0,115200 audit=0 intel_iommu=on" \
> >   -virtfs "local,path=/home/kbj/work/src/kernel/linux,security_model=none,readonly=on,mount_tag=kernel_dir" \
> >   -serial "mon:stdio" \
> >   -d "guest_errors" \
> >   -D "/home/kbj/work/vol/machines/log/null/qemu.log" \
> >   -trace "pci_nvme*"
> 
> Alright. It was *some* config issue with my kernel. Reverted to a
> defconfig + requirements and the issue went away.
> 

And it went away because I didn't include iommu support in that kernel (and its
not enabled by default on the stock OS kernel).

> I'll try to track down what happended, but doesnt look like qemu is at
> fault here.

OK. So.

I can continue to reproduce this if the machine has a virtual intel iommu
enabled. And it only happens when this commit is applied.

I even backported this patch (and the shadow doorbell patch) to v7.0 and v6.2
(i.e. no SRIOV or CC logic changes that could be buggy) and it still exhibits
this behavior. Sometimes QEMU coredumps on poweroff and I managed to grab one:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  nvme_process_sq (opaque=0x556329708110) at ../hw/nvme/ctrl.c:5720
5720   NvmeCQueue *cq = n->cq[sq->cqid];
[Current thread is 1 (Thread 0x7f7363553cc0 (LWP 2554896))]
(gdb) bt
#0  nvme_process_sq (opaque=0x556329708110) at ../hw/nvme/ctrl.c:5720
#1  0x0000556326e82e28 in nvme_sq_notifier (e=0x556329708148) at ../hw/nvme/ctrl.c:3993
#2  0x000055632738396a in aio_dispatch_handler (ctx=0x5563291c3160, node=0x55632a228b60) at ../util/aio-posix.c:329
#3  0x0000556327383b22 in aio_dispatch_handlers (ctx=0x5563291c3160) at ../util/aio-posix.c:372
#4  0x0000556327383b78 in aio_dispatch (ctx=0x5563291c3160) at ../util/aio-posix.c:382
#5  0x000055632739d748 in aio_ctx_dispatch (source=0x5563291c3160, callback=0x0, user_data=0x0) at ../util/async.c:311
#6  0x00007f7369398163 in g_main_context_dispatch () at /usr/lib64/libglib-2.0.so.0
#7  0x00005563273af279 in glib_pollfds_poll () at ../util/main-loop.c:232
#8  0x00005563273af2f6 in os_host_main_loop_wait (timeout=0x1dbe22c0) at ../util/main-loop.c:255
#9  0x00005563273af404 in main_loop_wait (nonblocking=0x0) at ../util/main-loop.c:531
#10 0x00005563270714d9 in qemu_main_loop () at ../softmmu/runstate.c:726
#11 0x0000556326c7ea46 in main (argc=0x2e, argv=0x7ffc6977f198, envp=0x7ffc6977f310) at ../softmmu/main.c:50

At this point, there should not be any CQ/SQs (I detached the device from the
kernel driver which deletes all queues and bound it to vfio-pci instead), but
somehow a stale notifier is called on poweroff and the queue is bogus, causing
the segfault.

(gdb) p cq->cqid
$2 = 0x7880

My guess would be that we are not cleaning up the notifier properly. Currently
we do this

    if (cq->ioeventfd_enabled) {
        memory_region_del_eventfd(&n->iomem,
                                  0x1000 + offset, 4, false, 0, &cq->notifier);
        event_notifier_cleanup(&cq->notifier);
    }


Any ioeventfd experts that has some insights into what we are doing
wrong here? Something we need to flush? I tried with a test_and_clear on
the eventfd but that didnt do the trick.

I think we'd need to revert this until we can track down what is going wrong.