Re: [PATCH v4] hw/nvme: Use ioeventfd to handle doorbell updates

From: Jinhao Fan <fanjinhao21s@ict.ac.cn>
To: Klaus Jensen <its@irrelevant.dk>
Cc: Keith Busch <kbusch@kernel.org>, qemu-devel@nongnu.org
Subject: Re: [PATCH v4] hw/nvme: Use ioeventfd to handle doorbell updates
Date: Thu, 7 Jul 2022 16:50:39 +0800	[thread overview]
Message-ID: <F8469361-6748-4213-BE46-B68AC8078CC4@ict.ac.cn> (raw)
In-Reply-To: <YsZ0eZ7xJtbxgLS8@apples>

at 1:51 PM, Klaus Jensen <its@irrelevant.dk> wrote:

> On Jul  6 19:34, Jinhao Fan wrote:
>> at 2:43 AM, Keith Busch <kbusch@kernel.org> wrote:
>> 
>>> On Tue, Jul 05, 2022 at 07:11:36PM +0200, Klaus Jensen wrote:
>>>> On Jul  5 22:24, Jinhao Fan wrote:
>>>>> @@ -1374,7 +1374,14 @@ static void nvme_enqueue_req_completion(NvmeCQueue *cq, NvmeRequest *req)
>>>>> 
>>>>>    QTAILQ_REMOVE(&req->sq->out_req_list, req, entry);
>>>>>    QTAILQ_INSERT_TAIL(&cq->req_list, req, entry);
>>>>> -    timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
>>>>> +
>>>>> +    if (req->sq->ioeventfd_enabled) {
>>>>> +        /* Post CQE directly since we are in main loop thread */
>>>>> +        nvme_post_cqes(cq);
>>>>> +    } else {
>>>>> +        /* Schedule the timer to post CQE later since we are in vcpu thread */
>>>>> +        timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
>>>>> +    }
>>>> 
>>>> Actually, we are only in the vcpu thread if we come here from
>>>> nvme_process_db that in very rare circumstances may enqueue the
>>>> completion of an AER due to an invalid doorbell write.
>>>> 
>>>> In general, nvme_enqueue_req_completion is only ever called from the
>>>> main iothread. Which actually causes me to wonder why we defer this work
>>>> in the first place. It does have the benefit that we queue up several
>>>> completions before posting them in one go and raising the interrupt.
>>>> But I wonder if that could be handled better.
>>> 
>>> I think the timer is used because of the cq_full condition. We need to restart
>>> completions when it becomes not full, which requires a doorbell write. Having
>>> everyone from the main iothread use the same timer as the doorbell handler just
>>> ensures single threaded list access.
>> 
>> Could we let nvme_process_aers register another timer/BH to trigger
>> nvme_enqueue_req_completion in the iothread? In this way we won’t need the
>> timer_mod in nvme_enqueue_req_completion.
> 
> Yes, we could have process_aers in a timer. Which would probably be
> preferable in order to limit the amount of work the mmio handler is
> doing in that rare case. However, its such a rare case (only misbehaving
> drivers) that it's probably not worth optimizing for.

I think putting nvme_process_aers in a timer can help make sure
nvme_enqueue_req_completion is only called in an iothread context. Currently
it can be called either in iothread or vcpu thread. So
nvme_enqueue_req_completion has to infer its context, in my patch using the
cq->ioeventfd_enabled flag. This approach is probably error-prone. Honestly
I am not sure whether cq->ioeventfd_enabled is enough to guarantee we are in
iothread.

>> We can also avoid some potential currency problems because CQ is only
>> modified in the iothread.
> 
> There are currently no concurrency problems because of the Big QEMU
> Lock. When the mmio handler is running, the vcpu holds the BQL (and
> whenever the main iothread is running, it is holding the BQL).

Will this still hold when we move on to iothreads?

> 
>> BTW, are there any reason that we must use timers (not BH) here? Also why do
>> we choose to delay for 500ns?
> 
> No particular reason. do not see any reason why this could not be bottom
> halfs. This will likely change into bhs when we add iothread support
> anyway.

I will try BH when I start the iothread work.