qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Shivaprasad G Bhat <sbhat@linux.ibm.com>
To: David Gibson <david@gibson.dropbear.id.au>
Cc: ehabkost@redhat.com, mst@redhat.com, aneesh.kumar@linux.ibm.com,
	bharata@linux.vnet.ibm.com, linux-nvdimm@lists.01.org,
	groug@kaod.org, kvm-ppc@vger.kernel.org, qemu-devel@nongnu.org,
	shivaprasadbhat@gmail.com, qemu-ppc@nongnu.org,
	imammedo@redhat.com, sbhat@linux.vnet.ibm.com,
	xiaoguangrong.eric@gmail.com
Subject: Re: [PATCH v3 2/3] spapr: nvdimm: Implement H_SCM_FLUSH hcall
Date: Mon, 29 Mar 2021 14:53:47 +0530	[thread overview]
Message-ID: <13744465-ca7a-0aaf-5abb-43a70a39c167@linux.ibm.com> (raw)
In-Reply-To: <YFqs8M1dHAFhdCL6@yekko.fritz.box>

[-- Attachment #1: Type: text/plain, Size: 6193 bytes --]


On 3/24/21 8:37 AM, David Gibson wrote:
> On Tue, Mar 23, 2021 at 09:47:38AM -0400, Shivaprasad G Bhat wrote:
>> machine vmstate.
>>
>> Signed-off-by: Shivaprasad G Bhat<sbhat@linux.ibm.com>
> An overal question: surely the same issue must arise on x86 with
> file-backed NVDIMMs.  How do they handle this case?

Discussed in other threads..

....

>>   };
>> @@ -2997,6 +3000,9 @@ static void spapr_machine_init(MachineState *machine)
>>       }
>>   
>>       qemu_cond_init(&spapr->fwnmi_machine_check_interlock_cond);
>> +    qemu_mutex_init(&spapr->spapr_nvdimm_flush_states_lock);
> Do you actually need an extra mutex, or can you rely on the BQL?

I verified BQL is held at all places where it matters in the context of 
this patch.

Safe to get rid of this extra mutex.

...

>
>> +{
>> +     SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
>> +
>> +     return (!QLIST_EMPTY(&spapr->pending_flush_states) ||
>> +             !QLIST_EMPTY(&spapr->completed_flush_states));
>> +}
>> +
>> +static int spapr_nvdimm_pre_save(void *opaque)
>> +{
>> +    SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
>> +
>> +    while (!QLIST_EMPTY(&spapr->pending_flush_states)) {
>> +        aio_poll(qemu_get_aio_context(), true);
> Hmm... how long could waiting for all the pending flushes to complete
> take?  This could add substanially to the guest's migration downtime,
> couldn't it?


The time taken depends on the number of dirtied pages and the disk io 
write speed. The number of dirty pages on host is configureable with 
tunables vm.dirty_background_ratio (10% default on Fedora 32, Ubuntu 
20.04), vm.dirty_ratio(20%) of host memory and|or 
vm.dirty_expire_centisecs(30 seconds). So, the host itself would be 
flushing the mmaped file on its own from time to time. For guests using 
the nvdimms with filesystem, the flushes would have come frequently and 
the number of dirty pages might be less. The pmem applications can use 
the nvdimms without a filesystem. And for such guests, the chances that 
a flush request can come from pmem applications at the time of migration 
is less or is random. But, the host would have flushed the pagecache on 
its own when vm.dirty_background_ratio is crossed or 
vm.dirty_expire_centisecs expired. So, the worst case would stands at 
disk io latency for writing the dirtied pages in the last 
vm.dirty_expire_centisecs on host OR latency for writing maximum 
vm.dirty_background_ratio(10%) of host RAM. If you want me to calibrate 
any particular size, scenario and get the numbers please let me know.

...
>> +
>> +/*
>> + * Acquire a unique token and reserve it for the new flush state.
>> + */
>> +static SpaprNVDIMMDeviceFlushState *spapr_nvdimm_init_new_flush_state(void)
>> +{
>> +    Error *err = NULL;
>> +    uint64_t token;
>> +    SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
>> +    SpaprNVDIMMDeviceFlushState *tmp, *next, *state;
>> +
>> +    state = g_malloc0(sizeof(*state));
>> +
>> +    qemu_mutex_lock(&spapr->spapr_nvdimm_flush_states_lock);
>> +retry:
>> +    if (qemu_guest_getrandom(&token, sizeof(token), &err) < 0) {
> Using getrandom seems like overkill, why not just use a counter?

I didnt want a spurious guest to abuse by consuming the return value 
providing

a valid "guess-able" counter and the real driver failing subsequently. 
Also, across

guest migrations carrying the global counter to destination is another 
thing to ponder.


Let me know if you want me to reconsider using counter.

...

>> mm_flush_states_lock);
>> +
>> +    return state;
>> +}
>> +
>> +/*
>> + * spapr_nvdimm_finish_flushes
>> + *      Waits for all pending flush requests to complete
>> + *      their execution and free the states
>> + */
>> +void spapr_nvdimm_finish_flushes(void)
>> +{
>> +    SpaprNVDIMMDeviceFlushState *state, *next;
>> +    SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
> The caller has natural access to the machine, so pass it in rather
> than using the global.

okay

...

>> +
>> +/*
>> + * spapr_nvdimm_get_hcall_status
>> + *      Fetches the status of the hcall worker and returns H_BUSY
>> + *      if the worker is still running.
>> + */
>> +static int spapr_nvdimm_get_flush_status(uint64_t token)
>> +{
>> +    int ret = H_LONG_BUSY_ORDER_10_MSEC;
>> +    SpaprMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
> The callers have natural access to spapr, so pass it in rather than
> using the global.

Okay

...

>> +
>> +/*
>> + * H_SCM_FLUSH
>> + * Input: drc_index, continue-token
>> + * Out: continue-token
>> + * Return Value: H_SUCCESS, H_Parameter, H_P2, H_BUSY
>> + *
>> + * Given a DRC Index Flush the data to backend NVDIMM device.
>> + * The hcall returns H_BUSY when the flush takes longer time and the hcall
> It returns one of the H_LONG_BUSY values, not actual H_BUSY, doesn't
> it?

Yes. I thought its okay to call it just H_BUSY in a generic way. Will 
fix it.


>> + * needs to be issued multiple times in order to be completely serviced.
>> +        }
>> +
>> +        return ret;
>> +    }
>> +
>> +    dimm = PC_DIMM(drc->dev);
>> +    backend = MEMORY_BACKEND(dimm->hostmem);
>> +
>> +    state = spapr_nvdimm_init_new_flush_state();
>> +    if (!state) {
>> +        return H_P2;
> AFAICT the only way init_new_flush_state() fails is a failure in the
> RNG, which definitely isn't a parameter problem.

Will change it to H_HARDWARE.


>> +    }
>> +
>> +    state->backend_fd = memory_region_get_fd(&backend->mr);
> Is this guaranteed to return a usable fd in all configurations?

Right, for memory-backend-ram this wont work. I think we should

not set the hcall-flush-required too for memory-backend-ram. Will fix this.

>> +    thread_pool_submit_aio(pool, flush_worker_cb, state,
>> +                           spapr_nvdimm_flush_completion_cb, state);
>> +
>> +    ret = spapr_nvdimm_get_flush_status(state->continue_token);
>> +    if (H_IS_LONG_BUSY(ret)) {
>> +        args[0] = state->continue_token;
>> +    }
>> +
>> +    return ret;
> I believe you can rearrange this so the get_flush_status / check /
> return is shared between the args[0] == 0 and args[0] == token paths.
okay.

Thanks,

Shiva


[-- Attachment #2: Type: text/html, Size: 14194 bytes --]

  parent reply	other threads:[~2021-03-29  9:26 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-23 13:47 [PATCH v3 0/3] spapr: nvdimm: Enable sync-dax property for nvdimm Shivaprasad G Bhat
2021-03-23 13:47 ` [PATCH v3 1/3] spapr: nvdimm: Forward declare and move the definitions Shivaprasad G Bhat
2021-03-24  2:30   ` David Gibson
2021-03-23 13:47 ` [PATCH v3 2/3] spapr: nvdimm: Implement H_SCM_FLUSH hcall Shivaprasad G Bhat
2021-03-24  3:07   ` David Gibson
2021-03-24  4:04     ` Aneesh Kumar K.V
2021-03-25  1:51       ` David Gibson
2021-03-26 13:45         ` Shivaprasad G Bhat
2021-03-29  9:23     ` Shivaprasad G Bhat [this message]
2021-03-30 23:57       ` David Gibson
2021-03-23 13:47 ` [PATCH v3 3/3] spapr: nvdimm: Enable sync-dax device property for nvdimm Shivaprasad G Bhat
2021-03-24  3:09   ` David Gibson
2021-03-24  4:09     ` Aneesh Kumar K.V

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=13744465-ca7a-0aaf-5abb-43a70a39c167@linux.ibm.com \
    --to=sbhat@linux.ibm.com \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=bharata@linux.vnet.ibm.com \
    --cc=david@gibson.dropbear.id.au \
    --cc=ehabkost@redhat.com \
    --cc=groug@kaod.org \
    --cc=imammedo@redhat.com \
    --cc=kvm-ppc@vger.kernel.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=mst@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=qemu-ppc@nongnu.org \
    --cc=sbhat@linux.vnet.ibm.com \
    --cc=shivaprasadbhat@gmail.com \
    --cc=xiaoguangrong.eric@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).