All of lore.kernel.org
 help / color / mirror / Atom feed
From: Pankaj Gupta <pagupta-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
To: David Hildenbrand <david-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: kwolf-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
	jack-AlSwsSmVLrQ@public.gmane.org,
	xiaoguangrong eric
	<xiaoguangrong.eric-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>,
	kvm-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	riel-ebMLmSuQjDVBDgjK7y7TUQ@public.gmane.org,
	qemu-devel-qX2TKyscuCcdnm+yROfE0A@public.gmane.org,
	linux-nvdimm-y27Ovi1pjclAfugRpC6u6w@public.gmane.org,
	mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
	ross zwisler
	<ross.zwisler-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Luiz Capitulino
	<lcapitulino-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>,
	hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org,
	pbonzini-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
	Stefan Hajnoczi
	<stefanha-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>,
	niteshnarayanlal-PkbjNfxxIARBDgjK7y7TUQ@public.gmane.org,
	imammedo-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org,
	nilal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org
Subject: Re: [Qemu-devel] [RFC v3] qemu: Add virtio pmem device
Date: Fri, 20 Jul 2018 09:02:46 -0400 (EDT)	[thread overview]
Message-ID: <1935251498.52851607.1532091766703.JavaMail.zimbra@redhat.com> (raw)
In-Reply-To: <b6ef19f3-7f16-5427-bfed-f352a76e48b7-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>


> >>>>  /*
> >>>>   * virtio-balloon-pci: This extends VirtioPCIProxy.
> >>>>   */
> >>>> diff --git a/hw/virtio/virtio-pmem.c b/hw/virtio/virtio-pmem.c
> >>>> new file mode 100644
> >>>> index 0000000000..08c96d7e80
> >>>> --- /dev/null
> >>>> +++ b/hw/virtio/virtio-pmem.c
> >>>> @@ -0,0 +1,241 @@
> >>>> +/*
> >>>> + * Virtio pmem device
> >>>> + *
> >>>> + * Copyright (C) 2018 Red Hat, Inc.
> >>>> + * Copyright (C) 2018 Pankaj Gupta <pagupta-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> >>>> + *
> >>>> + * This work is licensed under the terms of the GNU GPL, version 2.
> >>>> + * See the COPYING file in the top-level directory.
> >>>> + *
> >>>> + */
> >>>> +
> >>>> +#include "qemu/osdep.h"
> >>>> +#include "qapi/error.h"
> >>>> +#include "qemu-common.h"
> >>>> +#include "qemu/error-report.h"
> >>>> +#include "hw/virtio/virtio-access.h"
> >>>> +#include "hw/virtio/virtio-pmem.h"
> >>>> +#include "hw/mem/memory-device.h"
> >>>> +#include "block/aio.h"
> >>>> +#include "block/thread-pool.h"
> >>>> +
> >>>> +typedef struct VirtIOPMEMresp {
> >>>> +    int ret;
> >>>> +} VirtIOPMEMResp;
> >>>> +
> >>>> +typedef struct VirtIODeviceRequest {
> >>>> +    VirtQueueElement elem;
> >>>> +    int fd;
> >>>> +    VirtIOPMEM *pmem;
> >>>> +    VirtIOPMEMResp resp;
> >>>> +} VirtIODeviceRequest;
> >>>> +
> >>>> +static int worker_cb(void *opaque)
> >>>> +{
> >>>> +    VirtIODeviceRequest *req = opaque;
> >>>> +    int err = 0;
> >>>> +
> >>>> +    /* flush raw backing image */
> >>>> +    err = fsync(req->fd);
> >>>> +    if (err != 0) {
> >>>> +        err = errno;
> >>>> +    }
> >>>> +    req->resp.ret = err;
> >>>
> >>> Host question: are you returning the guest errno code to the host?
> >>
> >> No. I am returning error code from the host in-case of host fsync
> >> failure, otherwise returning zero.
> > 
> > I think that's what Luiz meant.  errno constants are not portable
> > between operating systems and architectures.  Therefore they cannot be
> > used in external interfaces in software that expects to communicate with
> > other systems.
> > 
> > It will be necessary to define specific constants for virtio-pmem
> > instead of passing errno from the host to guest.
> > 
> 
> In general, I wonder if we should report errors at all or rather *kill*
> the guest. That might sound harsh, but think about the following scenario:
> 
> fsync() fails due to some block that cannot e.g. be written (e.g.
> network connection failed). What happens if our guest tries to
> read/write that mmaped block? (e.g. network connection failed).
> 
> I assume we'll get a signal an get killed? So we are trying to optimize
> one special case (fsync()) although every read/write is prone to kill
> the guest. And as soon as the guest will try to access the block that
> made fsync fail, we will crash the guest either way.
> 
> I assume the main problem is that we are trying to take a file (with all
> the errors that can happen during read/write/fsync) and make it look
> like memory (dax). On ordinary block access, we can forward errors, but
> not if it's memory (maybe using MCE, but it's complicated and
> architecture specific).

There are two points which you highlighted:

1] Memory hardware errors:
These type of errors will be notified by MCA. If mce is non-recoverable, KVM gets 
SIG_BUS when hardware detects such error and injects mce in guest vCPU. If guest 
does not recoverable it can decide to kill the user-space process. 

Default option for mce is '1':
1: panic or SIGBUS on uncorrected errors, log corrected errors

2] read/write/fsync failure because of (network connection failure):
I assume you are talking about something like NFS mount where read/write/fsync
responsibility is taken care by NFS. This scenario can happen for any application
accessing a network filesystem and return appropriate error or wait. Until 'fsync' 
is not performed there is no guarantee ram data is backed. I think its
the responsibility of application to perform fsync after write operation or 
a transaction.
 
> 
> So I wonder if we should rather assume that our backend file is placed
> on some stable storage that cannot easily fail.
> 
> (we might have the same problem with NVDIMM right now, at least the
> memory reading/writing part)

NVDIMM NFIT handles this handler and checks if any SPA falls in the range
of mce:address. It creates a list of bad blocks(corresponding to nd_region) and handle 
in function 'pmem_do_bvec' used by 'pmem_mem_request' & 'pmem_read_write'.

void nfit_mce_register(void)
{
        mce_register_decode_chain(&nfit_mce_dec);
}

In 'fake DAX', we bypass NFIT ACPI and using virtio & nvdimm_bus way of registering
memory region. By default it should kill the userspace process or at worst cause guest reboot.
I am thinking how we can integrate the NFIT bad block handling with mce handler approach
for fake DAX. I think we can do this. But I want inputs from NVDIMM guys?

Thanks,
Pankaj

> 
> It's complicated and I am not a block level expert :)

WARNING: multiple messages have this Message-ID (diff)
From: Pankaj Gupta <pagupta@redhat.com>
To: David Hildenbrand <david@redhat.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>,
	Luiz Capitulino <lcapitulino@redhat.com>,
	kwolf@redhat.com, dan j williams <dan.j.williams@intel.com>,
	jack@suse.cz, xiaoguangrong eric <xiaoguangrong.eric@gmail.com>,
	kvm@vger.kernel.org, riel@surriel.com, linux-nvdimm@ml01.01.org,
	ross zwisler <ross.zwisler@intel.com>,
	linux-kernel@vger.kernel.org, qemu-devel@nongnu.org,
	hch@infradead.org, imammedo@redhat.com, mst@redhat.com,
	niteshnarayanlal@hotmail.com, pbonzini@redhat.com,
	nilal@redhat.com
Subject: Re: [Qemu-devel] [RFC v3] qemu: Add virtio pmem device
Date: Fri, 20 Jul 2018 09:02:46 -0400 (EDT)	[thread overview]
Message-ID: <1935251498.52851607.1532091766703.JavaMail.zimbra@redhat.com> (raw)
In-Reply-To: <b6ef19f3-7f16-5427-bfed-f352a76e48b7@redhat.com>


> >>>>  /*
> >>>>   * virtio-balloon-pci: This extends VirtioPCIProxy.
> >>>>   */
> >>>> diff --git a/hw/virtio/virtio-pmem.c b/hw/virtio/virtio-pmem.c
> >>>> new file mode 100644
> >>>> index 0000000000..08c96d7e80
> >>>> --- /dev/null
> >>>> +++ b/hw/virtio/virtio-pmem.c
> >>>> @@ -0,0 +1,241 @@
> >>>> +/*
> >>>> + * Virtio pmem device
> >>>> + *
> >>>> + * Copyright (C) 2018 Red Hat, Inc.
> >>>> + * Copyright (C) 2018 Pankaj Gupta <pagupta@redhat.com>
> >>>> + *
> >>>> + * This work is licensed under the terms of the GNU GPL, version 2.
> >>>> + * See the COPYING file in the top-level directory.
> >>>> + *
> >>>> + */
> >>>> +
> >>>> +#include "qemu/osdep.h"
> >>>> +#include "qapi/error.h"
> >>>> +#include "qemu-common.h"
> >>>> +#include "qemu/error-report.h"
> >>>> +#include "hw/virtio/virtio-access.h"
> >>>> +#include "hw/virtio/virtio-pmem.h"
> >>>> +#include "hw/mem/memory-device.h"
> >>>> +#include "block/aio.h"
> >>>> +#include "block/thread-pool.h"
> >>>> +
> >>>> +typedef struct VirtIOPMEMresp {
> >>>> +    int ret;
> >>>> +} VirtIOPMEMResp;
> >>>> +
> >>>> +typedef struct VirtIODeviceRequest {
> >>>> +    VirtQueueElement elem;
> >>>> +    int fd;
> >>>> +    VirtIOPMEM *pmem;
> >>>> +    VirtIOPMEMResp resp;
> >>>> +} VirtIODeviceRequest;
> >>>> +
> >>>> +static int worker_cb(void *opaque)
> >>>> +{
> >>>> +    VirtIODeviceRequest *req = opaque;
> >>>> +    int err = 0;
> >>>> +
> >>>> +    /* flush raw backing image */
> >>>> +    err = fsync(req->fd);
> >>>> +    if (err != 0) {
> >>>> +        err = errno;
> >>>> +    }
> >>>> +    req->resp.ret = err;
> >>>
> >>> Host question: are you returning the guest errno code to the host?
> >>
> >> No. I am returning error code from the host in-case of host fsync
> >> failure, otherwise returning zero.
> > 
> > I think that's what Luiz meant.  errno constants are not portable
> > between operating systems and architectures.  Therefore they cannot be
> > used in external interfaces in software that expects to communicate with
> > other systems.
> > 
> > It will be necessary to define specific constants for virtio-pmem
> > instead of passing errno from the host to guest.
> > 
> 
> In general, I wonder if we should report errors at all or rather *kill*
> the guest. That might sound harsh, but think about the following scenario:
> 
> fsync() fails due to some block that cannot e.g. be written (e.g.
> network connection failed). What happens if our guest tries to
> read/write that mmaped block? (e.g. network connection failed).
> 
> I assume we'll get a signal an get killed? So we are trying to optimize
> one special case (fsync()) although every read/write is prone to kill
> the guest. And as soon as the guest will try to access the block that
> made fsync fail, we will crash the guest either way.
> 
> I assume the main problem is that we are trying to take a file (with all
> the errors that can happen during read/write/fsync) and make it look
> like memory (dax). On ordinary block access, we can forward errors, but
> not if it's memory (maybe using MCE, but it's complicated and
> architecture specific).

There are two points which you highlighted:

1] Memory hardware errors:
These type of errors will be notified by MCA. If mce is non-recoverable, KVM gets 
SIG_BUS when hardware detects such error and injects mce in guest vCPU. If guest 
does not recoverable it can decide to kill the user-space process. 

Default option for mce is '1':
1: panic or SIGBUS on uncorrected errors, log corrected errors

2] read/write/fsync failure because of (network connection failure):
I assume you are talking about something like NFS mount where read/write/fsync
responsibility is taken care by NFS. This scenario can happen for any application
accessing a network filesystem and return appropriate error or wait. Until 'fsync' 
is not performed there is no guarantee ram data is backed. I think its
the responsibility of application to perform fsync after write operation or 
a transaction.
 
> 
> So I wonder if we should rather assume that our backend file is placed
> on some stable storage that cannot easily fail.
> 
> (we might have the same problem with NVDIMM right now, at least the
> memory reading/writing part)

NVDIMM NFIT handles this handler and checks if any SPA falls in the range
of mce:address. It creates a list of bad blocks(corresponding to nd_region) and handle 
in function 'pmem_do_bvec' used by 'pmem_mem_request' & 'pmem_read_write'.

void nfit_mce_register(void)
{
        mce_register_decode_chain(&nfit_mce_dec);
}

In 'fake DAX', we bypass NFIT ACPI and using virtio & nvdimm_bus way of registering
memory region. By default it should kill the userspace process or at worst cause guest reboot.
I am thinking how we can integrate the NFIT bad block handling with mce handler approach
for fake DAX. I think we can do this. But I want inputs from NVDIMM guys?

Thanks,
Pankaj

> 
> It's complicated and I am not a block level expert :)

  parent reply	other threads:[~2018-07-20 13:02 UTC|newest]

Thread overview: 60+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-07-13  7:52 [RFC v3 0/2] kvm "fake DAX" device flushing Pankaj Gupta
2018-07-13  7:52 ` [Qemu-devel] " Pankaj Gupta
2018-07-13  7:52 ` Pankaj Gupta
2018-07-13  7:52 ` [RFC v3 1/2] libnvdimm: Add flush callback for virtio pmem Pankaj Gupta
2018-07-13  7:52   ` [Qemu-devel] " Pankaj Gupta
2018-07-13 20:35   ` Luiz Capitulino
2018-07-13 20:35     ` [Qemu-devel] " Luiz Capitulino
2018-07-16  8:13     ` Pankaj Gupta
2018-07-16  8:13       ` [Qemu-devel] " Pankaj Gupta
     [not found] ` <20180713075232.9575-1-pagupta-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2018-07-13  7:52   ` [RFC v3 2/2] virtio-pmem: Add virtio pmem driver Pankaj Gupta
2018-07-13  7:52     ` [Qemu-devel] " Pankaj Gupta
2018-07-13  7:52     ` Pankaj Gupta
     [not found]     ` <20180713075232.9575-3-pagupta-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2018-07-13 20:38       ` Luiz Capitulino
2018-07-13 20:38         ` [Qemu-devel] " Luiz Capitulino
2018-07-13 20:38         ` Luiz Capitulino
2018-07-16 11:46         ` [Qemu-devel] " Pankaj Gupta
2018-07-16 11:46           ` Pankaj Gupta
     [not found]           ` <633297685.51039804.1531741590092.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2018-07-16 14:03             ` Luiz Capitulino
2018-07-16 14:03               ` Luiz Capitulino
2018-07-16 15:11               ` Pankaj Gupta
2018-07-16 15:11                 ` Pankaj Gupta
2018-07-17 13:11     ` Stefan Hajnoczi
2018-07-17 13:11       ` [Qemu-devel] " Stefan Hajnoczi
     [not found]       ` <20180717131156.GA13498-lxVrvc10SDRcolVlb+j0YCZi+YwRKgec@public.gmane.org>
2018-07-18  7:05         ` Pankaj Gupta
2018-07-18  7:05           ` [Qemu-devel] " Pankaj Gupta
2018-07-18  7:05           ` Pankaj Gupta
2018-08-28 12:13   ` [RFC v3 0/2] kvm "fake DAX" device flushing David Hildenbrand
2018-08-28 12:13     ` [Qemu-devel] " David Hildenbrand
2018-08-28 12:13     ` David Hildenbrand
     [not found]     ` <1328e543-0276-8f33-1744-8baa053023c4-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2018-08-28 12:39       ` [Qemu-devel] " Pankaj Gupta
2018-08-28 12:39         ` Pankaj Gupta
2018-07-13  7:52 ` [RFC v3] qemu: Add virtio pmem device Pankaj Gupta
2018-07-13  7:52   ` [Qemu-devel] " Pankaj Gupta
     [not found]   ` <20180713075232.9575-4-pagupta-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2018-07-18 12:55     ` Luiz Capitulino
2018-07-18 12:55       ` [Qemu-devel] " Luiz Capitulino
2018-07-18 12:55       ` Luiz Capitulino
2018-07-19  5:48       ` [Qemu-devel] " Pankaj Gupta
2018-07-19 12:16         ` Stefan Hajnoczi
     [not found]           ` <20180719121635.GA28107-lxVrvc10SDRcolVlb+j0YCZi+YwRKgec@public.gmane.org>
2018-07-19 12:48             ` Luiz Capitulino
2018-07-19 12:48               ` Luiz Capitulino
2018-07-19 12:57               ` Luiz Capitulino
2018-07-20 13:04               ` Pankaj Gupta
2018-07-20 13:04                 ` Pankaj Gupta
2018-07-19 13:58             ` David Hildenbrand
2018-07-19 13:58               ` David Hildenbrand
     [not found]               ` <b6ef19f3-7f16-5427-bfed-f352a76e48b7-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2018-07-19 15:48                 ` Luiz Capitulino
2018-07-19 15:48                   ` Luiz Capitulino
2018-07-20 13:02                 ` Pankaj Gupta [this message]
2018-07-20 13:02                   ` Pankaj Gupta
     [not found]         ` <367397176.52317488.1531979293251.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2018-07-19 12:39           ` Luiz Capitulino
2018-07-19 12:39             ` Luiz Capitulino
2018-07-24 16:13     ` Eric Blake
2018-07-24 16:13       ` [Qemu-devel] " Eric Blake
2018-07-24 16:13       ` Eric Blake
     [not found]       ` <783786ae-2e85-2376-448c-1e362c3d4d48-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2018-07-25  5:01         ` [Qemu-devel] " Pankaj Gupta
2018-07-25  5:01           ` Pankaj Gupta
     [not found]           ` <399916154.53931292.1532494899706.JavaMail.zimbra-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2018-07-25 12:19             ` Eric Blake
2018-07-25 12:19               ` Eric Blake
     [not found]               ` <d3fe397a-024d-faf7-8854-bb8e9ea17f53-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2018-07-25 12:47                 ` Pankaj Gupta
2018-07-25 12:47                   ` Pankaj Gupta

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1935251498.52851607.1532091766703.JavaMail.zimbra@redhat.com \
    --to=pagupta-h+wxahxf7alqt0dzr+alfa@public.gmane.org \
    --cc=david-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org \
    --cc=hch-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org \
    --cc=imammedo-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org \
    --cc=jack-AlSwsSmVLrQ@public.gmane.org \
    --cc=kvm-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=kwolf-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org \
    --cc=lcapitulino-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org \
    --cc=linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=linux-nvdimm-y27Ovi1pjclAfugRpC6u6w@public.gmane.org \
    --cc=mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org \
    --cc=nilal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org \
    --cc=niteshnarayanlal-PkbjNfxxIARBDgjK7y7TUQ@public.gmane.org \
    --cc=pbonzini-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org \
    --cc=qemu-devel-qX2TKyscuCcdnm+yROfE0A@public.gmane.org \
    --cc=riel-ebMLmSuQjDVBDgjK7y7TUQ@public.gmane.org \
    --cc=ross.zwisler-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org \
    --cc=stefanha-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org \
    --cc=xiaoguangrong.eric-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.