linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Kenneth Lee <liguozhu@hisilicon.com>
To: Jerome Glisse <jglisse@redhat.com>
Cc: Kenneth Lee <nek.in.cn@gmail.com>,
	Jonathan Corbet <corbet@lwn.net>,
	Herbert Xu <herbert@gondor.apana.org.au>,
	"David S . Miller" <davem@davemloft.net>,
	Joerg Roedel <joro@8bytes.org>,
	Alex Williamson <alex.williamson@redhat.com>,
	Hao Fang <fanghao11@huawei.com>,
	Zhou Wang <wangzhou1@hisilicon.com>,
	Zaibo Xu <xuzaibo@huawei.com>,
	Philippe Ombredanne <pombredanne@nexb.com>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Thomas Gleixner <tglx@linutronix.de>, <linux-doc@vger.kernel.org>,
	<linux-kernel@vger.kernel.org>, <linux-crypto@vger.kernel.org>,
	<iommu@lists.linux-foundation.org>, <kvm@vger.kernel.org>,
	<linux-accelerators@lists.ozlabs.org>,
	Lu Baolu <baolu.lu@linux.intel.com>,
	Sanjay Kumar <sanjay.k.kumar@intel.com>, <linuxarm@huawei.com>
Subject: Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
Date: Fri, 21 Sep 2018 18:03:14 +0800	[thread overview]
Message-ID: <20180921100314.GH207969@Turing-Arch-b> (raw)
In-Reply-To: <20180917014244.GA27596@redhat.com>

On Sun, Sep 16, 2018 at 09:42:44PM -0400, Jerome Glisse wrote:
> Received: from POPSCN.huawei.com [10.3.17.45] by Turing-Arch-b with POP3
>  (fetchmail-6.3.26) for <kenny@localhost> (single-drop); Mon, 17 Sep 2018
>  09:45:02 +0800 (CST)
> Received: from DGGEMM406-HUB.china.huawei.com (10.3.20.214) by
>  dggeml421-hub.china.huawei.com (10.1.199.38) with Microsoft SMTP Server
>  (TLS) id 14.3.399.0; Mon, 17 Sep 2018 09:43:07 +0800
> Received: from dggwg01-in.huawei.com (172.30.65.32) by
>  DGGEMM406-HUB.china.huawei.com (10.3.20.214) with Microsoft SMTP Server id
>  14.3.399.0; Mon, 17 Sep 2018 09:43:00 +0800
> Received: from mx1.redhat.com (unknown [209.132.183.28])	by Forcepoint
>  Email with ESMTPS id A15E04AB7D1C3;	Mon, 17 Sep 2018 09:42:56 +0800 (CST)
> Received: from smtp.corp.redhat.com
>  (int-mx11.intmail.prod.int.phx2.redhat.com [10.5.11.26])	(using
>  TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))	(No client
>  certificate requested)	by mx1.redhat.com (Postfix) with ESMTPS id
>  EC621308212D;	Mon, 17 Sep 2018 01:42:52 +0000 (UTC)
> Received: from redhat.com (ovpn-121-3.rdu2.redhat.com [10.10.121.3])	by
>  smtp.corp.redhat.com (Postfix) with ESMTPS id 8874530912F4;	Mon, 17 Sep
>  2018 01:42:46 +0000 (UTC)
> Date: Sun, 16 Sep 2018 21:42:44 -0400
> From: Jerome Glisse <jglisse@redhat.com>
> To: Kenneth Lee <nek.in.cn@gmail.com>
> CC: Jonathan Corbet <corbet@lwn.net>, Herbert Xu
>  <herbert@gondor.apana.org.au>, "David S . Miller" <davem@davemloft.net>,
>  Joerg Roedel <joro@8bytes.org>, Alex Williamson
>  <alex.williamson@redhat.com>, Kenneth Lee <liguozhu@hisilicon.com>, Hao
>  Fang <fanghao11@huawei.com>, Zhou Wang <wangzhou1@hisilicon.com>, Zaibo Xu
>  <xuzaibo@huawei.com>, Philippe Ombredanne <pombredanne@nexb.com>, Greg
>  Kroah-Hartman <gregkh@linuxfoundation.org>, Thomas Gleixner
>  <tglx@linutronix.de>, linux-doc@vger.kernel.org,
>  linux-kernel@vger.kernel.org, linux-crypto@vger.kernel.org,
>  iommu@lists.linux-foundation.org, kvm@vger.kernel.org,
>  linux-accelerators@lists.ozlabs.org, Lu Baolu <baolu.lu@linux.intel.com>,
>  Sanjay Kumar <sanjay.k.kumar@intel.com>, linuxarm@huawei.com
> Subject: Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
> Message-ID: <20180917014244.GA27596@redhat.com>
> References: <20180903005204.26041-1-nek.in.cn@gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
> Content-Disposition: inline
> Content-Transfer-Encoding: 8bit
> In-Reply-To: <20180903005204.26041-1-nek.in.cn@gmail.com>
> User-Agent: Mutt/1.10.1 (2018-07-13)
> X-Scanned-By: MIMEDefang 2.84 on 10.5.11.26
> X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16
>  (mx1.redhat.com [10.5.110.42]); Mon, 17 Sep 2018 01:42:53 +0000 (UTC)
> Return-Path: jglisse@redhat.com
> X-MS-Exchange-Organization-AuthSource: DGGEMM406-HUB.china.huawei.com
> X-MS-Exchange-Organization-AuthAs: Anonymous
> MIME-Version: 1.0
> 
> So i want to summarize issues i have as this threads have dig deep into
> details. For this i would like to differentiate two cases first the easy
> one when relying on SVA/SVM. Then the second one when there is no SVA/SVM.
> In both cases your objectives as i understand them:
> 
> [R1]- expose a common user space API that make it easy to share boiler
>       plate code accross many devices (discovering devices, opening
>       device, creating context, creating command queue ...).
> [R2]- try to share the device as much as possible up to device limits
>       (number of independant queues the device has)
> [R3]- minimize syscall by allowing user space to directly schedule on the
>       device queue without a round trip to the kernel
> 
> I don't think i missed any.
> 
> 
> (1) Device with SVA/SVM
> 
> For that case it is easy, you do not need to be in VFIO or part of any
> thing specific in the kernel. There is no security risk (modulo bug in
> the SVA/SVM silicon). Fork/exec is properly handle and binding a process
> to a device is just couple dozen lines of code.
> 
> 
> (2) Device does not have SVA/SVM (or it is disabled)
> 
> You want to still allow device to be part of your framework. However
> here i see fundamentals securities issues and you move the burden of
> being careful to user space which i think is a bad idea. We should
> never trus the userspace from kernel space.
> 
> To keep the same API for the user space code you want a 1:1 mapping
> between device physical address and process virtual address (ie if
> device access device physical address A it is accessing the same
> memory as what is backing the virtual address A in the process.
> 
> Security issues are on two things:
> [I1]- fork/exec, a process who opened any such device and created an
>       active queue can transfer without its knowledge control of its
>       commands queue through COW. The parent map some anonymous region
>       to the device as a command queue buffer but because of COW the
>       parent can be the first to copy on write and thus the child can
>       inherit the original pages that are mapped to the hardware.
>       Here parent lose control and child gain it.
> 

Hi, Jerome, 

I reconsider your logic. I think the problem can be solved. Let us separate the
SVA/SVM feature into two: fault-from-device and device-va-awareness. A device
with iommu can support only device-va-awareness or both.

VFIO works on top of iommu, so it will support at least device-va-awareness. For
the COW problem, it can be taken as a mmu synchronization issue. If the mmu page
table is changed, it should be synchronize to iommu (via iommu_notifier). In the
case that the device support fault-from-device, it will work fine. In the case
that it supports only device-va-awareness, we can prefault (handle_mm_fault)
also via iommu_notifier and reset to iommu page table.

So this can be considered as a bug of VFIO, cannot it?

> [I2]- Because of [R3] you want to allow userspace to schedule commands
>       on the device without doing an ioctl and thus here user space
>       can schedule any commands to the device with any address. What
>       happens if that address have not been mapped by the user space
>       is undefined and in fact can not be defined as what each IOMMU
>       does on invalid address access is different from IOMMU to IOMMU.
> 
>       In case of a bad IOMMU, or simply an IOMMU improperly setup by
>       the kernel, this can potentialy allow user space to DMA anywhere.
> 
> [I3]- By relying on GUP in VFIO you are not abiding by the implicit
>       contract (at least i hope it is implicit) that you should not
>       try to map to the device any file backed vma (private or share).
> 
>       The VFIO code never check the vma controlling the addresses that
>       are provided to VFIO_IOMMU_MAP_DMA ioctl. Which means that the
>       user space can provide file backed range.
> 
>       I am guessing that the VFIO code never had any issues because its
>       number one user is QEMU and QEMU never does that (and that's good
>       as no one should ever do that).
> 
>       So if process does that you are opening your self to serious file
>       system corruption (depending on file system this can lead to total
>       data loss for the filesystem).
> 
>       Issue is that once you GUP you never abide to file system flushing
>       which write protect the page before writing to the disk. So
>       because the page is still map with write permission to the device
>       (assuming VFIO_IOMMU_MAP_DMA was a write map) then the device can
>       write to the page while it is in the middle of being written back
>       to disk. Consult your nearest file system specialist to ask him
>       how bad that can be.

In the case, we cannot do anything if the device do not support
fault-from-device. But we can reject write map with file-backed mapping.

It seems both issues can be solved under VFIO framework:) (But of cause, I don't
mean it has to)

> 
> [I4]- Design issue, mdev design As Far As I Understand It is about
>       sharing a single device to multiple clients (most obvious case
>       here is again QEMU guest). But you are going against that model,
>       in fact AFAIUI you are doing the exect opposite. When there is
>       no SVA/SVM you want only one mdev device that can not be share.
> 
>       So this is counter intuitive to the mdev existing design. It is
>       not about sharing device among multiple users but about giving
>       exclusive access to the device to one user.
> 
> 
> 
> All the reasons above is why i believe a different model would serve
> you and your user better. Below is a design that avoids all of the
> above issues and still delivers all of your objectives with the
> exceptions of the third one [R3] when there is no SVA/SVM.
> 
> 
> Create a subsystem (very much boiler plate code) which allow device to
> register themself against (very much like what you do in your current
> patchset but outside of VFIO).
> 
> That subsystem will create a device file for each registered system and
> expose a common API (ie set of ioctl) for each of those device files.
> 
> When user space create a queue (through an ioctl after opening the device
> file) the kernel can return -EBUSY if all the device queue are in use,
> or create a device queue and return a flag like SYNC_ONLY for device that
> do not have SVA/SVM.
> 
> For device with SVA/SVM at the time the process create a queue you bind
> the process PASID to the device queue. From there on the userspace can
> schedule commands and use the device without going to kernel space.
> 
> For device without SVA/SVM you create a fake queue that is just pure
> memory is not related to the device. From there on the userspace must
> call an ioctl every time it wants the device to consume its queue
> (hence why the SYNC_ONLY flag for synchronous operation only). The
> kernel portion read the fake queue expose to user space and copy
> commands into the real hardware queue but first it properly map any
> of the process memory needed for those commands to the device and
> adjust the device physical address with the one it gets from dma_map
> API.
> 
> With that model it is "easy" to listen to mmu_notifier and to abide by
> them to avoid issues [I1], [I3] and [I4]. You obviously avoid the [I2]
> issue by only mapping a fake device queue to userspace.
> 
> So yes with that models it means that every device that wish to support
> the non SVA/SVM case will have to do extra work (ie emulate its command
> queue in software in the kernel). But by doing so, you support an
> unlimited number of process on your device (ie all the process can share
> one single hardware command queues or multiple hardware queues).
> 
> The big advantages i see here is that the process do not have to worry
> about doing something wrong. You are protecting yourself and your user
> from stupid mistakes.
> 
> 
> I hope this is useful to you.
> 
> Cheers,
> Jérôme

Cheers
-- 
			-Kenneth(Hisilicon)


  parent reply	other threads:[~2018-09-21 10:05 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-09-03  0:51 [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive Kenneth Lee
2018-09-03  0:51 ` [PATCH 1/7] vfio/sdmdev: Add documents for WarpDrive framework Kenneth Lee
2018-09-06 18:36   ` Randy Dunlap
2018-09-07  2:21     ` Kenneth Lee
2018-09-03  0:51 ` [PATCH 2/7] iommu: Add share domain interface in iommu for sdmdev Kenneth Lee
2018-09-03  0:52 ` [PATCH 3/7] vfio: add sdmdev support Kenneth Lee
2018-09-03  2:11   ` Randy Dunlap
2018-09-06  8:08     ` Kenneth Lee
2018-09-03  2:55   ` Lu Baolu
2018-09-06  9:01     ` Kenneth Lee
2018-09-04 15:31   ` [RFC PATCH] vfio: vfio_sdmdev_groups[] can be static kbuild test robot
2018-09-04 15:32   ` [PATCH 3/7] vfio: add sdmdev support kbuild test robot
2018-09-04 15:32   ` kbuild test robot
2018-09-05  7:27   ` Dan Carpenter
2018-09-03  0:52 ` [PATCH 4/7] crypto: add hisilicon Queue Manager driver Kenneth Lee
2018-09-03  2:15   ` Randy Dunlap
2018-09-06  9:08     ` Kenneth Lee
2018-09-03  0:52 ` [PATCH 5/7] crypto: Add Hisilicon Zip driver Kenneth Lee
2018-09-03  0:52 ` [PATCH 6/7] crypto: add sdmdev support to Hisilicon QM Kenneth Lee
2018-09-03  2:19   ` Randy Dunlap
2018-09-06  9:09     ` Kenneth Lee
2018-09-03  0:52 ` [PATCH 7/7] vfio/sdmdev: add user sample Kenneth Lee
2018-09-03  2:25   ` Randy Dunlap
2018-09-06  9:10     ` Kenneth Lee
2018-09-03  2:32 ` [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive Lu Baolu
2018-09-06  9:11   ` Kenneth Lee
2018-09-04 15:00 ` Jerome Glisse
2018-09-04 16:15   ` Alex Williamson
2018-09-06  9:45     ` Kenneth Lee
2018-09-06 13:31       ` Jerome Glisse
2018-09-07  4:01         ` Kenneth Lee
2018-09-07 16:53           ` Jerome Glisse
2018-09-07 17:55             ` Jean-Philippe Brucker
2018-09-07 18:04               ` Jerome Glisse
2018-09-10  3:28             ` Kenneth Lee
2018-09-10 14:54               ` Jerome Glisse
2018-09-11  2:42                 ` Kenneth Lee
2018-09-11  3:33                   ` Jerome Glisse
2018-09-11  6:40                     ` Kenneth Lee
2018-09-11 13:40                       ` Jerome Glisse
2018-09-13  8:32                         ` Kenneth Lee
2018-09-13 14:51                           ` Jerome Glisse
2018-09-14  3:12                             ` Kenneth Lee
2018-09-14 14:05                               ` Jerome Glisse
2018-09-14  6:50                             ` Tian, Kevin
2018-09-14 13:05                               ` Kenneth Lee
2018-09-14 14:13                               ` Jerome Glisse
2018-09-17  1:42 ` Jerome Glisse
2018-09-17  8:39   ` Kenneth Lee
2018-09-17 12:37     ` Jerome Glisse
2018-09-18  6:00       ` Kenneth Lee
2018-09-18 13:03         ` Jerome Glisse
2018-09-20  5:55           ` Kenneth Lee
2018-09-20 14:23             ` Jerome Glisse
2018-09-21 10:05               ` Kenneth Lee
2018-09-21 10:03   ` Kenneth Lee [this message]
2018-09-21 14:52     ` Jerome Glisse
2018-09-25  5:55       ` Kenneth Lee

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180921100314.GH207969@Turing-Arch-b \
    --to=liguozhu@hisilicon.com \
    --cc=alex.williamson@redhat.com \
    --cc=baolu.lu@linux.intel.com \
    --cc=corbet@lwn.net \
    --cc=davem@davemloft.net \
    --cc=fanghao11@huawei.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=herbert@gondor.apana.org.au \
    --cc=iommu@lists.linux-foundation.org \
    --cc=jglisse@redhat.com \
    --cc=joro@8bytes.org \
    --cc=kvm@vger.kernel.org \
    --cc=linux-accelerators@lists.ozlabs.org \
    --cc=linux-crypto@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linuxarm@huawei.com \
    --cc=nek.in.cn@gmail.com \
    --cc=pombredanne@nexb.com \
    --cc=sanjay.k.kumar@intel.com \
    --cc=tglx@linutronix.de \
    --cc=wangzhou1@hisilicon.com \
    --cc=xuzaibo@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).