All of lore.kernel.org
 help / color / mirror / Atom feed
From: Michael Roth <mdroth@linux.vnet.ibm.com>
To: Alexey Kardashevskiy <aik@ozlabs.ru>, Ram Pai <linuxram@us.ibm.com>
Cc: mpe@ellerman.id.au, linuxppc-dev@lists.ozlabs.org,
	benh@kernel.crashing.org, david@gibson.dropbear.id.au,
	paulus@ozlabs.org, hch@lst.de, andmike@us.ibm.com,
	sukadev@linux.vnet.ibm.com, mst@redhat.com, ram.n.pai@gmail.com,
	cai@lca.pw, tglx@linutronix.de, bauerman@linux.ibm.com,
	linux-kernel@vger.kernel.org, leonardo@linux.ibm.com
Subject: Re: [PATCH v5 1/2] powerpc/pseries/iommu: Share the per-cpu TCE page with the hypervisor.
Date: Thu, 12 Dec 2019 18:22:18 -0600	[thread overview]
Message-ID: <157619653837.3810.9657617422595030033@sif> (raw)
In-Reply-To: <ad63a352-bdec-08a8-2fd0-f64b9579da6c@ozlabs.ru>

Quoting Alexey Kardashevskiy (2019-12-11 16:47:30)
> 
> 
> On 12/12/2019 07:31, Michael Roth wrote:
> > Quoting Alexey Kardashevskiy (2019-12-11 02:15:44)
> >>
> >>
> >> On 11/12/2019 02:35, Ram Pai wrote:
> >>> On Tue, Dec 10, 2019 at 04:32:10PM +1100, Alexey Kardashevskiy wrote:
> >>>>
> >>>>
> >>>> On 10/12/2019 16:12, Ram Pai wrote:
> >>>>> On Tue, Dec 10, 2019 at 02:07:36PM +1100, Alexey Kardashevskiy wrote:
> >>>>>>
> >>>>>>
> >>>>>> On 07/12/2019 12:12, Ram Pai wrote:
> >>>>>>> H_PUT_TCE_INDIRECT hcall uses a page filled with TCE entries, as one of
> >>>>>>> its parameters.  On secure VMs, hypervisor cannot access the contents of
> >>>>>>> this page since it gets encrypted.  Hence share the page with the
> >>>>>>> hypervisor, and unshare when done.
> >>>>>>
> >>>>>>
> >>>>>> I thought the idea was to use H_PUT_TCE and avoid sharing any extra
> >>>>>> pages. There is small problem that when DDW is enabled,
> >>>>>> FW_FEATURE_MULTITCE is ignored (easy to fix); I also noticed complains
> >>>>>> about the performance on slack but this is caused by initial cleanup of
> >>>>>> the default TCE window (which we do not use anyway) and to battle this
> >>>>>> we can simply reduce its size by adding
> >>>>>
> >>>>> something that takes hardly any time with H_PUT_TCE_INDIRECT,  takes
> >>>>> 13secs per device for H_PUT_TCE approach, during boot. This is with a
> >>>>> 30GB guest. With larger guest, the time will further detoriate.
> >>>>
> >>>>
> >>>> No it will not, I checked. The time is the same for 2GB and 32GB guests-
> >>>> the delay is caused by clearing the small DMA window which is small by
> >>>> the space mapped (1GB) but quite huge in TCEs as it uses 4K pages; and
> >>>> for DDW window + emulated devices the IOMMU page size will be 2M/16M/1G
> >>>> (depends on the system) so the number of TCEs is much smaller.
> >>>
> >>> I cant get your results.  What changes did you make to get it?
> >>
> >>
> >> Get what? I passed "-m 2G" and "-m 32G", got the same time - 13s spent
> >> in clearing the default window and the huge window took a fraction of a
> >> second to create and map.
> > 
> > Is this if we disable FW_FEATURE_MULTITCE in the guest and force the use
> > of H_PUT_TCE everywhere?
> 
> 
> Yes. Well, for the DDW case FW_FEATURE_MULTITCE is ignored but even when
> fixed (I have it in my local branch), this does not make a difference.
> 
> 
> > 
> > In theory couldn't we leave FW_FEATURE_MULTITCE in place so that
> > iommu_table_clear() can still use H_STUFF_TCE (which I guess is basically
> > instant),
> 
> PAPR/LoPAPR "conveniently" do not describe what hcall-multi-tce does
> exactly. But I am pretty sure the idea is that either both H_STUFF_TCE
> and H_PUT_TCE_INDIRECT are present or neither.

That was my interpretation (or maybe I just went by what your
implementation did :), but just because they are available doesn't mean
the guest has to use them. I agree it's ugly to condition it on
is_secure_guest(), but to me that seems better than sharing memory
uncessarily, or potentially leaving stale mappings into default IOMMU.

Not sure if that are other alternatives though.

> 
> 
> > and then force H_PUT_TCE for new mappings via something like:
> > 
> > diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> > index 6ba081dd61c9..85d092baf17d 100644
> > --- a/arch/powerpc/platforms/pseries/iommu.c
> > +++ b/arch/powerpc/platforms/pseries/iommu.c
> > @@ -194,6 +194,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
> >         unsigned long flags;
> >  
> >         if ((npages == 1) || !firmware_has_feature(FW_FEATURE_MULTITCE)) {
> > +       if ((npages == 1) || !firmware_has_feature(FW_FEATURE_MULTITCE) || is_secure_guest()) {
> 
> 
> Nobody (including myself) seems to like the idea of having
> is_secure_guest() all over the place.
> 
> And with KVM acceleration enabled, it is pretty fast anyway. Just now we
> do not have H_PUT_TCE in KVM/UV for secure guests but we will have to
> fix this for secure PCI passhtrough anyway.
> 
> 
> >                 return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr,
> >                                            direction, attrs);
> >         }
> > 
> > That seems like it would avoid the extra 13s.
> 
> Or move around iommu_table_clear() which imho is just the right thing to do.
> 
> 
> > If we take the additional step of only mapping SWIOTLB range in
> > enable_ddw() for is_secure_guest() that might further improve things
> > (though the bigger motivation with that is the extra isolation it would
> > grant us for stuff behind the IOMMU, since it apparently doesn't affect
> > boot-time all that much)
> 
> 
> Sure, we just need to confirm how many of these swiotlb banks we are
> going to have (just one or many and at what location). Thanks,
> 
> 
> 
> > 
> >>
> >>
> >>>>>>
> >>>>>> -global
> >>>>>> spapr-pci-host-bridge.dma_win_size=0x4000000
> >>>>>
> >>>>> This option, speeds it up tremendously.  But than should this option be
> >>>>> enabled in qemu by default?  only for secure VMs? for both VMs?
> >>>>
> >>>>
> >>>> As discussed in slack, by default we do not need to clear the entire TCE
> >>>> table and we only have to map swiotlb buffer using the small window. It
> >>>> is a guest kernel change only. Thanks,
> >>>
> >>> Can you tell me what code you are talking about here.  Where is the TCE
> >>> table getting cleared? What code needs to be changed to not clear it?
> >>
> >>
> >> pci_dma_bus_setup_pSeriesLP()
> >>         iommu_init_table()
> >>                 iommu_table_clear()
> >>                         for () tbl->it_ops->get()
> >>
> >> We do not really need to clear it there, we only need it for VFIO with
> >> IOMMU SPAPR TCE v1 which reuses these tables but there are
> >> iommu_take_ownership/iommu_release_ownership to clear these tables. I'll
> >> send a patch for this.
> > 
> > 
> >>
> >>
> >>> Is the code in tce_buildmulti_pSeriesLP(), the one that does the clear
> >>> aswell?
> >>
> >>
> >> This one does not need to clear TCEs as this creates a window of known
> >> size and maps it all.
> >>
> >> Well, actually, it only maps actual guest RAM, if there are gaps in RAM,
> >> then TCEs for the gaps will have what hypervisor had there (which is
> >> zeroes, qemu/kvm clears it anyway).
> >>
> >>
> >>> But before I close, you have not told me clearly, what is the problem
> >>> with;  'share the page, make the H_PUT_INDIRECT_TCE hcall, unshare the page'.
> >>
> >> Between share and unshare you have a (tiny) window of opportunity to
> >> attack the guest. No, I do not know how exactly.
> >>
> >> For example, the hypervisor does a lot of PHB+PCI hotplug-unplug with
> >> 64bit devices - each time this will create a huge window which will
> >> share/unshare the same page.  No, I do not know how exactly how this can
> >> be exploited either, we cannot rely of what you or myself know today. My
> >> point is that we should not be sharing pages at all unless we really
> >> really have to, and this does not seem to be the case.
> >>
> >> But since this seems to an acceptable compromise anyway,
> >>
> >> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>
> >>
> >>
> >>
> >>
> >>> Remember this is the same page that is earmarked for doing
> >>> H_PUT_INDIRECT_TCE, not by my patch, but its already earmarked by the
> >>> existing code. So it not some random buffer that is picked. Second 
> >>> this page is temporarily shared and unshared, it does not stay shared
> >>> for life.  It does not slow the boot. it does not need any
> >>> special command line options on the qemu.
> >>>> Shared pages technology was put in place, exactly for the purpose of
> >>> sharing data with the hypervisor.  We are using this technology exactly
> >>> for that purpose.  And finally I agreed with your concern of having
> >>> shared pages staying around.  Hence i addressed that concern, by
> >>> unsharing the page.  At this point, I fail to understand your concern.
> >>
> >>
> >>
> >>
> >> -- 
> >> Alexey
> 
> -- 
> Alexey

WARNING: multiple messages have this Message-ID (diff)
From: Michael Roth <mdroth@linux.vnet.ibm.com>
To: Alexey Kardashevskiy <aik@ozlabs.ru>, Ram Pai <linuxram@us.ibm.com>
Cc: andmike@us.ibm.com, mst@redhat.com, linux-kernel@vger.kernel.org,
	leonardo@linux.ibm.com, ram.n.pai@gmail.com, cai@lca.pw,
	tglx@linutronix.de, sukadev@linux.vnet.ibm.com,
	linuxppc-dev@lists.ozlabs.org, hch@lst.de,
	bauerman@linux.ibm.com, david@gibson.dropbear.id.au
Subject: Re: [PATCH v5 1/2] powerpc/pseries/iommu: Share the per-cpu TCE page with the hypervisor.
Date: Thu, 12 Dec 2019 18:22:18 -0600	[thread overview]
Message-ID: <157619653837.3810.9657617422595030033@sif> (raw)
In-Reply-To: <ad63a352-bdec-08a8-2fd0-f64b9579da6c@ozlabs.ru>

Quoting Alexey Kardashevskiy (2019-12-11 16:47:30)
> 
> 
> On 12/12/2019 07:31, Michael Roth wrote:
> > Quoting Alexey Kardashevskiy (2019-12-11 02:15:44)
> >>
> >>
> >> On 11/12/2019 02:35, Ram Pai wrote:
> >>> On Tue, Dec 10, 2019 at 04:32:10PM +1100, Alexey Kardashevskiy wrote:
> >>>>
> >>>>
> >>>> On 10/12/2019 16:12, Ram Pai wrote:
> >>>>> On Tue, Dec 10, 2019 at 02:07:36PM +1100, Alexey Kardashevskiy wrote:
> >>>>>>
> >>>>>>
> >>>>>> On 07/12/2019 12:12, Ram Pai wrote:
> >>>>>>> H_PUT_TCE_INDIRECT hcall uses a page filled with TCE entries, as one of
> >>>>>>> its parameters.  On secure VMs, hypervisor cannot access the contents of
> >>>>>>> this page since it gets encrypted.  Hence share the page with the
> >>>>>>> hypervisor, and unshare when done.
> >>>>>>
> >>>>>>
> >>>>>> I thought the idea was to use H_PUT_TCE and avoid sharing any extra
> >>>>>> pages. There is small problem that when DDW is enabled,
> >>>>>> FW_FEATURE_MULTITCE is ignored (easy to fix); I also noticed complains
> >>>>>> about the performance on slack but this is caused by initial cleanup of
> >>>>>> the default TCE window (which we do not use anyway) and to battle this
> >>>>>> we can simply reduce its size by adding
> >>>>>
> >>>>> something that takes hardly any time with H_PUT_TCE_INDIRECT,  takes
> >>>>> 13secs per device for H_PUT_TCE approach, during boot. This is with a
> >>>>> 30GB guest. With larger guest, the time will further detoriate.
> >>>>
> >>>>
> >>>> No it will not, I checked. The time is the same for 2GB and 32GB guests-
> >>>> the delay is caused by clearing the small DMA window which is small by
> >>>> the space mapped (1GB) but quite huge in TCEs as it uses 4K pages; and
> >>>> for DDW window + emulated devices the IOMMU page size will be 2M/16M/1G
> >>>> (depends on the system) so the number of TCEs is much smaller.
> >>>
> >>> I cant get your results.  What changes did you make to get it?
> >>
> >>
> >> Get what? I passed "-m 2G" and "-m 32G", got the same time - 13s spent
> >> in clearing the default window and the huge window took a fraction of a
> >> second to create and map.
> > 
> > Is this if we disable FW_FEATURE_MULTITCE in the guest and force the use
> > of H_PUT_TCE everywhere?
> 
> 
> Yes. Well, for the DDW case FW_FEATURE_MULTITCE is ignored but even when
> fixed (I have it in my local branch), this does not make a difference.
> 
> 
> > 
> > In theory couldn't we leave FW_FEATURE_MULTITCE in place so that
> > iommu_table_clear() can still use H_STUFF_TCE (which I guess is basically
> > instant),
> 
> PAPR/LoPAPR "conveniently" do not describe what hcall-multi-tce does
> exactly. But I am pretty sure the idea is that either both H_STUFF_TCE
> and H_PUT_TCE_INDIRECT are present or neither.

That was my interpretation (or maybe I just went by what your
implementation did :), but just because they are available doesn't mean
the guest has to use them. I agree it's ugly to condition it on
is_secure_guest(), but to me that seems better than sharing memory
uncessarily, or potentially leaving stale mappings into default IOMMU.

Not sure if that are other alternatives though.

> 
> 
> > and then force H_PUT_TCE for new mappings via something like:
> > 
> > diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> > index 6ba081dd61c9..85d092baf17d 100644
> > --- a/arch/powerpc/platforms/pseries/iommu.c
> > +++ b/arch/powerpc/platforms/pseries/iommu.c
> > @@ -194,6 +194,7 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
> >         unsigned long flags;
> >  
> >         if ((npages == 1) || !firmware_has_feature(FW_FEATURE_MULTITCE)) {
> > +       if ((npages == 1) || !firmware_has_feature(FW_FEATURE_MULTITCE) || is_secure_guest()) {
> 
> 
> Nobody (including myself) seems to like the idea of having
> is_secure_guest() all over the place.
> 
> And with KVM acceleration enabled, it is pretty fast anyway. Just now we
> do not have H_PUT_TCE in KVM/UV for secure guests but we will have to
> fix this for secure PCI passhtrough anyway.
> 
> 
> >                 return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr,
> >                                            direction, attrs);
> >         }
> > 
> > That seems like it would avoid the extra 13s.
> 
> Or move around iommu_table_clear() which imho is just the right thing to do.
> 
> 
> > If we take the additional step of only mapping SWIOTLB range in
> > enable_ddw() for is_secure_guest() that might further improve things
> > (though the bigger motivation with that is the extra isolation it would
> > grant us for stuff behind the IOMMU, since it apparently doesn't affect
> > boot-time all that much)
> 
> 
> Sure, we just need to confirm how many of these swiotlb banks we are
> going to have (just one or many and at what location). Thanks,
> 
> 
> 
> > 
> >>
> >>
> >>>>>>
> >>>>>> -global
> >>>>>> spapr-pci-host-bridge.dma_win_size=0x4000000
> >>>>>
> >>>>> This option, speeds it up tremendously.  But than should this option be
> >>>>> enabled in qemu by default?  only for secure VMs? for both VMs?
> >>>>
> >>>>
> >>>> As discussed in slack, by default we do not need to clear the entire TCE
> >>>> table and we only have to map swiotlb buffer using the small window. It
> >>>> is a guest kernel change only. Thanks,
> >>>
> >>> Can you tell me what code you are talking about here.  Where is the TCE
> >>> table getting cleared? What code needs to be changed to not clear it?
> >>
> >>
> >> pci_dma_bus_setup_pSeriesLP()
> >>         iommu_init_table()
> >>                 iommu_table_clear()
> >>                         for () tbl->it_ops->get()
> >>
> >> We do not really need to clear it there, we only need it for VFIO with
> >> IOMMU SPAPR TCE v1 which reuses these tables but there are
> >> iommu_take_ownership/iommu_release_ownership to clear these tables. I'll
> >> send a patch for this.
> > 
> > 
> >>
> >>
> >>> Is the code in tce_buildmulti_pSeriesLP(), the one that does the clear
> >>> aswell?
> >>
> >>
> >> This one does not need to clear TCEs as this creates a window of known
> >> size and maps it all.
> >>
> >> Well, actually, it only maps actual guest RAM, if there are gaps in RAM,
> >> then TCEs for the gaps will have what hypervisor had there (which is
> >> zeroes, qemu/kvm clears it anyway).
> >>
> >>
> >>> But before I close, you have not told me clearly, what is the problem
> >>> with;  'share the page, make the H_PUT_INDIRECT_TCE hcall, unshare the page'.
> >>
> >> Between share and unshare you have a (tiny) window of opportunity to
> >> attack the guest. No, I do not know how exactly.
> >>
> >> For example, the hypervisor does a lot of PHB+PCI hotplug-unplug with
> >> 64bit devices - each time this will create a huge window which will
> >> share/unshare the same page.  No, I do not know how exactly how this can
> >> be exploited either, we cannot rely of what you or myself know today. My
> >> point is that we should not be sharing pages at all unless we really
> >> really have to, and this does not seem to be the case.
> >>
> >> But since this seems to an acceptable compromise anyway,
> >>
> >> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>
> >>
> >>
> >>
> >>
> >>> Remember this is the same page that is earmarked for doing
> >>> H_PUT_INDIRECT_TCE, not by my patch, but its already earmarked by the
> >>> existing code. So it not some random buffer that is picked. Second 
> >>> this page is temporarily shared and unshared, it does not stay shared
> >>> for life.  It does not slow the boot. it does not need any
> >>> special command line options on the qemu.
> >>>> Shared pages technology was put in place, exactly for the purpose of
> >>> sharing data with the hypervisor.  We are using this technology exactly
> >>> for that purpose.  And finally I agreed with your concern of having
> >>> shared pages staying around.  Hence i addressed that concern, by
> >>> unsharing the page.  At this point, I fail to understand your concern.
> >>
> >>
> >>
> >>
> >> -- 
> >> Alexey
> 
> -- 
> Alexey

  parent reply	other threads:[~2019-12-13  0:23 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-12-07  1:12 [PATCH v5 0/2] Enable IOMMU support for pseries Secure VMs Ram Pai
2019-12-07  1:12 ` Ram Pai
2019-12-07  1:12 ` [PATCH v5 1/2] powerpc/pseries/iommu: Share the per-cpu TCE page with the hypervisor Ram Pai
2019-12-07  1:12   ` Ram Pai
2019-12-07  1:12   ` [PATCH v5 2/2] powerpc/pseries/iommu: Use dma_iommu_ops for Secure VM Ram Pai
2019-12-07  1:12     ` Ram Pai
2019-12-10  3:08     ` Alexey Kardashevskiy
2019-12-10  3:08       ` Alexey Kardashevskiy
2019-12-10 22:09     ` Thiago Jung Bauermann
2019-12-10 22:09       ` Thiago Jung Bauermann
2019-12-11  1:43     ` Michael Roth
2019-12-11  1:43       ` Michael Roth
2019-12-11  8:36       ` Alexey Kardashevskiy
2019-12-11  8:36         ` Alexey Kardashevskiy
2019-12-11 18:07         ` Michael Roth
2019-12-11 18:07           ` Michael Roth
2019-12-11 18:20           ` Christoph Hellwig
2019-12-11 18:20             ` Christoph Hellwig
2019-12-12  6:45       ` Ram Pai
2019-12-12  6:45         ` Ram Pai
2019-12-13  0:19         ` Michael Roth
2019-12-13  0:19           ` Michael Roth
2019-12-10  3:07   ` [PATCH v5 1/2] powerpc/pseries/iommu: Share the per-cpu TCE page with the hypervisor Alexey Kardashevskiy
2019-12-10  3:07     ` Alexey Kardashevskiy
2019-12-10  5:12     ` Ram Pai
2019-12-10  5:12       ` Ram Pai
2019-12-10  5:32       ` Alexey Kardashevskiy
2019-12-10  5:32         ` Alexey Kardashevskiy
2019-12-10 15:35         ` Ram Pai
2019-12-10 15:35           ` Ram Pai
2019-12-11  8:15           ` Alexey Kardashevskiy
2019-12-11  8:15             ` Alexey Kardashevskiy
2019-12-11 20:31             ` Michael Roth
2019-12-11 20:31               ` Michael Roth
2019-12-11 22:47               ` Alexey Kardashevskiy
2019-12-11 22:47                 ` Alexey Kardashevskiy
2019-12-12  2:39                 ` Alexey Kardashevskiy
2019-12-12  2:39                   ` Alexey Kardashevskiy
2019-12-13  0:22                 ` Michael Roth [this message]
2019-12-13  0:22                   ` Michael Roth
2019-12-12  4:11             ` Ram Pai
2019-12-12  4:11               ` Ram Pai

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=157619653837.3810.9657617422595030033@sif \
    --to=mdroth@linux.vnet.ibm.com \
    --cc=aik@ozlabs.ru \
    --cc=andmike@us.ibm.com \
    --cc=bauerman@linux.ibm.com \
    --cc=benh@kernel.crashing.org \
    --cc=cai@lca.pw \
    --cc=david@gibson.dropbear.id.au \
    --cc=hch@lst.de \
    --cc=leonardo@linux.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=linuxram@us.ibm.com \
    --cc=mpe@ellerman.id.au \
    --cc=mst@redhat.com \
    --cc=paulus@ozlabs.org \
    --cc=ram.n.pai@gmail.com \
    --cc=sukadev@linux.vnet.ibm.com \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.