[RFC PATCH 0/2] Fix SEV user-space mapping of unencrypted coherent memory

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/2] Fix SEV user-space mapping of unencrypted coherent memory
@ 2019-09-05 10:35 Thomas Hellström (VMware)
  2019-09-05 10:35 ` [RFC PATCH 1/2] x86: Don't let pgprot_modify() change the page encryption bit Thomas Hellström (VMware)
                   ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Thomas Hellström (VMware) @ 2019-09-05 10:35 UTC (permalink / raw)
  To: linux-kernel, x86, pv-drivers
  Cc: Thomas Hellström, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Christoph Hellwig, Christian König,
	Marek Szyprowski, Tom Lendacky

With SEV and sometimes with SME encryption, The dma api coherent memory is
typically unencrypted, meaning the linear kernel map has the encryption
bit cleared. However, default page protection returned from vm_get_page_prot()
has the encryption bit set. So to compute the correct page protection we need
to clear the encryption bit.

Also, in order for the encryption bit setting to survive across do_mmap() and
mprotect_fixup(), We need to make pgprot_modify() aware of it and not touch it.

(Note that the encryption status is not logically encoded in the pfn but in
the page protection even if an address line in the physical address is used).

The patchset has seen some sanity testing by exporting dma_pgprot() and
using it in the vmwgfx mmap handler with SEV enabled.

Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Christian König <christian.koenig@amd.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC PATCH 1/2] x86: Don't let pgprot_modify() change the page encryption bit
  2019-09-05 10:35 [RFC PATCH 0/2] Fix SEV user-space mapping of unencrypted coherent memory Thomas Hellström (VMware)
@ 2019-09-05 10:35 ` Thomas Hellström (VMware)
  2019-09-05 14:15   ` Dave Hansen
  2019-09-05 10:35 ` [RFC PATCH 2/2] dma-mapping: Fix dma_pgprot() for unencrypted coherent pages Thomas Hellström (VMware)
  2019-09-05 11:23 ` [RFC PATCH 0/2] Fix SEV user-space mapping of unencrypted coherent memory Christoph Hellwig
  2 siblings, 1 reply; 28+ messages in thread
From: Thomas Hellström (VMware) @ 2019-09-05 10:35 UTC (permalink / raw)
  To: linux-kernel, x86, pv-drivers
  Cc: Thomas Hellstrom, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Christoph Hellwig, Christian König, Marek Szyprowski,
	Tom Lendacky

From: Thomas Hellstrom <thellstrom@vmware.com>

When SEV or SME is enabled and active, vm_get_page_prot() typically
returns with the encryption bit set. This means that users of
pgprot_modify(, vm_get_page_prot()) (mprotect_fixup, do_mmap) typically
unintentionally sets encrypted page protection even on mmap'd coherent
memory where the mmap callback has cleared the bit. Fix this by not
allowing pgprot_modify() to change the encryption bit, similar to
how it's done for PAT bits.

Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Christian König <christian.koenig@amd.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
---
 arch/x86/include/asm/pgtable.h | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 0bc530c4eb13..8e507169fd90 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -624,12 +624,16 @@ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
 	return __pmd(val);
 }
 
-/* mprotect needs to preserve PAT bits when updating vm_page_prot */
+/*
+ * mprotect needs to preserve PAT and encryption bits when updating
+ * vm_page_prot
+ */
 #define pgprot_modify pgprot_modify
 static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
 {
-	pgprotval_t preservebits = pgprot_val(oldprot) & _PAGE_CHG_MASK;
-	pgprotval_t addbits = pgprot_val(newprot);
+	pgprotval_t preservebits = pgprot_val(oldprot) &
+		(_PAGE_CHG_MASK | sme_me_mask);
+	pgprotval_t addbits = pgprot_val(newprot) & ~sme_me_mask;
 	return __pgprot(preservebits | addbits);
 }
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH 2/2] dma-mapping: Fix dma_pgprot() for unencrypted coherent pages
  2019-09-05 10:35 [RFC PATCH 0/2] Fix SEV user-space mapping of unencrypted coherent memory Thomas Hellström (VMware)
  2019-09-05 10:35 ` [RFC PATCH 1/2] x86: Don't let pgprot_modify() change the page encryption bit Thomas Hellström (VMware)
@ 2019-09-05 10:35 ` Thomas Hellström (VMware)
  2019-09-05 11:23 ` [RFC PATCH 0/2] Fix SEV user-space mapping of unencrypted coherent memory Christoph Hellwig
  2 siblings, 0 replies; 28+ messages in thread
From: Thomas Hellström (VMware) @ 2019-09-05 10:35 UTC (permalink / raw)
  To: linux-kernel, x86, pv-drivers
  Cc: Thomas Hellstrom, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Christoph Hellwig, Christian König, Marek Szyprowski,
	Tom Lendacky

From: Thomas Hellstrom <thellstrom@vmware.com>

When force_dma_unencrypted() returns true, the linear kernel map of the
coherent pages have had the encryption bit explicitly cleared and the
page content is unencrypted. Make sure that any additional PTEs we set
up to these pages also have the encryption bit cleared by having
dma_pgprot() return a protection with the encryption bit cleared in this
case.

Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Christian König <christian.koenig@amd.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
---
 kernel/dma/mapping.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index b0038ca3aa92..2b499dcae74f 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -157,6 +157,8 @@ EXPORT_SYMBOL(dma_get_sgtable_attrs);
  */
 pgprot_t dma_pgprot(struct device *dev, pgprot_t prot, unsigned long attrs)
 {
+	if (force_dma_unencrypted(dev))
+		prot = pgprot_decrypted(prot);
 	if (dev_is_dma_coherent(dev) ||
 	    (IS_ENABLED(CONFIG_DMA_NONCOHERENT_CACHE_SYNC) &&
              (attrs & DMA_ATTR_NON_CONSISTENT)))
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 0/2] Fix SEV user-space mapping of unencrypted coherent memory
  2019-09-05 10:35 [RFC PATCH 0/2] Fix SEV user-space mapping of unencrypted coherent memory Thomas Hellström (VMware)
  2019-09-05 10:35 ` [RFC PATCH 1/2] x86: Don't let pgprot_modify() change the page encryption bit Thomas Hellström (VMware)
  2019-09-05 10:35 ` [RFC PATCH 2/2] dma-mapping: Fix dma_pgprot() for unencrypted coherent pages Thomas Hellström (VMware)
@ 2019-09-05 11:23 ` Christoph Hellwig
  2019-09-10  6:11   ` Christoph Hellwig
  2 siblings, 1 reply; 28+ messages in thread
From: Christoph Hellwig @ 2019-09-05 11:23 UTC (permalink / raw)
  To: Thomas Hellström (VMware)
  Cc: linux-kernel, x86, pv-drivers, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Christoph Hellwig, Christian König,
	Marek Szyprowski, Tom Lendacky

This looks fine from the DMA POV.  I'll let the x86 guys comment on the
rest.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 1/2] x86: Don't let pgprot_modify() change the page encryption bit
  2019-09-05 10:35 ` [RFC PATCH 1/2] x86: Don't let pgprot_modify() change the page encryption bit Thomas Hellström (VMware)
@ 2019-09-05 14:15   ` Dave Hansen
  2019-09-05 15:21     ` Thomas Hellström (VMware)
  0 siblings, 1 reply; 28+ messages in thread
From: Dave Hansen @ 2019-09-05 14:15 UTC (permalink / raw)
  To: Thomas Hellström (VMware), linux-kernel, x86, pv-drivers
  Cc: Thomas Hellstrom, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Christoph Hellwig, Christian König, Marek Szyprowski,
	Tom Lendacky

Hi Thomas,

Thanks for the second batch of patches!  These look much improved on all
fronts.

On 9/5/19 3:35 AM, Thomas Hellström (VMware) wrote:
> -/* mprotect needs to preserve PAT bits when updating vm_page_prot */
> +/*
> + * mprotect needs to preserve PAT and encryption bits when updating
> + * vm_page_prot
> + */
>  #define pgprot_modify pgprot_modify
>  static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
>  {
> -	pgprotval_t preservebits = pgprot_val(oldprot) & _PAGE_CHG_MASK;
> -	pgprotval_t addbits = pgprot_val(newprot);
> +	pgprotval_t preservebits = pgprot_val(oldprot) &
> +		(_PAGE_CHG_MASK | sme_me_mask);
> +	pgprotval_t addbits = pgprot_val(newprot) & ~sme_me_mask;
>  	return __pgprot(preservebits | addbits);
>  }

_PAGE_CHG_MASK is claiming similar functionality about preserving bits
when changing PTEs:

> /*
>  * Set of bits not changed in pte_modify.  The pte's
>  * protection key is treated like _PAGE_RW, for
>  * instance, and is *not* included in this mask since
>  * pte_modify() does modify it.
>  */
> #define _PAGE_CHG_MASK  (PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT |         \
>                          _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY | \
>                          _PAGE_SOFT_DIRTY | _PAGE_DEVMAP)

This makes me wonder if we should be including sme_me_mask in
_PAGE_CHG_MASK (logically).

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 1/2] x86: Don't let pgprot_modify() change the page encryption bit
  2019-09-05 14:15   ` Dave Hansen
@ 2019-09-05 15:21     ` Thomas Hellström (VMware)
  2019-09-05 15:24       ` Christoph Hellwig
  2019-09-05 15:59       ` Dave Hansen
  0 siblings, 2 replies; 28+ messages in thread
From: Thomas Hellström (VMware) @ 2019-09-05 15:21 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, x86, pv-drivers
  Cc: Thomas Hellstrom, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Christoph Hellwig, Christian König, Marek Szyprowski,
	Tom Lendacky

On 9/5/19 4:15 PM, Dave Hansen wrote:
> Hi Thomas,
>
> Thanks for the second batch of patches!  These look much improved on all
> fronts.

Yes, although the TTM functionality isn't in yet. Hopefully we won't 
have to bother you with those though, since this assumes TTM will be 
using the dma API.

> On 9/5/19 3:35 AM, Thomas Hellström (VMware) wrote:
>> -/* mprotect needs to preserve PAT bits when updating vm_page_prot */
>> +/*
>> + * mprotect needs to preserve PAT and encryption bits when updating
>> + * vm_page_prot
>> + */
>>   #define pgprot_modify pgprot_modify
>>   static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
>>   {
>> -	pgprotval_t preservebits = pgprot_val(oldprot) & _PAGE_CHG_MASK;
>> -	pgprotval_t addbits = pgprot_val(newprot);
>> +	pgprotval_t preservebits = pgprot_val(oldprot) &
>> +		(_PAGE_CHG_MASK | sme_me_mask);
>> +	pgprotval_t addbits = pgprot_val(newprot) & ~sme_me_mask;
>>   	return __pgprot(preservebits | addbits);
>>   }
> _PAGE_CHG_MASK is claiming similar functionality about preserving bits
> when changing PTEs:
>
>> /*
>>   * Set of bits not changed in pte_modify.  The pte's
>>   * protection key is treated like _PAGE_RW, for
>>   * instance, and is *not* included in this mask since
>>   * pte_modify() does modify it.
>>   */
>> #define _PAGE_CHG_MASK  (PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT |         \
>>                           _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY | \
>>                           _PAGE_SOFT_DIRTY | _PAGE_DEVMAP)
> This makes me wonder if we should be including sme_me_mask in
> _PAGE_CHG_MASK (logically).

I was thinking the same. But what confuses me is that addbits isn't 
masked with ~_PAGE_CHG_MASK, which is needed for sme_me_mask, since the 
problem otherwise is typically that the encryption bit is incorrectly 
set in addbits. I wonder whether it's an optimization or intentional.

/Thomas




^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 1/2] x86: Don't let pgprot_modify() change the page encryption bit
  2019-09-05 15:21     ` Thomas Hellström (VMware)
@ 2019-09-05 15:24       ` Christoph Hellwig
  2019-09-05 16:40         ` Thomas Hellström (VMware)
                           ` (2 more replies)
  2019-09-05 15:59       ` Dave Hansen
  1 sibling, 3 replies; 28+ messages in thread
From: Christoph Hellwig @ 2019-09-05 15:24 UTC (permalink / raw)
  To: Thomas Hellström (VMware)
  Cc: Dave Hansen, linux-kernel, x86, pv-drivers, Thomas Hellstrom,
	Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Christoph Hellwig,
	Christian König, Marek Szyprowski, Tom Lendacky

On Thu, Sep 05, 2019 at 05:21:24PM +0200, Thomas Hellström (VMware) wrote:
> On 9/5/19 4:15 PM, Dave Hansen wrote:
> > Hi Thomas,
> > 
> > Thanks for the second batch of patches!  These look much improved on all
> > fronts.
> 
> Yes, although the TTM functionality isn't in yet. Hopefully we won't have to
> bother you with those though, since this assumes TTM will be using the dma
> API.

Please take a look at dma_mmap_prepare and dma_mmap_fault in this
branch:

	http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dma-mmap-improvements

they should allow to fault dma api pages in the page fault handler.  But
this is totally hot off the press and not actually tested for the last
few patches.  Note that I've also included your two patches from this
series to handle SEV.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 1/2] x86: Don't let pgprot_modify() change the page encryption bit
  2019-09-05 15:21     ` Thomas Hellström (VMware)
  2019-09-05 15:24       ` Christoph Hellwig
@ 2019-09-05 15:59       ` Dave Hansen
  2019-09-05 16:29         ` Thomas Hellström (VMware)
  1 sibling, 1 reply; 28+ messages in thread
From: Dave Hansen @ 2019-09-05 15:59 UTC (permalink / raw)
  To: Thomas Hellström (VMware), linux-kernel, x86, pv-drivers
  Cc: Thomas Hellstrom, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Christoph Hellwig, Christian König, Marek Szyprowski,
	Tom Lendacky, Shutemov, Kirill

On 9/5/19 8:21 AM, Thomas Hellström (VMware) wrote:
>>>   #define pgprot_modify pgprot_modify
>>>   static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t
>>> newprot)
>>>   {
>>> -    pgprotval_t preservebits = pgprot_val(oldprot) & _PAGE_CHG_MASK;
>>> -    pgprotval_t addbits = pgprot_val(newprot);
>>> +    pgprotval_t preservebits = pgprot_val(oldprot) &
>>> +        (_PAGE_CHG_MASK | sme_me_mask);
>>> +    pgprotval_t addbits = pgprot_val(newprot) & ~sme_me_mask;
>>>       return __pgprot(preservebits | addbits);
>>>   }
>> _PAGE_CHG_MASK is claiming similar functionality about preserving bits
>> when changing PTEs:
...
>>> #define _PAGE_CHG_MASK  (PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT
>>> |         \
>>>                           _PAGE_SPECIAL | _PAGE_ACCESSED |
>>> _PAGE_DIRTY | \
>>>                           _PAGE_SOFT_DIRTY | _PAGE_DEVMAP)
>> This makes me wonder if we should be including sme_me_mask in
>> _PAGE_CHG_MASK (logically).
> 
> I was thinking the same. But what confuses me is that addbits isn't
> masked with ~_PAGE_CHG_MASK, which is needed for sme_me_mask, since the
> problem otherwise is typically that the encryption bit is incorrectly
> set in addbits. I wonder whether it's an optimization or intentional.

I think there's a built-in assumption that 'newprot' won't have any of
the _PAGE_CHG_MASK bits set.  That makes sense because there are no
protection bits in the mask.  But, the code certainly doesn't enforce that.

Are you seeing 'sme_me_mask' bits set in 'newprot'?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 1/2] x86: Don't let pgprot_modify() change the page encryption bit
  2019-09-05 15:59       ` Dave Hansen
@ 2019-09-05 16:29         ` Thomas Hellström (VMware)
  0 siblings, 0 replies; 28+ messages in thread
From: Thomas Hellström (VMware) @ 2019-09-05 16:29 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, x86, pv-drivers
  Cc: Thomas Hellstrom, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Christoph Hellwig, Christian König, Marek Szyprowski,
	Tom Lendacky, Shutemov, Kirill

On 9/5/19 5:59 PM, Dave Hansen wrote:
> On 9/5/19 8:21 AM, Thomas Hellström (VMware) wrote:
>>>>    #define pgprot_modify pgprot_modify
>>>>    static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t
>>>> newprot)
>>>>    {
>>>> -    pgprotval_t preservebits = pgprot_val(oldprot) & _PAGE_CHG_MASK;
>>>> -    pgprotval_t addbits = pgprot_val(newprot);
>>>> +    pgprotval_t preservebits = pgprot_val(oldprot) &
>>>> +        (_PAGE_CHG_MASK | sme_me_mask);
>>>> +    pgprotval_t addbits = pgprot_val(newprot) & ~sme_me_mask;
>>>>        return __pgprot(preservebits | addbits);
>>>>    }
>>> _PAGE_CHG_MASK is claiming similar functionality about preserving bits
>>> when changing PTEs:
> ...
>>>> #define _PAGE_CHG_MASK  (PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT
>>>> |         \
>>>>                            _PAGE_SPECIAL | _PAGE_ACCESSED |
>>>> _PAGE_DIRTY | \
>>>>                            _PAGE_SOFT_DIRTY | _PAGE_DEVMAP)
>>> This makes me wonder if we should be including sme_me_mask in
>>> _PAGE_CHG_MASK (logically).
>> I was thinking the same. But what confuses me is that addbits isn't
>> masked with ~_PAGE_CHG_MASK, which is needed for sme_me_mask, since the
>> problem otherwise is typically that the encryption bit is incorrectly
>> set in addbits. I wonder whether it's an optimization or intentional.
> I think there's a built-in assumption that 'newprot' won't have any of
> the _PAGE_CHG_MASK bits set.  That makes sense because there are no
> protection bits in the mask.  But, the code certainly doesn't enforce that.
>
> Are you seeing 'sme_me_mask' bits set in 'newprot'?

Yes. AFAIK it's only one bit, and typically always set.

/Thomas



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 1/2] x86: Don't let pgprot_modify() change the page encryption bit
  2019-09-05 15:24       ` Christoph Hellwig
@ 2019-09-05 16:40         ` Thomas Hellström (VMware)
  2019-09-05 17:05         ` dma_mmap_fault discussion Thomas Hellström (VMware)
  2019-09-10 16:11         ` [RFC PATCH 1/2] x86: Don't let pgprot_modify() change the page encryption bit Andy Lutomirski
  2 siblings, 0 replies; 28+ messages in thread
From: Thomas Hellström (VMware) @ 2019-09-05 16:40 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dave Hansen, linux-kernel, x86, pv-drivers, Thomas Hellstrom,
	Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Christian König, Marek Szyprowski, Tom Lendacky

On 9/5/19 5:24 PM, Christoph Hellwig wrote:
> On Thu, Sep 05, 2019 at 05:21:24PM +0200, Thomas Hellström (VMware) wrote:
>> On 9/5/19 4:15 PM, Dave Hansen wrote:
>>> Hi Thomas,
>>>
>>> Thanks for the second batch of patches!  These look much improved on all
>>> fronts.
>> Yes, although the TTM functionality isn't in yet. Hopefully we won't have to
>> bother you with those though, since this assumes TTM will be using the dma
>> API.
> Please take a look at dma_mmap_prepare and dma_mmap_fault in this
> branch:
>
> 	http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dma-mmap-improvements
>
> they should allow to fault dma api pages in the page fault handler.  But
> this is totally hot off the press and not actually tested for the last
> few patches.  Note that I've also included your two patches from this
> series to handle SEV.

Thanks, Christoph.

I'll get to this tomorrow.

/Thomas



^ permalink raw reply	[flat|nested] 28+ messages in thread

* dma_mmap_fault discussion
  2019-09-05 15:24       ` Christoph Hellwig
  2019-09-05 16:40         ` Thomas Hellström (VMware)
@ 2019-09-05 17:05         ` Thomas Hellström (VMware)
  2019-09-06  6:32           ` Christoph Hellwig
  2019-09-10 16:11         ` [RFC PATCH 1/2] x86: Don't let pgprot_modify() change the page encryption bit Andy Lutomirski
  2 siblings, 1 reply; 28+ messages in thread
From: Thomas Hellström (VMware) @ 2019-09-05 17:05 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-kernel, pv-drivers, Thomas Hellstrom, Dave Hansen,
	Christian König

Hi, Christoph,

On 9/5/19 5:24 PM, Christoph Hellwig wrote:
> On Thu, Sep 05, 2019 at 05:21:24PM +0200, Thomas Hellström (VMware) wrote:
>> On 9/5/19 4:15 PM, Dave Hansen wrote:
>>> Hi Thomas,
>>>
>>> Thanks for the second batch of patches!  These look much improved on all
>>> fronts.
>> Yes, although the TTM functionality isn't in yet. Hopefully we won't have to
>> bother you with those though, since this assumes TTM will be using the dma
>> API.
> Please take a look at dma_mmap_prepare and dma_mmap_fault in this
> branch:
>
> 	http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dma-mmap-improvements
>
> they should allow to fault dma api pages in the page fault handler.  But
> this is totally hot off the press and not actually tested for the last
> few patches.  Note that I've also included your two patches from this
> series to handle SEV.

I took a quick look at the interfaces and have two questions:

1) dma_mmap_prepare(), would it be possible to drop the references to 
the actual coherent region? The thing is that TTM doesn't know at mmap 
time what memory will be backing the region. It can be VRAM, coherent 
memory or system memory. Also see below:

2) @cpu_addr and @dma_addr are the values pointing at the beginning of 
an allocated chunk, right? The reason I'm asking is that TTM's coherent 
memory pool is sub-allocating from larger chunks into smaller PAGE_SIZE 
chunks, which means that a TTM buffer object may be randomly split 
across larger chunks, which means we have to store these values for each 
PAGE_SiZE chunk.

Thanks,
Thomas

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: dma_mmap_fault discussion
  2019-09-05 17:05         ` dma_mmap_fault discussion Thomas Hellström (VMware)
@ 2019-09-06  6:32           ` Christoph Hellwig
  2019-09-06  7:10             ` Thomas Hellström (VMware)
  0 siblings, 1 reply; 28+ messages in thread
From: Christoph Hellwig @ 2019-09-06  6:32 UTC (permalink / raw)
  To: Thomas Hellström (VMware)
  Cc: Christoph Hellwig, linux-kernel, pv-drivers, Thomas Hellstrom,
	Dave Hansen, Christian König

On Thu, Sep 05, 2019 at 07:05:43PM +0200, Thomas Hellström (VMware) wrote:
> I took a quick look at the interfaces and have two questions:
> 
> 1) dma_mmap_prepare(), would it be possible to drop the references to the
> actual coherent region? The thing is that TTM doesn't know at mmap time what
> memory will be backing the region. It can be VRAM, coherent memory or system

I guess we can shift the argument checking into the fault handler.

> 2) @cpu_addr and @dma_addr are the values pointing at the beginning of an
> allocated chunk, right?

Yes.

> The reason I'm asking is that TTM's coherent memory
> pool is sub-allocating from larger chunks into smaller PAGE_SIZE chunks,
> which means that a TTM buffer object may be randomly split across larger
> chunks, which means we have to store these values for each PAGE_SiZE chunk.

For implementations that remap non-contigous regions using vmap we need the
start cpu address passed, as that is used to find the vm_struct structure
for it.  That being said I don't see a problem with you keeping track
of the original start and offset into in your suballocation helpers.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: dma_mmap_fault discussion
  2019-09-06  6:32           ` Christoph Hellwig
@ 2019-09-06  7:10             ` Thomas Hellström (VMware)
  2019-09-06  7:20               ` Christoph Hellwig
  0 siblings, 1 reply; 28+ messages in thread
From: Thomas Hellström (VMware) @ 2019-09-06  7:10 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-kernel, pv-drivers, Thomas Hellstrom, Dave Hansen,
	Christian König

On 9/6/19 8:32 AM, Christoph Hellwig wrote:
> On Thu, Sep 05, 2019 at 07:05:43PM +0200, Thomas Hellström (VMware) wrote:
>> I took a quick look at the interfaces and have two questions:
>>
>> 1) dma_mmap_prepare(), would it be possible to drop the references to the
>> actual coherent region? The thing is that TTM doesn't know at mmap time what
>> memory will be backing the region. It can be VRAM, coherent memory or system
> I guess we can shift the argument checking into the fault handler.
>
>> 2) @cpu_addr and @dma_addr are the values pointing at the beginning of an
>> allocated chunk, right?
> Yes.
>
>> The reason I'm asking is that TTM's coherent memory
>> pool is sub-allocating from larger chunks into smaller PAGE_SIZE chunks,
>> which means that a TTM buffer object may be randomly split across larger
>> chunks, which means we have to store these values for each PAGE_SiZE chunk.
> For implementations that remap non-contigous regions using vmap we need the
> start cpu address passed, as that is used to find the vm_struct structure
> for it.  That being said I don't see a problem with you keeping track
> of the original start and offset into in your suballocation helpers.

It's definitely possible. I was just wondering whether it was necessary, 
but it seems like it.

Thanks,

Thomas



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: dma_mmap_fault discussion
  2019-09-06  7:10             ` Thomas Hellström (VMware)
@ 2019-09-06  7:20               ` Christoph Hellwig
  2019-09-10  8:37                 ` Thomas Hellström (VMware)
  0 siblings, 1 reply; 28+ messages in thread
From: Christoph Hellwig @ 2019-09-06  7:20 UTC (permalink / raw)
  To: Thomas Hellström (VMware)
  Cc: Christoph Hellwig, linux-kernel, pv-drivers, Thomas Hellstrom,
	Dave Hansen, Christian König

On Fri, Sep 06, 2019 at 09:10:08AM +0200, Thomas Hellström (VMware) wrote:
> It's definitely possible. I was just wondering whether it was necessary, but
> it seems like it.

Yepp.

I've pushed a new version out (even hotter off the press) that doesn't
require the region for dma_mmap_prepare, and also uses PAGE_SIZE units
for the offset / length in dma_mmap_fault, which simplifies a few
things.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 0/2] Fix SEV user-space mapping of unencrypted coherent memory
  2019-09-05 11:23 ` [RFC PATCH 0/2] Fix SEV user-space mapping of unencrypted coherent memory Christoph Hellwig
@ 2019-09-10  6:11   ` Christoph Hellwig
  2019-09-10  6:25     ` Thomas Hellström (VMware)
  0 siblings, 1 reply; 28+ messages in thread
From: Christoph Hellwig @ 2019-09-10  6:11 UTC (permalink / raw)
  To: Thomas Hellström (VMware)
  Cc: linux-kernel, x86, pv-drivers, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Christoph Hellwig, Christian König,
	Marek Szyprowski, Tom Lendacky

On Thu, Sep 05, 2019 at 04:23:11AM -0700, Christoph Hellwig wrote:
> This looks fine from the DMA POV.  I'll let the x86 guys comment on the
> rest.

Do we want to pick this series up for 5.4?  Should I queue it up in
the dma-mapping tree?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 0/2] Fix SEV user-space mapping of unencrypted coherent memory
  2019-09-10  6:11   ` Christoph Hellwig
@ 2019-09-10  6:25     ` Thomas Hellström (VMware)
  0 siblings, 0 replies; 28+ messages in thread
From: Thomas Hellström (VMware) @ 2019-09-10  6:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-kernel, x86, pv-drivers, Dave Hansen, Andy Lutomirski,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Christian König, Marek Szyprowski,
	Tom Lendacky

On 9/10/19 8:11 AM, Christoph Hellwig wrote:
> On Thu, Sep 05, 2019 at 04:23:11AM -0700, Christoph Hellwig wrote:
>> This looks fine from the DMA POV.  I'll let the x86 guys comment on the
>> rest.
> Do we want to pick this series up for 5.4?  Should I queue it up in
> the dma-mapping tree?

Hi, Christoph

I think the DMA change is pretty uncontroversial.

There are still some questions about the x86 change: After digging a bit 
deeper into the mm code I think Dave is correct about that we should 
include the sme_me_mask in _PAGE_CHG_MASK.

I'll respin that patch and then I guess we need an ack from the x86 people.

Thanks,
Thomas



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: dma_mmap_fault discussion
  2019-09-06  7:20               ` Christoph Hellwig
@ 2019-09-10  8:37                 ` Thomas Hellström (VMware)
  0 siblings, 0 replies; 28+ messages in thread
From: Thomas Hellström (VMware) @ 2019-09-10  8:37 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-kernel, pv-drivers, Thomas Hellstrom, Dave Hansen,
	Christian König

On 9/6/19 9:20 AM, Christoph Hellwig wrote:
> On Fri, Sep 06, 2019 at 09:10:08AM +0200, Thomas Hellström (VMware) wrote:
>> It's definitely possible. I was just wondering whether it was necessary, but
>> it seems like it.
> Yepp.
>
> I've pushed a new version out (even hotter off the press) that doesn't
> require the region for dma_mmap_prepare, and also uses PAGE_SIZE units
> for the offset / length in dma_mmap_fault, which simplifies a few
> things.

Hi, Christoph,

I've been looking into this a bit, and while it's possible to make it 
work for fault, (and for single page kmaps of course, since we have the 
kernel virtual address already), I run into problems with vmaps, since 
we basically need to vmap a set of consecutive coherent allocations 
obtained from the coherent pool.

Also, at mmap time, we need in theory to call dma_mmap_prepare() and 
save the page_prot for all attrs we might be using at fault time...

I did some thinking into whether we could perhaps cover all or at least 
the vast majority of architectures and awkward devices more generally 
with a  page prot and a set of pfns. So I looked at how 
remap_pfn_range() and io_remap_pfn_range() was implemented across 
architectures, and it turns out that all archs implementing a special 
io_remap_pfn_range() (sparc and mips) end up calling remap_pfn_range(), 
and that should mean that any arch that's currently capable of doing 
fault() should in principle be capable of using vmf_insert_pfn_prot() 
with a suitable pfn that in most (if not all) cases should be obtainable 
from the kernel virtual address.

So do you think a way forward could perhaps be to have a 
dma_common_get_pfn_sgtable() and add a generic vmap_pfn_prot()?

Thanks,

Thomas

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 1/2] x86: Don't let pgprot_modify() change the page encryption bit
  2019-09-05 15:24       ` Christoph Hellwig
  2019-09-05 16:40         ` Thomas Hellström (VMware)
  2019-09-05 17:05         ` dma_mmap_fault discussion Thomas Hellström (VMware)
@ 2019-09-10 16:11         ` Andy Lutomirski
  2019-09-10 19:26           ` Thomas Hellström (VMware)
  2 siblings, 1 reply; 28+ messages in thread
From: Andy Lutomirski @ 2019-09-10 16:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc:  Thomas Hellström (VMware) ,
	Dave Hansen, linux-kernel, x86, pv-drivers, Thomas Hellstrom,
	Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Christian König, Marek Szyprowski, Tom Lendacky



> On Sep 5, 2019, at 8:24 AM, Christoph Hellwig <hch@infradead.org> wrote:
> 
>> On Thu, Sep 05, 2019 at 05:21:24PM +0200, Thomas Hellström (VMware) wrote:
>>> On 9/5/19 4:15 PM, Dave Hansen wrote:
>>> Hi Thomas,
>>> 
>>> Thanks for the second batch of patches!  These look much improved on all
>>> fronts.
>> 
>> Yes, although the TTM functionality isn't in yet. Hopefully we won't have to
>> bother you with those though, since this assumes TTM will be using the dma
>> API.
> 
> Please take a look at dma_mmap_prepare and dma_mmap_fault in this
> branch:
> 
>    http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dma-mmap-improvements
> 
> they should allow to fault dma api pages in the page fault handler.  But
> this is totally hot off the press and not actually tested for the last
> few patches.  Note that I've also included your two patches from this
> series to handle SEV.

I read that patch, and it seems like you’ve built in the assumption that all pages in the mapping use identical protection or, if not, that the same fake vma hack that TTM already has is used to fudge around it.  Could it be reworked slightly to avoid this?

I wonder if it’s a mistake to put the encryption bits in vm_page_prot at all.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 1/2] x86: Don't let pgprot_modify() change the page encryption bit
  2019-09-10 16:11         ` [RFC PATCH 1/2] x86: Don't let pgprot_modify() change the page encryption bit Andy Lutomirski
@ 2019-09-10 19:26           ` Thomas Hellström (VMware)
  2019-09-11  4:18             ` Andy Lutomirski
  2019-09-11  9:08             ` Koenig, Christian
  0 siblings, 2 replies; 28+ messages in thread
From: Thomas Hellström (VMware) @ 2019-09-10 19:26 UTC (permalink / raw)
  To: Andy Lutomirski, Christoph Hellwig
  Cc: Dave Hansen, linux-kernel, x86, pv-drivers, Thomas Hellstrom,
	Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Christian König, Marek Szyprowski, Tom Lendacky

On 9/10/19 6:11 PM, Andy Lutomirski wrote:
>
>> On Sep 5, 2019, at 8:24 AM, Christoph Hellwig <hch@infradead.org> wrote:
>>
>>> On Thu, Sep 05, 2019 at 05:21:24PM +0200, Thomas Hellström (VMware) wrote:
>>>> On 9/5/19 4:15 PM, Dave Hansen wrote:
>>>> Hi Thomas,
>>>>
>>>> Thanks for the second batch of patches!  These look much improved on all
>>>> fronts.
>>> Yes, although the TTM functionality isn't in yet. Hopefully we won't have to
>>> bother you with those though, since this assumes TTM will be using the dma
>>> API.
>> Please take a look at dma_mmap_prepare and dma_mmap_fault in this
>> branch:
>>
>>     http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dma-mmap-improvements
>>
>> they should allow to fault dma api pages in the page fault handler.  But
>> this is totally hot off the press and not actually tested for the last
>> few patches.  Note that I've also included your two patches from this
>> series to handle SEV.
> I read that patch, and it seems like you’ve built in the assumption that all pages in the mapping use identical protection or, if not, that the same fake vma hack that TTM already has is used to fudge around it.  Could it be reworked slightly to avoid this?
>
> I wonder if it’s a mistake to put the encryption bits in vm_page_prot at all.

 From my POW, the encryption bits behave quite similar in behaviour to 
the caching mode bits, and they're also in vm_page_prot. They're the 
reason TTM needs to modify the page protection in the fault handler in 
the first place.

The problem seen in TTM is that we want to be able to change the 
vm_page_prot from the fault handler, but it's problematic since we have 
the mmap_sem typically only in read mode. Hence the fake vma hack. From 
what I can tell it's reasonably well-behaved, since pte_modify() skips 
the bits TTM updates, so mprotect() and mremap() works OK. I think 
split_huge_pmd may run into trouble, but we don't support it (yet) with 
TTM.

We could probably get away with a WRITE_ONCE() update of the 
vm_page_prot before taking the page table lock since

a) We're locking out all other writers.
b) We can't race with another fault to the same vma since we hold an 
address space lock ("buffer object reservation")
c) When we need to update there are no valid page table entries in the 
vma, since it only happens directly after mmap(), or after an 
unmap_mapping_range() with the same address space lock. When another 
reader (for example split_huge_pmd()) sees a valid page table entry, it 
also sees the new page protection and things are fine.

But that would really be a special case. To solve this properly we'd 
probably need an additional lock to protect the vm_flags and 
vm_page_prot, taken after mmap_sem and i_mmap_lock.

/Thomas

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 1/2] x86: Don't let pgprot_modify() change the page encryption bit
  2019-09-10 19:26           ` Thomas Hellström (VMware)
@ 2019-09-11  4:18             ` Andy Lutomirski
  2019-09-11  7:49               ` Thomas Hellström (VMware)
  2019-09-11  9:08             ` Koenig, Christian
  1 sibling, 1 reply; 28+ messages in thread
From: Andy Lutomirski @ 2019-09-11  4:18 UTC (permalink / raw)
  To: Thomas Hellström (VMware)
  Cc: Christoph Hellwig, Dave Hansen, LKML, X86 ML, pv-drivers,
	Thomas Hellstrom, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Christian König, Marek Szyprowski, Tom Lendacky

On Tue, Sep 10, 2019 at 12:26 PM Thomas Hellström (VMware)
<thomas_os@shipmail.org> wrote:
>
> On 9/10/19 6:11 PM, Andy Lutomirski wrote:
> >
> >> On Sep 5, 2019, at 8:24 AM, Christoph Hellwig <hch@infradead.org> wrote:
> >>
> >>> On Thu, Sep 05, 2019 at 05:21:24PM +0200, Thomas Hellström (VMware) wrote:
> >>>> On 9/5/19 4:15 PM, Dave Hansen wrote:
> >>>> Hi Thomas,
> >>>>
> >>>> Thanks for the second batch of patches!  These look much improved on all
> >>>> fronts.
> >>> Yes, although the TTM functionality isn't in yet. Hopefully we won't have to
> >>> bother you with those though, since this assumes TTM will be using the dma
> >>> API.
> >> Please take a look at dma_mmap_prepare and dma_mmap_fault in this
> >> branch:
> >>
> >>     http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dma-mmap-improvements
> >>
> >> they should allow to fault dma api pages in the page fault handler.  But
> >> this is totally hot off the press and not actually tested for the last
> >> few patches.  Note that I've also included your two patches from this
> >> series to handle SEV.
> > I read that patch, and it seems like you’ve built in the assumption that all pages in the mapping use identical protection or, if not, that the same fake vma hack that TTM already has is used to fudge around it.  Could it be reworked slightly to avoid this?
> >
> > I wonder if it’s a mistake to put the encryption bits in vm_page_prot at all.
>
>  From my POW, the encryption bits behave quite similar in behaviour to
> the caching mode bits, and they're also in vm_page_prot. They're the
> reason TTM needs to modify the page protection in the fault handler in
> the first place.
>
> The problem seen in TTM is that we want to be able to change the
> vm_page_prot from the fault handler, but it's problematic since we have
> the mmap_sem typically only in read mode. Hence the fake vma hack. From
> what I can tell it's reasonably well-behaved, since pte_modify() skips
> the bits TTM updates, so mprotect() and mremap() works OK. I think
> split_huge_pmd may run into trouble, but we don't support it (yet) with
> TTM.

One thing I'm confused about: does TTM move individual pages between
main memory and device memory or does it move whole mappings?  If it
moves individual pages, then a single mapping could have PTEs from
dma_alloc_coherent() space and from PCI space.  How can this work with
vm_page_prot?  I guess you might get lucky and have both have the same
protection bits, but that seems like an unfortunate thing to rely on.

As a for-real example, take a look at arch/x86/entry/vdso/vma.c.  The
"vvar" VMA contains multiple pages that are backed by different types
of memory.  There's a page of ordinary kernel memory.  Historically
there was a page of genuine MMIO memory, but that's gone now.  If the
system is a SEV guest with pvclock enabled, then there's a page of
decrypted memory.   They all share a VMA, they're instantiated in
.fault, and there is no hackery involved.  Look at vvar_fault().

So, Christoph, can't you restructure your code a bit to compute the
caching and encryption state per page instead of once in
dma_mmap_prepare() and insert the pages accordingly?  You might need
to revert 6d958546ff611c9ae09b181e628c1c5d5da5ebda depending on
exactly how you structure it.

>
> We could probably get away with a WRITE_ONCE() update of the
> vm_page_prot before taking the page table lock since
>
> a) We're locking out all other writers.
> b) We can't race with another fault to the same vma since we hold an
> address space lock ("buffer object reservation")
> c) When we need to update there are no valid page table entries in the
> vma, since it only happens directly after mmap(), or after an
> unmap_mapping_range() with the same address space lock. When another
> reader (for example split_huge_pmd()) sees a valid page table entry, it
> also sees the new page protection and things are fine.
>
> But that would really be a special case. To solve this properly we'd
> probably need an additional lock to protect the vm_flags and
> vm_page_prot, taken after mmap_sem and i_mmap_lock.
>

This is all horrible IMO.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 1/2] x86: Don't let pgprot_modify() change the page encryption bit
  2019-09-11  4:18             ` Andy Lutomirski
@ 2019-09-11  7:49               ` Thomas Hellström (VMware)
  2019-09-11 18:03                 ` Andy Lutomirski
  0 siblings, 1 reply; 28+ messages in thread
From: Thomas Hellström (VMware) @ 2019-09-11  7:49 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Christoph Hellwig, Dave Hansen, LKML, X86 ML, pv-drivers,
	Thomas Hellstrom, Dave Hansen, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Christian König, Marek Szyprowski, Tom Lendacky

Hi, Andy.

On 9/11/19 6:18 AM, Andy Lutomirski wrote:
> On Tue, Sep 10, 2019 at 12:26 PM Thomas Hellström (VMware)
> <thomas_os@shipmail.org> wrote:
>> On 9/10/19 6:11 PM, Andy Lutomirski wrote:
>>>> On Sep 5, 2019, at 8:24 AM, Christoph Hellwig <hch@infradead.org> wrote:
>>>>
>>>>> On Thu, Sep 05, 2019 at 05:21:24PM +0200, Thomas Hellström (VMware) wrote:
>>>>>> On 9/5/19 4:15 PM, Dave Hansen wrote:
>>>>>> Hi Thomas,
>>>>>>
>>>>>> Thanks for the second batch of patches!  These look much improved on all
>>>>>> fronts.
>>>>> Yes, although the TTM functionality isn't in yet. Hopefully we won't have to
>>>>> bother you with those though, since this assumes TTM will be using the dma
>>>>> API.
>>>> Please take a look at dma_mmap_prepare and dma_mmap_fault in this
>>>> branch:
>>>>
>>>>      http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dma-mmap-improvements
>>>>
>>>> they should allow to fault dma api pages in the page fault handler.  But
>>>> this is totally hot off the press and not actually tested for the last
>>>> few patches.  Note that I've also included your two patches from this
>>>> series to handle SEV.
>>> I read that patch, and it seems like you’ve built in the assumption that all pages in the mapping use identical protection or, if not, that the same fake vma hack that TTM already has is used to fudge around it.  Could it be reworked slightly to avoid this?
>>>
>>> I wonder if it’s a mistake to put the encryption bits in vm_page_prot at all.
>>   From my POW, the encryption bits behave quite similar in behaviour to
>> the caching mode bits, and they're also in vm_page_prot. They're the
>> reason TTM needs to modify the page protection in the fault handler in
>> the first place.
>>
>> The problem seen in TTM is that we want to be able to change the
>> vm_page_prot from the fault handler, but it's problematic since we have
>> the mmap_sem typically only in read mode. Hence the fake vma hack. From
>> what I can tell it's reasonably well-behaved, since pte_modify() skips
>> the bits TTM updates, so mprotect() and mremap() works OK. I think
>> split_huge_pmd may run into trouble, but we don't support it (yet) with
>> TTM.
> One thing I'm confused about: does TTM move individual pages between
> main memory and device memory or does it move whole mappings?  If it
> moves individual pages, then a single mapping could have PTEs from
> dma_alloc_coherent() space and from PCI space.  How can this work with
> vm_page_prot?  I guess you might get lucky and have both have the same
> protection bits, but that seems like an unfortunate thing to rely on.

With TTM, a single vma is completely backed with memory of the same 
type, so all PTEs have the same protection, and mimics that of the 
linear kernel map (if applicable). But the protection is not 
determinable at mmap time, and we may switch protection and backing 
memory at certain points in time where all PTEs are first killed.

>
> As a for-real example, take a look at arch/x86/entry/vdso/vma.c.  The
> "vvar" VMA contains multiple pages that are backed by different types
> of memory.  There's a page of ordinary kernel memory.  Historically
> there was a page of genuine MMIO memory, but that's gone now.  If the
> system is a SEV guest with pvclock enabled, then there's a page of
> decrypted memory.   They all share a VMA, they're instantiated in
> .fault, and there is no hackery involved.  Look at vvar_fault().

So this is conceptually identical to TTM. The difference is that it uses 
vmf_insert_pfn_prot() instead of vmf_insert_mixed() with the vma hack. 
Had there been a vmf_insert_mixed_prot(), the hack in TTM wouldn't be 
needed. I guess providing a vmf_insert_mixed_prot() is a to-do for me to 
pick up.

Having said that, the code you point out is as fragile and suffers from 
the same shortcomings as TTM since
a) Running things like split_huge_pmd() that takes the vm_page_prot and 
applies it to new PTEs will make things break, (although probably never 
applicable in this case).
b) Running mprotect() on that VMA will only work if sme_me_mask is part 
of _PAGE_CHG_MASK (which is addressed in a reliable way in my recent 
patchset),  otherwise, the encryption setting will be overwritten.


>> We could probably get away with a WRITE_ONCE() update of the
>> vm_page_prot before taking the page table lock since
>>
>> a) We're locking out all other writers.
>> b) We can't race with another fault to the same vma since we hold an
>> address space lock ("buffer object reservation")
>> c) When we need to update there are no valid page table entries in the
>> vma, since it only happens directly after mmap(), or after an
>> unmap_mapping_range() with the same address space lock. When another
>> reader (for example split_huge_pmd()) sees a valid page table entry, it
>> also sees the new page protection and things are fine.
>>
>> But that would really be a special case. To solve this properly we'd
>> probably need an additional lock to protect the vm_flags and
>> vm_page_prot, taken after mmap_sem and i_mmap_lock.
>>
> This is all horrible IMO.

I'd prefer to call it fragile and potentially troublesome to maintain.

That distinction is important because if it ever comes to a choice 
between adding a new lock to protect vm_page_prot (and consequently slow 
down the whole vm system) and using the WRITE_ONCE solution in TTM, we 
should know what the implications are. As it turns out previous choices 
in this area actually seem to have opted for the lockless WRITE_ONCE / 
READ_ONCE / ptl solution. See __split_huge_pmd_locked() and 
vma_set_page_prot().

Thanks,
Thomas



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 1/2] x86: Don't let pgprot_modify() change the page encryption bit
  2019-09-10 19:26           ` Thomas Hellström (VMware)
  2019-09-11  4:18             ` Andy Lutomirski
@ 2019-09-11  9:08             ` Koenig, Christian
  2019-09-11 10:10               ` TTM huge page-faults WAS: " Thomas Hellström (VMware)
  1 sibling, 1 reply; 28+ messages in thread
From: Koenig, Christian @ 2019-09-11  9:08 UTC (permalink / raw)
  To: Thomas Hellström (VMware), Andy Lutomirski, Christoph Hellwig
  Cc: Dave Hansen, linux-kernel, x86, pv-drivers, Thomas Hellstrom,
	Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Marek Szyprowski,
	Lendacky, Thomas

Am 10.09.19 um 21:26 schrieb Thomas Hellström (VMware):
> On 9/10/19 6:11 PM, Andy Lutomirski wrote:
>>
>>> On Sep 5, 2019, at 8:24 AM, Christoph Hellwig <hch@infradead.org> 
>>> wrote:
>>>
>>>> On Thu, Sep 05, 2019 at 05:21:24PM +0200, Thomas Hellström (VMware) 
>>>> wrote:
>>>>> On 9/5/19 4:15 PM, Dave Hansen wrote:
>>>>> Hi Thomas,
>>>>>
>>>>> Thanks for the second batch of patches!  These look much improved 
>>>>> on all
>>>>> fronts.
>>>> Yes, although the TTM functionality isn't in yet. Hopefully we 
>>>> won't have to
>>>> bother you with those though, since this assumes TTM will be using 
>>>> the dma
>>>> API.
>>> Please take a look at dma_mmap_prepare and dma_mmap_fault in this
>>> branch:
>>>
>>> http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dma-mmap-improvements
>>>
>>> they should allow to fault dma api pages in the page fault handler.  
>>> But
>>> this is totally hot off the press and not actually tested for the last
>>> few patches.  Note that I've also included your two patches from this
>>> series to handle SEV.
>> I read that patch, and it seems like you’ve built in the assumption 
>> that all pages in the mapping use identical protection or, if not, 
>> that the same fake vma hack that TTM already has is used to fudge 
>> around it.  Could it be reworked slightly to avoid this?
>>
>> I wonder if it’s a mistake to put the encryption bits in vm_page_prot 
>> at all.
>
> From my POW, the encryption bits behave quite similar in behaviour to 
> the caching mode bits, and they're also in vm_page_prot. They're the 
> reason TTM needs to modify the page protection in the fault handler in 
> the first place.
>
> The problem seen in TTM is that we want to be able to change the 
> vm_page_prot from the fault handler, but it's problematic since we 
> have the mmap_sem typically only in read mode. Hence the fake vma 
> hack. From what I can tell it's reasonably well-behaved, since 
> pte_modify() skips the bits TTM updates, so mprotect() and mremap() 
> works OK. I think split_huge_pmd may run into trouble, but we don't 
> support it (yet) with TTM.

Ah! I actually ran into this while implementing huge page support for 
TTM and never figured out why that doesn't work. Dropped CPU huge page 
support because of this.

>
> We could probably get away with a WRITE_ONCE() update of the 
> vm_page_prot before taking the page table lock since
>
> a) We're locking out all other writers.
> b) We can't race with another fault to the same vma since we hold an 
> address space lock ("buffer object reservation")
> c) When we need to update there are no valid page table entries in the 
> vma, since it only happens directly after mmap(), or after an 
> unmap_mapping_range() with the same address space lock. When another 
> reader (for example split_huge_pmd()) sees a valid page table entry, 
> it also sees the new page protection and things are fine.

Yeah, that's exactly why I always wondered why we need this hack with 
the vma copy on the stack.

>
> But that would really be a special case. To solve this properly we'd 
> probably need an additional lock to protect the vm_flags and 
> vm_page_prot, taken after mmap_sem and i_mmap_lock.

Well we already have a special lock for this: The reservation object. So 
memory barriers etc should be in place and I also think we can just 
update the vm_page_prot on the fly.

Christian.

>
> /Thomas
>
>
>
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* TTM huge page-faults WAS: Re: [RFC PATCH 1/2] x86: Don't let pgprot_modify() change the page encryption bit
  2019-09-11  9:08             ` Koenig, Christian
@ 2019-09-11 10:10               ` Thomas Hellström (VMware)
  2019-09-11 14:06                 ` Koenig, Christian
  0 siblings, 1 reply; 28+ messages in thread
From: Thomas Hellström (VMware) @ 2019-09-11 10:10 UTC (permalink / raw)
  To: Koenig, Christian; +Cc: linux-kernel, pv-drivers, Thomas Hellstrom, dri-devel

removing people that are probably not interested from CC
adding dri-devel

On 9/11/19 11:08 AM, Koenig, Christian wrote:
> Am 10.09.19 um 21:26 schrieb Thomas Hellström (VMware):
>> On 9/10/19 6:11 PM, Andy Lutomirski wrote:
>>>> On Sep 5, 2019, at 8:24 AM, Christoph Hellwig <hch@infradead.org>
>>>> wrote:
>>>>
>>>>> On Thu, Sep 05, 2019 at 05:21:24PM +0200, Thomas Hellström (VMware)
>>>>> wrote:
>>>>>> On 9/5/19 4:15 PM, Dave Hansen wrote:
>>>>>> Hi Thomas,
>>>>>>
>>>>>> Thanks for the second batch of patches!  These look much improved
>>>>>> on all
>>>>>> fronts.
>>>>> Yes, although the TTM functionality isn't in yet. Hopefully we
>>>>> won't have to
>>>>> bother you with those though, since this assumes TTM will be using
>>>>> the dma
>>>>> API.
>>>> Please take a look at dma_mmap_prepare and dma_mmap_fault in this
>>>> branch:
>>>>
>>>> http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/dma-mmap-improvements
>>>>
>>>> they should allow to fault dma api pages in the page fault handler.
>>>> But
>>>> this is totally hot off the press and not actually tested for the last
>>>> few patches.  Note that I've also included your two patches from this
>>>> series to handle SEV.
>>> I read that patch, and it seems like you’ve built in the assumption
>>> that all pages in the mapping use identical protection or, if not,
>>> that the same fake vma hack that TTM already has is used to fudge
>>> around it.  Could it be reworked slightly to avoid this?
>>>
>>> I wonder if it’s a mistake to put the encryption bits in vm_page_prot
>>> at all.
>>  From my POW, the encryption bits behave quite similar in behaviour to
>> the caching mode bits, and they're also in vm_page_prot. They're the
>> reason TTM needs to modify the page protection in the fault handler in
>> the first place.
>>
>> The problem seen in TTM is that we want to be able to change the
>> vm_page_prot from the fault handler, but it's problematic since we
>> have the mmap_sem typically only in read mode. Hence the fake vma
>> hack. From what I can tell it's reasonably well-behaved, since
>> pte_modify() skips the bits TTM updates, so mprotect() and mremap()
>> works OK. I think split_huge_pmd may run into trouble, but we don't
>> support it (yet) with TTM.
> Ah! I actually ran into this while implementing huge page support for
> TTM and never figured out why that doesn't work. Dropped CPU huge page
> support because of this.

By incident, I got slightly sidetracked the other day and started 
looking at this as well. Got to the point where I figured out all the 
hairy alignment issues and actually got huge_fault() calls, but never 
implemented the handler. I think that's definitely something worth 
having. Not sure it will work for IO memory, though, (split_huge_pmd 
will just skip non-page-backed memory) but if we only support VM_SHARED 
(non COW) vmas there's no reason to split the huge pmds anyway. 
Definitely something we should have IMO.

>> We could probably get away with a WRITE_ONCE() update of the
>> vm_page_prot before taking the page table lock since
>>
>> a) We're locking out all other writers.
>> b) We can't race with another fault to the same vma since we hold an
>> address space lock ("buffer object reservation")
>> c) When we need to update there are no valid page table entries in the
>> vma, since it only happens directly after mmap(), or after an
>> unmap_mapping_range() with the same address space lock. When another
>> reader (for example split_huge_pmd()) sees a valid page table entry,
>> it also sees the new page protection and things are fine.
> Yeah, that's exactly why I always wondered why we need this hack with
> the vma copy on the stack.
>
>> But that would really be a special case. To solve this properly we'd
>> probably need an additional lock to protect the vm_flags and
>> vm_page_prot, taken after mmap_sem and i_mmap_lock.
> Well we already have a special lock for this: The reservation object. So
> memory barriers etc should be in place and I also think we can just
> update the vm_page_prot on the fly.

I agree. This is needed for huge pages. We should make this change, and 
perhaps add the justification above as a comment.

/Thomas

> Christian.
>
>> /Thomas
>>
>>
>>
>>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: TTM huge page-faults WAS: Re: [RFC PATCH 1/2] x86: Don't let pgprot_modify() change the page encryption bit
  2019-09-11 10:10               ` TTM huge page-faults WAS: " Thomas Hellström (VMware)
@ 2019-09-11 14:06                 ` Koenig, Christian
  2019-09-11 15:08                   ` Thomas Hellström (VMware)
  0 siblings, 1 reply; 28+ messages in thread
From: Koenig, Christian @ 2019-09-11 14:06 UTC (permalink / raw)
  To: Thomas Hellström (VMware)
  Cc: linux-kernel, pv-drivers, Thomas Hellstrom, dri-devel

Am 11.09.19 um 12:10 schrieb Thomas Hellström (VMware):
[SNIP]
>>> The problem seen in TTM is that we want to be able to change the
>>> vm_page_prot from the fault handler, but it's problematic since we
>>> have the mmap_sem typically only in read mode. Hence the fake vma
>>> hack. From what I can tell it's reasonably well-behaved, since
>>> pte_modify() skips the bits TTM updates, so mprotect() and mremap()
>>> works OK. I think split_huge_pmd may run into trouble, but we don't
>>> support it (yet) with TTM.
>> Ah! I actually ran into this while implementing huge page support for
>> TTM and never figured out why that doesn't work. Dropped CPU huge page
>> support because of this.
>
> By incident, I got slightly sidetracked the other day and started 
> looking at this as well. Got to the point where I figured out all the 
> hairy alignment issues and actually got huge_fault() calls, but never 
> implemented the handler. I think that's definitely something worth 
> having. Not sure it will work for IO memory, though, (split_huge_pmd 
> will just skip non-page-backed memory) but if we only support 
> VM_SHARED (non COW) vmas there's no reason to split the huge pmds 
> anyway. Definitely something we should have IMO.

Well our primary use case would be IO memory, cause system memory is 
only optionally allocate as huge page but we nearly always allocate VRAM 
in chunks of at least 2MB because we otherwise get a huge performance 
penalty.

>>> We could probably get away with a WRITE_ONCE() update of the
>>> vm_page_prot before taking the page table lock since
>>>
>>> a) We're locking out all other writers.
>>> b) We can't race with another fault to the same vma since we hold an
>>> address space lock ("buffer object reservation")
>>> c) When we need to update there are no valid page table entries in the
>>> vma, since it only happens directly after mmap(), or after an
>>> unmap_mapping_range() with the same address space lock. When another
>>> reader (for example split_huge_pmd()) sees a valid page table entry,
>>> it also sees the new page protection and things are fine.
>> Yeah, that's exactly why I always wondered why we need this hack with
>> the vma copy on the stack.
>>
>>> But that would really be a special case. To solve this properly we'd
>>> probably need an additional lock to protect the vm_flags and
>>> vm_page_prot, taken after mmap_sem and i_mmap_lock.
>> Well we already have a special lock for this: The reservation object. So
>> memory barriers etc should be in place and I also think we can just
>> update the vm_page_prot on the fly.
>
> I agree. This is needed for huge pages. We should make this change, 
> and perhaps add the justification above as a comment.

Alternatively we could introduce a new VM_* flag telling users of 
vm_page_prot to just let the pages table entries be filled by faults again.

Christian.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: TTM huge page-faults WAS: Re: [RFC PATCH 1/2] x86: Don't let pgprot_modify() change the page encryption bit
  2019-09-11 14:06                 ` Koenig, Christian
@ 2019-09-11 15:08                   ` Thomas Hellström (VMware)
  2019-09-24 12:03                     ` Koenig, Christian
  0 siblings, 1 reply; 28+ messages in thread
From: Thomas Hellström (VMware) @ 2019-09-11 15:08 UTC (permalink / raw)
  To: Koenig, Christian; +Cc: linux-kernel, pv-drivers, Thomas Hellstrom, dri-devel

On 9/11/19 4:06 PM, Koenig, Christian wrote:
> Am 11.09.19 um 12:10 schrieb Thomas Hellström (VMware):
> [SNIP]
>>>> The problem seen in TTM is that we want to be able to change the
>>>> vm_page_prot from the fault handler, but it's problematic since we
>>>> have the mmap_sem typically only in read mode. Hence the fake vma
>>>> hack. From what I can tell it's reasonably well-behaved, since
>>>> pte_modify() skips the bits TTM updates, so mprotect() and mremap()
>>>> works OK. I think split_huge_pmd may run into trouble, but we don't
>>>> support it (yet) with TTM.
>>> Ah! I actually ran into this while implementing huge page support for
>>> TTM and never figured out why that doesn't work. Dropped CPU huge page
>>> support because of this.
>> By incident, I got slightly sidetracked the other day and started
>> looking at this as well. Got to the point where I figured out all the
>> hairy alignment issues and actually got huge_fault() calls, but never
>> implemented the handler. I think that's definitely something worth
>> having. Not sure it will work for IO memory, though, (split_huge_pmd
>> will just skip non-page-backed memory) but if we only support
>> VM_SHARED (non COW) vmas there's no reason to split the huge pmds
>> anyway. Definitely something we should have IMO.
> Well our primary use case would be IO memory, cause system memory is
> only optionally allocate as huge page but we nearly always allocate VRAM
> in chunks of at least 2MB because we otherwise get a huge performance
> penalty.

But that system memory option is on by default, right? In any case, a 
request for a huge_fault
would probably need to check that there is actually an underlying 
huge_page and otherwise fallback to ordinary faults.

Another requirement would be for IO memory allocations to be 
PMD_PAGE_SIZE aligned in the mappable aperture, to avoid fallbacks to 
ordinary faults. Probably increasing fragmentation somewhat. (Seems like 
pmd entries can only point to PMD_PAGE_SIZE aligned physical addresses) 
Would that work for you?

>>>> We could probably get away with a WRITE_ONCE() update of the
>>>> vm_page_prot before taking the page table lock since
>>>>
>>>> a) We're locking out all other writers.
>>>> b) We can't race with another fault to the same vma since we hold an
>>>> address space lock ("buffer object reservation")
>>>> c) When we need to update there are no valid page table entries in the
>>>> vma, since it only happens directly after mmap(), or after an
>>>> unmap_mapping_range() with the same address space lock. When another
>>>> reader (for example split_huge_pmd()) sees a valid page table entry,
>>>> it also sees the new page protection and things are fine.
>>> Yeah, that's exactly why I always wondered why we need this hack with
>>> the vma copy on the stack.
>>>
>>>> But that would really be a special case. To solve this properly we'd
>>>> probably need an additional lock to protect the vm_flags and
>>>> vm_page_prot, taken after mmap_sem and i_mmap_lock.
>>> Well we already have a special lock for this: The reservation object. So
>>> memory barriers etc should be in place and I also think we can just
>>> update the vm_page_prot on the fly.
>> I agree. This is needed for huge pages. We should make this change,
>> and perhaps add the justification above as a comment.
> Alternatively we could introduce a new VM_* flag telling users of
> vm_page_prot to just let the pages table entries be filled by faults again

An interesting idea, although we'd lose things like dirty-tracking bits.

/Thomas


>
> Christian.



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 1/2] x86: Don't let pgprot_modify() change the page encryption bit
  2019-09-11  7:49               ` Thomas Hellström (VMware)
@ 2019-09-11 18:03                 ` Andy Lutomirski
  2019-09-12  8:29                   ` Thomas Hellström (VMware)
  0 siblings, 1 reply; 28+ messages in thread
From: Andy Lutomirski @ 2019-09-11 18:03 UTC (permalink / raw)
  To: Thomas Hellström (VMware)
  Cc: Andy Lutomirski, Christoph Hellwig, Dave Hansen, LKML, X86 ML,
	pv-drivers, Thomas Hellstrom, Dave Hansen, Peter Zijlstra,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Christian König, Marek Szyprowski, Tom Lendacky

On Wed, Sep 11, 2019 at 12:49 AM Thomas Hellström (VMware)
<thomas_os@shipmail.org> wrote:
>
> Hi, Andy.
>
> On 9/11/19 6:18 AM, Andy Lutomirski wrote:

> >
> > As a for-real example, take a look at arch/x86/entry/vdso/vma.c.  The
> > "vvar" VMA contains multiple pages that are backed by different types
> > of memory.  There's a page of ordinary kernel memory.  Historically
> > there was a page of genuine MMIO memory, but that's gone now.  If the
> > system is a SEV guest with pvclock enabled, then there's a page of
> > decrypted memory.   They all share a VMA, they're instantiated in
> > .fault, and there is no hackery involved.  Look at vvar_fault().
>
> So this is conceptually identical to TTM. The difference is that it uses
> vmf_insert_pfn_prot() instead of vmf_insert_mixed() with the vma hack.
> Had there been a vmf_insert_mixed_prot(), the hack in TTM wouldn't be
> needed. I guess providing a vmf_insert_mixed_prot() is a to-do for me to
> pick up.

:)

>
> Having said that, the code you point out is as fragile and suffers from
> the same shortcomings as TTM since
> a) Running things like split_huge_pmd() that takes the vm_page_prot and
> applies it to new PTEs will make things break, (although probably never
> applicable in this case).

Hmm.  There are no vvar huge pages, so this is okay.

I wonder how hard it would be to change the huge page splitting code
to copy the encryption and cacheability bits from the source entry
instead of getting them from vm_page_prot, at least in the cases
relevant to VM_MIXEDMAP and VM_PFNMAP.

> b) Running mprotect() on that VMA will only work if sme_me_mask is part
> of _PAGE_CHG_MASK (which is addressed in a reliable way in my recent
> patchset),  otherwise, the encryption setting will be overwritten.
>

Indeed.  Thanks for the fix!

>
> >> We could probably get away with a WRITE_ONCE() update of the
> >> vm_page_prot before taking the page table lock since
> >>
> >> a) We're locking out all other writers.
> >> b) We can't race with another fault to the same vma since we hold an
> >> address space lock ("buffer object reservation")
> >> c) When we need to update there are no valid page table entries in the
> >> vma, since it only happens directly after mmap(), or after an
> >> unmap_mapping_range() with the same address space lock. When another
> >> reader (for example split_huge_pmd()) sees a valid page table entry, it
> >> also sees the new page protection and things are fine.
> >>
> >> But that would really be a special case. To solve this properly we'd
> >> probably need an additional lock to protect the vm_flags and
> >> vm_page_prot, taken after mmap_sem and i_mmap_lock.
> >>
> > This is all horrible IMO.
>
> I'd prefer to call it fragile and potentially troublesome to maintain.

Fair enough.

>
> That distinction is important because if it ever comes to a choice
> between adding a new lock to protect vm_page_prot (and consequently slow
> down the whole vm system) and using the WRITE_ONCE solution in TTM, we
> should know what the implications are. As it turns out previous choices
> in this area actually seem to have opted for the lockless WRITE_ONCE /
> READ_ONCE / ptl solution. See __split_huge_pmd_locked() and
> vma_set_page_prot().

I think it would be even better if the whole thing could work without
ever writing to vm_page_prot.  This would be a requirement for vvar in
the unlikely event that the vvar vma ever supported splittable huge
pages.  Fortunately, that seems unlikely :)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 1/2] x86: Don't let pgprot_modify() change the page encryption bit
  2019-09-11 18:03                 ` Andy Lutomirski
@ 2019-09-12  8:29                   ` Thomas Hellström (VMware)
  0 siblings, 0 replies; 28+ messages in thread
From: Thomas Hellström (VMware) @ 2019-09-12  8:29 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Christoph Hellwig, Dave Hansen, LKML, X86 ML, pv-drivers,
	Thomas Hellstrom, Dave Hansen, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	Christian König, Marek Szyprowski, Tom Lendacky

On 9/11/19 8:03 PM, Andy Lutomirski wrote:
>
>> That distinction is important because if it ever comes to a choice
>> between adding a new lock to protect vm_page_prot (and consequently slow
>> down the whole vm system) and using the WRITE_ONCE solution in TTM, we
>> should know what the implications are. As it turns out previous choices
>> in this area actually seem to have opted for the lockless WRITE_ONCE /
>> READ_ONCE / ptl solution. See __split_huge_pmd_locked() and
>> vma_set_page_prot().
> I think it would be even better if the whole thing could work without
> ever writing to vm_page_prot.  This would be a requirement for vvar in
> the unlikely event that the vvar vma ever supported splittable huge
> pages.  Fortunately, that seems unlikely :)

Yeah, for TTM the situation is different since we want huge vm pages  at 
some point.

But I re-read __split_huge_pmd_locked() and it actually looks like 
vm_page_prot is only accessed for anonymous vmas. For other vmas, it 
appears it just simply zaps the PMD, relying on re-faulting the page 
table enries if necessary (as also suggested by Christian in another 
thread).

So perhaps we should be good never writing to vm_page_prot.

/Thomas



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: TTM huge page-faults WAS: Re: [RFC PATCH 1/2] x86: Don't let pgprot_modify() change the page encryption bit
  2019-09-11 15:08                   ` Thomas Hellström (VMware)
@ 2019-09-24 12:03                     ` Koenig, Christian
  0 siblings, 0 replies; 28+ messages in thread
From: Koenig, Christian @ 2019-09-24 12:03 UTC (permalink / raw)
  To: Thomas Hellström (VMware)
  Cc: linux-kernel, pv-drivers, Thomas Hellstrom, dri-devel

Am 11.09.19 um 17:08 schrieb Thomas Hellström (VMware):
> On 9/11/19 4:06 PM, Koenig, Christian wrote:
>> Am 11.09.19 um 12:10 schrieb Thomas Hellström (VMware):
>> [SNIP]
>>>>> The problem seen in TTM is that we want to be able to change the
>>>>> vm_page_prot from the fault handler, but it's problematic since we
>>>>> have the mmap_sem typically only in read mode. Hence the fake vma
>>>>> hack. From what I can tell it's reasonably well-behaved, since
>>>>> pte_modify() skips the bits TTM updates, so mprotect() and mremap()
>>>>> works OK. I think split_huge_pmd may run into trouble, but we don't
>>>>> support it (yet) with TTM.
>>>> Ah! I actually ran into this while implementing huge page support for
>>>> TTM and never figured out why that doesn't work. Dropped CPU huge page
>>>> support because of this.
>>> By incident, I got slightly sidetracked the other day and started
>>> looking at this as well. Got to the point where I figured out all the
>>> hairy alignment issues and actually got huge_fault() calls, but never
>>> implemented the handler. I think that's definitely something worth
>>> having. Not sure it will work for IO memory, though, (split_huge_pmd
>>> will just skip non-page-backed memory) but if we only support
>>> VM_SHARED (non COW) vmas there's no reason to split the huge pmds
>>> anyway. Definitely something we should have IMO.
>> Well our primary use case would be IO memory, cause system memory is
>> only optionally allocate as huge page but we nearly always allocate VRAM
>> in chunks of at least 2MB because we otherwise get a huge performance
>> penalty.
>
> But that system memory option is on by default, right? In any case, a 
> request for a huge_fault
> would probably need to check that there is actually an underlying 
> huge_page and otherwise fallback to ordinary faults.
>
> Another requirement would be for IO memory allocations to be 
> PMD_PAGE_SIZE aligned in the mappable aperture, to avoid fallbacks to 
> ordinary faults. Probably increasing fragmentation somewhat. (Seems 
> like pmd entries can only point to PMD_PAGE_SIZE aligned physical 
> addresses) Would that work for you?

Yeah, we do it this way anyway.

Regards,
Christian.

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2019-09-24 12:03 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-05 10:35 [RFC PATCH 0/2] Fix SEV user-space mapping of unencrypted coherent memory Thomas Hellström (VMware)
2019-09-05 10:35 ` [RFC PATCH 1/2] x86: Don't let pgprot_modify() change the page encryption bit Thomas Hellström (VMware)
2019-09-05 14:15   ` Dave Hansen
2019-09-05 15:21     ` Thomas Hellström (VMware)
2019-09-05 15:24       ` Christoph Hellwig
2019-09-05 16:40         ` Thomas Hellström (VMware)
2019-09-05 17:05         ` dma_mmap_fault discussion Thomas Hellström (VMware)
2019-09-06  6:32           ` Christoph Hellwig
2019-09-06  7:10             ` Thomas Hellström (VMware)
2019-09-06  7:20               ` Christoph Hellwig
2019-09-10  8:37                 ` Thomas Hellström (VMware)
2019-09-10 16:11         ` [RFC PATCH 1/2] x86: Don't let pgprot_modify() change the page encryption bit Andy Lutomirski
2019-09-10 19:26           ` Thomas Hellström (VMware)
2019-09-11  4:18             ` Andy Lutomirski
2019-09-11  7:49               ` Thomas Hellström (VMware)
2019-09-11 18:03                 ` Andy Lutomirski
2019-09-12  8:29                   ` Thomas Hellström (VMware)
2019-09-11  9:08             ` Koenig, Christian
2019-09-11 10:10               ` TTM huge page-faults WAS: " Thomas Hellström (VMware)
2019-09-11 14:06                 ` Koenig, Christian
2019-09-11 15:08                   ` Thomas Hellström (VMware)
2019-09-24 12:03                     ` Koenig, Christian
2019-09-05 15:59       ` Dave Hansen
2019-09-05 16:29         ` Thomas Hellström (VMware)
2019-09-05 10:35 ` [RFC PATCH 2/2] dma-mapping: Fix dma_pgprot() for unencrypted coherent pages Thomas Hellström (VMware)
2019-09-05 11:23 ` [RFC PATCH 0/2] Fix SEV user-space mapping of unencrypted coherent memory Christoph Hellwig
2019-09-10  6:11   ` Christoph Hellwig
2019-09-10  6:25     ` Thomas Hellström (VMware)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).