linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] Some MMU notifier cleanups for Intel/AMD IOMMU
@ 2019-01-30  5:57 Peter Xu
  2019-01-30  5:57 ` [PATCH 1/2] iommu/vt-d: Remove change_pte notifier Peter Xu
  2019-01-30  5:57 ` [PATCH 2/2] iommu/amd: Remove clear_flush_young notifier Peter Xu
  0 siblings, 2 replies; 11+ messages in thread
From: Peter Xu @ 2019-01-30  5:57 UTC (permalink / raw)
  To: iommu, linux-kernel; +Cc: joro, Jason Wang, peterx, Jerome Glisse

Recently when I'm reading the mmu notifiers I noticed that both
Intel/AMD IOMMU drivers seem to have redundancies in using the MMU
notifiers.  It can also be seen as a follow up of commit 8301da53fbc1
("iommu/amd: Remove change_pte mmu_notifier call-back", 2014-07-30).

I don't have hardwares to test them, but they compile well.

Please have a look, thanks.

Peter Xu (2):
  iommu/vt-d: Remove change_pte notifier
  iommu/amd: Remove clear_flush_young notifier

 drivers/iommu/amd_iommu_v2.c | 24 ------------------------
 drivers/iommu/intel-svm.c    |  9 ---------
 2 files changed, 33 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/2] iommu/vt-d: Remove change_pte notifier
  2019-01-30  5:57 [PATCH 0/2] Some MMU notifier cleanups for Intel/AMD IOMMU Peter Xu
@ 2019-01-30  5:57 ` Peter Xu
  2019-01-30 16:32   ` Joerg Roedel
  2019-01-30  5:57 ` [PATCH 2/2] iommu/amd: Remove clear_flush_young notifier Peter Xu
  1 sibling, 1 reply; 11+ messages in thread
From: Peter Xu @ 2019-01-30  5:57 UTC (permalink / raw)
  To: iommu, linux-kernel; +Cc: joro, Jason Wang, peterx, Jerome Glisse

The change_pte() interface is tailored for PFN updates, while the
other notifier invalidate_range() should be enough for Intel IOMMU
cache flushing.  Actually we've done similar thing for AMD IOMMU
already in 8301da53fbc1 ("iommu/amd: Remove change_pte mmu_notifier
call-back", 2014-07-30) but the Intel IOMMU driver still have it.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 drivers/iommu/intel-svm.c | 9 ---------
 1 file changed, 9 deletions(-)

diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
index a2a2aa4439aa..e9fd3ca057ac 100644
--- a/drivers/iommu/intel-svm.c
+++ b/drivers/iommu/intel-svm.c
@@ -180,14 +180,6 @@ static void intel_flush_svm_range(struct intel_svm *svm, unsigned long address,
 	rcu_read_unlock();
 }
 
-static void intel_change_pte(struct mmu_notifier *mn, struct mm_struct *mm,
-			     unsigned long address, pte_t pte)
-{
-	struct intel_svm *svm = container_of(mn, struct intel_svm, notifier);
-
-	intel_flush_svm_range(svm, address, 1, 1, 0);
-}
-
 /* Pages have been freed at this point */
 static void intel_invalidate_range(struct mmu_notifier *mn,
 				   struct mm_struct *mm,
@@ -227,7 +219,6 @@ static void intel_mm_release(struct mmu_notifier *mn, struct mm_struct *mm)
 
 static const struct mmu_notifier_ops intel_mmuops = {
 	.release = intel_mm_release,
-	.change_pte = intel_change_pte,
 	.invalidate_range = intel_invalidate_range,
 };
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/2] iommu/amd: Remove clear_flush_young notifier
  2019-01-30  5:57 [PATCH 0/2] Some MMU notifier cleanups for Intel/AMD IOMMU Peter Xu
  2019-01-30  5:57 ` [PATCH 1/2] iommu/vt-d: Remove change_pte notifier Peter Xu
@ 2019-01-30  5:57 ` Peter Xu
  2019-01-30 12:27   ` Jean-Philippe Brucker
  2019-01-30 16:31   ` Joerg Roedel
  1 sibling, 2 replies; 11+ messages in thread
From: Peter Xu @ 2019-01-30  5:57 UTC (permalink / raw)
  To: iommu, linux-kernel; +Cc: joro, Jason Wang, peterx, Jerome Glisse

AMD IOMMU driver is using the clear_flush_young() to do cache flushing
but that's actually already covered by invalidate_range().  Remove the
extra notifier and the chunks.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 drivers/iommu/amd_iommu_v2.c | 24 ------------------------
 1 file changed, 24 deletions(-)

diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
index 23dae9348ace..5d7ef750e4a0 100644
--- a/drivers/iommu/amd_iommu_v2.c
+++ b/drivers/iommu/amd_iommu_v2.c
@@ -370,29 +370,6 @@ static struct pasid_state *mn_to_state(struct mmu_notifier *mn)
 	return container_of(mn, struct pasid_state, mn);
 }
 
-static void __mn_flush_page(struct mmu_notifier *mn,
-			    unsigned long address)
-{
-	struct pasid_state *pasid_state;
-	struct device_state *dev_state;
-
-	pasid_state = mn_to_state(mn);
-	dev_state   = pasid_state->device_state;
-
-	amd_iommu_flush_page(dev_state->domain, pasid_state->pasid, address);
-}
-
-static int mn_clear_flush_young(struct mmu_notifier *mn,
-				struct mm_struct *mm,
-				unsigned long start,
-				unsigned long end)
-{
-	for (; start < end; start += PAGE_SIZE)
-		__mn_flush_page(mn, start);
-
-	return 0;
-}
-
 static void mn_invalidate_range(struct mmu_notifier *mn,
 				struct mm_struct *mm,
 				unsigned long start, unsigned long end)
@@ -430,7 +407,6 @@ static void mn_release(struct mmu_notifier *mn, struct mm_struct *mm)
 
 static const struct mmu_notifier_ops iommu_mn = {
 	.release		= mn_release,
-	.clear_flush_young      = mn_clear_flush_young,
 	.invalidate_range       = mn_invalidate_range,
 };
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] iommu/amd: Remove clear_flush_young notifier
  2019-01-30  5:57 ` [PATCH 2/2] iommu/amd: Remove clear_flush_young notifier Peter Xu
@ 2019-01-30 12:27   ` Jean-Philippe Brucker
  2019-01-31  7:59     ` Peter Xu
  2019-01-30 16:31   ` Joerg Roedel
  1 sibling, 1 reply; 11+ messages in thread
From: Jean-Philippe Brucker @ 2019-01-30 12:27 UTC (permalink / raw)
  To: Peter Xu, iommu, linux-kernel; +Cc: Jerome Glisse, Jason Wang

Hi Peter,

On 30/01/2019 05:57, Peter Xu wrote:
> AMD IOMMU driver is using the clear_flush_young() to do cache flushing
> but that's actually already covered by invalidate_range().  Remove the
> extra notifier and the chunks.

Aren't uses of clear_flush_young() and invalidate_range() orthogonal? If
I understood correctly, when doing reclaim the kernel clears the young
bit from the PTE. This requires flushing secondary TLBs (ATCs), so that
new accesses from devices will go through the IOMMU, set the young bit
again and the kernel can accurately track page use. I didn't see
invalidate_range() being called by rmap or vmscan in this case, but
might just be missing it.

Two MMU notifiers are used for the young bit, clear_flush_young() and
clear_young(). I believe the former should invalidate ATCs so that DMA
accesses participate in PTE aging. Otherwise the kernel can't know that
the device is still using entries marked 'old'. The latter,
clear_young() is exempted from invalidating the secondary TLBs by
mmu_notifier.h and the IOMMU driver doesn't need to implement it.

Thanks,
Jean

> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  drivers/iommu/amd_iommu_v2.c | 24 ------------------------
>  1 file changed, 24 deletions(-)
> 
> diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
> index 23dae9348ace..5d7ef750e4a0 100644
> --- a/drivers/iommu/amd_iommu_v2.c
> +++ b/drivers/iommu/amd_iommu_v2.c
> @@ -370,29 +370,6 @@ static struct pasid_state *mn_to_state(struct mmu_notifier *mn)
>  	return container_of(mn, struct pasid_state, mn);
>  }
>  
> -static void __mn_flush_page(struct mmu_notifier *mn,
> -			    unsigned long address)
> -{
> -	struct pasid_state *pasid_state;
> -	struct device_state *dev_state;
> -
> -	pasid_state = mn_to_state(mn);
> -	dev_state   = pasid_state->device_state;
> -
> -	amd_iommu_flush_page(dev_state->domain, pasid_state->pasid, address);
> -}
> -
> -static int mn_clear_flush_young(struct mmu_notifier *mn,
> -				struct mm_struct *mm,
> -				unsigned long start,
> -				unsigned long end)
> -{
> -	for (; start < end; start += PAGE_SIZE)
> -		__mn_flush_page(mn, start);
> -
> -	return 0;
> -}
> -
>  static void mn_invalidate_range(struct mmu_notifier *mn,
>  				struct mm_struct *mm,
>  				unsigned long start, unsigned long end)
> @@ -430,7 +407,6 @@ static void mn_release(struct mmu_notifier *mn, struct mm_struct *mm)
>  
>  static const struct mmu_notifier_ops iommu_mn = {
>  	.release		= mn_release,
> -	.clear_flush_young      = mn_clear_flush_young,
>  	.invalidate_range       = mn_invalidate_range,
>  };
>  
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] iommu/amd: Remove clear_flush_young notifier
  2019-01-30  5:57 ` [PATCH 2/2] iommu/amd: Remove clear_flush_young notifier Peter Xu
  2019-01-30 12:27   ` Jean-Philippe Brucker
@ 2019-01-30 16:31   ` Joerg Roedel
  1 sibling, 0 replies; 11+ messages in thread
From: Joerg Roedel @ 2019-01-30 16:31 UTC (permalink / raw)
  To: Peter Xu; +Cc: iommu, linux-kernel, Jason Wang, Jerome Glisse

On Wed, Jan 30, 2019 at 01:57:58PM +0800, Peter Xu wrote:
> AMD IOMMU driver is using the clear_flush_young() to do cache flushing
> but that's actually already covered by invalidate_range().  Remove the
> extra notifier and the chunks.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

Applied to x86/amd, thanks.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] iommu/vt-d: Remove change_pte notifier
  2019-01-30  5:57 ` [PATCH 1/2] iommu/vt-d: Remove change_pte notifier Peter Xu
@ 2019-01-30 16:32   ` Joerg Roedel
  0 siblings, 0 replies; 11+ messages in thread
From: Joerg Roedel @ 2019-01-30 16:32 UTC (permalink / raw)
  To: Peter Xu; +Cc: iommu, linux-kernel, Jason Wang, Jerome Glisse

On Wed, Jan 30, 2019 at 01:57:57PM +0800, Peter Xu wrote:
> The change_pte() interface is tailored for PFN updates, while the
> other notifier invalidate_range() should be enough for Intel IOMMU
> cache flushing.  Actually we've done similar thing for AMD IOMMU
> already in 8301da53fbc1 ("iommu/amd: Remove change_pte mmu_notifier
> call-back", 2014-07-30) but the Intel IOMMU driver still have it.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

Applied to x86/vt-d, thanks.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] iommu/amd: Remove clear_flush_young notifier
  2019-01-30 12:27   ` Jean-Philippe Brucker
@ 2019-01-31  7:59     ` Peter Xu
  2019-01-31 12:25       ` Jean-Philippe Brucker
  0 siblings, 1 reply; 11+ messages in thread
From: Peter Xu @ 2019-01-31  7:59 UTC (permalink / raw)
  To: Jean-Philippe Brucker; +Cc: iommu, linux-kernel, Jerome Glisse, Jason Wang

On Wed, Jan 30, 2019 at 12:27:40PM +0000, Jean-Philippe Brucker wrote:
> Hi Peter,

Hi, Jean,

> 
> On 30/01/2019 05:57, Peter Xu wrote:
> > AMD IOMMU driver is using the clear_flush_young() to do cache flushing
> > but that's actually already covered by invalidate_range().  Remove the
> > extra notifier and the chunks.
> 
> Aren't uses of clear_flush_young() and invalidate_range() orthogonal? If
> I understood correctly, when doing reclaim the kernel clears the young
> bit from the PTE. This requires flushing secondary TLBs (ATCs), so that
> new accesses from devices will go through the IOMMU, set the young bit
> again and the kernel can accurately track page use. I didn't see
> invalidate_range() being called by rmap or vmscan in this case, but
> might just be missing it.
> 
> Two MMU notifiers are used for the young bit, clear_flush_young() and
> clear_young(). I believe the former should invalidate ATCs so that DMA
> accesses participate in PTE aging. Otherwise the kernel can't know that
> the device is still using entries marked 'old'. The latter,
> clear_young() is exempted from invalidating the secondary TLBs by
> mmu_notifier.h and the IOMMU driver doesn't need to implement it.

Yes I agree most of your analysis, but IMHO the problem is that the
AMD IOMMU is not really participating in the PTE aging after all.
Please have a look at mn_clear_flush_young() below at [1] - it's
always returning zero, no matter whether the page has been accessed by
device or not.  A real user of it could be someone like KVM (please
see kvm_mmu_notifier_clear_flush_young) where we'll try to lookup the
shadow PTEs and do test-and-clear on that, then return the result to
the core mm.  That's why I think currently the AMD driver was only
trying to use that as a way to do an extra flush however IMHO it's
redundant.

Thanks,

> > -static int mn_clear_flush_young(struct mmu_notifier *mn,
> > -				struct mm_struct *mm,
> > -				unsigned long start,
> > -				unsigned long end)
> > -{
> > -	for (; start < end; start += PAGE_SIZE)
> > -		__mn_flush_page(mn, start);
> > -
> > -	return 0;

[1]

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] iommu/amd: Remove clear_flush_young notifier
  2019-01-31  7:59     ` Peter Xu
@ 2019-01-31 12:25       ` Jean-Philippe Brucker
  2019-02-01  3:51         ` Peter Xu
  0 siblings, 1 reply; 11+ messages in thread
From: Jean-Philippe Brucker @ 2019-01-31 12:25 UTC (permalink / raw)
  To: Peter Xu; +Cc: Jerome Glisse, iommu, linux-kernel, Jason Wang

On 31/01/2019 07:59, Peter Xu wrote:
> On Wed, Jan 30, 2019 at 12:27:40PM +0000, Jean-Philippe Brucker wrote:
>> Hi Peter,
> 
> Hi, Jean,
> 
>>
>> On 30/01/2019 05:57, Peter Xu wrote:
>>> AMD IOMMU driver is using the clear_flush_young() to do cache flushing
>>> but that's actually already covered by invalidate_range().  Remove the
>>> extra notifier and the chunks.
>>
>> Aren't uses of clear_flush_young() and invalidate_range() orthogonal? If
>> I understood correctly, when doing reclaim the kernel clears the young
>> bit from the PTE. This requires flushing secondary TLBs (ATCs), so that
>> new accesses from devices will go through the IOMMU, set the young bit
>> again and the kernel can accurately track page use. I didn't see
>> invalidate_range() being called by rmap or vmscan in this case, but
>> might just be missing it.
>>
>> Two MMU notifiers are used for the young bit, clear_flush_young() and
>> clear_young(). I believe the former should invalidate ATCs so that DMA
>> accesses participate in PTE aging. Otherwise the kernel can't know that
>> the device is still using entries marked 'old'. The latter,
>> clear_young() is exempted from invalidating the secondary TLBs by
>> mmu_notifier.h and the IOMMU driver doesn't need to implement it.
> 
> Yes I agree most of your analysis, but IMHO the problem is that the
> AMD IOMMU is not really participating in the PTE aging after all.
> Please have a look at mn_clear_flush_young() below at [1] - it's
> always returning zero, no matter whether the page has been accessed by
> device or not.  A real user of it could be someone like KVM (please
> see kvm_mmu_notifier_clear_flush_young) where we'll try to lookup the
> shadow PTEs and do test-and-clear on that, then return the result to
> the core mm.  That's why I think currently the AMD driver was only
> trying to use that as a way to do an extra flush however IMHO it's
> redundant.

Yes, in IOMMU drivers clear_flush_young() doesn't do the clear, only the
flush, since the level-1 page table is shared with the CPU. But removing
the flush gets you in the following situation:

  (1) Devices wants to access $addr, sends ATS query, the IOMMU sets PTE
young and the entry is cached in the ATC.

  (2) The CPU does ptep_clear_flush_young_notify(), clears young,
notices that the page is being used.

  (3) Device accesses $addr again. Given that we didn't invalidate the
ATC in (2) it accesses the page directly, without going through the IOMMU.

  (4) CPU does ptep_clear_flush_young_notify() again, the PTE doesn't
have the young bit, which means the page isn't being used and can be
reclaimed.

That's not accurate since the page is being used by the device. At step
(2) we should invalidate the ATC, so that (3) fetches the PTE again and
marks it young.

I can agree that the clear_flush_young() notifier is too brutal for this
purpose, since we send spurious ATC invalidation even when the PTE
wasn't young (and ATC inv is expensive). There should be another MMU
notifier "flush_young()" that is called by
ptep_clear_flush_young_notify() only when the page was actually young.
But for the moment it's the best we have to avoid the situation above.

I don't know enough about mm to understand exactly how the
page_referenced() information is used, but I believe the problem is only
about accuracy and not correctness. So applying this patch might not
break anything (after all, intel-svm.c never implemented the notifier)
but I think I'll keep the notifier in my SVA rework [1].

Thanks,
Jean

[1] https://www.spinics.net/lists/iommu/msg30081.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] iommu/amd: Remove clear_flush_young notifier
  2019-01-31 12:25       ` Jean-Philippe Brucker
@ 2019-02-01  3:51         ` Peter Xu
  2019-02-01 11:46           ` Jean-Philippe Brucker
  0 siblings, 1 reply; 11+ messages in thread
From: Peter Xu @ 2019-02-01  3:51 UTC (permalink / raw)
  To: Jean-Philippe Brucker; +Cc: Jerome Glisse, iommu, linux-kernel, Jason Wang

On Thu, Jan 31, 2019 at 12:25:40PM +0000, Jean-Philippe Brucker wrote:
> On 31/01/2019 07:59, Peter Xu wrote:
> > On Wed, Jan 30, 2019 at 12:27:40PM +0000, Jean-Philippe Brucker wrote:
> >> Hi Peter,
> > 
> > Hi, Jean,
> > 
> >>
> >> On 30/01/2019 05:57, Peter Xu wrote:
> >>> AMD IOMMU driver is using the clear_flush_young() to do cache flushing
> >>> but that's actually already covered by invalidate_range().  Remove the
> >>> extra notifier and the chunks.
> >>
> >> Aren't uses of clear_flush_young() and invalidate_range() orthogonal? If
> >> I understood correctly, when doing reclaim the kernel clears the young
> >> bit from the PTE. This requires flushing secondary TLBs (ATCs), so that
> >> new accesses from devices will go through the IOMMU, set the young bit
> >> again and the kernel can accurately track page use. I didn't see
> >> invalidate_range() being called by rmap or vmscan in this case, but
> >> might just be missing it.
> >>
> >> Two MMU notifiers are used for the young bit, clear_flush_young() and
> >> clear_young(). I believe the former should invalidate ATCs so that DMA
> >> accesses participate in PTE aging. Otherwise the kernel can't know that
> >> the device is still using entries marked 'old'. The latter,
> >> clear_young() is exempted from invalidating the secondary TLBs by
> >> mmu_notifier.h and the IOMMU driver doesn't need to implement it.
> > 
> > Yes I agree most of your analysis, but IMHO the problem is that the
> > AMD IOMMU is not really participating in the PTE aging after all.
> > Please have a look at mn_clear_flush_young() below at [1] - it's
> > always returning zero, no matter whether the page has been accessed by
> > device or not.  A real user of it could be someone like KVM (please
> > see kvm_mmu_notifier_clear_flush_young) where we'll try to lookup the
> > shadow PTEs and do test-and-clear on that, then return the result to
> > the core mm.  That's why I think currently the AMD driver was only
> > trying to use that as a way to do an extra flush however IMHO it's
> > redundant.
> 
> Yes, in IOMMU drivers clear_flush_young() doesn't do the clear, only the
> flush, since the level-1 page table is shared with the CPU. But removing
> the flush gets you in the following situation:
> 
>   (1) Devices wants to access $addr, sends ATS query, the IOMMU sets PTE
> young and the entry is cached in the ATC.
> 
>   (2) The CPU does ptep_clear_flush_young_notify(), clears young,
> notices that the page is being used.
> 
>   (3) Device accesses $addr again. Given that we didn't invalidate the
> ATC in (2) it accesses the page directly, without going through the IOMMU.
> 
>   (4) CPU does ptep_clear_flush_young_notify() again, the PTE doesn't
> have the young bit, which means the page isn't being used and can be
> reclaimed.
> 
> That's not accurate since the page is being used by the device. At step
> (2) we should invalidate the ATC, so that (3) fetches the PTE again and
> marks it young.
> 
> I can agree that the clear_flush_young() notifier is too brutal for this
> purpose, since we send spurious ATC invalidation even when the PTE
> wasn't young (and ATC inv is expensive). There should be another MMU
> notifier "flush_young()" that is called by
> ptep_clear_flush_young_notify() only when the page was actually young.
> But for the moment it's the best we have to avoid the situation above.
> 
> I don't know enough about mm to understand exactly how the
> page_referenced() information is used, but I believe the problem is only
> about accuracy and not correctness. So applying this patch might not
> break anything (after all, intel-svm.c never implemented the notifier)
> but I think I'll keep the notifier in my SVA rework [1].

I see your point.  I'm not an expert of mm either, but I'd say AFAIU
this happens in CPU TLB too.  Please have a look at
ptep_clear_flush_young():

int ptep_clear_flush_young(struct vm_area_struct *vma,
			   unsigned long address, pte_t *ptep)
{
	/*
	 * On x86 CPUs, clearing the accessed bit without a TLB flush
	 * doesn't cause data corruption. [ It could cause incorrect
	 * page aging and the (mistaken) reclaim of hot pages, but the
	 * chance of that should be relatively low. ]
	 *
	 * So as a performance optimization don't flush the TLB when
	 * clearing the accessed bit, it will eventually be flushed by
	 * a context switch or a VM operation anyway. [ In the rare
	 * event of it not getting flushed for a long time the delay
	 * shouldn't really matter because there's no real memory
	 * pressure for swapout to react to. ]
	 */
	return ptep_test_and_clear_young(vma, address, ptep);
}

So maybe it is a tradeoff between performance and accuracy.  IMHO the
IOMMU cache flushing might affect the performance even more than CPU
TLB flushing if the invalidation command takes a long time to run
(e.g., amd_iommu_flush_page is far slower than a TLB flush
instruction, locks to take, queue commands, explicit wait for the
invalidation to happen, etc.) so it can potentially even drag down the
mm young bit access as a whole not to mention the future cache misses
from the device side.

Even if you really want to make the young bit accurate for the SVA
work, IMHO you may still want to implement the lightweight version of
clear_young() too otherwise it might be inaccurate again in idle page
tracking.

Another thing to mention is that disregarding the discussion about
young bit - IMO you should probably don't need the change_pte() which
I'm more confident with.

Thanks,

> 
> Thanks,
> Jean
> 
> [1] https://www.spinics.net/lists/iommu/msg30081.html

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] iommu/amd: Remove clear_flush_young notifier
  2019-02-01  3:51         ` Peter Xu
@ 2019-02-01 11:46           ` Jean-Philippe Brucker
  2019-02-02  7:08             ` Peter Xu
  0 siblings, 1 reply; 11+ messages in thread
From: Jean-Philippe Brucker @ 2019-02-01 11:46 UTC (permalink / raw)
  To: Peter Xu; +Cc: Jerome Glisse, iommu, linux-kernel, Jason Wang

On 01/02/2019 03:51, Peter Xu wrote:
> On Thu, Jan 31, 2019 at 12:25:40PM +0000, Jean-Philippe Brucker wrote:
>> On 31/01/2019 07:59, Peter Xu wrote:
>>> On Wed, Jan 30, 2019 at 12:27:40PM +0000, Jean-Philippe Brucker wrote:
>>>> Hi Peter,
>>>
>>> Hi, Jean,
>>>
>>>>
>>>> On 30/01/2019 05:57, Peter Xu wrote:
>>>>> AMD IOMMU driver is using the clear_flush_young() to do cache flushing
>>>>> but that's actually already covered by invalidate_range().  Remove the
>>>>> extra notifier and the chunks.
>>>>
>>>> Aren't uses of clear_flush_young() and invalidate_range() orthogonal? If
>>>> I understood correctly, when doing reclaim the kernel clears the young
>>>> bit from the PTE. This requires flushing secondary TLBs (ATCs), so that
>>>> new accesses from devices will go through the IOMMU, set the young bit
>>>> again and the kernel can accurately track page use. I didn't see
>>>> invalidate_range() being called by rmap or vmscan in this case, but
>>>> might just be missing it.
>>>>
>>>> Two MMU notifiers are used for the young bit, clear_flush_young() and
>>>> clear_young(). I believe the former should invalidate ATCs so that DMA
>>>> accesses participate in PTE aging. Otherwise the kernel can't know that
>>>> the device is still using entries marked 'old'. The latter,
>>>> clear_young() is exempted from invalidating the secondary TLBs by
>>>> mmu_notifier.h and the IOMMU driver doesn't need to implement it.
>>>
>>> Yes I agree most of your analysis, but IMHO the problem is that the
>>> AMD IOMMU is not really participating in the PTE aging after all.
>>> Please have a look at mn_clear_flush_young() below at [1] - it's
>>> always returning zero, no matter whether the page has been accessed by
>>> device or not.  A real user of it could be someone like KVM (please
>>> see kvm_mmu_notifier_clear_flush_young) where we'll try to lookup the
>>> shadow PTEs and do test-and-clear on that, then return the result to
>>> the core mm.  That's why I think currently the AMD driver was only
>>> trying to use that as a way to do an extra flush however IMHO it's
>>> redundant.
>>
>> Yes, in IOMMU drivers clear_flush_young() doesn't do the clear, only the
>> flush, since the level-1 page table is shared with the CPU. But removing
>> the flush gets you in the following situation:
>>
>>   (1) Devices wants to access $addr, sends ATS query, the IOMMU sets PTE
>> young and the entry is cached in the ATC.
>>
>>   (2) The CPU does ptep_clear_flush_young_notify(), clears young,
>> notices that the page is being used.
>>
>>   (3) Device accesses $addr again. Given that we didn't invalidate the
>> ATC in (2) it accesses the page directly, without going through the IOMMU.
>>
>>   (4) CPU does ptep_clear_flush_young_notify() again, the PTE doesn't
>> have the young bit, which means the page isn't being used and can be
>> reclaimed.
>>
>> That's not accurate since the page is being used by the device. At step
>> (2) we should invalidate the ATC, so that (3) fetches the PTE again and
>> marks it young.
>>
>> I can agree that the clear_flush_young() notifier is too brutal for this
>> purpose, since we send spurious ATC invalidation even when the PTE
>> wasn't young (and ATC inv is expensive). There should be another MMU
>> notifier "flush_young()" that is called by
>> ptep_clear_flush_young_notify() only when the page was actually young.
>> But for the moment it's the best we have to avoid the situation above.
>>
>> I don't know enough about mm to understand exactly how the
>> page_referenced() information is used, but I believe the problem is only
>> about accuracy and not correctness. So applying this patch might not
>> break anything (after all, intel-svm.c never implemented the notifier)
>> but I think I'll keep the notifier in my SVA rework [1].
> 
> I see your point.  I'm not an expert of mm either, but I'd say AFAIU
> this happens in CPU TLB too.  Please have a look at
> ptep_clear_flush_young():
> 
> int ptep_clear_flush_young(struct vm_area_struct *vma,
> 			   unsigned long address, pte_t *ptep)
> {
> 	/*
> 	 * On x86 CPUs, clearing the accessed bit without a TLB flush
> 	 * doesn't cause data corruption. [ It could cause incorrect
> 	 * page aging and the (mistaken) reclaim of hot pages, but the
> 	 * chance of that should be relatively low. ]
> 	 *
> 	 * So as a performance optimization don't flush the TLB when
> 	 * clearing the accessed bit, it will eventually be flushed by
> 	 * a context switch or a VM operation anyway. [ In the rare
> 	 * event of it not getting flushed for a long time the delay
> 	 * shouldn't really matter because there's no real memory
> 	 * pressure for swapout to react to. ]
> 	 */
> 	return ptep_test_and_clear_young(vma, address, ptep);
> }

Aha I see. The arm64 version of ptep_clear_flush_young() does invalidate
the TLB if the PTE was young (perhaps because we don't invalidate the
TLB on context switch). For SVA I would have liked to simply invalidate
the ATC whenever the CPU invalidates its TLB, but that falls apart if
archs are doing different things...

> So maybe it is a tradeoff between performance and accuracy.  IMHO the
> IOMMU cache flushing might affect the performance even more than CPU
> TLB flushing if the invalidation command takes a long time to run
> (e.g., amd_iommu_flush_page is far slower than a TLB flush
> instruction, locks to take, queue commands, explicit wait for the
> invalidation to happen, etc.) so it can potentially even drag down the
> mm young bit access as a whole not to mention the future cache misses
> from the device side.
> 
> Even if you really want to make the young bit accurate for the SVA
> work, IMHO you may still want to implement the lightweight version of
> clear_young() too otherwise it might be inaccurate again in idle page
> tracking.

Yes there is another conversation about this on the new idle_pages
proposal [1], which would never send any ATC invalidation. I'm not sure
what we should do about it - making clear_young() flush the IOTLB seems
way too expensive there as well, we'll probably want something more
selective.

> Another thing to mention is that disregarding the discussion about
> young bit - IMO you should probably don't need the change_pte() which
> I'm more confident with.

I haven't dug too much into change_pte() yet, but I'll keep that in
mind, thanks!

Jean

[1] https://patchwork.kernel.org/cover/10743133/#22446425

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] iommu/amd: Remove clear_flush_young notifier
  2019-02-01 11:46           ` Jean-Philippe Brucker
@ 2019-02-02  7:08             ` Peter Xu
  0 siblings, 0 replies; 11+ messages in thread
From: Peter Xu @ 2019-02-02  7:08 UTC (permalink / raw)
  To: Jean-Philippe Brucker; +Cc: Jerome Glisse, iommu, linux-kernel, Jason Wang

On Fri, Feb 01, 2019 at 11:46:00AM +0000, Jean-Philippe Brucker wrote:
> On 01/02/2019 03:51, Peter Xu wrote:
> > On Thu, Jan 31, 2019 at 12:25:40PM +0000, Jean-Philippe Brucker wrote:
> >> On 31/01/2019 07:59, Peter Xu wrote:
> >>> On Wed, Jan 30, 2019 at 12:27:40PM +0000, Jean-Philippe Brucker wrote:
> >>>> Hi Peter,
> >>>
> >>> Hi, Jean,
> >>>
> >>>>
> >>>> On 30/01/2019 05:57, Peter Xu wrote:
> >>>>> AMD IOMMU driver is using the clear_flush_young() to do cache flushing
> >>>>> but that's actually already covered by invalidate_range().  Remove the
> >>>>> extra notifier and the chunks.
> >>>>
> >>>> Aren't uses of clear_flush_young() and invalidate_range() orthogonal? If
> >>>> I understood correctly, when doing reclaim the kernel clears the young
> >>>> bit from the PTE. This requires flushing secondary TLBs (ATCs), so that
> >>>> new accesses from devices will go through the IOMMU, set the young bit
> >>>> again and the kernel can accurately track page use. I didn't see
> >>>> invalidate_range() being called by rmap or vmscan in this case, but
> >>>> might just be missing it.
> >>>>
> >>>> Two MMU notifiers are used for the young bit, clear_flush_young() and
> >>>> clear_young(). I believe the former should invalidate ATCs so that DMA
> >>>> accesses participate in PTE aging. Otherwise the kernel can't know that
> >>>> the device is still using entries marked 'old'. The latter,
> >>>> clear_young() is exempted from invalidating the secondary TLBs by
> >>>> mmu_notifier.h and the IOMMU driver doesn't need to implement it.
> >>>
> >>> Yes I agree most of your analysis, but IMHO the problem is that the
> >>> AMD IOMMU is not really participating in the PTE aging after all.
> >>> Please have a look at mn_clear_flush_young() below at [1] - it's
> >>> always returning zero, no matter whether the page has been accessed by
> >>> device or not.  A real user of it could be someone like KVM (please
> >>> see kvm_mmu_notifier_clear_flush_young) where we'll try to lookup the
> >>> shadow PTEs and do test-and-clear on that, then return the result to
> >>> the core mm.  That's why I think currently the AMD driver was only
> >>> trying to use that as a way to do an extra flush however IMHO it's
> >>> redundant.
> >>
> >> Yes, in IOMMU drivers clear_flush_young() doesn't do the clear, only the
> >> flush, since the level-1 page table is shared with the CPU. But removing
> >> the flush gets you in the following situation:
> >>
> >>   (1) Devices wants to access $addr, sends ATS query, the IOMMU sets PTE
> >> young and the entry is cached in the ATC.
> >>
> >>   (2) The CPU does ptep_clear_flush_young_notify(), clears young,
> >> notices that the page is being used.
> >>
> >>   (3) Device accesses $addr again. Given that we didn't invalidate the
> >> ATC in (2) it accesses the page directly, without going through the IOMMU.
> >>
> >>   (4) CPU does ptep_clear_flush_young_notify() again, the PTE doesn't
> >> have the young bit, which means the page isn't being used and can be
> >> reclaimed.
> >>
> >> That's not accurate since the page is being used by the device. At step
> >> (2) we should invalidate the ATC, so that (3) fetches the PTE again and
> >> marks it young.
> >>
> >> I can agree that the clear_flush_young() notifier is too brutal for this
> >> purpose, since we send spurious ATC invalidation even when the PTE
> >> wasn't young (and ATC inv is expensive). There should be another MMU
> >> notifier "flush_young()" that is called by
> >> ptep_clear_flush_young_notify() only when the page was actually young.
> >> But for the moment it's the best we have to avoid the situation above.
> >>
> >> I don't know enough about mm to understand exactly how the
> >> page_referenced() information is used, but I believe the problem is only
> >> about accuracy and not correctness. So applying this patch might not
> >> break anything (after all, intel-svm.c never implemented the notifier)
> >> but I think I'll keep the notifier in my SVA rework [1].
> > 
> > I see your point.  I'm not an expert of mm either, but I'd say AFAIU
> > this happens in CPU TLB too.  Please have a look at
> > ptep_clear_flush_young():
> > 
> > int ptep_clear_flush_young(struct vm_area_struct *vma,
> > 			   unsigned long address, pte_t *ptep)
> > {
> > 	/*
> > 	 * On x86 CPUs, clearing the accessed bit without a TLB flush
> > 	 * doesn't cause data corruption. [ It could cause incorrect
> > 	 * page aging and the (mistaken) reclaim of hot pages, but the
> > 	 * chance of that should be relatively low. ]
> > 	 *
> > 	 * So as a performance optimization don't flush the TLB when
> > 	 * clearing the accessed bit, it will eventually be flushed by
> > 	 * a context switch or a VM operation anyway. [ In the rare
> > 	 * event of it not getting flushed for a long time the delay
> > 	 * shouldn't really matter because there's no real memory
> > 	 * pressure for swapout to react to. ]
> > 	 */
> > 	return ptep_test_and_clear_young(vma, address, ptep);
> > }
> 
> Aha I see. The arm64 version of ptep_clear_flush_young() does invalidate
> the TLB if the PTE was young (perhaps because we don't invalidate the
> TLB on context switch). For SVA I would have liked to simply invalidate
> the ATC whenever the CPU invalidates its TLB, but that falls apart if
> archs are doing different things...
> 
> > So maybe it is a tradeoff between performance and accuracy.  IMHO the
> > IOMMU cache flushing might affect the performance even more than CPU
> > TLB flushing if the invalidation command takes a long time to run
> > (e.g., amd_iommu_flush_page is far slower than a TLB flush
> > instruction, locks to take, queue commands, explicit wait for the
> > invalidation to happen, etc.) so it can potentially even drag down the
> > mm young bit access as a whole not to mention the future cache misses
> > from the device side.
> > 
> > Even if you really want to make the young bit accurate for the SVA
> > work, IMHO you may still want to implement the lightweight version of
> > clear_young() too otherwise it might be inaccurate again in idle page
> > tracking.
> 
> Yes there is another conversation about this on the new idle_pages
> proposal [1], which would never send any ATC invalidation. I'm not sure
> what we should do about it - making clear_young() flush the IOTLB seems
> way too expensive there as well, we'll probably want something more
> selective.
> 
> > Another thing to mention is that disregarding the discussion about
> > young bit - IMO you should probably don't need the change_pte() which
> > I'm more confident with.
> 
> I haven't dug too much into change_pte() yet, but I'll keep that in
> mind, thanks!

Glad to have the discussion with you!  Note that besides the one that
was attached along with the same series, there's another patch to
remove the last (besides KVM) change_pte() usage here in Power:

  https://lkml.org/lkml/2019/1/31/259

Andrea has a longer reply there which might be far better than my
commit messages.  Please feel free to have a look too.

Thanks,

> 
> Jean
> 
> [1] https://patchwork.kernel.org/cover/10743133/#22446425

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2019-02-02  7:08 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-30  5:57 [PATCH 0/2] Some MMU notifier cleanups for Intel/AMD IOMMU Peter Xu
2019-01-30  5:57 ` [PATCH 1/2] iommu/vt-d: Remove change_pte notifier Peter Xu
2019-01-30 16:32   ` Joerg Roedel
2019-01-30  5:57 ` [PATCH 2/2] iommu/amd: Remove clear_flush_young notifier Peter Xu
2019-01-30 12:27   ` Jean-Philippe Brucker
2019-01-31  7:59     ` Peter Xu
2019-01-31 12:25       ` Jean-Philippe Brucker
2019-02-01  3:51         ` Peter Xu
2019-02-01 11:46           ` Jean-Philippe Brucker
2019-02-02  7:08             ` Peter Xu
2019-01-30 16:31   ` Joerg Roedel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).