All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/13] mmu_notifier kill invalidate_page callback
@ 2017-08-29 23:54 ` Jérôme Glisse
  0 siblings, 0 replies; 160+ messages in thread
From: Jérôme Glisse @ 2017-08-29 23:54 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg
  Cc: Jérôme Glisse, Kirill A . Shutemov, Linus Torvalds,
	Andrew Morton, Andrea Arcangeli, Joerg Roedel, Dan Williams,
	Sudeep Dutt, Ashutosh Dixit, Dimitri Sivanich, Jack Steiner,
	Paolo Bonzini, Radim Krčmář,
	linuxppc-dev-uLR06cmDAlY/bJ5BZ2RsiQ,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	xen-devel-GuqFBffKawtpuQazS67q72D2FQJk+8+b, kvm

(Sorry for so many list cross-posting and big cc)

Please help testing !

The invalidate_page callback suffered from 2 pitfalls. First it used to
happen after page table lock was release and thus a new page might have
been setup for the virtual address before the call to invalidate_page().

This is in a weird way fixed by c7ab0d2fdc840266b39db94538f74207ec2afbf6
which moved the callback under the page table lock. Which also broke
several existing user of the mmu_notifier API that assumed they could
sleep inside this callback.

The second pitfall was invalidate_page being the only callback not taking
a range of address in respect to invalidation but was giving an address
and a page. Lot of the callback implementer assumed this could never be
THP and thus failed to invalidate the appropriate range for THP pages.

By killing this callback we unify the mmu_notifier callback API to always
take a virtual address range as input.

There is now 2 clear API (I am not mentioning the youngess API which is
seldomly used):
  - invalidate_range_start()/end() callback (which allow you to sleep)
  - invalidate_range() where you can not sleep but happen right after
    page table update under page table lock


Note that a lot of existing user feels broken in respect to range_start/
range_end. Many user only have range_start() callback but there is nothing
preventing them to undo what was invalidated in their range_start() callback
after it returns but before any CPU page table update take place.

The code pattern use in kvm or umem odp is an example on how to properly
avoid such race. In a nutshell use some kind of sequence number and active
range invalidation counter to block anything that might undo what the
range_start() callback did.

If you do not care about keeping fully in sync with CPU page table (ie
you can live with CPU page table pointing to new different page for a
given virtual address) then you can take a reference on the pages inside
the range_start callback and drop it in range_end or when your driver
is done with those pages.

Last alternative is to use invalidate_range() if you can do invalidation
without sleeping as invalidate_range() callback happens under the CPU
page table spinlock right after the page table is updated.


Note this is barely tested. I intend to do more testing of next few days
but i do not have access to all hardware that make use of the mmu_notifier
API.


First 2 patches convert existing call of mmu_notifier_invalidate_page()
to mmu_notifier_invalidate_range() and bracket those call with call to
mmu_notifier_invalidate_range_start()/end().

The next 10 patches remove existing invalidate_page() callback as it can
no longer happen.

Finaly the last page remove it completely so it can RIP.

Jérôme Glisse (13):
  dax: update to new mmu_notifier semantic
  mm/rmap: update to new mmu_notifier semantic
  powerpc/powernv: update to new mmu_notifier semantic
  drm/amdgpu: update to new mmu_notifier semantic
  IB/umem: update to new mmu_notifier semantic
  IB/hfi1: update to new mmu_notifier semantic
  iommu/amd: update to new mmu_notifier semantic
  iommu/intel: update to new mmu_notifier semantic
  misc/mic/scif: update to new mmu_notifier semantic
  sgi-gru: update to new mmu_notifier semantic
  xen/gntdev: update to new mmu_notifier semantic
  KVM: update to new mmu_notifier semantic
  mm/mmu_notifier: kill invalidate_page

Cc: Kirill A. Shutemov <kirill.shutemov-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>
Cc: Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Cc: Andrea Arcangeli <aarcange-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Joerg Roedel <jroedel-l3A5Bk7waGM@public.gmane.org>
Cc: Dan Williams <dan.j.williams-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Cc: Sudeep Dutt <sudeep.dutt-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Cc: Ashutosh Dixit <ashutosh.dixit-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Cc: Dimitri Sivanich <sivanich-sJ/iWh9BUns@public.gmane.org>
Cc: Jack Steiner <steiner-sJ/iWh9BUns@public.gmane.org>
Cc: Paolo Bonzini <pbonzini-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Cc: Radim Krčmář <rkrcmar-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Cc: linuxppc-dev-uLR06cmDAlY/bJ5BZ2RsiQ@public.gmane.org
Cc: dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
Cc: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
Cc: xen-devel-GuqFBffKawtpuQazS67q72D2FQJk+8+b@public.gmane.org
Cc: kvm-u79uwXL29TY76Z2rM5mHXA@public.gmane.org


 arch/powerpc/platforms/powernv/npu-dma.c | 10 --------
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c   | 31 ----------------------
 drivers/infiniband/core/umem_odp.c       | 19 --------------
 drivers/infiniband/hw/hfi1/mmu_rb.c      |  9 -------
 drivers/iommu/amd_iommu_v2.c             |  8 ------
 drivers/iommu/intel-svm.c                |  9 -------
 drivers/misc/mic/scif/scif_dma.c         | 11 --------
 drivers/misc/sgi-gru/grutlbpurge.c       | 12 ---------
 drivers/xen/gntdev.c                     |  8 ------
 fs/dax.c                                 | 19 ++++++++------
 include/linux/mm.h                       |  1 +
 include/linux/mmu_notifier.h             | 25 ------------------
 mm/memory.c                              | 26 +++++++++++++++----
 mm/mmu_notifier.c                        | 14 ----------
 mm/rmap.c                                | 44 +++++++++++++++++++++++++++++---
 virt/kvm/kvm_main.c                      | 42 ------------------------------
 16 files changed, 74 insertions(+), 214 deletions(-)

-- 
2.13.5

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 160+ messages in thread

* [PATCH 00/13] mmu_notifier kill invalidate_page callback
@ 2017-08-29 23:54 ` Jérôme Glisse
  0 siblings, 0 replies; 160+ messages in thread
From: Jérôme Glisse @ 2017-08-29 23:54 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jérôme Glisse, Kirill A . Shutemov, Linus Torvalds,
	Andrew Morton, Andrea Arcangeli, Joerg Roedel, Dan Williams,
	Sudeep Dutt, Ashutosh Dixit, Dimitri Sivanich, Jack Steiner,
	Paolo Bonzini, Radim Krčmář,
	linuxppc-dev, dri-devel, amd-gfx, linux-rdma, iommu, xen-devel,
	kvm

(Sorry for so many list cross-posting and big cc)

Please help testing !

The invalidate_page callback suffered from 2 pitfalls. First it used to
happen after page table lock was release and thus a new page might have
been setup for the virtual address before the call to invalidate_page().

This is in a weird way fixed by c7ab0d2fdc840266b39db94538f74207ec2afbf6
which moved the callback under the page table lock. Which also broke
several existing user of the mmu_notifier API that assumed they could
sleep inside this callback.

The second pitfall was invalidate_page being the only callback not taking
a range of address in respect to invalidation but was giving an address
and a page. Lot of the callback implementer assumed this could never be
THP and thus failed to invalidate the appropriate range for THP pages.

By killing this callback we unify the mmu_notifier callback API to always
take a virtual address range as input.

There is now 2 clear API (I am not mentioning the youngess API which is
seldomly used):
  - invalidate_range_start()/end() callback (which allow you to sleep)
  - invalidate_range() where you can not sleep but happen right after
    page table update under page table lock


Note that a lot of existing user feels broken in respect to range_start/
range_end. Many user only have range_start() callback but there is nothing
preventing them to undo what was invalidated in their range_start() callback
after it returns but before any CPU page table update take place.

The code pattern use in kvm or umem odp is an example on how to properly
avoid such race. In a nutshell use some kind of sequence number and active
range invalidation counter to block anything that might undo what the
range_start() callback did.

If you do not care about keeping fully in sync with CPU page table (ie
you can live with CPU page table pointing to new different page for a
given virtual address) then you can take a reference on the pages inside
the range_start callback and drop it in range_end or when your driver
is done with those pages.

Last alternative is to use invalidate_range() if you can do invalidation
without sleeping as invalidate_range() callback happens under the CPU
page table spinlock right after the page table is updated.


Note this is barely tested. I intend to do more testing of next few days
but i do not have access to all hardware that make use of the mmu_notifier
API.


First 2 patches convert existing call of mmu_notifier_invalidate_page()
to mmu_notifier_invalidate_range() and bracket those call with call to
mmu_notifier_invalidate_range_start()/end().

The next 10 patches remove existing invalidate_page() callback as it can
no longer happen.

Finaly the last page remove it completely so it can RIP.

Jérôme Glisse (13):
  dax: update to new mmu_notifier semantic
  mm/rmap: update to new mmu_notifier semantic
  powerpc/powernv: update to new mmu_notifier semantic
  drm/amdgpu: update to new mmu_notifier semantic
  IB/umem: update to new mmu_notifier semantic
  IB/hfi1: update to new mmu_notifier semantic
  iommu/amd: update to new mmu_notifier semantic
  iommu/intel: update to new mmu_notifier semantic
  misc/mic/scif: update to new mmu_notifier semantic
  sgi-gru: update to new mmu_notifier semantic
  xen/gntdev: update to new mmu_notifier semantic
  KVM: update to new mmu_notifier semantic
  mm/mmu_notifier: kill invalidate_page

Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Sudeep Dutt <sudeep.dutt@intel.com>
Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Jack Steiner <steiner@sgi.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>

Cc: linuxppc-dev@lists.ozlabs.org
Cc: dri-devel@lists.freedesktop.org
Cc: amd-gfx@lists.freedesktop.org
Cc: linux-rdma@vger.kernel.org
Cc: iommu@lists.linux-foundation.org
Cc: xen-devel@lists.xenproject.org
Cc: kvm@vger.kernel.org


 arch/powerpc/platforms/powernv/npu-dma.c | 10 --------
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c   | 31 ----------------------
 drivers/infiniband/core/umem_odp.c       | 19 --------------
 drivers/infiniband/hw/hfi1/mmu_rb.c      |  9 -------
 drivers/iommu/amd_iommu_v2.c             |  8 ------
 drivers/iommu/intel-svm.c                |  9 -------
 drivers/misc/mic/scif/scif_dma.c         | 11 --------
 drivers/misc/sgi-gru/grutlbpurge.c       | 12 ---------
 drivers/xen/gntdev.c                     |  8 ------
 fs/dax.c                                 | 19 ++++++++------
 include/linux/mm.h                       |  1 +
 include/linux/mmu_notifier.h             | 25 ------------------
 mm/memory.c                              | 26 +++++++++++++++----
 mm/mmu_notifier.c                        | 14 ----------
 mm/rmap.c                                | 44 +++++++++++++++++++++++++++++---
 virt/kvm/kvm_main.c                      | 42 ------------------------------
 16 files changed, 74 insertions(+), 214 deletions(-)

-- 
2.13.5

^ permalink raw reply	[flat|nested] 160+ messages in thread

* [PATCH 00/13] mmu_notifier kill invalidate_page callback
@ 2017-08-29 23:54 ` Jérôme Glisse
  0 siblings, 0 replies; 160+ messages in thread
From: Jérôme Glisse @ 2017-08-29 23:54 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jérôme Glisse, Kirill A . Shutemov, Linus Torvalds,
	Andrew Morton, Andrea Arcangeli, Joerg Roedel, Dan Williams,
	Sudeep Dutt, Ashutosh Dixit, Dimitri Sivanich, Jack Steiner,
	Paolo Bonzini, Radim Krčmář,
	linuxppc-dev, dri-devel, amd-gfx, linux-rdma, iommu, xen-devel,
	kvm

(Sorry for so many list cross-posting and big cc)

Please help testing !

The invalidate_page callback suffered from 2 pitfalls. First it used to
happen after page table lock was release and thus a new page might have
been setup for the virtual address before the call to invalidate_page().

This is in a weird way fixed by c7ab0d2fdc840266b39db94538f74207ec2afbf6
which moved the callback under the page table lock. Which also broke
several existing user of the mmu_notifier API that assumed they could
sleep inside this callback.

The second pitfall was invalidate_page being the only callback not taking
a range of address in respect to invalidation but was giving an address
and a page. Lot of the callback implementer assumed this could never be
THP and thus failed to invalidate the appropriate range for THP pages.

By killing this callback we unify the mmu_notifier callback API to always
take a virtual address range as input.

There is now 2 clear API (I am not mentioning the youngess API which is
seldomly used):
  - invalidate_range_start()/end() callback (which allow you to sleep)
  - invalidate_range() where you can not sleep but happen right after
    page table update under page table lock


Note that a lot of existing user feels broken in respect to range_start/
range_end. Many user only have range_start() callback but there is nothing
preventing them to undo what was invalidated in their range_start() callback
after it returns but before any CPU page table update take place.

The code pattern use in kvm or umem odp is an example on how to properly
avoid such race. In a nutshell use some kind of sequence number and active
range invalidation counter to block anything that might undo what the
range_start() callback did.

If you do not care about keeping fully in sync with CPU page table (ie
you can live with CPU page table pointing to new different page for a
given virtual address) then you can take a reference on the pages inside
the range_start callback and drop it in range_end or when your driver
is done with those pages.

Last alternative is to use invalidate_range() if you can do invalidation
without sleeping as invalidate_range() callback happens under the CPU
page table spinlock right after the page table is updated.


Note this is barely tested. I intend to do more testing of next few days
but i do not have access to all hardware that make use of the mmu_notifier
API.


First 2 patches convert existing call of mmu_notifier_invalidate_page()
to mmu_notifier_invalidate_range() and bracket those call with call to
mmu_notifier_invalidate_range_start()/end().

The next 10 patches remove existing invalidate_page() callback as it can
no longer happen.

Finaly the last page remove it completely so it can RIP.

JA(C)rA'me Glisse (13):
  dax: update to new mmu_notifier semantic
  mm/rmap: update to new mmu_notifier semantic
  powerpc/powernv: update to new mmu_notifier semantic
  drm/amdgpu: update to new mmu_notifier semantic
  IB/umem: update to new mmu_notifier semantic
  IB/hfi1: update to new mmu_notifier semantic
  iommu/amd: update to new mmu_notifier semantic
  iommu/intel: update to new mmu_notifier semantic
  misc/mic/scif: update to new mmu_notifier semantic
  sgi-gru: update to new mmu_notifier semantic
  xen/gntdev: update to new mmu_notifier semantic
  KVM: update to new mmu_notifier semantic
  mm/mmu_notifier: kill invalidate_page

Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Sudeep Dutt <sudeep.dutt@intel.com>
Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Jack Steiner <steiner@sgi.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim KrA?mA!A? <rkrcmar@redhat.com>

Cc: linuxppc-dev@lists.ozlabs.org
Cc: dri-devel@lists.freedesktop.org
Cc: amd-gfx@lists.freedesktop.org
Cc: linux-rdma@vger.kernel.org
Cc: iommu@lists.linux-foundation.org
Cc: xen-devel@lists.xenproject.org
Cc: kvm@vger.kernel.org


 arch/powerpc/platforms/powernv/npu-dma.c | 10 --------
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c   | 31 ----------------------
 drivers/infiniband/core/umem_odp.c       | 19 --------------
 drivers/infiniband/hw/hfi1/mmu_rb.c      |  9 -------
 drivers/iommu/amd_iommu_v2.c             |  8 ------
 drivers/iommu/intel-svm.c                |  9 -------
 drivers/misc/mic/scif/scif_dma.c         | 11 --------
 drivers/misc/sgi-gru/grutlbpurge.c       | 12 ---------
 drivers/xen/gntdev.c                     |  8 ------
 fs/dax.c                                 | 19 ++++++++------
 include/linux/mm.h                       |  1 +
 include/linux/mmu_notifier.h             | 25 ------------------
 mm/memory.c                              | 26 +++++++++++++++----
 mm/mmu_notifier.c                        | 14 ----------
 mm/rmap.c                                | 44 +++++++++++++++++++++++++++++---
 virt/kvm/kvm_main.c                      | 42 ------------------------------
 16 files changed, 74 insertions(+), 214 deletions(-)

-- 
2.13.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 160+ messages in thread

* [PATCH 01/13] dax: update to new mmu_notifier semantic
  2017-08-29 23:54 ` Jérôme Glisse
@ 2017-08-29 23:54   ` Jérôme Glisse
  -1 siblings, 0 replies; 160+ messages in thread
From: Jérôme Glisse @ 2017-08-29 23:54 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jérôme Glisse, Dan Williams, Ross Zwisler,
	Linus Torvalds, Bernhard Held, Adam Borowski, Andrea Arcangeli,
	Radim Krčmář,
	Wanpeng Li, Paolo Bonzini, Takashi Iwai, Nadav Amit,
	Mike Galbraith, Kirill A . Shutemov, axie, Andrew Morton

Replacing all mmu_notifier_invalida_page() by mmu_notifier_invalidat_range
and making sure it is bracketed by call to mmu_notifier_invalidate_range_start/
end.

Note that because we can not presume the pmd value or pte value we have to
assume the worse and unconditionaly report an invalidation as happening.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Bernhard Held <berny156@gmx.de>
Cc: Adam Borowski <kilobyte@angband.pl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: axie <axie@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 fs/dax.c           | 19 +++++++++++--------
 include/linux/mm.h |  1 +
 mm/memory.c        | 26 +++++++++++++++++++++-----
 3 files changed, 33 insertions(+), 13 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 865d42c63e23..ab925dc6647a 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -646,11 +646,10 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
 	pte_t pte, *ptep = NULL;
 	pmd_t *pmdp = NULL;
 	spinlock_t *ptl;
-	bool changed;
 
 	i_mmap_lock_read(mapping);
 	vma_interval_tree_foreach(vma, &mapping->i_mmap, index, index) {
-		unsigned long address;
+		unsigned long address, start, end;
 
 		cond_resched();
 
@@ -658,8 +657,13 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
 			continue;
 
 		address = pgoff_address(index, vma);
-		changed = false;
-		if (follow_pte_pmd(vma->vm_mm, address, &ptep, &pmdp, &ptl))
+
+		/*
+		 * Note because we provide start/end to follow_pte_pmd it will
+		 * call mmu_notifier_invalidate_range_start() on our behalf
+		 * before taking any lock.
+		 */
+		if (follow_pte_pmd(vma->vm_mm, address, &start, &end, &ptep, &pmdp, &ptl))
 			continue;
 
 		if (pmdp) {
@@ -676,7 +680,7 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
 			pmd = pmd_wrprotect(pmd);
 			pmd = pmd_mkclean(pmd);
 			set_pmd_at(vma->vm_mm, address, pmdp, pmd);
-			changed = true;
+			mmu_notifier_invalidate_range(vma->vm_mm, start, end);
 unlock_pmd:
 			spin_unlock(ptl);
 #endif
@@ -691,13 +695,12 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
 			pte = pte_wrprotect(pte);
 			pte = pte_mkclean(pte);
 			set_pte_at(vma->vm_mm, address, ptep, pte);
-			changed = true;
+			mmu_notifier_invalidate_range(vma->vm_mm, start, end);
 unlock_pte:
 			pte_unmap_unlock(ptep, ptl);
 		}
 
-		if (changed)
-			mmu_notifier_invalidate_page(vma->vm_mm, address);
+		mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
 	}
 	i_mmap_unlock_read(mapping);
 }
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 46b9ac5e8569..c1f6c95f3496 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1260,6 +1260,7 @@ int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
 void unmap_mapping_range(struct address_space *mapping,
 		loff_t const holebegin, loff_t const holelen, int even_cows);
 int follow_pte_pmd(struct mm_struct *mm, unsigned long address,
+			     unsigned long *start, unsigned long *end,
 			     pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp);
 int follow_pfn(struct vm_area_struct *vma, unsigned long address,
 	unsigned long *pfn);
diff --git a/mm/memory.c b/mm/memory.c
index fe2fba27ded2..56e48e4593cb 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4008,7 +4008,8 @@ int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address)
 #endif /* __PAGETABLE_PMD_FOLDED */
 
 static int __follow_pte_pmd(struct mm_struct *mm, unsigned long address,
-		pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp)
+			    unsigned long *start, unsigned long *end,
+			    pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp)
 {
 	pgd_t *pgd;
 	p4d_t *p4d;
@@ -4035,17 +4036,29 @@ static int __follow_pte_pmd(struct mm_struct *mm, unsigned long address,
 		if (!pmdpp)
 			goto out;
 
+		if (start && end) {
+			*start = address & PMD_MASK;
+			*end = *start + PMD_SIZE;
+			mmu_notifier_invalidate_range_start(mm, *start, *end);
+		}
 		*ptlp = pmd_lock(mm, pmd);
 		if (pmd_huge(*pmd)) {
 			*pmdpp = pmd;
 			return 0;
 		}
 		spin_unlock(*ptlp);
+		if (start && end)
+			mmu_notifier_invalidate_range_end(mm, *start, *end);
 	}
 
 	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
 		goto out;
 
+	if (start && end) {
+		*start = address & PAGE_MASK;
+		*end = *start + PAGE_SIZE;
+		mmu_notifier_invalidate_range_start(mm, *start, *end);
+	}
 	ptep = pte_offset_map_lock(mm, pmd, address, ptlp);
 	if (!pte_present(*ptep))
 		goto unlock;
@@ -4053,6 +4066,8 @@ static int __follow_pte_pmd(struct mm_struct *mm, unsigned long address,
 	return 0;
 unlock:
 	pte_unmap_unlock(ptep, *ptlp);
+	if (start && end)
+		mmu_notifier_invalidate_range_end(mm, *start, *end);
 out:
 	return -EINVAL;
 }
@@ -4064,20 +4079,21 @@ static inline int follow_pte(struct mm_struct *mm, unsigned long address,
 
 	/* (void) is needed to make gcc happy */
 	(void) __cond_lock(*ptlp,
-			   !(res = __follow_pte_pmd(mm, address, ptepp, NULL,
-					   ptlp)));
+			   !(res = __follow_pte_pmd(mm, address, NULL, NULL,
+						    ptepp, NULL, ptlp)));
 	return res;
 }
 
 int follow_pte_pmd(struct mm_struct *mm, unsigned long address,
+			     unsigned long *start, unsigned long *end,
 			     pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp)
 {
 	int res;
 
 	/* (void) is needed to make gcc happy */
 	(void) __cond_lock(*ptlp,
-			   !(res = __follow_pte_pmd(mm, address, ptepp, pmdpp,
-					   ptlp)));
+			   !(res = __follow_pte_pmd(mm, address, start, end,
+						    ptepp, pmdpp, ptlp)));
 	return res;
 }
 EXPORT_SYMBOL(follow_pte_pmd);
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH 01/13] dax: update to new mmu_notifier semantic
@ 2017-08-29 23:54   ` Jérôme Glisse
  0 siblings, 0 replies; 160+ messages in thread
From: Jérôme Glisse @ 2017-08-29 23:54 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jérôme Glisse, Dan Williams, Ross Zwisler,
	Linus Torvalds, Bernhard Held, Adam Borowski, Andrea Arcangeli,
	Radim Krčmář,
	Wanpeng Li, Paolo Bonzini, Takashi Iwai, Nadav Amit,
	Mike Galbraith, Kirill A . Shutemov, axie, Andrew Morton

Replacing all mmu_notifier_invalida_page() by mmu_notifier_invalidat_range
and making sure it is bracketed by call to mmu_notifier_invalidate_range_start/
end.

Note that because we can not presume the pmd value or pte value we have to
assume the worse and unconditionaly report an invalidation as happening.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Bernhard Held <berny156@gmx.de>
Cc: Adam Borowski <kilobyte@angband.pl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Radim KrA?mA!A? <rkrcmar@redhat.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: axie <axie@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 fs/dax.c           | 19 +++++++++++--------
 include/linux/mm.h |  1 +
 mm/memory.c        | 26 +++++++++++++++++++++-----
 3 files changed, 33 insertions(+), 13 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 865d42c63e23..ab925dc6647a 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -646,11 +646,10 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
 	pte_t pte, *ptep = NULL;
 	pmd_t *pmdp = NULL;
 	spinlock_t *ptl;
-	bool changed;
 
 	i_mmap_lock_read(mapping);
 	vma_interval_tree_foreach(vma, &mapping->i_mmap, index, index) {
-		unsigned long address;
+		unsigned long address, start, end;
 
 		cond_resched();
 
@@ -658,8 +657,13 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
 			continue;
 
 		address = pgoff_address(index, vma);
-		changed = false;
-		if (follow_pte_pmd(vma->vm_mm, address, &ptep, &pmdp, &ptl))
+
+		/*
+		 * Note because we provide start/end to follow_pte_pmd it will
+		 * call mmu_notifier_invalidate_range_start() on our behalf
+		 * before taking any lock.
+		 */
+		if (follow_pte_pmd(vma->vm_mm, address, &start, &end, &ptep, &pmdp, &ptl))
 			continue;
 
 		if (pmdp) {
@@ -676,7 +680,7 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
 			pmd = pmd_wrprotect(pmd);
 			pmd = pmd_mkclean(pmd);
 			set_pmd_at(vma->vm_mm, address, pmdp, pmd);
-			changed = true;
+			mmu_notifier_invalidate_range(vma->vm_mm, start, end);
 unlock_pmd:
 			spin_unlock(ptl);
 #endif
@@ -691,13 +695,12 @@ static void dax_mapping_entry_mkclean(struct address_space *mapping,
 			pte = pte_wrprotect(pte);
 			pte = pte_mkclean(pte);
 			set_pte_at(vma->vm_mm, address, ptep, pte);
-			changed = true;
+			mmu_notifier_invalidate_range(vma->vm_mm, start, end);
 unlock_pte:
 			pte_unmap_unlock(ptep, ptl);
 		}
 
-		if (changed)
-			mmu_notifier_invalidate_page(vma->vm_mm, address);
+		mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
 	}
 	i_mmap_unlock_read(mapping);
 }
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 46b9ac5e8569..c1f6c95f3496 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1260,6 +1260,7 @@ int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
 void unmap_mapping_range(struct address_space *mapping,
 		loff_t const holebegin, loff_t const holelen, int even_cows);
 int follow_pte_pmd(struct mm_struct *mm, unsigned long address,
+			     unsigned long *start, unsigned long *end,
 			     pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp);
 int follow_pfn(struct vm_area_struct *vma, unsigned long address,
 	unsigned long *pfn);
diff --git a/mm/memory.c b/mm/memory.c
index fe2fba27ded2..56e48e4593cb 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4008,7 +4008,8 @@ int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address)
 #endif /* __PAGETABLE_PMD_FOLDED */
 
 static int __follow_pte_pmd(struct mm_struct *mm, unsigned long address,
-		pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp)
+			    unsigned long *start, unsigned long *end,
+			    pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp)
 {
 	pgd_t *pgd;
 	p4d_t *p4d;
@@ -4035,17 +4036,29 @@ static int __follow_pte_pmd(struct mm_struct *mm, unsigned long address,
 		if (!pmdpp)
 			goto out;
 
+		if (start && end) {
+			*start = address & PMD_MASK;
+			*end = *start + PMD_SIZE;
+			mmu_notifier_invalidate_range_start(mm, *start, *end);
+		}
 		*ptlp = pmd_lock(mm, pmd);
 		if (pmd_huge(*pmd)) {
 			*pmdpp = pmd;
 			return 0;
 		}
 		spin_unlock(*ptlp);
+		if (start && end)
+			mmu_notifier_invalidate_range_end(mm, *start, *end);
 	}
 
 	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
 		goto out;
 
+	if (start && end) {
+		*start = address & PAGE_MASK;
+		*end = *start + PAGE_SIZE;
+		mmu_notifier_invalidate_range_start(mm, *start, *end);
+	}
 	ptep = pte_offset_map_lock(mm, pmd, address, ptlp);
 	if (!pte_present(*ptep))
 		goto unlock;
@@ -4053,6 +4066,8 @@ static int __follow_pte_pmd(struct mm_struct *mm, unsigned long address,
 	return 0;
 unlock:
 	pte_unmap_unlock(ptep, *ptlp);
+	if (start && end)
+		mmu_notifier_invalidate_range_end(mm, *start, *end);
 out:
 	return -EINVAL;
 }
@@ -4064,20 +4079,21 @@ static inline int follow_pte(struct mm_struct *mm, unsigned long address,
 
 	/* (void) is needed to make gcc happy */
 	(void) __cond_lock(*ptlp,
-			   !(res = __follow_pte_pmd(mm, address, ptepp, NULL,
-					   ptlp)));
+			   !(res = __follow_pte_pmd(mm, address, NULL, NULL,
+						    ptepp, NULL, ptlp)));
 	return res;
 }
 
 int follow_pte_pmd(struct mm_struct *mm, unsigned long address,
+			     unsigned long *start, unsigned long *end,
 			     pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp)
 {
 	int res;
 
 	/* (void) is needed to make gcc happy */
 	(void) __cond_lock(*ptlp,
-			   !(res = __follow_pte_pmd(mm, address, ptepp, pmdpp,
-					   ptlp)));
+			   !(res = __follow_pte_pmd(mm, address, start, end,
+						    ptepp, pmdpp, ptlp)));
 	return res;
 }
 EXPORT_SYMBOL(follow_pte_pmd);
-- 
2.13.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic
  2017-08-29 23:54 ` Jérôme Glisse
@ 2017-08-29 23:54   ` Jérôme Glisse
  -1 siblings, 0 replies; 160+ messages in thread
From: Jérôme Glisse @ 2017-08-29 23:54 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jérôme Glisse, Dan Williams, Ross Zwisler,
	Linus Torvalds, Bernhard Held, Adam Borowski, Andrea Arcangeli,
	Radim Krčmář,
	Wanpeng Li, Paolo Bonzini, Takashi Iwai, Nadav Amit,
	Mike Galbraith, Kirill A . Shutemov, axie, Andrew Morton

Replacing all mmu_notifier_invalidate_page() by mmu_notifier_invalidat_range()
and making sure it is bracketed by call to mmu_notifier_invalidate_range_start/
end.

Note that because we can not presume the pmd value or pte value we have to
assume the worse and unconditionaly report an invalidation as happening.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Bernhard Held <berny156@gmx.de>
Cc: Adam Borowski <kilobyte@angband.pl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: axie <axie@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 mm/rmap.c | 44 +++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 41 insertions(+), 3 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index c8993c63eb25..da97ed525088 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -887,11 +887,21 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
 		.address = address,
 		.flags = PVMW_SYNC,
 	};
+	unsigned long start = address, end;
 	int *cleaned = arg;
 
+	/*
+	 * We have to assume the worse case ie pmd for invalidation. Note that
+	 * the page can not be free from this function.
+	 */
+	end = min(vma->vm_end, (start & PMD_MASK) + PMD_SIZE);
+	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
+
 	while (page_vma_mapped_walk(&pvmw)) {
+		unsigned long cstart, cend;
 		int ret = 0;
-		address = pvmw.address;
+
+		cstart = address = pvmw.address;
 		if (pvmw.pte) {
 			pte_t entry;
 			pte_t *pte = pvmw.pte;
@@ -904,6 +914,7 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
 			entry = pte_wrprotect(entry);
 			entry = pte_mkclean(entry);
 			set_pte_at(vma->vm_mm, address, pte, entry);
+			cend = cstart + PAGE_SIZE;
 			ret = 1;
 		} else {
 #ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE
@@ -918,6 +929,8 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
 			entry = pmd_wrprotect(entry);
 			entry = pmd_mkclean(entry);
 			set_pmd_at(vma->vm_mm, address, pmd, entry);
+			cstart &= PMD_MASK;
+			cend = cstart + PMD_SIZE;
 			ret = 1;
 #else
 			/* unexpected pmd-mapped page? */
@@ -926,11 +939,13 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
 		}
 
 		if (ret) {
-			mmu_notifier_invalidate_page(vma->vm_mm, address);
+			mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend);
 			(*cleaned)++;
 		}
 	}
 
+	mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
+
 	return true;
 }
 
@@ -1324,6 +1339,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	pte_t pteval;
 	struct page *subpage;
 	bool ret = true;
+	unsigned long start = address, end;
 	enum ttu_flags flags = (enum ttu_flags)arg;
 
 	/* munlock has nothing to gain from examining un-locked vmas */
@@ -1335,6 +1351,14 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 				flags & TTU_MIGRATION, page);
 	}
 
+	/*
+	 * We have to assume the worse case ie pmd for invalidation. Note that
+	 * the page can not be free in this function as call of try_to_unmap()
+	 * must hold a reference on the page.
+	 */
+	end = min(vma->vm_end, (start & PMD_MASK) + PMD_SIZE);
+	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
+
 	while (page_vma_mapped_walk(&pvmw)) {
 		/*
 		 * If the page is mlock()d, we cannot swap it out.
@@ -1408,6 +1432,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 				set_huge_swap_pte_at(mm, address,
 						     pvmw.pte, pteval,
 						     vma_mmu_pagesize(vma));
+				mmu_notifier_invalidate_range(mm, address,
+					address + vma_mmu_pagesize(vma));
 			} else {
 				dec_mm_counter(mm, mm_counter(page));
 				set_pte_at(mm, address, pvmw.pte, pteval);
@@ -1435,6 +1461,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			if (pte_soft_dirty(pteval))
 				swp_pte = pte_swp_mksoft_dirty(swp_pte);
 			set_pte_at(mm, address, pvmw.pte, swp_pte);
+			mmu_notifier_invalidate_range(mm, address,
+						      address + PAGE_SIZE);
 		} else if (PageAnon(page)) {
 			swp_entry_t entry = { .val = page_private(subpage) };
 			pte_t swp_pte;
@@ -1445,6 +1473,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			if (unlikely(PageSwapBacked(page) != PageSwapCache(page))) {
 				WARN_ON_ONCE(1);
 				ret = false;
+				/* We have to invalidate as we cleared the pte */
+				mmu_notifier_invalidate_range(mm, address,
+							address + PAGE_SIZE);
 				page_vma_mapped_walk_done(&pvmw);
 				break;
 			}
@@ -1453,6 +1484,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			if (!PageSwapBacked(page)) {
 				if (!PageDirty(page)) {
 					dec_mm_counter(mm, MM_ANONPAGES);
+					/* Invalidate as we cleared the pte */
+					mmu_notifier_invalidate_range(mm,
+						address, address + PAGE_SIZE);
 					goto discard;
 				}
 
@@ -1485,13 +1519,17 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			if (pte_soft_dirty(pteval))
 				swp_pte = pte_swp_mksoft_dirty(swp_pte);
 			set_pte_at(mm, address, pvmw.pte, swp_pte);
+			mmu_notifier_invalidate_range(mm, address,
+						      address + PAGE_SIZE);
 		} else
 			dec_mm_counter(mm, mm_counter_file(page));
 discard:
 		page_remove_rmap(subpage, PageHuge(page));
 		put_page(page);
-		mmu_notifier_invalidate_page(mm, address);
 	}
+
+	mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
+
 	return ret;
 }
 
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic
@ 2017-08-29 23:54   ` Jérôme Glisse
  0 siblings, 0 replies; 160+ messages in thread
From: Jérôme Glisse @ 2017-08-29 23:54 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jérôme Glisse, Dan Williams, Ross Zwisler,
	Linus Torvalds, Bernhard Held, Adam Borowski, Andrea Arcangeli,
	Radim Krčmář,
	Wanpeng Li, Paolo Bonzini, Takashi Iwai, Nadav Amit,
	Mike Galbraith, Kirill A . Shutemov, axie, Andrew Morton

Replacing all mmu_notifier_invalidate_page() by mmu_notifier_invalidat_range()
and making sure it is bracketed by call to mmu_notifier_invalidate_range_start/
end.

Note that because we can not presume the pmd value or pte value we have to
assume the worse and unconditionaly report an invalidation as happening.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Bernhard Held <berny156@gmx.de>
Cc: Adam Borowski <kilobyte@angband.pl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Radim KrA?mA!A? <rkrcmar@redhat.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: axie <axie@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 mm/rmap.c | 44 +++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 41 insertions(+), 3 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index c8993c63eb25..da97ed525088 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -887,11 +887,21 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
 		.address = address,
 		.flags = PVMW_SYNC,
 	};
+	unsigned long start = address, end;
 	int *cleaned = arg;
 
+	/*
+	 * We have to assume the worse case ie pmd for invalidation. Note that
+	 * the page can not be free from this function.
+	 */
+	end = min(vma->vm_end, (start & PMD_MASK) + PMD_SIZE);
+	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
+
 	while (page_vma_mapped_walk(&pvmw)) {
+		unsigned long cstart, cend;
 		int ret = 0;
-		address = pvmw.address;
+
+		cstart = address = pvmw.address;
 		if (pvmw.pte) {
 			pte_t entry;
 			pte_t *pte = pvmw.pte;
@@ -904,6 +914,7 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
 			entry = pte_wrprotect(entry);
 			entry = pte_mkclean(entry);
 			set_pte_at(vma->vm_mm, address, pte, entry);
+			cend = cstart + PAGE_SIZE;
 			ret = 1;
 		} else {
 #ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE
@@ -918,6 +929,8 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
 			entry = pmd_wrprotect(entry);
 			entry = pmd_mkclean(entry);
 			set_pmd_at(vma->vm_mm, address, pmd, entry);
+			cstart &= PMD_MASK;
+			cend = cstart + PMD_SIZE;
 			ret = 1;
 #else
 			/* unexpected pmd-mapped page? */
@@ -926,11 +939,13 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
 		}
 
 		if (ret) {
-			mmu_notifier_invalidate_page(vma->vm_mm, address);
+			mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend);
 			(*cleaned)++;
 		}
 	}
 
+	mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
+
 	return true;
 }
 
@@ -1324,6 +1339,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	pte_t pteval;
 	struct page *subpage;
 	bool ret = true;
+	unsigned long start = address, end;
 	enum ttu_flags flags = (enum ttu_flags)arg;
 
 	/* munlock has nothing to gain from examining un-locked vmas */
@@ -1335,6 +1351,14 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 				flags & TTU_MIGRATION, page);
 	}
 
+	/*
+	 * We have to assume the worse case ie pmd for invalidation. Note that
+	 * the page can not be free in this function as call of try_to_unmap()
+	 * must hold a reference on the page.
+	 */
+	end = min(vma->vm_end, (start & PMD_MASK) + PMD_SIZE);
+	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
+
 	while (page_vma_mapped_walk(&pvmw)) {
 		/*
 		 * If the page is mlock()d, we cannot swap it out.
@@ -1408,6 +1432,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 				set_huge_swap_pte_at(mm, address,
 						     pvmw.pte, pteval,
 						     vma_mmu_pagesize(vma));
+				mmu_notifier_invalidate_range(mm, address,
+					address + vma_mmu_pagesize(vma));
 			} else {
 				dec_mm_counter(mm, mm_counter(page));
 				set_pte_at(mm, address, pvmw.pte, pteval);
@@ -1435,6 +1461,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			if (pte_soft_dirty(pteval))
 				swp_pte = pte_swp_mksoft_dirty(swp_pte);
 			set_pte_at(mm, address, pvmw.pte, swp_pte);
+			mmu_notifier_invalidate_range(mm, address,
+						      address + PAGE_SIZE);
 		} else if (PageAnon(page)) {
 			swp_entry_t entry = { .val = page_private(subpage) };
 			pte_t swp_pte;
@@ -1445,6 +1473,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			if (unlikely(PageSwapBacked(page) != PageSwapCache(page))) {
 				WARN_ON_ONCE(1);
 				ret = false;
+				/* We have to invalidate as we cleared the pte */
+				mmu_notifier_invalidate_range(mm, address,
+							address + PAGE_SIZE);
 				page_vma_mapped_walk_done(&pvmw);
 				break;
 			}
@@ -1453,6 +1484,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			if (!PageSwapBacked(page)) {
 				if (!PageDirty(page)) {
 					dec_mm_counter(mm, MM_ANONPAGES);
+					/* Invalidate as we cleared the pte */
+					mmu_notifier_invalidate_range(mm,
+						address, address + PAGE_SIZE);
 					goto discard;
 				}
 
@@ -1485,13 +1519,17 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 			if (pte_soft_dirty(pteval))
 				swp_pte = pte_swp_mksoft_dirty(swp_pte);
 			set_pte_at(mm, address, pvmw.pte, swp_pte);
+			mmu_notifier_invalidate_range(mm, address,
+						      address + PAGE_SIZE);
 		} else
 			dec_mm_counter(mm, mm_counter_file(page));
 discard:
 		page_remove_rmap(subpage, PageHuge(page));
 		put_page(page);
-		mmu_notifier_invalidate_page(mm, address);
 	}
+
+	mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
+
 	return ret;
 }
 
-- 
2.13.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH 03/13] powerpc/powernv: update to new mmu_notifier semantic
  2017-08-29 23:54 ` Jérôme Glisse
@ 2017-08-29 23:54   ` Jérôme Glisse
  -1 siblings, 0 replies; 160+ messages in thread
From: Jérôme Glisse @ 2017-08-29 23:54 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jérôme Glisse, linuxppc-dev, Alistair Popple,
	Michael Ellerman, Kirill A . Shutemov, Andrew Morton,
	Linus Torvalds, Andrea Arcangeli

Call to mmu_notifier_invalidate_page() are replaced by call to
mmu_notifier_invalidate_range() and thus call are bracketed by
call to mmu_notifier_invalidate_range_start()/end()

Remove now useless invalidate_page callback.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Alistair Popple <alistair@popple.id.au>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/powerpc/platforms/powernv/npu-dma.c | 10 ----------
 1 file changed, 10 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
index b5d960d6db3d..4c7b8591f737 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -614,15 +614,6 @@ static void pnv_npu2_mn_change_pte(struct mmu_notifier *mn,
 	mmio_invalidate(npu_context, 1, address, true);
 }
 
-static void pnv_npu2_mn_invalidate_page(struct mmu_notifier *mn,
-					struct mm_struct *mm,
-					unsigned long address)
-{
-	struct npu_context *npu_context = mn_to_npu_context(mn);
-
-	mmio_invalidate(npu_context, 1, address, true);
-}
-
 static void pnv_npu2_mn_invalidate_range(struct mmu_notifier *mn,
 					struct mm_struct *mm,
 					unsigned long start, unsigned long end)
@@ -640,7 +631,6 @@ static void pnv_npu2_mn_invalidate_range(struct mmu_notifier *mn,
 static const struct mmu_notifier_ops nv_nmmu_notifier_ops = {
 	.release = pnv_npu2_mn_release,
 	.change_pte = pnv_npu2_mn_change_pte,
-	.invalidate_page = pnv_npu2_mn_invalidate_page,
 	.invalidate_range = pnv_npu2_mn_invalidate_range,
 };
 
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH 03/13] powerpc/powernv: update to new mmu_notifier semantic
@ 2017-08-29 23:54   ` Jérôme Glisse
  0 siblings, 0 replies; 160+ messages in thread
From: Jérôme Glisse @ 2017-08-29 23:54 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jérôme Glisse, linuxppc-dev, Alistair Popple,
	Michael Ellerman, Kirill A . Shutemov, Andrew Morton,
	Linus Torvalds, Andrea Arcangeli

Call to mmu_notifier_invalidate_page() are replaced by call to
mmu_notifier_invalidate_range() and thus call are bracketed by
call to mmu_notifier_invalidate_range_start()/end()

Remove now useless invalidate_page callback.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Alistair Popple <alistair@popple.id.au>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
---
 arch/powerpc/platforms/powernv/npu-dma.c | 10 ----------
 1 file changed, 10 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/npu-dma.c b/arch/powerpc/platforms/powernv/npu-dma.c
index b5d960d6db3d..4c7b8591f737 100644
--- a/arch/powerpc/platforms/powernv/npu-dma.c
+++ b/arch/powerpc/platforms/powernv/npu-dma.c
@@ -614,15 +614,6 @@ static void pnv_npu2_mn_change_pte(struct mmu_notifier *mn,
 	mmio_invalidate(npu_context, 1, address, true);
 }
 
-static void pnv_npu2_mn_invalidate_page(struct mmu_notifier *mn,
-					struct mm_struct *mm,
-					unsigned long address)
-{
-	struct npu_context *npu_context = mn_to_npu_context(mn);
-
-	mmio_invalidate(npu_context, 1, address, true);
-}
-
 static void pnv_npu2_mn_invalidate_range(struct mmu_notifier *mn,
 					struct mm_struct *mm,
 					unsigned long start, unsigned long end)
@@ -640,7 +631,6 @@ static void pnv_npu2_mn_invalidate_range(struct mmu_notifier *mn,
 static const struct mmu_notifier_ops nv_nmmu_notifier_ops = {
 	.release = pnv_npu2_mn_release,
 	.change_pte = pnv_npu2_mn_change_pte,
-	.invalidate_page = pnv_npu2_mn_invalidate_page,
 	.invalidate_range = pnv_npu2_mn_invalidate_range,
 };
 
-- 
2.13.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH 04/13] drm/amdgpu: update to new mmu_notifier semantic
  2017-08-29 23:54 ` Jérôme Glisse
  (?)
@ 2017-08-29 23:54   ` Jérôme Glisse
  -1 siblings, 0 replies; 160+ messages in thread
From: Jérôme Glisse @ 2017-08-29 23:54 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jérôme Glisse, amd-gfx, Felix Kuehling,
	Christian König, Alex Deucher, Kirill A . Shutemov,
	Andrew Morton, Linus Torvalds, Andrea Arcangeli

Call to mmu_notifier_invalidate_page() are replaced by call to
mmu_notifier_invalidate_range() and thus call are bracketed by
call to mmu_notifier_invalidate_range_start()/end()

Remove now useless invalidate_page callback.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: amd-gfx@lists.freedesktop.org
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c | 31 -------------------------------
 1 file changed, 31 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index 6558a3ed57a7..e1cde6b80027 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -147,36 +147,6 @@ static void amdgpu_mn_invalidate_node(struct amdgpu_mn_node *node,
 }
 
 /**
- * amdgpu_mn_invalidate_page - callback to notify about mm change
- *
- * @mn: our notifier
- * @mn: the mm this callback is about
- * @address: address of invalidate page
- *
- * Invalidation of a single page. Blocks for all BOs mapping it
- * and unmap them by move them into system domain again.
- */
-static void amdgpu_mn_invalidate_page(struct mmu_notifier *mn,
-				      struct mm_struct *mm,
-				      unsigned long address)
-{
-	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
-	struct interval_tree_node *it;
-
-	mutex_lock(&rmn->lock);
-
-	it = interval_tree_iter_first(&rmn->objects, address, address);
-	if (it) {
-		struct amdgpu_mn_node *node;
-
-		node = container_of(it, struct amdgpu_mn_node, it);
-		amdgpu_mn_invalidate_node(node, address, address);
-	}
-
-	mutex_unlock(&rmn->lock);
-}
-
-/**
  * amdgpu_mn_invalidate_range_start - callback to notify about mm change
  *
  * @mn: our notifier
@@ -215,7 +185,6 @@ static void amdgpu_mn_invalidate_range_start(struct mmu_notifier *mn,
 
 static const struct mmu_notifier_ops amdgpu_mn_ops = {
 	.release = amdgpu_mn_release,
-	.invalidate_page = amdgpu_mn_invalidate_page,
 	.invalidate_range_start = amdgpu_mn_invalidate_range_start,
 };
 
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH 04/13] drm/amdgpu: update to new mmu_notifier semantic
@ 2017-08-29 23:54   ` Jérôme Glisse
  0 siblings, 0 replies; 160+ messages in thread
From: Jérôme Glisse @ 2017-08-29 23:54 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jérôme Glisse, amd-gfx, Felix Kuehling,
	Christian König, Alex Deucher, Kirill A . Shutemov,
	Andrew Morton, Linus Torvalds, Andrea Arcangeli

Call to mmu_notifier_invalidate_page() are replaced by call to
mmu_notifier_invalidate_range() and thus call are bracketed by
call to mmu_notifier_invalidate_range_start()/end()

Remove now useless invalidate_page callback.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: amd-gfx@lists.freedesktop.org
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Christian KA?nig <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c | 31 -------------------------------
 1 file changed, 31 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index 6558a3ed57a7..e1cde6b80027 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -147,36 +147,6 @@ static void amdgpu_mn_invalidate_node(struct amdgpu_mn_node *node,
 }
 
 /**
- * amdgpu_mn_invalidate_page - callback to notify about mm change
- *
- * @mn: our notifier
- * @mn: the mm this callback is about
- * @address: address of invalidate page
- *
- * Invalidation of a single page. Blocks for all BOs mapping it
- * and unmap them by move them into system domain again.
- */
-static void amdgpu_mn_invalidate_page(struct mmu_notifier *mn,
-				      struct mm_struct *mm,
-				      unsigned long address)
-{
-	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
-	struct interval_tree_node *it;
-
-	mutex_lock(&rmn->lock);
-
-	it = interval_tree_iter_first(&rmn->objects, address, address);
-	if (it) {
-		struct amdgpu_mn_node *node;
-
-		node = container_of(it, struct amdgpu_mn_node, it);
-		amdgpu_mn_invalidate_node(node, address, address);
-	}
-
-	mutex_unlock(&rmn->lock);
-}
-
-/**
  * amdgpu_mn_invalidate_range_start - callback to notify about mm change
  *
  * @mn: our notifier
@@ -215,7 +185,6 @@ static void amdgpu_mn_invalidate_range_start(struct mmu_notifier *mn,
 
 static const struct mmu_notifier_ops amdgpu_mn_ops = {
 	.release = amdgpu_mn_release,
-	.invalidate_page = amdgpu_mn_invalidate_page,
 	.invalidate_range_start = amdgpu_mn_invalidate_range_start,
 };
 
-- 
2.13.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH 04/13] drm/amdgpu: update to new mmu_notifier semantic
@ 2017-08-29 23:54   ` Jérôme Glisse
  0 siblings, 0 replies; 160+ messages in thread
From: Jérôme Glisse @ 2017-08-29 23:54 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jérôme Glisse, amd-gfx, Felix Kuehling,
	Christian König, Alex Deucher, Kirill A . Shutemov,
	Andrew Morton, Linus Torvalds, Andrea Arcangeli

Call to mmu_notifier_invalidate_page() are replaced by call to
mmu_notifier_invalidate_range() and thus call are bracketed by
call to mmu_notifier_invalidate_range_start()/end()

Remove now useless invalidate_page callback.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: amd-gfx@lists.freedesktop.org
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c | 31 -------------------------------
 1 file changed, 31 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index 6558a3ed57a7..e1cde6b80027 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -147,36 +147,6 @@ static void amdgpu_mn_invalidate_node(struct amdgpu_mn_node *node,
 }
 
 /**
- * amdgpu_mn_invalidate_page - callback to notify about mm change
- *
- * @mn: our notifier
- * @mn: the mm this callback is about
- * @address: address of invalidate page
- *
- * Invalidation of a single page. Blocks for all BOs mapping it
- * and unmap them by move them into system domain again.
- */
-static void amdgpu_mn_invalidate_page(struct mmu_notifier *mn,
-				      struct mm_struct *mm,
-				      unsigned long address)
-{
-	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
-	struct interval_tree_node *it;
-
-	mutex_lock(&rmn->lock);
-
-	it = interval_tree_iter_first(&rmn->objects, address, address);
-	if (it) {
-		struct amdgpu_mn_node *node;
-
-		node = container_of(it, struct amdgpu_mn_node, it);
-		amdgpu_mn_invalidate_node(node, address, address);
-	}
-
-	mutex_unlock(&rmn->lock);
-}
-
-/**
  * amdgpu_mn_invalidate_range_start - callback to notify about mm change
  *
  * @mn: our notifier
@@ -215,7 +185,6 @@ static void amdgpu_mn_invalidate_range_start(struct mmu_notifier *mn,
 
 static const struct mmu_notifier_ops amdgpu_mn_ops = {
 	.release = amdgpu_mn_release,
-	.invalidate_page = amdgpu_mn_invalidate_page,
 	.invalidate_range_start = amdgpu_mn_invalidate_range_start,
 };
 
-- 
2.13.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH 05/13] IB/umem: update to new mmu_notifier semantic
  2017-08-29 23:54 ` Jérôme Glisse
  (?)
@ 2017-08-29 23:54   ` Jérôme Glisse
  -1 siblings, 0 replies; 160+ messages in thread
From: Jérôme Glisse @ 2017-08-29 23:54 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jérôme Glisse, Leon Romanovsky, linux-rdma,
	Artemy Kovalyov, Doug Ledford, Kirill A . Shutemov,
	Andrew Morton, Linus Torvalds, Andrea Arcangeli

Call to mmu_notifier_invalidate_page() are replaced by call to
mmu_notifier_invalidate_range() and thus call are bracketed by
call to mmu_notifier_invalidate_range_start()/end()

Remove now useless invalidate_page callback.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: linux-rdma@vger.kernel.org
Cc: Artemy Kovalyov <artemyko@mellanox.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
---
 drivers/infiniband/core/umem_odp.c | 19 -------------------
 1 file changed, 19 deletions(-)

diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index 8c4ec564e495..55e8f5ed8b3c 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -166,24 +166,6 @@ static int invalidate_page_trampoline(struct ib_umem *item, u64 start,
 	return 0;
 }
 
-static void ib_umem_notifier_invalidate_page(struct mmu_notifier *mn,
-					     struct mm_struct *mm,
-					     unsigned long address)
-{
-	struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
-
-	if (!context->invalidate_range)
-		return;
-
-	ib_ucontext_notifier_start_account(context);
-	down_read(&context->umem_rwsem);
-	rbt_ib_umem_for_each_in_range(&context->umem_tree, address,
-				      address + PAGE_SIZE,
-				      invalidate_page_trampoline, NULL);
-	up_read(&context->umem_rwsem);
-	ib_ucontext_notifier_end_account(context);
-}
-
 static int invalidate_range_start_trampoline(struct ib_umem *item, u64 start,
 					     u64 end, void *cookie)
 {
@@ -237,7 +219,6 @@ static void ib_umem_notifier_invalidate_range_end(struct mmu_notifier *mn,
 
 static const struct mmu_notifier_ops ib_umem_notifiers = {
 	.release                    = ib_umem_notifier_release,
-	.invalidate_page            = ib_umem_notifier_invalidate_page,
 	.invalidate_range_start     = ib_umem_notifier_invalidate_range_start,
 	.invalidate_range_end       = ib_umem_notifier_invalidate_range_end,
 };
-- 
2.13.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH 05/13] IB/umem: update to new mmu_notifier semantic
@ 2017-08-29 23:54   ` Jérôme Glisse
  0 siblings, 0 replies; 160+ messages in thread
From: Jérôme Glisse @ 2017-08-29 23:54 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jérôme Glisse, Leon Romanovsky, linux-rdma,
	Artemy Kovalyov, Doug Ledford, Kirill A . Shutemov,
	Andrew Morton, Linus Torvalds, Andrea Arcangeli

Call to mmu_notifier_invalidate_page() are replaced by call to
mmu_notifier_invalidate_range() and thus call are bracketed by
call to mmu_notifier_invalidate_range_start()/end()

Remove now useless invalidate_page callback.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: linux-rdma@vger.kernel.org
Cc: Artemy Kovalyov <artemyko@mellanox.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
---
 drivers/infiniband/core/umem_odp.c | 19 -------------------
 1 file changed, 19 deletions(-)

diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index 8c4ec564e495..55e8f5ed8b3c 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -166,24 +166,6 @@ static int invalidate_page_trampoline(struct ib_umem *item, u64 start,
 	return 0;
 }
 
-static void ib_umem_notifier_invalidate_page(struct mmu_notifier *mn,
-					     struct mm_struct *mm,
-					     unsigned long address)
-{
-	struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
-
-	if (!context->invalidate_range)
-		return;
-
-	ib_ucontext_notifier_start_account(context);
-	down_read(&context->umem_rwsem);
-	rbt_ib_umem_for_each_in_range(&context->umem_tree, address,
-				      address + PAGE_SIZE,
-				      invalidate_page_trampoline, NULL);
-	up_read(&context->umem_rwsem);
-	ib_ucontext_notifier_end_account(context);
-}
-
 static int invalidate_range_start_trampoline(struct ib_umem *item, u64 start,
 					     u64 end, void *cookie)
 {
@@ -237,7 +219,6 @@ static void ib_umem_notifier_invalidate_range_end(struct mmu_notifier *mn,
 
 static const struct mmu_notifier_ops ib_umem_notifiers = {
 	.release                    = ib_umem_notifier_release,
-	.invalidate_page            = ib_umem_notifier_invalidate_page,
 	.invalidate_range_start     = ib_umem_notifier_invalidate_range_start,
 	.invalidate_range_end       = ib_umem_notifier_invalidate_range_end,
 };
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH 05/13] IB/umem: update to new mmu_notifier semantic
@ 2017-08-29 23:54   ` Jérôme Glisse
  0 siblings, 0 replies; 160+ messages in thread
From: Jérôme Glisse @ 2017-08-29 23:54 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jérôme Glisse, Leon Romanovsky, linux-rdma,
	Artemy Kovalyov, Doug Ledford, Kirill A . Shutemov,
	Andrew Morton, Linus Torvalds, Andrea Arcangeli

Call to mmu_notifier_invalidate_page() are replaced by call to
mmu_notifier_invalidate_range() and thus call are bracketed by
call to mmu_notifier_invalidate_range_start()/end()

Remove now useless invalidate_page callback.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: linux-rdma@vger.kernel.org
Cc: Artemy Kovalyov <artemyko@mellanox.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
---
 drivers/infiniband/core/umem_odp.c | 19 -------------------
 1 file changed, 19 deletions(-)

diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index 8c4ec564e495..55e8f5ed8b3c 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -166,24 +166,6 @@ static int invalidate_page_trampoline(struct ib_umem *item, u64 start,
 	return 0;
 }
 
-static void ib_umem_notifier_invalidate_page(struct mmu_notifier *mn,
-					     struct mm_struct *mm,
-					     unsigned long address)
-{
-	struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
-
-	if (!context->invalidate_range)
-		return;
-
-	ib_ucontext_notifier_start_account(context);
-	down_read(&context->umem_rwsem);
-	rbt_ib_umem_for_each_in_range(&context->umem_tree, address,
-				      address + PAGE_SIZE,
-				      invalidate_page_trampoline, NULL);
-	up_read(&context->umem_rwsem);
-	ib_ucontext_notifier_end_account(context);
-}
-
 static int invalidate_range_start_trampoline(struct ib_umem *item, u64 start,
 					     u64 end, void *cookie)
 {
@@ -237,7 +219,6 @@ static void ib_umem_notifier_invalidate_range_end(struct mmu_notifier *mn,
 
 static const struct mmu_notifier_ops ib_umem_notifiers = {
 	.release                    = ib_umem_notifier_release,
-	.invalidate_page            = ib_umem_notifier_invalidate_page,
 	.invalidate_range_start     = ib_umem_notifier_invalidate_range_start,
 	.invalidate_range_end       = ib_umem_notifier_invalidate_range_end,
 };
-- 
2.13.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH 06/13] IB/hfi1: update to new mmu_notifier semantic
  2017-08-29 23:54 ` Jérôme Glisse
  (?)
@ 2017-08-29 23:54   ` Jérôme Glisse
  -1 siblings, 0 replies; 160+ messages in thread
From: Jérôme Glisse @ 2017-08-29 23:54 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jérôme Glisse, linux-rdma, Dean Luick, Ira Weiny,
	Doug Ledford, Kirill A . Shutemov, Andrew Morton, Linus Torvalds,
	Andrea Arcangeli

Call to mmu_notifier_invalidate_page() are replaced by call to
mmu_notifier_invalidate_range() and thus call are bracketed by
call to mmu_notifier_invalidate_range_start()/end()

Remove now useless invalidate_page callback.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: linux-rdma@vger.kernel.org
Cc: Dean Luick <dean.luick@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
---
 drivers/infiniband/hw/hfi1/mmu_rb.c | 9 ---------
 1 file changed, 9 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/mmu_rb.c b/drivers/infiniband/hw/hfi1/mmu_rb.c
index ccbf52c8ff6f..e4b56a0dd6d0 100644
--- a/drivers/infiniband/hw/hfi1/mmu_rb.c
+++ b/drivers/infiniband/hw/hfi1/mmu_rb.c
@@ -67,8 +67,6 @@ struct mmu_rb_handler {
 
 static unsigned long mmu_node_start(struct mmu_rb_node *);
 static unsigned long mmu_node_last(struct mmu_rb_node *);
-static inline void mmu_notifier_page(struct mmu_notifier *, struct mm_struct *,
-				     unsigned long);
 static inline void mmu_notifier_range_start(struct mmu_notifier *,
 					    struct mm_struct *,
 					    unsigned long, unsigned long);
@@ -82,7 +80,6 @@ static void do_remove(struct mmu_rb_handler *handler,
 static void handle_remove(struct work_struct *work);
 
 static const struct mmu_notifier_ops mn_opts = {
-	.invalidate_page = mmu_notifier_page,
 	.invalidate_range_start = mmu_notifier_range_start,
 };
 
@@ -285,12 +282,6 @@ void hfi1_mmu_rb_remove(struct mmu_rb_handler *handler,
 	handler->ops->remove(handler->ops_arg, node);
 }
 
-static inline void mmu_notifier_page(struct mmu_notifier *mn,
-				     struct mm_struct *mm, unsigned long addr)
-{
-	mmu_notifier_mem_invalidate(mn, mm, addr, addr + PAGE_SIZE);
-}
-
 static inline void mmu_notifier_range_start(struct mmu_notifier *mn,
 					    struct mm_struct *mm,
 					    unsigned long start,
-- 
2.13.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH 06/13] IB/hfi1: update to new mmu_notifier semantic
@ 2017-08-29 23:54   ` Jérôme Glisse
  0 siblings, 0 replies; 160+ messages in thread
From: Jérôme Glisse @ 2017-08-29 23:54 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jérôme Glisse, linux-rdma, Dean Luick, Ira Weiny,
	Doug Ledford, Kirill A . Shutemov, Andrew Morton, Linus Torvalds,
	Andrea Arcangeli

Call to mmu_notifier_invalidate_page() are replaced by call to
mmu_notifier_invalidate_range() and thus call are bracketed by
call to mmu_notifier_invalidate_range_start()/end()

Remove now useless invalidate_page callback.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: linux-rdma@vger.kernel.org
Cc: Dean Luick <dean.luick@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
---
 drivers/infiniband/hw/hfi1/mmu_rb.c | 9 ---------
 1 file changed, 9 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/mmu_rb.c b/drivers/infiniband/hw/hfi1/mmu_rb.c
index ccbf52c8ff6f..e4b56a0dd6d0 100644
--- a/drivers/infiniband/hw/hfi1/mmu_rb.c
+++ b/drivers/infiniband/hw/hfi1/mmu_rb.c
@@ -67,8 +67,6 @@ struct mmu_rb_handler {
 
 static unsigned long mmu_node_start(struct mmu_rb_node *);
 static unsigned long mmu_node_last(struct mmu_rb_node *);
-static inline void mmu_notifier_page(struct mmu_notifier *, struct mm_struct *,
-				     unsigned long);
 static inline void mmu_notifier_range_start(struct mmu_notifier *,
 					    struct mm_struct *,
 					    unsigned long, unsigned long);
@@ -82,7 +80,6 @@ static void do_remove(struct mmu_rb_handler *handler,
 static void handle_remove(struct work_struct *work);
 
 static const struct mmu_notifier_ops mn_opts = {
-	.invalidate_page = mmu_notifier_page,
 	.invalidate_range_start = mmu_notifier_range_start,
 };
 
@@ -285,12 +282,6 @@ void hfi1_mmu_rb_remove(struct mmu_rb_handler *handler,
 	handler->ops->remove(handler->ops_arg, node);
 }
 
-static inline void mmu_notifier_page(struct mmu_notifier *mn,
-				     struct mm_struct *mm, unsigned long addr)
-{
-	mmu_notifier_mem_invalidate(mn, mm, addr, addr + PAGE_SIZE);
-}
-
 static inline void mmu_notifier_range_start(struct mmu_notifier *mn,
 					    struct mm_struct *mm,
 					    unsigned long start,
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH 06/13] IB/hfi1: update to new mmu_notifier semantic
@ 2017-08-29 23:54   ` Jérôme Glisse
  0 siblings, 0 replies; 160+ messages in thread
From: Jérôme Glisse @ 2017-08-29 23:54 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jérôme Glisse, linux-rdma, Dean Luick, Ira Weiny,
	Doug Ledford, Kirill A . Shutemov, Andrew Morton, Linus Torvalds,
	Andrea Arcangeli

Call to mmu_notifier_invalidate_page() are replaced by call to
mmu_notifier_invalidate_range() and thus call are bracketed by
call to mmu_notifier_invalidate_range_start()/end()

Remove now useless invalidate_page callback.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: linux-rdma@vger.kernel.org
Cc: Dean Luick <dean.luick@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
---
 drivers/infiniband/hw/hfi1/mmu_rb.c | 9 ---------
 1 file changed, 9 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/mmu_rb.c b/drivers/infiniband/hw/hfi1/mmu_rb.c
index ccbf52c8ff6f..e4b56a0dd6d0 100644
--- a/drivers/infiniband/hw/hfi1/mmu_rb.c
+++ b/drivers/infiniband/hw/hfi1/mmu_rb.c
@@ -67,8 +67,6 @@ struct mmu_rb_handler {
 
 static unsigned long mmu_node_start(struct mmu_rb_node *);
 static unsigned long mmu_node_last(struct mmu_rb_node *);
-static inline void mmu_notifier_page(struct mmu_notifier *, struct mm_struct *,
-				     unsigned long);
 static inline void mmu_notifier_range_start(struct mmu_notifier *,
 					    struct mm_struct *,
 					    unsigned long, unsigned long);
@@ -82,7 +80,6 @@ static void do_remove(struct mmu_rb_handler *handler,
 static void handle_remove(struct work_struct *work);
 
 static const struct mmu_notifier_ops mn_opts = {
-	.invalidate_page = mmu_notifier_page,
 	.invalidate_range_start = mmu_notifier_range_start,
 };
 
@@ -285,12 +282,6 @@ void hfi1_mmu_rb_remove(struct mmu_rb_handler *handler,
 	handler->ops->remove(handler->ops_arg, node);
 }
 
-static inline void mmu_notifier_page(struct mmu_notifier *mn,
-				     struct mm_struct *mm, unsigned long addr)
-{
-	mmu_notifier_mem_invalidate(mn, mm, addr, addr + PAGE_SIZE);
-}
-
 static inline void mmu_notifier_range_start(struct mmu_notifier *mn,
 					    struct mm_struct *mm,
 					    unsigned long start,
-- 
2.13.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH 07/13] iommu/amd: update to new mmu_notifier semantic
  2017-08-29 23:54 ` Jérôme Glisse
  (?)
@ 2017-08-29 23:54   ` Jérôme Glisse
  -1 siblings, 0 replies; 160+ messages in thread
From: Jérôme Glisse @ 2017-08-29 23:54 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jérôme Glisse, Suravee Suthikulpanit, iommu,
	Joerg Roedel, Kirill A . Shutemov, Andrew Morton, Linus Torvalds,
	Andrea Arcangeli

Call to mmu_notifier_invalidate_page() are replaced by call to
mmu_notifier_invalidate_range() and thus call are bracketed by
call to mmu_notifier_invalidate_range_start()/end()

Remove now useless invalidate_page callback.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: iommu@lists.linux-foundation.org
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
---
 drivers/iommu/amd_iommu_v2.c | 8 --------
 1 file changed, 8 deletions(-)

diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
index 6629c472eafd..dccf5b76eff2 100644
--- a/drivers/iommu/amd_iommu_v2.c
+++ b/drivers/iommu/amd_iommu_v2.c
@@ -391,13 +391,6 @@ static int mn_clear_flush_young(struct mmu_notifier *mn,
 	return 0;
 }
 
-static void mn_invalidate_page(struct mmu_notifier *mn,
-			       struct mm_struct *mm,
-			       unsigned long address)
-{
-	__mn_flush_page(mn, address);
-}
-
 static void mn_invalidate_range(struct mmu_notifier *mn,
 				struct mm_struct *mm,
 				unsigned long start, unsigned long end)
@@ -436,7 +429,6 @@ static void mn_release(struct mmu_notifier *mn, struct mm_struct *mm)
 static const struct mmu_notifier_ops iommu_mn = {
 	.release		= mn_release,
 	.clear_flush_young      = mn_clear_flush_young,
-	.invalidate_page        = mn_invalidate_page,
 	.invalidate_range       = mn_invalidate_range,
 };
 
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH 07/13] iommu/amd: update to new mmu_notifier semantic
@ 2017-08-29 23:54   ` Jérôme Glisse
  0 siblings, 0 replies; 160+ messages in thread
From: Jérôme Glisse @ 2017-08-29 23:54 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jérôme Glisse, Suravee Suthikulpanit, iommu,
	Joerg Roedel, Kirill A . Shutemov, Andrew Morton, Linus Torvalds,
	Andrea Arcangeli

Call to mmu_notifier_invalidate_page() are replaced by call to
mmu_notifier_invalidate_range() and thus call are bracketed by
call to mmu_notifier_invalidate_range_start()/end()

Remove now useless invalidate_page callback.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: iommu@lists.linux-foundation.org
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
---
 drivers/iommu/amd_iommu_v2.c | 8 --------
 1 file changed, 8 deletions(-)

diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
index 6629c472eafd..dccf5b76eff2 100644
--- a/drivers/iommu/amd_iommu_v2.c
+++ b/drivers/iommu/amd_iommu_v2.c
@@ -391,13 +391,6 @@ static int mn_clear_flush_young(struct mmu_notifier *mn,
 	return 0;
 }
 
-static void mn_invalidate_page(struct mmu_notifier *mn,
-			       struct mm_struct *mm,
-			       unsigned long address)
-{
-	__mn_flush_page(mn, address);
-}
-
 static void mn_invalidate_range(struct mmu_notifier *mn,
 				struct mm_struct *mm,
 				unsigned long start, unsigned long end)
@@ -436,7 +429,6 @@ static void mn_release(struct mmu_notifier *mn, struct mm_struct *mm)
 static const struct mmu_notifier_ops iommu_mn = {
 	.release		= mn_release,
 	.clear_flush_young      = mn_clear_flush_young,
-	.invalidate_page        = mn_invalidate_page,
 	.invalidate_range       = mn_invalidate_range,
 };
 
-- 
2.13.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH 07/13] iommu/amd: update to new mmu_notifier semantic
@ 2017-08-29 23:54   ` Jérôme Glisse
  0 siblings, 0 replies; 160+ messages in thread
From: Jérôme Glisse @ 2017-08-29 23:54 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg
  Cc: Andrea Arcangeli, Joerg Roedel,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Jérôme Glisse, Andrew Morton, Linus Torvalds,
	Kirill A . Shutemov

Call to mmu_notifier_invalidate_page() are replaced by call to
mmu_notifier_invalidate_range() and thus call are bracketed by
call to mmu_notifier_invalidate_range_start()/end()

Remove now useless invalidate_page callback.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Cc: iommu@lists.linux-foundation.org
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
---
 drivers/iommu/amd_iommu_v2.c | 8 --------
 1 file changed, 8 deletions(-)

diff --git a/drivers/iommu/amd_iommu_v2.c b/drivers/iommu/amd_iommu_v2.c
index 6629c472eafd..dccf5b76eff2 100644
--- a/drivers/iommu/amd_iommu_v2.c
+++ b/drivers/iommu/amd_iommu_v2.c
@@ -391,13 +391,6 @@ static int mn_clear_flush_young(struct mmu_notifier *mn,
 	return 0;
 }
 
-static void mn_invalidate_page(struct mmu_notifier *mn,
-			       struct mm_struct *mm,
-			       unsigned long address)
-{
-	__mn_flush_page(mn, address);
-}
-
 static void mn_invalidate_range(struct mmu_notifier *mn,
 				struct mm_struct *mm,
 				unsigned long start, unsigned long end)
@@ -436,7 +429,6 @@ static void mn_release(struct mmu_notifier *mn, struct mm_struct *mm)
 static const struct mmu_notifier_ops iommu_mn = {
 	.release		= mn_release,
 	.clear_flush_young      = mn_clear_flush_young,
-	.invalidate_page        = mn_invalidate_page,
 	.invalidate_range       = mn_invalidate_range,
 };
 
-- 
2.13.5

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH 08/13] iommu/intel: update to new mmu_notifier semantic
  2017-08-29 23:54 ` Jérôme Glisse
@ 2017-08-29 23:54   ` Jérôme Glisse
  -1 siblings, 0 replies; 160+ messages in thread
From: Jérôme Glisse @ 2017-08-29 23:54 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jérôme Glisse, David Woodhouse, iommu, Joerg Roedel,
	Kirill A . Shutemov, Andrew Morton, Linus Torvalds,
	Andrea Arcangeli

Call to mmu_notifier_invalidate_page() are replaced by call to
mmu_notifier_invalidate_range() and thus call are bracketed by
call to mmu_notifier_invalidate_range_start()/end()

Remove now useless invalidate_page callback.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: iommu@lists.linux-foundation.org
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
---
 drivers/iommu/intel-svm.c | 9 ---------
 1 file changed, 9 deletions(-)

diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
index f167c0d84ebf..f620dccec8ee 100644
--- a/drivers/iommu/intel-svm.c
+++ b/drivers/iommu/intel-svm.c
@@ -223,14 +223,6 @@ static void intel_change_pte(struct mmu_notifier *mn, struct mm_struct *mm,
 	intel_flush_svm_range(svm, address, 1, 1, 0);
 }
 
-static void intel_invalidate_page(struct mmu_notifier *mn, struct mm_struct *mm,
-				  unsigned long address)
-{
-	struct intel_svm *svm = container_of(mn, struct intel_svm, notifier);
-
-	intel_flush_svm_range(svm, address, 1, 1, 0);
-}
-
 /* Pages have been freed at this point */
 static void intel_invalidate_range(struct mmu_notifier *mn,
 				   struct mm_struct *mm,
@@ -285,7 +277,6 @@ static void intel_mm_release(struct mmu_notifier *mn, struct mm_struct *mm)
 static const struct mmu_notifier_ops intel_mmuops = {
 	.release = intel_mm_release,
 	.change_pte = intel_change_pte,
-	.invalidate_page = intel_invalidate_page,
 	.invalidate_range = intel_invalidate_range,
 };
 
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH 08/13] iommu/intel: update to new mmu_notifier semantic
@ 2017-08-29 23:54   ` Jérôme Glisse
  0 siblings, 0 replies; 160+ messages in thread
From: Jérôme Glisse @ 2017-08-29 23:54 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jérôme Glisse, David Woodhouse, iommu, Joerg Roedel,
	Kirill A . Shutemov, Andrew Morton, Linus Torvalds,
	Andrea Arcangeli

Call to mmu_notifier_invalidate_page() are replaced by call to
mmu_notifier_invalidate_range() and thus call are bracketed by
call to mmu_notifier_invalidate_range_start()/end()

Remove now useless invalidate_page callback.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: iommu@lists.linux-foundation.org
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
---
 drivers/iommu/intel-svm.c | 9 ---------
 1 file changed, 9 deletions(-)

diff --git a/drivers/iommu/intel-svm.c b/drivers/iommu/intel-svm.c
index f167c0d84ebf..f620dccec8ee 100644
--- a/drivers/iommu/intel-svm.c
+++ b/drivers/iommu/intel-svm.c
@@ -223,14 +223,6 @@ static void intel_change_pte(struct mmu_notifier *mn, struct mm_struct *mm,
 	intel_flush_svm_range(svm, address, 1, 1, 0);
 }
 
-static void intel_invalidate_page(struct mmu_notifier *mn, struct mm_struct *mm,
-				  unsigned long address)
-{
-	struct intel_svm *svm = container_of(mn, struct intel_svm, notifier);
-
-	intel_flush_svm_range(svm, address, 1, 1, 0);
-}
-
 /* Pages have been freed at this point */
 static void intel_invalidate_range(struct mmu_notifier *mn,
 				   struct mm_struct *mm,
@@ -285,7 +277,6 @@ static void intel_mm_release(struct mmu_notifier *mn, struct mm_struct *mm)
 static const struct mmu_notifier_ops intel_mmuops = {
 	.release = intel_mm_release,
 	.change_pte = intel_change_pte,
-	.invalidate_page = intel_invalidate_page,
 	.invalidate_range = intel_invalidate_range,
 };
 
-- 
2.13.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH 09/13] misc/mic/scif: update to new mmu_notifier semantic
  2017-08-29 23:54 ` Jérôme Glisse
@ 2017-08-29 23:54   ` Jérôme Glisse
  -1 siblings, 0 replies; 160+ messages in thread
From: Jérôme Glisse @ 2017-08-29 23:54 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jérôme Glisse, Sudeep Dutt, Ashutosh Dixit,
	Kirill A . Shutemov, Andrew Morton, Linus Torvalds,
	Andrea Arcangeli

Call to mmu_notifier_invalidate_page() are replaced by call to
mmu_notifier_invalidate_range() and thus call are bracketed by
call to mmu_notifier_invalidate_range_start()/end()

Remove now useless invalidate_page callback.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Sudeep Dutt <sudeep.dutt@intel.com>
Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
---
 drivers/misc/mic/scif/scif_dma.c | 11 -----------
 1 file changed, 11 deletions(-)

diff --git a/drivers/misc/mic/scif/scif_dma.c b/drivers/misc/mic/scif/scif_dma.c
index 64d5760d069a..63d6246d6dff 100644
--- a/drivers/misc/mic/scif/scif_dma.c
+++ b/drivers/misc/mic/scif/scif_dma.c
@@ -200,16 +200,6 @@ static void scif_mmu_notifier_release(struct mmu_notifier *mn,
 	schedule_work(&scif_info.misc_work);
 }
 
-static void scif_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
-					      struct mm_struct *mm,
-					      unsigned long address)
-{
-	struct scif_mmu_notif	*mmn;
-
-	mmn = container_of(mn, struct scif_mmu_notif, ep_mmu_notifier);
-	scif_rma_destroy_tcw(mmn, address, PAGE_SIZE);
-}
-
 static void scif_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 						     struct mm_struct *mm,
 						     unsigned long start,
@@ -235,7 +225,6 @@ static void scif_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 static const struct mmu_notifier_ops scif_mmu_notifier_ops = {
 	.release = scif_mmu_notifier_release,
 	.clear_flush_young = NULL,
-	.invalidate_page = scif_mmu_notifier_invalidate_page,
 	.invalidate_range_start = scif_mmu_notifier_invalidate_range_start,
 	.invalidate_range_end = scif_mmu_notifier_invalidate_range_end};
 
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH 09/13] misc/mic/scif: update to new mmu_notifier semantic
@ 2017-08-29 23:54   ` Jérôme Glisse
  0 siblings, 0 replies; 160+ messages in thread
From: Jérôme Glisse @ 2017-08-29 23:54 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jérôme Glisse, Sudeep Dutt, Ashutosh Dixit,
	Kirill A . Shutemov, Andrew Morton, Linus Torvalds,
	Andrea Arcangeli

Call to mmu_notifier_invalidate_page() are replaced by call to
mmu_notifier_invalidate_range() and thus call are bracketed by
call to mmu_notifier_invalidate_range_start()/end()

Remove now useless invalidate_page callback.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: Sudeep Dutt <sudeep.dutt@intel.com>
Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
---
 drivers/misc/mic/scif/scif_dma.c | 11 -----------
 1 file changed, 11 deletions(-)

diff --git a/drivers/misc/mic/scif/scif_dma.c b/drivers/misc/mic/scif/scif_dma.c
index 64d5760d069a..63d6246d6dff 100644
--- a/drivers/misc/mic/scif/scif_dma.c
+++ b/drivers/misc/mic/scif/scif_dma.c
@@ -200,16 +200,6 @@ static void scif_mmu_notifier_release(struct mmu_notifier *mn,
 	schedule_work(&scif_info.misc_work);
 }
 
-static void scif_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
-					      struct mm_struct *mm,
-					      unsigned long address)
-{
-	struct scif_mmu_notif	*mmn;
-
-	mmn = container_of(mn, struct scif_mmu_notif, ep_mmu_notifier);
-	scif_rma_destroy_tcw(mmn, address, PAGE_SIZE);
-}
-
 static void scif_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 						     struct mm_struct *mm,
 						     unsigned long start,
@@ -235,7 +225,6 @@ static void scif_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
 static const struct mmu_notifier_ops scif_mmu_notifier_ops = {
 	.release = scif_mmu_notifier_release,
 	.clear_flush_young = NULL,
-	.invalidate_page = scif_mmu_notifier_invalidate_page,
 	.invalidate_range_start = scif_mmu_notifier_invalidate_range_start,
 	.invalidate_range_end = scif_mmu_notifier_invalidate_range_end};
 
-- 
2.13.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH 10/13] sgi-gru: update to new mmu_notifier semantic
  2017-08-29 23:54 ` Jérôme Glisse
@ 2017-08-29 23:54   ` Jérôme Glisse
  -1 siblings, 0 replies; 160+ messages in thread
From: Jérôme Glisse @ 2017-08-29 23:54 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jérôme Glisse, Dimitri Sivanich, Jack Steiner,
	Kirill A . Shutemov, Andrew Morton, Linus Torvalds,
	Andrea Arcangeli

Call to mmu_notifier_invalidate_page() are replaced by call to
mmu_notifier_invalidate_range() and thus call are bracketed by
call to mmu_notifier_invalidate_range_start()/end()

Remove now useless invalidate_page callback.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Jack Steiner <steiner@sgi.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
---
 drivers/misc/sgi-gru/grutlbpurge.c | 12 ------------
 1 file changed, 12 deletions(-)

diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
index e936d43895d2..9918eda0e05f 100644
--- a/drivers/misc/sgi-gru/grutlbpurge.c
+++ b/drivers/misc/sgi-gru/grutlbpurge.c
@@ -247,17 +247,6 @@ static void gru_invalidate_range_end(struct mmu_notifier *mn,
 	gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx\n", gms, start, end);
 }
 
-static void gru_invalidate_page(struct mmu_notifier *mn, struct mm_struct *mm,
-				unsigned long address)
-{
-	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
-						 ms_notifier);
-
-	STAT(mmu_invalidate_page);
-	gru_flush_tlb_range(gms, address, PAGE_SIZE);
-	gru_dbg(grudev, "gms %p, address 0x%lx\n", gms, address);
-}
-
 static void gru_release(struct mmu_notifier *mn, struct mm_struct *mm)
 {
 	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
@@ -269,7 +258,6 @@ static void gru_release(struct mmu_notifier *mn, struct mm_struct *mm)
 
 
 static const struct mmu_notifier_ops gru_mmuops = {
-	.invalidate_page	= gru_invalidate_page,
 	.invalidate_range_start	= gru_invalidate_range_start,
 	.invalidate_range_end	= gru_invalidate_range_end,
 	.release		= gru_release,
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH 10/13] sgi-gru: update to new mmu_notifier semantic
@ 2017-08-29 23:54   ` Jérôme Glisse
  0 siblings, 0 replies; 160+ messages in thread
From: Jérôme Glisse @ 2017-08-29 23:54 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jérôme Glisse, Dimitri Sivanich, Jack Steiner,
	Kirill A . Shutemov, Andrew Morton, Linus Torvalds,
	Andrea Arcangeli

Call to mmu_notifier_invalidate_page() are replaced by call to
mmu_notifier_invalidate_range() and thus call are bracketed by
call to mmu_notifier_invalidate_range_start()/end()

Remove now useless invalidate_page callback.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Jack Steiner <steiner@sgi.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
---
 drivers/misc/sgi-gru/grutlbpurge.c | 12 ------------
 1 file changed, 12 deletions(-)

diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
index e936d43895d2..9918eda0e05f 100644
--- a/drivers/misc/sgi-gru/grutlbpurge.c
+++ b/drivers/misc/sgi-gru/grutlbpurge.c
@@ -247,17 +247,6 @@ static void gru_invalidate_range_end(struct mmu_notifier *mn,
 	gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx\n", gms, start, end);
 }
 
-static void gru_invalidate_page(struct mmu_notifier *mn, struct mm_struct *mm,
-				unsigned long address)
-{
-	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
-						 ms_notifier);
-
-	STAT(mmu_invalidate_page);
-	gru_flush_tlb_range(gms, address, PAGE_SIZE);
-	gru_dbg(grudev, "gms %p, address 0x%lx\n", gms, address);
-}
-
 static void gru_release(struct mmu_notifier *mn, struct mm_struct *mm)
 {
 	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
@@ -269,7 +258,6 @@ static void gru_release(struct mmu_notifier *mn, struct mm_struct *mm)
 
 
 static const struct mmu_notifier_ops gru_mmuops = {
-	.invalidate_page	= gru_invalidate_page,
 	.invalidate_range_start	= gru_invalidate_range_start,
 	.invalidate_range_end	= gru_invalidate_range_end,
 	.release		= gru_release,
-- 
2.13.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH 11/13] xen/gntdev: update to new mmu_notifier semantic
  2017-08-29 23:54 ` Jérôme Glisse
@ 2017-08-29 23:54   ` Jérôme Glisse
  -1 siblings, 0 replies; 160+ messages in thread
From: Jérôme Glisse @ 2017-08-29 23:54 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jérôme Glisse, Konrad Rzeszutek Wilk,
	Roger Pau Monné,
	Boris Ostrovsky, xen-devel, Kirill A . Shutemov, Andrew Morton,
	Linus Torvalds, Andrea Arcangeli

Call to mmu_notifier_invalidate_page() are replaced by call to
mmu_notifier_invalidate_range() and thus call are bracketed by
call to mmu_notifier_invalidate_range_start()/end()

Remove now useless invalidate_page callback.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Roger Pau Monné <roger.pau@citrix.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: xen-devel@lists.xenproject.org
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
---
 drivers/xen/gntdev.c | 8 --------
 1 file changed, 8 deletions(-)

diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index f3bf8f4e2d6c..82360594fa8e 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -484,13 +484,6 @@ static void mn_invl_range_start(struct mmu_notifier *mn,
 	mutex_unlock(&priv->lock);
 }
 
-static void mn_invl_page(struct mmu_notifier *mn,
-			 struct mm_struct *mm,
-			 unsigned long address)
-{
-	mn_invl_range_start(mn, mm, address, address + PAGE_SIZE);
-}
-
 static void mn_release(struct mmu_notifier *mn,
 		       struct mm_struct *mm)
 {
@@ -522,7 +515,6 @@ static void mn_release(struct mmu_notifier *mn,
 
 static const struct mmu_notifier_ops gntdev_mmu_ops = {
 	.release                = mn_release,
-	.invalidate_page        = mn_invl_page,
 	.invalidate_range_start = mn_invl_range_start,
 };
 
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH 11/13] xen/gntdev: update to new mmu_notifier semantic
@ 2017-08-29 23:54   ` Jérôme Glisse
  0 siblings, 0 replies; 160+ messages in thread
From: Jérôme Glisse @ 2017-08-29 23:54 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jérôme Glisse, Konrad Rzeszutek Wilk,
	Roger Pau Monné,
	Boris Ostrovsky, xen-devel, Kirill A . Shutemov, Andrew Morton,
	Linus Torvalds, Andrea Arcangeli

Call to mmu_notifier_invalidate_page() are replaced by call to
mmu_notifier_invalidate_range() and thus call are bracketed by
call to mmu_notifier_invalidate_range_start()/end()

Remove now useless invalidate_page callback.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Roger Pau MonnA(C) <roger.pau@citrix.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: xen-devel@lists.xenproject.org
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
---
 drivers/xen/gntdev.c | 8 --------
 1 file changed, 8 deletions(-)

diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index f3bf8f4e2d6c..82360594fa8e 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -484,13 +484,6 @@ static void mn_invl_range_start(struct mmu_notifier *mn,
 	mutex_unlock(&priv->lock);
 }
 
-static void mn_invl_page(struct mmu_notifier *mn,
-			 struct mm_struct *mm,
-			 unsigned long address)
-{
-	mn_invl_range_start(mn, mm, address, address + PAGE_SIZE);
-}
-
 static void mn_release(struct mmu_notifier *mn,
 		       struct mm_struct *mm)
 {
@@ -522,7 +515,6 @@ static void mn_release(struct mmu_notifier *mn,
 
 static const struct mmu_notifier_ops gntdev_mmu_ops = {
 	.release                = mn_release,
-	.invalidate_page        = mn_invl_page,
 	.invalidate_range_start = mn_invl_range_start,
 };
 
-- 
2.13.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH 11/13] xen/gntdev: update to new mmu_notifier semantic
  2017-08-29 23:54 ` Jérôme Glisse
                   ` (12 preceding siblings ...)
  (?)
@ 2017-08-29 23:54 ` Jérôme Glisse
  -1 siblings, 0 replies; 160+ messages in thread
From: Jérôme Glisse @ 2017-08-29 23:54 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Andrea Arcangeli, Jérôme Glisse, xen-devel,
	Boris Ostrovsky, Linus Torvalds, Andrew Morton,
	Kirill A . Shutemov, Roger Pau Monné

Call to mmu_notifier_invalidate_page() are replaced by call to
mmu_notifier_invalidate_range() and thus call are bracketed by
call to mmu_notifier_invalidate_range_start()/end()

Remove now useless invalidate_page callback.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Roger Pau Monné <roger.pau@citrix.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: xen-devel@lists.xenproject.org
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
---
 drivers/xen/gntdev.c | 8 --------
 1 file changed, 8 deletions(-)

diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index f3bf8f4e2d6c..82360594fa8e 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -484,13 +484,6 @@ static void mn_invl_range_start(struct mmu_notifier *mn,
 	mutex_unlock(&priv->lock);
 }
 
-static void mn_invl_page(struct mmu_notifier *mn,
-			 struct mm_struct *mm,
-			 unsigned long address)
-{
-	mn_invl_range_start(mn, mm, address, address + PAGE_SIZE);
-}
-
 static void mn_release(struct mmu_notifier *mn,
 		       struct mm_struct *mm)
 {
@@ -522,7 +515,6 @@ static void mn_release(struct mmu_notifier *mn,
 
 static const struct mmu_notifier_ops gntdev_mmu_ops = {
 	.release                = mn_release,
-	.invalidate_page        = mn_invl_page,
 	.invalidate_range_start = mn_invl_range_start,
 };
 
-- 
2.13.5


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH 12/13] KVM: update to new mmu_notifier semantic
  2017-08-29 23:54 ` Jérôme Glisse
  (?)
@ 2017-08-29 23:54   ` Jérôme Glisse
  -1 siblings, 0 replies; 160+ messages in thread
From: Jérôme Glisse @ 2017-08-29 23:54 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jérôme Glisse, Paolo Bonzini,
	Radim Krčmář,
	kvm, Kirill A . Shutemov, Andrew Morton, Linus Torvalds,
	Andrea Arcangeli

Call to mmu_notifier_invalidate_page() are replaced by call to
mmu_notifier_invalidate_range() and thus call are bracketed by
call to mmu_notifier_invalidate_range_start()/end()

Remove now useless invalidate_page callback.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: kvm@vger.kernel.org
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
---
 virt/kvm/kvm_main.c | 42 ------------------------------------------
 1 file changed, 42 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 15252d723b54..4d81f6ded88e 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -322,47 +322,6 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
 	return container_of(mn, struct kvm, mmu_notifier);
 }
 
-static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
-					     struct mm_struct *mm,
-					     unsigned long address)
-{
-	struct kvm *kvm = mmu_notifier_to_kvm(mn);
-	int need_tlb_flush, idx;
-
-	/*
-	 * When ->invalidate_page runs, the linux pte has been zapped
-	 * already but the page is still allocated until
-	 * ->invalidate_page returns. So if we increase the sequence
-	 * here the kvm page fault will notice if the spte can't be
-	 * established because the page is going to be freed. If
-	 * instead the kvm page fault establishes the spte before
-	 * ->invalidate_page runs, kvm_unmap_hva will release it
-	 * before returning.
-	 *
-	 * The sequence increase only need to be seen at spin_unlock
-	 * time, and not at spin_lock time.
-	 *
-	 * Increasing the sequence after the spin_unlock would be
-	 * unsafe because the kvm page fault could then establish the
-	 * pte after kvm_unmap_hva returned, without noticing the page
-	 * is going to be freed.
-	 */
-	idx = srcu_read_lock(&kvm->srcu);
-	spin_lock(&kvm->mmu_lock);
-
-	kvm->mmu_notifier_seq++;
-	need_tlb_flush = kvm_unmap_hva(kvm, address) | kvm->tlbs_dirty;
-	/* we've to flush the tlb before the pages can be freed */
-	if (need_tlb_flush)
-		kvm_flush_remote_tlbs(kvm);
-
-	spin_unlock(&kvm->mmu_lock);
-
-	kvm_arch_mmu_notifier_invalidate_page(kvm, address);
-
-	srcu_read_unlock(&kvm->srcu, idx);
-}
-
 static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 					struct mm_struct *mm,
 					unsigned long address,
@@ -510,7 +469,6 @@ static void kvm_mmu_notifier_release(struct mmu_notifier *mn,
 }
 
 static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
-	.invalidate_page	= kvm_mmu_notifier_invalidate_page,
 	.invalidate_range_start	= kvm_mmu_notifier_invalidate_range_start,
 	.invalidate_range_end	= kvm_mmu_notifier_invalidate_range_end,
 	.clear_flush_young	= kvm_mmu_notifier_clear_flush_young,
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH 12/13] KVM: update to new mmu_notifier semantic
@ 2017-08-29 23:54   ` Jérôme Glisse
  0 siblings, 0 replies; 160+ messages in thread
From: Jérôme Glisse @ 2017-08-29 23:54 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jérôme Glisse, Paolo Bonzini,
	Radim Krčmář,
	kvm, Kirill A . Shutemov, Andrew Morton, Linus Torvalds,
	Andrea Arcangeli

Call to mmu_notifier_invalidate_page() are replaced by call to
mmu_notifier_invalidate_range() and thus call are bracketed by
call to mmu_notifier_invalidate_range_start()/end()

Remove now useless invalidate_page callback.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: kvm@vger.kernel.org
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
---
 virt/kvm/kvm_main.c | 42 ------------------------------------------
 1 file changed, 42 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 15252d723b54..4d81f6ded88e 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -322,47 +322,6 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
 	return container_of(mn, struct kvm, mmu_notifier);
 }
 
-static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
-					     struct mm_struct *mm,
-					     unsigned long address)
-{
-	struct kvm *kvm = mmu_notifier_to_kvm(mn);
-	int need_tlb_flush, idx;
-
-	/*
-	 * When ->invalidate_page runs, the linux pte has been zapped
-	 * already but the page is still allocated until
-	 * ->invalidate_page returns. So if we increase the sequence
-	 * here the kvm page fault will notice if the spte can't be
-	 * established because the page is going to be freed. If
-	 * instead the kvm page fault establishes the spte before
-	 * ->invalidate_page runs, kvm_unmap_hva will release it
-	 * before returning.
-	 *
-	 * The sequence increase only need to be seen at spin_unlock
-	 * time, and not at spin_lock time.
-	 *
-	 * Increasing the sequence after the spin_unlock would be
-	 * unsafe because the kvm page fault could then establish the
-	 * pte after kvm_unmap_hva returned, without noticing the page
-	 * is going to be freed.
-	 */
-	idx = srcu_read_lock(&kvm->srcu);
-	spin_lock(&kvm->mmu_lock);
-
-	kvm->mmu_notifier_seq++;
-	need_tlb_flush = kvm_unmap_hva(kvm, address) | kvm->tlbs_dirty;
-	/* we've to flush the tlb before the pages can be freed */
-	if (need_tlb_flush)
-		kvm_flush_remote_tlbs(kvm);
-
-	spin_unlock(&kvm->mmu_lock);
-
-	kvm_arch_mmu_notifier_invalidate_page(kvm, address);
-
-	srcu_read_unlock(&kvm->srcu, idx);
-}
-
 static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 					struct mm_struct *mm,
 					unsigned long address,
@@ -510,7 +469,6 @@ static void kvm_mmu_notifier_release(struct mmu_notifier *mn,
 }
 
 static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
-	.invalidate_page	= kvm_mmu_notifier_invalidate_page,
 	.invalidate_range_start	= kvm_mmu_notifier_invalidate_range_start,
 	.invalidate_range_end	= kvm_mmu_notifier_invalidate_range_end,
 	.clear_flush_young	= kvm_mmu_notifier_clear_flush_young,
-- 
2.13.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH 12/13] KVM: update to new mmu_notifier semantic
@ 2017-08-29 23:54   ` Jérôme Glisse
  0 siblings, 0 replies; 160+ messages in thread
From: Jérôme Glisse @ 2017-08-29 23:54 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jérôme Glisse, Paolo Bonzini,
	Radim Krčmář,
	kvm, Kirill A . Shutemov, Andrew Morton, Linus Torvalds,
	Andrea Arcangeli

Call to mmu_notifier_invalidate_page() are replaced by call to
mmu_notifier_invalidate_range() and thus call are bracketed by
call to mmu_notifier_invalidate_range_start()/end()

Remove now useless invalidate_page callback.

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim KrA?mA!A? <rkrcmar@redhat.com>
Cc: kvm@vger.kernel.org
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
---
 virt/kvm/kvm_main.c | 42 ------------------------------------------
 1 file changed, 42 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 15252d723b54..4d81f6ded88e 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -322,47 +322,6 @@ static inline struct kvm *mmu_notifier_to_kvm(struct mmu_notifier *mn)
 	return container_of(mn, struct kvm, mmu_notifier);
 }
 
-static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn,
-					     struct mm_struct *mm,
-					     unsigned long address)
-{
-	struct kvm *kvm = mmu_notifier_to_kvm(mn);
-	int need_tlb_flush, idx;
-
-	/*
-	 * When ->invalidate_page runs, the linux pte has been zapped
-	 * already but the page is still allocated until
-	 * ->invalidate_page returns. So if we increase the sequence
-	 * here the kvm page fault will notice if the spte can't be
-	 * established because the page is going to be freed. If
-	 * instead the kvm page fault establishes the spte before
-	 * ->invalidate_page runs, kvm_unmap_hva will release it
-	 * before returning.
-	 *
-	 * The sequence increase only need to be seen at spin_unlock
-	 * time, and not at spin_lock time.
-	 *
-	 * Increasing the sequence after the spin_unlock would be
-	 * unsafe because the kvm page fault could then establish the
-	 * pte after kvm_unmap_hva returned, without noticing the page
-	 * is going to be freed.
-	 */
-	idx = srcu_read_lock(&kvm->srcu);
-	spin_lock(&kvm->mmu_lock);
-
-	kvm->mmu_notifier_seq++;
-	need_tlb_flush = kvm_unmap_hva(kvm, address) | kvm->tlbs_dirty;
-	/* we've to flush the tlb before the pages can be freed */
-	if (need_tlb_flush)
-		kvm_flush_remote_tlbs(kvm);
-
-	spin_unlock(&kvm->mmu_lock);
-
-	kvm_arch_mmu_notifier_invalidate_page(kvm, address);
-
-	srcu_read_unlock(&kvm->srcu, idx);
-}
-
 static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 					struct mm_struct *mm,
 					unsigned long address,
@@ -510,7 +469,6 @@ static void kvm_mmu_notifier_release(struct mmu_notifier *mn,
 }
 
 static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
-	.invalidate_page	= kvm_mmu_notifier_invalidate_page,
 	.invalidate_range_start	= kvm_mmu_notifier_invalidate_range_start,
 	.invalidate_range_end	= kvm_mmu_notifier_invalidate_range_end,
 	.clear_flush_young	= kvm_mmu_notifier_clear_flush_young,
-- 
2.13.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH 13/13] mm/mmu_notifier: kill invalidate_page
  2017-08-29 23:54 ` Jérôme Glisse
@ 2017-08-29 23:54   ` Jérôme Glisse
  -1 siblings, 0 replies; 160+ messages in thread
From: Jérôme Glisse @ 2017-08-29 23:54 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jérôme Glisse, Linus Torvalds, Bernhard Held,
	Adam Borowski, Andrea Arcangeli, Radim Krčmář,
	Wanpeng Li, Paolo Bonzini, Takashi Iwai, Nadav Amit,
	Mike Galbraith, Kirill A . Shutemov, axie, Andrew Morton

The invalidate_page callback suffered from 2 pitfalls. First it use to
happen after page table lock was release and thus a new page might have
setup before the call to invalidate_page() happened.

This is in a weird way fix by c7ab0d2fdc840266b39db94538f74207ec2afbf6
that moved the callback under the page table lock but this also break
several existing user of the mmu_notifier API that assumed they could
sleep inside this callback.

The second pitfall was invalidate_page being the only callback not taking
a range of address in respect to invalidation but was giving an address
and a page. Lot of the callback implementer assumed this could never be
THP and thus failed to invalidate the appropriate range for THP.

By killing this callback we unify the mmu_notifier callback API to always
take a virtual address range as input.

Finaly this also simplify the end user life as there is now 2 clear
choice:
  - invalidate_range_start()/end() callback (which allow you to sleep)
  - invalidate_range() where you can not sleep but happen right after
    page table update under page table lock

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Bernhard Held <berny156@gmx.de>
Cc: Adam Borowski <kilobyte@angband.pl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: axie <axie@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/mmu_notifier.h | 25 -------------------------
 mm/mmu_notifier.c            | 14 --------------
 2 files changed, 39 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index c91b3bcd158f..7b2e31b1745a 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -95,17 +95,6 @@ struct mmu_notifier_ops {
 			   pte_t pte);
 
 	/*
-	 * Before this is invoked any secondary MMU is still ok to
-	 * read/write to the page previously pointed to by the Linux
-	 * pte because the page hasn't been freed yet and it won't be
-	 * freed until this returns. If required set_page_dirty has to
-	 * be called internally to this method.
-	 */
-	void (*invalidate_page)(struct mmu_notifier *mn,
-				struct mm_struct *mm,
-				unsigned long address);
-
-	/*
 	 * invalidate_range_start() and invalidate_range_end() must be
 	 * paired and are called only when the mmap_sem and/or the
 	 * locks protecting the reverse maps are held. If the subsystem
@@ -220,8 +209,6 @@ extern int __mmu_notifier_test_young(struct mm_struct *mm,
 				     unsigned long address);
 extern void __mmu_notifier_change_pte(struct mm_struct *mm,
 				      unsigned long address, pte_t pte);
-extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address);
 extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 				  unsigned long start, unsigned long end);
 extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
@@ -268,13 +255,6 @@ static inline void mmu_notifier_change_pte(struct mm_struct *mm,
 		__mmu_notifier_change_pte(mm, address, pte);
 }
 
-static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address)
-{
-	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_page(mm, address);
-}
-
 static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 				  unsigned long start, unsigned long end)
 {
@@ -442,11 +422,6 @@ static inline void mmu_notifier_change_pte(struct mm_struct *mm,
 {
 }
 
-static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address)
-{
-}
-
 static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 				  unsigned long start, unsigned long end)
 {
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 54ca54562928..314285284e6e 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -174,20 +174,6 @@ void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
 	srcu_read_unlock(&srcu, id);
 }
 
-void __mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address)
-{
-	struct mmu_notifier *mn;
-	int id;
-
-	id = srcu_read_lock(&srcu);
-	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
-		if (mn->ops->invalidate_page)
-			mn->ops->invalidate_page(mn, mm, address);
-	}
-	srcu_read_unlock(&srcu, id);
-}
-
 void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 				  unsigned long start, unsigned long end)
 {
-- 
2.13.5

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH 13/13] mm/mmu_notifier: kill invalidate_page
@ 2017-08-29 23:54   ` Jérôme Glisse
  0 siblings, 0 replies; 160+ messages in thread
From: Jérôme Glisse @ 2017-08-29 23:54 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Jérôme Glisse, Linus Torvalds, Bernhard Held,
	Adam Borowski, Andrea Arcangeli, Radim Krčmář,
	Wanpeng Li, Paolo Bonzini, Takashi Iwai, Nadav Amit,
	Mike Galbraith, Kirill A . Shutemov, axie, Andrew Morton

The invalidate_page callback suffered from 2 pitfalls. First it use to
happen after page table lock was release and thus a new page might have
setup before the call to invalidate_page() happened.

This is in a weird way fix by c7ab0d2fdc840266b39db94538f74207ec2afbf6
that moved the callback under the page table lock but this also break
several existing user of the mmu_notifier API that assumed they could
sleep inside this callback.

The second pitfall was invalidate_page being the only callback not taking
a range of address in respect to invalidation but was giving an address
and a page. Lot of the callback implementer assumed this could never be
THP and thus failed to invalidate the appropriate range for THP.

By killing this callback we unify the mmu_notifier callback API to always
take a virtual address range as input.

Finaly this also simplify the end user life as there is now 2 clear
choice:
  - invalidate_range_start()/end() callback (which allow you to sleep)
  - invalidate_range() where you can not sleep but happen right after
    page table update under page table lock

Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Bernhard Held <berny156@gmx.de>
Cc: Adam Borowski <kilobyte@angband.pl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Radim KrA?mA!A? <rkrcmar@redhat.com>
Cc: Wanpeng Li <kernellwp@gmail.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: axie <axie@amd.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/mmu_notifier.h | 25 -------------------------
 mm/mmu_notifier.c            | 14 --------------
 2 files changed, 39 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index c91b3bcd158f..7b2e31b1745a 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -95,17 +95,6 @@ struct mmu_notifier_ops {
 			   pte_t pte);
 
 	/*
-	 * Before this is invoked any secondary MMU is still ok to
-	 * read/write to the page previously pointed to by the Linux
-	 * pte because the page hasn't been freed yet and it won't be
-	 * freed until this returns. If required set_page_dirty has to
-	 * be called internally to this method.
-	 */
-	void (*invalidate_page)(struct mmu_notifier *mn,
-				struct mm_struct *mm,
-				unsigned long address);
-
-	/*
 	 * invalidate_range_start() and invalidate_range_end() must be
 	 * paired and are called only when the mmap_sem and/or the
 	 * locks protecting the reverse maps are held. If the subsystem
@@ -220,8 +209,6 @@ extern int __mmu_notifier_test_young(struct mm_struct *mm,
 				     unsigned long address);
 extern void __mmu_notifier_change_pte(struct mm_struct *mm,
 				      unsigned long address, pte_t pte);
-extern void __mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address);
 extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 				  unsigned long start, unsigned long end);
 extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
@@ -268,13 +255,6 @@ static inline void mmu_notifier_change_pte(struct mm_struct *mm,
 		__mmu_notifier_change_pte(mm, address, pte);
 }
 
-static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address)
-{
-	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_page(mm, address);
-}
-
 static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 				  unsigned long start, unsigned long end)
 {
@@ -442,11 +422,6 @@ static inline void mmu_notifier_change_pte(struct mm_struct *mm,
 {
 }
 
-static inline void mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address)
-{
-}
-
 static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 				  unsigned long start, unsigned long end)
 {
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 54ca54562928..314285284e6e 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -174,20 +174,6 @@ void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
 	srcu_read_unlock(&srcu, id);
 }
 
-void __mmu_notifier_invalidate_page(struct mm_struct *mm,
-					  unsigned long address)
-{
-	struct mmu_notifier *mn;
-	int id;
-
-	id = srcu_read_lock(&srcu);
-	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
-		if (mn->ops->invalidate_page)
-			mn->ops->invalidate_page(mn, mm, address);
-	}
-	srcu_read_unlock(&srcu, id);
-}
-
 void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 				  unsigned long start, unsigned long end)
 {
-- 
2.13.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* Re: [PATCH 00/13] mmu_notifier kill invalidate_page callback
  2017-08-29 23:54 ` Jérôme Glisse
  (?)
  (?)
@ 2017-08-30  0:11     ` Linus Torvalds
  -1 siblings, 0 replies; 160+ messages in thread
From: Linus Torvalds @ 2017-08-30  0:11 UTC (permalink / raw)
  To: Jérôme Glisse, Bernhard Held, Adam Borowski
  Cc: Andrea Arcangeli, Joerg Roedel, KVM list,
	Radim Krčmář,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Jack Steiner,
	Linux Kernel Mailing List, DRI, Sudeep Dutt, Ashutosh Dixit,
	linux-mm, open list:AMD IOMMU (AMD-VI),
	Dimitri Sivanich, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	xen-devel, Paolo Bonzini, Andrew Morton, ppc-dev, Dan Williams,
	Kirill A . Shutemov

On Tue, Aug 29, 2017 at 4:54 PM, Jérôme Glisse <jglisse@redhat.com> wrote:
>
> Note this is barely tested. I intend to do more testing of next few days
> but i do not have access to all hardware that make use of the mmu_notifier
> API.

Thanks for doing this.

> First 2 patches convert existing call of mmu_notifier_invalidate_page()
> to mmu_notifier_invalidate_range() and bracket those call with call to
> mmu_notifier_invalidate_range_start()/end().

Ok, those two patches are a bit more complex than I was hoping for,
but not *too* bad.

And the final end result certainly looks nice:

>  16 files changed, 74 insertions(+), 214 deletions(-)

Yeah, removing all those invalidate_page() notifiers certainly makes
for a nice patch.

And I actually think you missed some more lines that can now be
removed: kvm_arch_mmu_notifier_invalidate_page() should no longer be
needed either, so you can remove all of those too (most of them are
empty inline functions, but x86 has one that actually does something.

So there's an added 30 or so dead lines that should be removed in the
kvm patch, I think.

But from a _very_ quick read-through this looks fine. But it obviously
needs testing.

People - *especially* the people who saw issues under KVM - can you
try out Jérôme's patch-series? I aded some people to the cc, the full
series is on lkml. Jérôme - do you have a git branch for people to
test that they could easily pull and try out?

                    Linus
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 00/13] mmu_notifier kill invalidate_page callback
@ 2017-08-30  0:11     ` Linus Torvalds
  0 siblings, 0 replies; 160+ messages in thread
From: Linus Torvalds @ 2017-08-30  0:11 UTC (permalink / raw)
  To: Jérôme Glisse, Bernhard Held, Adam Borowski
  Cc: Linux Kernel Mailing List, linux-mm, Kirill A . Shutemov,
	Andrew Morton, Andrea Arcangeli, Joerg Roedel, Dan Williams,
	Sudeep Dutt, Ashutosh Dixit, Dimitri Sivanich, Jack Steiner,
	Paolo Bonzini, Radim Krčmář,
	ppc-dev, DRI, amd-gfx, linux-rdma, open list:AMD IOMMU (AMD-VI),
	xen-devel, KVM list

On Tue, Aug 29, 2017 at 4:54 PM, Jérôme Glisse <jglisse@redhat.com> wrote:
>
> Note this is barely tested. I intend to do more testing of next few days
> but i do not have access to all hardware that make use of the mmu_notifier
> API.

Thanks for doing this.

> First 2 patches convert existing call of mmu_notifier_invalidate_page()
> to mmu_notifier_invalidate_range() and bracket those call with call to
> mmu_notifier_invalidate_range_start()/end().

Ok, those two patches are a bit more complex than I was hoping for,
but not *too* bad.

And the final end result certainly looks nice:

>  16 files changed, 74 insertions(+), 214 deletions(-)

Yeah, removing all those invalidate_page() notifiers certainly makes
for a nice patch.

And I actually think you missed some more lines that can now be
removed: kvm_arch_mmu_notifier_invalidate_page() should no longer be
needed either, so you can remove all of those too (most of them are
empty inline functions, but x86 has one that actually does something.

So there's an added 30 or so dead lines that should be removed in the
kvm patch, I think.

But from a _very_ quick read-through this looks fine. But it obviously
needs testing.

People - *especially* the people who saw issues under KVM - can you
try out Jérôme's patch-series? I aded some people to the cc, the full
series is on lkml. Jérôme - do you have a git branch for people to
test that they could easily pull and try out?

                    Linus

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 00/13] mmu_notifier kill invalidate_page callback
@ 2017-08-30  0:11     ` Linus Torvalds
  0 siblings, 0 replies; 160+ messages in thread
From: Linus Torvalds @ 2017-08-30  0:11 UTC (permalink / raw)
  To: Jérôme Glisse, Bernhard Held, Adam Borowski
  Cc: Linux Kernel Mailing List, linux-mm, Kirill A . Shutemov,
	Andrew Morton, Andrea Arcangeli, Joerg Roedel, Dan Williams,
	Sudeep Dutt, Ashutosh Dixit, Dimitri Sivanich, Jack Steiner,
	Paolo Bonzini, Radim Krčmář,
	ppc-dev, DRI, amd-gfx, linux-rdma, open list:AMD IOMMU (AMD-VI),
	xen-devel, KVM list

On Tue, Aug 29, 2017 at 4:54 PM, Jérôme Glisse <jglisse@redhat.com> wrote:
>
> Note this is barely tested. I intend to do more testing of next few days
> but i do not have access to all hardware that make use of the mmu_notifier
> API.

Thanks for doing this.

> First 2 patches convert existing call of mmu_notifier_invalidate_page()
> to mmu_notifier_invalidate_range() and bracket those call with call to
> mmu_notifier_invalidate_range_start()/end().

Ok, those two patches are a bit more complex than I was hoping for,
but not *too* bad.

And the final end result certainly looks nice:

>  16 files changed, 74 insertions(+), 214 deletions(-)

Yeah, removing all those invalidate_page() notifiers certainly makes
for a nice patch.

And I actually think you missed some more lines that can now be
removed: kvm_arch_mmu_notifier_invalidate_page() should no longer be
needed either, so you can remove all of those too (most of them are
empty inline functions, but x86 has one that actually does something.

So there's an added 30 or so dead lines that should be removed in the
kvm patch, I think.

But from a _very_ quick read-through this looks fine. But it obviously
needs testing.

People - *especially* the people who saw issues under KVM - can you
try out Jérôme's patch-series? I aded some people to the cc, the full
series is on lkml. Jérôme - do you have a git branch for people to
test that they could easily pull and try out?

                    Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 00/13] mmu_notifier kill invalidate_page callback
@ 2017-08-30  0:11     ` Linus Torvalds
  0 siblings, 0 replies; 160+ messages in thread
From: Linus Torvalds @ 2017-08-30  0:11 UTC (permalink / raw)
  To: Jérôme Glisse, Bernhard Held, Adam Borowski
  Cc: Linux Kernel Mailing List, linux-mm, Kirill A . Shutemov,
	Andrew Morton, Andrea Arcangeli, Joerg Roedel, Dan Williams,
	Sudeep Dutt, Ashutosh Dixit, Dimitri Sivanich, Jack Steiner,
	Paolo Bonzini, Radim Krčmář,
	ppc-dev, DRI, amd-gfx, linux-rdma, open list:AMD IOMMU (AMD-VI),
	xen-devel, KVM list

On Tue, Aug 29, 2017 at 4:54 PM, J=C3=A9r=C3=B4me Glisse <jglisse@redhat.co=
m> wrote:
>
> Note this is barely tested. I intend to do more testing of next few days
> but i do not have access to all hardware that make use of the mmu_notifie=
r
> API.

Thanks for doing this.

> First 2 patches convert existing call of mmu_notifier_invalidate_page()
> to mmu_notifier_invalidate_range() and bracket those call with call to
> mmu_notifier_invalidate_range_start()/end().

Ok, those two patches are a bit more complex than I was hoping for,
but not *too* bad.

And the final end result certainly looks nice:

>  16 files changed, 74 insertions(+), 214 deletions(-)

Yeah, removing all those invalidate_page() notifiers certainly makes
for a nice patch.

And I actually think you missed some more lines that can now be
removed: kvm_arch_mmu_notifier_invalidate_page() should no longer be
needed either, so you can remove all of those too (most of them are
empty inline functions, but x86 has one that actually does something.

So there's an added 30 or so dead lines that should be removed in the
kvm patch, I think.

But from a _very_ quick read-through this looks fine. But it obviously
needs testing.

People - *especially* the people who saw issues under KVM - can you
try out J=C3=A9r=C3=B4me's patch-series? I aded some people to the cc, the =
full
series is on lkml. J=C3=A9r=C3=B4me - do you have a git branch for people t=
o
test that they could easily pull and try out?

                    Linus

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 00/13] mmu_notifier kill invalidate_page callback
  2017-08-29 23:54 ` Jérôme Glisse
                   ` (16 preceding siblings ...)
  (?)
@ 2017-08-30  0:11 ` Linus Torvalds
  -1 siblings, 0 replies; 160+ messages in thread
From: Linus Torvalds @ 2017-08-30  0:11 UTC (permalink / raw)
  To: Jérôme Glisse, Bernhard Held, Adam Borowski
  Cc: Andrea Arcangeli, Joerg Roedel, KVM list,
	Radim Krčmář,
	linux-rdma, Jack Steiner, Linux Kernel Mailing List, DRI,
	Sudeep Dutt, Ashutosh Dixit, linux-mm,
	open list:AMD IOMMU (AMD-VI),
	Dimitri Sivanich, amd-gfx, xen-devel, Paolo Bonzini,
	Andrew Morton, ppc-dev, Dan Williams, Kirill A . Shutemov

On Tue, Aug 29, 2017 at 4:54 PM, Jérôme Glisse <jglisse@redhat.com> wrote:
>
> Note this is barely tested. I intend to do more testing of next few days
> but i do not have access to all hardware that make use of the mmu_notifier
> API.

Thanks for doing this.

> First 2 patches convert existing call of mmu_notifier_invalidate_page()
> to mmu_notifier_invalidate_range() and bracket those call with call to
> mmu_notifier_invalidate_range_start()/end().

Ok, those two patches are a bit more complex than I was hoping for,
but not *too* bad.

And the final end result certainly looks nice:

>  16 files changed, 74 insertions(+), 214 deletions(-)

Yeah, removing all those invalidate_page() notifiers certainly makes
for a nice patch.

And I actually think you missed some more lines that can now be
removed: kvm_arch_mmu_notifier_invalidate_page() should no longer be
needed either, so you can remove all of those too (most of them are
empty inline functions, but x86 has one that actually does something.

So there's an added 30 or so dead lines that should be removed in the
kvm patch, I think.

But from a _very_ quick read-through this looks fine. But it obviously
needs testing.

People - *especially* the people who saw issues under KVM - can you
try out Jérôme's patch-series? I aded some people to the cc, the full
series is on lkml. Jérôme - do you have a git branch for people to
test that they could easily pull and try out?

                    Linus

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 00/13] mmu_notifier kill invalidate_page callback
  2017-08-30  0:11     ` Linus Torvalds
  (?)
@ 2017-08-30  0:56       ` Jerome Glisse
  -1 siblings, 0 replies; 160+ messages in thread
From: Jerome Glisse @ 2017-08-30  0:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Bernhard Held, KVM list, Radim Krčmář,
	Sudeep Dutt, DRI, linux-mm, Andrea Arcangeli, Dimitri Sivanich,
	linux-rdma, amd-gfx, xen-devel, Adam Borowski, Joerg Roedel,
	Jack Steiner, Dan Williams, Linux Kernel Mailing List,
	Ashutosh Dixit, open list:AMD IOMMU (AMD-VI),
	Paolo Bonzini, Andrew Morton, ppc-dev

On Tue, Aug 29, 2017 at 05:11:24PM -0700, Linus Torvalds wrote:
> On Tue, Aug 29, 2017 at 4:54 PM, Jérôme Glisse <jglisse@redhat.com> wrote:
> >
> > Note this is barely tested. I intend to do more testing of next few days
> > but i do not have access to all hardware that make use of the mmu_notifier
> > API.
> 
> Thanks for doing this.
> 
> > First 2 patches convert existing call of mmu_notifier_invalidate_page()
> > to mmu_notifier_invalidate_range() and bracket those call with call to
> > mmu_notifier_invalidate_range_start()/end().
> 
> Ok, those two patches are a bit more complex than I was hoping for,
> but not *too* bad.
> 
> And the final end result certainly looks nice:
> 
> >  16 files changed, 74 insertions(+), 214 deletions(-)
> 
> Yeah, removing all those invalidate_page() notifiers certainly makes
> for a nice patch.
> 
> And I actually think you missed some more lines that can now be
> removed: kvm_arch_mmu_notifier_invalidate_page() should no longer be
> needed either, so you can remove all of those too (most of them are
> empty inline functions, but x86 has one that actually does something.
> 
> So there's an added 30 or so dead lines that should be removed in the
> kvm patch, I think.

Yes i missed that. I will wait for people to test and for result of my
own test before reposting if need be, otherwise i will post as separate
patch.

> 
> But from a _very_ quick read-through this looks fine. But it obviously
> needs testing.
> 
> People - *especially* the people who saw issues under KVM - can you
> try out Jérôme's patch-series? I aded some people to the cc, the full
> series is on lkml. Jérôme - do you have a git branch for people to
> test that they could easily pull and try out?

https://cgit.freedesktop.org/~glisse/linux mmu-notifier branch
git://people.freedesktop.org/~glisse/linux

(Sorry if that tree is bit big it has a lot of dead thing i need
 to push a clean and slim one)

Jérôme
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 00/13] mmu_notifier kill invalidate_page callback
@ 2017-08-30  0:56       ` Jerome Glisse
  0 siblings, 0 replies; 160+ messages in thread
From: Jerome Glisse @ 2017-08-30  0:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Bernhard Held, Adam Borowski, Linux Kernel Mailing List,
	linux-mm, Kirill A . Shutemov, Andrew Morton, Andrea Arcangeli,
	Joerg Roedel, Dan Williams, Sudeep Dutt, Ashutosh Dixit,
	Dimitri Sivanich, Jack Steiner, Paolo Bonzini,
	Radim Krčmář,
	ppc-dev, DRI, amd-gfx, linux-rdma, open list:AMD IOMMU (AMD-VI),
	xen-devel, KVM list

On Tue, Aug 29, 2017 at 05:11:24PM -0700, Linus Torvalds wrote:
> On Tue, Aug 29, 2017 at 4:54 PM, Jérôme Glisse <jglisse@redhat.com> wrote:
> >
> > Note this is barely tested. I intend to do more testing of next few days
> > but i do not have access to all hardware that make use of the mmu_notifier
> > API.
> 
> Thanks for doing this.
> 
> > First 2 patches convert existing call of mmu_notifier_invalidate_page()
> > to mmu_notifier_invalidate_range() and bracket those call with call to
> > mmu_notifier_invalidate_range_start()/end().
> 
> Ok, those two patches are a bit more complex than I was hoping for,
> but not *too* bad.
> 
> And the final end result certainly looks nice:
> 
> >  16 files changed, 74 insertions(+), 214 deletions(-)
> 
> Yeah, removing all those invalidate_page() notifiers certainly makes
> for a nice patch.
> 
> And I actually think you missed some more lines that can now be
> removed: kvm_arch_mmu_notifier_invalidate_page() should no longer be
> needed either, so you can remove all of those too (most of them are
> empty inline functions, but x86 has one that actually does something.
> 
> So there's an added 30 or so dead lines that should be removed in the
> kvm patch, I think.

Yes i missed that. I will wait for people to test and for result of my
own test before reposting if need be, otherwise i will post as separate
patch.

> 
> But from a _very_ quick read-through this looks fine. But it obviously
> needs testing.
> 
> People - *especially* the people who saw issues under KVM - can you
> try out Jérôme's patch-series? I aded some people to the cc, the full
> series is on lkml. Jérôme - do you have a git branch for people to
> test that they could easily pull and try out?

https://cgit.freedesktop.org/~glisse/linux mmu-notifier branch
git://people.freedesktop.org/~glisse/linux

(Sorry if that tree is bit big it has a lot of dead thing i need
 to push a clean and slim one)

Jérôme

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 00/13] mmu_notifier kill invalidate_page callback
@ 2017-08-30  0:56       ` Jerome Glisse
  0 siblings, 0 replies; 160+ messages in thread
From: Jerome Glisse @ 2017-08-30  0:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Bernhard Held, Adam Borowski, Linux Kernel Mailing List,
	linux-mm, Kirill A . Shutemov, Andrew Morton, Andrea Arcangeli,
	Joerg Roedel, Dan Williams, Sudeep Dutt, Ashutosh Dixit,
	Dimitri Sivanich, Jack Steiner, Paolo Bonzini,
	Radim Krčmář,
	ppc-dev, DRI, amd-gfx, linux-rdma, open list:AMD IOMMU (AMD-VI),
	xen-devel, KVM list

On Tue, Aug 29, 2017 at 05:11:24PM -0700, Linus Torvalds wrote:
> On Tue, Aug 29, 2017 at 4:54 PM, Jerome Glisse <jglisse@redhat.com> wrote:
> >
> > Note this is barely tested. I intend to do more testing of next few days
> > but i do not have access to all hardware that make use of the mmu_notifier
> > API.
> 
> Thanks for doing this.
> 
> > First 2 patches convert existing call of mmu_notifier_invalidate_page()
> > to mmu_notifier_invalidate_range() and bracket those call with call to
> > mmu_notifier_invalidate_range_start()/end().
> 
> Ok, those two patches are a bit more complex than I was hoping for,
> but not *too* bad.
> 
> And the final end result certainly looks nice:
> 
> >  16 files changed, 74 insertions(+), 214 deletions(-)
> 
> Yeah, removing all those invalidate_page() notifiers certainly makes
> for a nice patch.
> 
> And I actually think you missed some more lines that can now be
> removed: kvm_arch_mmu_notifier_invalidate_page() should no longer be
> needed either, so you can remove all of those too (most of them are
> empty inline functions, but x86 has one that actually does something.
> 
> So there's an added 30 or so dead lines that should be removed in the
> kvm patch, I think.

Yes i missed that. I will wait for people to test and for result of my
own test before reposting if need be, otherwise i will post as separate
patch.

> 
> But from a _very_ quick read-through this looks fine. But it obviously
> needs testing.
> 
> People - *especially* the people who saw issues under KVM - can you
> try out Jerome's patch-series? I aded some people to the cc, the full
> series is on lkml. Jerome - do you have a git branch for people to
> test that they could easily pull and try out?

https://cgit.freedesktop.org/~glisse/linux mmu-notifier branch
git://people.freedesktop.org/~glisse/linux

(Sorry if that tree is bit big it has a lot of dead thing i need
 to push a clean and slim one)

Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 00/13] mmu_notifier kill invalidate_page callback
  2017-08-30  0:11     ` Linus Torvalds
                       ` (2 preceding siblings ...)
  (?)
@ 2017-08-30  0:56     ` Jerome Glisse
  -1 siblings, 0 replies; 160+ messages in thread
From: Jerome Glisse @ 2017-08-30  0:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Bernhard Held, KVM list, Radim Krčmář,
	Sudeep Dutt, DRI, linux-mm, Andrea Arcangeli, Dimitri Sivanich,
	linux-rdma, amd-gfx, xen-devel, Adam Borowski, Joerg Roedel,
	Jack Steiner, Dan Williams, Linux Kernel Mailing List,
	Ashutosh Dixit, open list:AMD IOMMU (AMD-VI),
	Paolo Bonzini, Andrew Morton, ppc-dev

On Tue, Aug 29, 2017 at 05:11:24PM -0700, Linus Torvalds wrote:
> On Tue, Aug 29, 2017 at 4:54 PM, Jérôme Glisse <jglisse@redhat.com> wrote:
> >
> > Note this is barely tested. I intend to do more testing of next few days
> > but i do not have access to all hardware that make use of the mmu_notifier
> > API.
> 
> Thanks for doing this.
> 
> > First 2 patches convert existing call of mmu_notifier_invalidate_page()
> > to mmu_notifier_invalidate_range() and bracket those call with call to
> > mmu_notifier_invalidate_range_start()/end().
> 
> Ok, those two patches are a bit more complex than I was hoping for,
> but not *too* bad.
> 
> And the final end result certainly looks nice:
> 
> >  16 files changed, 74 insertions(+), 214 deletions(-)
> 
> Yeah, removing all those invalidate_page() notifiers certainly makes
> for a nice patch.
> 
> And I actually think you missed some more lines that can now be
> removed: kvm_arch_mmu_notifier_invalidate_page() should no longer be
> needed either, so you can remove all of those too (most of them are
> empty inline functions, but x86 has one that actually does something.
> 
> So there's an added 30 or so dead lines that should be removed in the
> kvm patch, I think.

Yes i missed that. I will wait for people to test and for result of my
own test before reposting if need be, otherwise i will post as separate
patch.

> 
> But from a _very_ quick read-through this looks fine. But it obviously
> needs testing.
> 
> People - *especially* the people who saw issues under KVM - can you
> try out Jérôme's patch-series? I aded some people to the cc, the full
> series is on lkml. Jérôme - do you have a git branch for people to
> test that they could easily pull and try out?

https://cgit.freedesktop.org/~glisse/linux mmu-notifier branch
git://people.freedesktop.org/~glisse/linux

(Sorry if that tree is bit big it has a lot of dead thing i need
 to push a clean and slim one)

Jérôme

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic
  2017-08-29 23:54   ` Jérôme Glisse
@ 2017-08-30  2:46     ` Nadav Amit
  -1 siblings, 0 replies; 160+ messages in thread
From: Nadav Amit @ 2017-08-30  2:46 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: Linux Kernel Mailing List, open list:MEMORY MANAGEMENT,
	Dan Williams, Ross Zwisler, Linus Torvalds, Bernhard Held,
	Adam Borowski, Andrea Arcangeli, Radim Krčmář,
	Wanpeng Li, Paolo Bonzini, Takashi Iwai, Mike Galbraith,
	Kirill A . Shutemov, axie, Andrew Morton

Jérôme Glisse <jglisse@redhat.com> wrote:

> Replacing all mmu_notifier_invalidate_page() by mmu_notifier_invalidat_range()
> and making sure it is bracketed by call to mmu_notifier_invalidate_range_start/
> end.
> 
> Note that because we can not presume the pmd value or pte value we have to
> assume the worse and unconditionaly report an invalidation as happening.
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Bernhard Held <berny156@gmx.de>
> Cc: Adam Borowski <kilobyte@angband.pl>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Radim Krčmář <rkrcmar@redhat.com>
> Cc: Wanpeng Li <kernellwp@gmail.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Takashi Iwai <tiwai@suse.de>
> Cc: Nadav Amit <nadav.amit@gmail.com>
> Cc: Mike Galbraith <efault@gmx.de>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: axie <axie@amd.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> ---
> mm/rmap.c | 44 +++++++++++++++++++++++++++++++++++++++++---
> 1 file changed, 41 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index c8993c63eb25..da97ed525088 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -887,11 +887,21 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
> 		.address = address,
> 		.flags = PVMW_SYNC,
> 	};
> +	unsigned long start = address, end;
> 	int *cleaned = arg;
> 
> +	/*
> +	 * We have to assume the worse case ie pmd for invalidation. Note that
> +	 * the page can not be free from this function.
> +	 */
> +	end = min(vma->vm_end, (start & PMD_MASK) + PMD_SIZE);
> +	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
> +
> 	while (page_vma_mapped_walk(&pvmw)) {
> +		unsigned long cstart, cend;
> 		int ret = 0;
> -		address = pvmw.address;
> +
> +		cstart = address = pvmw.address;
> 		if (pvmw.pte) {
> 			pte_t entry;
> 			pte_t *pte = pvmw.pte;
> @@ -904,6 +914,7 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
> 			entry = pte_wrprotect(entry);
> 			entry = pte_mkclean(entry);
> 			set_pte_at(vma->vm_mm, address, pte, entry);
> +			cend = cstart + PAGE_SIZE;
> 			ret = 1;
> 		} else {
> #ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE
> @@ -918,6 +929,8 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
> 			entry = pmd_wrprotect(entry);
> 			entry = pmd_mkclean(entry);
> 			set_pmd_at(vma->vm_mm, address, pmd, entry);
> +			cstart &= PMD_MASK;
> +			cend = cstart + PMD_SIZE;
> 			ret = 1;
> #else
> 			/* unexpected pmd-mapped page? */
> @@ -926,11 +939,13 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
> 		}
> 
> 		if (ret) {
> -			mmu_notifier_invalidate_page(vma->vm_mm, address);
> +			mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend);
> 			(*cleaned)++;
> 		}
> 	}
> 
> +	mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
> +
> 	return true;
> }
> 
> @@ -1324,6 +1339,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> 	pte_t pteval;
> 	struct page *subpage;
> 	bool ret = true;
> +	unsigned long start = address, end;
> 	enum ttu_flags flags = (enum ttu_flags)arg;
> 
> 	/* munlock has nothing to gain from examining un-locked vmas */
> @@ -1335,6 +1351,14 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> 				flags & TTU_MIGRATION, page);
> 	}
> 
> +	/*
> +	 * We have to assume the worse case ie pmd for invalidation. Note that
> +	 * the page can not be free in this function as call of try_to_unmap()
> +	 * must hold a reference on the page.
> +	 */
> +	end = min(vma->vm_end, (start & PMD_MASK) + PMD_SIZE);
> +	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
> +
> 	while (page_vma_mapped_walk(&pvmw)) {
> 		/*
> 		 * If the page is mlock()d, we cannot swap it out.
> @@ -1408,6 +1432,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> 				set_huge_swap_pte_at(mm, address,
> 						     pvmw.pte, pteval,
> 						     vma_mmu_pagesize(vma));
> +				mmu_notifier_invalidate_range(mm, address,
> +					address + vma_mmu_pagesize(vma));

I don’t think that the notifier should be called after the PTE is set, but
after the PTE is cleared, PTE permissions are demoted (e.g., RW->RO) or
access/dirty bits are cleared. [There is an exception: if the PFN in the PTE
is changed without clearing the PTE before, but it does not apply here, and
there is a different notifier for this case.]

Therefore, IIUC, try_to_umap_one() should only call
mmu_notifier_invalidate_range() after ptep_get_and_clear() and
ptep_clear_flush() are called. All the other calls to
mmu_notifier_invalidate_range() in this function can be removed.

Regards,
Nadav

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic
@ 2017-08-30  2:46     ` Nadav Amit
  0 siblings, 0 replies; 160+ messages in thread
From: Nadav Amit @ 2017-08-30  2:46 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: Linux Kernel Mailing List, open list:MEMORY MANAGEMENT,
	Dan Williams, Ross Zwisler, Linus Torvalds, Bernhard Held,
	Adam Borowski, Andrea Arcangeli, Radim Krčmář,
	Wanpeng Li, Paolo Bonzini, Takashi Iwai, Mike Galbraith,
	Kirill A . Shutemov, axie, Andrew Morton

Jérôme Glisse <jglisse@redhat.com> wrote:

> Replacing all mmu_notifier_invalidate_page() by mmu_notifier_invalidat_range()
> and making sure it is bracketed by call to mmu_notifier_invalidate_range_start/
> end.
> 
> Note that because we can not presume the pmd value or pte value we have to
> assume the worse and unconditionaly report an invalidation as happening.
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Bernhard Held <berny156@gmx.de>
> Cc: Adam Borowski <kilobyte@angband.pl>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Radim Krčmář <rkrcmar@redhat.com>
> Cc: Wanpeng Li <kernellwp@gmail.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Takashi Iwai <tiwai@suse.de>
> Cc: Nadav Amit <nadav.amit@gmail.com>
> Cc: Mike Galbraith <efault@gmx.de>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: axie <axie@amd.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> ---
> mm/rmap.c | 44 +++++++++++++++++++++++++++++++++++++++++---
> 1 file changed, 41 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index c8993c63eb25..da97ed525088 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -887,11 +887,21 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
> 		.address = address,
> 		.flags = PVMW_SYNC,
> 	};
> +	unsigned long start = address, end;
> 	int *cleaned = arg;
> 
> +	/*
> +	 * We have to assume the worse case ie pmd for invalidation. Note that
> +	 * the page can not be free from this function.
> +	 */
> +	end = min(vma->vm_end, (start & PMD_MASK) + PMD_SIZE);
> +	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
> +
> 	while (page_vma_mapped_walk(&pvmw)) {
> +		unsigned long cstart, cend;
> 		int ret = 0;
> -		address = pvmw.address;
> +
> +		cstart = address = pvmw.address;
> 		if (pvmw.pte) {
> 			pte_t entry;
> 			pte_t *pte = pvmw.pte;
> @@ -904,6 +914,7 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
> 			entry = pte_wrprotect(entry);
> 			entry = pte_mkclean(entry);
> 			set_pte_at(vma->vm_mm, address, pte, entry);
> +			cend = cstart + PAGE_SIZE;
> 			ret = 1;
> 		} else {
> #ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE
> @@ -918,6 +929,8 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
> 			entry = pmd_wrprotect(entry);
> 			entry = pmd_mkclean(entry);
> 			set_pmd_at(vma->vm_mm, address, pmd, entry);
> +			cstart &= PMD_MASK;
> +			cend = cstart + PMD_SIZE;
> 			ret = 1;
> #else
> 			/* unexpected pmd-mapped page? */
> @@ -926,11 +939,13 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
> 		}
> 
> 		if (ret) {
> -			mmu_notifier_invalidate_page(vma->vm_mm, address);
> +			mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend);
> 			(*cleaned)++;
> 		}
> 	}
> 
> +	mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
> +
> 	return true;
> }
> 
> @@ -1324,6 +1339,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> 	pte_t pteval;
> 	struct page *subpage;
> 	bool ret = true;
> +	unsigned long start = address, end;
> 	enum ttu_flags flags = (enum ttu_flags)arg;
> 
> 	/* munlock has nothing to gain from examining un-locked vmas */
> @@ -1335,6 +1351,14 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> 				flags & TTU_MIGRATION, page);
> 	}
> 
> +	/*
> +	 * We have to assume the worse case ie pmd for invalidation. Note that
> +	 * the page can not be free in this function as call of try_to_unmap()
> +	 * must hold a reference on the page.
> +	 */
> +	end = min(vma->vm_end, (start & PMD_MASK) + PMD_SIZE);
> +	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
> +
> 	while (page_vma_mapped_walk(&pvmw)) {
> 		/*
> 		 * If the page is mlock()d, we cannot swap it out.
> @@ -1408,6 +1432,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> 				set_huge_swap_pte_at(mm, address,
> 						     pvmw.pte, pteval,
> 						     vma_mmu_pagesize(vma));
> +				mmu_notifier_invalidate_range(mm, address,
> +					address + vma_mmu_pagesize(vma));

I don’t think that the notifier should be called after the PTE is set, but
after the PTE is cleared, PTE permissions are demoted (e.g., RW->RO) or
access/dirty bits are cleared. [There is an exception: if the PFN in the PTE
is changed without clearing the PTE before, but it does not apply here, and
there is a different notifier for this case.]

Therefore, IIUC, try_to_umap_one() should only call
mmu_notifier_invalidate_range() after ptep_get_and_clear() and
ptep_clear_flush() are called. All the other calls to
mmu_notifier_invalidate_range() in this function can be removed.

Regards,
Nadav
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic
  2017-08-30  2:46     ` Nadav Amit
@ 2017-08-30  2:59       ` Jerome Glisse
  -1 siblings, 0 replies; 160+ messages in thread
From: Jerome Glisse @ 2017-08-30  2:59 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Linux Kernel Mailing List, open list:MEMORY MANAGEMENT,
	Dan Williams, Ross Zwisler, Linus Torvalds, Bernhard Held,
	Adam Borowski, Andrea Arcangeli, Radim Krčmář,
	Wanpeng Li, Paolo Bonzini, Takashi Iwai, Mike Galbraith,
	Kirill A . Shutemov, axie, Andrew Morton

On Tue, Aug 29, 2017 at 07:46:07PM -0700, Nadav Amit wrote:
> Jérôme Glisse <jglisse@redhat.com> wrote:
> 
> > Replacing all mmu_notifier_invalidate_page() by mmu_notifier_invalidat_range()
> > and making sure it is bracketed by call to mmu_notifier_invalidate_range_start/
> > end.
> > 
> > Note that because we can not presume the pmd value or pte value we have to
> > assume the worse and unconditionaly report an invalidation as happening.
> > 
> > Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> > Cc: Dan Williams <dan.j.williams@intel.com>
> > Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > Cc: Bernhard Held <berny156@gmx.de>
> > Cc: Adam Borowski <kilobyte@angband.pl>
> > Cc: Andrea Arcangeli <aarcange@redhat.com>
> > Cc: Radim Krčmář <rkrcmar@redhat.com>
> > Cc: Wanpeng Li <kernellwp@gmail.com>
> > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > Cc: Takashi Iwai <tiwai@suse.de>
> > Cc: Nadav Amit <nadav.amit@gmail.com>
> > Cc: Mike Galbraith <efault@gmx.de>
> > Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Cc: axie <axie@amd.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > ---
> > mm/rmap.c | 44 +++++++++++++++++++++++++++++++++++++++++---
> > 1 file changed, 41 insertions(+), 3 deletions(-)
> > 
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index c8993c63eb25..da97ed525088 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -887,11 +887,21 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
> > 		.address = address,
> > 		.flags = PVMW_SYNC,
> > 	};
> > +	unsigned long start = address, end;
> > 	int *cleaned = arg;
> > 
> > +	/*
> > +	 * We have to assume the worse case ie pmd for invalidation. Note that
> > +	 * the page can not be free from this function.
> > +	 */
> > +	end = min(vma->vm_end, (start & PMD_MASK) + PMD_SIZE);
> > +	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
> > +
> > 	while (page_vma_mapped_walk(&pvmw)) {
> > +		unsigned long cstart, cend;
> > 		int ret = 0;
> > -		address = pvmw.address;
> > +
> > +		cstart = address = pvmw.address;
> > 		if (pvmw.pte) {
> > 			pte_t entry;
> > 			pte_t *pte = pvmw.pte;
> > @@ -904,6 +914,7 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
> > 			entry = pte_wrprotect(entry);
> > 			entry = pte_mkclean(entry);
> > 			set_pte_at(vma->vm_mm, address, pte, entry);
> > +			cend = cstart + PAGE_SIZE;
> > 			ret = 1;
> > 		} else {
> > #ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE
> > @@ -918,6 +929,8 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
> > 			entry = pmd_wrprotect(entry);
> > 			entry = pmd_mkclean(entry);
> > 			set_pmd_at(vma->vm_mm, address, pmd, entry);
> > +			cstart &= PMD_MASK;
> > +			cend = cstart + PMD_SIZE;
> > 			ret = 1;
> > #else
> > 			/* unexpected pmd-mapped page? */
> > @@ -926,11 +939,13 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
> > 		}
> > 
> > 		if (ret) {
> > -			mmu_notifier_invalidate_page(vma->vm_mm, address);
> > +			mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend);
> > 			(*cleaned)++;
> > 		}
> > 	}
> > 
> > +	mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
> > +
> > 	return true;
> > }
> > 
> > @@ -1324,6 +1339,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > 	pte_t pteval;
> > 	struct page *subpage;
> > 	bool ret = true;
> > +	unsigned long start = address, end;
> > 	enum ttu_flags flags = (enum ttu_flags)arg;
> > 
> > 	/* munlock has nothing to gain from examining un-locked vmas */
> > @@ -1335,6 +1351,14 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > 				flags & TTU_MIGRATION, page);
> > 	}
> > 
> > +	/*
> > +	 * We have to assume the worse case ie pmd for invalidation. Note that
> > +	 * the page can not be free in this function as call of try_to_unmap()
> > +	 * must hold a reference on the page.
> > +	 */
> > +	end = min(vma->vm_end, (start & PMD_MASK) + PMD_SIZE);
> > +	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
> > +
> > 	while (page_vma_mapped_walk(&pvmw)) {
> > 		/*
> > 		 * If the page is mlock()d, we cannot swap it out.
> > @@ -1408,6 +1432,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > 				set_huge_swap_pte_at(mm, address,
> > 						     pvmw.pte, pteval,
> > 						     vma_mmu_pagesize(vma));
> > +				mmu_notifier_invalidate_range(mm, address,
> > +					address + vma_mmu_pagesize(vma));
> 
> I don’t think that the notifier should be called after the PTE is set, but
> after the PTE is cleared, PTE permissions are demoted (e.g., RW->RO) or
> access/dirty bits are cleared. [There is an exception: if the PFN in the PTE
> is changed without clearing the PTE before, but it does not apply here, and
> there is a different notifier for this case.]
> 
> Therefore, IIUC, try_to_umap_one() should only call
> mmu_notifier_invalidate_range() after ptep_get_and_clear() and
> ptep_clear_flush() are called. All the other calls to
> mmu_notifier_invalidate_range() in this function can be removed.

Yes it would simplify the patch, i was trying to optimize for the case
where we restore the pte to its original value after ptep_clear_flush()
or ptep_get_and_clear() as in this case there is no need to invalidate
any secondary page table but that's an overkill optimization that just
add too much complexity.

Jérôme

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic
@ 2017-08-30  2:59       ` Jerome Glisse
  0 siblings, 0 replies; 160+ messages in thread
From: Jerome Glisse @ 2017-08-30  2:59 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Linux Kernel Mailing List, open list:MEMORY MANAGEMENT,
	Dan Williams, Ross Zwisler, Linus Torvalds, Bernhard Held,
	Adam Borowski, Andrea Arcangeli, Radim Krčmář,
	Wanpeng Li, Paolo Bonzini, Takashi Iwai, Mike Galbraith,
	Kirill A . Shutemov, axie, Andrew Morton

On Tue, Aug 29, 2017 at 07:46:07PM -0700, Nadav Amit wrote:
> JA(C)rA'me Glisse <jglisse@redhat.com> wrote:
> 
> > Replacing all mmu_notifier_invalidate_page() by mmu_notifier_invalidat_range()
> > and making sure it is bracketed by call to mmu_notifier_invalidate_range_start/
> > end.
> > 
> > Note that because we can not presume the pmd value or pte value we have to
> > assume the worse and unconditionaly report an invalidation as happening.
> > 
> > Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
> > Cc: Dan Williams <dan.j.williams@intel.com>
> > Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > Cc: Bernhard Held <berny156@gmx.de>
> > Cc: Adam Borowski <kilobyte@angband.pl>
> > Cc: Andrea Arcangeli <aarcange@redhat.com>
> > Cc: Radim KrA?mA!A? <rkrcmar@redhat.com>
> > Cc: Wanpeng Li <kernellwp@gmail.com>
> > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > Cc: Takashi Iwai <tiwai@suse.de>
> > Cc: Nadav Amit <nadav.amit@gmail.com>
> > Cc: Mike Galbraith <efault@gmx.de>
> > Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Cc: axie <axie@amd.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > ---
> > mm/rmap.c | 44 +++++++++++++++++++++++++++++++++++++++++---
> > 1 file changed, 41 insertions(+), 3 deletions(-)
> > 
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index c8993c63eb25..da97ed525088 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -887,11 +887,21 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
> > 		.address = address,
> > 		.flags = PVMW_SYNC,
> > 	};
> > +	unsigned long start = address, end;
> > 	int *cleaned = arg;
> > 
> > +	/*
> > +	 * We have to assume the worse case ie pmd for invalidation. Note that
> > +	 * the page can not be free from this function.
> > +	 */
> > +	end = min(vma->vm_end, (start & PMD_MASK) + PMD_SIZE);
> > +	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
> > +
> > 	while (page_vma_mapped_walk(&pvmw)) {
> > +		unsigned long cstart, cend;
> > 		int ret = 0;
> > -		address = pvmw.address;
> > +
> > +		cstart = address = pvmw.address;
> > 		if (pvmw.pte) {
> > 			pte_t entry;
> > 			pte_t *pte = pvmw.pte;
> > @@ -904,6 +914,7 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
> > 			entry = pte_wrprotect(entry);
> > 			entry = pte_mkclean(entry);
> > 			set_pte_at(vma->vm_mm, address, pte, entry);
> > +			cend = cstart + PAGE_SIZE;
> > 			ret = 1;
> > 		} else {
> > #ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE
> > @@ -918,6 +929,8 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
> > 			entry = pmd_wrprotect(entry);
> > 			entry = pmd_mkclean(entry);
> > 			set_pmd_at(vma->vm_mm, address, pmd, entry);
> > +			cstart &= PMD_MASK;
> > +			cend = cstart + PMD_SIZE;
> > 			ret = 1;
> > #else
> > 			/* unexpected pmd-mapped page? */
> > @@ -926,11 +939,13 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
> > 		}
> > 
> > 		if (ret) {
> > -			mmu_notifier_invalidate_page(vma->vm_mm, address);
> > +			mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend);
> > 			(*cleaned)++;
> > 		}
> > 	}
> > 
> > +	mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
> > +
> > 	return true;
> > }
> > 
> > @@ -1324,6 +1339,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > 	pte_t pteval;
> > 	struct page *subpage;
> > 	bool ret = true;
> > +	unsigned long start = address, end;
> > 	enum ttu_flags flags = (enum ttu_flags)arg;
> > 
> > 	/* munlock has nothing to gain from examining un-locked vmas */
> > @@ -1335,6 +1351,14 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > 				flags & TTU_MIGRATION, page);
> > 	}
> > 
> > +	/*
> > +	 * We have to assume the worse case ie pmd for invalidation. Note that
> > +	 * the page can not be free in this function as call of try_to_unmap()
> > +	 * must hold a reference on the page.
> > +	 */
> > +	end = min(vma->vm_end, (start & PMD_MASK) + PMD_SIZE);
> > +	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
> > +
> > 	while (page_vma_mapped_walk(&pvmw)) {
> > 		/*
> > 		 * If the page is mlock()d, we cannot swap it out.
> > @@ -1408,6 +1432,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > 				set_huge_swap_pte_at(mm, address,
> > 						     pvmw.pte, pteval,
> > 						     vma_mmu_pagesize(vma));
> > +				mmu_notifier_invalidate_range(mm, address,
> > +					address + vma_mmu_pagesize(vma));
> 
> I dona??t think that the notifier should be called after the PTE is set, but
> after the PTE is cleared, PTE permissions are demoted (e.g., RW->RO) or
> access/dirty bits are cleared. [There is an exception: if the PFN in the PTE
> is changed without clearing the PTE before, but it does not apply here, and
> there is a different notifier for this case.]
> 
> Therefore, IIUC, try_to_umap_one() should only call
> mmu_notifier_invalidate_range() after ptep_get_and_clear() and
> ptep_clear_flush() are called. All the other calls to
> mmu_notifier_invalidate_range() in this function can be removed.

Yes it would simplify the patch, i was trying to optimize for the case
where we restore the pte to its original value after ptep_clear_flush()
or ptep_get_and_clear() as in this case there is no need to invalidate
any secondary page table but that's an overkill optimization that just
add too much complexity.

JA(C)rA'me

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic
  2017-08-30  2:59       ` Jerome Glisse
@ 2017-08-30  3:16         ` Nadav Amit
  -1 siblings, 0 replies; 160+ messages in thread
From: Nadav Amit @ 2017-08-30  3:16 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Linux Kernel Mailing List, open list:MEMORY MANAGEMENT,
	Dan Williams, Ross Zwisler, Linus Torvalds, Bernhard Held,
	Adam Borowski, Andrea Arcangeli, Radim Krčmář,
	Wanpeng Li, Paolo Bonzini, Takashi Iwai, Mike Galbraith,
	Kirill A . Shutemov, axie, Andrew Morton

Jerome Glisse <jglisse@redhat.com> wrote:

> On Tue, Aug 29, 2017 at 07:46:07PM -0700, Nadav Amit wrote:
>> Jérôme Glisse <jglisse@redhat.com> wrote:
>> 
>>> Replacing all mmu_notifier_invalidate_page() by mmu_notifier_invalidat_range()
>>> and making sure it is bracketed by call to mmu_notifier_invalidate_range_start/
>>> end.
>>> 
>>> Note that because we can not presume the pmd value or pte value we have to
>>> assume the worse and unconditionaly report an invalidation as happening.
>>> 
>>> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
>>> Cc: Dan Williams <dan.j.williams@intel.com>
>>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>>> Cc: Linus Torvalds <torvalds@linux-foundation.org>
>>> Cc: Bernhard Held <berny156@gmx.de>
>>> Cc: Adam Borowski <kilobyte@angband.pl>
>>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>>> Cc: Radim Krčmář <rkrcmar@redhat.com>
>>> Cc: Wanpeng Li <kernellwp@gmail.com>
>>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>>> Cc: Takashi Iwai <tiwai@suse.de>
>>> Cc: Nadav Amit <nadav.amit@gmail.com>
>>> Cc: Mike Galbraith <efault@gmx.de>
>>> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>> Cc: axie <axie@amd.com>
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>> ---
>>> mm/rmap.c | 44 +++++++++++++++++++++++++++++++++++++++++---
>>> 1 file changed, 41 insertions(+), 3 deletions(-)
>>> 
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index c8993c63eb25..da97ed525088 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -887,11 +887,21 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
>>> 		.address = address,
>>> 		.flags = PVMW_SYNC,
>>> 	};
>>> +	unsigned long start = address, end;
>>> 	int *cleaned = arg;
>>> 
>>> +	/*
>>> +	 * We have to assume the worse case ie pmd for invalidation. Note that
>>> +	 * the page can not be free from this function.
>>> +	 */
>>> +	end = min(vma->vm_end, (start & PMD_MASK) + PMD_SIZE);
>>> +	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
>>> +
>>> 	while (page_vma_mapped_walk(&pvmw)) {
>>> +		unsigned long cstart, cend;
>>> 		int ret = 0;
>>> -		address = pvmw.address;
>>> +
>>> +		cstart = address = pvmw.address;
>>> 		if (pvmw.pte) {
>>> 			pte_t entry;
>>> 			pte_t *pte = pvmw.pte;
>>> @@ -904,6 +914,7 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
>>> 			entry = pte_wrprotect(entry);
>>> 			entry = pte_mkclean(entry);
>>> 			set_pte_at(vma->vm_mm, address, pte, entry);
>>> +			cend = cstart + PAGE_SIZE;
>>> 			ret = 1;
>>> 		} else {
>>> #ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE
>>> @@ -918,6 +929,8 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
>>> 			entry = pmd_wrprotect(entry);
>>> 			entry = pmd_mkclean(entry);
>>> 			set_pmd_at(vma->vm_mm, address, pmd, entry);
>>> +			cstart &= PMD_MASK;
>>> +			cend = cstart + PMD_SIZE;
>>> 			ret = 1;
>>> #else
>>> 			/* unexpected pmd-mapped page? */
>>> @@ -926,11 +939,13 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
>>> 		}
>>> 
>>> 		if (ret) {
>>> -			mmu_notifier_invalidate_page(vma->vm_mm, address);
>>> +			mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend);
>>> 			(*cleaned)++;
>>> 		}
>>> 	}
>>> 
>>> +	mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
>>> +
>>> 	return true;
>>> }
>>> 
>>> @@ -1324,6 +1339,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>> 	pte_t pteval;
>>> 	struct page *subpage;
>>> 	bool ret = true;
>>> +	unsigned long start = address, end;
>>> 	enum ttu_flags flags = (enum ttu_flags)arg;
>>> 
>>> 	/* munlock has nothing to gain from examining un-locked vmas */
>>> @@ -1335,6 +1351,14 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>> 				flags & TTU_MIGRATION, page);
>>> 	}
>>> 
>>> +	/*
>>> +	 * We have to assume the worse case ie pmd for invalidation. Note that
>>> +	 * the page can not be free in this function as call of try_to_unmap()
>>> +	 * must hold a reference on the page.
>>> +	 */
>>> +	end = min(vma->vm_end, (start & PMD_MASK) + PMD_SIZE);
>>> +	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
>>> +
>>> 	while (page_vma_mapped_walk(&pvmw)) {
>>> 		/*
>>> 		 * If the page is mlock()d, we cannot swap it out.
>>> @@ -1408,6 +1432,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>> 				set_huge_swap_pte_at(mm, address,
>>> 						     pvmw.pte, pteval,
>>> 						     vma_mmu_pagesize(vma));
>>> +				mmu_notifier_invalidate_range(mm, address,
>>> +					address + vma_mmu_pagesize(vma));
>> 
>> I don’t think that the notifier should be called after the PTE is set, but
>> after the PTE is cleared, PTE permissions are demoted (e.g., RW->RO) or
>> access/dirty bits are cleared. [There is an exception: if the PFN in the PTE
>> is changed without clearing the PTE before, but it does not apply here, and
>> there is a different notifier for this case.]
>> 
>> Therefore, IIUC, try_to_umap_one() should only call
>> mmu_notifier_invalidate_range() after ptep_get_and_clear() and
>> ptep_clear_flush() are called. All the other calls to
>> mmu_notifier_invalidate_range() in this function can be removed.
> 
> Yes it would simplify the patch, i was trying to optimize for the case
> where we restore the pte to its original value after ptep_clear_flush()
> or ptep_get_and_clear() as in this case there is no need to invalidate
> any secondary page table but that's an overkill optimization that just
> add too much complexity.

Interesting. Actually, prior to your changes, it seems that the break
statements would skip mmu_notifier_invalidate_page() when the PTE is not
changed. So why not just change mmu_notifier_invalidate_page() into
mmu_notifier_invalidate_range() ?

Regards,
Nadav

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic
@ 2017-08-30  3:16         ` Nadav Amit
  0 siblings, 0 replies; 160+ messages in thread
From: Nadav Amit @ 2017-08-30  3:16 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Linux Kernel Mailing List, open list:MEMORY MANAGEMENT,
	Dan Williams, Ross Zwisler, Linus Torvalds, Bernhard Held,
	Adam Borowski, Andrea Arcangeli, Radim Krčmář,
	Wanpeng Li, Paolo Bonzini, Takashi Iwai, Mike Galbraith,
	Kirill A . Shutemov, axie, Andrew Morton

Jerome Glisse <jglisse@redhat.com> wrote:

> On Tue, Aug 29, 2017 at 07:46:07PM -0700, Nadav Amit wrote:
>> Jérôme Glisse <jglisse@redhat.com> wrote:
>> 
>>> Replacing all mmu_notifier_invalidate_page() by mmu_notifier_invalidat_range()
>>> and making sure it is bracketed by call to mmu_notifier_invalidate_range_start/
>>> end.
>>> 
>>> Note that because we can not presume the pmd value or pte value we have to
>>> assume the worse and unconditionaly report an invalidation as happening.
>>> 
>>> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
>>> Cc: Dan Williams <dan.j.williams@intel.com>
>>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>>> Cc: Linus Torvalds <torvalds@linux-foundation.org>
>>> Cc: Bernhard Held <berny156@gmx.de>
>>> Cc: Adam Borowski <kilobyte@angband.pl>
>>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>>> Cc: Radim Krčmář <rkrcmar@redhat.com>
>>> Cc: Wanpeng Li <kernellwp@gmail.com>
>>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>>> Cc: Takashi Iwai <tiwai@suse.de>
>>> Cc: Nadav Amit <nadav.amit@gmail.com>
>>> Cc: Mike Galbraith <efault@gmx.de>
>>> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>> Cc: axie <axie@amd.com>
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>> ---
>>> mm/rmap.c | 44 +++++++++++++++++++++++++++++++++++++++++---
>>> 1 file changed, 41 insertions(+), 3 deletions(-)
>>> 
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index c8993c63eb25..da97ed525088 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -887,11 +887,21 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
>>> 		.address = address,
>>> 		.flags = PVMW_SYNC,
>>> 	};
>>> +	unsigned long start = address, end;
>>> 	int *cleaned = arg;
>>> 
>>> +	/*
>>> +	 * We have to assume the worse case ie pmd for invalidation. Note that
>>> +	 * the page can not be free from this function.
>>> +	 */
>>> +	end = min(vma->vm_end, (start & PMD_MASK) + PMD_SIZE);
>>> +	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
>>> +
>>> 	while (page_vma_mapped_walk(&pvmw)) {
>>> +		unsigned long cstart, cend;
>>> 		int ret = 0;
>>> -		address = pvmw.address;
>>> +
>>> +		cstart = address = pvmw.address;
>>> 		if (pvmw.pte) {
>>> 			pte_t entry;
>>> 			pte_t *pte = pvmw.pte;
>>> @@ -904,6 +914,7 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
>>> 			entry = pte_wrprotect(entry);
>>> 			entry = pte_mkclean(entry);
>>> 			set_pte_at(vma->vm_mm, address, pte, entry);
>>> +			cend = cstart + PAGE_SIZE;
>>> 			ret = 1;
>>> 		} else {
>>> #ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE
>>> @@ -918,6 +929,8 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
>>> 			entry = pmd_wrprotect(entry);
>>> 			entry = pmd_mkclean(entry);
>>> 			set_pmd_at(vma->vm_mm, address, pmd, entry);
>>> +			cstart &= PMD_MASK;
>>> +			cend = cstart + PMD_SIZE;
>>> 			ret = 1;
>>> #else
>>> 			/* unexpected pmd-mapped page? */
>>> @@ -926,11 +939,13 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
>>> 		}
>>> 
>>> 		if (ret) {
>>> -			mmu_notifier_invalidate_page(vma->vm_mm, address);
>>> +			mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend);
>>> 			(*cleaned)++;
>>> 		}
>>> 	}
>>> 
>>> +	mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
>>> +
>>> 	return true;
>>> }
>>> 
>>> @@ -1324,6 +1339,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>> 	pte_t pteval;
>>> 	struct page *subpage;
>>> 	bool ret = true;
>>> +	unsigned long start = address, end;
>>> 	enum ttu_flags flags = (enum ttu_flags)arg;
>>> 
>>> 	/* munlock has nothing to gain from examining un-locked vmas */
>>> @@ -1335,6 +1351,14 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>> 				flags & TTU_MIGRATION, page);
>>> 	}
>>> 
>>> +	/*
>>> +	 * We have to assume the worse case ie pmd for invalidation. Note that
>>> +	 * the page can not be free in this function as call of try_to_unmap()
>>> +	 * must hold a reference on the page.
>>> +	 */
>>> +	end = min(vma->vm_end, (start & PMD_MASK) + PMD_SIZE);
>>> +	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
>>> +
>>> 	while (page_vma_mapped_walk(&pvmw)) {
>>> 		/*
>>> 		 * If the page is mlock()d, we cannot swap it out.
>>> @@ -1408,6 +1432,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>> 				set_huge_swap_pte_at(mm, address,
>>> 						     pvmw.pte, pteval,
>>> 						     vma_mmu_pagesize(vma));
>>> +				mmu_notifier_invalidate_range(mm, address,
>>> +					address + vma_mmu_pagesize(vma));
>> 
>> I don’t think that the notifier should be called after the PTE is set, but
>> after the PTE is cleared, PTE permissions are demoted (e.g., RW->RO) or
>> access/dirty bits are cleared. [There is an exception: if the PFN in the PTE
>> is changed without clearing the PTE before, but it does not apply here, and
>> there is a different notifier for this case.]
>> 
>> Therefore, IIUC, try_to_umap_one() should only call
>> mmu_notifier_invalidate_range() after ptep_get_and_clear() and
>> ptep_clear_flush() are called. All the other calls to
>> mmu_notifier_invalidate_range() in this function can be removed.
> 
> Yes it would simplify the patch, i was trying to optimize for the case
> where we restore the pte to its original value after ptep_clear_flush()
> or ptep_get_and_clear() as in this case there is no need to invalidate
> any secondary page table but that's an overkill optimization that just
> add too much complexity.

Interesting. Actually, prior to your changes, it seems that the break
statements would skip mmu_notifier_invalidate_page() when the PTE is not
changed. So why not just change mmu_notifier_invalidate_page() into
mmu_notifier_invalidate_range() ?

Regards,
Nadav
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic
  2017-08-30  3:16         ` Nadav Amit
@ 2017-08-30  3:18           ` Nadav Amit
  -1 siblings, 0 replies; 160+ messages in thread
From: Nadav Amit @ 2017-08-30  3:18 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Linux Kernel Mailing List, open list:MEMORY MANAGEMENT,
	Dan Williams, Ross Zwisler, Linus Torvalds, Bernhard Held,
	Adam Borowski, Andrea Arcangeli, Radim Krčmář,
	Wanpeng Li, Paolo Bonzini, Takashi Iwai, Mike Galbraith,
	Kirill A . Shutemov, axie, Andrew Morton

Nadav Amit <nadav.amit@gmail.com> wrote:

> Jerome Glisse <jglisse@redhat.com> wrote:
> 
>> On Tue, Aug 29, 2017 at 07:46:07PM -0700, Nadav Amit wrote:
>>> Jérôme Glisse <jglisse@redhat.com> wrote:
>>> 
>>>> Replacing all mmu_notifier_invalidate_page() by mmu_notifier_invalidat_range()
>>>> and making sure it is bracketed by call to mmu_notifier_invalidate_range_start/
>>>> end.
>>>> 
>>>> Note that because we can not presume the pmd value or pte value we have to
>>>> assume the worse and unconditionaly report an invalidation as happening.
>>>> 
>>>> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
>>>> Cc: Dan Williams <dan.j.williams@intel.com>
>>>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>>>> Cc: Linus Torvalds <torvalds@linux-foundation.org>
>>>> Cc: Bernhard Held <berny156@gmx.de>
>>>> Cc: Adam Borowski <kilobyte@angband.pl>
>>>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>>>> Cc: Radim Krčmář <rkrcmar@redhat.com>
>>>> Cc: Wanpeng Li <kernellwp@gmail.com>
>>>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>>>> Cc: Takashi Iwai <tiwai@suse.de>
>>>> Cc: Nadav Amit <nadav.amit@gmail.com>
>>>> Cc: Mike Galbraith <efault@gmx.de>
>>>> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>>> Cc: axie <axie@amd.com>
>>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>>> ---
>>>> mm/rmap.c | 44 +++++++++++++++++++++++++++++++++++++++++---
>>>> 1 file changed, 41 insertions(+), 3 deletions(-)
>>>> 
>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>> index c8993c63eb25..da97ed525088 100644
>>>> --- a/mm/rmap.c
>>>> +++ b/mm/rmap.c
>>>> @@ -887,11 +887,21 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
>>>> 		.address = address,
>>>> 		.flags = PVMW_SYNC,
>>>> 	};
>>>> +	unsigned long start = address, end;
>>>> 	int *cleaned = arg;
>>>> 
>>>> +	/*
>>>> +	 * We have to assume the worse case ie pmd for invalidation. Note that
>>>> +	 * the page can not be free from this function.
>>>> +	 */
>>>> +	end = min(vma->vm_end, (start & PMD_MASK) + PMD_SIZE);
>>>> +	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
>>>> +
>>>> 	while (page_vma_mapped_walk(&pvmw)) {
>>>> +		unsigned long cstart, cend;
>>>> 		int ret = 0;
>>>> -		address = pvmw.address;
>>>> +
>>>> +		cstart = address = pvmw.address;
>>>> 		if (pvmw.pte) {
>>>> 			pte_t entry;
>>>> 			pte_t *pte = pvmw.pte;
>>>> @@ -904,6 +914,7 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
>>>> 			entry = pte_wrprotect(entry);
>>>> 			entry = pte_mkclean(entry);
>>>> 			set_pte_at(vma->vm_mm, address, pte, entry);
>>>> +			cend = cstart + PAGE_SIZE;
>>>> 			ret = 1;
>>>> 		} else {
>>>> #ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE
>>>> @@ -918,6 +929,8 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
>>>> 			entry = pmd_wrprotect(entry);
>>>> 			entry = pmd_mkclean(entry);
>>>> 			set_pmd_at(vma->vm_mm, address, pmd, entry);
>>>> +			cstart &= PMD_MASK;
>>>> +			cend = cstart + PMD_SIZE;
>>>> 			ret = 1;
>>>> #else
>>>> 			/* unexpected pmd-mapped page? */
>>>> @@ -926,11 +939,13 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
>>>> 		}
>>>> 
>>>> 		if (ret) {
>>>> -			mmu_notifier_invalidate_page(vma->vm_mm, address);
>>>> +			mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend);
>>>> 			(*cleaned)++;
>>>> 		}
>>>> 	}
>>>> 
>>>> +	mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
>>>> +
>>>> 	return true;
>>>> }
>>>> 
>>>> @@ -1324,6 +1339,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>>> 	pte_t pteval;
>>>> 	struct page *subpage;
>>>> 	bool ret = true;
>>>> +	unsigned long start = address, end;
>>>> 	enum ttu_flags flags = (enum ttu_flags)arg;
>>>> 
>>>> 	/* munlock has nothing to gain from examining un-locked vmas */
>>>> @@ -1335,6 +1351,14 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>>> 				flags & TTU_MIGRATION, page);
>>>> 	}
>>>> 
>>>> +	/*
>>>> +	 * We have to assume the worse case ie pmd for invalidation. Note that
>>>> +	 * the page can not be free in this function as call of try_to_unmap()
>>>> +	 * must hold a reference on the page.
>>>> +	 */
>>>> +	end = min(vma->vm_end, (start & PMD_MASK) + PMD_SIZE);
>>>> +	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
>>>> +
>>>> 	while (page_vma_mapped_walk(&pvmw)) {
>>>> 		/*
>>>> 		 * If the page is mlock()d, we cannot swap it out.
>>>> @@ -1408,6 +1432,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>>> 				set_huge_swap_pte_at(mm, address,
>>>> 						     pvmw.pte, pteval,
>>>> 						     vma_mmu_pagesize(vma));
>>>> +				mmu_notifier_invalidate_range(mm, address,
>>>> +					address + vma_mmu_pagesize(vma));
>>> 
>>> I don’t think that the notifier should be called after the PTE is set, but
>>> after the PTE is cleared, PTE permissions are demoted (e.g., RW->RO) or
>>> access/dirty bits are cleared. [There is an exception: if the PFN in the PTE
>>> is changed without clearing the PTE before, but it does not apply here, and
>>> there is a different notifier for this case.]
>>> 
>>> Therefore, IIUC, try_to_umap_one() should only call
>>> mmu_notifier_invalidate_range() after ptep_get_and_clear() and
>>> ptep_clear_flush() are called. All the other calls to
>>> mmu_notifier_invalidate_range() in this function can be removed.
>> 
>> Yes it would simplify the patch, i was trying to optimize for the case
>> where we restore the pte to its original value after ptep_clear_flush()
>> or ptep_get_and_clear() as in this case there is no need to invalidate
>> any secondary page table but that's an overkill optimization that just
>> add too much complexity.
> 
> Interesting. Actually, prior to your changes, it seems that the break
> statements would skip mmu_notifier_invalidate_page() when the PTE is not
> changed. So why not just change mmu_notifier_invalidate_page() into
> mmu_notifier_invalidate_range() ?

Sorry - I noticed it was actually wrong before (as you noted) at least
in one case:
	(unlikely(PageSwapBacked(page) != PageSwapCache(page)))

Regards,
Nadav

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic
@ 2017-08-30  3:18           ` Nadav Amit
  0 siblings, 0 replies; 160+ messages in thread
From: Nadav Amit @ 2017-08-30  3:18 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Linux Kernel Mailing List, open list:MEMORY MANAGEMENT,
	Dan Williams, Ross Zwisler, Linus Torvalds, Bernhard Held,
	Adam Borowski, Andrea Arcangeli, Radim Krčmář,
	Wanpeng Li, Paolo Bonzini, Takashi Iwai, Mike Galbraith,
	Kirill A . Shutemov, axie, Andrew Morton

Nadav Amit <nadav.amit@gmail.com> wrote:

> Jerome Glisse <jglisse@redhat.com> wrote:
> 
>> On Tue, Aug 29, 2017 at 07:46:07PM -0700, Nadav Amit wrote:
>>> Jérôme Glisse <jglisse@redhat.com> wrote:
>>> 
>>>> Replacing all mmu_notifier_invalidate_page() by mmu_notifier_invalidat_range()
>>>> and making sure it is bracketed by call to mmu_notifier_invalidate_range_start/
>>>> end.
>>>> 
>>>> Note that because we can not presume the pmd value or pte value we have to
>>>> assume the worse and unconditionaly report an invalidation as happening.
>>>> 
>>>> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
>>>> Cc: Dan Williams <dan.j.williams@intel.com>
>>>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>>>> Cc: Linus Torvalds <torvalds@linux-foundation.org>
>>>> Cc: Bernhard Held <berny156@gmx.de>
>>>> Cc: Adam Borowski <kilobyte@angband.pl>
>>>> Cc: Andrea Arcangeli <aarcange@redhat.com>
>>>> Cc: Radim Krčmář <rkrcmar@redhat.com>
>>>> Cc: Wanpeng Li <kernellwp@gmail.com>
>>>> Cc: Paolo Bonzini <pbonzini@redhat.com>
>>>> Cc: Takashi Iwai <tiwai@suse.de>
>>>> Cc: Nadav Amit <nadav.amit@gmail.com>
>>>> Cc: Mike Galbraith <efault@gmx.de>
>>>> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>>>> Cc: axie <axie@amd.com>
>>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>>> ---
>>>> mm/rmap.c | 44 +++++++++++++++++++++++++++++++++++++++++---
>>>> 1 file changed, 41 insertions(+), 3 deletions(-)
>>>> 
>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>> index c8993c63eb25..da97ed525088 100644
>>>> --- a/mm/rmap.c
>>>> +++ b/mm/rmap.c
>>>> @@ -887,11 +887,21 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
>>>> 		.address = address,
>>>> 		.flags = PVMW_SYNC,
>>>> 	};
>>>> +	unsigned long start = address, end;
>>>> 	int *cleaned = arg;
>>>> 
>>>> +	/*
>>>> +	 * We have to assume the worse case ie pmd for invalidation. Note that
>>>> +	 * the page can not be free from this function.
>>>> +	 */
>>>> +	end = min(vma->vm_end, (start & PMD_MASK) + PMD_SIZE);
>>>> +	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
>>>> +
>>>> 	while (page_vma_mapped_walk(&pvmw)) {
>>>> +		unsigned long cstart, cend;
>>>> 		int ret = 0;
>>>> -		address = pvmw.address;
>>>> +
>>>> +		cstart = address = pvmw.address;
>>>> 		if (pvmw.pte) {
>>>> 			pte_t entry;
>>>> 			pte_t *pte = pvmw.pte;
>>>> @@ -904,6 +914,7 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
>>>> 			entry = pte_wrprotect(entry);
>>>> 			entry = pte_mkclean(entry);
>>>> 			set_pte_at(vma->vm_mm, address, pte, entry);
>>>> +			cend = cstart + PAGE_SIZE;
>>>> 			ret = 1;
>>>> 		} else {
>>>> #ifdef CONFIG_TRANSPARENT_HUGE_PAGECACHE
>>>> @@ -918,6 +929,8 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
>>>> 			entry = pmd_wrprotect(entry);
>>>> 			entry = pmd_mkclean(entry);
>>>> 			set_pmd_at(vma->vm_mm, address, pmd, entry);
>>>> +			cstart &= PMD_MASK;
>>>> +			cend = cstart + PMD_SIZE;
>>>> 			ret = 1;
>>>> #else
>>>> 			/* unexpected pmd-mapped page? */
>>>> @@ -926,11 +939,13 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
>>>> 		}
>>>> 
>>>> 		if (ret) {
>>>> -			mmu_notifier_invalidate_page(vma->vm_mm, address);
>>>> +			mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend);
>>>> 			(*cleaned)++;
>>>> 		}
>>>> 	}
>>>> 
>>>> +	mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
>>>> +
>>>> 	return true;
>>>> }
>>>> 
>>>> @@ -1324,6 +1339,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>>> 	pte_t pteval;
>>>> 	struct page *subpage;
>>>> 	bool ret = true;
>>>> +	unsigned long start = address, end;
>>>> 	enum ttu_flags flags = (enum ttu_flags)arg;
>>>> 
>>>> 	/* munlock has nothing to gain from examining un-locked vmas */
>>>> @@ -1335,6 +1351,14 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>>> 				flags & TTU_MIGRATION, page);
>>>> 	}
>>>> 
>>>> +	/*
>>>> +	 * We have to assume the worse case ie pmd for invalidation. Note that
>>>> +	 * the page can not be free in this function as call of try_to_unmap()
>>>> +	 * must hold a reference on the page.
>>>> +	 */
>>>> +	end = min(vma->vm_end, (start & PMD_MASK) + PMD_SIZE);
>>>> +	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
>>>> +
>>>> 	while (page_vma_mapped_walk(&pvmw)) {
>>>> 		/*
>>>> 		 * If the page is mlock()d, we cannot swap it out.
>>>> @@ -1408,6 +1432,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>>> 				set_huge_swap_pte_at(mm, address,
>>>> 						     pvmw.pte, pteval,
>>>> 						     vma_mmu_pagesize(vma));
>>>> +				mmu_notifier_invalidate_range(mm, address,
>>>> +					address + vma_mmu_pagesize(vma));
>>> 
>>> I don’t think that the notifier should be called after the PTE is set, but
>>> after the PTE is cleared, PTE permissions are demoted (e.g., RW->RO) or
>>> access/dirty bits are cleared. [There is an exception: if the PFN in the PTE
>>> is changed without clearing the PTE before, but it does not apply here, and
>>> there is a different notifier for this case.]
>>> 
>>> Therefore, IIUC, try_to_umap_one() should only call
>>> mmu_notifier_invalidate_range() after ptep_get_and_clear() and
>>> ptep_clear_flush() are called. All the other calls to
>>> mmu_notifier_invalidate_range() in this function can be removed.
>> 
>> Yes it would simplify the patch, i was trying to optimize for the case
>> where we restore the pte to its original value after ptep_clear_flush()
>> or ptep_get_and_clear() as in this case there is no need to invalidate
>> any secondary page table but that's an overkill optimization that just
>> add too much complexity.
> 
> Interesting. Actually, prior to your changes, it seems that the break
> statements would skip mmu_notifier_invalidate_page() when the PTE is not
> changed. So why not just change mmu_notifier_invalidate_page() into
> mmu_notifier_invalidate_range() ?

Sorry - I noticed it was actually wrong before (as you noted) at least
in one case:
	(unlikely(PageSwapBacked(page) != PageSwapCache(page)))

Regards,
Nadav
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 05/13] IB/umem: update to new mmu_notifier semantic
  2017-08-29 23:54   ` Jérôme Glisse
  (?)
  (?)
@ 2017-08-30  6:13   ` Leon Romanovsky
  -1 siblings, 0 replies; 160+ messages in thread
From: Leon Romanovsky @ 2017-08-30  6:13 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: linux-kernel, linux-mm, linux-rdma, Artemy Kovalyov,
	Doug Ledford, Kirill A . Shutemov, Andrew Morton, Linus Torvalds,
	Andrea Arcangeli

[-- Attachment #1: Type: text/plain, Size: 1084 bytes --]

On Tue, Aug 29, 2017 at 07:54:39PM -0400, Jérôme Glisse wrote:
> Call to mmu_notifier_invalidate_page() are replaced by call to
> mmu_notifier_invalidate_range() and thus call are bracketed by
> call to mmu_notifier_invalidate_range_start()/end()
>
> Remove now useless invalidate_page callback.
>
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Leon Romanovsky <leonro@mellanox.com>
> Cc: linux-rdma@vger.kernel.org
> Cc: Artemy Kovalyov <artemyko@mellanox.com>
> Cc: Doug Ledford <dledford@redhat.com>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  drivers/infiniband/core/umem_odp.c | 19 -------------------
>  1 file changed, 19 deletions(-)
>

Hi Jerome,

I took this series for the tests on Mellanox ConnectX-4/5 cards which
are devices beneath of this UMEM ODP code.

As a reference, I took latest Doug's for-next + Linus's master
(36fde05f3fb5) + whole series.

Thanks

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 04/13] drm/amdgpu: update to new mmu_notifier semantic
  2017-08-29 23:54   ` Jérôme Glisse
@ 2017-08-30  6:18     ` Christian König
  -1 siblings, 0 replies; 160+ messages in thread
From: Christian König @ 2017-08-30  6:18 UTC (permalink / raw)
  To: Jérôme Glisse, linux-kernel, linux-mm
  Cc: amd-gfx, Felix Kuehling, Alex Deucher, Kirill A . Shutemov,
	Andrew Morton, Linus Torvalds, Andrea Arcangeli

Am 30.08.2017 um 01:54 schrieb Jérôme Glisse:
> Call to mmu_notifier_invalidate_page() are replaced by call to
> mmu_notifier_invalidate_range() and thus call are bracketed by
> call to mmu_notifier_invalidate_range_start()/end()
>
> Remove now useless invalidate_page callback.
>
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>

Reviewed-by: Christian König <christian.koenig@amd.com>

The general approach is Acked-by: Christian König 
<christian.koenig@amd.com>.

It's something very welcome since I was one of the people (together with 
the Intel guys) which failed to recognize what this callback really does.

Regards,
Christian.

> Cc: amd-gfx@lists.freedesktop.org
> Cc: Felix Kuehling <Felix.Kuehling@amd.com>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c | 31 -------------------------------
>   1 file changed, 31 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> index 6558a3ed57a7..e1cde6b80027 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> @@ -147,36 +147,6 @@ static void amdgpu_mn_invalidate_node(struct amdgpu_mn_node *node,
>   }
>   
>   /**
> - * amdgpu_mn_invalidate_page - callback to notify about mm change
> - *
> - * @mn: our notifier
> - * @mn: the mm this callback is about
> - * @address: address of invalidate page
> - *
> - * Invalidation of a single page. Blocks for all BOs mapping it
> - * and unmap them by move them into system domain again.
> - */
> -static void amdgpu_mn_invalidate_page(struct mmu_notifier *mn,
> -				      struct mm_struct *mm,
> -				      unsigned long address)
> -{
> -	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
> -	struct interval_tree_node *it;
> -
> -	mutex_lock(&rmn->lock);
> -
> -	it = interval_tree_iter_first(&rmn->objects, address, address);
> -	if (it) {
> -		struct amdgpu_mn_node *node;
> -
> -		node = container_of(it, struct amdgpu_mn_node, it);
> -		amdgpu_mn_invalidate_node(node, address, address);
> -	}
> -
> -	mutex_unlock(&rmn->lock);
> -}
> -
> -/**
>    * amdgpu_mn_invalidate_range_start - callback to notify about mm change
>    *
>    * @mn: our notifier
> @@ -215,7 +185,6 @@ static void amdgpu_mn_invalidate_range_start(struct mmu_notifier *mn,
>   
>   static const struct mmu_notifier_ops amdgpu_mn_ops = {
>   	.release = amdgpu_mn_release,
> -	.invalidate_page = amdgpu_mn_invalidate_page,
>   	.invalidate_range_start = amdgpu_mn_invalidate_range_start,
>   };
>   

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 04/13] drm/amdgpu: update to new mmu_notifier semantic
@ 2017-08-30  6:18     ` Christian König
  0 siblings, 0 replies; 160+ messages in thread
From: Christian König @ 2017-08-30  6:18 UTC (permalink / raw)
  To: Jérôme Glisse, linux-kernel, linux-mm
  Cc: amd-gfx, Felix Kuehling, Alex Deucher, Kirill A . Shutemov,
	Andrew Morton, Linus Torvalds, Andrea Arcangeli

Am 30.08.2017 um 01:54 schrieb JA(C)rA'me Glisse:
> Call to mmu_notifier_invalidate_page() are replaced by call to
> mmu_notifier_invalidate_range() and thus call are bracketed by
> call to mmu_notifier_invalidate_range_start()/end()
>
> Remove now useless invalidate_page callback.
>
> Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>

Reviewed-by: Christian KA?nig <christian.koenig@amd.com>

The general approach is Acked-by: Christian KA?nig 
<christian.koenig@amd.com>.

It's something very welcome since I was one of the people (together with 
the Intel guys) which failed to recognize what this callback really does.

Regards,
Christian.

> Cc: amd-gfx@lists.freedesktop.org
> Cc: Felix Kuehling <Felix.Kuehling@amd.com>
> Cc: Christian KA?nig <christian.koenig@amd.com>
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c | 31 -------------------------------
>   1 file changed, 31 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> index 6558a3ed57a7..e1cde6b80027 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> @@ -147,36 +147,6 @@ static void amdgpu_mn_invalidate_node(struct amdgpu_mn_node *node,
>   }
>   
>   /**
> - * amdgpu_mn_invalidate_page - callback to notify about mm change
> - *
> - * @mn: our notifier
> - * @mn: the mm this callback is about
> - * @address: address of invalidate page
> - *
> - * Invalidation of a single page. Blocks for all BOs mapping it
> - * and unmap them by move them into system domain again.
> - */
> -static void amdgpu_mn_invalidate_page(struct mmu_notifier *mn,
> -				      struct mm_struct *mm,
> -				      unsigned long address)
> -{
> -	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
> -	struct interval_tree_node *it;
> -
> -	mutex_lock(&rmn->lock);
> -
> -	it = interval_tree_iter_first(&rmn->objects, address, address);
> -	if (it) {
> -		struct amdgpu_mn_node *node;
> -
> -		node = container_of(it, struct amdgpu_mn_node, it);
> -		amdgpu_mn_invalidate_node(node, address, address);
> -	}
> -
> -	mutex_unlock(&rmn->lock);
> -}
> -
> -/**
>    * amdgpu_mn_invalidate_range_start - callback to notify about mm change
>    *
>    * @mn: our notifier
> @@ -215,7 +185,6 @@ static void amdgpu_mn_invalidate_range_start(struct mmu_notifier *mn,
>   
>   static const struct mmu_notifier_ops amdgpu_mn_ops = {
>   	.release = amdgpu_mn_release,
> -	.invalidate_page = amdgpu_mn_invalidate_page,
>   	.invalidate_range_start = amdgpu_mn_invalidate_range_start,
>   };
>   


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 00/13] mmu_notifier kill invalidate_page callback
  2017-08-30  0:56       ` Jerome Glisse
  (?)
  (?)
@ 2017-08-30  8:40         ` Mike Galbraith
  -1 siblings, 0 replies; 160+ messages in thread
From: Mike Galbraith @ 2017-08-30  8:40 UTC (permalink / raw)
  To: Jerome Glisse, Linus Torvalds
  Cc: Bernhard Held, Adam Borowski, Linux Kernel Mailing List,
	linux-mm, Kirill A . Shutemov, Andrew Morton, Andrea Arcangeli,
	Joerg Roedel, Dan Williams, Sudeep Dutt, Ashutosh Dixit,
	Dimitri Sivanich, Jack Steiner, Paolo Bonzini,
	Radim Krčmář,
	ppc-dev, DRI, amd-gfx, linux-rdma, open list:AMD IOMMU (AMD-VI),
	xen-devel

On Tue, 2017-08-29 at 20:56 -0400, Jerome Glisse wrote:
> On Tue, Aug 29, 2017 at 05:11:24PM -0700, Linus Torvalds wrote:
> 
> > People - *especially* the people who saw issues under KVM - can you
> > try out Jérôme's patch-series? I aded some people to the cc, the full
> > series is on lkml. Jérôme - do you have a git branch for people to
> > test that they could easily pull and try out?
> 
> https://cgit.freedesktop.org/~glisse/linux mmu-notifier branch
> git://people.freedesktop.org/~glisse/linux

Looks good here.

I reproduced fairly quickly with RT host and 1 RT guest by just having
the guest do a parallel kbuild over NFS (the guest had to be restored
afterward, was corrupted).  I'm currently flogging 2 guests as well as
the host, whimper free.  I'll let the lot broil for while longer, but
at this point, smoke/flame appearance seems comfortingly unlikely.

	-Mike


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 00/13] mmu_notifier kill invalidate_page callback
@ 2017-08-30  8:40         ` Mike Galbraith
  0 siblings, 0 replies; 160+ messages in thread
From: Mike Galbraith @ 2017-08-30  8:40 UTC (permalink / raw)
  To: Jerome Glisse, Linus Torvalds
  Cc: Bernhard Held, Adam Borowski, Linux Kernel Mailing List,
	linux-mm, Kirill A . Shutemov, Andrew Morton, Andrea Arcangeli,
	Joerg Roedel, Dan Williams, Sudeep Dutt, Ashutosh Dixit,
	Dimitri Sivanich, Jack Steiner, Paolo Bonzini,
	Radim Krčmář,
	ppc-dev, DRI, amd-gfx, linux-rdma, open list:AMD IOMMU (AMD-VI),
	xen-devel, KVM list

On Tue, 2017-08-29 at 20:56 -0400, Jerome Glisse wrote:
> On Tue, Aug 29, 2017 at 05:11:24PM -0700, Linus Torvalds wrote:
> 
> > People - *especially* the people who saw issues under KVM - can you
> > try out Jérôme's patch-series? I aded some people to the cc, the full
> > series is on lkml. Jérôme - do you have a git branch for people to
> > test that they could easily pull and try out?
> 
> https://cgit.freedesktop.org/~glisse/linux mmu-notifier branch
> git://people.freedesktop.org/~glisse/linux

Looks good here.

I reproduced fairly quickly with RT host and 1 RT guest by just having
the guest do a parallel kbuild over NFS (the guest had to be restored
afterward, was corrupted).  I'm currently flogging 2 guests as well as
the host, whimper free.  I'll let the lot broil for while longer, but
at this point, smoke/flame appearance seems comfortingly unlikely.

	-Mike

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 00/13] mmu_notifier kill invalidate_page callback
@ 2017-08-30  8:40         ` Mike Galbraith
  0 siblings, 0 replies; 160+ messages in thread
From: Mike Galbraith @ 2017-08-30  8:40 UTC (permalink / raw)
  To: Jerome Glisse, Linus Torvalds
  Cc: Bernhard Held, Adam Borowski, Linux Kernel Mailing List,
	linux-mm, Kirill A . Shutemov, Andrew Morton, Andrea Arcangeli,
	Joerg Roedel, Dan Williams, Sudeep Dutt, Ashutosh Dixit,
	Dimitri Sivanich, Jack Steiner, Paolo Bonzini,
	Radim Krčmář,
	ppc-dev, DRI, amd-gfx, linux-rdma, open list:AMD IOMMU (AMD-VI),
	xen-devel, KVM list

On Tue, 2017-08-29 at 20:56 -0400, Jerome Glisse wrote:
> On Tue, Aug 29, 2017 at 05:11:24PM -0700, Linus Torvalds wrote:
> 
> > People - *especially* the people who saw issues under KVM - can you
> > try out Jérôme's patch-series? I aded some people to the cc, the full
> > series is on lkml. Jérôme - do you have a git branch for people to
> > test that they could easily pull and try out?
> 
> https://cgit.freedesktop.org/~glisse/linux mmu-notifier branch
> git://people.freedesktop.org/~glisse/linux

Looks good here.

I reproduced fairly quickly with RT host and 1 RT guest by just having
the guest do a parallel kbuild over NFS (the guest had to be restored
afterward, was corrupted).  I'm currently flogging 2 guests as well as
the host, whimper free.  I'll let the lot broil for while longer, but
at this point, smoke/flame appearance seems comfortingly unlikely.

	-Mike


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 00/13] mmu_notifier kill invalidate_page callback
@ 2017-08-30  8:40         ` Mike Galbraith
  0 siblings, 0 replies; 160+ messages in thread
From: Mike Galbraith @ 2017-08-30  8:40 UTC (permalink / raw)
  To: Jerome Glisse, Linus Torvalds
  Cc: Bernhard Held, Adam Borowski, Linux Kernel Mailing List,
	linux-mm, Kirill A . Shutemov, Andrew Morton, Andrea Arcangeli,
	Joerg Roedel, Dan Williams, Sudeep Dutt, Ashutosh Dixit,
	Dimitri Sivanich, Jack Steiner, Paolo Bonzini,
	Radim Krčmář,
	ppc-dev, DRI, amd-gfx, linux-rdma, open list:AMD IOMMU (AMD-VI),
	xen-devel, KVM list

On Tue, 2017-08-29 at 20:56 -0400, Jerome Glisse wrote:
> On Tue, Aug 29, 2017 at 05:11:24PM -0700, Linus Torvalds wrote:
>=20
> > People - *especially* the people who saw issues under KVM - can you
> > try out J=C3=A9r=C3=B4me's patch-series? I aded some people to the cc, =
the full
> > series is on lkml. J=C3=A9r=C3=B4me - do you have a git branch for peop=
le to
> > test that they could easily pull and try out?
>=20
> https://cgit.freedesktop.org/~glisse/linux mmu-notifier branch
> git://people.freedesktop.org/~glisse/linux

Looks good here.

I reproduced fairly quickly with RT host and 1 RT guest by just having
the guest do a parallel kbuild over NFS (the guest had to be restored
afterward, was corrupted). =C2=A0I'm currently flogging 2 guests as well as
the host, whimper free. =C2=A0I'll let the lot broil for while longer, but
at this point, smoke/flame appearance seems comfortingly unlikely.

	-Mike

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 00/13] mmu_notifier kill invalidate_page callback
  2017-08-30  0:56       ` Jerome Glisse
  (?)
  (?)
@ 2017-08-30  8:40       ` Mike Galbraith
  -1 siblings, 0 replies; 160+ messages in thread
From: Mike Galbraith @ 2017-08-30  8:40 UTC (permalink / raw)
  To: Jerome Glisse, Linus Torvalds
  Cc: Bernhard Held, KVM list, Radim Krčmář,
	Sudeep Dutt, DRI, linux-mm, Andrea Arcangeli, Dimitri Sivanich,
	linux-rdma, amd-gfx, xen-devel, Adam Borowski, Joerg Roedel,
	Jack Steiner, Dan Williams, Linux Kernel Mailing List,
	Ashutosh Dixit, open list:AMD IOMMU (AMD-VI),
	Paolo Bonzini, Andrew Morton, ppc-dev

On Tue, 2017-08-29 at 20:56 -0400, Jerome Glisse wrote:
> On Tue, Aug 29, 2017 at 05:11:24PM -0700, Linus Torvalds wrote:
> 
> > People - *especially* the people who saw issues under KVM - can you
> > try out Jérôme's patch-series? I aded some people to the cc, the full
> > series is on lkml. Jérôme - do you have a git branch for people to
> > test that they could easily pull and try out?
> 
> https://cgit.freedesktop.org/~glisse/linux mmu-notifier branch
> git://people.freedesktop.org/~glisse/linux

Looks good here.

I reproduced fairly quickly with RT host and 1 RT guest by just having
the guest do a parallel kbuild over NFS (the guest had to be restored
afterward, was corrupted).  I'm currently flogging 2 guests as well as
the host, whimper free.  I'll let the lot broil for while longer, but
at this point, smoke/flame appearance seems comfortingly unlikely.

	-Mike



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 00/13] mmu_notifier kill invalidate_page callback
  2017-08-30  0:56       ` Jerome Glisse
  (?)
@ 2017-08-30 14:57         ` Adam Borowski
  -1 siblings, 0 replies; 160+ messages in thread
From: Adam Borowski @ 2017-08-30 14:57 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Linus Torvalds, Bernhard Held, Linux Kernel Mailing List,
	linux-mm, Kirill A . Shutemov, Andrew Morton, Andrea Arcangeli,
	Joerg Roedel, Dan Williams, Sudeep Dutt, Ashutosh Dixit,
	Dimitri Sivanich, Jack Steiner, Paolo Bonzini,
	Radim Krčmář,
	ppc-dev, DRI, amd-gfx, linux-rdma, open list:AMD IOMMU (AMD-VI)

On Tue, Aug 29, 2017 at 08:56:15PM -0400, Jerome Glisse wrote:
> I will wait for people to test and for result of my own test before
> reposting if need be, otherwise i will post as separate patch.
>
> > But from a _very_ quick read-through this looks fine. But it obviously
> > needs testing.
> > 
> > People - *especially* the people who saw issues under KVM - can you
> > try out Jérôme's patch-series? I aded some people to the cc, the full
> > series is on lkml. Jérôme - do you have a git branch for people to
> > test that they could easily pull and try out?
> 
> https://cgit.freedesktop.org/~glisse/linux mmu-notifier branch
> git://people.freedesktop.org/~glisse/linux

Tested your branch as of 10f07641, on a long list of guest VMs.
No earth-shattering kaboom.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢰⠒⠀⣿⡁ Vat kind uf sufficiently advanced technology iz dis!?
⢿⡄⠘⠷⠚⠋⠀                                 -- Genghis Ht'rok'din
⠈⠳⣄⠀⠀⠀⠀ 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 00/13] mmu_notifier kill invalidate_page callback
@ 2017-08-30 14:57         ` Adam Borowski
  0 siblings, 0 replies; 160+ messages in thread
From: Adam Borowski @ 2017-08-30 14:57 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Linus Torvalds, Bernhard Held, Linux Kernel Mailing List,
	linux-mm, Kirill A . Shutemov, Andrew Morton, Andrea Arcangeli,
	Joerg Roedel, Dan Williams, Sudeep Dutt, Ashutosh Dixit,
	Dimitri Sivanich, Jack Steiner, Paolo Bonzini,
	Radim Krčmář,
	ppc-dev, DRI, amd-gfx, linux-rdma, open list:AMD IOMMU (AMD-VI),
	xen-devel, KVM list

On Tue, Aug 29, 2017 at 08:56:15PM -0400, Jerome Glisse wrote:
> I will wait for people to test and for result of my own test before
> reposting if need be, otherwise i will post as separate patch.
>
> > But from a _very_ quick read-through this looks fine. But it obviously
> > needs testing.
> > 
> > People - *especially* the people who saw issues under KVM - can you
> > try out Jérôme's patch-series? I aded some people to the cc, the full
> > series is on lkml. Jérôme - do you have a git branch for people to
> > test that they could easily pull and try out?
> 
> https://cgit.freedesktop.org/~glisse/linux mmu-notifier branch
> git://people.freedesktop.org/~glisse/linux

Tested your branch as of 10f07641, on a long list of guest VMs.
No earth-shattering kaboom.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢰⠒⠀⣿⡁ Vat kind uf sufficiently advanced technology iz dis!?
⢿⡄⠘⠷⠚⠋⠀                                 -- Genghis Ht'rok'din
⠈⠳⣄⠀⠀⠀⠀ 

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 00/13] mmu_notifier kill invalidate_page callback
@ 2017-08-30 14:57         ` Adam Borowski
  0 siblings, 0 replies; 160+ messages in thread
From: Adam Borowski @ 2017-08-30 14:57 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Linus Torvalds, Bernhard Held, Linux Kernel Mailing List,
	linux-mm, Kirill A . Shutemov, Andrew Morton, Andrea Arcangeli,
	Joerg Roedel, Dan Williams, Sudeep Dutt, Ashutosh Dixit,
	Dimitri Sivanich, Jack Steiner, Paolo Bonzini,
	Radim Krčmář,
	ppc-dev, DRI, amd-gfx, linux-rdma, open list:AMD IOMMU (AMD-VI),
	xen-devel, KVM list

On Tue, Aug 29, 2017 at 08:56:15PM -0400, Jerome Glisse wrote:
> I will wait for people to test and for result of my own test before
> reposting if need be, otherwise i will post as separate patch.
>
> > But from a _very_ quick read-through this looks fine. But it obviously
> > needs testing.
> > 
> > People - *especially* the people who saw issues under KVM - can you
> > try out JA(C)rA'me's patch-series? I aded some people to the cc, the full
> > series is on lkml. JA(C)rA'me - do you have a git branch for people to
> > test that they could easily pull and try out?
> 
> https://cgit.freedesktop.org/~glisse/linux mmu-notifier branch
> git://people.freedesktop.org/~glisse/linux

Tested your branch as of 10f07641, on a long list of guest VMs.
No earth-shattering kaboom.


Meow!
-- 
ac?aGBP'a  3/4 a >>ac?aGBP|a ? 
aGBP 3/4 a ?ac?a ?a ?aGBP?a!? Vat kind uf sufficiently advanced technology iz dis!?
ac?a!?a ?a .a ?a ?a ?                                 -- Genghis Ht'rok'din
a ?a 3aGBP?a ?a ?a ?a ? 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 00/13] mmu_notifier kill invalidate_page callback
  2017-08-30  0:56       ` Jerome Glisse
                         ` (3 preceding siblings ...)
  (?)
@ 2017-08-30 14:57       ` Adam Borowski
  -1 siblings, 0 replies; 160+ messages in thread
From: Adam Borowski @ 2017-08-30 14:57 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Bernhard Held, KVM list, Radim Krčmář,
	Sudeep Dutt, DRI, linux-mm, Andrea Arcangeli, Dimitri Sivanich,
	linux-rdma, amd-gfx, xen-devel, Joerg Roedel, Jack Steiner,
	Dan Williams, Linus Torvalds, Linux Kernel Mailing List,
	Ashutosh Dixit, open list:AMD IOMMU (AMD-VI),
	Paolo Bonzini, Andrew Morton, ppc-dev

On Tue, Aug 29, 2017 at 08:56:15PM -0400, Jerome Glisse wrote:
> I will wait for people to test and for result of my own test before
> reposting if need be, otherwise i will post as separate patch.
>
> > But from a _very_ quick read-through this looks fine. But it obviously
> > needs testing.
> > 
> > People - *especially* the people who saw issues under KVM - can you
> > try out Jérôme's patch-series? I aded some people to the cc, the full
> > series is on lkml. Jérôme - do you have a git branch for people to
> > test that they could easily pull and try out?
> 
> https://cgit.freedesktop.org/~glisse/linux mmu-notifier branch
> git://people.freedesktop.org/~glisse/linux

Tested your branch as of 10f07641, on a long list of guest VMs.
No earth-shattering kaboom.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢰⠒⠀⣿⡁ Vat kind uf sufficiently advanced technology iz dis!?
⢿⡄⠘⠷⠚⠋⠀                                 -- Genghis Ht'rok'din
⠈⠳⣄⠀⠀⠀⠀ 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic
  2017-08-29 23:54   ` Jérôme Glisse
@ 2017-08-30 16:52     ` Andrea Arcangeli
  -1 siblings, 0 replies; 160+ messages in thread
From: Andrea Arcangeli @ 2017-08-30 16:52 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: linux-kernel, linux-mm, Dan Williams, Ross Zwisler,
	Linus Torvalds, Bernhard Held, Adam Borowski,
	Radim Krčmář,
	Wanpeng Li, Paolo Bonzini, Takashi Iwai, Nadav Amit,
	Mike Galbraith, Kirill A . Shutemov, axie, Andrew Morton

Hello Jerome,

On Tue, Aug 29, 2017 at 07:54:36PM -0400, Jerome Glisse wrote:
> Replacing all mmu_notifier_invalidate_page() by mmu_notifier_invalidat_range()
> and making sure it is bracketed by call to mmu_notifier_invalidate_range_start/
> end.
> 
> Note that because we can not presume the pmd value or pte value we have to
> assume the worse and unconditionaly report an invalidation as happening.

I pointed out in earlier email ->invalidate_range can only be
implemented (as mutually exclusive alternative to
->invalidate_range_start/end) by secondary MMUs that shares the very
same pagetables with the core linux VM of the primary MMU, and those
invalidate_range are already called by
__mmu_notifier_invalidate_range_end. The other bit is done by the MMU
gather (because mmu_notifier_invalidate_range_start is a noop for
drivers that implement s->invalidate_range).

The difference between sharing the same pagetables or not allows for
->invalidate_range to work because when the Linux MM changes the
primary MMU pagetables it also automatically invalidated updates
secondary MMU at the same time (because of the pagetable sharing
between primary and secondary MMUs). So then all that is left to do is
an invalidate_range to flush the secondary MMU TLBs.

There's no need of action in mmu_notifier_invalidate_range_start for
those pagetable sharing drivers because there's no risk of a secondary
MMU shadow pagetable layer to be re-created in between
mmu_notifier_invalidate_range_start and the actual pagetable
invalidate because again the pagetables are shared.

void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
				  unsigned long start, unsigned long end)
{
	struct mmu_notifier *mn;
	int id;

	id = srcu_read_lock(&srcu);
	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
		/*
		 * Call invalidate_range here too to avoid the need for the
		 * subsystem of having to register an invalidate_range_end
		 * call-back when there is invalidate_range already. Usually a
		 * subsystem registers either invalidate_range_start()/end() or
		 * invalidate_range(), so this will be no additional overhead
		 * (besides the pointer check).
		 */
		if (mn->ops->invalidate_range)
			mn->ops->invalidate_range(mn, mm, start, end);
			^^^^^^^^^^^^^^^^^^^^^^^^^
		if (mn->ops->invalidate_range_end)
			mn->ops->invalidate_range_end(mn, mm, start, end);
	}
	srcu_read_unlock(&srcu, id);
}

So this conversion from invalidate_page to invalidate_range looks
superflous and the final mmu_notifier_invalidate_range_end should be
enough.

AFIK only amd_iommu_v2 and intel-svm (svm as in shared virtual memory)
uses it.

My suggestion is to remove from below all
mmu_notifier_invalidate_range calls and keep only the
mmu_notifier_invalidate_range_end and test both amd_iommu_v2 and
intel-svm with it under heavy swapping.

The only critical constraint to keep for invalidate_range to stay safe
with a single call of mmu_notifier_invalidate_range_end after put_page
is that the put_page cannot be the last put_page. That only applies to
the case where the page isn't freed through MMU gather (MMU gather
calls mmu_notifier_invalidate_range of its own before freeing the
page, as opposed mmu gather does nothing for drivers using
invalidate_range_start/end because invalidate_range_start acted as
barrier to avoid establishing mappings on the secondary MMUs for
those).

Not strictly required but I think it would be safer and more efficient
to replace the put_page with something like:

static inline void put_page_not_freeing(struct page *page)
{
	page = compound_head(page);

	if (put_page_testzero(page))
		VM_WARN_ON_PAGE(1, page);
}

Thanks!
Andrea

> @@ -926,11 +939,13 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
>  		}
>  
>  		if (ret) {
> -			mmu_notifier_invalidate_page(vma->vm_mm, address);
> +			mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend);
>  			(*cleaned)++;
>  		}
>  	}
>  
> +	mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
> +
>  	return true;
>  }
>  
> @@ -1324,6 +1339,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  	pte_t pteval;
>  	struct page *subpage;
>  	bool ret = true;
> +	unsigned long start = address, end;
>  	enum ttu_flags flags = (enum ttu_flags)arg;
>  
>  	/* munlock has nothing to gain from examining un-locked vmas */
> @@ -1335,6 +1351,14 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  				flags & TTU_MIGRATION, page);
>  	}
>  
> +	/*
> +	 * We have to assume the worse case ie pmd for invalidation. Note that
> +	 * the page can not be free in this function as call of try_to_unmap()
> +	 * must hold a reference on the page.
> +	 */
> +	end = min(vma->vm_end, (start & PMD_MASK) + PMD_SIZE);
> +	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
> +
>  	while (page_vma_mapped_walk(&pvmw)) {
>  		/*
>  		 * If the page is mlock()d, we cannot swap it out.
> @@ -1408,6 +1432,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  				set_huge_swap_pte_at(mm, address,
>  						     pvmw.pte, pteval,
>  						     vma_mmu_pagesize(vma));
> +				mmu_notifier_invalidate_range(mm, address,
> +					address + vma_mmu_pagesize(vma));
>  			} else {
>  				dec_mm_counter(mm, mm_counter(page));
>  				set_pte_at(mm, address, pvmw.pte, pteval);
> @@ -1435,6 +1461,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  			if (pte_soft_dirty(pteval))
>  				swp_pte = pte_swp_mksoft_dirty(swp_pte);
>  			set_pte_at(mm, address, pvmw.pte, swp_pte);
> +			mmu_notifier_invalidate_range(mm, address,
> +						      address + PAGE_SIZE);
>  		} else if (PageAnon(page)) {
>  			swp_entry_t entry = { .val = page_private(subpage) };
>  			pte_t swp_pte;
> @@ -1445,6 +1473,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  			if (unlikely(PageSwapBacked(page) != PageSwapCache(page))) {
>  				WARN_ON_ONCE(1);
>  				ret = false;
> +				/* We have to invalidate as we cleared the pte */
> +				mmu_notifier_invalidate_range(mm, address,
> +							address + PAGE_SIZE);
>  				page_vma_mapped_walk_done(&pvmw);
>  				break;
>  			}
> @@ -1453,6 +1484,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  			if (!PageSwapBacked(page)) {
>  				if (!PageDirty(page)) {
>  					dec_mm_counter(mm, MM_ANONPAGES);
> +					/* Invalidate as we cleared the pte */
> +					mmu_notifier_invalidate_range(mm,
> +						address, address + PAGE_SIZE);
>  					goto discard;
>  				}
>  
> @@ -1485,13 +1519,17 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  			if (pte_soft_dirty(pteval))
>  				swp_pte = pte_swp_mksoft_dirty(swp_pte);
>  			set_pte_at(mm, address, pvmw.pte, swp_pte);
> +			mmu_notifier_invalidate_range(mm, address,
> +						      address + PAGE_SIZE);
>  		} else
>  			dec_mm_counter(mm, mm_counter_file(page));
>  discard:
>  		page_remove_rmap(subpage, PageHuge(page));
>  		put_page(page);
> -		mmu_notifier_invalidate_page(mm, address);
>  	}
> +
> +	mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
> +
>  	return ret;
>  }
>  
> -- 
> 2.13.5
> 

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic
@ 2017-08-30 16:52     ` Andrea Arcangeli
  0 siblings, 0 replies; 160+ messages in thread
From: Andrea Arcangeli @ 2017-08-30 16:52 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: linux-kernel, linux-mm, Dan Williams, Ross Zwisler,
	Linus Torvalds, Bernhard Held, Adam Borowski,
	Radim Krčmář,
	Wanpeng Li, Paolo Bonzini, Takashi Iwai, Nadav Amit,
	Mike Galbraith, Kirill A . Shutemov, axie, Andrew Morton

Hello Jerome,

On Tue, Aug 29, 2017 at 07:54:36PM -0400, Jerome Glisse wrote:
> Replacing all mmu_notifier_invalidate_page() by mmu_notifier_invalidat_range()
> and making sure it is bracketed by call to mmu_notifier_invalidate_range_start/
> end.
> 
> Note that because we can not presume the pmd value or pte value we have to
> assume the worse and unconditionaly report an invalidation as happening.

I pointed out in earlier email ->invalidate_range can only be
implemented (as mutually exclusive alternative to
->invalidate_range_start/end) by secondary MMUs that shares the very
same pagetables with the core linux VM of the primary MMU, and those
invalidate_range are already called by
__mmu_notifier_invalidate_range_end. The other bit is done by the MMU
gather (because mmu_notifier_invalidate_range_start is a noop for
drivers that implement s->invalidate_range).

The difference between sharing the same pagetables or not allows for
->invalidate_range to work because when the Linux MM changes the
primary MMU pagetables it also automatically invalidated updates
secondary MMU at the same time (because of the pagetable sharing
between primary and secondary MMUs). So then all that is left to do is
an invalidate_range to flush the secondary MMU TLBs.

There's no need of action in mmu_notifier_invalidate_range_start for
those pagetable sharing drivers because there's no risk of a secondary
MMU shadow pagetable layer to be re-created in between
mmu_notifier_invalidate_range_start and the actual pagetable
invalidate because again the pagetables are shared.

void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
				  unsigned long start, unsigned long end)
{
	struct mmu_notifier *mn;
	int id;

	id = srcu_read_lock(&srcu);
	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
		/*
		 * Call invalidate_range here too to avoid the need for the
		 * subsystem of having to register an invalidate_range_end
		 * call-back when there is invalidate_range already. Usually a
		 * subsystem registers either invalidate_range_start()/end() or
		 * invalidate_range(), so this will be no additional overhead
		 * (besides the pointer check).
		 */
		if (mn->ops->invalidate_range)
			mn->ops->invalidate_range(mn, mm, start, end);
			^^^^^^^^^^^^^^^^^^^^^^^^^
		if (mn->ops->invalidate_range_end)
			mn->ops->invalidate_range_end(mn, mm, start, end);
	}
	srcu_read_unlock(&srcu, id);
}

So this conversion from invalidate_page to invalidate_range looks
superflous and the final mmu_notifier_invalidate_range_end should be
enough.

AFIK only amd_iommu_v2 and intel-svm (svm as in shared virtual memory)
uses it.

My suggestion is to remove from below all
mmu_notifier_invalidate_range calls and keep only the
mmu_notifier_invalidate_range_end and test both amd_iommu_v2 and
intel-svm with it under heavy swapping.

The only critical constraint to keep for invalidate_range to stay safe
with a single call of mmu_notifier_invalidate_range_end after put_page
is that the put_page cannot be the last put_page. That only applies to
the case where the page isn't freed through MMU gather (MMU gather
calls mmu_notifier_invalidate_range of its own before freeing the
page, as opposed mmu gather does nothing for drivers using
invalidate_range_start/end because invalidate_range_start acted as
barrier to avoid establishing mappings on the secondary MMUs for
those).

Not strictly required but I think it would be safer and more efficient
to replace the put_page with something like:

static inline void put_page_not_freeing(struct page *page)
{
	page = compound_head(page);

	if (put_page_testzero(page))
		VM_WARN_ON_PAGE(1, page);
}

Thanks!
Andrea

> @@ -926,11 +939,13 @@ static bool page_mkclean_one(struct page *page, struct vm_area_struct *vma,
>  		}
>  
>  		if (ret) {
> -			mmu_notifier_invalidate_page(vma->vm_mm, address);
> +			mmu_notifier_invalidate_range(vma->vm_mm, cstart, cend);
>  			(*cleaned)++;
>  		}
>  	}
>  
> +	mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
> +
>  	return true;
>  }
>  
> @@ -1324,6 +1339,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  	pte_t pteval;
>  	struct page *subpage;
>  	bool ret = true;
> +	unsigned long start = address, end;
>  	enum ttu_flags flags = (enum ttu_flags)arg;
>  
>  	/* munlock has nothing to gain from examining un-locked vmas */
> @@ -1335,6 +1351,14 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  				flags & TTU_MIGRATION, page);
>  	}
>  
> +	/*
> +	 * We have to assume the worse case ie pmd for invalidation. Note that
> +	 * the page can not be free in this function as call of try_to_unmap()
> +	 * must hold a reference on the page.
> +	 */
> +	end = min(vma->vm_end, (start & PMD_MASK) + PMD_SIZE);
> +	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
> +
>  	while (page_vma_mapped_walk(&pvmw)) {
>  		/*
>  		 * If the page is mlock()d, we cannot swap it out.
> @@ -1408,6 +1432,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  				set_huge_swap_pte_at(mm, address,
>  						     pvmw.pte, pteval,
>  						     vma_mmu_pagesize(vma));
> +				mmu_notifier_invalidate_range(mm, address,
> +					address + vma_mmu_pagesize(vma));
>  			} else {
>  				dec_mm_counter(mm, mm_counter(page));
>  				set_pte_at(mm, address, pvmw.pte, pteval);
> @@ -1435,6 +1461,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  			if (pte_soft_dirty(pteval))
>  				swp_pte = pte_swp_mksoft_dirty(swp_pte);
>  			set_pte_at(mm, address, pvmw.pte, swp_pte);
> +			mmu_notifier_invalidate_range(mm, address,
> +						      address + PAGE_SIZE);
>  		} else if (PageAnon(page)) {
>  			swp_entry_t entry = { .val = page_private(subpage) };
>  			pte_t swp_pte;
> @@ -1445,6 +1473,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  			if (unlikely(PageSwapBacked(page) != PageSwapCache(page))) {
>  				WARN_ON_ONCE(1);
>  				ret = false;
> +				/* We have to invalidate as we cleared the pte */
> +				mmu_notifier_invalidate_range(mm, address,
> +							address + PAGE_SIZE);
>  				page_vma_mapped_walk_done(&pvmw);
>  				break;
>  			}
> @@ -1453,6 +1484,9 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  			if (!PageSwapBacked(page)) {
>  				if (!PageDirty(page)) {
>  					dec_mm_counter(mm, MM_ANONPAGES);
> +					/* Invalidate as we cleared the pte */
> +					mmu_notifier_invalidate_range(mm,
> +						address, address + PAGE_SIZE);
>  					goto discard;
>  				}
>  
> @@ -1485,13 +1519,17 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  			if (pte_soft_dirty(pteval))
>  				swp_pte = pte_swp_mksoft_dirty(swp_pte);
>  			set_pte_at(mm, address, pvmw.pte, swp_pte);
> +			mmu_notifier_invalidate_range(mm, address,
> +						      address + PAGE_SIZE);
>  		} else
>  			dec_mm_counter(mm, mm_counter_file(page));
>  discard:
>  		page_remove_rmap(subpage, PageHuge(page));
>  		put_page(page);
> -		mmu_notifier_invalidate_page(mm, address);
>  	}
> +
> +	mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
> +
>  	return ret;
>  }
>  
> -- 
> 2.13.5
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic
  2017-08-30  2:46     ` Nadav Amit
@ 2017-08-30 17:27       ` Andrea Arcangeli
  -1 siblings, 0 replies; 160+ messages in thread
From: Andrea Arcangeli @ 2017-08-30 17:27 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Jérôme Glisse, Linux Kernel Mailing List,
	open list:MEMORY MANAGEMENT, Dan Williams, Ross Zwisler,
	Linus Torvalds, Bernhard Held, Adam Borowski,
	Radim Krčmář,
	Wanpeng Li, Paolo Bonzini, Takashi Iwai, Mike Galbraith,
	Kirill A . Shutemov, axie, Andrew Morton

On Tue, Aug 29, 2017 at 07:46:07PM -0700, Nadav Amit wrote:
> Therefore, IIUC, try_to_umap_one() should only call
> mmu_notifier_invalidate_range() after ptep_get_and_clear() and

That would trigger an unnecessarily double call to
->invalidate_range() both from mmu_notifier_invalidate_range() after
ptep_get_and_clear() and later at the end in
mmu_notifier_invalidate_range_end().

The only advantage of adding a mmu_notifier_invalidate_range() after
ptep_get_and_clear() is to flush the secondary MMU TLBs (keep in mind
the pagetables have to be shared with the primary MMU in order to use
the ->invalidate_range()) inside the PT lock.

So if mmu_notifier_invalidate_range() after ptep_get_and_clear() is
needed or not, again boils down to the issue if the old code calling
->invalidate_page outside the PT lock was always broken before or
not. I don't see why exactly it was broken, we even hold the page lock
there so I don't see a migration race possible either. Of course the
constraint to be safe is that the put_page in try_to_unmap_one cannot
be the last one, and that had to be enforced by the caller taking an
additional reference on it.

One can wonder if the primary MMU TLB flush in ptep_clear_flush
(should_defer_flush returning false) could be put after releasing the
PT lock too (because that's not different from deferring the secondary
MMU notifier TLB flush in ->invalidate_range down to
mmu_notifier_invalidate_range_end) even if TTU_BATCH_FLUSH isn't set,
which may be safe too for the same reasons.

When should_defer_flush returns true we already defer the primary MMU
TLB flush to much later to even mmu_notifier_invalidate_range_end, not
just after the PT lock release so at least when should_defer_flush is
true, it looks obviously safe to defer the secondary MMU TLB flush to
mmu_notifier_invalidate_range_end for the drivers implementing
->invalidate_range.

If I'm wrong and all TLB flushes must happen inside the PT lock, then
we should at least reconsider the implicit call to ->invalidate_range
method from mmu_notifier_invalidate_range_end or we would call it
twice unnecessarily which doesn't look optimal. Either ways this
doesn't look optimal. We would need to introduce a
mmu_notifier_invalidate_range_end_full that calls also
->invalidate_range in such case so we skip the ->invalidate_range call
in mmu_notifier_invalidate_range_end if we put an explicit
mmu_notifier_invalidate_range() after ptep_get_and_clear inside the PT
lock like you suggested above.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic
@ 2017-08-30 17:27       ` Andrea Arcangeli
  0 siblings, 0 replies; 160+ messages in thread
From: Andrea Arcangeli @ 2017-08-30 17:27 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Jérôme Glisse, Linux Kernel Mailing List,
	open list:MEMORY MANAGEMENT, Dan Williams, Ross Zwisler,
	Linus Torvalds, Bernhard Held, Adam Borowski,
	Radim Krčmář,
	Wanpeng Li, Paolo Bonzini, Takashi Iwai, Mike Galbraith,
	Kirill A . Shutemov, axie, Andrew Morton

On Tue, Aug 29, 2017 at 07:46:07PM -0700, Nadav Amit wrote:
> Therefore, IIUC, try_to_umap_one() should only call
> mmu_notifier_invalidate_range() after ptep_get_and_clear() and

That would trigger an unnecessarily double call to
->invalidate_range() both from mmu_notifier_invalidate_range() after
ptep_get_and_clear() and later at the end in
mmu_notifier_invalidate_range_end().

The only advantage of adding a mmu_notifier_invalidate_range() after
ptep_get_and_clear() is to flush the secondary MMU TLBs (keep in mind
the pagetables have to be shared with the primary MMU in order to use
the ->invalidate_range()) inside the PT lock.

So if mmu_notifier_invalidate_range() after ptep_get_and_clear() is
needed or not, again boils down to the issue if the old code calling
->invalidate_page outside the PT lock was always broken before or
not. I don't see why exactly it was broken, we even hold the page lock
there so I don't see a migration race possible either. Of course the
constraint to be safe is that the put_page in try_to_unmap_one cannot
be the last one, and that had to be enforced by the caller taking an
additional reference on it.

One can wonder if the primary MMU TLB flush in ptep_clear_flush
(should_defer_flush returning false) could be put after releasing the
PT lock too (because that's not different from deferring the secondary
MMU notifier TLB flush in ->invalidate_range down to
mmu_notifier_invalidate_range_end) even if TTU_BATCH_FLUSH isn't set,
which may be safe too for the same reasons.

When should_defer_flush returns true we already defer the primary MMU
TLB flush to much later to even mmu_notifier_invalidate_range_end, not
just after the PT lock release so at least when should_defer_flush is
true, it looks obviously safe to defer the secondary MMU TLB flush to
mmu_notifier_invalidate_range_end for the drivers implementing
->invalidate_range.

If I'm wrong and all TLB flushes must happen inside the PT lock, then
we should at least reconsider the implicit call to ->invalidate_range
method from mmu_notifier_invalidate_range_end or we would call it
twice unnecessarily which doesn't look optimal. Either ways this
doesn't look optimal. We would need to introduce a
mmu_notifier_invalidate_range_end_full that calls also
->invalidate_range in such case so we skip the ->invalidate_range call
in mmu_notifier_invalidate_range_end if we put an explicit
mmu_notifier_invalidate_range() after ptep_get_and_clear inside the PT
lock like you suggested above.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic
  2017-08-30 16:52     ` Andrea Arcangeli
@ 2017-08-30 17:48       ` Jerome Glisse
  -1 siblings, 0 replies; 160+ messages in thread
From: Jerome Glisse @ 2017-08-30 17:48 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Dan Williams, Ross Zwisler,
	Linus Torvalds, Bernhard Held, Adam Borowski,
	Radim Krčmář,
	Wanpeng Li, Paolo Bonzini, Takashi Iwai, Nadav Amit,
	Mike Galbraith, Kirill A . Shutemov, axie, Andrew Morton

On Wed, Aug 30, 2017 at 06:52:50PM +0200, Andrea Arcangeli wrote:
> Hello Jerome,
> 
> On Tue, Aug 29, 2017 at 07:54:36PM -0400, Jerome Glisse wrote:
> > Replacing all mmu_notifier_invalidate_page() by mmu_notifier_invalidat_range()
> > and making sure it is bracketed by call to mmu_notifier_invalidate_range_start/
> > end.
> > 
> > Note that because we can not presume the pmd value or pte value we have to
> > assume the worse and unconditionaly report an invalidation as happening.
> 
> I pointed out in earlier email ->invalidate_range can only be
> implemented (as mutually exclusive alternative to
> ->invalidate_range_start/end) by secondary MMUs that shares the very
> same pagetables with the core linux VM of the primary MMU, and those
> invalidate_range are already called by
> __mmu_notifier_invalidate_range_end. The other bit is done by the MMU
> gather (because mmu_notifier_invalidate_range_start is a noop for
> drivers that implement s->invalidate_range).
> 
> The difference between sharing the same pagetables or not allows for
> ->invalidate_range to work because when the Linux MM changes the
> primary MMU pagetables it also automatically invalidated updates
> secondary MMU at the same time (because of the pagetable sharing
> between primary and secondary MMUs). So then all that is left to do is
> an invalidate_range to flush the secondary MMU TLBs.
> 
> There's no need of action in mmu_notifier_invalidate_range_start for
> those pagetable sharing drivers because there's no risk of a secondary
> MMU shadow pagetable layer to be re-created in between
> mmu_notifier_invalidate_range_start and the actual pagetable
> invalidate because again the pagetables are shared.

Yes but we still need to call invalidate_range() while under the page
table spinlock as this hardware sharing the CPU page table have their
own tlb (not even talking about tlb of each individual devices that
use PASID/ATS).

If we do not call it under spinlock then there is a chance that something
else get mapped between the time we drop the CPU page table and the
time we call mmu_notifier_invalidate_range_end() which itself call
invalidate_range().

Actually i would remove the call to invalidate_range() from range_end()
and audit all places where we call range_end() to assert that there is
a call to invalidate_range() under the page table spinlock that preced
it.

> 
> void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> 				  unsigned long start, unsigned long end)
> {
> 	struct mmu_notifier *mn;
> 	int id;
> 
> 	id = srcu_read_lock(&srcu);
> 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
> 		/*
> 		 * Call invalidate_range here too to avoid the need for the
> 		 * subsystem of having to register an invalidate_range_end
> 		 * call-back when there is invalidate_range already. Usually a
> 		 * subsystem registers either invalidate_range_start()/end() or
> 		 * invalidate_range(), so this will be no additional overhead
> 		 * (besides the pointer check).
> 		 */
> 		if (mn->ops->invalidate_range)
> 			mn->ops->invalidate_range(mn, mm, start, end);
> 			^^^^^^^^^^^^^^^^^^^^^^^^^
> 		if (mn->ops->invalidate_range_end)
> 			mn->ops->invalidate_range_end(mn, mm, start, end);
> 	}
> 	srcu_read_unlock(&srcu, id);
> }
> 
> So this conversion from invalidate_page to invalidate_range looks
> superflous and the final mmu_notifier_invalidate_range_end should be
> enough.

See above.

> AFIK only amd_iommu_v2 and intel-svm (svm as in shared virtual memory)
> uses it.

powerpc has something similar too but i don't know its status

> My suggestion is to remove from below all
> mmu_notifier_invalidate_range calls and keep only the
> mmu_notifier_invalidate_range_end and test both amd_iommu_v2 and
> intel-svm with it under heavy swapping.
> 
> The only critical constraint to keep for invalidate_range to stay safe
> with a single call of mmu_notifier_invalidate_range_end after put_page
> is that the put_page cannot be the last put_page. That only applies to
> the case where the page isn't freed through MMU gather (MMU gather
> calls mmu_notifier_invalidate_range of its own before freeing the
> page, as opposed mmu gather does nothing for drivers using
> invalidate_range_start/end because invalidate_range_start acted as
> barrier to avoid establishing mappings on the secondary MMUs for
> those).
> 
> Not strictly required but I think it would be safer and more efficient
> to replace the put_page with something like:
> 
> static inline void put_page_not_freeing(struct page *page)
> {
> 	page = compound_head(page);
> 
> 	if (put_page_testzero(page))
> 		VM_WARN_ON_PAGE(1, page);
> }

Yes adding such check make sense.

Thank for looking into this too

Jérôme

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic
@ 2017-08-30 17:48       ` Jerome Glisse
  0 siblings, 0 replies; 160+ messages in thread
From: Jerome Glisse @ 2017-08-30 17:48 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-kernel, linux-mm, Dan Williams, Ross Zwisler,
	Linus Torvalds, Bernhard Held, Adam Borowski,
	Radim Krčmář,
	Wanpeng Li, Paolo Bonzini, Takashi Iwai, Nadav Amit,
	Mike Galbraith, Kirill A . Shutemov, axie, Andrew Morton

On Wed, Aug 30, 2017 at 06:52:50PM +0200, Andrea Arcangeli wrote:
> Hello Jerome,
> 
> On Tue, Aug 29, 2017 at 07:54:36PM -0400, Jerome Glisse wrote:
> > Replacing all mmu_notifier_invalidate_page() by mmu_notifier_invalidat_range()
> > and making sure it is bracketed by call to mmu_notifier_invalidate_range_start/
> > end.
> > 
> > Note that because we can not presume the pmd value or pte value we have to
> > assume the worse and unconditionaly report an invalidation as happening.
> 
> I pointed out in earlier email ->invalidate_range can only be
> implemented (as mutually exclusive alternative to
> ->invalidate_range_start/end) by secondary MMUs that shares the very
> same pagetables with the core linux VM of the primary MMU, and those
> invalidate_range are already called by
> __mmu_notifier_invalidate_range_end. The other bit is done by the MMU
> gather (because mmu_notifier_invalidate_range_start is a noop for
> drivers that implement s->invalidate_range).
> 
> The difference between sharing the same pagetables or not allows for
> ->invalidate_range to work because when the Linux MM changes the
> primary MMU pagetables it also automatically invalidated updates
> secondary MMU at the same time (because of the pagetable sharing
> between primary and secondary MMUs). So then all that is left to do is
> an invalidate_range to flush the secondary MMU TLBs.
> 
> There's no need of action in mmu_notifier_invalidate_range_start for
> those pagetable sharing drivers because there's no risk of a secondary
> MMU shadow pagetable layer to be re-created in between
> mmu_notifier_invalidate_range_start and the actual pagetable
> invalidate because again the pagetables are shared.

Yes but we still need to call invalidate_range() while under the page
table spinlock as this hardware sharing the CPU page table have their
own tlb (not even talking about tlb of each individual devices that
use PASID/ATS).

If we do not call it under spinlock then there is a chance that something
else get mapped between the time we drop the CPU page table and the
time we call mmu_notifier_invalidate_range_end() which itself call
invalidate_range().

Actually i would remove the call to invalidate_range() from range_end()
and audit all places where we call range_end() to assert that there is
a call to invalidate_range() under the page table spinlock that preced
it.

> 
> void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> 				  unsigned long start, unsigned long end)
> {
> 	struct mmu_notifier *mn;
> 	int id;
> 
> 	id = srcu_read_lock(&srcu);
> 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
> 		/*
> 		 * Call invalidate_range here too to avoid the need for the
> 		 * subsystem of having to register an invalidate_range_end
> 		 * call-back when there is invalidate_range already. Usually a
> 		 * subsystem registers either invalidate_range_start()/end() or
> 		 * invalidate_range(), so this will be no additional overhead
> 		 * (besides the pointer check).
> 		 */
> 		if (mn->ops->invalidate_range)
> 			mn->ops->invalidate_range(mn, mm, start, end);
> 			^^^^^^^^^^^^^^^^^^^^^^^^^
> 		if (mn->ops->invalidate_range_end)
> 			mn->ops->invalidate_range_end(mn, mm, start, end);
> 	}
> 	srcu_read_unlock(&srcu, id);
> }
> 
> So this conversion from invalidate_page to invalidate_range looks
> superflous and the final mmu_notifier_invalidate_range_end should be
> enough.

See above.

> AFIK only amd_iommu_v2 and intel-svm (svm as in shared virtual memory)
> uses it.

powerpc has something similar too but i don't know its status

> My suggestion is to remove from below all
> mmu_notifier_invalidate_range calls and keep only the
> mmu_notifier_invalidate_range_end and test both amd_iommu_v2 and
> intel-svm with it under heavy swapping.
> 
> The only critical constraint to keep for invalidate_range to stay safe
> with a single call of mmu_notifier_invalidate_range_end after put_page
> is that the put_page cannot be the last put_page. That only applies to
> the case where the page isn't freed through MMU gather (MMU gather
> calls mmu_notifier_invalidate_range of its own before freeing the
> page, as opposed mmu gather does nothing for drivers using
> invalidate_range_start/end because invalidate_range_start acted as
> barrier to avoid establishing mappings on the secondary MMUs for
> those).
> 
> Not strictly required but I think it would be safer and more efficient
> to replace the put_page with something like:
> 
> static inline void put_page_not_freeing(struct page *page)
> {
> 	page = compound_head(page);
> 
> 	if (put_page_testzero(page))
> 		VM_WARN_ON_PAGE(1, page);
> }

Yes adding such check make sense.

Thank for looking into this too

Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic
  2017-08-30 17:27       ` Andrea Arcangeli
@ 2017-08-30 18:00         ` Nadav Amit
  -1 siblings, 0 replies; 160+ messages in thread
From: Nadav Amit @ 2017-08-30 18:00 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Jérôme Glisse, Linux Kernel Mailing List,
	open list:MEMORY MANAGEMENT, Dan Williams, Ross Zwisler,
	Linus Torvalds, Bernhard Held, Adam Borowski,
	Radim Krčmář,
	Wanpeng Li, Paolo Bonzini, Takashi Iwai, Mike Galbraith,
	Kirill A . Shutemov, axie, Andrew Morton

Andrea Arcangeli <aarcange@redhat.com> wrote:

> On Tue, Aug 29, 2017 at 07:46:07PM -0700, Nadav Amit wrote:
>> Therefore, IIUC, try_to_umap_one() should only call
>> mmu_notifier_invalidate_range() after ptep_get_and_clear() and
> 
> That would trigger an unnecessarily double call to
> ->invalidate_range() both from mmu_notifier_invalidate_range() after
> ptep_get_and_clear() and later at the end in
> mmu_notifier_invalidate_range_end().
> 
> The only advantage of adding a mmu_notifier_invalidate_range() after
> ptep_get_and_clear() is to flush the secondary MMU TLBs (keep in mind
> the pagetables have to be shared with the primary MMU in order to use
> the ->invalidate_range()) inside the PT lock.
> 
> So if mmu_notifier_invalidate_range() after ptep_get_and_clear() is
> needed or not, again boils down to the issue if the old code calling
> ->invalidate_page outside the PT lock was always broken before or
> not. I don't see why exactly it was broken, we even hold the page lock
> there so I don't see a migration race possible either. Of course the
> constraint to be safe is that the put_page in try_to_unmap_one cannot
> be the last one, and that had to be enforced by the caller taking an
> additional reference on it.
> 
> One can wonder if the primary MMU TLB flush in ptep_clear_flush
> (should_defer_flush returning false) could be put after releasing the
> PT lock too (because that's not different from deferring the secondary
> MMU notifier TLB flush in ->invalidate_range down to
> mmu_notifier_invalidate_range_end) even if TTU_BATCH_FLUSH isn't set,
> which may be safe too for the same reasons.
> 
> When should_defer_flush returns true we already defer the primary MMU
> TLB flush to much later to even mmu_notifier_invalidate_range_end, not
> just after the PT lock release so at least when should_defer_flush is
> true, it looks obviously safe to defer the secondary MMU TLB flush to
> mmu_notifier_invalidate_range_end for the drivers implementing
> ->invalidate_range.

Agreed. It just seems as an enhancement and not directly a bug fix.

> 
> If I'm wrong and all TLB flushes must happen inside the PT lock, then
> we should at least reconsider the implicit call to ->invalidate_range
> method from mmu_notifier_invalidate_range_end or we would call it
> twice unnecessarily which doesn't look optimal. Either ways this
> doesn't look optimal. We would need to introduce a
> mmu_notifier_invalidate_range_end_full that calls also
> ->invalidate_range in such case so we skip the ->invalidate_range call
> in mmu_notifier_invalidate_range_end if we put an explicit
> mmu_notifier_invalidate_range() after ptep_get_and_clear inside the PT
> lock like you suggested above.

It is not trivial to flush TLBs (primary or secondary) without holding the
page-table lock, and as we recently encountered this resulted in several
bugs [1]. The main problem is that even if you perform the TLB flush
immediately after the PT-lock is released, you cause a situation in which
other threads may make decisions based on the updated PTE value, without
being aware that a TLB flush is needed.

For example, we recently encountered a Linux bug when two threads run
MADV_DONTNEED concurrently on the same address range [2]. One of the threads
may update a PTE, setting it as non-present, and then deferring the TLB
flush (for batching). As a result, however, it would cause the second
thread, which also changes the PTEs to assume that the PTE is already
non-present and TLB flush is not necessary. As a result the second core may
still hold stale PTEs in its TLB following MADV_DONTNEED.

There is a swarm of such problems, and some are not too trivial. Deferring
TLB flushes needs to be done in a very careful manner.


[1] https://marc.info/?t=149973438800002&r=1
[2] http://www.spinics.net/lists/linux-mm/msg131525.html

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic
@ 2017-08-30 18:00         ` Nadav Amit
  0 siblings, 0 replies; 160+ messages in thread
From: Nadav Amit @ 2017-08-30 18:00 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Jérôme Glisse, Linux Kernel Mailing List,
	open list:MEMORY MANAGEMENT, Dan Williams, Ross Zwisler,
	Linus Torvalds, Bernhard Held, Adam Borowski,
	Radim Krčmář,
	Wanpeng Li, Paolo Bonzini, Takashi Iwai, Mike Galbraith,
	Kirill A . Shutemov, axie, Andrew Morton

Andrea Arcangeli <aarcange@redhat.com> wrote:

> On Tue, Aug 29, 2017 at 07:46:07PM -0700, Nadav Amit wrote:
>> Therefore, IIUC, try_to_umap_one() should only call
>> mmu_notifier_invalidate_range() after ptep_get_and_clear() and
> 
> That would trigger an unnecessarily double call to
> ->invalidate_range() both from mmu_notifier_invalidate_range() after
> ptep_get_and_clear() and later at the end in
> mmu_notifier_invalidate_range_end().
> 
> The only advantage of adding a mmu_notifier_invalidate_range() after
> ptep_get_and_clear() is to flush the secondary MMU TLBs (keep in mind
> the pagetables have to be shared with the primary MMU in order to use
> the ->invalidate_range()) inside the PT lock.
> 
> So if mmu_notifier_invalidate_range() after ptep_get_and_clear() is
> needed or not, again boils down to the issue if the old code calling
> ->invalidate_page outside the PT lock was always broken before or
> not. I don't see why exactly it was broken, we even hold the page lock
> there so I don't see a migration race possible either. Of course the
> constraint to be safe is that the put_page in try_to_unmap_one cannot
> be the last one, and that had to be enforced by the caller taking an
> additional reference on it.
> 
> One can wonder if the primary MMU TLB flush in ptep_clear_flush
> (should_defer_flush returning false) could be put after releasing the
> PT lock too (because that's not different from deferring the secondary
> MMU notifier TLB flush in ->invalidate_range down to
> mmu_notifier_invalidate_range_end) even if TTU_BATCH_FLUSH isn't set,
> which may be safe too for the same reasons.
> 
> When should_defer_flush returns true we already defer the primary MMU
> TLB flush to much later to even mmu_notifier_invalidate_range_end, not
> just after the PT lock release so at least when should_defer_flush is
> true, it looks obviously safe to defer the secondary MMU TLB flush to
> mmu_notifier_invalidate_range_end for the drivers implementing
> ->invalidate_range.

Agreed. It just seems as an enhancement and not directly a bug fix.

> 
> If I'm wrong and all TLB flushes must happen inside the PT lock, then
> we should at least reconsider the implicit call to ->invalidate_range
> method from mmu_notifier_invalidate_range_end or we would call it
> twice unnecessarily which doesn't look optimal. Either ways this
> doesn't look optimal. We would need to introduce a
> mmu_notifier_invalidate_range_end_full that calls also
> ->invalidate_range in such case so we skip the ->invalidate_range call
> in mmu_notifier_invalidate_range_end if we put an explicit
> mmu_notifier_invalidate_range() after ptep_get_and_clear inside the PT
> lock like you suggested above.

It is not trivial to flush TLBs (primary or secondary) without holding the
page-table lock, and as we recently encountered this resulted in several
bugs [1]. The main problem is that even if you perform the TLB flush
immediately after the PT-lock is released, you cause a situation in which
other threads may make decisions based on the updated PTE value, without
being aware that a TLB flush is needed.

For example, we recently encountered a Linux bug when two threads run
MADV_DONTNEED concurrently on the same address range [2]. One of the threads
may update a PTE, setting it as non-present, and then deferring the TLB
flush (for batching). As a result, however, it would cause the second
thread, which also changes the PTEs to assume that the PTE is already
non-present and TLB flush is not necessary. As a result the second core may
still hold stale PTEs in its TLB following MADV_DONTNEED.

There is a swarm of such problems, and some are not too trivial. Deferring
TLB flushes needs to be done in a very careful manner.


[1] https://marc.info/?t=149973438800002&r=1
[2] http://www.spinics.net/lists/linux-mm/msg131525.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic
  2017-08-30 17:27       ` Andrea Arcangeli
@ 2017-08-30 18:20         ` Jerome Glisse
  -1 siblings, 0 replies; 160+ messages in thread
From: Jerome Glisse @ 2017-08-30 18:20 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nadav Amit, Linux Kernel Mailing List,
	open list:MEMORY MANAGEMENT, Dan Williams, Ross Zwisler,
	Linus Torvalds, Bernhard Held, Adam Borowski,
	Radim Krčmář,
	Wanpeng Li, Paolo Bonzini, Takashi Iwai, Mike Galbraith,
	Kirill A . Shutemov, axie, Andrew Morton

On Wed, Aug 30, 2017 at 07:27:47PM +0200, Andrea Arcangeli wrote:
> On Tue, Aug 29, 2017 at 07:46:07PM -0700, Nadav Amit wrote:
> > Therefore, IIUC, try_to_umap_one() should only call
> > mmu_notifier_invalidate_range() after ptep_get_and_clear() and
> 
> That would trigger an unnecessarily double call to
> ->invalidate_range() both from mmu_notifier_invalidate_range() after
> ptep_get_and_clear() and later at the end in
> mmu_notifier_invalidate_range_end().
> 
> The only advantage of adding a mmu_notifier_invalidate_range() after
> ptep_get_and_clear() is to flush the secondary MMU TLBs (keep in mind
> the pagetables have to be shared with the primary MMU in order to use
> the ->invalidate_range()) inside the PT lock.
> 
> So if mmu_notifier_invalidate_range() after ptep_get_and_clear() is
> needed or not, again boils down to the issue if the old code calling
> ->invalidate_page outside the PT lock was always broken before or
> not. I don't see why exactly it was broken, we even hold the page lock
> there so I don't see a migration race possible either. Of course the
> constraint to be safe is that the put_page in try_to_unmap_one cannot
> be the last one, and that had to be enforced by the caller taking an
> additional reference on it.
> 
> One can wonder if the primary MMU TLB flush in ptep_clear_flush
> (should_defer_flush returning false) could be put after releasing the
> PT lock too (because that's not different from deferring the secondary
> MMU notifier TLB flush in ->invalidate_range down to
> mmu_notifier_invalidate_range_end) even if TTU_BATCH_FLUSH isn't set,
> which may be safe too for the same reasons.
> 
> When should_defer_flush returns true we already defer the primary MMU
> TLB flush to much later to even mmu_notifier_invalidate_range_end, not
> just after the PT lock release so at least when should_defer_flush is
> true, it looks obviously safe to defer the secondary MMU TLB flush to
> mmu_notifier_invalidate_range_end for the drivers implementing
> ->invalidate_range.
> 
> If I'm wrong and all TLB flushes must happen inside the PT lock, then
> we should at least reconsider the implicit call to ->invalidate_range
> method from mmu_notifier_invalidate_range_end or we would call it
> twice unnecessarily which doesn't look optimal. Either ways this
> doesn't look optimal. We would need to introduce a
> mmu_notifier_invalidate_range_end_full that calls also
> ->invalidate_range in such case so we skip the ->invalidate_range call
> in mmu_notifier_invalidate_range_end if we put an explicit
> mmu_notifier_invalidate_range() after ptep_get_and_clear inside the PT
> lock like you suggested above.

So i went over call to try_to_unmap() (try_to_munlock() is fine as it
does not clear the CPU page table entry). I believe they are 2 cases
where you can get a new pte entry after we release spinlock and before
we call mmu_notifier_invalidate_range_end()

First case is :
if (unlikely(PageSwapBacked(page) != PageSwapCache(page))) {
  ...
  break;
}

The pte is clear, there was an error condition and this should never
happen but a racing thread might install a new pte in the meantime.
Maybe we should restore the pte value here. Anyway when this happens
bad things are going on.

The second case is non dirty anonymous page and MADV_FREE. But here
the application is telling us that no one should care about that
virtual address any more. So i am not sure how much we should care.


If we ignore this 2 cases than the CPU pte can never be replace by
something else between the time we release the spinlock and the time
we call mmu_notifier_invalidate_range_end() so not invalidating the
devices tlb is ok here. Do we want this kind of optimization ?

Jérôme

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic
@ 2017-08-30 18:20         ` Jerome Glisse
  0 siblings, 0 replies; 160+ messages in thread
From: Jerome Glisse @ 2017-08-30 18:20 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Nadav Amit, Linux Kernel Mailing List,
	open list:MEMORY MANAGEMENT, Dan Williams, Ross Zwisler,
	Linus Torvalds, Bernhard Held, Adam Borowski,
	Radim Krčmář,
	Wanpeng Li, Paolo Bonzini, Takashi Iwai, Mike Galbraith,
	Kirill A . Shutemov, axie, Andrew Morton

On Wed, Aug 30, 2017 at 07:27:47PM +0200, Andrea Arcangeli wrote:
> On Tue, Aug 29, 2017 at 07:46:07PM -0700, Nadav Amit wrote:
> > Therefore, IIUC, try_to_umap_one() should only call
> > mmu_notifier_invalidate_range() after ptep_get_and_clear() and
> 
> That would trigger an unnecessarily double call to
> ->invalidate_range() both from mmu_notifier_invalidate_range() after
> ptep_get_and_clear() and later at the end in
> mmu_notifier_invalidate_range_end().
> 
> The only advantage of adding a mmu_notifier_invalidate_range() after
> ptep_get_and_clear() is to flush the secondary MMU TLBs (keep in mind
> the pagetables have to be shared with the primary MMU in order to use
> the ->invalidate_range()) inside the PT lock.
> 
> So if mmu_notifier_invalidate_range() after ptep_get_and_clear() is
> needed or not, again boils down to the issue if the old code calling
> ->invalidate_page outside the PT lock was always broken before or
> not. I don't see why exactly it was broken, we even hold the page lock
> there so I don't see a migration race possible either. Of course the
> constraint to be safe is that the put_page in try_to_unmap_one cannot
> be the last one, and that had to be enforced by the caller taking an
> additional reference on it.
> 
> One can wonder if the primary MMU TLB flush in ptep_clear_flush
> (should_defer_flush returning false) could be put after releasing the
> PT lock too (because that's not different from deferring the secondary
> MMU notifier TLB flush in ->invalidate_range down to
> mmu_notifier_invalidate_range_end) even if TTU_BATCH_FLUSH isn't set,
> which may be safe too for the same reasons.
> 
> When should_defer_flush returns true we already defer the primary MMU
> TLB flush to much later to even mmu_notifier_invalidate_range_end, not
> just after the PT lock release so at least when should_defer_flush is
> true, it looks obviously safe to defer the secondary MMU TLB flush to
> mmu_notifier_invalidate_range_end for the drivers implementing
> ->invalidate_range.
> 
> If I'm wrong and all TLB flushes must happen inside the PT lock, then
> we should at least reconsider the implicit call to ->invalidate_range
> method from mmu_notifier_invalidate_range_end or we would call it
> twice unnecessarily which doesn't look optimal. Either ways this
> doesn't look optimal. We would need to introduce a
> mmu_notifier_invalidate_range_end_full that calls also
> ->invalidate_range in such case so we skip the ->invalidate_range call
> in mmu_notifier_invalidate_range_end if we put an explicit
> mmu_notifier_invalidate_range() after ptep_get_and_clear inside the PT
> lock like you suggested above.

So i went over call to try_to_unmap() (try_to_munlock() is fine as it
does not clear the CPU page table entry). I believe they are 2 cases
where you can get a new pte entry after we release spinlock and before
we call mmu_notifier_invalidate_range_end()

First case is :
if (unlikely(PageSwapBacked(page) != PageSwapCache(page))) {
  ...
  break;
}

The pte is clear, there was an error condition and this should never
happen but a racing thread might install a new pte in the meantime.
Maybe we should restore the pte value here. Anyway when this happens
bad things are going on.

The second case is non dirty anonymous page and MADV_FREE. But here
the application is telling us that no one should care about that
virtual address any more. So i am not sure how much we should care.


If we ignore this 2 cases than the CPU pte can never be replace by
something else between the time we release the spinlock and the time
we call mmu_notifier_invalidate_range_end() so not invalidating the
devices tlb is ok here. Do we want this kind of optimization ?

Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic
  2017-08-30 18:20         ` Jerome Glisse
@ 2017-08-30 18:40           ` Nadav Amit
  -1 siblings, 0 replies; 160+ messages in thread
From: Nadav Amit @ 2017-08-30 18:40 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Andrea Arcangeli, Linux Kernel Mailing List,
	open list:MEMORY MANAGEMENT, Dan Williams, Ross Zwisler,
	Linus Torvalds, Bernhard Held, Adam Borowski,
	Radim Krčmář,
	Wanpeng Li, Paolo Bonzini, Takashi Iwai, Mike Galbraith,
	Kirill A . Shutemov, axie, Andrew Morton

Jerome Glisse <jglisse@redhat.com> wrote:

> On Wed, Aug 30, 2017 at 07:27:47PM +0200, Andrea Arcangeli wrote:
>> On Tue, Aug 29, 2017 at 07:46:07PM -0700, Nadav Amit wrote:
>>> Therefore, IIUC, try_to_umap_one() should only call
>>> mmu_notifier_invalidate_range() after ptep_get_and_clear() and
>> 
>> That would trigger an unnecessarily double call to
>> ->invalidate_range() both from mmu_notifier_invalidate_range() after
>> ptep_get_and_clear() and later at the end in
>> mmu_notifier_invalidate_range_end().
>> 
>> The only advantage of adding a mmu_notifier_invalidate_range() after
>> ptep_get_and_clear() is to flush the secondary MMU TLBs (keep in mind
>> the pagetables have to be shared with the primary MMU in order to use
>> the ->invalidate_range()) inside the PT lock.
>> 
>> So if mmu_notifier_invalidate_range() after ptep_get_and_clear() is
>> needed or not, again boils down to the issue if the old code calling
>> ->invalidate_page outside the PT lock was always broken before or
>> not. I don't see why exactly it was broken, we even hold the page lock
>> there so I don't see a migration race possible either. Of course the
>> constraint to be safe is that the put_page in try_to_unmap_one cannot
>> be the last one, and that had to be enforced by the caller taking an
>> additional reference on it.
>> 
>> One can wonder if the primary MMU TLB flush in ptep_clear_flush
>> (should_defer_flush returning false) could be put after releasing the
>> PT lock too (because that's not different from deferring the secondary
>> MMU notifier TLB flush in ->invalidate_range down to
>> mmu_notifier_invalidate_range_end) even if TTU_BATCH_FLUSH isn't set,
>> which may be safe too for the same reasons.
>> 
>> When should_defer_flush returns true we already defer the primary MMU
>> TLB flush to much later to even mmu_notifier_invalidate_range_end, not
>> just after the PT lock release so at least when should_defer_flush is
>> true, it looks obviously safe to defer the secondary MMU TLB flush to
>> mmu_notifier_invalidate_range_end for the drivers implementing
>> ->invalidate_range.
>> 
>> If I'm wrong and all TLB flushes must happen inside the PT lock, then
>> we should at least reconsider the implicit call to ->invalidate_range
>> method from mmu_notifier_invalidate_range_end or we would call it
>> twice unnecessarily which doesn't look optimal. Either ways this
>> doesn't look optimal. We would need to introduce a
>> mmu_notifier_invalidate_range_end_full that calls also
>> ->invalidate_range in such case so we skip the ->invalidate_range call
>> in mmu_notifier_invalidate_range_end if we put an explicit
>> mmu_notifier_invalidate_range() after ptep_get_and_clear inside the PT
>> lock like you suggested above.
> 
> So i went over call to try_to_unmap() (try_to_munlock() is fine as it
> does not clear the CPU page table entry). I believe they are 2 cases
> where you can get a new pte entry after we release spinlock and before
> we call mmu_notifier_invalidate_range_end()
> 
> First case is :
> if (unlikely(PageSwapBacked(page) != PageSwapCache(page))) {
>  ...
>  break;
> }
> 
> The pte is clear, there was an error condition and this should never
> happen but a racing thread might install a new pte in the meantime.
> Maybe we should restore the pte value here. Anyway when this happens
> bad things are going on.
> 
> The second case is non dirty anonymous page and MADV_FREE. But here
> the application is telling us that no one should care about that
> virtual address any more. So i am not sure how much we should care.
> 
> 
> If we ignore this 2 cases than the CPU pte can never be replace by
> something else between the time we release the spinlock and the time
> we call mmu_notifier_invalidate_range_end() so not invalidating the
> devices tlb is ok here. Do we want this kind of optimization ?

The mmu_notifier users would have to be aware that invalidations may be
deferred. If they perform their ``invalidations’’ unconditionally, it may be
ok. If the notifier users avoid invalidations based on the PTE in the
secondary page-table, it can be a problem.

On another note, you may want to consider combining the secondary page-table
mechanisms with the existing TLB-flush mechanisms. Right now, it is
partially done: tlb_flush_mmu_tlbonly(), for example, calls
mmu_notifier_invalidate_range(). However, tlb_gather_mmu() does not call
mmu_notifier_invalidate_range_start().

This can also prevent all kind of inconsistencies, and potential bugs. For
instance, clear_refs_write() calls mmu_notifier_invalidate_range_start/end()
but in between there is no call for mmu_notifier_invalidate_range().

Regards,
Nadav

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic
@ 2017-08-30 18:40           ` Nadav Amit
  0 siblings, 0 replies; 160+ messages in thread
From: Nadav Amit @ 2017-08-30 18:40 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Andrea Arcangeli, Linux Kernel Mailing List,
	open list:MEMORY MANAGEMENT, Dan Williams, Ross Zwisler,
	Linus Torvalds, Bernhard Held, Adam Borowski,
	Radim Krčmář,
	Wanpeng Li, Paolo Bonzini, Takashi Iwai, Mike Galbraith,
	Kirill A . Shutemov, axie, Andrew Morton

Jerome Glisse <jglisse@redhat.com> wrote:

> On Wed, Aug 30, 2017 at 07:27:47PM +0200, Andrea Arcangeli wrote:
>> On Tue, Aug 29, 2017 at 07:46:07PM -0700, Nadav Amit wrote:
>>> Therefore, IIUC, try_to_umap_one() should only call
>>> mmu_notifier_invalidate_range() after ptep_get_and_clear() and
>> 
>> That would trigger an unnecessarily double call to
>> ->invalidate_range() both from mmu_notifier_invalidate_range() after
>> ptep_get_and_clear() and later at the end in
>> mmu_notifier_invalidate_range_end().
>> 
>> The only advantage of adding a mmu_notifier_invalidate_range() after
>> ptep_get_and_clear() is to flush the secondary MMU TLBs (keep in mind
>> the pagetables have to be shared with the primary MMU in order to use
>> the ->invalidate_range()) inside the PT lock.
>> 
>> So if mmu_notifier_invalidate_range() after ptep_get_and_clear() is
>> needed or not, again boils down to the issue if the old code calling
>> ->invalidate_page outside the PT lock was always broken before or
>> not. I don't see why exactly it was broken, we even hold the page lock
>> there so I don't see a migration race possible either. Of course the
>> constraint to be safe is that the put_page in try_to_unmap_one cannot
>> be the last one, and that had to be enforced by the caller taking an
>> additional reference on it.
>> 
>> One can wonder if the primary MMU TLB flush in ptep_clear_flush
>> (should_defer_flush returning false) could be put after releasing the
>> PT lock too (because that's not different from deferring the secondary
>> MMU notifier TLB flush in ->invalidate_range down to
>> mmu_notifier_invalidate_range_end) even if TTU_BATCH_FLUSH isn't set,
>> which may be safe too for the same reasons.
>> 
>> When should_defer_flush returns true we already defer the primary MMU
>> TLB flush to much later to even mmu_notifier_invalidate_range_end, not
>> just after the PT lock release so at least when should_defer_flush is
>> true, it looks obviously safe to defer the secondary MMU TLB flush to
>> mmu_notifier_invalidate_range_end for the drivers implementing
>> ->invalidate_range.
>> 
>> If I'm wrong and all TLB flushes must happen inside the PT lock, then
>> we should at least reconsider the implicit call to ->invalidate_range
>> method from mmu_notifier_invalidate_range_end or we would call it
>> twice unnecessarily which doesn't look optimal. Either ways this
>> doesn't look optimal. We would need to introduce a
>> mmu_notifier_invalidate_range_end_full that calls also
>> ->invalidate_range in such case so we skip the ->invalidate_range call
>> in mmu_notifier_invalidate_range_end if we put an explicit
>> mmu_notifier_invalidate_range() after ptep_get_and_clear inside the PT
>> lock like you suggested above.
> 
> So i went over call to try_to_unmap() (try_to_munlock() is fine as it
> does not clear the CPU page table entry). I believe they are 2 cases
> where you can get a new pte entry after we release spinlock and before
> we call mmu_notifier_invalidate_range_end()
> 
> First case is :
> if (unlikely(PageSwapBacked(page) != PageSwapCache(page))) {
>  ...
>  break;
> }
> 
> The pte is clear, there was an error condition and this should never
> happen but a racing thread might install a new pte in the meantime.
> Maybe we should restore the pte value here. Anyway when this happens
> bad things are going on.
> 
> The second case is non dirty anonymous page and MADV_FREE. But here
> the application is telling us that no one should care about that
> virtual address any more. So i am not sure how much we should care.
> 
> 
> If we ignore this 2 cases than the CPU pte can never be replace by
> something else between the time we release the spinlock and the time
> we call mmu_notifier_invalidate_range_end() so not invalidating the
> devices tlb is ok here. Do we want this kind of optimization ?

The mmu_notifier users would have to be aware that invalidations may be
deferred. If they perform their ``invalidations’’ unconditionally, it may be
ok. If the notifier users avoid invalidations based on the PTE in the
secondary page-table, it can be a problem.

On another note, you may want to consider combining the secondary page-table
mechanisms with the existing TLB-flush mechanisms. Right now, it is
partially done: tlb_flush_mmu_tlbonly(), for example, calls
mmu_notifier_invalidate_range(). However, tlb_gather_mmu() does not call
mmu_notifier_invalidate_range_start().

This can also prevent all kind of inconsistencies, and potential bugs. For
instance, clear_refs_write() calls mmu_notifier_invalidate_range_start/end()
but in between there is no call for mmu_notifier_invalidate_range().

Regards,
Nadav

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 11/13] xen/gntdev: update to new mmu_notifier semantic
  2017-08-29 23:54   ` Jérôme Glisse
  (?)
  (?)
@ 2017-08-30 19:32   ` Boris Ostrovsky
  -1 siblings, 0 replies; 160+ messages in thread
From: Boris Ostrovsky @ 2017-08-30 19:32 UTC (permalink / raw)
  To: Jérôme Glisse, linux-kernel, linux-mm
  Cc: Konrad Rzeszutek Wilk, Roger Pau Monné,
	xen-devel, Kirill A . Shutemov, Andrew Morton, Linus Torvalds,
	Andrea Arcangeli

On 08/29/2017 07:54 PM, JA(C)rA'me Glisse wrote:
> Call to mmu_notifier_invalidate_page() are replaced by call to
> mmu_notifier_invalidate_range() and thus call are bracketed by
> call to mmu_notifier_invalidate_range_start()/end()
>
> Remove now useless invalidate_page callback.
>
> Signed-off-by: JA(C)rA'me Glisse <jglisse@redhat.com>
> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Cc: Roger Pau MonnA(C) <roger.pau@citrix.com>
> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
> Cc: xen-devel@lists.xenproject.org
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  drivers/xen/gntdev.c | 8 --------
>  1 file changed, 8 deletions(-)

Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>

I also ran a bunch of tests (mostly bringing up/tearing down various Xen
guests). Haven't seen any issues.

-boris

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 11/13] xen/gntdev: update to new mmu_notifier semantic
  2017-08-29 23:54   ` Jérôme Glisse
  (?)
@ 2017-08-30 19:32   ` Boris Ostrovsky
  -1 siblings, 0 replies; 160+ messages in thread
From: Boris Ostrovsky @ 2017-08-30 19:32 UTC (permalink / raw)
  To: Jérôme Glisse, linux-kernel, linux-mm
  Cc: Andrea Arcangeli, xen-devel, Andrew Morton, Linus Torvalds,
	Kirill A . Shutemov, Roger Pau Monné

On 08/29/2017 07:54 PM, Jérôme Glisse wrote:
> Call to mmu_notifier_invalidate_page() are replaced by call to
> mmu_notifier_invalidate_range() and thus call are bracketed by
> call to mmu_notifier_invalidate_range_start()/end()
>
> Remove now useless invalidate_page callback.
>
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Cc: Roger Pau Monné <roger.pau@citrix.com>
> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
> Cc: xen-devel@lists.xenproject.org
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  drivers/xen/gntdev.c | 8 --------
>  1 file changed, 8 deletions(-)

Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>

I also ran a bunch of tests (mostly bringing up/tearing down various Xen
guests). Haven't seen any issues.

-boris

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic
  2017-08-30 18:40           ` Nadav Amit
@ 2017-08-30 20:45             ` Jerome Glisse
  -1 siblings, 0 replies; 160+ messages in thread
From: Jerome Glisse @ 2017-08-30 20:45 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andrea Arcangeli, Linux Kernel Mailing List,
	open list:MEMORY MANAGEMENT, Dan Williams, Ross Zwisler,
	Linus Torvalds, Bernhard Held, Adam Borowski,
	Radim Krčmář,
	Wanpeng Li, Paolo Bonzini, Takashi Iwai, Mike Galbraith,
	Kirill A . Shutemov, axie, Andrew Morton

On Wed, Aug 30, 2017 at 11:40:08AM -0700, Nadav Amit wrote:
> Jerome Glisse <jglisse@redhat.com> wrote:
> 
> > On Wed, Aug 30, 2017 at 07:27:47PM +0200, Andrea Arcangeli wrote:
> >> On Tue, Aug 29, 2017 at 07:46:07PM -0700, Nadav Amit wrote:
> >>> Therefore, IIUC, try_to_umap_one() should only call
> >>> mmu_notifier_invalidate_range() after ptep_get_and_clear() and
> >> 
> >> That would trigger an unnecessarily double call to
> >> ->invalidate_range() both from mmu_notifier_invalidate_range() after
> >> ptep_get_and_clear() and later at the end in
> >> mmu_notifier_invalidate_range_end().
> >> 
> >> The only advantage of adding a mmu_notifier_invalidate_range() after
> >> ptep_get_and_clear() is to flush the secondary MMU TLBs (keep in mind
> >> the pagetables have to be shared with the primary MMU in order to use
> >> the ->invalidate_range()) inside the PT lock.
> >> 
> >> So if mmu_notifier_invalidate_range() after ptep_get_and_clear() is
> >> needed or not, again boils down to the issue if the old code calling
> >> ->invalidate_page outside the PT lock was always broken before or
> >> not. I don't see why exactly it was broken, we even hold the page lock
> >> there so I don't see a migration race possible either. Of course the
> >> constraint to be safe is that the put_page in try_to_unmap_one cannot
> >> be the last one, and that had to be enforced by the caller taking an
> >> additional reference on it.
> >> 
> >> One can wonder if the primary MMU TLB flush in ptep_clear_flush
> >> (should_defer_flush returning false) could be put after releasing the
> >> PT lock too (because that's not different from deferring the secondary
> >> MMU notifier TLB flush in ->invalidate_range down to
> >> mmu_notifier_invalidate_range_end) even if TTU_BATCH_FLUSH isn't set,
> >> which may be safe too for the same reasons.
> >> 
> >> When should_defer_flush returns true we already defer the primary MMU
> >> TLB flush to much later to even mmu_notifier_invalidate_range_end, not
> >> just after the PT lock release so at least when should_defer_flush is
> >> true, it looks obviously safe to defer the secondary MMU TLB flush to
> >> mmu_notifier_invalidate_range_end for the drivers implementing
> >> ->invalidate_range.
> >> 
> >> If I'm wrong and all TLB flushes must happen inside the PT lock, then
> >> we should at least reconsider the implicit call to ->invalidate_range
> >> method from mmu_notifier_invalidate_range_end or we would call it
> >> twice unnecessarily which doesn't look optimal. Either ways this
> >> doesn't look optimal. We would need to introduce a
> >> mmu_notifier_invalidate_range_end_full that calls also
> >> ->invalidate_range in such case so we skip the ->invalidate_range call
> >> in mmu_notifier_invalidate_range_end if we put an explicit
> >> mmu_notifier_invalidate_range() after ptep_get_and_clear inside the PT
> >> lock like you suggested above.
> > 
> > So i went over call to try_to_unmap() (try_to_munlock() is fine as it
> > does not clear the CPU page table entry). I believe they are 2 cases
> > where you can get a new pte entry after we release spinlock and before
> > we call mmu_notifier_invalidate_range_end()
> > 
> > First case is :
> > if (unlikely(PageSwapBacked(page) != PageSwapCache(page))) {
> >  ...
> >  break;
> > }
> > 
> > The pte is clear, there was an error condition and this should never
> > happen but a racing thread might install a new pte in the meantime.
> > Maybe we should restore the pte value here. Anyway when this happens
> > bad things are going on.
> > 
> > The second case is non dirty anonymous page and MADV_FREE. But here
> > the application is telling us that no one should care about that
> > virtual address any more. So i am not sure how much we should care.
> > 
> > 
> > If we ignore this 2 cases than the CPU pte can never be replace by
> > something else between the time we release the spinlock and the time
> > we call mmu_notifier_invalidate_range_end() so not invalidating the
> > devices tlb is ok here. Do we want this kind of optimization ?
> 
> The mmu_notifier users would have to be aware that invalidations may be
> deferred. If they perform their ``invalidations’’ unconditionally, it may be
> ok. If the notifier users avoid invalidations based on the PTE in the
> secondary page-table, it can be a problem.

So i look at both AMD and Intel IOMMU. AMD always flush and current pte value
do not matter AFAICT (i doubt that hardware rewalk the page table just to
decide not to flush that would be terribly dumb for hardware engineer to do
so).

Intel have a deferred flush mecanism, basicly if no device is actively using
the page table then there is no flush (see deferred invalidation in [1]). But
i am unsure about the hardware ie does it also means that when a PASID is not
actively use then the IOMMU TLB is also invalid for that PASID. Also i am bit
unsure about ATS/PASID specification in respect to having device always report
when they are done with a given PASID (ie does the spec say that device tlb
must be invalidated when device stop using a pasid).

https://www.intel.com/content/www/us/en/embedded/technology/virtualization/vt-directed-io-spec.html

So i think we can side with caution here and call invalidate_range() under the
page table lock. If IOMMU folks see performance issue with real workload due
to the double invalidation that take place then we can work on that.

KVM or XEN are not impacted by this as they only care about start/end with this
patchset.

Andrea is that inline with your assessment ?


> On another note, you may want to consider combining the secondary page-table
> mechanisms with the existing TLB-flush mechanisms. Right now, it is
> partially done: tlb_flush_mmu_tlbonly(), for example, calls
> mmu_notifier_invalidate_range(). However, tlb_gather_mmu() does not call
> mmu_notifier_invalidate_range_start().

tlb_gather_mmu() is always call in place that are themself bracketed by
mmu_notifier_invalidate_range_start()/end()

> 
> This can also prevent all kind of inconsistencies, and potential bugs. For
> instance, clear_refs_write() calls mmu_notifier_invalidate_range_start/end()
> but in between there is no call for mmu_notifier_invalidate_range().

It is for softdirty, we should probably invalidate_range() in that case i
need to check how dirtyness is handled in ATS/PASID ie does device update
the dirty bit of the CPU page table on write. Or maybe device don't update
the dirty flag.

Jérôme

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic
@ 2017-08-30 20:45             ` Jerome Glisse
  0 siblings, 0 replies; 160+ messages in thread
From: Jerome Glisse @ 2017-08-30 20:45 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andrea Arcangeli, Linux Kernel Mailing List,
	open list:MEMORY MANAGEMENT, Dan Williams, Ross Zwisler,
	Linus Torvalds, Bernhard Held, Adam Borowski,
	Radim Krčmář,
	Wanpeng Li, Paolo Bonzini, Takashi Iwai, Mike Galbraith,
	Kirill A . Shutemov, axie, Andrew Morton

On Wed, Aug 30, 2017 at 11:40:08AM -0700, Nadav Amit wrote:
> Jerome Glisse <jglisse@redhat.com> wrote:
> 
> > On Wed, Aug 30, 2017 at 07:27:47PM +0200, Andrea Arcangeli wrote:
> >> On Tue, Aug 29, 2017 at 07:46:07PM -0700, Nadav Amit wrote:
> >>> Therefore, IIUC, try_to_umap_one() should only call
> >>> mmu_notifier_invalidate_range() after ptep_get_and_clear() and
> >> 
> >> That would trigger an unnecessarily double call to
> >> ->invalidate_range() both from mmu_notifier_invalidate_range() after
> >> ptep_get_and_clear() and later at the end in
> >> mmu_notifier_invalidate_range_end().
> >> 
> >> The only advantage of adding a mmu_notifier_invalidate_range() after
> >> ptep_get_and_clear() is to flush the secondary MMU TLBs (keep in mind
> >> the pagetables have to be shared with the primary MMU in order to use
> >> the ->invalidate_range()) inside the PT lock.
> >> 
> >> So if mmu_notifier_invalidate_range() after ptep_get_and_clear() is
> >> needed or not, again boils down to the issue if the old code calling
> >> ->invalidate_page outside the PT lock was always broken before or
> >> not. I don't see why exactly it was broken, we even hold the page lock
> >> there so I don't see a migration race possible either. Of course the
> >> constraint to be safe is that the put_page in try_to_unmap_one cannot
> >> be the last one, and that had to be enforced by the caller taking an
> >> additional reference on it.
> >> 
> >> One can wonder if the primary MMU TLB flush in ptep_clear_flush
> >> (should_defer_flush returning false) could be put after releasing the
> >> PT lock too (because that's not different from deferring the secondary
> >> MMU notifier TLB flush in ->invalidate_range down to
> >> mmu_notifier_invalidate_range_end) even if TTU_BATCH_FLUSH isn't set,
> >> which may be safe too for the same reasons.
> >> 
> >> When should_defer_flush returns true we already defer the primary MMU
> >> TLB flush to much later to even mmu_notifier_invalidate_range_end, not
> >> just after the PT lock release so at least when should_defer_flush is
> >> true, it looks obviously safe to defer the secondary MMU TLB flush to
> >> mmu_notifier_invalidate_range_end for the drivers implementing
> >> ->invalidate_range.
> >> 
> >> If I'm wrong and all TLB flushes must happen inside the PT lock, then
> >> we should at least reconsider the implicit call to ->invalidate_range
> >> method from mmu_notifier_invalidate_range_end or we would call it
> >> twice unnecessarily which doesn't look optimal. Either ways this
> >> doesn't look optimal. We would need to introduce a
> >> mmu_notifier_invalidate_range_end_full that calls also
> >> ->invalidate_range in such case so we skip the ->invalidate_range call
> >> in mmu_notifier_invalidate_range_end if we put an explicit
> >> mmu_notifier_invalidate_range() after ptep_get_and_clear inside the PT
> >> lock like you suggested above.
> > 
> > So i went over call to try_to_unmap() (try_to_munlock() is fine as it
> > does not clear the CPU page table entry). I believe they are 2 cases
> > where you can get a new pte entry after we release spinlock and before
> > we call mmu_notifier_invalidate_range_end()
> > 
> > First case is :
> > if (unlikely(PageSwapBacked(page) != PageSwapCache(page))) {
> >  ...
> >  break;
> > }
> > 
> > The pte is clear, there was an error condition and this should never
> > happen but a racing thread might install a new pte in the meantime.
> > Maybe we should restore the pte value here. Anyway when this happens
> > bad things are going on.
> > 
> > The second case is non dirty anonymous page and MADV_FREE. But here
> > the application is telling us that no one should care about that
> > virtual address any more. So i am not sure how much we should care.
> > 
> > 
> > If we ignore this 2 cases than the CPU pte can never be replace by
> > something else between the time we release the spinlock and the time
> > we call mmu_notifier_invalidate_range_end() so not invalidating the
> > devices tlb is ok here. Do we want this kind of optimization ?
> 
> The mmu_notifier users would have to be aware that invalidations may be
> deferred. If they perform their ``invalidationsa??a?? unconditionally, it may be
> ok. If the notifier users avoid invalidations based on the PTE in the
> secondary page-table, it can be a problem.

So i look at both AMD and Intel IOMMU. AMD always flush and current pte value
do not matter AFAICT (i doubt that hardware rewalk the page table just to
decide not to flush that would be terribly dumb for hardware engineer to do
so).

Intel have a deferred flush mecanism, basicly if no device is actively using
the page table then there is no flush (see deferred invalidation in [1]). But
i am unsure about the hardware ie does it also means that when a PASID is not
actively use then the IOMMU TLB is also invalid for that PASID. Also i am bit
unsure about ATS/PASID specification in respect to having device always report
when they are done with a given PASID (ie does the spec say that device tlb
must be invalidated when device stop using a pasid).

https://www.intel.com/content/www/us/en/embedded/technology/virtualization/vt-directed-io-spec.html

So i think we can side with caution here and call invalidate_range() under the
page table lock. If IOMMU folks see performance issue with real workload due
to the double invalidation that take place then we can work on that.

KVM or XEN are not impacted by this as they only care about start/end with this
patchset.

Andrea is that inline with your assessment ?


> On another note, you may want to consider combining the secondary page-table
> mechanisms with the existing TLB-flush mechanisms. Right now, it is
> partially done: tlb_flush_mmu_tlbonly(), for example, calls
> mmu_notifier_invalidate_range(). However, tlb_gather_mmu() does not call
> mmu_notifier_invalidate_range_start().

tlb_gather_mmu() is always call in place that are themself bracketed by
mmu_notifier_invalidate_range_start()/end()

> 
> This can also prevent all kind of inconsistencies, and potential bugs. For
> instance, clear_refs_write() calls mmu_notifier_invalidate_range_start/end()
> but in between there is no call for mmu_notifier_invalidate_range().

It is for softdirty, we should probably invalidate_range() in that case i
need to check how dirtyness is handled in ATS/PASID ie does device update
the dirty bit of the CPU page table on write. Or maybe device don't update
the dirty flag.

JA(C)rA'me

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic
  2017-08-30 18:40           ` Nadav Amit
@ 2017-08-30 20:55             ` Andrea Arcangeli
  -1 siblings, 0 replies; 160+ messages in thread
From: Andrea Arcangeli @ 2017-08-30 20:55 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Jerome Glisse, Linux Kernel Mailing List,
	open list:MEMORY MANAGEMENT, Dan Williams, Ross Zwisler,
	Linus Torvalds, Bernhard Held, Adam Borowski,
	Radim Krčmář,
	Wanpeng Li, Paolo Bonzini, Takashi Iwai, Mike Galbraith,
	Kirill A . Shutemov, axie, Andrew Morton

On Wed, Aug 30, 2017 at 11:40:08AM -0700, Nadav Amit wrote:
> The mmu_notifier users would have to be aware that invalidations may be
> deferred. If they perform their ``invalidations’’ unconditionally, it may be
> ok. If the notifier users avoid invalidations based on the PTE in the
> secondary page-table, it can be a problem.

invalidate_page was always deferred post PT lock release.

This ->invalidate_range post PT lock release, is not a new thing,
we're still back to squre one to find out if invalidate_page callout
after PT lock release has always been broken here or not.

> On another note, you may want to consider combining the secondary page-table
> mechanisms with the existing TLB-flush mechanisms. Right now, it is
> partially done: tlb_flush_mmu_tlbonly(), for example, calls
> mmu_notifier_invalidate_range(). However, tlb_gather_mmu() does not call
> mmu_notifier_invalidate_range_start().

If you implement ->invalidate_range_start you don't care about tlb
gather at all and you must not implement ->invalidate_range.

> This can also prevent all kind of inconsistencies, and potential bugs. For
> instance, clear_refs_write() calls mmu_notifier_invalidate_range_start/end()
> but in between there is no call for mmu_notifier_invalidate_range().

It's done in mmu_notifier_invalidate_range_end which is again fully
equivalent except run after PT lock release.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic
@ 2017-08-30 20:55             ` Andrea Arcangeli
  0 siblings, 0 replies; 160+ messages in thread
From: Andrea Arcangeli @ 2017-08-30 20:55 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Jerome Glisse, Linux Kernel Mailing List,
	open list:MEMORY MANAGEMENT, Dan Williams, Ross Zwisler,
	Linus Torvalds, Bernhard Held, Adam Borowski,
	Radim Krčmář,
	Wanpeng Li, Paolo Bonzini, Takashi Iwai, Mike Galbraith,
	Kirill A . Shutemov, axie, Andrew Morton

On Wed, Aug 30, 2017 at 11:40:08AM -0700, Nadav Amit wrote:
> The mmu_notifier users would have to be aware that invalidations may be
> deferred. If they perform their ``invalidationsa??a?? unconditionally, it may be
> ok. If the notifier users avoid invalidations based on the PTE in the
> secondary page-table, it can be a problem.

invalidate_page was always deferred post PT lock release.

This ->invalidate_range post PT lock release, is not a new thing,
we're still back to squre one to find out if invalidate_page callout
after PT lock release has always been broken here or not.

> On another note, you may want to consider combining the secondary page-table
> mechanisms with the existing TLB-flush mechanisms. Right now, it is
> partially done: tlb_flush_mmu_tlbonly(), for example, calls
> mmu_notifier_invalidate_range(). However, tlb_gather_mmu() does not call
> mmu_notifier_invalidate_range_start().

If you implement ->invalidate_range_start you don't care about tlb
gather at all and you must not implement ->invalidate_range.

> This can also prevent all kind of inconsistencies, and potential bugs. For
> instance, clear_refs_write() calls mmu_notifier_invalidate_range_start/end()
> but in between there is no call for mmu_notifier_invalidate_range().

It's done in mmu_notifier_invalidate_range_end which is again fully
equivalent except run after PT lock release.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic
  2017-08-30 18:00         ` Nadav Amit
@ 2017-08-30 21:25           ` Andrea Arcangeli
  -1 siblings, 0 replies; 160+ messages in thread
From: Andrea Arcangeli @ 2017-08-30 21:25 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Jérôme Glisse, Linux Kernel Mailing List,
	open list:MEMORY MANAGEMENT, Dan Williams, Ross Zwisler,
	Linus Torvalds, Bernhard Held, Adam Borowski,
	Radim Krčmář,
	Wanpeng Li, Paolo Bonzini, Takashi Iwai, Mike Galbraith,
	Kirill A . Shutemov, axie, Andrew Morton

On Wed, Aug 30, 2017 at 11:00:32AM -0700, Nadav Amit wrote:
> It is not trivial to flush TLBs (primary or secondary) without holding the
> page-table lock, and as we recently encountered this resulted in several
> bugs [1]. The main problem is that even if you perform the TLB flush
> immediately after the PT-lock is released, you cause a situation in which
> other threads may make decisions based on the updated PTE value, without
> being aware that a TLB flush is needed.
> 
> For example, we recently encountered a Linux bug when two threads run
> MADV_DONTNEED concurrently on the same address range [2]. One of the threads
> may update a PTE, setting it as non-present, and then deferring the TLB
> flush (for batching). As a result, however, it would cause the second
> thread, which also changes the PTEs to assume that the PTE is already
> non-present and TLB flush is not necessary. As a result the second core may
> still hold stale PTEs in its TLB following MADV_DONTNEED.

The source of those complex races that requires taking into account
nested tlb gather to solve it, originates from skipping primary MMU
tlb flushes depending on the value of the pagetables (as an
optimization).

For mmu_notifier_invalidate_range_end we always ignore the value of
the pagetables and mmu_notifier_invalidate_range_end always runs
unconditionally invalidating the secondary MMU for the whole range
under consideration. There are no optimizations that attempts to skip
mmu_notifier_invalidate_range_end depending on the pagetable value and
there's no TLB gather for secondary MMUs either. That is to keep it
simple of course.

As long as those mmu_notifier_invalidate_range_end stay unconditional,
I don't see how those races you linked, can be possibly relevant in
evaluating if ->invalidate_range (again only for iommuv2 and
intel-svm) has to run inside the PT lock or not.

> There is a swarm of such problems, and some are not too trivial. Deferring
> TLB flushes needs to be done in a very careful manner.

I agree it's not trivial, but I don't think any complexity comes from
above.

The only complexity is about, what if the page is copied to some other
page and replaced, because the copy is the only case where coherency
could be retained by the primary MMU. What if the primary MMU starts
working on the new page in between PT lock release and
mmu_notifier_invalidate_range_end, while the secondary MMU is stuck on
the old page? That is the only problem we deal with here, the copy to
other page and replace. Any other case that doesn't involve the copy
seems non coherent by definition, and you couldn't measure it.

I can't think of a scenario that requires the explicit
mmu_notifier_invalidate_range call before releasing the PT lock, at
least for try_to_unmap_one.

Could you think of a scenario where calling ->invalidate_range inside
mmu_notifier_invalidate_range_end post PT lock breaks iommuv2 or
intel-svm? Those two are the only ones requiring
->invalidate_range calls, all other mmu notifier users are safe
without running mmu_notifier_invalidate_range_end under PT lock thanks
to mmu_notifier_invalidate_range_start blocking the secondary MMU.

Could you post a tangible scenario that invalidates my theory that
those mmu_notifier_invalidate_range calls inside PT lock would be
superfluous?

Some of the scenarios under consideration:

1) migration entry -> migration_entry_wait -> page lock, plus
   migrate_pages taking the lock so it can't race with try_to_unmap
   from other places
2) swap entry -> lookup_swap_cache -> page lock (page not really replaced)
3) CoW -> do_wp_page -> page lock on old page
4) KSM -> replace_page -> page lock on old page
5) if the pte is zapped as in MADV_DONTNEED, no coherency possible so
   it's not measurable that we let the guest run a bit longer on the
   old page past PT lock release

If you could post a multi CPU trace that shows how iommuv2 or
intel-svm are broken if ->invalidate_range is run inside
mmu_notifier_invalidate_range_end post PT lock in try_to_unmap_one it
would help.

Of course if we run mmu_notifier_invalidate_range inside PT lock and
we remove ->invalidate_range from mmu_notifier_invalidate_range_stop
all will be obviously safe, so we could just do it to avoid thinking
about the above, but the resulting code will be uglier and less
optimal (even when disarmed there will be dummy branches I wouldn't
love) and I currently can't see why it wouldn't be safe.

Replacing a page completely without any relation to the old page
content allows no coherency anyway, so even if it breaks you cannot
measure it because it's undefined.

If the page has a relation with the old contents and coherency has to
be provided for both primary MMU and secondary MMUs, this relation
between old and new page during the replacement, is enforced by some
other mean besides the PT lock, migration entry on locked old page
with migration_entry_wait and page lock in migrate_pages etc..

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic
@ 2017-08-30 21:25           ` Andrea Arcangeli
  0 siblings, 0 replies; 160+ messages in thread
From: Andrea Arcangeli @ 2017-08-30 21:25 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Jérôme Glisse, Linux Kernel Mailing List,
	open list:MEMORY MANAGEMENT, Dan Williams, Ross Zwisler,
	Linus Torvalds, Bernhard Held, Adam Borowski,
	Radim Krčmář,
	Wanpeng Li, Paolo Bonzini, Takashi Iwai, Mike Galbraith,
	Kirill A . Shutemov, axie, Andrew Morton

On Wed, Aug 30, 2017 at 11:00:32AM -0700, Nadav Amit wrote:
> It is not trivial to flush TLBs (primary or secondary) without holding the
> page-table lock, and as we recently encountered this resulted in several
> bugs [1]. The main problem is that even if you perform the TLB flush
> immediately after the PT-lock is released, you cause a situation in which
> other threads may make decisions based on the updated PTE value, without
> being aware that a TLB flush is needed.
> 
> For example, we recently encountered a Linux bug when two threads run
> MADV_DONTNEED concurrently on the same address range [2]. One of the threads
> may update a PTE, setting it as non-present, and then deferring the TLB
> flush (for batching). As a result, however, it would cause the second
> thread, which also changes the PTEs to assume that the PTE is already
> non-present and TLB flush is not necessary. As a result the second core may
> still hold stale PTEs in its TLB following MADV_DONTNEED.

The source of those complex races that requires taking into account
nested tlb gather to solve it, originates from skipping primary MMU
tlb flushes depending on the value of the pagetables (as an
optimization).

For mmu_notifier_invalidate_range_end we always ignore the value of
the pagetables and mmu_notifier_invalidate_range_end always runs
unconditionally invalidating the secondary MMU for the whole range
under consideration. There are no optimizations that attempts to skip
mmu_notifier_invalidate_range_end depending on the pagetable value and
there's no TLB gather for secondary MMUs either. That is to keep it
simple of course.

As long as those mmu_notifier_invalidate_range_end stay unconditional,
I don't see how those races you linked, can be possibly relevant in
evaluating if ->invalidate_range (again only for iommuv2 and
intel-svm) has to run inside the PT lock or not.

> There is a swarm of such problems, and some are not too trivial. Deferring
> TLB flushes needs to be done in a very careful manner.

I agree it's not trivial, but I don't think any complexity comes from
above.

The only complexity is about, what if the page is copied to some other
page and replaced, because the copy is the only case where coherency
could be retained by the primary MMU. What if the primary MMU starts
working on the new page in between PT lock release and
mmu_notifier_invalidate_range_end, while the secondary MMU is stuck on
the old page? That is the only problem we deal with here, the copy to
other page and replace. Any other case that doesn't involve the copy
seems non coherent by definition, and you couldn't measure it.

I can't think of a scenario that requires the explicit
mmu_notifier_invalidate_range call before releasing the PT lock, at
least for try_to_unmap_one.

Could you think of a scenario where calling ->invalidate_range inside
mmu_notifier_invalidate_range_end post PT lock breaks iommuv2 or
intel-svm? Those two are the only ones requiring
->invalidate_range calls, all other mmu notifier users are safe
without running mmu_notifier_invalidate_range_end under PT lock thanks
to mmu_notifier_invalidate_range_start blocking the secondary MMU.

Could you post a tangible scenario that invalidates my theory that
those mmu_notifier_invalidate_range calls inside PT lock would be
superfluous?

Some of the scenarios under consideration:

1) migration entry -> migration_entry_wait -> page lock, plus
   migrate_pages taking the lock so it can't race with try_to_unmap
   from other places
2) swap entry -> lookup_swap_cache -> page lock (page not really replaced)
3) CoW -> do_wp_page -> page lock on old page
4) KSM -> replace_page -> page lock on old page
5) if the pte is zapped as in MADV_DONTNEED, no coherency possible so
   it's not measurable that we let the guest run a bit longer on the
   old page past PT lock release

If you could post a multi CPU trace that shows how iommuv2 or
intel-svm are broken if ->invalidate_range is run inside
mmu_notifier_invalidate_range_end post PT lock in try_to_unmap_one it
would help.

Of course if we run mmu_notifier_invalidate_range inside PT lock and
we remove ->invalidate_range from mmu_notifier_invalidate_range_stop
all will be obviously safe, so we could just do it to avoid thinking
about the above, but the resulting code will be uglier and less
optimal (even when disarmed there will be dummy branches I wouldn't
love) and I currently can't see why it wouldn't be safe.

Replacing a page completely without any relation to the old page
content allows no coherency anyway, so even if it breaks you cannot
measure it because it's undefined.

If the page has a relation with the old contents and coherency has to
be provided for both primary MMU and secondary MMUs, this relation
between old and new page during the replacement, is enforced by some
other mean besides the PT lock, migration entry on locked old page
with migration_entry_wait and page lock in migrate_pages etc..

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 00/13] mmu_notifier kill invalidate_page callback
       [not found] ` <20170829235447.10050-1-jglisse-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2017-08-30  0:11     ` Linus Torvalds
@ 2017-08-30 21:51   ` Felix Kuehling
  2017-08-31 13:59     ` Jerome Glisse
  1 sibling, 1 reply; 160+ messages in thread
From: Felix Kuehling @ 2017-08-30 21:51 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	jglisse-H+wXaHxf7aLQT0dZR+AlfA, Christian König

Hi Jérôme,

I have some questions about the potential range-start-end race you
mentioned.

On 2017-08-29 07:54 PM, Jérôme Glisse wrote:
> Note that a lot of existing user feels broken in respect to range_start/
> range_end. Many user only have range_start() callback but there is nothing
> preventing them to undo what was invalidated in their range_start() callback
> after it returns but before any CPU page table update take place.
>
> The code pattern use in kvm or umem odp is an example on how to properly
> avoid such race. In a nutshell use some kind of sequence number and active
> range invalidation counter to block anything that might undo what the
> range_start() callback did.
What happens when we start monitoring an address range after
invaligate_range_start was called? Sounds like we have to keep track of
all active invalidations for just such a case, even in address ranges
that we don't currently care about.

What are the things we cannot do between invalidate_range_start and
invalidate_range_end? amdgpu calls get_user_pages to re-validate our
userptr mappings after the invalidate_range_start notifier invalidated
it. Do we have to wait for invalidate_range_end before we can call
get_user_pages safely?

Regards,
  Felix
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic
  2017-08-30 16:52     ` Andrea Arcangeli
@ 2017-08-30 21:53       ` Linus Torvalds
  -1 siblings, 0 replies; 160+ messages in thread
From: Linus Torvalds @ 2017-08-30 21:53 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Jérôme Glisse, Linux Kernel Mailing List, linux-mm,
	Dan Williams, Ross Zwisler, Bernhard Held, Adam Borowski,
	Radim Krčmář,
	Wanpeng Li, Paolo Bonzini, Takashi Iwai, Nadav Amit,
	Mike Galbraith, Kirill A . Shutemov, axie, Andrew Morton

On Wed, Aug 30, 2017 at 9:52 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
>
> I pointed out in earlier email ->invalidate_range can only be
> implemented (as mutually exclusive alternative to
> ->invalidate_range_start/end) by secondary MMUs that shares the very
> same pagetables with the core linux VM of the primary MMU, and those
> invalidate_range are already called by
> __mmu_notifier_invalidate_range_end.

I have to admit that I didn't notice that fact - that we are already
in the situation that
invalidate_range is called by by the rand_end() nofifier.

I agree that that should simplify all the code, and means that we
don't have to worry about the few cases that already implemented only
the "invalidate_page()" and "invalidate_range()" cases.

So I think that simplifies Jérôme's patch further - once you have put
the range_start/end() cases around the inner loop, you can just drop
the invalidate_page() things entirely.

> So this conversion from invalidate_page to invalidate_range looks
> superflous and the final mmu_notifier_invalidate_range_end should be
> enough.

Yes. I missed the fact that we already called range() from range_end().

That said, the double call shouldn't hurt correctness, and it's
"closer" to old behavior for those people who only did the range/page
ones, so I wonder if we can keep Jérôme's patch in its current state
for 4.13.

Because I still want to release 4.13 this weekend, despite this
upheaval. Otherwise I'll have timing problems during the next merge
window.

Andrea, do you otherwise agree with the whole series as is?

            Linus

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic
@ 2017-08-30 21:53       ` Linus Torvalds
  0 siblings, 0 replies; 160+ messages in thread
From: Linus Torvalds @ 2017-08-30 21:53 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Jérôme Glisse, Linux Kernel Mailing List, linux-mm,
	Dan Williams, Ross Zwisler, Bernhard Held, Adam Borowski,
	Radim Krčmář,
	Wanpeng Li, Paolo Bonzini, Takashi Iwai, Nadav Amit,
	Mike Galbraith, Kirill A . Shutemov, axie, Andrew Morton

On Wed, Aug 30, 2017 at 9:52 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
>
> I pointed out in earlier email ->invalidate_range can only be
> implemented (as mutually exclusive alternative to
> ->invalidate_range_start/end) by secondary MMUs that shares the very
> same pagetables with the core linux VM of the primary MMU, and those
> invalidate_range are already called by
> __mmu_notifier_invalidate_range_end.

I have to admit that I didn't notice that fact - that we are already
in the situation that
invalidate_range is called by by the rand_end() nofifier.

I agree that that should simplify all the code, and means that we
don't have to worry about the few cases that already implemented only
the "invalidate_page()" and "invalidate_range()" cases.

So I think that simplifies Jérôme's patch further - once you have put
the range_start/end() cases around the inner loop, you can just drop
the invalidate_page() things entirely.

> So this conversion from invalidate_page to invalidate_range looks
> superflous and the final mmu_notifier_invalidate_range_end should be
> enough.

Yes. I missed the fact that we already called range() from range_end().

That said, the double call shouldn't hurt correctness, and it's
"closer" to old behavior for those people who only did the range/page
ones, so I wonder if we can keep Jérôme's patch in its current state
for 4.13.

Because I still want to release 4.13 this weekend, despite this
upheaval. Otherwise I'll have timing problems during the next merge
window.

Andrea, do you otherwise agree with the whole series as is?

            Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic
  2017-08-30 20:45             ` Jerome Glisse
@ 2017-08-30 22:17               ` Andrea Arcangeli
  -1 siblings, 0 replies; 160+ messages in thread
From: Andrea Arcangeli @ 2017-08-30 22:17 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Nadav Amit, Linux Kernel Mailing List,
	open list:MEMORY MANAGEMENT, Dan Williams, Ross Zwisler,
	Linus Torvalds, Bernhard Held, Adam Borowski,
	Radim Krčmář,
	Wanpeng Li, Paolo Bonzini, Takashi Iwai, Mike Galbraith,
	Kirill A . Shutemov, axie, Andrew Morton

On Wed, Aug 30, 2017 at 04:45:49PM -0400, Jerome Glisse wrote:
> So i look at both AMD and Intel IOMMU. AMD always flush and current pte value
> do not matter AFAICT (i doubt that hardware rewalk the page table just to
> decide not to flush that would be terribly dumb for hardware engineer to do
> so).
> 
> Intel have a deferred flush mecanism, basicly if no device is actively using
> the page table then there is no flush (see deferred invalidation in [1]). But
> i am unsure about the hardware ie does it also means that when a PASID is not
> actively use then the IOMMU TLB is also invalid for that PASID. Also i am bit
> unsure about ATS/PASID specification in respect to having device always report
> when they are done with a given PASID (ie does the spec say that device tlb
> must be invalidated when device stop using a pasid).
> 
> https://www.intel.com/content/www/us/en/embedded/technology/virtualization/vt-directed-io-spec.html
> 
> So i think we can side with caution here and call invalidate_range() under the
> page table lock. If IOMMU folks see performance issue with real workload due
> to the double invalidation that take place then we can work on that.
> 
> KVM or XEN are not impacted by this as they only care about start/end with this
> patchset.
> 
> Andrea is that inline with your assessment ?

That is obviously safe. The mmu_notifier_invalidate_range()
calls under PT lock could always be removed later.

However I'm afraid I've to figure out if
mmu_notifier_invalidate_range() must run inside the PT lock
regardless, because that's just the very same problem of
->invalidate_page run outside the PT lock with a majority of
production kernels out there:

	pte_unmap_unlock(pte, ptl);
	if (ret != SWAP_FAIL && ret != SWAP_MLOCK && !(flags & TTU_MUNLOCK))
		mmu_notifier_invalidate_page(mm, address);

So even if we take the easy route for mmu_notifier_invalidate_range,
I'm still forced to think about this issue.

Currently I tend to believe the old code is safe and in turn I'm more
inclined to skip the explicit mmu_notifier_invalidate_range() inside
the PT lock for amd_iommu_v2 and intel-svm, and add it later if it's
truly proven unsafe.

However as long as the reason for this is just to keep it simpler and
robustness, I don't mind either ways. Double call is not nice though
and not doing double call will mess the code more.

So I thought it was attractive to explain why it was safe not to run
mmu_notifier_invalidate_range() inside the PT lock, which would then
allow for the most strightforward and more optimal upstream
implementation (in addition of not having to fix the old code).

> It is for softdirty, we should probably invalidate_range() in that case i
> need to check how dirtyness is handled in ATS/PASID ie does device update
> the dirty bit of the CPU page table on write. Or maybe device don't update
> the dirty flag.

They both call handle_mm_fault and establish a readonly secondary MMU
mapping and then call handle_mm_fault again before there's a DMA to
memory, to establish a writable mapping and set the dirty bit here at
the first write fault post read fault:

	bool write = vmf->flags & FAULT_FLAG_WRITE;

	vmf->ptl = pmd_lock(vmf->vma->vm_mm, vmf->pmd);
	if (unlikely(!pmd_same(*vmf->pmd, orig_pmd)))
		goto unlock;

	entry = pmd_mkyoung(orig_pmd);
	if (write)
		entry = pmd_mkdirty(entry);
[..]
	if (vmf->flags & FAULT_FLAG_WRITE) {
		if (!pte_write(entry))
			return do_wp_page(vmf);
		entry = pte_mkdirty(entry);
	}

I doubt the PT lock is relevant for how the dirty bit gets set by
those two faulting-capable iommus.

All it matters is that post clear_refs_write there's a
mmu_notifier_invalidate_range_end so ->invalidate_range is called and
at the next write they call handle_mm_fault(FAULT_FLAG_WRITE) once
again. Where exactly the invalidate runs (i.e. post PT lock) I don't
see a big issue there, definitely clear_refs wouldn't change the
pte/hugepmd to point to a different page, so any coherency issue with
primary and secondary MMU temporarily being out of sync doesn't exist
there.

Kirill is the last committer to the handle_mm_fault line in both
drivers so it'd be great if he can double check to confirm that's the
way the dirty bit is handled and in turn causes no concern at
->invalidate_range time.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic
@ 2017-08-30 22:17               ` Andrea Arcangeli
  0 siblings, 0 replies; 160+ messages in thread
From: Andrea Arcangeli @ 2017-08-30 22:17 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Nadav Amit, Linux Kernel Mailing List,
	open list:MEMORY MANAGEMENT, Dan Williams, Ross Zwisler,
	Linus Torvalds, Bernhard Held, Adam Borowski,
	Radim Krčmář,
	Wanpeng Li, Paolo Bonzini, Takashi Iwai, Mike Galbraith,
	Kirill A . Shutemov, axie, Andrew Morton

On Wed, Aug 30, 2017 at 04:45:49PM -0400, Jerome Glisse wrote:
> So i look at both AMD and Intel IOMMU. AMD always flush and current pte value
> do not matter AFAICT (i doubt that hardware rewalk the page table just to
> decide not to flush that would be terribly dumb for hardware engineer to do
> so).
> 
> Intel have a deferred flush mecanism, basicly if no device is actively using
> the page table then there is no flush (see deferred invalidation in [1]). But
> i am unsure about the hardware ie does it also means that when a PASID is not
> actively use then the IOMMU TLB is also invalid for that PASID. Also i am bit
> unsure about ATS/PASID specification in respect to having device always report
> when they are done with a given PASID (ie does the spec say that device tlb
> must be invalidated when device stop using a pasid).
> 
> https://www.intel.com/content/www/us/en/embedded/technology/virtualization/vt-directed-io-spec.html
> 
> So i think we can side with caution here and call invalidate_range() under the
> page table lock. If IOMMU folks see performance issue with real workload due
> to the double invalidation that take place then we can work on that.
> 
> KVM or XEN are not impacted by this as they only care about start/end with this
> patchset.
> 
> Andrea is that inline with your assessment ?

That is obviously safe. The mmu_notifier_invalidate_range()
calls under PT lock could always be removed later.

However I'm afraid I've to figure out if
mmu_notifier_invalidate_range() must run inside the PT lock
regardless, because that's just the very same problem of
->invalidate_page run outside the PT lock with a majority of
production kernels out there:

	pte_unmap_unlock(pte, ptl);
	if (ret != SWAP_FAIL && ret != SWAP_MLOCK && !(flags & TTU_MUNLOCK))
		mmu_notifier_invalidate_page(mm, address);

So even if we take the easy route for mmu_notifier_invalidate_range,
I'm still forced to think about this issue.

Currently I tend to believe the old code is safe and in turn I'm more
inclined to skip the explicit mmu_notifier_invalidate_range() inside
the PT lock for amd_iommu_v2 and intel-svm, and add it later if it's
truly proven unsafe.

However as long as the reason for this is just to keep it simpler and
robustness, I don't mind either ways. Double call is not nice though
and not doing double call will mess the code more.

So I thought it was attractive to explain why it was safe not to run
mmu_notifier_invalidate_range() inside the PT lock, which would then
allow for the most strightforward and more optimal upstream
implementation (in addition of not having to fix the old code).

> It is for softdirty, we should probably invalidate_range() in that case i
> need to check how dirtyness is handled in ATS/PASID ie does device update
> the dirty bit of the CPU page table on write. Or maybe device don't update
> the dirty flag.

They both call handle_mm_fault and establish a readonly secondary MMU
mapping and then call handle_mm_fault again before there's a DMA to
memory, to establish a writable mapping and set the dirty bit here at
the first write fault post read fault:

	bool write = vmf->flags & FAULT_FLAG_WRITE;

	vmf->ptl = pmd_lock(vmf->vma->vm_mm, vmf->pmd);
	if (unlikely(!pmd_same(*vmf->pmd, orig_pmd)))
		goto unlock;

	entry = pmd_mkyoung(orig_pmd);
	if (write)
		entry = pmd_mkdirty(entry);
[..]
	if (vmf->flags & FAULT_FLAG_WRITE) {
		if (!pte_write(entry))
			return do_wp_page(vmf);
		entry = pte_mkdirty(entry);
	}

I doubt the PT lock is relevant for how the dirty bit gets set by
those two faulting-capable iommus.

All it matters is that post clear_refs_write there's a
mmu_notifier_invalidate_range_end so ->invalidate_range is called and
at the next write they call handle_mm_fault(FAULT_FLAG_WRITE) once
again. Where exactly the invalidate runs (i.e. post PT lock) I don't
see a big issue there, definitely clear_refs wouldn't change the
pte/hugepmd to point to a different page, so any coherency issue with
primary and secondary MMU temporarily being out of sync doesn't exist
there.

Kirill is the last committer to the handle_mm_fault line in both
drivers so it'd be great if he can double check to confirm that's the
way the dirty bit is handled and in turn causes no concern at
->invalidate_range time.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic
  2017-08-30 21:53       ` Linus Torvalds
@ 2017-08-30 23:01         ` Andrea Arcangeli
  -1 siblings, 0 replies; 160+ messages in thread
From: Andrea Arcangeli @ 2017-08-30 23:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jérôme Glisse, Linux Kernel Mailing List, linux-mm,
	Dan Williams, Ross Zwisler, Bernhard Held, Adam Borowski,
	Radim Krčmář,
	Wanpeng Li, Paolo Bonzini, Takashi Iwai, Nadav Amit,
	Mike Galbraith, Kirill A . Shutemov, axie, Andrew Morton

On Wed, Aug 30, 2017 at 02:53:38PM -0700, Linus Torvalds wrote:
> On Wed, Aug 30, 2017 at 9:52 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> >
> > I pointed out in earlier email ->invalidate_range can only be
> > implemented (as mutually exclusive alternative to
> > ->invalidate_range_start/end) by secondary MMUs that shares the very
> > same pagetables with the core linux VM of the primary MMU, and those
> > invalidate_range are already called by
> > __mmu_notifier_invalidate_range_end.
> 
> I have to admit that I didn't notice that fact - that we are already
> in the situation that
> invalidate_range is called by by the rand_end() nofifier.
> 
> I agree that that should simplify all the code, and means that we
> don't have to worry about the few cases that already implemented only
> the "invalidate_page()" and "invalidate_range()" cases.
> 
> So I think that simplifies Jérôme's patch further - once you have put
> the range_start/end() cases around the inner loop, you can just drop
> the invalidate_page() things entirely.
> 
> > So this conversion from invalidate_page to invalidate_range looks
> > superflous and the final mmu_notifier_invalidate_range_end should be
> > enough.
> 
> Yes. I missed the fact that we already called range() from range_end().
> 
> That said, the double call shouldn't hurt correctness, and it's
> "closer" to old behavior for those people who only did the range/page
> ones, so I wonder if we can keep Jérôme's patch in its current state
> for 4.13.

Yes, the double call doesn't hurt correctness. Keeping it in current
state is safer if something, so I've no objection to it other than I'd
like to optimize it further if possible, but it can be done later.

We're already running the double call in various fast paths too in
fact, and rmap walk isn't the fastest path that would be doing such
double call, so it's not a major concern.

Also not a bug, but one further (but more obviously safe) enhancement
I would like is to restrict those rmap invalidation ranges to
PAGE_SIZE << compound_order(page) instead of PMD_SIZE/PMD_MASK.

+	/*
+	 * We have to assume the worse case ie pmd for invalidation. Note that
+	 * the page can not be free in this function as call of try_to_unmap()
+	 * must hold a reference on the page.
+	 */
+	end = min(vma->vm_end, (start & PMD_MASK) + PMD_SIZE);
+	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);

We don't need to invalidate 2MB of secondary MMU mappings surrounding
a 4KB page, just to swapout a 4k page. split_huge_page can't run while
holding the rmap locks, so compound_order(page) is safe to use there.

It can also be optimized incrementally later.

> Because I still want to release 4.13 this weekend, despite this
> upheaval. Otherwise I'll have timing problems during the next merge
> window.
> 
> Andrea, do you otherwise agree with the whole series as is?

I only wish we had more time to test Jerome's patchset, but I sure
agree in principle and I don't see regressions in it.

The callouts to ->invalidate_page seems to have diminished over time
(for the various reasons we know) so if we don't use it for the fast
paths, using it only in rmap walk slow paths probably wasn't providing
much performance benefit.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic
@ 2017-08-30 23:01         ` Andrea Arcangeli
  0 siblings, 0 replies; 160+ messages in thread
From: Andrea Arcangeli @ 2017-08-30 23:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jérôme Glisse, Linux Kernel Mailing List, linux-mm,
	Dan Williams, Ross Zwisler, Bernhard Held, Adam Borowski,
	Radim Krčmář,
	Wanpeng Li, Paolo Bonzini, Takashi Iwai, Nadav Amit,
	Mike Galbraith, Kirill A . Shutemov, axie, Andrew Morton

On Wed, Aug 30, 2017 at 02:53:38PM -0700, Linus Torvalds wrote:
> On Wed, Aug 30, 2017 at 9:52 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> >
> > I pointed out in earlier email ->invalidate_range can only be
> > implemented (as mutually exclusive alternative to
> > ->invalidate_range_start/end) by secondary MMUs that shares the very
> > same pagetables with the core linux VM of the primary MMU, and those
> > invalidate_range are already called by
> > __mmu_notifier_invalidate_range_end.
> 
> I have to admit that I didn't notice that fact - that we are already
> in the situation that
> invalidate_range is called by by the rand_end() nofifier.
> 
> I agree that that should simplify all the code, and means that we
> don't have to worry about the few cases that already implemented only
> the "invalidate_page()" and "invalidate_range()" cases.
> 
> So I think that simplifies Jerome's patch further - once you have put
> the range_start/end() cases around the inner loop, you can just drop
> the invalidate_page() things entirely.
> 
> > So this conversion from invalidate_page to invalidate_range looks
> > superflous and the final mmu_notifier_invalidate_range_end should be
> > enough.
> 
> Yes. I missed the fact that we already called range() from range_end().
> 
> That said, the double call shouldn't hurt correctness, and it's
> "closer" to old behavior for those people who only did the range/page
> ones, so I wonder if we can keep Jerome's patch in its current state
> for 4.13.

Yes, the double call doesn't hurt correctness. Keeping it in current
state is safer if something, so I've no objection to it other than I'd
like to optimize it further if possible, but it can be done later.

We're already running the double call in various fast paths too in
fact, and rmap walk isn't the fastest path that would be doing such
double call, so it's not a major concern.

Also not a bug, but one further (but more obviously safe) enhancement
I would like is to restrict those rmap invalidation ranges to
PAGE_SIZE << compound_order(page) instead of PMD_SIZE/PMD_MASK.

+	/*
+	 * We have to assume the worse case ie pmd for invalidation. Note that
+	 * the page can not be free in this function as call of try_to_unmap()
+	 * must hold a reference on the page.
+	 */
+	end = min(vma->vm_end, (start & PMD_MASK) + PMD_SIZE);
+	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);

We don't need to invalidate 2MB of secondary MMU mappings surrounding
a 4KB page, just to swapout a 4k page. split_huge_page can't run while
holding the rmap locks, so compound_order(page) is safe to use there.

It can also be optimized incrementally later.

> Because I still want to release 4.13 this weekend, despite this
> upheaval. Otherwise I'll have timing problems during the next merge
> window.
> 
> Andrea, do you otherwise agree with the whole series as is?

I only wish we had more time to test Jerome's patchset, but I sure
agree in principle and I don't see regressions in it.

The callouts to ->invalidate_page seems to have diminished over time
(for the various reasons we know) so if we don't use it for the fast
paths, using it only in rmap walk slow paths probably wasn't providing
much performance benefit.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic
  2017-08-30 21:25           ` Andrea Arcangeli
@ 2017-08-30 23:25             ` Nadav Amit
  -1 siblings, 0 replies; 160+ messages in thread
From: Nadav Amit @ 2017-08-30 23:25 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Jérôme Glisse, Linux Kernel Mailing List,
	open list:MEMORY MANAGEMENT, Dan Williams, Ross Zwisler,
	Linus Torvalds, Bernhard Held, Adam Borowski,
	Radim Krčmář,
	Wanpeng Li, Paolo Bonzini, Takashi Iwai, Mike Galbraith,
	Kirill A . Shutemov, axie, Andrew Morton, Joerg Roedel, iommu

[cc’ing IOMMU people, which for some reason are not cc’d]

Andrea Arcangeli <aarcange@redhat.com> wrote:

> On Wed, Aug 30, 2017 at 11:00:32AM -0700, Nadav Amit wrote:
>> It is not trivial to flush TLBs (primary or secondary) without holding the
>> page-table lock, and as we recently encountered this resulted in several
>> bugs [1]. The main problem is that even if you perform the TLB flush
>> immediately after the PT-lock is released, you cause a situation in which
>> other threads may make decisions based on the updated PTE value, without
>> being aware that a TLB flush is needed.
>> 
>> For example, we recently encountered a Linux bug when two threads run
>> MADV_DONTNEED concurrently on the same address range [2]. One of the threads
>> may update a PTE, setting it as non-present, and then deferring the TLB
>> flush (for batching). As a result, however, it would cause the second
>> thread, which also changes the PTEs to assume that the PTE is already
>> non-present and TLB flush is not necessary. As a result the second core may
>> still hold stale PTEs in its TLB following MADV_DONTNEED.
> 
> The source of those complex races that requires taking into account
> nested tlb gather to solve it, originates from skipping primary MMU
> tlb flushes depending on the value of the pagetables (as an
> optimization).
> 
> For mmu_notifier_invalidate_range_end we always ignore the value of
> the pagetables and mmu_notifier_invalidate_range_end always runs
> unconditionally invalidating the secondary MMU for the whole range
> under consideration. There are no optimizations that attempts to skip
> mmu_notifier_invalidate_range_end depending on the pagetable value and
> there's no TLB gather for secondary MMUs either. That is to keep it
> simple of course.
> 
> As long as those mmu_notifier_invalidate_range_end stay unconditional,
> I don't see how those races you linked, can be possibly relevant in
> evaluating if ->invalidate_range (again only for iommuv2 and
> intel-svm) has to run inside the PT lock or not.

Thanks for the clarifications. It now makes much more sense.

> 
>> There is a swarm of such problems, and some are not too trivial. Deferring
>> TLB flushes needs to be done in a very careful manner.
> 
> I agree it's not trivial, but I don't think any complexity comes from
> above.
> 
> The only complexity is about, what if the page is copied to some other
> page and replaced, because the copy is the only case where coherency
> could be retained by the primary MMU. What if the primary MMU starts
> working on the new page in between PT lock release and
> mmu_notifier_invalidate_range_end, while the secondary MMU is stuck on
> the old page? That is the only problem we deal with here, the copy to
> other page and replace. Any other case that doesn't involve the copy
> seems non coherent by definition, and you couldn't measure it.
> 
> I can't think of a scenario that requires the explicit
> mmu_notifier_invalidate_range call before releasing the PT lock, at
> least for try_to_unmap_one.
> 
> Could you think of a scenario where calling ->invalidate_range inside
> mmu_notifier_invalidate_range_end post PT lock breaks iommuv2 or
> intel-svm? Those two are the only ones requiring
> ->invalidate_range calls, all other mmu notifier users are safe
> without running mmu_notifier_invalidate_range_end under PT lock thanks
> to mmu_notifier_invalidate_range_start blocking the secondary MMU.
> 
> Could you post a tangible scenario that invalidates my theory that
> those mmu_notifier_invalidate_range calls inside PT lock would be
> superfluous?
> 
> Some of the scenarios under consideration:
> 
> 1) migration entry -> migration_entry_wait -> page lock, plus
>   migrate_pages taking the lock so it can't race with try_to_unmap
>   from other places
> 2) swap entry -> lookup_swap_cache -> page lock (page not really replaced)
> 3) CoW -> do_wp_page -> page lock on old page
> 4) KSM -> replace_page -> page lock on old page
> 5) if the pte is zapped as in MADV_DONTNEED, no coherency possible so
>   it's not measurable that we let the guest run a bit longer on the
>   old page past PT lock release

For both CoW and KSM, the correctness is maintained by calling
ptep_clear_flush_notify(). If you defer the secondary MMU invalidation
(i.e., replacing ptep_clear_flush_notify() with ptep_clear_flush() ), you
will cause memory corruption, and page-lock would not be enough.

BTW: I see some calls to ptep_clear_flush_notify() which are followed
immediately after by set_pte_at_notify(). I do not understand why it makes
sense, as both notifications end up doing the same thing - invalidating the
secondary MMU. The set_pte_at_notify() in these cases can be changed to
set_pte(). No?

> If you could post a multi CPU trace that shows how iommuv2 or
> intel-svm are broken if ->invalidate_range is run inside
> mmu_notifier_invalidate_range_end post PT lock in try_to_unmap_one it
> would help.
> 
> Of course if we run mmu_notifier_invalidate_range inside PT lock and
> we remove ->invalidate_range from mmu_notifier_invalidate_range_stop
> all will be obviously safe, so we could just do it to avoid thinking
> about the above, but the resulting code will be uglier and less
> optimal (even when disarmed there will be dummy branches I wouldn't
> love) and I currently can't see why it wouldn't be safe.
> 
> Replacing a page completely without any relation to the old page
> content allows no coherency anyway, so even if it breaks you cannot
> measure it because it's undefined.
> 
> If the page has a relation with the old contents and coherency has to
> be provided for both primary MMU and secondary MMUs, this relation
> between old and new page during the replacement, is enforced by some
> other mean besides the PT lock, migration entry on locked old page
> with migration_entry_wait and page lock in migrate_pages etc..

I think that basically you are correct, and assuming that you always
notify/invalidate unconditionally any PTE range you read/write, you are
safe. Yet, I want to have another look at the code. Anyhow, just deferring
all the TLB flushes, including those of set_pte_at_notify(), is likely to
result in errors.

Regards,
Nadav

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic
@ 2017-08-30 23:25             ` Nadav Amit
  0 siblings, 0 replies; 160+ messages in thread
From: Nadav Amit @ 2017-08-30 23:25 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Jérôme Glisse, Linux Kernel Mailing List,
	open list:MEMORY MANAGEMENT, Dan Williams, Ross Zwisler,
	Linus Torvalds, Bernhard Held, Adam Borowski,
	Radim Krčmář,
	Wanpeng Li, Paolo Bonzini, Takashi Iwai, Mike Galbraith,
	Kirill A . Shutemov, axie, Andrew Morton, Joerg Roedel, iommu

[cc’ing IOMMU people, which for some reason are not cc’d]

Andrea Arcangeli <aarcange@redhat.com> wrote:

> On Wed, Aug 30, 2017 at 11:00:32AM -0700, Nadav Amit wrote:
>> It is not trivial to flush TLBs (primary or secondary) without holding the
>> page-table lock, and as we recently encountered this resulted in several
>> bugs [1]. The main problem is that even if you perform the TLB flush
>> immediately after the PT-lock is released, you cause a situation in which
>> other threads may make decisions based on the updated PTE value, without
>> being aware that a TLB flush is needed.
>> 
>> For example, we recently encountered a Linux bug when two threads run
>> MADV_DONTNEED concurrently on the same address range [2]. One of the threads
>> may update a PTE, setting it as non-present, and then deferring the TLB
>> flush (for batching). As a result, however, it would cause the second
>> thread, which also changes the PTEs to assume that the PTE is already
>> non-present and TLB flush is not necessary. As a result the second core may
>> still hold stale PTEs in its TLB following MADV_DONTNEED.
> 
> The source of those complex races that requires taking into account
> nested tlb gather to solve it, originates from skipping primary MMU
> tlb flushes depending on the value of the pagetables (as an
> optimization).
> 
> For mmu_notifier_invalidate_range_end we always ignore the value of
> the pagetables and mmu_notifier_invalidate_range_end always runs
> unconditionally invalidating the secondary MMU for the whole range
> under consideration. There are no optimizations that attempts to skip
> mmu_notifier_invalidate_range_end depending on the pagetable value and
> there's no TLB gather for secondary MMUs either. That is to keep it
> simple of course.
> 
> As long as those mmu_notifier_invalidate_range_end stay unconditional,
> I don't see how those races you linked, can be possibly relevant in
> evaluating if ->invalidate_range (again only for iommuv2 and
> intel-svm) has to run inside the PT lock or not.

Thanks for the clarifications. It now makes much more sense.

> 
>> There is a swarm of such problems, and some are not too trivial. Deferring
>> TLB flushes needs to be done in a very careful manner.
> 
> I agree it's not trivial, but I don't think any complexity comes from
> above.
> 
> The only complexity is about, what if the page is copied to some other
> page and replaced, because the copy is the only case where coherency
> could be retained by the primary MMU. What if the primary MMU starts
> working on the new page in between PT lock release and
> mmu_notifier_invalidate_range_end, while the secondary MMU is stuck on
> the old page? That is the only problem we deal with here, the copy to
> other page and replace. Any other case that doesn't involve the copy
> seems non coherent by definition, and you couldn't measure it.
> 
> I can't think of a scenario that requires the explicit
> mmu_notifier_invalidate_range call before releasing the PT lock, at
> least for try_to_unmap_one.
> 
> Could you think of a scenario where calling ->invalidate_range inside
> mmu_notifier_invalidate_range_end post PT lock breaks iommuv2 or
> intel-svm? Those two are the only ones requiring
> ->invalidate_range calls, all other mmu notifier users are safe
> without running mmu_notifier_invalidate_range_end under PT lock thanks
> to mmu_notifier_invalidate_range_start blocking the secondary MMU.
> 
> Could you post a tangible scenario that invalidates my theory that
> those mmu_notifier_invalidate_range calls inside PT lock would be
> superfluous?
> 
> Some of the scenarios under consideration:
> 
> 1) migration entry -> migration_entry_wait -> page lock, plus
>   migrate_pages taking the lock so it can't race with try_to_unmap
>   from other places
> 2) swap entry -> lookup_swap_cache -> page lock (page not really replaced)
> 3) CoW -> do_wp_page -> page lock on old page
> 4) KSM -> replace_page -> page lock on old page
> 5) if the pte is zapped as in MADV_DONTNEED, no coherency possible so
>   it's not measurable that we let the guest run a bit longer on the
>   old page past PT lock release

For both CoW and KSM, the correctness is maintained by calling
ptep_clear_flush_notify(). If you defer the secondary MMU invalidation
(i.e., replacing ptep_clear_flush_notify() with ptep_clear_flush() ), you
will cause memory corruption, and page-lock would not be enough.

BTW: I see some calls to ptep_clear_flush_notify() which are followed
immediately after by set_pte_at_notify(). I do not understand why it makes
sense, as both notifications end up doing the same thing - invalidating the
secondary MMU. The set_pte_at_notify() in these cases can be changed to
set_pte(). No?

> If you could post a multi CPU trace that shows how iommuv2 or
> intel-svm are broken if ->invalidate_range is run inside
> mmu_notifier_invalidate_range_end post PT lock in try_to_unmap_one it
> would help.
> 
> Of course if we run mmu_notifier_invalidate_range inside PT lock and
> we remove ->invalidate_range from mmu_notifier_invalidate_range_stop
> all will be obviously safe, so we could just do it to avoid thinking
> about the above, but the resulting code will be uglier and less
> optimal (even when disarmed there will be dummy branches I wouldn't
> love) and I currently can't see why it wouldn't be safe.
> 
> Replacing a page completely without any relation to the old page
> content allows no coherency anyway, so even if it breaks you cannot
> measure it because it's undefined.
> 
> If the page has a relation with the old contents and coherency has to
> be provided for both primary MMU and secondary MMUs, this relation
> between old and new page during the replacement, is enforced by some
> other mean besides the PT lock, migration entry on locked old page
> with migration_entry_wait and page lock in migrate_pages etc..

I think that basically you are correct, and assuming that you always
notify/invalidate unconditionally any PTE range you read/write, you are
safe. Yet, I want to have another look at the code. Anyhow, just deferring
all the TLB flushes, including those of set_pte_at_notify(), is likely to
result in errors.

Regards,
Nadav
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic
  2017-08-30 23:25             ` Nadav Amit
  (?)
@ 2017-08-31  0:47               ` Jerome Glisse
  -1 siblings, 0 replies; 160+ messages in thread
From: Jerome Glisse @ 2017-08-31  0:47 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andrea Arcangeli, Linux Kernel Mailing List,
	open list:MEMORY MANAGEMENT, Dan Williams, Ross Zwisler,
	Linus Torvalds, Bernhard Held, Adam Borowski,
	Radim Krčmář,
	Wanpeng Li, Paolo Bonzini, Takashi Iwai, Mike Galbraith,
	Kirill A . Shutemov, axie, Andrew Morton, Joerg Roedel, iommu

On Wed, Aug 30, 2017 at 04:25:54PM -0700, Nadav Amit wrote:
> [cc’ing IOMMU people, which for some reason are not cc’d]
> 
> Andrea Arcangeli <aarcange@redhat.com> wrote:
> 
> > On Wed, Aug 30, 2017 at 11:00:32AM -0700, Nadav Amit wrote:
> >> It is not trivial to flush TLBs (primary or secondary) without holding the
> >> page-table lock, and as we recently encountered this resulted in several
> >> bugs [1]. The main problem is that even if you perform the TLB flush
> >> immediately after the PT-lock is released, you cause a situation in which
> >> other threads may make decisions based on the updated PTE value, without
> >> being aware that a TLB flush is needed.
> >> 
> >> For example, we recently encountered a Linux bug when two threads run
> >> MADV_DONTNEED concurrently on the same address range [2]. One of the threads
> >> may update a PTE, setting it as non-present, and then deferring the TLB
> >> flush (for batching). As a result, however, it would cause the second
> >> thread, which also changes the PTEs to assume that the PTE is already
> >> non-present and TLB flush is not necessary. As a result the second core may
> >> still hold stale PTEs in its TLB following MADV_DONTNEED.
> > 
> > The source of those complex races that requires taking into account
> > nested tlb gather to solve it, originates from skipping primary MMU
> > tlb flushes depending on the value of the pagetables (as an
> > optimization).
> > 
> > For mmu_notifier_invalidate_range_end we always ignore the value of
> > the pagetables and mmu_notifier_invalidate_range_end always runs
> > unconditionally invalidating the secondary MMU for the whole range
> > under consideration. There are no optimizations that attempts to skip
> > mmu_notifier_invalidate_range_end depending on the pagetable value and
> > there's no TLB gather for secondary MMUs either. That is to keep it
> > simple of course.
> > 
> > As long as those mmu_notifier_invalidate_range_end stay unconditional,
> > I don't see how those races you linked, can be possibly relevant in
> > evaluating if ->invalidate_range (again only for iommuv2 and
> > intel-svm) has to run inside the PT lock or not.
> 
> Thanks for the clarifications. It now makes much more sense.
> 
> > 
> >> There is a swarm of such problems, and some are not too trivial. Deferring
> >> TLB flushes needs to be done in a very careful manner.
> > 
> > I agree it's not trivial, but I don't think any complexity comes from
> > above.
> > 
> > The only complexity is about, what if the page is copied to some other
> > page and replaced, because the copy is the only case where coherency
> > could be retained by the primary MMU. What if the primary MMU starts
> > working on the new page in between PT lock release and
> > mmu_notifier_invalidate_range_end, while the secondary MMU is stuck on
> > the old page? That is the only problem we deal with here, the copy to
> > other page and replace. Any other case that doesn't involve the copy
> > seems non coherent by definition, and you couldn't measure it.
> > 
> > I can't think of a scenario that requires the explicit
> > mmu_notifier_invalidate_range call before releasing the PT lock, at
> > least for try_to_unmap_one.
> > 
> > Could you think of a scenario where calling ->invalidate_range inside
> > mmu_notifier_invalidate_range_end post PT lock breaks iommuv2 or
> > intel-svm? Those two are the only ones requiring
> > ->invalidate_range calls, all other mmu notifier users are safe
> > without running mmu_notifier_invalidate_range_end under PT lock thanks
> > to mmu_notifier_invalidate_range_start blocking the secondary MMU.
> > 
> > Could you post a tangible scenario that invalidates my theory that
> > those mmu_notifier_invalidate_range calls inside PT lock would be
> > superfluous?
> > 
> > Some of the scenarios under consideration:
> > 
> > 1) migration entry -> migration_entry_wait -> page lock, plus
> >   migrate_pages taking the lock so it can't race with try_to_unmap
> >   from other places
> > 2) swap entry -> lookup_swap_cache -> page lock (page not really replaced)
> > 3) CoW -> do_wp_page -> page lock on old page
> > 4) KSM -> replace_page -> page lock on old page
> > 5) if the pte is zapped as in MADV_DONTNEED, no coherency possible so
> >   it's not measurable that we let the guest run a bit longer on the
> >   old page past PT lock release
> 
> For both CoW and KSM, the correctness is maintained by calling
> ptep_clear_flush_notify(). If you defer the secondary MMU invalidation
> (i.e., replacing ptep_clear_flush_notify() with ptep_clear_flush() ), you
> will cause memory corruption, and page-lock would not be enough.

Just to add up, the IOMMU have its own CPU page table walker and it can
walk the page table at any time (not the page table current to current
CPU, IOMMU have an array that match a PASID with a page table and device
request translation for a given virtual address against a PASID).

So this means the following can happen with ptep_clear_flush() only:

  CPU                          | IOMMU
                               | - walk page table populate tlb at addr A
  - clear pte at addr A        |
  - set new pte                |

Device is using old page and CPU new page :(

But with ptep_clear_flush_notify()

  CPU                          | IOMMU
                               | - walk page table populate tlb at addr A
  - clear pte at addr A        |
  - notify -> invalidate_range | > flush IOMMU/device tlb
  - set new pte                |

So now either the IOMMU see the empty pte and trigger a page fault (this is
if there is a racing IOMMU ATS right after the IOMMU/device tlb flush but
before setting the new pte) or it see the new pte. Either way both IOMMU
and CPU have a coherent view of what a virtual address points to.

> 
> BTW: I see some calls to ptep_clear_flush_notify() which are followed
> immediately after by set_pte_at_notify(). I do not understand why it makes
> sense, as both notifications end up doing the same thing - invalidating the
> secondary MMU. The set_pte_at_notify() in these cases can be changed to
> set_pte(). No?

Andrea explained to me the historical reasons set_pte_at_notify call the
change_pte callback and it was intended so that KVM could update the
secondary page table directly without having to fault. It is now a pointless
optimization as the call to range_start() happening in all the places before
any set_pte_at_notify() invalidate the secondary page table and thus will
lead to page fault for the vm. I have talk with Andrea on way to bring back
this optimization.

> > If you could post a multi CPU trace that shows how iommuv2 or
> > intel-svm are broken if ->invalidate_range is run inside
> > mmu_notifier_invalidate_range_end post PT lock in try_to_unmap_one it
> > would help.
> > 
> > Of course if we run mmu_notifier_invalidate_range inside PT lock and
> > we remove ->invalidate_range from mmu_notifier_invalidate_range_stop
> > all will be obviously safe, so we could just do it to avoid thinking
> > about the above, but the resulting code will be uglier and less
> > optimal (even when disarmed there will be dummy branches I wouldn't
> > love) and I currently can't see why it wouldn't be safe.
> > 
> > Replacing a page completely without any relation to the old page
> > content allows no coherency anyway, so even if it breaks you cannot
> > measure it because it's undefined.
> > 
> > If the page has a relation with the old contents and coherency has to
> > be provided for both primary MMU and secondary MMUs, this relation
> > between old and new page during the replacement, is enforced by some
> > other mean besides the PT lock, migration entry on locked old page
> > with migration_entry_wait and page lock in migrate_pages etc..
> 
> I think that basically you are correct, and assuming that you always
> notify/invalidate unconditionally any PTE range you read/write, you are
> safe. Yet, I want to have another look at the code. Anyhow, just deferring
> all the TLB flushes, including those of set_pte_at_notify(), is likely to
> result in errors.

Yes we need the following sequence for IOMMU:
 - clear pte
 - invalidate IOMMU/device TLB
 - set new pte

Otherwise the IOMMU page table walker can populate IOMMU/device tlb with
stall entry.

Note that this is not necessary for all the case. For try_to_unmap it
is fine for instance to move the IOMMU tlb shoot down after changing the
CPU page table as we are not pointing the pte to a different page. Either
we clear the pte or we set a swap entry and as long as the page that use
to be pointed by the pte is not free before the IOMMU tlb flush then we
are fine.

In fact i think the only case where we need the above sequence (clear,
flush secondary tlb, set new pte) is for COW. I think all other cases
we can get rid of invalidate_range() from inside the page table lock
and rely on invalidate_range_end() to call unconditionaly.

I might ponder on all this for more cleanup for mmu_notifier as i have
some optimization that i have line up for it but this is next cycle
material. For 4.13 i believe the current patchset is the safest way
to go.

Jérôme

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic
@ 2017-08-31  0:47               ` Jerome Glisse
  0 siblings, 0 replies; 160+ messages in thread
From: Jerome Glisse @ 2017-08-31  0:47 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andrea Arcangeli, Linux Kernel Mailing List,
	open list:MEMORY MANAGEMENT, Dan Williams, Ross Zwisler,
	Linus Torvalds, Bernhard Held, Adam Borowski,
	Radim Krčmář,
	Wanpeng Li, Paolo Bonzini, Takashi Iwai, Mike Galbraith,
	Kirill A . Shutemov, axie, Andrew Morton, Joerg Roedel, iommu

On Wed, Aug 30, 2017 at 04:25:54PM -0700, Nadav Amit wrote:
> [cca??ing IOMMU people, which for some reason are not cca??d]
> 
> Andrea Arcangeli <aarcange@redhat.com> wrote:
> 
> > On Wed, Aug 30, 2017 at 11:00:32AM -0700, Nadav Amit wrote:
> >> It is not trivial to flush TLBs (primary or secondary) without holding the
> >> page-table lock, and as we recently encountered this resulted in several
> >> bugs [1]. The main problem is that even if you perform the TLB flush
> >> immediately after the PT-lock is released, you cause a situation in which
> >> other threads may make decisions based on the updated PTE value, without
> >> being aware that a TLB flush is needed.
> >> 
> >> For example, we recently encountered a Linux bug when two threads run
> >> MADV_DONTNEED concurrently on the same address range [2]. One of the threads
> >> may update a PTE, setting it as non-present, and then deferring the TLB
> >> flush (for batching). As a result, however, it would cause the second
> >> thread, which also changes the PTEs to assume that the PTE is already
> >> non-present and TLB flush is not necessary. As a result the second core may
> >> still hold stale PTEs in its TLB following MADV_DONTNEED.
> > 
> > The source of those complex races that requires taking into account
> > nested tlb gather to solve it, originates from skipping primary MMU
> > tlb flushes depending on the value of the pagetables (as an
> > optimization).
> > 
> > For mmu_notifier_invalidate_range_end we always ignore the value of
> > the pagetables and mmu_notifier_invalidate_range_end always runs
> > unconditionally invalidating the secondary MMU for the whole range
> > under consideration. There are no optimizations that attempts to skip
> > mmu_notifier_invalidate_range_end depending on the pagetable value and
> > there's no TLB gather for secondary MMUs either. That is to keep it
> > simple of course.
> > 
> > As long as those mmu_notifier_invalidate_range_end stay unconditional,
> > I don't see how those races you linked, can be possibly relevant in
> > evaluating if ->invalidate_range (again only for iommuv2 and
> > intel-svm) has to run inside the PT lock or not.
> 
> Thanks for the clarifications. It now makes much more sense.
> 
> > 
> >> There is a swarm of such problems, and some are not too trivial. Deferring
> >> TLB flushes needs to be done in a very careful manner.
> > 
> > I agree it's not trivial, but I don't think any complexity comes from
> > above.
> > 
> > The only complexity is about, what if the page is copied to some other
> > page and replaced, because the copy is the only case where coherency
> > could be retained by the primary MMU. What if the primary MMU starts
> > working on the new page in between PT lock release and
> > mmu_notifier_invalidate_range_end, while the secondary MMU is stuck on
> > the old page? That is the only problem we deal with here, the copy to
> > other page and replace. Any other case that doesn't involve the copy
> > seems non coherent by definition, and you couldn't measure it.
> > 
> > I can't think of a scenario that requires the explicit
> > mmu_notifier_invalidate_range call before releasing the PT lock, at
> > least for try_to_unmap_one.
> > 
> > Could you think of a scenario where calling ->invalidate_range inside
> > mmu_notifier_invalidate_range_end post PT lock breaks iommuv2 or
> > intel-svm? Those two are the only ones requiring
> > ->invalidate_range calls, all other mmu notifier users are safe
> > without running mmu_notifier_invalidate_range_end under PT lock thanks
> > to mmu_notifier_invalidate_range_start blocking the secondary MMU.
> > 
> > Could you post a tangible scenario that invalidates my theory that
> > those mmu_notifier_invalidate_range calls inside PT lock would be
> > superfluous?
> > 
> > Some of the scenarios under consideration:
> > 
> > 1) migration entry -> migration_entry_wait -> page lock, plus
> >   migrate_pages taking the lock so it can't race with try_to_unmap
> >   from other places
> > 2) swap entry -> lookup_swap_cache -> page lock (page not really replaced)
> > 3) CoW -> do_wp_page -> page lock on old page
> > 4) KSM -> replace_page -> page lock on old page
> > 5) if the pte is zapped as in MADV_DONTNEED, no coherency possible so
> >   it's not measurable that we let the guest run a bit longer on the
> >   old page past PT lock release
> 
> For both CoW and KSM, the correctness is maintained by calling
> ptep_clear_flush_notify(). If you defer the secondary MMU invalidation
> (i.e., replacing ptep_clear_flush_notify() with ptep_clear_flush() ), you
> will cause memory corruption, and page-lock would not be enough.

Just to add up, the IOMMU have its own CPU page table walker and it can
walk the page table at any time (not the page table current to current
CPU, IOMMU have an array that match a PASID with a page table and device
request translation for a given virtual address against a PASID).

So this means the following can happen with ptep_clear_flush() only:

  CPU                          | IOMMU
                               | - walk page table populate tlb at addr A
  - clear pte at addr A        |
  - set new pte                |

Device is using old page and CPU new page :(

But with ptep_clear_flush_notify()

  CPU                          | IOMMU
                               | - walk page table populate tlb at addr A
  - clear pte at addr A        |
  - notify -> invalidate_range | > flush IOMMU/device tlb
  - set new pte                |

So now either the IOMMU see the empty pte and trigger a page fault (this is
if there is a racing IOMMU ATS right after the IOMMU/device tlb flush but
before setting the new pte) or it see the new pte. Either way both IOMMU
and CPU have a coherent view of what a virtual address points to.

> 
> BTW: I see some calls to ptep_clear_flush_notify() which are followed
> immediately after by set_pte_at_notify(). I do not understand why it makes
> sense, as both notifications end up doing the same thing - invalidating the
> secondary MMU. The set_pte_at_notify() in these cases can be changed to
> set_pte(). No?

Andrea explained to me the historical reasons set_pte_at_notify call the
change_pte callback and it was intended so that KVM could update the
secondary page table directly without having to fault. It is now a pointless
optimization as the call to range_start() happening in all the places before
any set_pte_at_notify() invalidate the secondary page table and thus will
lead to page fault for the vm. I have talk with Andrea on way to bring back
this optimization.

> > If you could post a multi CPU trace that shows how iommuv2 or
> > intel-svm are broken if ->invalidate_range is run inside
> > mmu_notifier_invalidate_range_end post PT lock in try_to_unmap_one it
> > would help.
> > 
> > Of course if we run mmu_notifier_invalidate_range inside PT lock and
> > we remove ->invalidate_range from mmu_notifier_invalidate_range_stop
> > all will be obviously safe, so we could just do it to avoid thinking
> > about the above, but the resulting code will be uglier and less
> > optimal (even when disarmed there will be dummy branches I wouldn't
> > love) and I currently can't see why it wouldn't be safe.
> > 
> > Replacing a page completely without any relation to the old page
> > content allows no coherency anyway, so even if it breaks you cannot
> > measure it because it's undefined.
> > 
> > If the page has a relation with the old contents and coherency has to
> > be provided for both primary MMU and secondary MMUs, this relation
> > between old and new page during the replacement, is enforced by some
> > other mean besides the PT lock, migration entry on locked old page
> > with migration_entry_wait and page lock in migrate_pages etc..
> 
> I think that basically you are correct, and assuming that you always
> notify/invalidate unconditionally any PTE range you read/write, you are
> safe. Yet, I want to have another look at the code. Anyhow, just deferring
> all the TLB flushes, including those of set_pte_at_notify(), is likely to
> result in errors.

Yes we need the following sequence for IOMMU:
 - clear pte
 - invalidate IOMMU/device TLB
 - set new pte

Otherwise the IOMMU page table walker can populate IOMMU/device tlb with
stall entry.

Note that this is not necessary for all the case. For try_to_unmap it
is fine for instance to move the IOMMU tlb shoot down after changing the
CPU page table as we are not pointing the pte to a different page. Either
we clear the pte or we set a swap entry and as long as the page that use
to be pointed by the pte is not free before the IOMMU tlb flush then we
are fine.

In fact i think the only case where we need the above sequence (clear,
flush secondary tlb, set new pte) is for COW. I think all other cases
we can get rid of invalidate_range() from inside the page table lock
and rely on invalidate_range_end() to call unconditionaly.

I might ponder on all this for more cleanup for mmu_notifier as i have
some optimization that i have line up for it but this is next cycle
material. For 4.13 i believe the current patchset is the safest way
to go.

JA(C)rA'me

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic
@ 2017-08-31  0:47               ` Jerome Glisse
  0 siblings, 0 replies; 160+ messages in thread
From: Jerome Glisse @ 2017-08-31  0:47 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andrea Arcangeli, Linux Kernel Mailing List,
	open list:MEMORY MANAGEMENT, Dan Williams, Ross Zwisler,
	Linus Torvalds, Bernhard Held, Adam Borowski,
	Radim Krčmář,
	Wanpeng Li, Paolo Bonzini, Takashi Iwai, Mike Galbraith,
	Kirill A . Shutemov, axie, Andrew Morton, Joerg Roedel, iommu

On Wed, Aug 30, 2017 at 04:25:54PM -0700, Nadav Amit wrote:
> [cc’ing IOMMU people, which for some reason are not cc’d]
> 
> Andrea Arcangeli <aarcange@redhat.com> wrote:
> 
> > On Wed, Aug 30, 2017 at 11:00:32AM -0700, Nadav Amit wrote:
> >> It is not trivial to flush TLBs (primary or secondary) without holding the
> >> page-table lock, and as we recently encountered this resulted in several
> >> bugs [1]. The main problem is that even if you perform the TLB flush
> >> immediately after the PT-lock is released, you cause a situation in which
> >> other threads may make decisions based on the updated PTE value, without
> >> being aware that a TLB flush is needed.
> >> 
> >> For example, we recently encountered a Linux bug when two threads run
> >> MADV_DONTNEED concurrently on the same address range [2]. One of the threads
> >> may update a PTE, setting it as non-present, and then deferring the TLB
> >> flush (for batching). As a result, however, it would cause the second
> >> thread, which also changes the PTEs to assume that the PTE is already
> >> non-present and TLB flush is not necessary. As a result the second core may
> >> still hold stale PTEs in its TLB following MADV_DONTNEED.
> > 
> > The source of those complex races that requires taking into account
> > nested tlb gather to solve it, originates from skipping primary MMU
> > tlb flushes depending on the value of the pagetables (as an
> > optimization).
> > 
> > For mmu_notifier_invalidate_range_end we always ignore the value of
> > the pagetables and mmu_notifier_invalidate_range_end always runs
> > unconditionally invalidating the secondary MMU for the whole range
> > under consideration. There are no optimizations that attempts to skip
> > mmu_notifier_invalidate_range_end depending on the pagetable value and
> > there's no TLB gather for secondary MMUs either. That is to keep it
> > simple of course.
> > 
> > As long as those mmu_notifier_invalidate_range_end stay unconditional,
> > I don't see how those races you linked, can be possibly relevant in
> > evaluating if ->invalidate_range (again only for iommuv2 and
> > intel-svm) has to run inside the PT lock or not.
> 
> Thanks for the clarifications. It now makes much more sense.
> 
> > 
> >> There is a swarm of such problems, and some are not too trivial. Deferring
> >> TLB flushes needs to be done in a very careful manner.
> > 
> > I agree it's not trivial, but I don't think any complexity comes from
> > above.
> > 
> > The only complexity is about, what if the page is copied to some other
> > page and replaced, because the copy is the only case where coherency
> > could be retained by the primary MMU. What if the primary MMU starts
> > working on the new page in between PT lock release and
> > mmu_notifier_invalidate_range_end, while the secondary MMU is stuck on
> > the old page? That is the only problem we deal with here, the copy to
> > other page and replace. Any other case that doesn't involve the copy
> > seems non coherent by definition, and you couldn't measure it.
> > 
> > I can't think of a scenario that requires the explicit
> > mmu_notifier_invalidate_range call before releasing the PT lock, at
> > least for try_to_unmap_one.
> > 
> > Could you think of a scenario where calling ->invalidate_range inside
> > mmu_notifier_invalidate_range_end post PT lock breaks iommuv2 or
> > intel-svm? Those two are the only ones requiring
> > ->invalidate_range calls, all other mmu notifier users are safe
> > without running mmu_notifier_invalidate_range_end under PT lock thanks
> > to mmu_notifier_invalidate_range_start blocking the secondary MMU.
> > 
> > Could you post a tangible scenario that invalidates my theory that
> > those mmu_notifier_invalidate_range calls inside PT lock would be
> > superfluous?
> > 
> > Some of the scenarios under consideration:
> > 
> > 1) migration entry -> migration_entry_wait -> page lock, plus
> >   migrate_pages taking the lock so it can't race with try_to_unmap
> >   from other places
> > 2) swap entry -> lookup_swap_cache -> page lock (page not really replaced)
> > 3) CoW -> do_wp_page -> page lock on old page
> > 4) KSM -> replace_page -> page lock on old page
> > 5) if the pte is zapped as in MADV_DONTNEED, no coherency possible so
> >   it's not measurable that we let the guest run a bit longer on the
> >   old page past PT lock release
> 
> For both CoW and KSM, the correctness is maintained by calling
> ptep_clear_flush_notify(). If you defer the secondary MMU invalidation
> (i.e., replacing ptep_clear_flush_notify() with ptep_clear_flush() ), you
> will cause memory corruption, and page-lock would not be enough.

Just to add up, the IOMMU have its own CPU page table walker and it can
walk the page table at any time (not the page table current to current
CPU, IOMMU have an array that match a PASID with a page table and device
request translation for a given virtual address against a PASID).

So this means the following can happen with ptep_clear_flush() only:

  CPU                          | IOMMU
                               | - walk page table populate tlb at addr A
  - clear pte at addr A        |
  - set new pte                |

Device is using old page and CPU new page :(

But with ptep_clear_flush_notify()

  CPU                          | IOMMU
                               | - walk page table populate tlb at addr A
  - clear pte at addr A        |
  - notify -> invalidate_range | > flush IOMMU/device tlb
  - set new pte                |

So now either the IOMMU see the empty pte and trigger a page fault (this is
if there is a racing IOMMU ATS right after the IOMMU/device tlb flush but
before setting the new pte) or it see the new pte. Either way both IOMMU
and CPU have a coherent view of what a virtual address points to.

> 
> BTW: I see some calls to ptep_clear_flush_notify() which are followed
> immediately after by set_pte_at_notify(). I do not understand why it makes
> sense, as both notifications end up doing the same thing - invalidating the
> secondary MMU. The set_pte_at_notify() in these cases can be changed to
> set_pte(). No?

Andrea explained to me the historical reasons set_pte_at_notify call the
change_pte callback and it was intended so that KVM could update the
secondary page table directly without having to fault. It is now a pointless
optimization as the call to range_start() happening in all the places before
any set_pte_at_notify() invalidate the secondary page table and thus will
lead to page fault for the vm. I have talk with Andrea on way to bring back
this optimization.

> > If you could post a multi CPU trace that shows how iommuv2 or
> > intel-svm are broken if ->invalidate_range is run inside
> > mmu_notifier_invalidate_range_end post PT lock in try_to_unmap_one it
> > would help.
> > 
> > Of course if we run mmu_notifier_invalidate_range inside PT lock and
> > we remove ->invalidate_range from mmu_notifier_invalidate_range_stop
> > all will be obviously safe, so we could just do it to avoid thinking
> > about the above, but the resulting code will be uglier and less
> > optimal (even when disarmed there will be dummy branches I wouldn't
> > love) and I currently can't see why it wouldn't be safe.
> > 
> > Replacing a page completely without any relation to the old page
> > content allows no coherency anyway, so even if it breaks you cannot
> > measure it because it's undefined.
> > 
> > If the page has a relation with the old contents and coherency has to
> > be provided for both primary MMU and secondary MMUs, this relation
> > between old and new page during the replacement, is enforced by some
> > other mean besides the PT lock, migration entry on locked old page
> > with migration_entry_wait and page lock in migrate_pages etc..
> 
> I think that basically you are correct, and assuming that you always
> notify/invalidate unconditionally any PTE range you read/write, you are
> safe. Yet, I want to have another look at the code. Anyhow, just deferring
> all the TLB flushes, including those of set_pte_at_notify(), is likely to
> result in errors.

Yes we need the following sequence for IOMMU:
 - clear pte
 - invalidate IOMMU/device TLB
 - set new pte

Otherwise the IOMMU page table walker can populate IOMMU/device tlb with
stall entry.

Note that this is not necessary for all the case. For try_to_unmap it
is fine for instance to move the IOMMU tlb shoot down after changing the
CPU page table as we are not pointing the pte to a different page. Either
we clear the pte or we set a swap entry and as long as the page that use
to be pointed by the pte is not free before the IOMMU tlb flush then we
are fine.

In fact i think the only case where we need the above sequence (clear,
flush secondary tlb, set new pte) is for COW. I think all other cases
we can get rid of invalidate_range() from inside the page table lock
and rely on invalidate_range_end() to call unconditionaly.

I might ponder on all this for more cleanup for mmu_notifier as i have
some optimization that i have line up for it but this is next cycle
material. For 4.13 i believe the current patchset is the safest way
to go.

Jérôme

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 00/13] mmu_notifier kill invalidate_page callback
  2017-08-30 21:51   ` Felix Kuehling
@ 2017-08-31 13:59     ` Jerome Glisse
       [not found]       ` <20170831135953.GA9227-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 160+ messages in thread
From: Jerome Glisse @ 2017-08-31 13:59 UTC (permalink / raw)
  To: Felix Kuehling; +Cc: intel-gfx, Christian König, amd-gfx

[Adding Intel folks as they might be interested in this discussion]

On Wed, Aug 30, 2017 at 05:51:52PM -0400, Felix Kuehling wrote:
> Hi Jérôme,
> 
> I have some questions about the potential range-start-end race you
> mentioned.
> 
> On 2017-08-29 07:54 PM, Jérôme Glisse wrote:
> > Note that a lot of existing user feels broken in respect to range_start/
> > range_end. Many user only have range_start() callback but there is nothing
> > preventing them to undo what was invalidated in their range_start() callback
> > after it returns but before any CPU page table update take place.
> >
> > The code pattern use in kvm or umem odp is an example on how to properly
> > avoid such race. In a nutshell use some kind of sequence number and active
> > range invalidation counter to block anything that might undo what the
> > range_start() callback did.
> What happens when we start monitoring an address range after
> invaligate_range_start was called? Sounds like we have to keep track of
> all active invalidations for just such a case, even in address ranges
> that we don't currently care about.
> 
> What are the things we cannot do between invalidate_range_start and
> invalidate_range_end? amdgpu calls get_user_pages to re-validate our
> userptr mappings after the invalidate_range_start notifier invalidated
> it. Do we have to wait for invalidate_range_end before we can call
> get_user_pages safely?

Well the whole userptr bo object is somewhat broken from the start.
You never defined the semantic of it ie what is expected. I can
think of 2 differents semantics:
  A) a uptr buffer object is a snapshot of a memory at the time of
     uptr buffer object creation
  B) a uptr buffer object allow GPU to access a range of virtual
     address of a process an share coherent view of that range
     between CPU and GPU

As it was implemented it is more inline with B but it is not defined
anywhere AFAICT.

Anyway getting back to your questions, it kind of doesn't matter as
you are using GUP ie you are pinning pages except for one scenario
(at least i can only think of one).

Problematic case is race between CPU write to zero page or COW and
GPU driver doing read only GUP:

    CPU thread 1                       | CPU thread 2
    ---------------------------------------------------------------------
                                       |
                                       | uptr covering addr A read only
                                       | .... do stuff with A
    write fault to addr A              |
    invalidate_range_start([A, A+1])   | unbind_ttm -> unpin
                                       | validate bo -> GUP -> zero page
    lock page table                    |
    replace zero pfn/COW with new page |
    unlock page table                  |
    invalidate_range_end([A, A+1])     |

So here the GPU would be using wrong page for the address. How bad
is it is undefined as the semantic of uptr is undefine. Given how it
as been use so far this race is unlikely (i don't think we have many
userspace that use that feature and do fork).


So i would first define the semantic of uptr bo and then i would fix
accordingly the code. Semantic A is easier to implement and you could
just drop the whole mmu_notifier. Maybe it is better to create uptr
buffer object everytime you want to snapshot a range of address. I
don't think the overhead of buffer creation would matter.


If you want to close the race for COW and zero page in case of read
only GUP there is no other way than what KVM or ODP is doing. I had
patchset to simplify all this but i need to bring it back to life.


Note that other thing might race but as you pin the pages they do
not matter. It just mean that if you GUP after range_start() but
before range_end() and before CPU page table update then you pinned
the same old page again and nothing will happen (migrate will fail,
MADV_FREE will nop, ...). So you just did the range_start() callback
for nothing in those cases.

(Sorry for taking so long to answer i forgot your mail yesterday with
 all the other discussion going on).

Cheers,
Jérôme
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 00/13] mmu_notifier kill invalidate_page callback
       [not found]       ` <20170831135953.GA9227-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2017-08-31 14:14         ` Christian König
  2017-08-31 18:39         ` Felix Kuehling
  1 sibling, 0 replies; 160+ messages in thread
From: Christian König @ 2017-08-31 14:14 UTC (permalink / raw)
  To: Jerome Glisse, Felix Kuehling
  Cc: intel-gfx, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Am 31.08.2017 um 15:59 schrieb Jerome Glisse:
> [Adding Intel folks as they might be interested in this discussion]
>
> On Wed, Aug 30, 2017 at 05:51:52PM -0400, Felix Kuehling wrote:
>> Hi Jérôme,
>>
>> I have some questions about the potential range-start-end race you
>> mentioned.
>>
>> On 2017-08-29 07:54 PM, Jérôme Glisse wrote:
>>> Note that a lot of existing user feels broken in respect to range_start/
>>> range_end. Many user only have range_start() callback but there is nothing
>>> preventing them to undo what was invalidated in their range_start() callback
>>> after it returns but before any CPU page table update take place.
>>>
>>> The code pattern use in kvm or umem odp is an example on how to properly
>>> avoid such race. In a nutshell use some kind of sequence number and active
>>> range invalidation counter to block anything that might undo what the
>>> range_start() callback did.
>> What happens when we start monitoring an address range after
>> invaligate_range_start was called? Sounds like we have to keep track of
>> all active invalidations for just such a case, even in address ranges
>> that we don't currently care about.
>>
>> What are the things we cannot do between invalidate_range_start and
>> invalidate_range_end? amdgpu calls get_user_pages to re-validate our
>> userptr mappings after the invalidate_range_start notifier invalidated
>> it. Do we have to wait for invalidate_range_end before we can call
>> get_user_pages safely?
> Well the whole userptr bo object is somewhat broken from the start.
> You never defined the semantic of it ie what is expected. I can
> think of 2 differents semantics:
>    A) a uptr buffer object is a snapshot of a memory at the time of
>       uptr buffer object creation
>    B) a uptr buffer object allow GPU to access a range of virtual
>       address of a process an share coherent view of that range
>       between CPU and GPU
>
> As it was implemented it is more inline with B but it is not defined
> anywhere AFAICT.

Well it is not documented, but the userspace APIs build on top of that 
require semantics B.

Essentially you could have cases where the GPU or the CPU is waiting in 
a busy loop for the other one to change some memory address.

> Anyway getting back to your questions, it kind of doesn't matter as
> you are using GUP ie you are pinning pages except for one scenario
> (at least i can only think of one).
>
> Problematic case is race between CPU write to zero page or COW and
> GPU driver doing read only GUP:
>
>      CPU thread 1                       | CPU thread 2
>      ---------------------------------------------------------------------
>                                         |
>                                         | uptr covering addr A read only
>                                         | .... do stuff with A
>      write fault to addr A              |
>      invalidate_range_start([A, A+1])   | unbind_ttm -> unpin
>                                         | validate bo -> GUP -> zero page
>      lock page table                    |
>      replace zero pfn/COW with new page |
>      unlock page table                  |
>      invalidate_range_end([A, A+1])     |
>
> So here the GPU would be using wrong page for the address. How bad
> is it is undefined as the semantic of uptr is undefine. Given how it
> as been use so far this race is unlikely (i don't think we have many
> userspace that use that feature and do fork).
>
>
> So i would first define the semantic of uptr bo and then i would fix
> accordingly the code. Semantic A is easier to implement and you could
> just drop the whole mmu_notifier. Maybe it is better to create uptr
> buffer object everytime you want to snapshot a range of address. I
> don't think the overhead of buffer creation would matter.

We do support creating userptr without mmu_notifier for exactly that 
purpose, e.g. uploads of snapshots what user space address space looked 
like in a certain moment.

Unfortunately we found that the overhead of buffer creation (and the 
related gup) is way to high to be useful as a throw away object. Just 
memcpy into a BO has just less latency over all.

And yeah, at least I'm perfectly aware of the problems with fork() and 
COW. BTW: It becomes really really ugly if you think about what happens 
when the parent writes to a page first and the GPU then has the child copy.

Regards,
Christian.

> If you want to close the race for COW and zero page in case of read
> only GUP there is no other way than what KVM or ODP is doing. I had
> patchset to simplify all this but i need to bring it back to life.
>
>
> Note that other thing might race but as you pin the pages they do
> not matter. It just mean that if you GUP after range_start() but
> before range_end() and before CPU page table update then you pinned
> the same old page again and nothing will happen (migrate will fail,
> MADV_FREE will nop, ...). So you just did the range_start() callback
> for nothing in those cases.
>
> (Sorry for taking so long to answer i forgot your mail yesterday with
>   all the other discussion going on).
>
> Cheers,
> Jérôme


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic
  2017-08-31  0:47               ` Jerome Glisse
  (?)
@ 2017-08-31 17:12                 ` Andrea Arcangeli
  -1 siblings, 0 replies; 160+ messages in thread
From: Andrea Arcangeli @ 2017-08-31 17:12 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Nadav Amit, Linux Kernel Mailing List,
	open list:MEMORY MANAGEMENT, Dan Williams, Ross Zwisler,
	Linus Torvalds, Bernhard Held, Adam Borowski,
	Radim Krčmář,
	Wanpeng Li, Paolo Bonzini, Takashi Iwai, Mike Galbraith,
	Kirill A . Shutemov, axie, Andrew Morton, Joerg Roedel, iommu

On Wed, Aug 30, 2017 at 08:47:19PM -0400, Jerome Glisse wrote:
> On Wed, Aug 30, 2017 at 04:25:54PM -0700, Nadav Amit wrote:
> > For both CoW and KSM, the correctness is maintained by calling
> > ptep_clear_flush_notify(). If you defer the secondary MMU invalidation
> > (i.e., replacing ptep_clear_flush_notify() with ptep_clear_flush() ), you
> > will cause memory corruption, and page-lock would not be enough.
> 
> Just to add up, the IOMMU have its own CPU page table walker and it can
> walk the page table at any time (not the page table current to current
> CPU, IOMMU have an array that match a PASID with a page table and device
> request translation for a given virtual address against a PASID).
> 
> So this means the following can happen with ptep_clear_flush() only:
> 
>   CPU                          | IOMMU
>                                | - walk page table populate tlb at addr A
>   - clear pte at addr A        |
>   - set new pte                |
      mmu_notifier_invalidate_range_end | -flush IOMMU/device tlb

> 
> Device is using old page and CPU new page :(

That condition won't be persistent.

> 
> But with ptep_clear_flush_notify()
> 
>   CPU                          | IOMMU
>                                | - walk page table populate tlb at addr A
>   - clear pte at addr A        |
>   - notify -> invalidate_range | > flush IOMMU/device tlb
>   - set new pte                |
> 
> So now either the IOMMU see the empty pte and trigger a page fault (this is
> if there is a racing IOMMU ATS right after the IOMMU/device tlb flush but
> before setting the new pte) or it see the new pte. Either way both IOMMU
> and CPU have a coherent view of what a virtual address points to.

Sure, the _notify version is obviously safe.

> Andrea explained to me the historical reasons set_pte_at_notify call the
> change_pte callback and it was intended so that KVM could update the
> secondary page table directly without having to fault. It is now a pointless
> optimization as the call to range_start() happening in all the places before
> any set_pte_at_notify() invalidate the secondary page table and thus will
> lead to page fault for the vm. I have talk with Andrea on way to bring back
> this optimization.

Yes, we known for a long time, the optimization got basically disabled
when range_start/end expanded. It'd be nice to optimize change_pte
again but this is for later.

> Yes we need the following sequence for IOMMU:
>  - clear pte
>  - invalidate IOMMU/device TLB
>  - set new pte
> 
> Otherwise the IOMMU page table walker can populate IOMMU/device tlb with
> stall entry.
> 
> Note that this is not necessary for all the case. For try_to_unmap it
> is fine for instance to move the IOMMU tlb shoot down after changing the
> CPU page table as we are not pointing the pte to a different page. Either
> we clear the pte or we set a swap entry and as long as the page that use
> to be pointed by the pte is not free before the IOMMU tlb flush then we
> are fine.
> 
> In fact i think the only case where we need the above sequence (clear,
> flush secondary tlb, set new pte) is for COW. I think all other cases
> we can get rid of invalidate_range() from inside the page table lock
> and rely on invalidate_range_end() to call unconditionaly.

Even with CoW, it's not big issue if the IOMMU keeps reading from the
old page for a little while longer (in between PT lock release and
mmu_notifier_invalidate_range_end).

How can you tell you read the old data a bit longer, because it
noticed the new page only when mmu_notifier_invalidate_range_end run,
and not because the CPU was faster at writing than the IOMMU was fast
at loading the new pagetable?

I figure it would be detectable only that the CPU could see pageA
changing before pageB. The iommu-v2 could see pageB changing before
pageA. If that's a concern that is the only good reason I can think of
right now, for requiring ->invalidate_page inside the CoW PT lock to
enforce the same ordering. However if the modifications happens
simultaneously and it's a correct runtime, the order must not matter,
but still it would be detectable which may not be desirable.

Currently the _notify is absolutely needed to run ->invalidate_range
before put_page is run in the CoW code below, because of the put_page
that is done in a not scalable place, it would be better moved down
regardless of ->invalidate_range to reduce the size of the PT lock
protected critical section.

-	if (new_page)
-		put_page(new_page);

	pte_unmap_unlock(vmf->pte, vmf->ptl);
	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	if (new_page)
+		put_page(new_page);

Of course the iommu will not immediately start reading from the new
page, but even if it triggers a write protect fault and calls
handle_mm_fault, it will find a writeable pte already there and it'll
get an ->invalidate_range as soon as mmu_notifier_invalidate_range_end
runs so it can sure notice the new page.

Now write_protect_page in KSM...

What bad can happen if iommu keeps writing to the write protected
page, for a little while longer? As long as nothing writes to the page
anymore by the time write_protect_page() returns, pages_identical will
work. How do you know the IOMMU was just a bit faster and wrote a few
more bytes and it wasn't mmu_notifier_invalidate_range_end  that run a
bit later after dropping the PT lock?

Now replace_page in KSM...

	ptep_clear_flush_notify(vma, addr, ptep);
	set_pte_at_notify(mm, addr, ptep, newpte);

	page_remove_rmap(page, false);
	if (!page_mapped(page))
		try_to_free_swap(page);
-	put_page(page);

	pte_unmap_unlock(ptep, ptl);
	err = 0;
out_mn:
	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
out:
+	put_page(page); /* TODO: broken of course, fix error cases */
	return err;
}

If we free the old page after mmu_notifier_invalidate_range_end
(fixing up the error case, the above change ignores the error paths),
the content of the old and new page are identical for replace_page.

Even if new page takes immediately a COW, how do you know the IOMMU
was just a bit slower and read the old page content pre-COW? They're
guaranteed identical and both readonly already at that point.

All the above considered, if this is getting way too complex, it may
be preferable to keep things obviously safe and always run
mmu_notifier_invalidate_range inside the PT lock and be done with it,
and remove the ->invalidate_range from
mmu_notifier_invalidate_range_end too to avoid the double invalidates
for the secondary MMUs with hardware pagetable walkers and shared
pagetables with the primary MMU.

In principle the primary reason for doing _notify or explicit
mmu_notifier_invalidate_range() is to keep things simpler and to avoid
having to care where pages exactly gets freed (i.e. before or after
mmu_notifier_invalidate_range_end).

For example zap_page_range tlb gather freeing strictly requires an
explicit mmu_notifier_invalidate_range before the page is actually
freed (because the tlb gather will free the pages well before
mmu_notifier_invalidate_range_end can run).

The concern that an ->invalidate_range is always needed before PT lock
is released if the primary TLB was flushed inside PT lock, is a more
recent concern and it looks like to me it's not always needed but
perhaps only in some case.

An example where the ->invalidate_range inside
mmu_notifier_invalidate_range_end pays off, is madvise_free_pte_range.
That doesn't flush the TLB before setting the pagetable clean. So the
primary MMU can still write through the dirty primary TLB without
setting the dirty/accessed bit after madvise_free_pte_range returns.

			ptent = ptep_get_and_clear_full(mm, addr, pte,
							tlb->fullmm);

			ptent = pte_mkold(ptent);
			ptent = pte_mkclean(ptent);
			set_pte_at(mm, addr, pte, ptent);
			tlb_remove_tlb_entry(tlb, pte, addr);

Not even the primary TLB is flushed here. All concurrent writes of the
primary MMU can still go lost while MADV_FREE runs. All that is
guaranteed is that after madvise MADV_FREE syscall returns to
userland, the new writes will stick, so then it's also enough to call
->invalidate_range inside the single
mmu_notifier_invalidate_range_end:

	madvise_free_page_range(&tlb, vma, start, end);
	mmu_notifier_invalidate_range_end(mm, start, end);
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This why we got both _notify and mmu_notifier_invalidate_range_end. If
we remove ->invalidate_range from mmu_notifier_invalidate_range_end,
we'll have to add a mmu_notifier_invalidate_range in places like above
(just before or just after mmu_notifier_invalidate_range_end above).

So with ptep_clear_flush_notify that avoids any issue with page
freeing in places like CoW, and the explicit
mmu_notifier_invalidate_range in the tlb gather, the rest got covered
automatically by mmu_notifier_invalidate_range_end. And again this is
only started to be needed when we added support for hardware pagetable
walkers that cannot stop the pagetable walking (unless they break the
sharing of the pagetable with the primary MMU which of course is not
desirable and it would cause unnecessary overhead).

The current ->invalidate_range handling however results in double
calls here and there when armed, but it reduces the number of explicit
hooks required in the common code and it keeps the mmu_notifier code
less intrusive and more optimal when disarmed (but less optimal when
armed). So the current state is a reasonable tradeoff, but there's
room for optimization.

> I might ponder on all this for more cleanup for mmu_notifier as i have
> some optimization that i have line up for it but this is next cycle
> material. For 4.13 i believe the current patchset is the safest way
> to go.

Yes, lots of material above for pondering, but the current status is
fine.

Did you look in reducing the range flushed in try_to_unmap_once to
PAGE_SIZE << compound_order(page)? That looked a straightforward
optimization possible that won't add any complexity and it would avoid
to worsen the granularity of the invalidates during try_to_unmap
compared to the previous code.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic
@ 2017-08-31 17:12                 ` Andrea Arcangeli
  0 siblings, 0 replies; 160+ messages in thread
From: Andrea Arcangeli @ 2017-08-31 17:12 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Nadav Amit, Linux Kernel Mailing List,
	open list:MEMORY MANAGEMENT, Dan Williams, Ross Zwisler,
	Linus Torvalds, Bernhard Held, Adam Borowski,
	Radim Krčmář,
	Wanpeng Li, Paolo Bonzini, Takashi Iwai, Mike Galbraith,
	Kirill A . Shutemov, axie, Andrew Morton, Joerg Roedel, iommu

On Wed, Aug 30, 2017 at 08:47:19PM -0400, Jerome Glisse wrote:
> On Wed, Aug 30, 2017 at 04:25:54PM -0700, Nadav Amit wrote:
> > For both CoW and KSM, the correctness is maintained by calling
> > ptep_clear_flush_notify(). If you defer the secondary MMU invalidation
> > (i.e., replacing ptep_clear_flush_notify() with ptep_clear_flush() ), you
> > will cause memory corruption, and page-lock would not be enough.
> 
> Just to add up, the IOMMU have its own CPU page table walker and it can
> walk the page table at any time (not the page table current to current
> CPU, IOMMU have an array that match a PASID with a page table and device
> request translation for a given virtual address against a PASID).
> 
> So this means the following can happen with ptep_clear_flush() only:
> 
>   CPU                          | IOMMU
>                                | - walk page table populate tlb at addr A
>   - clear pte at addr A        |
>   - set new pte                |
      mmu_notifier_invalidate_range_end | -flush IOMMU/device tlb

> 
> Device is using old page and CPU new page :(

That condition won't be persistent.

> 
> But with ptep_clear_flush_notify()
> 
>   CPU                          | IOMMU
>                                | - walk page table populate tlb at addr A
>   - clear pte at addr A        |
>   - notify -> invalidate_range | > flush IOMMU/device tlb
>   - set new pte                |
> 
> So now either the IOMMU see the empty pte and trigger a page fault (this is
> if there is a racing IOMMU ATS right after the IOMMU/device tlb flush but
> before setting the new pte) or it see the new pte. Either way both IOMMU
> and CPU have a coherent view of what a virtual address points to.

Sure, the _notify version is obviously safe.

> Andrea explained to me the historical reasons set_pte_at_notify call the
> change_pte callback and it was intended so that KVM could update the
> secondary page table directly without having to fault. It is now a pointless
> optimization as the call to range_start() happening in all the places before
> any set_pte_at_notify() invalidate the secondary page table and thus will
> lead to page fault for the vm. I have talk with Andrea on way to bring back
> this optimization.

Yes, we known for a long time, the optimization got basically disabled
when range_start/end expanded. It'd be nice to optimize change_pte
again but this is for later.

> Yes we need the following sequence for IOMMU:
>  - clear pte
>  - invalidate IOMMU/device TLB
>  - set new pte
> 
> Otherwise the IOMMU page table walker can populate IOMMU/device tlb with
> stall entry.
> 
> Note that this is not necessary for all the case. For try_to_unmap it
> is fine for instance to move the IOMMU tlb shoot down after changing the
> CPU page table as we are not pointing the pte to a different page. Either
> we clear the pte or we set a swap entry and as long as the page that use
> to be pointed by the pte is not free before the IOMMU tlb flush then we
> are fine.
> 
> In fact i think the only case where we need the above sequence (clear,
> flush secondary tlb, set new pte) is for COW. I think all other cases
> we can get rid of invalidate_range() from inside the page table lock
> and rely on invalidate_range_end() to call unconditionaly.

Even with CoW, it's not big issue if the IOMMU keeps reading from the
old page for a little while longer (in between PT lock release and
mmu_notifier_invalidate_range_end).

How can you tell you read the old data a bit longer, because it
noticed the new page only when mmu_notifier_invalidate_range_end run,
and not because the CPU was faster at writing than the IOMMU was fast
at loading the new pagetable?

I figure it would be detectable only that the CPU could see pageA
changing before pageB. The iommu-v2 could see pageB changing before
pageA. If that's a concern that is the only good reason I can think of
right now, for requiring ->invalidate_page inside the CoW PT lock to
enforce the same ordering. However if the modifications happens
simultaneously and it's a correct runtime, the order must not matter,
but still it would be detectable which may not be desirable.

Currently the _notify is absolutely needed to run ->invalidate_range
before put_page is run in the CoW code below, because of the put_page
that is done in a not scalable place, it would be better moved down
regardless of ->invalidate_range to reduce the size of the PT lock
protected critical section.

-	if (new_page)
-		put_page(new_page);

	pte_unmap_unlock(vmf->pte, vmf->ptl);
	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	if (new_page)
+		put_page(new_page);

Of course the iommu will not immediately start reading from the new
page, but even if it triggers a write protect fault and calls
handle_mm_fault, it will find a writeable pte already there and it'll
get an ->invalidate_range as soon as mmu_notifier_invalidate_range_end
runs so it can sure notice the new page.

Now write_protect_page in KSM...

What bad can happen if iommu keeps writing to the write protected
page, for a little while longer? As long as nothing writes to the page
anymore by the time write_protect_page() returns, pages_identical will
work. How do you know the IOMMU was just a bit faster and wrote a few
more bytes and it wasn't mmu_notifier_invalidate_range_end  that run a
bit later after dropping the PT lock?

Now replace_page in KSM...

	ptep_clear_flush_notify(vma, addr, ptep);
	set_pte_at_notify(mm, addr, ptep, newpte);

	page_remove_rmap(page, false);
	if (!page_mapped(page))
		try_to_free_swap(page);
-	put_page(page);

	pte_unmap_unlock(ptep, ptl);
	err = 0;
out_mn:
	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
out:
+	put_page(page); /* TODO: broken of course, fix error cases */
	return err;
}

If we free the old page after mmu_notifier_invalidate_range_end
(fixing up the error case, the above change ignores the error paths),
the content of the old and new page are identical for replace_page.

Even if new page takes immediately a COW, how do you know the IOMMU
was just a bit slower and read the old page content pre-COW? They're
guaranteed identical and both readonly already at that point.

All the above considered, if this is getting way too complex, it may
be preferable to keep things obviously safe and always run
mmu_notifier_invalidate_range inside the PT lock and be done with it,
and remove the ->invalidate_range from
mmu_notifier_invalidate_range_end too to avoid the double invalidates
for the secondary MMUs with hardware pagetable walkers and shared
pagetables with the primary MMU.

In principle the primary reason for doing _notify or explicit
mmu_notifier_invalidate_range() is to keep things simpler and to avoid
having to care where pages exactly gets freed (i.e. before or after
mmu_notifier_invalidate_range_end).

For example zap_page_range tlb gather freeing strictly requires an
explicit mmu_notifier_invalidate_range before the page is actually
freed (because the tlb gather will free the pages well before
mmu_notifier_invalidate_range_end can run).

The concern that an ->invalidate_range is always needed before PT lock
is released if the primary TLB was flushed inside PT lock, is a more
recent concern and it looks like to me it's not always needed but
perhaps only in some case.

An example where the ->invalidate_range inside
mmu_notifier_invalidate_range_end pays off, is madvise_free_pte_range.
That doesn't flush the TLB before setting the pagetable clean. So the
primary MMU can still write through the dirty primary TLB without
setting the dirty/accessed bit after madvise_free_pte_range returns.

			ptent = ptep_get_and_clear_full(mm, addr, pte,
							tlb->fullmm);

			ptent = pte_mkold(ptent);
			ptent = pte_mkclean(ptent);
			set_pte_at(mm, addr, pte, ptent);
			tlb_remove_tlb_entry(tlb, pte, addr);

Not even the primary TLB is flushed here. All concurrent writes of the
primary MMU can still go lost while MADV_FREE runs. All that is
guaranteed is that after madvise MADV_FREE syscall returns to
userland, the new writes will stick, so then it's also enough to call
->invalidate_range inside the single
mmu_notifier_invalidate_range_end:

	madvise_free_page_range(&tlb, vma, start, end);
	mmu_notifier_invalidate_range_end(mm, start, end);
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This why we got both _notify and mmu_notifier_invalidate_range_end. If
we remove ->invalidate_range from mmu_notifier_invalidate_range_end,
we'll have to add a mmu_notifier_invalidate_range in places like above
(just before or just after mmu_notifier_invalidate_range_end above).

So with ptep_clear_flush_notify that avoids any issue with page
freeing in places like CoW, and the explicit
mmu_notifier_invalidate_range in the tlb gather, the rest got covered
automatically by mmu_notifier_invalidate_range_end. And again this is
only started to be needed when we added support for hardware pagetable
walkers that cannot stop the pagetable walking (unless they break the
sharing of the pagetable with the primary MMU which of course is not
desirable and it would cause unnecessary overhead).

The current ->invalidate_range handling however results in double
calls here and there when armed, but it reduces the number of explicit
hooks required in the common code and it keeps the mmu_notifier code
less intrusive and more optimal when disarmed (but less optimal when
armed). So the current state is a reasonable tradeoff, but there's
room for optimization.

> I might ponder on all this for more cleanup for mmu_notifier as i have
> some optimization that i have line up for it but this is next cycle
> material. For 4.13 i believe the current patchset is the safest way
> to go.

Yes, lots of material above for pondering, but the current status is
fine.

Did you look in reducing the range flushed in try_to_unmap_once to
PAGE_SIZE << compound_order(page)? That looked a straightforward
optimization possible that won't add any complexity and it would avoid
to worsen the granularity of the invalidates during try_to_unmap
compared to the previous code.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic
@ 2017-08-31 17:12                 ` Andrea Arcangeli
  0 siblings, 0 replies; 160+ messages in thread
From: Andrea Arcangeli @ 2017-08-31 17:12 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Bernhard Held, Adam Borowski, Andrew Morton,
	Radim Krčmář,
	Takashi Iwai, Mike Galbraith, Linux Kernel Mailing List,
	open list:MEMORY MANAGEMENT, iommu, Wanpeng Li, Paolo Bonzini,
	Dan Williams, Linus Torvalds, Ross Zwisler, Kirill A . Shutemov,
	axie

On Wed, Aug 30, 2017 at 08:47:19PM -0400, Jerome Glisse wrote:
> On Wed, Aug 30, 2017 at 04:25:54PM -0700, Nadav Amit wrote:
> > For both CoW and KSM, the correctness is maintained by calling
> > ptep_clear_flush_notify(). If you defer the secondary MMU invalidation
> > (i.e., replacing ptep_clear_flush_notify() with ptep_clear_flush() ), you
> > will cause memory corruption, and page-lock would not be enough.
> 
> Just to add up, the IOMMU have its own CPU page table walker and it can
> walk the page table at any time (not the page table current to current
> CPU, IOMMU have an array that match a PASID with a page table and device
> request translation for a given virtual address against a PASID).
> 
> So this means the following can happen with ptep_clear_flush() only:
> 
>   CPU                          | IOMMU
>                                | - walk page table populate tlb at addr A
>   - clear pte at addr A        |
>   - set new pte                |
      mmu_notifier_invalidate_range_end | -flush IOMMU/device tlb

> 
> Device is using old page and CPU new page :(

That condition won't be persistent.

> 
> But with ptep_clear_flush_notify()
> 
>   CPU                          | IOMMU
>                                | - walk page table populate tlb at addr A
>   - clear pte at addr A        |
>   - notify -> invalidate_range | > flush IOMMU/device tlb
>   - set new pte                |
> 
> So now either the IOMMU see the empty pte and trigger a page fault (this is
> if there is a racing IOMMU ATS right after the IOMMU/device tlb flush but
> before setting the new pte) or it see the new pte. Either way both IOMMU
> and CPU have a coherent view of what a virtual address points to.

Sure, the _notify version is obviously safe.

> Andrea explained to me the historical reasons set_pte_at_notify call the
> change_pte callback and it was intended so that KVM could update the
> secondary page table directly without having to fault. It is now a pointless
> optimization as the call to range_start() happening in all the places before
> any set_pte_at_notify() invalidate the secondary page table and thus will
> lead to page fault for the vm. I have talk with Andrea on way to bring back
> this optimization.

Yes, we known for a long time, the optimization got basically disabled
when range_start/end expanded. It'd be nice to optimize change_pte
again but this is for later.

> Yes we need the following sequence for IOMMU:
>  - clear pte
>  - invalidate IOMMU/device TLB
>  - set new pte
> 
> Otherwise the IOMMU page table walker can populate IOMMU/device tlb with
> stall entry.
> 
> Note that this is not necessary for all the case. For try_to_unmap it
> is fine for instance to move the IOMMU tlb shoot down after changing the
> CPU page table as we are not pointing the pte to a different page. Either
> we clear the pte or we set a swap entry and as long as the page that use
> to be pointed by the pte is not free before the IOMMU tlb flush then we
> are fine.
> 
> In fact i think the only case where we need the above sequence (clear,
> flush secondary tlb, set new pte) is for COW. I think all other cases
> we can get rid of invalidate_range() from inside the page table lock
> and rely on invalidate_range_end() to call unconditionaly.

Even with CoW, it's not big issue if the IOMMU keeps reading from the
old page for a little while longer (in between PT lock release and
mmu_notifier_invalidate_range_end).

How can you tell you read the old data a bit longer, because it
noticed the new page only when mmu_notifier_invalidate_range_end run,
and not because the CPU was faster at writing than the IOMMU was fast
at loading the new pagetable?

I figure it would be detectable only that the CPU could see pageA
changing before pageB. The iommu-v2 could see pageB changing before
pageA. If that's a concern that is the only good reason I can think of
right now, for requiring ->invalidate_page inside the CoW PT lock to
enforce the same ordering. However if the modifications happens
simultaneously and it's a correct runtime, the order must not matter,
but still it would be detectable which may not be desirable.

Currently the _notify is absolutely needed to run ->invalidate_range
before put_page is run in the CoW code below, because of the put_page
that is done in a not scalable place, it would be better moved down
regardless of ->invalidate_range to reduce the size of the PT lock
protected critical section.

-	if (new_page)
-		put_page(new_page);

	pte_unmap_unlock(vmf->pte, vmf->ptl);
	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+	if (new_page)
+		put_page(new_page);

Of course the iommu will not immediately start reading from the new
page, but even if it triggers a write protect fault and calls
handle_mm_fault, it will find a writeable pte already there and it'll
get an ->invalidate_range as soon as mmu_notifier_invalidate_range_end
runs so it can sure notice the new page.

Now write_protect_page in KSM...

What bad can happen if iommu keeps writing to the write protected
page, for a little while longer? As long as nothing writes to the page
anymore by the time write_protect_page() returns, pages_identical will
work. How do you know the IOMMU was just a bit faster and wrote a few
more bytes and it wasn't mmu_notifier_invalidate_range_end  that run a
bit later after dropping the PT lock?

Now replace_page in KSM...

	ptep_clear_flush_notify(vma, addr, ptep);
	set_pte_at_notify(mm, addr, ptep, newpte);

	page_remove_rmap(page, false);
	if (!page_mapped(page))
		try_to_free_swap(page);
-	put_page(page);

	pte_unmap_unlock(ptep, ptl);
	err = 0;
out_mn:
	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
out:
+	put_page(page); /* TODO: broken of course, fix error cases */
	return err;
}

If we free the old page after mmu_notifier_invalidate_range_end
(fixing up the error case, the above change ignores the error paths),
the content of the old and new page are identical for replace_page.

Even if new page takes immediately a COW, how do you know the IOMMU
was just a bit slower and read the old page content pre-COW? They're
guaranteed identical and both readonly already at that point.

All the above considered, if this is getting way too complex, it may
be preferable to keep things obviously safe and always run
mmu_notifier_invalidate_range inside the PT lock and be done with it,
and remove the ->invalidate_range from
mmu_notifier_invalidate_range_end too to avoid the double invalidates
for the secondary MMUs with hardware pagetable walkers and shared
pagetables with the primary MMU.

In principle the primary reason for doing _notify or explicit
mmu_notifier_invalidate_range() is to keep things simpler and to avoid
having to care where pages exactly gets freed (i.e. before or after
mmu_notifier_invalidate_range_end).

For example zap_page_range tlb gather freeing strictly requires an
explicit mmu_notifier_invalidate_range before the page is actually
freed (because the tlb gather will free the pages well before
mmu_notifier_invalidate_range_end can run).

The concern that an ->invalidate_range is always needed before PT lock
is released if the primary TLB was flushed inside PT lock, is a more
recent concern and it looks like to me it's not always needed but
perhaps only in some case.

An example where the ->invalidate_range inside
mmu_notifier_invalidate_range_end pays off, is madvise_free_pte_range.
That doesn't flush the TLB before setting the pagetable clean. So the
primary MMU can still write through the dirty primary TLB without
setting the dirty/accessed bit after madvise_free_pte_range returns.

			ptent = ptep_get_and_clear_full(mm, addr, pte,
							tlb->fullmm);

			ptent = pte_mkold(ptent);
			ptent = pte_mkclean(ptent);
			set_pte_at(mm, addr, pte, ptent);
			tlb_remove_tlb_entry(tlb, pte, addr);

Not even the primary TLB is flushed here. All concurrent writes of the
primary MMU can still go lost while MADV_FREE runs. All that is
guaranteed is that after madvise MADV_FREE syscall returns to
userland, the new writes will stick, so then it's also enough to call
->invalidate_range inside the single
mmu_notifier_invalidate_range_end:

	madvise_free_page_range(&tlb, vma, start, end);
	mmu_notifier_invalidate_range_end(mm, start, end);
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This why we got both _notify and mmu_notifier_invalidate_range_end. If
we remove ->invalidate_range from mmu_notifier_invalidate_range_end,
we'll have to add a mmu_notifier_invalidate_range in places like above
(just before or just after mmu_notifier_invalidate_range_end above).

So with ptep_clear_flush_notify that avoids any issue with page
freeing in places like CoW, and the explicit
mmu_notifier_invalidate_range in the tlb gather, the rest got covered
automatically by mmu_notifier_invalidate_range_end. And again this is
only started to be needed when we added support for hardware pagetable
walkers that cannot stop the pagetable walking (unless they break the
sharing of the pagetable with the primary MMU which of course is not
desirable and it would cause unnecessary overhead).

The current ->invalidate_range handling however results in double
calls here and there when armed, but it reduces the number of explicit
hooks required in the common code and it keeps the mmu_notifier code
less intrusive and more optimal when disarmed (but less optimal when
armed). So the current state is a reasonable tradeoff, but there's
room for optimization.

> I might ponder on all this for more cleanup for mmu_notifier as i have
> some optimization that i have line up for it but this is next cycle
> material. For 4.13 i believe the current patchset is the safest way
> to go.

Yes, lots of material above for pondering, but the current status is
fine.

Did you look in reducing the range flushed in try_to_unmap_once to
PAGE_SIZE << compound_order(page)? That looked a straightforward
optimization possible that won't add any complexity and it would avoid
to worsen the granularity of the invalidates during try_to_unmap
compared to the previous code.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic
  2017-08-30 23:01         ` Andrea Arcangeli
@ 2017-08-31 18:25           ` Jerome Glisse
  -1 siblings, 0 replies; 160+ messages in thread
From: Jerome Glisse @ 2017-08-31 18:25 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-mm,
	Dan Williams, Ross Zwisler, Bernhard Held, Adam Borowski,
	Radim Krčmář,
	Wanpeng Li, Paolo Bonzini, Takashi Iwai, Nadav Amit,
	Mike Galbraith, Kirill A . Shutemov, axie, Andrew Morton

On Thu, Aug 31, 2017 at 01:01:25AM +0200, Andrea Arcangeli wrote:
> On Wed, Aug 30, 2017 at 02:53:38PM -0700, Linus Torvalds wrote:
> > On Wed, Aug 30, 2017 at 9:52 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> > >
> > > I pointed out in earlier email ->invalidate_range can only be
> > > implemented (as mutually exclusive alternative to
> > > ->invalidate_range_start/end) by secondary MMUs that shares the very
> > > same pagetables with the core linux VM of the primary MMU, and those
> > > invalidate_range are already called by
> > > __mmu_notifier_invalidate_range_end.
> > 
> > I have to admit that I didn't notice that fact - that we are already
> > in the situation that
> > invalidate_range is called by by the rand_end() nofifier.
> > 
> > I agree that that should simplify all the code, and means that we
> > don't have to worry about the few cases that already implemented only
> > the "invalidate_page()" and "invalidate_range()" cases.
> > 
> > So I think that simplifies Jérôme's patch further - once you have put
> > the range_start/end() cases around the inner loop, you can just drop
> > the invalidate_page() things entirely.
> > 
> > > So this conversion from invalidate_page to invalidate_range looks
> > > superflous and the final mmu_notifier_invalidate_range_end should be
> > > enough.
> > 
> > Yes. I missed the fact that we already called range() from range_end().
> > 
> > That said, the double call shouldn't hurt correctness, and it's
> > "closer" to old behavior for those people who only did the range/page
> > ones, so I wonder if we can keep Jérôme's patch in its current state
> > for 4.13.
> 
> Yes, the double call doesn't hurt correctness. Keeping it in current
> state is safer if something, so I've no objection to it other than I'd
> like to optimize it further if possible, but it can be done later.
> 
> We're already running the double call in various fast paths too in
> fact, and rmap walk isn't the fastest path that would be doing such
> double call, so it's not a major concern.
> 
> Also not a bug, but one further (but more obviously safe) enhancement
> I would like is to restrict those rmap invalidation ranges to
> PAGE_SIZE << compound_order(page) instead of PMD_SIZE/PMD_MASK.
> 
> +	/*
> +	 * We have to assume the worse case ie pmd for invalidation. Note that
> +	 * the page can not be free in this function as call of try_to_unmap()
> +	 * must hold a reference on the page.
> +	 */
> +	end = min(vma->vm_end, (start & PMD_MASK) + PMD_SIZE);
> +	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
> 
> We don't need to invalidate 2MB of secondary MMU mappings surrounding
> a 4KB page, just to swapout a 4k page. split_huge_page can't run while
> holding the rmap locks, so compound_order(page) is safe to use there.
> 
> It can also be optimized incrementally later.

This optimization is safe i believe. Linus i can respin with that and
with further kvm dead code removal.

Jérôme

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic
@ 2017-08-31 18:25           ` Jerome Glisse
  0 siblings, 0 replies; 160+ messages in thread
From: Jerome Glisse @ 2017-08-31 18:25 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-mm,
	Dan Williams, Ross Zwisler, Bernhard Held, Adam Borowski,
	Radim Krčmář,
	Wanpeng Li, Paolo Bonzini, Takashi Iwai, Nadav Amit,
	Mike Galbraith, Kirill A . Shutemov, axie, Andrew Morton

On Thu, Aug 31, 2017 at 01:01:25AM +0200, Andrea Arcangeli wrote:
> On Wed, Aug 30, 2017 at 02:53:38PM -0700, Linus Torvalds wrote:
> > On Wed, Aug 30, 2017 at 9:52 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> > >
> > > I pointed out in earlier email ->invalidate_range can only be
> > > implemented (as mutually exclusive alternative to
> > > ->invalidate_range_start/end) by secondary MMUs that shares the very
> > > same pagetables with the core linux VM of the primary MMU, and those
> > > invalidate_range are already called by
> > > __mmu_notifier_invalidate_range_end.
> > 
> > I have to admit that I didn't notice that fact - that we are already
> > in the situation that
> > invalidate_range is called by by the rand_end() nofifier.
> > 
> > I agree that that should simplify all the code, and means that we
> > don't have to worry about the few cases that already implemented only
> > the "invalidate_page()" and "invalidate_range()" cases.
> > 
> > So I think that simplifies Jerome's patch further - once you have put
> > the range_start/end() cases around the inner loop, you can just drop
> > the invalidate_page() things entirely.
> > 
> > > So this conversion from invalidate_page to invalidate_range looks
> > > superflous and the final mmu_notifier_invalidate_range_end should be
> > > enough.
> > 
> > Yes. I missed the fact that we already called range() from range_end().
> > 
> > That said, the double call shouldn't hurt correctness, and it's
> > "closer" to old behavior for those people who only did the range/page
> > ones, so I wonder if we can keep Jerome's patch in its current state
> > for 4.13.
> 
> Yes, the double call doesn't hurt correctness. Keeping it in current
> state is safer if something, so I've no objection to it other than I'd
> like to optimize it further if possible, but it can be done later.
> 
> We're already running the double call in various fast paths too in
> fact, and rmap walk isn't the fastest path that would be doing such
> double call, so it's not a major concern.
> 
> Also not a bug, but one further (but more obviously safe) enhancement
> I would like is to restrict those rmap invalidation ranges to
> PAGE_SIZE << compound_order(page) instead of PMD_SIZE/PMD_MASK.
> 
> +	/*
> +	 * We have to assume the worse case ie pmd for invalidation. Note that
> +	 * the page can not be free in this function as call of try_to_unmap()
> +	 * must hold a reference on the page.
> +	 */
> +	end = min(vma->vm_end, (start & PMD_MASK) + PMD_SIZE);
> +	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
> 
> We don't need to invalidate 2MB of secondary MMU mappings surrounding
> a 4KB page, just to swapout a 4k page. split_huge_page can't run while
> holding the rmap locks, so compound_order(page) is safe to use there.
> 
> It can also be optimized incrementally later.

This optimization is safe i believe. Linus i can respin with that and
with further kvm dead code removal.

Jerome

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 00/13] mmu_notifier kill invalidate_page callback
       [not found]       ` <20170831135953.GA9227-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2017-08-31 14:14         ` Christian König
@ 2017-08-31 18:39         ` Felix Kuehling
  2017-08-31 19:00           ` Jerome Glisse
  1 sibling, 1 reply; 160+ messages in thread
From: Felix Kuehling @ 2017-08-31 18:39 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: intel-gfx, Christian König,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

On 2017-08-31 09:59 AM, Jerome Glisse wrote:
> [Adding Intel folks as they might be interested in this discussion]
>
> On Wed, Aug 30, 2017 at 05:51:52PM -0400, Felix Kuehling wrote:
>> Hi Jérôme,
>>
>> I have some questions about the potential range-start-end race you
>> mentioned.
>>
>> On 2017-08-29 07:54 PM, Jérôme Glisse wrote:
>>> Note that a lot of existing user feels broken in respect to range_start/
>>> range_end. Many user only have range_start() callback but there is nothing
>>> preventing them to undo what was invalidated in their range_start() callback
>>> after it returns but before any CPU page table update take place.
>>>
>>> The code pattern use in kvm or umem odp is an example on how to properly
>>> avoid such race. In a nutshell use some kind of sequence number and active
>>> range invalidation counter to block anything that might undo what the
>>> range_start() callback did.
>> What happens when we start monitoring an address range after
>> invaligate_range_start was called? Sounds like we have to keep track of
>> all active invalidations for just such a case, even in address ranges
>> that we don't currently care about.
>>
>> What are the things we cannot do between invalidate_range_start and
>> invalidate_range_end? amdgpu calls get_user_pages to re-validate our
>> userptr mappings after the invalidate_range_start notifier invalidated
>> it. Do we have to wait for invalidate_range_end before we can call
>> get_user_pages safely?
> Well the whole userptr bo object is somewhat broken from the start.
> You never defined the semantic of it ie what is expected. I can
> think of 2 differents semantics:
>   A) a uptr buffer object is a snapshot of a memory at the time of
>      uptr buffer object creation
>   B) a uptr buffer object allow GPU to access a range of virtual
>      address of a process an share coherent view of that range
>      between CPU and GPU
>
> As it was implemented it is more inline with B but it is not defined
> anywhere AFAICT.

Yes, we're trying to achieve B, that's why we have an MMU notifier in
the first place. But it's been a struggle getting it to work properly,
and we're still dealing with some locking issues and now this one.

>
> Anyway getting back to your questions, it kind of doesn't matter as
> you are using GUP ie you are pinning pages except for one scenario
> (at least i can only think of one).
>
> Problematic case is race between CPU write to zero page or COW and
> GPU driver doing read only GUP:
[...]

Thanks, I was aware of COW but not of the zero-page case. I believe in
most practical cases our userptr mappings are read-write, so this is
probably not causing us any real trouble at the moment.

> So i would first define the semantic of uptr bo and then i would fix
> accordingly the code. Semantic A is easier to implement and you could
> just drop the whole mmu_notifier. Maybe it is better to create uptr
> buffer object everytime you want to snapshot a range of address. I
> don't think the overhead of buffer creation would matter.

That doesn't work for KFD and our compute memory model where CPU and GPU
expect to share the same address space.

>
> If you want to close the race for COW and zero page in case of read
> only GUP there is no other way than what KVM or ODP is doing. I had
> patchset to simplify all this but i need to bring it back to life.

OK. I'll look at these to get an idea.

> Note that other thing might race but as you pin the pages they do
> not matter. It just mean that if you GUP after range_start() but
> before range_end() and before CPU page table update then you pinned
> the same old page again and nothing will happen (migrate will fail,
> MADV_FREE will nop, ...). So you just did the range_start() callback
> for nothing in those cases.

We pin the memory because the GPU wants to access it. So this is
probably OK if the migration fails. However, your statement that we
"just did the range_start() callback for nothing" implies that we could
as well have ignored the range_start callback. But I don't think that's
true. That way we'd keep a page pinned that is no longer mapped in the
CPU address space. So the GPU mapping would be out of sync with the CPU
mapping.

>
> (Sorry for taking so long to answer i forgot your mail yesterday with
>  all the other discussion going on).

Thanks. Your reply helps. I'll take a look at KVM or ODP and see how
this is applicable to our driver.

Regards,
  Felix

>
> Cheers,
> Jérôme

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 00/13] mmu_notifier kill invalidate_page callback
  2017-08-31 18:39         ` Felix Kuehling
@ 2017-08-31 19:00           ` Jerome Glisse
       [not found]             ` <20170831190021.GG9227-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 160+ messages in thread
From: Jerome Glisse @ 2017-08-31 19:00 UTC (permalink / raw)
  To: Felix Kuehling; +Cc: intel-gfx, Christian König, amd-gfx

On Thu, Aug 31, 2017 at 02:39:17PM -0400, Felix Kuehling wrote:
> On 2017-08-31 09:59 AM, Jerome Glisse wrote:
> > [Adding Intel folks as they might be interested in this discussion]
> >
> > On Wed, Aug 30, 2017 at 05:51:52PM -0400, Felix Kuehling wrote:
> >> Hi Jérôme,
> >>
> >> I have some questions about the potential range-start-end race you
> >> mentioned.
> >>
> >> On 2017-08-29 07:54 PM, Jérôme Glisse wrote:
> >>> Note that a lot of existing user feels broken in respect to range_start/
> >>> range_end. Many user only have range_start() callback but there is nothing
> >>> preventing them to undo what was invalidated in their range_start() callback
> >>> after it returns but before any CPU page table update take place.
> >>>
> >>> The code pattern use in kvm or umem odp is an example on how to properly
> >>> avoid such race. In a nutshell use some kind of sequence number and active
> >>> range invalidation counter to block anything that might undo what the
> >>> range_start() callback did.
> >> What happens when we start monitoring an address range after
> >> invaligate_range_start was called? Sounds like we have to keep track of
> >> all active invalidations for just such a case, even in address ranges
> >> that we don't currently care about.
> >>
> >> What are the things we cannot do between invalidate_range_start and
> >> invalidate_range_end? amdgpu calls get_user_pages to re-validate our
> >> userptr mappings after the invalidate_range_start notifier invalidated
> >> it. Do we have to wait for invalidate_range_end before we can call
> >> get_user_pages safely?
> > Well the whole userptr bo object is somewhat broken from the start.
> > You never defined the semantic of it ie what is expected. I can
> > think of 2 differents semantics:
> >   A) a uptr buffer object is a snapshot of a memory at the time of
> >      uptr buffer object creation
> >   B) a uptr buffer object allow GPU to access a range of virtual
> >      address of a process an share coherent view of that range
> >      between CPU and GPU
> >
> > As it was implemented it is more inline with B but it is not defined
> > anywhere AFAICT.
> 
> Yes, we're trying to achieve B, that's why we have an MMU notifier in
> the first place. But it's been a struggle getting it to work properly,
> and we're still dealing with some locking issues and now this one.
> 
> >
> > Anyway getting back to your questions, it kind of doesn't matter as
> > you are using GUP ie you are pinning pages except for one scenario
> > (at least i can only think of one).
> >
> > Problematic case is race between CPU write to zero page or COW and
> > GPU driver doing read only GUP:
> [...]
> 
> Thanks, I was aware of COW but not of the zero-page case. I believe in
> most practical cases our userptr mappings are read-write, so this is
> probably not causing us any real trouble at the moment.
> 
> > So i would first define the semantic of uptr bo and then i would fix
> > accordingly the code. Semantic A is easier to implement and you could
> > just drop the whole mmu_notifier. Maybe it is better to create uptr
> > buffer object everytime you want to snapshot a range of address. I
> > don't think the overhead of buffer creation would matter.
> 
> That doesn't work for KFD and our compute memory model where CPU and GPU
> expect to share the same address space.
> 
> >
> > If you want to close the race for COW and zero page in case of read
> > only GUP there is no other way than what KVM or ODP is doing. I had
> > patchset to simplify all this but i need to bring it back to life.
> 
> OK. I'll look at these to get an idea.
> 
> > Note that other thing might race but as you pin the pages they do
> > not matter. It just mean that if you GUP after range_start() but
> > before range_end() and before CPU page table update then you pinned
> > the same old page again and nothing will happen (migrate will fail,
> > MADV_FREE will nop, ...). So you just did the range_start() callback
> > for nothing in those cases.
> 
> We pin the memory because the GPU wants to access it. So this is
> probably OK if the migration fails. However, your statement that we
> "just did the range_start() callback for nothing" implies that we could
> as well have ignored the range_start callback. But I don't think that's
> true. That way we'd keep a page pinned that is no longer mapped in the
> CPU address space. So the GPU mapping would be out of sync with the CPU
> mapping.

Here is an example of "for nothing":
    CPU0                                CPU1
    > migrate page at addr A
      > invalidate_start addr A
        > unbind_ttm(for addr A)
                                        > use ttm object for addr A
                                        > GUP addr A
      > page table update
      > invalidate_end addr A
      > refcount check fails because
        of GUP
      > restore page table to A

This is what i meant by range_start() for nothing on CPU0 you invalidated
ttm object for nothing but this is because of a race. Fixing COW race
would also fix this race and avoid migration to fail because GPU GUP.

I was not saying you should not use mmu_notifier. For achieving B you need
mmu_notifier. Note that if you do like ODP/KVM then you do not need to
pin page.

Cheers,
Jérôme
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic
  2017-08-31 17:12                 ` Andrea Arcangeli
@ 2017-08-31 19:15                   ` Nadav Amit
  -1 siblings, 0 replies; 160+ messages in thread
From: Nadav Amit @ 2017-08-31 19:15 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Jerome Glisse, Linux Kernel Mailing List,
	open list:MEMORY MANAGEMENT, Dan Williams, Ross Zwisler,
	Linus Torvalds, Bernhard Held, Adam Borowski,
	Radim Krčmář,
	Wanpeng Li, Paolo Bonzini, Takashi Iwai, Mike Galbraith,
	Kirill A . Shutemov, axie, Andrew Morton, Joerg Roedel, iommu

Andrea Arcangeli <aarcange@redhat.com> wrote:

> On Wed, Aug 30, 2017 at 08:47:19PM -0400, Jerome Glisse wrote:
>> On Wed, Aug 30, 2017 at 04:25:54PM -0700, Nadav Amit wrote:
>>> For both CoW and KSM, the correctness is maintained by calling
>>> ptep_clear_flush_notify(). If you defer the secondary MMU invalidation
>>> (i.e., replacing ptep_clear_flush_notify() with ptep_clear_flush() ), you
>>> will cause memory corruption, and page-lock would not be enough.
>> 
>> Just to add up, the IOMMU have its own CPU page table walker and it can
>> walk the page table at any time (not the page table current to current
>> CPU, IOMMU have an array that match a PASID with a page table and device
>> request translation for a given virtual address against a PASID).
>> 
>> So this means the following can happen with ptep_clear_flush() only:
>> 
>>  CPU                          | IOMMU
>>                               | - walk page table populate tlb at addr A
>>  - clear pte at addr A        |
>>  - set new pte                |
>      mmu_notifier_invalidate_range_end | -flush IOMMU/device tlb
> 
>> Device is using old page and CPU new page :(
> 
> That condition won't be persistent.
> 
>> But with ptep_clear_flush_notify()
>> 
>>  CPU                          | IOMMU
>>                               | - walk page table populate tlb at addr A
>>  - clear pte at addr A        |
>>  - notify -> invalidate_range | > flush IOMMU/device tlb
>>  - set new pte                |
>> 
>> So now either the IOMMU see the empty pte and trigger a page fault (this is
>> if there is a racing IOMMU ATS right after the IOMMU/device tlb flush but
>> before setting the new pte) or it see the new pte. Either way both IOMMU
>> and CPU have a coherent view of what a virtual address points to.
> 
> Sure, the _notify version is obviously safe.
> 
>> Andrea explained to me the historical reasons set_pte_at_notify call the
>> change_pte callback and it was intended so that KVM could update the
>> secondary page table directly without having to fault. It is now a pointless
>> optimization as the call to range_start() happening in all the places before
>> any set_pte_at_notify() invalidate the secondary page table and thus will
>> lead to page fault for the vm. I have talk with Andrea on way to bring back
>> this optimization.
> 
> Yes, we known for a long time, the optimization got basically disabled
> when range_start/end expanded. It'd be nice to optimize change_pte
> again but this is for later.
> 
>> Yes we need the following sequence for IOMMU:
>> - clear pte
>> - invalidate IOMMU/device TLB
>> - set new pte
>> 
>> Otherwise the IOMMU page table walker can populate IOMMU/device tlb with
>> stall entry.
>> 
>> Note that this is not necessary for all the case. For try_to_unmap it
>> is fine for instance to move the IOMMU tlb shoot down after changing the
>> CPU page table as we are not pointing the pte to a different page. Either
>> we clear the pte or we set a swap entry and as long as the page that use
>> to be pointed by the pte is not free before the IOMMU tlb flush then we
>> are fine.
>> 
>> In fact i think the only case where we need the above sequence (clear,
>> flush secondary tlb, set new pte) is for COW. I think all other cases
>> we can get rid of invalidate_range() from inside the page table lock
>> and rely on invalidate_range_end() to call unconditionaly.
> 
> Even with CoW, it's not big issue if the IOMMU keeps reading from the
> old page for a little while longer (in between PT lock release and
> mmu_notifier_invalidate_range_end).
> 
> How can you tell you read the old data a bit longer, because it
> noticed the new page only when mmu_notifier_invalidate_range_end run,
> and not because the CPU was faster at writing than the IOMMU was fast
> at loading the new pagetable?
> 
> I figure it would be detectable only that the CPU could see pageA
> changing before pageB. The iommu-v2 could see pageB changing before
> pageA. If that's a concern that is the only good reason I can think of
> right now, for requiring ->invalidate_page inside the CoW PT lock to
> enforce the same ordering. However if the modifications happens
> simultaneously and it's a correct runtime, the order must not matter,
> but still it would be detectable which may not be desirable.

I don’t know what is the memory model that SVM provides, but what you
describe here potentially violates it. I don’t think user software should
be expected to deal with it.

> 
> Currently the _notify is absolutely needed to run ->invalidate_range
> before put_page is run in the CoW code below, because of the put_page
> that is done in a not scalable place, it would be better moved down
> regardless of ->invalidate_range to reduce the size of the PT lock
> protected critical section.
> 
> -	if (new_page)
> -		put_page(new_page);
> 
> 	pte_unmap_unlock(vmf->pte, vmf->ptl);
> 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> +	if (new_page)
> +		put_page(new_page);
> 
> Of course the iommu will not immediately start reading from the new
> page, but even if it triggers a write protect fault and calls
> handle_mm_fault, it will find a writeable pte already there and it'll
> get an ->invalidate_range as soon as mmu_notifier_invalidate_range_end
> runs so it can sure notice the new page.
> 
> Now write_protect_page in KSM...
> 
> What bad can happen if iommu keeps writing to the write protected
> page, for a little while longer? As long as nothing writes to the page
> anymore by the time write_protect_page() returns, pages_identical will
> work. How do you know the IOMMU was just a bit faster and wrote a few
> more bytes and it wasn't mmu_notifier_invalidate_range_end  that run a
> bit later after dropping the PT lock?

I guess it should be ok in this case.

> 
> Now replace_page in KSM...
> 
> 	ptep_clear_flush_notify(vma, addr, ptep);
> 	set_pte_at_notify(mm, addr, ptep, newpte);
> 
> 	page_remove_rmap(page, false);
> 	if (!page_mapped(page))
> 		try_to_free_swap(page);
> -	put_page(page);
> 
> 	pte_unmap_unlock(ptep, ptl);
> 	err = 0;
> out_mn:
> 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> out:
> +	put_page(page); /* TODO: broken of course, fix error cases */
> 	return err;
> }
> 
> If we free the old page after mmu_notifier_invalidate_range_end
> (fixing up the error case, the above change ignores the error paths),
> the content of the old and new page are identical for replace_page.
> 
> Even if new page takes immediately a COW, how do you know the IOMMU
> was just a bit slower and read the old page content pre-COW? They're
> guaranteed identical and both readonly already at that point.

A potential problem might be that it requires that there would be no
promotion of PTE access-rights (e.g., RO->RW), since in these cases there
would be no invalidation. I did not find such a code-path, but I might have
missed one, or new one can be introduced in the future. For example, someone
may optimize KSM not to COW the last reference of a KSM’d upon page-fault.

Therefore, such optimizations may require some soft of a checker, or the
very least clear documentation when an invalidation is required, especially
that it would require invalidation on access-rights promotion, which is
currently not needed.

> All the above considered, if this is getting way too complex, it may
> be preferable to keep things obviously safe and always run
> mmu_notifier_invalidate_range inside the PT lock and be done with it,
> and remove the ->invalidate_range from
> mmu_notifier_invalidate_range_end too to avoid the double invalidates
> for the secondary MMUs with hardware pagetable walkers and shared
> pagetables with the primary MMU.
> 
> In principle the primary reason for doing _notify or explicit
> mmu_notifier_invalidate_range() is to keep things simpler and to avoid
> having to care where pages exactly gets freed (i.e. before or after
> mmu_notifier_invalidate_range_end).
> 
> For example zap_page_range tlb gather freeing strictly requires an
> explicit mmu_notifier_invalidate_range before the page is actually
> freed (because the tlb gather will free the pages well before
> mmu_notifier_invalidate_range_end can run).
> 
> The concern that an ->invalidate_range is always needed before PT lock
> is released if the primary TLB was flushed inside PT lock, is a more
> recent concern and it looks like to me it's not always needed but
> perhaps only in some case.
> 
> An example where the ->invalidate_range inside
> mmu_notifier_invalidate_range_end pays off, is madvise_free_pte_range.
> That doesn't flush the TLB before setting the pagetable clean. So the
> primary MMU can still write through the dirty primary TLB without
> setting the dirty/accessed bit after madvise_free_pte_range returns.
> 
> 			ptent = ptep_get_and_clear_full(mm, addr, pte,
> 							tlb->fullmm);
> 
> 			ptent = pte_mkold(ptent);
> 			ptent = pte_mkclean(ptent);
> 			set_pte_at(mm, addr, pte, ptent);
> 			tlb_remove_tlb_entry(tlb, pte, addr);
> 
> Not even the primary TLB is flushed here. All concurrent writes of the
> primary MMU can still go lost while MADV_FREE runs. All that is
> guaranteed is that after madvise MADV_FREE syscall returns to
> userland, the new writes will stick, so then it's also enough to call
> ->invalidate_range inside the single
> mmu_notifier_invalidate_range_end:
> 
> 	madvise_free_page_range(&tlb, vma, start, end);
> 	mmu_notifier_invalidate_range_end(mm, start, end);
>        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> This why we got both _notify and mmu_notifier_invalidate_range_end. If
> we remove ->invalidate_range from mmu_notifier_invalidate_range_end,
> we'll have to add a mmu_notifier_invalidate_range in places like above
> (just before or just after mmu_notifier_invalidate_range_end above).
> 
> So with ptep_clear_flush_notify that avoids any issue with page
> freeing in places like CoW, and the explicit
> mmu_notifier_invalidate_range in the tlb gather, the rest got covered
> automatically by mmu_notifier_invalidate_range_end. And again this is
> only started to be needed when we added support for hardware pagetable
> walkers that cannot stop the pagetable walking (unless they break the
> sharing of the pagetable with the primary MMU which of course is not
> desirable and it would cause unnecessary overhead).
> 
> The current ->invalidate_range handling however results in double
> calls here and there when armed, but it reduces the number of explicit
> hooks required in the common code and it keeps the mmu_notifier code
> less intrusive and more optimal when disarmed (but less optimal when
> armed). So the current state is a reasonable tradeoff, but there's
> room for optimization.

Everything you say makes sense, but at the end of the day you end up with
two different schemes for flushing the primary and secondary TLBs. For
example, the primary TLB flush mechanism is much more selective in flushing
than the secondary one. The discrepancy between the two surely does not
make things simpler.

Indeed, there might be different costs and trade-offs in the primary and
secondary TLBs (e.g., TLB flush/shootdown time, TLB miss latency, whether an
address-space is always loaded), but I have some doubts whether this
decision was data driven.

Regards,
Nadav

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic
@ 2017-08-31 19:15                   ` Nadav Amit
  0 siblings, 0 replies; 160+ messages in thread
From: Nadav Amit @ 2017-08-31 19:15 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Jerome Glisse, Linux Kernel Mailing List,
	open list:MEMORY MANAGEMENT, Dan Williams, Ross Zwisler,
	Linus Torvalds, Bernhard Held, Adam Borowski,
	Radim Krčmář,
	Wanpeng Li, Paolo Bonzini, Takashi Iwai, Mike Galbraith,
	Kirill A . Shutemov, axie, Andrew Morton, Joerg Roedel, iommu

Andrea Arcangeli <aarcange@redhat.com> wrote:

> On Wed, Aug 30, 2017 at 08:47:19PM -0400, Jerome Glisse wrote:
>> On Wed, Aug 30, 2017 at 04:25:54PM -0700, Nadav Amit wrote:
>>> For both CoW and KSM, the correctness is maintained by calling
>>> ptep_clear_flush_notify(). If you defer the secondary MMU invalidation
>>> (i.e., replacing ptep_clear_flush_notify() with ptep_clear_flush() ), you
>>> will cause memory corruption, and page-lock would not be enough.
>> 
>> Just to add up, the IOMMU have its own CPU page table walker and it can
>> walk the page table at any time (not the page table current to current
>> CPU, IOMMU have an array that match a PASID with a page table and device
>> request translation for a given virtual address against a PASID).
>> 
>> So this means the following can happen with ptep_clear_flush() only:
>> 
>>  CPU                          | IOMMU
>>                               | - walk page table populate tlb at addr A
>>  - clear pte at addr A        |
>>  - set new pte                |
>      mmu_notifier_invalidate_range_end | -flush IOMMU/device tlb
> 
>> Device is using old page and CPU new page :(
> 
> That condition won't be persistent.
> 
>> But with ptep_clear_flush_notify()
>> 
>>  CPU                          | IOMMU
>>                               | - walk page table populate tlb at addr A
>>  - clear pte at addr A        |
>>  - notify -> invalidate_range | > flush IOMMU/device tlb
>>  - set new pte                |
>> 
>> So now either the IOMMU see the empty pte and trigger a page fault (this is
>> if there is a racing IOMMU ATS right after the IOMMU/device tlb flush but
>> before setting the new pte) or it see the new pte. Either way both IOMMU
>> and CPU have a coherent view of what a virtual address points to.
> 
> Sure, the _notify version is obviously safe.
> 
>> Andrea explained to me the historical reasons set_pte_at_notify call the
>> change_pte callback and it was intended so that KVM could update the
>> secondary page table directly without having to fault. It is now a pointless
>> optimization as the call to range_start() happening in all the places before
>> any set_pte_at_notify() invalidate the secondary page table and thus will
>> lead to page fault for the vm. I have talk with Andrea on way to bring back
>> this optimization.
> 
> Yes, we known for a long time, the optimization got basically disabled
> when range_start/end expanded. It'd be nice to optimize change_pte
> again but this is for later.
> 
>> Yes we need the following sequence for IOMMU:
>> - clear pte
>> - invalidate IOMMU/device TLB
>> - set new pte
>> 
>> Otherwise the IOMMU page table walker can populate IOMMU/device tlb with
>> stall entry.
>> 
>> Note that this is not necessary for all the case. For try_to_unmap it
>> is fine for instance to move the IOMMU tlb shoot down after changing the
>> CPU page table as we are not pointing the pte to a different page. Either
>> we clear the pte or we set a swap entry and as long as the page that use
>> to be pointed by the pte is not free before the IOMMU tlb flush then we
>> are fine.
>> 
>> In fact i think the only case where we need the above sequence (clear,
>> flush secondary tlb, set new pte) is for COW. I think all other cases
>> we can get rid of invalidate_range() from inside the page table lock
>> and rely on invalidate_range_end() to call unconditionaly.
> 
> Even with CoW, it's not big issue if the IOMMU keeps reading from the
> old page for a little while longer (in between PT lock release and
> mmu_notifier_invalidate_range_end).
> 
> How can you tell you read the old data a bit longer, because it
> noticed the new page only when mmu_notifier_invalidate_range_end run,
> and not because the CPU was faster at writing than the IOMMU was fast
> at loading the new pagetable?
> 
> I figure it would be detectable only that the CPU could see pageA
> changing before pageB. The iommu-v2 could see pageB changing before
> pageA. If that's a concern that is the only good reason I can think of
> right now, for requiring ->invalidate_page inside the CoW PT lock to
> enforce the same ordering. However if the modifications happens
> simultaneously and it's a correct runtime, the order must not matter,
> but still it would be detectable which may not be desirable.

I don’t know what is the memory model that SVM provides, but what you
describe here potentially violates it. I don’t think user software should
be expected to deal with it.

> 
> Currently the _notify is absolutely needed to run ->invalidate_range
> before put_page is run in the CoW code below, because of the put_page
> that is done in a not scalable place, it would be better moved down
> regardless of ->invalidate_range to reduce the size of the PT lock
> protected critical section.
> 
> -	if (new_page)
> -		put_page(new_page);
> 
> 	pte_unmap_unlock(vmf->pte, vmf->ptl);
> 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> +	if (new_page)
> +		put_page(new_page);
> 
> Of course the iommu will not immediately start reading from the new
> page, but even if it triggers a write protect fault and calls
> handle_mm_fault, it will find a writeable pte already there and it'll
> get an ->invalidate_range as soon as mmu_notifier_invalidate_range_end
> runs so it can sure notice the new page.
> 
> Now write_protect_page in KSM...
> 
> What bad can happen if iommu keeps writing to the write protected
> page, for a little while longer? As long as nothing writes to the page
> anymore by the time write_protect_page() returns, pages_identical will
> work. How do you know the IOMMU was just a bit faster and wrote a few
> more bytes and it wasn't mmu_notifier_invalidate_range_end  that run a
> bit later after dropping the PT lock?

I guess it should be ok in this case.

> 
> Now replace_page in KSM...
> 
> 	ptep_clear_flush_notify(vma, addr, ptep);
> 	set_pte_at_notify(mm, addr, ptep, newpte);
> 
> 	page_remove_rmap(page, false);
> 	if (!page_mapped(page))
> 		try_to_free_swap(page);
> -	put_page(page);
> 
> 	pte_unmap_unlock(ptep, ptl);
> 	err = 0;
> out_mn:
> 	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
> out:
> +	put_page(page); /* TODO: broken of course, fix error cases */
> 	return err;
> }
> 
> If we free the old page after mmu_notifier_invalidate_range_end
> (fixing up the error case, the above change ignores the error paths),
> the content of the old and new page are identical for replace_page.
> 
> Even if new page takes immediately a COW, how do you know the IOMMU
> was just a bit slower and read the old page content pre-COW? They're
> guaranteed identical and both readonly already at that point.

A potential problem might be that it requires that there would be no
promotion of PTE access-rights (e.g., RO->RW), since in these cases there
would be no invalidation. I did not find such a code-path, but I might have
missed one, or new one can be introduced in the future. For example, someone
may optimize KSM not to COW the last reference of a KSM’d upon page-fault.

Therefore, such optimizations may require some soft of a checker, or the
very least clear documentation when an invalidation is required, especially
that it would require invalidation on access-rights promotion, which is
currently not needed.

> All the above considered, if this is getting way too complex, it may
> be preferable to keep things obviously safe and always run
> mmu_notifier_invalidate_range inside the PT lock and be done with it,
> and remove the ->invalidate_range from
> mmu_notifier_invalidate_range_end too to avoid the double invalidates
> for the secondary MMUs with hardware pagetable walkers and shared
> pagetables with the primary MMU.
> 
> In principle the primary reason for doing _notify or explicit
> mmu_notifier_invalidate_range() is to keep things simpler and to avoid
> having to care where pages exactly gets freed (i.e. before or after
> mmu_notifier_invalidate_range_end).
> 
> For example zap_page_range tlb gather freeing strictly requires an
> explicit mmu_notifier_invalidate_range before the page is actually
> freed (because the tlb gather will free the pages well before
> mmu_notifier_invalidate_range_end can run).
> 
> The concern that an ->invalidate_range is always needed before PT lock
> is released if the primary TLB was flushed inside PT lock, is a more
> recent concern and it looks like to me it's not always needed but
> perhaps only in some case.
> 
> An example where the ->invalidate_range inside
> mmu_notifier_invalidate_range_end pays off, is madvise_free_pte_range.
> That doesn't flush the TLB before setting the pagetable clean. So the
> primary MMU can still write through the dirty primary TLB without
> setting the dirty/accessed bit after madvise_free_pte_range returns.
> 
> 			ptent = ptep_get_and_clear_full(mm, addr, pte,
> 							tlb->fullmm);
> 
> 			ptent = pte_mkold(ptent);
> 			ptent = pte_mkclean(ptent);
> 			set_pte_at(mm, addr, pte, ptent);
> 			tlb_remove_tlb_entry(tlb, pte, addr);
> 
> Not even the primary TLB is flushed here. All concurrent writes of the
> primary MMU can still go lost while MADV_FREE runs. All that is
> guaranteed is that after madvise MADV_FREE syscall returns to
> userland, the new writes will stick, so then it's also enough to call
> ->invalidate_range inside the single
> mmu_notifier_invalidate_range_end:
> 
> 	madvise_free_page_range(&tlb, vma, start, end);
> 	mmu_notifier_invalidate_range_end(mm, start, end);
>        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> This why we got both _notify and mmu_notifier_invalidate_range_end. If
> we remove ->invalidate_range from mmu_notifier_invalidate_range_end,
> we'll have to add a mmu_notifier_invalidate_range in places like above
> (just before or just after mmu_notifier_invalidate_range_end above).
> 
> So with ptep_clear_flush_notify that avoids any issue with page
> freeing in places like CoW, and the explicit
> mmu_notifier_invalidate_range in the tlb gather, the rest got covered
> automatically by mmu_notifier_invalidate_range_end. And again this is
> only started to be needed when we added support for hardware pagetable
> walkers that cannot stop the pagetable walking (unless they break the
> sharing of the pagetable with the primary MMU which of course is not
> desirable and it would cause unnecessary overhead).
> 
> The current ->invalidate_range handling however results in double
> calls here and there when armed, but it reduces the number of explicit
> hooks required in the common code and it keeps the mmu_notifier code
> less intrusive and more optimal when disarmed (but less optimal when
> armed). So the current state is a reasonable tradeoff, but there's
> room for optimization.

Everything you say makes sense, but at the end of the day you end up with
two different schemes for flushing the primary and secondary TLBs. For
example, the primary TLB flush mechanism is much more selective in flushing
than the secondary one. The discrepancy between the two surely does not
make things simpler.

Indeed, there might be different costs and trade-offs in the primary and
secondary TLBs (e.g., TLB flush/shootdown time, TLB miss latency, whether an
address-space is always loaded), but I have some doubts whether this
decision was data driven.

Regards,
Nadav

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic
  2017-08-31 18:25           ` Jerome Glisse
@ 2017-08-31 19:40             ` Linus Torvalds
  -1 siblings, 0 replies; 160+ messages in thread
From: Linus Torvalds @ 2017-08-31 19:40 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Andrea Arcangeli, Linux Kernel Mailing List, linux-mm,
	Dan Williams, Ross Zwisler, Bernhard Held, Adam Borowski,
	Radim Krčmář,
	Wanpeng Li, Paolo Bonzini, Takashi Iwai, Nadav Amit,
	Mike Galbraith, Kirill A . Shutemov, axie, Andrew Morton

On Thu, Aug 31, 2017 at 11:25 AM, Jerome Glisse <jglisse@redhat.com> wrote:
>
> This optimization is safe i believe. Linus i can respin with that and
> with further kvm dead code removal.

Yes, please.

And if you've been tracking the reviewed-by and tested-by's that would
be really good too. I lost much of yesterday to a sudden kidney stone
emergency, so I might not have tracked every ack.

                Linus

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 02/13] mm/rmap: update to new mmu_notifier semantic
@ 2017-08-31 19:40             ` Linus Torvalds
  0 siblings, 0 replies; 160+ messages in thread
From: Linus Torvalds @ 2017-08-31 19:40 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Andrea Arcangeli, Linux Kernel Mailing List, linux-mm,
	Dan Williams, Ross Zwisler, Bernhard Held, Adam Borowski,
	Radim Krčmář,
	Wanpeng Li, Paolo Bonzini, Takashi Iwai, Nadav Amit,
	Mike Galbraith, Kirill A . Shutemov, axie, Andrew Morton

On Thu, Aug 31, 2017 at 11:25 AM, Jerome Glisse <jglisse@redhat.com> wrote:
>
> This optimization is safe i believe. Linus i can respin with that and
> with further kvm dead code removal.

Yes, please.

And if you've been tracking the reviewed-by and tested-by's that would
be really good too. I lost much of yesterday to a sudden kidney stone
emergency, so I might not have tracked every ack.

                Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 00/13] mmu_notifier kill invalidate_page callback
       [not found]             ` <20170831190021.GG9227-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2017-08-31 23:19               ` Felix Kuehling
  2017-08-31 23:29                 ` Jerome Glisse
  0 siblings, 1 reply; 160+ messages in thread
From: Felix Kuehling @ 2017-08-31 23:19 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: intel-gfx, Christian König,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

On 2017-08-31 03:00 PM, Jerome Glisse wrote:
> I was not saying you should not use mmu_notifier. For achieving B you need
> mmu_notifier. Note that if you do like ODP/KVM then you do not need to
> pin page.
I would like that. I've thought about it before. The one problem I
couldn't figure out is, where to set the accessed and dirty bits for the
pages. Now we do it when we unpin. If we don't pin the pages in the
first place, we don't have a good place for this.

Our hardware doesn't give us notifications or accessed/dirty bits, so we
have to assume the worst case that the pages are continuously
accessed/dirty.

I'd appreciate any advice how to handle that. (Sorry, I realize this is
going a bit off topic.) A pointer to a document or source code would be
great. :)

Thanks,
  Felix

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 00/13] mmu_notifier kill invalidate_page callback
  2017-08-31 23:19               ` Felix Kuehling
@ 2017-08-31 23:29                 ` Jerome Glisse
  0 siblings, 0 replies; 160+ messages in thread
From: Jerome Glisse @ 2017-08-31 23:29 UTC (permalink / raw)
  To: Felix Kuehling; +Cc: intel-gfx, Christian König, amd-gfx

On Thu, Aug 31, 2017 at 07:19:19PM -0400, Felix Kuehling wrote:
> On 2017-08-31 03:00 PM, Jerome Glisse wrote:
> > I was not saying you should not use mmu_notifier. For achieving B you need
> > mmu_notifier. Note that if you do like ODP/KVM then you do not need to
> > pin page.
> I would like that. I've thought about it before. The one problem I
> couldn't figure out is, where to set the accessed and dirty bits for the
> pages. Now we do it when we unpin. If we don't pin the pages in the
> first place, we don't have a good place for this.
> 
> Our hardware doesn't give us notifications or accessed/dirty bits, so we
> have to assume the worst case that the pages are continuously
> accessed/dirty.
> 
> I'd appreciate any advice how to handle that. (Sorry, I realize this is
> going a bit off topic.) A pointer to a document or source code would be
> great. :)

In invalidate_range_start() ie same place as where you unpin but instead
of unpining you would just call set_page_dirty()

Jérôme
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 00/13] mmu_notifier kill invalidate_page callback
  2017-08-30 14:57         ` Adam Borowski
  (?)
  (?)
@ 2017-09-01 14:47           ` Jeff Cook
  -1 siblings, 0 replies; 160+ messages in thread
From: Jeff Cook @ 2017-09-01 14:47 UTC (permalink / raw)
  To: Adam Borowski, Jerome Glisse
  Cc: Linus Torvalds, Bernhard Held, Linux Kernel Mailing List,
	linux-mm, Kirill A . Shutemov, Andrew Morton, Andrea Arcangeli,
	Joerg Roedel, Dan Williams, Sudeep Dutt, Ashutosh Dixit,
	Dimitri Sivanich, Jack Steiner, Paolo Bonzini,
	Radim Krčmář,
	ppc-dev, DRI, amd-gfx, linux-rdma, open list:AMD IOMMU (AMD-VI),
	xen-devel

On Wed, Aug 30, 2017, at 10:57 AM, Adam Borowski wrote:
> On Tue, Aug 29, 2017 at 08:56:15PM -0400, Jerome Glisse wrote:
> > I will wait for people to test and for result of my own test before
> > reposting if need be, otherwise i will post as separate patch.
> >
> > > But from a _very_ quick read-through this looks fine. But it obviously
> > > needs testing.
> > > 
> > > People - *especially* the people who saw issues under KVM - can you
> > > try out Jérôme's patch-series? I aded some people to the cc, the full
> > > series is on lkml. Jérôme - do you have a git branch for people to
> > > test that they could easily pull and try out?
> > 
> > https://cgit.freedesktop.org/~glisse/linux mmu-notifier branch
> > git://people.freedesktop.org/~glisse/linux
> 
> Tested your branch as of 10f07641, on a long list of guest VMs.
> No earth-shattering kaboom.

I've been using the mmu_notifier branch @ a3d944233bcf8c for the last 36
hours or so, also without incident.

Unlike most other reporters, I experienced a similar splat on 4.12:

Aug 03 15:02:47 kvm_master kernel: ------------[ cut here ]------------
Aug 03 15:02:47 kvm_master kernel: WARNING: CPU: 13 PID: 1653 at
arch/x86/kvm/mmu.c:682 mmu_spte_clear_track_bits+0xfb/0x100 [kvm]
Aug 03 15:02:47 kvm_master kernel: Modules linked in: vhost_net vhost
tap xt_conntrack xt_CHECKSUM iptable_mangle ipt_REJECT nf_reject_ipv4
xt_tcpudp tun ebtable_filter ebtables ip6table_filter ip6_tables
iptable_filter msr nls_iso8859_1 nls_cp437 intel_rapl ipt_
MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4
nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack sb_edac
x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul
crc32_pclmul ghash_clmulni_intel input_leds pcbc aesni_intel led_class
aes_x86_6
4 mxm_wmi crypto_simd glue_helper uvcvideo cryptd videobuf2_vmalloc
videobuf2_memops igb videobuf2_v4l2 videobuf2_core snd_usb_audio
videodev media joydev ptp evdev mousedev intel_cstate pps_core mac_hid
intel_rapl_perf snd_hda_intel snd_virtuoso snd_usbmidi_lib snd_hda_codec
snd_oxygen_lib snd_hda_core                        
Aug 03 15:02:47 kvm_master kernel:  snd_mpu401_uart snd_rawmidi
snd_hwdep snd_seq_device snd_pcm snd_timer snd soundcore i2c_algo_bit
pcspkr i2c_i801 lpc_ich ioatdma shpchp dca wmi acpi_power_meter tpm_tis
tpm_tis_core tpm button bridge stp llc sch_fq_codel virtio_pci
virtio_blk virtio_balloon virtio_net virtio_ring virtio kvm_intel kvm sg
ip_tables x_tables hid_logitech_hidpp hid_logitech_dj hid_generic
hid_microsoft usbhid hid sr_mod cdrom sd_mod xhci_pci ahci libahci
xhci_hcd libata usbcore scsi_mod usb_common zfs(PO) zunicode(PO)
zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) drm_kms_helper
syscopyarea sysfillrect sysimgblt fb_sys_fops drm vfio_pci irqbypass
vfio_virqfd vfio_iommu_type1 vfio vfat fat ext4 crc16 jbd2 fscrypto
mbcache dm_thin_pool dm_cache dm_persistent_data dm_bio_prison dm_bufio
dm_raid raid456 libcrc32c                 
Aug 03 15:02:47 kvm_master kernel:  crc32c_generic crc32c_intel
async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq
dm_mod dax raid1 md_mod                                                  
Aug 03 15:02:47 kvm_master kernel: CPU: 13 PID: 1653 Comm: kworker/13:2
Tainted: P    B D W  O    4.12.3-1-ARCH #1                 
Aug 03 15:02:47 kvm_master kernel: Hardware name: Supermicro
SYS-7038A-I/X10DAI, BIOS 2.0a 11/09/2016                              
Aug 03 15:02:47 kvm_master kernel: Workqueue: events mmput_async_fn      
Aug 03 15:02:47 kvm_master kernel: task: ffff9fa89751b900 task.stack:
ffffc179880d8000                                             
Aug 03 15:02:47 kvm_master kernel: RIP:
0010:mmu_spte_clear_track_bits+0xfb/0x100 [kvm]                          
Aug 03 15:02:47 kvm_master kernel: RSP: 0018:ffffc179880dbc20 EFLAGS:
00010246                                                     
Aug 03 15:02:47 kvm_master kernel: RAX: 0000000000000000 RBX:
00000009c07cce77 RCX: dead0000000000ff                               
Aug 03 15:02:47 kvm_master kernel: RDX: 0000000000000000 RSI:
ffff9fa82d6d6f08 RDI: fffff6e76701f300                               
Aug 03 15:02:47 kvm_master kernel: RBP: ffffc179880dbc38 R08:
0000000000100000 R09: 000000000000000d                               
Aug 03 15:02:47 kvm_master kernel: R10: ffff9fa0a56b0008 R11:
ffff9fa0a56b0000 R12: 00000000009c07cc                               
Aug 03 15:02:47 kvm_master kernel: R13: ffff9fa88b990000 R14:
ffff9f9e19dbb1b8 R15: 0000000000000000                               
Aug 03 15:02:47 kvm_master kernel: FS:  0000000000000000(0000)
GS:ffff9fac5f340000(0000) knlGS:0000000000000000                    
Aug 03 15:02:47 kvm_master kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
0000000080050033                                               
Aug 03 15:02:47 kvm_master kernel: CR2: ffffd1b542d71000 CR3:
0000000570a09000 CR4: 00000000003426e0                               
Aug 03 15:02:47 kvm_master kernel: DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000                               
Aug 03 15:02:47 kvm_master kernel: DR3: 0000000000000000 DR6:
00000000fffe0ff0 DR7: 0000000000000400                               
Aug 03 15:02:47 kvm_master kernel: Call Trace:                   
Aug 03 15:02:47 kvm_master kernel:  drop_spte+0x1a/0xb0 [kvm]    
Aug 03 15:02:47 kvm_master kernel:  mmu_page_zap_pte+0x9c/0xe0 [kvm]     
Aug 03 15:02:47 kvm_master kernel:  kvm_mmu_prepare_zap_page+0x65/0x310
[kvm]
Aug 03 15:02:47 kvm_master kernel: 
kvm_mmu_invalidate_zap_all_pages+0x10d/0x160 [kvm]
Aug 03 15:02:47 kvm_master kernel:  kvm_arch_flush_shadow_all+0xe/0x10
[kvm]
Aug 03 15:02:47 kvm_master kernel:  kvm_mmu_notifier_release+0x2c/0x40
[kvm]
Aug 03 15:02:47 kvm_master kernel:  __mmu_notifier_release+0x44/0xc0
Aug 03 15:02:47 kvm_master kernel:  exit_mmap+0x142/0x150
Aug 03 15:02:47 kvm_master kernel:  ? kfree+0x175/0x190
Aug 03 15:02:47 kvm_master kernel:  ? kfree+0x175/0x190
Aug 03 15:02:47 kvm_master kernel:  ? exit_aio+0xc6/0x100
Aug 03 15:02:47 kvm_master kernel:  mmput_async_fn+0x4c/0x130
Aug 03 15:02:47 kvm_master kernel:  process_one_work+0x1de/0x430
Aug 03 15:02:47 kvm_master kernel:  worker_thread+0x47/0x3f0
Aug 03 15:02:47 kvm_master kernel:  kthread+0x125/0x140
Aug 03 15:02:47 kvm_master kernel:  ? process_one_work+0x430/0x430
Aug 03 15:02:47 kvm_master kernel:  ? kthread_create_on_node+0x70/0x70
Aug 03 15:02:47 kvm_master kernel:  ret_from_fork+0x25/0x30
Aug 03 15:02:47 kvm_master kernel: Code: ec 75 04 00 48 b8 00 00 00 00
00 00 00 40 48 21 da 48 39 c2 0f 95 c0 eb b2 48 d1 eb 83 e3 01 eb c0 4c
89 e7 e8 f7 3d fe ff eb a4 <0f> ff eb 8a 90 0f 1f 44 00 00 55 48 89 e5
53 89 d3 e8 ff 4a fe 
Aug 03 15:02:47 kvm_master kernel: ---[ end trace 8710f4d700a7d36e ]---

This would typically take 36-48 hours to surface, so we're good so far,
but not completely out of the woods yet. I'm optimistic that since this
patchset changes the mmu_notifier behavior to something safer in
general, this issue will also be resolved by it.

Jeff

> 
> 
> Meow!
> -- 
> ⢀⣴⠾⠻⢶⣦⠀ 
> ⣾⠁⢰⠒⠀⣿⡁ Vat kind uf sufficiently advanced technology iz dis!?
> ⢿⡄⠘⠷⠚⠋⠀                                 -- Genghis Ht'rok'din
> ⠈⠳⣄⠀⠀⠀⠀ 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 00/13] mmu_notifier kill invalidate_page callback
@ 2017-09-01 14:47           ` Jeff Cook
  0 siblings, 0 replies; 160+ messages in thread
From: Jeff Cook @ 2017-09-01 14:47 UTC (permalink / raw)
  To: Adam Borowski, Jerome Glisse
  Cc: Linus Torvalds, Bernhard Held, Linux Kernel Mailing List,
	linux-mm, Kirill A . Shutemov, Andrew Morton, Andrea Arcangeli,
	Joerg Roedel, Dan Williams, Sudeep Dutt, Ashutosh Dixit,
	Dimitri Sivanich, Jack Steiner, Paolo Bonzini,
	Radim Krčmář,
	ppc-dev, DRI, amd-gfx, linux-rdma, open list:AMD IOMMU (AMD-VI),
	xen-devel, KVM list

On Wed, Aug 30, 2017, at 10:57 AM, Adam Borowski wrote:
> On Tue, Aug 29, 2017 at 08:56:15PM -0400, Jerome Glisse wrote:
> > I will wait for people to test and for result of my own test before
> > reposting if need be, otherwise i will post as separate patch.
> >
> > > But from a _very_ quick read-through this looks fine. But it obviously
> > > needs testing.
> > > 
> > > People - *especially* the people who saw issues under KVM - can you
> > > try out Jérôme's patch-series? I aded some people to the cc, the full
> > > series is on lkml. Jérôme - do you have a git branch for people to
> > > test that they could easily pull and try out?
> > 
> > https://cgit.freedesktop.org/~glisse/linux mmu-notifier branch
> > git://people.freedesktop.org/~glisse/linux
> 
> Tested your branch as of 10f07641, on a long list of guest VMs.
> No earth-shattering kaboom.

I've been using the mmu_notifier branch @ a3d944233bcf8c for the last 36
hours or so, also without incident.

Unlike most other reporters, I experienced a similar splat on 4.12:

Aug 03 15:02:47 kvm_master kernel: ------------[ cut here ]------------
Aug 03 15:02:47 kvm_master kernel: WARNING: CPU: 13 PID: 1653 at
arch/x86/kvm/mmu.c:682 mmu_spte_clear_track_bits+0xfb/0x100 [kvm]
Aug 03 15:02:47 kvm_master kernel: Modules linked in: vhost_net vhost
tap xt_conntrack xt_CHECKSUM iptable_mangle ipt_REJECT nf_reject_ipv4
xt_tcpudp tun ebtable_filter ebtables ip6table_filter ip6_tables
iptable_filter msr nls_iso8859_1 nls_cp437 intel_rapl ipt_
MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4
nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack sb_edac
x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul
crc32_pclmul ghash_clmulni_intel input_leds pcbc aesni_intel led_class
aes_x86_6
4 mxm_wmi crypto_simd glue_helper uvcvideo cryptd videobuf2_vmalloc
videobuf2_memops igb videobuf2_v4l2 videobuf2_core snd_usb_audio
videodev media joydev ptp evdev mousedev intel_cstate pps_core mac_hid
intel_rapl_perf snd_hda_intel snd_virtuoso snd_usbmidi_lib snd_hda_codec
snd_oxygen_lib snd_hda_core                        
Aug 03 15:02:47 kvm_master kernel:  snd_mpu401_uart snd_rawmidi
snd_hwdep snd_seq_device snd_pcm snd_timer snd soundcore i2c_algo_bit
pcspkr i2c_i801 lpc_ich ioatdma shpchp dca wmi acpi_power_meter tpm_tis
tpm_tis_core tpm button bridge stp llc sch_fq_codel virtio_pci
virtio_blk virtio_balloon virtio_net virtio_ring virtio kvm_intel kvm sg
ip_tables x_tables hid_logitech_hidpp hid_logitech_dj hid_generic
hid_microsoft usbhid hid sr_mod cdrom sd_mod xhci_pci ahci libahci
xhci_hcd libata usbcore scsi_mod usb_common zfs(PO) zunicode(PO)
zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) drm_kms_helper
syscopyarea sysfillrect sysimgblt fb_sys_fops drm vfio_pci irqbypass
vfio_virqfd vfio_iommu_type1 vfio vfat fat ext4 crc16 jbd2 fscrypto
mbcache dm_thin_pool dm_cache dm_persistent_data dm_bio_prison dm_bufio
dm_raid raid456 libcrc32c                 
Aug 03 15:02:47 kvm_master kernel:  crc32c_generic crc32c_intel
async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq
dm_mod dax raid1 md_mod                                                  
Aug 03 15:02:47 kvm_master kernel: CPU: 13 PID: 1653 Comm: kworker/13:2
Tainted: P    B D W  O    4.12.3-1-ARCH #1                 
Aug 03 15:02:47 kvm_master kernel: Hardware name: Supermicro
SYS-7038A-I/X10DAI, BIOS 2.0a 11/09/2016                              
Aug 03 15:02:47 kvm_master kernel: Workqueue: events mmput_async_fn      
Aug 03 15:02:47 kvm_master kernel: task: ffff9fa89751b900 task.stack:
ffffc179880d8000                                             
Aug 03 15:02:47 kvm_master kernel: RIP:
0010:mmu_spte_clear_track_bits+0xfb/0x100 [kvm]                          
Aug 03 15:02:47 kvm_master kernel: RSP: 0018:ffffc179880dbc20 EFLAGS:
00010246                                                     
Aug 03 15:02:47 kvm_master kernel: RAX: 0000000000000000 RBX:
00000009c07cce77 RCX: dead0000000000ff                               
Aug 03 15:02:47 kvm_master kernel: RDX: 0000000000000000 RSI:
ffff9fa82d6d6f08 RDI: fffff6e76701f300                               
Aug 03 15:02:47 kvm_master kernel: RBP: ffffc179880dbc38 R08:
0000000000100000 R09: 000000000000000d                               
Aug 03 15:02:47 kvm_master kernel: R10: ffff9fa0a56b0008 R11:
ffff9fa0a56b0000 R12: 00000000009c07cc                               
Aug 03 15:02:47 kvm_master kernel: R13: ffff9fa88b990000 R14:
ffff9f9e19dbb1b8 R15: 0000000000000000                               
Aug 03 15:02:47 kvm_master kernel: FS:  0000000000000000(0000)
GS:ffff9fac5f340000(0000) knlGS:0000000000000000                    
Aug 03 15:02:47 kvm_master kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
0000000080050033                                               
Aug 03 15:02:47 kvm_master kernel: CR2: ffffd1b542d71000 CR3:
0000000570a09000 CR4: 00000000003426e0                               
Aug 03 15:02:47 kvm_master kernel: DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000                               
Aug 03 15:02:47 kvm_master kernel: DR3: 0000000000000000 DR6:
00000000fffe0ff0 DR7: 0000000000000400                               
Aug 03 15:02:47 kvm_master kernel: Call Trace:                   
Aug 03 15:02:47 kvm_master kernel:  drop_spte+0x1a/0xb0 [kvm]    
Aug 03 15:02:47 kvm_master kernel:  mmu_page_zap_pte+0x9c/0xe0 [kvm]     
Aug 03 15:02:47 kvm_master kernel:  kvm_mmu_prepare_zap_page+0x65/0x310
[kvm]
Aug 03 15:02:47 kvm_master kernel: 
kvm_mmu_invalidate_zap_all_pages+0x10d/0x160 [kvm]
Aug 03 15:02:47 kvm_master kernel:  kvm_arch_flush_shadow_all+0xe/0x10
[kvm]
Aug 03 15:02:47 kvm_master kernel:  kvm_mmu_notifier_release+0x2c/0x40
[kvm]
Aug 03 15:02:47 kvm_master kernel:  __mmu_notifier_release+0x44/0xc0
Aug 03 15:02:47 kvm_master kernel:  exit_mmap+0x142/0x150
Aug 03 15:02:47 kvm_master kernel:  ? kfree+0x175/0x190
Aug 03 15:02:47 kvm_master kernel:  ? kfree+0x175/0x190
Aug 03 15:02:47 kvm_master kernel:  ? exit_aio+0xc6/0x100
Aug 03 15:02:47 kvm_master kernel:  mmput_async_fn+0x4c/0x130
Aug 03 15:02:47 kvm_master kernel:  process_one_work+0x1de/0x430
Aug 03 15:02:47 kvm_master kernel:  worker_thread+0x47/0x3f0
Aug 03 15:02:47 kvm_master kernel:  kthread+0x125/0x140
Aug 03 15:02:47 kvm_master kernel:  ? process_one_work+0x430/0x430
Aug 03 15:02:47 kvm_master kernel:  ? kthread_create_on_node+0x70/0x70
Aug 03 15:02:47 kvm_master kernel:  ret_from_fork+0x25/0x30
Aug 03 15:02:47 kvm_master kernel: Code: ec 75 04 00 48 b8 00 00 00 00
00 00 00 40 48 21 da 48 39 c2 0f 95 c0 eb b2 48 d1 eb 83 e3 01 eb c0 4c
89 e7 e8 f7 3d fe ff eb a4 <0f> ff eb 8a 90 0f 1f 44 00 00 55 48 89 e5
53 89 d3 e8 ff 4a fe 
Aug 03 15:02:47 kvm_master kernel: ---[ end trace 8710f4d700a7d36e ]---

This would typically take 36-48 hours to surface, so we're good so far,
but not completely out of the woods yet. I'm optimistic that since this
patchset changes the mmu_notifier behavior to something safer in
general, this issue will also be resolved by it.

Jeff

> 
> 
> Meow!
> -- 
> ⢀⣴⠾⠻⢶⣦⠀ 
> ⣾⠁⢰⠒⠀⣿⡁ Vat kind uf sufficiently advanced technology iz dis!?
> ⢿⡄⠘⠷⠚⠋⠀                                 -- Genghis Ht'rok'din
> ⠈⠳⣄⠀⠀⠀⠀ 

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 00/13] mmu_notifier kill invalidate_page callback
@ 2017-09-01 14:47           ` Jeff Cook
  0 siblings, 0 replies; 160+ messages in thread
From: Jeff Cook @ 2017-09-01 14:47 UTC (permalink / raw)
  To: Adam Borowski, Jerome Glisse
  Cc: Linus Torvalds, Bernhard Held, Linux Kernel Mailing List,
	linux-mm, Kirill A . Shutemov, Andrew Morton, Andrea Arcangeli,
	Joerg Roedel, Dan Williams, Sudeep Dutt, Ashutosh Dixit,
	Dimitri Sivanich, Jack Steiner, Paolo Bonzini,
	Radim Krčmář,
	ppc-dev, DRI, amd-gfx, linux-rdma, open list:AMD IOMMU (AMD-VI),
	xen-devel, KVM list

On Wed, Aug 30, 2017, at 10:57 AM, Adam Borowski wrote:
> On Tue, Aug 29, 2017 at 08:56:15PM -0400, Jerome Glisse wrote:
> > I will wait for people to test and for result of my own test before
> > reposting if need be, otherwise i will post as separate patch.
> >
> > > But from a _very_ quick read-through this looks fine. But it obviously
> > > needs testing.
> > > 
> > > People - *especially* the people who saw issues under KVM - can you
> > > try out Jérôme's patch-series? I aded some people to the cc, the full
> > > series is on lkml. Jérôme - do you have a git branch for people to
> > > test that they could easily pull and try out?
> > 
> > https://cgit.freedesktop.org/~glisse/linux mmu-notifier branch
> > git://people.freedesktop.org/~glisse/linux
> 
> Tested your branch as of 10f07641, on a long list of guest VMs.
> No earth-shattering kaboom.

I've been using the mmu_notifier branch @ a3d944233bcf8c for the last 36
hours or so, also without incident.

Unlike most other reporters, I experienced a similar splat on 4.12:

Aug 03 15:02:47 kvm_master kernel: ------------[ cut here ]------------
Aug 03 15:02:47 kvm_master kernel: WARNING: CPU: 13 PID: 1653 at
arch/x86/kvm/mmu.c:682 mmu_spte_clear_track_bits+0xfb/0x100 [kvm]
Aug 03 15:02:47 kvm_master kernel: Modules linked in: vhost_net vhost
tap xt_conntrack xt_CHECKSUM iptable_mangle ipt_REJECT nf_reject_ipv4
xt_tcpudp tun ebtable_filter ebtables ip6table_filter ip6_tables
iptable_filter msr nls_iso8859_1 nls_cp437 intel_rapl ipt_
MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4
nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack sb_edac
x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul
crc32_pclmul ghash_clmulni_intel input_leds pcbc aesni_intel led_class
aes_x86_6
4 mxm_wmi crypto_simd glue_helper uvcvideo cryptd videobuf2_vmalloc
videobuf2_memops igb videobuf2_v4l2 videobuf2_core snd_usb_audio
videodev media joydev ptp evdev mousedev intel_cstate pps_core mac_hid
intel_rapl_perf snd_hda_intel snd_virtuoso snd_usbmidi_lib snd_hda_codec
snd_oxygen_lib snd_hda_core                        
Aug 03 15:02:47 kvm_master kernel:  snd_mpu401_uart snd_rawmidi
snd_hwdep snd_seq_device snd_pcm snd_timer snd soundcore i2c_algo_bit
pcspkr i2c_i801 lpc_ich ioatdma shpchp dca wmi acpi_power_meter tpm_tis
tpm_tis_core tpm button bridge stp llc sch_fq_codel virtio_pci
virtio_blk virtio_balloon virtio_net virtio_ring virtio kvm_intel kvm sg
ip_tables x_tables hid_logitech_hidpp hid_logitech_dj hid_generic
hid_microsoft usbhid hid sr_mod cdrom sd_mod xhci_pci ahci libahci
xhci_hcd libata usbcore scsi_mod usb_common zfs(PO) zunicode(PO)
zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) drm_kms_helper
syscopyarea sysfillrect sysimgblt fb_sys_fops drm vfio_pci irqbypass
vfio_virqfd vfio_iommu_type1 vfio vfat fat ext4 crc16 jbd2 fscrypto
mbcache dm_thin_pool dm_cache dm_persistent_data dm_bio_prison dm_bufio
dm_raid raid456 libcrc32c                 
Aug 03 15:02:47 kvm_master kernel:  crc32c_generic crc32c_intel
async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq
dm_mod dax raid1 md_mod                                                  
Aug 03 15:02:47 kvm_master kernel: CPU: 13 PID: 1653 Comm: kworker/13:2
Tainted: P    B D W  O    4.12.3-1-ARCH #1                 
Aug 03 15:02:47 kvm_master kernel: Hardware name: Supermicro
SYS-7038A-I/X10DAI, BIOS 2.0a 11/09/2016                              
Aug 03 15:02:47 kvm_master kernel: Workqueue: events mmput_async_fn      
Aug 03 15:02:47 kvm_master kernel: task: ffff9fa89751b900 task.stack:
ffffc179880d8000                                             
Aug 03 15:02:47 kvm_master kernel: RIP:
0010:mmu_spte_clear_track_bits+0xfb/0x100 [kvm]                          
Aug 03 15:02:47 kvm_master kernel: RSP: 0018:ffffc179880dbc20 EFLAGS:
00010246                                                     
Aug 03 15:02:47 kvm_master kernel: RAX: 0000000000000000 RBX:
00000009c07cce77 RCX: dead0000000000ff                               
Aug 03 15:02:47 kvm_master kernel: RDX: 0000000000000000 RSI:
ffff9fa82d6d6f08 RDI: fffff6e76701f300                               
Aug 03 15:02:47 kvm_master kernel: RBP: ffffc179880dbc38 R08:
0000000000100000 R09: 000000000000000d                               
Aug 03 15:02:47 kvm_master kernel: R10: ffff9fa0a56b0008 R11:
ffff9fa0a56b0000 R12: 00000000009c07cc                               
Aug 03 15:02:47 kvm_master kernel: R13: ffff9fa88b990000 R14:
ffff9f9e19dbb1b8 R15: 0000000000000000                               
Aug 03 15:02:47 kvm_master kernel: FS:  0000000000000000(0000)
GS:ffff9fac5f340000(0000) knlGS:0000000000000000                    
Aug 03 15:02:47 kvm_master kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
0000000080050033                                               
Aug 03 15:02:47 kvm_master kernel: CR2: ffffd1b542d71000 CR3:
0000000570a09000 CR4: 00000000003426e0                               
Aug 03 15:02:47 kvm_master kernel: DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000                               
Aug 03 15:02:47 kvm_master kernel: DR3: 0000000000000000 DR6:
00000000fffe0ff0 DR7: 0000000000000400                               
Aug 03 15:02:47 kvm_master kernel: Call Trace:                   
Aug 03 15:02:47 kvm_master kernel:  drop_spte+0x1a/0xb0 [kvm]    
Aug 03 15:02:47 kvm_master kernel:  mmu_page_zap_pte+0x9c/0xe0 [kvm]     
Aug 03 15:02:47 kvm_master kernel:  kvm_mmu_prepare_zap_page+0x65/0x310
[kvm]
Aug 03 15:02:47 kvm_master kernel: 
kvm_mmu_invalidate_zap_all_pages+0x10d/0x160 [kvm]
Aug 03 15:02:47 kvm_master kernel:  kvm_arch_flush_shadow_all+0xe/0x10
[kvm]
Aug 03 15:02:47 kvm_master kernel:  kvm_mmu_notifier_release+0x2c/0x40
[kvm]
Aug 03 15:02:47 kvm_master kernel:  __mmu_notifier_release+0x44/0xc0
Aug 03 15:02:47 kvm_master kernel:  exit_mmap+0x142/0x150
Aug 03 15:02:47 kvm_master kernel:  ? kfree+0x175/0x190
Aug 03 15:02:47 kvm_master kernel:  ? kfree+0x175/0x190
Aug 03 15:02:47 kvm_master kernel:  ? exit_aio+0xc6/0x100
Aug 03 15:02:47 kvm_master kernel:  mmput_async_fn+0x4c/0x130
Aug 03 15:02:47 kvm_master kernel:  process_one_work+0x1de/0x430
Aug 03 15:02:47 kvm_master kernel:  worker_thread+0x47/0x3f0
Aug 03 15:02:47 kvm_master kernel:  kthread+0x125/0x140
Aug 03 15:02:47 kvm_master kernel:  ? process_one_work+0x430/0x430
Aug 03 15:02:47 kvm_master kernel:  ? kthread_create_on_node+0x70/0x70
Aug 03 15:02:47 kvm_master kernel:  ret_from_fork+0x25/0x30
Aug 03 15:02:47 kvm_master kernel: Code: ec 75 04 00 48 b8 00 00 00 00
00 00 00 40 48 21 da 48 39 c2 0f 95 c0 eb b2 48 d1 eb 83 e3 01 eb c0 4c
89 e7 e8 f7 3d fe ff eb a4 <0f> ff eb 8a 90 0f 1f 44 00 00 55 48 89 e5
53 89 d3 e8 ff 4a fe 
Aug 03 15:02:47 kvm_master kernel: ---[ end trace 8710f4d700a7d36e ]---

This would typically take 36-48 hours to surface, so we're good so far,
but not completely out of the woods yet. I'm optimistic that since this
patchset changes the mmu_notifier behavior to something safer in
general, this issue will also be resolved by it.

Jeff

> 
> 
> Meow!
> -- 
> ⢀⣴⠾⠻⢶⣦⠀ 
> ⣾⠁⢰⠒⠀⣿⡁ Vat kind uf sufficiently advanced technology iz dis!?
> ⢿⡄⠘⠷⠚⠋⠀                                 -- Genghis Ht'rok'din
> ⠈⠳⣄⠀⠀⠀⠀ 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 00/13] mmu_notifier kill invalidate_page callback
@ 2017-09-01 14:47           ` Jeff Cook
  0 siblings, 0 replies; 160+ messages in thread
From: Jeff Cook @ 2017-09-01 14:47 UTC (permalink / raw)
  To: Adam Borowski, Jerome Glisse
  Cc: Linus Torvalds, Bernhard Held, Linux Kernel Mailing List,
	linux-mm, Kirill A . Shutemov, Andrew Morton, Andrea Arcangeli,
	Joerg Roedel, Dan Williams, Sudeep Dutt, Ashutosh Dixit,
	Dimitri Sivanich, Jack Steiner, Paolo Bonzini,
	Radim Krčmář,
	ppc-dev, DRI, amd-gfx, linux-rdma, open list:AMD IOMMU (AMD-VI),
	xen-devel, KVM list

On Wed, Aug 30, 2017, at 10:57 AM, Adam Borowski wrote:
> On Tue, Aug 29, 2017 at 08:56:15PM -0400, Jerome Glisse wrote:
> > I will wait for people to test and for result of my own test before
> > reposting if need be, otherwise i will post as separate patch.
> >
> > > But from a _very_ quick read-through this looks fine. But it obviously
> > > needs testing.
> > >=20
> > > People - *especially* the people who saw issues under KVM - can you
> > > try out J=C3=A9r=C3=B4me's patch-series? I aded some people to the cc=
, the full
> > > series is on lkml. J=C3=A9r=C3=B4me - do you have a git branch for pe=
ople to
> > > test that they could easily pull and try out?
> >=20
> > https://cgit.freedesktop.org/~glisse/linux mmu-notifier branch
> > git://people.freedesktop.org/~glisse/linux
>=20
> Tested your branch as of 10f07641, on a long list of guest VMs.
> No earth-shattering kaboom.

I've been using the mmu_notifier branch @ a3d944233bcf8c for the last 36
hours or so, also without incident.

Unlike most other reporters, I experienced a similar splat on 4.12:

Aug 03 15:02:47 kvm_master kernel: ------------[ cut here ]------------
Aug 03 15:02:47 kvm_master kernel: WARNING: CPU: 13 PID: 1653 at
arch/x86/kvm/mmu.c:682 mmu_spte_clear_track_bits+0xfb/0x100 [kvm]
Aug 03 15:02:47 kvm_master kernel: Modules linked in: vhost_net vhost
tap xt_conntrack xt_CHECKSUM iptable_mangle ipt_REJECT nf_reject_ipv4
xt_tcpudp tun ebtable_filter ebtables ip6table_filter ip6_tables
iptable_filter msr nls_iso8859_1 nls_cp437 intel_rapl ipt_
MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4
nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack sb_edac
x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul
crc32_pclmul ghash_clmulni_intel input_leds pcbc aesni_intel led_class
aes_x86_6
4 mxm_wmi crypto_simd glue_helper uvcvideo cryptd videobuf2_vmalloc
videobuf2_memops igb videobuf2_v4l2 videobuf2_core snd_usb_audio
videodev media joydev ptp evdev mousedev intel_cstate pps_core mac_hid
intel_rapl_perf snd_hda_intel snd_virtuoso snd_usbmidi_lib snd_hda_codec
snd_oxygen_lib snd_hda_core=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20
Aug 03 15:02:47 kvm_master kernel:  snd_mpu401_uart snd_rawmidi
snd_hwdep snd_seq_device snd_pcm snd_timer snd soundcore i2c_algo_bit
pcspkr i2c_i801 lpc_ich ioatdma shpchp dca wmi acpi_power_meter tpm_tis
tpm_tis_core tpm button bridge stp llc sch_fq_codel virtio_pci
virtio_blk virtio_balloon virtio_net virtio_ring virtio kvm_intel kvm sg
ip_tables x_tables hid_logitech_hidpp hid_logitech_dj hid_generic
hid_microsoft usbhid hid sr_mod cdrom sd_mod xhci_pci ahci libahci
xhci_hcd libata usbcore scsi_mod usb_common zfs(PO) zunicode(PO)
zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) drm_kms_helper
syscopyarea sysfillrect sysimgblt fb_sys_fops drm vfio_pci irqbypass
vfio_virqfd vfio_iommu_type1 vfio vfat fat ext4 crc16 jbd2 fscrypto
mbcache dm_thin_pool dm_cache dm_persistent_data dm_bio_prison dm_bufio
dm_raid raid456 libcrc32c=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20
Aug 03 15:02:47 kvm_master kernel:  crc32c_generic crc32c_intel
async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq
dm_mod dax raid1 md_mod=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20
Aug 03 15:02:47 kvm_master kernel: CPU: 13 PID: 1653 Comm: kworker/13:2
Tainted: P    B D W  O    4.12.3-1-ARCH #1=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20
Aug 03 15:02:47 kvm_master kernel: Hardware name: Supermicro
SYS-7038A-I/X10DAI, BIOS 2.0a 11/09/2016=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20
Aug 03 15:02:47 kvm_master kernel: Workqueue: events mmput_async_fn=20=20=
=20=20=20=20
Aug 03 15:02:47 kvm_master kernel: task: ffff9fa89751b900 task.stack:
ffffc179880d8000=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20
Aug 03 15:02:47 kvm_master kernel: RIP:
0010:mmu_spte_clear_track_bits+0xfb/0x100 [kvm]=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20
Aug 03 15:02:47 kvm_master kernel: RSP: 0018:ffffc179880dbc20 EFLAGS:
00010246=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20
Aug 03 15:02:47 kvm_master kernel: RAX: 0000000000000000 RBX:
00000009c07cce77 RCX: dead0000000000ff=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20
Aug 03 15:02:47 kvm_master kernel: RDX: 0000000000000000 RSI:
ffff9fa82d6d6f08 RDI: fffff6e76701f300=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20
Aug 03 15:02:47 kvm_master kernel: RBP: ffffc179880dbc38 R08:
0000000000100000 R09: 000000000000000d=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20
Aug 03 15:02:47 kvm_master kernel: R10: ffff9fa0a56b0008 R11:
ffff9fa0a56b0000 R12: 00000000009c07cc=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20
Aug 03 15:02:47 kvm_master kernel: R13: ffff9fa88b990000 R14:
ffff9f9e19dbb1b8 R15: 0000000000000000=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20
Aug 03 15:02:47 kvm_master kernel: FS:  0000000000000000(0000)
GS:ffff9fac5f340000(0000) knlGS:0000000000000000=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20
Aug 03 15:02:47 kvm_master kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
0000000080050033=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20
Aug 03 15:02:47 kvm_master kernel: CR2: ffffd1b542d71000 CR3:
0000000570a09000 CR4: 00000000003426e0=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20
Aug 03 15:02:47 kvm_master kernel: DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20
Aug 03 15:02:47 kvm_master kernel: DR3: 0000000000000000 DR6:
00000000fffe0ff0 DR7: 0000000000000400=20=20=20=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20=20
Aug 03 15:02:47 kvm_master kernel: Call Trace:=20=20=20=20=20=20=20=20=20=
=20=20=20=20=20=20=20=20=20=20
Aug 03 15:02:47 kvm_master kernel:  drop_spte+0x1a/0xb0 [kvm]=20=20=20=20
Aug 03 15:02:47 kvm_master kernel:  mmu_page_zap_pte+0x9c/0xe0 [kvm]=20=20=
=20=20=20
Aug 03 15:02:47 kvm_master kernel:  kvm_mmu_prepare_zap_page+0x65/0x310
[kvm]
Aug 03 15:02:47 kvm_master kernel:=20
kvm_mmu_invalidate_zap_all_pages+0x10d/0x160 [kvm]
Aug 03 15:02:47 kvm_master kernel:  kvm_arch_flush_shadow_all+0xe/0x10
[kvm]
Aug 03 15:02:47 kvm_master kernel:  kvm_mmu_notifier_release+0x2c/0x40
[kvm]
Aug 03 15:02:47 kvm_master kernel:  __mmu_notifier_release+0x44/0xc0
Aug 03 15:02:47 kvm_master kernel:  exit_mmap+0x142/0x150
Aug 03 15:02:47 kvm_master kernel:  ? kfree+0x175/0x190
Aug 03 15:02:47 kvm_master kernel:  ? kfree+0x175/0x190
Aug 03 15:02:47 kvm_master kernel:  ? exit_aio+0xc6/0x100
Aug 03 15:02:47 kvm_master kernel:  mmput_async_fn+0x4c/0x130
Aug 03 15:02:47 kvm_master kernel:  process_one_work+0x1de/0x430
Aug 03 15:02:47 kvm_master kernel:  worker_thread+0x47/0x3f0
Aug 03 15:02:47 kvm_master kernel:  kthread+0x125/0x140
Aug 03 15:02:47 kvm_master kernel:  ? process_one_work+0x430/0x430
Aug 03 15:02:47 kvm_master kernel:  ? kthread_create_on_node+0x70/0x70
Aug 03 15:02:47 kvm_master kernel:  ret_from_fork+0x25/0x30
Aug 03 15:02:47 kvm_master kernel: Code: ec 75 04 00 48 b8 00 00 00 00
00 00 00 40 48 21 da 48 39 c2 0f 95 c0 eb b2 48 d1 eb 83 e3 01 eb c0 4c
89 e7 e8 f7 3d fe ff eb a4 <0f> ff eb 8a 90 0f 1f 44 00 00 55 48 89 e5
53 89 d3 e8 ff 4a fe=20
Aug 03 15:02:47 kvm_master kernel: ---[ end trace 8710f4d700a7d36e ]---

This would typically take 36-48 hours to surface, so we're good so far,
but not completely out of the woods yet. I'm optimistic that since this
patchset changes the mmu_notifier behavior to something safer in
general, this issue will also be resolved by it.

Jeff

>=20
>=20
> Meow!
> --=20
> =E2=A2=80=E2=A3=B4=E2=A0=BE=E2=A0=BB=E2=A2=B6=E2=A3=A6=E2=A0=80=20
> =E2=A3=BE=E2=A0=81=E2=A2=B0=E2=A0=92=E2=A0=80=E2=A3=BF=E2=A1=81 Vat kind =
uf sufficiently advanced technology iz dis!?
> =E2=A2=BF=E2=A1=84=E2=A0=98=E2=A0=B7=E2=A0=9A=E2=A0=8B=E2=A0=80          =
                       -- Genghis Ht'rok'din
> =E2=A0=88=E2=A0=B3=E2=A3=84=E2=A0=80=E2=A0=80=E2=A0=80=E2=A0=80=20

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 00/13] mmu_notifier kill invalidate_page callback
  2017-08-30 14:57         ` Adam Borowski
  (?)
  (?)
@ 2017-09-01 14:47         ` Jeff Cook
  -1 siblings, 0 replies; 160+ messages in thread
From: Jeff Cook @ 2017-09-01 14:47 UTC (permalink / raw)
  To: Adam Borowski, Jerome Glisse
  Cc: Bernhard Held, KVM list, Radim Krčmář,
	Sudeep Dutt, DRI, linux-mm, Andrea Arcangeli, Dimitri Sivanich,
	linux-rdma, amd-gfx, xen-devel, Joerg Roedel, Jack Steiner,
	Dan Williams, Linus Torvalds, Linux Kernel Mailing List,
	Ashutosh Dixit, open list:AMD IOMMU (AMD-VI),
	Paolo Bonzini, Andrew Morton, ppc-dev

On Wed, Aug 30, 2017, at 10:57 AM, Adam Borowski wrote:
> On Tue, Aug 29, 2017 at 08:56:15PM -0400, Jerome Glisse wrote:
> > I will wait for people to test and for result of my own test before
> > reposting if need be, otherwise i will post as separate patch.
> >
> > > But from a _very_ quick read-through this looks fine. But it obviously
> > > needs testing.
> > > 
> > > People - *especially* the people who saw issues under KVM - can you
> > > try out Jérôme's patch-series? I aded some people to the cc, the full
> > > series is on lkml. Jérôme - do you have a git branch for people to
> > > test that they could easily pull and try out?
> > 
> > https://cgit.freedesktop.org/~glisse/linux mmu-notifier branch
> > git://people.freedesktop.org/~glisse/linux
> 
> Tested your branch as of 10f07641, on a long list of guest VMs.
> No earth-shattering kaboom.

I've been using the mmu_notifier branch @ a3d944233bcf8c for the last 36
hours or so, also without incident.

Unlike most other reporters, I experienced a similar splat on 4.12:

Aug 03 15:02:47 kvm_master kernel: ------------[ cut here ]------------
Aug 03 15:02:47 kvm_master kernel: WARNING: CPU: 13 PID: 1653 at
arch/x86/kvm/mmu.c:682 mmu_spte_clear_track_bits+0xfb/0x100 [kvm]
Aug 03 15:02:47 kvm_master kernel: Modules linked in: vhost_net vhost
tap xt_conntrack xt_CHECKSUM iptable_mangle ipt_REJECT nf_reject_ipv4
xt_tcpudp tun ebtable_filter ebtables ip6table_filter ip6_tables
iptable_filter msr nls_iso8859_1 nls_cp437 intel_rapl ipt_
MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4
nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack sb_edac
x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul
crc32_pclmul ghash_clmulni_intel input_leds pcbc aesni_intel led_class
aes_x86_6
4 mxm_wmi crypto_simd glue_helper uvcvideo cryptd videobuf2_vmalloc
videobuf2_memops igb videobuf2_v4l2 videobuf2_core snd_usb_audio
videodev media joydev ptp evdev mousedev intel_cstate pps_core mac_hid
intel_rapl_perf snd_hda_intel snd_virtuoso snd_usbmidi_lib snd_hda_codec
snd_oxygen_lib snd_hda_core                        
Aug 03 15:02:47 kvm_master kernel:  snd_mpu401_uart snd_rawmidi
snd_hwdep snd_seq_device snd_pcm snd_timer snd soundcore i2c_algo_bit
pcspkr i2c_i801 lpc_ich ioatdma shpchp dca wmi acpi_power_meter tpm_tis
tpm_tis_core tpm button bridge stp llc sch_fq_codel virtio_pci
virtio_blk virtio_balloon virtio_net virtio_ring virtio kvm_intel kvm sg
ip_tables x_tables hid_logitech_hidpp hid_logitech_dj hid_generic
hid_microsoft usbhid hid sr_mod cdrom sd_mod xhci_pci ahci libahci
xhci_hcd libata usbcore scsi_mod usb_common zfs(PO) zunicode(PO)
zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) drm_kms_helper
syscopyarea sysfillrect sysimgblt fb_sys_fops drm vfio_pci irqbypass
vfio_virqfd vfio_iommu_type1 vfio vfat fat ext4 crc16 jbd2 fscrypto
mbcache dm_thin_pool dm_cache dm_persistent_data dm_bio_prison dm_bufio
dm_raid raid456 libcrc32c                 
Aug 03 15:02:47 kvm_master kernel:  crc32c_generic crc32c_intel
async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq
dm_mod dax raid1 md_mod                                                  
Aug 03 15:02:47 kvm_master kernel: CPU: 13 PID: 1653 Comm: kworker/13:2
Tainted: P    B D W  O    4.12.3-1-ARCH #1                 
Aug 03 15:02:47 kvm_master kernel: Hardware name: Supermicro
SYS-7038A-I/X10DAI, BIOS 2.0a 11/09/2016                              
Aug 03 15:02:47 kvm_master kernel: Workqueue: events mmput_async_fn      
Aug 03 15:02:47 kvm_master kernel: task: ffff9fa89751b900 task.stack:
ffffc179880d8000                                             
Aug 03 15:02:47 kvm_master kernel: RIP:
0010:mmu_spte_clear_track_bits+0xfb/0x100 [kvm]                          
Aug 03 15:02:47 kvm_master kernel: RSP: 0018:ffffc179880dbc20 EFLAGS:
00010246                                                     
Aug 03 15:02:47 kvm_master kernel: RAX: 0000000000000000 RBX:
00000009c07cce77 RCX: dead0000000000ff                               
Aug 03 15:02:47 kvm_master kernel: RDX: 0000000000000000 RSI:
ffff9fa82d6d6f08 RDI: fffff6e76701f300                               
Aug 03 15:02:47 kvm_master kernel: RBP: ffffc179880dbc38 R08:
0000000000100000 R09: 000000000000000d                               
Aug 03 15:02:47 kvm_master kernel: R10: ffff9fa0a56b0008 R11:
ffff9fa0a56b0000 R12: 00000000009c07cc                               
Aug 03 15:02:47 kvm_master kernel: R13: ffff9fa88b990000 R14:
ffff9f9e19dbb1b8 R15: 0000000000000000                               
Aug 03 15:02:47 kvm_master kernel: FS:  0000000000000000(0000)
GS:ffff9fac5f340000(0000) knlGS:0000000000000000                    
Aug 03 15:02:47 kvm_master kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
0000000080050033                                               
Aug 03 15:02:47 kvm_master kernel: CR2: ffffd1b542d71000 CR3:
0000000570a09000 CR4: 00000000003426e0                               
Aug 03 15:02:47 kvm_master kernel: DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000                               
Aug 03 15:02:47 kvm_master kernel: DR3: 0000000000000000 DR6:
00000000fffe0ff0 DR7: 0000000000000400                               
Aug 03 15:02:47 kvm_master kernel: Call Trace:                   
Aug 03 15:02:47 kvm_master kernel:  drop_spte+0x1a/0xb0 [kvm]    
Aug 03 15:02:47 kvm_master kernel:  mmu_page_zap_pte+0x9c/0xe0 [kvm]     
Aug 03 15:02:47 kvm_master kernel:  kvm_mmu_prepare_zap_page+0x65/0x310
[kvm]
Aug 03 15:02:47 kvm_master kernel: 
kvm_mmu_invalidate_zap_all_pages+0x10d/0x160 [kvm]
Aug 03 15:02:47 kvm_master kernel:  kvm_arch_flush_shadow_all+0xe/0x10
[kvm]
Aug 03 15:02:47 kvm_master kernel:  kvm_mmu_notifier_release+0x2c/0x40
[kvm]
Aug 03 15:02:47 kvm_master kernel:  __mmu_notifier_release+0x44/0xc0
Aug 03 15:02:47 kvm_master kernel:  exit_mmap+0x142/0x150
Aug 03 15:02:47 kvm_master kernel:  ? kfree+0x175/0x190
Aug 03 15:02:47 kvm_master kernel:  ? kfree+0x175/0x190
Aug 03 15:02:47 kvm_master kernel:  ? exit_aio+0xc6/0x100
Aug 03 15:02:47 kvm_master kernel:  mmput_async_fn+0x4c/0x130
Aug 03 15:02:47 kvm_master kernel:  process_one_work+0x1de/0x430
Aug 03 15:02:47 kvm_master kernel:  worker_thread+0x47/0x3f0
Aug 03 15:02:47 kvm_master kernel:  kthread+0x125/0x140
Aug 03 15:02:47 kvm_master kernel:  ? process_one_work+0x430/0x430
Aug 03 15:02:47 kvm_master kernel:  ? kthread_create_on_node+0x70/0x70
Aug 03 15:02:47 kvm_master kernel:  ret_from_fork+0x25/0x30
Aug 03 15:02:47 kvm_master kernel: Code: ec 75 04 00 48 b8 00 00 00 00
00 00 00 40 48 21 da 48 39 c2 0f 95 c0 eb b2 48 d1 eb 83 e3 01 eb c0 4c
89 e7 e8 f7 3d fe ff eb a4 <0f> ff eb 8a 90 0f 1f 44 00 00 55 48 89 e5
53 89 d3 e8 ff 4a fe 
Aug 03 15:02:47 kvm_master kernel: ---[ end trace 8710f4d700a7d36e ]---

This would typically take 36-48 hours to surface, so we're good so far,
but not completely out of the woods yet. I'm optimistic that since this
patchset changes the mmu_notifier behavior to something safer in
general, this issue will also be resolved by it.

Jeff

> 
> 
> Meow!
> -- 
> ⢀⣴⠾⠻⢶⣦⠀ 
> ⣾⠁⢰⠒⠀⣿⡁ Vat kind uf sufficiently advanced technology iz dis!?
> ⢿⡄⠘⠷⠚⠋⠀                                 -- Genghis Ht'rok'din
> ⠈⠳⣄⠀⠀⠀⠀ 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 00/13] mmu_notifier kill invalidate_page callback
  2017-09-01 14:47           ` Jeff Cook
@ 2017-09-01 14:50             ` taskboxtester
  -1 siblings, 0 replies; 160+ messages in thread
From: taskboxtester @ 2017-09-01 14:50 UTC (permalink / raw)
  To: Jeff Cook
  Cc: open list:AMD IOMMU (AMD-VI),
	Paolo Bonzini, Dan Williams, linux-rdma, Jack Steiner,
	Dimitri Sivanich, Bernhard Held, Andrew Morton,
	Radim Krčmář,
	amd-gfx, DRI, xen-devel, Joerg Roedel, Jerome Glisse, ppc-dev,
	Linus Torvalds, Kirill A . Shutemov, Andrea Arcangeli,
	Sudeep Dutt, KVM list, Adam Borowski

[-- Attachment #1: Type: text/plain, Size: 7403 bytes --]

taskboxtester@gmail.com liked your message with Boxer for Android.


On Sep 1, 2017 10:48 AM, Jeff Cook <jeff@jeffcook.io> wrote:

On Wed, Aug 30, 2017, at 10:57 AM, Adam Borowski wrote:
> On Tue, Aug 29, 2017 at 08:56:15PM -0400, Jerome Glisse wrote:
> > I will wait for people to test and for result of my own test before
> > reposting if need be, otherwise i will post as separate patch.
> >
> > > But from a _very_ quick read-through this looks fine. But it obviously
> > > needs testing.
> > > 
> > > People - *especially* the people who saw issues under KVM - can you
> > > try out Jérôme's patch-series? I aded some people to the cc, the full
> > > series is on lkml. Jérôme - do you have a git branch for people to
> > > test that they could easily pull and try out?
> > 
> > https://cgit.freedesktop.org/~glisse/linux mmu-notifier branch
> > git://people.freedesktop.org/~glisse/linux
> 
> Tested your branch as of 10f07641, on a long list of guest VMs.
> No earth-shattering kaboom.

I've been using the mmu_notifier branch @ a3d944233bcf8c for the last 36
hours or so, also without incident.

Unlike most other reporters, I experienced a similar splat on 4.12:

Aug 03 15:02:47 kvm_master kernel: ------------[ cut here ]------------
Aug 03 15:02:47 kvm_master kernel: WARNING: CPU: 13 PID: 1653 at
arch/x86/kvm/mmu.c:682 mmu_spte_clear_track_bits+0xfb/0x100 [kvm]
Aug 03 15:02:47 kvm_master kernel: Modules linked in: vhost_net vhost
tap xt_conntrack xt_CHECKSUM iptable_mangle ipt_REJECT nf_reject_ipv4
xt_tcpudp tun ebtable_filter ebtables ip6table_filter ip6_tables
iptable_filter msr nls_iso8859_1 nls_cp437 intel_rapl ipt_
MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4
nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack sb_edac
x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul
crc32_pclmul ghash_clmulni_intel input_leds pcbc aesni_intel led_class
aes_x86_6
4 mxm_wmi crypto_simd glue_helper uvcvideo cryptd videobuf2_vmalloc
videobuf2_memops igb videobuf2_v4l2 videobuf2_core snd_usb_audio
videodev media joydev ptp evdev mousedev intel_cstate pps_core mac_hid
intel_rapl_perf snd_hda_intel snd_virtuoso snd_usbmidi_lib snd_hda_codec
snd_oxygen_lib snd_hda_core                        
Aug 03 15:02:47 kvm_master kernel:  snd_mpu401_uart snd_rawmidi
snd_hwdep snd_seq_device snd_pcm snd_timer snd soundcore i2c_algo_bit
pcspkr i2c_i801 lpc_ich ioatdma shpchp dca wmi acpi_power_meter tpm_tis
tpm_tis_core tpm button bridge stp llc sch_fq_codel virtio_pci
virtio_blk virtio_balloon virtio_net virtio_ring virtio kvm_intel kvm sg
ip_tables x_tables hid_logitech_hidpp hid_logitech_dj hid_generic
hid_microsoft usbhid hid sr_mod cdrom sd_mod xhci_pci ahci libahci
xhci_hcd libata usbcore scsi_mod usb_common zfs(PO) zunicode(PO)
zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) drm_kms_helper
syscopyarea sysfillrect sysimgblt fb_sys_fops drm vfio_pci irqbypass
vfio_virqfd vfio_iommu_type1 vfio vfat fat ext4 crc16 jbd2 fscrypto
mbcache dm_thin_pool dm_cache dm_persistent_data dm_bio_prison dm_bufio
dm_raid raid456 libcrc32c                 
Aug 03 15:02:47 kvm_master kernel:  crc32c_generic crc32c_intel
async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq
dm_mod dax raid1 md_mod                                                  
Aug 03 15:02:47 kvm_master kernel: CPU: 13 PID: 1653 Comm: kworker/13:2
Tainted: P    B D W  O    4.12.3-1-ARCH #1                 
Aug 03 15:02:47 kvm_master kernel: Hardware name: Supermicro
SYS-7038A-I/X10DAI, BIOS 2.0a 11/09/2016                              
Aug 03 15:02:47 kvm_master kernel: Workqueue: events mmput_async_fn      
Aug 03 15:02:47 kvm_master kernel: task: ffff9fa89751b900 task.stack:
ffffc179880d8000                                             
Aug 03 15:02:47 kvm_master kernel: RIP:
0010:mmu_spte_clear_track_bits+0xfb/0x100 [kvm]                          
Aug 03 15:02:47 kvm_master kernel: RSP: 0018:ffffc179880dbc20 EFLAGS:
00010246                                                     
Aug 03 15:02:47 kvm_master kernel: RAX: 0000000000000000 RBX:
00000009c07cce77 RCX: dead0000000000ff                               
Aug 03 15:02:47 kvm_master kernel: RDX: 0000000000000000 RSI:
ffff9fa82d6d6f08 RDI: fffff6e76701f300                               
Aug 03 15:02:47 kvm_master kernel: RBP: ffffc179880dbc38 R08:
0000000000100000 R09: 000000000000000d                               
Aug 03 15:02:47 kvm_master kernel: R10: ffff9fa0a56b0008 R11:
ffff9fa0a56b0000 R12: 00000000009c07cc                               
Aug 03 15:02:47 kvm_master kernel: R13: ffff9fa88b990000 R14:
ffff9f9e19dbb1b8 R15: 0000000000000000                               
Aug 03 15:02:47 kvm_master kernel: FS:  0000000000000000(0000)
GS:ffff9fac5f340000(0000) knlGS:0000000000000000                    
Aug 03 15:02:47 kvm_master kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
0000000080050033                                               
Aug 03 15:02:47 kvm_master kernel: CR2: ffffd1b542d71000 CR3:
0000000570a09000 CR4: 00000000003426e0                               
Aug 03 15:02:47 kvm_master kernel: DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000                               
Aug 03 15:02:47 kvm_master kernel: DR3: 0000000000000000 DR6:
00000000fffe0ff0 DR7: 0000000000000400                               
Aug 03 15:02:47 kvm_master kernel: Call Trace:                   
Aug 03 15:02:47 kvm_master kernel:  drop_spte+0x1a/0xb0 [kvm]    
Aug 03 15:02:47 kvm_master kernel:  mmu_page_zap_pte+0x9c/0xe0 [kvm]     
Aug 03 15:02:47 kvm_master kernel:  kvm_mmu_prepare_zap_page+0x65/0x310
[kvm]
Aug 03 15:02:47 kvm_master kernel: 
kvm_mmu_invalidate_zap_all_pages+0x10d/0x160 [kvm]
Aug 03 15:02:47 kvm_master kernel:  kvm_arch_flush_shadow_all+0xe/0x10
[kvm]
Aug 03 15:02:47 kvm_master kernel:  kvm_mmu_notifier_release+0x2c/0x40
[kvm]
Aug 03 15:02:47 kvm_master kernel:  __mmu_notifier_release+0x44/0xc0
Aug 03 15:02:47 kvm_master kernel:  exit_mmap+0x142/0x150
Aug 03 15:02:47 kvm_master kernel:  ? kfree+0x175/0x190
Aug 03 15:02:47 kvm_master kernel:  ? kfree+0x175/0x190
Aug 03 15:02:47 kvm_master kernel:  ? exit_aio+0xc6/0x100
Aug 03 15:02:47 kvm_master kernel:  mmput_async_fn+0x4c/0x130
Aug 03 15:02:47 kvm_master kernel:  process_one_work+0x1de/0x430
Aug 03 15:02:47 kvm_master kernel:  worker_thread+0x47/0x3f0
Aug 03 15:02:47 kvm_master kernel:  kthread+0x125/0x140
Aug 03 15:02:47 kvm_master kernel:  ? process_one_work+0x430/0x430
Aug 03 15:02:47 kvm_master kernel:  ? kthread_create_on_node+0x70/0x70
Aug 03 15:02:47 kvm_master kernel:  ret_from_fork+0x25/0x30
Aug 03 15:02:47 kvm_master kernel: Code: ec 75 04 00 48 b8 00 00 00 00
00 00 00 40 48 21 da 48 39 c2 0f 95 c0 eb b2 48 d1 eb 83 e3 01 eb c0 4c
89 e7 e8 f7 3d fe ff eb a4 <0f> ff eb 8a 90 0f 1f 44 00 00 55 48 89 e5
53 89 d3 e8 ff 4a fe 
Aug 03 15:02:47 kvm_master kernel: ---[ end trace 8710f4d700a7d36e ]---

This would typically take 36-48 hours to surface, so we're good so far,
but not completely out of the woods yet. I'm optimistic that since this
patchset changes the mmu_notifier behavior to something safer in
general, this issue will also be resolved by it.

Jeff

> 
> 
> Meow!
> -- 
> ⢀⣴⠾⠻⢶⣦⠀ 
> ⣾⠁⢰⠒⠀⣿⡁ Vat kind uf sufficiently advanced technology iz dis!?
> ⢿⡄⠘⠷⠚⠋⠀                                 -- Genghis Ht'rok'din
> ⠈⠳⣄⠀⠀⠀⠀ 


[-- Attachment #2: Type: text/html, Size: 12378 bytes --]

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 00/13] mmu_notifier kill invalidate_page callback
@ 2017-09-01 14:50             ` taskboxtester
  0 siblings, 0 replies; 160+ messages in thread
From: taskboxtester @ 2017-09-01 14:50 UTC (permalink / raw)
  To: Jeff Cook
  Cc: open list:AMD IOMMU (AMD-VI),
	Paolo Bonzini, Dan Williams, linux-rdma, Jack Steiner,
	Dimitri Sivanich, Bernhard Held, Andrew Morton,
	Radim Krčmář,
	amd-gfx, DRI, xen-devel, Joerg Roedel, Jerome Glisse, ppc-dev,
	Linus Torvalds, Kirill A . Shutemov, Andrea Arcangeli,
	Sudeep Dutt, KVM list, Adam Borowski, Linux Kernel Mailing List,
	Ashutosh Dixit, linux-mm

[-- Attachment #1: Type: text/plain, Size: 7403 bytes --]

taskboxtester@gmail.com liked your message with Boxer for Android.


On Sep 1, 2017 10:48 AM, Jeff Cook <jeff@jeffcook.io> wrote:

On Wed, Aug 30, 2017, at 10:57 AM, Adam Borowski wrote:
> On Tue, Aug 29, 2017 at 08:56:15PM -0400, Jerome Glisse wrote:
> > I will wait for people to test and for result of my own test before
> > reposting if need be, otherwise i will post as separate patch.
> >
> > > But from a _very_ quick read-through this looks fine. But it obviously
> > > needs testing.
> > > 
> > > People - *especially* the people who saw issues under KVM - can you
> > > try out Jérôme's patch-series? I aded some people to the cc, the full
> > > series is on lkml. Jérôme - do you have a git branch for people to
> > > test that they could easily pull and try out?
> > 
> > https://cgit.freedesktop.org/~glisse/linux mmu-notifier branch
> > git://people.freedesktop.org/~glisse/linux
> 
> Tested your branch as of 10f07641, on a long list of guest VMs.
> No earth-shattering kaboom.

I've been using the mmu_notifier branch @ a3d944233bcf8c for the last 36
hours or so, also without incident.

Unlike most other reporters, I experienced a similar splat on 4.12:

Aug 03 15:02:47 kvm_master kernel: ------------[ cut here ]------------
Aug 03 15:02:47 kvm_master kernel: WARNING: CPU: 13 PID: 1653 at
arch/x86/kvm/mmu.c:682 mmu_spte_clear_track_bits+0xfb/0x100 [kvm]
Aug 03 15:02:47 kvm_master kernel: Modules linked in: vhost_net vhost
tap xt_conntrack xt_CHECKSUM iptable_mangle ipt_REJECT nf_reject_ipv4
xt_tcpudp tun ebtable_filter ebtables ip6table_filter ip6_tables
iptable_filter msr nls_iso8859_1 nls_cp437 intel_rapl ipt_
MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4
nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack sb_edac
x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul
crc32_pclmul ghash_clmulni_intel input_leds pcbc aesni_intel led_class
aes_x86_6
4 mxm_wmi crypto_simd glue_helper uvcvideo cryptd videobuf2_vmalloc
videobuf2_memops igb videobuf2_v4l2 videobuf2_core snd_usb_audio
videodev media joydev ptp evdev mousedev intel_cstate pps_core mac_hid
intel_rapl_perf snd_hda_intel snd_virtuoso snd_usbmidi_lib snd_hda_codec
snd_oxygen_lib snd_hda_core                        
Aug 03 15:02:47 kvm_master kernel:  snd_mpu401_uart snd_rawmidi
snd_hwdep snd_seq_device snd_pcm snd_timer snd soundcore i2c_algo_bit
pcspkr i2c_i801 lpc_ich ioatdma shpchp dca wmi acpi_power_meter tpm_tis
tpm_tis_core tpm button bridge stp llc sch_fq_codel virtio_pci
virtio_blk virtio_balloon virtio_net virtio_ring virtio kvm_intel kvm sg
ip_tables x_tables hid_logitech_hidpp hid_logitech_dj hid_generic
hid_microsoft usbhid hid sr_mod cdrom sd_mod xhci_pci ahci libahci
xhci_hcd libata usbcore scsi_mod usb_common zfs(PO) zunicode(PO)
zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) drm_kms_helper
syscopyarea sysfillrect sysimgblt fb_sys_fops drm vfio_pci irqbypass
vfio_virqfd vfio_iommu_type1 vfio vfat fat ext4 crc16 jbd2 fscrypto
mbcache dm_thin_pool dm_cache dm_persistent_data dm_bio_prison dm_bufio
dm_raid raid456 libcrc32c                 
Aug 03 15:02:47 kvm_master kernel:  crc32c_generic crc32c_intel
async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq
dm_mod dax raid1 md_mod                                                  
Aug 03 15:02:47 kvm_master kernel: CPU: 13 PID: 1653 Comm: kworker/13:2
Tainted: P    B D W  O    4.12.3-1-ARCH #1                 
Aug 03 15:02:47 kvm_master kernel: Hardware name: Supermicro
SYS-7038A-I/X10DAI, BIOS 2.0a 11/09/2016                              
Aug 03 15:02:47 kvm_master kernel: Workqueue: events mmput_async_fn      
Aug 03 15:02:47 kvm_master kernel: task: ffff9fa89751b900 task.stack:
ffffc179880d8000                                             
Aug 03 15:02:47 kvm_master kernel: RIP:
0010:mmu_spte_clear_track_bits+0xfb/0x100 [kvm]                          
Aug 03 15:02:47 kvm_master kernel: RSP: 0018:ffffc179880dbc20 EFLAGS:
00010246                                                     
Aug 03 15:02:47 kvm_master kernel: RAX: 0000000000000000 RBX:
00000009c07cce77 RCX: dead0000000000ff                               
Aug 03 15:02:47 kvm_master kernel: RDX: 0000000000000000 RSI:
ffff9fa82d6d6f08 RDI: fffff6e76701f300                               
Aug 03 15:02:47 kvm_master kernel: RBP: ffffc179880dbc38 R08:
0000000000100000 R09: 000000000000000d                               
Aug 03 15:02:47 kvm_master kernel: R10: ffff9fa0a56b0008 R11:
ffff9fa0a56b0000 R12: 00000000009c07cc                               
Aug 03 15:02:47 kvm_master kernel: R13: ffff9fa88b990000 R14:
ffff9f9e19dbb1b8 R15: 0000000000000000                               
Aug 03 15:02:47 kvm_master kernel: FS:  0000000000000000(0000)
GS:ffff9fac5f340000(0000) knlGS:0000000000000000                    
Aug 03 15:02:47 kvm_master kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
0000000080050033                                               
Aug 03 15:02:47 kvm_master kernel: CR2: ffffd1b542d71000 CR3:
0000000570a09000 CR4: 00000000003426e0                               
Aug 03 15:02:47 kvm_master kernel: DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000                               
Aug 03 15:02:47 kvm_master kernel: DR3: 0000000000000000 DR6:
00000000fffe0ff0 DR7: 0000000000000400                               
Aug 03 15:02:47 kvm_master kernel: Call Trace:                   
Aug 03 15:02:47 kvm_master kernel:  drop_spte+0x1a/0xb0 [kvm]    
Aug 03 15:02:47 kvm_master kernel:  mmu_page_zap_pte+0x9c/0xe0 [kvm]     
Aug 03 15:02:47 kvm_master kernel:  kvm_mmu_prepare_zap_page+0x65/0x310
[kvm]
Aug 03 15:02:47 kvm_master kernel: 
kvm_mmu_invalidate_zap_all_pages+0x10d/0x160 [kvm]
Aug 03 15:02:47 kvm_master kernel:  kvm_arch_flush_shadow_all+0xe/0x10
[kvm]
Aug 03 15:02:47 kvm_master kernel:  kvm_mmu_notifier_release+0x2c/0x40
[kvm]
Aug 03 15:02:47 kvm_master kernel:  __mmu_notifier_release+0x44/0xc0
Aug 03 15:02:47 kvm_master kernel:  exit_mmap+0x142/0x150
Aug 03 15:02:47 kvm_master kernel:  ? kfree+0x175/0x190
Aug 03 15:02:47 kvm_master kernel:  ? kfree+0x175/0x190
Aug 03 15:02:47 kvm_master kernel:  ? exit_aio+0xc6/0x100
Aug 03 15:02:47 kvm_master kernel:  mmput_async_fn+0x4c/0x130
Aug 03 15:02:47 kvm_master kernel:  process_one_work+0x1de/0x430
Aug 03 15:02:47 kvm_master kernel:  worker_thread+0x47/0x3f0
Aug 03 15:02:47 kvm_master kernel:  kthread+0x125/0x140
Aug 03 15:02:47 kvm_master kernel:  ? process_one_work+0x430/0x430
Aug 03 15:02:47 kvm_master kernel:  ? kthread_create_on_node+0x70/0x70
Aug 03 15:02:47 kvm_master kernel:  ret_from_fork+0x25/0x30
Aug 03 15:02:47 kvm_master kernel: Code: ec 75 04 00 48 b8 00 00 00 00
00 00 00 40 48 21 da 48 39 c2 0f 95 c0 eb b2 48 d1 eb 83 e3 01 eb c0 4c
89 e7 e8 f7 3d fe ff eb a4 <0f> ff eb 8a 90 0f 1f 44 00 00 55 48 89 e5
53 89 d3 e8 ff 4a fe 
Aug 03 15:02:47 kvm_master kernel: ---[ end trace 8710f4d700a7d36e ]---

This would typically take 36-48 hours to surface, so we're good so far,
but not completely out of the woods yet. I'm optimistic that since this
patchset changes the mmu_notifier behavior to something safer in
general, this issue will also be resolved by it.

Jeff

> 
> 
> Meow!
> -- 
> ⢀⣴⠾⠻⢶⣦⠀ 
> ⣾⠁⢰⠒⠀⣿⡁ Vat kind uf sufficiently advanced technology iz dis!?
> ⢿⡄⠘⠷⠚⠋⠀                                 -- Genghis Ht'rok'din
> ⠈⠳⣄⠀⠀⠀⠀ 


[-- Attachment #2: Type: text/html, Size: 12378 bytes --]

^ permalink raw reply	[flat|nested] 160+ messages in thread

* RE: [PATCH 06/13] IB/hfi1: update to new mmu_notifier semantic
  2017-08-29 23:54   ` Jérôme Glisse
  (?)
  (?)
@ 2017-09-06 14:08   ` Arumugam, Kamenee
  -1 siblings, 0 replies; 160+ messages in thread
From: Arumugam, Kamenee @ 2017-09-06 14:08 UTC (permalink / raw)
  To: Jérôme Glisse, linux-kernel, linux-mm
  Cc: linux-rdma, Luick, Dean, Weiny, Ira, Doug Ledford,
	Kirill A . Shutemov, Andrew Morton, Linus Torvalds,
	Andrea Arcangeli

Tested this patch with hfi1 driver and no issue found. 

Tested-by: Kamenee Arumugam <kamenee.arumugam@intel.com>

-----Original Message-----
From: linux-rdma-owner@vger.kernel.org [mailto:linux-rdma-owner@vger.kernel.org] On Behalf Of Jérôme Glisse
Sent: Tuesday, August 29, 2017 7:55 PM
To: linux-kernel@vger.kernel.org; linux-mm@kvack.org
Cc: Jérôme Glisse <jglisse@redhat.com>; linux-rdma@vger.kernel.org; Luick, Dean <dean.luick@intel.com>; Weiny, Ira <ira.weiny@intel.com>; Doug Ledford <dledford@redhat.com>; Kirill A . Shutemov <kirill.shutemov@linux.intel.com>; Andrew Morton <akpm@linux-foundation.org>; Linus Torvalds <torvalds@linux-foundation.org>; Andrea Arcangeli <aarcange@redhat.com>
Subject: [PATCH 06/13] IB/hfi1: update to new mmu_notifier semantic

Call to mmu_notifier_invalidate_page() are replaced by call to
mmu_notifier_invalidate_range() and thus call are bracketed by call to mmu_notifier_invalidate_range_start()/end()

Remove now useless invalidate_page callback.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: linux-rdma@vger.kernel.org
Cc: Dean Luick <dean.luick@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
---
 drivers/infiniband/hw/hfi1/mmu_rb.c | 9 ---------
 1 file changed, 9 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/mmu_rb.c b/drivers/infiniband/hw/hfi1/mmu_rb.c
index ccbf52c8ff6f..e4b56a0dd6d0 100644
--- a/drivers/infiniband/hw/hfi1/mmu_rb.c
+++ b/drivers/infiniband/hw/hfi1/mmu_rb.c
@@ -67,8 +67,6 @@ struct mmu_rb_handler {
 
 static unsigned long mmu_node_start(struct mmu_rb_node *);  static unsigned long mmu_node_last(struct mmu_rb_node *); -static inline void mmu_notifier_page(struct mmu_notifier *, struct mm_struct *,
-				     unsigned long);
 static inline void mmu_notifier_range_start(struct mmu_notifier *,
 					    struct mm_struct *,
 					    unsigned long, unsigned long); @@ -82,7 +80,6 @@ static void do_remove(struct mmu_rb_handler *handler,  static void handle_remove(struct work_struct *work);
 
 static const struct mmu_notifier_ops mn_opts = {
-	.invalidate_page = mmu_notifier_page,
 	.invalidate_range_start = mmu_notifier_range_start,  };
 
@@ -285,12 +282,6 @@ void hfi1_mmu_rb_remove(struct mmu_rb_handler *handler,
 	handler->ops->remove(handler->ops_arg, node);  }
 
-static inline void mmu_notifier_page(struct mmu_notifier *mn,
-				     struct mm_struct *mm, unsigned long addr)
-{
-	mmu_notifier_mem_invalidate(mn, mm, addr, addr + PAGE_SIZE);
-}
-
 static inline void mmu_notifier_range_start(struct mmu_notifier *mn,
 					    struct mm_struct *mm,
 					    unsigned long start,
--
2.13.5

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* BSOD with [PATCH 00/13] mmu_notifier kill invalidate_page callback
  2017-08-29 23:54 ` Jérôme Glisse
@ 2017-11-30  9:33   ` Fabian Grünbichler
  -1 siblings, 0 replies; 160+ messages in thread
From: Fabian Grünbichler @ 2017-11-30  9:33 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: linux-kernel, linux-mm, Kirill A . Shutemov, Linus Torvalds,
	Andrew Morton, Andrea Arcangeli, Joerg Roedel, Dan Williams,
	Sudeep Dutt, Ashutosh Dixit, Dimitri Sivanich, Jack Steiner,
	Paolo Bonzini, Radim Krčmář,
	linuxppc-dev, iommu, xen-devel, kvm

On Tue, Aug 29, 2017 at 07:54:34PM -0400, Jérôme Glisse wrote:
> (Sorry for so many list cross-posting and big cc)

Ditto (trimmed it a bit already, feel free to limit replies as you see
fit).

> 
> Please help testing !
> 

Kernels 4.13 and 4.14 (which both contain these patch series in its
final form) are affected by a bug triggering BSOD
(CRITICAL_STRUCTURE_CORRUPTION) in Windows 10/2016 VMs in Qemu under
certain conditions on certain hardware/microcode versions (see below for
details).

Testing this proved to be quite cumbersome, as only some systems are
affected and it took a while to find a semi-reliable test setup. Some
users reported that microcode updates made the problem disappear on some
affected systems[1].

Bisecting the 4.13 release cycle first pointed to

aac2fea94f7a3df8ad1eeb477eb2643f81fd5393 rmap: do not call mmu_notifier_invalidate_page() under ptl

as likely culprit (although it was not possible to bisect exactly down
to this commit).

It was reverted in 785373b4c38719f4af6775845df6be1dfaea120f after which
the symptoms disappeared until this series was merged, which contains

369ea8242c0fb5239b4ddf0dc568f694bd244de4 mm/rmap: update to new mmu_notifier semantic v2

We haven't bisected the individual commits of the series yet, but the
commit immediately preceding its merge exhibits no problems, while
everything after does. It is not known whether the bug is actually in
the series itself, or whether increasing the likelihood of triggering it
is just a side-effect. There is a similar report[2] concerning an
upgrade from 4.12.12 to 4.12.13, which does not contain this series in
any form AFAICT but might be worth another look as well.

Our test setup consists of the following:
CPU: Intel(R) Xeon(R) CPU D-1528 @ 1.90GHz (single socket)
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat
pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb
rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology
nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx
est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic
movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm
3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 intel_ppin intel_pt
tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2
smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap xsaveopt cqm_llc
cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm arat pln pts
microcode: 0x700000e
Mainboard: Supermicro X10SDV-6C+-TLN4F
RAM: 64G DDR4 [3]
Swap: 8G on an LV
Qemu: 2.9.1 with some patches on top ([4])
OS: Debian Stretch based (PVE 5.1)
KSM is enabled, but turning it off just increases the number of test
iterations needed to trigger the bug.
Kernel config: [5]

VMs:
A: Windows 2016 with virtio-blk disks, 6G RAM
B: Windows 10 with (virtual) IDE disks, 6G RAM
C: Debian Stretch, ~55G RAM

fio config:
[global]
thread=2
runtime=1800

[write1]
ioengine=windowsaio
sync=0
direct=0
bs=4k
size=30G
rw=randwrite
iodepth=16

[read1]
ioengine=windowsaio
sync=0
direct=0
bs=4k
size=30G
rw=randread
iodepth=16

Test run:

- Start all three VMs
- run 'stress-ng --vm-bytes 1G --vm 52 -t 6000' in VM C
- wait until swap is (almost) full and KSM starts to merge pages
- start fio in VM A and B
- stop stress-ng in VM C, power off VM C
- run 'swapoff -av' on host
- wait until swap content has been swapped in again (this takes a while)
- observe BSOD in at least one of A / B around 30% of the time

While this test case is pretty artifical, the BSOD issue does affect
users in the wild running regular work loads (where it can take from
multiple hours up to several days to trigger).

We have reverted this patch series in our 4.13 based kernel for now,
with positive feedback from users and our own testing. If more detailed
traces or data from a test run on an affected system is needed, we will
of course provide it.

Any further input / pointers are highly appreciated!

1: https://forum.proxmox.com/threads/blue-screen-with-5-1.37664/
2: http://www.spinics.net/lists/kvm/msg159179.html
https://bugs.launchpad.net/qemu/+bug/1728256
https://bugzilla.kernel.org/show_bug.cgi?id=197951
3: http://www.samsung.com/semiconductor/products/dram/server-dram/ddr4-registered-dimm/M393A2K40BB1?ia=2503
5: https://git.proxmox.com/?p=pve-qemu.git;a=tree;f=debian/patches;h=2c516be8e69a033d14809b17e8a661b3808257f7;hb=8d4a2d3f5569817221c19a91f763964c40e00292
6: https://gist.github.com/Fabian-Gruenbichler/5c3af22ac7e6faae46840bdcebd7df14

^ permalink raw reply	[flat|nested] 160+ messages in thread

* BSOD with [PATCH 00/13] mmu_notifier kill invalidate_page callback
@ 2017-11-30  9:33   ` Fabian Grünbichler
  0 siblings, 0 replies; 160+ messages in thread
From: Fabian Grünbichler @ 2017-11-30  9:33 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: linux-kernel, linux-mm, Kirill A . Shutemov, Linus Torvalds,
	Andrew Morton, Andrea Arcangeli, Joerg Roedel, Dan Williams,
	Sudeep Dutt, Ashutosh Dixit, Dimitri Sivanich, Jack Steiner,
	Paolo Bonzini, Radim Krčmář,
	linuxppc-dev, iommu, xen-devel, kvm

On Tue, Aug 29, 2017 at 07:54:34PM -0400, Jerome Glisse wrote:
> (Sorry for so many list cross-posting and big cc)

Ditto (trimmed it a bit already, feel free to limit replies as you see
fit).

> 
> Please help testing !
> 

Kernels 4.13 and 4.14 (which both contain these patch series in its
final form) are affected by a bug triggering BSOD
(CRITICAL_STRUCTURE_CORRUPTION) in Windows 10/2016 VMs in Qemu under
certain conditions on certain hardware/microcode versions (see below for
details).

Testing this proved to be quite cumbersome, as only some systems are
affected and it took a while to find a semi-reliable test setup. Some
users reported that microcode updates made the problem disappear on some
affected systems[1].

Bisecting the 4.13 release cycle first pointed to

aac2fea94f7a3df8ad1eeb477eb2643f81fd5393 rmap: do not call mmu_notifier_invalidate_page() under ptl

as likely culprit (although it was not possible to bisect exactly down
to this commit).

It was reverted in 785373b4c38719f4af6775845df6be1dfaea120f after which
the symptoms disappeared until this series was merged, which contains

369ea8242c0fb5239b4ddf0dc568f694bd244de4 mm/rmap: update to new mmu_notifier semantic v2

We haven't bisected the individual commits of the series yet, but the
commit immediately preceding its merge exhibits no problems, while
everything after does. It is not known whether the bug is actually in
the series itself, or whether increasing the likelihood of triggering it
is just a side-effect. There is a similar report[2] concerning an
upgrade from 4.12.12 to 4.12.13, which does not contain this series in
any form AFAICT but might be worth another look as well.

Our test setup consists of the following:
CPU: Intel(R) Xeon(R) CPU D-1528 @ 1.90GHz (single socket)
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat
pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb
rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology
nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx
est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic
movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm
3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 intel_ppin intel_pt
tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2
smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap xsaveopt cqm_llc
cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm arat pln pts
microcode: 0x700000e
Mainboard: Supermicro X10SDV-6C+-TLN4F
RAM: 64G DDR4 [3]
Swap: 8G on an LV
Qemu: 2.9.1 with some patches on top ([4])
OS: Debian Stretch based (PVE 5.1)
KSM is enabled, but turning it off just increases the number of test
iterations needed to trigger the bug.
Kernel config: [5]

VMs:
A: Windows 2016 with virtio-blk disks, 6G RAM
B: Windows 10 with (virtual) IDE disks, 6G RAM
C: Debian Stretch, ~55G RAM

fio config:
[global]
thread=2
runtime=1800

[write1]
ioengine=windowsaio
sync=0
direct=0
bs=4k
size=30G
rw=randwrite
iodepth=16

[read1]
ioengine=windowsaio
sync=0
direct=0
bs=4k
size=30G
rw=randread
iodepth=16

Test run:

- Start all three VMs
- run 'stress-ng --vm-bytes 1G --vm 52 -t 6000' in VM C
- wait until swap is (almost) full and KSM starts to merge pages
- start fio in VM A and B
- stop stress-ng in VM C, power off VM C
- run 'swapoff -av' on host
- wait until swap content has been swapped in again (this takes a while)
- observe BSOD in at least one of A / B around 30% of the time

While this test case is pretty artifical, the BSOD issue does affect
users in the wild running regular work loads (where it can take from
multiple hours up to several days to trigger).

We have reverted this patch series in our 4.13 based kernel for now,
with positive feedback from users and our own testing. If more detailed
traces or data from a test run on an affected system is needed, we will
of course provide it.

Any further input / pointers are highly appreciated!

1: https://forum.proxmox.com/threads/blue-screen-with-5-1.37664/
2: http://www.spinics.net/lists/kvm/msg159179.html
https://bugs.launchpad.net/qemu/+bug/1728256
https://bugzilla.kernel.org/show_bug.cgi?id=197951
3: http://www.samsung.com/semiconductor/products/dram/server-dram/ddr4-registered-dimm/M393A2K40BB1?ia=2503
5: https://git.proxmox.com/?p=pve-qemu.git;a=tree;f=debian/patches;h=2c516be8e69a033d14809b17e8a661b3808257f7;hb=8d4a2d3f5569817221c19a91f763964c40e00292
6: https://gist.github.com/Fabian-Gruenbichler/5c3af22ac7e6faae46840bdcebd7df14

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 160+ messages in thread

* BSOD with [PATCH 00/13] mmu_notifier kill invalidate_page callback
  2017-08-29 23:54 ` Jérôme Glisse
                   ` (18 preceding siblings ...)
  (?)
@ 2017-11-30  9:33 ` Fabian Grünbichler
  -1 siblings, 0 replies; 160+ messages in thread
From: Fabian Grünbichler @ 2017-11-30  9:33 UTC (permalink / raw)
  To: Jérôme Glisse
  Cc: Andrea Arcangeli, Joerg Roedel, kvm, Radim Krčmář,
	linuxppc-dev, Jack Steiner, linux-kernel, Sudeep Dutt,
	Ashutosh Dixit, linux-mm, iommu, Dimitri Sivanich, xen-devel,
	Paolo Bonzini, Andrew Morton, Linus Torvalds, Dan Williams,
	Kirill A . Shutemov

On Tue, Aug 29, 2017 at 07:54:34PM -0400, Jérôme Glisse wrote:
> (Sorry for so many list cross-posting and big cc)

Ditto (trimmed it a bit already, feel free to limit replies as you see
fit).

> 
> Please help testing !
> 

Kernels 4.13 and 4.14 (which both contain these patch series in its
final form) are affected by a bug triggering BSOD
(CRITICAL_STRUCTURE_CORRUPTION) in Windows 10/2016 VMs in Qemu under
certain conditions on certain hardware/microcode versions (see below for
details).

Testing this proved to be quite cumbersome, as only some systems are
affected and it took a while to find a semi-reliable test setup. Some
users reported that microcode updates made the problem disappear on some
affected systems[1].

Bisecting the 4.13 release cycle first pointed to

aac2fea94f7a3df8ad1eeb477eb2643f81fd5393 rmap: do not call mmu_notifier_invalidate_page() under ptl

as likely culprit (although it was not possible to bisect exactly down
to this commit).

It was reverted in 785373b4c38719f4af6775845df6be1dfaea120f after which
the symptoms disappeared until this series was merged, which contains

369ea8242c0fb5239b4ddf0dc568f694bd244de4 mm/rmap: update to new mmu_notifier semantic v2

We haven't bisected the individual commits of the series yet, but the
commit immediately preceding its merge exhibits no problems, while
everything after does. It is not known whether the bug is actually in
the series itself, or whether increasing the likelihood of triggering it
is just a side-effect. There is a similar report[2] concerning an
upgrade from 4.12.12 to 4.12.13, which does not contain this series in
any form AFAICT but might be worth another look as well.

Our test setup consists of the following:
CPU: Intel(R) Xeon(R) CPU D-1528 @ 1.90GHz (single socket)
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat
pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb
rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology
nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx
est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic
movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm
3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 intel_ppin intel_pt
tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2
smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap xsaveopt cqm_llc
cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm arat pln pts
microcode: 0x700000e
Mainboard: Supermicro X10SDV-6C+-TLN4F
RAM: 64G DDR4 [3]
Swap: 8G on an LV
Qemu: 2.9.1 with some patches on top ([4])
OS: Debian Stretch based (PVE 5.1)
KSM is enabled, but turning it off just increases the number of test
iterations needed to trigger the bug.
Kernel config: [5]

VMs:
A: Windows 2016 with virtio-blk disks, 6G RAM
B: Windows 10 with (virtual) IDE disks, 6G RAM
C: Debian Stretch, ~55G RAM

fio config:
[global]
thread=2
runtime=1800

[write1]
ioengine=windowsaio
sync=0
direct=0
bs=4k
size=30G
rw=randwrite
iodepth=16

[read1]
ioengine=windowsaio
sync=0
direct=0
bs=4k
size=30G
rw=randread
iodepth=16

Test run:

- Start all three VMs
- run 'stress-ng --vm-bytes 1G --vm 52 -t 6000' in VM C
- wait until swap is (almost) full and KSM starts to merge pages
- start fio in VM A and B
- stop stress-ng in VM C, power off VM C
- run 'swapoff -av' on host
- wait until swap content has been swapped in again (this takes a while)
- observe BSOD in at least one of A / B around 30% of the time

While this test case is pretty artifical, the BSOD issue does affect
users in the wild running regular work loads (where it can take from
multiple hours up to several days to trigger).

We have reverted this patch series in our 4.13 based kernel for now,
with positive feedback from users and our own testing. If more detailed
traces or data from a test run on an affected system is needed, we will
of course provide it.

Any further input / pointers are highly appreciated!

1: https://forum.proxmox.com/threads/blue-screen-with-5-1.37664/
2: http://www.spinics.net/lists/kvm/msg159179.html
https://bugs.launchpad.net/qemu/+bug/1728256
https://bugzilla.kernel.org/show_bug.cgi?id=197951
3: http://www.samsung.com/semiconductor/products/dram/server-dram/ddr4-registered-dimm/M393A2K40BB1?ia=2503
5: https://git.proxmox.com/?p=pve-qemu.git;a=tree;f=debian/patches;h=2c516be8e69a033d14809b17e8a661b3808257f7;hb=8d4a2d3f5569817221c19a91f763964c40e00292
6: https://gist.github.com/Fabian-Gruenbichler/5c3af22ac7e6faae46840bdcebd7df14


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: BSOD with [PATCH 00/13] mmu_notifier kill invalidate_page callback
@ 2017-11-30 11:20     ` Paolo Bonzini
  0 siblings, 0 replies; 160+ messages in thread
From: Paolo Bonzini @ 2017-11-30 11:20 UTC (permalink / raw)
  To: Jérôme Glisse, linux-kernel, linux-mm,
	Kirill A . Shutemov, Linus Torvalds, Andrew Morton,
	Andrea Arcangeli, Joerg Roedel, Dan Williams, Sudeep Dutt,
	Ashutosh Dixit, Dimitri Sivanich, Jack Steiner,
	Radim Krčmář,
	linuxppc-dev, iommu, xen-devel, kvm

On 30/11/2017 10:33, Fabian Grünbichler wrote:
> 
> It was reverted in 785373b4c38719f4af6775845df6be1dfaea120f after which
> the symptoms disappeared until this series was merged, which contains
> 
> 369ea8242c0fb5239b4ddf0dc568f694bd244de4 mm/rmap: update to new mmu_notifier semantic v2
> 
> We haven't bisected the individual commits of the series yet, but the
> commit immediately preceding its merge exhibits no problems, while
> everything after does. It is not known whether the bug is actually in
> the series itself, or whether increasing the likelihood of triggering it
> is just a side-effect. There is a similar report[2] concerning an
> upgrade from 4.12.12 to 4.12.13, which does not contain this series in
> any form AFAICT but might be worth another look as well.

I know of one issue in this series (invalidate_page was removed from KVM
without reimplementing it as invalidate_range).  I'll try to prioritize
the fix, but I don't think I can do it before Monday.

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: BSOD with [PATCH 00/13] mmu_notifier kill invalidate_page callback
@ 2017-11-30 11:20     ` Paolo Bonzini
  0 siblings, 0 replies; 160+ messages in thread
From: Paolo Bonzini @ 2017-11-30 11:20 UTC (permalink / raw)
  To: Jérôme Glisse, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Kirill A . Shutemov,
	Linus Torvalds, Andrew Morton, Andrea Arcangeli, Joerg Roedel,
	Dan Williams, Sudeep Dutt, Ashutosh Dixit, Dimitri Sivanich,
	Jack Steiner, Radim Krčmář,
	linuxppc-dev-uLR06cmDAlY/bJ5BZ2RsiQ,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	xen-devel-GuqFBffKawtpuQazS67q72D2FQJk+8+b,
	kvm-u79uwXL29TY76Z2rM5mHXA

On 30/11/2017 10:33, Fabian Grünbichler wrote:
> 
> It was reverted in 785373b4c38719f4af6775845df6be1dfaea120f after which
> the symptoms disappeared until this series was merged, which contains
> 
> 369ea8242c0fb5239b4ddf0dc568f694bd244de4 mm/rmap: update to new mmu_notifier semantic v2
> 
> We haven't bisected the individual commits of the series yet, but the
> commit immediately preceding its merge exhibits no problems, while
> everything after does. It is not known whether the bug is actually in
> the series itself, or whether increasing the likelihood of triggering it
> is just a side-effect. There is a similar report[2] concerning an
> upgrade from 4.12.12 to 4.12.13, which does not contain this series in
> any form AFAICT but might be worth another look as well.

I know of one issue in this series (invalidate_page was removed from KVM
without reimplementing it as invalidate_range).  I'll try to prioritize
the fix, but I don't think I can do it before Monday.

Thanks,

Paolo
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: BSOD with [PATCH 00/13] mmu_notifier kill invalidate_page callback
@ 2017-11-30 11:20     ` Paolo Bonzini
  0 siblings, 0 replies; 160+ messages in thread
From: Paolo Bonzini @ 2017-11-30 11:20 UTC (permalink / raw)
  To: Jérôme Glisse, linux-kernel, linux-mm,
	Kirill A . Shutemov, Linus Torvalds, Andrew Morton,
	Andrea Arcangeli, Joerg Roedel, Dan Williams, Sudeep Dutt,
	Ashutosh Dixit, Dimitri Sivanich, Jack Steiner,
	Radim Krčmář,
	linuxppc-dev, iommu, xen-devel, kvm

On 30/11/2017 10:33, Fabian GrA 1/4 nbichler wrote:
> 
> It was reverted in 785373b4c38719f4af6775845df6be1dfaea120f after which
> the symptoms disappeared until this series was merged, which contains
> 
> 369ea8242c0fb5239b4ddf0dc568f694bd244de4 mm/rmap: update to new mmu_notifier semantic v2
> 
> We haven't bisected the individual commits of the series yet, but the
> commit immediately preceding its merge exhibits no problems, while
> everything after does. It is not known whether the bug is actually in
> the series itself, or whether increasing the likelihood of triggering it
> is just a side-effect. There is a similar report[2] concerning an
> upgrade from 4.12.12 to 4.12.13, which does not contain this series in
> any form AFAICT but might be worth another look as well.

I know of one issue in this series (invalidate_page was removed from KVM
without reimplementing it as invalidate_range).  I'll try to prioritize
the fix, but I don't think I can do it before Monday.

Thanks,

Paolo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: BSOD with [PATCH 00/13] mmu_notifier kill invalidate_page callback
  2017-11-30  9:33   ` Fabian Grünbichler
  (?)
  (?)
@ 2017-11-30 11:20   ` Paolo Bonzini
  -1 siblings, 0 replies; 160+ messages in thread
From: Paolo Bonzini @ 2017-11-30 11:20 UTC (permalink / raw)
  To: Jérôme Glisse, linux-kernel, linux-mm,
	Kirill A . Shutemov, Linus Torvalds, Andrew Morton,
	Andrea Arcangeli, Joerg Roedel, Dan Williams, Sudeep Dutt,
	Ashutosh Dixit, Dimitri Sivanich, Jack Steiner,
	Radim Krčmář,
	linuxppc-dev, iommu, xen-devel, kvm

On 30/11/2017 10:33, Fabian Grünbichler wrote:
> 
> It was reverted in 785373b4c38719f4af6775845df6be1dfaea120f after which
> the symptoms disappeared until this series was merged, which contains
> 
> 369ea8242c0fb5239b4ddf0dc568f694bd244de4 mm/rmap: update to new mmu_notifier semantic v2
> 
> We haven't bisected the individual commits of the series yet, but the
> commit immediately preceding its merge exhibits no problems, while
> everything after does. It is not known whether the bug is actually in
> the series itself, or whether increasing the likelihood of triggering it
> is just a side-effect. There is a similar report[2] concerning an
> upgrade from 4.12.12 to 4.12.13, which does not contain this series in
> any form AFAICT but might be worth another look as well.

I know of one issue in this series (invalidate_page was removed from KVM
without reimplementing it as invalidate_range).  I'll try to prioritize
the fix, but I don't think I can do it before Monday.

Thanks,

Paolo

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: BSOD with [PATCH 00/13] mmu_notifier kill invalidate_page callback
  2017-11-30 11:20     ` Paolo Bonzini
@ 2017-11-30 16:19       ` Radim Krčmář
  -1 siblings, 0 replies; 160+ messages in thread
From: Radim Krčmář @ 2017-11-30 16:19 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Jérôme Glisse, linux-kernel, linux-mm,
	Kirill A . Shutemov, Linus Torvalds, Andrew Morton,
	Andrea Arcangeli, Joerg Roedel, Dan Williams, Sudeep Dutt,
	Ashutosh Dixit, Dimitri Sivanich, Jack Steiner, linuxppc-dev,
	iommu, xen-devel, kvm

2017-11-30 12:20+0100, Paolo Bonzini:
> On 30/11/2017 10:33, Fabian Grünbichler wrote:
> > 
> > It was reverted in 785373b4c38719f4af6775845df6be1dfaea120f after which
> > the symptoms disappeared until this series was merged, which contains
> > 
> > 369ea8242c0fb5239b4ddf0dc568f694bd244de4 mm/rmap: update to new mmu_notifier semantic v2
> > 
> > We haven't bisected the individual commits of the series yet, but the
> > commit immediately preceding its merge exhibits no problems, while
> > everything after does. It is not known whether the bug is actually in
> > the series itself, or whether increasing the likelihood of triggering it
> > is just a side-effect. There is a similar report[2] concerning an
> > upgrade from 4.12.12 to 4.12.13, which does not contain this series in
> > any form AFAICT but might be worth another look as well.
> 
> I know of one issue in this series (invalidate_page was removed from KVM
> without reimplementing it as invalidate_range).  I'll try to prioritize
> the fix, but I don't think I can do it before Monday.

The series also dropped the reloading of the APIC access page and we
never had it in invalidate_range_start ... I'll look into it today.

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: BSOD with [PATCH 00/13] mmu_notifier kill invalidate_page callback
@ 2017-11-30 16:19       ` Radim Krčmář
  0 siblings, 0 replies; 160+ messages in thread
From: Radim Krčmář @ 2017-11-30 16:19 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Jérôme Glisse, linux-kernel, linux-mm,
	Kirill A . Shutemov, Linus Torvalds, Andrew Morton,
	Andrea Arcangeli, Joerg Roedel, Dan Williams, Sudeep Dutt,
	Ashutosh Dixit, Dimitri Sivanich, Jack Steiner, linuxppc-dev,
	iommu, xen-devel, kvm

2017-11-30 12:20+0100, Paolo Bonzini:
> On 30/11/2017 10:33, Fabian GrA 1/4 nbichler wrote:
> > 
> > It was reverted in 785373b4c38719f4af6775845df6be1dfaea120f after which
> > the symptoms disappeared until this series was merged, which contains
> > 
> > 369ea8242c0fb5239b4ddf0dc568f694bd244de4 mm/rmap: update to new mmu_notifier semantic v2
> > 
> > We haven't bisected the individual commits of the series yet, but the
> > commit immediately preceding its merge exhibits no problems, while
> > everything after does. It is not known whether the bug is actually in
> > the series itself, or whether increasing the likelihood of triggering it
> > is just a side-effect. There is a similar report[2] concerning an
> > upgrade from 4.12.12 to 4.12.13, which does not contain this series in
> > any form AFAICT but might be worth another look as well.
> 
> I know of one issue in this series (invalidate_page was removed from KVM
> without reimplementing it as invalidate_range).  I'll try to prioritize
> the fix, but I don't think I can do it before Monday.

The series also dropped the reloading of the APIC access page and we
never had it in invalidate_range_start ... I'll look into it today.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: BSOD with [PATCH 00/13] mmu_notifier kill invalidate_page callback
  2017-11-30 11:20     ` Paolo Bonzini
                       ` (2 preceding siblings ...)
  (?)
@ 2017-11-30 16:19     ` Radim Krčmář
  -1 siblings, 0 replies; 160+ messages in thread
From: Radim Krčmář @ 2017-11-30 16:19 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Andrea Arcangeli, Joerg Roedel, kvm, linuxppc-dev, Jack Steiner,
	linux-kernel, iommu, Sudeep Dutt, Ashutosh Dixit, linux-mm,
	Jérôme Glisse, Dimitri Sivanich, xen-devel,
	Andrew Morton, Linus Torvalds, Dan Williams, Kirill A . Shutemov

2017-11-30 12:20+0100, Paolo Bonzini:
> On 30/11/2017 10:33, Fabian Grünbichler wrote:
> > 
> > It was reverted in 785373b4c38719f4af6775845df6be1dfaea120f after which
> > the symptoms disappeared until this series was merged, which contains
> > 
> > 369ea8242c0fb5239b4ddf0dc568f694bd244de4 mm/rmap: update to new mmu_notifier semantic v2
> > 
> > We haven't bisected the individual commits of the series yet, but the
> > commit immediately preceding its merge exhibits no problems, while
> > everything after does. It is not known whether the bug is actually in
> > the series itself, or whether increasing the likelihood of triggering it
> > is just a side-effect. There is a similar report[2] concerning an
> > upgrade from 4.12.12 to 4.12.13, which does not contain this series in
> > any form AFAICT but might be worth another look as well.
> 
> I know of one issue in this series (invalidate_page was removed from KVM
> without reimplementing it as invalidate_range).  I'll try to prioritize
> the fix, but I don't think I can do it before Monday.

The series also dropped the reloading of the APIC access page and we
never had it in invalidate_range_start ... I'll look into it today.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 160+ messages in thread

* [PATCH 1/2] KVM: x86: fix APIC page invalidation
  2017-11-30 16:19       ` Radim Krčmář
  (?)
@ 2017-11-30 18:05         ` Radim Krčmář
  -1 siblings, 0 replies; 160+ messages in thread
From: Radim Krčmář @ 2017-11-30 18:05 UTC (permalink / raw)
  To: Fabian Grünbichler
  Cc: linux-kernel, linux-mm, kvm, Paolo Bonzini, Andrea Arcangeli,
	Jérôme Glisse

Implementation of the unpinned APIC page didn't update the VMCS address
cache when invalidation was done through range mmu notifiers.
This became a problem when the page notifier was removed.

Re-introduce the arch-specific helper and call it from ...range_start.

Fixes: 38b9917350cb ("kvm: vmx: Implement set_apic_access_page_addr")
Fixes: 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic v2")
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |  3 +++
 arch/x86/kvm/x86.c              | 14 ++++++++++++++
 virt/kvm/kvm_main.c             |  8 ++++++++
 3 files changed, 25 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 977de5fb968b..c16c3f924863 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1435,4 +1435,7 @@ static inline int kvm_cpu_get_apicid(int mps_cpu)
 #define put_smstate(type, buf, offset, val)                      \
 	*(type *)((buf) + (offset) - 0x7e00) = val
 
+void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
+		unsigned long start, unsigned long end);
+
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index eee8e7faf1af..a219974cdb89 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6778,6 +6778,20 @@ static void kvm_vcpu_flush_tlb(struct kvm_vcpu *vcpu)
 	kvm_x86_ops->tlb_flush(vcpu);
 }
 
+void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
+		unsigned long start, unsigned long end)
+{
+	unsigned long apic_address;
+
+	/*
+	 * The physical address of apic access page is stored in the VMCS.
+	 * Update it when it becomes invalid.
+	 */
+	apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
+	if (start <= apic_address && apic_address < end)
+		kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
+}
+
 void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
 {
 	struct page *page = NULL;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index c01cff064ec5..b7f4689e373f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -135,6 +135,11 @@ static void kvm_uevent_notify_change(unsigned int type, struct kvm *kvm);
 static unsigned long long kvm_createvm_count;
 static unsigned long long kvm_active_vms;
 
+__weak void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
+		unsigned long start, unsigned long end)
+{
+}
+
 bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
 {
 	if (pfn_valid(pfn))
@@ -360,6 +365,9 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 		kvm_flush_remote_tlbs(kvm);
 
 	spin_unlock(&kvm->mmu_lock);
+
+	kvm_arch_mmu_notifier_invalidate_range(kvm, start, end);
+
 	srcu_read_unlock(&kvm->srcu, idx);
 }
 
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH 1/2] KVM: x86: fix APIC page invalidation
@ 2017-11-30 18:05         ` Radim Krčmář
  0 siblings, 0 replies; 160+ messages in thread
From: Radim Krčmář @ 2017-11-30 18:05 UTC (permalink / raw)
  To: Fabian Grünbichler
  Cc: linux-kernel, linux-mm, kvm, Paolo Bonzini, Andrea Arcangeli,
	Jérôme Glisse

Implementation of the unpinned APIC page didn't update the VMCS address
cache when invalidation was done through range mmu notifiers.
This became a problem when the page notifier was removed.

Re-introduce the arch-specific helper and call it from ...range_start.

Fixes: 38b9917350cb ("kvm: vmx: Implement set_apic_access_page_addr")
Fixes: 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic v2")
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |  3 +++
 arch/x86/kvm/x86.c              | 14 ++++++++++++++
 virt/kvm/kvm_main.c             |  8 ++++++++
 3 files changed, 25 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 977de5fb968b..c16c3f924863 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1435,4 +1435,7 @@ static inline int kvm_cpu_get_apicid(int mps_cpu)
 #define put_smstate(type, buf, offset, val)                      \
 	*(type *)((buf) + (offset) - 0x7e00) = val
 
+void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
+		unsigned long start, unsigned long end);
+
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index eee8e7faf1af..a219974cdb89 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6778,6 +6778,20 @@ static void kvm_vcpu_flush_tlb(struct kvm_vcpu *vcpu)
 	kvm_x86_ops->tlb_flush(vcpu);
 }
 
+void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
+		unsigned long start, unsigned long end)
+{
+	unsigned long apic_address;
+
+	/*
+	 * The physical address of apic access page is stored in the VMCS.
+	 * Update it when it becomes invalid.
+	 */
+	apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
+	if (start <= apic_address && apic_address < end)
+		kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
+}
+
 void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
 {
 	struct page *page = NULL;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index c01cff064ec5..b7f4689e373f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -135,6 +135,11 @@ static void kvm_uevent_notify_change(unsigned int type, struct kvm *kvm);
 static unsigned long long kvm_createvm_count;
 static unsigned long long kvm_active_vms;
 
+__weak void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
+		unsigned long start, unsigned long end)
+{
+}
+
 bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
 {
 	if (pfn_valid(pfn))
@@ -360,6 +365,9 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 		kvm_flush_remote_tlbs(kvm);
 
 	spin_unlock(&kvm->mmu_lock);
+
+	kvm_arch_mmu_notifier_invalidate_range(kvm, start, end);
+
 	srcu_read_unlock(&kvm->srcu, idx);
 }
 
-- 
2.14.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH 1/2] KVM: x86: fix APIC page invalidation
@ 2017-11-30 18:05         ` Radim Krčmář
  0 siblings, 0 replies; 160+ messages in thread
From: Radim Krčmář @ 2017-11-30 18:05 UTC (permalink / raw)
  To: Fabian Grünbichler
  Cc: linux-kernel, linux-mm, kvm, Paolo Bonzini, Andrea Arcangeli,
	Jérôme Glisse

Implementation of the unpinned APIC page didn't update the VMCS address
cache when invalidation was done through range mmu notifiers.
This became a problem when the page notifier was removed.

Re-introduce the arch-specific helper and call it from ...range_start.

Fixes: 38b9917350cb ("kvm: vmx: Implement set_apic_access_page_addr")
Fixes: 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic v2")
Signed-off-by: Radim KrA?mA!A? <rkrcmar@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |  3 +++
 arch/x86/kvm/x86.c              | 14 ++++++++++++++
 virt/kvm/kvm_main.c             |  8 ++++++++
 3 files changed, 25 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 977de5fb968b..c16c3f924863 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1435,4 +1435,7 @@ static inline int kvm_cpu_get_apicid(int mps_cpu)
 #define put_smstate(type, buf, offset, val)                      \
 	*(type *)((buf) + (offset) - 0x7e00) = val
 
+void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
+		unsigned long start, unsigned long end);
+
 #endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index eee8e7faf1af..a219974cdb89 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6778,6 +6778,20 @@ static void kvm_vcpu_flush_tlb(struct kvm_vcpu *vcpu)
 	kvm_x86_ops->tlb_flush(vcpu);
 }
 
+void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
+		unsigned long start, unsigned long end)
+{
+	unsigned long apic_address;
+
+	/*
+	 * The physical address of apic access page is stored in the VMCS.
+	 * Update it when it becomes invalid.
+	 */
+	apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
+	if (start <= apic_address && apic_address < end)
+		kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
+}
+
 void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
 {
 	struct page *page = NULL;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index c01cff064ec5..b7f4689e373f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -135,6 +135,11 @@ static void kvm_uevent_notify_change(unsigned int type, struct kvm *kvm);
 static unsigned long long kvm_createvm_count;
 static unsigned long long kvm_active_vms;
 
+__weak void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
+		unsigned long start, unsigned long end)
+{
+}
+
 bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
 {
 	if (pfn_valid(pfn))
@@ -360,6 +365,9 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 		kvm_flush_remote_tlbs(kvm);
 
 	spin_unlock(&kvm->mmu_lock);
+
+	kvm_arch_mmu_notifier_invalidate_range(kvm, start, end);
+
 	srcu_read_unlock(&kvm->srcu, idx);
 }
 
-- 
2.14.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH 2/2] TESTING! KVM: x86: add invalidate_range mmu notifier
  2017-11-30 18:05         ` Radim Krčmář
@ 2017-11-30 18:05           ` Radim Krčmář
  -1 siblings, 0 replies; 160+ messages in thread
From: Radim Krčmář @ 2017-11-30 18:05 UTC (permalink / raw)
  To: Fabian Grünbichler
  Cc: linux-kernel, linux-mm, kvm, Paolo Bonzini, Andrea Arcangeli,
	Jérôme Glisse

Does roughly what kvm_mmu_notifier_invalidate_page did before.

I am not certain why this would be needed.  It might mean that we have
another bug with start/end or just that I missed something.

Please try just [1/2] first and apply this one only if [1/2] still bugs,
thanks!
---
 virt/kvm/kvm_main.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b7f4689e373f..0825ea624f16 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -342,6 +342,29 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 	srcu_read_unlock(&kvm->srcu, idx);
 }
 
+static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
+						    struct mm_struct *mm,
+						    unsigned long start,
+						    unsigned long end)
+{
+	struct kvm *kvm = mmu_notifier_to_kvm(mn);
+	int need_tlb_flush = 0, idx;
+
+	idx = srcu_read_lock(&kvm->srcu);
+	spin_lock(&kvm->mmu_lock);
+	kvm->mmu_notifier_seq++;
+	need_tlb_flush = kvm_unmap_hva_range(kvm, start, end);
+	need_tlb_flush |= kvm->tlbs_dirty;
+	if (need_tlb_flush)
+		kvm_flush_remote_tlbs(kvm);
+
+	spin_unlock(&kvm->mmu_lock);
+
+	kvm_arch_mmu_notifier_invalidate_range(kvm, start, end);
+
+	srcu_read_unlock(&kvm->srcu, idx);
+}
+
 static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 						    struct mm_struct *mm,
 						    unsigned long start,
@@ -476,6 +499,7 @@ static void kvm_mmu_notifier_release(struct mmu_notifier *mn,
 }
 
 static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
+	.invalidate_range	= kvm_mmu_notifier_invalidate_range,
 	.invalidate_range_start	= kvm_mmu_notifier_invalidate_range_start,
 	.invalidate_range_end	= kvm_mmu_notifier_invalidate_range_end,
 	.clear_flush_young	= kvm_mmu_notifier_clear_flush_young,
-- 
2.14.2

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* [PATCH 2/2] TESTING! KVM: x86: add invalidate_range mmu notifier
@ 2017-11-30 18:05           ` Radim Krčmář
  0 siblings, 0 replies; 160+ messages in thread
From: Radim Krčmář @ 2017-11-30 18:05 UTC (permalink / raw)
  To: Fabian Grünbichler
  Cc: linux-kernel, linux-mm, kvm, Paolo Bonzini, Andrea Arcangeli,
	Jérôme Glisse

Does roughly what kvm_mmu_notifier_invalidate_page did before.

I am not certain why this would be needed.  It might mean that we have
another bug with start/end or just that I missed something.

Please try just [1/2] first and apply this one only if [1/2] still bugs,
thanks!
---
 virt/kvm/kvm_main.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b7f4689e373f..0825ea624f16 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -342,6 +342,29 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 	srcu_read_unlock(&kvm->srcu, idx);
 }
 
+static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
+						    struct mm_struct *mm,
+						    unsigned long start,
+						    unsigned long end)
+{
+	struct kvm *kvm = mmu_notifier_to_kvm(mn);
+	int need_tlb_flush = 0, idx;
+
+	idx = srcu_read_lock(&kvm->srcu);
+	spin_lock(&kvm->mmu_lock);
+	kvm->mmu_notifier_seq++;
+	need_tlb_flush = kvm_unmap_hva_range(kvm, start, end);
+	need_tlb_flush |= kvm->tlbs_dirty;
+	if (need_tlb_flush)
+		kvm_flush_remote_tlbs(kvm);
+
+	spin_unlock(&kvm->mmu_lock);
+
+	kvm_arch_mmu_notifier_invalidate_range(kvm, start, end);
+
+	srcu_read_unlock(&kvm->srcu, idx);
+}
+
 static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 						    struct mm_struct *mm,
 						    unsigned long start,
@@ -476,6 +499,7 @@ static void kvm_mmu_notifier_release(struct mmu_notifier *mn,
 }
 
 static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
+	.invalidate_range	= kvm_mmu_notifier_invalidate_range,
 	.invalidate_range_start	= kvm_mmu_notifier_invalidate_range_start,
 	.invalidate_range_end	= kvm_mmu_notifier_invalidate_range_end,
 	.clear_flush_young	= kvm_mmu_notifier_clear_flush_young,
-- 
2.14.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 160+ messages in thread

* Re: [PATCH 1/2] KVM: x86: fix APIC page invalidation
  2017-11-30 18:05         ` Radim Krčmář
@ 2017-12-01 12:21           ` Fabian Grünbichler
  -1 siblings, 0 replies; 160+ messages in thread
From: Fabian Grünbichler @ 2017-12-01 12:21 UTC (permalink / raw)
  To: Radim Krčmář
  Cc: linux-kernel, linux-mm, kvm, Paolo Bonzini, Andrea Arcangeli,
	Jérôme Glisse

On Thu, Nov 30, 2017 at 07:05:45PM +0100, Radim Krčmář wrote:
> Implementation of the unpinned APIC page didn't update the VMCS address
> cache when invalidation was done through range mmu notifiers.
> This became a problem when the page notifier was removed.
> 
> Re-introduce the arch-specific helper and call it from ...range_start.
> 
> Fixes: 38b9917350cb ("kvm: vmx: Implement set_apic_access_page_addr")
> Fixes: 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic v2")
> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>

Thanks for the fast reaction!

Some initial test rounds with just Patch 1 applied on top of 4.13.8 show
no blue screens, will do more tests also with 4.14.3 on Monday and
report back.

4.15-rc1 crashes for unrelated reasons, but I can re-run the tests once
a stable-enough rc has been cut..

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 1/2] KVM: x86: fix APIC page invalidation
@ 2017-12-01 12:21           ` Fabian Grünbichler
  0 siblings, 0 replies; 160+ messages in thread
From: Fabian Grünbichler @ 2017-12-01 12:21 UTC (permalink / raw)
  To: Radim Krčmář
  Cc: linux-kernel, linux-mm, kvm, Paolo Bonzini, Andrea Arcangeli,
	Jérôme Glisse

On Thu, Nov 30, 2017 at 07:05:45PM +0100, Radim KrA?mA!A? wrote:
> Implementation of the unpinned APIC page didn't update the VMCS address
> cache when invalidation was done through range mmu notifiers.
> This became a problem when the page notifier was removed.
> 
> Re-introduce the arch-specific helper and call it from ...range_start.
> 
> Fixes: 38b9917350cb ("kvm: vmx: Implement set_apic_access_page_addr")
> Fixes: 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic v2")
> Signed-off-by: Radim KrA?mA!A? <rkrcmar@redhat.com>

Thanks for the fast reaction!

Some initial test rounds with just Patch 1 applied on top of 4.13.8 show
no blue screens, will do more tests also with 4.14.3 on Monday and
report back.

4.15-rc1 crashes for unrelated reasons, but I can re-run the tests once
a stable-enough rc has been cut..

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 2/2] TESTING! KVM: x86: add invalidate_range mmu notifier
  2017-11-30 18:05           ` Radim Krčmář
  (?)
@ 2017-12-01 15:15             ` Paolo Bonzini
  -1 siblings, 0 replies; 160+ messages in thread
From: Paolo Bonzini @ 2017-12-01 15:15 UTC (permalink / raw)
  To: Radim Krčmář, Fabian Grünbichler
  Cc: linux-kernel, linux-mm, kvm, Andrea Arcangeli, Jérôme Glisse

On 30/11/2017 19:05, Radim Krčmář wrote:
> Does roughly what kvm_mmu_notifier_invalidate_page did before.
> 
> I am not certain why this would be needed.  It might mean that we have
> another bug with start/end or just that I missed something.

I don't think this is needed, because we don't have shared page tables.
My understanding is that without shared page tables, you can assume that
all page modifications go through invalidate_range_start/end.  With
shared page tables, there are additional TLB flushes to take care of,
which require invalidate_range.

Thanks,

Paolo

> Please try just [1/2] first and apply this one only if [1/2] still bugs,
> thanks!
> ---
>  virt/kvm/kvm_main.c | 24 ++++++++++++++++++++++++
>  1 file changed, 24 insertions(+)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index b7f4689e373f..0825ea624f16 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -342,6 +342,29 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
>  	srcu_read_unlock(&kvm->srcu, idx);
>  }
>  
> +static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
> +						    struct mm_struct *mm,
> +						    unsigned long start,
> +						    unsigned long end)
> +{
> +	struct kvm *kvm = mmu_notifier_to_kvm(mn);
> +	int need_tlb_flush = 0, idx;
> +
> +	idx = srcu_read_lock(&kvm->srcu);
> +	spin_lock(&kvm->mmu_lock);
> +	kvm->mmu_notifier_seq++;
> +	need_tlb_flush = kvm_unmap_hva_range(kvm, start, end);
> +	need_tlb_flush |= kvm->tlbs_dirty;
> +	if (need_tlb_flush)
> +		kvm_flush_remote_tlbs(kvm);
> +
> +	spin_unlock(&kvm->mmu_lock);
> +
> +	kvm_arch_mmu_notifier_invalidate_range(kvm, start, end);
> +
> +	srcu_read_unlock(&kvm->srcu, idx);
> +}
> +
>  static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>  						    struct mm_struct *mm,
>  						    unsigned long start,
> @@ -476,6 +499,7 @@ static void kvm_mmu_notifier_release(struct mmu_notifier *mn,
>  }
>  
>  static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
> +	.invalidate_range	= kvm_mmu_notifier_invalidate_range,
>  	.invalidate_range_start	= kvm_mmu_notifier_invalidate_range_start,
>  	.invalidate_range_end	= kvm_mmu_notifier_invalidate_range_end,
>  	.clear_flush_young	= kvm_mmu_notifier_clear_flush_young,
> 

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 2/2] TESTING! KVM: x86: add invalidate_range mmu notifier
@ 2017-12-01 15:15             ` Paolo Bonzini
  0 siblings, 0 replies; 160+ messages in thread
From: Paolo Bonzini @ 2017-12-01 15:15 UTC (permalink / raw)
  To: Radim Krčmář, Fabian Grünbichler
  Cc: linux-kernel, linux-mm, kvm, Andrea Arcangeli, Jérôme Glisse

On 30/11/2017 19:05, Radim Krčmář wrote:
> Does roughly what kvm_mmu_notifier_invalidate_page did before.
> 
> I am not certain why this would be needed.  It might mean that we have
> another bug with start/end or just that I missed something.

I don't think this is needed, because we don't have shared page tables.
My understanding is that without shared page tables, you can assume that
all page modifications go through invalidate_range_start/end.  With
shared page tables, there are additional TLB flushes to take care of,
which require invalidate_range.

Thanks,

Paolo

> Please try just [1/2] first and apply this one only if [1/2] still bugs,
> thanks!
> ---
>  virt/kvm/kvm_main.c | 24 ++++++++++++++++++++++++
>  1 file changed, 24 insertions(+)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index b7f4689e373f..0825ea624f16 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -342,6 +342,29 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
>  	srcu_read_unlock(&kvm->srcu, idx);
>  }
>  
> +static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
> +						    struct mm_struct *mm,
> +						    unsigned long start,
> +						    unsigned long end)
> +{
> +	struct kvm *kvm = mmu_notifier_to_kvm(mn);
> +	int need_tlb_flush = 0, idx;
> +
> +	idx = srcu_read_lock(&kvm->srcu);
> +	spin_lock(&kvm->mmu_lock);
> +	kvm->mmu_notifier_seq++;
> +	need_tlb_flush = kvm_unmap_hva_range(kvm, start, end);
> +	need_tlb_flush |= kvm->tlbs_dirty;
> +	if (need_tlb_flush)
> +		kvm_flush_remote_tlbs(kvm);
> +
> +	spin_unlock(&kvm->mmu_lock);
> +
> +	kvm_arch_mmu_notifier_invalidate_range(kvm, start, end);
> +
> +	srcu_read_unlock(&kvm->srcu, idx);
> +}
> +
>  static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>  						    struct mm_struct *mm,
>  						    unsigned long start,
> @@ -476,6 +499,7 @@ static void kvm_mmu_notifier_release(struct mmu_notifier *mn,
>  }
>  
>  static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
> +	.invalidate_range	= kvm_mmu_notifier_invalidate_range,
>  	.invalidate_range_start	= kvm_mmu_notifier_invalidate_range_start,
>  	.invalidate_range_end	= kvm_mmu_notifier_invalidate_range_end,
>  	.clear_flush_young	= kvm_mmu_notifier_clear_flush_young,
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 2/2] TESTING! KVM: x86: add invalidate_range mmu notifier
@ 2017-12-01 15:15             ` Paolo Bonzini
  0 siblings, 0 replies; 160+ messages in thread
From: Paolo Bonzini @ 2017-12-01 15:15 UTC (permalink / raw)
  To: Radim Krčmář, Fabian Grünbichler
  Cc: linux-kernel, linux-mm, kvm, Andrea Arcangeli, Jérôme Glisse

On 30/11/2017 19:05, Radim KrA?mA!A? wrote:
> Does roughly what kvm_mmu_notifier_invalidate_page did before.
> 
> I am not certain why this would be needed.  It might mean that we have
> another bug with start/end or just that I missed something.

I don't think this is needed, because we don't have shared page tables.
My understanding is that without shared page tables, you can assume that
all page modifications go through invalidate_range_start/end.  With
shared page tables, there are additional TLB flushes to take care of,
which require invalidate_range.

Thanks,

Paolo

> Please try just [1/2] first and apply this one only if [1/2] still bugs,
> thanks!
> ---
>  virt/kvm/kvm_main.c | 24 ++++++++++++++++++++++++
>  1 file changed, 24 insertions(+)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index b7f4689e373f..0825ea624f16 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -342,6 +342,29 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
>  	srcu_read_unlock(&kvm->srcu, idx);
>  }
>  
> +static void kvm_mmu_notifier_invalidate_range(struct mmu_notifier *mn,
> +						    struct mm_struct *mm,
> +						    unsigned long start,
> +						    unsigned long end)
> +{
> +	struct kvm *kvm = mmu_notifier_to_kvm(mn);
> +	int need_tlb_flush = 0, idx;
> +
> +	idx = srcu_read_lock(&kvm->srcu);
> +	spin_lock(&kvm->mmu_lock);
> +	kvm->mmu_notifier_seq++;
> +	need_tlb_flush = kvm_unmap_hva_range(kvm, start, end);
> +	need_tlb_flush |= kvm->tlbs_dirty;
> +	if (need_tlb_flush)
> +		kvm_flush_remote_tlbs(kvm);
> +
> +	spin_unlock(&kvm->mmu_lock);
> +
> +	kvm_arch_mmu_notifier_invalidate_range(kvm, start, end);
> +
> +	srcu_read_unlock(&kvm->srcu, idx);
> +}
> +
>  static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>  						    struct mm_struct *mm,
>  						    unsigned long start,
> @@ -476,6 +499,7 @@ static void kvm_mmu_notifier_release(struct mmu_notifier *mn,
>  }
>  
>  static const struct mmu_notifier_ops kvm_mmu_notifier_ops = {
> +	.invalidate_range	= kvm_mmu_notifier_invalidate_range,
>  	.invalidate_range_start	= kvm_mmu_notifier_invalidate_range_start,
>  	.invalidate_range_end	= kvm_mmu_notifier_invalidate_range_end,
>  	.clear_flush_young	= kvm_mmu_notifier_clear_flush_young,
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 1/2] KVM: x86: fix APIC page invalidation
  2017-11-30 18:05         ` Radim Krčmář
@ 2017-12-01 15:27           ` Paolo Bonzini
  -1 siblings, 0 replies; 160+ messages in thread
From: Paolo Bonzini @ 2017-12-01 15:27 UTC (permalink / raw)
  To: Radim Krčmář, Fabian Grünbichler
  Cc: linux-kernel, linux-mm, kvm, Andrea Arcangeli, Jérôme Glisse

On 30/11/2017 19:05, Radim Krčmář wrote:
> Implementation of the unpinned APIC page didn't update the VMCS address
> cache when invalidation was done through range mmu notifiers.
> This became a problem when the page notifier was removed.
> 
> Re-introduce the arch-specific helper and call it from ...range_start.
> 
> Fixes: 38b9917350cb ("kvm: vmx: Implement set_apic_access_page_addr")
> Fixes: 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic v2")
> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  3 +++
>  arch/x86/kvm/x86.c              | 14 ++++++++++++++
>  virt/kvm/kvm_main.c             |  8 ++++++++
>  3 files changed, 25 insertions(+)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 977de5fb968b..c16c3f924863 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1435,4 +1435,7 @@ static inline int kvm_cpu_get_apicid(int mps_cpu)
>  #define put_smstate(type, buf, offset, val)                      \
>  	*(type *)((buf) + (offset) - 0x7e00) = val
>  
> +void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> +		unsigned long start, unsigned long end);
> +
>  #endif /* _ASM_X86_KVM_HOST_H */
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index eee8e7faf1af..a219974cdb89 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6778,6 +6778,20 @@ static void kvm_vcpu_flush_tlb(struct kvm_vcpu *vcpu)
>  	kvm_x86_ops->tlb_flush(vcpu);
>  }
>  
> +void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> +		unsigned long start, unsigned long end)
> +{
> +	unsigned long apic_address;
> +
> +	/*
> +	 * The physical address of apic access page is stored in the VMCS.
> +	 * Update it when it becomes invalid.
> +	 */
> +	apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
> +	if (start <= apic_address && apic_address < end)
> +		kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
> +}
> +
>  void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
>  {
>  	struct page *page = NULL;
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index c01cff064ec5..b7f4689e373f 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -135,6 +135,11 @@ static void kvm_uevent_notify_change(unsigned int type, struct kvm *kvm);
>  static unsigned long long kvm_createvm_count;
>  static unsigned long long kvm_active_vms;
>  
> +__weak void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> +		unsigned long start, unsigned long end)
> +{
> +}
> +
>  bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
>  {
>  	if (pfn_valid(pfn))
> @@ -360,6 +365,9 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>  		kvm_flush_remote_tlbs(kvm);
>  
>  	spin_unlock(&kvm->mmu_lock);
> +
> +	kvm_arch_mmu_notifier_invalidate_range(kvm, start, end);
> +
>  	srcu_read_unlock(&kvm->srcu, idx);
>  }
>  
> 

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 1/2] KVM: x86: fix APIC page invalidation
@ 2017-12-01 15:27           ` Paolo Bonzini
  0 siblings, 0 replies; 160+ messages in thread
From: Paolo Bonzini @ 2017-12-01 15:27 UTC (permalink / raw)
  To: Radim Krčmář, Fabian Grünbichler
  Cc: linux-kernel, linux-mm, kvm, Andrea Arcangeli, Jérôme Glisse

On 30/11/2017 19:05, Radim KrA?mA!A? wrote:
> Implementation of the unpinned APIC page didn't update the VMCS address
> cache when invalidation was done through range mmu notifiers.
> This became a problem when the page notifier was removed.
> 
> Re-introduce the arch-specific helper and call it from ...range_start.
> 
> Fixes: 38b9917350cb ("kvm: vmx: Implement set_apic_access_page_addr")
> Fixes: 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic v2")
> Signed-off-by: Radim KrA?mA!A? <rkrcmar@redhat.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  3 +++
>  arch/x86/kvm/x86.c              | 14 ++++++++++++++
>  virt/kvm/kvm_main.c             |  8 ++++++++
>  3 files changed, 25 insertions(+)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 977de5fb968b..c16c3f924863 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1435,4 +1435,7 @@ static inline int kvm_cpu_get_apicid(int mps_cpu)
>  #define put_smstate(type, buf, offset, val)                      \
>  	*(type *)((buf) + (offset) - 0x7e00) = val
>  
> +void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> +		unsigned long start, unsigned long end);
> +
>  #endif /* _ASM_X86_KVM_HOST_H */
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index eee8e7faf1af..a219974cdb89 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6778,6 +6778,20 @@ static void kvm_vcpu_flush_tlb(struct kvm_vcpu *vcpu)
>  	kvm_x86_ops->tlb_flush(vcpu);
>  }
>  
> +void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> +		unsigned long start, unsigned long end)
> +{
> +	unsigned long apic_address;
> +
> +	/*
> +	 * The physical address of apic access page is stored in the VMCS.
> +	 * Update it when it becomes invalid.
> +	 */
> +	apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
> +	if (start <= apic_address && apic_address < end)
> +		kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
> +}
> +
>  void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
>  {
>  	struct page *page = NULL;
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index c01cff064ec5..b7f4689e373f 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -135,6 +135,11 @@ static void kvm_uevent_notify_change(unsigned int type, struct kvm *kvm);
>  static unsigned long long kvm_createvm_count;
>  static unsigned long long kvm_active_vms;
>  
> +__weak void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> +		unsigned long start, unsigned long end)
> +{
> +}
> +
>  bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
>  {
>  	if (pfn_valid(pfn))
> @@ -360,6 +365,9 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>  		kvm_flush_remote_tlbs(kvm);
>  
>  	spin_unlock(&kvm->mmu_lock);
> +
> +	kvm_arch_mmu_notifier_invalidate_range(kvm, start, end);
> +
>  	srcu_read_unlock(&kvm->srcu, idx);
>  }
>  
> 

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 2/2] TESTING! KVM: x86: add invalidate_range mmu notifier
  2017-12-01 15:15             ` Paolo Bonzini
  (?)
@ 2017-12-03 17:24               ` Andrea Arcangeli
  -1 siblings, 0 replies; 160+ messages in thread
From: Andrea Arcangeli @ 2017-12-03 17:24 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Radim Krčmář,
	Fabian Grünbichler, linux-kernel, linux-mm, kvm,
	Jérôme Glisse

On Fri, Dec 01, 2017 at 04:15:37PM +0100, Paolo Bonzini wrote:
> On 30/11/2017 19:05, Radim Krčmář wrote:
> > Does roughly what kvm_mmu_notifier_invalidate_page did before.
> > 
> > I am not certain why this would be needed.  It might mean that we have
> > another bug with start/end or just that I missed something.
> 
> I don't think this is needed, because we don't have shared page tables.
> My understanding is that without shared page tables, you can assume that
> all page modifications go through invalidate_range_start/end.  With
> shared page tables, there are additional TLB flushes to take care of,
> which require invalidate_range.

Agreed, invalidate_range only is ever needed if you the secondary MMU
(i.e. KVM) shares the same pagetables of the primary MMU in the
host. Only in such case we need a special secondary MMU invalidate in
the tlb gather before the page is freed because there's no way to
block the secondary MMU from walking the host pagetables in
invalidate_range_start.

In KVM case the secondary MMU always go through the shadow pagetables,
so all shadow pagetable invalidates can happen in
invalidate_range_start and patch 2/2 is not needed here.

Note that the host kernel could have always decided to call
invalidate_range_start/end and never to call invalidate_page even
before invalidate_page was removed.

So the problem in practice could only be noticed after the removal of
invalidate_page of course, but in more theoretical terms 1/2 is
actually fixing a longstanding bug. The removal of invalidate_page
made the lack of kvm_arch_mmu_notifier_invalidate_page call in
invalidate_range_start more apparent.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 2/2] TESTING! KVM: x86: add invalidate_range mmu notifier
@ 2017-12-03 17:24               ` Andrea Arcangeli
  0 siblings, 0 replies; 160+ messages in thread
From: Andrea Arcangeli @ 2017-12-03 17:24 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Radim Krčmář,
	Fabian Grünbichler, linux-kernel, linux-mm, kvm,
	Jérôme Glisse

On Fri, Dec 01, 2017 at 04:15:37PM +0100, Paolo Bonzini wrote:
> On 30/11/2017 19:05, Radim Krčmář wrote:
> > Does roughly what kvm_mmu_notifier_invalidate_page did before.
> > 
> > I am not certain why this would be needed.  It might mean that we have
> > another bug with start/end or just that I missed something.
> 
> I don't think this is needed, because we don't have shared page tables.
> My understanding is that without shared page tables, you can assume that
> all page modifications go through invalidate_range_start/end.  With
> shared page tables, there are additional TLB flushes to take care of,
> which require invalidate_range.

Agreed, invalidate_range only is ever needed if you the secondary MMU
(i.e. KVM) shares the same pagetables of the primary MMU in the
host. Only in such case we need a special secondary MMU invalidate in
the tlb gather before the page is freed because there's no way to
block the secondary MMU from walking the host pagetables in
invalidate_range_start.

In KVM case the secondary MMU always go through the shadow pagetables,
so all shadow pagetable invalidates can happen in
invalidate_range_start and patch 2/2 is not needed here.

Note that the host kernel could have always decided to call
invalidate_range_start/end and never to call invalidate_page even
before invalidate_page was removed.

So the problem in practice could only be noticed after the removal of
invalidate_page of course, but in more theoretical terms 1/2 is
actually fixing a longstanding bug. The removal of invalidate_page
made the lack of kvm_arch_mmu_notifier_invalidate_page call in
invalidate_range_start more apparent.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 2/2] TESTING! KVM: x86: add invalidate_range mmu notifier
@ 2017-12-03 17:24               ` Andrea Arcangeli
  0 siblings, 0 replies; 160+ messages in thread
From: Andrea Arcangeli @ 2017-12-03 17:24 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Radim Krčmář,
	Fabian Grünbichler, linux-kernel, linux-mm, kvm,
	Jérôme Glisse

On Fri, Dec 01, 2017 at 04:15:37PM +0100, Paolo Bonzini wrote:
> On 30/11/2017 19:05, Radim KrA?mA!A? wrote:
> > Does roughly what kvm_mmu_notifier_invalidate_page did before.
> > 
> > I am not certain why this would be needed.  It might mean that we have
> > another bug with start/end or just that I missed something.
> 
> I don't think this is needed, because we don't have shared page tables.
> My understanding is that without shared page tables, you can assume that
> all page modifications go through invalidate_range_start/end.  With
> shared page tables, there are additional TLB flushes to take care of,
> which require invalidate_range.

Agreed, invalidate_range only is ever needed if you the secondary MMU
(i.e. KVM) shares the same pagetables of the primary MMU in the
host. Only in such case we need a special secondary MMU invalidate in
the tlb gather before the page is freed because there's no way to
block the secondary MMU from walking the host pagetables in
invalidate_range_start.

In KVM case the secondary MMU always go through the shadow pagetables,
so all shadow pagetable invalidates can happen in
invalidate_range_start and patch 2/2 is not needed here.

Note that the host kernel could have always decided to call
invalidate_range_start/end and never to call invalidate_page even
before invalidate_page was removed.

So the problem in practice could only be noticed after the removal of
invalidate_page of course, but in more theoretical terms 1/2 is
actually fixing a longstanding bug. The removal of invalidate_page
made the lack of kvm_arch_mmu_notifier_invalidate_page call in
invalidate_range_start more apparent.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 1/2] KVM: x86: fix APIC page invalidation
  2017-11-30 18:05         ` Radim Krčmář
  (?)
@ 2017-12-03 17:28           ` Andrea Arcangeli
  -1 siblings, 0 replies; 160+ messages in thread
From: Andrea Arcangeli @ 2017-12-03 17:28 UTC (permalink / raw)
  To: Radim Krčmář
  Cc: Fabian Grünbichler, linux-kernel, linux-mm, kvm,
	Paolo Bonzini, Jérôme Glisse

On Thu, Nov 30, 2017 at 07:05:45PM +0100, Radim Krčmář wrote:
> Implementation of the unpinned APIC page didn't update the VMCS address
> cache when invalidation was done through range mmu notifiers.
> This became a problem when the page notifier was removed.
> 
> Re-introduce the arch-specific helper and call it from ...range_start.

Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>

Cc: <stable@vger.kernel.org>

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 1/2] KVM: x86: fix APIC page invalidation
@ 2017-12-03 17:28           ` Andrea Arcangeli
  0 siblings, 0 replies; 160+ messages in thread
From: Andrea Arcangeli @ 2017-12-03 17:28 UTC (permalink / raw)
  To: Radim Krčmář
  Cc: Fabian Grünbichler, linux-kernel, linux-mm, kvm,
	Paolo Bonzini, Jérôme Glisse

On Thu, Nov 30, 2017 at 07:05:45PM +0100, Radim Krčmář wrote:
> Implementation of the unpinned APIC page didn't update the VMCS address
> cache when invalidation was done through range mmu notifiers.
> This became a problem when the page notifier was removed.
> 
> Re-introduce the arch-specific helper and call it from ...range_start.

Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>

Cc: <stable@vger.kernel.org>

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 1/2] KVM: x86: fix APIC page invalidation
@ 2017-12-03 17:28           ` Andrea Arcangeli
  0 siblings, 0 replies; 160+ messages in thread
From: Andrea Arcangeli @ 2017-12-03 17:28 UTC (permalink / raw)
  To: Radim Krčmář
  Cc: Fabian Grünbichler, linux-kernel, linux-mm, kvm,
	Paolo Bonzini, Jérôme Glisse

On Thu, Nov 30, 2017 at 07:05:45PM +0100, Radim KrA?mA!A? wrote:
> Implementation of the unpinned APIC page didn't update the VMCS address
> cache when invalidation was done through range mmu notifiers.
> This became a problem when the page notifier was removed.
> 
> Re-introduce the arch-specific helper and call it from ...range_start.

Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>

Cc: <stable@vger.kernel.org>

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 1/2] KVM: x86: fix APIC page invalidation
  2017-11-30 18:05         ` Radim Krčmář
@ 2017-12-06  2:32           ` Wanpeng Li
  -1 siblings, 0 replies; 160+ messages in thread
From: Wanpeng Li @ 2017-12-06  2:32 UTC (permalink / raw)
  To: Radim Krčmář
  Cc: Fabian Grünbichler, linux-kernel, linux-mm, kvm,
	Paolo Bonzini, Andrea Arcangeli, Jérôme Glisse

2017-12-01 2:05 GMT+08:00 Radim Krčmář <rkrcmar@redhat.com>:
> Implementation of the unpinned APIC page didn't update the VMCS address
> cache when invalidation was done through range mmu notifiers.
> This became a problem when the page notifier was removed.
>
> Re-introduce the arch-specific helper and call it from ...range_start.
>
> Fixes: 38b9917350cb ("kvm: vmx: Implement set_apic_access_page_addr")
> Fixes: 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic v2")
> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>

Tested-by: Wanpeng Li <wanpeng.li@hotmail.com>

I observe the windows 2016 guest hang during boot on a heavy memory
overcommit host, and this commit fixes it.

Regards,
Wanpeng Li

> ---
>  arch/x86/include/asm/kvm_host.h |  3 +++
>  arch/x86/kvm/x86.c              | 14 ++++++++++++++
>  virt/kvm/kvm_main.c             |  8 ++++++++
>  3 files changed, 25 insertions(+)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 977de5fb968b..c16c3f924863 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1435,4 +1435,7 @@ static inline int kvm_cpu_get_apicid(int mps_cpu)
>  #define put_smstate(type, buf, offset, val)                      \
>         *(type *)((buf) + (offset) - 0x7e00) = val
>
> +void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> +               unsigned long start, unsigned long end);
> +
>  #endif /* _ASM_X86_KVM_HOST_H */
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index eee8e7faf1af..a219974cdb89 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6778,6 +6778,20 @@ static void kvm_vcpu_flush_tlb(struct kvm_vcpu *vcpu)
>         kvm_x86_ops->tlb_flush(vcpu);
>  }
>
> +void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> +               unsigned long start, unsigned long end)
> +{
> +       unsigned long apic_address;
> +
> +       /*
> +        * The physical address of apic access page is stored in the VMCS.
> +        * Update it when it becomes invalid.
> +        */
> +       apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
> +       if (start <= apic_address && apic_address < end)
> +               kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
> +}
> +
>  void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
>  {
>         struct page *page = NULL;
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index c01cff064ec5..b7f4689e373f 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -135,6 +135,11 @@ static void kvm_uevent_notify_change(unsigned int type, struct kvm *kvm);
>  static unsigned long long kvm_createvm_count;
>  static unsigned long long kvm_active_vms;
>
> +__weak void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> +               unsigned long start, unsigned long end)
> +{
> +}
> +
>  bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
>  {
>         if (pfn_valid(pfn))
> @@ -360,6 +365,9 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>                 kvm_flush_remote_tlbs(kvm);
>
>         spin_unlock(&kvm->mmu_lock);
> +
> +       kvm_arch_mmu_notifier_invalidate_range(kvm, start, end);
> +
>         srcu_read_unlock(&kvm->srcu, idx);
>  }
>
> --
> 2.14.2
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 1/2] KVM: x86: fix APIC page invalidation
@ 2017-12-06  2:32           ` Wanpeng Li
  0 siblings, 0 replies; 160+ messages in thread
From: Wanpeng Li @ 2017-12-06  2:32 UTC (permalink / raw)
  To: Radim Krčmář
  Cc: Fabian Grünbichler, linux-kernel, linux-mm, kvm,
	Paolo Bonzini, Andrea Arcangeli, Jérôme Glisse

2017-12-01 2:05 GMT+08:00 Radim Krčmář <rkrcmar@redhat.com>:
> Implementation of the unpinned APIC page didn't update the VMCS address
> cache when invalidation was done through range mmu notifiers.
> This became a problem when the page notifier was removed.
>
> Re-introduce the arch-specific helper and call it from ...range_start.
>
> Fixes: 38b9917350cb ("kvm: vmx: Implement set_apic_access_page_addr")
> Fixes: 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic v2")
> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>

Tested-by: Wanpeng Li <wanpeng.li@hotmail.com>

I observe the windows 2016 guest hang during boot on a heavy memory
overcommit host, and this commit fixes it.

Regards,
Wanpeng Li

> ---
>  arch/x86/include/asm/kvm_host.h |  3 +++
>  arch/x86/kvm/x86.c              | 14 ++++++++++++++
>  virt/kvm/kvm_main.c             |  8 ++++++++
>  3 files changed, 25 insertions(+)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 977de5fb968b..c16c3f924863 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1435,4 +1435,7 @@ static inline int kvm_cpu_get_apicid(int mps_cpu)
>  #define put_smstate(type, buf, offset, val)                      \
>         *(type *)((buf) + (offset) - 0x7e00) = val
>
> +void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> +               unsigned long start, unsigned long end);
> +
>  #endif /* _ASM_X86_KVM_HOST_H */
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index eee8e7faf1af..a219974cdb89 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6778,6 +6778,20 @@ static void kvm_vcpu_flush_tlb(struct kvm_vcpu *vcpu)
>         kvm_x86_ops->tlb_flush(vcpu);
>  }
>
> +void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> +               unsigned long start, unsigned long end)
> +{
> +       unsigned long apic_address;
> +
> +       /*
> +        * The physical address of apic access page is stored in the VMCS.
> +        * Update it when it becomes invalid.
> +        */
> +       apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
> +       if (start <= apic_address && apic_address < end)
> +               kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
> +}
> +
>  void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
>  {
>         struct page *page = NULL;
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index c01cff064ec5..b7f4689e373f 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -135,6 +135,11 @@ static void kvm_uevent_notify_change(unsigned int type, struct kvm *kvm);
>  static unsigned long long kvm_createvm_count;
>  static unsigned long long kvm_active_vms;
>
> +__weak void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> +               unsigned long start, unsigned long end)
> +{
> +}
> +
>  bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
>  {
>         if (pfn_valid(pfn))
> @@ -360,6 +365,9 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>                 kvm_flush_remote_tlbs(kvm);
>
>         spin_unlock(&kvm->mmu_lock);
> +
> +       kvm_arch_mmu_notifier_invalidate_range(kvm, start, end);
> +
>         srcu_read_unlock(&kvm->srcu, idx);
>  }
>
> --
> 2.14.2
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 1/2] KVM: x86: fix APIC page invalidation
  2017-11-30 18:05         ` Radim Krčmář
  (?)
@ 2017-12-06  8:15           ` Fabian Grünbichler
  -1 siblings, 0 replies; 160+ messages in thread
From: Fabian Grünbichler @ 2017-12-06  8:15 UTC (permalink / raw)
  To: Radim Krčmář
  Cc: linux-kernel, linux-mm, kvm, Paolo Bonzini, Andrea Arcangeli,
	Jérôme Glisse

On Thu, Nov 30, 2017 at 07:05:45PM +0100, Radim Krčmář wrote:
> Implementation of the unpinned APIC page didn't update the VMCS address
> cache when invalidation was done through range mmu notifiers.
> This became a problem when the page notifier was removed.
> 
> Re-introduce the arch-specific helper and call it from ...range_start.
> 
> Fixes: 38b9917350cb ("kvm: vmx: Implement set_apic_access_page_addr")
> Fixes: 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic v2")
> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>

Tested-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>

no further issues observed with this patch applied on top of 4.13 and
4.14 - thanks!

> ---
>  arch/x86/include/asm/kvm_host.h |  3 +++
>  arch/x86/kvm/x86.c              | 14 ++++++++++++++
>  virt/kvm/kvm_main.c             |  8 ++++++++
>  3 files changed, 25 insertions(+)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 977de5fb968b..c16c3f924863 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1435,4 +1435,7 @@ static inline int kvm_cpu_get_apicid(int mps_cpu)
>  #define put_smstate(type, buf, offset, val)                      \
>  	*(type *)((buf) + (offset) - 0x7e00) = val
>  
> +void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> +		unsigned long start, unsigned long end);
> +
>  #endif /* _ASM_X86_KVM_HOST_H */
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index eee8e7faf1af..a219974cdb89 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6778,6 +6778,20 @@ static void kvm_vcpu_flush_tlb(struct kvm_vcpu *vcpu)
>  	kvm_x86_ops->tlb_flush(vcpu);
>  }
>  
> +void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> +		unsigned long start, unsigned long end)
> +{
> +	unsigned long apic_address;
> +
> +	/*
> +	 * The physical address of apic access page is stored in the VMCS.
> +	 * Update it when it becomes invalid.
> +	 */
> +	apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
> +	if (start <= apic_address && apic_address < end)
> +		kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
> +}
> +
>  void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
>  {
>  	struct page *page = NULL;
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index c01cff064ec5..b7f4689e373f 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -135,6 +135,11 @@ static void kvm_uevent_notify_change(unsigned int type, struct kvm *kvm);
>  static unsigned long long kvm_createvm_count;
>  static unsigned long long kvm_active_vms;
>  
> +__weak void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> +		unsigned long start, unsigned long end)
> +{
> +}
> +
>  bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
>  {
>  	if (pfn_valid(pfn))
> @@ -360,6 +365,9 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>  		kvm_flush_remote_tlbs(kvm);
>  
>  	spin_unlock(&kvm->mmu_lock);
> +
> +	kvm_arch_mmu_notifier_invalidate_range(kvm, start, end);
> +
>  	srcu_read_unlock(&kvm->srcu, idx);
>  }
>  
> -- 
> 2.14.2
> 
> 

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 1/2] KVM: x86: fix APIC page invalidation
@ 2017-12-06  8:15           ` Fabian Grünbichler
  0 siblings, 0 replies; 160+ messages in thread
From: Fabian Grünbichler @ 2017-12-06  8:15 UTC (permalink / raw)
  To: Radim Krčmář
  Cc: linux-kernel, linux-mm, kvm, Paolo Bonzini, Andrea Arcangeli,
	Jérôme Glisse

On Thu, Nov 30, 2017 at 07:05:45PM +0100, Radim Krčmář wrote:
> Implementation of the unpinned APIC page didn't update the VMCS address
> cache when invalidation was done through range mmu notifiers.
> This became a problem when the page notifier was removed.
> 
> Re-introduce the arch-specific helper and call it from ...range_start.
> 
> Fixes: 38b9917350cb ("kvm: vmx: Implement set_apic_access_page_addr")
> Fixes: 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic v2")
> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>

Tested-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>

no further issues observed with this patch applied on top of 4.13 and
4.14 - thanks!

> ---
>  arch/x86/include/asm/kvm_host.h |  3 +++
>  arch/x86/kvm/x86.c              | 14 ++++++++++++++
>  virt/kvm/kvm_main.c             |  8 ++++++++
>  3 files changed, 25 insertions(+)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 977de5fb968b..c16c3f924863 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1435,4 +1435,7 @@ static inline int kvm_cpu_get_apicid(int mps_cpu)
>  #define put_smstate(type, buf, offset, val)                      \
>  	*(type *)((buf) + (offset) - 0x7e00) = val
>  
> +void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> +		unsigned long start, unsigned long end);
> +
>  #endif /* _ASM_X86_KVM_HOST_H */
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index eee8e7faf1af..a219974cdb89 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6778,6 +6778,20 @@ static void kvm_vcpu_flush_tlb(struct kvm_vcpu *vcpu)
>  	kvm_x86_ops->tlb_flush(vcpu);
>  }
>  
> +void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> +		unsigned long start, unsigned long end)
> +{
> +	unsigned long apic_address;
> +
> +	/*
> +	 * The physical address of apic access page is stored in the VMCS.
> +	 * Update it when it becomes invalid.
> +	 */
> +	apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
> +	if (start <= apic_address && apic_address < end)
> +		kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
> +}
> +
>  void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
>  {
>  	struct page *page = NULL;
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index c01cff064ec5..b7f4689e373f 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -135,6 +135,11 @@ static void kvm_uevent_notify_change(unsigned int type, struct kvm *kvm);
>  static unsigned long long kvm_createvm_count;
>  static unsigned long long kvm_active_vms;
>  
> +__weak void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> +		unsigned long start, unsigned long end)
> +{
> +}
> +
>  bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
>  {
>  	if (pfn_valid(pfn))
> @@ -360,6 +365,9 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>  		kvm_flush_remote_tlbs(kvm);
>  
>  	spin_unlock(&kvm->mmu_lock);
> +
> +	kvm_arch_mmu_notifier_invalidate_range(kvm, start, end);
> +
>  	srcu_read_unlock(&kvm->srcu, idx);
>  }
>  
> -- 
> 2.14.2
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 1/2] KVM: x86: fix APIC page invalidation
@ 2017-12-06  8:15           ` Fabian Grünbichler
  0 siblings, 0 replies; 160+ messages in thread
From: Fabian Grünbichler @ 2017-12-06  8:15 UTC (permalink / raw)
  To: Radim Krčmář
  Cc: linux-kernel, linux-mm, kvm, Paolo Bonzini, Andrea Arcangeli,
	Jérôme Glisse

On Thu, Nov 30, 2017 at 07:05:45PM +0100, Radim KrA?mA!A? wrote:
> Implementation of the unpinned APIC page didn't update the VMCS address
> cache when invalidation was done through range mmu notifiers.
> This became a problem when the page notifier was removed.
> 
> Re-introduce the arch-specific helper and call it from ...range_start.
> 
> Fixes: 38b9917350cb ("kvm: vmx: Implement set_apic_access_page_addr")
> Fixes: 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic v2")
> Signed-off-by: Radim KrA?mA!A? <rkrcmar@redhat.com>

Tested-by: Fabian GrA 1/4 nbichler <f.gruenbichler@proxmox.com>

no further issues observed with this patch applied on top of 4.13 and
4.14 - thanks!

> ---
>  arch/x86/include/asm/kvm_host.h |  3 +++
>  arch/x86/kvm/x86.c              | 14 ++++++++++++++
>  virt/kvm/kvm_main.c             |  8 ++++++++
>  3 files changed, 25 insertions(+)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 977de5fb968b..c16c3f924863 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1435,4 +1435,7 @@ static inline int kvm_cpu_get_apicid(int mps_cpu)
>  #define put_smstate(type, buf, offset, val)                      \
>  	*(type *)((buf) + (offset) - 0x7e00) = val
>  
> +void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> +		unsigned long start, unsigned long end);
> +
>  #endif /* _ASM_X86_KVM_HOST_H */
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index eee8e7faf1af..a219974cdb89 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6778,6 +6778,20 @@ static void kvm_vcpu_flush_tlb(struct kvm_vcpu *vcpu)
>  	kvm_x86_ops->tlb_flush(vcpu);
>  }
>  
> +void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> +		unsigned long start, unsigned long end)
> +{
> +	unsigned long apic_address;
> +
> +	/*
> +	 * The physical address of apic access page is stored in the VMCS.
> +	 * Update it when it becomes invalid.
> +	 */
> +	apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
> +	if (start <= apic_address && apic_address < end)
> +		kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
> +}
> +
>  void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
>  {
>  	struct page *page = NULL;
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index c01cff064ec5..b7f4689e373f 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -135,6 +135,11 @@ static void kvm_uevent_notify_change(unsigned int type, struct kvm *kvm);
>  static unsigned long long kvm_createvm_count;
>  static unsigned long long kvm_active_vms;
>  
> +__weak void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> +		unsigned long start, unsigned long end)
> +{
> +}
> +
>  bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
>  {
>  	if (pfn_valid(pfn))
> @@ -360,6 +365,9 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>  		kvm_flush_remote_tlbs(kvm);
>  
>  	spin_unlock(&kvm->mmu_lock);
> +
> +	kvm_arch_mmu_notifier_invalidate_range(kvm, start, end);
> +
>  	srcu_read_unlock(&kvm->srcu, idx);
>  }
>  
> -- 
> 2.14.2
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 1/2] KVM: x86: fix APIC page invalidation
  2017-12-06  2:32           ` Wanpeng Li
@ 2017-12-06  9:50             ` 王金浦
  -1 siblings, 0 replies; 160+ messages in thread
From: 王金浦 @ 2017-12-06  9:50 UTC (permalink / raw)
  To: Wanpeng Li, Radim Krčmář
  Cc: Fabian Grünbichler, linux-kernel, linux-mm, kvm,
	Paolo Bonzini, Andrea Arcangeli, Jérôme Glisse

2017-12-06 3:32 GMT+01:00 Wanpeng Li <kernellwp@gmail.com>:
> 2017-12-01 2:05 GMT+08:00 Radim Krčmář <rkrcmar@redhat.com>:
>> Implementation of the unpinned APIC page didn't update the VMCS address
>> cache when invalidation was done through range mmu notifiers.
>> This became a problem when the page notifier was removed.
>>
>> Re-introduce the arch-specific helper and call it from ...range_start.
>>
>> Fixes: 38b9917350cb ("kvm: vmx: Implement set_apic_access_page_addr")
>> Fixes: 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic v2")
>> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
So the patch should be backport to v3.18+?

Thanks,
Jack
>
> Tested-by: Wanpeng Li <wanpeng.li@hotmail.com>
>
> I observe the windows 2016 guest hang during boot on a heavy memory
> overcommit host, and this commit fixes it.
>
> Regards,
> Wanpeng Li
>
>> ---
>>  arch/x86/include/asm/kvm_host.h |  3 +++
>>  arch/x86/kvm/x86.c              | 14 ++++++++++++++
>>  virt/kvm/kvm_main.c             |  8 ++++++++
>>  3 files changed, 25 insertions(+)
>>
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index 977de5fb968b..c16c3f924863 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -1435,4 +1435,7 @@ static inline int kvm_cpu_get_apicid(int mps_cpu)
>>  #define put_smstate(type, buf, offset, val)                      \
>>         *(type *)((buf) + (offset) - 0x7e00) = val
>>
>> +void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
>> +               unsigned long start, unsigned long end);
>> +
>>  #endif /* _ASM_X86_KVM_HOST_H */
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index eee8e7faf1af..a219974cdb89 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -6778,6 +6778,20 @@ static void kvm_vcpu_flush_tlb(struct kvm_vcpu *vcpu)
>>         kvm_x86_ops->tlb_flush(vcpu);
>>  }
>>
>> +void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
>> +               unsigned long start, unsigned long end)
>> +{
>> +       unsigned long apic_address;
>> +
>> +       /*
>> +        * The physical address of apic access page is stored in the VMCS.
>> +        * Update it when it becomes invalid.
>> +        */
>> +       apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
>> +       if (start <= apic_address && apic_address < end)
>> +               kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
>> +}
>> +
>>  void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
>>  {
>>         struct page *page = NULL;
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index c01cff064ec5..b7f4689e373f 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -135,6 +135,11 @@ static void kvm_uevent_notify_change(unsigned int type, struct kvm *kvm);
>>  static unsigned long long kvm_createvm_count;
>>  static unsigned long long kvm_active_vms;
>>
>> +__weak void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
>> +               unsigned long start, unsigned long end)
>> +{
>> +}
>> +
>>  bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
>>  {
>>         if (pfn_valid(pfn))
>> @@ -360,6 +365,9 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>>                 kvm_flush_remote_tlbs(kvm);
>>
>>         spin_unlock(&kvm->mmu_lock);
>> +
>> +       kvm_arch_mmu_notifier_invalidate_range(kvm, start, end);
>> +
>>         srcu_read_unlock(&kvm->srcu, idx);
>>  }
>>
>> --
>> 2.14.2
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 1/2] KVM: x86: fix APIC page invalidation
@ 2017-12-06  9:50             ` 王金浦
  0 siblings, 0 replies; 160+ messages in thread
From: 王金浦 @ 2017-12-06  9:50 UTC (permalink / raw)
  To: Wanpeng Li, Radim Krčmář
  Cc: Fabian Grünbichler, linux-kernel, linux-mm, kvm,
	Paolo Bonzini, Andrea Arcangeli, Jérôme Glisse

2017-12-06 3:32 GMT+01:00 Wanpeng Li <kernellwp@gmail.com>:
> 2017-12-01 2:05 GMT+08:00 Radim Krčmář <rkrcmar@redhat.com>:
>> Implementation of the unpinned APIC page didn't update the VMCS address
>> cache when invalidation was done through range mmu notifiers.
>> This became a problem when the page notifier was removed.
>>
>> Re-introduce the arch-specific helper and call it from ...range_start.
>>
>> Fixes: 38b9917350cb ("kvm: vmx: Implement set_apic_access_page_addr")
>> Fixes: 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic v2")
>> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
So the patch should be backport to v3.18+?

Thanks,
Jack
>
> Tested-by: Wanpeng Li <wanpeng.li@hotmail.com>
>
> I observe the windows 2016 guest hang during boot on a heavy memory
> overcommit host, and this commit fixes it.
>
> Regards,
> Wanpeng Li
>
>> ---
>>  arch/x86/include/asm/kvm_host.h |  3 +++
>>  arch/x86/kvm/x86.c              | 14 ++++++++++++++
>>  virt/kvm/kvm_main.c             |  8 ++++++++
>>  3 files changed, 25 insertions(+)
>>
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index 977de5fb968b..c16c3f924863 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -1435,4 +1435,7 @@ static inline int kvm_cpu_get_apicid(int mps_cpu)
>>  #define put_smstate(type, buf, offset, val)                      \
>>         *(type *)((buf) + (offset) - 0x7e00) = val
>>
>> +void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
>> +               unsigned long start, unsigned long end);
>> +
>>  #endif /* _ASM_X86_KVM_HOST_H */
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index eee8e7faf1af..a219974cdb89 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -6778,6 +6778,20 @@ static void kvm_vcpu_flush_tlb(struct kvm_vcpu *vcpu)
>>         kvm_x86_ops->tlb_flush(vcpu);
>>  }
>>
>> +void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
>> +               unsigned long start, unsigned long end)
>> +{
>> +       unsigned long apic_address;
>> +
>> +       /*
>> +        * The physical address of apic access page is stored in the VMCS.
>> +        * Update it when it becomes invalid.
>> +        */
>> +       apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
>> +       if (start <= apic_address && apic_address < end)
>> +               kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
>> +}
>> +
>>  void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
>>  {
>>         struct page *page = NULL;
>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>> index c01cff064ec5..b7f4689e373f 100644
>> --- a/virt/kvm/kvm_main.c
>> +++ b/virt/kvm/kvm_main.c
>> @@ -135,6 +135,11 @@ static void kvm_uevent_notify_change(unsigned int type, struct kvm *kvm);
>>  static unsigned long long kvm_createvm_count;
>>  static unsigned long long kvm_active_vms;
>>
>> +__weak void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
>> +               unsigned long start, unsigned long end)
>> +{
>> +}
>> +
>>  bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
>>  {
>>         if (pfn_valid(pfn))
>> @@ -360,6 +365,9 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>>                 kvm_flush_remote_tlbs(kvm);
>>
>>         spin_unlock(&kvm->mmu_lock);
>> +
>> +       kvm_arch_mmu_notifier_invalidate_range(kvm, start, end);
>> +
>>         srcu_read_unlock(&kvm->srcu, idx);
>>  }
>>
>> --
>> 2.14.2
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 1/2] KVM: x86: fix APIC page invalidation
  2017-12-06  9:50             ` 王金浦
  (?)
@ 2017-12-06 10:00               ` Paolo Bonzini
  -1 siblings, 0 replies; 160+ messages in thread
From: Paolo Bonzini @ 2017-12-06 10:00 UTC (permalink / raw)
  To: 王金浦, Wanpeng Li, Radim Krčmář
  Cc: Fabian Grünbichler, linux-kernel, linux-mm, kvm,
	Andrea Arcangeli, Jérôme Glisse

On 06/12/2017 10:50, 王金浦 wrote:
> 2017-12-06 3:32 GMT+01:00 Wanpeng Li <kernellwp@gmail.com>:
>> 2017-12-01 2:05 GMT+08:00 Radim Krčmář <rkrcmar@redhat.com>:
>>> Implementation of the unpinned APIC page didn't update the VMCS address
>>> cache when invalidation was done through range mmu notifiers.
>>> This became a problem when the page notifier was removed.
>>>
>>> Re-introduce the arch-specific helper and call it from ...range_start.
>>>
>>> Fixes: 38b9917350cb ("kvm: vmx: Implement set_apic_access_page_addr")
>>> Fixes: 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic v2")
>>> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
> So the patch should be backport to v3.18+?

In practice the invalidate_page callback was enough because the APIC
page is allocated with alloc_page().  But it doesn't hurt.

Paolo

> Thanks,
> Jack
>>
>> Tested-by: Wanpeng Li <wanpeng.li@hotmail.com>
>>
>> I observe the windows 2016 guest hang during boot on a heavy memory
>> overcommit host, and this commit fixes it.
>>
>> Regards,
>> Wanpeng Li
>>
>>> ---
>>>  arch/x86/include/asm/kvm_host.h |  3 +++
>>>  arch/x86/kvm/x86.c              | 14 ++++++++++++++
>>>  virt/kvm/kvm_main.c             |  8 ++++++++
>>>  3 files changed, 25 insertions(+)
>>>
>>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>>> index 977de5fb968b..c16c3f924863 100644
>>> --- a/arch/x86/include/asm/kvm_host.h
>>> +++ b/arch/x86/include/asm/kvm_host.h
>>> @@ -1435,4 +1435,7 @@ static inline int kvm_cpu_get_apicid(int mps_cpu)
>>>  #define put_smstate(type, buf, offset, val)                      \
>>>         *(type *)((buf) + (offset) - 0x7e00) = val
>>>
>>> +void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
>>> +               unsigned long start, unsigned long end);
>>> +
>>>  #endif /* _ASM_X86_KVM_HOST_H */
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index eee8e7faf1af..a219974cdb89 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -6778,6 +6778,20 @@ static void kvm_vcpu_flush_tlb(struct kvm_vcpu *vcpu)
>>>         kvm_x86_ops->tlb_flush(vcpu);
>>>  }
>>>
>>> +void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
>>> +               unsigned long start, unsigned long end)
>>> +{
>>> +       unsigned long apic_address;
>>> +
>>> +       /*
>>> +        * The physical address of apic access page is stored in the VMCS.
>>> +        * Update it when it becomes invalid.
>>> +        */
>>> +       apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
>>> +       if (start <= apic_address && apic_address < end)
>>> +               kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
>>> +}
>>> +
>>>  void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
>>>  {
>>>         struct page *page = NULL;
>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>> index c01cff064ec5..b7f4689e373f 100644
>>> --- a/virt/kvm/kvm_main.c
>>> +++ b/virt/kvm/kvm_main.c
>>> @@ -135,6 +135,11 @@ static void kvm_uevent_notify_change(unsigned int type, struct kvm *kvm);
>>>  static unsigned long long kvm_createvm_count;
>>>  static unsigned long long kvm_active_vms;
>>>
>>> +__weak void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
>>> +               unsigned long start, unsigned long end)
>>> +{
>>> +}
>>> +
>>>  bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
>>>  {
>>>         if (pfn_valid(pfn))
>>> @@ -360,6 +365,9 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>>>                 kvm_flush_remote_tlbs(kvm);
>>>
>>>         spin_unlock(&kvm->mmu_lock);
>>> +
>>> +       kvm_arch_mmu_notifier_invalidate_range(kvm, start, end);
>>> +
>>>         srcu_read_unlock(&kvm->srcu, idx);
>>>  }
>>>
>>> --
>>> 2.14.2
>>>
>>> --
>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>> the body to majordomo@kvack.org.  For more info on Linux MM,
>>> see: http://www.linux-mm.org/ .
>>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 1/2] KVM: x86: fix APIC page invalidation
@ 2017-12-06 10:00               ` Paolo Bonzini
  0 siblings, 0 replies; 160+ messages in thread
From: Paolo Bonzini @ 2017-12-06 10:00 UTC (permalink / raw)
  To: 王金浦, Wanpeng Li, Radim Krčmář
  Cc: Fabian Grünbichler, linux-kernel, linux-mm, kvm,
	Andrea Arcangeli, Jérôme Glisse

On 06/12/2017 10:50, 王金浦 wrote:
> 2017-12-06 3:32 GMT+01:00 Wanpeng Li <kernellwp@gmail.com>:
>> 2017-12-01 2:05 GMT+08:00 Radim Krčmář <rkrcmar@redhat.com>:
>>> Implementation of the unpinned APIC page didn't update the VMCS address
>>> cache when invalidation was done through range mmu notifiers.
>>> This became a problem when the page notifier was removed.
>>>
>>> Re-introduce the arch-specific helper and call it from ...range_start.
>>>
>>> Fixes: 38b9917350cb ("kvm: vmx: Implement set_apic_access_page_addr")
>>> Fixes: 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic v2")
>>> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
> So the patch should be backport to v3.18+?

In practice the invalidate_page callback was enough because the APIC
page is allocated with alloc_page().  But it doesn't hurt.

Paolo

> Thanks,
> Jack
>>
>> Tested-by: Wanpeng Li <wanpeng.li@hotmail.com>
>>
>> I observe the windows 2016 guest hang during boot on a heavy memory
>> overcommit host, and this commit fixes it.
>>
>> Regards,
>> Wanpeng Li
>>
>>> ---
>>>  arch/x86/include/asm/kvm_host.h |  3 +++
>>>  arch/x86/kvm/x86.c              | 14 ++++++++++++++
>>>  virt/kvm/kvm_main.c             |  8 ++++++++
>>>  3 files changed, 25 insertions(+)
>>>
>>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>>> index 977de5fb968b..c16c3f924863 100644
>>> --- a/arch/x86/include/asm/kvm_host.h
>>> +++ b/arch/x86/include/asm/kvm_host.h
>>> @@ -1435,4 +1435,7 @@ static inline int kvm_cpu_get_apicid(int mps_cpu)
>>>  #define put_smstate(type, buf, offset, val)                      \
>>>         *(type *)((buf) + (offset) - 0x7e00) = val
>>>
>>> +void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
>>> +               unsigned long start, unsigned long end);
>>> +
>>>  #endif /* _ASM_X86_KVM_HOST_H */
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index eee8e7faf1af..a219974cdb89 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -6778,6 +6778,20 @@ static void kvm_vcpu_flush_tlb(struct kvm_vcpu *vcpu)
>>>         kvm_x86_ops->tlb_flush(vcpu);
>>>  }
>>>
>>> +void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
>>> +               unsigned long start, unsigned long end)
>>> +{
>>> +       unsigned long apic_address;
>>> +
>>> +       /*
>>> +        * The physical address of apic access page is stored in the VMCS.
>>> +        * Update it when it becomes invalid.
>>> +        */
>>> +       apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
>>> +       if (start <= apic_address && apic_address < end)
>>> +               kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
>>> +}
>>> +
>>>  void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
>>>  {
>>>         struct page *page = NULL;
>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>> index c01cff064ec5..b7f4689e373f 100644
>>> --- a/virt/kvm/kvm_main.c
>>> +++ b/virt/kvm/kvm_main.c
>>> @@ -135,6 +135,11 @@ static void kvm_uevent_notify_change(unsigned int type, struct kvm *kvm);
>>>  static unsigned long long kvm_createvm_count;
>>>  static unsigned long long kvm_active_vms;
>>>
>>> +__weak void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
>>> +               unsigned long start, unsigned long end)
>>> +{
>>> +}
>>> +
>>>  bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
>>>  {
>>>         if (pfn_valid(pfn))
>>> @@ -360,6 +365,9 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>>>                 kvm_flush_remote_tlbs(kvm);
>>>
>>>         spin_unlock(&kvm->mmu_lock);
>>> +
>>> +       kvm_arch_mmu_notifier_invalidate_range(kvm, start, end);
>>> +
>>>         srcu_read_unlock(&kvm->srcu, idx);
>>>  }
>>>
>>> --
>>> 2.14.2
>>>
>>> --
>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>> the body to majordomo@kvack.org.  For more info on Linux MM,
>>> see: http://www.linux-mm.org/ .
>>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 1/2] KVM: x86: fix APIC page invalidation
@ 2017-12-06 10:00               ` Paolo Bonzini
  0 siblings, 0 replies; 160+ messages in thread
From: Paolo Bonzini @ 2017-12-06 10:00 UTC (permalink / raw)
  To: 王金浦, Wanpeng Li, Radim Krčmář
  Cc: Fabian Grünbichler, linux-kernel, linux-mm, kvm,
	Andrea Arcangeli, Jérôme Glisse

On 06/12/2017 10:50, c??e??aeu| wrote:
> 2017-12-06 3:32 GMT+01:00 Wanpeng Li <kernellwp@gmail.com>:
>> 2017-12-01 2:05 GMT+08:00 Radim KrA?mA!A? <rkrcmar@redhat.com>:
>>> Implementation of the unpinned APIC page didn't update the VMCS address
>>> cache when invalidation was done through range mmu notifiers.
>>> This became a problem when the page notifier was removed.
>>>
>>> Re-introduce the arch-specific helper and call it from ...range_start.
>>>
>>> Fixes: 38b9917350cb ("kvm: vmx: Implement set_apic_access_page_addr")
>>> Fixes: 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic v2")
>>> Signed-off-by: Radim KrA?mA!A? <rkrcmar@redhat.com>
> So the patch should be backport to v3.18+?

In practice the invalidate_page callback was enough because the APIC
page is allocated with alloc_page().  But it doesn't hurt.

Paolo

> Thanks,
> Jack
>>
>> Tested-by: Wanpeng Li <wanpeng.li@hotmail.com>
>>
>> I observe the windows 2016 guest hang during boot on a heavy memory
>> overcommit host, and this commit fixes it.
>>
>> Regards,
>> Wanpeng Li
>>
>>> ---
>>>  arch/x86/include/asm/kvm_host.h |  3 +++
>>>  arch/x86/kvm/x86.c              | 14 ++++++++++++++
>>>  virt/kvm/kvm_main.c             |  8 ++++++++
>>>  3 files changed, 25 insertions(+)
>>>
>>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>>> index 977de5fb968b..c16c3f924863 100644
>>> --- a/arch/x86/include/asm/kvm_host.h
>>> +++ b/arch/x86/include/asm/kvm_host.h
>>> @@ -1435,4 +1435,7 @@ static inline int kvm_cpu_get_apicid(int mps_cpu)
>>>  #define put_smstate(type, buf, offset, val)                      \
>>>         *(type *)((buf) + (offset) - 0x7e00) = val
>>>
>>> +void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
>>> +               unsigned long start, unsigned long end);
>>> +
>>>  #endif /* _ASM_X86_KVM_HOST_H */
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index eee8e7faf1af..a219974cdb89 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -6778,6 +6778,20 @@ static void kvm_vcpu_flush_tlb(struct kvm_vcpu *vcpu)
>>>         kvm_x86_ops->tlb_flush(vcpu);
>>>  }
>>>
>>> +void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
>>> +               unsigned long start, unsigned long end)
>>> +{
>>> +       unsigned long apic_address;
>>> +
>>> +       /*
>>> +        * The physical address of apic access page is stored in the VMCS.
>>> +        * Update it when it becomes invalid.
>>> +        */
>>> +       apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
>>> +       if (start <= apic_address && apic_address < end)
>>> +               kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
>>> +}
>>> +
>>>  void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
>>>  {
>>>         struct page *page = NULL;
>>> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>>> index c01cff064ec5..b7f4689e373f 100644
>>> --- a/virt/kvm/kvm_main.c
>>> +++ b/virt/kvm/kvm_main.c
>>> @@ -135,6 +135,11 @@ static void kvm_uevent_notify_change(unsigned int type, struct kvm *kvm);
>>>  static unsigned long long kvm_createvm_count;
>>>  static unsigned long long kvm_active_vms;
>>>
>>> +__weak void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
>>> +               unsigned long start, unsigned long end)
>>> +{
>>> +}
>>> +
>>>  bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
>>>  {
>>>         if (pfn_valid(pfn))
>>> @@ -360,6 +365,9 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>>>                 kvm_flush_remote_tlbs(kvm);
>>>
>>>         spin_unlock(&kvm->mmu_lock);
>>> +
>>> +       kvm_arch_mmu_notifier_invalidate_range(kvm, start, end);
>>> +
>>>         srcu_read_unlock(&kvm->srcu, idx);
>>>  }
>>>
>>> --
>>> 2.14.2
>>>
>>> --
>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>> the body to majordomo@kvack.org.  For more info on Linux MM,
>>> see: http://www.linux-mm.org/ .
>>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 1/2] KVM: x86: fix APIC page invalidation
  2017-11-30 18:05         ` Radim Krčmář
  (?)
@ 2017-12-13 12:54           ` Richard Purdie
  -1 siblings, 0 replies; 160+ messages in thread
From: Richard Purdie @ 2017-12-13 12:54 UTC (permalink / raw)
  To: Radim Krčmář, Fabian Grünbichler
  Cc: linux-kernel, linux-mm, kvm, Paolo Bonzini, Andrea Arcangeli,
	Jérôme Glisse, stable

On Thu, 2017-11-30 at 19:05 +0100, Radim Krčmář wrote:
> Implementation of the unpinned APIC page didn't update the VMCS
> address
> cache when invalidation was done through range mmu notifiers.
> This became a problem when the page notifier was removed.
> 
> Re-introduce the arch-specific helper and call it from
> ...range_start.
> 
> Fixes: 38b9917350cb ("kvm: vmx: Implement set_apic_access_page_addr")
> Fixes: 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic
> v2")
> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  3 +++
>  arch/x86/kvm/x86.c              | 14 ++++++++++++++
>  virt/kvm/kvm_main.c             |  8 ++++++++
>  3 files changed, 25 insertions(+)

Thanks for this. I've been chasing APIC related hangs booting images
with qemu-system-x86_64 on 4.13 and 4.14 host kernels where the guest
doesn't have x2apic enabled.

I can confirm this fixes issues the Yocto Project automated testing
infrastructure was seeing.

I'd like to add support for backporting this in stable.

Tested-by: Richard Purdie <richard.purdie@linuxfoundation.org>

Cheers,

Richard

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 1/2] KVM: x86: fix APIC page invalidation
@ 2017-12-13 12:54           ` Richard Purdie
  0 siblings, 0 replies; 160+ messages in thread
From: Richard Purdie @ 2017-12-13 12:54 UTC (permalink / raw)
  To: Radim Krčmář, Fabian Grünbichler
  Cc: linux-kernel, linux-mm, kvm, Paolo Bonzini, Andrea Arcangeli,
	Jérôme Glisse, stable

On Thu, 2017-11-30 at 19:05 +0100, Radim Krčmář wrote:
> Implementation of the unpinned APIC page didn't update the VMCS
> address
> cache when invalidation was done through range mmu notifiers.
> This became a problem when the page notifier was removed.
> 
> Re-introduce the arch-specific helper and call it from
> ...range_start.
> 
> Fixes: 38b9917350cb ("kvm: vmx: Implement set_apic_access_page_addr")
> Fixes: 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic
> v2")
> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  3 +++
>  arch/x86/kvm/x86.c              | 14 ++++++++++++++
>  virt/kvm/kvm_main.c             |  8 ++++++++
>  3 files changed, 25 insertions(+)

Thanks for this. I've been chasing APIC related hangs booting images
with qemu-system-x86_64 on 4.13 and 4.14 host kernels where the guest
doesn't have x2apic enabled.

I can confirm this fixes issues the Yocto Project automated testing
infrastructure was seeing.

I'd like to add support for backporting this in stable.

Tested-by: Richard Purdie <richard.purdie@linuxfoundation.org>

Cheers,

Richard


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 160+ messages in thread

* Re: [PATCH 1/2] KVM: x86: fix APIC page invalidation
@ 2017-12-13 12:54           ` Richard Purdie
  0 siblings, 0 replies; 160+ messages in thread
From: Richard Purdie @ 2017-12-13 12:54 UTC (permalink / raw)
  To: Radim Krčmář, Fabian Grünbichler
  Cc: linux-kernel, linux-mm, kvm, Paolo Bonzini, Andrea Arcangeli,
	Jérôme Glisse, stable

On Thu, 2017-11-30 at 19:05 +0100, Radim KrA?mA!A? wrote:
> Implementation of the unpinned APIC page didn't update the VMCS
> address
> cache when invalidation was done through range mmu notifiers.
> This became a problem when the page notifier was removed.
> 
> Re-introduce the arch-specific helper and call it from
> ...range_start.
> 
> Fixes: 38b9917350cb ("kvm: vmx: Implement set_apic_access_page_addr")
> Fixes: 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic
> v2")
> Signed-off-by: Radim KrA?mA!A? <rkrcmar@redhat.com>
> ---
> A arch/x86/include/asm/kvm_host.h |A A 3 +++
> A arch/x86/kvm/x86.cA A A A A A A A A A A A A A | 14 ++++++++++++++
> A virt/kvm/kvm_main.cA A A A A A A A A A A A A |A A 8 ++++++++
> A 3 files changed, 25 insertions(+)

Thanks for this. I've been chasing APIC related hangs booting images
with qemu-system-x86_64 on 4.13 and 4.14 host kernels where the guest
doesn't have x2apic enabled.

I can confirm this fixes issues the Yocto Project automated testing
infrastructure was seeing.

I'd like to add support for backporting this in stable.

Tested-by: Richard Purdie <richard.purdie@linuxfoundation.org>

Cheers,

Richard


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 160+ messages in thread

end of thread, other threads:[~2017-12-13 13:07 UTC | newest]

Thread overview: 160+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-29 23:54 [PATCH 00/13] mmu_notifier kill invalidate_page callback Jérôme Glisse
2017-08-29 23:54 ` Jérôme Glisse
2017-08-29 23:54 ` Jérôme Glisse
2017-08-29 23:54 ` [PATCH 01/13] dax: update to new mmu_notifier semantic Jérôme Glisse
2017-08-29 23:54   ` Jérôme Glisse
2017-08-29 23:54 ` [PATCH 02/13] mm/rmap: " Jérôme Glisse
2017-08-29 23:54   ` Jérôme Glisse
2017-08-30  2:46   ` Nadav Amit
2017-08-30  2:46     ` Nadav Amit
2017-08-30  2:59     ` Jerome Glisse
2017-08-30  2:59       ` Jerome Glisse
2017-08-30  3:16       ` Nadav Amit
2017-08-30  3:16         ` Nadav Amit
2017-08-30  3:18         ` Nadav Amit
2017-08-30  3:18           ` Nadav Amit
2017-08-30 17:27     ` Andrea Arcangeli
2017-08-30 17:27       ` Andrea Arcangeli
2017-08-30 18:00       ` Nadav Amit
2017-08-30 18:00         ` Nadav Amit
2017-08-30 21:25         ` Andrea Arcangeli
2017-08-30 21:25           ` Andrea Arcangeli
2017-08-30 23:25           ` Nadav Amit
2017-08-30 23:25             ` Nadav Amit
2017-08-31  0:47             ` Jerome Glisse
2017-08-31  0:47               ` Jerome Glisse
2017-08-31  0:47               ` Jerome Glisse
2017-08-31 17:12               ` Andrea Arcangeli
2017-08-31 17:12                 ` Andrea Arcangeli
2017-08-31 17:12                 ` Andrea Arcangeli
2017-08-31 19:15                 ` Nadav Amit
2017-08-31 19:15                   ` Nadav Amit
2017-08-30 18:20       ` Jerome Glisse
2017-08-30 18:20         ` Jerome Glisse
2017-08-30 18:40         ` Nadav Amit
2017-08-30 18:40           ` Nadav Amit
2017-08-30 20:45           ` Jerome Glisse
2017-08-30 20:45             ` Jerome Glisse
2017-08-30 22:17             ` Andrea Arcangeli
2017-08-30 22:17               ` Andrea Arcangeli
2017-08-30 20:55           ` Andrea Arcangeli
2017-08-30 20:55             ` Andrea Arcangeli
2017-08-30 16:52   ` Andrea Arcangeli
2017-08-30 16:52     ` Andrea Arcangeli
2017-08-30 17:48     ` Jerome Glisse
2017-08-30 17:48       ` Jerome Glisse
2017-08-30 21:53     ` Linus Torvalds
2017-08-30 21:53       ` Linus Torvalds
2017-08-30 23:01       ` Andrea Arcangeli
2017-08-30 23:01         ` Andrea Arcangeli
2017-08-31 18:25         ` Jerome Glisse
2017-08-31 18:25           ` Jerome Glisse
2017-08-31 19:40           ` Linus Torvalds
2017-08-31 19:40             ` Linus Torvalds
2017-08-29 23:54 ` [PATCH 03/13] powerpc/powernv: " Jérôme Glisse
2017-08-29 23:54   ` Jérôme Glisse
2017-08-29 23:54 ` [PATCH 04/13] drm/amdgpu: " Jérôme Glisse
2017-08-29 23:54   ` Jérôme Glisse
2017-08-29 23:54   ` Jérôme Glisse
2017-08-30  6:18   ` Christian König
2017-08-30  6:18     ` Christian König
2017-08-29 23:54 ` [PATCH 05/13] IB/umem: " Jérôme Glisse
2017-08-29 23:54   ` Jérôme Glisse
2017-08-29 23:54   ` Jérôme Glisse
2017-08-30  6:13   ` Leon Romanovsky
2017-08-29 23:54 ` [PATCH 06/13] IB/hfi1: " Jérôme Glisse
2017-08-29 23:54   ` Jérôme Glisse
2017-08-29 23:54   ` Jérôme Glisse
2017-09-06 14:08   ` Arumugam, Kamenee
2017-08-29 23:54 ` [PATCH 07/13] iommu/amd: " Jérôme Glisse
2017-08-29 23:54   ` Jérôme Glisse
2017-08-29 23:54   ` Jérôme Glisse
2017-08-29 23:54 ` [PATCH 08/13] iommu/intel: " Jérôme Glisse
2017-08-29 23:54   ` Jérôme Glisse
2017-08-29 23:54 ` [PATCH 09/13] misc/mic/scif: " Jérôme Glisse
2017-08-29 23:54   ` Jérôme Glisse
2017-08-29 23:54 ` [PATCH 10/13] sgi-gru: " Jérôme Glisse
2017-08-29 23:54   ` Jérôme Glisse
2017-08-29 23:54 ` [PATCH 11/13] xen/gntdev: " Jérôme Glisse
2017-08-29 23:54   ` Jérôme Glisse
2017-08-30 19:32   ` Boris Ostrovsky
2017-08-30 19:32   ` Boris Ostrovsky
2017-08-29 23:54 ` Jérôme Glisse
2017-08-29 23:54 ` [PATCH 12/13] KVM: " Jérôme Glisse
2017-08-29 23:54   ` Jérôme Glisse
2017-08-29 23:54   ` Jérôme Glisse
2017-08-29 23:54 ` [PATCH 13/13] mm/mmu_notifier: kill invalidate_page Jérôme Glisse
2017-08-29 23:54   ` Jérôme Glisse
     [not found] ` <20170829235447.10050-1-jglisse-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-08-30  0:11   ` [PATCH 00/13] mmu_notifier kill invalidate_page callback Linus Torvalds
2017-08-30  0:11     ` Linus Torvalds
2017-08-30  0:11     ` Linus Torvalds
2017-08-30  0:11     ` Linus Torvalds
2017-08-30  0:56     ` Jerome Glisse
2017-08-30  0:56     ` Jerome Glisse
2017-08-30  0:56       ` Jerome Glisse
2017-08-30  0:56       ` Jerome Glisse
2017-08-30  8:40       ` Mike Galbraith
2017-08-30  8:40       ` Mike Galbraith
2017-08-30  8:40         ` Mike Galbraith
2017-08-30  8:40         ` Mike Galbraith
2017-08-30  8:40         ` Mike Galbraith
2017-08-30 14:57       ` Adam Borowski
2017-08-30 14:57       ` Adam Borowski
2017-08-30 14:57         ` Adam Borowski
2017-08-30 14:57         ` Adam Borowski
2017-09-01 14:47         ` Jeff Cook
2017-09-01 14:47         ` Jeff Cook
2017-09-01 14:47           ` Jeff Cook
2017-09-01 14:47           ` Jeff Cook
2017-09-01 14:47           ` Jeff Cook
2017-09-01 14:50           ` taskboxtester
2017-09-01 14:50             ` taskboxtester
2017-08-30 21:51   ` Felix Kuehling
2017-08-31 13:59     ` Jerome Glisse
     [not found]       ` <20170831135953.GA9227-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-08-31 14:14         ` Christian König
2017-08-31 18:39         ` Felix Kuehling
2017-08-31 19:00           ` Jerome Glisse
     [not found]             ` <20170831190021.GG9227-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-08-31 23:19               ` Felix Kuehling
2017-08-31 23:29                 ` Jerome Glisse
2017-08-30  0:11 ` Linus Torvalds
2017-11-30  9:33 ` BSOD with " Fabian Grünbichler
2017-11-30  9:33   ` Fabian Grünbichler
2017-11-30 11:20   ` Paolo Bonzini
2017-11-30 11:20     ` Paolo Bonzini
2017-11-30 11:20     ` Paolo Bonzini
2017-11-30 16:19     ` Radim Krčmář
2017-11-30 16:19       ` Radim Krčmář
2017-11-30 18:05       ` [PATCH 1/2] KVM: x86: fix APIC page invalidation Radim Krčmář
2017-11-30 18:05         ` Radim Krčmář
2017-11-30 18:05         ` Radim Krčmář
2017-11-30 18:05         ` [PATCH 2/2] TESTING! KVM: x86: add invalidate_range mmu notifier Radim Krčmář
2017-11-30 18:05           ` Radim Krčmář
2017-12-01 15:15           ` Paolo Bonzini
2017-12-01 15:15             ` Paolo Bonzini
2017-12-01 15:15             ` Paolo Bonzini
2017-12-03 17:24             ` Andrea Arcangeli
2017-12-03 17:24               ` Andrea Arcangeli
2017-12-03 17:24               ` Andrea Arcangeli
2017-12-01 12:21         ` [PATCH 1/2] KVM: x86: fix APIC page invalidation Fabian Grünbichler
2017-12-01 12:21           ` Fabian Grünbichler
2017-12-01 15:27         ` Paolo Bonzini
2017-12-01 15:27           ` Paolo Bonzini
2017-12-03 17:28         ` Andrea Arcangeli
2017-12-03 17:28           ` Andrea Arcangeli
2017-12-03 17:28           ` Andrea Arcangeli
2017-12-06  2:32         ` Wanpeng Li
2017-12-06  2:32           ` Wanpeng Li
2017-12-06  9:50           ` 王金浦
2017-12-06  9:50             ` 王金浦
2017-12-06 10:00             ` Paolo Bonzini
2017-12-06 10:00               ` Paolo Bonzini
2017-12-06 10:00               ` Paolo Bonzini
2017-12-06  8:15         ` Fabian Grünbichler
2017-12-06  8:15           ` Fabian Grünbichler
2017-12-06  8:15           ` Fabian Grünbichler
2017-12-13 12:54         ` Richard Purdie
2017-12-13 12:54           ` Richard Purdie
2017-12-13 12:54           ` Richard Purdie
2017-11-30 16:19     ` BSOD with [PATCH 00/13] mmu_notifier kill invalidate_page callback Radim Krčmář
2017-11-30 11:20   ` Paolo Bonzini
2017-11-30  9:33 ` Fabian Grünbichler

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.