All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
@ 2018-06-22 15:02 Michal Hocko
  2018-06-22 15:06 ` ✗ Fi.CI.BAT: failure for " Patchwork
                   ` (17 more replies)
  0 siblings, 18 replies; 125+ messages in thread
From: Michal Hocko @ 2018-06-22 15:02 UTC (permalink / raw)
  To: LKML
  Cc: Michal Hocko, kvm, Radim Krčmář,
	David Airlie, Sudeep Dutt, dri-devel, linux-mm, Andrea Arcangeli,
	David (ChunMing) Zhou, Dimitri Sivanich, linux-rdma, amd-gfx,
	Jason Gunthorpe, Doug Ledford, David Rientjes, xen-devel,
	intel-gfx, Jérôme Glisse, Rodrigo Vivi,
	Boris Ostrovsky, Juergen Gross, Mike Marciniszyn,
	Dennis Dalessandro

From: Michal Hocko <mhocko@suse.com>

There are several blockable mmu notifiers which might sleep in
mmu_notifier_invalidate_range_start and that is a problem for the
oom_reaper because it needs to guarantee a forward progress so it cannot
depend on any sleepable locks. Currently we simply back off and mark an
oom victim with blockable mmu notifiers as done after a short sleep.
That can result in selecting a new oom victim prematurely because the
previous one still hasn't torn its memory down yet.

We can do much better though. Even if mmu notifiers use sleepable locks
there is no reason to automatically assume those locks are held.
Moreover most notifiers only care about a portion of the address
space. This patch handles the first part of the problem.
__mmu_notifier_invalidate_range_start gets a blockable flag and
callbacks are not allowed to sleep if the flag is set to false. This is
achieved by using trylock instead of the sleepable lock for most
callbacks. I think we can improve that even further because there is
a common pattern to do a range lookup first and then do something about
that. The first part can be done without a sleeping lock I presume.

Anyway, what does the oom_reaper do with all that? We do not have to
fail right away. We simply retry if there is at least one notifier which
couldn't make any progress. A retry loop is already implemented to wait
for the mmap_sem and this is basically the same thing.

Cc: "David (ChunMing) Zhou" <David1.Zhou@amd.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
Cc: Sudeep Dutt <sudeep.dutt@intel.com>
Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: kvm@vger.kernel.org (open list:KERNEL VIRTUAL MACHINE FOR X86 (KVM/x86))
Cc: linux-kernel@vger.kernel.org (open list:X86 ARCHITECTURE (32-BIT AND 64-BIT))
Cc: amd-gfx@lists.freedesktop.org (open list:RADEON and AMDGPU DRM DRIVERS)
Cc: dri-devel@lists.freedesktop.org (open list:DRM DRIVERS)
Cc: intel-gfx@lists.freedesktop.org (open list:INTEL DRM DRIVERS (excluding Poulsbo, Moorestow...)
Cc: linux-rdma@vger.kernel.org (open list:INFINIBAND SUBSYSTEM)
Cc: xen-devel@lists.xenproject.org (moderated list:XEN HYPERVISOR INTERFACE)
Cc: linux-mm@kvack.org (open list:HMM - Heterogeneous Memory Management)
Reported-by: David Rientjes <rientjes@google.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---

Hi,
this is an RFC and not tested at all. I am not very familiar with the
mmu notifiers semantics very much so this is a crude attempt to achieve
what I need basically. It might be completely wrong but I would like
to discuss what would be a better way if that is the case.

get_maintainers gave me quite large list of people to CC so I had to trim
it down. If you think I have forgot somebody, please let me know

Any feedback is highly appreciated.

 arch/x86/kvm/x86.c                      |  7 ++++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c  | 33 +++++++++++++++++++------
 drivers/gpu/drm/i915/i915_gem_userptr.c | 10 +++++---
 drivers/gpu/drm/radeon/radeon_mn.c      | 15 ++++++++---
 drivers/infiniband/core/umem_odp.c      | 15 ++++++++---
 drivers/infiniband/hw/hfi1/mmu_rb.c     |  7 ++++--
 drivers/misc/mic/scif/scif_dma.c        |  7 ++++--
 drivers/misc/sgi-gru/grutlbpurge.c      |  7 ++++--
 drivers/xen/gntdev.c                    | 14 ++++++++---
 include/linux/kvm_host.h                |  2 +-
 include/linux/mmu_notifier.h            | 15 +++++++++--
 mm/hmm.c                                |  7 ++++--
 mm/mmu_notifier.c                       | 15 ++++++++---
 mm/oom_kill.c                           | 29 +++++++++++-----------
 virt/kvm/kvm_main.c                     | 12 ++++++---
 15 files changed, 137 insertions(+), 58 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6bcecc325e7e..ac08f5d711be 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7203,8 +7203,9 @@ static void vcpu_load_eoi_exitmap(struct kvm_vcpu *vcpu)
 	kvm_x86_ops->load_eoi_exitmap(vcpu, eoi_exit_bitmap);
 }
 
-void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
-		unsigned long start, unsigned long end)
+int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
+		unsigned long start, unsigned long end,
+		bool blockable)
 {
 	unsigned long apic_address;
 
@@ -7215,6 +7216,8 @@ void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
 	apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
 	if (start <= apic_address && apic_address < end)
 		kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
+
+	return 0;
 }
 
 void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index 83e344fbb50a..d138a526feff 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -136,12 +136,18 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
  *
  * Take the rmn read side lock.
  */
-static void amdgpu_mn_read_lock(struct amdgpu_mn *rmn)
+static int amdgpu_mn_read_lock(struct amdgpu_mn *rmn, bool blockable)
 {
-	mutex_lock(&rmn->read_lock);
+	if (blockable)
+		mutex_lock(&rmn->read_lock);
+	else if (!mutex_trylock(&rmn->read_lock))
+		return -EAGAIN;
+
 	if (atomic_inc_return(&rmn->recursion) == 1)
 		down_read_non_owner(&rmn->lock);
 	mutex_unlock(&rmn->read_lock);
+
+	return 0;
 }
 
 /**
@@ -197,10 +203,11 @@ static void amdgpu_mn_invalidate_node(struct amdgpu_mn_node *node,
  * We block for all BOs between start and end to be idle and
  * unmap them by move them into system domain again.
  */
-static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
+static int amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
 						 struct mm_struct *mm,
 						 unsigned long start,
-						 unsigned long end)
+						 unsigned long end,
+						 bool blockable)
 {
 	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
 	struct interval_tree_node *it;
@@ -208,7 +215,11 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
 	/* notification is exclusive, but interval is inclusive */
 	end -= 1;
 
-	amdgpu_mn_read_lock(rmn);
+	/* TODO we should be able to split locking for interval tree and
+	 * amdgpu_mn_invalidate_node
+	 */
+	if (amdgpu_mn_read_lock(rmn, blockable))
+		return -EAGAIN;
 
 	it = interval_tree_iter_first(&rmn->objects, start, end);
 	while (it) {
@@ -219,6 +230,8 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
 
 		amdgpu_mn_invalidate_node(node, start, end);
 	}
+
+	return 0;
 }
 
 /**
@@ -233,10 +246,11 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
  * necessitates evicting all user-mode queues of the process. The BOs
  * are restorted in amdgpu_mn_invalidate_range_end_hsa.
  */
-static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
+static int amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
 						 struct mm_struct *mm,
 						 unsigned long start,
-						 unsigned long end)
+						 unsigned long end,
+						 bool blockable)
 {
 	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
 	struct interval_tree_node *it;
@@ -244,7 +258,8 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
 	/* notification is exclusive, but interval is inclusive */
 	end -= 1;
 
-	amdgpu_mn_read_lock(rmn);
+	if (amdgpu_mn_read_lock(rmn, blockable))
+		return -EAGAIN;
 
 	it = interval_tree_iter_first(&rmn->objects, start, end);
 	while (it) {
@@ -262,6 +277,8 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
 				amdgpu_amdkfd_evict_userptr(mem, mm);
 		}
 	}
+
+	return 0;
 }
 
 /**
diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
index 854bd51b9478..5285df9331fa 100644
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
@@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object *mo)
 	mo->attached = false;
 }
 
-static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
+static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
 						       struct mm_struct *mm,
 						       unsigned long start,
-						       unsigned long end)
+						       unsigned long end,
+						       bool blockable)
 {
 	struct i915_mmu_notifier *mn =
 		container_of(_mn, struct i915_mmu_notifier, mn);
@@ -124,7 +125,7 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
 	LIST_HEAD(cancelled);
 
 	if (RB_EMPTY_ROOT(&mn->objects.rb_root))
-		return;
+		return 0;
 
 	/* interval ranges are inclusive, but invalidate range is exclusive */
 	end--;
@@ -152,7 +153,8 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
 		del_object(mo);
 	spin_unlock(&mn->lock);
 
-	if (!list_empty(&cancelled))
+	/* TODO: can we skip waiting here? */
+	if (!list_empty(&cancelled) && blockable)
 		flush_workqueue(mn->wq);
 }
 
diff --git a/drivers/gpu/drm/radeon/radeon_mn.c b/drivers/gpu/drm/radeon/radeon_mn.c
index abd24975c9b1..b47e828b725d 100644
--- a/drivers/gpu/drm/radeon/radeon_mn.c
+++ b/drivers/gpu/drm/radeon/radeon_mn.c
@@ -118,10 +118,11 @@ static void radeon_mn_release(struct mmu_notifier *mn,
  * We block for all BOs between start and end to be idle and
  * unmap them by move them into system domain again.
  */
-static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
+static int radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
 					     struct mm_struct *mm,
 					     unsigned long start,
-					     unsigned long end)
+					     unsigned long end,
+					     bool blockable)
 {
 	struct radeon_mn *rmn = container_of(mn, struct radeon_mn, mn);
 	struct ttm_operation_ctx ctx = { false, false };
@@ -130,7 +131,13 @@ static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
 	/* notification is exclusive, but interval is inclusive */
 	end -= 1;
 
-	mutex_lock(&rmn->lock);
+	/* TODO we should be able to split locking for interval tree and
+	 * the tear down.
+	 */
+	if (blockable)
+		mutex_lock(&rmn->lock);
+	else if (!mutex_trylock(&rmn->lock))
+		return -EAGAIN;
 
 	it = interval_tree_iter_first(&rmn->objects, start, end);
 	while (it) {
@@ -167,6 +174,8 @@ static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
 	}
 	
 	mutex_unlock(&rmn->lock);
+
+	return 0;
 }
 
 static const struct mmu_notifier_ops radeon_mn_ops = {
diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index 182436b92ba9..f65f6a29daae 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -207,22 +207,29 @@ static int invalidate_range_start_trampoline(struct ib_umem *item, u64 start,
 	return 0;
 }
 
-static void ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
+static int ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
 						    struct mm_struct *mm,
 						    unsigned long start,
-						    unsigned long end)
+						    unsigned long end,
+						    bool blockable)
 {
 	struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
 
 	if (!context->invalidate_range)
-		return;
+		return 0;
+
+	if (blockable)
+		down_read(&context->umem_rwsem);
+	else if (!down_read_trylock(&context->umem_rwsem))
+		return -EAGAIN;
 
 	ib_ucontext_notifier_start_account(context);
-	down_read(&context->umem_rwsem);
 	rbt_ib_umem_for_each_in_range(&context->umem_tree, start,
 				      end,
 				      invalidate_range_start_trampoline, NULL);
 	up_read(&context->umem_rwsem);
+
+	return 0;
 }
 
 static int invalidate_range_end_trampoline(struct ib_umem *item, u64 start,
diff --git a/drivers/infiniband/hw/hfi1/mmu_rb.c b/drivers/infiniband/hw/hfi1/mmu_rb.c
index 70aceefe14d5..8780560d1623 100644
--- a/drivers/infiniband/hw/hfi1/mmu_rb.c
+++ b/drivers/infiniband/hw/hfi1/mmu_rb.c
@@ -284,10 +284,11 @@ void hfi1_mmu_rb_remove(struct mmu_rb_handler *handler,
 	handler->ops->remove(handler->ops_arg, node);
 }
 
-static void mmu_notifier_range_start(struct mmu_notifier *mn,
+static int mmu_notifier_range_start(struct mmu_notifier *mn,
 				     struct mm_struct *mm,
 				     unsigned long start,
-				     unsigned long end)
+				     unsigned long end,
+				     bool blockable)
 {
 	struct mmu_rb_handler *handler =
 		container_of(mn, struct mmu_rb_handler, mn);
@@ -313,6 +314,8 @@ static void mmu_notifier_range_start(struct mmu_notifier *mn,
 
 	if (added)
 		queue_work(handler->wq, &handler->del_work);
+
+	return 0;
 }
 
 /*
diff --git a/drivers/misc/mic/scif/scif_dma.c b/drivers/misc/mic/scif/scif_dma.c
index 63d6246d6dff..d940568bed87 100644
--- a/drivers/misc/mic/scif/scif_dma.c
+++ b/drivers/misc/mic/scif/scif_dma.c
@@ -200,15 +200,18 @@ static void scif_mmu_notifier_release(struct mmu_notifier *mn,
 	schedule_work(&scif_info.misc_work);
 }
 
-static void scif_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
+static int scif_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 						     struct mm_struct *mm,
 						     unsigned long start,
-						     unsigned long end)
+						     unsigned long end,
+						     bool blockable)
 {
 	struct scif_mmu_notif	*mmn;
 
 	mmn = container_of(mn, struct scif_mmu_notif, ep_mmu_notifier);
 	scif_rma_destroy_tcw(mmn, start, end - start);
+
+	return 0
 }
 
 static void scif_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
index a3454eb56fbf..be28f05bfafa 100644
--- a/drivers/misc/sgi-gru/grutlbpurge.c
+++ b/drivers/misc/sgi-gru/grutlbpurge.c
@@ -219,9 +219,10 @@ void gru_flush_all_tlb(struct gru_state *gru)
 /*
  * MMUOPS notifier callout functions
  */
-static void gru_invalidate_range_start(struct mmu_notifier *mn,
+static int gru_invalidate_range_start(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
-				       unsigned long start, unsigned long end)
+				       unsigned long start, unsigned long end,
+				       bool blockable)
 {
 	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
 						 ms_notifier);
@@ -231,6 +232,8 @@ static void gru_invalidate_range_start(struct mmu_notifier *mn,
 	gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx, act %d\n", gms,
 		start, end, atomic_read(&gms->ms_range_active));
 	gru_flush_tlb_range(gms, start, end - start);
+
+	return 0;
 }
 
 static void gru_invalidate_range_end(struct mmu_notifier *mn,
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index bd56653b9bbc..50724d09fe5c 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -465,14 +465,20 @@ static void unmap_if_in_range(struct grant_map *map,
 	WARN_ON(err);
 }
 
-static void mn_invl_range_start(struct mmu_notifier *mn,
+static int mn_invl_range_start(struct mmu_notifier *mn,
 				struct mm_struct *mm,
-				unsigned long start, unsigned long end)
+				unsigned long start, unsigned long end,
+				bool blockable)
 {
 	struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
 	struct grant_map *map;
 
-	mutex_lock(&priv->lock);
+	/* TODO do we really need a mutex here? */
+	if (blockable)
+		mutex_lock(&priv->lock);
+	else if (!mutex_trylock(&priv->lock))
+		return -EAGAIN;
+
 	list_for_each_entry(map, &priv->maps, next) {
 		unmap_if_in_range(map, start, end);
 	}
@@ -480,6 +486,8 @@ static void mn_invl_range_start(struct mmu_notifier *mn,
 		unmap_if_in_range(map, start, end);
 	}
 	mutex_unlock(&priv->lock);
+
+	return true;
 }
 
 static void mn_release(struct mmu_notifier *mn,
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 4ee7bc548a83..e4181063e755 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1275,7 +1275,7 @@ static inline long kvm_arch_vcpu_async_ioctl(struct file *filp,
 }
 #endif /* CONFIG_HAVE_KVM_VCPU_ASYNC_IOCTL */
 
-void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
+int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
 		unsigned long start, unsigned long end);
 
 #ifdef CONFIG_HAVE_KVM_VCPU_RUN_PID_CHANGE
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 392e6af82701..369867501bed 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -230,7 +230,8 @@ extern int __mmu_notifier_test_young(struct mm_struct *mm,
 extern void __mmu_notifier_change_pte(struct mm_struct *mm,
 				      unsigned long address, pte_t pte);
 extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end);
+				  unsigned long start, unsigned long end,
+				  bool blockable);
 extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 				  unsigned long start, unsigned long end,
 				  bool only_end);
@@ -281,7 +282,17 @@ static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 				  unsigned long start, unsigned long end)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_start(mm, start, end);
+		__mmu_notifier_invalidate_range_start(mm, start, end, true);
+}
+
+static inline int mmu_notifier_invalidate_range_start_nonblock(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+	int ret = 0;
+	if (mm_has_notifiers(mm))
+		ret = __mmu_notifier_invalidate_range_start(mm, start, end, false);
+
+	return ret;
 }
 
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
diff --git a/mm/hmm.c b/mm/hmm.c
index de7b6bf77201..81fd57bd2634 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -177,16 +177,19 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
 	up_write(&hmm->mirrors_sem);
 }
 
-static void hmm_invalidate_range_start(struct mmu_notifier *mn,
+static int hmm_invalidate_range_start(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
 				       unsigned long start,
-				       unsigned long end)
+				       unsigned long end,
+				       bool blockable)
 {
 	struct hmm *hmm = mm->hmm;
 
 	VM_BUG_ON(!hmm);
 
 	atomic_inc(&hmm->sequence);
+
+	return 0;
 }
 
 static void hmm_invalidate_range_end(struct mmu_notifier *mn,
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index eff6b88a993f..30cc43121da9 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -174,18 +174,25 @@ void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
 	srcu_read_unlock(&srcu, id);
 }
 
-void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+int __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+				  unsigned long start, unsigned long end,
+				  bool blockable)
 {
 	struct mmu_notifier *mn;
+	int ret = 0;
 	int id;
 
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
-		if (mn->ops->invalidate_range_start)
-			mn->ops->invalidate_range_start(mn, mm, start, end);
+		if (mn->ops->invalidate_range_start) {
+			int _ret = mn->ops->invalidate_range_start(mn, mm, start, end, blockable);
+			if (_ret)
+				ret = _ret;
+		}
 	}
 	srcu_read_unlock(&srcu, id);
+
+	return ret;
 }
 EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 84081e77bc51..7e0c6e78ae5c 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -479,9 +479,10 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
 static struct task_struct *oom_reaper_list;
 static DEFINE_SPINLOCK(oom_reaper_lock);
 
-void __oom_reap_task_mm(struct mm_struct *mm)
+bool __oom_reap_task_mm(struct mm_struct *mm)
 {
 	struct vm_area_struct *vma;
+	bool ret = true;
 
 	/*
 	 * Tell all users of get_user/copy_from_user etc... that the content
@@ -511,12 +512,17 @@ void __oom_reap_task_mm(struct mm_struct *mm)
 			struct mmu_gather tlb;
 
 			tlb_gather_mmu(&tlb, mm, start, end);
-			mmu_notifier_invalidate_range_start(mm, start, end);
+			if (mmu_notifier_invalidate_range_start_nonblock(mm, start, end)) {
+				ret = false;
+				continue;
+			}
 			unmap_page_range(&tlb, vma, start, end, NULL);
 			mmu_notifier_invalidate_range_end(mm, start, end);
 			tlb_finish_mmu(&tlb, start, end);
 		}
 	}
+
+	return ret;
 }
 
 static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
@@ -545,18 +551,6 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 		goto unlock_oom;
 	}
 
-	/*
-	 * If the mm has invalidate_{start,end}() notifiers that could block,
-	 * sleep to give the oom victim some more time.
-	 * TODO: we really want to get rid of this ugly hack and make sure that
-	 * notifiers cannot block for unbounded amount of time
-	 */
-	if (mm_has_blockable_invalidate_notifiers(mm)) {
-		up_read(&mm->mmap_sem);
-		schedule_timeout_idle(HZ);
-		goto unlock_oom;
-	}
-
 	/*
 	 * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
 	 * work on the mm anymore. The check for MMF_OOM_SKIP must run
@@ -571,7 +565,12 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 
 	trace_start_task_reaping(tsk->pid);
 
-	__oom_reap_task_mm(mm);
+	/* failed to reap part of the address space. Try again later */
+	if (!__oom_reap_task_mm(mm)) {
+		up_read(&mm->mmap_sem);
+		ret = false;
+		goto out_unlock;
+	}
 
 	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
 			task_pid_nr(tsk), tsk->comm,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ada21f47f22b..6f7e709d2944 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -135,7 +135,7 @@ static void kvm_uevent_notify_change(unsigned int type, struct kvm *kvm);
 static unsigned long long kvm_createvm_count;
 static unsigned long long kvm_active_vms;
 
-__weak void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
+__weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
 		unsigned long start, unsigned long end)
 {
 }
@@ -354,13 +354,15 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 	srcu_read_unlock(&kvm->srcu, idx);
 }
 
-static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
+static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 						    struct mm_struct *mm,
 						    unsigned long start,
-						    unsigned long end)
+						    unsigned long end,
+						    bool blockable)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int need_tlb_flush = 0, idx;
+	int ret;
 
 	idx = srcu_read_lock(&kvm->srcu);
 	spin_lock(&kvm->mmu_lock);
@@ -378,9 +380,11 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 
 	spin_unlock(&kvm->mmu_lock);
 
-	kvm_arch_mmu_notifier_invalidate_range(kvm, start, end);
+	ret = kvm_arch_mmu_notifier_invalidate_range(kvm, start, end, blockable);
 
 	srcu_read_unlock(&kvm->srcu, idx);
+
+	return ret;
 }
 
 static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
-- 
2.17.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 125+ messages in thread

* [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
@ 2018-06-22 15:02 Michal Hocko
  2018-06-22 15:06 ` ✗ Fi.CI.BAT: failure for " Patchwork
                   ` (17 more replies)
  0 siblings, 18 replies; 125+ messages in thread
From: Michal Hocko @ 2018-06-22 15:02 UTC (permalink / raw)
  To: LKML
  Cc: Michal Hocko, David (ChunMing) Zhou, Paolo Bonzini,
	Radim Krčmář,
	Alex Deucher, Christian König, David Airlie, Jani Nikula,
	Joonas Lahtinen, Rodrigo Vivi, Doug Ledford, Jason Gunthorpe,
	Mike Marciniszyn, Dennis Dalessandro, Sudeep Dutt,
	Ashutosh Dixit, Dimitri Sivanich, Boris Ostrovsky, Juergen Gross,
	Jérôme Glisse, Andrea Arcangeli, kvm, amd-gfx,
	dri-devel, intel-gfx, linux-rdma, xen-devel, linux-mm,
	David Rientjes

From: Michal Hocko <mhocko@suse.com>

There are several blockable mmu notifiers which might sleep in
mmu_notifier_invalidate_range_start and that is a problem for the
oom_reaper because it needs to guarantee a forward progress so it cannot
depend on any sleepable locks. Currently we simply back off and mark an
oom victim with blockable mmu notifiers as done after a short sleep.
That can result in selecting a new oom victim prematurely because the
previous one still hasn't torn its memory down yet.

We can do much better though. Even if mmu notifiers use sleepable locks
there is no reason to automatically assume those locks are held.
Moreover most notifiers only care about a portion of the address
space. This patch handles the first part of the problem.
__mmu_notifier_invalidate_range_start gets a blockable flag and
callbacks are not allowed to sleep if the flag is set to false. This is
achieved by using trylock instead of the sleepable lock for most
callbacks. I think we can improve that even further because there is
a common pattern to do a range lookup first and then do something about
that. The first part can be done without a sleeping lock I presume.

Anyway, what does the oom_reaper do with all that? We do not have to
fail right away. We simply retry if there is at least one notifier which
couldn't make any progress. A retry loop is already implemented to wait
for the mmap_sem and this is basically the same thing.

Cc: "David (ChunMing) Zhou" <David1.Zhou@amd.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
Cc: Sudeep Dutt <sudeep.dutt@intel.com>
Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: kvm@vger.kernel.org (open list:KERNEL VIRTUAL MACHINE FOR X86 (KVM/x86))
Cc: linux-kernel@vger.kernel.org (open list:X86 ARCHITECTURE (32-BIT AND 64-BIT))
Cc: amd-gfx@lists.freedesktop.org (open list:RADEON and AMDGPU DRM DRIVERS)
Cc: dri-devel@lists.freedesktop.org (open list:DRM DRIVERS)
Cc: intel-gfx@lists.freedesktop.org (open list:INTEL DRM DRIVERS (excluding Poulsbo, Moorestow...)
Cc: linux-rdma@vger.kernel.org (open list:INFINIBAND SUBSYSTEM)
Cc: xen-devel@lists.xenproject.org (moderated list:XEN HYPERVISOR INTERFACE)
Cc: linux-mm@kvack.org (open list:HMM - Heterogeneous Memory Management)
Reported-by: David Rientjes <rientjes@google.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---

Hi,
this is an RFC and not tested at all. I am not very familiar with the
mmu notifiers semantics very much so this is a crude attempt to achieve
what I need basically. It might be completely wrong but I would like
to discuss what would be a better way if that is the case.

get_maintainers gave me quite large list of people to CC so I had to trim
it down. If you think I have forgot somebody, please let me know

Any feedback is highly appreciated.

 arch/x86/kvm/x86.c                      |  7 ++++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c  | 33 +++++++++++++++++++------
 drivers/gpu/drm/i915/i915_gem_userptr.c | 10 +++++---
 drivers/gpu/drm/radeon/radeon_mn.c      | 15 ++++++++---
 drivers/infiniband/core/umem_odp.c      | 15 ++++++++---
 drivers/infiniband/hw/hfi1/mmu_rb.c     |  7 ++++--
 drivers/misc/mic/scif/scif_dma.c        |  7 ++++--
 drivers/misc/sgi-gru/grutlbpurge.c      |  7 ++++--
 drivers/xen/gntdev.c                    | 14 ++++++++---
 include/linux/kvm_host.h                |  2 +-
 include/linux/mmu_notifier.h            | 15 +++++++++--
 mm/hmm.c                                |  7 ++++--
 mm/mmu_notifier.c                       | 15 ++++++++---
 mm/oom_kill.c                           | 29 +++++++++++-----------
 virt/kvm/kvm_main.c                     | 12 ++++++---
 15 files changed, 137 insertions(+), 58 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6bcecc325e7e..ac08f5d711be 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7203,8 +7203,9 @@ static void vcpu_load_eoi_exitmap(struct kvm_vcpu *vcpu)
 	kvm_x86_ops->load_eoi_exitmap(vcpu, eoi_exit_bitmap);
 }
 
-void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
-		unsigned long start, unsigned long end)
+int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
+		unsigned long start, unsigned long end,
+		bool blockable)
 {
 	unsigned long apic_address;
 
@@ -7215,6 +7216,8 @@ void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
 	apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
 	if (start <= apic_address && apic_address < end)
 		kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
+
+	return 0;
 }
 
 void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index 83e344fbb50a..d138a526feff 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -136,12 +136,18 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
  *
  * Take the rmn read side lock.
  */
-static void amdgpu_mn_read_lock(struct amdgpu_mn *rmn)
+static int amdgpu_mn_read_lock(struct amdgpu_mn *rmn, bool blockable)
 {
-	mutex_lock(&rmn->read_lock);
+	if (blockable)
+		mutex_lock(&rmn->read_lock);
+	else if (!mutex_trylock(&rmn->read_lock))
+		return -EAGAIN;
+
 	if (atomic_inc_return(&rmn->recursion) == 1)
 		down_read_non_owner(&rmn->lock);
 	mutex_unlock(&rmn->read_lock);
+
+	return 0;
 }
 
 /**
@@ -197,10 +203,11 @@ static void amdgpu_mn_invalidate_node(struct amdgpu_mn_node *node,
  * We block for all BOs between start and end to be idle and
  * unmap them by move them into system domain again.
  */
-static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
+static int amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
 						 struct mm_struct *mm,
 						 unsigned long start,
-						 unsigned long end)
+						 unsigned long end,
+						 bool blockable)
 {
 	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
 	struct interval_tree_node *it;
@@ -208,7 +215,11 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
 	/* notification is exclusive, but interval is inclusive */
 	end -= 1;
 
-	amdgpu_mn_read_lock(rmn);
+	/* TODO we should be able to split locking for interval tree and
+	 * amdgpu_mn_invalidate_node
+	 */
+	if (amdgpu_mn_read_lock(rmn, blockable))
+		return -EAGAIN;
 
 	it = interval_tree_iter_first(&rmn->objects, start, end);
 	while (it) {
@@ -219,6 +230,8 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
 
 		amdgpu_mn_invalidate_node(node, start, end);
 	}
+
+	return 0;
 }
 
 /**
@@ -233,10 +246,11 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
  * necessitates evicting all user-mode queues of the process. The BOs
  * are restorted in amdgpu_mn_invalidate_range_end_hsa.
  */
-static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
+static int amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
 						 struct mm_struct *mm,
 						 unsigned long start,
-						 unsigned long end)
+						 unsigned long end,
+						 bool blockable)
 {
 	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
 	struct interval_tree_node *it;
@@ -244,7 +258,8 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
 	/* notification is exclusive, but interval is inclusive */
 	end -= 1;
 
-	amdgpu_mn_read_lock(rmn);
+	if (amdgpu_mn_read_lock(rmn, blockable))
+		return -EAGAIN;
 
 	it = interval_tree_iter_first(&rmn->objects, start, end);
 	while (it) {
@@ -262,6 +277,8 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
 				amdgpu_amdkfd_evict_userptr(mem, mm);
 		}
 	}
+
+	return 0;
 }
 
 /**
diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
index 854bd51b9478..5285df9331fa 100644
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
@@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object *mo)
 	mo->attached = false;
 }
 
-static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
+static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
 						       struct mm_struct *mm,
 						       unsigned long start,
-						       unsigned long end)
+						       unsigned long end,
+						       bool blockable)
 {
 	struct i915_mmu_notifier *mn =
 		container_of(_mn, struct i915_mmu_notifier, mn);
@@ -124,7 +125,7 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
 	LIST_HEAD(cancelled);
 
 	if (RB_EMPTY_ROOT(&mn->objects.rb_root))
-		return;
+		return 0;
 
 	/* interval ranges are inclusive, but invalidate range is exclusive */
 	end--;
@@ -152,7 +153,8 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
 		del_object(mo);
 	spin_unlock(&mn->lock);
 
-	if (!list_empty(&cancelled))
+	/* TODO: can we skip waiting here? */
+	if (!list_empty(&cancelled) && blockable)
 		flush_workqueue(mn->wq);
 }
 
diff --git a/drivers/gpu/drm/radeon/radeon_mn.c b/drivers/gpu/drm/radeon/radeon_mn.c
index abd24975c9b1..b47e828b725d 100644
--- a/drivers/gpu/drm/radeon/radeon_mn.c
+++ b/drivers/gpu/drm/radeon/radeon_mn.c
@@ -118,10 +118,11 @@ static void radeon_mn_release(struct mmu_notifier *mn,
  * We block for all BOs between start and end to be idle and
  * unmap them by move them into system domain again.
  */
-static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
+static int radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
 					     struct mm_struct *mm,
 					     unsigned long start,
-					     unsigned long end)
+					     unsigned long end,
+					     bool blockable)
 {
 	struct radeon_mn *rmn = container_of(mn, struct radeon_mn, mn);
 	struct ttm_operation_ctx ctx = { false, false };
@@ -130,7 +131,13 @@ static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
 	/* notification is exclusive, but interval is inclusive */
 	end -= 1;
 
-	mutex_lock(&rmn->lock);
+	/* TODO we should be able to split locking for interval tree and
+	 * the tear down.
+	 */
+	if (blockable)
+		mutex_lock(&rmn->lock);
+	else if (!mutex_trylock(&rmn->lock))
+		return -EAGAIN;
 
 	it = interval_tree_iter_first(&rmn->objects, start, end);
 	while (it) {
@@ -167,6 +174,8 @@ static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
 	}
 	
 	mutex_unlock(&rmn->lock);
+
+	return 0;
 }
 
 static const struct mmu_notifier_ops radeon_mn_ops = {
diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index 182436b92ba9..f65f6a29daae 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -207,22 +207,29 @@ static int invalidate_range_start_trampoline(struct ib_umem *item, u64 start,
 	return 0;
 }
 
-static void ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
+static int ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
 						    struct mm_struct *mm,
 						    unsigned long start,
-						    unsigned long end)
+						    unsigned long end,
+						    bool blockable)
 {
 	struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
 
 	if (!context->invalidate_range)
-		return;
+		return 0;
+
+	if (blockable)
+		down_read(&context->umem_rwsem);
+	else if (!down_read_trylock(&context->umem_rwsem))
+		return -EAGAIN;
 
 	ib_ucontext_notifier_start_account(context);
-	down_read(&context->umem_rwsem);
 	rbt_ib_umem_for_each_in_range(&context->umem_tree, start,
 				      end,
 				      invalidate_range_start_trampoline, NULL);
 	up_read(&context->umem_rwsem);
+
+	return 0;
 }
 
 static int invalidate_range_end_trampoline(struct ib_umem *item, u64 start,
diff --git a/drivers/infiniband/hw/hfi1/mmu_rb.c b/drivers/infiniband/hw/hfi1/mmu_rb.c
index 70aceefe14d5..8780560d1623 100644
--- a/drivers/infiniband/hw/hfi1/mmu_rb.c
+++ b/drivers/infiniband/hw/hfi1/mmu_rb.c
@@ -284,10 +284,11 @@ void hfi1_mmu_rb_remove(struct mmu_rb_handler *handler,
 	handler->ops->remove(handler->ops_arg, node);
 }
 
-static void mmu_notifier_range_start(struct mmu_notifier *mn,
+static int mmu_notifier_range_start(struct mmu_notifier *mn,
 				     struct mm_struct *mm,
 				     unsigned long start,
-				     unsigned long end)
+				     unsigned long end,
+				     bool blockable)
 {
 	struct mmu_rb_handler *handler =
 		container_of(mn, struct mmu_rb_handler, mn);
@@ -313,6 +314,8 @@ static void mmu_notifier_range_start(struct mmu_notifier *mn,
 
 	if (added)
 		queue_work(handler->wq, &handler->del_work);
+
+	return 0;
 }
 
 /*
diff --git a/drivers/misc/mic/scif/scif_dma.c b/drivers/misc/mic/scif/scif_dma.c
index 63d6246d6dff..d940568bed87 100644
--- a/drivers/misc/mic/scif/scif_dma.c
+++ b/drivers/misc/mic/scif/scif_dma.c
@@ -200,15 +200,18 @@ static void scif_mmu_notifier_release(struct mmu_notifier *mn,
 	schedule_work(&scif_info.misc_work);
 }
 
-static void scif_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
+static int scif_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 						     struct mm_struct *mm,
 						     unsigned long start,
-						     unsigned long end)
+						     unsigned long end,
+						     bool blockable)
 {
 	struct scif_mmu_notif	*mmn;
 
 	mmn = container_of(mn, struct scif_mmu_notif, ep_mmu_notifier);
 	scif_rma_destroy_tcw(mmn, start, end - start);
+
+	return 0
 }
 
 static void scif_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
index a3454eb56fbf..be28f05bfafa 100644
--- a/drivers/misc/sgi-gru/grutlbpurge.c
+++ b/drivers/misc/sgi-gru/grutlbpurge.c
@@ -219,9 +219,10 @@ void gru_flush_all_tlb(struct gru_state *gru)
 /*
  * MMUOPS notifier callout functions
  */
-static void gru_invalidate_range_start(struct mmu_notifier *mn,
+static int gru_invalidate_range_start(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
-				       unsigned long start, unsigned long end)
+				       unsigned long start, unsigned long end,
+				       bool blockable)
 {
 	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
 						 ms_notifier);
@@ -231,6 +232,8 @@ static void gru_invalidate_range_start(struct mmu_notifier *mn,
 	gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx, act %d\n", gms,
 		start, end, atomic_read(&gms->ms_range_active));
 	gru_flush_tlb_range(gms, start, end - start);
+
+	return 0;
 }
 
 static void gru_invalidate_range_end(struct mmu_notifier *mn,
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index bd56653b9bbc..50724d09fe5c 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -465,14 +465,20 @@ static void unmap_if_in_range(struct grant_map *map,
 	WARN_ON(err);
 }
 
-static void mn_invl_range_start(struct mmu_notifier *mn,
+static int mn_invl_range_start(struct mmu_notifier *mn,
 				struct mm_struct *mm,
-				unsigned long start, unsigned long end)
+				unsigned long start, unsigned long end,
+				bool blockable)
 {
 	struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
 	struct grant_map *map;
 
-	mutex_lock(&priv->lock);
+	/* TODO do we really need a mutex here? */
+	if (blockable)
+		mutex_lock(&priv->lock);
+	else if (!mutex_trylock(&priv->lock))
+		return -EAGAIN;
+
 	list_for_each_entry(map, &priv->maps, next) {
 		unmap_if_in_range(map, start, end);
 	}
@@ -480,6 +486,8 @@ static void mn_invl_range_start(struct mmu_notifier *mn,
 		unmap_if_in_range(map, start, end);
 	}
 	mutex_unlock(&priv->lock);
+
+	return true;
 }
 
 static void mn_release(struct mmu_notifier *mn,
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 4ee7bc548a83..e4181063e755 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1275,7 +1275,7 @@ static inline long kvm_arch_vcpu_async_ioctl(struct file *filp,
 }
 #endif /* CONFIG_HAVE_KVM_VCPU_ASYNC_IOCTL */
 
-void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
+int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
 		unsigned long start, unsigned long end);
 
 #ifdef CONFIG_HAVE_KVM_VCPU_RUN_PID_CHANGE
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 392e6af82701..369867501bed 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -230,7 +230,8 @@ extern int __mmu_notifier_test_young(struct mm_struct *mm,
 extern void __mmu_notifier_change_pte(struct mm_struct *mm,
 				      unsigned long address, pte_t pte);
 extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end);
+				  unsigned long start, unsigned long end,
+				  bool blockable);
 extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 				  unsigned long start, unsigned long end,
 				  bool only_end);
@@ -281,7 +282,17 @@ static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 				  unsigned long start, unsigned long end)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_start(mm, start, end);
+		__mmu_notifier_invalidate_range_start(mm, start, end, true);
+}
+
+static inline int mmu_notifier_invalidate_range_start_nonblock(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+	int ret = 0;
+	if (mm_has_notifiers(mm))
+		ret = __mmu_notifier_invalidate_range_start(mm, start, end, false);
+
+	return ret;
 }
 
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
diff --git a/mm/hmm.c b/mm/hmm.c
index de7b6bf77201..81fd57bd2634 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -177,16 +177,19 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
 	up_write(&hmm->mirrors_sem);
 }
 
-static void hmm_invalidate_range_start(struct mmu_notifier *mn,
+static int hmm_invalidate_range_start(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
 				       unsigned long start,
-				       unsigned long end)
+				       unsigned long end,
+				       bool blockable)
 {
 	struct hmm *hmm = mm->hmm;
 
 	VM_BUG_ON(!hmm);
 
 	atomic_inc(&hmm->sequence);
+
+	return 0;
 }
 
 static void hmm_invalidate_range_end(struct mmu_notifier *mn,
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index eff6b88a993f..30cc43121da9 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -174,18 +174,25 @@ void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
 	srcu_read_unlock(&srcu, id);
 }
 
-void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+int __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+				  unsigned long start, unsigned long end,
+				  bool blockable)
 {
 	struct mmu_notifier *mn;
+	int ret = 0;
 	int id;
 
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
-		if (mn->ops->invalidate_range_start)
-			mn->ops->invalidate_range_start(mn, mm, start, end);
+		if (mn->ops->invalidate_range_start) {
+			int _ret = mn->ops->invalidate_range_start(mn, mm, start, end, blockable);
+			if (_ret)
+				ret = _ret;
+		}
 	}
 	srcu_read_unlock(&srcu, id);
+
+	return ret;
 }
 EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 84081e77bc51..7e0c6e78ae5c 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -479,9 +479,10 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
 static struct task_struct *oom_reaper_list;
 static DEFINE_SPINLOCK(oom_reaper_lock);
 
-void __oom_reap_task_mm(struct mm_struct *mm)
+bool __oom_reap_task_mm(struct mm_struct *mm)
 {
 	struct vm_area_struct *vma;
+	bool ret = true;
 
 	/*
 	 * Tell all users of get_user/copy_from_user etc... that the content
@@ -511,12 +512,17 @@ void __oom_reap_task_mm(struct mm_struct *mm)
 			struct mmu_gather tlb;
 
 			tlb_gather_mmu(&tlb, mm, start, end);
-			mmu_notifier_invalidate_range_start(mm, start, end);
+			if (mmu_notifier_invalidate_range_start_nonblock(mm, start, end)) {
+				ret = false;
+				continue;
+			}
 			unmap_page_range(&tlb, vma, start, end, NULL);
 			mmu_notifier_invalidate_range_end(mm, start, end);
 			tlb_finish_mmu(&tlb, start, end);
 		}
 	}
+
+	return ret;
 }
 
 static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
@@ -545,18 +551,6 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 		goto unlock_oom;
 	}
 
-	/*
-	 * If the mm has invalidate_{start,end}() notifiers that could block,
-	 * sleep to give the oom victim some more time.
-	 * TODO: we really want to get rid of this ugly hack and make sure that
-	 * notifiers cannot block for unbounded amount of time
-	 */
-	if (mm_has_blockable_invalidate_notifiers(mm)) {
-		up_read(&mm->mmap_sem);
-		schedule_timeout_idle(HZ);
-		goto unlock_oom;
-	}
-
 	/*
 	 * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
 	 * work on the mm anymore. The check for MMF_OOM_SKIP must run
@@ -571,7 +565,12 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 
 	trace_start_task_reaping(tsk->pid);
 
-	__oom_reap_task_mm(mm);
+	/* failed to reap part of the address space. Try again later */
+	if (!__oom_reap_task_mm(mm)) {
+		up_read(&mm->mmap_sem);
+		ret = false;
+		goto out_unlock;
+	}
 
 	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
 			task_pid_nr(tsk), tsk->comm,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ada21f47f22b..6f7e709d2944 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -135,7 +135,7 @@ static void kvm_uevent_notify_change(unsigned int type, struct kvm *kvm);
 static unsigned long long kvm_createvm_count;
 static unsigned long long kvm_active_vms;
 
-__weak void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
+__weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
 		unsigned long start, unsigned long end)
 {
 }
@@ -354,13 +354,15 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 	srcu_read_unlock(&kvm->srcu, idx);
 }
 
-static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
+static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 						    struct mm_struct *mm,
 						    unsigned long start,
-						    unsigned long end)
+						    unsigned long end,
+						    bool blockable)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int need_tlb_flush = 0, idx;
+	int ret;
 
 	idx = srcu_read_lock(&kvm->srcu);
 	spin_lock(&kvm->mmu_lock);
@@ -378,9 +380,11 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 
 	spin_unlock(&kvm->mmu_lock);
 
-	kvm_arch_mmu_notifier_invalidate_range(kvm, start, end);
+	ret = kvm_arch_mmu_notifier_invalidate_range(kvm, start, end, blockable);
 
 	srcu_read_unlock(&kvm->srcu, idx);
+
+	return ret;
 }
 
 static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
-- 
2.17.1


^ permalink raw reply	[flat|nested] 125+ messages in thread

* [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
@ 2018-06-22 15:02 Michal Hocko
  2018-06-22 15:06 ` ✗ Fi.CI.BAT: failure for " Patchwork
                   ` (17 more replies)
  0 siblings, 18 replies; 125+ messages in thread
From: Michal Hocko @ 2018-06-22 15:02 UTC (permalink / raw)
  To: LKML
  Cc: Michal Hocko, David (ChunMing) Zhou, Paolo Bonzini,
	Radim Krčmář,
	Alex Deucher, Christian König, David Airlie, Jani Nikula,
	Joonas Lahtinen, Rodrigo Vivi, Doug Ledford, Jason Gunthorpe,
	Mike Marciniszyn, Dennis Dalessandro, Sudeep Dutt,
	Ashutosh Dixit, Dimitri Sivanich, Boris Ostrovsky, Juergen Gross,
	Jérôme Glisse, Andrea Arcangeli, kvm, amd-gfx,
	dri-devel, intel-gfx, linux-rdma, xen-devel, linux-mm,
	David Rientjes

From: Michal Hocko <mhocko@suse.com>

There are several blockable mmu notifiers which might sleep in
mmu_notifier_invalidate_range_start and that is a problem for the
oom_reaper because it needs to guarantee a forward progress so it cannot
depend on any sleepable locks. Currently we simply back off and mark an
oom victim with blockable mmu notifiers as done after a short sleep.
That can result in selecting a new oom victim prematurely because the
previous one still hasn't torn its memory down yet.

We can do much better though. Even if mmu notifiers use sleepable locks
there is no reason to automatically assume those locks are held.
Moreover most notifiers only care about a portion of the address
space. This patch handles the first part of the problem.
__mmu_notifier_invalidate_range_start gets a blockable flag and
callbacks are not allowed to sleep if the flag is set to false. This is
achieved by using trylock instead of the sleepable lock for most
callbacks. I think we can improve that even further because there is
a common pattern to do a range lookup first and then do something about
that. The first part can be done without a sleeping lock I presume.

Anyway, what does the oom_reaper do with all that? We do not have to
fail right away. We simply retry if there is at least one notifier which
couldn't make any progress. A retry loop is already implemented to wait
for the mmap_sem and this is basically the same thing.

Cc: "David (ChunMing) Zhou" <David1.Zhou@amd.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Radim KrA?mA!A?" <rkrcmar@redhat.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: "Christian KA?nig" <christian.koenig@amd.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
Cc: Sudeep Dutt <sudeep.dutt@intel.com>
Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: "JA(C)rA'me Glisse" <jglisse@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: kvm@vger.kernel.org (open list:KERNEL VIRTUAL MACHINE FOR X86 (KVM/x86))
Cc: linux-kernel@vger.kernel.org (open list:X86 ARCHITECTURE (32-BIT AND 64-BIT))
Cc: amd-gfx@lists.freedesktop.org (open list:RADEON and AMDGPU DRM DRIVERS)
Cc: dri-devel@lists.freedesktop.org (open list:DRM DRIVERS)
Cc: intel-gfx@lists.freedesktop.org (open list:INTEL DRM DRIVERS (excluding Poulsbo, Moorestow...)
Cc: linux-rdma@vger.kernel.org (open list:INFINIBAND SUBSYSTEM)
Cc: xen-devel@lists.xenproject.org (moderated list:XEN HYPERVISOR INTERFACE)
Cc: linux-mm@kvack.org (open list:HMM - Heterogeneous Memory Management)
Reported-by: David Rientjes <rientjes@google.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---

Hi,
this is an RFC and not tested at all. I am not very familiar with the
mmu notifiers semantics very much so this is a crude attempt to achieve
what I need basically. It might be completely wrong but I would like
to discuss what would be a better way if that is the case.

get_maintainers gave me quite large list of people to CC so I had to trim
it down. If you think I have forgot somebody, please let me know

Any feedback is highly appreciated.

 arch/x86/kvm/x86.c                      |  7 ++++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c  | 33 +++++++++++++++++++------
 drivers/gpu/drm/i915/i915_gem_userptr.c | 10 +++++---
 drivers/gpu/drm/radeon/radeon_mn.c      | 15 ++++++++---
 drivers/infiniband/core/umem_odp.c      | 15 ++++++++---
 drivers/infiniband/hw/hfi1/mmu_rb.c     |  7 ++++--
 drivers/misc/mic/scif/scif_dma.c        |  7 ++++--
 drivers/misc/sgi-gru/grutlbpurge.c      |  7 ++++--
 drivers/xen/gntdev.c                    | 14 ++++++++---
 include/linux/kvm_host.h                |  2 +-
 include/linux/mmu_notifier.h            | 15 +++++++++--
 mm/hmm.c                                |  7 ++++--
 mm/mmu_notifier.c                       | 15 ++++++++---
 mm/oom_kill.c                           | 29 +++++++++++-----------
 virt/kvm/kvm_main.c                     | 12 ++++++---
 15 files changed, 137 insertions(+), 58 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6bcecc325e7e..ac08f5d711be 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7203,8 +7203,9 @@ static void vcpu_load_eoi_exitmap(struct kvm_vcpu *vcpu)
 	kvm_x86_ops->load_eoi_exitmap(vcpu, eoi_exit_bitmap);
 }
 
-void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
-		unsigned long start, unsigned long end)
+int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
+		unsigned long start, unsigned long end,
+		bool blockable)
 {
 	unsigned long apic_address;
 
@@ -7215,6 +7216,8 @@ void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
 	apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
 	if (start <= apic_address && apic_address < end)
 		kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
+
+	return 0;
 }
 
 void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index 83e344fbb50a..d138a526feff 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -136,12 +136,18 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
  *
  * Take the rmn read side lock.
  */
-static void amdgpu_mn_read_lock(struct amdgpu_mn *rmn)
+static int amdgpu_mn_read_lock(struct amdgpu_mn *rmn, bool blockable)
 {
-	mutex_lock(&rmn->read_lock);
+	if (blockable)
+		mutex_lock(&rmn->read_lock);
+	else if (!mutex_trylock(&rmn->read_lock))
+		return -EAGAIN;
+
 	if (atomic_inc_return(&rmn->recursion) == 1)
 		down_read_non_owner(&rmn->lock);
 	mutex_unlock(&rmn->read_lock);
+
+	return 0;
 }
 
 /**
@@ -197,10 +203,11 @@ static void amdgpu_mn_invalidate_node(struct amdgpu_mn_node *node,
  * We block for all BOs between start and end to be idle and
  * unmap them by move them into system domain again.
  */
-static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
+static int amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
 						 struct mm_struct *mm,
 						 unsigned long start,
-						 unsigned long end)
+						 unsigned long end,
+						 bool blockable)
 {
 	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
 	struct interval_tree_node *it;
@@ -208,7 +215,11 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
 	/* notification is exclusive, but interval is inclusive */
 	end -= 1;
 
-	amdgpu_mn_read_lock(rmn);
+	/* TODO we should be able to split locking for interval tree and
+	 * amdgpu_mn_invalidate_node
+	 */
+	if (amdgpu_mn_read_lock(rmn, blockable))
+		return -EAGAIN;
 
 	it = interval_tree_iter_first(&rmn->objects, start, end);
 	while (it) {
@@ -219,6 +230,8 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
 
 		amdgpu_mn_invalidate_node(node, start, end);
 	}
+
+	return 0;
 }
 
 /**
@@ -233,10 +246,11 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
  * necessitates evicting all user-mode queues of the process. The BOs
  * are restorted in amdgpu_mn_invalidate_range_end_hsa.
  */
-static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
+static int amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
 						 struct mm_struct *mm,
 						 unsigned long start,
-						 unsigned long end)
+						 unsigned long end,
+						 bool blockable)
 {
 	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
 	struct interval_tree_node *it;
@@ -244,7 +258,8 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
 	/* notification is exclusive, but interval is inclusive */
 	end -= 1;
 
-	amdgpu_mn_read_lock(rmn);
+	if (amdgpu_mn_read_lock(rmn, blockable))
+		return -EAGAIN;
 
 	it = interval_tree_iter_first(&rmn->objects, start, end);
 	while (it) {
@@ -262,6 +277,8 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
 				amdgpu_amdkfd_evict_userptr(mem, mm);
 		}
 	}
+
+	return 0;
 }
 
 /**
diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
index 854bd51b9478..5285df9331fa 100644
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
@@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object *mo)
 	mo->attached = false;
 }
 
-static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
+static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
 						       struct mm_struct *mm,
 						       unsigned long start,
-						       unsigned long end)
+						       unsigned long end,
+						       bool blockable)
 {
 	struct i915_mmu_notifier *mn =
 		container_of(_mn, struct i915_mmu_notifier, mn);
@@ -124,7 +125,7 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
 	LIST_HEAD(cancelled);
 
 	if (RB_EMPTY_ROOT(&mn->objects.rb_root))
-		return;
+		return 0;
 
 	/* interval ranges are inclusive, but invalidate range is exclusive */
 	end--;
@@ -152,7 +153,8 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
 		del_object(mo);
 	spin_unlock(&mn->lock);
 
-	if (!list_empty(&cancelled))
+	/* TODO: can we skip waiting here? */
+	if (!list_empty(&cancelled) && blockable)
 		flush_workqueue(mn->wq);
 }
 
diff --git a/drivers/gpu/drm/radeon/radeon_mn.c b/drivers/gpu/drm/radeon/radeon_mn.c
index abd24975c9b1..b47e828b725d 100644
--- a/drivers/gpu/drm/radeon/radeon_mn.c
+++ b/drivers/gpu/drm/radeon/radeon_mn.c
@@ -118,10 +118,11 @@ static void radeon_mn_release(struct mmu_notifier *mn,
  * We block for all BOs between start and end to be idle and
  * unmap them by move them into system domain again.
  */
-static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
+static int radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
 					     struct mm_struct *mm,
 					     unsigned long start,
-					     unsigned long end)
+					     unsigned long end,
+					     bool blockable)
 {
 	struct radeon_mn *rmn = container_of(mn, struct radeon_mn, mn);
 	struct ttm_operation_ctx ctx = { false, false };
@@ -130,7 +131,13 @@ static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
 	/* notification is exclusive, but interval is inclusive */
 	end -= 1;
 
-	mutex_lock(&rmn->lock);
+	/* TODO we should be able to split locking for interval tree and
+	 * the tear down.
+	 */
+	if (blockable)
+		mutex_lock(&rmn->lock);
+	else if (!mutex_trylock(&rmn->lock))
+		return -EAGAIN;
 
 	it = interval_tree_iter_first(&rmn->objects, start, end);
 	while (it) {
@@ -167,6 +174,8 @@ static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
 	}
 	
 	mutex_unlock(&rmn->lock);
+
+	return 0;
 }
 
 static const struct mmu_notifier_ops radeon_mn_ops = {
diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index 182436b92ba9..f65f6a29daae 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -207,22 +207,29 @@ static int invalidate_range_start_trampoline(struct ib_umem *item, u64 start,
 	return 0;
 }
 
-static void ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
+static int ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
 						    struct mm_struct *mm,
 						    unsigned long start,
-						    unsigned long end)
+						    unsigned long end,
+						    bool blockable)
 {
 	struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
 
 	if (!context->invalidate_range)
-		return;
+		return 0;
+
+	if (blockable)
+		down_read(&context->umem_rwsem);
+	else if (!down_read_trylock(&context->umem_rwsem))
+		return -EAGAIN;
 
 	ib_ucontext_notifier_start_account(context);
-	down_read(&context->umem_rwsem);
 	rbt_ib_umem_for_each_in_range(&context->umem_tree, start,
 				      end,
 				      invalidate_range_start_trampoline, NULL);
 	up_read(&context->umem_rwsem);
+
+	return 0;
 }
 
 static int invalidate_range_end_trampoline(struct ib_umem *item, u64 start,
diff --git a/drivers/infiniband/hw/hfi1/mmu_rb.c b/drivers/infiniband/hw/hfi1/mmu_rb.c
index 70aceefe14d5..8780560d1623 100644
--- a/drivers/infiniband/hw/hfi1/mmu_rb.c
+++ b/drivers/infiniband/hw/hfi1/mmu_rb.c
@@ -284,10 +284,11 @@ void hfi1_mmu_rb_remove(struct mmu_rb_handler *handler,
 	handler->ops->remove(handler->ops_arg, node);
 }
 
-static void mmu_notifier_range_start(struct mmu_notifier *mn,
+static int mmu_notifier_range_start(struct mmu_notifier *mn,
 				     struct mm_struct *mm,
 				     unsigned long start,
-				     unsigned long end)
+				     unsigned long end,
+				     bool blockable)
 {
 	struct mmu_rb_handler *handler =
 		container_of(mn, struct mmu_rb_handler, mn);
@@ -313,6 +314,8 @@ static void mmu_notifier_range_start(struct mmu_notifier *mn,
 
 	if (added)
 		queue_work(handler->wq, &handler->del_work);
+
+	return 0;
 }
 
 /*
diff --git a/drivers/misc/mic/scif/scif_dma.c b/drivers/misc/mic/scif/scif_dma.c
index 63d6246d6dff..d940568bed87 100644
--- a/drivers/misc/mic/scif/scif_dma.c
+++ b/drivers/misc/mic/scif/scif_dma.c
@@ -200,15 +200,18 @@ static void scif_mmu_notifier_release(struct mmu_notifier *mn,
 	schedule_work(&scif_info.misc_work);
 }
 
-static void scif_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
+static int scif_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 						     struct mm_struct *mm,
 						     unsigned long start,
-						     unsigned long end)
+						     unsigned long end,
+						     bool blockable)
 {
 	struct scif_mmu_notif	*mmn;
 
 	mmn = container_of(mn, struct scif_mmu_notif, ep_mmu_notifier);
 	scif_rma_destroy_tcw(mmn, start, end - start);
+
+	return 0
 }
 
 static void scif_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
index a3454eb56fbf..be28f05bfafa 100644
--- a/drivers/misc/sgi-gru/grutlbpurge.c
+++ b/drivers/misc/sgi-gru/grutlbpurge.c
@@ -219,9 +219,10 @@ void gru_flush_all_tlb(struct gru_state *gru)
 /*
  * MMUOPS notifier callout functions
  */
-static void gru_invalidate_range_start(struct mmu_notifier *mn,
+static int gru_invalidate_range_start(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
-				       unsigned long start, unsigned long end)
+				       unsigned long start, unsigned long end,
+				       bool blockable)
 {
 	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
 						 ms_notifier);
@@ -231,6 +232,8 @@ static void gru_invalidate_range_start(struct mmu_notifier *mn,
 	gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx, act %d\n", gms,
 		start, end, atomic_read(&gms->ms_range_active));
 	gru_flush_tlb_range(gms, start, end - start);
+
+	return 0;
 }
 
 static void gru_invalidate_range_end(struct mmu_notifier *mn,
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index bd56653b9bbc..50724d09fe5c 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -465,14 +465,20 @@ static void unmap_if_in_range(struct grant_map *map,
 	WARN_ON(err);
 }
 
-static void mn_invl_range_start(struct mmu_notifier *mn,
+static int mn_invl_range_start(struct mmu_notifier *mn,
 				struct mm_struct *mm,
-				unsigned long start, unsigned long end)
+				unsigned long start, unsigned long end,
+				bool blockable)
 {
 	struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
 	struct grant_map *map;
 
-	mutex_lock(&priv->lock);
+	/* TODO do we really need a mutex here? */
+	if (blockable)
+		mutex_lock(&priv->lock);
+	else if (!mutex_trylock(&priv->lock))
+		return -EAGAIN;
+
 	list_for_each_entry(map, &priv->maps, next) {
 		unmap_if_in_range(map, start, end);
 	}
@@ -480,6 +486,8 @@ static void mn_invl_range_start(struct mmu_notifier *mn,
 		unmap_if_in_range(map, start, end);
 	}
 	mutex_unlock(&priv->lock);
+
+	return true;
 }
 
 static void mn_release(struct mmu_notifier *mn,
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 4ee7bc548a83..e4181063e755 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1275,7 +1275,7 @@ static inline long kvm_arch_vcpu_async_ioctl(struct file *filp,
 }
 #endif /* CONFIG_HAVE_KVM_VCPU_ASYNC_IOCTL */
 
-void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
+int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
 		unsigned long start, unsigned long end);
 
 #ifdef CONFIG_HAVE_KVM_VCPU_RUN_PID_CHANGE
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 392e6af82701..369867501bed 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -230,7 +230,8 @@ extern int __mmu_notifier_test_young(struct mm_struct *mm,
 extern void __mmu_notifier_change_pte(struct mm_struct *mm,
 				      unsigned long address, pte_t pte);
 extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end);
+				  unsigned long start, unsigned long end,
+				  bool blockable);
 extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 				  unsigned long start, unsigned long end,
 				  bool only_end);
@@ -281,7 +282,17 @@ static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 				  unsigned long start, unsigned long end)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_start(mm, start, end);
+		__mmu_notifier_invalidate_range_start(mm, start, end, true);
+}
+
+static inline int mmu_notifier_invalidate_range_start_nonblock(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+	int ret = 0;
+	if (mm_has_notifiers(mm))
+		ret = __mmu_notifier_invalidate_range_start(mm, start, end, false);
+
+	return ret;
 }
 
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
diff --git a/mm/hmm.c b/mm/hmm.c
index de7b6bf77201..81fd57bd2634 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -177,16 +177,19 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
 	up_write(&hmm->mirrors_sem);
 }
 
-static void hmm_invalidate_range_start(struct mmu_notifier *mn,
+static int hmm_invalidate_range_start(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
 				       unsigned long start,
-				       unsigned long end)
+				       unsigned long end,
+				       bool blockable)
 {
 	struct hmm *hmm = mm->hmm;
 
 	VM_BUG_ON(!hmm);
 
 	atomic_inc(&hmm->sequence);
+
+	return 0;
 }
 
 static void hmm_invalidate_range_end(struct mmu_notifier *mn,
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index eff6b88a993f..30cc43121da9 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -174,18 +174,25 @@ void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
 	srcu_read_unlock(&srcu, id);
 }
 
-void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+int __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+				  unsigned long start, unsigned long end,
+				  bool blockable)
 {
 	struct mmu_notifier *mn;
+	int ret = 0;
 	int id;
 
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
-		if (mn->ops->invalidate_range_start)
-			mn->ops->invalidate_range_start(mn, mm, start, end);
+		if (mn->ops->invalidate_range_start) {
+			int _ret = mn->ops->invalidate_range_start(mn, mm, start, end, blockable);
+			if (_ret)
+				ret = _ret;
+		}
 	}
 	srcu_read_unlock(&srcu, id);
+
+	return ret;
 }
 EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 84081e77bc51..7e0c6e78ae5c 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -479,9 +479,10 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
 static struct task_struct *oom_reaper_list;
 static DEFINE_SPINLOCK(oom_reaper_lock);
 
-void __oom_reap_task_mm(struct mm_struct *mm)
+bool __oom_reap_task_mm(struct mm_struct *mm)
 {
 	struct vm_area_struct *vma;
+	bool ret = true;
 
 	/*
 	 * Tell all users of get_user/copy_from_user etc... that the content
@@ -511,12 +512,17 @@ void __oom_reap_task_mm(struct mm_struct *mm)
 			struct mmu_gather tlb;
 
 			tlb_gather_mmu(&tlb, mm, start, end);
-			mmu_notifier_invalidate_range_start(mm, start, end);
+			if (mmu_notifier_invalidate_range_start_nonblock(mm, start, end)) {
+				ret = false;
+				continue;
+			}
 			unmap_page_range(&tlb, vma, start, end, NULL);
 			mmu_notifier_invalidate_range_end(mm, start, end);
 			tlb_finish_mmu(&tlb, start, end);
 		}
 	}
+
+	return ret;
 }
 
 static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
@@ -545,18 +551,6 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 		goto unlock_oom;
 	}
 
-	/*
-	 * If the mm has invalidate_{start,end}() notifiers that could block,
-	 * sleep to give the oom victim some more time.
-	 * TODO: we really want to get rid of this ugly hack and make sure that
-	 * notifiers cannot block for unbounded amount of time
-	 */
-	if (mm_has_blockable_invalidate_notifiers(mm)) {
-		up_read(&mm->mmap_sem);
-		schedule_timeout_idle(HZ);
-		goto unlock_oom;
-	}
-
 	/*
 	 * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
 	 * work on the mm anymore. The check for MMF_OOM_SKIP must run
@@ -571,7 +565,12 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 
 	trace_start_task_reaping(tsk->pid);
 
-	__oom_reap_task_mm(mm);
+	/* failed to reap part of the address space. Try again later */
+	if (!__oom_reap_task_mm(mm)) {
+		up_read(&mm->mmap_sem);
+		ret = false;
+		goto out_unlock;
+	}
 
 	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
 			task_pid_nr(tsk), tsk->comm,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ada21f47f22b..6f7e709d2944 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -135,7 +135,7 @@ static void kvm_uevent_notify_change(unsigned int type, struct kvm *kvm);
 static unsigned long long kvm_createvm_count;
 static unsigned long long kvm_active_vms;
 
-__weak void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
+__weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
 		unsigned long start, unsigned long end)
 {
 }
@@ -354,13 +354,15 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 	srcu_read_unlock(&kvm->srcu, idx);
 }
 
-static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
+static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 						    struct mm_struct *mm,
 						    unsigned long start,
-						    unsigned long end)
+						    unsigned long end,
+						    bool blockable)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int need_tlb_flush = 0, idx;
+	int ret;
 
 	idx = srcu_read_lock(&kvm->srcu);
 	spin_lock(&kvm->mmu_lock);
@@ -378,9 +380,11 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 
 	spin_unlock(&kvm->mmu_lock);
 
-	kvm_arch_mmu_notifier_invalidate_range(kvm, start, end);
+	ret = kvm_arch_mmu_notifier_invalidate_range(kvm, start, end, blockable);
 
 	srcu_read_unlock(&kvm->srcu, idx);
+
+	return ret;
 }
 
 static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
-- 
2.17.1

^ permalink raw reply	[flat|nested] 125+ messages in thread

* ✗ Fi.CI.BAT: failure for mm, oom: distinguish blockable mode for mmu notifiers
  2018-06-22 15:02 [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers Michal Hocko
@ 2018-06-22 15:06 ` Patchwork
  2018-06-22 15:13 ` [RFC PATCH] " Christian König
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 125+ messages in thread
From: Patchwork @ 2018-06-22 15:06 UTC (permalink / raw)
  To: Michal Hocko; +Cc: intel-gfx

== Series Details ==

Series: mm, oom: distinguish blockable mode for mmu notifiers
URL   : https://patchwork.freedesktop.org/series/45263/
State : failure

== Summary ==

Applying: mm, oom: distinguish blockable mode for mmu notifiers
Using index info to reconstruct a base tree...
M	arch/x86/kvm/x86.c
M	drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
M	mm/oom_kill.c
Falling back to patching base and 3-way merge...
Auto-merging mm/oom_kill.c
Auto-merging drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
CONFLICT (content): Merge conflict in drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
Auto-merging arch/x86/kvm/x86.c
error: Failed to merge in the changes.
Patch failed at 0001 mm, oom: distinguish blockable mode for mmu notifiers
Use 'git am --show-current-patch' to see the failed patch
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
  2018-06-22 15:02 [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers Michal Hocko
  2018-06-22 15:06 ` ✗ Fi.CI.BAT: failure for " Patchwork
@ 2018-06-22 15:13 ` Christian König
  2018-06-22 15:24   ` Michal Hocko
  2018-06-22 15:24   ` Michal Hocko
  2018-06-22 15:13 ` Christian König
                   ` (15 subsequent siblings)
  17 siblings, 2 replies; 125+ messages in thread
From: Christian König @ 2018-06-22 15:13 UTC (permalink / raw)
  To: Michal Hocko, LKML
  Cc: Michal Hocko, kvm, Radim Krčmář,
	David Airlie, Sudeep Dutt, dri-devel, linux-mm, Andrea Arcangeli,
	David (ChunMing) Zhou, Dimitri Sivanich, linux-rdma, amd-gfx,
	Jason Gunthorpe, Doug Ledford, David Rientjes, xen-devel,
	intel-gfx, Jérôme Glisse, Rodrigo Vivi,
	Boris Ostrovsky, Juergen Gross, Mike Marciniszyn,
	Dennis Dalessandro

Hi Michal,

[Adding Felix as well]

Well first of all you have a misconception why at least the AMD graphics 
driver need to be able to sleep in an MMU notifier: We need to sleep 
because we need to wait for hardware operations to finish and *NOT* 
because we need to wait for locks.

I'm not sure if your flag now means that you generally can't sleep in 
MMU notifiers any more, but if that's the case at least AMD hardware 
will break badly. In our case the approach of waiting for a short time 
for the process to be reaped and then select another victim actually 
sounds like the right thing to do.

What we also already try to do is to abort hardware operations with the 
address space when we detect that the process is dying, but that can 
certainly be improved.

Regards,
Christian.

Am 22.06.2018 um 17:02 schrieb Michal Hocko:
> From: Michal Hocko <mhocko@suse.com>
>
> There are several blockable mmu notifiers which might sleep in
> mmu_notifier_invalidate_range_start and that is a problem for the
> oom_reaper because it needs to guarantee a forward progress so it cannot
> depend on any sleepable locks. Currently we simply back off and mark an
> oom victim with blockable mmu notifiers as done after a short sleep.
> That can result in selecting a new oom victim prematurely because the
> previous one still hasn't torn its memory down yet.
>
> We can do much better though. Even if mmu notifiers use sleepable locks
> there is no reason to automatically assume those locks are held.
> Moreover most notifiers only care about a portion of the address
> space. This patch handles the first part of the problem.
> __mmu_notifier_invalidate_range_start gets a blockable flag and
> callbacks are not allowed to sleep if the flag is set to false. This is
> achieved by using trylock instead of the sleepable lock for most
> callbacks. I think we can improve that even further because there is
> a common pattern to do a range lookup first and then do something about
> that. The first part can be done without a sleeping lock I presume.
>
> Anyway, what does the oom_reaper do with all that? We do not have to
> fail right away. We simply retry if there is at least one notifier which
> couldn't make any progress. A retry loop is already implemented to wait
> for the mmap_sem and this is basically the same thing.
>
> Cc: "David (ChunMing) Zhou" <David1.Zhou@amd.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: "Radim Krčmář" <rkrcmar@redhat.com>
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: "Christian König" <christian.koenig@amd.com>
> Cc: David Airlie <airlied@linux.ie>
> Cc: Jani Nikula <jani.nikula@linux.intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> Cc: Doug Ledford <dledford@redhat.com>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
> Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
> Cc: Sudeep Dutt <sudeep.dutt@intel.com>
> Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
> Cc: Dimitri Sivanich <sivanich@sgi.com>
> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
> Cc: Juergen Gross <jgross@suse.com>
> Cc: "Jérôme Glisse" <jglisse@redhat.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: kvm@vger.kernel.org (open list:KERNEL VIRTUAL MACHINE FOR X86 (KVM/x86))
> Cc: linux-kernel@vger.kernel.org (open list:X86 ARCHITECTURE (32-BIT AND 64-BIT))
> Cc: amd-gfx@lists.freedesktop.org (open list:RADEON and AMDGPU DRM DRIVERS)
> Cc: dri-devel@lists.freedesktop.org (open list:DRM DRIVERS)
> Cc: intel-gfx@lists.freedesktop.org (open list:INTEL DRM DRIVERS (excluding Poulsbo, Moorestow...)
> Cc: linux-rdma@vger.kernel.org (open list:INFINIBAND SUBSYSTEM)
> Cc: xen-devel@lists.xenproject.org (moderated list:XEN HYPERVISOR INTERFACE)
> Cc: linux-mm@kvack.org (open list:HMM - Heterogeneous Memory Management)
> Reported-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>
> Hi,
> this is an RFC and not tested at all. I am not very familiar with the
> mmu notifiers semantics very much so this is a crude attempt to achieve
> what I need basically. It might be completely wrong but I would like
> to discuss what would be a better way if that is the case.
>
> get_maintainers gave me quite large list of people to CC so I had to trim
> it down. If you think I have forgot somebody, please let me know
>
> Any feedback is highly appreciated.
>
>   arch/x86/kvm/x86.c                      |  7 ++++--
>   drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c  | 33 +++++++++++++++++++------
>   drivers/gpu/drm/i915/i915_gem_userptr.c | 10 +++++---
>   drivers/gpu/drm/radeon/radeon_mn.c      | 15 ++++++++---
>   drivers/infiniband/core/umem_odp.c      | 15 ++++++++---
>   drivers/infiniband/hw/hfi1/mmu_rb.c     |  7 ++++--
>   drivers/misc/mic/scif/scif_dma.c        |  7 ++++--
>   drivers/misc/sgi-gru/grutlbpurge.c      |  7 ++++--
>   drivers/xen/gntdev.c                    | 14 ++++++++---
>   include/linux/kvm_host.h                |  2 +-
>   include/linux/mmu_notifier.h            | 15 +++++++++--
>   mm/hmm.c                                |  7 ++++--
>   mm/mmu_notifier.c                       | 15 ++++++++---
>   mm/oom_kill.c                           | 29 +++++++++++-----------
>   virt/kvm/kvm_main.c                     | 12 ++++++---
>   15 files changed, 137 insertions(+), 58 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 6bcecc325e7e..ac08f5d711be 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -7203,8 +7203,9 @@ static void vcpu_load_eoi_exitmap(struct kvm_vcpu *vcpu)
>   	kvm_x86_ops->load_eoi_exitmap(vcpu, eoi_exit_bitmap);
>   }
>   
> -void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> -		unsigned long start, unsigned long end)
> +int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> +		unsigned long start, unsigned long end,
> +		bool blockable)
>   {
>   	unsigned long apic_address;
>   
> @@ -7215,6 +7216,8 @@ void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
>   	apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
>   	if (start <= apic_address && apic_address < end)
>   		kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
> +
> +	return 0;
>   }
>   
>   void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> index 83e344fbb50a..d138a526feff 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> @@ -136,12 +136,18 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
>    *
>    * Take the rmn read side lock.
>    */
> -static void amdgpu_mn_read_lock(struct amdgpu_mn *rmn)
> +static int amdgpu_mn_read_lock(struct amdgpu_mn *rmn, bool blockable)
>   {
> -	mutex_lock(&rmn->read_lock);
> +	if (blockable)
> +		mutex_lock(&rmn->read_lock);
> +	else if (!mutex_trylock(&rmn->read_lock))
> +		return -EAGAIN;
> +
>   	if (atomic_inc_return(&rmn->recursion) == 1)
>   		down_read_non_owner(&rmn->lock);
>   	mutex_unlock(&rmn->read_lock);
> +
> +	return 0;
>   }
>   
>   /**
> @@ -197,10 +203,11 @@ static void amdgpu_mn_invalidate_node(struct amdgpu_mn_node *node,
>    * We block for all BOs between start and end to be idle and
>    * unmap them by move them into system domain again.
>    */
> -static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
> +static int amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>   						 struct mm_struct *mm,
>   						 unsigned long start,
> -						 unsigned long end)
> +						 unsigned long end,
> +						 bool blockable)
>   {
>   	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
>   	struct interval_tree_node *it;
> @@ -208,7 +215,11 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>   	/* notification is exclusive, but interval is inclusive */
>   	end -= 1;
>   
> -	amdgpu_mn_read_lock(rmn);
> +	/* TODO we should be able to split locking for interval tree and
> +	 * amdgpu_mn_invalidate_node
> +	 */
> +	if (amdgpu_mn_read_lock(rmn, blockable))
> +		return -EAGAIN;
>   
>   	it = interval_tree_iter_first(&rmn->objects, start, end);
>   	while (it) {
> @@ -219,6 +230,8 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>   
>   		amdgpu_mn_invalidate_node(node, start, end);
>   	}
> +
> +	return 0;
>   }
>   
>   /**
> @@ -233,10 +246,11 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>    * necessitates evicting all user-mode queues of the process. The BOs
>    * are restorted in amdgpu_mn_invalidate_range_end_hsa.
>    */
> -static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
> +static int amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
>   						 struct mm_struct *mm,
>   						 unsigned long start,
> -						 unsigned long end)
> +						 unsigned long end,
> +						 bool blockable)
>   {
>   	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
>   	struct interval_tree_node *it;
> @@ -244,7 +258,8 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
>   	/* notification is exclusive, but interval is inclusive */
>   	end -= 1;
>   
> -	amdgpu_mn_read_lock(rmn);
> +	if (amdgpu_mn_read_lock(rmn, blockable))
> +		return -EAGAIN;
>   
>   	it = interval_tree_iter_first(&rmn->objects, start, end);
>   	while (it) {
> @@ -262,6 +277,8 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
>   				amdgpu_amdkfd_evict_userptr(mem, mm);
>   		}
>   	}
> +
> +	return 0;
>   }
>   
>   /**
> diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
> index 854bd51b9478..5285df9331fa 100644
> --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> @@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object *mo)
>   	mo->attached = false;
>   }
>   
> -static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> +static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
>   						       struct mm_struct *mm,
>   						       unsigned long start,
> -						       unsigned long end)
> +						       unsigned long end,
> +						       bool blockable)
>   {
>   	struct i915_mmu_notifier *mn =
>   		container_of(_mn, struct i915_mmu_notifier, mn);
> @@ -124,7 +125,7 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
>   	LIST_HEAD(cancelled);
>   
>   	if (RB_EMPTY_ROOT(&mn->objects.rb_root))
> -		return;
> +		return 0;
>   
>   	/* interval ranges are inclusive, but invalidate range is exclusive */
>   	end--;
> @@ -152,7 +153,8 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
>   		del_object(mo);
>   	spin_unlock(&mn->lock);
>   
> -	if (!list_empty(&cancelled))
> +	/* TODO: can we skip waiting here? */
> +	if (!list_empty(&cancelled) && blockable)
>   		flush_workqueue(mn->wq);
>   }
>   
> diff --git a/drivers/gpu/drm/radeon/radeon_mn.c b/drivers/gpu/drm/radeon/radeon_mn.c
> index abd24975c9b1..b47e828b725d 100644
> --- a/drivers/gpu/drm/radeon/radeon_mn.c
> +++ b/drivers/gpu/drm/radeon/radeon_mn.c
> @@ -118,10 +118,11 @@ static void radeon_mn_release(struct mmu_notifier *mn,
>    * We block for all BOs between start and end to be idle and
>    * unmap them by move them into system domain again.
>    */
> -static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
> +static int radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
>   					     struct mm_struct *mm,
>   					     unsigned long start,
> -					     unsigned long end)
> +					     unsigned long end,
> +					     bool blockable)
>   {
>   	struct radeon_mn *rmn = container_of(mn, struct radeon_mn, mn);
>   	struct ttm_operation_ctx ctx = { false, false };
> @@ -130,7 +131,13 @@ static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
>   	/* notification is exclusive, but interval is inclusive */
>   	end -= 1;
>   
> -	mutex_lock(&rmn->lock);
> +	/* TODO we should be able to split locking for interval tree and
> +	 * the tear down.
> +	 */
> +	if (blockable)
> +		mutex_lock(&rmn->lock);
> +	else if (!mutex_trylock(&rmn->lock))
> +		return -EAGAIN;
>   
>   	it = interval_tree_iter_first(&rmn->objects, start, end);
>   	while (it) {
> @@ -167,6 +174,8 @@ static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
>   	}
>   	
>   	mutex_unlock(&rmn->lock);
> +
> +	return 0;
>   }
>   
>   static const struct mmu_notifier_ops radeon_mn_ops = {
> diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
> index 182436b92ba9..f65f6a29daae 100644
> --- a/drivers/infiniband/core/umem_odp.c
> +++ b/drivers/infiniband/core/umem_odp.c
> @@ -207,22 +207,29 @@ static int invalidate_range_start_trampoline(struct ib_umem *item, u64 start,
>   	return 0;
>   }
>   
> -static void ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
> +static int ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
>   						    struct mm_struct *mm,
>   						    unsigned long start,
> -						    unsigned long end)
> +						    unsigned long end,
> +						    bool blockable)
>   {
>   	struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
>   
>   	if (!context->invalidate_range)
> -		return;
> +		return 0;
> +
> +	if (blockable)
> +		down_read(&context->umem_rwsem);
> +	else if (!down_read_trylock(&context->umem_rwsem))
> +		return -EAGAIN;
>   
>   	ib_ucontext_notifier_start_account(context);
> -	down_read(&context->umem_rwsem);
>   	rbt_ib_umem_for_each_in_range(&context->umem_tree, start,
>   				      end,
>   				      invalidate_range_start_trampoline, NULL);
>   	up_read(&context->umem_rwsem);
> +
> +	return 0;
>   }
>   
>   static int invalidate_range_end_trampoline(struct ib_umem *item, u64 start,
> diff --git a/drivers/infiniband/hw/hfi1/mmu_rb.c b/drivers/infiniband/hw/hfi1/mmu_rb.c
> index 70aceefe14d5..8780560d1623 100644
> --- a/drivers/infiniband/hw/hfi1/mmu_rb.c
> +++ b/drivers/infiniband/hw/hfi1/mmu_rb.c
> @@ -284,10 +284,11 @@ void hfi1_mmu_rb_remove(struct mmu_rb_handler *handler,
>   	handler->ops->remove(handler->ops_arg, node);
>   }
>   
> -static void mmu_notifier_range_start(struct mmu_notifier *mn,
> +static int mmu_notifier_range_start(struct mmu_notifier *mn,
>   				     struct mm_struct *mm,
>   				     unsigned long start,
> -				     unsigned long end)
> +				     unsigned long end,
> +				     bool blockable)
>   {
>   	struct mmu_rb_handler *handler =
>   		container_of(mn, struct mmu_rb_handler, mn);
> @@ -313,6 +314,8 @@ static void mmu_notifier_range_start(struct mmu_notifier *mn,
>   
>   	if (added)
>   		queue_work(handler->wq, &handler->del_work);
> +
> +	return 0;
>   }
>   
>   /*
> diff --git a/drivers/misc/mic/scif/scif_dma.c b/drivers/misc/mic/scif/scif_dma.c
> index 63d6246d6dff..d940568bed87 100644
> --- a/drivers/misc/mic/scif/scif_dma.c
> +++ b/drivers/misc/mic/scif/scif_dma.c
> @@ -200,15 +200,18 @@ static void scif_mmu_notifier_release(struct mmu_notifier *mn,
>   	schedule_work(&scif_info.misc_work);
>   }
>   
> -static void scif_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> +static int scif_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>   						     struct mm_struct *mm,
>   						     unsigned long start,
> -						     unsigned long end)
> +						     unsigned long end,
> +						     bool blockable)
>   {
>   	struct scif_mmu_notif	*mmn;
>   
>   	mmn = container_of(mn, struct scif_mmu_notif, ep_mmu_notifier);
>   	scif_rma_destroy_tcw(mmn, start, end - start);
> +
> +	return 0
>   }
>   
>   static void scif_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
> index a3454eb56fbf..be28f05bfafa 100644
> --- a/drivers/misc/sgi-gru/grutlbpurge.c
> +++ b/drivers/misc/sgi-gru/grutlbpurge.c
> @@ -219,9 +219,10 @@ void gru_flush_all_tlb(struct gru_state *gru)
>   /*
>    * MMUOPS notifier callout functions
>    */
> -static void gru_invalidate_range_start(struct mmu_notifier *mn,
> +static int gru_invalidate_range_start(struct mmu_notifier *mn,
>   				       struct mm_struct *mm,
> -				       unsigned long start, unsigned long end)
> +				       unsigned long start, unsigned long end,
> +				       bool blockable)
>   {
>   	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
>   						 ms_notifier);
> @@ -231,6 +232,8 @@ static void gru_invalidate_range_start(struct mmu_notifier *mn,
>   	gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx, act %d\n", gms,
>   		start, end, atomic_read(&gms->ms_range_active));
>   	gru_flush_tlb_range(gms, start, end - start);
> +
> +	return 0;
>   }
>   
>   static void gru_invalidate_range_end(struct mmu_notifier *mn,
> diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
> index bd56653b9bbc..50724d09fe5c 100644
> --- a/drivers/xen/gntdev.c
> +++ b/drivers/xen/gntdev.c
> @@ -465,14 +465,20 @@ static void unmap_if_in_range(struct grant_map *map,
>   	WARN_ON(err);
>   }
>   
> -static void mn_invl_range_start(struct mmu_notifier *mn,
> +static int mn_invl_range_start(struct mmu_notifier *mn,
>   				struct mm_struct *mm,
> -				unsigned long start, unsigned long end)
> +				unsigned long start, unsigned long end,
> +				bool blockable)
>   {
>   	struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
>   	struct grant_map *map;
>   
> -	mutex_lock(&priv->lock);
> +	/* TODO do we really need a mutex here? */
> +	if (blockable)
> +		mutex_lock(&priv->lock);
> +	else if (!mutex_trylock(&priv->lock))
> +		return -EAGAIN;
> +
>   	list_for_each_entry(map, &priv->maps, next) {
>   		unmap_if_in_range(map, start, end);
>   	}
> @@ -480,6 +486,8 @@ static void mn_invl_range_start(struct mmu_notifier *mn,
>   		unmap_if_in_range(map, start, end);
>   	}
>   	mutex_unlock(&priv->lock);
> +
> +	return true;
>   }
>   
>   static void mn_release(struct mmu_notifier *mn,
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 4ee7bc548a83..e4181063e755 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1275,7 +1275,7 @@ static inline long kvm_arch_vcpu_async_ioctl(struct file *filp,
>   }
>   #endif /* CONFIG_HAVE_KVM_VCPU_ASYNC_IOCTL */
>   
> -void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> +int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
>   		unsigned long start, unsigned long end);
>   
>   #ifdef CONFIG_HAVE_KVM_VCPU_RUN_PID_CHANGE
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index 392e6af82701..369867501bed 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -230,7 +230,8 @@ extern int __mmu_notifier_test_young(struct mm_struct *mm,
>   extern void __mmu_notifier_change_pte(struct mm_struct *mm,
>   				      unsigned long address, pte_t pte);
>   extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> -				  unsigned long start, unsigned long end);
> +				  unsigned long start, unsigned long end,
> +				  bool blockable);
>   extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
>   				  unsigned long start, unsigned long end,
>   				  bool only_end);
> @@ -281,7 +282,17 @@ static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
>   				  unsigned long start, unsigned long end)
>   {
>   	if (mm_has_notifiers(mm))
> -		__mmu_notifier_invalidate_range_start(mm, start, end);
> +		__mmu_notifier_invalidate_range_start(mm, start, end, true);
> +}
> +
> +static inline int mmu_notifier_invalidate_range_start_nonblock(struct mm_struct *mm,
> +				  unsigned long start, unsigned long end)
> +{
> +	int ret = 0;
> +	if (mm_has_notifiers(mm))
> +		ret = __mmu_notifier_invalidate_range_start(mm, start, end, false);
> +
> +	return ret;
>   }
>   
>   static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> diff --git a/mm/hmm.c b/mm/hmm.c
> index de7b6bf77201..81fd57bd2634 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -177,16 +177,19 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
>   	up_write(&hmm->mirrors_sem);
>   }
>   
> -static void hmm_invalidate_range_start(struct mmu_notifier *mn,
> +static int hmm_invalidate_range_start(struct mmu_notifier *mn,
>   				       struct mm_struct *mm,
>   				       unsigned long start,
> -				       unsigned long end)
> +				       unsigned long end,
> +				       bool blockable)
>   {
>   	struct hmm *hmm = mm->hmm;
>   
>   	VM_BUG_ON(!hmm);
>   
>   	atomic_inc(&hmm->sequence);
> +
> +	return 0;
>   }
>   
>   static void hmm_invalidate_range_end(struct mmu_notifier *mn,
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> index eff6b88a993f..30cc43121da9 100644
> --- a/mm/mmu_notifier.c
> +++ b/mm/mmu_notifier.c
> @@ -174,18 +174,25 @@ void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
>   	srcu_read_unlock(&srcu, id);
>   }
>   
> -void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> -				  unsigned long start, unsigned long end)
> +int __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> +				  unsigned long start, unsigned long end,
> +				  bool blockable)
>   {
>   	struct mmu_notifier *mn;
> +	int ret = 0;
>   	int id;
>   
>   	id = srcu_read_lock(&srcu);
>   	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
> -		if (mn->ops->invalidate_range_start)
> -			mn->ops->invalidate_range_start(mn, mm, start, end);
> +		if (mn->ops->invalidate_range_start) {
> +			int _ret = mn->ops->invalidate_range_start(mn, mm, start, end, blockable);
> +			if (_ret)
> +				ret = _ret;
> +		}
>   	}
>   	srcu_read_unlock(&srcu, id);
> +
> +	return ret;
>   }
>   EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
>   
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 84081e77bc51..7e0c6e78ae5c 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -479,9 +479,10 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
>   static struct task_struct *oom_reaper_list;
>   static DEFINE_SPINLOCK(oom_reaper_lock);
>   
> -void __oom_reap_task_mm(struct mm_struct *mm)
> +bool __oom_reap_task_mm(struct mm_struct *mm)
>   {
>   	struct vm_area_struct *vma;
> +	bool ret = true;
>   
>   	/*
>   	 * Tell all users of get_user/copy_from_user etc... that the content
> @@ -511,12 +512,17 @@ void __oom_reap_task_mm(struct mm_struct *mm)
>   			struct mmu_gather tlb;
>   
>   			tlb_gather_mmu(&tlb, mm, start, end);
> -			mmu_notifier_invalidate_range_start(mm, start, end);
> +			if (mmu_notifier_invalidate_range_start_nonblock(mm, start, end)) {
> +				ret = false;
> +				continue;
> +			}
>   			unmap_page_range(&tlb, vma, start, end, NULL);
>   			mmu_notifier_invalidate_range_end(mm, start, end);
>   			tlb_finish_mmu(&tlb, start, end);
>   		}
>   	}
> +
> +	return ret;
>   }
>   
>   static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
> @@ -545,18 +551,6 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>   		goto unlock_oom;
>   	}
>   
> -	/*
> -	 * If the mm has invalidate_{start,end}() notifiers that could block,
> -	 * sleep to give the oom victim some more time.
> -	 * TODO: we really want to get rid of this ugly hack and make sure that
> -	 * notifiers cannot block for unbounded amount of time
> -	 */
> -	if (mm_has_blockable_invalidate_notifiers(mm)) {
> -		up_read(&mm->mmap_sem);
> -		schedule_timeout_idle(HZ);
> -		goto unlock_oom;
> -	}
> -
>   	/*
>   	 * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
>   	 * work on the mm anymore. The check for MMF_OOM_SKIP must run
> @@ -571,7 +565,12 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>   
>   	trace_start_task_reaping(tsk->pid);
>   
> -	__oom_reap_task_mm(mm);
> +	/* failed to reap part of the address space. Try again later */
> +	if (!__oom_reap_task_mm(mm)) {
> +		up_read(&mm->mmap_sem);
> +		ret = false;
> +		goto out_unlock;
> +	}
>   
>   	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
>   			task_pid_nr(tsk), tsk->comm,
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index ada21f47f22b..6f7e709d2944 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -135,7 +135,7 @@ static void kvm_uevent_notify_change(unsigned int type, struct kvm *kvm);
>   static unsigned long long kvm_createvm_count;
>   static unsigned long long kvm_active_vms;
>   
> -__weak void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> +__weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
>   		unsigned long start, unsigned long end)
>   {
>   }
> @@ -354,13 +354,15 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
>   	srcu_read_unlock(&kvm->srcu, idx);
>   }
>   
> -static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> +static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>   						    struct mm_struct *mm,
>   						    unsigned long start,
> -						    unsigned long end)
> +						    unsigned long end,
> +						    bool blockable)
>   {
>   	struct kvm *kvm = mmu_notifier_to_kvm(mn);
>   	int need_tlb_flush = 0, idx;
> +	int ret;
>   
>   	idx = srcu_read_lock(&kvm->srcu);
>   	spin_lock(&kvm->mmu_lock);
> @@ -378,9 +380,11 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>   
>   	spin_unlock(&kvm->mmu_lock);
>   
> -	kvm_arch_mmu_notifier_invalidate_range(kvm, start, end);
> +	ret = kvm_arch_mmu_notifier_invalidate_range(kvm, start, end, blockable);
>   
>   	srcu_read_unlock(&kvm->srcu, idx);
> +
> +	return ret;
>   }
>   
>   static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
@ 2018-06-22 15:13 ` Christian König
  2018-06-22 15:24   ` Michal Hocko
  2018-06-22 15:24   ` Michal Hocko
  0 siblings, 2 replies; 125+ messages in thread
From: Christian König @ 2018-06-22 15:13 UTC (permalink / raw)
  To: Michal Hocko, LKML
  Cc: Michal Hocko, David (ChunMing) Zhou, Paolo Bonzini,
	Radim Krčmář,
	Alex Deucher, David Airlie, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Doug Ledford, Jason Gunthorpe, Mike Marciniszyn,
	Dennis Dalessandro, Sudeep Dutt, Ashutosh Dixit,
	Dimitri Sivanich, Boris Ostrovsky, Juergen Gross,
	Jérôme Glisse, Andrea Arcangeli, kvm, amd-gfx,
	dri-devel, intel-gfx, linux-rdma, xen-devel, linux-mm,
	David Rientjes, Felix Kuehling

Hi Michal,

[Adding Felix as well]

Well first of all you have a misconception why at least the AMD graphics 
driver need to be able to sleep in an MMU notifier: We need to sleep 
because we need to wait for hardware operations to finish and *NOT* 
because we need to wait for locks.

I'm not sure if your flag now means that you generally can't sleep in 
MMU notifiers any more, but if that's the case at least AMD hardware 
will break badly. In our case the approach of waiting for a short time 
for the process to be reaped and then select another victim actually 
sounds like the right thing to do.

What we also already try to do is to abort hardware operations with the 
address space when we detect that the process is dying, but that can 
certainly be improved.

Regards,
Christian.

Am 22.06.2018 um 17:02 schrieb Michal Hocko:
> From: Michal Hocko <mhocko@suse.com>
>
> There are several blockable mmu notifiers which might sleep in
> mmu_notifier_invalidate_range_start and that is a problem for the
> oom_reaper because it needs to guarantee a forward progress so it cannot
> depend on any sleepable locks. Currently we simply back off and mark an
> oom victim with blockable mmu notifiers as done after a short sleep.
> That can result in selecting a new oom victim prematurely because the
> previous one still hasn't torn its memory down yet.
>
> We can do much better though. Even if mmu notifiers use sleepable locks
> there is no reason to automatically assume those locks are held.
> Moreover most notifiers only care about a portion of the address
> space. This patch handles the first part of the problem.
> __mmu_notifier_invalidate_range_start gets a blockable flag and
> callbacks are not allowed to sleep if the flag is set to false. This is
> achieved by using trylock instead of the sleepable lock for most
> callbacks. I think we can improve that even further because there is
> a common pattern to do a range lookup first and then do something about
> that. The first part can be done without a sleeping lock I presume.
>
> Anyway, what does the oom_reaper do with all that? We do not have to
> fail right away. We simply retry if there is at least one notifier which
> couldn't make any progress. A retry loop is already implemented to wait
> for the mmap_sem and this is basically the same thing.
>
> Cc: "David (ChunMing) Zhou" <David1.Zhou@amd.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: "Radim Krčmář" <rkrcmar@redhat.com>
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: "Christian König" <christian.koenig@amd.com>
> Cc: David Airlie <airlied@linux.ie>
> Cc: Jani Nikula <jani.nikula@linux.intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> Cc: Doug Ledford <dledford@redhat.com>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
> Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
> Cc: Sudeep Dutt <sudeep.dutt@intel.com>
> Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
> Cc: Dimitri Sivanich <sivanich@sgi.com>
> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
> Cc: Juergen Gross <jgross@suse.com>
> Cc: "Jérôme Glisse" <jglisse@redhat.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: kvm@vger.kernel.org (open list:KERNEL VIRTUAL MACHINE FOR X86 (KVM/x86))
> Cc: linux-kernel@vger.kernel.org (open list:X86 ARCHITECTURE (32-BIT AND 64-BIT))
> Cc: amd-gfx@lists.freedesktop.org (open list:RADEON and AMDGPU DRM DRIVERS)
> Cc: dri-devel@lists.freedesktop.org (open list:DRM DRIVERS)
> Cc: intel-gfx@lists.freedesktop.org (open list:INTEL DRM DRIVERS (excluding Poulsbo, Moorestow...)
> Cc: linux-rdma@vger.kernel.org (open list:INFINIBAND SUBSYSTEM)
> Cc: xen-devel@lists.xenproject.org (moderated list:XEN HYPERVISOR INTERFACE)
> Cc: linux-mm@kvack.org (open list:HMM - Heterogeneous Memory Management)
> Reported-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>
> Hi,
> this is an RFC and not tested at all. I am not very familiar with the
> mmu notifiers semantics very much so this is a crude attempt to achieve
> what I need basically. It might be completely wrong but I would like
> to discuss what would be a better way if that is the case.
>
> get_maintainers gave me quite large list of people to CC so I had to trim
> it down. If you think I have forgot somebody, please let me know
>
> Any feedback is highly appreciated.
>
>   arch/x86/kvm/x86.c                      |  7 ++++--
>   drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c  | 33 +++++++++++++++++++------
>   drivers/gpu/drm/i915/i915_gem_userptr.c | 10 +++++---
>   drivers/gpu/drm/radeon/radeon_mn.c      | 15 ++++++++---
>   drivers/infiniband/core/umem_odp.c      | 15 ++++++++---
>   drivers/infiniband/hw/hfi1/mmu_rb.c     |  7 ++++--
>   drivers/misc/mic/scif/scif_dma.c        |  7 ++++--
>   drivers/misc/sgi-gru/grutlbpurge.c      |  7 ++++--
>   drivers/xen/gntdev.c                    | 14 ++++++++---
>   include/linux/kvm_host.h                |  2 +-
>   include/linux/mmu_notifier.h            | 15 +++++++++--
>   mm/hmm.c                                |  7 ++++--
>   mm/mmu_notifier.c                       | 15 ++++++++---
>   mm/oom_kill.c                           | 29 +++++++++++-----------
>   virt/kvm/kvm_main.c                     | 12 ++++++---
>   15 files changed, 137 insertions(+), 58 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 6bcecc325e7e..ac08f5d711be 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -7203,8 +7203,9 @@ static void vcpu_load_eoi_exitmap(struct kvm_vcpu *vcpu)
>   	kvm_x86_ops->load_eoi_exitmap(vcpu, eoi_exit_bitmap);
>   }
>   
> -void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> -		unsigned long start, unsigned long end)
> +int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> +		unsigned long start, unsigned long end,
> +		bool blockable)
>   {
>   	unsigned long apic_address;
>   
> @@ -7215,6 +7216,8 @@ void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
>   	apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
>   	if (start <= apic_address && apic_address < end)
>   		kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
> +
> +	return 0;
>   }
>   
>   void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> index 83e344fbb50a..d138a526feff 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> @@ -136,12 +136,18 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
>    *
>    * Take the rmn read side lock.
>    */
> -static void amdgpu_mn_read_lock(struct amdgpu_mn *rmn)
> +static int amdgpu_mn_read_lock(struct amdgpu_mn *rmn, bool blockable)
>   {
> -	mutex_lock(&rmn->read_lock);
> +	if (blockable)
> +		mutex_lock(&rmn->read_lock);
> +	else if (!mutex_trylock(&rmn->read_lock))
> +		return -EAGAIN;
> +
>   	if (atomic_inc_return(&rmn->recursion) == 1)
>   		down_read_non_owner(&rmn->lock);
>   	mutex_unlock(&rmn->read_lock);
> +
> +	return 0;
>   }
>   
>   /**
> @@ -197,10 +203,11 @@ static void amdgpu_mn_invalidate_node(struct amdgpu_mn_node *node,
>    * We block for all BOs between start and end to be idle and
>    * unmap them by move them into system domain again.
>    */
> -static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
> +static int amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>   						 struct mm_struct *mm,
>   						 unsigned long start,
> -						 unsigned long end)
> +						 unsigned long end,
> +						 bool blockable)
>   {
>   	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
>   	struct interval_tree_node *it;
> @@ -208,7 +215,11 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>   	/* notification is exclusive, but interval is inclusive */
>   	end -= 1;
>   
> -	amdgpu_mn_read_lock(rmn);
> +	/* TODO we should be able to split locking for interval tree and
> +	 * amdgpu_mn_invalidate_node
> +	 */
> +	if (amdgpu_mn_read_lock(rmn, blockable))
> +		return -EAGAIN;
>   
>   	it = interval_tree_iter_first(&rmn->objects, start, end);
>   	while (it) {
> @@ -219,6 +230,8 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>   
>   		amdgpu_mn_invalidate_node(node, start, end);
>   	}
> +
> +	return 0;
>   }
>   
>   /**
> @@ -233,10 +246,11 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>    * necessitates evicting all user-mode queues of the process. The BOs
>    * are restorted in amdgpu_mn_invalidate_range_end_hsa.
>    */
> -static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
> +static int amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
>   						 struct mm_struct *mm,
>   						 unsigned long start,
> -						 unsigned long end)
> +						 unsigned long end,
> +						 bool blockable)
>   {
>   	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
>   	struct interval_tree_node *it;
> @@ -244,7 +258,8 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
>   	/* notification is exclusive, but interval is inclusive */
>   	end -= 1;
>   
> -	amdgpu_mn_read_lock(rmn);
> +	if (amdgpu_mn_read_lock(rmn, blockable))
> +		return -EAGAIN;
>   
>   	it = interval_tree_iter_first(&rmn->objects, start, end);
>   	while (it) {
> @@ -262,6 +277,8 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
>   				amdgpu_amdkfd_evict_userptr(mem, mm);
>   		}
>   	}
> +
> +	return 0;
>   }
>   
>   /**
> diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
> index 854bd51b9478..5285df9331fa 100644
> --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> @@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object *mo)
>   	mo->attached = false;
>   }
>   
> -static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> +static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
>   						       struct mm_struct *mm,
>   						       unsigned long start,
> -						       unsigned long end)
> +						       unsigned long end,
> +						       bool blockable)
>   {
>   	struct i915_mmu_notifier *mn =
>   		container_of(_mn, struct i915_mmu_notifier, mn);
> @@ -124,7 +125,7 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
>   	LIST_HEAD(cancelled);
>   
>   	if (RB_EMPTY_ROOT(&mn->objects.rb_root))
> -		return;
> +		return 0;
>   
>   	/* interval ranges are inclusive, but invalidate range is exclusive */
>   	end--;
> @@ -152,7 +153,8 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
>   		del_object(mo);
>   	spin_unlock(&mn->lock);
>   
> -	if (!list_empty(&cancelled))
> +	/* TODO: can we skip waiting here? */
> +	if (!list_empty(&cancelled) && blockable)
>   		flush_workqueue(mn->wq);
>   }
>   
> diff --git a/drivers/gpu/drm/radeon/radeon_mn.c b/drivers/gpu/drm/radeon/radeon_mn.c
> index abd24975c9b1..b47e828b725d 100644
> --- a/drivers/gpu/drm/radeon/radeon_mn.c
> +++ b/drivers/gpu/drm/radeon/radeon_mn.c
> @@ -118,10 +118,11 @@ static void radeon_mn_release(struct mmu_notifier *mn,
>    * We block for all BOs between start and end to be idle and
>    * unmap them by move them into system domain again.
>    */
> -static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
> +static int radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
>   					     struct mm_struct *mm,
>   					     unsigned long start,
> -					     unsigned long end)
> +					     unsigned long end,
> +					     bool blockable)
>   {
>   	struct radeon_mn *rmn = container_of(mn, struct radeon_mn, mn);
>   	struct ttm_operation_ctx ctx = { false, false };
> @@ -130,7 +131,13 @@ static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
>   	/* notification is exclusive, but interval is inclusive */
>   	end -= 1;
>   
> -	mutex_lock(&rmn->lock);
> +	/* TODO we should be able to split locking for interval tree and
> +	 * the tear down.
> +	 */
> +	if (blockable)
> +		mutex_lock(&rmn->lock);
> +	else if (!mutex_trylock(&rmn->lock))
> +		return -EAGAIN;
>   
>   	it = interval_tree_iter_first(&rmn->objects, start, end);
>   	while (it) {
> @@ -167,6 +174,8 @@ static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
>   	}
>   	
>   	mutex_unlock(&rmn->lock);
> +
> +	return 0;
>   }
>   
>   static const struct mmu_notifier_ops radeon_mn_ops = {
> diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
> index 182436b92ba9..f65f6a29daae 100644
> --- a/drivers/infiniband/core/umem_odp.c
> +++ b/drivers/infiniband/core/umem_odp.c
> @@ -207,22 +207,29 @@ static int invalidate_range_start_trampoline(struct ib_umem *item, u64 start,
>   	return 0;
>   }
>   
> -static void ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
> +static int ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
>   						    struct mm_struct *mm,
>   						    unsigned long start,
> -						    unsigned long end)
> +						    unsigned long end,
> +						    bool blockable)
>   {
>   	struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
>   
>   	if (!context->invalidate_range)
> -		return;
> +		return 0;
> +
> +	if (blockable)
> +		down_read(&context->umem_rwsem);
> +	else if (!down_read_trylock(&context->umem_rwsem))
> +		return -EAGAIN;
>   
>   	ib_ucontext_notifier_start_account(context);
> -	down_read(&context->umem_rwsem);
>   	rbt_ib_umem_for_each_in_range(&context->umem_tree, start,
>   				      end,
>   				      invalidate_range_start_trampoline, NULL);
>   	up_read(&context->umem_rwsem);
> +
> +	return 0;
>   }
>   
>   static int invalidate_range_end_trampoline(struct ib_umem *item, u64 start,
> diff --git a/drivers/infiniband/hw/hfi1/mmu_rb.c b/drivers/infiniband/hw/hfi1/mmu_rb.c
> index 70aceefe14d5..8780560d1623 100644
> --- a/drivers/infiniband/hw/hfi1/mmu_rb.c
> +++ b/drivers/infiniband/hw/hfi1/mmu_rb.c
> @@ -284,10 +284,11 @@ void hfi1_mmu_rb_remove(struct mmu_rb_handler *handler,
>   	handler->ops->remove(handler->ops_arg, node);
>   }
>   
> -static void mmu_notifier_range_start(struct mmu_notifier *mn,
> +static int mmu_notifier_range_start(struct mmu_notifier *mn,
>   				     struct mm_struct *mm,
>   				     unsigned long start,
> -				     unsigned long end)
> +				     unsigned long end,
> +				     bool blockable)
>   {
>   	struct mmu_rb_handler *handler =
>   		container_of(mn, struct mmu_rb_handler, mn);
> @@ -313,6 +314,8 @@ static void mmu_notifier_range_start(struct mmu_notifier *mn,
>   
>   	if (added)
>   		queue_work(handler->wq, &handler->del_work);
> +
> +	return 0;
>   }
>   
>   /*
> diff --git a/drivers/misc/mic/scif/scif_dma.c b/drivers/misc/mic/scif/scif_dma.c
> index 63d6246d6dff..d940568bed87 100644
> --- a/drivers/misc/mic/scif/scif_dma.c
> +++ b/drivers/misc/mic/scif/scif_dma.c
> @@ -200,15 +200,18 @@ static void scif_mmu_notifier_release(struct mmu_notifier *mn,
>   	schedule_work(&scif_info.misc_work);
>   }
>   
> -static void scif_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> +static int scif_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>   						     struct mm_struct *mm,
>   						     unsigned long start,
> -						     unsigned long end)
> +						     unsigned long end,
> +						     bool blockable)
>   {
>   	struct scif_mmu_notif	*mmn;
>   
>   	mmn = container_of(mn, struct scif_mmu_notif, ep_mmu_notifier);
>   	scif_rma_destroy_tcw(mmn, start, end - start);
> +
> +	return 0
>   }
>   
>   static void scif_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
> index a3454eb56fbf..be28f05bfafa 100644
> --- a/drivers/misc/sgi-gru/grutlbpurge.c
> +++ b/drivers/misc/sgi-gru/grutlbpurge.c
> @@ -219,9 +219,10 @@ void gru_flush_all_tlb(struct gru_state *gru)
>   /*
>    * MMUOPS notifier callout functions
>    */
> -static void gru_invalidate_range_start(struct mmu_notifier *mn,
> +static int gru_invalidate_range_start(struct mmu_notifier *mn,
>   				       struct mm_struct *mm,
> -				       unsigned long start, unsigned long end)
> +				       unsigned long start, unsigned long end,
> +				       bool blockable)
>   {
>   	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
>   						 ms_notifier);
> @@ -231,6 +232,8 @@ static void gru_invalidate_range_start(struct mmu_notifier *mn,
>   	gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx, act %d\n", gms,
>   		start, end, atomic_read(&gms->ms_range_active));
>   	gru_flush_tlb_range(gms, start, end - start);
> +
> +	return 0;
>   }
>   
>   static void gru_invalidate_range_end(struct mmu_notifier *mn,
> diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
> index bd56653b9bbc..50724d09fe5c 100644
> --- a/drivers/xen/gntdev.c
> +++ b/drivers/xen/gntdev.c
> @@ -465,14 +465,20 @@ static void unmap_if_in_range(struct grant_map *map,
>   	WARN_ON(err);
>   }
>   
> -static void mn_invl_range_start(struct mmu_notifier *mn,
> +static int mn_invl_range_start(struct mmu_notifier *mn,
>   				struct mm_struct *mm,
> -				unsigned long start, unsigned long end)
> +				unsigned long start, unsigned long end,
> +				bool blockable)
>   {
>   	struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
>   	struct grant_map *map;
>   
> -	mutex_lock(&priv->lock);
> +	/* TODO do we really need a mutex here? */
> +	if (blockable)
> +		mutex_lock(&priv->lock);
> +	else if (!mutex_trylock(&priv->lock))
> +		return -EAGAIN;
> +
>   	list_for_each_entry(map, &priv->maps, next) {
>   		unmap_if_in_range(map, start, end);
>   	}
> @@ -480,6 +486,8 @@ static void mn_invl_range_start(struct mmu_notifier *mn,
>   		unmap_if_in_range(map, start, end);
>   	}
>   	mutex_unlock(&priv->lock);
> +
> +	return true;
>   }
>   
>   static void mn_release(struct mmu_notifier *mn,
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 4ee7bc548a83..e4181063e755 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1275,7 +1275,7 @@ static inline long kvm_arch_vcpu_async_ioctl(struct file *filp,
>   }
>   #endif /* CONFIG_HAVE_KVM_VCPU_ASYNC_IOCTL */
>   
> -void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> +int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
>   		unsigned long start, unsigned long end);
>   
>   #ifdef CONFIG_HAVE_KVM_VCPU_RUN_PID_CHANGE
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index 392e6af82701..369867501bed 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -230,7 +230,8 @@ extern int __mmu_notifier_test_young(struct mm_struct *mm,
>   extern void __mmu_notifier_change_pte(struct mm_struct *mm,
>   				      unsigned long address, pte_t pte);
>   extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> -				  unsigned long start, unsigned long end);
> +				  unsigned long start, unsigned long end,
> +				  bool blockable);
>   extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
>   				  unsigned long start, unsigned long end,
>   				  bool only_end);
> @@ -281,7 +282,17 @@ static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
>   				  unsigned long start, unsigned long end)
>   {
>   	if (mm_has_notifiers(mm))
> -		__mmu_notifier_invalidate_range_start(mm, start, end);
> +		__mmu_notifier_invalidate_range_start(mm, start, end, true);
> +}
> +
> +static inline int mmu_notifier_invalidate_range_start_nonblock(struct mm_struct *mm,
> +				  unsigned long start, unsigned long end)
> +{
> +	int ret = 0;
> +	if (mm_has_notifiers(mm))
> +		ret = __mmu_notifier_invalidate_range_start(mm, start, end, false);
> +
> +	return ret;
>   }
>   
>   static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> diff --git a/mm/hmm.c b/mm/hmm.c
> index de7b6bf77201..81fd57bd2634 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -177,16 +177,19 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
>   	up_write(&hmm->mirrors_sem);
>   }
>   
> -static void hmm_invalidate_range_start(struct mmu_notifier *mn,
> +static int hmm_invalidate_range_start(struct mmu_notifier *mn,
>   				       struct mm_struct *mm,
>   				       unsigned long start,
> -				       unsigned long end)
> +				       unsigned long end,
> +				       bool blockable)
>   {
>   	struct hmm *hmm = mm->hmm;
>   
>   	VM_BUG_ON(!hmm);
>   
>   	atomic_inc(&hmm->sequence);
> +
> +	return 0;
>   }
>   
>   static void hmm_invalidate_range_end(struct mmu_notifier *mn,
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> index eff6b88a993f..30cc43121da9 100644
> --- a/mm/mmu_notifier.c
> +++ b/mm/mmu_notifier.c
> @@ -174,18 +174,25 @@ void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
>   	srcu_read_unlock(&srcu, id);
>   }
>   
> -void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> -				  unsigned long start, unsigned long end)
> +int __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> +				  unsigned long start, unsigned long end,
> +				  bool blockable)
>   {
>   	struct mmu_notifier *mn;
> +	int ret = 0;
>   	int id;
>   
>   	id = srcu_read_lock(&srcu);
>   	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
> -		if (mn->ops->invalidate_range_start)
> -			mn->ops->invalidate_range_start(mn, mm, start, end);
> +		if (mn->ops->invalidate_range_start) {
> +			int _ret = mn->ops->invalidate_range_start(mn, mm, start, end, blockable);
> +			if (_ret)
> +				ret = _ret;
> +		}
>   	}
>   	srcu_read_unlock(&srcu, id);
> +
> +	return ret;
>   }
>   EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
>   
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 84081e77bc51..7e0c6e78ae5c 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -479,9 +479,10 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
>   static struct task_struct *oom_reaper_list;
>   static DEFINE_SPINLOCK(oom_reaper_lock);
>   
> -void __oom_reap_task_mm(struct mm_struct *mm)
> +bool __oom_reap_task_mm(struct mm_struct *mm)
>   {
>   	struct vm_area_struct *vma;
> +	bool ret = true;
>   
>   	/*
>   	 * Tell all users of get_user/copy_from_user etc... that the content
> @@ -511,12 +512,17 @@ void __oom_reap_task_mm(struct mm_struct *mm)
>   			struct mmu_gather tlb;
>   
>   			tlb_gather_mmu(&tlb, mm, start, end);
> -			mmu_notifier_invalidate_range_start(mm, start, end);
> +			if (mmu_notifier_invalidate_range_start_nonblock(mm, start, end)) {
> +				ret = false;
> +				continue;
> +			}
>   			unmap_page_range(&tlb, vma, start, end, NULL);
>   			mmu_notifier_invalidate_range_end(mm, start, end);
>   			tlb_finish_mmu(&tlb, start, end);
>   		}
>   	}
> +
> +	return ret;
>   }
>   
>   static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
> @@ -545,18 +551,6 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>   		goto unlock_oom;
>   	}
>   
> -	/*
> -	 * If the mm has invalidate_{start,end}() notifiers that could block,
> -	 * sleep to give the oom victim some more time.
> -	 * TODO: we really want to get rid of this ugly hack and make sure that
> -	 * notifiers cannot block for unbounded amount of time
> -	 */
> -	if (mm_has_blockable_invalidate_notifiers(mm)) {
> -		up_read(&mm->mmap_sem);
> -		schedule_timeout_idle(HZ);
> -		goto unlock_oom;
> -	}
> -
>   	/*
>   	 * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
>   	 * work on the mm anymore. The check for MMF_OOM_SKIP must run
> @@ -571,7 +565,12 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>   
>   	trace_start_task_reaping(tsk->pid);
>   
> -	__oom_reap_task_mm(mm);
> +	/* failed to reap part of the address space. Try again later */
> +	if (!__oom_reap_task_mm(mm)) {
> +		up_read(&mm->mmap_sem);
> +		ret = false;
> +		goto out_unlock;
> +	}
>   
>   	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
>   			task_pid_nr(tsk), tsk->comm,
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index ada21f47f22b..6f7e709d2944 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -135,7 +135,7 @@ static void kvm_uevent_notify_change(unsigned int type, struct kvm *kvm);
>   static unsigned long long kvm_createvm_count;
>   static unsigned long long kvm_active_vms;
>   
> -__weak void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> +__weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
>   		unsigned long start, unsigned long end)
>   {
>   }
> @@ -354,13 +354,15 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
>   	srcu_read_unlock(&kvm->srcu, idx);
>   }
>   
> -static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> +static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>   						    struct mm_struct *mm,
>   						    unsigned long start,
> -						    unsigned long end)
> +						    unsigned long end,
> +						    bool blockable)
>   {
>   	struct kvm *kvm = mmu_notifier_to_kvm(mn);
>   	int need_tlb_flush = 0, idx;
> +	int ret;
>   
>   	idx = srcu_read_lock(&kvm->srcu);
>   	spin_lock(&kvm->mmu_lock);
> @@ -378,9 +380,11 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>   
>   	spin_unlock(&kvm->mmu_lock);
>   
> -	kvm_arch_mmu_notifier_invalidate_range(kvm, start, end);
> +	ret = kvm_arch_mmu_notifier_invalidate_range(kvm, start, end, blockable);
>   
>   	srcu_read_unlock(&kvm->srcu, idx);
> +
> +	return ret;
>   }
>   
>   static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,


^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
@ 2018-06-22 15:13 ` Christian König
  2018-06-22 15:24   ` Michal Hocko
  2018-06-22 15:24   ` Michal Hocko
  0 siblings, 2 replies; 125+ messages in thread
From: Christian König @ 2018-06-22 15:13 UTC (permalink / raw)
  To: Michal Hocko, LKML
  Cc: Michal Hocko, David (ChunMing) Zhou, Paolo Bonzini,
	Radim Krčmář,
	Alex Deucher, David Airlie, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Doug Ledford, Jason Gunthorpe, Mike Marciniszyn,
	Dennis Dalessandro, Sudeep Dutt, Ashutosh Dixit,
	Dimitri Sivanich, Boris Ostrovsky, Juergen Gross,
	Jérôme Glisse, Andrea Arcangeli, kvm, amd-gfx,
	dri-devel, intel-gfx, linux-rdma, xen-devel, linux-mm,
	David Rientjes, Felix Kuehling

Hi Michal,

[Adding Felix as well]

Well first of all you have a misconception why at least the AMD graphics 
driver need to be able to sleep in an MMU notifier: We need to sleep 
because we need to wait for hardware operations to finish and *NOT* 
because we need to wait for locks.

I'm not sure if your flag now means that you generally can't sleep in 
MMU notifiers any more, but if that's the case at least AMD hardware 
will break badly. In our case the approach of waiting for a short time 
for the process to be reaped and then select another victim actually 
sounds like the right thing to do.

What we also already try to do is to abort hardware operations with the 
address space when we detect that the process is dying, but that can 
certainly be improved.

Regards,
Christian.

Am 22.06.2018 um 17:02 schrieb Michal Hocko:
> From: Michal Hocko <mhocko@suse.com>
>
> There are several blockable mmu notifiers which might sleep in
> mmu_notifier_invalidate_range_start and that is a problem for the
> oom_reaper because it needs to guarantee a forward progress so it cannot
> depend on any sleepable locks. Currently we simply back off and mark an
> oom victim with blockable mmu notifiers as done after a short sleep.
> That can result in selecting a new oom victim prematurely because the
> previous one still hasn't torn its memory down yet.
>
> We can do much better though. Even if mmu notifiers use sleepable locks
> there is no reason to automatically assume those locks are held.
> Moreover most notifiers only care about a portion of the address
> space. This patch handles the first part of the problem.
> __mmu_notifier_invalidate_range_start gets a blockable flag and
> callbacks are not allowed to sleep if the flag is set to false. This is
> achieved by using trylock instead of the sleepable lock for most
> callbacks. I think we can improve that even further because there is
> a common pattern to do a range lookup first and then do something about
> that. The first part can be done without a sleeping lock I presume.
>
> Anyway, what does the oom_reaper do with all that? We do not have to
> fail right away. We simply retry if there is at least one notifier which
> couldn't make any progress. A retry loop is already implemented to wait
> for the mmap_sem and this is basically the same thing.
>
> Cc: "David (ChunMing) Zhou" <David1.Zhou@amd.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: "Radim KrA?mA!A?" <rkrcmar@redhat.com>
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: "Christian KA?nig" <christian.koenig@amd.com>
> Cc: David Airlie <airlied@linux.ie>
> Cc: Jani Nikula <jani.nikula@linux.intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> Cc: Doug Ledford <dledford@redhat.com>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
> Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
> Cc: Sudeep Dutt <sudeep.dutt@intel.com>
> Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
> Cc: Dimitri Sivanich <sivanich@sgi.com>
> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
> Cc: Juergen Gross <jgross@suse.com>
> Cc: "JA(C)rA'me Glisse" <jglisse@redhat.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: kvm@vger.kernel.org (open list:KERNEL VIRTUAL MACHINE FOR X86 (KVM/x86))
> Cc: linux-kernel@vger.kernel.org (open list:X86 ARCHITECTURE (32-BIT AND 64-BIT))
> Cc: amd-gfx@lists.freedesktop.org (open list:RADEON and AMDGPU DRM DRIVERS)
> Cc: dri-devel@lists.freedesktop.org (open list:DRM DRIVERS)
> Cc: intel-gfx@lists.freedesktop.org (open list:INTEL DRM DRIVERS (excluding Poulsbo, Moorestow...)
> Cc: linux-rdma@vger.kernel.org (open list:INFINIBAND SUBSYSTEM)
> Cc: xen-devel@lists.xenproject.org (moderated list:XEN HYPERVISOR INTERFACE)
> Cc: linux-mm@kvack.org (open list:HMM - Heterogeneous Memory Management)
> Reported-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>
> Hi,
> this is an RFC and not tested at all. I am not very familiar with the
> mmu notifiers semantics very much so this is a crude attempt to achieve
> what I need basically. It might be completely wrong but I would like
> to discuss what would be a better way if that is the case.
>
> get_maintainers gave me quite large list of people to CC so I had to trim
> it down. If you think I have forgot somebody, please let me know
>
> Any feedback is highly appreciated.
>
>   arch/x86/kvm/x86.c                      |  7 ++++--
>   drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c  | 33 +++++++++++++++++++------
>   drivers/gpu/drm/i915/i915_gem_userptr.c | 10 +++++---
>   drivers/gpu/drm/radeon/radeon_mn.c      | 15 ++++++++---
>   drivers/infiniband/core/umem_odp.c      | 15 ++++++++---
>   drivers/infiniband/hw/hfi1/mmu_rb.c     |  7 ++++--
>   drivers/misc/mic/scif/scif_dma.c        |  7 ++++--
>   drivers/misc/sgi-gru/grutlbpurge.c      |  7 ++++--
>   drivers/xen/gntdev.c                    | 14 ++++++++---
>   include/linux/kvm_host.h                |  2 +-
>   include/linux/mmu_notifier.h            | 15 +++++++++--
>   mm/hmm.c                                |  7 ++++--
>   mm/mmu_notifier.c                       | 15 ++++++++---
>   mm/oom_kill.c                           | 29 +++++++++++-----------
>   virt/kvm/kvm_main.c                     | 12 ++++++---
>   15 files changed, 137 insertions(+), 58 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 6bcecc325e7e..ac08f5d711be 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -7203,8 +7203,9 @@ static void vcpu_load_eoi_exitmap(struct kvm_vcpu *vcpu)
>   	kvm_x86_ops->load_eoi_exitmap(vcpu, eoi_exit_bitmap);
>   }
>   
> -void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> -		unsigned long start, unsigned long end)
> +int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> +		unsigned long start, unsigned long end,
> +		bool blockable)
>   {
>   	unsigned long apic_address;
>   
> @@ -7215,6 +7216,8 @@ void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
>   	apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
>   	if (start <= apic_address && apic_address < end)
>   		kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
> +
> +	return 0;
>   }
>   
>   void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> index 83e344fbb50a..d138a526feff 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> @@ -136,12 +136,18 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
>    *
>    * Take the rmn read side lock.
>    */
> -static void amdgpu_mn_read_lock(struct amdgpu_mn *rmn)
> +static int amdgpu_mn_read_lock(struct amdgpu_mn *rmn, bool blockable)
>   {
> -	mutex_lock(&rmn->read_lock);
> +	if (blockable)
> +		mutex_lock(&rmn->read_lock);
> +	else if (!mutex_trylock(&rmn->read_lock))
> +		return -EAGAIN;
> +
>   	if (atomic_inc_return(&rmn->recursion) == 1)
>   		down_read_non_owner(&rmn->lock);
>   	mutex_unlock(&rmn->read_lock);
> +
> +	return 0;
>   }
>   
>   /**
> @@ -197,10 +203,11 @@ static void amdgpu_mn_invalidate_node(struct amdgpu_mn_node *node,
>    * We block for all BOs between start and end to be idle and
>    * unmap them by move them into system domain again.
>    */
> -static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
> +static int amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>   						 struct mm_struct *mm,
>   						 unsigned long start,
> -						 unsigned long end)
> +						 unsigned long end,
> +						 bool blockable)
>   {
>   	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
>   	struct interval_tree_node *it;
> @@ -208,7 +215,11 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>   	/* notification is exclusive, but interval is inclusive */
>   	end -= 1;
>   
> -	amdgpu_mn_read_lock(rmn);
> +	/* TODO we should be able to split locking for interval tree and
> +	 * amdgpu_mn_invalidate_node
> +	 */
> +	if (amdgpu_mn_read_lock(rmn, blockable))
> +		return -EAGAIN;
>   
>   	it = interval_tree_iter_first(&rmn->objects, start, end);
>   	while (it) {
> @@ -219,6 +230,8 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>   
>   		amdgpu_mn_invalidate_node(node, start, end);
>   	}
> +
> +	return 0;
>   }
>   
>   /**
> @@ -233,10 +246,11 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>    * necessitates evicting all user-mode queues of the process. The BOs
>    * are restorted in amdgpu_mn_invalidate_range_end_hsa.
>    */
> -static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
> +static int amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
>   						 struct mm_struct *mm,
>   						 unsigned long start,
> -						 unsigned long end)
> +						 unsigned long end,
> +						 bool blockable)
>   {
>   	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
>   	struct interval_tree_node *it;
> @@ -244,7 +258,8 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
>   	/* notification is exclusive, but interval is inclusive */
>   	end -= 1;
>   
> -	amdgpu_mn_read_lock(rmn);
> +	if (amdgpu_mn_read_lock(rmn, blockable))
> +		return -EAGAIN;
>   
>   	it = interval_tree_iter_first(&rmn->objects, start, end);
>   	while (it) {
> @@ -262,6 +277,8 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
>   				amdgpu_amdkfd_evict_userptr(mem, mm);
>   		}
>   	}
> +
> +	return 0;
>   }
>   
>   /**
> diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
> index 854bd51b9478..5285df9331fa 100644
> --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> @@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object *mo)
>   	mo->attached = false;
>   }
>   
> -static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> +static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
>   						       struct mm_struct *mm,
>   						       unsigned long start,
> -						       unsigned long end)
> +						       unsigned long end,
> +						       bool blockable)
>   {
>   	struct i915_mmu_notifier *mn =
>   		container_of(_mn, struct i915_mmu_notifier, mn);
> @@ -124,7 +125,7 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
>   	LIST_HEAD(cancelled);
>   
>   	if (RB_EMPTY_ROOT(&mn->objects.rb_root))
> -		return;
> +		return 0;
>   
>   	/* interval ranges are inclusive, but invalidate range is exclusive */
>   	end--;
> @@ -152,7 +153,8 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
>   		del_object(mo);
>   	spin_unlock(&mn->lock);
>   
> -	if (!list_empty(&cancelled))
> +	/* TODO: can we skip waiting here? */
> +	if (!list_empty(&cancelled) && blockable)
>   		flush_workqueue(mn->wq);
>   }
>   
> diff --git a/drivers/gpu/drm/radeon/radeon_mn.c b/drivers/gpu/drm/radeon/radeon_mn.c
> index abd24975c9b1..b47e828b725d 100644
> --- a/drivers/gpu/drm/radeon/radeon_mn.c
> +++ b/drivers/gpu/drm/radeon/radeon_mn.c
> @@ -118,10 +118,11 @@ static void radeon_mn_release(struct mmu_notifier *mn,
>    * We block for all BOs between start and end to be idle and
>    * unmap them by move them into system domain again.
>    */
> -static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
> +static int radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
>   					     struct mm_struct *mm,
>   					     unsigned long start,
> -					     unsigned long end)
> +					     unsigned long end,
> +					     bool blockable)
>   {
>   	struct radeon_mn *rmn = container_of(mn, struct radeon_mn, mn);
>   	struct ttm_operation_ctx ctx = { false, false };
> @@ -130,7 +131,13 @@ static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
>   	/* notification is exclusive, but interval is inclusive */
>   	end -= 1;
>   
> -	mutex_lock(&rmn->lock);
> +	/* TODO we should be able to split locking for interval tree and
> +	 * the tear down.
> +	 */
> +	if (blockable)
> +		mutex_lock(&rmn->lock);
> +	else if (!mutex_trylock(&rmn->lock))
> +		return -EAGAIN;
>   
>   	it = interval_tree_iter_first(&rmn->objects, start, end);
>   	while (it) {
> @@ -167,6 +174,8 @@ static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
>   	}
>   	
>   	mutex_unlock(&rmn->lock);
> +
> +	return 0;
>   }
>   
>   static const struct mmu_notifier_ops radeon_mn_ops = {
> diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
> index 182436b92ba9..f65f6a29daae 100644
> --- a/drivers/infiniband/core/umem_odp.c
> +++ b/drivers/infiniband/core/umem_odp.c
> @@ -207,22 +207,29 @@ static int invalidate_range_start_trampoline(struct ib_umem *item, u64 start,
>   	return 0;
>   }
>   
> -static void ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
> +static int ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
>   						    struct mm_struct *mm,
>   						    unsigned long start,
> -						    unsigned long end)
> +						    unsigned long end,
> +						    bool blockable)
>   {
>   	struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
>   
>   	if (!context->invalidate_range)
> -		return;
> +		return 0;
> +
> +	if (blockable)
> +		down_read(&context->umem_rwsem);
> +	else if (!down_read_trylock(&context->umem_rwsem))
> +		return -EAGAIN;
>   
>   	ib_ucontext_notifier_start_account(context);
> -	down_read(&context->umem_rwsem);
>   	rbt_ib_umem_for_each_in_range(&context->umem_tree, start,
>   				      end,
>   				      invalidate_range_start_trampoline, NULL);
>   	up_read(&context->umem_rwsem);
> +
> +	return 0;
>   }
>   
>   static int invalidate_range_end_trampoline(struct ib_umem *item, u64 start,
> diff --git a/drivers/infiniband/hw/hfi1/mmu_rb.c b/drivers/infiniband/hw/hfi1/mmu_rb.c
> index 70aceefe14d5..8780560d1623 100644
> --- a/drivers/infiniband/hw/hfi1/mmu_rb.c
> +++ b/drivers/infiniband/hw/hfi1/mmu_rb.c
> @@ -284,10 +284,11 @@ void hfi1_mmu_rb_remove(struct mmu_rb_handler *handler,
>   	handler->ops->remove(handler->ops_arg, node);
>   }
>   
> -static void mmu_notifier_range_start(struct mmu_notifier *mn,
> +static int mmu_notifier_range_start(struct mmu_notifier *mn,
>   				     struct mm_struct *mm,
>   				     unsigned long start,
> -				     unsigned long end)
> +				     unsigned long end,
> +				     bool blockable)
>   {
>   	struct mmu_rb_handler *handler =
>   		container_of(mn, struct mmu_rb_handler, mn);
> @@ -313,6 +314,8 @@ static void mmu_notifier_range_start(struct mmu_notifier *mn,
>   
>   	if (added)
>   		queue_work(handler->wq, &handler->del_work);
> +
> +	return 0;
>   }
>   
>   /*
> diff --git a/drivers/misc/mic/scif/scif_dma.c b/drivers/misc/mic/scif/scif_dma.c
> index 63d6246d6dff..d940568bed87 100644
> --- a/drivers/misc/mic/scif/scif_dma.c
> +++ b/drivers/misc/mic/scif/scif_dma.c
> @@ -200,15 +200,18 @@ static void scif_mmu_notifier_release(struct mmu_notifier *mn,
>   	schedule_work(&scif_info.misc_work);
>   }
>   
> -static void scif_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> +static int scif_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>   						     struct mm_struct *mm,
>   						     unsigned long start,
> -						     unsigned long end)
> +						     unsigned long end,
> +						     bool blockable)
>   {
>   	struct scif_mmu_notif	*mmn;
>   
>   	mmn = container_of(mn, struct scif_mmu_notif, ep_mmu_notifier);
>   	scif_rma_destroy_tcw(mmn, start, end - start);
> +
> +	return 0
>   }
>   
>   static void scif_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
> index a3454eb56fbf..be28f05bfafa 100644
> --- a/drivers/misc/sgi-gru/grutlbpurge.c
> +++ b/drivers/misc/sgi-gru/grutlbpurge.c
> @@ -219,9 +219,10 @@ void gru_flush_all_tlb(struct gru_state *gru)
>   /*
>    * MMUOPS notifier callout functions
>    */
> -static void gru_invalidate_range_start(struct mmu_notifier *mn,
> +static int gru_invalidate_range_start(struct mmu_notifier *mn,
>   				       struct mm_struct *mm,
> -				       unsigned long start, unsigned long end)
> +				       unsigned long start, unsigned long end,
> +				       bool blockable)
>   {
>   	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
>   						 ms_notifier);
> @@ -231,6 +232,8 @@ static void gru_invalidate_range_start(struct mmu_notifier *mn,
>   	gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx, act %d\n", gms,
>   		start, end, atomic_read(&gms->ms_range_active));
>   	gru_flush_tlb_range(gms, start, end - start);
> +
> +	return 0;
>   }
>   
>   static void gru_invalidate_range_end(struct mmu_notifier *mn,
> diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
> index bd56653b9bbc..50724d09fe5c 100644
> --- a/drivers/xen/gntdev.c
> +++ b/drivers/xen/gntdev.c
> @@ -465,14 +465,20 @@ static void unmap_if_in_range(struct grant_map *map,
>   	WARN_ON(err);
>   }
>   
> -static void mn_invl_range_start(struct mmu_notifier *mn,
> +static int mn_invl_range_start(struct mmu_notifier *mn,
>   				struct mm_struct *mm,
> -				unsigned long start, unsigned long end)
> +				unsigned long start, unsigned long end,
> +				bool blockable)
>   {
>   	struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
>   	struct grant_map *map;
>   
> -	mutex_lock(&priv->lock);
> +	/* TODO do we really need a mutex here? */
> +	if (blockable)
> +		mutex_lock(&priv->lock);
> +	else if (!mutex_trylock(&priv->lock))
> +		return -EAGAIN;
> +
>   	list_for_each_entry(map, &priv->maps, next) {
>   		unmap_if_in_range(map, start, end);
>   	}
> @@ -480,6 +486,8 @@ static void mn_invl_range_start(struct mmu_notifier *mn,
>   		unmap_if_in_range(map, start, end);
>   	}
>   	mutex_unlock(&priv->lock);
> +
> +	return true;
>   }
>   
>   static void mn_release(struct mmu_notifier *mn,
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 4ee7bc548a83..e4181063e755 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1275,7 +1275,7 @@ static inline long kvm_arch_vcpu_async_ioctl(struct file *filp,
>   }
>   #endif /* CONFIG_HAVE_KVM_VCPU_ASYNC_IOCTL */
>   
> -void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> +int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
>   		unsigned long start, unsigned long end);
>   
>   #ifdef CONFIG_HAVE_KVM_VCPU_RUN_PID_CHANGE
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index 392e6af82701..369867501bed 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -230,7 +230,8 @@ extern int __mmu_notifier_test_young(struct mm_struct *mm,
>   extern void __mmu_notifier_change_pte(struct mm_struct *mm,
>   				      unsigned long address, pte_t pte);
>   extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> -				  unsigned long start, unsigned long end);
> +				  unsigned long start, unsigned long end,
> +				  bool blockable);
>   extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
>   				  unsigned long start, unsigned long end,
>   				  bool only_end);
> @@ -281,7 +282,17 @@ static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
>   				  unsigned long start, unsigned long end)
>   {
>   	if (mm_has_notifiers(mm))
> -		__mmu_notifier_invalidate_range_start(mm, start, end);
> +		__mmu_notifier_invalidate_range_start(mm, start, end, true);
> +}
> +
> +static inline int mmu_notifier_invalidate_range_start_nonblock(struct mm_struct *mm,
> +				  unsigned long start, unsigned long end)
> +{
> +	int ret = 0;
> +	if (mm_has_notifiers(mm))
> +		ret = __mmu_notifier_invalidate_range_start(mm, start, end, false);
> +
> +	return ret;
>   }
>   
>   static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> diff --git a/mm/hmm.c b/mm/hmm.c
> index de7b6bf77201..81fd57bd2634 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -177,16 +177,19 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
>   	up_write(&hmm->mirrors_sem);
>   }
>   
> -static void hmm_invalidate_range_start(struct mmu_notifier *mn,
> +static int hmm_invalidate_range_start(struct mmu_notifier *mn,
>   				       struct mm_struct *mm,
>   				       unsigned long start,
> -				       unsigned long end)
> +				       unsigned long end,
> +				       bool blockable)
>   {
>   	struct hmm *hmm = mm->hmm;
>   
>   	VM_BUG_ON(!hmm);
>   
>   	atomic_inc(&hmm->sequence);
> +
> +	return 0;
>   }
>   
>   static void hmm_invalidate_range_end(struct mmu_notifier *mn,
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> index eff6b88a993f..30cc43121da9 100644
> --- a/mm/mmu_notifier.c
> +++ b/mm/mmu_notifier.c
> @@ -174,18 +174,25 @@ void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
>   	srcu_read_unlock(&srcu, id);
>   }
>   
> -void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> -				  unsigned long start, unsigned long end)
> +int __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> +				  unsigned long start, unsigned long end,
> +				  bool blockable)
>   {
>   	struct mmu_notifier *mn;
> +	int ret = 0;
>   	int id;
>   
>   	id = srcu_read_lock(&srcu);
>   	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
> -		if (mn->ops->invalidate_range_start)
> -			mn->ops->invalidate_range_start(mn, mm, start, end);
> +		if (mn->ops->invalidate_range_start) {
> +			int _ret = mn->ops->invalidate_range_start(mn, mm, start, end, blockable);
> +			if (_ret)
> +				ret = _ret;
> +		}
>   	}
>   	srcu_read_unlock(&srcu, id);
> +
> +	return ret;
>   }
>   EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
>   
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 84081e77bc51..7e0c6e78ae5c 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -479,9 +479,10 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
>   static struct task_struct *oom_reaper_list;
>   static DEFINE_SPINLOCK(oom_reaper_lock);
>   
> -void __oom_reap_task_mm(struct mm_struct *mm)
> +bool __oom_reap_task_mm(struct mm_struct *mm)
>   {
>   	struct vm_area_struct *vma;
> +	bool ret = true;
>   
>   	/*
>   	 * Tell all users of get_user/copy_from_user etc... that the content
> @@ -511,12 +512,17 @@ void __oom_reap_task_mm(struct mm_struct *mm)
>   			struct mmu_gather tlb;
>   
>   			tlb_gather_mmu(&tlb, mm, start, end);
> -			mmu_notifier_invalidate_range_start(mm, start, end);
> +			if (mmu_notifier_invalidate_range_start_nonblock(mm, start, end)) {
> +				ret = false;
> +				continue;
> +			}
>   			unmap_page_range(&tlb, vma, start, end, NULL);
>   			mmu_notifier_invalidate_range_end(mm, start, end);
>   			tlb_finish_mmu(&tlb, start, end);
>   		}
>   	}
> +
> +	return ret;
>   }
>   
>   static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
> @@ -545,18 +551,6 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>   		goto unlock_oom;
>   	}
>   
> -	/*
> -	 * If the mm has invalidate_{start,end}() notifiers that could block,
> -	 * sleep to give the oom victim some more time.
> -	 * TODO: we really want to get rid of this ugly hack and make sure that
> -	 * notifiers cannot block for unbounded amount of time
> -	 */
> -	if (mm_has_blockable_invalidate_notifiers(mm)) {
> -		up_read(&mm->mmap_sem);
> -		schedule_timeout_idle(HZ);
> -		goto unlock_oom;
> -	}
> -
>   	/*
>   	 * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
>   	 * work on the mm anymore. The check for MMF_OOM_SKIP must run
> @@ -571,7 +565,12 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>   
>   	trace_start_task_reaping(tsk->pid);
>   
> -	__oom_reap_task_mm(mm);
> +	/* failed to reap part of the address space. Try again later */
> +	if (!__oom_reap_task_mm(mm)) {
> +		up_read(&mm->mmap_sem);
> +		ret = false;
> +		goto out_unlock;
> +	}
>   
>   	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
>   			task_pid_nr(tsk), tsk->comm,
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index ada21f47f22b..6f7e709d2944 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -135,7 +135,7 @@ static void kvm_uevent_notify_change(unsigned int type, struct kvm *kvm);
>   static unsigned long long kvm_createvm_count;
>   static unsigned long long kvm_active_vms;
>   
> -__weak void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> +__weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
>   		unsigned long start, unsigned long end)
>   {
>   }
> @@ -354,13 +354,15 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
>   	srcu_read_unlock(&kvm->srcu, idx);
>   }
>   
> -static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> +static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>   						    struct mm_struct *mm,
>   						    unsigned long start,
> -						    unsigned long end)
> +						    unsigned long end,
> +						    bool blockable)
>   {
>   	struct kvm *kvm = mmu_notifier_to_kvm(mn);
>   	int need_tlb_flush = 0, idx;
> +	int ret;
>   
>   	idx = srcu_read_lock(&kvm->srcu);
>   	spin_lock(&kvm->mmu_lock);
> @@ -378,9 +380,11 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>   
>   	spin_unlock(&kvm->mmu_lock);
>   
> -	kvm_arch_mmu_notifier_invalidate_range(kvm, start, end);
> +	ret = kvm_arch_mmu_notifier_invalidate_range(kvm, start, end, blockable);
>   
>   	srcu_read_unlock(&kvm->srcu, idx);
> +
> +	return ret;
>   }
>   
>   static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
  2018-06-22 15:02 [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers Michal Hocko
  2018-06-22 15:06 ` ✗ Fi.CI.BAT: failure for " Patchwork
  2018-06-22 15:13 ` [RFC PATCH] " Christian König
@ 2018-06-22 15:13 ` Christian König
  2018-06-22 15:36 ` [Intel-gfx] " Chris Wilson
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 125+ messages in thread
From: Christian König @ 2018-06-22 15:13 UTC (permalink / raw)
  To: Michal Hocko, LKML
  Cc: Michal Hocko, kvm, Radim Krčmář,
	David Airlie, Joonas Lahtinen, Sudeep Dutt, dri-devel, linux-mm,
	Andrea Arcangeli, David (ChunMing) Zhou, Dimitri Sivanich,
	linux-rdma, amd-gfx, Jason Gunthorpe, Doug Ledford,
	David Rientjes, xen-devel, intel-gfx, Jani Nikula,
	Jérôme Glisse, Rodrigo Vivi, Boris Ostrovsky,
	Juergen Gross

Hi Michal,

[Adding Felix as well]

Well first of all you have a misconception why at least the AMD graphics 
driver need to be able to sleep in an MMU notifier: We need to sleep 
because we need to wait for hardware operations to finish and *NOT* 
because we need to wait for locks.

I'm not sure if your flag now means that you generally can't sleep in 
MMU notifiers any more, but if that's the case at least AMD hardware 
will break badly. In our case the approach of waiting for a short time 
for the process to be reaped and then select another victim actually 
sounds like the right thing to do.

What we also already try to do is to abort hardware operations with the 
address space when we detect that the process is dying, but that can 
certainly be improved.

Regards,
Christian.

Am 22.06.2018 um 17:02 schrieb Michal Hocko:
> From: Michal Hocko <mhocko@suse.com>
>
> There are several blockable mmu notifiers which might sleep in
> mmu_notifier_invalidate_range_start and that is a problem for the
> oom_reaper because it needs to guarantee a forward progress so it cannot
> depend on any sleepable locks. Currently we simply back off and mark an
> oom victim with blockable mmu notifiers as done after a short sleep.
> That can result in selecting a new oom victim prematurely because the
> previous one still hasn't torn its memory down yet.
>
> We can do much better though. Even if mmu notifiers use sleepable locks
> there is no reason to automatically assume those locks are held.
> Moreover most notifiers only care about a portion of the address
> space. This patch handles the first part of the problem.
> __mmu_notifier_invalidate_range_start gets a blockable flag and
> callbacks are not allowed to sleep if the flag is set to false. This is
> achieved by using trylock instead of the sleepable lock for most
> callbacks. I think we can improve that even further because there is
> a common pattern to do a range lookup first and then do something about
> that. The first part can be done without a sleeping lock I presume.
>
> Anyway, what does the oom_reaper do with all that? We do not have to
> fail right away. We simply retry if there is at least one notifier which
> couldn't make any progress. A retry loop is already implemented to wait
> for the mmap_sem and this is basically the same thing.
>
> Cc: "David (ChunMing) Zhou" <David1.Zhou@amd.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: "Radim Krčmář" <rkrcmar@redhat.com>
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: "Christian König" <christian.koenig@amd.com>
> Cc: David Airlie <airlied@linux.ie>
> Cc: Jani Nikula <jani.nikula@linux.intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> Cc: Doug Ledford <dledford@redhat.com>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
> Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
> Cc: Sudeep Dutt <sudeep.dutt@intel.com>
> Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
> Cc: Dimitri Sivanich <sivanich@sgi.com>
> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
> Cc: Juergen Gross <jgross@suse.com>
> Cc: "Jérôme Glisse" <jglisse@redhat.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: kvm@vger.kernel.org (open list:KERNEL VIRTUAL MACHINE FOR X86 (KVM/x86))
> Cc: linux-kernel@vger.kernel.org (open list:X86 ARCHITECTURE (32-BIT AND 64-BIT))
> Cc: amd-gfx@lists.freedesktop.org (open list:RADEON and AMDGPU DRM DRIVERS)
> Cc: dri-devel@lists.freedesktop.org (open list:DRM DRIVERS)
> Cc: intel-gfx@lists.freedesktop.org (open list:INTEL DRM DRIVERS (excluding Poulsbo, Moorestow...)
> Cc: linux-rdma@vger.kernel.org (open list:INFINIBAND SUBSYSTEM)
> Cc: xen-devel@lists.xenproject.org (moderated list:XEN HYPERVISOR INTERFACE)
> Cc: linux-mm@kvack.org (open list:HMM - Heterogeneous Memory Management)
> Reported-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>
> Hi,
> this is an RFC and not tested at all. I am not very familiar with the
> mmu notifiers semantics very much so this is a crude attempt to achieve
> what I need basically. It might be completely wrong but I would like
> to discuss what would be a better way if that is the case.
>
> get_maintainers gave me quite large list of people to CC so I had to trim
> it down. If you think I have forgot somebody, please let me know
>
> Any feedback is highly appreciated.
>
>   arch/x86/kvm/x86.c                      |  7 ++++--
>   drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c  | 33 +++++++++++++++++++------
>   drivers/gpu/drm/i915/i915_gem_userptr.c | 10 +++++---
>   drivers/gpu/drm/radeon/radeon_mn.c      | 15 ++++++++---
>   drivers/infiniband/core/umem_odp.c      | 15 ++++++++---
>   drivers/infiniband/hw/hfi1/mmu_rb.c     |  7 ++++--
>   drivers/misc/mic/scif/scif_dma.c        |  7 ++++--
>   drivers/misc/sgi-gru/grutlbpurge.c      |  7 ++++--
>   drivers/xen/gntdev.c                    | 14 ++++++++---
>   include/linux/kvm_host.h                |  2 +-
>   include/linux/mmu_notifier.h            | 15 +++++++++--
>   mm/hmm.c                                |  7 ++++--
>   mm/mmu_notifier.c                       | 15 ++++++++---
>   mm/oom_kill.c                           | 29 +++++++++++-----------
>   virt/kvm/kvm_main.c                     | 12 ++++++---
>   15 files changed, 137 insertions(+), 58 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 6bcecc325e7e..ac08f5d711be 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -7203,8 +7203,9 @@ static void vcpu_load_eoi_exitmap(struct kvm_vcpu *vcpu)
>   	kvm_x86_ops->load_eoi_exitmap(vcpu, eoi_exit_bitmap);
>   }
>   
> -void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> -		unsigned long start, unsigned long end)
> +int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> +		unsigned long start, unsigned long end,
> +		bool blockable)
>   {
>   	unsigned long apic_address;
>   
> @@ -7215,6 +7216,8 @@ void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
>   	apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
>   	if (start <= apic_address && apic_address < end)
>   		kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
> +
> +	return 0;
>   }
>   
>   void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> index 83e344fbb50a..d138a526feff 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> @@ -136,12 +136,18 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
>    *
>    * Take the rmn read side lock.
>    */
> -static void amdgpu_mn_read_lock(struct amdgpu_mn *rmn)
> +static int amdgpu_mn_read_lock(struct amdgpu_mn *rmn, bool blockable)
>   {
> -	mutex_lock(&rmn->read_lock);
> +	if (blockable)
> +		mutex_lock(&rmn->read_lock);
> +	else if (!mutex_trylock(&rmn->read_lock))
> +		return -EAGAIN;
> +
>   	if (atomic_inc_return(&rmn->recursion) == 1)
>   		down_read_non_owner(&rmn->lock);
>   	mutex_unlock(&rmn->read_lock);
> +
> +	return 0;
>   }
>   
>   /**
> @@ -197,10 +203,11 @@ static void amdgpu_mn_invalidate_node(struct amdgpu_mn_node *node,
>    * We block for all BOs between start and end to be idle and
>    * unmap them by move them into system domain again.
>    */
> -static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
> +static int amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>   						 struct mm_struct *mm,
>   						 unsigned long start,
> -						 unsigned long end)
> +						 unsigned long end,
> +						 bool blockable)
>   {
>   	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
>   	struct interval_tree_node *it;
> @@ -208,7 +215,11 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>   	/* notification is exclusive, but interval is inclusive */
>   	end -= 1;
>   
> -	amdgpu_mn_read_lock(rmn);
> +	/* TODO we should be able to split locking for interval tree and
> +	 * amdgpu_mn_invalidate_node
> +	 */
> +	if (amdgpu_mn_read_lock(rmn, blockable))
> +		return -EAGAIN;
>   
>   	it = interval_tree_iter_first(&rmn->objects, start, end);
>   	while (it) {
> @@ -219,6 +230,8 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>   
>   		amdgpu_mn_invalidate_node(node, start, end);
>   	}
> +
> +	return 0;
>   }
>   
>   /**
> @@ -233,10 +246,11 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>    * necessitates evicting all user-mode queues of the process. The BOs
>    * are restorted in amdgpu_mn_invalidate_range_end_hsa.
>    */
> -static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
> +static int amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
>   						 struct mm_struct *mm,
>   						 unsigned long start,
> -						 unsigned long end)
> +						 unsigned long end,
> +						 bool blockable)
>   {
>   	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
>   	struct interval_tree_node *it;
> @@ -244,7 +258,8 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
>   	/* notification is exclusive, but interval is inclusive */
>   	end -= 1;
>   
> -	amdgpu_mn_read_lock(rmn);
> +	if (amdgpu_mn_read_lock(rmn, blockable))
> +		return -EAGAIN;
>   
>   	it = interval_tree_iter_first(&rmn->objects, start, end);
>   	while (it) {
> @@ -262,6 +277,8 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
>   				amdgpu_amdkfd_evict_userptr(mem, mm);
>   		}
>   	}
> +
> +	return 0;
>   }
>   
>   /**
> diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
> index 854bd51b9478..5285df9331fa 100644
> --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> @@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object *mo)
>   	mo->attached = false;
>   }
>   
> -static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> +static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
>   						       struct mm_struct *mm,
>   						       unsigned long start,
> -						       unsigned long end)
> +						       unsigned long end,
> +						       bool blockable)
>   {
>   	struct i915_mmu_notifier *mn =
>   		container_of(_mn, struct i915_mmu_notifier, mn);
> @@ -124,7 +125,7 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
>   	LIST_HEAD(cancelled);
>   
>   	if (RB_EMPTY_ROOT(&mn->objects.rb_root))
> -		return;
> +		return 0;
>   
>   	/* interval ranges are inclusive, but invalidate range is exclusive */
>   	end--;
> @@ -152,7 +153,8 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
>   		del_object(mo);
>   	spin_unlock(&mn->lock);
>   
> -	if (!list_empty(&cancelled))
> +	/* TODO: can we skip waiting here? */
> +	if (!list_empty(&cancelled) && blockable)
>   		flush_workqueue(mn->wq);
>   }
>   
> diff --git a/drivers/gpu/drm/radeon/radeon_mn.c b/drivers/gpu/drm/radeon/radeon_mn.c
> index abd24975c9b1..b47e828b725d 100644
> --- a/drivers/gpu/drm/radeon/radeon_mn.c
> +++ b/drivers/gpu/drm/radeon/radeon_mn.c
> @@ -118,10 +118,11 @@ static void radeon_mn_release(struct mmu_notifier *mn,
>    * We block for all BOs between start and end to be idle and
>    * unmap them by move them into system domain again.
>    */
> -static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
> +static int radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
>   					     struct mm_struct *mm,
>   					     unsigned long start,
> -					     unsigned long end)
> +					     unsigned long end,
> +					     bool blockable)
>   {
>   	struct radeon_mn *rmn = container_of(mn, struct radeon_mn, mn);
>   	struct ttm_operation_ctx ctx = { false, false };
> @@ -130,7 +131,13 @@ static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
>   	/* notification is exclusive, but interval is inclusive */
>   	end -= 1;
>   
> -	mutex_lock(&rmn->lock);
> +	/* TODO we should be able to split locking for interval tree and
> +	 * the tear down.
> +	 */
> +	if (blockable)
> +		mutex_lock(&rmn->lock);
> +	else if (!mutex_trylock(&rmn->lock))
> +		return -EAGAIN;
>   
>   	it = interval_tree_iter_first(&rmn->objects, start, end);
>   	while (it) {
> @@ -167,6 +174,8 @@ static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
>   	}
>   	
>   	mutex_unlock(&rmn->lock);
> +
> +	return 0;
>   }
>   
>   static const struct mmu_notifier_ops radeon_mn_ops = {
> diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
> index 182436b92ba9..f65f6a29daae 100644
> --- a/drivers/infiniband/core/umem_odp.c
> +++ b/drivers/infiniband/core/umem_odp.c
> @@ -207,22 +207,29 @@ static int invalidate_range_start_trampoline(struct ib_umem *item, u64 start,
>   	return 0;
>   }
>   
> -static void ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
> +static int ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
>   						    struct mm_struct *mm,
>   						    unsigned long start,
> -						    unsigned long end)
> +						    unsigned long end,
> +						    bool blockable)
>   {
>   	struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
>   
>   	if (!context->invalidate_range)
> -		return;
> +		return 0;
> +
> +	if (blockable)
> +		down_read(&context->umem_rwsem);
> +	else if (!down_read_trylock(&context->umem_rwsem))
> +		return -EAGAIN;
>   
>   	ib_ucontext_notifier_start_account(context);
> -	down_read(&context->umem_rwsem);
>   	rbt_ib_umem_for_each_in_range(&context->umem_tree, start,
>   				      end,
>   				      invalidate_range_start_trampoline, NULL);
>   	up_read(&context->umem_rwsem);
> +
> +	return 0;
>   }
>   
>   static int invalidate_range_end_trampoline(struct ib_umem *item, u64 start,
> diff --git a/drivers/infiniband/hw/hfi1/mmu_rb.c b/drivers/infiniband/hw/hfi1/mmu_rb.c
> index 70aceefe14d5..8780560d1623 100644
> --- a/drivers/infiniband/hw/hfi1/mmu_rb.c
> +++ b/drivers/infiniband/hw/hfi1/mmu_rb.c
> @@ -284,10 +284,11 @@ void hfi1_mmu_rb_remove(struct mmu_rb_handler *handler,
>   	handler->ops->remove(handler->ops_arg, node);
>   }
>   
> -static void mmu_notifier_range_start(struct mmu_notifier *mn,
> +static int mmu_notifier_range_start(struct mmu_notifier *mn,
>   				     struct mm_struct *mm,
>   				     unsigned long start,
> -				     unsigned long end)
> +				     unsigned long end,
> +				     bool blockable)
>   {
>   	struct mmu_rb_handler *handler =
>   		container_of(mn, struct mmu_rb_handler, mn);
> @@ -313,6 +314,8 @@ static void mmu_notifier_range_start(struct mmu_notifier *mn,
>   
>   	if (added)
>   		queue_work(handler->wq, &handler->del_work);
> +
> +	return 0;
>   }
>   
>   /*
> diff --git a/drivers/misc/mic/scif/scif_dma.c b/drivers/misc/mic/scif/scif_dma.c
> index 63d6246d6dff..d940568bed87 100644
> --- a/drivers/misc/mic/scif/scif_dma.c
> +++ b/drivers/misc/mic/scif/scif_dma.c
> @@ -200,15 +200,18 @@ static void scif_mmu_notifier_release(struct mmu_notifier *mn,
>   	schedule_work(&scif_info.misc_work);
>   }
>   
> -static void scif_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> +static int scif_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>   						     struct mm_struct *mm,
>   						     unsigned long start,
> -						     unsigned long end)
> +						     unsigned long end,
> +						     bool blockable)
>   {
>   	struct scif_mmu_notif	*mmn;
>   
>   	mmn = container_of(mn, struct scif_mmu_notif, ep_mmu_notifier);
>   	scif_rma_destroy_tcw(mmn, start, end - start);
> +
> +	return 0
>   }
>   
>   static void scif_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
> index a3454eb56fbf..be28f05bfafa 100644
> --- a/drivers/misc/sgi-gru/grutlbpurge.c
> +++ b/drivers/misc/sgi-gru/grutlbpurge.c
> @@ -219,9 +219,10 @@ void gru_flush_all_tlb(struct gru_state *gru)
>   /*
>    * MMUOPS notifier callout functions
>    */
> -static void gru_invalidate_range_start(struct mmu_notifier *mn,
> +static int gru_invalidate_range_start(struct mmu_notifier *mn,
>   				       struct mm_struct *mm,
> -				       unsigned long start, unsigned long end)
> +				       unsigned long start, unsigned long end,
> +				       bool blockable)
>   {
>   	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
>   						 ms_notifier);
> @@ -231,6 +232,8 @@ static void gru_invalidate_range_start(struct mmu_notifier *mn,
>   	gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx, act %d\n", gms,
>   		start, end, atomic_read(&gms->ms_range_active));
>   	gru_flush_tlb_range(gms, start, end - start);
> +
> +	return 0;
>   }
>   
>   static void gru_invalidate_range_end(struct mmu_notifier *mn,
> diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
> index bd56653b9bbc..50724d09fe5c 100644
> --- a/drivers/xen/gntdev.c
> +++ b/drivers/xen/gntdev.c
> @@ -465,14 +465,20 @@ static void unmap_if_in_range(struct grant_map *map,
>   	WARN_ON(err);
>   }
>   
> -static void mn_invl_range_start(struct mmu_notifier *mn,
> +static int mn_invl_range_start(struct mmu_notifier *mn,
>   				struct mm_struct *mm,
> -				unsigned long start, unsigned long end)
> +				unsigned long start, unsigned long end,
> +				bool blockable)
>   {
>   	struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
>   	struct grant_map *map;
>   
> -	mutex_lock(&priv->lock);
> +	/* TODO do we really need a mutex here? */
> +	if (blockable)
> +		mutex_lock(&priv->lock);
> +	else if (!mutex_trylock(&priv->lock))
> +		return -EAGAIN;
> +
>   	list_for_each_entry(map, &priv->maps, next) {
>   		unmap_if_in_range(map, start, end);
>   	}
> @@ -480,6 +486,8 @@ static void mn_invl_range_start(struct mmu_notifier *mn,
>   		unmap_if_in_range(map, start, end);
>   	}
>   	mutex_unlock(&priv->lock);
> +
> +	return true;
>   }
>   
>   static void mn_release(struct mmu_notifier *mn,
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 4ee7bc548a83..e4181063e755 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1275,7 +1275,7 @@ static inline long kvm_arch_vcpu_async_ioctl(struct file *filp,
>   }
>   #endif /* CONFIG_HAVE_KVM_VCPU_ASYNC_IOCTL */
>   
> -void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> +int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
>   		unsigned long start, unsigned long end);
>   
>   #ifdef CONFIG_HAVE_KVM_VCPU_RUN_PID_CHANGE
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index 392e6af82701..369867501bed 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -230,7 +230,8 @@ extern int __mmu_notifier_test_young(struct mm_struct *mm,
>   extern void __mmu_notifier_change_pte(struct mm_struct *mm,
>   				      unsigned long address, pte_t pte);
>   extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> -				  unsigned long start, unsigned long end);
> +				  unsigned long start, unsigned long end,
> +				  bool blockable);
>   extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
>   				  unsigned long start, unsigned long end,
>   				  bool only_end);
> @@ -281,7 +282,17 @@ static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
>   				  unsigned long start, unsigned long end)
>   {
>   	if (mm_has_notifiers(mm))
> -		__mmu_notifier_invalidate_range_start(mm, start, end);
> +		__mmu_notifier_invalidate_range_start(mm, start, end, true);
> +}
> +
> +static inline int mmu_notifier_invalidate_range_start_nonblock(struct mm_struct *mm,
> +				  unsigned long start, unsigned long end)
> +{
> +	int ret = 0;
> +	if (mm_has_notifiers(mm))
> +		ret = __mmu_notifier_invalidate_range_start(mm, start, end, false);
> +
> +	return ret;
>   }
>   
>   static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> diff --git a/mm/hmm.c b/mm/hmm.c
> index de7b6bf77201..81fd57bd2634 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -177,16 +177,19 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
>   	up_write(&hmm->mirrors_sem);
>   }
>   
> -static void hmm_invalidate_range_start(struct mmu_notifier *mn,
> +static int hmm_invalidate_range_start(struct mmu_notifier *mn,
>   				       struct mm_struct *mm,
>   				       unsigned long start,
> -				       unsigned long end)
> +				       unsigned long end,
> +				       bool blockable)
>   {
>   	struct hmm *hmm = mm->hmm;
>   
>   	VM_BUG_ON(!hmm);
>   
>   	atomic_inc(&hmm->sequence);
> +
> +	return 0;
>   }
>   
>   static void hmm_invalidate_range_end(struct mmu_notifier *mn,
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> index eff6b88a993f..30cc43121da9 100644
> --- a/mm/mmu_notifier.c
> +++ b/mm/mmu_notifier.c
> @@ -174,18 +174,25 @@ void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
>   	srcu_read_unlock(&srcu, id);
>   }
>   
> -void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> -				  unsigned long start, unsigned long end)
> +int __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> +				  unsigned long start, unsigned long end,
> +				  bool blockable)
>   {
>   	struct mmu_notifier *mn;
> +	int ret = 0;
>   	int id;
>   
>   	id = srcu_read_lock(&srcu);
>   	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
> -		if (mn->ops->invalidate_range_start)
> -			mn->ops->invalidate_range_start(mn, mm, start, end);
> +		if (mn->ops->invalidate_range_start) {
> +			int _ret = mn->ops->invalidate_range_start(mn, mm, start, end, blockable);
> +			if (_ret)
> +				ret = _ret;
> +		}
>   	}
>   	srcu_read_unlock(&srcu, id);
> +
> +	return ret;
>   }
>   EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
>   
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 84081e77bc51..7e0c6e78ae5c 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -479,9 +479,10 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
>   static struct task_struct *oom_reaper_list;
>   static DEFINE_SPINLOCK(oom_reaper_lock);
>   
> -void __oom_reap_task_mm(struct mm_struct *mm)
> +bool __oom_reap_task_mm(struct mm_struct *mm)
>   {
>   	struct vm_area_struct *vma;
> +	bool ret = true;
>   
>   	/*
>   	 * Tell all users of get_user/copy_from_user etc... that the content
> @@ -511,12 +512,17 @@ void __oom_reap_task_mm(struct mm_struct *mm)
>   			struct mmu_gather tlb;
>   
>   			tlb_gather_mmu(&tlb, mm, start, end);
> -			mmu_notifier_invalidate_range_start(mm, start, end);
> +			if (mmu_notifier_invalidate_range_start_nonblock(mm, start, end)) {
> +				ret = false;
> +				continue;
> +			}
>   			unmap_page_range(&tlb, vma, start, end, NULL);
>   			mmu_notifier_invalidate_range_end(mm, start, end);
>   			tlb_finish_mmu(&tlb, start, end);
>   		}
>   	}
> +
> +	return ret;
>   }
>   
>   static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
> @@ -545,18 +551,6 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>   		goto unlock_oom;
>   	}
>   
> -	/*
> -	 * If the mm has invalidate_{start,end}() notifiers that could block,
> -	 * sleep to give the oom victim some more time.
> -	 * TODO: we really want to get rid of this ugly hack and make sure that
> -	 * notifiers cannot block for unbounded amount of time
> -	 */
> -	if (mm_has_blockable_invalidate_notifiers(mm)) {
> -		up_read(&mm->mmap_sem);
> -		schedule_timeout_idle(HZ);
> -		goto unlock_oom;
> -	}
> -
>   	/*
>   	 * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
>   	 * work on the mm anymore. The check for MMF_OOM_SKIP must run
> @@ -571,7 +565,12 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>   
>   	trace_start_task_reaping(tsk->pid);
>   
> -	__oom_reap_task_mm(mm);
> +	/* failed to reap part of the address space. Try again later */
> +	if (!__oom_reap_task_mm(mm)) {
> +		up_read(&mm->mmap_sem);
> +		ret = false;
> +		goto out_unlock;
> +	}
>   
>   	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
>   			task_pid_nr(tsk), tsk->comm,
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index ada21f47f22b..6f7e709d2944 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -135,7 +135,7 @@ static void kvm_uevent_notify_change(unsigned int type, struct kvm *kvm);
>   static unsigned long long kvm_createvm_count;
>   static unsigned long long kvm_active_vms;
>   
> -__weak void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> +__weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
>   		unsigned long start, unsigned long end)
>   {
>   }
> @@ -354,13 +354,15 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
>   	srcu_read_unlock(&kvm->srcu, idx);
>   }
>   
> -static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> +static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>   						    struct mm_struct *mm,
>   						    unsigned long start,
> -						    unsigned long end)
> +						    unsigned long end,
> +						    bool blockable)
>   {
>   	struct kvm *kvm = mmu_notifier_to_kvm(mn);
>   	int need_tlb_flush = 0, idx;
> +	int ret;
>   
>   	idx = srcu_read_lock(&kvm->srcu);
>   	spin_lock(&kvm->mmu_lock);
> @@ -378,9 +380,11 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>   
>   	spin_unlock(&kvm->mmu_lock);
>   
> -	kvm_arch_mmu_notifier_invalidate_range(kvm, start, end);
> +	ret = kvm_arch_mmu_notifier_invalidate_range(kvm, start, end, blockable);
>   
>   	srcu_read_unlock(&kvm->srcu, idx);
> +
> +	return ret;
>   }
>   
>   static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
  2018-06-22 15:13 ` [RFC PATCH] " Christian König
  2018-06-22 15:24   ` Michal Hocko
@ 2018-06-22 15:24   ` Michal Hocko
  2018-06-22 20:09     ` Felix Kuehling
                       ` (2 more replies)
  1 sibling, 3 replies; 125+ messages in thread
From: Michal Hocko @ 2018-06-22 15:24 UTC (permalink / raw)
  To: Christian König
  Cc: kvm, Radim Krčmář,
	David Airlie, Sudeep Dutt, dri-devel, linux-mm, Andrea Arcangeli,
	David (ChunMing) Zhou, Dimitri Sivanich, linux-rdma, amd-gfx,
	Jason Gunthorpe, Doug Ledford, David Rientjes, xen-devel,
	intel-gfx, Jérôme Glisse, Rodrigo Vivi,
	Boris Ostrovsky, Juergen Gross, Mike Marciniszyn,
	Dennis Dalessandro, LKML

On Fri 22-06-18 17:13:02, Christian König wrote:
> Hi Michal,
> 
> [Adding Felix as well]
> 
> Well first of all you have a misconception why at least the AMD graphics
> driver need to be able to sleep in an MMU notifier: We need to sleep because
> we need to wait for hardware operations to finish and *NOT* because we need
> to wait for locks.
> 
> I'm not sure if your flag now means that you generally can't sleep in MMU
> notifiers any more, but if that's the case at least AMD hardware will break
> badly. In our case the approach of waiting for a short time for the process
> to be reaped and then select another victim actually sounds like the right
> thing to do.

Well, I do not need to make the notifier code non blocking all the time.
All I need is to ensure that it won't sleep if the flag says so and
return -EAGAIN instead.

So here is what I do for amdgpu:

> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> > index 83e344fbb50a..d138a526feff 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> > @@ -136,12 +136,18 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
> >    *
> >    * Take the rmn read side lock.
> >    */
> > -static void amdgpu_mn_read_lock(struct amdgpu_mn *rmn)
> > +static int amdgpu_mn_read_lock(struct amdgpu_mn *rmn, bool blockable)
> >   {
> > -	mutex_lock(&rmn->read_lock);
> > +	if (blockable)
> > +		mutex_lock(&rmn->read_lock);
> > +	else if (!mutex_trylock(&rmn->read_lock))
> > +		return -EAGAIN;
> > +
> >   	if (atomic_inc_return(&rmn->recursion) == 1)
> >   		down_read_non_owner(&rmn->lock);
> >   	mutex_unlock(&rmn->read_lock);
> > +
> > +	return 0;
> >   }
> >   /**
> > @@ -197,10 +203,11 @@ static void amdgpu_mn_invalidate_node(struct amdgpu_mn_node *node,
> >    * We block for all BOs between start and end to be idle and
> >    * unmap them by move them into system domain again.
> >    */
> > -static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
> > +static int amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
> >   						 struct mm_struct *mm,
> >   						 unsigned long start,
> > -						 unsigned long end)
> > +						 unsigned long end,
> > +						 bool blockable)
> >   {
> >   	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
> >   	struct interval_tree_node *it;
> > @@ -208,7 +215,11 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
> >   	/* notification is exclusive, but interval is inclusive */
> >   	end -= 1;
> > -	amdgpu_mn_read_lock(rmn);
> > +	/* TODO we should be able to split locking for interval tree and
> > +	 * amdgpu_mn_invalidate_node
> > +	 */
> > +	if (amdgpu_mn_read_lock(rmn, blockable))
> > +		return -EAGAIN;
> >   	it = interval_tree_iter_first(&rmn->objects, start, end);
> >   	while (it) {
> > @@ -219,6 +230,8 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
> >   		amdgpu_mn_invalidate_node(node, start, end);
> >   	}
> > +
> > +	return 0;
> >   }
> >   /**
> > @@ -233,10 +246,11 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
> >    * necessitates evicting all user-mode queues of the process. The BOs
> >    * are restorted in amdgpu_mn_invalidate_range_end_hsa.
> >    */
> > -static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
> > +static int amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
> >   						 struct mm_struct *mm,
> >   						 unsigned long start,
> > -						 unsigned long end)
> > +						 unsigned long end,
> > +						 bool blockable)
> >   {
> >   	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
> >   	struct interval_tree_node *it;
> > @@ -244,7 +258,8 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
> >   	/* notification is exclusive, but interval is inclusive */
> >   	end -= 1;
> > -	amdgpu_mn_read_lock(rmn);
> > +	if (amdgpu_mn_read_lock(rmn, blockable))
> > +		return -EAGAIN;
> >   	it = interval_tree_iter_first(&rmn->objects, start, end);
> >   	while (it) {
> > @@ -262,6 +277,8 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
> >   				amdgpu_amdkfd_evict_userptr(mem, mm);
> >   		}
> >   	}
> > +
> > +	return 0;
> >   }
> >   /**
-- 
Michal Hocko
SUSE Labs
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
@ 2018-06-22 15:24   ` Michal Hocko
  2018-06-22 20:09     ` Felix Kuehling
                       ` (2 more replies)
  0 siblings, 3 replies; 125+ messages in thread
From: Michal Hocko @ 2018-06-22 15:24 UTC (permalink / raw)
  To: Christian König
  Cc: LKML, David (ChunMing) Zhou, Paolo Bonzini,
	Radim Krčmář,
	Alex Deucher, David Airlie, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Doug Ledford, Jason Gunthorpe, Mike Marciniszyn,
	Dennis Dalessandro, Sudeep Dutt, Ashutosh Dixit,
	Dimitri Sivanich, Boris Ostrovsky, Juergen Gross,
	Jérôme Glisse, Andrea Arcangeli, kvm, amd-gfx,
	dri-devel, intel-gfx, linux-rdma, xen-devel, linux-mm,
	David Rientjes, Felix Kuehling

On Fri 22-06-18 17:13:02, Christian König wrote:
> Hi Michal,
> 
> [Adding Felix as well]
> 
> Well first of all you have a misconception why at least the AMD graphics
> driver need to be able to sleep in an MMU notifier: We need to sleep because
> we need to wait for hardware operations to finish and *NOT* because we need
> to wait for locks.
> 
> I'm not sure if your flag now means that you generally can't sleep in MMU
> notifiers any more, but if that's the case at least AMD hardware will break
> badly. In our case the approach of waiting for a short time for the process
> to be reaped and then select another victim actually sounds like the right
> thing to do.

Well, I do not need to make the notifier code non blocking all the time.
All I need is to ensure that it won't sleep if the flag says so and
return -EAGAIN instead.

So here is what I do for amdgpu:

> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> > index 83e344fbb50a..d138a526feff 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> > @@ -136,12 +136,18 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
> >    *
> >    * Take the rmn read side lock.
> >    */
> > -static void amdgpu_mn_read_lock(struct amdgpu_mn *rmn)
> > +static int amdgpu_mn_read_lock(struct amdgpu_mn *rmn, bool blockable)
> >   {
> > -	mutex_lock(&rmn->read_lock);
> > +	if (blockable)
> > +		mutex_lock(&rmn->read_lock);
> > +	else if (!mutex_trylock(&rmn->read_lock))
> > +		return -EAGAIN;
> > +
> >   	if (atomic_inc_return(&rmn->recursion) == 1)
> >   		down_read_non_owner(&rmn->lock);
> >   	mutex_unlock(&rmn->read_lock);
> > +
> > +	return 0;
> >   }
> >   /**
> > @@ -197,10 +203,11 @@ static void amdgpu_mn_invalidate_node(struct amdgpu_mn_node *node,
> >    * We block for all BOs between start and end to be idle and
> >    * unmap them by move them into system domain again.
> >    */
> > -static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
> > +static int amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
> >   						 struct mm_struct *mm,
> >   						 unsigned long start,
> > -						 unsigned long end)
> > +						 unsigned long end,
> > +						 bool blockable)
> >   {
> >   	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
> >   	struct interval_tree_node *it;
> > @@ -208,7 +215,11 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
> >   	/* notification is exclusive, but interval is inclusive */
> >   	end -= 1;
> > -	amdgpu_mn_read_lock(rmn);
> > +	/* TODO we should be able to split locking for interval tree and
> > +	 * amdgpu_mn_invalidate_node
> > +	 */
> > +	if (amdgpu_mn_read_lock(rmn, blockable))
> > +		return -EAGAIN;
> >   	it = interval_tree_iter_first(&rmn->objects, start, end);
> >   	while (it) {
> > @@ -219,6 +230,8 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
> >   		amdgpu_mn_invalidate_node(node, start, end);
> >   	}
> > +
> > +	return 0;
> >   }
> >   /**
> > @@ -233,10 +246,11 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
> >    * necessitates evicting all user-mode queues of the process. The BOs
> >    * are restorted in amdgpu_mn_invalidate_range_end_hsa.
> >    */
> > -static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
> > +static int amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
> >   						 struct mm_struct *mm,
> >   						 unsigned long start,
> > -						 unsigned long end)
> > +						 unsigned long end,
> > +						 bool blockable)
> >   {
> >   	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
> >   	struct interval_tree_node *it;
> > @@ -244,7 +258,8 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
> >   	/* notification is exclusive, but interval is inclusive */
> >   	end -= 1;
> > -	amdgpu_mn_read_lock(rmn);
> > +	if (amdgpu_mn_read_lock(rmn, blockable))
> > +		return -EAGAIN;
> >   	it = interval_tree_iter_first(&rmn->objects, start, end);
> >   	while (it) {
> > @@ -262,6 +277,8 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
> >   				amdgpu_amdkfd_evict_userptr(mem, mm);
> >   		}
> >   	}
> > +
> > +	return 0;
> >   }
> >   /**
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
@ 2018-06-22 15:24   ` Michal Hocko
  2018-06-22 20:09     ` Felix Kuehling
                       ` (2 more replies)
  0 siblings, 3 replies; 125+ messages in thread
From: Michal Hocko @ 2018-06-22 15:24 UTC (permalink / raw)
  To: Christian König
  Cc: LKML, David (ChunMing) Zhou, Paolo Bonzini,
	Radim Krčmář,
	Alex Deucher, David Airlie, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Doug Ledford, Jason Gunthorpe, Mike Marciniszyn,
	Dennis Dalessandro, Sudeep Dutt, Ashutosh Dixit,
	Dimitri Sivanich, Boris Ostrovsky, Juergen Gross,
	Jérôme Glisse, Andrea Arcangeli, kvm, amd-gfx,
	dri-devel, intel-gfx, linux-rdma, xen-devel, linux-mm,
	David Rientjes, Felix Kuehling

On Fri 22-06-18 17:13:02, Christian Konig wrote:
> Hi Michal,
> 
> [Adding Felix as well]
> 
> Well first of all you have a misconception why at least the AMD graphics
> driver need to be able to sleep in an MMU notifier: We need to sleep because
> we need to wait for hardware operations to finish and *NOT* because we need
> to wait for locks.
> 
> I'm not sure if your flag now means that you generally can't sleep in MMU
> notifiers any more, but if that's the case at least AMD hardware will break
> badly. In our case the approach of waiting for a short time for the process
> to be reaped and then select another victim actually sounds like the right
> thing to do.

Well, I do not need to make the notifier code non blocking all the time.
All I need is to ensure that it won't sleep if the flag says so and
return -EAGAIN instead.

So here is what I do for amdgpu:

> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> > index 83e344fbb50a..d138a526feff 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> > @@ -136,12 +136,18 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
> >    *
> >    * Take the rmn read side lock.
> >    */
> > -static void amdgpu_mn_read_lock(struct amdgpu_mn *rmn)
> > +static int amdgpu_mn_read_lock(struct amdgpu_mn *rmn, bool blockable)
> >   {
> > -	mutex_lock(&rmn->read_lock);
> > +	if (blockable)
> > +		mutex_lock(&rmn->read_lock);
> > +	else if (!mutex_trylock(&rmn->read_lock))
> > +		return -EAGAIN;
> > +
> >   	if (atomic_inc_return(&rmn->recursion) == 1)
> >   		down_read_non_owner(&rmn->lock);
> >   	mutex_unlock(&rmn->read_lock);
> > +
> > +	return 0;
> >   }
> >   /**
> > @@ -197,10 +203,11 @@ static void amdgpu_mn_invalidate_node(struct amdgpu_mn_node *node,
> >    * We block for all BOs between start and end to be idle and
> >    * unmap them by move them into system domain again.
> >    */
> > -static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
> > +static int amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
> >   						 struct mm_struct *mm,
> >   						 unsigned long start,
> > -						 unsigned long end)
> > +						 unsigned long end,
> > +						 bool blockable)
> >   {
> >   	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
> >   	struct interval_tree_node *it;
> > @@ -208,7 +215,11 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
> >   	/* notification is exclusive, but interval is inclusive */
> >   	end -= 1;
> > -	amdgpu_mn_read_lock(rmn);
> > +	/* TODO we should be able to split locking for interval tree and
> > +	 * amdgpu_mn_invalidate_node
> > +	 */
> > +	if (amdgpu_mn_read_lock(rmn, blockable))
> > +		return -EAGAIN;
> >   	it = interval_tree_iter_first(&rmn->objects, start, end);
> >   	while (it) {
> > @@ -219,6 +230,8 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
> >   		amdgpu_mn_invalidate_node(node, start, end);
> >   	}
> > +
> > +	return 0;
> >   }
> >   /**
> > @@ -233,10 +246,11 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
> >    * necessitates evicting all user-mode queues of the process. The BOs
> >    * are restorted in amdgpu_mn_invalidate_range_end_hsa.
> >    */
> > -static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
> > +static int amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
> >   						 struct mm_struct *mm,
> >   						 unsigned long start,
> > -						 unsigned long end)
> > +						 unsigned long end,
> > +						 bool blockable)
> >   {
> >   	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
> >   	struct interval_tree_node *it;
> > @@ -244,7 +258,8 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
> >   	/* notification is exclusive, but interval is inclusive */
> >   	end -= 1;
> > -	amdgpu_mn_read_lock(rmn);
> > +	if (amdgpu_mn_read_lock(rmn, blockable))
> > +		return -EAGAIN;
> >   	it = interval_tree_iter_first(&rmn->objects, start, end);
> >   	while (it) {
> > @@ -262,6 +277,8 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
> >   				amdgpu_amdkfd_evict_userptr(mem, mm);
> >   		}
> >   	}
> > +
> > +	return 0;
> >   }
> >   /**
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
  2018-06-22 15:13 ` [RFC PATCH] " Christian König
@ 2018-06-22 15:24   ` Michal Hocko
  2018-06-22 15:24   ` Michal Hocko
  1 sibling, 0 replies; 125+ messages in thread
From: Michal Hocko @ 2018-06-22 15:24 UTC (permalink / raw)
  To: Christian König
  Cc: kvm, Radim Krčmář,
	David Airlie, Joonas Lahtinen, Sudeep Dutt, dri-devel, linux-mm,
	Andrea Arcangeli, David (ChunMing) Zhou, Dimitri Sivanich,
	linux-rdma, amd-gfx, Jason Gunthorpe, Doug Ledford,
	David Rientjes, xen-devel, intel-gfx, Jani Nikula,
	Jérôme Glisse, Rodrigo Vivi, Boris Ostrovsky,
	Juergen Gross, Mike Marciniszyn

On Fri 22-06-18 17:13:02, Christian König wrote:
> Hi Michal,
> 
> [Adding Felix as well]
> 
> Well first of all you have a misconception why at least the AMD graphics
> driver need to be able to sleep in an MMU notifier: We need to sleep because
> we need to wait for hardware operations to finish and *NOT* because we need
> to wait for locks.
> 
> I'm not sure if your flag now means that you generally can't sleep in MMU
> notifiers any more, but if that's the case at least AMD hardware will break
> badly. In our case the approach of waiting for a short time for the process
> to be reaped and then select another victim actually sounds like the right
> thing to do.

Well, I do not need to make the notifier code non blocking all the time.
All I need is to ensure that it won't sleep if the flag says so and
return -EAGAIN instead.

So here is what I do for amdgpu:

> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> > index 83e344fbb50a..d138a526feff 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> > @@ -136,12 +136,18 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
> >    *
> >    * Take the rmn read side lock.
> >    */
> > -static void amdgpu_mn_read_lock(struct amdgpu_mn *rmn)
> > +static int amdgpu_mn_read_lock(struct amdgpu_mn *rmn, bool blockable)
> >   {
> > -	mutex_lock(&rmn->read_lock);
> > +	if (blockable)
> > +		mutex_lock(&rmn->read_lock);
> > +	else if (!mutex_trylock(&rmn->read_lock))
> > +		return -EAGAIN;
> > +
> >   	if (atomic_inc_return(&rmn->recursion) == 1)
> >   		down_read_non_owner(&rmn->lock);
> >   	mutex_unlock(&rmn->read_lock);
> > +
> > +	return 0;
> >   }
> >   /**
> > @@ -197,10 +203,11 @@ static void amdgpu_mn_invalidate_node(struct amdgpu_mn_node *node,
> >    * We block for all BOs between start and end to be idle and
> >    * unmap them by move them into system domain again.
> >    */
> > -static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
> > +static int amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
> >   						 struct mm_struct *mm,
> >   						 unsigned long start,
> > -						 unsigned long end)
> > +						 unsigned long end,
> > +						 bool blockable)
> >   {
> >   	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
> >   	struct interval_tree_node *it;
> > @@ -208,7 +215,11 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
> >   	/* notification is exclusive, but interval is inclusive */
> >   	end -= 1;
> > -	amdgpu_mn_read_lock(rmn);
> > +	/* TODO we should be able to split locking for interval tree and
> > +	 * amdgpu_mn_invalidate_node
> > +	 */
> > +	if (amdgpu_mn_read_lock(rmn, blockable))
> > +		return -EAGAIN;
> >   	it = interval_tree_iter_first(&rmn->objects, start, end);
> >   	while (it) {
> > @@ -219,6 +230,8 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
> >   		amdgpu_mn_invalidate_node(node, start, end);
> >   	}
> > +
> > +	return 0;
> >   }
> >   /**
> > @@ -233,10 +246,11 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
> >    * necessitates evicting all user-mode queues of the process. The BOs
> >    * are restorted in amdgpu_mn_invalidate_range_end_hsa.
> >    */
> > -static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
> > +static int amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
> >   						 struct mm_struct *mm,
> >   						 unsigned long start,
> > -						 unsigned long end)
> > +						 unsigned long end,
> > +						 bool blockable)
> >   {
> >   	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
> >   	struct interval_tree_node *it;
> > @@ -244,7 +258,8 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
> >   	/* notification is exclusive, but interval is inclusive */
> >   	end -= 1;
> > -	amdgpu_mn_read_lock(rmn);
> > +	if (amdgpu_mn_read_lock(rmn, blockable))
> > +		return -EAGAIN;
> >   	it = interval_tree_iter_first(&rmn->objects, start, end);
> >   	while (it) {
> > @@ -262,6 +277,8 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
> >   				amdgpu_amdkfd_evict_userptr(mem, mm);
> >   		}
> >   	}
> > +
> > +	return 0;
> >   }
> >   /**
-- 
Michal Hocko
SUSE Labs

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [Intel-gfx] [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
       [not found] ` <20180622150242.16558-1-mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2018-06-22 15:36   ` Chris Wilson
  2018-06-27  7:44 ` Michal Hocko
  1 sibling, 0 replies; 125+ messages in thread
From: Chris Wilson @ 2018-06-22 15:36 UTC (permalink / raw)
  To: Michal Hocko, LKML
  Cc: Andrea Arcangeli, David (ChunMing) Zhou, Dimitri Sivanich,
	Jason Gunthorpe,
	Michal Hocko  <mhocko-IBi9RG/b67k@public.gmane.org>,
	kvm-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	  " Radim Krčmář
	<rkrcmar-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>,
	 David Airlie, intel-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	Sudeep Dutt, dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Doug Ledford,
	" Jérôme Glisse,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, David Rientjes,
	Rodrigo-CC+yJ3UmIYqDUpFQwHEjaQ,
	xen-devel-GuqFBffKawtpuQazS67q72D2FQJk+8+b,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

Quoting Michal Hocko (2018-06-22 16:02:42)
> Hi,
> this is an RFC and not tested at all. I am not very familiar with the
> mmu notifiers semantics very much so this is a crude attempt to achieve
> what I need basically. It might be completely wrong but I would like
> to discuss what would be a better way if that is the case.
> 
> get_maintainers gave me quite large list of people to CC so I had to trim
> it down. If you think I have forgot somebody, please let me know

> diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
> index 854bd51b9478..5285df9331fa 100644
> --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> @@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object *mo)
>         mo->attached = false;
>  }
>  
> -static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> +static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
>                                                        struct mm_struct *mm,
>                                                        unsigned long start,
> -                                                      unsigned long end)
> +                                                      unsigned long end,
> +                                                      bool blockable)
>  {
>         struct i915_mmu_notifier *mn =
>                 container_of(_mn, struct i915_mmu_notifier, mn);
> @@ -124,7 +125,7 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
>         LIST_HEAD(cancelled);
>  
>         if (RB_EMPTY_ROOT(&mn->objects.rb_root))
> -               return;
> +               return 0;

The principle wait here is for the HW (even after fixing all the locks
to be not so coarse, we still have to wait for the HW to finish its
access). The first pass would be then to not do anything here if
!blockable.

Jerome keeps on shaking his head and telling us we're doing it all
wrong, so maybe it'll all fall out of HMM before we have to figure out
how to differentiate between objects that can be invalidated immediately
and those that need to acquire locks and/or wait.
-Chris
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [Intel-gfx] [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
@ 2018-06-22 15:36   ` Chris Wilson
  0 siblings, 0 replies; 125+ messages in thread
From: Chris Wilson @ 2018-06-22 15:36 UTC (permalink / raw)
  To: Michal Hocko, LKML
  Cc: Michal Hocko  <mhocko@suse.com>,
	kvm@vger.kernel.org,
	 " Radim Krčmář <rkrcmar@redhat.com>,
	 David Airlie, Sudeep Dutt, dri-devel, linux-mm,
	Andrea Arcangeli, David (ChunMing) Zhou, Dimitri Sivanich,
	linux-rdma, amd-gfx, Jason Gunthorpe, Doug Ledford,
	David Rientjes, xen-devel, intel-gfx,
	" Jérôme Glisse, Rodrigo, Vivi,
	 <rodrigo.vivi@intel.com>,  ,
	Boris, Ostrovsky,  <boris.ostrovsky@oracle.com>,  ,
	Juergen, Gross,  <jgross@suse.com>,  ,
	Mike, Marciniszyn,  <mike.marciniszyn@intel.com>,  ,
	Dennis, Dalessandro,  <dennis.dalessandro@intel.com>,  ,
	Ashutosh, Dixit,  <ashutosh.dixit@intel.com>,  ,
	Alex, Deucher,  <alexander.deucher@amd.com>,  ,
	Paolo, Bonzini,  <pbonzini@redhat.com>,
	 =?utf-8?q?=22_Christian_K=C3=B6nig?=
	<christian.koenig@amd.com>

Quoting Michal Hocko (2018-06-22 16:02:42)
> Hi,
> this is an RFC and not tested at all. I am not very familiar with the
> mmu notifiers semantics very much so this is a crude attempt to achieve
> what I need basically. It might be completely wrong but I would like
> to discuss what would be a better way if that is the case.
> 
> get_maintainers gave me quite large list of people to CC so I had to trim
> it down. If you think I have forgot somebody, please let me know

> diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
> index 854bd51b9478..5285df9331fa 100644
> --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> @@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object *mo)
>         mo->attached = false;
>  }
>  
> -static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> +static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
>                                                        struct mm_struct *mm,
>                                                        unsigned long start,
> -                                                      unsigned long end)
> +                                                      unsigned long end,
> +                                                      bool blockable)
>  {
>         struct i915_mmu_notifier *mn =
>                 container_of(_mn, struct i915_mmu_notifier, mn);
> @@ -124,7 +125,7 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
>         LIST_HEAD(cancelled);
>  
>         if (RB_EMPTY_ROOT(&mn->objects.rb_root))
> -               return;
> +               return 0;

The principle wait here is for the HW (even after fixing all the locks
to be not so coarse, we still have to wait for the HW to finish its
access). The first pass would be then to not do anything here if
!blockable.

Jerome keeps on shaking his head and telling us we're doing it all
wrong, so maybe it'll all fall out of HMM before we have to figure out
how to differentiate between objects that can be invalidated immediately
and those that need to acquire locks and/or wait.
-Chris

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [Intel-gfx] [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
  2018-06-22 15:02 [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers Michal Hocko
                   ` (2 preceding siblings ...)
  2018-06-22 15:13 ` Christian König
@ 2018-06-22 15:36 ` Chris Wilson
  2018-06-22 15:36   ` [Intel-gfx] [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers Chris Wilson
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 125+ messages in thread
From: Chris Wilson @ 2018-06-22 15:36 UTC (permalink / raw)
  To: Michal Hocko, LKML
  Cc: Andrea Arcangeli, David (ChunMing) Zhou, Dimitri Sivanich,
	Jason Gunthorpe, Michal Hocko  <mhocko@suse.com>,
	kvm@vger.kernel.org,
	" Radim Krčmář <rkrcmar@redhat.com>,
	David Airlie, intel-gfx, Sudeep Dutt, dri-devel, linux-mm,
	Doug Ledford, " Jérôme Glisse, amd-gfx,
	David Rientjes, xen-devel, linux-rdma

Quoting Michal Hocko (2018-06-22 16:02:42)
> Hi,
> this is an RFC and not tested at all. I am not very familiar with the
> mmu notifiers semantics very much so this is a crude attempt to achieve
> what I need basically. It might be completely wrong but I would like
> to discuss what would be a better way if that is the case.
> 
> get_maintainers gave me quite large list of people to CC so I had to trim
> it down. If you think I have forgot somebody, please let me know

> diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
> index 854bd51b9478..5285df9331fa 100644
> --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> @@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object *mo)
>         mo->attached = false;
>  }
>  
> -static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> +static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
>                                                        struct mm_struct *mm,
>                                                        unsigned long start,
> -                                                      unsigned long end)
> +                                                      unsigned long end,
> +                                                      bool blockable)
>  {
>         struct i915_mmu_notifier *mn =
>                 container_of(_mn, struct i915_mmu_notifier, mn);
> @@ -124,7 +125,7 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
>         LIST_HEAD(cancelled);
>  
>         if (RB_EMPTY_ROOT(&mn->objects.rb_root))
> -               return;
> +               return 0;

The principle wait here is for the HW (even after fixing all the locks
to be not so coarse, we still have to wait for the HW to finish its
access). The first pass would be then to not do anything here if
!blockable.

Jerome keeps on shaking his head and telling us we're doing it all
wrong, so maybe it'll all fall out of HMM before we have to figure out
how to differentiate between objects that can be invalidated immediately
and those that need to acquire locks and/or wait.
-Chris

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [Intel-gfx] [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
  2018-06-22 15:36   ` [Intel-gfx] [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers Chris Wilson
  (?)
@ 2018-06-22 15:57   ` Michal Hocko
  2018-06-22 16:18     ` Jerome Glisse
                       ` (2 more replies)
  -1 siblings, 3 replies; 125+ messages in thread
From: Michal Hocko @ 2018-06-22 15:57 UTC (permalink / raw)
  To: Chris Wilson
  Cc: Rodrigo, Michal Hocko <mhocko@suse.com>,
	kvm@vger.kernel.org,
	 " Radim Krčmář <rkrcmar@redhat.com>,
	David Airlie, Sudeep Dutt, dri-devel, Deucher, linux-mm, Mike,
	Vivi, Juergen, Andrea Arcangeli, Dimitri Sivanich, Paolo, Dennis,
	linux-rdma, amd-gfx, Boris, Jason Gunthorpe, Doug Ledford,
	David Rientjes, xen-devel, Ashutosh, Marciniszyn, Alex,
	intel-gfx, Dalessandro, " Jérôme Glisse, Ostrovsky

On Fri 22-06-18 16:36:49, Chris Wilson wrote:
> Quoting Michal Hocko (2018-06-22 16:02:42)
> > Hi,
> > this is an RFC and not tested at all. I am not very familiar with the
> > mmu notifiers semantics very much so this is a crude attempt to achieve
> > what I need basically. It might be completely wrong but I would like
> > to discuss what would be a better way if that is the case.
> > 
> > get_maintainers gave me quite large list of people to CC so I had to trim
> > it down. If you think I have forgot somebody, please let me know
> 
> > diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > index 854bd51b9478..5285df9331fa 100644
> > --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> > +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > @@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object *mo)
> >         mo->attached = false;
> >  }
> >  
> > -static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > +static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> >                                                        struct mm_struct *mm,
> >                                                        unsigned long start,
> > -                                                      unsigned long end)
> > +                                                      unsigned long end,
> > +                                                      bool blockable)
> >  {
> >         struct i915_mmu_notifier *mn =
> >                 container_of(_mn, struct i915_mmu_notifier, mn);
> > @@ -124,7 +125,7 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> >         LIST_HEAD(cancelled);
> >  
> >         if (RB_EMPTY_ROOT(&mn->objects.rb_root))
> > -               return;
> > +               return 0;
> 
> The principle wait here is for the HW (even after fixing all the locks
> to be not so coarse, we still have to wait for the HW to finish its
> access).

Is this wait bound or it can take basically arbitrary amount of time?

> The first pass would be then to not do anything here if
> !blockable.

something like this? (incremental diff)

diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
index 5285df9331fa..e9ed0d2cfabc 100644
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
@@ -122,6 +122,7 @@ static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
 		container_of(_mn, struct i915_mmu_notifier, mn);
 	struct i915_mmu_object *mo;
 	struct interval_tree_node *it;
+	int ret = 0;
 	LIST_HEAD(cancelled);
 
 	if (RB_EMPTY_ROOT(&mn->objects.rb_root))
@@ -133,6 +134,10 @@ static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
 	spin_lock(&mn->lock);
 	it = interval_tree_iter_first(&mn->objects, start, end);
 	while (it) {
+		if (!blockable) {
+			ret = -EAGAIN;
+			goto out_unlock;
+		}
 		/* The mmu_object is released late when destroying the
 		 * GEM object so it is entirely possible to gain a
 		 * reference on an object in the process of being freed
@@ -154,8 +159,10 @@ static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
 	spin_unlock(&mn->lock);
 
 	/* TODO: can we skip waiting here? */
-	if (!list_empty(&cancelled) && blockable)
+	if (!list_empty(&cancelled))
 		flush_workqueue(mn->wq);
+
+	return ret;
 }
 
 static const struct mmu_notifier_ops i915_gem_userptr_notifier = {
-- 
Michal Hocko
SUSE Labs
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [Intel-gfx] [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
@ 2018-06-22 15:57   ` Michal Hocko
  2018-06-22 16:18     ` Jerome Glisse
                       ` (2 more replies)
  0 siblings, 3 replies; 125+ messages in thread
From: Michal Hocko @ 2018-06-22 15:57 UTC (permalink / raw)
  To: Chris Wilson
  Cc: LKML, Michal Hocko <mhocko@suse.com>,
	kvm@vger.kernel.org,
	 " Radim Krčmář <rkrcmar@redhat.com>,
	David Airlie, Sudeep Dutt, dri-devel, linux-mm, Andrea Arcangeli,
	David (ChunMing) Zhou, Dimitri Sivanich, linux-rdma, amd-gfx,
	Jason Gunthorpe, Doug Ledford, David Rientjes, xen-devel,
	intel-gfx, " Jérôme Glisse, Rodrigo, Vivi, Boris,
	Ostrovsky, Juergen, Gross, Mike, Marciniszyn, Dennis,
	Dalessandro, Ashutosh, Dixit, Alex, Deucher, Paolo, Bonzini

On Fri 22-06-18 16:36:49, Chris Wilson wrote:
> Quoting Michal Hocko (2018-06-22 16:02:42)
> > Hi,
> > this is an RFC and not tested at all. I am not very familiar with the
> > mmu notifiers semantics very much so this is a crude attempt to achieve
> > what I need basically. It might be completely wrong but I would like
> > to discuss what would be a better way if that is the case.
> > 
> > get_maintainers gave me quite large list of people to CC so I had to trim
> > it down. If you think I have forgot somebody, please let me know
> 
> > diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > index 854bd51b9478..5285df9331fa 100644
> > --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> > +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > @@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object *mo)
> >         mo->attached = false;
> >  }
> >  
> > -static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > +static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> >                                                        struct mm_struct *mm,
> >                                                        unsigned long start,
> > -                                                      unsigned long end)
> > +                                                      unsigned long end,
> > +                                                      bool blockable)
> >  {
> >         struct i915_mmu_notifier *mn =
> >                 container_of(_mn, struct i915_mmu_notifier, mn);
> > @@ -124,7 +125,7 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> >         LIST_HEAD(cancelled);
> >  
> >         if (RB_EMPTY_ROOT(&mn->objects.rb_root))
> > -               return;
> > +               return 0;
> 
> The principle wait here is for the HW (even after fixing all the locks
> to be not so coarse, we still have to wait for the HW to finish its
> access).

Is this wait bound or it can take basically arbitrary amount of time?

> The first pass would be then to not do anything here if
> !blockable.

something like this? (incremental diff)

diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
index 5285df9331fa..e9ed0d2cfabc 100644
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
@@ -122,6 +122,7 @@ static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
 		container_of(_mn, struct i915_mmu_notifier, mn);
 	struct i915_mmu_object *mo;
 	struct interval_tree_node *it;
+	int ret = 0;
 	LIST_HEAD(cancelled);
 
 	if (RB_EMPTY_ROOT(&mn->objects.rb_root))
@@ -133,6 +134,10 @@ static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
 	spin_lock(&mn->lock);
 	it = interval_tree_iter_first(&mn->objects, start, end);
 	while (it) {
+		if (!blockable) {
+			ret = -EAGAIN;
+			goto out_unlock;
+		}
 		/* The mmu_object is released late when destroying the
 		 * GEM object so it is entirely possible to gain a
 		 * reference on an object in the process of being freed
@@ -154,8 +159,10 @@ static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
 	spin_unlock(&mn->lock);
 
 	/* TODO: can we skip waiting here? */
-	if (!list_empty(&cancelled) && blockable)
+	if (!list_empty(&cancelled))
 		flush_workqueue(mn->wq);
+
+	return ret;
 }
 
 static const struct mmu_notifier_ops i915_gem_userptr_notifier = {
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [Intel-gfx] [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
  2018-06-22 15:36   ` [Intel-gfx] [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers Chris Wilson
@ 2018-06-22 15:57   ` Michal Hocko
  -1 siblings, 0 replies; 125+ messages in thread
From: Michal Hocko @ 2018-06-22 15:57 UTC (permalink / raw)
  To: Chris Wilson
  Cc: Rodrigo, Michal Hocko <mhocko@suse.com>,
	kvm@vger.kernel.org,
	 " Radim Krčmář <rkrcmar@redhat.com>,
	David Airlie, Sudeep Dutt, dri-devel, Deucher, linux-mm, Mike,
	Vivi, Juergen, Andrea Arcangeli, David (ChunMing) Zhou,
	Dimitri Sivanich, Paolo, Dennis, linux-rdma, amd-gfx, Boris,
	Jason Gunthorpe, Doug Ledford, David Rientjes, xen-devel,
	Ashutosh, Marciniszyn, Alex, intel-gfx, Dalessandro

On Fri 22-06-18 16:36:49, Chris Wilson wrote:
> Quoting Michal Hocko (2018-06-22 16:02:42)
> > Hi,
> > this is an RFC and not tested at all. I am not very familiar with the
> > mmu notifiers semantics very much so this is a crude attempt to achieve
> > what I need basically. It might be completely wrong but I would like
> > to discuss what would be a better way if that is the case.
> > 
> > get_maintainers gave me quite large list of people to CC so I had to trim
> > it down. If you think I have forgot somebody, please let me know
> 
> > diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > index 854bd51b9478..5285df9331fa 100644
> > --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> > +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > @@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object *mo)
> >         mo->attached = false;
> >  }
> >  
> > -static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > +static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> >                                                        struct mm_struct *mm,
> >                                                        unsigned long start,
> > -                                                      unsigned long end)
> > +                                                      unsigned long end,
> > +                                                      bool blockable)
> >  {
> >         struct i915_mmu_notifier *mn =
> >                 container_of(_mn, struct i915_mmu_notifier, mn);
> > @@ -124,7 +125,7 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> >         LIST_HEAD(cancelled);
> >  
> >         if (RB_EMPTY_ROOT(&mn->objects.rb_root))
> > -               return;
> > +               return 0;
> 
> The principle wait here is for the HW (even after fixing all the locks
> to be not so coarse, we still have to wait for the HW to finish its
> access).

Is this wait bound or it can take basically arbitrary amount of time?

> The first pass would be then to not do anything here if
> !blockable.

something like this? (incremental diff)

diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
index 5285df9331fa..e9ed0d2cfabc 100644
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
@@ -122,6 +122,7 @@ static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
 		container_of(_mn, struct i915_mmu_notifier, mn);
 	struct i915_mmu_object *mo;
 	struct interval_tree_node *it;
+	int ret = 0;
 	LIST_HEAD(cancelled);
 
 	if (RB_EMPTY_ROOT(&mn->objects.rb_root))
@@ -133,6 +134,10 @@ static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
 	spin_lock(&mn->lock);
 	it = interval_tree_iter_first(&mn->objects, start, end);
 	while (it) {
+		if (!blockable) {
+			ret = -EAGAIN;
+			goto out_unlock;
+		}
 		/* The mmu_object is released late when destroying the
 		 * GEM object so it is entirely possible to gain a
 		 * reference on an object in the process of being freed
@@ -154,8 +159,10 @@ static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
 	spin_unlock(&mn->lock);
 
 	/* TODO: can we skip waiting here? */
-	if (!list_empty(&cancelled) && blockable)
+	if (!list_empty(&cancelled))
 		flush_workqueue(mn->wq);
+
+	return ret;
 }
 
 static const struct mmu_notifier_ops i915_gem_userptr_notifier = {
-- 
Michal Hocko
SUSE Labs

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 125+ messages in thread

* ✗ Fi.CI.BAT: failure for mm, oom: distinguish blockable mode for mmu notifiers (rev2)
  2018-06-22 15:02 [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers Michal Hocko
                   ` (4 preceding siblings ...)
  2018-06-22 15:36   ` [Intel-gfx] [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers Chris Wilson
@ 2018-06-22 16:01 ` Patchwork
  2018-06-24  8:11 ` [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers Paolo Bonzini
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 125+ messages in thread
From: Patchwork @ 2018-06-22 16:01 UTC (permalink / raw)
  To: Michal Hocko; +Cc: intel-gfx

== Series Details ==

Series: mm, oom: distinguish blockable mode for mmu notifiers (rev2)
URL   : https://patchwork.freedesktop.org/series/45263/
State : failure

== Summary ==

Applying: mm, oom: distinguish blockable mode for mmu notifiers
Using index info to reconstruct a base tree...
M	drivers/gpu/drm/i915/i915_gem_userptr.c
Falling back to patching base and 3-way merge...
Auto-merging drivers/gpu/drm/i915/i915_gem_userptr.c
CONFLICT (content): Merge conflict in drivers/gpu/drm/i915/i915_gem_userptr.c
error: Failed to merge in the changes.
Patch failed at 0001 mm, oom: distinguish blockable mode for mmu notifiers
Use 'git am --show-current-patch' to see the failed patch
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [Intel-gfx] [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
  2018-06-22 15:57   ` Michal Hocko
  2018-06-22 16:18     ` Jerome Glisse
@ 2018-06-22 16:18     ` Jerome Glisse
       [not found]       ` <20180622164026.GA23674@dhcp22.suse.cz>
       [not found]     ` <152968364170.11773.4392861266443293819@mail.alporthouse.com>
  2 siblings, 1 reply; 125+ messages in thread
From: Jerome Glisse @ 2018-06-22 16:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Rodrigo, Michal Hocko <mhocko@suse.com>,
	kvm@vger.kernel.org,
	 " Radim Krčmář <rkrcmar@redhat.com>,
	David Airlie, Sudeep Dutt, dri-devel, linux-mm, Mike, Vivi,
	Juergen, Andrea Arcangeli, Dimitri Sivanich, Paolo, Dennis,
	linux-rdma, amd-gfx, Boris, Jason Gunthorpe, Doug Ledford,
	David Rientjes, xen-devel, Ashutosh, Marciniszyn, Alex,
	intel-gfx, Dalessandro, Deucher, Ostrovsky, Bonzini, LKML

On Fri, Jun 22, 2018 at 05:57:16PM +0200, Michal Hocko wrote:
> On Fri 22-06-18 16:36:49, Chris Wilson wrote:
> > Quoting Michal Hocko (2018-06-22 16:02:42)
> > > Hi,
> > > this is an RFC and not tested at all. I am not very familiar with the
> > > mmu notifiers semantics very much so this is a crude attempt to achieve
> > > what I need basically. It might be completely wrong but I would like
> > > to discuss what would be a better way if that is the case.
> > > 
> > > get_maintainers gave me quite large list of people to CC so I had to trim
> > > it down. If you think I have forgot somebody, please let me know
> > 
> > > diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > index 854bd51b9478..5285df9331fa 100644
> > > --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > @@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object *mo)
> > >         mo->attached = false;
> > >  }
> > >  
> > > -static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > > +static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > >                                                        struct mm_struct *mm,
> > >                                                        unsigned long start,
> > > -                                                      unsigned long end)
> > > +                                                      unsigned long end,
> > > +                                                      bool blockable)
> > >  {
> > >         struct i915_mmu_notifier *mn =
> > >                 container_of(_mn, struct i915_mmu_notifier, mn);
> > > @@ -124,7 +125,7 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > >         LIST_HEAD(cancelled);
> > >  
> > >         if (RB_EMPTY_ROOT(&mn->objects.rb_root))
> > > -               return;
> > > +               return 0;
> > 
> > The principle wait here is for the HW (even after fixing all the locks
> > to be not so coarse, we still have to wait for the HW to finish its
> > access).
> 
> Is this wait bound or it can take basically arbitrary amount of time?

Arbitrary amount of time but in desktop use case you can assume that
it should never go above 16ms for a 60frame per second rendering of
your desktop (in GPU compute case this kind of assumption does not
hold). Is the process exit_state already updated by the time this mmu
notifier callbacks happen ?

> 
> > The first pass would be then to not do anything here if
> > !blockable.
> 
> something like this? (incremental diff)

What i wanted to do with HMM and mmu notifier is split the invalidation
in 2 pass. First pass tell the drivers to stop/cancel pending jobs that
depends on the range and invalidate internal driver states (like clear
buffer object pages array in case of GPU but not GPU page table). While
the second callback would do the actual wait on the GPU to be done and
update the GPU page table.

Now in this scheme in case the task is already in some exit state and
that all CPU threads are frozen/kill then we can probably find a way to
do the first path mostly lock less. AFAICR nor AMD nor Intel allow to
share userptr bo hence a uptr bo should only ever be access through
ioctl submited by the process.

The second call can then be delayed and ping from time to time to see
if GPU jobs are done.


Note that what you propose might still be useful as in case there is
no buffer object for a range then OOM can make progress in freeing a
range of memory. It is very likely that significant virtual address
range of a process and backing memory can be reclaim that way. This
assume OOM reclaim vma by vma or in some form of granularity like
reclaiming 1GB by 1GB. Or we could also update blocking callback to
return range that are blocking that way OOM can reclaim around.

Cheers,
Jérôme
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [Intel-gfx] [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
@ 2018-06-22 16:18     ` Jerome Glisse
       [not found]       ` <20180622164026.GA23674@dhcp22.suse.cz>
  0 siblings, 1 reply; 125+ messages in thread
From: Jerome Glisse @ 2018-06-22 16:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Chris Wilson, LKML, Michal Hocko <mhocko@suse.com>,
	kvm@vger.kernel.org,
	 " Radim Krčmář <rkrcmar@redhat.com>,
	David Airlie, Sudeep Dutt, dri-devel, linux-mm, Andrea Arcangeli,
	David (ChunMing) Zhou, Dimitri Sivanich, linux-rdma, amd-gfx,
	Jason Gunthorpe, Doug Ledford, David Rientjes, xen-devel,
	intel-gfx, Rodrigo, Vivi, Boris, Ostrovsky, Juergen, Gross, Mike,
	Marciniszyn, Dennis, Dalessandro, Ashutosh, Dixit, Alex, Deucher,
	Paolo, Bonzini

On Fri, Jun 22, 2018 at 05:57:16PM +0200, Michal Hocko wrote:
> On Fri 22-06-18 16:36:49, Chris Wilson wrote:
> > Quoting Michal Hocko (2018-06-22 16:02:42)
> > > Hi,
> > > this is an RFC and not tested at all. I am not very familiar with the
> > > mmu notifiers semantics very much so this is a crude attempt to achieve
> > > what I need basically. It might be completely wrong but I would like
> > > to discuss what would be a better way if that is the case.
> > > 
> > > get_maintainers gave me quite large list of people to CC so I had to trim
> > > it down. If you think I have forgot somebody, please let me know
> > 
> > > diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > index 854bd51b9478..5285df9331fa 100644
> > > --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > @@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object *mo)
> > >         mo->attached = false;
> > >  }
> > >  
> > > -static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > > +static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > >                                                        struct mm_struct *mm,
> > >                                                        unsigned long start,
> > > -                                                      unsigned long end)
> > > +                                                      unsigned long end,
> > > +                                                      bool blockable)
> > >  {
> > >         struct i915_mmu_notifier *mn =
> > >                 container_of(_mn, struct i915_mmu_notifier, mn);
> > > @@ -124,7 +125,7 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > >         LIST_HEAD(cancelled);
> > >  
> > >         if (RB_EMPTY_ROOT(&mn->objects.rb_root))
> > > -               return;
> > > +               return 0;
> > 
> > The principle wait here is for the HW (even after fixing all the locks
> > to be not so coarse, we still have to wait for the HW to finish its
> > access).
> 
> Is this wait bound or it can take basically arbitrary amount of time?

Arbitrary amount of time but in desktop use case you can assume that
it should never go above 16ms for a 60frame per second rendering of
your desktop (in GPU compute case this kind of assumption does not
hold). Is the process exit_state already updated by the time this mmu
notifier callbacks happen ?

> 
> > The first pass would be then to not do anything here if
> > !blockable.
> 
> something like this? (incremental diff)

What i wanted to do with HMM and mmu notifier is split the invalidation
in 2 pass. First pass tell the drivers to stop/cancel pending jobs that
depends on the range and invalidate internal driver states (like clear
buffer object pages array in case of GPU but not GPU page table). While
the second callback would do the actual wait on the GPU to be done and
update the GPU page table.

Now in this scheme in case the task is already in some exit state and
that all CPU threads are frozen/kill then we can probably find a way to
do the first path mostly lock less. AFAICR nor AMD nor Intel allow to
share userptr bo hence a uptr bo should only ever be access through
ioctl submited by the process.

The second call can then be delayed and ping from time to time to see
if GPU jobs are done.


Note that what you propose might still be useful as in case there is
no buffer object for a range then OOM can make progress in freeing a
range of memory. It is very likely that significant virtual address
range of a process and backing memory can be reclaim that way. This
assume OOM reclaim vma by vma or in some form of granularity like
reclaiming 1GB by 1GB. Or we could also update blocking callback to
return range that are blocking that way OOM can reclaim around.

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [Intel-gfx] [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
@ 2018-06-22 16:18     ` Jerome Glisse
       [not found]       ` <20180622164026.GA23674@dhcp22.suse.cz>
  0 siblings, 1 reply; 125+ messages in thread
From: Jerome Glisse @ 2018-06-22 16:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Chris Wilson, LKML, Michal Hocko <mhocko@suse.com>,
	kvm@vger.kernel.org,
	 " Radim Krčmář <rkrcmar@redhat.com>,
	David Airlie, Sudeep Dutt, dri-devel, linux-mm, Andrea Arcangeli,
	David (ChunMing) Zhou, Dimitri Sivanich, linux-rdma, amd-gfx,
	Jason Gunthorpe, Doug Ledford, David Rientjes, xen-devel,
	intel-gfx, Rodrigo, Vivi, Boris, Ostrovsky, Juergen, Gross, Mike,
	Marciniszyn, Dennis, Dalessandro, Ashutosh, Dixit, Alex, Deucher,
	Paolo, Bonzini

On Fri, Jun 22, 2018 at 05:57:16PM +0200, Michal Hocko wrote:
> On Fri 22-06-18 16:36:49, Chris Wilson wrote:
> > Quoting Michal Hocko (2018-06-22 16:02:42)
> > > Hi,
> > > this is an RFC and not tested at all. I am not very familiar with the
> > > mmu notifiers semantics very much so this is a crude attempt to achieve
> > > what I need basically. It might be completely wrong but I would like
> > > to discuss what would be a better way if that is the case.
> > > 
> > > get_maintainers gave me quite large list of people to CC so I had to trim
> > > it down. If you think I have forgot somebody, please let me know
> > 
> > > diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > index 854bd51b9478..5285df9331fa 100644
> > > --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > @@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object *mo)
> > >         mo->attached = false;
> > >  }
> > >  
> > > -static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > > +static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > >                                                        struct mm_struct *mm,
> > >                                                        unsigned long start,
> > > -                                                      unsigned long end)
> > > +                                                      unsigned long end,
> > > +                                                      bool blockable)
> > >  {
> > >         struct i915_mmu_notifier *mn =
> > >                 container_of(_mn, struct i915_mmu_notifier, mn);
> > > @@ -124,7 +125,7 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > >         LIST_HEAD(cancelled);
> > >  
> > >         if (RB_EMPTY_ROOT(&mn->objects.rb_root))
> > > -               return;
> > > +               return 0;
> > 
> > The principle wait here is for the HW (even after fixing all the locks
> > to be not so coarse, we still have to wait for the HW to finish its
> > access).
> 
> Is this wait bound or it can take basically arbitrary amount of time?

Arbitrary amount of time but in desktop use case you can assume that
it should never go above 16ms for a 60frame per second rendering of
your desktop (in GPU compute case this kind of assumption does not
hold). Is the process exit_state already updated by the time this mmu
notifier callbacks happen ?

> 
> > The first pass would be then to not do anything here if
> > !blockable.
> 
> something like this? (incremental diff)

What i wanted to do with HMM and mmu notifier is split the invalidation
in 2 pass. First pass tell the drivers to stop/cancel pending jobs that
depends on the range and invalidate internal driver states (like clear
buffer object pages array in case of GPU but not GPU page table). While
the second callback would do the actual wait on the GPU to be done and
update the GPU page table.

Now in this scheme in case the task is already in some exit state and
that all CPU threads are frozen/kill then we can probably find a way to
do the first path mostly lock less. AFAICR nor AMD nor Intel allow to
share userptr bo hence a uptr bo should only ever be access through
ioctl submited by the process.

The second call can then be delayed and ping from time to time to see
if GPU jobs are done.


Note that what you propose might still be useful as in case there is
no buffer object for a range then OOM can make progress in freeing a
range of memory. It is very likely that significant virtual address
range of a process and backing memory can be reclaim that way. This
assume OOM reclaim vma by vma or in some form of granularity like
reclaiming 1GB by 1GB. Or we could also update blocking callback to
return range that are blocking that way OOM can reclaim around.

Cheers,
Jerome

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [Intel-gfx] [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
  2018-06-22 15:57   ` Michal Hocko
@ 2018-06-22 16:18     ` Jerome Glisse
  2018-06-22 16:18     ` Jerome Glisse
       [not found]     ` <152968364170.11773.4392861266443293819@mail.alporthouse.com>
  2 siblings, 0 replies; 125+ messages in thread
From: Jerome Glisse @ 2018-06-22 16:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Rodrigo, Michal Hocko <mhocko@suse.com>,
	kvm@vger.kernel.org,
	 " Radim Krčmář <rkrcmar@redhat.com>,
	David Airlie, Sudeep Dutt, dri-devel, Chris Wilson, linux-mm,
	Mike, Vivi, Juergen, Andrea Arcangeli, David (ChunMing) Zhou,
	Dimitri Sivanich, Paolo, Dennis, linux-rdma, amd-gfx, Boris,
	Jason Gunthorpe, Doug Ledford, David Rientjes, xen-devel,
	Ashutosh, Marciniszyn, Alex, intel-gfx, Dalessandro

On Fri, Jun 22, 2018 at 05:57:16PM +0200, Michal Hocko wrote:
> On Fri 22-06-18 16:36:49, Chris Wilson wrote:
> > Quoting Michal Hocko (2018-06-22 16:02:42)
> > > Hi,
> > > this is an RFC and not tested at all. I am not very familiar with the
> > > mmu notifiers semantics very much so this is a crude attempt to achieve
> > > what I need basically. It might be completely wrong but I would like
> > > to discuss what would be a better way if that is the case.
> > > 
> > > get_maintainers gave me quite large list of people to CC so I had to trim
> > > it down. If you think I have forgot somebody, please let me know
> > 
> > > diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > index 854bd51b9478..5285df9331fa 100644
> > > --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > @@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object *mo)
> > >         mo->attached = false;
> > >  }
> > >  
> > > -static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > > +static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > >                                                        struct mm_struct *mm,
> > >                                                        unsigned long start,
> > > -                                                      unsigned long end)
> > > +                                                      unsigned long end,
> > > +                                                      bool blockable)
> > >  {
> > >         struct i915_mmu_notifier *mn =
> > >                 container_of(_mn, struct i915_mmu_notifier, mn);
> > > @@ -124,7 +125,7 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > >         LIST_HEAD(cancelled);
> > >  
> > >         if (RB_EMPTY_ROOT(&mn->objects.rb_root))
> > > -               return;
> > > +               return 0;
> > 
> > The principle wait here is for the HW (even after fixing all the locks
> > to be not so coarse, we still have to wait for the HW to finish its
> > access).
> 
> Is this wait bound or it can take basically arbitrary amount of time?

Arbitrary amount of time but in desktop use case you can assume that
it should never go above 16ms for a 60frame per second rendering of
your desktop (in GPU compute case this kind of assumption does not
hold). Is the process exit_state already updated by the time this mmu
notifier callbacks happen ?

> 
> > The first pass would be then to not do anything here if
> > !blockable.
> 
> something like this? (incremental diff)

What i wanted to do with HMM and mmu notifier is split the invalidation
in 2 pass. First pass tell the drivers to stop/cancel pending jobs that
depends on the range and invalidate internal driver states (like clear
buffer object pages array in case of GPU but not GPU page table). While
the second callback would do the actual wait on the GPU to be done and
update the GPU page table.

Now in this scheme in case the task is already in some exit state and
that all CPU threads are frozen/kill then we can probably find a way to
do the first path mostly lock less. AFAICR nor AMD nor Intel allow to
share userptr bo hence a uptr bo should only ever be access through
ioctl submited by the process.

The second call can then be delayed and ping from time to time to see
if GPU jobs are done.


Note that what you propose might still be useful as in case there is
no buffer object for a range then OOM can make progress in freeing a
range of memory. It is very likely that significant virtual address
range of a process and backing memory can be reclaim that way. This
assume OOM reclaim vma by vma or in some form of granularity like
reclaiming 1GB by 1GB. Or we could also update blocking callback to
return range that are blocking that way OOM can reclaim around.

Cheers,
Jérôme

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [Intel-gfx] [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
       [not found]     ` <152968364170.11773.4392861266443293819@mail.alporthouse.com>
@ 2018-06-22 16:19       ` Michal Hocko
  2018-06-22 16:19       ` Michal Hocko
  1 sibling, 0 replies; 125+ messages in thread
From: Michal Hocko @ 2018-06-22 16:19 UTC (permalink / raw)
  To: Chris Wilson
  Cc: Michal Hocko, kvm, =?UTF-8?q?Radim=20Kr=C4=8Dm=C3=A1=C5=99?=,
	David Airlie, Sudeep Dutt, dri-devel, linux-mm, Andrea Arcangeli,
	Dimitri Sivanich, linux-rdma, amd-gfx, Jason Gunthorpe,
	Doug Ledford, David Rientjes, xen-devel, intel-gfx,
	=?UTF-8?q?J=C3=A9r=C3=B4me=20Glisse?=,
	Rodrigo Vivi, Boris Ostrovsky, Juergen Gross, Mike Marciniszyn,
	Dennis Dalessandro

[Hmm, the cc list got mangled somehow - you have just made many people
to work for suse ;) and to kvack.org in the preious one - fixed up
hopefully]

On Fri 22-06-18 17:07:21, Chris Wilson wrote:
> Quoting Michal Hocko (2018-06-22 16:57:16)
> > On Fri 22-06-18 16:36:49, Chris Wilson wrote:
> > > Quoting Michal Hocko (2018-06-22 16:02:42)
> > > > Hi,
> > > > this is an RFC and not tested at all. I am not very familiar with the
> > > > mmu notifiers semantics very much so this is a crude attempt to achieve
> > > > what I need basically. It might be completely wrong but I would like
> > > > to discuss what would be a better way if that is the case.
> > > > 
> > > > get_maintainers gave me quite large list of people to CC so I had to trim
> > > > it down. If you think I have forgot somebody, please let me know
> > > 
> > > > diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > index 854bd51b9478..5285df9331fa 100644
> > > > --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > @@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object *mo)
> > > >         mo->attached = false;
> > > >  }
> > > >  
> > > > -static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > > > +static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > > >                                                        struct mm_struct *mm,
> > > >                                                        unsigned long start,
> > > > -                                                      unsigned long end)
> > > > +                                                      unsigned long end,
> > > > +                                                      bool blockable)
> > > >  {
> > > >         struct i915_mmu_notifier *mn =
> > > >                 container_of(_mn, struct i915_mmu_notifier, mn);
> > > > @@ -124,7 +125,7 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > > >         LIST_HEAD(cancelled);
> > > >  
> > > >         if (RB_EMPTY_ROOT(&mn->objects.rb_root))
> > > > -               return;
> > > > +               return 0;
> > > 
> > > The principle wait here is for the HW (even after fixing all the locks
> > > to be not so coarse, we still have to wait for the HW to finish its
> > > access).
> > 
> > Is this wait bound or it can take basically arbitrary amount of time?
> 
> Arbitrary. It waits for the last operation in the queue that needs that
> set of backing pages, and that queue is unbounded and not even confined
> to the local driver. (Though each operation should be bounded to be
> completed within an interval or be cancelled, that interval is on the
> order of 10s!)

OK, I see. We should rather not wait that long so backoff is just
better. The whole point of the oom_reaper is to tear down and free some
memory. We do not really need to reclaim all of it.

It would be great if we could do something like - kick the tear down of
the device memory but have it done in the background. We wouldn't tear
the vma down in that case but the whole process would start at least.
I am not sure something like that is possible.
 
> > > The first pass would be then to not do anything here if
> > > !blockable.
> > 
> > something like this? (incremental diff)
> 
> Yup.

Cool, I will start with that because even that is an improvement from
the oom_reaper POV.

Thanks!
-- 
Michal Hocko
SUSE Labs
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [Intel-gfx] [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
@ 2018-06-22 16:19       ` Michal Hocko
  0 siblings, 0 replies; 125+ messages in thread
From: Michal Hocko @ 2018-06-22 16:19 UTC (permalink / raw)
  To: Chris Wilson
  Cc: Michal Hocko, David (ChunMing) Zhou, Paolo Bonzini,
	=?UTF-8?q?Radim=20Kr=C4=8Dm=C3=A1=C5=99?=,
	Alex Deucher, =?UTF-8?q?Christian=20K=C3=B6nig?=,
	David Airlie, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi,
	Doug Ledford, Jason Gunthorpe, Mike Marciniszyn,
	Dennis Dalessandro, Sudeep Dutt, Ashutosh Dixit,
	Dimitri Sivanich, Boris Ostrovsky, Juergen Gross,
	=?UTF-8?q?J=C3=A9r=C3=B4me=20Glisse?=,
	Andrea Arcangeli, kvm, amd-gfx, dri-devel, intel-gfx, linux-rdma,
	xen-devel, linux-mm, David Rientjes

[Hmm, the cc list got mangled somehow - you have just made many people
to work for suse ;) and to kvack.org in the preious one - fixed up
hopefully]

On Fri 22-06-18 17:07:21, Chris Wilson wrote:
> Quoting Michal Hocko (2018-06-22 16:57:16)
> > On Fri 22-06-18 16:36:49, Chris Wilson wrote:
> > > Quoting Michal Hocko (2018-06-22 16:02:42)
> > > > Hi,
> > > > this is an RFC and not tested at all. I am not very familiar with the
> > > > mmu notifiers semantics very much so this is a crude attempt to achieve
> > > > what I need basically. It might be completely wrong but I would like
> > > > to discuss what would be a better way if that is the case.
> > > > 
> > > > get_maintainers gave me quite large list of people to CC so I had to trim
> > > > it down. If you think I have forgot somebody, please let me know
> > > 
> > > > diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > index 854bd51b9478..5285df9331fa 100644
> > > > --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > @@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object *mo)
> > > >         mo->attached = false;
> > > >  }
> > > >  
> > > > -static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > > > +static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > > >                                                        struct mm_struct *mm,
> > > >                                                        unsigned long start,
> > > > -                                                      unsigned long end)
> > > > +                                                      unsigned long end,
> > > > +                                                      bool blockable)
> > > >  {
> > > >         struct i915_mmu_notifier *mn =
> > > >                 container_of(_mn, struct i915_mmu_notifier, mn);
> > > > @@ -124,7 +125,7 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > > >         LIST_HEAD(cancelled);
> > > >  
> > > >         if (RB_EMPTY_ROOT(&mn->objects.rb_root))
> > > > -               return;
> > > > +               return 0;
> > > 
> > > The principle wait here is for the HW (even after fixing all the locks
> > > to be not so coarse, we still have to wait for the HW to finish its
> > > access).
> > 
> > Is this wait bound or it can take basically arbitrary amount of time?
> 
> Arbitrary. It waits for the last operation in the queue that needs that
> set of backing pages, and that queue is unbounded and not even confined
> to the local driver. (Though each operation should be bounded to be
> completed within an interval or be cancelled, that interval is on the
> order of 10s!)

OK, I see. We should rather not wait that long so backoff is just
better. The whole point of the oom_reaper is to tear down and free some
memory. We do not really need to reclaim all of it.

It would be great if we could do something like - kick the tear down of
the device memory but have it done in the background. We wouldn't tear
the vma down in that case but the whole process would start at least.
I am not sure something like that is possible.
 
> > > The first pass would be then to not do anything here if
> > > !blockable.
> > 
> > something like this? (incremental diff)
> 
> Yup.

Cool, I will start with that because even that is an improvement from
the oom_reaper POV.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [Intel-gfx] [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
       [not found]     ` <152968364170.11773.4392861266443293819@mail.alporthouse.com>
  2018-06-22 16:19       ` Michal Hocko
@ 2018-06-22 16:19       ` Michal Hocko
  1 sibling, 0 replies; 125+ messages in thread
From: Michal Hocko @ 2018-06-22 16:19 UTC (permalink / raw)
  To: Chris Wilson
  Cc: Michal Hocko, kvm, =?UTF-8?q?Radim=20Kr=C4=8Dm=C3=A1=C5=99?=,
	David Airlie, Joonas Lahtinen, Sudeep Dutt, dri-devel, linux-mm,
	Andrea Arcangeli, David (ChunMing) Zhou, Dimitri Sivanich,
	linux-rdma, amd-gfx, Jason Gunthorpe, Doug Ledford,
	David Rientjes, xen-devel, intel-gfx, Jani Nikula,
	=?UTF-8?q?J=C3=A9r=C3=B4me=20Glisse?=,
	Rodrigo Vivi

[Hmm, the cc list got mangled somehow - you have just made many people
to work for suse ;) and to kvack.org in the preious one - fixed up
hopefully]

On Fri 22-06-18 17:07:21, Chris Wilson wrote:
> Quoting Michal Hocko (2018-06-22 16:57:16)
> > On Fri 22-06-18 16:36:49, Chris Wilson wrote:
> > > Quoting Michal Hocko (2018-06-22 16:02:42)
> > > > Hi,
> > > > this is an RFC and not tested at all. I am not very familiar with the
> > > > mmu notifiers semantics very much so this is a crude attempt to achieve
> > > > what I need basically. It might be completely wrong but I would like
> > > > to discuss what would be a better way if that is the case.
> > > > 
> > > > get_maintainers gave me quite large list of people to CC so I had to trim
> > > > it down. If you think I have forgot somebody, please let me know
> > > 
> > > > diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > index 854bd51b9478..5285df9331fa 100644
> > > > --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > @@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object *mo)
> > > >         mo->attached = false;
> > > >  }
> > > >  
> > > > -static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > > > +static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > > >                                                        struct mm_struct *mm,
> > > >                                                        unsigned long start,
> > > > -                                                      unsigned long end)
> > > > +                                                      unsigned long end,
> > > > +                                                      bool blockable)
> > > >  {
> > > >         struct i915_mmu_notifier *mn =
> > > >                 container_of(_mn, struct i915_mmu_notifier, mn);
> > > > @@ -124,7 +125,7 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > > >         LIST_HEAD(cancelled);
> > > >  
> > > >         if (RB_EMPTY_ROOT(&mn->objects.rb_root))
> > > > -               return;
> > > > +               return 0;
> > > 
> > > The principle wait here is for the HW (even after fixing all the locks
> > > to be not so coarse, we still have to wait for the HW to finish its
> > > access).
> > 
> > Is this wait bound or it can take basically arbitrary amount of time?
> 
> Arbitrary. It waits for the last operation in the queue that needs that
> set of backing pages, and that queue is unbounded and not even confined
> to the local driver. (Though each operation should be bounded to be
> completed within an interval or be cancelled, that interval is on the
> order of 10s!)

OK, I see. We should rather not wait that long so backoff is just
better. The whole point of the oom_reaper is to tear down and free some
memory. We do not really need to reclaim all of it.

It would be great if we could do something like - kick the tear down of
the device memory but have it done in the background. We wouldn't tear
the vma down in that case but the whole process would start at least.
I am not sure something like that is possible.
 
> > > The first pass would be then to not do anything here if
> > > !blockable.
> > 
> > something like this? (incremental diff)
> 
> Yup.

Cool, I will start with that because even that is an improvement from
the oom_reaper POV.

Thanks!
-- 
Michal Hocko
SUSE Labs

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [Intel-gfx] [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
  2018-06-22 15:36   ` [Intel-gfx] [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers Chris Wilson
  (?)
  (?)
@ 2018-06-22 16:25   ` Jerome Glisse
  -1 siblings, 0 replies; 125+ messages in thread
From: Jerome Glisse @ 2018-06-22 16:25 UTC (permalink / raw)
  To: Chris Wilson; +Cc: Michal Hocko, LKML

On Fri, Jun 22, 2018 at 04:36:49PM +0100, Chris Wilson wrote:
> Quoting Michal Hocko (2018-06-22 16:02:42)
> > Hi,
> > this is an RFC and not tested at all. I am not very familiar with the
> > mmu notifiers semantics very much so this is a crude attempt to achieve
> > what I need basically. It might be completely wrong but I would like
> > to discuss what would be a better way if that is the case.
> > 
> > get_maintainers gave me quite large list of people to CC so I had to trim
> > it down. If you think I have forgot somebody, please let me know
> 
> > diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > index 854bd51b9478..5285df9331fa 100644
> > --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> > +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > @@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object *mo)
> >         mo->attached = false;
> >  }
> >  
> > -static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > +static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> >                                                        struct mm_struct *mm,
> >                                                        unsigned long start,
> > -                                                      unsigned long end)
> > +                                                      unsigned long end,
> > +                                                      bool blockable)
> >  {
> >         struct i915_mmu_notifier *mn =
> >                 container_of(_mn, struct i915_mmu_notifier, mn);
> > @@ -124,7 +125,7 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> >         LIST_HEAD(cancelled);
> >  
> >         if (RB_EMPTY_ROOT(&mn->objects.rb_root))
> > -               return;
> > +               return 0;
> 
> The principle wait here is for the HW (even after fixing all the locks
> to be not so coarse, we still have to wait for the HW to finish its
> access). The first pass would be then to not do anything here if
> !blockable.
> 
> Jerome keeps on shaking his head and telling us we're doing it all
> wrong, so maybe it'll all fall out of HMM before we have to figure out
> how to differentiate between objects that can be invalidated immediately
> and those that need to acquire locks and/or wait.

Intel and AMD are doing it right nowadays (IIRC AMD had a bug a while
back). What i want is to replace GUP and notifier with HMM, with the
intention that we can mitigate in more clever way thing like OOM or
other mm aspect inside HMM and thus isolating mm folks from ever having
to decipher GPU or other weird drivers :)

I also want to do that for optimization purposes to allow to share
more thing accross multiple GPU that mirror same range of address.

Finaly another motiviation is to avoid the pin GUP implies and only
rely on mmu notification. This would unlock some memory migration
from ever backing of early when they see the pin.

I intend to post patches sometime before XDC this year and discuss
them at XDC see how people on driver side feel about that. I also
want to use that as an excuse to gather features request and other
Santa wishlist for HMM ;)

Cheers,
Jérôme

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
       [not found]       ` <20180622164026.GA23674@dhcp22.suse.cz>
@ 2018-06-22 16:42         ` Michal Hocko
  2018-06-22 17:26           ` Jerome Glisse
  2018-06-22 17:26           ` Jerome Glisse
  2018-06-22 16:42         ` Michal Hocko
  1 sibling, 2 replies; 125+ messages in thread
From: Michal Hocko @ 2018-06-22 16:42 UTC (permalink / raw)
  Cc: kvm, =?UTF-8?q?Radim=20Kr=C4=8Dm=C3=A1=C5=99?=,
	David Airlie, Sudeep Dutt, dri-devel, linux-mm, Andrea Arcangeli,
	David (ChunMing) Zhou, Dimitri Sivanich, linux-rdma, amd-gfx,
	Jason Gunthorpe, Doug Ledford, David Rientjes, xen-devel,
	intel-gfx, =?UTF-8?q?J=C3=A9r=C3=B4me=20Glisse?=,
	Rodrigo Vivi, Boris Ostrovsky, Juergen Gross, Mike Marciniszyn

[Resnding with the CC list fixed]

On Fri 22-06-18 18:40:26, Michal Hocko wrote:
> On Fri 22-06-18 12:18:46, Jerome Glisse wrote:
> > On Fri, Jun 22, 2018 at 05:57:16PM +0200, Michal Hocko wrote:
> > > On Fri 22-06-18 16:36:49, Chris Wilson wrote:
> > > > Quoting Michal Hocko (2018-06-22 16:02:42)
> > > > > Hi,
> > > > > this is an RFC and not tested at all. I am not very familiar with the
> > > > > mmu notifiers semantics very much so this is a crude attempt to achieve
> > > > > what I need basically. It might be completely wrong but I would like
> > > > > to discuss what would be a better way if that is the case.
> > > > > 
> > > > > get_maintainers gave me quite large list of people to CC so I had to trim
> > > > > it down. If you think I have forgot somebody, please let me know
> > > > 
> > > > > diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > > index 854bd51b9478..5285df9331fa 100644
> > > > > --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > > +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > > @@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object *mo)
> > > > >         mo->attached = false;
> > > > >  }
> > > > >  
> > > > > -static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > > > > +static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > > > >                                                        struct mm_struct *mm,
> > > > >                                                        unsigned long start,
> > > > > -                                                      unsigned long end)
> > > > > +                                                      unsigned long end,
> > > > > +                                                      bool blockable)
> > > > >  {
> > > > >         struct i915_mmu_notifier *mn =
> > > > >                 container_of(_mn, struct i915_mmu_notifier, mn);
> > > > > @@ -124,7 +125,7 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > > > >         LIST_HEAD(cancelled);
> > > > >  
> > > > >         if (RB_EMPTY_ROOT(&mn->objects.rb_root))
> > > > > -               return;
> > > > > +               return 0;
> > > > 
> > > > The principle wait here is for the HW (even after fixing all the locks
> > > > to be not so coarse, we still have to wait for the HW to finish its
> > > > access).
> > > 
> > > Is this wait bound or it can take basically arbitrary amount of time?
> > 
> > Arbitrary amount of time but in desktop use case you can assume that
> > it should never go above 16ms for a 60frame per second rendering of
> > your desktop (in GPU compute case this kind of assumption does not
> > hold). Is the process exit_state already updated by the time this mmu
> > notifier callbacks happen ?
> 
> What do you mean? The process is killed (by SIGKILL) at the time but we
> do not know much more than that. The task might be stuck anywhere in the
> kernel before handling that signal.
> 
> > > > The first pass would be then to not do anything here if
> > > > !blockable.
> > > 
> > > something like this? (incremental diff)
> > 
> > What i wanted to do with HMM and mmu notifier is split the invalidation
> > in 2 pass. First pass tell the drivers to stop/cancel pending jobs that
> > depends on the range and invalidate internal driver states (like clear
> > buffer object pages array in case of GPU but not GPU page table). While
> > the second callback would do the actual wait on the GPU to be done and
> > update the GPU page table.
> 
> What can you do after the first phase? Can I unmap the range?
> 
> > Now in this scheme in case the task is already in some exit state and
> > that all CPU threads are frozen/kill then we can probably find a way to
> > do the first path mostly lock less. AFAICR nor AMD nor Intel allow to
> > share userptr bo hence a uptr bo should only ever be access through
> > ioctl submited by the process.
> > 
> > The second call can then be delayed and ping from time to time to see
> > if GPU jobs are done.
> > 
> > 
> > Note that what you propose might still be useful as in case there is
> > no buffer object for a range then OOM can make progress in freeing a
> > range of memory. It is very likely that significant virtual address
> > range of a process and backing memory can be reclaim that way. This
> > assume OOM reclaim vma by vma or in some form of granularity like
> > reclaiming 1GB by 1GB. Or we could also update blocking callback to
> > return range that are blocking that way OOM can reclaim around.
> 
> Exactly my point. What we have right now is all or nothing which is
> obviously too coarse to be useful.

-- 
Michal Hocko
SUSE Labs
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
@ 2018-06-22 16:42         ` Michal Hocko
  2018-06-22 17:26           ` Jerome Glisse
  2018-06-22 17:26           ` Jerome Glisse
  0 siblings, 2 replies; 125+ messages in thread
From: Michal Hocko @ 2018-06-22 16:42 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: kvm, =?UTF-8?q?Radim=20Kr=C4=8Dm=C3=A1=C5=99?=,
	David Airlie, Sudeep Dutt, dri-devel, linux-mm, Andrea Arcangeli,
	David (ChunMing) Zhou, Dimitri Sivanich, linux-rdma, amd-gfx,
	Jason Gunthorpe, Doug Ledford, David Rientjes, xen-devel,
	intel-gfx, =?UTF-8?q?J=C3=A9r=C3=B4me=20Glisse?=,
	Rodrigo Vivi, Boris Ostrovsky, Juergen Gross, Mike Marciniszyn

[Resnding with the CC list fixed]

On Fri 22-06-18 18:40:26, Michal Hocko wrote:
> On Fri 22-06-18 12:18:46, Jerome Glisse wrote:
> > On Fri, Jun 22, 2018 at 05:57:16PM +0200, Michal Hocko wrote:
> > > On Fri 22-06-18 16:36:49, Chris Wilson wrote:
> > > > Quoting Michal Hocko (2018-06-22 16:02:42)
> > > > > Hi,
> > > > > this is an RFC and not tested at all. I am not very familiar with the
> > > > > mmu notifiers semantics very much so this is a crude attempt to achieve
> > > > > what I need basically. It might be completely wrong but I would like
> > > > > to discuss what would be a better way if that is the case.
> > > > > 
> > > > > get_maintainers gave me quite large list of people to CC so I had to trim
> > > > > it down. If you think I have forgot somebody, please let me know
> > > > 
> > > > > diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > > index 854bd51b9478..5285df9331fa 100644
> > > > > --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > > +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > > @@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object *mo)
> > > > >         mo->attached = false;
> > > > >  }
> > > > >  
> > > > > -static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > > > > +static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > > > >                                                        struct mm_struct *mm,
> > > > >                                                        unsigned long start,
> > > > > -                                                      unsigned long end)
> > > > > +                                                      unsigned long end,
> > > > > +                                                      bool blockable)
> > > > >  {
> > > > >         struct i915_mmu_notifier *mn =
> > > > >                 container_of(_mn, struct i915_mmu_notifier, mn);
> > > > > @@ -124,7 +125,7 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > > > >         LIST_HEAD(cancelled);
> > > > >  
> > > > >         if (RB_EMPTY_ROOT(&mn->objects.rb_root))
> > > > > -               return;
> > > > > +               return 0;
> > > > 
> > > > The principle wait here is for the HW (even after fixing all the locks
> > > > to be not so coarse, we still have to wait for the HW to finish its
> > > > access).
> > > 
> > > Is this wait bound or it can take basically arbitrary amount of time?
> > 
> > Arbitrary amount of time but in desktop use case you can assume that
> > it should never go above 16ms for a 60frame per second rendering of
> > your desktop (in GPU compute case this kind of assumption does not
> > hold). Is the process exit_state already updated by the time this mmu
> > notifier callbacks happen ?
> 
> What do you mean? The process is killed (by SIGKILL) at the time but we
> do not know much more than that. The task might be stuck anywhere in the
> kernel before handling that signal.
> 
> > > > The first pass would be then to not do anything here if
> > > > !blockable.
> > > 
> > > something like this? (incremental diff)
> > 
> > What i wanted to do with HMM and mmu notifier is split the invalidation
> > in 2 pass. First pass tell the drivers to stop/cancel pending jobs that
> > depends on the range and invalidate internal driver states (like clear
> > buffer object pages array in case of GPU but not GPU page table). While
> > the second callback would do the actual wait on the GPU to be done and
> > update the GPU page table.
> 
> What can you do after the first phase? Can I unmap the range?
> 
> > Now in this scheme in case the task is already in some exit state and
> > that all CPU threads are frozen/kill then we can probably find a way to
> > do the first path mostly lock less. AFAICR nor AMD nor Intel allow to
> > share userptr bo hence a uptr bo should only ever be access through
> > ioctl submited by the process.
> > 
> > The second call can then be delayed and ping from time to time to see
> > if GPU jobs are done.
> > 
> > 
> > Note that what you propose might still be useful as in case there is
> > no buffer object for a range then OOM can make progress in freeing a
> > range of memory. It is very likely that significant virtual address
> > range of a process and backing memory can be reclaim that way. This
> > assume OOM reclaim vma by vma or in some form of granularity like
> > reclaiming 1GB by 1GB. Or we could also update blocking callback to
> > return range that are blocking that way OOM can reclaim around.
> 
> Exactly my point. What we have right now is all or nothing which is
> obviously too coarse to be useful.

-- 
Michal Hocko
SUSE Labs
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [Intel-gfx] [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
@ 2018-06-22 16:42         ` Michal Hocko
  2018-06-22 17:26           ` Jerome Glisse
  2018-06-22 17:26           ` Jerome Glisse
  0 siblings, 2 replies; 125+ messages in thread
From: Michal Hocko @ 2018-06-22 16:42 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: David (ChunMing) Zhou, Paolo Bonzini,
	=?UTF-8?q?Radim=20Kr=C4=8Dm=C3=A1=C5=99?=,
	Alex Deucher, =?UTF-8?q?Christian=20K=C3=B6nig?=,
	David Airlie, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi,
	Doug Ledford, Jason Gunthorpe, Mike Marciniszyn,
	Dennis Dalessandro, Sudeep Dutt, Ashutosh Dixit,
	Dimitri Sivanich, Boris Ostrovsky, Juergen Gross,
	Andrea Arcangeli, kvm, amd-gfx, dri-devel, intel-gfx, linux-rdma,
	xen-devel, linux-mm, David Rientjes

[Resnding with the CC list fixed]

On Fri 22-06-18 18:40:26, Michal Hocko wrote:
> On Fri 22-06-18 12:18:46, Jerome Glisse wrote:
> > On Fri, Jun 22, 2018 at 05:57:16PM +0200, Michal Hocko wrote:
> > > On Fri 22-06-18 16:36:49, Chris Wilson wrote:
> > > > Quoting Michal Hocko (2018-06-22 16:02:42)
> > > > > Hi,
> > > > > this is an RFC and not tested at all. I am not very familiar with the
> > > > > mmu notifiers semantics very much so this is a crude attempt to achieve
> > > > > what I need basically. It might be completely wrong but I would like
> > > > > to discuss what would be a better way if that is the case.
> > > > > 
> > > > > get_maintainers gave me quite large list of people to CC so I had to trim
> > > > > it down. If you think I have forgot somebody, please let me know
> > > > 
> > > > > diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > > index 854bd51b9478..5285df9331fa 100644
> > > > > --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > > +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > > @@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object *mo)
> > > > >         mo->attached = false;
> > > > >  }
> > > > >  
> > > > > -static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > > > > +static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > > > >                                                        struct mm_struct *mm,
> > > > >                                                        unsigned long start,
> > > > > -                                                      unsigned long end)
> > > > > +                                                      unsigned long end,
> > > > > +                                                      bool blockable)
> > > > >  {
> > > > >         struct i915_mmu_notifier *mn =
> > > > >                 container_of(_mn, struct i915_mmu_notifier, mn);
> > > > > @@ -124,7 +125,7 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > > > >         LIST_HEAD(cancelled);
> > > > >  
> > > > >         if (RB_EMPTY_ROOT(&mn->objects.rb_root))
> > > > > -               return;
> > > > > +               return 0;
> > > > 
> > > > The principle wait here is for the HW (even after fixing all the locks
> > > > to be not so coarse, we still have to wait for the HW to finish its
> > > > access).
> > > 
> > > Is this wait bound or it can take basically arbitrary amount of time?
> > 
> > Arbitrary amount of time but in desktop use case you can assume that
> > it should never go above 16ms for a 60frame per second rendering of
> > your desktop (in GPU compute case this kind of assumption does not
> > hold). Is the process exit_state already updated by the time this mmu
> > notifier callbacks happen ?
> 
> What do you mean? The process is killed (by SIGKILL) at the time but we
> do not know much more than that. The task might be stuck anywhere in the
> kernel before handling that signal.
> 
> > > > The first pass would be then to not do anything here if
> > > > !blockable.
> > > 
> > > something like this? (incremental diff)
> > 
> > What i wanted to do with HMM and mmu notifier is split the invalidation
> > in 2 pass. First pass tell the drivers to stop/cancel pending jobs that
> > depends on the range and invalidate internal driver states (like clear
> > buffer object pages array in case of GPU but not GPU page table). While
> > the second callback would do the actual wait on the GPU to be done and
> > update the GPU page table.
> 
> What can you do after the first phase? Can I unmap the range?
> 
> > Now in this scheme in case the task is already in some exit state and
> > that all CPU threads are frozen/kill then we can probably find a way to
> > do the first path mostly lock less. AFAICR nor AMD nor Intel allow to
> > share userptr bo hence a uptr bo should only ever be access through
> > ioctl submited by the process.
> > 
> > The second call can then be delayed and ping from time to time to see
> > if GPU jobs are done.
> > 
> > 
> > Note that what you propose might still be useful as in case there is
> > no buffer object for a range then OOM can make progress in freeing a
> > range of memory. It is very likely that significant virtual address
> > range of a process and backing memory can be reclaim that way. This
> > assume OOM reclaim vma by vma or in some form of granularity like
> > reclaiming 1GB by 1GB. Or we could also update blocking callback to
> > return range that are blocking that way OOM can reclaim around.
> 
> Exactly my point. What we have right now is all or nothing which is
> obviously too coarse to be useful.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [Intel-gfx] [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
       [not found]       ` <20180622164026.GA23674@dhcp22.suse.cz>
  2018-06-22 16:42         ` Michal Hocko
@ 2018-06-22 16:42         ` Michal Hocko
  1 sibling, 0 replies; 125+ messages in thread
From: Michal Hocko @ 2018-06-22 16:42 UTC (permalink / raw)
  Cc: kvm, =?UTF-8?q?Radim=20Kr=C4=8Dm=C3=A1=C5=99?=,
	David Airlie, Joonas Lahtinen, Sudeep Dutt, dri-devel, linux-mm,
	Andrea Arcangeli, David (ChunMing) Zhou, Dimitri Sivanich,
	linux-rdma, amd-gfx, Jason Gunthorpe, Doug Ledford,
	David Rientjes, xen-devel, intel-gfx, Jani Nikula,
	=?UTF-8?q?J=C3=A9r=C3=B4me=20Glisse?=,
	Rodrigo Vivi, Boris Ostrovsky

[Resnding with the CC list fixed]

On Fri 22-06-18 18:40:26, Michal Hocko wrote:
> On Fri 22-06-18 12:18:46, Jerome Glisse wrote:
> > On Fri, Jun 22, 2018 at 05:57:16PM +0200, Michal Hocko wrote:
> > > On Fri 22-06-18 16:36:49, Chris Wilson wrote:
> > > > Quoting Michal Hocko (2018-06-22 16:02:42)
> > > > > Hi,
> > > > > this is an RFC and not tested at all. I am not very familiar with the
> > > > > mmu notifiers semantics very much so this is a crude attempt to achieve
> > > > > what I need basically. It might be completely wrong but I would like
> > > > > to discuss what would be a better way if that is the case.
> > > > > 
> > > > > get_maintainers gave me quite large list of people to CC so I had to trim
> > > > > it down. If you think I have forgot somebody, please let me know
> > > > 
> > > > > diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > > index 854bd51b9478..5285df9331fa 100644
> > > > > --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > > +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > > @@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object *mo)
> > > > >         mo->attached = false;
> > > > >  }
> > > > >  
> > > > > -static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > > > > +static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > > > >                                                        struct mm_struct *mm,
> > > > >                                                        unsigned long start,
> > > > > -                                                      unsigned long end)
> > > > > +                                                      unsigned long end,
> > > > > +                                                      bool blockable)
> > > > >  {
> > > > >         struct i915_mmu_notifier *mn =
> > > > >                 container_of(_mn, struct i915_mmu_notifier, mn);
> > > > > @@ -124,7 +125,7 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > > > >         LIST_HEAD(cancelled);
> > > > >  
> > > > >         if (RB_EMPTY_ROOT(&mn->objects.rb_root))
> > > > > -               return;
> > > > > +               return 0;
> > > > 
> > > > The principle wait here is for the HW (even after fixing all the locks
> > > > to be not so coarse, we still have to wait for the HW to finish its
> > > > access).
> > > 
> > > Is this wait bound or it can take basically arbitrary amount of time?
> > 
> > Arbitrary amount of time but in desktop use case you can assume that
> > it should never go above 16ms for a 60frame per second rendering of
> > your desktop (in GPU compute case this kind of assumption does not
> > hold). Is the process exit_state already updated by the time this mmu
> > notifier callbacks happen ?
> 
> What do you mean? The process is killed (by SIGKILL) at the time but we
> do not know much more than that. The task might be stuck anywhere in the
> kernel before handling that signal.
> 
> > > > The first pass would be then to not do anything here if
> > > > !blockable.
> > > 
> > > something like this? (incremental diff)
> > 
> > What i wanted to do with HMM and mmu notifier is split the invalidation
> > in 2 pass. First pass tell the drivers to stop/cancel pending jobs that
> > depends on the range and invalidate internal driver states (like clear
> > buffer object pages array in case of GPU but not GPU page table). While
> > the second callback would do the actual wait on the GPU to be done and
> > update the GPU page table.
> 
> What can you do after the first phase? Can I unmap the range?
> 
> > Now in this scheme in case the task is already in some exit state and
> > that all CPU threads are frozen/kill then we can probably find a way to
> > do the first path mostly lock less. AFAICR nor AMD nor Intel allow to
> > share userptr bo hence a uptr bo should only ever be access through
> > ioctl submited by the process.
> > 
> > The second call can then be delayed and ping from time to time to see
> > if GPU jobs are done.
> > 
> > 
> > Note that what you propose might still be useful as in case there is
> > no buffer object for a range then OOM can make progress in freeing a
> > range of memory. It is very likely that significant virtual address
> > range of a process and backing memory can be reclaim that way. This
> > assume OOM reclaim vma by vma or in some form of granularity like
> > reclaiming 1GB by 1GB. Or we could also update blocking callback to
> > return range that are blocking that way OOM can reclaim around.
> 
> Exactly my point. What we have right now is all or nothing which is
> obviously too coarse to be useful.

-- 
Michal Hocko
SUSE Labs

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
  2018-06-22 16:42         ` Michal Hocko
  2018-06-22 17:26           ` Jerome Glisse
@ 2018-06-22 17:26           ` Jerome Glisse
  1 sibling, 0 replies; 125+ messages in thread
From: Jerome Glisse @ 2018-06-22 17:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: kvm, =?UTF-8?q?Radim=20Kr=C4=8Dm=C3=A1=C5=99?=,
	David Airlie, Sudeep Dutt, dri-devel, linux-mm, Andrea Arcangeli,
	David (ChunMing) Zhou, Dimitri Sivanich, linux-rdma, amd-gfx,
	Jason Gunthorpe, Doug Ledford, David Rientjes, xen-devel,
	intel-gfx, Rodrigo Vivi, Boris Ostrovsky, Juergen Gross,
	Mike Marciniszyn, Dennis Dalessandro, Ashutosh Dixit

On Fri, Jun 22, 2018 at 06:42:43PM +0200, Michal Hocko wrote:
> [Resnding with the CC list fixed]
> 
> On Fri 22-06-18 18:40:26, Michal Hocko wrote:
> > On Fri 22-06-18 12:18:46, Jerome Glisse wrote:
> > > On Fri, Jun 22, 2018 at 05:57:16PM +0200, Michal Hocko wrote:
> > > > On Fri 22-06-18 16:36:49, Chris Wilson wrote:
> > > > > Quoting Michal Hocko (2018-06-22 16:02:42)
> > > > > > Hi,
> > > > > > this is an RFC and not tested at all. I am not very familiar with the
> > > > > > mmu notifiers semantics very much so this is a crude attempt to achieve
> > > > > > what I need basically. It might be completely wrong but I would like
> > > > > > to discuss what would be a better way if that is the case.
> > > > > > 
> > > > > > get_maintainers gave me quite large list of people to CC so I had to trim
> > > > > > it down. If you think I have forgot somebody, please let me know
> > > > > 
> > > > > > diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > > > index 854bd51b9478..5285df9331fa 100644
> > > > > > --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > > > +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > > > @@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object *mo)
> > > > > >         mo->attached = false;
> > > > > >  }
> > > > > >  
> > > > > > -static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > > > > > +static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > > > > >                                                        struct mm_struct *mm,
> > > > > >                                                        unsigned long start,
> > > > > > -                                                      unsigned long end)
> > > > > > +                                                      unsigned long end,
> > > > > > +                                                      bool blockable)
> > > > > >  {
> > > > > >         struct i915_mmu_notifier *mn =
> > > > > >                 container_of(_mn, struct i915_mmu_notifier, mn);
> > > > > > @@ -124,7 +125,7 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > > > > >         LIST_HEAD(cancelled);
> > > > > >  
> > > > > >         if (RB_EMPTY_ROOT(&mn->objects.rb_root))
> > > > > > -               return;
> > > > > > +               return 0;
> > > > > 
> > > > > The principle wait here is for the HW (even after fixing all the locks
> > > > > to be not so coarse, we still have to wait for the HW to finish its
> > > > > access).
> > > > 
> > > > Is this wait bound or it can take basically arbitrary amount of time?
> > > 
> > > Arbitrary amount of time but in desktop use case you can assume that
> > > it should never go above 16ms for a 60frame per second rendering of
> > > your desktop (in GPU compute case this kind of assumption does not
> > > hold). Is the process exit_state already updated by the time this mmu
> > > notifier callbacks happen ?
> > 
> > What do you mean? The process is killed (by SIGKILL) at the time but we
> > do not know much more than that. The task might be stuck anywhere in the
> > kernel before handling that signal.

I was wondering if another thread might still be dereferencing any of
the structure concurrently with the OOM mmu notifier callback. Saddly
yes, it would be simpler if we could make such assumption.

> > 
> > > > > The first pass would be then to not do anything here if
> > > > > !blockable.
> > > > 
> > > > something like this? (incremental diff)
> > > 
> > > What i wanted to do with HMM and mmu notifier is split the invalidation
> > > in 2 pass. First pass tell the drivers to stop/cancel pending jobs that
> > > depends on the range and invalidate internal driver states (like clear
> > > buffer object pages array in case of GPU but not GPU page table). While
> > > the second callback would do the actual wait on the GPU to be done and
> > > update the GPU page table.
> > 
> > What can you do after the first phase? Can I unmap the range?

No you can't do anything but this force synchronization as any other
thread that concurrently trie to do something with those would queue
up. So it would serialize thing. Also main motivation on my side is
multi-GPU, right now multi-GPU are not that common but this is changing
quickly and what we see on high end (4, 8 or 16 GPUs per socket) is
spreading into more configurations. Here in mutli GPU case splitting
in two would avoid having to fully wait on first GPU before trying to
invaidate on second GPU, so on and so forth.

> > 
> > > Now in this scheme in case the task is already in some exit state and
> > > that all CPU threads are frozen/kill then we can probably find a way to
> > > do the first path mostly lock less. AFAICR nor AMD nor Intel allow to
> > > share userptr bo hence a uptr bo should only ever be access through
> > > ioctl submited by the process.
> > > 
> > > The second call can then be delayed and ping from time to time to see
> > > if GPU jobs are done.
> > > 
> > > 
> > > Note that what you propose might still be useful as in case there is
> > > no buffer object for a range then OOM can make progress in freeing a
> > > range of memory. It is very likely that significant virtual address
> > > range of a process and backing memory can be reclaim that way. This
> > > assume OOM reclaim vma by vma or in some form of granularity like
> > > reclaiming 1GB by 1GB. Or we could also update blocking callback to
> > > return range that are blocking that way OOM can reclaim around.
> > 
> > Exactly my point. What we have right now is all or nothing which is
> > obviously too coarse to be useful.

Yes i think it is a good step in the right direction.

Cheers,
Jérôme
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [Intel-gfx] [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
@ 2018-06-22 17:26           ` Jerome Glisse
  0 siblings, 0 replies; 125+ messages in thread
From: Jerome Glisse @ 2018-06-22 17:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David (ChunMing) Zhou, Paolo Bonzini,
	=?UTF-8?q?Radim=20Kr=C4=8Dm=C3=A1=C5=99?=,
	Alex Deucher, =?UTF-8?q?Christian=20K=C3=B6nig?=,
	David Airlie, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi,
	Doug Ledford, Jason Gunthorpe, Mike Marciniszyn,
	Dennis Dalessandro, Sudeep Dutt, Ashutosh Dixit,
	Dimitri Sivanich, Boris Ostrovsky, Juergen Gross,
	Andrea Arcangeli, kvm, amd-gfx, dri-devel, intel-gfx, linux-rdma,
	xen-devel, linux-mm, David Rientjes

On Fri, Jun 22, 2018 at 06:42:43PM +0200, Michal Hocko wrote:
> [Resnding with the CC list fixed]
> 
> On Fri 22-06-18 18:40:26, Michal Hocko wrote:
> > On Fri 22-06-18 12:18:46, Jerome Glisse wrote:
> > > On Fri, Jun 22, 2018 at 05:57:16PM +0200, Michal Hocko wrote:
> > > > On Fri 22-06-18 16:36:49, Chris Wilson wrote:
> > > > > Quoting Michal Hocko (2018-06-22 16:02:42)
> > > > > > Hi,
> > > > > > this is an RFC and not tested at all. I am not very familiar with the
> > > > > > mmu notifiers semantics very much so this is a crude attempt to achieve
> > > > > > what I need basically. It might be completely wrong but I would like
> > > > > > to discuss what would be a better way if that is the case.
> > > > > > 
> > > > > > get_maintainers gave me quite large list of people to CC so I had to trim
> > > > > > it down. If you think I have forgot somebody, please let me know
> > > > > 
> > > > > > diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > > > index 854bd51b9478..5285df9331fa 100644
> > > > > > --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > > > +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > > > @@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object *mo)
> > > > > >         mo->attached = false;
> > > > > >  }
> > > > > >  
> > > > > > -static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > > > > > +static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > > > > >                                                        struct mm_struct *mm,
> > > > > >                                                        unsigned long start,
> > > > > > -                                                      unsigned long end)
> > > > > > +                                                      unsigned long end,
> > > > > > +                                                      bool blockable)
> > > > > >  {
> > > > > >         struct i915_mmu_notifier *mn =
> > > > > >                 container_of(_mn, struct i915_mmu_notifier, mn);
> > > > > > @@ -124,7 +125,7 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > > > > >         LIST_HEAD(cancelled);
> > > > > >  
> > > > > >         if (RB_EMPTY_ROOT(&mn->objects.rb_root))
> > > > > > -               return;
> > > > > > +               return 0;
> > > > > 
> > > > > The principle wait here is for the HW (even after fixing all the locks
> > > > > to be not so coarse, we still have to wait for the HW to finish its
> > > > > access).
> > > > 
> > > > Is this wait bound or it can take basically arbitrary amount of time?
> > > 
> > > Arbitrary amount of time but in desktop use case you can assume that
> > > it should never go above 16ms for a 60frame per second rendering of
> > > your desktop (in GPU compute case this kind of assumption does not
> > > hold). Is the process exit_state already updated by the time this mmu
> > > notifier callbacks happen ?
> > 
> > What do you mean? The process is killed (by SIGKILL) at the time but we
> > do not know much more than that. The task might be stuck anywhere in the
> > kernel before handling that signal.

I was wondering if another thread might still be dereferencing any of
the structure concurrently with the OOM mmu notifier callback. Saddly
yes, it would be simpler if we could make such assumption.

> > 
> > > > > The first pass would be then to not do anything here if
> > > > > !blockable.
> > > > 
> > > > something like this? (incremental diff)
> > > 
> > > What i wanted to do with HMM and mmu notifier is split the invalidation
> > > in 2 pass. First pass tell the drivers to stop/cancel pending jobs that
> > > depends on the range and invalidate internal driver states (like clear
> > > buffer object pages array in case of GPU but not GPU page table). While
> > > the second callback would do the actual wait on the GPU to be done and
> > > update the GPU page table.
> > 
> > What can you do after the first phase? Can I unmap the range?

No you can't do anything but this force synchronization as any other
thread that concurrently trie to do something with those would queue
up. So it would serialize thing. Also main motivation on my side is
multi-GPU, right now multi-GPU are not that common but this is changing
quickly and what we see on high end (4, 8 or 16 GPUs per socket) is
spreading into more configurations. Here in mutli GPU case splitting
in two would avoid having to fully wait on first GPU before trying to
invaidate on second GPU, so on and so forth.

> > 
> > > Now in this scheme in case the task is already in some exit state and
> > > that all CPU threads are frozen/kill then we can probably find a way to
> > > do the first path mostly lock less. AFAICR nor AMD nor Intel allow to
> > > share userptr bo hence a uptr bo should only ever be access through
> > > ioctl submited by the process.
> > > 
> > > The second call can then be delayed and ping from time to time to see
> > > if GPU jobs are done.
> > > 
> > > 
> > > Note that what you propose might still be useful as in case there is
> > > no buffer object for a range then OOM can make progress in freeing a
> > > range of memory. It is very likely that significant virtual address
> > > range of a process and backing memory can be reclaim that way. This
> > > assume OOM reclaim vma by vma or in some form of granularity like
> > > reclaiming 1GB by 1GB. Or we could also update blocking callback to
> > > return range that are blocking that way OOM can reclaim around.
> > 
> > Exactly my point. What we have right now is all or nothing which is
> > obviously too coarse to be useful.

Yes i think it is a good step in the right direction.

Cheers,
Jerome

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [Intel-gfx] [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
  2018-06-22 16:42         ` Michal Hocko
@ 2018-06-22 17:26           ` Jerome Glisse
  2018-06-22 17:26           ` Jerome Glisse
  1 sibling, 0 replies; 125+ messages in thread
From: Jerome Glisse @ 2018-06-22 17:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: kvm, =?UTF-8?q?Radim=20Kr=C4=8Dm=C3=A1=C5=99?=,
	David Airlie, Joonas Lahtinen, Sudeep Dutt, dri-devel, linux-mm,
	Andrea Arcangeli, David (ChunMing) Zhou, Dimitri Sivanich,
	linux-rdma, amd-gfx, Jason Gunthorpe, Doug Ledford,
	David Rientjes, xen-devel, intel-gfx, Jani Nikula, Rodrigo Vivi,
	Boris Ostrovsky, Juergen Gross, Mike Marciniszyn

On Fri, Jun 22, 2018 at 06:42:43PM +0200, Michal Hocko wrote:
> [Resnding with the CC list fixed]
> 
> On Fri 22-06-18 18:40:26, Michal Hocko wrote:
> > On Fri 22-06-18 12:18:46, Jerome Glisse wrote:
> > > On Fri, Jun 22, 2018 at 05:57:16PM +0200, Michal Hocko wrote:
> > > > On Fri 22-06-18 16:36:49, Chris Wilson wrote:
> > > > > Quoting Michal Hocko (2018-06-22 16:02:42)
> > > > > > Hi,
> > > > > > this is an RFC and not tested at all. I am not very familiar with the
> > > > > > mmu notifiers semantics very much so this is a crude attempt to achieve
> > > > > > what I need basically. It might be completely wrong but I would like
> > > > > > to discuss what would be a better way if that is the case.
> > > > > > 
> > > > > > get_maintainers gave me quite large list of people to CC so I had to trim
> > > > > > it down. If you think I have forgot somebody, please let me know
> > > > > 
> > > > > > diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > > > index 854bd51b9478..5285df9331fa 100644
> > > > > > --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > > > +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > > > @@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object *mo)
> > > > > >         mo->attached = false;
> > > > > >  }
> > > > > >  
> > > > > > -static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > > > > > +static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > > > > >                                                        struct mm_struct *mm,
> > > > > >                                                        unsigned long start,
> > > > > > -                                                      unsigned long end)
> > > > > > +                                                      unsigned long end,
> > > > > > +                                                      bool blockable)
> > > > > >  {
> > > > > >         struct i915_mmu_notifier *mn =
> > > > > >                 container_of(_mn, struct i915_mmu_notifier, mn);
> > > > > > @@ -124,7 +125,7 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> > > > > >         LIST_HEAD(cancelled);
> > > > > >  
> > > > > >         if (RB_EMPTY_ROOT(&mn->objects.rb_root))
> > > > > > -               return;
> > > > > > +               return 0;
> > > > > 
> > > > > The principle wait here is for the HW (even after fixing all the locks
> > > > > to be not so coarse, we still have to wait for the HW to finish its
> > > > > access).
> > > > 
> > > > Is this wait bound or it can take basically arbitrary amount of time?
> > > 
> > > Arbitrary amount of time but in desktop use case you can assume that
> > > it should never go above 16ms for a 60frame per second rendering of
> > > your desktop (in GPU compute case this kind of assumption does not
> > > hold). Is the process exit_state already updated by the time this mmu
> > > notifier callbacks happen ?
> > 
> > What do you mean? The process is killed (by SIGKILL) at the time but we
> > do not know much more than that. The task might be stuck anywhere in the
> > kernel before handling that signal.

I was wondering if another thread might still be dereferencing any of
the structure concurrently with the OOM mmu notifier callback. Saddly
yes, it would be simpler if we could make such assumption.

> > 
> > > > > The first pass would be then to not do anything here if
> > > > > !blockable.
> > > > 
> > > > something like this? (incremental diff)
> > > 
> > > What i wanted to do with HMM and mmu notifier is split the invalidation
> > > in 2 pass. First pass tell the drivers to stop/cancel pending jobs that
> > > depends on the range and invalidate internal driver states (like clear
> > > buffer object pages array in case of GPU but not GPU page table). While
> > > the second callback would do the actual wait on the GPU to be done and
> > > update the GPU page table.
> > 
> > What can you do after the first phase? Can I unmap the range?

No you can't do anything but this force synchronization as any other
thread that concurrently trie to do something with those would queue
up. So it would serialize thing. Also main motivation on my side is
multi-GPU, right now multi-GPU are not that common but this is changing
quickly and what we see on high end (4, 8 or 16 GPUs per socket) is
spreading into more configurations. Here in mutli GPU case splitting
in two would avoid having to fully wait on first GPU before trying to
invaidate on second GPU, so on and so forth.

> > 
> > > Now in this scheme in case the task is already in some exit state and
> > > that all CPU threads are frozen/kill then we can probably find a way to
> > > do the first path mostly lock less. AFAICR nor AMD nor Intel allow to
> > > share userptr bo hence a uptr bo should only ever be access through
> > > ioctl submited by the process.
> > > 
> > > The second call can then be delayed and ping from time to time to see
> > > if GPU jobs are done.
> > > 
> > > 
> > > Note that what you propose might still be useful as in case there is
> > > no buffer object for a range then OOM can make progress in freeing a
> > > range of memory. It is very likely that significant virtual address
> > > range of a process and backing memory can be reclaim that way. This
> > > assume OOM reclaim vma by vma or in some form of granularity like
> > > reclaiming 1GB by 1GB. Or we could also update blocking callback to
> > > return range that are blocking that way OOM can reclaim around.
> > 
> > Exactly my point. What we have right now is all or nothing which is
> > obviously too coarse to be useful.

Yes i think it is a good step in the right direction.

Cheers,
Jérôme

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
       [not found]     ` <20180622152444.GC10465-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2018-06-22 20:09       ` Felix Kuehling
  0 siblings, 0 replies; 125+ messages in thread
From: Felix Kuehling @ 2018-06-22 20:09 UTC (permalink / raw)
  To: Michal Hocko, Christian König
  Cc: kvm-u79uwXL29TY76Z2rM5mHXA, Radim Krčmář,
	David Airlie, Joonas Lahtinen, Sudeep Dutt,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andrea Arcangeli,
	David (ChunMing) Zhou, Dimitri Sivanich,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Jason Gunthorpe,
	Doug Ledford, David Rientjes,
	xen-devel-GuqFBffKawtpuQazS67q72D2FQJk+8+b,
	intel-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Jani Nikula,
	Jérôme Glisse, Rodrigo Vivi, Boris Ostrovsky,
	Juergen Gross, Mike Marciniszyn

On 2018-06-22 11:24 AM, Michal Hocko wrote:
> On Fri 22-06-18 17:13:02, Christian König wrote:
>> Hi Michal,
>>
>> [Adding Felix as well]
>>
>> Well first of all you have a misconception why at least the AMD graphics
>> driver need to be able to sleep in an MMU notifier: We need to sleep because
>> we need to wait for hardware operations to finish and *NOT* because we need
>> to wait for locks.
>>
>> I'm not sure if your flag now means that you generally can't sleep in MMU
>> notifiers any more, but if that's the case at least AMD hardware will break
>> badly. In our case the approach of waiting for a short time for the process
>> to be reaped and then select another victim actually sounds like the right
>> thing to do.
> Well, I do not need to make the notifier code non blocking all the time.
> All I need is to ensure that it won't sleep if the flag says so and
> return -EAGAIN instead.
>
> So here is what I do for amdgpu:

In the case of KFD we also need to take the DQM lock:

amdgpu_mn_invalidate_range_start_hsa -> amdgpu_amdkfd_evict_userptr ->
kgd2kfd_quiesce_mm -> kfd_process_evict_queues -> evict_process_queues_cpsch

So we'd need to pass the blockable parameter all the way through that
call chain.

Regards,
  Felix

>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
>>> index 83e344fbb50a..d138a526feff 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
>>> @@ -136,12 +136,18 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
>>>    *
>>>    * Take the rmn read side lock.
>>>    */
>>> -static void amdgpu_mn_read_lock(struct amdgpu_mn *rmn)
>>> +static int amdgpu_mn_read_lock(struct amdgpu_mn *rmn, bool blockable)
>>>   {
>>> -	mutex_lock(&rmn->read_lock);
>>> +	if (blockable)
>>> +		mutex_lock(&rmn->read_lock);
>>> +	else if (!mutex_trylock(&rmn->read_lock))
>>> +		return -EAGAIN;
>>> +
>>>   	if (atomic_inc_return(&rmn->recursion) == 1)
>>>   		down_read_non_owner(&rmn->lock);
>>>   	mutex_unlock(&rmn->read_lock);
>>> +
>>> +	return 0;
>>>   }
>>>   /**
>>> @@ -197,10 +203,11 @@ static void amdgpu_mn_invalidate_node(struct amdgpu_mn_node *node,
>>>    * We block for all BOs between start and end to be idle and
>>>    * unmap them by move them into system domain again.
>>>    */
>>> -static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>>> +static int amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>>>   						 struct mm_struct *mm,
>>>   						 unsigned long start,
>>> -						 unsigned long end)
>>> +						 unsigned long end,
>>> +						 bool blockable)
>>>   {
>>>   	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
>>>   	struct interval_tree_node *it;
>>> @@ -208,7 +215,11 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>>>   	/* notification is exclusive, but interval is inclusive */
>>>   	end -= 1;
>>> -	amdgpu_mn_read_lock(rmn);
>>> +	/* TODO we should be able to split locking for interval tree and
>>> +	 * amdgpu_mn_invalidate_node
>>> +	 */
>>> +	if (amdgpu_mn_read_lock(rmn, blockable))
>>> +		return -EAGAIN;
>>>   	it = interval_tree_iter_first(&rmn->objects, start, end);
>>>   	while (it) {
>>> @@ -219,6 +230,8 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>>>   		amdgpu_mn_invalidate_node(node, start, end);
>>>   	}
>>> +
>>> +	return 0;
>>>   }
>>>   /**
>>> @@ -233,10 +246,11 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>>>    * necessitates evicting all user-mode queues of the process. The BOs
>>>    * are restorted in amdgpu_mn_invalidate_range_end_hsa.
>>>    */
>>> -static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
>>> +static int amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
>>>   						 struct mm_struct *mm,
>>>   						 unsigned long start,
>>> -						 unsigned long end)
>>> +						 unsigned long end,
>>> +						 bool blockable)
>>>   {
>>>   	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
>>>   	struct interval_tree_node *it;
>>> @@ -244,7 +258,8 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
>>>   	/* notification is exclusive, but interval is inclusive */
>>>   	end -= 1;
>>> -	amdgpu_mn_read_lock(rmn);
>>> +	if (amdgpu_mn_read_lock(rmn, blockable))
>>> +		return -EAGAIN;
>>>   	it = interval_tree_iter_first(&rmn->objects, start, end);
>>>   	while (it) {
>>> @@ -262,6 +277,8 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
>>>   				amdgpu_amdkfd_evict_userptr(mem, mm);
>>>   		}
>>>   	}
>>> +
>>> +	return 0;
>>>   }
>>>   /**

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
@ 2018-06-22 20:09       ` Felix Kuehling
  0 siblings, 0 replies; 125+ messages in thread
From: Felix Kuehling @ 2018-06-22 20:09 UTC (permalink / raw)
  To: Michal Hocko, Christian König
  Cc: LKML, David (ChunMing) Zhou, Paolo Bonzini,
	Radim Krčmář,
	Alex Deucher, David Airlie, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Doug Ledford, Jason Gunthorpe, Mike Marciniszyn,
	Dennis Dalessandro, Sudeep Dutt, Ashutosh Dixit,
	Dimitri Sivanich, Boris Ostrovsky, Juergen Gross,
	Jérôme Glisse, Andrea Arcangeli, kvm, amd-gfx,
	dri-devel, intel-gfx, linux-rdma, xen-devel, linux-mm,
	David Rientjes

On 2018-06-22 11:24 AM, Michal Hocko wrote:
> On Fri 22-06-18 17:13:02, Christian König wrote:
>> Hi Michal,
>>
>> [Adding Felix as well]
>>
>> Well first of all you have a misconception why at least the AMD graphics
>> driver need to be able to sleep in an MMU notifier: We need to sleep because
>> we need to wait for hardware operations to finish and *NOT* because we need
>> to wait for locks.
>>
>> I'm not sure if your flag now means that you generally can't sleep in MMU
>> notifiers any more, but if that's the case at least AMD hardware will break
>> badly. In our case the approach of waiting for a short time for the process
>> to be reaped and then select another victim actually sounds like the right
>> thing to do.
> Well, I do not need to make the notifier code non blocking all the time.
> All I need is to ensure that it won't sleep if the flag says so and
> return -EAGAIN instead.
>
> So here is what I do for amdgpu:

In the case of KFD we also need to take the DQM lock:

amdgpu_mn_invalidate_range_start_hsa -> amdgpu_amdkfd_evict_userptr ->
kgd2kfd_quiesce_mm -> kfd_process_evict_queues -> evict_process_queues_cpsch

So we'd need to pass the blockable parameter all the way through that
call chain.

Regards,
  Felix

>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
>>> index 83e344fbb50a..d138a526feff 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
>>> @@ -136,12 +136,18 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
>>>    *
>>>    * Take the rmn read side lock.
>>>    */
>>> -static void amdgpu_mn_read_lock(struct amdgpu_mn *rmn)
>>> +static int amdgpu_mn_read_lock(struct amdgpu_mn *rmn, bool blockable)
>>>   {
>>> -	mutex_lock(&rmn->read_lock);
>>> +	if (blockable)
>>> +		mutex_lock(&rmn->read_lock);
>>> +	else if (!mutex_trylock(&rmn->read_lock))
>>> +		return -EAGAIN;
>>> +
>>>   	if (atomic_inc_return(&rmn->recursion) == 1)
>>>   		down_read_non_owner(&rmn->lock);
>>>   	mutex_unlock(&rmn->read_lock);
>>> +
>>> +	return 0;
>>>   }
>>>   /**
>>> @@ -197,10 +203,11 @@ static void amdgpu_mn_invalidate_node(struct amdgpu_mn_node *node,
>>>    * We block for all BOs between start and end to be idle and
>>>    * unmap them by move them into system domain again.
>>>    */
>>> -static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>>> +static int amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>>>   						 struct mm_struct *mm,
>>>   						 unsigned long start,
>>> -						 unsigned long end)
>>> +						 unsigned long end,
>>> +						 bool blockable)
>>>   {
>>>   	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
>>>   	struct interval_tree_node *it;
>>> @@ -208,7 +215,11 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>>>   	/* notification is exclusive, but interval is inclusive */
>>>   	end -= 1;
>>> -	amdgpu_mn_read_lock(rmn);
>>> +	/* TODO we should be able to split locking for interval tree and
>>> +	 * amdgpu_mn_invalidate_node
>>> +	 */
>>> +	if (amdgpu_mn_read_lock(rmn, blockable))
>>> +		return -EAGAIN;
>>>   	it = interval_tree_iter_first(&rmn->objects, start, end);
>>>   	while (it) {
>>> @@ -219,6 +230,8 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>>>   		amdgpu_mn_invalidate_node(node, start, end);
>>>   	}
>>> +
>>> +	return 0;
>>>   }
>>>   /**
>>> @@ -233,10 +246,11 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>>>    * necessitates evicting all user-mode queues of the process. The BOs
>>>    * are restorted in amdgpu_mn_invalidate_range_end_hsa.
>>>    */
>>> -static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
>>> +static int amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
>>>   						 struct mm_struct *mm,
>>>   						 unsigned long start,
>>> -						 unsigned long end)
>>> +						 unsigned long end,
>>> +						 bool blockable)
>>>   {
>>>   	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
>>>   	struct interval_tree_node *it;
>>> @@ -244,7 +258,8 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
>>>   	/* notification is exclusive, but interval is inclusive */
>>>   	end -= 1;
>>> -	amdgpu_mn_read_lock(rmn);
>>> +	if (amdgpu_mn_read_lock(rmn, blockable))
>>> +		return -EAGAIN;
>>>   	it = interval_tree_iter_first(&rmn->objects, start, end);
>>>   	while (it) {
>>> @@ -262,6 +277,8 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
>>>   				amdgpu_amdkfd_evict_userptr(mem, mm);
>>>   		}
>>>   	}
>>> +
>>> +	return 0;
>>>   }
>>>   /**


^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
@ 2018-06-22 20:09       ` Felix Kuehling
  0 siblings, 0 replies; 125+ messages in thread
From: Felix Kuehling @ 2018-06-22 20:09 UTC (permalink / raw)
  To: Michal Hocko, Christian König
  Cc: LKML, David (ChunMing) Zhou, Paolo Bonzini,
	Radim Krčmář,
	Alex Deucher, David Airlie, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Doug Ledford, Jason Gunthorpe, Mike Marciniszyn,
	Dennis Dalessandro, Sudeep Dutt, Ashutosh Dixit,
	Dimitri Sivanich, Boris Ostrovsky, Juergen Gross,
	Jérôme Glisse, Andrea Arcangeli, kvm, amd-gfx,
	dri-devel, intel-gfx, linux-rdma, xen-devel, linux-mm,
	David Rientjes

On 2018-06-22 11:24 AM, Michal Hocko wrote:
> On Fri 22-06-18 17:13:02, Christian KA?nig wrote:
>> Hi Michal,
>>
>> [Adding Felix as well]
>>
>> Well first of all you have a misconception why at least the AMD graphics
>> driver need to be able to sleep in an MMU notifier: We need to sleep because
>> we need to wait for hardware operations to finish and *NOT* because we need
>> to wait for locks.
>>
>> I'm not sure if your flag now means that you generally can't sleep in MMU
>> notifiers any more, but if that's the case at least AMD hardware will break
>> badly. In our case the approach of waiting for a short time for the process
>> to be reaped and then select another victim actually sounds like the right
>> thing to do.
> Well, I do not need to make the notifier code non blocking all the time.
> All I need is to ensure that it won't sleep if the flag says so and
> return -EAGAIN instead.
>
> So here is what I do for amdgpu:

In the case of KFD we also need to take the DQM lock:

amdgpu_mn_invalidate_range_start_hsa -> amdgpu_amdkfd_evict_userptr ->
kgd2kfd_quiesce_mm -> kfd_process_evict_queues -> evict_process_queues_cpsch

So we'd need to pass the blockable parameter all the way through that
call chain.

Regards,
A  Felix

>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
>>> index 83e344fbb50a..d138a526feff 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
>>> @@ -136,12 +136,18 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
>>>    *
>>>    * Take the rmn read side lock.
>>>    */
>>> -static void amdgpu_mn_read_lock(struct amdgpu_mn *rmn)
>>> +static int amdgpu_mn_read_lock(struct amdgpu_mn *rmn, bool blockable)
>>>   {
>>> -	mutex_lock(&rmn->read_lock);
>>> +	if (blockable)
>>> +		mutex_lock(&rmn->read_lock);
>>> +	else if (!mutex_trylock(&rmn->read_lock))
>>> +		return -EAGAIN;
>>> +
>>>   	if (atomic_inc_return(&rmn->recursion) == 1)
>>>   		down_read_non_owner(&rmn->lock);
>>>   	mutex_unlock(&rmn->read_lock);
>>> +
>>> +	return 0;
>>>   }
>>>   /**
>>> @@ -197,10 +203,11 @@ static void amdgpu_mn_invalidate_node(struct amdgpu_mn_node *node,
>>>    * We block for all BOs between start and end to be idle and
>>>    * unmap them by move them into system domain again.
>>>    */
>>> -static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>>> +static int amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>>>   						 struct mm_struct *mm,
>>>   						 unsigned long start,
>>> -						 unsigned long end)
>>> +						 unsigned long end,
>>> +						 bool blockable)
>>>   {
>>>   	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
>>>   	struct interval_tree_node *it;
>>> @@ -208,7 +215,11 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>>>   	/* notification is exclusive, but interval is inclusive */
>>>   	end -= 1;
>>> -	amdgpu_mn_read_lock(rmn);
>>> +	/* TODO we should be able to split locking for interval tree and
>>> +	 * amdgpu_mn_invalidate_node
>>> +	 */
>>> +	if (amdgpu_mn_read_lock(rmn, blockable))
>>> +		return -EAGAIN;
>>>   	it = interval_tree_iter_first(&rmn->objects, start, end);
>>>   	while (it) {
>>> @@ -219,6 +230,8 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>>>   		amdgpu_mn_invalidate_node(node, start, end);
>>>   	}
>>> +
>>> +	return 0;
>>>   }
>>>   /**
>>> @@ -233,10 +246,11 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>>>    * necessitates evicting all user-mode queues of the process. The BOs
>>>    * are restorted in amdgpu_mn_invalidate_range_end_hsa.
>>>    */
>>> -static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
>>> +static int amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
>>>   						 struct mm_struct *mm,
>>>   						 unsigned long start,
>>> -						 unsigned long end)
>>> +						 unsigned long end,
>>> +						 bool blockable)
>>>   {
>>>   	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
>>>   	struct interval_tree_node *it;
>>> @@ -244,7 +258,8 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
>>>   	/* notification is exclusive, but interval is inclusive */
>>>   	end -= 1;
>>> -	amdgpu_mn_read_lock(rmn);
>>> +	if (amdgpu_mn_read_lock(rmn, blockable))
>>> +		return -EAGAIN;
>>>   	it = interval_tree_iter_first(&rmn->objects, start, end);
>>>   	while (it) {
>>> @@ -262,6 +277,8 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
>>>   				amdgpu_amdkfd_evict_userptr(mem, mm);
>>>   		}
>>>   	}
>>> +
>>> +	return 0;
>>>   }
>>>   /**

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
  2018-06-22 15:24   ` Michal Hocko
@ 2018-06-22 20:09     ` Felix Kuehling
  2018-06-22 20:09       ` Felix Kuehling
       [not found]     ` <20180622152444.GC10465-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  2 siblings, 0 replies; 125+ messages in thread
From: Felix Kuehling @ 2018-06-22 20:09 UTC (permalink / raw)
  To: Michal Hocko, Christian König
  Cc: kvm, Radim Krčmář,
	David Airlie, Joonas Lahtinen, Sudeep Dutt, dri-devel, linux-mm,
	Andrea Arcangeli, David (ChunMing) Zhou, Dimitri Sivanich,
	linux-rdma, amd-gfx, Jason Gunthorpe, Doug Ledford,
	David Rientjes, xen-devel, intel-gfx, Jani Nikula,
	Jérôme Glisse, Rodrigo Vivi, Boris Ostrovsky,
	Juergen Gross, Mike Marciniszyn

On 2018-06-22 11:24 AM, Michal Hocko wrote:
> On Fri 22-06-18 17:13:02, Christian König wrote:
>> Hi Michal,
>>
>> [Adding Felix as well]
>>
>> Well first of all you have a misconception why at least the AMD graphics
>> driver need to be able to sleep in an MMU notifier: We need to sleep because
>> we need to wait for hardware operations to finish and *NOT* because we need
>> to wait for locks.
>>
>> I'm not sure if your flag now means that you generally can't sleep in MMU
>> notifiers any more, but if that's the case at least AMD hardware will break
>> badly. In our case the approach of waiting for a short time for the process
>> to be reaped and then select another victim actually sounds like the right
>> thing to do.
> Well, I do not need to make the notifier code non blocking all the time.
> All I need is to ensure that it won't sleep if the flag says so and
> return -EAGAIN instead.
>
> So here is what I do for amdgpu:

In the case of KFD we also need to take the DQM lock:

amdgpu_mn_invalidate_range_start_hsa -> amdgpu_amdkfd_evict_userptr ->
kgd2kfd_quiesce_mm -> kfd_process_evict_queues -> evict_process_queues_cpsch

So we'd need to pass the blockable parameter all the way through that
call chain.

Regards,
  Felix

>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
>>> index 83e344fbb50a..d138a526feff 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
>>> @@ -136,12 +136,18 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
>>>    *
>>>    * Take the rmn read side lock.
>>>    */
>>> -static void amdgpu_mn_read_lock(struct amdgpu_mn *rmn)
>>> +static int amdgpu_mn_read_lock(struct amdgpu_mn *rmn, bool blockable)
>>>   {
>>> -	mutex_lock(&rmn->read_lock);
>>> +	if (blockable)
>>> +		mutex_lock(&rmn->read_lock);
>>> +	else if (!mutex_trylock(&rmn->read_lock))
>>> +		return -EAGAIN;
>>> +
>>>   	if (atomic_inc_return(&rmn->recursion) == 1)
>>>   		down_read_non_owner(&rmn->lock);
>>>   	mutex_unlock(&rmn->read_lock);
>>> +
>>> +	return 0;
>>>   }
>>>   /**
>>> @@ -197,10 +203,11 @@ static void amdgpu_mn_invalidate_node(struct amdgpu_mn_node *node,
>>>    * We block for all BOs between start and end to be idle and
>>>    * unmap them by move them into system domain again.
>>>    */
>>> -static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>>> +static int amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>>>   						 struct mm_struct *mm,
>>>   						 unsigned long start,
>>> -						 unsigned long end)
>>> +						 unsigned long end,
>>> +						 bool blockable)
>>>   {
>>>   	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
>>>   	struct interval_tree_node *it;
>>> @@ -208,7 +215,11 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>>>   	/* notification is exclusive, but interval is inclusive */
>>>   	end -= 1;
>>> -	amdgpu_mn_read_lock(rmn);
>>> +	/* TODO we should be able to split locking for interval tree and
>>> +	 * amdgpu_mn_invalidate_node
>>> +	 */
>>> +	if (amdgpu_mn_read_lock(rmn, blockable))
>>> +		return -EAGAIN;
>>>   	it = interval_tree_iter_first(&rmn->objects, start, end);
>>>   	while (it) {
>>> @@ -219,6 +230,8 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>>>   		amdgpu_mn_invalidate_node(node, start, end);
>>>   	}
>>> +
>>> +	return 0;
>>>   }
>>>   /**
>>> @@ -233,10 +246,11 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>>>    * necessitates evicting all user-mode queues of the process. The BOs
>>>    * are restorted in amdgpu_mn_invalidate_range_end_hsa.
>>>    */
>>> -static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
>>> +static int amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
>>>   						 struct mm_struct *mm,
>>>   						 unsigned long start,
>>> -						 unsigned long end)
>>> +						 unsigned long end,
>>> +						 bool blockable)
>>>   {
>>>   	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
>>>   	struct interval_tree_node *it;
>>> @@ -244,7 +258,8 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
>>>   	/* notification is exclusive, but interval is inclusive */
>>>   	end -= 1;
>>> -	amdgpu_mn_read_lock(rmn);
>>> +	if (amdgpu_mn_read_lock(rmn, blockable))
>>> +		return -EAGAIN;
>>>   	it = interval_tree_iter_first(&rmn->objects, start, end);
>>>   	while (it) {
>>> @@ -262,6 +277,8 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
>>>   				amdgpu_amdkfd_evict_userptr(mem, mm);
>>>   		}
>>>   	}
>>> +
>>> +	return 0;
>>>   }
>>>   /**


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
  2018-06-22 15:02 [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers Michal Hocko
                   ` (5 preceding siblings ...)
  2018-06-22 16:01 ` ✗ Fi.CI.BAT: failure for mm, oom: distinguish blockable mode for mmu notifiers (rev2) Patchwork
@ 2018-06-24  8:11 ` Paolo Bonzini
  2018-06-25  7:57   ` Michal Hocko
  2018-06-25 10:23 ` ✗ Fi.CI.CHECKPATCH: warning for mm, oom: distinguish blockable mode for mmu notifiers (rev3) Patchwork
                   ` (10 subsequent siblings)
  17 siblings, 1 reply; 125+ messages in thread
From: Paolo Bonzini @ 2018-06-24  8:11 UTC (permalink / raw)
  To: Michal Hocko, LKML
  Cc: Michal Hocko, kvm, Radim Krčmář,
	linux-mm, Andrea Arcangeli, Jérôme Glisse

On 22/06/2018 17:02, Michal Hocko wrote:
> @@ -7215,6 +7216,8 @@ void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
>  	apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
>  	if (start <= apic_address && apic_address < end)
>  		kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
> +
> +	return 0;

This is wrong, gfn_to_hva can sleep.

You could do the the kvm_make_all_cpus_request unconditionally, but only
if !blockable is a really rare thing.  OOM would be fine, since the
request actually would never be processed, but I'm afraid of more uses
of !blockable being introduced later.

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
  2018-06-24  8:11 ` [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers Paolo Bonzini
@ 2018-06-25  7:57   ` Michal Hocko
  2018-06-25  8:10     ` Paolo Bonzini
  0 siblings, 1 reply; 125+ messages in thread
From: Michal Hocko @ 2018-06-25  7:57 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: LKML, kvm, Radim Krčmář,
	linux-mm, Andrea Arcangeli, Jérôme Glisse

On Sun 24-06-18 10:11:21, Paolo Bonzini wrote:
> On 22/06/2018 17:02, Michal Hocko wrote:
> > @@ -7215,6 +7216,8 @@ void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> >  	apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
> >  	if (start <= apic_address && apic_address < end)
> >  		kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
> > +
> > +	return 0;
> 
> This is wrong, gfn_to_hva can sleep.

Hmm, I have tried to crawl the call chain and haven't found any
sleepable locks taken. Maybe I am just missing something.
__kvm_memslots has a complex locking assert. I do not see we would take
slots_lock anywhere from the notifier call path. IIUC that means that
users_count has to be zero at that time. I have no idea how that is
guaranteed.
 
> You could do the the kvm_make_all_cpus_request unconditionally, but only
> if !blockable is a really rare thing.  OOM would be fine, since the
> request actually would never be processed, but I'm afraid of more uses
> of !blockable being introduced later.

Well, if this is not generally guaranteed then I have to come up with a
different flag. E.g. OOM_CONTEXT that would be more specific to
contrains for the callback.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
  2018-06-22 20:09       ` Felix Kuehling
  (?)
@ 2018-06-25  8:01       ` Michal Hocko
  2018-06-25 13:31         ` Michal Hocko
                           ` (2 more replies)
  -1 siblings, 3 replies; 125+ messages in thread
From: Michal Hocko @ 2018-06-25  8:01 UTC (permalink / raw)
  To: Felix Kuehling
  Cc: Christian König, LKML, David (ChunMing) Zhou, Paolo Bonzini,
	Radim Krčmář,
	Alex Deucher, David Airlie, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Doug Ledford, Jason Gunthorpe, Mike Marciniszyn,
	Dennis Dalessandro, Sudeep Dutt, Ashutosh Dixit,
	Dimitri Sivanich

On Fri 22-06-18 16:09:06, Felix Kuehling wrote:
> On 2018-06-22 11:24 AM, Michal Hocko wrote:
> > On Fri 22-06-18 17:13:02, Christian König wrote:
> >> Hi Michal,
> >>
> >> [Adding Felix as well]
> >>
> >> Well first of all you have a misconception why at least the AMD graphics
> >> driver need to be able to sleep in an MMU notifier: We need to sleep because
> >> we need to wait for hardware operations to finish and *NOT* because we need
> >> to wait for locks.
> >>
> >> I'm not sure if your flag now means that you generally can't sleep in MMU
> >> notifiers any more, but if that's the case at least AMD hardware will break
> >> badly. In our case the approach of waiting for a short time for the process
> >> to be reaped and then select another victim actually sounds like the right
> >> thing to do.
> > Well, I do not need to make the notifier code non blocking all the time.
> > All I need is to ensure that it won't sleep if the flag says so and
> > return -EAGAIN instead.
> >
> > So here is what I do for amdgpu:
> 
> In the case of KFD we also need to take the DQM lock:
> 
> amdgpu_mn_invalidate_range_start_hsa -> amdgpu_amdkfd_evict_userptr ->
> kgd2kfd_quiesce_mm -> kfd_process_evict_queues -> evict_process_queues_cpsch
> 
> So we'd need to pass the blockable parameter all the way through that
> call chain.

Thanks, I have missed that part. So I guess I will start with something
similar to intel-gfx and back off when the current range needs some
treatment. So this on top. Does it look correct?

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index d138a526feff..e2d422b3eb0b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -266,6 +266,11 @@ static int amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
 		struct amdgpu_mn_node *node;
 		struct amdgpu_bo *bo;
 
+		if (!blockable) {
+			amdgpu_mn_read_unlock();
+			return -EAGAIN;
+		}
+
 		node = container_of(it, struct amdgpu_mn_node, it);
 		it = interval_tree_iter_next(it, start, end);
 
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
@ 2018-06-25  8:01       ` Michal Hocko
  2018-06-25 13:31         ` Michal Hocko
                           ` (2 more replies)
  0 siblings, 3 replies; 125+ messages in thread
From: Michal Hocko @ 2018-06-25  8:01 UTC (permalink / raw)
  To: Felix Kuehling
  Cc: Christian König, LKML, David (ChunMing) Zhou, Paolo Bonzini,
	Radim Krčmář,
	Alex Deucher, David Airlie, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Doug Ledford, Jason Gunthorpe, Mike Marciniszyn,
	Dennis Dalessandro, Sudeep Dutt, Ashutosh Dixit,
	Dimitri Sivanich, Boris Ostrovsky, Juergen Gross,
	Jérôme Glisse, Andrea Arcangeli, kvm, amd-gfx,
	dri-devel, intel-gfx, linux-rdma, xen-devel, linux-mm,
	David Rientjes

On Fri 22-06-18 16:09:06, Felix Kuehling wrote:
> On 2018-06-22 11:24 AM, Michal Hocko wrote:
> > On Fri 22-06-18 17:13:02, Christian König wrote:
> >> Hi Michal,
> >>
> >> [Adding Felix as well]
> >>
> >> Well first of all you have a misconception why at least the AMD graphics
> >> driver need to be able to sleep in an MMU notifier: We need to sleep because
> >> we need to wait for hardware operations to finish and *NOT* because we need
> >> to wait for locks.
> >>
> >> I'm not sure if your flag now means that you generally can't sleep in MMU
> >> notifiers any more, but if that's the case at least AMD hardware will break
> >> badly. In our case the approach of waiting for a short time for the process
> >> to be reaped and then select another victim actually sounds like the right
> >> thing to do.
> > Well, I do not need to make the notifier code non blocking all the time.
> > All I need is to ensure that it won't sleep if the flag says so and
> > return -EAGAIN instead.
> >
> > So here is what I do for amdgpu:
> 
> In the case of KFD we also need to take the DQM lock:
> 
> amdgpu_mn_invalidate_range_start_hsa -> amdgpu_amdkfd_evict_userptr ->
> kgd2kfd_quiesce_mm -> kfd_process_evict_queues -> evict_process_queues_cpsch
> 
> So we'd need to pass the blockable parameter all the way through that
> call chain.

Thanks, I have missed that part. So I guess I will start with something
similar to intel-gfx and back off when the current range needs some
treatment. So this on top. Does it look correct?

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index d138a526feff..e2d422b3eb0b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -266,6 +266,11 @@ static int amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
 		struct amdgpu_mn_node *node;
 		struct amdgpu_bo *bo;
 
+		if (!blockable) {
+			amdgpu_mn_read_unlock();
+			return -EAGAIN;
+		}
+
 		node = container_of(it, struct amdgpu_mn_node, it);
 		it = interval_tree_iter_next(it, start, end);
 
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
@ 2018-06-25  8:01       ` Michal Hocko
  2018-06-25 13:31         ` Michal Hocko
                           ` (2 more replies)
  0 siblings, 3 replies; 125+ messages in thread
From: Michal Hocko @ 2018-06-25  8:01 UTC (permalink / raw)
  To: Felix Kuehling
  Cc: Christian König, LKML, David (ChunMing) Zhou, Paolo Bonzini,
	Radim Krčmář,
	Alex Deucher, David Airlie, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Doug Ledford, Jason Gunthorpe, Mike Marciniszyn,
	Dennis Dalessandro, Sudeep Dutt, Ashutosh Dixit,
	Dimitri Sivanich, Boris Ostrovsky, Juergen Gross,
	Jérôme Glisse, Andrea Arcangeli, kvm, amd-gfx,
	dri-devel, intel-gfx, linux-rdma, xen-devel, linux-mm,
	David Rientjes

On Fri 22-06-18 16:09:06, Felix Kuehling wrote:
> On 2018-06-22 11:24 AM, Michal Hocko wrote:
> > On Fri 22-06-18 17:13:02, Christian Konig wrote:
> >> Hi Michal,
> >>
> >> [Adding Felix as well]
> >>
> >> Well first of all you have a misconception why at least the AMD graphics
> >> driver need to be able to sleep in an MMU notifier: We need to sleep because
> >> we need to wait for hardware operations to finish and *NOT* because we need
> >> to wait for locks.
> >>
> >> I'm not sure if your flag now means that you generally can't sleep in MMU
> >> notifiers any more, but if that's the case at least AMD hardware will break
> >> badly. In our case the approach of waiting for a short time for the process
> >> to be reaped and then select another victim actually sounds like the right
> >> thing to do.
> > Well, I do not need to make the notifier code non blocking all the time.
> > All I need is to ensure that it won't sleep if the flag says so and
> > return -EAGAIN instead.
> >
> > So here is what I do for amdgpu:
> 
> In the case of KFD we also need to take the DQM lock:
> 
> amdgpu_mn_invalidate_range_start_hsa -> amdgpu_amdkfd_evict_userptr ->
> kgd2kfd_quiesce_mm -> kfd_process_evict_queues -> evict_process_queues_cpsch
> 
> So we'd need to pass the blockable parameter all the way through that
> call chain.

Thanks, I have missed that part. So I guess I will start with something
similar to intel-gfx and back off when the current range needs some
treatment. So this on top. Does it look correct?

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index d138a526feff..e2d422b3eb0b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -266,6 +266,11 @@ static int amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
 		struct amdgpu_mn_node *node;
 		struct amdgpu_bo *bo;
 
+		if (!blockable) {
+			amdgpu_mn_read_unlock();
+			return -EAGAIN;
+		}
+
 		node = container_of(it, struct amdgpu_mn_node, it);
 		it = interval_tree_iter_next(it, start, end);
 
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
  2018-06-22 20:09       ` Felix Kuehling
@ 2018-06-25  8:01       ` Michal Hocko
  -1 siblings, 0 replies; 125+ messages in thread
From: Michal Hocko @ 2018-06-25  8:01 UTC (permalink / raw)
  To: Felix Kuehling
  Cc: kvm, Radim Krčmář,
	David Airlie, Joonas Lahtinen, Sudeep Dutt, dri-devel, linux-mm,
	Andrea Arcangeli, David (ChunMing) Zhou, Dimitri Sivanich,
	linux-rdma, amd-gfx, Jason Gunthorpe, Doug Ledford,
	David Rientjes, xen-devel, intel-gfx, Jani Nikula,
	Jérôme Glisse, Rodrigo Vivi, Boris Ostrovsky,
	Juergen Gross, Mike Marciniszyn

On Fri 22-06-18 16:09:06, Felix Kuehling wrote:
> On 2018-06-22 11:24 AM, Michal Hocko wrote:
> > On Fri 22-06-18 17:13:02, Christian König wrote:
> >> Hi Michal,
> >>
> >> [Adding Felix as well]
> >>
> >> Well first of all you have a misconception why at least the AMD graphics
> >> driver need to be able to sleep in an MMU notifier: We need to sleep because
> >> we need to wait for hardware operations to finish and *NOT* because we need
> >> to wait for locks.
> >>
> >> I'm not sure if your flag now means that you generally can't sleep in MMU
> >> notifiers any more, but if that's the case at least AMD hardware will break
> >> badly. In our case the approach of waiting for a short time for the process
> >> to be reaped and then select another victim actually sounds like the right
> >> thing to do.
> > Well, I do not need to make the notifier code non blocking all the time.
> > All I need is to ensure that it won't sleep if the flag says so and
> > return -EAGAIN instead.
> >
> > So here is what I do for amdgpu:
> 
> In the case of KFD we also need to take the DQM lock:
> 
> amdgpu_mn_invalidate_range_start_hsa -> amdgpu_amdkfd_evict_userptr ->
> kgd2kfd_quiesce_mm -> kfd_process_evict_queues -> evict_process_queues_cpsch
> 
> So we'd need to pass the blockable parameter all the way through that
> call chain.

Thanks, I have missed that part. So I guess I will start with something
similar to intel-gfx and back off when the current range needs some
treatment. So this on top. Does it look correct?

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index d138a526feff..e2d422b3eb0b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -266,6 +266,11 @@ static int amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
 		struct amdgpu_mn_node *node;
 		struct amdgpu_bo *bo;
 
+		if (!blockable) {
+			amdgpu_mn_read_unlock();
+			return -EAGAIN;
+		}
+
 		node = container_of(it, struct amdgpu_mn_node, it);
 		it = interval_tree_iter_next(it, start, end);
 
-- 
Michal Hocko
SUSE Labs

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
  2018-06-25  7:57   ` Michal Hocko
@ 2018-06-25  8:10     ` Paolo Bonzini
  2018-06-25  8:45       ` Michal Hocko
  0 siblings, 1 reply; 125+ messages in thread
From: Paolo Bonzini @ 2018-06-25  8:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: LKML, kvm, Radim Krčmář,
	linux-mm, Andrea Arcangeli, Jérôme Glisse

On 25/06/2018 09:57, Michal Hocko wrote:
> On Sun 24-06-18 10:11:21, Paolo Bonzini wrote:
>> On 22/06/2018 17:02, Michal Hocko wrote:
>>> @@ -7215,6 +7216,8 @@ void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
>>>  	apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
>>>  	if (start <= apic_address && apic_address < end)
>>>  		kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
>>> +
>>> +	return 0;
>>
>> This is wrong, gfn_to_hva can sleep.
> 
> Hmm, I have tried to crawl the call chain and haven't found any
> sleepable locks taken. Maybe I am just missing something.
> __kvm_memslots has a complex locking assert. I do not see we would take
> slots_lock anywhere from the notifier call path. IIUC that means that
> users_count has to be zero at that time. I have no idea how that is
> guaranteed.

Nevermind, ENOCOFFEE.  This is gfn_to_hva, not gfn_to_pfn.  It only
needs SRCU.

Paolo

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
  2018-06-25  8:10     ` Paolo Bonzini
@ 2018-06-25  8:45       ` Michal Hocko
  2018-06-25 10:34         ` Paolo Bonzini
  0 siblings, 1 reply; 125+ messages in thread
From: Michal Hocko @ 2018-06-25  8:45 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: LKML, kvm, Radim Krčmář,
	linux-mm, Andrea Arcangeli, Jérôme Glisse

On Mon 25-06-18 10:10:18, Paolo Bonzini wrote:
> On 25/06/2018 09:57, Michal Hocko wrote:
> > On Sun 24-06-18 10:11:21, Paolo Bonzini wrote:
> >> On 22/06/2018 17:02, Michal Hocko wrote:
> >>> @@ -7215,6 +7216,8 @@ void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> >>>  	apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
> >>>  	if (start <= apic_address && apic_address < end)
> >>>  		kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
> >>> +
> >>> +	return 0;
> >>
> >> This is wrong, gfn_to_hva can sleep.
> > 
> > Hmm, I have tried to crawl the call chain and haven't found any
> > sleepable locks taken. Maybe I am just missing something.
> > __kvm_memslots has a complex locking assert. I do not see we would take
> > slots_lock anywhere from the notifier call path. IIUC that means that
> > users_count has to be zero at that time. I have no idea how that is
> > guaranteed.
> 
> Nevermind, ENOCOFFEE.  This is gfn_to_hva, not gfn_to_pfn.  It only
> needs SRCU.

OK, so just the make sure I follow, the change above is correct?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 125+ messages in thread

* ✗ Fi.CI.CHECKPATCH: warning for mm, oom: distinguish blockable mode for mmu notifiers (rev3)
  2018-06-22 15:02 [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers Michal Hocko
                   ` (6 preceding siblings ...)
  2018-06-24  8:11 ` [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers Paolo Bonzini
@ 2018-06-25 10:23 ` Patchwork
  2018-06-25 10:56 ` ✓ Fi.CI.BAT: success " Patchwork
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 125+ messages in thread
From: Patchwork @ 2018-06-25 10:23 UTC (permalink / raw)
  To: Michal Hocko; +Cc: intel-gfx

== Series Details ==

Series: mm, oom: distinguish blockable mode for mmu notifiers (rev3)
URL   : https://patchwork.freedesktop.org/series/45263/
State : warning

== Summary ==

$ dim checkpatch origin/drm-tip
c36cf27f4057 mm, oom: distinguish blockable mode for mmu notifiers
-:16: WARNING:COMMIT_LOG_LONG_LINE: Possible unwrapped commit description (prefer a maximum 75 chars per line)
#16: 
> >> Well first of all you have a misconception why at least the AMD graphics

-:59: ERROR:MISSING_SIGN_OFF: Missing Signed-off-by: line(s)

total: 1 errors, 1 warnings, 0 checks, 11 lines checked

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
  2018-06-25  8:45       ` Michal Hocko
@ 2018-06-25 10:34         ` Paolo Bonzini
  2018-06-25 11:08           ` Michal Hocko
  0 siblings, 1 reply; 125+ messages in thread
From: Paolo Bonzini @ 2018-06-25 10:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: LKML, kvm, Radim Krčmář,
	linux-mm, Andrea Arcangeli, Jérôme Glisse

On 25/06/2018 10:45, Michal Hocko wrote:
> On Mon 25-06-18 10:10:18, Paolo Bonzini wrote:
>> On 25/06/2018 09:57, Michal Hocko wrote:
>>> On Sun 24-06-18 10:11:21, Paolo Bonzini wrote:
>>>> On 22/06/2018 17:02, Michal Hocko wrote:
>>>>> @@ -7215,6 +7216,8 @@ void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
>>>>>  	apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
>>>>>  	if (start <= apic_address && apic_address < end)
>>>>>  		kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
>>>>> +
>>>>> +	return 0;
>>>>
>>>> This is wrong, gfn_to_hva can sleep.
>>>
>>> Hmm, I have tried to crawl the call chain and haven't found any
>>> sleepable locks taken. Maybe I am just missing something.
>>> __kvm_memslots has a complex locking assert. I do not see we would take
>>> slots_lock anywhere from the notifier call path. IIUC that means that
>>> users_count has to be zero at that time. I have no idea how that is
>>> guaranteed.
>>
>> Nevermind, ENOCOFFEE.  This is gfn_to_hva, not gfn_to_pfn.  It only
>> needs SRCU.
> 
> OK, so just the make sure I follow, the change above is correct?

Yes.  It's only gfn_to_pfn that calls get_user_pages and therefore can
sleep.

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 125+ messages in thread

* ✓ Fi.CI.BAT: success for mm, oom: distinguish blockable mode for mmu notifiers (rev3)
  2018-06-22 15:02 [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers Michal Hocko
                   ` (7 preceding siblings ...)
  2018-06-25 10:23 ` ✗ Fi.CI.CHECKPATCH: warning for mm, oom: distinguish blockable mode for mmu notifiers (rev3) Patchwork
@ 2018-06-25 10:56 ` Patchwork
  2018-06-25 13:50 ` ✗ Fi.CI.CHECKPATCH: warning for mm, oom: distinguish blockable mode for mmu notifiers (rev4) Patchwork
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 125+ messages in thread
From: Patchwork @ 2018-06-25 10:56 UTC (permalink / raw)
  To: Michal Hocko; +Cc: intel-gfx

== Series Details ==

Series: mm, oom: distinguish blockable mode for mmu notifiers (rev3)
URL   : https://patchwork.freedesktop.org/series/45263/
State : success

== Summary ==

= CI Bug Log - changes from CI_DRM_4373 -> Patchwork_9410 =

== Summary - SUCCESS ==

  No regressions found.

  External URL: https://patchwork.freedesktop.org/api/1.0/series/45263/revisions/3/mbox/

== Known issues ==

  Here are the changes found in Patchwork_9410 that come from known issues:

  === IGT changes ===

    ==== Issues hit ====

    igt@kms_pipe_crc_basic@read-crc-pipe-c-frame-sequence:
      fi-hsw-peppy:       PASS -> FAIL (fdo#103481)

    
    ==== Possible fixes ====

    igt@gem_exec_gttfill@basic:
      fi-byt-n2820:       FAIL (fdo#106744) -> PASS

    igt@gem_exec_suspend@basic-s4-devices:
      fi-kbl-7500u:       DMESG-WARN (fdo#105128) -> PASS

    
  fdo#103481 https://bugs.freedesktop.org/show_bug.cgi?id=103481
  fdo#105128 https://bugs.freedesktop.org/show_bug.cgi?id=105128
  fdo#106744 https://bugs.freedesktop.org/show_bug.cgi?id=106744


== Participating hosts (43 -> 36) ==

  Missing    (7): fi-ilk-m540 fi-hsw-4200u fi-byt-squawks fi-glk-dsi fi-bsw-cyan fi-ctg-p8600 fi-kbl-x1275 


== Build changes ==

    * Linux: CI_DRM_4373 -> Patchwork_9410

  CI_DRM_4373: be7193758db79443ad5dc45072a166746819ba7e @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_4529: 23d50a49413aff619d00ec50fc2e051e9b45baa5 @ git://anongit.freedesktop.org/xorg/app/intel-gpu-tools
  Patchwork_9410: c36cf27f405785df4030ad8b5d129f853f4a96ef @ git://anongit.freedesktop.org/gfx-ci/linux


== Linux commits ==

c36cf27f4057 mm, oom: distinguish blockable mode for mmu notifiers

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_9410/issues.html
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
  2018-06-25 10:34         ` Paolo Bonzini
@ 2018-06-25 11:08           ` Michal Hocko
  0 siblings, 0 replies; 125+ messages in thread
From: Michal Hocko @ 2018-06-25 11:08 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: LKML, kvm, Radim Krčmář,
	linux-mm, Andrea Arcangeli, Jérôme Glisse

On Mon 25-06-18 12:34:43, Paolo Bonzini wrote:
> On 25/06/2018 10:45, Michal Hocko wrote:
> > On Mon 25-06-18 10:10:18, Paolo Bonzini wrote:
> >> On 25/06/2018 09:57, Michal Hocko wrote:
> >>> On Sun 24-06-18 10:11:21, Paolo Bonzini wrote:
> >>>> On 22/06/2018 17:02, Michal Hocko wrote:
> >>>>> @@ -7215,6 +7216,8 @@ void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> >>>>>  	apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
> >>>>>  	if (start <= apic_address && apic_address < end)
> >>>>>  		kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
> >>>>> +
> >>>>> +	return 0;
> >>>>
> >>>> This is wrong, gfn_to_hva can sleep.
> >>>
> >>> Hmm, I have tried to crawl the call chain and haven't found any
> >>> sleepable locks taken. Maybe I am just missing something.
> >>> __kvm_memslots has a complex locking assert. I do not see we would take
> >>> slots_lock anywhere from the notifier call path. IIUC that means that
> >>> users_count has to be zero at that time. I have no idea how that is
> >>> guaranteed.
> >>
> >> Nevermind, ENOCOFFEE.  This is gfn_to_hva, not gfn_to_pfn.  It only
> >> needs SRCU.
> > 
> > OK, so just the make sure I follow, the change above is correct?
> 
> Yes.  It's only gfn_to_pfn that calls get_user_pages and therefore can
> sleep.

OK, thanks for the confirmation!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
       [not found]         ` <20180625080103.GB28965-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2018-06-25 13:31           ` Michal Hocko
  0 siblings, 0 replies; 125+ messages in thread
From: Michal Hocko @ 2018-06-25 13:31 UTC (permalink / raw)
  To: Felix Kuehling
  Cc: kvm-u79uwXL29TY76Z2rM5mHXA, Radim Krčmář,
	David Airlie, Joonas Lahtinen, Sudeep Dutt,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andrea Arcangeli,
	David (ChunMing) Zhou, Dimitri Sivanich,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Jason Gunthorpe,
	Doug Ledford, David Rientjes,
	xen-devel-GuqFBffKawtpuQazS67q72D2FQJk+8+b,
	intel-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Jani Nikula,
	Jérôme Glisse, Rodrigo Vivi, Boris Ostrovsky,
	Juergen Gross, Mike Marciniszyn

On Mon 25-06-18 10:01:03, Michal Hocko wrote:
> On Fri 22-06-18 16:09:06, Felix Kuehling wrote:
> > On 2018-06-22 11:24 AM, Michal Hocko wrote:
> > > On Fri 22-06-18 17:13:02, Christian König wrote:
> > >> Hi Michal,
> > >>
> > >> [Adding Felix as well]
> > >>
> > >> Well first of all you have a misconception why at least the AMD graphics
> > >> driver need to be able to sleep in an MMU notifier: We need to sleep because
> > >> we need to wait for hardware operations to finish and *NOT* because we need
> > >> to wait for locks.
> > >>
> > >> I'm not sure if your flag now means that you generally can't sleep in MMU
> > >> notifiers any more, but if that's the case at least AMD hardware will break
> > >> badly. In our case the approach of waiting for a short time for the process
> > >> to be reaped and then select another victim actually sounds like the right
> > >> thing to do.
> > > Well, I do not need to make the notifier code non blocking all the time.
> > > All I need is to ensure that it won't sleep if the flag says so and
> > > return -EAGAIN instead.
> > >
> > > So here is what I do for amdgpu:
> > 
> > In the case of KFD we also need to take the DQM lock:
> > 
> > amdgpu_mn_invalidate_range_start_hsa -> amdgpu_amdkfd_evict_userptr ->
> > kgd2kfd_quiesce_mm -> kfd_process_evict_queues -> evict_process_queues_cpsch
> > 
> > So we'd need to pass the blockable parameter all the way through that
> > call chain.
> 
> Thanks, I have missed that part. So I guess I will start with something
> similar to intel-gfx and back off when the current range needs some
> treatment. So this on top. Does it look correct?
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> index d138a526feff..e2d422b3eb0b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> @@ -266,6 +266,11 @@ static int amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
>  		struct amdgpu_mn_node *node;
>  		struct amdgpu_bo *bo;
>  
> +		if (!blockable) {
> +			amdgpu_mn_read_unlock();
> +			return -EAGAIN;
> +		}
> +
>  		node = container_of(it, struct amdgpu_mn_node, it);
>  		it = interval_tree_iter_next(it, start, end);

Ble, just noticed that half of the change didn't get to git index...
This is what I have
commit c4701b36ac2802b903db3d05cf77c030fccce3a8
Author: Michal Hocko <mhocko@suse.com>
Date:   Mon Jun 25 15:24:03 2018 +0200

    fold me
    
    - amd gpu notifiers can sleep deeper in the callchain (evict_process_queues_cpsch
      on a lock and amdgpu_mn_invalidate_node on unbound timeout) make sure
      we bail out when we have an intersecting range for starter

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index d138a526feff..3399a4a927fb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -225,6 +225,11 @@ static int amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
 	while (it) {
 		struct amdgpu_mn_node *node;
 
+		if (!blockable) {
+			amdgpu_mn_read_unlock(rmn);
+			return -EAGAIN;
+		}
+
 		node = container_of(it, struct amdgpu_mn_node, it);
 		it = interval_tree_iter_next(it, start, end);
 
@@ -266,6 +271,11 @@ static int amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
 		struct amdgpu_mn_node *node;
 		struct amdgpu_bo *bo;
 
+		if (!blockable) {
+			amdgpu_mn_read_unlock(rmn);
+			return -EAGAIN;
+		}
+
 		node = container_of(it, struct amdgpu_mn_node, it);
 		it = interval_tree_iter_next(it, start, end);
 
-- 
Michal Hocko
SUSE Labs
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
@ 2018-06-25 13:31           ` Michal Hocko
  0 siblings, 0 replies; 125+ messages in thread
From: Michal Hocko @ 2018-06-25 13:31 UTC (permalink / raw)
  To: Felix Kuehling
  Cc: Christian König, LKML, David (ChunMing) Zhou, Paolo Bonzini,
	Radim Krčmář,
	Alex Deucher, David Airlie, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Doug Ledford, Jason Gunthorpe, Mike Marciniszyn,
	Dennis Dalessandro, Sudeep Dutt, Ashutosh Dixit,
	Dimitri Sivanich, Boris Ostrovsky, Juergen Gross,
	Jérôme Glisse, Andrea Arcangeli, kvm, amd-gfx,
	dri-devel, intel-gfx, linux-rdma, xen-devel, linux-mm,
	David Rientjes

On Mon 25-06-18 10:01:03, Michal Hocko wrote:
> On Fri 22-06-18 16:09:06, Felix Kuehling wrote:
> > On 2018-06-22 11:24 AM, Michal Hocko wrote:
> > > On Fri 22-06-18 17:13:02, Christian König wrote:
> > >> Hi Michal,
> > >>
> > >> [Adding Felix as well]
> > >>
> > >> Well first of all you have a misconception why at least the AMD graphics
> > >> driver need to be able to sleep in an MMU notifier: We need to sleep because
> > >> we need to wait for hardware operations to finish and *NOT* because we need
> > >> to wait for locks.
> > >>
> > >> I'm not sure if your flag now means that you generally can't sleep in MMU
> > >> notifiers any more, but if that's the case at least AMD hardware will break
> > >> badly. In our case the approach of waiting for a short time for the process
> > >> to be reaped and then select another victim actually sounds like the right
> > >> thing to do.
> > > Well, I do not need to make the notifier code non blocking all the time.
> > > All I need is to ensure that it won't sleep if the flag says so and
> > > return -EAGAIN instead.
> > >
> > > So here is what I do for amdgpu:
> > 
> > In the case of KFD we also need to take the DQM lock:
> > 
> > amdgpu_mn_invalidate_range_start_hsa -> amdgpu_amdkfd_evict_userptr ->
> > kgd2kfd_quiesce_mm -> kfd_process_evict_queues -> evict_process_queues_cpsch
> > 
> > So we'd need to pass the blockable parameter all the way through that
> > call chain.
> 
> Thanks, I have missed that part. So I guess I will start with something
> similar to intel-gfx and back off when the current range needs some
> treatment. So this on top. Does it look correct?
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> index d138a526feff..e2d422b3eb0b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> @@ -266,6 +266,11 @@ static int amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
>  		struct amdgpu_mn_node *node;
>  		struct amdgpu_bo *bo;
>  
> +		if (!blockable) {
> +			amdgpu_mn_read_unlock();
> +			return -EAGAIN;
> +		}
> +
>  		node = container_of(it, struct amdgpu_mn_node, it);
>  		it = interval_tree_iter_next(it, start, end);

Ble, just noticed that half of the change didn't get to git index...
This is what I have
commit c4701b36ac2802b903db3d05cf77c030fccce3a8
Author: Michal Hocko <mhocko@suse.com>
Date:   Mon Jun 25 15:24:03 2018 +0200

    fold me
    
    - amd gpu notifiers can sleep deeper in the callchain (evict_process_queues_cpsch
      on a lock and amdgpu_mn_invalidate_node on unbound timeout) make sure
      we bail out when we have an intersecting range for starter

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index d138a526feff..3399a4a927fb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -225,6 +225,11 @@ static int amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
 	while (it) {
 		struct amdgpu_mn_node *node;
 
+		if (!blockable) {
+			amdgpu_mn_read_unlock(rmn);
+			return -EAGAIN;
+		}
+
 		node = container_of(it, struct amdgpu_mn_node, it);
 		it = interval_tree_iter_next(it, start, end);
 
@@ -266,6 +271,11 @@ static int amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
 		struct amdgpu_mn_node *node;
 		struct amdgpu_bo *bo;
 
+		if (!blockable) {
+			amdgpu_mn_read_unlock(rmn);
+			return -EAGAIN;
+		}
+
 		node = container_of(it, struct amdgpu_mn_node, it);
 		it = interval_tree_iter_next(it, start, end);
 
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
@ 2018-06-25 13:31           ` Michal Hocko
  0 siblings, 0 replies; 125+ messages in thread
From: Michal Hocko @ 2018-06-25 13:31 UTC (permalink / raw)
  To: Felix Kuehling
  Cc: Christian König, LKML, David (ChunMing) Zhou, Paolo Bonzini,
	Radim Krčmář,
	Alex Deucher, David Airlie, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Doug Ledford, Jason Gunthorpe, Mike Marciniszyn,
	Dennis Dalessandro, Sudeep Dutt, Ashutosh Dixit,
	Dimitri Sivanich, Boris Ostrovsky, Juergen Gross,
	Jérôme Glisse, Andrea Arcangeli, kvm, amd-gfx,
	dri-devel, intel-gfx, linux-rdma, xen-devel, linux-mm,
	David Rientjes

On Mon 25-06-18 10:01:03, Michal Hocko wrote:
> On Fri 22-06-18 16:09:06, Felix Kuehling wrote:
> > On 2018-06-22 11:24 AM, Michal Hocko wrote:
> > > On Fri 22-06-18 17:13:02, Christian Konig wrote:
> > >> Hi Michal,
> > >>
> > >> [Adding Felix as well]
> > >>
> > >> Well first of all you have a misconception why at least the AMD graphics
> > >> driver need to be able to sleep in an MMU notifier: We need to sleep because
> > >> we need to wait for hardware operations to finish and *NOT* because we need
> > >> to wait for locks.
> > >>
> > >> I'm not sure if your flag now means that you generally can't sleep in MMU
> > >> notifiers any more, but if that's the case at least AMD hardware will break
> > >> badly. In our case the approach of waiting for a short time for the process
> > >> to be reaped and then select another victim actually sounds like the right
> > >> thing to do.
> > > Well, I do not need to make the notifier code non blocking all the time.
> > > All I need is to ensure that it won't sleep if the flag says so and
> > > return -EAGAIN instead.
> > >
> > > So here is what I do for amdgpu:
> > 
> > In the case of KFD we also need to take the DQM lock:
> > 
> > amdgpu_mn_invalidate_range_start_hsa -> amdgpu_amdkfd_evict_userptr ->
> > kgd2kfd_quiesce_mm -> kfd_process_evict_queues -> evict_process_queues_cpsch
> > 
> > So we'd need to pass the blockable parameter all the way through that
> > call chain.
> 
> Thanks, I have missed that part. So I guess I will start with something
> similar to intel-gfx and back off when the current range needs some
> treatment. So this on top. Does it look correct?
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> index d138a526feff..e2d422b3eb0b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> @@ -266,6 +266,11 @@ static int amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
>  		struct amdgpu_mn_node *node;
>  		struct amdgpu_bo *bo;
>  
> +		if (!blockable) {
> +			amdgpu_mn_read_unlock();
> +			return -EAGAIN;
> +		}
> +
>  		node = container_of(it, struct amdgpu_mn_node, it);
>  		it = interval_tree_iter_next(it, start, end);

Ble, just noticed that half of the change didn't get to git index...
This is what I have
commit c4701b36ac2802b903db3d05cf77c030fccce3a8
Author: Michal Hocko <mhocko@suse.com>
Date:   Mon Jun 25 15:24:03 2018 +0200

    fold me
    
    - amd gpu notifiers can sleep deeper in the callchain (evict_process_queues_cpsch
      on a lock and amdgpu_mn_invalidate_node on unbound timeout) make sure
      we bail out when we have an intersecting range for starter

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index d138a526feff..3399a4a927fb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -225,6 +225,11 @@ static int amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
 	while (it) {
 		struct amdgpu_mn_node *node;
 
+		if (!blockable) {
+			amdgpu_mn_read_unlock(rmn);
+			return -EAGAIN;
+		}
+
 		node = container_of(it, struct amdgpu_mn_node, it);
 		it = interval_tree_iter_next(it, start, end);
 
@@ -266,6 +271,11 @@ static int amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
 		struct amdgpu_mn_node *node;
 		struct amdgpu_bo *bo;
 
+		if (!blockable) {
+			amdgpu_mn_read_unlock(rmn);
+			return -EAGAIN;
+		}
+
 		node = container_of(it, struct amdgpu_mn_node, it);
 		it = interval_tree_iter_next(it, start, end);
 
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
  2018-06-25  8:01       ` Michal Hocko
@ 2018-06-25 13:31         ` Michal Hocko
  2018-06-25 13:31           ` Michal Hocko
       [not found]         ` <20180625080103.GB28965-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  2 siblings, 0 replies; 125+ messages in thread
From: Michal Hocko @ 2018-06-25 13:31 UTC (permalink / raw)
  To: Felix Kuehling
  Cc: kvm, Radim Krčmář,
	David Airlie, Joonas Lahtinen, Sudeep Dutt, dri-devel, linux-mm,
	Andrea Arcangeli, David (ChunMing) Zhou, Dimitri Sivanich,
	linux-rdma, amd-gfx, Jason Gunthorpe, Doug Ledford,
	David Rientjes, xen-devel, intel-gfx, Jani Nikula,
	Jérôme Glisse, Rodrigo Vivi, Boris Ostrovsky,
	Juergen Gross, Mike Marciniszyn

On Mon 25-06-18 10:01:03, Michal Hocko wrote:
> On Fri 22-06-18 16:09:06, Felix Kuehling wrote:
> > On 2018-06-22 11:24 AM, Michal Hocko wrote:
> > > On Fri 22-06-18 17:13:02, Christian König wrote:
> > >> Hi Michal,
> > >>
> > >> [Adding Felix as well]
> > >>
> > >> Well first of all you have a misconception why at least the AMD graphics
> > >> driver need to be able to sleep in an MMU notifier: We need to sleep because
> > >> we need to wait for hardware operations to finish and *NOT* because we need
> > >> to wait for locks.
> > >>
> > >> I'm not sure if your flag now means that you generally can't sleep in MMU
> > >> notifiers any more, but if that's the case at least AMD hardware will break
> > >> badly. In our case the approach of waiting for a short time for the process
> > >> to be reaped and then select another victim actually sounds like the right
> > >> thing to do.
> > > Well, I do not need to make the notifier code non blocking all the time.
> > > All I need is to ensure that it won't sleep if the flag says so and
> > > return -EAGAIN instead.
> > >
> > > So here is what I do for amdgpu:
> > 
> > In the case of KFD we also need to take the DQM lock:
> > 
> > amdgpu_mn_invalidate_range_start_hsa -> amdgpu_amdkfd_evict_userptr ->
> > kgd2kfd_quiesce_mm -> kfd_process_evict_queues -> evict_process_queues_cpsch
> > 
> > So we'd need to pass the blockable parameter all the way through that
> > call chain.
> 
> Thanks, I have missed that part. So I guess I will start with something
> similar to intel-gfx and back off when the current range needs some
> treatment. So this on top. Does it look correct?
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> index d138a526feff..e2d422b3eb0b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> @@ -266,6 +266,11 @@ static int amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
>  		struct amdgpu_mn_node *node;
>  		struct amdgpu_bo *bo;
>  
> +		if (!blockable) {
> +			amdgpu_mn_read_unlock();
> +			return -EAGAIN;
> +		}
> +
>  		node = container_of(it, struct amdgpu_mn_node, it);
>  		it = interval_tree_iter_next(it, start, end);

Ble, just noticed that half of the change didn't get to git index...
This is what I have
commit c4701b36ac2802b903db3d05cf77c030fccce3a8
Author: Michal Hocko <mhocko@suse.com>
Date:   Mon Jun 25 15:24:03 2018 +0200

    fold me
    
    - amd gpu notifiers can sleep deeper in the callchain (evict_process_queues_cpsch
      on a lock and amdgpu_mn_invalidate_node on unbound timeout) make sure
      we bail out when we have an intersecting range for starter

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index d138a526feff..3399a4a927fb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -225,6 +225,11 @@ static int amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
 	while (it) {
 		struct amdgpu_mn_node *node;
 
+		if (!blockable) {
+			amdgpu_mn_read_unlock(rmn);
+			return -EAGAIN;
+		}
+
 		node = container_of(it, struct amdgpu_mn_node, it);
 		it = interval_tree_iter_next(it, start, end);
 
@@ -266,6 +271,11 @@ static int amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
 		struct amdgpu_mn_node *node;
 		struct amdgpu_bo *bo;
 
+		if (!blockable) {
+			amdgpu_mn_read_unlock(rmn);
+			return -EAGAIN;
+		}
+
 		node = container_of(it, struct amdgpu_mn_node, it);
 		it = interval_tree_iter_next(it, start, end);
 
-- 
Michal Hocko
SUSE Labs

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 125+ messages in thread

* ✗ Fi.CI.CHECKPATCH: warning for mm, oom: distinguish blockable mode for mmu notifiers (rev4)
  2018-06-22 15:02 [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers Michal Hocko
                   ` (8 preceding siblings ...)
  2018-06-25 10:56 ` ✓ Fi.CI.BAT: success " Patchwork
@ 2018-06-25 13:50 ` Patchwork
  2018-06-25 14:00 ` ✓ Fi.CI.IGT: success for mm, oom: distinguish blockable mode for mmu notifiers (rev3) Patchwork
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 125+ messages in thread
From: Patchwork @ 2018-06-25 13:50 UTC (permalink / raw)
  To: Michal Hocko; +Cc: intel-gfx

== Series Details ==

Series: mm, oom: distinguish blockable mode for mmu notifiers (rev4)
URL   : https://patchwork.freedesktop.org/series/45263/
State : warning

== Summary ==

$ dim checkpatch origin/drm-tip
debbcdd2c559 mm, oom: distinguish blockable mode for mmu notifiers
-:17: WARNING:COMMIT_LOG_LONG_LINE: Possible unwrapped commit description (prefer a maximum 75 chars per line)
#17: 
> > >> Well first of all you have a misconception why at least the AMD graphics

-:63: ERROR:GIT_COMMIT_ID: Please use git commit description style 'commit <12+ chars of sha1> ("<title line>")' - ie: 'commit fatal: bad o ("5cf77c030fccce3a8")'
#63: 
commit c4701b36ac2802b903db3d05cf77c030fccce3a8

-:100: ERROR:MISSING_SIGN_OFF: Missing Signed-off-by: line(s)

total: 2 errors, 1 warnings, 0 checks, 22 lines checked

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 125+ messages in thread

* ✓ Fi.CI.IGT: success for mm, oom: distinguish blockable mode for mmu notifiers (rev3)
  2018-06-22 15:02 [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers Michal Hocko
                   ` (9 preceding siblings ...)
  2018-06-25 13:50 ` ✗ Fi.CI.CHECKPATCH: warning for mm, oom: distinguish blockable mode for mmu notifiers (rev4) Patchwork
@ 2018-06-25 14:00 ` Patchwork
  2018-06-25 14:10 ` ✓ Fi.CI.BAT: success for mm, oom: distinguish blockable mode for mmu notifiers (rev4) Patchwork
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 125+ messages in thread
From: Patchwork @ 2018-06-25 14:00 UTC (permalink / raw)
  To: Michal Hocko; +Cc: intel-gfx

== Series Details ==

Series: mm, oom: distinguish blockable mode for mmu notifiers (rev3)
URL   : https://patchwork.freedesktop.org/series/45263/
State : success

== Summary ==

= CI Bug Log - changes from CI_DRM_4373_full -> Patchwork_9410_full =

== Summary - WARNING ==

  Minor unknown changes coming with Patchwork_9410_full need to be verified
  manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in Patchwork_9410_full, please notify your bug team to allow them
  to document this new failure mode, which will reduce false positives in CI.

  

== Possible new issues ==

  Here are the unknown changes that may have been introduced in Patchwork_9410_full:

  === IGT changes ===

    ==== Warnings ====

    igt@gem_exec_schedule@deep-bsd1:
      shard-kbl:          PASS -> SKIP

    igt@gem_exec_schedule@deep-bsd2:
      shard-kbl:          SKIP -> PASS

    
== Known issues ==

  Here are the changes found in Patchwork_9410_full that come from known issues:

  === IGT changes ===

    ==== Issues hit ====

    igt@drv_suspend@shrink:
      shard-apl:          PASS -> FAIL (fdo#106886)

    igt@kms_atomic_transition@1x-modeset-transitions-nonblocking:
      shard-glk:          PASS -> FAIL (fdo#105703)

    igt@kms_flip@flip-vs-expired-vblank:
      shard-glk:          PASS -> FAIL (fdo#105363)

    igt@kms_flip@plain-flip-ts-check:
      shard-glk:          PASS -> FAIL (fdo#100368) +1

    igt@kms_flip_tiling@flip-y-tiled:
      shard-glk:          PASS -> FAIL (fdo#104724, fdo#103822)

    igt@kms_setmode@basic:
      shard-kbl:          PASS -> FAIL (fdo#99912)

    igt@perf@polling:
      shard-hsw:          PASS -> FAIL (fdo#102252)

    igt@perf_pmu@idle-no-semaphores-rcs0:
      shard-snb:          NOTRUN -> INCOMPLETE (fdo#105411)

    
    ==== Possible fixes ====

    igt@drv_selftest@live_hangcheck:
      shard-apl:          DMESG-FAIL (fdo#106560, fdo#106947) -> PASS

    igt@gem_exec_params@rs-invalid-on-bsd-ring:
      shard-snb:          INCOMPLETE (fdo#105411) -> SKIP

    igt@kms_atomic_transition@1x-modeset-transitions-nonblocking-fencing:
      shard-glk:          FAIL (fdo#105703) -> PASS

    igt@kms_flip_tiling@flip-to-x-tiled:
      shard-glk:          FAIL (fdo#104724) -> PASS

    igt@kms_rotation_crc@sprite-rotation-180:
      shard-hsw:          FAIL (fdo#104724, fdo#103925) -> PASS

    
    ==== Warnings ====

    igt@drv_selftest@live_gtt:
      shard-glk:          FAIL (fdo#105347) -> INCOMPLETE (k.org#198133, fdo#103359)

    
  fdo#100368 https://bugs.freedesktop.org/show_bug.cgi?id=100368
  fdo#102252 https://bugs.freedesktop.org/show_bug.cgi?id=102252
  fdo#103359 https://bugs.freedesktop.org/show_bug.cgi?id=103359
  fdo#103822 https://bugs.freedesktop.org/show_bug.cgi?id=103822
  fdo#103925 https://bugs.freedesktop.org/show_bug.cgi?id=103925
  fdo#104724 https://bugs.freedesktop.org/show_bug.cgi?id=104724
  fdo#105347 https://bugs.freedesktop.org/show_bug.cgi?id=105347
  fdo#105363 https://bugs.freedesktop.org/show_bug.cgi?id=105363
  fdo#105411 https://bugs.freedesktop.org/show_bug.cgi?id=105411
  fdo#105703 https://bugs.freedesktop.org/show_bug.cgi?id=105703
  fdo#106560 https://bugs.freedesktop.org/show_bug.cgi?id=106560
  fdo#106886 https://bugs.freedesktop.org/show_bug.cgi?id=106886
  fdo#106947 https://bugs.freedesktop.org/show_bug.cgi?id=106947
  fdo#99912 https://bugs.freedesktop.org/show_bug.cgi?id=99912
  k.org#198133 https://bugzilla.kernel.org/show_bug.cgi?id=198133


== Participating hosts (5 -> 5) ==

  No changes in participating hosts


== Build changes ==

    * Linux: CI_DRM_4373 -> Patchwork_9410

  CI_DRM_4373: be7193758db79443ad5dc45072a166746819ba7e @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_4529: 23d50a49413aff619d00ec50fc2e051e9b45baa5 @ git://anongit.freedesktop.org/xorg/app/intel-gpu-tools
  Patchwork_9410: c36cf27f405785df4030ad8b5d129f853f4a96ef @ git://anongit.freedesktop.org/gfx-ci/linux
  piglit_4509: fdc5a4ca11124ab8413c7988896eec4c97336694 @ git://anongit.freedesktop.org/piglit

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_9410/shards.html
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 125+ messages in thread

* ✓ Fi.CI.BAT: success for mm, oom: distinguish blockable mode for mmu notifiers (rev4)
  2018-06-22 15:02 [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers Michal Hocko
                   ` (10 preceding siblings ...)
  2018-06-25 14:00 ` ✓ Fi.CI.IGT: success for mm, oom: distinguish blockable mode for mmu notifiers (rev3) Patchwork
@ 2018-06-25 14:10 ` Patchwork
  2018-06-25 19:22 ` ✓ Fi.CI.IGT: " Patchwork
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 125+ messages in thread
From: Patchwork @ 2018-06-25 14:10 UTC (permalink / raw)
  To: Michal Hocko; +Cc: intel-gfx

== Series Details ==

Series: mm, oom: distinguish blockable mode for mmu notifiers (rev4)
URL   : https://patchwork.freedesktop.org/series/45263/
State : success

== Summary ==

= CI Bug Log - changes from CI_DRM_4373 -> Patchwork_9413 =

== Summary - SUCCESS ==

  No regressions found.

  External URL: https://patchwork.freedesktop.org/api/1.0/series/45263/revisions/4/mbox/

== Known issues ==

  Here are the changes found in Patchwork_9413 that come from known issues:

  === IGT changes ===

    ==== Issues hit ====

    igt@gem_ringfill@basic-default-hang:
      fi-pnv-d510:        NOTRUN -> DMESG-WARN (fdo#101600)

    
    ==== Possible fixes ====

    igt@gem_exec_gttfill@basic:
      fi-byt-n2820:       FAIL (fdo#106744) -> PASS

    igt@gem_exec_suspend@basic-s4-devices:
      fi-kbl-7500u:       DMESG-WARN (fdo#105128) -> PASS

    
  fdo#101600 https://bugs.freedesktop.org/show_bug.cgi?id=101600
  fdo#105128 https://bugs.freedesktop.org/show_bug.cgi?id=105128
  fdo#106744 https://bugs.freedesktop.org/show_bug.cgi?id=106744


== Participating hosts (43 -> 39) ==

  Additional (1): fi-pnv-d510 
  Missing    (5): fi-ctg-p8600 fi-ilk-m540 fi-byt-squawks fi-bsw-cyan fi-hsw-4200u 


== Build changes ==

    * Linux: CI_DRM_4373 -> Patchwork_9413

  CI_DRM_4373: be7193758db79443ad5dc45072a166746819ba7e @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_4529: 23d50a49413aff619d00ec50fc2e051e9b45baa5 @ git://anongit.freedesktop.org/xorg/app/intel-gpu-tools
  Patchwork_9413: debbcdd2c559a0ff0bfdc8cc5ce38935be009eff @ git://anongit.freedesktop.org/gfx-ci/linux


== Linux commits ==

debbcdd2c559 mm, oom: distinguish blockable mode for mmu notifiers

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_9413/issues.html
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 125+ messages in thread

* ✓ Fi.CI.IGT: success for mm, oom: distinguish blockable mode for mmu notifiers (rev4)
  2018-06-22 15:02 [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers Michal Hocko
                   ` (11 preceding siblings ...)
  2018-06-25 14:10 ` ✓ Fi.CI.BAT: success for mm, oom: distinguish blockable mode for mmu notifiers (rev4) Patchwork
@ 2018-06-25 19:22 ` Patchwork
       [not found] ` <20180622150242.16558-1-mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 125+ messages in thread
From: Patchwork @ 2018-06-25 19:22 UTC (permalink / raw)
  To: Michal Hocko; +Cc: intel-gfx

== Series Details ==

Series: mm, oom: distinguish blockable mode for mmu notifiers (rev4)
URL   : https://patchwork.freedesktop.org/series/45263/
State : success

== Summary ==

= CI Bug Log - changes from CI_DRM_4373_full -> Patchwork_9413_full =

== Summary - WARNING ==

  Minor unknown changes coming with Patchwork_9413_full need to be verified
  manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in Patchwork_9413_full, please notify your bug team to allow them
  to document this new failure mode, which will reduce false positives in CI.

  

== Possible new issues ==

  Here are the unknown changes that may have been introduced in Patchwork_9413_full:

  === IGT changes ===

    ==== Warnings ====

    igt@gem_exec_schedule@deep-bsd1:
      shard-kbl:          PASS -> SKIP +1

    igt@gem_exec_schedule@deep-bsd2:
      shard-kbl:          SKIP -> PASS

    igt@pm_rc6_residency@rc6-accuracy:
      shard-snb:          SKIP -> PASS

    
== Known issues ==

  Here are the changes found in Patchwork_9413_full that come from known issues:

  === IGT changes ===

    ==== Issues hit ====

    igt@drv_suspend@shrink:
      shard-snb:          PASS -> FAIL (fdo#106886)

    igt@kms_flip@flip-vs-expired-vblank:
      shard-glk:          PASS -> FAIL (fdo#102887, fdo#105363)

    igt@kms_flip_tiling@flip-y-tiled:
      shard-glk:          PASS -> FAIL (fdo#104724, fdo#103822)

    igt@kms_setmode@basic:
      shard-kbl:          PASS -> FAIL (fdo#99912)

    
    ==== Possible fixes ====

    igt@kms_atomic_transition@1x-modeset-transitions-nonblocking-fencing:
      shard-glk:          FAIL (fdo#105703) -> PASS

    igt@kms_flip_tiling@flip-x-tiled:
      shard-glk:          FAIL (fdo#104724, fdo#103822) -> PASS

    igt@kms_rotation_crc@sprite-rotation-180:
      shard-hsw:          FAIL (fdo#104724, fdo#103925) -> PASS

    
    ==== Warnings ====

    igt@drv_selftest@live_gtt:
      shard-glk:          FAIL (fdo#105347) -> INCOMPLETE (k.org#198133, fdo#103359)

    
  fdo#102887 https://bugs.freedesktop.org/show_bug.cgi?id=102887
  fdo#103359 https://bugs.freedesktop.org/show_bug.cgi?id=103359
  fdo#103822 https://bugs.freedesktop.org/show_bug.cgi?id=103822
  fdo#103925 https://bugs.freedesktop.org/show_bug.cgi?id=103925
  fdo#104724 https://bugs.freedesktop.org/show_bug.cgi?id=104724
  fdo#105347 https://bugs.freedesktop.org/show_bug.cgi?id=105347
  fdo#105363 https://bugs.freedesktop.org/show_bug.cgi?id=105363
  fdo#105703 https://bugs.freedesktop.org/show_bug.cgi?id=105703
  fdo#106886 https://bugs.freedesktop.org/show_bug.cgi?id=106886
  fdo#99912 https://bugs.freedesktop.org/show_bug.cgi?id=99912
  k.org#198133 https://bugzilla.kernel.org/show_bug.cgi?id=198133


== Participating hosts (5 -> 5) ==

  No changes in participating hosts


== Build changes ==

    * Linux: CI_DRM_4373 -> Patchwork_9413

  CI_DRM_4373: be7193758db79443ad5dc45072a166746819ba7e @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_4529: 23d50a49413aff619d00ec50fc2e051e9b45baa5 @ git://anongit.freedesktop.org/xorg/app/intel-gpu-tools
  Patchwork_9413: debbcdd2c559a0ff0bfdc8cc5ce38935be009eff @ git://anongit.freedesktop.org/gfx-ci/linux
  piglit_4509: fdc5a4ca11124ab8413c7988896eec4c97336694 @ git://anongit.freedesktop.org/piglit

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_9413/shards.html
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
  2018-06-22 15:02 [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers Michal Hocko
                   ` (14 preceding siblings ...)
  2018-06-27  7:44 ` Michal Hocko
@ 2018-06-27  7:44 ` Michal Hocko
  2018-07-02  9:14     ` Christian König
                     ` (4 more replies)
  2018-06-27  9:05 ` ✗ Fi.CI.BAT: failure for mm, oom: distinguish blockable mode for mmu notifiers (rev5) Patchwork
  2018-07-11 10:57 ` ✗ Fi.CI.BAT: failure for mm, oom: distinguish blockable mode for mmu notifiers (rev6) Patchwork
  17 siblings, 5 replies; 125+ messages in thread
From: Michal Hocko @ 2018-06-27  7:44 UTC (permalink / raw)
  To: LKML
  Cc: kvm-u79uwXL29TY76Z2rM5mHXA, Radim Krčmář,
	David Airlie, Joonas Lahtinen, Sudeep Dutt,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Felix Kuehling,
	Andrea Arcangeli, David (ChunMing) Zhou, Dimitri Sivanich,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Jason Gunthorpe,
	Doug Ledford, David Rientjes,
	xen-devel-GuqFBffKawtpuQazS67q72D2FQJk+8+b,
	intel-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Jani Nikula,
	Jérôme Glisse, Rodrigo Vivi, Boris Ostrovsky,
	Juergen Gross

This is the v2 of RFC based on the feedback I've received so far. The
code even compiles as a bonus ;) I haven't runtime tested it yet, mostly
because I have no idea how.

Any further feedback is highly appreciated of course.
---
From ec9a7241bf422b908532c4c33953b0da2655ad05 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Wed, 20 Jun 2018 15:03:20 +0200
Subject: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

There are several blockable mmu notifiers which might sleep in
mmu_notifier_invalidate_range_start and that is a problem for the
oom_reaper because it needs to guarantee a forward progress so it cannot
depend on any sleepable locks.

Currently we simply back off and mark an oom victim with blockable mmu
notifiers as done after a short sleep. That can result in selecting a
new oom victim prematurely because the previous one still hasn't torn
its memory down yet.

We can do much better though. Even if mmu notifiers use sleepable locks
there is no reason to automatically assume those locks are held.
Moreover majority of notifiers only care about a portion of the address
space and there is absolutely zero reason to fail when we are unmapping an
unrelated range. Many notifiers do really block and wait for HW which is
harder to handle and we have to bail out though.

This patch handles the low hanging fruid. __mmu_notifier_invalidate_range_start
gets a blockable flag and callbacks are not allowed to sleep if the
flag is set to false. This is achieved by using trylock instead of the
sleepable lock for most callbacks and continue as long as we do not
block down the call chain.

I think we can improve that even further because there is a common
pattern to do a range lookup first and then do something about that.
The first part can be done without a sleeping lock in most cases AFAICS.

The oom_reaper end then simply retries if there is at least one notifier
which couldn't make any progress in !blockable mode. A retry loop is
already implemented to wait for the mmap_sem and this is basically the
same thing.

Changes since rfc v1
- gpu notifiers can sleep while waiting for HW (evict_process_queues_cpsch
  on a lock and amdgpu_mn_invalidate_node on unbound timeout) make sure
  we bail out when we have an intersecting range for starter
- note that a notifier failed to the log for easier debugging
- back off early in ib_umem_notifier_invalidate_range_start if the
  callback is called
- mn_invl_range_start waits for completion down the unmap_grant_pages
  path so we have to back off early on overlapping ranges

Cc: "David (ChunMing) Zhou" <David1.Zhou@amd.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
Cc: Sudeep Dutt <sudeep.dutt@intel.com>
Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: kvm@vger.kernel.org (open list:KERNEL VIRTUAL MACHINE FOR X86 (KVM/x86))
Cc: linux-kernel@vger.kernel.org (open list:X86 ARCHITECTURE (32-BIT AND 64-BIT))
Cc: amd-gfx@lists.freedesktop.org (open list:RADEON and AMDGPU DRM DRIVERS)
Cc: dri-devel@lists.freedesktop.org (open list:DRM DRIVERS)
Cc: intel-gfx@lists.freedesktop.org (open list:INTEL DRM DRIVERS (excluding Poulsbo, Moorestow...)
Cc: linux-rdma@vger.kernel.org (open list:INFINIBAND SUBSYSTEM)
Cc: xen-devel@lists.xenproject.org (moderated list:XEN HYPERVISOR INTERFACE)
Cc: linux-mm@kvack.org (open list:HMM - Heterogeneous Memory Management)
Reported-by: David Rientjes <rientjes@google.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 arch/x86/kvm/x86.c                      |  7 ++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c  | 43 +++++++++++++++++++-----
 drivers/gpu/drm/i915/i915_gem_userptr.c | 13 ++++++--
 drivers/gpu/drm/radeon/radeon_mn.c      | 22 +++++++++++--
 drivers/infiniband/core/umem_odp.c      | 33 +++++++++++++++----
 drivers/infiniband/hw/hfi1/mmu_rb.c     | 11 ++++---
 drivers/infiniband/hw/mlx5/odp.c        |  2 +-
 drivers/misc/mic/scif/scif_dma.c        |  7 ++--
 drivers/misc/sgi-gru/grutlbpurge.c      |  7 ++--
 drivers/xen/gntdev.c                    | 44 ++++++++++++++++++++-----
 include/linux/kvm_host.h                |  4 +--
 include/linux/mmu_notifier.h            | 35 +++++++++++++++-----
 include/linux/oom.h                     |  2 +-
 include/rdma/ib_umem_odp.h              |  3 +-
 mm/hmm.c                                |  7 ++--
 mm/mmap.c                               |  2 +-
 mm/mmu_notifier.c                       | 19 ++++++++---
 mm/oom_kill.c                           | 29 ++++++++--------
 virt/kvm/kvm_main.c                     | 15 ++++++---
 19 files changed, 225 insertions(+), 80 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6bcecc325e7e..ac08f5d711be 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7203,8 +7203,9 @@ static void vcpu_load_eoi_exitmap(struct kvm_vcpu *vcpu)
 	kvm_x86_ops->load_eoi_exitmap(vcpu, eoi_exit_bitmap);
 }
 
-void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
-		unsigned long start, unsigned long end)
+int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
+		unsigned long start, unsigned long end,
+		bool blockable)
 {
 	unsigned long apic_address;
 
@@ -7215,6 +7216,8 @@ void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
 	apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
 	if (start <= apic_address && apic_address < end)
 		kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
+
+	return 0;
 }
 
 void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index 83e344fbb50a..3399a4a927fb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -136,12 +136,18 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
  *
  * Take the rmn read side lock.
  */
-static void amdgpu_mn_read_lock(struct amdgpu_mn *rmn)
+static int amdgpu_mn_read_lock(struct amdgpu_mn *rmn, bool blockable)
 {
-	mutex_lock(&rmn->read_lock);
+	if (blockable)
+		mutex_lock(&rmn->read_lock);
+	else if (!mutex_trylock(&rmn->read_lock))
+		return -EAGAIN;
+
 	if (atomic_inc_return(&rmn->recursion) == 1)
 		down_read_non_owner(&rmn->lock);
 	mutex_unlock(&rmn->read_lock);
+
+	return 0;
 }
 
 /**
@@ -197,10 +203,11 @@ static void amdgpu_mn_invalidate_node(struct amdgpu_mn_node *node,
  * We block for all BOs between start and end to be idle and
  * unmap them by move them into system domain again.
  */
-static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
+static int amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
 						 struct mm_struct *mm,
 						 unsigned long start,
-						 unsigned long end)
+						 unsigned long end,
+						 bool blockable)
 {
 	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
 	struct interval_tree_node *it;
@@ -208,17 +215,28 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
 	/* notification is exclusive, but interval is inclusive */
 	end -= 1;
 
-	amdgpu_mn_read_lock(rmn);
+	/* TODO we should be able to split locking for interval tree and
+	 * amdgpu_mn_invalidate_node
+	 */
+	if (amdgpu_mn_read_lock(rmn, blockable))
+		return -EAGAIN;
 
 	it = interval_tree_iter_first(&rmn->objects, start, end);
 	while (it) {
 		struct amdgpu_mn_node *node;
 
+		if (!blockable) {
+			amdgpu_mn_read_unlock(rmn);
+			return -EAGAIN;
+		}
+
 		node = container_of(it, struct amdgpu_mn_node, it);
 		it = interval_tree_iter_next(it, start, end);
 
 		amdgpu_mn_invalidate_node(node, start, end);
 	}
+
+	return 0;
 }
 
 /**
@@ -233,10 +251,11 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
  * necessitates evicting all user-mode queues of the process. The BOs
  * are restorted in amdgpu_mn_invalidate_range_end_hsa.
  */
-static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
+static int amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
 						 struct mm_struct *mm,
 						 unsigned long start,
-						 unsigned long end)
+						 unsigned long end,
+						 bool blockable)
 {
 	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
 	struct interval_tree_node *it;
@@ -244,13 +263,19 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
 	/* notification is exclusive, but interval is inclusive */
 	end -= 1;
 
-	amdgpu_mn_read_lock(rmn);
+	if (amdgpu_mn_read_lock(rmn, blockable))
+		return -EAGAIN;
 
 	it = interval_tree_iter_first(&rmn->objects, start, end);
 	while (it) {
 		struct amdgpu_mn_node *node;
 		struct amdgpu_bo *bo;
 
+		if (!blockable) {
+			amdgpu_mn_read_unlock(rmn);
+			return -EAGAIN;
+		}
+
 		node = container_of(it, struct amdgpu_mn_node, it);
 		it = interval_tree_iter_next(it, start, end);
 
@@ -262,6 +287,8 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
 				amdgpu_amdkfd_evict_userptr(mem, mm);
 		}
 	}
+
+	return 0;
 }
 
 /**
diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
index 854bd51b9478..9cbff68f6b41 100644
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
@@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object *mo)
 	mo->attached = false;
 }
 
-static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
+static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
 						       struct mm_struct *mm,
 						       unsigned long start,
-						       unsigned long end)
+						       unsigned long end,
+						       bool blockable)
 {
 	struct i915_mmu_notifier *mn =
 		container_of(_mn, struct i915_mmu_notifier, mn);
@@ -124,7 +125,7 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
 	LIST_HEAD(cancelled);
 
 	if (RB_EMPTY_ROOT(&mn->objects.rb_root))
-		return;
+		return 0;
 
 	/* interval ranges are inclusive, but invalidate range is exclusive */
 	end--;
@@ -132,6 +133,10 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
 	spin_lock(&mn->lock);
 	it = interval_tree_iter_first(&mn->objects, start, end);
 	while (it) {
+		if (!blockable) {
+			spin_unlock(&mn->lock);
+			return -EAGAIN;
+		}
 		/* The mmu_object is released late when destroying the
 		 * GEM object so it is entirely possible to gain a
 		 * reference on an object in the process of being freed
@@ -154,6 +159,8 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
 
 	if (!list_empty(&cancelled))
 		flush_workqueue(mn->wq);
+
+	return 0;
 }
 
 static const struct mmu_notifier_ops i915_gem_userptr_notifier = {
diff --git a/drivers/gpu/drm/radeon/radeon_mn.c b/drivers/gpu/drm/radeon/radeon_mn.c
index abd24975c9b1..f8b35df44c60 100644
--- a/drivers/gpu/drm/radeon/radeon_mn.c
+++ b/drivers/gpu/drm/radeon/radeon_mn.c
@@ -118,19 +118,27 @@ static void radeon_mn_release(struct mmu_notifier *mn,
  * We block for all BOs between start and end to be idle and
  * unmap them by move them into system domain again.
  */
-static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
+static int radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
 					     struct mm_struct *mm,
 					     unsigned long start,
-					     unsigned long end)
+					     unsigned long end,
+					     bool blockable)
 {
 	struct radeon_mn *rmn = container_of(mn, struct radeon_mn, mn);
 	struct ttm_operation_ctx ctx = { false, false };
 	struct interval_tree_node *it;
+	int ret = 0;
 
 	/* notification is exclusive, but interval is inclusive */
 	end -= 1;
 
-	mutex_lock(&rmn->lock);
+	/* TODO we should be able to split locking for interval tree and
+	 * the tear down.
+	 */
+	if (blockable)
+		mutex_lock(&rmn->lock);
+	else if (!mutex_trylock(&rmn->lock))
+		return -EAGAIN;
 
 	it = interval_tree_iter_first(&rmn->objects, start, end);
 	while (it) {
@@ -138,6 +146,11 @@ static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
 		struct radeon_bo *bo;
 		long r;
 
+		if (!blockable) {
+			ret = -EAGAIN;
+			goto out_unlock;
+		}
+
 		node = container_of(it, struct radeon_mn_node, it);
 		it = interval_tree_iter_next(it, start, end);
 
@@ -166,7 +179,10 @@ static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
 		}
 	}
 	
+out_unlock:
 	mutex_unlock(&rmn->lock);
+
+	return ret;
 }
 
 static const struct mmu_notifier_ops radeon_mn_ops = {
diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index 182436b92ba9..6ec748eccff7 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -186,6 +186,7 @@ static void ib_umem_notifier_release(struct mmu_notifier *mn,
 	rbt_ib_umem_for_each_in_range(&context->umem_tree, 0,
 				      ULLONG_MAX,
 				      ib_umem_notifier_release_trampoline,
+				      true,
 				      NULL);
 	up_read(&context->umem_rwsem);
 }
@@ -207,22 +208,31 @@ static int invalidate_range_start_trampoline(struct ib_umem *item, u64 start,
 	return 0;
 }
 
-static void ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
+static int ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
 						    struct mm_struct *mm,
 						    unsigned long start,
-						    unsigned long end)
+						    unsigned long end,
+						    bool blockable)
 {
 	struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
+	int ret;
 
 	if (!context->invalidate_range)
-		return;
+		return 0;
+
+	if (blockable)
+		down_read(&context->umem_rwsem);
+	else if (!down_read_trylock(&context->umem_rwsem))
+		return -EAGAIN;
 
 	ib_ucontext_notifier_start_account(context);
-	down_read(&context->umem_rwsem);
-	rbt_ib_umem_for_each_in_range(&context->umem_tree, start,
+	ret = rbt_ib_umem_for_each_in_range(&context->umem_tree, start,
 				      end,
-				      invalidate_range_start_trampoline, NULL);
+				      invalidate_range_start_trampoline,
+				      blockable, NULL);
 	up_read(&context->umem_rwsem);
+
+	return ret;
 }
 
 static int invalidate_range_end_trampoline(struct ib_umem *item, u64 start,
@@ -242,10 +252,15 @@ static void ib_umem_notifier_invalidate_range_end(struct mmu_notifier *mn,
 	if (!context->invalidate_range)
 		return;
 
+	/*
+	 * TODO: we currently bail out if there is any sleepable work to be done
+	 * in ib_umem_notifier_invalidate_range_start so we shouldn't really block
+	 * here. But this is ugly and fragile.
+	 */
 	down_read(&context->umem_rwsem);
 	rbt_ib_umem_for_each_in_range(&context->umem_tree, start,
 				      end,
-				      invalidate_range_end_trampoline, NULL);
+				      invalidate_range_end_trampoline, true, NULL);
 	up_read(&context->umem_rwsem);
 	ib_ucontext_notifier_end_account(context);
 }
@@ -798,6 +813,7 @@ EXPORT_SYMBOL(ib_umem_odp_unmap_dma_pages);
 int rbt_ib_umem_for_each_in_range(struct rb_root_cached *root,
 				  u64 start, u64 last,
 				  umem_call_back cb,
+				  bool blockable,
 				  void *cookie)
 {
 	int ret_val = 0;
@@ -809,6 +825,9 @@ int rbt_ib_umem_for_each_in_range(struct rb_root_cached *root,
 
 	for (node = rbt_ib_umem_iter_first(root, start, last - 1);
 			node; node = next) {
+		/* TODO move the blockable decision up to the callback */
+		if (!blockable)
+			return -EAGAIN;
 		next = rbt_ib_umem_iter_next(node, start, last - 1);
 		umem = container_of(node, struct ib_umem_odp, interval_tree);
 		ret_val = cb(umem->umem, start, last, cookie) || ret_val;
diff --git a/drivers/infiniband/hw/hfi1/mmu_rb.c b/drivers/infiniband/hw/hfi1/mmu_rb.c
index 70aceefe14d5..e1c7996c018e 100644
--- a/drivers/infiniband/hw/hfi1/mmu_rb.c
+++ b/drivers/infiniband/hw/hfi1/mmu_rb.c
@@ -67,9 +67,9 @@ struct mmu_rb_handler {
 
 static unsigned long mmu_node_start(struct mmu_rb_node *);
 static unsigned long mmu_node_last(struct mmu_rb_node *);
-static void mmu_notifier_range_start(struct mmu_notifier *,
+static int mmu_notifier_range_start(struct mmu_notifier *,
 				     struct mm_struct *,
-				     unsigned long, unsigned long);
+				     unsigned long, unsigned long, bool);
 static struct mmu_rb_node *__mmu_rb_search(struct mmu_rb_handler *,
 					   unsigned long, unsigned long);
 static void do_remove(struct mmu_rb_handler *handler,
@@ -284,10 +284,11 @@ void hfi1_mmu_rb_remove(struct mmu_rb_handler *handler,
 	handler->ops->remove(handler->ops_arg, node);
 }
 
-static void mmu_notifier_range_start(struct mmu_notifier *mn,
+static int mmu_notifier_range_start(struct mmu_notifier *mn,
 				     struct mm_struct *mm,
 				     unsigned long start,
-				     unsigned long end)
+				     unsigned long end,
+				     bool blockable)
 {
 	struct mmu_rb_handler *handler =
 		container_of(mn, struct mmu_rb_handler, mn);
@@ -313,6 +314,8 @@ static void mmu_notifier_range_start(struct mmu_notifier *mn,
 
 	if (added)
 		queue_work(handler->wq, &handler->del_work);
+
+	return 0;
 }
 
 /*
diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
index f1a87a690a4c..d216e0d2921d 100644
--- a/drivers/infiniband/hw/mlx5/odp.c
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -488,7 +488,7 @@ void mlx5_ib_free_implicit_mr(struct mlx5_ib_mr *imr)
 
 	down_read(&ctx->umem_rwsem);
 	rbt_ib_umem_for_each_in_range(&ctx->umem_tree, 0, ULLONG_MAX,
-				      mr_leaf_free, imr);
+				      mr_leaf_free, true, imr);
 	up_read(&ctx->umem_rwsem);
 
 	wait_event(imr->q_leaf_free, !atomic_read(&imr->num_leaf_free));
diff --git a/drivers/misc/mic/scif/scif_dma.c b/drivers/misc/mic/scif/scif_dma.c
index 63d6246d6dff..6369aeaa7056 100644
--- a/drivers/misc/mic/scif/scif_dma.c
+++ b/drivers/misc/mic/scif/scif_dma.c
@@ -200,15 +200,18 @@ static void scif_mmu_notifier_release(struct mmu_notifier *mn,
 	schedule_work(&scif_info.misc_work);
 }
 
-static void scif_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
+static int scif_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 						     struct mm_struct *mm,
 						     unsigned long start,
-						     unsigned long end)
+						     unsigned long end,
+						     bool blockable)
 {
 	struct scif_mmu_notif	*mmn;
 
 	mmn = container_of(mn, struct scif_mmu_notif, ep_mmu_notifier);
 	scif_rma_destroy_tcw(mmn, start, end - start);
+
+	return 0;
 }
 
 static void scif_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
index a3454eb56fbf..be28f05bfafa 100644
--- a/drivers/misc/sgi-gru/grutlbpurge.c
+++ b/drivers/misc/sgi-gru/grutlbpurge.c
@@ -219,9 +219,10 @@ void gru_flush_all_tlb(struct gru_state *gru)
 /*
  * MMUOPS notifier callout functions
  */
-static void gru_invalidate_range_start(struct mmu_notifier *mn,
+static int gru_invalidate_range_start(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
-				       unsigned long start, unsigned long end)
+				       unsigned long start, unsigned long end,
+				       bool blockable)
 {
 	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
 						 ms_notifier);
@@ -231,6 +232,8 @@ static void gru_invalidate_range_start(struct mmu_notifier *mn,
 	gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx, act %d\n", gms,
 		start, end, atomic_read(&gms->ms_range_active));
 	gru_flush_tlb_range(gms, start, end - start);
+
+	return 0;
 }
 
 static void gru_invalidate_range_end(struct mmu_notifier *mn,
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index bd56653b9bbc..55b4f0e3f4d6 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -441,18 +441,25 @@ static const struct vm_operations_struct gntdev_vmops = {
 
 /* ------------------------------------------------------------------ */
 
+static bool in_range(struct grant_map *map,
+			      unsigned long start, unsigned long end)
+{
+	if (!map->vma)
+		return false;
+	if (map->vma->vm_start >= end)
+		return false;
+	if (map->vma->vm_end <= start)
+		return false;
+
+	return true;
+}
+
 static void unmap_if_in_range(struct grant_map *map,
 			      unsigned long start, unsigned long end)
 {
 	unsigned long mstart, mend;
 	int err;
 
-	if (!map->vma)
-		return;
-	if (map->vma->vm_start >= end)
-		return;
-	if (map->vma->vm_end <= start)
-		return;
 	mstart = max(start, map->vma->vm_start);
 	mend   = min(end,   map->vma->vm_end);
 	pr_debug("map %d+%d (%lx %lx), range %lx %lx, mrange %lx %lx\n",
@@ -465,21 +472,40 @@ static void unmap_if_in_range(struct grant_map *map,
 	WARN_ON(err);
 }
 
-static void mn_invl_range_start(struct mmu_notifier *mn,
+static int mn_invl_range_start(struct mmu_notifier *mn,
 				struct mm_struct *mm,
-				unsigned long start, unsigned long end)
+				unsigned long start, unsigned long end,
+				bool blockable)
 {
 	struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
 	struct grant_map *map;
+	int ret = 0;
+
+	/* TODO do we really need a mutex here? */
+	if (blockable)
+		mutex_lock(&priv->lock);
+	else if (!mutex_trylock(&priv->lock))
+		return -EAGAIN;
 
-	mutex_lock(&priv->lock);
 	list_for_each_entry(map, &priv->maps, next) {
+		if (in_range(map, start, end)) {
+			ret = -EAGAIN;
+			goto out_unlock;
+		}
 		unmap_if_in_range(map, start, end);
 	}
 	list_for_each_entry(map, &priv->freeable_maps, next) {
+		if (in_range(map, start, end)) {
+			ret = -EAGAIN;
+			goto out_unlock;
+		}
 		unmap_if_in_range(map, start, end);
 	}
+
+out_unlock:
 	mutex_unlock(&priv->lock);
+
+	return ret;
 }
 
 static void mn_release(struct mmu_notifier *mn,
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 4ee7bc548a83..148935085194 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1275,8 +1275,8 @@ static inline long kvm_arch_vcpu_async_ioctl(struct file *filp,
 }
 #endif /* CONFIG_HAVE_KVM_VCPU_ASYNC_IOCTL */
 
-void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
-		unsigned long start, unsigned long end);
+int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
+		unsigned long start, unsigned long end, bool blockable);
 
 #ifdef CONFIG_HAVE_KVM_VCPU_RUN_PID_CHANGE
 int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu);
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 392e6af82701..2eb1a2d01759 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -151,13 +151,15 @@ struct mmu_notifier_ops {
 	 * address space but may still be referenced by sptes until
 	 * the last refcount is dropped.
 	 *
-	 * If both of these callbacks cannot block, and invalidate_range
-	 * cannot block, mmu_notifier_ops.flags should have
-	 * MMU_INVALIDATE_DOES_NOT_BLOCK set.
+	 * If blockable argument is set to false then the callback cannot
+	 * sleep and has to return with -EAGAIN. 0 should be returned
+	 * otherwise.
+	 *
 	 */
-	void (*invalidate_range_start)(struct mmu_notifier *mn,
+	int (*invalidate_range_start)(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
-				       unsigned long start, unsigned long end);
+				       unsigned long start, unsigned long end,
+				       bool blockable);
 	void (*invalidate_range_end)(struct mmu_notifier *mn,
 				     struct mm_struct *mm,
 				     unsigned long start, unsigned long end);
@@ -229,8 +231,9 @@ extern int __mmu_notifier_test_young(struct mm_struct *mm,
 				     unsigned long address);
 extern void __mmu_notifier_change_pte(struct mm_struct *mm,
 				      unsigned long address, pte_t pte);
-extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end);
+extern int __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+				  unsigned long start, unsigned long end,
+				  bool blockable);
 extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 				  unsigned long start, unsigned long end,
 				  bool only_end);
@@ -281,7 +284,17 @@ static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 				  unsigned long start, unsigned long end)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_start(mm, start, end);
+		__mmu_notifier_invalidate_range_start(mm, start, end, true);
+}
+
+static inline int mmu_notifier_invalidate_range_start_nonblock(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+	int ret = 0;
+	if (mm_has_notifiers(mm))
+		ret = __mmu_notifier_invalidate_range_start(mm, start, end, false);
+
+	return ret;
 }
 
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
@@ -461,6 +474,12 @@ static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 {
 }
 
+static inline int mmu_notifier_invalidate_range_start_nonblock(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+	return 0;
+}
+
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 				  unsigned long start, unsigned long end)
 {
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 6adac113e96d..92f70e4c6252 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -95,7 +95,7 @@ static inline int check_stable_address_space(struct mm_struct *mm)
 	return 0;
 }
 
-void __oom_reap_task_mm(struct mm_struct *mm);
+bool __oom_reap_task_mm(struct mm_struct *mm);
 
 extern unsigned long oom_badness(struct task_struct *p,
 		struct mem_cgroup *memcg, const nodemask_t *nodemask,
diff --git a/include/rdma/ib_umem_odp.h b/include/rdma/ib_umem_odp.h
index 6a17f856f841..381cdf5a9bd1 100644
--- a/include/rdma/ib_umem_odp.h
+++ b/include/rdma/ib_umem_odp.h
@@ -119,7 +119,8 @@ typedef int (*umem_call_back)(struct ib_umem *item, u64 start, u64 end,
  */
 int rbt_ib_umem_for_each_in_range(struct rb_root_cached *root,
 				  u64 start, u64 end,
-				  umem_call_back cb, void *cookie);
+				  umem_call_back cb,
+				  bool blockable, void *cookie);
 
 /*
  * Find first region intersecting with address range.
diff --git a/mm/hmm.c b/mm/hmm.c
index de7b6bf77201..81fd57bd2634 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -177,16 +177,19 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
 	up_write(&hmm->mirrors_sem);
 }
 
-static void hmm_invalidate_range_start(struct mmu_notifier *mn,
+static int hmm_invalidate_range_start(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
 				       unsigned long start,
-				       unsigned long end)
+				       unsigned long end,
+				       bool blockable)
 {
 	struct hmm *hmm = mm->hmm;
 
 	VM_BUG_ON(!hmm);
 
 	atomic_inc(&hmm->sequence);
+
+	return 0;
 }
 
 static void hmm_invalidate_range_end(struct mmu_notifier *mn,
diff --git a/mm/mmap.c b/mm/mmap.c
index d1eb87ef4b1a..336bee8c4e25 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3074,7 +3074,7 @@ void exit_mmap(struct mm_struct *mm)
 		 * reliably test it.
 		 */
 		mutex_lock(&oom_lock);
-		__oom_reap_task_mm(mm);
+		(void)__oom_reap_task_mm(mm);
 		mutex_unlock(&oom_lock);
 
 		set_bit(MMF_OOM_SKIP, &mm->flags);
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index eff6b88a993f..103b2b450043 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -174,18 +174,29 @@ void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
 	srcu_read_unlock(&srcu, id);
 }
 
-void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+int __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+				  unsigned long start, unsigned long end,
+				  bool blockable)
 {
 	struct mmu_notifier *mn;
+	int ret = 0;
 	int id;
 
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
-		if (mn->ops->invalidate_range_start)
-			mn->ops->invalidate_range_start(mn, mm, start, end);
+		if (mn->ops->invalidate_range_start) {
+			int _ret = mn->ops->invalidate_range_start(mn, mm, start, end, blockable);
+			if (_ret) {
+				pr_info("%pS callback failed with %d in %sblockable context.\n",
+						mn->ops->invalidate_range_start, _ret,
+						!blockable ? "non-": "");
+				ret = _ret;
+			}
+		}
 	}
 	srcu_read_unlock(&srcu, id);
+
+	return ret;
 }
 EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 84081e77bc51..5a936cf24d79 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -479,9 +479,10 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
 static struct task_struct *oom_reaper_list;
 static DEFINE_SPINLOCK(oom_reaper_lock);
 
-void __oom_reap_task_mm(struct mm_struct *mm)
+bool __oom_reap_task_mm(struct mm_struct *mm)
 {
 	struct vm_area_struct *vma;
+	bool ret = true;
 
 	/*
 	 * Tell all users of get_user/copy_from_user etc... that the content
@@ -511,12 +512,17 @@ void __oom_reap_task_mm(struct mm_struct *mm)
 			struct mmu_gather tlb;
 
 			tlb_gather_mmu(&tlb, mm, start, end);
-			mmu_notifier_invalidate_range_start(mm, start, end);
+			if (mmu_notifier_invalidate_range_start_nonblock(mm, start, end)) {
+				ret = false;
+				continue;
+			}
 			unmap_page_range(&tlb, vma, start, end, NULL);
 			mmu_notifier_invalidate_range_end(mm, start, end);
 			tlb_finish_mmu(&tlb, start, end);
 		}
 	}
+
+	return ret;
 }
 
 static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
@@ -545,18 +551,6 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 		goto unlock_oom;
 	}
 
-	/*
-	 * If the mm has invalidate_{start,end}() notifiers that could block,
-	 * sleep to give the oom victim some more time.
-	 * TODO: we really want to get rid of this ugly hack and make sure that
-	 * notifiers cannot block for unbounded amount of time
-	 */
-	if (mm_has_blockable_invalidate_notifiers(mm)) {
-		up_read(&mm->mmap_sem);
-		schedule_timeout_idle(HZ);
-		goto unlock_oom;
-	}
-
 	/*
 	 * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
 	 * work on the mm anymore. The check for MMF_OOM_SKIP must run
@@ -571,7 +565,12 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 
 	trace_start_task_reaping(tsk->pid);
 
-	__oom_reap_task_mm(mm);
+	/* failed to reap part of the address space. Try again later */
+	if (!__oom_reap_task_mm(mm)) {
+		up_read(&mm->mmap_sem);
+		ret = false;
+		goto unlock_oom;
+	}
 
 	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
 			task_pid_nr(tsk), tsk->comm,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ada21f47f22b..16ce38f178d1 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -135,9 +135,10 @@ static void kvm_uevent_notify_change(unsigned int type, struct kvm *kvm);
 static unsigned long long kvm_createvm_count;
 static unsigned long long kvm_active_vms;
 
-__weak void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
-		unsigned long start, unsigned long end)
+__weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
+		unsigned long start, unsigned long end, bool blockable)
 {
+	return 0;
 }
 
 bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
@@ -354,13 +355,15 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 	srcu_read_unlock(&kvm->srcu, idx);
 }
 
-static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
+static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 						    struct mm_struct *mm,
 						    unsigned long start,
-						    unsigned long end)
+						    unsigned long end,
+						    bool blockable)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int need_tlb_flush = 0, idx;
+	int ret;
 
 	idx = srcu_read_lock(&kvm->srcu);
 	spin_lock(&kvm->mmu_lock);
@@ -378,9 +381,11 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 
 	spin_unlock(&kvm->mmu_lock);
 
-	kvm_arch_mmu_notifier_invalidate_range(kvm, start, end);
+	ret = kvm_arch_mmu_notifier_invalidate_range(kvm, start, end, blockable);
 
 	srcu_read_unlock(&kvm->srcu, idx);
+
+	return ret;
 }
 
 static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
-- 
2.18.0

-- 
Michal Hocko
SUSE Labs
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
@ 2018-06-27  7:44 ` Michal Hocko
  2018-07-02  9:14     ` Christian König
                     ` (4 more replies)
  0 siblings, 5 replies; 125+ messages in thread
From: Michal Hocko @ 2018-06-27  7:44 UTC (permalink / raw)
  To: LKML
  Cc: David (ChunMing) Zhou, Paolo Bonzini, Radim Krčmář,
	Alex Deucher, Christian König, David Airlie, Jani Nikula,
	Joonas Lahtinen, Rodrigo Vivi, Doug Ledford, Jason Gunthorpe,
	Mike Marciniszyn, Dennis Dalessandro, Sudeep Dutt,
	Ashutosh Dixit, Dimitri Sivanich, Boris Ostrovsky, Juergen Gross,
	Jérôme Glisse, Andrea Arcangeli, kvm, amd-gfx,
	dri-devel, intel-gfx, linux-rdma, xen-devel, linux-mm,
	David Rientjes, Felix Kuehling

This is the v2 of RFC based on the feedback I've received so far. The
code even compiles as a bonus ;) I haven't runtime tested it yet, mostly
because I have no idea how.

Any further feedback is highly appreciated of course.
---
From ec9a7241bf422b908532c4c33953b0da2655ad05 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Wed, 20 Jun 2018 15:03:20 +0200
Subject: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

There are several blockable mmu notifiers which might sleep in
mmu_notifier_invalidate_range_start and that is a problem for the
oom_reaper because it needs to guarantee a forward progress so it cannot
depend on any sleepable locks.

Currently we simply back off and mark an oom victim with blockable mmu
notifiers as done after a short sleep. That can result in selecting a
new oom victim prematurely because the previous one still hasn't torn
its memory down yet.

We can do much better though. Even if mmu notifiers use sleepable locks
there is no reason to automatically assume those locks are held.
Moreover majority of notifiers only care about a portion of the address
space and there is absolutely zero reason to fail when we are unmapping an
unrelated range. Many notifiers do really block and wait for HW which is
harder to handle and we have to bail out though.

This patch handles the low hanging fruid. __mmu_notifier_invalidate_range_start
gets a blockable flag and callbacks are not allowed to sleep if the
flag is set to false. This is achieved by using trylock instead of the
sleepable lock for most callbacks and continue as long as we do not
block down the call chain.

I think we can improve that even further because there is a common
pattern to do a range lookup first and then do something about that.
The first part can be done without a sleeping lock in most cases AFAICS.

The oom_reaper end then simply retries if there is at least one notifier
which couldn't make any progress in !blockable mode. A retry loop is
already implemented to wait for the mmap_sem and this is basically the
same thing.

Changes since rfc v1
- gpu notifiers can sleep while waiting for HW (evict_process_queues_cpsch
  on a lock and amdgpu_mn_invalidate_node on unbound timeout) make sure
  we bail out when we have an intersecting range for starter
- note that a notifier failed to the log for easier debugging
- back off early in ib_umem_notifier_invalidate_range_start if the
  callback is called
- mn_invl_range_start waits for completion down the unmap_grant_pages
  path so we have to back off early on overlapping ranges

Cc: "David (ChunMing) Zhou" <David1.Zhou@amd.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
Cc: Sudeep Dutt <sudeep.dutt@intel.com>
Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: kvm@vger.kernel.org (open list:KERNEL VIRTUAL MACHINE FOR X86 (KVM/x86))
Cc: linux-kernel@vger.kernel.org (open list:X86 ARCHITECTURE (32-BIT AND 64-BIT))
Cc: amd-gfx@lists.freedesktop.org (open list:RADEON and AMDGPU DRM DRIVERS)
Cc: dri-devel@lists.freedesktop.org (open list:DRM DRIVERS)
Cc: intel-gfx@lists.freedesktop.org (open list:INTEL DRM DRIVERS (excluding Poulsbo, Moorestow...)
Cc: linux-rdma@vger.kernel.org (open list:INFINIBAND SUBSYSTEM)
Cc: xen-devel@lists.xenproject.org (moderated list:XEN HYPERVISOR INTERFACE)
Cc: linux-mm@kvack.org (open list:HMM - Heterogeneous Memory Management)
Reported-by: David Rientjes <rientjes@google.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 arch/x86/kvm/x86.c                      |  7 ++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c  | 43 +++++++++++++++++++-----
 drivers/gpu/drm/i915/i915_gem_userptr.c | 13 ++++++--
 drivers/gpu/drm/radeon/radeon_mn.c      | 22 +++++++++++--
 drivers/infiniband/core/umem_odp.c      | 33 +++++++++++++++----
 drivers/infiniband/hw/hfi1/mmu_rb.c     | 11 ++++---
 drivers/infiniband/hw/mlx5/odp.c        |  2 +-
 drivers/misc/mic/scif/scif_dma.c        |  7 ++--
 drivers/misc/sgi-gru/grutlbpurge.c      |  7 ++--
 drivers/xen/gntdev.c                    | 44 ++++++++++++++++++++-----
 include/linux/kvm_host.h                |  4 +--
 include/linux/mmu_notifier.h            | 35 +++++++++++++++-----
 include/linux/oom.h                     |  2 +-
 include/rdma/ib_umem_odp.h              |  3 +-
 mm/hmm.c                                |  7 ++--
 mm/mmap.c                               |  2 +-
 mm/mmu_notifier.c                       | 19 ++++++++---
 mm/oom_kill.c                           | 29 ++++++++--------
 virt/kvm/kvm_main.c                     | 15 ++++++---
 19 files changed, 225 insertions(+), 80 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6bcecc325e7e..ac08f5d711be 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7203,8 +7203,9 @@ static void vcpu_load_eoi_exitmap(struct kvm_vcpu *vcpu)
 	kvm_x86_ops->load_eoi_exitmap(vcpu, eoi_exit_bitmap);
 }
 
-void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
-		unsigned long start, unsigned long end)
+int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
+		unsigned long start, unsigned long end,
+		bool blockable)
 {
 	unsigned long apic_address;
 
@@ -7215,6 +7216,8 @@ void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
 	apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
 	if (start <= apic_address && apic_address < end)
 		kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
+
+	return 0;
 }
 
 void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index 83e344fbb50a..3399a4a927fb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -136,12 +136,18 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
  *
  * Take the rmn read side lock.
  */
-static void amdgpu_mn_read_lock(struct amdgpu_mn *rmn)
+static int amdgpu_mn_read_lock(struct amdgpu_mn *rmn, bool blockable)
 {
-	mutex_lock(&rmn->read_lock);
+	if (blockable)
+		mutex_lock(&rmn->read_lock);
+	else if (!mutex_trylock(&rmn->read_lock))
+		return -EAGAIN;
+
 	if (atomic_inc_return(&rmn->recursion) == 1)
 		down_read_non_owner(&rmn->lock);
 	mutex_unlock(&rmn->read_lock);
+
+	return 0;
 }
 
 /**
@@ -197,10 +203,11 @@ static void amdgpu_mn_invalidate_node(struct amdgpu_mn_node *node,
  * We block for all BOs between start and end to be idle and
  * unmap them by move them into system domain again.
  */
-static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
+static int amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
 						 struct mm_struct *mm,
 						 unsigned long start,
-						 unsigned long end)
+						 unsigned long end,
+						 bool blockable)
 {
 	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
 	struct interval_tree_node *it;
@@ -208,17 +215,28 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
 	/* notification is exclusive, but interval is inclusive */
 	end -= 1;
 
-	amdgpu_mn_read_lock(rmn);
+	/* TODO we should be able to split locking for interval tree and
+	 * amdgpu_mn_invalidate_node
+	 */
+	if (amdgpu_mn_read_lock(rmn, blockable))
+		return -EAGAIN;
 
 	it = interval_tree_iter_first(&rmn->objects, start, end);
 	while (it) {
 		struct amdgpu_mn_node *node;
 
+		if (!blockable) {
+			amdgpu_mn_read_unlock(rmn);
+			return -EAGAIN;
+		}
+
 		node = container_of(it, struct amdgpu_mn_node, it);
 		it = interval_tree_iter_next(it, start, end);
 
 		amdgpu_mn_invalidate_node(node, start, end);
 	}
+
+	return 0;
 }
 
 /**
@@ -233,10 +251,11 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
  * necessitates evicting all user-mode queues of the process. The BOs
  * are restorted in amdgpu_mn_invalidate_range_end_hsa.
  */
-static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
+static int amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
 						 struct mm_struct *mm,
 						 unsigned long start,
-						 unsigned long end)
+						 unsigned long end,
+						 bool blockable)
 {
 	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
 	struct interval_tree_node *it;
@@ -244,13 +263,19 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
 	/* notification is exclusive, but interval is inclusive */
 	end -= 1;
 
-	amdgpu_mn_read_lock(rmn);
+	if (amdgpu_mn_read_lock(rmn, blockable))
+		return -EAGAIN;
 
 	it = interval_tree_iter_first(&rmn->objects, start, end);
 	while (it) {
 		struct amdgpu_mn_node *node;
 		struct amdgpu_bo *bo;
 
+		if (!blockable) {
+			amdgpu_mn_read_unlock(rmn);
+			return -EAGAIN;
+		}
+
 		node = container_of(it, struct amdgpu_mn_node, it);
 		it = interval_tree_iter_next(it, start, end);
 
@@ -262,6 +287,8 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
 				amdgpu_amdkfd_evict_userptr(mem, mm);
 		}
 	}
+
+	return 0;
 }
 
 /**
diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
index 854bd51b9478..9cbff68f6b41 100644
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
@@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object *mo)
 	mo->attached = false;
 }
 
-static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
+static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
 						       struct mm_struct *mm,
 						       unsigned long start,
-						       unsigned long end)
+						       unsigned long end,
+						       bool blockable)
 {
 	struct i915_mmu_notifier *mn =
 		container_of(_mn, struct i915_mmu_notifier, mn);
@@ -124,7 +125,7 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
 	LIST_HEAD(cancelled);
 
 	if (RB_EMPTY_ROOT(&mn->objects.rb_root))
-		return;
+		return 0;
 
 	/* interval ranges are inclusive, but invalidate range is exclusive */
 	end--;
@@ -132,6 +133,10 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
 	spin_lock(&mn->lock);
 	it = interval_tree_iter_first(&mn->objects, start, end);
 	while (it) {
+		if (!blockable) {
+			spin_unlock(&mn->lock);
+			return -EAGAIN;
+		}
 		/* The mmu_object is released late when destroying the
 		 * GEM object so it is entirely possible to gain a
 		 * reference on an object in the process of being freed
@@ -154,6 +159,8 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
 
 	if (!list_empty(&cancelled))
 		flush_workqueue(mn->wq);
+
+	return 0;
 }
 
 static const struct mmu_notifier_ops i915_gem_userptr_notifier = {
diff --git a/drivers/gpu/drm/radeon/radeon_mn.c b/drivers/gpu/drm/radeon/radeon_mn.c
index abd24975c9b1..f8b35df44c60 100644
--- a/drivers/gpu/drm/radeon/radeon_mn.c
+++ b/drivers/gpu/drm/radeon/radeon_mn.c
@@ -118,19 +118,27 @@ static void radeon_mn_release(struct mmu_notifier *mn,
  * We block for all BOs between start and end to be idle and
  * unmap them by move them into system domain again.
  */
-static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
+static int radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
 					     struct mm_struct *mm,
 					     unsigned long start,
-					     unsigned long end)
+					     unsigned long end,
+					     bool blockable)
 {
 	struct radeon_mn *rmn = container_of(mn, struct radeon_mn, mn);
 	struct ttm_operation_ctx ctx = { false, false };
 	struct interval_tree_node *it;
+	int ret = 0;
 
 	/* notification is exclusive, but interval is inclusive */
 	end -= 1;
 
-	mutex_lock(&rmn->lock);
+	/* TODO we should be able to split locking for interval tree and
+	 * the tear down.
+	 */
+	if (blockable)
+		mutex_lock(&rmn->lock);
+	else if (!mutex_trylock(&rmn->lock))
+		return -EAGAIN;
 
 	it = interval_tree_iter_first(&rmn->objects, start, end);
 	while (it) {
@@ -138,6 +146,11 @@ static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
 		struct radeon_bo *bo;
 		long r;
 
+		if (!blockable) {
+			ret = -EAGAIN;
+			goto out_unlock;
+		}
+
 		node = container_of(it, struct radeon_mn_node, it);
 		it = interval_tree_iter_next(it, start, end);
 
@@ -166,7 +179,10 @@ static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
 		}
 	}
 	
+out_unlock:
 	mutex_unlock(&rmn->lock);
+
+	return ret;
 }
 
 static const struct mmu_notifier_ops radeon_mn_ops = {
diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index 182436b92ba9..6ec748eccff7 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -186,6 +186,7 @@ static void ib_umem_notifier_release(struct mmu_notifier *mn,
 	rbt_ib_umem_for_each_in_range(&context->umem_tree, 0,
 				      ULLONG_MAX,
 				      ib_umem_notifier_release_trampoline,
+				      true,
 				      NULL);
 	up_read(&context->umem_rwsem);
 }
@@ -207,22 +208,31 @@ static int invalidate_range_start_trampoline(struct ib_umem *item, u64 start,
 	return 0;
 }
 
-static void ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
+static int ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
 						    struct mm_struct *mm,
 						    unsigned long start,
-						    unsigned long end)
+						    unsigned long end,
+						    bool blockable)
 {
 	struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
+	int ret;
 
 	if (!context->invalidate_range)
-		return;
+		return 0;
+
+	if (blockable)
+		down_read(&context->umem_rwsem);
+	else if (!down_read_trylock(&context->umem_rwsem))
+		return -EAGAIN;
 
 	ib_ucontext_notifier_start_account(context);
-	down_read(&context->umem_rwsem);
-	rbt_ib_umem_for_each_in_range(&context->umem_tree, start,
+	ret = rbt_ib_umem_for_each_in_range(&context->umem_tree, start,
 				      end,
-				      invalidate_range_start_trampoline, NULL);
+				      invalidate_range_start_trampoline,
+				      blockable, NULL);
 	up_read(&context->umem_rwsem);
+
+	return ret;
 }
 
 static int invalidate_range_end_trampoline(struct ib_umem *item, u64 start,
@@ -242,10 +252,15 @@ static void ib_umem_notifier_invalidate_range_end(struct mmu_notifier *mn,
 	if (!context->invalidate_range)
 		return;
 
+	/*
+	 * TODO: we currently bail out if there is any sleepable work to be done
+	 * in ib_umem_notifier_invalidate_range_start so we shouldn't really block
+	 * here. But this is ugly and fragile.
+	 */
 	down_read(&context->umem_rwsem);
 	rbt_ib_umem_for_each_in_range(&context->umem_tree, start,
 				      end,
-				      invalidate_range_end_trampoline, NULL);
+				      invalidate_range_end_trampoline, true, NULL);
 	up_read(&context->umem_rwsem);
 	ib_ucontext_notifier_end_account(context);
 }
@@ -798,6 +813,7 @@ EXPORT_SYMBOL(ib_umem_odp_unmap_dma_pages);
 int rbt_ib_umem_for_each_in_range(struct rb_root_cached *root,
 				  u64 start, u64 last,
 				  umem_call_back cb,
+				  bool blockable,
 				  void *cookie)
 {
 	int ret_val = 0;
@@ -809,6 +825,9 @@ int rbt_ib_umem_for_each_in_range(struct rb_root_cached *root,
 
 	for (node = rbt_ib_umem_iter_first(root, start, last - 1);
 			node; node = next) {
+		/* TODO move the blockable decision up to the callback */
+		if (!blockable)
+			return -EAGAIN;
 		next = rbt_ib_umem_iter_next(node, start, last - 1);
 		umem = container_of(node, struct ib_umem_odp, interval_tree);
 		ret_val = cb(umem->umem, start, last, cookie) || ret_val;
diff --git a/drivers/infiniband/hw/hfi1/mmu_rb.c b/drivers/infiniband/hw/hfi1/mmu_rb.c
index 70aceefe14d5..e1c7996c018e 100644
--- a/drivers/infiniband/hw/hfi1/mmu_rb.c
+++ b/drivers/infiniband/hw/hfi1/mmu_rb.c
@@ -67,9 +67,9 @@ struct mmu_rb_handler {
 
 static unsigned long mmu_node_start(struct mmu_rb_node *);
 static unsigned long mmu_node_last(struct mmu_rb_node *);
-static void mmu_notifier_range_start(struct mmu_notifier *,
+static int mmu_notifier_range_start(struct mmu_notifier *,
 				     struct mm_struct *,
-				     unsigned long, unsigned long);
+				     unsigned long, unsigned long, bool);
 static struct mmu_rb_node *__mmu_rb_search(struct mmu_rb_handler *,
 					   unsigned long, unsigned long);
 static void do_remove(struct mmu_rb_handler *handler,
@@ -284,10 +284,11 @@ void hfi1_mmu_rb_remove(struct mmu_rb_handler *handler,
 	handler->ops->remove(handler->ops_arg, node);
 }
 
-static void mmu_notifier_range_start(struct mmu_notifier *mn,
+static int mmu_notifier_range_start(struct mmu_notifier *mn,
 				     struct mm_struct *mm,
 				     unsigned long start,
-				     unsigned long end)
+				     unsigned long end,
+				     bool blockable)
 {
 	struct mmu_rb_handler *handler =
 		container_of(mn, struct mmu_rb_handler, mn);
@@ -313,6 +314,8 @@ static void mmu_notifier_range_start(struct mmu_notifier *mn,
 
 	if (added)
 		queue_work(handler->wq, &handler->del_work);
+
+	return 0;
 }
 
 /*
diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
index f1a87a690a4c..d216e0d2921d 100644
--- a/drivers/infiniband/hw/mlx5/odp.c
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -488,7 +488,7 @@ void mlx5_ib_free_implicit_mr(struct mlx5_ib_mr *imr)
 
 	down_read(&ctx->umem_rwsem);
 	rbt_ib_umem_for_each_in_range(&ctx->umem_tree, 0, ULLONG_MAX,
-				      mr_leaf_free, imr);
+				      mr_leaf_free, true, imr);
 	up_read(&ctx->umem_rwsem);
 
 	wait_event(imr->q_leaf_free, !atomic_read(&imr->num_leaf_free));
diff --git a/drivers/misc/mic/scif/scif_dma.c b/drivers/misc/mic/scif/scif_dma.c
index 63d6246d6dff..6369aeaa7056 100644
--- a/drivers/misc/mic/scif/scif_dma.c
+++ b/drivers/misc/mic/scif/scif_dma.c
@@ -200,15 +200,18 @@ static void scif_mmu_notifier_release(struct mmu_notifier *mn,
 	schedule_work(&scif_info.misc_work);
 }
 
-static void scif_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
+static int scif_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 						     struct mm_struct *mm,
 						     unsigned long start,
-						     unsigned long end)
+						     unsigned long end,
+						     bool blockable)
 {
 	struct scif_mmu_notif	*mmn;
 
 	mmn = container_of(mn, struct scif_mmu_notif, ep_mmu_notifier);
 	scif_rma_destroy_tcw(mmn, start, end - start);
+
+	return 0;
 }
 
 static void scif_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
index a3454eb56fbf..be28f05bfafa 100644
--- a/drivers/misc/sgi-gru/grutlbpurge.c
+++ b/drivers/misc/sgi-gru/grutlbpurge.c
@@ -219,9 +219,10 @@ void gru_flush_all_tlb(struct gru_state *gru)
 /*
  * MMUOPS notifier callout functions
  */
-static void gru_invalidate_range_start(struct mmu_notifier *mn,
+static int gru_invalidate_range_start(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
-				       unsigned long start, unsigned long end)
+				       unsigned long start, unsigned long end,
+				       bool blockable)
 {
 	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
 						 ms_notifier);
@@ -231,6 +232,8 @@ static void gru_invalidate_range_start(struct mmu_notifier *mn,
 	gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx, act %d\n", gms,
 		start, end, atomic_read(&gms->ms_range_active));
 	gru_flush_tlb_range(gms, start, end - start);
+
+	return 0;
 }
 
 static void gru_invalidate_range_end(struct mmu_notifier *mn,
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index bd56653b9bbc..55b4f0e3f4d6 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -441,18 +441,25 @@ static const struct vm_operations_struct gntdev_vmops = {
 
 /* ------------------------------------------------------------------ */
 
+static bool in_range(struct grant_map *map,
+			      unsigned long start, unsigned long end)
+{
+	if (!map->vma)
+		return false;
+	if (map->vma->vm_start >= end)
+		return false;
+	if (map->vma->vm_end <= start)
+		return false;
+
+	return true;
+}
+
 static void unmap_if_in_range(struct grant_map *map,
 			      unsigned long start, unsigned long end)
 {
 	unsigned long mstart, mend;
 	int err;
 
-	if (!map->vma)
-		return;
-	if (map->vma->vm_start >= end)
-		return;
-	if (map->vma->vm_end <= start)
-		return;
 	mstart = max(start, map->vma->vm_start);
 	mend   = min(end,   map->vma->vm_end);
 	pr_debug("map %d+%d (%lx %lx), range %lx %lx, mrange %lx %lx\n",
@@ -465,21 +472,40 @@ static void unmap_if_in_range(struct grant_map *map,
 	WARN_ON(err);
 }
 
-static void mn_invl_range_start(struct mmu_notifier *mn,
+static int mn_invl_range_start(struct mmu_notifier *mn,
 				struct mm_struct *mm,
-				unsigned long start, unsigned long end)
+				unsigned long start, unsigned long end,
+				bool blockable)
 {
 	struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
 	struct grant_map *map;
+	int ret = 0;
+
+	/* TODO do we really need a mutex here? */
+	if (blockable)
+		mutex_lock(&priv->lock);
+	else if (!mutex_trylock(&priv->lock))
+		return -EAGAIN;
 
-	mutex_lock(&priv->lock);
 	list_for_each_entry(map, &priv->maps, next) {
+		if (in_range(map, start, end)) {
+			ret = -EAGAIN;
+			goto out_unlock;
+		}
 		unmap_if_in_range(map, start, end);
 	}
 	list_for_each_entry(map, &priv->freeable_maps, next) {
+		if (in_range(map, start, end)) {
+			ret = -EAGAIN;
+			goto out_unlock;
+		}
 		unmap_if_in_range(map, start, end);
 	}
+
+out_unlock:
 	mutex_unlock(&priv->lock);
+
+	return ret;
 }
 
 static void mn_release(struct mmu_notifier *mn,
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 4ee7bc548a83..148935085194 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1275,8 +1275,8 @@ static inline long kvm_arch_vcpu_async_ioctl(struct file *filp,
 }
 #endif /* CONFIG_HAVE_KVM_VCPU_ASYNC_IOCTL */
 
-void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
-		unsigned long start, unsigned long end);
+int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
+		unsigned long start, unsigned long end, bool blockable);
 
 #ifdef CONFIG_HAVE_KVM_VCPU_RUN_PID_CHANGE
 int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu);
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 392e6af82701..2eb1a2d01759 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -151,13 +151,15 @@ struct mmu_notifier_ops {
 	 * address space but may still be referenced by sptes until
 	 * the last refcount is dropped.
 	 *
-	 * If both of these callbacks cannot block, and invalidate_range
-	 * cannot block, mmu_notifier_ops.flags should have
-	 * MMU_INVALIDATE_DOES_NOT_BLOCK set.
+	 * If blockable argument is set to false then the callback cannot
+	 * sleep and has to return with -EAGAIN. 0 should be returned
+	 * otherwise.
+	 *
 	 */
-	void (*invalidate_range_start)(struct mmu_notifier *mn,
+	int (*invalidate_range_start)(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
-				       unsigned long start, unsigned long end);
+				       unsigned long start, unsigned long end,
+				       bool blockable);
 	void (*invalidate_range_end)(struct mmu_notifier *mn,
 				     struct mm_struct *mm,
 				     unsigned long start, unsigned long end);
@@ -229,8 +231,9 @@ extern int __mmu_notifier_test_young(struct mm_struct *mm,
 				     unsigned long address);
 extern void __mmu_notifier_change_pte(struct mm_struct *mm,
 				      unsigned long address, pte_t pte);
-extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end);
+extern int __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+				  unsigned long start, unsigned long end,
+				  bool blockable);
 extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 				  unsigned long start, unsigned long end,
 				  bool only_end);
@@ -281,7 +284,17 @@ static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 				  unsigned long start, unsigned long end)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_start(mm, start, end);
+		__mmu_notifier_invalidate_range_start(mm, start, end, true);
+}
+
+static inline int mmu_notifier_invalidate_range_start_nonblock(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+	int ret = 0;
+	if (mm_has_notifiers(mm))
+		ret = __mmu_notifier_invalidate_range_start(mm, start, end, false);
+
+	return ret;
 }
 
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
@@ -461,6 +474,12 @@ static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 {
 }
 
+static inline int mmu_notifier_invalidate_range_start_nonblock(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+	return 0;
+}
+
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 				  unsigned long start, unsigned long end)
 {
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 6adac113e96d..92f70e4c6252 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -95,7 +95,7 @@ static inline int check_stable_address_space(struct mm_struct *mm)
 	return 0;
 }
 
-void __oom_reap_task_mm(struct mm_struct *mm);
+bool __oom_reap_task_mm(struct mm_struct *mm);
 
 extern unsigned long oom_badness(struct task_struct *p,
 		struct mem_cgroup *memcg, const nodemask_t *nodemask,
diff --git a/include/rdma/ib_umem_odp.h b/include/rdma/ib_umem_odp.h
index 6a17f856f841..381cdf5a9bd1 100644
--- a/include/rdma/ib_umem_odp.h
+++ b/include/rdma/ib_umem_odp.h
@@ -119,7 +119,8 @@ typedef int (*umem_call_back)(struct ib_umem *item, u64 start, u64 end,
  */
 int rbt_ib_umem_for_each_in_range(struct rb_root_cached *root,
 				  u64 start, u64 end,
-				  umem_call_back cb, void *cookie);
+				  umem_call_back cb,
+				  bool blockable, void *cookie);
 
 /*
  * Find first region intersecting with address range.
diff --git a/mm/hmm.c b/mm/hmm.c
index de7b6bf77201..81fd57bd2634 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -177,16 +177,19 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
 	up_write(&hmm->mirrors_sem);
 }
 
-static void hmm_invalidate_range_start(struct mmu_notifier *mn,
+static int hmm_invalidate_range_start(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
 				       unsigned long start,
-				       unsigned long end)
+				       unsigned long end,
+				       bool blockable)
 {
 	struct hmm *hmm = mm->hmm;
 
 	VM_BUG_ON(!hmm);
 
 	atomic_inc(&hmm->sequence);
+
+	return 0;
 }
 
 static void hmm_invalidate_range_end(struct mmu_notifier *mn,
diff --git a/mm/mmap.c b/mm/mmap.c
index d1eb87ef4b1a..336bee8c4e25 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3074,7 +3074,7 @@ void exit_mmap(struct mm_struct *mm)
 		 * reliably test it.
 		 */
 		mutex_lock(&oom_lock);
-		__oom_reap_task_mm(mm);
+		(void)__oom_reap_task_mm(mm);
 		mutex_unlock(&oom_lock);
 
 		set_bit(MMF_OOM_SKIP, &mm->flags);
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index eff6b88a993f..103b2b450043 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -174,18 +174,29 @@ void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
 	srcu_read_unlock(&srcu, id);
 }
 
-void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+int __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+				  unsigned long start, unsigned long end,
+				  bool blockable)
 {
 	struct mmu_notifier *mn;
+	int ret = 0;
 	int id;
 
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
-		if (mn->ops->invalidate_range_start)
-			mn->ops->invalidate_range_start(mn, mm, start, end);
+		if (mn->ops->invalidate_range_start) {
+			int _ret = mn->ops->invalidate_range_start(mn, mm, start, end, blockable);
+			if (_ret) {
+				pr_info("%pS callback failed with %d in %sblockable context.\n",
+						mn->ops->invalidate_range_start, _ret,
+						!blockable ? "non-": "");
+				ret = _ret;
+			}
+		}
 	}
 	srcu_read_unlock(&srcu, id);
+
+	return ret;
 }
 EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 84081e77bc51..5a936cf24d79 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -479,9 +479,10 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
 static struct task_struct *oom_reaper_list;
 static DEFINE_SPINLOCK(oom_reaper_lock);
 
-void __oom_reap_task_mm(struct mm_struct *mm)
+bool __oom_reap_task_mm(struct mm_struct *mm)
 {
 	struct vm_area_struct *vma;
+	bool ret = true;
 
 	/*
 	 * Tell all users of get_user/copy_from_user etc... that the content
@@ -511,12 +512,17 @@ void __oom_reap_task_mm(struct mm_struct *mm)
 			struct mmu_gather tlb;
 
 			tlb_gather_mmu(&tlb, mm, start, end);
-			mmu_notifier_invalidate_range_start(mm, start, end);
+			if (mmu_notifier_invalidate_range_start_nonblock(mm, start, end)) {
+				ret = false;
+				continue;
+			}
 			unmap_page_range(&tlb, vma, start, end, NULL);
 			mmu_notifier_invalidate_range_end(mm, start, end);
 			tlb_finish_mmu(&tlb, start, end);
 		}
 	}
+
+	return ret;
 }
 
 static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
@@ -545,18 +551,6 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 		goto unlock_oom;
 	}
 
-	/*
-	 * If the mm has invalidate_{start,end}() notifiers that could block,
-	 * sleep to give the oom victim some more time.
-	 * TODO: we really want to get rid of this ugly hack and make sure that
-	 * notifiers cannot block for unbounded amount of time
-	 */
-	if (mm_has_blockable_invalidate_notifiers(mm)) {
-		up_read(&mm->mmap_sem);
-		schedule_timeout_idle(HZ);
-		goto unlock_oom;
-	}
-
 	/*
 	 * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
 	 * work on the mm anymore. The check for MMF_OOM_SKIP must run
@@ -571,7 +565,12 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 
 	trace_start_task_reaping(tsk->pid);
 
-	__oom_reap_task_mm(mm);
+	/* failed to reap part of the address space. Try again later */
+	if (!__oom_reap_task_mm(mm)) {
+		up_read(&mm->mmap_sem);
+		ret = false;
+		goto unlock_oom;
+	}
 
 	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
 			task_pid_nr(tsk), tsk->comm,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ada21f47f22b..16ce38f178d1 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -135,9 +135,10 @@ static void kvm_uevent_notify_change(unsigned int type, struct kvm *kvm);
 static unsigned long long kvm_createvm_count;
 static unsigned long long kvm_active_vms;
 
-__weak void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
-		unsigned long start, unsigned long end)
+__weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
+		unsigned long start, unsigned long end, bool blockable)
 {
+	return 0;
 }
 
 bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
@@ -354,13 +355,15 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 	srcu_read_unlock(&kvm->srcu, idx);
 }
 
-static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
+static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 						    struct mm_struct *mm,
 						    unsigned long start,
-						    unsigned long end)
+						    unsigned long end,
+						    bool blockable)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int need_tlb_flush = 0, idx;
+	int ret;
 
 	idx = srcu_read_lock(&kvm->srcu);
 	spin_lock(&kvm->mmu_lock);
@@ -378,9 +381,11 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 
 	spin_unlock(&kvm->mmu_lock);
 
-	kvm_arch_mmu_notifier_invalidate_range(kvm, start, end);
+	ret = kvm_arch_mmu_notifier_invalidate_range(kvm, start, end, blockable);
 
 	srcu_read_unlock(&kvm->srcu, idx);
+
+	return ret;
 }
 
 static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
-- 
2.18.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
@ 2018-06-27  7:44 ` Michal Hocko
  2018-07-02  9:14     ` Christian König
                     ` (4 more replies)
  0 siblings, 5 replies; 125+ messages in thread
From: Michal Hocko @ 2018-06-27  7:44 UTC (permalink / raw)
  To: LKML
  Cc: David (ChunMing) Zhou, Paolo Bonzini, Radim Krčmář,
	Alex Deucher, Christian König, David Airlie, Jani Nikula,
	Joonas Lahtinen, Rodrigo Vivi, Doug Ledford, Jason Gunthorpe,
	Mike Marciniszyn, Dennis Dalessandro, Sudeep Dutt,
	Ashutosh Dixit, Dimitri Sivanich, Boris Ostrovsky, Juergen Gross,
	Jérôme Glisse, Andrea Arcangeli, kvm, amd-gfx,
	dri-devel, intel-gfx, linux-rdma, xen-devel, linux-mm,
	David Rientjes, Felix Kuehling

This is the v2 of RFC based on the feedback I've received so far. The
code even compiles as a bonus ;) I haven't runtime tested it yet, mostly
because I have no idea how.

Any further feedback is highly appreciated of course.
---

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
  2018-06-22 15:02 [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers Michal Hocko
                   ` (13 preceding siblings ...)
       [not found] ` <20180622150242.16558-1-mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
@ 2018-06-27  7:44 ` Michal Hocko
  2018-06-27  7:44 ` Michal Hocko
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 125+ messages in thread
From: Michal Hocko @ 2018-06-27  7:44 UTC (permalink / raw)
  To: LKML
  Cc: kvm, Radim Krčmář,
	David Airlie, Joonas Lahtinen, Sudeep Dutt, dri-devel, linux-mm,
	Felix Kuehling, Andrea Arcangeli, David (ChunMing) Zhou,
	Dimitri Sivanich, linux-rdma, amd-gfx, Jason Gunthorpe,
	Doug Ledford, David Rientjes, xen-devel, intel-gfx, Jani Nikula,
	Jérôme Glisse, Rodrigo Vivi, Boris Ostrovsky,
	Juergen Gross

This is the v2 of RFC based on the feedback I've received so far. The
code even compiles as a bonus ;) I haven't runtime tested it yet, mostly
because I have no idea how.

Any further feedback is highly appreciated of course.
---
From ec9a7241bf422b908532c4c33953b0da2655ad05 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Wed, 20 Jun 2018 15:03:20 +0200
Subject: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

There are several blockable mmu notifiers which might sleep in
mmu_notifier_invalidate_range_start and that is a problem for the
oom_reaper because it needs to guarantee a forward progress so it cannot
depend on any sleepable locks.

Currently we simply back off and mark an oom victim with blockable mmu
notifiers as done after a short sleep. That can result in selecting a
new oom victim prematurely because the previous one still hasn't torn
its memory down yet.

We can do much better though. Even if mmu notifiers use sleepable locks
there is no reason to automatically assume those locks are held.
Moreover majority of notifiers only care about a portion of the address
space and there is absolutely zero reason to fail when we are unmapping an
unrelated range. Many notifiers do really block and wait for HW which is
harder to handle and we have to bail out though.

This patch handles the low hanging fruid. __mmu_notifier_invalidate_range_start
gets a blockable flag and callbacks are not allowed to sleep if the
flag is set to false. This is achieved by using trylock instead of the
sleepable lock for most callbacks and continue as long as we do not
block down the call chain.

I think we can improve that even further because there is a common
pattern to do a range lookup first and then do something about that.
The first part can be done without a sleeping lock in most cases AFAICS.

The oom_reaper end then simply retries if there is at least one notifier
which couldn't make any progress in !blockable mode. A retry loop is
already implemented to wait for the mmap_sem and this is basically the
same thing.

Changes since rfc v1
- gpu notifiers can sleep while waiting for HW (evict_process_queues_cpsch
  on a lock and amdgpu_mn_invalidate_node on unbound timeout) make sure
  we bail out when we have an intersecting range for starter
- note that a notifier failed to the log for easier debugging
- back off early in ib_umem_notifier_invalidate_range_start if the
  callback is called
- mn_invl_range_start waits for completion down the unmap_grant_pages
  path so we have to back off early on overlapping ranges

Cc: "David (ChunMing) Zhou" <David1.Zhou@amd.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
Cc: Sudeep Dutt <sudeep.dutt@intel.com>
Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: kvm@vger.kernel.org (open list:KERNEL VIRTUAL MACHINE FOR X86 (KVM/x86))
Cc: linux-kernel@vger.kernel.org (open list:X86 ARCHITECTURE (32-BIT AND 64-BIT))
Cc: amd-gfx@lists.freedesktop.org (open list:RADEON and AMDGPU DRM DRIVERS)
Cc: dri-devel@lists.freedesktop.org (open list:DRM DRIVERS)
Cc: intel-gfx@lists.freedesktop.org (open list:INTEL DRM DRIVERS (excluding Poulsbo, Moorestow...)
Cc: linux-rdma@vger.kernel.org (open list:INFINIBAND SUBSYSTEM)
Cc: xen-devel@lists.xenproject.org (moderated list:XEN HYPERVISOR INTERFACE)
Cc: linux-mm@kvack.org (open list:HMM - Heterogeneous Memory Management)
Reported-by: David Rientjes <rientjes@google.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 arch/x86/kvm/x86.c                      |  7 ++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c  | 43 +++++++++++++++++++-----
 drivers/gpu/drm/i915/i915_gem_userptr.c | 13 ++++++--
 drivers/gpu/drm/radeon/radeon_mn.c      | 22 +++++++++++--
 drivers/infiniband/core/umem_odp.c      | 33 +++++++++++++++----
 drivers/infiniband/hw/hfi1/mmu_rb.c     | 11 ++++---
 drivers/infiniband/hw/mlx5/odp.c        |  2 +-
 drivers/misc/mic/scif/scif_dma.c        |  7 ++--
 drivers/misc/sgi-gru/grutlbpurge.c      |  7 ++--
 drivers/xen/gntdev.c                    | 44 ++++++++++++++++++++-----
 include/linux/kvm_host.h                |  4 +--
 include/linux/mmu_notifier.h            | 35 +++++++++++++++-----
 include/linux/oom.h                     |  2 +-
 include/rdma/ib_umem_odp.h              |  3 +-
 mm/hmm.c                                |  7 ++--
 mm/mmap.c                               |  2 +-
 mm/mmu_notifier.c                       | 19 ++++++++---
 mm/oom_kill.c                           | 29 ++++++++--------
 virt/kvm/kvm_main.c                     | 15 ++++++---
 19 files changed, 225 insertions(+), 80 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6bcecc325e7e..ac08f5d711be 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7203,8 +7203,9 @@ static void vcpu_load_eoi_exitmap(struct kvm_vcpu *vcpu)
 	kvm_x86_ops->load_eoi_exitmap(vcpu, eoi_exit_bitmap);
 }
 
-void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
-		unsigned long start, unsigned long end)
+int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
+		unsigned long start, unsigned long end,
+		bool blockable)
 {
 	unsigned long apic_address;
 
@@ -7215,6 +7216,8 @@ void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
 	apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
 	if (start <= apic_address && apic_address < end)
 		kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
+
+	return 0;
 }
 
 void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index 83e344fbb50a..3399a4a927fb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -136,12 +136,18 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
  *
  * Take the rmn read side lock.
  */
-static void amdgpu_mn_read_lock(struct amdgpu_mn *rmn)
+static int amdgpu_mn_read_lock(struct amdgpu_mn *rmn, bool blockable)
 {
-	mutex_lock(&rmn->read_lock);
+	if (blockable)
+		mutex_lock(&rmn->read_lock);
+	else if (!mutex_trylock(&rmn->read_lock))
+		return -EAGAIN;
+
 	if (atomic_inc_return(&rmn->recursion) == 1)
 		down_read_non_owner(&rmn->lock);
 	mutex_unlock(&rmn->read_lock);
+
+	return 0;
 }
 
 /**
@@ -197,10 +203,11 @@ static void amdgpu_mn_invalidate_node(struct amdgpu_mn_node *node,
  * We block for all BOs between start and end to be idle and
  * unmap them by move them into system domain again.
  */
-static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
+static int amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
 						 struct mm_struct *mm,
 						 unsigned long start,
-						 unsigned long end)
+						 unsigned long end,
+						 bool blockable)
 {
 	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
 	struct interval_tree_node *it;
@@ -208,17 +215,28 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
 	/* notification is exclusive, but interval is inclusive */
 	end -= 1;
 
-	amdgpu_mn_read_lock(rmn);
+	/* TODO we should be able to split locking for interval tree and
+	 * amdgpu_mn_invalidate_node
+	 */
+	if (amdgpu_mn_read_lock(rmn, blockable))
+		return -EAGAIN;
 
 	it = interval_tree_iter_first(&rmn->objects, start, end);
 	while (it) {
 		struct amdgpu_mn_node *node;
 
+		if (!blockable) {
+			amdgpu_mn_read_unlock(rmn);
+			return -EAGAIN;
+		}
+
 		node = container_of(it, struct amdgpu_mn_node, it);
 		it = interval_tree_iter_next(it, start, end);
 
 		amdgpu_mn_invalidate_node(node, start, end);
 	}
+
+	return 0;
 }
 
 /**
@@ -233,10 +251,11 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
  * necessitates evicting all user-mode queues of the process. The BOs
  * are restorted in amdgpu_mn_invalidate_range_end_hsa.
  */
-static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
+static int amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
 						 struct mm_struct *mm,
 						 unsigned long start,
-						 unsigned long end)
+						 unsigned long end,
+						 bool blockable)
 {
 	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
 	struct interval_tree_node *it;
@@ -244,13 +263,19 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
 	/* notification is exclusive, but interval is inclusive */
 	end -= 1;
 
-	amdgpu_mn_read_lock(rmn);
+	if (amdgpu_mn_read_lock(rmn, blockable))
+		return -EAGAIN;
 
 	it = interval_tree_iter_first(&rmn->objects, start, end);
 	while (it) {
 		struct amdgpu_mn_node *node;
 		struct amdgpu_bo *bo;
 
+		if (!blockable) {
+			amdgpu_mn_read_unlock(rmn);
+			return -EAGAIN;
+		}
+
 		node = container_of(it, struct amdgpu_mn_node, it);
 		it = interval_tree_iter_next(it, start, end);
 
@@ -262,6 +287,8 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
 				amdgpu_amdkfd_evict_userptr(mem, mm);
 		}
 	}
+
+	return 0;
 }
 
 /**
diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
index 854bd51b9478..9cbff68f6b41 100644
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
@@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object *mo)
 	mo->attached = false;
 }
 
-static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
+static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
 						       struct mm_struct *mm,
 						       unsigned long start,
-						       unsigned long end)
+						       unsigned long end,
+						       bool blockable)
 {
 	struct i915_mmu_notifier *mn =
 		container_of(_mn, struct i915_mmu_notifier, mn);
@@ -124,7 +125,7 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
 	LIST_HEAD(cancelled);
 
 	if (RB_EMPTY_ROOT(&mn->objects.rb_root))
-		return;
+		return 0;
 
 	/* interval ranges are inclusive, but invalidate range is exclusive */
 	end--;
@@ -132,6 +133,10 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
 	spin_lock(&mn->lock);
 	it = interval_tree_iter_first(&mn->objects, start, end);
 	while (it) {
+		if (!blockable) {
+			spin_unlock(&mn->lock);
+			return -EAGAIN;
+		}
 		/* The mmu_object is released late when destroying the
 		 * GEM object so it is entirely possible to gain a
 		 * reference on an object in the process of being freed
@@ -154,6 +159,8 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
 
 	if (!list_empty(&cancelled))
 		flush_workqueue(mn->wq);
+
+	return 0;
 }
 
 static const struct mmu_notifier_ops i915_gem_userptr_notifier = {
diff --git a/drivers/gpu/drm/radeon/radeon_mn.c b/drivers/gpu/drm/radeon/radeon_mn.c
index abd24975c9b1..f8b35df44c60 100644
--- a/drivers/gpu/drm/radeon/radeon_mn.c
+++ b/drivers/gpu/drm/radeon/radeon_mn.c
@@ -118,19 +118,27 @@ static void radeon_mn_release(struct mmu_notifier *mn,
  * We block for all BOs between start and end to be idle and
  * unmap them by move them into system domain again.
  */
-static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
+static int radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
 					     struct mm_struct *mm,
 					     unsigned long start,
-					     unsigned long end)
+					     unsigned long end,
+					     bool blockable)
 {
 	struct radeon_mn *rmn = container_of(mn, struct radeon_mn, mn);
 	struct ttm_operation_ctx ctx = { false, false };
 	struct interval_tree_node *it;
+	int ret = 0;
 
 	/* notification is exclusive, but interval is inclusive */
 	end -= 1;
 
-	mutex_lock(&rmn->lock);
+	/* TODO we should be able to split locking for interval tree and
+	 * the tear down.
+	 */
+	if (blockable)
+		mutex_lock(&rmn->lock);
+	else if (!mutex_trylock(&rmn->lock))
+		return -EAGAIN;
 
 	it = interval_tree_iter_first(&rmn->objects, start, end);
 	while (it) {
@@ -138,6 +146,11 @@ static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
 		struct radeon_bo *bo;
 		long r;
 
+		if (!blockable) {
+			ret = -EAGAIN;
+			goto out_unlock;
+		}
+
 		node = container_of(it, struct radeon_mn_node, it);
 		it = interval_tree_iter_next(it, start, end);
 
@@ -166,7 +179,10 @@ static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
 		}
 	}
 	
+out_unlock:
 	mutex_unlock(&rmn->lock);
+
+	return ret;
 }
 
 static const struct mmu_notifier_ops radeon_mn_ops = {
diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index 182436b92ba9..6ec748eccff7 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -186,6 +186,7 @@ static void ib_umem_notifier_release(struct mmu_notifier *mn,
 	rbt_ib_umem_for_each_in_range(&context->umem_tree, 0,
 				      ULLONG_MAX,
 				      ib_umem_notifier_release_trampoline,
+				      true,
 				      NULL);
 	up_read(&context->umem_rwsem);
 }
@@ -207,22 +208,31 @@ static int invalidate_range_start_trampoline(struct ib_umem *item, u64 start,
 	return 0;
 }
 
-static void ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
+static int ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
 						    struct mm_struct *mm,
 						    unsigned long start,
-						    unsigned long end)
+						    unsigned long end,
+						    bool blockable)
 {
 	struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
+	int ret;
 
 	if (!context->invalidate_range)
-		return;
+		return 0;
+
+	if (blockable)
+		down_read(&context->umem_rwsem);
+	else if (!down_read_trylock(&context->umem_rwsem))
+		return -EAGAIN;
 
 	ib_ucontext_notifier_start_account(context);
-	down_read(&context->umem_rwsem);
-	rbt_ib_umem_for_each_in_range(&context->umem_tree, start,
+	ret = rbt_ib_umem_for_each_in_range(&context->umem_tree, start,
 				      end,
-				      invalidate_range_start_trampoline, NULL);
+				      invalidate_range_start_trampoline,
+				      blockable, NULL);
 	up_read(&context->umem_rwsem);
+
+	return ret;
 }
 
 static int invalidate_range_end_trampoline(struct ib_umem *item, u64 start,
@@ -242,10 +252,15 @@ static void ib_umem_notifier_invalidate_range_end(struct mmu_notifier *mn,
 	if (!context->invalidate_range)
 		return;
 
+	/*
+	 * TODO: we currently bail out if there is any sleepable work to be done
+	 * in ib_umem_notifier_invalidate_range_start so we shouldn't really block
+	 * here. But this is ugly and fragile.
+	 */
 	down_read(&context->umem_rwsem);
 	rbt_ib_umem_for_each_in_range(&context->umem_tree, start,
 				      end,
-				      invalidate_range_end_trampoline, NULL);
+				      invalidate_range_end_trampoline, true, NULL);
 	up_read(&context->umem_rwsem);
 	ib_ucontext_notifier_end_account(context);
 }
@@ -798,6 +813,7 @@ EXPORT_SYMBOL(ib_umem_odp_unmap_dma_pages);
 int rbt_ib_umem_for_each_in_range(struct rb_root_cached *root,
 				  u64 start, u64 last,
 				  umem_call_back cb,
+				  bool blockable,
 				  void *cookie)
 {
 	int ret_val = 0;
@@ -809,6 +825,9 @@ int rbt_ib_umem_for_each_in_range(struct rb_root_cached *root,
 
 	for (node = rbt_ib_umem_iter_first(root, start, last - 1);
 			node; node = next) {
+		/* TODO move the blockable decision up to the callback */
+		if (!blockable)
+			return -EAGAIN;
 		next = rbt_ib_umem_iter_next(node, start, last - 1);
 		umem = container_of(node, struct ib_umem_odp, interval_tree);
 		ret_val = cb(umem->umem, start, last, cookie) || ret_val;
diff --git a/drivers/infiniband/hw/hfi1/mmu_rb.c b/drivers/infiniband/hw/hfi1/mmu_rb.c
index 70aceefe14d5..e1c7996c018e 100644
--- a/drivers/infiniband/hw/hfi1/mmu_rb.c
+++ b/drivers/infiniband/hw/hfi1/mmu_rb.c
@@ -67,9 +67,9 @@ struct mmu_rb_handler {
 
 static unsigned long mmu_node_start(struct mmu_rb_node *);
 static unsigned long mmu_node_last(struct mmu_rb_node *);
-static void mmu_notifier_range_start(struct mmu_notifier *,
+static int mmu_notifier_range_start(struct mmu_notifier *,
 				     struct mm_struct *,
-				     unsigned long, unsigned long);
+				     unsigned long, unsigned long, bool);
 static struct mmu_rb_node *__mmu_rb_search(struct mmu_rb_handler *,
 					   unsigned long, unsigned long);
 static void do_remove(struct mmu_rb_handler *handler,
@@ -284,10 +284,11 @@ void hfi1_mmu_rb_remove(struct mmu_rb_handler *handler,
 	handler->ops->remove(handler->ops_arg, node);
 }
 
-static void mmu_notifier_range_start(struct mmu_notifier *mn,
+static int mmu_notifier_range_start(struct mmu_notifier *mn,
 				     struct mm_struct *mm,
 				     unsigned long start,
-				     unsigned long end)
+				     unsigned long end,
+				     bool blockable)
 {
 	struct mmu_rb_handler *handler =
 		container_of(mn, struct mmu_rb_handler, mn);
@@ -313,6 +314,8 @@ static void mmu_notifier_range_start(struct mmu_notifier *mn,
 
 	if (added)
 		queue_work(handler->wq, &handler->del_work);
+
+	return 0;
 }
 
 /*
diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
index f1a87a690a4c..d216e0d2921d 100644
--- a/drivers/infiniband/hw/mlx5/odp.c
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -488,7 +488,7 @@ void mlx5_ib_free_implicit_mr(struct mlx5_ib_mr *imr)
 
 	down_read(&ctx->umem_rwsem);
 	rbt_ib_umem_for_each_in_range(&ctx->umem_tree, 0, ULLONG_MAX,
-				      mr_leaf_free, imr);
+				      mr_leaf_free, true, imr);
 	up_read(&ctx->umem_rwsem);
 
 	wait_event(imr->q_leaf_free, !atomic_read(&imr->num_leaf_free));
diff --git a/drivers/misc/mic/scif/scif_dma.c b/drivers/misc/mic/scif/scif_dma.c
index 63d6246d6dff..6369aeaa7056 100644
--- a/drivers/misc/mic/scif/scif_dma.c
+++ b/drivers/misc/mic/scif/scif_dma.c
@@ -200,15 +200,18 @@ static void scif_mmu_notifier_release(struct mmu_notifier *mn,
 	schedule_work(&scif_info.misc_work);
 }
 
-static void scif_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
+static int scif_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 						     struct mm_struct *mm,
 						     unsigned long start,
-						     unsigned long end)
+						     unsigned long end,
+						     bool blockable)
 {
 	struct scif_mmu_notif	*mmn;
 
 	mmn = container_of(mn, struct scif_mmu_notif, ep_mmu_notifier);
 	scif_rma_destroy_tcw(mmn, start, end - start);
+
+	return 0;
 }
 
 static void scif_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
index a3454eb56fbf..be28f05bfafa 100644
--- a/drivers/misc/sgi-gru/grutlbpurge.c
+++ b/drivers/misc/sgi-gru/grutlbpurge.c
@@ -219,9 +219,10 @@ void gru_flush_all_tlb(struct gru_state *gru)
 /*
  * MMUOPS notifier callout functions
  */
-static void gru_invalidate_range_start(struct mmu_notifier *mn,
+static int gru_invalidate_range_start(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
-				       unsigned long start, unsigned long end)
+				       unsigned long start, unsigned long end,
+				       bool blockable)
 {
 	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
 						 ms_notifier);
@@ -231,6 +232,8 @@ static void gru_invalidate_range_start(struct mmu_notifier *mn,
 	gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx, act %d\n", gms,
 		start, end, atomic_read(&gms->ms_range_active));
 	gru_flush_tlb_range(gms, start, end - start);
+
+	return 0;
 }
 
 static void gru_invalidate_range_end(struct mmu_notifier *mn,
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index bd56653b9bbc..55b4f0e3f4d6 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -441,18 +441,25 @@ static const struct vm_operations_struct gntdev_vmops = {
 
 /* ------------------------------------------------------------------ */
 
+static bool in_range(struct grant_map *map,
+			      unsigned long start, unsigned long end)
+{
+	if (!map->vma)
+		return false;
+	if (map->vma->vm_start >= end)
+		return false;
+	if (map->vma->vm_end <= start)
+		return false;
+
+	return true;
+}
+
 static void unmap_if_in_range(struct grant_map *map,
 			      unsigned long start, unsigned long end)
 {
 	unsigned long mstart, mend;
 	int err;
 
-	if (!map->vma)
-		return;
-	if (map->vma->vm_start >= end)
-		return;
-	if (map->vma->vm_end <= start)
-		return;
 	mstart = max(start, map->vma->vm_start);
 	mend   = min(end,   map->vma->vm_end);
 	pr_debug("map %d+%d (%lx %lx), range %lx %lx, mrange %lx %lx\n",
@@ -465,21 +472,40 @@ static void unmap_if_in_range(struct grant_map *map,
 	WARN_ON(err);
 }
 
-static void mn_invl_range_start(struct mmu_notifier *mn,
+static int mn_invl_range_start(struct mmu_notifier *mn,
 				struct mm_struct *mm,
-				unsigned long start, unsigned long end)
+				unsigned long start, unsigned long end,
+				bool blockable)
 {
 	struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
 	struct grant_map *map;
+	int ret = 0;
+
+	/* TODO do we really need a mutex here? */
+	if (blockable)
+		mutex_lock(&priv->lock);
+	else if (!mutex_trylock(&priv->lock))
+		return -EAGAIN;
 
-	mutex_lock(&priv->lock);
 	list_for_each_entry(map, &priv->maps, next) {
+		if (in_range(map, start, end)) {
+			ret = -EAGAIN;
+			goto out_unlock;
+		}
 		unmap_if_in_range(map, start, end);
 	}
 	list_for_each_entry(map, &priv->freeable_maps, next) {
+		if (in_range(map, start, end)) {
+			ret = -EAGAIN;
+			goto out_unlock;
+		}
 		unmap_if_in_range(map, start, end);
 	}
+
+out_unlock:
 	mutex_unlock(&priv->lock);
+
+	return ret;
 }
 
 static void mn_release(struct mmu_notifier *mn,
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 4ee7bc548a83..148935085194 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1275,8 +1275,8 @@ static inline long kvm_arch_vcpu_async_ioctl(struct file *filp,
 }
 #endif /* CONFIG_HAVE_KVM_VCPU_ASYNC_IOCTL */
 
-void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
-		unsigned long start, unsigned long end);
+int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
+		unsigned long start, unsigned long end, bool blockable);
 
 #ifdef CONFIG_HAVE_KVM_VCPU_RUN_PID_CHANGE
 int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu);
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 392e6af82701..2eb1a2d01759 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -151,13 +151,15 @@ struct mmu_notifier_ops {
 	 * address space but may still be referenced by sptes until
 	 * the last refcount is dropped.
 	 *
-	 * If both of these callbacks cannot block, and invalidate_range
-	 * cannot block, mmu_notifier_ops.flags should have
-	 * MMU_INVALIDATE_DOES_NOT_BLOCK set.
+	 * If blockable argument is set to false then the callback cannot
+	 * sleep and has to return with -EAGAIN. 0 should be returned
+	 * otherwise.
+	 *
 	 */
-	void (*invalidate_range_start)(struct mmu_notifier *mn,
+	int (*invalidate_range_start)(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
-				       unsigned long start, unsigned long end);
+				       unsigned long start, unsigned long end,
+				       bool blockable);
 	void (*invalidate_range_end)(struct mmu_notifier *mn,
 				     struct mm_struct *mm,
 				     unsigned long start, unsigned long end);
@@ -229,8 +231,9 @@ extern int __mmu_notifier_test_young(struct mm_struct *mm,
 				     unsigned long address);
 extern void __mmu_notifier_change_pte(struct mm_struct *mm,
 				      unsigned long address, pte_t pte);
-extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end);
+extern int __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+				  unsigned long start, unsigned long end,
+				  bool blockable);
 extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 				  unsigned long start, unsigned long end,
 				  bool only_end);
@@ -281,7 +284,17 @@ static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 				  unsigned long start, unsigned long end)
 {
 	if (mm_has_notifiers(mm))
-		__mmu_notifier_invalidate_range_start(mm, start, end);
+		__mmu_notifier_invalidate_range_start(mm, start, end, true);
+}
+
+static inline int mmu_notifier_invalidate_range_start_nonblock(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+	int ret = 0;
+	if (mm_has_notifiers(mm))
+		ret = __mmu_notifier_invalidate_range_start(mm, start, end, false);
+
+	return ret;
 }
 
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
@@ -461,6 +474,12 @@ static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
 {
 }
 
+static inline int mmu_notifier_invalidate_range_start_nonblock(struct mm_struct *mm,
+				  unsigned long start, unsigned long end)
+{
+	return 0;
+}
+
 static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
 				  unsigned long start, unsigned long end)
 {
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 6adac113e96d..92f70e4c6252 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -95,7 +95,7 @@ static inline int check_stable_address_space(struct mm_struct *mm)
 	return 0;
 }
 
-void __oom_reap_task_mm(struct mm_struct *mm);
+bool __oom_reap_task_mm(struct mm_struct *mm);
 
 extern unsigned long oom_badness(struct task_struct *p,
 		struct mem_cgroup *memcg, const nodemask_t *nodemask,
diff --git a/include/rdma/ib_umem_odp.h b/include/rdma/ib_umem_odp.h
index 6a17f856f841..381cdf5a9bd1 100644
--- a/include/rdma/ib_umem_odp.h
+++ b/include/rdma/ib_umem_odp.h
@@ -119,7 +119,8 @@ typedef int (*umem_call_back)(struct ib_umem *item, u64 start, u64 end,
  */
 int rbt_ib_umem_for_each_in_range(struct rb_root_cached *root,
 				  u64 start, u64 end,
-				  umem_call_back cb, void *cookie);
+				  umem_call_back cb,
+				  bool blockable, void *cookie);
 
 /*
  * Find first region intersecting with address range.
diff --git a/mm/hmm.c b/mm/hmm.c
index de7b6bf77201..81fd57bd2634 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -177,16 +177,19 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
 	up_write(&hmm->mirrors_sem);
 }
 
-static void hmm_invalidate_range_start(struct mmu_notifier *mn,
+static int hmm_invalidate_range_start(struct mmu_notifier *mn,
 				       struct mm_struct *mm,
 				       unsigned long start,
-				       unsigned long end)
+				       unsigned long end,
+				       bool blockable)
 {
 	struct hmm *hmm = mm->hmm;
 
 	VM_BUG_ON(!hmm);
 
 	atomic_inc(&hmm->sequence);
+
+	return 0;
 }
 
 static void hmm_invalidate_range_end(struct mmu_notifier *mn,
diff --git a/mm/mmap.c b/mm/mmap.c
index d1eb87ef4b1a..336bee8c4e25 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3074,7 +3074,7 @@ void exit_mmap(struct mm_struct *mm)
 		 * reliably test it.
 		 */
 		mutex_lock(&oom_lock);
-		__oom_reap_task_mm(mm);
+		(void)__oom_reap_task_mm(mm);
 		mutex_unlock(&oom_lock);
 
 		set_bit(MMF_OOM_SKIP, &mm->flags);
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index eff6b88a993f..103b2b450043 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -174,18 +174,29 @@ void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
 	srcu_read_unlock(&srcu, id);
 }
 
-void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
-				  unsigned long start, unsigned long end)
+int __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+				  unsigned long start, unsigned long end,
+				  bool blockable)
 {
 	struct mmu_notifier *mn;
+	int ret = 0;
 	int id;
 
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
-		if (mn->ops->invalidate_range_start)
-			mn->ops->invalidate_range_start(mn, mm, start, end);
+		if (mn->ops->invalidate_range_start) {
+			int _ret = mn->ops->invalidate_range_start(mn, mm, start, end, blockable);
+			if (_ret) {
+				pr_info("%pS callback failed with %d in %sblockable context.\n",
+						mn->ops->invalidate_range_start, _ret,
+						!blockable ? "non-": "");
+				ret = _ret;
+			}
+		}
 	}
 	srcu_read_unlock(&srcu, id);
+
+	return ret;
 }
 EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 84081e77bc51..5a936cf24d79 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -479,9 +479,10 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
 static struct task_struct *oom_reaper_list;
 static DEFINE_SPINLOCK(oom_reaper_lock);
 
-void __oom_reap_task_mm(struct mm_struct *mm)
+bool __oom_reap_task_mm(struct mm_struct *mm)
 {
 	struct vm_area_struct *vma;
+	bool ret = true;
 
 	/*
 	 * Tell all users of get_user/copy_from_user etc... that the content
@@ -511,12 +512,17 @@ void __oom_reap_task_mm(struct mm_struct *mm)
 			struct mmu_gather tlb;
 
 			tlb_gather_mmu(&tlb, mm, start, end);
-			mmu_notifier_invalidate_range_start(mm, start, end);
+			if (mmu_notifier_invalidate_range_start_nonblock(mm, start, end)) {
+				ret = false;
+				continue;
+			}
 			unmap_page_range(&tlb, vma, start, end, NULL);
 			mmu_notifier_invalidate_range_end(mm, start, end);
 			tlb_finish_mmu(&tlb, start, end);
 		}
 	}
+
+	return ret;
 }
 
 static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
@@ -545,18 +551,6 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 		goto unlock_oom;
 	}
 
-	/*
-	 * If the mm has invalidate_{start,end}() notifiers that could block,
-	 * sleep to give the oom victim some more time.
-	 * TODO: we really want to get rid of this ugly hack and make sure that
-	 * notifiers cannot block for unbounded amount of time
-	 */
-	if (mm_has_blockable_invalidate_notifiers(mm)) {
-		up_read(&mm->mmap_sem);
-		schedule_timeout_idle(HZ);
-		goto unlock_oom;
-	}
-
 	/*
 	 * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
 	 * work on the mm anymore. The check for MMF_OOM_SKIP must run
@@ -571,7 +565,12 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 
 	trace_start_task_reaping(tsk->pid);
 
-	__oom_reap_task_mm(mm);
+	/* failed to reap part of the address space. Try again later */
+	if (!__oom_reap_task_mm(mm)) {
+		up_read(&mm->mmap_sem);
+		ret = false;
+		goto unlock_oom;
+	}
 
 	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
 			task_pid_nr(tsk), tsk->comm,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ada21f47f22b..16ce38f178d1 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -135,9 +135,10 @@ static void kvm_uevent_notify_change(unsigned int type, struct kvm *kvm);
 static unsigned long long kvm_createvm_count;
 static unsigned long long kvm_active_vms;
 
-__weak void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
-		unsigned long start, unsigned long end)
+__weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
+		unsigned long start, unsigned long end, bool blockable)
 {
+	return 0;
 }
 
 bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
@@ -354,13 +355,15 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
 	srcu_read_unlock(&kvm->srcu, idx);
 }
 
-static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
+static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 						    struct mm_struct *mm,
 						    unsigned long start,
-						    unsigned long end)
+						    unsigned long end,
+						    bool blockable)
 {
 	struct kvm *kvm = mmu_notifier_to_kvm(mn);
 	int need_tlb_flush = 0, idx;
+	int ret;
 
 	idx = srcu_read_lock(&kvm->srcu);
 	spin_lock(&kvm->mmu_lock);
@@ -378,9 +381,11 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
 
 	spin_unlock(&kvm->mmu_lock);
 
-	kvm_arch_mmu_notifier_invalidate_range(kvm, start, end);
+	ret = kvm_arch_mmu_notifier_invalidate_range(kvm, start, end, blockable);
 
 	srcu_read_unlock(&kvm->srcu, idx);
+
+	return ret;
 }
 
 static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
-- 
2.18.0

-- 
Michal Hocko
SUSE Labs

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 125+ messages in thread

* ✗ Fi.CI.BAT: failure for mm, oom: distinguish blockable mode for mmu notifiers (rev5)
  2018-06-22 15:02 [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers Michal Hocko
                   ` (15 preceding siblings ...)
  2018-06-27  7:44 ` Michal Hocko
@ 2018-06-27  9:05 ` Patchwork
  2018-07-11 10:57 ` ✗ Fi.CI.BAT: failure for mm, oom: distinguish blockable mode for mmu notifiers (rev6) Patchwork
  17 siblings, 0 replies; 125+ messages in thread
From: Patchwork @ 2018-06-27  9:05 UTC (permalink / raw)
  To: Michal Hocko; +Cc: intel-gfx

== Series Details ==

Series: mm, oom: distinguish blockable mode for mmu notifiers (rev5)
URL   : https://patchwork.freedesktop.org/series/45263/
State : failure

== Summary ==

Applying: mm, oom: distinguish blockable mode for mmu notifiers
Using index info to reconstruct a base tree...
M	arch/x86/kvm/x86.c
M	drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
M	mm/oom_kill.c
M	virt/kvm/kvm_main.c
Falling back to patching base and 3-way merge...
Auto-merging virt/kvm/kvm_main.c
Auto-merging mm/oom_kill.c
Auto-merging drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
CONFLICT (content): Merge conflict in drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
Auto-merging arch/x86/kvm/x86.c
error: Failed to merge in the changes.
Patch failed at 0001 mm, oom: distinguish blockable mode for mmu notifiers
Use 'git am --show-current-patch' to see the failed patch
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
       [not found]   ` <20180627074421.GF32348-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2018-07-02  9:14     ` Christian König
  2018-07-09 12:29   ` Michal Hocko
  1 sibling, 0 replies; 125+ messages in thread
From: Christian König @ 2018-07-02  9:14 UTC (permalink / raw)
  To: Michal Hocko, LKML
  Cc: kvm-u79uwXL29TY76Z2rM5mHXA, Radim Krčmář,
	David Airlie, Joonas Lahtinen, Sudeep Dutt,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andrea Arcangeli,
	David (ChunMing) Zhou, Dimitri Sivanich,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Jason Gunthorpe,
	Doug Ledford, David Rientjes,
	xen-devel-GuqFBffKawtpuQazS67q72D2FQJk+8+b,
	intel-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Jani Nikula,
	Jérôme Glisse, Rodrigo Vivi, Boris Ostrovsky,
	Juergen Gross, Mike Marciniszyn

Am 27.06.2018 um 09:44 schrieb Michal Hocko:
> This is the v2 of RFC based on the feedback I've received so far. The
> code even compiles as a bonus ;) I haven't runtime tested it yet, mostly
> because I have no idea how.
>
> Any further feedback is highly appreciated of course.

That sounds like it should work and at least the amdgpu changes now look 
good to me on first glance.

Can you split that up further in the usual way? E.g. adding the 
blockable flag in one patch and fixing all implementations of the MMU 
notifier in follow up patches.

This way I'm pretty sure Felix and I can give an rb on the amdgpu/amdkfd 
changes.

Thanks,
Christian.

> ---
>  From ec9a7241bf422b908532c4c33953b0da2655ad05 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Wed, 20 Jun 2018 15:03:20 +0200
> Subject: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
>
> There are several blockable mmu notifiers which might sleep in
> mmu_notifier_invalidate_range_start and that is a problem for the
> oom_reaper because it needs to guarantee a forward progress so it cannot
> depend on any sleepable locks.
>
> Currently we simply back off and mark an oom victim with blockable mmu
> notifiers as done after a short sleep. That can result in selecting a
> new oom victim prematurely because the previous one still hasn't torn
> its memory down yet.
>
> We can do much better though. Even if mmu notifiers use sleepable locks
> there is no reason to automatically assume those locks are held.
> Moreover majority of notifiers only care about a portion of the address
> space and there is absolutely zero reason to fail when we are unmapping an
> unrelated range. Many notifiers do really block and wait for HW which is
> harder to handle and we have to bail out though.
>
> This patch handles the low hanging fruid. __mmu_notifier_invalidate_range_start
> gets a blockable flag and callbacks are not allowed to sleep if the
> flag is set to false. This is achieved by using trylock instead of the
> sleepable lock for most callbacks and continue as long as we do not
> block down the call chain.
>
> I think we can improve that even further because there is a common
> pattern to do a range lookup first and then do something about that.
> The first part can be done without a sleeping lock in most cases AFAICS.
>
> The oom_reaper end then simply retries if there is at least one notifier
> which couldn't make any progress in !blockable mode. A retry loop is
> already implemented to wait for the mmap_sem and this is basically the
> same thing.
>
> Changes since rfc v1
> - gpu notifiers can sleep while waiting for HW (evict_process_queues_cpsch
>    on a lock and amdgpu_mn_invalidate_node on unbound timeout) make sure
>    we bail out when we have an intersecting range for starter
> - note that a notifier failed to the log for easier debugging
> - back off early in ib_umem_notifier_invalidate_range_start if the
>    callback is called
> - mn_invl_range_start waits for completion down the unmap_grant_pages
>    path so we have to back off early on overlapping ranges
>
> Cc: "David (ChunMing) Zhou" <David1.Zhou@amd.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: "Radim Krčmář" <rkrcmar@redhat.com>
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: "Christian König" <christian.koenig@amd.com>
> Cc: David Airlie <airlied@linux.ie>
> Cc: Jani Nikula <jani.nikula@linux.intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> Cc: Doug Ledford <dledford@redhat.com>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
> Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
> Cc: Sudeep Dutt <sudeep.dutt@intel.com>
> Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
> Cc: Dimitri Sivanich <sivanich@sgi.com>
> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
> Cc: Juergen Gross <jgross@suse.com>
> Cc: "Jérôme Glisse" <jglisse@redhat.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Felix Kuehling <felix.kuehling@amd.com>
> Cc: kvm@vger.kernel.org (open list:KERNEL VIRTUAL MACHINE FOR X86 (KVM/x86))
> Cc: linux-kernel@vger.kernel.org (open list:X86 ARCHITECTURE (32-BIT AND 64-BIT))
> Cc: amd-gfx@lists.freedesktop.org (open list:RADEON and AMDGPU DRM DRIVERS)
> Cc: dri-devel@lists.freedesktop.org (open list:DRM DRIVERS)
> Cc: intel-gfx@lists.freedesktop.org (open list:INTEL DRM DRIVERS (excluding Poulsbo, Moorestow...)
> Cc: linux-rdma@vger.kernel.org (open list:INFINIBAND SUBSYSTEM)
> Cc: xen-devel@lists.xenproject.org (moderated list:XEN HYPERVISOR INTERFACE)
> Cc: linux-mm@kvack.org (open list:HMM - Heterogeneous Memory Management)
> Reported-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>   arch/x86/kvm/x86.c                      |  7 ++--
>   drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c  | 43 +++++++++++++++++++-----
>   drivers/gpu/drm/i915/i915_gem_userptr.c | 13 ++++++--
>   drivers/gpu/drm/radeon/radeon_mn.c      | 22 +++++++++++--
>   drivers/infiniband/core/umem_odp.c      | 33 +++++++++++++++----
>   drivers/infiniband/hw/hfi1/mmu_rb.c     | 11 ++++---
>   drivers/infiniband/hw/mlx5/odp.c        |  2 +-
>   drivers/misc/mic/scif/scif_dma.c        |  7 ++--
>   drivers/misc/sgi-gru/grutlbpurge.c      |  7 ++--
>   drivers/xen/gntdev.c                    | 44 ++++++++++++++++++++-----
>   include/linux/kvm_host.h                |  4 +--
>   include/linux/mmu_notifier.h            | 35 +++++++++++++++-----
>   include/linux/oom.h                     |  2 +-
>   include/rdma/ib_umem_odp.h              |  3 +-
>   mm/hmm.c                                |  7 ++--
>   mm/mmap.c                               |  2 +-
>   mm/mmu_notifier.c                       | 19 ++++++++---
>   mm/oom_kill.c                           | 29 ++++++++--------
>   virt/kvm/kvm_main.c                     | 15 ++++++---
>   19 files changed, 225 insertions(+), 80 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 6bcecc325e7e..ac08f5d711be 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -7203,8 +7203,9 @@ static void vcpu_load_eoi_exitmap(struct kvm_vcpu *vcpu)
>   	kvm_x86_ops->load_eoi_exitmap(vcpu, eoi_exit_bitmap);
>   }
>   
> -void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> -		unsigned long start, unsigned long end)
> +int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> +		unsigned long start, unsigned long end,
> +		bool blockable)
>   {
>   	unsigned long apic_address;
>   
> @@ -7215,6 +7216,8 @@ void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
>   	apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
>   	if (start <= apic_address && apic_address < end)
>   		kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
> +
> +	return 0;
>   }
>   
>   void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> index 83e344fbb50a..3399a4a927fb 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> @@ -136,12 +136,18 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
>    *
>    * Take the rmn read side lock.
>    */
> -static void amdgpu_mn_read_lock(struct amdgpu_mn *rmn)
> +static int amdgpu_mn_read_lock(struct amdgpu_mn *rmn, bool blockable)
>   {
> -	mutex_lock(&rmn->read_lock);
> +	if (blockable)
> +		mutex_lock(&rmn->read_lock);
> +	else if (!mutex_trylock(&rmn->read_lock))
> +		return -EAGAIN;
> +
>   	if (atomic_inc_return(&rmn->recursion) == 1)
>   		down_read_non_owner(&rmn->lock);
>   	mutex_unlock(&rmn->read_lock);
> +
> +	return 0;
>   }
>   
>   /**
> @@ -197,10 +203,11 @@ static void amdgpu_mn_invalidate_node(struct amdgpu_mn_node *node,
>    * We block for all BOs between start and end to be idle and
>    * unmap them by move them into system domain again.
>    */
> -static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
> +static int amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>   						 struct mm_struct *mm,
>   						 unsigned long start,
> -						 unsigned long end)
> +						 unsigned long end,
> +						 bool blockable)
>   {
>   	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
>   	struct interval_tree_node *it;
> @@ -208,17 +215,28 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>   	/* notification is exclusive, but interval is inclusive */
>   	end -= 1;
>   
> -	amdgpu_mn_read_lock(rmn);
> +	/* TODO we should be able to split locking for interval tree and
> +	 * amdgpu_mn_invalidate_node
> +	 */
> +	if (amdgpu_mn_read_lock(rmn, blockable))
> +		return -EAGAIN;
>   
>   	it = interval_tree_iter_first(&rmn->objects, start, end);
>   	while (it) {
>   		struct amdgpu_mn_node *node;
>   
> +		if (!blockable) {
> +			amdgpu_mn_read_unlock(rmn);
> +			return -EAGAIN;
> +		}
> +
>   		node = container_of(it, struct amdgpu_mn_node, it);
>   		it = interval_tree_iter_next(it, start, end);
>   
>   		amdgpu_mn_invalidate_node(node, start, end);
>   	}
> +
> +	return 0;
>   }
>   
>   /**
> @@ -233,10 +251,11 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>    * necessitates evicting all user-mode queues of the process. The BOs
>    * are restorted in amdgpu_mn_invalidate_range_end_hsa.
>    */
> -static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
> +static int amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
>   						 struct mm_struct *mm,
>   						 unsigned long start,
> -						 unsigned long end)
> +						 unsigned long end,
> +						 bool blockable)
>   {
>   	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
>   	struct interval_tree_node *it;
> @@ -244,13 +263,19 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
>   	/* notification is exclusive, but interval is inclusive */
>   	end -= 1;
>   
> -	amdgpu_mn_read_lock(rmn);
> +	if (amdgpu_mn_read_lock(rmn, blockable))
> +		return -EAGAIN;
>   
>   	it = interval_tree_iter_first(&rmn->objects, start, end);
>   	while (it) {
>   		struct amdgpu_mn_node *node;
>   		struct amdgpu_bo *bo;
>   
> +		if (!blockable) {
> +			amdgpu_mn_read_unlock(rmn);
> +			return -EAGAIN;
> +		}
> +
>   		node = container_of(it, struct amdgpu_mn_node, it);
>   		it = interval_tree_iter_next(it, start, end);
>   
> @@ -262,6 +287,8 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
>   				amdgpu_amdkfd_evict_userptr(mem, mm);
>   		}
>   	}
> +
> +	return 0;
>   }
>   
>   /**
> diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
> index 854bd51b9478..9cbff68f6b41 100644
> --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> @@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object *mo)
>   	mo->attached = false;
>   }
>   
> -static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> +static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
>   						       struct mm_struct *mm,
>   						       unsigned long start,
> -						       unsigned long end)
> +						       unsigned long end,
> +						       bool blockable)
>   {
>   	struct i915_mmu_notifier *mn =
>   		container_of(_mn, struct i915_mmu_notifier, mn);
> @@ -124,7 +125,7 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
>   	LIST_HEAD(cancelled);
>   
>   	if (RB_EMPTY_ROOT(&mn->objects.rb_root))
> -		return;
> +		return 0;
>   
>   	/* interval ranges are inclusive, but invalidate range is exclusive */
>   	end--;
> @@ -132,6 +133,10 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
>   	spin_lock(&mn->lock);
>   	it = interval_tree_iter_first(&mn->objects, start, end);
>   	while (it) {
> +		if (!blockable) {
> +			spin_unlock(&mn->lock);
> +			return -EAGAIN;
> +		}
>   		/* The mmu_object is released late when destroying the
>   		 * GEM object so it is entirely possible to gain a
>   		 * reference on an object in the process of being freed
> @@ -154,6 +159,8 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
>   
>   	if (!list_empty(&cancelled))
>   		flush_workqueue(mn->wq);
> +
> +	return 0;
>   }
>   
>   static const struct mmu_notifier_ops i915_gem_userptr_notifier = {
> diff --git a/drivers/gpu/drm/radeon/radeon_mn.c b/drivers/gpu/drm/radeon/radeon_mn.c
> index abd24975c9b1..f8b35df44c60 100644
> --- a/drivers/gpu/drm/radeon/radeon_mn.c
> +++ b/drivers/gpu/drm/radeon/radeon_mn.c
> @@ -118,19 +118,27 @@ static void radeon_mn_release(struct mmu_notifier *mn,
>    * We block for all BOs between start and end to be idle and
>    * unmap them by move them into system domain again.
>    */
> -static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
> +static int radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
>   					     struct mm_struct *mm,
>   					     unsigned long start,
> -					     unsigned long end)
> +					     unsigned long end,
> +					     bool blockable)
>   {
>   	struct radeon_mn *rmn = container_of(mn, struct radeon_mn, mn);
>   	struct ttm_operation_ctx ctx = { false, false };
>   	struct interval_tree_node *it;
> +	int ret = 0;
>   
>   	/* notification is exclusive, but interval is inclusive */
>   	end -= 1;
>   
> -	mutex_lock(&rmn->lock);
> +	/* TODO we should be able to split locking for interval tree and
> +	 * the tear down.
> +	 */
> +	if (blockable)
> +		mutex_lock(&rmn->lock);
> +	else if (!mutex_trylock(&rmn->lock))
> +		return -EAGAIN;
>   
>   	it = interval_tree_iter_first(&rmn->objects, start, end);
>   	while (it) {
> @@ -138,6 +146,11 @@ static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
>   		struct radeon_bo *bo;
>   		long r;
>   
> +		if (!blockable) {
> +			ret = -EAGAIN;
> +			goto out_unlock;
> +		}
> +
>   		node = container_of(it, struct radeon_mn_node, it);
>   		it = interval_tree_iter_next(it, start, end);
>   
> @@ -166,7 +179,10 @@ static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
>   		}
>   	}
>   	
> +out_unlock:
>   	mutex_unlock(&rmn->lock);
> +
> +	return ret;
>   }
>   
>   static const struct mmu_notifier_ops radeon_mn_ops = {
> diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
> index 182436b92ba9..6ec748eccff7 100644
> --- a/drivers/infiniband/core/umem_odp.c
> +++ b/drivers/infiniband/core/umem_odp.c
> @@ -186,6 +186,7 @@ static void ib_umem_notifier_release(struct mmu_notifier *mn,
>   	rbt_ib_umem_for_each_in_range(&context->umem_tree, 0,
>   				      ULLONG_MAX,
>   				      ib_umem_notifier_release_trampoline,
> +				      true,
>   				      NULL);
>   	up_read(&context->umem_rwsem);
>   }
> @@ -207,22 +208,31 @@ static int invalidate_range_start_trampoline(struct ib_umem *item, u64 start,
>   	return 0;
>   }
>   
> -static void ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
> +static int ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
>   						    struct mm_struct *mm,
>   						    unsigned long start,
> -						    unsigned long end)
> +						    unsigned long end,
> +						    bool blockable)
>   {
>   	struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
> +	int ret;
>   
>   	if (!context->invalidate_range)
> -		return;
> +		return 0;
> +
> +	if (blockable)
> +		down_read(&context->umem_rwsem);
> +	else if (!down_read_trylock(&context->umem_rwsem))
> +		return -EAGAIN;
>   
>   	ib_ucontext_notifier_start_account(context);
> -	down_read(&context->umem_rwsem);
> -	rbt_ib_umem_for_each_in_range(&context->umem_tree, start,
> +	ret = rbt_ib_umem_for_each_in_range(&context->umem_tree, start,
>   				      end,
> -				      invalidate_range_start_trampoline, NULL);
> +				      invalidate_range_start_trampoline,
> +				      blockable, NULL);
>   	up_read(&context->umem_rwsem);
> +
> +	return ret;
>   }
>   
>   static int invalidate_range_end_trampoline(struct ib_umem *item, u64 start,
> @@ -242,10 +252,15 @@ static void ib_umem_notifier_invalidate_range_end(struct mmu_notifier *mn,
>   	if (!context->invalidate_range)
>   		return;
>   
> +	/*
> +	 * TODO: we currently bail out if there is any sleepable work to be done
> +	 * in ib_umem_notifier_invalidate_range_start so we shouldn't really block
> +	 * here. But this is ugly and fragile.
> +	 */
>   	down_read(&context->umem_rwsem);
>   	rbt_ib_umem_for_each_in_range(&context->umem_tree, start,
>   				      end,
> -				      invalidate_range_end_trampoline, NULL);
> +				      invalidate_range_end_trampoline, true, NULL);
>   	up_read(&context->umem_rwsem);
>   	ib_ucontext_notifier_end_account(context);
>   }
> @@ -798,6 +813,7 @@ EXPORT_SYMBOL(ib_umem_odp_unmap_dma_pages);
>   int rbt_ib_umem_for_each_in_range(struct rb_root_cached *root,
>   				  u64 start, u64 last,
>   				  umem_call_back cb,
> +				  bool blockable,
>   				  void *cookie)
>   {
>   	int ret_val = 0;
> @@ -809,6 +825,9 @@ int rbt_ib_umem_for_each_in_range(struct rb_root_cached *root,
>   
>   	for (node = rbt_ib_umem_iter_first(root, start, last - 1);
>   			node; node = next) {
> +		/* TODO move the blockable decision up to the callback */
> +		if (!blockable)
> +			return -EAGAIN;
>   		next = rbt_ib_umem_iter_next(node, start, last - 1);
>   		umem = container_of(node, struct ib_umem_odp, interval_tree);
>   		ret_val = cb(umem->umem, start, last, cookie) || ret_val;
> diff --git a/drivers/infiniband/hw/hfi1/mmu_rb.c b/drivers/infiniband/hw/hfi1/mmu_rb.c
> index 70aceefe14d5..e1c7996c018e 100644
> --- a/drivers/infiniband/hw/hfi1/mmu_rb.c
> +++ b/drivers/infiniband/hw/hfi1/mmu_rb.c
> @@ -67,9 +67,9 @@ struct mmu_rb_handler {
>   
>   static unsigned long mmu_node_start(struct mmu_rb_node *);
>   static unsigned long mmu_node_last(struct mmu_rb_node *);
> -static void mmu_notifier_range_start(struct mmu_notifier *,
> +static int mmu_notifier_range_start(struct mmu_notifier *,
>   				     struct mm_struct *,
> -				     unsigned long, unsigned long);
> +				     unsigned long, unsigned long, bool);
>   static struct mmu_rb_node *__mmu_rb_search(struct mmu_rb_handler *,
>   					   unsigned long, unsigned long);
>   static void do_remove(struct mmu_rb_handler *handler,
> @@ -284,10 +284,11 @@ void hfi1_mmu_rb_remove(struct mmu_rb_handler *handler,
>   	handler->ops->remove(handler->ops_arg, node);
>   }
>   
> -static void mmu_notifier_range_start(struct mmu_notifier *mn,
> +static int mmu_notifier_range_start(struct mmu_notifier *mn,
>   				     struct mm_struct *mm,
>   				     unsigned long start,
> -				     unsigned long end)
> +				     unsigned long end,
> +				     bool blockable)
>   {
>   	struct mmu_rb_handler *handler =
>   		container_of(mn, struct mmu_rb_handler, mn);
> @@ -313,6 +314,8 @@ static void mmu_notifier_range_start(struct mmu_notifier *mn,
>   
>   	if (added)
>   		queue_work(handler->wq, &handler->del_work);
> +
> +	return 0;
>   }
>   
>   /*
> diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
> index f1a87a690a4c..d216e0d2921d 100644
> --- a/drivers/infiniband/hw/mlx5/odp.c
> +++ b/drivers/infiniband/hw/mlx5/odp.c
> @@ -488,7 +488,7 @@ void mlx5_ib_free_implicit_mr(struct mlx5_ib_mr *imr)
>   
>   	down_read(&ctx->umem_rwsem);
>   	rbt_ib_umem_for_each_in_range(&ctx->umem_tree, 0, ULLONG_MAX,
> -				      mr_leaf_free, imr);
> +				      mr_leaf_free, true, imr);
>   	up_read(&ctx->umem_rwsem);
>   
>   	wait_event(imr->q_leaf_free, !atomic_read(&imr->num_leaf_free));
> diff --git a/drivers/misc/mic/scif/scif_dma.c b/drivers/misc/mic/scif/scif_dma.c
> index 63d6246d6dff..6369aeaa7056 100644
> --- a/drivers/misc/mic/scif/scif_dma.c
> +++ b/drivers/misc/mic/scif/scif_dma.c
> @@ -200,15 +200,18 @@ static void scif_mmu_notifier_release(struct mmu_notifier *mn,
>   	schedule_work(&scif_info.misc_work);
>   }
>   
> -static void scif_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> +static int scif_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>   						     struct mm_struct *mm,
>   						     unsigned long start,
> -						     unsigned long end)
> +						     unsigned long end,
> +						     bool blockable)
>   {
>   	struct scif_mmu_notif	*mmn;
>   
>   	mmn = container_of(mn, struct scif_mmu_notif, ep_mmu_notifier);
>   	scif_rma_destroy_tcw(mmn, start, end - start);
> +
> +	return 0;
>   }
>   
>   static void scif_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
> index a3454eb56fbf..be28f05bfafa 100644
> --- a/drivers/misc/sgi-gru/grutlbpurge.c
> +++ b/drivers/misc/sgi-gru/grutlbpurge.c
> @@ -219,9 +219,10 @@ void gru_flush_all_tlb(struct gru_state *gru)
>   /*
>    * MMUOPS notifier callout functions
>    */
> -static void gru_invalidate_range_start(struct mmu_notifier *mn,
> +static int gru_invalidate_range_start(struct mmu_notifier *mn,
>   				       struct mm_struct *mm,
> -				       unsigned long start, unsigned long end)
> +				       unsigned long start, unsigned long end,
> +				       bool blockable)
>   {
>   	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
>   						 ms_notifier);
> @@ -231,6 +232,8 @@ static void gru_invalidate_range_start(struct mmu_notifier *mn,
>   	gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx, act %d\n", gms,
>   		start, end, atomic_read(&gms->ms_range_active));
>   	gru_flush_tlb_range(gms, start, end - start);
> +
> +	return 0;
>   }
>   
>   static void gru_invalidate_range_end(struct mmu_notifier *mn,
> diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
> index bd56653b9bbc..55b4f0e3f4d6 100644
> --- a/drivers/xen/gntdev.c
> +++ b/drivers/xen/gntdev.c
> @@ -441,18 +441,25 @@ static const struct vm_operations_struct gntdev_vmops = {
>   
>   /* ------------------------------------------------------------------ */
>   
> +static bool in_range(struct grant_map *map,
> +			      unsigned long start, unsigned long end)
> +{
> +	if (!map->vma)
> +		return false;
> +	if (map->vma->vm_start >= end)
> +		return false;
> +	if (map->vma->vm_end <= start)
> +		return false;
> +
> +	return true;
> +}
> +
>   static void unmap_if_in_range(struct grant_map *map,
>   			      unsigned long start, unsigned long end)
>   {
>   	unsigned long mstart, mend;
>   	int err;
>   
> -	if (!map->vma)
> -		return;
> -	if (map->vma->vm_start >= end)
> -		return;
> -	if (map->vma->vm_end <= start)
> -		return;
>   	mstart = max(start, map->vma->vm_start);
>   	mend   = min(end,   map->vma->vm_end);
>   	pr_debug("map %d+%d (%lx %lx), range %lx %lx, mrange %lx %lx\n",
> @@ -465,21 +472,40 @@ static void unmap_if_in_range(struct grant_map *map,
>   	WARN_ON(err);
>   }
>   
> -static void mn_invl_range_start(struct mmu_notifier *mn,
> +static int mn_invl_range_start(struct mmu_notifier *mn,
>   				struct mm_struct *mm,
> -				unsigned long start, unsigned long end)
> +				unsigned long start, unsigned long end,
> +				bool blockable)
>   {
>   	struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
>   	struct grant_map *map;
> +	int ret = 0;
> +
> +	/* TODO do we really need a mutex here? */
> +	if (blockable)
> +		mutex_lock(&priv->lock);
> +	else if (!mutex_trylock(&priv->lock))
> +		return -EAGAIN;
>   
> -	mutex_lock(&priv->lock);
>   	list_for_each_entry(map, &priv->maps, next) {
> +		if (in_range(map, start, end)) {
> +			ret = -EAGAIN;
> +			goto out_unlock;
> +		}
>   		unmap_if_in_range(map, start, end);
>   	}
>   	list_for_each_entry(map, &priv->freeable_maps, next) {
> +		if (in_range(map, start, end)) {
> +			ret = -EAGAIN;
> +			goto out_unlock;
> +		}
>   		unmap_if_in_range(map, start, end);
>   	}
> +
> +out_unlock:
>   	mutex_unlock(&priv->lock);
> +
> +	return ret;
>   }
>   
>   static void mn_release(struct mmu_notifier *mn,
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 4ee7bc548a83..148935085194 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1275,8 +1275,8 @@ static inline long kvm_arch_vcpu_async_ioctl(struct file *filp,
>   }
>   #endif /* CONFIG_HAVE_KVM_VCPU_ASYNC_IOCTL */
>   
> -void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> -		unsigned long start, unsigned long end);
> +int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> +		unsigned long start, unsigned long end, bool blockable);
>   
>   #ifdef CONFIG_HAVE_KVM_VCPU_RUN_PID_CHANGE
>   int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu);
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index 392e6af82701..2eb1a2d01759 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -151,13 +151,15 @@ struct mmu_notifier_ops {
>   	 * address space but may still be referenced by sptes until
>   	 * the last refcount is dropped.
>   	 *
> -	 * If both of these callbacks cannot block, and invalidate_range
> -	 * cannot block, mmu_notifier_ops.flags should have
> -	 * MMU_INVALIDATE_DOES_NOT_BLOCK set.
> +	 * If blockable argument is set to false then the callback cannot
> +	 * sleep and has to return with -EAGAIN. 0 should be returned
> +	 * otherwise.
> +	 *
>   	 */
> -	void (*invalidate_range_start)(struct mmu_notifier *mn,
> +	int (*invalidate_range_start)(struct mmu_notifier *mn,
>   				       struct mm_struct *mm,
> -				       unsigned long start, unsigned long end);
> +				       unsigned long start, unsigned long end,
> +				       bool blockable);
>   	void (*invalidate_range_end)(struct mmu_notifier *mn,
>   				     struct mm_struct *mm,
>   				     unsigned long start, unsigned long end);
> @@ -229,8 +231,9 @@ extern int __mmu_notifier_test_young(struct mm_struct *mm,
>   				     unsigned long address);
>   extern void __mmu_notifier_change_pte(struct mm_struct *mm,
>   				      unsigned long address, pte_t pte);
> -extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> -				  unsigned long start, unsigned long end);
> +extern int __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> +				  unsigned long start, unsigned long end,
> +				  bool blockable);
>   extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
>   				  unsigned long start, unsigned long end,
>   				  bool only_end);
> @@ -281,7 +284,17 @@ static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
>   				  unsigned long start, unsigned long end)
>   {
>   	if (mm_has_notifiers(mm))
> -		__mmu_notifier_invalidate_range_start(mm, start, end);
> +		__mmu_notifier_invalidate_range_start(mm, start, end, true);
> +}
> +
> +static inline int mmu_notifier_invalidate_range_start_nonblock(struct mm_struct *mm,
> +				  unsigned long start, unsigned long end)
> +{
> +	int ret = 0;
> +	if (mm_has_notifiers(mm))
> +		ret = __mmu_notifier_invalidate_range_start(mm, start, end, false);
> +
> +	return ret;
>   }
>   
>   static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> @@ -461,6 +474,12 @@ static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
>   {
>   }
>   
> +static inline int mmu_notifier_invalidate_range_start_nonblock(struct mm_struct *mm,
> +				  unsigned long start, unsigned long end)
> +{
> +	return 0;
> +}
> +
>   static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
>   				  unsigned long start, unsigned long end)
>   {
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index 6adac113e96d..92f70e4c6252 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -95,7 +95,7 @@ static inline int check_stable_address_space(struct mm_struct *mm)
>   	return 0;
>   }
>   
> -void __oom_reap_task_mm(struct mm_struct *mm);
> +bool __oom_reap_task_mm(struct mm_struct *mm);
>   
>   extern unsigned long oom_badness(struct task_struct *p,
>   		struct mem_cgroup *memcg, const nodemask_t *nodemask,
> diff --git a/include/rdma/ib_umem_odp.h b/include/rdma/ib_umem_odp.h
> index 6a17f856f841..381cdf5a9bd1 100644
> --- a/include/rdma/ib_umem_odp.h
> +++ b/include/rdma/ib_umem_odp.h
> @@ -119,7 +119,8 @@ typedef int (*umem_call_back)(struct ib_umem *item, u64 start, u64 end,
>    */
>   int rbt_ib_umem_for_each_in_range(struct rb_root_cached *root,
>   				  u64 start, u64 end,
> -				  umem_call_back cb, void *cookie);
> +				  umem_call_back cb,
> +				  bool blockable, void *cookie);
>   
>   /*
>    * Find first region intersecting with address range.
> diff --git a/mm/hmm.c b/mm/hmm.c
> index de7b6bf77201..81fd57bd2634 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -177,16 +177,19 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
>   	up_write(&hmm->mirrors_sem);
>   }
>   
> -static void hmm_invalidate_range_start(struct mmu_notifier *mn,
> +static int hmm_invalidate_range_start(struct mmu_notifier *mn,
>   				       struct mm_struct *mm,
>   				       unsigned long start,
> -				       unsigned long end)
> +				       unsigned long end,
> +				       bool blockable)
>   {
>   	struct hmm *hmm = mm->hmm;
>   
>   	VM_BUG_ON(!hmm);
>   
>   	atomic_inc(&hmm->sequence);
> +
> +	return 0;
>   }
>   
>   static void hmm_invalidate_range_end(struct mmu_notifier *mn,
> diff --git a/mm/mmap.c b/mm/mmap.c
> index d1eb87ef4b1a..336bee8c4e25 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3074,7 +3074,7 @@ void exit_mmap(struct mm_struct *mm)
>   		 * reliably test it.
>   		 */
>   		mutex_lock(&oom_lock);
> -		__oom_reap_task_mm(mm);
> +		(void)__oom_reap_task_mm(mm);
>   		mutex_unlock(&oom_lock);
>   
>   		set_bit(MMF_OOM_SKIP, &mm->flags);
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> index eff6b88a993f..103b2b450043 100644
> --- a/mm/mmu_notifier.c
> +++ b/mm/mmu_notifier.c
> @@ -174,18 +174,29 @@ void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
>   	srcu_read_unlock(&srcu, id);
>   }
>   
> -void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> -				  unsigned long start, unsigned long end)
> +int __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> +				  unsigned long start, unsigned long end,
> +				  bool blockable)
>   {
>   	struct mmu_notifier *mn;
> +	int ret = 0;
>   	int id;
>   
>   	id = srcu_read_lock(&srcu);
>   	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
> -		if (mn->ops->invalidate_range_start)
> -			mn->ops->invalidate_range_start(mn, mm, start, end);
> +		if (mn->ops->invalidate_range_start) {
> +			int _ret = mn->ops->invalidate_range_start(mn, mm, start, end, blockable);
> +			if (_ret) {
> +				pr_info("%pS callback failed with %d in %sblockable context.\n",
> +						mn->ops->invalidate_range_start, _ret,
> +						!blockable ? "non-": "");
> +				ret = _ret;
> +			}
> +		}
>   	}
>   	srcu_read_unlock(&srcu, id);
> +
> +	return ret;
>   }
>   EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
>   
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 84081e77bc51..5a936cf24d79 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -479,9 +479,10 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
>   static struct task_struct *oom_reaper_list;
>   static DEFINE_SPINLOCK(oom_reaper_lock);
>   
> -void __oom_reap_task_mm(struct mm_struct *mm)
> +bool __oom_reap_task_mm(struct mm_struct *mm)
>   {
>   	struct vm_area_struct *vma;
> +	bool ret = true;
>   
>   	/*
>   	 * Tell all users of get_user/copy_from_user etc... that the content
> @@ -511,12 +512,17 @@ void __oom_reap_task_mm(struct mm_struct *mm)
>   			struct mmu_gather tlb;
>   
>   			tlb_gather_mmu(&tlb, mm, start, end);
> -			mmu_notifier_invalidate_range_start(mm, start, end);
> +			if (mmu_notifier_invalidate_range_start_nonblock(mm, start, end)) {
> +				ret = false;
> +				continue;
> +			}
>   			unmap_page_range(&tlb, vma, start, end, NULL);
>   			mmu_notifier_invalidate_range_end(mm, start, end);
>   			tlb_finish_mmu(&tlb, start, end);
>   		}
>   	}
> +
> +	return ret;
>   }
>   
>   static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
> @@ -545,18 +551,6 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>   		goto unlock_oom;
>   	}
>   
> -	/*
> -	 * If the mm has invalidate_{start,end}() notifiers that could block,
> -	 * sleep to give the oom victim some more time.
> -	 * TODO: we really want to get rid of this ugly hack and make sure that
> -	 * notifiers cannot block for unbounded amount of time
> -	 */
> -	if (mm_has_blockable_invalidate_notifiers(mm)) {
> -		up_read(&mm->mmap_sem);
> -		schedule_timeout_idle(HZ);
> -		goto unlock_oom;
> -	}
> -
>   	/*
>   	 * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
>   	 * work on the mm anymore. The check for MMF_OOM_SKIP must run
> @@ -571,7 +565,12 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>   
>   	trace_start_task_reaping(tsk->pid);
>   
> -	__oom_reap_task_mm(mm);
> +	/* failed to reap part of the address space. Try again later */
> +	if (!__oom_reap_task_mm(mm)) {
> +		up_read(&mm->mmap_sem);
> +		ret = false;
> +		goto unlock_oom;
> +	}
>   
>   	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
>   			task_pid_nr(tsk), tsk->comm,
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index ada21f47f22b..16ce38f178d1 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -135,9 +135,10 @@ static void kvm_uevent_notify_change(unsigned int type, struct kvm *kvm);
>   static unsigned long long kvm_createvm_count;
>   static unsigned long long kvm_active_vms;
>   
> -__weak void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> -		unsigned long start, unsigned long end)
> +__weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> +		unsigned long start, unsigned long end, bool blockable)
>   {
> +	return 0;
>   }
>   
>   bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
> @@ -354,13 +355,15 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
>   	srcu_read_unlock(&kvm->srcu, idx);
>   }
>   
> -static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> +static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>   						    struct mm_struct *mm,
>   						    unsigned long start,
> -						    unsigned long end)
> +						    unsigned long end,
> +						    bool blockable)
>   {
>   	struct kvm *kvm = mmu_notifier_to_kvm(mn);
>   	int need_tlb_flush = 0, idx;
> +	int ret;
>   
>   	idx = srcu_read_lock(&kvm->srcu);
>   	spin_lock(&kvm->mmu_lock);
> @@ -378,9 +381,11 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>   
>   	spin_unlock(&kvm->mmu_lock);
>   
> -	kvm_arch_mmu_notifier_invalidate_range(kvm, start, end);
> +	ret = kvm_arch_mmu_notifier_invalidate_range(kvm, start, end, blockable);
>   
>   	srcu_read_unlock(&kvm->srcu, idx);
> +
> +	return ret;
>   }
>   
>   static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
@ 2018-07-02  9:14     ` Christian König
  0 siblings, 0 replies; 125+ messages in thread
From: Christian König @ 2018-07-02  9:14 UTC (permalink / raw)
  To: Michal Hocko, LKML
  Cc: David (ChunMing) Zhou, Paolo Bonzini, Radim Krčmář,
	Alex Deucher, David Airlie, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Doug Ledford, Jason Gunthorpe, Mike Marciniszyn,
	Dennis Dalessandro, Sudeep Dutt, Ashutosh Dixit,
	Dimitri Sivanich, Boris Ostrovsky, Juergen Gross,
	Jérôme Glisse, Andrea Arcangeli, kvm, amd-gfx,
	dri-devel, intel-gfx, linux-rdma, xen-devel, linux-mm,
	David Rientjes, Felix Kuehling

Am 27.06.2018 um 09:44 schrieb Michal Hocko:
> This is the v2 of RFC based on the feedback I've received so far. The
> code even compiles as a bonus ;) I haven't runtime tested it yet, mostly
> because I have no idea how.
>
> Any further feedback is highly appreciated of course.

That sounds like it should work and at least the amdgpu changes now look 
good to me on first glance.

Can you split that up further in the usual way? E.g. adding the 
blockable flag in one patch and fixing all implementations of the MMU 
notifier in follow up patches.

This way I'm pretty sure Felix and I can give an rb on the amdgpu/amdkfd 
changes.

Thanks,
Christian.

> ---
>  From ec9a7241bf422b908532c4c33953b0da2655ad05 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Wed, 20 Jun 2018 15:03:20 +0200
> Subject: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
>
> There are several blockable mmu notifiers which might sleep in
> mmu_notifier_invalidate_range_start and that is a problem for the
> oom_reaper because it needs to guarantee a forward progress so it cannot
> depend on any sleepable locks.
>
> Currently we simply back off and mark an oom victim with blockable mmu
> notifiers as done after a short sleep. That can result in selecting a
> new oom victim prematurely because the previous one still hasn't torn
> its memory down yet.
>
> We can do much better though. Even if mmu notifiers use sleepable locks
> there is no reason to automatically assume those locks are held.
> Moreover majority of notifiers only care about a portion of the address
> space and there is absolutely zero reason to fail when we are unmapping an
> unrelated range. Many notifiers do really block and wait for HW which is
> harder to handle and we have to bail out though.
>
> This patch handles the low hanging fruid. __mmu_notifier_invalidate_range_start
> gets a blockable flag and callbacks are not allowed to sleep if the
> flag is set to false. This is achieved by using trylock instead of the
> sleepable lock for most callbacks and continue as long as we do not
> block down the call chain.
>
> I think we can improve that even further because there is a common
> pattern to do a range lookup first and then do something about that.
> The first part can be done without a sleeping lock in most cases AFAICS.
>
> The oom_reaper end then simply retries if there is at least one notifier
> which couldn't make any progress in !blockable mode. A retry loop is
> already implemented to wait for the mmap_sem and this is basically the
> same thing.
>
> Changes since rfc v1
> - gpu notifiers can sleep while waiting for HW (evict_process_queues_cpsch
>    on a lock and amdgpu_mn_invalidate_node on unbound timeout) make sure
>    we bail out when we have an intersecting range for starter
> - note that a notifier failed to the log for easier debugging
> - back off early in ib_umem_notifier_invalidate_range_start if the
>    callback is called
> - mn_invl_range_start waits for completion down the unmap_grant_pages
>    path so we have to back off early on overlapping ranges
>
> Cc: "David (ChunMing) Zhou" <David1.Zhou@amd.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: "Radim Krčmář" <rkrcmar@redhat.com>
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: "Christian König" <christian.koenig@amd.com>
> Cc: David Airlie <airlied@linux.ie>
> Cc: Jani Nikula <jani.nikula@linux.intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> Cc: Doug Ledford <dledford@redhat.com>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
> Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
> Cc: Sudeep Dutt <sudeep.dutt@intel.com>
> Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
> Cc: Dimitri Sivanich <sivanich@sgi.com>
> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
> Cc: Juergen Gross <jgross@suse.com>
> Cc: "Jérôme Glisse" <jglisse@redhat.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Felix Kuehling <felix.kuehling@amd.com>
> Cc: kvm@vger.kernel.org (open list:KERNEL VIRTUAL MACHINE FOR X86 (KVM/x86))
> Cc: linux-kernel@vger.kernel.org (open list:X86 ARCHITECTURE (32-BIT AND 64-BIT))
> Cc: amd-gfx@lists.freedesktop.org (open list:RADEON and AMDGPU DRM DRIVERS)
> Cc: dri-devel@lists.freedesktop.org (open list:DRM DRIVERS)
> Cc: intel-gfx@lists.freedesktop.org (open list:INTEL DRM DRIVERS (excluding Poulsbo, Moorestow...)
> Cc: linux-rdma@vger.kernel.org (open list:INFINIBAND SUBSYSTEM)
> Cc: xen-devel@lists.xenproject.org (moderated list:XEN HYPERVISOR INTERFACE)
> Cc: linux-mm@kvack.org (open list:HMM - Heterogeneous Memory Management)
> Reported-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>   arch/x86/kvm/x86.c                      |  7 ++--
>   drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c  | 43 +++++++++++++++++++-----
>   drivers/gpu/drm/i915/i915_gem_userptr.c | 13 ++++++--
>   drivers/gpu/drm/radeon/radeon_mn.c      | 22 +++++++++++--
>   drivers/infiniband/core/umem_odp.c      | 33 +++++++++++++++----
>   drivers/infiniband/hw/hfi1/mmu_rb.c     | 11 ++++---
>   drivers/infiniband/hw/mlx5/odp.c        |  2 +-
>   drivers/misc/mic/scif/scif_dma.c        |  7 ++--
>   drivers/misc/sgi-gru/grutlbpurge.c      |  7 ++--
>   drivers/xen/gntdev.c                    | 44 ++++++++++++++++++++-----
>   include/linux/kvm_host.h                |  4 +--
>   include/linux/mmu_notifier.h            | 35 +++++++++++++++-----
>   include/linux/oom.h                     |  2 +-
>   include/rdma/ib_umem_odp.h              |  3 +-
>   mm/hmm.c                                |  7 ++--
>   mm/mmap.c                               |  2 +-
>   mm/mmu_notifier.c                       | 19 ++++++++---
>   mm/oom_kill.c                           | 29 ++++++++--------
>   virt/kvm/kvm_main.c                     | 15 ++++++---
>   19 files changed, 225 insertions(+), 80 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 6bcecc325e7e..ac08f5d711be 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -7203,8 +7203,9 @@ static void vcpu_load_eoi_exitmap(struct kvm_vcpu *vcpu)
>   	kvm_x86_ops->load_eoi_exitmap(vcpu, eoi_exit_bitmap);
>   }
>   
> -void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> -		unsigned long start, unsigned long end)
> +int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> +		unsigned long start, unsigned long end,
> +		bool blockable)
>   {
>   	unsigned long apic_address;
>   
> @@ -7215,6 +7216,8 @@ void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
>   	apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
>   	if (start <= apic_address && apic_address < end)
>   		kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
> +
> +	return 0;
>   }
>   
>   void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> index 83e344fbb50a..3399a4a927fb 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> @@ -136,12 +136,18 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
>    *
>    * Take the rmn read side lock.
>    */
> -static void amdgpu_mn_read_lock(struct amdgpu_mn *rmn)
> +static int amdgpu_mn_read_lock(struct amdgpu_mn *rmn, bool blockable)
>   {
> -	mutex_lock(&rmn->read_lock);
> +	if (blockable)
> +		mutex_lock(&rmn->read_lock);
> +	else if (!mutex_trylock(&rmn->read_lock))
> +		return -EAGAIN;
> +
>   	if (atomic_inc_return(&rmn->recursion) == 1)
>   		down_read_non_owner(&rmn->lock);
>   	mutex_unlock(&rmn->read_lock);
> +
> +	return 0;
>   }
>   
>   /**
> @@ -197,10 +203,11 @@ static void amdgpu_mn_invalidate_node(struct amdgpu_mn_node *node,
>    * We block for all BOs between start and end to be idle and
>    * unmap them by move them into system domain again.
>    */
> -static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
> +static int amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>   						 struct mm_struct *mm,
>   						 unsigned long start,
> -						 unsigned long end)
> +						 unsigned long end,
> +						 bool blockable)
>   {
>   	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
>   	struct interval_tree_node *it;
> @@ -208,17 +215,28 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>   	/* notification is exclusive, but interval is inclusive */
>   	end -= 1;
>   
> -	amdgpu_mn_read_lock(rmn);
> +	/* TODO we should be able to split locking for interval tree and
> +	 * amdgpu_mn_invalidate_node
> +	 */
> +	if (amdgpu_mn_read_lock(rmn, blockable))
> +		return -EAGAIN;
>   
>   	it = interval_tree_iter_first(&rmn->objects, start, end);
>   	while (it) {
>   		struct amdgpu_mn_node *node;
>   
> +		if (!blockable) {
> +			amdgpu_mn_read_unlock(rmn);
> +			return -EAGAIN;
> +		}
> +
>   		node = container_of(it, struct amdgpu_mn_node, it);
>   		it = interval_tree_iter_next(it, start, end);
>   
>   		amdgpu_mn_invalidate_node(node, start, end);
>   	}
> +
> +	return 0;
>   }
>   
>   /**
> @@ -233,10 +251,11 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>    * necessitates evicting all user-mode queues of the process. The BOs
>    * are restorted in amdgpu_mn_invalidate_range_end_hsa.
>    */
> -static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
> +static int amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
>   						 struct mm_struct *mm,
>   						 unsigned long start,
> -						 unsigned long end)
> +						 unsigned long end,
> +						 bool blockable)
>   {
>   	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
>   	struct interval_tree_node *it;
> @@ -244,13 +263,19 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
>   	/* notification is exclusive, but interval is inclusive */
>   	end -= 1;
>   
> -	amdgpu_mn_read_lock(rmn);
> +	if (amdgpu_mn_read_lock(rmn, blockable))
> +		return -EAGAIN;
>   
>   	it = interval_tree_iter_first(&rmn->objects, start, end);
>   	while (it) {
>   		struct amdgpu_mn_node *node;
>   		struct amdgpu_bo *bo;
>   
> +		if (!blockable) {
> +			amdgpu_mn_read_unlock(rmn);
> +			return -EAGAIN;
> +		}
> +
>   		node = container_of(it, struct amdgpu_mn_node, it);
>   		it = interval_tree_iter_next(it, start, end);
>   
> @@ -262,6 +287,8 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
>   				amdgpu_amdkfd_evict_userptr(mem, mm);
>   		}
>   	}
> +
> +	return 0;
>   }
>   
>   /**
> diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
> index 854bd51b9478..9cbff68f6b41 100644
> --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> @@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object *mo)
>   	mo->attached = false;
>   }
>   
> -static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> +static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
>   						       struct mm_struct *mm,
>   						       unsigned long start,
> -						       unsigned long end)
> +						       unsigned long end,
> +						       bool blockable)
>   {
>   	struct i915_mmu_notifier *mn =
>   		container_of(_mn, struct i915_mmu_notifier, mn);
> @@ -124,7 +125,7 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
>   	LIST_HEAD(cancelled);
>   
>   	if (RB_EMPTY_ROOT(&mn->objects.rb_root))
> -		return;
> +		return 0;
>   
>   	/* interval ranges are inclusive, but invalidate range is exclusive */
>   	end--;
> @@ -132,6 +133,10 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
>   	spin_lock(&mn->lock);
>   	it = interval_tree_iter_first(&mn->objects, start, end);
>   	while (it) {
> +		if (!blockable) {
> +			spin_unlock(&mn->lock);
> +			return -EAGAIN;
> +		}
>   		/* The mmu_object is released late when destroying the
>   		 * GEM object so it is entirely possible to gain a
>   		 * reference on an object in the process of being freed
> @@ -154,6 +159,8 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
>   
>   	if (!list_empty(&cancelled))
>   		flush_workqueue(mn->wq);
> +
> +	return 0;
>   }
>   
>   static const struct mmu_notifier_ops i915_gem_userptr_notifier = {
> diff --git a/drivers/gpu/drm/radeon/radeon_mn.c b/drivers/gpu/drm/radeon/radeon_mn.c
> index abd24975c9b1..f8b35df44c60 100644
> --- a/drivers/gpu/drm/radeon/radeon_mn.c
> +++ b/drivers/gpu/drm/radeon/radeon_mn.c
> @@ -118,19 +118,27 @@ static void radeon_mn_release(struct mmu_notifier *mn,
>    * We block for all BOs between start and end to be idle and
>    * unmap them by move them into system domain again.
>    */
> -static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
> +static int radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
>   					     struct mm_struct *mm,
>   					     unsigned long start,
> -					     unsigned long end)
> +					     unsigned long end,
> +					     bool blockable)
>   {
>   	struct radeon_mn *rmn = container_of(mn, struct radeon_mn, mn);
>   	struct ttm_operation_ctx ctx = { false, false };
>   	struct interval_tree_node *it;
> +	int ret = 0;
>   
>   	/* notification is exclusive, but interval is inclusive */
>   	end -= 1;
>   
> -	mutex_lock(&rmn->lock);
> +	/* TODO we should be able to split locking for interval tree and
> +	 * the tear down.
> +	 */
> +	if (blockable)
> +		mutex_lock(&rmn->lock);
> +	else if (!mutex_trylock(&rmn->lock))
> +		return -EAGAIN;
>   
>   	it = interval_tree_iter_first(&rmn->objects, start, end);
>   	while (it) {
> @@ -138,6 +146,11 @@ static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
>   		struct radeon_bo *bo;
>   		long r;
>   
> +		if (!blockable) {
> +			ret = -EAGAIN;
> +			goto out_unlock;
> +		}
> +
>   		node = container_of(it, struct radeon_mn_node, it);
>   		it = interval_tree_iter_next(it, start, end);
>   
> @@ -166,7 +179,10 @@ static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
>   		}
>   	}
>   	
> +out_unlock:
>   	mutex_unlock(&rmn->lock);
> +
> +	return ret;
>   }
>   
>   static const struct mmu_notifier_ops radeon_mn_ops = {
> diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
> index 182436b92ba9..6ec748eccff7 100644
> --- a/drivers/infiniband/core/umem_odp.c
> +++ b/drivers/infiniband/core/umem_odp.c
> @@ -186,6 +186,7 @@ static void ib_umem_notifier_release(struct mmu_notifier *mn,
>   	rbt_ib_umem_for_each_in_range(&context->umem_tree, 0,
>   				      ULLONG_MAX,
>   				      ib_umem_notifier_release_trampoline,
> +				      true,
>   				      NULL);
>   	up_read(&context->umem_rwsem);
>   }
> @@ -207,22 +208,31 @@ static int invalidate_range_start_trampoline(struct ib_umem *item, u64 start,
>   	return 0;
>   }
>   
> -static void ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
> +static int ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
>   						    struct mm_struct *mm,
>   						    unsigned long start,
> -						    unsigned long end)
> +						    unsigned long end,
> +						    bool blockable)
>   {
>   	struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
> +	int ret;
>   
>   	if (!context->invalidate_range)
> -		return;
> +		return 0;
> +
> +	if (blockable)
> +		down_read(&context->umem_rwsem);
> +	else if (!down_read_trylock(&context->umem_rwsem))
> +		return -EAGAIN;
>   
>   	ib_ucontext_notifier_start_account(context);
> -	down_read(&context->umem_rwsem);
> -	rbt_ib_umem_for_each_in_range(&context->umem_tree, start,
> +	ret = rbt_ib_umem_for_each_in_range(&context->umem_tree, start,
>   				      end,
> -				      invalidate_range_start_trampoline, NULL);
> +				      invalidate_range_start_trampoline,
> +				      blockable, NULL);
>   	up_read(&context->umem_rwsem);
> +
> +	return ret;
>   }
>   
>   static int invalidate_range_end_trampoline(struct ib_umem *item, u64 start,
> @@ -242,10 +252,15 @@ static void ib_umem_notifier_invalidate_range_end(struct mmu_notifier *mn,
>   	if (!context->invalidate_range)
>   		return;
>   
> +	/*
> +	 * TODO: we currently bail out if there is any sleepable work to be done
> +	 * in ib_umem_notifier_invalidate_range_start so we shouldn't really block
> +	 * here. But this is ugly and fragile.
> +	 */
>   	down_read(&context->umem_rwsem);
>   	rbt_ib_umem_for_each_in_range(&context->umem_tree, start,
>   				      end,
> -				      invalidate_range_end_trampoline, NULL);
> +				      invalidate_range_end_trampoline, true, NULL);
>   	up_read(&context->umem_rwsem);
>   	ib_ucontext_notifier_end_account(context);
>   }
> @@ -798,6 +813,7 @@ EXPORT_SYMBOL(ib_umem_odp_unmap_dma_pages);
>   int rbt_ib_umem_for_each_in_range(struct rb_root_cached *root,
>   				  u64 start, u64 last,
>   				  umem_call_back cb,
> +				  bool blockable,
>   				  void *cookie)
>   {
>   	int ret_val = 0;
> @@ -809,6 +825,9 @@ int rbt_ib_umem_for_each_in_range(struct rb_root_cached *root,
>   
>   	for (node = rbt_ib_umem_iter_first(root, start, last - 1);
>   			node; node = next) {
> +		/* TODO move the blockable decision up to the callback */
> +		if (!blockable)
> +			return -EAGAIN;
>   		next = rbt_ib_umem_iter_next(node, start, last - 1);
>   		umem = container_of(node, struct ib_umem_odp, interval_tree);
>   		ret_val = cb(umem->umem, start, last, cookie) || ret_val;
> diff --git a/drivers/infiniband/hw/hfi1/mmu_rb.c b/drivers/infiniband/hw/hfi1/mmu_rb.c
> index 70aceefe14d5..e1c7996c018e 100644
> --- a/drivers/infiniband/hw/hfi1/mmu_rb.c
> +++ b/drivers/infiniband/hw/hfi1/mmu_rb.c
> @@ -67,9 +67,9 @@ struct mmu_rb_handler {
>   
>   static unsigned long mmu_node_start(struct mmu_rb_node *);
>   static unsigned long mmu_node_last(struct mmu_rb_node *);
> -static void mmu_notifier_range_start(struct mmu_notifier *,
> +static int mmu_notifier_range_start(struct mmu_notifier *,
>   				     struct mm_struct *,
> -				     unsigned long, unsigned long);
> +				     unsigned long, unsigned long, bool);
>   static struct mmu_rb_node *__mmu_rb_search(struct mmu_rb_handler *,
>   					   unsigned long, unsigned long);
>   static void do_remove(struct mmu_rb_handler *handler,
> @@ -284,10 +284,11 @@ void hfi1_mmu_rb_remove(struct mmu_rb_handler *handler,
>   	handler->ops->remove(handler->ops_arg, node);
>   }
>   
> -static void mmu_notifier_range_start(struct mmu_notifier *mn,
> +static int mmu_notifier_range_start(struct mmu_notifier *mn,
>   				     struct mm_struct *mm,
>   				     unsigned long start,
> -				     unsigned long end)
> +				     unsigned long end,
> +				     bool blockable)
>   {
>   	struct mmu_rb_handler *handler =
>   		container_of(mn, struct mmu_rb_handler, mn);
> @@ -313,6 +314,8 @@ static void mmu_notifier_range_start(struct mmu_notifier *mn,
>   
>   	if (added)
>   		queue_work(handler->wq, &handler->del_work);
> +
> +	return 0;
>   }
>   
>   /*
> diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
> index f1a87a690a4c..d216e0d2921d 100644
> --- a/drivers/infiniband/hw/mlx5/odp.c
> +++ b/drivers/infiniband/hw/mlx5/odp.c
> @@ -488,7 +488,7 @@ void mlx5_ib_free_implicit_mr(struct mlx5_ib_mr *imr)
>   
>   	down_read(&ctx->umem_rwsem);
>   	rbt_ib_umem_for_each_in_range(&ctx->umem_tree, 0, ULLONG_MAX,
> -				      mr_leaf_free, imr);
> +				      mr_leaf_free, true, imr);
>   	up_read(&ctx->umem_rwsem);
>   
>   	wait_event(imr->q_leaf_free, !atomic_read(&imr->num_leaf_free));
> diff --git a/drivers/misc/mic/scif/scif_dma.c b/drivers/misc/mic/scif/scif_dma.c
> index 63d6246d6dff..6369aeaa7056 100644
> --- a/drivers/misc/mic/scif/scif_dma.c
> +++ b/drivers/misc/mic/scif/scif_dma.c
> @@ -200,15 +200,18 @@ static void scif_mmu_notifier_release(struct mmu_notifier *mn,
>   	schedule_work(&scif_info.misc_work);
>   }
>   
> -static void scif_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> +static int scif_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>   						     struct mm_struct *mm,
>   						     unsigned long start,
> -						     unsigned long end)
> +						     unsigned long end,
> +						     bool blockable)
>   {
>   	struct scif_mmu_notif	*mmn;
>   
>   	mmn = container_of(mn, struct scif_mmu_notif, ep_mmu_notifier);
>   	scif_rma_destroy_tcw(mmn, start, end - start);
> +
> +	return 0;
>   }
>   
>   static void scif_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
> index a3454eb56fbf..be28f05bfafa 100644
> --- a/drivers/misc/sgi-gru/grutlbpurge.c
> +++ b/drivers/misc/sgi-gru/grutlbpurge.c
> @@ -219,9 +219,10 @@ void gru_flush_all_tlb(struct gru_state *gru)
>   /*
>    * MMUOPS notifier callout functions
>    */
> -static void gru_invalidate_range_start(struct mmu_notifier *mn,
> +static int gru_invalidate_range_start(struct mmu_notifier *mn,
>   				       struct mm_struct *mm,
> -				       unsigned long start, unsigned long end)
> +				       unsigned long start, unsigned long end,
> +				       bool blockable)
>   {
>   	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
>   						 ms_notifier);
> @@ -231,6 +232,8 @@ static void gru_invalidate_range_start(struct mmu_notifier *mn,
>   	gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx, act %d\n", gms,
>   		start, end, atomic_read(&gms->ms_range_active));
>   	gru_flush_tlb_range(gms, start, end - start);
> +
> +	return 0;
>   }
>   
>   static void gru_invalidate_range_end(struct mmu_notifier *mn,
> diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
> index bd56653b9bbc..55b4f0e3f4d6 100644
> --- a/drivers/xen/gntdev.c
> +++ b/drivers/xen/gntdev.c
> @@ -441,18 +441,25 @@ static const struct vm_operations_struct gntdev_vmops = {
>   
>   /* ------------------------------------------------------------------ */
>   
> +static bool in_range(struct grant_map *map,
> +			      unsigned long start, unsigned long end)
> +{
> +	if (!map->vma)
> +		return false;
> +	if (map->vma->vm_start >= end)
> +		return false;
> +	if (map->vma->vm_end <= start)
> +		return false;
> +
> +	return true;
> +}
> +
>   static void unmap_if_in_range(struct grant_map *map,
>   			      unsigned long start, unsigned long end)
>   {
>   	unsigned long mstart, mend;
>   	int err;
>   
> -	if (!map->vma)
> -		return;
> -	if (map->vma->vm_start >= end)
> -		return;
> -	if (map->vma->vm_end <= start)
> -		return;
>   	mstart = max(start, map->vma->vm_start);
>   	mend   = min(end,   map->vma->vm_end);
>   	pr_debug("map %d+%d (%lx %lx), range %lx %lx, mrange %lx %lx\n",
> @@ -465,21 +472,40 @@ static void unmap_if_in_range(struct grant_map *map,
>   	WARN_ON(err);
>   }
>   
> -static void mn_invl_range_start(struct mmu_notifier *mn,
> +static int mn_invl_range_start(struct mmu_notifier *mn,
>   				struct mm_struct *mm,
> -				unsigned long start, unsigned long end)
> +				unsigned long start, unsigned long end,
> +				bool blockable)
>   {
>   	struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
>   	struct grant_map *map;
> +	int ret = 0;
> +
> +	/* TODO do we really need a mutex here? */
> +	if (blockable)
> +		mutex_lock(&priv->lock);
> +	else if (!mutex_trylock(&priv->lock))
> +		return -EAGAIN;
>   
> -	mutex_lock(&priv->lock);
>   	list_for_each_entry(map, &priv->maps, next) {
> +		if (in_range(map, start, end)) {
> +			ret = -EAGAIN;
> +			goto out_unlock;
> +		}
>   		unmap_if_in_range(map, start, end);
>   	}
>   	list_for_each_entry(map, &priv->freeable_maps, next) {
> +		if (in_range(map, start, end)) {
> +			ret = -EAGAIN;
> +			goto out_unlock;
> +		}
>   		unmap_if_in_range(map, start, end);
>   	}
> +
> +out_unlock:
>   	mutex_unlock(&priv->lock);
> +
> +	return ret;
>   }
>   
>   static void mn_release(struct mmu_notifier *mn,
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 4ee7bc548a83..148935085194 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1275,8 +1275,8 @@ static inline long kvm_arch_vcpu_async_ioctl(struct file *filp,
>   }
>   #endif /* CONFIG_HAVE_KVM_VCPU_ASYNC_IOCTL */
>   
> -void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> -		unsigned long start, unsigned long end);
> +int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> +		unsigned long start, unsigned long end, bool blockable);
>   
>   #ifdef CONFIG_HAVE_KVM_VCPU_RUN_PID_CHANGE
>   int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu);
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index 392e6af82701..2eb1a2d01759 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -151,13 +151,15 @@ struct mmu_notifier_ops {
>   	 * address space but may still be referenced by sptes until
>   	 * the last refcount is dropped.
>   	 *
> -	 * If both of these callbacks cannot block, and invalidate_range
> -	 * cannot block, mmu_notifier_ops.flags should have
> -	 * MMU_INVALIDATE_DOES_NOT_BLOCK set.
> +	 * If blockable argument is set to false then the callback cannot
> +	 * sleep and has to return with -EAGAIN. 0 should be returned
> +	 * otherwise.
> +	 *
>   	 */
> -	void (*invalidate_range_start)(struct mmu_notifier *mn,
> +	int (*invalidate_range_start)(struct mmu_notifier *mn,
>   				       struct mm_struct *mm,
> -				       unsigned long start, unsigned long end);
> +				       unsigned long start, unsigned long end,
> +				       bool blockable);
>   	void (*invalidate_range_end)(struct mmu_notifier *mn,
>   				     struct mm_struct *mm,
>   				     unsigned long start, unsigned long end);
> @@ -229,8 +231,9 @@ extern int __mmu_notifier_test_young(struct mm_struct *mm,
>   				     unsigned long address);
>   extern void __mmu_notifier_change_pte(struct mm_struct *mm,
>   				      unsigned long address, pte_t pte);
> -extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> -				  unsigned long start, unsigned long end);
> +extern int __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> +				  unsigned long start, unsigned long end,
> +				  bool blockable);
>   extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
>   				  unsigned long start, unsigned long end,
>   				  bool only_end);
> @@ -281,7 +284,17 @@ static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
>   				  unsigned long start, unsigned long end)
>   {
>   	if (mm_has_notifiers(mm))
> -		__mmu_notifier_invalidate_range_start(mm, start, end);
> +		__mmu_notifier_invalidate_range_start(mm, start, end, true);
> +}
> +
> +static inline int mmu_notifier_invalidate_range_start_nonblock(struct mm_struct *mm,
> +				  unsigned long start, unsigned long end)
> +{
> +	int ret = 0;
> +	if (mm_has_notifiers(mm))
> +		ret = __mmu_notifier_invalidate_range_start(mm, start, end, false);
> +
> +	return ret;
>   }
>   
>   static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> @@ -461,6 +474,12 @@ static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
>   {
>   }
>   
> +static inline int mmu_notifier_invalidate_range_start_nonblock(struct mm_struct *mm,
> +				  unsigned long start, unsigned long end)
> +{
> +	return 0;
> +}
> +
>   static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
>   				  unsigned long start, unsigned long end)
>   {
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index 6adac113e96d..92f70e4c6252 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -95,7 +95,7 @@ static inline int check_stable_address_space(struct mm_struct *mm)
>   	return 0;
>   }
>   
> -void __oom_reap_task_mm(struct mm_struct *mm);
> +bool __oom_reap_task_mm(struct mm_struct *mm);
>   
>   extern unsigned long oom_badness(struct task_struct *p,
>   		struct mem_cgroup *memcg, const nodemask_t *nodemask,
> diff --git a/include/rdma/ib_umem_odp.h b/include/rdma/ib_umem_odp.h
> index 6a17f856f841..381cdf5a9bd1 100644
> --- a/include/rdma/ib_umem_odp.h
> +++ b/include/rdma/ib_umem_odp.h
> @@ -119,7 +119,8 @@ typedef int (*umem_call_back)(struct ib_umem *item, u64 start, u64 end,
>    */
>   int rbt_ib_umem_for_each_in_range(struct rb_root_cached *root,
>   				  u64 start, u64 end,
> -				  umem_call_back cb, void *cookie);
> +				  umem_call_back cb,
> +				  bool blockable, void *cookie);
>   
>   /*
>    * Find first region intersecting with address range.
> diff --git a/mm/hmm.c b/mm/hmm.c
> index de7b6bf77201..81fd57bd2634 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -177,16 +177,19 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
>   	up_write(&hmm->mirrors_sem);
>   }
>   
> -static void hmm_invalidate_range_start(struct mmu_notifier *mn,
> +static int hmm_invalidate_range_start(struct mmu_notifier *mn,
>   				       struct mm_struct *mm,
>   				       unsigned long start,
> -				       unsigned long end)
> +				       unsigned long end,
> +				       bool blockable)
>   {
>   	struct hmm *hmm = mm->hmm;
>   
>   	VM_BUG_ON(!hmm);
>   
>   	atomic_inc(&hmm->sequence);
> +
> +	return 0;
>   }
>   
>   static void hmm_invalidate_range_end(struct mmu_notifier *mn,
> diff --git a/mm/mmap.c b/mm/mmap.c
> index d1eb87ef4b1a..336bee8c4e25 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3074,7 +3074,7 @@ void exit_mmap(struct mm_struct *mm)
>   		 * reliably test it.
>   		 */
>   		mutex_lock(&oom_lock);
> -		__oom_reap_task_mm(mm);
> +		(void)__oom_reap_task_mm(mm);
>   		mutex_unlock(&oom_lock);
>   
>   		set_bit(MMF_OOM_SKIP, &mm->flags);
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> index eff6b88a993f..103b2b450043 100644
> --- a/mm/mmu_notifier.c
> +++ b/mm/mmu_notifier.c
> @@ -174,18 +174,29 @@ void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
>   	srcu_read_unlock(&srcu, id);
>   }
>   
> -void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> -				  unsigned long start, unsigned long end)
> +int __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> +				  unsigned long start, unsigned long end,
> +				  bool blockable)
>   {
>   	struct mmu_notifier *mn;
> +	int ret = 0;
>   	int id;
>   
>   	id = srcu_read_lock(&srcu);
>   	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
> -		if (mn->ops->invalidate_range_start)
> -			mn->ops->invalidate_range_start(mn, mm, start, end);
> +		if (mn->ops->invalidate_range_start) {
> +			int _ret = mn->ops->invalidate_range_start(mn, mm, start, end, blockable);
> +			if (_ret) {
> +				pr_info("%pS callback failed with %d in %sblockable context.\n",
> +						mn->ops->invalidate_range_start, _ret,
> +						!blockable ? "non-": "");
> +				ret = _ret;
> +			}
> +		}
>   	}
>   	srcu_read_unlock(&srcu, id);
> +
> +	return ret;
>   }
>   EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
>   
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 84081e77bc51..5a936cf24d79 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -479,9 +479,10 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
>   static struct task_struct *oom_reaper_list;
>   static DEFINE_SPINLOCK(oom_reaper_lock);
>   
> -void __oom_reap_task_mm(struct mm_struct *mm)
> +bool __oom_reap_task_mm(struct mm_struct *mm)
>   {
>   	struct vm_area_struct *vma;
> +	bool ret = true;
>   
>   	/*
>   	 * Tell all users of get_user/copy_from_user etc... that the content
> @@ -511,12 +512,17 @@ void __oom_reap_task_mm(struct mm_struct *mm)
>   			struct mmu_gather tlb;
>   
>   			tlb_gather_mmu(&tlb, mm, start, end);
> -			mmu_notifier_invalidate_range_start(mm, start, end);
> +			if (mmu_notifier_invalidate_range_start_nonblock(mm, start, end)) {
> +				ret = false;
> +				continue;
> +			}
>   			unmap_page_range(&tlb, vma, start, end, NULL);
>   			mmu_notifier_invalidate_range_end(mm, start, end);
>   			tlb_finish_mmu(&tlb, start, end);
>   		}
>   	}
> +
> +	return ret;
>   }
>   
>   static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
> @@ -545,18 +551,6 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>   		goto unlock_oom;
>   	}
>   
> -	/*
> -	 * If the mm has invalidate_{start,end}() notifiers that could block,
> -	 * sleep to give the oom victim some more time.
> -	 * TODO: we really want to get rid of this ugly hack and make sure that
> -	 * notifiers cannot block for unbounded amount of time
> -	 */
> -	if (mm_has_blockable_invalidate_notifiers(mm)) {
> -		up_read(&mm->mmap_sem);
> -		schedule_timeout_idle(HZ);
> -		goto unlock_oom;
> -	}
> -
>   	/*
>   	 * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
>   	 * work on the mm anymore. The check for MMF_OOM_SKIP must run
> @@ -571,7 +565,12 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>   
>   	trace_start_task_reaping(tsk->pid);
>   
> -	__oom_reap_task_mm(mm);
> +	/* failed to reap part of the address space. Try again later */
> +	if (!__oom_reap_task_mm(mm)) {
> +		up_read(&mm->mmap_sem);
> +		ret = false;
> +		goto unlock_oom;
> +	}
>   
>   	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
>   			task_pid_nr(tsk), tsk->comm,
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index ada21f47f22b..16ce38f178d1 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -135,9 +135,10 @@ static void kvm_uevent_notify_change(unsigned int type, struct kvm *kvm);
>   static unsigned long long kvm_createvm_count;
>   static unsigned long long kvm_active_vms;
>   
> -__weak void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> -		unsigned long start, unsigned long end)
> +__weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> +		unsigned long start, unsigned long end, bool blockable)
>   {
> +	return 0;
>   }
>   
>   bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
> @@ -354,13 +355,15 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
>   	srcu_read_unlock(&kvm->srcu, idx);
>   }
>   
> -static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> +static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>   						    struct mm_struct *mm,
>   						    unsigned long start,
> -						    unsigned long end)
> +						    unsigned long end,
> +						    bool blockable)
>   {
>   	struct kvm *kvm = mmu_notifier_to_kvm(mn);
>   	int need_tlb_flush = 0, idx;
> +	int ret;
>   
>   	idx = srcu_read_lock(&kvm->srcu);
>   	spin_lock(&kvm->mmu_lock);
> @@ -378,9 +381,11 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>   
>   	spin_unlock(&kvm->mmu_lock);
>   
> -	kvm_arch_mmu_notifier_invalidate_range(kvm, start, end);
> +	ret = kvm_arch_mmu_notifier_invalidate_range(kvm, start, end, blockable);
>   
>   	srcu_read_unlock(&kvm->srcu, idx);
> +
> +	return ret;
>   }
>   
>   static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,


^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
@ 2018-07-02  9:14     ` Christian König
  0 siblings, 0 replies; 125+ messages in thread
From: Christian König @ 2018-07-02  9:14 UTC (permalink / raw)
  To: Michal Hocko, LKML
  Cc: David (ChunMing) Zhou, Paolo Bonzini, Radim Krčmář,
	Alex Deucher, David Airlie, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Doug Ledford, Jason Gunthorpe, Mike Marciniszyn,
	Dennis Dalessandro, Sudeep Dutt, Ashutosh Dixit,
	Dimitri Sivanich, Boris Ostrovsky, Juergen Gross,
	Jérôme Glisse, Andrea Arcangeli, kvm, amd-gfx,
	dri-devel, intel-gfx, linux-rdma, xen-devel, linux-mm,
	David Rientjes, Felix Kuehling

Am 27.06.2018 um 09:44 schrieb Michal Hocko:
> This is the v2 of RFC based on the feedback I've received so far. The
> code even compiles as a bonus ;) I haven't runtime tested it yet, mostly
> because I have no idea how.
>
> Any further feedback is highly appreciated of course.

That sounds like it should work and at least the amdgpu changes now look 
good to me on first glance.

Can you split that up further in the usual way? E.g. adding the 
blockable flag in one patch and fixing all implementations of the MMU 
notifier in follow up patches.

This way I'm pretty sure Felix and I can give an rb on the amdgpu/amdkfd 
changes.

Thanks,
Christian.

> ---
>  From ec9a7241bf422b908532c4c33953b0da2655ad05 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Wed, 20 Jun 2018 15:03:20 +0200
> Subject: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
>
> There are several blockable mmu notifiers which might sleep in
> mmu_notifier_invalidate_range_start and that is a problem for the
> oom_reaper because it needs to guarantee a forward progress so it cannot
> depend on any sleepable locks.
>
> Currently we simply back off and mark an oom victim with blockable mmu
> notifiers as done after a short sleep. That can result in selecting a
> new oom victim prematurely because the previous one still hasn't torn
> its memory down yet.
>
> We can do much better though. Even if mmu notifiers use sleepable locks
> there is no reason to automatically assume those locks are held.
> Moreover majority of notifiers only care about a portion of the address
> space and there is absolutely zero reason to fail when we are unmapping an
> unrelated range. Many notifiers do really block and wait for HW which is
> harder to handle and we have to bail out though.
>
> This patch handles the low hanging fruid. __mmu_notifier_invalidate_range_start
> gets a blockable flag and callbacks are not allowed to sleep if the
> flag is set to false. This is achieved by using trylock instead of the
> sleepable lock for most callbacks and continue as long as we do not
> block down the call chain.
>
> I think we can improve that even further because there is a common
> pattern to do a range lookup first and then do something about that.
> The first part can be done without a sleeping lock in most cases AFAICS.
>
> The oom_reaper end then simply retries if there is at least one notifier
> which couldn't make any progress in !blockable mode. A retry loop is
> already implemented to wait for the mmap_sem and this is basically the
> same thing.
>
> Changes since rfc v1
> - gpu notifiers can sleep while waiting for HW (evict_process_queues_cpsch
>    on a lock and amdgpu_mn_invalidate_node on unbound timeout) make sure
>    we bail out when we have an intersecting range for starter
> - note that a notifier failed to the log for easier debugging
> - back off early in ib_umem_notifier_invalidate_range_start if the
>    callback is called
> - mn_invl_range_start waits for completion down the unmap_grant_pages
>    path so we have to back off early on overlapping ranges
>
> Cc: "David (ChunMing) Zhou" <David1.Zhou@amd.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: "Radim KrA?mA!A?" <rkrcmar@redhat.com>
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: "Christian KA?nig" <christian.koenig@amd.com>
> Cc: David Airlie <airlied@linux.ie>
> Cc: Jani Nikula <jani.nikula@linux.intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> Cc: Doug Ledford <dledford@redhat.com>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
> Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
> Cc: Sudeep Dutt <sudeep.dutt@intel.com>
> Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
> Cc: Dimitri Sivanich <sivanich@sgi.com>
> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
> Cc: Juergen Gross <jgross@suse.com>
> Cc: "JA(C)rA'me Glisse" <jglisse@redhat.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Felix Kuehling <felix.kuehling@amd.com>
> Cc: kvm@vger.kernel.org (open list:KERNEL VIRTUAL MACHINE FOR X86 (KVM/x86))
> Cc: linux-kernel@vger.kernel.org (open list:X86 ARCHITECTURE (32-BIT AND 64-BIT))
> Cc: amd-gfx@lists.freedesktop.org (open list:RADEON and AMDGPU DRM DRIVERS)
> Cc: dri-devel@lists.freedesktop.org (open list:DRM DRIVERS)
> Cc: intel-gfx@lists.freedesktop.org (open list:INTEL DRM DRIVERS (excluding Poulsbo, Moorestow...)
> Cc: linux-rdma@vger.kernel.org (open list:INFINIBAND SUBSYSTEM)
> Cc: xen-devel@lists.xenproject.org (moderated list:XEN HYPERVISOR INTERFACE)
> Cc: linux-mm@kvack.org (open list:HMM - Heterogeneous Memory Management)
> Reported-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>   arch/x86/kvm/x86.c                      |  7 ++--
>   drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c  | 43 +++++++++++++++++++-----
>   drivers/gpu/drm/i915/i915_gem_userptr.c | 13 ++++++--
>   drivers/gpu/drm/radeon/radeon_mn.c      | 22 +++++++++++--
>   drivers/infiniband/core/umem_odp.c      | 33 +++++++++++++++----
>   drivers/infiniband/hw/hfi1/mmu_rb.c     | 11 ++++---
>   drivers/infiniband/hw/mlx5/odp.c        |  2 +-
>   drivers/misc/mic/scif/scif_dma.c        |  7 ++--
>   drivers/misc/sgi-gru/grutlbpurge.c      |  7 ++--
>   drivers/xen/gntdev.c                    | 44 ++++++++++++++++++++-----
>   include/linux/kvm_host.h                |  4 +--
>   include/linux/mmu_notifier.h            | 35 +++++++++++++++-----
>   include/linux/oom.h                     |  2 +-
>   include/rdma/ib_umem_odp.h              |  3 +-
>   mm/hmm.c                                |  7 ++--
>   mm/mmap.c                               |  2 +-
>   mm/mmu_notifier.c                       | 19 ++++++++---
>   mm/oom_kill.c                           | 29 ++++++++--------
>   virt/kvm/kvm_main.c                     | 15 ++++++---
>   19 files changed, 225 insertions(+), 80 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 6bcecc325e7e..ac08f5d711be 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -7203,8 +7203,9 @@ static void vcpu_load_eoi_exitmap(struct kvm_vcpu *vcpu)
>   	kvm_x86_ops->load_eoi_exitmap(vcpu, eoi_exit_bitmap);
>   }
>   
> -void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> -		unsigned long start, unsigned long end)
> +int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> +		unsigned long start, unsigned long end,
> +		bool blockable)
>   {
>   	unsigned long apic_address;
>   
> @@ -7215,6 +7216,8 @@ void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
>   	apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
>   	if (start <= apic_address && apic_address < end)
>   		kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
> +
> +	return 0;
>   }
>   
>   void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> index 83e344fbb50a..3399a4a927fb 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> @@ -136,12 +136,18 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
>    *
>    * Take the rmn read side lock.
>    */
> -static void amdgpu_mn_read_lock(struct amdgpu_mn *rmn)
> +static int amdgpu_mn_read_lock(struct amdgpu_mn *rmn, bool blockable)
>   {
> -	mutex_lock(&rmn->read_lock);
> +	if (blockable)
> +		mutex_lock(&rmn->read_lock);
> +	else if (!mutex_trylock(&rmn->read_lock))
> +		return -EAGAIN;
> +
>   	if (atomic_inc_return(&rmn->recursion) == 1)
>   		down_read_non_owner(&rmn->lock);
>   	mutex_unlock(&rmn->read_lock);
> +
> +	return 0;
>   }
>   
>   /**
> @@ -197,10 +203,11 @@ static void amdgpu_mn_invalidate_node(struct amdgpu_mn_node *node,
>    * We block for all BOs between start and end to be idle and
>    * unmap them by move them into system domain again.
>    */
> -static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
> +static int amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>   						 struct mm_struct *mm,
>   						 unsigned long start,
> -						 unsigned long end)
> +						 unsigned long end,
> +						 bool blockable)
>   {
>   	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
>   	struct interval_tree_node *it;
> @@ -208,17 +215,28 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>   	/* notification is exclusive, but interval is inclusive */
>   	end -= 1;
>   
> -	amdgpu_mn_read_lock(rmn);
> +	/* TODO we should be able to split locking for interval tree and
> +	 * amdgpu_mn_invalidate_node
> +	 */
> +	if (amdgpu_mn_read_lock(rmn, blockable))
> +		return -EAGAIN;
>   
>   	it = interval_tree_iter_first(&rmn->objects, start, end);
>   	while (it) {
>   		struct amdgpu_mn_node *node;
>   
> +		if (!blockable) {
> +			amdgpu_mn_read_unlock(rmn);
> +			return -EAGAIN;
> +		}
> +
>   		node = container_of(it, struct amdgpu_mn_node, it);
>   		it = interval_tree_iter_next(it, start, end);
>   
>   		amdgpu_mn_invalidate_node(node, start, end);
>   	}
> +
> +	return 0;
>   }
>   
>   /**
> @@ -233,10 +251,11 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>    * necessitates evicting all user-mode queues of the process. The BOs
>    * are restorted in amdgpu_mn_invalidate_range_end_hsa.
>    */
> -static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
> +static int amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
>   						 struct mm_struct *mm,
>   						 unsigned long start,
> -						 unsigned long end)
> +						 unsigned long end,
> +						 bool blockable)
>   {
>   	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
>   	struct interval_tree_node *it;
> @@ -244,13 +263,19 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
>   	/* notification is exclusive, but interval is inclusive */
>   	end -= 1;
>   
> -	amdgpu_mn_read_lock(rmn);
> +	if (amdgpu_mn_read_lock(rmn, blockable))
> +		return -EAGAIN;
>   
>   	it = interval_tree_iter_first(&rmn->objects, start, end);
>   	while (it) {
>   		struct amdgpu_mn_node *node;
>   		struct amdgpu_bo *bo;
>   
> +		if (!blockable) {
> +			amdgpu_mn_read_unlock(rmn);
> +			return -EAGAIN;
> +		}
> +
>   		node = container_of(it, struct amdgpu_mn_node, it);
>   		it = interval_tree_iter_next(it, start, end);
>   
> @@ -262,6 +287,8 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
>   				amdgpu_amdkfd_evict_userptr(mem, mm);
>   		}
>   	}
> +
> +	return 0;
>   }
>   
>   /**
> diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
> index 854bd51b9478..9cbff68f6b41 100644
> --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> @@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object *mo)
>   	mo->attached = false;
>   }
>   
> -static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> +static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
>   						       struct mm_struct *mm,
>   						       unsigned long start,
> -						       unsigned long end)
> +						       unsigned long end,
> +						       bool blockable)
>   {
>   	struct i915_mmu_notifier *mn =
>   		container_of(_mn, struct i915_mmu_notifier, mn);
> @@ -124,7 +125,7 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
>   	LIST_HEAD(cancelled);
>   
>   	if (RB_EMPTY_ROOT(&mn->objects.rb_root))
> -		return;
> +		return 0;
>   
>   	/* interval ranges are inclusive, but invalidate range is exclusive */
>   	end--;
> @@ -132,6 +133,10 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
>   	spin_lock(&mn->lock);
>   	it = interval_tree_iter_first(&mn->objects, start, end);
>   	while (it) {
> +		if (!blockable) {
> +			spin_unlock(&mn->lock);
> +			return -EAGAIN;
> +		}
>   		/* The mmu_object is released late when destroying the
>   		 * GEM object so it is entirely possible to gain a
>   		 * reference on an object in the process of being freed
> @@ -154,6 +159,8 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
>   
>   	if (!list_empty(&cancelled))
>   		flush_workqueue(mn->wq);
> +
> +	return 0;
>   }
>   
>   static const struct mmu_notifier_ops i915_gem_userptr_notifier = {
> diff --git a/drivers/gpu/drm/radeon/radeon_mn.c b/drivers/gpu/drm/radeon/radeon_mn.c
> index abd24975c9b1..f8b35df44c60 100644
> --- a/drivers/gpu/drm/radeon/radeon_mn.c
> +++ b/drivers/gpu/drm/radeon/radeon_mn.c
> @@ -118,19 +118,27 @@ static void radeon_mn_release(struct mmu_notifier *mn,
>    * We block for all BOs between start and end to be idle and
>    * unmap them by move them into system domain again.
>    */
> -static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
> +static int radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
>   					     struct mm_struct *mm,
>   					     unsigned long start,
> -					     unsigned long end)
> +					     unsigned long end,
> +					     bool blockable)
>   {
>   	struct radeon_mn *rmn = container_of(mn, struct radeon_mn, mn);
>   	struct ttm_operation_ctx ctx = { false, false };
>   	struct interval_tree_node *it;
> +	int ret = 0;
>   
>   	/* notification is exclusive, but interval is inclusive */
>   	end -= 1;
>   
> -	mutex_lock(&rmn->lock);
> +	/* TODO we should be able to split locking for interval tree and
> +	 * the tear down.
> +	 */
> +	if (blockable)
> +		mutex_lock(&rmn->lock);
> +	else if (!mutex_trylock(&rmn->lock))
> +		return -EAGAIN;
>   
>   	it = interval_tree_iter_first(&rmn->objects, start, end);
>   	while (it) {
> @@ -138,6 +146,11 @@ static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
>   		struct radeon_bo *bo;
>   		long r;
>   
> +		if (!blockable) {
> +			ret = -EAGAIN;
> +			goto out_unlock;
> +		}
> +
>   		node = container_of(it, struct radeon_mn_node, it);
>   		it = interval_tree_iter_next(it, start, end);
>   
> @@ -166,7 +179,10 @@ static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
>   		}
>   	}
>   	
> +out_unlock:
>   	mutex_unlock(&rmn->lock);
> +
> +	return ret;
>   }
>   
>   static const struct mmu_notifier_ops radeon_mn_ops = {
> diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
> index 182436b92ba9..6ec748eccff7 100644
> --- a/drivers/infiniband/core/umem_odp.c
> +++ b/drivers/infiniband/core/umem_odp.c
> @@ -186,6 +186,7 @@ static void ib_umem_notifier_release(struct mmu_notifier *mn,
>   	rbt_ib_umem_for_each_in_range(&context->umem_tree, 0,
>   				      ULLONG_MAX,
>   				      ib_umem_notifier_release_trampoline,
> +				      true,
>   				      NULL);
>   	up_read(&context->umem_rwsem);
>   }
> @@ -207,22 +208,31 @@ static int invalidate_range_start_trampoline(struct ib_umem *item, u64 start,
>   	return 0;
>   }
>   
> -static void ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
> +static int ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
>   						    struct mm_struct *mm,
>   						    unsigned long start,
> -						    unsigned long end)
> +						    unsigned long end,
> +						    bool blockable)
>   {
>   	struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
> +	int ret;
>   
>   	if (!context->invalidate_range)
> -		return;
> +		return 0;
> +
> +	if (blockable)
> +		down_read(&context->umem_rwsem);
> +	else if (!down_read_trylock(&context->umem_rwsem))
> +		return -EAGAIN;
>   
>   	ib_ucontext_notifier_start_account(context);
> -	down_read(&context->umem_rwsem);
> -	rbt_ib_umem_for_each_in_range(&context->umem_tree, start,
> +	ret = rbt_ib_umem_for_each_in_range(&context->umem_tree, start,
>   				      end,
> -				      invalidate_range_start_trampoline, NULL);
> +				      invalidate_range_start_trampoline,
> +				      blockable, NULL);
>   	up_read(&context->umem_rwsem);
> +
> +	return ret;
>   }
>   
>   static int invalidate_range_end_trampoline(struct ib_umem *item, u64 start,
> @@ -242,10 +252,15 @@ static void ib_umem_notifier_invalidate_range_end(struct mmu_notifier *mn,
>   	if (!context->invalidate_range)
>   		return;
>   
> +	/*
> +	 * TODO: we currently bail out if there is any sleepable work to be done
> +	 * in ib_umem_notifier_invalidate_range_start so we shouldn't really block
> +	 * here. But this is ugly and fragile.
> +	 */
>   	down_read(&context->umem_rwsem);
>   	rbt_ib_umem_for_each_in_range(&context->umem_tree, start,
>   				      end,
> -				      invalidate_range_end_trampoline, NULL);
> +				      invalidate_range_end_trampoline, true, NULL);
>   	up_read(&context->umem_rwsem);
>   	ib_ucontext_notifier_end_account(context);
>   }
> @@ -798,6 +813,7 @@ EXPORT_SYMBOL(ib_umem_odp_unmap_dma_pages);
>   int rbt_ib_umem_for_each_in_range(struct rb_root_cached *root,
>   				  u64 start, u64 last,
>   				  umem_call_back cb,
> +				  bool blockable,
>   				  void *cookie)
>   {
>   	int ret_val = 0;
> @@ -809,6 +825,9 @@ int rbt_ib_umem_for_each_in_range(struct rb_root_cached *root,
>   
>   	for (node = rbt_ib_umem_iter_first(root, start, last - 1);
>   			node; node = next) {
> +		/* TODO move the blockable decision up to the callback */
> +		if (!blockable)
> +			return -EAGAIN;
>   		next = rbt_ib_umem_iter_next(node, start, last - 1);
>   		umem = container_of(node, struct ib_umem_odp, interval_tree);
>   		ret_val = cb(umem->umem, start, last, cookie) || ret_val;
> diff --git a/drivers/infiniband/hw/hfi1/mmu_rb.c b/drivers/infiniband/hw/hfi1/mmu_rb.c
> index 70aceefe14d5..e1c7996c018e 100644
> --- a/drivers/infiniband/hw/hfi1/mmu_rb.c
> +++ b/drivers/infiniband/hw/hfi1/mmu_rb.c
> @@ -67,9 +67,9 @@ struct mmu_rb_handler {
>   
>   static unsigned long mmu_node_start(struct mmu_rb_node *);
>   static unsigned long mmu_node_last(struct mmu_rb_node *);
> -static void mmu_notifier_range_start(struct mmu_notifier *,
> +static int mmu_notifier_range_start(struct mmu_notifier *,
>   				     struct mm_struct *,
> -				     unsigned long, unsigned long);
> +				     unsigned long, unsigned long, bool);
>   static struct mmu_rb_node *__mmu_rb_search(struct mmu_rb_handler *,
>   					   unsigned long, unsigned long);
>   static void do_remove(struct mmu_rb_handler *handler,
> @@ -284,10 +284,11 @@ void hfi1_mmu_rb_remove(struct mmu_rb_handler *handler,
>   	handler->ops->remove(handler->ops_arg, node);
>   }
>   
> -static void mmu_notifier_range_start(struct mmu_notifier *mn,
> +static int mmu_notifier_range_start(struct mmu_notifier *mn,
>   				     struct mm_struct *mm,
>   				     unsigned long start,
> -				     unsigned long end)
> +				     unsigned long end,
> +				     bool blockable)
>   {
>   	struct mmu_rb_handler *handler =
>   		container_of(mn, struct mmu_rb_handler, mn);
> @@ -313,6 +314,8 @@ static void mmu_notifier_range_start(struct mmu_notifier *mn,
>   
>   	if (added)
>   		queue_work(handler->wq, &handler->del_work);
> +
> +	return 0;
>   }
>   
>   /*
> diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
> index f1a87a690a4c..d216e0d2921d 100644
> --- a/drivers/infiniband/hw/mlx5/odp.c
> +++ b/drivers/infiniband/hw/mlx5/odp.c
> @@ -488,7 +488,7 @@ void mlx5_ib_free_implicit_mr(struct mlx5_ib_mr *imr)
>   
>   	down_read(&ctx->umem_rwsem);
>   	rbt_ib_umem_for_each_in_range(&ctx->umem_tree, 0, ULLONG_MAX,
> -				      mr_leaf_free, imr);
> +				      mr_leaf_free, true, imr);
>   	up_read(&ctx->umem_rwsem);
>   
>   	wait_event(imr->q_leaf_free, !atomic_read(&imr->num_leaf_free));
> diff --git a/drivers/misc/mic/scif/scif_dma.c b/drivers/misc/mic/scif/scif_dma.c
> index 63d6246d6dff..6369aeaa7056 100644
> --- a/drivers/misc/mic/scif/scif_dma.c
> +++ b/drivers/misc/mic/scif/scif_dma.c
> @@ -200,15 +200,18 @@ static void scif_mmu_notifier_release(struct mmu_notifier *mn,
>   	schedule_work(&scif_info.misc_work);
>   }
>   
> -static void scif_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> +static int scif_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>   						     struct mm_struct *mm,
>   						     unsigned long start,
> -						     unsigned long end)
> +						     unsigned long end,
> +						     bool blockable)
>   {
>   	struct scif_mmu_notif	*mmn;
>   
>   	mmn = container_of(mn, struct scif_mmu_notif, ep_mmu_notifier);
>   	scif_rma_destroy_tcw(mmn, start, end - start);
> +
> +	return 0;
>   }
>   
>   static void scif_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
> index a3454eb56fbf..be28f05bfafa 100644
> --- a/drivers/misc/sgi-gru/grutlbpurge.c
> +++ b/drivers/misc/sgi-gru/grutlbpurge.c
> @@ -219,9 +219,10 @@ void gru_flush_all_tlb(struct gru_state *gru)
>   /*
>    * MMUOPS notifier callout functions
>    */
> -static void gru_invalidate_range_start(struct mmu_notifier *mn,
> +static int gru_invalidate_range_start(struct mmu_notifier *mn,
>   				       struct mm_struct *mm,
> -				       unsigned long start, unsigned long end)
> +				       unsigned long start, unsigned long end,
> +				       bool blockable)
>   {
>   	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
>   						 ms_notifier);
> @@ -231,6 +232,8 @@ static void gru_invalidate_range_start(struct mmu_notifier *mn,
>   	gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx, act %d\n", gms,
>   		start, end, atomic_read(&gms->ms_range_active));
>   	gru_flush_tlb_range(gms, start, end - start);
> +
> +	return 0;
>   }
>   
>   static void gru_invalidate_range_end(struct mmu_notifier *mn,
> diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
> index bd56653b9bbc..55b4f0e3f4d6 100644
> --- a/drivers/xen/gntdev.c
> +++ b/drivers/xen/gntdev.c
> @@ -441,18 +441,25 @@ static const struct vm_operations_struct gntdev_vmops = {
>   
>   /* ------------------------------------------------------------------ */
>   
> +static bool in_range(struct grant_map *map,
> +			      unsigned long start, unsigned long end)
> +{
> +	if (!map->vma)
> +		return false;
> +	if (map->vma->vm_start >= end)
> +		return false;
> +	if (map->vma->vm_end <= start)
> +		return false;
> +
> +	return true;
> +}
> +
>   static void unmap_if_in_range(struct grant_map *map,
>   			      unsigned long start, unsigned long end)
>   {
>   	unsigned long mstart, mend;
>   	int err;
>   
> -	if (!map->vma)
> -		return;
> -	if (map->vma->vm_start >= end)
> -		return;
> -	if (map->vma->vm_end <= start)
> -		return;
>   	mstart = max(start, map->vma->vm_start);
>   	mend   = min(end,   map->vma->vm_end);
>   	pr_debug("map %d+%d (%lx %lx), range %lx %lx, mrange %lx %lx\n",
> @@ -465,21 +472,40 @@ static void unmap_if_in_range(struct grant_map *map,
>   	WARN_ON(err);
>   }
>   
> -static void mn_invl_range_start(struct mmu_notifier *mn,
> +static int mn_invl_range_start(struct mmu_notifier *mn,
>   				struct mm_struct *mm,
> -				unsigned long start, unsigned long end)
> +				unsigned long start, unsigned long end,
> +				bool blockable)
>   {
>   	struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
>   	struct grant_map *map;
> +	int ret = 0;
> +
> +	/* TODO do we really need a mutex here? */
> +	if (blockable)
> +		mutex_lock(&priv->lock);
> +	else if (!mutex_trylock(&priv->lock))
> +		return -EAGAIN;
>   
> -	mutex_lock(&priv->lock);
>   	list_for_each_entry(map, &priv->maps, next) {
> +		if (in_range(map, start, end)) {
> +			ret = -EAGAIN;
> +			goto out_unlock;
> +		}
>   		unmap_if_in_range(map, start, end);
>   	}
>   	list_for_each_entry(map, &priv->freeable_maps, next) {
> +		if (in_range(map, start, end)) {
> +			ret = -EAGAIN;
> +			goto out_unlock;
> +		}
>   		unmap_if_in_range(map, start, end);
>   	}
> +
> +out_unlock:
>   	mutex_unlock(&priv->lock);
> +
> +	return ret;
>   }
>   
>   static void mn_release(struct mmu_notifier *mn,
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 4ee7bc548a83..148935085194 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1275,8 +1275,8 @@ static inline long kvm_arch_vcpu_async_ioctl(struct file *filp,
>   }
>   #endif /* CONFIG_HAVE_KVM_VCPU_ASYNC_IOCTL */
>   
> -void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> -		unsigned long start, unsigned long end);
> +int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> +		unsigned long start, unsigned long end, bool blockable);
>   
>   #ifdef CONFIG_HAVE_KVM_VCPU_RUN_PID_CHANGE
>   int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu);
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index 392e6af82701..2eb1a2d01759 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -151,13 +151,15 @@ struct mmu_notifier_ops {
>   	 * address space but may still be referenced by sptes until
>   	 * the last refcount is dropped.
>   	 *
> -	 * If both of these callbacks cannot block, and invalidate_range
> -	 * cannot block, mmu_notifier_ops.flags should have
> -	 * MMU_INVALIDATE_DOES_NOT_BLOCK set.
> +	 * If blockable argument is set to false then the callback cannot
> +	 * sleep and has to return with -EAGAIN. 0 should be returned
> +	 * otherwise.
> +	 *
>   	 */
> -	void (*invalidate_range_start)(struct mmu_notifier *mn,
> +	int (*invalidate_range_start)(struct mmu_notifier *mn,
>   				       struct mm_struct *mm,
> -				       unsigned long start, unsigned long end);
> +				       unsigned long start, unsigned long end,
> +				       bool blockable);
>   	void (*invalidate_range_end)(struct mmu_notifier *mn,
>   				     struct mm_struct *mm,
>   				     unsigned long start, unsigned long end);
> @@ -229,8 +231,9 @@ extern int __mmu_notifier_test_young(struct mm_struct *mm,
>   				     unsigned long address);
>   extern void __mmu_notifier_change_pte(struct mm_struct *mm,
>   				      unsigned long address, pte_t pte);
> -extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> -				  unsigned long start, unsigned long end);
> +extern int __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> +				  unsigned long start, unsigned long end,
> +				  bool blockable);
>   extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
>   				  unsigned long start, unsigned long end,
>   				  bool only_end);
> @@ -281,7 +284,17 @@ static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
>   				  unsigned long start, unsigned long end)
>   {
>   	if (mm_has_notifiers(mm))
> -		__mmu_notifier_invalidate_range_start(mm, start, end);
> +		__mmu_notifier_invalidate_range_start(mm, start, end, true);
> +}
> +
> +static inline int mmu_notifier_invalidate_range_start_nonblock(struct mm_struct *mm,
> +				  unsigned long start, unsigned long end)
> +{
> +	int ret = 0;
> +	if (mm_has_notifiers(mm))
> +		ret = __mmu_notifier_invalidate_range_start(mm, start, end, false);
> +
> +	return ret;
>   }
>   
>   static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> @@ -461,6 +474,12 @@ static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
>   {
>   }
>   
> +static inline int mmu_notifier_invalidate_range_start_nonblock(struct mm_struct *mm,
> +				  unsigned long start, unsigned long end)
> +{
> +	return 0;
> +}
> +
>   static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
>   				  unsigned long start, unsigned long end)
>   {
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index 6adac113e96d..92f70e4c6252 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -95,7 +95,7 @@ static inline int check_stable_address_space(struct mm_struct *mm)
>   	return 0;
>   }
>   
> -void __oom_reap_task_mm(struct mm_struct *mm);
> +bool __oom_reap_task_mm(struct mm_struct *mm);
>   
>   extern unsigned long oom_badness(struct task_struct *p,
>   		struct mem_cgroup *memcg, const nodemask_t *nodemask,
> diff --git a/include/rdma/ib_umem_odp.h b/include/rdma/ib_umem_odp.h
> index 6a17f856f841..381cdf5a9bd1 100644
> --- a/include/rdma/ib_umem_odp.h
> +++ b/include/rdma/ib_umem_odp.h
> @@ -119,7 +119,8 @@ typedef int (*umem_call_back)(struct ib_umem *item, u64 start, u64 end,
>    */
>   int rbt_ib_umem_for_each_in_range(struct rb_root_cached *root,
>   				  u64 start, u64 end,
> -				  umem_call_back cb, void *cookie);
> +				  umem_call_back cb,
> +				  bool blockable, void *cookie);
>   
>   /*
>    * Find first region intersecting with address range.
> diff --git a/mm/hmm.c b/mm/hmm.c
> index de7b6bf77201..81fd57bd2634 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -177,16 +177,19 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
>   	up_write(&hmm->mirrors_sem);
>   }
>   
> -static void hmm_invalidate_range_start(struct mmu_notifier *mn,
> +static int hmm_invalidate_range_start(struct mmu_notifier *mn,
>   				       struct mm_struct *mm,
>   				       unsigned long start,
> -				       unsigned long end)
> +				       unsigned long end,
> +				       bool blockable)
>   {
>   	struct hmm *hmm = mm->hmm;
>   
>   	VM_BUG_ON(!hmm);
>   
>   	atomic_inc(&hmm->sequence);
> +
> +	return 0;
>   }
>   
>   static void hmm_invalidate_range_end(struct mmu_notifier *mn,
> diff --git a/mm/mmap.c b/mm/mmap.c
> index d1eb87ef4b1a..336bee8c4e25 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3074,7 +3074,7 @@ void exit_mmap(struct mm_struct *mm)
>   		 * reliably test it.
>   		 */
>   		mutex_lock(&oom_lock);
> -		__oom_reap_task_mm(mm);
> +		(void)__oom_reap_task_mm(mm);
>   		mutex_unlock(&oom_lock);
>   
>   		set_bit(MMF_OOM_SKIP, &mm->flags);
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> index eff6b88a993f..103b2b450043 100644
> --- a/mm/mmu_notifier.c
> +++ b/mm/mmu_notifier.c
> @@ -174,18 +174,29 @@ void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
>   	srcu_read_unlock(&srcu, id);
>   }
>   
> -void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> -				  unsigned long start, unsigned long end)
> +int __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> +				  unsigned long start, unsigned long end,
> +				  bool blockable)
>   {
>   	struct mmu_notifier *mn;
> +	int ret = 0;
>   	int id;
>   
>   	id = srcu_read_lock(&srcu);
>   	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
> -		if (mn->ops->invalidate_range_start)
> -			mn->ops->invalidate_range_start(mn, mm, start, end);
> +		if (mn->ops->invalidate_range_start) {
> +			int _ret = mn->ops->invalidate_range_start(mn, mm, start, end, blockable);
> +			if (_ret) {
> +				pr_info("%pS callback failed with %d in %sblockable context.\n",
> +						mn->ops->invalidate_range_start, _ret,
> +						!blockable ? "non-": "");
> +				ret = _ret;
> +			}
> +		}
>   	}
>   	srcu_read_unlock(&srcu, id);
> +
> +	return ret;
>   }
>   EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
>   
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 84081e77bc51..5a936cf24d79 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -479,9 +479,10 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
>   static struct task_struct *oom_reaper_list;
>   static DEFINE_SPINLOCK(oom_reaper_lock);
>   
> -void __oom_reap_task_mm(struct mm_struct *mm)
> +bool __oom_reap_task_mm(struct mm_struct *mm)
>   {
>   	struct vm_area_struct *vma;
> +	bool ret = true;
>   
>   	/*
>   	 * Tell all users of get_user/copy_from_user etc... that the content
> @@ -511,12 +512,17 @@ void __oom_reap_task_mm(struct mm_struct *mm)
>   			struct mmu_gather tlb;
>   
>   			tlb_gather_mmu(&tlb, mm, start, end);
> -			mmu_notifier_invalidate_range_start(mm, start, end);
> +			if (mmu_notifier_invalidate_range_start_nonblock(mm, start, end)) {
> +				ret = false;
> +				continue;
> +			}
>   			unmap_page_range(&tlb, vma, start, end, NULL);
>   			mmu_notifier_invalidate_range_end(mm, start, end);
>   			tlb_finish_mmu(&tlb, start, end);
>   		}
>   	}
> +
> +	return ret;
>   }
>   
>   static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
> @@ -545,18 +551,6 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>   		goto unlock_oom;
>   	}
>   
> -	/*
> -	 * If the mm has invalidate_{start,end}() notifiers that could block,
> -	 * sleep to give the oom victim some more time.
> -	 * TODO: we really want to get rid of this ugly hack and make sure that
> -	 * notifiers cannot block for unbounded amount of time
> -	 */
> -	if (mm_has_blockable_invalidate_notifiers(mm)) {
> -		up_read(&mm->mmap_sem);
> -		schedule_timeout_idle(HZ);
> -		goto unlock_oom;
> -	}
> -
>   	/*
>   	 * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
>   	 * work on the mm anymore. The check for MMF_OOM_SKIP must run
> @@ -571,7 +565,12 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>   
>   	trace_start_task_reaping(tsk->pid);
>   
> -	__oom_reap_task_mm(mm);
> +	/* failed to reap part of the address space. Try again later */
> +	if (!__oom_reap_task_mm(mm)) {
> +		up_read(&mm->mmap_sem);
> +		ret = false;
> +		goto unlock_oom;
> +	}
>   
>   	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
>   			task_pid_nr(tsk), tsk->comm,
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index ada21f47f22b..16ce38f178d1 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -135,9 +135,10 @@ static void kvm_uevent_notify_change(unsigned int type, struct kvm *kvm);
>   static unsigned long long kvm_createvm_count;
>   static unsigned long long kvm_active_vms;
>   
> -__weak void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> -		unsigned long start, unsigned long end)
> +__weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> +		unsigned long start, unsigned long end, bool blockable)
>   {
> +	return 0;
>   }
>   
>   bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
> @@ -354,13 +355,15 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
>   	srcu_read_unlock(&kvm->srcu, idx);
>   }
>   
> -static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> +static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>   						    struct mm_struct *mm,
>   						    unsigned long start,
> -						    unsigned long end)
> +						    unsigned long end,
> +						    bool blockable)
>   {
>   	struct kvm *kvm = mmu_notifier_to_kvm(mn);
>   	int need_tlb_flush = 0, idx;
> +	int ret;
>   
>   	idx = srcu_read_lock(&kvm->srcu);
>   	spin_lock(&kvm->mmu_lock);
> @@ -378,9 +381,11 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>   
>   	spin_unlock(&kvm->mmu_lock);
>   
> -	kvm_arch_mmu_notifier_invalidate_range(kvm, start, end);
> +	ret = kvm_arch_mmu_notifier_invalidate_range(kvm, start, end, blockable);
>   
>   	srcu_read_unlock(&kvm->srcu, idx);
> +
> +	return ret;
>   }
>   
>   static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
  2018-06-27  7:44 ` Michal Hocko
  2018-07-02  9:14     ` Christian König
       [not found]   ` <20180627074421.GF32348-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2018-07-02  9:14   ` Christian König
  2018-07-09 12:29   ` Michal Hocko
  2018-07-09 12:29   ` Michal Hocko
  4 siblings, 0 replies; 125+ messages in thread
From: Christian König @ 2018-07-02  9:14 UTC (permalink / raw)
  To: Michal Hocko, LKML
  Cc: kvm, Radim Krčmář,
	David Airlie, Joonas Lahtinen, Sudeep Dutt, dri-devel, linux-mm,
	Andrea Arcangeli, David (ChunMing) Zhou, Dimitri Sivanich,
	linux-rdma, amd-gfx, Jason Gunthorpe, Doug Ledford,
	David Rientjes, xen-devel, intel-gfx, Jani Nikula,
	Jérôme Glisse, Rodrigo Vivi, Boris Ostrovsky,
	Juergen Gross, Mike Marciniszyn

Am 27.06.2018 um 09:44 schrieb Michal Hocko:
> This is the v2 of RFC based on the feedback I've received so far. The
> code even compiles as a bonus ;) I haven't runtime tested it yet, mostly
> because I have no idea how.
>
> Any further feedback is highly appreciated of course.

That sounds like it should work and at least the amdgpu changes now look 
good to me on first glance.

Can you split that up further in the usual way? E.g. adding the 
blockable flag in one patch and fixing all implementations of the MMU 
notifier in follow up patches.

This way I'm pretty sure Felix and I can give an rb on the amdgpu/amdkfd 
changes.

Thanks,
Christian.

> ---
>  From ec9a7241bf422b908532c4c33953b0da2655ad05 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Wed, 20 Jun 2018 15:03:20 +0200
> Subject: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
>
> There are several blockable mmu notifiers which might sleep in
> mmu_notifier_invalidate_range_start and that is a problem for the
> oom_reaper because it needs to guarantee a forward progress so it cannot
> depend on any sleepable locks.
>
> Currently we simply back off and mark an oom victim with blockable mmu
> notifiers as done after a short sleep. That can result in selecting a
> new oom victim prematurely because the previous one still hasn't torn
> its memory down yet.
>
> We can do much better though. Even if mmu notifiers use sleepable locks
> there is no reason to automatically assume those locks are held.
> Moreover majority of notifiers only care about a portion of the address
> space and there is absolutely zero reason to fail when we are unmapping an
> unrelated range. Many notifiers do really block and wait for HW which is
> harder to handle and we have to bail out though.
>
> This patch handles the low hanging fruid. __mmu_notifier_invalidate_range_start
> gets a blockable flag and callbacks are not allowed to sleep if the
> flag is set to false. This is achieved by using trylock instead of the
> sleepable lock for most callbacks and continue as long as we do not
> block down the call chain.
>
> I think we can improve that even further because there is a common
> pattern to do a range lookup first and then do something about that.
> The first part can be done without a sleeping lock in most cases AFAICS.
>
> The oom_reaper end then simply retries if there is at least one notifier
> which couldn't make any progress in !blockable mode. A retry loop is
> already implemented to wait for the mmap_sem and this is basically the
> same thing.
>
> Changes since rfc v1
> - gpu notifiers can sleep while waiting for HW (evict_process_queues_cpsch
>    on a lock and amdgpu_mn_invalidate_node on unbound timeout) make sure
>    we bail out when we have an intersecting range for starter
> - note that a notifier failed to the log for easier debugging
> - back off early in ib_umem_notifier_invalidate_range_start if the
>    callback is called
> - mn_invl_range_start waits for completion down the unmap_grant_pages
>    path so we have to back off early on overlapping ranges
>
> Cc: "David (ChunMing) Zhou" <David1.Zhou@amd.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: "Radim Krčmář" <rkrcmar@redhat.com>
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: "Christian König" <christian.koenig@amd.com>
> Cc: David Airlie <airlied@linux.ie>
> Cc: Jani Nikula <jani.nikula@linux.intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> Cc: Doug Ledford <dledford@redhat.com>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
> Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
> Cc: Sudeep Dutt <sudeep.dutt@intel.com>
> Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
> Cc: Dimitri Sivanich <sivanich@sgi.com>
> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
> Cc: Juergen Gross <jgross@suse.com>
> Cc: "Jérôme Glisse" <jglisse@redhat.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Felix Kuehling <felix.kuehling@amd.com>
> Cc: kvm@vger.kernel.org (open list:KERNEL VIRTUAL MACHINE FOR X86 (KVM/x86))
> Cc: linux-kernel@vger.kernel.org (open list:X86 ARCHITECTURE (32-BIT AND 64-BIT))
> Cc: amd-gfx@lists.freedesktop.org (open list:RADEON and AMDGPU DRM DRIVERS)
> Cc: dri-devel@lists.freedesktop.org (open list:DRM DRIVERS)
> Cc: intel-gfx@lists.freedesktop.org (open list:INTEL DRM DRIVERS (excluding Poulsbo, Moorestow...)
> Cc: linux-rdma@vger.kernel.org (open list:INFINIBAND SUBSYSTEM)
> Cc: xen-devel@lists.xenproject.org (moderated list:XEN HYPERVISOR INTERFACE)
> Cc: linux-mm@kvack.org (open list:HMM - Heterogeneous Memory Management)
> Reported-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>   arch/x86/kvm/x86.c                      |  7 ++--
>   drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c  | 43 +++++++++++++++++++-----
>   drivers/gpu/drm/i915/i915_gem_userptr.c | 13 ++++++--
>   drivers/gpu/drm/radeon/radeon_mn.c      | 22 +++++++++++--
>   drivers/infiniband/core/umem_odp.c      | 33 +++++++++++++++----
>   drivers/infiniband/hw/hfi1/mmu_rb.c     | 11 ++++---
>   drivers/infiniband/hw/mlx5/odp.c        |  2 +-
>   drivers/misc/mic/scif/scif_dma.c        |  7 ++--
>   drivers/misc/sgi-gru/grutlbpurge.c      |  7 ++--
>   drivers/xen/gntdev.c                    | 44 ++++++++++++++++++++-----
>   include/linux/kvm_host.h                |  4 +--
>   include/linux/mmu_notifier.h            | 35 +++++++++++++++-----
>   include/linux/oom.h                     |  2 +-
>   include/rdma/ib_umem_odp.h              |  3 +-
>   mm/hmm.c                                |  7 ++--
>   mm/mmap.c                               |  2 +-
>   mm/mmu_notifier.c                       | 19 ++++++++---
>   mm/oom_kill.c                           | 29 ++++++++--------
>   virt/kvm/kvm_main.c                     | 15 ++++++---
>   19 files changed, 225 insertions(+), 80 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 6bcecc325e7e..ac08f5d711be 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -7203,8 +7203,9 @@ static void vcpu_load_eoi_exitmap(struct kvm_vcpu *vcpu)
>   	kvm_x86_ops->load_eoi_exitmap(vcpu, eoi_exit_bitmap);
>   }
>   
> -void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> -		unsigned long start, unsigned long end)
> +int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> +		unsigned long start, unsigned long end,
> +		bool blockable)
>   {
>   	unsigned long apic_address;
>   
> @@ -7215,6 +7216,8 @@ void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
>   	apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
>   	if (start <= apic_address && apic_address < end)
>   		kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
> +
> +	return 0;
>   }
>   
>   void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> index 83e344fbb50a..3399a4a927fb 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> @@ -136,12 +136,18 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
>    *
>    * Take the rmn read side lock.
>    */
> -static void amdgpu_mn_read_lock(struct amdgpu_mn *rmn)
> +static int amdgpu_mn_read_lock(struct amdgpu_mn *rmn, bool blockable)
>   {
> -	mutex_lock(&rmn->read_lock);
> +	if (blockable)
> +		mutex_lock(&rmn->read_lock);
> +	else if (!mutex_trylock(&rmn->read_lock))
> +		return -EAGAIN;
> +
>   	if (atomic_inc_return(&rmn->recursion) == 1)
>   		down_read_non_owner(&rmn->lock);
>   	mutex_unlock(&rmn->read_lock);
> +
> +	return 0;
>   }
>   
>   /**
> @@ -197,10 +203,11 @@ static void amdgpu_mn_invalidate_node(struct amdgpu_mn_node *node,
>    * We block for all BOs between start and end to be idle and
>    * unmap them by move them into system domain again.
>    */
> -static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
> +static int amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>   						 struct mm_struct *mm,
>   						 unsigned long start,
> -						 unsigned long end)
> +						 unsigned long end,
> +						 bool blockable)
>   {
>   	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
>   	struct interval_tree_node *it;
> @@ -208,17 +215,28 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>   	/* notification is exclusive, but interval is inclusive */
>   	end -= 1;
>   
> -	amdgpu_mn_read_lock(rmn);
> +	/* TODO we should be able to split locking for interval tree and
> +	 * amdgpu_mn_invalidate_node
> +	 */
> +	if (amdgpu_mn_read_lock(rmn, blockable))
> +		return -EAGAIN;
>   
>   	it = interval_tree_iter_first(&rmn->objects, start, end);
>   	while (it) {
>   		struct amdgpu_mn_node *node;
>   
> +		if (!blockable) {
> +			amdgpu_mn_read_unlock(rmn);
> +			return -EAGAIN;
> +		}
> +
>   		node = container_of(it, struct amdgpu_mn_node, it);
>   		it = interval_tree_iter_next(it, start, end);
>   
>   		amdgpu_mn_invalidate_node(node, start, end);
>   	}
> +
> +	return 0;
>   }
>   
>   /**
> @@ -233,10 +251,11 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>    * necessitates evicting all user-mode queues of the process. The BOs
>    * are restorted in amdgpu_mn_invalidate_range_end_hsa.
>    */
> -static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
> +static int amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
>   						 struct mm_struct *mm,
>   						 unsigned long start,
> -						 unsigned long end)
> +						 unsigned long end,
> +						 bool blockable)
>   {
>   	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
>   	struct interval_tree_node *it;
> @@ -244,13 +263,19 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
>   	/* notification is exclusive, but interval is inclusive */
>   	end -= 1;
>   
> -	amdgpu_mn_read_lock(rmn);
> +	if (amdgpu_mn_read_lock(rmn, blockable))
> +		return -EAGAIN;
>   
>   	it = interval_tree_iter_first(&rmn->objects, start, end);
>   	while (it) {
>   		struct amdgpu_mn_node *node;
>   		struct amdgpu_bo *bo;
>   
> +		if (!blockable) {
> +			amdgpu_mn_read_unlock(rmn);
> +			return -EAGAIN;
> +		}
> +
>   		node = container_of(it, struct amdgpu_mn_node, it);
>   		it = interval_tree_iter_next(it, start, end);
>   
> @@ -262,6 +287,8 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
>   				amdgpu_amdkfd_evict_userptr(mem, mm);
>   		}
>   	}
> +
> +	return 0;
>   }
>   
>   /**
> diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
> index 854bd51b9478..9cbff68f6b41 100644
> --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> @@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object *mo)
>   	mo->attached = false;
>   }
>   
> -static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> +static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
>   						       struct mm_struct *mm,
>   						       unsigned long start,
> -						       unsigned long end)
> +						       unsigned long end,
> +						       bool blockable)
>   {
>   	struct i915_mmu_notifier *mn =
>   		container_of(_mn, struct i915_mmu_notifier, mn);
> @@ -124,7 +125,7 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
>   	LIST_HEAD(cancelled);
>   
>   	if (RB_EMPTY_ROOT(&mn->objects.rb_root))
> -		return;
> +		return 0;
>   
>   	/* interval ranges are inclusive, but invalidate range is exclusive */
>   	end--;
> @@ -132,6 +133,10 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
>   	spin_lock(&mn->lock);
>   	it = interval_tree_iter_first(&mn->objects, start, end);
>   	while (it) {
> +		if (!blockable) {
> +			spin_unlock(&mn->lock);
> +			return -EAGAIN;
> +		}
>   		/* The mmu_object is released late when destroying the
>   		 * GEM object so it is entirely possible to gain a
>   		 * reference on an object in the process of being freed
> @@ -154,6 +159,8 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
>   
>   	if (!list_empty(&cancelled))
>   		flush_workqueue(mn->wq);
> +
> +	return 0;
>   }
>   
>   static const struct mmu_notifier_ops i915_gem_userptr_notifier = {
> diff --git a/drivers/gpu/drm/radeon/radeon_mn.c b/drivers/gpu/drm/radeon/radeon_mn.c
> index abd24975c9b1..f8b35df44c60 100644
> --- a/drivers/gpu/drm/radeon/radeon_mn.c
> +++ b/drivers/gpu/drm/radeon/radeon_mn.c
> @@ -118,19 +118,27 @@ static void radeon_mn_release(struct mmu_notifier *mn,
>    * We block for all BOs between start and end to be idle and
>    * unmap them by move them into system domain again.
>    */
> -static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
> +static int radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
>   					     struct mm_struct *mm,
>   					     unsigned long start,
> -					     unsigned long end)
> +					     unsigned long end,
> +					     bool blockable)
>   {
>   	struct radeon_mn *rmn = container_of(mn, struct radeon_mn, mn);
>   	struct ttm_operation_ctx ctx = { false, false };
>   	struct interval_tree_node *it;
> +	int ret = 0;
>   
>   	/* notification is exclusive, but interval is inclusive */
>   	end -= 1;
>   
> -	mutex_lock(&rmn->lock);
> +	/* TODO we should be able to split locking for interval tree and
> +	 * the tear down.
> +	 */
> +	if (blockable)
> +		mutex_lock(&rmn->lock);
> +	else if (!mutex_trylock(&rmn->lock))
> +		return -EAGAIN;
>   
>   	it = interval_tree_iter_first(&rmn->objects, start, end);
>   	while (it) {
> @@ -138,6 +146,11 @@ static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
>   		struct radeon_bo *bo;
>   		long r;
>   
> +		if (!blockable) {
> +			ret = -EAGAIN;
> +			goto out_unlock;
> +		}
> +
>   		node = container_of(it, struct radeon_mn_node, it);
>   		it = interval_tree_iter_next(it, start, end);
>   
> @@ -166,7 +179,10 @@ static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
>   		}
>   	}
>   	
> +out_unlock:
>   	mutex_unlock(&rmn->lock);
> +
> +	return ret;
>   }
>   
>   static const struct mmu_notifier_ops radeon_mn_ops = {
> diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
> index 182436b92ba9..6ec748eccff7 100644
> --- a/drivers/infiniband/core/umem_odp.c
> +++ b/drivers/infiniband/core/umem_odp.c
> @@ -186,6 +186,7 @@ static void ib_umem_notifier_release(struct mmu_notifier *mn,
>   	rbt_ib_umem_for_each_in_range(&context->umem_tree, 0,
>   				      ULLONG_MAX,
>   				      ib_umem_notifier_release_trampoline,
> +				      true,
>   				      NULL);
>   	up_read(&context->umem_rwsem);
>   }
> @@ -207,22 +208,31 @@ static int invalidate_range_start_trampoline(struct ib_umem *item, u64 start,
>   	return 0;
>   }
>   
> -static void ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
> +static int ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
>   						    struct mm_struct *mm,
>   						    unsigned long start,
> -						    unsigned long end)
> +						    unsigned long end,
> +						    bool blockable)
>   {
>   	struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
> +	int ret;
>   
>   	if (!context->invalidate_range)
> -		return;
> +		return 0;
> +
> +	if (blockable)
> +		down_read(&context->umem_rwsem);
> +	else if (!down_read_trylock(&context->umem_rwsem))
> +		return -EAGAIN;
>   
>   	ib_ucontext_notifier_start_account(context);
> -	down_read(&context->umem_rwsem);
> -	rbt_ib_umem_for_each_in_range(&context->umem_tree, start,
> +	ret = rbt_ib_umem_for_each_in_range(&context->umem_tree, start,
>   				      end,
> -				      invalidate_range_start_trampoline, NULL);
> +				      invalidate_range_start_trampoline,
> +				      blockable, NULL);
>   	up_read(&context->umem_rwsem);
> +
> +	return ret;
>   }
>   
>   static int invalidate_range_end_trampoline(struct ib_umem *item, u64 start,
> @@ -242,10 +252,15 @@ static void ib_umem_notifier_invalidate_range_end(struct mmu_notifier *mn,
>   	if (!context->invalidate_range)
>   		return;
>   
> +	/*
> +	 * TODO: we currently bail out if there is any sleepable work to be done
> +	 * in ib_umem_notifier_invalidate_range_start so we shouldn't really block
> +	 * here. But this is ugly and fragile.
> +	 */
>   	down_read(&context->umem_rwsem);
>   	rbt_ib_umem_for_each_in_range(&context->umem_tree, start,
>   				      end,
> -				      invalidate_range_end_trampoline, NULL);
> +				      invalidate_range_end_trampoline, true, NULL);
>   	up_read(&context->umem_rwsem);
>   	ib_ucontext_notifier_end_account(context);
>   }
> @@ -798,6 +813,7 @@ EXPORT_SYMBOL(ib_umem_odp_unmap_dma_pages);
>   int rbt_ib_umem_for_each_in_range(struct rb_root_cached *root,
>   				  u64 start, u64 last,
>   				  umem_call_back cb,
> +				  bool blockable,
>   				  void *cookie)
>   {
>   	int ret_val = 0;
> @@ -809,6 +825,9 @@ int rbt_ib_umem_for_each_in_range(struct rb_root_cached *root,
>   
>   	for (node = rbt_ib_umem_iter_first(root, start, last - 1);
>   			node; node = next) {
> +		/* TODO move the blockable decision up to the callback */
> +		if (!blockable)
> +			return -EAGAIN;
>   		next = rbt_ib_umem_iter_next(node, start, last - 1);
>   		umem = container_of(node, struct ib_umem_odp, interval_tree);
>   		ret_val = cb(umem->umem, start, last, cookie) || ret_val;
> diff --git a/drivers/infiniband/hw/hfi1/mmu_rb.c b/drivers/infiniband/hw/hfi1/mmu_rb.c
> index 70aceefe14d5..e1c7996c018e 100644
> --- a/drivers/infiniband/hw/hfi1/mmu_rb.c
> +++ b/drivers/infiniband/hw/hfi1/mmu_rb.c
> @@ -67,9 +67,9 @@ struct mmu_rb_handler {
>   
>   static unsigned long mmu_node_start(struct mmu_rb_node *);
>   static unsigned long mmu_node_last(struct mmu_rb_node *);
> -static void mmu_notifier_range_start(struct mmu_notifier *,
> +static int mmu_notifier_range_start(struct mmu_notifier *,
>   				     struct mm_struct *,
> -				     unsigned long, unsigned long);
> +				     unsigned long, unsigned long, bool);
>   static struct mmu_rb_node *__mmu_rb_search(struct mmu_rb_handler *,
>   					   unsigned long, unsigned long);
>   static void do_remove(struct mmu_rb_handler *handler,
> @@ -284,10 +284,11 @@ void hfi1_mmu_rb_remove(struct mmu_rb_handler *handler,
>   	handler->ops->remove(handler->ops_arg, node);
>   }
>   
> -static void mmu_notifier_range_start(struct mmu_notifier *mn,
> +static int mmu_notifier_range_start(struct mmu_notifier *mn,
>   				     struct mm_struct *mm,
>   				     unsigned long start,
> -				     unsigned long end)
> +				     unsigned long end,
> +				     bool blockable)
>   {
>   	struct mmu_rb_handler *handler =
>   		container_of(mn, struct mmu_rb_handler, mn);
> @@ -313,6 +314,8 @@ static void mmu_notifier_range_start(struct mmu_notifier *mn,
>   
>   	if (added)
>   		queue_work(handler->wq, &handler->del_work);
> +
> +	return 0;
>   }
>   
>   /*
> diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
> index f1a87a690a4c..d216e0d2921d 100644
> --- a/drivers/infiniband/hw/mlx5/odp.c
> +++ b/drivers/infiniband/hw/mlx5/odp.c
> @@ -488,7 +488,7 @@ void mlx5_ib_free_implicit_mr(struct mlx5_ib_mr *imr)
>   
>   	down_read(&ctx->umem_rwsem);
>   	rbt_ib_umem_for_each_in_range(&ctx->umem_tree, 0, ULLONG_MAX,
> -				      mr_leaf_free, imr);
> +				      mr_leaf_free, true, imr);
>   	up_read(&ctx->umem_rwsem);
>   
>   	wait_event(imr->q_leaf_free, !atomic_read(&imr->num_leaf_free));
> diff --git a/drivers/misc/mic/scif/scif_dma.c b/drivers/misc/mic/scif/scif_dma.c
> index 63d6246d6dff..6369aeaa7056 100644
> --- a/drivers/misc/mic/scif/scif_dma.c
> +++ b/drivers/misc/mic/scif/scif_dma.c
> @@ -200,15 +200,18 @@ static void scif_mmu_notifier_release(struct mmu_notifier *mn,
>   	schedule_work(&scif_info.misc_work);
>   }
>   
> -static void scif_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> +static int scif_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>   						     struct mm_struct *mm,
>   						     unsigned long start,
> -						     unsigned long end)
> +						     unsigned long end,
> +						     bool blockable)
>   {
>   	struct scif_mmu_notif	*mmn;
>   
>   	mmn = container_of(mn, struct scif_mmu_notif, ep_mmu_notifier);
>   	scif_rma_destroy_tcw(mmn, start, end - start);
> +
> +	return 0;
>   }
>   
>   static void scif_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
> index a3454eb56fbf..be28f05bfafa 100644
> --- a/drivers/misc/sgi-gru/grutlbpurge.c
> +++ b/drivers/misc/sgi-gru/grutlbpurge.c
> @@ -219,9 +219,10 @@ void gru_flush_all_tlb(struct gru_state *gru)
>   /*
>    * MMUOPS notifier callout functions
>    */
> -static void gru_invalidate_range_start(struct mmu_notifier *mn,
> +static int gru_invalidate_range_start(struct mmu_notifier *mn,
>   				       struct mm_struct *mm,
> -				       unsigned long start, unsigned long end)
> +				       unsigned long start, unsigned long end,
> +				       bool blockable)
>   {
>   	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
>   						 ms_notifier);
> @@ -231,6 +232,8 @@ static void gru_invalidate_range_start(struct mmu_notifier *mn,
>   	gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx, act %d\n", gms,
>   		start, end, atomic_read(&gms->ms_range_active));
>   	gru_flush_tlb_range(gms, start, end - start);
> +
> +	return 0;
>   }
>   
>   static void gru_invalidate_range_end(struct mmu_notifier *mn,
> diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
> index bd56653b9bbc..55b4f0e3f4d6 100644
> --- a/drivers/xen/gntdev.c
> +++ b/drivers/xen/gntdev.c
> @@ -441,18 +441,25 @@ static const struct vm_operations_struct gntdev_vmops = {
>   
>   /* ------------------------------------------------------------------ */
>   
> +static bool in_range(struct grant_map *map,
> +			      unsigned long start, unsigned long end)
> +{
> +	if (!map->vma)
> +		return false;
> +	if (map->vma->vm_start >= end)
> +		return false;
> +	if (map->vma->vm_end <= start)
> +		return false;
> +
> +	return true;
> +}
> +
>   static void unmap_if_in_range(struct grant_map *map,
>   			      unsigned long start, unsigned long end)
>   {
>   	unsigned long mstart, mend;
>   	int err;
>   
> -	if (!map->vma)
> -		return;
> -	if (map->vma->vm_start >= end)
> -		return;
> -	if (map->vma->vm_end <= start)
> -		return;
>   	mstart = max(start, map->vma->vm_start);
>   	mend   = min(end,   map->vma->vm_end);
>   	pr_debug("map %d+%d (%lx %lx), range %lx %lx, mrange %lx %lx\n",
> @@ -465,21 +472,40 @@ static void unmap_if_in_range(struct grant_map *map,
>   	WARN_ON(err);
>   }
>   
> -static void mn_invl_range_start(struct mmu_notifier *mn,
> +static int mn_invl_range_start(struct mmu_notifier *mn,
>   				struct mm_struct *mm,
> -				unsigned long start, unsigned long end)
> +				unsigned long start, unsigned long end,
> +				bool blockable)
>   {
>   	struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
>   	struct grant_map *map;
> +	int ret = 0;
> +
> +	/* TODO do we really need a mutex here? */
> +	if (blockable)
> +		mutex_lock(&priv->lock);
> +	else if (!mutex_trylock(&priv->lock))
> +		return -EAGAIN;
>   
> -	mutex_lock(&priv->lock);
>   	list_for_each_entry(map, &priv->maps, next) {
> +		if (in_range(map, start, end)) {
> +			ret = -EAGAIN;
> +			goto out_unlock;
> +		}
>   		unmap_if_in_range(map, start, end);
>   	}
>   	list_for_each_entry(map, &priv->freeable_maps, next) {
> +		if (in_range(map, start, end)) {
> +			ret = -EAGAIN;
> +			goto out_unlock;
> +		}
>   		unmap_if_in_range(map, start, end);
>   	}
> +
> +out_unlock:
>   	mutex_unlock(&priv->lock);
> +
> +	return ret;
>   }
>   
>   static void mn_release(struct mmu_notifier *mn,
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 4ee7bc548a83..148935085194 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1275,8 +1275,8 @@ static inline long kvm_arch_vcpu_async_ioctl(struct file *filp,
>   }
>   #endif /* CONFIG_HAVE_KVM_VCPU_ASYNC_IOCTL */
>   
> -void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> -		unsigned long start, unsigned long end);
> +int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> +		unsigned long start, unsigned long end, bool blockable);
>   
>   #ifdef CONFIG_HAVE_KVM_VCPU_RUN_PID_CHANGE
>   int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu);
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index 392e6af82701..2eb1a2d01759 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -151,13 +151,15 @@ struct mmu_notifier_ops {
>   	 * address space but may still be referenced by sptes until
>   	 * the last refcount is dropped.
>   	 *
> -	 * If both of these callbacks cannot block, and invalidate_range
> -	 * cannot block, mmu_notifier_ops.flags should have
> -	 * MMU_INVALIDATE_DOES_NOT_BLOCK set.
> +	 * If blockable argument is set to false then the callback cannot
> +	 * sleep and has to return with -EAGAIN. 0 should be returned
> +	 * otherwise.
> +	 *
>   	 */
> -	void (*invalidate_range_start)(struct mmu_notifier *mn,
> +	int (*invalidate_range_start)(struct mmu_notifier *mn,
>   				       struct mm_struct *mm,
> -				       unsigned long start, unsigned long end);
> +				       unsigned long start, unsigned long end,
> +				       bool blockable);
>   	void (*invalidate_range_end)(struct mmu_notifier *mn,
>   				     struct mm_struct *mm,
>   				     unsigned long start, unsigned long end);
> @@ -229,8 +231,9 @@ extern int __mmu_notifier_test_young(struct mm_struct *mm,
>   				     unsigned long address);
>   extern void __mmu_notifier_change_pte(struct mm_struct *mm,
>   				      unsigned long address, pte_t pte);
> -extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> -				  unsigned long start, unsigned long end);
> +extern int __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> +				  unsigned long start, unsigned long end,
> +				  bool blockable);
>   extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
>   				  unsigned long start, unsigned long end,
>   				  bool only_end);
> @@ -281,7 +284,17 @@ static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
>   				  unsigned long start, unsigned long end)
>   {
>   	if (mm_has_notifiers(mm))
> -		__mmu_notifier_invalidate_range_start(mm, start, end);
> +		__mmu_notifier_invalidate_range_start(mm, start, end, true);
> +}
> +
> +static inline int mmu_notifier_invalidate_range_start_nonblock(struct mm_struct *mm,
> +				  unsigned long start, unsigned long end)
> +{
> +	int ret = 0;
> +	if (mm_has_notifiers(mm))
> +		ret = __mmu_notifier_invalidate_range_start(mm, start, end, false);
> +
> +	return ret;
>   }
>   
>   static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> @@ -461,6 +474,12 @@ static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
>   {
>   }
>   
> +static inline int mmu_notifier_invalidate_range_start_nonblock(struct mm_struct *mm,
> +				  unsigned long start, unsigned long end)
> +{
> +	return 0;
> +}
> +
>   static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
>   				  unsigned long start, unsigned long end)
>   {
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index 6adac113e96d..92f70e4c6252 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -95,7 +95,7 @@ static inline int check_stable_address_space(struct mm_struct *mm)
>   	return 0;
>   }
>   
> -void __oom_reap_task_mm(struct mm_struct *mm);
> +bool __oom_reap_task_mm(struct mm_struct *mm);
>   
>   extern unsigned long oom_badness(struct task_struct *p,
>   		struct mem_cgroup *memcg, const nodemask_t *nodemask,
> diff --git a/include/rdma/ib_umem_odp.h b/include/rdma/ib_umem_odp.h
> index 6a17f856f841..381cdf5a9bd1 100644
> --- a/include/rdma/ib_umem_odp.h
> +++ b/include/rdma/ib_umem_odp.h
> @@ -119,7 +119,8 @@ typedef int (*umem_call_back)(struct ib_umem *item, u64 start, u64 end,
>    */
>   int rbt_ib_umem_for_each_in_range(struct rb_root_cached *root,
>   				  u64 start, u64 end,
> -				  umem_call_back cb, void *cookie);
> +				  umem_call_back cb,
> +				  bool blockable, void *cookie);
>   
>   /*
>    * Find first region intersecting with address range.
> diff --git a/mm/hmm.c b/mm/hmm.c
> index de7b6bf77201..81fd57bd2634 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -177,16 +177,19 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
>   	up_write(&hmm->mirrors_sem);
>   }
>   
> -static void hmm_invalidate_range_start(struct mmu_notifier *mn,
> +static int hmm_invalidate_range_start(struct mmu_notifier *mn,
>   				       struct mm_struct *mm,
>   				       unsigned long start,
> -				       unsigned long end)
> +				       unsigned long end,
> +				       bool blockable)
>   {
>   	struct hmm *hmm = mm->hmm;
>   
>   	VM_BUG_ON(!hmm);
>   
>   	atomic_inc(&hmm->sequence);
> +
> +	return 0;
>   }
>   
>   static void hmm_invalidate_range_end(struct mmu_notifier *mn,
> diff --git a/mm/mmap.c b/mm/mmap.c
> index d1eb87ef4b1a..336bee8c4e25 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3074,7 +3074,7 @@ void exit_mmap(struct mm_struct *mm)
>   		 * reliably test it.
>   		 */
>   		mutex_lock(&oom_lock);
> -		__oom_reap_task_mm(mm);
> +		(void)__oom_reap_task_mm(mm);
>   		mutex_unlock(&oom_lock);
>   
>   		set_bit(MMF_OOM_SKIP, &mm->flags);
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> index eff6b88a993f..103b2b450043 100644
> --- a/mm/mmu_notifier.c
> +++ b/mm/mmu_notifier.c
> @@ -174,18 +174,29 @@ void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
>   	srcu_read_unlock(&srcu, id);
>   }
>   
> -void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> -				  unsigned long start, unsigned long end)
> +int __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> +				  unsigned long start, unsigned long end,
> +				  bool blockable)
>   {
>   	struct mmu_notifier *mn;
> +	int ret = 0;
>   	int id;
>   
>   	id = srcu_read_lock(&srcu);
>   	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
> -		if (mn->ops->invalidate_range_start)
> -			mn->ops->invalidate_range_start(mn, mm, start, end);
> +		if (mn->ops->invalidate_range_start) {
> +			int _ret = mn->ops->invalidate_range_start(mn, mm, start, end, blockable);
> +			if (_ret) {
> +				pr_info("%pS callback failed with %d in %sblockable context.\n",
> +						mn->ops->invalidate_range_start, _ret,
> +						!blockable ? "non-": "");
> +				ret = _ret;
> +			}
> +		}
>   	}
>   	srcu_read_unlock(&srcu, id);
> +
> +	return ret;
>   }
>   EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
>   
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 84081e77bc51..5a936cf24d79 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -479,9 +479,10 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
>   static struct task_struct *oom_reaper_list;
>   static DEFINE_SPINLOCK(oom_reaper_lock);
>   
> -void __oom_reap_task_mm(struct mm_struct *mm)
> +bool __oom_reap_task_mm(struct mm_struct *mm)
>   {
>   	struct vm_area_struct *vma;
> +	bool ret = true;
>   
>   	/*
>   	 * Tell all users of get_user/copy_from_user etc... that the content
> @@ -511,12 +512,17 @@ void __oom_reap_task_mm(struct mm_struct *mm)
>   			struct mmu_gather tlb;
>   
>   			tlb_gather_mmu(&tlb, mm, start, end);
> -			mmu_notifier_invalidate_range_start(mm, start, end);
> +			if (mmu_notifier_invalidate_range_start_nonblock(mm, start, end)) {
> +				ret = false;
> +				continue;
> +			}
>   			unmap_page_range(&tlb, vma, start, end, NULL);
>   			mmu_notifier_invalidate_range_end(mm, start, end);
>   			tlb_finish_mmu(&tlb, start, end);
>   		}
>   	}
> +
> +	return ret;
>   }
>   
>   static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
> @@ -545,18 +551,6 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>   		goto unlock_oom;
>   	}
>   
> -	/*
> -	 * If the mm has invalidate_{start,end}() notifiers that could block,
> -	 * sleep to give the oom victim some more time.
> -	 * TODO: we really want to get rid of this ugly hack and make sure that
> -	 * notifiers cannot block for unbounded amount of time
> -	 */
> -	if (mm_has_blockable_invalidate_notifiers(mm)) {
> -		up_read(&mm->mmap_sem);
> -		schedule_timeout_idle(HZ);
> -		goto unlock_oom;
> -	}
> -
>   	/*
>   	 * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
>   	 * work on the mm anymore. The check for MMF_OOM_SKIP must run
> @@ -571,7 +565,12 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>   
>   	trace_start_task_reaping(tsk->pid);
>   
> -	__oom_reap_task_mm(mm);
> +	/* failed to reap part of the address space. Try again later */
> +	if (!__oom_reap_task_mm(mm)) {
> +		up_read(&mm->mmap_sem);
> +		ret = false;
> +		goto unlock_oom;
> +	}
>   
>   	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
>   			task_pid_nr(tsk), tsk->comm,
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index ada21f47f22b..16ce38f178d1 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -135,9 +135,10 @@ static void kvm_uevent_notify_change(unsigned int type, struct kvm *kvm);
>   static unsigned long long kvm_createvm_count;
>   static unsigned long long kvm_active_vms;
>   
> -__weak void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> -		unsigned long start, unsigned long end)
> +__weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> +		unsigned long start, unsigned long end, bool blockable)
>   {
> +	return 0;
>   }
>   
>   bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
> @@ -354,13 +355,15 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
>   	srcu_read_unlock(&kvm->srcu, idx);
>   }
>   
> -static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> +static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>   						    struct mm_struct *mm,
>   						    unsigned long start,
> -						    unsigned long end)
> +						    unsigned long end,
> +						    bool blockable)
>   {
>   	struct kvm *kvm = mmu_notifier_to_kvm(mn);
>   	int need_tlb_flush = 0, idx;
> +	int ret;
>   
>   	idx = srcu_read_lock(&kvm->srcu);
>   	spin_lock(&kvm->mmu_lock);
> @@ -378,9 +381,11 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>   
>   	spin_unlock(&kvm->mmu_lock);
>   
> -	kvm_arch_mmu_notifier_invalidate_range(kvm, start, end);
> +	ret = kvm_arch_mmu_notifier_invalidate_range(kvm, start, end, blockable);
>   
>   	srcu_read_unlock(&kvm->srcu, idx);
> +
> +	return ret;
>   }
>   
>   static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
  2018-07-02  9:14     ` Christian König
  (?)
@ 2018-07-02 11:54     ` Michal Hocko
  2018-07-02 12:13       ` Christian König
  2018-07-02 12:13       ` Christian König
  -1 siblings, 2 replies; 125+ messages in thread
From: Michal Hocko @ 2018-07-02 11:54 UTC (permalink / raw)
  To: Christian König
  Cc: kvm, Radim Krčmář,
	David Airlie, Sudeep Dutt, dri-devel, linux-mm, Andrea Arcangeli,
	David (ChunMing) Zhou, Dimitri Sivanich, linux-rdma, amd-gfx,
	Jason Gunthorpe, Doug Ledford, David Rientjes, xen-devel,
	intel-gfx, Jérôme Glisse, Rodrigo Vivi,
	Boris Ostrovsky, Juergen Gross, Mike Marciniszyn,
	Dennis Dalessandro, LKML

On Mon 02-07-18 11:14:58, Christian König wrote:
> Am 27.06.2018 um 09:44 schrieb Michal Hocko:
> > This is the v2 of RFC based on the feedback I've received so far. The
> > code even compiles as a bonus ;) I haven't runtime tested it yet, mostly
> > because I have no idea how.
> > 
> > Any further feedback is highly appreciated of course.
> 
> That sounds like it should work and at least the amdgpu changes now look
> good to me on first glance.
> 
> Can you split that up further in the usual way? E.g. adding the blockable
> flag in one patch and fixing all implementations of the MMU notifier in
> follow up patches.

But such a code would be broken, no? Ignoring the blockable state will
simply lead to lockups until the fixup parts get applied.
Is the split up really worth it? I was thinking about that but had hard
times to end up with something that would be bisectable. Well, except
for returning -EBUSY until all notifiers are implemented. Which I found
confusing.

> This way I'm pretty sure Felix and I can give an rb on the amdgpu/amdkfd
> changes.

If you are worried to give r-b only for those then this can be done even
for larger patches. Just make your Reviewd-by more specific
R-b: name # For BLA BLA
-- 
Michal Hocko
SUSE Labs
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
@ 2018-07-02 11:54     ` Michal Hocko
  2018-07-02 12:13       ` Christian König
  2018-07-02 12:13       ` Christian König
  0 siblings, 2 replies; 125+ messages in thread
From: Michal Hocko @ 2018-07-02 11:54 UTC (permalink / raw)
  To: Christian König
  Cc: LKML, David (ChunMing) Zhou, Paolo Bonzini,
	Radim Krčmář,
	Alex Deucher, David Airlie, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Doug Ledford, Jason Gunthorpe, Mike Marciniszyn,
	Dennis Dalessandro, Sudeep Dutt, Ashutosh Dixit,
	Dimitri Sivanich, Boris Ostrovsky, Juergen Gross,
	Jérôme Glisse, Andrea Arcangeli, kvm, amd-gfx,
	dri-devel, intel-gfx, linux-rdma, xen-devel, linux-mm,
	David Rientjes, Felix Kuehling

On Mon 02-07-18 11:14:58, Christian König wrote:
> Am 27.06.2018 um 09:44 schrieb Michal Hocko:
> > This is the v2 of RFC based on the feedback I've received so far. The
> > code even compiles as a bonus ;) I haven't runtime tested it yet, mostly
> > because I have no idea how.
> > 
> > Any further feedback is highly appreciated of course.
> 
> That sounds like it should work and at least the amdgpu changes now look
> good to me on first glance.
> 
> Can you split that up further in the usual way? E.g. adding the blockable
> flag in one patch and fixing all implementations of the MMU notifier in
> follow up patches.

But such a code would be broken, no? Ignoring the blockable state will
simply lead to lockups until the fixup parts get applied.
Is the split up really worth it? I was thinking about that but had hard
times to end up with something that would be bisectable. Well, except
for returning -EBUSY until all notifiers are implemented. Which I found
confusing.

> This way I'm pretty sure Felix and I can give an rb on the amdgpu/amdkfd
> changes.

If you are worried to give r-b only for those then this can be done even
for larger patches. Just make your Reviewd-by more specific
R-b: name # For BLA BLA
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
@ 2018-07-02 11:54     ` Michal Hocko
  2018-07-02 12:13       ` Christian König
  2018-07-02 12:13       ` Christian König
  0 siblings, 2 replies; 125+ messages in thread
From: Michal Hocko @ 2018-07-02 11:54 UTC (permalink / raw)
  To: Christian König
  Cc: LKML, David (ChunMing) Zhou, Paolo Bonzini,
	Radim Krčmář,
	Alex Deucher, David Airlie, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Doug Ledford, Jason Gunthorpe, Mike Marciniszyn,
	Dennis Dalessandro, Sudeep Dutt, Ashutosh Dixit,
	Dimitri Sivanich, Boris Ostrovsky, Juergen Gross,
	Jérôme Glisse, Andrea Arcangeli, kvm, amd-gfx,
	dri-devel, intel-gfx, linux-rdma, xen-devel, linux-mm,
	David Rientjes, Felix Kuehling

On Mon 02-07-18 11:14:58, Christian Konig wrote:
> Am 27.06.2018 um 09:44 schrieb Michal Hocko:
> > This is the v2 of RFC based on the feedback I've received so far. The
> > code even compiles as a bonus ;) I haven't runtime tested it yet, mostly
> > because I have no idea how.
> > 
> > Any further feedback is highly appreciated of course.
> 
> That sounds like it should work and at least the amdgpu changes now look
> good to me on first glance.
> 
> Can you split that up further in the usual way? E.g. adding the blockable
> flag in one patch and fixing all implementations of the MMU notifier in
> follow up patches.

But such a code would be broken, no? Ignoring the blockable state will
simply lead to lockups until the fixup parts get applied.
Is the split up really worth it? I was thinking about that but had hard
times to end up with something that would be bisectable. Well, except
for returning -EBUSY until all notifiers are implemented. Which I found
confusing.

> This way I'm pretty sure Felix and I can give an rb on the amdgpu/amdkfd
> changes.

If you are worried to give r-b only for those then this can be done even
for larger patches. Just make your Reviewd-by more specific
R-b: name # For BLA BLA
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
  2018-07-02  9:14     ` Christian König
@ 2018-07-02 11:54     ` Michal Hocko
  -1 siblings, 0 replies; 125+ messages in thread
From: Michal Hocko @ 2018-07-02 11:54 UTC (permalink / raw)
  To: Christian König
  Cc: kvm, Radim Krčmář,
	David Airlie, Joonas Lahtinen, Sudeep Dutt, dri-devel, linux-mm,
	Andrea Arcangeli, David (ChunMing) Zhou, Dimitri Sivanich,
	linux-rdma, amd-gfx, Jason Gunthorpe, Doug Ledford,
	David Rientjes, xen-devel, intel-gfx, Jani Nikula,
	Jérôme Glisse, Rodrigo Vivi, Boris Ostrovsky,
	Juergen Gross, Mike Marciniszyn

On Mon 02-07-18 11:14:58, Christian König wrote:
> Am 27.06.2018 um 09:44 schrieb Michal Hocko:
> > This is the v2 of RFC based on the feedback I've received so far. The
> > code even compiles as a bonus ;) I haven't runtime tested it yet, mostly
> > because I have no idea how.
> > 
> > Any further feedback is highly appreciated of course.
> 
> That sounds like it should work and at least the amdgpu changes now look
> good to me on first glance.
> 
> Can you split that up further in the usual way? E.g. adding the blockable
> flag in one patch and fixing all implementations of the MMU notifier in
> follow up patches.

But such a code would be broken, no? Ignoring the blockable state will
simply lead to lockups until the fixup parts get applied.
Is the split up really worth it? I was thinking about that but had hard
times to end up with something that would be bisectable. Well, except
for returning -EBUSY until all notifiers are implemented. Which I found
confusing.

> This way I'm pretty sure Felix and I can give an rb on the amdgpu/amdkfd
> changes.

If you are worried to give r-b only for those then this can be done even
for larger patches. Just make your Reviewd-by more specific
R-b: name # For BLA BLA
-- 
Michal Hocko
SUSE Labs

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
  2018-07-02 11:54     ` Michal Hocko
  2018-07-02 12:13       ` Christian König
@ 2018-07-02 12:13       ` Christian König
  2018-07-02 12:20         ` Michal Hocko
  2018-07-02 12:20         ` Michal Hocko
  1 sibling, 2 replies; 125+ messages in thread
From: Christian König @ 2018-07-02 12:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: kvm, Radim Krčmář,
	David Airlie, Sudeep Dutt, dri-devel, linux-mm, Andrea Arcangeli,
	David (ChunMing) Zhou, Dimitri Sivanich, linux-rdma, amd-gfx,
	Jason Gunthorpe, Doug Ledford, David Rientjes, xen-devel,
	intel-gfx, Jérôme Glisse, Rodrigo Vivi,
	Boris Ostrovsky, Juergen Gross, Mike Marciniszyn,
	Dennis Dalessandro, LKML

Am 02.07.2018 um 13:54 schrieb Michal Hocko:
> On Mon 02-07-18 11:14:58, Christian König wrote:
>> Am 27.06.2018 um 09:44 schrieb Michal Hocko:
>>> This is the v2 of RFC based on the feedback I've received so far. The
>>> code even compiles as a bonus ;) I haven't runtime tested it yet, mostly
>>> because I have no idea how.
>>>
>>> Any further feedback is highly appreciated of course.
>> That sounds like it should work and at least the amdgpu changes now look
>> good to me on first glance.
>>
>> Can you split that up further in the usual way? E.g. adding the blockable
>> flag in one patch and fixing all implementations of the MMU notifier in
>> follow up patches.
> But such a code would be broken, no? Ignoring the blockable state will
> simply lead to lockups until the fixup parts get applied.

Well to still be bisect-able you only need to get the interface change 
in first with fixing the function signature of the implementations.

Then add all the new code to the implementations and last start to 
actually use the new interface.

That is a pattern we use regularly and I think it's good practice to do 
this.

> Is the split up really worth it? I was thinking about that but had hard
> times to end up with something that would be bisectable. Well, except
> for returning -EBUSY until all notifiers are implemented. Which I found
> confusing.

It at least makes reviewing changes much easier, cause as driver 
maintainer I can concentrate on the stuff only related to me.

Additional to that when you cause some unrelated side effect in a driver 
we can much easier pinpoint the actual change later on when the patch is 
smaller.

>
>> This way I'm pretty sure Felix and I can give an rb on the amdgpu/amdkfd
>> changes.
> If you are worried to give r-b only for those then this can be done even
> for larger patches. Just make your Reviewd-by more specific
> R-b: name # For BLA BLA

Yeah, possible alternative but more work for me when I review it :)

Regards,
Christian.
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
@ 2018-07-02 12:13       ` Christian König
  2018-07-02 12:20         ` Michal Hocko
  2018-07-02 12:20         ` Michal Hocko
  0 siblings, 2 replies; 125+ messages in thread
From: Christian König @ 2018-07-02 12:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: LKML, David (ChunMing) Zhou, Paolo Bonzini,
	Radim Krčmář,
	Alex Deucher, David Airlie, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Doug Ledford, Jason Gunthorpe, Mike Marciniszyn,
	Dennis Dalessandro, Sudeep Dutt, Ashutosh Dixit,
	Dimitri Sivanich, Boris Ostrovsky, Juergen Gross,
	Jérôme Glisse, Andrea Arcangeli, kvm, amd-gfx,
	dri-devel, intel-gfx, linux-rdma, xen-devel, linux-mm,
	David Rientjes, Felix Kuehling

Am 02.07.2018 um 13:54 schrieb Michal Hocko:
> On Mon 02-07-18 11:14:58, Christian König wrote:
>> Am 27.06.2018 um 09:44 schrieb Michal Hocko:
>>> This is the v2 of RFC based on the feedback I've received so far. The
>>> code even compiles as a bonus ;) I haven't runtime tested it yet, mostly
>>> because I have no idea how.
>>>
>>> Any further feedback is highly appreciated of course.
>> That sounds like it should work and at least the amdgpu changes now look
>> good to me on first glance.
>>
>> Can you split that up further in the usual way? E.g. adding the blockable
>> flag in one patch and fixing all implementations of the MMU notifier in
>> follow up patches.
> But such a code would be broken, no? Ignoring the blockable state will
> simply lead to lockups until the fixup parts get applied.

Well to still be bisect-able you only need to get the interface change 
in first with fixing the function signature of the implementations.

Then add all the new code to the implementations and last start to 
actually use the new interface.

That is a pattern we use regularly and I think it's good practice to do 
this.

> Is the split up really worth it? I was thinking about that but had hard
> times to end up with something that would be bisectable. Well, except
> for returning -EBUSY until all notifiers are implemented. Which I found
> confusing.

It at least makes reviewing changes much easier, cause as driver 
maintainer I can concentrate on the stuff only related to me.

Additional to that when you cause some unrelated side effect in a driver 
we can much easier pinpoint the actual change later on when the patch is 
smaller.

>
>> This way I'm pretty sure Felix and I can give an rb on the amdgpu/amdkfd
>> changes.
> If you are worried to give r-b only for those then this can be done even
> for larger patches. Just make your Reviewd-by more specific
> R-b: name # For BLA BLA

Yeah, possible alternative but more work for me when I review it :)

Regards,
Christian.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
@ 2018-07-02 12:13       ` Christian König
  2018-07-02 12:20         ` Michal Hocko
  2018-07-02 12:20         ` Michal Hocko
  0 siblings, 2 replies; 125+ messages in thread
From: Christian König @ 2018-07-02 12:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: LKML, David (ChunMing) Zhou, Paolo Bonzini,
	Radim Krčmář,
	Alex Deucher, David Airlie, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Doug Ledford, Jason Gunthorpe, Mike Marciniszyn,
	Dennis Dalessandro, Sudeep Dutt, Ashutosh Dixit,
	Dimitri Sivanich, Boris Ostrovsky, Juergen Gross,
	Jérôme Glisse, Andrea Arcangeli, kvm, amd-gfx,
	dri-devel, intel-gfx, linux-rdma, xen-devel, linux-mm,
	David Rientjes, Felix Kuehling

Am 02.07.2018 um 13:54 schrieb Michal Hocko:
> On Mon 02-07-18 11:14:58, Christian KA?nig wrote:
>> Am 27.06.2018 um 09:44 schrieb Michal Hocko:
>>> This is the v2 of RFC based on the feedback I've received so far. The
>>> code even compiles as a bonus ;) I haven't runtime tested it yet, mostly
>>> because I have no idea how.
>>>
>>> Any further feedback is highly appreciated of course.
>> That sounds like it should work and at least the amdgpu changes now look
>> good to me on first glance.
>>
>> Can you split that up further in the usual way? E.g. adding the blockable
>> flag in one patch and fixing all implementations of the MMU notifier in
>> follow up patches.
> But such a code would be broken, no? Ignoring the blockable state will
> simply lead to lockups until the fixup parts get applied.

Well to still be bisect-able you only need to get the interface change 
in first with fixing the function signature of the implementations.

Then add all the new code to the implementations and last start to 
actually use the new interface.

That is a pattern we use regularly and I think it's good practice to do 
this.

> Is the split up really worth it? I was thinking about that but had hard
> times to end up with something that would be bisectable. Well, except
> for returning -EBUSY until all notifiers are implemented. Which I found
> confusing.

It at least makes reviewing changes much easier, cause as driver 
maintainer I can concentrate on the stuff only related to me.

Additional to that when you cause some unrelated side effect in a driver 
we can much easier pinpoint the actual change later on when the patch is 
smaller.

>
>> This way I'm pretty sure Felix and I can give an rb on the amdgpu/amdkfd
>> changes.
> If you are worried to give r-b only for those then this can be done even
> for larger patches. Just make your Reviewd-by more specific
> R-b: name # For BLA BLA

Yeah, possible alternative but more work for me when I review it :)

Regards,
Christian.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
  2018-07-02 11:54     ` Michal Hocko
@ 2018-07-02 12:13       ` Christian König
  2018-07-02 12:13       ` Christian König
  1 sibling, 0 replies; 125+ messages in thread
From: Christian König @ 2018-07-02 12:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: kvm, Radim Krčmář,
	David Airlie, Joonas Lahtinen, Sudeep Dutt, dri-devel, linux-mm,
	Andrea Arcangeli, David (ChunMing) Zhou, Dimitri Sivanich,
	linux-rdma, amd-gfx, Jason Gunthorpe, Doug Ledford,
	David Rientjes, xen-devel, intel-gfx, Jani Nikula,
	Jérôme Glisse, Rodrigo Vivi, Boris Ostrovsky,
	Juergen Gross, Mike Marciniszyn

Am 02.07.2018 um 13:54 schrieb Michal Hocko:
> On Mon 02-07-18 11:14:58, Christian König wrote:
>> Am 27.06.2018 um 09:44 schrieb Michal Hocko:
>>> This is the v2 of RFC based on the feedback I've received so far. The
>>> code even compiles as a bonus ;) I haven't runtime tested it yet, mostly
>>> because I have no idea how.
>>>
>>> Any further feedback is highly appreciated of course.
>> That sounds like it should work and at least the amdgpu changes now look
>> good to me on first glance.
>>
>> Can you split that up further in the usual way? E.g. adding the blockable
>> flag in one patch and fixing all implementations of the MMU notifier in
>> follow up patches.
> But such a code would be broken, no? Ignoring the blockable state will
> simply lead to lockups until the fixup parts get applied.

Well to still be bisect-able you only need to get the interface change 
in first with fixing the function signature of the implementations.

Then add all the new code to the implementations and last start to 
actually use the new interface.

That is a pattern we use regularly and I think it's good practice to do 
this.

> Is the split up really worth it? I was thinking about that but had hard
> times to end up with something that would be bisectable. Well, except
> for returning -EBUSY until all notifiers are implemented. Which I found
> confusing.

It at least makes reviewing changes much easier, cause as driver 
maintainer I can concentrate on the stuff only related to me.

Additional to that when you cause some unrelated side effect in a driver 
we can much easier pinpoint the actual change later on when the patch is 
smaller.

>
>> This way I'm pretty sure Felix and I can give an rb on the amdgpu/amdkfd
>> changes.
> If you are worried to give r-b only for those then this can be done even
> for larger patches. Just make your Reviewd-by more specific
> R-b: name # For BLA BLA

Yeah, possible alternative but more work for me when I review it :)

Regards,
Christian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
  2018-07-02 12:13       ` Christian König
  2018-07-02 12:20         ` Michal Hocko
@ 2018-07-02 12:20         ` Michal Hocko
  2018-07-02 12:24           ` Christian König
  2018-07-02 12:24           ` Christian König
  1 sibling, 2 replies; 125+ messages in thread
From: Michal Hocko @ 2018-07-02 12:20 UTC (permalink / raw)
  To: Christian König
  Cc: kvm, Radim Krčmář,
	David Airlie, Sudeep Dutt, dri-devel, linux-mm, Andrea Arcangeli,
	David (ChunMing) Zhou, Dimitri Sivanich, linux-rdma, amd-gfx,
	Jason Gunthorpe, Doug Ledford, David Rientjes, xen-devel,
	intel-gfx, Jérôme Glisse, Rodrigo Vivi,
	Boris Ostrovsky, Juergen Gross, Mike Marciniszyn,
	Dennis Dalessandro, LKML

On Mon 02-07-18 14:13:42, Christian König wrote:
> Am 02.07.2018 um 13:54 schrieb Michal Hocko:
> > On Mon 02-07-18 11:14:58, Christian König wrote:
> > > Am 27.06.2018 um 09:44 schrieb Michal Hocko:
> > > > This is the v2 of RFC based on the feedback I've received so far. The
> > > > code even compiles as a bonus ;) I haven't runtime tested it yet, mostly
> > > > because I have no idea how.
> > > > 
> > > > Any further feedback is highly appreciated of course.
> > > That sounds like it should work and at least the amdgpu changes now look
> > > good to me on first glance.
> > > 
> > > Can you split that up further in the usual way? E.g. adding the blockable
> > > flag in one patch and fixing all implementations of the MMU notifier in
> > > follow up patches.
> > But such a code would be broken, no? Ignoring the blockable state will
> > simply lead to lockups until the fixup parts get applied.
> 
> Well to still be bisect-able you only need to get the interface change in
> first with fixing the function signature of the implementations.

That would only work if those functions return -AGAIN unconditionally.
Otherwise they would pretend to not block while that would be obviously
incorrect. This doesn't sound correct to me.

> Then add all the new code to the implementations and last start to actually
> use the new interface.
> 
> That is a pattern we use regularly and I think it's good practice to do
> this.

But we do rely on the proper blockable handling.

> > Is the split up really worth it? I was thinking about that but had hard
> > times to end up with something that would be bisectable. Well, except
> > for returning -EBUSY until all notifiers are implemented. Which I found
> > confusing.
> 
> It at least makes reviewing changes much easier, cause as driver maintainer
> I can concentrate on the stuff only related to me.
> 
> Additional to that when you cause some unrelated side effect in a driver we
> can much easier pinpoint the actual change later on when the patch is
> smaller.
> 
> > 
> > > This way I'm pretty sure Felix and I can give an rb on the amdgpu/amdkfd
> > > changes.
> > If you are worried to give r-b only for those then this can be done even
> > for larger patches. Just make your Reviewd-by more specific
> > R-b: name # For BLA BLA
> 
> Yeah, possible alternative but more work for me when I review it :)

I definitely do not want to add more work to reviewers and I completely
see how massive "flag days" like these are not popular but I really
didn't find a reasonable way around that would be both correct and
wouldn't add much more churn on the way. So if you really insist then I
would really appreciate a hint on the way to achive the same without any
above downsides.
-- 
Michal Hocko
SUSE Labs
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
@ 2018-07-02 12:20         ` Michal Hocko
  2018-07-02 12:24           ` Christian König
  2018-07-02 12:24           ` Christian König
  0 siblings, 2 replies; 125+ messages in thread
From: Michal Hocko @ 2018-07-02 12:20 UTC (permalink / raw)
  To: Christian König
  Cc: LKML, David (ChunMing) Zhou, Paolo Bonzini,
	Radim Krčmář,
	Alex Deucher, David Airlie, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Doug Ledford, Jason Gunthorpe, Mike Marciniszyn,
	Dennis Dalessandro, Sudeep Dutt, Ashutosh Dixit,
	Dimitri Sivanich, Boris Ostrovsky, Juergen Gross,
	Jérôme Glisse, Andrea Arcangeli, kvm, amd-gfx,
	dri-devel, intel-gfx, linux-rdma, xen-devel, linux-mm,
	David Rientjes, Felix Kuehling

On Mon 02-07-18 14:13:42, Christian König wrote:
> Am 02.07.2018 um 13:54 schrieb Michal Hocko:
> > On Mon 02-07-18 11:14:58, Christian König wrote:
> > > Am 27.06.2018 um 09:44 schrieb Michal Hocko:
> > > > This is the v2 of RFC based on the feedback I've received so far. The
> > > > code even compiles as a bonus ;) I haven't runtime tested it yet, mostly
> > > > because I have no idea how.
> > > > 
> > > > Any further feedback is highly appreciated of course.
> > > That sounds like it should work and at least the amdgpu changes now look
> > > good to me on first glance.
> > > 
> > > Can you split that up further in the usual way? E.g. adding the blockable
> > > flag in one patch and fixing all implementations of the MMU notifier in
> > > follow up patches.
> > But such a code would be broken, no? Ignoring the blockable state will
> > simply lead to lockups until the fixup parts get applied.
> 
> Well to still be bisect-able you only need to get the interface change in
> first with fixing the function signature of the implementations.

That would only work if those functions return -AGAIN unconditionally.
Otherwise they would pretend to not block while that would be obviously
incorrect. This doesn't sound correct to me.

> Then add all the new code to the implementations and last start to actually
> use the new interface.
> 
> That is a pattern we use regularly and I think it's good practice to do
> this.

But we do rely on the proper blockable handling.

> > Is the split up really worth it? I was thinking about that but had hard
> > times to end up with something that would be bisectable. Well, except
> > for returning -EBUSY until all notifiers are implemented. Which I found
> > confusing.
> 
> It at least makes reviewing changes much easier, cause as driver maintainer
> I can concentrate on the stuff only related to me.
> 
> Additional to that when you cause some unrelated side effect in a driver we
> can much easier pinpoint the actual change later on when the patch is
> smaller.
> 
> > 
> > > This way I'm pretty sure Felix and I can give an rb on the amdgpu/amdkfd
> > > changes.
> > If you are worried to give r-b only for those then this can be done even
> > for larger patches. Just make your Reviewd-by more specific
> > R-b: name # For BLA BLA
> 
> Yeah, possible alternative but more work for me when I review it :)

I definitely do not want to add more work to reviewers and I completely
see how massive "flag days" like these are not popular but I really
didn't find a reasonable way around that would be both correct and
wouldn't add much more churn on the way. So if you really insist then I
would really appreciate a hint on the way to achive the same without any
above downsides.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
@ 2018-07-02 12:20         ` Michal Hocko
  2018-07-02 12:24           ` Christian König
  2018-07-02 12:24           ` Christian König
  0 siblings, 2 replies; 125+ messages in thread
From: Michal Hocko @ 2018-07-02 12:20 UTC (permalink / raw)
  To: Christian König
  Cc: LKML, David (ChunMing) Zhou, Paolo Bonzini,
	Radim Krčmář,
	Alex Deucher, David Airlie, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Doug Ledford, Jason Gunthorpe, Mike Marciniszyn,
	Dennis Dalessandro, Sudeep Dutt, Ashutosh Dixit,
	Dimitri Sivanich, Boris Ostrovsky, Juergen Gross,
	Jérôme Glisse, Andrea Arcangeli, kvm, amd-gfx,
	dri-devel, intel-gfx, linux-rdma, xen-devel, linux-mm,
	David Rientjes, Felix Kuehling

On Mon 02-07-18 14:13:42, Christian Konig wrote:
> Am 02.07.2018 um 13:54 schrieb Michal Hocko:
> > On Mon 02-07-18 11:14:58, Christian Konig wrote:
> > > Am 27.06.2018 um 09:44 schrieb Michal Hocko:
> > > > This is the v2 of RFC based on the feedback I've received so far. The
> > > > code even compiles as a bonus ;) I haven't runtime tested it yet, mostly
> > > > because I have no idea how.
> > > > 
> > > > Any further feedback is highly appreciated of course.
> > > That sounds like it should work and at least the amdgpu changes now look
> > > good to me on first glance.
> > > 
> > > Can you split that up further in the usual way? E.g. adding the blockable
> > > flag in one patch and fixing all implementations of the MMU notifier in
> > > follow up patches.
> > But such a code would be broken, no? Ignoring the blockable state will
> > simply lead to lockups until the fixup parts get applied.
> 
> Well to still be bisect-able you only need to get the interface change in
> first with fixing the function signature of the implementations.

That would only work if those functions return -AGAIN unconditionally.
Otherwise they would pretend to not block while that would be obviously
incorrect. This doesn't sound correct to me.

> Then add all the new code to the implementations and last start to actually
> use the new interface.
> 
> That is a pattern we use regularly and I think it's good practice to do
> this.

But we do rely on the proper blockable handling.

> > Is the split up really worth it? I was thinking about that but had hard
> > times to end up with something that would be bisectable. Well, except
> > for returning -EBUSY until all notifiers are implemented. Which I found
> > confusing.
> 
> It at least makes reviewing changes much easier, cause as driver maintainer
> I can concentrate on the stuff only related to me.
> 
> Additional to that when you cause some unrelated side effect in a driver we
> can much easier pinpoint the actual change later on when the patch is
> smaller.
> 
> > 
> > > This way I'm pretty sure Felix and I can give an rb on the amdgpu/amdkfd
> > > changes.
> > If you are worried to give r-b only for those then this can be done even
> > for larger patches. Just make your Reviewd-by more specific
> > R-b: name # For BLA BLA
> 
> Yeah, possible alternative but more work for me when I review it :)

I definitely do not want to add more work to reviewers and I completely
see how massive "flag days" like these are not popular but I really
didn't find a reasonable way around that would be both correct and
wouldn't add much more churn on the way. So if you really insist then I
would really appreciate a hint on the way to achive the same without any
above downsides.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
  2018-07-02 12:13       ` Christian König
@ 2018-07-02 12:20         ` Michal Hocko
  2018-07-02 12:20         ` Michal Hocko
  1 sibling, 0 replies; 125+ messages in thread
From: Michal Hocko @ 2018-07-02 12:20 UTC (permalink / raw)
  To: Christian König
  Cc: kvm, Radim Krčmář,
	David Airlie, Joonas Lahtinen, Sudeep Dutt, dri-devel, linux-mm,
	Andrea Arcangeli, David (ChunMing) Zhou, Dimitri Sivanich,
	linux-rdma, amd-gfx, Jason Gunthorpe, Doug Ledford,
	David Rientjes, xen-devel, intel-gfx, Jani Nikula,
	Jérôme Glisse, Rodrigo Vivi, Boris Ostrovsky,
	Juergen Gross, Mike Marciniszyn

On Mon 02-07-18 14:13:42, Christian König wrote:
> Am 02.07.2018 um 13:54 schrieb Michal Hocko:
> > On Mon 02-07-18 11:14:58, Christian König wrote:
> > > Am 27.06.2018 um 09:44 schrieb Michal Hocko:
> > > > This is the v2 of RFC based on the feedback I've received so far. The
> > > > code even compiles as a bonus ;) I haven't runtime tested it yet, mostly
> > > > because I have no idea how.
> > > > 
> > > > Any further feedback is highly appreciated of course.
> > > That sounds like it should work and at least the amdgpu changes now look
> > > good to me on first glance.
> > > 
> > > Can you split that up further in the usual way? E.g. adding the blockable
> > > flag in one patch and fixing all implementations of the MMU notifier in
> > > follow up patches.
> > But such a code would be broken, no? Ignoring the blockable state will
> > simply lead to lockups until the fixup parts get applied.
> 
> Well to still be bisect-able you only need to get the interface change in
> first with fixing the function signature of the implementations.

That would only work if those functions return -AGAIN unconditionally.
Otherwise they would pretend to not block while that would be obviously
incorrect. This doesn't sound correct to me.

> Then add all the new code to the implementations and last start to actually
> use the new interface.
> 
> That is a pattern we use regularly and I think it's good practice to do
> this.

But we do rely on the proper blockable handling.

> > Is the split up really worth it? I was thinking about that but had hard
> > times to end up with something that would be bisectable. Well, except
> > for returning -EBUSY until all notifiers are implemented. Which I found
> > confusing.
> 
> It at least makes reviewing changes much easier, cause as driver maintainer
> I can concentrate on the stuff only related to me.
> 
> Additional to that when you cause some unrelated side effect in a driver we
> can much easier pinpoint the actual change later on when the patch is
> smaller.
> 
> > 
> > > This way I'm pretty sure Felix and I can give an rb on the amdgpu/amdkfd
> > > changes.
> > If you are worried to give r-b only for those then this can be done even
> > for larger patches. Just make your Reviewd-by more specific
> > R-b: name # For BLA BLA
> 
> Yeah, possible alternative but more work for me when I review it :)

I definitely do not want to add more work to reviewers and I completely
see how massive "flag days" like these are not popular but I really
didn't find a reasonable way around that would be both correct and
wouldn't add much more churn on the way. So if you really insist then I
would really appreciate a hint on the way to achive the same without any
above downsides.
-- 
Michal Hocko
SUSE Labs

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
  2018-07-02 12:20         ` Michal Hocko
@ 2018-07-02 12:24           ` Christian König
  2018-07-02 12:35             ` Michal Hocko
                               ` (2 more replies)
  2018-07-02 12:24           ` Christian König
  1 sibling, 3 replies; 125+ messages in thread
From: Christian König @ 2018-07-02 12:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: kvm, Radim Krčmář,
	David Airlie, Sudeep Dutt, dri-devel, linux-mm, Andrea Arcangeli,
	David (ChunMing) Zhou, Dimitri Sivanich, linux-rdma, amd-gfx,
	Jason Gunthorpe, Doug Ledford, David Rientjes, xen-devel,
	intel-gfx, Jérôme Glisse, Rodrigo Vivi,
	Boris Ostrovsky, Juergen Gross, Mike Marciniszyn,
	Dennis Dalessandro, LKML

Am 02.07.2018 um 14:20 schrieb Michal Hocko:
> On Mon 02-07-18 14:13:42, Christian König wrote:
>> Am 02.07.2018 um 13:54 schrieb Michal Hocko:
>>> On Mon 02-07-18 11:14:58, Christian König wrote:
>>>> Am 27.06.2018 um 09:44 schrieb Michal Hocko:
>>>>> This is the v2 of RFC based on the feedback I've received so far. The
>>>>> code even compiles as a bonus ;) I haven't runtime tested it yet, mostly
>>>>> because I have no idea how.
>>>>>
>>>>> Any further feedback is highly appreciated of course.
>>>> That sounds like it should work and at least the amdgpu changes now look
>>>> good to me on first glance.
>>>>
>>>> Can you split that up further in the usual way? E.g. adding the blockable
>>>> flag in one patch and fixing all implementations of the MMU notifier in
>>>> follow up patches.
>>> But such a code would be broken, no? Ignoring the blockable state will
>>> simply lead to lockups until the fixup parts get applied.
>> Well to still be bisect-able you only need to get the interface change in
>> first with fixing the function signature of the implementations.
> That would only work if those functions return -AGAIN unconditionally.
> Otherwise they would pretend to not block while that would be obviously
> incorrect. This doesn't sound correct to me.
>
>> Then add all the new code to the implementations and last start to actually
>> use the new interface.
>>
>> That is a pattern we use regularly and I think it's good practice to do
>> this.
> But we do rely on the proper blockable handling.

Yeah, but you could add the handling only after you have all the 
implementations in place. Don't you?

>>> Is the split up really worth it? I was thinking about that but had hard
>>> times to end up with something that would be bisectable. Well, except
>>> for returning -EBUSY until all notifiers are implemented. Which I found
>>> confusing.
>> It at least makes reviewing changes much easier, cause as driver maintainer
>> I can concentrate on the stuff only related to me.
>>
>> Additional to that when you cause some unrelated side effect in a driver we
>> can much easier pinpoint the actual change later on when the patch is
>> smaller.
>>
>>>> This way I'm pretty sure Felix and I can give an rb on the amdgpu/amdkfd
>>>> changes.
>>> If you are worried to give r-b only for those then this can be done even
>>> for larger patches. Just make your Reviewd-by more specific
>>> R-b: name # For BLA BLA
>> Yeah, possible alternative but more work for me when I review it :)
> I definitely do not want to add more work to reviewers and I completely
> see how massive "flag days" like these are not popular but I really
> didn't find a reasonable way around that would be both correct and
> wouldn't add much more churn on the way. So if you really insist then I
> would really appreciate a hint on the way to achive the same without any
> above downsides.

Well, I don't insist on this. It's just from my point of view that this 
patch doesn't needs to be one patch, but could be split up.

Could be that I just don't know the code or the consequences of adding 
that well enough to really judge.

Christian.
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
@ 2018-07-02 12:24           ` Christian König
  2018-07-02 12:35             ` Michal Hocko
                               ` (2 more replies)
  0 siblings, 3 replies; 125+ messages in thread
From: Christian König @ 2018-07-02 12:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: LKML, David (ChunMing) Zhou, Paolo Bonzini,
	Radim Krčmář,
	Alex Deucher, David Airlie, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Doug Ledford, Jason Gunthorpe, Mike Marciniszyn,
	Dennis Dalessandro, Sudeep Dutt, Ashutosh Dixit,
	Dimitri Sivanich, Boris Ostrovsky, Juergen Gross,
	Jérôme Glisse, Andrea Arcangeli, kvm, amd-gfx,
	dri-devel, intel-gfx, linux-rdma, xen-devel, linux-mm,
	David Rientjes, Felix Kuehling

Am 02.07.2018 um 14:20 schrieb Michal Hocko:
> On Mon 02-07-18 14:13:42, Christian König wrote:
>> Am 02.07.2018 um 13:54 schrieb Michal Hocko:
>>> On Mon 02-07-18 11:14:58, Christian König wrote:
>>>> Am 27.06.2018 um 09:44 schrieb Michal Hocko:
>>>>> This is the v2 of RFC based on the feedback I've received so far. The
>>>>> code even compiles as a bonus ;) I haven't runtime tested it yet, mostly
>>>>> because I have no idea how.
>>>>>
>>>>> Any further feedback is highly appreciated of course.
>>>> That sounds like it should work and at least the amdgpu changes now look
>>>> good to me on first glance.
>>>>
>>>> Can you split that up further in the usual way? E.g. adding the blockable
>>>> flag in one patch and fixing all implementations of the MMU notifier in
>>>> follow up patches.
>>> But such a code would be broken, no? Ignoring the blockable state will
>>> simply lead to lockups until the fixup parts get applied.
>> Well to still be bisect-able you only need to get the interface change in
>> first with fixing the function signature of the implementations.
> That would only work if those functions return -AGAIN unconditionally.
> Otherwise they would pretend to not block while that would be obviously
> incorrect. This doesn't sound correct to me.
>
>> Then add all the new code to the implementations and last start to actually
>> use the new interface.
>>
>> That is a pattern we use regularly and I think it's good practice to do
>> this.
> But we do rely on the proper blockable handling.

Yeah, but you could add the handling only after you have all the 
implementations in place. Don't you?

>>> Is the split up really worth it? I was thinking about that but had hard
>>> times to end up with something that would be bisectable. Well, except
>>> for returning -EBUSY until all notifiers are implemented. Which I found
>>> confusing.
>> It at least makes reviewing changes much easier, cause as driver maintainer
>> I can concentrate on the stuff only related to me.
>>
>> Additional to that when you cause some unrelated side effect in a driver we
>> can much easier pinpoint the actual change later on when the patch is
>> smaller.
>>
>>>> This way I'm pretty sure Felix and I can give an rb on the amdgpu/amdkfd
>>>> changes.
>>> If you are worried to give r-b only for those then this can be done even
>>> for larger patches. Just make your Reviewd-by more specific
>>> R-b: name # For BLA BLA
>> Yeah, possible alternative but more work for me when I review it :)
> I definitely do not want to add more work to reviewers and I completely
> see how massive "flag days" like these are not popular but I really
> didn't find a reasonable way around that would be both correct and
> wouldn't add much more churn on the way. So if you really insist then I
> would really appreciate a hint on the way to achive the same without any
> above downsides.

Well, I don't insist on this. It's just from my point of view that this 
patch doesn't needs to be one patch, but could be split up.

Could be that I just don't know the code or the consequences of adding 
that well enough to really judge.

Christian.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
@ 2018-07-02 12:24           ` Christian König
  2018-07-02 12:35             ` Michal Hocko
                               ` (2 more replies)
  0 siblings, 3 replies; 125+ messages in thread
From: Christian König @ 2018-07-02 12:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: LKML, David (ChunMing) Zhou, Paolo Bonzini,
	Radim Krčmář,
	Alex Deucher, David Airlie, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Doug Ledford, Jason Gunthorpe, Mike Marciniszyn,
	Dennis Dalessandro, Sudeep Dutt, Ashutosh Dixit,
	Dimitri Sivanich, Boris Ostrovsky, Juergen Gross,
	Jérôme Glisse, Andrea Arcangeli, kvm, amd-gfx,
	dri-devel, intel-gfx, linux-rdma, xen-devel, linux-mm,
	David Rientjes, Felix Kuehling

Am 02.07.2018 um 14:20 schrieb Michal Hocko:
> On Mon 02-07-18 14:13:42, Christian KA?nig wrote:
>> Am 02.07.2018 um 13:54 schrieb Michal Hocko:
>>> On Mon 02-07-18 11:14:58, Christian KA?nig wrote:
>>>> Am 27.06.2018 um 09:44 schrieb Michal Hocko:
>>>>> This is the v2 of RFC based on the feedback I've received so far. The
>>>>> code even compiles as a bonus ;) I haven't runtime tested it yet, mostly
>>>>> because I have no idea how.
>>>>>
>>>>> Any further feedback is highly appreciated of course.
>>>> That sounds like it should work and at least the amdgpu changes now look
>>>> good to me on first glance.
>>>>
>>>> Can you split that up further in the usual way? E.g. adding the blockable
>>>> flag in one patch and fixing all implementations of the MMU notifier in
>>>> follow up patches.
>>> But such a code would be broken, no? Ignoring the blockable state will
>>> simply lead to lockups until the fixup parts get applied.
>> Well to still be bisect-able you only need to get the interface change in
>> first with fixing the function signature of the implementations.
> That would only work if those functions return -AGAIN unconditionally.
> Otherwise they would pretend to not block while that would be obviously
> incorrect. This doesn't sound correct to me.
>
>> Then add all the new code to the implementations and last start to actually
>> use the new interface.
>>
>> That is a pattern we use regularly and I think it's good practice to do
>> this.
> But we do rely on the proper blockable handling.

Yeah, but you could add the handling only after you have all the 
implementations in place. Don't you?

>>> Is the split up really worth it? I was thinking about that but had hard
>>> times to end up with something that would be bisectable. Well, except
>>> for returning -EBUSY until all notifiers are implemented. Which I found
>>> confusing.
>> It at least makes reviewing changes much easier, cause as driver maintainer
>> I can concentrate on the stuff only related to me.
>>
>> Additional to that when you cause some unrelated side effect in a driver we
>> can much easier pinpoint the actual change later on when the patch is
>> smaller.
>>
>>>> This way I'm pretty sure Felix and I can give an rb on the amdgpu/amdkfd
>>>> changes.
>>> If you are worried to give r-b only for those then this can be done even
>>> for larger patches. Just make your Reviewd-by more specific
>>> R-b: name # For BLA BLA
>> Yeah, possible alternative but more work for me when I review it :)
> I definitely do not want to add more work to reviewers and I completely
> see how massive "flag days" like these are not popular but I really
> didn't find a reasonable way around that would be both correct and
> wouldn't add much more churn on the way. So if you really insist then I
> would really appreciate a hint on the way to achive the same without any
> above downsides.

Well, I don't insist on this. It's just from my point of view that this 
patch doesn't needs to be one patch, but could be split up.

Could be that I just don't know the code or the consequences of adding 
that well enough to really judge.

Christian.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
  2018-07-02 12:20         ` Michal Hocko
  2018-07-02 12:24           ` Christian König
@ 2018-07-02 12:24           ` Christian König
  1 sibling, 0 replies; 125+ messages in thread
From: Christian König @ 2018-07-02 12:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: kvm, Radim Krčmář,
	David Airlie, Joonas Lahtinen, Sudeep Dutt, dri-devel, linux-mm,
	Andrea Arcangeli, David (ChunMing) Zhou, Dimitri Sivanich,
	linux-rdma, amd-gfx, Jason Gunthorpe, Doug Ledford,
	David Rientjes, xen-devel, intel-gfx, Jani Nikula,
	Jérôme Glisse, Rodrigo Vivi, Boris Ostrovsky,
	Juergen Gross, Mike Marciniszyn

Am 02.07.2018 um 14:20 schrieb Michal Hocko:
> On Mon 02-07-18 14:13:42, Christian König wrote:
>> Am 02.07.2018 um 13:54 schrieb Michal Hocko:
>>> On Mon 02-07-18 11:14:58, Christian König wrote:
>>>> Am 27.06.2018 um 09:44 schrieb Michal Hocko:
>>>>> This is the v2 of RFC based on the feedback I've received so far. The
>>>>> code even compiles as a bonus ;) I haven't runtime tested it yet, mostly
>>>>> because I have no idea how.
>>>>>
>>>>> Any further feedback is highly appreciated of course.
>>>> That sounds like it should work and at least the amdgpu changes now look
>>>> good to me on first glance.
>>>>
>>>> Can you split that up further in the usual way? E.g. adding the blockable
>>>> flag in one patch and fixing all implementations of the MMU notifier in
>>>> follow up patches.
>>> But such a code would be broken, no? Ignoring the blockable state will
>>> simply lead to lockups until the fixup parts get applied.
>> Well to still be bisect-able you only need to get the interface change in
>> first with fixing the function signature of the implementations.
> That would only work if those functions return -AGAIN unconditionally.
> Otherwise they would pretend to not block while that would be obviously
> incorrect. This doesn't sound correct to me.
>
>> Then add all the new code to the implementations and last start to actually
>> use the new interface.
>>
>> That is a pattern we use regularly and I think it's good practice to do
>> this.
> But we do rely on the proper blockable handling.

Yeah, but you could add the handling only after you have all the 
implementations in place. Don't you?

>>> Is the split up really worth it? I was thinking about that but had hard
>>> times to end up with something that would be bisectable. Well, except
>>> for returning -EBUSY until all notifiers are implemented. Which I found
>>> confusing.
>> It at least makes reviewing changes much easier, cause as driver maintainer
>> I can concentrate on the stuff only related to me.
>>
>> Additional to that when you cause some unrelated side effect in a driver we
>> can much easier pinpoint the actual change later on when the patch is
>> smaller.
>>
>>>> This way I'm pretty sure Felix and I can give an rb on the amdgpu/amdkfd
>>>> changes.
>>> If you are worried to give r-b only for those then this can be done even
>>> for larger patches. Just make your Reviewd-by more specific
>>> R-b: name # For BLA BLA
>> Yeah, possible alternative but more work for me when I review it :)
> I definitely do not want to add more work to reviewers and I completely
> see how massive "flag days" like these are not popular but I really
> didn't find a reasonable way around that would be both correct and
> wouldn't add much more churn on the way. So if you really insist then I
> would really appreciate a hint on the way to achive the same without any
> above downsides.

Well, I don't insist on this. It's just from my point of view that this 
patch doesn't needs to be one patch, but could be split up.

Could be that I just don't know the code or the consequences of adding 
that well enough to really judge.

Christian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
  2018-07-02 12:24           ` Christian König
  2018-07-02 12:35             ` Michal Hocko
       [not found]             ` <02d1d52c-f534-f899-a18c-a3169123ac7c-5C7GfCeVMHo@public.gmane.org>
@ 2018-07-02 12:35             ` Michal Hocko
  2018-07-02 12:39                 ` Christian König
                                 ` (2 more replies)
  2 siblings, 3 replies; 125+ messages in thread
From: Michal Hocko @ 2018-07-02 12:35 UTC (permalink / raw)
  To: Christian König
  Cc: kvm-u79uwXL29TY76Z2rM5mHXA, Radim Krčmář,
	David Airlie, Joonas Lahtinen, Sudeep Dutt,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andrea Arcangeli,
	David (ChunMing) Zhou, Dimitri Sivanich,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Jason Gunthorpe,
	Doug Ledford, David Rientjes,
	xen-devel-GuqFBffKawtpuQazS67q72D2FQJk+8+b,
	intel-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Jani Nikula,
	Jérôme Glisse, Rodrigo Vivi, Boris Ostrovsky,
	Juergen Gross, Mike Marciniszyn

On Mon 02-07-18 14:24:29, Christian König wrote:
> Am 02.07.2018 um 14:20 schrieb Michal Hocko:
> > On Mon 02-07-18 14:13:42, Christian König wrote:
> > > Am 02.07.2018 um 13:54 schrieb Michal Hocko:
> > > > On Mon 02-07-18 11:14:58, Christian König wrote:
> > > > > Am 27.06.2018 um 09:44 schrieb Michal Hocko:
> > > > > > This is the v2 of RFC based on the feedback I've received so far. The
> > > > > > code even compiles as a bonus ;) I haven't runtime tested it yet, mostly
> > > > > > because I have no idea how.
> > > > > > 
> > > > > > Any further feedback is highly appreciated of course.
> > > > > That sounds like it should work and at least the amdgpu changes now look
> > > > > good to me on first glance.
> > > > > 
> > > > > Can you split that up further in the usual way? E.g. adding the blockable
> > > > > flag in one patch and fixing all implementations of the MMU notifier in
> > > > > follow up patches.
> > > > But such a code would be broken, no? Ignoring the blockable state will
> > > > simply lead to lockups until the fixup parts get applied.
> > > Well to still be bisect-able you only need to get the interface change in
> > > first with fixing the function signature of the implementations.
> > That would only work if those functions return -AGAIN unconditionally.
> > Otherwise they would pretend to not block while that would be obviously
> > incorrect. This doesn't sound correct to me.
> > 
> > > Then add all the new code to the implementations and last start to actually
> > > use the new interface.
> > > 
> > > That is a pattern we use regularly and I think it's good practice to do
> > > this.
> > But we do rely on the proper blockable handling.
> 
> Yeah, but you could add the handling only after you have all the
> implementations in place. Don't you?

Yeah, but then I would be adding a code with no user. And I really
prefer to no do so because then the code is harder to argue about.

> > > > Is the split up really worth it? I was thinking about that but had hard
> > > > times to end up with something that would be bisectable. Well, except
> > > > for returning -EBUSY until all notifiers are implemented. Which I found
> > > > confusing.
> > > It at least makes reviewing changes much easier, cause as driver maintainer
> > > I can concentrate on the stuff only related to me.
> > > 
> > > Additional to that when you cause some unrelated side effect in a driver we
> > > can much easier pinpoint the actual change later on when the patch is
> > > smaller.
> > > 
> > > > > This way I'm pretty sure Felix and I can give an rb on the amdgpu/amdkfd
> > > > > changes.
> > > > If you are worried to give r-b only for those then this can be done even
> > > > for larger patches. Just make your Reviewd-by more specific
> > > > R-b: name # For BLA BLA
> > > Yeah, possible alternative but more work for me when I review it :)
> > I definitely do not want to add more work to reviewers and I completely
> > see how massive "flag days" like these are not popular but I really
> > didn't find a reasonable way around that would be both correct and
> > wouldn't add much more churn on the way. So if you really insist then I
> > would really appreciate a hint on the way to achive the same without any
> > above downsides.
> 
> Well, I don't insist on this. It's just from my point of view that this
> patch doesn't needs to be one patch, but could be split up.

Well, if there are more people with the same concern I can try to do
that. But if your only concern is to focus on your particular part then
I guess it would be easier both for you and me to simply apply the patch
and use git show $files_for_your_subystem on your end. I have put the
patch to attempts/oom-vs-mmu-notifiers branch to my tree at
git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git
-- 
Michal Hocko
SUSE Labs
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
@ 2018-07-02 12:35             ` Michal Hocko
  2018-07-02 12:39                 ` Christian König
                                 ` (2 more replies)
  0 siblings, 3 replies; 125+ messages in thread
From: Michal Hocko @ 2018-07-02 12:35 UTC (permalink / raw)
  To: Christian König
  Cc: LKML, David (ChunMing) Zhou, Paolo Bonzini,
	Radim Krčmář,
	Alex Deucher, David Airlie, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Doug Ledford, Jason Gunthorpe, Mike Marciniszyn,
	Dennis Dalessandro, Sudeep Dutt, Ashutosh Dixit,
	Dimitri Sivanich, Boris Ostrovsky, Juergen Gross,
	Jérôme Glisse, Andrea Arcangeli, kvm, amd-gfx,
	dri-devel, intel-gfx, linux-rdma, xen-devel, linux-mm,
	David Rientjes, Felix Kuehling

On Mon 02-07-18 14:24:29, Christian König wrote:
> Am 02.07.2018 um 14:20 schrieb Michal Hocko:
> > On Mon 02-07-18 14:13:42, Christian König wrote:
> > > Am 02.07.2018 um 13:54 schrieb Michal Hocko:
> > > > On Mon 02-07-18 11:14:58, Christian König wrote:
> > > > > Am 27.06.2018 um 09:44 schrieb Michal Hocko:
> > > > > > This is the v2 of RFC based on the feedback I've received so far. The
> > > > > > code even compiles as a bonus ;) I haven't runtime tested it yet, mostly
> > > > > > because I have no idea how.
> > > > > > 
> > > > > > Any further feedback is highly appreciated of course.
> > > > > That sounds like it should work and at least the amdgpu changes now look
> > > > > good to me on first glance.
> > > > > 
> > > > > Can you split that up further in the usual way? E.g. adding the blockable
> > > > > flag in one patch and fixing all implementations of the MMU notifier in
> > > > > follow up patches.
> > > > But such a code would be broken, no? Ignoring the blockable state will
> > > > simply lead to lockups until the fixup parts get applied.
> > > Well to still be bisect-able you only need to get the interface change in
> > > first with fixing the function signature of the implementations.
> > That would only work if those functions return -AGAIN unconditionally.
> > Otherwise they would pretend to not block while that would be obviously
> > incorrect. This doesn't sound correct to me.
> > 
> > > Then add all the new code to the implementations and last start to actually
> > > use the new interface.
> > > 
> > > That is a pattern we use regularly and I think it's good practice to do
> > > this.
> > But we do rely on the proper blockable handling.
> 
> Yeah, but you could add the handling only after you have all the
> implementations in place. Don't you?

Yeah, but then I would be adding a code with no user. And I really
prefer to no do so because then the code is harder to argue about.

> > > > Is the split up really worth it? I was thinking about that but had hard
> > > > times to end up with something that would be bisectable. Well, except
> > > > for returning -EBUSY until all notifiers are implemented. Which I found
> > > > confusing.
> > > It at least makes reviewing changes much easier, cause as driver maintainer
> > > I can concentrate on the stuff only related to me.
> > > 
> > > Additional to that when you cause some unrelated side effect in a driver we
> > > can much easier pinpoint the actual change later on when the patch is
> > > smaller.
> > > 
> > > > > This way I'm pretty sure Felix and I can give an rb on the amdgpu/amdkfd
> > > > > changes.
> > > > If you are worried to give r-b only for those then this can be done even
> > > > for larger patches. Just make your Reviewd-by more specific
> > > > R-b: name # For BLA BLA
> > > Yeah, possible alternative but more work for me when I review it :)
> > I definitely do not want to add more work to reviewers and I completely
> > see how massive "flag days" like these are not popular but I really
> > didn't find a reasonable way around that would be both correct and
> > wouldn't add much more churn on the way. So if you really insist then I
> > would really appreciate a hint on the way to achive the same without any
> > above downsides.
> 
> Well, I don't insist on this. It's just from my point of view that this
> patch doesn't needs to be one patch, but could be split up.

Well, if there are more people with the same concern I can try to do
that. But if your only concern is to focus on your particular part then
I guess it would be easier both for you and me to simply apply the patch
and use git show $files_for_your_subystem on your end. I have put the
patch to attempts/oom-vs-mmu-notifiers branch to my tree at
git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
@ 2018-07-02 12:35             ` Michal Hocko
  2018-07-02 12:39                 ` Christian König
                                 ` (2 more replies)
  0 siblings, 3 replies; 125+ messages in thread
From: Michal Hocko @ 2018-07-02 12:35 UTC (permalink / raw)
  To: Christian König
  Cc: LKML, David (ChunMing) Zhou, Paolo Bonzini,
	Radim Krčmář,
	Alex Deucher, David Airlie, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Doug Ledford, Jason Gunthorpe, Mike Marciniszyn,
	Dennis Dalessandro, Sudeep Dutt, Ashutosh Dixit,
	Dimitri Sivanich, Boris Ostrovsky, Juergen Gross,
	Jérôme Glisse, Andrea Arcangeli, kvm, amd-gfx,
	dri-devel, intel-gfx, linux-rdma, xen-devel, linux-mm,
	David Rientjes, Felix Kuehling

On Mon 02-07-18 14:24:29, Christian Konig wrote:
> Am 02.07.2018 um 14:20 schrieb Michal Hocko:
> > On Mon 02-07-18 14:13:42, Christian Konig wrote:
> > > Am 02.07.2018 um 13:54 schrieb Michal Hocko:
> > > > On Mon 02-07-18 11:14:58, Christian Konig wrote:
> > > > > Am 27.06.2018 um 09:44 schrieb Michal Hocko:
> > > > > > This is the v2 of RFC based on the feedback I've received so far. The
> > > > > > code even compiles as a bonus ;) I haven't runtime tested it yet, mostly
> > > > > > because I have no idea how.
> > > > > > 
> > > > > > Any further feedback is highly appreciated of course.
> > > > > That sounds like it should work and at least the amdgpu changes now look
> > > > > good to me on first glance.
> > > > > 
> > > > > Can you split that up further in the usual way? E.g. adding the blockable
> > > > > flag in one patch and fixing all implementations of the MMU notifier in
> > > > > follow up patches.
> > > > But such a code would be broken, no? Ignoring the blockable state will
> > > > simply lead to lockups until the fixup parts get applied.
> > > Well to still be bisect-able you only need to get the interface change in
> > > first with fixing the function signature of the implementations.
> > That would only work if those functions return -AGAIN unconditionally.
> > Otherwise they would pretend to not block while that would be obviously
> > incorrect. This doesn't sound correct to me.
> > 
> > > Then add all the new code to the implementations and last start to actually
> > > use the new interface.
> > > 
> > > That is a pattern we use regularly and I think it's good practice to do
> > > this.
> > But we do rely on the proper blockable handling.
> 
> Yeah, but you could add the handling only after you have all the
> implementations in place. Don't you?

Yeah, but then I would be adding a code with no user. And I really
prefer to no do so because then the code is harder to argue about.

> > > > Is the split up really worth it? I was thinking about that but had hard
> > > > times to end up with something that would be bisectable. Well, except
> > > > for returning -EBUSY until all notifiers are implemented. Which I found
> > > > confusing.
> > > It at least makes reviewing changes much easier, cause as driver maintainer
> > > I can concentrate on the stuff only related to me.
> > > 
> > > Additional to that when you cause some unrelated side effect in a driver we
> > > can much easier pinpoint the actual change later on when the patch is
> > > smaller.
> > > 
> > > > > This way I'm pretty sure Felix and I can give an rb on the amdgpu/amdkfd
> > > > > changes.
> > > > If you are worried to give r-b only for those then this can be done even
> > > > for larger patches. Just make your Reviewd-by more specific
> > > > R-b: name # For BLA BLA
> > > Yeah, possible alternative but more work for me when I review it :)
> > I definitely do not want to add more work to reviewers and I completely
> > see how massive "flag days" like these are not popular but I really
> > didn't find a reasonable way around that would be both correct and
> > wouldn't add much more churn on the way. So if you really insist then I
> > would really appreciate a hint on the way to achive the same without any
> > above downsides.
> 
> Well, I don't insist on this. It's just from my point of view that this
> patch doesn't needs to be one patch, but could be split up.

Well, if there are more people with the same concern I can try to do
that. But if your only concern is to focus on your particular part then
I guess it would be easier both for you and me to simply apply the patch
and use git show $files_for_your_subystem on your end. I have put the
patch to attempts/oom-vs-mmu-notifiers branch to my tree at
git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
  2018-07-02 12:24           ` Christian König
@ 2018-07-02 12:35             ` Michal Hocko
       [not found]             ` <02d1d52c-f534-f899-a18c-a3169123ac7c-5C7GfCeVMHo@public.gmane.org>
  2018-07-02 12:35             ` Michal Hocko
  2 siblings, 0 replies; 125+ messages in thread
From: Michal Hocko @ 2018-07-02 12:35 UTC (permalink / raw)
  To: Christian König
  Cc: kvm, Radim Krčmář,
	David Airlie, Joonas Lahtinen, Sudeep Dutt, dri-devel, linux-mm,
	Andrea Arcangeli, David (ChunMing) Zhou, Dimitri Sivanich,
	linux-rdma, amd-gfx, Jason Gunthorpe, Doug Ledford,
	David Rientjes, xen-devel, intel-gfx, Jani Nikula,
	Jérôme Glisse, Rodrigo Vivi, Boris Ostrovsky,
	Juergen Gross, Mike Marciniszyn

On Mon 02-07-18 14:24:29, Christian König wrote:
> Am 02.07.2018 um 14:20 schrieb Michal Hocko:
> > On Mon 02-07-18 14:13:42, Christian König wrote:
> > > Am 02.07.2018 um 13:54 schrieb Michal Hocko:
> > > > On Mon 02-07-18 11:14:58, Christian König wrote:
> > > > > Am 27.06.2018 um 09:44 schrieb Michal Hocko:
> > > > > > This is the v2 of RFC based on the feedback I've received so far. The
> > > > > > code even compiles as a bonus ;) I haven't runtime tested it yet, mostly
> > > > > > because I have no idea how.
> > > > > > 
> > > > > > Any further feedback is highly appreciated of course.
> > > > > That sounds like it should work and at least the amdgpu changes now look
> > > > > good to me on first glance.
> > > > > 
> > > > > Can you split that up further in the usual way? E.g. adding the blockable
> > > > > flag in one patch and fixing all implementations of the MMU notifier in
> > > > > follow up patches.
> > > > But such a code would be broken, no? Ignoring the blockable state will
> > > > simply lead to lockups until the fixup parts get applied.
> > > Well to still be bisect-able you only need to get the interface change in
> > > first with fixing the function signature of the implementations.
> > That would only work if those functions return -AGAIN unconditionally.
> > Otherwise they would pretend to not block while that would be obviously
> > incorrect. This doesn't sound correct to me.
> > 
> > > Then add all the new code to the implementations and last start to actually
> > > use the new interface.
> > > 
> > > That is a pattern we use regularly and I think it's good practice to do
> > > this.
> > But we do rely on the proper blockable handling.
> 
> Yeah, but you could add the handling only after you have all the
> implementations in place. Don't you?

Yeah, but then I would be adding a code with no user. And I really
prefer to no do so because then the code is harder to argue about.

> > > > Is the split up really worth it? I was thinking about that but had hard
> > > > times to end up with something that would be bisectable. Well, except
> > > > for returning -EBUSY until all notifiers are implemented. Which I found
> > > > confusing.
> > > It at least makes reviewing changes much easier, cause as driver maintainer
> > > I can concentrate on the stuff only related to me.
> > > 
> > > Additional to that when you cause some unrelated side effect in a driver we
> > > can much easier pinpoint the actual change later on when the patch is
> > > smaller.
> > > 
> > > > > This way I'm pretty sure Felix and I can give an rb on the amdgpu/amdkfd
> > > > > changes.
> > > > If you are worried to give r-b only for those then this can be done even
> > > > for larger patches. Just make your Reviewd-by more specific
> > > > R-b: name # For BLA BLA
> > > Yeah, possible alternative but more work for me when I review it :)
> > I definitely do not want to add more work to reviewers and I completely
> > see how massive "flag days" like these are not popular but I really
> > didn't find a reasonable way around that would be both correct and
> > wouldn't add much more churn on the way. So if you really insist then I
> > would really appreciate a hint on the way to achive the same without any
> > above downsides.
> 
> Well, I don't insist on this. It's just from my point of view that this
> patch doesn't needs to be one patch, but could be split up.

Well, if there are more people with the same concern I can try to do
that. But if your only concern is to focus on your particular part then
I guess it would be easier both for you and me to simply apply the patch
and use git show $files_for_your_subystem on your end. I have put the
patch to attempts/oom-vs-mmu-notifiers branch to my tree at
git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git
-- 
Michal Hocko
SUSE Labs

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
       [not found]               ` <20180702123521.GO19043-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
@ 2018-07-02 12:39                 ` Christian König
  0 siblings, 0 replies; 125+ messages in thread
From: Christian König @ 2018-07-02 12:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: kvm-u79uwXL29TY76Z2rM5mHXA, Radim Krčmář,
	David Airlie, Joonas Lahtinen, Sudeep Dutt,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andrea Arcangeli,
	David (ChunMing) Zhou, Dimitri Sivanich,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Jason Gunthorpe,
	Doug Ledford, David Rientjes,
	xen-devel-GuqFBffKawtpuQazS67q72D2FQJk+8+b,
	intel-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Jani Nikula,
	Jérôme Glisse, Rodrigo Vivi, Boris Ostrovsky,
	Juergen Gross, Mike Marciniszyn

Am 02.07.2018 um 14:35 schrieb Michal Hocko:
> On Mon 02-07-18 14:24:29, Christian König wrote:
>> Am 02.07.2018 um 14:20 schrieb Michal Hocko:
>>> On Mon 02-07-18 14:13:42, Christian König wrote:
>>>> Am 02.07.2018 um 13:54 schrieb Michal Hocko:
>>>>> On Mon 02-07-18 11:14:58, Christian König wrote:
>>>>>> Am 27.06.2018 um 09:44 schrieb Michal Hocko:
>>>>>>> This is the v2 of RFC based on the feedback I've received so far. The
>>>>>>> code even compiles as a bonus ;) I haven't runtime tested it yet, mostly
>>>>>>> because I have no idea how.
>>>>>>>
>>>>>>> Any further feedback is highly appreciated of course.
>>>>>> That sounds like it should work and at least the amdgpu changes now look
>>>>>> good to me on first glance.
>>>>>>
>>>>>> Can you split that up further in the usual way? E.g. adding the blockable
>>>>>> flag in one patch and fixing all implementations of the MMU notifier in
>>>>>> follow up patches.
>>>>> But such a code would be broken, no? Ignoring the blockable state will
>>>>> simply lead to lockups until the fixup parts get applied.
>>>> Well to still be bisect-able you only need to get the interface change in
>>>> first with fixing the function signature of the implementations.
>>> That would only work if those functions return -AGAIN unconditionally.
>>> Otherwise they would pretend to not block while that would be obviously
>>> incorrect. This doesn't sound correct to me.
>>>
>>>> Then add all the new code to the implementations and last start to actually
>>>> use the new interface.
>>>>
>>>> That is a pattern we use regularly and I think it's good practice to do
>>>> this.
>>> But we do rely on the proper blockable handling.
>> Yeah, but you could add the handling only after you have all the
>> implementations in place. Don't you?
> Yeah, but then I would be adding a code with no user. And I really
> prefer to no do so because then the code is harder to argue about.
>
>>>>> Is the split up really worth it? I was thinking about that but had hard
>>>>> times to end up with something that would be bisectable. Well, except
>>>>> for returning -EBUSY until all notifiers are implemented. Which I found
>>>>> confusing.
>>>> It at least makes reviewing changes much easier, cause as driver maintainer
>>>> I can concentrate on the stuff only related to me.
>>>>
>>>> Additional to that when you cause some unrelated side effect in a driver we
>>>> can much easier pinpoint the actual change later on when the patch is
>>>> smaller.
>>>>
>>>>>> This way I'm pretty sure Felix and I can give an rb on the amdgpu/amdkfd
>>>>>> changes.
>>>>> If you are worried to give r-b only for those then this can be done even
>>>>> for larger patches. Just make your Reviewd-by more specific
>>>>> R-b: name # For BLA BLA
>>>> Yeah, possible alternative but more work for me when I review it :)
>>> I definitely do not want to add more work to reviewers and I completely
>>> see how massive "flag days" like these are not popular but I really
>>> didn't find a reasonable way around that would be both correct and
>>> wouldn't add much more churn on the way. So if you really insist then I
>>> would really appreciate a hint on the way to achive the same without any
>>> above downsides.
>> Well, I don't insist on this. It's just from my point of view that this
>> patch doesn't needs to be one patch, but could be split up.
> Well, if there are more people with the same concern I can try to do
> that. But if your only concern is to focus on your particular part then
> I guess it would be easier both for you and me to simply apply the patch
> and use git show $files_for_your_subystem on your end. I have put the
> patch to attempts/oom-vs-mmu-notifiers branch to my tree at
> git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git

Not wanting to block something as important as this, so feel free to add 
an Acked-by: Christian König <christian.koenig@amd.com> to the patch.

Let's rather face the next topic: Any idea how to runtime test this?

I mean I can rather easily provide a test which crashes an AMD GPU, 
which in turn then would mean that the MMU notifier would block forever 
without this patch.

But do you know a way to let the OOM killer kill a specific process?

Regards,
Christian.
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
@ 2018-07-02 12:39                 ` Christian König
  0 siblings, 0 replies; 125+ messages in thread
From: Christian König @ 2018-07-02 12:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: LKML, David (ChunMing) Zhou, Paolo Bonzini,
	Radim Krčmář,
	Alex Deucher, David Airlie, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Doug Ledford, Jason Gunthorpe, Mike Marciniszyn,
	Dennis Dalessandro, Sudeep Dutt, Ashutosh Dixit,
	Dimitri Sivanich, Boris Ostrovsky, Juergen Gross,
	Jérôme Glisse, Andrea Arcangeli, kvm, amd-gfx,
	dri-devel, intel-gfx, linux-rdma, xen-devel, linux-mm,
	David Rientjes, Felix Kuehling

Am 02.07.2018 um 14:35 schrieb Michal Hocko:
> On Mon 02-07-18 14:24:29, Christian König wrote:
>> Am 02.07.2018 um 14:20 schrieb Michal Hocko:
>>> On Mon 02-07-18 14:13:42, Christian König wrote:
>>>> Am 02.07.2018 um 13:54 schrieb Michal Hocko:
>>>>> On Mon 02-07-18 11:14:58, Christian König wrote:
>>>>>> Am 27.06.2018 um 09:44 schrieb Michal Hocko:
>>>>>>> This is the v2 of RFC based on the feedback I've received so far. The
>>>>>>> code even compiles as a bonus ;) I haven't runtime tested it yet, mostly
>>>>>>> because I have no idea how.
>>>>>>>
>>>>>>> Any further feedback is highly appreciated of course.
>>>>>> That sounds like it should work and at least the amdgpu changes now look
>>>>>> good to me on first glance.
>>>>>>
>>>>>> Can you split that up further in the usual way? E.g. adding the blockable
>>>>>> flag in one patch and fixing all implementations of the MMU notifier in
>>>>>> follow up patches.
>>>>> But such a code would be broken, no? Ignoring the blockable state will
>>>>> simply lead to lockups until the fixup parts get applied.
>>>> Well to still be bisect-able you only need to get the interface change in
>>>> first with fixing the function signature of the implementations.
>>> That would only work if those functions return -AGAIN unconditionally.
>>> Otherwise they would pretend to not block while that would be obviously
>>> incorrect. This doesn't sound correct to me.
>>>
>>>> Then add all the new code to the implementations and last start to actually
>>>> use the new interface.
>>>>
>>>> That is a pattern we use regularly and I think it's good practice to do
>>>> this.
>>> But we do rely on the proper blockable handling.
>> Yeah, but you could add the handling only after you have all the
>> implementations in place. Don't you?
> Yeah, but then I would be adding a code with no user. And I really
> prefer to no do so because then the code is harder to argue about.
>
>>>>> Is the split up really worth it? I was thinking about that but had hard
>>>>> times to end up with something that would be bisectable. Well, except
>>>>> for returning -EBUSY until all notifiers are implemented. Which I found
>>>>> confusing.
>>>> It at least makes reviewing changes much easier, cause as driver maintainer
>>>> I can concentrate on the stuff only related to me.
>>>>
>>>> Additional to that when you cause some unrelated side effect in a driver we
>>>> can much easier pinpoint the actual change later on when the patch is
>>>> smaller.
>>>>
>>>>>> This way I'm pretty sure Felix and I can give an rb on the amdgpu/amdkfd
>>>>>> changes.
>>>>> If you are worried to give r-b only for those then this can be done even
>>>>> for larger patches. Just make your Reviewd-by more specific
>>>>> R-b: name # For BLA BLA
>>>> Yeah, possible alternative but more work for me when I review it :)
>>> I definitely do not want to add more work to reviewers and I completely
>>> see how massive "flag days" like these are not popular but I really
>>> didn't find a reasonable way around that would be both correct and
>>> wouldn't add much more churn on the way. So if you really insist then I
>>> would really appreciate a hint on the way to achive the same without any
>>> above downsides.
>> Well, I don't insist on this. It's just from my point of view that this
>> patch doesn't needs to be one patch, but could be split up.
> Well, if there are more people with the same concern I can try to do
> that. But if your only concern is to focus on your particular part then
> I guess it would be easier both for you and me to simply apply the patch
> and use git show $files_for_your_subystem on your end. I have put the
> patch to attempts/oom-vs-mmu-notifiers branch to my tree at
> git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git

Not wanting to block something as important as this, so feel free to add 
an Acked-by: Christian König <christian.koenig@amd.com> to the patch.

Let's rather face the next topic: Any idea how to runtime test this?

I mean I can rather easily provide a test which crashes an AMD GPU, 
which in turn then would mean that the MMU notifier would block forever 
without this patch.

But do you know a way to let the OOM killer kill a specific process?

Regards,
Christian.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
@ 2018-07-02 12:39                 ` Christian König
  0 siblings, 0 replies; 125+ messages in thread
From: Christian König @ 2018-07-02 12:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: LKML, David (ChunMing) Zhou, Paolo Bonzini,
	Radim Krčmář,
	Alex Deucher, David Airlie, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Doug Ledford, Jason Gunthorpe, Mike Marciniszyn,
	Dennis Dalessandro, Sudeep Dutt, Ashutosh Dixit,
	Dimitri Sivanich, Boris Ostrovsky, Juergen Gross,
	Jérôme Glisse, Andrea Arcangeli, kvm, amd-gfx,
	dri-devel, intel-gfx, linux-rdma, xen-devel, linux-mm,
	David Rientjes, Felix Kuehling

Am 02.07.2018 um 14:35 schrieb Michal Hocko:
> On Mon 02-07-18 14:24:29, Christian KA?nig wrote:
>> Am 02.07.2018 um 14:20 schrieb Michal Hocko:
>>> On Mon 02-07-18 14:13:42, Christian KA?nig wrote:
>>>> Am 02.07.2018 um 13:54 schrieb Michal Hocko:
>>>>> On Mon 02-07-18 11:14:58, Christian KA?nig wrote:
>>>>>> Am 27.06.2018 um 09:44 schrieb Michal Hocko:
>>>>>>> This is the v2 of RFC based on the feedback I've received so far. The
>>>>>>> code even compiles as a bonus ;) I haven't runtime tested it yet, mostly
>>>>>>> because I have no idea how.
>>>>>>>
>>>>>>> Any further feedback is highly appreciated of course.
>>>>>> That sounds like it should work and at least the amdgpu changes now look
>>>>>> good to me on first glance.
>>>>>>
>>>>>> Can you split that up further in the usual way? E.g. adding the blockable
>>>>>> flag in one patch and fixing all implementations of the MMU notifier in
>>>>>> follow up patches.
>>>>> But such a code would be broken, no? Ignoring the blockable state will
>>>>> simply lead to lockups until the fixup parts get applied.
>>>> Well to still be bisect-able you only need to get the interface change in
>>>> first with fixing the function signature of the implementations.
>>> That would only work if those functions return -AGAIN unconditionally.
>>> Otherwise they would pretend to not block while that would be obviously
>>> incorrect. This doesn't sound correct to me.
>>>
>>>> Then add all the new code to the implementations and last start to actually
>>>> use the new interface.
>>>>
>>>> That is a pattern we use regularly and I think it's good practice to do
>>>> this.
>>> But we do rely on the proper blockable handling.
>> Yeah, but you could add the handling only after you have all the
>> implementations in place. Don't you?
> Yeah, but then I would be adding a code with no user. And I really
> prefer to no do so because then the code is harder to argue about.
>
>>>>> Is the split up really worth it? I was thinking about that but had hard
>>>>> times to end up with something that would be bisectable. Well, except
>>>>> for returning -EBUSY until all notifiers are implemented. Which I found
>>>>> confusing.
>>>> It at least makes reviewing changes much easier, cause as driver maintainer
>>>> I can concentrate on the stuff only related to me.
>>>>
>>>> Additional to that when you cause some unrelated side effect in a driver we
>>>> can much easier pinpoint the actual change later on when the patch is
>>>> smaller.
>>>>
>>>>>> This way I'm pretty sure Felix and I can give an rb on the amdgpu/amdkfd
>>>>>> changes.
>>>>> If you are worried to give r-b only for those then this can be done even
>>>>> for larger patches. Just make your Reviewd-by more specific
>>>>> R-b: name # For BLA BLA
>>>> Yeah, possible alternative but more work for me when I review it :)
>>> I definitely do not want to add more work to reviewers and I completely
>>> see how massive "flag days" like these are not popular but I really
>>> didn't find a reasonable way around that would be both correct and
>>> wouldn't add much more churn on the way. So if you really insist then I
>>> would really appreciate a hint on the way to achive the same without any
>>> above downsides.
>> Well, I don't insist on this. It's just from my point of view that this
>> patch doesn't needs to be one patch, but could be split up.
> Well, if there are more people with the same concern I can try to do
> that. But if your only concern is to focus on your particular part then
> I guess it would be easier both for you and me to simply apply the patch
> and use git show $files_for_your_subystem on your end. I have put the
> patch to attempts/oom-vs-mmu-notifiers branch to my tree at
> git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git

Not wanting to block something as important as this, so feel free to add 
an Acked-by: Christian KA?nig <christian.koenig@amd.com> to the patch.

Let's rather face the next topic: Any idea how to runtime test this?

I mean I can rather easily provide a test which crashes an AMD GPU, 
which in turn then would mean that the MMU notifier would block forever 
without this patch.

But do you know a way to let the OOM killer kill a specific process?

Regards,
Christian.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
  2018-07-02 12:35             ` Michal Hocko
  2018-07-02 12:39                 ` Christian König
@ 2018-07-02 12:39               ` Christian König
       [not found]               ` <20180702123521.GO19043-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
  2 siblings, 0 replies; 125+ messages in thread
From: Christian König @ 2018-07-02 12:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: kvm, Radim Krčmář,
	David Airlie, Joonas Lahtinen, Sudeep Dutt, dri-devel, linux-mm,
	Andrea Arcangeli, David (ChunMing) Zhou, Dimitri Sivanich,
	linux-rdma, amd-gfx, Jason Gunthorpe, Doug Ledford,
	David Rientjes, xen-devel, intel-gfx, Jani Nikula,
	Jérôme Glisse, Rodrigo Vivi, Boris Ostrovsky,
	Juergen Gross, Mike Marciniszyn

Am 02.07.2018 um 14:35 schrieb Michal Hocko:
> On Mon 02-07-18 14:24:29, Christian König wrote:
>> Am 02.07.2018 um 14:20 schrieb Michal Hocko:
>>> On Mon 02-07-18 14:13:42, Christian König wrote:
>>>> Am 02.07.2018 um 13:54 schrieb Michal Hocko:
>>>>> On Mon 02-07-18 11:14:58, Christian König wrote:
>>>>>> Am 27.06.2018 um 09:44 schrieb Michal Hocko:
>>>>>>> This is the v2 of RFC based on the feedback I've received so far. The
>>>>>>> code even compiles as a bonus ;) I haven't runtime tested it yet, mostly
>>>>>>> because I have no idea how.
>>>>>>>
>>>>>>> Any further feedback is highly appreciated of course.
>>>>>> That sounds like it should work and at least the amdgpu changes now look
>>>>>> good to me on first glance.
>>>>>>
>>>>>> Can you split that up further in the usual way? E.g. adding the blockable
>>>>>> flag in one patch and fixing all implementations of the MMU notifier in
>>>>>> follow up patches.
>>>>> But such a code would be broken, no? Ignoring the blockable state will
>>>>> simply lead to lockups until the fixup parts get applied.
>>>> Well to still be bisect-able you only need to get the interface change in
>>>> first with fixing the function signature of the implementations.
>>> That would only work if those functions return -AGAIN unconditionally.
>>> Otherwise they would pretend to not block while that would be obviously
>>> incorrect. This doesn't sound correct to me.
>>>
>>>> Then add all the new code to the implementations and last start to actually
>>>> use the new interface.
>>>>
>>>> That is a pattern we use regularly and I think it's good practice to do
>>>> this.
>>> But we do rely on the proper blockable handling.
>> Yeah, but you could add the handling only after you have all the
>> implementations in place. Don't you?
> Yeah, but then I would be adding a code with no user. And I really
> prefer to no do so because then the code is harder to argue about.
>
>>>>> Is the split up really worth it? I was thinking about that but had hard
>>>>> times to end up with something that would be bisectable. Well, except
>>>>> for returning -EBUSY until all notifiers are implemented. Which I found
>>>>> confusing.
>>>> It at least makes reviewing changes much easier, cause as driver maintainer
>>>> I can concentrate on the stuff only related to me.
>>>>
>>>> Additional to that when you cause some unrelated side effect in a driver we
>>>> can much easier pinpoint the actual change later on when the patch is
>>>> smaller.
>>>>
>>>>>> This way I'm pretty sure Felix and I can give an rb on the amdgpu/amdkfd
>>>>>> changes.
>>>>> If you are worried to give r-b only for those then this can be done even
>>>>> for larger patches. Just make your Reviewd-by more specific
>>>>> R-b: name # For BLA BLA
>>>> Yeah, possible alternative but more work for me when I review it :)
>>> I definitely do not want to add more work to reviewers and I completely
>>> see how massive "flag days" like these are not popular but I really
>>> didn't find a reasonable way around that would be both correct and
>>> wouldn't add much more churn on the way. So if you really insist then I
>>> would really appreciate a hint on the way to achive the same without any
>>> above downsides.
>> Well, I don't insist on this. It's just from my point of view that this
>> patch doesn't needs to be one patch, but could be split up.
> Well, if there are more people with the same concern I can try to do
> that. But if your only concern is to focus on your particular part then
> I guess it would be easier both for you and me to simply apply the patch
> and use git show $files_for_your_subystem on your end. I have put the
> patch to attempts/oom-vs-mmu-notifiers branch to my tree at
> git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git

Not wanting to block something as important as this, so feel free to add 
an Acked-by: Christian König <christian.koenig@amd.com> to the patch.

Let's rather face the next topic: Any idea how to runtime test this?

I mean I can rather easily provide a test which crashes an AMD GPU, 
which in turn then would mean that the MMU notifier would block forever 
without this patch.

But do you know a way to let the OOM killer kill a specific process?

Regards,
Christian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
       [not found]                 ` <91ad1106-6bd4-7d2c-4d40-7c5be945ba36-5C7GfCeVMHo@public.gmane.org>
@ 2018-07-02 12:56                   ` Michal Hocko
  0 siblings, 0 replies; 125+ messages in thread
From: Michal Hocko @ 2018-07-02 12:56 UTC (permalink / raw)
  To: Christian König
  Cc: kvm-u79uwXL29TY76Z2rM5mHXA, Radim Krčmář,
	David Airlie, Joonas Lahtinen, Sudeep Dutt,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Andrea Arcangeli,
	David (ChunMing) Zhou, Dimitri Sivanich,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Jason Gunthorpe,
	Doug Ledford, David Rientjes,
	xen-devel-GuqFBffKawtpuQazS67q72D2FQJk+8+b,
	intel-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Jani Nikula,
	Jérôme Glisse, Rodrigo Vivi, Boris Ostrovsky,
	Juergen Gross, Mike Marciniszyn

On Mon 02-07-18 14:39:50, Christian König wrote:
[...]
> Not wanting to block something as important as this, so feel free to add an
> Acked-by: Christian König <christian.koenig@amd.com> to the patch.

Thanks a lot!

> Let's rather face the next topic: Any idea how to runtime test this?

This is a good question indeed. One way to do that would be triggering
the OOM killer from the context which uses each of these mmu notifiers
(one at the time) and see how that works. You would see the note in the
log whenever the notifier would block. The primary thing to test is how
often the oom reaper really had to back off completely.

> I mean I can rather easily provide a test which crashes an AMD GPU, which in
> turn then would mean that the MMU notifier would block forever without this
> patch.

Well, you do not really have to go that far. It should be sufficient to
do the above. The current code would simply back of without releasing
any memory. The patch should help to reclaim some memory.
 
> But do you know a way to let the OOM killer kill a specific process?

Yes, you can set its oom_score_adj to 1000 which means always select
that task.
-- 
Michal Hocko
SUSE Labs
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
@ 2018-07-02 12:56                   ` Michal Hocko
  0 siblings, 0 replies; 125+ messages in thread
From: Michal Hocko @ 2018-07-02 12:56 UTC (permalink / raw)
  To: Christian König
  Cc: LKML, David (ChunMing) Zhou, Paolo Bonzini,
	Radim Krčmář,
	Alex Deucher, David Airlie, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Doug Ledford, Jason Gunthorpe, Mike Marciniszyn,
	Dennis Dalessandro, Sudeep Dutt, Ashutosh Dixit,
	Dimitri Sivanich, Boris Ostrovsky, Juergen Gross,
	Jérôme Glisse, Andrea Arcangeli, kvm, amd-gfx,
	dri-devel, intel-gfx, linux-rdma, xen-devel, linux-mm,
	David Rientjes, Felix Kuehling

On Mon 02-07-18 14:39:50, Christian König wrote:
[...]
> Not wanting to block something as important as this, so feel free to add an
> Acked-by: Christian König <christian.koenig@amd.com> to the patch.

Thanks a lot!

> Let's rather face the next topic: Any idea how to runtime test this?

This is a good question indeed. One way to do that would be triggering
the OOM killer from the context which uses each of these mmu notifiers
(one at the time) and see how that works. You would see the note in the
log whenever the notifier would block. The primary thing to test is how
often the oom reaper really had to back off completely.

> I mean I can rather easily provide a test which crashes an AMD GPU, which in
> turn then would mean that the MMU notifier would block forever without this
> patch.

Well, you do not really have to go that far. It should be sufficient to
do the above. The current code would simply back of without releasing
any memory. The patch should help to reclaim some memory.
 
> But do you know a way to let the OOM killer kill a specific process?

Yes, you can set its oom_score_adj to 1000 which means always select
that task.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
@ 2018-07-02 12:56                   ` Michal Hocko
  0 siblings, 0 replies; 125+ messages in thread
From: Michal Hocko @ 2018-07-02 12:56 UTC (permalink / raw)
  To: Christian König
  Cc: LKML, David (ChunMing) Zhou, Paolo Bonzini,
	Radim Krčmář,
	Alex Deucher, David Airlie, Jani Nikula, Joonas Lahtinen,
	Rodrigo Vivi, Doug Ledford, Jason Gunthorpe, Mike Marciniszyn,
	Dennis Dalessandro, Sudeep Dutt, Ashutosh Dixit,
	Dimitri Sivanich, Boris Ostrovsky, Juergen Gross,
	Jérôme Glisse, Andrea Arcangeli, kvm, amd-gfx,
	dri-devel, intel-gfx, linux-rdma, xen-devel, linux-mm,
	David Rientjes, Felix Kuehling

On Mon 02-07-18 14:39:50, Christian Konig wrote:
[...]
> Not wanting to block something as important as this, so feel free to add an
> Acked-by: Christian Konig <christian.koenig@amd.com> to the patch.

Thanks a lot!

> Let's rather face the next topic: Any idea how to runtime test this?

This is a good question indeed. One way to do that would be triggering
the OOM killer from the context which uses each of these mmu notifiers
(one at the time) and see how that works. You would see the note in the
log whenever the notifier would block. The primary thing to test is how
often the oom reaper really had to back off completely.

> I mean I can rather easily provide a test which crashes an AMD GPU, which in
> turn then would mean that the MMU notifier would block forever without this
> patch.

Well, you do not really have to go that far. It should be sufficient to
do the above. The current code would simply back of without releasing
any memory. The patch should help to reclaim some memory.
 
> But do you know a way to let the OOM killer kill a specific process?

Yes, you can set its oom_score_adj to 1000 which means always select
that task.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
  2018-07-02 12:39                 ` Christian König
  (?)
  (?)
@ 2018-07-02 12:56                 ` Michal Hocko
  -1 siblings, 0 replies; 125+ messages in thread
From: Michal Hocko @ 2018-07-02 12:56 UTC (permalink / raw)
  To: Christian König
  Cc: kvm, Radim Krčmář,
	David Airlie, Joonas Lahtinen, Sudeep Dutt, dri-devel, linux-mm,
	Andrea Arcangeli, David (ChunMing) Zhou, Dimitri Sivanich,
	linux-rdma, amd-gfx, Jason Gunthorpe, Doug Ledford,
	David Rientjes, xen-devel, intel-gfx, Jani Nikula,
	Jérôme Glisse, Rodrigo Vivi, Boris Ostrovsky,
	Juergen Gross, Mike Marciniszyn

On Mon 02-07-18 14:39:50, Christian König wrote:
[...]
> Not wanting to block something as important as this, so feel free to add an
> Acked-by: Christian König <christian.koenig@amd.com> to the patch.

Thanks a lot!

> Let's rather face the next topic: Any idea how to runtime test this?

This is a good question indeed. One way to do that would be triggering
the OOM killer from the context which uses each of these mmu notifiers
(one at the time) and see how that works. You would see the note in the
log whenever the notifier would block. The primary thing to test is how
often the oom reaper really had to back off completely.

> I mean I can rather easily provide a test which crashes an AMD GPU, which in
> turn then would mean that the MMU notifier would block forever without this
> patch.

Well, you do not really have to go that far. It should be sufficient to
do the above. The current code would simply back of without releasing
any memory. The patch should help to reclaim some memory.
 
> But do you know a way to let the OOM killer kill a specific process?

Yes, you can set its oom_score_adj to 1000 which means always select
that task.
-- 
Michal Hocko
SUSE Labs

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
  2018-06-27  7:44 ` Michal Hocko
                     ` (3 preceding siblings ...)
  2018-07-09 12:29   ` Michal Hocko
@ 2018-07-09 12:29   ` Michal Hocko
       [not found]     ` <20180709122908.GJ22049-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
                       ` (2 more replies)
  4 siblings, 3 replies; 125+ messages in thread
From: Michal Hocko @ 2018-07-09 12:29 UTC (permalink / raw)
  To: LKML
  Cc: kvm-u79uwXL29TY76Z2rM5mHXA, Radim Krčmář,
	David Airlie, Joonas Lahtinen, Sudeep Dutt,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Felix Kuehling,
	Andrea Arcangeli, David (ChunMing) Zhou, Dimitri Sivanich,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Jason Gunthorpe,
	Doug Ledford, David Rientjes,
	xen-devel-GuqFBffKawtpuQazS67q72D2FQJk+8+b,
	intel-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Jani Nikula,
	Jérôme Glisse, Rodrigo Vivi, Boris Ostrovsky,
	Juergen Gross

On Wed 27-06-18 09:44:21, Michal Hocko wrote:
> This is the v2 of RFC based on the feedback I've received so far. The
> code even compiles as a bonus ;) I haven't runtime tested it yet, mostly
> because I have no idea how.
> 
> Any further feedback is highly appreciated of course.

Any other feedback before I post this as non-RFC?

> ---
> From ec9a7241bf422b908532c4c33953b0da2655ad05 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Wed, 20 Jun 2018 15:03:20 +0200
> Subject: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
> 
> There are several blockable mmu notifiers which might sleep in
> mmu_notifier_invalidate_range_start and that is a problem for the
> oom_reaper because it needs to guarantee a forward progress so it cannot
> depend on any sleepable locks.
> 
> Currently we simply back off and mark an oom victim with blockable mmu
> notifiers as done after a short sleep. That can result in selecting a
> new oom victim prematurely because the previous one still hasn't torn
> its memory down yet.
> 
> We can do much better though. Even if mmu notifiers use sleepable locks
> there is no reason to automatically assume those locks are held.
> Moreover majority of notifiers only care about a portion of the address
> space and there is absolutely zero reason to fail when we are unmapping an
> unrelated range. Many notifiers do really block and wait for HW which is
> harder to handle and we have to bail out though.
> 
> This patch handles the low hanging fruid. __mmu_notifier_invalidate_range_start
> gets a blockable flag and callbacks are not allowed to sleep if the
> flag is set to false. This is achieved by using trylock instead of the
> sleepable lock for most callbacks and continue as long as we do not
> block down the call chain.
> 
> I think we can improve that even further because there is a common
> pattern to do a range lookup first and then do something about that.
> The first part can be done without a sleeping lock in most cases AFAICS.
> 
> The oom_reaper end then simply retries if there is at least one notifier
> which couldn't make any progress in !blockable mode. A retry loop is
> already implemented to wait for the mmap_sem and this is basically the
> same thing.
> 
> Changes since rfc v1
> - gpu notifiers can sleep while waiting for HW (evict_process_queues_cpsch
>   on a lock and amdgpu_mn_invalidate_node on unbound timeout) make sure
>   we bail out when we have an intersecting range for starter
> - note that a notifier failed to the log for easier debugging
> - back off early in ib_umem_notifier_invalidate_range_start if the
>   callback is called
> - mn_invl_range_start waits for completion down the unmap_grant_pages
>   path so we have to back off early on overlapping ranges
> 
> Cc: "David (ChunMing) Zhou" <David1.Zhou@amd.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: "Radim Krčmář" <rkrcmar@redhat.com>
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: "Christian König" <christian.koenig@amd.com>
> Cc: David Airlie <airlied@linux.ie>
> Cc: Jani Nikula <jani.nikula@linux.intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> Cc: Doug Ledford <dledford@redhat.com>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
> Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
> Cc: Sudeep Dutt <sudeep.dutt@intel.com>
> Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
> Cc: Dimitri Sivanich <sivanich@sgi.com>
> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
> Cc: Juergen Gross <jgross@suse.com>
> Cc: "Jérôme Glisse" <jglisse@redhat.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Felix Kuehling <felix.kuehling@amd.com>
> Cc: kvm@vger.kernel.org (open list:KERNEL VIRTUAL MACHINE FOR X86 (KVM/x86))
> Cc: linux-kernel@vger.kernel.org (open list:X86 ARCHITECTURE (32-BIT AND 64-BIT))
> Cc: amd-gfx@lists.freedesktop.org (open list:RADEON and AMDGPU DRM DRIVERS)
> Cc: dri-devel@lists.freedesktop.org (open list:DRM DRIVERS)
> Cc: intel-gfx@lists.freedesktop.org (open list:INTEL DRM DRIVERS (excluding Poulsbo, Moorestow...)
> Cc: linux-rdma@vger.kernel.org (open list:INFINIBAND SUBSYSTEM)
> Cc: xen-devel@lists.xenproject.org (moderated list:XEN HYPERVISOR INTERFACE)
> Cc: linux-mm@kvack.org (open list:HMM - Heterogeneous Memory Management)
> Reported-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  arch/x86/kvm/x86.c                      |  7 ++--
>  drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c  | 43 +++++++++++++++++++-----
>  drivers/gpu/drm/i915/i915_gem_userptr.c | 13 ++++++--
>  drivers/gpu/drm/radeon/radeon_mn.c      | 22 +++++++++++--
>  drivers/infiniband/core/umem_odp.c      | 33 +++++++++++++++----
>  drivers/infiniband/hw/hfi1/mmu_rb.c     | 11 ++++---
>  drivers/infiniband/hw/mlx5/odp.c        |  2 +-
>  drivers/misc/mic/scif/scif_dma.c        |  7 ++--
>  drivers/misc/sgi-gru/grutlbpurge.c      |  7 ++--
>  drivers/xen/gntdev.c                    | 44 ++++++++++++++++++++-----
>  include/linux/kvm_host.h                |  4 +--
>  include/linux/mmu_notifier.h            | 35 +++++++++++++++-----
>  include/linux/oom.h                     |  2 +-
>  include/rdma/ib_umem_odp.h              |  3 +-
>  mm/hmm.c                                |  7 ++--
>  mm/mmap.c                               |  2 +-
>  mm/mmu_notifier.c                       | 19 ++++++++---
>  mm/oom_kill.c                           | 29 ++++++++--------
>  virt/kvm/kvm_main.c                     | 15 ++++++---
>  19 files changed, 225 insertions(+), 80 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 6bcecc325e7e..ac08f5d711be 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -7203,8 +7203,9 @@ static void vcpu_load_eoi_exitmap(struct kvm_vcpu *vcpu)
>  	kvm_x86_ops->load_eoi_exitmap(vcpu, eoi_exit_bitmap);
>  }
>  
> -void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> -		unsigned long start, unsigned long end)
> +int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> +		unsigned long start, unsigned long end,
> +		bool blockable)
>  {
>  	unsigned long apic_address;
>  
> @@ -7215,6 +7216,8 @@ void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
>  	apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
>  	if (start <= apic_address && apic_address < end)
>  		kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
> +
> +	return 0;
>  }
>  
>  void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> index 83e344fbb50a..3399a4a927fb 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> @@ -136,12 +136,18 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
>   *
>   * Take the rmn read side lock.
>   */
> -static void amdgpu_mn_read_lock(struct amdgpu_mn *rmn)
> +static int amdgpu_mn_read_lock(struct amdgpu_mn *rmn, bool blockable)
>  {
> -	mutex_lock(&rmn->read_lock);
> +	if (blockable)
> +		mutex_lock(&rmn->read_lock);
> +	else if (!mutex_trylock(&rmn->read_lock))
> +		return -EAGAIN;
> +
>  	if (atomic_inc_return(&rmn->recursion) == 1)
>  		down_read_non_owner(&rmn->lock);
>  	mutex_unlock(&rmn->read_lock);
> +
> +	return 0;
>  }
>  
>  /**
> @@ -197,10 +203,11 @@ static void amdgpu_mn_invalidate_node(struct amdgpu_mn_node *node,
>   * We block for all BOs between start and end to be idle and
>   * unmap them by move them into system domain again.
>   */
> -static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
> +static int amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>  						 struct mm_struct *mm,
>  						 unsigned long start,
> -						 unsigned long end)
> +						 unsigned long end,
> +						 bool blockable)
>  {
>  	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
>  	struct interval_tree_node *it;
> @@ -208,17 +215,28 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>  	/* notification is exclusive, but interval is inclusive */
>  	end -= 1;
>  
> -	amdgpu_mn_read_lock(rmn);
> +	/* TODO we should be able to split locking for interval tree and
> +	 * amdgpu_mn_invalidate_node
> +	 */
> +	if (amdgpu_mn_read_lock(rmn, blockable))
> +		return -EAGAIN;
>  
>  	it = interval_tree_iter_first(&rmn->objects, start, end);
>  	while (it) {
>  		struct amdgpu_mn_node *node;
>  
> +		if (!blockable) {
> +			amdgpu_mn_read_unlock(rmn);
> +			return -EAGAIN;
> +		}
> +
>  		node = container_of(it, struct amdgpu_mn_node, it);
>  		it = interval_tree_iter_next(it, start, end);
>  
>  		amdgpu_mn_invalidate_node(node, start, end);
>  	}
> +
> +	return 0;
>  }
>  
>  /**
> @@ -233,10 +251,11 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>   * necessitates evicting all user-mode queues of the process. The BOs
>   * are restorted in amdgpu_mn_invalidate_range_end_hsa.
>   */
> -static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
> +static int amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
>  						 struct mm_struct *mm,
>  						 unsigned long start,
> -						 unsigned long end)
> +						 unsigned long end,
> +						 bool blockable)
>  {
>  	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
>  	struct interval_tree_node *it;
> @@ -244,13 +263,19 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
>  	/* notification is exclusive, but interval is inclusive */
>  	end -= 1;
>  
> -	amdgpu_mn_read_lock(rmn);
> +	if (amdgpu_mn_read_lock(rmn, blockable))
> +		return -EAGAIN;
>  
>  	it = interval_tree_iter_first(&rmn->objects, start, end);
>  	while (it) {
>  		struct amdgpu_mn_node *node;
>  		struct amdgpu_bo *bo;
>  
> +		if (!blockable) {
> +			amdgpu_mn_read_unlock(rmn);
> +			return -EAGAIN;
> +		}
> +
>  		node = container_of(it, struct amdgpu_mn_node, it);
>  		it = interval_tree_iter_next(it, start, end);
>  
> @@ -262,6 +287,8 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
>  				amdgpu_amdkfd_evict_userptr(mem, mm);
>  		}
>  	}
> +
> +	return 0;
>  }
>  
>  /**
> diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
> index 854bd51b9478..9cbff68f6b41 100644
> --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> @@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object *mo)
>  	mo->attached = false;
>  }
>  
> -static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> +static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
>  						       struct mm_struct *mm,
>  						       unsigned long start,
> -						       unsigned long end)
> +						       unsigned long end,
> +						       bool blockable)
>  {
>  	struct i915_mmu_notifier *mn =
>  		container_of(_mn, struct i915_mmu_notifier, mn);
> @@ -124,7 +125,7 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
>  	LIST_HEAD(cancelled);
>  
>  	if (RB_EMPTY_ROOT(&mn->objects.rb_root))
> -		return;
> +		return 0;
>  
>  	/* interval ranges are inclusive, but invalidate range is exclusive */
>  	end--;
> @@ -132,6 +133,10 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
>  	spin_lock(&mn->lock);
>  	it = interval_tree_iter_first(&mn->objects, start, end);
>  	while (it) {
> +		if (!blockable) {
> +			spin_unlock(&mn->lock);
> +			return -EAGAIN;
> +		}
>  		/* The mmu_object is released late when destroying the
>  		 * GEM object so it is entirely possible to gain a
>  		 * reference on an object in the process of being freed
> @@ -154,6 +159,8 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
>  
>  	if (!list_empty(&cancelled))
>  		flush_workqueue(mn->wq);
> +
> +	return 0;
>  }
>  
>  static const struct mmu_notifier_ops i915_gem_userptr_notifier = {
> diff --git a/drivers/gpu/drm/radeon/radeon_mn.c b/drivers/gpu/drm/radeon/radeon_mn.c
> index abd24975c9b1..f8b35df44c60 100644
> --- a/drivers/gpu/drm/radeon/radeon_mn.c
> +++ b/drivers/gpu/drm/radeon/radeon_mn.c
> @@ -118,19 +118,27 @@ static void radeon_mn_release(struct mmu_notifier *mn,
>   * We block for all BOs between start and end to be idle and
>   * unmap them by move them into system domain again.
>   */
> -static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
> +static int radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
>  					     struct mm_struct *mm,
>  					     unsigned long start,
> -					     unsigned long end)
> +					     unsigned long end,
> +					     bool blockable)
>  {
>  	struct radeon_mn *rmn = container_of(mn, struct radeon_mn, mn);
>  	struct ttm_operation_ctx ctx = { false, false };
>  	struct interval_tree_node *it;
> +	int ret = 0;
>  
>  	/* notification is exclusive, but interval is inclusive */
>  	end -= 1;
>  
> -	mutex_lock(&rmn->lock);
> +	/* TODO we should be able to split locking for interval tree and
> +	 * the tear down.
> +	 */
> +	if (blockable)
> +		mutex_lock(&rmn->lock);
> +	else if (!mutex_trylock(&rmn->lock))
> +		return -EAGAIN;
>  
>  	it = interval_tree_iter_first(&rmn->objects, start, end);
>  	while (it) {
> @@ -138,6 +146,11 @@ static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
>  		struct radeon_bo *bo;
>  		long r;
>  
> +		if (!blockable) {
> +			ret = -EAGAIN;
> +			goto out_unlock;
> +		}
> +
>  		node = container_of(it, struct radeon_mn_node, it);
>  		it = interval_tree_iter_next(it, start, end);
>  
> @@ -166,7 +179,10 @@ static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
>  		}
>  	}
>  	
> +out_unlock:
>  	mutex_unlock(&rmn->lock);
> +
> +	return ret;
>  }
>  
>  static const struct mmu_notifier_ops radeon_mn_ops = {
> diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
> index 182436b92ba9..6ec748eccff7 100644
> --- a/drivers/infiniband/core/umem_odp.c
> +++ b/drivers/infiniband/core/umem_odp.c
> @@ -186,6 +186,7 @@ static void ib_umem_notifier_release(struct mmu_notifier *mn,
>  	rbt_ib_umem_for_each_in_range(&context->umem_tree, 0,
>  				      ULLONG_MAX,
>  				      ib_umem_notifier_release_trampoline,
> +				      true,
>  				      NULL);
>  	up_read(&context->umem_rwsem);
>  }
> @@ -207,22 +208,31 @@ static int invalidate_range_start_trampoline(struct ib_umem *item, u64 start,
>  	return 0;
>  }
>  
> -static void ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
> +static int ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
>  						    struct mm_struct *mm,
>  						    unsigned long start,
> -						    unsigned long end)
> +						    unsigned long end,
> +						    bool blockable)
>  {
>  	struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
> +	int ret;
>  
>  	if (!context->invalidate_range)
> -		return;
> +		return 0;
> +
> +	if (blockable)
> +		down_read(&context->umem_rwsem);
> +	else if (!down_read_trylock(&context->umem_rwsem))
> +		return -EAGAIN;
>  
>  	ib_ucontext_notifier_start_account(context);
> -	down_read(&context->umem_rwsem);
> -	rbt_ib_umem_for_each_in_range(&context->umem_tree, start,
> +	ret = rbt_ib_umem_for_each_in_range(&context->umem_tree, start,
>  				      end,
> -				      invalidate_range_start_trampoline, NULL);
> +				      invalidate_range_start_trampoline,
> +				      blockable, NULL);
>  	up_read(&context->umem_rwsem);
> +
> +	return ret;
>  }
>  
>  static int invalidate_range_end_trampoline(struct ib_umem *item, u64 start,
> @@ -242,10 +252,15 @@ static void ib_umem_notifier_invalidate_range_end(struct mmu_notifier *mn,
>  	if (!context->invalidate_range)
>  		return;
>  
> +	/*
> +	 * TODO: we currently bail out if there is any sleepable work to be done
> +	 * in ib_umem_notifier_invalidate_range_start so we shouldn't really block
> +	 * here. But this is ugly and fragile.
> +	 */
>  	down_read(&context->umem_rwsem);
>  	rbt_ib_umem_for_each_in_range(&context->umem_tree, start,
>  				      end,
> -				      invalidate_range_end_trampoline, NULL);
> +				      invalidate_range_end_trampoline, true, NULL);
>  	up_read(&context->umem_rwsem);
>  	ib_ucontext_notifier_end_account(context);
>  }
> @@ -798,6 +813,7 @@ EXPORT_SYMBOL(ib_umem_odp_unmap_dma_pages);
>  int rbt_ib_umem_for_each_in_range(struct rb_root_cached *root,
>  				  u64 start, u64 last,
>  				  umem_call_back cb,
> +				  bool blockable,
>  				  void *cookie)
>  {
>  	int ret_val = 0;
> @@ -809,6 +825,9 @@ int rbt_ib_umem_for_each_in_range(struct rb_root_cached *root,
>  
>  	for (node = rbt_ib_umem_iter_first(root, start, last - 1);
>  			node; node = next) {
> +		/* TODO move the blockable decision up to the callback */
> +		if (!blockable)
> +			return -EAGAIN;
>  		next = rbt_ib_umem_iter_next(node, start, last - 1);
>  		umem = container_of(node, struct ib_umem_odp, interval_tree);
>  		ret_val = cb(umem->umem, start, last, cookie) || ret_val;
> diff --git a/drivers/infiniband/hw/hfi1/mmu_rb.c b/drivers/infiniband/hw/hfi1/mmu_rb.c
> index 70aceefe14d5..e1c7996c018e 100644
> --- a/drivers/infiniband/hw/hfi1/mmu_rb.c
> +++ b/drivers/infiniband/hw/hfi1/mmu_rb.c
> @@ -67,9 +67,9 @@ struct mmu_rb_handler {
>  
>  static unsigned long mmu_node_start(struct mmu_rb_node *);
>  static unsigned long mmu_node_last(struct mmu_rb_node *);
> -static void mmu_notifier_range_start(struct mmu_notifier *,
> +static int mmu_notifier_range_start(struct mmu_notifier *,
>  				     struct mm_struct *,
> -				     unsigned long, unsigned long);
> +				     unsigned long, unsigned long, bool);
>  static struct mmu_rb_node *__mmu_rb_search(struct mmu_rb_handler *,
>  					   unsigned long, unsigned long);
>  static void do_remove(struct mmu_rb_handler *handler,
> @@ -284,10 +284,11 @@ void hfi1_mmu_rb_remove(struct mmu_rb_handler *handler,
>  	handler->ops->remove(handler->ops_arg, node);
>  }
>  
> -static void mmu_notifier_range_start(struct mmu_notifier *mn,
> +static int mmu_notifier_range_start(struct mmu_notifier *mn,
>  				     struct mm_struct *mm,
>  				     unsigned long start,
> -				     unsigned long end)
> +				     unsigned long end,
> +				     bool blockable)
>  {
>  	struct mmu_rb_handler *handler =
>  		container_of(mn, struct mmu_rb_handler, mn);
> @@ -313,6 +314,8 @@ static void mmu_notifier_range_start(struct mmu_notifier *mn,
>  
>  	if (added)
>  		queue_work(handler->wq, &handler->del_work);
> +
> +	return 0;
>  }
>  
>  /*
> diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
> index f1a87a690a4c..d216e0d2921d 100644
> --- a/drivers/infiniband/hw/mlx5/odp.c
> +++ b/drivers/infiniband/hw/mlx5/odp.c
> @@ -488,7 +488,7 @@ void mlx5_ib_free_implicit_mr(struct mlx5_ib_mr *imr)
>  
>  	down_read(&ctx->umem_rwsem);
>  	rbt_ib_umem_for_each_in_range(&ctx->umem_tree, 0, ULLONG_MAX,
> -				      mr_leaf_free, imr);
> +				      mr_leaf_free, true, imr);
>  	up_read(&ctx->umem_rwsem);
>  
>  	wait_event(imr->q_leaf_free, !atomic_read(&imr->num_leaf_free));
> diff --git a/drivers/misc/mic/scif/scif_dma.c b/drivers/misc/mic/scif/scif_dma.c
> index 63d6246d6dff..6369aeaa7056 100644
> --- a/drivers/misc/mic/scif/scif_dma.c
> +++ b/drivers/misc/mic/scif/scif_dma.c
> @@ -200,15 +200,18 @@ static void scif_mmu_notifier_release(struct mmu_notifier *mn,
>  	schedule_work(&scif_info.misc_work);
>  }
>  
> -static void scif_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> +static int scif_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>  						     struct mm_struct *mm,
>  						     unsigned long start,
> -						     unsigned long end)
> +						     unsigned long end,
> +						     bool blockable)
>  {
>  	struct scif_mmu_notif	*mmn;
>  
>  	mmn = container_of(mn, struct scif_mmu_notif, ep_mmu_notifier);
>  	scif_rma_destroy_tcw(mmn, start, end - start);
> +
> +	return 0;
>  }
>  
>  static void scif_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
> index a3454eb56fbf..be28f05bfafa 100644
> --- a/drivers/misc/sgi-gru/grutlbpurge.c
> +++ b/drivers/misc/sgi-gru/grutlbpurge.c
> @@ -219,9 +219,10 @@ void gru_flush_all_tlb(struct gru_state *gru)
>  /*
>   * MMUOPS notifier callout functions
>   */
> -static void gru_invalidate_range_start(struct mmu_notifier *mn,
> +static int gru_invalidate_range_start(struct mmu_notifier *mn,
>  				       struct mm_struct *mm,
> -				       unsigned long start, unsigned long end)
> +				       unsigned long start, unsigned long end,
> +				       bool blockable)
>  {
>  	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
>  						 ms_notifier);
> @@ -231,6 +232,8 @@ static void gru_invalidate_range_start(struct mmu_notifier *mn,
>  	gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx, act %d\n", gms,
>  		start, end, atomic_read(&gms->ms_range_active));
>  	gru_flush_tlb_range(gms, start, end - start);
> +
> +	return 0;
>  }
>  
>  static void gru_invalidate_range_end(struct mmu_notifier *mn,
> diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
> index bd56653b9bbc..55b4f0e3f4d6 100644
> --- a/drivers/xen/gntdev.c
> +++ b/drivers/xen/gntdev.c
> @@ -441,18 +441,25 @@ static const struct vm_operations_struct gntdev_vmops = {
>  
>  /* ------------------------------------------------------------------ */
>  
> +static bool in_range(struct grant_map *map,
> +			      unsigned long start, unsigned long end)
> +{
> +	if (!map->vma)
> +		return false;
> +	if (map->vma->vm_start >= end)
> +		return false;
> +	if (map->vma->vm_end <= start)
> +		return false;
> +
> +	return true;
> +}
> +
>  static void unmap_if_in_range(struct grant_map *map,
>  			      unsigned long start, unsigned long end)
>  {
>  	unsigned long mstart, mend;
>  	int err;
>  
> -	if (!map->vma)
> -		return;
> -	if (map->vma->vm_start >= end)
> -		return;
> -	if (map->vma->vm_end <= start)
> -		return;
>  	mstart = max(start, map->vma->vm_start);
>  	mend   = min(end,   map->vma->vm_end);
>  	pr_debug("map %d+%d (%lx %lx), range %lx %lx, mrange %lx %lx\n",
> @@ -465,21 +472,40 @@ static void unmap_if_in_range(struct grant_map *map,
>  	WARN_ON(err);
>  }
>  
> -static void mn_invl_range_start(struct mmu_notifier *mn,
> +static int mn_invl_range_start(struct mmu_notifier *mn,
>  				struct mm_struct *mm,
> -				unsigned long start, unsigned long end)
> +				unsigned long start, unsigned long end,
> +				bool blockable)
>  {
>  	struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
>  	struct grant_map *map;
> +	int ret = 0;
> +
> +	/* TODO do we really need a mutex here? */
> +	if (blockable)
> +		mutex_lock(&priv->lock);
> +	else if (!mutex_trylock(&priv->lock))
> +		return -EAGAIN;
>  
> -	mutex_lock(&priv->lock);
>  	list_for_each_entry(map, &priv->maps, next) {
> +		if (in_range(map, start, end)) {
> +			ret = -EAGAIN;
> +			goto out_unlock;
> +		}
>  		unmap_if_in_range(map, start, end);
>  	}
>  	list_for_each_entry(map, &priv->freeable_maps, next) {
> +		if (in_range(map, start, end)) {
> +			ret = -EAGAIN;
> +			goto out_unlock;
> +		}
>  		unmap_if_in_range(map, start, end);
>  	}
> +
> +out_unlock:
>  	mutex_unlock(&priv->lock);
> +
> +	return ret;
>  }
>  
>  static void mn_release(struct mmu_notifier *mn,
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 4ee7bc548a83..148935085194 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1275,8 +1275,8 @@ static inline long kvm_arch_vcpu_async_ioctl(struct file *filp,
>  }
>  #endif /* CONFIG_HAVE_KVM_VCPU_ASYNC_IOCTL */
>  
> -void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> -		unsigned long start, unsigned long end);
> +int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> +		unsigned long start, unsigned long end, bool blockable);
>  
>  #ifdef CONFIG_HAVE_KVM_VCPU_RUN_PID_CHANGE
>  int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu);
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index 392e6af82701..2eb1a2d01759 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -151,13 +151,15 @@ struct mmu_notifier_ops {
>  	 * address space but may still be referenced by sptes until
>  	 * the last refcount is dropped.
>  	 *
> -	 * If both of these callbacks cannot block, and invalidate_range
> -	 * cannot block, mmu_notifier_ops.flags should have
> -	 * MMU_INVALIDATE_DOES_NOT_BLOCK set.
> +	 * If blockable argument is set to false then the callback cannot
> +	 * sleep and has to return with -EAGAIN. 0 should be returned
> +	 * otherwise.
> +	 *
>  	 */
> -	void (*invalidate_range_start)(struct mmu_notifier *mn,
> +	int (*invalidate_range_start)(struct mmu_notifier *mn,
>  				       struct mm_struct *mm,
> -				       unsigned long start, unsigned long end);
> +				       unsigned long start, unsigned long end,
> +				       bool blockable);
>  	void (*invalidate_range_end)(struct mmu_notifier *mn,
>  				     struct mm_struct *mm,
>  				     unsigned long start, unsigned long end);
> @@ -229,8 +231,9 @@ extern int __mmu_notifier_test_young(struct mm_struct *mm,
>  				     unsigned long address);
>  extern void __mmu_notifier_change_pte(struct mm_struct *mm,
>  				      unsigned long address, pte_t pte);
> -extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> -				  unsigned long start, unsigned long end);
> +extern int __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> +				  unsigned long start, unsigned long end,
> +				  bool blockable);
>  extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
>  				  unsigned long start, unsigned long end,
>  				  bool only_end);
> @@ -281,7 +284,17 @@ static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
>  				  unsigned long start, unsigned long end)
>  {
>  	if (mm_has_notifiers(mm))
> -		__mmu_notifier_invalidate_range_start(mm, start, end);
> +		__mmu_notifier_invalidate_range_start(mm, start, end, true);
> +}
> +
> +static inline int mmu_notifier_invalidate_range_start_nonblock(struct mm_struct *mm,
> +				  unsigned long start, unsigned long end)
> +{
> +	int ret = 0;
> +	if (mm_has_notifiers(mm))
> +		ret = __mmu_notifier_invalidate_range_start(mm, start, end, false);
> +
> +	return ret;
>  }
>  
>  static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> @@ -461,6 +474,12 @@ static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
>  {
>  }
>  
> +static inline int mmu_notifier_invalidate_range_start_nonblock(struct mm_struct *mm,
> +				  unsigned long start, unsigned long end)
> +{
> +	return 0;
> +}
> +
>  static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
>  				  unsigned long start, unsigned long end)
>  {
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index 6adac113e96d..92f70e4c6252 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -95,7 +95,7 @@ static inline int check_stable_address_space(struct mm_struct *mm)
>  	return 0;
>  }
>  
> -void __oom_reap_task_mm(struct mm_struct *mm);
> +bool __oom_reap_task_mm(struct mm_struct *mm);
>  
>  extern unsigned long oom_badness(struct task_struct *p,
>  		struct mem_cgroup *memcg, const nodemask_t *nodemask,
> diff --git a/include/rdma/ib_umem_odp.h b/include/rdma/ib_umem_odp.h
> index 6a17f856f841..381cdf5a9bd1 100644
> --- a/include/rdma/ib_umem_odp.h
> +++ b/include/rdma/ib_umem_odp.h
> @@ -119,7 +119,8 @@ typedef int (*umem_call_back)(struct ib_umem *item, u64 start, u64 end,
>   */
>  int rbt_ib_umem_for_each_in_range(struct rb_root_cached *root,
>  				  u64 start, u64 end,
> -				  umem_call_back cb, void *cookie);
> +				  umem_call_back cb,
> +				  bool blockable, void *cookie);
>  
>  /*
>   * Find first region intersecting with address range.
> diff --git a/mm/hmm.c b/mm/hmm.c
> index de7b6bf77201..81fd57bd2634 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -177,16 +177,19 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
>  	up_write(&hmm->mirrors_sem);
>  }
>  
> -static void hmm_invalidate_range_start(struct mmu_notifier *mn,
> +static int hmm_invalidate_range_start(struct mmu_notifier *mn,
>  				       struct mm_struct *mm,
>  				       unsigned long start,
> -				       unsigned long end)
> +				       unsigned long end,
> +				       bool blockable)
>  {
>  	struct hmm *hmm = mm->hmm;
>  
>  	VM_BUG_ON(!hmm);
>  
>  	atomic_inc(&hmm->sequence);
> +
> +	return 0;
>  }
>  
>  static void hmm_invalidate_range_end(struct mmu_notifier *mn,
> diff --git a/mm/mmap.c b/mm/mmap.c
> index d1eb87ef4b1a..336bee8c4e25 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3074,7 +3074,7 @@ void exit_mmap(struct mm_struct *mm)
>  		 * reliably test it.
>  		 */
>  		mutex_lock(&oom_lock);
> -		__oom_reap_task_mm(mm);
> +		(void)__oom_reap_task_mm(mm);
>  		mutex_unlock(&oom_lock);
>  
>  		set_bit(MMF_OOM_SKIP, &mm->flags);
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> index eff6b88a993f..103b2b450043 100644
> --- a/mm/mmu_notifier.c
> +++ b/mm/mmu_notifier.c
> @@ -174,18 +174,29 @@ void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
>  	srcu_read_unlock(&srcu, id);
>  }
>  
> -void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> -				  unsigned long start, unsigned long end)
> +int __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> +				  unsigned long start, unsigned long end,
> +				  bool blockable)
>  {
>  	struct mmu_notifier *mn;
> +	int ret = 0;
>  	int id;
>  
>  	id = srcu_read_lock(&srcu);
>  	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
> -		if (mn->ops->invalidate_range_start)
> -			mn->ops->invalidate_range_start(mn, mm, start, end);
> +		if (mn->ops->invalidate_range_start) {
> +			int _ret = mn->ops->invalidate_range_start(mn, mm, start, end, blockable);
> +			if (_ret) {
> +				pr_info("%pS callback failed with %d in %sblockable context.\n",
> +						mn->ops->invalidate_range_start, _ret,
> +						!blockable ? "non-": "");
> +				ret = _ret;
> +			}
> +		}
>  	}
>  	srcu_read_unlock(&srcu, id);
> +
> +	return ret;
>  }
>  EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
>  
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 84081e77bc51..5a936cf24d79 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -479,9 +479,10 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
>  static struct task_struct *oom_reaper_list;
>  static DEFINE_SPINLOCK(oom_reaper_lock);
>  
> -void __oom_reap_task_mm(struct mm_struct *mm)
> +bool __oom_reap_task_mm(struct mm_struct *mm)
>  {
>  	struct vm_area_struct *vma;
> +	bool ret = true;
>  
>  	/*
>  	 * Tell all users of get_user/copy_from_user etc... that the content
> @@ -511,12 +512,17 @@ void __oom_reap_task_mm(struct mm_struct *mm)
>  			struct mmu_gather tlb;
>  
>  			tlb_gather_mmu(&tlb, mm, start, end);
> -			mmu_notifier_invalidate_range_start(mm, start, end);
> +			if (mmu_notifier_invalidate_range_start_nonblock(mm, start, end)) {
> +				ret = false;
> +				continue;
> +			}
>  			unmap_page_range(&tlb, vma, start, end, NULL);
>  			mmu_notifier_invalidate_range_end(mm, start, end);
>  			tlb_finish_mmu(&tlb, start, end);
>  		}
>  	}
> +
> +	return ret;
>  }
>  
>  static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
> @@ -545,18 +551,6 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>  		goto unlock_oom;
>  	}
>  
> -	/*
> -	 * If the mm has invalidate_{start,end}() notifiers that could block,
> -	 * sleep to give the oom victim some more time.
> -	 * TODO: we really want to get rid of this ugly hack and make sure that
> -	 * notifiers cannot block for unbounded amount of time
> -	 */
> -	if (mm_has_blockable_invalidate_notifiers(mm)) {
> -		up_read(&mm->mmap_sem);
> -		schedule_timeout_idle(HZ);
> -		goto unlock_oom;
> -	}
> -
>  	/*
>  	 * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
>  	 * work on the mm anymore. The check for MMF_OOM_SKIP must run
> @@ -571,7 +565,12 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
>  
>  	trace_start_task_reaping(tsk->pid);
>  
> -	__oom_reap_task_mm(mm);
> +	/* failed to reap part of the address space. Try again later */
> +	if (!__oom_reap_task_mm(mm)) {
> +		up_read(&mm->mmap_sem);
> +		ret = false;
> +		goto unlock_oom;
> +	}
>  
>  	pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
>  			task_pid_nr(tsk), tsk->comm,
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index ada21f47f22b..16ce38f178d1 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -135,9 +135,10 @@ static void kvm_uevent_notify_change(unsigned int type, struct kvm *kvm);
>  static unsigned long long kvm_createvm_count;
>  static unsigned long long kvm_active_vms;
>  
> -__weak void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> -		unsigned long start, unsigned long end)
> +__weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> +		unsigned long start, unsigned long end, bool blockable)
>  {
> +	return 0;
>  }
>  
>  bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
> @@ -354,13 +355,15 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
>  	srcu_read_unlock(&kvm->srcu, idx);
>  }
>  
> -static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> +static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>  						    struct mm_struct *mm,
>  						    unsigned long start,
> -						    unsigned long end)
> +						    unsigned long end,
> +						    bool blockable)
>  {
>  	struct kvm *kvm = mmu_notifier_to_kvm(mn);
>  	int need_tlb_flush = 0, idx;
> +	int ret;
>  
>  	idx = srcu_read_lock(&kvm->srcu);
>  	spin_lock(&kvm->mmu_lock);
> @@ -378,9 +381,11 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>  
>  	spin_unlock(&kvm->mmu_lock);
>  
> -	kvm_arch_mmu_notifier_invalidate_range(kvm, start, end);
> +	ret = kvm_arch_mmu_notifier_invalidate_range(kvm, start, end, blockable);
>  
>  	srcu_read_unlock(&kvm->srcu, idx);
> +
> +	return ret;
>  }
>  
>  static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> -- 
> 2.18.0
> 
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [RFC PATCH] mm, oom: distinguish blockable mode for mmu notifiers
@ 2018-07-09 12:29   ` Michal Hocko
       [not found]     ` <20180709122908.GJ22049-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
                       ` (2 more replies)
  0 siblings, 3 replies; 125+ messages in thread
From: Michal Hocko @ 2018-07-09 12:29 UTC (permalink / raw)
  To: LKML
  Cc: David (ChunMing) Zhou, Paolo Bonzini, Radim Krčmář,
	Alex Deucher, Christian König, David Airlie, Jani Nikula,
	Joonas Lahtinen, Rodrigo Vivi, Doug Ledford, Jason Gunthorpe,
	Mike Marciniszyn, Dennis Dalessandro, Sudeep Dutt,
	Ashutosh Dixit, Dimitri Sivanich, Boris Ostrovsky, Juergen Gross,
	Jérôme Glisse, Andrea Arcangeli, kvm, amd-gfx,
	dri-devel, intel-gfx, linux-rdma, xen-devel, linux-mm,
	David Rientjes, Felix Kuehling

On Wed 27-06-18 09:44:21, Michal Hocko wrote:
> This is the v2 of RFC based on the feedback I've received so far. The
> code even compiles as a bonus ;) I haven't runtime tested it yet, mostly
> because I have no idea how.
> 
> Any further feedback is highly appreciated of course.

Any other feedback before I post this as non-RFC?

> ---
> From ec9a7241bf422b908532c4c33953b0da2655ad05 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Wed, 20 Jun 2018 15:03:20 +0200
> Subject: [PATCH] mm, oom: distinguish blockable mode for mmu notifiers
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
> 
> There are several blockable mmu notifiers which might sleep in
> mmu_notifier_invalidate_range_start and that is a problem for the
> oom_reaper because it needs to guarantee a forward progress so it cannot
> depend on any sleepable locks.
> 
> Currently we simply back off and mark an oom victim with blockable mmu
> notifiers as done after a short sleep. That can result in selecting a
> new oom victim prematurely because the previous one still hasn't torn
> its memory down yet.
> 
> We can do much better though. Even if mmu notifiers use sleepable locks
> there is no reason to automatically assume those locks are held.
> Moreover majority of notifiers only care about a portion of the address
> space and there is absolutely zero reason to fail when we are unmapping an
> unrelated range. Many notifiers do really block and wait for HW which is
> harder to handle and we have to bail out though.
> 
> This patch handles the low hanging fruid. __mmu_notifier_invalidate_range_start
> gets a blockable flag and callbacks are not allowed to sleep if the
> flag is set to false. This is achieved by using trylock instead of the
> sleepable lock for most callbacks and continue as long as we do not
> block down the call chain.
> 
> I think we can improve that even further because there is a common
> pattern to do a range lookup first and then do something about that.
> The first part can be done without a sleeping lock in most cases AFAICS.
> 
> The oom_reaper end then simply retries if there is at least one notifier
> which couldn't make any progress in !blockable mode. A retry loop is
> already implemented to wait for the mmap_sem and this is basically the
> same thing.
> 
> Changes since rfc v1
> - gpu notifiers can sleep while waiting for HW (evict_process_queues_cpsch
>   on a lock and amdgpu_mn_invalidate_node on unbound timeout) make sure
>   we bail out when we have an intersecting range for starter
> - note that a notifier failed to the log for easier debugging
> - back off early in ib_umem_notifier_invalidate_range_start if the
>   callback is called
> - mn_invl_range_start waits for completion down the unmap_grant_pages
>   path so we have to back off early on overlapping ranges
> 
> Cc: "David (ChunMing) Zhou" <David1.Zhou@amd.com>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: "Radim Krčmář" <rkrcmar@redhat.com>
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: "Christian König" <christian.koenig@amd.com>
> Cc: David Airlie <airlied@linux.ie>
> Cc: Jani Nikula <jani.nikula@linux.intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> Cc: Doug Ledford <dledford@redhat.com>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
> Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
> Cc: Sudeep Dutt <sudeep.dutt@intel.com>
> Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
> Cc: Dimitri Sivanich <sivanich@sgi.com>
> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
> Cc: Juergen Gross <jgross@suse.com>
> Cc: "Jérôme Glisse" <jglisse@redhat.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Felix Kuehling <felix.kuehling@amd.com>
> Cc: kvm@vger.kernel.org (open list:KERNEL VIRTUAL MACHINE FOR X86 (KVM/x86))
> Cc: linux-kernel@vger.kernel.org (open list:X86 ARCHITECTURE (32-BIT AND 64-BIT))
> Cc: amd-gfx@lists.freedesktop.org (open list:RADEON and AMDGPU DRM DRIVERS)
> Cc: dri-devel@lists.freedesktop.org (open list:DRM DRIVERS)
> Cc: intel-gfx@lists.freedesktop.org (open list:INTEL DRM DRIVERS (excluding Poulsbo, Moorestow...)
> Cc: linux-rdma@vger.kernel.org (open list:INFINIBAND SUBSYSTEM)
> Cc: xen-devel@lists.xenproject.org (moderated list:XEN HYPERVISOR INTERFACE)
> Cc: linux-mm@kvack.org (open list:HMM - Heterogeneous Memory Management)
> Reported-by: David Rientjes <rientjes@google.com>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  arch/x86/kvm/x86.c                      |  7 ++--
>  drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c  | 43 +++++++++++++++++++-----
>  drivers/gpu/drm/i915/i915_gem_userptr.c | 13 ++++++--
>  drivers/gpu/drm/radeon/radeon_mn.c      | 22 +++++++++++--
>  drivers/infiniband/core/umem_odp.c      | 33 +++++++++++++++----
>  drivers/infiniband/hw/hfi1/mmu_rb.c     | 11 ++++---
>  drivers/infiniband/hw/mlx5/odp.c        |  2 +-
>  drivers/misc/mic/scif/scif_dma.c        |  7 ++--
>  drivers/misc/sgi-gru/grutlbpurge.c      |  7 ++--
>  drivers/xen/gntdev.c                    | 44 ++++++++++++++++++++-----
>  include/linux/kvm_host.h                |  4 +--
>  include/linux/mmu_notifier.h            | 35 +++++++++++++++-----
>  include/linux/oom.h                     |  2 +-
>  include/rdma/ib_umem_odp.h              |  3 +-
>  mm/hmm.c                                |  7 ++--
>  mm/mmap.c                               |  2 +-
>  mm/mmu_notifier.c                       | 19 ++++++++---
>  mm/oom_kill.c                           | 29 ++++++++--------
>  virt/kvm/kvm_main.c                     | 15 ++++++---
>  19 files changed, 225 insertions(+), 80 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 6bcecc325e7e..ac08f5d711be 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -7203,8 +7203,9 @@ static void vcpu_load_eoi_exitmap(struct kvm_vcpu *vcpu)
>  	kvm_x86_ops->load_eoi_exitmap(vcpu, eoi_exit_bitmap);
>  }
>  
> -void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> -		unsigned long start, unsigned long end)
> +int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> +		unsigned long start, unsigned long end,
> +		bool blockable)
>  {
>  	unsigned long apic_address;
>  
> @@ -7215,6 +7216,8 @@ void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
>  	apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
>  	if (start <= apic_address && apic_address < end)
>  		kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
> +
> +	return 0;
>  }
>  
>  void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> index 83e344fbb50a..3399a4a927fb 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> @@ -136,12 +136,18 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
>   *
>   * Take the rmn read side lock.
>   */
> -static void amdgpu_mn_read_lock(struct amdgpu_mn *rmn)
> +static int amdgpu_mn_read_lock(struct amdgpu_mn *rmn, bool blockable)
>  {
> -	mutex_lock(&rmn->read_lock);
> +	if (blockable)
> +		mutex_lock(&rmn->read_lock);
> +	else if (!mutex_trylock(&rmn->read_lock))
> +		return -EAGAIN;
> +
>  	if (atomic_inc_return(&rmn->recursion) == 1)
>  		down_read_non_owner(&rmn->lock);
>  	mutex_unlock(&rmn->read_lock);
> +
> +	return 0;
>  }
>  
>  /**
> @@ -197,10 +203,11 @@ static void amdgpu_mn_invalidate_node(struct amdgpu_mn_node *node,
>   * We block for all BOs between start and end to be idle and
>   * unmap them by move them into system domain again.
>   */
> -static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
> +static int amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>  						 struct mm_struct *mm,
>  						 unsigned long start,
> -						 unsigned long end)
> +						 unsigned long end,
> +						 bool blockable)
>  {
>  	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
>  	struct interval_tree_node *it;
> @@ -208,17 +215,28 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>  	/* notification is exclusive, but interval is inclusive */
>  	end -= 1;
>  
> -	amdgpu_mn_read_lock(rmn);
> +	/* TODO we should be able to split locking for interval tree and
> +	 * amdgpu_mn_invalidate_node
> +	 */
> +	if (amdgpu_mn_read_lock(rmn, blockable))
> +		return -EAGAIN;
>  
>  	it = interval_tree_iter_first(&rmn->objects, start, end);
>  	while (it) {
>  		struct amdgpu_mn_node *node;
>  
> +		if (!blockable) {
> +			amdgpu_mn_read_unlock(rmn);
> +			return -EAGAIN;
> +		}
> +
>  		node = container_of(it, struct amdgpu_mn_node, it);
>  		it = interval_tree_iter_next(it, start, end);
>  
>  		amdgpu_mn_invalidate_node(node, start, end);
>  	}
> +
> +	return 0;
>  }
>  
>  /**
> @@ -233,10 +251,11 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
>   * necessitates evicting all user-mode queues of the process. The BOs
>   * are restorted in amdgpu_mn_invalidate_range_end_hsa.
>   */
> -static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
> +static int amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
>  						 struct mm_struct *mm,
>  						 unsigned long start,
> -						 unsigned long end)
> +						 unsigned long end,
> +						 bool blockable)
>  {
>  	struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
>  	struct interval_tree_node *it;
> @@ -244,13 +263,19 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
>  	/* notification is exclusive, but interval is inclusive */
>  	end -= 1;
>  
> -	amdgpu_mn_read_lock(rmn);
> +	if (amdgpu_mn_read_lock(rmn, blockable))
> +		return -EAGAIN;
>  
>  	it = interval_tree_iter_first(&rmn->objects, start, end);
>  	while (it) {
>  		struct amdgpu_mn_node *node;
>  		struct amdgpu_bo *bo;
>  
> +		if (!blockable) {
> +			amdgpu_mn_read_unlock(rmn);
> +			return -EAGAIN;
> +		}
> +
>  		node = container_of(it, struct amdgpu_mn_node, it);
>  		it = interval_tree_iter_next(it, start, end);
>  
> @@ -262,6 +287,8 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
>  				amdgpu_amdkfd_evict_userptr(mem, mm);
>  		}
>  	}
> +
> +	return 0;
>  }
>  
>  /**
> diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
> index 854bd51b9478..9cbff68f6b41 100644
> --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> @@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object *mo)
>  	mo->attached = false;
>  }
>  
> -static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
> +static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
>  						       struct mm_struct *mm,
>  						       unsigned long start,
> -						       unsigned long end)
> +						       unsigned long end,
> +						       bool blockable)
>  {
>  	struct i915_mmu_notifier *mn =
>  		container_of(_mn, struct i915_mmu_notifier, mn);
> @@ -124,7 +125,7 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
>  	LIST_HEAD(cancelled);
>  
>  	if (RB_EMPTY_ROOT(&mn->objects.rb_root))
> -		return;
> +		return 0;
>  
>  	/* interval ranges are inclusive, but invalidate range is exclusive */
>  	end--;
> @@ -132,6 +133,10 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
>  	spin_lock(&mn->lock);
>  	it = interval_tree_iter_first(&mn->objects, start, end);
>  	while (it) {
> +		if (!blockable) {
> +			spin_unlock(&mn->lock);
> +			return -EAGAIN;
> +		}
>  		/* The mmu_object is released late when destroying the
>  		 * GEM object so it is entirely possible to gain a
>  		 * reference on an object in the process of being freed
> @@ -154,6 +159,8 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
>  
>  	if (!list_empty(&cancelled))
>  		flush_workqueue(mn->wq);
> +
> +	return 0;
>  }
>  
>  static const struct mmu_notifier_ops i915_gem_userptr_notifier = {
> diff --git a/drivers/gpu/drm/radeon/radeon_mn.c b/drivers/gpu/drm/radeon/radeon_mn.c
> index abd24975c9b1..f8b35df44c60 100644
> --- a/drivers/gpu/drm/radeon/radeon_mn.c
> +++ b/drivers/gpu/drm/radeon/radeon_mn.c
> @@ -118,19 +118,27 @@ static void radeon_mn_release(struct mmu_notifier *mn,
>   * We block for all BOs between start and end to be idle and
>   * unmap them by move them into system domain again.
>   */
> -static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
> +static int radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
>  					     struct mm_struct *mm,
>  					     unsigned long start,
> -					     unsigned long end)
> +					     unsigned long end,
> +					     bool blockable)
>  {
>  	struct radeon_mn *rmn = container_of(mn, struct radeon_mn, mn);
>  	struct ttm_operation_ctx ctx = { false, false };
>  	struct interval_tree_node *it;
> +	int ret = 0;
>  
>  	/* notification is exclusive, but interval is inclusive */
>  	end -= 1;
>  
> -	mutex_lock(&rmn->lock);
> +	/* TODO we should be able to split locking for interval tree and
> +	 * the tear down.
> +	 */
> +	if (blockable)
> +		mutex_lock(&rmn->lock);
> +	else if (!mutex_trylock(&rmn->lock))
> +		return -EAGAIN;
>  
>  	it = interval_tree_iter_first(&rmn->objects, start, end);
>  	while (it) {
> @@ -138,6 +146,11 @@ static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
>  		struct radeon_bo *bo;
>  		long r;
>  
> +		if (!blockable) {
> +			ret = -EAGAIN;
> +			goto out_unlock;
> +		}
> +
>  		node = container_of(it, struct radeon_mn_node, it);
>  		it = interval_tree_iter_next(it, start, end);
>  
> @@ -166,7 +179,10 @@ static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
>  		}
>  	}
>  	
> +out_unlock:
>  	mutex_unlock(&rmn->lock);
> +
> +	return ret;
>  }
>  
>  static const struct mmu_notifier_ops radeon_mn_ops = {
> diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
> index 182436b92ba9..6ec748eccff7 100644
> --- a/drivers/infiniband/core/umem_odp.c
> +++ b/drivers/infiniband/core/umem_odp.c
> @@ -186,6 +186,7 @@ static void ib_umem_notifier_release(struct mmu_notifier *mn,
>  	rbt_ib_umem_for_each_in_range(&context->umem_tree, 0,
>  				      ULLONG_MAX,
>  				      ib_umem_notifier_release_trampoline,
> +				      true,
>  				      NULL);
>  	up_read(&context->umem_rwsem);
>  }
> @@ -207,22 +208,31 @@ static int invalidate_range_start_trampoline(struct ib_umem *item, u64 start,
>  	return 0;
>  }
>  
> -static void ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
> +static int ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
>  						    struct mm_struct *mm,
>  						    unsigned long start,
> -						    unsigned long end)
> +						    unsigned long end,
> +						    bool blockable)
>  {
>  	struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
> +	int ret;
>  
>  	if (!context->invalidate_range)
> -		return;
> +		return 0;
> +
> +	if (blockable)
> +		down_read(&context->umem_rwsem);
> +	else if (!down_read_trylock(&context->umem_rwsem))
> +		return -EAGAIN;
>  
>  	ib_ucontext_notifier_start_account(context);
> -	down_read(&context->umem_rwsem);
> -	rbt_ib_umem_for_each_in_range(&context->umem_tree, start,
> +	ret = rbt_ib_umem_for_each_in_range(&context->umem_tree, start,
>  				      end,
> -				      invalidate_range_start_trampoline, NULL);
> +				      invalidate_range_start_trampoline,
> +				      blockable, NULL);
>  	up_read(&context->umem_rwsem);
> +
> +	return ret;
>  }
>  
>  static int invalidate_range_end_trampoline(struct ib_umem *item, u64 start,
> @@ -242,10 +252,15 @@ static void ib_umem_notifier_invalidate_range_end(struct mmu_notifier *mn,
>  	if (!context->invalidate_range)
>  		return;
>  
> +	/*
> +	 * TODO: we currently bail out if there is any sleepable work to be done
> +	 * in ib_umem_notifier_invalidate_range_start so we shouldn't really block
> +	 * here. But this is ugly and fragile.
> +	 */
>  	down_read(&context->umem_rwsem);
>  	rbt_ib_umem_for_each_in_range(&context->umem_tree, start,
>  				      end,
> -				      invalidate_range_end_trampoline, NULL);
> +				      invalidate_range_end_trampoline, true, NULL);
>  	up_read(&context->umem_rwsem);
>  	ib_ucontext_notifier_end_account(context);
>  }
> @@ -798,6 +813,7 @@ EXPORT_SYMBOL(ib_umem_odp_unmap_dma_pages);
>  int rbt_ib_umem_for_each_in_range(struct rb_root_cached *root,
>  				  u64 start, u64 last,
>  				  umem_call_back cb,
> +				  bool blockable,
>  				  void *cookie)
>  {
>  	int ret_val = 0;
> @@ -809,6 +825,9 @@ int rbt_ib_umem_for_each_in_range(struct rb_root_cached *root,
>  
>  	for (node = rbt_ib_umem_iter_first(root, start, last - 1);
>  			node; node = next) {
> +		/* TODO move the blockable decision up to the callback */
> +		if (!blockable)
> +			return -EAGAIN;
>  		next = rbt_ib_umem_iter_next(node, start, last - 1);
>  		umem = container_of(node, struct ib_umem_odp, interval_tree);
>  		ret_val = cb(umem->umem, start, last, cookie) || ret_val;
> diff --git a/drivers/infiniband/hw/hfi1/mmu_rb.c b/drivers/infiniband/hw/hfi1/mmu_rb.c
> index 70aceefe14d5..e1c7996c018e 100644
> --- a/drivers/infiniband/hw/hfi1/mmu_rb.c
> +++ b/drivers/infiniband/hw/hfi1/mmu_rb.c
> @@ -67,9 +67,9 @@ struct mmu_rb_handler {
>  
>  static unsigned long mmu_node_start(struct mmu_rb_node *);
>  static unsigned long mmu_node_last(struct mmu_rb_node *);
> -static void mmu_notifier_range_start(struct mmu_notifier *,
> +static int mmu_notifier_range_start(struct mmu_notifier *,
>  				     struct mm_struct *,
> -				     unsigned long, unsigned long);
> +				     unsigned long, unsigned long, bool);
>  static struct mmu_rb_node *__mmu_rb_search(struct mmu_rb_handler *,
>  					   unsigned long, unsigned long);
>  static void do_remove(struct mmu_rb_handler *handler,
> @@ -284,10 +284,11 @@ void hfi1_mmu_rb_remove(struct mmu_rb_handler *handler,
>  	handler->ops->remove(handler->ops_arg, node);
>  }
>  
> -static void mmu_notifier_range_start(struct mmu_notifier *mn,
> +static int mmu_notifier_range_start(struct mmu_notifier *mn,
>  				     struct mm_struct *mm,
>  				     unsigned long start,
> -				     unsigned long end)
> +				     unsigned long end,
> +				     bool blockable)
>  {
>  	struct mmu_rb_handler *handler =
>  		container_of(mn, struct mmu_rb_handler, mn);
> @@ -313,6 +314,8 @@ static void mmu_notifier_range_start(struct mmu_notifier *mn,
>  
>  	if (added)
>  		queue_work(handler->wq, &handler->del_work);
> +
> +	return 0;
>  }
>  
>  /*
> diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
> index f1a87a690a4c..d216e0d2921d 100644
> --- a/drivers/infiniband/hw/mlx5/odp.c
> +++ b/drivers/infiniband/hw/mlx5/odp.c
> @@ -488,7 +488,7 @@ void mlx5_ib_free_implicit_mr(struct mlx5_ib_mr *imr)
>  
>  	down_read(&ctx->umem_rwsem);
>  	rbt_ib_umem_for_each_in_range(&ctx->umem_tree, 0, ULLONG_MAX,
> -				      mr_leaf_free, imr);
> +				      mr_leaf_free, true, imr);
>  	up_read(&ctx->umem_rwsem);
>  
>  	wait_event(imr->q_leaf_free, !atomic_read(&imr->num_leaf_free));
> diff --git a/drivers/misc/mic/scif/scif_dma.c b/drivers/misc/mic/scif/scif_dma.c
> index 63d6246d6dff..6369aeaa7056 100644
> --- a/drivers/misc/mic/scif/scif_dma.c
> +++ b/drivers/misc/mic/scif/scif_dma.c
> @@ -200,15 +200,18 @@ static void scif_mmu_notifier_release(struct mmu_notifier *mn,
>  	schedule_work(&scif_info.misc_work);
>  }
>  
> -static void scif_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
> +static int scif_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
>  						     struct mm_struct *mm,
>  						     unsigned long start,
> -						     unsigned long end)
> +						     unsigned long end,
> +						     bool blockable)
>  {
>  	struct scif_mmu_notif	*mmn;
>  
>  	mmn = container_of(mn, struct scif_mmu_notif, ep_mmu_notifier);
>  	scif_rma_destroy_tcw(mmn, start, end - start);
> +
> +	return 0;
>  }
>  
>  static void scif_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
> diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
> index a3454eb56fbf..be28f05bfafa 100644
> --- a/drivers/misc/sgi-gru/grutlbpurge.c
> +++ b/drivers/misc/sgi-gru/grutlbpurge.c
> @@ -219,9 +219,10 @@ void gru_flush_all_tlb(struct gru_state *gru)
>  /*
>   * MMUOPS notifier callout functions
>   */
> -static void gru_invalidate_range_start(struct mmu_notifier *mn,
> +static int gru_invalidate_range_start(struct mmu_notifier *mn,
>  				       struct mm_struct *mm,
> -				       unsigned long start, unsigned long end)
> +				       unsigned long start, unsigned long end,
> +				       bool blockable)
>  {
>  	struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
>  						 ms_notifier);
> @@ -231,6 +232,8 @@ static void gru_invalidate_range_start(struct mmu_notifier *mn,
>  	gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx, act %d\n", gms,
>  		start, end, atomic_read(&gms->ms_range_active));
>  	gru_flush_tlb_range(gms, start, end - start);
> +
> +	return 0;
>  }
>  
>  static void gru_invalidate_range_end(struct mmu_notifier *mn,
> diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
> index bd56653b9bbc..55b4f0e3f4d6 100644
> --- a/drivers/xen/gntdev.c
> +++ b/drivers/xen/gntdev.c
> @@ -441,18 +441,25 @@ static const struct vm_operations_struct gntdev_vmops = {
>  
>  /* ------------------------------------------------------------------ */
>  
> +static bool in_range(struct grant_map *map,
> +			      unsigned long start, unsigned long end)
> +{
> +	if (!map->vma)
> +		return false;
> +	if (map->vma->vm_start >= end)
> +		return false;
> +	if (map->vma->vm_end <= start)
> +		return false;
> +
> +	return true;
> +}
> +
>  static void unmap_if_in_range(struct grant_map *map,
>  			      unsigned long start, unsigned long end)
>  {
>  	unsigned long mstart, mend;
>  	int err;
>  
> -	if (!map->vma)
> -		return;
> -	if (map->vma->vm_start >= end)
> -		return;
> -	if (map->vma->vm_end <= start)
> -		return;
>  	mstart = max(start, map->vma->vm_start);
>  	mend   = min(end,   map->vma->vm_end);
>  	pr_debug("map %d+%d (%lx %lx), range %lx %lx, mrange %lx %lx\n",
> @@ -465,21 +472,40 @@ static void unmap_if_in_range(struct grant_map *map,
>  	WARN_ON(err);
>  }
>  
> -static void mn_invl_range_start(struct mmu_notifier *mn,
> +static int mn_invl_range_start(struct mmu_notifier *mn,
>  				struct mm_struct *mm,
> -				unsigned long start, unsigned long end)
> +				unsigned long start, unsigned long end,
> +				bool blockable)
>  {
>  	struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
>  	struct grant_map *map;
> +	int ret = 0;
> +
> +	/* TODO do we really need a mutex here? */
> +	if (blockable)
> +		mutex_lock(&priv->lock);
> +	else if (!mutex_trylock(&priv->lock))
> +		return -EAGAIN;
>  
> -	mutex_lock(&priv->lock);
>  	list_for_each_entry(map, &priv->maps, next) {
> +		if (in_range(map, start, end)) {
> +			ret = -EAGAIN;
> +			goto out_unlock;
> +		}
>  		unmap_if_in_range(map, start, end);
>  	}
>  	list_for_each_entry(map, &priv->freeable_maps, next) {
> +		if (in_range(map, start, end)) {
> +			ret = -EAGAIN;
> +			goto out_unlock;
> +		}
>  		unmap_if_in_range(map, start, end);
>  	}
> +
> +out_unlock:
>  	mutex_unlock(&priv->lock);
> +
> +	return ret;
>  }
>  
>  static void mn_release(struct mmu_notifier *mn,
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 4ee7bc548a83..148935085194 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1275,8 +1275,8 @@ static inline long kvm_arch_vcpu_async_ioctl(struct file *filp,
>  }
>  #endif /* CONFIG_HAVE_KVM_VCPU_ASYNC_IOCTL */
>  
> -void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> -		unsigned long start, unsigned long end);
> +int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> +		unsigned long start, unsigned long end, bool blockable);
>  
>  #ifdef CONFIG_HAVE_KVM_VCPU_RUN_PID_CHANGE
>  int kvm_arch_vcpu_run_pid_change(struct kvm_vcpu *vcpu);
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index 392e6af82701..2eb1a2d01759 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -151,13 +151,15 @@ struct mmu_notifier_ops {
>  	 * address space but may still be referenced by sptes until
>  	 * the last refcount is dropped.
>  	 *
> -	 * If both of these callbacks cannot block, and invalidate_range
> -	 * cannot block, mmu_notifier_ops.flags should have
> -	 * MMU_INVALIDATE_DOES_NOT_BLOCK set.
> +	 * If blockable argument is set to false then the callback cannot
> +	 * sleep and has to return with -EAGAIN. 0 should be returned
> +	 * otherwise.
> +	 *
>  	 */
> -	void (*invalidate_range_start)(struct mmu_notifier *mn,
> +	int (*invalidate_range_start)(struct mmu_notifier *mn,
>  				       struct mm_struct *mm,
> -				       unsigned long start, unsigned long end);
> +				       unsigned long start, unsigned long end,
> +				       bool blockable);
>  	void (*invalidate_range_end)(struct mmu_notifier *mn,
>  				     struct mm_struct *mm,
>  				     unsigned long start, unsigned long end);
> @@ -229,8 +231,9 @@ extern int __mmu_notifier_test_young(struct mm_struct *mm,
>  				     unsigned long address);
>  extern void __mmu_notifier_change_pte(struct mm_struct *mm,
>  				      unsigned long address, pte_t pte);
> -extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> -				  unsigned long start, unsigned long end);
> +extern int __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> +				  unsigned long start, unsigned long end,
> +				  bool blockable);
>  extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
>  				  unsigned long start, unsigned long end,
>  				  bool only_end);
> @@ -281,7 +284,17 @@ static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
>  				  unsigned long start, unsigned long end)
>  {
>  	if (mm_has_notifiers(mm))
> -		__mmu_notifier_invalidate_range_start(mm, start, end);
> +		__mmu_notifier_invalidate_range_start(mm, start, end, true);
> +}
> +
> +static inline int mmu_notifier_invalidate_range_start_nonblock(struct mm_struct *mm,
> +				  unsigned long start, unsigned long end)
> +{
> +	int ret = 0;
> +	if (mm_has_notifiers(mm))
> +		ret = __mmu_notifier_invalidate_range_start(mm, start, end, false);
> +
> +	return ret;
>  }
>  
>  static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
> @@ -461,6 +474,12 @@ static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
>  {
>  }
>  
> +static inline int mmu_notifier_invalidate_range_start_nonblock(struct mm_struct *mm,
> +				  unsigned long start, unsigned long end)
> +{
> +	return 0;
> +}
> +
>  static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
>  				  unsigned long start, unsigned long end)
>  {
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index 6adac113e96d..92f70e4c6252 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -95,7 +95,7 @@ static inline int check_stable_address_space(struct mm_struct *mm)
>  	return 0;
>  }
>  
> -void __oom_reap_task_mm(struct mm_struct *mm);
> +bool __oom_reap_task_mm(struct mm_struct *mm);
>  
>  extern unsigned long oom_badness(struct task_struct *p,
>  		struct mem_cgroup *memcg, const nodemask_t *nodemask,
> diff --git a/include/rdma/ib_umem_odp.h b/include/rdma/ib_umem_odp.h
> index 6a17f856f841..381cdf5a9bd1 100644
> --- a/include/rdma/ib_umem_odp.h
> +++ b/include/rdma/ib_umem_odp.h
> @@ -119,7 +119,8 @@ typedef int (*umem_call_back)(struct ib_umem *item, u64 start, u64 end,
>   */
>  int rbt_ib_umem_for_each_in_range(struct rb_root_cached *root,
>  				  u64 start, u64 end,
> -				  umem_call_back cb, void *cookie);
> +				  umem_call_back cb,
> +				  bool blockable, void *cookie);
>  
>  /*
>   * Find first region intersecting with address range.
> diff --git a/mm/hmm.c b/mm/hmm.c
> index de7b6bf77201..81fd57bd2634 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -177,16 +177,19 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
>  	up_write(&hmm->mirrors_sem);
>  }
>  
> -static void hmm_invalidate_range_start(struct mmu_notifier *mn,
> +static int hmm_invalidate_range_start(struct mmu_notifier *mn,
>  				       struct mm_struct *mm,
>  				       unsigned long start,
> -				       unsigned long end)
> +				       unsigned long end,
> +				       bool blockable)
>  {
>  	struct hmm *hmm = mm->hmm;
>  
>  	VM_BUG_ON(!hmm);
>  
>  	atomic_inc(&hmm->sequence);
> +
> +	return 0;
>  }
>  
>  static void hmm_invalidate_range_end(struct mmu_notifier *mn,
> diff --git a/mm/mmap.c b/mm/mmap.c
> index d1eb87ef4b1a..336bee8c4e25 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3074,7 +3074,7 @@ void exit_mmap(struct mm_struct *mm)
>  		 * reliably test it.
>  		 */
>  		mutex_lock(&oom_lock);
> -		__oom_reap_task_mm(mm);
> +		(void)__oom_reap_task_mm(mm);
>  		mutex_unlock(&oom_lock);
>  
>  		set_bit(MMF_OOM_SKIP, &mm->flags);
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> index eff6b88a993f..103b2b450043 100644
> --- a/mm/mmu_notifier.c
> +++ b/mm/mmu_notifier.c
> @@ -174,18 +174,29 @@ void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
>  	srcu_read_unlock(&srcu, id);
>  }
>  
> -void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> -				  unsigned long start, unsigned long end)
> +int __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
> +				  unsigned long start, unsigned long end,
> +				  bool blockable)
>  {
>  	struct mmu_notifier *mn;
> +	int ret = 0;
>  	int id;
>  
>  	id = srcu_read_lock(&srcu);
>  	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
> -		if (mn->ops->invalidate_range_start)
> -			mn->ops->invalidate_range_start(mn, mm, start, end);
> +		if (mn->ops->invalidate_range_start) {
> +			int _ret = mn->ops->invalidate_range_start(mn, mm, start, end, blockable);
> +			if (_ret) {
> +				pr_info("%pS callback failed with %d in %sblockable context.\n",
> +						mn->ops->invalidate_range_start, _ret,
> +						!blockable ? "non-": "");
> +				ret = _ret;
> +			}
> +		}
>  	}
>  	srcu_read_unlock(&srcu, id);
> +
> +	return ret;
>  }
>  EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
>  
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 84081e77bc51..5a936cf24d79 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -479,9 +479,10 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
>  static struct task_struct *oom_reaper_list;
>  static DEFINE_SPINLOCK(oom_reaper_lock);
>  
> -void __oom_reap_task_mm(struct mm_struct *mm)
> +bool __oom_reap_task_mm(stru