Linux-mm Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH 0/5] hmm & mmu_notifier debug/lockdep annotations
@ 2019-08-14 20:20 Daniel Vetter
  2019-08-14 20:20 ` [PATCH 1/5] mm: Check if mmu notifier callbacks are allowed to fail Daniel Vetter
                   ` (4 more replies)
  0 siblings, 5 replies; 69+ messages in thread
From: Daniel Vetter @ 2019-08-14 20:20 UTC (permalink / raw)
  To: LKML; +Cc: linux-mm, DRI Development, Intel Graphics Development, Daniel Vetter

Hi all (but I guess mostly Jason),

Finally gotten around to rebasing the previous version, fixing the rebase
fail in there, update the commit message a bit and give this a spin with
some tests. Nicely caught a lockdep splat that we're now discussing in
i915, and seems to not have misfired anywhere else (including a few oom).

Review, comments and everything very much appreciated. Plus I'd really
like to land this, there's more hmm_mirror users in-flight, and I've seen
some that get things wrong which this patchset should catch.

Thanks, Daniel

Daniel Vetter (5):
  mm: Check if mmu notifier callbacks are allowed to fail
  kernel.h: Add non_block_start/end()
  mm, notifier: Catch sleeping/blocking for !blockable
  mm, notifier: Add a lockdep map for invalidate_range_start
  mm/hmm: WARN on illegal ->sync_cpu_device_pagetables errors

 include/linux/kernel.h       | 10 +++++++++-
 include/linux/mmu_notifier.h |  6 ++++++
 include/linux/sched.h        |  4 ++++
 kernel/sched/core.c          | 19 ++++++++++++++-----
 mm/hmm.c                     |  3 +++
 mm/mmu_notifier.c            | 17 ++++++++++++++++-
 6 files changed, 52 insertions(+), 7 deletions(-)

-- 
2.22.0



^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 1/5] mm: Check if mmu notifier callbacks are allowed to fail
  2019-08-14 20:20 [PATCH 0/5] hmm & mmu_notifier debug/lockdep annotations Daniel Vetter
@ 2019-08-14 20:20 ` Daniel Vetter
  2019-08-14 22:14   ` Andrew Morton
  2019-08-16 17:19   ` Jason Gunthorpe
  2019-08-14 20:20 ` [PATCH 2/5] kernel.h: Add non_block_start/end() Daniel Vetter
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 69+ messages in thread
From: Daniel Vetter @ 2019-08-14 20:20 UTC (permalink / raw)
  To: LKML
  Cc: linux-mm, DRI Development, Intel Graphics Development,
	Daniel Vetter, Andrew Morton, Michal Hocko, Christian König,
	David Rientjes, Jérôme Glisse, Paolo Bonzini,
	Jason Gunthorpe, Daniel Vetter

Just a bit of paranoia, since if we start pushing this deep into
callchains it's hard to spot all places where an mmu notifier
implementation might fail when it's not allowed to.

Inspired by some confusion we had discussing i915 mmu notifiers and
whether we could use the newly-introduced return value to handle some
corner cases. Until we realized that these are only for when a task
has been killed by the oom reaper.

An alternative approach would be to split the callback into two
versions, one with the int return value, and the other with void
return value like in older kernels. But that's a lot more churn for
fairly little gain I think.

Summary from the m-l discussion on why we want something at warning
level: This allows automated tooling in CI to catch bugs without
humans having to look at everything. If we just upgrade the existing
pr_info to a pr_warn, then we'll have false positives. And as-is, no
one will ever spot the problem since it's lost in the massive amounts
of overall dmesg noise.

v2: Drop the full WARN_ON backtrace in favour of just a pr_warn for
the problematic case (Michal Hocko).

v3: Rebase on top of Glisse's arg rework.

v4: More rebase on top of Glisse reworking everything.

v5: Fixup rebase damage and also catch failures != EAGAIN for
!blockable (Jason). Also go back to WARN_ON as requested by Jason, so
automatic checkers can easily catch bugs by setting panic_on_warn.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: linux-mm@kvack.org
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 mm/mmu_notifier.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index b5670620aea0..16f1cbc775d0 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -179,6 +179,8 @@ int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range)
 				pr_info("%pS callback failed with %d in %sblockable context.\n",
 					mn->ops->invalidate_range_start, _ret,
 					!mmu_notifier_range_blockable(range) ? "non-" : "");
+				WARN_ON(mmu_notifier_range_blockable(range) ||
+					ret != -EAGAIN);
 				ret = _ret;
 			}
 		}
-- 
2.22.0



^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 2/5] kernel.h: Add non_block_start/end()
  2019-08-14 20:20 [PATCH 0/5] hmm & mmu_notifier debug/lockdep annotations Daniel Vetter
  2019-08-14 20:20 ` [PATCH 1/5] mm: Check if mmu notifier callbacks are allowed to fail Daniel Vetter
@ 2019-08-14 20:20 ` Daniel Vetter
  2019-08-14 20:45   ` Andrew Morton
  2019-08-14 23:58   ` Jason Gunthorpe
  2019-08-14 20:20 ` [PATCH 3/5] mm, notifier: Catch sleeping/blocking for !blockable Daniel Vetter
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 69+ messages in thread
From: Daniel Vetter @ 2019-08-14 20:20 UTC (permalink / raw)
  To: LKML
  Cc: linux-mm, DRI Development, Intel Graphics Development,
	Daniel Vetter, Jason Gunthorpe, Peter Zijlstra, Ingo Molnar,
	Andrew Morton, Michal Hocko, David Rientjes,
	Christian König, Jérôme Glisse, Masahiro Yamada,
	Wei Wang, Andy Shevchenko, Thomas Gleixner, Jann Horn, Feng Tang,
	Kees Cook, Randy Dunlap, Daniel Vetter

In some special cases we must not block, but there's not a
spinlock, preempt-off, irqs-off or similar critical section already
that arms the might_sleep() debug checks. Add a non_block_start/end()
pair to annotate these.

This will be used in the oom paths of mmu-notifiers, where blocking is
not allowed to make sure there's forward progress. Quoting Michal:

"The notifier is called from quite a restricted context - oom_reaper -
which shouldn't depend on any locks or sleepable conditionals. The code
should be swift as well but we mostly do care about it to make a forward
progress. Checking for sleepable context is the best thing we could come
up with that would describe these demands at least partially."

Peter also asked whether we want to catch spinlocks on top, but Michal
said those are less of a problem because spinlocks can't have an
indirect dependency upon the page allocator and hence close the loop
with the oom reaper.

Suggested by Michal Hocko.

v2:
- Improve commit message (Michal)
- Also check in schedule, not just might_sleep (Peter)

v3: It works better when I actually squash in the fixup I had lying
around :-/

Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: linux-mm@kvack.org
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Wei Wang <wvw@google.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jann Horn <jannh@google.com>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: linux-kernel@vger.kernel.org
Acked-by: Christian König <christian.koenig@amd.com> (v1)
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 include/linux/kernel.h | 10 +++++++++-
 include/linux/sched.h  |  4 ++++
 kernel/sched/core.c    | 19 ++++++++++++++-----
 3 files changed, 27 insertions(+), 6 deletions(-)

diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 4fa360a13c1e..915fd9888afb 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -217,7 +217,9 @@ extern void __cant_sleep(const char *file, int line, int preempt_offset);
  * might_sleep - annotation for functions that can sleep
  *
  * this macro will print a stack trace if it is executed in an atomic
- * context (spinlock, irq-handler, ...).
+ * context (spinlock, irq-handler, ...). Additional sections where blocking is
+ * not allowed can be annotated with non_block_start() and non_block_end()
+ * pairs.
  *
  * This is a useful debugging help to be able to catch problems early and not
  * be bitten later when the calling function happens to sleep when it is not
@@ -233,6 +235,10 @@ extern void __cant_sleep(const char *file, int line, int preempt_offset);
 # define cant_sleep() \
 	do { __cant_sleep(__FILE__, __LINE__, 0); } while (0)
 # define sched_annotate_sleep()	(current->task_state_change = 0)
+# define non_block_start() \
+	do { current->non_block_count++; } while (0)
+# define non_block_end() \
+	do { WARN_ON(current->non_block_count-- == 0); } while (0)
 #else
   static inline void ___might_sleep(const char *file, int line,
 				   int preempt_offset) { }
@@ -241,6 +247,8 @@ extern void __cant_sleep(const char *file, int line, int preempt_offset);
 # define might_sleep() do { might_resched(); } while (0)
 # define cant_sleep() do { } while (0)
 # define sched_annotate_sleep() do { } while (0)
+# define non_block_start() do { } while (0)
+# define non_block_end() do { } while (0)
 #endif
 
 #define might_sleep_if(cond) do { if (cond) might_sleep(); } while (0)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9f51932bd543..c5630f3dca1f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -974,6 +974,10 @@ struct task_struct {
 	struct mutex_waiter		*blocked_on;
 #endif
 
+#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
+	int				non_block_count;
+#endif
+
 #ifdef CONFIG_TRACE_IRQFLAGS
 	unsigned int			irq_events;
 	unsigned long			hardirq_enable_ip;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2b037f195473..57245770d6cc 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3700,13 +3700,22 @@ static noinline void __schedule_bug(struct task_struct *prev)
 /*
  * Various schedule()-time debugging checks and statistics:
  */
-static inline void schedule_debug(struct task_struct *prev)
+static inline void schedule_debug(struct task_struct *prev, bool preempt)
 {
 #ifdef CONFIG_SCHED_STACK_END_CHECK
 	if (task_stack_end_corrupted(prev))
 		panic("corrupted stack end detected inside scheduler\n");
 #endif
 
+#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
+	if (!preempt && prev->state && prev->non_block_count) {
+		printk(KERN_ERR "BUG: scheduling in a non-blocking section: %s/%d/%i\n",
+			prev->comm, prev->pid, prev->non_block_count);
+		dump_stack();
+		add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
+	}
+#endif
+
 	if (unlikely(in_atomic_preempt_off())) {
 		__schedule_bug(prev);
 		preempt_count_set(PREEMPT_DISABLED);
@@ -3813,7 +3822,7 @@ static void __sched notrace __schedule(bool preempt)
 	rq = cpu_rq(cpu);
 	prev = rq->curr;
 
-	schedule_debug(prev);
+	schedule_debug(prev, preempt);
 
 	if (sched_feat(HRTICK))
 		hrtick_clear(rq);
@@ -6570,7 +6579,7 @@ void ___might_sleep(const char *file, int line, int preempt_offset)
 	rcu_sleep_check();
 
 	if ((preempt_count_equals(preempt_offset) && !irqs_disabled() &&
-	     !is_idle_task(current)) ||
+	     !is_idle_task(current) && !current->non_block_count) ||
 	    system_state == SYSTEM_BOOTING || system_state > SYSTEM_RUNNING ||
 	    oops_in_progress)
 		return;
@@ -6586,8 +6595,8 @@ void ___might_sleep(const char *file, int line, int preempt_offset)
 		"BUG: sleeping function called from invalid context at %s:%d\n",
 			file, line);
 	printk(KERN_ERR
-		"in_atomic(): %d, irqs_disabled(): %d, pid: %d, name: %s\n",
-			in_atomic(), irqs_disabled(),
+		"in_atomic(): %d, irqs_disabled(): %d, non_block: %d, pid: %d, name: %s\n",
+			in_atomic(), irqs_disabled(), current->non_block_count,
 			current->pid, current->comm);
 
 	if (task_stack_end_corrupted(current))
-- 
2.22.0



^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 3/5] mm, notifier: Catch sleeping/blocking for !blockable
  2019-08-14 20:20 [PATCH 0/5] hmm & mmu_notifier debug/lockdep annotations Daniel Vetter
  2019-08-14 20:20 ` [PATCH 1/5] mm: Check if mmu notifier callbacks are allowed to fail Daniel Vetter
  2019-08-14 20:20 ` [PATCH 2/5] kernel.h: Add non_block_start/end() Daniel Vetter
@ 2019-08-14 20:20 ` Daniel Vetter
  2019-08-15  0:00   ` Jason Gunthorpe
  2019-08-14 20:20 ` [PATCH 4/5] mm, notifier: Add a lockdep map for invalidate_range_start Daniel Vetter
  2019-08-14 20:20 ` [PATCH 5/5] mm/hmm: WARN on illegal ->sync_cpu_device_pagetables errors Daniel Vetter
  4 siblings, 1 reply; 69+ messages in thread
From: Daniel Vetter @ 2019-08-14 20:20 UTC (permalink / raw)
  To: LKML
  Cc: linux-mm, DRI Development, Intel Graphics Development,
	Daniel Vetter, Jason Gunthorpe, Andrew Morton, Michal Hocko,
	David Rientjes, Christian König, Jérôme Glisse,
	Daniel Vetter

We need to make sure implementations don't cheat and don't have a
possible schedule/blocking point deeply burried where review can't
catch it.

I'm not sure whether this is the best way to make sure all the
might_sleep() callsites trigger, and it's a bit ugly in the code flow.
But it gets the job done.

Inspired by an i915 patch series which did exactly that, because the
rules haven't been entirely clear to us.

v2: Use the shiny new non_block_start/end annotations instead of
abusing preempt_disable/enable.

v3: Rebase on top of Glisse's arg rework.

v4: Rebase on top of more Glisse rework.

Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: linux-mm@kvack.org
Reviewed-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 mm/mmu_notifier.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 16f1cbc775d0..43a76d030164 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -174,7 +174,13 @@ int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range)
 	id = srcu_read_lock(&srcu);
 	hlist_for_each_entry_rcu(mn, &range->mm->mmu_notifier_mm->list, hlist) {
 		if (mn->ops->invalidate_range_start) {
-			int _ret = mn->ops->invalidate_range_start(mn, range);
+			int _ret;
+
+			if (!mmu_notifier_range_blockable(range))
+				non_block_start();
+			_ret = mn->ops->invalidate_range_start(mn, range);
+			if (!mmu_notifier_range_blockable(range))
+				non_block_end();
 			if (_ret) {
 				pr_info("%pS callback failed with %d in %sblockable context.\n",
 					mn->ops->invalidate_range_start, _ret,
-- 
2.22.0



^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 4/5] mm, notifier: Add a lockdep map for invalidate_range_start
  2019-08-14 20:20 [PATCH 0/5] hmm & mmu_notifier debug/lockdep annotations Daniel Vetter
                   ` (2 preceding siblings ...)
  2019-08-14 20:20 ` [PATCH 3/5] mm, notifier: Catch sleeping/blocking for !blockable Daniel Vetter
@ 2019-08-14 20:20 ` Daniel Vetter
  2019-08-15  0:09   ` Jason Gunthorpe
  2019-08-14 20:20 ` [PATCH 5/5] mm/hmm: WARN on illegal ->sync_cpu_device_pagetables errors Daniel Vetter
  4 siblings, 1 reply; 69+ messages in thread
From: Daniel Vetter @ 2019-08-14 20:20 UTC (permalink / raw)
  To: LKML
  Cc: linux-mm, DRI Development, Intel Graphics Development,
	Daniel Vetter, Jason Gunthorpe, Chris Wilson, Andrew Morton,
	David Rientjes, Jérôme Glisse, Michal Hocko,
	Christian König, Greg Kroah-Hartman, Mike Rapoport,
	Daniel Vetter

This is a similar idea to the fs_reclaim fake lockdep lock. It's
fairly easy to provoke a specific notifier to be run on a specific
range: Just prep it, and then munmap() it.

A bit harder, but still doable, is to provoke the mmu notifiers for
all the various callchains that might lead to them. But both at the
same time is really hard to reliable hit, especially when you want to
exercise paths like direct reclaim or compaction, where it's not
easy to control what exactly will be unmapped.

By introducing a lockdep map to tie them all together we allow lockdep
to see a lot more dependencies, without having to actually hit them
in a single challchain while testing.

Aside: Since I typed this to test i915 mmu notifiers I've only rolled
this out for the invaliate_range_start callback. If there's
interest, we should probably roll this out to all of them. But my
undestanding of core mm is seriously lacking, and I'm not clear on
whether we need a lockdep map for each callback, or whether some can
be shared.

v2: Use lock_map_acquire/release() like fs_reclaim, to avoid confusion
with this being a real mutex (Chris Wilson).

v3: Rebase on top of Glisse's arg rework.

Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Rientjes <rientjes@google.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: linux-mm@kvack.org
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 include/linux/mmu_notifier.h | 6 ++++++
 mm/mmu_notifier.c            | 7 +++++++
 2 files changed, 13 insertions(+)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index b6c004bd9f6a..9dd38c32fc53 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -42,6 +42,10 @@ enum mmu_notifier_event {
 
 #ifdef CONFIG_MMU_NOTIFIER
 
+#ifdef CONFIG_LOCKDEP
+extern struct lockdep_map __mmu_notifier_invalidate_range_start_map;
+#endif
+
 /*
  * The mmu notifier_mm structure is allocated and installed in
  * mm->mmu_notifier_mm inside the mm_take_all_locks() protected
@@ -310,10 +314,12 @@ static inline void mmu_notifier_change_pte(struct mm_struct *mm,
 static inline void
 mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range)
 {
+	lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
 	if (mm_has_notifiers(range->mm)) {
 		range->flags |= MMU_NOTIFIER_RANGE_BLOCKABLE;
 		__mmu_notifier_invalidate_range_start(range);
 	}
+	lock_map_release(&__mmu_notifier_invalidate_range_start_map);
 }
 
 static inline int
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 43a76d030164..331e43ce6f3c 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -21,6 +21,13 @@
 /* global SRCU for all MMs */
 DEFINE_STATIC_SRCU(srcu);
 
+#ifdef CONFIG_LOCKDEP
+struct lockdep_map __mmu_notifier_invalidate_range_start_map = {
+	.name = "mmu_notifier_invalidate_range_start"
+};
+EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start_map);
+#endif
+
 /*
  * This function allows mmu_notifier::release callback to delay a call to
  * a function that will free appropriate resources. The function must be
-- 
2.22.0



^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 5/5] mm/hmm: WARN on illegal ->sync_cpu_device_pagetables errors
  2019-08-14 20:20 [PATCH 0/5] hmm & mmu_notifier debug/lockdep annotations Daniel Vetter
                   ` (3 preceding siblings ...)
  2019-08-14 20:20 ` [PATCH 4/5] mm, notifier: Add a lockdep map for invalidate_range_start Daniel Vetter
@ 2019-08-14 20:20 ` Daniel Vetter
  2019-08-15  0:11   ` Jason Gunthorpe
  4 siblings, 1 reply; 69+ messages in thread
From: Daniel Vetter @ 2019-08-14 20:20 UTC (permalink / raw)
  To: LKML
  Cc: linux-mm, DRI Development, Intel Graphics Development,
	Daniel Vetter, Jason Gunthorpe, Ralph Campbell, John Hubbard,
	Dan Williams, Dan Carpenter, Matthew Wilcox, Arnd Bergmann,
	Balbir Singh, Ira Weiny, Souptick Joarder, Andrew Morton,
	Jérôme Glisse, Daniel Vetter

Similar to the warning in the mmu notifer, warning if an hmm mirror
callback gets it's blocking vs. nonblocking handling wrong, or if it
fails with anything else than -EAGAIN.

Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: linux-mm@kvack.org
Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
---
 mm/hmm.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/hmm.c b/mm/hmm.c
index 16b6731a34db..52ac59384268 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -205,6 +205,9 @@ static int hmm_invalidate_range_start(struct mmu_notifier *mn,
 			ret = -EAGAIN;
 			break;
 		}
+		WARN(ret, "%pS callback failed with %d in %sblockable context\n",
+		     mirror->ops->sync_cpu_device_pagetables, ret,
+		     update.blockable ? "" : "non-");
 	}
 	up_read(&hmm->mirrors_sem);
 
-- 
2.22.0



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/5] kernel.h: Add non_block_start/end()
  2019-08-14 20:20 ` [PATCH 2/5] kernel.h: Add non_block_start/end() Daniel Vetter
@ 2019-08-14 20:45   ` Andrew Morton
  2019-08-15  6:52     ` Daniel Vetter
  2019-08-15  8:44     ` Michal Hocko
  2019-08-14 23:58   ` Jason Gunthorpe
  1 sibling, 2 replies; 69+ messages in thread
From: Andrew Morton @ 2019-08-14 20:45 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: LKML, linux-mm, DRI Development, Intel Graphics Development,
	Jason Gunthorpe, Peter Zijlstra, Ingo Molnar, Michal Hocko,
	David Rientjes, Christian König, Jérôme Glisse,
	Masahiro Yamada, Wei Wang, Andy Shevchenko, Thomas Gleixner,
	Jann Horn, Feng Tang, Kees Cook, Randy Dunlap, Daniel Vetter

On Wed, 14 Aug 2019 22:20:24 +0200 Daniel Vetter <daniel.vetter@ffwll.ch> wrote:

> In some special cases we must not block, but there's not a
> spinlock, preempt-off, irqs-off or similar critical section already
> that arms the might_sleep() debug checks. Add a non_block_start/end()
> pair to annotate these.
> 
> This will be used in the oom paths of mmu-notifiers, where blocking is
> not allowed to make sure there's forward progress. Quoting Michal:
> 
> "The notifier is called from quite a restricted context - oom_reaper -
> which shouldn't depend on any locks or sleepable conditionals. The code
> should be swift as well but we mostly do care about it to make a forward
> progress. Checking for sleepable context is the best thing we could come
> up with that would describe these demands at least partially."
> 
> Peter also asked whether we want to catch spinlocks on top, but Michal
> said those are less of a problem because spinlocks can't have an
> indirect dependency upon the page allocator and hence close the loop
> with the oom reaper.

I continue to struggle with this.  It introduces a new kernel state
"running preemptibly but must not call schedule()".  How does this make
any sense?

Perhaps a much, much more detailed description of the oom_reaper
situation would help out.



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/5] mm: Check if mmu notifier callbacks are allowed to fail
  2019-08-14 20:20 ` [PATCH 1/5] mm: Check if mmu notifier callbacks are allowed to fail Daniel Vetter
@ 2019-08-14 22:14   ` Andrew Morton
  2019-08-14 23:22     ` Jason Gunthorpe
  2019-08-14 23:34     ` Ralph Campbell
  2019-08-16 17:19   ` Jason Gunthorpe
  1 sibling, 2 replies; 69+ messages in thread
From: Andrew Morton @ 2019-08-14 22:14 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: LKML, linux-mm, DRI Development, Intel Graphics Development,
	Michal Hocko, Christian König, David Rientjes,
	Jérôme Glisse, Paolo Bonzini, Jason Gunthorpe,
	Daniel Vetter

On Wed, 14 Aug 2019 22:20:23 +0200 Daniel Vetter <daniel.vetter@ffwll.ch> wrote:

> Just a bit of paranoia, since if we start pushing this deep into
> callchains it's hard to spot all places where an mmu notifier
> implementation might fail when it's not allowed to.
> 
> Inspired by some confusion we had discussing i915 mmu notifiers and
> whether we could use the newly-introduced return value to handle some
> corner cases. Until we realized that these are only for when a task
> has been killed by the oom reaper.
> 
> An alternative approach would be to split the callback into two
> versions, one with the int return value, and the other with void
> return value like in older kernels. But that's a lot more churn for
> fairly little gain I think.
> 
> Summary from the m-l discussion on why we want something at warning
> level: This allows automated tooling in CI to catch bugs without
> humans having to look at everything. If we just upgrade the existing
> pr_info to a pr_warn, then we'll have false positives. And as-is, no
> one will ever spot the problem since it's lost in the massive amounts
> of overall dmesg noise.
> 
> ...
>
> --- a/mm/mmu_notifier.c
> +++ b/mm/mmu_notifier.c
> @@ -179,6 +179,8 @@ int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range)
>  				pr_info("%pS callback failed with %d in %sblockable context.\n",
>  					mn->ops->invalidate_range_start, _ret,
>  					!mmu_notifier_range_blockable(range) ? "non-" : "");
> +				WARN_ON(mmu_notifier_range_blockable(range) ||
> +					ret != -EAGAIN);
>  				ret = _ret;
>  			}
>  		}

A problem with WARN_ON(a || b) is that if it triggers, we don't know
whether it was because of a or because of b.  Or both.  So I'd suggest

	WARN_ON(a);
	WARN_ON(b);



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/5] mm: Check if mmu notifier callbacks are allowed to fail
  2019-08-14 22:14   ` Andrew Morton
@ 2019-08-14 23:22     ` Jason Gunthorpe
  2019-08-14 23:34     ` Ralph Campbell
  1 sibling, 0 replies; 69+ messages in thread
From: Jason Gunthorpe @ 2019-08-14 23:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Daniel Vetter, LKML, linux-mm, DRI Development,
	Intel Graphics Development, Michal Hocko, Christian König,
	David Rientjes, Jérôme Glisse, Paolo Bonzini,
	Daniel Vetter

On Wed, Aug 14, 2019 at 03:14:47PM -0700, Andrew Morton wrote:
> On Wed, 14 Aug 2019 22:20:23 +0200 Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> 
> > Just a bit of paranoia, since if we start pushing this deep into
> > callchains it's hard to spot all places where an mmu notifier
> > implementation might fail when it's not allowed to.
> > 
> > Inspired by some confusion we had discussing i915 mmu notifiers and
> > whether we could use the newly-introduced return value to handle some
> > corner cases. Until we realized that these are only for when a task
> > has been killed by the oom reaper.
> > 
> > An alternative approach would be to split the callback into two
> > versions, one with the int return value, and the other with void
> > return value like in older kernels. But that's a lot more churn for
> > fairly little gain I think.
> > 
> > Summary from the m-l discussion on why we want something at warning
> > level: This allows automated tooling in CI to catch bugs without
> > humans having to look at everything. If we just upgrade the existing
> > pr_info to a pr_warn, then we'll have false positives. And as-is, no
> > one will ever spot the problem since it's lost in the massive amounts
> > of overall dmesg noise.
> > 
> > ...
> >
> > +++ b/mm/mmu_notifier.c
> > @@ -179,6 +179,8 @@ int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range)
> >  				pr_info("%pS callback failed with %d in %sblockable context.\n",
> >  					mn->ops->invalidate_range_start, _ret,
> >  					!mmu_notifier_range_blockable(range) ? "non-" : "");
> > +				WARN_ON(mmu_notifier_range_blockable(range) ||
> > +					ret != -EAGAIN);
> >  				ret = _ret;
> >  			}
> >  		}
> 
> A problem with WARN_ON(a || b) is that if it triggers, we don't know
> whether it was because of a or because of b.  Or both.  So I'd suggest
> 
> 	WARN_ON(a);
> 	WARN_ON(b);
> 

Well, we did just make a pr_info right above with the value of
blockable, that seems enough to tell the cases apart?

But you are generally right, the full logic:

    if (_ret) {
      if (WARN_ON(mmu_notifier_range_blockable(range)))
            continue;
      WARN_ON(_ret != -EAGAIN);
      ret = -EAGAIN;
      break;
    }

would force correct API contract up the call chain once we detect a
broken driver..

But at some point it does feel like a bit much debugging logic to have
in a production code path, as this should never happen and is just to
discourage wrong driver behaviors during driver development.

If we like this version then:

Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>

Also - I have a bunch of other patches to mmu notifiers for hmm.git,
so when everyone agrees I can grab this to avoid conflicts.

Thanks,
Jason


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/5] mm: Check if mmu notifier callbacks are allowed to fail
  2019-08-14 22:14   ` Andrew Morton
  2019-08-14 23:22     ` Jason Gunthorpe
@ 2019-08-14 23:34     ` Ralph Campbell
  1 sibling, 0 replies; 69+ messages in thread
From: Ralph Campbell @ 2019-08-14 23:34 UTC (permalink / raw)
  To: Andrew Morton, Daniel Vetter
  Cc: LKML, linux-mm, DRI Development, Intel Graphics Development,
	Michal Hocko, Christian König, David Rientjes,
	Jérôme Glisse, Paolo Bonzini, Jason Gunthorpe,
	Daniel Vetter


On 8/14/19 3:14 PM, Andrew Morton wrote:
> On Wed, 14 Aug 2019 22:20:23 +0200 Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> 
>> Just a bit of paranoia, since if we start pushing this deep into
>> callchains it's hard to spot all places where an mmu notifier
>> implementation might fail when it's not allowed to.
>>
>> Inspired by some confusion we had discussing i915 mmu notifiers and
>> whether we could use the newly-introduced return value to handle some
>> corner cases. Until we realized that these are only for when a task
>> has been killed by the oom reaper.
>>
>> An alternative approach would be to split the callback into two
>> versions, one with the int return value, and the other with void
>> return value like in older kernels. But that's a lot more churn for
>> fairly little gain I think.
>>
>> Summary from the m-l discussion on why we want something at warning
>> level: This allows automated tooling in CI to catch bugs without
>> humans having to look at everything. If we just upgrade the existing
>> pr_info to a pr_warn, then we'll have false positives. And as-is, no
>> one will ever spot the problem since it's lost in the massive amounts
>> of overall dmesg noise.
>>
>> ...
>>
>> --- a/mm/mmu_notifier.c
>> +++ b/mm/mmu_notifier.c
>> @@ -179,6 +179,8 @@ int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range)
>>   				pr_info("%pS callback failed with %d in %sblockable context.\n",
>>   					mn->ops->invalidate_range_start, _ret,
>>   					!mmu_notifier_range_blockable(range) ? "non-" : "");
>> +				WARN_ON(mmu_notifier_range_blockable(range) ||
>> +					ret != -EAGAIN);
>>   				ret = _ret;
>>   			}
>>   		}
> 
> A problem with WARN_ON(a || b) is that if it triggers, we don't know
> whether it was because of a or because of b.  Or both.  So I'd suggest
> 
> 	WARN_ON(a);
> 	WARN_ON(b);
> 

This won't quite work. It is OK to have 
mmu_notifier_range_blockable(range) be true or false.
sync_cpu_device_pagetables() shouldn't return
-EAGAIN unless blockable is true.


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/5] kernel.h: Add non_block_start/end()
  2019-08-14 20:20 ` [PATCH 2/5] kernel.h: Add non_block_start/end() Daniel Vetter
  2019-08-14 20:45   ` Andrew Morton
@ 2019-08-14 23:58   ` Jason Gunthorpe
  2019-08-15  6:58     ` Daniel Vetter
  1 sibling, 1 reply; 69+ messages in thread
From: Jason Gunthorpe @ 2019-08-14 23:58 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: LKML, linux-mm, DRI Development, Intel Graphics Development,
	Peter Zijlstra, Ingo Molnar, Andrew Morton, Michal Hocko,
	David Rientjes, Christian König, Jérôme Glisse,
	Masahiro Yamada, Wei Wang, Andy Shevchenko, Thomas Gleixner,
	Jann Horn, Feng Tang, Kees Cook, Randy Dunlap, Daniel Vetter

On Wed, Aug 14, 2019 at 10:20:24PM +0200, Daniel Vetter wrote:
> In some special cases we must not block, but there's not a
> spinlock, preempt-off, irqs-off or similar critical section already
> that arms the might_sleep() debug checks. Add a non_block_start/end()
> pair to annotate these.
> 
> This will be used in the oom paths of mmu-notifiers, where blocking is
> not allowed to make sure there's forward progress. Quoting Michal:
> 
> "The notifier is called from quite a restricted context - oom_reaper -
> which shouldn't depend on any locks or sleepable conditionals. The code
> should be swift as well but we mostly do care about it to make a forward
> progress. Checking for sleepable context is the best thing we could come
> up with that would describe these demands at least partially."

But this describes fs_reclaim_acquire() - is there some reason we are
conflating fs_reclaim with non-sleeping?

ie is there some fundamental difference between the block stack
sleeping during reclaim while it waits for a driver to write out a
page and a GPU driver sleeping during OOM while it waits for it's HW
to fence DMA on a page?

Fundamentally we have invalidate_range_start() vs invalidate_range()
as the start() version is able to sleep. If drivers can do their work
without sleeping then they should be using invalidare_range() instead.

Thus, it doesn't seem to make any sense to ask a driver that requires a
sleeping API not to sleep.

AFAICT what is really going on here is that drivers care about only a
subset of the VA space, and we want to query the driver if it cares
about the range proposed to be OOM'd, so we can OOM ranges that are
do not have SPTEs.

ie if you look pretty much all drivers do exactly as
userptr_mn_invalidate_range_start() does, and bail once they detect
the VA range is of interest.

So, I'm working on a patch to lift the interval tree into the notifier
core and then do the VA test OOM needs without bothering the
driver. Drivers can retain the blocking API they require and OOM can
work on VA's that don't have SPTEs.

This approach also solves the critical bug in this path:
  https://lore.kernel.org/linux-mm/20190807191627.GA3008@ziepe.ca/

And solves a bunch of other bugs in the drivers.

> Peter also asked whether we want to catch spinlocks on top, but Michal
> said those are less of a problem because spinlocks can't have an
> indirect dependency upon the page allocator and hence close the loop
> with the oom reaper.

Again, this entirely sounds like fs_reclaim - isn't that exactly what
it is for?

I have had on my list a second and very related possible bug. I ran
into commit 35cfa2b0b491 ("mm/mmu_notifier: allocate mmu_notifier in
advance") which says that mapping->i_mmap_mutex is under fs_reclaim().

We do hold i_mmap_rwsem while calling invalidate_range_start():

 unmap_mapping_pages
  i_mmap_lock_write(mapping); // ie i_mmap_rwsem
  unmap_mapping_range_tree
   unmap_mapping_range_vma
    zap_page_range_single
     mmu_notifier_invalidate_range_start

So, if it is still true that i_mmap_rwsem is under fs_reclaim then
invalidate_range_start is *always* under fs_reclaim anyhow! (this I do
not know)

Thus we should use lockdep to force this and fix all the drivers.

.. and if we force fs_reclaim always, do we care about blockable
anymore??

Jason


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/5] mm, notifier: Catch sleeping/blocking for !blockable
  2019-08-14 20:20 ` [PATCH 3/5] mm, notifier: Catch sleeping/blocking for !blockable Daniel Vetter
@ 2019-08-15  0:00   ` Jason Gunthorpe
  2019-08-15  7:02     ` Daniel Vetter
  0 siblings, 1 reply; 69+ messages in thread
From: Jason Gunthorpe @ 2019-08-15  0:00 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: LKML, linux-mm, DRI Development, Intel Graphics Development,
	Andrew Morton, Michal Hocko, David Rientjes,
	Christian König, Jérôme Glisse, Daniel Vetter

On Wed, Aug 14, 2019 at 10:20:25PM +0200, Daniel Vetter wrote:
> We need to make sure implementations don't cheat and don't have a
> possible schedule/blocking point deeply burried where review can't
> catch it.
> 
> I'm not sure whether this is the best way to make sure all the
> might_sleep() callsites trigger, and it's a bit ugly in the code flow.
> But it gets the job done.
> 
> Inspired by an i915 patch series which did exactly that, because the
> rules haven't been entirely clear to us.

I thought lockdep already was able to detect:

 spin_lock()
 might_sleep();
 spin_unlock()

Am I mistaken? If yes, couldn't this patch just inject a dummy lockdep
spinlock?

Jason


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 4/5] mm, notifier: Add a lockdep map for invalidate_range_start
  2019-08-14 20:20 ` [PATCH 4/5] mm, notifier: Add a lockdep map for invalidate_range_start Daniel Vetter
@ 2019-08-15  0:09   ` Jason Gunthorpe
  2019-08-15  7:10     ` Daniel Vetter
  0 siblings, 1 reply; 69+ messages in thread
From: Jason Gunthorpe @ 2019-08-15  0:09 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: LKML, linux-mm, DRI Development, Intel Graphics Development,
	Chris Wilson, Andrew Morton, David Rientjes,
	Jérôme Glisse, Michal Hocko, Christian König,
	Greg Kroah-Hartman, Mike Rapoport, Daniel Vetter

On Wed, Aug 14, 2019 at 10:20:26PM +0200, Daniel Vetter wrote:
> This is a similar idea to the fs_reclaim fake lockdep lock. It's
> fairly easy to provoke a specific notifier to be run on a specific
> range: Just prep it, and then munmap() it.
> 
> A bit harder, but still doable, is to provoke the mmu notifiers for
> all the various callchains that might lead to them. But both at the
> same time is really hard to reliable hit, especially when you want to
> exercise paths like direct reclaim or compaction, where it's not
> easy to control what exactly will be unmapped.
> 
> By introducing a lockdep map to tie them all together we allow lockdep
> to see a lot more dependencies, without having to actually hit them
> in a single challchain while testing.
> 
> Aside: Since I typed this to test i915 mmu notifiers I've only rolled
> this out for the invaliate_range_start callback. If there's
> interest, we should probably roll this out to all of them. But my
> undestanding of core mm is seriously lacking, and I'm not clear on
> whether we need a lockdep map for each callback, or whether some can
> be shared.

I was thinking about doing something like this..

IMHO only range_end needs annotation, the other ops are either already
non-sleeping or only used by KVM.

BTW, I have found it strange that i915 only uses
invalidate_range_start. Not really sure how it is able to do
that. Would love to know the answer :)

> Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
>  include/linux/mmu_notifier.h | 6 ++++++
>  mm/mmu_notifier.c            | 7 +++++++
>  2 files changed, 13 insertions(+)
> 
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index b6c004bd9f6a..9dd38c32fc53 100644
> +++ b/include/linux/mmu_notifier.h
> @@ -42,6 +42,10 @@ enum mmu_notifier_event {
>  
>  #ifdef CONFIG_MMU_NOTIFIER
>  
> +#ifdef CONFIG_LOCKDEP
> +extern struct lockdep_map __mmu_notifier_invalidate_range_start_map;
> +#endif

I wonder what the trade off is having a global map vs a map in each
mmu_notifier_mm ?

>  /*
>   * The mmu notifier_mm structure is allocated and installed in
>   * mm->mmu_notifier_mm inside the mm_take_all_locks() protected
> @@ -310,10 +314,12 @@ static inline void mmu_notifier_change_pte(struct mm_struct *mm,
>  static inline void
>  mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range)
>  {
> +	lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
>  	if (mm_has_notifiers(range->mm)) {
>  		range->flags |= MMU_NOTIFIER_RANGE_BLOCKABLE;
>  		__mmu_notifier_invalidate_range_start(range);
>  	}
> +	lock_map_release(&__mmu_notifier_invalidate_range_start_map);
>  }

Also range_end should have this too - it has all the same
constraints. I think it can share the map. So 'range_start_map' is
probably not the right name.

It may also make some sense to do a dummy acquire/release under the
mm_take_all_locks() to forcibly increase map coverage and reduce the
scenario complexity required to hit bugs.

And if we do decide on the reclaim thing in my other email then the
reclaim dependency can be reliably injected by doing:

 fs_reclaim_acquire();
 lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
 lock_map_release(&__mmu_notifier_invalidate_range_start_map);
 fs_reclaim_release();

If I understand lockdep properly..

Jason


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 5/5] mm/hmm: WARN on illegal ->sync_cpu_device_pagetables errors
  2019-08-14 20:20 ` [PATCH 5/5] mm/hmm: WARN on illegal ->sync_cpu_device_pagetables errors Daniel Vetter
@ 2019-08-15  0:11   ` Jason Gunthorpe
  2019-08-15  7:14     ` Daniel Vetter
  0 siblings, 1 reply; 69+ messages in thread
From: Jason Gunthorpe @ 2019-08-15  0:11 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: LKML, linux-mm, DRI Development, Intel Graphics Development,
	Ralph Campbell, John Hubbard, Dan Williams, Dan Carpenter,
	Matthew Wilcox, Arnd Bergmann, Balbir Singh, Ira Weiny,
	Souptick Joarder, Andrew Morton, Jérôme Glisse,
	Daniel Vetter

On Wed, Aug 14, 2019 at 10:20:27PM +0200, Daniel Vetter wrote:
> Similar to the warning in the mmu notifer, warning if an hmm mirror
> callback gets it's blocking vs. nonblocking handling wrong, or if it
> fails with anything else than -EAGAIN.
> 
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dan Carpenter <dan.carpenter@oracle.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Balbir Singh <bsingharora@gmail.com>
> Cc: Ira Weiny <ira.weiny@intel.com>
> Cc: Souptick Joarder <jrdr.linux@gmail.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: "Jérôme Glisse" <jglisse@redhat.com>
> Cc: linux-mm@kvack.org
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
>  mm/hmm.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 16b6731a34db..52ac59384268 100644
> +++ b/mm/hmm.c
> @@ -205,6 +205,9 @@ static int hmm_invalidate_range_start(struct mmu_notifier *mn,
>  			ret = -EAGAIN;
>  			break;
>  		}
> +		WARN(ret, "%pS callback failed with %d in %sblockable context\n",
> +		     mirror->ops->sync_cpu_device_pagetables, ret,
> +		     update.blockable ? "" : "non-");
>  	}
>  	up_read(&hmm->mirrors_sem);

Didn't I beat you to this?

	list_for_each_entry(mirror, &hmm->mirrors, list) {
		int rc;

		rc = mirror->ops->sync_cpu_device_pagetables(mirror, &update);
		if (rc) {
			if (WARN_ON(update.blockable || rc != -EAGAIN))
				continue;
			ret = -EAGAIN;
			break;
		}
	}

Thanks,
Jason


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/5] kernel.h: Add non_block_start/end()
  2019-08-14 20:45   ` Andrew Morton
@ 2019-08-15  6:52     ` Daniel Vetter
  2019-08-15  8:44     ` Michal Hocko
  1 sibling, 0 replies; 69+ messages in thread
From: Daniel Vetter @ 2019-08-15  6:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Daniel Vetter, LKML, linux-mm, DRI Development,
	Intel Graphics Development, Jason Gunthorpe, Peter Zijlstra,
	Ingo Molnar, Michal Hocko, David Rientjes, Christian König,
	Jérôme Glisse, Masahiro Yamada, Wei Wang,
	Andy Shevchenko, Thomas Gleixner, Jann Horn, Feng Tang,
	Kees Cook, Randy Dunlap, Daniel Vetter

On Wed, Aug 14, 2019 at 01:45:58PM -0700, Andrew Morton wrote:
> On Wed, 14 Aug 2019 22:20:24 +0200 Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> 
> > In some special cases we must not block, but there's not a
> > spinlock, preempt-off, irqs-off or similar critical section already
> > that arms the might_sleep() debug checks. Add a non_block_start/end()
> > pair to annotate these.
> > 
> > This will be used in the oom paths of mmu-notifiers, where blocking is
> > not allowed to make sure there's forward progress. Quoting Michal:
> > 
> > "The notifier is called from quite a restricted context - oom_reaper -
> > which shouldn't depend on any locks or sleepable conditionals. The code
> > should be swift as well but we mostly do care about it to make a forward
> > progress. Checking for sleepable context is the best thing we could come
> > up with that would describe these demands at least partially."
> > 
> > Peter also asked whether we want to catch spinlocks on top, but Michal
> > said those are less of a problem because spinlocks can't have an
> > indirect dependency upon the page allocator and hence close the loop
> > with the oom reaper.
> 
> I continue to struggle with this.  It introduces a new kernel state
> "running preemptibly but must not call schedule()".  How does this make
> any sense?
> 
> Perhaps a much, much more detailed description of the oom_reaper
> situation would help out.

I agree on both points, but I guess I'm not the expert to explain why we
have this. All I'm trying to do is that drivers hold up their side. If you
want to have better documentation for why the oom case needs this special
new mode, you're looking at the wrong guy for that :-)

Of course if you folks all decide that you just don't want to be
remembered about that I guess we can drop this one here, but you're just
shooting the messenger I think.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/5] kernel.h: Add non_block_start/end()
  2019-08-14 23:58   ` Jason Gunthorpe
@ 2019-08-15  6:58     ` Daniel Vetter
  2019-08-15 12:23       ` Jason Gunthorpe
  0 siblings, 1 reply; 69+ messages in thread
From: Daniel Vetter @ 2019-08-15  6:58 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Daniel Vetter, LKML, linux-mm, DRI Development,
	Intel Graphics Development, Peter Zijlstra, Ingo Molnar,
	Andrew Morton, Michal Hocko, David Rientjes,
	Christian König, Jérôme Glisse, Masahiro Yamada,
	Wei Wang, Andy Shevchenko, Thomas Gleixner, Jann Horn, Feng Tang,
	Kees Cook, Randy Dunlap, Daniel Vetter

On Wed, Aug 14, 2019 at 08:58:05PM -0300, Jason Gunthorpe wrote:
> On Wed, Aug 14, 2019 at 10:20:24PM +0200, Daniel Vetter wrote:
> > In some special cases we must not block, but there's not a
> > spinlock, preempt-off, irqs-off or similar critical section already
> > that arms the might_sleep() debug checks. Add a non_block_start/end()
> > pair to annotate these.
> > 
> > This will be used in the oom paths of mmu-notifiers, where blocking is
> > not allowed to make sure there's forward progress. Quoting Michal:
> > 
> > "The notifier is called from quite a restricted context - oom_reaper -
> > which shouldn't depend on any locks or sleepable conditionals. The code
> > should be swift as well but we mostly do care about it to make a forward
> > progress. Checking for sleepable context is the best thing we could come
> > up with that would describe these demands at least partially."
> 
> But this describes fs_reclaim_acquire() - is there some reason we are
> conflating fs_reclaim with non-sleeping?

No idea why you tie this into fs_reclaim. We can definitly sleep in there,
and for e.g. kswapd (which also wraps everything in fs_reclaim) we're
event supposed to I thought. To make sure we can get at the last bit of
memory by flushing all the queues and waiting for everything to be cleaned
out.

> ie is there some fundamental difference between the block stack
> sleeping during reclaim while it waits for a driver to write out a
> page and a GPU driver sleeping during OOM while it waits for it's HW
> to fence DMA on a page?
> 
> Fundamentally we have invalidate_range_start() vs invalidate_range()
> as the start() version is able to sleep. If drivers can do their work
> without sleeping then they should be using invalidare_range() instead.
> 
> Thus, it doesn't seem to make any sense to ask a driver that requires a
> sleeping API not to sleep.
> 
> AFAICT what is really going on here is that drivers care about only a
> subset of the VA space, and we want to query the driver if it cares
> about the range proposed to be OOM'd, so we can OOM ranges that are
> do not have SPTEs.
> 
> ie if you look pretty much all drivers do exactly as
> userptr_mn_invalidate_range_start() does, and bail once they detect
> the VA range is of interest.
> 
> So, I'm working on a patch to lift the interval tree into the notifier
> core and then do the VA test OOM needs without bothering the
> driver. Drivers can retain the blocking API they require and OOM can
> work on VA's that don't have SPTEs.

Hm I figured the point of hmm_mirror is to have that interval tree for
everyone (among other things). But yeah lifting to mmu_notifier sounds
like a clean solution for this, but I really have not much clue about why
we even have this for special mode in the oom case. I'm just trying to
increase the odds that drivers hold up their end of the bargain.

> This approach also solves the critical bug in this path:
>   https://lore.kernel.org/linux-mm/20190807191627.GA3008@ziepe.ca/
> 
> And solves a bunch of other bugs in the drivers.
> 
> > Peter also asked whether we want to catch spinlocks on top, but Michal
> > said those are less of a problem because spinlocks can't have an
> > indirect dependency upon the page allocator and hence close the loop
> > with the oom reaper.
> 
> Again, this entirely sounds like fs_reclaim - isn't that exactly what
> it is for?
> 
> I have had on my list a second and very related possible bug. I ran
> into commit 35cfa2b0b491 ("mm/mmu_notifier: allocate mmu_notifier in
> advance") which says that mapping->i_mmap_mutex is under fs_reclaim().
> 
> We do hold i_mmap_rwsem while calling invalidate_range_start():
> 
>  unmap_mapping_pages
>   i_mmap_lock_write(mapping); // ie i_mmap_rwsem
>   unmap_mapping_range_tree
>    unmap_mapping_range_vma
>     zap_page_range_single
>      mmu_notifier_invalidate_range_start
> 
> So, if it is still true that i_mmap_rwsem is under fs_reclaim then
> invalidate_range_start is *always* under fs_reclaim anyhow! (this I do
> not know)
> 
> Thus we should use lockdep to force this and fix all the drivers.
> 
> .. and if we force fs_reclaim always, do we care about blockable
> anymore??

Still not sure what fs_reclaim matters here. One of the later patches
steals the fs_reclaim idea for mmu_notifiers, but that ties together
completely different paths.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/5] mm, notifier: Catch sleeping/blocking for !blockable
  2019-08-15  0:00   ` Jason Gunthorpe
@ 2019-08-15  7:02     ` Daniel Vetter
       [not found]       ` <20190815123556.GB21596@ziepe.ca>
  0 siblings, 1 reply; 69+ messages in thread
From: Daniel Vetter @ 2019-08-15  7:02 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Daniel Vetter, LKML, linux-mm, DRI Development,
	Intel Graphics Development, Andrew Morton, Michal Hocko,
	David Rientjes, Christian König, Jérôme Glisse,
	Daniel Vetter

On Wed, Aug 14, 2019 at 09:00:29PM -0300, Jason Gunthorpe wrote:
> On Wed, Aug 14, 2019 at 10:20:25PM +0200, Daniel Vetter wrote:
> > We need to make sure implementations don't cheat and don't have a
> > possible schedule/blocking point deeply burried where review can't
> > catch it.
> > 
> > I'm not sure whether this is the best way to make sure all the
> > might_sleep() callsites trigger, and it's a bit ugly in the code flow.
> > But it gets the job done.
> > 
> > Inspired by an i915 patch series which did exactly that, because the
> > rules haven't been entirely clear to us.
> 
> I thought lockdep already was able to detect:
> 
>  spin_lock()
>  might_sleep();
>  spin_unlock()
> 
> Am I mistaken? If yes, couldn't this patch just inject a dummy lockdep
> spinlock?

Hm ... assuming I didn't get lost in the maze I think might_sleep (well
___might_sleep) doesn't do any lockdep checking at all. And we want
might_sleep, since that catches a lot more than lockdep.

Maybe you mixed it up with the hard/softirq context stuff that lockdep
tracks and complains about if you get it wrong?
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 4/5] mm, notifier: Add a lockdep map for invalidate_range_start
  2019-08-15  0:09   ` Jason Gunthorpe
@ 2019-08-15  7:10     ` Daniel Vetter
  2019-08-15 12:53       ` Jason Gunthorpe
  0 siblings, 1 reply; 69+ messages in thread
From: Daniel Vetter @ 2019-08-15  7:10 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Daniel Vetter, LKML, linux-mm, DRI Development,
	Intel Graphics Development, Chris Wilson, Andrew Morton,
	David Rientjes, Jérôme Glisse, Michal Hocko,
	Christian König, Greg Kroah-Hartman, Mike Rapoport,
	Daniel Vetter

On Wed, Aug 14, 2019 at 09:09:59PM -0300, Jason Gunthorpe wrote:
> On Wed, Aug 14, 2019 at 10:20:26PM +0200, Daniel Vetter wrote:
> > This is a similar idea to the fs_reclaim fake lockdep lock. It's
> > fairly easy to provoke a specific notifier to be run on a specific
> > range: Just prep it, and then munmap() it.
> > 
> > A bit harder, but still doable, is to provoke the mmu notifiers for
> > all the various callchains that might lead to them. But both at the
> > same time is really hard to reliable hit, especially when you want to
> > exercise paths like direct reclaim or compaction, where it's not
> > easy to control what exactly will be unmapped.
> > 
> > By introducing a lockdep map to tie them all together we allow lockdep
> > to see a lot more dependencies, without having to actually hit them
> > in a single challchain while testing.
> > 
> > Aside: Since I typed this to test i915 mmu notifiers I've only rolled
> > this out for the invaliate_range_start callback. If there's
> > interest, we should probably roll this out to all of them. But my
> > undestanding of core mm is seriously lacking, and I'm not clear on
> > whether we need a lockdep map for each callback, or whether some can
> > be shared.
> 
> I was thinking about doing something like this..
> 
> IMHO only range_end needs annotation, the other ops are either already
> non-sleeping or only used by KVM.

This isnt' about sleeping, this is about locking loops. And the biggest
risk for that is from driver code, and at least hmm_mirror only has the
driver code callback on invalidate_range_start. Once thing I discovered
using this (and it would be really hard to spot, it's deeply neested) is
that i915 userptr.

Even if i915 userptr would use hmm_mirror (to fix the issue you mention
below), if we then switch the annotation to invalidate_range_end nothing
interesting would ever come from this. Well, the only thing it'd catch is
issues in hmm_mirror, but I think core mm review will catch that before it
reaches us :-)

> BTW, I have found it strange that i915 only uses
> invalidate_range_start. Not really sure how it is able to do
> that. Would love to know the answer :)

I suspect it's broken :-/ Our userptr is ... not the best. Part of the
motivation here.

> > Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
> > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> >  include/linux/mmu_notifier.h | 6 ++++++
> >  mm/mmu_notifier.c            | 7 +++++++
> >  2 files changed, 13 insertions(+)
> > 
> > diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> > index b6c004bd9f6a..9dd38c32fc53 100644
> > +++ b/include/linux/mmu_notifier.h
> > @@ -42,6 +42,10 @@ enum mmu_notifier_event {
> >  
> >  #ifdef CONFIG_MMU_NOTIFIER
> >  
> > +#ifdef CONFIG_LOCKDEP
> > +extern struct lockdep_map __mmu_notifier_invalidate_range_start_map;
> > +#endif
> 
> I wonder what the trade off is having a global map vs a map in each
> mmu_notifier_mm ?

Less reports, specifically no reports involving multiple different mmu
notifiers to build the entire chain. But I'm assuming it's possible to
combine them in one mm (kvm+gpu+infiniband in one process sounds like
something someone could reasonably do), and it will help to make sure
everyone follows the same rules.
> 
> >  /*
> >   * The mmu notifier_mm structure is allocated and installed in
> >   * mm->mmu_notifier_mm inside the mm_take_all_locks() protected
> > @@ -310,10 +314,12 @@ static inline void mmu_notifier_change_pte(struct mm_struct *mm,
> >  static inline void
> >  mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range)
> >  {
> > +	lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
> >  	if (mm_has_notifiers(range->mm)) {
> >  		range->flags |= MMU_NOTIFIER_RANGE_BLOCKABLE;
> >  		__mmu_notifier_invalidate_range_start(range);
> >  	}
> > +	lock_map_release(&__mmu_notifier_invalidate_range_start_map);
> >  }
> 
> Also range_end should have this too - it has all the same
> constraints. I think it can share the map. So 'range_start_map' is
> probably not the right name.
> 
> It may also make some sense to do a dummy acquire/release under the
> mm_take_all_locks() to forcibly increase map coverage and reduce the
> scenario complexity required to hit bugs.
> 
> And if we do decide on the reclaim thing in my other email then the
> reclaim dependency can be reliably injected by doing:
> 
>  fs_reclaim_acquire();
>  lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
>  lock_map_release(&__mmu_notifier_invalidate_range_start_map);
>  fs_reclaim_release();
> 
> If I understand lockdep properly..

Ime fs_reclaim injects the mmu_notifier map here reliably as soon as
you've thrown out the first pagecache mmap on any process. That "make sure
we inject it quickly" is why the lockdep is _outside_ of the
mm_has_notifiers() check. So no further injection needed imo.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 5/5] mm/hmm: WARN on illegal ->sync_cpu_device_pagetables errors
  2019-08-15  0:11   ` Jason Gunthorpe
@ 2019-08-15  7:14     ` Daniel Vetter
  0 siblings, 0 replies; 69+ messages in thread
From: Daniel Vetter @ 2019-08-15  7:14 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Daniel Vetter, LKML, linux-mm, DRI Development,
	Intel Graphics Development, Ralph Campbell, John Hubbard,
	Dan Williams, Dan Carpenter, Matthew Wilcox, Arnd Bergmann,
	Balbir Singh, Ira Weiny, Souptick Joarder, Andrew Morton,
	Jérôme Glisse, Daniel Vetter

On Wed, Aug 14, 2019 at 09:11:37PM -0300, Jason Gunthorpe wrote:
> On Wed, Aug 14, 2019 at 10:20:27PM +0200, Daniel Vetter wrote:
> > Similar to the warning in the mmu notifer, warning if an hmm mirror
> > callback gets it's blocking vs. nonblocking handling wrong, or if it
> > fails with anything else than -EAGAIN.
> > 
> > Cc: Jason Gunthorpe <jgg@ziepe.ca>
> > Cc: Ralph Campbell <rcampbell@nvidia.com>
> > Cc: John Hubbard <jhubbard@nvidia.com>
> > Cc: Dan Williams <dan.j.williams@intel.com>
> > Cc: Dan Carpenter <dan.carpenter@oracle.com>
> > Cc: Matthew Wilcox <willy@infradead.org>
> > Cc: Arnd Bergmann <arnd@arndb.de>
> > Cc: Balbir Singh <bsingharora@gmail.com>
> > Cc: Ira Weiny <ira.weiny@intel.com>
> > Cc: Souptick Joarder <jrdr.linux@gmail.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: "Jérôme Glisse" <jglisse@redhat.com>
> > Cc: linux-mm@kvack.org
> > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> >  mm/hmm.c | 3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > diff --git a/mm/hmm.c b/mm/hmm.c
> > index 16b6731a34db..52ac59384268 100644
> > +++ b/mm/hmm.c
> > @@ -205,6 +205,9 @@ static int hmm_invalidate_range_start(struct mmu_notifier *mn,
> >  			ret = -EAGAIN;
> >  			break;
> >  		}
> > +		WARN(ret, "%pS callback failed with %d in %sblockable context\n",
> > +		     mirror->ops->sync_cpu_device_pagetables, ret,
> > +		     update.blockable ? "" : "non-");
> >  	}
> >  	up_read(&hmm->mirrors_sem);
> 
> Didn't I beat you to this?

Very much possible, I think I didn't rebase this to linux-next before
resending ... have an

Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>

in case you need.

Cheers, Daniel

> 
> 	list_for_each_entry(mirror, &hmm->mirrors, list) {
> 		int rc;
> 
> 		rc = mirror->ops->sync_cpu_device_pagetables(mirror, &update);
> 		if (rc) {
> 			if (WARN_ON(update.blockable || rc != -EAGAIN))
> 				continue;
> 			ret = -EAGAIN;
> 			break;
> 		}
> 	}
> 
> Thanks,
> Jason

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/5] kernel.h: Add non_block_start/end()
  2019-08-14 20:45   ` Andrew Morton
  2019-08-15  6:52     ` Daniel Vetter
@ 2019-08-15  8:44     ` Michal Hocko
  2019-08-15 13:04       ` Jason Gunthorpe
  2019-08-15 22:15       ` Andrew Morton
  1 sibling, 2 replies; 69+ messages in thread
From: Michal Hocko @ 2019-08-15  8:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Daniel Vetter, LKML, linux-mm, DRI Development,
	Intel Graphics Development, Jason Gunthorpe, Peter Zijlstra,
	Ingo Molnar, David Rientjes, Christian König,
	Jérôme Glisse, Masahiro Yamada, Wei Wang,
	Andy Shevchenko, Thomas Gleixner, Jann Horn, Feng Tang,
	Kees Cook, Randy Dunlap, Daniel Vetter

On Wed 14-08-19 13:45:58, Andrew Morton wrote:
> On Wed, 14 Aug 2019 22:20:24 +0200 Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> 
> > In some special cases we must not block, but there's not a
> > spinlock, preempt-off, irqs-off or similar critical section already
> > that arms the might_sleep() debug checks. Add a non_block_start/end()
> > pair to annotate these.
> > 
> > This will be used in the oom paths of mmu-notifiers, where blocking is
> > not allowed to make sure there's forward progress. Quoting Michal:
> > 
> > "The notifier is called from quite a restricted context - oom_reaper -
> > which shouldn't depend on any locks or sleepable conditionals. The code
> > should be swift as well but we mostly do care about it to make a forward
> > progress. Checking for sleepable context is the best thing we could come
> > up with that would describe these demands at least partially."
> > 
> > Peter also asked whether we want to catch spinlocks on top, but Michal
> > said those are less of a problem because spinlocks can't have an
> > indirect dependency upon the page allocator and hence close the loop
> > with the oom reaper.
> 
> I continue to struggle with this.  It introduces a new kernel state
> "running preemptibly but must not call schedule()".  How does this make
> any sense?
> 
> Perhaps a much, much more detailed description of the oom_reaper
> situation would help out.
 
The primary point here is that there is a demand of non blockable mmu
notifiers to be called when the oom reaper tears down the address space.
As the oom reaper is the primary guarantee of the oom handling forward
progress it cannot be blocked on anything that might depend on blockable
memory allocations. These are not really easy to track because they
might be indirect - e.g. notifier blocks on a lock which other context
holds while allocating memory or waiting for a flusher that needs memory
to perform its work. If such a blocking state happens that we can end up
in a silent hang with an unusable machine.
Now we hope for reasonable implementations of mmu notifiers (strong
words I know ;) and this should be relatively simple and effective catch
all tool to detect something suspicious is going on.

Does that make the situation more clear?

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/5] kernel.h: Add non_block_start/end()
  2019-08-15  6:58     ` Daniel Vetter
@ 2019-08-15 12:23       ` Jason Gunthorpe
  2019-08-15 13:21         ` Michal Hocko
  0 siblings, 1 reply; 69+ messages in thread
From: Jason Gunthorpe @ 2019-08-15 12:23 UTC (permalink / raw)
  To: LKML, linux-mm, DRI Development, Intel Graphics Development,
	Peter Zijlstra, Ingo Molnar, Andrew Morton, Michal Hocko,
	David Rientjes, Christian König, Jérôme Glisse,
	Masahiro Yamada, Wei Wang, Andy Shevchenko, Thomas Gleixner,
	Jann Horn, Feng Tang, Kees Cook, Randy Dunlap, Daniel Vetter

On Thu, Aug 15, 2019 at 08:58:29AM +0200, Daniel Vetter wrote:
> On Wed, Aug 14, 2019 at 08:58:05PM -0300, Jason Gunthorpe wrote:
> > On Wed, Aug 14, 2019 at 10:20:24PM +0200, Daniel Vetter wrote:
> > > In some special cases we must not block, but there's not a
> > > spinlock, preempt-off, irqs-off or similar critical section already
> > > that arms the might_sleep() debug checks. Add a non_block_start/end()
> > > pair to annotate these.
> > > 
> > > This will be used in the oom paths of mmu-notifiers, where blocking is
> > > not allowed to make sure there's forward progress. Quoting Michal:
> > > 
> > > "The notifier is called from quite a restricted context - oom_reaper -
> > > which shouldn't depend on any locks or sleepable conditionals. The code
> > > should be swift as well but we mostly do care about it to make a forward
> > > progress. Checking for sleepable context is the best thing we could come
> > > up with that would describe these demands at least partially."
> > 
> > But this describes fs_reclaim_acquire() - is there some reason we are
> > conflating fs_reclaim with non-sleeping?
> 
> No idea why you tie this into fs_reclaim. We can definitly sleep in there,
> and for e.g. kswapd (which also wraps everything in fs_reclaim) we're
> event supposed to I thought. To make sure we can get at the last bit of
> memory by flushing all the queues and waiting for everything to be cleaned
> out.

AFAIK the point of fs_reclaim is to prevent "indirect dependency upon
the page allocator" ie a justification that was given this !blockable
stuff.

For instance:

  fs_reclaim_acquire()
  kmalloc(GFP_KERNEL) <- lock dep assertion

And further, Michal's concern about indirectness through locks is also
handled by lockdep:

       CPU0                                 CPU1
                                        mutex_lock()
                                        kmalloc(GFP_KERNEL)
                                        mutex_unlock()
  fs_reclaim_acquire()
  mutex_lock() <- lock dep assertion

In other words, to prevent recursion into the page allocator you use
fs_reclaim_acquire(), and lockdep verfies it in its usual robust way.

I asked Tejun about this once in regards to WQ_MEM_RECLAIM and he
explained that it means you can't call the allocator functions in a
way that would recurse into reclaim (ie instead use instead GFP_ATOMIC, or
tolerate allocation failure, or various other things).

So, the reason I bring it up is half the justifications you posted for
blockable had to do with not recursing into reclaim and deadlocking,
and didn't seem to have much to do with blocking.

I'm asking if *non-blocking* is really the requirement or if this is
just the usual 'do not deadlock on the allocator' thing reclaim paths
alread have?

Jason


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 4/5] mm, notifier: Add a lockdep map for invalidate_range_start
  2019-08-15  7:10     ` Daniel Vetter
@ 2019-08-15 12:53       ` Jason Gunthorpe
  0 siblings, 0 replies; 69+ messages in thread
From: Jason Gunthorpe @ 2019-08-15 12:53 UTC (permalink / raw)
  To: LKML, linux-mm, DRI Development, Intel Graphics Development,
	Chris Wilson, Andrew Morton, David Rientjes,
	Jérôme Glisse, Michal Hocko, Christian König,
	Greg Kroah-Hartman, Mike Rapoport, Daniel Vetter

On Thu, Aug 15, 2019 at 09:10:14AM +0200, Daniel Vetter wrote:
> On Wed, Aug 14, 2019 at 09:09:59PM -0300, Jason Gunthorpe wrote:
> > On Wed, Aug 14, 2019 at 10:20:26PM +0200, Daniel Vetter wrote:
> > > This is a similar idea to the fs_reclaim fake lockdep lock. It's
> > > fairly easy to provoke a specific notifier to be run on a specific
> > > range: Just prep it, and then munmap() it.
> > > 
> > > A bit harder, but still doable, is to provoke the mmu notifiers for
> > > all the various callchains that might lead to them. But both at the
> > > same time is really hard to reliable hit, especially when you want to
> > > exercise paths like direct reclaim or compaction, where it's not
> > > easy to control what exactly will be unmapped.
> > > 
> > > By introducing a lockdep map to tie them all together we allow lockdep
> > > to see a lot more dependencies, without having to actually hit them
> > > in a single challchain while testing.
> > > 
> > > Aside: Since I typed this to test i915 mmu notifiers I've only rolled
> > > this out for the invaliate_range_start callback. If there's
> > > interest, we should probably roll this out to all of them. But my
> > > undestanding of core mm is seriously lacking, and I'm not clear on
> > > whether we need a lockdep map for each callback, or whether some can
> > > be shared.
> > 
> > I was thinking about doing something like this..
> > 
> > IMHO only range_end needs annotation, the other ops are either already
> > non-sleeping or only used by KVM.
>
> This isnt' about sleeping, this is about locking loops. And the biggest
> risk for that is from driver code, and at least hmm_mirror only has the
> driver code callback on invalidate_range_start. Once thing I discovered
> using this (and it would be really hard to spot, it's deeply neested) is
> that i915 userptr.

Sorry, that came out wrong, what I ment is that only range_end and
range_start really need annotation.

The other places are only used by KVM and are called in non-sleeping
contexts, so they already can't recurse back onto the popular sleeping
locks like mmap_sem or what not, can't do allocations, etc.  I don't
see alot of return in investing in them.

> > BTW, I have found it strange that i915 only uses
> > invalidate_range_start. Not really sure how it is able to do
> > that. Would love to know the answer :)
> 
> I suspect it's broken :-/ Our userptr is ... not the best. Part of the
> motivation here.

I was wondering if it is what we call in RDMA a 'registration cache'
ie you assume that userspace is well behaved while DMA is ongoing and
just use the notifer to invalidate cached dma mappings.

The hallmark of this pattern is that it holds the page pin the entire
time DMA is active, which is why it isn't a bug, it is just best
described as loosely coherent.

But, in RDMA the best-practice is to do this in userspace with
userfaultfd as it is very expensive to take a syscall on command
submission to have the kernel figure it out.

> > And if we do decide on the reclaim thing in my other email then the
> > reclaim dependency can be reliably injected by doing:
> > 
> >  fs_reclaim_acquire();
> >  lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
> >  lock_map_release(&__mmu_notifier_invalidate_range_start_map);
> >  fs_reclaim_release();
> > 
> > If I understand lockdep properly..
> 
> Ime fs_reclaim injects the mmu_notifier map here reliably as soon as
> you've thrown out the first pagecache mmap on any process. That "make sure
> we inject it quickly" is why the lockdep is _outside_ of the
> mm_has_notifiers() check. So no further injection needed imo.

I suspect a lot of our automated testing, like syzkaller in restricted
kvms, probably does not reliably trigger a fs_reclaim, so I would very
much prefer to inject it 100% of the time directly if we are sure this
is a reclaim context because of the i_mmap_rwsem I mentioned before.

Jason


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/5] kernel.h: Add non_block_start/end()
  2019-08-15  8:44     ` Michal Hocko
@ 2019-08-15 13:04       ` Jason Gunthorpe
  2019-08-15 13:12         ` Daniel Vetter
  2019-08-15 13:24         ` Michal Hocko
  2019-08-15 22:15       ` Andrew Morton
  1 sibling, 2 replies; 69+ messages in thread
From: Jason Gunthorpe @ 2019-08-15 13:04 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Daniel Vetter, LKML, linux-mm, DRI Development,
	Intel Graphics Development, Peter Zijlstra, Ingo Molnar,
	David Rientjes, Christian König, Jérôme Glisse,
	Masahiro Yamada, Wei Wang, Andy Shevchenko, Thomas Gleixner,
	Jann Horn, Feng Tang, Kees Cook, Randy Dunlap, Daniel Vetter

On Thu, Aug 15, 2019 at 10:44:29AM +0200, Michal Hocko wrote:

> As the oom reaper is the primary guarantee of the oom handling forward
> progress it cannot be blocked on anything that might depend on blockable
> memory allocations. These are not really easy to track because they
> might be indirect - e.g. notifier blocks on a lock which other context
> holds while allocating memory or waiting for a flusher that needs memory
> to perform its work.

But lockdep *does* track all this and fs_reclaim_acquire() was created
to solve exactly this problem.

fs_reclaim is a lock and it flows through all the usual lockdep
schemes like any other lock. Any time the page allocator wants to do
something the would deadlock with reclaim it takes the lock.

Failure is expressed by a deadlock cycle in the lockdep map, and
lockdep can handle arbitary complexity through layers of locks, work
queues, threads, etc.

What is missing?

Jason


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/5] kernel.h: Add non_block_start/end()
  2019-08-15 13:04       ` Jason Gunthorpe
@ 2019-08-15 13:12         ` Daniel Vetter
  2019-08-15 14:37           ` Jason Gunthorpe
  2019-08-15 13:24         ` Michal Hocko
  1 sibling, 1 reply; 69+ messages in thread
From: Daniel Vetter @ 2019-08-15 13:12 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Michal Hocko, Andrew Morton, LKML, Linux MM, DRI Development,
	Intel Graphics Development, Peter Zijlstra, Ingo Molnar,
	David Rientjes, Christian König, Jérôme Glisse,
	Masahiro Yamada, Wei Wang, Andy Shevchenko, Thomas Gleixner,
	Jann Horn, Feng Tang, Kees Cook, Randy Dunlap, Daniel Vetter

On Thu, Aug 15, 2019 at 3:04 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Thu, Aug 15, 2019 at 10:44:29AM +0200, Michal Hocko wrote:
>
> > As the oom reaper is the primary guarantee of the oom handling forward
> > progress it cannot be blocked on anything that might depend on blockable
> > memory allocations. These are not really easy to track because they
> > might be indirect - e.g. notifier blocks on a lock which other context
> > holds while allocating memory or waiting for a flusher that needs memory
> > to perform its work.
>
> But lockdep *does* track all this and fs_reclaim_acquire() was created
> to solve exactly this problem.
>
> fs_reclaim is a lock and it flows through all the usual lockdep
> schemes like any other lock. Any time the page allocator wants to do
> something the would deadlock with reclaim it takes the lock.
>
> Failure is expressed by a deadlock cycle in the lockdep map, and
> lockdep can handle arbitary complexity through layers of locks, work
> queues, threads, etc.
>
> What is missing?

Lockdep doens't seen everything by far. E.g. a wait_event will be
caught by the annotations here, but not by lockdep. Plus lockdep does
not see through the wait_event, and doesn't realize if e.g. that event
will never signal because the worker is part of the deadlock loop.
cross-release was supposed to fix that, but seems like that will never
land.

And since we're talking about mmu notifiers here and gpus/dma engines.
We have dma_fence_wait, which can wait for any hw/driver in the system
that takes part in shared/zero-copy buffer processing. Which at least
on the graphics side is everything. This pulls in enormous amounts of
deadlock potential that lockdep simply is blind about and will never
see.

Arming might_sleep catches them all.

Cheers, Daniel

PS: Don't ask me about why we need these semantics for oom_reaper,
like I said just trying to follow the rules.
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/5] kernel.h: Add non_block_start/end()
  2019-08-15 12:23       ` Jason Gunthorpe
@ 2019-08-15 13:21         ` Michal Hocko
  2019-08-15 14:12           ` Jason Gunthorpe
  0 siblings, 1 reply; 69+ messages in thread
From: Michal Hocko @ 2019-08-15 13:21 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: LKML, linux-mm, DRI Development, Intel Graphics Development,
	Peter Zijlstra, Ingo Molnar, Andrew Morton, David Rientjes,
	Christian König, Jérôme Glisse, Masahiro Yamada,
	Wei Wang, Andy Shevchenko, Thomas Gleixner, Jann Horn, Feng Tang,
	Kees Cook, Randy Dunlap, Daniel Vetter

On Thu 15-08-19 09:23:44, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2019 at 08:58:29AM +0200, Daniel Vetter wrote:
> > On Wed, Aug 14, 2019 at 08:58:05PM -0300, Jason Gunthorpe wrote:
> > > On Wed, Aug 14, 2019 at 10:20:24PM +0200, Daniel Vetter wrote:
> > > > In some special cases we must not block, but there's not a
> > > > spinlock, preempt-off, irqs-off or similar critical section already
> > > > that arms the might_sleep() debug checks. Add a non_block_start/end()
> > > > pair to annotate these.
> > > > 
> > > > This will be used in the oom paths of mmu-notifiers, where blocking is
> > > > not allowed to make sure there's forward progress. Quoting Michal:
> > > > 
> > > > "The notifier is called from quite a restricted context - oom_reaper -
> > > > which shouldn't depend on any locks or sleepable conditionals. The code
> > > > should be swift as well but we mostly do care about it to make a forward
> > > > progress. Checking for sleepable context is the best thing we could come
> > > > up with that would describe these demands at least partially."
> > > 
> > > But this describes fs_reclaim_acquire() - is there some reason we are
> > > conflating fs_reclaim with non-sleeping?
> > 
> > No idea why you tie this into fs_reclaim. We can definitly sleep in there,
> > and for e.g. kswapd (which also wraps everything in fs_reclaim) we're
> > event supposed to I thought. To make sure we can get at the last bit of
> > memory by flushing all the queues and waiting for everything to be cleaned
> > out.
> 
> AFAIK the point of fs_reclaim is to prevent "indirect dependency upon
> the page allocator" ie a justification that was given this !blockable
> stuff.
> 
> For instance:
> 
>   fs_reclaim_acquire()
>   kmalloc(GFP_KERNEL) <- lock dep assertion
> 
> And further, Michal's concern about indirectness through locks is also
> handled by lockdep:
> 
>        CPU0                                 CPU1
>                                         mutex_lock()
>                                         kmalloc(GFP_KERNEL)
>                                         mutex_unlock()
>   fs_reclaim_acquire()
>   mutex_lock() <- lock dep assertion
> 
> In other words, to prevent recursion into the page allocator you use
> fs_reclaim_acquire(), and lockdep verfies it in its usual robust way.

fs_reclaim_acquire is about FS/IO recursions IIUC. We are talking about
any !GFP_NOWAIT allocation context here and any {in}direct dependency on
it. Whether fs_reclaim_acquire can be reused for that I do not know
because I am not familiar with the lockdep machinery enough
 
> I asked Tejun about this once in regards to WQ_MEM_RECLAIM and he
> explained that it means you can't call the allocator functions in a
> way that would recurse into reclaim (ie instead use instead GFP_ATOMIC, or
> tolerate allocation failure, or various other things).
> 
> So, the reason I bring it up is half the justifications you posted for
> blockable had to do with not recursing into reclaim and deadlocking,
> and didn't seem to have much to do with blocking.
> 
> I'm asking if *non-blocking* is really the requirement or if this is
> just the usual 'do not deadlock on the allocator' thing reclaim paths
> alread have?

No, non-blocking is a very coarse approximation of what we really need.
But it should give us even a stronger condition. Essentially any sleep
other than a preemption shouldn't be allowed in that context.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/5] kernel.h: Add non_block_start/end()
  2019-08-15 13:04       ` Jason Gunthorpe
  2019-08-15 13:12         ` Daniel Vetter
@ 2019-08-15 13:24         ` Michal Hocko
  1 sibling, 0 replies; 69+ messages in thread
From: Michal Hocko @ 2019-08-15 13:24 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Andrew Morton, Daniel Vetter, LKML, linux-mm, DRI Development,
	Intel Graphics Development, Peter Zijlstra, Ingo Molnar,
	David Rientjes, Christian König, Jérôme Glisse,
	Masahiro Yamada, Wei Wang, Andy Shevchenko, Thomas Gleixner,
	Jann Horn, Feng Tang, Kees Cook, Randy Dunlap, Daniel Vetter

On Thu 15-08-19 10:04:15, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2019 at 10:44:29AM +0200, Michal Hocko wrote:
> 
> > As the oom reaper is the primary guarantee of the oom handling forward
> > progress it cannot be blocked on anything that might depend on blockable
> > memory allocations. These are not really easy to track because they
> > might be indirect - e.g. notifier blocks on a lock which other context
> > holds while allocating memory or waiting for a flusher that needs memory
> > to perform its work.
> 
> But lockdep *does* track all this and fs_reclaim_acquire() was created
> to solve exactly this problem.
> 
> fs_reclaim is a lock and it flows through all the usual lockdep
> schemes like any other lock. Any time the page allocator wants to do
> something the would deadlock with reclaim it takes the lock.

Our emails have crossed. Even if fs_reclaim can be re-purposed for other
than FS/IO reclaim contexts which I am not sure about I think that
lockdep is too heavy weight for the purpose of this debugging aid in the
first place. The non block mode is really something as simple as "hey
mmu notifier you are only allowed to do a lightweight processing, if you
cannot guarantee that then just back off". The annotation will give us a
warning if the notifier gets out of this promise.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/5] kernel.h: Add non_block_start/end()
  2019-08-15 13:21         ` Michal Hocko
@ 2019-08-15 14:12           ` Jason Gunthorpe
  2019-08-15 16:00             ` Michal Hocko
  0 siblings, 1 reply; 69+ messages in thread
From: Jason Gunthorpe @ 2019-08-15 14:12 UTC (permalink / raw)
  To: Michal Hocko
  Cc: LKML, linux-mm, DRI Development, Intel Graphics Development,
	Peter Zijlstra, Ingo Molnar, Andrew Morton, David Rientjes,
	Christian König, Jérôme Glisse, Masahiro Yamada,
	Wei Wang, Andy Shevchenko, Thomas Gleixner, Jann Horn, Feng Tang,
	Kees Cook, Randy Dunlap, Daniel Vetter

On Thu, Aug 15, 2019 at 03:21:27PM +0200, Michal Hocko wrote:
> On Thu 15-08-19 09:23:44, Jason Gunthorpe wrote:
> > On Thu, Aug 15, 2019 at 08:58:29AM +0200, Daniel Vetter wrote:
> > > On Wed, Aug 14, 2019 at 08:58:05PM -0300, Jason Gunthorpe wrote:
> > > > On Wed, Aug 14, 2019 at 10:20:24PM +0200, Daniel Vetter wrote:
> > > > > In some special cases we must not block, but there's not a
> > > > > spinlock, preempt-off, irqs-off or similar critical section already
> > > > > that arms the might_sleep() debug checks. Add a non_block_start/end()
> > > > > pair to annotate these.
> > > > > 
> > > > > This will be used in the oom paths of mmu-notifiers, where blocking is
> > > > > not allowed to make sure there's forward progress. Quoting Michal:
> > > > > 
> > > > > "The notifier is called from quite a restricted context - oom_reaper -
> > > > > which shouldn't depend on any locks or sleepable conditionals. The code
> > > > > should be swift as well but we mostly do care about it to make a forward
> > > > > progress. Checking for sleepable context is the best thing we could come
> > > > > up with that would describe these demands at least partially."
> > > > 
> > > > But this describes fs_reclaim_acquire() - is there some reason we are
> > > > conflating fs_reclaim with non-sleeping?
> > > 
> > > No idea why you tie this into fs_reclaim. We can definitly sleep in there,
> > > and for e.g. kswapd (which also wraps everything in fs_reclaim) we're
> > > event supposed to I thought. To make sure we can get at the last bit of
> > > memory by flushing all the queues and waiting for everything to be cleaned
> > > out.
> > 
> > AFAIK the point of fs_reclaim is to prevent "indirect dependency upon
> > the page allocator" ie a justification that was given this !blockable
> > stuff.
> > 
> > For instance:
> > 
> >   fs_reclaim_acquire()
> >   kmalloc(GFP_KERNEL) <- lock dep assertion
> > 
> > And further, Michal's concern about indirectness through locks is also
> > handled by lockdep:
> > 
> >        CPU0                                 CPU1
> >                                         mutex_lock()
> >                                         kmalloc(GFP_KERNEL)
> >                                         mutex_unlock()
> >   fs_reclaim_acquire()
> >   mutex_lock() <- lock dep assertion
> > 
> > In other words, to prevent recursion into the page allocator you use
> > fs_reclaim_acquire(), and lockdep verfies it in its usual robust way.
> 
> fs_reclaim_acquire is about FS/IO recursions IIUC. We are talking about
> any !GFP_NOWAIT allocation context here and any {in}direct dependency on
> it. 

AFAIK 'GFP_NOWAIT' is characterized by the lack of __GFP_FS and
__GFP_DIRECT_RECLAIM..

This matches the existing test in __need_fs_reclaim() - so if you are
OK with GFP_NOFS, aka __GFP_IO which triggers try_to_compact_pages(),
allocations during OOM, then I think fs_reclaim already matches what
you described?

> Whether fs_reclaim_acquire can be reused for that I do not know
> because I am not familiar with the lockdep machinery enough

Well, if fs_reclaim is not already testing the flags you want, then we
could add another lockdep map that does. The basic principle is the
same, if you want to detect and prevent recursion into the allocator
under certain GFP flags then then AFAIK lockdep is the best tool we
have.

> No, non-blocking is a very coarse approximation of what we really need.
> But it should give us even a stronger condition. Essentially any sleep
> other than a preemption shouldn't be allowed in that context.

But it is a nonsense API to give the driver invalidate_range_start,
the blocking alternative to the non-blocking invalidate_range and then
demand it to be non-blocking.

Inspecting the code, no drivers are actually able to progress their
side in non-blocking mode.

The best we got was drivers tested the VA range and returned success
if they had no interest. Which is a big win to be sure, but it looks
like getting any more is not really posssible.

However, we could (probably even should) make the drivers fs_reclaim
safe.

If that is enough to guarantee progress of OOM, then lets consider
something like using current_gfp_context() to force PF_MEMALLOC_NOFS
allocation behavior on the driver callback and lockdep to try and keep
pushing on the the debugging, and dropping !blocking.

Jason


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/5] kernel.h: Add non_block_start/end()
  2019-08-15 13:12         ` Daniel Vetter
@ 2019-08-15 14:37           ` Jason Gunthorpe
  2019-08-15 14:43             ` Daniel Vetter
  0 siblings, 1 reply; 69+ messages in thread
From: Jason Gunthorpe @ 2019-08-15 14:37 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Michal Hocko, Andrew Morton, LKML, Linux MM, DRI Development,
	Intel Graphics Development, Peter Zijlstra, Ingo Molnar,
	David Rientjes, Christian König, Jérôme Glisse,
	Masahiro Yamada, Wei Wang, Andy Shevchenko, Thomas Gleixner,
	Jann Horn, Feng Tang, Kees Cook, Randy Dunlap, Daniel Vetter

On Thu, Aug 15, 2019 at 03:12:11PM +0200, Daniel Vetter wrote:
> On Thu, Aug 15, 2019 at 3:04 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Thu, Aug 15, 2019 at 10:44:29AM +0200, Michal Hocko wrote:
> >
> > > As the oom reaper is the primary guarantee of the oom handling forward
> > > progress it cannot be blocked on anything that might depend on blockable
> > > memory allocations. These are not really easy to track because they
> > > might be indirect - e.g. notifier blocks on a lock which other context
> > > holds while allocating memory or waiting for a flusher that needs memory
> > > to perform its work.
> >
> > But lockdep *does* track all this and fs_reclaim_acquire() was created
> > to solve exactly this problem.
> >
> > fs_reclaim is a lock and it flows through all the usual lockdep
> > schemes like any other lock. Any time the page allocator wants to do
> > something the would deadlock with reclaim it takes the lock.
> >
> > Failure is expressed by a deadlock cycle in the lockdep map, and
> > lockdep can handle arbitary complexity through layers of locks, work
> > queues, threads, etc.
> >
> > What is missing?
> 
> Lockdep doens't seen everything by far. E.g. a wait_event will be
> caught by the annotations here, but not by lockdep. 

Sure, but the wait_event might be OK if its progress isn't contingent
on fs_reclaim, ie triggered from interrupt, so why ban it?

> And since we're talking about mmu notifiers here and gpus/dma engines.
> We have dma_fence_wait, which can wait for any hw/driver in the system
> that takes part in shared/zero-copy buffer processing. Which at least
> on the graphics side is everything. This pulls in enormous amounts of
> deadlock potential that lockdep simply is blind about and will never
> see.

It seems very risky to entagle a notifier widely like that.

It looks pretty sure that notifiers are fs_reclaim, so at a minimum
that wait_event can't be contingent on anything that is doing
GFP_KERNEL or it will deadlock - and blockable doesn't make that sleep
safe.

Avoiding an uncertain wait_event under notifiers would seem to be the
only reasonable design here..

Jason


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/5] kernel.h: Add non_block_start/end()
  2019-08-15 14:37           ` Jason Gunthorpe
@ 2019-08-15 14:43             ` Daniel Vetter
  2019-08-15 15:10               ` Jason Gunthorpe
  0 siblings, 1 reply; 69+ messages in thread
From: Daniel Vetter @ 2019-08-15 14:43 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Michal Hocko, Andrew Morton, LKML, Linux MM, DRI Development,
	Intel Graphics Development, Peter Zijlstra, Ingo Molnar,
	David Rientjes, Christian König, Jérôme Glisse,
	Masahiro Yamada, Wei Wang, Andy Shevchenko, Thomas Gleixner,
	Jann Horn, Feng Tang, Kees Cook, Randy Dunlap, Daniel Vetter

On Thu, Aug 15, 2019 at 4:38 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Thu, Aug 15, 2019 at 03:12:11PM +0200, Daniel Vetter wrote:
> > On Thu, Aug 15, 2019 at 3:04 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > >
> > > On Thu, Aug 15, 2019 at 10:44:29AM +0200, Michal Hocko wrote:
> > >
> > > > As the oom reaper is the primary guarantee of the oom handling forward
> > > > progress it cannot be blocked on anything that might depend on blockable
> > > > memory allocations. These are not really easy to track because they
> > > > might be indirect - e.g. notifier blocks on a lock which other context
> > > > holds while allocating memory or waiting for a flusher that needs memory
> > > > to perform its work.
> > >
> > > But lockdep *does* track all this and fs_reclaim_acquire() was created
> > > to solve exactly this problem.
> > >
> > > fs_reclaim is a lock and it flows through all the usual lockdep
> > > schemes like any other lock. Any time the page allocator wants to do
> > > something the would deadlock with reclaim it takes the lock.
> > >
> > > Failure is expressed by a deadlock cycle in the lockdep map, and
> > > lockdep can handle arbitary complexity through layers of locks, work
> > > queues, threads, etc.
> > >
> > > What is missing?
> >
> > Lockdep doens't seen everything by far. E.g. a wait_event will be
> > caught by the annotations here, but not by lockdep.
>
> Sure, but the wait_event might be OK if its progress isn't contingent
> on fs_reclaim, ie triggered from interrupt, so why ban it?

For normal notifiers sure (but lockdep won't help you proof that at
all). For oom_reaper apparently not, but that's really Michal's thing,
I have not much idea about that.

> > And since we're talking about mmu notifiers here and gpus/dma engines.
> > We have dma_fence_wait, which can wait for any hw/driver in the system
> > that takes part in shared/zero-copy buffer processing. Which at least
> > on the graphics side is everything. This pulls in enormous amounts of
> > deadlock potential that lockdep simply is blind about and will never
> > see.
>
> It seems very risky to entagle a notifier widely like that.

That's why I've looked into all possible ways to annotate them with
debug infrastructure.

> It looks pretty sure that notifiers are fs_reclaim, so at a minimum
> that wait_event can't be contingent on anything that is doing
> GFP_KERNEL or it will deadlock - and blockable doesn't make that sleep
> safe.
>
> Avoiding an uncertain wait_event under notifiers would seem to be the
> only reasonable design here..

You have to wait for the gpu to finnish current processing in
invalidate_range_start. Otherwise there's no point to any of this
really. So the wait_event/dma_fence_wait are unavoidable really.

That's also why I'm throwing in the lockdep annotation on top, and why
it would be really nice if we somehow can get the cross-release work
landed. But it looks like all the people who could make it happen are
busy with smeltdown :-/
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/5] kernel.h: Add non_block_start/end()
  2019-08-15 14:43             ` Daniel Vetter
@ 2019-08-15 15:10               ` Jason Gunthorpe
  2019-08-15 16:25                 ` Daniel Vetter
  2019-08-15 16:32                 ` Jerome Glisse
  0 siblings, 2 replies; 69+ messages in thread
From: Jason Gunthorpe @ 2019-08-15 15:10 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Michal Hocko, Andrew Morton, LKML, Linux MM, DRI Development,
	Intel Graphics Development, Peter Zijlstra, Ingo Molnar,
	David Rientjes, Christian König, Jérôme Glisse,
	Masahiro Yamada, Wei Wang, Andy Shevchenko, Thomas Gleixner,
	Jann Horn, Feng Tang, Kees Cook, Randy Dunlap, Daniel Vetter

On Thu, Aug 15, 2019 at 04:43:38PM +0200, Daniel Vetter wrote:

> You have to wait for the gpu to finnish current processing in
> invalidate_range_start. Otherwise there's no point to any of this
> really. So the wait_event/dma_fence_wait are unavoidable really.

I don't envy your task :|

But, what you describe sure sounds like a 'registration cache' model,
not the 'shadow pte' model of coherency.

The key difference is that a regirstationcache is allowed to become
incoherent with the VMA's because it holds page pins. It is a
programming bug in userspace to change VA mappings via mmap/munmap/etc
while the device is working on that VA, but it does not harm system
integrity because of the page pin.

The cache ensures that each initiated operation sees a DMA setup that
matches the current VA map when the operation is initiated and allows
expensive device DMA setups to be re-used.

A 'shadow pte' model (ie hmm) *really* needs device support to
directly block DMA access - ie trigger 'device page fault'. ie the
invalidate_start should inform the device to enter a fault mode and
that is it.  If the device can't do that, then the driver probably
shouldn't persue this level of coherency. The driver would quickly get
into the messy locking problems like dma_fence_wait from a notifier.

It is important to identify what model you are going for as defining a
'registration cache' coherence expectation allows the driver to skip
blocking in invalidate_range_start. All it does is invalidate the
cache so that future operations pick up the new VA mapping.

Intel's HFI RDMA driver uses this model extensively, and I think it is
well proven, within some limitations of course.

At least, 'registration cache' is the only use model I know of where
it is acceptable to skip invalidate_range_end.

Jason


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/5] kernel.h: Add non_block_start/end()
  2019-08-15 14:12           ` Jason Gunthorpe
@ 2019-08-15 16:00             ` Michal Hocko
  2019-08-15 16:56               ` Jason Gunthorpe
  0 siblings, 1 reply; 69+ messages in thread
From: Michal Hocko @ 2019-08-15 16:00 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: LKML, linux-mm, DRI Development, Intel Graphics Development,
	Peter Zijlstra, Ingo Molnar, Andrew Morton, David Rientjes,
	Christian König, Jérôme Glisse, Masahiro Yamada,
	Wei Wang, Andy Shevchenko, Thomas Gleixner, Jann Horn, Feng Tang,
	Kees Cook, Randy Dunlap, Daniel Vetter

On Thu 15-08-19 11:12:19, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2019 at 03:21:27PM +0200, Michal Hocko wrote:
> > On Thu 15-08-19 09:23:44, Jason Gunthorpe wrote:
> > > On Thu, Aug 15, 2019 at 08:58:29AM +0200, Daniel Vetter wrote:
> > > > On Wed, Aug 14, 2019 at 08:58:05PM -0300, Jason Gunthorpe wrote:
> > > > > On Wed, Aug 14, 2019 at 10:20:24PM +0200, Daniel Vetter wrote:
> > > > > > In some special cases we must not block, but there's not a
> > > > > > spinlock, preempt-off, irqs-off or similar critical section already
> > > > > > that arms the might_sleep() debug checks. Add a non_block_start/end()
> > > > > > pair to annotate these.
> > > > > > 
> > > > > > This will be used in the oom paths of mmu-notifiers, where blocking is
> > > > > > not allowed to make sure there's forward progress. Quoting Michal:
> > > > > > 
> > > > > > "The notifier is called from quite a restricted context - oom_reaper -
> > > > > > which shouldn't depend on any locks or sleepable conditionals. The code
> > > > > > should be swift as well but we mostly do care about it to make a forward
> > > > > > progress. Checking for sleepable context is the best thing we could come
> > > > > > up with that would describe these demands at least partially."
> > > > > 
> > > > > But this describes fs_reclaim_acquire() - is there some reason we are
> > > > > conflating fs_reclaim with non-sleeping?
> > > > 
> > > > No idea why you tie this into fs_reclaim. We can definitly sleep in there,
> > > > and for e.g. kswapd (which also wraps everything in fs_reclaim) we're
> > > > event supposed to I thought. To make sure we can get at the last bit of
> > > > memory by flushing all the queues and waiting for everything to be cleaned
> > > > out.
> > > 
> > > AFAIK the point of fs_reclaim is to prevent "indirect dependency upon
> > > the page allocator" ie a justification that was given this !blockable
> > > stuff.
> > > 
> > > For instance:
> > > 
> > >   fs_reclaim_acquire()
> > >   kmalloc(GFP_KERNEL) <- lock dep assertion
> > > 
> > > And further, Michal's concern about indirectness through locks is also
> > > handled by lockdep:
> > > 
> > >        CPU0                                 CPU1
> > >                                         mutex_lock()
> > >                                         kmalloc(GFP_KERNEL)
> > >                                         mutex_unlock()
> > >   fs_reclaim_acquire()
> > >   mutex_lock() <- lock dep assertion
> > > 
> > > In other words, to prevent recursion into the page allocator you use
> > > fs_reclaim_acquire(), and lockdep verfies it in its usual robust way.
> > 
> > fs_reclaim_acquire is about FS/IO recursions IIUC. We are talking about
> > any !GFP_NOWAIT allocation context here and any {in}direct dependency on
> > it. 
> 
> AFAIK 'GFP_NOWAIT' is characterized by the lack of __GFP_FS and
> __GFP_DIRECT_RECLAIM..
>
> This matches the existing test in __need_fs_reclaim() - so if you are
> OK with GFP_NOFS, aka __GFP_IO which triggers try_to_compact_pages(),
> allocations during OOM, then I think fs_reclaim already matches what
> you described?

No GFP_NOFS is equally bad. Please read my other email explaining what
the oom_reaper actually requires. In short no blocking on direct or
indirect dependecy on memory allocation that might sleep. If you can
express that in the existing lockdep machinery. All fine. But then
consider deployments where lockdep is no-no because of the overhead.

> > No, non-blocking is a very coarse approximation of what we really need.
> > But it should give us even a stronger condition. Essentially any sleep
> > other than a preemption shouldn't be allowed in that context.
> 
> But it is a nonsense API to give the driver invalidate_range_start,
> the blocking alternative to the non-blocking invalidate_range and then
> demand it to be non-blocking.
> 
> Inspecting the code, no drivers are actually able to progress their
> side in non-blocking mode.
> 
> The best we got was drivers tested the VA range and returned success
> if they had no interest. Which is a big win to be sure, but it looks
> like getting any more is not really posssible.

And that is already a great win! Because many notifiers only do care
about particular mappings. Please note that backing off unconditioanlly
will simply cause that the oom reaper will have to back off not doing
any tear down anything.

> However, we could (probably even should) make the drivers fs_reclaim
> safe.
> 
> If that is enough to guarantee progress of OOM, then lets consider
> something like using current_gfp_context() to force PF_MEMALLOC_NOFS
> allocation behavior on the driver callback and lockdep to try and keep
> pushing on the the debugging, and dropping !blocking.

How are you going to enforce indirect dependency? E.g. a lock that is
also used in other context which depend on sleepable memory allocation
to move forward.

Really, this was aimed to give a very simple debugging aid. If it is
considered to be so controversial, even though I really do not see why,
then let's just drop it on the floor.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/5] kernel.h: Add non_block_start/end()
  2019-08-15 15:10               ` Jason Gunthorpe
@ 2019-08-15 16:25                 ` Daniel Vetter
  2019-08-15 17:35                   ` Jason Gunthorpe
  2019-08-15 16:32                 ` Jerome Glisse
  1 sibling, 1 reply; 69+ messages in thread
From: Daniel Vetter @ 2019-08-15 16:25 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Michal Hocko, Andrew Morton, LKML, Linux MM, DRI Development,
	Intel Graphics Development, Peter Zijlstra, Ingo Molnar,
	David Rientjes, Christian König, Jérôme Glisse,
	Masahiro Yamada, Wei Wang, Andy Shevchenko, Thomas Gleixner,
	Jann Horn, Feng Tang, Kees Cook, Randy Dunlap, Daniel Vetter

On Thu, Aug 15, 2019 at 5:10 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Thu, Aug 15, 2019 at 04:43:38PM +0200, Daniel Vetter wrote:
>
> > You have to wait for the gpu to finnish current processing in
> > invalidate_range_start. Otherwise there's no point to any of this
> > really. So the wait_event/dma_fence_wait are unavoidable really.
>
> I don't envy your task :|
>
> But, what you describe sure sounds like a 'registration cache' model,
> not the 'shadow pte' model of coherency.
>
> The key difference is that a regirstationcache is allowed to become
> incoherent with the VMA's because it holds page pins. It is a
> programming bug in userspace to change VA mappings via mmap/munmap/etc
> while the device is working on that VA, but it does not harm system
> integrity because of the page pin.
>
> The cache ensures that each initiated operation sees a DMA setup that
> matches the current VA map when the operation is initiated and allows
> expensive device DMA setups to be re-used.
>
> A 'shadow pte' model (ie hmm) *really* needs device support to
> directly block DMA access - ie trigger 'device page fault'. ie the
> invalidate_start should inform the device to enter a fault mode and
> that is it.  If the device can't do that, then the driver probably
> shouldn't persue this level of coherency. The driver would quickly get
> into the messy locking problems like dma_fence_wait from a notifier.
>
> It is important to identify what model you are going for as defining a
> 'registration cache' coherence expectation allows the driver to skip
> blocking in invalidate_range_start. All it does is invalidate the
> cache so that future operations pick up the new VA mapping.
>
> Intel's HFI RDMA driver uses this model extensively, and I think it is
> well proven, within some limitations of course.
>
> At least, 'registration cache' is the only use model I know of where
> it is acceptable to skip invalidate_range_end.

I'm not really well versed in the details of our userptr, but both
amdgpu and i915 wait for the gpu to complete from
invalidate_range_start. Jerome has at least looked a lot at the amdgpu
one, so maybe he can explain what exactly it is we're doing ...
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/5] kernel.h: Add non_block_start/end()
  2019-08-15 15:10               ` Jason Gunthorpe
  2019-08-15 16:25                 ` Daniel Vetter
@ 2019-08-15 16:32                 ` Jerome Glisse
  2019-08-15 17:16                   ` Jason Gunthorpe
  1 sibling, 1 reply; 69+ messages in thread
From: Jerome Glisse @ 2019-08-15 16:32 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Daniel Vetter, Michal Hocko, Andrew Morton, LKML, Linux MM,
	DRI Development, Intel Graphics Development, Peter Zijlstra,
	Ingo Molnar, David Rientjes, Christian König,
	Masahiro Yamada, Wei Wang, Andy Shevchenko, Thomas Gleixner,
	Jann Horn, Feng Tang, Kees Cook, Randy Dunlap, Daniel Vetter

On Thu, Aug 15, 2019 at 12:10:28PM -0300, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2019 at 04:43:38PM +0200, Daniel Vetter wrote:
> 
> > You have to wait for the gpu to finnish current processing in
> > invalidate_range_start. Otherwise there's no point to any of this
> > really. So the wait_event/dma_fence_wait are unavoidable really.
> 
> I don't envy your task :|
> 
> But, what you describe sure sounds like a 'registration cache' model,
> not the 'shadow pte' model of coherency.
> 
> The key difference is that a regirstationcache is allowed to become
> incoherent with the VMA's because it holds page pins. It is a
> programming bug in userspace to change VA mappings via mmap/munmap/etc
> while the device is working on that VA, but it does not harm system
> integrity because of the page pin.
> 
> The cache ensures that each initiated operation sees a DMA setup that
> matches the current VA map when the operation is initiated and allows
> expensive device DMA setups to be re-used.
> 
> A 'shadow pte' model (ie hmm) *really* needs device support to
> directly block DMA access - ie trigger 'device page fault'. ie the
> invalidate_start should inform the device to enter a fault mode and
> that is it.  If the device can't do that, then the driver probably
> shouldn't persue this level of coherency. The driver would quickly get
> into the messy locking problems like dma_fence_wait from a notifier.

I think here we do not agree on the hardware requirement. For GPU
we will always need to be able to wait for some GPU fence from inside
the notifier callback, there is just no way around that for many of
the GPUs today (i do not see any indication of that changing).

Driver should avoid lock complexity by using wait queue so that the
driver notifier callback can wait without having to hold some driver
lock. However there will be at least one lock needed to update the
internal driver state for the range being invalidated. That lock is
just the device driver page table lock for the GPU page table
associated with the mm_struct. In all GPU driver so far it is a short
lived lock and nothing blocking is done while holding it (it is just
about updating page table directory really wether it is filling it or
clearing it).


> 
> It is important to identify what model you are going for as defining a
> 'registration cache' coherence expectation allows the driver to skip
> blocking in invalidate_range_start. All it does is invalidate the
> cache so that future operations pick up the new VA mapping.
> 
> Intel's HFI RDMA driver uses this model extensively, and I think it is
> well proven, within some limitations of course.
> 
> At least, 'registration cache' is the only use model I know of where
> it is acceptable to skip invalidate_range_end.

Here GPU are not in the registration cache model, i know it might looks
like it because of GUP but GUP was use just because hmm did not exist
at the time.

Cheers,
Jérôme


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/5] kernel.h: Add non_block_start/end()
  2019-08-15 16:00             ` Michal Hocko
@ 2019-08-15 16:56               ` Jason Gunthorpe
  2019-08-15 17:11                 ` Jerome Glisse
  2019-08-15 17:42                 ` Michal Hocko
  0 siblings, 2 replies; 69+ messages in thread
From: Jason Gunthorpe @ 2019-08-15 16:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: LKML, linux-mm, DRI Development, Intel Graphics Development,
	Peter Zijlstra, Ingo Molnar, Andrew Morton, David Rientjes,
	Christian König, Jérôme Glisse, Masahiro Yamada,
	Wei Wang, Andy Shevchenko, Thomas Gleixner, Jann Horn, Feng Tang,
	Kees Cook, Randy Dunlap, Daniel Vetter

On Thu, Aug 15, 2019 at 06:00:41PM +0200, Michal Hocko wrote:

> > AFAIK 'GFP_NOWAIT' is characterized by the lack of __GFP_FS and
> > __GFP_DIRECT_RECLAIM..
> >
> > This matches the existing test in __need_fs_reclaim() - so if you are
> > OK with GFP_NOFS, aka __GFP_IO which triggers try_to_compact_pages(),
> > allocations during OOM, then I think fs_reclaim already matches what
> > you described?
> 
> No GFP_NOFS is equally bad. Please read my other email explaining what
> the oom_reaper actually requires. In short no blocking on direct or
> indirect dependecy on memory allocation that might sleep.

It is much easier to follow with some hints on code, so the true
requirement is that the OOM repear not block on GFP_FS and GFP_IO
allocations, great, that constraint is now clear.

> If you can express that in the existing lockdep machinery. All
> fine. But then consider deployments where lockdep is no-no because
> of the overhead.

This is all for driver debugging. The point of lockdep is to find all
these paths without have to hit them as actual races, using debug
kernels.

I don't think we need this kind of debugging on production kernels?

> > The best we got was drivers tested the VA range and returned success
> > if they had no interest. Which is a big win to be sure, but it looks
> > like getting any more is not really posssible.
> 
> And that is already a great win! Because many notifiers only do care
> about particular mappings. Please note that backing off unconditioanlly
> will simply cause that the oom reaper will have to back off not doing
> any tear down anything.

Well, I'm working to propose that we do the VA range test under core
mmu notifier code that cannot block and then we simply remove the idea
of blockable from drivers using this new 'range notifier'. 

I think this pretty much solves the concern?

> > However, we could (probably even should) make the drivers fs_reclaim
> > safe.
> > 
> > If that is enough to guarantee progress of OOM, then lets consider
> > something like using current_gfp_context() to force PF_MEMALLOC_NOFS
> > allocation behavior on the driver callback and lockdep to try and keep
> > pushing on the the debugging, and dropping !blocking.
> 
> How are you going to enforce indirect dependency? E.g. a lock that is
> also used in other context which depend on sleepable memory allocation
> to move forward.

You mean like this:

       CPU0                                 CPU1
                                        mutex_lock()
                                        kmalloc(GFP_KERNEL)
                                        mutex_unlock()
  fs_reclaim_acquire()
  mutex_lock() <- illegal: lock dep assertion

?

lockdep handles this - that is what it does, it builds a graph of all
lock dependencies and requires the graph to be acyclic. So mutex_lock
depends on fs_reclaim_lock on CPU1 while on CPU0, fs_reclaim_lock
depends on mutex_lock. This is an ABBA locking cycle and lockdep will
reliably trigger.

So, if we wanted to do this, I'd probably suggest we retool
fs_reclaim's interface be more like

  prevent_gfp_flags(__GFP_IO | __GFP_FS);
  [..]
  restore_gfp_flags(__GFP_IO | __GFP_FS);

Which is lockdep based and follows the indirect lock dependencies you
talked about.

Then OOM and reclaim can specify the different flags they want
blocked.  We could probably use the same API with WQ_MEM_RECLAIM as I
was chatting with Tejun about:

https://www.spinics.net/lists/linux-rdma/msg77362.html

IMHO this stuff is *super complicated* for those of us outside the mm
subsystem, so having some really clear annotations like the above
would go a long way to help understand these special constraints.

I'm pretty sure there are lots of driver bugs related to using the
wrong GFP flags in the kernel.

> Really, this was aimed to give a very simple debugging aid. If it is
> considered to be so controversial, even though I really do not see why,
> then let's just drop it on the floor.

My concern is that the requirement was very unclear. I'm trying to
understand all the bits of how these notifiers work and the exact
semantic of this OOM path have been vauge if it is really some GFP
flag restriction or truely related to not sleeping.

Jason


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/5] kernel.h: Add non_block_start/end()
  2019-08-15 16:56               ` Jason Gunthorpe
@ 2019-08-15 17:11                 ` Jerome Glisse
  2019-08-15 17:17                   ` Jason Gunthorpe
  2019-08-15 17:42                 ` Michal Hocko
  1 sibling, 1 reply; 69+ messages in thread
From: Jerome Glisse @ 2019-08-15 17:11 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Michal Hocko, LKML, linux-mm, DRI Development,
	Intel Graphics Development, Peter Zijlstra, Ingo Molnar,
	Andrew Morton, David Rientjes, Christian König,
	Masahiro Yamada, Wei Wang, Andy Shevchenko, Thomas Gleixner,
	Jann Horn, Feng Tang, Kees Cook, Randy Dunlap, Daniel Vetter

On Thu, Aug 15, 2019 at 01:56:31PM -0300, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2019 at 06:00:41PM +0200, Michal Hocko wrote:
> 
> > > AFAIK 'GFP_NOWAIT' is characterized by the lack of __GFP_FS and
> > > __GFP_DIRECT_RECLAIM..
> > >
> > > This matches the existing test in __need_fs_reclaim() - so if you are
> > > OK with GFP_NOFS, aka __GFP_IO which triggers try_to_compact_pages(),
> > > allocations during OOM, then I think fs_reclaim already matches what
> > > you described?
> > 
> > No GFP_NOFS is equally bad. Please read my other email explaining what
> > the oom_reaper actually requires. In short no blocking on direct or
> > indirect dependecy on memory allocation that might sleep.
> 
> It is much easier to follow with some hints on code, so the true
> requirement is that the OOM repear not block on GFP_FS and GFP_IO
> allocations, great, that constraint is now clear.
> 
> > If you can express that in the existing lockdep machinery. All
> > fine. But then consider deployments where lockdep is no-no because
> > of the overhead.
> 
> This is all for driver debugging. The point of lockdep is to find all
> these paths without have to hit them as actual races, using debug
> kernels.
> 
> I don't think we need this kind of debugging on production kernels?
> 
> > > The best we got was drivers tested the VA range and returned success
> > > if they had no interest. Which is a big win to be sure, but it looks
> > > like getting any more is not really posssible.
> > 
> > And that is already a great win! Because many notifiers only do care
> > about particular mappings. Please note that backing off unconditioanlly
> > will simply cause that the oom reaper will have to back off not doing
> > any tear down anything.
> 
> Well, I'm working to propose that we do the VA range test under core
> mmu notifier code that cannot block and then we simply remove the idea
> of blockable from drivers using this new 'range notifier'. 
> 
> I think this pretty much solves the concern?

I am not sure i follow what you propose here ? Like i pointed out in
another email for GPU we do need to be able to sleep (we might get
lucky and not need too but this is runtime thing) within notifier
range_start callback. This has been something allow by notifier since
it has been introduced in the kernel.

Cheers,
Jérôme


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/5] kernel.h: Add non_block_start/end()
  2019-08-15 16:32                 ` Jerome Glisse
@ 2019-08-15 17:16                   ` Jason Gunthorpe
  2019-08-15 17:21                     ` Daniel Vetter
  0 siblings, 1 reply; 69+ messages in thread
From: Jason Gunthorpe @ 2019-08-15 17:16 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Daniel Vetter, Michal Hocko, Andrew Morton, LKML, Linux MM,
	DRI Development, Intel Graphics Development, Peter Zijlstra,
	Ingo Molnar, David Rientjes, Christian König,
	Masahiro Yamada, Wei Wang, Andy Shevchenko, Thomas Gleixner,
	Jann Horn, Feng Tang, Kees Cook, Randy Dunlap, Daniel Vetter

On Thu, Aug 15, 2019 at 12:32:38PM -0400, Jerome Glisse wrote:
> On Thu, Aug 15, 2019 at 12:10:28PM -0300, Jason Gunthorpe wrote:
> > On Thu, Aug 15, 2019 at 04:43:38PM +0200, Daniel Vetter wrote:
> > 
> > > You have to wait for the gpu to finnish current processing in
> > > invalidate_range_start. Otherwise there's no point to any of this
> > > really. So the wait_event/dma_fence_wait are unavoidable really.
> > 
> > I don't envy your task :|
> > 
> > But, what you describe sure sounds like a 'registration cache' model,
> > not the 'shadow pte' model of coherency.
> > 
> > The key difference is that a regirstationcache is allowed to become
> > incoherent with the VMA's because it holds page pins. It is a
> > programming bug in userspace to change VA mappings via mmap/munmap/etc
> > while the device is working on that VA, but it does not harm system
> > integrity because of the page pin.
> > 
> > The cache ensures that each initiated operation sees a DMA setup that
> > matches the current VA map when the operation is initiated and allows
> > expensive device DMA setups to be re-used.
> > 
> > A 'shadow pte' model (ie hmm) *really* needs device support to
> > directly block DMA access - ie trigger 'device page fault'. ie the
> > invalidate_start should inform the device to enter a fault mode and
> > that is it.  If the device can't do that, then the driver probably
> > shouldn't persue this level of coherency. The driver would quickly get
> > into the messy locking problems like dma_fence_wait from a notifier.
> 
> I think here we do not agree on the hardware requirement. For GPU
> we will always need to be able to wait for some GPU fence from inside
> the notifier callback, there is just no way around that for many of
> the GPUs today (i do not see any indication of that changing).

I didn't say you couldn't wait, I was trying to say that the wait
should only be contigent on the HW itself. Ie you can wait on a GPU
page table lock, and you can wait on a GPU page table flush completion
via IRQ.

What is troubling is to wait till some other thread gets a GPU command
completion and decr's a kref on the DMA buffer - which kinda looks
like what this dma_fence() stuff is all about. A driver like that
would have to be super careful to ensure consistent forward progress
toward dma ref == 0 when the system is under reclaim. 

ie by running it's entire IRQ flow under fs_reclaim locking.

> associated with the mm_struct. In all GPU driver so far it is a short
> lived lock and nothing blocking is done while holding it (it is just
> about updating page table directory really wether it is filling it or
> clearing it).

The main blocking I expect in a shadow PTE flow is waiting for the HW
to complete invalidations of its PTE cache.

> > It is important to identify what model you are going for as defining a
> > 'registration cache' coherence expectation allows the driver to skip
> > blocking in invalidate_range_start. All it does is invalidate the
> > cache so that future operations pick up the new VA mapping.
> > 
> > Intel's HFI RDMA driver uses this model extensively, and I think it is
> > well proven, within some limitations of course.
> > 
> > At least, 'registration cache' is the only use model I know of where
> > it is acceptable to skip invalidate_range_end.
> 
> Here GPU are not in the registration cache model, i know it might looks
> like it because of GUP but GUP was use just because hmm did not exist
> at the time.

It is not because of GUP, it is because of the lack of
invalidate_range_end. A driver cannot correctly implement the SPTE
model without invalidate_range_end, even if it holds the page pins via
GUP.

So, I've been assuming the few drivers without invalidate_range_end
are trying to do registration caching, rather than assuming they are
broken.

Jason


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/5] kernel.h: Add non_block_start/end()
  2019-08-15 17:11                 ` Jerome Glisse
@ 2019-08-15 17:17                   ` Jason Gunthorpe
  0 siblings, 0 replies; 69+ messages in thread
From: Jason Gunthorpe @ 2019-08-15 17:17 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Michal Hocko, LKML, linux-mm, DRI Development,
	Intel Graphics Development, Peter Zijlstra, Ingo Molnar,
	Andrew Morton, David Rientjes, Christian König,
	Masahiro Yamada, Wei Wang, Andy Shevchenko, Thomas Gleixner,
	Jann Horn, Feng Tang, Kees Cook, Randy Dunlap, Daniel Vetter

On Thu, Aug 15, 2019 at 01:11:56PM -0400, Jerome Glisse wrote:
> On Thu, Aug 15, 2019 at 01:56:31PM -0300, Jason Gunthorpe wrote:
> > On Thu, Aug 15, 2019 at 06:00:41PM +0200, Michal Hocko wrote:
> > 
> > > > AFAIK 'GFP_NOWAIT' is characterized by the lack of __GFP_FS and
> > > > __GFP_DIRECT_RECLAIM..
> > > >
> > > > This matches the existing test in __need_fs_reclaim() - so if you are
> > > > OK with GFP_NOFS, aka __GFP_IO which triggers try_to_compact_pages(),
> > > > allocations during OOM, then I think fs_reclaim already matches what
> > > > you described?
> > > 
> > > No GFP_NOFS is equally bad. Please read my other email explaining what
> > > the oom_reaper actually requires. In short no blocking on direct or
> > > indirect dependecy on memory allocation that might sleep.
> > 
> > It is much easier to follow with some hints on code, so the true
> > requirement is that the OOM repear not block on GFP_FS and GFP_IO
> > allocations, great, that constraint is now clear.
> > 
> > > If you can express that in the existing lockdep machinery. All
> > > fine. But then consider deployments where lockdep is no-no because
> > > of the overhead.
> > 
> > This is all for driver debugging. The point of lockdep is to find all
> > these paths without have to hit them as actual races, using debug
> > kernels.
> > 
> > I don't think we need this kind of debugging on production kernels?
> > 
> > > > The best we got was drivers tested the VA range and returned success
> > > > if they had no interest. Which is a big win to be sure, but it looks
> > > > like getting any more is not really posssible.
> > > 
> > > And that is already a great win! Because many notifiers only do care
> > > about particular mappings. Please note that backing off unconditioanlly
> > > will simply cause that the oom reaper will have to back off not doing
> > > any tear down anything.
> > 
> > Well, I'm working to propose that we do the VA range test under core
> > mmu notifier code that cannot block and then we simply remove the idea
> > of blockable from drivers using this new 'range notifier'. 
> > 
> > I think this pretty much solves the concern?
> 
> I am not sure i follow what you propose here ? Like i pointed out in
> another email for GPU we do need to be able to sleep (we might get
> lucky and not need too but this is runtime thing) within notifier
> range_start callback. This has been something allow by notifier since
> it has been introduced in the kernel.

Sorry, I mean remove the idea of the blockable flag from the
drivers. Drivers will always be able to block, within the existing
limitation of fs_reclaim

Jason


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/5] kernel.h: Add non_block_start/end()
  2019-08-15 17:16                   ` Jason Gunthorpe
@ 2019-08-15 17:21                     ` Daniel Vetter
  2019-08-15 17:35                       ` Jerome Glisse
  0 siblings, 1 reply; 69+ messages in thread
From: Daniel Vetter @ 2019-08-15 17:21 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jerome Glisse, Michal Hocko, Andrew Morton, LKML, Linux MM,
	DRI Development, Intel Graphics Development, Peter Zijlstra,
	Ingo Molnar, David Rientjes, Christian König,
	Masahiro Yamada, Wei Wang, Andy Shevchenko, Thomas Gleixner,
	Jann Horn, Feng Tang, Kees Cook, Randy Dunlap, Daniel Vetter

On Thu, Aug 15, 2019 at 7:16 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Thu, Aug 15, 2019 at 12:32:38PM -0400, Jerome Glisse wrote:
> > On Thu, Aug 15, 2019 at 12:10:28PM -0300, Jason Gunthorpe wrote:
> > > On Thu, Aug 15, 2019 at 04:43:38PM +0200, Daniel Vetter wrote:
> > >
> > > > You have to wait for the gpu to finnish current processing in
> > > > invalidate_range_start. Otherwise there's no point to any of this
> > > > really. So the wait_event/dma_fence_wait are unavoidable really.
> > >
> > > I don't envy your task :|
> > >
> > > But, what you describe sure sounds like a 'registration cache' model,
> > > not the 'shadow pte' model of coherency.
> > >
> > > The key difference is that a regirstationcache is allowed to become
> > > incoherent with the VMA's because it holds page pins. It is a
> > > programming bug in userspace to change VA mappings via mmap/munmap/etc
> > > while the device is working on that VA, but it does not harm system
> > > integrity because of the page pin.
> > >
> > > The cache ensures that each initiated operation sees a DMA setup that
> > > matches the current VA map when the operation is initiated and allows
> > > expensive device DMA setups to be re-used.
> > >
> > > A 'shadow pte' model (ie hmm) *really* needs device support to
> > > directly block DMA access - ie trigger 'device page fault'. ie the
> > > invalidate_start should inform the device to enter a fault mode and
> > > that is it.  If the device can't do that, then the driver probably
> > > shouldn't persue this level of coherency. The driver would quickly get
> > > into the messy locking problems like dma_fence_wait from a notifier.
> >
> > I think here we do not agree on the hardware requirement. For GPU
> > we will always need to be able to wait for some GPU fence from inside
> > the notifier callback, there is just no way around that for many of
> > the GPUs today (i do not see any indication of that changing).
>
> I didn't say you couldn't wait, I was trying to say that the wait
> should only be contigent on the HW itself. Ie you can wait on a GPU
> page table lock, and you can wait on a GPU page table flush completion
> via IRQ.
>
> What is troubling is to wait till some other thread gets a GPU command
> completion and decr's a kref on the DMA buffer - which kinda looks
> like what this dma_fence() stuff is all about. A driver like that
> would have to be super careful to ensure consistent forward progress
> toward dma ref == 0 when the system is under reclaim.
>
> ie by running it's entire IRQ flow under fs_reclaim locking.

This is correct. At least for i915 it's already a required due to our
shrinker also having to do the same. I think amdgpu isn't bothering
with that since they have vram for most of the stuff, and just limit
system memory usage to half of all and forgo the shrinker. Probably
not the nicest approach. Anyway, both do the same mmu_notifier dance,
just want to explain that we've been living with this for longer
already.

So yeah writing a gpu driver is not easy.

> > associated with the mm_struct. In all GPU driver so far it is a short
> > lived lock and nothing blocking is done while holding it (it is just
> > about updating page table directory really wether it is filling it or
> > clearing it).
>
> The main blocking I expect in a shadow PTE flow is waiting for the HW
> to complete invalidations of its PTE cache.
>
> > > It is important to identify what model you are going for as defining a
> > > 'registration cache' coherence expectation allows the driver to skip
> > > blocking in invalidate_range_start. All it does is invalidate the
> > > cache so that future operations pick up the new VA mapping.
> > >
> > > Intel's HFI RDMA driver uses this model extensively, and I think it is
> > > well proven, within some limitations of course.
> > >
> > > At least, 'registration cache' is the only use model I know of where
> > > it is acceptable to skip invalidate_range_end.
> >
> > Here GPU are not in the registration cache model, i know it might looks
> > like it because of GUP but GUP was use just because hmm did not exist
> > at the time.
>
> It is not because of GUP, it is because of the lack of
> invalidate_range_end. A driver cannot correctly implement the SPTE
> model without invalidate_range_end, even if it holds the page pins via
> GUP.
>
> So, I've been assuming the few drivers without invalidate_range_end
> are trying to do registration caching, rather than assuming they are
> broken.

I915 might just be broken. amdgpu does the full thing, using
hmm_mirror. But still with dma_fence_wait.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/5] kernel.h: Add non_block_start/end()
  2019-08-15 17:21                     ` Daniel Vetter
@ 2019-08-15 17:35                       ` Jerome Glisse
  0 siblings, 0 replies; 69+ messages in thread
From: Jerome Glisse @ 2019-08-15 17:35 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Jason Gunthorpe, Michal Hocko, Andrew Morton, LKML, Linux MM,
	DRI Development, Intel Graphics Development, Peter Zijlstra,
	Ingo Molnar, David Rientjes, Christian König,
	Masahiro Yamada, Wei Wang, Andy Shevchenko, Thomas Gleixner,
	Jann Horn, Feng Tang, Kees Cook, Randy Dunlap, Daniel Vetter

On Thu, Aug 15, 2019 at 07:21:47PM +0200, Daniel Vetter wrote:
> On Thu, Aug 15, 2019 at 7:16 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Thu, Aug 15, 2019 at 12:32:38PM -0400, Jerome Glisse wrote:
> > > On Thu, Aug 15, 2019 at 12:10:28PM -0300, Jason Gunthorpe wrote:
> > > > On Thu, Aug 15, 2019 at 04:43:38PM +0200, Daniel Vetter wrote:
> > > >
> > > > > You have to wait for the gpu to finnish current processing in
> > > > > invalidate_range_start. Otherwise there's no point to any of this
> > > > > really. So the wait_event/dma_fence_wait are unavoidable really.
> > > >
> > > > I don't envy your task :|
> > > >
> > > > But, what you describe sure sounds like a 'registration cache' model,
> > > > not the 'shadow pte' model of coherency.
> > > >
> > > > The key difference is that a regirstationcache is allowed to become
> > > > incoherent with the VMA's because it holds page pins. It is a
> > > > programming bug in userspace to change VA mappings via mmap/munmap/etc
> > > > while the device is working on that VA, but it does not harm system
> > > > integrity because of the page pin.
> > > >
> > > > The cache ensures that each initiated operation sees a DMA setup that
> > > > matches the current VA map when the operation is initiated and allows
> > > > expensive device DMA setups to be re-used.
> > > >
> > > > A 'shadow pte' model (ie hmm) *really* needs device support to
> > > > directly block DMA access - ie trigger 'device page fault'. ie the
> > > > invalidate_start should inform the device to enter a fault mode and
> > > > that is it.  If the device can't do that, then the driver probably
> > > > shouldn't persue this level of coherency. The driver would quickly get
> > > > into the messy locking problems like dma_fence_wait from a notifier.
> > >
> > > I think here we do not agree on the hardware requirement. For GPU
> > > we will always need to be able to wait for some GPU fence from inside
> > > the notifier callback, there is just no way around that for many of
> > > the GPUs today (i do not see any indication of that changing).
> >
> > I didn't say you couldn't wait, I was trying to say that the wait
> > should only be contigent on the HW itself. Ie you can wait on a GPU
> > page table lock, and you can wait on a GPU page table flush completion
> > via IRQ.
> >
> > What is troubling is to wait till some other thread gets a GPU command
> > completion and decr's a kref on the DMA buffer - which kinda looks
> > like what this dma_fence() stuff is all about. A driver like that
> > would have to be super careful to ensure consistent forward progress
> > toward dma ref == 0 when the system is under reclaim.
> >
> > ie by running it's entire IRQ flow under fs_reclaim locking.
> 
> This is correct. At least for i915 it's already a required due to our
> shrinker also having to do the same. I think amdgpu isn't bothering
> with that since they have vram for most of the stuff, and just limit
> system memory usage to half of all and forgo the shrinker. Probably
> not the nicest approach. Anyway, both do the same mmu_notifier dance,
> just want to explain that we've been living with this for longer
> already.
> 
> So yeah writing a gpu driver is not easy.
> 
> > > associated with the mm_struct. In all GPU driver so far it is a short
> > > lived lock and nothing blocking is done while holding it (it is just
> > > about updating page table directory really wether it is filling it or
> > > clearing it).
> >
> > The main blocking I expect in a shadow PTE flow is waiting for the HW
> > to complete invalidations of its PTE cache.
> >
> > > > It is important to identify what model you are going for as defining a
> > > > 'registration cache' coherence expectation allows the driver to skip
> > > > blocking in invalidate_range_start. All it does is invalidate the
> > > > cache so that future operations pick up the new VA mapping.
> > > >
> > > > Intel's HFI RDMA driver uses this model extensively, and I think it is
> > > > well proven, within some limitations of course.
> > > >
> > > > At least, 'registration cache' is the only use model I know of where
> > > > it is acceptable to skip invalidate_range_end.
> > >
> > > Here GPU are not in the registration cache model, i know it might looks
> > > like it because of GUP but GUP was use just because hmm did not exist
> > > at the time.
> >
> > It is not because of GUP, it is because of the lack of
> > invalidate_range_end. A driver cannot correctly implement the SPTE
> > model without invalidate_range_end, even if it holds the page pins via
> > GUP.
> >
> > So, I've been assuming the few drivers without invalidate_range_end
> > are trying to do registration caching, rather than assuming they are
> > broken.
> 
> I915 might just be broken. amdgpu does the full thing, using
> hmm_mirror. But still with dma_fence_wait.

Yeah i915 is broken but it never hurted anyone ;) I posted patch
a long time ago to convert it to hmm but i delayed that to until
i can get through making something of GUPfast that can also be
use for HMM/ODP user.

Cheers,
Jérôme


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/5] kernel.h: Add non_block_start/end()
  2019-08-15 16:25                 ` Daniel Vetter
@ 2019-08-15 17:35                   ` Jason Gunthorpe
  2019-08-15 17:39                     ` Jerome Glisse
  0 siblings, 1 reply; 69+ messages in thread
From: Jason Gunthorpe @ 2019-08-15 17:35 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Michal Hocko, Andrew Morton, LKML, Linux MM, DRI Development,
	Intel Graphics Development, Peter Zijlstra, Ingo Molnar,
	David Rientjes, Christian König, Jérôme Glisse,
	Masahiro Yamada, Wei Wang, Andy Shevchenko, Thomas Gleixner,
	Jann Horn, Feng Tang, Kees Cook, Randy Dunlap, Daniel Vetter

On Thu, Aug 15, 2019 at 06:25:16PM +0200, Daniel Vetter wrote:

> I'm not really well versed in the details of our userptr, but both
> amdgpu and i915 wait for the gpu to complete from
> invalidate_range_start. Jerome has at least looked a lot at the amdgpu
> one, so maybe he can explain what exactly it is we're doing ...

amdgpu is (wrongly) using hmm for something, I can't really tell what
it is trying to do. The calls to dma_fence under the
invalidate_range_start do not give me a good feeling.

However, i915 shows all the signs of trying to follow the registration
cache model, it even has a nice comment in
i915_gem_userptr_get_pages() explaining that the races it has don't
matter because it is a user space bug to change the VA mapping in the
first place. That just screams registration cache to me.

So it is fine to run HW that way, but if you do, there is no reason to
fence inside the invalidate_range end. Just orphan the DMA buffer and
clean it up & release the page pins when all DMA buffer refs go to
zero. The next access to that VA should get a new DMA buffer with the
right mapping.

In other words the invalidation should be very simple without
complicated locking, or wait_event's. Look at hfi1 for example.

Jason


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/5] kernel.h: Add non_block_start/end()
  2019-08-15 17:35                   ` Jason Gunthorpe
@ 2019-08-15 17:39                     ` Jerome Glisse
  2019-08-15 18:01                       ` Jason Gunthorpe
  0 siblings, 1 reply; 69+ messages in thread
From: Jerome Glisse @ 2019-08-15 17:39 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Daniel Vetter, Michal Hocko, Andrew Morton, LKML, Linux MM,
	DRI Development, Intel Graphics Development, Peter Zijlstra,
	Ingo Molnar, David Rientjes, Christian König,
	Masahiro Yamada, Wei Wang, Andy Shevchenko, Thomas Gleixner,
	Jann Horn, Feng Tang, Kees Cook, Randy Dunlap, Daniel Vetter

On Thu, Aug 15, 2019 at 02:35:57PM -0300, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2019 at 06:25:16PM +0200, Daniel Vetter wrote:
> 
> > I'm not really well versed in the details of our userptr, but both
> > amdgpu and i915 wait for the gpu to complete from
> > invalidate_range_start. Jerome has at least looked a lot at the amdgpu
> > one, so maybe he can explain what exactly it is we're doing ...
> 
> amdgpu is (wrongly) using hmm for something, I can't really tell what
> it is trying to do. The calls to dma_fence under the
> invalidate_range_start do not give me a good feeling.
> 
> However, i915 shows all the signs of trying to follow the registration
> cache model, it even has a nice comment in
> i915_gem_userptr_get_pages() explaining that the races it has don't
> matter because it is a user space bug to change the VA mapping in the
> first place. That just screams registration cache to me.
> 
> So it is fine to run HW that way, but if you do, there is no reason to
> fence inside the invalidate_range end. Just orphan the DMA buffer and
> clean it up & release the page pins when all DMA buffer refs go to
> zero. The next access to that VA should get a new DMA buffer with the
> right mapping.
> 
> In other words the invalidation should be very simple without
> complicated locking, or wait_event's. Look at hfi1 for example.

This would break the today usage model of uptr and it will
break userspace expectation ie if GPU is writting to that
memory and that memory then the userspace want to make sure
that it will see what the GPU write.

Yes i915 is broken in respect to not having a end notifier
and tracking active invalidation for a range but the GUP
side of thing kind of hide this bug and it shrinks the window
for bad to happen to something so small that i doubt anyone
could ever hit it (still a bug thought).

Cheers,
Jérôme


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/5] kernel.h: Add non_block_start/end()
  2019-08-15 16:56               ` Jason Gunthorpe
  2019-08-15 17:11                 ` Jerome Glisse
@ 2019-08-15 17:42                 ` Michal Hocko
  2019-08-15 17:57                   ` Jerome Glisse
  2019-08-15 18:24                   ` Jason Gunthorpe
  1 sibling, 2 replies; 69+ messages in thread
From: Michal Hocko @ 2019-08-15 17:42 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: LKML, linux-mm, DRI Development, Intel Graphics Development,
	Peter Zijlstra, Ingo Molnar, Andrew Morton, David Rientjes,
	Christian König, Jérôme Glisse, Masahiro Yamada,
	Wei Wang, Andy Shevchenko, Thomas Gleixner, Jann Horn, Feng Tang,
	Kees Cook, Randy Dunlap, Daniel Vetter

On Thu 15-08-19 13:56:31, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2019 at 06:00:41PM +0200, Michal Hocko wrote:
> 
> > > AFAIK 'GFP_NOWAIT' is characterized by the lack of __GFP_FS and
> > > __GFP_DIRECT_RECLAIM..
> > >
> > > This matches the existing test in __need_fs_reclaim() - so if you are
> > > OK with GFP_NOFS, aka __GFP_IO which triggers try_to_compact_pages(),
> > > allocations during OOM, then I think fs_reclaim already matches what
> > > you described?
> > 
> > No GFP_NOFS is equally bad. Please read my other email explaining what
> > the oom_reaper actually requires. In short no blocking on direct or
> > indirect dependecy on memory allocation that might sleep.
> 
> It is much easier to follow with some hints on code, so the true
> requirement is that the OOM repear not block on GFP_FS and GFP_IO
> allocations, great, that constraint is now clear.

I still do not get why do you put FS/IO into the picture. This is really
about __GFP_DIRECT_RECLAIM.

> 
> > If you can express that in the existing lockdep machinery. All
> > fine. But then consider deployments where lockdep is no-no because
> > of the overhead.
> 
> This is all for driver debugging. The point of lockdep is to find all
> these paths without have to hit them as actual races, using debug
> kernels.
> 
> I don't think we need this kind of debugging on production kernels?

Again, the primary motivation was a simple debugging aid that could be
used without worrying about overhead. So lockdep is very often out of
the question.

> > > The best we got was drivers tested the VA range and returned success
> > > if they had no interest. Which is a big win to be sure, but it looks
> > > like getting any more is not really posssible.
> > 
> > And that is already a great win! Because many notifiers only do care
> > about particular mappings. Please note that backing off unconditioanlly
> > will simply cause that the oom reaper will have to back off not doing
> > any tear down anything.
> 
> Well, I'm working to propose that we do the VA range test under core
> mmu notifier code that cannot block and then we simply remove the idea
> of blockable from drivers using this new 'range notifier'. 
> 
> I think this pretty much solves the concern?

Well, my idea was that a range check and early bail out was a first step
and then each specific notifier would be able to do a more specific
check. I was not able to do the second step because that requires a deep
understanding of the respective subsystem.

Really all I do care about is to reclaim as much memory from the
oom_reaper context as possible. And that cannot really be an unbounded
process. Quite contrary it should be as swift as possible. From my
cursory look some notifiers are able to achieve their task without
blocking or depending on memory just fine. So bailing out
unconditionally on the range of interest would just put us back.

> > > However, we could (probably even should) make the drivers fs_reclaim
> > > safe.
> > > 
> > > If that is enough to guarantee progress of OOM, then lets consider
> > > something like using current_gfp_context() to force PF_MEMALLOC_NOFS
> > > allocation behavior on the driver callback and lockdep to try and keep
> > > pushing on the the debugging, and dropping !blocking.
> > 
> > How are you going to enforce indirect dependency? E.g. a lock that is
> > also used in other context which depend on sleepable memory allocation
> > to move forward.
> 
> You mean like this:
> 
>        CPU0                                 CPU1
>                                         mutex_lock()
>                                         kmalloc(GFP_KERNEL)

no I mean __GFP_DIRECT_RECLAIM here.

>                                         mutex_unlock()
>   fs_reclaim_acquire()
>   mutex_lock() <- illegal: lock dep assertion

I cannot really comment on how that is achieveable by lockdep. I managed
to forget details about FS/IO reclaim recursion tracking already and I
do not have time to learn it again. It was quite a hack. Anyway, let me
repeat that the primary motivation was a simple aid. Not something as
poverful as lockdep.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/5] kernel.h: Add non_block_start/end()
  2019-08-15 17:42                 ` Michal Hocko
@ 2019-08-15 17:57                   ` Jerome Glisse
  2019-08-15 18:24                   ` Jason Gunthorpe
  1 sibling, 0 replies; 69+ messages in thread
From: Jerome Glisse @ 2019-08-15 17:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jason Gunthorpe, LKML, linux-mm, DRI Development,
	Intel Graphics Development, Peter Zijlstra, Ingo Molnar,
	Andrew Morton, David Rientjes, Christian König,
	Masahiro Yamada, Wei Wang, Andy Shevchenko, Thomas Gleixner,
	Jann Horn, Feng Tang, Kees Cook, Randy Dunlap, Daniel Vetter

On Thu, Aug 15, 2019 at 07:42:07PM +0200, Michal Hocko wrote:
> On Thu 15-08-19 13:56:31, Jason Gunthorpe wrote:
> > On Thu, Aug 15, 2019 at 06:00:41PM +0200, Michal Hocko wrote:
> > 
> > > > AFAIK 'GFP_NOWAIT' is characterized by the lack of __GFP_FS and
> > > > __GFP_DIRECT_RECLAIM..
> > > >
> > > > This matches the existing test in __need_fs_reclaim() - so if you are
> > > > OK with GFP_NOFS, aka __GFP_IO which triggers try_to_compact_pages(),
> > > > allocations during OOM, then I think fs_reclaim already matches what
> > > > you described?
> > > 
> > > No GFP_NOFS is equally bad. Please read my other email explaining what
> > > the oom_reaper actually requires. In short no blocking on direct or
> > > indirect dependecy on memory allocation that might sleep.
> > 
> > It is much easier to follow with some hints on code, so the true
> > requirement is that the OOM repear not block on GFP_FS and GFP_IO
> > allocations, great, that constraint is now clear.
> 
> I still do not get why do you put FS/IO into the picture. This is really
> about __GFP_DIRECT_RECLAIM.
> 
> > 
> > > If you can express that in the existing lockdep machinery. All
> > > fine. But then consider deployments where lockdep is no-no because
> > > of the overhead.
> > 
> > This is all for driver debugging. The point of lockdep is to find all
> > these paths without have to hit them as actual races, using debug
> > kernels.
> > 
> > I don't think we need this kind of debugging on production kernels?
> 
> Again, the primary motivation was a simple debugging aid that could be
> used without worrying about overhead. So lockdep is very often out of
> the question.
> 
> > > > The best we got was drivers tested the VA range and returned success
> > > > if they had no interest. Which is a big win to be sure, but it looks
> > > > like getting any more is not really posssible.
> > > 
> > > And that is already a great win! Because many notifiers only do care
> > > about particular mappings. Please note that backing off unconditioanlly
> > > will simply cause that the oom reaper will have to back off not doing
> > > any tear down anything.
> > 
> > Well, I'm working to propose that we do the VA range test under core
> > mmu notifier code that cannot block and then we simply remove the idea
> > of blockable from drivers using this new 'range notifier'. 
> > 
> > I think this pretty much solves the concern?
> 
> Well, my idea was that a range check and early bail out was a first step
> and then each specific notifier would be able to do a more specific
> check. I was not able to do the second step because that requires a deep
> understanding of the respective subsystem.
> 
> Really all I do care about is to reclaim as much memory from the
> oom_reaper context as possible. And that cannot really be an unbounded
> process. Quite contrary it should be as swift as possible. From my
> cursory look some notifiers are able to achieve their task without
> blocking or depending on memory just fine. So bailing out
> unconditionally on the range of interest would just put us back.

Agree, OOM just asking the question: can i unmap that page quickly ?
so that me (OOM) can swap it out. In many cases the driver need to
lookup something to see if at the time the memory is just not in use
and can be reclaim right away. So driver should have a path to
optimistically update its state to allow quick reclaim.


> > > > However, we could (probably even should) make the drivers fs_reclaim
> > > > safe.
> > > > 
> > > > If that is enough to guarantee progress of OOM, then lets consider
> > > > something like using current_gfp_context() to force PF_MEMALLOC_NOFS
> > > > allocation behavior on the driver callback and lockdep to try and keep
> > > > pushing on the the debugging, and dropping !blocking.
> > > 
> > > How are you going to enforce indirect dependency? E.g. a lock that is
> > > also used in other context which depend on sleepable memory allocation
> > > to move forward.
> > 
> > You mean like this:
> > 
> >        CPU0                                 CPU1
> >                                         mutex_lock()
> >                                         kmalloc(GFP_KERNEL)
> 
> no I mean __GFP_DIRECT_RECLAIM here.
> 
> >                                         mutex_unlock()
> >   fs_reclaim_acquire()
> >   mutex_lock() <- illegal: lock dep assertion
> 
> I cannot really comment on how that is achieveable by lockdep. I managed
> to forget details about FS/IO reclaim recursion tracking already and I
> do not have time to learn it again. It was quite a hack. Anyway, let me
> repeat that the primary motivation was a simple aid. Not something as
> poverful as lockdep.

I feel that the fs_reclaim_acquire() is just too heavy weight here. I
do think that Daniel patches improve the debugging situation without
burdening anything so i am in favor or merging that.

I do not think we should devote too much time into fs_reclaim(), our
time would be better spend in improving the driver shrinker for instance
after all OOM is all about trying to free-up memory.

Cheers,
Jérôme


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/5] kernel.h: Add non_block_start/end()
  2019-08-15 17:39                     ` Jerome Glisse
@ 2019-08-15 18:01                       ` Jason Gunthorpe
  2019-08-15 18:27                         ` Jerome Glisse
  0 siblings, 1 reply; 69+ messages in thread
From: Jason Gunthorpe @ 2019-08-15 18:01 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Daniel Vetter, Michal Hocko, Andrew Morton, LKML, Linux MM,
	DRI Development, Intel Graphics Development, Peter Zijlstra,
	Ingo Molnar, David Rientjes, Christian König,
	Masahiro Yamada, Wei Wang, Andy Shevchenko, Thomas Gleixner,
	Jann Horn, Feng Tang, Kees Cook, Randy Dunlap, Daniel Vetter

On Thu, Aug 15, 2019 at 01:39:22PM -0400, Jerome Glisse wrote:
> On Thu, Aug 15, 2019 at 02:35:57PM -0300, Jason Gunthorpe wrote:
> > On Thu, Aug 15, 2019 at 06:25:16PM +0200, Daniel Vetter wrote:
> > 
> > > I'm not really well versed in the details of our userptr, but both
> > > amdgpu and i915 wait for the gpu to complete from
> > > invalidate_range_start. Jerome has at least looked a lot at the amdgpu
> > > one, so maybe he can explain what exactly it is we're doing ...
> > 
> > amdgpu is (wrongly) using hmm for something, I can't really tell what
> > it is trying to do. The calls to dma_fence under the
> > invalidate_range_start do not give me a good feeling.
> > 
> > However, i915 shows all the signs of trying to follow the registration
> > cache model, it even has a nice comment in
> > i915_gem_userptr_get_pages() explaining that the races it has don't
> > matter because it is a user space bug to change the VA mapping in the
> > first place. That just screams registration cache to me.
> > 
> > So it is fine to run HW that way, but if you do, there is no reason to
> > fence inside the invalidate_range end. Just orphan the DMA buffer and
> > clean it up & release the page pins when all DMA buffer refs go to
> > zero. The next access to that VA should get a new DMA buffer with the
> > right mapping.
> > 
> > In other words the invalidation should be very simple without
> > complicated locking, or wait_event's. Look at hfi1 for example.
> 
> This would break the today usage model of uptr and it will
> break userspace expectation ie if GPU is writting to that
> memory and that memory then the userspace want to make sure
> that it will see what the GPU write.

How exactly? This is holding the page pin, so the only way the VA
mapping can be changed is via explicit user action.

ie:

   gpu_write_something(va, size)
   mmap(.., va, size, MMAP_FIXED);
   gpu_wait_done()

This is racy and indeterminate with both models.

Based on the comment in i915 it appears to be going on the model that
changes to the mmap by userspace when the GPU is working on it is a
programming bug. This is reasonable, lots of systems use this kind of
consistency model.

Since the driver seems to rely on a dma_fence to block DMA access, it
looks to me like the kernel has full visibility to the
'gpu_write_something()' and 'gpu_wait_done()' markers.

I think trying to use hmm_range_fault on HW that can't do HW page
faulting and HW 'TLB shootdown' is a very, very bad idea. I fear that
is what amd gpu is trying to do.

I haven't yet seen anything that looks like 'TLB shootdown' in i915??

Jason


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/5] kernel.h: Add non_block_start/end()
  2019-08-15 17:42                 ` Michal Hocko
  2019-08-15 17:57                   ` Jerome Glisse
@ 2019-08-15 18:24                   ` Jason Gunthorpe
  2019-08-15 19:05                     ` Michal Hocko
  1 sibling, 1 reply; 69+ messages in thread
From: Jason Gunthorpe @ 2019-08-15 18:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: LKML, linux-mm, DRI Development, Intel Graphics Development,
	Peter Zijlstra, Ingo Molnar, Andrew Morton, David Rientjes,
	Christian König, Jérôme Glisse, Masahiro Yamada,
	Wei Wang, Andy Shevchenko, Thomas Gleixner, Jann Horn, Feng Tang,
	Kees Cook, Randy Dunlap, Daniel Vetter

On Thu, Aug 15, 2019 at 07:42:07PM +0200, Michal Hocko wrote:
> On Thu 15-08-19 13:56:31, Jason Gunthorpe wrote:
> > On Thu, Aug 15, 2019 at 06:00:41PM +0200, Michal Hocko wrote:
> > 
> > > > AFAIK 'GFP_NOWAIT' is characterized by the lack of __GFP_FS and
> > > > __GFP_DIRECT_RECLAIM..
> > > >
> > > > This matches the existing test in __need_fs_reclaim() - so if you are
> > > > OK with GFP_NOFS, aka __GFP_IO which triggers try_to_compact_pages(),
> > > > allocations during OOM, then I think fs_reclaim already matches what
> > > > you described?
> > > 
> > > No GFP_NOFS is equally bad. Please read my other email explaining what
> > > the oom_reaper actually requires. In short no blocking on direct or
> > > indirect dependecy on memory allocation that might sleep.
> > 
> > It is much easier to follow with some hints on code, so the true
> > requirement is that the OOM repear not block on GFP_FS and GFP_IO
> > allocations, great, that constraint is now clear.
> 
> I still do not get why do you put FS/IO into the picture. This is really
> about __GFP_DIRECT_RECLAIM.

Like I said this is complicated, translating "no blocking on direct or
indirect dependecy on memory allocation that might sleep" into GFP
flags is hard for us outside the mm community.

So the contraint here is no __GFP_DIRECT_RECLAIM?

I bring up FS/IO because that is what Tejun mentioned when I asked him
about reclaim restrictions, and is what fs_reclaim_acquire() is
already sensitive too. It is pretty confusing if we have places using
the word 'reclaim' with different restrictions. :(

> >        CPU0                                 CPU1
> >                                         mutex_lock()
> >                                         kmalloc(GFP_KERNEL)
> 
> no I mean __GFP_DIRECT_RECLAIM here.
> 
> >                                         mutex_unlock()
> >   fs_reclaim_acquire()
> >   mutex_lock() <- illegal: lock dep assertion
> 
> I cannot really comment on how that is achieveable by lockdep.

??? I am trying to explain this is already done and working today. The
above example will already fault with lockdep enabled.

This is existing debugging we can use and improve upon rather that
invent new debugging.

Jason


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/5] kernel.h: Add non_block_start/end()
  2019-08-15 18:01                       ` Jason Gunthorpe
@ 2019-08-15 18:27                         ` Jerome Glisse
  2019-08-15 18:57                           ` Jason Gunthorpe
  0 siblings, 1 reply; 69+ messages in thread
From: Jerome Glisse @ 2019-08-15 18:27 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Daniel Vetter, Michal Hocko, Andrew Morton, LKML, Linux MM,
	DRI Development, Intel Graphics Development, Peter Zijlstra,
	Ingo Molnar, David Rientjes, Christian König,
	Masahiro Yamada, Wei Wang, Andy Shevchenko, Thomas Gleixner,
	Jann Horn, Feng Tang, Kees Cook, Randy Dunlap, Daniel Vetter

On Thu, Aug 15, 2019 at 03:01:59PM -0300, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2019 at 01:39:22PM -0400, Jerome Glisse wrote:
> > On Thu, Aug 15, 2019 at 02:35:57PM -0300, Jason Gunthorpe wrote:
> > > On Thu, Aug 15, 2019 at 06:25:16PM +0200, Daniel Vetter wrote:
> > > 
> > > > I'm not really well versed in the details of our userptr, but both
> > > > amdgpu and i915 wait for the gpu to complete from
> > > > invalidate_range_start. Jerome has at least looked a lot at the amdgpu
> > > > one, so maybe he can explain what exactly it is we're doing ...
> > > 
> > > amdgpu is (wrongly) using hmm for something, I can't really tell what
> > > it is trying to do. The calls to dma_fence under the
> > > invalidate_range_start do not give me a good feeling.
> > > 
> > > However, i915 shows all the signs of trying to follow the registration
> > > cache model, it even has a nice comment in
> > > i915_gem_userptr_get_pages() explaining that the races it has don't
> > > matter because it is a user space bug to change the VA mapping in the
> > > first place. That just screams registration cache to me.
> > > 
> > > So it is fine to run HW that way, but if you do, there is no reason to
> > > fence inside the invalidate_range end. Just orphan the DMA buffer and
> > > clean it up & release the page pins when all DMA buffer refs go to
> > > zero. The next access to that VA should get a new DMA buffer with the
> > > right mapping.
> > > 
> > > In other words the invalidation should be very simple without
> > > complicated locking, or wait_event's. Look at hfi1 for example.
> > 
> > This would break the today usage model of uptr and it will
> > break userspace expectation ie if GPU is writting to that
> > memory and that memory then the userspace want to make sure
> > that it will see what the GPU write.
> 
> How exactly? This is holding the page pin, so the only way the VA
> mapping can be changed is via explicit user action.
> 
> ie:
> 
>    gpu_write_something(va, size)
>    mmap(.., va, size, MMAP_FIXED);
>    gpu_wait_done()
> 
> This is racy and indeterminate with both models.
> 
> Based on the comment in i915 it appears to be going on the model that
> changes to the mmap by userspace when the GPU is working on it is a
> programming bug. This is reasonable, lots of systems use this kind of
> consistency model.

Well userspace process doing munmap(), mremap(), fork() and things like
that are a bug from the i915 kernel and userspace contract point of view.

But things like migration or reclaim are not cover under that contract
and for those the expectation is that CPU access to the same virtual address
should allow to get what was last written to it either by the GPU or the
CPU.

> 
> Since the driver seems to rely on a dma_fence to block DMA access, it
> looks to me like the kernel has full visibility to the
> 'gpu_write_something()' and 'gpu_wait_done()' markers.

So let's only consider the case where GPU wants to write to the memory
(the read only case is obviously right and not need any notifier in
fact) and like above the only thing we care about is reclaim or migration
(for instance because of some numa compaction) as the rest is cover by
i915 userspace contract.

So in the write case we do GUPfast(write=true) which will be "safe" from
any concurrent CPU page table update ie if GUPfast get a reference for
the page then any racing CPU page table update will not be able to migrate
or reclaim the page and thus the virtual address to page association will
be preserve.

This is only true because of GUPfast(), now if GUPfast() fails it will
fallback to the slow GUP case which make the same thing safe by taking
the page table lock.

Because of the reference on the page the i915 driver can forego the mmu
notifier end callback. The thing here is that taking a page reference
is pointless if we have better synchronization and tracking of mmu
notifier. Hence converting to hmm mirror allows to avoid taking a ref
on the page while still keeping the same functionality as of today.


> I think trying to use hmm_range_fault on HW that can't do HW page
> faulting and HW 'TLB shootdown' is a very, very bad idea. I fear that
> is what amd gpu is trying to do.
> 
> I haven't yet seen anything that looks like 'TLB shootdown' in i915??

GPU driver have complex usage pattern the tlb shootdown is implicit
once the GEM object associated with the uptr is invalidated it means
next time userspace submit command against that GEM object it will
have to re-validate it which means re-program the GPU page table to
point to the proper address (and re-call GUP).

So the whole GPU page table update is all hidden behind GEM function
like i915_gem_object_unbind() (or ttm invalidate for amd/radeon).

Technicaly those GPU do not support page faulting but because of the
way GPU works you know that you have frequent checkpoint where you
can unbind things. This is what is happening here.

Also not all GPU engines can handle page fault, this is true of all
GPU with page fault AFAIK (AMD and NVidia so far). This is why
uptr is a limited form of SVM (share virtual memory) that can be
use on all GPUs engine including the dma engine. When using the
full SVM power (like hmm mirror with nouveau) this is only use in
GPU engine that supports page fault (but updating the page table
still require locking and waiting on acknowledge).

Cheers,
Jérôme


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/5] kernel.h: Add non_block_start/end()
  2019-08-15 18:27                         ` Jerome Glisse
@ 2019-08-15 18:57                           ` Jason Gunthorpe
  0 siblings, 0 replies; 69+ messages in thread
From: Jason Gunthorpe @ 2019-08-15 18:57 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Daniel Vetter, Michal Hocko, Andrew Morton, LKML, Linux MM,
	DRI Development, Intel Graphics Development, Peter Zijlstra,
	Ingo Molnar, David Rientjes, Christian König,
	Masahiro Yamada, Wei Wang, Andy Shevchenko, Thomas Gleixner,
	Jann Horn, Feng Tang, Kees Cook, Randy Dunlap, Daniel Vetter

On Thu, Aug 15, 2019 at 02:27:19PM -0400, Jerome Glisse wrote:
> > How exactly? This is holding the page pin, so the only way the VA
> > mapping can be changed is via explicit user action.
> > 
> > ie:
> > 
> >    gpu_write_something(va, size)
> >    mmap(.., va, size, MMAP_FIXED);
> >    gpu_wait_done()
> > 
> > This is racy and indeterminate with both models.
> > 
> > Based on the comment in i915 it appears to be going on the model that
> > changes to the mmap by userspace when the GPU is working on it is a
> > programming bug. This is reasonable, lots of systems use this kind of
> > consistency model.
> 
> Well userspace process doing munmap(), mremap(), fork() and things like
> that are a bug from the i915 kernel and userspace contract point of view.
> 
> But things like migration or reclaim are not cover under that contract
> and for those the expectation is that CPU access to the same virtual address
> should allow to get what was last written to it either by the GPU or the
> CPU.

Okay, this is a more reasonable point - I agree the i915 registration
cache model precludes using migration and thus DEVICE_PRIVATE. This is
a strong motivation to use the hmm approach

But we started out this converstation asking if i915 is correct, and I
still say a registration cache model is a functionally correct way to
use notifiers.

> Because of the reference on the page the i915 driver can forego the mmu
> notifier end callback. The thing here is that taking a page reference
> is pointless if we have better synchronization and tracking of mmu
> notifier. Hence converting to hmm mirror allows to avoid taking a ref
> on the page while still keeping the same functionality as of today.

However, there is a huge trade off here. Drivers like this are going
to have a very complicated locking inside invalidate_range_start as
they must sleep waiting for dma buffer references to go to zero.

> GPU driver have complex usage pattern the tlb shootdown is implicit
> once the GEM object associated with the uptr is invalidated it means
> next time userspace submit command against that GEM object it will
> have to re-validate it which means re-program the GPU page table to
> point to the proper address (and re-call GUP).

I think it is a mistake to try and cram the very different approaches
to notifiers into the same API. SW ref counting of DMA buffers vs HW
async page faulting have totally different requirements and locking
schemes.

This explains why AMDGPU gets away with not using the hmm API
properly, it is probably relying on its DMA refcount, not the hmm
valid, to keep things in order?

I think the API approach in hmm_mirror is reasonable for page faulting
HW, but does not serve refcounting HW well at all.

Jason


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/5] kernel.h: Add non_block_start/end()
  2019-08-15 18:24                   ` Jason Gunthorpe
@ 2019-08-15 19:05                     ` Michal Hocko
  2019-08-15 19:18                       ` Jason Gunthorpe
  0 siblings, 1 reply; 69+ messages in thread
From: Michal Hocko @ 2019-08-15 19:05 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: LKML, linux-mm, DRI Development, Intel Graphics Development,
	Peter Zijlstra, Ingo Molnar, Andrew Morton, David Rientjes,
	Christian König, Jérôme Glisse, Masahiro Yamada,
	Wei Wang, Andy Shevchenko, Thomas Gleixner, Jann Horn, Feng Tang,
	Kees Cook, Randy Dunlap, Daniel Vetter

On Thu 15-08-19 15:24:48, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2019 at 07:42:07PM +0200, Michal Hocko wrote:
> > On Thu 15-08-19 13:56:31, Jason Gunthorpe wrote:
> > > On Thu, Aug 15, 2019 at 06:00:41PM +0200, Michal Hocko wrote:
> > > 
> > > > > AFAIK 'GFP_NOWAIT' is characterized by the lack of __GFP_FS and
> > > > > __GFP_DIRECT_RECLAIM..
> > > > >
> > > > > This matches the existing test in __need_fs_reclaim() - so if you are
> > > > > OK with GFP_NOFS, aka __GFP_IO which triggers try_to_compact_pages(),
> > > > > allocations during OOM, then I think fs_reclaim already matches what
> > > > > you described?
> > > > 
> > > > No GFP_NOFS is equally bad. Please read my other email explaining what
> > > > the oom_reaper actually requires. In short no blocking on direct or
> > > > indirect dependecy on memory allocation that might sleep.
> > > 
> > > It is much easier to follow with some hints on code, so the true
> > > requirement is that the OOM repear not block on GFP_FS and GFP_IO
> > > allocations, great, that constraint is now clear.
> > 
> > I still do not get why do you put FS/IO into the picture. This is really
> > about __GFP_DIRECT_RECLAIM.
> 
> Like I said this is complicated, translating "no blocking on direct or
> indirect dependecy on memory allocation that might sleep" into GFP
> flags is hard for us outside the mm community.
> 
> So the contraint here is no __GFP_DIRECT_RECLAIM?

OK, I am obviously failing to explain that. Sorry about that. You are
right that this is not simple. Let me try again.

The context we are calling !blockable notifiers from has to finish in a
_finite_ amount of time (and swift is hugely appreciated by users of
otherwise non-responsive system that is under OOM). We are out of memory
so we cannot be blocked waiting for memory. Directly or indirectly (via
a lock, waiting for an event that needs memory to finish in general). So
you need to track dependency over more complicated contexts than the
direct call path (think of workqueue for example).

> I bring up FS/IO because that is what Tejun mentioned when I asked him
> about reclaim restrictions, and is what fs_reclaim_acquire() is
> already sensitive too. It is pretty confusing if we have places using
> the word 'reclaim' with different restrictions. :(

fs_reclaim has been invented to catch potential deadlocks when a
GFP_NO{FS/IO} allocation hits into fs/io reclaim. This is a subset of
the reclaim that we have. The oom context is more restricted and it
cannot depend on _any_ memory reclaim because by the time we have got to
this context all the reclaim has already failed and wait some more will
simply not help.

> > >        CPU0                                 CPU1
> > >                                         mutex_lock()
> > >                                         kmalloc(GFP_KERNEL)
> > 
> > no I mean __GFP_DIRECT_RECLAIM here.
> > 
> > >                                         mutex_unlock()
> > >   fs_reclaim_acquire()
> > >   mutex_lock() <- illegal: lock dep assertion
> > 
> > I cannot really comment on how that is achieveable by lockdep.
> 
> ??? I am trying to explain this is already done and working today. The
> above example will already fault with lockdep enabled.
> 
> This is existing debugging we can use and improve upon rather that
> invent new debugging.

This is what you claim and I am saying that fs_reclaim is about a
restricted reclaim context and it is an ugly hack. It has proven to
report false positives. Maybe it can be extended to a generic reclaim.
I haven't tried that. Do not aim to try it.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/5] kernel.h: Add non_block_start/end()
  2019-08-15 19:05                     ` Michal Hocko
@ 2019-08-15 19:18                       ` Jason Gunthorpe
  2019-08-15 19:35                         ` Michal Hocko
  0 siblings, 1 reply; 69+ messages in thread
From: Jason Gunthorpe @ 2019-08-15 19:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: LKML, linux-mm, DRI Development, Intel Graphics Development,
	Peter Zijlstra, Ingo Molnar, Andrew Morton, David Rientjes,
	Christian König, Jérôme Glisse, Masahiro Yamada,
	Wei Wang, Andy Shevchenko, Thomas Gleixner, Jann Horn, Feng Tang,
	Kees Cook, Randy Dunlap, Daniel Vetter

On Thu, Aug 15, 2019 at 09:05:25PM +0200, Michal Hocko wrote:

> This is what you claim and I am saying that fs_reclaim is about a
> restricted reclaim context and it is an ugly hack. It has proven to
> report false positives. Maybe it can be extended to a generic reclaim.
> I haven't tried that. Do not aim to try it.

Okay, great, I think this has been very helpful, at least for me,
thanks. I did not know fs_reclaim was so problematic, or the special
cases about OOM 'reclaim'. 

On this patch, I have no general objection to enforcing drivers to be
non-blocking, I'd just like to see it done with the existing lockdep
can't sleep detection rather than inventing some new debugging for it.

I understand this means the debugging requires lockdep enabled and
will not run in production, but I'm of the view that is OK and in line
with general kernel practice.

The last detail is I'm still unclear what a GFP flags a blockable
invalidate_range_start() should use. Is GFP_KERNEL OK? Lockdep has
complained on that in past due to fs_reclaim - how do you know if it
is a false positive?

Thanks again,
Jason


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/5] kernel.h: Add non_block_start/end()
  2019-08-15 19:18                       ` Jason Gunthorpe
@ 2019-08-15 19:35                         ` Michal Hocko
  2019-08-15 20:13                           ` Jason Gunthorpe
  2019-08-15 20:16                           ` [Intel-gfx] " Daniel Vetter
  0 siblings, 2 replies; 69+ messages in thread
From: Michal Hocko @ 2019-08-15 19:35 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: LKML, linux-mm, DRI Development, Intel Graphics Development,
	Peter Zijlstra, Ingo Molnar, Andrew Morton, David Rientjes,
	Christian König, Jérôme Glisse, Masahiro Yamada,
	Wei Wang, Andy Shevchenko, Thomas Gleixner, Jann Horn, Feng Tang,
	Kees Cook, Randy Dunlap, Daniel Vetter

On Thu 15-08-19 16:18:10, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2019 at 09:05:25PM +0200, Michal Hocko wrote:
> 
> > This is what you claim and I am saying that fs_reclaim is about a
> > restricted reclaim context and it is an ugly hack. It has proven to
> > report false positives. Maybe it can be extended to a generic reclaim.
> > I haven't tried that. Do not aim to try it.
> 
> Okay, great, I think this has been very helpful, at least for me,
> thanks. I did not know fs_reclaim was so problematic, or the special
> cases about OOM 'reclaim'. 

I am happy that this is more clear now.

> On this patch, I have no general objection to enforcing drivers to be
> non-blocking, I'd just like to see it done with the existing lockdep
> can't sleep detection rather than inventing some new debugging for it.
> 
> I understand this means the debugging requires lockdep enabled and
> will not run in production, but I'm of the view that is OK and in line
> with general kernel practice.

Yes and I do agree with this in general.

> The last detail is I'm still unclear what a GFP flags a blockable
> invalidate_range_start() should use. Is GFP_KERNEL OK?

I hope I will not make this muddy again ;)
invalidate_range_start in the blockable mode can use/depend on any sleepable
allocation allowed in the context it is called from. So in other words
it is no different from any other function in the kernel that calls into
allocator. As the API is missing gfp context then I hope it is not
called from any restricted contexts (except from the oom which we have
!blockable for).

> Lockdep has
> complained on that in past due to fs_reclaim - how do you know if it
> is a false positive?

I would have to see the specific lockdep splat.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/5] kernel.h: Add non_block_start/end()
  2019-08-15 19:35                         ` Michal Hocko
@ 2019-08-15 20:13                           ` Jason Gunthorpe
  2019-08-16  8:10                             ` Michal Hocko
  2019-08-15 20:16                           ` [Intel-gfx] " Daniel Vetter
  1 sibling, 1 reply; 69+ messages in thread
From: Jason Gunthorpe @ 2019-08-15 20:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: LKML, linux-mm, DRI Development, Intel Graphics Development,
	Peter Zijlstra, Ingo Molnar, Andrew Morton, David Rientjes,
	Christian König, Jérôme Glisse, Masahiro Yamada,
	Wei Wang, Andy Shevchenko, Thomas Gleixner, Jann Horn, Feng Tang,
	Kees Cook, Randy Dunlap, Daniel Vetter

On Thu, Aug 15, 2019 at 09:35:26PM +0200, Michal Hocko wrote:

> > The last detail is I'm still unclear what a GFP flags a blockable
> > invalidate_range_start() should use. Is GFP_KERNEL OK?
> 
> I hope I will not make this muddy again ;)
> invalidate_range_start in the blockable mode can use/depend on any sleepable
> allocation allowed in the context it is called from. 

'in the context is is called from' is the magic phrase, as
invalidate_range_start is called while holding several different mm
related locks. I know at least write mmap_sem and i_mmap_rwsem
(write?)

Can GFP_KERNEL be called while holding those locks?

This is the question of indirect dependency on reclaim via locks you
raised earlier.

> So in other words it is no different from any other function in the
> kernel that calls into allocator. As the API is missing gfp context
> then I hope it is not called from any restricted contexts (except
> from the oom which we have !blockable for).

Yes, the callers are exactly my concern.
 
> > Lockdep has
> > complained on that in past due to fs_reclaim - how do you know if it
> > is a false positive?
> 
> I would have to see the specific lockdep splat.

See below. I found it when trying to understand why the registration
of the mmu notififer was so oddly coded.

The situation was:

  down_write(&mm->mmap_sem);
  mm_take_all_locks(mm);
  kmalloc(GFP_KERNEL);  <--- lockdep warning

I understood Daniel said he saw this directly on a recent kernel when
working with his lockdep patch?

Checking myself, on todays kernel I see a call chain:

shrink_all_memory
  fs_reclaim_acquire(sc.gfp_mask);
  [..]
  do_try_to_free_pages
   shrink_zones
    shrink_node
     shrink_node_memcg
      shrink_list
       shrink_active_list
        page_referenced
         rmap_walk
          rmap_walk_file
           i_mmap_lock_read
            down_read(i_mmap_rwsem)

So it is possible that the down_read() above will block on
i_mmap_rwsem being held in the caller of invalidate_range_start which
is doing kmalloc(GPF_KERNEL).

Is this OK? The lockdep annotation says no..

Jason

commit 35cfa2b0b491c37e23527822bf365610dbb188e5
Author: Gavin Shan <shangw@linux.vnet.ibm.com>
Date:   Thu Oct 25 13:38:01 2012 -0700

    mm/mmu_notifier: allocate mmu_notifier in advance
    
    While allocating mmu_notifier with parameter GFP_KERNEL, swap would start
    to work in case of tight available memory.  Eventually, that would lead to
    a deadlock while the swap deamon swaps anonymous pages.  It was caused by
    commit e0f3c3f78da29b ("mm/mmu_notifier: init notifier if necessary").
    
      =================================
      [ INFO: inconsistent lock state ]
      3.7.0-rc1+ #518 Not tainted
      ---------------------------------
      inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
      kswapd0/35 [HC0[0]:SC0[0]:HE1:SE1] takes:
       (&mapping->i_mmap_mutex){+.+.?.}, at: page_referenced+0x9c/0x2e0
      {RECLAIM_FS-ON-W} state was registered at:
         mark_held_locks+0x86/0x150
         lockdep_trace_alloc+0x67/0xc0
         kmem_cache_alloc_trace+0x33/0x230
         do_mmu_notifier_register+0x87/0x180
         mmu_notifier_register+0x13/0x20
         kvm_dev_ioctl+0x428/0x510
         do_vfs_ioctl+0x98/0x570
         sys_ioctl+0x91/0xb0
         system_call_fastpath+0x16/0x1b


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Intel-gfx] [PATCH 2/5] kernel.h: Add non_block_start/end()
  2019-08-15 19:35                         ` Michal Hocko
  2019-08-15 20:13                           ` Jason Gunthorpe
@ 2019-08-15 20:16                           ` " Daniel Vetter
  2019-08-15 20:27                             ` Jason Gunthorpe
  2019-08-16  8:27                             ` Michal Hocko
  1 sibling, 2 replies; 69+ messages in thread
From: Daniel Vetter @ 2019-08-15 20:16 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jason Gunthorpe, Feng Tang, Randy Dunlap, Kees Cook,
	Masahiro Yamada, Peter Zijlstra, Intel Graphics Development,
	Jann Horn, LKML, DRI Development, Linux MM,
	Jérôme Glisse, Ingo Molnar, Thomas Gleixner,
	David Rientjes, Wei Wang, Daniel Vetter, Andrew Morton,
	Andy Shevchenko, Christian König

On Thu, Aug 15, 2019 at 9:35 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Thu 15-08-19 16:18:10, Jason Gunthorpe wrote:
> > On Thu, Aug 15, 2019 at 09:05:25PM +0200, Michal Hocko wrote:
> >
> > > This is what you claim and I am saying that fs_reclaim is about a
> > > restricted reclaim context and it is an ugly hack. It has proven to
> > > report false positives. Maybe it can be extended to a generic reclaim.
> > > I haven't tried that. Do not aim to try it.
> >
> > Okay, great, I think this has been very helpful, at least for me,
> > thanks. I did not know fs_reclaim was so problematic, or the special
> > cases about OOM 'reclaim'.
>
> I am happy that this is more clear now.
>
> > On this patch, I have no general objection to enforcing drivers to be
> > non-blocking, I'd just like to see it done with the existing lockdep
> > can't sleep detection rather than inventing some new debugging for it.
> >
> > I understand this means the debugging requires lockdep enabled and
> > will not run in production, but I'm of the view that is OK and in line
> > with general kernel practice.
>
> Yes and I do agree with this in general.

So if someone can explain to me how that works with lockdep I can of
course implement it. But afaics that doesn't exist (I tried to explain
that somewhere else already), and I'm no really looking forward to
hacking also on lockdep for this little series.

> > The last detail is I'm still unclear what a GFP flags a blockable
> > invalidate_range_start() should use. Is GFP_KERNEL OK?
>
> I hope I will not make this muddy again ;)
> invalidate_range_start in the blockable mode can use/depend on any sleepable
> allocation allowed in the context it is called from. So in other words
> it is no different from any other function in the kernel that calls into
> allocator. As the API is missing gfp context then I hope it is not
> called from any restricted contexts (except from the oom which we have
> !blockable for).

Hm, that's new to me. I thought mmu notifiers very much can be called
from direct reclaim paths, so you have to be extremely careful with
getting back into that one. At least the lockdep splats I remember
also tend to have fs_reclaim in there, that's where all the fun comes
from.

> > Lockdep has
> > complained on that in past due to fs_reclaim - how do you know if it
> > is a false positive?
>
> I would have to see the specific lockdep splat.

I guess the lockdep annotation for invalidate_range_start carries the
same risks as the fs_reclaim annotation. Still feels like worth it.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Intel-gfx] [PATCH 2/5] kernel.h: Add non_block_start/end()
  2019-08-15 20:16                           ` [Intel-gfx] " Daniel Vetter
@ 2019-08-15 20:27                             ` Jason Gunthorpe
  2019-08-15 20:49                               ` Daniel Vetter
  2019-08-16  8:27                             ` Michal Hocko
  1 sibling, 1 reply; 69+ messages in thread
From: Jason Gunthorpe @ 2019-08-15 20:27 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Michal Hocko, Feng Tang, Randy Dunlap, Kees Cook,
	Masahiro Yamada, Peter Zijlstra, Intel Graphics Development,
	Jann Horn, LKML, DRI Development, Linux MM,
	Jérôme Glisse, Ingo Molnar, Thomas Gleixner,
	David Rientjes, Wei Wang, Daniel Vetter, Andrew Morton,
	Andy Shevchenko, Christian König

On Thu, Aug 15, 2019 at 10:16:43PM +0200, Daniel Vetter wrote:

> So if someone can explain to me how that works with lockdep I can of
> course implement it. But afaics that doesn't exist (I tried to explain
> that somewhere else already), and I'm no really looking forward to
> hacking also on lockdep for this little series.

Hmm, kind of looks like it is done by calling preempt_disable()

Probably the debug option is CONFIG_DEBUG_PREEMPT, not lockdep?

Jason


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Intel-gfx] [PATCH 2/5] kernel.h: Add non_block_start/end()
  2019-08-15 20:27                             ` Jason Gunthorpe
@ 2019-08-15 20:49                               ` Daniel Vetter
  2019-08-16  1:00                                 ` Jason Gunthorpe
  0 siblings, 1 reply; 69+ messages in thread
From: Daniel Vetter @ 2019-08-15 20:49 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Michal Hocko, Feng Tang, Randy Dunlap, Kees Cook,
	Masahiro Yamada, Peter Zijlstra, Intel Graphics Development,
	Jann Horn, LKML, DRI Development, Linux MM,
	Jérôme Glisse, Ingo Molnar, Thomas Gleixner,
	David Rientjes, Wei Wang, Daniel Vetter, Andrew Morton,
	Andy Shevchenko, Christian König

On Thu, Aug 15, 2019 at 10:27 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> On Thu, Aug 15, 2019 at 10:16:43PM +0200, Daniel Vetter wrote:
> > So if someone can explain to me how that works with lockdep I can of
> > course implement it. But afaics that doesn't exist (I tried to explain
> > that somewhere else already), and I'm no really looking forward to
> > hacking also on lockdep for this little series.
>
> Hmm, kind of looks like it is done by calling preempt_disable()

Yup. That was v1, then came the suggestion that disabling preemption
is maybe not the best thing (the oom reaper could still run for a long
time comparatively, if it's cleaning out gigabytes of process memory
or what not, hence this dedicated debug infrastructure).

> Probably the debug option is CONFIG_DEBUG_PREEMPT, not lockdep?

CONFIG_DEBUG_ATOMIC_SLEEP. Like in my patch.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/5] kernel.h: Add non_block_start/end()
  2019-08-15  8:44     ` Michal Hocko
  2019-08-15 13:04       ` Jason Gunthorpe
@ 2019-08-15 22:15       ` Andrew Morton
  2019-08-16  8:24         ` Michal Hocko
  1 sibling, 1 reply; 69+ messages in thread
From: Andrew Morton @ 2019-08-15 22:15 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Daniel Vetter, LKML, linux-mm, DRI Development,
	Intel Graphics Development, Jason Gunthorpe, Peter Zijlstra,
	Ingo Molnar, David Rientjes, Christian König,
	Jérôme Glisse, Masahiro Yamada, Wei Wang,
	Andy Shevchenko, Thomas Gleixner, Jann Horn, Feng Tang,
	Kees Cook, Randy Dunlap, Daniel Vetter

On Thu, 15 Aug 2019 10:44:29 +0200 Michal Hocko <mhocko@kernel.org> wrote:

> > I continue to struggle with this.  It introduces a new kernel state
> > "running preemptibly but must not call schedule()".  How does this make
> > any sense?
> > 
> > Perhaps a much, much more detailed description of the oom_reaper
> > situation would help out.
>  
> The primary point here is that there is a demand of non blockable mmu
> notifiers to be called when the oom reaper tears down the address space.
> As the oom reaper is the primary guarantee of the oom handling forward
> progress it cannot be blocked on anything that might depend on blockable
> memory allocations. These are not really easy to track because they
> might be indirect - e.g. notifier blocks on a lock which other context
> holds while allocating memory or waiting for a flusher that needs memory
> to perform its work. If such a blocking state happens that we can end up
> in a silent hang with an unusable machine.
> Now we hope for reasonable implementations of mmu notifiers (strong
> words I know ;) and this should be relatively simple and effective catch
> all tool to detect something suspicious is going on.
> 
> Does that make the situation more clear?

Yes, thanks, much.  Maybe a code comment along the lines of

  This is on behalf of the oom reaper, specifically when it is
  calling the mmu notifiers.  The problem is that if the notifier were
  to block on, for example, mutex_lock() and if the process which holds
  that mutex were to perform a sleeping memory allocation, the oom
  reaper is now blocked on completion of that memory allocation.

btw, do we need task_struct.non_block_count?  Perhaps the oom reaper
thread could set a new PF_NONBLOCK (which would be more general than
PF_OOM_REAPER).  If we run out of PF_ flags, use (current == oom_reaper_th).



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Intel-gfx] [PATCH 2/5] kernel.h: Add non_block_start/end()
  2019-08-15 20:49                               ` Daniel Vetter
@ 2019-08-16  1:00                                 ` Jason Gunthorpe
  2019-08-16  6:20                                   ` Daniel Vetter
  0 siblings, 1 reply; 69+ messages in thread
From: Jason Gunthorpe @ 2019-08-16  1:00 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Michal Hocko, Feng Tang, Randy Dunlap, Kees Cook,
	Masahiro Yamada, Peter Zijlstra, Intel Graphics Development,
	Jann Horn, LKML, DRI Development, Linux MM,
	Jérôme Glisse, Ingo Molnar, Thomas Gleixner,
	David Rientjes, Wei Wang, Daniel Vetter, Andrew Morton,
	Andy Shevchenko, Christian König

On Thu, Aug 15, 2019 at 10:49:31PM +0200, Daniel Vetter wrote:
> On Thu, Aug 15, 2019 at 10:27 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > On Thu, Aug 15, 2019 at 10:16:43PM +0200, Daniel Vetter wrote:
> > > So if someone can explain to me how that works with lockdep I can of
> > > course implement it. But afaics that doesn't exist (I tried to explain
> > > that somewhere else already), and I'm no really looking forward to
> > > hacking also on lockdep for this little series.
> >
> > Hmm, kind of looks like it is done by calling preempt_disable()
> 
> Yup. That was v1, then came the suggestion that disabling preemption
> is maybe not the best thing (the oom reaper could still run for a long
> time comparatively, if it's cleaning out gigabytes of process memory
> or what not, hence this dedicated debug infrastructure).

Oh, I'm coming in late, sorry

Anyhow, I was thinking since we agreed this can trigger on some
CONFIG_DEBUG flag, something like

    /* This is a sleepable region, but use preempt_disable to get debugging
     * for calls that are not allowed to block for OOM [.. insert
     * Michal's explanation.. ] */
    if (IS_ENABLED(CONFIG_DEBUG_ATOMIC_SLEEP) && !mmu_notifier_range_blockable(range))
	preempt_disable();
    ops->invalidate_range_start();

And I have also been idly mulling doing something like

   if (IS_ENABLED(CONFIG_DEBUG_NOTIFIERS) && 
       rand &&
       mmu_notifier_range_blockable(range)) {
     range->flags = 0
     if (!ops->invalidate_range_start(range))
	continue

     // Failed, try again as blockable
     range->flags = MMU_NOTIFIER_RANGE_BLOCKABLE
   }
   ops->invalidate_range_start(range);

Which would give coverage for this corner case without forcing OOM.

Jason


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Intel-gfx] [PATCH 2/5] kernel.h: Add non_block_start/end()
  2019-08-16  1:00                                 ` Jason Gunthorpe
@ 2019-08-16  6:20                                   ` Daniel Vetter
  2019-08-16 12:12                                     ` Jason Gunthorpe
  0 siblings, 1 reply; 69+ messages in thread
From: Daniel Vetter @ 2019-08-16  6:20 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Michal Hocko, Feng Tang, Randy Dunlap, Kees Cook,
	Masahiro Yamada, Peter Zijlstra, Intel Graphics Development,
	Jann Horn, LKML, DRI Development, Linux MM,
	Jérôme Glisse, Ingo Molnar, Thomas Gleixner,
	David Rientjes, Wei Wang, Daniel Vetter, Andrew Morton,
	Andy Shevchenko, Christian König

On Fri, Aug 16, 2019 at 3:00 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> On Thu, Aug 15, 2019 at 10:49:31PM +0200, Daniel Vetter wrote:
> > On Thu, Aug 15, 2019 at 10:27 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > On Thu, Aug 15, 2019 at 10:16:43PM +0200, Daniel Vetter wrote:
> > > > So if someone can explain to me how that works with lockdep I can of
> > > > course implement it. But afaics that doesn't exist (I tried to explain
> > > > that somewhere else already), and I'm no really looking forward to
> > > > hacking also on lockdep for this little series.
> > >
> > > Hmm, kind of looks like it is done by calling preempt_disable()
> >
> > Yup. That was v1, then came the suggestion that disabling preemption
> > is maybe not the best thing (the oom reaper could still run for a long
> > time comparatively, if it's cleaning out gigabytes of process memory
> > or what not, hence this dedicated debug infrastructure).
>
> Oh, I'm coming in late, sorry
>
> Anyhow, I was thinking since we agreed this can trigger on some
> CONFIG_DEBUG flag, something like
>
>     /* This is a sleepable region, but use preempt_disable to get debugging
>      * for calls that are not allowed to block for OOM [.. insert
>      * Michal's explanation.. ] */
>     if (IS_ENABLED(CONFIG_DEBUG_ATOMIC_SLEEP) && !mmu_notifier_range_blockable(range))
>         preempt_disable();
>     ops->invalidate_range_start();

I think we also discussed that, and some expressed concerns it would
change behaviour/timing too much for testing. Since this does does
disable preemption for real, not just for might_sleep.

> And I have also been idly mulling doing something like
>
>    if (IS_ENABLED(CONFIG_DEBUG_NOTIFIERS) &&
>        rand &&
>        mmu_notifier_range_blockable(range)) {
>      range->flags = 0
>      if (!ops->invalidate_range_start(range))
>         continue
>
>      // Failed, try again as blockable
>      range->flags = MMU_NOTIFIER_RANGE_BLOCKABLE
>    }
>    ops->invalidate_range_start(range);
>
> Which would give coverage for this corner case without forcing OOM.

Hm, this sounds like a neat idea to slap on top. The rand is going to
be a bit tricky though, but I guess for this we could stuff another
counter into task_struct and just e.g. do this every 1000th or so
invalidate (well need to pick a prime so we cycle through notifiers in
case there's multiple). I like.

Michal, thoughts?
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/5] kernel.h: Add non_block_start/end()
  2019-08-15 20:13                           ` Jason Gunthorpe
@ 2019-08-16  8:10                             ` Michal Hocko
  2019-08-16 12:19                               ` Jason Gunthorpe
  0 siblings, 1 reply; 69+ messages in thread
From: Michal Hocko @ 2019-08-16  8:10 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: LKML, linux-mm, DRI Development, Intel Graphics Development,
	Peter Zijlstra, Ingo Molnar, Andrew Morton, David Rientjes,
	Christian König, Jérôme Glisse, Masahiro Yamada,
	Wei Wang, Andy Shevchenko, Thomas Gleixner, Jann Horn, Feng Tang,
	Kees Cook, Randy Dunlap, Daniel Vetter

On Thu 15-08-19 17:13:23, Jason Gunthorpe wrote:
> On Thu, Aug 15, 2019 at 09:35:26PM +0200, Michal Hocko wrote:
> 
> > > The last detail is I'm still unclear what a GFP flags a blockable
> > > invalidate_range_start() should use. Is GFP_KERNEL OK?
> > 
> > I hope I will not make this muddy again ;)
> > invalidate_range_start in the blockable mode can use/depend on any sleepable
> > allocation allowed in the context it is called from. 
> 
> 'in the context is is called from' is the magic phrase, as
> invalidate_range_start is called while holding several different mm
> related locks. I know at least write mmap_sem and i_mmap_rwsem
> (write?)
> 
> Can GFP_KERNEL be called while holding those locks?

i_mmap_rwsem would be problematic because it is taken during the
reclaim.

> This is the question of indirect dependency on reclaim via locks you
> raised earlier.
> 
> > So in other words it is no different from any other function in the
> > kernel that calls into allocator. As the API is missing gfp context
> > then I hope it is not called from any restricted contexts (except
> > from the oom which we have !blockable for).
> 
> Yes, the callers are exactly my concern.
>  
> > > Lockdep has
> > > complained on that in past due to fs_reclaim - how do you know if it
> > > is a false positive?
> > 
> > I would have to see the specific lockdep splat.
> 
> See below. I found it when trying to understand why the registration
> of the mmu notififer was so oddly coded.
> 
> The situation was:
> 
>   down_write(&mm->mmap_sem);
>   mm_take_all_locks(mm);
>   kmalloc(GFP_KERNEL);  <--- lockdep warning

Ugh. mm_take_all_locks :/

> I understood Daniel said he saw this directly on a recent kernel when
> working with his lockdep patch?
> 
> Checking myself, on todays kernel I see a call chain:
> 
> shrink_all_memory
>   fs_reclaim_acquire(sc.gfp_mask);
>   [..]
>   do_try_to_free_pages
>    shrink_zones
>     shrink_node
>      shrink_node_memcg
>       shrink_list
>        shrink_active_list
>         page_referenced
>          rmap_walk
>           rmap_walk_file
>            i_mmap_lock_read
>             down_read(i_mmap_rwsem)
> 
> So it is possible that the down_read() above will block on
> i_mmap_rwsem being held in the caller of invalidate_range_start which
> is doing kmalloc(GPF_KERNEL).
> 
> Is this OK? The lockdep annotation says no..

It's not as per the above code patch which is easily possible because
mm_take_all_locks will lock all file vmas.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/5] kernel.h: Add non_block_start/end()
  2019-08-15 22:15       ` Andrew Morton
@ 2019-08-16  8:24         ` Michal Hocko
  0 siblings, 0 replies; 69+ messages in thread
From: Michal Hocko @ 2019-08-16  8:24 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Daniel Vetter, LKML, linux-mm, DRI Development,
	Intel Graphics Development, Jason Gunthorpe, Peter Zijlstra,
	Ingo Molnar, David Rientjes, Christian König,
	Jérôme Glisse, Masahiro Yamada, Wei Wang,
	Andy Shevchenko, Thomas Gleixner, Jann Horn, Feng Tang,
	Kees Cook, Randy Dunlap, Daniel Vetter

On Thu 15-08-19 15:15:09, Andrew Morton wrote:
> On Thu, 15 Aug 2019 10:44:29 +0200 Michal Hocko <mhocko@kernel.org> wrote:
> 
> > > I continue to struggle with this.  It introduces a new kernel state
> > > "running preemptibly but must not call schedule()".  How does this make
> > > any sense?
> > > 
> > > Perhaps a much, much more detailed description of the oom_reaper
> > > situation would help out.
> >  
> > The primary point here is that there is a demand of non blockable mmu
> > notifiers to be called when the oom reaper tears down the address space.
> > As the oom reaper is the primary guarantee of the oom handling forward
> > progress it cannot be blocked on anything that might depend on blockable
> > memory allocations. These are not really easy to track because they
> > might be indirect - e.g. notifier blocks on a lock which other context
> > holds while allocating memory or waiting for a flusher that needs memory
> > to perform its work. If such a blocking state happens that we can end up
> > in a silent hang with an unusable machine.
> > Now we hope for reasonable implementations of mmu notifiers (strong
> > words I know ;) and this should be relatively simple and effective catch
> > all tool to detect something suspicious is going on.
> > 
> > Does that make the situation more clear?
> 
> Yes, thanks, much.  Maybe a code comment along the lines of
> 
>   This is on behalf of the oom reaper, specifically when it is
>   calling the mmu notifiers.  The problem is that if the notifier were
>   to block on, for example, mutex_lock() and if the process which holds
>   that mutex were to perform a sleeping memory allocation, the oom
>   reaper is now blocked on completion of that memory allocation.

    reaper is now blocked on completion of that memory allocation
    and cannot provide the guarantee of the OOM forward progress.

OK. 
 
> btw, do we need task_struct.non_block_count?  Perhaps the oom reaper
> thread could set a new PF_NONBLOCK (which would be more general than
> PF_OOM_REAPER).  If we run out of PF_ flags, use (current == oom_reaper_th).

Well, I do not have a strong opinion here. A simple check for the value
seems to be trivial. There are quite some holes in task_struct to hide
this counter without increasing the size.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Intel-gfx] [PATCH 2/5] kernel.h: Add non_block_start/end()
  2019-08-15 20:16                           ` [Intel-gfx] " Daniel Vetter
  2019-08-15 20:27                             ` Jason Gunthorpe
@ 2019-08-16  8:27                             ` Michal Hocko
  1 sibling, 0 replies; 69+ messages in thread
From: Michal Hocko @ 2019-08-16  8:27 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Jason Gunthorpe, Feng Tang, Randy Dunlap, Kees Cook,
	Masahiro Yamada, Peter Zijlstra, Intel Graphics Development,
	Jann Horn, LKML, DRI Development, Linux MM,
	Jérôme Glisse, Ingo Molnar, Thomas Gleixner,
	David Rientjes, Wei Wang, Daniel Vetter, Andrew Morton,
	Andy Shevchenko, Christian König

On Thu 15-08-19 22:16:43, Daniel Vetter wrote:
> On Thu, Aug 15, 2019 at 9:35 PM Michal Hocko <mhocko@kernel.org> wrote:
[...]
> > > The last detail is I'm still unclear what a GFP flags a blockable
> > > invalidate_range_start() should use. Is GFP_KERNEL OK?
> >
> > I hope I will not make this muddy again ;)
> > invalidate_range_start in the blockable mode can use/depend on any sleepable
> > allocation allowed in the context it is called from. So in other words
> > it is no different from any other function in the kernel that calls into
> > allocator. As the API is missing gfp context then I hope it is not
> > called from any restricted contexts (except from the oom which we have
> > !blockable for).
> 
> Hm, that's new to me. I thought mmu notifiers very much can be called
> from direct reclaim paths, so you have to be extremely careful with
> getting back into that one.

Correct, I should have added that notifier callbacks ideally do not
allocate any memory. They can block and even that is quite a pain to be
honest.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Intel-gfx] [PATCH 2/5] kernel.h: Add non_block_start/end()
  2019-08-16  6:20                                   ` Daniel Vetter
@ 2019-08-16 12:12                                     ` Jason Gunthorpe
  2019-08-16 14:11                                       ` Daniel Vetter
  0 siblings, 1 reply; 69+ messages in thread
From: Jason Gunthorpe @ 2019-08-16 12:12 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Michal Hocko, Feng Tang, Randy Dunlap, Kees Cook,
	Masahiro Yamada, Peter Zijlstra, Intel Graphics Development,
	Jann Horn, LKML, DRI Development, Linux MM,
	Jérôme Glisse, Ingo Molnar, Thomas Gleixner,
	David Rientjes, Wei Wang, Daniel Vetter, Andrew Morton,
	Andy Shevchenko, Christian König

On Fri, Aug 16, 2019 at 08:20:55AM +0200, Daniel Vetter wrote:
> On Fri, Aug 16, 2019 at 3:00 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > On Thu, Aug 15, 2019 at 10:49:31PM +0200, Daniel Vetter wrote:
> > > On Thu, Aug 15, 2019 at 10:27 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > > On Thu, Aug 15, 2019 at 10:16:43PM +0200, Daniel Vetter wrote:
> > > > > So if someone can explain to me how that works with lockdep I can of
> > > > > course implement it. But afaics that doesn't exist (I tried to explain
> > > > > that somewhere else already), and I'm no really looking forward to
> > > > > hacking also on lockdep for this little series.
> > > >
> > > > Hmm, kind of looks like it is done by calling preempt_disable()
> > >
> > > Yup. That was v1, then came the suggestion that disabling preemption
> > > is maybe not the best thing (the oom reaper could still run for a long
> > > time comparatively, if it's cleaning out gigabytes of process memory
> > > or what not, hence this dedicated debug infrastructure).
> >
> > Oh, I'm coming in late, sorry
> >
> > Anyhow, I was thinking since we agreed this can trigger on some
> > CONFIG_DEBUG flag, something like
> >
> >     /* This is a sleepable region, but use preempt_disable to get debugging
> >      * for calls that are not allowed to block for OOM [.. insert
> >      * Michal's explanation.. ] */
> >     if (IS_ENABLED(CONFIG_DEBUG_ATOMIC_SLEEP) && !mmu_notifier_range_blockable(range))
> >         preempt_disable();
> >     ops->invalidate_range_start();
> 
> I think we also discussed that, and some expressed concerns it would
> change behaviour/timing too much for testing. Since this does does
> disable preemption for real, not just for might_sleep.

I don't follow, this is a debug kernel, it will have widly different
timing. 

Further the point of this debugging on atomic_sleep is to be as
timing-independent as possible since functions with rare sleeps should
be guarded by might_sleep() in their common paths.

I guess I don't get the push to have some low overhead debugging for
this? Is there something special you are looking for?

Jason


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/5] kernel.h: Add non_block_start/end()
  2019-08-16  8:10                             ` Michal Hocko
@ 2019-08-16 12:19                               ` Jason Gunthorpe
  2019-08-16 12:26                                 ` Michal Hocko
  0 siblings, 1 reply; 69+ messages in thread
From: Jason Gunthorpe @ 2019-08-16 12:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: LKML, linux-mm, DRI Development, Intel Graphics Development,
	Peter Zijlstra, Ingo Molnar, Andrew Morton, David Rientjes,
	Christian König, Jérôme Glisse, Masahiro Yamada,
	Wei Wang, Andy Shevchenko, Thomas Gleixner, Jann Horn, Feng Tang,
	Kees Cook, Randy Dunlap, Daniel Vetter

On Fri, Aug 16, 2019 at 10:10:29AM +0200, Michal Hocko wrote:
> On Thu 15-08-19 17:13:23, Jason Gunthorpe wrote:
> > On Thu, Aug 15, 2019 at 09:35:26PM +0200, Michal Hocko wrote:
> > 
> > > > The last detail is I'm still unclear what a GFP flags a blockable
> > > > invalidate_range_start() should use. Is GFP_KERNEL OK?
> > > 
> > > I hope I will not make this muddy again ;)
> > > invalidate_range_start in the blockable mode can use/depend on any sleepable
> > > allocation allowed in the context it is called from. 
> > 
> > 'in the context is is called from' is the magic phrase, as
> > invalidate_range_start is called while holding several different mm
> > related locks. I know at least write mmap_sem and i_mmap_rwsem
> > (write?)
> > 
> > Can GFP_KERNEL be called while holding those locks?
> 
> i_mmap_rwsem would be problematic because it is taken during the
> reclaim.

Okay.. So the fs_reclaim debugging does catch errors. Do you have any
reference for what a false positive looks like? 

I would like to inject it into the notifier path as this is very
difficult for driver authors to discover and know about, but I'm
worried about your false positive remark.

I think I understand we can use only GFP_ATOMIC in the notifiers, but
we need a strategy to handle OOM to guarentee forward progress.

This is just more bugs to fix :(

Jason


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 2/5] kernel.h: Add non_block_start/end()
  2019-08-16 12:19                               ` Jason Gunthorpe
@ 2019-08-16 12:26                                 ` Michal Hocko
  0 siblings, 0 replies; 69+ messages in thread
From: Michal Hocko @ 2019-08-16 12:26 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: LKML, linux-mm, DRI Development, Intel Graphics Development,
	Peter Zijlstra, Ingo Molnar, Andrew Morton, David Rientjes,
	Christian König, Jérôme Glisse, Masahiro Yamada,
	Wei Wang, Andy Shevchenko, Thomas Gleixner, Jann Horn, Feng Tang,
	Kees Cook, Randy Dunlap, Daniel Vetter

On Fri 16-08-19 09:19:06, Jason Gunthorpe wrote:
> On Fri, Aug 16, 2019 at 10:10:29AM +0200, Michal Hocko wrote:
> > On Thu 15-08-19 17:13:23, Jason Gunthorpe wrote:
> > > On Thu, Aug 15, 2019 at 09:35:26PM +0200, Michal Hocko wrote:
> > > 
> > > > > The last detail is I'm still unclear what a GFP flags a blockable
> > > > > invalidate_range_start() should use. Is GFP_KERNEL OK?
> > > > 
> > > > I hope I will not make this muddy again ;)
> > > > invalidate_range_start in the blockable mode can use/depend on any sleepable
> > > > allocation allowed in the context it is called from. 
> > > 
> > > 'in the context is is called from' is the magic phrase, as
> > > invalidate_range_start is called while holding several different mm
> > > related locks. I know at least write mmap_sem and i_mmap_rwsem
> > > (write?)
> > > 
> > > Can GFP_KERNEL be called while holding those locks?
> > 
> > i_mmap_rwsem would be problematic because it is taken during the
> > reclaim.
> 
> Okay.. So the fs_reclaim debugging does catch errors.

I do not think fs_reclaim is the udnerlying mechanism to catch this
deadlock. It is a simple AA deadlock. You take i_mmap_rwsem and then
go down the allocation path, direct reclaim and take the lock again.
Nothing really surprising. fs_reclaim is really to catch GFP_NOFS
context calling into a less restricted (e.g. GFP_KERNEL allocation
context).

> Do you have any
> reference for what a false positive looks like? 

I believe I have given some examples when introducing __GFP_NOLOCKDEP.
 
> I would like to inject it into the notifier path as this is very
> difficult for driver authors to discover and know about, but I'm
> worried about your false positive remark.
> 
> I think I understand we can use only GFP_ATOMIC in the notifiers, but
> we need a strategy to handle OOM to guarentee forward progress.

Your example is from the notifier registration IIUC. Can you
pre-allocate before taking locks? Could you point me to some examples
when the allocation is necessary in the range notifier callback?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Intel-gfx] [PATCH 2/5] kernel.h: Add non_block_start/end()
  2019-08-16 12:12                                     ` Jason Gunthorpe
@ 2019-08-16 14:11                                       ` Daniel Vetter
  2019-08-16 14:38                                         ` Jason Gunthorpe
  0 siblings, 1 reply; 69+ messages in thread
From: Daniel Vetter @ 2019-08-16 14:11 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Michal Hocko, Feng Tang, Randy Dunlap, Kees Cook,
	Masahiro Yamada, Peter Zijlstra, Intel Graphics Development,
	Jann Horn, LKML, DRI Development, Linux MM,
	Jérôme Glisse, Ingo Molnar, Thomas Gleixner,
	David Rientjes, Wei Wang, Daniel Vetter, Andrew Morton,
	Andy Shevchenko, Christian König

On Fri, Aug 16, 2019 at 2:12 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Fri, Aug 16, 2019 at 08:20:55AM +0200, Daniel Vetter wrote:
> > On Fri, Aug 16, 2019 at 3:00 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > On Thu, Aug 15, 2019 at 10:49:31PM +0200, Daniel Vetter wrote:
> > > > On Thu, Aug 15, 2019 at 10:27 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > > > On Thu, Aug 15, 2019 at 10:16:43PM +0200, Daniel Vetter wrote:
> > > > > > So if someone can explain to me how that works with lockdep I can of
> > > > > > course implement it. But afaics that doesn't exist (I tried to explain
> > > > > > that somewhere else already), and I'm no really looking forward to
> > > > > > hacking also on lockdep for this little series.
> > > > >
> > > > > Hmm, kind of looks like it is done by calling preempt_disable()
> > > >
> > > > Yup. That was v1, then came the suggestion that disabling preemption
> > > > is maybe not the best thing (the oom reaper could still run for a long
> > > > time comparatively, if it's cleaning out gigabytes of process memory
> > > > or what not, hence this dedicated debug infrastructure).
> > >
> > > Oh, I'm coming in late, sorry
> > >
> > > Anyhow, I was thinking since we agreed this can trigger on some
> > > CONFIG_DEBUG flag, something like
> > >
> > >     /* This is a sleepable region, but use preempt_disable to get debugging
> > >      * for calls that are not allowed to block for OOM [.. insert
> > >      * Michal's explanation.. ] */
> > >     if (IS_ENABLED(CONFIG_DEBUG_ATOMIC_SLEEP) && !mmu_notifier_range_blockable(range))
> > >         preempt_disable();
> > >     ops->invalidate_range_start();
> >
> > I think we also discussed that, and some expressed concerns it would
> > change behaviour/timing too much for testing. Since this does does
> > disable preemption for real, not just for might_sleep.
>
> I don't follow, this is a debug kernel, it will have widly different
> timing.
>
> Further the point of this debugging on atomic_sleep is to be as
> timing-independent as possible since functions with rare sleeps should
> be guarded by might_sleep() in their common paths.
>
> I guess I don't get the push to have some low overhead debugging for
> this? Is there something special you are looking for?

Don't ask me, I'm just trying to get _some_ debugging for this in. I
don't care one bit how much overhead it has because in our CI our
debug build has lockdep and everything and it crawls anyway. I started
out with the preempt_disable/enable thing like you suggested, and a
few rounds later we're here. We can go back to v1 on this one, but I'd
prefer to not do the lap too often.

Also, aside from this patch (which is prep for the next) and some
simple reordering conflicts they're all independent. So if there's no
way to paint this bikeshed here (technicolor perhaps?) then I'd like
to get at least the others considered.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Intel-gfx] [PATCH 2/5] kernel.h: Add non_block_start/end()
  2019-08-16 14:11                                       ` Daniel Vetter
@ 2019-08-16 14:38                                         ` Jason Gunthorpe
  2019-08-16 16:36                                           ` Daniel Vetter
  0 siblings, 1 reply; 69+ messages in thread
From: Jason Gunthorpe @ 2019-08-16 14:38 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Michal Hocko, Feng Tang, Randy Dunlap, Kees Cook,
	Masahiro Yamada, Peter Zijlstra, Intel Graphics Development,
	Jann Horn, LKML, DRI Development, Linux MM,
	Jérôme Glisse, Ingo Molnar, Thomas Gleixner,
	David Rientjes, Wei Wang, Daniel Vetter, Andrew Morton,
	Andy Shevchenko, Christian König

On Fri, Aug 16, 2019 at 04:11:34PM +0200, Daniel Vetter wrote:
> Also, aside from this patch (which is prep for the next) and some
> simple reordering conflicts they're all independent. So if there's no
> way to paint this bikeshed here (technicolor perhaps?) then I'd like
> to get at least the others considered.

Sure, I think for conflict avoidance reasons I'm probably taking
mmu_notifier stuff via hmm.git, so:

- Andrew had a minor remark on #1, I am ambivalent and would take it
  as-is. Your decision if you want to respin.
- #2/#3 is this issue, I would stand by the preempt_disable/etc path
  Our situation matches yours, debug tests run lockdep/etc.
- #4 I like a lot, except the map should enclose range_end too,
  this can be done after the mm_has_notifiers inside the
  __mmu_notifier function
  Can you respin?
  I will propose preloading the map in another patch
- #5 is already applied in -rc

Jason


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Intel-gfx] [PATCH 2/5] kernel.h: Add non_block_start/end()
  2019-08-16 14:38                                         ` Jason Gunthorpe
@ 2019-08-16 16:36                                           ` Daniel Vetter
  2019-08-16 16:54                                             ` Jason Gunthorpe
  0 siblings, 1 reply; 69+ messages in thread
From: Daniel Vetter @ 2019-08-16 16:36 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Michal Hocko, Feng Tang, Randy Dunlap, Kees Cook,
	Masahiro Yamada, Peter Zijlstra, Intel Graphics Development,
	Jann Horn, LKML, DRI Development, Linux MM,
	Jérôme Glisse, Ingo Molnar, Thomas Gleixner,
	David Rientjes, Wei Wang, Daniel Vetter, Andrew Morton,
	Andy Shevchenko, Christian König

On Fri, Aug 16, 2019 at 4:38 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Fri, Aug 16, 2019 at 04:11:34PM +0200, Daniel Vetter wrote:
> > Also, aside from this patch (which is prep for the next) and some
> > simple reordering conflicts they're all independent. So if there's no
> > way to paint this bikeshed here (technicolor perhaps?) then I'd like
> > to get at least the others considered.
>
> Sure, I think for conflict avoidance reasons I'm probably taking
> mmu_notifier stuff via hmm.git, so:
>
> - Andrew had a minor remark on #1, I am ambivalent and would take it
>   as-is. Your decision if you want to respin.

I like mine better, see also the reply from Ralph Campbell.

> - #2/#3 is this issue, I would stand by the preempt_disable/etc path
>   Our situation matches yours, debug tests run lockdep/etc.

Since Michal requested the current flavour I think we need spin a bit
more on these here. I guess I'll just rebase them to the end so
they're not holding up the others.

> - #4 I like a lot, except the map should enclose range_end too,
>   this can be done after the mm_has_notifiers inside the
>   __mmu_notifier function

To make sure I get this right: The same lockdep context, but also
wrapped around invalidate_range_end? From my understanding of pte
zapping that makes sense, but I'm definitely not well-versed enough
for that.

>   Can you respin?

Will do.

>   I will propose preloading the map in another patch
> - #5 is already applied in -rc

Yup, I'll drop that one.

Thanks, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Intel-gfx] [PATCH 2/5] kernel.h: Add non_block_start/end()
  2019-08-16 16:36                                           ` Daniel Vetter
@ 2019-08-16 16:54                                             ` Jason Gunthorpe
  0 siblings, 0 replies; 69+ messages in thread
From: Jason Gunthorpe @ 2019-08-16 16:54 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Michal Hocko, Feng Tang, Randy Dunlap, Kees Cook,
	Masahiro Yamada, Peter Zijlstra, Intel Graphics Development,
	Jann Horn, LKML, DRI Development, Linux MM,
	Jérôme Glisse, Ingo Molnar, Thomas Gleixner,
	David Rientjes, Wei Wang, Daniel Vetter, Andrew Morton,
	Andy Shevchenko, Christian König

On Fri, Aug 16, 2019 at 06:36:52PM +0200, Daniel Vetter wrote:
> On Fri, Aug 16, 2019 at 4:38 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Fri, Aug 16, 2019 at 04:11:34PM +0200, Daniel Vetter wrote:
> > > Also, aside from this patch (which is prep for the next) and some
> > > simple reordering conflicts they're all independent. So if there's no
> > > way to paint this bikeshed here (technicolor perhaps?) then I'd like
> > > to get at least the others considered.
> >
> > Sure, I think for conflict avoidance reasons I'm probably taking
> > mmu_notifier stuff via hmm.git, so:
> >
> > - Andrew had a minor remark on #1, I am ambivalent and would take it
> >   as-is. Your decision if you want to respin.
> 
> I like mine better, see also the reply from Ralph Campbell.

Sure

> > - #2/#3 is this issue, I would stand by the preempt_disable/etc path
> >   Our situation matches yours, debug tests run lockdep/etc.
>
> Since Michal requested the current flavour I think we need spin a bit
> more on these here. I guess I'll just rebase them to the end so
> they're not holding up the others.
> 
> > - #4 I like a lot, except the map should enclose range_end too,
> >   this can be done after the mm_has_notifiers inside the
> >   __mmu_notifier function
> 
> To make sure I get this right: The same lockdep context, but also
> wrapped around invalidate_range_end? 

Yes, the locking context of _range_start and _range_end should be
identical, last time I checked callers this was the case.

So, just add it to __mmu_notifier_invalidate_range_end() outside the
SRCU as there is no reason to burden debug kernel callers twice when
mmu notifiers are not enabled

Jason


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 1/5] mm: Check if mmu notifier callbacks are allowed to fail
  2019-08-14 20:20 ` [PATCH 1/5] mm: Check if mmu notifier callbacks are allowed to fail Daniel Vetter
  2019-08-14 22:14   ` Andrew Morton
@ 2019-08-16 17:19   ` Jason Gunthorpe
  1 sibling, 0 replies; 69+ messages in thread
From: Jason Gunthorpe @ 2019-08-16 17:19 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: LKML, linux-mm, DRI Development, Intel Graphics Development,
	Andrew Morton, Michal Hocko, Christian König,
	David Rientjes, Jérôme Glisse, Paolo Bonzini,
	Daniel Vetter

On Wed, Aug 14, 2019 at 10:20:23PM +0200, Daniel Vetter wrote:
> Just a bit of paranoia, since if we start pushing this deep into
> callchains it's hard to spot all places where an mmu notifier
> implementation might fail when it's not allowed to.
> 
> Inspired by some confusion we had discussing i915 mmu notifiers and
> whether we could use the newly-introduced return value to handle some
> corner cases. Until we realized that these are only for when a task
> has been killed by the oom reaper.
> 
> An alternative approach would be to split the callback into two
> versions, one with the int return value, and the other with void
> return value like in older kernels. But that's a lot more churn for
> fairly little gain I think.
> 
> Summary from the m-l discussion on why we want something at warning
> level: This allows automated tooling in CI to catch bugs without
> humans having to look at everything. If we just upgrade the existing
> pr_info to a pr_warn, then we'll have false positives. And as-is, no
> one will ever spot the problem since it's lost in the massive amounts
> of overall dmesg noise.
> 
> v2: Drop the full WARN_ON backtrace in favour of just a pr_warn for
> the problematic case (Michal Hocko).
> 
> v3: Rebase on top of Glisse's arg rework.
> 
> v4: More rebase on top of Glisse reworking everything.
> 
> v5: Fixup rebase damage and also catch failures != EAGAIN for
> !blockable (Jason). Also go back to WARN_ON as requested by Jason, so
> automatic checkers can easily catch bugs by setting panic_on_warn.
> 
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: "Christian König" <christian.koenig@amd.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> Cc: "Jérôme Glisse" <jglisse@redhat.com>
> Cc: linux-mm@kvack.org
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> ---
>  mm/mmu_notifier.c | 2 ++
>  1 file changed, 2 insertions(+)

Applied to hmm.git, thanks

Jason


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 3/5] mm, notifier: Catch sleeping/blocking for !blockable
       [not found]       ` <20190815123556.GB21596@ziepe.ca>
@ 2019-08-17 16:09         ` Daniel Vetter
  0 siblings, 0 replies; 69+ messages in thread
From: Daniel Vetter @ 2019-08-17 16:09 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: LKML, Linux MM, DRI Development, Intel Graphics Development,
	Andrew Morton, Michal Hocko, David Rientjes,
	Christian König, Jérôme Glisse, Daniel Vetter

On Sat, Aug 17, 2019 at 5:26 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Thu, Aug 15, 2019 at 09:02:49AM +0200, Daniel Vetter wrote:
> > On Wed, Aug 14, 2019 at 09:00:29PM -0300, Jason Gunthorpe wrote:
> > > On Wed, Aug 14, 2019 at 10:20:25PM +0200, Daniel Vetter wrote:
> > > > We need to make sure implementations don't cheat and don't have a
> > > > possible schedule/blocking point deeply burried where review can't
> > > > catch it.
> > > >
> > > > I'm not sure whether this is the best way to make sure all the
> > > > might_sleep() callsites trigger, and it's a bit ugly in the code flow.
> > > > But it gets the job done.
> > > >
> > > > Inspired by an i915 patch series which did exactly that, because the
> > > > rules haven't been entirely clear to us.
> > >
> > > I thought lockdep already was able to detect:
> > >
> > >  spin_lock()
> > >  might_sleep();
> > >  spin_unlock()
> > >
> > > Am I mistaken? If yes, couldn't this patch just inject a dummy lockdep
> > > spinlock?
> >
> > Hm ... assuming I didn't get lost in the maze I think might_sleep (well
> > ___might_sleep) doesn't do any lockdep checking at all. And we want
> > might_sleep, since that catches a lot more than lockdep.
>
> Don't know how it works, but it sure looks like it does:
>
> This:
>         spin_lock(&file->uobjects_lock);
>         down_read(&file->hw_destroy_rwsem);
>         up_read(&file->hw_destroy_rwsem);
>         spin_unlock(&file->uobjects_lock);
>
> Causes:
>
> [   33.324729] BUG: sleeping function called from invalid context at kernel/locking/rwsem.c:1444
> [   33.325599] in_atomic(): 1, irqs_disabled(): 0, pid: 247, name: ibv_devinfo
> [   33.326115] 3 locks held by ibv_devinfo/247:
> [   33.326556]  #0: 000000009edf8379 (&uverbs_dev->disassociate_srcu){....}, at: ib_uverbs_open+0xff/0x5f0 [ib_uverbs]
> [   33.327657]  #1: 000000005e0eddf1 (&uverbs_dev->lists_mutex){+.+.}, at: ib_uverbs_open+0x16c/0x5f0 [ib_uverbs]
> [   33.328682]  #2: 00000000505f509e (&(&file->uobjects_lock)->rlock){+.+.}, at: ib_uverbs_open+0x31a/0x5f0 [ib_uverbs]
>
> And this:
>
>         spin_lock(&file->uobjects_lock);
>         might_sleep();
>         spin_unlock(&file->uobjects_lock);
>
> Causes:
>
> [   16.867211] BUG: sleeping function called from invalid context at drivers/infiniband/core/uverbs_main.c:1095
> [   16.867776] in_atomic(): 1, irqs_disabled(): 0, pid: 245, name: ibv_devinfo
> [   16.868098] 3 locks held by ibv_devinfo/245:
> [   16.868383]  #0: 000000004c5954ff (&uverbs_dev->disassociate_srcu){....}, at: ib_uverbs_open+0xf8/0x600 [ib_uverbs]
> [   16.868938]  #1: 0000000020a6fae2 (&uverbs_dev->lists_mutex){+.+.}, at: ib_uverbs_open+0x16c/0x600 [ib_uverbs]
> [   16.869568]  #2: 00000000036e6a97 (&(&file->uobjects_lock)->rlock){+.+.}, at: ib_uverbs_open+0x317/0x600 [ib_uverbs]
>
> I think this is done in some very expensive way, so it probably only
> works when lockdep is enabled..

This is the might_sleep debug infrastructure (both of them), not
lockdep. Disable CONFIG_PROVE_LOCKING and you should still get these.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch


^ permalink raw reply	[flat|nested] 69+ messages in thread

end of thread, back to index

Thread overview: 69+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-14 20:20 [PATCH 0/5] hmm & mmu_notifier debug/lockdep annotations Daniel Vetter
2019-08-14 20:20 ` [PATCH 1/5] mm: Check if mmu notifier callbacks are allowed to fail Daniel Vetter
2019-08-14 22:14   ` Andrew Morton
2019-08-14 23:22     ` Jason Gunthorpe
2019-08-14 23:34     ` Ralph Campbell
2019-08-16 17:19   ` Jason Gunthorpe
2019-08-14 20:20 ` [PATCH 2/5] kernel.h: Add non_block_start/end() Daniel Vetter
2019-08-14 20:45   ` Andrew Morton
2019-08-15  6:52     ` Daniel Vetter
2019-08-15  8:44     ` Michal Hocko
2019-08-15 13:04       ` Jason Gunthorpe
2019-08-15 13:12         ` Daniel Vetter
2019-08-15 14:37           ` Jason Gunthorpe
2019-08-15 14:43             ` Daniel Vetter
2019-08-15 15:10               ` Jason Gunthorpe
2019-08-15 16:25                 ` Daniel Vetter
2019-08-15 17:35                   ` Jason Gunthorpe
2019-08-15 17:39                     ` Jerome Glisse
2019-08-15 18:01                       ` Jason Gunthorpe
2019-08-15 18:27                         ` Jerome Glisse
2019-08-15 18:57                           ` Jason Gunthorpe
2019-08-15 16:32                 ` Jerome Glisse
2019-08-15 17:16                   ` Jason Gunthorpe
2019-08-15 17:21                     ` Daniel Vetter
2019-08-15 17:35                       ` Jerome Glisse
2019-08-15 13:24         ` Michal Hocko
2019-08-15 22:15       ` Andrew Morton
2019-08-16  8:24         ` Michal Hocko
2019-08-14 23:58   ` Jason Gunthorpe
2019-08-15  6:58     ` Daniel Vetter
2019-08-15 12:23       ` Jason Gunthorpe
2019-08-15 13:21         ` Michal Hocko
2019-08-15 14:12           ` Jason Gunthorpe
2019-08-15 16:00             ` Michal Hocko
2019-08-15 16:56               ` Jason Gunthorpe
2019-08-15 17:11                 ` Jerome Glisse
2019-08-15 17:17                   ` Jason Gunthorpe
2019-08-15 17:42                 ` Michal Hocko
2019-08-15 17:57                   ` Jerome Glisse
2019-08-15 18:24                   ` Jason Gunthorpe
2019-08-15 19:05                     ` Michal Hocko
2019-08-15 19:18                       ` Jason Gunthorpe
2019-08-15 19:35                         ` Michal Hocko
2019-08-15 20:13                           ` Jason Gunthorpe
2019-08-16  8:10                             ` Michal Hocko
2019-08-16 12:19                               ` Jason Gunthorpe
2019-08-16 12:26                                 ` Michal Hocko
2019-08-15 20:16                           ` [Intel-gfx] " Daniel Vetter
2019-08-15 20:27                             ` Jason Gunthorpe
2019-08-15 20:49                               ` Daniel Vetter
2019-08-16  1:00                                 ` Jason Gunthorpe
2019-08-16  6:20                                   ` Daniel Vetter
2019-08-16 12:12                                     ` Jason Gunthorpe
2019-08-16 14:11                                       ` Daniel Vetter
2019-08-16 14:38                                         ` Jason Gunthorpe
2019-08-16 16:36                                           ` Daniel Vetter
2019-08-16 16:54                                             ` Jason Gunthorpe
2019-08-16  8:27                             ` Michal Hocko
2019-08-14 20:20 ` [PATCH 3/5] mm, notifier: Catch sleeping/blocking for !blockable Daniel Vetter
2019-08-15  0:00   ` Jason Gunthorpe
2019-08-15  7:02     ` Daniel Vetter
     [not found]       ` <20190815123556.GB21596@ziepe.ca>
2019-08-17 16:09         ` Daniel Vetter
2019-08-14 20:20 ` [PATCH 4/5] mm, notifier: Add a lockdep map for invalidate_range_start Daniel Vetter
2019-08-15  0:09   ` Jason Gunthorpe
2019-08-15  7:10     ` Daniel Vetter
2019-08-15 12:53       ` Jason Gunthorpe
2019-08-14 20:20 ` [PATCH 5/5] mm/hmm: WARN on illegal ->sync_cpu_device_pagetables errors Daniel Vetter
2019-08-15  0:11   ` Jason Gunthorpe
2019-08-15  7:14     ` Daniel Vetter

Linux-mm Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-mm/0 linux-mm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-mm linux-mm/ https://lore.kernel.org/linux-mm \
		linux-mm@kvack.org linux-mm@archiver.kernel.org
	public-inbox-index linux-mm


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kvack.linux-mm


AGPL code for this site: git clone https://public-inbox.org/ public-inbox