bcachefs status update (it's done cooking; let's get this sucker merged)

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* bcachefs status update (it's done cooking; let's get this sucker merged)
@ 2019-06-10 19:14 Kent Overstreet
  2019-06-10 19:14 ` [PATCH 01/12] Compiler Attributes: add __flatten Kent Overstreet
                   ` (13 more replies)
  0 siblings, 14 replies; 63+ messages in thread
From: Kent Overstreet @ 2019-06-10 19:14 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcache
  Cc: Kent Overstreet, Dave Chinner, Darrick J . Wong ,
	Zach Brown, Peter Zijlstra, Jens Axboe, Josef Bacik,
	Alexander Viro, Andrew Morton, Linus Torvalds, Tejun Heo

Last status update: https://lkml.org/lkml/2018/12/2/46

Current status - I'm pretty much running out of things to polish and excuses to
keep tinkering. The core featureset is _done_ and the list of known outstanding
bugs is getting to be short and unexciting. The next big things on my todo list
are finishing erasure coding and reflink, but there's no reason for merging to
wait on those.

So. Here's my bcachefs-for-review branch - this has the minimal set of patches
outside of fs/bcachefs/. My master branch has some performance optimizations for
the core buffered IO paths, but those are fairly tricky and invasive so I want
to hold off on those for now - this branch is intended to be more or less
suitable for merging as is.

https://evilpiepirate.org/git/bcachefs.git/log/?h=bcachefs-for-review

The list of non bcachefs patches is:

closures: fix a race on wakeup from closure_sync
closures: closure_wait_event()
bcache: move closures to lib/
bcache: optimize continue_at_nobarrier()
block: Add some exports for bcachefs
Propagate gfp_t when allocating pte entries from __vmalloc
fs: factor out d_mark_tmpfile()
fs: insert_inode_locked2()
mm: export find_get_pages()
mm: pagecache add lock
locking: SIX locks (shared/intent/exclusive)
Compiler Attributes: add __flatten

Most of the patches are pretty small, of the ones that aren't:

 - SIX locks have already been discussed, and seem to be pretty uncontroversial.

 - pagecache add lock: it's kind of ugly, but necessary to rigorously prevent
   page cache inconsistencies with dio and other operations, in particular
   racing vs. page faults - honestly, it's criminal that we still don't have a
   mechanism in the kernel to address this, other filesystems are susceptible to
   these kinds of bugs too.

   My patch is intentionally ugly in the hopes that someone else will come up
   with a magical elegant solution, but in the meantime it's an "it's ugly but
   it works" sort of thing, and I suspect in real world scenarios it's going to
   beat any kind of range locking performance wise, which is the only
   alternative I've heard discussed.

 - Propaget gfp_t from __vmalloc() - bcachefs needs __vmalloc() to respect
   GFP_NOFS, that's all that is.

 - and, moving closures out of drivers/md/bcache to lib/. 

The rest of the tree is 62k lines of code in fs/bcachefs. So, I obviously won't
be mailing out all of that as patches, but if any code reviewers have
suggestions on what would make that go easier go ahead and speak up. The last
time I was mailing things out for review the main thing that came up was ioctls,
but the ioctl interface hasn't really changed since then. I'm pretty confident
in the on disk format stuff, which was the other thing that was mentioned.

----------

This has been a monumental effort over a lot of years, and I'm _really_ happy
with how it's turned out. I'm excited to finally unleash this upon the world.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 01/12] Compiler Attributes: add __flatten
  2019-06-10 19:14 bcachefs status update (it's done cooking; let's get this sucker merged) Kent Overstreet
@ 2019-06-10 19:14 ` Kent Overstreet
  2019-06-12 17:16   ` Greg KH
  2019-06-10 19:14 ` [PATCH 02/12] locking: SIX locks (shared/intent/exclusive) Kent Overstreet
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 63+ messages in thread
From: Kent Overstreet @ 2019-06-10 19:14 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcache; +Cc: Kent Overstreet

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
---
 include/linux/compiler_attributes.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/linux/compiler_attributes.h b/include/linux/compiler_attributes.h
index 6b318efd8a..48b2c6ae6f 100644
--- a/include/linux/compiler_attributes.h
+++ b/include/linux/compiler_attributes.h
@@ -253,4 +253,9 @@
  */
 #define __weak                          __attribute__((__weak__))
 
+/*
+ *   gcc: https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-flatten-function-attribute
+ */
+#define __flatten __attribute__((flatten))
+
 #endif /* __LINUX_COMPILER_ATTRIBUTES_H */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 02/12] locking: SIX locks (shared/intent/exclusive)
  2019-06-10 19:14 bcachefs status update (it's done cooking; let's get this sucker merged) Kent Overstreet
  2019-06-10 19:14 ` [PATCH 01/12] Compiler Attributes: add __flatten Kent Overstreet
@ 2019-06-10 19:14 ` Kent Overstreet
  2019-06-10 19:14 ` [PATCH 03/12] mm: pagecache add lock Kent Overstreet
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 63+ messages in thread
From: Kent Overstreet @ 2019-06-10 19:14 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcache; +Cc: Kent Overstreet

New lock for bcachefs, like read/write locks but with a third state,
intent.

Intent locks conflict with each other, but not with read locks; taking a
write lock requires first holding an intent lock.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
---
 include/linux/six.h     | 192 +++++++++++++++
 kernel/Kconfig.locks    |   3 +
 kernel/locking/Makefile |   1 +
 kernel/locking/six.c    | 512 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 708 insertions(+)
 create mode 100644 include/linux/six.h
 create mode 100644 kernel/locking/six.c

diff --git a/include/linux/six.h b/include/linux/six.h
new file mode 100644
index 0000000000..0fb1b2f493
--- /dev/null
+++ b/include/linux/six.h
@@ -0,0 +1,192 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _LINUX_SIX_H
+#define _LINUX_SIX_H
+
+/*
+ * Shared/intent/exclusive locks: sleepable read/write locks, much like rw
+ * semaphores, except with a third intermediate state, intent. Basic operations
+ * are:
+ *
+ * six_lock_read(&foo->lock);
+ * six_unlock_read(&foo->lock);
+ *
+ * six_lock_intent(&foo->lock);
+ * six_unlock_intent(&foo->lock);
+ *
+ * six_lock_write(&foo->lock);
+ * six_unlock_write(&foo->lock);
+ *
+ * Intent locks block other intent locks, but do not block read locks, and you
+ * must have an intent lock held before taking a write lock, like so:
+ *
+ * six_lock_intent(&foo->lock);
+ * six_lock_write(&foo->lock);
+ * six_unlock_write(&foo->lock);
+ * six_unlock_intent(&foo->lock);
+ *
+ * Other operations:
+ *
+ *   six_trylock_read()
+ *   six_trylock_intent()
+ *   six_trylock_write()
+ *
+ *   six_lock_downgrade():	convert from intent to read
+ *   six_lock_tryupgrade():	attempt to convert from read to intent
+ *
+ * Locks also embed a sequence number, which is incremented when the lock is
+ * locked or unlocked for write. The current sequence number can be grabbed
+ * while a lock is held from lock->state.seq; then, if you drop the lock you can
+ * use six_relock_(read|intent_write)(lock, seq) to attempt to retake the lock
+ * iff it hasn't been locked for write in the meantime.
+ *
+ * There are also operations that take the lock type as a parameter, where the
+ * type is one of SIX_LOCK_read, SIX_LOCK_intent, or SIX_LOCK_write:
+ *
+ *   six_lock_type(lock, type)
+ *   six_unlock_type(lock, type)
+ *   six_relock(lock, type, seq)
+ *   six_trylock_type(lock, type)
+ *   six_trylock_convert(lock, from, to)
+ *
+ * A lock may be held multiple types by the same thread (for read or intent,
+ * not write). However, the six locks code does _not_ implement the actual
+ * recursive checks itself though - rather, if your code (e.g. btree iterator
+ * code) knows that the current thread already has a lock held, and for the
+ * correct type, six_lock_increment() may be used to bump up the counter for
+ * that type - the only effect is that one more call to unlock will be required
+ * before the lock is unlocked.
+ */
+
+#include <linux/lockdep.h>
+#include <linux/osq_lock.h>
+#include <linux/sched.h>
+#include <linux/types.h>
+
+#define SIX_LOCK_SEPARATE_LOCKFNS
+
+union six_lock_state {
+	struct {
+		atomic64_t	counter;
+	};
+
+	struct {
+		u64		v;
+	};
+
+	struct {
+		/* for waitlist_bitnr() */
+		unsigned long	l;
+	};
+
+	struct {
+		unsigned	read_lock:28;
+		unsigned	intent_lock:1;
+		unsigned	waiters:3;
+		/*
+		 * seq works much like in seqlocks: it's incremented every time
+		 * we lock and unlock for write.
+		 *
+		 * If it's odd write lock is held, even unlocked.
+		 *
+		 * Thus readers can unlock, and then lock again later iff it
+		 * hasn't been modified in the meantime.
+		 */
+		u32		seq;
+	};
+};
+
+enum six_lock_type {
+	SIX_LOCK_read,
+	SIX_LOCK_intent,
+	SIX_LOCK_write,
+};
+
+struct six_lock {
+	union six_lock_state	state;
+	unsigned		intent_lock_recurse;
+	struct task_struct	*owner;
+	struct optimistic_spin_queue osq;
+
+	raw_spinlock_t		wait_lock;
+	struct list_head	wait_list[2];
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	struct lockdep_map	dep_map;
+#endif
+};
+
+static __always_inline void __six_lock_init(struct six_lock *lock,
+					    const char *name,
+					    struct lock_class_key *key)
+{
+	atomic64_set(&lock->state.counter, 0);
+	raw_spin_lock_init(&lock->wait_lock);
+	INIT_LIST_HEAD(&lock->wait_list[SIX_LOCK_read]);
+	INIT_LIST_HEAD(&lock->wait_list[SIX_LOCK_intent]);
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	debug_check_no_locks_freed((void *) lock, sizeof(*lock));
+	lockdep_init_map(&lock->dep_map, name, key, 0);
+#endif
+}
+
+#define six_lock_init(lock)						\
+do {									\
+	static struct lock_class_key __key;				\
+									\
+	__six_lock_init((lock), #lock, &__key);				\
+} while (0)
+
+#define __SIX_VAL(field, _v)	(((union six_lock_state) { .field = _v }).v)
+
+#define __SIX_LOCK(type)						\
+bool six_trylock_##type(struct six_lock *);				\
+bool six_relock_##type(struct six_lock *, u32);				\
+void six_lock_##type(struct six_lock *);				\
+void six_unlock_##type(struct six_lock *);
+
+__SIX_LOCK(read)
+__SIX_LOCK(intent)
+__SIX_LOCK(write)
+#undef __SIX_LOCK
+
+#define SIX_LOCK_DISPATCH(type, fn, ...)			\
+	switch (type) {						\
+	case SIX_LOCK_read:					\
+		return fn##_read(__VA_ARGS__);			\
+	case SIX_LOCK_intent:					\
+		return fn##_intent(__VA_ARGS__);		\
+	case SIX_LOCK_write:					\
+		return fn##_write(__VA_ARGS__);			\
+	default:						\
+		BUG();						\
+	}
+
+static inline bool six_trylock_type(struct six_lock *lock, enum six_lock_type type)
+{
+	SIX_LOCK_DISPATCH(type, six_trylock, lock);
+}
+
+static inline bool six_relock_type(struct six_lock *lock, enum six_lock_type type,
+		     unsigned seq)
+{
+	SIX_LOCK_DISPATCH(type, six_relock, lock, seq);
+}
+
+static inline void six_lock_type(struct six_lock *lock, enum six_lock_type type)
+{
+	SIX_LOCK_DISPATCH(type, six_lock, lock);
+}
+
+static inline void six_unlock_type(struct six_lock *lock, enum six_lock_type type)
+{
+	SIX_LOCK_DISPATCH(type, six_unlock, lock);
+}
+
+void six_lock_downgrade(struct six_lock *);
+bool six_lock_tryupgrade(struct six_lock *);
+bool six_trylock_convert(struct six_lock *, enum six_lock_type,
+			 enum six_lock_type);
+
+void six_lock_increment(struct six_lock *, enum six_lock_type);
+
+#endif /* _LINUX_SIX_H */
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index fbba478ae5..ff3e6121ae 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -251,3 +251,6 @@ config ARCH_USE_QUEUED_RWLOCKS
 config QUEUED_RWLOCKS
 	def_bool y if ARCH_USE_QUEUED_RWLOCKS
 	depends on SMP
+
+config SIXLOCKS
+	bool
diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
index 392c7f23af..9a73c8564e 100644
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -30,3 +30,4 @@ obj-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem-xadd.o
 obj-$(CONFIG_QUEUED_RWLOCKS) += qrwlock.o
 obj-$(CONFIG_LOCK_TORTURE_TEST) += locktorture.o
 obj-$(CONFIG_WW_MUTEX_SELFTEST) += test-ww_mutex.o
+obj-$(CONFIG_SIXLOCKS) += six.o
diff --git a/kernel/locking/six.c b/kernel/locking/six.c
new file mode 100644
index 0000000000..9fa58b6fad
--- /dev/null
+++ b/kernel/locking/six.c
@@ -0,0 +1,512 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/export.h>
+#include <linux/log2.h>
+#include <linux/preempt.h>
+#include <linux/rcupdate.h>
+#include <linux/sched.h>
+#include <linux/sched/rt.h>
+#include <linux/six.h>
+
+#ifdef DEBUG
+#define EBUG_ON(cond)		BUG_ON(cond)
+#else
+#define EBUG_ON(cond)		do {} while (0)
+#endif
+
+#define six_acquire(l, t)	lock_acquire(l, 0, t, 0, 0, NULL, _RET_IP_)
+#define six_release(l)		lock_release(l, 0, _RET_IP_)
+
+struct six_lock_vals {
+	/* Value we add to the lock in order to take the lock: */
+	u64			lock_val;
+
+	/* If the lock has this value (used as a mask), taking the lock fails: */
+	u64			lock_fail;
+
+	/* Value we add to the lock in order to release the lock: */
+	u64			unlock_val;
+
+	/* Mask that indicates lock is held for this type: */
+	u64			held_mask;
+
+	/* Waitlist we wakeup when releasing the lock: */
+	enum six_lock_type	unlock_wakeup;
+};
+
+#define __SIX_LOCK_HELD_read	__SIX_VAL(read_lock, ~0)
+#define __SIX_LOCK_HELD_intent	__SIX_VAL(intent_lock, ~0)
+#define __SIX_LOCK_HELD_write	__SIX_VAL(seq, 1)
+
+#define LOCK_VALS {							\
+	[SIX_LOCK_read] = {						\
+		.lock_val	= __SIX_VAL(read_lock, 1),		\
+		.lock_fail	= __SIX_LOCK_HELD_write,		\
+		.unlock_val	= -__SIX_VAL(read_lock, 1),		\
+		.held_mask	= __SIX_LOCK_HELD_read,			\
+		.unlock_wakeup	= SIX_LOCK_write,			\
+	},								\
+	[SIX_LOCK_intent] = {						\
+		.lock_val	= __SIX_VAL(intent_lock, 1),		\
+		.lock_fail	= __SIX_LOCK_HELD_intent,		\
+		.unlock_val	= -__SIX_VAL(intent_lock, 1),		\
+		.held_mask	= __SIX_LOCK_HELD_intent,		\
+		.unlock_wakeup	= SIX_LOCK_intent,			\
+	},								\
+	[SIX_LOCK_write] = {						\
+		.lock_val	= __SIX_VAL(seq, 1),			\
+		.lock_fail	= __SIX_LOCK_HELD_read,			\
+		.unlock_val	= __SIX_VAL(seq, 1),			\
+		.held_mask	= __SIX_LOCK_HELD_write,		\
+		.unlock_wakeup	= SIX_LOCK_read,			\
+	},								\
+}
+
+static inline void six_set_owner(struct six_lock *lock, enum six_lock_type type,
+				 union six_lock_state old)
+{
+	if (type != SIX_LOCK_intent)
+		return;
+
+	if (!old.intent_lock) {
+		EBUG_ON(lock->owner);
+		lock->owner = current;
+	} else {
+		EBUG_ON(lock->owner != current);
+	}
+}
+
+static __always_inline bool do_six_trylock_type(struct six_lock *lock,
+						enum six_lock_type type)
+{
+	const struct six_lock_vals l[] = LOCK_VALS;
+	union six_lock_state old;
+	u64 v = READ_ONCE(lock->state.v);
+
+	EBUG_ON(type == SIX_LOCK_write && lock->owner != current);
+
+	do {
+		old.v = v;
+
+		EBUG_ON(type == SIX_LOCK_write &&
+			((old.v & __SIX_LOCK_HELD_write) ||
+			 !(old.v & __SIX_LOCK_HELD_intent)));
+
+		if (old.v & l[type].lock_fail)
+			return false;
+	} while ((v = atomic64_cmpxchg_acquire(&lock->state.counter,
+				old.v,
+				old.v + l[type].lock_val)) != old.v);
+
+	six_set_owner(lock, type, old);
+	return true;
+}
+
+__always_inline __flatten
+static bool __six_trylock_type(struct six_lock *lock, enum six_lock_type type)
+{
+	if (!do_six_trylock_type(lock, type))
+		return false;
+
+	six_acquire(&lock->dep_map, 1);
+	return true;
+}
+
+__always_inline __flatten
+static bool __six_relock_type(struct six_lock *lock, enum six_lock_type type,
+			      unsigned seq)
+{
+	const struct six_lock_vals l[] = LOCK_VALS;
+	union six_lock_state old;
+	u64 v = READ_ONCE(lock->state.v);
+
+	do {
+		old.v = v;
+
+		if (old.seq != seq || old.v & l[type].lock_fail)
+			return false;
+	} while ((v = atomic64_cmpxchg_acquire(&lock->state.counter,
+				old.v,
+				old.v + l[type].lock_val)) != old.v);
+
+	six_set_owner(lock, type, old);
+	six_acquire(&lock->dep_map, 1);
+	return true;
+}
+
+struct six_lock_waiter {
+	struct list_head	list;
+	struct task_struct	*task;
+};
+
+/* This is probably up there with the more evil things I've done */
+#define waitlist_bitnr(id) ilog2((((union six_lock_state) { .waiters = 1 << (id) }).l))
+
+#ifdef CONFIG_LOCK_SPIN_ON_OWNER
+
+static inline int six_can_spin_on_owner(struct six_lock *lock)
+{
+	struct task_struct *owner;
+	int retval = 1;
+
+	if (need_resched())
+		return 0;
+
+	rcu_read_lock();
+	owner = READ_ONCE(lock->owner);
+	if (owner)
+		retval = owner->on_cpu;
+	rcu_read_unlock();
+	/*
+	 * if lock->owner is not set, the mutex owner may have just acquired
+	 * it and not set the owner yet or the mutex has been released.
+	 */
+	return retval;
+}
+
+static inline bool six_spin_on_owner(struct six_lock *lock,
+				     struct task_struct *owner)
+{
+	bool ret = true;
+
+	rcu_read_lock();
+	while (lock->owner == owner) {
+		/*
+		 * Ensure we emit the owner->on_cpu, dereference _after_
+		 * checking lock->owner still matches owner. If that fails,
+		 * owner might point to freed memory. If it still matches,
+		 * the rcu_read_lock() ensures the memory stays valid.
+		 */
+		barrier();
+
+		if (!owner->on_cpu || need_resched()) {
+			ret = false;
+			break;
+		}
+
+		cpu_relax();
+	}
+	rcu_read_unlock();
+
+	return ret;
+}
+
+static inline bool six_optimistic_spin(struct six_lock *lock, enum six_lock_type type)
+{
+	struct task_struct *task = current;
+
+	if (type == SIX_LOCK_write)
+		return false;
+
+	preempt_disable();
+	if (!six_can_spin_on_owner(lock))
+		goto fail;
+
+	if (!osq_lock(&lock->osq))
+		goto fail;
+
+	while (1) {
+		struct task_struct *owner;
+
+		/*
+		 * If there's an owner, wait for it to either
+		 * release the lock or go to sleep.
+		 */
+		owner = READ_ONCE(lock->owner);
+		if (owner && !six_spin_on_owner(lock, owner))
+			break;
+
+		if (do_six_trylock_type(lock, type)) {
+			osq_unlock(&lock->osq);
+			preempt_enable();
+			return true;
+		}
+
+		/*
+		 * When there's no owner, we might have preempted between the
+		 * owner acquiring the lock and setting the owner field. If
+		 * we're an RT task that will live-lock because we won't let
+		 * the owner complete.
+		 */
+		if (!owner && (need_resched() || rt_task(task)))
+			break;
+
+		/*
+		 * The cpu_relax() call is a compiler barrier which forces
+		 * everything in this loop to be re-loaded. We don't need
+		 * memory barriers as we'll eventually observe the right
+		 * values at the cost of a few extra spins.
+		 */
+		cpu_relax();
+	}
+
+	osq_unlock(&lock->osq);
+fail:
+	preempt_enable();
+
+	/*
+	 * If we fell out of the spin path because of need_resched(),
+	 * reschedule now, before we try-lock again. This avoids getting
+	 * scheduled out right after we obtained the lock.
+	 */
+	if (need_resched())
+		schedule();
+
+	return false;
+}
+
+#else /* CONFIG_LOCK_SPIN_ON_OWNER */
+
+static inline bool six_optimistic_spin(struct six_lock *lock, enum six_lock_type type)
+{
+	return false;
+}
+
+#endif
+
+noinline
+static void __six_lock_type_slowpath(struct six_lock *lock, enum six_lock_type type)
+{
+	const struct six_lock_vals l[] = LOCK_VALS;
+	union six_lock_state old, new;
+	struct six_lock_waiter wait;
+	u64 v;
+
+	if (six_optimistic_spin(lock, type))
+		return;
+
+	lock_contended(&lock->dep_map, _RET_IP_);
+
+	INIT_LIST_HEAD(&wait.list);
+	wait.task = current;
+
+	while (1) {
+		set_current_state(TASK_UNINTERRUPTIBLE);
+		if (type == SIX_LOCK_write)
+			EBUG_ON(lock->owner != current);
+		else if (list_empty_careful(&wait.list)) {
+			raw_spin_lock(&lock->wait_lock);
+			list_add_tail(&wait.list, &lock->wait_list[type]);
+			raw_spin_unlock(&lock->wait_lock);
+		}
+
+		v = READ_ONCE(lock->state.v);
+		do {
+			new.v = old.v = v;
+
+			if (!(old.v & l[type].lock_fail))
+				new.v += l[type].lock_val;
+			else if (!(new.waiters & (1 << type)))
+				new.waiters |= 1 << type;
+			else
+				break; /* waiting bit already set */
+		} while ((v = atomic64_cmpxchg_acquire(&lock->state.counter,
+					old.v, new.v)) != old.v);
+
+		if (!(old.v & l[type].lock_fail))
+			break;
+
+		schedule();
+	}
+
+	six_set_owner(lock, type, old);
+
+	__set_current_state(TASK_RUNNING);
+
+	if (!list_empty_careful(&wait.list)) {
+		raw_spin_lock(&lock->wait_lock);
+		list_del_init(&wait.list);
+		raw_spin_unlock(&lock->wait_lock);
+	}
+}
+
+__always_inline
+static void __six_lock_type(struct six_lock *lock, enum six_lock_type type)
+{
+	six_acquire(&lock->dep_map, 0);
+
+	if (!do_six_trylock_type(lock, type))
+		__six_lock_type_slowpath(lock, type);
+
+	lock_acquired(&lock->dep_map, _RET_IP_);
+}
+
+static inline void six_lock_wakeup(struct six_lock *lock,
+				   union six_lock_state state,
+				   unsigned waitlist_id)
+{
+	struct list_head *wait_list = &lock->wait_list[waitlist_id];
+	struct six_lock_waiter *w, *next;
+
+	if (waitlist_id == SIX_LOCK_write && state.read_lock)
+		return;
+
+	if (!(state.waiters & (1 << waitlist_id)))
+		return;
+
+	clear_bit(waitlist_bitnr(waitlist_id),
+		  (unsigned long *) &lock->state.v);
+
+	if (waitlist_id == SIX_LOCK_write) {
+		struct task_struct *p = READ_ONCE(lock->owner);
+
+		if (p)
+			wake_up_process(p);
+		return;
+	}
+
+	raw_spin_lock(&lock->wait_lock);
+
+	list_for_each_entry_safe(w, next, wait_list, list) {
+		list_del_init(&w->list);
+
+		if (wake_up_process(w->task) &&
+		    waitlist_id != SIX_LOCK_read) {
+			if (!list_empty(wait_list))
+				set_bit(waitlist_bitnr(waitlist_id),
+					(unsigned long *) &lock->state.v);
+			break;
+		}
+	}
+
+	raw_spin_unlock(&lock->wait_lock);
+}
+
+__always_inline __flatten
+static void __six_unlock_type(struct six_lock *lock, enum six_lock_type type)
+{
+	const struct six_lock_vals l[] = LOCK_VALS;
+	union six_lock_state state;
+
+	EBUG_ON(!(lock->state.v & l[type].held_mask));
+	EBUG_ON(type == SIX_LOCK_write &&
+		!(lock->state.v & __SIX_LOCK_HELD_intent));
+
+	six_release(&lock->dep_map);
+
+	if (type == SIX_LOCK_intent) {
+		EBUG_ON(lock->owner != current);
+
+		if (lock->intent_lock_recurse) {
+			--lock->intent_lock_recurse;
+			return;
+		}
+
+		lock->owner = NULL;
+	}
+
+	state.v = atomic64_add_return_release(l[type].unlock_val,
+					      &lock->state.counter);
+	six_lock_wakeup(lock, state, l[type].unlock_wakeup);
+}
+
+#define __SIX_LOCK(type)						\
+bool six_trylock_##type(struct six_lock *lock)				\
+{									\
+	return __six_trylock_type(lock, SIX_LOCK_##type);		\
+}									\
+EXPORT_SYMBOL_GPL(six_trylock_##type);					\
+									\
+bool six_relock_##type(struct six_lock *lock, u32 seq)			\
+{									\
+	return __six_relock_type(lock, SIX_LOCK_##type, seq);		\
+}									\
+EXPORT_SYMBOL_GPL(six_relock_##type);					\
+									\
+void six_lock_##type(struct six_lock *lock)				\
+{									\
+	__six_lock_type(lock, SIX_LOCK_##type);				\
+}									\
+EXPORT_SYMBOL_GPL(six_lock_##type);					\
+									\
+void six_unlock_##type(struct six_lock *lock)				\
+{									\
+	__six_unlock_type(lock, SIX_LOCK_##type);			\
+}									\
+EXPORT_SYMBOL_GPL(six_unlock_##type);
+
+__SIX_LOCK(read)
+__SIX_LOCK(intent)
+__SIX_LOCK(write)
+
+#undef __SIX_LOCK
+
+/* Convert from intent to read: */
+void six_lock_downgrade(struct six_lock *lock)
+{
+	six_lock_increment(lock, SIX_LOCK_read);
+	six_unlock_intent(lock);
+}
+EXPORT_SYMBOL_GPL(six_lock_downgrade);
+
+bool six_lock_tryupgrade(struct six_lock *lock)
+{
+	const struct six_lock_vals l[] = LOCK_VALS;
+	union six_lock_state old, new;
+	u64 v = READ_ONCE(lock->state.v);
+
+	do {
+		new.v = old.v = v;
+
+		EBUG_ON(!(old.v & l[SIX_LOCK_read].held_mask));
+
+		new.v += l[SIX_LOCK_read].unlock_val;
+
+		if (new.v & l[SIX_LOCK_intent].lock_fail)
+			return false;
+
+		new.v += l[SIX_LOCK_intent].lock_val;
+	} while ((v = atomic64_cmpxchg_acquire(&lock->state.counter,
+				old.v, new.v)) != old.v);
+
+	six_set_owner(lock, SIX_LOCK_intent, old);
+	six_lock_wakeup(lock, new, l[SIX_LOCK_read].unlock_wakeup);
+
+	return true;
+}
+EXPORT_SYMBOL_GPL(six_lock_tryupgrade);
+
+bool six_trylock_convert(struct six_lock *lock,
+			 enum six_lock_type from,
+			 enum six_lock_type to)
+{
+	EBUG_ON(to == SIX_LOCK_write || from == SIX_LOCK_write);
+
+	if (to == from)
+		return true;
+
+	if (to == SIX_LOCK_read) {
+		six_lock_downgrade(lock);
+		return true;
+	} else {
+		return six_lock_tryupgrade(lock);
+	}
+}
+EXPORT_SYMBOL_GPL(six_trylock_convert);
+
+/*
+ * Increment read/intent lock count, assuming we already have it read or intent
+ * locked:
+ */
+void six_lock_increment(struct six_lock *lock, enum six_lock_type type)
+{
+	const struct six_lock_vals l[] = LOCK_VALS;
+
+	EBUG_ON(type == SIX_LOCK_write);
+	six_acquire(&lock->dep_map, 0);
+
+	/* XXX: assert already locked, and that we don't overflow: */
+
+	switch (type) {
+	case SIX_LOCK_read:
+		atomic64_add(l[type].lock_val, &lock->state.counter);
+		break;
+	case SIX_LOCK_intent:
+		lock->intent_lock_recurse++;
+		break;
+	case SIX_LOCK_write:
+		BUG();
+		break;
+	}
+}
+EXPORT_SYMBOL_GPL(six_lock_increment);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 03/12] mm: pagecache add lock
  2019-06-10 19:14 bcachefs status update (it's done cooking; let's get this sucker merged) Kent Overstreet
  2019-06-10 19:14 ` [PATCH 01/12] Compiler Attributes: add __flatten Kent Overstreet
  2019-06-10 19:14 ` [PATCH 02/12] locking: SIX locks (shared/intent/exclusive) Kent Overstreet
@ 2019-06-10 19:14 ` Kent Overstreet
  2019-06-10 19:14 ` [PATCH 04/12] mm: export find_get_pages() Kent Overstreet
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 63+ messages in thread
From: Kent Overstreet @ 2019-06-10 19:14 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcache; +Cc: Kent Overstreet

Add a per address space lock around adding pages to the pagecache - making it
possible for fallocate INSERT_RANGE/COLLAPSE_RANGE to work correctly, and also
hopefully making truncate and dio a bit saner.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
---
 fs/inode.c            |  1 +
 include/linux/fs.h    | 24 +++++++++++++
 include/linux/sched.h |  4 +++
 init/init_task.c      |  1 +
 mm/filemap.c          | 81 +++++++++++++++++++++++++++++++++++++++++--
 5 files changed, 108 insertions(+), 3 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 9a453f3637..8881dc551f 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -350,6 +350,7 @@ EXPORT_SYMBOL(inc_nlink);
 static void __address_space_init_once(struct address_space *mapping)
 {
 	xa_init_flags(&mapping->i_pages, XA_FLAGS_LOCK_IRQ);
+	pagecache_lock_init(&mapping->add_lock);
 	init_rwsem(&mapping->i_mmap_rwsem);
 	INIT_LIST_HEAD(&mapping->private_list);
 	spin_lock_init(&mapping->private_lock);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index dd28e76790..a88d994751 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -418,6 +418,28 @@ int pagecache_write_end(struct file *, struct address_space *mapping,
 				loff_t pos, unsigned len, unsigned copied,
 				struct page *page, void *fsdata);
 
+/*
+ * Two-state lock - can be taken for add or block - both states are shared,
+ * like read side of rwsem, but conflict with other state:
+ */
+struct pagecache_lock {
+	atomic_long_t		v;
+	wait_queue_head_t	wait;
+};
+
+static inline void pagecache_lock_init(struct pagecache_lock *lock)
+{
+	atomic_long_set(&lock->v, 0);
+	init_waitqueue_head(&lock->wait);
+}
+
+void pagecache_add_put(struct pagecache_lock *);
+void pagecache_add_get(struct pagecache_lock *);
+void __pagecache_block_put(struct pagecache_lock *);
+void __pagecache_block_get(struct pagecache_lock *);
+void pagecache_block_put(struct pagecache_lock *);
+void pagecache_block_get(struct pagecache_lock *);
+
 /**
  * struct address_space - Contents of a cacheable, mappable object.
  * @host: Owner, either the inode or the block_device.
@@ -452,6 +474,8 @@ struct address_space {
 	spinlock_t		private_lock;
 	struct list_head	private_list;
 	void			*private_data;
+	struct pagecache_lock	add_lock
+		____cacheline_aligned_in_smp;	/* protects adding new pages */
 } __attribute__((aligned(sizeof(long)))) __randomize_layout;
 	/*
 	 * On most architectures that alignment is already the case; but
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1549584a15..a46baade99 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -43,6 +43,7 @@ struct io_context;
 struct mempolicy;
 struct nameidata;
 struct nsproxy;
+struct pagecache_lock;
 struct perf_event_context;
 struct pid_namespace;
 struct pipe_inode_info;
@@ -935,6 +936,9 @@ struct task_struct {
 	unsigned int			in_ubsan;
 #endif
 
+	/* currently held lock, for avoiding recursing in fault path: */
+	struct pagecache_lock *pagecache_lock;
+
 	/* Journalling filesystem info: */
 	void				*journal_info;
 
diff --git a/init/init_task.c b/init/init_task.c
index c70ef656d0..92bbb6e909 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -115,6 +115,7 @@ struct task_struct init_task
 	},
 	.blocked	= {{0}},
 	.alloc_lock	= __SPIN_LOCK_UNLOCKED(init_task.alloc_lock),
+	.pagecache_lock = NULL,
 	.journal_info	= NULL,
 	INIT_CPU_TIMERS(init_task)
 	.pi_lock	= __RAW_SPIN_LOCK_UNLOCKED(init_task.pi_lock),
diff --git a/mm/filemap.c b/mm/filemap.c
index d78f577bae..93d7e0e686 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -113,6 +113,73 @@
  *   ->tasklist_lock            (memory_failure, collect_procs_ao)
  */
 
+static void __pagecache_lock_put(struct pagecache_lock *lock, long i)
+{
+	BUG_ON(atomic_long_read(&lock->v) == 0);
+
+	if (atomic_long_sub_return_release(i, &lock->v) == 0)
+		wake_up_all(&lock->wait);
+}
+
+static bool __pagecache_lock_tryget(struct pagecache_lock *lock, long i)
+{
+	long v = atomic_long_read(&lock->v), old;
+
+	do {
+		old = v;
+
+		if (i > 0 ? v < 0 : v > 0)
+			return false;
+	} while ((v = atomic_long_cmpxchg_acquire(&lock->v,
+					old, old + i)) != old);
+	return true;
+}
+
+static void __pagecache_lock_get(struct pagecache_lock *lock, long i)
+{
+	wait_event(lock->wait, __pagecache_lock_tryget(lock, i));
+}
+
+void pagecache_add_put(struct pagecache_lock *lock)
+{
+	__pagecache_lock_put(lock, 1);
+}
+EXPORT_SYMBOL(pagecache_add_put);
+
+void pagecache_add_get(struct pagecache_lock *lock)
+{
+	__pagecache_lock_get(lock, 1);
+}
+EXPORT_SYMBOL(pagecache_add_get);
+
+void __pagecache_block_put(struct pagecache_lock *lock)
+{
+	__pagecache_lock_put(lock, -1);
+}
+EXPORT_SYMBOL(__pagecache_block_put);
+
+void __pagecache_block_get(struct pagecache_lock *lock)
+{
+	__pagecache_lock_get(lock, -1);
+}
+EXPORT_SYMBOL(__pagecache_block_get);
+
+void pagecache_block_put(struct pagecache_lock *lock)
+{
+	BUG_ON(current->pagecache_lock != lock);
+	current->pagecache_lock = NULL;
+	__pagecache_lock_put(lock, -1);
+}
+EXPORT_SYMBOL(pagecache_block_put);
+
+void pagecache_block_get(struct pagecache_lock *lock)
+{
+	__pagecache_lock_get(lock, -1);
+	BUG_ON(current->pagecache_lock);
+	current->pagecache_lock = lock;
+}
+EXPORT_SYMBOL(pagecache_block_get);
+
 static void page_cache_delete(struct address_space *mapping,
 				   struct page *page, void *shadow)
 {
@@ -829,11 +896,14 @@ static int __add_to_page_cache_locked(struct page *page,
 	VM_BUG_ON_PAGE(PageSwapBacked(page), page);
 	mapping_set_update(&xas, mapping);
 
+	if (current->pagecache_lock != &mapping->add_lock)
+		pagecache_add_get(&mapping->add_lock);
+
 	if (!huge) {
 		error = mem_cgroup_try_charge(page, current->mm,
 					      gfp_mask, &memcg, false);
 		if (error)
-			return error;
+			goto out;
 	}
 
 	get_page(page);
@@ -869,14 +939,19 @@ static int __add_to_page_cache_locked(struct page *page,
 	if (!huge)
 		mem_cgroup_commit_charge(page, memcg, false, false);
 	trace_mm_filemap_add_to_page_cache(page);
-	return 0;
+	error = 0;
+out:
+	if (current->pagecache_lock != &mapping->add_lock)
+		pagecache_add_put(&mapping->add_lock);
+	return error;
 error:
 	page->mapping = NULL;
 	/* Leave page->index set: truncation relies upon it */
 	if (!huge)
 		mem_cgroup_cancel_charge(page, memcg, false);
 	put_page(page);
-	return xas_error(&xas);
+	error = xas_error(&xas);
+	goto out;
 }
 
 /**
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 04/12] mm: export find_get_pages()
  2019-06-10 19:14 bcachefs status update (it's done cooking; let's get this sucker merged) Kent Overstreet
                   ` (2 preceding siblings ...)
  2019-06-10 19:14 ` [PATCH 03/12] mm: pagecache add lock Kent Overstreet
@ 2019-06-10 19:14 ` Kent Overstreet
  2019-06-10 19:14 ` [PATCH 05/12] fs: insert_inode_locked2() Kent Overstreet
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 63+ messages in thread
From: Kent Overstreet @ 2019-06-10 19:14 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcache; +Cc: Kent Overstreet

Needed for bcachefs

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
---
 mm/filemap.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 93d7e0e686..617168474e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1899,6 +1899,7 @@ unsigned find_get_pages_range(struct address_space *mapping, pgoff_t *start,
 
 	return ret;
 }
+EXPORT_SYMBOL(find_get_pages_range);
 
 /**
  * find_get_pages_contig - gang contiguous pagecache lookup
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 05/12] fs: insert_inode_locked2()
  2019-06-10 19:14 bcachefs status update (it's done cooking; let's get this sucker merged) Kent Overstreet
                   ` (3 preceding siblings ...)
  2019-06-10 19:14 ` [PATCH 04/12] mm: export find_get_pages() Kent Overstreet
@ 2019-06-10 19:14 ` Kent Overstreet
  2019-06-10 19:14 ` [PATCH 06/12] fs: factor out d_mark_tmpfile() Kent Overstreet
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 63+ messages in thread
From: Kent Overstreet @ 2019-06-10 19:14 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcache; +Cc: Kent Overstreet

New helper for bcachefs, so that when we race inserting an inode we can
atomically grab a ref to the inode already in the inode cache.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
---
 fs/inode.c         | 40 ++++++++++++++++++++++++++++++++++++++++
 include/linux/fs.h |  1 +
 2 files changed, 41 insertions(+)

diff --git a/fs/inode.c b/fs/inode.c
index 8881dc551f..cc44f345e0 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1479,6 +1479,46 @@ int insert_inode_locked(struct inode *inode)
 }
 EXPORT_SYMBOL(insert_inode_locked);
 
+struct inode *insert_inode_locked2(struct inode *inode)
+{
+	struct super_block *sb = inode->i_sb;
+	ino_t ino = inode->i_ino;
+	struct hlist_head *head = inode_hashtable + hash(sb, ino);
+
+	while (1) {
+		struct inode *old = NULL;
+		spin_lock(&inode_hash_lock);
+		hlist_for_each_entry(old, head, i_hash) {
+			if (old->i_ino != ino)
+				continue;
+			if (old->i_sb != sb)
+				continue;
+			spin_lock(&old->i_lock);
+			if (old->i_state & (I_FREEING|I_WILL_FREE)) {
+				spin_unlock(&old->i_lock);
+				continue;
+			}
+			break;
+		}
+		if (likely(!old)) {
+			spin_lock(&inode->i_lock);
+			inode->i_state |= I_NEW | I_CREATING;
+			hlist_add_head(&inode->i_hash, head);
+			spin_unlock(&inode->i_lock);
+			spin_unlock(&inode_hash_lock);
+			return NULL;
+		}
+		__iget(old);
+		spin_unlock(&old->i_lock);
+		spin_unlock(&inode_hash_lock);
+		wait_on_inode(old);
+		if (unlikely(!inode_unhashed(old)))
+			return old;
+		iput(old);
+	}
+}
+EXPORT_SYMBOL(insert_inode_locked2);
+
 int insert_inode_locked4(struct inode *inode, unsigned long hashval,
 		int (*test)(struct inode *, void *), void *data)
 {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a88d994751..d5d12d6981 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3010,6 +3010,7 @@ extern struct inode *find_inode_nowait(struct super_block *,
 				       void *data);
 extern int insert_inode_locked4(struct inode *, unsigned long, int (*test)(struct inode *, void *), void *);
 extern int insert_inode_locked(struct inode *);
+extern struct inode *insert_inode_locked2(struct inode *);
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 extern void lockdep_annotate_inode_mutex_key(struct inode *inode);
 #else
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 06/12] fs: factor out d_mark_tmpfile()
  2019-06-10 19:14 bcachefs status update (it's done cooking; let's get this sucker merged) Kent Overstreet
                   ` (4 preceding siblings ...)
  2019-06-10 19:14 ` [PATCH 05/12] fs: insert_inode_locked2() Kent Overstreet
@ 2019-06-10 19:14 ` Kent Overstreet
  2019-06-10 19:14 ` [PATCH 07/12] Propagate gfp_t when allocating pte entries from __vmalloc Kent Overstreet
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 63+ messages in thread
From: Kent Overstreet @ 2019-06-10 19:14 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcache; +Cc: Kent Overstreet

New helper for bcachefs - bcachefs doesn't want the
inode_dec_link_count() call that d_tmpfile does, it handles i_nlink on
its own atomically with other btree updates

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
---
 fs/dcache.c            | 10 ++++++++--
 include/linux/dcache.h |  1 +
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index aac41adf47..18edb4e5bc 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -3042,9 +3042,8 @@ void d_genocide(struct dentry *parent)
 
 EXPORT_SYMBOL(d_genocide);
 
-void d_tmpfile(struct dentry *dentry, struct inode *inode)
+void d_mark_tmpfile(struct dentry *dentry, struct inode *inode)
 {
-	inode_dec_link_count(inode);
 	BUG_ON(dentry->d_name.name != dentry->d_iname ||
 		!hlist_unhashed(&dentry->d_u.d_alias) ||
 		!d_unlinked(dentry));
@@ -3054,6 +3053,13 @@ void d_tmpfile(struct dentry *dentry, struct inode *inode)
 				(unsigned long long)inode->i_ino);
 	spin_unlock(&dentry->d_lock);
 	spin_unlock(&dentry->d_parent->d_lock);
+}
+EXPORT_SYMBOL(d_mark_tmpfile);
+
+void d_tmpfile(struct dentry *dentry, struct inode *inode)
+{
+	inode_dec_link_count(inode);
+	d_mark_tmpfile(dentry, inode);
 	d_instantiate(dentry, inode);
 }
 EXPORT_SYMBOL(d_tmpfile);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 60996e64c5..e0fe330162 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -255,6 +255,7 @@ extern struct dentry * d_make_root(struct inode *);
 /* <clickety>-<click> the ramfs-type tree */
 extern void d_genocide(struct dentry *);
 
+extern void d_mark_tmpfile(struct dentry *, struct inode *);
 extern void d_tmpfile(struct dentry *, struct inode *);
 
 extern struct dentry *d_find_alias(struct inode *);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 07/12] Propagate gfp_t when allocating pte entries from __vmalloc
  2019-06-10 19:14 bcachefs status update (it's done cooking; let's get this sucker merged) Kent Overstreet
                   ` (5 preceding siblings ...)
  2019-06-10 19:14 ` [PATCH 06/12] fs: factor out d_mark_tmpfile() Kent Overstreet
@ 2019-06-10 19:14 ` Kent Overstreet
  2019-06-10 19:14 ` [PATCH 08/12] block: Add some exports for bcachefs Kent Overstreet
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 63+ messages in thread
From: Kent Overstreet @ 2019-06-10 19:14 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcache; +Cc: Kent Overstreet

This fixes a lockdep recursion when using __vmalloc from places that
aren't GFP_KERNEL safe.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
---
 arch/alpha/include/asm/pgalloc.h             | 11 ++---
 arch/arc/include/asm/pgalloc.h               |  9 +---
 arch/arm/include/asm/pgalloc.h               | 11 +++--
 arch/arm/mm/idmap.c                          |  2 +-
 arch/arm/mm/mmu.c                            |  5 +-
 arch/arm/mm/pgd.c                            |  8 +--
 arch/arm64/include/asm/pgalloc.h             | 17 ++++---
 arch/arm64/mm/hugetlbpage.c                  |  8 +--
 arch/csky/include/asm/pgalloc.h              |  4 +-
 arch/hexagon/include/asm/pgalloc.h           |  5 +-
 arch/ia64/include/asm/pgalloc.h              | 14 +++---
 arch/ia64/mm/hugetlbpage.c                   |  4 +-
 arch/ia64/mm/init.c                          |  6 +--
 arch/m68k/include/asm/mcf_pgalloc.h          | 12 ++---
 arch/m68k/include/asm/motorola_pgalloc.h     |  7 +--
 arch/m68k/include/asm/sun3_pgalloc.h         | 12 ++---
 arch/m68k/mm/kmap.c                          |  5 +-
 arch/m68k/sun3x/dvma.c                       |  6 ++-
 arch/microblaze/include/asm/pgalloc.h        |  6 +--
 arch/microblaze/mm/pgtable.c                 |  6 +--
 arch/mips/include/asm/pgalloc.h              | 14 +++---
 arch/mips/mm/hugetlbpage.c                   |  4 +-
 arch/mips/mm/ioremap.c                       |  6 +--
 arch/nds32/include/asm/pgalloc.h             | 14 ++----
 arch/nds32/kernel/dma.c                      |  4 +-
 arch/nios2/include/asm/pgalloc.h             |  8 +--
 arch/nios2/mm/ioremap.c                      |  6 +--
 arch/openrisc/include/asm/pgalloc.h          |  2 +-
 arch/openrisc/mm/ioremap.c                   |  4 +-
 arch/parisc/include/asm/pgalloc.h            | 16 +++---
 arch/parisc/kernel/pci-dma.c                 |  6 +--
 arch/parisc/mm/hugetlbpage.c                 |  4 +-
 arch/powerpc/include/asm/book3s/32/pgalloc.h |  4 +-
 arch/powerpc/include/asm/book3s/64/pgalloc.h | 20 ++++----
 arch/powerpc/include/asm/nohash/32/pgalloc.h |  6 +--
 arch/powerpc/include/asm/nohash/64/pgalloc.h | 14 +++---
 arch/powerpc/kvm/book3s_64_mmu_radix.c       |  2 +-
 arch/powerpc/mm/hugetlbpage.c                |  8 +--
 arch/powerpc/mm/pgtable-book3e.c             |  6 +--
 arch/powerpc/mm/pgtable-book3s64.c           | 14 +++---
 arch/powerpc/mm/pgtable-hash64.c             |  6 +--
 arch/powerpc/mm/pgtable-radix.c              | 12 ++---
 arch/powerpc/mm/pgtable_32.c                 |  6 +--
 arch/riscv/include/asm/pgalloc.h             | 11 ++---
 arch/s390/include/asm/pgalloc.h              | 25 +++++-----
 arch/s390/mm/hugetlbpage.c                   |  6 +--
 arch/s390/mm/pgalloc.c                       | 10 ++--
 arch/s390/mm/pgtable.c                       |  6 +--
 arch/s390/mm/vmem.c                          |  2 +-
 arch/sh/include/asm/pgalloc.h                |  7 +--
 arch/sh/mm/hugetlbpage.c                     |  4 +-
 arch/sh/mm/init.c                            |  4 +-
 arch/sh/mm/pgtable.c                         |  8 ++-
 arch/sparc/include/asm/pgalloc_32.h          |  6 +--
 arch/sparc/include/asm/pgalloc_64.h          | 12 +++--
 arch/sparc/mm/hugetlbpage.c                  |  4 +-
 arch/sparc/mm/init_64.c                      | 10 +---
 arch/sparc/mm/srmmu.c                        |  2 +-
 arch/um/include/asm/pgalloc.h                |  2 +-
 arch/um/include/asm/pgtable-3level.h         |  3 +-
 arch/um/kernel/mem.c                         | 17 ++-----
 arch/um/kernel/skas/mmu.c                    |  4 +-
 arch/unicore32/include/asm/pgalloc.h         |  8 ++-
 arch/unicore32/mm/pgd.c                      |  2 +-
 arch/x86/include/asm/pgalloc.h               | 30 ++++++------
 arch/x86/kernel/espfix_64.c                  |  2 +-
 arch/x86/kernel/tboot.c                      |  6 +--
 arch/x86/mm/pgtable.c                        |  4 +-
 arch/x86/platform/efi/efi_64.c               |  9 ++--
 arch/xtensa/include/asm/pgalloc.h            |  4 +-
 drivers/staging/media/ipu3/ipu3-dmamap.c     |  2 +-
 include/asm-generic/4level-fixup.h           |  6 +--
 include/asm-generic/5level-fixup.h           |  6 +--
 include/asm-generic/pgtable-nop4d-hack.h     |  2 +-
 include/asm-generic/pgtable-nop4d.h          |  2 +-
 include/asm-generic/pgtable-nopmd.h          |  2 +-
 include/asm-generic/pgtable-nopud.h          |  2 +-
 include/linux/mm.h                           | 40 ++++++++-------
 include/linux/vmalloc.h                      |  2 +-
 lib/ioremap.c                                |  8 +--
 mm/hugetlb.c                                 | 11 +++--
 mm/kasan/init.c                              |  8 +--
 mm/memory.c                                  | 51 +++++++++++---------
 mm/migrate.c                                 |  6 +--
 mm/mremap.c                                  |  6 +--
 mm/userfaultfd.c                             |  6 +--
 mm/vmalloc.c                                 | 49 +++++++++++--------
 mm/zsmalloc.c                                |  2 +-
 virt/kvm/arm/mmu.c                           |  6 +--
 89 files changed, 377 insertions(+), 392 deletions(-)

diff --git a/arch/alpha/include/asm/pgalloc.h b/arch/alpha/include/asm/pgalloc.h
index 02f9f91bb4..6b8336865e 100644
--- a/arch/alpha/include/asm/pgalloc.h
+++ b/arch/alpha/include/asm/pgalloc.h
@@ -39,9 +39,9 @@ pgd_free(struct mm_struct *mm, pgd_t *pgd)
 }
 
 static inline pmd_t *
-pmd_alloc_one(struct mm_struct *mm, unsigned long address)
+pmd_alloc_one(struct mm_struct *mm, unsigned long address, gfp_t gfp)
 {
-	pmd_t *ret = (pmd_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO);
+	pmd_t *ret = (pmd_t *)get_zeroed_page(gfp);
 	return ret;
 }
 
@@ -52,10 +52,9 @@ pmd_free(struct mm_struct *mm, pmd_t *pmd)
 }
 
 static inline pte_t *
-pte_alloc_one_kernel(struct mm_struct *mm)
+pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp)
 {
-	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO);
-	return pte;
+	return (pte_t *)get_zeroed_page(gfp);
 }
 
 static inline void
@@ -67,7 +66,7 @@ pte_free_kernel(struct mm_struct *mm, pte_t *pte)
 static inline pgtable_t
 pte_alloc_one(struct mm_struct *mm)
 {
-	pte_t *pte = pte_alloc_one_kernel(mm);
+	pte_t *pte = pte_alloc_one_kernel(mm, GFP_KERNEL);
 	struct page *page;
 
 	if (!pte)
diff --git a/arch/arc/include/asm/pgalloc.h b/arch/arc/include/asm/pgalloc.h
index 9c9b5a5ebf..491535bb2b 100644
--- a/arch/arc/include/asm/pgalloc.h
+++ b/arch/arc/include/asm/pgalloc.h
@@ -90,14 +90,9 @@ static inline int __get_order_pte(void)
 	return get_order(PTRS_PER_PTE * sizeof(pte_t));
 }
 
-static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
+static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp)
 {
-	pte_t *pte;
-
-	pte = (pte_t *) __get_free_pages(GFP_KERNEL | __GFP_ZERO,
-					 __get_order_pte());
-
-	return pte;
+	return (pte_t *) __get_free_pages(gfp|__GFP_ZERO, __get_order_pte());
 }
 
 static inline pgtable_t
diff --git a/arch/arm/include/asm/pgalloc.h b/arch/arm/include/asm/pgalloc.h
index 17ab72f0cc..f21ba862f6 100644
--- a/arch/arm/include/asm/pgalloc.h
+++ b/arch/arm/include/asm/pgalloc.h
@@ -27,9 +27,10 @@
 
 #ifdef CONFIG_ARM_LPAE
 
-static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
+static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr,
+				   gfp_t gfp)
 {
-	return (pmd_t *)get_zeroed_page(GFP_KERNEL);
+	return (pmd_t *)get_zeroed_page(gfp);
 }
 
 static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
@@ -48,7 +49,7 @@ static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)
 /*
  * Since we have only two-level page tables, these are trivial
  */
-#define pmd_alloc_one(mm,addr)		({ BUG(); ((pmd_t *)2); })
+#define pmd_alloc_one(mm,addr,gfp)	({ BUG(); ((pmd_t *)2); })
 #define pmd_free(mm, pmd)		do { } while (0)
 #define pud_populate(mm,pmd,pte)	BUG()
 
@@ -81,11 +82,11 @@ static inline void clean_pte_table(pte_t *pte)
  *  +------------+
  */
 static inline pte_t *
-pte_alloc_one_kernel(struct mm_struct *mm)
+pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp)
 {
 	pte_t *pte;
 
-	pte = (pte_t *)__get_free_page(PGALLOC_GFP);
+	pte = (pte_t *)get_zeroed_page(gfp);
 	if (pte)
 		clean_pte_table(pte);
 
diff --git a/arch/arm/mm/idmap.c b/arch/arm/mm/idmap.c
index a033f6134a..b90d2deedc 100644
--- a/arch/arm/mm/idmap.c
+++ b/arch/arm/mm/idmap.c
@@ -28,7 +28,7 @@ static void idmap_add_pmd(pud_t *pud, unsigned long addr, unsigned long end,
 	unsigned long next;
 
 	if (pud_none_or_clear_bad(pud) || (pud_val(*pud) & L_PGD_SWAPPER)) {
-		pmd = pmd_alloc_one(&init_mm, addr);
+		pmd = pmd_alloc_one(&init_mm, addr, GFP_KERNEL);
 		if (!pmd) {
 			pr_warn("Failed to allocate identity pmd.\n");
 			return;
diff --git a/arch/arm/mm/mmu.c b/arch/arm/mm/mmu.c
index f3ce34113f..7cc18e5174 100644
--- a/arch/arm/mm/mmu.c
+++ b/arch/arm/mm/mmu.c
@@ -979,10 +979,11 @@ void __init create_mapping_late(struct mm_struct *mm, struct map_desc *md,
 				bool ng)
 {
 #ifdef CONFIG_ARM_LPAE
-	pud_t *pud = pud_alloc(mm, pgd_offset(mm, md->virtual), md->virtual);
+	pud_t *pud = pud_alloc(mm, pgd_offset(mm, md->virtual), md->virtual,
+			       GFP_KERNEL);
 	if (WARN_ON(!pud))
 		return;
-	pmd_alloc(mm, pud, 0);
+	pmd_alloc(mm, pud, 0, GFP_KERNEL);
 #endif
 	__create_mapping(mm, md, late_alloc, ng);
 }
diff --git a/arch/arm/mm/pgd.c b/arch/arm/mm/pgd.c
index a1606d9502..6c3a640672 100644
--- a/arch/arm/mm/pgd.c
+++ b/arch/arm/mm/pgd.c
@@ -57,11 +57,11 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
 	 * Allocate PMD table for modules and pkmap mappings.
 	 */
 	new_pud = pud_alloc(mm, new_pgd + pgd_index(MODULES_VADDR),
-			    MODULES_VADDR);
+			    MODULES_VADDR, GFP_KERNEL);
 	if (!new_pud)
 		goto no_pud;
 
-	new_pmd = pmd_alloc(mm, new_pud, 0);
+	new_pmd = pmd_alloc(mm, new_pud, 0, GFP_KERNEL);
 	if (!new_pmd)
 		goto no_pmd;
 #endif
@@ -72,11 +72,11 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
 		 * contains the machine vectors. The vectors are always high
 		 * with LPAE.
 		 */
-		new_pud = pud_alloc(mm, new_pgd, 0);
+		new_pud = pud_alloc(mm, new_pgd, 0, GFP_KERNEL);
 		if (!new_pud)
 			goto no_pud;
 
-		new_pmd = pmd_alloc(mm, new_pud, 0);
+		new_pmd = pmd_alloc(mm, new_pud, 0, GFP_KERNEL);
 		if (!new_pmd)
 			goto no_pmd;
 
diff --git a/arch/arm64/include/asm/pgalloc.h b/arch/arm64/include/asm/pgalloc.h
index 52fa47c73b..54199d52ea 100644
--- a/arch/arm64/include/asm/pgalloc.h
+++ b/arch/arm64/include/asm/pgalloc.h
@@ -26,14 +26,14 @@
 
 #define check_pgt_cache()		do { } while (0)
 
-#define PGALLOC_GFP	(GFP_KERNEL | __GFP_ZERO)
 #define PGD_SIZE	(PTRS_PER_PGD * sizeof(pgd_t))
 
 #if CONFIG_PGTABLE_LEVELS > 2
 
-static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
+static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr,
+				   gfp_t gfp)
 {
-	return (pmd_t *)__get_free_page(PGALLOC_GFP);
+	return (pmd_t *)get_zeroed_page(gfp);
 }
 
 static inline void pmd_free(struct mm_struct *mm, pmd_t *pmdp)
@@ -60,9 +60,10 @@ static inline void __pud_populate(pud_t *pudp, phys_addr_t pmdp, pudval_t prot)
 
 #if CONFIG_PGTABLE_LEVELS > 3
 
-static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
+static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr,
+				   gfp_t gfp)
 {
-	return (pud_t *)__get_free_page(PGALLOC_GFP);
+	return (pud_t *)get_zeroed_page(gfp);
 }
 
 static inline void pud_free(struct mm_struct *mm, pud_t *pudp)
@@ -91,9 +92,9 @@ extern pgd_t *pgd_alloc(struct mm_struct *mm);
 extern void pgd_free(struct mm_struct *mm, pgd_t *pgdp);
 
 static inline pte_t *
-pte_alloc_one_kernel(struct mm_struct *mm)
+pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp)
 {
-	return (pte_t *)__get_free_page(PGALLOC_GFP);
+	return (pte_t *)get_zeroed_page(gfp);
 }
 
 static inline pgtable_t
@@ -101,7 +102,7 @@ pte_alloc_one(struct mm_struct *mm)
 {
 	struct page *pte;
 
-	pte = alloc_pages(PGALLOC_GFP, 0);
+	pte = alloc_pages(GFP_KERNEL|__GFP_ZERO, 0);
 	if (!pte)
 		return NULL;
 	if (!pgtable_page_ctor(pte)) {
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 6b4a47b3ad..0a17776894 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -230,14 +230,14 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
 	pte_t *ptep = NULL;
 
 	pgdp = pgd_offset(mm, addr);
-	pudp = pud_alloc(mm, pgdp, addr);
+	pudp = pud_alloc(mm, pgdp, addr, GFP_KERNEL);
 	if (!pudp)
 		return NULL;
 
 	if (sz == PUD_SIZE) {
 		ptep = (pte_t *)pudp;
 	} else if (sz == (PAGE_SIZE * CONT_PTES)) {
-		pmdp = pmd_alloc(mm, pudp, addr);
+		pmdp = pmd_alloc(mm, pudp, addr, GFP_KERNEL);
 
 		WARN_ON(addr & (sz - 1));
 		/*
@@ -253,9 +253,9 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
 		    pud_none(READ_ONCE(*pudp)))
 			ptep = huge_pmd_share(mm, addr, pudp);
 		else
-			ptep = (pte_t *)pmd_alloc(mm, pudp, addr);
+			ptep = (pte_t *)pmd_alloc(mm, pudp, addr, GFP_KERNEL);
 	} else if (sz == (PMD_SIZE * CONT_PMDS)) {
-		pmdp = pmd_alloc(mm, pudp, addr);
+		pmdp = pmd_alloc(mm, pudp, addr, GFP_KERNEL);
 		WARN_ON(addr & (sz - 1));
 		return (pte_t *)pmdp;
 	}
diff --git a/arch/csky/include/asm/pgalloc.h b/arch/csky/include/asm/pgalloc.h
index d213bb47b7..1611a84be5 100644
--- a/arch/csky/include/asm/pgalloc.h
+++ b/arch/csky/include/asm/pgalloc.h
@@ -24,12 +24,12 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd,
 
 extern void pgd_init(unsigned long *p);
 
-static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
+static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp)
 {
 	pte_t *pte;
 	unsigned long i;
 
-	pte = (pte_t *) __get_free_page(GFP_KERNEL);
+	pte = (pte_t *) __get_free_page(gfp);
 	if (!pte)
 		return NULL;
 
diff --git a/arch/hexagon/include/asm/pgalloc.h b/arch/hexagon/include/asm/pgalloc.h
index d36183887b..2c42f912f4 100644
--- a/arch/hexagon/include/asm/pgalloc.h
+++ b/arch/hexagon/include/asm/pgalloc.h
@@ -74,10 +74,9 @@ static inline struct page *pte_alloc_one(struct mm_struct *mm)
 }
 
 /* _kernel variant gets to use a different allocator */
-static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
+static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp)
 {
-	gfp_t flags =  GFP_KERNEL | __GFP_ZERO;
-	return (pte_t *) __get_free_page(flags);
+	return (pte_t *) get_zeroed_page(gfp);
 }
 
 static inline void pte_free(struct mm_struct *mm, struct page *pte)
diff --git a/arch/ia64/include/asm/pgalloc.h b/arch/ia64/include/asm/pgalloc.h
index c9e481023c..dd99d58a89 100644
--- a/arch/ia64/include/asm/pgalloc.h
+++ b/arch/ia64/include/asm/pgalloc.h
@@ -40,9 +40,10 @@ pgd_populate(struct mm_struct *mm, pgd_t * pgd_entry, pud_t * pud)
 	pgd_val(*pgd_entry) = __pa(pud);
 }
 
-static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
+static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr,
+				   gfp_t gfp)
 {
-	return quicklist_alloc(0, GFP_KERNEL, NULL);
+	return quicklist_alloc(0, gfp, NULL);
 }
 
 static inline void pud_free(struct mm_struct *mm, pud_t *pud)
@@ -58,9 +59,10 @@ pud_populate(struct mm_struct *mm, pud_t * pud_entry, pmd_t * pmd)
 	pud_val(*pud_entry) = __pa(pmd);
 }
 
-static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
+static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr,
+				   gfp_t gfp)
 {
-	return quicklist_alloc(0, GFP_KERNEL, NULL);
+	return quicklist_alloc(0, gfp, NULL);
 }
 
 static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
@@ -99,9 +101,9 @@ static inline pgtable_t pte_alloc_one(struct mm_struct *mm)
 	return page;
 }
 
-static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
+static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp)
 {
-	return quicklist_alloc(0, GFP_KERNEL, NULL);
+	return quicklist_alloc(0, gfp, NULL);
 }
 
 static inline void pte_free(struct mm_struct *mm, pgtable_t pte)
diff --git a/arch/ia64/mm/hugetlbpage.c b/arch/ia64/mm/hugetlbpage.c
index d16e419fd7..01e08edc9d 100644
--- a/arch/ia64/mm/hugetlbpage.c
+++ b/arch/ia64/mm/hugetlbpage.c
@@ -35,9 +35,9 @@ huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz)
 	pte_t *pte = NULL;
 
 	pgd = pgd_offset(mm, taddr);
-	pud = pud_alloc(mm, pgd, taddr);
+	pud = pud_alloc(mm, pgd, taddr, GFP_KERNEL);
 	if (pud) {
-		pmd = pmd_alloc(mm, pud, taddr);
+		pmd = pmd_alloc(mm, pud, taddr, GFP_KERNEL);
 		if (pmd)
 			pte = pte_alloc_map(mm, pmd, taddr);
 	}
diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c
index e49200e317..a420c0d04f 100644
--- a/arch/ia64/mm/init.c
+++ b/arch/ia64/mm/init.c
@@ -216,13 +216,13 @@ put_kernel_page (struct page *page, unsigned long address, pgprot_t pgprot)
 	pgd = pgd_offset_k(address);		/* note: this is NOT pgd_offset()! */
 
 	{
-		pud = pud_alloc(&init_mm, pgd, address);
+		pud = pud_alloc(&init_mm, pgd, address, GFP_KERNEL);
 		if (!pud)
 			goto out;
-		pmd = pmd_alloc(&init_mm, pud, address);
+		pmd = pmd_alloc(&init_mm, pud, address, GFP_KERNEL);
 		if (!pmd)
 			goto out;
-		pte = pte_alloc_kernel(pmd, address);
+		pte = pte_alloc_kernel(pmd, address, GFP_KERNEL);
 		if (!pte)
 			goto out;
 		if (!pte_none(*pte))
diff --git a/arch/m68k/include/asm/mcf_pgalloc.h b/arch/m68k/include/asm/mcf_pgalloc.h
index 4399d712f6..95384360cf 100644
--- a/arch/m68k/include/asm/mcf_pgalloc.h
+++ b/arch/m68k/include/asm/mcf_pgalloc.h
@@ -12,15 +12,9 @@ extern inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
 
 extern const char bad_pmd_string[];
 
-extern inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
+extern inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp)
 {
-	unsigned long page = __get_free_page(GFP_DMA);
-
-	if (!page)
-		return NULL;
-
-	memset((void *)page, 0, PAGE_SIZE);
-	return (pte_t *) (page);
+	return (pte_t *) get_zeroed_page(gfp|GFP_DMA);
 }
 
 extern inline pmd_t *pmd_alloc_kernel(pgd_t *pgd, unsigned long address)
@@ -29,7 +23,7 @@ extern inline pmd_t *pmd_alloc_kernel(pgd_t *pgd, unsigned long address)
 }
 
 #define pmd_alloc_one_fast(mm, address) ({ BUG(); ((pmd_t *)1); })
-#define pmd_alloc_one(mm, address)      ({ BUG(); ((pmd_t *)2); })
+#define pmd_alloc_one(mm, address, gfp) ({ BUG(); ((pmd_t *)2); })
 
 #define pmd_populate(mm, pmd, page) (pmd_val(*pmd) = \
 	(unsigned long)(page_address(page)))
diff --git a/arch/m68k/include/asm/motorola_pgalloc.h b/arch/m68k/include/asm/motorola_pgalloc.h
index d04d9ba9b9..e9b598f96b 100644
--- a/arch/m68k/include/asm/motorola_pgalloc.h
+++ b/arch/m68k/include/asm/motorola_pgalloc.h
@@ -8,11 +8,11 @@
 extern pmd_t *get_pointer_table(void);
 extern int free_pointer_table(pmd_t *);
 
-static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
+static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp)
 {
 	pte_t *pte;
 
-	pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO);
+	pte = (pte_t *)get_zeroed_page(gfp);
 	if (pte) {
 		__flush_page_to_ram(pte);
 		flush_tlb_kernel_page(pte);
@@ -67,7 +67,8 @@ static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t page,
 }
 
 
-static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address)
+static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address,
+				   gfp_t gfp)
 {
 	return get_pointer_table();
 }
diff --git a/arch/m68k/include/asm/sun3_pgalloc.h b/arch/m68k/include/asm/sun3_pgalloc.h
index 1456c5eecb..18324d4a33 100644
--- a/arch/m68k/include/asm/sun3_pgalloc.h
+++ b/arch/m68k/include/asm/sun3_pgalloc.h
@@ -15,7 +15,7 @@
 
 extern const char bad_pmd_string[];
 
-#define pmd_alloc_one(mm,address)       ({ BUG(); ((pmd_t *)2); })
+#define pmd_alloc_one(mm,address,gfp)       ({ BUG(); ((pmd_t *)2); })
 
 
 static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
@@ -35,15 +35,9 @@ do {							\
 	tlb_remove_page((tlb), pte);			\
 } while (0)
 
-static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
+static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp)
 {
-	unsigned long page = __get_free_page(GFP_KERNEL);
-
-	if (!page)
-		return NULL;
-
-	memset((void *)page, 0, PAGE_SIZE);
-	return (pte_t *) (page);
+	return (pte_t *) get_zeroed_page(gfp);
 }
 
 static inline pgtable_t pte_alloc_one(struct mm_struct *mm)
diff --git a/arch/m68k/mm/kmap.c b/arch/m68k/mm/kmap.c
index 40a3b327da..8de716049a 100644
--- a/arch/m68k/mm/kmap.c
+++ b/arch/m68k/mm/kmap.c
@@ -196,7 +196,7 @@ void __iomem *__ioremap(unsigned long physaddr, unsigned long size, int cachefla
 			printk ("\npa=%#lx va=%#lx ", physaddr, virtaddr);
 #endif
 		pgd_dir = pgd_offset_k(virtaddr);
-		pmd_dir = pmd_alloc(&init_mm, pgd_dir, virtaddr);
+		pmd_dir = pmd_alloc(&init_mm, pgd_dir, virtaddr, GFP_KERNEL);
 		if (!pmd_dir) {
 			printk("ioremap: no mem for pmd_dir\n");
 			return NULL;
@@ -208,7 +208,8 @@ void __iomem *__ioremap(unsigned long physaddr, unsigned long size, int cachefla
 			virtaddr += PTRTREESIZE;
 			size -= PTRTREESIZE;
 		} else {
-			pte_dir = pte_alloc_kernel(pmd_dir, virtaddr);
+			pte_dir = pte_alloc_kernel(pmd_dir, virtaddr,
+						   GFP_KERNEL);
 			if (!pte_dir) {
 				printk("ioremap: no mem for pte_dir\n");
 				return NULL;
diff --git a/arch/m68k/sun3x/dvma.c b/arch/m68k/sun3x/dvma.c
index 89e630e665..86ffbe2785 100644
--- a/arch/m68k/sun3x/dvma.c
+++ b/arch/m68k/sun3x/dvma.c
@@ -95,7 +95,8 @@ inline int dvma_map_cpu(unsigned long kaddr,
 		pmd_t *pmd;
 		unsigned long end2;
 
-		if((pmd = pmd_alloc(&init_mm, pgd, vaddr)) == NULL) {
+		pmd = pmd_alloc(&init_mm, pgd, vaddr, GFP_KERNEL);
+		if (!pmd) {
 			ret = -ENOMEM;
 			goto out;
 		}
@@ -109,7 +110,8 @@ inline int dvma_map_cpu(unsigned long kaddr,
 			pte_t *pte;
 			unsigned long end3;
 
-			if((pte = pte_alloc_kernel(pmd, vaddr)) == NULL) {
+			pte = pte_alloc_kernel(pmd, vaddr, GFP_KERNEL);
+			if (!pte) {
 				ret = -ENOMEM;
 				goto out;
 			}
diff --git a/arch/microblaze/include/asm/pgalloc.h b/arch/microblaze/include/asm/pgalloc.h
index f4cc9ffc44..240e0bcd14 100644
--- a/arch/microblaze/include/asm/pgalloc.h
+++ b/arch/microblaze/include/asm/pgalloc.h
@@ -106,9 +106,9 @@ static inline void free_pgd_slow(pgd_t *pgd)
  * the pgd will always be present..
  */
 #define pmd_alloc_one_fast(mm, address)	({ BUG(); ((pmd_t *)1); })
-#define pmd_alloc_one(mm, address)	({ BUG(); ((pmd_t *)2); })
+#define pmd_alloc_one(mm, address, gfp)	({ BUG(); ((pmd_t *)2); })
 
-extern pte_t *pte_alloc_one_kernel(struct mm_struct *mm);
+extern pte_t *pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp);
 
 static inline struct page *pte_alloc_one(struct mm_struct *mm)
 {
@@ -166,7 +166,7 @@ static inline void pte_free(struct mm_struct *mm, struct page *ptepage)
  * We don't have any real pmd's, and this code never triggers because
  * the pgd will always be present..
  */
-#define pmd_alloc_one(mm, address)	({ BUG(); ((pmd_t *)2); })
+#define pmd_alloc_one(mm, address, gfp)	({ BUG(); ((pmd_t *)2); })
 #define pmd_free(mm, x)			do { } while (0)
 #define __pmd_free_tlb(tlb, x, addr)	pmd_free((tlb)->mm, x)
 #define pgd_populate(mm, pmd, pte)	BUG()
diff --git a/arch/microblaze/mm/pgtable.c b/arch/microblaze/mm/pgtable.c
index c2ce1e42b8..796c422af7 100644
--- a/arch/microblaze/mm/pgtable.c
+++ b/arch/microblaze/mm/pgtable.c
@@ -144,7 +144,7 @@ int map_page(unsigned long va, phys_addr_t pa, int flags)
 	/* Use upper 10 bits of VA to index the first level map */
 	pd = pmd_offset(pgd_offset_k(va), va);
 	/* Use middle 10 bits of VA to index the second-level map */
-	pg = pte_alloc_kernel(pd, va); /* from powerpc - pgtable.c */
+	pg = pte_alloc_kernel(pd, va, GFP_KERNEL); /* from powerpc - pgtable.c */
 	/* pg = pte_alloc_kernel(&init_mm, pd, va); */
 
 	if (pg != NULL) {
@@ -235,11 +235,11 @@ unsigned long iopa(unsigned long addr)
 	return pa;
 }
 
-__ref pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
+__ref pte_t *pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp)
 {
 	pte_t *pte;
 	if (mem_init_done) {
-		pte = (pte_t *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
+		pte = (pte_t *)get_zeroed_page(gfp);
 	} else {
 		pte = (pte_t *)early_get_page();
 		if (pte)
diff --git a/arch/mips/include/asm/pgalloc.h b/arch/mips/include/asm/pgalloc.h
index 27808d9461..7e832f978a 100644
--- a/arch/mips/include/asm/pgalloc.h
+++ b/arch/mips/include/asm/pgalloc.h
@@ -50,9 +50,9 @@ static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 	free_pages((unsigned long)pgd, PGD_ORDER);
 }
 
-static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
+static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp)
 {
-	return (pte_t *)__get_free_pages(GFP_KERNEL | __GFP_ZERO, PTE_ORDER);
+	return (pte_t *)__get_free_pages(gfp | __GFP_ZERO, PTE_ORDER);
 }
 
 static inline struct page *pte_alloc_one(struct mm_struct *mm)
@@ -89,11 +89,12 @@ do {							\
 
 #ifndef __PAGETABLE_PMD_FOLDED
 
-static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address)
+static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address,
+				   gfp_t gfp)
 {
 	pmd_t *pmd;
 
-	pmd = (pmd_t *) __get_free_pages(GFP_KERNEL, PMD_ORDER);
+	pmd = (pmd_t *) __get_free_pages(gfp, PMD_ORDER);
 	if (pmd)
 		pmd_init((unsigned long)pmd, (unsigned long)invalid_pte_table);
 	return pmd;
@@ -110,11 +111,12 @@ static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
 
 #ifndef __PAGETABLE_PUD_FOLDED
 
-static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long address)
+static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long address,
+				   gfp_t gfp)
 {
 	pud_t *pud;
 
-	pud = (pud_t *) __get_free_pages(GFP_KERNEL, PUD_ORDER);
+	pud = (pud_t *) __get_free_pages(gfp, PUD_ORDER);
 	if (pud)
 		pud_init((unsigned long)pud, (unsigned long)invalid_pmd_table);
 	return pud;
diff --git a/arch/mips/mm/hugetlbpage.c b/arch/mips/mm/hugetlbpage.c
index cef1522343..27843e10f6 100644
--- a/arch/mips/mm/hugetlbpage.c
+++ b/arch/mips/mm/hugetlbpage.c
@@ -29,9 +29,9 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr,
 	pte_t *pte = NULL;
 
 	pgd = pgd_offset(mm, addr);
-	pud = pud_alloc(mm, pgd, addr);
+	pud = pud_alloc(mm, pgd, addr, GFP_KERNEL);
 	if (pud)
-		pte = (pte_t *)pmd_alloc(mm, pud, addr);
+		pte = (pte_t *)pmd_alloc(mm, pud, addr, GFP_KERNEL);
 
 	return pte;
 }
diff --git a/arch/mips/mm/ioremap.c b/arch/mips/mm/ioremap.c
index 1601d90b08..40da8f0ba7 100644
--- a/arch/mips/mm/ioremap.c
+++ b/arch/mips/mm/ioremap.c
@@ -56,7 +56,7 @@ static inline int remap_area_pmd(pmd_t * pmd, unsigned long address,
 	phys_addr -= address;
 	BUG_ON(address >= end);
 	do {
-		pte_t * pte = pte_alloc_kernel(pmd, address);
+		pte_t *pte = pte_alloc_kernel(pmd, address, GFP_KERNEL);
 		if (!pte)
 			return -ENOMEM;
 		remap_area_pte(pte, address, end - address, address + phys_addr, flags);
@@ -82,10 +82,10 @@ static int remap_area_pages(unsigned long address, phys_addr_t phys_addr,
 		pmd_t *pmd;
 
 		error = -ENOMEM;
-		pud = pud_alloc(&init_mm, dir, address);
+		pud = pud_alloc(&init_mm, dir, address, GFP_KERNEL);
 		if (!pud)
 			break;
-		pmd = pmd_alloc(&init_mm, pud, address);
+		pmd = pmd_alloc(&init_mm, pud, address, GFP_KERNEL);
 		if (!pmd)
 			break;
 		if (remap_area_pmd(pmd, address, end - address,
diff --git a/arch/nds32/include/asm/pgalloc.h b/arch/nds32/include/asm/pgalloc.h
index 3c5fee5b57..b187a2f127 100644
--- a/arch/nds32/include/asm/pgalloc.h
+++ b/arch/nds32/include/asm/pgalloc.h
@@ -12,8 +12,8 @@
 /*
  * Since we have only two-level page tables, these are trivial
  */
-#define pmd_alloc_one(mm, addr)		({ BUG(); ((pmd_t *)2); })
-#define pmd_free(mm, pmd)			do { } while (0)
+#define pmd_alloc_one(mm, addr, gfp)	({ BUG(); ((pmd_t *)2); })
+#define pmd_free(mm, pmd)		do { } while (0)
 #define pgd_populate(mm, pmd, pte)	BUG()
 #define pmd_pgtable(pmd) pmd_page(pmd)
 
@@ -22,15 +22,9 @@ extern void pgd_free(struct mm_struct *mm, pgd_t * pgd);
 
 #define check_pgt_cache()		do { } while (0)
 
-static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
+static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp)
 {
-	pte_t *pte;
-
-	pte =
-	    (pte_t *) __get_free_page(GFP_KERNEL | __GFP_RETRY_MAYFAIL |
-				      __GFP_ZERO);
-
-	return pte;
+	return (pte_t *) get_zeroed_page(gfp | __GFP_RETRY_MAYFAIL);
 }
 
 static inline pgtable_t pte_alloc_one(struct mm_struct *mm)
diff --git a/arch/nds32/kernel/dma.c b/arch/nds32/kernel/dma.c
index d0dbd4fe96..920a003762 100644
--- a/arch/nds32/kernel/dma.c
+++ b/arch/nds32/kernel/dma.c
@@ -300,7 +300,7 @@ static int __init consistent_init(void)
 
 	do {
 		pgd = pgd_offset(&init_mm, CONSISTENT_BASE);
-		pmd = pmd_alloc(&init_mm, pgd, CONSISTENT_BASE);
+		pmd = pmd_alloc(&init_mm, pgd, CONSISTENT_BASE, GFP_KERNEL);
 		if (!pmd) {
 			pr_err("%s: no pmd tables\n", __func__);
 			ret = -ENOMEM;
@@ -310,7 +310,7 @@ static int __init consistent_init(void)
 		 * It's not necessary to warn here. */
 		/* WARN_ON(!pmd_none(*pmd)); */
 
-		pte = pte_alloc_kernel(pmd, CONSISTENT_BASE);
+		pte = pte_alloc_kernel(pmd, CONSISTENT_BASE, GFP_KERNEL);
 		if (!pte) {
 			ret = -ENOMEM;
 			break;
diff --git a/arch/nios2/include/asm/pgalloc.h b/arch/nios2/include/asm/pgalloc.h
index 3a149ead12..2ce9bd5399 100644
--- a/arch/nios2/include/asm/pgalloc.h
+++ b/arch/nios2/include/asm/pgalloc.h
@@ -37,13 +37,9 @@ static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 	free_pages((unsigned long)pgd, PGD_ORDER);
 }
 
-static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
+static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, gf_t gfp)
 {
-	pte_t *pte;
-
-	pte = (pte_t *) __get_free_pages(GFP_KERNEL|__GFP_ZERO, PTE_ORDER);
-
-	return pte;
+	return (pte_t *)__get_free_pages(gfp|__GFP_ZERO, PTE_ORDER);
 }
 
 static inline pgtable_t pte_alloc_one(struct mm_struct *mm)
diff --git a/arch/nios2/mm/ioremap.c b/arch/nios2/mm/ioremap.c
index 3a28177a01..50c38da029 100644
--- a/arch/nios2/mm/ioremap.c
+++ b/arch/nios2/mm/ioremap.c
@@ -61,7 +61,7 @@ static inline int remap_area_pmd(pmd_t *pmd, unsigned long address,
 	if (address >= end)
 		BUG();
 	do {
-		pte_t *pte = pte_alloc_kernel(pmd, address);
+		pte_t *pte = pte_alloc_kernel(pmd, address, GFP_KERNEL);
 
 		if (!pte)
 			return -ENOMEM;
@@ -90,10 +90,10 @@ static int remap_area_pages(unsigned long address, unsigned long phys_addr,
 		pmd_t *pmd;
 
 		error = -ENOMEM;
-		pud = pud_alloc(&init_mm, dir, address);
+		pud = pud_alloc(&init_mm, dir, address, GFP_KERNEL);
 		if (!pud)
 			break;
-		pmd = pmd_alloc(&init_mm, pud, address);
+		pmd = pmd_alloc(&init_mm, pud, address, GFP_KERNEL);
 		if (!pmd)
 			break;
 		if (remap_area_pmd(pmd, address, end - address,
diff --git a/arch/openrisc/include/asm/pgalloc.h b/arch/openrisc/include/asm/pgalloc.h
index 149c82ee4b..f33f2a4504 100644
--- a/arch/openrisc/include/asm/pgalloc.h
+++ b/arch/openrisc/include/asm/pgalloc.h
@@ -70,7 +70,7 @@ static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 	free_page((unsigned long)pgd);
 }
 
-extern pte_t *pte_alloc_one_kernel(struct mm_struct *mm);
+extern pte_t *pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp);
 
 static inline struct page *pte_alloc_one(struct mm_struct *mm)
 {
diff --git a/arch/openrisc/mm/ioremap.c b/arch/openrisc/mm/ioremap.c
index a8509950db..93d295d26a 100644
--- a/arch/openrisc/mm/ioremap.c
+++ b/arch/openrisc/mm/ioremap.c
@@ -118,12 +118,12 @@ EXPORT_SYMBOL(iounmap);
  * the memblock infrastructure.
  */
 
-pte_t __ref *pte_alloc_one_kernel(struct mm_struct *mm)
+pte_t __ref *pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp)
 {
 	pte_t *pte;
 
 	if (likely(mem_init_done)) {
-		pte = (pte_t *)get_zeroed_page(GFP_KERNEL);
+		pte = (pte_t *)get_zeroed_page(gfp);
 	} else {
 		pte = memblock_alloc(PAGE_SIZE, PAGE_SIZE);
 		if (!pte)
diff --git a/arch/parisc/include/asm/pgalloc.h b/arch/parisc/include/asm/pgalloc.h
index d05c678c77..705f5fffbd 100644
--- a/arch/parisc/include/asm/pgalloc.h
+++ b/arch/parisc/include/asm/pgalloc.h
@@ -62,12 +62,10 @@ static inline void pgd_populate(struct mm_struct *mm, pgd_t *pgd, pmd_t *pmd)
 		        (__u32)(__pa((unsigned long)pmd) >> PxD_VALUE_SHIFT));
 }
 
-static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address)
+static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address,
+				   gfp_t gfp)
 {
-	pmd_t *pmd = (pmd_t *)__get_free_pages(GFP_KERNEL, PMD_ORDER);
-	if (pmd)
-		memset(pmd, 0, PAGE_SIZE<<PMD_ORDER);
-	return pmd;
+	return (pmd_t *)__get_free_pages(gfp|__GFP_ZERO, PMD_ORDER);
 }
 
 static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
@@ -94,7 +92,7 @@ static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
  * inside the pgd, so has no extra memory associated with it.
  */
 
-#define pmd_alloc_one(mm, addr)		({ BUG(); ((pmd_t *)2); })
+#define pmd_alloc_one(mm, addr, gfp)	({ BUG(); ((pmd_t *)2); })
 #define pmd_free(mm, x)			do { } while (0)
 #define pgd_populate(mm, pmd, pte)	BUG()
 
@@ -134,11 +132,9 @@ pte_alloc_one(struct mm_struct *mm)
 	return page;
 }
 
-static inline pte_t *
-pte_alloc_one_kernel(struct mm_struct *mm)
+static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp)
 {
-	pte_t *pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO);
-	return pte;
+	return (pte_t *)get_zeroed_page(gfp);
 }
 
 static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
diff --git a/arch/parisc/kernel/pci-dma.c b/arch/parisc/kernel/pci-dma.c
index 239162355b..285688e735 100644
--- a/arch/parisc/kernel/pci-dma.c
+++ b/arch/parisc/kernel/pci-dma.c
@@ -113,7 +113,7 @@ static inline int map_pmd_uncached(pmd_t * pmd, unsigned long vaddr,
 	if (end > PGDIR_SIZE)
 		end = PGDIR_SIZE;
 	do {
-		pte_t * pte = pte_alloc_kernel(pmd, vaddr);
+		pte_t *pte = pte_alloc_kernel(pmd, vaddr, GFP_KERNEL);
 		if (!pte)
 			return -ENOMEM;
 		if (map_pte_uncached(pte, orig_vaddr, end - vaddr, paddr_ptr))
@@ -134,8 +134,8 @@ static inline int map_uncached_pages(unsigned long vaddr, unsigned long size,
 	dir = pgd_offset_k(vaddr);
 	do {
 		pmd_t *pmd;
-		
-		pmd = pmd_alloc(NULL, dir, vaddr);
+
+		pmd = pmd_alloc(NULL, dir, vaddr, GFP_KERNEL);
 		if (!pmd)
 			return -ENOMEM;
 		if (map_pmd_uncached(pmd, vaddr, end - vaddr, &paddr))
diff --git a/arch/parisc/mm/hugetlbpage.c b/arch/parisc/mm/hugetlbpage.c
index d77479ae3a..6351549539 100644
--- a/arch/parisc/mm/hugetlbpage.c
+++ b/arch/parisc/mm/hugetlbpage.c
@@ -61,9 +61,9 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
 	addr &= HPAGE_MASK;
 
 	pgd = pgd_offset(mm, addr);
-	pud = pud_alloc(mm, pgd, addr);
+	pud = pud_alloc(mm, pgd, addr, GFP_KERNEL);
 	if (pud) {
-		pmd = pmd_alloc(mm, pud, addr);
+		pmd = pmd_alloc(mm, pud, addr, GFP_KERNEL);
 		if (pmd)
 			pte = pte_alloc_map(mm, pmd, addr);
 	}
diff --git a/arch/powerpc/include/asm/book3s/32/pgalloc.h b/arch/powerpc/include/asm/book3s/32/pgalloc.h
index 3633502e10..9032660c0e 100644
--- a/arch/powerpc/include/asm/book3s/32/pgalloc.h
+++ b/arch/powerpc/include/asm/book3s/32/pgalloc.h
@@ -42,7 +42,7 @@ static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
  * We don't have any real pmd's, and this code never triggers because
  * the pgd will always be present..
  */
-/* #define pmd_alloc_one(mm,address)       ({ BUG(); ((pmd_t *)2); }) */
+/* #define pmd_alloc_one(mm,address,gfp) ({ BUG(); ((pmd_t *)2); }) */
 #define pmd_free(mm, x) 		do { } while (0)
 #define __pmd_free_tlb(tlb,x,a)		do { } while (0)
 /* #define pgd_populate(mm, pmd, pte)      BUG() */
@@ -61,7 +61,7 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmdp,
 
 #define pmd_pgtable(pmd) ((pgtable_t)pmd_page_vaddr(pmd))
 
-extern pte_t *pte_alloc_one_kernel(struct mm_struct *mm);
+extern pte_t *pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp);
 extern pgtable_t pte_alloc_one(struct mm_struct *mm);
 void pte_frag_destroy(void *pte_frag);
 pte_t *pte_fragment_alloc(struct mm_struct *mm, int kernel);
diff --git a/arch/powerpc/include/asm/book3s/64/pgalloc.h b/arch/powerpc/include/asm/book3s/64/pgalloc.h
index 138bc2ecc0..c2199361cf 100644
--- a/arch/powerpc/include/asm/book3s/64/pgalloc.h
+++ b/arch/powerpc/include/asm/book3s/64/pgalloc.h
@@ -39,8 +39,8 @@ extern struct vmemmap_backing *vmemmap_list;
 extern struct kmem_cache *pgtable_cache[];
 #define PGT_CACHE(shift) pgtable_cache[shift]
 
-extern pte_t *pte_fragment_alloc(struct mm_struct *, int);
-extern pmd_t *pmd_fragment_alloc(struct mm_struct *, unsigned long);
+extern pte_t *pte_fragment_alloc(struct mm_struct *, int, gfp_t);
+extern pmd_t *pmd_fragment_alloc(struct mm_struct *, unsigned long, gfp_t);
 extern void pte_fragment_free(unsigned long *, int);
 extern void pmd_fragment_free(unsigned long *);
 extern void pgtable_free_tlb(struct mmu_gather *tlb, void *table, int shift);
@@ -114,12 +114,13 @@ static inline void pgd_populate(struct mm_struct *mm, pgd_t *pgd, pud_t *pud)
 	*pgd =  __pgd(__pgtable_ptr_val(pud) | PGD_VAL_BITS);
 }
 
-static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
+static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr,
+				   gfp_t gfp)
 {
 	pud_t *pud;
 
 	pud = kmem_cache_alloc(PGT_CACHE(PUD_CACHE_INDEX),
-			       pgtable_gfp_flags(mm, GFP_KERNEL));
+			       pgtable_gfp_flags(mm, gfp));
 	/*
 	 * Tell kmemleak to ignore the PUD, that means don't scan it for
 	 * pointers and don't consider it a leak. PUDs are typically only
@@ -152,9 +153,10 @@ static inline void __pud_free_tlb(struct mmu_gather *tlb, pud_t *pud,
 	pgtable_free_tlb(tlb, pud, PUD_INDEX);
 }
 
-static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
+static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr,
+				   gfp_t gfp)
 {
-	return pmd_fragment_alloc(mm, addr);
+	return pmd_fragment_alloc(mm, addr, gfp);
 }
 
 static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
@@ -190,14 +192,14 @@ static inline pgtable_t pmd_pgtable(pmd_t pmd)
 	return (pgtable_t)pmd_page_vaddr(pmd);
 }
 
-static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
+static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp)
 {
-	return (pte_t *)pte_fragment_alloc(mm, 1);
+	return (pte_t *)pte_fragment_alloc(mm, 1, gfp);
 }
 
 static inline pgtable_t pte_alloc_one(struct mm_struct *mm)
 {
-	return (pgtable_t)pte_fragment_alloc(mm, 0);
+	return (pgtable_t)pte_fragment_alloc(mm, 0, GFP_KERNEL);
 }
 
 static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
diff --git a/arch/powerpc/include/asm/nohash/32/pgalloc.h b/arch/powerpc/include/asm/nohash/32/pgalloc.h
index bd186e85b4..8a5a944251 100644
--- a/arch/powerpc/include/asm/nohash/32/pgalloc.h
+++ b/arch/powerpc/include/asm/nohash/32/pgalloc.h
@@ -42,8 +42,8 @@ static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
  * We don't have any real pmd's, and this code never triggers because
  * the pgd will always be present..
  */
-/* #define pmd_alloc_one(mm,address)       ({ BUG(); ((pmd_t *)2); }) */
-#define pmd_free(mm, x) 		do { } while (0)
+/* #define pmd_alloc_one(mm,address,gfp) ({ BUG(); ((pmd_t *)2); }) */
+#define pmd_free(mm, x)			do { } while (0)
 #define __pmd_free_tlb(tlb,x,a)		do { } while (0)
 /* #define pgd_populate(mm, pmd, pte)      BUG() */
 
@@ -79,7 +79,7 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmdp,
 #define pmd_pgtable(pmd) ((pgtable_t)pmd_page_vaddr(pmd))
 #endif
 
-extern pte_t *pte_alloc_one_kernel(struct mm_struct *mm);
+extern pte_t *pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp);
 extern pgtable_t pte_alloc_one(struct mm_struct *mm);
 void pte_frag_destroy(void *pte_frag);
 pte_t *pte_fragment_alloc(struct mm_struct *mm, int kernel);
diff --git a/arch/powerpc/include/asm/nohash/64/pgalloc.h b/arch/powerpc/include/asm/nohash/64/pgalloc.h
index 66d086f85b..e30f21916a 100644
--- a/arch/powerpc/include/asm/nohash/64/pgalloc.h
+++ b/arch/powerpc/include/asm/nohash/64/pgalloc.h
@@ -51,10 +51,11 @@ static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 
 #define pgd_populate(MM, PGD, PUD)	pgd_set(PGD, (unsigned long)PUD)
 
-static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
+static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr,
+				   gfp_t gfp)
 {
 	return kmem_cache_alloc(PGT_CACHE(PUD_INDEX_SIZE),
-			pgtable_gfp_flags(mm, GFP_KERNEL));
+			pgtable_gfp_flags(mm, gfp));
 }
 
 static inline void pud_free(struct mm_struct *mm, pud_t *pud)
@@ -81,10 +82,11 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd,
 
 #define pmd_pgtable(pmd) pmd_page(pmd)
 
-static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
+static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr,
+				   gfp_t gfp)
 {
 	return kmem_cache_alloc(PGT_CACHE(PMD_CACHE_INDEX),
-			pgtable_gfp_flags(mm, GFP_KERNEL));
+			pgtable_gfp_flags(mm, gfp));
 }
 
 static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
@@ -93,9 +95,9 @@ static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
 }
 
 
-static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
+static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp)
 {
-	return (pte_t *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
+	return (pte_t *)get_zeroed_page(gfp);
 }
 
 static inline pgtable_t pte_alloc_one(struct mm_struct *mm)
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index f55ef07188..d9a9856029 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -585,7 +585,7 @@ int kvmppc_create_pte(struct kvm *kvm, pgd_t *pgtable, pte_t pte,
 	if (pgd_present(*pgd))
 		pud = pud_offset(pgd, gpa);
 	else
-		new_pud = pud_alloc_one(kvm->mm, gpa);
+		new_pud = pud_alloc_one(kvm->mm, gpa, GFP_KERNEL);
 
 	pmd = NULL;
 	if (pud && pud_present(*pud) && !pud_huge(*pud))
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index 9e732bb2c8..f66c42c933 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -153,7 +153,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz
 		hpdp = (hugepd_t *)pg;
 	} else {
 		pdshift = PUD_SHIFT;
-		pu = pud_alloc(mm, pg, addr);
+		pu = pud_alloc(mm, pg, addr, GFP_KERNEL);
 		if (pshift == PUD_SHIFT)
 			return (pte_t *)pu;
 		else if (pshift > PMD_SHIFT) {
@@ -161,7 +161,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz
 			hpdp = (hugepd_t *)pu;
 		} else {
 			pdshift = PMD_SHIFT;
-			pm = pmd_alloc(mm, pu, addr);
+			pm = pmd_alloc(mm, pu, addr, GFP_KERNEL);
 			if (pshift == PMD_SHIFT)
 				/* 16MB hugepage */
 				return (pte_t *)pm;
@@ -177,13 +177,13 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz
 		hpdp = (hugepd_t *)pg;
 	} else {
 		pdshift = PUD_SHIFT;
-		pu = pud_alloc(mm, pg, addr);
+		pu = pud_alloc(mm, pg, addr, GFP_KERNEL);
 		if (pshift >= PUD_SHIFT) {
 			ptl = pud_lockptr(mm, pu);
 			hpdp = (hugepd_t *)pu;
 		} else {
 			pdshift = PMD_SHIFT;
-			pm = pmd_alloc(mm, pu, addr);
+			pm = pmd_alloc(mm, pu, addr, GFP_KERNEL);
 			ptl = pmd_lockptr(mm, pm);
 			hpdp = (hugepd_t *)pm;
 		}
diff --git a/arch/powerpc/mm/pgtable-book3e.c b/arch/powerpc/mm/pgtable-book3e.c
index 1032ef7aaf..43bcc3bc8a 100644
--- a/arch/powerpc/mm/pgtable-book3e.c
+++ b/arch/powerpc/mm/pgtable-book3e.c
@@ -84,13 +84,13 @@ int map_kernel_page(unsigned long ea, unsigned long pa, pgprot_t prot)
 	BUILD_BUG_ON(TASK_SIZE_USER64 > PGTABLE_RANGE);
 	if (slab_is_available()) {
 		pgdp = pgd_offset_k(ea);
-		pudp = pud_alloc(&init_mm, pgdp, ea);
+		pudp = pud_alloc(&init_mm, pgdp, ea, GFP_KERNEL);
 		if (!pudp)
 			return -ENOMEM;
-		pmdp = pmd_alloc(&init_mm, pudp, ea);
+		pmdp = pmd_alloc(&init_mm, pudp, ea, GFP_KERNEL);
 		if (!pmdp)
 			return -ENOMEM;
-		ptep = pte_alloc_kernel(pmdp, ea);
+		ptep = pte_alloc_kernel(pmdp, ea, GFP_KERNEL);
 		if (!ptep)
 			return -ENOMEM;
 	} else {
diff --git a/arch/powerpc/mm/pgtable-book3s64.c b/arch/powerpc/mm/pgtable-book3s64.c
index a4341aba0a..cfb417ce6a 100644
--- a/arch/powerpc/mm/pgtable-book3s64.c
+++ b/arch/powerpc/mm/pgtable-book3s64.c
@@ -262,15 +262,14 @@ static pmd_t *get_pmd_from_cache(struct mm_struct *mm)
 	return (pmd_t *)ret;
 }
 
-static pmd_t *__alloc_for_pmdcache(struct mm_struct *mm)
+static pmd_t *__alloc_for_pmdcache(struct mm_struct *mm, gfp_t gfp)
 {
 	void *ret = NULL;
 	struct page *page;
-	gfp_t gfp = GFP_KERNEL_ACCOUNT | __GFP_ZERO;
 
-	if (mm == &init_mm)
-		gfp &= ~__GFP_ACCOUNT;
-	page = alloc_page(gfp);
+	if (mm != &init_mm)
+		gfp |= __GFP_ACCOUNT;
+	page = alloc_page(gfp|__GFP_ZERO);
 	if (!page)
 		return NULL;
 	if (!pgtable_pmd_page_ctor(page)) {
@@ -303,7 +302,8 @@ static pmd_t *__alloc_for_pmdcache(struct mm_struct *mm)
 	return (pmd_t *)ret;
 }
 
-pmd_t *pmd_fragment_alloc(struct mm_struct *mm, unsigned long vmaddr)
+pmd_t *pmd_fragment_alloc(struct mm_struct *mm, unsigned long vmaddr,
+			  gfp_t gfp)
 {
 	pmd_t *pmd;
 
@@ -311,7 +311,7 @@ pmd_t *pmd_fragment_alloc(struct mm_struct *mm, unsigned long vmaddr)
 	if (pmd)
 		return pmd;
 
-	return __alloc_for_pmdcache(mm);
+	return __alloc_for_pmdcache(mm, gfp);
 }
 
 void pmd_fragment_free(unsigned long *pmd)
diff --git a/arch/powerpc/mm/pgtable-hash64.c b/arch/powerpc/mm/pgtable-hash64.c
index c08d49046a..d90deb67d8 100644
--- a/arch/powerpc/mm/pgtable-hash64.c
+++ b/arch/powerpc/mm/pgtable-hash64.c
@@ -152,13 +152,13 @@ int hash__map_kernel_page(unsigned long ea, unsigned long pa, pgprot_t prot)
 	BUILD_BUG_ON(TASK_SIZE_USER64 > H_PGTABLE_RANGE);
 	if (slab_is_available()) {
 		pgdp = pgd_offset_k(ea);
-		pudp = pud_alloc(&init_mm, pgdp, ea);
+		pudp = pud_alloc(&init_mm, pgdp, ea, GFP_KERNEL);
 		if (!pudp)
 			return -ENOMEM;
-		pmdp = pmd_alloc(&init_mm, pudp, ea);
+		pmdp = pmd_alloc(&init_mm, pudp, ea, GFP_KERNEL);
 		if (!pmdp)
 			return -ENOMEM;
-		ptep = pte_alloc_kernel(pmdp, ea);
+		ptep = pte_alloc_kernel(pmdp, ea, GFP_KERNEL);
 		if (!ptep)
 			return -ENOMEM;
 		set_pte_at(&init_mm, ea, ptep, pfn_pte(pa >> PAGE_SHIFT, prot));
diff --git a/arch/powerpc/mm/pgtable-radix.c b/arch/powerpc/mm/pgtable-radix.c
index 154472a28c..0fbc67a090 100644
--- a/arch/powerpc/mm/pgtable-radix.c
+++ b/arch/powerpc/mm/pgtable-radix.c
@@ -145,21 +145,21 @@ static int __map_kernel_page(unsigned long ea, unsigned long pa,
 	 * boot.
 	 */
 	pgdp = pgd_offset_k(ea);
-	pudp = pud_alloc(&init_mm, pgdp, ea);
+	pudp = pud_alloc(&init_mm, pgdp, ea, GFP_KERNEL);
 	if (!pudp)
 		return -ENOMEM;
 	if (map_page_size == PUD_SIZE) {
 		ptep = (pte_t *)pudp;
 		goto set_the_pte;
 	}
-	pmdp = pmd_alloc(&init_mm, pudp, ea);
+	pmdp = pmd_alloc(&init_mm, pudp, ea, GFP_KERNEL);
 	if (!pmdp)
 		return -ENOMEM;
 	if (map_page_size == PMD_SIZE) {
 		ptep = pmdp_ptep(pmdp);
 		goto set_the_pte;
 	}
-	ptep = pte_alloc_kernel(pmdp, ea);
+	ptep = pte_alloc_kernel(pmdp, ea, GFP_KERNEL);
 	if (!ptep)
 		return -ENOMEM;
 
@@ -194,21 +194,21 @@ void radix__change_memory_range(unsigned long start, unsigned long end,
 
 	for (idx = start; idx < end; idx += PAGE_SIZE) {
 		pgdp = pgd_offset_k(idx);
-		pudp = pud_alloc(&init_mm, pgdp, idx);
+		pudp = pud_alloc(&init_mm, pgdp, idx, GFP_KERNEL);
 		if (!pudp)
 			continue;
 		if (pud_huge(*pudp)) {
 			ptep = (pte_t *)pudp;
 			goto update_the_pte;
 		}
-		pmdp = pmd_alloc(&init_mm, pudp, idx);
+		pmdp = pmd_alloc(&init_mm, pudp, idx, GFP_KERNEL);
 		if (!pmdp)
 			continue;
 		if (pmd_huge(*pmdp)) {
 			ptep = pmdp_ptep(pmdp);
 			goto update_the_pte;
 		}
-		ptep = pte_alloc_kernel(pmdp, idx);
+		ptep = pte_alloc_kernel(pmdp, idx, GFP_KERNEL);
 		if (!ptep)
 			continue;
 update_the_pte:
diff --git a/arch/powerpc/mm/pgtable_32.c b/arch/powerpc/mm/pgtable_32.c
index 6e56a6240b..eb474a99f0 100644
--- a/arch/powerpc/mm/pgtable_32.c
+++ b/arch/powerpc/mm/pgtable_32.c
@@ -43,12 +43,12 @@ EXPORT_SYMBOL(ioremap_bot);	/* aka VMALLOC_END */
 
 extern char etext[], _stext[], _sinittext[], _einittext[];
 
-__ref pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
+__ref pte_t *pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp)
 {
 	if (!slab_is_available())
 		return memblock_alloc(PTE_FRAG_SIZE, PTE_FRAG_SIZE);
 
-	return (pte_t *)pte_fragment_alloc(mm, 1);
+	return (pte_t *)pte_fragment_alloc(mm, 1, gfp);
 }
 
 pgtable_t pte_alloc_one(struct mm_struct *mm)
@@ -214,7 +214,7 @@ int map_kernel_page(unsigned long va, phys_addr_t pa, pgprot_t prot)
 	/* Use upper 10 bits of VA to index the first level map */
 	pd = pmd_offset(pud_offset(pgd_offset_k(va), va), va);
 	/* Use middle 10 bits of VA to index the second-level map */
-	pg = pte_alloc_kernel(pd, va);
+	pg = pte_alloc_kernel(pd, va, GFP_KERNEL);
 	if (pg != 0) {
 		err = 0;
 		/* The PTE should never be already set nor present in the
diff --git a/arch/riscv/include/asm/pgalloc.h b/arch/riscv/include/asm/pgalloc.h
index 94043cf83c..991c8d268e 100644
--- a/arch/riscv/include/asm/pgalloc.h
+++ b/arch/riscv/include/asm/pgalloc.h
@@ -67,10 +67,10 @@ static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 
 #ifndef __PAGETABLE_PMD_FOLDED
 
-static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
+static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr,
+				   gfp_t gfp)
 {
-	return (pmd_t *)__get_free_page(
-		GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_ZERO);
+	return (pmd_t *)get_zeroed_page(gfp|__GFP_RETRY_MAYFAIL);
 }
 
 static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
@@ -82,10 +82,9 @@ static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
 
 #endif /* __PAGETABLE_PMD_FOLDED */
 
-static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
+static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp)
 {
-	return (pte_t *)__get_free_page(
-		GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_ZERO);
+	return (pte_t *)get_zeroed_page(gfp | __GFP_RETRY_MAYFAIL);
 }
 
 static inline struct page *pte_alloc_one(struct mm_struct *mm)
diff --git a/arch/s390/include/asm/pgalloc.h b/arch/s390/include/asm/pgalloc.h
index bccb8f4a63..49fb025627 100644
--- a/arch/s390/include/asm/pgalloc.h
+++ b/arch/s390/include/asm/pgalloc.h
@@ -19,10 +19,10 @@
 
 #define CRST_ALLOC_ORDER 2
 
-unsigned long *crst_table_alloc(struct mm_struct *);
+unsigned long *crst_table_alloc(struct mm_struct *, gfp_t);
 void crst_table_free(struct mm_struct *, unsigned long *);
 
-unsigned long *page_table_alloc(struct mm_struct *);
+unsigned long *page_table_alloc(struct mm_struct *, gfp_t);
 struct page *page_table_alloc_pgste(struct mm_struct *mm);
 void page_table_free(struct mm_struct *, unsigned long *);
 void page_table_free_rcu(struct mmu_gather *, unsigned long *, unsigned long);
@@ -48,9 +48,10 @@ static inline unsigned long pgd_entry_type(struct mm_struct *mm)
 int crst_table_upgrade(struct mm_struct *mm, unsigned long limit);
 void crst_table_downgrade(struct mm_struct *);
 
-static inline p4d_t *p4d_alloc_one(struct mm_struct *mm, unsigned long address)
+static inline p4d_t *p4d_alloc_one(struct mm_struct *mm, unsigned long address,
+				   gfp_t gfp)
 {
-	unsigned long *table = crst_table_alloc(mm);
+	unsigned long *table = crst_table_alloc(mm, gfp);
 
 	if (table)
 		crst_table_init(table, _REGION2_ENTRY_EMPTY);
@@ -58,18 +59,20 @@ static inline p4d_t *p4d_alloc_one(struct mm_struct *mm, unsigned long address)
 }
 #define p4d_free(mm, p4d) crst_table_free(mm, (unsigned long *) p4d)
 
-static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long address)
+static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long address,
+				   gfp_t gfp)
 {
-	unsigned long *table = crst_table_alloc(mm);
+	unsigned long *table = crst_table_alloc(mm, gfp);
 	if (table)
 		crst_table_init(table, _REGION3_ENTRY_EMPTY);
 	return (pud_t *) table;
 }
 #define pud_free(mm, pud) crst_table_free(mm, (unsigned long *) pud)
 
-static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long vmaddr)
+static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long vmaddr,
+				   gfp_t gfp)
 {
-	unsigned long *table = crst_table_alloc(mm);
+	unsigned long *table = crst_table_alloc(mm, gfp);
 
 	if (!table)
 		return NULL;
@@ -104,7 +107,7 @@ static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)
 
 static inline pgd_t *pgd_alloc(struct mm_struct *mm)
 {
-	unsigned long *table = crst_table_alloc(mm);
+	unsigned long *table = crst_table_alloc(mm, GFP_KERNEL);
 
 	if (!table)
 		return NULL;
@@ -139,8 +142,8 @@ static inline void pmd_populate(struct mm_struct *mm,
 /*
  * page table entry allocation/free routines.
  */
-#define pte_alloc_one_kernel(mm) ((pte_t *)page_table_alloc(mm))
-#define pte_alloc_one(mm) ((pte_t *)page_table_alloc(mm))
+#define pte_alloc_one_kernel(mm, gfp) ((pte_t *)page_table_alloc(mm, gfp))
+#define pte_alloc_one(mm) ((pte_t *)page_table_alloc(mm, GFP_KERNEL))
 
 #define pte_free_kernel(mm, pte) page_table_free(mm, (unsigned long *) pte)
 #define pte_free(mm, pte) page_table_free(mm, (unsigned long *) pte)
diff --git a/arch/s390/mm/hugetlbpage.c b/arch/s390/mm/hugetlbpage.c
index b0246c705a..eeb1468369 100644
--- a/arch/s390/mm/hugetlbpage.c
+++ b/arch/s390/mm/hugetlbpage.c
@@ -192,14 +192,14 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
 	pmd_t *pmdp = NULL;
 
 	pgdp = pgd_offset(mm, addr);
-	p4dp = p4d_alloc(mm, pgdp, addr);
+	p4dp = p4d_alloc(mm, pgdp, addr, GFP_KERNEL);
 	if (p4dp) {
-		pudp = pud_alloc(mm, p4dp, addr);
+		pudp = pud_alloc(mm, p4dp, addr, GFP_KERNEL);
 		if (pudp) {
 			if (sz == PUD_SIZE)
 				return (pte_t *) pudp;
 			else if (sz == PMD_SIZE)
-				pmdp = pmd_alloc(mm, pudp, addr);
+				pmdp = pmd_alloc(mm, pudp, addr, GFP_KERNEL);
 		}
 	}
 	return (pte_t *) pmdp;
diff --git a/arch/s390/mm/pgalloc.c b/arch/s390/mm/pgalloc.c
index db6bb2f97a..b8c309de98 100644
--- a/arch/s390/mm/pgalloc.c
+++ b/arch/s390/mm/pgalloc.c
@@ -53,9 +53,9 @@ __initcall(page_table_register_sysctl);
 
 #endif /* CONFIG_PGSTE */
 
-unsigned long *crst_table_alloc(struct mm_struct *mm)
+unsigned long *crst_table_alloc(struct mm_struct *mm, gfp_t gfp)
 {
-	struct page *page = alloc_pages(GFP_KERNEL, 2);
+	struct page *page = alloc_pages(gfp, 2);
 
 	if (!page)
 		return NULL;
@@ -87,7 +87,7 @@ int crst_table_upgrade(struct mm_struct *mm, unsigned long end)
 	rc = 0;
 	notify = 0;
 	while (mm->context.asce_limit < end) {
-		table = crst_table_alloc(mm);
+		table = crst_table_alloc(mm, GFP_KERNEL);
 		if (!table) {
 			rc = -ENOMEM;
 			break;
@@ -179,7 +179,7 @@ void page_table_free_pgste(struct page *page)
 /*
  * page table entry allocation/free routines.
  */
-unsigned long *page_table_alloc(struct mm_struct *mm)
+unsigned long *page_table_alloc(struct mm_struct *mm, gfp_t gfp)
 {
 	unsigned long *table;
 	struct page *page;
@@ -209,7 +209,7 @@ unsigned long *page_table_alloc(struct mm_struct *mm)
 			return table;
 	}
 	/* Allocate a fresh page */
-	page = alloc_page(GFP_KERNEL);
+	page = alloc_page(gfp);
 	if (!page)
 		return NULL;
 	if (!pgtable_page_ctor(page)) {
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 8485d6dc27..0bc3249927 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -418,13 +418,13 @@ static pmd_t *pmd_alloc_map(struct mm_struct *mm, unsigned long addr)
 	pmd_t *pmd;
 
 	pgd = pgd_offset(mm, addr);
-	p4d = p4d_alloc(mm, pgd, addr);
+	p4d = p4d_alloc(mm, pgd, addr, GFP_KERNEL);
 	if (!p4d)
 		return NULL;
-	pud = pud_alloc(mm, p4d, addr);
+	pud = pud_alloc(mm, p4d, addr, GFP_KERNEL);
 	if (!pud)
 		return NULL;
-	pmd = pmd_alloc(mm, pud, addr);
+	pmd = pmd_alloc(mm, pud, addr, GFP_KERNEL);
 	return pmd;
 }
 
diff --git a/arch/s390/mm/vmem.c b/arch/s390/mm/vmem.c
index 0472e27feb..47ffefab75 100644
--- a/arch/s390/mm/vmem.c
+++ b/arch/s390/mm/vmem.c
@@ -54,7 +54,7 @@ pte_t __ref *vmem_pte_alloc(void)
 	pte_t *pte;
 
 	if (slab_is_available())
-		pte = (pte_t *) page_table_alloc(&init_mm);
+		pte = (pte_t *) page_table_alloc(&init_mm, GFP_KERNEL);
 	else
 		pte = (pte_t *) memblock_phys_alloc(size, size);
 	if (!pte)
diff --git a/arch/sh/include/asm/pgalloc.h b/arch/sh/include/asm/pgalloc.h
index 8ad73cb311..bd51502e8b 100644
--- a/arch/sh/include/asm/pgalloc.h
+++ b/arch/sh/include/asm/pgalloc.h
@@ -12,7 +12,8 @@ extern void pgd_free(struct mm_struct *mm, pgd_t *pgd);
 
 #if PAGETABLE_LEVELS > 2
 extern void pud_populate(struct mm_struct *mm, pud_t *pudp, pmd_t *pmd);
-extern pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address);
+extern pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address,
+			    gfp_t gfp);
 extern void pmd_free(struct mm_struct *mm, pmd_t *pmd);
 #endif
 
@@ -32,9 +33,9 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd,
 /*
  * Allocate and free page tables.
  */
-static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
+static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp)
 {
-	return quicklist_alloc(QUICK_PT, GFP_KERNEL, NULL);
+	return quicklist_alloc(QUICK_PT, gfp, NULL);
 }
 
 static inline pgtable_t pte_alloc_one(struct mm_struct *mm)
diff --git a/arch/sh/mm/hugetlbpage.c b/arch/sh/mm/hugetlbpage.c
index 960deb1f24..1eb4932cdb 100644
--- a/arch/sh/mm/hugetlbpage.c
+++ b/arch/sh/mm/hugetlbpage.c
@@ -32,9 +32,9 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
 
 	pgd = pgd_offset(mm, addr);
 	if (pgd) {
-		pud = pud_alloc(mm, pgd, addr);
+		pud = pud_alloc(mm, pgd, addr, GFP_KERNEL);
 		if (pud) {
-			pmd = pmd_alloc(mm, pud, addr);
+			pmd = pmd_alloc(mm, pud, addr, GFP_KERNEL);
 			if (pmd)
 				pte = pte_alloc_map(mm, pmd, addr);
 		}
diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c
index 70621324db..4bd118c32e 100644
--- a/arch/sh/mm/init.c
+++ b/arch/sh/mm/init.c
@@ -53,13 +53,13 @@ static pte_t *__get_pte_phys(unsigned long addr)
 		return NULL;
 	}
 
-	pud = pud_alloc(NULL, pgd, addr);
+	pud = pud_alloc(NULL, pgd, addr, GFP_KERNEL);
 	if (unlikely(!pud)) {
 		pud_ERROR(*pud);
 		return NULL;
 	}
 
-	pmd = pmd_alloc(NULL, pud, addr);
+	pmd = pmd_alloc(NULL, pud, addr, GFP_KERNEL);
 	if (unlikely(!pmd)) {
 		pmd_ERROR(*pmd);
 		return NULL;
diff --git a/arch/sh/mm/pgtable.c b/arch/sh/mm/pgtable.c
index 5c8f9247c3..972f54fa09 100644
--- a/arch/sh/mm/pgtable.c
+++ b/arch/sh/mm/pgtable.c
@@ -2,8 +2,6 @@
 #include <linux/mm.h>
 #include <linux/slab.h>
 
-#define PGALLOC_GFP GFP_KERNEL | __GFP_ZERO
-
 static struct kmem_cache *pgd_cachep;
 #if PAGETABLE_LEVELS > 2
 static struct kmem_cache *pmd_cachep;
@@ -32,7 +30,7 @@ void pgtable_cache_init(void)
 
 pgd_t *pgd_alloc(struct mm_struct *mm)
 {
-	return kmem_cache_alloc(pgd_cachep, PGALLOC_GFP);
+	return kmem_cache_alloc(pgd_cachep, GFP_KERNEL|__GFP_ZERO);
 }
 
 void pgd_free(struct mm_struct *mm, pgd_t *pgd)
@@ -46,9 +44,9 @@ void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)
 	set_pud(pud, __pud((unsigned long)pmd));
 }
 
-pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address)
+pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address, gfp_t gfp)
 {
-	return kmem_cache_alloc(pmd_cachep, PGALLOC_GFP);
+	return kmem_cache_alloc(pmd_cachep, gfp|__GFP_ZERO);
 }
 
 void pmd_free(struct mm_struct *mm, pmd_t *pmd)
diff --git a/arch/sparc/include/asm/pgalloc_32.h b/arch/sparc/include/asm/pgalloc_32.h
index 282be50a4a..51dea1c004 100644
--- a/arch/sparc/include/asm/pgalloc_32.h
+++ b/arch/sparc/include/asm/pgalloc_32.h
@@ -38,7 +38,8 @@ static inline void pgd_set(pgd_t * pgdp, pmd_t * pmdp)
 #define pgd_populate(MM, PGD, PMD)      pgd_set(PGD, PMD)
 
 static inline pmd_t *pmd_alloc_one(struct mm_struct *mm,
-				   unsigned long address)
+				   unsigned long address,
+				   gfp_t gfp)
 {
 	return srmmu_get_nocache(SRMMU_PMD_TABLE_SIZE,
 				 SRMMU_PMD_TABLE_SIZE);
@@ -60,12 +61,11 @@ void pmd_set(pmd_t *pmdp, pte_t *ptep);
 
 pgtable_t pte_alloc_one(struct mm_struct *mm);
 
-static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
+static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp)
 {
 	return srmmu_get_nocache(PTE_SIZE, PTE_SIZE);
 }
 
-
 static inline void free_pte_fast(pte_t *pte)
 {
 	srmmu_free_nocache(pte, PTE_SIZE);
diff --git a/arch/sparc/include/asm/pgalloc_64.h b/arch/sparc/include/asm/pgalloc_64.h
index 48abccba49..e772ee60ee 100644
--- a/arch/sparc/include/asm/pgalloc_64.h
+++ b/arch/sparc/include/asm/pgalloc_64.h
@@ -40,9 +40,10 @@ static inline void __pud_populate(pud_t *pud, pmd_t *pmd)
 
 #define pud_populate(MM, PUD, PMD)	__pud_populate(PUD, PMD)
 
-static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
+static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr,
+				   gfp_t gfp)
 {
-	return kmem_cache_alloc(pgtable_cache, GFP_KERNEL);
+	return kmem_cache_alloc(pgtable_cache, gfp);
 }
 
 static inline void pud_free(struct mm_struct *mm, pud_t *pud)
@@ -50,9 +51,10 @@ static inline void pud_free(struct mm_struct *mm, pud_t *pud)
 	kmem_cache_free(pgtable_cache, pud);
 }
 
-static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
+static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr,
+				   gfp_t gfp)
 {
-	return kmem_cache_alloc(pgtable_cache, GFP_KERNEL);
+	return kmem_cache_alloc(pgtable_cache, gfp);
 }
 
 static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
@@ -60,7 +62,7 @@ static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
 	kmem_cache_free(pgtable_cache, pmd);
 }
 
-pte_t *pte_alloc_one_kernel(struct mm_struct *mm);
+pte_t *pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp);
 pgtable_t pte_alloc_one(struct mm_struct *mm);
 void pte_free_kernel(struct mm_struct *mm, pte_t *pte);
 void pte_free(struct mm_struct *mm, pgtable_t ptepage);
diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index f78793a06b..aeacfb0aab 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -281,12 +281,12 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
 	pmd_t *pmd;
 
 	pgd = pgd_offset(mm, addr);
-	pud = pud_alloc(mm, pgd, addr);
+	pud = pud_alloc(mm, pgd, addr, GFP_KERNEL);
 	if (!pud)
 		return NULL;
 	if (sz >= PUD_SIZE)
 		return (pte_t *)pud;
-	pmd = pmd_alloc(mm, pud, addr);
+	pmd = pmd_alloc(mm, pud, addr, GFP_KERNEL);
 	if (!pmd)
 		return NULL;
 	if (sz >= PMD_SIZE)
diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index f2d70ff7a2..bd81b148f4 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2933,15 +2933,9 @@ void __flush_tlb_all(void)
 			     : : "r" (pstate));
 }
 
-pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
+pte_t *pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp)
 {
-	struct page *page = alloc_page(GFP_KERNEL | __GFP_ZERO);
-	pte_t *pte = NULL;
-
-	if (page)
-		pte = (pte_t *) page_address(page);
-
-	return pte;
+	return (pte_t *) get_zeroed_page(gfp);
 }
 
 pgtable_t pte_alloc_one(struct mm_struct *mm)
diff --git a/arch/sparc/mm/srmmu.c b/arch/sparc/mm/srmmu.c
index aaebbc00d2..143a5bc7ce 100644
--- a/arch/sparc/mm/srmmu.c
+++ b/arch/sparc/mm/srmmu.c
@@ -375,7 +375,7 @@ pgtable_t pte_alloc_one(struct mm_struct *mm)
 	unsigned long pte;
 	struct page *page;
 
-	if ((pte = (unsigned long)pte_alloc_one_kernel(mm)) == 0)
+	if ((pte = (unsigned long)pte_alloc_one_kernel(mm, GFP_KERNEL)) == 0)
 		return NULL;
 	page = pfn_to_page(__nocache_pa(pte) >> PAGE_SHIFT);
 	if (!pgtable_page_ctor(page)) {
diff --git a/arch/um/include/asm/pgalloc.h b/arch/um/include/asm/pgalloc.h
index 99eb568279..71090e43d0 100644
--- a/arch/um/include/asm/pgalloc.h
+++ b/arch/um/include/asm/pgalloc.h
@@ -25,7 +25,7 @@
 extern pgd_t *pgd_alloc(struct mm_struct *);
 extern void pgd_free(struct mm_struct *mm, pgd_t *pgd);
 
-extern pte_t *pte_alloc_one_kernel(struct mm_struct *);
+extern pte_t *pte_alloc_one_kernel(struct mm_struct *, gfp_t);
 extern pgtable_t pte_alloc_one(struct mm_struct *);
 
 static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
diff --git a/arch/um/include/asm/pgtable-3level.h b/arch/um/include/asm/pgtable-3level.h
index c4d876dfb9..7f5fd79234 100644
--- a/arch/um/include/asm/pgtable-3level.h
+++ b/arch/um/include/asm/pgtable-3level.h
@@ -80,7 +80,8 @@ static inline void pgd_mkuptodate(pgd_t pgd) { pgd_val(pgd) &= ~_PAGE_NEWPAGE; }
 #endif
 
 struct mm_struct;
-extern pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address);
+extern pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address,
+			    gfp_t gfp);
 
 static inline void pud_clear (pud_t *pud)
 {
diff --git a/arch/um/kernel/mem.c b/arch/um/kernel/mem.c
index 99aa11bf53..3e0ce9f645 100644
--- a/arch/um/kernel/mem.c
+++ b/arch/um/kernel/mem.c
@@ -215,12 +215,9 @@ void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 	free_page((unsigned long) pgd);
 }
 
-pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
+pte_t *pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp)
 {
-	pte_t *pte;
-
-	pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_ZERO);
-	return pte;
+	return (pte_t *)get_zeroed_page(gfp);
 }
 
 pgtable_t pte_alloc_one(struct mm_struct *mm)
@@ -238,14 +235,10 @@ pgtable_t pte_alloc_one(struct mm_struct *mm)
 }
 
 #ifdef CONFIG_3_LEVEL_PGTABLES
-pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address)
+pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address,
+		     gfp_t gfp)
 {
-	pmd_t *pmd = (pmd_t *) __get_free_page(GFP_KERNEL);
-
-	if (pmd)
-		memset(pmd, 0, PAGE_SIZE);
-
-	return pmd;
+	return (pmd_t *) get_zeroed_page(gfp);
 }
 #endif
 
diff --git a/arch/um/kernel/skas/mmu.c b/arch/um/kernel/skas/mmu.c
index 7a1f2a936f..b677b615d6 100644
--- a/arch/um/kernel/skas/mmu.c
+++ b/arch/um/kernel/skas/mmu.c
@@ -24,11 +24,11 @@ static int init_stub_pte(struct mm_struct *mm, unsigned long proc,
 	pte_t *pte;
 
 	pgd = pgd_offset(mm, proc);
-	pud = pud_alloc(mm, pgd, proc);
+	pud = pud_alloc(mm, pgd, proc, GFP_KERNEL);
 	if (!pud)
 		goto out;
 
-	pmd = pmd_alloc(mm, pud, proc);
+	pmd = pmd_alloc(mm, pud, proc, GFP_KERNEL);
 	if (!pmd)
 		goto out_pmd;
 
diff --git a/arch/unicore32/include/asm/pgalloc.h b/arch/unicore32/include/asm/pgalloc.h
index 7cceabecf4..e5f6c1ae64 100644
--- a/arch/unicore32/include/asm/pgalloc.h
+++ b/arch/unicore32/include/asm/pgalloc.h
@@ -28,17 +28,15 @@ extern void free_pgd_slow(struct mm_struct *mm, pgd_t *pgd);
 #define pgd_alloc(mm)			get_pgd_slow(mm)
 #define pgd_free(mm, pgd)		free_pgd_slow(mm, pgd)
 
-#define PGALLOC_GFP	(GFP_KERNEL | __GFP_ZERO)
-
 /*
  * Allocate one PTE table.
  */
 static inline pte_t *
-pte_alloc_one_kernel(struct mm_struct *mm)
+pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp)
 {
 	pte_t *pte;
 
-	pte = (pte_t *)__get_free_page(PGALLOC_GFP);
+	pte = (pte_t *)get_zeroed_page(gfp);
 	if (pte)
 		clean_dcache_area(pte, PTRS_PER_PTE * sizeof(pte_t));
 
@@ -50,7 +48,7 @@ pte_alloc_one(struct mm_struct *mm)
 {
 	struct page *pte;
 
-	pte = alloc_pages(PGALLOC_GFP, 0);
+	pte = alloc_pages(GFP_KERNEL|__GFP_ZERO, 0);
 	if (!pte)
 		return NULL;
 	if (!PageHighMem(pte)) {
diff --git a/arch/unicore32/mm/pgd.c b/arch/unicore32/mm/pgd.c
index a830a300aa..b9c628a55f 100644
--- a/arch/unicore32/mm/pgd.c
+++ b/arch/unicore32/mm/pgd.c
@@ -50,7 +50,7 @@ pgd_t *get_pgd_slow(struct mm_struct *mm)
 		 * On UniCore, first page must always be allocated since it
 		 * contains the machine vectors.
 		 */
-		new_pmd = pmd_alloc(mm, (pud_t *)new_pgd, 0);
+		new_pmd = pmd_alloc(mm, (pud_t *)new_pgd, 0, GFP_KERNEL);
 		if (!new_pmd)
 			goto no_pmd;
 
diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/pgalloc.h
index a281e61ec6..1909a8dfaf 100644
--- a/arch/x86/include/asm/pgalloc.h
+++ b/arch/x86/include/asm/pgalloc.h
@@ -47,7 +47,7 @@ extern gfp_t __userpte_alloc_gfp;
 extern pgd_t *pgd_alloc(struct mm_struct *);
 extern void pgd_free(struct mm_struct *mm, pgd_t *pgd);
 
-extern pte_t *pte_alloc_one_kernel(struct mm_struct *);
+extern pte_t *pte_alloc_one_kernel(struct mm_struct *, gfp_t);
 extern pgtable_t pte_alloc_one(struct mm_struct *);
 
 /* Should really implement gc for free page table pages. This could be
@@ -99,14 +99,14 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd,
 #define pmd_pgtable(pmd) pmd_page(pmd)
 
 #if CONFIG_PGTABLE_LEVELS > 2
-static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
+static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr,
+				   gfp_t gfp)
 {
 	struct page *page;
-	gfp_t gfp = GFP_KERNEL_ACCOUNT | __GFP_ZERO;
 
-	if (mm == &init_mm)
-		gfp &= ~__GFP_ACCOUNT;
-	page = alloc_pages(gfp, 0);
+	if (mm != &init_mm)
+		gfp |= __GFP_ACCOUNT;
+	page = alloc_pages(gfp|__GFP_ZERO, 0);
 	if (!page)
 		return NULL;
 	if (!pgtable_pmd_page_ctor(page)) {
@@ -160,12 +160,11 @@ static inline void p4d_populate_safe(struct mm_struct *mm, p4d_t *p4d, pud_t *pu
 	set_p4d_safe(p4d, __p4d(_PAGE_TABLE | __pa(pud)));
 }
 
-static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
+static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr,
+				   gfp_t gfp)
 {
-	gfp_t gfp = GFP_KERNEL_ACCOUNT;
-
-	if (mm == &init_mm)
-		gfp &= ~__GFP_ACCOUNT;
+	if (mm != &init_mm)
+		gfp |= __GFP_ACCOUNT;
 	return (pud_t *)get_zeroed_page(gfp);
 }
 
@@ -200,12 +199,11 @@ static inline void pgd_populate_safe(struct mm_struct *mm, pgd_t *pgd, p4d_t *p4
 	set_pgd_safe(pgd, __pgd(_PAGE_TABLE | __pa(p4d)));
 }
 
-static inline p4d_t *p4d_alloc_one(struct mm_struct *mm, unsigned long addr)
+static inline p4d_t *p4d_alloc_one(struct mm_struct *mm, unsigned long addr,
+				   gfp_t gfp)
 {
-	gfp_t gfp = GFP_KERNEL_ACCOUNT;
-
-	if (mm == &init_mm)
-		gfp &= ~__GFP_ACCOUNT;
+	if (mm != &init_mm)
+		gfp |= __GFP_ACCOUNT;
 	return (p4d_t *)get_zeroed_page(gfp);
 }
 
diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
index aebd0d5bc0..46df9bb51f 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -126,7 +126,7 @@ void __init init_espfix_bsp(void)
 
 	/* Install the espfix pud into the kernel page directory */
 	pgd = &init_top_pgt[pgd_index(ESPFIX_BASE_ADDR)];
-	p4d = p4d_alloc(&init_mm, pgd, ESPFIX_BASE_ADDR);
+	p4d = p4d_alloc(&init_mm, pgd, ESPFIX_BASE_ADDR, GFP_KERNEL);
 	p4d_populate(&init_mm, p4d, espfix_pud_page);
 
 	/* Randomize the locations */
diff --git a/arch/x86/kernel/tboot.c b/arch/x86/kernel/tboot.c
index 6e5ef8fb8a..9a4f0fa6d6 100644
--- a/arch/x86/kernel/tboot.c
+++ b/arch/x86/kernel/tboot.c
@@ -124,13 +124,13 @@ static int map_tboot_page(unsigned long vaddr, unsigned long pfn,
 	pte_t *pte;
 
 	pgd = pgd_offset(&tboot_mm, vaddr);
-	p4d = p4d_alloc(&tboot_mm, pgd, vaddr);
+	p4d = p4d_alloc(&tboot_mm, pgd, vaddr, GFP_KERNEL);
 	if (!p4d)
 		return -1;
-	pud = pud_alloc(&tboot_mm, p4d, vaddr);
+	pud = pud_alloc(&tboot_mm, p4d, vaddr, GFP_KERNEL);
 	if (!pud)
 		return -1;
-	pmd = pmd_alloc(&tboot_mm, pud, vaddr);
+	pmd = pmd_alloc(&tboot_mm, pud, vaddr, GFP_KERNEL);
 	if (!pmd)
 		return -1;
 	pte = pte_alloc_map(&tboot_mm, pmd, vaddr);
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 7bd01709a0..04cb5aec7b 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -23,9 +23,9 @@ EXPORT_SYMBOL(physical_mask);
 
 gfp_t __userpte_alloc_gfp = PGALLOC_GFP | PGALLOC_USER_GFP;
 
-pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
+pte_t *pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp)
 {
-	return (pte_t *)__get_free_page(PGALLOC_GFP & ~__GFP_ACCOUNT);
+	return (pte_t *) get_zeroed_page(gfp);
 }
 
 pgtable_t pte_alloc_one(struct mm_struct *mm)
diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index cf0347f61b..9cb455ba28 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -106,7 +106,7 @@ pgd_t * __init efi_call_phys_prolog(void)
 		pgd_efi = pgd_offset_k(addr_pgd);
 		save_pgd[pgd] = *pgd_efi;
 
-		p4d = p4d_alloc(&init_mm, pgd_efi, addr_pgd);
+		p4d = p4d_alloc(&init_mm, pgd_efi, addr_pgd, GFP_KERNEL);
 		if (!p4d) {
 			pr_err("Failed to allocate p4d table!\n");
 			goto out;
@@ -116,7 +116,8 @@ pgd_t * __init efi_call_phys_prolog(void)
 			addr_p4d = addr_pgd + i * P4D_SIZE;
 			p4d_efi = p4d + p4d_index(addr_p4d);
 
-			pud = pud_alloc(&init_mm, p4d_efi, addr_p4d);
+			pud = pud_alloc(&init_mm, p4d_efi, addr_p4d,
+					GFP_KERNEL);
 			if (!pud) {
 				pr_err("Failed to allocate pud table!\n");
 				goto out;
@@ -217,13 +218,13 @@ int __init efi_alloc_page_tables(void)
 		return -ENOMEM;
 
 	pgd = efi_pgd + pgd_index(EFI_VA_END);
-	p4d = p4d_alloc(&init_mm, pgd, EFI_VA_END);
+	p4d = p4d_alloc(&init_mm, pgd, EFI_VA_END, GFP_KERNEL);
 	if (!p4d) {
 		free_page((unsigned long)efi_pgd);
 		return -ENOMEM;
 	}
 
-	pud = pud_alloc(&init_mm, p4d, EFI_VA_END);
+	pud = pud_alloc(&init_mm, p4d, EFI_VA_END, GFP_KERNEL);
 	if (!pud) {
 		if (pgtable_l5_enabled())
 			free_page((unsigned long) pgd_page_vaddr(*pgd));
diff --git a/arch/xtensa/include/asm/pgalloc.h b/arch/xtensa/include/asm/pgalloc.h
index b3b388ff2f..cc7ec6dd09 100644
--- a/arch/xtensa/include/asm/pgalloc.h
+++ b/arch/xtensa/include/asm/pgalloc.h
@@ -38,12 +38,12 @@ static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 	free_page((unsigned long)pgd);
 }
 
-static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
+static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm, gfp_t gfp)
 {
 	pte_t *ptep;
 	int i;
 
-	ptep = (pte_t *)__get_free_page(GFP_KERNEL);
+	ptep = (pte_t *)__get_free_page(gfp);
 	if (!ptep)
 		return NULL;
 	for (i = 0; i < 1024; i++)
diff --git a/drivers/staging/media/ipu3/ipu3-dmamap.c b/drivers/staging/media/ipu3/ipu3-dmamap.c
index d978a00e1e..f74221ab2b 100644
--- a/drivers/staging/media/ipu3/ipu3-dmamap.c
+++ b/drivers/staging/media/ipu3/ipu3-dmamap.c
@@ -137,7 +137,7 @@ void *imgu_dmamap_alloc(struct imgu_device *imgu, struct imgu_css_map *map,
 
 	map->vma->pages = pages;
 	/* And map it in KVA */
-	if (map_vm_area(map->vma, PAGE_KERNEL, pages))
+	if (map_vm_area(map->vma, GFP_KERNEL, PAGE_KERNEL, pages))
 		goto out_vunmap;
 
 	map->size = size;
diff --git a/include/asm-generic/4level-fixup.h b/include/asm-generic/4level-fixup.h
index e3667c9a33..652b68f475 100644
--- a/include/asm-generic/4level-fixup.h
+++ b/include/asm-generic/4level-fixup.h
@@ -12,9 +12,9 @@
 
 #define pud_t				pgd_t
 
-#define pmd_alloc(mm, pud, address) \
-	((unlikely(pgd_none(*(pud))) && __pmd_alloc(mm, pud, address))? \
- 		NULL: pmd_offset(pud, address))
+#define pmd_alloc(mm, pud, address, gfp) \
+	((unlikely(pgd_none(*(pud))) && __pmd_alloc(mm, pud, address, gfp)) \
+		? NULL : pmd_offset(pud, address))
 
 #define pud_offset(pgd, start)		(pgd)
 #define pud_none(pud)			0
diff --git a/include/asm-generic/5level-fixup.h b/include/asm-generic/5level-fixup.h
index bb6cb34701..c6f68f6a9f 100644
--- a/include/asm-generic/5level-fixup.h
+++ b/include/asm-generic/5level-fixup.h
@@ -13,11 +13,11 @@
 
 #define p4d_t				pgd_t
 
-#define pud_alloc(mm, p4d, address) \
-	((unlikely(pgd_none(*(p4d))) && __pud_alloc(mm, p4d, address)) ? \
+#define pud_alloc(mm, p4d, address, gfp) \
+	((unlikely(pgd_none(*(p4d))) && __pud_alloc(mm, p4d, address, gfp)) ? \
 		NULL : pud_offset(p4d, address))
 
-#define p4d_alloc(mm, pgd, address)	(pgd)
+#define p4d_alloc(mm, pgd, address, gfp) (pgd)
 #define p4d_offset(pgd, start)		(pgd)
 #define p4d_none(p4d)			0
 #define p4d_bad(p4d)			0
diff --git a/include/asm-generic/pgtable-nop4d-hack.h b/include/asm-generic/pgtable-nop4d-hack.h
index 829bdb0d63..3ba3c7e4b9 100644
--- a/include/asm-generic/pgtable-nop4d-hack.h
+++ b/include/asm-generic/pgtable-nop4d-hack.h
@@ -53,7 +53,7 @@ static inline pud_t *pud_offset(pgd_t *pgd, unsigned long address)
  * allocating and freeing a pud is trivial: the 1-entry pud is
  * inside the pgd, so has no extra memory associated with it.
  */
-#define pud_alloc_one(mm, address)		NULL
+#define pud_alloc_one(mm, address, gfp)		NULL
 #define pud_free(mm, x)				do { } while (0)
 #define __pud_free_tlb(tlb, x, a)		do { } while (0)
 
diff --git a/include/asm-generic/pgtable-nop4d.h b/include/asm-generic/pgtable-nop4d.h
index aebab905e6..7c9e00e44d 100644
--- a/include/asm-generic/pgtable-nop4d.h
+++ b/include/asm-generic/pgtable-nop4d.h
@@ -48,7 +48,7 @@ static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address)
  * allocating and freeing a p4d is trivial: the 1-entry p4d is
  * inside the pgd, so has no extra memory associated with it.
  */
-#define p4d_alloc_one(mm, address)		NULL
+#define p4d_alloc_one(mm, address, gfp)		NULL
 #define p4d_free(mm, x)				do { } while (0)
 #define __p4d_free_tlb(tlb, x, a)		do { } while (0)
 
diff --git a/include/asm-generic/pgtable-nopmd.h b/include/asm-generic/pgtable-nopmd.h
index b85b8271a7..e4a51cbdef 100644
--- a/include/asm-generic/pgtable-nopmd.h
+++ b/include/asm-generic/pgtable-nopmd.h
@@ -56,7 +56,7 @@ static inline pmd_t * pmd_offset(pud_t * pud, unsigned long address)
  * allocating and freeing a pmd is trivial: the 1-entry pmd is
  * inside the pud, so has no extra memory associated with it.
  */
-#define pmd_alloc_one(mm, address)		NULL
+#define pmd_alloc_one(mm, address, gfp)		NULL
 static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
 {
 }
diff --git a/include/asm-generic/pgtable-nopud.h b/include/asm-generic/pgtable-nopud.h
index c77a1d3011..e7aacf134c 100644
--- a/include/asm-generic/pgtable-nopud.h
+++ b/include/asm-generic/pgtable-nopud.h
@@ -57,7 +57,7 @@ static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
  * allocating and freeing a pud is trivial: the 1-entry pud is
  * inside the p4d, so has no extra memory associated with it.
  */
-#define pud_alloc_one(mm, address)		NULL
+#define pud_alloc_one(mm, address, gfp)		NULL
 #define pud_free(mm, x)				do { } while (0)
 #define __pud_free_tlb(tlb, x, a)		do { } while (0)
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6b10c21630..d6f315e106 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1749,17 +1749,18 @@ static inline pte_t *get_locked_pte(struct mm_struct *mm, unsigned long addr,
 
 #ifdef __PAGETABLE_P4D_FOLDED
 static inline int __p4d_alloc(struct mm_struct *mm, pgd_t *pgd,
-						unsigned long address)
+			      unsigned long address, gfp_t gfp)
 {
 	return 0;
 }
 #else
-int __p4d_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address);
+int __p4d_alloc(struct mm_struct *mm, pgd_t *pgd,
+		unsigned long address, gfp_t gfp);
 #endif
 
 #if defined(__PAGETABLE_PUD_FOLDED) || !defined(CONFIG_MMU)
 static inline int __pud_alloc(struct mm_struct *mm, p4d_t *p4d,
-						unsigned long address)
+			      unsigned long address, gfp_t gfp)
 {
 	return 0;
 }
@@ -1767,7 +1768,8 @@ static inline void mm_inc_nr_puds(struct mm_struct *mm) {}
 static inline void mm_dec_nr_puds(struct mm_struct *mm) {}
 
 #else
-int __pud_alloc(struct mm_struct *mm, p4d_t *p4d, unsigned long address);
+int __pud_alloc(struct mm_struct *mm, p4d_t *p4d,
+		unsigned long address, gfp_t gfp);
 
 static inline void mm_inc_nr_puds(struct mm_struct *mm)
 {
@@ -1786,7 +1788,7 @@ static inline void mm_dec_nr_puds(struct mm_struct *mm)
 
 #if defined(__PAGETABLE_PMD_FOLDED) || !defined(CONFIG_MMU)
 static inline int __pmd_alloc(struct mm_struct *mm, pud_t *pud,
-						unsigned long address)
+			      unsigned long address, gfp_t gfp)
 {
 	return 0;
 }
@@ -1795,7 +1797,8 @@ static inline void mm_inc_nr_pmds(struct mm_struct *mm) {}
 static inline void mm_dec_nr_pmds(struct mm_struct *mm) {}
 
 #else
-int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address);
+int __pmd_alloc(struct mm_struct *mm, pud_t *pud,
+		unsigned long address, gfp_t gfp);
 
 static inline void mm_inc_nr_pmds(struct mm_struct *mm)
 {
@@ -1845,7 +1848,7 @@ static inline void mm_dec_nr_ptes(struct mm_struct *mm) {}
 #endif
 
 int __pte_alloc(struct mm_struct *mm, pmd_t *pmd);
-int __pte_alloc_kernel(pmd_t *pmd);
+int __pte_alloc_kernel(pmd_t *pmd, gfp_t gfp);
 
 /*
  * The following ifdef needed to get the 4level-fixup.h header to work.
@@ -1855,24 +1858,25 @@ int __pte_alloc_kernel(pmd_t *pmd);
 
 #ifndef __ARCH_HAS_5LEVEL_HACK
 static inline p4d_t *p4d_alloc(struct mm_struct *mm, pgd_t *pgd,
-		unsigned long address)
+		unsigned long address, gfp_t gfp)
 {
-	return (unlikely(pgd_none(*pgd)) && __p4d_alloc(mm, pgd, address)) ?
-		NULL : p4d_offset(pgd, address);
+	return (unlikely(pgd_none(*pgd)) && __p4d_alloc(mm, pgd, address, gfp))
+	    ? NULL : p4d_offset(pgd, address);
 }
 
 static inline pud_t *pud_alloc(struct mm_struct *mm, p4d_t *p4d,
-		unsigned long address)
+		unsigned long address, gfp_t gfp)
 {
-	return (unlikely(p4d_none(*p4d)) && __pud_alloc(mm, p4d, address)) ?
-		NULL : pud_offset(p4d, address);
+	return (unlikely(p4d_none(*p4d)) && __pud_alloc(mm, p4d, address, gfp))
+	    ? NULL : pud_offset(p4d, address);
 }
 #endif /* !__ARCH_HAS_5LEVEL_HACK */
 
-static inline pmd_t *pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address)
+static inline pmd_t *pmd_alloc(struct mm_struct *mm, pud_t *pud,
+		unsigned long address, gfp_t gfp)
 {
-	return (unlikely(pud_none(*pud)) && __pmd_alloc(mm, pud, address))?
-		NULL: pmd_offset(pud, address);
+	return (unlikely(pud_none(*pud)) && __pmd_alloc(mm, pud, address, gfp))
+	    ? NULL : pmd_offset(pud, address);
 }
 #endif /* CONFIG_MMU && !__ARCH_HAS_4LEVEL_HACK */
 
@@ -1985,8 +1989,8 @@ static inline void pgtable_page_dtor(struct page *page)
 	(pte_alloc(mm, pmd) ?			\
 		 NULL : pte_offset_map_lock(mm, pmd, address, ptlp))
 
-#define pte_alloc_kernel(pmd, address)			\
-	((unlikely(pmd_none(*(pmd))) && __pte_alloc_kernel(pmd))? \
+#define pte_alloc_kernel(pmd, address, gfp)			\
+	((unlikely(pmd_none(*(pmd))) && __pte_alloc_kernel(pmd, gfp))? \
 		NULL: pte_offset_kernel(pmd, address))
 
 #if USE_SPLIT_PMD_PTLOCKS
diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 398e9c95cd..11788d5ba3 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -135,7 +135,7 @@ extern struct vm_struct *__get_vm_area_caller(unsigned long size,
 extern struct vm_struct *remove_vm_area(const void *addr);
 extern struct vm_struct *find_vm_area(const void *addr);
 
-extern int map_vm_area(struct vm_struct *area, pgprot_t prot,
+extern int map_vm_area(struct vm_struct *area, gfp_t gfp, pgprot_t prot,
 			struct page **pages);
 #ifdef CONFIG_MMU
 extern int map_kernel_range_noflush(unsigned long start, unsigned long size,
diff --git a/lib/ioremap.c b/lib/ioremap.c
index 0632136855..a4e21ef50c 100644
--- a/lib/ioremap.c
+++ b/lib/ioremap.c
@@ -65,7 +65,7 @@ static int ioremap_pte_range(pmd_t *pmd, unsigned long addr,
 	u64 pfn;
 
 	pfn = phys_addr >> PAGE_SHIFT;
-	pte = pte_alloc_kernel(pmd, addr);
+	pte = pte_alloc_kernel(pmd, addr, GFP_KERNEL);
 	if (!pte)
 		return -ENOMEM;
 	do {
@@ -101,7 +101,7 @@ static inline int ioremap_pmd_range(pud_t *pud, unsigned long addr,
 	pmd_t *pmd;
 	unsigned long next;
 
-	pmd = pmd_alloc(&init_mm, pud, addr);
+	pmd = pmd_alloc(&init_mm, pud, addr, GFP_KERNEL);
 	if (!pmd)
 		return -ENOMEM;
 	do {
@@ -141,7 +141,7 @@ static inline int ioremap_pud_range(p4d_t *p4d, unsigned long addr,
 	pud_t *pud;
 	unsigned long next;
 
-	pud = pud_alloc(&init_mm, p4d, addr);
+	pud = pud_alloc(&init_mm, p4d, addr, GFP_KERNEL);
 	if (!pud)
 		return -ENOMEM;
 	do {
@@ -181,7 +181,7 @@ static inline int ioremap_p4d_range(pgd_t *pgd, unsigned long addr,
 	p4d_t *p4d;
 	unsigned long next;
 
-	p4d = p4d_alloc(&init_mm, pgd, addr);
+	p4d = p4d_alloc(&init_mm, pgd, addr, GFP_KERNEL);
 	if (!p4d)
 		return -ENOMEM;
 	do {
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6cdc7b2d91..245d4a2585 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4683,7 +4683,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 	spinlock_t *ptl;
 
 	if (!vma_shareable(vma, addr))
-		return (pte_t *)pmd_alloc(mm, pud, addr);
+		return (pte_t *)pmd_alloc(mm, pud, addr, GFP_KERNEL);
 
 	i_mmap_lock_write(mapping);
 	vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
@@ -4714,7 +4714,7 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 	}
 	spin_unlock(ptl);
 out:
-	pte = (pte_t *)pmd_alloc(mm, pud, addr);
+	pte = (pte_t *)pmd_alloc(mm, pud, addr, GFP_KERNEL);
 	i_mmap_unlock_write(mapping);
 	return pte;
 }
@@ -4776,10 +4776,10 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
 	pte_t *pte = NULL;
 
 	pgd = pgd_offset(mm, addr);
-	p4d = p4d_alloc(mm, pgd, addr);
+	p4d = p4d_alloc(mm, pgd, addr, GFP_KERNEL);
 	if (!p4d)
 		return NULL;
-	pud = pud_alloc(mm, p4d, addr);
+	pud = pud_alloc(mm, p4d, addr, GFP_KERNEL);
 	if (pud) {
 		if (sz == PUD_SIZE) {
 			pte = (pte_t *)pud;
@@ -4788,7 +4788,8 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
 			if (want_pmd_share() && pud_none(*pud))
 				pte = huge_pmd_share(mm, addr, pud);
 			else
-				pte = (pte_t *)pmd_alloc(mm, pud, addr);
+				pte = (pte_t *)pmd_alloc(mm, pud, addr,
+							 GFP_KERNEL);
 		}
 	}
 	BUG_ON(pte && pte_present(*pte) && !pte_huge(*pte));
diff --git a/mm/kasan/init.c b/mm/kasan/init.c
index ce45c491eb..3ed63dcb7a 100644
--- a/mm/kasan/init.c
+++ b/mm/kasan/init.c
@@ -129,7 +129,7 @@ static int __ref zero_pmd_populate(pud_t *pud, unsigned long addr,
 			pte_t *p;
 
 			if (slab_is_available())
-				p = pte_alloc_one_kernel(&init_mm);
+				p = pte_alloc_one_kernel(&init_mm, GFP_KERNEL);
 			else
 				p = early_alloc(PAGE_SIZE, NUMA_NO_NODE);
 			if (!p)
@@ -166,7 +166,7 @@ static int __ref zero_pud_populate(p4d_t *p4d, unsigned long addr,
 			pmd_t *p;
 
 			if (slab_is_available()) {
-				p = pmd_alloc(&init_mm, pud, addr);
+				p = pmd_alloc(&init_mm, pud, addr, GFP_KERNEL);
 				if (!p)
 					return -ENOMEM;
 			} else {
@@ -207,7 +207,7 @@ static int __ref zero_p4d_populate(pgd_t *pgd, unsigned long addr,
 			pud_t *p;
 
 			if (slab_is_available()) {
-				p = pud_alloc(&init_mm, p4d, addr);
+				p = pud_alloc(&init_mm, p4d, addr, GFP_KERNEL);
 				if (!p)
 					return -ENOMEM;
 			} else {
@@ -280,7 +280,7 @@ int __ref kasan_populate_early_shadow(const void *shadow_start,
 			p4d_t *p;
 
 			if (slab_is_available()) {
-				p = p4d_alloc(&init_mm, pgd, addr);
+				p = p4d_alloc(&init_mm, pgd, addr, GFP_KERNEL);
 				if (!p)
 					return -ENOMEM;
 			} else {
diff --git a/mm/memory.c b/mm/memory.c
index ab650c21bc..f599cdd1bc 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -435,9 +435,9 @@ int __pte_alloc(struct mm_struct *mm, pmd_t *pmd)
 	return 0;
 }
 
-int __pte_alloc_kernel(pmd_t *pmd)
+int __pte_alloc_kernel(pmd_t *pmd, gfp_t gfp)
 {
-	pte_t *new = pte_alloc_one_kernel(&init_mm);
+	pte_t *new = pte_alloc_one_kernel(&init_mm, gfp);
 	if (!new)
 		return -ENOMEM;
 
@@ -884,7 +884,7 @@ static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src
 	pmd_t *src_pmd, *dst_pmd;
 	unsigned long next;
 
-	dst_pmd = pmd_alloc(dst_mm, dst_pud, addr);
+	dst_pmd = pmd_alloc(dst_mm, dst_pud, addr, GFP_KERNEL);
 	if (!dst_pmd)
 		return -ENOMEM;
 	src_pmd = pmd_offset(src_pud, addr);
@@ -918,7 +918,7 @@ static inline int copy_pud_range(struct mm_struct *dst_mm, struct mm_struct *src
 	pud_t *src_pud, *dst_pud;
 	unsigned long next;
 
-	dst_pud = pud_alloc(dst_mm, dst_p4d, addr);
+	dst_pud = pud_alloc(dst_mm, dst_p4d, addr, GFP_KERNEL);
 	if (!dst_pud)
 		return -ENOMEM;
 	src_pud = pud_offset(src_p4d, addr);
@@ -952,7 +952,7 @@ static inline int copy_p4d_range(struct mm_struct *dst_mm, struct mm_struct *src
 	p4d_t *src_p4d, *dst_p4d;
 	unsigned long next;
 
-	dst_p4d = p4d_alloc(dst_mm, dst_pgd, addr);
+	dst_p4d = p4d_alloc(dst_mm, dst_pgd, addr, GFP_KERNEL);
 	if (!dst_p4d)
 		return -ENOMEM;
 	src_p4d = p4d_offset(src_pgd, addr);
@@ -1422,13 +1422,13 @@ pte_t *__get_locked_pte(struct mm_struct *mm, unsigned long addr,
 	pmd_t *pmd;
 
 	pgd = pgd_offset(mm, addr);
-	p4d = p4d_alloc(mm, pgd, addr);
+	p4d = p4d_alloc(mm, pgd, addr, GFP_KERNEL);
 	if (!p4d)
 		return NULL;
-	pud = pud_alloc(mm, p4d, addr);
+	pud = pud_alloc(mm, p4d, addr, GFP_KERNEL);
 	if (!pud)
 		return NULL;
-	pmd = pmd_alloc(mm, pud, addr);
+	pmd = pmd_alloc(mm, pud, addr, GFP_KERNEL);
 	if (!pmd)
 		return NULL;
 
@@ -1768,7 +1768,7 @@ static inline int remap_pmd_range(struct mm_struct *mm, pud_t *pud,
 	int err;
 
 	pfn -= addr >> PAGE_SHIFT;
-	pmd = pmd_alloc(mm, pud, addr);
+	pmd = pmd_alloc(mm, pud, addr, GFP_KERNEL);
 	if (!pmd)
 		return -ENOMEM;
 	VM_BUG_ON(pmd_trans_huge(*pmd));
@@ -1791,7 +1791,7 @@ static inline int remap_pud_range(struct mm_struct *mm, p4d_t *p4d,
 	int err;
 
 	pfn -= addr >> PAGE_SHIFT;
-	pud = pud_alloc(mm, p4d, addr);
+	pud = pud_alloc(mm, p4d, addr, GFP_KERNEL);
 	if (!pud)
 		return -ENOMEM;
 	do {
@@ -1813,7 +1813,7 @@ static inline int remap_p4d_range(struct mm_struct *mm, pgd_t *pgd,
 	int err;
 
 	pfn -= addr >> PAGE_SHIFT;
-	p4d = p4d_alloc(mm, pgd, addr);
+	p4d = p4d_alloc(mm, pgd, addr, GFP_KERNEL);
 	if (!p4d)
 		return -ENOMEM;
 	do {
@@ -1956,7 +1956,7 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
 	spinlock_t *uninitialized_var(ptl);
 
 	pte = (mm == &init_mm) ?
-		pte_alloc_kernel(pmd, addr) :
+		pte_alloc_kernel(pmd, addr, GFP_KERNEL) :
 		pte_alloc_map_lock(mm, pmd, addr, &ptl);
 	if (!pte)
 		return -ENOMEM;
@@ -1990,7 +1990,7 @@ static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud,
 
 	BUG_ON(pud_huge(*pud));
 
-	pmd = pmd_alloc(mm, pud, addr);
+	pmd = pmd_alloc(mm, pud, addr, GFP_KERNEL);
 	if (!pmd)
 		return -ENOMEM;
 	do {
@@ -2010,7 +2010,7 @@ static int apply_to_pud_range(struct mm_struct *mm, p4d_t *p4d,
 	unsigned long next;
 	int err;
 
-	pud = pud_alloc(mm, p4d, addr);
+	pud = pud_alloc(mm, p4d, addr, GFP_KERNEL);
 	if (!pud)
 		return -ENOMEM;
 	do {
@@ -2030,7 +2030,7 @@ static int apply_to_p4d_range(struct mm_struct *mm, pgd_t *pgd,
 	unsigned long next;
 	int err;
 
-	p4d = p4d_alloc(mm, pgd, addr);
+	p4d = p4d_alloc(mm, pgd, addr, GFP_KERNEL);
 	if (!p4d)
 		return -ENOMEM;
 	do {
@@ -3868,11 +3868,11 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 	vm_fault_t ret;
 
 	pgd = pgd_offset(mm, address);
-	p4d = p4d_alloc(mm, pgd, address);
+	p4d = p4d_alloc(mm, pgd, address, GFP_KERNEL);
 	if (!p4d)
 		return VM_FAULT_OOM;
 
-	vmf.pud = pud_alloc(mm, p4d, address);
+	vmf.pud = pud_alloc(mm, p4d, address, GFP_KERNEL);
 	if (!vmf.pud)
 		return VM_FAULT_OOM;
 	if (pud_none(*vmf.pud) && __transparent_hugepage_enabled(vma)) {
@@ -3898,7 +3898,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 		}
 	}
 
-	vmf.pmd = pmd_alloc(mm, vmf.pud, address);
+	vmf.pmd = pmd_alloc(mm, vmf.pud, address, GFP_KERNEL);
 	if (!vmf.pmd)
 		return VM_FAULT_OOM;
 	if (pmd_none(*vmf.pmd) && __transparent_hugepage_enabled(vma)) {
@@ -3991,9 +3991,10 @@ EXPORT_SYMBOL_GPL(handle_mm_fault);
  * Allocate p4d page table.
  * We've already handled the fast-path in-line.
  */
-int __p4d_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address)
+int __p4d_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address,
+		gfp_t gfp)
 {
-	p4d_t *new = p4d_alloc_one(mm, address);
+	p4d_t *new = p4d_alloc_one(mm, address, gfp);
 	if (!new)
 		return -ENOMEM;
 
@@ -4014,9 +4015,10 @@ int __p4d_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long address)
  * Allocate page upper directory.
  * We've already handled the fast-path in-line.
  */
-int __pud_alloc(struct mm_struct *mm, p4d_t *p4d, unsigned long address)
+int __pud_alloc(struct mm_struct *mm, p4d_t *p4d, unsigned long address,
+		gfp_t gfp)
 {
-	pud_t *new = pud_alloc_one(mm, address);
+	pud_t *new = pud_alloc_one(mm, address, gfp);
 	if (!new)
 		return -ENOMEM;
 
@@ -4046,10 +4048,11 @@ int __pud_alloc(struct mm_struct *mm, p4d_t *p4d, unsigned long address)
  * Allocate page middle directory.
  * We've already handled the fast-path in-line.
  */
-int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address)
+int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address,
+		gfp_t gfp)
 {
 	spinlock_t *ptl;
-	pmd_t *new = pmd_alloc_one(mm, address);
+	pmd_t *new = pmd_alloc_one(mm, address, gfp);
 	if (!new)
 		return -ENOMEM;
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 663a544936..917ff0b3f7 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2616,13 +2616,13 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 		goto abort;
 
 	pgdp = pgd_offset(mm, addr);
-	p4dp = p4d_alloc(mm, pgdp, addr);
+	p4dp = p4d_alloc(mm, pgdp, addr, GFP_KERNEL);
 	if (!p4dp)
 		goto abort;
-	pudp = pud_alloc(mm, p4dp, addr);
+	pudp = pud_alloc(mm, p4dp, addr, GFP_KERNEL);
 	if (!pudp)
 		goto abort;
-	pmdp = pmd_alloc(mm, pudp, addr);
+	pmdp = pmd_alloc(mm, pudp, addr, GFP_KERNEL);
 	if (!pmdp)
 		goto abort;
 
diff --git a/mm/mremap.c b/mm/mremap.c
index e3edef6b7a..b1f9605fad 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -65,14 +65,14 @@ static pmd_t *alloc_new_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
 	pmd_t *pmd;
 
 	pgd = pgd_offset(mm, addr);
-	p4d = p4d_alloc(mm, pgd, addr);
+	p4d = p4d_alloc(mm, pgd, addr, GFP_KERNEL);
 	if (!p4d)
 		return NULL;
-	pud = pud_alloc(mm, p4d, addr);
+	pud = pud_alloc(mm, p4d, addr, GFP_KERNEL);
 	if (!pud)
 		return NULL;
 
-	pmd = pmd_alloc(mm, pud, addr);
+	pmd = pmd_alloc(mm, pud, addr, GFP_KERNEL);
 	if (!pmd)
 		return NULL;
 
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index d59b5a73df..9bb9d44834 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -153,10 +153,10 @@ static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address)
 	pud_t *pud;
 
 	pgd = pgd_offset(mm, address);
-	p4d = p4d_alloc(mm, pgd, address);
+	p4d = p4d_alloc(mm, pgd, address, GFP_KERNEL);
 	if (!p4d)
 		return NULL;
-	pud = pud_alloc(mm, p4d, address);
+	pud = pud_alloc(mm, p4d, address, GFP_KERNEL);
 	if (!pud)
 		return NULL;
 	/*
@@ -164,7 +164,7 @@ static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address)
 	 * missing, the *pmd may be already established and in
 	 * turn it may also be a trans_huge_pmd.
 	 */
-	return pmd_alloc(mm, pud, address);
+	return pmd_alloc(mm, pud, address, GFP_KERNEL);
 }
 
 #ifdef CONFIG_HUGETLB_PAGE
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index e86ba6e74b..288d078ab6 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -132,7 +132,8 @@ static void vunmap_page_range(unsigned long addr, unsigned long end)
 }
 
 static int vmap_pte_range(pmd_t *pmd, unsigned long addr,
-		unsigned long end, pgprot_t prot, struct page **pages, int *nr)
+		unsigned long end, gfp_t gfp, pgprot_t prot,
+		struct page **pages, int *nr)
 {
 	pte_t *pte;
 
@@ -141,7 +142,7 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr,
 	 * callers keep track of where we're up to.
 	 */
 
-	pte = pte_alloc_kernel(pmd, addr);
+	pte = pte_alloc_kernel(pmd, addr, gfp);
 	if (!pte)
 		return -ENOMEM;
 	do {
@@ -158,51 +159,54 @@ static int vmap_pte_range(pmd_t *pmd, unsigned long addr,
 }
 
 static int vmap_pmd_range(pud_t *pud, unsigned long addr,
-		unsigned long end, pgprot_t prot, struct page **pages, int *nr)
+		unsigned long end, gfp_t gfp, pgprot_t prot,
+		struct page **pages, int *nr)
 {
 	pmd_t *pmd;
 	unsigned long next;
 
-	pmd = pmd_alloc(&init_mm, pud, addr);
+	pmd = pmd_alloc(&init_mm, pud, addr, gfp);
 	if (!pmd)
 		return -ENOMEM;
 	do {
 		next = pmd_addr_end(addr, end);
-		if (vmap_pte_range(pmd, addr, next, prot, pages, nr))
+		if (vmap_pte_range(pmd, addr, next, gfp, prot, pages, nr))
 			return -ENOMEM;
 	} while (pmd++, addr = next, addr != end);
 	return 0;
 }
 
 static int vmap_pud_range(p4d_t *p4d, unsigned long addr,
-		unsigned long end, pgprot_t prot, struct page **pages, int *nr)
+		unsigned long end, gfp_t gfp, pgprot_t prot,
+		struct page **pages, int *nr)
 {
 	pud_t *pud;
 	unsigned long next;
 
-	pud = pud_alloc(&init_mm, p4d, addr);
+	pud = pud_alloc(&init_mm, p4d, addr, gfp);
 	if (!pud)
 		return -ENOMEM;
 	do {
 		next = pud_addr_end(addr, end);
-		if (vmap_pmd_range(pud, addr, next, prot, pages, nr))
+		if (vmap_pmd_range(pud, addr, next, gfp, prot, pages, nr))
 			return -ENOMEM;
 	} while (pud++, addr = next, addr != end);
 	return 0;
 }
 
 static int vmap_p4d_range(pgd_t *pgd, unsigned long addr,
-		unsigned long end, pgprot_t prot, struct page **pages, int *nr)
+		unsigned long end, gfp_t gfp, pgprot_t prot,
+		struct page **pages, int *nr)
 {
 	p4d_t *p4d;
 	unsigned long next;
 
-	p4d = p4d_alloc(&init_mm, pgd, addr);
+	p4d = p4d_alloc(&init_mm, pgd, addr, gfp);
 	if (!p4d)
 		return -ENOMEM;
 	do {
 		next = p4d_addr_end(addr, end);
-		if (vmap_pud_range(p4d, addr, next, prot, pages, nr))
+		if (vmap_pud_range(p4d, addr, next, gfp, prot, pages, nr))
 			return -ENOMEM;
 	} while (p4d++, addr = next, addr != end);
 	return 0;
@@ -215,7 +219,8 @@ static int vmap_p4d_range(pgd_t *pgd, unsigned long addr,
  * Ie. pte at addr+N*PAGE_SIZE shall point to pfn corresponding to pages[N]
  */
 static int vmap_page_range_noflush(unsigned long start, unsigned long end,
-				   pgprot_t prot, struct page **pages)
+				   gfp_t gfp, pgprot_t prot,
+				   struct page **pages)
 {
 	pgd_t *pgd;
 	unsigned long next;
@@ -227,7 +232,7 @@ static int vmap_page_range_noflush(unsigned long start, unsigned long end,
 	pgd = pgd_offset_k(addr);
 	do {
 		next = pgd_addr_end(addr, end);
-		err = vmap_p4d_range(pgd, addr, next, prot, pages, &nr);
+		err = vmap_p4d_range(pgd, addr, next, gfp, prot, pages, &nr);
 		if (err)
 			return err;
 	} while (pgd++, addr = next, addr != end);
@@ -236,11 +241,11 @@ static int vmap_page_range_noflush(unsigned long start, unsigned long end,
 }
 
 static int vmap_page_range(unsigned long start, unsigned long end,
-			   pgprot_t prot, struct page **pages)
+			   gfp_t gfp, pgprot_t prot, struct page **pages)
 {
 	int ret;
 
-	ret = vmap_page_range_noflush(start, end, prot, pages);
+	ret = vmap_page_range_noflush(start, end, gfp, prot, pages);
 	flush_cache_vmap(start, end);
 	return ret;
 }
@@ -1182,7 +1187,7 @@ void *vm_map_ram(struct page **pages, unsigned int count, int node, pgprot_t pro
 		addr = va->va_start;
 		mem = (void *)addr;
 	}
-	if (vmap_page_range(addr, addr + size, prot, pages) < 0) {
+	if (vmap_page_range(addr, addr + size, GFP_KERNEL, prot, pages) < 0) {
 		vm_unmap_ram(mem, count);
 		return NULL;
 	}
@@ -1298,7 +1303,8 @@ void __init vmalloc_init(void)
 int map_kernel_range_noflush(unsigned long addr, unsigned long size,
 			     pgprot_t prot, struct page **pages)
 {
-	return vmap_page_range_noflush(addr, addr + size, prot, pages);
+	return vmap_page_range_noflush(addr, addr + size, GFP_KERNEL, prot,
+				       pages);
 }
 
 /**
@@ -1339,13 +1345,14 @@ void unmap_kernel_range(unsigned long addr, unsigned long size)
 }
 EXPORT_SYMBOL_GPL(unmap_kernel_range);
 
-int map_vm_area(struct vm_struct *area, pgprot_t prot, struct page **pages)
+int map_vm_area(struct vm_struct *area, gfp_t gfp,
+		pgprot_t prot, struct page **pages)
 {
 	unsigned long addr = (unsigned long)area->addr;
 	unsigned long end = addr + get_vm_area_size(area);
 	int err;
 
-	err = vmap_page_range(addr, end, prot, pages);
+	err = vmap_page_range(addr, end, gfp, prot, pages);
 
 	return err > 0 ? 0 : err;
 }
@@ -1661,7 +1668,7 @@ void *vmap(struct page **pages, unsigned int count,
 	if (!area)
 		return NULL;
 
-	if (map_vm_area(area, prot, pages)) {
+	if (map_vm_area(area, GFP_KERNEL, prot, pages)) {
 		vunmap(area->addr);
 		return NULL;
 	}
@@ -1720,7 +1727,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
 			cond_resched();
 	}
 
-	if (map_vm_area(area, prot, pages))
+	if (map_vm_area(area, gfp_mask, prot, pages))
 		goto fail;
 	return area->addr;
 
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 0787d33b80..d369e5bf27 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -1151,7 +1151,7 @@ static inline void __zs_cpu_down(struct mapping_area *area)
 static inline void *__zs_map_object(struct mapping_area *area,
 				struct page *pages[2], int off, int size)
 {
-	BUG_ON(map_vm_area(area->vm, PAGE_KERNEL, pages));
+	BUG_ON(map_vm_area(area->vm, GFP_KERNEL, PAGE_KERNEL, pages));
 	area->vm_addr = area->vm->addr;
 	return area->vm_addr + off;
 }
diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index a39dcfdbcc..0829eefb61 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -645,7 +645,7 @@ static int create_hyp_pmd_mappings(pud_t *pud, unsigned long start,
 		BUG_ON(pmd_sect(*pmd));
 
 		if (pmd_none(*pmd)) {
-			pte = pte_alloc_one_kernel(NULL);
+			pte = pte_alloc_one_kernel(NULL, GFP_KERNEL);
 			if (!pte) {
 				kvm_err("Cannot allocate Hyp pte\n");
 				return -ENOMEM;
@@ -677,7 +677,7 @@ static int create_hyp_pud_mappings(pgd_t *pgd, unsigned long start,
 		pud = pud_offset(pgd, addr);
 
 		if (pud_none_or_clear_bad(pud)) {
-			pmd = pmd_alloc_one(NULL, addr);
+			pmd = pmd_alloc_one(NULL, addr, GFP_KERNEL);
 			if (!pmd) {
 				kvm_err("Cannot allocate Hyp pmd\n");
 				return -ENOMEM;
@@ -712,7 +712,7 @@ static int __create_hyp_mappings(pgd_t *pgdp, unsigned long ptrs_per_pgd,
 		pgd = pgdp + kvm_pgd_index(addr, ptrs_per_pgd);
 
 		if (pgd_none(*pgd)) {
-			pud = pud_alloc_one(NULL, addr);
+			pud = pud_alloc_one(NULL, addr, GFP_KERNEL);
 			if (!pud) {
 				kvm_err("Cannot allocate Hyp pud\n");
 				err = -ENOMEM;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 08/12] block: Add some exports for bcachefs
  2019-06-10 19:14 bcachefs status update (it's done cooking; let's get this sucker merged) Kent Overstreet
                   ` (6 preceding siblings ...)
  2019-06-10 19:14 ` [PATCH 07/12] Propagate gfp_t when allocating pte entries from __vmalloc Kent Overstreet
@ 2019-06-10 19:14 ` Kent Overstreet
  2019-06-10 19:14 ` [PATCH 09/12] bcache: optimize continue_at_nobarrier() Kent Overstreet
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 63+ messages in thread
From: Kent Overstreet @ 2019-06-10 19:14 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcache; +Cc: Kent Overstreet

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
---
 block/bio.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/bio.c b/block/bio.c
index 716510ecd7..a67aa6e0de 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -958,6 +958,7 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 
 	return 0;
 }
+EXPORT_SYMBOL_GPL(bio_iov_iter_get_pages);
 
 static void submit_bio_wait_endio(struct bio *bio)
 {
@@ -1658,6 +1659,7 @@ void bio_set_pages_dirty(struct bio *bio)
 			set_page_dirty_lock(bvec->bv_page);
 	}
 }
+EXPORT_SYMBOL_GPL(bio_set_pages_dirty);
 
 static void bio_release_pages(struct bio *bio)
 {
@@ -1731,6 +1733,7 @@ void bio_check_pages_dirty(struct bio *bio)
 	spin_unlock_irqrestore(&bio_dirty_lock, flags);
 	schedule_work(&bio_dirty_work);
 }
+EXPORT_SYMBOL_GPL(bio_check_pages_dirty);
 
 void update_io_ticks(struct hd_struct *part, unsigned long now)
 {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 09/12] bcache: optimize continue_at_nobarrier()
  2019-06-10 19:14 bcachefs status update (it's done cooking; let's get this sucker merged) Kent Overstreet
                   ` (7 preceding siblings ...)
  2019-06-10 19:14 ` [PATCH 08/12] block: Add some exports for bcachefs Kent Overstreet
@ 2019-06-10 19:14 ` Kent Overstreet
  2019-06-10 19:14 ` [PATCH 10/12] bcache: move closures to lib/ Kent Overstreet
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 63+ messages in thread
From: Kent Overstreet @ 2019-06-10 19:14 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcache; +Cc: Kent Overstreet

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
---
 drivers/md/bcache/closure.h | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/md/bcache/closure.h b/drivers/md/bcache/closure.h
index c88cdc4ae4..376c5e659c 100644
--- a/drivers/md/bcache/closure.h
+++ b/drivers/md/bcache/closure.h
@@ -245,7 +245,7 @@ static inline void closure_queue(struct closure *cl)
 		     != offsetof(struct work_struct, func));
 	if (wq) {
 		INIT_WORK(&cl->work, cl->work.func);
-		BUG_ON(!queue_work(wq, &cl->work));
+		queue_work(wq, &cl->work);
 	} else
 		cl->fn(cl);
 }
@@ -340,8 +340,13 @@ do {									\
  */
 #define continue_at_nobarrier(_cl, _fn, _wq)				\
 do {									\
-	set_closure_fn(_cl, _fn, _wq);					\
-	closure_queue(_cl);						\
+	closure_set_ip(_cl);						\
+	if (_wq) {							\
+		INIT_WORK(&(_cl)->work, (void *) _fn);			\
+		queue_work((_wq), &(_cl)->work);			\
+	} else {							\
+		(_fn)(_cl);						\
+	}								\
 } while (0)
 
 /**
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 10/12] bcache: move closures to lib/
  2019-06-10 19:14 bcachefs status update (it's done cooking; let's get this sucker merged) Kent Overstreet
                   ` (8 preceding siblings ...)
  2019-06-10 19:14 ` [PATCH 09/12] bcache: optimize continue_at_nobarrier() Kent Overstreet
@ 2019-06-10 19:14 ` Kent Overstreet
  2019-06-11 10:25   ` Coly Li
  2019-06-13  7:28   ` Christoph Hellwig
  2019-06-10 19:14 ` [PATCH 11/12] closures: closure_wait_event() Kent Overstreet
                   ` (3 subsequent siblings)
  13 siblings, 2 replies; 63+ messages in thread
From: Kent Overstreet @ 2019-06-10 19:14 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcache; +Cc: Kent Overstreet

Prep work for bcachefs - being a fork of bcache it also uses closures

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
---
 drivers/md/bcache/Kconfig                     | 10 +------
 drivers/md/bcache/Makefile                    |  6 ++--
 drivers/md/bcache/bcache.h                    |  2 +-
 drivers/md/bcache/super.c                     |  1 -
 drivers/md/bcache/util.h                      |  3 +-
 .../md/bcache => include/linux}/closure.h     | 17 ++++++-----
 lib/Kconfig                                   |  3 ++
 lib/Kconfig.debug                             |  9 ++++++
 lib/Makefile                                  |  2 ++
 {drivers/md/bcache => lib}/closure.c          | 28 ++++++-------------
 10 files changed, 37 insertions(+), 44 deletions(-)
 rename {drivers/md/bcache => include/linux}/closure.h (97%)
 rename {drivers/md/bcache => lib}/closure.c (89%)

diff --git a/drivers/md/bcache/Kconfig b/drivers/md/bcache/Kconfig
index f6e0a8b3a6..3dd1d48987 100644
--- a/drivers/md/bcache/Kconfig
+++ b/drivers/md/bcache/Kconfig
@@ -2,6 +2,7 @@
 config BCACHE
 	tristate "Block device as cache"
 	select CRC64
+	select CLOSURES
 	help
 	Allows a block device to be used as cache for other devices; uses
 	a btree for indexing and the layout is optimized for SSDs.
@@ -16,12 +17,3 @@ config BCACHE_DEBUG
 
 	Enables extra debugging tools, allows expensive runtime checks to be
 	turned on.
-
-config BCACHE_CLOSURES_DEBUG
-	bool "Debug closures"
-	depends on BCACHE
-	select DEBUG_FS
-	help
-	Keeps all active closures in a linked list and provides a debugfs
-	interface to list them, which makes it possible to see asynchronous
-	operations that get stuck.
diff --git a/drivers/md/bcache/Makefile b/drivers/md/bcache/Makefile
index d26b351958..2b790fb813 100644
--- a/drivers/md/bcache/Makefile
+++ b/drivers/md/bcache/Makefile
@@ -2,8 +2,8 @@
 
 obj-$(CONFIG_BCACHE)	+= bcache.o
 
-bcache-y		:= alloc.o bset.o btree.o closure.o debug.o extents.o\
-	io.o journal.o movinggc.o request.o stats.o super.o sysfs.o trace.o\
-	util.o writeback.o
+bcache-y		:= alloc.o bset.o btree.o debug.o extents.o io.o\
+	journal.o movinggc.o request.o stats.o super.o sysfs.o trace.o util.o\
+	writeback.o
 
 CFLAGS_request.o	+= -Iblock
diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index fdf75352e1..ced9f1526c 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -180,6 +180,7 @@
 
 #include <linux/bcache.h>
 #include <linux/bio.h>
+#include <linux/closure.h>
 #include <linux/kobject.h>
 #include <linux/list.h>
 #include <linux/mutex.h>
@@ -192,7 +193,6 @@
 
 #include "bset.h"
 #include "util.h"
-#include "closure.h"
 
 struct bucket {
 	atomic_t	pin;
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index a697a3a923..da6803f280 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -2487,7 +2487,6 @@ static int __init bcache_init(void)
 		goto err;
 
 	bch_debug_init();
-	closure_debug_init();
 
 	return 0;
 err:
diff --git a/drivers/md/bcache/util.h b/drivers/md/bcache/util.h
index 00aab6abcf..8a75100c0b 100644
--- a/drivers/md/bcache/util.h
+++ b/drivers/md/bcache/util.h
@@ -4,6 +4,7 @@
 #define _BCACHE_UTIL_H
 
 #include <linux/blkdev.h>
+#include <linux/closure.h>
 #include <linux/errno.h>
 #include <linux/kernel.h>
 #include <linux/sched/clock.h>
@@ -13,8 +14,6 @@
 #include <linux/workqueue.h>
 #include <linux/crc64.h>
 
-#include "closure.h"
-
 #define PAGE_SECTORS		(PAGE_SIZE / 512)
 
 struct closure;
diff --git a/drivers/md/bcache/closure.h b/include/linux/closure.h
similarity index 97%
rename from drivers/md/bcache/closure.h
rename to include/linux/closure.h
index 376c5e659c..308e38028c 100644
--- a/drivers/md/bcache/closure.h
+++ b/include/linux/closure.h
@@ -155,7 +155,7 @@ struct closure {
 
 	atomic_t		remaining;
 
-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES
 #define CLOSURE_MAGIC_DEAD	0xc054dead
 #define CLOSURE_MAGIC_ALIVE	0xc054a11e
 
@@ -184,15 +184,13 @@ static inline void closure_sync(struct closure *cl)
 		__closure_sync(cl);
 }
 
-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES
 
-void closure_debug_init(void);
 void closure_debug_create(struct closure *cl);
 void closure_debug_destroy(struct closure *cl);
 
 #else
 
-static inline void closure_debug_init(void) {}
 static inline void closure_debug_create(struct closure *cl) {}
 static inline void closure_debug_destroy(struct closure *cl) {}
 
@@ -200,21 +198,21 @@ static inline void closure_debug_destroy(struct closure *cl) {}
 
 static inline void closure_set_ip(struct closure *cl)
 {
-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES
 	cl->ip = _THIS_IP_;
 #endif
 }
 
 static inline void closure_set_ret_ip(struct closure *cl)
 {
-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES
 	cl->ip = _RET_IP_;
 #endif
 }
 
 static inline void closure_set_waiting(struct closure *cl, unsigned long f)
 {
-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES
 	cl->waiting_on = f;
 #endif
 }
@@ -243,6 +241,7 @@ static inline void closure_queue(struct closure *cl)
 	 */
 	BUILD_BUG_ON(offsetof(struct closure, fn)
 		     != offsetof(struct work_struct, func));
+
 	if (wq) {
 		INIT_WORK(&cl->work, cl->work.func);
 		queue_work(wq, &cl->work);
@@ -255,7 +254,7 @@ static inline void closure_queue(struct closure *cl)
  */
 static inline void closure_get(struct closure *cl)
 {
-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES
 	BUG_ON((atomic_inc_return(&cl->remaining) &
 		CLOSURE_REMAINING_MASK) <= 1);
 #else
@@ -271,7 +270,7 @@ static inline void closure_get(struct closure *cl)
  */
 static inline void closure_init(struct closure *cl, struct closure *parent)
 {
-	memset(cl, 0, sizeof(struct closure));
+	cl->fn = NULL;
 	cl->parent = parent;
 	if (parent)
 		closure_get(parent);
diff --git a/lib/Kconfig b/lib/Kconfig
index a9e56539bd..09a25af0d0 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -427,6 +427,9 @@ config ASSOCIATIVE_ARRAY
 
 	  for more information.
 
+config CLOSURES
+	bool
+
 config HAS_IOMEM
 	bool
 	depends on !NO_IOMEM
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index d5a4a4036d..6d97985e7e 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1397,6 +1397,15 @@ config DEBUG_CREDENTIALS
 
 source "kernel/rcu/Kconfig.debug"
 
+config DEBUG_CLOSURES
+	bool "Debug closures (bcache async widgits)"
+	depends on CLOSURES
+	select DEBUG_FS
+	help
+	Keeps all active closures in a linked list and provides a debugfs
+	interface to list them, which makes it possible to see asynchronous
+	operations that get stuck.
+
 config DEBUG_WQ_FORCE_RR_CPU
 	bool "Force round-robin CPU selection for unbound work items"
 	depends on DEBUG_KERNEL
diff --git a/lib/Makefile b/lib/Makefile
index 18c2be516a..2003eda127 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -193,6 +193,8 @@ obj-$(CONFIG_ATOMIC64_SELFTEST) += atomic64_test.o
 
 obj-$(CONFIG_CPU_RMAP) += cpu_rmap.o
 
+obj-$(CONFIG_CLOSURES) += closure.o
+
 obj-$(CONFIG_CORDIC) += cordic.o
 
 obj-$(CONFIG_DQL) += dynamic_queue_limits.o
diff --git a/drivers/md/bcache/closure.c b/lib/closure.c
similarity index 89%
rename from drivers/md/bcache/closure.c
rename to lib/closure.c
index 73f5319295..46cfe4c382 100644
--- a/drivers/md/bcache/closure.c
+++ b/lib/closure.c
@@ -6,13 +6,12 @@
  * Copyright 2012 Google, Inc.
  */
 
+#include <linux/closure.h>
 #include <linux/debugfs.h>
-#include <linux/module.h>
+#include <linux/export.h>
 #include <linux/seq_file.h>
 #include <linux/sched/debug.h>
 
-#include "closure.h"
-
 static inline void closure_put_after_sub(struct closure *cl, int flags)
 {
 	int r = flags & CLOSURE_REMAINING_MASK;
@@ -127,7 +126,7 @@ void __sched __closure_sync(struct closure *cl)
 }
 EXPORT_SYMBOL(__closure_sync);
 
-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES
 
 static LIST_HEAD(closure_list);
 static DEFINE_SPINLOCK(closure_list_lock);
@@ -158,8 +157,6 @@ void closure_debug_destroy(struct closure *cl)
 }
 EXPORT_SYMBOL(closure_debug_destroy);
 
-static struct dentry *closure_debug;
-
 static int debug_seq_show(struct seq_file *f, void *data)
 {
 	struct closure *cl;
@@ -182,7 +179,7 @@ static int debug_seq_show(struct seq_file *f, void *data)
 			seq_printf(f, " W %pS\n",
 				   (void *) cl->waiting_on);
 
-		seq_printf(f, "\n");
+		seq_puts(f, "\n");
 	}
 
 	spin_unlock_irq(&closure_list_lock);
@@ -201,18 +198,11 @@ static const struct file_operations debug_ops = {
 	.release	= single_release
 };
 
-void  __init closure_debug_init(void)
+static int __init closure_debug_init(void)
 {
-	if (!IS_ERR_OR_NULL(bcache_debug))
-		/*
-		 * it is unnecessary to check return value of
-		 * debugfs_create_file(), we should not care
-		 * about this.
-		 */
-		closure_debug = debugfs_create_file(
-			"closures", 0400, bcache_debug, NULL, &debug_ops);
+	debugfs_create_file("closures", 0400, NULL, NULL, &debug_ops);
+	return 0;
 }
-#endif
+late_initcall(closure_debug_init)
 
-MODULE_AUTHOR("Kent Overstreet <koverstreet@google.com>");
-MODULE_LICENSE("GPL");
+#endif
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 11/12] closures: closure_wait_event()
  2019-06-10 19:14 bcachefs status update (it's done cooking; let's get this sucker merged) Kent Overstreet
                   ` (9 preceding siblings ...)
  2019-06-10 19:14 ` [PATCH 10/12] bcache: move closures to lib/ Kent Overstreet
@ 2019-06-10 19:14 ` Kent Overstreet
  2019-06-11 10:25   ` Coly Li
  2019-06-12 17:17   ` Greg KH
  2019-06-10 19:14 ` [PATCH 12/12] closures: fix a race on wakeup from closure_sync Kent Overstreet
                   ` (2 subsequent siblings)
  13 siblings, 2 replies; 63+ messages in thread
From: Kent Overstreet @ 2019-06-10 19:14 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcache; +Cc: Kent Overstreet

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
---
 include/linux/closure.h | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/include/linux/closure.h b/include/linux/closure.h
index 308e38028c..abacb91c35 100644
--- a/include/linux/closure.h
+++ b/include/linux/closure.h
@@ -379,4 +379,26 @@ static inline void closure_call(struct closure *cl, closure_fn fn,
 	continue_at_nobarrier(cl, fn, wq);
 }
 
+#define __closure_wait_event(waitlist, _cond)				\
+do {									\
+	struct closure cl;						\
+									\
+	closure_init_stack(&cl);					\
+									\
+	while (1) {							\
+		closure_wait(waitlist, &cl);				\
+		if (_cond)						\
+			break;						\
+		closure_sync(&cl);					\
+	}								\
+	closure_wake_up(waitlist);					\
+	closure_sync(&cl);						\
+} while (0)
+
+#define closure_wait_event(waitlist, _cond)				\
+do {									\
+	if (!(_cond))							\
+		__closure_wait_event(waitlist, _cond);			\
+} while (0)
+
 #endif /* _LINUX_CLOSURE_H */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 12/12] closures: fix a race on wakeup from closure_sync
  2019-06-10 19:14 bcachefs status update (it's done cooking; let's get this sucker merged) Kent Overstreet
                   ` (10 preceding siblings ...)
  2019-06-10 19:14 ` [PATCH 11/12] closures: closure_wait_event() Kent Overstreet
@ 2019-06-10 19:14 ` Kent Overstreet
  2019-07-16 10:47   ` Coly Li
  2019-06-10 20:46 ` bcachefs status update (it's done cooking; let's get this sucker merged) Linus Torvalds
  2019-07-03  5:59 ` Stefan K
  13 siblings, 1 reply; 63+ messages in thread
From: Kent Overstreet @ 2019-06-10 19:14 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcache; +Cc: Kent Overstreet

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
---
 lib/closure.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/lib/closure.c b/lib/closure.c
index 46cfe4c382..3e6366c262 100644
--- a/lib/closure.c
+++ b/lib/closure.c
@@ -104,8 +104,14 @@ struct closure_syncer {
 
 static void closure_sync_fn(struct closure *cl)
 {
-	cl->s->done = 1;
-	wake_up_process(cl->s->task);
+	struct closure_syncer *s = cl->s;
+	struct task_struct *p;
+
+	rcu_read_lock();
+	p = READ_ONCE(s->task);
+	s->done = 1;
+	wake_up_process(p);
+	rcu_read_unlock();
 }
 
 void __sched __closure_sync(struct closure *cl)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: bcachefs status update (it's done cooking; let's get this sucker merged)
  2019-06-10 19:14 bcachefs status update (it's done cooking; let's get this sucker merged) Kent Overstreet
                   ` (11 preceding siblings ...)
  2019-06-10 19:14 ` [PATCH 12/12] closures: fix a race on wakeup from closure_sync Kent Overstreet
@ 2019-06-10 20:46 ` Linus Torvalds
  2019-06-11  1:17   ` Kent Overstreet
  2019-06-11  4:10   ` Dave Chinner
  2019-07-03  5:59 ` Stefan K
  13 siblings, 2 replies; 63+ messages in thread
From: Linus Torvalds @ 2019-06-10 20:46 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Linux List Kernel Mailing, linux-fsdevel, linux-bcache,
	Dave Chinner, Darrick J . Wong, Zach Brown, Peter Zijlstra,
	Jens Axboe, Josef Bacik, Alexander Viro, Andrew Morton,
	Tejun Heo

On Mon, Jun 10, 2019 at 9:14 AM Kent Overstreet
<kent.overstreet@gmail.com> wrote:
>
> So. Here's my bcachefs-for-review branch - this has the minimal set of patches
> outside of fs/bcachefs/. My master branch has some performance optimizations for
> the core buffered IO paths, but those are fairly tricky and invasive so I want
> to hold off on those for now - this branch is intended to be more or less
> suitable for merging as is.

Honestly, it really isn't.

There are obvious things wrong with it - like the fact that you've
rebased it so that the original history is gone, yet you've not
actually *fixed* the history, so you find things like reverts of
commits that should simply have been removed, and fixes for things
that should just have been fixed in the original commit the fix is
for.

And this isn't just "if you rebase, just fix things". You have actual
bogus commit messages as a result of this all.

So to point at the revert, for example. The commit message is

    This reverts commit 36f389604294dfc953e6f5624ceb683818d32f28.

which is wrong to begin with - you should always explain *why* the
revert was done, not just state that it's a revert.

But since you rebased things, that commit 36f3896042 doesn't exist any
more to begin with.  So now you have a revert commit that doesn't
explain why it reverts things, but the only thing it *does* say is
simply wrong and pointless.

Some of the "fixes" commits have the exact same issue - they say things like

    gc lock must be held while invalidating buckets - fixes
    "1f7a95698e bcachefs: Invalidate buckets when writing to alloc btree"

and

    fixes b0f3e786995cb3b12975503f963e469db5a4f09b

both of which are dead and stale git object pointers since the commits
in question got rebased and have some other hash these days.

But note that the cleanup should go further than just fix those kinds
of technical issues. If you rebase, and you have fixes in your tree
for things you rebase, just fix things as you rewrite history anyway
(there are cases where the fix may be informative in itself and it's
worth leaving around, but that's rare).

Anyway, aside from that, I only looked at the non-bcachefs parts. Some
of those are not acceptable either, like

    struct pagecache_lock add_lock
        ____cacheline_aligned_in_smp; /* protects adding new pages */

in 'struct address_space', which is completely bogus, since that
forces not only a potentially huge amount of padding, it also requires
alignment that that struct simply fundamentally does not have, and
_will_ not have.

You can only use ____cacheline_aligned_in_smp for top-level objects,
and honestly, it's almost never a win. That lock shouldn't be so hot.

That lock is somewhat questionable in the first place, and no, we
don't do those hacky recursive things anyway. A recursive lock is
almost always a buggy and mis-designed one.

Why does the regular page lock (at a finer granularity) not suffice?

And no, nobody has ever cared. The dio people just don't care about
page cache anyway. They have their own thing going.

Similarly, no, we're not starting to do vmalloc in non-process context. Stop it.

And the commit comments are very sparse. And not always signed off.

I also get the feeling that the "intent" part of the six-locks could
just be done as a slight extension of the rwsem, where an "intent" is
the same as a write-lock, but without waiting for existing readers,
and then the write-lock part is just the "wait for readers to be
done".

Have you talked to Waiman Long about that?

                    Linus

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: bcachefs status update (it's done cooking; let's get this sucker merged)
  2019-06-10 20:46 ` bcachefs status update (it's done cooking; let's get this sucker merged) Linus Torvalds
@ 2019-06-11  1:17   ` Kent Overstreet
  2019-06-11  4:33     ` Dave Chinner
  2019-06-11  4:55     ` bcachefs status update (it's done cooking; let's get this sucker merged) Linus Torvalds
  2019-06-11  4:10   ` Dave Chinner
  1 sibling, 2 replies; 63+ messages in thread
From: Kent Overstreet @ 2019-06-11  1:17 UTC (permalink / raw)
  To: Linus Torvalds, Dave Chinner, Waiman Long, Peter Zijlstra
  Cc: Linux List Kernel Mailing, linux-fsdevel, linux-bcache,
	Darrick J . Wong, Zach Brown, Jens Axboe, Josef Bacik,
	Alexander Viro, Andrew Morton, Tejun Heo

On Mon, Jun 10, 2019 at 10:46:35AM -1000, Linus Torvalds wrote:
> On Mon, Jun 10, 2019 at 9:14 AM Kent Overstreet
> <kent.overstreet@gmail.com> wrote:
> >
> > So. Here's my bcachefs-for-review branch - this has the minimal set of patches
> > outside of fs/bcachefs/. My master branch has some performance optimizations for
> > the core buffered IO paths, but those are fairly tricky and invasive so I want
> > to hold off on those for now - this branch is intended to be more or less
> > suitable for merging as is.
> 
> Honestly, it really isn't.

Heh, I suppose that's what review is for :)

> There are obvious things wrong with it - like the fact that you've
> rebased it so that the original history is gone, yet you've not
> actually *fixed* the history, so you find things like reverts of
> commits that should simply have been removed, and fixes for things
> that should just have been fixed in the original commit the fix is
> for.

Yeah, I suppose I have dropped the ball on that lately. 

> But note that the cleanup should go further than just fix those kinds
> of technical issues. If you rebase, and you have fixes in your tree
> for things you rebase, just fix things as you rewrite history anyway
> (there are cases where the fix may be informative in itself and it's
> worth leaving around, but that's rare).

Yeah that has historically been my practice, I've just been moving away from
that kind of history editing as bcachefs has been getting more users. Hence the
in-between, worst of both workflows state of the current tree.

But, I can certainly go through and clean things up like that one last time and
make everything bisectable again - I'll go through and write proper commit
messages too. Unless you'd be ok with just squashing most of the history down to
one commit - which would you prefer?

> Anyway, aside from that, I only looked at the non-bcachefs parts. Some
> of those are not acceptable either, like
> 
>     struct pagecache_lock add_lock
>         ____cacheline_aligned_in_smp; /* protects adding new pages */
> 
> in 'struct address_space', which is completely bogus, since that
> forces not only a potentially huge amount of padding, it also requires
> alignment that that struct simply fundamentally does not have, and
> _will_ not have.

Oh, good point.

> You can only use ____cacheline_aligned_in_smp for top-level objects,
> and honestly, it's almost never a win. That lock shouldn't be so hot.
> 
> That lock is somewhat questionable in the first place, and no, we
> don't do those hacky recursive things anyway. A recursive lock is
> almost always a buggy and mis-designed one.

You're preaching to the choir there, I still feel dirty about that code and I'd
love nothing more than for someone else to come along and point out how stupid
I've been with a much better way of doing it. 

> Why does the regular page lock (at a finer granularity) not suffice?

Because the lock needs to prevent pages from being _added_ to the page cache -
to do it with a page granularity lock it'd have to be part of the radix tree, 

> And no, nobody has ever cared. The dio people just don't care about
> page cache anyway. They have their own thing going.

It's not just dio, it's even worse with the various fallocate operations. And
the xfs people care, but IIRC even they don't have locking for pages being
faulted in. This is an issue I've talked to other filesystem people quite a bit
about - especially Dave Chinner, maybe we can get him to weigh in here.

And this inconsistency does result in _real_ bugs. It goes something like this:
 - dio write shoots down the range of the page cache for the file it's writing
   to, using invalidate_inode_pages_range2
 - After the page cache shoot down, but before the write actually happens,
   another process pulls those pages back in to the page cache
 - Now the write happens: if that write was e.g. an allocating write, you're
   going to have page cache state (buffer heads) that say that page doesn't have
   anything on disk backing it, but it actually does because of the dio write.

xfs has additional locking (that the vfs does _not_ do) around both the buffered
and dio IO paths to prevent this happening because of a buffered read pulling
the pages back in, but no one has a solution for pages getting _faulted_ back in
- either because of mmap or gup().

And there are some filesystem people who do know about this race, because at
some point the dio code has been changed to shoot down the page cache _again_
after the write completes. But that doesn't eliminate the race, it just makes it
harder to trigger.

And dio writes actually aren't the worst of it, it's even worse with fallocate
FALLOC_FL_INSERT_RANGE/COLLAPSE_RANGE. Last time I looked at the ext4 fallocate
code, it looked _completely_ broken to me - the code seemed to think it was
using the same mechanism truncate uses for shooting down the page cache and
keeping pages from being readded - but that only works for truncate because it's
changing i_size and shooting down pages above i_size. Fallocate needs to shoot
down pages that are still within i_size, so... yeah...

The recursiveness is needed because otherwise, if you mmap a file, then do a dio
write where you pass the address you mmapped to pwrite(), gup() from the dio
write path will be trying to fault in the exact pages it's blocking from being
added.

A better solution would be for gup() to detect that and return an error, so we
can just fall back to buffered writes. Or just return an error to userspace
because fuck anyone who would actually do that.

But I fear plumbing that through gup() is going to be a hell of a lot uglier
than this patch.

I would really like Dave to weigh in here.

> Similarly, no, we're not starting to do vmalloc in non-process context. Stop it.

I don't want to do vmalloc in non process context - but I do need to call
vmalloc when reading in btree nodes, from the filesystem IO path.

But I just learned today about this new memalloc_nofs_save() thing, so if that
works I'm more than happy to drop that patch.

> And the commit comments are very sparse. And not always signed off.

Yeah, I'll fix that.

> I also get the feeling that the "intent" part of the six-locks could
> just be done as a slight extension of the rwsem, where an "intent" is
> the same as a write-lock, but without waiting for existing readers,
> and then the write-lock part is just the "wait for readers to be
> done".
> 
> Have you talked to Waiman Long about that?

No, I haven't, but I'm adding him to the list.

I really hate the idea of adding these sorts of special case features to the
core locking primitives though - I mean, look what's happened to the mutex code,
and the intent state isn't the only special feature they have. As is, they're
small and clean and they do their job well, I'd really prefer to have them just
remain their own thing instead of trying to cram it all into the the hyper
optimized rw semaphore code.

Also, six locks used to be in fs/bcachefs/, but last time I was mailing stuff
out for review Peter Zijlstra was dead set against exporting the osq lock stuff
- moving six locks to kernel/locking/ was actually his idea. 

I can say more about six locks tomorrow when I'm less sleep deprived, if you're
still not convinced.

Cheers.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: bcachefs status update (it's done cooking; let's get this sucker merged)
  2019-06-10 20:46 ` bcachefs status update (it's done cooking; let's get this sucker merged) Linus Torvalds
  2019-06-11  1:17   ` Kent Overstreet
@ 2019-06-11  4:10   ` Dave Chinner
  2019-06-11  4:39     ` Linus Torvalds
  1 sibling, 1 reply; 63+ messages in thread
From: Dave Chinner @ 2019-06-11  4:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kent Overstreet, Linux List Kernel Mailing, linux-fsdevel,
	linux-bcache, Dave Chinner, Darrick J . Wong, Zach Brown,
	Peter Zijlstra, Jens Axboe, Josef Bacik, Alexander Viro,
	Andrew Morton, Tejun Heo

On Mon, Jun 10, 2019 at 10:46:35AM -1000, Linus Torvalds wrote:
> I also get the feeling that the "intent" part of the six-locks could
> just be done as a slight extension of the rwsem, where an "intent" is
> the same as a write-lock, but without waiting for existing readers,
> and then the write-lock part is just the "wait for readers to be
> done".

Please, no, let's not make the rwsems even more fragile than they
already are. I'm tired of the ongoing XFS customer escalations that
end up being root caused to yet another rwsem memory barrier bug.

> Have you talked to Waiman Long about that?

Unfortunately, Waiman has been unable to find/debug multiple rwsem
exclusion violations we've seen in XFS bug reports over the past 2-3
years. Those memory barrier bugs have all been fixed by other people
long after Waiman has said "I can't reproduce any problems in my
testing" and essentially walked away from the problem. We've been
left multiple times wondering how the hell we even prove it's a
rwsem bug because there's no way to reproduce the inconsistent rwsem
state we see in the kernel crash dumps.

Hence, as a downstream rwsem user, I have relatively little
confidence in upstream's ability to integrate new functionality into
rwsems without introducing yet more subtle regressions that are only
exposed by heavy rwsem users like XFS. As such, I consider rwsems to
be extremely fragile and are now a prime suspect whenever see some
one-off memory corruption in a structure protected by a rwsem.

As such, please keep SIX locks separate to rwsems to minimise the
merge risk of bcachefs.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: bcachefs status update (it's done cooking; let's get this sucker merged)
  2019-06-11  1:17   ` Kent Overstreet
@ 2019-06-11  4:33     ` Dave Chinner
  2019-06-12 16:21       ` Kent Overstreet
  2019-06-11  4:55     ` bcachefs status update (it's done cooking; let's get this sucker merged) Linus Torvalds
  1 sibling, 1 reply; 63+ messages in thread
From: Dave Chinner @ 2019-06-11  4:33 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Linus Torvalds, Dave Chinner, Waiman Long, Peter Zijlstra,
	Linux List Kernel Mailing, linux-fsdevel, linux-bcache,
	Darrick J . Wong, Zach Brown, Jens Axboe, Josef Bacik,
	Alexander Viro, Andrew Morton, Tejun Heo

On Mon, Jun 10, 2019 at 09:17:37PM -0400, Kent Overstreet wrote:
> On Mon, Jun 10, 2019 at 10:46:35AM -1000, Linus Torvalds wrote:
> > On Mon, Jun 10, 2019 at 9:14 AM Kent Overstreet
> > <kent.overstreet@gmail.com> wrote:
> > That lock is somewhat questionable in the first place, and no, we
> > don't do those hacky recursive things anyway. A recursive lock is
> > almost always a buggy and mis-designed one.
> 
> You're preaching to the choir there, I still feel dirty about that code and I'd
> love nothing more than for someone else to come along and point out how stupid
> I've been with a much better way of doing it. 
> 
> > Why does the regular page lock (at a finer granularity) not suffice?
> 
> Because the lock needs to prevent pages from being _added_ to the page cache -
> to do it with a page granularity lock it'd have to be part of the radix tree, 
> 
> > And no, nobody has ever cared. The dio people just don't care about
> > page cache anyway. They have their own thing going.
> 
> It's not just dio, it's even worse with the various fallocate operations. And
> the xfs people care, but IIRC even they don't have locking for pages being
> faulted in. This is an issue I've talked to other filesystem people quite a bit
> about - especially Dave Chinner, maybe we can get him to weigh in here.
> 
> And this inconsistency does result in _real_ bugs. It goes something like this:
>  - dio write shoots down the range of the page cache for the file it's writing
>    to, using invalidate_inode_pages_range2
>  - After the page cache shoot down, but before the write actually happens,
>    another process pulls those pages back in to the page cache
>  - Now the write happens: if that write was e.g. an allocating write, you're
>    going to have page cache state (buffer heads) that say that page doesn't have
>    anything on disk backing it, but it actually does because of the dio write.
> 
> xfs has additional locking (that the vfs does _not_ do) around both the buffered
> and dio IO paths to prevent this happening because of a buffered read pulling
> the pages back in, but no one has a solution for pages getting _faulted_ back in
> - either because of mmap or gup().
> 
> And there are some filesystem people who do know about this race, because at
> some point the dio code has been changed to shoot down the page cache _again_
> after the write completes. But that doesn't eliminate the race, it just makes it
> harder to trigger.
> 
> And dio writes actually aren't the worst of it, it's even worse with fallocate
> FALLOC_FL_INSERT_RANGE/COLLAPSE_RANGE. Last time I looked at the ext4 fallocate
> code, it looked _completely_ broken to me - the code seemed to think it was
> using the same mechanism truncate uses for shooting down the page cache and
> keeping pages from being readded - but that only works for truncate because it's
> changing i_size and shooting down pages above i_size. Fallocate needs to shoot
> down pages that are still within i_size, so... yeah...

Yes, that ext4 code is broken, and Jan Kara is trying to work out
how to fix it. His recent patchset fell foul of taking the same lock
either side of the mmap_sem in this path:

> The recursiveness is needed because otherwise, if you mmap a file, then do a dio
> write where you pass the address you mmapped to pwrite(), gup() from the dio
> write path will be trying to fault in the exact pages it's blocking from being
> added.
> 
> A better solution would be for gup() to detect that and return an error, so we
> can just fall back to buffered writes. Or just return an error to userspace
> because fuck anyone who would actually do that.

I just recently said this with reference to the range lock stuff I'm
working on in the background:

	FWIW, it's to avoid problems with stupid userspace stuff
	that nobody really should be doing that I want range locks
	for the XFS inode locks.  If userspace overlaps the ranges
	and deadlocks in that case, they they get to keep all the
	broken bits because, IMO, they are doing something
	monumentally stupid. I'd probably be making it return
	EDEADLOCK back out to userspace in the case rather than
	deadlocking but, fundamentally, I think it's broken
	behaviour that we should be rejecting with an error rather
	than adding complexity trying to handle it.

So I think this recusive locking across a page fault case should
just fail, not add yet more complexity to try to handle a rare
corner case that exists more in theory than in reality. i.e put the
lock context in the current task, then if the page fault requires a
conflicting lock context to be taken, we terminate the page fault,
back out of the IO and return EDEADLOCK out to userspace. This works
for all types of lock contexts - only the filesystem itself needs to
know what the lock context pointer contains....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: bcachefs status update (it's done cooking; let's get this sucker merged)
  2019-06-11  4:10   ` Dave Chinner
@ 2019-06-11  4:39     ` Linus Torvalds
  2019-06-11  7:10       ` Dave Chinner
  0 siblings, 1 reply; 63+ messages in thread
From: Linus Torvalds @ 2019-06-11  4:39 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Kent Overstreet, Linux List Kernel Mailing, linux-fsdevel,
	linux-bcache, Dave Chinner, Darrick J . Wong, Zach Brown,
	Peter Zijlstra, Jens Axboe, Josef Bacik, Alexander Viro,
	Andrew Morton, Tejun Heo

On Mon, Jun 10, 2019 at 6:11 PM Dave Chinner <david@fromorbit.com> wrote:
>
> Please, no, let's not make the rwsems even more fragile than they
> already are. I'm tired of the ongoing XFS customer escalations that
> end up being root caused to yet another rwsem memory barrier bug.
>
> > Have you talked to Waiman Long about that?
>
> Unfortunately, Waiman has been unable to find/debug multiple rwsem
> exclusion violations we've seen in XFS bug reports over the past 2-3
> years.

Inside xfs you can do whatever you want.

But in generic code, no, we're not saying "we don't trust the generic
locking, so we cook our own random locking".

If tghere really are exclusion issues, they should be fairly easy to
try to find with a generic test-suite. Have a bunch of readers that
assert that some shared variable has a particular value, and a bund of
writers that then modify the value and set it back. Add some random
timing and "yield" to them all, and show that the serialization is
wrong.

Some kind of "XFS load Y shows problems" is undebuggable, and not
necessarily due to locking.

Because if the locking issues are real (and we did fix one bug
recently in a9e9bcb45b15: "locking/rwsem: Prevent decrement of reader
count before increment") it needs to be fixed. Some kind of "let's do
something else entirely" is simply not acceptable.

                  Linus

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: bcachefs status update (it's done cooking; let's get this sucker merged)
  2019-06-11  1:17   ` Kent Overstreet
  2019-06-11  4:33     ` Dave Chinner
@ 2019-06-11  4:55     ` Linus Torvalds
  2019-06-11 14:26       ` Matthew Wilcox
  1 sibling, 1 reply; 63+ messages in thread
From: Linus Torvalds @ 2019-06-11  4:55 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Dave Chinner, Waiman Long, Peter Zijlstra,
	Linux List Kernel Mailing, linux-fsdevel, linux-bcache,
	Darrick J . Wong, Zach Brown, Jens Axboe, Josef Bacik,
	Alexander Viro, Andrew Morton, Tejun Heo

On Mon, Jun 10, 2019 at 3:17 PM Kent Overstreet
<kent.overstreet@gmail.com> wrote:
>
>
> > Why does the regular page lock (at a finer granularity) not suffice?
>
> Because the lock needs to prevent pages from being _added_ to the page cache -
> to do it with a page granularity lock it'd have to be part of the radix tree,

No, I understand that part, but I still think we should be able to do
the locking per-page rather than over the whole mapping.

When doing dio, you need to iterate over old existing pages anyway in
that range (otherwise the "no _new_ pages" part is kind of pointless
when there are old pages there), so my gut feel is that you might as
well at that point also "poison" the range you are doin dio on. With
the xarray changes, we might be better at handling ranges. That was
one of the arguments for the xarrays over the old radix tree model,
after all.

And I think the dio code would ideally want to have a range-based lock
anyway, rather than one global one. No?

Anyway, don't get me wrong. I'm not entirely against a "stop adding
pages" model per-mapping if it's just fundamentally simpler and nobody
wants anything fancier. So I'm certainly open to it, assuming it
doesn't add any real overhead to the normal case.

But I *am* against it when it has ad-hoc locking and random
anti-recursion things.

So I'm with Dave on the "I hope we can avoid the recursive hacks" by
making better rules. Even if I disagree with him on the locking thing
- I'd rather put _more_stress on the standard locking and make sure it
really works, over having multiple similar locking models because they
don't trust each other.

               Linus

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: bcachefs status update (it's done cooking; let's get this sucker merged)
  2019-06-11  4:39     ` Linus Torvalds
@ 2019-06-11  7:10       ` Dave Chinner
  2019-06-12  2:07         ` Linus Torvalds
  0 siblings, 1 reply; 63+ messages in thread
From: Dave Chinner @ 2019-06-11  7:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kent Overstreet, Linux List Kernel Mailing, linux-fsdevel,
	linux-bcache, Dave Chinner, Darrick J . Wong, Zach Brown,
	Peter Zijlstra, Jens Axboe, Josef Bacik, Alexander Viro,
	Andrew Morton, Tejun Heo

On Mon, Jun 10, 2019 at 06:39:00PM -1000, Linus Torvalds wrote:
> On Mon, Jun 10, 2019 at 6:11 PM Dave Chinner <david@fromorbit.com> wrote:
> >
> > Please, no, let's not make the rwsems even more fragile than they
> > already are. I'm tired of the ongoing XFS customer escalations that
> > end up being root caused to yet another rwsem memory barrier bug.
> >
> > > Have you talked to Waiman Long about that?
> >
> > Unfortunately, Waiman has been unable to find/debug multiple rwsem
> > exclusion violations we've seen in XFS bug reports over the past 2-3
> > years.
> 
> Inside xfs you can do whatever you want.
>
> But in generic code, no, we're not saying "we don't trust the generic
> locking, so we cook our own random locking".

We use the generic rwsems in XFS, too, and it's the generic
rwsems that have been the cause of the problems I'm talking about.

The same rwsem issues were seen on the mmap_sem, the shrinker rwsem,
in a couple of device drivers, and so on. i.e. This isn't an XFS
issue I'm raising here - I'm raising a concern about the lack of
validation of core infrastructure and it's suitability for
functionality extensions.

> If tghere really are exclusion issues, they should be fairly easy to
> try to find with a generic test-suite.  Have a bunch of readers that
> assert that some shared variable has a particular value, and a bund of
> writers that then modify the value and set it back. Add some random
> timing and "yield" to them all, and show that the serialization is
> wrong.

Writing such a test suite would be the responsibility of the rwsem
maintainers, yes?

> Some kind of "XFS load Y shows problems" is undebuggable, and not
> necessarily due to locking.

Sure, but this wasn't isolated to XFS, and it wasn't one workload.

We had a growing pile of kernel crash dumps all with the same
signatures across multiple subsystems. When this happens, it falls
to the maintainer of that common element to more deeply analyse the
issue. One of the rwsem maintainers was unable to reproduce or find
the root cause of the pile of rwsem state corruptions, and so we've
been left hanging telling people "we think it's rwsems because the
state is valid right up to the rwsem state going bad, but we can't
prove it's a rwsem problem because the debug we've added to the
rwsem code makes the problem go away". Sometime later, a bug has
been found in the upstream rwsem code....

This has played out several times over the past couple of years. No
locking bugs have been found in XFS, with the mmap_sem, the shrinker
rwsem, etc, but 4 or 5 bugs have been found in the rwsem code and
backports of those commits have been proven to solve _all_ the
issues that were reported.

That's the painful reality I'm telling you about here - that poor
upstream core infrastructure quality has had quite severe downstream
knock-on effects that cost a lot of time, resources, money and
stress to diagnose and rectify.  I don't want those same mistakes to
be made again for many reasons, not the least that the stress of
these situations has a direct and adverse impact on my mental
health....

> Because if the locking issues are real (and we did fix one bug
> recently in a9e9bcb45b15: "locking/rwsem: Prevent decrement of reader
> count before increment") it needs to be fixed.

That's just one of the bugs we've tripped over. There's been a
couple of missed wakeups bugs that caused rwsem state hangs (e.g.
readers waiting with no holder), there was a power arch specific
memory barrier bug that caused read/write exclusion bugs, the
optimistic spinning caused some severe performance degradations on
the mmap_sem with some highly threaded workloads, the rwsem bias
changed from read biased to write biased (might be the other way
around, can't remember) some time around 4.10 causing a complete
inversion in mixed read-write IO characteristics, there was a
botched RHEL7 backport that had memory barrier bugs in it that
upstream didn't have that occurred because of the complexity of the
code, etc.

But this is all off-topic for bcachefs review - all we need to do
here is keep the SIX locking in a separate module and everything
rwsem related will be just fine.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 11/12] closures: closure_wait_event()
  2019-06-10 19:14 ` [PATCH 11/12] closures: closure_wait_event() Kent Overstreet
@ 2019-06-11 10:25   ` Coly Li
  2019-06-12 17:17   ` Greg KH
  1 sibling, 0 replies; 63+ messages in thread
From: Coly Li @ 2019-06-11 10:25 UTC (permalink / raw)
  To: Kent Overstreet, linux-kernel, linux-fsdevel, linux-bcache

On 2019/6/11 3:14 上午, Kent Overstreet wrote:
> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

Acked-by: Coly Li <colyli@suse.de>

Thanks.

Coly Li

> ---
>  include/linux/closure.h | 22 ++++++++++++++++++++++
>  1 file changed, 22 insertions(+)
> 
> diff --git a/include/linux/closure.h b/include/linux/closure.h
> index 308e38028c..abacb91c35 100644
> --- a/include/linux/closure.h
> +++ b/include/linux/closure.h
> @@ -379,4 +379,26 @@ static inline void closure_call(struct closure *cl, closure_fn fn,
>  	continue_at_nobarrier(cl, fn, wq);
>  }
>  
> +#define __closure_wait_event(waitlist, _cond)				\
> +do {									\
> +	struct closure cl;						\
> +									\
> +	closure_init_stack(&cl);					\
> +									\
> +	while (1) {							\
> +		closure_wait(waitlist, &cl);				\
> +		if (_cond)						\
> +			break;						\
> +		closure_sync(&cl);					\
> +	}								\
> +	closure_wake_up(waitlist);					\
> +	closure_sync(&cl);						\
> +} while (0)
> +
> +#define closure_wait_event(waitlist, _cond)				\
> +do {									\
> +	if (!(_cond))							\
> +		__closure_wait_event(waitlist, _cond);			\
> +} while (0)
> +
>  #endif /* _LINUX_CLOSURE_H */
> 


-- 

Coly Li

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 10/12] bcache: move closures to lib/
  2019-06-10 19:14 ` [PATCH 10/12] bcache: move closures to lib/ Kent Overstreet
@ 2019-06-11 10:25   ` Coly Li
  2019-06-13  7:28   ` Christoph Hellwig
  1 sibling, 0 replies; 63+ messages in thread
From: Coly Li @ 2019-06-11 10:25 UTC (permalink / raw)
  To: Kent Overstreet, linux-kernel, linux-fsdevel, linux-bcache

On 2019/6/11 3:14 上午, Kent Overstreet wrote:
> Prep work for bcachefs - being a fork of bcache it also uses closures
> 
> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>

Acked-by: Coly Li <colyli@suse.de>

Thanks.

Coly Li


> ---
>  drivers/md/bcache/Kconfig                     | 10 +------
>  drivers/md/bcache/Makefile                    |  6 ++--
>  drivers/md/bcache/bcache.h                    |  2 +-
>  drivers/md/bcache/super.c                     |  1 -
>  drivers/md/bcache/util.h                      |  3 +-
>  .../md/bcache => include/linux}/closure.h     | 17 ++++++-----
>  lib/Kconfig                                   |  3 ++
>  lib/Kconfig.debug                             |  9 ++++++
>  lib/Makefile                                  |  2 ++
>  {drivers/md/bcache => lib}/closure.c          | 28 ++++++-------------
>  10 files changed, 37 insertions(+), 44 deletions(-)
>  rename {drivers/md/bcache => include/linux}/closure.h (97%)
>  rename {drivers/md/bcache => lib}/closure.c (89%)
> 
> diff --git a/drivers/md/bcache/Kconfig b/drivers/md/bcache/Kconfig
> index f6e0a8b3a6..3dd1d48987 100644
> --- a/drivers/md/bcache/Kconfig
> +++ b/drivers/md/bcache/Kconfig
> @@ -2,6 +2,7 @@
>  config BCACHE
>  	tristate "Block device as cache"
>  	select CRC64
> +	select CLOSURES
>  	help
>  	Allows a block device to be used as cache for other devices; uses
>  	a btree for indexing and the layout is optimized for SSDs.
> @@ -16,12 +17,3 @@ config BCACHE_DEBUG
>  
>  	Enables extra debugging tools, allows expensive runtime checks to be
>  	turned on.
> -
> -config BCACHE_CLOSURES_DEBUG
> -	bool "Debug closures"
> -	depends on BCACHE
> -	select DEBUG_FS
> -	help
> -	Keeps all active closures in a linked list and provides a debugfs
> -	interface to list them, which makes it possible to see asynchronous
> -	operations that get stuck.
> diff --git a/drivers/md/bcache/Makefile b/drivers/md/bcache/Makefile
> index d26b351958..2b790fb813 100644
> --- a/drivers/md/bcache/Makefile
> +++ b/drivers/md/bcache/Makefile
> @@ -2,8 +2,8 @@
>  
>  obj-$(CONFIG_BCACHE)	+= bcache.o
>  
> -bcache-y		:= alloc.o bset.o btree.o closure.o debug.o extents.o\
> -	io.o journal.o movinggc.o request.o stats.o super.o sysfs.o trace.o\
> -	util.o writeback.o
> +bcache-y		:= alloc.o bset.o btree.o debug.o extents.o io.o\
> +	journal.o movinggc.o request.o stats.o super.o sysfs.o trace.o util.o\
> +	writeback.o
>  
>  CFLAGS_request.o	+= -Iblock
> diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
> index fdf75352e1..ced9f1526c 100644
> --- a/drivers/md/bcache/bcache.h
> +++ b/drivers/md/bcache/bcache.h
> @@ -180,6 +180,7 @@
>  
>  #include <linux/bcache.h>
>  #include <linux/bio.h>
> +#include <linux/closure.h>
>  #include <linux/kobject.h>
>  #include <linux/list.h>
>  #include <linux/mutex.h>
> @@ -192,7 +193,6 @@
>  
>  #include "bset.h"
>  #include "util.h"
> -#include "closure.h"
>  
>  struct bucket {
>  	atomic_t	pin;
> diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
> index a697a3a923..da6803f280 100644
> --- a/drivers/md/bcache/super.c
> +++ b/drivers/md/bcache/super.c
> @@ -2487,7 +2487,6 @@ static int __init bcache_init(void)
>  		goto err;
>  
>  	bch_debug_init();
> -	closure_debug_init();
>  
>  	return 0;
>  err:
> diff --git a/drivers/md/bcache/util.h b/drivers/md/bcache/util.h
> index 00aab6abcf..8a75100c0b 100644
> --- a/drivers/md/bcache/util.h
> +++ b/drivers/md/bcache/util.h
> @@ -4,6 +4,7 @@
>  #define _BCACHE_UTIL_H
>  
>  #include <linux/blkdev.h>
> +#include <linux/closure.h>
>  #include <linux/errno.h>
>  #include <linux/kernel.h>
>  #include <linux/sched/clock.h>
> @@ -13,8 +14,6 @@
>  #include <linux/workqueue.h>
>  #include <linux/crc64.h>
>  
> -#include "closure.h"
> -
>  #define PAGE_SECTORS		(PAGE_SIZE / 512)
>  
>  struct closure;
> diff --git a/drivers/md/bcache/closure.h b/include/linux/closure.h
> similarity index 97%
> rename from drivers/md/bcache/closure.h
> rename to include/linux/closure.h
> index 376c5e659c..308e38028c 100644
> --- a/drivers/md/bcache/closure.h
> +++ b/include/linux/closure.h
> @@ -155,7 +155,7 @@ struct closure {
>  
>  	atomic_t		remaining;
>  
> -#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
> +#ifdef CONFIG_DEBUG_CLOSURES
>  #define CLOSURE_MAGIC_DEAD	0xc054dead
>  #define CLOSURE_MAGIC_ALIVE	0xc054a11e
>  
> @@ -184,15 +184,13 @@ static inline void closure_sync(struct closure *cl)
>  		__closure_sync(cl);
>  }
>  
> -#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
> +#ifdef CONFIG_DEBUG_CLOSURES
>  
> -void closure_debug_init(void);
>  void closure_debug_create(struct closure *cl);
>  void closure_debug_destroy(struct closure *cl);
>  
>  #else
>  
> -static inline void closure_debug_init(void) {}
>  static inline void closure_debug_create(struct closure *cl) {}
>  static inline void closure_debug_destroy(struct closure *cl) {}
>  
> @@ -200,21 +198,21 @@ static inline void closure_debug_destroy(struct closure *cl) {}
>  
>  static inline void closure_set_ip(struct closure *cl)
>  {
> -#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
> +#ifdef CONFIG_DEBUG_CLOSURES
>  	cl->ip = _THIS_IP_;
>  #endif
>  }
>  
>  static inline void closure_set_ret_ip(struct closure *cl)
>  {
> -#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
> +#ifdef CONFIG_DEBUG_CLOSURES
>  	cl->ip = _RET_IP_;
>  #endif
>  }
>  
>  static inline void closure_set_waiting(struct closure *cl, unsigned long f)
>  {
> -#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
> +#ifdef CONFIG_DEBUG_CLOSURES
>  	cl->waiting_on = f;
>  #endif
>  }
> @@ -243,6 +241,7 @@ static inline void closure_queue(struct closure *cl)
>  	 */
>  	BUILD_BUG_ON(offsetof(struct closure, fn)
>  		     != offsetof(struct work_struct, func));
> +
>  	if (wq) {
>  		INIT_WORK(&cl->work, cl->work.func);
>  		queue_work(wq, &cl->work);
> @@ -255,7 +254,7 @@ static inline void closure_queue(struct closure *cl)
>   */
>  static inline void closure_get(struct closure *cl)
>  {
> -#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
> +#ifdef CONFIG_DEBUG_CLOSURES
>  	BUG_ON((atomic_inc_return(&cl->remaining) &
>  		CLOSURE_REMAINING_MASK) <= 1);
>  #else
> @@ -271,7 +270,7 @@ static inline void closure_get(struct closure *cl)
>   */
>  static inline void closure_init(struct closure *cl, struct closure *parent)
>  {
> -	memset(cl, 0, sizeof(struct closure));
> +	cl->fn = NULL;
>  	cl->parent = parent;
>  	if (parent)
>  		closure_get(parent);
> diff --git a/lib/Kconfig b/lib/Kconfig
> index a9e56539bd..09a25af0d0 100644
> --- a/lib/Kconfig
> +++ b/lib/Kconfig
> @@ -427,6 +427,9 @@ config ASSOCIATIVE_ARRAY
>  
>  	  for more information.
>  
> +config CLOSURES
> +	bool
> +
>  config HAS_IOMEM
>  	bool
>  	depends on !NO_IOMEM
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index d5a4a4036d..6d97985e7e 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -1397,6 +1397,15 @@ config DEBUG_CREDENTIALS
>  
>  source "kernel/rcu/Kconfig.debug"
>  
> +config DEBUG_CLOSURES
> +	bool "Debug closures (bcache async widgits)"
> +	depends on CLOSURES
> +	select DEBUG_FS
> +	help
> +	Keeps all active closures in a linked list and provides a debugfs
> +	interface to list them, which makes it possible to see asynchronous
> +	operations that get stuck.
> +
>  config DEBUG_WQ_FORCE_RR_CPU
>  	bool "Force round-robin CPU selection for unbound work items"
>  	depends on DEBUG_KERNEL
> diff --git a/lib/Makefile b/lib/Makefile
> index 18c2be516a..2003eda127 100644
> --- a/lib/Makefile
> +++ b/lib/Makefile
> @@ -193,6 +193,8 @@ obj-$(CONFIG_ATOMIC64_SELFTEST) += atomic64_test.o
>  
>  obj-$(CONFIG_CPU_RMAP) += cpu_rmap.o
>  
> +obj-$(CONFIG_CLOSURES) += closure.o
> +
>  obj-$(CONFIG_CORDIC) += cordic.o
>  
>  obj-$(CONFIG_DQL) += dynamic_queue_limits.o
> diff --git a/drivers/md/bcache/closure.c b/lib/closure.c
> similarity index 89%
> rename from drivers/md/bcache/closure.c
> rename to lib/closure.c
> index 73f5319295..46cfe4c382 100644
> --- a/drivers/md/bcache/closure.c
> +++ b/lib/closure.c
> @@ -6,13 +6,12 @@
>   * Copyright 2012 Google, Inc.
>   */
>  
> +#include <linux/closure.h>
>  #include <linux/debugfs.h>
> -#include <linux/module.h>
> +#include <linux/export.h>
>  #include <linux/seq_file.h>
>  #include <linux/sched/debug.h>
>  
> -#include "closure.h"
> -
>  static inline void closure_put_after_sub(struct closure *cl, int flags)
>  {
>  	int r = flags & CLOSURE_REMAINING_MASK;
> @@ -127,7 +126,7 @@ void __sched __closure_sync(struct closure *cl)
>  }
>  EXPORT_SYMBOL(__closure_sync);
>  
> -#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
> +#ifdef CONFIG_DEBUG_CLOSURES
>  
>  static LIST_HEAD(closure_list);
>  static DEFINE_SPINLOCK(closure_list_lock);
> @@ -158,8 +157,6 @@ void closure_debug_destroy(struct closure *cl)
>  }
>  EXPORT_SYMBOL(closure_debug_destroy);
>  
> -static struct dentry *closure_debug;
> -
>  static int debug_seq_show(struct seq_file *f, void *data)
>  {
>  	struct closure *cl;
> @@ -182,7 +179,7 @@ static int debug_seq_show(struct seq_file *f, void *data)
>  			seq_printf(f, " W %pS\n",
>  				   (void *) cl->waiting_on);
>  
> -		seq_printf(f, "\n");
> +		seq_puts(f, "\n");
>  	}
>  
>  	spin_unlock_irq(&closure_list_lock);
> @@ -201,18 +198,11 @@ static const struct file_operations debug_ops = {
>  	.release	= single_release
>  };
>  
> -void  __init closure_debug_init(void)
> +static int __init closure_debug_init(void)
>  {
> -	if (!IS_ERR_OR_NULL(bcache_debug))
> -		/*
> -		 * it is unnecessary to check return value of
> -		 * debugfs_create_file(), we should not care
> -		 * about this.
> -		 */
> -		closure_debug = debugfs_create_file(
> -			"closures", 0400, bcache_debug, NULL, &debug_ops);
> +	debugfs_create_file("closures", 0400, NULL, NULL, &debug_ops);
> +	return 0;
>  }
> -#endif
> +late_initcall(closure_debug_init)
>  
> -MODULE_AUTHOR("Kent Overstreet <koverstreet@google.com>");
> -MODULE_LICENSE("GPL");
> +#endif
> 


-- 

Coly Li

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: bcachefs status update (it's done cooking; let's get this sucker merged)
  2019-06-11  4:55     ` bcachefs status update (it's done cooking; let's get this sucker merged) Linus Torvalds
@ 2019-06-11 14:26       ` Matthew Wilcox
  0 siblings, 0 replies; 63+ messages in thread
From: Matthew Wilcox @ 2019-06-11 14:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kent Overstreet, Dave Chinner, Waiman Long, Peter Zijlstra,
	Linux List Kernel Mailing, linux-fsdevel, linux-bcache,
	Darrick J . Wong, Zach Brown, Jens Axboe, Josef Bacik,
	Alexander Viro, Andrew Morton, Tejun Heo

On Mon, Jun 10, 2019 at 06:55:15PM -1000, Linus Torvalds wrote:
> On Mon, Jun 10, 2019 at 3:17 PM Kent Overstreet
> <kent.overstreet@gmail.com> wrote:
> > > Why does the regular page lock (at a finer granularity) not suffice?
> >
> > Because the lock needs to prevent pages from being _added_ to the page cache -
> > to do it with a page granularity lock it'd have to be part of the radix tree,
> 
> No, I understand that part, but I still think we should be able to do
> the locking per-page rather than over the whole mapping.
> 
> When doing dio, you need to iterate over old existing pages anyway in
> that range (otherwise the "no _new_ pages" part is kind of pointless
> when there are old pages there), so my gut feel is that you might as
> well at that point also "poison" the range you are doin dio on. With
> the xarray changes, we might be better at handling ranges. That was
> one of the arguments for the xarrays over the old radix tree model,
> after all.

We could do that -- if there are pages (or shadow entries) in the XArray,
replace them with "lock entries".  I think we'd want the behaviour of
faults / buffered IO be to wait on those entries being removed.  I think
the DAX code is just about ready to switch over to lock entries instead
of having a special DAX lock bit.

The question is what to do when there are _no_ pages in the tree for a
range that we're about to do DIO on.  This should be the normal case --
as you say, DIO users typically have their own schemes for caching in
userspace, and rather resent the other users starting to cache their
file in the kernel.

Adding lock entries in the page cache for every DIO starts to look pretty
painful in terms of allocating radix tree nodes.  And it gets worse when
you have sub-page-size DIOs -- do we embed a count in the lock entry?
Or delay DIOs which hit in the same page as an existing DIO?

And then I start to question the whole reasoning behind how we do mixed
DIO and buffered IO; if there's a page in the page cache, why are we
writing it out, then doing a direct IO instead of doing a memcpy to the
page first, then writing the page back?

IOW, move to a model where:

 - If i_dio_count is non-zero, buffered I/O waits for i_dio_count to
   drop to zero before bringing pages in.
 - If i_pages is empty, DIOs increment i_dio_count, do the IO and
   decrement i_dio_count.
 - If i_pages is not empty, DIO is implemented by doing buffered I/O
   and waiting for the pages to finish writeback.

(needs a slight tweak to ensure that new DIOs can't hold off a buffered
I/O indefinitely)

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: bcachefs status update (it's done cooking; let's get this sucker merged)
  2019-06-11  7:10       ` Dave Chinner
@ 2019-06-12  2:07         ` Linus Torvalds
  0 siblings, 0 replies; 63+ messages in thread
From: Linus Torvalds @ 2019-06-12  2:07 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Kent Overstreet, Linux List Kernel Mailing, linux-fsdevel,
	linux-bcache, Dave Chinner, Darrick J . Wong, Zach Brown,
	Peter Zijlstra, Jens Axboe, Josef Bacik, Alexander Viro,
	Andrew Morton, Tejun Heo

On Mon, Jun 10, 2019 at 9:11 PM Dave Chinner <david@fromorbit.com> wrote:
>
> The same rwsem issues were seen on the mmap_sem, the shrinker rwsem,
> in a couple of device drivers, and so on. i.e. This isn't an XFS
> issue I'm raising here - I'm raising a concern about the lack of
> validation of core infrastructure and it's suitability for
> functionality extensions.

I haven't actually seen the reports.

That said, I do think this should be improving. The random
architecture-specific code is largely going away, and we'll have a
unified rwsem.

It might obviously cause some pain initially, but I think long-term we
should be much better off, at least avoiding the "on particular
configurations" issue..

              Linus

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: bcachefs status update (it's done cooking; let's get this sucker merged)
  2019-06-11  4:33     ` Dave Chinner
@ 2019-06-12 16:21       ` Kent Overstreet
  2019-06-12 23:02         ` Dave Chinner
  0 siblings, 1 reply; 63+ messages in thread
From: Kent Overstreet @ 2019-06-12 16:21 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Linus Torvalds, Dave Chinner, Waiman Long, Peter Zijlstra,
	Linux List Kernel Mailing, linux-fsdevel, linux-bcache,
	Darrick J . Wong, Zach Brown, Jens Axboe, Josef Bacik,
	Alexander Viro, Andrew Morton, Tejun Heo

On Tue, Jun 11, 2019 at 02:33:36PM +1000, Dave Chinner wrote:
> I just recently said this with reference to the range lock stuff I'm
> working on in the background:
> 
> 	FWIW, it's to avoid problems with stupid userspace stuff
> 	that nobody really should be doing that I want range locks
> 	for the XFS inode locks.  If userspace overlaps the ranges
> 	and deadlocks in that case, they they get to keep all the
> 	broken bits because, IMO, they are doing something
> 	monumentally stupid. I'd probably be making it return
> 	EDEADLOCK back out to userspace in the case rather than
> 	deadlocking but, fundamentally, I think it's broken
> 	behaviour that we should be rejecting with an error rather
> 	than adding complexity trying to handle it.
> 
> So I think this recusive locking across a page fault case should
> just fail, not add yet more complexity to try to handle a rare
> corner case that exists more in theory than in reality. i.e put the
> lock context in the current task, then if the page fault requires a
> conflicting lock context to be taken, we terminate the page fault,
> back out of the IO and return EDEADLOCK out to userspace. This works
> for all types of lock contexts - only the filesystem itself needs to
> know what the lock context pointer contains....

Ok, I'm totally on board with returning EDEADLOCK.

Question: Would we be ok with returning EDEADLOCK for any IO where the buffer is
in the same address space as the file being read/written to, even if the buffer
and the IO don't technically overlap?

This would simplify things a lot and eliminate a really nasty corner case - page
faults trigger readahead. Even if the buffer and the direct IO don't overlap,
readahead can pull in pages that do overlap with the dio.

And on getting EDEADLOCK we could fall back to buffered IO, so userspace would
never know...

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 01/12] Compiler Attributes: add __flatten
  2019-06-10 19:14 ` [PATCH 01/12] Compiler Attributes: add __flatten Kent Overstreet
@ 2019-06-12 17:16   ` Greg KH
  0 siblings, 0 replies; 63+ messages in thread
From: Greg KH @ 2019-06-12 17:16 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-kernel, linux-fsdevel, linux-bcache

On Mon, Jun 10, 2019 at 03:14:09PM -0400, Kent Overstreet wrote:
> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
> ---
>  include/linux/compiler_attributes.h | 5 +++++
>  1 file changed, 5 insertions(+)

I know I reject patches with no changelog text at all.  You shouldn't
rely on other maintainers being more lax.

You need to describe why you are doing this at the very least, as I sure
do not know...

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 11/12] closures: closure_wait_event()
  2019-06-10 19:14 ` [PATCH 11/12] closures: closure_wait_event() Kent Overstreet
  2019-06-11 10:25   ` Coly Li
@ 2019-06-12 17:17   ` Greg KH
  1 sibling, 0 replies; 63+ messages in thread
From: Greg KH @ 2019-06-12 17:17 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-kernel, linux-fsdevel, linux-bcache

On Mon, Jun 10, 2019 at 03:14:19PM -0400, Kent Overstreet wrote:
> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
> ---
>  include/linux/closure.h | 22 ++++++++++++++++++++++
>  1 file changed, 22 insertions(+)

Again, no changelog?  You are a daring developer... :)

greg k-h

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: bcachefs status update (it's done cooking; let's get this sucker merged)
  2019-06-12 16:21       ` Kent Overstreet
@ 2019-06-12 23:02         ` Dave Chinner
  2019-06-13 18:36           ` pagecache locking (was: bcachefs status update) merged) Kent Overstreet
  2019-06-19  8:21           ` bcachefs status update (it's done cooking; let's get this sucker merged) Jan Kara
  0 siblings, 2 replies; 63+ messages in thread
From: Dave Chinner @ 2019-06-12 23:02 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Linus Torvalds, Dave Chinner, Waiman Long, Peter Zijlstra,
	Linux List Kernel Mailing, linux-fsdevel, linux-bcache,
	Darrick J . Wong, Zach Brown, Jens Axboe, Josef Bacik,
	Alexander Viro, Andrew Morton, Tejun Heo

On Wed, Jun 12, 2019 at 12:21:44PM -0400, Kent Overstreet wrote:
> On Tue, Jun 11, 2019 at 02:33:36PM +1000, Dave Chinner wrote:
> > I just recently said this with reference to the range lock stuff I'm
> > working on in the background:
> > 
> > 	FWIW, it's to avoid problems with stupid userspace stuff
> > 	that nobody really should be doing that I want range locks
> > 	for the XFS inode locks.  If userspace overlaps the ranges
> > 	and deadlocks in that case, they they get to keep all the
> > 	broken bits because, IMO, they are doing something
> > 	monumentally stupid. I'd probably be making it return
> > 	EDEADLOCK back out to userspace in the case rather than
> > 	deadlocking but, fundamentally, I think it's broken
> > 	behaviour that we should be rejecting with an error rather
> > 	than adding complexity trying to handle it.
> > 
> > So I think this recusive locking across a page fault case should
> > just fail, not add yet more complexity to try to handle a rare
> > corner case that exists more in theory than in reality. i.e put the
> > lock context in the current task, then if the page fault requires a
> > conflicting lock context to be taken, we terminate the page fault,
> > back out of the IO and return EDEADLOCK out to userspace. This works
> > for all types of lock contexts - only the filesystem itself needs to
> > know what the lock context pointer contains....
> 
> Ok, I'm totally on board with returning EDEADLOCK.
> 
> Question: Would we be ok with returning EDEADLOCK for any IO where the buffer is
> in the same address space as the file being read/written to, even if the buffer
> and the IO don't technically overlap?

I'd say that depends on the lock granularity. For a range lock,
we'd be able to do the IO for non-overlapping ranges. For a normal
mutex or rwsem, then we risk deadlock if the page fault triggers on
the same address space host as we already have locked for IO. That's
the case we currently handle with the second IO lock in XFS, ext4,
btrfs, etc (XFS_MMAPLOCK_* in XFS).

One of the reasons I'm looking at range locks for XFS is to get rid
of the need for this second mmap lock, as there is no reason for it
existing if we can lock ranges and EDEADLOCK inside page faults and
return errors.

> This would simplify things a lot and eliminate a really nasty corner case - page
> faults trigger readahead. Even if the buffer and the direct IO don't overlap,
> readahead can pull in pages that do overlap with the dio.

Page cache readahead needs to be moved under the filesystem IO
locks. There was a recent thread about how readahead can race with
hole punching and other fallocate() operations because page cache
readahead bypasses the filesystem IO locks used to serialise page
cache invalidation.

e.g. Readahead can be directed by userspace via fadvise, so we now
have file->f_op->fadvise() so that filesystems can lock the inode
before calling generic_fadvise() such that page cache instantiation
and readahead dispatch can be serialised against page cache
invalidation. I have a patch for XFS sitting around somewhere that
implements the ->fadvise method.

I think there are some other patches floating around to address the
other readahead mechanisms to only be done under filesytem IO locks,
but I haven't had time to dig into it any further. Readahead from
page faults most definitely needs to be under the MMAPLOCK at
least so it serialises against fallocate()...

> And on getting EDEADLOCK we could fall back to buffered IO, so
> userspace would never know....

Yup, that's a choice that individual filesystems can make.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 10/12] bcache: move closures to lib/
  2019-06-10 19:14 ` [PATCH 10/12] bcache: move closures to lib/ Kent Overstreet
  2019-06-11 10:25   ` Coly Li
@ 2019-06-13  7:28   ` Christoph Hellwig
  2019-06-13 11:04     ` Kent Overstreet
  1 sibling, 1 reply; 63+ messages in thread
From: Christoph Hellwig @ 2019-06-13  7:28 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-kernel, linux-fsdevel, linux-bcache

On Mon, Jun 10, 2019 at 03:14:18PM -0400, Kent Overstreet wrote:
> Prep work for bcachefs - being a fork of bcache it also uses closures

NAK.  This obsfucation needs to go away from bcache and not actually be
spread further, especially not as an API with multiple users which will
make it even harder to get rid of it.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 10/12] bcache: move closures to lib/
  2019-06-13  7:28   ` Christoph Hellwig
@ 2019-06-13 11:04     ` Kent Overstreet
  0 siblings, 0 replies; 63+ messages in thread
From: Kent Overstreet @ 2019-06-13 11:04 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-kernel, linux-fsdevel, linux-bcache

On Thu, Jun 13, 2019 at 12:28:41AM -0700, Christoph Hellwig wrote:
> On Mon, Jun 10, 2019 at 03:14:18PM -0400, Kent Overstreet wrote:
> > Prep work for bcachefs - being a fork of bcache it also uses closures
> 
> NAK.  This obsfucation needs to go away from bcache and not actually be
> spread further, especially not as an API with multiple users which will
> make it even harder to get rid of it.

Christoph, you've made it plenty clear how much you dislike closures in the past
but "I don't like it" is not remotely the kind of objection that is appropriate
or useful for technical discussions, and that's pretty much all you've ever
given.

If you really think that code should be gotten rid of, then maybe you should
actually _look at what they do that's not covered by other kernel
infrastructure_ and figure out something better. Otherwise... no one else seems
to care all that much about closures, and you're not actually giving any
technical feedback, so I'm not sure what you expect.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* pagecache locking (was: bcachefs status update) merged)
  2019-06-12 23:02         ` Dave Chinner
@ 2019-06-13 18:36           ` Kent Overstreet
  2019-06-13 21:13             ` Andreas Dilger
  2019-06-13 23:55             ` Dave Chinner
  2019-06-19  8:21           ` bcachefs status update (it's done cooking; let's get this sucker merged) Jan Kara
  1 sibling, 2 replies; 63+ messages in thread
From: Kent Overstreet @ 2019-06-13 18:36 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Linus Torvalds, Dave Chinner, Darrick J . Wong,
	Christoph Hellwig, Matthew Wilcox, Amir Goldstein, Jan Kara,
	Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik,
	Alexander Viro, Andrew Morton

On Thu, Jun 13, 2019 at 09:02:24AM +1000, Dave Chinner wrote:
> On Wed, Jun 12, 2019 at 12:21:44PM -0400, Kent Overstreet wrote:
> > Ok, I'm totally on board with returning EDEADLOCK.
> > 
> > Question: Would we be ok with returning EDEADLOCK for any IO where the buffer is
> > in the same address space as the file being read/written to, even if the buffer
> > and the IO don't technically overlap?
> 
> I'd say that depends on the lock granularity. For a range lock,
> we'd be able to do the IO for non-overlapping ranges. For a normal
> mutex or rwsem, then we risk deadlock if the page fault triggers on
> the same address space host as we already have locked for IO. That's
> the case we currently handle with the second IO lock in XFS, ext4,
> btrfs, etc (XFS_MMAPLOCK_* in XFS).
> 
> One of the reasons I'm looking at range locks for XFS is to get rid
> of the need for this second mmap lock, as there is no reason for it
> existing if we can lock ranges and EDEADLOCK inside page faults and
> return errors.

My concern is that range locks are going to turn out to be both more complicated
and heavier weight, performance wise, than the approach I've taken of just a
single lock per address space.

Reason being range locks only help when you've got multiple operations going on
simultaneously that don't conflict - i.e. it's really only going to be useful
for applications that are doing buffered IO and direct IO simultaneously to the
same file. Personally, I think that would be a pretty gross thing to do and I'm
not particularly interested in optimizing for that case myself... but, if you
know of applications that do depend on that I might change my opinion. If not, I
want to try and get the simpler, one-lock-per-address space approach to work.

That said though - range locks on the page cache can be viewed as just a
performance optimization over my approach, they've got the same semantics
(locking a subset of the page cache vs. the entire thing). So that's a bit of a
digression.

> > This would simplify things a lot and eliminate a really nasty corner case - page
> > faults trigger readahead. Even if the buffer and the direct IO don't overlap,
> > readahead can pull in pages that do overlap with the dio.
> 
> Page cache readahead needs to be moved under the filesystem IO
> locks. There was a recent thread about how readahead can race with
> hole punching and other fallocate() operations because page cache
> readahead bypasses the filesystem IO locks used to serialise page
> cache invalidation.
> 
> e.g. Readahead can be directed by userspace via fadvise, so we now
> have file->f_op->fadvise() so that filesystems can lock the inode
> before calling generic_fadvise() such that page cache instantiation
> and readahead dispatch can be serialised against page cache
> invalidation. I have a patch for XFS sitting around somewhere that
> implements the ->fadvise method.

I just puked a little in my mouth.

> I think there are some other patches floating around to address the
> other readahead mechanisms to only be done under filesytem IO locks,
> but I haven't had time to dig into it any further. Readahead from
> page faults most definitely needs to be under the MMAPLOCK at
> least so it serialises against fallocate()...

So I think there's two different approaches we should distinguish between. We
can either add the locking to all the top level IO paths - what you just
described - or, the locking can be pushed down to protect _only_ adding pages to
the page cache, which is the approach I've been taking.

I think both approaches are workable, but I do think that pushing the locking
down to __add_to_page_cache_locked is fundamentally the better, more correct
approach.

 - It better matches the semantics of what we're trying to do. All these
   operations we're trying to protect - dio, fallocate, truncate - they all have
   in common that they just want to shoot down a range of the page cache and
   keep it from being readded. And in general, it's better to have locks that
   protect specific data structures ("adding to this radix tree"), vs. large
   critical sections ("the io path").

   In bcachefs, at least for buffered IO I don't currently need any per-inode IO
   locks, page granularity locks suffice, so I'd like to keep that - under the
   theory that buffered IO to pages already in cache is more of a fast path than
   faulting pages in.

 - If we go with the approach of using the filesystem IO locks, we need to be
   really careful about auditing and adding assertions to make sure we've found
   and fixed all the code paths that can potentially add pages to the page
   cache. I didn't even know about the fadvise case, eesh.

 - We still need to do something about the case where we're recursively faulting
   pages back in, which means we need _something_ in place to even detect that
   that's happening. Just trying to cover everything with the filesystem IO
   locks isn't useful here.

So to summarize - if we have locking specifically for adding pages to the page
cache, we don't need to extend the filesystem IO locks to all these places, and
we need something at that level anyways to handle recursive faults from gup()
anyways.

The tricky part is that there's multiple places that want to call
add_to_page_cache() while holding this pagecache_add_lock.

 - dio -> gup(): but, you had the idea of just returning -EDEADLOCK here which I
   like way better than my recursive locking approach.

 - the other case is truncate/fpunch/etc - they make use of buffered IO to
   handle operations that aren't page/block aligned. But those look a lot more
   tractable than dio, since they're calling find_get_page()/readpage() directly
   instead of via gup(), and possibly they could be structured to not have to
   truncate/punch the partial page while holding the pagecache_add_lock at all
   (but that's going to be heavily filesystem dependent).

The more I think about it, the more convinced I am that this is fundamentally
the correct approach. So, I'm going to work on an improved version of this
patch.

One other tricky thing we need is a way to write out and then evict a page
without allowing it to be redirtied - i.e. something that combines
filemap_write_and_wait_range() with invalidate_inode_pages2_range(). Otherwise,
a process continuously redirtying a page is going to make truncate/dio
operations spin trying to shoot down the page cache - in bcachefs I'm currently
taking pagecache_add_lock in write_begin and mkwrite to prevent this, but I
really want to get rid of that. If we can get that combined
write_and_invalidate() operation, then I think the locking will turn out fairly
clean.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: pagecache locking (was: bcachefs status update) merged)
  2019-06-13 18:36           ` pagecache locking (was: bcachefs status update) merged) Kent Overstreet
@ 2019-06-13 21:13             ` Andreas Dilger
  2019-06-13 21:21               ` Kent Overstreet
  2019-06-13 23:55             ` Dave Chinner
  1 sibling, 1 reply; 63+ messages in thread
From: Andreas Dilger @ 2019-06-13 21:13 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Dave Chinner, Linus Torvalds, Dave Chinner, Darrick J . Wong,
	Christoph Hellwig, Matthew Wilcox, Amir Goldstein, Jan Kara,
	Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik,
	Alexander Viro, Andrew Morton


[-- Attachment #1.1: Type: text/plain, Size: 2948 bytes --]

On Jun 13, 2019, at 12:36 PM, Kent Overstreet <kent.overstreet@gmail.com> wrote:
> 
> On Thu, Jun 13, 2019 at 09:02:24AM +1000, Dave Chinner wrote:
>> On Wed, Jun 12, 2019 at 12:21:44PM -0400, Kent Overstreet wrote:
>>> Ok, I'm totally on board with returning EDEADLOCK.
>>> 
>>> Question: Would we be ok with returning EDEADLOCK for any IO where the buffer is
>>> in the same address space as the file being read/written to, even if the buffer
>>> and the IO don't technically overlap?
>> 
>> I'd say that depends on the lock granularity. For a range lock,
>> we'd be able to do the IO for non-overlapping ranges. For a normal
>> mutex or rwsem, then we risk deadlock if the page fault triggers on
>> the same address space host as we already have locked for IO. That's
>> the case we currently handle with the second IO lock in XFS, ext4,
>> btrfs, etc (XFS_MMAPLOCK_* in XFS).
>> 
>> One of the reasons I'm looking at range locks for XFS is to get rid
>> of the need for this second mmap lock, as there is no reason for it
>> existing if we can lock ranges and EDEADLOCK inside page faults and
>> return errors.
> 
> My concern is that range locks are going to turn out to be both more complicated
> and heavier weight, performance wise, than the approach I've taken of just a
> single lock per address space.
> 
> Reason being range locks only help when you've got multiple operations going on
> simultaneously that don't conflict - i.e. it's really only going to be useful
> for applications that are doing buffered IO and direct IO simultaneously to the
> same file. Personally, I think that would be a pretty gross thing to do and I'm
> not particularly interested in optimizing for that case myself... but, if you
> know of applications that do depend on that I might change my opinion. If not, I
> want to try and get the simpler, one-lock-per-address space approach to work.

There are definitely workloads that require multiple threads doing non-overlapping
writes to a single file in HPC.  This is becoming an increasingly common problem
as the number of cores on a single client increase, since there is typically one
thread per core trying to write to a shared file.  Using multiple files (one per
core) is possible, but that has file management issues for users when there are a
million cores running on the same job/file (obviously not on the same client node)
dumping data every hour.

We were just looking at this exact problem last week, and most of the threads are
spinning in grab_cache_page_nowait->add_to_page_cache_lru() and set_page_dirty()
when writing at 1.9GB/s when they could be writing at 5.8GB/s (when threads are
writing O_DIRECT instead of buffered).  Flame graph is attached for 16-thread case,
but high-end systems today easily have 2-4x that many cores.

Any approach for range locks can't be worse than spending 80% of time spinning.

Cheers, Andreas





[-- Attachment #1.2: shared_file_write.svg --]
[-- Type: image/svg+xml, Size: 161674 bytes --]

[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: pagecache locking (was: bcachefs status update) merged)
  2019-06-13 21:13             ` Andreas Dilger
@ 2019-06-13 21:21               ` Kent Overstreet
  2019-06-14  0:35                 ` Dave Chinner
  0 siblings, 1 reply; 63+ messages in thread
From: Kent Overstreet @ 2019-06-13 21:21 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Dave Chinner, Linus Torvalds, Dave Chinner, Darrick J . Wong,
	Christoph Hellwig, Matthew Wilcox, Amir Goldstein, Jan Kara,
	Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik,
	Alexander Viro, Andrew Morton

On Thu, Jun 13, 2019 at 03:13:40PM -0600, Andreas Dilger wrote:
> There are definitely workloads that require multiple threads doing non-overlapping
> writes to a single file in HPC.  This is becoming an increasingly common problem
> as the number of cores on a single client increase, since there is typically one
> thread per core trying to write to a shared file.  Using multiple files (one per
> core) is possible, but that has file management issues for users when there are a
> million cores running on the same job/file (obviously not on the same client node)
> dumping data every hour.

Mixed buffered and O_DIRECT though? That profile looks like just buffered IO to
me.

> We were just looking at this exact problem last week, and most of the threads are
> spinning in grab_cache_page_nowait->add_to_page_cache_lru() and set_page_dirty()
> when writing at 1.9GB/s when they could be writing at 5.8GB/s (when threads are
> writing O_DIRECT instead of buffered).  Flame graph is attached for 16-thread case,
> but high-end systems today easily have 2-4x that many cores.

Yeah I've been spending some time on buffered IO performance too - 4k page
overhead is a killer.

bcachefs has a buffered write path that looks up multiple pages at a time and
locks them, and then copies the data to all the pages at once (I stole the idea
from btrfs). It was a very significant performance increase.

https://evilpiepirate.org/git/bcachefs.git/tree/fs/bcachefs/fs-io.c#n1498

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: pagecache locking (was: bcachefs status update) merged)
  2019-06-13 18:36           ` pagecache locking (was: bcachefs status update) merged) Kent Overstreet
  2019-06-13 21:13             ` Andreas Dilger
@ 2019-06-13 23:55             ` Dave Chinner
  2019-06-14  2:30               ` Linus Torvalds
                                 ` (2 more replies)
  1 sibling, 3 replies; 63+ messages in thread
From: Dave Chinner @ 2019-06-13 23:55 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Linus Torvalds, Dave Chinner, Darrick J . Wong,
	Christoph Hellwig, Matthew Wilcox, Amir Goldstein, Jan Kara,
	Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik,
	Alexander Viro, Andrew Morton

On Thu, Jun 13, 2019 at 02:36:25PM -0400, Kent Overstreet wrote:
> On Thu, Jun 13, 2019 at 09:02:24AM +1000, Dave Chinner wrote:
> > On Wed, Jun 12, 2019 at 12:21:44PM -0400, Kent Overstreet wrote:
> > > Ok, I'm totally on board with returning EDEADLOCK.
> > > 
> > > Question: Would we be ok with returning EDEADLOCK for any IO where the buffer is
> > > in the same address space as the file being read/written to, even if the buffer
> > > and the IO don't technically overlap?
> > 
> > I'd say that depends on the lock granularity. For a range lock,
> > we'd be able to do the IO for non-overlapping ranges. For a normal
> > mutex or rwsem, then we risk deadlock if the page fault triggers on
> > the same address space host as we already have locked for IO. That's
> > the case we currently handle with the second IO lock in XFS, ext4,
> > btrfs, etc (XFS_MMAPLOCK_* in XFS).
> > 
> > One of the reasons I'm looking at range locks for XFS is to get rid
> > of the need for this second mmap lock, as there is no reason for it
> > existing if we can lock ranges and EDEADLOCK inside page faults and
> > return errors.
> 
> My concern is that range locks are going to turn out to be both more complicated
> and heavier weight, performance wise, than the approach I've taken of just a
> single lock per address space.

That's the battle I'm fighting at the moment with them for direct
IO(*), but range locks are something I'm doing for XFS and I don't
really care if anyone else wants to use them or not.

(*)Direct IO on XFS is a pure shared lock workload, so the rwsem
scales until single atomic update cache line bouncing limits
throughput. That means I can max out my hardware at 1.6 million
random 4k read/write IOPS (a bit over 6GB/s)(**) to a single file
with a rwsem at 32 AIO+DIO dispatch threads. I've only got range
locks to about 1.1M IOPS on the same workload, though it's within a
couple of percent of a rwsem up to 16 threads...

(**) A small handful of nvme SSDs fed by AIO+DIO are /way faster/
than pmem that is emulated with RAM, let alone real pmem which is
much slower at random writes than RAM.

> Reason being range locks only help when you've got multiple operations going on
> simultaneously that don't conflict - i.e. it's really only going to be useful
> for applications that are doing buffered IO and direct IO simultaneously to the
> same file.

Yes, they do that, but that's not why I'm looking at this.  Range
locks are primarily for applications that mix multiple different
types of operations to the same file concurrently. e.g:

- fallocate and read/write() can be run concurrently if they
don't overlap, but right now we serialise them because we have no
visibility into what other operations require.

- buffered read and buffered write can run concurrently if they
don't overlap, but right now they are serialised because that's the
only way to provide POSIX atomic write vs read semantics (only XFS
provides userspace with that guarantee).

- Sub-block direct IO is serialised against all other direct IO
because we can't tell if it overlaps with some other direct IO and
so we have to take the slow but safe option - range locks solve that
problem, too.

- there's inode_dio_wait() for DIO truncate serialisation
because AIO doesn't hold inode locks across IO - range locks can be
held all the way to AIO completion so we can get rid of
inode_Dio_wait() in XFS and that allows truncate/fallocate to run
concurrently with non-overlapping direct IO.

- holding non-overlapping range locks on either side of page
faults which then gets rid of the need for the special mmap locking
path to serialise it against invalidation operations.

IOWs, range locks for IO solve a bunch of long term problems we have
in XFS and largely simplify the lock algorithms within the
filesystem. And it puts us on the path to introduce range locks for
extent mapping serialisation, allowing concurrent mapping lookups
and allocation within a single file. It also has the potential to
allow us to do concurrent directory modifications....

> Personally, I think that would be a pretty gross thing to do and I'm
> not particularly interested in optimizing for that case myself... but, if you
> know of applications that do depend on that I might change my opinion. If not, I
> want to try and get the simpler, one-lock-per-address space approach to work.
> 
> That said though - range locks on the page cache can be viewed as just a
> performance optimization over my approach, they've got the same semantics
> (locking a subset of the page cache vs. the entire thing). So that's a bit of a
> digression.

IO range locks are not "locking the page cache". IO range locks are
purely for managing concurrent IO state in a fine grained manner.
The page cache already has it's own locking - that just needs to
nest inside IO range locks as teh io locks are what provide the high
level exclusion from overlapping page cache operations...

> > > This would simplify things a lot and eliminate a really nasty corner case - page
> > > faults trigger readahead. Even if the buffer and the direct IO don't overlap,
> > > readahead can pull in pages that do overlap with the dio.
> > 
> > Page cache readahead needs to be moved under the filesystem IO
> > locks. There was a recent thread about how readahead can race with
> > hole punching and other fallocate() operations because page cache
> > readahead bypasses the filesystem IO locks used to serialise page
> > cache invalidation.
> > 
> > e.g. Readahead can be directed by userspace via fadvise, so we now
> > have file->f_op->fadvise() so that filesystems can lock the inode
> > before calling generic_fadvise() such that page cache instantiation
> > and readahead dispatch can be serialised against page cache
> > invalidation. I have a patch for XFS sitting around somewhere that
> > implements the ->fadvise method.
> 
> I just puked a little in my mouth.

Yeah, it's pretty gross. But the page cache simply isn't designed to
allow atomic range operations to be performed. We ahven't be able to
drag it out of the 1980s - we wrote the fs/iomap.c code so we could
do range based extent mapping for IOs rather than the horrible,
inefficient page-by-page block mapping the generic page cache code
does - that gave us a 30+% increase in buffered IO throughput
because we only do a single mapping lookup per IO rather than one
per page...

That said, the page cache is still far, far slower than direct IO,
and the gap is just getting wider and wider as nvme SSDs get faster
and faster. PCIe 4 SSDs are just going to make this even more
obvious - it's getting to the point where the only reason for having
a page cache is to support mmap() and cheap systems with spinning
rust storage.

> > I think there are some other patches floating around to address the
> > other readahead mechanisms to only be done under filesytem IO locks,
> > but I haven't had time to dig into it any further. Readahead from
> > page faults most definitely needs to be under the MMAPLOCK at
> > least so it serialises against fallocate()...
> 
> So I think there's two different approaches we should distinguish between. We
> can either add the locking to all the top level IO paths - what you just
> described - or, the locking can be pushed down to protect _only_ adding pages to
> the page cache, which is the approach I've been taking.

I'm don't think just serialising adding pages is sufficient for
filesystems like XFS because, e.g., DAX.

> I think both approaches are workable, but I do think that pushing the locking
> down to __add_to_page_cache_locked is fundamentally the better, more correct
> approach.
> 
>  - It better matches the semantics of what we're trying to do. All these
>    operations we're trying to protect - dio, fallocate, truncate - they all have
>    in common that they just want to shoot down a range of the page cache and
>    keep it from being readded. And in general, it's better to have locks that
>    protect specific data structures ("adding to this radix tree"), vs. large
>    critical sections ("the io path").

I disagree :)

The high level IO locks provide the IO concurrency policy for the
filesystem. The page cache is an internal structure for caching
pages - it is not a structure for efficiently and cleanly
implementing IO concurrency policy. That's the mistake the current
page cache architecture makes - it tries to be the central control
for all the filesystem IO (because filesystems are dumb and the page
cache knows best!) but, unfortunately, this does not provide the
semantics or functionality that all filesystems want and/or need.

Just look at the truncate detection mess we have every time we
lookup and lock a page anywhere in the mm/ code - do you see any
code in all that which detects a hole punch race? Nope, you don't
because the filesystems take responsibility for serialising that
functionality. Unfortunately, we have so much legacy filesystem
cruft we'll never get rid of those truncate hacks.

That's my beef with relying on the page cache - the page cache is
rapidly becoming a legacy structure that only serves to slow modern
IO subsystems down. More and more we are going to bypass the page
cache and push/pull data via DMA directly into user buffers because
that's the only way we have enough CPU power in the systems to keep
the storage we have fully utilised.

That means, IMO, making the page cache central to solving an IO
concurrency problem is going the wrong way....

>    In bcachefs, at least for buffered IO I don't currently need any per-inode IO
>    locks, page granularity locks suffice, so I'd like to keep that - under the
>    theory that buffered IO to pages already in cache is more of a fast path than
>    faulting pages in.
> 
>  - If we go with the approach of using the filesystem IO locks, we need to be
>    really careful about auditing and adding assertions to make sure we've found
>    and fixed all the code paths that can potentially add pages to the page
>    cache. I didn't even know about the fadvise case, eesh.

Sure, but we've largely done that already. There aren't a lot of
places that add pages to the page cache.

>  - We still need to do something about the case where we're recursively faulting
>    pages back in, which means we need _something_ in place to even detect that
>    that's happening. Just trying to cover everything with the filesystem IO
>    locks isn't useful here.

Haven't we just addressed that with setting a current task lock
context and returning -EDEADLOCK?

> So to summarize - if we have locking specifically for adding pages to the page
> cache, we don't need to extend the filesystem IO locks to all these places, and
> we need something at that level anyways to handle recursive faults from gup()
> anyways.
> 
> The tricky part is that there's multiple places that want to call
> add_to_page_cache() while holding this pagecache_add_lock.
> 
>  - dio -> gup(): but, you had the idea of just returning -EDEADLOCK here which I
>    like way better than my recursive locking approach.
> 
>  - the other case is truncate/fpunch/etc - they make use of buffered IO to
>    handle operations that aren't page/block aligned.  But those look a lot more
>    tractable than dio, since they're calling find_get_page()/readpage() directly
>    instead of via gup(), and possibly they could be structured to not have to
>    truncate/punch the partial page while holding the pagecache_add_lock at all
>    (but that's going to be heavily filesystem dependent).

That sounds like it is going to be messy. :(

> One other tricky thing we need is a way to write out and then evict a page
> without allowing it to be redirtied - i.e. something that combines
> filemap_write_and_wait_range() with invalidate_inode_pages2_range(). Otherwise,
> a process continuously redirtying a page is going to make truncate/dio
> operations spin trying to shoot down the page cache - in bcachefs I'm currently
> taking pagecache_add_lock in write_begin and mkwrite to prevent this, but I
> really want to get rid of that. If we can get that combined
> write_and_invalidate() operation, then I think the locking will turn out fairly
> clean.

IMO, this isn't a page cache problem - this is a filesystem
operation vs page fault serialisation issue. Why? Because the
problem exists for DAX, and it doesn't add pages to the page cache
for mappings...

i.e.  We've already solved these problems with the same high level
IO locks that solve all the truncate, hole punch, etc issues for
both the page cache and DAX operation. i.e. The MMAPLOCK prevents
the PTE being re-dirtied across the entire filesystem operation, not
just the writeback and invalidation. The XFS flush+inval code
looks like this:

	xfs_ilock(ip, XFS_IOLOCK_EXCL);
	xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
	xfs_flush_unmap_range(ip, offset, len);

	<do operation that requires invalidated page cache>
	<flush modifications if necessary>

	xfs_iunlock(ip, XFS_MMAPLOCK_EXCL);
	xfs_iunlock(ip, XFS_IOLOCK_EXCL);

With range locks it looks like this;

	range_lock_init(&rl, offset, len);
	range_lock_write(&ip->i_iolock, &rl);
	xfs_flush_unmap_range(ip, offset, len);

	<do operation that requires invalidated page cache>
	<flush modified range if necessary>

	range_unlock_write(&ip->i_iolock, &rl);

---

In summary, I can see why the page cache add lock works well for
bcachefs, but on the other hand I can say that it really doesn't
match well for filesystems like XFS or for filesystems that
implement DAX and so may not be using the page cache at all...

Don't get me wrong - I'm not opposed to incuding page cache add
locking - I'm just saying that the problems it tries to address (and
ones it cannot address) are already largely solved in existing
filesystems. I suspect that if we do merge this code, whatever
locking is added would have to be optional....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: pagecache locking (was: bcachefs status update) merged)
  2019-06-13 21:21               ` Kent Overstreet
@ 2019-06-14  0:35                 ` Dave Chinner
  0 siblings, 0 replies; 63+ messages in thread
From: Dave Chinner @ 2019-06-14  0:35 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Andreas Dilger, Linus Torvalds, Dave Chinner, Darrick J . Wong,
	Christoph Hellwig, Matthew Wilcox, Amir Goldstein, Jan Kara,
	Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik,
	Alexander Viro, Andrew Morton

On Thu, Jun 13, 2019 at 05:21:12PM -0400, Kent Overstreet wrote:
> On Thu, Jun 13, 2019 at 03:13:40PM -0600, Andreas Dilger wrote:
> > There are definitely workloads that require multiple threads doing non-overlapping
> > writes to a single file in HPC.  This is becoming an increasingly common problem
> > as the number of cores on a single client increase, since there is typically one
> > thread per core trying to write to a shared file.  Using multiple files (one per
> > core) is possible, but that has file management issues for users when there are a
> > million cores running on the same job/file (obviously not on the same client node)
> > dumping data every hour.
> 
> Mixed buffered and O_DIRECT though? That profile looks like just buffered IO to
> me.
> 
> > We were just looking at this exact problem last week, and most of the threads are
> > spinning in grab_cache_page_nowait->add_to_page_cache_lru() and set_page_dirty()
> > when writing at 1.9GB/s when they could be writing at 5.8GB/s (when threads are
> > writing O_DIRECT instead of buffered).  Flame graph is attached for 16-thread case,
> > but high-end systems today easily have 2-4x that many cores.
> 
> Yeah I've been spending some time on buffered IO performance too - 4k page
> overhead is a killer.
> 
> bcachefs has a buffered write path that looks up multiple pages at a time and
> locks them, and then copies the data to all the pages at once (I stole the idea
> from btrfs). It was a very significant performance increase.

Careful with that - locking multiple pages is also a deadlock vector
that triggers unexpectedly when something conspires to lock pages in
non-ascending order. e.g.

64081362e8ff mm/page-writeback.c: fix range_cyclic writeback vs writepages deadlock

The fs/iomap.c code avoids this problem by mapping the IO first,
then iterating pages one at a time until the mapping is consumed,
then it gets another mapping. It also avoids needing to put a page
array on stack....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: pagecache locking (was: bcachefs status update) merged)
  2019-06-13 23:55             ` Dave Chinner
@ 2019-06-14  2:30               ` Linus Torvalds
  2019-06-14  7:30                 ` Dave Chinner
  2019-06-14  3:08               ` Linus Torvalds
  2019-06-14 17:08               ` pagecache locking (was: bcachefs status update) merged) Kent Overstreet
  2 siblings, 1 reply; 63+ messages in thread
From: Linus Torvalds @ 2019-06-14  2:30 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Kent Overstreet, Dave Chinner, Darrick J . Wong,
	Christoph Hellwig, Matthew Wilcox, Amir Goldstein, Jan Kara,
	Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik,
	Alexander Viro, Andrew Morton

On Thu, Jun 13, 2019 at 1:56 PM Dave Chinner <david@fromorbit.com> wrote:
>
> That said, the page cache is still far, far slower than direct IO,

Bullshit, Dave.

You've made that claim before, and it's been complete bullshit before
too, and I've called you out on it then too.

Why do you continue to make this obviously garbage argument?

The key word in the "page cache" name is "cache".

Caches work, Dave. Anybody who thinks caches don't work is
incompetent. 99% of all filesystem accesses are cached, and they never
do any IO at all, and the page cache handles them beautifully.

When you say that the page cache is slower than direct IO, it's
because you don't even see or care about the *fast* case. You only get
involved once there is actual IO to be done.

So you're making that statement without taking into account all the
cases that you don't see, and that you don't care about, because the
page cache has already handled them for you, and done so much better
than DIO can do or ever _will_ do.

Is direct IO faster when you *know* it's not cached, and shouldn't be
cached? Sure. But that/s actually quite rare.

How often do you use non-temporal stores when you do non-IO
programming? Approximately never, perhaps? Because caches work.

And no, SSD's haven't made caches irrelevant. Not doing IO at all is
still orders of magnitude faster than doing IO. And it's not clear
nvdimms will either.

So stop with the stupid and dishonest argument already, where you
ignore the effects of caching.

                Linus

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: pagecache locking (was: bcachefs status update) merged)
  2019-06-13 23:55             ` Dave Chinner
  2019-06-14  2:30               ` Linus Torvalds
@ 2019-06-14  3:08               ` Linus Torvalds
  2019-06-15  4:01                 ` Linus Torvalds
  2019-06-14 17:08               ` pagecache locking (was: bcachefs status update) merged) Kent Overstreet
  2 siblings, 1 reply; 63+ messages in thread
From: Linus Torvalds @ 2019-06-14  3:08 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Kent Overstreet, Dave Chinner, Darrick J . Wong,
	Christoph Hellwig, Matthew Wilcox, Amir Goldstein, Jan Kara,
	Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik,
	Alexander Viro, Andrew Morton

On Thu, Jun 13, 2019 at 1:56 PM Dave Chinner <david@fromorbit.com> wrote:
>
> - buffered read and buffered write can run concurrently if they
> don't overlap, but right now they are serialised because that's the
> only way to provide POSIX atomic write vs read semantics (only XFS
> provides userspace with that guarantee).

I do not believe that posix itself actually requires that at all,
although extended standards may.

That said, from a quality of implementation standpoint, it's obviously
a good thing to do, so it might be worth looking at if something
reasonable can be done. The XFS atomicity guarantees are better than
what other filesystems give, but they might also not be exactly
required.

But POSIX actually ends up being pretty lax, and says

  "Writes can be serialized with respect to other reads and writes. If
a read() of file data can be proven (by any means) to occur after a
write() of the data, it must reflect that write(), even if the calls
are made by different processes. A similar requirement applies to
multiple write operations to the same file position. This is needed to
guarantee the propagation of data from write() calls to subsequent
read() calls. This requirement is particularly significant for
networked file systems, where some caching schemes violate these
semantics."

Note the "can" in "can be serialized", not "must". Also note that
whole language about how the read file data must match the written
data only if the read can be proven to have occurred after a write of
that data.  Concurrency is very much left in the air, only provably
serial operations matter.

(There is also language that talks about "after the write has
successfully returned" etc - again, it's about reads that occur
_after_ the write, not concurrently with the write).

The only atomicity guarantees are about the usual pipe writes and
PIPE_BUF. Those are very explicit.

Of course, there are lots of standards outside of just the POSIX
read/write thing, so you may be thinking of some other stricter
standard. POSIX itself has always been pretty permissive.

And as mentioned, I do agree from a QoI standpoint that atomicity is
nice, and that the XFS behavior is better. However, it does seem that
nobody really cares, because I'm not sure we've ever done it in
general (although we do have that i_rwsem, but I think it's mainly
used to give the proper lseek behavior). And so the XFS behavior may
not necessarily be *worth* it, although I presume you have some test
for this as part of xfstests.

                Linus

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: pagecache locking (was: bcachefs status update) merged)
  2019-06-14  2:30               ` Linus Torvalds
@ 2019-06-14  7:30                 ` Dave Chinner
  2019-06-15  1:15                   ` Linus Torvalds
  0 siblings, 1 reply; 63+ messages in thread
From: Dave Chinner @ 2019-06-14  7:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kent Overstreet, Dave Chinner, Darrick J . Wong,
	Christoph Hellwig, Matthew Wilcox, Amir Goldstein, Jan Kara,
	Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik,
	Alexander Viro, Andrew Morton

On Thu, Jun 13, 2019 at 04:30:36PM -1000, Linus Torvalds wrote:
> On Thu, Jun 13, 2019 at 1:56 PM Dave Chinner <david@fromorbit.com> wrote:
> >
> > That said, the page cache is still far, far slower than direct IO,
> 
> Bullshit, Dave.
> 
> You've made that claim before, and it's been complete bullshit before
> too, and I've called you out on it then too.

Yes, your last run of insulting rants on this topic resulted in me
pointing out your CoC violations because you were unable to listen
or discuss the subject matter in a civil manner. And you've started
right where you left off last time....

> Why do you continue to make this obviously garbage argument?
> 
> The key word in the "page cache" name is "cache".
> 
> Caches work, Dave.

Yes, they do, I see plenty of cases where the page cache works just
fine because it is still faster than most storage. But that's _not
what I said_.

Indeed, you haven't even bothered to ask me to clarify what I was
refering to in the statement you quoted. IOWs, you've taken _one
single statement_ I made from a huge email about complexities in
dealing with IO concurency, the page cache and architectural flaws n
the existing code, quoted it out of context, fabricated a completely
new context and started ranting about how I know nothing about how
caches or the page cache work.

Not very professional but, unfortunately, an entirely predictable
and _expected_ response.

Linus, nobody can talk about direct IO without you screaming and
tossing all your toys out of the crib. If you can't be civil or you
find yourself writing a some condescending "caching 101" explanation
to someone who has spent the last 15+ years working with filesystems
and caches, then you're far better off not saying anything.

---

So, in the interests of further _civil_ discussion, let me clarify
my statement for you: for a highly concurrent application that is
crunching through bulk data on large files on high throughput
storage, the page cache is still far, far slower than direct IO.

Which comes back to this statement you made:

> Is direct IO faster when you *know* it's not cached, and shouldn't
> be cached? Sure. But that/s actually quite rare. 

This is where I think you get the wrong end of the stick, Linus.

The world I work in has a significant proportion of applications
where the data set is too large to be cached effectively or is
better cached by the application than the kernel. IOWs, data being
cached efficiently by the page cache is the exception rather than
the rule. Hence, they use direct IO because it is faster than the
page cache. This is common in applications like major enterprise
databases, HPC apps, data mining/analysis applications, etc. and
there's an awful lot of the world that runs on these apps....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: pagecache locking (was: bcachefs status update) merged)
  2019-06-13 23:55             ` Dave Chinner
  2019-06-14  2:30               ` Linus Torvalds
  2019-06-14  3:08               ` Linus Torvalds
@ 2019-06-14 17:08               ` Kent Overstreet
  2 siblings, 0 replies; 63+ messages in thread
From: Kent Overstreet @ 2019-06-14 17:08 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Linus Torvalds, Dave Chinner, Darrick J . Wong,
	Christoph Hellwig, Matthew Wilcox, Amir Goldstein, Jan Kara,
	Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik,
	Alexander Viro, Andrew Morton

On Fri, Jun 14, 2019 at 09:55:24AM +1000, Dave Chinner wrote:
> On Thu, Jun 13, 2019 at 02:36:25PM -0400, Kent Overstreet wrote:
> > On Thu, Jun 13, 2019 at 09:02:24AM +1000, Dave Chinner wrote:
> > > On Wed, Jun 12, 2019 at 12:21:44PM -0400, Kent Overstreet wrote:
> > > > Ok, I'm totally on board with returning EDEADLOCK.
> > > > 
> > > > Question: Would we be ok with returning EDEADLOCK for any IO where the buffer is
> > > > in the same address space as the file being read/written to, even if the buffer
> > > > and the IO don't technically overlap?
> > > 
> > > I'd say that depends on the lock granularity. For a range lock,
> > > we'd be able to do the IO for non-overlapping ranges. For a normal
> > > mutex or rwsem, then we risk deadlock if the page fault triggers on
> > > the same address space host as we already have locked for IO. That's
> > > the case we currently handle with the second IO lock in XFS, ext4,
> > > btrfs, etc (XFS_MMAPLOCK_* in XFS).
> > > 
> > > One of the reasons I'm looking at range locks for XFS is to get rid
> > > of the need for this second mmap lock, as there is no reason for it
> > > existing if we can lock ranges and EDEADLOCK inside page faults and
> > > return errors.
> > 
> > My concern is that range locks are going to turn out to be both more complicated
> > and heavier weight, performance wise, than the approach I've taken of just a
> > single lock per address space.
> 
> That's the battle I'm fighting at the moment with them for direct
> IO(*), but range locks are something I'm doing for XFS and I don't
> really care if anyone else wants to use them or not.

I'm not saying I won't use them :)

I just want to do the simple thing first, and then if range locks turn out well
I don't think it'll be hard to switch bcachefs to them. Or if the simple thing
turns out to be good enough, even better :)

> (*)Direct IO on XFS is a pure shared lock workload, so the rwsem
> scales until single atomic update cache line bouncing limits
> throughput. That means I can max out my hardware at 1.6 million
> random 4k read/write IOPS (a bit over 6GB/s)(**) to a single file
> with a rwsem at 32 AIO+DIO dispatch threads. I've only got range
> locks to about 1.1M IOPS on the same workload, though it's within a
> couple of percent of a rwsem up to 16 threads...
> 
> (**) A small handful of nvme SSDs fed by AIO+DIO are /way faster/
> than pmem that is emulated with RAM, let alone real pmem which is
> much slower at random writes than RAM.

So something I should mention is that I've been 

> 
> > Reason being range locks only help when you've got multiple operations going on
> > simultaneously that don't conflict - i.e. it's really only going to be useful
> > for applications that are doing buffered IO and direct IO simultaneously to the
> > same file.
> 
> Yes, they do that, but that's not why I'm looking at this.  Range
> locks are primarily for applications that mix multiple different
> types of operations to the same file concurrently. e.g:
> 
> - fallocate and read/write() can be run concurrently if they
> don't overlap, but right now we serialise them because we have no
> visibility into what other operations require.

True true. Have you ever seen this be an issue for real applications?

> - buffered read and buffered write can run concurrently if they
> don't overlap, but right now they are serialised because that's the
> only way to provide POSIX atomic write vs read semantics (only XFS
> provides userspace with that guarantee).

We already talked about this on IRC, but it's not the _only_ way - page locks
suffice if we lock all the pages for the read/write at once, and that's actually
a really good thing to for performance.

bcachefs doesn't currently provide any guarantees here, but since we already are
batching up the page operations for buffered writes (and I have patches to do so
for the buffered read path in filemap.c) I will tweak that slightly to provide
some sort of guarantee.

This is something I care about in bcachefs because I'm pretty sure I can make
both the buffered read and write paths work without taking any per inode locks -
so unrelated IOs to the same file won't be modifying at shared cachelines at
all. Haven't gotten around to it yet for the buffered write path, but it's on
the todo list.

> - Sub-block direct IO is serialised against all other direct IO
> because we can't tell if it overlaps with some other direct IO and
> so we have to take the slow but safe option - range locks solve that
> problem, too.

This feels like an internal filesystem implementation detail (not objecting,
just doesn't sound like something applicable outside xfs to me).

> - there's inode_dio_wait() for DIO truncate serialisation
> because AIO doesn't hold inode locks across IO - range locks can be
> held all the way to AIO completion so we can get rid of
> inode_Dio_wait() in XFS and that allows truncate/fallocate to run
> concurrently with non-overlapping direct IO.

getting rid of inode_dio_wait() and unifying that with other locking would be
great, for sure.

> - holding non-overlapping range locks on either side of page
> faults which then gets rid of the need for the special mmap locking
> path to serialise it against invalidation operations.
> 
> IOWs, range locks for IO solve a bunch of long term problems we have
> in XFS and largely simplify the lock algorithms within the
> filesystem. And it puts us on the path to introduce range locks for
> extent mapping serialisation, allowing concurrent mapping lookups
> and allocation within a single file. It also has the potential to
> allow us to do concurrent directory modifications....

*nod*

Don't take any of what I'm saying as arguing against range locks. They don't
seem _as_ useful to me for bcachefs because a lot of what you want them for I
can already do concurrently - but if they turn out to work better than my
pagecache add lock I'll happily switch to them later. I'm just reserving
judgement until I see them and get to play with them.

> IO range locks are not "locking the page cache". IO range locks are
> purely for managing concurrent IO state in a fine grained manner.
> The page cache already has it's own locking - that just needs to
> nest inside IO range locks as teh io locks are what provide the high
> level exclusion from overlapping page cache operations...

I suppose I worded it badly, my point was just that range locks can be viewed as
a superset of my pagecache add lock so switching to them would be an easy
conversion.

> > > > This would simplify things a lot and eliminate a really nasty corner case - page
> > > > faults trigger readahead. Even if the buffer and the direct IO don't overlap,
> > > > readahead can pull in pages that do overlap with the dio.
> > > 
> > > Page cache readahead needs to be moved under the filesystem IO
> > > locks. There was a recent thread about how readahead can race with
> > > hole punching and other fallocate() operations because page cache
> > > readahead bypasses the filesystem IO locks used to serialise page
> > > cache invalidation.
> > > 
> > > e.g. Readahead can be directed by userspace via fadvise, so we now
> > > have file->f_op->fadvise() so that filesystems can lock the inode
> > > before calling generic_fadvise() such that page cache instantiation
> > > and readahead dispatch can be serialised against page cache
> > > invalidation. I have a patch for XFS sitting around somewhere that
> > > implements the ->fadvise method.
> > 
> > I just puked a little in my mouth.
> 
> Yeah, it's pretty gross. But the page cache simply isn't designed to
> allow atomic range operations to be performed. We ahven't be able to
> drag it out of the 1980s - we wrote the fs/iomap.c code so we could
> do range based extent mapping for IOs rather than the horrible,
> inefficient page-by-page block mapping the generic page cache code
> does - that gave us a 30+% increase in buffered IO throughput
> because we only do a single mapping lookup per IO rather than one
> per page...

I don't think there's anything _fundamentally_ wrong with the pagecache design -
i.e. a radix tree of pages.

We do _badly_ need support for arbitrary power of 2 sized pages in the
pagecache (and it _really_ needs to be compound pages, not just hugepages - a
lot of the work that's been done that just adds if (hugepage) to page cache code
is FUCKING RETARDED) - the overhead of working with 4k pages has gotten to be
completely absurd. But AFAIK the radix tree/xarray side of that is done, what
needs to be fixed is all the stuff in e.g. filemap.c that assumes it's iterating
over fixed size pages - that code needs to be reworked to be more like iterating
over extents.

> That said, the page cache is still far, far slower than direct IO,
> and the gap is just getting wider and wider as nvme SSDs get faster
> and faster. PCIe 4 SSDs are just going to make this even more
> obvious - it's getting to the point where the only reason for having
> a page cache is to support mmap() and cheap systems with spinning
> rust storage.

It is true that buffered IO performance is _way_ behind direct IO performance
today in a lot of situations, but this is mostly due to the fact that we've
completely dropped the ball and it's 2019 and we're STILL DOING EVERYTHING 4k AT
A TIME.

> > I think both approaches are workable, but I do think that pushing the locking
> > down to __add_to_page_cache_locked is fundamentally the better, more correct
> > approach.
> > 
> >  - It better matches the semantics of what we're trying to do. All these
> >    operations we're trying to protect - dio, fallocate, truncate - they all have
> >    in common that they just want to shoot down a range of the page cache and
> >    keep it from being readded. And in general, it's better to have locks that
> >    protect specific data structures ("adding to this radix tree"), vs. large
> >    critical sections ("the io path").
> 
> I disagree :)
> 
> The high level IO locks provide the IO concurrency policy for the
> filesystem. The page cache is an internal structure for caching
> pages - it is not a structure for efficiently and cleanly
> implementing IO concurrency policy. That's the mistake the current
> page cache architecture makes - it tries to be the central control
> for all the filesystem IO (because filesystems are dumb and the page
> cache knows best!) but, unfortunately, this does not provide the
> semantics or functionality that all filesystems want and/or need.

We might be talking past each other a bit.

I think we both agree that there should be a separation of concerns w.r.t.
locking, and that hanging too much stuff off the page cache has been a mistake
in the past (e.g. buffer heads).

But that doesn't mean the page cache shouldn't have locking, just that that
locking should only protect the page cache, not other filesystem state.

I am _not_ arguing that this should replace filesystem IO path locks for other,
not page cache purposes.

> Just look at the truncate detection mess we have every time we
> lookup and lock a page anywhere in the mm/ code - do you see any
> code in all that which detects a hole punch race? Nope, you don't
> because the filesystems take responsibility for serialising that
> functionality.

Yeah, I know.

> Unfortunately, we have so much legacy filesystem cruft we'll never get rid of
> those truncate hacks.

Actually - I honestly think that this page cache add lock is simple and
straightforward enough to use that converting legacy filesystems to it wouldn't
be crazypants, and getting rid of all those hacks would be _awesome_. The legacy
filesystems tend to to implement the really crazy fallocate stuff, it's just
truncate that really has to be dealt with.

> Don't get me wrong - I'm not opposed to incuding page cache add
> locking - I'm just saying that the problems it tries to address (and
> ones it cannot address) are already largely solved in existing
> filesystems. I suspect that if we do merge this code, whatever
> locking is added would have to be optional....

That's certainly a fair point. There is more than one way to skin a cat.

Not sure "largely solved" is true though - solved in XFS yes, but these issues
have not been taken seriously in other filesystems up until pretty recently...

That said, I do think this will prove useful for non bcachefs filesystems at
some point. Like you alluded to, truncate synchronization is hot garbage in most
filesystems, and this would give us an easy way of improving that and possibly
getting rid of those truncate hacks in filemap.c.

And because (on the add side at least) this lock only needs to taken when adding
pages to the page cache, not when the pages needed are already present - in
bcachefs at least this is a key enabler for making buffered IO not modify any
per-inode cachelines in the fast path, and I think other filesystems could do
the same thing.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: pagecache locking (was: bcachefs status update) merged)
  2019-06-14  7:30                 ` Dave Chinner
@ 2019-06-15  1:15                   ` Linus Torvalds
  0 siblings, 0 replies; 63+ messages in thread
From: Linus Torvalds @ 2019-06-15  1:15 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Kent Overstreet, Dave Chinner, Darrick J . Wong,
	Christoph Hellwig, Matthew Wilcox, Amir Goldstein, Jan Kara,
	Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik,
	Alexander Viro, Andrew Morton

On Thu, Jun 13, 2019 at 9:31 PM Dave Chinner <david@fromorbit.com> wrote:
>
> Yes, they do, I see plenty of cases where the page cache works just
> fine because it is still faster than most storage. But that's _not
> what I said_.

I only quoted one small part of your email, because I wanted to point
out how you again dismissed caches.

And yes, that literally _is_ what you said. In other parts of that
same email you said

   "..it's getting to the point where the only reason for having
    a page cache is to support mmap() and cheap systems with spinning
    rust storage"

and

  "That's my beef with relying on the page cache - the page cache is
   rapidly becoming a legacy structure that only serves to slow modern
   IO subsystems down"

and your whole email was basically a rant against the page cache.

So I only quoted the bare minimum, and pointed out that caching is
still damn important.

Because most loads cache well.

How you are back-tracking a bit from your statements, but don't go
saying was misreading you. How else would the above be read? You
really were saying that caching was "legacy". I called you out on it.
Now you're trying to back-track.

Yes, you have loads that don't cache well. But that does not mean that
caching has somehow become irrelevant in the big picture or a "legacy"
thing at all.

The thing is, I don't even hate DIO. But we always end up clashing
because you seem to have this mindset where nothing else matters
(which really came through in that email I replied to).

Do you really wonder why I point out that caching is important?
Because you seem to actively claim caching doesn't matter. Are you
happier now that I quoted more of your emails back to you?

>         IOWs, you've taken _one
> single statement_ I made from a huge email about complexities in
> dealing with IO concurency, the page cache and architectural flaws n
> the existing code, quoted it out of context, fabricated a completely
> new context and started ranting about how I know nothing about how
> caches or the page cache work.

See above. I cut things down a lot, but it wasn't a single statement
at all. I just boiled it down to the basics.

> Linus, nobody can talk about direct IO without you screaming and
> tossing all your toys out of the crib.

Dave, look in the mirror some day. You might be surprised.

> So, in the interests of further _civil_ discussion, let me clarify
> my statement for you: for a highly concurrent application that is
> crunching through bulk data on large files on high throughput
> storage, the page cache is still far, far slower than direct IO.

.. and Christ, Dave, we even _agree_ on this.

But when DIO becomes an issue is when you try to claim it makes the
page cache irrelevant, or a problem.

I also take issue with you then making statements that seem to be
explicitly designed to be misleading. For DIO, you talk about how XFS
has no serialization and gets great performance. Then in the very next
email, you talk about how you think buffered IO has to be excessively
serialized, and how XFS is the only one who does it properly, and how
that is a problem for performance. But as far as I can tell, the
serialization rule you quote is simply not true. But for you it is,
and only for buffered IO.

It's really as if you were actively trying to make the non-DIO case
look bad by picking and choosing your rules.

And the thing is, I suspect that the overlap between DIO and cached IO
shouldn't even need to be there. We've generally tried to just not
have them interact at all, by just having DIO invalidate the caches
(which is really really cheap if they don't exist - which should be
the common case by far!). People almost never mix the two at all, and
we might be better off aiming to separate them out even more than we
do now.

That's actually the part I like best about the page cache add lock - I
may not be a great fan of yet another ad-hoc lock - but I do like how
it adds minimal overhead to the cached case (because by definition,
the good cached case is when you don't need to add new pages), while
hopefully working well together with the whole "invalidate existing
caches" case for DIO.

I know you don't like the cache flush and invalidation stuff for some
reason, but I don't even understand why you care. Again, if you're
actually just doing all DIO, the caches will be empty and not be in
your way. So normally all that should be really really cheap. Flushing
and invalidating caches that don't exists isn't really complicated, is
it?

And if cached state *does* exist, and if it can't be invalidated (for
example, existing busy mmap or whatever), maybe the solution there is
"always fall back to buffered/cached IO".

For the cases you care about, that should never happen, after all.

IOW, if anything, I think we should strive for a situation where the
whole DIO vs cached becomes even _more_ independent. If there are busy
caches, just fall back to cached IO. It will have lower IO throughput,
but that's one of the _points_ of caches - they should decrease the
need for IO, and less IO is what it's all about.

So I don't understand why you hate the page cache so much. For the
cases you care about, the page cache should be a total non-issue. And
if the page cache does exist, then it almost by definition means that
it's not a case you care about.

And yes, yes, maybe some day people won't have SSD's at all, and it's
all nvdimm's and all filesystem data accesses are DAX, and caching is
all done by hardware and the page cache will never exist at all. At
that point a page cache will be legacy.

But honestly, that day is not today. It's decades away, and might
never happen at all.

So in the meantime, don't pooh-pooh the page cache. It works very well
indeed, and I say that as somebody who has refused to touch spinning
media (or indeed bad SSD's) for a decade.

              Linus

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: pagecache locking (was: bcachefs status update) merged)
  2019-06-14  3:08               ` Linus Torvalds
@ 2019-06-15  4:01                 ` Linus Torvalds
  2019-06-17 22:47                   ` Dave Chinner
  0 siblings, 1 reply; 63+ messages in thread
From: Linus Torvalds @ 2019-06-15  4:01 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Kent Overstreet, Dave Chinner, Darrick J . Wong,
	Christoph Hellwig, Matthew Wilcox, Amir Goldstein, Jan Kara,
	Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik,
	Alexander Viro, Andrew Morton

On Thu, Jun 13, 2019 at 5:08 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I do not believe that posix itself actually requires that at all,
> although extended standards may.

So I tried to see if I could find what this perhaps alludes to.

And I suspect it's not in the read/write thing, but the pthreads side
talks about atomicity.

Interesting, but I doubt if that's actually really intentional, since
the non-thread read/write behavior specifically seems to avoid the
whole concurrency issue.

The pthreads atomicity thing seems to be about not splitting up IO and
doing it in chunks when you have m:n threading models, but can be
(mis-)construed to have threads given higher atomicity guarantees than
processes.

               Linus

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: pagecache locking (was: bcachefs status update) merged)
  2019-06-15  4:01                 ` Linus Torvalds
@ 2019-06-17 22:47                   ` Dave Chinner
  2019-06-17 23:38                     ` Linus Torvalds
  0 siblings, 1 reply; 63+ messages in thread
From: Dave Chinner @ 2019-06-17 22:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kent Overstreet, Dave Chinner, Darrick J . Wong,
	Christoph Hellwig, Matthew Wilcox, Amir Goldstein, Jan Kara,
	Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik,
	Alexander Viro, Andrew Morton

On Fri, Jun 14, 2019 at 06:01:07PM -1000, Linus Torvalds wrote:
> On Thu, Jun 13, 2019 at 5:08 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > I do not believe that posix itself actually requires that at all,
> > although extended standards may.
> 
> So I tried to see if I could find what this perhaps alludes to.
> 
> And I suspect it's not in the read/write thing, but the pthreads side
> talks about atomicity.
>
> Interesting, but I doubt if that's actually really intentional, since
> the non-thread read/write behavior specifically seems to avoid the
> whole concurrency issue.

The wording of posix changes every time they release a new version
of the standard, and it's _never_ obvious what behaviour the
standard is actually meant to define. They are always written with
sufficient ambiguity and wiggle room that they could mean
_anything_. The POSIX 2017.1 standard you quoted is quite different
to older versions, but it's no less ambiguous...

> The pthreads atomicity thing seems to be about not splitting up IO and
> doing it in chunks when you have m:n threading models, but can be
> (mis-)construed to have threads given higher atomicity guarantees than
> processes.

Right, but regardless of the spec we have to consider that the
behaviour of XFS comes from it's Irix heritage (actually from EFS,
the predecessor of XFS from the late 1980s). i.e. the IO exclusion
model dates to long before POSIX had anything to say about pthreads,
and it's wording about atomicity could only refer to to
multi-process interactions.

These days, however, is the unfortunate reality of a long tail of
applications developed on other Unix systems under older POSIX
specifications that are still being ported to and deployed on Linux.
Hence the completely ambiguous behaviours defined in the older specs
are still just as important these days as the completely ambiguous
behaviours defined in the new specifications. :/

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: pagecache locking (was: bcachefs status update) merged)
  2019-06-17 22:47                   ` Dave Chinner
@ 2019-06-17 23:38                     ` Linus Torvalds
  2019-06-18  4:21                       ` Amir Goldstein
  0 siblings, 1 reply; 63+ messages in thread
From: Linus Torvalds @ 2019-06-17 23:38 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Kent Overstreet, Dave Chinner, Darrick J . Wong,
	Christoph Hellwig, Matthew Wilcox, Amir Goldstein, Jan Kara,
	Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik,
	Alexander Viro, Andrew Morton

On Mon, Jun 17, 2019 at 3:48 PM Dave Chinner <david@fromorbit.com> wrote:
>
> The wording of posix changes every time they release a new version
> of the standard, and it's _never_ obvious what behaviour the
> standard is actually meant to define. They are always written with
> sufficient ambiguity and wiggle room that they could mean
> _anything_. The POSIX 2017.1 standard you quoted is quite different
> to older versions, but it's no less ambiguous...

POSIX has always been pretty lax, partly because all the Unixes did
things differently, but partly because it then also ended up about
trying to work for the VMS and Windows posix subsystems..

So yes, the language tends to be intentionally not all that strict.

> > The pthreads atomicity thing seems to be about not splitting up IO and
> > doing it in chunks when you have m:n threading models, but can be
> > (mis-)construed to have threads given higher atomicity guarantees than
> > processes.
>
> Right, but regardless of the spec we have to consider that the
> behaviour of XFS comes from it's Irix heritage (actually from EFS,
> the predecessor of XFS from the late 1980s)

Sure. And as I mentioned, I think it's technically the nicer guarantee.

That said, it's a pretty *expensive* guarantee. It's one that you
yourself are not willing to give for O_DIRECT IO.

And it's not a guarantee that Linux has ever had. In fact, it's not
even something I've ever seen anybody ever depend on.

I agree that it's possible that some app out there might depend on
that kind of guarantee, but I also suspect it's much much more likely
that it's the other way around: XFS is being unnecessarily strict,
because everybody is testing against filesystems that don't actually
give the total atomicity guarantees.

Nobody develops for other unixes any more (and nobody really ever did
it by reading standards papers - even if they had been very explicit).

And honestly, the only people who really do threaded accesses to the same file

 (a) don't want that guarantee in the first place

 (b) are likely to use direct-io that apparently doesn't give that
atomicity guarantee even on xfs

so I do think it's moot.

End result: if we had a really cheap range lock, I think it would be a
good idea to use it (for the whole QoI implementation), but for
practical reasons it's likely better to just stick to the current lack
of serialization because it performs better and nobody really seems to
want anything else anyway.

                  Linus

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: pagecache locking (was: bcachefs status update) merged)
  2019-06-17 23:38                     ` Linus Torvalds
@ 2019-06-18  4:21                       ` Amir Goldstein
  2019-06-19 10:38                         ` Jan Kara
  0 siblings, 1 reply; 63+ messages in thread
From: Amir Goldstein @ 2019-06-18  4:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, Kent Overstreet, Dave Chinner, Darrick J . Wong,
	Christoph Hellwig, Matthew Wilcox, Jan Kara,
	Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik,
	Alexander Viro, Andrew Morton

> > Right, but regardless of the spec we have to consider that the
> > behaviour of XFS comes from it's Irix heritage (actually from EFS,
> > the predecessor of XFS from the late 1980s)
>
> Sure. And as I mentioned, I think it's technically the nicer guarantee.
>
> That said, it's a pretty *expensive* guarantee. It's one that you
> yourself are not willing to give for O_DIRECT IO.
>
> And it's not a guarantee that Linux has ever had. In fact, it's not
> even something I've ever seen anybody ever depend on.
>
> I agree that it's possible that some app out there might depend on
> that kind of guarantee, but I also suspect it's much much more likely
> that it's the other way around: XFS is being unnecessarily strict,
> because everybody is testing against filesystems that don't actually
> give the total atomicity guarantees.
>
> Nobody develops for other unixes any more (and nobody really ever did
> it by reading standards papers - even if they had been very explicit).
>
> And honestly, the only people who really do threaded accesses to the same file
>
>  (a) don't want that guarantee in the first place
>
>  (b) are likely to use direct-io that apparently doesn't give that
> atomicity guarantee even on xfs
>
> so I do think it's moot.
>
> End result: if we had a really cheap range lock, I think it would be a
> good idea to use it (for the whole QoI implementation), but for
> practical reasons it's likely better to just stick to the current lack
> of serialization because it performs better and nobody really seems to
> want anything else anyway.
>

This is the point in the conversation where somebody usually steps in
and says "let the user/distro decide". Distro maintainers are in a much
better position to take the risk of breaking hypothetical applications.

I should point out that even if "strict atomic rw" behavior is desired, then
page cache warmup [1] significantly improves performance.
Having mentioned that, the discussion can now return to what is the
preferred way to solve the punch hole vs. page cache add race.

XFS may end up with special tailored range locks, which beings some
other benefits to XFS, but all filesystems need the solution for the punch
hole vs. page cache add race.
Jan recently took a stab at it for ext4 [2], but that didn't work out.
So I wonder what everyone thinks about Kent's page add lock as the
solution to the problem.
Allegedly, all filesystems (XFS included) are potentially exposed to
stale data exposure/data corruption.

Thanks,
Amir.

[1] https://lore.kernel.org/linux-fsdevel/20190404165737.30889-1-amir73il@gmail.com/
[2] https://lore.kernel.org/linux-fsdevel/20190603132155.20600-3-jack@suse.cz/

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: bcachefs status update (it's done cooking; let's get this sucker merged)
  2019-06-12 23:02         ` Dave Chinner
  2019-06-13 18:36           ` pagecache locking (was: bcachefs status update) merged) Kent Overstreet
@ 2019-06-19  8:21           ` Jan Kara
  2019-07-03  1:04             ` [PATCH] mm: Support madvise_willneed override by Filesystems Boaz Harrosh
  1 sibling, 1 reply; 63+ messages in thread
From: Jan Kara @ 2019-06-19  8:21 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Kent Overstreet, Linus Torvalds, Dave Chinner, Waiman Long,
	Peter Zijlstra, Linux List Kernel Mailing, linux-fsdevel,
	linux-bcache, Darrick J . Wong, Zach Brown, Jens Axboe,
	Josef Bacik, Alexander Viro, Andrew Morton, Tejun Heo

On Thu 13-06-19 09:02:24, Dave Chinner wrote:
> On Wed, Jun 12, 2019 at 12:21:44PM -0400, Kent Overstreet wrote:
> > This would simplify things a lot and eliminate a really nasty corner case - page
> > faults trigger readahead. Even if the buffer and the direct IO don't overlap,
> > readahead can pull in pages that do overlap with the dio.
> 
> Page cache readahead needs to be moved under the filesystem IO
> locks. There was a recent thread about how readahead can race with
> hole punching and other fallocate() operations because page cache
> readahead bypasses the filesystem IO locks used to serialise page
> cache invalidation.
> 
> e.g. Readahead can be directed by userspace via fadvise, so we now
> have file->f_op->fadvise() so that filesystems can lock the inode
> before calling generic_fadvise() such that page cache instantiation
> and readahead dispatch can be serialised against page cache
> invalidation. I have a patch for XFS sitting around somewhere that
> implements the ->fadvise method.
> 
> I think there are some other patches floating around to address the
> other readahead mechanisms to only be done under filesytem IO locks,
> but I haven't had time to dig into it any further. Readahead from
> page faults most definitely needs to be under the MMAPLOCK at
> least so it serialises against fallocate()...

Yes, I have patch to make madvise(MADV_WILLNEED) go through ->fadvise() as
well. I'll post it soon since the rest of the series isn't really dependent
on it.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: pagecache locking (was: bcachefs status update) merged)
  2019-06-18  4:21                       ` Amir Goldstein
@ 2019-06-19 10:38                         ` Jan Kara
  2019-06-19 22:37                           ` Dave Chinner
  0 siblings, 1 reply; 63+ messages in thread
From: Jan Kara @ 2019-06-19 10:38 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Linus Torvalds, Dave Chinner, Kent Overstreet, Dave Chinner,
	Darrick J . Wong, Christoph Hellwig, Matthew Wilcox, Jan Kara,
	Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik,
	Alexander Viro, Andrew Morton

On Tue 18-06-19 07:21:56, Amir Goldstein wrote:
> > > Right, but regardless of the spec we have to consider that the
> > > behaviour of XFS comes from it's Irix heritage (actually from EFS,
> > > the predecessor of XFS from the late 1980s)
> >
> > Sure. And as I mentioned, I think it's technically the nicer guarantee.
> >
> > That said, it's a pretty *expensive* guarantee. It's one that you
> > yourself are not willing to give for O_DIRECT IO.
> >
> > And it's not a guarantee that Linux has ever had. In fact, it's not
> > even something I've ever seen anybody ever depend on.
> >
> > I agree that it's possible that some app out there might depend on
> > that kind of guarantee, but I also suspect it's much much more likely
> > that it's the other way around: XFS is being unnecessarily strict,
> > because everybody is testing against filesystems that don't actually
> > give the total atomicity guarantees.
> >
> > Nobody develops for other unixes any more (and nobody really ever did
> > it by reading standards papers - even if they had been very explicit).
> >
> > And honestly, the only people who really do threaded accesses to the same file
> >
> >  (a) don't want that guarantee in the first place
> >
> >  (b) are likely to use direct-io that apparently doesn't give that
> > atomicity guarantee even on xfs
> >
> > so I do think it's moot.
> >
> > End result: if we had a really cheap range lock, I think it would be a
> > good idea to use it (for the whole QoI implementation), but for
> > practical reasons it's likely better to just stick to the current lack
> > of serialization because it performs better and nobody really seems to
> > want anything else anyway.
> >
> 
> This is the point in the conversation where somebody usually steps in
> and says "let the user/distro decide". Distro maintainers are in a much
> better position to take the risk of breaking hypothetical applications.
> 
> I should point out that even if "strict atomic rw" behavior is desired, then
> page cache warmup [1] significantly improves performance.
> Having mentioned that, the discussion can now return to what is the
> preferred way to solve the punch hole vs. page cache add race.
> 
> XFS may end up with special tailored range locks, which beings some
> other benefits to XFS, but all filesystems need the solution for the punch
> hole vs. page cache add race.
> Jan recently took a stab at it for ext4 [2], but that didn't work out.

Yes, but I have idea how to fix it. I just need to push acquiring ext4's
i_mmap_sem down a bit further so that only page cache filling is protected
by it but not copying of data out to userspace. But I didn't get to coding
it last week due to other stuff.

> So I wonder what everyone thinks about Kent's page add lock as the
> solution to the problem.
> Allegedly, all filesystems (XFS included) are potentially exposed to
> stale data exposure/data corruption.

When we first realized that hole-punching vs page faults races are an issue I
was looking into a generic solution but at that time there was a sentiment
against adding another rwsem to struct address_space (or inode) so we ended
up with a private lock in ext4 (i_mmap_rwsem), XFS (I_MMAPLOCK), and other
filesystems these days. If people think that growing struct inode for
everybody is OK, we can think about lifting private filesystem solutions
into a generic one. I'm fine with that.

That being said as Dave said we use those fs-private locks also for
serializing against equivalent issues arising for DAX. So the problem is
not only about page cache but generally about doing IO and caching
block mapping information for a file range. So the solution should not be
too tied to page cache.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: pagecache locking (was: bcachefs status update) merged)
  2019-06-19 10:38                         ` Jan Kara
@ 2019-06-19 22:37                           ` Dave Chinner
  2019-07-03  0:04                             ` pagecache locking Boaz Harrosh
  0 siblings, 1 reply; 63+ messages in thread
From: Dave Chinner @ 2019-06-19 22:37 UTC (permalink / raw)
  To: Jan Kara
  Cc: Amir Goldstein, Linus Torvalds, Kent Overstreet, Dave Chinner,
	Darrick J . Wong, Christoph Hellwig, Matthew Wilcox,
	Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik,
	Alexander Viro, Andrew Morton

On Wed, Jun 19, 2019 at 12:38:38PM +0200, Jan Kara wrote:
> On Tue 18-06-19 07:21:56, Amir Goldstein wrote:
> > > > Right, but regardless of the spec we have to consider that the
> > > > behaviour of XFS comes from it's Irix heritage (actually from EFS,
> > > > the predecessor of XFS from the late 1980s)
> > >
> > > Sure. And as I mentioned, I think it's technically the nicer guarantee.
> > >
> > > That said, it's a pretty *expensive* guarantee. It's one that you
> > > yourself are not willing to give for O_DIRECT IO.
> > >
> > > And it's not a guarantee that Linux has ever had. In fact, it's not
> > > even something I've ever seen anybody ever depend on.
> > >
> > > I agree that it's possible that some app out there might depend on
> > > that kind of guarantee, but I also suspect it's much much more likely
> > > that it's the other way around: XFS is being unnecessarily strict,
> > > because everybody is testing against filesystems that don't actually
> > > give the total atomicity guarantees.
> > >
> > > Nobody develops for other unixes any more (and nobody really ever did
> > > it by reading standards papers - even if they had been very explicit).
> > >
> > > And honestly, the only people who really do threaded accesses to the same file
> > >
> > >  (a) don't want that guarantee in the first place
> > >
> > >  (b) are likely to use direct-io that apparently doesn't give that
> > > atomicity guarantee even on xfs
> > >
> > > so I do think it's moot.
> > >
> > > End result: if we had a really cheap range lock, I think it would be a
> > > good idea to use it (for the whole QoI implementation), but for
> > > practical reasons it's likely better to just stick to the current lack
> > > of serialization because it performs better and nobody really seems to
> > > want anything else anyway.
> > >
> > 
> > This is the point in the conversation where somebody usually steps in
> > and says "let the user/distro decide". Distro maintainers are in a much
> > better position to take the risk of breaking hypothetical applications.
> > 
> > I should point out that even if "strict atomic rw" behavior is desired, then
> > page cache warmup [1] significantly improves performance.
> > Having mentioned that, the discussion can now return to what is the
> > preferred way to solve the punch hole vs. page cache add race.
> > 
> > XFS may end up with special tailored range locks, which beings some
> > other benefits to XFS, but all filesystems need the solution for the punch
> > hole vs. page cache add race.
> > Jan recently took a stab at it for ext4 [2], but that didn't work out.
> 
> Yes, but I have idea how to fix it. I just need to push acquiring ext4's
> i_mmap_sem down a bit further so that only page cache filling is protected
> by it but not copying of data out to userspace. But I didn't get to coding
> it last week due to other stuff.
> 
> > So I wonder what everyone thinks about Kent's page add lock as the
> > solution to the problem.
> > Allegedly, all filesystems (XFS included) are potentially exposed to
> > stale data exposure/data corruption.
> 
> When we first realized that hole-punching vs page faults races are an issue I
> was looking into a generic solution but at that time there was a sentiment
> against adding another rwsem to struct address_space (or inode) so we ended
> up with a private lock in ext4 (i_mmap_rwsem), XFS (I_MMAPLOCK), and other
> filesystems these days. If people think that growing struct inode for
> everybody is OK, we can think about lifting private filesystem solutions
> into a generic one. I'm fine with that.

I'd prefer it doesn't get lifted to the VFS because I'm planning on
getting rid of it in XFS with range locks. i.e. the XFS_MMAPLOCK is
likely to go away in the near term because a range lock can be
taken on either side of the mmap_sem in the page fault path.

> That being said as Dave said we use those fs-private locks also for
> serializing against equivalent issues arising for DAX. So the problem is
> not only about page cache but generally about doing IO and caching
> block mapping information for a file range. So the solution should not be
> too tied to page cache.

Yup, that was the point I was trying to make when Linus started
shouting at me about how caches work and how essential they are.  I
guess the fact that DAX doesn't use the page cache isn't as widely
known as I assumed it was...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: pagecache locking
  2019-06-19 22:37                           ` Dave Chinner
@ 2019-07-03  0:04                             ` Boaz Harrosh
       [not found]                               ` <DM6PR19MB250857CB8A3A1C8279D6F2F3C5FB0@DM6PR19MB2508.namprd19.prod.outlook.com>
  2019-07-05 23:31                               ` Dave Chinner
  0 siblings, 2 replies; 63+ messages in thread
From: Boaz Harrosh @ 2019-07-03  0:04 UTC (permalink / raw)
  To: Dave Chinner, Jan Kara
  Cc: Amir Goldstein, Linus Torvalds, Kent Overstreet, Dave Chinner,
	Darrick J . Wong, Christoph Hellwig, Matthew Wilcox,
	Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik,
	Alexander Viro, Andrew Morton

On 20/06/2019 01:37, Dave Chinner wrote:
<>
> 
> I'd prefer it doesn't get lifted to the VFS because I'm planning on
> getting rid of it in XFS with range locks. i.e. the XFS_MMAPLOCK is
> likely to go away in the near term because a range lock can be
> taken on either side of the mmap_sem in the page fault path.
> 
<>
Sir Dave

Sorry if this was answered before. I am please very curious. In the zufs
project I have an equivalent rw_MMAPLOCK that I _read_lock on page_faults.
(Read & writes all take read-locks ...)
The only reason I have it is because of lockdep actually.

Specifically for those xfstests that mmap a buffer then direct_IO in/out
of that buffer from/to another file in the same FS or the same file.
(For lockdep its the same case).
I would be perfectly happy to recursively _read_lock both from the top
of the page_fault at the DIO path, and under in the page_fault. I'm
_read_locking after all. But lockdep is hard to convince. So I stole the
xfs idea of having an rw_MMAPLOCK. And grab yet another _write_lock at
truncate/punch/clone time when all mapping traversal needs to stop for
the destructive change to take place. (Allocations are done another way
and are race safe with traversal)

How do you intend to address this problem with range-locks? ie recursively
taking the same "lock"? because if not for the recursive-ity and lockdep I would
not need the extra lock-object per inode.

Thanks
Boaz

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH] mm: Support madvise_willneed override by Filesystems
  2019-06-19  8:21           ` bcachefs status update (it's done cooking; let's get this sucker merged) Jan Kara
@ 2019-07-03  1:04             ` Boaz Harrosh
  2019-07-03 17:21               ` Jan Kara
  0 siblings, 1 reply; 63+ messages in thread
From: Boaz Harrosh @ 2019-07-03  1:04 UTC (permalink / raw)
  To: Jan Kara, Dave Chinner
  Cc: Kent Overstreet, Linus Torvalds, Dave Chinner, Waiman Long,
	Peter Zijlstra, Linux List Kernel Mailing, linux-fsdevel,
	linux-bcache, Darrick J . Wong, Zach Brown, Jens Axboe,
	Josef Bacik, Alexander Viro, Andrew Morton, Tejun Heo,
	Amir Goldstein

On 19/06/2019 11:21, Jan Kara wrote:
<>
> Yes, I have patch to make madvise(MADV_WILLNEED) go through ->fadvise() as
> well. I'll post it soon since the rest of the series isn't really dependent
> on it.
> 
> 								Honza
> 

Hi Jan

Funny I'm sitting on the same patch since LSF last. I need it too for other
reasons. I have not seen, have you pushed your patch yet?
(Is based on old v4.20)

~~~~~~~~~
From fddb38169e33d23060ddd444ba6f2319f76edc89 Mon Sep 17 00:00:00 2001
From: Boaz Harrosh <boazh@netapp.com>
Date: Thu, 16 May 2019 20:02:14 +0300
Subject: [PATCH] mm: Support madvise_willneed override by Filesystems

In the patchset:
	[b833a3660394] ovl: add ovl_fadvise()
	[3d8f7615319b] vfs: implement readahead(2) using POSIX_FADV_WILLNEED
	[45cd0faae371] vfs: add the fadvise() file operation

Amir Goldstein introduced a way for filesystems to overide fadvise.
Well madvise_willneed is exactly as fadvise_willneed except it always
returns 0.

In this patch we call the FS vector if it exists.

NOTE: I called vfs_fadvise(..,POSIX_FADV_WILLNEED);
      (Which is my artistic preference)

I could also selectively call
	if (file->f_op->fadvise)
		return file->f_op->fadvise(..., POSIX_FADV_WILLNEED);
If we fear theoretical side effects. I don't mind either way.

CC: Amir Goldstein <amir73il@gmail.com>
CC: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Boaz Harrosh <boazh@netapp.com>
---
 mm/madvise.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 6cb1ca93e290..6b84ddcaaaf2 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -24,6 +24,7 @@
 #include <linux/swapops.h>
 #include <linux/shmem_fs.h>
 #include <linux/mmu_notifier.h>
+#include <linux/fadvise.h>
 
 #include <asm/tlb.h>
 
@@ -303,7 +304,8 @@ static long madvise_willneed(struct vm_area_struct *vma,
 		end = vma->vm_end;
 	end = ((end - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
 
-	force_page_cache_readahead(file->f_mapping, file, start, end - start);
+	vfs_fadvise(file, start << PAGE_SHIFT, (end - start) << PAGE_SHIFT,
+		    POSIX_FADV_WILLNEED);
 	return 0;
 }
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: pagecache locking
       [not found]                               ` <DM6PR19MB250857CB8A3A1C8279D6F2F3C5FB0@DM6PR19MB2508.namprd19.prod.outlook.com>
@ 2019-07-03  1:25                                 ` Boaz Harrosh
  0 siblings, 0 replies; 63+ messages in thread
From: Boaz Harrosh @ 2019-07-03  1:25 UTC (permalink / raw)
  To: Patrick Farrell, Dave Chinner, Jan Kara
  Cc: Amir Goldstein, Linus Torvalds, Kent Overstreet, Dave Chinner,
	Darrick J . Wong, Christoph Hellwig, Matthew Wilcox,
	Linux List Kernel Mailing, linux-xfs, linux-fsdevel, Josef Bacik,
	Alexander Viro, Andrew Morton

On 03/07/2019 04:07, Patrick Farrell wrote:
> Recursively read locking is generally unsafe, that’s why lockdep
> complains about it.  The common RW lock primitives are queued in
> their implementation, meaning this recursive read lock sequence:
> P1 - read (gets lock)
> P2 - write
> P1 - read
> 
> Results not in a successful read lock, but P1 blocking behind P2,
> which is blocked behind P1.  

> Readers are not allowed to jump past waiting writers.

OK thanks that makes sense. I did not know about that last part. Its a kind
of a lock fairness I did not know we have.

So I guess I'll keep my two locks than. The write_locker is the SLOW
path for me anyway, right?

[if we are already at the subject, Do mutexes have the same lock fairness as
 above? Do the write_lock side of rw_sem have same fairness? Something I never
 figured out]

Thanks
Boaz

> 
> - Patrick

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: bcachefs status update (it's done cooking; let's get this sucker merged)
  2019-06-10 19:14 bcachefs status update (it's done cooking; let's get this sucker merged) Kent Overstreet
                   ` (12 preceding siblings ...)
  2019-06-10 20:46 ` bcachefs status update (it's done cooking; let's get this sucker merged) Linus Torvalds
@ 2019-07-03  5:59 ` Stefan K
  13 siblings, 0 replies; 63+ messages in thread
From: Stefan K @ 2019-07-03  5:59 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-kernel, linux-fsdevel, linux-bcache, Dave Chinner,
	Darrick J . Wong ,
	Zach Brown, Peter Zijlstra, Jens Axboe, Josef Bacik,
	Alexander Viro, Andrew Morton, Linus Torvalds, Tejun Heo

Hello,

is there a chance to get this in Kernel 5.3?
And thanks for this fs!


On Monday, June 10, 2019 9:14:08 PM CEST Kent Overstreet wrote:
> Last status update: https://lkml.org/lkml/2018/12/2/46
>
> Current status - I'm pretty much running out of things to polish and excuses to
> keep tinkering. The core featureset is _done_ and the list of known outstanding
> bugs is getting to be short and unexciting. The next big things on my todo list
> are finishing erasure coding and reflink, but there's no reason for merging to
> wait on those.
>
> So. Here's my bcachefs-for-review branch - this has the minimal set of patches
> outside of fs/bcachefs/. My master branch has some performance optimizations for
> the core buffered IO paths, but those are fairly tricky and invasive so I want
> to hold off on those for now - this branch is intended to be more or less
> suitable for merging as is.
>
> https://evilpiepirate.org/git/bcachefs.git/log/?h=bcachefs-for-review
>
> The list of non bcachefs patches is:
>
> closures: fix a race on wakeup from closure_sync
> closures: closure_wait_event()
> bcache: move closures to lib/
> bcache: optimize continue_at_nobarrier()
> block: Add some exports for bcachefs
> Propagate gfp_t when allocating pte entries from __vmalloc
> fs: factor out d_mark_tmpfile()
> fs: insert_inode_locked2()
> mm: export find_get_pages()
> mm: pagecache add lock
> locking: SIX locks (shared/intent/exclusive)
> Compiler Attributes: add __flatten
>
> Most of the patches are pretty small, of the ones that aren't:
>
>  - SIX locks have already been discussed, and seem to be pretty uncontroversial.
>
>  - pagecache add lock: it's kind of ugly, but necessary to rigorously prevent
>    page cache inconsistencies with dio and other operations, in particular
>    racing vs. page faults - honestly, it's criminal that we still don't have a
>    mechanism in the kernel to address this, other filesystems are susceptible to
>    these kinds of bugs too.
>
>    My patch is intentionally ugly in the hopes that someone else will come up
>    with a magical elegant solution, but in the meantime it's an "it's ugly but
>    it works" sort of thing, and I suspect in real world scenarios it's going to
>    beat any kind of range locking performance wise, which is the only
>    alternative I've heard discussed.
>
>  - Propaget gfp_t from __vmalloc() - bcachefs needs __vmalloc() to respect
>    GFP_NOFS, that's all that is.
>
>  - and, moving closures out of drivers/md/bcache to lib/.
>
> The rest of the tree is 62k lines of code in fs/bcachefs. So, I obviously won't
> be mailing out all of that as patches, but if any code reviewers have
> suggestions on what would make that go easier go ahead and speak up. The last
> time I was mailing things out for review the main thing that came up was ioctls,
> but the ioctl interface hasn't really changed since then. I'm pretty confident
> in the on disk format stuff, which was the other thing that was mentioned.
>
> ----------
>
> This has been a monumental effort over a lot of years, and I'm _really_ happy
> with how it's turned out. I'm excited to finally unleash this upon the world.
>




^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] mm: Support madvise_willneed override by Filesystems
  2019-07-03  1:04             ` [PATCH] mm: Support madvise_willneed override by Filesystems Boaz Harrosh
@ 2019-07-03 17:21               ` Jan Kara
  2019-07-03 18:03                 ` Boaz Harrosh
  0 siblings, 1 reply; 63+ messages in thread
From: Jan Kara @ 2019-07-03 17:21 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Jan Kara, Dave Chinner, Kent Overstreet, Linus Torvalds,
	Dave Chinner, Waiman Long, Peter Zijlstra,
	Linux List Kernel Mailing, linux-fsdevel, linux-bcache,
	Darrick J . Wong, Zach Brown, Jens Axboe, Josef Bacik,
	Alexander Viro, Andrew Morton, Tejun Heo, Amir Goldstein

On Wed 03-07-19 04:04:57, Boaz Harrosh wrote:
> On 19/06/2019 11:21, Jan Kara wrote:
> <>
> > Yes, I have patch to make madvise(MADV_WILLNEED) go through ->fadvise() as
> > well. I'll post it soon since the rest of the series isn't really dependent
> > on it.
> > 
> > 								Honza
> > 
> 
> Hi Jan
> 
> Funny I'm sitting on the same patch since LSF last. I need it too for other
> reasons. I have not seen, have you pushed your patch yet?
> (Is based on old v4.20)

Your patch is wrong due to lock ordering. You should not call vfs_fadvise()
under mmap_sem. So we need to do a similar dance like madvise_remove(). I
have to get to writing at least XFS fix so that the madvise change gets
used and post the madvise patch with it... Sorry it takes me so long.

								Honza
> 
> ~~~~~~~~~
> From fddb38169e33d23060ddd444ba6f2319f76edc89 Mon Sep 17 00:00:00 2001
> From: Boaz Harrosh <boazh@netapp.com>
> Date: Thu, 16 May 2019 20:02:14 +0300
> Subject: [PATCH] mm: Support madvise_willneed override by Filesystems
> 
> In the patchset:
> 	[b833a3660394] ovl: add ovl_fadvise()
> 	[3d8f7615319b] vfs: implement readahead(2) using POSIX_FADV_WILLNEED
> 	[45cd0faae371] vfs: add the fadvise() file operation
> 
> Amir Goldstein introduced a way for filesystems to overide fadvise.
> Well madvise_willneed is exactly as fadvise_willneed except it always
> returns 0.
> 
> In this patch we call the FS vector if it exists.
> 
> NOTE: I called vfs_fadvise(..,POSIX_FADV_WILLNEED);
>       (Which is my artistic preference)
> 
> I could also selectively call
> 	if (file->f_op->fadvise)
> 		return file->f_op->fadvise(..., POSIX_FADV_WILLNEED);
> If we fear theoretical side effects. I don't mind either way.
> 
> CC: Amir Goldstein <amir73il@gmail.com>
> CC: Miklos Szeredi <mszeredi@redhat.com>
> Signed-off-by: Boaz Harrosh <boazh@netapp.com>
> ---
>  mm/madvise.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 6cb1ca93e290..6b84ddcaaaf2 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -24,6 +24,7 @@
>  #include <linux/swapops.h>
>  #include <linux/shmem_fs.h>
>  #include <linux/mmu_notifier.h>
> +#include <linux/fadvise.h>
>  
>  #include <asm/tlb.h>
>  
> @@ -303,7 +304,8 @@ static long madvise_willneed(struct vm_area_struct *vma,
>  		end = vma->vm_end;
>  	end = ((end - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
>  
> -	force_page_cache_readahead(file->f_mapping, file, start, end - start);
> +	vfs_fadvise(file, start << PAGE_SHIFT, (end - start) << PAGE_SHIFT,
> +		    POSIX_FADV_WILLNEED);
>  	return 0;
>  }
>  
> -- 
> 2.20.1
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] mm: Support madvise_willneed override by Filesystems
  2019-07-03 17:21               ` Jan Kara
@ 2019-07-03 18:03                 ` Boaz Harrosh
  0 siblings, 0 replies; 63+ messages in thread
From: Boaz Harrosh @ 2019-07-03 18:03 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Chinner, Kent Overstreet, Linus Torvalds, Dave Chinner,
	Waiman Long, Peter Zijlstra, Linux List Kernel Mailing,
	linux-fsdevel, linux-bcache, Darrick J . Wong, Zach Brown,
	Jens Axboe, Josef Bacik, Alexander Viro, Andrew Morton,
	Tejun Heo, Amir Goldstein

On 03/07/2019 20:21, Jan Kara wrote:
> On Wed 03-07-19 04:04:57, Boaz Harrosh wrote:
>> On 19/06/2019 11:21, Jan Kara wrote:
>> <>
<>
>> Hi Jan
>>
>> Funny I'm sitting on the same patch since LSF last. I need it too for other
>> reasons. I have not seen, have you pushed your patch yet?
>> (Is based on old v4.20)
> 
> Your patch is wrong due to lock ordering. You should not call vfs_fadvise()
> under mmap_sem. So we need to do a similar dance like madvise_remove(). I
> have to get to writing at least XFS fix so that the madvise change gets
> used and post the madvise patch with it... Sorry it takes me so long.
> 
> 								Honza

Ha Sorry I was not aware of this. Lockdep did not catch it on my setup
because my setup does not have any locking conflicts with mmap_sem on the
WILL_NEED path.

But surly you are right because the all effort is to fix the locking problems.

I will also try in a day or two to do as you suggest, and look at madvise_remove()
once I have a bit of time. Who ever gets to be less busy ...

Thank you for your help
Boaz

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: pagecache locking
  2019-07-03  0:04                             ` pagecache locking Boaz Harrosh
       [not found]                               ` <DM6PR19MB250857CB8A3A1C8279D6F2F3C5FB0@DM6PR19MB2508.namprd19.prod.outlook.com>
@ 2019-07-05 23:31                               ` Dave Chinner
  2019-07-07 15:05                                 ` Boaz Harrosh
  2019-07-08 13:31                                 ` Jan Kara
  1 sibling, 2 replies; 63+ messages in thread
From: Dave Chinner @ 2019-07-05 23:31 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Jan Kara, Amir Goldstein, Linus Torvalds, Kent Overstreet,
	Dave Chinner, Darrick J . Wong, Christoph Hellwig,
	Matthew Wilcox, Linux List Kernel Mailing, linux-xfs,
	linux-fsdevel, Josef Bacik, Alexander Viro, Andrew Morton

On Wed, Jul 03, 2019 at 03:04:45AM +0300, Boaz Harrosh wrote:
> On 20/06/2019 01:37, Dave Chinner wrote:
> <>
> > 
> > I'd prefer it doesn't get lifted to the VFS because I'm planning on
> > getting rid of it in XFS with range locks. i.e. the XFS_MMAPLOCK is
> > likely to go away in the near term because a range lock can be
> > taken on either side of the mmap_sem in the page fault path.
> > 
> <>
> Sir Dave
> 
> Sorry if this was answered before. I am please very curious. In the zufs
> project I have an equivalent rw_MMAPLOCK that I _read_lock on page_faults.
> (Read & writes all take read-locks ...)
> The only reason I have it is because of lockdep actually.
> 
> Specifically for those xfstests that mmap a buffer then direct_IO in/out
> of that buffer from/to another file in the same FS or the same file.
> (For lockdep its the same case).

Which can deadlock if the same inode rwsem is taken on both sides of
the mmap_sem, as lockdep tells you...

> I would be perfectly happy to recursively _read_lock both from the top
> of the page_fault at the DIO path, and under in the page_fault. I'm
> _read_locking after all. But lockdep is hard to convince. So I stole the
> xfs idea of having an rw_MMAPLOCK. And grab yet another _write_lock at
> truncate/punch/clone time when all mapping traversal needs to stop for
> the destructive change to take place. (Allocations are done another way
> and are race safe with traversal)
> 
> How do you intend to address this problem with range-locks? ie recursively
> taking the same "lock"? because if not for the recursive-ity and lockdep I would
> not need the extra lock-object per inode.

As long as the IO ranges to the same file *don't overlap*, it should
be perfectly safe to take separate range locks (in read or write
mode) on either side of the mmap_sem as non-overlapping range locks
can be nested and will not self-deadlock.

The "recursive lock problem" still arises with DIO and page faults
inside gup, but it only occurs when the user buffer range overlaps
the DIO range to the same file. IOWs, the application is trying to
do something that has an undefined result and is likely to result in
data corruption. So, in that case I plan to have the gup page faults
fail and the DIO return -EDEADLOCK to userspace....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: pagecache locking
  2019-07-05 23:31                               ` Dave Chinner
@ 2019-07-07 15:05                                 ` Boaz Harrosh
  2019-07-07 23:55                                   ` Dave Chinner
  2019-07-08 13:31                                 ` Jan Kara
  1 sibling, 1 reply; 63+ messages in thread
From: Boaz Harrosh @ 2019-07-07 15:05 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Amir Goldstein, Linus Torvalds, Kent Overstreet,
	Dave Chinner, Darrick J . Wong, Christoph Hellwig,
	Matthew Wilcox, Linux List Kernel Mailing, linux-xfs,
	linux-fsdevel, Josef Bacik, Alexander Viro, Andrew Morton

On 06/07/2019 02:31, Dave Chinner wrote:

> 
> As long as the IO ranges to the same file *don't overlap*, it should
> be perfectly safe to take separate range locks (in read or write
> mode) on either side of the mmap_sem as non-overlapping range locks
> can be nested and will not self-deadlock.
> 
> The "recursive lock problem" still arises with DIO and page faults
> inside gup, but it only occurs when the user buffer range overlaps
> the DIO range to the same file. IOWs, the application is trying to
> do something that has an undefined result and is likely to result in
> data corruption. So, in that case I plan to have the gup page faults
> fail and the DIO return -EDEADLOCK to userspace....
> 

This sounds very cool. I now understand. I hope you put all the tools
for this in generic places so it will be easier to salvage.

One thing I will be very curious to see is how you teach lockdep
about the "range locks can be nested" thing. I know its possible,
other places do it, but its something I never understood.

> Cheers,
> Dave.

[ Ha one more question if you have time:

  In one of the mails, and you also mentioned it before, you said about
  the rw_read_lock not being able to scale well on mammoth machines
  over 10ns of cores (maybe you said over 20).
  I wonder why that happens. Is it because of the atomic operations,
  or something in the lock algorithm. In my theoretical understanding,
  as long as there are no write-lock-grabbers, why would the readers
  interfere with each other?
]

Thanks
Boaz

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: pagecache locking
  2019-07-07 15:05                                 ` Boaz Harrosh
@ 2019-07-07 23:55                                   ` Dave Chinner
  0 siblings, 0 replies; 63+ messages in thread
From: Dave Chinner @ 2019-07-07 23:55 UTC (permalink / raw)
  To: Boaz Harrosh
  Cc: Jan Kara, Amir Goldstein, Linus Torvalds, Kent Overstreet,
	Dave Chinner, Darrick J . Wong, Christoph Hellwig,
	Matthew Wilcox, Linux List Kernel Mailing, linux-xfs,
	linux-fsdevel, Josef Bacik, Alexander Viro, Andrew Morton

On Sun, Jul 07, 2019 at 06:05:16PM +0300, Boaz Harrosh wrote:
> On 06/07/2019 02:31, Dave Chinner wrote:
> 
> > 
> > As long as the IO ranges to the same file *don't overlap*, it should
> > be perfectly safe to take separate range locks (in read or write
> > mode) on either side of the mmap_sem as non-overlapping range locks
> > can be nested and will not self-deadlock.
> > 
> > The "recursive lock problem" still arises with DIO and page faults
> > inside gup, but it only occurs when the user buffer range overlaps
> > the DIO range to the same file. IOWs, the application is trying to
> > do something that has an undefined result and is likely to result in
> > data corruption. So, in that case I plan to have the gup page faults
> > fail and the DIO return -EDEADLOCK to userspace....
> > 
> 
> This sounds very cool. I now understand. I hope you put all the tools
> for this in generic places so it will be easier to salvage.

That's the plan, though I'm not really caring about anything outside
XFS for the moment.

> One thing I will be very curious to see is how you teach lockdep
> about the "range locks can be nested" thing. I know its possible,
> other places do it, but its something I never understood.

The issue with lockdep is not nested locks, it's that there is no
concept of ranges. e.g.  This is fine:

P0				P1
read_lock(A, 0, 1000)
				read_lock(B, 0, 1000)
write_lock(B, 1001, 2000)
				write_lock(A, 1001, 2000)

Because the read/write lock ranges on file A don't overlap and so
can be held concurrently, similarly the ranges on file B. i.e. This
lock pattern does not result in deadlock.

However, this very similar lock pattern is not fine:

P0				P1
read_lock(A, 0, 1000)
				read_lock(B, 0, 1000)
write_lock(B, 500, 1500)
				write_lock(A, 900, 1900)

i.e. it's an ABBA deadlock because the lock ranges partially
overlap.

IOWs, the problem with lockdep is not nesting read lock or nesting
write locks (because that's valid, too), the problem is that it
needs to be taught about ranges. Once it knows about ranges, nested
read/write locking contexts don't require any special support...

As it is, tracking overlapping lock ranges in lockdep will be
interesting, given that I've been taking several thousand
non-overlapping range locks concurrently on a single file in my
testing. Tracking this sort of usage without completely killing the
machine looking for conflicts and order violations likely makes
lockdep validation of range locks a non-starter....

> [ Ha one more question if you have time:
> 
>   In one of the mails, and you also mentioned it before, you said about
>   the rw_read_lock not being able to scale well on mammoth machines
>   over 10ns of cores (maybe you said over 20).
>   I wonder why that happens. Is it because of the atomic operations,
>   or something in the lock algorithm. In my theoretical understanding,
>   as long as there are no write-lock-grabbers, why would the readers
>   interfere with each other?

Concurrent shared read lock/unlock are still atomic counting
operations.  Hence they bounce exclusive cachelines from CPU to
CPU...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: pagecache locking
  2019-07-05 23:31                               ` Dave Chinner
  2019-07-07 15:05                                 ` Boaz Harrosh
@ 2019-07-08 13:31                                 ` Jan Kara
  2019-07-09 23:47                                   ` Dave Chinner
  1 sibling, 1 reply; 63+ messages in thread
From: Jan Kara @ 2019-07-08 13:31 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Boaz Harrosh, Jan Kara, Amir Goldstein, Linus Torvalds,
	Kent Overstreet, Dave Chinner, Darrick J . Wong,
	Christoph Hellwig, Matthew Wilcox, Linux List Kernel Mailing,
	linux-xfs, linux-fsdevel, Josef Bacik, Alexander Viro,
	Andrew Morton

On Sat 06-07-19 09:31:57, Dave Chinner wrote:
> On Wed, Jul 03, 2019 at 03:04:45AM +0300, Boaz Harrosh wrote:
> > On 20/06/2019 01:37, Dave Chinner wrote:
> > <>
> > > 
> > > I'd prefer it doesn't get lifted to the VFS because I'm planning on
> > > getting rid of it in XFS with range locks. i.e. the XFS_MMAPLOCK is
> > > likely to go away in the near term because a range lock can be
> > > taken on either side of the mmap_sem in the page fault path.
> > > 
> > <>
> > Sir Dave
> > 
> > Sorry if this was answered before. I am please very curious. In the zufs
> > project I have an equivalent rw_MMAPLOCK that I _read_lock on page_faults.
> > (Read & writes all take read-locks ...)
> > The only reason I have it is because of lockdep actually.
> > 
> > Specifically for those xfstests that mmap a buffer then direct_IO in/out
> > of that buffer from/to another file in the same FS or the same file.
> > (For lockdep its the same case).
> 
> Which can deadlock if the same inode rwsem is taken on both sides of
> the mmap_sem, as lockdep tells you...
> 
> > I would be perfectly happy to recursively _read_lock both from the top
> > of the page_fault at the DIO path, and under in the page_fault. I'm
> > _read_locking after all. But lockdep is hard to convince. So I stole the
> > xfs idea of having an rw_MMAPLOCK. And grab yet another _write_lock at
> > truncate/punch/clone time when all mapping traversal needs to stop for
> > the destructive change to take place. (Allocations are done another way
> > and are race safe with traversal)
> > 
> > How do you intend to address this problem with range-locks? ie recursively
> > taking the same "lock"? because if not for the recursive-ity and lockdep I would
> > not need the extra lock-object per inode.
> 
> As long as the IO ranges to the same file *don't overlap*, it should
> be perfectly safe to take separate range locks (in read or write
> mode) on either side of the mmap_sem as non-overlapping range locks
> can be nested and will not self-deadlock.

I'd be really careful with nesting range locks. You can have nasty
situations like:

Thread 1		Thread 2
read_lock(0,1000)	
			write_lock(500,1500) -> blocks due to read lock
read_lock(1001,1500) -> blocks due to write lock (or you have to break
  fairness and then deal with starvation issues).

So once you allow nesting of range locks, you have to very carefully define
what is and what is not allowed. That's why in my range lock implementation
ages back I've decided to treat range lock as a rwsem for deadlock
verification purposes.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: pagecache locking
  2019-07-08 13:31                                 ` Jan Kara
@ 2019-07-09 23:47                                   ` Dave Chinner
  2019-07-10  8:41                                     ` Jan Kara
  0 siblings, 1 reply; 63+ messages in thread
From: Dave Chinner @ 2019-07-09 23:47 UTC (permalink / raw)
  To: Jan Kara
  Cc: Boaz Harrosh, Amir Goldstein, Linus Torvalds, Kent Overstreet,
	Dave Chinner, Darrick J . Wong, Christoph Hellwig,
	Matthew Wilcox, Linux List Kernel Mailing, linux-xfs,
	linux-fsdevel, Josef Bacik, Alexander Viro, Andrew Morton

On Mon, Jul 08, 2019 at 03:31:14PM +0200, Jan Kara wrote:
> On Sat 06-07-19 09:31:57, Dave Chinner wrote:
> > On Wed, Jul 03, 2019 at 03:04:45AM +0300, Boaz Harrosh wrote:
> > > On 20/06/2019 01:37, Dave Chinner wrote:
> > > <>
> > > > 
> > > > I'd prefer it doesn't get lifted to the VFS because I'm planning on
> > > > getting rid of it in XFS with range locks. i.e. the XFS_MMAPLOCK is
> > > > likely to go away in the near term because a range lock can be
> > > > taken on either side of the mmap_sem in the page fault path.
> > > > 
> > > <>
> > > Sir Dave
> > > 
> > > Sorry if this was answered before. I am please very curious. In the zufs
> > > project I have an equivalent rw_MMAPLOCK that I _read_lock on page_faults.
> > > (Read & writes all take read-locks ...)
> > > The only reason I have it is because of lockdep actually.
> > > 
> > > Specifically for those xfstests that mmap a buffer then direct_IO in/out
> > > of that buffer from/to another file in the same FS or the same file.
> > > (For lockdep its the same case).
> > 
> > Which can deadlock if the same inode rwsem is taken on both sides of
> > the mmap_sem, as lockdep tells you...
> > 
> > > I would be perfectly happy to recursively _read_lock both from the top
> > > of the page_fault at the DIO path, and under in the page_fault. I'm
> > > _read_locking after all. But lockdep is hard to convince. So I stole the
> > > xfs idea of having an rw_MMAPLOCK. And grab yet another _write_lock at
> > > truncate/punch/clone time when all mapping traversal needs to stop for
> > > the destructive change to take place. (Allocations are done another way
> > > and are race safe with traversal)
> > > 
> > > How do you intend to address this problem with range-locks? ie recursively
> > > taking the same "lock"? because if not for the recursive-ity and lockdep I would
> > > not need the extra lock-object per inode.
> > 
> > As long as the IO ranges to the same file *don't overlap*, it should
> > be perfectly safe to take separate range locks (in read or write
> > mode) on either side of the mmap_sem as non-overlapping range locks
> > can be nested and will not self-deadlock.
> 
> I'd be really careful with nesting range locks. You can have nasty
> situations like:
> 
> Thread 1		Thread 2
> read_lock(0,1000)	
> 			write_lock(500,1500) -> blocks due to read lock
> read_lock(1001,1500) -> blocks due to write lock (or you have to break
>   fairness and then deal with starvation issues).
>
> So once you allow nesting of range locks, you have to very carefully define
> what is and what is not allowed.

Yes. I do understand the problem with rwsem read nesting and how
that can translate to reange locks.

That's why my range locks don't even try to block on other pending
waiters. The case where read nesting vs write might occur like above
is something like copy_file_range() vs truncate, but otherwise for
IO locks we simply don't have arbitrarily deep nesting of range
locks.

i.e. for your example my range lock would result in:

Thread 1		Thread 2
read_lock(0,1000)	
			write_lock(500,1500)
			<finds conflicting read lock>
			<marks read lock as having a write waiter>
			<parks on range lock wait list>
<...>
read_lock_nested(1001,1500)
<no overlapping range in tree>
<gets nested range lock>

<....>
read_unlock(1001,1500)	<stays blocked because nothing is waiting
		         on (1001,1500) so no wakeup>
<....>
read_unlock(0,1000)
<sees write waiter flag, runs wakeup>
			<gets woken>
			<retries write lock>
			<write lock gained>

IOWs, I'm not trying to complicate the range lock implementation
with complex stuff like waiter fairness or anti-starvation semantics
at this point in time. Waiters simply don't impact whether a new lock
can be gained or not, and hence the limitations of rwsem semantics
don't apply.

If such functionality is necessary (and I suspect it will be to
prevent AIO from delaying truncate and fallocate-like operations
indefinitely) then I'll add a "barrier" lock type (e.g.
range_lock_write_barrier()) that will block new range lock attempts
across it's span.

However, because this can cause deadlocks like the above, a barrier
lock will not block new *_lock_nested() or *_trylock() calls, hence
allowing runs of nested range locking to complete and drain without
deadlocking on a conflicting barrier range. And that still can't be
modelled by existing rwsem semantics....

> That's why in my range lock implementation
> ages back I've decided to treat range lock as a rwsem for deadlock
> verification purposes.

IMO, treating a range lock as a rwsem for deadlock purposes defeats
the purpose of adding range locks in the first place. The
concurrency models are completely different, and some of the
limitations on rwsems are a result of implementation choices rather
than limitations of a rwsem construct.

In reality I couldn't care less about what lockdep can or can't
verify. I've always said lockdep is a crutch for people who don't
understand locks and the concurrency model of the code they
maintain. That obviously extends to the fact that lockdep
verification limitations should not limit what we allow new locking
primitives to do.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: pagecache locking
  2019-07-09 23:47                                   ` Dave Chinner
@ 2019-07-10  8:41                                     ` Jan Kara
  0 siblings, 0 replies; 63+ messages in thread
From: Jan Kara @ 2019-07-10  8:41 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jan Kara, Boaz Harrosh, Amir Goldstein, Linus Torvalds,
	Kent Overstreet, Dave Chinner, Darrick J . Wong,
	Christoph Hellwig, Matthew Wilcox, Linux List Kernel Mailing,
	linux-xfs, linux-fsdevel, Josef Bacik, Alexander Viro,
	Andrew Morton

On Wed 10-07-19 09:47:12, Dave Chinner wrote:
> On Mon, Jul 08, 2019 at 03:31:14PM +0200, Jan Kara wrote:
> > I'd be really careful with nesting range locks. You can have nasty
> > situations like:
> > 
> > Thread 1		Thread 2
> > read_lock(0,1000)	
> > 			write_lock(500,1500) -> blocks due to read lock
> > read_lock(1001,1500) -> blocks due to write lock (or you have to break
> >   fairness and then deal with starvation issues).
> >
> > So once you allow nesting of range locks, you have to very carefully define
> > what is and what is not allowed.
> 
> Yes. I do understand the problem with rwsem read nesting and how
> that can translate to reange locks.
> 
> That's why my range locks don't even try to block on other pending
> waiters. The case where read nesting vs write might occur like above
> is something like copy_file_range() vs truncate, but otherwise for
> IO locks we simply don't have arbitrarily deep nesting of range
> locks.
> 
> i.e. for your example my range lock would result in:
> 
> Thread 1		Thread 2
> read_lock(0,1000)	
> 			write_lock(500,1500)
> 			<finds conflicting read lock>
> 			<marks read lock as having a write waiter>
> 			<parks on range lock wait list>
> <...>
> read_lock_nested(1001,1500)
> <no overlapping range in tree>
> <gets nested range lock>
> 
> <....>
> read_unlock(1001,1500)	<stays blocked because nothing is waiting
> 		         on (1001,1500) so no wakeup>
> <....>
> read_unlock(0,1000)
> <sees write waiter flag, runs wakeup>
> 			<gets woken>
> 			<retries write lock>
> 			<write lock gained>
> 
> IOWs, I'm not trying to complicate the range lock implementation
> with complex stuff like waiter fairness or anti-starvation semantics
> at this point in time. Waiters simply don't impact whether a new lock
> can be gained or not, and hence the limitations of rwsem semantics
> don't apply.
> 
> If such functionality is necessary (and I suspect it will be to
> prevent AIO from delaying truncate and fallocate-like operations
> indefinitely) then I'll add a "barrier" lock type (e.g.
> range_lock_write_barrier()) that will block new range lock attempts
> across it's span.
> 
> However, because this can cause deadlocks like the above, a barrier
> lock will not block new *_lock_nested() or *_trylock() calls, hence
> allowing runs of nested range locking to complete and drain without
> deadlocking on a conflicting barrier range. And that still can't be
> modelled by existing rwsem semantics....

Clever :). Thanks for explanation.

> > That's why in my range lock implementation
> > ages back I've decided to treat range lock as a rwsem for deadlock
> > verification purposes.
> 
> IMO, treating a range lock as a rwsem for deadlock purposes defeats
> the purpose of adding range locks in the first place. The
> concurrency models are completely different, and some of the
> limitations on rwsems are a result of implementation choices rather
> than limitations of a rwsem construct.

Well, even a range lock that cannot nest allows concurrent non-overlapping
read and write to the same file which rwsem doesn't allow. But I agree that
your nesting variant offers more (but also at the cost of higher complexity
of the lock semantics).

> In reality I couldn't care less about what lockdep can or can't
> verify. I've always said lockdep is a crutch for people who don't
> understand locks and the concurrency model of the code they
> maintain. That obviously extends to the fact that lockdep
> verification limitations should not limit what we allow new locking
> primitives to do.

I didn't say we have to constrain new locking primitives to what lockdep
can support. It is just a locking correctness verification tool so naturally
lockdep should be taught to what it needs to know about any locking scheme
we come up with. And sometimes it is just too much effort which is why e.g.
page lock still doesn't have lockdep coverage.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 12/12] closures: fix a race on wakeup from closure_sync
  2019-06-10 19:14 ` [PATCH 12/12] closures: fix a race on wakeup from closure_sync Kent Overstreet
@ 2019-07-16 10:47   ` Coly Li
  2019-07-18  7:46     ` Coly Li
  0 siblings, 1 reply; 63+ messages in thread
From: Coly Li @ 2019-07-16 10:47 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-kernel, linux-fsdevel, linux-bcache

Hi Kent,

On 2019/6/11 3:14 上午, Kent Overstreet wrote:
> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Acked-by: Coly Li <colyli@suse.de>

And also I receive report for suspicious closure race condition in
bcache, and people ask for having this patch into Linux v5.3.

So before this patch gets merged into upstream, I plan to rebase it to
drivers/md/bcache/closure.c at this moment. Of cause the author is you.

When lib/closure.c merged into upstream, I will rebase all closure usage
from bcache to use lib/closure.{c,h}.

Thanks in advance.

Coly Li

> ---
>  lib/closure.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/closure.c b/lib/closure.c
> index 46cfe4c382..3e6366c262 100644
> --- a/lib/closure.c
> +++ b/lib/closure.c
> @@ -104,8 +104,14 @@ struct closure_syncer {
>  
>  static void closure_sync_fn(struct closure *cl)
>  {
> -	cl->s->done = 1;
> -	wake_up_process(cl->s->task);
> +	struct closure_syncer *s = cl->s;
> +	struct task_struct *p;
> +
> +	rcu_read_lock();
> +	p = READ_ONCE(s->task);
> +	s->done = 1;
> +	wake_up_process(p);
> +	rcu_read_unlock();
>  }
>  
>  void __sched __closure_sync(struct closure *cl)
> 


-- 

Coly Li

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 12/12] closures: fix a race on wakeup from closure_sync
  2019-07-16 10:47   ` Coly Li
@ 2019-07-18  7:46     ` Coly Li
  2019-07-22 17:22       ` Kent Overstreet
  0 siblings, 1 reply; 63+ messages in thread
From: Coly Li @ 2019-07-18  7:46 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-kernel, linux-fsdevel, linux-bcache

On 2019/7/16 6:47 下午, Coly Li wrote:
> Hi Kent,
> 
> On 2019/6/11 3:14 上午, Kent Overstreet wrote:
>> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
> Acked-by: Coly Li <colyli@suse.de>
> 
> And also I receive report for suspicious closure race condition in
> bcache, and people ask for having this patch into Linux v5.3.
> 
> So before this patch gets merged into upstream, I plan to rebase it to
> drivers/md/bcache/closure.c at this moment. Of cause the author is you.
> 
> When lib/closure.c merged into upstream, I will rebase all closure usage
> from bcache to use lib/closure.{c,h}.

Hi Kent,

The race bug reporter replies me that the closure race bug is very rare
to reproduce, after applying the patch and testing, they are not sure
whether their closure race problem is fixed or not.

And I notice rcu_read_lock()/rcu_read_unlock() is used here, but it is
not clear to me what is the functionality of the rcu read lock in
closure_sync_fn(). I believe you have reason to use the rcu stuffs here,
could you please provide some hints to help me to understand the change
better ?

Thanks in advance.

Coly Li

>> ---
>>  lib/closure.c | 10 ++++++++--
>>  1 file changed, 8 insertions(+), 2 deletions(-)
>>
>> diff --git a/lib/closure.c b/lib/closure.c
>> index 46cfe4c382..3e6366c262 100644
>> --- a/lib/closure.c
>> +++ b/lib/closure.c
>> @@ -104,8 +104,14 @@ struct closure_syncer {
>>  
>>  static void closure_sync_fn(struct closure *cl)
>>  {
>> -	cl->s->done = 1;
>> -	wake_up_process(cl->s->task);
>> +	struct closure_syncer *s = cl->s;
>> +	struct task_struct *p;
>> +
>> +	rcu_read_lock();
>> +	p = READ_ONCE(s->task);
>> +	s->done = 1;
>> +	wake_up_process(p);
>> +	rcu_read_unlock();
>>  }
>>  
>>  void __sched __closure_sync(struct closure *cl)

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 12/12] closures: fix a race on wakeup from closure_sync
  2019-07-18  7:46     ` Coly Li
@ 2019-07-22 17:22       ` Kent Overstreet
  0 siblings, 0 replies; 63+ messages in thread
From: Kent Overstreet @ 2019-07-22 17:22 UTC (permalink / raw)
  To: Coly Li; +Cc: linux-kernel, linux-fsdevel, linux-bcache

On Thu, Jul 18, 2019 at 03:46:46PM +0800, Coly Li wrote:
> On 2019/7/16 6:47 下午, Coly Li wrote:
> > Hi Kent,
> > 
> > On 2019/6/11 3:14 上午, Kent Overstreet wrote:
> >> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
> > Acked-by: Coly Li <colyli@suse.de>
> > 
> > And also I receive report for suspicious closure race condition in
> > bcache, and people ask for having this patch into Linux v5.3.
> > 
> > So before this patch gets merged into upstream, I plan to rebase it to
> > drivers/md/bcache/closure.c at this moment. Of cause the author is you.
> > 
> > When lib/closure.c merged into upstream, I will rebase all closure usage
> > from bcache to use lib/closure.{c,h}.
> 
> Hi Kent,
> 
> The race bug reporter replies me that the closure race bug is very rare
> to reproduce, after applying the patch and testing, they are not sure
> whether their closure race problem is fixed or not.
> 
> And I notice rcu_read_lock()/rcu_read_unlock() is used here, but it is
> not clear to me what is the functionality of the rcu read lock in
> closure_sync_fn(). I believe you have reason to use the rcu stuffs here,
> could you please provide some hints to help me to understand the change
> better ?

The race was when a thread using closure_sync() notices cl->s->done == 1 before
the thread calling closure_put() calls wake_up_process(). Then, it's possible
for that thread to return and exit just before wake_up_process() is called - so
we're trying to wake up a process that no longer exists.

rcu_read_lock() is sufficient to protect against this, as there's an rcu barrier
somewhere in the process teardown path.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: bcachefs status update (it's done cooking; let's get this sucker merged)
@ 2019-06-29 16:39 Luke Kenneth Casson Leighton
  0 siblings, 0 replies; 63+ messages in thread
From: Luke Kenneth Casson Leighton @ 2019-06-29 16:39 UTC (permalink / raw)
  To: torvalds
  Cc: akpm, axboe, darrick.wong, david, dchinner, josef,
	kent.overstreet, linux-bcache, linux-fsdevel, linux-kernel,
	peterz, tj, viro, zach.brown

hey linus, you made news again, all blown up and pointless again.
you're doing great: you're being honest. remember the offer i made to
put you in touch with my friend.

anecdotal story: andrew tridgell worked on the fujitsu sparc
supercomputer a couple decades ago: it had a really weird DMA ring
bus.

* memory-to-memory copy (in the same core) was 10mbytes/sec
* DMA memory-to-memory copy (in the same core) was 20mbytes/sec
* memory-memory copy (across the ring bus i.e. to another machine) was
100mbytes/sec
* DMA memory-memory copy (across the ring bus) was *200* mbytes/sec.

when andrew tried asking people, "hey everyone, we need a filesystem
that can work really well on this fast parallel system", he had to
continuously fend off "i got a great idea for in-core memory-to-memory
cacheing!!!!" suggestions, because they *just would never work*.

the point being: caches aren't always "fast".

/salutes.

l.

^ permalink raw reply	[flat|nested] 63+ messages in thread

end of thread, other threads:[~2019-07-22 17:22 UTC | newest]

Thread overview: 63+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-10 19:14 bcachefs status update (it's done cooking; let's get this sucker merged) Kent Overstreet
2019-06-10 19:14 ` [PATCH 01/12] Compiler Attributes: add __flatten Kent Overstreet
2019-06-12 17:16   ` Greg KH
2019-06-10 19:14 ` [PATCH 02/12] locking: SIX locks (shared/intent/exclusive) Kent Overstreet
2019-06-10 19:14 ` [PATCH 03/12] mm: pagecache add lock Kent Overstreet
2019-06-10 19:14 ` [PATCH 04/12] mm: export find_get_pages() Kent Overstreet
2019-06-10 19:14 ` [PATCH 05/12] fs: insert_inode_locked2() Kent Overstreet
2019-06-10 19:14 ` [PATCH 06/12] fs: factor out d_mark_tmpfile() Kent Overstreet
2019-06-10 19:14 ` [PATCH 07/12] Propagate gfp_t when allocating pte entries from __vmalloc Kent Overstreet
2019-06-10 19:14 ` [PATCH 08/12] block: Add some exports for bcachefs Kent Overstreet
2019-06-10 19:14 ` [PATCH 09/12] bcache: optimize continue_at_nobarrier() Kent Overstreet
2019-06-10 19:14 ` [PATCH 10/12] bcache: move closures to lib/ Kent Overstreet
2019-06-11 10:25   ` Coly Li
2019-06-13  7:28   ` Christoph Hellwig
2019-06-13 11:04     ` Kent Overstreet
2019-06-10 19:14 ` [PATCH 11/12] closures: closure_wait_event() Kent Overstreet
2019-06-11 10:25   ` Coly Li
2019-06-12 17:17   ` Greg KH
2019-06-10 19:14 ` [PATCH 12/12] closures: fix a race on wakeup from closure_sync Kent Overstreet
2019-07-16 10:47   ` Coly Li
2019-07-18  7:46     ` Coly Li
2019-07-22 17:22       ` Kent Overstreet
2019-06-10 20:46 ` bcachefs status update (it's done cooking; let's get this sucker merged) Linus Torvalds
2019-06-11  1:17   ` Kent Overstreet
2019-06-11  4:33     ` Dave Chinner
2019-06-12 16:21       ` Kent Overstreet
2019-06-12 23:02         ` Dave Chinner
2019-06-13 18:36           ` pagecache locking (was: bcachefs status update) merged) Kent Overstreet
2019-06-13 21:13             ` Andreas Dilger
2019-06-13 21:21               ` Kent Overstreet
2019-06-14  0:35                 ` Dave Chinner
2019-06-13 23:55             ` Dave Chinner
2019-06-14  2:30               ` Linus Torvalds
2019-06-14  7:30                 ` Dave Chinner
2019-06-15  1:15                   ` Linus Torvalds
2019-06-14  3:08               ` Linus Torvalds
2019-06-15  4:01                 ` Linus Torvalds
2019-06-17 22:47                   ` Dave Chinner
2019-06-17 23:38                     ` Linus Torvalds
2019-06-18  4:21                       ` Amir Goldstein
2019-06-19 10:38                         ` Jan Kara
2019-06-19 22:37                           ` Dave Chinner
2019-07-03  0:04                             ` pagecache locking Boaz Harrosh
     [not found]                               ` <DM6PR19MB250857CB8A3A1C8279D6F2F3C5FB0@DM6PR19MB2508.namprd19.prod.outlook.com>
2019-07-03  1:25                                 ` Boaz Harrosh
2019-07-05 23:31                               ` Dave Chinner
2019-07-07 15:05                                 ` Boaz Harrosh
2019-07-07 23:55                                   ` Dave Chinner
2019-07-08 13:31                                 ` Jan Kara
2019-07-09 23:47                                   ` Dave Chinner
2019-07-10  8:41                                     ` Jan Kara
2019-06-14 17:08               ` pagecache locking (was: bcachefs status update) merged) Kent Overstreet
2019-06-19  8:21           ` bcachefs status update (it's done cooking; let's get this sucker merged) Jan Kara
2019-07-03  1:04             ` [PATCH] mm: Support madvise_willneed override by Filesystems Boaz Harrosh
2019-07-03 17:21               ` Jan Kara
2019-07-03 18:03                 ` Boaz Harrosh
2019-06-11  4:55     ` bcachefs status update (it's done cooking; let's get this sucker merged) Linus Torvalds
2019-06-11 14:26       ` Matthew Wilcox
2019-06-11  4:10   ` Dave Chinner
2019-06-11  4:39     ` Linus Torvalds
2019-06-11  7:10       ` Dave Chinner
2019-06-12  2:07         ` Linus Torvalds
2019-07-03  5:59 ` Stefan K
2019-06-29 16:39 Luke Kenneth Casson Leighton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).