RFC [PATCH 0/1] ksm: introduce ksm_max_page_sharing per page deduplication limit

All of lore.kernel.org
 help / color / mirror / Atom feed

* RFC [PATCH 0/1] ksm: introduce ksm_max_page_sharing per page deduplication limit
@ 2015-11-10 18:44 Andrea Arcangeli
  2015-11-10 18:44 ` [PATCH 1/1] " Andrea Arcangeli
  0 siblings, 1 reply; 29+ messages in thread
From: Andrea Arcangeli @ 2015-11-10 18:44 UTC (permalink / raw)
  To: Hugh Dickins, Davidlohr Bueso
  Cc: linux-mm, Petr Holasek, Andrew Morton, Arjan van de Ven

Hello,

this patch solves KSM computational complexity issues in the rmap walk
that created stalls during enterprise usage on large systems with lots
of RAM and CPUs. It's incremental with the patches posted earlier and
it should apply clean to upstream.

Special review should be done on the KSM page migration code with
merge_across_nodes == 0. I tested merge_across_nodes == 0 but not very
hard. The old code in this area in fact looked flakey in case two KSM
pages of equal content ended up being migrated in the same node. With
this new code instead they end up in the same "chain" in the right
stable_tree or the page under migration is merged into an existing KSM
page of the right node if page_mapcount allows for it.

At the moment the chains aren't defragmented but they could be. It'd
be enough to refile a couple of remap_items for each prune of the
stable_node_chain, from a "dup" with the lowest rmap_hlist_len to the
dup with the highest (not yet full) rmap_hlist_len. The rmap_hlist_len
of the "dup" that got its rmap_item removed, would drop to zero and it
would be garbage collected at the next prune pass and the
page_sharing/page_shared ratio would increase up to the peak possible
given the current max_page_sharing. However the "chain" prune logic
already tries to compact the "dups" in as few stable nodes as
possible. Overall if any defragmentation logic of the stable_node
"chains" really turn out to be good idea, it's better to do it in an
incremental patch as it's an orthogonal problem.

Changing max_page_sharing without first doing "echo 2
>/sys/kernel/mm/ksm/run" (that get rid of the entire stable rbtree)
would also be possible but I didn't think it was worth it as certain
asserts become not enforceable anymore.

The code has perhaps too much asserts, mostly VM_BUG_ON though (the
few BUG_ON are there because in those few cases if the asserts trigger
and ksmd doesn't stop there, it'd corrupt memory randomly). I can
remove the VM_BUG_ON when this gets more testing and gets out of
RFC. Also note, there's no VM_WARN_ON available.

Some testsuite that pretends to know the internals of KSM and predict
exact page_sharing/page_shared values, may give false positive with
this patch applied, but it's enough to set max_page_sharing to a very
large value in order to pass the old tests. Ideally those testsuites
should learn about the max_page_sharing limit and predict the new
page_sharing/shared results with the new code to validate it.

Comments welcome, thanks,
Andrea

Andrea Arcangeli (1):
  ksm: introduce ksm_max_page_sharing per page deduplication limit

 Documentation/vm/ksm.txt |  63 ++++
 mm/ksm.c                 | 731 ++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 728 insertions(+), 66 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 1/1] ksm: introduce ksm_max_page_sharing per page deduplication limit
  2015-11-10 18:44 RFC [PATCH 0/1] ksm: introduce ksm_max_page_sharing per page deduplication limit Andrea Arcangeli
@ 2015-11-10 18:44 ` Andrea Arcangeli
  2015-12-09 16:19   ` Petr Holasek
                     ` (3 more replies)
  0 siblings, 4 replies; 29+ messages in thread
From: Andrea Arcangeli @ 2015-11-10 18:44 UTC (permalink / raw)
  To: Hugh Dickins, Davidlohr Bueso
  Cc: linux-mm, Petr Holasek, Andrew Morton, Arjan van de Ven

Without a max deduplication limit for each KSM page, the list of the
rmap_items associated to each stable_node can grow infinitely
large.

During the rmap walk each entry can take up to ~10usec to process
because of IPIs for the TLB flushing (both for the primary MMU and the
secondary MMUs with the MMU notifier). With only 16GB of address space
shared in the same KSM page, that would amount to dozens of seconds of
kernel runtime.

A ~256 max deduplication factor will reduce the latencies of the rmap
walks on KSM pages to order of a few msec. Just doing the
cond_resched() during the rmap walks is not enough, the list size must
have a limit too, otherwise the caller could get blocked in (schedule
friendly) kernel computations for seconds, unexpectedly.

There's room for optimization to significantly reduce the IPI delivery
cost during the page_referenced(), but at least for page_migration in
the KSM case (used by hard NUMA bindings, compaction and NUMA
balancing) it may be inevitable to send lots of IPIs if each
rmap_item->mm is active on a different CPU and there are lots of
CPUs. Even if we ignore the IPI delivery cost, we've still to walk the
whole KSM rmap list, so we can't allow millions or billions (ulimited)
number of entries in the KSM stable_node rmap_item lists.

The limit is enforced efficiently by adding a second dimension to the
stable rbtree. So there are three types of stable_nodes: the regular
ones (identical as before, living in the first flat dimension of the
stable rbtree), the "chains" and the "dups".

Every "chain" and all "dups" linked into a "chain" enforce the
invariant that they represent the same write protected memory content,
even if each "dup" will be pointed by a different KSM page copy of
that content. This way the stable rbtree lookup computational
complexity is unaffected if compared to an unlimited
max_sharing_limit. It is still enforced that there cannot be KSM page
content duplicates in the stable rbtree itself.

Adding the second dimension to the stable rbtree only after the
max_page_sharing limit hits, provides for a zero memory footprint
increase on 64bit archs. The memory overhead of the per-KSM page
stable_tree and per virtual mapping rmap_item is unchanged. Only after
the max_page_sharing limit hits, we need to allocate a stable_tree
"chain" and rb_replace() the "regular" stable_node with the newly
allocated stable_node "chain". After that we simply add the "regular"
stable_node to the chain as a stable_node "dup" by linking hlist_dup
in the stable_node_chain->hlist. This way the "regular" (flat)
stable_node is converted to a stable_node "dup" living in the second
dimension of the stable rbtree.

During stable rbtree lookups the stable_node "chain" is identified as
stable_node->rmap_hlist_len == STABLE_NODE_CHAIN (aka
is_stable_node_chain()).

When dropping stable_nodes, the stable_node "dup" is identified as
stable_node->head == STABLE_NODE_DUP_HEAD (aka is_stable_node_dup()).

The STABLE_NODE_DUP_HEAD must be an unique valid pointer never used
elsewhere in any stable_node->head/node to avoid a clashes with the
stable_node->node.rb_parent_color pointer, and different from
&migrate_nodes. So the second field of &migrate_nodes is picked and
verified as always safe with a BUILD_BUG_ON in case the list_head
implementation changes in the future.

The STABLE_NODE_DUP is picked as a random negative value in
stable_node->rmap_hlist_len. rmap_hlist_len cannot become negative
when it's a "regular" stable_node or a stable_node "dup".

The stable_node_chain->nid is irrelevant. The stable_node_chain->kpfn
is aliased in a union with a time field used to rate limit the
stable_node_chain->hlist prunes.

The garbage collection of the stable_node_chain happens lazily during
stable rbtree lookups (as for all other kind of stable_nodes), or
while disabling KSM with "echo 2 >/sys/kernel/mm/ksm/run" while
collecting the entire stable rbtree.

While the "regular" stable_nodes and the stable_node "dups" must wait
for their underlying tree_page to be freed before they can be freed
themselves, the stable_node "chains" can be freed immediately if the
stable_node->hlist turns empty. This is because the "chains" are never
pointed by any page->mapping and they're effectively stable rbtree KSM
self contained metadata.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 Documentation/vm/ksm.txt |  63 ++++
 mm/ksm.c                 | 731 ++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 728 insertions(+), 66 deletions(-)

diff --git a/Documentation/vm/ksm.txt b/Documentation/vm/ksm.txt
index f34a8ee..a7ef759 100644
--- a/Documentation/vm/ksm.txt
+++ b/Documentation/vm/ksm.txt
@@ -80,6 +80,50 @@ run              - set 0 to stop ksmd from running but keep merged pages,
                    Default: 0 (must be changed to 1 to activate KSM,
                                except if CONFIG_SYSFS is disabled)
 
+max_page_sharing - Maximum sharing allowed for each KSM page. This
+                   enforces a deduplication limit to avoid the virtual
+                   memory rmap lists to grow too large. The minimum
+                   value is 2 as a newly created KSM page will have at
+                   least two sharers. The rmap walk has O(N)
+                   complexity where N is the number of rmap_items
+                   (i.e. virtual mappings) that are sharing the page,
+                   which is in turn capped by max_page_sharing. So
+                   this effectively spread the the linear O(N)
+                   computational complexity from rmap walk context
+                   over different KSM pages. The ksmd walk over the
+                   stable_node "chains" is also O(N), but N is the
+                   number of stable_node "dups", not the number of
+                   rmap_items, so it has not a significant impact on
+                   ksmd performance. In practice the best stable_node
+                   "dup" candidate will be kept and found at the head
+                   of the "dups" list. The higher this value the
+                   faster KSM will merge the memory (because there
+                   will be fewer stable_node dups queued into the
+                   stable_node chain->hlist to check for pruning) and
+                   the higher the deduplication factor will be, but
+                   the slowest the worst case rmap walk could be for
+                   any given KSM page. Slowing down the rmap_walk
+                   means there will be higher latency for certain
+                   virtual memory operations happening during
+                   swapping, compaction, NUMA balancing and page
+                   migration, in turn decreasing responsiveness for
+                   the caller of those virtual memory operations. The
+                   scheduler latency of other tasks not involved with
+                   the VM operations doing the rmap walk is not
+                   affected by this parameter as the rmap walks are
+                   always schedule friendly themselves.
+
+stable_node_chains_prune_millisecs - How frequently to walk the whole
+                   list of stable_node "dups" linked in the
+                   stable_node "chains" in order to prune stale
+                   stable_nodes. Smaller milllisecs values will free
+                   up the KSM metadata with lower latency, but they
+                   will make ksmd use more CPU during the scan. This
+                   only applies to the stable_node chains so it's a
+                   noop if not a single KSM page hit the
+                   max_page_sharing yet (there would be no stable_node
+                   chains in such case).
+
 The effectiveness of KSM and MADV_MERGEABLE is shown in /sys/kernel/mm/ksm/:
 
 pages_shared     - how many shared pages are being used
@@ -88,10 +132,29 @@ pages_unshared   - how many pages unique but repeatedly checked for merging
 pages_volatile   - how many pages changing too fast to be placed in a tree
 full_scans       - how many times all mergeable areas have been scanned
 
+stable_node_chains - number of stable node chains allocated, this is
+		     effectively the number of KSM pages that hit the
+		     max_page_sharing limit
+stable_node_dups   - number of stable node dups queued into the
+		     stable_node chains
+
 A high ratio of pages_sharing to pages_shared indicates good sharing, but
 a high ratio of pages_unshared to pages_sharing indicates wasted effort.
 pages_volatile embraces several different kinds of activity, but a high
 proportion there would also indicate poor use of madvise MADV_MERGEABLE.
 
+The maximum possible page_sharing/page_shared ratio is limited by the
+max_page_sharing tunable. To increase the ratio max_page_sharing must
+be increased accordingly.
+
+The stable_node_dups/stable_node_chains ratio is also affected by the
+max_page_sharing tunable, and an high ratio may indicate fragmentation
+in the stable_node dups, which could be solved by introducing
+fragmentation algorithms in ksmd which would refile rmap_items from
+one stable_node dup to another stable_node dup, in order to freeup
+stable_node "dups" with few rmap_items in them, but that may increase
+the ksmd CPU usage and possibly slowdown the readonly computations on
+the KSM pages of the applications.
+
 Izik Eidus,
 Hugh Dickins, 17 Nov 2009
diff --git a/mm/ksm.c b/mm/ksm.c
index b5cd647..fd0bf51 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -126,9 +126,12 @@ struct ksm_scan {
  * struct stable_node - node of the stable rbtree
  * @node: rb node of this ksm page in the stable tree
  * @head: (overlaying parent) &migrate_nodes indicates temporarily on that list
+ * @hlist_dup: linked into the stable_node->hlist with a stable_node chain
  * @list: linked into migrate_nodes, pending placement in the proper node tree
  * @hlist: hlist head of rmap_items using this ksm page
  * @kpfn: page frame number of this ksm page (perhaps temporarily on wrong nid)
+ * @chain_prune_time: time of the last full garbage collection
+ * @rmap_hlist_len: number of rmap_item entries in hlist or STABLE_NODE_CHAIN
  * @nid: NUMA node id of stable tree in which linked (may not match kpfn)
  */
 struct stable_node {
@@ -136,11 +139,24 @@ struct stable_node {
 		struct rb_node node;	/* when node of stable tree */
 		struct {		/* when listed for migration */
 			struct list_head *head;
-			struct list_head list;
+			struct {
+				struct hlist_node hlist_dup;
+				struct list_head list;
+			};
 		};
 	};
 	struct hlist_head hlist;
-	unsigned long kpfn;
+	union {
+		unsigned long kpfn;
+		unsigned long chain_prune_time;
+	};
+	/*
+	 * STABLE_NODE_CHAIN can be any negative number in
+	 * rmap_hlist_len negative range, but better not -1 to be able
+	 * to reliably detect underflows.
+	 */
+#define STABLE_NODE_CHAIN -1024
+	int rmap_hlist_len;
 #ifdef CONFIG_NUMA
 	int nid;
 #endif
@@ -190,6 +206,7 @@ static struct rb_root *root_unstable_tree = one_unstable_tree;
 
 /* Recently migrated nodes of stable tree, pending proper placement */
 static LIST_HEAD(migrate_nodes);
+#define STABLE_NODE_DUP_HEAD ((struct list_head *)&migrate_nodes.prev)
 
 #define MM_SLOTS_HASH_BITS 10
 static DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
@@ -217,6 +234,18 @@ static unsigned long ksm_pages_unshared;
 /* The number of rmap_items in use: to calculate pages_volatile */
 static unsigned long ksm_rmap_items;
 
+/* The number of stable_node chains */
+static unsigned long ksm_stable_node_chains;
+
+/* The number of stable_node dups linked to the stable_node chains */
+static unsigned long ksm_stable_node_dups;
+
+/* Delay in pruning stale stable_node_dups in the stable_node_chains */
+static int ksm_stable_node_chains_prune_millisecs = 2000;
+
+/* Maximum number of page slots sharing a stable node */
+static int ksm_max_page_sharing = 256;
+
 /* Number of pages ksmd should scan in one batch */
 static unsigned int ksm_thread_pages_to_scan = 100;
 
@@ -279,6 +308,44 @@ static void __init ksm_slab_free(void)
 	mm_slot_cache = NULL;
 }
 
+static __always_inline bool is_stable_node_chain(struct stable_node *chain)
+{
+	return chain->rmap_hlist_len == STABLE_NODE_CHAIN;
+}
+
+static __always_inline bool is_stable_node_dup(struct stable_node *dup)
+{
+	return dup->head == STABLE_NODE_DUP_HEAD;
+}
+
+static inline void stable_node_chain_add_dup(struct stable_node *dup,
+					     struct stable_node *chain)
+{
+	VM_BUG_ON(is_stable_node_dup(dup));
+	dup->head = STABLE_NODE_DUP_HEAD;
+	VM_BUG_ON(!is_stable_node_chain(chain));
+	hlist_add_head(&dup->hlist_dup, &chain->hlist);
+	ksm_stable_node_dups++;
+}
+
+static inline void __stable_node_dup_del(struct stable_node *dup)
+{
+	hlist_del(&dup->hlist_dup);
+	ksm_stable_node_dups--;
+}
+
+static inline void stable_node_dup_del(struct stable_node *dup)
+{
+	VM_BUG_ON(is_stable_node_chain(dup));
+	if (is_stable_node_dup(dup))
+		__stable_node_dup_del(dup);
+	else
+		rb_erase(&dup->node, root_stable_tree + NUMA(dup->nid));
+#ifdef CONFIG_DEBUG_VM
+	dup->head = NULL;
+#endif
+}
+
 static inline struct rmap_item *alloc_rmap_item(void)
 {
 	struct rmap_item *rmap_item;
@@ -303,6 +370,8 @@ static inline struct stable_node *alloc_stable_node(void)
 
 static inline void free_stable_node(struct stable_node *stable_node)
 {
+	VM_BUG_ON(stable_node->rmap_hlist_len &&
+		  !is_stable_node_chain(stable_node));
 	kmem_cache_free(stable_node_cache, stable_node);
 }
 
@@ -493,25 +562,80 @@ static inline int get_kpfn_nid(unsigned long kpfn)
 	return ksm_merge_across_nodes ? 0 : NUMA(pfn_to_nid(kpfn));
 }
 
+static struct stable_node *alloc_stable_node_chain(struct stable_node *dup,
+						   struct rb_root *root)
+{
+	struct stable_node *chain = alloc_stable_node();
+	VM_BUG_ON(is_stable_node_chain(dup));
+	if (likely(chain)) {
+		INIT_HLIST_HEAD(&chain->hlist);
+		chain->chain_prune_time = jiffies;
+		chain->rmap_hlist_len = STABLE_NODE_CHAIN;
+#ifdef CONFIG_DEBUG_VM
+		chain->nid = -1; /* debug */
+#endif
+		ksm_stable_node_chains++;
+
+		/*
+		 * Put the stable node chain in the first dimension of
+		 * the stable tree and at the same time remove the old
+		 * stable node.
+		 */
+		rb_replace_node(&dup->node, &chain->node, root);
+
+		/*
+		 * Move the old stable node to the second dimension
+		 * queued in the hlist_dup. The invariant is that all
+		 * dup stable_nodes in the chain->hlist point to pages
+		 * that are wrprotected and have the exact same
+		 * content.
+		 */
+		stable_node_chain_add_dup(dup, chain);
+	}
+	return chain;
+}
+
+static inline void free_stable_node_chain(struct stable_node *chain,
+					  struct rb_root *root)
+{
+	rb_erase(&chain->node, root);
+	free_stable_node(chain);
+	ksm_stable_node_chains--;
+}
+
 static void remove_node_from_stable_tree(struct stable_node *stable_node)
 {
 	struct rmap_item *rmap_item;
 
+	/* check it's not STABLE_NODE_CHAIN or negative */
+	BUG_ON(stable_node->rmap_hlist_len < 0);
+
 	hlist_for_each_entry(rmap_item, &stable_node->hlist, hlist) {
 		if (rmap_item->hlist.next)
 			ksm_pages_sharing--;
 		else
 			ksm_pages_shared--;
+		VM_BUG_ON(stable_node->rmap_hlist_len <= 0);
+		stable_node->rmap_hlist_len--;
 		put_anon_vma(rmap_item->anon_vma);
 		rmap_item->address &= PAGE_MASK;
 		cond_resched();
 	}
 
+	/*
+	 * We need the second aligned pointer of the migrate_nodes
+	 * list_head to stay clear from the rb_parent_color union
+	 * (aligned and different than any node) and also different
+	 * from &migrate_nodes. This will verify that future list.h changes
+	 * don't break STABLE_NODE_DUP_HEAD.
+	 */
+	BUILD_BUG_ON(STABLE_NODE_DUP_HEAD <= &migrate_nodes);
+	BUILD_BUG_ON(STABLE_NODE_DUP_HEAD >= &migrate_nodes + 1);
+
 	if (stable_node->head == &migrate_nodes)
 		list_del(&stable_node->list);
 	else
-		rb_erase(&stable_node->node,
-			 root_stable_tree + NUMA(stable_node->nid));
+		stable_node_dup_del(stable_node);
 	free_stable_node(stable_node);
 }
 
@@ -630,6 +754,8 @@ static void remove_rmap_item_from_tree(struct rmap_item *rmap_item)
 			ksm_pages_sharing--;
 		else
 			ksm_pages_shared--;
+		VM_BUG_ON(stable_node->rmap_hlist_len <= 0);
+		stable_node->rmap_hlist_len--;
 
 		put_anon_vma(rmap_item->anon_vma);
 		rmap_item->address &= PAGE_MASK;
@@ -738,6 +864,31 @@ static int remove_stable_node(struct stable_node *stable_node)
 	return err;
 }
 
+static int remove_stable_node_chain(struct stable_node *stable_node,
+				    struct rb_root *root)
+{
+	struct stable_node *dup;
+	struct hlist_node *hlist_safe;
+
+	if (!is_stable_node_chain(stable_node)) {
+		VM_BUG_ON(is_stable_node_dup(stable_node));
+		if (remove_stable_node(stable_node))
+			return true;
+		else
+			return false;
+	}
+
+	hlist_for_each_entry_safe(dup, hlist_safe,
+				  &stable_node->hlist, hlist_dup) {
+		VM_BUG_ON(!is_stable_node_dup(dup));
+		if (remove_stable_node(dup))
+			return true;
+	}
+	BUG_ON(!hlist_empty(&stable_node->hlist));
+	free_stable_node_chain(stable_node, root);
+	return false;
+}
+
 static int remove_all_stable_nodes(void)
 {
 	struct stable_node *stable_node;
@@ -749,7 +900,8 @@ static int remove_all_stable_nodes(void)
 		while (root_stable_tree[nid].rb_node) {
 			stable_node = rb_entry(root_stable_tree[nid].rb_node,
 						struct stable_node, node);
-			if (remove_stable_node(stable_node)) {
+			if (remove_stable_node_chain(stable_node,
+						     root_stable_tree + nid)) {
 				err = -EBUSY;
 				break;	/* proceed to next nid */
 			}
@@ -1136,6 +1288,163 @@ static struct page *try_to_merge_two_pages(struct rmap_item *rmap_item,
 	return err ? NULL : page;
 }
 
+static __always_inline
+bool __is_page_sharing_candidate(struct stable_node *stable_node, int offset)
+{
+	VM_BUG_ON(stable_node->rmap_hlist_len < 0);
+	/*
+	 * Check that at least one mapping still exists, otherwise
+	 * there's no much point to merge and share with this
+	 * stable_node, as the underlying tree_page of the other
+	 * sharer is going to be freed soon.
+	 */
+	return stable_node->rmap_hlist_len &&
+		stable_node->rmap_hlist_len + offset < ksm_max_page_sharing;
+}
+
+static __always_inline
+bool is_page_sharing_candidate(struct stable_node *stable_node)
+{
+	return __is_page_sharing_candidate(stable_node, 0);
+}
+
+static struct stable_node *stable_node_dup(struct stable_node *stable_node,
+					   struct page **tree_page,
+					   struct rb_root *root,
+					   bool prune_stale_stable_nodes)
+{
+	struct stable_node *dup, *found = NULL;
+	struct hlist_node *hlist_safe;
+	struct page *_tree_page;
+	int nr = 0;
+	int found_rmap_hlist_len;
+
+	if (!prune_stale_stable_nodes ||
+	    time_before(jiffies, stable_node->chain_prune_time +
+			msecs_to_jiffies(
+				ksm_stable_node_chains_prune_millisecs)))
+		prune_stale_stable_nodes = false;
+	else
+		stable_node->chain_prune_time = jiffies;
+
+	hlist_for_each_entry_safe(dup, hlist_safe,
+				  &stable_node->hlist, hlist_dup) {
+		cond_resched();
+		/*
+		 * We must walk all stable_node_dup to prune the stale
+		 * stable nodes during lookup.
+		 *
+		 * get_ksm_page can drop the nodes from the
+		 * stable_node->hlist if they point to freed pages
+		 * (that's why we do a _safe walk). The "dup"
+		 * stable_node parameter itself will be freed from
+		 * under us if it returns NULL.
+		 */
+		_tree_page = get_ksm_page(dup, false);
+		if (!_tree_page)
+			continue;
+		nr += 1;
+		if (is_page_sharing_candidate(dup)) {
+			if (!found ||
+			    dup->rmap_hlist_len > found_rmap_hlist_len) {
+				if (found)
+					put_page(*tree_page);
+				found = dup;
+				found_rmap_hlist_len = found->rmap_hlist_len;
+				*tree_page = _tree_page;
+
+				if (!prune_stale_stable_nodes)
+					break;
+				/* skip put_page */
+				continue;
+			}
+		}
+		put_page(_tree_page);
+	}
+
+	/*
+	 * nr is relevant only if prune_stale_stable_nodes is true,
+	 * otherwise we may break the loop at nr == 1 even if there
+	 * are multiple entries.
+	 */
+	if (prune_stale_stable_nodes && found) {
+		if (nr == 1) {
+			/*
+			 * If there's not just one entry it would
+			 * corrupt memory, better BUG_ON. In KSM
+			 * context with no lock held it's not even
+			 * fatal.
+			 */
+			BUG_ON(stable_node->hlist.first->next);
+
+			/*
+			 * There's just one entry and it is below the
+			 * deduplication limit so drop the chain.
+			 */
+			rb_replace_node(&stable_node->node, &found->node,
+					root);
+			free_stable_node(stable_node);
+			ksm_stable_node_chains--;
+			ksm_stable_node_dups--;
+		} else if (__is_page_sharing_candidate(found, 1)) {
+			/*
+			 * Refile our candidate at the head
+			 * after the prune if our candidate
+			 * can accept one more future sharing
+			 * in addition to the one underway.
+			 */
+			hlist_del(&found->hlist_dup);
+			hlist_add_head(&found->hlist_dup,
+				       &stable_node->hlist);
+		}
+	}
+
+	return found;
+}
+
+static struct stable_node *stable_node_dup_any(struct stable_node *stable_node,
+					       struct rb_root *root)
+{
+	if (!is_stable_node_chain(stable_node))
+		return stable_node;
+	if (hlist_empty(&stable_node->hlist)) {
+		free_stable_node_chain(stable_node, root);
+		return NULL;
+	}
+	return hlist_entry(stable_node->hlist.first,
+			   typeof(*stable_node), hlist_dup);
+}
+
+static struct stable_node *__stable_node_chain(struct stable_node *stable_node,
+					       struct page **tree_page,
+					       struct rb_root *root,
+					       bool prune_stale_stable_nodes)
+{
+	if (!is_stable_node_chain(stable_node)) {
+		if (is_page_sharing_candidate(stable_node)) {
+			*tree_page = get_ksm_page(stable_node, false);
+			return stable_node;
+		}
+		return NULL;
+	}
+	return stable_node_dup(stable_node, tree_page, root,
+			       prune_stale_stable_nodes);
+}
+
+static __always_inline struct stable_node *chain_prune(struct stable_node *s_n,
+						       struct page **t_p,
+						       struct rb_root *root)
+{
+	return __stable_node_chain(s_n, t_p, root, true);
+}
+
+static __always_inline struct stable_node *chain(struct stable_node *s_n,
+						 struct page **t_p,
+						 struct rb_root *root)
+{
+	return __stable_node_chain(s_n, t_p, root, false);
+}
+
 /*
  * stable_tree_search - search for page inside the stable tree
  *
@@ -1151,7 +1460,7 @@ static struct page *stable_tree_search(struct page *page)
 	struct rb_root *root;
 	struct rb_node **new;
 	struct rb_node *parent;
-	struct stable_node *stable_node;
+	struct stable_node *stable_node, *stable_node_dup, *stable_node_any;
 	struct stable_node *page_node;
 
 	page_node = page_stable_node(page);
@@ -1173,7 +1482,32 @@ again:
 
 		cond_resched();
 		stable_node = rb_entry(*new, struct stable_node, node);
-		tree_page = get_ksm_page(stable_node, false);
+		stable_node_any = NULL;
+		stable_node_dup = chain_prune(stable_node, &tree_page, root);
+		if (!stable_node_dup) {
+			/*
+			 * Either all stable_node dups were full in
+			 * this stable_node chain, or this chain was
+			 * empty and should be rb_erased.
+			 */
+			stable_node_any = stable_node_dup_any(stable_node,
+							      root);
+			if (!stable_node_any) {
+				/* rb_erase just run */
+				goto again;
+			}
+			/*
+			 * Take any of the stable_node dups page of
+			 * this stable_node chain to let the tree walk
+			 * continue. All KSM pages belonging to the
+			 * stable_node dups in a stable_node chain
+			 * have the same content and they're
+			 * wrprotected at all times. Any will work
+			 * fine to continue the walk.
+			 */
+			tree_page = get_ksm_page(stable_node_any, false);
+		}
+		VM_BUG_ON(!stable_node_dup ^ !!stable_node_any);
 		if (!tree_page) {
 			/*
 			 * If we walked over a stale stable_node,
@@ -1196,6 +1530,34 @@ again:
 		else if (ret > 0)
 			new = &parent->rb_right;
 		else {
+			if (page_node) {
+				VM_BUG_ON(page_node->head != &migrate_nodes);
+				/*
+				 * Test if the migrated page should be merged
+				 * into a stable node dup. If the mapcount is
+				 * 1 we can migrate it with another KSM page
+				 * without adding it to the chain.
+				 */
+				if (page_mapcount(page) > 1)
+					goto chain_append;
+			}
+
+			if (!stable_node_dup) {
+				/*
+				 * If the stable_node is a chain and
+				 * we got a payload match in memcmp
+				 * but we cannot merge the scanned
+				 * page in any of the existing
+				 * stable_node dups because they're
+				 * all full, we need to wait the
+				 * scanned page to find itself a match
+				 * in the unstable tree to create a
+				 * brand new KSM page to add later to
+				 * the dups of this stable_node.
+				 */
+				return NULL;
+			}
+
 			/*
 			 * Lock and unlock the stable_node's page (which
 			 * might already have been migrated) so that page
@@ -1203,23 +1565,21 @@ again:
 			 * It would be more elegant to return stable_node
 			 * than kpage, but that involves more changes.
 			 */
-			tree_page = get_ksm_page(stable_node, true);
-			if (tree_page) {
-				unlock_page(tree_page);
-				if (get_kpfn_nid(stable_node->kpfn) !=
-						NUMA(stable_node->nid)) {
-					put_page(tree_page);
-					goto replace;
-				}
-				return tree_page;
-			}
-			/*
-			 * There is now a place for page_node, but the tree may
-			 * have been rebalanced, so re-evaluate parent and new.
-			 */
-			if (page_node)
+			tree_page = get_ksm_page(stable_node_dup, true);
+			if (unlikely(!tree_page))
+				/*
+				 * The tree may have been rebalanced,
+				 * so re-evaluate parent and new.
+				 */
 				goto again;
-			return NULL;
+			unlock_page(tree_page);
+
+			if (get_kpfn_nid(stable_node_dup->kpfn) !=
+			    NUMA(stable_node_dup->nid)) {
+				put_page(tree_page);
+				goto replace;
+			}
+			return tree_page;
 		}
 	}
 
@@ -1230,22 +1590,72 @@ again:
 	DO_NUMA(page_node->nid = nid);
 	rb_link_node(&page_node->node, parent, new);
 	rb_insert_color(&page_node->node, root);
-	get_page(page);
-	return page;
+out:
+	if (is_page_sharing_candidate(page_node)) {
+		get_page(page);
+		return page;
+	} else
+		return NULL;
 
 replace:
-	if (page_node) {
-		list_del(&page_node->list);
-		DO_NUMA(page_node->nid = nid);
-		rb_replace_node(&stable_node->node, &page_node->node, root);
-		get_page(page);
+	if (stable_node_dup == stable_node) {
+		/* there is no chain */
+		if (page_node) {
+			VM_BUG_ON(page_node->head != &migrate_nodes);
+			list_del(&page_node->list);
+			DO_NUMA(page_node->nid = nid);
+			rb_replace_node(&stable_node->node, &page_node->node,
+					root);
+			if (is_page_sharing_candidate(page_node))
+				get_page(page);
+			else
+				page = NULL;
+		} else {
+			rb_erase(&stable_node->node, root);
+			page = NULL;
+		}
 	} else {
-		rb_erase(&stable_node->node, root);
-		page = NULL;
+		VM_BUG_ON(!is_stable_node_chain(stable_node));
+		__stable_node_dup_del(stable_node_dup);
+		if (page_node) {
+			VM_BUG_ON(page_node->head != &migrate_nodes);
+			list_del(&page_node->list);
+			DO_NUMA(page_node->nid = nid);
+			stable_node_chain_add_dup(page_node, stable_node);
+			if (is_page_sharing_candidate(page_node))
+				get_page(page);
+			else
+				page = NULL;
+		} else {
+			page = NULL;
+		}
 	}
-	stable_node->head = &migrate_nodes;
-	list_add(&stable_node->list, stable_node->head);
+	stable_node_dup->head = &migrate_nodes;
+	list_add(&stable_node_dup->list, stable_node_dup->head);
 	return page;
+
+chain_append:
+	/* stable_node_dup could be null if it reached the limit */
+	if (!stable_node_dup)
+		stable_node_dup = stable_node_any;
+	if (stable_node_dup == stable_node) {
+		/* chain is missing so create it */
+		stable_node = alloc_stable_node_chain(stable_node_dup,
+						      root);
+		if (!stable_node)
+			return NULL;
+	}
+	/*
+	 * Add this stable_node dup that was
+	 * migrated to the stable_node chain
+	 * of the current nid for this page
+	 * content.
+	 */
+	VM_BUG_ON(page_node->head != &migrate_nodes);
+	list_del(&page_node->list);
+	DO_NUMA(page_node->nid = nid);
+	stable_node_chain_add_dup(page_node, stable_node);
+	goto out;
 }
 
 /*
@@ -1262,7 +1672,8 @@ static struct stable_node *stable_tree_insert(struct page *kpage)
 	struct rb_root *root;
 	struct rb_node **new;
 	struct rb_node *parent;
-	struct stable_node *stable_node;
+	struct stable_node *stable_node, *stable_node_dup, *stable_node_any;
+	bool need_chain = false;
 
 	kpfn = page_to_pfn(kpage);
 	nid = get_kpfn_nid(kpfn);
@@ -1277,7 +1688,32 @@ again:
 
 		cond_resched();
 		stable_node = rb_entry(*new, struct stable_node, node);
-		tree_page = get_ksm_page(stable_node, false);
+		stable_node_any = NULL;
+		stable_node_dup = chain(stable_node, &tree_page, root);
+		if (!stable_node_dup) {
+			/*
+			 * Either all stable_node dups were full in
+			 * this stable_node chain, or this chain was
+			 * empty and should be rb_erased.
+			 */
+			stable_node_any = stable_node_dup_any(stable_node,
+							      root);
+			if (!stable_node_any) {
+				/* rb_erase just run */
+				goto again;
+			}
+			/*
+			 * Take any of the stable_node dups page of
+			 * this stable_node chain to let the tree walk
+			 * continue. All KSM pages belonging to the
+			 * stable_node dups in a stable_node chain
+			 * have the same content and they're
+			 * wrprotected at all times. Any will work
+			 * fine to continue the walk.
+			 */
+			tree_page = get_ksm_page(stable_node_any, false);
+		}
+		VM_BUG_ON(!stable_node_dup ^ !!stable_node_any);
 		if (!tree_page) {
 			/*
 			 * If we walked over a stale stable_node,
@@ -1300,27 +1736,37 @@ again:
 		else if (ret > 0)
 			new = &parent->rb_right;
 		else {
-			/*
-			 * It is not a bug that stable_tree_search() didn't
-			 * find this node: because at that time our page was
-			 * not yet write-protected, so may have changed since.
-			 */
-			return NULL;
+			need_chain = true;
+			break;
 		}
 	}
 
-	stable_node = alloc_stable_node();
-	if (!stable_node)
+	stable_node_dup = alloc_stable_node();
+	if (!stable_node_dup)
 		return NULL;
 
-	INIT_HLIST_HEAD(&stable_node->hlist);
-	stable_node->kpfn = kpfn;
-	set_page_stable_node(kpage, stable_node);
-	DO_NUMA(stable_node->nid = nid);
-	rb_link_node(&stable_node->node, parent, new);
-	rb_insert_color(&stable_node->node, root);
+	INIT_HLIST_HEAD(&stable_node_dup->hlist);
+	stable_node_dup->kpfn = kpfn;
+	set_page_stable_node(kpage, stable_node_dup);
+	stable_node_dup->rmap_hlist_len = 0;
+	DO_NUMA(stable_node_dup->nid = nid);
+	if (!need_chain) {
+		rb_link_node(&stable_node_dup->node, parent, new);
+		rb_insert_color(&stable_node_dup->node, root);
+	} else {
+		if (!is_stable_node_chain(stable_node)) {
+			struct stable_node *orig = stable_node;
+			/* chain is missing so create it */
+			stable_node = alloc_stable_node_chain(orig, root);
+			if (!stable_node) {
+				free_stable_node(stable_node_dup);
+				return NULL;
+			}
+		}
+		stable_node_chain_add_dup(stable_node_dup, stable_node);
+	}
 
-	return stable_node;
+	return stable_node_dup;
 }
 
 /*
@@ -1410,8 +1856,27 @@ struct rmap_item *unstable_tree_search_insert(struct rmap_item *rmap_item,
  * the same ksm page.
  */
 static void stable_tree_append(struct rmap_item *rmap_item,
-			       struct stable_node *stable_node)
+			       struct stable_node *stable_node,
+			       bool max_page_sharing_bypass)
 {
+	/*
+	 * rmap won't find this mapping if we don't insert the
+	 * rmap_item in the right stable_node
+	 * duplicate. page_migration could break later if rmap breaks,
+	 * so we can as well crash here. We really need to check for
+	 * rmap_hlist_len == STABLE_NODE_CHAIN, but we can as well check
+	 * for other negative values as an undeflow if detected here
+	 * for the first time (and not when decreasing rmap_hlist_len)
+	 * would be sign of memory corruption in the stable_node.
+	 */
+	BUG_ON(stable_node->rmap_hlist_len < 0);
+
+	stable_node->rmap_hlist_len++;
+	if (!max_page_sharing_bypass)
+		/* possibly non fatal but unexpected overflow, only warn */
+		WARN_ON_ONCE(stable_node->rmap_hlist_len >
+			     ksm_max_page_sharing);
+
 	rmap_item->head = stable_node;
 	rmap_item->address |= STABLE_FLAG;
 	hlist_add_head(&rmap_item->hlist, &stable_node->hlist);
@@ -1439,19 +1904,26 @@ static void cmp_and_merge_page(struct page *page, struct rmap_item *rmap_item)
 	struct page *kpage;
 	unsigned int checksum;
 	int err;
+	bool max_page_sharing_bypass = false;
 
 	stable_node = page_stable_node(page);
 	if (stable_node) {
 		if (stable_node->head != &migrate_nodes &&
-		    get_kpfn_nid(stable_node->kpfn) != NUMA(stable_node->nid)) {
-			rb_erase(&stable_node->node,
-				 root_stable_tree + NUMA(stable_node->nid));
+		    get_kpfn_nid(READ_ONCE(stable_node->kpfn)) !=
+		    NUMA(stable_node->nid)) {
+			stable_node_dup_del(stable_node);
 			stable_node->head = &migrate_nodes;
 			list_add(&stable_node->list, stable_node->head);
 		}
 		if (stable_node->head != &migrate_nodes &&
 		    rmap_item->head == stable_node)
 			return;
+		/*
+		 * If it's a KSM fork, allow it to go over the sharing limit
+		 * without warnings.
+		 */
+		if (!is_page_sharing_candidate(stable_node))
+			max_page_sharing_bypass = true;
 	}
 
 	/* We first start with searching the page inside the stable tree */
@@ -1471,7 +1943,8 @@ static void cmp_and_merge_page(struct page *page, struct rmap_item *rmap_item)
 			 * add its rmap_item to the stable tree.
 			 */
 			lock_page(kpage);
-			stable_tree_append(rmap_item, page_stable_node(kpage));
+			stable_tree_append(rmap_item, page_stable_node(kpage),
+					   max_page_sharing_bypass);
 			unlock_page(kpage);
 		}
 		put_page(kpage);
@@ -1504,8 +1977,10 @@ static void cmp_and_merge_page(struct page *page, struct rmap_item *rmap_item)
 			lock_page(kpage);
 			stable_node = stable_tree_insert(kpage);
 			if (stable_node) {
-				stable_tree_append(tree_rmap_item, stable_node);
-				stable_tree_append(rmap_item, stable_node);
+				stable_tree_append(tree_rmap_item, stable_node,
+						   false);
+				stable_tree_append(rmap_item, stable_node,
+						   false);
 			}
 			unlock_page(kpage);
 
@@ -2009,6 +2484,48 @@ static void wait_while_offlining(void)
 	}
 }
 
+static bool stable_node_dup_remove_range(struct stable_node *stable_node,
+					 unsigned long start_pfn,
+					 unsigned long end_pfn)
+{
+	if (stable_node->kpfn >= start_pfn &&
+	    stable_node->kpfn < end_pfn) {
+		/*
+		 * Don't get_ksm_page, page has already gone:
+		 * which is why we keep kpfn instead of page*
+		 */
+		remove_node_from_stable_tree(stable_node);
+		return true;
+	}
+	return false;
+}
+
+static bool stable_node_chain_remove_range(struct stable_node *stable_node,
+					   unsigned long start_pfn,
+					   unsigned long end_pfn,
+					   struct rb_root *root)
+{
+	struct stable_node *dup;
+	struct hlist_node *hlist_safe;
+
+	if (!is_stable_node_chain(stable_node)) {
+		VM_BUG_ON(is_stable_node_dup(stable_node));
+		return stable_node_dup_remove_range(stable_node, start_pfn,
+						    end_pfn);
+	}
+
+	hlist_for_each_entry_safe(dup, hlist_safe,
+				  &stable_node->hlist, hlist_dup) {
+		VM_BUG_ON(!is_stable_node_dup(dup));
+		stable_node_dup_remove_range(dup, start_pfn, end_pfn);
+	}
+	if (hlist_empty(&stable_node->hlist)) {
+		free_stable_node_chain(stable_node, root);
+		return true; /* notify caller that tree was rebalanced */
+	} else
+		return false;
+}
+
 static void ksm_check_stable_tree(unsigned long start_pfn,
 				  unsigned long end_pfn)
 {
@@ -2021,15 +2538,12 @@ static void ksm_check_stable_tree(unsigned long start_pfn,
 		node = rb_first(root_stable_tree + nid);
 		while (node) {
 			stable_node = rb_entry(node, struct stable_node, node);
-			if (stable_node->kpfn >= start_pfn &&
-			    stable_node->kpfn < end_pfn) {
-				/*
-				 * Don't get_ksm_page, page has already gone:
-				 * which is why we keep kpfn instead of page*
-				 */
-				remove_node_from_stable_tree(stable_node);
+			if (stable_node_chain_remove_range(stable_node,
+							   start_pfn, end_pfn,
+							   root_stable_tree +
+							   nid))
 				node = rb_first(root_stable_tree + nid);
-			} else
+			else
 				node = rb_next(node);
 			cond_resched();
 		}
@@ -2254,6 +2768,47 @@ static ssize_t merge_across_nodes_store(struct kobject *kobj,
 KSM_ATTR(merge_across_nodes);
 #endif
 
+static ssize_t max_page_sharing_show(struct kobject *kobj,
+				     struct kobj_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%u\n", ksm_max_page_sharing);
+}
+
+static ssize_t max_page_sharing_store(struct kobject *kobj,
+				      struct kobj_attribute *attr,
+				      const char *buf, size_t count)
+{
+	int err;
+	int knob;
+
+	err = kstrtoint(buf, 10, &knob);
+	if (err)
+		return err;
+	/*
+	 * When a KSM page is created it is shared by 2 mappings. This
+	 * being a signed comparison, it implicitly verifies it's not
+	 * negative.
+	 */
+	if (knob < 2)
+		return -EINVAL;
+
+	if (READ_ONCE(ksm_max_page_sharing) == knob)
+		return count;
+
+	mutex_lock(&ksm_thread_mutex);
+	wait_while_offlining();
+	if (ksm_max_page_sharing != knob) {
+		if (ksm_pages_shared || remove_all_stable_nodes())
+			err = -EBUSY;
+		else
+			ksm_max_page_sharing = knob;
+	}
+	mutex_unlock(&ksm_thread_mutex);
+
+	return err ? err : count;
+}
+KSM_ATTR(max_page_sharing);
+
 static ssize_t pages_shared_show(struct kobject *kobj,
 				 struct kobj_attribute *attr, char *buf)
 {
@@ -2292,6 +2847,46 @@ static ssize_t pages_volatile_show(struct kobject *kobj,
 }
 KSM_ATTR_RO(pages_volatile);
 
+static ssize_t stable_node_dups_show(struct kobject *kobj,
+				     struct kobj_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%lu\n", ksm_stable_node_dups);
+}
+KSM_ATTR_RO(stable_node_dups);
+
+static ssize_t stable_node_chains_show(struct kobject *kobj,
+				       struct kobj_attribute *attr, char *buf)
+{
+	return sprintf(buf, "%lu\n", ksm_stable_node_chains);
+}
+KSM_ATTR_RO(stable_node_chains);
+
+static ssize_t
+stable_node_chains_prune_millisecs_show(struct kobject *kobj,
+					struct kobj_attribute *attr,
+					char *buf)
+{
+	return sprintf(buf, "%u\n", ksm_stable_node_chains_prune_millisecs);
+}
+
+static ssize_t
+stable_node_chains_prune_millisecs_store(struct kobject *kobj,
+					 struct kobj_attribute *attr,
+					 const char *buf, size_t count)
+{
+	unsigned long msecs;
+	int err;
+
+	err = kstrtoul(buf, 10, &msecs);
+	if (err || msecs > UINT_MAX)
+		return -EINVAL;
+
+	ksm_stable_node_chains_prune_millisecs = msecs;
+
+	return count;
+}
+KSM_ATTR(stable_node_chains_prune_millisecs);
+
 static ssize_t full_scans_show(struct kobject *kobj,
 			       struct kobj_attribute *attr, char *buf)
 {
@@ -2311,6 +2906,10 @@ static struct attribute *ksm_attrs[] = {
 #ifdef CONFIG_NUMA
 	&merge_across_nodes_attr.attr,
 #endif
+	&max_page_sharing_attr.attr,
+	&stable_node_chains_attr.attr,
+	&stable_node_dups_attr.attr,
+	&stable_node_chains_prune_millisecs_attr.attr,
 	NULL,
 };
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/1] ksm: introduce ksm_max_page_sharing per page deduplication limit
  2015-11-10 18:44 ` [PATCH 1/1] " Andrea Arcangeli
@ 2015-12-09 16:19   ` Petr Holasek
  2015-12-09 17:15     ` Andrea Arcangeli
  2015-12-11  0:31   ` Andrew Morton
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 29+ messages in thread
From: Petr Holasek @ 2015-12-09 16:19 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, Davidlohr Bueso, linux-mm, Andrew Morton, Arjan van de Ven

On Tue, 10 Nov 2015, Andrea Arcangeli <aarcange@redhat.com> wrote:
> Without a max deduplication limit for each KSM page, the list of the
> rmap_items associated to each stable_node can grow infinitely
> large.
> 
> During the rmap walk each entry can take up to ~10usec to process
> because of IPIs for the TLB flushing (both for the primary MMU and the
> secondary MMUs with the MMU notifier). With only 16GB of address space
> shared in the same KSM page, that would amount to dozens of seconds of
> kernel runtime.
> 
> A ~256 max deduplication factor will reduce the latencies of the rmap
> walks on KSM pages to order of a few msec. Just doing the
> cond_resched() during the rmap walks is not enough, the list size must
> have a limit too, otherwise the caller could get blocked in (schedule
> friendly) kernel computations for seconds, unexpectedly.
> 
> There's room for optimization to significantly reduce the IPI delivery
> cost during the page_referenced(), but at least for page_migration in
> the KSM case (used by hard NUMA bindings, compaction and NUMA
> balancing) it may be inevitable to send lots of IPIs if each
> rmap_item->mm is active on a different CPU and there are lots of
> CPUs. Even if we ignore the IPI delivery cost, we've still to walk the
> whole KSM rmap list, so we can't allow millions or billions (ulimited)
> number of entries in the KSM stable_node rmap_item lists.
> 
> The limit is enforced efficiently by adding a second dimension to the
> stable rbtree. So there are three types of stable_nodes: the regular
> ones (identical as before, living in the first flat dimension of the
> stable rbtree), the "chains" and the "dups".
> 
> Every "chain" and all "dups" linked into a "chain" enforce the
> invariant that they represent the same write protected memory content,
> even if each "dup" will be pointed by a different KSM page copy of
> that content. This way the stable rbtree lookup computational
> complexity is unaffected if compared to an unlimited
> max_sharing_limit. It is still enforced that there cannot be KSM page
> content duplicates in the stable rbtree itself.
> 
> Adding the second dimension to the stable rbtree only after the
> max_page_sharing limit hits, provides for a zero memory footprint
> increase on 64bit archs. The memory overhead of the per-KSM page
> stable_tree and per virtual mapping rmap_item is unchanged. Only after
> the max_page_sharing limit hits, we need to allocate a stable_tree
> "chain" and rb_replace() the "regular" stable_node with the newly
> allocated stable_node "chain". After that we simply add the "regular"
> stable_node to the chain as a stable_node "dup" by linking hlist_dup
> in the stable_node_chain->hlist. This way the "regular" (flat)
> stable_node is converted to a stable_node "dup" living in the second
> dimension of the stable rbtree.
> 
> During stable rbtree lookups the stable_node "chain" is identified as
> stable_node->rmap_hlist_len == STABLE_NODE_CHAIN (aka
> is_stable_node_chain()).
> 
> When dropping stable_nodes, the stable_node "dup" is identified as
> stable_node->head == STABLE_NODE_DUP_HEAD (aka is_stable_node_dup()).
> 
> The STABLE_NODE_DUP_HEAD must be an unique valid pointer never used
> elsewhere in any stable_node->head/node to avoid a clashes with the
> stable_node->node.rb_parent_color pointer, and different from
> &migrate_nodes. So the second field of &migrate_nodes is picked and
> verified as always safe with a BUILD_BUG_ON in case the list_head
> implementation changes in the future.
> 
> The STABLE_NODE_DUP is picked as a random negative value in
> stable_node->rmap_hlist_len. rmap_hlist_len cannot become negative
> when it's a "regular" stable_node or a stable_node "dup".
> 
> The stable_node_chain->nid is irrelevant. The stable_node_chain->kpfn
> is aliased in a union with a time field used to rate limit the
> stable_node_chain->hlist prunes.
> 
> The garbage collection of the stable_node_chain happens lazily during
> stable rbtree lookups (as for all other kind of stable_nodes), or
> while disabling KSM with "echo 2 >/sys/kernel/mm/ksm/run" while
> collecting the entire stable rbtree.

Hi Andrea,

I've been running stress tests against this patchset for a couple of hours
and everything was ok. However, I've allocated ~1TB of memory and got
following lockup during disabling KSM with 'echo 2 > /sys/kernel/mm/ksm/run':

[13201.060601] INFO: task ksmd:351 blocked for more than 120 seconds.
[13201.066812]       Not tainted 4.4.0-rc4+ #5
[13201.070996] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
[13201.078830] ksmd            D ffff883f65eb7dc8     0   351      2
0x00000000
[13201.085903]  ffff883f65eb7dc8 ffff887f66e26400 ffff883f65d5e400
ffff883f65eb8000
[13201.093343]  ffffffff81a65144 ffff883f65d5e400 00000000ffffffff
ffffffff81a65148
[13201.100792]  ffff883f65eb7de0 ffffffff816907e5 ffffffff81a65140
ffff883f65eb7df0
[13201.108242] Call Trace:
[13201.110708]  [<ffffffff816907e5>] schedule+0x35/0x80
[13201.115676]  [<ffffffff81690ace>] schedule_preempt_disabled+0xe/0x10
[13201.122044]  [<ffffffff81692524>] __mutex_lock_slowpath+0xb4/0x130
[13201.128237]  [<ffffffff816925bf>] mutex_lock+0x1f/0x2f
[13201.133395]  [<ffffffff811debd2>] ksm_scan_thread+0x62/0x1f0
[13201.139068]  [<ffffffff810c8ac0>] ? wait_woken+0x80/0x80
[13201.144391]  [<ffffffff811deb70>] ? ksm_do_scan+0x1140/0x1140
[13201.150164]  [<ffffffff810a4378>] kthread+0xd8/0xf0
[13201.155056]  [<ffffffff810a42a0>] ? kthread_park+0x60/0x60
[13201.160551]  [<ffffffff8169460f>] ret_from_fork+0x3f/0x70
[13201.165961]  [<ffffffff810a42a0>] ? kthread_park+0x60/0x60

It seems this is not connected with the new code, but it would be nice to
also make unmerge_and_remove_all_rmap_items() more scheduler friendly.

thanks,
Petr

> 
> While the "regular" stable_nodes and the stable_node "dups" must wait
> for their underlying tree_page to be freed before they can be freed
> themselves, the stable_node "chains" can be freed immediately if the
> stable_node->hlist turns empty. This is because the "chains" are never
> pointed by any page->mapping and they're effectively stable rbtree KSM
> self contained metadata.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  Documentation/vm/ksm.txt |  63 ++++
>  mm/ksm.c                 | 731 ++++++++++++++++++++++++++++++++++++++++++-----
>  2 files changed, 728 insertions(+), 66 deletions(-)
> 
> diff --git a/Documentation/vm/ksm.txt b/Documentation/vm/ksm.txt
> index f34a8ee..a7ef759 100644
> --- a/Documentation/vm/ksm.txt
> +++ b/Documentation/vm/ksm.txt
> @@ -80,6 +80,50 @@ run              - set 0 to stop ksmd from running but keep merged pages,
>                     Default: 0 (must be changed to 1 to activate KSM,
>                                 except if CONFIG_SYSFS is disabled)
>  
> +max_page_sharing - Maximum sharing allowed for each KSM page. This
> +                   enforces a deduplication limit to avoid the virtual
> +                   memory rmap lists to grow too large. The minimum
> +                   value is 2 as a newly created KSM page will have at
> +                   least two sharers. The rmap walk has O(N)
> +                   complexity where N is the number of rmap_items
> +                   (i.e. virtual mappings) that are sharing the page,
> +                   which is in turn capped by max_page_sharing. So
> +                   this effectively spread the the linear O(N)
> +                   computational complexity from rmap walk context
> +                   over different KSM pages. The ksmd walk over the
> +                   stable_node "chains" is also O(N), but N is the
> +                   number of stable_node "dups", not the number of
> +                   rmap_items, so it has not a significant impact on
> +                   ksmd performance. In practice the best stable_node
> +                   "dup" candidate will be kept and found at the head
> +                   of the "dups" list. The higher this value the
> +                   faster KSM will merge the memory (because there
> +                   will be fewer stable_node dups queued into the
> +                   stable_node chain->hlist to check for pruning) and
> +                   the higher the deduplication factor will be, but
> +                   the slowest the worst case rmap walk could be for
> +                   any given KSM page. Slowing down the rmap_walk
> +                   means there will be higher latency for certain
> +                   virtual memory operations happening during
> +                   swapping, compaction, NUMA balancing and page
> +                   migration, in turn decreasing responsiveness for
> +                   the caller of those virtual memory operations. The
> +                   scheduler latency of other tasks not involved with
> +                   the VM operations doing the rmap walk is not
> +                   affected by this parameter as the rmap walks are
> +                   always schedule friendly themselves.
> +
> +stable_node_chains_prune_millisecs - How frequently to walk the whole
> +                   list of stable_node "dups" linked in the
> +                   stable_node "chains" in order to prune stale
> +                   stable_nodes. Smaller milllisecs values will free
> +                   up the KSM metadata with lower latency, but they
> +                   will make ksmd use more CPU during the scan. This
> +                   only applies to the stable_node chains so it's a
> +                   noop if not a single KSM page hit the
> +                   max_page_sharing yet (there would be no stable_node
> +                   chains in such case).
> +
>  The effectiveness of KSM and MADV_MERGEABLE is shown in /sys/kernel/mm/ksm/:
>  
>  pages_shared     - how many shared pages are being used
> @@ -88,10 +132,29 @@ pages_unshared   - how many pages unique but repeatedly checked for merging
>  pages_volatile   - how many pages changing too fast to be placed in a tree
>  full_scans       - how many times all mergeable areas have been scanned
>  
> +stable_node_chains - number of stable node chains allocated, this is
> +		     effectively the number of KSM pages that hit the
> +		     max_page_sharing limit
> +stable_node_dups   - number of stable node dups queued into the
> +		     stable_node chains
> +
>  A high ratio of pages_sharing to pages_shared indicates good sharing, but
>  a high ratio of pages_unshared to pages_sharing indicates wasted effort.
>  pages_volatile embraces several different kinds of activity, but a high
>  proportion there would also indicate poor use of madvise MADV_MERGEABLE.
>  
> +The maximum possible page_sharing/page_shared ratio is limited by the
> +max_page_sharing tunable. To increase the ratio max_page_sharing must
> +be increased accordingly.
> +
> +The stable_node_dups/stable_node_chains ratio is also affected by the
> +max_page_sharing tunable, and an high ratio may indicate fragmentation
> +in the stable_node dups, which could be solved by introducing
> +fragmentation algorithms in ksmd which would refile rmap_items from
> +one stable_node dup to another stable_node dup, in order to freeup
> +stable_node "dups" with few rmap_items in them, but that may increase
> +the ksmd CPU usage and possibly slowdown the readonly computations on
> +the KSM pages of the applications.
> +
>  Izik Eidus,
>  Hugh Dickins, 17 Nov 2009
> diff --git a/mm/ksm.c b/mm/ksm.c
> index b5cd647..fd0bf51 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -126,9 +126,12 @@ struct ksm_scan {
>   * struct stable_node - node of the stable rbtree
>   * @node: rb node of this ksm page in the stable tree
>   * @head: (overlaying parent) &migrate_nodes indicates temporarily on that list
> + * @hlist_dup: linked into the stable_node->hlist with a stable_node chain
>   * @list: linked into migrate_nodes, pending placement in the proper node tree
>   * @hlist: hlist head of rmap_items using this ksm page
>   * @kpfn: page frame number of this ksm page (perhaps temporarily on wrong nid)
> + * @chain_prune_time: time of the last full garbage collection
> + * @rmap_hlist_len: number of rmap_item entries in hlist or STABLE_NODE_CHAIN
>   * @nid: NUMA node id of stable tree in which linked (may not match kpfn)
>   */
>  struct stable_node {
> @@ -136,11 +139,24 @@ struct stable_node {
>  		struct rb_node node;	/* when node of stable tree */
>  		struct {		/* when listed for migration */
>  			struct list_head *head;
> -			struct list_head list;
> +			struct {
> +				struct hlist_node hlist_dup;
> +				struct list_head list;
> +			};
>  		};
>  	};
>  	struct hlist_head hlist;
> -	unsigned long kpfn;
> +	union {
> +		unsigned long kpfn;
> +		unsigned long chain_prune_time;
> +	};
> +	/*
> +	 * STABLE_NODE_CHAIN can be any negative number in
> +	 * rmap_hlist_len negative range, but better not -1 to be able
> +	 * to reliably detect underflows.
> +	 */
> +#define STABLE_NODE_CHAIN -1024
> +	int rmap_hlist_len;
>  #ifdef CONFIG_NUMA
>  	int nid;
>  #endif
> @@ -190,6 +206,7 @@ static struct rb_root *root_unstable_tree = one_unstable_tree;
>  
>  /* Recently migrated nodes of stable tree, pending proper placement */
>  static LIST_HEAD(migrate_nodes);
> +#define STABLE_NODE_DUP_HEAD ((struct list_head *)&migrate_nodes.prev)
>  
>  #define MM_SLOTS_HASH_BITS 10
>  static DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
> @@ -217,6 +234,18 @@ static unsigned long ksm_pages_unshared;
>  /* The number of rmap_items in use: to calculate pages_volatile */
>  static unsigned long ksm_rmap_items;
>  
> +/* The number of stable_node chains */
> +static unsigned long ksm_stable_node_chains;
> +
> +/* The number of stable_node dups linked to the stable_node chains */
> +static unsigned long ksm_stable_node_dups;
> +
> +/* Delay in pruning stale stable_node_dups in the stable_node_chains */
> +static int ksm_stable_node_chains_prune_millisecs = 2000;
> +
> +/* Maximum number of page slots sharing a stable node */
> +static int ksm_max_page_sharing = 256;
> +
>  /* Number of pages ksmd should scan in one batch */
>  static unsigned int ksm_thread_pages_to_scan = 100;
>  
> @@ -279,6 +308,44 @@ static void __init ksm_slab_free(void)
>  	mm_slot_cache = NULL;
>  }
>  
> +static __always_inline bool is_stable_node_chain(struct stable_node *chain)
> +{
> +	return chain->rmap_hlist_len == STABLE_NODE_CHAIN;
> +}
> +
> +static __always_inline bool is_stable_node_dup(struct stable_node *dup)
> +{
> +	return dup->head == STABLE_NODE_DUP_HEAD;
> +}
> +
> +static inline void stable_node_chain_add_dup(struct stable_node *dup,
> +					     struct stable_node *chain)
> +{
> +	VM_BUG_ON(is_stable_node_dup(dup));
> +	dup->head = STABLE_NODE_DUP_HEAD;
> +	VM_BUG_ON(!is_stable_node_chain(chain));
> +	hlist_add_head(&dup->hlist_dup, &chain->hlist);
> +	ksm_stable_node_dups++;
> +}
> +
> +static inline void __stable_node_dup_del(struct stable_node *dup)
> +{
> +	hlist_del(&dup->hlist_dup);
> +	ksm_stable_node_dups--;
> +}
> +
> +static inline void stable_node_dup_del(struct stable_node *dup)
> +{
> +	VM_BUG_ON(is_stable_node_chain(dup));
> +	if (is_stable_node_dup(dup))
> +		__stable_node_dup_del(dup);
> +	else
> +		rb_erase(&dup->node, root_stable_tree + NUMA(dup->nid));
> +#ifdef CONFIG_DEBUG_VM
> +	dup->head = NULL;
> +#endif
> +}
> +
>  static inline struct rmap_item *alloc_rmap_item(void)
>  {
>  	struct rmap_item *rmap_item;
> @@ -303,6 +370,8 @@ static inline struct stable_node *alloc_stable_node(void)
>  
>  static inline void free_stable_node(struct stable_node *stable_node)
>  {
> +	VM_BUG_ON(stable_node->rmap_hlist_len &&
> +		  !is_stable_node_chain(stable_node));
>  	kmem_cache_free(stable_node_cache, stable_node);
>  }
>  
> @@ -493,25 +562,80 @@ static inline int get_kpfn_nid(unsigned long kpfn)
>  	return ksm_merge_across_nodes ? 0 : NUMA(pfn_to_nid(kpfn));
>  }
>  
> +static struct stable_node *alloc_stable_node_chain(struct stable_node *dup,
> +						   struct rb_root *root)
> +{
> +	struct stable_node *chain = alloc_stable_node();
> +	VM_BUG_ON(is_stable_node_chain(dup));
> +	if (likely(chain)) {
> +		INIT_HLIST_HEAD(&chain->hlist);
> +		chain->chain_prune_time = jiffies;
> +		chain->rmap_hlist_len = STABLE_NODE_CHAIN;
> +#ifdef CONFIG_DEBUG_VM
> +		chain->nid = -1; /* debug */
> +#endif
> +		ksm_stable_node_chains++;
> +
> +		/*
> +		 * Put the stable node chain in the first dimension of
> +		 * the stable tree and at the same time remove the old
> +		 * stable node.
> +		 */
> +		rb_replace_node(&dup->node, &chain->node, root);
> +
> +		/*
> +		 * Move the old stable node to the second dimension
> +		 * queued in the hlist_dup. The invariant is that all
> +		 * dup stable_nodes in the chain->hlist point to pages
> +		 * that are wrprotected and have the exact same
> +		 * content.
> +		 */
> +		stable_node_chain_add_dup(dup, chain);
> +	}
> +	return chain;
> +}
> +
> +static inline void free_stable_node_chain(struct stable_node *chain,
> +					  struct rb_root *root)
> +{
> +	rb_erase(&chain->node, root);
> +	free_stable_node(chain);
> +	ksm_stable_node_chains--;
> +}
> +
>  static void remove_node_from_stable_tree(struct stable_node *stable_node)
>  {
>  	struct rmap_item *rmap_item;
>  
> +	/* check it's not STABLE_NODE_CHAIN or negative */
> +	BUG_ON(stable_node->rmap_hlist_len < 0);
> +
>  	hlist_for_each_entry(rmap_item, &stable_node->hlist, hlist) {
>  		if (rmap_item->hlist.next)
>  			ksm_pages_sharing--;
>  		else
>  			ksm_pages_shared--;
> +		VM_BUG_ON(stable_node->rmap_hlist_len <= 0);
> +		stable_node->rmap_hlist_len--;
>  		put_anon_vma(rmap_item->anon_vma);
>  		rmap_item->address &= PAGE_MASK;
>  		cond_resched();
>  	}
>  
> +	/*
> +	 * We need the second aligned pointer of the migrate_nodes
> +	 * list_head to stay clear from the rb_parent_color union
> +	 * (aligned and different than any node) and also different
> +	 * from &migrate_nodes. This will verify that future list.h changes
> +	 * don't break STABLE_NODE_DUP_HEAD.
> +	 */
> +	BUILD_BUG_ON(STABLE_NODE_DUP_HEAD <= &migrate_nodes);
> +	BUILD_BUG_ON(STABLE_NODE_DUP_HEAD >= &migrate_nodes + 1);
> +
>  	if (stable_node->head == &migrate_nodes)
>  		list_del(&stable_node->list);
>  	else
> -		rb_erase(&stable_node->node,
> -			 root_stable_tree + NUMA(stable_node->nid));
> +		stable_node_dup_del(stable_node);
>  	free_stable_node(stable_node);
>  }
>  
> @@ -630,6 +754,8 @@ static void remove_rmap_item_from_tree(struct rmap_item *rmap_item)
>  			ksm_pages_sharing--;
>  		else
>  			ksm_pages_shared--;
> +		VM_BUG_ON(stable_node->rmap_hlist_len <= 0);
> +		stable_node->rmap_hlist_len--;
>  
>  		put_anon_vma(rmap_item->anon_vma);
>  		rmap_item->address &= PAGE_MASK;
> @@ -738,6 +864,31 @@ static int remove_stable_node(struct stable_node *stable_node)
>  	return err;
>  }
>  
> +static int remove_stable_node_chain(struct stable_node *stable_node,
> +				    struct rb_root *root)
> +{
> +	struct stable_node *dup;
> +	struct hlist_node *hlist_safe;
> +
> +	if (!is_stable_node_chain(stable_node)) {
> +		VM_BUG_ON(is_stable_node_dup(stable_node));
> +		if (remove_stable_node(stable_node))
> +			return true;
> +		else
> +			return false;
> +	}
> +
> +	hlist_for_each_entry_safe(dup, hlist_safe,
> +				  &stable_node->hlist, hlist_dup) {
> +		VM_BUG_ON(!is_stable_node_dup(dup));
> +		if (remove_stable_node(dup))
> +			return true;
> +	}
> +	BUG_ON(!hlist_empty(&stable_node->hlist));
> +	free_stable_node_chain(stable_node, root);
> +	return false;
> +}
> +
>  static int remove_all_stable_nodes(void)
>  {
>  	struct stable_node *stable_node;
> @@ -749,7 +900,8 @@ static int remove_all_stable_nodes(void)
>  		while (root_stable_tree[nid].rb_node) {
>  			stable_node = rb_entry(root_stable_tree[nid].rb_node,
>  						struct stable_node, node);
> -			if (remove_stable_node(stable_node)) {
> +			if (remove_stable_node_chain(stable_node,
> +						     root_stable_tree + nid)) {
>  				err = -EBUSY;
>  				break;	/* proceed to next nid */
>  			}
> @@ -1136,6 +1288,163 @@ static struct page *try_to_merge_two_pages(struct rmap_item *rmap_item,
>  	return err ? NULL : page;
>  }
>  
> +static __always_inline
> +bool __is_page_sharing_candidate(struct stable_node *stable_node, int offset)
> +{
> +	VM_BUG_ON(stable_node->rmap_hlist_len < 0);
> +	/*
> +	 * Check that at least one mapping still exists, otherwise
> +	 * there's no much point to merge and share with this
> +	 * stable_node, as the underlying tree_page of the other
> +	 * sharer is going to be freed soon.
> +	 */
> +	return stable_node->rmap_hlist_len &&
> +		stable_node->rmap_hlist_len + offset < ksm_max_page_sharing;
> +}
> +
> +static __always_inline
> +bool is_page_sharing_candidate(struct stable_node *stable_node)
> +{
> +	return __is_page_sharing_candidate(stable_node, 0);
> +}
> +
> +static struct stable_node *stable_node_dup(struct stable_node *stable_node,
> +					   struct page **tree_page,
> +					   struct rb_root *root,
> +					   bool prune_stale_stable_nodes)
> +{
> +	struct stable_node *dup, *found = NULL;
> +	struct hlist_node *hlist_safe;
> +	struct page *_tree_page;
> +	int nr = 0;
> +	int found_rmap_hlist_len;
> +
> +	if (!prune_stale_stable_nodes ||
> +	    time_before(jiffies, stable_node->chain_prune_time +
> +			msecs_to_jiffies(
> +				ksm_stable_node_chains_prune_millisecs)))
> +		prune_stale_stable_nodes = false;
> +	else
> +		stable_node->chain_prune_time = jiffies;
> +
> +	hlist_for_each_entry_safe(dup, hlist_safe,
> +				  &stable_node->hlist, hlist_dup) {
> +		cond_resched();
> +		/*
> +		 * We must walk all stable_node_dup to prune the stale
> +		 * stable nodes during lookup.
> +		 *
> +		 * get_ksm_page can drop the nodes from the
> +		 * stable_node->hlist if they point to freed pages
> +		 * (that's why we do a _safe walk). The "dup"
> +		 * stable_node parameter itself will be freed from
> +		 * under us if it returns NULL.
> +		 */
> +		_tree_page = get_ksm_page(dup, false);
> +		if (!_tree_page)
> +			continue;
> +		nr += 1;
> +		if (is_page_sharing_candidate(dup)) {
> +			if (!found ||
> +			    dup->rmap_hlist_len > found_rmap_hlist_len) {
> +				if (found)
> +					put_page(*tree_page);
> +				found = dup;
> +				found_rmap_hlist_len = found->rmap_hlist_len;
> +				*tree_page = _tree_page;
> +
> +				if (!prune_stale_stable_nodes)
> +					break;
> +				/* skip put_page */
> +				continue;
> +			}
> +		}
> +		put_page(_tree_page);
> +	}
> +
> +	/*
> +	 * nr is relevant only if prune_stale_stable_nodes is true,
> +	 * otherwise we may break the loop at nr == 1 even if there
> +	 * are multiple entries.
> +	 */
> +	if (prune_stale_stable_nodes && found) {
> +		if (nr == 1) {
> +			/*
> +			 * If there's not just one entry it would
> +			 * corrupt memory, better BUG_ON. In KSM
> +			 * context with no lock held it's not even
> +			 * fatal.
> +			 */
> +			BUG_ON(stable_node->hlist.first->next);
> +
> +			/*
> +			 * There's just one entry and it is below the
> +			 * deduplication limit so drop the chain.
> +			 */
> +			rb_replace_node(&stable_node->node, &found->node,
> +					root);
> +			free_stable_node(stable_node);
> +			ksm_stable_node_chains--;
> +			ksm_stable_node_dups--;
> +		} else if (__is_page_sharing_candidate(found, 1)) {
> +			/*
> +			 * Refile our candidate at the head
> +			 * after the prune if our candidate
> +			 * can accept one more future sharing
> +			 * in addition to the one underway.
> +			 */
> +			hlist_del(&found->hlist_dup);
> +			hlist_add_head(&found->hlist_dup,
> +				       &stable_node->hlist);
> +		}
> +	}
> +
> +	return found;
> +}
> +
> +static struct stable_node *stable_node_dup_any(struct stable_node *stable_node,
> +					       struct rb_root *root)
> +{
> +	if (!is_stable_node_chain(stable_node))
> +		return stable_node;
> +	if (hlist_empty(&stable_node->hlist)) {
> +		free_stable_node_chain(stable_node, root);
> +		return NULL;
> +	}
> +	return hlist_entry(stable_node->hlist.first,
> +			   typeof(*stable_node), hlist_dup);
> +}
> +
> +static struct stable_node *__stable_node_chain(struct stable_node *stable_node,
> +					       struct page **tree_page,
> +					       struct rb_root *root,
> +					       bool prune_stale_stable_nodes)
> +{
> +	if (!is_stable_node_chain(stable_node)) {
> +		if (is_page_sharing_candidate(stable_node)) {
> +			*tree_page = get_ksm_page(stable_node, false);
> +			return stable_node;
> +		}
> +		return NULL;
> +	}
> +	return stable_node_dup(stable_node, tree_page, root,
> +			       prune_stale_stable_nodes);
> +}
> +
> +static __always_inline struct stable_node *chain_prune(struct stable_node *s_n,
> +						       struct page **t_p,
> +						       struct rb_root *root)
> +{
> +	return __stable_node_chain(s_n, t_p, root, true);
> +}
> +
> +static __always_inline struct stable_node *chain(struct stable_node *s_n,
> +						 struct page **t_p,
> +						 struct rb_root *root)
> +{
> +	return __stable_node_chain(s_n, t_p, root, false);
> +}
> +
>  /*
>   * stable_tree_search - search for page inside the stable tree
>   *
> @@ -1151,7 +1460,7 @@ static struct page *stable_tree_search(struct page *page)
>  	struct rb_root *root;
>  	struct rb_node **new;
>  	struct rb_node *parent;
> -	struct stable_node *stable_node;
> +	struct stable_node *stable_node, *stable_node_dup, *stable_node_any;
>  	struct stable_node *page_node;
>  
>  	page_node = page_stable_node(page);
> @@ -1173,7 +1482,32 @@ again:
>  
>  		cond_resched();
>  		stable_node = rb_entry(*new, struct stable_node, node);
> -		tree_page = get_ksm_page(stable_node, false);
> +		stable_node_any = NULL;
> +		stable_node_dup = chain_prune(stable_node, &tree_page, root);
> +		if (!stable_node_dup) {
> +			/*
> +			 * Either all stable_node dups were full in
> +			 * this stable_node chain, or this chain was
> +			 * empty and should be rb_erased.
> +			 */
> +			stable_node_any = stable_node_dup_any(stable_node,
> +							      root);
> +			if (!stable_node_any) {
> +				/* rb_erase just run */
> +				goto again;
> +			}
> +			/*
> +			 * Take any of the stable_node dups page of
> +			 * this stable_node chain to let the tree walk
> +			 * continue. All KSM pages belonging to the
> +			 * stable_node dups in a stable_node chain
> +			 * have the same content and they're
> +			 * wrprotected at all times. Any will work
> +			 * fine to continue the walk.
> +			 */
> +			tree_page = get_ksm_page(stable_node_any, false);
> +		}
> +		VM_BUG_ON(!stable_node_dup ^ !!stable_node_any);
>  		if (!tree_page) {
>  			/*
>  			 * If we walked over a stale stable_node,
> @@ -1196,6 +1530,34 @@ again:
>  		else if (ret > 0)
>  			new = &parent->rb_right;
>  		else {
> +			if (page_node) {
> +				VM_BUG_ON(page_node->head != &migrate_nodes);
> +				/*
> +				 * Test if the migrated page should be merged
> +				 * into a stable node dup. If the mapcount is
> +				 * 1 we can migrate it with another KSM page
> +				 * without adding it to the chain.
> +				 */
> +				if (page_mapcount(page) > 1)
> +					goto chain_append;
> +			}
> +
> +			if (!stable_node_dup) {
> +				/*
> +				 * If the stable_node is a chain and
> +				 * we got a payload match in memcmp
> +				 * but we cannot merge the scanned
> +				 * page in any of the existing
> +				 * stable_node dups because they're
> +				 * all full, we need to wait the
> +				 * scanned page to find itself a match
> +				 * in the unstable tree to create a
> +				 * brand new KSM page to add later to
> +				 * the dups of this stable_node.
> +				 */
> +				return NULL;
> +			}
> +
>  			/*
>  			 * Lock and unlock the stable_node's page (which
>  			 * might already have been migrated) so that page
> @@ -1203,23 +1565,21 @@ again:
>  			 * It would be more elegant to return stable_node
>  			 * than kpage, but that involves more changes.
>  			 */
> -			tree_page = get_ksm_page(stable_node, true);
> -			if (tree_page) {
> -				unlock_page(tree_page);
> -				if (get_kpfn_nid(stable_node->kpfn) !=
> -						NUMA(stable_node->nid)) {
> -					put_page(tree_page);
> -					goto replace;
> -				}
> -				return tree_page;
> -			}
> -			/*
> -			 * There is now a place for page_node, but the tree may
> -			 * have been rebalanced, so re-evaluate parent and new.
> -			 */
> -			if (page_node)
> +			tree_page = get_ksm_page(stable_node_dup, true);
> +			if (unlikely(!tree_page))
> +				/*
> +				 * The tree may have been rebalanced,
> +				 * so re-evaluate parent and new.
> +				 */
>  				goto again;
> -			return NULL;
> +			unlock_page(tree_page);
> +
> +			if (get_kpfn_nid(stable_node_dup->kpfn) !=
> +			    NUMA(stable_node_dup->nid)) {
> +				put_page(tree_page);
> +				goto replace;
> +			}
> +			return tree_page;
>  		}
>  	}
>  
> @@ -1230,22 +1590,72 @@ again:
>  	DO_NUMA(page_node->nid = nid);
>  	rb_link_node(&page_node->node, parent, new);
>  	rb_insert_color(&page_node->node, root);
> -	get_page(page);
> -	return page;
> +out:
> +	if (is_page_sharing_candidate(page_node)) {
> +		get_page(page);
> +		return page;
> +	} else
> +		return NULL;
>  
>  replace:
> -	if (page_node) {
> -		list_del(&page_node->list);
> -		DO_NUMA(page_node->nid = nid);
> -		rb_replace_node(&stable_node->node, &page_node->node, root);
> -		get_page(page);
> +	if (stable_node_dup == stable_node) {
> +		/* there is no chain */
> +		if (page_node) {
> +			VM_BUG_ON(page_node->head != &migrate_nodes);
> +			list_del(&page_node->list);
> +			DO_NUMA(page_node->nid = nid);
> +			rb_replace_node(&stable_node->node, &page_node->node,
> +					root);
> +			if (is_page_sharing_candidate(page_node))
> +				get_page(page);
> +			else
> +				page = NULL;
> +		} else {
> +			rb_erase(&stable_node->node, root);
> +			page = NULL;
> +		}
>  	} else {
> -		rb_erase(&stable_node->node, root);
> -		page = NULL;
> +		VM_BUG_ON(!is_stable_node_chain(stable_node));
> +		__stable_node_dup_del(stable_node_dup);
> +		if (page_node) {
> +			VM_BUG_ON(page_node->head != &migrate_nodes);
> +			list_del(&page_node->list);
> +			DO_NUMA(page_node->nid = nid);
> +			stable_node_chain_add_dup(page_node, stable_node);
> +			if (is_page_sharing_candidate(page_node))
> +				get_page(page);
> +			else
> +				page = NULL;
> +		} else {
> +			page = NULL;
> +		}
>  	}
> -	stable_node->head = &migrate_nodes;
> -	list_add(&stable_node->list, stable_node->head);
> +	stable_node_dup->head = &migrate_nodes;
> +	list_add(&stable_node_dup->list, stable_node_dup->head);
>  	return page;
> +
> +chain_append:
> +	/* stable_node_dup could be null if it reached the limit */
> +	if (!stable_node_dup)
> +		stable_node_dup = stable_node_any;
> +	if (stable_node_dup == stable_node) {
> +		/* chain is missing so create it */
> +		stable_node = alloc_stable_node_chain(stable_node_dup,
> +						      root);
> +		if (!stable_node)
> +			return NULL;
> +	}
> +	/*
> +	 * Add this stable_node dup that was
> +	 * migrated to the stable_node chain
> +	 * of the current nid for this page
> +	 * content.
> +	 */
> +	VM_BUG_ON(page_node->head != &migrate_nodes);
> +	list_del(&page_node->list);
> +	DO_NUMA(page_node->nid = nid);
> +	stable_node_chain_add_dup(page_node, stable_node);
> +	goto out;
>  }
>  
>  /*
> @@ -1262,7 +1672,8 @@ static struct stable_node *stable_tree_insert(struct page *kpage)
>  	struct rb_root *root;
>  	struct rb_node **new;
>  	struct rb_node *parent;
> -	struct stable_node *stable_node;
> +	struct stable_node *stable_node, *stable_node_dup, *stable_node_any;
> +	bool need_chain = false;
>  
>  	kpfn = page_to_pfn(kpage);
>  	nid = get_kpfn_nid(kpfn);
> @@ -1277,7 +1688,32 @@ again:
>  
>  		cond_resched();
>  		stable_node = rb_entry(*new, struct stable_node, node);
> -		tree_page = get_ksm_page(stable_node, false);
> +		stable_node_any = NULL;
> +		stable_node_dup = chain(stable_node, &tree_page, root);
> +		if (!stable_node_dup) {
> +			/*
> +			 * Either all stable_node dups were full in
> +			 * this stable_node chain, or this chain was
> +			 * empty and should be rb_erased.
> +			 */
> +			stable_node_any = stable_node_dup_any(stable_node,
> +							      root);
> +			if (!stable_node_any) {
> +				/* rb_erase just run */
> +				goto again;
> +			}
> +			/*
> +			 * Take any of the stable_node dups page of
> +			 * this stable_node chain to let the tree walk
> +			 * continue. All KSM pages belonging to the
> +			 * stable_node dups in a stable_node chain
> +			 * have the same content and they're
> +			 * wrprotected at all times. Any will work
> +			 * fine to continue the walk.
> +			 */
> +			tree_page = get_ksm_page(stable_node_any, false);
> +		}
> +		VM_BUG_ON(!stable_node_dup ^ !!stable_node_any);
>  		if (!tree_page) {
>  			/*
>  			 * If we walked over a stale stable_node,
> @@ -1300,27 +1736,37 @@ again:
>  		else if (ret > 0)
>  			new = &parent->rb_right;
>  		else {
> -			/*
> -			 * It is not a bug that stable_tree_search() didn't
> -			 * find this node: because at that time our page was
> -			 * not yet write-protected, so may have changed since.
> -			 */
> -			return NULL;
> +			need_chain = true;
> +			break;
>  		}
>  	}
>  
> -	stable_node = alloc_stable_node();
> -	if (!stable_node)
> +	stable_node_dup = alloc_stable_node();
> +	if (!stable_node_dup)
>  		return NULL;
>  
> -	INIT_HLIST_HEAD(&stable_node->hlist);
> -	stable_node->kpfn = kpfn;
> -	set_page_stable_node(kpage, stable_node);
> -	DO_NUMA(stable_node->nid = nid);
> -	rb_link_node(&stable_node->node, parent, new);
> -	rb_insert_color(&stable_node->node, root);
> +	INIT_HLIST_HEAD(&stable_node_dup->hlist);
> +	stable_node_dup->kpfn = kpfn;
> +	set_page_stable_node(kpage, stable_node_dup);
> +	stable_node_dup->rmap_hlist_len = 0;
> +	DO_NUMA(stable_node_dup->nid = nid);
> +	if (!need_chain) {
> +		rb_link_node(&stable_node_dup->node, parent, new);
> +		rb_insert_color(&stable_node_dup->node, root);
> +	} else {
> +		if (!is_stable_node_chain(stable_node)) {
> +			struct stable_node *orig = stable_node;
> +			/* chain is missing so create it */
> +			stable_node = alloc_stable_node_chain(orig, root);
> +			if (!stable_node) {
> +				free_stable_node(stable_node_dup);
> +				return NULL;
> +			}
> +		}
> +		stable_node_chain_add_dup(stable_node_dup, stable_node);
> +	}
>  
> -	return stable_node;
> +	return stable_node_dup;
>  }
>  
>  /*
> @@ -1410,8 +1856,27 @@ struct rmap_item *unstable_tree_search_insert(struct rmap_item *rmap_item,
>   * the same ksm page.
>   */
>  static void stable_tree_append(struct rmap_item *rmap_item,
> -			       struct stable_node *stable_node)
> +			       struct stable_node *stable_node,
> +			       bool max_page_sharing_bypass)
>  {
> +	/*
> +	 * rmap won't find this mapping if we don't insert the
> +	 * rmap_item in the right stable_node
> +	 * duplicate. page_migration could break later if rmap breaks,
> +	 * so we can as well crash here. We really need to check for
> +	 * rmap_hlist_len == STABLE_NODE_CHAIN, but we can as well check
> +	 * for other negative values as an undeflow if detected here
> +	 * for the first time (and not when decreasing rmap_hlist_len)
> +	 * would be sign of memory corruption in the stable_node.
> +	 */
> +	BUG_ON(stable_node->rmap_hlist_len < 0);
> +
> +	stable_node->rmap_hlist_len++;
> +	if (!max_page_sharing_bypass)
> +		/* possibly non fatal but unexpected overflow, only warn */
> +		WARN_ON_ONCE(stable_node->rmap_hlist_len >
> +			     ksm_max_page_sharing);
> +
>  	rmap_item->head = stable_node;
>  	rmap_item->address |= STABLE_FLAG;
>  	hlist_add_head(&rmap_item->hlist, &stable_node->hlist);
> @@ -1439,19 +1904,26 @@ static void cmp_and_merge_page(struct page *page, struct rmap_item *rmap_item)
>  	struct page *kpage;
>  	unsigned int checksum;
>  	int err;
> +	bool max_page_sharing_bypass = false;
>  
>  	stable_node = page_stable_node(page);
>  	if (stable_node) {
>  		if (stable_node->head != &migrate_nodes &&
> -		    get_kpfn_nid(stable_node->kpfn) != NUMA(stable_node->nid)) {
> -			rb_erase(&stable_node->node,
> -				 root_stable_tree + NUMA(stable_node->nid));
> +		    get_kpfn_nid(READ_ONCE(stable_node->kpfn)) !=
> +		    NUMA(stable_node->nid)) {
> +			stable_node_dup_del(stable_node);
>  			stable_node->head = &migrate_nodes;
>  			list_add(&stable_node->list, stable_node->head);
>  		}
>  		if (stable_node->head != &migrate_nodes &&
>  		    rmap_item->head == stable_node)
>  			return;
> +		/*
> +		 * If it's a KSM fork, allow it to go over the sharing limit
> +		 * without warnings.
> +		 */
> +		if (!is_page_sharing_candidate(stable_node))
> +			max_page_sharing_bypass = true;
>  	}
>  
>  	/* We first start with searching the page inside the stable tree */
> @@ -1471,7 +1943,8 @@ static void cmp_and_merge_page(struct page *page, struct rmap_item *rmap_item)
>  			 * add its rmap_item to the stable tree.
>  			 */
>  			lock_page(kpage);
> -			stable_tree_append(rmap_item, page_stable_node(kpage));
> +			stable_tree_append(rmap_item, page_stable_node(kpage),
> +					   max_page_sharing_bypass);
>  			unlock_page(kpage);
>  		}
>  		put_page(kpage);
> @@ -1504,8 +1977,10 @@ static void cmp_and_merge_page(struct page *page, struct rmap_item *rmap_item)
>  			lock_page(kpage);
>  			stable_node = stable_tree_insert(kpage);
>  			if (stable_node) {
> -				stable_tree_append(tree_rmap_item, stable_node);
> -				stable_tree_append(rmap_item, stable_node);
> +				stable_tree_append(tree_rmap_item, stable_node,
> +						   false);
> +				stable_tree_append(rmap_item, stable_node,
> +						   false);
>  			}
>  			unlock_page(kpage);
>  
> @@ -2009,6 +2484,48 @@ static void wait_while_offlining(void)
>  	}
>  }
>  
> +static bool stable_node_dup_remove_range(struct stable_node *stable_node,
> +					 unsigned long start_pfn,
> +					 unsigned long end_pfn)
> +{
> +	if (stable_node->kpfn >= start_pfn &&
> +	    stable_node->kpfn < end_pfn) {
> +		/*
> +		 * Don't get_ksm_page, page has already gone:
> +		 * which is why we keep kpfn instead of page*
> +		 */
> +		remove_node_from_stable_tree(stable_node);
> +		return true;
> +	}
> +	return false;
> +}
> +
> +static bool stable_node_chain_remove_range(struct stable_node *stable_node,
> +					   unsigned long start_pfn,
> +					   unsigned long end_pfn,
> +					   struct rb_root *root)
> +{
> +	struct stable_node *dup;
> +	struct hlist_node *hlist_safe;
> +
> +	if (!is_stable_node_chain(stable_node)) {
> +		VM_BUG_ON(is_stable_node_dup(stable_node));
> +		return stable_node_dup_remove_range(stable_node, start_pfn,
> +						    end_pfn);
> +	}
> +
> +	hlist_for_each_entry_safe(dup, hlist_safe,
> +				  &stable_node->hlist, hlist_dup) {
> +		VM_BUG_ON(!is_stable_node_dup(dup));
> +		stable_node_dup_remove_range(dup, start_pfn, end_pfn);
> +	}
> +	if (hlist_empty(&stable_node->hlist)) {
> +		free_stable_node_chain(stable_node, root);
> +		return true; /* notify caller that tree was rebalanced */
> +	} else
> +		return false;
> +}
> +
>  static void ksm_check_stable_tree(unsigned long start_pfn,
>  				  unsigned long end_pfn)
>  {
> @@ -2021,15 +2538,12 @@ static void ksm_check_stable_tree(unsigned long start_pfn,
>  		node = rb_first(root_stable_tree + nid);
>  		while (node) {
>  			stable_node = rb_entry(node, struct stable_node, node);
> -			if (stable_node->kpfn >= start_pfn &&
> -			    stable_node->kpfn < end_pfn) {
> -				/*
> -				 * Don't get_ksm_page, page has already gone:
> -				 * which is why we keep kpfn instead of page*
> -				 */
> -				remove_node_from_stable_tree(stable_node);
> +			if (stable_node_chain_remove_range(stable_node,
> +							   start_pfn, end_pfn,
> +							   root_stable_tree +
> +							   nid))
>  				node = rb_first(root_stable_tree + nid);
> -			} else
> +			else
>  				node = rb_next(node);
>  			cond_resched();
>  		}
> @@ -2254,6 +2768,47 @@ static ssize_t merge_across_nodes_store(struct kobject *kobj,
>  KSM_ATTR(merge_across_nodes);
>  #endif
>  
> +static ssize_t max_page_sharing_show(struct kobject *kobj,
> +				     struct kobj_attribute *attr, char *buf)
> +{
> +	return sprintf(buf, "%u\n", ksm_max_page_sharing);
> +}
> +
> +static ssize_t max_page_sharing_store(struct kobject *kobj,
> +				      struct kobj_attribute *attr,
> +				      const char *buf, size_t count)
> +{
> +	int err;
> +	int knob;
> +
> +	err = kstrtoint(buf, 10, &knob);
> +	if (err)
> +		return err;
> +	/*
> +	 * When a KSM page is created it is shared by 2 mappings. This
> +	 * being a signed comparison, it implicitly verifies it's not
> +	 * negative.
> +	 */
> +	if (knob < 2)
> +		return -EINVAL;
> +
> +	if (READ_ONCE(ksm_max_page_sharing) == knob)
> +		return count;
> +
> +	mutex_lock(&ksm_thread_mutex);
> +	wait_while_offlining();
> +	if (ksm_max_page_sharing != knob) {
> +		if (ksm_pages_shared || remove_all_stable_nodes())
> +			err = -EBUSY;
> +		else
> +			ksm_max_page_sharing = knob;
> +	}
> +	mutex_unlock(&ksm_thread_mutex);
> +
> +	return err ? err : count;
> +}
> +KSM_ATTR(max_page_sharing);
> +
>  static ssize_t pages_shared_show(struct kobject *kobj,
>  				 struct kobj_attribute *attr, char *buf)
>  {
> @@ -2292,6 +2847,46 @@ static ssize_t pages_volatile_show(struct kobject *kobj,
>  }
>  KSM_ATTR_RO(pages_volatile);
>  
> +static ssize_t stable_node_dups_show(struct kobject *kobj,
> +				     struct kobj_attribute *attr, char *buf)
> +{
> +	return sprintf(buf, "%lu\n", ksm_stable_node_dups);
> +}
> +KSM_ATTR_RO(stable_node_dups);
> +
> +static ssize_t stable_node_chains_show(struct kobject *kobj,
> +				       struct kobj_attribute *attr, char *buf)
> +{
> +	return sprintf(buf, "%lu\n", ksm_stable_node_chains);
> +}
> +KSM_ATTR_RO(stable_node_chains);
> +
> +static ssize_t
> +stable_node_chains_prune_millisecs_show(struct kobject *kobj,
> +					struct kobj_attribute *attr,
> +					char *buf)
> +{
> +	return sprintf(buf, "%u\n", ksm_stable_node_chains_prune_millisecs);
> +}
> +
> +static ssize_t
> +stable_node_chains_prune_millisecs_store(struct kobject *kobj,
> +					 struct kobj_attribute *attr,
> +					 const char *buf, size_t count)
> +{
> +	unsigned long msecs;
> +	int err;
> +
> +	err = kstrtoul(buf, 10, &msecs);
> +	if (err || msecs > UINT_MAX)
> +		return -EINVAL;
> +
> +	ksm_stable_node_chains_prune_millisecs = msecs;
> +
> +	return count;
> +}
> +KSM_ATTR(stable_node_chains_prune_millisecs);
> +
>  static ssize_t full_scans_show(struct kobject *kobj,
>  			       struct kobj_attribute *attr, char *buf)
>  {
> @@ -2311,6 +2906,10 @@ static struct attribute *ksm_attrs[] = {
>  #ifdef CONFIG_NUMA
>  	&merge_across_nodes_attr.attr,
>  #endif
> +	&max_page_sharing_attr.attr,
> +	&stable_node_chains_attr.attr,
> +	&stable_node_dups_attr.attr,
> +	&stable_node_chains_prune_millisecs_attr.attr,
>  	NULL,
>  };
>  

-- 
Petr Holasek
pholasek@redhat.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/1] ksm: introduce ksm_max_page_sharing per page deduplication limit
  2015-12-09 16:19   ` Petr Holasek
@ 2015-12-09 17:15     ` Andrea Arcangeli
  2015-12-09 18:10       ` Andrea Arcangeli
  0 siblings, 1 reply; 29+ messages in thread
From: Andrea Arcangeli @ 2015-12-09 17:15 UTC (permalink / raw)
  To: Petr Holasek
  Cc: Hugh Dickins, Davidlohr Bueso, linux-mm, Andrew Morton, Arjan van de Ven

Hello Petr,

On Wed, Dec 09, 2015 at 05:19:59PM +0100, Petr Holasek wrote:
> Hi Andrea,
> 
> I've been running stress tests against this patchset for a couple of hours
> and everything was ok. However, I've allocated ~1TB of memory and got
> following lockup during disabling KSM with 'echo 2 > /sys/kernel/mm/ksm/run':
> 
> [13201.060601] INFO: task ksmd:351 blocked for more than 120 seconds.
> [13201.066812]       Not tainted 4.4.0-rc4+ #5
> [13201.070996] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [13201.078830] ksmd            D ffff883f65eb7dc8     0   351      2
> 0x00000000
> [13201.085903]  ffff883f65eb7dc8 ffff887f66e26400 ffff883f65d5e400
> ffff883f65eb8000
> [13201.093343]  ffffffff81a65144 ffff883f65d5e400 00000000ffffffff
> ffffffff81a65148
> [13201.100792]  ffff883f65eb7de0 ffffffff816907e5 ffffffff81a65140
> ffff883f65eb7df0
> [13201.108242] Call Trace:
> [13201.110708]  [<ffffffff816907e5>] schedule+0x35/0x80
> [13201.115676]  [<ffffffff81690ace>] schedule_preempt_disabled+0xe/0x10
> [13201.122044]  [<ffffffff81692524>] __mutex_lock_slowpath+0xb4/0x130
> [13201.128237]  [<ffffffff816925bf>] mutex_lock+0x1f/0x2f
> [13201.133395]  [<ffffffff811debd2>] ksm_scan_thread+0x62/0x1f0
> [13201.139068]  [<ffffffff810c8ac0>] ? wait_woken+0x80/0x80
> [13201.144391]  [<ffffffff811deb70>] ? ksm_do_scan+0x1140/0x1140
> [13201.150164]  [<ffffffff810a4378>] kthread+0xd8/0xf0
> [13201.155056]  [<ffffffff810a42a0>] ? kthread_park+0x60/0x60
> [13201.160551]  [<ffffffff8169460f>] ret_from_fork+0x3f/0x70
> [13201.165961]  [<ffffffff810a42a0>] ? kthread_park+0x60/0x60
> 
> It seems this is not connected with the new code, but it would be nice to
> also make unmerge_and_remove_all_rmap_items() more scheduler friendly.

Agreed. I run echo 2 many times here with big stable_node chains but
this one never happened here, it likely shows easier on the 1TiB.

It was most certainly the teardown of an enormous stable_node chain,
while at it I also added one more cond_resched() in the echo 2 slow
path to make the vma list walk more schedule friendly (even thought it
would never end in softlockup in practice, but max_map_count can be
increased via sysctl so it's safer and worth it considering how slow
is that path).

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/1] ksm: introduce ksm_max_page_sharing per page deduplication limit
  2015-12-09 17:15     ` Andrea Arcangeli
@ 2015-12-09 18:10       ` Andrea Arcangeli
  2015-12-10 16:06         ` Petr Holasek
  0 siblings, 1 reply; 29+ messages in thread
From: Andrea Arcangeli @ 2015-12-09 18:10 UTC (permalink / raw)
  To: Petr Holasek
  Cc: Hugh Dickins, Davidlohr Bueso, linux-mm, Andrew Morton, Arjan van de Ven

On Wed, Dec 09, 2015 at 06:15:08PM +0100, Andrea Arcangeli wrote:
> Hello Petr,
> 
> On Wed, Dec 09, 2015 at 05:19:59PM +0100, Petr Holasek wrote:
> > Hi Andrea,
> > 
> > I've been running stress tests against this patchset for a couple of hours
> > and everything was ok. However, I've allocated ~1TB of memory and got
> > following lockup during disabling KSM with 'echo 2 > /sys/kernel/mm/ksm/run':
> > 
> > [13201.060601] INFO: task ksmd:351 blocked for more than 120 seconds.
> > [13201.066812]       Not tainted 4.4.0-rc4+ #5
> > [13201.070996] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> > this message.
> > [13201.078830] ksmd            D ffff883f65eb7dc8     0   351      2
> > 0x00000000
> > [13201.085903]  ffff883f65eb7dc8 ffff887f66e26400 ffff883f65d5e400
> > ffff883f65eb8000
> > [13201.093343]  ffffffff81a65144 ffff883f65d5e400 00000000ffffffff
> > ffffffff81a65148
> > [13201.100792]  ffff883f65eb7de0 ffffffff816907e5 ffffffff81a65140
> > ffff883f65eb7df0
> > [13201.108242] Call Trace:
> > [13201.110708]  [<ffffffff816907e5>] schedule+0x35/0x80
> > [13201.115676]  [<ffffffff81690ace>] schedule_preempt_disabled+0xe/0x10
> > [13201.122044]  [<ffffffff81692524>] __mutex_lock_slowpath+0xb4/0x130
> > [13201.128237]  [<ffffffff816925bf>] mutex_lock+0x1f/0x2f
> > [13201.133395]  [<ffffffff811debd2>] ksm_scan_thread+0x62/0x1f0
> > [13201.139068]  [<ffffffff810c8ac0>] ? wait_woken+0x80/0x80
> > [13201.144391]  [<ffffffff811deb70>] ? ksm_do_scan+0x1140/0x1140
> > [13201.150164]  [<ffffffff810a4378>] kthread+0xd8/0xf0
> > [13201.155056]  [<ffffffff810a42a0>] ? kthread_park+0x60/0x60
> > [13201.160551]  [<ffffffff8169460f>] ret_from_fork+0x3f/0x70
> > [13201.165961]  [<ffffffff810a42a0>] ? kthread_park+0x60/0x60
> > 
> > It seems this is not connected with the new code, but it would be nice to
> > also make unmerge_and_remove_all_rmap_items() more scheduler friendly.
> 
> Agreed. I run echo 2 many times here with big stable_node chains but
> this one never happened here, it likely shows easier on the 1TiB.

I thought the above was a problem with "scheduler friendliness" in
turn missing cond_resched() but the above is not a softlockup.

The above can't be solved by improving scheduler friendliness, we
didn't prevent the schedule for 120sec, just the mutex_lock was stuck
and a stuck was in D state for too long, which in the KSM case for
servers would be just a false positive. KSM would immediately stop
after it takes the mutex anyway so the above only informs that we
didn't run try_to_freeze() fast enough. The only trouble there could
be with suspend for non-server usage.

To hide the above (and reach try_to_freeze() quick) we could just do
trylock in ksm_scan_thread and mutex_lock_interruptible() in the other
places, but that still leaves the uninterruptible wait_on_bit to
solve.

Improving scheduler friendliness would have been more important than
avoiding the above. remove_node_from_stable_tree would also do a
cond_resched() if the rmap_item list is not empty so it was unlikely
it could generate a softlockup for 120sec even with an enormous
chain. However just like the &migrate_nodes list walk and like the
remove_stable_node_chain caller both do a cond_resched() after
remove_stable_node(), it sounds better to do it inside
remove_stable_node_chain too in case we run into a chain and we need
to free the dups. Just the previous patch won't help with the above.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/1] ksm: introduce ksm_max_page_sharing per page deduplication limit
  2015-12-09 18:10       ` Andrea Arcangeli
@ 2015-12-10 16:06         ` Petr Holasek
  0 siblings, 0 replies; 29+ messages in thread
From: Petr Holasek @ 2015-12-10 16:06 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, Davidlohr Bueso, linux-mm, Andrew Morton, Arjan van de Ven

On Wed, 09 Dec 2015, Andrea Arcangeli <aarcange@redhat.com> wrote:
> On Wed, Dec 09, 2015 at 06:15:08PM +0100, Andrea Arcangeli wrote:
> > Hello Petr,
> > 
> > On Wed, Dec 09, 2015 at 05:19:59PM +0100, Petr Holasek wrote:
> > > Hi Andrea,
> > > 
> > > I've been running stress tests against this patchset for a couple of hours
> > > and everything was ok. However, I've allocated ~1TB of memory and got
> > > following lockup during disabling KSM with 'echo 2 > /sys/kernel/mm/ksm/run':
> > > 
> > > [13201.060601] INFO: task ksmd:351 blocked for more than 120 seconds.
> > > [13201.066812]       Not tainted 4.4.0-rc4+ #5
> > > [13201.070996] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> > > this message.
> > > [13201.078830] ksmd            D ffff883f65eb7dc8     0   351      2
> > > 0x00000000
> > > [13201.085903]  ffff883f65eb7dc8 ffff887f66e26400 ffff883f65d5e400
> > > ffff883f65eb8000
> > > [13201.093343]  ffffffff81a65144 ffff883f65d5e400 00000000ffffffff
> > > ffffffff81a65148
> > > [13201.100792]  ffff883f65eb7de0 ffffffff816907e5 ffffffff81a65140
> > > ffff883f65eb7df0
> > > [13201.108242] Call Trace:
> > > [13201.110708]  [<ffffffff816907e5>] schedule+0x35/0x80
> > > [13201.115676]  [<ffffffff81690ace>] schedule_preempt_disabled+0xe/0x10
> > > [13201.122044]  [<ffffffff81692524>] __mutex_lock_slowpath+0xb4/0x130
> > > [13201.128237]  [<ffffffff816925bf>] mutex_lock+0x1f/0x2f
> > > [13201.133395]  [<ffffffff811debd2>] ksm_scan_thread+0x62/0x1f0
> > > [13201.139068]  [<ffffffff810c8ac0>] ? wait_woken+0x80/0x80
> > > [13201.144391]  [<ffffffff811deb70>] ? ksm_do_scan+0x1140/0x1140
> > > [13201.150164]  [<ffffffff810a4378>] kthread+0xd8/0xf0
> > > [13201.155056]  [<ffffffff810a42a0>] ? kthread_park+0x60/0x60
> > > [13201.160551]  [<ffffffff8169460f>] ret_from_fork+0x3f/0x70
> > > [13201.165961]  [<ffffffff810a42a0>] ? kthread_park+0x60/0x60
> > > 
> > > It seems this is not connected with the new code, but it would be nice to
> > > also make unmerge_and_remove_all_rmap_items() more scheduler friendly.
> > 
> > Agreed. I run echo 2 many times here with big stable_node chains but
> > this one never happened here, it likely shows easier on the 1TiB.
> 
> I thought the above was a problem with "scheduler friendliness" in
> turn missing cond_resched() but the above is not a softlockup.
> 
> The above can't be solved by improving scheduler friendliness, we
> didn't prevent the schedule for 120sec, just the mutex_lock was stuck
> and a stuck was in D state for too long, which in the KSM case for
> servers would be just a false positive. KSM would immediately stop
> after it takes the mutex anyway so the above only informs that we
> didn't run try_to_freeze() fast enough. The only trouble there could
> be with suspend for non-server usage.
> 
> To hide the above (and reach try_to_freeze() quick) we could just do
> trylock in ksm_scan_thread and mutex_lock_interruptible() in the other
> places, but that still leaves the uninterruptible wait_on_bit to
> solve.
> 
> Improving scheduler friendliness would have been more important than
> avoiding the above. remove_node_from_stable_tree would also do a
> cond_resched() if the rmap_item list is not empty so it was unlikely
> it could generate a softlockup for 120sec even with an enormous
> chain. However just like the &migrate_nodes list walk and like the
> remove_stable_node_chain caller both do a cond_resched() after
> remove_stable_node(), it sounds better to do it inside
> remove_stable_node_chain too in case we run into a chain and we need
> to free the dups. Just the previous patch won't help with the above.

Yep, I've applied the patch attached above and still got the hunk during
cleaning memory above ~800GB.

Anyway, this thing is not connected with this RFC and no other problems
were observed -

Tested-by: Petr Holasek <pholasek@redhat.com>

-- 
Petr Holasek
pholasek@redhat.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/1] ksm: introduce ksm_max_page_sharing per page deduplication limit
  2015-11-10 18:44 ` [PATCH 1/1] " Andrea Arcangeli
  2015-12-09 16:19   ` Petr Holasek
@ 2015-12-11  0:31   ` Andrew Morton
  2016-01-14 23:36   ` Hugh Dickins
  2016-04-06 20:33   ` Rik van Riel
  3 siblings, 0 replies; 29+ messages in thread
From: Andrew Morton @ 2015-12-11  0:31 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, Davidlohr Bueso, linux-mm, Petr Holasek, Arjan van de Ven

On Tue, 10 Nov 2015 19:44:41 +0100 Andrea Arcangeli <aarcange@redhat.com> wrote:

> The STABLE_NODE_DUP_HEAD must be an unique valid pointer never used
> elsewhere in any stable_node->head/node to avoid a clashes with the
> stable_node->node.rb_parent_color pointer, and different from
> &migrate_nodes. So the second field of &migrate_nodes is picked and
> verified as always safe with a BUILD_BUG_ON in case the list_head
> implementation changes in the future.
> 
> ...
>
> +	/*
> +	 * We need the second aligned pointer of the migrate_nodes
> +	 * list_head to stay clear from the rb_parent_color union
> +	 * (aligned and different than any node) and also different
> +	 * from &migrate_nodes. This will verify that future list.h changes
> +	 * don't break STABLE_NODE_DUP_HEAD.
> +	 */
> +	BUILD_BUG_ON(STABLE_NODE_DUP_HEAD <= &migrate_nodes);
> +	BUILD_BUG_ON(STABLE_NODE_DUP_HEAD >= &migrate_nodes + 1);

mm/ksm.c: In function 'remove_node_from_stable_tree':
mm/ksm.c:618: error: call to '__compiletime_assert_618' declared with attribute error: BUILD_BUG_ON failed: STABLE_NODE_DUP_HEAD <= &migrate_nodes
mm/ksm.c:619: error: call to '__compiletime_assert_619' declared with attribute error: BUILD_BUG_ON failed: STABLE_NODE_DUP_HEAD >= &migrate_nodes + 1

That's with gcc-4.4.4.

I don't have time at present to investigate so I'll switch them to
WARN_ON_ONCE for now.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/1] ksm: introduce ksm_max_page_sharing per page deduplication limit
  2015-11-10 18:44 ` [PATCH 1/1] " Andrea Arcangeli
  2015-12-09 16:19   ` Petr Holasek
  2015-12-11  0:31   ` Andrew Morton
@ 2016-01-14 23:36   ` Hugh Dickins
  2016-01-16 17:49     ` Andrea Arcangeli
  2016-01-18 11:01     ` Mel Gorman
  2016-04-06 20:33   ` Rik van Riel
  3 siblings, 2 replies; 29+ messages in thread
From: Hugh Dickins @ 2016-01-14 23:36 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, Davidlohr Bueso, linux-mm, Petr Holasek,
	Andrew Morton, Arjan van de Ven, Mel Gorman

Hi Andrea,

I'm sorry, I've only now got around to looking at this, forced by
merge window to contemplate whether it should go forward or not.

While there's lots of careful and ingenious work gone into it, and
I've seen no actual problem in the source, nor when running with it,
I cannot really get to grips with the review, because I cannot shake
off the feeling that it's all a wrong direction, at least for now.

On Tue, 10 Nov 2015, Andrea Arcangeli wrote:

> Without a max deduplication limit for each KSM page, the list of the
> rmap_items associated to each stable_node can grow infinitely
> large.

Okay, yes, I can see that unlimited growth of the rmap list is a problem
which we need to address; and I can see that successful deduplication
tends towards this state much more than anon rmap or even file rmap.

But it's not entirely a problem with deduplication, and I'd rather see
it dealt with in a more general way, not by a max deduplication limit
in KSM.

I don't think you should throw away this work; but I do think there
are several other improvements, outside of ksm.c, which we should
make first; and then come back to this if those are not enough.

> 
> During the rmap walk each entry can take up to ~10usec to process
> because of IPIs for the TLB flushing (both for the primary MMU and the
> secondary MMUs with the MMU notifier). With only 16GB of address space
> shared in the same KSM page, that would amount to dozens of seconds of
> kernel runtime.

Right, IPIs and MMU notifiers: more below.

> 
> A ~256 max deduplication factor will reduce the latencies of the rmap
> walks on KSM pages to order of a few msec. Just doing the
> cond_resched() during the rmap walks is not enough, the list size must
> have a limit too, otherwise the caller could get blocked in (schedule
> friendly) kernel computations for seconds, unexpectedly.

Schedule-friendly: you make an important point there: with the rmap
mutexes and your recent cond_resched()s, I don't think we need to worry
about a task doing explicit work for itself (memory hotremove, for example),
that will take as long as it has to; but, as usual, we do need to worry
about compaction serving THP, and the high mapcounts it might encounter.

> 
> There's room for optimization to significantly reduce the IPI delivery
> cost during the page_referenced(),

You're missing that Shaohua removed x86's TLB flush from page_referenced()
a couple of years ago: b13b1d2d8692 "x86/mm: In the PTE swapout page
reclaim case clear the accessed bit instead of flushing the TLB".
Yes, it is confusing that "flush" remains in the names now there
isn't a flush.  I assume x86 is your main concern?

> but at least for page_migration in
> the KSM case (used by hard NUMA bindings, compaction and NUMA
> balancing) it may be inevitable to send lots of IPIs if each
> rmap_item->mm is active on a different CPU and there are lots of
> CPUs. Even if we ignore the IPI delivery cost, we've still to walk the
> whole KSM rmap list, so we can't allow millions or billions (ulimited)
                                                              (unlimited)
> number of entries in the KSM stable_node rmap_item lists.

You've got me fired up.  Mel's recent 72b252aed506 "mm: send one IPI per
CPU to TLB flush all entries after unmapping pages" and nearby commits
are very much to the point here; but because his first draft was unsafe
in the page migration area, he dropped that, and ended up submitting
for page reclaim alone.

That's the first low-hanging fruit: we should apply Mel's batching
to page migration; and it's become clear enough in my mind, that I'm
now impatient to try it myself (but maybe Mel will respond if he has
a patch for it already).  If I can't post a patch for that early next
week, someone else take over (or preempt me); yes, there's all kinds
of other things I should be doing instead, but this is too tempting.

That can help, not just KSM's long chains, but most other migration of
mapped pages too.  (The KSM case is particularly easy, because those
pages are known to be mapped write-protected, and its long chains can
benefit just from batching on the single page; but in general, I
believe we want to batch across pages there too, when we can.)

But MMU notifiers: there I'm out of my element, and you are right at
home.  I don't know how much of an issue the MMU notification per pte
clear is for you, but I assume it's nasty.  Is there any way those
can be batched too, across mms, at least for the single page?

Let me be a little rude: I think that is the problem you should be
working on, which also would benefit way beyond KSM.  (But I may be
naive to suppose that there's even a prospect of batching there.)

> 
> The limit is enforced efficiently by adding a second dimension to the
> stable rbtree. So there are three types of stable_nodes: the regular
> ones (identical as before, living in the first flat dimension of the
> stable rbtree), the "chains" and the "dups".

Okay, now you get into the implementation details (it's a very helpful
outline, thank you); but I don't agree that KSM should enforce such a
limit, I think it should continue to do as it does today.

Using dup pages does provide a "clean" way to break up the rmap
operations, and would be great if those operations were O(N^2).
But they're O(N), so I think the dup pages actually add a little
overhead, rather than solving anything.  Instead of having 1000
maps served by 1 page of mapcount 1000, they're served by 4 pages
of mapcount 250.  That lowers the max amount of work in unmapping
one page, but what about when all four of those pages have to be
unmapped in the same operation?  That will be a less common case,
sure, but it still has to be acknowledged in worst latency.  And
when considering page reclaim: we're starting out from a deficit
of 3 pages when compared with before.

Yes, there should be a limit (though I suspect not one hard limit,
but something that is adjusted according to pressure), and maybe
a tunable for it.  But this limit or limits imposed by rmap_walk
callers: first approximation, just don't bother to call rmap_walk
if page_mapcount is higher than limit, assume the page is referenced
and not to be unmapped - the higher its mapcount, the more likely
that is to be true, and the walks just a waste of time.  (But we'd
prefer that a page not be effectively mlocked by many mappings.)

I'm thinking there of the page reclaim case.  Page migration callers
will have different needs.  Memory hotremove will want to try as hard
as it must.  Compaction (but I'm unfamiliar with its different modes)
might want to avoid pages with high mapcount, until there's indication
that they must be attacked, preferably in the background.  mbind():
well, to consider just the KSM case, isn't it silly to be migrating
KSM pages - shouldn't it COW them instead?

> 
> Every "chain" and all "dups" linked into a "chain" enforce the
> invariant that they represent the same write protected memory content,
> even if each "dup" will be pointed by a different KSM page copy of
> that content. This way the stable rbtree lookup computational
> complexity is unaffected if compared to an unlimited
> max_sharing_limit. It is still enforced that there cannot be KSM page
> content duplicates in the stable rbtree itself.
> 
> Adding the second dimension to the stable rbtree only after the
> max_page_sharing limit hits, provides for a zero memory footprint
> increase on 64bit archs. The memory overhead of the per-KSM page
> stable_tree and per virtual mapping rmap_item is unchanged. Only after
> the max_page_sharing limit hits, we need to allocate a stable_tree
> "chain" and rb_replace() the "regular" stable_node with the newly
> allocated stable_node "chain". After that we simply add the "regular"
> stable_node to the chain as a stable_node "dup" by linking hlist_dup
> in the stable_node_chain->hlist. This way the "regular" (flat)
> stable_node is converted to a stable_node "dup" living in the second
> dimension of the stable rbtree.
> 
> During stable rbtree lookups the stable_node "chain" is identified as
> stable_node->rmap_hlist_len == STABLE_NODE_CHAIN (aka
> is_stable_node_chain()).
> 
> When dropping stable_nodes, the stable_node "dup" is identified as
> stable_node->head == STABLE_NODE_DUP_HEAD (aka is_stable_node_dup()).
> 
> The STABLE_NODE_DUP_HEAD must be an unique valid pointer never used
> elsewhere in any stable_node->head/node to avoid a clashes with the
> stable_node->node.rb_parent_color pointer, and different from
> &migrate_nodes. So the second field of &migrate_nodes is picked and
> verified as always safe with a BUILD_BUG_ON in case the list_head
> implementation changes in the future.
> 
> The STABLE_NODE_DUP is picked as a random negative value in
> stable_node->rmap_hlist_len. rmap_hlist_len cannot become negative
> when it's a "regular" stable_node or a stable_node "dup".
> 
> The stable_node_chain->nid is irrelevant. The stable_node_chain->kpfn
> is aliased in a union with a time field used to rate limit the
> stable_node_chain->hlist prunes.
> 
> The garbage collection of the stable_node_chain happens lazily during
> stable rbtree lookups (as for all other kind of stable_nodes), or
> while disabling KSM with "echo 2 >/sys/kernel/mm/ksm/run" while
> collecting the entire stable rbtree.

I'd like your design much better if there didn't have to be this
garbage collection, and the related possibility of fragmentation.
But I don't see how to avoid it in a seamless way.

> 
> While the "regular" stable_nodes and the stable_node "dups" must wait
> for their underlying tree_page to be freed before they can be freed
> themselves, the stable_node "chains" can be freed immediately if the
> stable_node->hlist turns empty. This is because the "chains" are never
> pointed by any page->mapping and they're effectively stable rbtree KSM
> self contained metadata.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  Documentation/vm/ksm.txt |  63 ++++
>  mm/ksm.c                 | 731 ++++++++++++++++++++++++++++++++++++++++++-----
>  2 files changed, 728 insertions(+), 66 deletions(-)

Nicely self-contained, but I think we should go beyond KSM.
And all those +s did put me off a bit :(

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/1] ksm: introduce ksm_max_page_sharing per page deduplication limit
  2016-01-14 23:36   ` Hugh Dickins
@ 2016-01-16 17:49     ` Andrea Arcangeli
  2016-01-16 18:00       ` Arjan van de Ven
  2016-01-18  9:10       ` Hugh Dickins
  2016-01-18 11:01     ` Mel Gorman
  1 sibling, 2 replies; 29+ messages in thread
From: Andrea Arcangeli @ 2016-01-16 17:49 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Davidlohr Bueso, linux-mm, Petr Holasek, Andrew Morton,
	Arjan van de Ven, Mel Gorman

Hello Hugh,

Thanks a lot for reviewing this.

On Thu, Jan 14, 2016 at 03:36:56PM -0800, Hugh Dickins wrote:
> Okay, yes, I can see that unlimited growth of the rmap list is a problem
> which we need to address; and I can see that successful deduplication
> tends towards this state much more than anon rmap or even file rmap.

The main difference with anon and file rmap, is that all other
rmap_walks require a very bad software running in userland to end up
in anything remotely close to what can happen with KSM. So the blame
definitely goes to userland if the rmap walk gets so slow. Secondly in
all cases except KSM the memory footprint in creating such long rmap
walks is an order of magnitude higher. So the badness of such an
userland app, would spread not just to the CPU usage but also to the
RAM utilization: the app would waste massive amounts of kernel memory
to manage little userland memory.

With KSM instead millions of entries to walk and millions of
pagetables and shadow pagetables to update during a single rmap walk
can happen with any optimal program running. The program itself won't
even know what's happening inside of KSM and has no way to limit the
sharing either.

> Schedule-friendly: you make an important point there: with the rmap
> mutexes and your recent cond_resched()s, I don't think we need to worry
> about a task doing explicit work for itself (memory hotremove, for example),
> that will take as long as it has to; but, as usual, we do need to worry
> about compaction serving THP, and the high mapcounts it might encounter.

It's not just those VM internal cases: migrate_pages of of the below
program takes >60 seconds without my patch, with my patch the max
latency is a few milliseconds. So yes now I made the rmap_walks
schedule friendly, but it can't be ok to be stuck in R state for 60
seconds in order to move a single page. Making it interruptible isn't
helping a lot either as then it would fail.

==
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <sys/mman.h>
#include <stdlib.h>
#include <unistd.h>
#include <stdio.h>
#include <numaif.h>

#define MAXNODES 17

int main(void)
{
	unsigned long page_size;
	char *p;
	page_size = sysconf(_SC_PAGE_SIZE);
	unsigned long old_node = (1<<MAXNODES)-1, node = 1;
	if (posix_memalign((void **)&p, page_size, page_size))
		perror("malloc"), exit(1);
	if (madvise(p, page_size, MADV_MERGEABLE))
		perror("madvise"), exit(1);
	memset(p, 0xff, page_size);
	for (;;) {
		sleep(1);
		migrate_pages(getpid(), MAXNODES, &old_node, &node);
		if (node == 1)
			node = 2;
		else
			node = 1;
	}
	return 0;
}
==

> 
> > 
> > There's room for optimization to significantly reduce the IPI delivery
> > cost during the page_referenced(),
> 
> You're missing that Shaohua removed x86's TLB flush from page_referenced()
> a couple of years ago: b13b1d2d8692 "x86/mm: In the PTE swapout page
> reclaim case clear the accessed bit instead of flushing the TLB".
> Yes, it is confusing that "flush" remains in the names now there
> isn't a flush.  I assume x86 is your main concern?

All archs are my concern as KSM also is supported on cyanogenmod ARM,
but the IPIs still remain in the MMU notifier. And we don't always
have a young bit there so the spte must be invalidated and
flushed. Perhaps we could teach it to create invalidated-but-not-yet
flushed sptes.

Still the IPI aren't the main problem. Let's assume for the rest of
the email that there are no IPI during page_referenced() in any arch
no matter if there's a secondary MMU attached.

> > but at least for page_migration in
> > the KSM case (used by hard NUMA bindings, compaction and NUMA
> > balancing) it may be inevitable to send lots of IPIs if each
> > rmap_item->mm is active on a different CPU and there are lots of
> > CPUs. Even if we ignore the IPI delivery cost, we've still to walk the
> > whole KSM rmap list, so we can't allow millions or billions (ulimited)
>                                                               (unlimited)
> > number of entries in the KSM stable_node rmap_item lists.
> 
> You've got me fired up.  Mel's recent 72b252aed506 "mm: send one IPI per
> CPU to TLB flush all entries after unmapping pages" and nearby commits
> are very much to the point here; but because his first draft was unsafe
> in the page migration area, he dropped that, and ended up submitting
> for page reclaim alone.
> 
> That's the first low-hanging fruit: we should apply Mel's batching
> to page migration; and it's become clear enough in my mind, that I'm
> now impatient to try it myself (but maybe Mel will respond if he has
> a patch for it already).  If I can't post a patch for that early next
> week, someone else take over (or preempt me); yes, there's all kinds
> of other things I should be doing instead, but this is too tempting.
> 
> That can help, not just KSM's long chains, but most other migration of
> mapped pages too.  (The KSM case is particularly easy, because those
> pages are known to be mapped write-protected, and its long chains can
> benefit just from batching on the single page; but in general, I
> believe we want to batch across pages there too, when we can.)
> 
> But MMU notifiers: there I'm out of my element, and you are right at
> home.  I don't know how much of an issue the MMU notification per pte
> clear is for you, but I assume it's nasty.  Is there any way those
> can be batched too, across mms, at least for the single page?
> 
> Let me be a little rude: I think that is the problem you should be
> working on, which also would benefit way beyond KSM.  (But I may be
> naive to suppose that there's even a prospect of batching there.)

I'm certainly not against all that work. It was precisely looking into
how we could improve the MMU gathering to reduce the IPI when I found
the race condition in THP MMU gather and I posted "THP MMU gather"
patchset to linux-mm (which btw still needs to be reviewed and merged
too ;).

I mentioned some of those optimizations to point out that even if we
implement them, we've still to walk millions of rmap_items.

This is purely an hardware issue we're dealing with in KSM. The
software can be improved but ultimately we've to update millions of
ptes, even the ptep_clear_flush_young IPI-less of
b13b1d2d8692b437203de7a404c6b809d2cc4d99 has to modify millions of
pagetables (or double that if there's a secondary MMU attached of KVM)
before a rmap_walk can complete. Even the lightest kind of
page_referenced() has to do that, and more work is required if it's
not just page_referenced().

> Okay, now you get into the implementation details (it's a very helpful
> outline, thank you); but I don't agree that KSM should enforce such a
> limit, I think it should continue to do as it does today.
> 
> Using dup pages does provide a "clean" way to break up the rmap
> operations, and would be great if those operations were O(N^2).
> But they're O(N), so I think the dup pages actually add a little
> overhead, rather than solving anything.  Instead of having 1000

O(N) is exactly the problem. The rmap_walk has O(N) complexity and
there's no way to improve it because it is an hardware
constraint. Either you drop the entire x86 architecture pagetable
format, or you can't fix it. And not even exotic things like shared
pageteables could fix it for KSM because it's all non linear so
there's absolutely no guarantee any pagetable can be shared in the KSM
case. All pagetables could include a pointer to a page that is
different, while all other 511 entries points to the shared KSM page.

So we can't allow N to reach a number as high as 1 million or the
rmap_walk becomes a minefield.

This is why it is mandatory to have a sharing limit and I can't
foresee any other solution, that wouldn't look like a band aid.

I can't fix the above program in any other way to guarantee it won't
run into a minefield.

> maps served by 1 page of mapcount 1000, they're served by 4 pages
> of mapcount 250.  That lowers the max amount of work in unmapping
> one page, but what about when all four of those pages have to be
> unmapped in the same operation?  That will be a less common case,

If the migrate_pages of the above program runs for 1 page only, we
must guarantee it returns in a short amount of time and succeed.

If is called for 4 pages it's ok if it's take a little longer as long
as a signal can interrupt it in the same latency that would take if we
migrated 1 page only. And that already happens for free with my
solution without having to mess with all rmap_walks interfaces. We
don't have to add signal mess to the rmap_walk, the rmap_walk can
continue to be an atomic kind of thing and it already works.

However the main benefit is not in leaving rmap_walk alone and atomic
to remain reactive to signals in the 4 page case, the main difference
with what you're proposing and what I'm proposing is in the 1 page
case. Your solution requires >60 seconds before migrate_pages can
return success. My solution requires 3 milliseconds.

A program must be able to move all its memory around, without sometime
taking >60seconds for a single page to be moved.

The unfixable problem without the KSM sharing limit:

15:21:56.769124 migrate_pages(10024, , 0x7ffee1fc5760, 17, , 0x7ffee1fc5758, 17) = 1
15:23:02.760836 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0

And ultimately this is a O(N) complexity issue that comes from an
hardware constraint.

> sure, but it still has to be acknowledged in worst latency.  And
> when considering page reclaim: we're starting out from a deficit
> of 3 pages when compared with before.
>
> Yes, there should be a limit (though I suspect not one hard limit,
> but something that is adjusted according to pressure), and maybe
> a tunable for it.  But this limit or limits imposed by rmap_walk
> callers: first approximation, just don't bother to call rmap_walk
> if page_mapcount is higher than limit, assume the page is referenced

How do you compact memory if you can't run the rmap_walk? That would
become a defragmentation minefield, instead of "a stuck for 60-sec in
R state" current minefield. Plus the migrate_pages syscall would
fail instead of succeeding in a few milliseconds.

> and not to be unmapped - the higher its mapcount, the more likely
> I'd like your design much better if there didn't have to be this
> garbage collection, and the related possibility of fragmentation.
> But I don't see how to avoid it in a seamless way.

Fragmentation can be dealt with shall it emerge as a problem, by just
implementing a refile operation that moves a rmap_item from one
stable_node_dup to another stable_node_dup (updating the pte).

I didn't implement the defrag of stable_node_dups because the
heuristic that compacts them works well enough and then I was afraid
it was just a waste of CPU to defrag stable_node_chains with low
number of remap_items to increase the "compression ratio".

Garbage collection has to be done anyway by KSM for all its metadata,
and that's not a big issue, we just run it once in a while. Whatever
happens inside KSM in terms of CPU utilization is schedule
friendly. What is important is that whatever KSM does during its
schedule friendly compute time, cannot impact anything else in the
system. KSM can't leak memory either as long as we run the garbage
collection once in a while.

> Nicely self-contained, but I think we should go beyond KSM.
> And all those +s did put me off a bit :(

Initially I thought of attacking the problem also with optimizations
and heuristics in the VM and not by introducing a sharing limit. The
first optimizations I already implemented by making the rmap_walks
schedule friendly as that was hugely beneficial anyway (as soon as I
noticed they weren't latency friendly and we weren't fully leveraging
the higher cost of the mutexes).

However I couldn't find a way to fix it "all" by touching only the VM
parts, no matter how I changed the VM, I would still have to deal with
a compaction minefield, a migration randomly not succeeding... or
taking minutes to migrate a single page etc.. All showstopper problems.

All other optimizations to speedup the rmap_walks are still welcome
and needed but they're optimization, they can't prevent unpredictable
computation complexity issues in the kernel to emerge during random
userland workloads.

By setting the max N to the sysfs configurable value, it makes the
rmap_walk of KSM from O(N) to O(1). Of course nothing is free so it
costs memory in lower potential "compression ratio" but compressing
x256 times is surely good enough and the last of my concerns.

Implementing the sharing limit and the stable_node_chains in KSM so it
wouldn't take more memory in the KSM metadata until we hit the sharing
limit, was higher priority concern, and I think I made it.

Obviously as the VM gets faster and smarter, the sharing limit should
better increase from the current 256 to 10k or 100k, but we'll never
be able to allow it to be 1m or 10m unless the CPU gets like 10-100
times faster. By then we'll have hundred of Terabytes of memory and
we'd be back to square one requiring a limit in the 10m 100m of
rmap_items per stable_node_dup.

In short I don't see the KSM sharing limit ever going to be obsolete
unless the whole pagetable format changes and we don't deal with
pagetables anymore.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/1] ksm: introduce ksm_max_page_sharing per page deduplication limit
  2016-01-16 17:49     ` Andrea Arcangeli
@ 2016-01-16 18:00       ` Arjan van de Ven
  2016-01-18  8:14         ` Hugh Dickins
  2016-01-18  9:10       ` Hugh Dickins
  1 sibling, 1 reply; 29+ messages in thread
From: Arjan van de Ven @ 2016-01-16 18:00 UTC (permalink / raw)
  To: Andrea Arcangeli, Hugh Dickins
  Cc: Davidlohr Bueso, linux-mm, Petr Holasek, Andrew Morton, Mel Gorman

On 1/16/2016 9:49 AM, Andrea Arcangeli wrote:
> In short I don't see the KSM sharing limit ever going to be obsolete
> unless the whole pagetable format changes and we don't deal with
> pagetables anymore.

just to put some weight behind Andrea's arguments: this is not theoretical.
We're running 3500 - 7000 virtual machines on a single server quite easily nowadays
and there's quite a bit of memory that KSM will share between them (often
even multiple times)..  so your N in O(N) is 7000 to many multiples there of
in real environments.

And the long hang do happen... once you start getting a bit of memory pressure
(say you go from 7000 to 7200 VMs and you only have memory for 7150) then you
are hitting the long delays *for every page* the VM inspects, and it will inspect
many... since initially they all (all 200Gb of them) are active. My machine was
just completely "out" in this for 24 hours before I decided to just reboot it instead.

Now, you can make it 2x faster (reboot in 12 hours? ;-) ) but there's really a much
higher order reduction of the "long chain" problem needed...
I'm with Andrea that prevention of super long chains is the way to go, we can argue about 250
or 500 or 1000. Numbers will speak there... but from a KSM user perspective, at some point
you reduced the cost of a page by 250x or 500x or 1000x... it's hitting diminishing returns.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/1] ksm: introduce ksm_max_page_sharing per page deduplication limit
  2016-01-16 18:00       ` Arjan van de Ven
@ 2016-01-18  8:14         ` Hugh Dickins
  2016-01-18 14:43           ` Arjan van de Ven
  0 siblings, 1 reply; 29+ messages in thread
From: Hugh Dickins @ 2016-01-18  8:14 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Andrea Arcangeli, Hugh Dickins, Davidlohr Bueso, linux-mm,
	Petr Holasek, Andrew Morton, Mel Gorman

On Sat, 16 Jan 2016, Arjan van de Ven wrote:
> On 1/16/2016 9:49 AM, Andrea Arcangeli wrote:
> > In short I don't see the KSM sharing limit ever going to be obsolete
> > unless the whole pagetable format changes and we don't deal with
> > pagetables anymore.
> 
> just to put some weight behind Andrea's arguments: this is not theoretical.
> We're running 3500 - 7000 virtual machines on a single server quite easily
> nowadays
> and there's quite a bit of memory that KSM will share between them (often
> even multiple times)..  so your N in O(N) is 7000 to many multiples there of
> in real environments.

Thanks for filling in more of the picture, Arjan, that helps.

> 
> And the long hang do happen... once you start getting a bit of memory
> pressure
> (say you go from 7000 to 7200 VMs and you only have memory for 7150) then you
> are hitting the long delays *for every page* the VM inspects, and it will

I don't understand "*for every page*": why for *every* page?
I won't dispute "for many pages, many more than is bearable".

> inspect
> many... since initially they all (all 200Gb of them) are active. My machine
> was
> just completely "out" in this for 24 hours before I decided to just reboot it
> instead.
> 
> Now, you can make it 2x faster (reboot in 12 hours? ;-) ) but there's really
> a much
> higher order reduction of the "long chain" problem needed...
> I'm with Andrea that prevention of super long chains is the way to go, we can
> argue about 250
> or 500 or 1000. Numbers will speak there... but from a KSM user perspective,
> at some point
> you reduced the cost of a page by 250x or 500x or 1000x... it's hitting
> diminishing returns.

I'm not for a moment denying that there's a problem to be solved,
just questioning what's the right solution.

The reclaim case you illustrate does not persuade me, I already suggested
an easier way to handle that (don't waste time on pages of high mapcount).

Or are you saying that in your usage, the majority of pages start out with
high mapcount?  That would defeat my suggestion, I think,  But it's the
compaction case I want to think more about, that may persuade me also.

And of course you're right about diminishing returns from trying to
minimize the number of dups: that wasn't my concern, merely that
managing several is more complex than managing one.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/1] ksm: introduce ksm_max_page_sharing per page deduplication limit
  2016-01-16 17:49     ` Andrea Arcangeli
  2016-01-16 18:00       ` Arjan van de Ven
@ 2016-01-18  9:10       ` Hugh Dickins
  2016-01-18  9:45         ` Hugh Dickins
                           ` (2 more replies)
  1 sibling, 3 replies; 29+ messages in thread
From: Hugh Dickins @ 2016-01-18  9:10 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, Davidlohr Bueso, linux-mm, Petr Holasek,
	Andrew Morton, Arjan van de Ven, Mel Gorman

On Sat, 16 Jan 2016, Andrea Arcangeli wrote:
> Hello Hugh,
> 
> Thanks a lot for reviewing this.

And thanks for your thorough reply, though I take issue with some of it :)

> 
> On Thu, Jan 14, 2016 at 03:36:56PM -0800, Hugh Dickins wrote:
> > Okay, yes, I can see that unlimited growth of the rmap list is a problem
> > which we need to address; and I can see that successful deduplication
> > tends towards this state much more than anon rmap or even file rmap.
> 
> The main difference with anon and file rmap, is that all other
> rmap_walks require a very bad software running in userland to end up
> in anything remotely close to what can happen with KSM. So the blame
> definitely goes to userland if the rmap walk gets so slow. Secondly in
> all cases except KSM the memory footprint in creating such long rmap
> walks is an order of magnitude higher. So the badness of such an
> userland app, would spread not just to the CPU usage but also to the
> RAM utilization: the app would waste massive amounts of kernel memory
> to manage little userland memory.
> 
> With KSM instead millions of entries to walk and millions of
> pagetables and shadow pagetables to update during a single rmap walk
> can happen with any optimal program running. The program itself won't
> even know what's happening inside of KSM and has no way to limit the
> sharing either.
> 
> > Schedule-friendly: you make an important point there: with the rmap
> > mutexes and your recent cond_resched()s, I don't think we need to worry
> > about a task doing explicit work for itself (memory hotremove, for example),
> > that will take as long as it has to; but, as usual, we do need to worry
> > about compaction serving THP, and the high mapcounts it might encounter.
> 
> It's not just those VM internal cases: migrate_pages of of the below
> program takes >60 seconds without my patch, with my patch the max
> latency is a few milliseconds. So yes now I made the rmap_walks
> schedule friendly, but it can't be ok to be stuck in R state for 60
> seconds in order to move a single page. Making it interruptible isn't
> helping a lot either as then it would fail.
> 
> ==
> #include <stdlib.h>
> #include <unistd.h>
> #include <string.h>
> #include <sys/mman.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <stdio.h>
> #include <numaif.h>
> 
> #define MAXNODES 17
> 
> int main(void)
> {
> 	unsigned long page_size;
> 	char *p;
> 	page_size = sysconf(_SC_PAGE_SIZE);
> 	unsigned long old_node = (1<<MAXNODES)-1, node = 1;
> 	if (posix_memalign((void **)&p, page_size, page_size))
> 		perror("malloc"), exit(1);
> 	if (madvise(p, page_size, MADV_MERGEABLE))
> 		perror("madvise"), exit(1);
> 	memset(p, 0xff, page_size);
> 	for (;;) {
> 		sleep(1);
> 		migrate_pages(getpid(), MAXNODES, &old_node, &node);
> 		if (node == 1)
> 			node = 2;
> 		else
> 			node = 1;
> 	}
> 	return 0;
> }
> ==

Puhleese.  Of course we have a bug with respect to KSM pages in
migrate_pages(): I already said as much, though I used the example
of mbind(), that being the one you had mentioned.  Just do something
like mremap() does, a temporary ksm_madvise(,,,MADV_UNMERGEABLE,).

Or are you suggesting that when MPOL_MF_MOVE_ALL meets a KSM page,
it is actually correct to move every page that shares the same data
to the node of this process's choice?

MPOL_MF_MOVE_ALL was introduced some while before KSM: it makes sense
for anon pages, and for file pages, but it was mere oversight that we
did not stop it from behaving this way on KSM pages.

Even sillier now that we have Petr's merge_across_nodes settable to 0.

> 
> > 
> > > 
> > > There's room for optimization to significantly reduce the IPI delivery
> > > cost during the page_referenced(),
> > 
> > You're missing that Shaohua removed x86's TLB flush from page_referenced()
> > a couple of years ago: b13b1d2d8692 "x86/mm: In the PTE swapout page
> > reclaim case clear the accessed bit instead of flushing the TLB".
> > Yes, it is confusing that "flush" remains in the names now there
> > isn't a flush.  I assume x86 is your main concern?
> 
> All archs are my concern as KSM also is supported on cyanogenmod ARM,
> but the IPIs still remain in the MMU notifier. And we don't always
> have a young bit there so the spte must be invalidated and
> flushed. Perhaps we could teach it to create invalidated-but-not-yet
> flushed sptes.
> 
> Still the IPI aren't the main problem.

Yes, I think I was overestimating their significance.

> Let's assume for the rest of
> the email that there are no IPI during page_referenced() in any arch
> no matter if there's a secondary MMU attached.
> 
> > > but at least for page_migration in
> > > the KSM case (used by hard NUMA bindings, compaction and NUMA
> > > balancing) it may be inevitable to send lots of IPIs if each
> > > rmap_item->mm is active on a different CPU and there are lots of
> > > CPUs. Even if we ignore the IPI delivery cost, we've still to walk the
> > > whole KSM rmap list, so we can't allow millions or billions (ulimited)
> >                                                               (unlimited)
> > > number of entries in the KSM stable_node rmap_item lists.
> > 
> > You've got me fired up.  Mel's recent 72b252aed506 "mm: send one IPI per
> > CPU to TLB flush all entries after unmapping pages" and nearby commits
> > are very much to the point here; but because his first draft was unsafe
> > in the page migration area, he dropped that, and ended up submitting
> > for page reclaim alone.
> > 
> > That's the first low-hanging fruit: we should apply Mel's batching
> > to page migration; and it's become clear enough in my mind, that I'm
> > now impatient to try it myself (but maybe Mel will respond if he has
> > a patch for it already).  If I can't post a patch for that early next
> > week, someone else take over (or preempt me); yes, there's all kinds
> > of other things I should be doing instead, but this is too tempting.
> > 
> > That can help, not just KSM's long chains, but most other migration of
> > mapped pages too.  (The KSM case is particularly easy, because those
> > pages are known to be mapped write-protected, and its long chains can
> > benefit just from batching on the single page; but in general, I
> > believe we want to batch across pages there too, when we can.)

I'll send that, but I wasn't able to take it as far as I'd hoped,
not for the file-backed pages anyway.

> > 
> > But MMU notifiers: there I'm out of my element, and you are right at
> > home.  I don't know how much of an issue the MMU notification per pte
> > clear is for you, but I assume it's nasty.  Is there any way those
> > can be batched too, across mms, at least for the single page?
> > 
> > Let me be a little rude: I think that is the problem you should be
> > working on, which also would benefit way beyond KSM.  (But I may be
> > naive to suppose that there's even a prospect of batching there.)
> 
> I'm certainly not against all that work. It was precisely looking into
> how we could improve the MMU gathering to reduce the IPI when I found
> the race condition in THP MMU gather and I posted "THP MMU gather"
> patchset to linux-mm (which btw still needs to be reviewed and merged
> too ;).
> 
> I mentioned some of those optimizations to point out that even if we
> implement them, we've still to walk millions of rmap_items.
> 
> This is purely an hardware issue we're dealing with in KSM. The
> software can be improved but ultimately we've to update millions of
> ptes, even the ptep_clear_flush_young IPI-less of
> b13b1d2d8692b437203de7a404c6b809d2cc4d99 has to modify millions of
> pagetables (or double that if there's a secondary MMU attached of KVM)
> before a rmap_walk can complete. Even the lightest kind of
> page_referenced() has to do that, and more work is required if it's
> not just page_referenced().

I already suggested that reclaim should avoid pages with high mapcount.
But more thought on compaction may show me that's not a viable strategy.

> 
> > Okay, now you get into the implementation details (it's a very helpful
> > outline, thank you); but I don't agree that KSM should enforce such a
> > limit, I think it should continue to do as it does today.
> > 
> > Using dup pages does provide a "clean" way to break up the rmap
> > operations, and would be great if those operations were O(N^2).
> > But they're O(N), so I think the dup pages actually add a little
> > overhead, rather than solving anything.  Instead of having 1000
> 
> O(N) is exactly the problem. The rmap_walk has O(N) complexity and
> there's no way to improve it because it is an hardware
> constraint. Either you drop the entire x86 architecture pagetable
> format, or you can't fix it. And not even exotic things like shared
> pageteables could fix it for KSM because it's all non linear so
> there's absolutely no guarantee any pagetable can be shared in the KSM
> case. All pagetables could include a pointer to a page that is
> different, while all other 511 entries points to the shared KSM page.

I don't know how we arrived at x86 pagetable format and shared
pagetables here, you've lost me.  Never mind, back to business.

> 
> So we can't allow N to reach a number as high as 1 million or the
> rmap_walk becomes a minefield.
> 
> This is why it is mandatory to have a sharing limit and I can't
> foresee any other solution, that wouldn't look like a band aid.
> 
> I can't fix the above program in any other way to guarantee it won't
> run into a minefield.

Forget that above program, it's easily fixed (or rather, the kernel
is easily fixed for it): please argue with a better example.

> 
> > maps served by 1 page of mapcount 1000, they're served by 4 pages
> > of mapcount 250.  That lowers the max amount of work in unmapping
> > one page, but what about when all four of those pages have to be
> > unmapped in the same operation?  That will be a less common case,
> 
> If the migrate_pages of the above program runs for 1 page only, we
> must guarantee it returns in a short amount of time and succeed.
> 
> If is called for 4 pages it's ok if it's take a little longer as long
> as a signal can interrupt it in the same latency that would take if we
> migrated 1 page only. And that already happens for free with my
> solution without having to mess with all rmap_walks interfaces. We
> don't have to add signal mess to the rmap_walk, the rmap_walk can
> continue to be an atomic kind of thing and it already works.
> 
> However the main benefit is not in leaving rmap_walk alone and atomic
> to remain reactive to signals in the 4 page case, the main difference
> with what you're proposing and what I'm proposing is in the 1 page
> case. Your solution requires >60 seconds before migrate_pages can
> return success. My solution requires 3 milliseconds.
> 
> A program must be able to move all its memory around, without sometime
> taking >60seconds for a single page to be moved.
> 
> The unfixable problem without the KSM sharing limit:
> 
> 15:21:56.769124 migrate_pages(10024, , 0x7ffee1fc5760, 17, , 0x7ffee1fc5758, 17) = 1
> 15:23:02.760836 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
> 
> And ultimately this is a O(N) complexity issue that comes from an
> hardware constraint.
> 
> > sure, but it still has to be acknowledged in worst latency.  And
> > when considering page reclaim: we're starting out from a deficit
> > of 3 pages when compared with before.
> >
> > Yes, there should be a limit (though I suspect not one hard limit,
> > but something that is adjusted according to pressure), and maybe
> > a tunable for it.  But this limit or limits imposed by rmap_walk
> > callers: first approximation, just don't bother to call rmap_walk
> > if page_mapcount is higher than limit, assume the page is referenced
> 
> How do you compact memory if you can't run the rmap_walk? That would
> become a defragmentation minefield, instead of "a stuck for 60-sec in
> R state" current minefield. Plus the migrate_pages syscall would
> fail instead of succeeding in a few milliseconds.

Phew, we've left your migrate_pages program behind now :)
I do need to think about compaction more: it may persuade me
that yours is the right way forward, but I haven't got there yet.

> 
> > and not to be unmapped - the higher its mapcount, the more likely
> > I'd like your design much better if there didn't have to be this
> > garbage collection, and the related possibility of fragmentation.
> > But I don't see how to avoid it in a seamless way.
> 
> Fragmentation can be dealt with shall it emerge as a problem, by just
> implementing a refile operation that moves a rmap_item from one
> stable_node_dup to another stable_node_dup (updating the pte).
> 
> I didn't implement the defrag of stable_node_dups because the
> heuristic that compacts them works well enough and then I was afraid
> it was just a waste of CPU to defrag stable_node_chains with low
> number of remap_items to increase the "compression ratio".
> 
> Garbage collection has to be done anyway by KSM for all its metadata,
> and that's not a big issue, we just run it once in a while. Whatever
> happens inside KSM in terms of CPU utilization is schedule
> friendly. What is important is that whatever KSM does during its
> schedule friendly compute time, cannot impact anything else in the
> system. KSM can't leak memory either as long as we run the garbage
> collection once in a while.

Fair enough.  I'm not saying it's a big worry, just that the design
would have been more elegant and appealing without this side to it.
I wonder who's going to tune stable_node_chains_prune_millisecs :)

> 
> > Nicely self-contained, but I think we should go beyond KSM.
> > And all those +s did put me off a bit :(
> 
> Initially I thought of attacking the problem also with optimizations
> and heuristics in the VM and not by introducing a sharing limit. The
> first optimizations I already implemented by making the rmap_walks
> schedule friendly as that was hugely beneficial anyway (as soon as I
> noticed they weren't latency friendly and we weren't fully leveraging
> the higher cost of the mutexes).
> 
> However I couldn't find a way to fix it "all" by touching only the VM
> parts, no matter how I changed the VM, I would still have to deal with
> a compaction minefield, a migration randomly not succeeding... or
> taking minutes to migrate a single page etc.. All showstopper problems.
> 
> All other optimizations to speedup the rmap_walks are still welcome
> and needed but they're optimization, they can't prevent unpredictable
> computation complexity issues in the kernel to emerge during random
> userland workloads.
> 
> By setting the max N to the sysfs configurable value, it makes the
> rmap_walk of KSM from O(N) to O(1). Of course nothing is free so it
> costs memory in lower potential "compression ratio" but compressing
> x256 times is surely good enough and the last of my concerns.
> 
> Implementing the sharing limit and the stable_node_chains in KSM so it
> wouldn't take more memory in the KSM metadata until we hit the sharing
> limit, was higher priority concern, and I think I made it.

Yes, thank you for paying so much attention to that aspect too.

> 
> Obviously as the VM gets faster and smarter, the sharing limit should

The VM gets slower and more complicated: the CPUs get faster and smarter.

> better increase from the current 256 to 10k or 100k, but we'll never
> be able to allow it to be 1m or 10m unless the CPU gets like 10-100
> times faster. By then we'll have hundred of Terabytes of memory and
> we'd be back to square one requiring a limit in the 10m 100m of
> rmap_items per stable_node_dup.
> 
> In short I don't see the KSM sharing limit ever going to be obsolete
> unless the whole pagetable format changes and we don't deal with
> pagetables anymore.
> 
> Thanks,
> Andrea

I'll think about it more.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/1] ksm: introduce ksm_max_page_sharing per page deduplication limit
  2016-01-18  9:10       ` Hugh Dickins
@ 2016-01-18  9:45         ` Hugh Dickins
  2016-01-18 17:46         ` Andrea Arcangeli
  2016-03-17 21:34         ` Hugh Dickins
  2 siblings, 0 replies; 29+ messages in thread
From: Hugh Dickins @ 2016-01-18  9:45 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, Davidlohr Bueso, linux-mm, Petr Holasek,
	Andrew Morton, Arjan van de Ven, Mel Gorman

On Mon, 18 Jan 2016, Hugh Dickins wrote:
> On Sat, 16 Jan 2016, Andrea Arcangeli wrote:
> > On Thu, Jan 14, 2016 at 03:36:56PM -0800, Hugh Dickins wrote:
> > > 
> > > You've got me fired up.  Mel's recent 72b252aed506 "mm: send one IPI per
> > > CPU to TLB flush all entries after unmapping pages" and nearby commits
> > > are very much to the point here; but because his first draft was unsafe
> > > in the page migration area, he dropped that, and ended up submitting
> > > for page reclaim alone.
> > > 
> > > That's the first low-hanging fruit: we should apply Mel's batching
> > > to page migration; and it's become clear enough in my mind, that I'm
> > > now impatient to try it myself (but maybe Mel will respond if he has
> > > a patch for it already).  If I can't post a patch for that early next
> > > week, someone else take over (or preempt me); yes, there's all kinds
> > > of other things I should be doing instead, but this is too tempting.
> > > 
> > > That can help, not just KSM's long chains, but most other migration of
> > > mapped pages too.  (The KSM case is particularly easy, because those
> > > pages are known to be mapped write-protected, and its long chains can
> > > benefit just from batching on the single page; but in general, I
> > > believe we want to batch across pages there too, when we can.)
> 
> I'll send that, but I wasn't able to take it as far as I'd hoped,
> not for the file-backed pages anyway.

Here's what I did against v4.4, haven't looked at current git yet.
And there's an XXX where I found the MR_MEMORY_FAILURE case a little
confusing, so didn't yet add back the necessary variant code for that.
But I was probably too excited above, overestimating the significance
of the IPIs here.

--- v4.4/mm/migrate.c	2016-01-10 15:01:32.000000000 -0800
+++ linux/mm/migrate.c	2016-01-18 01:28:26.861853142 -0800
@@ -887,8 +887,21 @@ static int __unmap_and_move(struct page
 		/* Establish migration ptes */
 		VM_BUG_ON_PAGE(PageAnon(page) && !PageKsm(page) && !anon_vma,
 				page);
-		try_to_unmap(page,
+		try_to_unmap(page, TTU_BATCH_FLUSH|
 			TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
+		/*
+		 * We must flush TLBs before copying the page if a pte was
+		 * dirty, because otherwise further mods could be made to page
+		 * without faulting, and never copied to newpage.  That's true
+		 * of anon and file pages; but it's even worse for a file page,
+		 * because once newpage is unlocked, it can be written via the
+		 * pagecache, and those mods must be visible through its ptes.
+		 * We could hold newpage lock for longer, but how much longer?
+		 */
+		if (PageAnon(page))
+			try_to_unmap_flush_dirty();
+		else
+			try_to_unmap_flush();
 		page_was_mapped = 1;
 	}
 
@@ -927,8 +940,7 @@ out:
 static ICE_noinline int unmap_and_move(new_page_t get_new_page,
 				   free_page_t put_new_page,
 				   unsigned long private, struct page *page,
-				   int force, enum migrate_mode mode,
-				   enum migrate_reason reason)
+				   int force, enum migrate_mode mode)
 {
 	int rc = MIGRATEPAGE_SUCCESS;
 	int *result = NULL;
@@ -950,27 +962,7 @@ static ICE_noinline int unmap_and_move(n
 	rc = __unmap_and_move(page, newpage, force, mode);
 	if (rc == MIGRATEPAGE_SUCCESS)
 		put_new_page = NULL;
-
 out:
-	if (rc != -EAGAIN) {
-		/*
-		 * A page that has been migrated has all references
-		 * removed and will be freed. A page that has not been
-		 * migrated will have kepts its references and be
-		 * restored.
-		 */
-		list_del(&page->lru);
-		dec_zone_page_state(page, NR_ISOLATED_ANON +
-				page_is_file_cache(page));
-		/* Soft-offlined page shouldn't go through lru cache list */
-		if (reason == MR_MEMORY_FAILURE) {
-			put_page(page);
-			if (!test_set_page_hwpoison(page))
-				num_poisoned_pages_inc();
-		} else
-			putback_lru_page(page);
-	}
-
 	/*
 	 * If migration was not successful and there's a freeing callback, use
 	 * it.  Otherwise, putback_lru_page() will drop the reference grabbed
@@ -1029,10 +1021,8 @@ static int unmap_and_move_huge_page(new_
 	 * tables or check whether the hugepage is pmd-based or not before
 	 * kicking migration.
 	 */
-	if (!hugepage_migration_supported(page_hstate(hpage))) {
-		putback_active_hugepage(hpage);
+	if (!hugepage_migration_supported(page_hstate(hpage)))
 		return -ENOSYS;
-	}
 
 	new_hpage = get_new_page(hpage, private, &result);
 	if (!new_hpage)
@@ -1051,8 +1041,9 @@ static int unmap_and_move_huge_page(new_
 		goto put_anon;
 
 	if (page_mapped(hpage)) {
-		try_to_unmap(hpage,
+		try_to_unmap(hpage, TTU_BATCH_FLUSH|
 			TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
+		try_to_unmap_flush();
 		page_was_mapped = 1;
 	}
 
@@ -1076,9 +1067,6 @@ put_anon:
 
 	unlock_page(hpage);
 out:
-	if (rc != -EAGAIN)
-		putback_active_hugepage(hpage);
-
 	/*
 	 * If migration was not successful and there's a freeing callback, use
 	 * it.  Otherwise, put_page() will drop the reference grabbed during
@@ -1123,6 +1111,7 @@ int migrate_pages(struct list_head *from
 		free_page_t put_new_page, unsigned long private,
 		enum migrate_mode mode, int reason)
 {
+	LIST_HEAD(putback_pages);
 	int retry = 1;
 	int nr_failed = 0;
 	int nr_succeeded = 0;
@@ -1147,8 +1136,20 @@ int migrate_pages(struct list_head *from
 						pass > 2, mode);
 			else
 				rc = unmap_and_move(get_new_page, put_new_page,
-						private, page, pass > 2, mode,
-						reason);
+						private, page, pass > 2, mode);
+
+			if (rc != -EAGAIN) {
+				/*
+				 * A page that has been migrated has had all
+				 * references removed and will be freed once
+				 * TLBs have been flushed. A page that has not
+				 * been migrated will have kept its references
+				 * and been restored.
+				 *
+				 * XXX: Get the MR_MEMORY_FAILURE case right.
+				 */
+				list_move_tail(&page->lru, &putback_pages);
+			}
 
 			switch(rc) {
 			case -ENOMEM:
@@ -1183,6 +1184,10 @@ out:
 	if (!swapwrite)
 		current->flags &= ~PF_SWAPWRITE;
 
+	try_to_unmap_flush();
+	putback_movable_pages(&putback_pages);
+	/* But our caller has to putback the -EAGAIN pages left on from list */
+
 	return rc;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/1] ksm: introduce ksm_max_page_sharing per page deduplication limit
  2016-01-14 23:36   ` Hugh Dickins
  2016-01-16 17:49     ` Andrea Arcangeli
@ 2016-01-18 11:01     ` Mel Gorman
  2016-01-18 22:19       ` Andrea Arcangeli
  1 sibling, 1 reply; 29+ messages in thread
From: Mel Gorman @ 2016-01-18 11:01 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Davidlohr Bueso, linux-mm, Petr Holasek,
	Andrew Morton, Arjan van de Ven

On Thu, Jan 14, 2016 at 03:36:56PM -0800, Hugh Dickins wrote:
> > but at least for page_migration in
> > the KSM case (used by hard NUMA bindings, compaction and NUMA
> > balancing) it may be inevitable to send lots of IPIs if each
> > rmap_item->mm is active on a different CPU and there are lots of
> > CPUs. Even if we ignore the IPI delivery cost, we've still to walk the
> > whole KSM rmap list, so we can't allow millions or billions (ulimited)
>                                                               (unlimited)
> > number of entries in the KSM stable_node rmap_item lists.
> 
> You've got me fired up.  Mel's recent 72b252aed506 "mm: send one IPI per
> CPU to TLB flush all entries after unmapping pages" and nearby commits
> are very much to the point here; but because his first draft was unsafe
> in the page migration area, he dropped that, and ended up submitting
> for page reclaim alone.
> 

I didn't read too much into the history of this patch or the
implementation so take anything I say with a grain of salt.

In the page-migration case, the obvious beneficiary of reduced IPI counts
is memory compaction and automatic NUMA balancing. For automatic NUMA
balancing, if a page is heavily shared then there is a possibility that
there is no optimal home node for the data and that no balancing should
take place. For compaction, it depends on whether it's for a THP allocation
or not -- heavy amounts of migration work is unlikely to be offset by THP
usage although this is a matter of opinion.

In the case of memory compaction, it would make sense to limit the length
of the walk and abort if it the mapcount is too high particularly if the
compaction is for a THP allocation. In the case of THP, the cost of a long
rmap is not necessarily going to be offset by THP usage.

> That's the first low-hanging fruit: we should apply Mel's batching
> to page migration; and it's become clear enough in my mind, that I'm
> now impatient to try it myself (but maybe Mel will respond if he has
> a patch for it already).  If I can't post a patch for that early next
> week, someone else take over (or preempt me); yes, there's all kinds
> of other things I should be doing instead, but this is too tempting.
> 

I never followed up as I felt the benefit was marginal but I did not take
KSM into account.

That said, it occurs to me that a full rmap walk is not necessary in all
cases and may be a more obvious starting point. For page reclaim, it only
matters if it's mapped once.  page_referenced() does not register a ->done
handle and the callers only care about a positive count. One optimisation
for reclaim would be to return when referenced > 0 and be done with it. That
would limit the length of the walk in *some* cases and still activate the
page as expected.

The full walk will still be necessary if the page is unused and is a reclaim
candidate so there still may be grounds for limiting the length of the rmap
walk so the unmap does not take too long but we'd avoid active pages quickly.

Hugh touched off this point and I mostly agree with him -- in some
instances of page migration, it would be ok to return when referenced >
arbitrary_threshold and just assume it's not worth the cost of migrating
when the mapcount is too high. This would be fine for compaction for THP at
least as the cost large rmap walks for reclaim followed by IPIs is unlikely
to be offset by THP usage even if it succeeds afterwards. It may be ok
for compaction triggered for high-order allocations. One obvious exception
would be page migration due to memory poisoning where it does not matter
what the cost of migrating is if the memory is faulty.  Another obvious
exception would be mbind where it may be better to stop sharing in such
cases that Hugh mentioned already.

There still would be cases where a full walk is necessary so this idea does
not fix all problems of concern but it would help reclaim in the case where
there are many KSM pages that are in use in a relatively obvious manner. More
importantly, such a modification would benefit cases other than KSM.

I recognise that this is side-stepping the problem -- migration cost is
too high and while Andrea's patch may reduce that cost, there is some
grounds for simply not trying to migrate it in the first place.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/1] ksm: introduce ksm_max_page_sharing per page deduplication limit
  2016-01-18  8:14         ` Hugh Dickins
@ 2016-01-18 14:43           ` Arjan van de Ven
  0 siblings, 0 replies; 29+ messages in thread
From: Arjan van de Ven @ 2016-01-18 14:43 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Davidlohr Bueso, linux-mm, Petr Holasek,
	Andrew Morton, Mel Gorman

>>
>> And the long hang do happen... once you start getting a bit of memory
>> pressure
>> (say you go from 7000 to 7200 VMs and you only have memory for 7150) then you
>> are hitting the long delays *for every page* the VM inspects, and it will
>
> I don't understand "*for every page*": why for *every* page?
> I won't dispute "for many pages, many more than is bearable".

for every page the VM inspects; the VM tries to free memory and goes on trying
to free stuff, but (I'm guessing here) skipping active pages, but to know and clear
active, you need to walk the whole chain. for each page.

>> Now, you can make it 2x faster (reboot in 12 hours? ;-) ) but there's really
>> a much
>> higher order reduction of the "long chain" problem needed...
>> I'm with Andrea that prevention of super long chains is the way to go, we can
>> argue about 250
>> or 500 or 1000. Numbers will speak there... but from a KSM user perspective,
>> at some point
>> you reduced the cost of a page by 250x or 500x or 1000x... it's hitting
>> diminishing returns.
>
> I'm not for a moment denying that there's a problem to be solved,
> just questioning what's the right solution.
>
> The reclaim case you illustrate does not persuade me, I already suggested
> an easier way to handle that (don't waste time on pages of high mapcount).
>
> Or are you saying that in your usage, the majority of pages start out with
> high mapcount?  That would defeat my suggestion, I think,  But it's the
> compaction case I want to think more about, that may persuade me also.


well in most servers that host VMs, of the, say 128Gb to 240Gb, all but a few hundred MB
is allocated to VMs, and VM memory is generally shared. So yes a big chunk
of memory will have a high map count of some sorts.
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/1] ksm: introduce ksm_max_page_sharing per page deduplication limit
  2016-01-18  9:10       ` Hugh Dickins
  2016-01-18  9:45         ` Hugh Dickins
@ 2016-01-18 17:46         ` Andrea Arcangeli
  2016-03-17 21:34         ` Hugh Dickins
  2 siblings, 0 replies; 29+ messages in thread
From: Andrea Arcangeli @ 2016-01-18 17:46 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Davidlohr Bueso, linux-mm, Petr Holasek, Andrew Morton,
	Arjan van de Ven, Mel Gorman

On Mon, Jan 18, 2016 at 01:10:42AM -0800, Hugh Dickins wrote:
> Puhleese.  Of course we have a bug with respect to KSM pages in
> migrate_pages(): I already said as much, though I used the example
> of mbind(), that being the one you had mentioned.  Just do something
> like mremap() does, a temporary ksm_madvise(,,,MADV_UNMERGEABLE,).
> 
> Or are you suggesting that when MPOL_MF_MOVE_ALL meets a KSM page,
> it is actually correct to move every page that shares the same data
> to the node of this process's choice?

So you think it's wrong that the migrate_pages() syscall can move KSM
pages too? If yes, why don't you simply add a check for PageKSM in
migrate_pages? Forbidding all KSM pages to be migrated would be a
simple and black and white fix for it.

I mean either we allow all KSM pages to be migrated, or none. I don't
like that we allow some to be migrated if it looks like it will take
less than 60seconds to do so... rmap_walks always were intended to be
atomic.

Even assuming migrate_pages should be fixed to prevent KSM pages to be
migrated, for me that was not a bug but a feature. It allowed me to
code the simplest reproducer to show how long it takes to do a
rmap_walk with a KSM page with high sharing.

Before I wrote this testcase it wasn't trivial at all to measure or
even reproduce the hang no matter what you throw it at it. In fact the
hang takes normally days or weeks to reproduce, but when they hit
systems go down. The testcase I posted simplifies the reproduction
of the VM badness tremendously.

Page migration ultimately was my concern. Just as things stands today
even the migrate_pages hammer didn't look entirely safe (no matter if
the user is an artificial test like mine that calls migrate_pages for
whatever reason, or simply a benchmark) but that's certainly not the
primary concern.

> MPOL_MF_MOVE_ALL was introduced some while before KSM: it makes sense
> for anon pages, and for file pages, but it was mere oversight that we
> did not stop it from behaving this way on KSM pages.
> 
> Even sillier now that we have Petr's merge_across_nodes settable to 0.

The merge_across_nodes set to 0 must not break either when
migrate_pages moves a KSM page to a different node. Nothing wrong with
that, in fact it's a good stress test.

> Forget that above program, it's easily fixed (or rather, the kernel
> is easily fixed for it): please argue with a better example.

I don't think I can write a simpler or smaller testcase that shows
exactly how bad things can get with the rmap walks. If I abused a
"feature" in migrate_pages to achieve it, that still made my life
easier at reproducing the VM badness that has to be fixed somehow.

> Fair enough.  I'm not saying it's a big worry, just that the design
> would have been more elegant and appealing without this side to it.
> I wonder who's going to tune stable_node_chains_prune_millisecs :)

stable_node_chains_prune_millisecs can be set to any value and it
won't make much difference actually, it's a tradeoff between KSM CPU
usage and a slight delay in freeing the KSM metadata, but it's not
like it has a big impact on anything. This isn't metadata that amounts
to significant amounts of memory nor that it will grow fast. But that
being a fixed number, I made it sysfs configurable.

I've a much easier time to set the max sharing limit and a deep
garbage collection event every couple of millisecs, than dynamically
deciding what's the magic breakpoint page_mapcount number where
page_migration, try_to_unmap and page_referenced should bail out or
system hangs... but if they happen to bail out too soon the VM will be
in a DoS.

> The VM gets slower and more complicated: the CPUs get faster and smarter.

Eheh, luckily it's not that bad ;), your page_migration TLB flush
batching should provide a significant boost that may allow to increase
the max sharing limit already compared to the current value I set in
the patch.

Complexity increases yes, but VM gets faster as your patch shows.

The testcase I posted is precisely the program that you should run to
test the effectiveness of your page_migration improvement. Perhaps you
can add a bit of printf so you won't have to use strace -tt :).

> I'll think about it more.

Back this being an HW issue, as far as the VM is concerned, all
rmap_walks were intended to simulate having the accessed or dirty bit
in the page structure (kind of s390?) not in the plenty of pagetables
that maps the page. We collapse all pagetable bits in a single
per-physical page bit. This is why they're atomic and we don't break
the loop, and my patch just allows this model to be retained. If we
start breaking the loop and in turn break the atomicity of the
rmap_walks, we just increase false positive OOM risk in some
unpredictable way.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/1] ksm: introduce ksm_max_page_sharing per page deduplication limit
  2016-01-18 11:01     ` Mel Gorman
@ 2016-01-18 22:19       ` Andrea Arcangeli
  2016-01-19 10:43         ` Mel Gorman
  0 siblings, 1 reply; 29+ messages in thread
From: Andrea Arcangeli @ 2016-01-18 22:19 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Hugh Dickins, Davidlohr Bueso, linux-mm, Petr Holasek,
	Andrew Morton, Arjan van de Ven

Hello Mel,

On Mon, Jan 18, 2016 at 11:01:47AM +0000, Mel Gorman wrote:
> I didn't read too much into the history of this patch or the
> implementation so take anything I say with a grain of salt.
> 
> In the page-migration case, the obvious beneficiary of reduced IPI counts
> is memory compaction and automatic NUMA balancing. For automatic NUMA
> balancing, if a page is heavily shared then there is a possibility that
> there is no optimal home node for the data and that no balancing should
> take place. For compaction, it depends on whether it's for a THP allocation
> or not -- heavy amounts of migration work is unlikely to be offset by THP
> usage although this is a matter of opinion.
> 
> In the case of memory compaction, it would make sense to limit the length
> of the walk and abort if it the mapcount is too high particularly if the
> compaction is for a THP allocation. In the case of THP, the cost of a long
> rmap is not necessarily going to be offset by THP usage.

Agreed that it wouldn't be worth migrating a KSM page with massive
sharing just to allocate one more THP page and that limiting the
length of the rmap_walk would avoid the hangs for compaction.

The real problem is what happens after you stop migrating KSM pages.

If you limit the length of the rmap_walk with a "magical" value (that
I would have no clue how to set), you end up with compaction being
entirely disabled and wasting CPU as well, if you have a bit too many
KSM pages in the system with mapcount above the "magical" value.

If compaction can be disabled all your memblock precious work becomes
useless.

After the memblocks become useless, the movable zone also becomes
useless (as all anonymous movable pages in the system can suddenly
becomes non movable and go above the "magical" rmap_walk limit).

Then the memory offlining stops to work as well (it could take
months in the worst case?).

CMA stops working too so certain drivers will stop working. How can it
be ok if you enable KSM and CMA entirely breaks? Because this is what
happens if you stop migrating KSM pages if the rmap_walk would exceed
the "magical" limit.

Adding a sharing limit in KSM instead allows to keep KSM turned on at
all times at 100% CPU load and it guarantees nothing KSM does could
ever interfere with all other VM features. Everything is just
guaranteed to work at all times.

In fact if we were not to add the sharing limit to KSM, I think we
would be better off to consider KSM pages unmovable like slab caches
and drop the KSM migration entirely. So that KSM pages are restricted
to those unmovable memblocks. At least compaction/memoryofflining/CMA
would still have a slight chance to keep working despite KSM was enabled.

> That said, it occurs to me that a full rmap walk is not necessary in all
> cases and may be a more obvious starting point. For page reclaim, it only
> matters if it's mapped once.  page_referenced() does not register a ->done
> handle and the callers only care about a positive count. One optimisation
> for reclaim would be to return when referenced > 0 and be done with it. That
> would limit the length of the walk in *some* cases and still activate the
> page as expected.

Limiting the page_referenced() rmap walk length was my first idea as
well. We could also try to rotate the rmap_items list so the next loop
starts with a new batch of pagetables and by retrying we would have a
chance to clear all accessed bits for all pagetables eventually. The
rmap_item->hlist is an hlist however, so rotation isn't possible at
this time and we'd waste 8 bytes per stable node metadata if we were
to add it. Still it would be trivial to add the rotation.

However even the most benign change described above, can result in
false positive OOM. By the time you move to the next page the accessed
bits are quickly set again for all other pages you're not scanning. If
just a bit too many pages are like that you won't be clearing young
bits, or establishing swap entries fast enough and you'll trigger the
OOM killer early despite lots of swap is available. And that's in
addition to the compaction/memblock/offlining/CMA issues that also
would have no solution.

> There still would be cases where a full walk is necessary so this idea does
> not fix all problems of concern but it would help reclaim in the case where
> there are many KSM pages that are in use in a relatively obvious manner. More
> importantly, such a modification would benefit cases other than KSM.

What I'm afraid is that it would risk to destabilize the VM if we
limit the rmap_walk length. That would exponentially increase the
chance page_referenced() returns true at all times in turn not really
solving much if too many KSM pages are shared above the "magical" rmap
walk length. I'm not sure if that would benefit other cases either as
they would end up in the same corner case.

If instead we enforce a KSM sharing limit, there's no point in taking
any more risk and sticking to the current model of atomic rmap_walks
that simulates a physical page accessed bit, certainly looks much
safer to me.

I evaluated those possible VM modifications before adding the sharing
limit to KSM and they didn't look definitive to me and there was no
way I could guarantee the VM not to end up in weird corner cases
unless I limited the KSM sharing limit in the first place.

Something that instead I definitely like is the reduction of the IPIs
in the rmap_walks, I mean your work and Hugh's patch that extends it
to page_migration. That looks just great and it has zero risk to
destabilize the VM, but it's also orthogonal to the KSM sharing limit.

Not clearing all accessed bits in page_referenced or failing KSM
migration depending on a magic rmap_walk limit I don't like very much.
They can certainly hide the problem in some case: they would work
absolutely great if there's just 1 KSM page in the system over the
"magical" limit, but I don't think they've drawbacks and it looks
flakey if too many KSM pages are above the "magical" limit. The KSM
sharing limit value is way less magical than that: if it's set too low
you can only lose some memory, you won't risk VM malfunction.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/1] ksm: introduce ksm_max_page_sharing per page deduplication limit
  2016-01-18 22:19       ` Andrea Arcangeli
@ 2016-01-19 10:43         ` Mel Gorman
  0 siblings, 0 replies; 29+ messages in thread
From: Mel Gorman @ 2016-01-19 10:43 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Hugh Dickins, Davidlohr Bueso, linux-mm, Petr Holasek,
	Andrew Morton, Arjan van de Ven

On Mon, Jan 18, 2016 at 11:19:00PM +0100, Andrea Arcangeli wrote:
> Hello Mel,
> 
> On Mon, Jan 18, 2016 at 11:01:47AM +0000, Mel Gorman wrote:
> > I didn't read too much into the history of this patch or the
> > implementation so take anything I say with a grain of salt.
> > 
> > In the page-migration case, the obvious beneficiary of reduced IPI counts
> > is memory compaction and automatic NUMA balancing. For automatic NUMA
> > balancing, if a page is heavily shared then there is a possibility that
> > there is no optimal home node for the data and that no balancing should
> > take place. For compaction, it depends on whether it's for a THP allocation
> > or not -- heavy amounts of migration work is unlikely to be offset by THP
> > usage although this is a matter of opinion.
> > 
> > In the case of memory compaction, it would make sense to limit the length
> > of the walk and abort if it the mapcount is too high particularly if the
> > compaction is for a THP allocation. In the case of THP, the cost of a long
> > rmap is not necessarily going to be offset by THP usage.
> 
> Agreed that it wouldn't be worth migrating a KSM page with massive
> sharing just to allocate one more THP page and that limiting the
> length of the rmap_walk would avoid the hangs for compaction.
> 
> The real problem is what happens after you stop migrating KSM pages.
> 
> If you limit the length of the rmap_walk with a "magical" value (that
> I would have no clue how to set), you end up with compaction being
> entirely disabled and wasting CPU as well, if you have a bit too many
> KSM pages in the system with mapcount above the "magical" value.
> 
> If compaction can be disabled all your memblock precious work becomes
> useless.
> 
> After the memblocks become useless, the movable zone also becomes
> useless (as all anonymous movable pages in the system can suddenly
> becomes non movable and go above the "magical" rmap_walk limit).
> 
> Then the memory offlining stops to work as well (it could take
> months in the worst case?).
> 

The focus was primarily on reclaim. It was expected that memory offline
and memory poisoning would continue using the full rmap walk as it has
no option.

> CMA stops working too so certain drivers will stop working. How can it
> be ok if you enable KSM and CMA entirely breaks? Because this is what
> happens if you stop migrating KSM pages if the rmap_walk would exceed
> the "magical" limit.
> 
> Adding a sharing limit in KSM instead allows to keep KSM turned on at
> all times at 100% CPU load and it guarantees nothing KSM does could
> ever interfere with all other VM features. Everything is just
> guaranteed to work at all times.
> 

I'm not disagreeing with you. However, limiting the rmap walk for reclaim
also benefits any workload with large amounts of shared pages.

> In fact if we were not to add the sharing limit to KSM, I think we
> would be better off to consider KSM pages unmovable like slab caches
> and drop the KSM migration entirely. So that KSM pages are restricted
> to those unmovable memblocks. At least compaction/memoryofflining/CMA
> would still have a slight chance to keep working despite KSM was enabled.
> 
> > That said, it occurs to me that a full rmap walk is not necessary in all
> > cases and may be a more obvious starting point. For page reclaim, it only
> > matters if it's mapped once.  page_referenced() does not register a ->done
> > handle and the callers only care about a positive count. One optimisation
> > for reclaim would be to return when referenced > 0 and be done with it. That
> > would limit the length of the walk in *some* cases and still activate the
> > page as expected.
> 
> Limiting the page_referenced() rmap walk length was my first idea as
> well. We could also try to rotate the rmap_items list so the next loop
> starts with a new batch of pagetables and by retrying we would have a
> chance to clear all accessed bits for all pagetables eventually. The
> rmap_item->hlist is an hlist however, so rotation isn't possible at
> this time and we'd waste 8 bytes per stable node metadata if we were
> to add it. Still it would be trivial to add the rotation.
> 
> However even the most benign change described above, can result in
> false positive OOM.

Scanning at max priority could force a full rmap walk to avoid premature OOM.
At this point, the idea would fallback to the full cost we have today.
It does not completely solve the problem you are worried about, it mitigates
it and for cases other than KSM.

> > There still would be cases where a full walk is necessary so this idea does
> > not fix all problems of concern but it would help reclaim in the case where
> > there are many KSM pages that are in use in a relatively obvious manner. More
> > importantly, such a modification would benefit cases other than KSM.
> 
> What I'm afraid is that it would risk to destabilize the VM if we
> limit the rmap_walk length. That would exponentially increase the
> chance page_referenced() returns true at all times in turn not really
> solving much if too many KSM pages are shared above the "magical" rmap
> walk length. I'm not sure if that would benefit other cases either as
> they would end up in the same corner case.
> 

Eventually yes albeit it gets deferred until near OOM time and also
covers the heavily shared pages cases instead of just KSM.

> If instead we enforce a KSM sharing limit, there's no point in taking
> any more risk and sticking to the current model of atomic rmap_walks
> that simulates a physical page accessed bit, certainly looks much
> safer to me.
> 
> I evaluated those possible VM modifications before adding the sharing
> limit to KSM and they didn't look definitive to me and there was no
> way I could guarantee the VM not to end up in weird corner cases
> unless I limited the KSM sharing limit in the first place.
> 

By all means ignore the suggestion. It was intended to mitigate the problem
to see if enforcing a KSM sharing limit is always necessary or if it's
just necessary to handle the case where the system is near OOM or the
migration is absolutly required.

> Something that instead I definitely like is the reduction of the IPIs
> in the rmap_walks, I mean your work and Hugh's patch that extends it
> to page_migration. That looks just great and it has zero risk to
> destabilize the VM, but it's also orthogonal to the KSM sharing limit.
> 

Agreed. Fewer IPIs on a system that still works properly is an
unconditional win.

> Not clearing all accessed bits in page_referenced or failing KSM
> migration depending on a magic rmap_walk limit I don't like very much.
> They can certainly hide the problem in some case: they would work
> absolutely great if there's just 1 KSM page in the system over the
> "magical" limit, but I don't think they've drawbacks and it looks
> flakey if too many KSM pages are above the "magical" limit. The KSM
> sharing limit value is way less magical than that: if it's set too low
> you can only lose some memory, you won't risk VM malfunction.
> 

Ok.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/1] ksm: introduce ksm_max_page_sharing per page deduplication limit
  2016-01-18  9:10       ` Hugh Dickins
  2016-01-18  9:45         ` Hugh Dickins
  2016-01-18 17:46         ` Andrea Arcangeli
@ 2016-03-17 21:34         ` Hugh Dickins
  2016-03-17 21:50           ` Andrew Morton
  2016-03-18 16:27           ` Andrea Arcangeli
  2 siblings, 2 replies; 29+ messages in thread
From: Hugh Dickins @ 2016-03-17 21:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Davidlohr Bueso, linux-mm, Petr Holasek,
	Arjan van de Ven, Mel Gorman, Hugh Dickins

On Mon, 18 Jan 2016, Hugh Dickins wrote:
> On Sat, 16 Jan 2016, Andrea Arcangeli wrote:
> > Hello Hugh,
> > 
> > Thanks a lot for reviewing this.
> 
> And thanks for your thorough reply, though I take issue with some of it :)
> 
[...]
> 
> I'll think about it more.

Andrew, please ignore my reservations about Andrea's KSM max sharing
patch.  The diversions I created (IPIs etc), and my rash promise to
think about it more, are serving no purpose but to delay a fix to a
real problem.  Even if this fix is not quite what I dreamt of, it has
the great commendation of being self-contained within ksm.c, affecting
nothing else: so long as Andrea has time to support it, I think we're
good with it.  Let Mel or I come up with better when we've devised it:
but I doubt that will be soon (I got no further on what else to do
about the compaction-migration case).

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/1] ksm: introduce ksm_max_page_sharing per page deduplication limit
  2016-03-17 21:34         ` Hugh Dickins
@ 2016-03-17 21:50           ` Andrew Morton
  2016-03-18 16:27           ` Andrea Arcangeli
  1 sibling, 0 replies; 29+ messages in thread
From: Andrew Morton @ 2016-03-17 21:50 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrea Arcangeli, Davidlohr Bueso, linux-mm, Petr Holasek,
	Arjan van de Ven, Mel Gorman

On Thu, 17 Mar 2016 14:34:23 -0700 (PDT) Hugh Dickins <hughd@google.com> wrote:

> On Mon, 18 Jan 2016, Hugh Dickins wrote:
> > On Sat, 16 Jan 2016, Andrea Arcangeli wrote:
> > > Hello Hugh,
> > > 
> > > Thanks a lot for reviewing this.
> > 
> > And thanks for your thorough reply, though I take issue with some of it :)
> > 
> [...]
> > 
> > I'll think about it more.
> 
> Andrew, please ignore my reservations about Andrea's KSM max sharing
> patch.  The diversions I created (IPIs etc), and my rash promise to
> think about it more, are serving no purpose but to delay a fix to a
> real problem.  Even if this fix is not quite what I dreamt of, it has
> the great commendation of being self-contained within ksm.c, affecting
> nothing else: so long as Andrea has time to support it, I think we're
> good with it.  Let Mel or I come up with better when we've devised it:
> but I doubt that will be soon (I got no further on what else to do
> about the compaction-migration case).

OK, thanks, I'll send it Linuswards next week.

http://ozlabs.org/~akpm/mmots/broken-out/ksm-introduce-ksm_max_page_sharing-per-page-deduplication-limit.patch
http://ozlabs.org/~akpm/mmots/broken-out/ksm-introduce-ksm_max_page_sharing-per-page-deduplication-limit-fix-2.patch
http://ozlabs.org/~akpm/mmots/broken-out/ksm-introduce-ksm_max_page_sharing-per-page-deduplication-limit-fix-3.patch

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/1] ksm: introduce ksm_max_page_sharing per page deduplication limit
  2016-03-17 21:34         ` Hugh Dickins
  2016-03-17 21:50           ` Andrew Morton
@ 2016-03-18 16:27           ` Andrea Arcangeli
  1 sibling, 0 replies; 29+ messages in thread
From: Andrea Arcangeli @ 2016-03-18 16:27 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Davidlohr Bueso, linux-mm, Petr Holasek,
	Arjan van de Ven, Mel Gorman

Hello Hugh,

On Thu, Mar 17, 2016 at 02:34:23PM -0700, Hugh Dickins wrote:
> Andrew, please ignore my reservations about Andrea's KSM max sharing
> patch.  The diversions I created (IPIs etc), and my rash promise to
> think about it more, are serving no purpose but to delay a fix to a
> real problem.  Even if this fix is not quite what I dreamt of, it has
> the great commendation of being self-contained within ksm.c, affecting
> nothing else: so long as Andrea has time to support it, I think we're

I'll have time to support if if any problem arises yes.

> good with it.  Let Mel or I come up with better when we've devised it:
> but I doubt that will be soon (I got no further on what else to do
> about the compaction-migration case).

Agreed, the page migration case is what made me lean towards this self
contained solution, and then it has the benefit of solving all other
cases at the same time.

On the plus side the rmap_walk can remain "atomic" this way, so it may
generally result in higher reliability not having to break it in the
middle just because it's taking too long. This way there will be one
less unknown variable into the equation. We could still relax it in
the future if we'll need, but we won't be forced to.

Thanks!
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/1] ksm: introduce ksm_max_page_sharing per page deduplication limit
  2015-11-10 18:44 ` [PATCH 1/1] " Andrea Arcangeli
                     ` (2 preceding siblings ...)
  2016-01-14 23:36   ` Hugh Dickins
@ 2016-04-06 20:33   ` Rik van Riel
  2016-04-06 22:02     ` Andrea Arcangeli
  3 siblings, 1 reply; 29+ messages in thread
From: Rik van Riel @ 2016-04-06 20:33 UTC (permalink / raw)
  To: Andrea Arcangeli, Hugh Dickins, Davidlohr Bueso
  Cc: linux-mm, Petr Holasek, Andrew Morton, Arjan van de Ven

[-- Attachment #1: Type: text/plain, Size: 840 bytes --]

On Tue, 2015-11-10 at 19:44 +0100, Andrea Arcangeli wrote:
> Without a max deduplication limit for each KSM page, the list of the
> rmap_items associated to each stable_node can grow infinitely
> large.
> 
> During the rmap walk each entry can take up to ~10usec to process
> because of IPIs for the TLB flushing (both for the primary MMU and
> the
> secondary MMUs with the MMU notifier). With only 16GB of address
> space
> shared in the same KSM page, that would amount to dozens of seconds
> of
> kernel runtime.

Silly question, but could we fix this problem
by building up a bitmask of all CPUs that have
a page-with-high-mapcount mapped, and simply
send out a global TLB flush to those CPUs once
we have changed the page tables, instead of
sending out IPIs at every page table change?

-- 
All rights reversed

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/1] ksm: introduce ksm_max_page_sharing per page deduplication limit
  2016-04-06 20:33   ` Rik van Riel
@ 2016-04-06 22:02     ` Andrea Arcangeli
  2016-09-21 15:12       ` Gavin Guo
  0 siblings, 1 reply; 29+ messages in thread
From: Andrea Arcangeli @ 2016-04-06 22:02 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Hugh Dickins, Davidlohr Bueso, linux-mm, Petr Holasek,
	Andrew Morton, Arjan van de Ven

Hello Rik,

On Wed, Apr 06, 2016 at 04:33:49PM -0400, Rik van Riel wrote:
> On Tue, 2015-11-10 at 19:44 +0100, Andrea Arcangeli wrote:
> > Without a max deduplication limit for each KSM page, the list of the
> > rmap_items associated to each stable_node can grow infinitely
> > large.
> > 
> > During the rmap walk each entry can take up to ~10usec to process
> > because of IPIs for the TLB flushing (both for the primary MMU and
> > the
> > secondary MMUs with the MMU notifier). With only 16GB of address
> > space
> > shared in the same KSM page, that would amount to dozens of seconds
> > of
> > kernel runtime.
> 
> Silly question, but could we fix this problem
> by building up a bitmask of all CPUs that have
> a page-with-high-mapcount mapped, and simply
> send out a global TLB flush to those CPUs once
> we have changed the page tables, instead of
> sending out IPIs at every page table change?

That's great idea indeed, but it's an orthogonal optimization. Hugh
already posted a patch adding TTU_BATCH_FLUSH to try_to_unmap in
migrate and then call try_to_unmap_flush() at the end which is on the
same lines of you're suggesting. Problem is we still got millions of
entries potentially present in those lists with the current code, even
a list walk without IPI is prohibitive.

The only alternative is to make rmap_walk non atomic, i.e. break it in
the middle, because it's not just the cost of IPIs that is
excessive. However doing that breaks all sort of assumptions in the VM
and overall it will make it weaker, as when we're OOM we're not sure
anymore if we have been aggressive enough in clearing referenced bits
if tons of KSM pages are slightly above the atomic-walk-limit. Even
ignoring the VM behavior, page migration and in turn compaction and
memory offlining require scanning all entries in the list before we
can return to userland and remove the DIMM or succeed the increase of
echo > nr_hugepages, so all those features would become unreliable and
they could incur in enormous latencies.

Like Arjan mentioned, there's no significant downside in limiting the
"compression ratio" to x256 or x1024 or x2048 (depending on the sysctl
value) because the higher the limit the more we're hitting diminishing
returns.

On the design side I believe there's no other black and white possible
solution than this one that solves all problems with no downside at
all for the VM fast paths we care about the most.

On the implementation side if somebody can implement it better than I
did while still as optimal, so that the memory footprint of the KSM
metadata is unchanged (on 64bit), that would be welcome.

One thing that could be improved is adding proper defrag to increase
the average density to nearly match the sysctl value at all times, but
the heuristic I added (that tries to achieve the same objective by
picking the busiest stable_node_dup and putting it in the head of the
chain for the next merges) is working well too. There will be at least
2 entries for each stable_node_dup so the worst case density is still
x2. Real defrag that modifies pagetables would be as costly as page
migration, while this costs almost nothing as it's run once in a
while.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/1] ksm: introduce ksm_max_page_sharing per page deduplication limit
  2016-04-06 22:02     ` Andrea Arcangeli
@ 2016-09-21 15:12       ` Gavin Guo
  2016-09-21 15:34         ` Andrea Arcangeli
  0 siblings, 1 reply; 29+ messages in thread
From: Gavin Guo @ 2016-09-21 15:12 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Rik van Riel, Hugh Dickins, Davidlohr Bueso, linux-mm,
	Petr Holasek, Andrew Morton, Arjan van de Ven

Hi Andrea,

On Thu, Apr 7, 2016 at 6:02 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> Hello Rik,
>
> On Wed, Apr 06, 2016 at 04:33:49PM -0400, Rik van Riel wrote:
>> On Tue, 2015-11-10 at 19:44 +0100, Andrea Arcangeli wrote:
>> > Without a max deduplication limit for each KSM page, the list of the
>> > rmap_items associated to each stable_node can grow infinitely
>> > large.
>> >
>> > During the rmap walk each entry can take up to ~10usec to process
>> > because of IPIs for the TLB flushing (both for the primary MMU and
>> > the
>> > secondary MMUs with the MMU notifier). With only 16GB of address
>> > space
>> > shared in the same KSM page, that would amount to dozens of seconds
>> > of
>> > kernel runtime.
>>
>> Silly question, but could we fix this problem
>> by building up a bitmask of all CPUs that have
>> a page-with-high-mapcount mapped, and simply
>> send out a global TLB flush to those CPUs once
>> we have changed the page tables, instead of
>> sending out IPIs at every page table change?
>
> That's great idea indeed, but it's an orthogonal optimization. Hugh
> already posted a patch adding TTU_BATCH_FLUSH to try_to_unmap in
> migrate and then call try_to_unmap_flush() at the end which is on the
> same lines of you're suggesting. Problem is we still got millions of
> entries potentially present in those lists with the current code, even
> a list walk without IPI is prohibitive.
>
> The only alternative is to make rmap_walk non atomic, i.e. break it in
> the middle, because it's not just the cost of IPIs that is
> excessive. However doing that breaks all sort of assumptions in the VM
> and overall it will make it weaker, as when we're OOM we're not sure
> anymore if we have been aggressive enough in clearing referenced bits
> if tons of KSM pages are slightly above the atomic-walk-limit. Even
> ignoring the VM behavior, page migration and in turn compaction and
> memory offlining require scanning all entries in the list before we
> can return to userland and remove the DIMM or succeed the increase of
> echo > nr_hugepages, so all those features would become unreliable and
> they could incur in enormous latencies.
>
> Like Arjan mentioned, there's no significant downside in limiting the
> "compression ratio" to x256 or x1024 or x2048 (depending on the sysctl
> value) because the higher the limit the more we're hitting diminishing
> returns.
>
> On the design side I believe there's no other black and white possible
> solution than this one that solves all problems with no downside at
> all for the VM fast paths we care about the most.
>
> On the implementation side if somebody can implement it better than I
> did while still as optimal, so that the memory footprint of the KSM
> metadata is unchanged (on 64bit), that would be welcome.
>
> One thing that could be improved is adding proper defrag to increase
> the average density to nearly match the sysctl value at all times, but
> the heuristic I added (that tries to achieve the same objective by
> picking the busiest stable_node_dup and putting it in the head of the
> chain for the next merges) is working well too. There will be at least
> 2 entries for each stable_node_dup so the worst case density is still
> x2. Real defrag that modifies pagetables would be as costly as page
> migration, while this costs almost nothing as it's run once in a
> while.
>
> Thanks,
> Andrea
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

Recently, a similar bug can also be observed under the numad process
with the v4.4 Ubuntu kernel or the latest upstream kernel. However, I
think the patch should be useful to mitigate the symptom. I tried to
search the mailing list and found the patch finally didn't be merged
into the upstream kernel. If there are any problems which drop the
patch?

The numad process tried to migrate a qemu process of 33GB memory.
Finally, it stuck in the csd_lock_wait function which causes the qemu
process hung and the virtual machine has high CPU usage and hung also.
With KSM disabled, the symptom disappeared.

>From the following backtrace, The RIP: smp_call_function_single+219 is
actually in the csd_lock_wait function mentioned above, but the
compiler has optimized that call and it does not appear in the stack.

What happens here is that do_migrate_pages (frame #10) acquires the
mmap_sem semaphore that everything else is waiting for (and that
eventually produce the hang warnings), and it holds that semaphore for
the duration of the page migration.

crash> bt 2950
PID: 2950   TASK: ffff885f97745280  CPU: 49  COMMAND: "numad"
    [exception RIP: smp_call_function_single+219]
    RIP: ffffffff81103a0b  RSP: ffff885f8fb4fb28  RFLAGS: 00000202
    RAX: 0000000000000000  RBX: 0000000000000013  RCX: 0000000000000000
    RDX: 0000000000000003  RSI: 0000000000000100  RDI: 0000000000000286
    RBP: ffff885f8fb4fb70   R8: 0000000000000000   R9: 0000000000080000
    R10: 0000000000000000  R11: ffff883faf917c88  R12: ffffffff810725f0
    R13: 0000000000000013  R14: ffffffff810725f0  R15: ffff885f8fb4fbc8
    CS: 0010  SS: 0018
 #0 [ffff885f8fb4fb30] kvm_unmap_rmapp at ffffffffc01f1c3e [kvm]
 #1 [ffff885f8fb4fb78] smp_call_function_many at ffffffff81103db3
 #2 [ffff885f8fb4fbc0] native_flush_tlb_others at ffffffff8107279d
 #3 [ffff885f8fb4fc08] flush_tlb_page at ffffffff81072a95
 #4 [ffff885f8fb4fc30] ptep_clear_flush at ffffffff811d048e
 #5 [ffff885f8fb4fc60] try_to_unmap_one at ffffffff811cb1c7
 #6 [ffff885f8fb4fcd0] rmap_walk_ksm at ffffffff811e6f91
 #7 [ffff885f8fb4fd28] rmap_walk at ffffffff811cc1bf
 #8 [ffff885f8fb4fd80] try_to_unmap at ffffffff811cc46b
 #9 [ffff885f8fb4fdc8] migrate_pages at ffffffff811f26d8
#10 [ffff885f8fb4fe80] do_migrate_pages at ffffffff811e15f7
#11 [ffff885f8fb4fef8] sys_migrate_pages at ffffffff811e187d
#12 [ffff885f8fb4ff50] entry_SYSCALL_64_fastpath at ffffffff818244f2

After some investigations, I've tried to disassemble the coredump and
finally find the stable_node->hlist is as long as 2306920 entries.

rmap_item list(stable_node->hlist):
stable_node: 0xffff881f836ba000 stable_node->hlist->first =
0xffff883f3e5746b0

struct hlist_head {
[0] struct hlist_node *first;
}
struct hlist_node {
[0] struct hlist_node *next;
[8] struct hlist_node **pprev;
}

crash> list hlist_node.next 0xffff883f3e5746b0 > rmap_item.lst

$ wc -l rmap_item.lst
2306920 rmap_item.lst

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/1] ksm: introduce ksm_max_page_sharing per page deduplication limit
  2016-09-21 15:12       ` Gavin Guo
@ 2016-09-21 15:34         ` Andrea Arcangeli
  2016-09-22 10:48           ` Gavin Guo
  0 siblings, 1 reply; 29+ messages in thread
From: Andrea Arcangeli @ 2016-09-21 15:34 UTC (permalink / raw)
  To: Gavin Guo
  Cc: Rik van Riel, Hugh Dickins, Davidlohr Bueso, linux-mm,
	Petr Holasek, Andrew Morton, Arjan van de Ven

Hello Gavin,

On Wed, Sep 21, 2016 at 11:12:19PM +0800, Gavin Guo wrote:
> Recently, a similar bug can also be observed under the numad process
> with the v4.4 Ubuntu kernel or the latest upstream kernel. However, I
> think the patch should be useful to mitigate the symptom. I tried to
> search the mailing list and found the patch finally didn't be merged
> into the upstream kernel. If there are any problems which drop the
> patch?

Zero known problems, in fact it's running in production in both RHEL7
and RHEL6 for a while. The RHEL customers are not affected anymore for
a while now.

It's a critical computational complexity fix, if using KSM in
enterprise production. Hugh already Acked it as well.

It's included in -mm and Andrew submitted it once upstream, but it
bounced probably perhaps it was not the right time in the merge window
cycle.

Or perhaps because it's complex but I wouldn't know how to simplify it
but there's no bug at all in the code.

I would suggest Andrew to send it once again when he feels it's a good
time to do so.

> The numad process tried to migrate a qemu process of 33GB memory.
> Finally, it stuck in the csd_lock_wait function which causes the qemu
> process hung and the virtual machine has high CPU usage and hung also.
> With KSM disabled, the symptom disappeared.

Until it's merged upstream you can cherrypick from my aa.git tree
these three commits:

https://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?id=9384142e4ce830898abcefc4f0479c4533fa5bbc
https://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?id=4b293be7e20c8e8731a4fdc3c3bf6047304d0cc8
https://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?id=44c0d79c2c223c54ffe3fabc893963fc5963d611

They're in -mm too.

> What happens here is that do_migrate_pages (frame #10) acquires the
> mmap_sem semaphore that everything else is waiting for (and that
> eventually produce the hang warnings), and it holds that semaphore for
> the duration of the page migration.
> 
> crash> bt 2950
> PID: 2950   TASK: ffff885f97745280  CPU: 49  COMMAND: "numad"
>     [exception RIP: smp_call_function_single+219]
>     RIP: ffffffff81103a0b  RSP: ffff885f8fb4fb28  RFLAGS: 00000202
>     RAX: 0000000000000000  RBX: 0000000000000013  RCX: 0000000000000000
>     RDX: 0000000000000003  RSI: 0000000000000100  RDI: 0000000000000286
>     RBP: ffff885f8fb4fb70   R8: 0000000000000000   R9: 0000000000080000
>     R10: 0000000000000000  R11: ffff883faf917c88  R12: ffffffff810725f0
>     R13: 0000000000000013  R14: ffffffff810725f0  R15: ffff885f8fb4fbc8
>     CS: 0010  SS: 0018
>  #0 [ffff885f8fb4fb30] kvm_unmap_rmapp at ffffffffc01f1c3e [kvm]
>  #1 [ffff885f8fb4fb78] smp_call_function_many at ffffffff81103db3
>  #2 [ffff885f8fb4fbc0] native_flush_tlb_others at ffffffff8107279d
>  #3 [ffff885f8fb4fc08] flush_tlb_page at ffffffff81072a95
>  #4 [ffff885f8fb4fc30] ptep_clear_flush at ffffffff811d048e
>  #5 [ffff885f8fb4fc60] try_to_unmap_one at ffffffff811cb1c7
>  #6 [ffff885f8fb4fcd0] rmap_walk_ksm at ffffffff811e6f91
>  #7 [ffff885f8fb4fd28] rmap_walk at ffffffff811cc1bf
>  #8 [ffff885f8fb4fd80] try_to_unmap at ffffffff811cc46b
>  #9 [ffff885f8fb4fdc8] migrate_pages at ffffffff811f26d8
> #10 [ffff885f8fb4fe80] do_migrate_pages at ffffffff811e15f7
> #11 [ffff885f8fb4fef8] sys_migrate_pages at ffffffff811e187d
> #12 [ffff885f8fb4ff50] entry_SYSCALL_64_fastpath at ffffffff818244f2
> 
> After some investigations, I've tried to disassemble the coredump and
> finally find the stable_node->hlist is as long as 2306920 entries.

Yep, this is definitely getting fixed by the three commits above and
the problem is in rmap_walk_ksm like you found above. With that
applied you can't ever run into hangs anymore with KSM enabled, no
matter the workload and the amount of memory in guest and host.

numad isn't required to reproduce it, some swapping is enough.

It limits the de-duplication factor to 256 times, like a x256 times
compression, a x256 compression factor is clearly more than enough. So
effectively the list you found that was too long, gets hard-limited to
256 entries with my patch applied. The limit is configurable at runtime:

/* Maximum number of page slots sharing a stable node */
static int ksm_max_page_sharing = 256;

If you want to increase the limit (careful: that will increase
the rmap_walk_ksm computation time) you can echo $newsharinglimit >
/sys/kernel/mm/ksm/max_page_sharing.

Hope this helps,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/1] ksm: introduce ksm_max_page_sharing per page deduplication limit
  2016-09-21 15:34         ` Andrea Arcangeli
@ 2016-09-22 10:48           ` Gavin Guo
  2016-10-28  6:26             ` Gavin Guo
  0 siblings, 1 reply; 29+ messages in thread
From: Gavin Guo @ 2016-09-22 10:48 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Rik van Riel, Hugh Dickins, Davidlohr Bueso, linux-mm,
	Petr Holasek, Andrew Morton, Arjan van de Ven

Hi Andrea,

On Wed, Sep 21, 2016 at 11:34 PM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> Hello Gavin,
>
> On Wed, Sep 21, 2016 at 11:12:19PM +0800, Gavin Guo wrote:
>> Recently, a similar bug can also be observed under the numad process
>> with the v4.4 Ubuntu kernel or the latest upstream kernel. However, I
>> think the patch should be useful to mitigate the symptom. I tried to
>> search the mailing list and found the patch finally didn't be merged
>> into the upstream kernel. If there are any problems which drop the
>> patch?
>
> Zero known problems, in fact it's running in production in both RHEL7
> and RHEL6 for a while. The RHEL customers are not affected anymore for
> a while now.
>
> It's a critical computational complexity fix, if using KSM in
> enterprise production. Hugh already Acked it as well.
>
> It's included in -mm and Andrew submitted it once upstream, but it
> bounced probably perhaps it was not the right time in the merge window
> cycle.
>
> Or perhaps because it's complex but I wouldn't know how to simplify it
> but there's no bug at all in the code.
>
> I would suggest Andrew to send it once again when he feels it's a good
> time to do so.
>
>> The numad process tried to migrate a qemu process of 33GB memory.
>> Finally, it stuck in the csd_lock_wait function which causes the qemu
>> process hung and the virtual machine has high CPU usage and hung also.
>> With KSM disabled, the symptom disappeared.
>
> Until it's merged upstream you can cherrypick from my aa.git tree
> these three commits:
>
> https://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?id=9384142e4ce830898abcefc4f0479c4533fa5bbc
> https://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?id=4b293be7e20c8e8731a4fdc3c3bf6047304d0cc8
> https://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?id=44c0d79c2c223c54ffe3fabc893963fc5963d611
>
> They're in -mm too.
>
>> What happens here is that do_migrate_pages (frame #10) acquires the
>> mmap_sem semaphore that everything else is waiting for (and that
>> eventually produce the hang warnings), and it holds that semaphore for
>> the duration of the page migration.
>>
>> crash> bt 2950
>> PID: 2950   TASK: ffff885f97745280  CPU: 49  COMMAND: "numad"
>>     [exception RIP: smp_call_function_single+219]
>>     RIP: ffffffff81103a0b  RSP: ffff885f8fb4fb28  RFLAGS: 00000202
>>     RAX: 0000000000000000  RBX: 0000000000000013  RCX: 0000000000000000
>>     RDX: 0000000000000003  RSI: 0000000000000100  RDI: 0000000000000286
>>     RBP: ffff885f8fb4fb70   R8: 0000000000000000   R9: 0000000000080000
>>     R10: 0000000000000000  R11: ffff883faf917c88  R12: ffffffff810725f0
>>     R13: 0000000000000013  R14: ffffffff810725f0  R15: ffff885f8fb4fbc8
>>     CS: 0010  SS: 0018
>>  #0 [ffff885f8fb4fb30] kvm_unmap_rmapp at ffffffffc01f1c3e [kvm]
>>  #1 [ffff885f8fb4fb78] smp_call_function_many at ffffffff81103db3
>>  #2 [ffff885f8fb4fbc0] native_flush_tlb_others at ffffffff8107279d
>>  #3 [ffff885f8fb4fc08] flush_tlb_page at ffffffff81072a95
>>  #4 [ffff885f8fb4fc30] ptep_clear_flush at ffffffff811d048e
>>  #5 [ffff885f8fb4fc60] try_to_unmap_one at ffffffff811cb1c7
>>  #6 [ffff885f8fb4fcd0] rmap_walk_ksm at ffffffff811e6f91
>>  #7 [ffff885f8fb4fd28] rmap_walk at ffffffff811cc1bf
>>  #8 [ffff885f8fb4fd80] try_to_unmap at ffffffff811cc46b
>>  #9 [ffff885f8fb4fdc8] migrate_pages at ffffffff811f26d8
>> #10 [ffff885f8fb4fe80] do_migrate_pages at ffffffff811e15f7
>> #11 [ffff885f8fb4fef8] sys_migrate_pages at ffffffff811e187d
>> #12 [ffff885f8fb4ff50] entry_SYSCALL_64_fastpath at ffffffff818244f2
>>
>> After some investigations, I've tried to disassemble the coredump and
>> finally find the stable_node->hlist is as long as 2306920 entries.
>
> Yep, this is definitely getting fixed by the three commits above and
> the problem is in rmap_walk_ksm like you found above. With that
> applied you can't ever run into hangs anymore with KSM enabled, no
> matter the workload and the amount of memory in guest and host.
>
> numad isn't required to reproduce it, some swapping is enough.
>
> It limits the de-duplication factor to 256 times, like a x256 times
> compression, a x256 compression factor is clearly more than enough. So
> effectively the list you found that was too long, gets hard-limited to
> 256 entries with my patch applied. The limit is configurable at runtime:
>
> /* Maximum number of page slots sharing a stable node */
> static int ksm_max_page_sharing = 256;
>
> If you want to increase the limit (careful: that will increase
> the rmap_walk_ksm computation time) you can echo $newsharinglimit >
> /sys/kernel/mm/ksm/max_page_sharing.
>
> Hope this helps,
> Andrea

Thank you for the detail explanation. I've cherry-picked these patches
and now doing the verification. I'll get back to you if there is any
problem. Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/1] ksm: introduce ksm_max_page_sharing per page deduplication limit
  2016-09-22 10:48           ` Gavin Guo
@ 2016-10-28  6:26             ` Gavin Guo
  2016-10-28 18:31               ` Andrea Arcangeli
  0 siblings, 1 reply; 29+ messages in thread
From: Gavin Guo @ 2016-10-28  6:26 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Rik van Riel, Hugh Dickins, Davidlohr Bueso, linux-mm,
	Petr Holasek, Andrew Morton, Arjan van de Ven, Jay Vosburgh

Hi Andrea,

On Thu, Sep 22, 2016 at 6:48 PM, Gavin Guo <gavin.guo@canonical.com> wrote:
>
> Hi Andrea,
>
> On Wed, Sep 21, 2016 at 11:34 PM, Andrea Arcangeli <aarcange@redhat.com> wrote:
> > Hello Gavin,
> >
> > On Wed, Sep 21, 2016 at 11:12:19PM +0800, Gavin Guo wrote:
> >> Recently, a similar bug can also be observed under the numad process
> >> with the v4.4 Ubuntu kernel or the latest upstream kernel. However, I
> >> think the patch should be useful to mitigate the symptom. I tried to
> >> search the mailing list and found the patch finally didn't be merged
> >> into the upstream kernel. If there are any problems which drop the
> >> patch?
> >
> > Zero known problems, in fact it's running in production in both RHEL7
> > and RHEL6 for a while. The RHEL customers are not affected anymore for
> > a while now.
> >
> > It's a critical computational complexity fix, if using KSM in
> > enterprise production. Hugh already Acked it as well.
> >
> > It's included in -mm and Andrew submitted it once upstream, but it
> > bounced probably perhaps it was not the right time in the merge window
> > cycle.
> >
> > Or perhaps because it's complex but I wouldn't know how to simplify it
> > but there's no bug at all in the code.
> >
> > I would suggest Andrew to send it once again when he feels it's a good
> > time to do so.
> >
> >> The numad process tried to migrate a qemu process of 33GB memory.
> >> Finally, it stuck in the csd_lock_wait function which causes the qemu
> >> process hung and the virtual machine has high CPU usage and hung also.
> >> With KSM disabled, the symptom disappeared.
> >
> > Until it's merged upstream you can cherrypick from my aa.git tree
> > these three commits:
> >
> > https://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?id=9384142e4ce830898abcefc4f0479c4533fa5bbc
> > https://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?id=4b293be7e20c8e8731a4fdc3c3bf6047304d0cc8
> > https://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/commit/?id=44c0d79c2c223c54ffe3fabc893963fc5963d611
> >
> > They're in -mm too.
> >
> >> What happens here is that do_migrate_pages (frame #10) acquires the
> >> mmap_sem semaphore that everything else is waiting for (and that
> >> eventually produce the hang warnings), and it holds that semaphore for
> >> the duration of the page migration.
> >>
> >> crash> bt 2950
> >> PID: 2950   TASK: ffff885f97745280  CPU: 49  COMMAND: "numad"
> >>     [exception RIP: smp_call_function_single+219]
> >>     RIP: ffffffff81103a0b  RSP: ffff885f8fb4fb28  RFLAGS: 00000202
> >>     RAX: 0000000000000000  RBX: 0000000000000013  RCX: 0000000000000000
> >>     RDX: 0000000000000003  RSI: 0000000000000100  RDI: 0000000000000286
> >>     RBP: ffff885f8fb4fb70   R8: 0000000000000000   R9: 0000000000080000
> >>     R10: 0000000000000000  R11: ffff883faf917c88  R12: ffffffff810725f0
> >>     R13: 0000000000000013  R14: ffffffff810725f0  R15: ffff885f8fb4fbc8
> >>     CS: 0010  SS: 0018
> >>  #0 [ffff885f8fb4fb30] kvm_unmap_rmapp at ffffffffc01f1c3e [kvm]
> >>  #1 [ffff885f8fb4fb78] smp_call_function_many at ffffffff81103db3
> >>  #2 [ffff885f8fb4fbc0] native_flush_tlb_others at ffffffff8107279d
> >>  #3 [ffff885f8fb4fc08] flush_tlb_page at ffffffff81072a95
> >>  #4 [ffff885f8fb4fc30] ptep_clear_flush at ffffffff811d048e
> >>  #5 [ffff885f8fb4fc60] try_to_unmap_one at ffffffff811cb1c7
> >>  #6 [ffff885f8fb4fcd0] rmap_walk_ksm at ffffffff811e6f91
> >>  #7 [ffff885f8fb4fd28] rmap_walk at ffffffff811cc1bf
> >>  #8 [ffff885f8fb4fd80] try_to_unmap at ffffffff811cc46b
> >>  #9 [ffff885f8fb4fdc8] migrate_pages at ffffffff811f26d8
> >> #10 [ffff885f8fb4fe80] do_migrate_pages at ffffffff811e15f7
> >> #11 [ffff885f8fb4fef8] sys_migrate_pages at ffffffff811e187d
> >> #12 [ffff885f8fb4ff50] entry_SYSCALL_64_fastpath at ffffffff818244f2
> >>
> >> After some investigations, I've tried to disassemble the coredump and
> >> finally find the stable_node->hlist is as long as 2306920 entries.
> >
> > Yep, this is definitely getting fixed by the three commits above and
> > the problem is in rmap_walk_ksm like you found above. With that
> > applied you can't ever run into hangs anymore with KSM enabled, no
> > matter the workload and the amount of memory in guest and host.
> >
> > numad isn't required to reproduce it, some swapping is enough.
> >
> > It limits the de-duplication factor to 256 times, like a x256 times
> > compression, a x256 compression factor is clearly more than enough. So
> > effectively the list you found that was too long, gets hard-limited to
> > 256 entries with my patch applied. The limit is configurable at runtime:
> >
> > /* Maximum number of page slots sharing a stable node */
> > static int ksm_max_page_sharing = 256;
> >
> > If you want to increase the limit (careful: that will increase
> > the rmap_walk_ksm computation time) you can echo $newsharinglimit >
> > /sys/kernel/mm/ksm/max_page_sharing.
> >
> > Hope this helps,
> > Andrea
>
> Thank you for the detail explanation. I've cherry-picked these patches
> and now doing the verification. I'll get back to you if there is any
> problem. Thanks!

Andrea,

I have tried verifying these patches. However, the default 256
bytes max_page_sharing still suffers the hung task issue. Then, the
following sequence has been tried to mitigate the symptom. When the
value is decreased, it took more time to reproduce the symptom.
Finally, the value 8 has been tried and I didn't continue with lower
value.

128 -> 64 -> 32 -> 16 -> 8

The crashdump has also been investigated.

stable_node: 0xffff880d36413040 stable_node->hlist->first = 0xffff880e4c9f4cf0
crash> list hlist_node.next 0xffff880e4c9f4cf0  > rmap_item.lst

$ wc -l rmap_item.lst
$ 8 rmap_item.lst

This shows that the list is actually reduced to 8 items. I wondered if the
loop is still consuming a lot of time and hold the mmap_sem too long.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/1] ksm: introduce ksm_max_page_sharing per page deduplication limit
  2016-10-28  6:26             ` Gavin Guo
@ 2016-10-28 18:31               ` Andrea Arcangeli
  2017-04-20  3:14                 ` Gavin Guo
  0 siblings, 1 reply; 29+ messages in thread
From: Andrea Arcangeli @ 2016-10-28 18:31 UTC (permalink / raw)
  To: Gavin Guo
  Cc: Rik van Riel, Hugh Dickins, Davidlohr Bueso, linux-mm,
	Petr Holasek, Andrew Morton, Arjan van de Ven, Jay Vosburgh

On Fri, Oct 28, 2016 at 02:26:03PM +0800, Gavin Guo wrote:
> I have tried verifying these patches. However, the default 256
> bytes max_page_sharing still suffers the hung task issue. Then, the
> following sequence has been tried to mitigate the symptom. When the
> value is decreased, it took more time to reproduce the symptom.
> Finally, the value 8 has been tried and I didn't continue with lower
> value.
> 
> 128 -> 64 -> 32 -> 16 -> 8
> 
> The crashdump has also been investigated.

You should try to get multiple sysrq+l too during the hang.

> stable_node: 0xffff880d36413040 stable_node->hlist->first = 0xffff880e4c9f4cf0
> crash> list hlist_node.next 0xffff880e4c9f4cf0  > rmap_item.lst
> 
> $ wc -l rmap_item.lst
> $ 8 rmap_item.lst
> 
> This shows that the list is actually reduced to 8 items. I wondered if the
> loop is still consuming a lot of time and hold the mmap_sem too long.

Even the default 256 would be enough (certainly with KVM that doesn't
have a deep anon_vma interval tree).

Perhaps this is an app with a massively large anon_vma interval tree
and uses MADV_MERGEABLE and not qemu/kvm? However then you'd run in
similar issues with anon pages rmap walks so KSM wouldn't be to
blame. The depth of the rmap_items multiplies the cost of the rbtree
walk 512 times but still it shouldn't freeze for seconds.

The important thing here is that the app is in control of the max
depth of the anon_vma interval tree while it's not in control of the
max depth of the rmap_item list, this is why it's fundamental that the
KSM rmap_item list is bounded to a max value, while the depth of the
interval tree is secondary issue because userland has a chance to
optimize for it. If the app deep forks and uses MADV_MERGEABLE that is
possible to optimize in userland. But I guess the app that is using
MADV_MERGEABLE is qemu/kvm for you too so it can't be a too long
interval tree. Furthermore if when the symptom triggers you still get
a long hang even with rmap_item depth of 8 and it just takes longer
time to reach the hanging point, it may be something else.

I assume this is not an upstream kernel, can you reproduce on the
upstream kernel? Sorry but I can't help you any further, if this isn't
first verified on the upstream kernel.

Also if you test on the upstream kernel you can leave the default
value of 256 and then use sysrq+l to get multiple dumps of what's
running in the CPUs. The crash dump is useful as well but it's also
interesting to see what's running most frequently during the hang
(which isn't guaranteed to be shown by the exact point in time the
crash dump is being taken). perf top -g may also help if this is a
computational complexity issue inside the kernel to see where most CPU
is being burnt.

Note the problem was reproduced and verified as fixed. It's quite easy
to reproduce, I used migrate_pages syscall to do that, and after the
deep KSM merging that takes several seconds in strace -tt, while with
the fix it stays in the order of milliseconds. The point is that with
deeper merging the migrate_pages could take minutes in unkillable R
state (or during swapping), while with the KSMscale fix it gets capped
to milliseconds no matter what.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 1/1] ksm: introduce ksm_max_page_sharing per page deduplication limit
  2016-10-28 18:31               ` Andrea Arcangeli
@ 2017-04-20  3:14                 ` Gavin Guo
  0 siblings, 0 replies; 29+ messages in thread
From: Gavin Guo @ 2017-04-20  3:14 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Rik van Riel, Hugh Dickins, Davidlohr Bueso, linux-mm,
	Petr Holasek, Andrew Morton, Arjan van de Ven, Jay Vosburgh

On Sat, Oct 29, 2016 at 2:31 AM, Andrea Arcangeli <aarcange@redhat.com> wrote:
>
> On Fri, Oct 28, 2016 at 02:26:03PM +0800, Gavin Guo wrote:
> > I have tried verifying these patches. However, the default 256
> > bytes max_page_sharing still suffers the hung task issue. Then, the
> > following sequence has been tried to mitigate the symptom. When the
> > value is decreased, it took more time to reproduce the symptom.
> > Finally, the value 8 has been tried and I didn't continue with lower
> > value.
> >
> > 128 -> 64 -> 32 -> 16 -> 8
> >
> > The crashdump has also been investigated.
>
> You should try to get multiple sysrq+l too during the hang.
>
> > stable_node: 0xffff880d36413040 stable_node->hlist->first = 0xffff880e4c9f4cf0
> > crash> list hlist_node.next 0xffff880e4c9f4cf0  > rmap_item.lst
> >
> > $ wc -l rmap_item.lst
> > $ 8 rmap_item.lst
> >
> > This shows that the list is actually reduced to 8 items. I wondered if the
> > loop is still consuming a lot of time and hold the mmap_sem too long.
>
> Even the default 256 would be enough (certainly with KVM that doesn't
> have a deep anon_vma interval tree).
>
> Perhaps this is an app with a massively large anon_vma interval tree
> and uses MADV_MERGEABLE and not qemu/kvm? However then you'd run in
> similar issues with anon pages rmap walks so KSM wouldn't be to
> blame. The depth of the rmap_items multiplies the cost of the rbtree
> walk 512 times but still it shouldn't freeze for seconds.
>
> The important thing here is that the app is in control of the max
> depth of the anon_vma interval tree while it's not in control of the
> max depth of the rmap_item list, this is why it's fundamental that the
> KSM rmap_item list is bounded to a max value, while the depth of the
> interval tree is secondary issue because userland has a chance to
> optimize for it. If the app deep forks and uses MADV_MERGEABLE that is
> possible to optimize in userland. But I guess the app that is using
> MADV_MERGEABLE is qemu/kvm for you too so it can't be a too long
> interval tree. Furthermore if when the symptom triggers you still get
> a long hang even with rmap_item depth of 8 and it just takes longer
> time to reach the hanging point, it may be something else.
>
> I assume this is not an upstream kernel, can you reproduce on the
> upstream kernel? Sorry but I can't help you any further, if this isn't
> first verified on the upstream kernel.
>
> Also if you test on the upstream kernel you can leave the default
> value of 256 and then use sysrq+l to get multiple dumps of what's
> running in the CPUs. The crash dump is useful as well but it's also
> interesting to see what's running most frequently during the hang
> (which isn't guaranteed to be shown by the exact point in time the
> crash dump is being taken). perf top -g may also help if this is a
> computational complexity issue inside the kernel to see where most CPU
> is being burnt.
>
> Note the problem was reproduced and verified as fixed. It's quite easy
> to reproduce, I used migrate_pages syscall to do that, and after the
> deep KSM merging that takes several seconds in strace -tt, while with
> the fix it stays in the order of milliseconds. The point is that with
> deeper merging the migrate_pages could take minutes in unkillable R
> state (or during swapping), while with the KSMscale fix it gets capped
> to milliseconds no matter what.
>
> Thanks,
> Andrea

Andrea,

After spending some time, the problem was sorted out. Actually, there
are 2 bugs in a tangle. The other one has been identified and solved.

I've applied your patch to the Ubuntu 4.4 kernel and it works as
expected. Without the patch, The bug can be observed by the perf flame
graph[1]. With the patch, the unstable CPU loading cannot be observed
inside the virtual appliances monitor[2].

I have tested this patch at least 3 weeks, and it's stable. This patch
should be merged into the upstream kernel.

Tested-by: Gavin Guo <gavin.guo@canonical.com>

[1]. http://kernel.ubuntu.com/~gavinguo/sf00131845/numa-131845.svg
[2]. http://kernel.ubuntu.com/~gavinguo/sf00131845/virtual_appliances_loading.png

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2017-04-20  3:14 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-10 18:44 RFC [PATCH 0/1] ksm: introduce ksm_max_page_sharing per page deduplication limit Andrea Arcangeli
2015-11-10 18:44 ` [PATCH 1/1] " Andrea Arcangeli
2015-12-09 16:19   ` Petr Holasek
2015-12-09 17:15     ` Andrea Arcangeli
2015-12-09 18:10       ` Andrea Arcangeli
2015-12-10 16:06         ` Petr Holasek
2015-12-11  0:31   ` Andrew Morton
2016-01-14 23:36   ` Hugh Dickins
2016-01-16 17:49     ` Andrea Arcangeli
2016-01-16 18:00       ` Arjan van de Ven
2016-01-18  8:14         ` Hugh Dickins
2016-01-18 14:43           ` Arjan van de Ven
2016-01-18  9:10       ` Hugh Dickins
2016-01-18  9:45         ` Hugh Dickins
2016-01-18 17:46         ` Andrea Arcangeli
2016-03-17 21:34         ` Hugh Dickins
2016-03-17 21:50           ` Andrew Morton
2016-03-18 16:27           ` Andrea Arcangeli
2016-01-18 11:01     ` Mel Gorman
2016-01-18 22:19       ` Andrea Arcangeli
2016-01-19 10:43         ` Mel Gorman
2016-04-06 20:33   ` Rik van Riel
2016-04-06 22:02     ` Andrea Arcangeli
2016-09-21 15:12       ` Gavin Guo
2016-09-21 15:34         ` Andrea Arcangeli
2016-09-22 10:48           ` Gavin Guo
2016-10-28  6:26             ` Gavin Guo
2016-10-28 18:31               ` Andrea Arcangeli
2017-04-20  3:14                 ` Gavin Guo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.