linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/6] the big khugepaged redesign
@ 2015-02-23 12:58 Vlastimil Babka
  2015-02-23 12:58 ` [RFC 1/6] mm, thp: stop preallocating hugepages in khugepaged Vlastimil Babka
                   ` (7 more replies)
  0 siblings, 8 replies; 23+ messages in thread
From: Vlastimil Babka @ 2015-02-23 12:58 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Hugh Dickins, Andrea Arcangeli,
	Kirill A. Shutemov, Rik van Riel, Mel Gorman, Michal Hocko,
	Ebru Akagunduz, Alex Thorlton, David Rientjes, Peter Zijlstra,
	Ingo Molnar, Vlastimil Babka

Recently, there was concern expressed (e.g. [1]) whether the quite aggressive
THP allocation attempts on page faults are a good performance trade-off.

- THP allocations add to page fault latency, as high-order allocations are
  notoriously expensive. Page allocation slowpath now does extra checks for
  GFP_TRANSHUGE && !PF_KTHREAD to avoid the more expensive synchronous
  compaction for user page faults. But even async compaction can be expensive.
- During the first page fault in a 2MB range we cannot predict how much of the
  range will be actually accessed - we can theoretically waste as much as 511
  worth of pages [2]. Or, the pages in the range might be accessed from CPUs
  from different NUMA nodes and while base pages could be all local, THP could
  be remote to all but one CPU. The cost of remote accesses due to this false
  sharing would be higher than any savings on the TLB.
- The interaction with memcg are also problematic [1].

Now I don't have any hard data to show how big these problems are, and I
expect we will discuss this on LSF/MM (and hope somebody has such data [3]).
But it's certain that e.g. SAP recommends to disable THPs [4] for their apps
for performance reasons.

One might think that instead of fully disabling THP's it should be possible to
only disable (or make less aggressive, or limit to MADV_HUGEPAGE regions) THP's
for page faults and leave the collapsing up to khugepaged, which would hide the
latencies and allow better decisions based on how many base pages were faulted
in and from which nodes. However, looking more closely gives the impression
that khugepaged was meant rather as a rarely needed fallback for cases where
the THP page fault fails due to e.g. low memory. There are some tunables under
/sys/kernel/mm/transparent_hugepage/ but it doesn't seem sufficient for moving
the bulk of the THP work to khugepaged as it is.

- setting "defrag" to "madvise" or "never", while leaving khugepaged/defrag=1
  will result in very lightweight THP allocation attempts during page faults.
  This is nice and solves the latency problem, but not the other problems
  described above. It doesn't seem possible to disable page fault THP's
  completely without also disabling khugepaged.
- even if it was possible, the default settings for khugepaged are to scan up
  to 8 PMD's and collapse up to 1 THP per 10 seconds. That's probably too slow
  for some workloads and machines, but if one was to set this to be more
  aggressive, it would become quite inefficient. Khugepaged has a single global
  list of mm's to scan, which may include lot of tasks where scanning won't yield
  anything. It should rather focus on tasks that are actually running (and thus
  could benefit) and where collapses were recently successful. The activity
  should also be ideally accounted to the task that benefits from it.
- scanning on NUMA systems will proceed even when actual THP allocations fail,
  e.g. during memory pressure. In such case it should be better to save the
  scanning time until memory is available.
- there were some limitations on which PMD's khugepaged can collapse. Thanks to
  Ebru's recent patches, this should be fixed soon. With the
  khugepaged/max_ptes_none tunable, one can limit the potential memory wasted.
  But limiting NUMA false sharing is performed just in zone reclaim mode and
  could be made stricter.

This RFC patchset doesn't try to solve everything mentioned above - the full
motivation is included for the bigger picture discussion, including LSF/MM.
The main part of this patchset is the move of collapse scanning from
khugepaged to the task_work context. This has been already discussed as a good
idea and a RFC has been posted by Alex Thorlton last October [5]. In that
prototype, scanning has been driven from __khugepaged_enter(), which is called
on events such as vma_merge, fork, and THP page fault attempts, e.g. events
that are not exactly periodical. The main difference in my patchset is it being
modeled after the automatic NUMA balancing scanning, i.e. using scheduler's
task_tick_fair(). The second difference is that khugepaged is not disabled
entirely, but repurposed for the costly hugepage allocations. There is a
nodemask for indicating which nodes should have hugepages easily available.
The idea is that hugepage allocation attempts from the process context
(either page fault or the task_work collapsing) would not attempt
reclaim/compaction, and on failure will clear the nodemask bit and wake up
khugepaged to do the hard work and flip the bit back on. If if appears that
ther are no hugepages available, attempts to page fault THP or scan are
suspended.

I have done only light testing so far to see that it works as intended, but
not to prove it's "better" than current state. I wanted to post the RFC before
LSF/MM.

There are known limitations and TODO/FIXME's in the code, for example:
- the scanning period doesn't yet adjust itself based on recent collapse
  success/failure. The idea is that it will double itself if the last full
  mm scan yielded no collapse.
- some user-visible counters for the task scanning activity should be added
- the change of THP allocation attempts pressure can have hard to predict
  outcomes on fragmentation and the periods of deferred compaction.
- Documentation should be updated

More stuff is not decided yet:
- moving to task context from khugepaged results in the hugepage allocations
  for collapsing not use sync compaction anymore. But should we also change
  the "defrag" default from "always" to e.g. "madvise", which results in
  ~__GFP_WAIT allocations to further reduce latencies perceived from process
  context?
- Would make sense to have one khugepaged instance per memory node, bound to
  the corresponding CPU's? It would then touch local memory e.g. during
  compaction migrations, and the khugepaged sleeps and wakeups would be more
  fine-grained.
- should we keep using the tunables under /sys/.../khugepaged for the activity
  that is mostly no longer performed by khugepaged, and when e.g. NUMA
  balancing uses sysctl?

The stuff not touched by this patchset (nor decided):
- should we allow the user to disable THP page faults completely (or restrict
  them to MADV_HUGEPAGE vma's?), or just assume that max_ptes_none < 511 means
  it should be disabled, because we cannot know how much of the 2MB the process
  would fault in?
- do we want to further limit collapses when base pages come from different
  NUMA nodes? Another tunable (saying that minimum X pte's should be from
  single node), or just hard-require that all are from the same node?

Patchset was lightly tested on v3.19 + 2 cherry-picked patches 077fcf116c8c and
be97a41b291, and rebased to 4.0-rc1 before sending.

[1] http://marc.info/?l=linux-mm&m=142056088232484&w=2
[2] http://www.spinics.net/lists/kernel/msg1928252.html
[3] http://www.spinics.net/lists/linux-mm/msg84023.html
[4] http://scn.sap.com/people/markmumy/blog/2014/05/22/sap-iq-and-linux-hugepagestransparent-hugepages
[5] https://lkml.org/lkml/2014/10/22/931

Vlastimil Babka (6):
  mm, thp: stop preallocating hugepages in khugepaged
  mm, thp: make khugepaged check for THP allocability before scanning
  mm, thp: try fault allocations only if we expect them to succeed
  mm, thp: move collapsing from khugepaged to task_work context
  mm, thp: wakeup khugepaged when THP allocation fails
  mm, thp: remove no longer needed khugepaged code

 include/linux/khugepaged.h |  19 +-
 include/linux/mm_types.h   |   4 +
 include/linux/sched.h      |   5 +
 kernel/fork.c              |   1 -
 kernel/sched/core.c        |  12 +
 kernel/sched/fair.c        | 124 ++++++++-
 mm/huge_memory.c           | 628 ++++++++++++++-------------------------------
 7 files changed, 339 insertions(+), 454 deletions(-)

-- 
2.1.4


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [RFC 1/6] mm, thp: stop preallocating hugepages in khugepaged
  2015-02-23 12:58 [RFC 0/6] the big khugepaged redesign Vlastimil Babka
@ 2015-02-23 12:58 ` Vlastimil Babka
  2015-02-23 12:58 ` [RFC 2/6] mm, thp: make khugepaged check for THP allocability before scanning Vlastimil Babka
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Vlastimil Babka @ 2015-02-23 12:58 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Hugh Dickins, Andrea Arcangeli,
	Kirill A. Shutemov, Rik van Riel, Mel Gorman, Michal Hocko,
	Ebru Akagunduz, Alex Thorlton, David Rientjes, Peter Zijlstra,
	Ingo Molnar, Vlastimil Babka

Khugepaged tries to preallocate a hugepage before scanning for THP collapse
candidates. If the preallocation fails, scanning is not attempted. This makes
sense, but it is only restricted to !NUMA configurations, where it does not
need to predict on which node to preallocate.

Besides the !NUMA restriction, the preallocated page may also end up being
unused and put back when no collapse candidate is found. I have observed the
thp_collapse_alloc vmstat counter to have 3+ times the value of the counter
of actually collapsed pages in /sys/.../khugepaged/pages_collapsed. On the
other hand, the periodic hugepage allocation attempts involving sync
compaction can be beneficial for the antifragmentation mechanism, but that's
however harder to evaluate.

The following patch will introduce per-node THP availability tracking, which
has more benefits than current preallocation and is applicable to CONFIG_NUMA.
We can therefore remove the preallocation, which also allows a cleanup of the
functions involved in khugepaged allocations. Another small benefit of the
patch is that NUMA configs can now reuse an allocated hugepage for another
collapse attempt, if the previous one was for the same node and failed.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/huge_memory.c | 136 +++++++++++++++++++------------------------------------
 1 file changed, 46 insertions(+), 90 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index fc00c8c..44fecfc4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -756,9 +756,9 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 	return 0;
 }
 
-static inline gfp_t alloc_hugepage_gfpmask(int defrag, gfp_t extra_gfp)
+static inline gfp_t alloc_hugepage_gfpmask(int defrag)
 {
-	return (GFP_TRANSHUGE & ~(defrag ? 0 : __GFP_WAIT)) | extra_gfp;
+	return (GFP_TRANSHUGE & ~(defrag ? 0 : __GFP_WAIT));
 }
 
 /* Caller must hold page table lock. */
@@ -816,7 +816,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		}
 		return 0;
 	}
-	gfp = alloc_hugepage_gfpmask(transparent_hugepage_defrag(vma), 0);
+	gfp = alloc_hugepage_gfpmask(transparent_hugepage_defrag(vma));
 	page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
 	if (unlikely(!page)) {
 		count_vm_event(THP_FAULT_FALLBACK);
@@ -1108,7 +1108,7 @@ alloc:
 	    !transparent_hugepage_debug_cow()) {
 		gfp_t gfp;
 
-		gfp = alloc_hugepage_gfpmask(transparent_hugepage_defrag(vma), 0);
+		gfp = alloc_hugepage_gfpmask(transparent_hugepage_defrag(vma));
 		new_page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
 	} else
 		new_page = NULL;
@@ -2289,40 +2289,44 @@ static int khugepaged_find_target_node(void)
 	return target_node;
 }
 
-static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
+static inline struct page *alloc_hugepage_node(gfp_t gfp, int node)
 {
-	if (IS_ERR(*hpage)) {
-		if (!*wait)
-			return false;
-
-		*wait = false;
-		*hpage = NULL;
-		khugepaged_alloc_sleep();
-	} else if (*hpage) {
-		put_page(*hpage);
-		*hpage = NULL;
-	}
+	return alloc_pages_exact_node(node, gfp | __GFP_OTHER_NODE,
+							HPAGE_PMD_ORDER);
+}
+#else
+static int khugepaged_find_target_node(void)
+{
+	return 0;
+}
 
-	return true;
+static inline struct page *alloc_hugepage_node(gfp_t gfp, int node)
+{
+	return alloc_pages(gfp, HPAGE_PMD_ORDER);
 }
+#endif
 
 static struct page
-*khugepaged_alloc_page(struct page **hpage, struct mm_struct *mm,
-		       struct vm_area_struct *vma, unsigned long address,
-		       int node)
+*khugepaged_alloc_page(struct page **hpage, int node)
 {
-	VM_BUG_ON_PAGE(*hpage, *hpage);
+	gfp_t gfp;
 
 	/*
-	 * Before allocating the hugepage, release the mmap_sem read lock.
-	 * The allocation can take potentially a long time if it involves
-	 * sync compaction, and we do not need to hold the mmap_sem during
-	 * that. We will recheck the vma after taking it again in write mode.
+	 * If we allocated a hugepage previously and failed to collapse, reuse
+	 * the page, unless it's on different NUMA node.
 	 */
-	up_read(&mm->mmap_sem);
+	if (!IS_ERR_OR_NULL(*hpage)) {
+		if (IS_ENABLED(CONFIG_NUMA) && page_to_nid(*hpage) != node) {
+			put_page(*hpage);
+			*hpage = NULL;
+		} else {
+			return *hpage;
+		}
+	}
+
+	gfp = alloc_hugepage_gfpmask(khugepaged_defrag());
+	*hpage = alloc_hugepage_node(gfp, node);
 
-	*hpage = alloc_pages_exact_node(node, alloc_hugepage_gfpmask(
-		khugepaged_defrag(), __GFP_OTHER_NODE), HPAGE_PMD_ORDER);
 	if (unlikely(!*hpage)) {
 		count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
 		*hpage = ERR_PTR(-ENOMEM);
@@ -2332,59 +2336,6 @@ static struct page
 	count_vm_event(THP_COLLAPSE_ALLOC);
 	return *hpage;
 }
-#else
-static int khugepaged_find_target_node(void)
-{
-	return 0;
-}
-
-static inline struct page *alloc_hugepage(int defrag)
-{
-	return alloc_pages(alloc_hugepage_gfpmask(defrag, 0),
-			   HPAGE_PMD_ORDER);
-}
-
-static struct page *khugepaged_alloc_hugepage(bool *wait)
-{
-	struct page *hpage;
-
-	do {
-		hpage = alloc_hugepage(khugepaged_defrag());
-		if (!hpage) {
-			count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
-			if (!*wait)
-				return NULL;
-
-			*wait = false;
-			khugepaged_alloc_sleep();
-		} else
-			count_vm_event(THP_COLLAPSE_ALLOC);
-	} while (unlikely(!hpage) && likely(khugepaged_enabled()));
-
-	return hpage;
-}
-
-static bool khugepaged_prealloc_page(struct page **hpage, bool *wait)
-{
-	if (!*hpage)
-		*hpage = khugepaged_alloc_hugepage(wait);
-
-	if (unlikely(!*hpage))
-		return false;
-
-	return true;
-}
-
-static struct page
-*khugepaged_alloc_page(struct page **hpage, struct mm_struct *mm,
-		       struct vm_area_struct *vma, unsigned long address,
-		       int node)
-{
-	up_read(&mm->mmap_sem);
-	VM_BUG_ON(!*hpage);
-	return  *hpage;
-}
-#endif
 
 static bool hugepage_vma_check(struct vm_area_struct *vma)
 {
@@ -2419,8 +2370,14 @@ static void collapse_huge_page(struct mm_struct *mm,
 
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 
-	/* release the mmap_sem read lock. */
-	new_page = khugepaged_alloc_page(hpage, mm, vma, address, node);
+	/*
+	 * Before allocating the hugepage, release the mmap_sem read lock.
+	 * The allocation can take potentially a long time if it involves
+	 * sync compaction, and we do not need to hold the mmap_sem during
+	 * that. We will recheck the vma after taking it again in write mode.
+	 */
+	up_read(&mm->mmap_sem);
+	new_page = khugepaged_alloc_page(hpage, node);
 	if (!new_page)
 		return;
 
@@ -2754,15 +2711,9 @@ static void khugepaged_do_scan(void)
 {
 	struct page *hpage = NULL;
 	unsigned int progress = 0, pass_through_head = 0;
-	unsigned int pages = khugepaged_pages_to_scan;
-	bool wait = true;
-
-	barrier(); /* write khugepaged_pages_to_scan to local stack */
+	unsigned int pages = READ_ONCE(khugepaged_pages_to_scan);
 
 	while (progress < pages) {
-		if (!khugepaged_prealloc_page(&hpage, &wait))
-			break;
-
 		cond_resched();
 
 		if (unlikely(kthread_should_stop() || freezing(current)))
@@ -2778,6 +2729,11 @@ static void khugepaged_do_scan(void)
 		else
 			progress = pages;
 		spin_unlock(&khugepaged_mm_lock);
+
+		if (IS_ERR(hpage)) {
+			khugepaged_alloc_sleep();
+			break;
+		}
 	}
 
 	if (!IS_ERR_OR_NULL(hpage))
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC 2/6] mm, thp: make khugepaged check for THP allocability before scanning
  2015-02-23 12:58 [RFC 0/6] the big khugepaged redesign Vlastimil Babka
  2015-02-23 12:58 ` [RFC 1/6] mm, thp: stop preallocating hugepages in khugepaged Vlastimil Babka
@ 2015-02-23 12:58 ` Vlastimil Babka
  2015-02-23 12:58 ` [RFC 3/6] mm, thp: try fault allocations only if we expect them to succeed Vlastimil Babka
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Vlastimil Babka @ 2015-02-23 12:58 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Hugh Dickins, Andrea Arcangeli,
	Kirill A. Shutemov, Rik van Riel, Mel Gorman, Michal Hocko,
	Ebru Akagunduz, Alex Thorlton, David Rientjes, Peter Zijlstra,
	Ingo Molnar, Vlastimil Babka

Khugepaged could be scanning for collapse candidates uselessly, if it cannot
allocate a hugepage in the end. The hugepage preallocation mechanism prevented
this, but only for !NUMA configurations. It was removed by the previous patch,
and this patch replaces it with a more generic mechanism.

The patch itroduces a thp_avail_nodes nodemask, which initially assumes that
hugepage can be allocated on any node. Whenever khugepaged fails to allocate
a hugepage, it clears the corresponding node bit. Before scanning for collapse
candidates, it tries to allocate a hugepage on each online node with the bit
cleared, and set it back on success. It tries to hold on to the hugepage if
it doesn't hold any other yet. But the assumption is that even if the hugepage
is freed back, it should be possible to allocate it in near future without
further reclaim and compaction attempts.

During the scaning, khugepaged avoids collapsing on nodes with the bit cleared,
as soon as possible. If no nodes have hugepages available, scanning is skipped
altogether.

During testing, the patch did not show much difference in preventing
thp_collapse_failed events from khugepaged, but this can be attributed to the
sync compaction, which only khugepaged is allowed to use, and which is
heavyweight enough to succeed frequently enough nowadays. However, with the
plan to convert THP collapsing to task_work context, this patch is a
preparation to avoid useless scanning and/or heavyweight THP allocations in
that context. A later patch also extends the THP availability check to page
fault context.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/huge_memory.c | 56 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 55 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 44fecfc4..55846b8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -101,7 +101,7 @@ struct khugepaged_scan {
 static struct khugepaged_scan khugepaged_scan = {
 	.mm_head = LIST_HEAD_INIT(khugepaged_scan.mm_head),
 };
-
+static nodemask_t thp_avail_nodes = NODE_MASK_ALL;
 
 static int set_recommended_min_free_kbytes(void)
 {
@@ -2244,6 +2244,14 @@ static bool khugepaged_scan_abort(int nid)
 	int i;
 
 	/*
+	 * If it's clear that we are going to select a node where THP
+	 * allocation is unlikely to succeed, abort
+	 */
+	if (khugepaged_node_load[nid] == (HPAGE_PMD_NR) / 2 &&
+				!node_isset(nid, thp_avail_nodes))
+		return true;
+
+	/*
 	 * If zone_reclaim_mode is disabled, then no extra effort is made to
 	 * allocate memory locally.
 	 */
@@ -2330,6 +2338,7 @@ static struct page
 	if (unlikely(!*hpage)) {
 		count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
 		*hpage = ERR_PTR(-ENOMEM);
+		node_clear(node, thp_avail_nodes);
 		return NULL;
 	}
 
@@ -2337,6 +2346,42 @@ static struct page
 	return *hpage;
 }
 
+/* Return true, if THP should be allocatable on at least one node */
+static bool khugepaged_check_nodes(struct page **hpage)
+{
+	bool ret = false;
+	int nid;
+	struct page *newpage = NULL;
+	gfp_t gfp = alloc_hugepage_gfpmask(khugepaged_defrag());
+
+	for_each_online_node(nid) {
+		if (node_isset(nid, thp_avail_nodes)) {
+			ret = true;
+			continue;
+		}
+
+		newpage = alloc_hugepage_node(gfp, nid);
+
+		if (newpage) {
+			node_set(nid, thp_avail_nodes);
+			ret = true;
+			/*
+			 * Heuristic - try to hold on to the page for collapse
+			 * scanning, if we don't hold any yet.
+			 */
+			if (IS_ERR_OR_NULL(*hpage)) {
+				*hpage = newpage;
+				//NIXME: should we count all/no allocations?
+				count_vm_event(THP_COLLAPSE_ALLOC);
+			} else {
+				put_page(newpage);
+			}
+		}
+	}
+
+	return ret;
+}
+
 static bool hugepage_vma_check(struct vm_area_struct *vma)
 {
 	if ((!(vma->vm_flags & VM_HUGEPAGE) && !khugepaged_always()) ||
@@ -2557,6 +2602,10 @@ out_unmap:
 	pte_unmap_unlock(pte, ptl);
 	if (ret) {
 		node = khugepaged_find_target_node();
+		if (!node_isset(node, thp_avail_nodes)) {
+			ret = 0;
+			goto out;
+		}
 		/* collapse_huge_page will return with the mmap_sem released */
 		collapse_huge_page(mm, address, hpage, vma, node);
 	}
@@ -2713,6 +2762,11 @@ static void khugepaged_do_scan(void)
 	unsigned int progress = 0, pass_through_head = 0;
 	unsigned int pages = READ_ONCE(khugepaged_pages_to_scan);
 
+	if (!khugepaged_check_nodes(&hpage)) {
+		khugepaged_alloc_sleep();
+		return;
+	}
+
 	while (progress < pages) {
 		cond_resched();
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC 3/6] mm, thp: try fault allocations only if we expect them to succeed
  2015-02-23 12:58 [RFC 0/6] the big khugepaged redesign Vlastimil Babka
  2015-02-23 12:58 ` [RFC 1/6] mm, thp: stop preallocating hugepages in khugepaged Vlastimil Babka
  2015-02-23 12:58 ` [RFC 2/6] mm, thp: make khugepaged check for THP allocability before scanning Vlastimil Babka
@ 2015-02-23 12:58 ` Vlastimil Babka
  2015-02-23 12:58 ` [RFC 4/6] mm, thp: move collapsing from khugepaged to task_work context Vlastimil Babka
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Vlastimil Babka @ 2015-02-23 12:58 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Hugh Dickins, Andrea Arcangeli,
	Kirill A. Shutemov, Rik van Riel, Mel Gorman, Michal Hocko,
	Ebru Akagunduz, Alex Thorlton, David Rientjes, Peter Zijlstra,
	Ingo Molnar, Vlastimil Babka

Since we check THP availability for khugepaged THP collapses, we can use it
also for page fault THP allocations. If khugepaged with its sync compaction
is not able to allocate a hugepage, then it's unlikely that the less involved
attempt on page fault would succeed.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/huge_memory.c | 39 ++++++++++++++++++++++++++++++---------
 1 file changed, 30 insertions(+), 9 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 55846b8..1eec1a6 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -761,6 +761,32 @@ static inline gfp_t alloc_hugepage_gfpmask(int defrag)
 	return (GFP_TRANSHUGE & ~(defrag ? 0 : __GFP_WAIT));
 }
 
+//TODO: inline? check bloat-o-meter
+static inline struct page *
+fault_alloc_hugepage(struct vm_area_struct *vma, unsigned long haddr)
+{
+	struct page *hpage;
+	gfp_t gfp;
+	int nid;
+
+	nid = numa_node_id();
+	/*
+	 * This check is not exact for interleave policy, but we can leave such
+	 * cases to later scanning.
+	 * TODO: should VM_HUGEPAGE madvised vma's proceed regardless of the check?
+	 */
+	if (!node_isset(nid, thp_avail_nodes))
+		return NULL;
+
+	gfp = alloc_hugepage_gfpmask(transparent_hugepage_defrag(vma));
+	hpage = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
+
+	if (!hpage)
+		node_clear(nid, thp_avail_nodes);
+
+	return hpage;
+}
+
 /* Caller must hold page table lock. */
 static bool set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
 		struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd,
@@ -781,7 +807,6 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			       unsigned long address, pmd_t *pmd,
 			       unsigned int flags)
 {
-	gfp_t gfp;
 	struct page *page;
 	unsigned long haddr = address & HPAGE_PMD_MASK;
 
@@ -816,8 +841,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		}
 		return 0;
 	}
-	gfp = alloc_hugepage_gfpmask(transparent_hugepage_defrag(vma));
-	page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
+	page = fault_alloc_hugepage(vma, haddr);
 	if (unlikely(!page)) {
 		count_vm_event(THP_FAULT_FALLBACK);
 		return VM_FAULT_FALLBACK;
@@ -1105,12 +1129,9 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	spin_unlock(ptl);
 alloc:
 	if (transparent_hugepage_enabled(vma) &&
-	    !transparent_hugepage_debug_cow()) {
-		gfp_t gfp;
-
-		gfp = alloc_hugepage_gfpmask(transparent_hugepage_defrag(vma));
-		new_page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
-	} else
+	    !transparent_hugepage_debug_cow())
+		new_page = fault_alloc_hugepage(vma, haddr);
+	else
 		new_page = NULL;
 
 	if (unlikely(!new_page)) {
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC 4/6] mm, thp: move collapsing from khugepaged to task_work context
  2015-02-23 12:58 [RFC 0/6] the big khugepaged redesign Vlastimil Babka
                   ` (2 preceding siblings ...)
  2015-02-23 12:58 ` [RFC 3/6] mm, thp: try fault allocations only if we expect them to succeed Vlastimil Babka
@ 2015-02-23 12:58 ` Vlastimil Babka
  2015-02-23 14:25   ` Peter Zijlstra
  2015-02-23 12:58 ` [RFC 5/6] mm, thp: wakeup khugepaged when THP allocation fails Vlastimil Babka
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 23+ messages in thread
From: Vlastimil Babka @ 2015-02-23 12:58 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Hugh Dickins, Andrea Arcangeli,
	Kirill A. Shutemov, Rik van Riel, Mel Gorman, Michal Hocko,
	Ebru Akagunduz, Alex Thorlton, David Rientjes, Peter Zijlstra,
	Ingo Molnar, Vlastimil Babka

Moving the THP scanning and collapsing work to task_work context allows us to
balance and account for the effort per each task, and get rid of the mm_slot
infrastructure needed for khugepaged, among other things.

This patch implements the scanning from task_work context by essentially
copying the way that the automatic NUMA balancing is performed. It's currently
missing some details such as atomatically adjusting the delay between scan
attempts based on recent collapse success rates, etc.

After this patch, khugepaged remains to perform just the expensive hugepage
allocation attempts, which could easily offset the benefits of THP for the
process, if it was to perfom them in its context. The allocation attempts from
process context do not use sync compaction, and the previously introduced
per-node hugepage availability tracking should further reduce failed collapse
attemps. The next patch will improve the coordination between collapsers and
khugepaged.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/khugepaged.h |   5 +
 include/linux/mm_types.h   |   4 +
 include/linux/sched.h      |   5 +
 kernel/sched/core.c        |  12 +++
 kernel/sched/fair.c        | 124 ++++++++++++++++++++++++-
 mm/huge_memory.c           | 225 +++++++++++++--------------------------------
 6 files changed, 210 insertions(+), 165 deletions(-)

diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
index eeb3079..51b2cc5 100644
--- a/include/linux/khugepaged.h
+++ b/include/linux/khugepaged.h
@@ -4,10 +4,15 @@
 #include <linux/sched.h> /* MMF_VM_HUGEPAGE */
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
+extern unsigned int khugepaged_pages_to_scan;
+extern unsigned int khugepaged_scan_sleep_millisecs;
 extern int __khugepaged_enter(struct mm_struct *mm);
 extern void __khugepaged_exit(struct mm_struct *mm);
 extern int khugepaged_enter_vma_merge(struct vm_area_struct *vma,
 				      unsigned long vm_flags);
+extern bool khugepaged_scan_mm(struct mm_struct *mm,
+			       unsigned long *start,
+			       long pages);
 
 #define khugepaged_enabled()					       \
 	(transparent_hugepage_flags &				       \
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 199a03a..b3587e6 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -451,6 +451,10 @@ struct mm_struct {
 	/* numa_scan_seq prevents two threads setting pte_numa */
 	int numa_scan_seq;
 #endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	unsigned long thp_next_scan;
+	unsigned long thp_scan_address;
+#endif
 #if defined(CONFIG_NUMA_BALANCING) || defined(CONFIG_COMPACTION)
 	/*
 	 * An operation with batched TLB flushing is going on. Anything that
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6d77432..22a59fe 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1633,6 +1633,11 @@ struct task_struct {
 
 	unsigned long numa_pages_migrated;
 #endif /* CONFIG_NUMA_BALANCING */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	u64 thp_scan_last;
+	unsigned int thp_scan_period;
+	struct callback_head thp_work;
+#endif
 
 	struct rcu_head rcu;
 
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f0f831e..9389d13 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -32,6 +32,7 @@
 #include <linux/init.h>
 #include <linux/uaccess.h>
 #include <linux/highmem.h>
+#include <linux/khugepaged.h>
 #include <asm/mmu_context.h>
 #include <linux/interrupt.h>
 #include <linux/capability.h>
@@ -1823,6 +1824,17 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
 
 	p->numa_group = NULL;
 #endif /* CONFIG_NUMA_BALANCING */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	if (p->mm && atomic_read(&p->mm->mm_users) == 1) {
+		//TODO: have separate initial delay like NUMA_BALANCING?
+		p->mm->thp_next_scan = jiffies +
+					khugepaged_scan_sleep_millisecs;
+		p->mm->thp_scan_address = 0;
+	}
+	p->thp_scan_last = 0ULL;
+	p->thp_scan_period = khugepaged_scan_sleep_millisecs; //TODO: ditto
+	p->thp_work.next = &p->thp_work;
+#endif
 }
 
 #ifdef CONFIG_NUMA_BALANCING
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7ce18f3..551cbde 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -30,6 +30,7 @@
 #include <linux/mempolicy.h>
 #include <linux/migrate.h>
 #include <linux/task_work.h>
+#include <linux/khugepaged.h>
 
 #include <trace/events/sched.h>
 
@@ -2220,7 +2221,7 @@ out:
 /*
  * Drive the periodic memory faults..
  */
-void task_tick_numa(struct rq *rq, struct task_struct *curr)
+static bool task_tick_numa(struct rq *rq, struct task_struct *curr)
 {
 	struct callback_head *work = &curr->numa_work;
 	u64 period, now;
@@ -2229,7 +2230,7 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
 	 * We don't care about NUMA placement if we don't have memory.
 	 */
 	if (!curr->mm || (curr->flags & PF_EXITING) || work->next != work)
-		return;
+		return false;
 
 	/*
 	 * Using runtime rather than walltime has the dual advantage that
@@ -2248,12 +2249,15 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
 		if (!time_before(jiffies, curr->mm->numa_next_scan)) {
 			init_task_work(work, task_numa_work); /* TODO: move this into sched_fork() */
 			task_work_add(curr, work, true);
+			return true;
 		}
 	}
+	return false;
 }
 #else
-static void task_tick_numa(struct rq *rq, struct task_struct *curr)
+static bool task_tick_numa(struct rq *rq, struct task_struct *curr)
 {
+	return false;
 }
 
 static inline void account_numa_enqueue(struct rq *rq, struct task_struct *p)
@@ -2265,6 +2269,109 @@ static inline void account_numa_dequeue(struct rq *rq, struct task_struct *p)
 }
 #endif /* CONFIG_NUMA_BALANCING */
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+/*
+ * Entry point for THP collapse scanning
+ */
+void task_thp_work(struct callback_head *work)
+{
+	unsigned long now = jiffies;
+	struct task_struct *p = current;
+	unsigned long current_scan, next_scan;
+	struct mm_struct *mm = current->mm;
+	unsigned long start;
+	long pages;
+
+	WARN_ON_ONCE(p != container_of(work, struct task_struct, thp_work));
+
+	work->next = work; /* allows the work item to be scheduled again */
+	/*
+	 * Who cares about THP's when they're dying.
+	 *
+	 * NOTE: make sure not to dereference p->mm before this check,
+	 * exit_task_work() happens _after_ exit_mm() so we could be called
+	 * without p->mm even though we still had it when we enqueued this
+	 * work.
+	 */
+	if (p->flags & PF_EXITING)
+		return;
+
+	//TODO: separate initial delay like NUMA_BALANCING has?
+	if (!mm->thp_next_scan) {
+		mm->thp_next_scan = now +
+			msecs_to_jiffies(khugepaged_scan_sleep_millisecs);
+	}
+
+	//TODO automatic tuning of scan frequency?
+	current_scan = mm->thp_next_scan;
+
+	/*
+	 * Set the moment of the next THP scan. This should generally rule out
+	 * that other thread executes task_thp_work at the same time as us,
+	 * but it's not guaranteed. It's not a safety issue though, just
+	 * efficiency.
+	 */
+	if (time_before(now, current_scan))
+		return;
+
+	next_scan = now + msecs_to_jiffies(p->thp_scan_period);
+	if (cmpxchg(&mm->thp_next_scan, current_scan, next_scan)
+							!= current_scan)
+		return;
+
+	/*
+	 * Delay this task enough that another task of this mm will likely win
+	 * the next time around.
+	 */
+	p->thp_scan_last += 2*TICK_NSEC;
+
+	start = mm->thp_scan_address;
+	pages = khugepaged_pages_to_scan;
+
+	khugepaged_scan_mm(mm, &start, pages);
+
+	mm->thp_scan_address = start;
+}
+/*
+ * Drive the periodic scanning for THP collapses
+ */
+void task_tick_thp(struct rq *rq, struct task_struct *curr)
+{
+	struct callback_head *work = &curr->thp_work;
+	u64 period, now;
+
+	/*
+	 * We don't care about THP collapses if we don't have memory.
+	 */
+	if (!curr->mm || (curr->flags & PF_EXITING) || work->next != work)
+		return;
+
+	/* We don't care bout MM with no eligible VMAs */
+	if (!test_bit(MMF_VM_HUGEPAGE, &curr->mm->flags))
+		return;
+
+	/*
+	 * Using runtime rather than walltime has the dual advantage that
+	 * we (mostly) drive the scanning from busy threads and that the
+	 * task needs to have done some actual work before we bother with
+	 * THP collapses.
+	 */
+	now = curr->se.sum_exec_runtime;
+	period = (u64)curr->thp_scan_period * NSEC_PER_MSEC;
+
+	if (now - curr->thp_scan_last > period) {
+		if (!curr->thp_scan_last)
+			curr->thp_scan_period = khugepaged_scan_sleep_millisecs;
+		curr->thp_scan_last += period;
+
+		if (!time_before(jiffies, curr->mm->thp_next_scan)) {
+			init_task_work(work, task_thp_work); /* TODO: move this into sched_fork() */
+			task_work_add(curr, work, true);
+		}
+	}
+}
+#endif
+
 static void
 account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
@@ -7713,8 +7820,15 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 		entity_tick(cfs_rq, se, queued);
 	}
 
-	if (numabalancing_enabled)
-		task_tick_numa(rq, curr);
+	/*
+	 * For latency considerations, don't schedule the THP work together
+	 * with NUMA work. NUMA has higher priority, assuming remote accesses
+	 * have worse penalty than TLB misses.
+	 */
+	if (!(numabalancing_enabled && task_tick_numa(rq, curr))
+						&& khugepaged_enabled())
+		task_tick_thp(rq, curr);
+
 
 	update_rq_runnable_avg(rq, 1);
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1eec1a6..1c92edc 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -48,10 +48,10 @@ unsigned long transparent_hugepage_flags __read_mostly =
 	(1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG);
 
 /* default scan 8*512 pte (or vmas) every 30 second */
-static unsigned int khugepaged_pages_to_scan __read_mostly = HPAGE_PMD_NR*8;
+unsigned int khugepaged_pages_to_scan __read_mostly = HPAGE_PMD_NR*8;
 static unsigned int khugepaged_pages_collapsed;
 static unsigned int khugepaged_full_scans;
-static unsigned int khugepaged_scan_sleep_millisecs __read_mostly = 10000;
+unsigned int khugepaged_scan_sleep_millisecs __read_mostly = 10000;
 /* during fragmentation poll the hugepage allocator once every minute */
 static unsigned int khugepaged_alloc_sleep_millisecs __read_mostly = 60000;
 static struct task_struct *khugepaged_thread __read_mostly;
@@ -2258,9 +2258,7 @@ static void khugepaged_alloc_sleep(void)
 			msecs_to_jiffies(khugepaged_alloc_sleep_millisecs));
 }
 
-static int khugepaged_node_load[MAX_NUMNODES];
-
-static bool khugepaged_scan_abort(int nid)
+static bool khugepaged_scan_abort(int nid, int *node_load)
 {
 	int i;
 
@@ -2268,7 +2266,7 @@ static bool khugepaged_scan_abort(int nid)
 	 * If it's clear that we are going to select a node where THP
 	 * allocation is unlikely to succeed, abort
 	 */
-	if (khugepaged_node_load[nid] == (HPAGE_PMD_NR) / 2 &&
+	if (node_load[nid] == (HPAGE_PMD_NR) / 2 &&
 				!node_isset(nid, thp_avail_nodes))
 		return true;
 
@@ -2280,11 +2278,11 @@ static bool khugepaged_scan_abort(int nid)
 		return false;
 
 	/* If there is a count for this node already, it must be acceptable */
-	if (khugepaged_node_load[nid])
+	if (node_load[nid])
 		return false;
 
 	for (i = 0; i < MAX_NUMNODES; i++) {
-		if (!khugepaged_node_load[i])
+		if (!node_load[i])
 			continue;
 		if (node_distance(nid, i) > RECLAIM_DISTANCE)
 			return true;
@@ -2293,15 +2291,15 @@ static bool khugepaged_scan_abort(int nid)
 }
 
 #ifdef CONFIG_NUMA
-static int khugepaged_find_target_node(void)
+static int khugepaged_find_target_node(int *node_load)
 {
 	static int last_khugepaged_target_node = NUMA_NO_NODE;
 	int nid, target_node = 0, max_value = 0;
 
 	/* find first node with max normal pages hit */
 	for (nid = 0; nid < MAX_NUMNODES; nid++)
-		if (khugepaged_node_load[nid] > max_value) {
-			max_value = khugepaged_node_load[nid];
+		if (node_load[nid] > max_value) {
+			max_value = node_load[nid];
 			target_node = nid;
 		}
 
@@ -2309,7 +2307,7 @@ static int khugepaged_find_target_node(void)
 	if (target_node <= last_khugepaged_target_node)
 		for (nid = last_khugepaged_target_node + 1; nid < MAX_NUMNODES;
 				nid++)
-			if (max_value == khugepaged_node_load[nid]) {
+			if (max_value == node_load[nid]) {
 				target_node = nid;
 				break;
 			}
@@ -2324,7 +2322,7 @@ static inline struct page *alloc_hugepage_node(gfp_t gfp, int node)
 							HPAGE_PMD_ORDER);
 }
 #else
-static int khugepaged_find_target_node(void)
+static int khugepaged_find_target_node(int *node_load)
 {
 	return 0;
 }
@@ -2368,7 +2366,7 @@ static struct page
 }
 
 /* Return true, if THP should be allocatable on at least one node */
-static bool khugepaged_check_nodes(struct page **hpage)
+static bool khugepaged_check_nodes(void)
 {
 	bool ret = false;
 	int nid;
@@ -2386,18 +2384,10 @@ static bool khugepaged_check_nodes(struct page **hpage)
 		if (newpage) {
 			node_set(nid, thp_avail_nodes);
 			ret = true;
-			/*
-			 * Heuristic - try to hold on to the page for collapse
-			 * scanning, if we don't hold any yet.
-			 */
-			if (IS_ERR_OR_NULL(*hpage)) {
-				*hpage = newpage;
-				//NIXME: should we count all/no allocations?
-				count_vm_event(THP_COLLAPSE_ALLOC);
-			} else {
-				put_page(newpage);
-			}
+			put_page(newpage);
 		}
+		if (unlikely(kthread_should_stop() || freezing(current)))
+			break;
 	}
 
 	return ret;
@@ -2544,6 +2534,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 
 	*hpage = NULL;
 
+	//FIXME: this is racy
 	khugepaged_pages_collapsed++;
 out_up_write:
 	up_write(&mm->mmap_sem);
@@ -2557,7 +2548,8 @@ out:
 static int khugepaged_scan_pmd(struct mm_struct *mm,
 			       struct vm_area_struct *vma,
 			       unsigned long address,
-			       struct page **hpage)
+			       struct page **hpage,
+			       int *node_load)
 {
 	pmd_t *pmd;
 	pte_t *pte, *_pte;
@@ -2574,7 +2566,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 	if (!pmd)
 		goto out;
 
-	memset(khugepaged_node_load, 0, sizeof(khugepaged_node_load));
+	memset(node_load, 0, sizeof(int) * MAX_NUMNODES);
 	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
 	for (_address = address, _pte = pte; _pte < pte+HPAGE_PMD_NR;
 	     _pte++, _address += PAGE_SIZE) {
@@ -2595,14 +2587,14 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 			goto out_unmap;
 		/*
 		 * Record which node the original page is from and save this
-		 * information to khugepaged_node_load[].
+		 * information to node_load[].
 		 * Khupaged will allocate hugepage from the node has the max
 		 * hit record.
 		 */
 		node = page_to_nid(page);
-		if (khugepaged_scan_abort(node))
+		if (khugepaged_scan_abort(node, node_load))
 			goto out_unmap;
-		khugepaged_node_load[node]++;
+		node_load[node]++;
 		VM_BUG_ON_PAGE(PageCompound(page), page);
 		if (!PageLRU(page) || PageLocked(page) || !PageAnon(page))
 			goto out_unmap;
@@ -2622,7 +2614,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
 	if (ret) {
-		node = khugepaged_find_target_node();
+		node = khugepaged_find_target_node(node_load);
 		if (!node_isset(node, thp_avail_nodes)) {
 			ret = 0;
 			goto out;
@@ -2657,112 +2649,61 @@ static void collect_mm_slot(struct mm_slot *mm_slot)
 	}
 }
 
-static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
-					    struct page **hpage)
-	__releases(&khugepaged_mm_lock)
-	__acquires(&khugepaged_mm_lock)
+bool khugepaged_scan_mm(struct mm_struct *mm, unsigned long *start, long pages)
 {
-	struct mm_slot *mm_slot;
-	struct mm_struct *mm;
 	struct vm_area_struct *vma;
-	int progress = 0;
-
-	VM_BUG_ON(!pages);
-	VM_BUG_ON(NR_CPUS != 1 && !spin_is_locked(&khugepaged_mm_lock));
+	struct page *hpage = NULL;
+	int ret;
+	int *node_load;
 
-	if (khugepaged_scan.mm_slot)
-		mm_slot = khugepaged_scan.mm_slot;
-	else {
-		mm_slot = list_entry(khugepaged_scan.mm_head.next,
-				     struct mm_slot, mm_node);
-		khugepaged_scan.address = 0;
-		khugepaged_scan.mm_slot = mm_slot;
-	}
-	spin_unlock(&khugepaged_mm_lock);
+	//TODO: #ifdef this for NUMA only
+	node_load = kmalloc(sizeof(int) * MAX_NUMNODES,
+						GFP_KERNEL | GFP_NOWAIT);
+	if (!node_load)
+		return false;
 
-	mm = mm_slot->mm;
 	down_read(&mm->mmap_sem);
-	if (unlikely(khugepaged_test_exit(mm)))
-		vma = NULL;
-	else
-		vma = find_vma(mm, khugepaged_scan.address);
-
-	progress++;
+	vma = find_vma(mm, *start);
 	for (; vma; vma = vma->vm_next) {
 		unsigned long hstart, hend;
 
-		cond_resched();
-		if (unlikely(khugepaged_test_exit(mm))) {
-			progress++;
-			break;
-		}
-		if (!hugepage_vma_check(vma)) {
-skip:
-			progress++;
+		if (!hugepage_vma_check(vma))
 			continue;
-		}
-		hstart = (vma->vm_start + ~HPAGE_PMD_MASK) & HPAGE_PMD_MASK;
+
+		hstart = ALIGN(vma->vm_start, HPAGE_PMD_SIZE);
 		hend = vma->vm_end & HPAGE_PMD_MASK;
+
 		if (hstart >= hend)
-			goto skip;
-		if (khugepaged_scan.address > hend)
-			goto skip;
-		if (khugepaged_scan.address < hstart)
-			khugepaged_scan.address = hstart;
-		VM_BUG_ON(khugepaged_scan.address & ~HPAGE_PMD_MASK);
-
-		while (khugepaged_scan.address < hend) {
-			int ret;
-			cond_resched();
-			if (unlikely(khugepaged_test_exit(mm)))
-				goto breakouterloop;
-
-			VM_BUG_ON(khugepaged_scan.address < hstart ||
-				  khugepaged_scan.address + HPAGE_PMD_SIZE >
-				  hend);
-			ret = khugepaged_scan_pmd(mm, vma,
-						  khugepaged_scan.address,
-						  hpage);
-			/* move to next address */
-			khugepaged_scan.address += HPAGE_PMD_SIZE;
-			progress += HPAGE_PMD_NR;
+			continue;
+		if (*start < hstart)
+			*start = hstart;
+		VM_BUG_ON(*start & ~HPAGE_PMD_MASK);
+
+		while (*start < hend) {
+			ret = khugepaged_scan_pmd(mm, vma, *start, &hpage,
+								node_load);
+			*start += HPAGE_PMD_SIZE;
+			pages -= HPAGE_PMD_NR;
+
 			if (ret)
-				/* we released mmap_sem so break loop */
-				goto breakouterloop_mmap_sem;
-			if (progress >= pages)
-				goto breakouterloop;
+				goto out;
+
+			if (pages <= 0)
+				goto out_unlock;
 		}
 	}
-breakouterloop:
-	up_read(&mm->mmap_sem); /* exit_mmap will destroy ptes after this */
-breakouterloop_mmap_sem:
+out_unlock:
+	up_read(&mm->mmap_sem);
+out:
+	if (!vma)
+		*start = 0;
 
-	spin_lock(&khugepaged_mm_lock);
-	VM_BUG_ON(khugepaged_scan.mm_slot != mm_slot);
-	/*
-	 * Release the current mm_slot if this mm is about to die, or
-	 * if we scanned all vmas of this mm.
-	 */
-	if (khugepaged_test_exit(mm) || !vma) {
-		/*
-		 * Make sure that if mm_users is reaching zero while
-		 * khugepaged runs here, khugepaged_exit will find
-		 * mm_slot not pointing to the exiting mm.
-		 */
-		if (mm_slot->mm_node.next != &khugepaged_scan.mm_head) {
-			khugepaged_scan.mm_slot = list_entry(
-				mm_slot->mm_node.next,
-				struct mm_slot, mm_node);
-			khugepaged_scan.address = 0;
-		} else {
-			khugepaged_scan.mm_slot = NULL;
-			khugepaged_full_scans++;
-		}
+	if (!IS_ERR_OR_NULL(hpage))
+		put_page(hpage);
 
-		collect_mm_slot(mm_slot);
-	}
+	kfree(node_load);
 
-	return progress;
+	return true;
 }
 
 static int khugepaged_has_work(void)
@@ -2777,44 +2718,6 @@ static int khugepaged_wait_event(void)
 		kthread_should_stop();
 }
 
-static void khugepaged_do_scan(void)
-{
-	struct page *hpage = NULL;
-	unsigned int progress = 0, pass_through_head = 0;
-	unsigned int pages = READ_ONCE(khugepaged_pages_to_scan);
-
-	if (!khugepaged_check_nodes(&hpage)) {
-		khugepaged_alloc_sleep();
-		return;
-	}
-
-	while (progress < pages) {
-		cond_resched();
-
-		if (unlikely(kthread_should_stop() || freezing(current)))
-			break;
-
-		spin_lock(&khugepaged_mm_lock);
-		if (!khugepaged_scan.mm_slot)
-			pass_through_head++;
-		if (khugepaged_has_work() &&
-		    pass_through_head < 2)
-			progress += khugepaged_scan_mm_slot(pages - progress,
-							    &hpage);
-		else
-			progress = pages;
-		spin_unlock(&khugepaged_mm_lock);
-
-		if (IS_ERR(hpage)) {
-			khugepaged_alloc_sleep();
-			break;
-		}
-	}
-
-	if (!IS_ERR_OR_NULL(hpage))
-		put_page(hpage);
-}
-
 static void khugepaged_wait_work(void)
 {
 	try_to_freeze();
@@ -2841,8 +2744,10 @@ static int khugepaged(void *none)
 	set_user_nice(current, MAX_NICE);
 
 	while (!kthread_should_stop()) {
-		khugepaged_do_scan();
-		khugepaged_wait_work();
+		if (khugepaged_check_nodes())
+			khugepaged_wait_work();
+		else
+			khugepaged_alloc_sleep();
 	}
 
 	spin_lock(&khugepaged_mm_lock);
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC 5/6] mm, thp: wakeup khugepaged when THP allocation fails
  2015-02-23 12:58 [RFC 0/6] the big khugepaged redesign Vlastimil Babka
                   ` (3 preceding siblings ...)
  2015-02-23 12:58 ` [RFC 4/6] mm, thp: move collapsing from khugepaged to task_work context Vlastimil Babka
@ 2015-02-23 12:58 ` Vlastimil Babka
  2015-02-23 12:58 ` [RFC 6/6] mm, thp: remove no longer needed khugepaged code Vlastimil Babka
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Vlastimil Babka @ 2015-02-23 12:58 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Hugh Dickins, Andrea Arcangeli,
	Kirill A. Shutemov, Rik van Riel, Mel Gorman, Michal Hocko,
	Ebru Akagunduz, Alex Thorlton, David Rientjes, Peter Zijlstra,
	Ingo Molnar, Vlastimil Babka

The previous patch has taken away the THP collapse scanning from khugepaged,
leaving it only to maintain the thp_avail_nodes nodemask through heavyweight
attempts to make a hugepage available on nodes where it could not be allocated
from the process context, both through page fault or the collapse scanning.

This patch improves the coordination between failed THP allocations and
khugepaged by wakeups, repurposing the khugepaged_wait infrastructure.
Instead of periodical sleeping and checking for work, khugepaged will now sleep
at least alloc_sleep_millisecs after its last allocation attempt in order to
prevent excessive activity, and then respond to a failed THP allocation
immediately through khugepaged_wait.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/huge_memory.c | 77 ++++++++++++++++++++++++++++++++------------------------
 1 file changed, 44 insertions(+), 33 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1c92edc..9172c7f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -158,9 +158,6 @@ static int start_khugepaged(void)
 			khugepaged_thread = NULL;
 		}
 
-		if (!list_empty(&khugepaged_scan.mm_head))
-			wake_up_interruptible(&khugepaged_wait);
-
 		set_recommended_min_free_kbytes();
 	} else if (khugepaged_thread) {
 		kthread_stop(khugepaged_thread);
@@ -430,7 +427,6 @@ static ssize_t scan_sleep_millisecs_store(struct kobject *kobj,
 		return -EINVAL;
 
 	khugepaged_scan_sleep_millisecs = msecs;
-	wake_up_interruptible(&khugepaged_wait);
 
 	return count;
 }
@@ -781,8 +777,10 @@ fault_alloc_hugepage(struct vm_area_struct *vma, unsigned long haddr)
 	gfp = alloc_hugepage_gfpmask(transparent_hugepage_defrag(vma));
 	hpage = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
 
-	if (!hpage)
+	if (!hpage) {
 		node_clear(nid, thp_avail_nodes);
+		wake_up_interruptible(&khugepaged_wait);
+	}
 
 	return hpage;
 }
@@ -2054,8 +2052,6 @@ int __khugepaged_enter(struct mm_struct *mm)
 	spin_unlock(&khugepaged_mm_lock);
 
 	atomic_inc(&mm->mm_count);
-	if (wakeup)
-		wake_up_interruptible(&khugepaged_wait);
 
 	return 0;
 }
@@ -2252,12 +2248,6 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
 	}
 }
 
-static void khugepaged_alloc_sleep(void)
-{
-	wait_event_freezable_timeout(khugepaged_wait, false,
-			msecs_to_jiffies(khugepaged_alloc_sleep_millisecs));
-}
-
 static bool khugepaged_scan_abort(int nid, int *node_load)
 {
 	int i;
@@ -2358,6 +2348,7 @@ static struct page
 		count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
 		*hpage = ERR_PTR(-ENOMEM);
 		node_clear(node, thp_avail_nodes);
+		wake_up_interruptible(&khugepaged_wait);
 		return NULL;
 	}
 
@@ -2365,7 +2356,7 @@ static struct page
 	return *hpage;
 }
 
-/* Return true, if THP should be allocatable on at least one node */
+/* Return true if we tried to allocate on at least one node */
 static bool khugepaged_check_nodes(void)
 {
 	bool ret = false;
@@ -2375,15 +2366,14 @@ static bool khugepaged_check_nodes(void)
 
 	for_each_online_node(nid) {
 		if (node_isset(nid, thp_avail_nodes)) {
-			ret = true;
 			continue;
 		}
 
 		newpage = alloc_hugepage_node(gfp, nid);
+		ret = true;
 
 		if (newpage) {
 			node_set(nid, thp_avail_nodes);
-			ret = true;
 			put_page(newpage);
 		}
 		if (unlikely(kthread_should_stop() || freezing(current)))
@@ -2393,6 +2383,19 @@ static bool khugepaged_check_nodes(void)
 	return ret;
 }
 
+/* Return true if hugepages are available on at least one node */
+static bool check_thp_avail(void)
+{
+	int nid;
+
+	for_each_online_node(nid) {
+		if (node_isset(nid, thp_avail_nodes))
+			return true;
+	}
+
+	return false;
+}
+
 static bool hugepage_vma_check(struct vm_area_struct *vma)
 {
 	if ((!(vma->vm_flags & VM_HUGEPAGE) && !khugepaged_always()) ||
@@ -2656,6 +2659,9 @@ bool khugepaged_scan_mm(struct mm_struct *mm, unsigned long *start, long pages)
 	int ret;
 	int *node_load;
 
+	if (!check_thp_avail())
+		return false;
+
 	//TODO: #ifdef this for NUMA only
 	node_load = kmalloc(sizeof(int) * MAX_NUMNODES,
 						GFP_KERNEL | GFP_NOWAIT);
@@ -2706,30 +2712,36 @@ out:
 	return true;
 }
 
-static int khugepaged_has_work(void)
+static bool khugepaged_has_work(void)
 {
-	return !list_empty(&khugepaged_scan.mm_head) &&
-		khugepaged_enabled();
+	int nid;
+
+	for_each_online_node(nid) {
+		if (!node_isset(nid, thp_avail_nodes))
+			return true;
+	}
+
+	return false;
 }
 
-static int khugepaged_wait_event(void)
+static bool khugepaged_wait_event(void)
 {
-	return !list_empty(&khugepaged_scan.mm_head) ||
-		kthread_should_stop();
+	return khugepaged_has_work() || kthread_should_stop();
 }
 
-static void khugepaged_wait_work(void)
+static void khugepaged_wait_work(bool did_alloc)
 {
+	unsigned int msec_sleep;
+
 	try_to_freeze();
 
-	if (khugepaged_has_work()) {
-		if (!khugepaged_scan_sleep_millisecs)
-			return;
+	if (did_alloc) {
+		msec_sleep = READ_ONCE(khugepaged_alloc_sleep_millisecs);
 
-		wait_event_freezable_timeout(khugepaged_wait,
+		if (msec_sleep)
+			wait_event_freezable_timeout(khugepaged_wait,
 					     kthread_should_stop(),
-			msecs_to_jiffies(khugepaged_scan_sleep_millisecs));
-		return;
+						msecs_to_jiffies(msec_sleep));
 	}
 
 	if (khugepaged_enabled())
@@ -2739,15 +2751,14 @@ static void khugepaged_wait_work(void)
 static int khugepaged(void *none)
 {
 	struct mm_slot *mm_slot;
+	bool did_alloc;
 
 	set_freezable();
 	set_user_nice(current, MAX_NICE);
 
 	while (!kthread_should_stop()) {
-		if (khugepaged_check_nodes())
-			khugepaged_wait_work();
-		else
-			khugepaged_alloc_sleep();
+		did_alloc = khugepaged_check_nodes();
+		khugepaged_wait_work(did_alloc);
 	}
 
 	spin_lock(&khugepaged_mm_lock);
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [RFC 6/6] mm, thp: remove no longer needed khugepaged code
  2015-02-23 12:58 [RFC 0/6] the big khugepaged redesign Vlastimil Babka
                   ` (4 preceding siblings ...)
  2015-02-23 12:58 ` [RFC 5/6] mm, thp: wakeup khugepaged when THP allocation fails Vlastimil Babka
@ 2015-02-23 12:58 ` Vlastimil Babka
  2015-02-23 21:03 ` [RFC 0/6] the big khugepaged redesign Andi Kleen
  2015-02-23 22:46 ` Davidlohr Bueso
  7 siblings, 0 replies; 23+ messages in thread
From: Vlastimil Babka @ 2015-02-23 12:58 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Hugh Dickins, Andrea Arcangeli,
	Kirill A. Shutemov, Rik van Riel, Mel Gorman, Michal Hocko,
	Ebru Akagunduz, Alex Thorlton, David Rientjes, Peter Zijlstra,
	Ingo Molnar, Vlastimil Babka

With collapse scanning moved to processes, we can remove lot of code from
khugepaged, mostly related to maintenance of mm_slots, where khugepaged used
to track which mm's to scan.

We keep the hooks for vma operations such as khugepaged_enter() only to set
the MMF_VM_HUGEPAGE bit, which enables the scanning for given mm.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/khugepaged.h |  14 +---
 kernel/fork.c              |   1 -
 mm/huge_memory.c           | 193 +--------------------------------------------
 3 files changed, 3 insertions(+), 205 deletions(-)

diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
index 51b2cc5..5af0f35 100644
--- a/include/linux/khugepaged.h
+++ b/include/linux/khugepaged.h
@@ -31,16 +31,10 @@ extern bool khugepaged_scan_mm(struct mm_struct *mm,
 static inline int khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
 {
 	if (test_bit(MMF_VM_HUGEPAGE, &oldmm->flags))
-		return __khugepaged_enter(mm);
+		set_bit(MMF_VM_HUGEPAGE, &mm->flags);
 	return 0;
 }
 
-static inline void khugepaged_exit(struct mm_struct *mm)
-{
-	if (test_bit(MMF_VM_HUGEPAGE, &mm->flags))
-		__khugepaged_exit(mm);
-}
-
 static inline int khugepaged_enter(struct vm_area_struct *vma,
 				   unsigned long vm_flags)
 {
@@ -48,8 +42,7 @@ static inline int khugepaged_enter(struct vm_area_struct *vma,
 		if ((khugepaged_always() ||
 		     (khugepaged_req_madv() && (vm_flags & VM_HUGEPAGE))) &&
 		    !(vm_flags & VM_NOHUGEPAGE))
-			if (__khugepaged_enter(vma->vm_mm))
-				return -ENOMEM;
+			set_bit(MMF_VM_HUGEPAGE, &vma->vm_mm->flags);
 	return 0;
 }
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
@@ -57,9 +50,6 @@ static inline int khugepaged_fork(struct mm_struct *mm, struct mm_struct *oldmm)
 {
 	return 0;
 }
-static inline void khugepaged_exit(struct mm_struct *mm)
-{
-}
 static inline int khugepaged_enter(struct vm_area_struct *vma,
 				   unsigned long vm_flags)
 {
diff --git a/kernel/fork.c b/kernel/fork.c
index cf65139..5541a9f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -659,7 +659,6 @@ void mmput(struct mm_struct *mm)
 		uprobe_clear_state(mm);
 		exit_aio(mm);
 		ksm_exit(mm);
-		khugepaged_exit(mm); /* must run before exit_mmap */
 		exit_mmap(mm);
 		set_mm_exe_file(mm, NULL);
 		if (!list_empty(&mm->mmlist)) {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9172c7f..f497e6b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -56,7 +56,6 @@ unsigned int khugepaged_scan_sleep_millisecs __read_mostly = 10000;
 static unsigned int khugepaged_alloc_sleep_millisecs __read_mostly = 60000;
 static struct task_struct *khugepaged_thread __read_mostly;
 static DEFINE_MUTEX(khugepaged_mutex);
-static DEFINE_SPINLOCK(khugepaged_mm_lock);
 static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait);
 /*
  * default collapse hugepages if there is at least one pte mapped like
@@ -66,41 +65,7 @@ static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait);
 static unsigned int khugepaged_max_ptes_none __read_mostly = HPAGE_PMD_NR-1;
 
 static int khugepaged(void *none);
-static int khugepaged_slab_init(void);
 
-#define MM_SLOTS_HASH_BITS 10
-static __read_mostly DEFINE_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
-
-static struct kmem_cache *mm_slot_cache __read_mostly;
-
-/**
- * struct mm_slot - hash lookup from mm to mm_slot
- * @hash: hash collision list
- * @mm_node: khugepaged scan list headed in khugepaged_scan.mm_head
- * @mm: the mm that this information is valid for
- */
-struct mm_slot {
-	struct hlist_node hash;
-	struct list_head mm_node;
-	struct mm_struct *mm;
-};
-
-/**
- * struct khugepaged_scan - cursor for scanning
- * @mm_head: the head of the mm list to scan
- * @mm_slot: the current mm_slot we are scanning
- * @address: the next address inside that to be scanned
- *
- * There is only the one khugepaged_scan instance of this cursor structure.
- */
-struct khugepaged_scan {
-	struct list_head mm_head;
-	struct mm_slot *mm_slot;
-	unsigned long address;
-};
-static struct khugepaged_scan khugepaged_scan = {
-	.mm_head = LIST_HEAD_INIT(khugepaged_scan.mm_head),
-};
 static nodemask_t thp_avail_nodes = NODE_MASK_ALL;
 
 static int set_recommended_min_free_kbytes(void)
@@ -601,21 +566,12 @@ delete_obj:
 	return err;
 }
 
-static void __init hugepage_exit_sysfs(struct kobject *hugepage_kobj)
-{
-	sysfs_remove_group(hugepage_kobj, &khugepaged_attr_group);
-	sysfs_remove_group(hugepage_kobj, &hugepage_attr_group);
-	kobject_put(hugepage_kobj);
-}
 #else
 static inline int hugepage_init_sysfs(struct kobject **hugepage_kobj)
 {
 	return 0;
 }
 
-static inline void hugepage_exit_sysfs(struct kobject *hugepage_kobj)
-{
-}
 #endif /* CONFIG_SYSFS */
 
 static int __init hugepage_init(void)
@@ -632,10 +588,6 @@ static int __init hugepage_init(void)
 	if (err)
 		return err;
 
-	err = khugepaged_slab_init();
-	if (err)
-		goto out;
-
 	register_shrinker(&huge_zero_page_shrinker);
 
 	/*
@@ -649,9 +601,6 @@ static int __init hugepage_init(void)
 	start_khugepaged();
 
 	return 0;
-out:
-	hugepage_exit_sysfs(hugepage_kobj);
-	return err;
 }
 subsys_initcall(hugepage_init);
 
@@ -1979,83 +1928,6 @@ int hugepage_madvise(struct vm_area_struct *vma,
 	return 0;
 }
 
-static int __init khugepaged_slab_init(void)
-{
-	mm_slot_cache = kmem_cache_create("khugepaged_mm_slot",
-					  sizeof(struct mm_slot),
-					  __alignof__(struct mm_slot), 0, NULL);
-	if (!mm_slot_cache)
-		return -ENOMEM;
-
-	return 0;
-}
-
-static inline struct mm_slot *alloc_mm_slot(void)
-{
-	if (!mm_slot_cache)	/* initialization failed */
-		return NULL;
-	return kmem_cache_zalloc(mm_slot_cache, GFP_KERNEL);
-}
-
-static inline void free_mm_slot(struct mm_slot *mm_slot)
-{
-	kmem_cache_free(mm_slot_cache, mm_slot);
-}
-
-static struct mm_slot *get_mm_slot(struct mm_struct *mm)
-{
-	struct mm_slot *mm_slot;
-
-	hash_for_each_possible(mm_slots_hash, mm_slot, hash, (unsigned long)mm)
-		if (mm == mm_slot->mm)
-			return mm_slot;
-
-	return NULL;
-}
-
-static void insert_to_mm_slots_hash(struct mm_struct *mm,
-				    struct mm_slot *mm_slot)
-{
-	mm_slot->mm = mm;
-	hash_add(mm_slots_hash, &mm_slot->hash, (long)mm);
-}
-
-static inline int khugepaged_test_exit(struct mm_struct *mm)
-{
-	return atomic_read(&mm->mm_users) == 0;
-}
-
-int __khugepaged_enter(struct mm_struct *mm)
-{
-	struct mm_slot *mm_slot;
-	int wakeup;
-
-	mm_slot = alloc_mm_slot();
-	if (!mm_slot)
-		return -ENOMEM;
-
-	/* __khugepaged_exit() must not run from under us */
-	VM_BUG_ON_MM(khugepaged_test_exit(mm), mm);
-	if (unlikely(test_and_set_bit(MMF_VM_HUGEPAGE, &mm->flags))) {
-		free_mm_slot(mm_slot);
-		return 0;
-	}
-
-	spin_lock(&khugepaged_mm_lock);
-	insert_to_mm_slots_hash(mm, mm_slot);
-	/*
-	 * Insert just behind the scanning cursor, to let the area settle
-	 * down a little.
-	 */
-	wakeup = list_empty(&khugepaged_scan.mm_head);
-	list_add_tail(&mm_slot->mm_node, &khugepaged_scan.mm_head);
-	spin_unlock(&khugepaged_mm_lock);
-
-	atomic_inc(&mm->mm_count);
-
-	return 0;
-}
-
 int khugepaged_enter_vma_merge(struct vm_area_struct *vma,
 			       unsigned long vm_flags)
 {
@@ -2077,38 +1949,6 @@ int khugepaged_enter_vma_merge(struct vm_area_struct *vma,
 	return 0;
 }
 
-void __khugepaged_exit(struct mm_struct *mm)
-{
-	struct mm_slot *mm_slot;
-	int free = 0;
-
-	spin_lock(&khugepaged_mm_lock);
-	mm_slot = get_mm_slot(mm);
-	if (mm_slot && khugepaged_scan.mm_slot != mm_slot) {
-		hash_del(&mm_slot->hash);
-		list_del(&mm_slot->mm_node);
-		free = 1;
-	}
-	spin_unlock(&khugepaged_mm_lock);
-
-	if (free) {
-		clear_bit(MMF_VM_HUGEPAGE, &mm->flags);
-		free_mm_slot(mm_slot);
-		mmdrop(mm);
-	} else if (mm_slot) {
-		/*
-		 * This is required to serialize against
-		 * khugepaged_test_exit() (which is guaranteed to run
-		 * under mmap sem read mode). Stop here (after we
-		 * return all pagetables will be destroyed) until
-		 * khugepaged has finished working on the pagetables
-		 * under the mmap_sem.
-		 */
-		down_write(&mm->mmap_sem);
-		up_write(&mm->mmap_sem);
-	}
-}
-
 static void release_pte_page(struct page *page)
 {
 	/* 0 stands for page_is_file_cache(page) == false */
@@ -2450,8 +2290,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	 * handled by the anon_vma lock + PG_lock.
 	 */
 	down_write(&mm->mmap_sem);
-	if (unlikely(khugepaged_test_exit(mm)))
-		goto out;
+	VM_BUG_ON(atomic_read(&mm->mm_users) == 0);
 
 	vma = find_vma(mm, address);
 	if (!vma)
@@ -2629,29 +2468,6 @@ out:
 	return ret;
 }
 
-static void collect_mm_slot(struct mm_slot *mm_slot)
-{
-	struct mm_struct *mm = mm_slot->mm;
-
-	VM_BUG_ON(NR_CPUS != 1 && !spin_is_locked(&khugepaged_mm_lock));
-
-	if (khugepaged_test_exit(mm)) {
-		/* free mm_slot */
-		hash_del(&mm_slot->hash);
-		list_del(&mm_slot->mm_node);
-
-		/*
-		 * Not strictly needed because the mm exited already.
-		 *
-		 * clear_bit(MMF_VM_HUGEPAGE, &mm->flags);
-		 */
-
-		/* khugepaged_mm_lock actually not necessary for the below */
-		free_mm_slot(mm_slot);
-		mmdrop(mm);
-	}
-}
-
 bool khugepaged_scan_mm(struct mm_struct *mm, unsigned long *start, long pages)
 {
 	struct vm_area_struct *vma;
@@ -2750,7 +2566,6 @@ static void khugepaged_wait_work(bool did_alloc)
 
 static int khugepaged(void *none)
 {
-	struct mm_slot *mm_slot;
 	bool did_alloc;
 
 	set_freezable();
@@ -2761,12 +2576,6 @@ static int khugepaged(void *none)
 		khugepaged_wait_work(did_alloc);
 	}
 
-	spin_lock(&khugepaged_mm_lock);
-	mm_slot = khugepaged_scan.mm_slot;
-	khugepaged_scan.mm_slot = NULL;
-	if (mm_slot)
-		collect_mm_slot(mm_slot);
-	spin_unlock(&khugepaged_mm_lock);
 	return 0;
 }
 
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [RFC 4/6] mm, thp: move collapsing from khugepaged to task_work context
  2015-02-23 12:58 ` [RFC 4/6] mm, thp: move collapsing from khugepaged to task_work context Vlastimil Babka
@ 2015-02-23 14:25   ` Peter Zijlstra
  0 siblings, 0 replies; 23+ messages in thread
From: Peter Zijlstra @ 2015-02-23 14:25 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, Andrew Morton, Hugh Dickins,
	Andrea Arcangeli, Kirill A. Shutemov, Rik van Riel, Mel Gorman,
	Michal Hocko, Ebru Akagunduz, Alex Thorlton, David Rientjes,
	Ingo Molnar

On Mon, Feb 23, 2015 at 01:58:40PM +0100, Vlastimil Babka wrote:
> @@ -7713,8 +7820,15 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
>  		entity_tick(cfs_rq, se, queued);
>  	}
>  
> -	if (numabalancing_enabled)
> -		task_tick_numa(rq, curr);
> +	/*
> +	 * For latency considerations, don't schedule the THP work together
> +	 * with NUMA work. NUMA has higher priority, assuming remote accesses
> +	 * have worse penalty than TLB misses.
> +	 */
> +	if (!(numabalancing_enabled && task_tick_numa(rq, curr))
> +						&& khugepaged_enabled())
> +		task_tick_thp(rq, curr);
> +
>  
>  	update_rq_runnable_avg(rq, 1);
>  }

That's a bit yucky; and I think there's no problem moving that
update_rq_runnable_avg() thing up a bit; which would get you:

static void task_tick_fair(..)
{

	...

	update_rq_runnable_avg();

	if (numabalancing_enabled && task_tick_numa(rq, curr))
		return;

	if (khugepaged_enabled() && task_tick_thp(rq, curr))
		return;
}

Clearly the return on that second conditional is a tad pointless, but
OCD :-)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 0/6] the big khugepaged redesign
  2015-02-23 12:58 [RFC 0/6] the big khugepaged redesign Vlastimil Babka
                   ` (5 preceding siblings ...)
  2015-02-23 12:58 ` [RFC 6/6] mm, thp: remove no longer needed khugepaged code Vlastimil Babka
@ 2015-02-23 21:03 ` Andi Kleen
  2015-02-23 22:46 ` Davidlohr Bueso
  7 siblings, 0 replies; 23+ messages in thread
From: Andi Kleen @ 2015-02-23 21:03 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, Andrew Morton, Hugh Dickins,
	Andrea Arcangeli, Kirill A. Shutemov, Rik van Riel, Mel Gorman,
	Michal Hocko, Ebru Akagunduz, Alex Thorlton, David Rientjes,
	Peter Zijlstra, Ingo Molnar

Vlastimil Babka <vbabka@suse.cz> writes:

> This has been already discussed as a good
> idea and a RFC has been posted by Alex Thorlton last October [5].

In my opinion it's a very bad idea. It heavily penalizes the single
threaded application case, which is quite important. And it
would likely lead to even larger latencies on the application
base, even for the multithreaded case, as there is no good way
anymore to hide blocking latencies in the process.

The current single thead khugepaged has various issues, but this would
just make it much worse.

IMHO it's useless to do much here without a lot of data first
to identify the actual problems. Doing things first without analysis 
first seems totally backwards.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 0/6] the big khugepaged redesign
  2015-02-23 12:58 [RFC 0/6] the big khugepaged redesign Vlastimil Babka
                   ` (6 preceding siblings ...)
  2015-02-23 21:03 ` [RFC 0/6] the big khugepaged redesign Andi Kleen
@ 2015-02-23 22:46 ` Davidlohr Bueso
  2015-02-23 22:56   ` Andrew Morton
  2015-03-09  3:17   ` Vlastimil Babka
  7 siblings, 2 replies; 23+ messages in thread
From: Davidlohr Bueso @ 2015-02-23 22:46 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-kernel, Andrew Morton, Hugh Dickins,
	Andrea Arcangeli, Kirill A. Shutemov, Rik van Riel, Mel Gorman,
	Michal Hocko, Ebru Akagunduz, Alex Thorlton, David Rientjes,
	Peter Zijlstra, Ingo Molnar

On Mon, 2015-02-23 at 13:58 +0100, Vlastimil Babka wrote:
> Recently, there was concern expressed (e.g. [1]) whether the quite aggressive
> THP allocation attempts on page faults are a good performance trade-off.
> 
> - THP allocations add to page fault latency, as high-order allocations are
>   notoriously expensive. Page allocation slowpath now does extra checks for
>   GFP_TRANSHUGE && !PF_KTHREAD to avoid the more expensive synchronous
>   compaction for user page faults. But even async compaction can be expensive.
> - During the first page fault in a 2MB range we cannot predict how much of the
>   range will be actually accessed - we can theoretically waste as much as 511
>   worth of pages [2]. Or, the pages in the range might be accessed from CPUs
>   from different NUMA nodes and while base pages could be all local, THP could
>   be remote to all but one CPU. The cost of remote accesses due to this false
>   sharing would be higher than any savings on the TLB.
> - The interaction with memcg are also problematic [1].
> 
> Now I don't have any hard data to show how big these problems are, and I
> expect we will discuss this on LSF/MM (and hope somebody has such data [3]).
> But it's certain that e.g. SAP recommends to disable THPs [4] for their apps
> for performance reasons.

There are plenty of examples of this, ie for Oracle:

https://blogs.oracle.com/linux/entry/performance_issues_with_transparent_huge
http://oracle-base.com/articles/linux/configuring-huge-pages-for-oracle-on-linux-64.php

Thanks,
Davidlohr


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 0/6] the big khugepaged redesign
  2015-02-23 22:46 ` Davidlohr Bueso
@ 2015-02-23 22:56   ` Andrew Morton
  2015-02-23 22:58     ` Sasha Levin
  2015-02-24 10:32     ` Vlastimil Babka
  2015-03-09  3:17   ` Vlastimil Babka
  1 sibling, 2 replies; 23+ messages in thread
From: Andrew Morton @ 2015-02-23 22:56 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Vlastimil Babka, linux-mm, linux-kernel, Hugh Dickins,
	Andrea Arcangeli, Kirill A. Shutemov, Rik van Riel, Mel Gorman,
	Michal Hocko, Ebru Akagunduz, Alex Thorlton, David Rientjes,
	Peter Zijlstra, Ingo Molnar

On Mon, 23 Feb 2015 14:46:43 -0800 Davidlohr Bueso <dave@stgolabs.net> wrote:

> On Mon, 2015-02-23 at 13:58 +0100, Vlastimil Babka wrote:
> > Recently, there was concern expressed (e.g. [1]) whether the quite aggressive
> > THP allocation attempts on page faults are a good performance trade-off.
> > 
> > - THP allocations add to page fault latency, as high-order allocations are
> >   notoriously expensive. Page allocation slowpath now does extra checks for
> >   GFP_TRANSHUGE && !PF_KTHREAD to avoid the more expensive synchronous
> >   compaction for user page faults. But even async compaction can be expensive.
> > - During the first page fault in a 2MB range we cannot predict how much of the
> >   range will be actually accessed - we can theoretically waste as much as 511
> >   worth of pages [2]. Or, the pages in the range might be accessed from CPUs
> >   from different NUMA nodes and while base pages could be all local, THP could
> >   be remote to all but one CPU. The cost of remote accesses due to this false
> >   sharing would be higher than any savings on the TLB.
> > - The interaction with memcg are also problematic [1].
> > 
> > Now I don't have any hard data to show how big these problems are, and I
> > expect we will discuss this on LSF/MM (and hope somebody has such data [3]).
> > But it's certain that e.g. SAP recommends to disable THPs [4] for their apps
> > for performance reasons.
> 
> There are plenty of examples of this, ie for Oracle:
> 
> https://blogs.oracle.com/linux/entry/performance_issues_with_transparent_huge

hm, five months ago and I don't recall seeing any followup to this. 
Does anyone know what's happening?


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 0/6] the big khugepaged redesign
  2015-02-23 22:56   ` Andrew Morton
@ 2015-02-23 22:58     ` Sasha Levin
  2015-02-24 10:32     ` Vlastimil Babka
  1 sibling, 0 replies; 23+ messages in thread
From: Sasha Levin @ 2015-02-23 22:58 UTC (permalink / raw)
  To: Andrew Morton, Davidlohr Bueso
  Cc: Vlastimil Babka, linux-mm, linux-kernel, Hugh Dickins,
	Andrea Arcangeli, Kirill A. Shutemov, Rik van Riel, Mel Gorman,
	Michal Hocko, Ebru Akagunduz, Alex Thorlton, David Rientjes,
	Peter Zijlstra, Ingo Molnar

On 02/23/2015 05:56 PM, Andrew Morton wrote:
>>> Now I don't have any hard data to show how big these problems are, and I
>>> > > expect we will discuss this on LSF/MM (and hope somebody has such data [3]).
>>> > > But it's certain that e.g. SAP recommends to disable THPs [4] for their apps
>>> > > for performance reasons.
>> > 
>> > There are plenty of examples of this, ie for Oracle:
>> > 
>> > https://blogs.oracle.com/linux/entry/performance_issues_with_transparent_huge
> hm, five months ago and I don't recall seeing any followup to this. 
> Does anyone know what's happening?

I'll dig it up.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 0/6] the big khugepaged redesign
  2015-02-23 22:56   ` Andrew Morton
  2015-02-23 22:58     ` Sasha Levin
@ 2015-02-24 10:32     ` Vlastimil Babka
  2015-02-24 11:24       ` Andrea Arcangeli
  2015-03-05 16:30       ` Vlastimil Babka
  1 sibling, 2 replies; 23+ messages in thread
From: Vlastimil Babka @ 2015-02-24 10:32 UTC (permalink / raw)
  To: Andrew Morton, Davidlohr Bueso
  Cc: linux-mm, linux-kernel, Hugh Dickins, Andrea Arcangeli,
	Kirill A. Shutemov, Rik van Riel, Mel Gorman, Michal Hocko,
	Ebru Akagunduz, Alex Thorlton, David Rientjes, Peter Zijlstra,
	Ingo Molnar

On 02/23/2015 11:56 PM, Andrew Morton wrote:
> On Mon, 23 Feb 2015 14:46:43 -0800 Davidlohr Bueso <dave@stgolabs.net> wrote:
>
>> On Mon, 2015-02-23 at 13:58 +0100, Vlastimil Babka wrote:
>>> Recently, there was concern expressed (e.g. [1]) whether the quite aggressive
>>> THP allocation attempts on page faults are a good performance trade-off.
>>>
>>> - THP allocations add to page fault latency, as high-order allocations are
>>>    notoriously expensive. Page allocation slowpath now does extra checks for
>>>    GFP_TRANSHUGE && !PF_KTHREAD to avoid the more expensive synchronous
>>>    compaction for user page faults. But even async compaction can be expensive.
>>> - During the first page fault in a 2MB range we cannot predict how much of the
>>>    range will be actually accessed - we can theoretically waste as much as 511
>>>    worth of pages [2]. Or, the pages in the range might be accessed from CPUs
>>>    from different NUMA nodes and while base pages could be all local, THP could
>>>    be remote to all but one CPU. The cost of remote accesses due to this false
>>>    sharing would be higher than any savings on the TLB.
>>> - The interaction with memcg are also problematic [1].
>>>
>>> Now I don't have any hard data to show how big these problems are, and I
>>> expect we will discuss this on LSF/MM (and hope somebody has such data [3]).
>>> But it's certain that e.g. SAP recommends to disable THPs [4] for their apps
>>> for performance reasons.
>>
>> There are plenty of examples of this, ie for Oracle:
>>
>> https://blogs.oracle.com/linux/entry/performance_issues_with_transparent_huge
>
> hm, five months ago and I don't recall seeing any followup to this.

Actually it's year + five months, but nevertheless...

> Does anyone know what's happening?

I would suspect mmap_sem being held during whole THP page fault 
(including the needed reclaim and compaction), which I forgot to mention 
in the first e-mail - it's not just the problem page fault latency, but 
also potentially holding back other processes, why we should allow 
shifting from THP page faults to deferred collapsing.
Although the attempts for opportunistic page faults without mmap_sem 
would also help in this particular case.

Khugepaged also used to hold mmap_sem (for read) during the allocation 
attempt, but that was fixed since then. It could be also zone lru_lock 
pressure.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 0/6] the big khugepaged redesign
  2015-02-24 10:32     ` Vlastimil Babka
@ 2015-02-24 11:24       ` Andrea Arcangeli
  2015-02-24 11:45         ` Andrea Arcangeli
  2015-02-25 12:42         ` Vlastimil Babka
  2015-03-05 16:30       ` Vlastimil Babka
  1 sibling, 2 replies; 23+ messages in thread
From: Andrea Arcangeli @ 2015-02-24 11:24 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Davidlohr Bueso, linux-mm, linux-kernel,
	Hugh Dickins, Kirill A. Shutemov, Rik van Riel, Mel Gorman,
	Michal Hocko, Ebru Akagunduz, Alex Thorlton, David Rientjes,
	Peter Zijlstra, Ingo Molnar

Hi everyone,

On Tue, Feb 24, 2015 at 11:32:30AM +0100, Vlastimil Babka wrote:
> I would suspect mmap_sem being held during whole THP page fault 
> (including the needed reclaim and compaction), which I forgot to mention 
> in the first e-mail - it's not just the problem page fault latency, but 
> also potentially holding back other processes, why we should allow 
> shifting from THP page faults to deferred collapsing.
> Although the attempts for opportunistic page faults without mmap_sem 
> would also help in this particular case.
> 
> Khugepaged also used to hold mmap_sem (for read) during the allocation 
> attempt, but that was fixed since then. It could be also zone lru_lock 
> pressure.

I'm traveling and I didn't have much time to read the code yet but if
I understood well the proposal, I've some doubt boosting khugepaged
CPU utilization is going to provide a better universal trade off. I
think the low overhead background scan is safer default.

If we want to do more async background work and less "synchronous work
at fault time", what may be more interesting is to generate
transparent hugepages in the background and possibly not to invoke
compaction (or much compaction) in the page faults.

I'd rather move compaction to a background kernel thread, and to
invoke compaction synchronously only in khugepaged. I like it more if
nothing else because it is a kind of background load that can come to
a full stop, once enough THP have been created. Unlike khugepaged that
can never stop to scan and it better be lightweight kind of background
load, as it'd be running all the time.

Creating THP through khugepaged is much more expensive than creating
them on page faults. khugepaged will need to halt the userland access
on the range once more and it'll have to copy the 2MB.

Overall I agree with Andi we need more data collected for various
workloads before embarking into big changes, at least so we can proof
the changes to be beneficial to those workloads.

I would advise not to make changes for app that are already the
biggest users ever of hugetlbfs (like Oracle). Those already are
optimized by other means. THP target are apps that have several
benefit in not ever using hugetlbfs, so apps that are more dynamic
workloads that don't fit well with NUMA hard pinning with numactl or
other static placements of memory and CPU.

There are also other corner cases to optimize, that have nothing to do
with khugepaged nor compaction: for example redis has issues in the
way it forks() and then uses the child memory as a snapshot while the
parent keeps running and writing to the memory. If THP is enabled, the
parent that writes to the memory will allocate and copy 2MB objects
instead of 4k objects. That means more memory utilization but
especially the problem are those copy_user of 2MB instead of 4k hurting
the parent runtime.

For redis we need a more finegrined thing than MADV_NOHUGEPAGE. It
needs a MADV_COW_NOHUGEPAGE (please think at a better name) that will
only prevent THP creation during COW faults but still maximize THP
utilization for every other case. Once such a madvise will become
available, redis will run faster with THP enabled (currently redis
recommends THP disabled because of the higher latencies in the 2MB COW
faults while the child process is snapshotting). When the snapshot is
finished and the child quits, khugepaged will recreate THP for those
fragmented cows.

OTOH redis could also use the userfaultfd to do the snapshotting and
it could avoid fork in the first place, after I add UFFDIO_WP ioctl to
mark and unmark the memory wrprotected or not without altering the
vma, while catching the faults with read or POLLIN on the ufd to copy
the memory off before removing the wrprotection. The real problem to
fully implement the UFFDIO_WP will be the swapcache and swapouts: swap
entries have no wrprotection bit to know if to fire wrprotected
userfaults on write faults, if the range is registered as
uffdio_register.mode & UFFDIO_REGISTER_MODE_WP. So far I only
implemented in full the UFFDIO_REGISTER_MODE_MISSING tracking mode, so
I didn't need to attack the wrprotected swapentry thingy, but the new
userfaultfd API already is ready to implement all write protection (or
any other faulting reason) as well and it can incrementally be
extended to different memory types (tmpfs etc..) without backwards
compatibility issues.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 0/6] the big khugepaged redesign
  2015-02-24 11:24       ` Andrea Arcangeli
@ 2015-02-24 11:45         ` Andrea Arcangeli
  2015-02-25 12:42         ` Vlastimil Babka
  1 sibling, 0 replies; 23+ messages in thread
From: Andrea Arcangeli @ 2015-02-24 11:45 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Davidlohr Bueso, linux-mm, linux-kernel,
	Hugh Dickins, Kirill A. Shutemov, Rik van Riel, Mel Gorman,
	Michal Hocko, Ebru Akagunduz, Alex Thorlton, David Rientjes,
	Peter Zijlstra, Ingo Molnar

On Tue, Feb 24, 2015 at 12:24:12PM +0100, Andrea Arcangeli wrote:
> I would advise not to make changes for app that are already the
> biggest users ever of hugetlbfs (like Oracle). Those already are
> optimized by other means. THP target are apps that have several

Before somebody risks to misunderstand perhaps I should clarify
further: what I meant is that if the khugepaged boost helps Oracle or
other heavy users of hugetlbfs, but it _hurts_ everything else as I'd
guess, I'd advise against it. Because if an app can deal with
hugetlbfs it's much simpler to optimize by other means and it's not
the primary target of THP so the priority for THP default behavior
should be biased towards those apps that can't easily fit into
hugetlbfs and numa hard pins static placement models.

Of course it'd be perfectly fine to make THP changes that helps even
the biggest hugetlbfs users out there, as long as these changes don't
hurt all other normal use cases (where THP is always guaranteed to
provide a significant performance boost if enabled). Chances are the
benchmarks are also comparing "hugetlbfs+THP" vs "hugetlbfs" without
THP, and not "nothing" vs "THP".

Clearly I'd like to optimize for all apps including the biggest
hugetlbfs users, and this is why I'd like to optimize redis as well,
considering it's simple enough to do it with just one madvise to
change the behavior of COW faults and it'd be guaranteed not to hurt
any other common usage. If we were to instead change the default
behavior of COW faults we'd need first to collect data for a variety
of apps and personally I doubt such a change would be a good universal
tradeoff, while it's a fine change for a behavioral change through
madvise.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 0/6] the big khugepaged redesign
  2015-02-24 11:24       ` Andrea Arcangeli
  2015-02-24 11:45         ` Andrea Arcangeli
@ 2015-02-25 12:42         ` Vlastimil Babka
  1 sibling, 0 replies; 23+ messages in thread
From: Vlastimil Babka @ 2015-02-25 12:42 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Andrew Morton, Davidlohr Bueso, linux-mm, linux-kernel,
	Hugh Dickins, Kirill A. Shutemov, Rik van Riel, Mel Gorman,
	Michal Hocko, Ebru Akagunduz, Alex Thorlton, David Rientjes,
	Peter Zijlstra, Ingo Molnar

On 02/24/2015 12:24 PM, Andrea Arcangeli wrote:
> Hi everyone,

Hi,

> On Tue, Feb 24, 2015 at 11:32:30AM +0100, Vlastimil Babka wrote:
>> I would suspect mmap_sem being held during whole THP page fault
>> (including the needed reclaim and compaction), which I forgot to mention
>> in the first e-mail - it's not just the problem page fault latency, but
>> also potentially holding back other processes, why we should allow
>> shifting from THP page faults to deferred collapsing.
>> Although the attempts for opportunistic page faults without mmap_sem
>> would also help in this particular case.
>>
>> Khugepaged also used to hold mmap_sem (for read) during the allocation
>> attempt, but that was fixed since then. It could be also zone lru_lock
>> pressure.
>
> I'm traveling and I didn't have much time to read the code yet but if
> I understood well the proposal, I've some doubt boosting khugepaged
> CPU utilization is going to provide a better universal trade off. I
> think the low overhead background scan is safer default.

Making the background scanning more efficient should be win in any case.

> If we want to do more async background work and less "synchronous work
> at fault time", what may be more interesting is to generate
> transparent hugepages in the background and possibly not to invoke
> compaction (or much compaction) in the page faults.

Steps in that direction are in fact part of the patchset :)

> I'd rather move compaction to a background kernel thread, and to
> invoke compaction synchronously only in khugepaged. I like it more if
> nothing else because it is a kind of background load that can come to
> a full stop, once enough THP have been created.

Yes, we agree here.

> Unlike khugepaged that
> can never stop to scan and it better be lightweight kind of background
> load, as it'd be running all the time.

IMHO it doesn't hurt if the scanning can focus on mm's where it's more 
likely to succeed, and tune its activity according to how successful it 
is. Then you don't need to achieve the "lightweightness" by setting the 
existing tunables to very long sleeps and very short scans, which 
increases the delay until the good collapse candidates are actually 
found by khugepaged.

> Creating THP through khugepaged is much more expensive than creating
> them on page faults. khugepaged will need to halt the userland access
> on the range once more and it'll have to copy the 2MB.

Well, Mel also suggested another thing that I didn't mention yet - 
in-place collapsing, where the base pages would be allocated on page 
faults with such layout to allow later collapse without the copying. I 
think that Kiryl's refcounting changes could potentially allow this by 
allocating a hugepage, but mapping it using pte's so it could still be 
tracked which pages are actually accessed, and from which nodes. If 
after some time it looks like a good candidate, just switch it to pmd, 
otherwise break the hugepage and free the unused base pages.

> Overall I agree with Andi we need more data collected for various
> workloads before embarking into big changes, at least so we can proof
> the changes to be beneficial to those workloads.

OK. I mainly wanted to stir some discussion at this point.

> I would advise not to make changes for app that are already the
> biggest users ever of hugetlbfs (like Oracle). Those already are
> optimized by other means. THP target are apps that have several
> benefit in not ever using hugetlbfs, so apps that are more dynamic
> workloads that don't fit well with NUMA hard pinning with numactl or
> other static placements of memory and CPU.
>
> There are also other corner cases to optimize, that have nothing to do
> with khugepaged nor compaction: for example redis has issues in the
> way it forks() and then uses the child memory as a snapshot while the
> parent keeps running and writing to the memory. If THP is enabled, the
> parent that writes to the memory will allocate and copy 2MB objects
> instead of 4k objects. That means more memory utilization but
> especially the problem are those copy_user of 2MB instead of 4k hurting
> the parent runtime.
>
> For redis we need a more finegrined thing than MADV_NOHUGEPAGE. It
> needs a MADV_COW_NOHUGEPAGE (please think at a better name) that will
> only prevent THP creation during COW faults but still maximize THP
> utilization for every other case. Once such a madvise will become
> available, redis will run faster with THP enabled (currently redis
> recommends THP disabled because of the higher latencies in the 2MB COW
> faults while the child process is snapshotting). When the snapshot is
> finished and the child quits, khugepaged will recreate THP for those
> fragmented cows.

Hm sounds like Kiryl's patchset could also help here? In parent, split 
only the pmd and do cow on 4k pages, while child keeps the whole THP.
Later khugepaged can recreate THP for the parent, as you say. That 
should be better default behavior than the current 2MB copies, not just 
for redis? And no new madvise needed. Or maybe with MADV_HUGEPAGE you 
can assume that the caller does want the 2MB COW behavior?


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 0/6] the big khugepaged redesign
  2015-02-24 10:32     ` Vlastimil Babka
  2015-02-24 11:24       ` Andrea Arcangeli
@ 2015-03-05 16:30       ` Vlastimil Babka
  2015-03-05 16:52         ` Andres Freund
  2015-03-06  0:21         ` Andres Freund
  1 sibling, 2 replies; 23+ messages in thread
From: Vlastimil Babka @ 2015-03-05 16:30 UTC (permalink / raw)
  To: Andrew Morton, Davidlohr Bueso
  Cc: linux-mm, linux-kernel, Hugh Dickins, Andrea Arcangeli,
	Kirill A. Shutemov, Rik van Riel, Mel Gorman, Michal Hocko,
	Ebru Akagunduz, Alex Thorlton, David Rientjes, Peter Zijlstra,
	Ingo Molnar, Andres Freund, Robert Haas, Josh Berkus

On 02/24/2015 11:32 AM, Vlastimil Babka wrote:
> On 02/23/2015 11:56 PM, Andrew Morton wrote:
>> On Mon, 23 Feb 2015 14:46:43 -0800 Davidlohr Bueso <dave@stgolabs.net> wrote:
>>
>>> On Mon, 2015-02-23 at 13:58 +0100, Vlastimil Babka wrote:
>>>> Recently, there was concern expressed (e.g. [1]) whether the quite aggressive
>>>> THP allocation attempts on page faults are a good performance trade-off.
>>>>
>>>> - THP allocations add to page fault latency, as high-order allocations are
>>>>    notoriously expensive. Page allocation slowpath now does extra checks for
>>>>    GFP_TRANSHUGE && !PF_KTHREAD to avoid the more expensive synchronous
>>>>    compaction for user page faults. But even async compaction can be expensive.
>>>> - During the first page fault in a 2MB range we cannot predict how much of the
>>>>    range will be actually accessed - we can theoretically waste as much as 511
>>>>    worth of pages [2]. Or, the pages in the range might be accessed from CPUs
>>>>    from different NUMA nodes and while base pages could be all local, THP could
>>>>    be remote to all but one CPU. The cost of remote accesses due to this false
>>>>    sharing would be higher than any savings on the TLB.
>>>> - The interaction with memcg are also problematic [1].
>>>>
>>>> Now I don't have any hard data to show how big these problems are, and I
>>>> expect we will discuss this on LSF/MM (and hope somebody has such data [3]).
>>>> But it's certain that e.g. SAP recommends to disable THPs [4] for their apps
>>>> for performance reasons.
>>>
>>> There are plenty of examples of this, ie for Oracle:
>>>
>>> https://blogs.oracle.com/linux/entry/performance_issues_with_transparent_huge
>>
>> hm, five months ago and I don't recall seeing any followup to this.
> 
> Actually it's year + five months, but nevertheless...
> 
>> Does anyone know what's happening?

So I think that post was actually about THP support enabled in .config slowing
down hugetlbfs, and found a followup post here
https://blogs.oracle.com/linuxkernel/entry/performance_impact_of_transparent_huge and
that was after all solved in 3.12. Sasha also mentioned that split PTL patchset
helped as well, and the degradation in IOPS due to THP enabled is now limited to
5%, and possibly the refcounting redesign could help.

That however means the workload is based on hugetlbfs and shouldn't trigger THP
page fault activity, which is the aim of this patchset. Some more googling made
me recall that last LSF/MM, postgresql people mentioned THP issues and pointed
at compaction. See http://lwn.net/Articles/591723/ That's exactly where this
patchset should help, but I obviously won't be able to measure this before LSF/MM...

I'm CCing the psql guys from last year LSF/MM - do you have any insight about
psql performance with THPs enabled/disabled on recent kernels, where e.g.
compaction is no longer synchronous for THP page faults?

Thanks,
Vlastimil

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 0/6] the big khugepaged redesign
  2015-03-05 16:30       ` Vlastimil Babka
@ 2015-03-05 16:52         ` Andres Freund
  2015-03-05 17:01           ` Vlastimil Babka
  2015-03-06  0:21         ` Andres Freund
  1 sibling, 1 reply; 23+ messages in thread
From: Andres Freund @ 2015-03-05 16:52 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Davidlohr Bueso, linux-mm, linux-kernel,
	Hugh Dickins, Andrea Arcangeli, Kirill A. Shutemov, Rik van Riel,
	Mel Gorman, Michal Hocko, Ebru Akagunduz, Alex Thorlton,
	David Rientjes, Peter Zijlstra, Ingo Molnar, Robert Haas,
	Josh Berkus

Hi,

On 2015-03-05 17:30:16 +0100, Vlastimil Babka wrote:
> That however means the workload is based on hugetlbfs and shouldn't trigger THP
> page fault activity, which is the aim of this patchset. Some more googling made
> me recall that last LSF/MM, postgresql people mentioned THP issues and pointed
> at compaction. See http://lwn.net/Articles/591723/ That's exactly where this
> patchset should help, but I obviously won't be able to measure this before LSF/MM...
> 
> I'm CCing the psql guys from last year LSF/MM - do you have any insight about
> psql performance with THPs enabled/disabled on recent kernels, where e.g.
> compaction is no longer synchronous for THP page faults?

What exactly counts as "recent" in this context? Most of the bigger
installations where we found THP to be absolutely prohibitive (slowdowns
on the order of a magnitude, huge latency spikes) unfortunately run
quite old kernels...  I guess 3.11 does *not* count :/? That'd be a
bigger machine where I could relatively quickly reenable THP to check
whether it's still bad. I might be able to trigger it to be rebooted
onto a newer kernel, will ask.

Greetings,

Andres Freund

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 0/6] the big khugepaged redesign
  2015-03-05 16:52         ` Andres Freund
@ 2015-03-05 17:01           ` Vlastimil Babka
  2015-03-05 17:07             ` Andres Freund
  0 siblings, 1 reply; 23+ messages in thread
From: Vlastimil Babka @ 2015-03-05 17:01 UTC (permalink / raw)
  To: Andres Freund
  Cc: Andrew Morton, Davidlohr Bueso, linux-mm, linux-kernel,
	Hugh Dickins, Andrea Arcangeli, Kirill A. Shutemov, Rik van Riel,
	Mel Gorman, Michal Hocko, Ebru Akagunduz, Alex Thorlton,
	David Rientjes, Peter Zijlstra, Ingo Molnar, Robert Haas,
	Josh Berkus

On 03/05/2015 05:52 PM, Andres Freund wrote:
> Hi,
> 
> On 2015-03-05 17:30:16 +0100, Vlastimil Babka wrote:
>> That however means the workload is based on hugetlbfs and shouldn't trigger THP
>> page fault activity, which is the aim of this patchset. Some more googling made
>> me recall that last LSF/MM, postgresql people mentioned THP issues and pointed
>> at compaction. See http://lwn.net/Articles/591723/ That's exactly where this
>> patchset should help, but I obviously won't be able to measure this before LSF/MM...
>> 
>> I'm CCing the psql guys from last year LSF/MM - do you have any insight about
>> psql performance with THPs enabled/disabled on recent kernels, where e.g.
>> compaction is no longer synchronous for THP page faults?
> 
> What exactly counts as "recent" in this context? Most of the bigger
> installations where we found THP to be absolutely prohibitive (slowdowns
> on the order of a magnitude, huge latency spikes) unfortunately run
> quite old kernels...  I guess 3.11 does *not* count :/? That'd be a

Yeah that's too old :/ 3.17 has patches to make compaction less aggressive on
THP page faults, and 3.18 prevents khugepaged from holding mmap_sem during
compaction, which could be also relevant.

> bigger machine where I could relatively quickly reenable THP to check
> whether it's still bad. I might be able to trigger it to be rebooted
> onto a newer kernel, will ask.

Thanks, that would be great, if you could do that.
I also noticed that you now support hugetlbfs. That could be also interesting
data point, if the hugetlbfs usage helped because THP code wouldn't trigger.

Vlastimil

> Greetings,
> 
> Andres Freund
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 0/6] the big khugepaged redesign
  2015-03-05 17:01           ` Vlastimil Babka
@ 2015-03-05 17:07             ` Andres Freund
  0 siblings, 0 replies; 23+ messages in thread
From: Andres Freund @ 2015-03-05 17:07 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Davidlohr Bueso, linux-mm, linux-kernel,
	Hugh Dickins, Andrea Arcangeli, Kirill A. Shutemov, Rik van Riel,
	Mel Gorman, Michal Hocko, Ebru Akagunduz, Alex Thorlton,
	David Rientjes, Peter Zijlstra, Ingo Molnar, Robert Haas,
	Josh Berkus

On 2015-03-05 18:01:08 +0100, Vlastimil Babka wrote:
> On 03/05/2015 05:52 PM, Andres Freund wrote:
> > What exactly counts as "recent" in this context? Most of the bigger
> > installations where we found THP to be absolutely prohibitive (slowdowns
> > on the order of a magnitude, huge latency spikes) unfortunately run
> > quite old kernels...  I guess 3.11 does *not* count :/? That'd be a
> 
> Yeah that's too old :/

Guessed so.

> I also noticed that you now support hugetlbfs. That could be also interesting
> data point, if the hugetlbfs usage helped because THP code wouldn't
> trigger.

Well, mmap(MAP_HUGETLB), but yea.

Will let you know once I know whether it's possible to get a newer kernel.

Greetings,

Andres Freund

-- 
 Andres Freund	                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 0/6] the big khugepaged redesign
  2015-03-05 16:30       ` Vlastimil Babka
  2015-03-05 16:52         ` Andres Freund
@ 2015-03-06  0:21         ` Andres Freund
  2015-03-06  7:50           ` Vlastimil Babka
  1 sibling, 1 reply; 23+ messages in thread
From: Andres Freund @ 2015-03-06  0:21 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Davidlohr Bueso, linux-mm, linux-kernel,
	Hugh Dickins, Andrea Arcangeli, Kirill A. Shutemov, Rik van Riel,
	Mel Gorman, Michal Hocko, Ebru Akagunduz, Alex Thorlton,
	David Rientjes, Peter Zijlstra, Ingo Molnar, Robert Haas,
	Josh Berkus

Long mail ahead, sorry for that.

TL;DR: THP is still noticeable, but not nearly as bad.

On 2015-03-05 17:30:16 +0100, Vlastimil Babka wrote:
> That however means the workload is based on hugetlbfs and shouldn't trigger THP
> page fault activity, which is the aim of this patchset. Some more googling made
> me recall that last LSF/MM, postgresql people mentioned THP issues and pointed
> at compaction. See http://lwn.net/Articles/591723/ That's exactly where this
> patchset should help, but I obviously won't be able to measure this before LSF/MM...

Just as a reference, this is how some the more extreme profiles looked
like in the past:

>     96.50%    postmaster  [kernel.kallsyms]         [k] _spin_lock_irq
>               |
>               --- _spin_lock_irq
>                  |
>                  |--99.87%-- compact_zone
>                  |          compact_zone_order
>                  |          try_to_compact_pages
>                  |          __alloc_pages_nodemask
>                  |          alloc_pages_vma
>                  |          do_huge_pmd_anonymous_page
>                  |          handle_mm_fault
>                  |          __do_page_fault
>                  |          do_page_fault
>                  |          page_fault
>                  |          0x631d98
>                   --0.13%-- [...]

That specific profile is from a rather old kernel as you probably
recognize.

> I'm CCing the psql guys from last year LSF/MM - do you have any insight about
> psql performance with THPs enabled/disabled on recent kernels, where e.g.
> compaction is no longer synchronous for THP page faults?

So, I've managed to get a machine upgraded to 3.19. 4 x E5-4620, 256GB
RAM.

First of: It's noticeably harder to trigger problems than it used to
be. But, I can still trigger various problems that are much worse with
THP enabled than without.

There seem to be various different bottlenecks; I can get somewhat
different profiles.

In a somewhat artificial workload, that tries to simulate what I've seen
trigger the problem at a customer, I can quite easily trigger large
differences between THP=enable and THP=never.  There's two types of
tasks running, one purely OLTP, another doing somewhat more complex
statements that require a fair amount of process local memory.

(ignore the absolute numbers for progress, I just waited for somewhat
stable results while doing other stuff)

THP off:
Task 1 solo:
progress: 200.0 s, 391442.0 tps, 0.654 ms lat
progress: 201.0 s, 394816.1 tps, 0.683 ms lat
progress: 202.0 s, 409722.5 tps, 0.625 ms lat
progress: 203.0 s, 384794.9 tps, 0.665 ms lat

combined:
Task 1:
progress: 144.0 s, 25430.4 tps, 10.067 ms lat
progress: 145.0 s, 22260.3 tps, 11.500 ms lat
progress: 146.0 s, 24089.9 tps, 10.627 ms lat
progress: 147.0 s, 25888.8 tps, 9.888 ms lat

Task 2:
progress: 24.4 s, 30.0 tps, 2134.043 ms lat
progress: 26.5 s, 29.8 tps, 2150.487 ms lat
progress: 28.4 s, 29.7 tps, 2151.557 ms lat
progress: 30.4 s, 28.5 tps, 2245.304 ms lat

flat profile:
     6.07%      postgres  postgres            [.] heap_form_minimal_tuple
     4.36%      postgres  postgres            [.] heap_fill_tuple
     4.22%      postgres  postgres            [.] ExecStoreMinimalTuple
     4.11%      postgres  postgres            [.] AllocSetAlloc
     3.97%      postgres  postgres            [.] advance_aggregates
     3.94%      postgres  postgres            [.] advance_transition_function
     3.94%      postgres  postgres            [.] ExecMakeTableFunctionResult
     3.33%      postgres  postgres            [.] heap_compute_data_size
     3.30%      postgres  postgres            [.] MemoryContextReset
     3.28%      postgres  postgres            [.] ExecScan
     3.04%      postgres  postgres            [.] ExecProject
     2.96%      postgres  postgres            [.] generate_series_step_int4
     2.94%      postgres  [kernel.kallsyms]   [k] clear_page_c

(i.e. most of it postgres, cache miss bound)

THP on:
Task 1 solo:
progress: 140.0 s, 390458.1 tps, 0.656 ms lat
progress: 141.0 s, 391174.2 tps, 0.654 ms lat
progress: 142.0 s, 394828.8 tps, 0.648 ms lat
progress: 143.0 s, 398156.2 tps, 0.643 ms lat

Task 1:
progress: 179.0 s, 23963.1 tps, 10.683 ms lat
progress: 180.0 s, 22712.9 tps, 11.271 ms lat
progress: 181.0 s, 21211.4 tps, 12.069 ms lat
progress: 182.0 s, 23207.8 tps, 11.031 ms lat

Task 2:
progress: 28.2 s, 19.1 tps, 3349.747 ms lat
progress: 31.0 s, 19.8 tps, 3230.589 ms lat
progress: 34.3 s, 21.5 tps, 2979.113 ms lat
progress: 37.4 s, 20.9 tps, 3055.143 ms lat

flat profile:
    21.36%      postgres  [kernel.kallsyms]   [k] pageblock_pfn_to_page
     4.93%      postgres  postgres            [.] ExecStoreMinimalTuple
     4.02%      postgres  postgres            [.] heap_form_minimal_tuple
     3.55%      postgres  [kernel.kallsyms]   [k] clear_page_c
     2.85%      postgres  postgres            [.] heap_fill_tuple
     2.60%      postgres  postgres            [.] ExecMakeTableFunctionResult
     2.57%      postgres  postgres            [.] AllocSetAlloc
     2.44%      postgres  postgres            [.] advance_transition_function
     2.43%      postgres  postgres            [.] generate_series_step_int4

callgraph:
    18.23%      postgres  [kernel.kallsyms]   [k] pageblock_pfn_to_page
                |
                --- pageblock_pfn_to_page
                   |
                   |--99.05%-- isolate_migratepages
                   |          compact_zone
                   |          compact_zone_order
                   |          try_to_compact_pages
                   |          __alloc_pages_direct_compact
                   |          __alloc_pages_nodemask
                   |          alloc_pages_vma
                   |          do_huge_pmd_anonymous_page
                   |          __handle_mm_fault
                   |          handle_mm_fault
                   |          __do_page_fault
                   |          do_page_fault
                   |          page_fault
....
                   |
                    --0.95%-- compact_zone
                              compact_zone_order
                              try_to_compact_pages
                              __alloc_pages_direct_compact
                              __alloc_pages_nodemask
                              alloc_pages_vma
                              do_huge_pmd_anonymous_page
                              __handle_mm_fault
                              handle_mm_fault
                              __do_page_fault
     4.98%      postgres  postgres            [.] ExecStoreMinimalTuple
                |
     4.20%      postgres  postgres            [.] heap_form_minimal_tuple
                |
     3.69%      postgres  [kernel.kallsyms]   [k] clear_page_c
                |
                --- clear_page_c
                   |
                   |--58.89%-- __do_huge_pmd_anonymous_page
                   |          do_huge_pmd_anonymous_page
                   |          __handle_mm_fault
                   |          handle_mm_fault
                   |          __do_page_fault
                   |          do_page_fault
                   |          page_fault

As you can see THP on/off makes a noticeable difference, especially for
Task 2. Compaction suddenly takes a significant amount of time. But:
It's a relatively gradual slowdown, at pretty extreme concurrency. So
I'm pretty happy already.


In the workload tested here most non-shared allocations are short
lived. So it's not surprising that it's not worth compacting pages. I do
wonder whether it'd be possible to keep some running statistics about
THP being worthwhile or not.


This is just one workload, and I saw some different profiles while
playing around. But I've already invested more time in this today than I
should have... :)


BTW, parallel process exits with large shared mappings isn't
particularly fun:

    80.09%      postgres  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
                |
                --- _raw_spin_lock_irqsave
                   |
                   |--99.97%-- pagevec_lru_move_fn
                   |          |
                   |          |--65.51%-- activate_page
                   |          |          mark_page_accessed.part.23
                   |          |          mark_page_accessed
                   |          |          zap_pte_range
                   |          |          unmap_page_range
                   |          |          unmap_single_vma
                   |          |          unmap_vmas
                   |          |          exit_mmap
                   |          |          mmput.part.27
                   |          |          mmput
                   |          |          exit_mm
                   |          |          do_exit
                   |          |          do_group_exit
                   |          |          sys_exit_group
                   |          |          system_call_fastpath
                   |          |
                   |           --34.49%-- lru_add_drain_cpu
                   |                     lru_add_drain
                   |                     free_pages_and_swap_cache
                   |                     tlb_flush_mmu_free
                   |                     zap_pte_range
                   |                     unmap_page_range
                   |                     unmap_single_vma
                   |                     unmap_vmas
                   |                     exit_mmap
                   |                     mmput.part.27
                   |                     mmput
                   |                     exit_mm
                   |                     do_exit
                   |                     do_group_exit
                   |                     sys_exit_group
                   |                     system_call_fastpath
                    --0.03%-- [...]

     9.75%      postgres  [kernel.kallsyms]  [k] zap_pte_range
                |
                --- zap_pte_range
                    unmap_page_range
                    unmap_single_vma
                    unmap_vmas
                    exit_mmap
                    mmput.part.27
                    mmput
                    exit_mm
                    do_exit
                    do_group_exit
                    sys_exit_group
                    system_call_fastpath

     1.93%      postgres  [kernel.kallsyms]  [k] release_pages
                |
                --- release_pages
                   |
                   |--77.09%-- free_pages_and_swap_cache
                   |          tlb_flush_mmu_free
                   |          zap_pte_range
                   |          unmap_page_range
                   |          unmap_single_vma
                   |          unmap_vmas
                   |          exit_mmap
                   |          mmput.part.27
                   |          mmput
                   |          exit_mm
                   |          do_exit
                   |          do_group_exit
                   |          sys_exit_group
                   |          system_call_fastpath
                   |
                   |--22.64%-- pagevec_lru_move_fn
                   |          |
                   |          |--63.88%-- activate_page
                   |          |          mark_page_accessed.part.23
                   |          |          mark_page_accessed
                   |          |          zap_pte_range
                   |          |          unmap_page_range
                   |          |          unmap_single_vma
                   |          |          unmap_vmas
                   |          |          exit_mmap
                   |          |          mmput.part.27
                   |          |          mmput
                   |          |          exit_mm
                   |          |          do_exit
                   |          |          do_group_exit
                   |          |          sys_exit_group
                   |          |          system_call_fastpath
                   |          |
                   |           --36.12%-- lru_add_drain_cpu
                   |                     lru_add_drain
                   |                     free_pages_and_swap_cache
                   |                     tlb_flush_mmu_free
                   |                     zap_pte_range
                   |                     unmap_page_range
                   |                     unmap_single_vma
                   |                     unmap_vmas
                   |                     exit_mmap
                   |                     mmput.part.27
                   |                     mmput
                   |                     exit_mm
                   |                     do_exit
                   |                     do_group_exit
                   |                     sys_exit_group
                   |                     system_call_fastpath
                    --0.27%-- [...]

     1.91%      postgres  [kernel.kallsyms]  [k] page_remove_file_rmap
                |
                --- page_remove_file_rmap
                   |
                   |--98.18%-- page_remove_rmap
                   |          zap_pte_range
                   |          unmap_page_range
                   |          unmap_single_vma
                   |          unmap_vmas
                   |          exit_mmap
                   |          mmput.part.27
                   |          mmput
                   |          exit_mm
                   |          do_exit
                   |          do_group_exit
                   |          sys_exit_group
                   |          system_call_fastpath
                   |
                    --1.82%-- zap_pte_range
                              unmap_page_range
                              unmap_single_vma
                              unmap_vmas
                              exit_mmap
                              mmput.part.27
                              mmput
                              exit_mm
                              do_exit
                              do_group_exit
                              sys_exit_group
                              system_call_fastpath



Greetings,

Andres Freund

--
 Andres Freund	                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 0/6] the big khugepaged redesign
  2015-03-06  0:21         ` Andres Freund
@ 2015-03-06  7:50           ` Vlastimil Babka
  0 siblings, 0 replies; 23+ messages in thread
From: Vlastimil Babka @ 2015-03-06  7:50 UTC (permalink / raw)
  To: Andres Freund
  Cc: Andrew Morton, Davidlohr Bueso, linux-mm, linux-kernel,
	Hugh Dickins, Andrea Arcangeli, Kirill A. Shutemov, Rik van Riel,
	Mel Gorman, Michal Hocko, Ebru Akagunduz, Alex Thorlton,
	David Rientjes, Peter Zijlstra, Ingo Molnar, Robert Haas,
	Josh Berkus

On 03/06/2015 01:21 AM, Andres Freund wrote:
> Long mail ahead, sorry for that.

No problem, thanks a lot!

> TL;DR: THP is still noticeable, but not nearly as bad.
> 
> On 2015-03-05 17:30:16 +0100, Vlastimil Babka wrote:
>> That however means the workload is based on hugetlbfs and shouldn't trigger THP
>> page fault activity, which is the aim of this patchset. Some more googling made
>> me recall that last LSF/MM, postgresql people mentioned THP issues and pointed
>> at compaction. See http://lwn.net/Articles/591723/ That's exactly where this
>> patchset should help, but I obviously won't be able to measure this before LSF/MM...
> 
> Just as a reference, this is how some the more extreme profiles looked
> like in the past:
> 
>>     96.50%    postmaster  [kernel.kallsyms]         [k] _spin_lock_irq
>>               |
>>               --- _spin_lock_irq
>>                  |
>>                  |--99.87%-- compact_zone
>>                  |          compact_zone_order
>>                  |          try_to_compact_pages
>>                  |          __alloc_pages_nodemask
>>                  |          alloc_pages_vma
>>                  |          do_huge_pmd_anonymous_page
>>                  |          handle_mm_fault
>>                  |          __do_page_fault
>>                  |          do_page_fault
>>                  |          page_fault
>>                  |          0x631d98
>>                   --0.13%-- [...]
> 
> That specific profile is from a rather old kernel as you probably
> recognize.

Yeah, sounds like synchronous compaction before it was forbidden for THP page
faults...

>> I'm CCing the psql guys from last year LSF/MM - do you have any insight about
>> psql performance with THPs enabled/disabled on recent kernels, where e.g.
>> compaction is no longer synchronous for THP page faults?
> 
> So, I've managed to get a machine upgraded to 3.19. 4 x E5-4620, 256GB
> RAM.
> 
> First of: It's noticeably harder to trigger problems than it used to
> be. But, I can still trigger various problems that are much worse with
> THP enabled than without.
> 
> There seem to be various different bottlenecks; I can get somewhat
> different profiles.
> 
> In a somewhat artificial workload, that tries to simulate what I've seen
> trigger the problem at a customer, I can quite easily trigger large
> differences between THP=enable and THP=never.  There's two types of
> tasks running, one purely OLTP, another doing somewhat more complex
> statements that require a fair amount of process local memory.
> 
> (ignore the absolute numbers for progress, I just waited for somewhat
> stable results while doing other stuff)
> 
> THP off:
> Task 1 solo:
> progress: 200.0 s, 391442.0 tps, 0.654 ms lat
> progress: 201.0 s, 394816.1 tps, 0.683 ms lat
> progress: 202.0 s, 409722.5 tps, 0.625 ms lat
> progress: 203.0 s, 384794.9 tps, 0.665 ms lat
> 
> combined:
> Task 1:
> progress: 144.0 s, 25430.4 tps, 10.067 ms lat
> progress: 145.0 s, 22260.3 tps, 11.500 ms lat
> progress: 146.0 s, 24089.9 tps, 10.627 ms lat
> progress: 147.0 s, 25888.8 tps, 9.888 ms lat
> 
> Task 2:
> progress: 24.4 s, 30.0 tps, 2134.043 ms lat
> progress: 26.5 s, 29.8 tps, 2150.487 ms lat
> progress: 28.4 s, 29.7 tps, 2151.557 ms lat
> progress: 30.4 s, 28.5 tps, 2245.304 ms lat
> 
> flat profile:
>      6.07%      postgres  postgres            [.] heap_form_minimal_tuple
>      4.36%      postgres  postgres            [.] heap_fill_tuple
>      4.22%      postgres  postgres            [.] ExecStoreMinimalTuple
>      4.11%      postgres  postgres            [.] AllocSetAlloc
>      3.97%      postgres  postgres            [.] advance_aggregates
>      3.94%      postgres  postgres            [.] advance_transition_function
>      3.94%      postgres  postgres            [.] ExecMakeTableFunctionResult
>      3.33%      postgres  postgres            [.] heap_compute_data_size
>      3.30%      postgres  postgres            [.] MemoryContextReset
>      3.28%      postgres  postgres            [.] ExecScan
>      3.04%      postgres  postgres            [.] ExecProject
>      2.96%      postgres  postgres            [.] generate_series_step_int4
>      2.94%      postgres  [kernel.kallsyms]   [k] clear_page_c
> 
> (i.e. most of it postgres, cache miss bound)
> 
> THP on:
> Task 1 solo:
> progress: 140.0 s, 390458.1 tps, 0.656 ms lat
> progress: 141.0 s, 391174.2 tps, 0.654 ms lat
> progress: 142.0 s, 394828.8 tps, 0.648 ms lat
> progress: 143.0 s, 398156.2 tps, 0.643 ms lat
> 
> Task 1:
> progress: 179.0 s, 23963.1 tps, 10.683 ms lat
> progress: 180.0 s, 22712.9 tps, 11.271 ms lat
> progress: 181.0 s, 21211.4 tps, 12.069 ms lat
> progress: 182.0 s, 23207.8 tps, 11.031 ms lat
> 
> Task 2:
> progress: 28.2 s, 19.1 tps, 3349.747 ms lat
> progress: 31.0 s, 19.8 tps, 3230.589 ms lat
> progress: 34.3 s, 21.5 tps, 2979.113 ms lat
> progress: 37.4 s, 20.9 tps, 3055.143 ms lat

So that's 1/3 worse tps for task 2? Not very nice...

> flat profile:
>     21.36%      postgres  [kernel.kallsyms]   [k] pageblock_pfn_to_page

Interesting. This function shouldn't be heavyweight, although cache misses are
certainly possible. It's only called once per pageblock, so for this to be so
prominent, the pageblocks are probably marked as unsuitable and it just skips
over them uselessly. The compaction doesn't become deferred, since that only
happens for synchronous compaction and this is probably doing just a lots of
asynchronous ones.

I wonder what are the /proc/vmstat here for compaction and thp fault succcesses...

>      4.93%      postgres  postgres            [.] ExecStoreMinimalTuple
>      4.02%      postgres  postgres            [.] heap_form_minimal_tuple
>      3.55%      postgres  [kernel.kallsyms]   [k] clear_page_c
>      2.85%      postgres  postgres            [.] heap_fill_tuple
>      2.60%      postgres  postgres            [.] ExecMakeTableFunctionResult
>      2.57%      postgres  postgres            [.] AllocSetAlloc
>      2.44%      postgres  postgres            [.] advance_transition_function
>      2.43%      postgres  postgres            [.] generate_series_step_int4
> 
> callgraph:
>     18.23%      postgres  [kernel.kallsyms]   [k] pageblock_pfn_to_page
>                 |
>                 --- pageblock_pfn_to_page
>                    |
>                    |--99.05%-- isolate_migratepages
>                    |          compact_zone
>                    |          compact_zone_order
>                    |          try_to_compact_pages
>                    |          __alloc_pages_direct_compact
>                    |          __alloc_pages_nodemask
>                    |          alloc_pages_vma
>                    |          do_huge_pmd_anonymous_page
>                    |          __handle_mm_fault
>                    |          handle_mm_fault
>                    |          __do_page_fault
>                    |          do_page_fault
>                    |          page_fault
> ....
>                    |
>                     --0.95%-- compact_zone
>                               compact_zone_order
>                               try_to_compact_pages
>                               __alloc_pages_direct_compact
>                               __alloc_pages_nodemask
>                               alloc_pages_vma
>                               do_huge_pmd_anonymous_page
>                               __handle_mm_fault
>                               handle_mm_fault
>                               __do_page_fault
>      4.98%      postgres  postgres            [.] ExecStoreMinimalTuple
>                 |
>      4.20%      postgres  postgres            [.] heap_form_minimal_tuple
>                 |
>      3.69%      postgres  [kernel.kallsyms]   [k] clear_page_c
>                 |
>                 --- clear_page_c
>                    |
>                    |--58.89%-- __do_huge_pmd_anonymous_page
>                    |          do_huge_pmd_anonymous_page
>                    |          __handle_mm_fault
>                    |          handle_mm_fault
>                    |          __do_page_fault
>                    |          do_page_fault
>                    |          page_fault
> 
> As you can see THP on/off makes a noticeable difference, especially for
> Task 2. Compaction suddenly takes a significant amount of time. But:
> It's a relatively gradual slowdown, at pretty extreme concurrency. So
> I'm pretty happy already.
> 
> 
> In the workload tested here most non-shared allocations are short
> lived. So it's not surprising that it's not worth compacting pages. I do
> wonder whether it'd be possible to keep some running statistics about
> THP being worthwhile or not.

My goal was to be more conservative and collapse mostly in khugepaged instead
of page faults. But maybe some running per-thread statistics of hugepage lifetime
could work too...

> This is just one workload, and I saw some different profiles while
> playing around. But I've already invested more time in this today than I
> should have... :)

Again, thanks a lot! If you find some more time, could you please also quickly
try how this workload looks like when THP's are enabled but page fault
compaction disabled completely by:

echo never > /sys/kernel/mm/transparent_hugepage/defrag

After LSF/MM I might be interested in how to reproduce this locally to use as a
testcase...

> BTW, parallel process exits with large shared mappings isn't
> particularly fun:
> 
>     80.09%      postgres  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
>                 |
>                 --- _raw_spin_lock_irqsave
>                    |
>                    |--99.97%-- pagevec_lru_move_fn
>                    |          |
>                    |          |--65.51%-- activate_page

Hm at first sight it seems odd that page activation would be useful to do when
pages are being unmapped. But I'm not that familiar with this area...

>                    |          |          mark_page_accessed.part.23
>                    |          |          mark_page_accessed
>                    |          |          zap_pte_range
>                    |          |          unmap_page_range
>                    |          |          unmap_single_vma
>                    |          |          unmap_vmas
>                    |          |          exit_mmap
>                    |          |          mmput.part.27
>                    |          |          mmput
>                    |          |          exit_mm
>                    |          |          do_exit
>                    |          |          do_group_exit
>                    |          |          sys_exit_group
>                    |          |          system_call_fastpath
>                    |          |
>                    |           --34.49%-- lru_add_drain_cpu
>                    |                     lru_add_drain
>                    |                     free_pages_and_swap_cache
>                    |                     tlb_flush_mmu_free
>                    |                     zap_pte_range
>                    |                     unmap_page_range
>                    |                     unmap_single_vma
>                    |                     unmap_vmas
>                    |                     exit_mmap
>                    |                     mmput.part.27
>                    |                     mmput
>                    |                     exit_mm
>                    |                     do_exit
>                    |                     do_group_exit
>                    |                     sys_exit_group
>                    |                     system_call_fastpath
>                     --0.03%-- [...]
> 
>      9.75%      postgres  [kernel.kallsyms]  [k] zap_pte_range
>                 |
>                 --- zap_pte_range
>                     unmap_page_range
>                     unmap_single_vma
>                     unmap_vmas
>                     exit_mmap
>                     mmput.part.27
>                     mmput
>                     exit_mm
>                     do_exit
>                     do_group_exit
>                     sys_exit_group
>                     system_call_fastpath
> 
>      1.93%      postgres  [kernel.kallsyms]  [k] release_pages
>                 |
>                 --- release_pages
>                    |
>                    |--77.09%-- free_pages_and_swap_cache
>                    |          tlb_flush_mmu_free
>                    |          zap_pte_range
>                    |          unmap_page_range
>                    |          unmap_single_vma
>                    |          unmap_vmas
>                    |          exit_mmap
>                    |          mmput.part.27
>                    |          mmput
>                    |          exit_mm
>                    |          do_exit
>                    |          do_group_exit
>                    |          sys_exit_group
>                    |          system_call_fastpath
>                    |
>                    |--22.64%-- pagevec_lru_move_fn
>                    |          |
>                    |          |--63.88%-- activate_page
>                    |          |          mark_page_accessed.part.23
>                    |          |          mark_page_accessed
>                    |          |          zap_pte_range
>                    |          |          unmap_page_range
>                    |          |          unmap_single_vma
>                    |          |          unmap_vmas
>                    |          |          exit_mmap
>                    |          |          mmput.part.27
>                    |          |          mmput
>                    |          |          exit_mm
>                    |          |          do_exit
>                    |          |          do_group_exit
>                    |          |          sys_exit_group
>                    |          |          system_call_fastpath
>                    |          |
>                    |           --36.12%-- lru_add_drain_cpu
>                    |                     lru_add_drain
>                    |                     free_pages_and_swap_cache
>                    |                     tlb_flush_mmu_free
>                    |                     zap_pte_range
>                    |                     unmap_page_range
>                    |                     unmap_single_vma
>                    |                     unmap_vmas
>                    |                     exit_mmap
>                    |                     mmput.part.27
>                    |                     mmput
>                    |                     exit_mm
>                    |                     do_exit
>                    |                     do_group_exit
>                    |                     sys_exit_group
>                    |                     system_call_fastpath
>                     --0.27%-- [...]
> 
>      1.91%      postgres  [kernel.kallsyms]  [k] page_remove_file_rmap
>                 |
>                 --- page_remove_file_rmap
>                    |
>                    |--98.18%-- page_remove_rmap
>                    |          zap_pte_range
>                    |          unmap_page_range
>                    |          unmap_single_vma
>                    |          unmap_vmas
>                    |          exit_mmap
>                    |          mmput.part.27
>                    |          mmput
>                    |          exit_mm
>                    |          do_exit
>                    |          do_group_exit
>                    |          sys_exit_group
>                    |          system_call_fastpath
>                    |
>                     --1.82%-- zap_pte_range
>                               unmap_page_range
>                               unmap_single_vma
>                               unmap_vmas
>                               exit_mmap
>                               mmput.part.27
>                               mmput
>                               exit_mm
>                               do_exit
>                               do_group_exit
>                               sys_exit_group
>                               system_call_fastpath
> 
> 
> 
> Greetings,
> 
> Andres Freund
> 
> --
>  Andres Freund	                   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [RFC 0/6] the big khugepaged redesign
  2015-02-23 22:46 ` Davidlohr Bueso
  2015-02-23 22:56   ` Andrew Morton
@ 2015-03-09  3:17   ` Vlastimil Babka
  1 sibling, 0 replies; 23+ messages in thread
From: Vlastimil Babka @ 2015-03-09  3:17 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: linux-mm, linux-kernel, Andrew Morton, Hugh Dickins,
	Andrea Arcangeli, Kirill A. Shutemov, Rik van Riel, Mel Gorman,
	Michal Hocko, Ebru Akagunduz, Alex Thorlton, David Rientjes,
	Peter Zijlstra, Ingo Molnar

On 02/23/2015 11:46 PM, Davidlohr Bueso wrote:
> On Mon, 2015-02-23 at 13:58 +0100, Vlastimil Babka wrote:
>> Recently, there was concern expressed (e.g. [1]) whether the quite aggressive
>> THP allocation attempts on page faults are a good performance trade-off.
>>
>> - THP allocations add to page fault latency, as high-order allocations are
>>    notoriously expensive. Page allocation slowpath now does extra checks for
>>    GFP_TRANSHUGE && !PF_KTHREAD to avoid the more expensive synchronous
>>    compaction for user page faults. But even async compaction can be expensive.
>> - During the first page fault in a 2MB range we cannot predict how much of the
>>    range will be actually accessed - we can theoretically waste as much as 511
>>    worth of pages [2]. Or, the pages in the range might be accessed from CPUs
>>    from different NUMA nodes and while base pages could be all local, THP could
>>    be remote to all but one CPU. The cost of remote accesses due to this false
>>    sharing would be higher than any savings on the TLB.
>> - The interaction with memcg are also problematic [1].
>>
>> Now I don't have any hard data to show how big these problems are, and I
>> expect we will discuss this on LSF/MM (and hope somebody has such data [3]).
>> But it's certain that e.g. SAP recommends to disable THPs [4] for their apps
>> for performance reasons.
>
> There are plenty of examples of this, ie for Oracle:
>
> https://blogs.oracle.com/linux/entry/performance_issues_with_transparent_huge
> http://oracle-base.com/articles/linux/configuring-huge-pages-for-oracle-on-linux-64.php

Just stumbled upon more references when catching up on lwn:

http://lwn.net/Articles/634797/


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2015-03-09  3:17 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-23 12:58 [RFC 0/6] the big khugepaged redesign Vlastimil Babka
2015-02-23 12:58 ` [RFC 1/6] mm, thp: stop preallocating hugepages in khugepaged Vlastimil Babka
2015-02-23 12:58 ` [RFC 2/6] mm, thp: make khugepaged check for THP allocability before scanning Vlastimil Babka
2015-02-23 12:58 ` [RFC 3/6] mm, thp: try fault allocations only if we expect them to succeed Vlastimil Babka
2015-02-23 12:58 ` [RFC 4/6] mm, thp: move collapsing from khugepaged to task_work context Vlastimil Babka
2015-02-23 14:25   ` Peter Zijlstra
2015-02-23 12:58 ` [RFC 5/6] mm, thp: wakeup khugepaged when THP allocation fails Vlastimil Babka
2015-02-23 12:58 ` [RFC 6/6] mm, thp: remove no longer needed khugepaged code Vlastimil Babka
2015-02-23 21:03 ` [RFC 0/6] the big khugepaged redesign Andi Kleen
2015-02-23 22:46 ` Davidlohr Bueso
2015-02-23 22:56   ` Andrew Morton
2015-02-23 22:58     ` Sasha Levin
2015-02-24 10:32     ` Vlastimil Babka
2015-02-24 11:24       ` Andrea Arcangeli
2015-02-24 11:45         ` Andrea Arcangeli
2015-02-25 12:42         ` Vlastimil Babka
2015-03-05 16:30       ` Vlastimil Babka
2015-03-05 16:52         ` Andres Freund
2015-03-05 17:01           ` Vlastimil Babka
2015-03-05 17:07             ` Andres Freund
2015-03-06  0:21         ` Andres Freund
2015-03-06  7:50           ` Vlastimil Babka
2015-03-09  3:17   ` Vlastimil Babka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).