All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/2] fix unbounded too_many_isolated
@ 2017-01-18 13:44 ` Michal Hocko
  0 siblings, 0 replies; 110+ messages in thread
From: Michal Hocko @ 2017-01-18 13:44 UTC (permalink / raw)
  To: linux-mm; +Cc: Mel Gorman, Johannes Weiner, Tetsuo Handa, LKML

Hi,
this is based on top of [1]. The first patch continues in the direction
of moving some decisions to zones rather than nodes. In this case it is
the NR_ISOLATED* counters which I believe need to be zone aware as well.
See patch 1 for more information why.

The second path builds on top of that and tries to address the problem
which has been reported by Tetsuo several times already. In the
current implementation we can loop deep in the reclaim path without
any effective way out to re-evaluate our decisions about the reclaim
retries. Patch 2 says more about that but in principle we should locate
retry logic as high in the allocator chain as possible and so we should
get rid of any unbound retry loops inside the reclaim. This is what the
patch does.

I am sending this as an RFC because I am not yet sure this is the best
forward. My testing shows that the system behaves sanely.

Thoughts, comments?

[1] http://lkml.kernel.org/r/20170117103702.28542-1-mhocko@kernel.org

Michal Hocko (2):
      mm, vmscan: account the number of isolated pages per zone
      mm, vmscan: do not loop on too_many_isolated for ever

 include/linux/mmzone.h |  4 +--
 mm/compaction.c        | 16 ++++-----
 mm/khugepaged.c        |  4 +--
 mm/memory_hotplug.c    |  2 +-
 mm/migrate.c           |  4 +--
 mm/page_alloc.c        | 14 ++++----
 mm/vmscan.c            | 93 ++++++++++++++++++++++++++++++++------------------
 mm/vmstat.c            |  4 +--
 8 files changed, 82 insertions(+), 59 deletions(-)

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC PATCH 0/2] fix unbounded too_many_isolated
@ 2017-01-18 13:44 ` Michal Hocko
  0 siblings, 0 replies; 110+ messages in thread
From: Michal Hocko @ 2017-01-18 13:44 UTC (permalink / raw)
  To: linux-mm; +Cc: Mel Gorman, Johannes Weiner, Tetsuo Handa, LKML

Hi,
this is based on top of [1]. The first patch continues in the direction
of moving some decisions to zones rather than nodes. In this case it is
the NR_ISOLATED* counters which I believe need to be zone aware as well.
See patch 1 for more information why.

The second path builds on top of that and tries to address the problem
which has been reported by Tetsuo several times already. In the
current implementation we can loop deep in the reclaim path without
any effective way out to re-evaluate our decisions about the reclaim
retries. Patch 2 says more about that but in principle we should locate
retry logic as high in the allocator chain as possible and so we should
get rid of any unbound retry loops inside the reclaim. This is what the
patch does.

I am sending this as an RFC because I am not yet sure this is the best
forward. My testing shows that the system behaves sanely.

Thoughts, comments?

[1] http://lkml.kernel.org/r/20170117103702.28542-1-mhocko@kernel.org

Michal Hocko (2):
      mm, vmscan: account the number of isolated pages per zone
      mm, vmscan: do not loop on too_many_isolated for ever

 include/linux/mmzone.h |  4 +--
 mm/compaction.c        | 16 ++++-----
 mm/khugepaged.c        |  4 +--
 mm/memory_hotplug.c    |  2 +-
 mm/migrate.c           |  4 +--
 mm/page_alloc.c        | 14 ++++----
 mm/vmscan.c            | 93 ++++++++++++++++++++++++++++++++------------------
 mm/vmstat.c            |  4 +--
 8 files changed, 82 insertions(+), 59 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
  2017-01-18 13:44 ` Michal Hocko
@ 2017-01-18 13:44   ` Michal Hocko
  -1 siblings, 0 replies; 110+ messages in thread
From: Michal Hocko @ 2017-01-18 13:44 UTC (permalink / raw)
  To: linux-mm; +Cc: Mel Gorman, Johannes Weiner, Tetsuo Handa, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

599d0c954f91 ("mm, vmscan: move LRU lists to node") has moved
NR_ISOLATED* counters from zones to nodes. This is not the best fit
especially for systems with high/lowmem because a heavy memory pressure
on the highmem zone might block lowmem requests from making progress. Or
we might allow to reclaim lowmem zone even though there are too many
pages already isolated from the eligible zones just because highmem
pages will easily bias too_many_isolated to say no.

Fix these potential issues by moving isolated stats back to zones and
teach too_many_isolated to consider only eligible zones. Per zone
isolation counters are a bit tricky with the node reclaim because
we have to track each page separatelly.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/mmzone.h |  4 ++--
 mm/compaction.c        | 16 +++++++-------
 mm/khugepaged.c        |  4 ++--
 mm/memory_hotplug.c    |  2 +-
 mm/migrate.c           |  4 ++--
 mm/page_alloc.c        | 14 ++++++------
 mm/vmscan.c            | 58 ++++++++++++++++++++++++++++----------------------
 mm/vmstat.c            |  4 ++--
 8 files changed, 56 insertions(+), 50 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 91f69aa0d581..100e7f37b7dc 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -119,6 +119,8 @@ enum zone_stat_item {
 	NR_ZONE_INACTIVE_FILE,
 	NR_ZONE_ACTIVE_FILE,
 	NR_ZONE_UNEVICTABLE,
+	NR_ZONE_ISOLATED_ANON,	/* Temporary isolated pages from anon lru */
+	NR_ZONE_ISOLATED_FILE,	/* Temporary isolated pages from file lru */
 	NR_ZONE_WRITE_PENDING,	/* Count of dirty, writeback and unstable pages */
 	NR_MLOCK,		/* mlock()ed pages found and moved off LRU */
 	NR_SLAB_RECLAIMABLE,
@@ -148,8 +150,6 @@ enum node_stat_item {
 	NR_INACTIVE_FILE,	/*  "     "     "   "       "         */
 	NR_ACTIVE_FILE,		/*  "     "     "   "       "         */
 	NR_UNEVICTABLE,		/*  "     "     "   "       "         */
-	NR_ISOLATED_ANON,	/* Temporary isolated pages from anon lru */
-	NR_ISOLATED_FILE,	/* Temporary isolated pages from file lru */
 	NR_PAGES_SCANNED,	/* pages scanned since last reclaim */
 	WORKINGSET_REFAULT,
 	WORKINGSET_ACTIVATE,
diff --git a/mm/compaction.c b/mm/compaction.c
index 43a6cf1dc202..f84104217887 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -639,12 +639,12 @@ static bool too_many_isolated(struct zone *zone)
 {
 	unsigned long active, inactive, isolated;
 
-	inactive = node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE) +
-			node_page_state(zone->zone_pgdat, NR_INACTIVE_ANON);
-	active = node_page_state(zone->zone_pgdat, NR_ACTIVE_FILE) +
-			node_page_state(zone->zone_pgdat, NR_ACTIVE_ANON);
-	isolated = node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE) +
-			node_page_state(zone->zone_pgdat, NR_ISOLATED_ANON);
+	inactive = zone_page_state(zone, NR_ZONE_INACTIVE_FILE) +
+			zone_page_state(zone, NR_ZONE_INACTIVE_ANON);
+	active = zone_page_state(zone, NR_ZONE_ACTIVE_FILE) +
+			zone_page_state(zone, NR_ZONE_ACTIVE_ANON);
+	isolated = zone_page_state(zone, NR_ZONE_ISOLATED_FILE) +
+			zone_page_state(zone, NR_ZONE_ISOLATED_ANON);
 
 	return isolated > (inactive + active) / 2;
 }
@@ -857,8 +857,8 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 
 		/* Successfully isolated */
 		del_page_from_lru_list(page, lruvec, page_lru(page));
-		inc_node_page_state(page,
-				NR_ISOLATED_ANON + page_is_file_cache(page));
+		inc_zone_page_state(page,
+				NR_ZONE_ISOLATED_ANON + page_is_file_cache(page));
 
 isolate_success:
 		list_add(&page->lru, &cc->migratepages);
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 34bce5c308e3..8e692b683cac 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -482,7 +482,7 @@ void __khugepaged_exit(struct mm_struct *mm)
 static void release_pte_page(struct page *page)
 {
 	/* 0 stands for page_is_file_cache(page) == false */
-	dec_node_page_state(page, NR_ISOLATED_ANON + 0);
+	dec_zone_page_state(page, NR_ZONE_ISOLATED_ANON + 0);
 	unlock_page(page);
 	putback_lru_page(page);
 }
@@ -578,7 +578,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 			goto out;
 		}
 		/* 0 stands for page_is_file_cache(page) == false */
-		inc_node_page_state(page, NR_ISOLATED_ANON + 0);
+		inc_zone_page_state(page, NR_ZONE_ISOLATED_ANON + 0);
 		VM_BUG_ON_PAGE(!PageLocked(page), page);
 		VM_BUG_ON_PAGE(PageLRU(page), page);
 
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index d47b186892b4..8b88dd63bf3d 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1616,7 +1616,7 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
 			put_page(page);
 			list_add_tail(&page->lru, &source);
 			move_pages--;
-			inc_node_page_state(page, NR_ISOLATED_ANON +
+			inc_zone_page_state(page, NR_ZONE_ISOLATED_ANON +
 					    page_is_file_cache(page));
 
 		} else {
diff --git a/mm/migrate.c b/mm/migrate.c
index 87f4d0f81819..e5589dee3022 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -184,7 +184,7 @@ void putback_movable_pages(struct list_head *l)
 			put_page(page);
 		} else {
 			putback_lru_page(page);
-			dec_node_page_state(page, NR_ISOLATED_ANON +
+			dec_zone_page_state(page, NR_ZONE_ISOLATED_ANON +
 					page_is_file_cache(page));
 		}
 	}
@@ -1130,7 +1130,7 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
 		 * as __PageMovable
 		 */
 		if (likely(!__PageMovable(page)))
-			dec_node_page_state(page, NR_ISOLATED_ANON +
+			dec_zone_page_state(page, NR_ZONE_ISOLATED_ANON +
 					page_is_file_cache(page));
 	}
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8ff25883c172..997c9bfdf9e5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4318,18 +4318,16 @@ void show_free_areas(unsigned int filter)
 			free_pcp += per_cpu_ptr(zone->pageset, cpu)->pcp.count;
 	}
 
-	printk("active_anon:%lu inactive_anon:%lu isolated_anon:%lu\n"
-		" active_file:%lu inactive_file:%lu isolated_file:%lu\n"
+	printk("active_anon:%lu inactive_anon:%lu\n"
+		" active_file:%lu inactive_file:%lu\n"
 		" unevictable:%lu dirty:%lu writeback:%lu unstable:%lu\n"
 		" slab_reclaimable:%lu slab_unreclaimable:%lu\n"
 		" mapped:%lu shmem:%lu pagetables:%lu bounce:%lu\n"
 		" free:%lu free_pcp:%lu free_cma:%lu\n",
 		global_node_page_state(NR_ACTIVE_ANON),
 		global_node_page_state(NR_INACTIVE_ANON),
-		global_node_page_state(NR_ISOLATED_ANON),
 		global_node_page_state(NR_ACTIVE_FILE),
 		global_node_page_state(NR_INACTIVE_FILE),
-		global_node_page_state(NR_ISOLATED_FILE),
 		global_node_page_state(NR_UNEVICTABLE),
 		global_node_page_state(NR_FILE_DIRTY),
 		global_node_page_state(NR_WRITEBACK),
@@ -4351,8 +4349,6 @@ void show_free_areas(unsigned int filter)
 			" active_file:%lukB"
 			" inactive_file:%lukB"
 			" unevictable:%lukB"
-			" isolated(anon):%lukB"
-			" isolated(file):%lukB"
 			" mapped:%lukB"
 			" dirty:%lukB"
 			" writeback:%lukB"
@@ -4373,8 +4369,6 @@ void show_free_areas(unsigned int filter)
 			K(node_page_state(pgdat, NR_ACTIVE_FILE)),
 			K(node_page_state(pgdat, NR_INACTIVE_FILE)),
 			K(node_page_state(pgdat, NR_UNEVICTABLE)),
-			K(node_page_state(pgdat, NR_ISOLATED_ANON)),
-			K(node_page_state(pgdat, NR_ISOLATED_FILE)),
 			K(node_page_state(pgdat, NR_FILE_MAPPED)),
 			K(node_page_state(pgdat, NR_FILE_DIRTY)),
 			K(node_page_state(pgdat, NR_WRITEBACK)),
@@ -4410,8 +4404,10 @@ void show_free_areas(unsigned int filter)
 			" high:%lukB"
 			" active_anon:%lukB"
 			" inactive_anon:%lukB"
+			" isolated_anon:%lukB"
 			" active_file:%lukB"
 			" inactive_file:%lukB"
+			" isolated_file:%lukB"
 			" unevictable:%lukB"
 			" writepending:%lukB"
 			" present:%lukB"
@@ -4433,8 +4429,10 @@ void show_free_areas(unsigned int filter)
 			K(high_wmark_pages(zone)),
 			K(zone_page_state(zone, NR_ZONE_ACTIVE_ANON)),
 			K(zone_page_state(zone, NR_ZONE_INACTIVE_ANON)),
+			K(zone_page_state(zone, NR_ZONE_ISOLATED_ANON)),
 			K(zone_page_state(zone, NR_ZONE_ACTIVE_FILE)),
 			K(zone_page_state(zone, NR_ZONE_INACTIVE_FILE)),
+			K(zone_page_state(zone, NR_ZONE_ISOLATED_FILE)),
 			K(zone_page_state(zone, NR_ZONE_UNEVICTABLE)),
 			K(zone_page_state(zone, NR_ZONE_WRITE_PENDING)),
 			K(zone->present_pages),
diff --git a/mm/vmscan.c b/mm/vmscan.c
index f3255702f3df..4b1ed1b1f1db 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -216,14 +216,13 @@ unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat)
 {
 	unsigned long nr;
 
+	/* TODO can we live without NR_*ISOLATED*? */
 	nr = node_page_state_snapshot(pgdat, NR_ACTIVE_FILE) +
-	     node_page_state_snapshot(pgdat, NR_INACTIVE_FILE) +
-	     node_page_state_snapshot(pgdat, NR_ISOLATED_FILE);
+	     node_page_state_snapshot(pgdat, NR_INACTIVE_FILE);
 
 	if (get_nr_swap_pages() > 0)
 		nr += node_page_state_snapshot(pgdat, NR_ACTIVE_ANON) +
-		      node_page_state_snapshot(pgdat, NR_INACTIVE_ANON) +
-		      node_page_state_snapshot(pgdat, NR_ISOLATED_ANON);
+		      node_page_state_snapshot(pgdat, NR_INACTIVE_ANON);
 
 	return nr;
 }
@@ -1245,8 +1244,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 					 * increment nr_reclaimed here (and
 					 * leave it off the LRU).
 					 */
-					nr_reclaimed++;
-					continue;
+					goto drop_isolated;
 				}
 			}
 		}
@@ -1267,13 +1265,16 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		if (ret == SWAP_LZFREE)
 			count_vm_event(PGLAZYFREED);
 
-		nr_reclaimed++;
-
 		/*
 		 * Is there need to periodically free_page_list? It would
 		 * appear not as the counts should be low
 		 */
 		list_add(&page->lru, &free_pages);
+drop_isolated:
+		nr_reclaimed++;
+		mod_zone_page_state(page_zone(page),
+				NR_ZONE_ISOLATED_ANON + page_is_file_cache(page),
+				-hpage_nr_pages(page));
 		continue;
 
 cull_mlocked:
@@ -1340,7 +1341,6 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 	ret = shrink_page_list(&clean_pages, zone->zone_pgdat, &sc,
 			TTU_UNMAP|TTU_IGNORE_ACCESS, NULL, true);
 	list_splice(&clean_pages, page_list);
-	mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE, -ret);
 	return ret;
 }
 
@@ -1433,6 +1433,9 @@ static __always_inline void update_lru_sizes(struct lruvec *lruvec,
 			continue;
 
 		__update_lru_size(lruvec, lru, zid, -nr_zone_taken[zid]);
+		mod_zone_page_state(&lruvec_pgdat(lruvec)->node_zones[zid],
+				NR_ZONE_ISOLATED_ANON + !!is_file_lru(lru),
+				nr_zone_taken[zid]);
 #ifdef CONFIG_MEMCG
 		mem_cgroup_update_lru_size(lruvec, lru, zid, -nr_zone_taken[zid]);
 #endif
@@ -1603,10 +1606,11 @@ int isolate_lru_page(struct page *page)
  * the LRU list will go small and be scanned faster than necessary, leading to
  * unnecessary swapping, thrashing and OOM.
  */
-static int too_many_isolated(struct pglist_data *pgdat, int file,
+static int too_many_isolated(struct pglist_data *pgdat, enum lru_list lru,
 		struct scan_control *sc)
 {
-	unsigned long inactive, isolated;
+	unsigned long inactive = 0, isolated = 0;
+	int zid;
 
 	if (current_is_kswapd())
 		return 0;
@@ -1614,12 +1618,12 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
 	if (!sane_reclaim(sc))
 		return 0;
 
-	if (file) {
-		inactive = node_page_state(pgdat, NR_INACTIVE_FILE);
-		isolated = node_page_state(pgdat, NR_ISOLATED_FILE);
-	} else {
-		inactive = node_page_state(pgdat, NR_INACTIVE_ANON);
-		isolated = node_page_state(pgdat, NR_ISOLATED_ANON);
+	for (zid = 0; zid <= sc->reclaim_idx; zid++) {
+		struct zone *zone = &pgdat->node_zones[zid];
+
+		inactive += zone_page_state_snapshot(zone, NR_ZONE_LRU_BASE + lru);
+		isolated += zone_page_state_snapshot(zone,
+				NR_ZONE_ISOLATED_ANON + !!is_file_lru(lru));
 	}
 
 	/*
@@ -1649,6 +1653,11 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
 
 		VM_BUG_ON_PAGE(PageLRU(page), page);
 		list_del(&page->lru);
+
+		mod_zone_page_state(page_zone(page),
+				NR_ZONE_ISOLATED_ANON + !!page_is_file_cache(page),
+				-hpage_nr_pages(page));
+
 		if (unlikely(!page_evictable(page))) {
 			spin_unlock_irq(&pgdat->lru_lock);
 			putback_lru_page(page);
@@ -1719,7 +1728,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
 
-	while (unlikely(too_many_isolated(pgdat, file, sc))) {
+	while (unlikely(too_many_isolated(pgdat, lru, sc))) {
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
 
 		/* We are about to die and free our memory. Return now. */
@@ -1739,7 +1748,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
 				     &nr_scanned, sc, isolate_mode, lru);
 
-	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
 	reclaim_stat->recent_scanned[file] += nr_taken;
 
 	if (global_reclaim(sc)) {
@@ -1768,8 +1776,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 
 	putback_inactive_pages(lruvec, &page_list);
 
-	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
-
 	spin_unlock_irq(&pgdat->lru_lock);
 
 	mem_cgroup_uncharge_list(&page_list);
@@ -1939,7 +1945,6 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
 				     &nr_scanned, sc, isolate_mode, lru);
 
-	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
 	reclaim_stat->recent_scanned[file] += nr_taken;
 
 	if (global_reclaim(sc))
@@ -1955,7 +1960,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 
 		if (unlikely(!page_evictable(page))) {
 			putback_lru_page(page);
-			continue;
+			goto drop_isolated;
 		}
 
 		if (unlikely(buffer_heads_over_limit)) {
@@ -1980,12 +1985,16 @@ static void shrink_active_list(unsigned long nr_to_scan,
 			 */
 			if ((vm_flags & VM_EXEC) && page_is_file_cache(page)) {
 				list_add(&page->lru, &l_active);
-				continue;
+				goto drop_isolated;
 			}
 		}
 
 		ClearPageActive(page);	/* we are de-activating */
 		list_add(&page->lru, &l_inactive);
+drop_isolated:
+		mod_zone_page_state(page_zone(page),
+				NR_ZONE_ISOLATED_ANON + !!is_file_lru(lru),
+				-hpage_nr_pages(page));
 	}
 
 	/*
@@ -2002,7 +2011,6 @@ static void shrink_active_list(unsigned long nr_to_scan,
 
 	nr_activate = move_active_pages_to_lru(lruvec, &l_active, &l_hold, lru);
 	nr_deactivate = move_active_pages_to_lru(lruvec, &l_inactive, &l_hold, lru - LRU_ACTIVE);
-	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
 	spin_unlock_irq(&pgdat->lru_lock);
 
 	mem_cgroup_uncharge_list(&l_hold);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index bed3c3845936..059c29d14d23 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -926,6 +926,8 @@ const char * const vmstat_text[] = {
 	"nr_zone_inactive_file",
 	"nr_zone_active_file",
 	"nr_zone_unevictable",
+	"nr_zone_anon_isolated",
+	"nr_zone_file_isolated",
 	"nr_zone_write_pending",
 	"nr_mlock",
 	"nr_slab_reclaimable",
@@ -952,8 +954,6 @@ const char * const vmstat_text[] = {
 	"nr_inactive_file",
 	"nr_active_file",
 	"nr_unevictable",
-	"nr_isolated_anon",
-	"nr_isolated_file",
 	"nr_pages_scanned",
 	"workingset_refault",
 	"workingset_activate",
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
@ 2017-01-18 13:44   ` Michal Hocko
  0 siblings, 0 replies; 110+ messages in thread
From: Michal Hocko @ 2017-01-18 13:44 UTC (permalink / raw)
  To: linux-mm; +Cc: Mel Gorman, Johannes Weiner, Tetsuo Handa, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

599d0c954f91 ("mm, vmscan: move LRU lists to node") has moved
NR_ISOLATED* counters from zones to nodes. This is not the best fit
especially for systems with high/lowmem because a heavy memory pressure
on the highmem zone might block lowmem requests from making progress. Or
we might allow to reclaim lowmem zone even though there are too many
pages already isolated from the eligible zones just because highmem
pages will easily bias too_many_isolated to say no.

Fix these potential issues by moving isolated stats back to zones and
teach too_many_isolated to consider only eligible zones. Per zone
isolation counters are a bit tricky with the node reclaim because
we have to track each page separatelly.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/mmzone.h |  4 ++--
 mm/compaction.c        | 16 +++++++-------
 mm/khugepaged.c        |  4 ++--
 mm/memory_hotplug.c    |  2 +-
 mm/migrate.c           |  4 ++--
 mm/page_alloc.c        | 14 ++++++------
 mm/vmscan.c            | 58 ++++++++++++++++++++++++++++----------------------
 mm/vmstat.c            |  4 ++--
 8 files changed, 56 insertions(+), 50 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 91f69aa0d581..100e7f37b7dc 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -119,6 +119,8 @@ enum zone_stat_item {
 	NR_ZONE_INACTIVE_FILE,
 	NR_ZONE_ACTIVE_FILE,
 	NR_ZONE_UNEVICTABLE,
+	NR_ZONE_ISOLATED_ANON,	/* Temporary isolated pages from anon lru */
+	NR_ZONE_ISOLATED_FILE,	/* Temporary isolated pages from file lru */
 	NR_ZONE_WRITE_PENDING,	/* Count of dirty, writeback and unstable pages */
 	NR_MLOCK,		/* mlock()ed pages found and moved off LRU */
 	NR_SLAB_RECLAIMABLE,
@@ -148,8 +150,6 @@ enum node_stat_item {
 	NR_INACTIVE_FILE,	/*  "     "     "   "       "         */
 	NR_ACTIVE_FILE,		/*  "     "     "   "       "         */
 	NR_UNEVICTABLE,		/*  "     "     "   "       "         */
-	NR_ISOLATED_ANON,	/* Temporary isolated pages from anon lru */
-	NR_ISOLATED_FILE,	/* Temporary isolated pages from file lru */
 	NR_PAGES_SCANNED,	/* pages scanned since last reclaim */
 	WORKINGSET_REFAULT,
 	WORKINGSET_ACTIVATE,
diff --git a/mm/compaction.c b/mm/compaction.c
index 43a6cf1dc202..f84104217887 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -639,12 +639,12 @@ static bool too_many_isolated(struct zone *zone)
 {
 	unsigned long active, inactive, isolated;
 
-	inactive = node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE) +
-			node_page_state(zone->zone_pgdat, NR_INACTIVE_ANON);
-	active = node_page_state(zone->zone_pgdat, NR_ACTIVE_FILE) +
-			node_page_state(zone->zone_pgdat, NR_ACTIVE_ANON);
-	isolated = node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE) +
-			node_page_state(zone->zone_pgdat, NR_ISOLATED_ANON);
+	inactive = zone_page_state(zone, NR_ZONE_INACTIVE_FILE) +
+			zone_page_state(zone, NR_ZONE_INACTIVE_ANON);
+	active = zone_page_state(zone, NR_ZONE_ACTIVE_FILE) +
+			zone_page_state(zone, NR_ZONE_ACTIVE_ANON);
+	isolated = zone_page_state(zone, NR_ZONE_ISOLATED_FILE) +
+			zone_page_state(zone, NR_ZONE_ISOLATED_ANON);
 
 	return isolated > (inactive + active) / 2;
 }
@@ -857,8 +857,8 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 
 		/* Successfully isolated */
 		del_page_from_lru_list(page, lruvec, page_lru(page));
-		inc_node_page_state(page,
-				NR_ISOLATED_ANON + page_is_file_cache(page));
+		inc_zone_page_state(page,
+				NR_ZONE_ISOLATED_ANON + page_is_file_cache(page));
 
 isolate_success:
 		list_add(&page->lru, &cc->migratepages);
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 34bce5c308e3..8e692b683cac 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -482,7 +482,7 @@ void __khugepaged_exit(struct mm_struct *mm)
 static void release_pte_page(struct page *page)
 {
 	/* 0 stands for page_is_file_cache(page) == false */
-	dec_node_page_state(page, NR_ISOLATED_ANON + 0);
+	dec_zone_page_state(page, NR_ZONE_ISOLATED_ANON + 0);
 	unlock_page(page);
 	putback_lru_page(page);
 }
@@ -578,7 +578,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 			goto out;
 		}
 		/* 0 stands for page_is_file_cache(page) == false */
-		inc_node_page_state(page, NR_ISOLATED_ANON + 0);
+		inc_zone_page_state(page, NR_ZONE_ISOLATED_ANON + 0);
 		VM_BUG_ON_PAGE(!PageLocked(page), page);
 		VM_BUG_ON_PAGE(PageLRU(page), page);
 
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index d47b186892b4..8b88dd63bf3d 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1616,7 +1616,7 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
 			put_page(page);
 			list_add_tail(&page->lru, &source);
 			move_pages--;
-			inc_node_page_state(page, NR_ISOLATED_ANON +
+			inc_zone_page_state(page, NR_ZONE_ISOLATED_ANON +
 					    page_is_file_cache(page));
 
 		} else {
diff --git a/mm/migrate.c b/mm/migrate.c
index 87f4d0f81819..e5589dee3022 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -184,7 +184,7 @@ void putback_movable_pages(struct list_head *l)
 			put_page(page);
 		} else {
 			putback_lru_page(page);
-			dec_node_page_state(page, NR_ISOLATED_ANON +
+			dec_zone_page_state(page, NR_ZONE_ISOLATED_ANON +
 					page_is_file_cache(page));
 		}
 	}
@@ -1130,7 +1130,7 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
 		 * as __PageMovable
 		 */
 		if (likely(!__PageMovable(page)))
-			dec_node_page_state(page, NR_ISOLATED_ANON +
+			dec_zone_page_state(page, NR_ZONE_ISOLATED_ANON +
 					page_is_file_cache(page));
 	}
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8ff25883c172..997c9bfdf9e5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4318,18 +4318,16 @@ void show_free_areas(unsigned int filter)
 			free_pcp += per_cpu_ptr(zone->pageset, cpu)->pcp.count;
 	}
 
-	printk("active_anon:%lu inactive_anon:%lu isolated_anon:%lu\n"
-		" active_file:%lu inactive_file:%lu isolated_file:%lu\n"
+	printk("active_anon:%lu inactive_anon:%lu\n"
+		" active_file:%lu inactive_file:%lu\n"
 		" unevictable:%lu dirty:%lu writeback:%lu unstable:%lu\n"
 		" slab_reclaimable:%lu slab_unreclaimable:%lu\n"
 		" mapped:%lu shmem:%lu pagetables:%lu bounce:%lu\n"
 		" free:%lu free_pcp:%lu free_cma:%lu\n",
 		global_node_page_state(NR_ACTIVE_ANON),
 		global_node_page_state(NR_INACTIVE_ANON),
-		global_node_page_state(NR_ISOLATED_ANON),
 		global_node_page_state(NR_ACTIVE_FILE),
 		global_node_page_state(NR_INACTIVE_FILE),
-		global_node_page_state(NR_ISOLATED_FILE),
 		global_node_page_state(NR_UNEVICTABLE),
 		global_node_page_state(NR_FILE_DIRTY),
 		global_node_page_state(NR_WRITEBACK),
@@ -4351,8 +4349,6 @@ void show_free_areas(unsigned int filter)
 			" active_file:%lukB"
 			" inactive_file:%lukB"
 			" unevictable:%lukB"
-			" isolated(anon):%lukB"
-			" isolated(file):%lukB"
 			" mapped:%lukB"
 			" dirty:%lukB"
 			" writeback:%lukB"
@@ -4373,8 +4369,6 @@ void show_free_areas(unsigned int filter)
 			K(node_page_state(pgdat, NR_ACTIVE_FILE)),
 			K(node_page_state(pgdat, NR_INACTIVE_FILE)),
 			K(node_page_state(pgdat, NR_UNEVICTABLE)),
-			K(node_page_state(pgdat, NR_ISOLATED_ANON)),
-			K(node_page_state(pgdat, NR_ISOLATED_FILE)),
 			K(node_page_state(pgdat, NR_FILE_MAPPED)),
 			K(node_page_state(pgdat, NR_FILE_DIRTY)),
 			K(node_page_state(pgdat, NR_WRITEBACK)),
@@ -4410,8 +4404,10 @@ void show_free_areas(unsigned int filter)
 			" high:%lukB"
 			" active_anon:%lukB"
 			" inactive_anon:%lukB"
+			" isolated_anon:%lukB"
 			" active_file:%lukB"
 			" inactive_file:%lukB"
+			" isolated_file:%lukB"
 			" unevictable:%lukB"
 			" writepending:%lukB"
 			" present:%lukB"
@@ -4433,8 +4429,10 @@ void show_free_areas(unsigned int filter)
 			K(high_wmark_pages(zone)),
 			K(zone_page_state(zone, NR_ZONE_ACTIVE_ANON)),
 			K(zone_page_state(zone, NR_ZONE_INACTIVE_ANON)),
+			K(zone_page_state(zone, NR_ZONE_ISOLATED_ANON)),
 			K(zone_page_state(zone, NR_ZONE_ACTIVE_FILE)),
 			K(zone_page_state(zone, NR_ZONE_INACTIVE_FILE)),
+			K(zone_page_state(zone, NR_ZONE_ISOLATED_FILE)),
 			K(zone_page_state(zone, NR_ZONE_UNEVICTABLE)),
 			K(zone_page_state(zone, NR_ZONE_WRITE_PENDING)),
 			K(zone->present_pages),
diff --git a/mm/vmscan.c b/mm/vmscan.c
index f3255702f3df..4b1ed1b1f1db 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -216,14 +216,13 @@ unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat)
 {
 	unsigned long nr;
 
+	/* TODO can we live without NR_*ISOLATED*? */
 	nr = node_page_state_snapshot(pgdat, NR_ACTIVE_FILE) +
-	     node_page_state_snapshot(pgdat, NR_INACTIVE_FILE) +
-	     node_page_state_snapshot(pgdat, NR_ISOLATED_FILE);
+	     node_page_state_snapshot(pgdat, NR_INACTIVE_FILE);
 
 	if (get_nr_swap_pages() > 0)
 		nr += node_page_state_snapshot(pgdat, NR_ACTIVE_ANON) +
-		      node_page_state_snapshot(pgdat, NR_INACTIVE_ANON) +
-		      node_page_state_snapshot(pgdat, NR_ISOLATED_ANON);
+		      node_page_state_snapshot(pgdat, NR_INACTIVE_ANON);
 
 	return nr;
 }
@@ -1245,8 +1244,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 					 * increment nr_reclaimed here (and
 					 * leave it off the LRU).
 					 */
-					nr_reclaimed++;
-					continue;
+					goto drop_isolated;
 				}
 			}
 		}
@@ -1267,13 +1265,16 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		if (ret == SWAP_LZFREE)
 			count_vm_event(PGLAZYFREED);
 
-		nr_reclaimed++;
-
 		/*
 		 * Is there need to periodically free_page_list? It would
 		 * appear not as the counts should be low
 		 */
 		list_add(&page->lru, &free_pages);
+drop_isolated:
+		nr_reclaimed++;
+		mod_zone_page_state(page_zone(page),
+				NR_ZONE_ISOLATED_ANON + page_is_file_cache(page),
+				-hpage_nr_pages(page));
 		continue;
 
 cull_mlocked:
@@ -1340,7 +1341,6 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 	ret = shrink_page_list(&clean_pages, zone->zone_pgdat, &sc,
 			TTU_UNMAP|TTU_IGNORE_ACCESS, NULL, true);
 	list_splice(&clean_pages, page_list);
-	mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE, -ret);
 	return ret;
 }
 
@@ -1433,6 +1433,9 @@ static __always_inline void update_lru_sizes(struct lruvec *lruvec,
 			continue;
 
 		__update_lru_size(lruvec, lru, zid, -nr_zone_taken[zid]);
+		mod_zone_page_state(&lruvec_pgdat(lruvec)->node_zones[zid],
+				NR_ZONE_ISOLATED_ANON + !!is_file_lru(lru),
+				nr_zone_taken[zid]);
 #ifdef CONFIG_MEMCG
 		mem_cgroup_update_lru_size(lruvec, lru, zid, -nr_zone_taken[zid]);
 #endif
@@ -1603,10 +1606,11 @@ int isolate_lru_page(struct page *page)
  * the LRU list will go small and be scanned faster than necessary, leading to
  * unnecessary swapping, thrashing and OOM.
  */
-static int too_many_isolated(struct pglist_data *pgdat, int file,
+static int too_many_isolated(struct pglist_data *pgdat, enum lru_list lru,
 		struct scan_control *sc)
 {
-	unsigned long inactive, isolated;
+	unsigned long inactive = 0, isolated = 0;
+	int zid;
 
 	if (current_is_kswapd())
 		return 0;
@@ -1614,12 +1618,12 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
 	if (!sane_reclaim(sc))
 		return 0;
 
-	if (file) {
-		inactive = node_page_state(pgdat, NR_INACTIVE_FILE);
-		isolated = node_page_state(pgdat, NR_ISOLATED_FILE);
-	} else {
-		inactive = node_page_state(pgdat, NR_INACTIVE_ANON);
-		isolated = node_page_state(pgdat, NR_ISOLATED_ANON);
+	for (zid = 0; zid <= sc->reclaim_idx; zid++) {
+		struct zone *zone = &pgdat->node_zones[zid];
+
+		inactive += zone_page_state_snapshot(zone, NR_ZONE_LRU_BASE + lru);
+		isolated += zone_page_state_snapshot(zone,
+				NR_ZONE_ISOLATED_ANON + !!is_file_lru(lru));
 	}
 
 	/*
@@ -1649,6 +1653,11 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
 
 		VM_BUG_ON_PAGE(PageLRU(page), page);
 		list_del(&page->lru);
+
+		mod_zone_page_state(page_zone(page),
+				NR_ZONE_ISOLATED_ANON + !!page_is_file_cache(page),
+				-hpage_nr_pages(page));
+
 		if (unlikely(!page_evictable(page))) {
 			spin_unlock_irq(&pgdat->lru_lock);
 			putback_lru_page(page);
@@ -1719,7 +1728,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
 
-	while (unlikely(too_many_isolated(pgdat, file, sc))) {
+	while (unlikely(too_many_isolated(pgdat, lru, sc))) {
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
 
 		/* We are about to die and free our memory. Return now. */
@@ -1739,7 +1748,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
 				     &nr_scanned, sc, isolate_mode, lru);
 
-	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
 	reclaim_stat->recent_scanned[file] += nr_taken;
 
 	if (global_reclaim(sc)) {
@@ -1768,8 +1776,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 
 	putback_inactive_pages(lruvec, &page_list);
 
-	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
-
 	spin_unlock_irq(&pgdat->lru_lock);
 
 	mem_cgroup_uncharge_list(&page_list);
@@ -1939,7 +1945,6 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
 				     &nr_scanned, sc, isolate_mode, lru);
 
-	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
 	reclaim_stat->recent_scanned[file] += nr_taken;
 
 	if (global_reclaim(sc))
@@ -1955,7 +1960,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
 
 		if (unlikely(!page_evictable(page))) {
 			putback_lru_page(page);
-			continue;
+			goto drop_isolated;
 		}
 
 		if (unlikely(buffer_heads_over_limit)) {
@@ -1980,12 +1985,16 @@ static void shrink_active_list(unsigned long nr_to_scan,
 			 */
 			if ((vm_flags & VM_EXEC) && page_is_file_cache(page)) {
 				list_add(&page->lru, &l_active);
-				continue;
+				goto drop_isolated;
 			}
 		}
 
 		ClearPageActive(page);	/* we are de-activating */
 		list_add(&page->lru, &l_inactive);
+drop_isolated:
+		mod_zone_page_state(page_zone(page),
+				NR_ZONE_ISOLATED_ANON + !!is_file_lru(lru),
+				-hpage_nr_pages(page));
 	}
 
 	/*
@@ -2002,7 +2011,6 @@ static void shrink_active_list(unsigned long nr_to_scan,
 
 	nr_activate = move_active_pages_to_lru(lruvec, &l_active, &l_hold, lru);
 	nr_deactivate = move_active_pages_to_lru(lruvec, &l_inactive, &l_hold, lru - LRU_ACTIVE);
-	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
 	spin_unlock_irq(&pgdat->lru_lock);
 
 	mem_cgroup_uncharge_list(&l_hold);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index bed3c3845936..059c29d14d23 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -926,6 +926,8 @@ const char * const vmstat_text[] = {
 	"nr_zone_inactive_file",
 	"nr_zone_active_file",
 	"nr_zone_unevictable",
+	"nr_zone_anon_isolated",
+	"nr_zone_file_isolated",
 	"nr_zone_write_pending",
 	"nr_mlock",
 	"nr_slab_reclaimable",
@@ -952,8 +954,6 @@ const char * const vmstat_text[] = {
 	"nr_inactive_file",
 	"nr_active_file",
 	"nr_unevictable",
-	"nr_isolated_anon",
-	"nr_isolated_file",
 	"nr_pages_scanned",
 	"workingset_refault",
 	"workingset_activate",
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [RFC PATCH 2/2] mm, vmscan: do not loop on too_many_isolated for ever
  2017-01-18 13:44 ` Michal Hocko
@ 2017-01-18 13:44   ` Michal Hocko
  -1 siblings, 0 replies; 110+ messages in thread
From: Michal Hocko @ 2017-01-18 13:44 UTC (permalink / raw)
  To: linux-mm; +Cc: Mel Gorman, Johannes Weiner, Tetsuo Handa, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Tetsuo Handa has reported [1] that direct reclaimers might get stuck in
too_many_isolated loop basically for ever because the last few pages on
the LRU lists are isolated by the kswapd which is stuck on fs locks when
doing the pageout. This in turn means that there is nobody to actually
trigger the oom killer and the system is basically unusable.

too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle
direct reclaim when too many pages are isolated already") to prevent
from pre-mature oom killer invocations because back then no reclaim
progress could indeed trigger the OOM killer too early. But since the
oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection")
the allocation/reclaim retry loop considers all the reclaimable pages
including those which are isolated - see 9f6c399ddc36 ("mm, vmscan:
consider isolated pages in zone_reclaimable_pages") so we can loosen
the direct reclaim throttling and instead rely on should_reclaim_retry
logic which is the proper layer to control how to throttle and retry
reclaim attempts.

Move the too_many_isolated check outside shrink_inactive_list because
in fact active list might theoretically see too many isolated pages as
well.

[1] http://lkml.kernel.org/r/201602092349.ACG81273.OSVtMJQHLOFOFF@I-love.SAKURA.ne.jp
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/vmscan.c | 37 +++++++++++++++++++++++++++----------
 1 file changed, 27 insertions(+), 10 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4b1ed1b1f1db..9f6be3b10ff0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -204,10 +204,12 @@ unsigned long zone_reclaimable_pages(struct zone *zone)
 	unsigned long nr;
 
 	nr = zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_FILE) +
-		zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_FILE);
+		zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_FILE) +
+		zone_page_state_snapshot(zone, NR_ZONE_ISOLATED_FILE);
 	if (get_nr_swap_pages() > 0)
 		nr += zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_ANON) +
-			zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_ANON);
+			zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_ANON) +
+			zone_page_state_snapshot(zone, NR_ZONE_ISOLATED_ANON);
 
 	return nr;
 }
@@ -1728,14 +1730,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
 
-	while (unlikely(too_many_isolated(pgdat, lru, sc))) {
-		congestion_wait(BLK_RW_ASYNC, HZ/10);
-
-		/* We are about to die and free our memory. Return now. */
-		if (fatal_signal_pending(current))
-			return SWAP_CLUSTER_MAX;
-	}
-
 	lru_add_drain();
 
 	if (!sc->may_unmap)
@@ -2083,6 +2077,29 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
 static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
 				 struct lruvec *lruvec, struct scan_control *sc)
 {
+	int stalled = false;
+
+	/* We are about to die and free our memory. Return now. */
+	if (fatal_signal_pending(current))
+		return SWAP_CLUSTER_MAX;
+
+	/*
+	 * throttle direct reclaimers but do not loop for ever. We rely
+	 * on should_reclaim_retry to not allow pre-mature OOM when
+	 * there are too many pages under reclaim.
+	 */
+	while (too_many_isolated(lruvec_pgdat(lruvec), lru, sc)) {
+		if (stalled)
+			return 0;
+
+		/*
+		 * TODO we should wait on a different event here - do the wake up
+		 * after we decrement NR_ZONE_ISOLATED_*
+		 */
+		congestion_wait(BLK_RW_ASYNC, HZ/10);
+		stalled = true;
+	}
+
 	if (is_active_lru(lru)) {
 		if (inactive_list_is_low(lruvec, is_file_lru(lru), sc, true))
 			shrink_active_list(nr_to_scan, lruvec, sc, lru);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* [RFC PATCH 2/2] mm, vmscan: do not loop on too_many_isolated for ever
@ 2017-01-18 13:44   ` Michal Hocko
  0 siblings, 0 replies; 110+ messages in thread
From: Michal Hocko @ 2017-01-18 13:44 UTC (permalink / raw)
  To: linux-mm; +Cc: Mel Gorman, Johannes Weiner, Tetsuo Handa, LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Tetsuo Handa has reported [1] that direct reclaimers might get stuck in
too_many_isolated loop basically for ever because the last few pages on
the LRU lists are isolated by the kswapd which is stuck on fs locks when
doing the pageout. This in turn means that there is nobody to actually
trigger the oom killer and the system is basically unusable.

too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle
direct reclaim when too many pages are isolated already") to prevent
from pre-mature oom killer invocations because back then no reclaim
progress could indeed trigger the OOM killer too early. But since the
oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection")
the allocation/reclaim retry loop considers all the reclaimable pages
including those which are isolated - see 9f6c399ddc36 ("mm, vmscan:
consider isolated pages in zone_reclaimable_pages") so we can loosen
the direct reclaim throttling and instead rely on should_reclaim_retry
logic which is the proper layer to control how to throttle and retry
reclaim attempts.

Move the too_many_isolated check outside shrink_inactive_list because
in fact active list might theoretically see too many isolated pages as
well.

[1] http://lkml.kernel.org/r/201602092349.ACG81273.OSVtMJQHLOFOFF@I-love.SAKURA.ne.jp
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/vmscan.c | 37 +++++++++++++++++++++++++++----------
 1 file changed, 27 insertions(+), 10 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4b1ed1b1f1db..9f6be3b10ff0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -204,10 +204,12 @@ unsigned long zone_reclaimable_pages(struct zone *zone)
 	unsigned long nr;
 
 	nr = zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_FILE) +
-		zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_FILE);
+		zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_FILE) +
+		zone_page_state_snapshot(zone, NR_ZONE_ISOLATED_FILE);
 	if (get_nr_swap_pages() > 0)
 		nr += zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_ANON) +
-			zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_ANON);
+			zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_ANON) +
+			zone_page_state_snapshot(zone, NR_ZONE_ISOLATED_ANON);
 
 	return nr;
 }
@@ -1728,14 +1730,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
 
-	while (unlikely(too_many_isolated(pgdat, lru, sc))) {
-		congestion_wait(BLK_RW_ASYNC, HZ/10);
-
-		/* We are about to die and free our memory. Return now. */
-		if (fatal_signal_pending(current))
-			return SWAP_CLUSTER_MAX;
-	}
-
 	lru_add_drain();
 
 	if (!sc->may_unmap)
@@ -2083,6 +2077,29 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file,
 static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
 				 struct lruvec *lruvec, struct scan_control *sc)
 {
+	int stalled = false;
+
+	/* We are about to die and free our memory. Return now. */
+	if (fatal_signal_pending(current))
+		return SWAP_CLUSTER_MAX;
+
+	/*
+	 * throttle direct reclaimers but do not loop for ever. We rely
+	 * on should_reclaim_retry to not allow pre-mature OOM when
+	 * there are too many pages under reclaim.
+	 */
+	while (too_many_isolated(lruvec_pgdat(lruvec), lru, sc)) {
+		if (stalled)
+			return 0;
+
+		/*
+		 * TODO we should wait on a different event here - do the wake up
+		 * after we decrement NR_ZONE_ISOLATED_*
+		 */
+		congestion_wait(BLK_RW_ASYNC, HZ/10);
+		stalled = true;
+	}
+
 	if (is_active_lru(lru)) {
 		if (inactive_list_is_low(lruvec, is_file_lru(lru), sc, true))
 			shrink_active_list(nr_to_scan, lruvec, sc, lru);
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
  2017-01-18 13:44   ` Michal Hocko
@ 2017-01-18 14:46     ` Mel Gorman
  -1 siblings, 0 replies; 110+ messages in thread
From: Mel Gorman @ 2017-01-18 14:46 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Johannes Weiner, Tetsuo Handa, LKML, Michal Hocko

On Wed, Jan 18, 2017 at 02:44:52PM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> 599d0c954f91 ("mm, vmscan: move LRU lists to node") has moved
> NR_ISOLATED* counters from zones to nodes. This is not the best fit
> especially for systems with high/lowmem because a heavy memory pressure
> on the highmem zone might block lowmem requests from making progress. Or
> we might allow to reclaim lowmem zone even though there are too many
> pages already isolated from the eligible zones just because highmem
> pages will easily bias too_many_isolated to say no.
> 
> Fix these potential issues by moving isolated stats back to zones and
> teach too_many_isolated to consider only eligible zones. Per zone
> isolation counters are a bit tricky with the node reclaim because
> we have to track each page separatelly.
> 

I'm quite unhappy with this. Each move back increases the cache footprint
because of the counters but it's not clear at all this patch actually
helps anything.

Heavy memory pressure on highmem should be spread across the whole node as
we no longer are applying the fair zone allocation policy. The processes
with highmem requirements will be reclaiming from all zones and when it
finishes, it's possible that a lowmem-specific request will be clear to make
progress. It's all the same LRU so if there are too many pages isolated,
it makes sense to wait regardless of the allocation request.

More importantly, this patch may make things worse and delay reclaim. If
this patch allowed a lowmem request to make progress that would have
previously stalled, it's going to spend time skipping pages in the LRU
instead of letting kswapd and the highmem pressured processes make progress.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
@ 2017-01-18 14:46     ` Mel Gorman
  0 siblings, 0 replies; 110+ messages in thread
From: Mel Gorman @ 2017-01-18 14:46 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Johannes Weiner, Tetsuo Handa, LKML, Michal Hocko

On Wed, Jan 18, 2017 at 02:44:52PM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> 599d0c954f91 ("mm, vmscan: move LRU lists to node") has moved
> NR_ISOLATED* counters from zones to nodes. This is not the best fit
> especially for systems with high/lowmem because a heavy memory pressure
> on the highmem zone might block lowmem requests from making progress. Or
> we might allow to reclaim lowmem zone even though there are too many
> pages already isolated from the eligible zones just because highmem
> pages will easily bias too_many_isolated to say no.
> 
> Fix these potential issues by moving isolated stats back to zones and
> teach too_many_isolated to consider only eligible zones. Per zone
> isolation counters are a bit tricky with the node reclaim because
> we have to track each page separatelly.
> 

I'm quite unhappy with this. Each move back increases the cache footprint
because of the counters but it's not clear at all this patch actually
helps anything.

Heavy memory pressure on highmem should be spread across the whole node as
we no longer are applying the fair zone allocation policy. The processes
with highmem requirements will be reclaiming from all zones and when it
finishes, it's possible that a lowmem-specific request will be clear to make
progress. It's all the same LRU so if there are too many pages isolated,
it makes sense to wait regardless of the allocation request.

More importantly, this patch may make things worse and delay reclaim. If
this patch allowed a lowmem request to make progress that would have
previously stalled, it's going to spend time skipping pages in the LRU
instead of letting kswapd and the highmem pressured processes make progress.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 2/2] mm, vmscan: do not loop on too_many_isolated for ever
  2017-01-18 13:44   ` Michal Hocko
@ 2017-01-18 14:50     ` Mel Gorman
  -1 siblings, 0 replies; 110+ messages in thread
From: Mel Gorman @ 2017-01-18 14:50 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Johannes Weiner, Tetsuo Handa, LKML, Michal Hocko

On Wed, Jan 18, 2017 at 02:44:53PM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Tetsuo Handa has reported [1] that direct reclaimers might get stuck in
> too_many_isolated loop basically for ever because the last few pages on
> the LRU lists are isolated by the kswapd which is stuck on fs locks when
> doing the pageout. This in turn means that there is nobody to actually
> trigger the oom killer and the system is basically unusable.
> 
> too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle
> direct reclaim when too many pages are isolated already") to prevent
> from pre-mature oom killer invocations because back then no reclaim
> progress could indeed trigger the OOM killer too early. But since the
> oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection")
> the allocation/reclaim retry loop considers all the reclaimable pages
> including those which are isolated - see 9f6c399ddc36 ("mm, vmscan:
> consider isolated pages in zone_reclaimable_pages") so we can loosen
> the direct reclaim throttling and instead rely on should_reclaim_retry
> logic which is the proper layer to control how to throttle and retry
> reclaim attempts.
> 
> Move the too_many_isolated check outside shrink_inactive_list because
> in fact active list might theoretically see too many isolated pages as
> well.
> 

No major objections in general. It's a bit odd you have a while loop for
something that will only loop once.

As for the TODO, one approach would be to use a waitqueue when too many
pages are isolated. Wake them one at a time when isolated pages drops
below the threshold.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 2/2] mm, vmscan: do not loop on too_many_isolated for ever
@ 2017-01-18 14:50     ` Mel Gorman
  0 siblings, 0 replies; 110+ messages in thread
From: Mel Gorman @ 2017-01-18 14:50 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Johannes Weiner, Tetsuo Handa, LKML, Michal Hocko

On Wed, Jan 18, 2017 at 02:44:53PM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Tetsuo Handa has reported [1] that direct reclaimers might get stuck in
> too_many_isolated loop basically for ever because the last few pages on
> the LRU lists are isolated by the kswapd which is stuck on fs locks when
> doing the pageout. This in turn means that there is nobody to actually
> trigger the oom killer and the system is basically unusable.
> 
> too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle
> direct reclaim when too many pages are isolated already") to prevent
> from pre-mature oom killer invocations because back then no reclaim
> progress could indeed trigger the OOM killer too early. But since the
> oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection")
> the allocation/reclaim retry loop considers all the reclaimable pages
> including those which are isolated - see 9f6c399ddc36 ("mm, vmscan:
> consider isolated pages in zone_reclaimable_pages") so we can loosen
> the direct reclaim throttling and instead rely on should_reclaim_retry
> logic which is the proper layer to control how to throttle and retry
> reclaim attempts.
> 
> Move the too_many_isolated check outside shrink_inactive_list because
> in fact active list might theoretically see too many isolated pages as
> well.
> 

No major objections in general. It's a bit odd you have a while loop for
something that will only loop once.

As for the TODO, one approach would be to use a waitqueue when too many
pages are isolated. Wake them one at a time when isolated pages drops
below the threshold.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
  2017-01-18 14:46     ` Mel Gorman
@ 2017-01-18 15:15       ` Michal Hocko
  -1 siblings, 0 replies; 110+ messages in thread
From: Michal Hocko @ 2017-01-18 15:15 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, Johannes Weiner, Tetsuo Handa, LKML

On Wed 18-01-17 14:46:55, Mel Gorman wrote:
> On Wed, Jan 18, 2017 at 02:44:52PM +0100, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > 599d0c954f91 ("mm, vmscan: move LRU lists to node") has moved
> > NR_ISOLATED* counters from zones to nodes. This is not the best fit
> > especially for systems with high/lowmem because a heavy memory pressure
> > on the highmem zone might block lowmem requests from making progress. Or
> > we might allow to reclaim lowmem zone even though there are too many
> > pages already isolated from the eligible zones just because highmem
> > pages will easily bias too_many_isolated to say no.
> > 
> > Fix these potential issues by moving isolated stats back to zones and
> > teach too_many_isolated to consider only eligible zones. Per zone
> > isolation counters are a bit tricky with the node reclaim because
> > we have to track each page separatelly.
> > 
> 
> I'm quite unhappy with this. Each move back increases the cache footprint
> because of the counters

Why would per zone counters cause an increased cache footprint?

> but it's not clear at all this patch actually helps anything.

Yes, I cannot prove any real issue so far. The main motivation was the
patch 2 which needs per-zone accounting to use it in the retry logic
(should_reclaim_retry). I've spotted too_many_isoalated issues on the
way.

> Heavy memory pressure on highmem should be spread across the whole node as
> we no longer are applying the fair zone allocation policy. The processes
> with highmem requirements will be reclaiming from all zones and when it
> finishes, it's possible that a lowmem-specific request will be clear to make
> progress. It's all the same LRU so if there are too many pages isolated,
> it makes sense to wait regardless of the allocation request.

This is true but I am not sure how it is realated to the patch. If we
have a heavy highmem memory pressure then we will throttle based on
pages isolated from the respective zones. So if the there is a lowmem
pressure at the same time then we throttle it only when we need to.

Also consider that lowmem throttling in too_many_isolated has only small
chance to ever work with the node counters because highmem >> lowmem in
many/most configurations.

> More importantly, this patch may make things worse and delay reclaim. If
> this patch allowed a lowmem request to make progress that would have
> previously stalled, it's going to spend time skipping pages in the LRU
> instead of letting kswapd and the highmem pressured processes make progress.

I am not sure I understand this part. Say that we have highmem pressure
which would isolated too many pages from the LRU. lowmem request would
stall previously regardless of where those pages came from. With this
patch it would stall only when we isolated too many pages from the
eligible zones. So let's assume that lowmem is not under pressure, why
should we stall? And why would it delay reclaim? Whoever want to make
progress on that zone has to iterate and potentially skip many pages.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
@ 2017-01-18 15:15       ` Michal Hocko
  0 siblings, 0 replies; 110+ messages in thread
From: Michal Hocko @ 2017-01-18 15:15 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, Johannes Weiner, Tetsuo Handa, LKML

On Wed 18-01-17 14:46:55, Mel Gorman wrote:
> On Wed, Jan 18, 2017 at 02:44:52PM +0100, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > 599d0c954f91 ("mm, vmscan: move LRU lists to node") has moved
> > NR_ISOLATED* counters from zones to nodes. This is not the best fit
> > especially for systems with high/lowmem because a heavy memory pressure
> > on the highmem zone might block lowmem requests from making progress. Or
> > we might allow to reclaim lowmem zone even though there are too many
> > pages already isolated from the eligible zones just because highmem
> > pages will easily bias too_many_isolated to say no.
> > 
> > Fix these potential issues by moving isolated stats back to zones and
> > teach too_many_isolated to consider only eligible zones. Per zone
> > isolation counters are a bit tricky with the node reclaim because
> > we have to track each page separatelly.
> > 
> 
> I'm quite unhappy with this. Each move back increases the cache footprint
> because of the counters

Why would per zone counters cause an increased cache footprint?

> but it's not clear at all this patch actually helps anything.

Yes, I cannot prove any real issue so far. The main motivation was the
patch 2 which needs per-zone accounting to use it in the retry logic
(should_reclaim_retry). I've spotted too_many_isoalated issues on the
way.

> Heavy memory pressure on highmem should be spread across the whole node as
> we no longer are applying the fair zone allocation policy. The processes
> with highmem requirements will be reclaiming from all zones and when it
> finishes, it's possible that a lowmem-specific request will be clear to make
> progress. It's all the same LRU so if there are too many pages isolated,
> it makes sense to wait regardless of the allocation request.

This is true but I am not sure how it is realated to the patch. If we
have a heavy highmem memory pressure then we will throttle based on
pages isolated from the respective zones. So if the there is a lowmem
pressure at the same time then we throttle it only when we need to.

Also consider that lowmem throttling in too_many_isolated has only small
chance to ever work with the node counters because highmem >> lowmem in
many/most configurations.

> More importantly, this patch may make things worse and delay reclaim. If
> this patch allowed a lowmem request to make progress that would have
> previously stalled, it's going to spend time skipping pages in the LRU
> instead of letting kswapd and the highmem pressured processes make progress.

I am not sure I understand this part. Say that we have highmem pressure
which would isolated too many pages from the LRU. lowmem request would
stall previously regardless of where those pages came from. With this
patch it would stall only when we isolated too many pages from the
eligible zones. So let's assume that lowmem is not under pressure, why
should we stall? And why would it delay reclaim? Whoever want to make
progress on that zone has to iterate and potentially skip many pages.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
  2017-01-18 15:15       ` Michal Hocko
@ 2017-01-18 15:54         ` Mel Gorman
  -1 siblings, 0 replies; 110+ messages in thread
From: Mel Gorman @ 2017-01-18 15:54 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Johannes Weiner, Tetsuo Handa, LKML

On Wed, Jan 18, 2017 at 04:15:31PM +0100, Michal Hocko wrote:
> On Wed 18-01-17 14:46:55, Mel Gorman wrote:
> > On Wed, Jan 18, 2017 at 02:44:52PM +0100, Michal Hocko wrote:
> > > From: Michal Hocko <mhocko@suse.com>
> > > 
> > > 599d0c954f91 ("mm, vmscan: move LRU lists to node") has moved
> > > NR_ISOLATED* counters from zones to nodes. This is not the best fit
> > > especially for systems with high/lowmem because a heavy memory pressure
> > > on the highmem zone might block lowmem requests from making progress. Or
> > > we might allow to reclaim lowmem zone even though there are too many
> > > pages already isolated from the eligible zones just because highmem
> > > pages will easily bias too_many_isolated to say no.
> > > 
> > > Fix these potential issues by moving isolated stats back to zones and
> > > teach too_many_isolated to consider only eligible zones. Per zone
> > > isolation counters are a bit tricky with the node reclaim because
> > > we have to track each page separatelly.
> > > 
> > 
> > I'm quite unhappy with this. Each move back increases the cache footprint
> > because of the counters
> 
> Why would per zone counters cause an increased cache footprint?
> 

Because there are multiple counters, each of which need to be updated.

> > but it's not clear at all this patch actually helps anything.
> 
> Yes, I cannot prove any real issue so far. The main motivation was the
> patch 2 which needs per-zone accounting to use it in the retry logic
> (should_reclaim_retry). I've spotted too_many_isoalated issues on the
> way.
> 

You don't appear to directly use that information in patch 2. The primary
breakout is returning after stalling at least once. You could also avoid
an infinite loop by using a waitqueue that sleeps on too many isolated.
That would both avoid the clunky congestion_wait() and guarantee forward
progress. If the primary motivation is to avoid an infinite loop with
too_many_isolated then there are ways of handling that without reintroducing
zone-based counters.

> > Heavy memory pressure on highmem should be spread across the whole node as
> > we no longer are applying the fair zone allocation policy. The processes
> > with highmem requirements will be reclaiming from all zones and when it
> > finishes, it's possible that a lowmem-specific request will be clear to make
> > progress. It's all the same LRU so if there are too many pages isolated,
> > it makes sense to wait regardless of the allocation request.
> 
> This is true but I am not sure how it is realated to the patch.

Because heavy pressure that is enough to trigger too many isolated pages
is unlikely to be specifically targetting a lower zone. There is general
pressure with multiple direct reclaimers being applied. If the system is
under enough pressure with parallel reclaimers to trigger too_many_isolated
checks then the system is grinding already and making little progress. Adding
multiple counters to allow a lowmem reclaimer to potentially make faster
progress is going to be marginal at best.

> Also consider that lowmem throttling in too_many_isolated has only small
> chance to ever work with the node counters because highmem >> lowmem in
> many/most configurations.
> 

While true, it's also not that important.

> > More importantly, this patch may make things worse and delay reclaim. If
> > this patch allowed a lowmem request to make progress that would have
> > previously stalled, it's going to spend time skipping pages in the LRU
> > instead of letting kswapd and the highmem pressured processes make progress.
> 
> I am not sure I understand this part. Say that we have highmem pressure
> which would isolated too many pages from the LRU.

Which requires multiple direct reclaimers or tiny inactive lists. In the
event there is such highmem pressure, it also means the lower zones are
depleted.

> lowmem request would
> stall previously regardless of where those pages came from. With this
> patch it would stall only when we isolated too many pages from the
> eligible zones.

And when it makes progress, it's goign to compete with the other direct
reclaimers except the lowmem reclaim is skipping some pages and
recycling them through the LRU. It chews up CPU that would probably have
been better spent letting kswapd and the other direct reclaimers do
their work.

> So let's assume that lowmem is not under pressure,

It has to be or the highmem request would have used memory from the
lower zones.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
@ 2017-01-18 15:54         ` Mel Gorman
  0 siblings, 0 replies; 110+ messages in thread
From: Mel Gorman @ 2017-01-18 15:54 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Johannes Weiner, Tetsuo Handa, LKML

On Wed, Jan 18, 2017 at 04:15:31PM +0100, Michal Hocko wrote:
> On Wed 18-01-17 14:46:55, Mel Gorman wrote:
> > On Wed, Jan 18, 2017 at 02:44:52PM +0100, Michal Hocko wrote:
> > > From: Michal Hocko <mhocko@suse.com>
> > > 
> > > 599d0c954f91 ("mm, vmscan: move LRU lists to node") has moved
> > > NR_ISOLATED* counters from zones to nodes. This is not the best fit
> > > especially for systems with high/lowmem because a heavy memory pressure
> > > on the highmem zone might block lowmem requests from making progress. Or
> > > we might allow to reclaim lowmem zone even though there are too many
> > > pages already isolated from the eligible zones just because highmem
> > > pages will easily bias too_many_isolated to say no.
> > > 
> > > Fix these potential issues by moving isolated stats back to zones and
> > > teach too_many_isolated to consider only eligible zones. Per zone
> > > isolation counters are a bit tricky with the node reclaim because
> > > we have to track each page separatelly.
> > > 
> > 
> > I'm quite unhappy with this. Each move back increases the cache footprint
> > because of the counters
> 
> Why would per zone counters cause an increased cache footprint?
> 

Because there are multiple counters, each of which need to be updated.

> > but it's not clear at all this patch actually helps anything.
> 
> Yes, I cannot prove any real issue so far. The main motivation was the
> patch 2 which needs per-zone accounting to use it in the retry logic
> (should_reclaim_retry). I've spotted too_many_isoalated issues on the
> way.
> 

You don't appear to directly use that information in patch 2. The primary
breakout is returning after stalling at least once. You could also avoid
an infinite loop by using a waitqueue that sleeps on too many isolated.
That would both avoid the clunky congestion_wait() and guarantee forward
progress. If the primary motivation is to avoid an infinite loop with
too_many_isolated then there are ways of handling that without reintroducing
zone-based counters.

> > Heavy memory pressure on highmem should be spread across the whole node as
> > we no longer are applying the fair zone allocation policy. The processes
> > with highmem requirements will be reclaiming from all zones and when it
> > finishes, it's possible that a lowmem-specific request will be clear to make
> > progress. It's all the same LRU so if there are too many pages isolated,
> > it makes sense to wait regardless of the allocation request.
> 
> This is true but I am not sure how it is realated to the patch.

Because heavy pressure that is enough to trigger too many isolated pages
is unlikely to be specifically targetting a lower zone. There is general
pressure with multiple direct reclaimers being applied. If the system is
under enough pressure with parallel reclaimers to trigger too_many_isolated
checks then the system is grinding already and making little progress. Adding
multiple counters to allow a lowmem reclaimer to potentially make faster
progress is going to be marginal at best.

> Also consider that lowmem throttling in too_many_isolated has only small
> chance to ever work with the node counters because highmem >> lowmem in
> many/most configurations.
> 

While true, it's also not that important.

> > More importantly, this patch may make things worse and delay reclaim. If
> > this patch allowed a lowmem request to make progress that would have
> > previously stalled, it's going to spend time skipping pages in the LRU
> > instead of letting kswapd and the highmem pressured processes make progress.
> 
> I am not sure I understand this part. Say that we have highmem pressure
> which would isolated too many pages from the LRU.

Which requires multiple direct reclaimers or tiny inactive lists. In the
event there is such highmem pressure, it also means the lower zones are
depleted.

> lowmem request would
> stall previously regardless of where those pages came from. With this
> patch it would stall only when we isolated too many pages from the
> eligible zones.

And when it makes progress, it's goign to compete with the other direct
reclaimers except the lowmem reclaim is skipping some pages and
recycling them through the LRU. It chews up CPU that would probably have
been better spent letting kswapd and the other direct reclaimers do
their work.

> So let's assume that lowmem is not under pressure,

It has to be or the highmem request would have used memory from the
lower zones.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
  2017-01-18 15:54         ` Mel Gorman
@ 2017-01-18 16:17           ` Michal Hocko
  -1 siblings, 0 replies; 110+ messages in thread
From: Michal Hocko @ 2017-01-18 16:17 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, Johannes Weiner, Tetsuo Handa, LKML

On Wed 18-01-17 15:54:30, Mel Gorman wrote:
> On Wed, Jan 18, 2017 at 04:15:31PM +0100, Michal Hocko wrote:
> > On Wed 18-01-17 14:46:55, Mel Gorman wrote:
> > > On Wed, Jan 18, 2017 at 02:44:52PM +0100, Michal Hocko wrote:
> > > > From: Michal Hocko <mhocko@suse.com>
> > > > 
> > > > 599d0c954f91 ("mm, vmscan: move LRU lists to node") has moved
> > > > NR_ISOLATED* counters from zones to nodes. This is not the best fit
> > > > especially for systems with high/lowmem because a heavy memory pressure
> > > > on the highmem zone might block lowmem requests from making progress. Or
> > > > we might allow to reclaim lowmem zone even though there are too many
> > > > pages already isolated from the eligible zones just because highmem
> > > > pages will easily bias too_many_isolated to say no.
> > > > 
> > > > Fix these potential issues by moving isolated stats back to zones and
> > > > teach too_many_isolated to consider only eligible zones. Per zone
> > > > isolation counters are a bit tricky with the node reclaim because
> > > > we have to track each page separatelly.
> > > > 
> > > 
> > > I'm quite unhappy with this. Each move back increases the cache footprint
> > > because of the counters
> > 
> > Why would per zone counters cause an increased cache footprint?
> > 
> 
> Because there are multiple counters, each of which need to be updated.

How does this differ from per node counter though. We would need to do
the accounting anyway. Moreover none of the accounting is done in a hot
path.

> > > but it's not clear at all this patch actually helps anything.
> > 
> > Yes, I cannot prove any real issue so far. The main motivation was the
> > patch 2 which needs per-zone accounting to use it in the retry logic
> > (should_reclaim_retry). I've spotted too_many_isoalated issues on the
> > way.
> > 
> 
> You don't appear to directly use that information in patch 2.

It is used via zone_reclaimable_pages in should_reclaim_retry

> The primary
> breakout is returning after stalling at least once. You could also avoid
> an infinite loop by using a waitqueue that sleeps on too many isolated.

That would be tricky on its own. Just consider the report form Tetsuo.
Basically all the direct reclamers are looping on too_many_isolated
while the kswapd is not making any progres because it is blocked on FS
locks which are held by flushers which are making dead slow progress.
Some of those direct reclaimers could have gone oom instead and release
some memory if we decide so, which we cannot because we are deep down in
the reclaim path. Waiting for on the reclaimer to increase the ISOLATED
counter wouldn't help in this situation.

> That would both avoid the clunky congestion_wait() and guarantee forward
> progress. If the primary motivation is to avoid an infinite loop with
> too_many_isolated then there are ways of handling that without reintroducing
> zone-based counters.
> 
> > > Heavy memory pressure on highmem should be spread across the whole node as
> > > we no longer are applying the fair zone allocation policy. The processes
> > > with highmem requirements will be reclaiming from all zones and when it
> > > finishes, it's possible that a lowmem-specific request will be clear to make
> > > progress. It's all the same LRU so if there are too many pages isolated,
> > > it makes sense to wait regardless of the allocation request.
> > 
> > This is true but I am not sure how it is realated to the patch.
> 
> Because heavy pressure that is enough to trigger too many isolated pages
> is unlikely to be specifically targetting a lower zone.

Why? Basically any GFP_KERNEL allocation will make lowmem pressure and
going OOM on lowmem is not all that unrealistic scenario on 32b systems.

> There is general
> pressure with multiple direct reclaimers being applied. If the system is
> under enough pressure with parallel reclaimers to trigger too_many_isolated
> checks then the system is grinding already and making little progress. Adding
> multiple counters to allow a lowmem reclaimer to potentially make faster
> progress is going to be marginal at best.

OK, I agree that the situation where highmem blocks lowmem from making
progress is much less likely than the other situation described in the
changelog when lowmem doesn't get throttled ever. Which is the one I am
interested more about.

> > Also consider that lowmem throttling in too_many_isolated has only small
> > chance to ever work with the node counters because highmem >> lowmem in
> > many/most configurations.
> > 
> 
> While true, it's also not that important.
> 
> > > More importantly, this patch may make things worse and delay reclaim. If
> > > this patch allowed a lowmem request to make progress that would have
> > > previously stalled, it's going to spend time skipping pages in the LRU
> > > instead of letting kswapd and the highmem pressured processes make progress.
> > 
> > I am not sure I understand this part. Say that we have highmem pressure
> > which would isolated too many pages from the LRU.
> 
> Which requires multiple direct reclaimers or tiny inactive lists. In the
> event there is such highmem pressure, it also means the lower zones are
> depleted.

But consider a lowmem without highmem pressure. E.g. a heavy parallel
fork or any other GFP_KERNEL intensive workload.
 
> > lowmem request would
> > stall previously regardless of where those pages came from. With this
> > patch it would stall only when we isolated too many pages from the
> > eligible zones.
> 
> And when it makes progress, it's goign to compete with the other direct
> reclaimers except the lowmem reclaim is skipping some pages and
> recycling them through the LRU. It chews up CPU that would probably have
> been better spent letting kswapd and the other direct reclaimers do
> their work.

OK, I guess we are talking past each other. What I meant to say is that
it doesn't really make any difference who is chewing through the LRU to
find last few lowmem pages to reclaim. So I do not see much of a
difference sleeping and postponing that to the kswapd.

That being said, I _believe_ I will need per zone ISOLATED counters in
order to make the other patch work reliably and do not declare oom
prematurely. Maybe there is some other way around that (hence this RFC).
Would you be strongly opposed to the patch which would make counters per
zone without touching too_many_isolated?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
@ 2017-01-18 16:17           ` Michal Hocko
  0 siblings, 0 replies; 110+ messages in thread
From: Michal Hocko @ 2017-01-18 16:17 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, Johannes Weiner, Tetsuo Handa, LKML

On Wed 18-01-17 15:54:30, Mel Gorman wrote:
> On Wed, Jan 18, 2017 at 04:15:31PM +0100, Michal Hocko wrote:
> > On Wed 18-01-17 14:46:55, Mel Gorman wrote:
> > > On Wed, Jan 18, 2017 at 02:44:52PM +0100, Michal Hocko wrote:
> > > > From: Michal Hocko <mhocko@suse.com>
> > > > 
> > > > 599d0c954f91 ("mm, vmscan: move LRU lists to node") has moved
> > > > NR_ISOLATED* counters from zones to nodes. This is not the best fit
> > > > especially for systems with high/lowmem because a heavy memory pressure
> > > > on the highmem zone might block lowmem requests from making progress. Or
> > > > we might allow to reclaim lowmem zone even though there are too many
> > > > pages already isolated from the eligible zones just because highmem
> > > > pages will easily bias too_many_isolated to say no.
> > > > 
> > > > Fix these potential issues by moving isolated stats back to zones and
> > > > teach too_many_isolated to consider only eligible zones. Per zone
> > > > isolation counters are a bit tricky with the node reclaim because
> > > > we have to track each page separatelly.
> > > > 
> > > 
> > > I'm quite unhappy with this. Each move back increases the cache footprint
> > > because of the counters
> > 
> > Why would per zone counters cause an increased cache footprint?
> > 
> 
> Because there are multiple counters, each of which need to be updated.

How does this differ from per node counter though. We would need to do
the accounting anyway. Moreover none of the accounting is done in a hot
path.

> > > but it's not clear at all this patch actually helps anything.
> > 
> > Yes, I cannot prove any real issue so far. The main motivation was the
> > patch 2 which needs per-zone accounting to use it in the retry logic
> > (should_reclaim_retry). I've spotted too_many_isoalated issues on the
> > way.
> > 
> 
> You don't appear to directly use that information in patch 2.

It is used via zone_reclaimable_pages in should_reclaim_retry

> The primary
> breakout is returning after stalling at least once. You could also avoid
> an infinite loop by using a waitqueue that sleeps on too many isolated.

That would be tricky on its own. Just consider the report form Tetsuo.
Basically all the direct reclamers are looping on too_many_isolated
while the kswapd is not making any progres because it is blocked on FS
locks which are held by flushers which are making dead slow progress.
Some of those direct reclaimers could have gone oom instead and release
some memory if we decide so, which we cannot because we are deep down in
the reclaim path. Waiting for on the reclaimer to increase the ISOLATED
counter wouldn't help in this situation.

> That would both avoid the clunky congestion_wait() and guarantee forward
> progress. If the primary motivation is to avoid an infinite loop with
> too_many_isolated then there are ways of handling that without reintroducing
> zone-based counters.
> 
> > > Heavy memory pressure on highmem should be spread across the whole node as
> > > we no longer are applying the fair zone allocation policy. The processes
> > > with highmem requirements will be reclaiming from all zones and when it
> > > finishes, it's possible that a lowmem-specific request will be clear to make
> > > progress. It's all the same LRU so if there are too many pages isolated,
> > > it makes sense to wait regardless of the allocation request.
> > 
> > This is true but I am not sure how it is realated to the patch.
> 
> Because heavy pressure that is enough to trigger too many isolated pages
> is unlikely to be specifically targetting a lower zone.

Why? Basically any GFP_KERNEL allocation will make lowmem pressure and
going OOM on lowmem is not all that unrealistic scenario on 32b systems.

> There is general
> pressure with multiple direct reclaimers being applied. If the system is
> under enough pressure with parallel reclaimers to trigger too_many_isolated
> checks then the system is grinding already and making little progress. Adding
> multiple counters to allow a lowmem reclaimer to potentially make faster
> progress is going to be marginal at best.

OK, I agree that the situation where highmem blocks lowmem from making
progress is much less likely than the other situation described in the
changelog when lowmem doesn't get throttled ever. Which is the one I am
interested more about.

> > Also consider that lowmem throttling in too_many_isolated has only small
> > chance to ever work with the node counters because highmem >> lowmem in
> > many/most configurations.
> > 
> 
> While true, it's also not that important.
> 
> > > More importantly, this patch may make things worse and delay reclaim. If
> > > this patch allowed a lowmem request to make progress that would have
> > > previously stalled, it's going to spend time skipping pages in the LRU
> > > instead of letting kswapd and the highmem pressured processes make progress.
> > 
> > I am not sure I understand this part. Say that we have highmem pressure
> > which would isolated too many pages from the LRU.
> 
> Which requires multiple direct reclaimers or tiny inactive lists. In the
> event there is such highmem pressure, it also means the lower zones are
> depleted.

But consider a lowmem without highmem pressure. E.g. a heavy parallel
fork or any other GFP_KERNEL intensive workload.
 
> > lowmem request would
> > stall previously regardless of where those pages came from. With this
> > patch it would stall only when we isolated too many pages from the
> > eligible zones.
> 
> And when it makes progress, it's goign to compete with the other direct
> reclaimers except the lowmem reclaim is skipping some pages and
> recycling them through the LRU. It chews up CPU that would probably have
> been better spent letting kswapd and the other direct reclaimers do
> their work.

OK, I guess we are talking past each other. What I meant to say is that
it doesn't really make any difference who is chewing through the LRU to
find last few lowmem pages to reclaim. So I do not see much of a
difference sleeping and postponing that to the kswapd.

That being said, I _believe_ I will need per zone ISOLATED counters in
order to make the other patch work reliably and do not declare oom
prematurely. Maybe there is some other way around that (hence this RFC).
Would you be strongly opposed to the patch which would make counters per
zone without touching too_many_isolated?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
  2017-01-18 16:17           ` Michal Hocko
@ 2017-01-18 17:00             ` Mel Gorman
  -1 siblings, 0 replies; 110+ messages in thread
From: Mel Gorman @ 2017-01-18 17:00 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Johannes Weiner, Tetsuo Handa, LKML

On Wed, Jan 18, 2017 at 05:17:31PM +0100, Michal Hocko wrote:
> On Wed 18-01-17 15:54:30, Mel Gorman wrote:
> > On Wed, Jan 18, 2017 at 04:15:31PM +0100, Michal Hocko wrote:
> > > On Wed 18-01-17 14:46:55, Mel Gorman wrote:
> > > > On Wed, Jan 18, 2017 at 02:44:52PM +0100, Michal Hocko wrote:
> > > > > From: Michal Hocko <mhocko@suse.com>
> > > > > 
> > > > > 599d0c954f91 ("mm, vmscan: move LRU lists to node") has moved
> > > > > NR_ISOLATED* counters from zones to nodes. This is not the best fit
> > > > > especially for systems with high/lowmem because a heavy memory pressure
> > > > > on the highmem zone might block lowmem requests from making progress. Or
> > > > > we might allow to reclaim lowmem zone even though there are too many
> > > > > pages already isolated from the eligible zones just because highmem
> > > > > pages will easily bias too_many_isolated to say no.
> > > > > 
> > > > > Fix these potential issues by moving isolated stats back to zones and
> > > > > teach too_many_isolated to consider only eligible zones. Per zone
> > > > > isolation counters are a bit tricky with the node reclaim because
> > > > > we have to track each page separatelly.
> > > > > 
> > > > 
> > > > I'm quite unhappy with this. Each move back increases the cache footprint
> > > > because of the counters
> > > 
> > > Why would per zone counters cause an increased cache footprint?
> > > 
> > 
> > Because there are multiple counters, each of which need to be updated.
> 
> How does this differ from per node counter though.

A per-node counter is 2 * nr_online_nodes
A per-zone counter is 2 * nr_populated_zones

> We would need to do
> the accounting anyway. Moreover none of the accounting is done in a hot
> path.
> 
> > > > but it's not clear at all this patch actually helps anything.
> > > 
> > > Yes, I cannot prove any real issue so far. The main motivation was the
> > > patch 2 which needs per-zone accounting to use it in the retry logic
> > > (should_reclaim_retry). I've spotted too_many_isoalated issues on the
> > > way.
> > > 
> > 
> > You don't appear to directly use that information in patch 2.
> 
> It is used via zone_reclaimable_pages in should_reclaim_retry
> 

Which is still not directly required to avoid the infinite loop. There
even is a small inherent risk if the too_isolated_condition no longer
applies at the time should_reclaim_retry is attempted.

> > The primary
> > breakout is returning after stalling at least once. You could also avoid
> > an infinite loop by using a waitqueue that sleeps on too many isolated.
> 
> That would be tricky on its own. Just consider the report form Tetsuo.
> Basically all the direct reclamers are looping on too_many_isolated
> while the kswapd is not making any progres because it is blocked on FS
> locks which are held by flushers which are making dead slow progress.
> Some of those direct reclaimers could have gone oom instead and release
> some memory if we decide so, which we cannot because we are deep down in
> the reclaim path. Waiting for on the reclaimer to increase the ISOLATED
> counter wouldn't help in this situation.
> 

If it's a waitqueue waking one process at a time, the progress may be
slow but it'll still exit the loop, attempt the reclaim and then
potentially OOM if no progress is made. The key is using the waitqueue
to have a fair queue of processes making progress instead of a
potentially infinite loop that never meets the exit conditions.

> > That would both avoid the clunky congestion_wait() and guarantee forward
> > progress. If the primary motivation is to avoid an infinite loop with
> > too_many_isolated then there are ways of handling that without reintroducing
> > zone-based counters.
> > 
> > > > Heavy memory pressure on highmem should be spread across the whole node as
> > > > we no longer are applying the fair zone allocation policy. The processes
> > > > with highmem requirements will be reclaiming from all zones and when it
> > > > finishes, it's possible that a lowmem-specific request will be clear to make
> > > > progress. It's all the same LRU so if there are too many pages isolated,
> > > > it makes sense to wait regardless of the allocation request.
> > > 
> > > This is true but I am not sure how it is realated to the patch.
> > 
> > Because heavy pressure that is enough to trigger too many isolated pages
> > is unlikely to be specifically targetting a lower zone.
> 
> Why? Basically any GFP_KERNEL allocation will make lowmem pressure and
> going OOM on lowmem is not all that unrealistic scenario on 32b systems.
> 

If the sole source of pressure is from GFP_KERNEL allocations then the
isolated counter will also be specific to the lower zones and there is no
benefit from the patch.

If there is a combination of highmem and lowmem pressure then the highmem
reclaimers will also reclaim lowmem memory.

> > There is general
> > pressure with multiple direct reclaimers being applied. If the system is
> > under enough pressure with parallel reclaimers to trigger too_many_isolated
> > checks then the system is grinding already and making little progress. Adding
> > multiple counters to allow a lowmem reclaimer to potentially make faster
> > progress is going to be marginal at best.
> 
> OK, I agree that the situation where highmem blocks lowmem from making
> progress is much less likely than the other situation described in the
> changelog when lowmem doesn't get throttled ever. Which is the one I am
> interested more about.
> 

That is of some concern but could be handled by having too_may_isolated
take into account if it's a zone-restricted allocation and if so, then
decrement the LRU counts from the higher zones. Counters already exist
there. It would not be as strict but it should be sufficient.

> > > Also consider that lowmem throttling in too_many_isolated has only small
> > > chance to ever work with the node counters because highmem >> lowmem in
> > > many/most configurations.
> > > 
> > 
> > While true, it's also not that important.
> > 
> > > > More importantly, this patch may make things worse and delay reclaim. If
> > > > this patch allowed a lowmem request to make progress that would have
> > > > previously stalled, it's going to spend time skipping pages in the LRU
> > > > instead of letting kswapd and the highmem pressured processes make progress.
> > > 
> > > I am not sure I understand this part. Say that we have highmem pressure
> > > which would isolated too many pages from the LRU.
> > 
> > Which requires multiple direct reclaimers or tiny inactive lists. In the
> > event there is such highmem pressure, it also means the lower zones are
> > depleted.
> 
> But consider a lowmem without highmem pressure. E.g. a heavy parallel
> fork or any other GFP_KERNEL intensive workload.
>  

Lowmem without highmem pressure means all isolated pages are in the lowmem
nodes and the per-zone counters are unnecessary.

> > > lowmem request would
> > > stall previously regardless of where those pages came from. With this
> > > patch it would stall only when we isolated too many pages from the
> > > eligible zones.
> > 
> > And when it makes progress, it's goign to compete with the other direct
> > reclaimers except the lowmem reclaim is skipping some pages and
> > recycling them through the LRU. It chews up CPU that would probably have
> > been better spent letting kswapd and the other direct reclaimers do
> > their work.
> 
> OK, I guess we are talking past each other. What I meant to say is that
> it doesn't really make any difference who is chewing through the LRU to
> find last few lowmem pages to reclaim. So I do not see much of a
> difference sleeping and postponing that to the kswapd.
> 
> That being said, I _believe_ I will need per zone ISOLATED counters in
> order to make the other patch work reliably and do not declare oom
> prematurely. Maybe there is some other way around that (hence this RFC).
> Would you be strongly opposed to the patch which would make counters per
> zone without touching too_many_isolated?

I'm resistent to the per-zone counters in general but it's unfortunate to
add them just to avoid a potentially infinite loop from isolated pages.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
@ 2017-01-18 17:00             ` Mel Gorman
  0 siblings, 0 replies; 110+ messages in thread
From: Mel Gorman @ 2017-01-18 17:00 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Johannes Weiner, Tetsuo Handa, LKML

On Wed, Jan 18, 2017 at 05:17:31PM +0100, Michal Hocko wrote:
> On Wed 18-01-17 15:54:30, Mel Gorman wrote:
> > On Wed, Jan 18, 2017 at 04:15:31PM +0100, Michal Hocko wrote:
> > > On Wed 18-01-17 14:46:55, Mel Gorman wrote:
> > > > On Wed, Jan 18, 2017 at 02:44:52PM +0100, Michal Hocko wrote:
> > > > > From: Michal Hocko <mhocko@suse.com>
> > > > > 
> > > > > 599d0c954f91 ("mm, vmscan: move LRU lists to node") has moved
> > > > > NR_ISOLATED* counters from zones to nodes. This is not the best fit
> > > > > especially for systems with high/lowmem because a heavy memory pressure
> > > > > on the highmem zone might block lowmem requests from making progress. Or
> > > > > we might allow to reclaim lowmem zone even though there are too many
> > > > > pages already isolated from the eligible zones just because highmem
> > > > > pages will easily bias too_many_isolated to say no.
> > > > > 
> > > > > Fix these potential issues by moving isolated stats back to zones and
> > > > > teach too_many_isolated to consider only eligible zones. Per zone
> > > > > isolation counters are a bit tricky with the node reclaim because
> > > > > we have to track each page separatelly.
> > > > > 
> > > > 
> > > > I'm quite unhappy with this. Each move back increases the cache footprint
> > > > because of the counters
> > > 
> > > Why would per zone counters cause an increased cache footprint?
> > > 
> > 
> > Because there are multiple counters, each of which need to be updated.
> 
> How does this differ from per node counter though.

A per-node counter is 2 * nr_online_nodes
A per-zone counter is 2 * nr_populated_zones

> We would need to do
> the accounting anyway. Moreover none of the accounting is done in a hot
> path.
> 
> > > > but it's not clear at all this patch actually helps anything.
> > > 
> > > Yes, I cannot prove any real issue so far. The main motivation was the
> > > patch 2 which needs per-zone accounting to use it in the retry logic
> > > (should_reclaim_retry). I've spotted too_many_isoalated issues on the
> > > way.
> > > 
> > 
> > You don't appear to directly use that information in patch 2.
> 
> It is used via zone_reclaimable_pages in should_reclaim_retry
> 

Which is still not directly required to avoid the infinite loop. There
even is a small inherent risk if the too_isolated_condition no longer
applies at the time should_reclaim_retry is attempted.

> > The primary
> > breakout is returning after stalling at least once. You could also avoid
> > an infinite loop by using a waitqueue that sleeps on too many isolated.
> 
> That would be tricky on its own. Just consider the report form Tetsuo.
> Basically all the direct reclamers are looping on too_many_isolated
> while the kswapd is not making any progres because it is blocked on FS
> locks which are held by flushers which are making dead slow progress.
> Some of those direct reclaimers could have gone oom instead and release
> some memory if we decide so, which we cannot because we are deep down in
> the reclaim path. Waiting for on the reclaimer to increase the ISOLATED
> counter wouldn't help in this situation.
> 

If it's a waitqueue waking one process at a time, the progress may be
slow but it'll still exit the loop, attempt the reclaim and then
potentially OOM if no progress is made. The key is using the waitqueue
to have a fair queue of processes making progress instead of a
potentially infinite loop that never meets the exit conditions.

> > That would both avoid the clunky congestion_wait() and guarantee forward
> > progress. If the primary motivation is to avoid an infinite loop with
> > too_many_isolated then there are ways of handling that without reintroducing
> > zone-based counters.
> > 
> > > > Heavy memory pressure on highmem should be spread across the whole node as
> > > > we no longer are applying the fair zone allocation policy. The processes
> > > > with highmem requirements will be reclaiming from all zones and when it
> > > > finishes, it's possible that a lowmem-specific request will be clear to make
> > > > progress. It's all the same LRU so if there are too many pages isolated,
> > > > it makes sense to wait regardless of the allocation request.
> > > 
> > > This is true but I am not sure how it is realated to the patch.
> > 
> > Because heavy pressure that is enough to trigger too many isolated pages
> > is unlikely to be specifically targetting a lower zone.
> 
> Why? Basically any GFP_KERNEL allocation will make lowmem pressure and
> going OOM on lowmem is not all that unrealistic scenario on 32b systems.
> 

If the sole source of pressure is from GFP_KERNEL allocations then the
isolated counter will also be specific to the lower zones and there is no
benefit from the patch.

If there is a combination of highmem and lowmem pressure then the highmem
reclaimers will also reclaim lowmem memory.

> > There is general
> > pressure with multiple direct reclaimers being applied. If the system is
> > under enough pressure with parallel reclaimers to trigger too_many_isolated
> > checks then the system is grinding already and making little progress. Adding
> > multiple counters to allow a lowmem reclaimer to potentially make faster
> > progress is going to be marginal at best.
> 
> OK, I agree that the situation where highmem blocks lowmem from making
> progress is much less likely than the other situation described in the
> changelog when lowmem doesn't get throttled ever. Which is the one I am
> interested more about.
> 

That is of some concern but could be handled by having too_may_isolated
take into account if it's a zone-restricted allocation and if so, then
decrement the LRU counts from the higher zones. Counters already exist
there. It would not be as strict but it should be sufficient.

> > > Also consider that lowmem throttling in too_many_isolated has only small
> > > chance to ever work with the node counters because highmem >> lowmem in
> > > many/most configurations.
> > > 
> > 
> > While true, it's also not that important.
> > 
> > > > More importantly, this patch may make things worse and delay reclaim. If
> > > > this patch allowed a lowmem request to make progress that would have
> > > > previously stalled, it's going to spend time skipping pages in the LRU
> > > > instead of letting kswapd and the highmem pressured processes make progress.
> > > 
> > > I am not sure I understand this part. Say that we have highmem pressure
> > > which would isolated too many pages from the LRU.
> > 
> > Which requires multiple direct reclaimers or tiny inactive lists. In the
> > event there is such highmem pressure, it also means the lower zones are
> > depleted.
> 
> But consider a lowmem without highmem pressure. E.g. a heavy parallel
> fork or any other GFP_KERNEL intensive workload.
>  

Lowmem without highmem pressure means all isolated pages are in the lowmem
nodes and the per-zone counters are unnecessary.

> > > lowmem request would
> > > stall previously regardless of where those pages came from. With this
> > > patch it would stall only when we isolated too many pages from the
> > > eligible zones.
> > 
> > And when it makes progress, it's goign to compete with the other direct
> > reclaimers except the lowmem reclaim is skipping some pages and
> > recycling them through the LRU. It chews up CPU that would probably have
> > been better spent letting kswapd and the other direct reclaimers do
> > their work.
> 
> OK, I guess we are talking past each other. What I meant to say is that
> it doesn't really make any difference who is chewing through the LRU to
> find last few lowmem pages to reclaim. So I do not see much of a
> difference sleeping and postponing that to the kswapd.
> 
> That being said, I _believe_ I will need per zone ISOLATED counters in
> order to make the other patch work reliably and do not declare oom
> prematurely. Maybe there is some other way around that (hence this RFC).
> Would you be strongly opposed to the patch which would make counters per
> zone without touching too_many_isolated?

I'm resistent to the per-zone counters in general but it's unfortunate to
add them just to avoid a potentially infinite loop from isolated pages.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
  2017-01-18 17:00             ` Mel Gorman
@ 2017-01-18 17:29               ` Michal Hocko
  -1 siblings, 0 replies; 110+ messages in thread
From: Michal Hocko @ 2017-01-18 17:29 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, Johannes Weiner, Tetsuo Handa, LKML

On Wed 18-01-17 17:00:10, Mel Gorman wrote:
> On Wed, Jan 18, 2017 at 05:17:31PM +0100, Michal Hocko wrote:
> > On Wed 18-01-17 15:54:30, Mel Gorman wrote:
> > > On Wed, Jan 18, 2017 at 04:15:31PM +0100, Michal Hocko wrote:
> > > > On Wed 18-01-17 14:46:55, Mel Gorman wrote:
> > > > > On Wed, Jan 18, 2017 at 02:44:52PM +0100, Michal Hocko wrote:
> > > > > > From: Michal Hocko <mhocko@suse.com>
> > > > > > 
> > > > > > 599d0c954f91 ("mm, vmscan: move LRU lists to node") has moved
> > > > > > NR_ISOLATED* counters from zones to nodes. This is not the best fit
> > > > > > especially for systems with high/lowmem because a heavy memory pressure
> > > > > > on the highmem zone might block lowmem requests from making progress. Or
> > > > > > we might allow to reclaim lowmem zone even though there are too many
> > > > > > pages already isolated from the eligible zones just because highmem
> > > > > > pages will easily bias too_many_isolated to say no.
> > > > > > 
> > > > > > Fix these potential issues by moving isolated stats back to zones and
> > > > > > teach too_many_isolated to consider only eligible zones. Per zone
> > > > > > isolation counters are a bit tricky with the node reclaim because
> > > > > > we have to track each page separatelly.
> > > > > > 
> > > > > 
> > > > > I'm quite unhappy with this. Each move back increases the cache footprint
> > > > > because of the counters
> > > > 
> > > > Why would per zone counters cause an increased cache footprint?
> > > > 
> > > 
> > > Because there are multiple counters, each of which need to be updated.
> > 
> > How does this differ from per node counter though.
> 
> A per-node counter is 2 * nr_online_nodes
> A per-zone counter is 2 * nr_populated_zones
> 
> > We would need to do
> > the accounting anyway. Moreover none of the accounting is done in a hot
> > path.
> > 
> > > > > but it's not clear at all this patch actually helps anything.
> > > > 
> > > > Yes, I cannot prove any real issue so far. The main motivation was the
> > > > patch 2 which needs per-zone accounting to use it in the retry logic
> > > > (should_reclaim_retry). I've spotted too_many_isoalated issues on the
> > > > way.
> > > > 
> > > 
> > > You don't appear to directly use that information in patch 2.
> > 
> > It is used via zone_reclaimable_pages in should_reclaim_retry
> > 
> 
> Which is still not directly required to avoid the infinite loop. There
> even is a small inherent risk if the too_isolated_condition no longer
> applies at the time should_reclaim_retry is attempted.

Not really because, if those pages are no longer isolated then they
either have been reclaimed - and NR_FREE_PAGES will increase - or they
have been put back to LRU in which case we will see them in regular LRU
counters. I need to catch the case where there are still too many pages
isolated which would skew should_reclaim_retry watermark check.
 
> > > The primary
> > > breakout is returning after stalling at least once. You could also avoid
> > > an infinite loop by using a waitqueue that sleeps on too many isolated.
> > 
> > That would be tricky on its own. Just consider the report form Tetsuo.
> > Basically all the direct reclamers are looping on too_many_isolated
> > while the kswapd is not making any progres because it is blocked on FS
> > locks which are held by flushers which are making dead slow progress.
> > Some of those direct reclaimers could have gone oom instead and release
> > some memory if we decide so, which we cannot because we are deep down in
> > the reclaim path. Waiting for on the reclaimer to increase the ISOLATED
> > counter wouldn't help in this situation.
> > 
> 
> If it's a waitqueue waking one process at a time, the progress may be
> slow but it'll still exit the loop, attempt the reclaim and then
> potentially OOM if no progress is made. The key is using the waitqueue
> to have a fair queue of processes making progress instead of a
> potentially infinite loop that never meets the exit conditions.

It is not clear to me who would wake waiters on the queue. You cannot
rely on kswapd to do that as already mentioned.

> > > That would both avoid the clunky congestion_wait() and guarantee forward
> > > progress. If the primary motivation is to avoid an infinite loop with
> > > too_many_isolated then there are ways of handling that without reintroducing
> > > zone-based counters.
> > > 
> > > > > Heavy memory pressure on highmem should be spread across the whole node as
> > > > > we no longer are applying the fair zone allocation policy. The processes
> > > > > with highmem requirements will be reclaiming from all zones and when it
> > > > > finishes, it's possible that a lowmem-specific request will be clear to make
> > > > > progress. It's all the same LRU so if there are too many pages isolated,
> > > > > it makes sense to wait regardless of the allocation request.
> > > > 
> > > > This is true but I am not sure how it is realated to the patch.
> > > 
> > > Because heavy pressure that is enough to trigger too many isolated pages
> > > is unlikely to be specifically targetting a lower zone.
> > 
> > Why? Basically any GFP_KERNEL allocation will make lowmem pressure and
> > going OOM on lowmem is not all that unrealistic scenario on 32b systems.
> > 
> 
> If the sole source of pressure is from GFP_KERNEL allocations then the
> isolated counter will also be specific to the lower zones and there is no
> benefit from the patch.

I believe you are wrong here. Just consider that you have isolated
basically all lowmem pages. too_many_isolated will still happily tell
you to not throttle or back off because NR_INACTIVE_* are way too bigger
than all low mem pages altogether. Or am I still missing your point?

> If there is a combination of highmem and lowmem pressure then the highmem
> reclaimers will also reclaim lowmem memory.
> 
> > > There is general
> > > pressure with multiple direct reclaimers being applied. If the system is
> > > under enough pressure with parallel reclaimers to trigger too_many_isolated
> > > checks then the system is grinding already and making little progress. Adding
> > > multiple counters to allow a lowmem reclaimer to potentially make faster
> > > progress is going to be marginal at best.
> > 
> > OK, I agree that the situation where highmem blocks lowmem from making
> > progress is much less likely than the other situation described in the
> > changelog when lowmem doesn't get throttled ever. Which is the one I am
> > interested more about.
> > 
> 
> That is of some concern but could be handled by having too_may_isolated
> take into account if it's a zone-restricted allocation and if so, then
> decrement the LRU counts from the higher zones. Counters already exist
> there. It would not be as strict but it should be sufficient.

Well, this is what this patch tries to do. Which other counters I can
use to consider only eligible zones when evaluating the number of
isolated pages?

> > > > Also consider that lowmem throttling in too_many_isolated has only small
> > > > chance to ever work with the node counters because highmem >> lowmem in
> > > > many/most configurations.
> > > > 
> > > 
> > > While true, it's also not that important.
> > > 
> > > > > More importantly, this patch may make things worse and delay reclaim. If
> > > > > this patch allowed a lowmem request to make progress that would have
> > > > > previously stalled, it's going to spend time skipping pages in the LRU
> > > > > instead of letting kswapd and the highmem pressured processes make progress.
> > > > 
> > > > I am not sure I understand this part. Say that we have highmem pressure
> > > > which would isolated too many pages from the LRU.
> > > 
> > > Which requires multiple direct reclaimers or tiny inactive lists. In the
> > > event there is such highmem pressure, it also means the lower zones are
> > > depleted.
> > 
> > But consider a lowmem without highmem pressure. E.g. a heavy parallel
> > fork or any other GFP_KERNEL intensive workload.
> >  
> 
> Lowmem without highmem pressure means all isolated pages are in the lowmem
> nodes and the per-zone counters are unnecessary.

But most configurations will have highmem and lowmem zones in the same
node...
 
> > > > lowmem request would
> > > > stall previously regardless of where those pages came from. With this
> > > > patch it would stall only when we isolated too many pages from the
> > > > eligible zones.
> > > 
> > > And when it makes progress, it's goign to compete with the other direct
> > > reclaimers except the lowmem reclaim is skipping some pages and
> > > recycling them through the LRU. It chews up CPU that would probably have
> > > been better spent letting kswapd and the other direct reclaimers do
> > > their work.
> > 
> > OK, I guess we are talking past each other. What I meant to say is that
> > it doesn't really make any difference who is chewing through the LRU to
> > find last few lowmem pages to reclaim. So I do not see much of a
> > difference sleeping and postponing that to the kswapd.
> > 
> > That being said, I _believe_ I will need per zone ISOLATED counters in
> > order to make the other patch work reliably and do not declare oom
> > prematurely. Maybe there is some other way around that (hence this RFC).
> > Would you be strongly opposed to the patch which would make counters per
> > zone without touching too_many_isolated?
> 
> I'm resistent to the per-zone counters in general but it's unfortunate to
> add them just to avoid a potentially infinite loop from isolated pages.

I am really open to any alternative solutions, of course. This is
the best I could come up with. I will keep thinking but removing
too_many_isolated without considering isolated pages during the oom
detection is just too risky. We can isolate many pages to ignore them.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
@ 2017-01-18 17:29               ` Michal Hocko
  0 siblings, 0 replies; 110+ messages in thread
From: Michal Hocko @ 2017-01-18 17:29 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, Johannes Weiner, Tetsuo Handa, LKML

On Wed 18-01-17 17:00:10, Mel Gorman wrote:
> On Wed, Jan 18, 2017 at 05:17:31PM +0100, Michal Hocko wrote:
> > On Wed 18-01-17 15:54:30, Mel Gorman wrote:
> > > On Wed, Jan 18, 2017 at 04:15:31PM +0100, Michal Hocko wrote:
> > > > On Wed 18-01-17 14:46:55, Mel Gorman wrote:
> > > > > On Wed, Jan 18, 2017 at 02:44:52PM +0100, Michal Hocko wrote:
> > > > > > From: Michal Hocko <mhocko@suse.com>
> > > > > > 
> > > > > > 599d0c954f91 ("mm, vmscan: move LRU lists to node") has moved
> > > > > > NR_ISOLATED* counters from zones to nodes. This is not the best fit
> > > > > > especially for systems with high/lowmem because a heavy memory pressure
> > > > > > on the highmem zone might block lowmem requests from making progress. Or
> > > > > > we might allow to reclaim lowmem zone even though there are too many
> > > > > > pages already isolated from the eligible zones just because highmem
> > > > > > pages will easily bias too_many_isolated to say no.
> > > > > > 
> > > > > > Fix these potential issues by moving isolated stats back to zones and
> > > > > > teach too_many_isolated to consider only eligible zones. Per zone
> > > > > > isolation counters are a bit tricky with the node reclaim because
> > > > > > we have to track each page separatelly.
> > > > > > 
> > > > > 
> > > > > I'm quite unhappy with this. Each move back increases the cache footprint
> > > > > because of the counters
> > > > 
> > > > Why would per zone counters cause an increased cache footprint?
> > > > 
> > > 
> > > Because there are multiple counters, each of which need to be updated.
> > 
> > How does this differ from per node counter though.
> 
> A per-node counter is 2 * nr_online_nodes
> A per-zone counter is 2 * nr_populated_zones
> 
> > We would need to do
> > the accounting anyway. Moreover none of the accounting is done in a hot
> > path.
> > 
> > > > > but it's not clear at all this patch actually helps anything.
> > > > 
> > > > Yes, I cannot prove any real issue so far. The main motivation was the
> > > > patch 2 which needs per-zone accounting to use it in the retry logic
> > > > (should_reclaim_retry). I've spotted too_many_isoalated issues on the
> > > > way.
> > > > 
> > > 
> > > You don't appear to directly use that information in patch 2.
> > 
> > It is used via zone_reclaimable_pages in should_reclaim_retry
> > 
> 
> Which is still not directly required to avoid the infinite loop. There
> even is a small inherent risk if the too_isolated_condition no longer
> applies at the time should_reclaim_retry is attempted.

Not really because, if those pages are no longer isolated then they
either have been reclaimed - and NR_FREE_PAGES will increase - or they
have been put back to LRU in which case we will see them in regular LRU
counters. I need to catch the case where there are still too many pages
isolated which would skew should_reclaim_retry watermark check.
 
> > > The primary
> > > breakout is returning after stalling at least once. You could also avoid
> > > an infinite loop by using a waitqueue that sleeps on too many isolated.
> > 
> > That would be tricky on its own. Just consider the report form Tetsuo.
> > Basically all the direct reclamers are looping on too_many_isolated
> > while the kswapd is not making any progres because it is blocked on FS
> > locks which are held by flushers which are making dead slow progress.
> > Some of those direct reclaimers could have gone oom instead and release
> > some memory if we decide so, which we cannot because we are deep down in
> > the reclaim path. Waiting for on the reclaimer to increase the ISOLATED
> > counter wouldn't help in this situation.
> > 
> 
> If it's a waitqueue waking one process at a time, the progress may be
> slow but it'll still exit the loop, attempt the reclaim and then
> potentially OOM if no progress is made. The key is using the waitqueue
> to have a fair queue of processes making progress instead of a
> potentially infinite loop that never meets the exit conditions.

It is not clear to me who would wake waiters on the queue. You cannot
rely on kswapd to do that as already mentioned.

> > > That would both avoid the clunky congestion_wait() and guarantee forward
> > > progress. If the primary motivation is to avoid an infinite loop with
> > > too_many_isolated then there are ways of handling that without reintroducing
> > > zone-based counters.
> > > 
> > > > > Heavy memory pressure on highmem should be spread across the whole node as
> > > > > we no longer are applying the fair zone allocation policy. The processes
> > > > > with highmem requirements will be reclaiming from all zones and when it
> > > > > finishes, it's possible that a lowmem-specific request will be clear to make
> > > > > progress. It's all the same LRU so if there are too many pages isolated,
> > > > > it makes sense to wait regardless of the allocation request.
> > > > 
> > > > This is true but I am not sure how it is realated to the patch.
> > > 
> > > Because heavy pressure that is enough to trigger too many isolated pages
> > > is unlikely to be specifically targetting a lower zone.
> > 
> > Why? Basically any GFP_KERNEL allocation will make lowmem pressure and
> > going OOM on lowmem is not all that unrealistic scenario on 32b systems.
> > 
> 
> If the sole source of pressure is from GFP_KERNEL allocations then the
> isolated counter will also be specific to the lower zones and there is no
> benefit from the patch.

I believe you are wrong here. Just consider that you have isolated
basically all lowmem pages. too_many_isolated will still happily tell
you to not throttle or back off because NR_INACTIVE_* are way too bigger
than all low mem pages altogether. Or am I still missing your point?

> If there is a combination of highmem and lowmem pressure then the highmem
> reclaimers will also reclaim lowmem memory.
> 
> > > There is general
> > > pressure with multiple direct reclaimers being applied. If the system is
> > > under enough pressure with parallel reclaimers to trigger too_many_isolated
> > > checks then the system is grinding already and making little progress. Adding
> > > multiple counters to allow a lowmem reclaimer to potentially make faster
> > > progress is going to be marginal at best.
> > 
> > OK, I agree that the situation where highmem blocks lowmem from making
> > progress is much less likely than the other situation described in the
> > changelog when lowmem doesn't get throttled ever. Which is the one I am
> > interested more about.
> > 
> 
> That is of some concern but could be handled by having too_may_isolated
> take into account if it's a zone-restricted allocation and if so, then
> decrement the LRU counts from the higher zones. Counters already exist
> there. It would not be as strict but it should be sufficient.

Well, this is what this patch tries to do. Which other counters I can
use to consider only eligible zones when evaluating the number of
isolated pages?

> > > > Also consider that lowmem throttling in too_many_isolated has only small
> > > > chance to ever work with the node counters because highmem >> lowmem in
> > > > many/most configurations.
> > > > 
> > > 
> > > While true, it's also not that important.
> > > 
> > > > > More importantly, this patch may make things worse and delay reclaim. If
> > > > > this patch allowed a lowmem request to make progress that would have
> > > > > previously stalled, it's going to spend time skipping pages in the LRU
> > > > > instead of letting kswapd and the highmem pressured processes make progress.
> > > > 
> > > > I am not sure I understand this part. Say that we have highmem pressure
> > > > which would isolated too many pages from the LRU.
> > > 
> > > Which requires multiple direct reclaimers or tiny inactive lists. In the
> > > event there is such highmem pressure, it also means the lower zones are
> > > depleted.
> > 
> > But consider a lowmem without highmem pressure. E.g. a heavy parallel
> > fork or any other GFP_KERNEL intensive workload.
> >  
> 
> Lowmem without highmem pressure means all isolated pages are in the lowmem
> nodes and the per-zone counters are unnecessary.

But most configurations will have highmem and lowmem zones in the same
node...
 
> > > > lowmem request would
> > > > stall previously regardless of where those pages came from. With this
> > > > patch it would stall only when we isolated too many pages from the
> > > > eligible zones.
> > > 
> > > And when it makes progress, it's goign to compete with the other direct
> > > reclaimers except the lowmem reclaim is skipping some pages and
> > > recycling them through the LRU. It chews up CPU that would probably have
> > > been better spent letting kswapd and the other direct reclaimers do
> > > their work.
> > 
> > OK, I guess we are talking past each other. What I meant to say is that
> > it doesn't really make any difference who is chewing through the LRU to
> > find last few lowmem pages to reclaim. So I do not see much of a
> > difference sleeping and postponing that to the kswapd.
> > 
> > That being said, I _believe_ I will need per zone ISOLATED counters in
> > order to make the other patch work reliably and do not declare oom
> > prematurely. Maybe there is some other way around that (hence this RFC).
> > Would you be strongly opposed to the patch which would make counters per
> > zone without touching too_many_isolated?
> 
> I'm resistent to the per-zone counters in general but it's unfortunate to
> add them just to avoid a potentially infinite loop from isolated pages.

I am really open to any alternative solutions, of course. This is
the best I could come up with. I will keep thinking but removing
too_many_isolated without considering isolated pages during the oom
detection is just too risky. We can isolate many pages to ignore them.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
  2017-01-18 17:29               ` Michal Hocko
@ 2017-01-19 10:07                 ` Mel Gorman
  -1 siblings, 0 replies; 110+ messages in thread
From: Mel Gorman @ 2017-01-19 10:07 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Johannes Weiner, Tetsuo Handa, LKML

On Wed, Jan 18, 2017 at 06:29:46PM +0100, Michal Hocko wrote:
> On Wed 18-01-17 17:00:10, Mel Gorman wrote:
> > > > You don't appear to directly use that information in patch 2.
> > > 
> > > It is used via zone_reclaimable_pages in should_reclaim_retry
> > > 
> > 
> > Which is still not directly required to avoid the infinite loop. There
> > even is a small inherent risk if the too_isolated_condition no longer
> > applies at the time should_reclaim_retry is attempted.
> 
> Not really because, if those pages are no longer isolated then they
> either have been reclaimed - and NR_FREE_PAGES will increase - or they
> have been put back to LRU in which case we will see them in regular LRU
> counters. I need to catch the case where there are still too many pages
> isolated which would skew should_reclaim_retry watermark check.
>  

We can also rely on the no_progress_loops counter to trigger OOM. It'll
take longer but has a lower risk of premature OOM.

> > > > The primary
> > > > breakout is returning after stalling at least once. You could also avoid
> > > > an infinite loop by using a waitqueue that sleeps on too many isolated.
> > > 
> > > That would be tricky on its own. Just consider the report form Tetsuo.
> > > Basically all the direct reclamers are looping on too_many_isolated
> > > while the kswapd is not making any progres because it is blocked on FS
> > > locks which are held by flushers which are making dead slow progress.
> > > Some of those direct reclaimers could have gone oom instead and release
> > > some memory if we decide so, which we cannot because we are deep down in
> > > the reclaim path. Waiting for on the reclaimer to increase the ISOLATED
> > > counter wouldn't help in this situation.
> > > 
> > 
> > If it's a waitqueue waking one process at a time, the progress may be
> > slow but it'll still exit the loop, attempt the reclaim and then
> > potentially OOM if no progress is made. The key is using the waitqueue
> > to have a fair queue of processes making progress instead of a
> > potentially infinite loop that never meets the exit conditions.
> 
> It is not clear to me who would wake waiters on the queue. You cannot
> rely on kswapd to do that as already mentioned.
> 

We can use timeouts to guard against an infinite wait. Besides, updating
every single place where pages are put back on the LRU would be fragile
and too easy to break.

> > > > That would both avoid the clunky congestion_wait() and guarantee forward
> > > > progress. If the primary motivation is to avoid an infinite loop with
> > > > too_many_isolated then there are ways of handling that without reintroducing
> > > > zone-based counters.
> > > > 
> > > > > > Heavy memory pressure on highmem should be spread across the whole node as
> > > > > > we no longer are applying the fair zone allocation policy. The processes
> > > > > > with highmem requirements will be reclaiming from all zones and when it
> > > > > > finishes, it's possible that a lowmem-specific request will be clear to make
> > > > > > progress. It's all the same LRU so if there are too many pages isolated,
> > > > > > it makes sense to wait regardless of the allocation request.
> > > > > 
> > > > > This is true but I am not sure how it is realated to the patch.
> > > > 
> > > > Because heavy pressure that is enough to trigger too many isolated pages
> > > > is unlikely to be specifically targetting a lower zone.
> > > 
> > > Why? Basically any GFP_KERNEL allocation will make lowmem pressure and
> > > going OOM on lowmem is not all that unrealistic scenario on 32b systems.
> > > 
> > 
> > If the sole source of pressure is from GFP_KERNEL allocations then the
> > isolated counter will also be specific to the lower zones and there is no
> > benefit from the patch.
> 
> I believe you are wrong here. Just consider that you have isolated
> basically all lowmem pages. too_many_isolated will still happily tell
> you to not throttle or back off because NR_INACTIVE_* are way too bigger
> than all low mem pages altogether. Or am I still missing your point?
> 

This is a potential risk. It could be accounted for by including the node
isolated counters in the calculation but it'll be inherently fuzzy and
may stall a lowmem direct reclaimer unnecessarily in the presence of
highmem reclaim.

> > If there is a combination of highmem and lowmem pressure then the highmem
> > reclaimers will also reclaim lowmem memory.
> > 
> > > > There is general
> > > > pressure with multiple direct reclaimers being applied. If the system is
> > > > under enough pressure with parallel reclaimers to trigger too_many_isolated
> > > > checks then the system is grinding already and making little progress. Adding
> > > > multiple counters to allow a lowmem reclaimer to potentially make faster
> > > > progress is going to be marginal at best.
> > > 
> > > OK, I agree that the situation where highmem blocks lowmem from making
> > > progress is much less likely than the other situation described in the
> > > changelog when lowmem doesn't get throttled ever. Which is the one I am
> > > interested more about.
> > > 
> > 
> > That is of some concern but could be handled by having too_may_isolated
> > take into account if it's a zone-restricted allocation and if so, then
> > decrement the LRU counts from the higher zones. Counters already exist
> > there. It would not be as strict but it should be sufficient.
> 
> Well, this is what this patch tries to do. Which other counters I can
> use to consider only eligible zones when evaluating the number of
> isolated pages?
> 

The LRU anon/file counters. It'll reduce the number of eligible pages
for reclaim.

> > > > > Also consider that lowmem throttling in too_many_isolated has only small
> > > > > chance to ever work with the node counters because highmem >> lowmem in
> > > > > many/most configurations.
> > > > > 
> > > > 
> > > > While true, it's also not that important.
> > > > 
> > > > > > More importantly, this patch may make things worse and delay reclaim. If
> > > > > > this patch allowed a lowmem request to make progress that would have
> > > > > > previously stalled, it's going to spend time skipping pages in the LRU
> > > > > > instead of letting kswapd and the highmem pressured processes make progress.
> > > > > 
> > > > > I am not sure I understand this part. Say that we have highmem pressure
> > > > > which would isolated too many pages from the LRU.
> > > > 
> > > > Which requires multiple direct reclaimers or tiny inactive lists. In the
> > > > event there is such highmem pressure, it also means the lower zones are
> > > > depleted.
> > > 
> > > But consider a lowmem without highmem pressure. E.g. a heavy parallel
> > > fork or any other GFP_KERNEL intensive workload.
> > >  
> > 
> > Lowmem without highmem pressure means all isolated pages are in the lowmem
> > nodes and the per-zone counters are unnecessary.
> 
> But most configurations will have highmem and lowmem zones in the same
> node...

True but if it's only lowmem pressure it doesn't matter.

>  
> > > OK, I guess we are talking past each other. What I meant to say is that
> > > it doesn't really make any difference who is chewing through the LRU to
> > > find last few lowmem pages to reclaim. So I do not see much of a
> > > difference sleeping and postponing that to the kswapd.
> > > 
> > > That being said, I _believe_ I will need per zone ISOLATED counters in
> > > order to make the other patch work reliably and do not declare oom
> > > prematurely. Maybe there is some other way around that (hence this RFC).
> > > Would you be strongly opposed to the patch which would make counters per
> > > zone without touching too_many_isolated?
> > 
> > I'm resistent to the per-zone counters in general but it's unfortunate to
> > add them just to avoid a potentially infinite loop from isolated pages.
> 
> I am really open to any alternative solutions, of course. This is
> the best I could come up with. I will keep thinking but removing
> too_many_isolated without considering isolated pages during the oom
> detection is just too risky. We can isolate many pages to ignore them.

If it's definitely required and is proven to fix the
infinite-loop-without-oom workload then I'll back off and withdraw my
objections. However, I'd at least like the following untested patch to
be considered as an alternative. It has some weaknesses and would be
slower to OOM than your patch but it avoids reintroducing zone counters

---8<---
mm, vmscan: Wait on a waitqueue when too many pages are isolated

When too many pages are isolated, direct reclaim waits on congestion to clear
for up to a tenth of a second. There is no reason to believe that too many
pages are isolated due to dirty pages, reclaim efficiency or congestion.
It may simply be because an extremely large number of processes have entered
direct reclaim at the same time. However, it is possible for the situation
to persist forever and never reach OOM.

This patch queues processes a waitqueue when too many pages are isolated.
When parallel reclaimers finish shrink_page_list, they wake the waiters
to recheck whether too many pages are isolated.

The wait on the queue has a timeout as not all sites that isolate pages
will do the wakeup. Depending on every isolation of LRU pages to be perfect
forever is potentially fragile. The specific wakeups occur for page reclaim
and compaction. If too many pages are isolated due to memory failure,
hotplug or directly calling migration from a syscall then the waiting
processes may wait the full timeout.

Note that the timeout allows the use of waitqueue_active() on the basis
that a race will cause the full timeout to be reached due to a missed
wakeup. This is relatively harmless and still a massive improvement over
unconditionally calling congestion_wait.

Direct reclaimers that cannot isolate pages within the timeout will consider
return to the caller. This is somewhat clunky as it won't return immediately
and make go through the other priorities and slab shrinking. Eventually,
it'll go through a few iterations of should_reclaim_retry and reach the
MAX_RECLAIM_RETRIES limit and consider going OOM.

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 91f69aa0d581..3dd617d0c8c4 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -628,6 +628,7 @@ typedef struct pglist_data {
 	int node_id;
 	wait_queue_head_t kswapd_wait;
 	wait_queue_head_t pfmemalloc_wait;
+	wait_queue_head_t isolated_wait;
 	struct task_struct *kswapd;	/* Protected by
 					   mem_hotplug_begin/end() */
 	int kswapd_order;
diff --git a/mm/compaction.c b/mm/compaction.c
index 43a6cf1dc202..1b1ff6da7401 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1634,6 +1634,10 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro
 	count_compact_events(COMPACTMIGRATE_SCANNED, cc->total_migrate_scanned);
 	count_compact_events(COMPACTFREE_SCANNED, cc->total_free_scanned);
 
+	/* Page reclaim could have stalled due to isolated pages */
+	if (waitqueue_active(&zone->zone_pgdat->isolated_wait))
+		wake_up(&zone->zone_pgdat->isolated_wait);
+
 	trace_mm_compaction_end(start_pfn, cc->migrate_pfn,
 				cc->free_pfn, end_pfn, sync, ret);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8ff25883c172..d848c9f31bff 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5823,6 +5823,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
 #endif
 	init_waitqueue_head(&pgdat->kswapd_wait);
 	init_waitqueue_head(&pgdat->pfmemalloc_wait);
+	init_waitqueue_head(&pgdat->isolated_wait);
 #ifdef CONFIG_COMPACTION
 	init_waitqueue_head(&pgdat->kcompactd_wait);
 #endif
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2281ad310d06..c93f299fbad7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1603,16 +1603,16 @@ int isolate_lru_page(struct page *page)
  * the LRU list will go small and be scanned faster than necessary, leading to
  * unnecessary swapping, thrashing and OOM.
  */
-static int too_many_isolated(struct pglist_data *pgdat, int file,
+static bool safe_to_isolate(struct pglist_data *pgdat, int file,
 		struct scan_control *sc)
 {
 	unsigned long inactive, isolated;
 
 	if (current_is_kswapd())
-		return 0;
+		return true;
 
-	if (!sane_reclaim(sc))
-		return 0;
+	if (sane_reclaim(sc))
+		return true;
 
 	if (file) {
 		inactive = node_page_state(pgdat, NR_INACTIVE_FILE);
@@ -1630,7 +1630,7 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
 	if ((sc->gfp_mask & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS))
 		inactive >>= 3;
 
-	return isolated > inactive;
+	return isolated < inactive;
 }
 
 static noinline_for_stack void
@@ -1719,12 +1719,28 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
 
-	while (unlikely(too_many_isolated(pgdat, file, sc))) {
-		congestion_wait(BLK_RW_ASYNC, HZ/10);
+	while (!safe_to_isolate(pgdat, file, sc)) {
+		long ret;
+
+		ret = wait_event_interruptible_timeout(pgdat->isolated_wait,
+			safe_to_isolate(pgdat, file, sc), HZ/10);
 
 		/* We are about to die and free our memory. Return now. */
-		if (fatal_signal_pending(current))
-			return SWAP_CLUSTER_MAX;
+		if (fatal_signal_pending(current)) {
+			nr_reclaimed = SWAP_CLUSTER_MAX;
+			goto out;
+		}
+
+		/*
+		 * If we reached the timeout, this is direct reclaim, and
+		 * pages cannot be isolated then return. If the situation
+		 * persists for a long time then it'll eventually reach
+		 * the no_progress limit in should_reclaim_retry and consider
+		 * going OOM. In this case, do not wake the isolated_wait
+		 * queue as the wakee will still not be able to make progress.
+		 */
+		if (!ret && !current_is_kswapd() && !safe_to_isolate(pgdat, file, sc))
+			return 0;
 	}
 
 	lru_add_drain();
@@ -1839,6 +1855,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 			stat.nr_activate, stat.nr_ref_keep,
 			stat.nr_unmap_fail,
 			sc->priority, file);
+
+out:
+	if (waitqueue_active(&pgdat->isolated_wait))
+		wake_up(&pgdat->isolated_wait);
 	return nr_reclaimed;
 }
 

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
@ 2017-01-19 10:07                 ` Mel Gorman
  0 siblings, 0 replies; 110+ messages in thread
From: Mel Gorman @ 2017-01-19 10:07 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Johannes Weiner, Tetsuo Handa, LKML

On Wed, Jan 18, 2017 at 06:29:46PM +0100, Michal Hocko wrote:
> On Wed 18-01-17 17:00:10, Mel Gorman wrote:
> > > > You don't appear to directly use that information in patch 2.
> > > 
> > > It is used via zone_reclaimable_pages in should_reclaim_retry
> > > 
> > 
> > Which is still not directly required to avoid the infinite loop. There
> > even is a small inherent risk if the too_isolated_condition no longer
> > applies at the time should_reclaim_retry is attempted.
> 
> Not really because, if those pages are no longer isolated then they
> either have been reclaimed - and NR_FREE_PAGES will increase - or they
> have been put back to LRU in which case we will see them in regular LRU
> counters. I need to catch the case where there are still too many pages
> isolated which would skew should_reclaim_retry watermark check.
>  

We can also rely on the no_progress_loops counter to trigger OOM. It'll
take longer but has a lower risk of premature OOM.

> > > > The primary
> > > > breakout is returning after stalling at least once. You could also avoid
> > > > an infinite loop by using a waitqueue that sleeps on too many isolated.
> > > 
> > > That would be tricky on its own. Just consider the report form Tetsuo.
> > > Basically all the direct reclamers are looping on too_many_isolated
> > > while the kswapd is not making any progres because it is blocked on FS
> > > locks which are held by flushers which are making dead slow progress.
> > > Some of those direct reclaimers could have gone oom instead and release
> > > some memory if we decide so, which we cannot because we are deep down in
> > > the reclaim path. Waiting for on the reclaimer to increase the ISOLATED
> > > counter wouldn't help in this situation.
> > > 
> > 
> > If it's a waitqueue waking one process at a time, the progress may be
> > slow but it'll still exit the loop, attempt the reclaim and then
> > potentially OOM if no progress is made. The key is using the waitqueue
> > to have a fair queue of processes making progress instead of a
> > potentially infinite loop that never meets the exit conditions.
> 
> It is not clear to me who would wake waiters on the queue. You cannot
> rely on kswapd to do that as already mentioned.
> 

We can use timeouts to guard against an infinite wait. Besides, updating
every single place where pages are put back on the LRU would be fragile
and too easy to break.

> > > > That would both avoid the clunky congestion_wait() and guarantee forward
> > > > progress. If the primary motivation is to avoid an infinite loop with
> > > > too_many_isolated then there are ways of handling that without reintroducing
> > > > zone-based counters.
> > > > 
> > > > > > Heavy memory pressure on highmem should be spread across the whole node as
> > > > > > we no longer are applying the fair zone allocation policy. The processes
> > > > > > with highmem requirements will be reclaiming from all zones and when it
> > > > > > finishes, it's possible that a lowmem-specific request will be clear to make
> > > > > > progress. It's all the same LRU so if there are too many pages isolated,
> > > > > > it makes sense to wait regardless of the allocation request.
> > > > > 
> > > > > This is true but I am not sure how it is realated to the patch.
> > > > 
> > > > Because heavy pressure that is enough to trigger too many isolated pages
> > > > is unlikely to be specifically targetting a lower zone.
> > > 
> > > Why? Basically any GFP_KERNEL allocation will make lowmem pressure and
> > > going OOM on lowmem is not all that unrealistic scenario on 32b systems.
> > > 
> > 
> > If the sole source of pressure is from GFP_KERNEL allocations then the
> > isolated counter will also be specific to the lower zones and there is no
> > benefit from the patch.
> 
> I believe you are wrong here. Just consider that you have isolated
> basically all lowmem pages. too_many_isolated will still happily tell
> you to not throttle or back off because NR_INACTIVE_* are way too bigger
> than all low mem pages altogether. Or am I still missing your point?
> 

This is a potential risk. It could be accounted for by including the node
isolated counters in the calculation but it'll be inherently fuzzy and
may stall a lowmem direct reclaimer unnecessarily in the presence of
highmem reclaim.

> > If there is a combination of highmem and lowmem pressure then the highmem
> > reclaimers will also reclaim lowmem memory.
> > 
> > > > There is general
> > > > pressure with multiple direct reclaimers being applied. If the system is
> > > > under enough pressure with parallel reclaimers to trigger too_many_isolated
> > > > checks then the system is grinding already and making little progress. Adding
> > > > multiple counters to allow a lowmem reclaimer to potentially make faster
> > > > progress is going to be marginal at best.
> > > 
> > > OK, I agree that the situation where highmem blocks lowmem from making
> > > progress is much less likely than the other situation described in the
> > > changelog when lowmem doesn't get throttled ever. Which is the one I am
> > > interested more about.
> > > 
> > 
> > That is of some concern but could be handled by having too_may_isolated
> > take into account if it's a zone-restricted allocation and if so, then
> > decrement the LRU counts from the higher zones. Counters already exist
> > there. It would not be as strict but it should be sufficient.
> 
> Well, this is what this patch tries to do. Which other counters I can
> use to consider only eligible zones when evaluating the number of
> isolated pages?
> 

The LRU anon/file counters. It'll reduce the number of eligible pages
for reclaim.

> > > > > Also consider that lowmem throttling in too_many_isolated has only small
> > > > > chance to ever work with the node counters because highmem >> lowmem in
> > > > > many/most configurations.
> > > > > 
> > > > 
> > > > While true, it's also not that important.
> > > > 
> > > > > > More importantly, this patch may make things worse and delay reclaim. If
> > > > > > this patch allowed a lowmem request to make progress that would have
> > > > > > previously stalled, it's going to spend time skipping pages in the LRU
> > > > > > instead of letting kswapd and the highmem pressured processes make progress.
> > > > > 
> > > > > I am not sure I understand this part. Say that we have highmem pressure
> > > > > which would isolated too many pages from the LRU.
> > > > 
> > > > Which requires multiple direct reclaimers or tiny inactive lists. In the
> > > > event there is such highmem pressure, it also means the lower zones are
> > > > depleted.
> > > 
> > > But consider a lowmem without highmem pressure. E.g. a heavy parallel
> > > fork or any other GFP_KERNEL intensive workload.
> > >  
> > 
> > Lowmem without highmem pressure means all isolated pages are in the lowmem
> > nodes and the per-zone counters are unnecessary.
> 
> But most configurations will have highmem and lowmem zones in the same
> node...

True but if it's only lowmem pressure it doesn't matter.

>  
> > > OK, I guess we are talking past each other. What I meant to say is that
> > > it doesn't really make any difference who is chewing through the LRU to
> > > find last few lowmem pages to reclaim. So I do not see much of a
> > > difference sleeping and postponing that to the kswapd.
> > > 
> > > That being said, I _believe_ I will need per zone ISOLATED counters in
> > > order to make the other patch work reliably and do not declare oom
> > > prematurely. Maybe there is some other way around that (hence this RFC).
> > > Would you be strongly opposed to the patch which would make counters per
> > > zone without touching too_many_isolated?
> > 
> > I'm resistent to the per-zone counters in general but it's unfortunate to
> > add them just to avoid a potentially infinite loop from isolated pages.
> 
> I am really open to any alternative solutions, of course. This is
> the best I could come up with. I will keep thinking but removing
> too_many_isolated without considering isolated pages during the oom
> detection is just too risky. We can isolate many pages to ignore them.

If it's definitely required and is proven to fix the
infinite-loop-without-oom workload then I'll back off and withdraw my
objections. However, I'd at least like the following untested patch to
be considered as an alternative. It has some weaknesses and would be
slower to OOM than your patch but it avoids reintroducing zone counters

---8<---
mm, vmscan: Wait on a waitqueue when too many pages are isolated

When too many pages are isolated, direct reclaim waits on congestion to clear
for up to a tenth of a second. There is no reason to believe that too many
pages are isolated due to dirty pages, reclaim efficiency or congestion.
It may simply be because an extremely large number of processes have entered
direct reclaim at the same time. However, it is possible for the situation
to persist forever and never reach OOM.

This patch queues processes a waitqueue when too many pages are isolated.
When parallel reclaimers finish shrink_page_list, they wake the waiters
to recheck whether too many pages are isolated.

The wait on the queue has a timeout as not all sites that isolate pages
will do the wakeup. Depending on every isolation of LRU pages to be perfect
forever is potentially fragile. The specific wakeups occur for page reclaim
and compaction. If too many pages are isolated due to memory failure,
hotplug or directly calling migration from a syscall then the waiting
processes may wait the full timeout.

Note that the timeout allows the use of waitqueue_active() on the basis
that a race will cause the full timeout to be reached due to a missed
wakeup. This is relatively harmless and still a massive improvement over
unconditionally calling congestion_wait.

Direct reclaimers that cannot isolate pages within the timeout will consider
return to the caller. This is somewhat clunky as it won't return immediately
and make go through the other priorities and slab shrinking. Eventually,
it'll go through a few iterations of should_reclaim_retry and reach the
MAX_RECLAIM_RETRIES limit and consider going OOM.

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 91f69aa0d581..3dd617d0c8c4 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -628,6 +628,7 @@ typedef struct pglist_data {
 	int node_id;
 	wait_queue_head_t kswapd_wait;
 	wait_queue_head_t pfmemalloc_wait;
+	wait_queue_head_t isolated_wait;
 	struct task_struct *kswapd;	/* Protected by
 					   mem_hotplug_begin/end() */
 	int kswapd_order;
diff --git a/mm/compaction.c b/mm/compaction.c
index 43a6cf1dc202..1b1ff6da7401 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1634,6 +1634,10 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro
 	count_compact_events(COMPACTMIGRATE_SCANNED, cc->total_migrate_scanned);
 	count_compact_events(COMPACTFREE_SCANNED, cc->total_free_scanned);
 
+	/* Page reclaim could have stalled due to isolated pages */
+	if (waitqueue_active(&zone->zone_pgdat->isolated_wait))
+		wake_up(&zone->zone_pgdat->isolated_wait);
+
 	trace_mm_compaction_end(start_pfn, cc->migrate_pfn,
 				cc->free_pfn, end_pfn, sync, ret);
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8ff25883c172..d848c9f31bff 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5823,6 +5823,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
 #endif
 	init_waitqueue_head(&pgdat->kswapd_wait);
 	init_waitqueue_head(&pgdat->pfmemalloc_wait);
+	init_waitqueue_head(&pgdat->isolated_wait);
 #ifdef CONFIG_COMPACTION
 	init_waitqueue_head(&pgdat->kcompactd_wait);
 #endif
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2281ad310d06..c93f299fbad7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1603,16 +1603,16 @@ int isolate_lru_page(struct page *page)
  * the LRU list will go small and be scanned faster than necessary, leading to
  * unnecessary swapping, thrashing and OOM.
  */
-static int too_many_isolated(struct pglist_data *pgdat, int file,
+static bool safe_to_isolate(struct pglist_data *pgdat, int file,
 		struct scan_control *sc)
 {
 	unsigned long inactive, isolated;
 
 	if (current_is_kswapd())
-		return 0;
+		return true;
 
-	if (!sane_reclaim(sc))
-		return 0;
+	if (sane_reclaim(sc))
+		return true;
 
 	if (file) {
 		inactive = node_page_state(pgdat, NR_INACTIVE_FILE);
@@ -1630,7 +1630,7 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
 	if ((sc->gfp_mask & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS))
 		inactive >>= 3;
 
-	return isolated > inactive;
+	return isolated < inactive;
 }
 
 static noinline_for_stack void
@@ -1719,12 +1719,28 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
 
-	while (unlikely(too_many_isolated(pgdat, file, sc))) {
-		congestion_wait(BLK_RW_ASYNC, HZ/10);
+	while (!safe_to_isolate(pgdat, file, sc)) {
+		long ret;
+
+		ret = wait_event_interruptible_timeout(pgdat->isolated_wait,
+			safe_to_isolate(pgdat, file, sc), HZ/10);
 
 		/* We are about to die and free our memory. Return now. */
-		if (fatal_signal_pending(current))
-			return SWAP_CLUSTER_MAX;
+		if (fatal_signal_pending(current)) {
+			nr_reclaimed = SWAP_CLUSTER_MAX;
+			goto out;
+		}
+
+		/*
+		 * If we reached the timeout, this is direct reclaim, and
+		 * pages cannot be isolated then return. If the situation
+		 * persists for a long time then it'll eventually reach
+		 * the no_progress limit in should_reclaim_retry and consider
+		 * going OOM. In this case, do not wake the isolated_wait
+		 * queue as the wakee will still not be able to make progress.
+		 */
+		if (!ret && !current_is_kswapd() && !safe_to_isolate(pgdat, file, sc))
+			return 0;
 	}
 
 	lru_add_drain();
@@ -1839,6 +1855,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 			stat.nr_activate, stat.nr_ref_keep,
 			stat.nr_unmap_fail,
 			sc->priority, file);
+
+out:
+	if (waitqueue_active(&pgdat->isolated_wait))
+		wake_up(&pgdat->isolated_wait);
 	return nr_reclaimed;
 }
 

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
  2017-01-19 10:07                 ` Mel Gorman
@ 2017-01-19 11:23                   ` Michal Hocko
  -1 siblings, 0 replies; 110+ messages in thread
From: Michal Hocko @ 2017-01-19 11:23 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, Johannes Weiner, Tetsuo Handa, LKML

On Thu 19-01-17 10:07:55, Mel Gorman wrote:
[...]
> mm, vmscan: Wait on a waitqueue when too many pages are isolated
> 
> When too many pages are isolated, direct reclaim waits on congestion to clear
> for up to a tenth of a second. There is no reason to believe that too many
> pages are isolated due to dirty pages, reclaim efficiency or congestion.
> It may simply be because an extremely large number of processes have entered
> direct reclaim at the same time. However, it is possible for the situation
> to persist forever and never reach OOM.
> 
> This patch queues processes a waitqueue when too many pages are isolated.
> When parallel reclaimers finish shrink_page_list, they wake the waiters
> to recheck whether too many pages are isolated.
> 
> The wait on the queue has a timeout as not all sites that isolate pages
> will do the wakeup. Depending on every isolation of LRU pages to be perfect
> forever is potentially fragile. The specific wakeups occur for page reclaim
> and compaction. If too many pages are isolated due to memory failure,
> hotplug or directly calling migration from a syscall then the waiting
> processes may wait the full timeout.
> 
> Note that the timeout allows the use of waitqueue_active() on the basis
> that a race will cause the full timeout to be reached due to a missed
> wakeup. This is relatively harmless and still a massive improvement over
> unconditionally calling congestion_wait.
> 
> Direct reclaimers that cannot isolate pages within the timeout will consider
> return to the caller. This is somewhat clunky as it won't return immediately
> and make go through the other priorities and slab shrinking. Eventually,
> it'll go through a few iterations of should_reclaim_retry and reach the
> MAX_RECLAIM_RETRIES limit and consider going OOM.

I cannot really say I would like this. It's just much more complex than
necessary. I definitely agree that congestion_wait while waiting for
too_many_isolated is a crude hack. This patch doesn't really resolve
my biggest worry, though, that we go OOM with too many pages isolated
as your patch doesn't alter zone_reclaimable_pages to reflect those
numbers.

Anyway, I think both of us are probably overcomplicating things a bit.
Your waitqueue approach is definitely better semantically than the
congestion_wait because we are waiting for a different event than the
API is intended for. On the other hand a mere
schedule_timeout_interruptible might work equally well in the real life.
On the other side I might really over emphasise the role of NR_ISOLATED*
counts. It might really turn out that we can safely ignore them and it
won't be the end of the world. So what do you think about the following
as a starting point. If we ever see oom reports with high number of
NR_ISOLATED* which are part of the oom report then we know we have to do
something about that. Those changes would at least be driven by a real
usecase rather than theoretical scenarios.

So what do you think about the following? Tetsuo, would you be willing
to run this patch through your torture testing please?
---
>From 47cba23b5b50260b533d7ad57a4c9e6a800d9b20 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Thu, 19 Jan 2017 12:11:56 +0100
Subject: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever

Tetsuo Handa has reported [1] that direct reclaimers might get stuck in
too_many_isolated loop basically for ever because the last few pages on
the LRU lists are isolated by the kswapd which is stuck on fs locks when
doing the pageout. This in turn means that there is nobody to actually
trigger the oom killer and the system is basically unusable.

too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle
direct reclaim when too many pages are isolated already") to prevent
from pre-mature oom killer invocations because back then no reclaim
progress could indeed trigger the OOM killer too early. But since the
oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection")
the allocation/reclaim retry loop considers all the reclaimable pages
and throttles the allocation at that layer so we can loosen the direct
reclaim throttling.

Make shrink_inactive_list loop over too_many_isolated bounded and returns
immediately when the situation hasn't resolved after the first sleep.
Replace congestion_wait by a simple schedule_timeout_interruptible because
we are not really waiting on the IO congestion in this path.

Please note that this patch can theoretically cause the OOM killer to
trigger earlier while there are many pages isolated for the reclaim
which makes progress only very slowly. This would be obvious from the oom
report as the number of isolated pages are printed there. If we ever hit
this should_reclaim_retry should consider those numbers in the evaluation
in one way or another.

[1] http://lkml.kernel.org/r/201602092349.ACG81273.OSVtMJQHLOFOFF@I-love.SAKURA.ne.jp
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/vmscan.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index a60066d4521b..d07380ba1f9e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1718,9 +1718,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 	int file = is_file_lru(lru);
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
+	bool stalled = false;
 
 	while (unlikely(too_many_isolated(pgdat, file, sc))) {
-		congestion_wait(BLK_RW_ASYNC, HZ/10);
+		if (stalled)
+			return 0;
+
+		/* wait a bit for the reclaimer. */
+		schedule_timeout_interruptible(HZ/10);
+		stalled = true;
 
 		/* We are about to die and free our memory. Return now. */
 		if (fatal_signal_pending(current))
-- 
2.11.0


-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
@ 2017-01-19 11:23                   ` Michal Hocko
  0 siblings, 0 replies; 110+ messages in thread
From: Michal Hocko @ 2017-01-19 11:23 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linux-mm, Johannes Weiner, Tetsuo Handa, LKML

On Thu 19-01-17 10:07:55, Mel Gorman wrote:
[...]
> mm, vmscan: Wait on a waitqueue when too many pages are isolated
> 
> When too many pages are isolated, direct reclaim waits on congestion to clear
> for up to a tenth of a second. There is no reason to believe that too many
> pages are isolated due to dirty pages, reclaim efficiency or congestion.
> It may simply be because an extremely large number of processes have entered
> direct reclaim at the same time. However, it is possible for the situation
> to persist forever and never reach OOM.
> 
> This patch queues processes a waitqueue when too many pages are isolated.
> When parallel reclaimers finish shrink_page_list, they wake the waiters
> to recheck whether too many pages are isolated.
> 
> The wait on the queue has a timeout as not all sites that isolate pages
> will do the wakeup. Depending on every isolation of LRU pages to be perfect
> forever is potentially fragile. The specific wakeups occur for page reclaim
> and compaction. If too many pages are isolated due to memory failure,
> hotplug or directly calling migration from a syscall then the waiting
> processes may wait the full timeout.
> 
> Note that the timeout allows the use of waitqueue_active() on the basis
> that a race will cause the full timeout to be reached due to a missed
> wakeup. This is relatively harmless and still a massive improvement over
> unconditionally calling congestion_wait.
> 
> Direct reclaimers that cannot isolate pages within the timeout will consider
> return to the caller. This is somewhat clunky as it won't return immediately
> and make go through the other priorities and slab shrinking. Eventually,
> it'll go through a few iterations of should_reclaim_retry and reach the
> MAX_RECLAIM_RETRIES limit and consider going OOM.

I cannot really say I would like this. It's just much more complex than
necessary. I definitely agree that congestion_wait while waiting for
too_many_isolated is a crude hack. This patch doesn't really resolve
my biggest worry, though, that we go OOM with too many pages isolated
as your patch doesn't alter zone_reclaimable_pages to reflect those
numbers.

Anyway, I think both of us are probably overcomplicating things a bit.
Your waitqueue approach is definitely better semantically than the
congestion_wait because we are waiting for a different event than the
API is intended for. On the other hand a mere
schedule_timeout_interruptible might work equally well in the real life.
On the other side I might really over emphasise the role of NR_ISOLATED*
counts. It might really turn out that we can safely ignore them and it
won't be the end of the world. So what do you think about the following
as a starting point. If we ever see oom reports with high number of
NR_ISOLATED* which are part of the oom report then we know we have to do
something about that. Those changes would at least be driven by a real
usecase rather than theoretical scenarios.

So what do you think about the following? Tetsuo, would you be willing
to run this patch through your torture testing please?
---

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
  2017-01-19 11:23                   ` Michal Hocko
@ 2017-01-19 13:11                     ` Mel Gorman
  -1 siblings, 0 replies; 110+ messages in thread
From: Mel Gorman @ 2017-01-19 13:11 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Johannes Weiner, Tetsuo Handa, LKML

On Thu, Jan 19, 2017 at 12:23:36PM +0100, Michal Hocko wrote:
> On Thu 19-01-17 10:07:55, Mel Gorman wrote:
> [...]
> > mm, vmscan: Wait on a waitqueue when too many pages are isolated
> > 
> > When too many pages are isolated, direct reclaim waits on congestion to clear
> > for up to a tenth of a second. There is no reason to believe that too many
> > pages are isolated due to dirty pages, reclaim efficiency or congestion.
> > It may simply be because an extremely large number of processes have entered
> > direct reclaim at the same time. However, it is possible for the situation
> > to persist forever and never reach OOM.
> > 
> > This patch queues processes a waitqueue when too many pages are isolated.
> > When parallel reclaimers finish shrink_page_list, they wake the waiters
> > to recheck whether too many pages are isolated.
> > 
> > The wait on the queue has a timeout as not all sites that isolate pages
> > will do the wakeup. Depending on every isolation of LRU pages to be perfect
> > forever is potentially fragile. The specific wakeups occur for page reclaim
> > and compaction. If too many pages are isolated due to memory failure,
> > hotplug or directly calling migration from a syscall then the waiting
> > processes may wait the full timeout.
> > 
> > Note that the timeout allows the use of waitqueue_active() on the basis
> > that a race will cause the full timeout to be reached due to a missed
> > wakeup. This is relatively harmless and still a massive improvement over
> > unconditionally calling congestion_wait.
> > 
> > Direct reclaimers that cannot isolate pages within the timeout will consider
> > return to the caller. This is somewhat clunky as it won't return immediately
> > and make go through the other priorities and slab shrinking. Eventually,
> > it'll go through a few iterations of should_reclaim_retry and reach the
> > MAX_RECLAIM_RETRIES limit and consider going OOM.
> 
> I cannot really say I would like this. It's just much more complex than
> necessary.

I guess it's a difference in opinion. Miximg per-zone and per-node
information for me is complex. I liked the workqueue because it was an
example of waiting on a specific event instead of relying completely on
time.

> I definitely agree that congestion_wait while waiting for
> too_many_isolated is a crude hack. This patch doesn't really resolve
> my biggest worry, though, that we go OOM with too many pages isolated
> as your patch doesn't alter zone_reclaimable_pages to reflect those
> numbers.
> 

Indeed, but such cases are also caught by the no_progress_loop logic to
avoid a premature OOM.

> Anyway, I think both of us are probably overcomplicating things a bit.
> Your waitqueue approach is definitely better semantically than the
> congestion_wait because we are waiting for a different event than the
> API is intended for. On the other hand a mere
> schedule_timeout_interruptible might work equally well in the real life.
> On the other side I might really over emphasise the role of NR_ISOLATED*
> counts. It might really turn out that we can safely ignore them and it
> won't be the end of the world. So what do you think about the following
> as a starting point. If we ever see oom reports with high number of
> NR_ISOLATED* which are part of the oom report then we know we have to do
> something about that. Those changes would at least be driven by a real
> usecase rather than theoretical scenarios.
> 
> So what do you think about the following? Tetsuo, would you be willing
> to run this patch through your torture testing please?

I'm fine with treating this as a starting point.

Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
@ 2017-01-19 13:11                     ` Mel Gorman
  0 siblings, 0 replies; 110+ messages in thread
From: Mel Gorman @ 2017-01-19 13:11 UTC (permalink / raw)
  To: Michal Hocko; +Cc: linux-mm, Johannes Weiner, Tetsuo Handa, LKML

On Thu, Jan 19, 2017 at 12:23:36PM +0100, Michal Hocko wrote:
> On Thu 19-01-17 10:07:55, Mel Gorman wrote:
> [...]
> > mm, vmscan: Wait on a waitqueue when too many pages are isolated
> > 
> > When too many pages are isolated, direct reclaim waits on congestion to clear
> > for up to a tenth of a second. There is no reason to believe that too many
> > pages are isolated due to dirty pages, reclaim efficiency or congestion.
> > It may simply be because an extremely large number of processes have entered
> > direct reclaim at the same time. However, it is possible for the situation
> > to persist forever and never reach OOM.
> > 
> > This patch queues processes a waitqueue when too many pages are isolated.
> > When parallel reclaimers finish shrink_page_list, they wake the waiters
> > to recheck whether too many pages are isolated.
> > 
> > The wait on the queue has a timeout as not all sites that isolate pages
> > will do the wakeup. Depending on every isolation of LRU pages to be perfect
> > forever is potentially fragile. The specific wakeups occur for page reclaim
> > and compaction. If too many pages are isolated due to memory failure,
> > hotplug or directly calling migration from a syscall then the waiting
> > processes may wait the full timeout.
> > 
> > Note that the timeout allows the use of waitqueue_active() on the basis
> > that a race will cause the full timeout to be reached due to a missed
> > wakeup. This is relatively harmless and still a massive improvement over
> > unconditionally calling congestion_wait.
> > 
> > Direct reclaimers that cannot isolate pages within the timeout will consider
> > return to the caller. This is somewhat clunky as it won't return immediately
> > and make go through the other priorities and slab shrinking. Eventually,
> > it'll go through a few iterations of should_reclaim_retry and reach the
> > MAX_RECLAIM_RETRIES limit and consider going OOM.
> 
> I cannot really say I would like this. It's just much more complex than
> necessary.

I guess it's a difference in opinion. Miximg per-zone and per-node
information for me is complex. I liked the workqueue because it was an
example of waiting on a specific event instead of relying completely on
time.

> I definitely agree that congestion_wait while waiting for
> too_many_isolated is a crude hack. This patch doesn't really resolve
> my biggest worry, though, that we go OOM with too many pages isolated
> as your patch doesn't alter zone_reclaimable_pages to reflect those
> numbers.
> 

Indeed, but such cases are also caught by the no_progress_loop logic to
avoid a premature OOM.

> Anyway, I think both of us are probably overcomplicating things a bit.
> Your waitqueue approach is definitely better semantically than the
> congestion_wait because we are waiting for a different event than the
> API is intended for. On the other hand a mere
> schedule_timeout_interruptible might work equally well in the real life.
> On the other side I might really over emphasise the role of NR_ISOLATED*
> counts. It might really turn out that we can safely ignore them and it
> won't be the end of the world. So what do you think about the following
> as a starting point. If we ever see oom reports with high number of
> NR_ISOLATED* which are part of the oom report then we know we have to do
> something about that. Those changes would at least be driven by a real
> usecase rather than theoretical scenarios.
> 
> So what do you think about the following? Tetsuo, would you be willing
> to run this patch through your torture testing please?

I'm fine with treating this as a starting point.

Thanks.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
  2017-01-19 10:07                 ` Mel Gorman
@ 2017-01-20  6:42                   ` Hillf Danton
  -1 siblings, 0 replies; 110+ messages in thread
From: Hillf Danton @ 2017-01-20  6:42 UTC (permalink / raw)
  To: 'Mel Gorman', 'Michal Hocko'
  Cc: linux-mm, 'Johannes Weiner', 'Tetsuo Handa',
	'LKML'


On Thursday, January 19, 2017 6:08 PM Mel Gorman wrote: 
> 
> If it's definitely required and is proven to fix the
> infinite-loop-without-oom workload then I'll back off and withdraw my
> objections. However, I'd at least like the following untested patch to
> be considered as an alternative. It has some weaknesses and would be
> slower to OOM than your patch but it avoids reintroducing zone counters
> 
> ---8<---
> mm, vmscan: Wait on a waitqueue when too many pages are isolated
> 
> When too many pages are isolated, direct reclaim waits on congestion to clear
> for up to a tenth of a second. There is no reason to believe that too many
> pages are isolated due to dirty pages, reclaim efficiency or congestion.
> It may simply be because an extremely large number of processes have entered
> direct reclaim at the same time. However, it is possible for the situation
> to persist forever and never reach OOM.
> 
> This patch queues processes a waitqueue when too many pages are isolated.
> When parallel reclaimers finish shrink_page_list, they wake the waiters
> to recheck whether too many pages are isolated.
> 
> The wait on the queue has a timeout as not all sites that isolate pages
> will do the wakeup. Depending on every isolation of LRU pages to be perfect
> forever is potentially fragile. The specific wakeups occur for page reclaim
> and compaction. If too many pages are isolated due to memory failure,
> hotplug or directly calling migration from a syscall then the waiting
> processes may wait the full timeout.
> 
> Note that the timeout allows the use of waitqueue_active() on the basis
> that a race will cause the full timeout to be reached due to a missed
> wakeup. This is relatively harmless and still a massive improvement over
> unconditionally calling congestion_wait.
> 
> Direct reclaimers that cannot isolate pages within the timeout will consider
> return to the caller. This is somewhat clunky as it won't return immediately
> and make go through the other priorities and slab shrinking. Eventually,
> it'll go through a few iterations of should_reclaim_retry and reach the
> MAX_RECLAIM_RETRIES limit and consider going OOM.
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 91f69aa0d581..3dd617d0c8c4 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -628,6 +628,7 @@ typedef struct pglist_data {
>  	int node_id;
>  	wait_queue_head_t kswapd_wait;
>  	wait_queue_head_t pfmemalloc_wait;
> +	wait_queue_head_t isolated_wait;
>  	struct task_struct *kswapd;	/* Protected by
>  					   mem_hotplug_begin/end() */
>  	int kswapd_order;
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 43a6cf1dc202..1b1ff6da7401 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1634,6 +1634,10 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro
>  	count_compact_events(COMPACTMIGRATE_SCANNED, cc->total_migrate_scanned);
>  	count_compact_events(COMPACTFREE_SCANNED, cc->total_free_scanned);
> 
> +	/* Page reclaim could have stalled due to isolated pages */
> +	if (waitqueue_active(&zone->zone_pgdat->isolated_wait))
> +		wake_up(&zone->zone_pgdat->isolated_wait);
> +
>  	trace_mm_compaction_end(start_pfn, cc->migrate_pfn,
>  				cc->free_pfn, end_pfn, sync, ret);
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8ff25883c172..d848c9f31bff 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5823,6 +5823,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
>  #endif
>  	init_waitqueue_head(&pgdat->kswapd_wait);
>  	init_waitqueue_head(&pgdat->pfmemalloc_wait);
> +	init_waitqueue_head(&pgdat->isolated_wait);
>  #ifdef CONFIG_COMPACTION
>  	init_waitqueue_head(&pgdat->kcompactd_wait);
>  #endif
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2281ad310d06..c93f299fbad7 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1603,16 +1603,16 @@ int isolate_lru_page(struct page *page)
>   * the LRU list will go small and be scanned faster than necessary, leading to
>   * unnecessary swapping, thrashing and OOM.
>   */
> -static int too_many_isolated(struct pglist_data *pgdat, int file,
> +static bool safe_to_isolate(struct pglist_data *pgdat, int file,
>  		struct scan_control *sc)

I prefer the current function name.

>  {
>  	unsigned long inactive, isolated;
> 
>  	if (current_is_kswapd())
> -		return 0;
> +		return true;
> 
> -	if (!sane_reclaim(sc))
> -		return 0;
> +	if (sane_reclaim(sc))
> +		return true;

We only need a one-line change.
> 
>  	if (file) {
>  		inactive = node_page_state(pgdat, NR_INACTIVE_FILE);
> @@ -1630,7 +1630,7 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
>  	if ((sc->gfp_mask & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS))
>  		inactive >>= 3;
> 
> -	return isolated > inactive;
> +	return isolated < inactive;
>  }
> 
>  static noinline_for_stack void
> @@ -1719,12 +1719,28 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>  	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>  	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
> 
> -	while (unlikely(too_many_isolated(pgdat, file, sc))) {
> -		congestion_wait(BLK_RW_ASYNC, HZ/10);
> +	while (!safe_to_isolate(pgdat, file, sc)) {
> +		long ret;
> +
> +		ret = wait_event_interruptible_timeout(pgdat->isolated_wait,
> +			safe_to_isolate(pgdat, file, sc), HZ/10);
> 
>  		/* We are about to die and free our memory. Return now. */
> -		if (fatal_signal_pending(current))
> -			return SWAP_CLUSTER_MAX;
> +		if (fatal_signal_pending(current)) {
> +			nr_reclaimed = SWAP_CLUSTER_MAX;
> +			goto out;
> +		}
> +
> +		/*
> +		 * If we reached the timeout, this is direct reclaim, and
> +		 * pages cannot be isolated then return. If the situation

Please add something that we would rather shrink slab than go
another round of nap.

> +		 * persists for a long time then it'll eventually reach
> +		 * the no_progress limit in should_reclaim_retry and consider
> +		 * going OOM. In this case, do not wake the isolated_wait
> +		 * queue as the wakee will still not be able to make progress.
> +		 */
> +		if (!ret && !current_is_kswapd() && !safe_to_isolate(pgdat, file, sc))
> +			return 0;
>  	}
> 
>  	lru_add_drain();
> @@ -1839,6 +1855,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>  			stat.nr_activate, stat.nr_ref_keep,
>  			stat.nr_unmap_fail,
>  			sc->priority, file);
> +
> +out:
> +	if (waitqueue_active(&pgdat->isolated_wait))
> +		wake_up(&pgdat->isolated_wait);
>  	return nr_reclaimed;
>  }
> 
Is it also needed to check isolated_wait active before kswapd 
takes nap?

thanks
Hillf

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
@ 2017-01-20  6:42                   ` Hillf Danton
  0 siblings, 0 replies; 110+ messages in thread
From: Hillf Danton @ 2017-01-20  6:42 UTC (permalink / raw)
  To: 'Mel Gorman', 'Michal Hocko'
  Cc: linux-mm, 'Johannes Weiner', 'Tetsuo Handa',
	'LKML'


On Thursday, January 19, 2017 6:08 PM Mel Gorman wrote: 
> 
> If it's definitely required and is proven to fix the
> infinite-loop-without-oom workload then I'll back off and withdraw my
> objections. However, I'd at least like the following untested patch to
> be considered as an alternative. It has some weaknesses and would be
> slower to OOM than your patch but it avoids reintroducing zone counters
> 
> ---8<---
> mm, vmscan: Wait on a waitqueue when too many pages are isolated
> 
> When too many pages are isolated, direct reclaim waits on congestion to clear
> for up to a tenth of a second. There is no reason to believe that too many
> pages are isolated due to dirty pages, reclaim efficiency or congestion.
> It may simply be because an extremely large number of processes have entered
> direct reclaim at the same time. However, it is possible for the situation
> to persist forever and never reach OOM.
> 
> This patch queues processes a waitqueue when too many pages are isolated.
> When parallel reclaimers finish shrink_page_list, they wake the waiters
> to recheck whether too many pages are isolated.
> 
> The wait on the queue has a timeout as not all sites that isolate pages
> will do the wakeup. Depending on every isolation of LRU pages to be perfect
> forever is potentially fragile. The specific wakeups occur for page reclaim
> and compaction. If too many pages are isolated due to memory failure,
> hotplug or directly calling migration from a syscall then the waiting
> processes may wait the full timeout.
> 
> Note that the timeout allows the use of waitqueue_active() on the basis
> that a race will cause the full timeout to be reached due to a missed
> wakeup. This is relatively harmless and still a massive improvement over
> unconditionally calling congestion_wait.
> 
> Direct reclaimers that cannot isolate pages within the timeout will consider
> return to the caller. This is somewhat clunky as it won't return immediately
> and make go through the other priorities and slab shrinking. Eventually,
> it'll go through a few iterations of should_reclaim_retry and reach the
> MAX_RECLAIM_RETRIES limit and consider going OOM.
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 91f69aa0d581..3dd617d0c8c4 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -628,6 +628,7 @@ typedef struct pglist_data {
>  	int node_id;
>  	wait_queue_head_t kswapd_wait;
>  	wait_queue_head_t pfmemalloc_wait;
> +	wait_queue_head_t isolated_wait;
>  	struct task_struct *kswapd;	/* Protected by
>  					   mem_hotplug_begin/end() */
>  	int kswapd_order;
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 43a6cf1dc202..1b1ff6da7401 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1634,6 +1634,10 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro
>  	count_compact_events(COMPACTMIGRATE_SCANNED, cc->total_migrate_scanned);
>  	count_compact_events(COMPACTFREE_SCANNED, cc->total_free_scanned);
> 
> +	/* Page reclaim could have stalled due to isolated pages */
> +	if (waitqueue_active(&zone->zone_pgdat->isolated_wait))
> +		wake_up(&zone->zone_pgdat->isolated_wait);
> +
>  	trace_mm_compaction_end(start_pfn, cc->migrate_pfn,
>  				cc->free_pfn, end_pfn, sync, ret);
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8ff25883c172..d848c9f31bff 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5823,6 +5823,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
>  #endif
>  	init_waitqueue_head(&pgdat->kswapd_wait);
>  	init_waitqueue_head(&pgdat->pfmemalloc_wait);
> +	init_waitqueue_head(&pgdat->isolated_wait);
>  #ifdef CONFIG_COMPACTION
>  	init_waitqueue_head(&pgdat->kcompactd_wait);
>  #endif
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2281ad310d06..c93f299fbad7 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1603,16 +1603,16 @@ int isolate_lru_page(struct page *page)
>   * the LRU list will go small and be scanned faster than necessary, leading to
>   * unnecessary swapping, thrashing and OOM.
>   */
> -static int too_many_isolated(struct pglist_data *pgdat, int file,
> +static bool safe_to_isolate(struct pglist_data *pgdat, int file,
>  		struct scan_control *sc)

I prefer the current function name.

>  {
>  	unsigned long inactive, isolated;
> 
>  	if (current_is_kswapd())
> -		return 0;
> +		return true;
> 
> -	if (!sane_reclaim(sc))
> -		return 0;
> +	if (sane_reclaim(sc))
> +		return true;

We only need a one-line change.
> 
>  	if (file) {
>  		inactive = node_page_state(pgdat, NR_INACTIVE_FILE);
> @@ -1630,7 +1630,7 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
>  	if ((sc->gfp_mask & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS))
>  		inactive >>= 3;
> 
> -	return isolated > inactive;
> +	return isolated < inactive;
>  }
> 
>  static noinline_for_stack void
> @@ -1719,12 +1719,28 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>  	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>  	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
> 
> -	while (unlikely(too_many_isolated(pgdat, file, sc))) {
> -		congestion_wait(BLK_RW_ASYNC, HZ/10);
> +	while (!safe_to_isolate(pgdat, file, sc)) {
> +		long ret;
> +
> +		ret = wait_event_interruptible_timeout(pgdat->isolated_wait,
> +			safe_to_isolate(pgdat, file, sc), HZ/10);
> 
>  		/* We are about to die and free our memory. Return now. */
> -		if (fatal_signal_pending(current))
> -			return SWAP_CLUSTER_MAX;
> +		if (fatal_signal_pending(current)) {
> +			nr_reclaimed = SWAP_CLUSTER_MAX;
> +			goto out;
> +		}
> +
> +		/*
> +		 * If we reached the timeout, this is direct reclaim, and
> +		 * pages cannot be isolated then return. If the situation

Please add something that we would rather shrink slab than go
another round of nap.

> +		 * persists for a long time then it'll eventually reach
> +		 * the no_progress limit in should_reclaim_retry and consider
> +		 * going OOM. In this case, do not wake the isolated_wait
> +		 * queue as the wakee will still not be able to make progress.
> +		 */
> +		if (!ret && !current_is_kswapd() && !safe_to_isolate(pgdat, file, sc))
> +			return 0;
>  	}
> 
>  	lru_add_drain();
> @@ -1839,6 +1855,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
>  			stat.nr_activate, stat.nr_ref_keep,
>  			stat.nr_unmap_fail,
>  			sc->priority, file);
> +
> +out:
> +	if (waitqueue_active(&pgdat->isolated_wait))
> +		wake_up(&pgdat->isolated_wait);
>  	return nr_reclaimed;
>  }
> 
Is it also needed to check isolated_wait active before kswapd 
takes nap?

thanks
Hillf


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
  2017-01-20  6:42                   ` Hillf Danton
@ 2017-01-20  9:25                     ` Mel Gorman
  -1 siblings, 0 replies; 110+ messages in thread
From: Mel Gorman @ 2017-01-20  9:25 UTC (permalink / raw)
  To: Hillf Danton
  Cc: 'Michal Hocko', linux-mm, 'Johannes Weiner',
	'Tetsuo Handa', 'LKML'

On Fri, Jan 20, 2017 at 02:42:24PM +0800, Hillf Danton wrote:
> > @@ -1603,16 +1603,16 @@ int isolate_lru_page(struct page *page)
> >   * the LRU list will go small and be scanned faster than necessary, leading to
> >   * unnecessary swapping, thrashing and OOM.
> >   */
> > -static int too_many_isolated(struct pglist_data *pgdat, int file,
> > +static bool safe_to_isolate(struct pglist_data *pgdat, int file,
> >  		struct scan_control *sc)
> 
> I prefer the current function name.
> 

The restructure is to work with the workqueue api.

> >  {
> >  	unsigned long inactive, isolated;
> > 
> >  	if (current_is_kswapd())
> > -		return 0;
> > +		return true;
> > 
> > -	if (!sane_reclaim(sc))
> > -		return 0;
> > +	if (sane_reclaim(sc))
> > +		return true;
> 
> We only need a one-line change.

It's bool so the conversion is made to bool while it's being changed
anyway.

> > 
> >  	if (file) {
> >  		inactive = node_page_state(pgdat, NR_INACTIVE_FILE);
> > @@ -1630,7 +1630,7 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
> >  	if ((sc->gfp_mask & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS))
> >  		inactive >>= 3;
> > 
> > -	return isolated > inactive;
> > +	return isolated < inactive;
> >  }
> > 
> >  static noinline_for_stack void
> > @@ -1719,12 +1719,28 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
> >  	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> >  	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
> > 
> > -	while (unlikely(too_many_isolated(pgdat, file, sc))) {
> > -		congestion_wait(BLK_RW_ASYNC, HZ/10);
> > +	while (!safe_to_isolate(pgdat, file, sc)) {
> > +		long ret;
> > +
> > +		ret = wait_event_interruptible_timeout(pgdat->isolated_wait,
> > +			safe_to_isolate(pgdat, file, sc), HZ/10);
> > 
> >  		/* We are about to die and free our memory. Return now. */
> > -		if (fatal_signal_pending(current))
> > -			return SWAP_CLUSTER_MAX;
> > +		if (fatal_signal_pending(current)) {
> > +			nr_reclaimed = SWAP_CLUSTER_MAX;
> > +			goto out;
> > +		}
> > +
> > +		/*
> > +		 * If we reached the timeout, this is direct reclaim, and
> > +		 * pages cannot be isolated then return. If the situation
> 
> Please add something that we would rather shrink slab than go
> another round of nap.
> 

That's not necessarily true or even a good idea. It could result in
excessive slab shrinking that is no longer in proportion to LRU scanning
and increased contention within shrinkers.

> > +		 * persists for a long time then it'll eventually reach
> > +		 * the no_progress limit in should_reclaim_retry and consider
> > +		 * going OOM. In this case, do not wake the isolated_wait
> > +		 * queue as the wakee will still not be able to make progress.
> > +		 */
> > +		if (!ret && !current_is_kswapd() && !safe_to_isolate(pgdat, file, sc))
> > +			return 0;
> >  	}
> > 
> >  	lru_add_drain();
> > @@ -1839,6 +1855,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
> >  			stat.nr_activate, stat.nr_ref_keep,
> >  			stat.nr_unmap_fail,
> >  			sc->priority, file);
> > +
> > +out:
> > +	if (waitqueue_active(&pgdat->isolated_wait))
> > +		wake_up(&pgdat->isolated_wait);
> >  	return nr_reclaimed;
> >  }
> > 
> Is it also needed to check isolated_wait active before kswapd 
> takes nap?
> 

No because this is where pages were isolated and there is no putback
event that would justify waking the queue. There is a race between
waitqueue_active() and going to sleep that we rely on the timeout to
recover from.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
@ 2017-01-20  9:25                     ` Mel Gorman
  0 siblings, 0 replies; 110+ messages in thread
From: Mel Gorman @ 2017-01-20  9:25 UTC (permalink / raw)
  To: Hillf Danton
  Cc: 'Michal Hocko', linux-mm, 'Johannes Weiner',
	'Tetsuo Handa', 'LKML'

On Fri, Jan 20, 2017 at 02:42:24PM +0800, Hillf Danton wrote:
> > @@ -1603,16 +1603,16 @@ int isolate_lru_page(struct page *page)
> >   * the LRU list will go small and be scanned faster than necessary, leading to
> >   * unnecessary swapping, thrashing and OOM.
> >   */
> > -static int too_many_isolated(struct pglist_data *pgdat, int file,
> > +static bool safe_to_isolate(struct pglist_data *pgdat, int file,
> >  		struct scan_control *sc)
> 
> I prefer the current function name.
> 

The restructure is to work with the workqueue api.

> >  {
> >  	unsigned long inactive, isolated;
> > 
> >  	if (current_is_kswapd())
> > -		return 0;
> > +		return true;
> > 
> > -	if (!sane_reclaim(sc))
> > -		return 0;
> > +	if (sane_reclaim(sc))
> > +		return true;
> 
> We only need a one-line change.

It's bool so the conversion is made to bool while it's being changed
anyway.

> > 
> >  	if (file) {
> >  		inactive = node_page_state(pgdat, NR_INACTIVE_FILE);
> > @@ -1630,7 +1630,7 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
> >  	if ((sc->gfp_mask & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS))
> >  		inactive >>= 3;
> > 
> > -	return isolated > inactive;
> > +	return isolated < inactive;
> >  }
> > 
> >  static noinline_for_stack void
> > @@ -1719,12 +1719,28 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
> >  	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> >  	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
> > 
> > -	while (unlikely(too_many_isolated(pgdat, file, sc))) {
> > -		congestion_wait(BLK_RW_ASYNC, HZ/10);
> > +	while (!safe_to_isolate(pgdat, file, sc)) {
> > +		long ret;
> > +
> > +		ret = wait_event_interruptible_timeout(pgdat->isolated_wait,
> > +			safe_to_isolate(pgdat, file, sc), HZ/10);
> > 
> >  		/* We are about to die and free our memory. Return now. */
> > -		if (fatal_signal_pending(current))
> > -			return SWAP_CLUSTER_MAX;
> > +		if (fatal_signal_pending(current)) {
> > +			nr_reclaimed = SWAP_CLUSTER_MAX;
> > +			goto out;
> > +		}
> > +
> > +		/*
> > +		 * If we reached the timeout, this is direct reclaim, and
> > +		 * pages cannot be isolated then return. If the situation
> 
> Please add something that we would rather shrink slab than go
> another round of nap.
> 

That's not necessarily true or even a good idea. It could result in
excessive slab shrinking that is no longer in proportion to LRU scanning
and increased contention within shrinkers.

> > +		 * persists for a long time then it'll eventually reach
> > +		 * the no_progress limit in should_reclaim_retry and consider
> > +		 * going OOM. In this case, do not wake the isolated_wait
> > +		 * queue as the wakee will still not be able to make progress.
> > +		 */
> > +		if (!ret && !current_is_kswapd() && !safe_to_isolate(pgdat, file, sc))
> > +			return 0;
> >  	}
> > 
> >  	lru_add_drain();
> > @@ -1839,6 +1855,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
> >  			stat.nr_activate, stat.nr_ref_keep,
> >  			stat.nr_unmap_fail,
> >  			sc->priority, file);
> > +
> > +out:
> > +	if (waitqueue_active(&pgdat->isolated_wait))
> > +		wake_up(&pgdat->isolated_wait);
> >  	return nr_reclaimed;
> >  }
> > 
> Is it also needed to check isolated_wait active before kswapd 
> takes nap?
> 

No because this is where pages were isolated and there is no putback
event that would justify waking the queue. There is a race between
waitqueue_active() and going to sleep that we rely on the timeout to
recover from.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
  2017-01-19 13:11                     ` Mel Gorman
@ 2017-01-20 13:27                       ` Tetsuo Handa
  -1 siblings, 0 replies; 110+ messages in thread
From: Tetsuo Handa @ 2017-01-20 13:27 UTC (permalink / raw)
  To: mgorman, mhocko; +Cc: linux-mm, hannes, linux-kernel

Mel Gorman wrote:
> On Thu, Jan 19, 2017 at 12:23:36PM +0100, Michal Hocko wrote:
> > So what do you think about the following? Tetsuo, would you be willing
> > to run this patch through your torture testing please?
> 
> I'm fine with treating this as a starting point.

OK. So I tried to test this patch but I failed at preparation step.
There are too many pending mm patches and I'm not sure which patch on
which linux-next snapshot I should try. Also as another question,
too_many_isolated() loop exists in both mm/vmscan.c and mm/compaction.c
but why this patch does not touch the loop in mm/compaction.c part?
Is there a guarantee that the problem can be avoided by tweaking only
too_many_isolated() part?

Anyway I tried linux-next-20170119 snapshot in order to confirm that
my reproducer can still reproduce the problem before trying this patch.
But I was not able to reproduce the problem today, for mm part is
changing rapidly and existing reproducers need tuning.

And I think that there is a different problem if I tune a reproducer
like below (i.e. increased the buffer size to write()/fsync() from 4096).

----------
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

int main(int argc, char *argv[])
{
	static char buffer[10485760] = { }; /* or 1048576 */
	char *buf = NULL;
	unsigned long size;
	unsigned long i;
	for (i = 0; i < 1024; i++) {
		if (fork() == 0) {
			int fd = open("/proc/self/oom_score_adj", O_WRONLY);
			write(fd, "1000", 4);
			close(fd);
			sleep(1);
			snprintf(buffer, sizeof(buffer), "/tmp/file.%u", getpid());
			fd = open(buffer, O_WRONLY | O_CREAT | O_APPEND, 0600);
			while (write(fd, buffer, sizeof(buffer)) == sizeof(buffer))
				fsync(fd);
			_exit(0);
		}
	}
	for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
		char *cp = realloc(buf, size);
		if (!cp) {
			size >>= 1;
			break;
		}
		buf = cp;
	}
	sleep(2);
	/* Will cause OOM due to overcommit */
	for (i = 0; i < size; i += 4096)
		buf[i] = 0;
	pause();
	return 0;
}
----------

Above reproducer sometimes kills all OOM killable processes and the system
finally panics. I guess that somebody is abusing TIF_MEMDIE for needless
allocations to the level where GFP_ATOMIC allocations start failing.

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20170120.txt.xz .
----------
[  184.482761] a.out invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=0, order=0, oom_score_adj=0
(...snipped...)
[  184.482955] Node 0 active_anon:1418748kB inactive_anon:13548kB active_file:11448kB inactive_file:26044kB unevictable:0kB isolated(anon):0kB isolated(file):132kB mapped:13744kB dirty:25872kB writeback:376kB shmem:0kB shmem_thp: 0kB sh\
mem_pmdmapped: 258048kB anon_thp: 14184kB writeback_tmp:0kB unstable:0kB pages_scanned:95127 all_unreclaimable? yes
[  184.482956] Node 0 DMA free:7660kB min:380kB low:472kB high:564kB active_anon:8176kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB slab_reclaimable:40\
kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:28kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  184.482959] lowmem_reserve[]: 0 1823 1823 1823
[  184.482963] Node 0 DMA32 free:44636kB min:44672kB low:55840kB high:67008kB active_anon:1410572kB inactive_anon:13548kB active_file:11448kB inactive_file:26044kB unevictable:0kB writepending:26248kB present:2080640kB managed:1866768kB\
 mlocked:0kB slab_reclaimable:85544kB slab_unreclaimable:128876kB kernel_stack:20496kB pagetables:40712kB bounce:0kB free_pcp:1136kB local_pcp:656kB free_cma:0kB
[  184.482966] lowmem_reserve[]: 0 0 0 0
[  184.482970] Node 0 DMA: 9*4kB (UE) 5*8kB (E) 2*16kB (ME) 0*32kB 2*64kB (U) 2*128kB (UE) 2*256kB (UE) 1*512kB (E) 2*1024kB (UE) 2*2048kB (ME) 0*4096kB = 7660kB
[  184.482994] Node 0 DMA32: 3845*4kB (UME) 1809*8kB (UME) 600*16kB (UME) 134*32kB (UME) 14*64kB (UME) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 44636kB
(...snipped...)
[  187.477371] Node 0 active_anon:1415648kB inactive_anon:13548kB active_file:11452kB inactive_file:79120kB unevictable:0kB isolated(anon):0kB isolated(file):5220kB mapped:13748kB dirty:83484kB writeback:376kB shmem:0kB shmem_thp: 0kB s\
hmem_pmdmapped: 258048kB anon_thp: 14184kB writeback_tmp:0kB unstable:0kB pages_scanned:16058 all_unreclaimable? no
[  187.477372] Node 0 DMA free:0kB min:380kB low:472kB high:564kB active_anon:8176kB inactive_anon:0kB active_file:0kB inactive_file:6976kB unevictable:0kB writepending:7492kB present:15988kB managed:15904kB mlocked:0kB slab_reclaimable\
:172kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:64kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  187.477375] lowmem_reserve[]: 0 1823 1823 1823
[  187.477378] Node 0 DMA32 free:0kB min:44672kB low:55840kB high:67008kB active_anon:1407472kB inactive_anon:13548kB active_file:11452kB inactive_file:71928kB unevictable:0kB writepending:76368kB present:2080640kB managed:1866768kB mlo\
cked:0kB slab_reclaimable:85580kB slab_unreclaimable:128824kB kernel_stack:20496kB pagetables:39460kB bounce:0kB free_pcp:52kB local_pcp:0kB free_cma:0kB
[  187.477381] lowmem_reserve[]: 0 0 0 0
[  187.477385] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[  187.477394] Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
(...snipped...)
[  318.524868] Node 0 active_anon:7064kB inactive_anon:12088kB active_file:13272kB inactive_file:1520272kB unevictable:0kB isolated(anon):0kB isolated(file):128kB mapped:10276kB dirty:1520264kB writeback:44kB shmem:0kB shmem_thp: 0kB sh\
mem_pmdmapped: 0kB anon_thp: 14184kB writeback_tmp:0kB unstable:0kB pages_scanned:3542854 all_unreclaimable? yes
[  318.524869] Node 0 DMA free:0kB min:380kB low:472kB high:564kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:14752kB unevictable:0kB writepending:14808kB present:15988kB managed:15904kB mlocked:0kB slab_reclaimable:\
1096kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  318.524872] lowmem_reserve[]: 0 1823 1823 1823
[  318.524876] Node 0 DMA32 free:0kB min:44672kB low:55840kB high:67008kB active_anon:7064kB inactive_anon:12088kB active_file:13272kB inactive_file:1505460kB unevictable:0kB writepending:1505500kB present:2080640kB managed:1866768kB ml\
ocked:0kB slab_reclaimable:147588kB slab_unreclaimable:99652kB kernel_stack:16512kB pagetables:2016kB bounce:0kB free_pcp:788kB local_pcp:512kB free_cma:0kB
[  318.524879] lowmem_reserve[]: 0 0 0 0
[  318.524882] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[  318.524893] Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[  318.524903] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[  318.524904] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  318.524905] 386967 total pagecache pages
[  318.524908] 0 pages in swap cache
[  318.524909] Swap cache stats: add 0, delete 0, find 0/0
[  318.524909] Free swap  = 0kB
[  318.524910] Total swap = 0kB
[  318.524912] 524157 pages RAM
[  318.524912] 0 pages HighMem/MovableOnly
[  318.524913] 53489 pages reserved
[  318.524914] 0 pages cma reserved
[  318.524914] 0 pages hwpoisoned
[  318.524916] Kernel panic - not syncing: Out of memory and no killable processes...
----------

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
@ 2017-01-20 13:27                       ` Tetsuo Handa
  0 siblings, 0 replies; 110+ messages in thread
From: Tetsuo Handa @ 2017-01-20 13:27 UTC (permalink / raw)
  To: mgorman, mhocko; +Cc: linux-mm, hannes, linux-kernel

Mel Gorman wrote:
> On Thu, Jan 19, 2017 at 12:23:36PM +0100, Michal Hocko wrote:
> > So what do you think about the following? Tetsuo, would you be willing
> > to run this patch through your torture testing please?
> 
> I'm fine with treating this as a starting point.

OK. So I tried to test this patch but I failed at preparation step.
There are too many pending mm patches and I'm not sure which patch on
which linux-next snapshot I should try. Also as another question,
too_many_isolated() loop exists in both mm/vmscan.c and mm/compaction.c
but why this patch does not touch the loop in mm/compaction.c part?
Is there a guarantee that the problem can be avoided by tweaking only
too_many_isolated() part?

Anyway I tried linux-next-20170119 snapshot in order to confirm that
my reproducer can still reproduce the problem before trying this patch.
But I was not able to reproduce the problem today, for mm part is
changing rapidly and existing reproducers need tuning.

And I think that there is a different problem if I tune a reproducer
like below (i.e. increased the buffer size to write()/fsync() from 4096).

----------
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

int main(int argc, char *argv[])
{
	static char buffer[10485760] = { }; /* or 1048576 */
	char *buf = NULL;
	unsigned long size;
	unsigned long i;
	for (i = 0; i < 1024; i++) {
		if (fork() == 0) {
			int fd = open("/proc/self/oom_score_adj", O_WRONLY);
			write(fd, "1000", 4);
			close(fd);
			sleep(1);
			snprintf(buffer, sizeof(buffer), "/tmp/file.%u", getpid());
			fd = open(buffer, O_WRONLY | O_CREAT | O_APPEND, 0600);
			while (write(fd, buffer, sizeof(buffer)) == sizeof(buffer))
				fsync(fd);
			_exit(0);
		}
	}
	for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
		char *cp = realloc(buf, size);
		if (!cp) {
			size >>= 1;
			break;
		}
		buf = cp;
	}
	sleep(2);
	/* Will cause OOM due to overcommit */
	for (i = 0; i < size; i += 4096)
		buf[i] = 0;
	pause();
	return 0;
}
----------

Above reproducer sometimes kills all OOM killable processes and the system
finally panics. I guess that somebody is abusing TIF_MEMDIE for needless
allocations to the level where GFP_ATOMIC allocations start failing.

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20170120.txt.xz .
----------
[  184.482761] a.out invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=0, order=0, oom_score_adj=0
(...snipped...)
[  184.482955] Node 0 active_anon:1418748kB inactive_anon:13548kB active_file:11448kB inactive_file:26044kB unevictable:0kB isolated(anon):0kB isolated(file):132kB mapped:13744kB dirty:25872kB writeback:376kB shmem:0kB shmem_thp: 0kB sh\
mem_pmdmapped: 258048kB anon_thp: 14184kB writeback_tmp:0kB unstable:0kB pages_scanned:95127 all_unreclaimable? yes
[  184.482956] Node 0 DMA free:7660kB min:380kB low:472kB high:564kB active_anon:8176kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB slab_reclaimable:40\
kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:28kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  184.482959] lowmem_reserve[]: 0 1823 1823 1823
[  184.482963] Node 0 DMA32 free:44636kB min:44672kB low:55840kB high:67008kB active_anon:1410572kB inactive_anon:13548kB active_file:11448kB inactive_file:26044kB unevictable:0kB writepending:26248kB present:2080640kB managed:1866768kB\
 mlocked:0kB slab_reclaimable:85544kB slab_unreclaimable:128876kB kernel_stack:20496kB pagetables:40712kB bounce:0kB free_pcp:1136kB local_pcp:656kB free_cma:0kB
[  184.482966] lowmem_reserve[]: 0 0 0 0
[  184.482970] Node 0 DMA: 9*4kB (UE) 5*8kB (E) 2*16kB (ME) 0*32kB 2*64kB (U) 2*128kB (UE) 2*256kB (UE) 1*512kB (E) 2*1024kB (UE) 2*2048kB (ME) 0*4096kB = 7660kB
[  184.482994] Node 0 DMA32: 3845*4kB (UME) 1809*8kB (UME) 600*16kB (UME) 134*32kB (UME) 14*64kB (UME) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 44636kB
(...snipped...)
[  187.477371] Node 0 active_anon:1415648kB inactive_anon:13548kB active_file:11452kB inactive_file:79120kB unevictable:0kB isolated(anon):0kB isolated(file):5220kB mapped:13748kB dirty:83484kB writeback:376kB shmem:0kB shmem_thp: 0kB s\
hmem_pmdmapped: 258048kB anon_thp: 14184kB writeback_tmp:0kB unstable:0kB pages_scanned:16058 all_unreclaimable? no
[  187.477372] Node 0 DMA free:0kB min:380kB low:472kB high:564kB active_anon:8176kB inactive_anon:0kB active_file:0kB inactive_file:6976kB unevictable:0kB writepending:7492kB present:15988kB managed:15904kB mlocked:0kB slab_reclaimable\
:172kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:64kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  187.477375] lowmem_reserve[]: 0 1823 1823 1823
[  187.477378] Node 0 DMA32 free:0kB min:44672kB low:55840kB high:67008kB active_anon:1407472kB inactive_anon:13548kB active_file:11452kB inactive_file:71928kB unevictable:0kB writepending:76368kB present:2080640kB managed:1866768kB mlo\
cked:0kB slab_reclaimable:85580kB slab_unreclaimable:128824kB kernel_stack:20496kB pagetables:39460kB bounce:0kB free_pcp:52kB local_pcp:0kB free_cma:0kB
[  187.477381] lowmem_reserve[]: 0 0 0 0
[  187.477385] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[  187.477394] Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
(...snipped...)
[  318.524868] Node 0 active_anon:7064kB inactive_anon:12088kB active_file:13272kB inactive_file:1520272kB unevictable:0kB isolated(anon):0kB isolated(file):128kB mapped:10276kB dirty:1520264kB writeback:44kB shmem:0kB shmem_thp: 0kB sh\
mem_pmdmapped: 0kB anon_thp: 14184kB writeback_tmp:0kB unstable:0kB pages_scanned:3542854 all_unreclaimable? yes
[  318.524869] Node 0 DMA free:0kB min:380kB low:472kB high:564kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:14752kB unevictable:0kB writepending:14808kB present:15988kB managed:15904kB mlocked:0kB slab_reclaimable:\
1096kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  318.524872] lowmem_reserve[]: 0 1823 1823 1823
[  318.524876] Node 0 DMA32 free:0kB min:44672kB low:55840kB high:67008kB active_anon:7064kB inactive_anon:12088kB active_file:13272kB inactive_file:1505460kB unevictable:0kB writepending:1505500kB present:2080640kB managed:1866768kB ml\
ocked:0kB slab_reclaimable:147588kB slab_unreclaimable:99652kB kernel_stack:16512kB pagetables:2016kB bounce:0kB free_pcp:788kB local_pcp:512kB free_cma:0kB
[  318.524879] lowmem_reserve[]: 0 0 0 0
[  318.524882] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[  318.524893] Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
[  318.524903] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[  318.524904] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  318.524905] 386967 total pagecache pages
[  318.524908] 0 pages in swap cache
[  318.524909] Swap cache stats: add 0, delete 0, find 0/0
[  318.524909] Free swap  = 0kB
[  318.524910] Total swap = 0kB
[  318.524912] 524157 pages RAM
[  318.524912] 0 pages HighMem/MovableOnly
[  318.524913] 53489 pages reserved
[  318.524914] 0 pages cma reserved
[  318.524914] 0 pages hwpoisoned
[  318.524916] Kernel panic - not syncing: Out of memory and no killable processes...
----------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
  2017-01-20 13:27                       ` Tetsuo Handa
@ 2017-01-21  7:42                         ` Tetsuo Handa
  -1 siblings, 0 replies; 110+ messages in thread
From: Tetsuo Handa @ 2017-01-21  7:42 UTC (permalink / raw)
  To: mgorman, mhocko, viro; +Cc: linux-mm, hannes, linux-kernel

Tetsuo Handa wrote:
> And I think that there is a different problem if I tune a reproducer
> like below (i.e. increased the buffer size to write()/fsync() from 4096).
> 
> ----------
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <unistd.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <fcntl.h>
> 
> int main(int argc, char *argv[])
> {
> 	static char buffer[10485760] = { }; /* or 1048576 */
> 	char *buf = NULL;
> 	unsigned long size;
> 	unsigned long i;
> 	for (i = 0; i < 1024; i++) {
> 		if (fork() == 0) {
> 			int fd = open("/proc/self/oom_score_adj", O_WRONLY);
> 			write(fd, "1000", 4);
> 			close(fd);
> 			sleep(1);
> 			snprintf(buffer, sizeof(buffer), "/tmp/file.%u", getpid());
> 			fd = open(buffer, O_WRONLY | O_CREAT | O_APPEND, 0600);
> 			while (write(fd, buffer, sizeof(buffer)) == sizeof(buffer))
> 				fsync(fd);
> 			_exit(0);
> 		}
> 	}
> 	for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
> 		char *cp = realloc(buf, size);
> 		if (!cp) {
> 			size >>= 1;
> 			break;
> 		}
> 		buf = cp;
> 	}
> 	sleep(2);
> 	/* Will cause OOM due to overcommit */
> 	for (i = 0; i < size; i += 4096)
> 		buf[i] = 0;
> 	pause();
> 	return 0;
> }
> ----------
> 
> Above reproducer sometimes kills all OOM killable processes and the system
> finally panics. I guess that somebody is abusing TIF_MEMDIE for needless
> allocations to the level where GFP_ATOMIC allocations start failing.

I tracked who is abusing TIF_MEMDIE using below patch.

----------------------------------------
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ea088e1..d9ac53d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3038,7 +3038,7 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
 	static DEFINE_RATELIMIT_STATE(nopage_rs, DEFAULT_RATELIMIT_INTERVAL,
 				      DEFAULT_RATELIMIT_BURST);
 
-	if ((gfp_mask & __GFP_NOWARN) || !__ratelimit(&nopage_rs) ||
+	if (1 || (gfp_mask & __GFP_NOWARN) || !__ratelimit(&nopage_rs) ||
 	    debug_guardpage_minorder() > 0)
 		return;
 
@@ -3573,6 +3573,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	int no_progress_loops = 0;
 	unsigned long alloc_start = jiffies;
 	unsigned int stall_timeout = 10 * HZ;
+	bool victim = false;
 
 	/*
 	 * In the slowpath, we sanity check order to avoid ever trying to
@@ -3656,8 +3657,10 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
 		wake_all_kswapds(order, ac);
 
-	if (gfp_pfmemalloc_allowed(gfp_mask))
+	if (gfp_pfmemalloc_allowed(gfp_mask)) {
 		alloc_flags = ALLOC_NO_WATERMARKS;
+		victim = test_thread_flag(TIF_MEMDIE);
+	}
 
 	/*
 	 * Reset the zonelist iterators if memory policies can be ignored.
@@ -3790,6 +3793,11 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	warn_alloc(gfp_mask, ac->nodemask,
 			"page allocation failure: order:%u", order);
 got_pg:
+	if (page && victim) {
+		pr_warn("%s(%u): TIF_MEMDIE allocation: order=%d mode=%#x(%pGg)\n",
+			current->comm, current->pid, order, gfp_mask, &gfp_mask);
+		dump_stack();
+	}
 	return page;
 }
 
----------------------------------------

And I got flood of traces shown below. It seems to be consuming memory reserves
until the size passed to write() request is stored to the page cache even after
OOM-killed.

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20170121.txt.xz .
----------------------------------------
[  202.306077] a.out(9789): TIF_MEMDIE allocation: order=0 mode=0x1c2004a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE)
[  202.309832] CPU: 0 PID: 9789 Comm: a.out Not tainted 4.10.0-rc4-next-20170120+ #492
[  202.312323] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[  202.315429] Call Trace:
[  202.316902]  dump_stack+0x85/0xc9
[  202.318810]  __alloc_pages_slowpath+0xa99/0xd7c
[  202.320697]  ? node_dirty_ok+0xef/0x130
[  202.322454]  __alloc_pages_nodemask+0x436/0x4d0
[  202.324506]  alloc_pages_current+0x97/0x1b0
[  202.326397]  __page_cache_alloc+0x15d/0x1a0          mm/filemap.c:728
[  202.328209]  pagecache_get_page+0x5a/0x2b0           mm/filemap.c:1331
[  202.329989]  grab_cache_page_write_begin+0x23/0x40   mm/filemap.c:2773
[  202.331905]  iomap_write_begin+0x50/0xd0             fs/iomap.c:118
[  202.333641]  iomap_write_actor+0xb5/0x1a0            fs/iomap.c:190
[  202.335377]  ? iomap_write_end+0x80/0x80             fs/iomap.c:150
[  202.337090]  iomap_apply+0xb3/0x130                  fs/iomap.c:79
[  202.338721]  iomap_file_buffered_write+0x68/0xa0     fs/iomap.c:243
[  202.340613]  ? iomap_write_end+0x80/0x80
[  202.342471]  xfs_file_buffered_aio_write+0x132/0x390 [xfs]
[  202.344501]  ? remove_wait_queue+0x59/0x60
[  202.346261]  xfs_file_write_iter+0x90/0x130 [xfs]
[  202.348082]  __vfs_write+0xe5/0x140
[  202.349743]  vfs_write+0xc7/0x1f0
[  202.351214]  ? syscall_trace_enter+0x1d0/0x380
[  202.353155]  SyS_write+0x58/0xc0
[  202.354628]  do_syscall_64+0x6c/0x200
[  202.356100]  entry_SYSCALL64_slow_path+0x25/0x25
----------------------------------------

Do we need to allow access to memory reserves for this allocation?
Or, should the caller check for SIGKILL rather than iterate the loop?

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
@ 2017-01-21  7:42                         ` Tetsuo Handa
  0 siblings, 0 replies; 110+ messages in thread
From: Tetsuo Handa @ 2017-01-21  7:42 UTC (permalink / raw)
  To: mgorman, mhocko, viro; +Cc: linux-mm, hannes, linux-kernel

Tetsuo Handa wrote:
> And I think that there is a different problem if I tune a reproducer
> like below (i.e. increased the buffer size to write()/fsync() from 4096).
> 
> ----------
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <unistd.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <fcntl.h>
> 
> int main(int argc, char *argv[])
> {
> 	static char buffer[10485760] = { }; /* or 1048576 */
> 	char *buf = NULL;
> 	unsigned long size;
> 	unsigned long i;
> 	for (i = 0; i < 1024; i++) {
> 		if (fork() == 0) {
> 			int fd = open("/proc/self/oom_score_adj", O_WRONLY);
> 			write(fd, "1000", 4);
> 			close(fd);
> 			sleep(1);
> 			snprintf(buffer, sizeof(buffer), "/tmp/file.%u", getpid());
> 			fd = open(buffer, O_WRONLY | O_CREAT | O_APPEND, 0600);
> 			while (write(fd, buffer, sizeof(buffer)) == sizeof(buffer))
> 				fsync(fd);
> 			_exit(0);
> 		}
> 	}
> 	for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
> 		char *cp = realloc(buf, size);
> 		if (!cp) {
> 			size >>= 1;
> 			break;
> 		}
> 		buf = cp;
> 	}
> 	sleep(2);
> 	/* Will cause OOM due to overcommit */
> 	for (i = 0; i < size; i += 4096)
> 		buf[i] = 0;
> 	pause();
> 	return 0;
> }
> ----------
> 
> Above reproducer sometimes kills all OOM killable processes and the system
> finally panics. I guess that somebody is abusing TIF_MEMDIE for needless
> allocations to the level where GFP_ATOMIC allocations start failing.

I tracked who is abusing TIF_MEMDIE using below patch.

----------------------------------------
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ea088e1..d9ac53d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3038,7 +3038,7 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
 	static DEFINE_RATELIMIT_STATE(nopage_rs, DEFAULT_RATELIMIT_INTERVAL,
 				      DEFAULT_RATELIMIT_BURST);
 
-	if ((gfp_mask & __GFP_NOWARN) || !__ratelimit(&nopage_rs) ||
+	if (1 || (gfp_mask & __GFP_NOWARN) || !__ratelimit(&nopage_rs) ||
 	    debug_guardpage_minorder() > 0)
 		return;
 
@@ -3573,6 +3573,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	int no_progress_loops = 0;
 	unsigned long alloc_start = jiffies;
 	unsigned int stall_timeout = 10 * HZ;
+	bool victim = false;
 
 	/*
 	 * In the slowpath, we sanity check order to avoid ever trying to
@@ -3656,8 +3657,10 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
 		wake_all_kswapds(order, ac);
 
-	if (gfp_pfmemalloc_allowed(gfp_mask))
+	if (gfp_pfmemalloc_allowed(gfp_mask)) {
 		alloc_flags = ALLOC_NO_WATERMARKS;
+		victim = test_thread_flag(TIF_MEMDIE);
+	}
 
 	/*
 	 * Reset the zonelist iterators if memory policies can be ignored.
@@ -3790,6 +3793,11 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	warn_alloc(gfp_mask, ac->nodemask,
 			"page allocation failure: order:%u", order);
 got_pg:
+	if (page && victim) {
+		pr_warn("%s(%u): TIF_MEMDIE allocation: order=%d mode=%#x(%pGg)\n",
+			current->comm, current->pid, order, gfp_mask, &gfp_mask);
+		dump_stack();
+	}
 	return page;
 }
 
----------------------------------------

And I got flood of traces shown below. It seems to be consuming memory reserves
until the size passed to write() request is stored to the page cache even after
OOM-killed.

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20170121.txt.xz .
----------------------------------------
[  202.306077] a.out(9789): TIF_MEMDIE allocation: order=0 mode=0x1c2004a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE)
[  202.309832] CPU: 0 PID: 9789 Comm: a.out Not tainted 4.10.0-rc4-next-20170120+ #492
[  202.312323] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[  202.315429] Call Trace:
[  202.316902]  dump_stack+0x85/0xc9
[  202.318810]  __alloc_pages_slowpath+0xa99/0xd7c
[  202.320697]  ? node_dirty_ok+0xef/0x130
[  202.322454]  __alloc_pages_nodemask+0x436/0x4d0
[  202.324506]  alloc_pages_current+0x97/0x1b0
[  202.326397]  __page_cache_alloc+0x15d/0x1a0          mm/filemap.c:728
[  202.328209]  pagecache_get_page+0x5a/0x2b0           mm/filemap.c:1331
[  202.329989]  grab_cache_page_write_begin+0x23/0x40   mm/filemap.c:2773
[  202.331905]  iomap_write_begin+0x50/0xd0             fs/iomap.c:118
[  202.333641]  iomap_write_actor+0xb5/0x1a0            fs/iomap.c:190
[  202.335377]  ? iomap_write_end+0x80/0x80             fs/iomap.c:150
[  202.337090]  iomap_apply+0xb3/0x130                  fs/iomap.c:79
[  202.338721]  iomap_file_buffered_write+0x68/0xa0     fs/iomap.c:243
[  202.340613]  ? iomap_write_end+0x80/0x80
[  202.342471]  xfs_file_buffered_aio_write+0x132/0x390 [xfs]
[  202.344501]  ? remove_wait_queue+0x59/0x60
[  202.346261]  xfs_file_write_iter+0x90/0x130 [xfs]
[  202.348082]  __vfs_write+0xe5/0x140
[  202.349743]  vfs_write+0xc7/0x1f0
[  202.351214]  ? syscall_trace_enter+0x1d0/0x380
[  202.353155]  SyS_write+0x58/0xc0
[  202.354628]  do_syscall_64+0x6c/0x200
[  202.356100]  entry_SYSCALL64_slow_path+0x25/0x25
----------------------------------------

Do we need to allow access to memory reserves for this allocation?
Or, should the caller check for SIGKILL rather than iterate the loop?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
  2017-01-20 13:27                       ` Tetsuo Handa
@ 2017-01-25  9:53                         ` Michal Hocko
  -1 siblings, 0 replies; 110+ messages in thread
From: Michal Hocko @ 2017-01-25  9:53 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: mgorman, linux-mm, hannes, linux-kernel

On Fri 20-01-17 22:27:27, Tetsuo Handa wrote:
> Mel Gorman wrote:
> > On Thu, Jan 19, 2017 at 12:23:36PM +0100, Michal Hocko wrote:
> > > So what do you think about the following? Tetsuo, would you be willing
> > > to run this patch through your torture testing please?
> > 
> > I'm fine with treating this as a starting point.
> 
> OK. So I tried to test this patch but I failed at preparation step.
> There are too many pending mm patches and I'm not sure which patch on
> which linux-next snapshot I should try.

The current linux-next should be good to test. It contains all patches
sitting in the mmotm tree. If you want a more stable base then you can
use mmotm git tree
(git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git #since-4.9
or its #auto-latest alias)

> Also as another question,
> too_many_isolated() loop exists in both mm/vmscan.c and mm/compaction.c
> but why this patch does not touch the loop in mm/compaction.c part?

I am not yet convinced the compaction suffers from the same problem.
Compaction backs off much sooner so that path shouldn't get into
pathological situation AFAICS. I might be wrong here but I think we
should start with the reclaim path first.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
@ 2017-01-25  9:53                         ` Michal Hocko
  0 siblings, 0 replies; 110+ messages in thread
From: Michal Hocko @ 2017-01-25  9:53 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: mgorman, linux-mm, hannes, linux-kernel

On Fri 20-01-17 22:27:27, Tetsuo Handa wrote:
> Mel Gorman wrote:
> > On Thu, Jan 19, 2017 at 12:23:36PM +0100, Michal Hocko wrote:
> > > So what do you think about the following? Tetsuo, would you be willing
> > > to run this patch through your torture testing please?
> > 
> > I'm fine with treating this as a starting point.
> 
> OK. So I tried to test this patch but I failed at preparation step.
> There are too many pending mm patches and I'm not sure which patch on
> which linux-next snapshot I should try.

The current linux-next should be good to test. It contains all patches
sitting in the mmotm tree. If you want a more stable base then you can
use mmotm git tree
(git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git #since-4.9
or its #auto-latest alias)

> Also as another question,
> too_many_isolated() loop exists in both mm/vmscan.c and mm/compaction.c
> but why this patch does not touch the loop in mm/compaction.c part?

I am not yet convinced the compaction suffers from the same problem.
Compaction backs off much sooner so that path shouldn't get into
pathological situation AFAICS. I might be wrong here but I think we
should start with the reclaim path first.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
  2017-01-21  7:42                         ` Tetsuo Handa
@ 2017-01-25 10:15                           ` Michal Hocko
  -1 siblings, 0 replies; 110+ messages in thread
From: Michal Hocko @ 2017-01-25 10:15 UTC (permalink / raw)
  To: Tetsuo Handa, Christoph Hellwig
  Cc: mgorman, viro, linux-mm, hannes, linux-kernel

[Let's add Christoph]

The below insane^Wstress test should exercise the OOM killer behavior.

On Sat 21-01-17 16:42:42, Tetsuo Handa wrote:
> Tetsuo Handa wrote:
> > And I think that there is a different problem if I tune a reproducer
> > like below (i.e. increased the buffer size to write()/fsync() from 4096).
> > 
> > ----------
> > #include <stdio.h>
> > #include <stdlib.h>
> > #include <string.h>
> > #include <unistd.h>
> > #include <sys/types.h>
> > #include <sys/stat.h>
> > #include <fcntl.h>
> > 
> > int main(int argc, char *argv[])
> > {
> > 	static char buffer[10485760] = { }; /* or 1048576 */
> > 	char *buf = NULL;
> > 	unsigned long size;
> > 	unsigned long i;
> > 	for (i = 0; i < 1024; i++) {
> > 		if (fork() == 0) {
> > 			int fd = open("/proc/self/oom_score_adj", O_WRONLY);
> > 			write(fd, "1000", 4);
> > 			close(fd);
> > 			sleep(1);
> > 			snprintf(buffer, sizeof(buffer), "/tmp/file.%u", getpid());
> > 			fd = open(buffer, O_WRONLY | O_CREAT | O_APPEND, 0600);
> > 			while (write(fd, buffer, sizeof(buffer)) == sizeof(buffer))
> > 				fsync(fd);
> > 			_exit(0);
> > 		}
> > 	}
> > 	for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
> > 		char *cp = realloc(buf, size);
> > 		if (!cp) {
> > 			size >>= 1;
> > 			break;
> > 		}
> > 		buf = cp;
> > 	}
> > 	sleep(2);
> > 	/* Will cause OOM due to overcommit */
> > 	for (i = 0; i < size; i += 4096)
> > 		buf[i] = 0;
> > 	pause();
> > 	return 0;
> > }
> > ----------
> > 
> > Above reproducer sometimes kills all OOM killable processes and the system
> > finally panics. I guess that somebody is abusing TIF_MEMDIE for needless
> > allocations to the level where GFP_ATOMIC allocations start failing.
[...] 
> And I got flood of traces shown below. It seems to be consuming memory reserves
> until the size passed to write() request is stored to the page cache even after
> OOM-killed.
> 
> Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20170121.txt.xz .
> ----------------------------------------
> [  202.306077] a.out(9789): TIF_MEMDIE allocation: order=0 mode=0x1c2004a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE)
> [  202.309832] CPU: 0 PID: 9789 Comm: a.out Not tainted 4.10.0-rc4-next-20170120+ #492
> [  202.312323] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
> [  202.315429] Call Trace:
> [  202.316902]  dump_stack+0x85/0xc9
> [  202.318810]  __alloc_pages_slowpath+0xa99/0xd7c
> [  202.320697]  ? node_dirty_ok+0xef/0x130
> [  202.322454]  __alloc_pages_nodemask+0x436/0x4d0
> [  202.324506]  alloc_pages_current+0x97/0x1b0
> [  202.326397]  __page_cache_alloc+0x15d/0x1a0          mm/filemap.c:728
> [  202.328209]  pagecache_get_page+0x5a/0x2b0           mm/filemap.c:1331
> [  202.329989]  grab_cache_page_write_begin+0x23/0x40   mm/filemap.c:2773
> [  202.331905]  iomap_write_begin+0x50/0xd0             fs/iomap.c:118
> [  202.333641]  iomap_write_actor+0xb5/0x1a0            fs/iomap.c:190
> [  202.335377]  ? iomap_write_end+0x80/0x80             fs/iomap.c:150
> [  202.337090]  iomap_apply+0xb3/0x130                  fs/iomap.c:79
> [  202.338721]  iomap_file_buffered_write+0x68/0xa0     fs/iomap.c:243
> [  202.340613]  ? iomap_write_end+0x80/0x80
> [  202.342471]  xfs_file_buffered_aio_write+0x132/0x390 [xfs]
> [  202.344501]  ? remove_wait_queue+0x59/0x60
> [  202.346261]  xfs_file_write_iter+0x90/0x130 [xfs]
> [  202.348082]  __vfs_write+0xe5/0x140
> [  202.349743]  vfs_write+0xc7/0x1f0
> [  202.351214]  ? syscall_trace_enter+0x1d0/0x380
> [  202.353155]  SyS_write+0x58/0xc0
> [  202.354628]  do_syscall_64+0x6c/0x200
> [  202.356100]  entry_SYSCALL64_slow_path+0x25/0x25
> ----------------------------------------
> 
> Do we need to allow access to memory reserves for this allocation?
> Or, should the caller check for SIGKILL rather than iterate the loop?

I think we are missing a check for fatal_signal_pending in
iomap_file_buffered_write. This means that an oom victim can consume the
full memory reserves. What do you think about the following? I haven't
tested this but it mimics generic_perform_write so I guess it should
work.
---
>From d56b54b708d403d1bf39fccb89750bab31c19032 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Wed, 25 Jan 2017 11:06:37 +0100
Subject: [PATCH] fs: break out of iomap_file_buffered_write on fatal signals

Tetsuo has noticed that an OOM stress test which performs large write
requests can cause the full memory reserves depletion. He has tracked
this down to the following path
	__alloc_pages_nodemask+0x436/0x4d0
	alloc_pages_current+0x97/0x1b0
	__page_cache_alloc+0x15d/0x1a0          mm/filemap.c:728
	pagecache_get_page+0x5a/0x2b0           mm/filemap.c:1331
	grab_cache_page_write_begin+0x23/0x40   mm/filemap.c:2773
	iomap_write_begin+0x50/0xd0             fs/iomap.c:118
	iomap_write_actor+0xb5/0x1a0            fs/iomap.c:190
	? iomap_write_end+0x80/0x80             fs/iomap.c:150
	iomap_apply+0xb3/0x130                  fs/iomap.c:79
	iomap_file_buffered_write+0x68/0xa0     fs/iomap.c:243
	? iomap_write_end+0x80/0x80
	xfs_file_buffered_aio_write+0x132/0x390 [xfs]
	? remove_wait_queue+0x59/0x60
	xfs_file_write_iter+0x90/0x130 [xfs]
	__vfs_write+0xe5/0x140
	vfs_write+0xc7/0x1f0
	? syscall_trace_enter+0x1d0/0x380
	SyS_write+0x58/0xc0
	do_syscall_64+0x6c/0x200
	entry_SYSCALL64_slow_path+0x25/0x25

the oom victim has access to all memory reserves to make a forward
progress to exit easier. But iomap_file_buffered_write loops to complete
the full request. We need to check for fatal signals and back off with a
short write.

Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path")
Cc: stable # 4.8+
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/iomap.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/fs/iomap.c b/fs/iomap.c
index e57b90b5ff37..a22672387549 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -238,6 +238,10 @@ iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *iter,
 	loff_t pos = iocb->ki_pos, ret = 0, written = 0;
 
 	while (iov_iter_count(iter)) {
+		if (fatal_signal_pending(current)) {
+			ret = -EINTR;
+			break;
+		}
 		ret = iomap_apply(inode, pos, iov_iter_count(iter),
 				IOMAP_WRITE, ops, iter, iomap_write_actor);
 		if (ret <= 0)
-- 
2.11.0


-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
@ 2017-01-25 10:15                           ` Michal Hocko
  0 siblings, 0 replies; 110+ messages in thread
From: Michal Hocko @ 2017-01-25 10:15 UTC (permalink / raw)
  To: Tetsuo Handa, Christoph Hellwig
  Cc: mgorman, viro, linux-mm, hannes, linux-kernel

[Let's add Christoph]

The below insane^Wstress test should exercise the OOM killer behavior.

On Sat 21-01-17 16:42:42, Tetsuo Handa wrote:
> Tetsuo Handa wrote:
> > And I think that there is a different problem if I tune a reproducer
> > like below (i.e. increased the buffer size to write()/fsync() from 4096).
> > 
> > ----------
> > #include <stdio.h>
> > #include <stdlib.h>
> > #include <string.h>
> > #include <unistd.h>
> > #include <sys/types.h>
> > #include <sys/stat.h>
> > #include <fcntl.h>
> > 
> > int main(int argc, char *argv[])
> > {
> > 	static char buffer[10485760] = { }; /* or 1048576 */
> > 	char *buf = NULL;
> > 	unsigned long size;
> > 	unsigned long i;
> > 	for (i = 0; i < 1024; i++) {
> > 		if (fork() == 0) {
> > 			int fd = open("/proc/self/oom_score_adj", O_WRONLY);
> > 			write(fd, "1000", 4);
> > 			close(fd);
> > 			sleep(1);
> > 			snprintf(buffer, sizeof(buffer), "/tmp/file.%u", getpid());
> > 			fd = open(buffer, O_WRONLY | O_CREAT | O_APPEND, 0600);
> > 			while (write(fd, buffer, sizeof(buffer)) == sizeof(buffer))
> > 				fsync(fd);
> > 			_exit(0);
> > 		}
> > 	}
> > 	for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
> > 		char *cp = realloc(buf, size);
> > 		if (!cp) {
> > 			size >>= 1;
> > 			break;
> > 		}
> > 		buf = cp;
> > 	}
> > 	sleep(2);
> > 	/* Will cause OOM due to overcommit */
> > 	for (i = 0; i < size; i += 4096)
> > 		buf[i] = 0;
> > 	pause();
> > 	return 0;
> > }
> > ----------
> > 
> > Above reproducer sometimes kills all OOM killable processes and the system
> > finally panics. I guess that somebody is abusing TIF_MEMDIE for needless
> > allocations to the level where GFP_ATOMIC allocations start failing.
[...] 
> And I got flood of traces shown below. It seems to be consuming memory reserves
> until the size passed to write() request is stored to the page cache even after
> OOM-killed.
> 
> Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20170121.txt.xz .
> ----------------------------------------
> [  202.306077] a.out(9789): TIF_MEMDIE allocation: order=0 mode=0x1c2004a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE)
> [  202.309832] CPU: 0 PID: 9789 Comm: a.out Not tainted 4.10.0-rc4-next-20170120+ #492
> [  202.312323] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
> [  202.315429] Call Trace:
> [  202.316902]  dump_stack+0x85/0xc9
> [  202.318810]  __alloc_pages_slowpath+0xa99/0xd7c
> [  202.320697]  ? node_dirty_ok+0xef/0x130
> [  202.322454]  __alloc_pages_nodemask+0x436/0x4d0
> [  202.324506]  alloc_pages_current+0x97/0x1b0
> [  202.326397]  __page_cache_alloc+0x15d/0x1a0          mm/filemap.c:728
> [  202.328209]  pagecache_get_page+0x5a/0x2b0           mm/filemap.c:1331
> [  202.329989]  grab_cache_page_write_begin+0x23/0x40   mm/filemap.c:2773
> [  202.331905]  iomap_write_begin+0x50/0xd0             fs/iomap.c:118
> [  202.333641]  iomap_write_actor+0xb5/0x1a0            fs/iomap.c:190
> [  202.335377]  ? iomap_write_end+0x80/0x80             fs/iomap.c:150
> [  202.337090]  iomap_apply+0xb3/0x130                  fs/iomap.c:79
> [  202.338721]  iomap_file_buffered_write+0x68/0xa0     fs/iomap.c:243
> [  202.340613]  ? iomap_write_end+0x80/0x80
> [  202.342471]  xfs_file_buffered_aio_write+0x132/0x390 [xfs]
> [  202.344501]  ? remove_wait_queue+0x59/0x60
> [  202.346261]  xfs_file_write_iter+0x90/0x130 [xfs]
> [  202.348082]  __vfs_write+0xe5/0x140
> [  202.349743]  vfs_write+0xc7/0x1f0
> [  202.351214]  ? syscall_trace_enter+0x1d0/0x380
> [  202.353155]  SyS_write+0x58/0xc0
> [  202.354628]  do_syscall_64+0x6c/0x200
> [  202.356100]  entry_SYSCALL64_slow_path+0x25/0x25
> ----------------------------------------
> 
> Do we need to allow access to memory reserves for this allocation?
> Or, should the caller check for SIGKILL rather than iterate the loop?

I think we are missing a check for fatal_signal_pending in
iomap_file_buffered_write. This means that an oom victim can consume the
full memory reserves. What do you think about the following? I haven't
tested this but it mimics generic_perform_write so I guess it should
work.
---

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
  2017-01-25 10:15                           ` Michal Hocko
@ 2017-01-25 10:19                             ` Christoph Hellwig
  -1 siblings, 0 replies; 110+ messages in thread
From: Christoph Hellwig @ 2017-01-25 10:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, Christoph Hellwig, mgorman, viro, linux-mm, hannes,
	linux-kernel

On Wed, Jan 25, 2017 at 11:15:17AM +0100, Michal Hocko wrote:
> I think we are missing a check for fatal_signal_pending in
> iomap_file_buffered_write. This means that an oom victim can consume the
> full memory reserves. What do you think about the following? I haven't
> tested this but it mimics generic_perform_write so I guess it should
> work.

Hi Michal,

this looks reasonable to me.  But we have a few more such loops,
maybe it makes sense to move the check into iomap_apply?

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
@ 2017-01-25 10:19                             ` Christoph Hellwig
  0 siblings, 0 replies; 110+ messages in thread
From: Christoph Hellwig @ 2017-01-25 10:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, Christoph Hellwig, mgorman, viro, linux-mm, hannes,
	linux-kernel

On Wed, Jan 25, 2017 at 11:15:17AM +0100, Michal Hocko wrote:
> I think we are missing a check for fatal_signal_pending in
> iomap_file_buffered_write. This means that an oom victim can consume the
> full memory reserves. What do you think about the following? I haven't
> tested this but it mimics generic_perform_write so I guess it should
> work.

Hi Michal,

this looks reasonable to me.  But we have a few more such loops,
maybe it makes sense to move the check into iomap_apply?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pagesper zone
  2017-01-25 10:15                           ` Michal Hocko
@ 2017-01-25 10:33                             ` Tetsuo Handa
  -1 siblings, 0 replies; 110+ messages in thread
From: Tetsuo Handa @ 2017-01-25 10:33 UTC (permalink / raw)
  To: mhocko, hch; +Cc: mgorman, viro, linux-mm, hannes, linux-kernel

Michal Hocko wrote:
> I think we are missing a check for fatal_signal_pending in
> iomap_file_buffered_write. This means that an oom victim can consume the
> full memory reserves. What do you think about the following? I haven't
> tested this but it mimics generic_perform_write so I guess it should
> work.

Looks OK to me. I worried

#define AOP_FLAG_UNINTERRUPTIBLE        0x0001 /* will not do a short write */

which forbids (!?) aborting the loop. But it seems that this flag is
no longer checked (i.e. set but not used). So, everybody should be ready
for short write, although I don't know whether exofs / hfs / hfsplus are
doing appropriate error handling.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pagesper zone
@ 2017-01-25 10:33                             ` Tetsuo Handa
  0 siblings, 0 replies; 110+ messages in thread
From: Tetsuo Handa @ 2017-01-25 10:33 UTC (permalink / raw)
  To: mhocko, hch; +Cc: mgorman, viro, linux-mm, hannes, linux-kernel

Michal Hocko wrote:
> I think we are missing a check for fatal_signal_pending in
> iomap_file_buffered_write. This means that an oom victim can consume the
> full memory reserves. What do you think about the following? I haven't
> tested this but it mimics generic_perform_write so I guess it should
> work.

Looks OK to me. I worried

#define AOP_FLAG_UNINTERRUPTIBLE        0x0001 /* will not do a short write */

which forbids (!?) aborting the loop. But it seems that this flag is
no longer checked (i.e. set but not used). So, everybody should be ready
for short write, although I don't know whether exofs / hfs / hfsplus are
doing appropriate error handling.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
  2017-01-25 10:19                             ` Christoph Hellwig
@ 2017-01-25 10:46                               ` Michal Hocko
  -1 siblings, 0 replies; 110+ messages in thread
From: Michal Hocko @ 2017-01-25 10:46 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Tetsuo Handa, mgorman, viro, linux-mm, hannes, linux-kernel

On Wed 25-01-17 11:19:57, Christoph Hellwig wrote:
> On Wed, Jan 25, 2017 at 11:15:17AM +0100, Michal Hocko wrote:
> > I think we are missing a check for fatal_signal_pending in
> > iomap_file_buffered_write. This means that an oom victim can consume the
> > full memory reserves. What do you think about the following? I haven't
> > tested this but it mimics generic_perform_write so I guess it should
> > work.
> 
> Hi Michal,
> 
> this looks reasonable to me.  But we have a few more such loops,
> maybe it makes sense to move the check into iomap_apply?

I wasn't sure about the expected semantic of iomap_apply but now that
I've actually checked all the callers I believe all of them should be
able to handle EINTR just fine. Well iomap_file_dirty, iomap_zero_range,
iomap_fiemap and iomap_page_mkwriteseem do not follow the standard
pattern to return the number of written pages or an error but it rather
propagates the error out. From my limited understanding of those code
paths that should just be ok. I was not all that sure about iomap_dio_rw
that is just too convoluted for me. If that one is OK as well then
the following patch should be indeed better.
---
>From d99c9d4115bed69a5d71281f59c190b0b26627cf Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Wed, 25 Jan 2017 11:06:37 +0100
Subject: [PATCH] fs: break out of iomap_file_buffered_write on fatal signals

Tetsuo has noticed that an OOM stress test which performs large write
requests can cause the full memory reserves depletion. He has tracked
this down to the following path
	__alloc_pages_nodemask+0x436/0x4d0
	alloc_pages_current+0x97/0x1b0
	__page_cache_alloc+0x15d/0x1a0          mm/filemap.c:728
	pagecache_get_page+0x5a/0x2b0           mm/filemap.c:1331
	grab_cache_page_write_begin+0x23/0x40   mm/filemap.c:2773
	iomap_write_begin+0x50/0xd0             fs/iomap.c:118
	iomap_write_actor+0xb5/0x1a0            fs/iomap.c:190
	? iomap_write_end+0x80/0x80             fs/iomap.c:150
	iomap_apply+0xb3/0x130                  fs/iomap.c:79
	iomap_file_buffered_write+0x68/0xa0     fs/iomap.c:243
	? iomap_write_end+0x80/0x80
	xfs_file_buffered_aio_write+0x132/0x390 [xfs]
	? remove_wait_queue+0x59/0x60
	xfs_file_write_iter+0x90/0x130 [xfs]
	__vfs_write+0xe5/0x140
	vfs_write+0xc7/0x1f0
	? syscall_trace_enter+0x1d0/0x380
	SyS_write+0x58/0xc0
	do_syscall_64+0x6c/0x200
	entry_SYSCALL64_slow_path+0x25/0x25

the oom victim has access to all memory reserves to make a forward
progress to exit easier. But iomap_file_buffered_write and other callers
of iomap_apply loop to complete the full request. We need to check for
fatal signals and back off with a short write instead.

Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path")
Cc: stable # 4.8+
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/iomap.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/iomap.c b/fs/iomap.c
index e57b90b5ff37..a58190f7a3e4 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -46,6 +46,9 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags,
 	struct iomap iomap = { 0 };
 	loff_t written = 0, ret;
 
+	if (fatal_signal_pending(current))
+		return -EINTR;
+
 	/*
 	 * Need to map a range from start position for length bytes. This can
 	 * span multiple pages - it is only guaranteed to return a range of a
-- 
2.11.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
@ 2017-01-25 10:46                               ` Michal Hocko
  0 siblings, 0 replies; 110+ messages in thread
From: Michal Hocko @ 2017-01-25 10:46 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Tetsuo Handa, mgorman, viro, linux-mm, hannes, linux-kernel

On Wed 25-01-17 11:19:57, Christoph Hellwig wrote:
> On Wed, Jan 25, 2017 at 11:15:17AM +0100, Michal Hocko wrote:
> > I think we are missing a check for fatal_signal_pending in
> > iomap_file_buffered_write. This means that an oom victim can consume the
> > full memory reserves. What do you think about the following? I haven't
> > tested this but it mimics generic_perform_write so I guess it should
> > work.
> 
> Hi Michal,
> 
> this looks reasonable to me.  But we have a few more such loops,
> maybe it makes sense to move the check into iomap_apply?

I wasn't sure about the expected semantic of iomap_apply but now that
I've actually checked all the callers I believe all of them should be
able to handle EINTR just fine. Well iomap_file_dirty, iomap_zero_range,
iomap_fiemap and iomap_page_mkwriteseem do not follow the standard
pattern to return the number of written pages or an error but it rather
propagates the error out. From my limited understanding of those code
paths that should just be ok. I was not all that sure about iomap_dio_rw
that is just too convoluted for me. If that one is OK as well then
the following patch should be indeed better.
---

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
  2017-01-25 10:46                               ` Michal Hocko
@ 2017-01-25 11:09                                 ` Tetsuo Handa
  -1 siblings, 0 replies; 110+ messages in thread
From: Tetsuo Handa @ 2017-01-25 11:09 UTC (permalink / raw)
  To: mhocko, hch; +Cc: mgorman, viro, linux-mm, hannes, linux-kernel

Michal Hocko wrote:
> On Wed 25-01-17 11:19:57, Christoph Hellwig wrote:
> > On Wed, Jan 25, 2017 at 11:15:17AM +0100, Michal Hocko wrote:
> > > I think we are missing a check for fatal_signal_pending in
> > > iomap_file_buffered_write. This means that an oom victim can consume the
> > > full memory reserves. What do you think about the following? I haven't
> > > tested this but it mimics generic_perform_write so I guess it should
> > > work.
> > 
> > Hi Michal,
> > 
> > this looks reasonable to me.  But we have a few more such loops,
> > maybe it makes sense to move the check into iomap_apply?
> 
> I wasn't sure about the expected semantic of iomap_apply but now that
> I've actually checked all the callers I believe all of them should be
> able to handle EINTR just fine. Well iomap_file_dirty, iomap_zero_range,
> iomap_fiemap and iomap_page_mkwriteseem do not follow the standard
> pattern to return the number of written pages or an error but it rather
> propagates the error out. From my limited understanding of those code
> paths that should just be ok. I was not all that sure about iomap_dio_rw
> that is just too convoluted for me. If that one is OK as well then
> the following patch should be indeed better.

Is "length" in

   written = actor(inode, pos, length, data, &iomap);

call guaranteed to be small enough? If not guaranteed,
don't we need to check SIGKILL inside "actor" functions?

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
@ 2017-01-25 11:09                                 ` Tetsuo Handa
  0 siblings, 0 replies; 110+ messages in thread
From: Tetsuo Handa @ 2017-01-25 11:09 UTC (permalink / raw)
  To: mhocko, hch; +Cc: mgorman, viro, linux-mm, hannes, linux-kernel

Michal Hocko wrote:
> On Wed 25-01-17 11:19:57, Christoph Hellwig wrote:
> > On Wed, Jan 25, 2017 at 11:15:17AM +0100, Michal Hocko wrote:
> > > I think we are missing a check for fatal_signal_pending in
> > > iomap_file_buffered_write. This means that an oom victim can consume the
> > > full memory reserves. What do you think about the following? I haven't
> > > tested this but it mimics generic_perform_write so I guess it should
> > > work.
> > 
> > Hi Michal,
> > 
> > this looks reasonable to me.  But we have a few more such loops,
> > maybe it makes sense to move the check into iomap_apply?
> 
> I wasn't sure about the expected semantic of iomap_apply but now that
> I've actually checked all the callers I believe all of them should be
> able to handle EINTR just fine. Well iomap_file_dirty, iomap_zero_range,
> iomap_fiemap and iomap_page_mkwriteseem do not follow the standard
> pattern to return the number of written pages or an error but it rather
> propagates the error out. From my limited understanding of those code
> paths that should just be ok. I was not all that sure about iomap_dio_rw
> that is just too convoluted for me. If that one is OK as well then
> the following patch should be indeed better.

Is "length" in

   written = actor(inode, pos, length, data, &iomap);

call guaranteed to be small enough? If not guaranteed,
don't we need to check SIGKILL inside "actor" functions?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pagesper zone
  2017-01-25 10:33                             ` Tetsuo Handa
@ 2017-01-25 12:34                               ` Michal Hocko
  -1 siblings, 0 replies; 110+ messages in thread
From: Michal Hocko @ 2017-01-25 12:34 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: hch, mgorman, viro, linux-mm, hannes, linux-kernel

On Wed 25-01-17 19:33:59, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > I think we are missing a check for fatal_signal_pending in
> > iomap_file_buffered_write. This means that an oom victim can consume the
> > full memory reserves. What do you think about the following? I haven't
> > tested this but it mimics generic_perform_write so I guess it should
> > work.
> 
> Looks OK to me. I worried
> 
> #define AOP_FLAG_UNINTERRUPTIBLE        0x0001 /* will not do a short write */
> 
> which forbids (!?) aborting the loop. But it seems that this flag is
> no longer checked (i.e. set but not used). So, everybody should be ready
> for short write, although I don't know whether exofs / hfs / hfsplus are
> doing appropriate error handling.

Those were using generic implementation before and that handles this
case AFAICS.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pagesper zone
@ 2017-01-25 12:34                               ` Michal Hocko
  0 siblings, 0 replies; 110+ messages in thread
From: Michal Hocko @ 2017-01-25 12:34 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: hch, mgorman, viro, linux-mm, hannes, linux-kernel

On Wed 25-01-17 19:33:59, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > I think we are missing a check for fatal_signal_pending in
> > iomap_file_buffered_write. This means that an oom victim can consume the
> > full memory reserves. What do you think about the following? I haven't
> > tested this but it mimics generic_perform_write so I guess it should
> > work.
> 
> Looks OK to me. I worried
> 
> #define AOP_FLAG_UNINTERRUPTIBLE        0x0001 /* will not do a short write */
> 
> which forbids (!?) aborting the loop. But it seems that this flag is
> no longer checked (i.e. set but not used). So, everybody should be ready
> for short write, although I don't know whether exofs / hfs / hfsplus are
> doing appropriate error handling.

Those were using generic implementation before and that handles this
case AFAICS.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
  2017-01-25 11:09                                 ` Tetsuo Handa
@ 2017-01-25 13:00                                   ` Michal Hocko
  -1 siblings, 0 replies; 110+ messages in thread
From: Michal Hocko @ 2017-01-25 13:00 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: hch, mgorman, viro, linux-mm, hannes, linux-kernel

On Wed 25-01-17 20:09:31, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 25-01-17 11:19:57, Christoph Hellwig wrote:
> > > On Wed, Jan 25, 2017 at 11:15:17AM +0100, Michal Hocko wrote:
> > > > I think we are missing a check for fatal_signal_pending in
> > > > iomap_file_buffered_write. This means that an oom victim can consume the
> > > > full memory reserves. What do you think about the following? I haven't
> > > > tested this but it mimics generic_perform_write so I guess it should
> > > > work.
> > > 
> > > Hi Michal,
> > > 
> > > this looks reasonable to me.  But we have a few more such loops,
> > > maybe it makes sense to move the check into iomap_apply?
> > 
> > I wasn't sure about the expected semantic of iomap_apply but now that
> > I've actually checked all the callers I believe all of them should be
> > able to handle EINTR just fine. Well iomap_file_dirty, iomap_zero_range,
> > iomap_fiemap and iomap_page_mkwriteseem do not follow the standard
> > pattern to return the number of written pages or an error but it rather
> > propagates the error out. From my limited understanding of those code
> > paths that should just be ok. I was not all that sure about iomap_dio_rw
> > that is just too convoluted for me. If that one is OK as well then
> > the following patch should be indeed better.
> 
> Is "length" in
> 
>    written = actor(inode, pos, length, data, &iomap);
> 
> call guaranteed to be small enough? If not guaranteed,
> don't we need to check SIGKILL inside "actor" functions?

You are right! Checking for signals inside iomap_apply doesn't really
solve anything because basically all users do iov_iter_count(). Blee. So
we have loops around iomap_apply which itself loops inside the actor.
iomap_write_begin seems to be used by most of them which is also where we
get the pagecache page so I guess this should be the "right" place to
put the check in. Things like dax_iomap_actor will need an explicit check.
This is quite unfortunate but I do not see any better solution.
What do you think Christoph?
---
>From 362da5cac527146a341300c2ca441245c16043e8 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Wed, 25 Jan 2017 11:06:37 +0100
Subject: [PATCH] fs: break out of iomap_file_buffered_write on fatal signals

Tetsuo has noticed that an OOM stress test which performs large write
requests can cause the full memory reserves depletion. He has tracked
this down to the following path
	__alloc_pages_nodemask+0x436/0x4d0
	alloc_pages_current+0x97/0x1b0
	__page_cache_alloc+0x15d/0x1a0          mm/filemap.c:728
	pagecache_get_page+0x5a/0x2b0           mm/filemap.c:1331
	grab_cache_page_write_begin+0x23/0x40   mm/filemap.c:2773
	iomap_write_begin+0x50/0xd0             fs/iomap.c:118
	iomap_write_actor+0xb5/0x1a0            fs/iomap.c:190
	? iomap_write_end+0x80/0x80             fs/iomap.c:150
	iomap_apply+0xb3/0x130                  fs/iomap.c:79
	iomap_file_buffered_write+0x68/0xa0     fs/iomap.c:243
	? iomap_write_end+0x80/0x80
	xfs_file_buffered_aio_write+0x132/0x390 [xfs]
	? remove_wait_queue+0x59/0x60
	xfs_file_write_iter+0x90/0x130 [xfs]
	__vfs_write+0xe5/0x140
	vfs_write+0xc7/0x1f0
	? syscall_trace_enter+0x1d0/0x380
	SyS_write+0x58/0xc0
	do_syscall_64+0x6c/0x200
	entry_SYSCALL64_slow_path+0x25/0x25

the oom victim has access to all memory reserves to make a forward
progress to exit easier. But iomap_file_buffered_write and other callers
of iomap_apply loop to complete the full request. We need to check for
fatal signals and back off with a short write instead. As the
iomap_apply delegates all the work down to the actor we have to hook
into those. All callers that work with the page cache are calling
iomap_write_begin so we will check for signals there. dax_iomap_actor
has to handle the situation explicitly because it copies data to the
userspace directly. Other callers like iomap_page_mkwrite work on a
single page or iomap_fiemap_actor do not allocate memory based on the
given len.

Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path")
Cc: stable # 4.8+
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/dax.c   | 5 +++++
 fs/iomap.c | 3 +++
 2 files changed, 8 insertions(+)

diff --git a/fs/dax.c b/fs/dax.c
index 413a91db9351..0e263dacf9cf 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1033,6 +1033,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
 		struct blk_dax_ctl dax = { 0 };
 		ssize_t map_len;
 
+		if (fatal_signal_pending(current)) {
+			ret = -EINTR;
+			break;
+		}
+
 		dax.sector = dax_iomap_sector(iomap, pos);
 		dax.size = (length + offset + PAGE_SIZE - 1) & PAGE_MASK;
 		map_len = dax_map_atomic(iomap->bdev, &dax);
diff --git a/fs/iomap.c b/fs/iomap.c
index e57b90b5ff37..691eada58b06 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -114,6 +114,9 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
 
 	BUG_ON(pos + len > iomap->offset + iomap->length);
 
+	if (fatal_signal_pending(current))
+		return -EINTR;
+
 	page = grab_cache_page_write_begin(inode->i_mapping, index, flags);
 	if (!page)
 		return -ENOMEM;
-- 
2.11.0


-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
@ 2017-01-25 13:00                                   ` Michal Hocko
  0 siblings, 0 replies; 110+ messages in thread
From: Michal Hocko @ 2017-01-25 13:00 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: hch, mgorman, viro, linux-mm, hannes, linux-kernel

On Wed 25-01-17 20:09:31, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 25-01-17 11:19:57, Christoph Hellwig wrote:
> > > On Wed, Jan 25, 2017 at 11:15:17AM +0100, Michal Hocko wrote:
> > > > I think we are missing a check for fatal_signal_pending in
> > > > iomap_file_buffered_write. This means that an oom victim can consume the
> > > > full memory reserves. What do you think about the following? I haven't
> > > > tested this but it mimics generic_perform_write so I guess it should
> > > > work.
> > > 
> > > Hi Michal,
> > > 
> > > this looks reasonable to me.  But we have a few more such loops,
> > > maybe it makes sense to move the check into iomap_apply?
> > 
> > I wasn't sure about the expected semantic of iomap_apply but now that
> > I've actually checked all the callers I believe all of them should be
> > able to handle EINTR just fine. Well iomap_file_dirty, iomap_zero_range,
> > iomap_fiemap and iomap_page_mkwriteseem do not follow the standard
> > pattern to return the number of written pages or an error but it rather
> > propagates the error out. From my limited understanding of those code
> > paths that should just be ok. I was not all that sure about iomap_dio_rw
> > that is just too convoluted for me. If that one is OK as well then
> > the following patch should be indeed better.
> 
> Is "length" in
> 
>    written = actor(inode, pos, length, data, &iomap);
> 
> call guaranteed to be small enough? If not guaranteed,
> don't we need to check SIGKILL inside "actor" functions?

You are right! Checking for signals inside iomap_apply doesn't really
solve anything because basically all users do iov_iter_count(). Blee. So
we have loops around iomap_apply which itself loops inside the actor.
iomap_write_begin seems to be used by most of them which is also where we
get the pagecache page so I guess this should be the "right" place to
put the check in. Things like dax_iomap_actor will need an explicit check.
This is quite unfortunate but I do not see any better solution.
What do you think Christoph?
---

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
  2017-01-25 12:34                               ` Michal Hocko
@ 2017-01-25 13:13                                 ` Tetsuo Handa
  -1 siblings, 0 replies; 110+ messages in thread
From: Tetsuo Handa @ 2017-01-25 13:13 UTC (permalink / raw)
  To: mhocko; +Cc: hch, mgorman, viro, linux-mm, hannes, linux-kernel

Michal Hocko wrote:
> On Wed 25-01-17 19:33:59, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > I think we are missing a check for fatal_signal_pending in
> > > iomap_file_buffered_write. This means that an oom victim can consume the
> > > full memory reserves. What do you think about the following? I haven't
> > > tested this but it mimics generic_perform_write so I guess it should
> > > work.
> > 
> > Looks OK to me. I worried
> > 
> > #define AOP_FLAG_UNINTERRUPTIBLE        0x0001 /* will not do a short write */
> > 
> > which forbids (!?) aborting the loop. But it seems that this flag is
> > no longer checked (i.e. set but not used). So, everybody should be ready
> > for short write, although I don't know whether exofs / hfs / hfsplus are
> > doing appropriate error handling.
> 
> Those were using generic implementation before and that handles this
> case AFAICS.

What I wanted to say is: "We can remove AOP_FLAG_UNINTERRUPTIBLE completely
because grep does not find that flag used in condition check, can't we?".

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
@ 2017-01-25 13:13                                 ` Tetsuo Handa
  0 siblings, 0 replies; 110+ messages in thread
From: Tetsuo Handa @ 2017-01-25 13:13 UTC (permalink / raw)
  To: mhocko; +Cc: hch, mgorman, viro, linux-mm, hannes, linux-kernel

Michal Hocko wrote:
> On Wed 25-01-17 19:33:59, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > I think we are missing a check for fatal_signal_pending in
> > > iomap_file_buffered_write. This means that an oom victim can consume the
> > > full memory reserves. What do you think about the following? I haven't
> > > tested this but it mimics generic_perform_write so I guess it should
> > > work.
> > 
> > Looks OK to me. I worried
> > 
> > #define AOP_FLAG_UNINTERRUPTIBLE        0x0001 /* will not do a short write */
> > 
> > which forbids (!?) aborting the loop. But it seems that this flag is
> > no longer checked (i.e. set but not used). So, everybody should be ready
> > for short write, although I don't know whether exofs / hfs / hfsplus are
> > doing appropriate error handling.
> 
> Those were using generic implementation before and that handles this
> case AFAICS.

What I wanted to say is: "We can remove AOP_FLAG_UNINTERRUPTIBLE completely
because grep does not find that flag used in condition check, can't we?".

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
  2017-01-25 13:00                                   ` Michal Hocko
@ 2017-01-27 14:49                                     ` Michal Hocko
  -1 siblings, 0 replies; 110+ messages in thread
From: Michal Hocko @ 2017-01-27 14:49 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: hch, mgorman, viro, linux-mm, hannes, linux-kernel

Tetsuo,
before we settle on the proper fix for this issue, could you give the
patch a try and try to reproduce the too_many_isolated() issue or
just see whether patch [1] has any negative effect on your oom stress
testing?

[1] http://lkml.kernel.org/r/20170119112336.GN30786@dhcp22.suse.cz

On Wed 25-01-17 14:00:14, Michal Hocko wrote:
[...]
> From 362da5cac527146a341300c2ca441245c16043e8 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Wed, 25 Jan 2017 11:06:37 +0100
> Subject: [PATCH] fs: break out of iomap_file_buffered_write on fatal signals
> 
> Tetsuo has noticed that an OOM stress test which performs large write
> requests can cause the full memory reserves depletion. He has tracked
> this down to the following path
> 	__alloc_pages_nodemask+0x436/0x4d0
> 	alloc_pages_current+0x97/0x1b0
> 	__page_cache_alloc+0x15d/0x1a0          mm/filemap.c:728
> 	pagecache_get_page+0x5a/0x2b0           mm/filemap.c:1331
> 	grab_cache_page_write_begin+0x23/0x40   mm/filemap.c:2773
> 	iomap_write_begin+0x50/0xd0             fs/iomap.c:118
> 	iomap_write_actor+0xb5/0x1a0            fs/iomap.c:190
> 	? iomap_write_end+0x80/0x80             fs/iomap.c:150
> 	iomap_apply+0xb3/0x130                  fs/iomap.c:79
> 	iomap_file_buffered_write+0x68/0xa0     fs/iomap.c:243
> 	? iomap_write_end+0x80/0x80
> 	xfs_file_buffered_aio_write+0x132/0x390 [xfs]
> 	? remove_wait_queue+0x59/0x60
> 	xfs_file_write_iter+0x90/0x130 [xfs]
> 	__vfs_write+0xe5/0x140
> 	vfs_write+0xc7/0x1f0
> 	? syscall_trace_enter+0x1d0/0x380
> 	SyS_write+0x58/0xc0
> 	do_syscall_64+0x6c/0x200
> 	entry_SYSCALL64_slow_path+0x25/0x25
> 
> the oom victim has access to all memory reserves to make a forward
> progress to exit easier. But iomap_file_buffered_write and other callers
> of iomap_apply loop to complete the full request. We need to check for
> fatal signals and back off with a short write instead. As the
> iomap_apply delegates all the work down to the actor we have to hook
> into those. All callers that work with the page cache are calling
> iomap_write_begin so we will check for signals there. dax_iomap_actor
> has to handle the situation explicitly because it copies data to the
> userspace directly. Other callers like iomap_page_mkwrite work on a
> single page or iomap_fiemap_actor do not allocate memory based on the
> given len.
> 
> Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path")
> Cc: stable # 4.8+
> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  fs/dax.c   | 5 +++++
>  fs/iomap.c | 3 +++
>  2 files changed, 8 insertions(+)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 413a91db9351..0e263dacf9cf 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1033,6 +1033,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>  		struct blk_dax_ctl dax = { 0 };
>  		ssize_t map_len;
>  
> +		if (fatal_signal_pending(current)) {
> +			ret = -EINTR;
> +			break;
> +		}
> +
>  		dax.sector = dax_iomap_sector(iomap, pos);
>  		dax.size = (length + offset + PAGE_SIZE - 1) & PAGE_MASK;
>  		map_len = dax_map_atomic(iomap->bdev, &dax);
> diff --git a/fs/iomap.c b/fs/iomap.c
> index e57b90b5ff37..691eada58b06 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -114,6 +114,9 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
>  
>  	BUG_ON(pos + len > iomap->offset + iomap->length);
>  
> +	if (fatal_signal_pending(current))
> +		return -EINTR;
> +
>  	page = grab_cache_page_write_begin(inode->i_mapping, index, flags);
>  	if (!page)
>  		return -ENOMEM;
> -- 
> 2.11.0
> 
> 
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
@ 2017-01-27 14:49                                     ` Michal Hocko
  0 siblings, 0 replies; 110+ messages in thread
From: Michal Hocko @ 2017-01-27 14:49 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: hch, mgorman, viro, linux-mm, hannes, linux-kernel

Tetsuo,
before we settle on the proper fix for this issue, could you give the
patch a try and try to reproduce the too_many_isolated() issue or
just see whether patch [1] has any negative effect on your oom stress
testing?

[1] http://lkml.kernel.org/r/20170119112336.GN30786@dhcp22.suse.cz

On Wed 25-01-17 14:00:14, Michal Hocko wrote:
[...]
> From 362da5cac527146a341300c2ca441245c16043e8 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Wed, 25 Jan 2017 11:06:37 +0100
> Subject: [PATCH] fs: break out of iomap_file_buffered_write on fatal signals
> 
> Tetsuo has noticed that an OOM stress test which performs large write
> requests can cause the full memory reserves depletion. He has tracked
> this down to the following path
> 	__alloc_pages_nodemask+0x436/0x4d0
> 	alloc_pages_current+0x97/0x1b0
> 	__page_cache_alloc+0x15d/0x1a0          mm/filemap.c:728
> 	pagecache_get_page+0x5a/0x2b0           mm/filemap.c:1331
> 	grab_cache_page_write_begin+0x23/0x40   mm/filemap.c:2773
> 	iomap_write_begin+0x50/0xd0             fs/iomap.c:118
> 	iomap_write_actor+0xb5/0x1a0            fs/iomap.c:190
> 	? iomap_write_end+0x80/0x80             fs/iomap.c:150
> 	iomap_apply+0xb3/0x130                  fs/iomap.c:79
> 	iomap_file_buffered_write+0x68/0xa0     fs/iomap.c:243
> 	? iomap_write_end+0x80/0x80
> 	xfs_file_buffered_aio_write+0x132/0x390 [xfs]
> 	? remove_wait_queue+0x59/0x60
> 	xfs_file_write_iter+0x90/0x130 [xfs]
> 	__vfs_write+0xe5/0x140
> 	vfs_write+0xc7/0x1f0
> 	? syscall_trace_enter+0x1d0/0x380
> 	SyS_write+0x58/0xc0
> 	do_syscall_64+0x6c/0x200
> 	entry_SYSCALL64_slow_path+0x25/0x25
> 
> the oom victim has access to all memory reserves to make a forward
> progress to exit easier. But iomap_file_buffered_write and other callers
> of iomap_apply loop to complete the full request. We need to check for
> fatal signals and back off with a short write instead. As the
> iomap_apply delegates all the work down to the actor we have to hook
> into those. All callers that work with the page cache are calling
> iomap_write_begin so we will check for signals there. dax_iomap_actor
> has to handle the situation explicitly because it copies data to the
> userspace directly. Other callers like iomap_page_mkwrite work on a
> single page or iomap_fiemap_actor do not allocate memory based on the
> given len.
> 
> Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path")
> Cc: stable # 4.8+
> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  fs/dax.c   | 5 +++++
>  fs/iomap.c | 3 +++
>  2 files changed, 8 insertions(+)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 413a91db9351..0e263dacf9cf 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1033,6 +1033,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>  		struct blk_dax_ctl dax = { 0 };
>  		ssize_t map_len;
>  
> +		if (fatal_signal_pending(current)) {
> +			ret = -EINTR;
> +			break;
> +		}
> +
>  		dax.sector = dax_iomap_sector(iomap, pos);
>  		dax.size = (length + offset + PAGE_SIZE - 1) & PAGE_MASK;
>  		map_len = dax_map_atomic(iomap->bdev, &dax);
> diff --git a/fs/iomap.c b/fs/iomap.c
> index e57b90b5ff37..691eada58b06 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -114,6 +114,9 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
>  
>  	BUG_ON(pos + len > iomap->offset + iomap->length);
>  
> +	if (fatal_signal_pending(current))
> +		return -EINTR;
> +
>  	page = grab_cache_page_write_begin(inode->i_mapping, index, flags);
>  	if (!page)
>  		return -ENOMEM;
> -- 
> 2.11.0
> 
> 
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
  2017-01-27 14:49                                     ` Michal Hocko
@ 2017-01-28 15:27                                       ` Tetsuo Handa
  -1 siblings, 0 replies; 110+ messages in thread
From: Tetsuo Handa @ 2017-01-28 15:27 UTC (permalink / raw)
  To: mhocko; +Cc: hch, mgorman, viro, linux-mm, hannes, linux-kernel

Michal Hocko wrote:
> Tetsuo,
> before we settle on the proper fix for this issue, could you give the
> patch a try and try to reproduce the too_many_isolated() issue or
> just see whether patch [1] has any negative effect on your oom stress
> testing?
> 
> [1] http://lkml.kernel.org/r/20170119112336.GN30786@dhcp22.suse.cz

I tested with both [1] and below patch applied on linux-next-20170125 and
the result is at http://I-love.SAKURA.ne.jp/tmp/serial-20170128.txt.xz .

Regarding below patch, it helped avoiding complete memory depletion with
large write() request. I don't know whether below patch helps avoiding
complete memory depletion when reading large amount (in other words, I
don't know whether this check is done for large read() request). But
I believe that __GFP_KILLABLE (despite the limitation that there are
unkillable waits in the reclaim path) is better solution compared to
scattering around fatal_signal_pending() in the callers. The reason
we check SIGKILL here is to avoid allocating memory more than needed.
If we check SIGKILL in the entry point of __alloc_pages_nodemask() and
retry: label in __alloc_pages_slowpath(), we waste 0 page. Regardless
of whether the OOM killer is invoked, whether memory can be allocated
without direct reclaim operation, not allocating memory unless needed
(in other words, allow page allocator fail immediately if the caller
can give up on SIGKILL and SIGKILL is pending) makes sense. It will
reduce possibility of OOM livelock on CONFIG_MMU=n kernels where the
OOM reaper is not available.

> 
> On Wed 25-01-17 14:00:14, Michal Hocko wrote:
> [...]
> > From 362da5cac527146a341300c2ca441245c16043e8 Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <mhocko@suse.com>
> > Date: Wed, 25 Jan 2017 11:06:37 +0100
> > Subject: [PATCH] fs: break out of iomap_file_buffered_write on fatal signals
> > 
> > Tetsuo has noticed that an OOM stress test which performs large write
> > requests can cause the full memory reserves depletion. He has tracked
> > this down to the following path
> > 	__alloc_pages_nodemask+0x436/0x4d0
> > 	alloc_pages_current+0x97/0x1b0
> > 	__page_cache_alloc+0x15d/0x1a0          mm/filemap.c:728
> > 	pagecache_get_page+0x5a/0x2b0           mm/filemap.c:1331
> > 	grab_cache_page_write_begin+0x23/0x40   mm/filemap.c:2773
> > 	iomap_write_begin+0x50/0xd0             fs/iomap.c:118
> > 	iomap_write_actor+0xb5/0x1a0            fs/iomap.c:190
> > 	? iomap_write_end+0x80/0x80             fs/iomap.c:150
> > 	iomap_apply+0xb3/0x130                  fs/iomap.c:79
> > 	iomap_file_buffered_write+0x68/0xa0     fs/iomap.c:243
> > 	? iomap_write_end+0x80/0x80
> > 	xfs_file_buffered_aio_write+0x132/0x390 [xfs]
> > 	? remove_wait_queue+0x59/0x60
> > 	xfs_file_write_iter+0x90/0x130 [xfs]
> > 	__vfs_write+0xe5/0x140
> > 	vfs_write+0xc7/0x1f0
> > 	? syscall_trace_enter+0x1d0/0x380
> > 	SyS_write+0x58/0xc0
> > 	do_syscall_64+0x6c/0x200
> > 	entry_SYSCALL64_slow_path+0x25/0x25
> > 
> > the oom victim has access to all memory reserves to make a forward
> > progress to exit easier. But iomap_file_buffered_write and other callers
> > of iomap_apply loop to complete the full request. We need to check for
> > fatal signals and back off with a short write instead. As the
> > iomap_apply delegates all the work down to the actor we have to hook
> > into those. All callers that work with the page cache are calling
> > iomap_write_begin so we will check for signals there. dax_iomap_actor
> > has to handle the situation explicitly because it copies data to the
> > userspace directly. Other callers like iomap_page_mkwrite work on a
> > single page or iomap_fiemap_actor do not allocate memory based on the
> > given len.
> > 
> > Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path")
> > Cc: stable # 4.8+
> > Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > ---
> >  fs/dax.c   | 5 +++++
> >  fs/iomap.c | 3 +++
> >  2 files changed, 8 insertions(+)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 413a91db9351..0e263dacf9cf 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -1033,6 +1033,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> >  		struct blk_dax_ctl dax = { 0 };
> >  		ssize_t map_len;
> >  
> > +		if (fatal_signal_pending(current)) {
> > +			ret = -EINTR;
> > +			break;
> > +		}
> > +
> >  		dax.sector = dax_iomap_sector(iomap, pos);
> >  		dax.size = (length + offset + PAGE_SIZE - 1) & PAGE_MASK;
> >  		map_len = dax_map_atomic(iomap->bdev, &dax);
> > diff --git a/fs/iomap.c b/fs/iomap.c
> > index e57b90b5ff37..691eada58b06 100644
> > --- a/fs/iomap.c
> > +++ b/fs/iomap.c
> > @@ -114,6 +114,9 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
> >  
> >  	BUG_ON(pos + len > iomap->offset + iomap->length);
> >  
> > +	if (fatal_signal_pending(current))
> > +		return -EINTR;
> > +
> >  	page = grab_cache_page_write_begin(inode->i_mapping, index, flags);
> >  	if (!page)
> >  		return -ENOMEM;
> > -- 
> > 2.11.0

Regarding [1], it helped avoiding the too_many_isolated() issue. I can't
tell whether it has any negative effect, but I got on the first trial that
all allocating threads are blocked on wait_for_completion() from flush_work()
in drain_all_pages() introduced by "mm, page_alloc: drain per-cpu pages from
workqueue context". There was no warn_alloc() stall warning message afterwords.

----------
[  540.039842] kworker/1:1: page allocation stalls for 10079ms, order:0, mode:0x14001c2(GFP_KERNEL|__GFP_HIGHMEM|__GFP_COLD), nodemask=(null)
[  540.041961] kthreadd invoked oom-killer: gfp_mask=0x16040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK), nodemask=(null),  order=2, oom_score_adj=0
[  540.041970] kthreadd cpuset=/ mems_allowed=0
[  540.041984] CPU: 3 PID: 2 Comm: kthreadd Not tainted 4.10.0-rc5-next-20170125+ #495
[  540.041987] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[  540.041989] Call Trace:
[  540.042008]  dump_stack+0x85/0xc9
[  540.042016]  dump_header+0x9f/0x296
[  540.042028]  ? trace_hardirqs_on+0xd/0x10
[  540.042039]  oom_kill_process+0x219/0x400
[  540.042046]  out_of_memory+0x13d/0x580
[  540.042049]  ? out_of_memory+0x20d/0x580
[  540.042058]  __alloc_pages_slowpath+0x951/0xe02
[  540.042063]  ? deactivate_slab+0x1fb/0x690
[  540.042082]  __alloc_pages_nodemask+0x382/0x3d0
[  540.042091]  new_slab+0x450/0x6b0
[  540.042100]  ___slab_alloc+0x3a3/0x620
[  540.042109]  ? copy_process.part.31+0x122/0x2200
[  540.042116]  ? cpuacct_charge+0x38/0x1e0
[  540.042122]  ? copy_process.part.31+0x122/0x2200
[  540.042129]  __slab_alloc+0x46/0x7d
[  540.042135]  kmem_cache_alloc_node+0xab/0x3a0
[  540.042144]  copy_process.part.31+0x122/0x2200
[  540.042150]  ? cpuacct_charge+0xf3/0x1e0
[  540.042153]  ? cpuacct_charge+0x38/0x1e0
[  540.042164]  ? kthread_create_on_node+0x70/0x70
[  540.042168]  ? finish_task_switch+0x70/0x240
[  540.042175]  _do_fork+0xf3/0x750
[  540.042183]  ? kthreadd+0x2f2/0x3c0
[  540.042193]  kernel_thread+0x29/0x30
[  540.042196]  kthreadd+0x35a/0x3c0
[  540.042206]  ? ret_from_fork+0x31/0x40
[  540.042218]  ? kthread_create_on_cpu+0xb0/0xb0
[  540.042225]  ret_from_fork+0x31/0x40
[  540.042237] Mem-Info:
[  540.042248] active_anon:170208 inactive_anon:2096 isolated_anon:0
[  540.042248]  active_file:40034 inactive_file:40034 isolated_file:32
[  540.042248]  unevictable:0 dirty:78514 writeback:1568 unstable:0
[  540.042248]  slab_reclaimable:19763 slab_unreclaimable:47744
[  540.042248]  mapped:491 shmem:2162 pagetables:4842 bounce:0
[  540.042248]  free:12698 free_pcp:637 free_cma:0
[  540.042258] Node 0 active_anon:680832kB inactive_anon:8384kB active_file:160136kB inactive_file:160136kB unevictable:0kB isolated(anon):0kB isolated(file):128kB mapped:1964kB dirty:314056kB writeback:6272kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 217088kB anon_thp: 8648kB writeback_tmp:0kB unstable:0kB pages_scanned:561618 all_unreclaimable? yes
[  540.042260] Node 0 DMA free:6248kB min:476kB low:592kB high:708kB active_anon:9492kB inactive_anon:0kB active_file:4kB inactive_file:4kB unevictable:0kB writepending:8kB present:15988kB managed:15904kB mlocked:0kB slab_reclaimable:48kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:28kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  540.042270] lowmem_reserve[]: 0 1443 1443 1443
[  540.042279] Node 0 DMA32 free:44544kB min:44576kB low:55720kB high:66864kB active_anon:671340kB inactive_anon:8384kB active_file:160132kB inactive_file:160132kB unevictable:0kB writepending:320320kB present:2080640kB managed:1478648kB mlocked:0kB slab_reclaimable:79004kB slab_unreclaimable:190944kB kernel_stack:12240kB pagetables:19340kB bounce:0kB free_pcp:2548kB local_pcp:728kB free_cma:0kB
[  540.042288] lowmem_reserve[]: 0 0 0 0
[  540.042296] Node 0 DMA: 2*4kB (UM) 0*8kB 2*16kB (UE) 4*32kB (UME) 3*64kB (ME) 2*128kB (UM) 2*256kB (UE) 2*512kB (ME) 2*1024kB (UE) 1*2048kB (E) 0*4096kB = 6248kB
[  540.042330] Node 0 DMA32: 764*4kB (UME) 1122*8kB (UME) 536*16kB (UME) 210*32kB (UME) 107*64kB (UE) 41*128kB (EH) 20*256kB (UME) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 44544kB
[  540.042363] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[  540.042366] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  540.042368] 82262 total pagecache pages
[  540.042371] 0 pages in swap cache
[  540.042374] Swap cache stats: add 0, delete 0, find 0/0
[  540.042376] Free swap  = 0kB
[  540.042377] Total swap = 0kB
[  540.042380] 524157 pages RAM
[  540.042382] 0 pages HighMem/MovableOnly
[  540.042383] 150519 pages reserved
[  540.042384] 0 pages cma reserved
[  540.042386] 0 pages hwpoisoned
[  540.042390] Out of memory: Kill process 10688 (a.out) score 998 or sacrifice child
[  540.042401] Killed process 10688 (a.out) total-vm:14404kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[  540.043111] oom_reaper: reaped process 10688 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  540.212629] kworker/1:1 cpuset=/ mems_allowed=0
[  540.214404] CPU: 1 PID: 51 Comm: kworker/1:1 Not tainted 4.10.0-rc5-next-20170125+ #495
[  540.216858] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[  540.219901] Workqueue: events pcpu_balance_workfn
[  540.221740] Call Trace:
[  540.223154]  dump_stack+0x85/0xc9
[  540.224724]  warn_alloc+0x11e/0x1d0
[  540.226333]  __alloc_pages_slowpath+0x3d4/0xe02
[  540.228160]  __alloc_pages_nodemask+0x382/0x3d0
[  540.229970]  pcpu_populate_chunk+0xc2/0x440
[  540.231724]  pcpu_balance_workfn+0x615/0x670
[  540.233483]  ? process_one_work+0x194/0x760
[  540.235405]  process_one_work+0x22b/0x760
[  540.237133]  ? process_one_work+0x194/0x760
[  540.238943]  worker_thread+0x243/0x4b0
[  540.240588]  kthread+0x10f/0x150
[  540.242125]  ? process_one_work+0x760/0x760
[  540.243865]  ? kthread_create_on_node+0x70/0x70
[  540.245631]  ret_from_fork+0x31/0x40
[  540.247278] Mem-Info:
[  540.248572] active_anon:170208 inactive_anon:2096 isolated_anon:0
[  540.248572]  active_file:40163 inactive_file:40049 isolated_file:32
[  540.248572]  unevictable:0 dirty:78514 writeback:1568 unstable:0
[  540.248572]  slab_reclaimable:19763 slab_unreclaimable:47744
[  540.248572]  mapped:522 shmem:2162 pagetables:4842 bounce:0
[  540.248572]  free:12698 free_pcp:500 free_cma:0
[  540.259735] Node 0 active_anon:680832kB inactive_anon:8384kB active_file:160412kB inactive_file:160436kB unevictable:0kB isolated(anon):0kB isolated(file):128kB mapped:2088kB dirty:314056kB writeback:6272kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 217088kB anon_thp: 8648kB writeback_tmp:0kB unstable:0kB pages_scanned:519289 all_unreclaimable? yes
[  540.267919] Node 0 DMA free:6248kB min:476kB low:592kB high:708kB active_anon:9492kB inactive_anon:0kB active_file:4kB inactive_file:4kB unevictable:0kB writepending:8kB present:15988kB managed:15904kB mlocked:0kB slab_reclaimable:48kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:28kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  540.276033] lowmem_reserve[]: 0 1443 1443 1443
[  540.277629] Node 0 DMA32 free:44544kB min:44576kB low:55720kB high:66864kB active_anon:671340kB inactive_anon:8384kB active_file:160408kB inactive_file:160432kB unevictable:0kB writepending:320320kB present:2080640kB managed:1478648kB mlocked:0kB slab_reclaimable:79004kB slab_unreclaimable:190944kB kernel_stack:12240kB pagetables:19340kB bounce:0kB free_pcp:2000kB local_pcp:352kB free_cma:0kB
[  540.286732] lowmem_reserve[]: 0 0 0 0
[  540.288204] Node 0 DMA: 2*4kB (UM) 0*8kB 2*16kB (UE) 4*32kB (UME) 3*64kB (ME) 2*128kB (UM) 2*256kB (UE) 2*512kB (ME) 2*1024kB (UE) 1*2048kB (E) 0*4096kB = 6248kB
[  540.292593] Node 0 DMA32: 738*4kB (ME) 1125*8kB (ME) 539*16kB (UME) 209*32kB (ME) 106*64kB (E) 42*128kB (UEH) 20*256kB (UME) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 44544kB
[  540.297228] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[  540.299825] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  540.302365] 82400 total pagecache pages
[  540.304010] 0 pages in swap cache
[  540.305535] Swap cache stats: add 0, delete 0, find 0/0
[  540.307302] Free swap  = 0kB
[  540.308600] Total swap = 0kB
[  540.309915] 524157 pages RAM
[  540.311187] 0 pages HighMem/MovableOnly
[  540.312613] 150519 pages reserved
[  540.314026] 0 pages cma reserved
[  540.315325] 0 pages hwpoisoned
[  540.317504] kworker/1:1 invoked oom-killer: gfp_mask=0x14001c2(GFP_KERNEL|__GFP_HIGHMEM|__GFP_COLD), nodemask=(null),  order=0, oom_score_adj=0
[  540.320589] kworker/1:1 cpuset=/ mems_allowed=0
[  540.322213] CPU: 1 PID: 51 Comm: kworker/1:1 Not tainted 4.10.0-rc5-next-20170125+ #495
[  540.324410] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[  540.327138] Workqueue: events pcpu_balance_workfn
[  540.328821] Call Trace:
[  540.330060]  dump_stack+0x85/0xc9
[  540.331449]  dump_header+0x9f/0x296
[  540.332925]  ? trace_hardirqs_on+0xd/0x10
[  540.334436]  oom_kill_process+0x219/0x400
[  540.335963]  out_of_memory+0x13d/0x580
[  540.337615]  ? out_of_memory+0x20d/0x580
[  540.339214]  __alloc_pages_slowpath+0x951/0xe02
[  540.340875]  __alloc_pages_nodemask+0x382/0x3d0
[  540.342544]  pcpu_populate_chunk+0xc2/0x440
[  540.344125]  pcpu_balance_workfn+0x615/0x670
[  540.345729]  ? process_one_work+0x194/0x760
[  540.347301]  process_one_work+0x22b/0x760
[  540.349042]  ? process_one_work+0x194/0x760
[  540.350616]  worker_thread+0x243/0x4b0
[  540.352245]  kthread+0x10f/0x150
[  540.353613]  ? process_one_work+0x760/0x760
[  540.355152]  ? kthread_create_on_node+0x70/0x70
[  540.356709]  ret_from_fork+0x31/0x40
[  540.358083] Mem-Info:
[  540.359191] active_anon:170208 inactive_anon:2096 isolated_anon:0
[  540.359191]  active_file:40103 inactive_file:40109 isolated_file:32
[  540.359191]  unevictable:0 dirty:78514 writeback:1568 unstable:0
[  540.359191]  slab_reclaimable:19763 slab_unreclaimable:47744
[  540.359191]  mapped:522 shmem:2162 pagetables:4842 bounce:0
[  540.359191]  free:12698 free_pcp:500 free_cma:0
[  540.369461] Node 0 active_anon:680832kB inactive_anon:8384kB active_file:160412kB inactive_file:160436kB unevictable:0kB isolated(anon):0kB isolated(file):128kB mapped:2088kB dirty:314056kB writeback:6272kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 217088kB anon_thp: 8648kB writeback_tmp:0kB unstable:0kB pages_scanned:519430 all_unreclaimable? yes
[  540.376876] Node 0 DMA free:6248kB min:476kB low:592kB high:708kB active_anon:9492kB inactive_anon:0kB active_file:4kB inactive_file:4kB unevictable:0kB writepending:8kB present:15988kB managed:15904kB mlocked:0kB slab_reclaimable:48kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:28kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  540.384224] lowmem_reserve[]: 0 1443 1443 1443
[  540.385668] Node 0 DMA32 free:44544kB min:44576kB low:55720kB high:66864kB active_anon:671340kB inactive_anon:8384kB active_file:160408kB inactive_file:160432kB unevictable:0kB writepending:320320kB present:2080640kB managed:1478648kB mlocked:0kB slab_reclaimable:79004kB slab_unreclaimable:190944kB kernel_stack:12240kB pagetables:19340kB bounce:0kB free_pcp:2000kB local_pcp:352kB free_cma:0kB
[  540.394066] lowmem_reserve[]: 0 0 0 0
[  540.395479] Node 0 DMA: 2*4kB (UM) 0*8kB 2*16kB (UE) 4*32kB (UME) 3*64kB (ME) 2*128kB (UM) 2*256kB (UE) 2*512kB (ME) 2*1024kB (UE) 1*2048kB (E) 0*4096kB = 6248kB
[  540.399533] Node 0 DMA32: 738*4kB (ME) 1125*8kB (ME) 539*16kB (UME) 209*32kB (ME) 106*64kB (E) 42*128kB (UEH) 20*256kB (UME) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 44544kB
[  540.403793] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[  540.406130] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  540.408490] 82400 total pagecache pages
[  540.409942] 0 pages in swap cache
[  540.411320] Swap cache stats: add 0, delete 0, find 0/0
[  540.412992] Free swap  = 0kB
[  540.414260] Total swap = 0kB
[  540.415633] 524157 pages RAM
[  540.416877] 0 pages HighMem/MovableOnly
[  540.418307] 150519 pages reserved
[  540.419695] 0 pages cma reserved
[  540.421020] 0 pages hwpoisoned
[  540.422293] Out of memory: Kill process 10689 (a.out) score 998 or sacrifice child
[  540.424450] Killed process 10689 (a.out) total-vm:14404kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[  540.430407] oom_reaper: reaped process 10689 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  575.747685] BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0 nice=0 stuck for 242s!
[  575.757497] Showing busy workqueues and worker pools:
[  575.765110] workqueue events: flags=0x0
[  575.772069]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=26/256
[  575.780544]     pending: free_work, vmpressure_work_fn, drain_local_pages_wq BAR(9811), vmw_fb_dirty_flush [vmwgfx], drain_local_pages_wq BAR(2506), drain_local_pages_wq BAR(812), drain_local_pages_wq BAR(2466), drain_local_pages_wq BAR(2485), drain_local_pages_wq BAR(3714), drain_local_pages_wq BAR(2862), drain_local_pages_wq BAR(827), drain_local_pages_wq BAR(527), drain_local_pages_wq BAR(9779), drain_local_pages_wq BAR(2484), drain_local_pages_wq BAR(932), drain_local_pages_wq BAR(2492), drain_local_pages_wq BAR(9820), drain_local_pages_wq BAR(811), drain_local_pages_wq BAR(1), drain_local_pages_wq BAR(2521), drain_local_pages_wq BAR(565), drain_local_pages_wq BAR(10420), drain_local_pages_wq BAR(9824), drain_local_pages_wq BAR(9749), drain_local_pages_wq BAR(2), drain_local_pages_wq BAR(9801)
[  575.827418] workqueue writeback: flags=0x4e
[  575.829234]   pwq 256: cpus=0-127 flags=0x4 nice=0 active=2/256
[  575.831299]     in-flight: 425:wb_workfn wb_workfn
[  575.834155] workqueue xfs-eofblocks/sda1: flags=0xc
[  575.836083]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[  575.838318]     in-flight: 123:xfs_eofblocks_worker [xfs]
[  575.840396] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=242s workers=2 manager: 80
[  575.843446] pool 256: cpus=0-127 flags=0x4 nice=0 hung=35s workers=3 idle: 424 423
[  605.951087] BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0 nice=0 stuck for 272s!
[  605.961096] Showing busy workqueues and worker pools:
[  605.968703] workqueue events: flags=0x0
[  605.975212]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=27/256
[  605.982787]     pending: free_work, vmpressure_work_fn, drain_local_pages_wq BAR(9811), vmw_fb_dirty_flush [vmwgfx], drain_local_pages_wq BAR(2506), drain_local_pages_wq BAR(812), drain_local_pages_wq BAR(2466), drain_local_pages_wq BAR(2485), drain_local_pages_wq BAR(3714), drain_local_pages_wq BAR(2862), drain_local_pages_wq BAR(827), drain_local_pages_wq BAR(527), drain_local_pages_wq BAR(9779), drain_local_pages_wq BAR(2484), drain_local_pages_wq BAR(932), drain_local_pages_wq BAR(2492), drain_local_pages_wq BAR(9820), drain_local_pages_wq BAR(811), drain_local_pages_wq BAR(1), drain_local_pages_wq BAR(2521), drain_local_pages_wq BAR(565), drain_local_pages_wq BAR(10420), drain_local_pages_wq BAR(9824), drain_local_pages_wq BAR(9749), drain_local_pages_wq BAR(2), drain_local_pages_wq BAR(9801)
[  606.010284] , drain_local_pages_wq BAR(47)
[  606.012955] workqueue writeback: flags=0x4e
[  606.014860]   pwq 256: cpus=0-127 flags=0x4 nice=0 active=2/256
[  606.016732]     in-flight: 425:wb_workfn wb_workfn
[  606.019085] workqueue mpt_poll_0: flags=0x8
[  606.020678]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256
[  606.022521]     pending: mpt_fault_reset_work [mptbase]
[  606.024445] workqueue xfs-eofblocks/sda1: flags=0xc
[  606.026148]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[  606.027992]     in-flight: 123:xfs_eofblocks_worker [xfs]
[  606.029904] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=272s workers=2 manager: 80
[  606.032120] pool 256: cpus=0-127 flags=0x4 nice=0 hung=65s workers=3 idle: 424 423
(...snipped...)
[  908.869406] sysrq: SysRq : Show State
[  908.875534]   task                        PC stack   pid father
[  908.883117] systemd         D11784     1      0 0x00000000
[  908.890352] Call Trace:
[  908.893121]  __schedule+0x345/0xdd0
[  908.895830]  ? __list_lru_count_one.isra.2+0x22/0x80
[  908.899036]  schedule+0x3d/0x90
[  908.901616]  schedule_timeout+0x287/0x540
[  908.904485]  ? wait_for_completion+0x4c/0x190
[  908.907488]  wait_for_completion+0x12c/0x190
[  908.910423]  ? wake_up_q+0x80/0x80
[  908.913060]  flush_work+0x230/0x310
[  908.915699]  ? flush_work+0x2b4/0x310
[  908.918382]  ? work_busy+0xb0/0xb0
[  908.920976]  drain_all_pages.part.88+0x319/0x390
[  908.923312]  ? drain_local_pages+0x30/0x30
[  908.924833]  __alloc_pages_slowpath+0x4dc/0xe02
[  908.926380]  ? alloc_pages_current+0x193/0x1b0
[  908.927887]  __alloc_pages_nodemask+0x382/0x3d0
[  908.929406]  ? __radix_tree_lookup+0x84/0xf0
[  908.930879]  alloc_pages_current+0x97/0x1b0
[  908.932333]  ? find_get_entry+0x5/0x300
[  908.933683]  __page_cache_alloc+0x15d/0x1a0
[  908.935069]  ? pagecache_get_page+0x2c/0x2b0
[  908.936447]  filemap_fault+0x4df/0x8b0
[  908.937728]  ? filemap_fault+0x373/0x8b0
[  908.939078]  ? xfs_ilock+0x22c/0x360 [xfs]
[  908.940393]  ? xfs_filemap_fault+0x64/0x1e0 [xfs]
[  908.941775]  ? down_read_nested+0x7b/0xc0
[  908.943046]  ? xfs_ilock+0x22c/0x360 [xfs]
[  908.944290]  xfs_filemap_fault+0x6c/0x1e0 [xfs]
[  908.945587]  __do_fault+0x1e/0xa0
[  908.946647]  ? _raw_spin_unlock+0x27/0x40
[  908.947823]  handle_mm_fault+0xd75/0x10d0
[  908.948954]  ? handle_mm_fault+0x5e/0x10d0
[  908.950079]  __do_page_fault+0x24a/0x530
[  908.951158]  do_page_fault+0x30/0x80
[  908.952199]  page_fault+0x28/0x30
(...snipped...)
[  909.537512] kswapd0         D11112    68      2 0x00000000
[  909.538860] Call Trace:
[  909.539675]  __schedule+0x345/0xdd0
[  909.540670]  schedule+0x3d/0x90
[  909.541619]  rwsem_down_read_failed+0x10e/0x1a0
[  909.542827]  ? xfs_map_blocks+0x98/0x5a0 [xfs]
[  909.543992]  call_rwsem_down_read_failed+0x18/0x30
[  909.545218]  down_read_nested+0xaf/0xc0
[  909.546316]  ? xfs_ilock+0x154/0x360 [xfs]
[  909.547519]  xfs_ilock+0x154/0x360 [xfs]
[  909.548608]  xfs_map_blocks+0x98/0x5a0 [xfs]
[  909.549754]  xfs_do_writepage+0x215/0x920 [xfs]
[  909.550954]  ? clear_page_dirty_for_io+0xb4/0x310
[  909.552188]  xfs_vm_writepage+0x3b/0x70 [xfs]
[  909.553340]  pageout.isra.54+0x1a4/0x460
[  909.554428]  shrink_page_list+0xa86/0xcf0
[  909.555529]  shrink_inactive_list+0x1d3/0x680
[  909.556680]  ? shrink_active_list+0x44f/0x590
[  909.557829]  shrink_node_memcg+0x535/0x7f0
[  909.558952]  ? mem_cgroup_iter+0x14d/0x720
[  909.560050]  shrink_node+0xe1/0x310
[  909.561043]  kswapd+0x362/0x9b0
[  909.561976]  kthread+0x10f/0x150
[  909.562974]  ? mem_cgroup_shrink_node+0x3b0/0x3b0
[  909.564199]  ? kthread_create_on_node+0x70/0x70
[  909.565375]  ret_from_fork+0x31/0x40
(...snipped...)
[  998.658049] BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0 nice=0 stuck for 665s!
[  998.667526] Showing busy workqueues and worker pools:
[  998.673851] workqueue events: flags=0x0
[  998.676147]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=28/256
[  998.678935]     pending: free_work, vmpressure_work_fn, drain_local_pages_wq BAR(9811), vmw_fb_dirty_flush [vmwgfx], drain_local_pages_wq BAR(2506), drain_local_pages_wq BAR(812), drain_local_pages_wq BAR(2466), drain_local_pages_wq BAR(2485), drain_local_pages_wq BAR(3714), drain_local_pages_wq BAR(2862), drain_local_pages_wq BAR(827), drain_local_pages_wq BAR(527), drain_local_pages_wq BAR(9779), drain_local_pages_wq BAR(2484), drain_local_pages_wq BAR(932), drain_local_pages_wq BAR(2492), drain_local_pages_wq BAR(9820), drain_local_pages_wq BAR(811), drain_local_pages_wq BAR(1), drain_local_pages_wq BAR(2521), drain_local_pages_wq BAR(565), drain_local_pages_wq BAR(10420), drain_local_pages_wq BAR(9824), drain_local_pages_wq BAR(9749), drain_local_pages_wq BAR(2), drain_local_pages_wq BAR(9801)
[  998.705187] , drain_local_pages_wq BAR(47), drain_local_pages_wq BAR(10805)
[  998.707558]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
[  998.709548]     pending: e1000_watchdog [e1000], vmstat_shepherd
[  998.711593] workqueue events_power_efficient: flags=0x80
[  998.713479]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256
[  998.715399]     pending: neigh_periodic_work
[  998.717075] workqueue writeback: flags=0x4e
[  998.718656]   pwq 256: cpus=0-127 flags=0x4 nice=0 active=2/256
[  998.720587]     in-flight: 425:wb_workfn wb_workfn
[  998.723062] workqueue mpt_poll_0: flags=0x8
[  998.724712]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256
[  998.726601]     pending: mpt_fault_reset_work [mptbase]
[  998.728548] workqueue xfs-eofblocks/sda1: flags=0xc
[  998.730292]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[  998.732178]     in-flight: 123:xfs_eofblocks_worker [xfs]
[  998.733997] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=665s workers=2 manager: 80
[  998.736251] pool 6: cpus=3 node=0 flags=0x0 nice=0 hung=0s workers=2 manager: 53 idle: 10804
[  998.738634] pool 256: cpus=0-127 flags=0x4 nice=0 hung=458s workers=3 idle: 424 423
----------

So, you believed that the too_many_isolated() issue is the only problem which
can prevent reasonable return to the page allocator [2]. But the reality is that
we are about to introduce a new problem without knowing the possibility which can
prevent reasonable return to the page allocator.

So, would you please please please accept asynchronous watchdog [3]? I said
"the cause of allocation stall might be due to out of idle workqueue thread"
in that post and I think above lockup is exactly in this case. We cannot be
careful enough to prove. We forever have possibility of failing to warn as
long as we depend on only synchronous watchdog.

[2] http://lkml.kernel.org/r/201701141910.ACF73418.OJHFVFStQOOMFL@I-love.SAKURA.ne.jp
[3] http://lkml.kernel.org/r/201701261928.DIG05227.OtOVFHOJMFLSQF@I-love.SAKURA.ne.jp

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
@ 2017-01-28 15:27                                       ` Tetsuo Handa
  0 siblings, 0 replies; 110+ messages in thread
From: Tetsuo Handa @ 2017-01-28 15:27 UTC (permalink / raw)
  To: mhocko; +Cc: hch, mgorman, viro, linux-mm, hannes, linux-kernel

Michal Hocko wrote:
> Tetsuo,
> before we settle on the proper fix for this issue, could you give the
> patch a try and try to reproduce the too_many_isolated() issue or
> just see whether patch [1] has any negative effect on your oom stress
> testing?
> 
> [1] http://lkml.kernel.org/r/20170119112336.GN30786@dhcp22.suse.cz

I tested with both [1] and below patch applied on linux-next-20170125 and
the result is at http://I-love.SAKURA.ne.jp/tmp/serial-20170128.txt.xz .

Regarding below patch, it helped avoiding complete memory depletion with
large write() request. I don't know whether below patch helps avoiding
complete memory depletion when reading large amount (in other words, I
don't know whether this check is done for large read() request). But
I believe that __GFP_KILLABLE (despite the limitation that there are
unkillable waits in the reclaim path) is better solution compared to
scattering around fatal_signal_pending() in the callers. The reason
we check SIGKILL here is to avoid allocating memory more than needed.
If we check SIGKILL in the entry point of __alloc_pages_nodemask() and
retry: label in __alloc_pages_slowpath(), we waste 0 page. Regardless
of whether the OOM killer is invoked, whether memory can be allocated
without direct reclaim operation, not allocating memory unless needed
(in other words, allow page allocator fail immediately if the caller
can give up on SIGKILL and SIGKILL is pending) makes sense. It will
reduce possibility of OOM livelock on CONFIG_MMU=n kernels where the
OOM reaper is not available.

> 
> On Wed 25-01-17 14:00:14, Michal Hocko wrote:
> [...]
> > From 362da5cac527146a341300c2ca441245c16043e8 Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <mhocko@suse.com>
> > Date: Wed, 25 Jan 2017 11:06:37 +0100
> > Subject: [PATCH] fs: break out of iomap_file_buffered_write on fatal signals
> > 
> > Tetsuo has noticed that an OOM stress test which performs large write
> > requests can cause the full memory reserves depletion. He has tracked
> > this down to the following path
> > 	__alloc_pages_nodemask+0x436/0x4d0
> > 	alloc_pages_current+0x97/0x1b0
> > 	__page_cache_alloc+0x15d/0x1a0          mm/filemap.c:728
> > 	pagecache_get_page+0x5a/0x2b0           mm/filemap.c:1331
> > 	grab_cache_page_write_begin+0x23/0x40   mm/filemap.c:2773
> > 	iomap_write_begin+0x50/0xd0             fs/iomap.c:118
> > 	iomap_write_actor+0xb5/0x1a0            fs/iomap.c:190
> > 	? iomap_write_end+0x80/0x80             fs/iomap.c:150
> > 	iomap_apply+0xb3/0x130                  fs/iomap.c:79
> > 	iomap_file_buffered_write+0x68/0xa0     fs/iomap.c:243
> > 	? iomap_write_end+0x80/0x80
> > 	xfs_file_buffered_aio_write+0x132/0x390 [xfs]
> > 	? remove_wait_queue+0x59/0x60
> > 	xfs_file_write_iter+0x90/0x130 [xfs]
> > 	__vfs_write+0xe5/0x140
> > 	vfs_write+0xc7/0x1f0
> > 	? syscall_trace_enter+0x1d0/0x380
> > 	SyS_write+0x58/0xc0
> > 	do_syscall_64+0x6c/0x200
> > 	entry_SYSCALL64_slow_path+0x25/0x25
> > 
> > the oom victim has access to all memory reserves to make a forward
> > progress to exit easier. But iomap_file_buffered_write and other callers
> > of iomap_apply loop to complete the full request. We need to check for
> > fatal signals and back off with a short write instead. As the
> > iomap_apply delegates all the work down to the actor we have to hook
> > into those. All callers that work with the page cache are calling
> > iomap_write_begin so we will check for signals there. dax_iomap_actor
> > has to handle the situation explicitly because it copies data to the
> > userspace directly. Other callers like iomap_page_mkwrite work on a
> > single page or iomap_fiemap_actor do not allocate memory based on the
> > given len.
> > 
> > Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path")
> > Cc: stable # 4.8+
> > Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > ---
> >  fs/dax.c   | 5 +++++
> >  fs/iomap.c | 3 +++
> >  2 files changed, 8 insertions(+)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 413a91db9351..0e263dacf9cf 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -1033,6 +1033,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> >  		struct blk_dax_ctl dax = { 0 };
> >  		ssize_t map_len;
> >  
> > +		if (fatal_signal_pending(current)) {
> > +			ret = -EINTR;
> > +			break;
> > +		}
> > +
> >  		dax.sector = dax_iomap_sector(iomap, pos);
> >  		dax.size = (length + offset + PAGE_SIZE - 1) & PAGE_MASK;
> >  		map_len = dax_map_atomic(iomap->bdev, &dax);
> > diff --git a/fs/iomap.c b/fs/iomap.c
> > index e57b90b5ff37..691eada58b06 100644
> > --- a/fs/iomap.c
> > +++ b/fs/iomap.c
> > @@ -114,6 +114,9 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
> >  
> >  	BUG_ON(pos + len > iomap->offset + iomap->length);
> >  
> > +	if (fatal_signal_pending(current))
> > +		return -EINTR;
> > +
> >  	page = grab_cache_page_write_begin(inode->i_mapping, index, flags);
> >  	if (!page)
> >  		return -ENOMEM;
> > -- 
> > 2.11.0

Regarding [1], it helped avoiding the too_many_isolated() issue. I can't
tell whether it has any negative effect, but I got on the first trial that
all allocating threads are blocked on wait_for_completion() from flush_work()
in drain_all_pages() introduced by "mm, page_alloc: drain per-cpu pages from
workqueue context". There was no warn_alloc() stall warning message afterwords.

----------
[  540.039842] kworker/1:1: page allocation stalls for 10079ms, order:0, mode:0x14001c2(GFP_KERNEL|__GFP_HIGHMEM|__GFP_COLD), nodemask=(null)
[  540.041961] kthreadd invoked oom-killer: gfp_mask=0x16040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK), nodemask=(null),  order=2, oom_score_adj=0
[  540.041970] kthreadd cpuset=/ mems_allowed=0
[  540.041984] CPU: 3 PID: 2 Comm: kthreadd Not tainted 4.10.0-rc5-next-20170125+ #495
[  540.041987] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[  540.041989] Call Trace:
[  540.042008]  dump_stack+0x85/0xc9
[  540.042016]  dump_header+0x9f/0x296
[  540.042028]  ? trace_hardirqs_on+0xd/0x10
[  540.042039]  oom_kill_process+0x219/0x400
[  540.042046]  out_of_memory+0x13d/0x580
[  540.042049]  ? out_of_memory+0x20d/0x580
[  540.042058]  __alloc_pages_slowpath+0x951/0xe02
[  540.042063]  ? deactivate_slab+0x1fb/0x690
[  540.042082]  __alloc_pages_nodemask+0x382/0x3d0
[  540.042091]  new_slab+0x450/0x6b0
[  540.042100]  ___slab_alloc+0x3a3/0x620
[  540.042109]  ? copy_process.part.31+0x122/0x2200
[  540.042116]  ? cpuacct_charge+0x38/0x1e0
[  540.042122]  ? copy_process.part.31+0x122/0x2200
[  540.042129]  __slab_alloc+0x46/0x7d
[  540.042135]  kmem_cache_alloc_node+0xab/0x3a0
[  540.042144]  copy_process.part.31+0x122/0x2200
[  540.042150]  ? cpuacct_charge+0xf3/0x1e0
[  540.042153]  ? cpuacct_charge+0x38/0x1e0
[  540.042164]  ? kthread_create_on_node+0x70/0x70
[  540.042168]  ? finish_task_switch+0x70/0x240
[  540.042175]  _do_fork+0xf3/0x750
[  540.042183]  ? kthreadd+0x2f2/0x3c0
[  540.042193]  kernel_thread+0x29/0x30
[  540.042196]  kthreadd+0x35a/0x3c0
[  540.042206]  ? ret_from_fork+0x31/0x40
[  540.042218]  ? kthread_create_on_cpu+0xb0/0xb0
[  540.042225]  ret_from_fork+0x31/0x40
[  540.042237] Mem-Info:
[  540.042248] active_anon:170208 inactive_anon:2096 isolated_anon:0
[  540.042248]  active_file:40034 inactive_file:40034 isolated_file:32
[  540.042248]  unevictable:0 dirty:78514 writeback:1568 unstable:0
[  540.042248]  slab_reclaimable:19763 slab_unreclaimable:47744
[  540.042248]  mapped:491 shmem:2162 pagetables:4842 bounce:0
[  540.042248]  free:12698 free_pcp:637 free_cma:0
[  540.042258] Node 0 active_anon:680832kB inactive_anon:8384kB active_file:160136kB inactive_file:160136kB unevictable:0kB isolated(anon):0kB isolated(file):128kB mapped:1964kB dirty:314056kB writeback:6272kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 217088kB anon_thp: 8648kB writeback_tmp:0kB unstable:0kB pages_scanned:561618 all_unreclaimable? yes
[  540.042260] Node 0 DMA free:6248kB min:476kB low:592kB high:708kB active_anon:9492kB inactive_anon:0kB active_file:4kB inactive_file:4kB unevictable:0kB writepending:8kB present:15988kB managed:15904kB mlocked:0kB slab_reclaimable:48kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:28kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  540.042270] lowmem_reserve[]: 0 1443 1443 1443
[  540.042279] Node 0 DMA32 free:44544kB min:44576kB low:55720kB high:66864kB active_anon:671340kB inactive_anon:8384kB active_file:160132kB inactive_file:160132kB unevictable:0kB writepending:320320kB present:2080640kB managed:1478648kB mlocked:0kB slab_reclaimable:79004kB slab_unreclaimable:190944kB kernel_stack:12240kB pagetables:19340kB bounce:0kB free_pcp:2548kB local_pcp:728kB free_cma:0kB
[  540.042288] lowmem_reserve[]: 0 0 0 0
[  540.042296] Node 0 DMA: 2*4kB (UM) 0*8kB 2*16kB (UE) 4*32kB (UME) 3*64kB (ME) 2*128kB (UM) 2*256kB (UE) 2*512kB (ME) 2*1024kB (UE) 1*2048kB (E) 0*4096kB = 6248kB
[  540.042330] Node 0 DMA32: 764*4kB (UME) 1122*8kB (UME) 536*16kB (UME) 210*32kB (UME) 107*64kB (UE) 41*128kB (EH) 20*256kB (UME) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 44544kB
[  540.042363] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[  540.042366] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  540.042368] 82262 total pagecache pages
[  540.042371] 0 pages in swap cache
[  540.042374] Swap cache stats: add 0, delete 0, find 0/0
[  540.042376] Free swap  = 0kB
[  540.042377] Total swap = 0kB
[  540.042380] 524157 pages RAM
[  540.042382] 0 pages HighMem/MovableOnly
[  540.042383] 150519 pages reserved
[  540.042384] 0 pages cma reserved
[  540.042386] 0 pages hwpoisoned
[  540.042390] Out of memory: Kill process 10688 (a.out) score 998 or sacrifice child
[  540.042401] Killed process 10688 (a.out) total-vm:14404kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[  540.043111] oom_reaper: reaped process 10688 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  540.212629] kworker/1:1 cpuset=/ mems_allowed=0
[  540.214404] CPU: 1 PID: 51 Comm: kworker/1:1 Not tainted 4.10.0-rc5-next-20170125+ #495
[  540.216858] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[  540.219901] Workqueue: events pcpu_balance_workfn
[  540.221740] Call Trace:
[  540.223154]  dump_stack+0x85/0xc9
[  540.224724]  warn_alloc+0x11e/0x1d0
[  540.226333]  __alloc_pages_slowpath+0x3d4/0xe02
[  540.228160]  __alloc_pages_nodemask+0x382/0x3d0
[  540.229970]  pcpu_populate_chunk+0xc2/0x440
[  540.231724]  pcpu_balance_workfn+0x615/0x670
[  540.233483]  ? process_one_work+0x194/0x760
[  540.235405]  process_one_work+0x22b/0x760
[  540.237133]  ? process_one_work+0x194/0x760
[  540.238943]  worker_thread+0x243/0x4b0
[  540.240588]  kthread+0x10f/0x150
[  540.242125]  ? process_one_work+0x760/0x760
[  540.243865]  ? kthread_create_on_node+0x70/0x70
[  540.245631]  ret_from_fork+0x31/0x40
[  540.247278] Mem-Info:
[  540.248572] active_anon:170208 inactive_anon:2096 isolated_anon:0
[  540.248572]  active_file:40163 inactive_file:40049 isolated_file:32
[  540.248572]  unevictable:0 dirty:78514 writeback:1568 unstable:0
[  540.248572]  slab_reclaimable:19763 slab_unreclaimable:47744
[  540.248572]  mapped:522 shmem:2162 pagetables:4842 bounce:0
[  540.248572]  free:12698 free_pcp:500 free_cma:0
[  540.259735] Node 0 active_anon:680832kB inactive_anon:8384kB active_file:160412kB inactive_file:160436kB unevictable:0kB isolated(anon):0kB isolated(file):128kB mapped:2088kB dirty:314056kB writeback:6272kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 217088kB anon_thp: 8648kB writeback_tmp:0kB unstable:0kB pages_scanned:519289 all_unreclaimable? yes
[  540.267919] Node 0 DMA free:6248kB min:476kB low:592kB high:708kB active_anon:9492kB inactive_anon:0kB active_file:4kB inactive_file:4kB unevictable:0kB writepending:8kB present:15988kB managed:15904kB mlocked:0kB slab_reclaimable:48kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:28kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  540.276033] lowmem_reserve[]: 0 1443 1443 1443
[  540.277629] Node 0 DMA32 free:44544kB min:44576kB low:55720kB high:66864kB active_anon:671340kB inactive_anon:8384kB active_file:160408kB inactive_file:160432kB unevictable:0kB writepending:320320kB present:2080640kB managed:1478648kB mlocked:0kB slab_reclaimable:79004kB slab_unreclaimable:190944kB kernel_stack:12240kB pagetables:19340kB bounce:0kB free_pcp:2000kB local_pcp:352kB free_cma:0kB
[  540.286732] lowmem_reserve[]: 0 0 0 0
[  540.288204] Node 0 DMA: 2*4kB (UM) 0*8kB 2*16kB (UE) 4*32kB (UME) 3*64kB (ME) 2*128kB (UM) 2*256kB (UE) 2*512kB (ME) 2*1024kB (UE) 1*2048kB (E) 0*4096kB = 6248kB
[  540.292593] Node 0 DMA32: 738*4kB (ME) 1125*8kB (ME) 539*16kB (UME) 209*32kB (ME) 106*64kB (E) 42*128kB (UEH) 20*256kB (UME) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 44544kB
[  540.297228] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[  540.299825] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  540.302365] 82400 total pagecache pages
[  540.304010] 0 pages in swap cache
[  540.305535] Swap cache stats: add 0, delete 0, find 0/0
[  540.307302] Free swap  = 0kB
[  540.308600] Total swap = 0kB
[  540.309915] 524157 pages RAM
[  540.311187] 0 pages HighMem/MovableOnly
[  540.312613] 150519 pages reserved
[  540.314026] 0 pages cma reserved
[  540.315325] 0 pages hwpoisoned
[  540.317504] kworker/1:1 invoked oom-killer: gfp_mask=0x14001c2(GFP_KERNEL|__GFP_HIGHMEM|__GFP_COLD), nodemask=(null),  order=0, oom_score_adj=0
[  540.320589] kworker/1:1 cpuset=/ mems_allowed=0
[  540.322213] CPU: 1 PID: 51 Comm: kworker/1:1 Not tainted 4.10.0-rc5-next-20170125+ #495
[  540.324410] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[  540.327138] Workqueue: events pcpu_balance_workfn
[  540.328821] Call Trace:
[  540.330060]  dump_stack+0x85/0xc9
[  540.331449]  dump_header+0x9f/0x296
[  540.332925]  ? trace_hardirqs_on+0xd/0x10
[  540.334436]  oom_kill_process+0x219/0x400
[  540.335963]  out_of_memory+0x13d/0x580
[  540.337615]  ? out_of_memory+0x20d/0x580
[  540.339214]  __alloc_pages_slowpath+0x951/0xe02
[  540.340875]  __alloc_pages_nodemask+0x382/0x3d0
[  540.342544]  pcpu_populate_chunk+0xc2/0x440
[  540.344125]  pcpu_balance_workfn+0x615/0x670
[  540.345729]  ? process_one_work+0x194/0x760
[  540.347301]  process_one_work+0x22b/0x760
[  540.349042]  ? process_one_work+0x194/0x760
[  540.350616]  worker_thread+0x243/0x4b0
[  540.352245]  kthread+0x10f/0x150
[  540.353613]  ? process_one_work+0x760/0x760
[  540.355152]  ? kthread_create_on_node+0x70/0x70
[  540.356709]  ret_from_fork+0x31/0x40
[  540.358083] Mem-Info:
[  540.359191] active_anon:170208 inactive_anon:2096 isolated_anon:0
[  540.359191]  active_file:40103 inactive_file:40109 isolated_file:32
[  540.359191]  unevictable:0 dirty:78514 writeback:1568 unstable:0
[  540.359191]  slab_reclaimable:19763 slab_unreclaimable:47744
[  540.359191]  mapped:522 shmem:2162 pagetables:4842 bounce:0
[  540.359191]  free:12698 free_pcp:500 free_cma:0
[  540.369461] Node 0 active_anon:680832kB inactive_anon:8384kB active_file:160412kB inactive_file:160436kB unevictable:0kB isolated(anon):0kB isolated(file):128kB mapped:2088kB dirty:314056kB writeback:6272kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 217088kB anon_thp: 8648kB writeback_tmp:0kB unstable:0kB pages_scanned:519430 all_unreclaimable? yes
[  540.376876] Node 0 DMA free:6248kB min:476kB low:592kB high:708kB active_anon:9492kB inactive_anon:0kB active_file:4kB inactive_file:4kB unevictable:0kB writepending:8kB present:15988kB managed:15904kB mlocked:0kB slab_reclaimable:48kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:28kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[  540.384224] lowmem_reserve[]: 0 1443 1443 1443
[  540.385668] Node 0 DMA32 free:44544kB min:44576kB low:55720kB high:66864kB active_anon:671340kB inactive_anon:8384kB active_file:160408kB inactive_file:160432kB unevictable:0kB writepending:320320kB present:2080640kB managed:1478648kB mlocked:0kB slab_reclaimable:79004kB slab_unreclaimable:190944kB kernel_stack:12240kB pagetables:19340kB bounce:0kB free_pcp:2000kB local_pcp:352kB free_cma:0kB
[  540.394066] lowmem_reserve[]: 0 0 0 0
[  540.395479] Node 0 DMA: 2*4kB (UM) 0*8kB 2*16kB (UE) 4*32kB (UME) 3*64kB (ME) 2*128kB (UM) 2*256kB (UE) 2*512kB (ME) 2*1024kB (UE) 1*2048kB (E) 0*4096kB = 6248kB
[  540.399533] Node 0 DMA32: 738*4kB (ME) 1125*8kB (ME) 539*16kB (UME) 209*32kB (ME) 106*64kB (E) 42*128kB (UEH) 20*256kB (UME) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 44544kB
[  540.403793] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[  540.406130] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[  540.408490] 82400 total pagecache pages
[  540.409942] 0 pages in swap cache
[  540.411320] Swap cache stats: add 0, delete 0, find 0/0
[  540.412992] Free swap  = 0kB
[  540.414260] Total swap = 0kB
[  540.415633] 524157 pages RAM
[  540.416877] 0 pages HighMem/MovableOnly
[  540.418307] 150519 pages reserved
[  540.419695] 0 pages cma reserved
[  540.421020] 0 pages hwpoisoned
[  540.422293] Out of memory: Kill process 10689 (a.out) score 998 or sacrifice child
[  540.424450] Killed process 10689 (a.out) total-vm:14404kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[  540.430407] oom_reaper: reaped process 10689 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  575.747685] BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0 nice=0 stuck for 242s!
[  575.757497] Showing busy workqueues and worker pools:
[  575.765110] workqueue events: flags=0x0
[  575.772069]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=26/256
[  575.780544]     pending: free_work, vmpressure_work_fn, drain_local_pages_wq BAR(9811), vmw_fb_dirty_flush [vmwgfx], drain_local_pages_wq BAR(2506), drain_local_pages_wq BAR(812), drain_local_pages_wq BAR(2466), drain_local_pages_wq BAR(2485), drain_local_pages_wq BAR(3714), drain_local_pages_wq BAR(2862), drain_local_pages_wq BAR(827), drain_local_pages_wq BAR(527), drain_local_pages_wq BAR(9779), drain_local_pages_wq BAR(2484), drain_local_pages_wq BAR(932), drain_local_pages_wq BAR(2492), drain_local_pages_wq BAR(9820), drain_local_pages_wq BAR(811), drain_local_pages_wq BAR(1), drain_local_pages_wq BAR(2521), drain_local_pages_wq BAR(565), drain_local_pages_wq BAR(10420), drain_local_pages_wq BAR(9824), drain_local_pages_wq BAR(9749), drain_local_pages_wq BAR(2), drain_local_pages_wq BAR(9801)
[  575.827418] workqueue writeback: flags=0x4e
[  575.829234]   pwq 256: cpus=0-127 flags=0x4 nice=0 active=2/256
[  575.831299]     in-flight: 425:wb_workfn wb_workfn
[  575.834155] workqueue xfs-eofblocks/sda1: flags=0xc
[  575.836083]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[  575.838318]     in-flight: 123:xfs_eofblocks_worker [xfs]
[  575.840396] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=242s workers=2 manager: 80
[  575.843446] pool 256: cpus=0-127 flags=0x4 nice=0 hung=35s workers=3 idle: 424 423
[  605.951087] BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0 nice=0 stuck for 272s!
[  605.961096] Showing busy workqueues and worker pools:
[  605.968703] workqueue events: flags=0x0
[  605.975212]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=27/256
[  605.982787]     pending: free_work, vmpressure_work_fn, drain_local_pages_wq BAR(9811), vmw_fb_dirty_flush [vmwgfx], drain_local_pages_wq BAR(2506), drain_local_pages_wq BAR(812), drain_local_pages_wq BAR(2466), drain_local_pages_wq BAR(2485), drain_local_pages_wq BAR(3714), drain_local_pages_wq BAR(2862), drain_local_pages_wq BAR(827), drain_local_pages_wq BAR(527), drain_local_pages_wq BAR(9779), drain_local_pages_wq BAR(2484), drain_local_pages_wq BAR(932), drain_local_pages_wq BAR(2492), drain_local_pages_wq BAR(9820), drain_local_pages_wq BAR(811), drain_local_pages_wq BAR(1), drain_local_pages_wq BAR(2521), drain_local_pages_wq BAR(565), drain_local_pages_wq BAR(10420), drain_local_pages_wq BAR(9824), drain_local_pages_wq BAR(9749), drain_local_pages_wq BAR(2), drain_local_pages_wq BAR(9801)
[  606.010284] , drain_local_pages_wq BAR(47)
[  606.012955] workqueue writeback: flags=0x4e
[  606.014860]   pwq 256: cpus=0-127 flags=0x4 nice=0 active=2/256
[  606.016732]     in-flight: 425:wb_workfn wb_workfn
[  606.019085] workqueue mpt_poll_0: flags=0x8
[  606.020678]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256
[  606.022521]     pending: mpt_fault_reset_work [mptbase]
[  606.024445] workqueue xfs-eofblocks/sda1: flags=0xc
[  606.026148]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[  606.027992]     in-flight: 123:xfs_eofblocks_worker [xfs]
[  606.029904] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=272s workers=2 manager: 80
[  606.032120] pool 256: cpus=0-127 flags=0x4 nice=0 hung=65s workers=3 idle: 424 423
(...snipped...)
[  908.869406] sysrq: SysRq : Show State
[  908.875534]   task                        PC stack   pid father
[  908.883117] systemd         D11784     1      0 0x00000000
[  908.890352] Call Trace:
[  908.893121]  __schedule+0x345/0xdd0
[  908.895830]  ? __list_lru_count_one.isra.2+0x22/0x80
[  908.899036]  schedule+0x3d/0x90
[  908.901616]  schedule_timeout+0x287/0x540
[  908.904485]  ? wait_for_completion+0x4c/0x190
[  908.907488]  wait_for_completion+0x12c/0x190
[  908.910423]  ? wake_up_q+0x80/0x80
[  908.913060]  flush_work+0x230/0x310
[  908.915699]  ? flush_work+0x2b4/0x310
[  908.918382]  ? work_busy+0xb0/0xb0
[  908.920976]  drain_all_pages.part.88+0x319/0x390
[  908.923312]  ? drain_local_pages+0x30/0x30
[  908.924833]  __alloc_pages_slowpath+0x4dc/0xe02
[  908.926380]  ? alloc_pages_current+0x193/0x1b0
[  908.927887]  __alloc_pages_nodemask+0x382/0x3d0
[  908.929406]  ? __radix_tree_lookup+0x84/0xf0
[  908.930879]  alloc_pages_current+0x97/0x1b0
[  908.932333]  ? find_get_entry+0x5/0x300
[  908.933683]  __page_cache_alloc+0x15d/0x1a0
[  908.935069]  ? pagecache_get_page+0x2c/0x2b0
[  908.936447]  filemap_fault+0x4df/0x8b0
[  908.937728]  ? filemap_fault+0x373/0x8b0
[  908.939078]  ? xfs_ilock+0x22c/0x360 [xfs]
[  908.940393]  ? xfs_filemap_fault+0x64/0x1e0 [xfs]
[  908.941775]  ? down_read_nested+0x7b/0xc0
[  908.943046]  ? xfs_ilock+0x22c/0x360 [xfs]
[  908.944290]  xfs_filemap_fault+0x6c/0x1e0 [xfs]
[  908.945587]  __do_fault+0x1e/0xa0
[  908.946647]  ? _raw_spin_unlock+0x27/0x40
[  908.947823]  handle_mm_fault+0xd75/0x10d0
[  908.948954]  ? handle_mm_fault+0x5e/0x10d0
[  908.950079]  __do_page_fault+0x24a/0x530
[  908.951158]  do_page_fault+0x30/0x80
[  908.952199]  page_fault+0x28/0x30
(...snipped...)
[  909.537512] kswapd0         D11112    68      2 0x00000000
[  909.538860] Call Trace:
[  909.539675]  __schedule+0x345/0xdd0
[  909.540670]  schedule+0x3d/0x90
[  909.541619]  rwsem_down_read_failed+0x10e/0x1a0
[  909.542827]  ? xfs_map_blocks+0x98/0x5a0 [xfs]
[  909.543992]  call_rwsem_down_read_failed+0x18/0x30
[  909.545218]  down_read_nested+0xaf/0xc0
[  909.546316]  ? xfs_ilock+0x154/0x360 [xfs]
[  909.547519]  xfs_ilock+0x154/0x360 [xfs]
[  909.548608]  xfs_map_blocks+0x98/0x5a0 [xfs]
[  909.549754]  xfs_do_writepage+0x215/0x920 [xfs]
[  909.550954]  ? clear_page_dirty_for_io+0xb4/0x310
[  909.552188]  xfs_vm_writepage+0x3b/0x70 [xfs]
[  909.553340]  pageout.isra.54+0x1a4/0x460
[  909.554428]  shrink_page_list+0xa86/0xcf0
[  909.555529]  shrink_inactive_list+0x1d3/0x680
[  909.556680]  ? shrink_active_list+0x44f/0x590
[  909.557829]  shrink_node_memcg+0x535/0x7f0
[  909.558952]  ? mem_cgroup_iter+0x14d/0x720
[  909.560050]  shrink_node+0xe1/0x310
[  909.561043]  kswapd+0x362/0x9b0
[  909.561976]  kthread+0x10f/0x150
[  909.562974]  ? mem_cgroup_shrink_node+0x3b0/0x3b0
[  909.564199]  ? kthread_create_on_node+0x70/0x70
[  909.565375]  ret_from_fork+0x31/0x40
(...snipped...)
[  998.658049] BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0 nice=0 stuck for 665s!
[  998.667526] Showing busy workqueues and worker pools:
[  998.673851] workqueue events: flags=0x0
[  998.676147]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=28/256
[  998.678935]     pending: free_work, vmpressure_work_fn, drain_local_pages_wq BAR(9811), vmw_fb_dirty_flush [vmwgfx], drain_local_pages_wq BAR(2506), drain_local_pages_wq BAR(812), drain_local_pages_wq BAR(2466), drain_local_pages_wq BAR(2485), drain_local_pages_wq BAR(3714), drain_local_pages_wq BAR(2862), drain_local_pages_wq BAR(827), drain_local_pages_wq BAR(527), drain_local_pages_wq BAR(9779), drain_local_pages_wq BAR(2484), drain_local_pages_wq BAR(932), drain_local_pages_wq BAR(2492), drain_local_pages_wq BAR(9820), drain_local_pages_wq BAR(811), drain_local_pages_wq BAR(1), drain_local_pages_wq BAR(2521), drain_local_pages_wq BAR(565), drain_local_pages_wq BAR(10420), drain_local_pages_wq BAR(9824), drain_local_pages_wq BAR(9749), drain_local_pages_wq BAR(2), drain_local_pages_wq BAR(9801)
[  998.705187] , drain_local_pages_wq BAR(47), drain_local_pages_wq BAR(10805)
[  998.707558]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
[  998.709548]     pending: e1000_watchdog [e1000], vmstat_shepherd
[  998.711593] workqueue events_power_efficient: flags=0x80
[  998.713479]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256
[  998.715399]     pending: neigh_periodic_work
[  998.717075] workqueue writeback: flags=0x4e
[  998.718656]   pwq 256: cpus=0-127 flags=0x4 nice=0 active=2/256
[  998.720587]     in-flight: 425:wb_workfn wb_workfn
[  998.723062] workqueue mpt_poll_0: flags=0x8
[  998.724712]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256
[  998.726601]     pending: mpt_fault_reset_work [mptbase]
[  998.728548] workqueue xfs-eofblocks/sda1: flags=0xc
[  998.730292]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[  998.732178]     in-flight: 123:xfs_eofblocks_worker [xfs]
[  998.733997] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=665s workers=2 manager: 80
[  998.736251] pool 6: cpus=3 node=0 flags=0x0 nice=0 hung=0s workers=2 manager: 53 idle: 10804
[  998.738634] pool 256: cpus=0-127 flags=0x4 nice=0 hung=458s workers=3 idle: 424 423
----------

So, you believed that the too_many_isolated() issue is the only problem which
can prevent reasonable return to the page allocator [2]. But the reality is that
we are about to introduce a new problem without knowing the possibility which can
prevent reasonable return to the page allocator.

So, would you please please please accept asynchronous watchdog [3]? I said
"the cause of allocation stall might be due to out of idle workqueue thread"
in that post and I think above lockup is exactly in this case. We cannot be
careful enough to prove. We forever have possibility of failing to warn as
long as we depend on only synchronous watchdog.

[2] http://lkml.kernel.org/r/201701141910.ACF73418.OJHFVFStQOOMFL@I-love.SAKURA.ne.jp
[3] http://lkml.kernel.org/r/201701261928.DIG05227.OtOVFHOJMFLSQF@I-love.SAKURA.ne.jp

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
  2017-01-28 15:27                                       ` Tetsuo Handa
@ 2017-01-30  8:55                                         ` Michal Hocko
  -1 siblings, 0 replies; 110+ messages in thread
From: Michal Hocko @ 2017-01-30  8:55 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: hch, mgorman, viro, linux-mm, hannes, linux-kernel

On Sun 29-01-17 00:27:27, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > Tetsuo,
> > before we settle on the proper fix for this issue, could you give the
> > patch a try and try to reproduce the too_many_isolated() issue or
> > just see whether patch [1] has any negative effect on your oom stress
> > testing?
> > 
> > [1] http://lkml.kernel.org/r/20170119112336.GN30786@dhcp22.suse.cz
> 
> I tested with both [1] and below patch applied on linux-next-20170125 and
> the result is at http://I-love.SAKURA.ne.jp/tmp/serial-20170128.txt.xz .
> 
> Regarding below patch, it helped avoiding complete memory depletion with
> large write() request. I don't know whether below patch helps avoiding
> complete memory depletion when reading large amount (in other words, I
> don't know whether this check is done for large read() request).

It's not AFAICS. do_generic_file_read doesn't do the
fatal_signal_pending check.

> But
> I believe that __GFP_KILLABLE (despite the limitation that there are
> unkillable waits in the reclaim path) is better solution compared to
> scattering around fatal_signal_pending() in the callers. The reason
> we check SIGKILL here is to avoid allocating memory more than needed.
> If we check SIGKILL in the entry point of __alloc_pages_nodemask() and
> retry: label in __alloc_pages_slowpath(), we waste 0 page. Regardless
> of whether the OOM killer is invoked, whether memory can be allocated
> without direct reclaim operation, not allocating memory unless needed
> (in other words, allow page allocator fail immediately if the caller
> can give up on SIGKILL and SIGKILL is pending) makes sense. It will
> reduce possibility of OOM livelock on CONFIG_MMU=n kernels where the
> OOM reaper is not available.

I am not really convinced this is a good idea. Put aside the fuzzy
semantic of __GFP_KILLABLE, we would have to use this flag in all
potentially allocating places from read/write paths and then it is just
easier to do the explicit checks in the the loops around those
allocations.
 
> > On Wed 25-01-17 14:00:14, Michal Hocko wrote:
> > [...]
> > > From 362da5cac527146a341300c2ca441245c16043e8 Mon Sep 17 00:00:00 2001
> > > From: Michal Hocko <mhocko@suse.com>
> > > Date: Wed, 25 Jan 2017 11:06:37 +0100
> > > Subject: [PATCH] fs: break out of iomap_file_buffered_write on fatal signals
> > > 
> > > Tetsuo has noticed that an OOM stress test which performs large write
> > > requests can cause the full memory reserves depletion. He has tracked
> > > this down to the following path
> > > 	__alloc_pages_nodemask+0x436/0x4d0
> > > 	alloc_pages_current+0x97/0x1b0
> > > 	__page_cache_alloc+0x15d/0x1a0          mm/filemap.c:728
> > > 	pagecache_get_page+0x5a/0x2b0           mm/filemap.c:1331
> > > 	grab_cache_page_write_begin+0x23/0x40   mm/filemap.c:2773
> > > 	iomap_write_begin+0x50/0xd0             fs/iomap.c:118
> > > 	iomap_write_actor+0xb5/0x1a0            fs/iomap.c:190
> > > 	? iomap_write_end+0x80/0x80             fs/iomap.c:150
> > > 	iomap_apply+0xb3/0x130                  fs/iomap.c:79
> > > 	iomap_file_buffered_write+0x68/0xa0     fs/iomap.c:243
> > > 	? iomap_write_end+0x80/0x80
> > > 	xfs_file_buffered_aio_write+0x132/0x390 [xfs]
> > > 	? remove_wait_queue+0x59/0x60
> > > 	xfs_file_write_iter+0x90/0x130 [xfs]
> > > 	__vfs_write+0xe5/0x140
> > > 	vfs_write+0xc7/0x1f0
> > > 	? syscall_trace_enter+0x1d0/0x380
> > > 	SyS_write+0x58/0xc0
> > > 	do_syscall_64+0x6c/0x200
> > > 	entry_SYSCALL64_slow_path+0x25/0x25
> > > 
> > > the oom victim has access to all memory reserves to make a forward
> > > progress to exit easier. But iomap_file_buffered_write and other callers
> > > of iomap_apply loop to complete the full request. We need to check for
> > > fatal signals and back off with a short write instead. As the
> > > iomap_apply delegates all the work down to the actor we have to hook
> > > into those. All callers that work with the page cache are calling
> > > iomap_write_begin so we will check for signals there. dax_iomap_actor
> > > has to handle the situation explicitly because it copies data to the
> > > userspace directly. Other callers like iomap_page_mkwrite work on a
> > > single page or iomap_fiemap_actor do not allocate memory based on the
> > > given len.
> > > 
> > > Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path")
> > > Cc: stable # 4.8+
> > > Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > > ---
> > >  fs/dax.c   | 5 +++++
> > >  fs/iomap.c | 3 +++
> > >  2 files changed, 8 insertions(+)
> > > 
> > > diff --git a/fs/dax.c b/fs/dax.c
> > > index 413a91db9351..0e263dacf9cf 100644
> > > --- a/fs/dax.c
> > > +++ b/fs/dax.c
> > > @@ -1033,6 +1033,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> > >  		struct blk_dax_ctl dax = { 0 };
> > >  		ssize_t map_len;
> > >  
> > > +		if (fatal_signal_pending(current)) {
> > > +			ret = -EINTR;
> > > +			break;
> > > +		}
> > > +
> > >  		dax.sector = dax_iomap_sector(iomap, pos);
> > >  		dax.size = (length + offset + PAGE_SIZE - 1) & PAGE_MASK;
> > >  		map_len = dax_map_atomic(iomap->bdev, &dax);
> > > diff --git a/fs/iomap.c b/fs/iomap.c
> > > index e57b90b5ff37..691eada58b06 100644
> > > --- a/fs/iomap.c
> > > +++ b/fs/iomap.c
> > > @@ -114,6 +114,9 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
> > >  
> > >  	BUG_ON(pos + len > iomap->offset + iomap->length);
> > >  
> > > +	if (fatal_signal_pending(current))
> > > +		return -EINTR;
> > > +
> > >  	page = grab_cache_page_write_begin(inode->i_mapping, index, flags);
> > >  	if (!page)
> > >  		return -ENOMEM;
> > > -- 
> > > 2.11.0
> 
> Regarding [1], it helped avoiding the too_many_isolated() issue. I can't
> tell whether it has any negative effect, but I got on the first trial that
> all allocating threads are blocked on wait_for_completion() from flush_work()
> in drain_all_pages() introduced by "mm, page_alloc: drain per-cpu pages from
> workqueue context". There was no warn_alloc() stall warning message afterwords.

That patch is buggy and there is a follow up [1] which is not sitting in the
mmotm (and thus linux-next) yet. I didn't get to review it properly and
I cannot say I would be too happy about using WQ from the page
allocator. I believe even the follow up needs to have WQ_RECLAIM WQ.

[1] http://lkml.kernel.org/r/20170125083038.rzb5f43nptmk7aed@techsingularity.net

Thanks for your testing!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
@ 2017-01-30  8:55                                         ` Michal Hocko
  0 siblings, 0 replies; 110+ messages in thread
From: Michal Hocko @ 2017-01-30  8:55 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: hch, mgorman, viro, linux-mm, hannes, linux-kernel

On Sun 29-01-17 00:27:27, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > Tetsuo,
> > before we settle on the proper fix for this issue, could you give the
> > patch a try and try to reproduce the too_many_isolated() issue or
> > just see whether patch [1] has any negative effect on your oom stress
> > testing?
> > 
> > [1] http://lkml.kernel.org/r/20170119112336.GN30786@dhcp22.suse.cz
> 
> I tested with both [1] and below patch applied on linux-next-20170125 and
> the result is at http://I-love.SAKURA.ne.jp/tmp/serial-20170128.txt.xz .
> 
> Regarding below patch, it helped avoiding complete memory depletion with
> large write() request. I don't know whether below patch helps avoiding
> complete memory depletion when reading large amount (in other words, I
> don't know whether this check is done for large read() request).

It's not AFAICS. do_generic_file_read doesn't do the
fatal_signal_pending check.

> But
> I believe that __GFP_KILLABLE (despite the limitation that there are
> unkillable waits in the reclaim path) is better solution compared to
> scattering around fatal_signal_pending() in the callers. The reason
> we check SIGKILL here is to avoid allocating memory more than needed.
> If we check SIGKILL in the entry point of __alloc_pages_nodemask() and
> retry: label in __alloc_pages_slowpath(), we waste 0 page. Regardless
> of whether the OOM killer is invoked, whether memory can be allocated
> without direct reclaim operation, not allocating memory unless needed
> (in other words, allow page allocator fail immediately if the caller
> can give up on SIGKILL and SIGKILL is pending) makes sense. It will
> reduce possibility of OOM livelock on CONFIG_MMU=n kernels where the
> OOM reaper is not available.

I am not really convinced this is a good idea. Put aside the fuzzy
semantic of __GFP_KILLABLE, we would have to use this flag in all
potentially allocating places from read/write paths and then it is just
easier to do the explicit checks in the the loops around those
allocations.
 
> > On Wed 25-01-17 14:00:14, Michal Hocko wrote:
> > [...]
> > > From 362da5cac527146a341300c2ca441245c16043e8 Mon Sep 17 00:00:00 2001
> > > From: Michal Hocko <mhocko@suse.com>
> > > Date: Wed, 25 Jan 2017 11:06:37 +0100
> > > Subject: [PATCH] fs: break out of iomap_file_buffered_write on fatal signals
> > > 
> > > Tetsuo has noticed that an OOM stress test which performs large write
> > > requests can cause the full memory reserves depletion. He has tracked
> > > this down to the following path
> > > 	__alloc_pages_nodemask+0x436/0x4d0
> > > 	alloc_pages_current+0x97/0x1b0
> > > 	__page_cache_alloc+0x15d/0x1a0          mm/filemap.c:728
> > > 	pagecache_get_page+0x5a/0x2b0           mm/filemap.c:1331
> > > 	grab_cache_page_write_begin+0x23/0x40   mm/filemap.c:2773
> > > 	iomap_write_begin+0x50/0xd0             fs/iomap.c:118
> > > 	iomap_write_actor+0xb5/0x1a0            fs/iomap.c:190
> > > 	? iomap_write_end+0x80/0x80             fs/iomap.c:150
> > > 	iomap_apply+0xb3/0x130                  fs/iomap.c:79
> > > 	iomap_file_buffered_write+0x68/0xa0     fs/iomap.c:243
> > > 	? iomap_write_end+0x80/0x80
> > > 	xfs_file_buffered_aio_write+0x132/0x390 [xfs]
> > > 	? remove_wait_queue+0x59/0x60
> > > 	xfs_file_write_iter+0x90/0x130 [xfs]
> > > 	__vfs_write+0xe5/0x140
> > > 	vfs_write+0xc7/0x1f0
> > > 	? syscall_trace_enter+0x1d0/0x380
> > > 	SyS_write+0x58/0xc0
> > > 	do_syscall_64+0x6c/0x200
> > > 	entry_SYSCALL64_slow_path+0x25/0x25
> > > 
> > > the oom victim has access to all memory reserves to make a forward
> > > progress to exit easier. But iomap_file_buffered_write and other callers
> > > of iomap_apply loop to complete the full request. We need to check for
> > > fatal signals and back off with a short write instead. As the
> > > iomap_apply delegates all the work down to the actor we have to hook
> > > into those. All callers that work with the page cache are calling
> > > iomap_write_begin so we will check for signals there. dax_iomap_actor
> > > has to handle the situation explicitly because it copies data to the
> > > userspace directly. Other callers like iomap_page_mkwrite work on a
> > > single page or iomap_fiemap_actor do not allocate memory based on the
> > > given len.
> > > 
> > > Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path")
> > > Cc: stable # 4.8+
> > > Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > > ---
> > >  fs/dax.c   | 5 +++++
> > >  fs/iomap.c | 3 +++
> > >  2 files changed, 8 insertions(+)
> > > 
> > > diff --git a/fs/dax.c b/fs/dax.c
> > > index 413a91db9351..0e263dacf9cf 100644
> > > --- a/fs/dax.c
> > > +++ b/fs/dax.c
> > > @@ -1033,6 +1033,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
> > >  		struct blk_dax_ctl dax = { 0 };
> > >  		ssize_t map_len;
> > >  
> > > +		if (fatal_signal_pending(current)) {
> > > +			ret = -EINTR;
> > > +			break;
> > > +		}
> > > +
> > >  		dax.sector = dax_iomap_sector(iomap, pos);
> > >  		dax.size = (length + offset + PAGE_SIZE - 1) & PAGE_MASK;
> > >  		map_len = dax_map_atomic(iomap->bdev, &dax);
> > > diff --git a/fs/iomap.c b/fs/iomap.c
> > > index e57b90b5ff37..691eada58b06 100644
> > > --- a/fs/iomap.c
> > > +++ b/fs/iomap.c
> > > @@ -114,6 +114,9 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
> > >  
> > >  	BUG_ON(pos + len > iomap->offset + iomap->length);
> > >  
> > > +	if (fatal_signal_pending(current))
> > > +		return -EINTR;
> > > +
> > >  	page = grab_cache_page_write_begin(inode->i_mapping, index, flags);
> > >  	if (!page)
> > >  		return -ENOMEM;
> > > -- 
> > > 2.11.0
> 
> Regarding [1], it helped avoiding the too_many_isolated() issue. I can't
> tell whether it has any negative effect, but I got on the first trial that
> all allocating threads are blocked on wait_for_completion() from flush_work()
> in drain_all_pages() introduced by "mm, page_alloc: drain per-cpu pages from
> workqueue context". There was no warn_alloc() stall warning message afterwords.

That patch is buggy and there is a follow up [1] which is not sitting in the
mmotm (and thus linux-next) yet. I didn't get to review it properly and
I cannot say I would be too happy about using WQ from the page
allocator. I believe even the follow up needs to have WQ_RECLAIM WQ.

[1] http://lkml.kernel.org/r/20170125083038.rzb5f43nptmk7aed@techsingularity.net

Thanks for your testing!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
  2017-01-25 13:00                                   ` Michal Hocko
@ 2017-01-31 11:58                                     ` Michal Hocko
  -1 siblings, 0 replies; 110+ messages in thread
From: Michal Hocko @ 2017-01-31 11:58 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Tetsuo Handa, mgorman, viro, linux-mm, hannes, linux-kernel

On Wed 25-01-17 14:00:14, Michal Hocko wrote:
> On Wed 25-01-17 20:09:31, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Wed 25-01-17 11:19:57, Christoph Hellwig wrote:
> > > > On Wed, Jan 25, 2017 at 11:15:17AM +0100, Michal Hocko wrote:
> > > > > I think we are missing a check for fatal_signal_pending in
> > > > > iomap_file_buffered_write. This means that an oom victim can consume the
> > > > > full memory reserves. What do you think about the following? I haven't
> > > > > tested this but it mimics generic_perform_write so I guess it should
> > > > > work.
> > > > 
> > > > Hi Michal,
> > > > 
> > > > this looks reasonable to me.  But we have a few more such loops,
> > > > maybe it makes sense to move the check into iomap_apply?
> > > 
> > > I wasn't sure about the expected semantic of iomap_apply but now that
> > > I've actually checked all the callers I believe all of them should be
> > > able to handle EINTR just fine. Well iomap_file_dirty, iomap_zero_range,
> > > iomap_fiemap and iomap_page_mkwriteseem do not follow the standard
> > > pattern to return the number of written pages or an error but it rather
> > > propagates the error out. From my limited understanding of those code
> > > paths that should just be ok. I was not all that sure about iomap_dio_rw
> > > that is just too convoluted for me. If that one is OK as well then
> > > the following patch should be indeed better.
> > 
> > Is "length" in
> > 
> >    written = actor(inode, pos, length, data, &iomap);
> > 
> > call guaranteed to be small enough? If not guaranteed,
> > don't we need to check SIGKILL inside "actor" functions?
> 
> You are right! Checking for signals inside iomap_apply doesn't really
> solve anything because basically all users do iov_iter_count(). Blee. So
> we have loops around iomap_apply which itself loops inside the actor.
> iomap_write_begin seems to be used by most of them which is also where we
> get the pagecache page so I guess this should be the "right" place to
> put the check in. Things like dax_iomap_actor will need an explicit check.
> This is quite unfortunate but I do not see any better solution.
> What do you think Christoph?

What do you think Christoph? I have an additional patch to handle
do_generic_file_read and a similar one to back off in
__vmalloc_area_node. I would like to post them all in one series but I
would like to know that this one is OK before I do that.

Thanks!

> ---
> From 362da5cac527146a341300c2ca441245c16043e8 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Wed, 25 Jan 2017 11:06:37 +0100
> Subject: [PATCH] fs: break out of iomap_file_buffered_write on fatal signals
> 
> Tetsuo has noticed that an OOM stress test which performs large write
> requests can cause the full memory reserves depletion. He has tracked
> this down to the following path
> 	__alloc_pages_nodemask+0x436/0x4d0
> 	alloc_pages_current+0x97/0x1b0
> 	__page_cache_alloc+0x15d/0x1a0          mm/filemap.c:728
> 	pagecache_get_page+0x5a/0x2b0           mm/filemap.c:1331
> 	grab_cache_page_write_begin+0x23/0x40   mm/filemap.c:2773
> 	iomap_write_begin+0x50/0xd0             fs/iomap.c:118
> 	iomap_write_actor+0xb5/0x1a0            fs/iomap.c:190
> 	? iomap_write_end+0x80/0x80             fs/iomap.c:150
> 	iomap_apply+0xb3/0x130                  fs/iomap.c:79
> 	iomap_file_buffered_write+0x68/0xa0     fs/iomap.c:243
> 	? iomap_write_end+0x80/0x80
> 	xfs_file_buffered_aio_write+0x132/0x390 [xfs]
> 	? remove_wait_queue+0x59/0x60
> 	xfs_file_write_iter+0x90/0x130 [xfs]
> 	__vfs_write+0xe5/0x140
> 	vfs_write+0xc7/0x1f0
> 	? syscall_trace_enter+0x1d0/0x380
> 	SyS_write+0x58/0xc0
> 	do_syscall_64+0x6c/0x200
> 	entry_SYSCALL64_slow_path+0x25/0x25
> 
> the oom victim has access to all memory reserves to make a forward
> progress to exit easier. But iomap_file_buffered_write and other callers
> of iomap_apply loop to complete the full request. We need to check for
> fatal signals and back off with a short write instead. As the
> iomap_apply delegates all the work down to the actor we have to hook
> into those. All callers that work with the page cache are calling
> iomap_write_begin so we will check for signals there. dax_iomap_actor
> has to handle the situation explicitly because it copies data to the
> userspace directly. Other callers like iomap_page_mkwrite work on a
> single page or iomap_fiemap_actor do not allocate memory based on the
> given len.
> 
> Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path")
> Cc: stable # 4.8+
> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  fs/dax.c   | 5 +++++
>  fs/iomap.c | 3 +++
>  2 files changed, 8 insertions(+)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 413a91db9351..0e263dacf9cf 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1033,6 +1033,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>  		struct blk_dax_ctl dax = { 0 };
>  		ssize_t map_len;
>  
> +		if (fatal_signal_pending(current)) {
> +			ret = -EINTR;
> +			break;
> +		}
> +
>  		dax.sector = dax_iomap_sector(iomap, pos);
>  		dax.size = (length + offset + PAGE_SIZE - 1) & PAGE_MASK;
>  		map_len = dax_map_atomic(iomap->bdev, &dax);
> diff --git a/fs/iomap.c b/fs/iomap.c
> index e57b90b5ff37..691eada58b06 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -114,6 +114,9 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
>  
>  	BUG_ON(pos + len > iomap->offset + iomap->length);
>  
> +	if (fatal_signal_pending(current))
> +		return -EINTR;
> +
>  	page = grab_cache_page_write_begin(inode->i_mapping, index, flags);
>  	if (!page)
>  		return -ENOMEM;
> -- 
> 2.11.0
> 
> 
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
@ 2017-01-31 11:58                                     ` Michal Hocko
  0 siblings, 0 replies; 110+ messages in thread
From: Michal Hocko @ 2017-01-31 11:58 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Tetsuo Handa, mgorman, viro, linux-mm, hannes, linux-kernel

On Wed 25-01-17 14:00:14, Michal Hocko wrote:
> On Wed 25-01-17 20:09:31, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Wed 25-01-17 11:19:57, Christoph Hellwig wrote:
> > > > On Wed, Jan 25, 2017 at 11:15:17AM +0100, Michal Hocko wrote:
> > > > > I think we are missing a check for fatal_signal_pending in
> > > > > iomap_file_buffered_write. This means that an oom victim can consume the
> > > > > full memory reserves. What do you think about the following? I haven't
> > > > > tested this but it mimics generic_perform_write so I guess it should
> > > > > work.
> > > > 
> > > > Hi Michal,
> > > > 
> > > > this looks reasonable to me.  But we have a few more such loops,
> > > > maybe it makes sense to move the check into iomap_apply?
> > > 
> > > I wasn't sure about the expected semantic of iomap_apply but now that
> > > I've actually checked all the callers I believe all of them should be
> > > able to handle EINTR just fine. Well iomap_file_dirty, iomap_zero_range,
> > > iomap_fiemap and iomap_page_mkwriteseem do not follow the standard
> > > pattern to return the number of written pages or an error but it rather
> > > propagates the error out. From my limited understanding of those code
> > > paths that should just be ok. I was not all that sure about iomap_dio_rw
> > > that is just too convoluted for me. If that one is OK as well then
> > > the following patch should be indeed better.
> > 
> > Is "length" in
> > 
> >    written = actor(inode, pos, length, data, &iomap);
> > 
> > call guaranteed to be small enough? If not guaranteed,
> > don't we need to check SIGKILL inside "actor" functions?
> 
> You are right! Checking for signals inside iomap_apply doesn't really
> solve anything because basically all users do iov_iter_count(). Blee. So
> we have loops around iomap_apply which itself loops inside the actor.
> iomap_write_begin seems to be used by most of them which is also where we
> get the pagecache page so I guess this should be the "right" place to
> put the check in. Things like dax_iomap_actor will need an explicit check.
> This is quite unfortunate but I do not see any better solution.
> What do you think Christoph?

What do you think Christoph? I have an additional patch to handle
do_generic_file_read and a similar one to back off in
__vmalloc_area_node. I would like to post them all in one series but I
would like to know that this one is OK before I do that.

Thanks!

> ---
> From 362da5cac527146a341300c2ca441245c16043e8 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Wed, 25 Jan 2017 11:06:37 +0100
> Subject: [PATCH] fs: break out of iomap_file_buffered_write on fatal signals
> 
> Tetsuo has noticed that an OOM stress test which performs large write
> requests can cause the full memory reserves depletion. He has tracked
> this down to the following path
> 	__alloc_pages_nodemask+0x436/0x4d0
> 	alloc_pages_current+0x97/0x1b0
> 	__page_cache_alloc+0x15d/0x1a0          mm/filemap.c:728
> 	pagecache_get_page+0x5a/0x2b0           mm/filemap.c:1331
> 	grab_cache_page_write_begin+0x23/0x40   mm/filemap.c:2773
> 	iomap_write_begin+0x50/0xd0             fs/iomap.c:118
> 	iomap_write_actor+0xb5/0x1a0            fs/iomap.c:190
> 	? iomap_write_end+0x80/0x80             fs/iomap.c:150
> 	iomap_apply+0xb3/0x130                  fs/iomap.c:79
> 	iomap_file_buffered_write+0x68/0xa0     fs/iomap.c:243
> 	? iomap_write_end+0x80/0x80
> 	xfs_file_buffered_aio_write+0x132/0x390 [xfs]
> 	? remove_wait_queue+0x59/0x60
> 	xfs_file_write_iter+0x90/0x130 [xfs]
> 	__vfs_write+0xe5/0x140
> 	vfs_write+0xc7/0x1f0
> 	? syscall_trace_enter+0x1d0/0x380
> 	SyS_write+0x58/0xc0
> 	do_syscall_64+0x6c/0x200
> 	entry_SYSCALL64_slow_path+0x25/0x25
> 
> the oom victim has access to all memory reserves to make a forward
> progress to exit easier. But iomap_file_buffered_write and other callers
> of iomap_apply loop to complete the full request. We need to check for
> fatal signals and back off with a short write instead. As the
> iomap_apply delegates all the work down to the actor we have to hook
> into those. All callers that work with the page cache are calling
> iomap_write_begin so we will check for signals there. dax_iomap_actor
> has to handle the situation explicitly because it copies data to the
> userspace directly. Other callers like iomap_page_mkwrite work on a
> single page or iomap_fiemap_actor do not allocate memory based on the
> given len.
> 
> Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path")
> Cc: stable # 4.8+
> Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  fs/dax.c   | 5 +++++
>  fs/iomap.c | 3 +++
>  2 files changed, 8 insertions(+)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 413a91db9351..0e263dacf9cf 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1033,6 +1033,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
>  		struct blk_dax_ctl dax = { 0 };
>  		ssize_t map_len;
>  
> +		if (fatal_signal_pending(current)) {
> +			ret = -EINTR;
> +			break;
> +		}
> +
>  		dax.sector = dax_iomap_sector(iomap, pos);
>  		dax.size = (length + offset + PAGE_SIZE - 1) & PAGE_MASK;
>  		map_len = dax_map_atomic(iomap->bdev, &dax);
> diff --git a/fs/iomap.c b/fs/iomap.c
> index e57b90b5ff37..691eada58b06 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -114,6 +114,9 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags,
>  
>  	BUG_ON(pos + len > iomap->offset + iomap->length);
>  
> +	if (fatal_signal_pending(current))
> +		return -EINTR;
> +
>  	page = grab_cache_page_write_begin(inode->i_mapping, index, flags);
>  	if (!page)
>  		return -ENOMEM;
> -- 
> 2.11.0
> 
> 
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
  2017-01-31 11:58                                     ` Michal Hocko
@ 2017-01-31 12:51                                       ` Christoph Hellwig
  -1 siblings, 0 replies; 110+ messages in thread
From: Christoph Hellwig @ 2017-01-31 12:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Christoph Hellwig, Tetsuo Handa, mgorman, viro, linux-mm, hannes,
	linux-kernel

On Tue, Jan 31, 2017 at 12:58:46PM +0100, Michal Hocko wrote:
> What do you think Christoph? I have an additional patch to handle
> do_generic_file_read and a similar one to back off in
> __vmalloc_area_node. I would like to post them all in one series but I
> would like to know that this one is OK before I do that.

Well, that patch you posted is okay, but you probably need additional
ones for the other interesting users of iomap_apply.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
@ 2017-01-31 12:51                                       ` Christoph Hellwig
  0 siblings, 0 replies; 110+ messages in thread
From: Christoph Hellwig @ 2017-01-31 12:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Christoph Hellwig, Tetsuo Handa, mgorman, viro, linux-mm, hannes,
	linux-kernel

On Tue, Jan 31, 2017 at 12:58:46PM +0100, Michal Hocko wrote:
> What do you think Christoph? I have an additional patch to handle
> do_generic_file_read and a similar one to back off in
> __vmalloc_area_node. I would like to post them all in one series but I
> would like to know that this one is OK before I do that.

Well, that patch you posted is okay, but you probably need additional
ones for the other interesting users of iomap_apply.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
  2017-01-31 12:51                                       ` Christoph Hellwig
@ 2017-01-31 13:21                                         ` Michal Hocko
  -1 siblings, 0 replies; 110+ messages in thread
From: Michal Hocko @ 2017-01-31 13:21 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Tetsuo Handa, mgorman, viro, linux-mm, hannes, linux-kernel

On Tue 31-01-17 13:51:40, Christoph Hellwig wrote:
> On Tue, Jan 31, 2017 at 12:58:46PM +0100, Michal Hocko wrote:
> > What do you think Christoph? I have an additional patch to handle
> > do_generic_file_read and a similar one to back off in
> > __vmalloc_area_node. I would like to post them all in one series but I
> > would like to know that this one is OK before I do that.
> 
> Well, that patch you posted is okay, but you probably need additional
> ones for the other interesting users of iomap_apply.

I have checked all of them I guees/hope. Which one you have in mind?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
@ 2017-01-31 13:21                                         ` Michal Hocko
  0 siblings, 0 replies; 110+ messages in thread
From: Michal Hocko @ 2017-01-31 13:21 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Tetsuo Handa, mgorman, viro, linux-mm, hannes, linux-kernel

On Tue 31-01-17 13:51:40, Christoph Hellwig wrote:
> On Tue, Jan 31, 2017 at 12:58:46PM +0100, Michal Hocko wrote:
> > What do you think Christoph? I have an additional patch to handle
> > do_generic_file_read and a similar one to back off in
> > __vmalloc_area_node. I would like to post them all in one series but I
> > would like to know that this one is OK before I do that.
> 
> Well, that patch you posted is okay, but you probably need additional
> ones for the other interesting users of iomap_apply.

I have checked all of them I guees/hope. Which one you have in mind?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
  2017-01-30  8:55                                         ` Michal Hocko
@ 2017-02-02 10:14                                           ` Michal Hocko
  -1 siblings, 0 replies; 110+ messages in thread
From: Michal Hocko @ 2017-02-02 10:14 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: hch, mgorman, viro, linux-mm, hannes, linux-kernel

On Mon 30-01-17 09:55:46, Michal Hocko wrote:
> On Sun 29-01-17 00:27:27, Tetsuo Handa wrote:
[...]
> > Regarding [1], it helped avoiding the too_many_isolated() issue. I can't
> > tell whether it has any negative effect, but I got on the first trial that
> > all allocating threads are blocked on wait_for_completion() from flush_work()
> > in drain_all_pages() introduced by "mm, page_alloc: drain per-cpu pages from
> > workqueue context". There was no warn_alloc() stall warning message afterwords.
> 
> That patch is buggy and there is a follow up [1] which is not sitting in the
> mmotm (and thus linux-next) yet. I didn't get to review it properly and
> I cannot say I would be too happy about using WQ from the page
> allocator. I believe even the follow up needs to have WQ_RECLAIM WQ.
> 
> [1] http://lkml.kernel.org/r/20170125083038.rzb5f43nptmk7aed@techsingularity.net

Did you get chance to test with this follow up patch? It would be
interesting to see whether OOM situation can still starve the waiter.
The current linux-next should contain this patch.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
@ 2017-02-02 10:14                                           ` Michal Hocko
  0 siblings, 0 replies; 110+ messages in thread
From: Michal Hocko @ 2017-02-02 10:14 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: hch, mgorman, viro, linux-mm, hannes, linux-kernel

On Mon 30-01-17 09:55:46, Michal Hocko wrote:
> On Sun 29-01-17 00:27:27, Tetsuo Handa wrote:
[...]
> > Regarding [1], it helped avoiding the too_many_isolated() issue. I can't
> > tell whether it has any negative effect, but I got on the first trial that
> > all allocating threads are blocked on wait_for_completion() from flush_work()
> > in drain_all_pages() introduced by "mm, page_alloc: drain per-cpu pages from
> > workqueue context". There was no warn_alloc() stall warning message afterwords.
> 
> That patch is buggy and there is a follow up [1] which is not sitting in the
> mmotm (and thus linux-next) yet. I didn't get to review it properly and
> I cannot say I would be too happy about using WQ from the page
> allocator. I believe even the follow up needs to have WQ_RECLAIM WQ.
> 
> [1] http://lkml.kernel.org/r/20170125083038.rzb5f43nptmk7aed@techsingularity.net

Did you get chance to test with this follow up patch? It would be
interesting to see whether OOM situation can still starve the waiter.
The current linux-next should contain this patch.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
  2017-02-02 10:14                                           ` Michal Hocko
@ 2017-02-03 10:57                                             ` Tetsuo Handa
  -1 siblings, 0 replies; 110+ messages in thread
From: Tetsuo Handa @ 2017-02-03 10:57 UTC (permalink / raw)
  To: mhocko
  Cc: david, dchinner, hch, mgorman, viro, linux-mm, hannes, linux-kernel

Michal Hocko wrote:
> On Mon 30-01-17 09:55:46, Michal Hocko wrote:
> > On Sun 29-01-17 00:27:27, Tetsuo Handa wrote:
> [...]
> > > Regarding [1], it helped avoiding the too_many_isolated() issue. I can't
> > > tell whether it has any negative effect, but I got on the first trial that
> > > all allocating threads are blocked on wait_for_completion() from flush_work()
> > > in drain_all_pages() introduced by "mm, page_alloc: drain per-cpu pages from
> > > workqueue context". There was no warn_alloc() stall warning message afterwords.
> > 
> > That patch is buggy and there is a follow up [1] which is not sitting in the
> > mmotm (and thus linux-next) yet. I didn't get to review it properly and
> > I cannot say I would be too happy about using WQ from the page
> > allocator. I believe even the follow up needs to have WQ_RECLAIM WQ.
> > 
> > [1] http://lkml.kernel.org/r/20170125083038.rzb5f43nptmk7aed@techsingularity.net
> 
> Did you get chance to test with this follow up patch? It would be
> interesting to see whether OOM situation can still starve the waiter.
> The current linux-next should contain this patch.

So far I can't reproduce problems except two listed below (cond_resched() trap
in printk() and IDLE priority trap are excluded from the list). But I agree that
the follow up patch needs to use a WQ_RECLAIM WQ. It is theoretically possible
that an allocation request which can trigger the OOM killer waits for the
system_wq while there is already a work which is in system_wq which is looping
forever inside the page allocator without triggering the OOM killer.
Maybe the follow up patch can share the vmstat WQ?

(1) I got an assertion failure.

[  969.626518] Killed process 6262 (oom-write) total-vm:2166856kB, anon-rss:1128732kB, file-rss:4kB, shmem-rss:0kB
[  969.958307] oom_reaper: reaped process 6262 (oom-write), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  972.114644] XFS: Assertion failed: oldlen > newlen, file: fs/xfs/libxfs/xfs_bmap.c, line: 2867
[  972.125085] ------------[ cut here ]------------
[  972.129261] WARNING: CPU: 0 PID: 6280 at fs/xfs/xfs_message.c:105 asswarn+0x33/0x40 [xfs]
[  972.136146] Modules linked in: nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack coretemp crct10dif_pclmul ppdev crc32_pclmul ghash_clmulni_intel ip_set nfnetlink ebtable_nat aesni_intel crypto_simd cryptd ebtable_broute glue_helper vmw_balloon bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 pcspkr nf_nat_ipv6 ip6table_mangle ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter sg parport_pc parport shpchp i2c_piix4 vmw_vsock_vmci_transport vsock vmw_vmci ip_tables xfs libcrc32c sr_mod cdrom ata_generic sd_mod pata_acpi crc32c_intel serio_raw vmwgfx drm_kms_helper syscopyarea sysfillrect
[  972.163630]  sysimgblt fb_sys_fops ttm drm ata_piix ahci libahci mptspi scsi_transport_spi mptscsih e1000 libata i2c_core mptbase
[  972.172535] CPU: 0 PID: 6280 Comm: write Not tainted 4.10.0-rc6-next-20170202 #498
[  972.175126] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[  972.178381] Call Trace:
[  972.180003]  dump_stack+0x85/0xc9
[  972.181682]  __warn+0xd1/0xf0
[  972.183374]  warn_slowpath_null+0x1d/0x20
[  972.185223]  asswarn+0x33/0x40 [xfs]
[  972.186950]  xfs_bmap_add_extent_hole_delay+0xb7f/0xdf0 [xfs]
[  972.189055]  xfs_bmapi_reserve_delalloc+0x297/0x440 [xfs]
[  972.191263]  ? xfs_ilock+0x1c9/0x360 [xfs]
[  972.193414]  xfs_file_iomap_begin+0x880/0x1140 [xfs]
[  972.195300]  ? iomap_write_end+0x80/0x80
[  972.196980]  iomap_apply+0x6c/0x130
[  972.198539]  iomap_file_buffered_write+0x68/0xa0
[  972.200316]  ? iomap_write_end+0x80/0x80
[  972.201950]  xfs_file_buffered_aio_write+0x132/0x390 [xfs]
[  972.203868]  ? _raw_spin_unlock+0x27/0x40
[  972.205470]  xfs_file_write_iter+0x90/0x130 [xfs]
[  972.207167]  __vfs_write+0xe5/0x140
[  972.208752]  vfs_write+0xc7/0x1f0
[  972.210233]  ? syscall_trace_enter+0x1d0/0x380
[  972.211809]  SyS_write+0x58/0xc0
[  972.213166]  do_int80_syscall_32+0x6c/0x1f0
[  972.214676]  entry_INT80_compat+0x38/0x50
[  972.216168] RIP: 0023:0x8048076
[  972.217494] RSP: 002b:00000000ff997020 EFLAGS: 00000202 ORIG_RAX: 0000000000000004
[  972.219635] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 0000000008048000
[  972.221679] RDX: 0000000000001000 RSI: 0000000000000000 RDI: 0000000000000000
[  972.223774] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[  972.225905] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  972.227946] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[  972.230064] ---[ end trace d498098daec56c11 ]---
[  984.210890] vmtoolsd invoked oom-killer: gfp_mask=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null),  order=0, oom_score_adj=0
[  984.224191] vmtoolsd cpuset=/ mems_allowed=0
[  984.231022] CPU: 0 PID: 689 Comm: vmtoolsd Tainted: G        W       4.10.0-rc6-next-20170202 #498

(2) I got a lockdep warning. (A new false positive?)

[  243.036975] =====================================================
[  243.042976] WARNING: RECLAIM_FS-safe -> RECLAIM_FS-unsafe lock order detected
[  243.051211] 4.10.0-rc6-next-20170202 #46 Not tainted
[  243.054619] -----------------------------------------------------
[  243.057395] awk/8767 [HC0[0]:SC0[0]:HE1:SE1] is trying to acquire:
[  243.060310]  (cpu_hotplug.dep_map){++++++}, at: [<ffffffff8108ddf2>] get_online_cpus+0x32/0x80
[  243.063462] 
[  243.063462] and this task is already holding:
[  243.066851]  (&xfs_dir_ilock_class){++++-.}, at: [<ffffffffa02a4af4>] xfs_ilock+0x114/0x290 [xfs]
[  243.069949] which would create a new lock dependency:
[  243.072143]  (&xfs_dir_ilock_class){++++-.} -> (cpu_hotplug.dep_map){++++++}
[  243.074789] 
[  243.074789] but this new dependency connects a RECLAIM_FS-irq-safe lock:
[  243.078735]  (&xfs_dir_ilock_class){++++-.}
[  243.078739] 
[  243.078739] ... which became RECLAIM_FS-irq-safe at:
[  243.084175]   
[  243.084180] [<ffffffff810ef934>] __lock_acquire+0x344/0x1bb0
[  243.087257]   
[  243.087261] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0
[  243.090027]   
[  243.090033] [<ffffffff810ea7e9>] down_write_nested+0x59/0xc0
[  243.092838]   
[  243.092888] [<ffffffffa02a4b2e>] xfs_ilock+0x14e/0x290 [xfs]
[  243.095453]   
[  243.095485] [<ffffffffa02986a5>] xfs_reclaim_inode+0x135/0x340 [xfs]
[  243.098083]   
[  243.098109] [<ffffffffa0298b7a>] xfs_reclaim_inodes_ag+0x2ca/0x4f0 [xfs]
[  243.100668]   
[  243.100692] [<ffffffffa029af9e>] xfs_reclaim_inodes_nr+0x2e/0x40 [xfs]
[  243.103191]   
[  243.103221] [<ffffffffa02b32c4>] xfs_fs_free_cached_objects+0x14/0x20 [xfs]
[  243.105710]   
[  243.105714] [<ffffffff81261dbc>] super_cache_scan+0x17c/0x190
[  243.107947]   
[  243.107950] [<ffffffff811d375a>] shrink_slab+0x29a/0x710
[  243.110133]   
[  243.110135] [<ffffffff811d876d>] shrink_node+0x23d/0x320
[  243.112262]   
[  243.112264] [<ffffffff811d9e24>] kswapd+0x354/0xa10
[  243.114323]   
[  243.114326] [<ffffffff810b5caa>] kthread+0x10a/0x140
[  243.116448]   
[  243.116452] [<ffffffff81715081>] ret_from_fork+0x31/0x40
[  243.118692] 
[  243.118692] to a RECLAIM_FS-irq-unsafe lock:
[  243.120636]  (cpu_hotplug.dep_map){++++++}
[  243.120638] 
[  243.120638] ... which became RECLAIM_FS-irq-unsafe at:
[  243.124021] ...
[  243.124022]   
[  243.124820] [<ffffffff810ef051>] mark_held_locks+0x71/0x90
[  243.127033]   
[  243.127035] [<ffffffff810f3405>] lockdep_trace_alloc+0xc5/0x110
[  243.129228]   
[  243.129231] [<ffffffff8122f8ca>] kmem_cache_alloc_node_trace+0x4a/0x410
[  243.131534]   
[  243.131536] [<ffffffff810ba350>] __smpboot_create_thread.part.3+0x30/0xf0
[  243.133850]   
[  243.133852] [<ffffffff810ba7a1>] smpboot_create_threads+0x61/0x90
[  243.136113]   
[  243.136119] [<ffffffff8108e2cb>] cpuhp_invoke_callback+0xbb/0xb70
[  243.138319]   
[  243.138320] [<ffffffff8108fc82>] cpuhp_up_callbacks+0x32/0xb0
[  243.140479]   
[  243.140480] [<ffffffff810900f4>] _cpu_up+0x84/0xf0
[  243.142484]   
[  243.142485] [<ffffffff810901e4>] do_cpu_up+0x84/0xd0
[  243.144716]   
[  243.144719] [<ffffffff8109023e>] cpu_up+0xe/0x10
[  243.146684]   
[  243.146687] [<ffffffff81f6f446>] smp_init+0xd5/0x141
[  243.148755]   
[  243.148758] [<ffffffff81f3f35b>] kernel_init_freeable+0x17d/0x2a7
[  243.150932]   
[  243.150936] [<ffffffff817048e9>] kernel_init+0x9/0x100
[  243.153088]   
[  243.153092] [<ffffffff81715081>] ret_from_fork+0x31/0x40
[  243.155135] 
[  243.155135] other info that might help us debug this:
[  243.155135] 
[  243.157724]  Possible interrupt unsafe locking scenario:
[  243.157724] 
[  243.159877]        CPU0                    CPU1
[  243.161047]        ----                    ----
[  243.162210]   lock(cpu_hotplug.dep_map);
[  243.163279]                                local_irq_disable();
[  243.164669]                                lock(&xfs_dir_ilock_class);
[  243.166148]                                lock(cpu_hotplug.dep_map);
[  243.167653]   <Interrupt>
[  243.168594]     lock(&xfs_dir_ilock_class);
[  243.169694] 
[  243.169694]  *** DEADLOCK ***
[  243.169694] 
[  243.171864] 3 locks held by awk/8767:
[  243.172872]  #0:  (&type->i_mutex_dir_key#3){++++++}, at: [<ffffffff8126e2dc>] path_openat+0x53c/0xa90
[  243.174791]  #1:  (&xfs_dir_ilock_class){++++-.}, at: [<ffffffffa02a4af4>] xfs_ilock+0x114/0x290 [xfs]
[  243.176899]  #2:  (pcpu_drain_mutex){+.+...}, at: [<ffffffff811bf39a>] drain_all_pages.part.80+0x1a/0x320
[  243.178875] 
[  243.178875] the dependencies between RECLAIM_FS-irq-safe lock and the holding lock:
[  243.181262] -> (&xfs_dir_ilock_class){++++-.} ops: 17348 {
[  243.182610]    HARDIRQ-ON-W at:
[  243.183603]                     
[  243.183606] [<ffffffff810efd84>] __lock_acquire+0x794/0x1bb0
[  243.186056]                     
[  243.186059] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0
[  243.188419]                     
[  243.188422] [<ffffffff810ea7e9>] down_write_nested+0x59/0xc0
[  243.190909]                     
[  243.190941] [<ffffffffa02a4b2e>] xfs_ilock+0x14e/0x290 [xfs]
[  243.193257]                     
[  243.193281] [<ffffffffa02a4c9b>] xfs_ilock_data_map_shared+0x2b/0x30 [xfs]
[  243.195795]                     
[  243.195814] [<ffffffffa02559f4>] xfs_dir_lookup+0xd4/0x1c0 [xfs]
[  243.198204]                     
[  243.198227] [<ffffffffa02a62ff>] xfs_lookup+0x7f/0x250 [xfs]
[  243.200570]                     
[  243.200593] [<ffffffffa02a1fcb>] xfs_vn_lookup+0x6b/0xb0 [xfs]
[  243.203086]                     
[  243.203089] [<ffffffff8126ce2c>] lookup_open+0x54c/0x790
[  243.205417]                     
[  243.205420] [<ffffffff8126e2fa>] path_openat+0x55a/0xa90
[  243.207711]                     
[  243.207713] [<ffffffff8126f9ec>] do_filp_open+0x8c/0x100
[  243.210092]                     
[  243.210095] [<ffffffff81263c41>] do_open_execat+0x71/0x180
[  243.212427]                     
[  243.212429] [<ffffffff812641b6>] open_exec+0x26/0x40
[  243.214664]                     
[  243.214668] [<ffffffff812c43ee>] load_elf_binary+0x2be/0x15f0
[  243.217045]                     
[  243.217048] [<ffffffff812644b0>] search_binary_handler+0x80/0x1e0
[  243.219501]                     
[  243.219503] [<ffffffff812663ca>] do_execveat_common.isra.40+0x68a/0xa00
[  243.222056]                     
[  243.222058] [<ffffffff81266767>] do_execve+0x27/0x30
[  243.224471]                     
[  243.224475] [<ffffffff812669c0>] SyS_execve+0x20/0x30
[  243.226787]                     
[  243.226790] [<ffffffff81003c17>] do_syscall_64+0x67/0x1f0
[  243.229178]                     
[  243.229182] [<ffffffff81714ec9>] return_from_SYSCALL_64+0x0/0x7a
[  243.231695]    HARDIRQ-ON-R at:
[  243.232709]                     
[  243.232712] [<ffffffff810ef8c0>] __lock_acquire+0x2d0/0x1bb0
[  243.235161]                     
[  243.235164] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0
[  243.237547]                     
[  243.237551] [<ffffffff810ea672>] down_read_nested+0x52/0xb0
[  243.239930]                     
[  243.239962] [<ffffffffa02a4af4>] xfs_ilock+0x114/0x290 [xfs]
[  243.242353]                     
[  243.242385] [<ffffffffa02a4c9b>] xfs_ilock_data_map_shared+0x2b/0x30 [xfs]
[  243.244978]                     
[  243.244998] [<ffffffffa02559f4>] xfs_dir_lookup+0xd4/0x1c0 [xfs]
[  243.247493]                     
[  243.247515] [<ffffffffa02a62ff>] xfs_lookup+0x7f/0x250 [xfs]
[  243.249910]                     
[  243.249930] [<ffffffffa02a1fcb>] xfs_vn_lookup+0x6b/0xb0 [xfs]
[  243.252407]                     
[  243.252412] [<ffffffff8126902e>] lookup_slow+0x12e/0x220
[  243.254747]                     
[  243.254750] [<ffffffff8126d2c6>] walk_component+0x1a6/0x2b0
[  243.257126]                     
[  243.257128] [<ffffffff8126d55c>] link_path_walk+0x18c/0x580
[  243.259495]                     
[  243.259497] [<ffffffff8126de41>] path_openat+0xa1/0xa90
[  243.261804]                     
[  243.261806] [<ffffffff8126f9ec>] do_filp_open+0x8c/0x100
[  243.264184]                     
[  243.264188] [<ffffffff8125c0ea>] do_sys_open+0x13a/0x200
[  243.266595]                     
[  243.266599] [<ffffffff8125c1c9>] SyS_open+0x19/0x20
[  243.268984]                     
[  243.268989] [<ffffffff81714e01>] entry_SYSCALL_64_fastpath+0x1f/0xc2
[  243.271702]    SOFTIRQ-ON-W at:
[  243.272726]                     
[  243.272729] [<ffffffff810ef8ed>] __lock_acquire+0x2fd/0x1bb0
[  243.275109]                     
[  243.275111] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0
[  243.277426]                     
[  243.277429] [<ffffffff810ea7e9>] down_write_nested+0x59/0xc0
[  243.279790]                     
[  243.279823] [<ffffffffa02a4b2e>] xfs_ilock+0x14e/0x290 [xfs]
[  243.282192]                     
[  243.282216] [<ffffffffa02a4c9b>] xfs_ilock_data_map_shared+0x2b/0x30 [xfs]
[  243.284794]                     
[  243.284816] [<ffffffffa02559f4>] xfs_dir_lookup+0xd4/0x1c0 [xfs]
[  243.287259]                     
[  243.287284] [<ffffffffa02a62ff>] xfs_lookup+0x7f/0x250 [xfs]
[  243.289735]                     
[  243.289763] [<ffffffffa02a1fcb>] xfs_vn_lookup+0x6b/0xb0 [xfs]
[  243.292205]                     
[  243.292208] [<ffffffff8126ce2c>] lookup_open+0x54c/0x790
[  243.294555]                     
[  243.294558] [<ffffffff8126e2fa>] path_openat+0x55a/0xa90
[  243.296897]                     
[  243.296900] [<ffffffff8126f9ec>] do_filp_open+0x8c/0x100
[  243.299242]                     
[  243.299244] [<ffffffff81263c41>] do_open_execat+0x71/0x180
[  243.301754]                     
[  243.301759] [<ffffffff812641b6>] open_exec+0x26/0x40
[  243.304037]                     
[  243.304042] [<ffffffff812c43ee>] load_elf_binary+0x2be/0x15f0
[  243.306531]                     
[  243.306534] [<ffffffff812644b0>] search_binary_handler+0x80/0x1e0
[  243.308976]                     
[  243.308979] [<ffffffff812663ca>] do_execveat_common.isra.40+0x68a/0xa00
[  243.311506]                     
[  243.311508] [<ffffffff81266767>] do_execve+0x27/0x30
[  243.313777]                     
[  243.313779] [<ffffffff812669c0>] SyS_execve+0x20/0x30
[  243.316067]                     
[  243.316070] [<ffffffff81003c17>] do_syscall_64+0x67/0x1f0
[  243.318429]                     
[  243.318434] [<ffffffff81714ec9>] return_from_SYSCALL_64+0x0/0x7a
[  243.320884]    SOFTIRQ-ON-R at:
[  243.321860]                     
[  243.321862] [<ffffffff810ef8ed>] __lock_acquire+0x2fd/0x1bb0
[  243.324251]                     
[  243.324252] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0
[  243.326601]                     
[  243.326604] [<ffffffff810ea672>] down_read_nested+0x52/0xb0
[  243.328966]                     
[  243.328998] [<ffffffffa02a4af4>] xfs_ilock+0x114/0x290 [xfs]
[  243.331384]                     
[  243.331407] [<ffffffffa02a4c9b>] xfs_ilock_data_map_shared+0x2b/0x30 [xfs]
[  243.333978]                     
[  243.334001] [<ffffffffa02559f4>] xfs_dir_lookup+0xd4/0x1c0 [xfs]
[  243.336492]                     
[  243.336516] [<ffffffffa02a62ff>] xfs_lookup+0x7f/0x250 [xfs]
[  243.338926]                     
[  243.338948] [<ffffffffa02a1fcb>] xfs_vn_lookup+0x6b/0xb0 [xfs]
[  243.341365]                     
[  243.341368] [<ffffffff8126902e>] lookup_slow+0x12e/0x220
[  243.343694]                     
[  243.343696] [<ffffffff8126d2c6>] walk_component+0x1a6/0x2b0
[  243.346074]                     
[  243.346076] [<ffffffff8126d55c>] link_path_walk+0x18c/0x580
[  243.348443]                     
[  243.348444] [<ffffffff8126de41>] path_openat+0xa1/0xa90
[  243.350753]                     
[  243.350755] [<ffffffff8126f9ec>] do_filp_open+0x8c/0x100
[  243.353240]                     
[  243.353244] [<ffffffff8125c0ea>] do_sys_open+0x13a/0x200
[  243.355581]                     
[  243.355583] [<ffffffff8125c1c9>] SyS_open+0x19/0x20
[  243.358015]                     
[  243.358019] [<ffffffff81714e01>] entry_SYSCALL_64_fastpath+0x1f/0xc2
[  243.360586]    IN-RECLAIM_FS-W at:
[  243.361628]                        
[  243.361630] [<ffffffff810ef934>] __lock_acquire+0x344/0x1bb0
[  243.364273]                        
[  243.364275] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0
[  243.366710]                        
[  243.366713] [<ffffffff810ea7e9>] down_write_nested+0x59/0xc0
[  243.369153]                        
[  243.369182] [<ffffffffa02a4b2e>] xfs_ilock+0x14e/0x290 [xfs]
[  243.371597]                        
[  243.371619] [<ffffffffa02986a5>] xfs_reclaim_inode+0x135/0x340 [xfs]
[  243.374339]                        
[  243.374366] [<ffffffffa0298b7a>] xfs_reclaim_inodes_ag+0x2ca/0x4f0 [xfs]
[  243.377009]                        
[  243.377032] [<ffffffffa029af9e>] xfs_reclaim_inodes_nr+0x2e/0x40 [xfs]
[  243.379659]                        
[  243.379686] [<ffffffffa02b32c4>] xfs_fs_free_cached_objects+0x14/0x20 [xfs]
[  243.382349]                        
[  243.382352] [<ffffffff81261dbc>] super_cache_scan+0x17c/0x190
[  243.384907]                        
[  243.384911] [<ffffffff811d375a>] shrink_slab+0x29a/0x710
[  243.387690]                        
[  243.387693] [<ffffffff811d876d>] shrink_node+0x23d/0x320
[  243.390148]                        
[  243.390150] [<ffffffff811d9e24>] kswapd+0x354/0xa10
[  243.392517]                        
[  243.392520] [<ffffffff810b5caa>] kthread+0x10a/0x140
[  243.394851]                        
[  243.394853] [<ffffffff81715081>] ret_from_fork+0x31/0x40
[  243.397246]    INITIAL USE at:
[  243.398227]                    
[  243.398229] [<ffffffff810ef960>] __lock_acquire+0x370/0x1bb0
[  243.400646]                    
[  243.400648] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0
[  243.402997]                    
[  243.402999] [<ffffffff810ea672>] down_read_nested+0x52/0xb0
[  243.405351]                    
[  243.405397] [<ffffffffa02a4af4>] xfs_ilock+0x114/0x290 [xfs]
[  243.407778]                    
[  243.407799] [<ffffffffa02a4c9b>] xfs_ilock_data_map_shared+0x2b/0x30 [xfs]
[  243.410364]                    
[  243.410390] [<ffffffffa02559f4>] xfs_dir_lookup+0xd4/0x1c0 [xfs]
[  243.412989]                    
[  243.413011] [<ffffffffa02a62ff>] xfs_lookup+0x7f/0x250 [xfs]
[  243.415416]                    
[  243.415437] [<ffffffffa02a1fcb>] xfs_vn_lookup+0x6b/0xb0 [xfs]
[  243.417871]                    
[  243.417874] [<ffffffff8126902e>] lookup_slow+0x12e/0x220
[  243.420641]                    
[  243.420644] [<ffffffff8126d2c6>] walk_component+0x1a6/0x2b0
[  243.423039]                    
[  243.423041] [<ffffffff8126d55c>] link_path_walk+0x18c/0x580
[  243.425553]                    
[  243.425555] [<ffffffff8126de41>] path_openat+0xa1/0xa90
[  243.427891]                    
[  243.427892] [<ffffffff8126f9ec>] do_filp_open+0x8c/0x100
[  243.430249]                    
[  243.430251] [<ffffffff8125c0ea>] do_sys_open+0x13a/0x200
[  243.432586]                    
[  243.432588] [<ffffffff8125c1c9>] SyS_open+0x19/0x20
[  243.434839]                    
[  243.434843] [<ffffffff81714e01>] entry_SYSCALL_64_fastpath+0x1f/0xc2
[  243.437343]  }
[  243.438115]  ... key      at: [<ffffffffa031dfcc>] xfs_dir_ilock_class+0x0/0xfffffffffffc3f6e [xfs]
[  243.440082]  ... acquired at:
[  243.441047]    
[  243.441049] [<ffffffff810ee7ea>] check_irq_usage+0x4a/0xb0
[  243.443169]    
[  243.443171] [<ffffffff810f0954>] __lock_acquire+0x1364/0x1bb0
[  243.445366]    
[  243.445368] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0
[  243.447471]    
[  243.447474] [<ffffffff8108de18>] get_online_cpus+0x58/0x80
[  243.449601]    
[  243.449604] [<ffffffff811bf3a7>] drain_all_pages.part.80+0x27/0x320
[  243.452123]    
[  243.452125] [<ffffffff811c2039>] drain_all_pages+0x19/0x20
[  243.454264]    
[  243.454266] [<ffffffff811c4854>] __alloc_pages_nodemask+0x784/0x1630
[  243.456596]    
[  243.456599] [<ffffffff8122e1bf>] cache_grow_begin+0xcf/0x630
[  243.458774]    
[  243.458776] [<ffffffff8122eb45>] fallback_alloc+0x1e5/0x290
[  243.460952]    
[  243.460955] [<ffffffff8122e955>] ____cache_alloc_node+0x235/0x240
[  243.463199]    
[  243.463201] [<ffffffff8122f30c>] kmem_cache_alloc+0x26c/0x3e0
[  243.465482]    
[  243.465510] [<ffffffffa02b9211>] kmem_zone_alloc+0x91/0x120 [xfs]
[  243.467754]    
[  243.467774] [<ffffffffa024e2f5>] xfs_da_state_alloc+0x15/0x20 [xfs]
[  243.470083]    
[  243.470101] [<ffffffffa025f333>] xfs_dir2_node_lookup+0x53/0x2b0 [xfs]
[  243.472427]    
[  243.472445] [<ffffffffa0255ac5>] xfs_dir_lookup+0x1a5/0x1c0 [xfs]
[  243.474705]    
[  243.474726] [<ffffffffa02a62ff>] xfs_lookup+0x7f/0x250 [xfs]
[  243.476933]    
[  243.476954] [<ffffffffa02a1fcb>] xfs_vn_lookup+0x6b/0xb0 [xfs]
[  243.479178]    
[  243.479180] [<ffffffff8126ce2c>] lookup_open+0x54c/0x790
[  243.481350]    
[  243.481352] [<ffffffff8126e2fa>] path_openat+0x55a/0xa90
[  243.483907]    
[  243.483910] [<ffffffff8126f9ec>] do_filp_open+0x8c/0x100
[  243.486070]    
[  243.486073] [<ffffffff8125c0ea>] do_sys_open+0x13a/0x200
[  243.488334]    
[  243.488338] [<ffffffff8125c1c9>] SyS_open+0x19/0x20
[  243.490476]    
[  243.490480] [<ffffffff81003c17>] do_syscall_64+0x67/0x1f0
[  243.492619]    
[  243.492623] [<ffffffff81714ec9>] return_from_SYSCALL_64+0x0/0x7a
[  243.494864] 
[  243.495618] 
[  243.495618] the dependencies between the lock to be acquired
[  243.495619]  and RECLAIM_FS-irq-unsafe lock:
[  243.498973] -> (cpu_hotplug.dep_map){++++++} ops: 838 {
[  243.500297]    HARDIRQ-ON-W at:
[  243.501292]                     
[  243.501295] [<ffffffff810efd84>] __lock_acquire+0x794/0x1bb0
[  243.503718]                     
[  243.503719] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0
[  243.506059]                     
[  243.506061] [<ffffffff8108ff5e>] cpu_hotplug_begin+0x6e/0xe0
[  243.508471]                     
[  243.508473] [<ffffffff8109009d>] _cpu_up+0x2d/0xf0
[  243.510708]                     
[  243.510709] [<ffffffff810901e4>] do_cpu_up+0x84/0xd0
[  243.512997]                     
[  243.512999] [<ffffffff8109023e>] cpu_up+0xe/0x10
[  243.515556]                     
[  243.515561] [<ffffffff81f6f446>] smp_init+0xd5/0x141
[  243.517807]                     
[  243.517810] [<ffffffff81f3f35b>] kernel_init_freeable+0x17d/0x2a7
[  243.520271]                     
[  243.520275] [<ffffffff817048e9>] kernel_init+0x9/0x100
[  243.522538]                     
[  243.522540] [<ffffffff81715081>] ret_from_fork+0x31/0x40
[  243.524833]    HARDIRQ-ON-R at:
[  243.525801]                     
[  243.525803] [<ffffffff810ef8c0>] __lock_acquire+0x2d0/0x1bb0
[  243.528152]                     
[  243.528153] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0
[  243.530416]                     
[  243.530419] [<ffffffff8108de18>] get_online_cpus+0x58/0x80
[  243.532696]                     
[  243.532698] [<ffffffff811ec375>] kmem_cache_create+0x35/0x2d0
[  243.535039]                     
[  243.535041] [<ffffffff81f87d4a>] debug_objects_mem_init+0x48/0x5c5
[  243.537451]                     
[  243.537453] [<ffffffff81f3f108>] start_kernel+0x3ec/0x4c2
[  243.539744]                     
[  243.539746] [<ffffffff81f3e5d6>] x86_64_start_reservations+0x2a/0x2c
[  243.542186]                     
[  243.542188] [<ffffffff81f3e724>] x86_64_start_kernel+0x14c/0x16f
[  243.544603]                     
[  243.544605] [<ffffffff810001c4>] verify_cpu+0x0/0xfc
[  243.547245]    SOFTIRQ-ON-W at:
[  243.548241]                     
[  243.548243] [<ffffffff810ef8ed>] __lock_acquire+0x2fd/0x1bb0
[  243.550559]                     
[  243.550561] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0
[  243.552841]                     
[  243.552842] [<ffffffff8108ff5e>] cpu_hotplug_begin+0x6e/0xe0
[  243.555186]                     
[  243.555187] [<ffffffff8109009d>] _cpu_up+0x2d/0xf0
[  243.557404]                     
[  243.557405] [<ffffffff810901e4>] do_cpu_up+0x84/0xd0
[  243.559654]                     
[  243.559656] [<ffffffff8109023e>] cpu_up+0xe/0x10
[  243.561824]                     
[  243.561827] [<ffffffff81f6f446>] smp_init+0xd5/0x141
[  243.564048]                     
[  243.564050] [<ffffffff81f3f35b>] kernel_init_freeable+0x17d/0x2a7
[  243.566455]                     
[  243.566457] [<ffffffff817048e9>] kernel_init+0x9/0x100
[  243.568731]                     
[  243.568733] [<ffffffff81715081>] ret_from_fork+0x31/0x40
[  243.571014]    SOFTIRQ-ON-R at:
[  243.571975]                     
[  243.571976] [<ffffffff810ef8ed>] __lock_acquire+0x2fd/0x1bb0
[  243.574328]                     
[  243.574330] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0
[  243.576610]                     
[  243.576612] [<ffffffff8108de18>] get_online_cpus+0x58/0x80
[  243.579161]                     
[  243.579165] [<ffffffff811ec375>] kmem_cache_create+0x35/0x2d0
[  243.581537]                     
[  243.581539] [<ffffffff81f87d4a>] debug_objects_mem_init+0x48/0x5c5
[  243.583982]                     
[  243.583984] [<ffffffff81f3f108>] start_kernel+0x3ec/0x4c2
[  243.586304]                     
[  243.586306] [<ffffffff81f3e5d6>] x86_64_start_reservations+0x2a/0x2c
[  243.588819]                     
[  243.588821] [<ffffffff81f3e724>] x86_64_start_kernel+0x14c/0x16f
[  243.591227]                     
[  243.591229] [<ffffffff810001c4>] verify_cpu+0x0/0xfc
[  243.593507]    RECLAIM_FS-ON-W at:
[  243.594519]                        
[  243.594520] [<ffffffff810ef051>] mark_held_locks+0x71/0x90
[  243.596888]                        
[  243.596895] [<ffffffff810f3405>] lockdep_trace_alloc+0xc5/0x110
[  243.599331]                        
[  243.599334] [<ffffffff8122f8ca>] kmem_cache_alloc_node_trace+0x4a/0x410
[  243.601872]                        
[  243.601874] [<ffffffff810ba350>] __smpboot_create_thread.part.3+0x30/0xf0
[  243.604460]                        
[  243.604461] [<ffffffff810ba7a1>] smpboot_create_threads+0x61/0x90
[  243.606950]                        
[  243.606952] [<ffffffff8108e2cb>] cpuhp_invoke_callback+0xbb/0xb70
[  243.609463]                        
[  243.609465] [<ffffffff8108fc82>] cpuhp_up_callbacks+0x32/0xb0
[  243.612282]                        
[  243.612285] [<ffffffff810900f4>] _cpu_up+0x84/0xf0
[  243.614604]                        
[  243.614606] [<ffffffff810901e4>] do_cpu_up+0x84/0xd0
[  243.616929]                        
[  243.616930] [<ffffffff8109023e>] cpu_up+0xe/0x10
[  243.619208]                        
[  243.619211] [<ffffffff81f6f446>] smp_init+0xd5/0x141
[  243.621518]                        
[  243.621520] [<ffffffff81f3f35b>] kernel_init_freeable+0x17d/0x2a7
[  243.624018]                        
[  243.624020] [<ffffffff817048e9>] kernel_init+0x9/0x100
[  243.626374]                        
[  243.626376] [<ffffffff81715081>] ret_from_fork+0x31/0x40
[  243.628771]    RECLAIM_FS-ON-R at:
[  243.629802]                        
[  243.629803] [<ffffffff810ef051>] mark_held_locks+0x71/0x90
[  243.632201]                        
[  243.632203] [<ffffffff810f3405>] lockdep_trace_alloc+0xc5/0x110
[  243.634692]                        
[  243.634695] [<ffffffff8122f8ca>] kmem_cache_alloc_node_trace+0x4a/0x410
[  243.637277]                        
[  243.637279] [<ffffffff8100cbb4>] allocate_shared_regs+0x24/0x70
[  243.639777]                        
[  243.639779] [<ffffffff8100cc32>] intel_pmu_cpu_prepare+0x32/0x140
[  243.643062]                        
[  243.643066] [<ffffffff810053db>] x86_pmu_prepare_cpu+0x3b/0x40
[  243.645553]                        
[  243.645556] [<ffffffff8108e2cb>] cpuhp_invoke_callback+0xbb/0xb70
[  243.648095]                        
[  243.648097] [<ffffffff8108f29c>] cpuhp_issue_call+0xec/0x160
[  243.650536]                        
[  243.650539] [<ffffffff8108f6bb>] __cpuhp_setup_state+0x13b/0x1a0
[  243.653126]                        
[  243.653130] [<ffffffff81f427e9>] init_hw_perf_events+0x402/0x5b6
[  243.655652]                        
[  243.655655] [<ffffffff8100217c>] do_one_initcall+0x4c/0x1b0
[  243.658127]                        
[  243.658130] [<ffffffff81f3f333>] kernel_init_freeable+0x155/0x2a7
[  243.660653]                        
[  243.660656] [<ffffffff817048e9>] kernel_init+0x9/0x100
[  243.663048]                        
[  243.663050] [<ffffffff81715081>] ret_from_fork+0x31/0x40
[  243.665436]    INITIAL USE at:
[  243.666403]                    
[  243.666405] [<ffffffff810ef960>] __lock_acquire+0x370/0x1bb0
[  243.668790]                    
[  243.668791] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0
[  243.671093]                    
[  243.671095] [<ffffffff8108de18>] get_online_cpus+0x58/0x80
[  243.673455]                    
[  243.673458] [<ffffffff8108f5be>] __cpuhp_setup_state+0x3e/0x1a0
[  243.676126]                    
[  243.676130] [<ffffffff81f7660e>] page_alloc_init+0x23/0x3a
[  243.678510]                    
[  243.678512] [<ffffffff81f3eebe>] start_kernel+0x1a2/0x4c2
[  243.680851]                    
[  243.680853] [<ffffffff81f3e5d6>] x86_64_start_reservations+0x2a/0x2c
[  243.683367]                    
[  243.683369] [<ffffffff81f3e724>] x86_64_start_kernel+0x14c/0x16f
[  243.685812]                    
[  243.685815] [<ffffffff810001c4>] verify_cpu+0x0/0xfc
[  243.688133]  }
[  243.688907]  ... key      at: [<ffffffff81c56848>] cpu_hotplug+0x108/0x140
[  243.690542]  ... acquired at:
[  243.691514]    
[  243.691517] [<ffffffff810ee7ea>] check_irq_usage+0x4a/0xb0
[  243.693655]    
[  243.693656] [<ffffffff810f0954>] __lock_acquire+0x1364/0x1bb0
[  243.695820]    
[  243.695822] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0
[  243.697926]    
[  243.697929] [<ffffffff8108de18>] get_online_cpus+0x58/0x80
[  243.700042]    
[  243.700044] [<ffffffff811bf3a7>] drain_all_pages.part.80+0x27/0x320
[  243.702285]    
[  243.702286] [<ffffffff811c2039>] drain_all_pages+0x19/0x20
[  243.704405]    
[  243.704407] [<ffffffff811c4854>] __alloc_pages_nodemask+0x784/0x1630
[  243.706721]    
[  243.706724] [<ffffffff8122e1bf>] cache_grow_begin+0xcf/0x630
[  243.708867]    
[  243.708870] [<ffffffff8122eb45>] fallback_alloc+0x1e5/0x290
[  243.711000]    
[  243.711002] [<ffffffff8122e955>] ____cache_alloc_node+0x235/0x240
[  243.713211]    
[  243.713213] [<ffffffff8122f30c>] kmem_cache_alloc+0x26c/0x3e0
[  243.715366]    
[  243.715410] [<ffffffffa02b9211>] kmem_zone_alloc+0x91/0x120 [xfs]
[  243.717625]    
[  243.717644] [<ffffffffa024e2f5>] xfs_da_state_alloc+0x15/0x20 [xfs]
[  243.719889]    
[  243.719918] [<ffffffffa025f333>] xfs_dir2_node_lookup+0x53/0x2b0 [xfs]
[  243.722224]    
[  243.722242] [<ffffffffa0255ac5>] xfs_dir_lookup+0x1a5/0x1c0 [xfs]
[  243.724493]    
[  243.724514] [<ffffffffa02a62ff>] xfs_lookup+0x7f/0x250 [xfs]
[  243.726690]    
[  243.726710] [<ffffffffa02a1fcb>] xfs_vn_lookup+0x6b/0xb0 [xfs]
[  243.728933]    
[  243.728936] [<ffffffff8126ce2c>] lookup_open+0x54c/0x790
[  243.731064]    
[  243.731066] [<ffffffff8126e2fa>] path_openat+0x55a/0xa90
[  243.733192]    
[  243.733194] [<ffffffff8126f9ec>] do_filp_open+0x8c/0x100
[  243.735312]    
[  243.735315] [<ffffffff8125c0ea>] do_sys_open+0x13a/0x200
[  243.737523]    
[  243.737527] [<ffffffff8125c1c9>] SyS_open+0x19/0x20
[  243.739577]    
[  243.739579] [<ffffffff81003c17>] do_syscall_64+0x67/0x1f0
[  243.741702]    
[  243.741706] [<ffffffff81714ec9>] return_from_SYSCALL_64+0x0/0x7a
[  243.743932] 
[  243.744661] 
[  243.744661] stack backtrace:
[  243.746302] CPU: 1 PID: 8767 Comm: awk Not tainted 4.10.0-rc6-next-20170202 #46
[  243.747963] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[  243.750166] Call Trace:
[  243.751071]  dump_stack+0x85/0xc9
[  243.752110]  check_usage+0x4f9/0x680
[  243.753188]  check_irq_usage+0x4a/0xb0
[  243.754280]  __lock_acquire+0x1364/0x1bb0
[  243.755410]  lock_acquire+0xe0/0x2a0
[  243.756467]  ? get_online_cpus+0x32/0x80
[  243.757580]  get_online_cpus+0x58/0x80
[  243.758664]  ? get_online_cpus+0x32/0x80
[  243.759764]  drain_all_pages.part.80+0x27/0x320
[  243.760972]  drain_all_pages+0x19/0x20
[  243.762039]  __alloc_pages_nodemask+0x784/0x1630
[  243.763249]  ? rcu_read_lock_sched_held+0x91/0xa0
[  243.764466]  ? __alloc_pages_nodemask+0x2e6/0x1630
[  243.765689]  ? mark_held_locks+0x71/0x90
[  243.766780]  ? cache_grow_begin+0x4ac/0x630
[  243.767912]  cache_grow_begin+0xcf/0x630
[  243.768985]  ? ____cache_alloc_node+0x1bf/0x240
[  243.770173]  fallback_alloc+0x1e5/0x290
[  243.771233]  ____cache_alloc_node+0x235/0x240
[  243.772403]  ? kmem_zone_alloc+0x91/0x120 [xfs]
[  243.773576]  kmem_cache_alloc+0x26c/0x3e0
[  243.774671]  kmem_zone_alloc+0x91/0x120 [xfs]
[  243.775816]  xfs_da_state_alloc+0x15/0x20 [xfs]
[  243.776989]  xfs_dir2_node_lookup+0x53/0x2b0 [xfs]
[  243.778188]  xfs_dir_lookup+0x1a5/0x1c0 [xfs]
[  243.779327]  xfs_lookup+0x7f/0x250 [xfs]
[  243.780394]  xfs_vn_lookup+0x6b/0xb0 [xfs]
[  243.781466]  lookup_open+0x54c/0x790
[  243.782440]  path_openat+0x55a/0xa90
[  243.783412]  do_filp_open+0x8c/0x100
[  243.784377]  ? _raw_spin_unlock+0x22/0x30
[  243.785418]  ? __alloc_fd+0xf2/0x210
[  243.786378]  do_sys_open+0x13a/0x200
[  243.787361]  SyS_open+0x19/0x20
[  243.788252]  do_syscall_64+0x67/0x1f0
[  243.789228]  entry_SYSCALL64_slow_path+0x25/0x25
[  243.790347] RIP: 0033:0x7fcf8dda06c7
[  243.791299] RSP: 002b:00007ffd883327b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000002
[  243.792895] RAX: ffffffffffffffda RBX: 00007ffd883328a8 RCX: 00007fcf8dda06c7
[  243.794424] RDX: 00007fcf8dfa9148 RSI: 0000000000080000 RDI: 00007fcf8dfa6b08
[  243.795949] RBP: 00007ffd88332810 R08: 00007ffd88332890 R09: 0000000000000000
[  243.797480] R10: 00007fcf8dfa6b08 R11: 0000000000000246 R12: 0000000000000000
[  243.799002] R13: 0000000000000000 R14: 0000000000000000 R15: 00007ffd88332890
[  253.543441] awk invoked oom-killer: gfp_mask=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null),  order=0, oom_score_adj=0
[  253.546121] awk cpuset=/ mems_allowed=0
[  253.547233] CPU: 3 PID: 8767 Comm: awk Not tainted 4.10.0-rc6-next-20170202 #46

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
@ 2017-02-03 10:57                                             ` Tetsuo Handa
  0 siblings, 0 replies; 110+ messages in thread
From: Tetsuo Handa @ 2017-02-03 10:57 UTC (permalink / raw)
  To: mhocko
  Cc: david, dchinner, hch, mgorman, viro, linux-mm, hannes, linux-kernel

Michal Hocko wrote:
> On Mon 30-01-17 09:55:46, Michal Hocko wrote:
> > On Sun 29-01-17 00:27:27, Tetsuo Handa wrote:
> [...]
> > > Regarding [1], it helped avoiding the too_many_isolated() issue. I can't
> > > tell whether it has any negative effect, but I got on the first trial that
> > > all allocating threads are blocked on wait_for_completion() from flush_work()
> > > in drain_all_pages() introduced by "mm, page_alloc: drain per-cpu pages from
> > > workqueue context". There was no warn_alloc() stall warning message afterwords.
> > 
> > That patch is buggy and there is a follow up [1] which is not sitting in the
> > mmotm (and thus linux-next) yet. I didn't get to review it properly and
> > I cannot say I would be too happy about using WQ from the page
> > allocator. I believe even the follow up needs to have WQ_RECLAIM WQ.
> > 
> > [1] http://lkml.kernel.org/r/20170125083038.rzb5f43nptmk7aed@techsingularity.net
> 
> Did you get chance to test with this follow up patch? It would be
> interesting to see whether OOM situation can still starve the waiter.
> The current linux-next should contain this patch.

So far I can't reproduce problems except two listed below (cond_resched() trap
in printk() and IDLE priority trap are excluded from the list). But I agree that
the follow up patch needs to use a WQ_RECLAIM WQ. It is theoretically possible
that an allocation request which can trigger the OOM killer waits for the
system_wq while there is already a work which is in system_wq which is looping
forever inside the page allocator without triggering the OOM killer.
Maybe the follow up patch can share the vmstat WQ?

(1) I got an assertion failure.

[  969.626518] Killed process 6262 (oom-write) total-vm:2166856kB, anon-rss:1128732kB, file-rss:4kB, shmem-rss:0kB
[  969.958307] oom_reaper: reaped process 6262 (oom-write), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  972.114644] XFS: Assertion failed: oldlen > newlen, file: fs/xfs/libxfs/xfs_bmap.c, line: 2867
[  972.125085] ------------[ cut here ]------------
[  972.129261] WARNING: CPU: 0 PID: 6280 at fs/xfs/xfs_message.c:105 asswarn+0x33/0x40 [xfs]
[  972.136146] Modules linked in: nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack coretemp crct10dif_pclmul ppdev crc32_pclmul ghash_clmulni_intel ip_set nfnetlink ebtable_nat aesni_intel crypto_simd cryptd ebtable_broute glue_helper vmw_balloon bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 pcspkr nf_nat_ipv6 ip6table_mangle ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter sg parport_pc parport shpchp i2c_piix4 vmw_vsock_vmci_transport vsock vmw_vmci ip_tables xfs libcrc32c sr_mod cdrom ata_generic sd_mod pata_acpi crc32c_intel serio_raw vmwgfx drm_kms_helper syscopyarea sysfillrect
[  972.163630]  sysimgblt fb_sys_fops ttm drm ata_piix ahci libahci mptspi scsi_transport_spi mptscsih e1000 libata i2c_core mptbase
[  972.172535] CPU: 0 PID: 6280 Comm: write Not tainted 4.10.0-rc6-next-20170202 #498
[  972.175126] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[  972.178381] Call Trace:
[  972.180003]  dump_stack+0x85/0xc9
[  972.181682]  __warn+0xd1/0xf0
[  972.183374]  warn_slowpath_null+0x1d/0x20
[  972.185223]  asswarn+0x33/0x40 [xfs]
[  972.186950]  xfs_bmap_add_extent_hole_delay+0xb7f/0xdf0 [xfs]
[  972.189055]  xfs_bmapi_reserve_delalloc+0x297/0x440 [xfs]
[  972.191263]  ? xfs_ilock+0x1c9/0x360 [xfs]
[  972.193414]  xfs_file_iomap_begin+0x880/0x1140 [xfs]
[  972.195300]  ? iomap_write_end+0x80/0x80
[  972.196980]  iomap_apply+0x6c/0x130
[  972.198539]  iomap_file_buffered_write+0x68/0xa0
[  972.200316]  ? iomap_write_end+0x80/0x80
[  972.201950]  xfs_file_buffered_aio_write+0x132/0x390 [xfs]
[  972.203868]  ? _raw_spin_unlock+0x27/0x40
[  972.205470]  xfs_file_write_iter+0x90/0x130 [xfs]
[  972.207167]  __vfs_write+0xe5/0x140
[  972.208752]  vfs_write+0xc7/0x1f0
[  972.210233]  ? syscall_trace_enter+0x1d0/0x380
[  972.211809]  SyS_write+0x58/0xc0
[  972.213166]  do_int80_syscall_32+0x6c/0x1f0
[  972.214676]  entry_INT80_compat+0x38/0x50
[  972.216168] RIP: 0023:0x8048076
[  972.217494] RSP: 002b:00000000ff997020 EFLAGS: 00000202 ORIG_RAX: 0000000000000004
[  972.219635] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 0000000008048000
[  972.221679] RDX: 0000000000001000 RSI: 0000000000000000 RDI: 0000000000000000
[  972.223774] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[  972.225905] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  972.227946] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[  972.230064] ---[ end trace d498098daec56c11 ]---
[  984.210890] vmtoolsd invoked oom-killer: gfp_mask=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null),  order=0, oom_score_adj=0
[  984.224191] vmtoolsd cpuset=/ mems_allowed=0
[  984.231022] CPU: 0 PID: 689 Comm: vmtoolsd Tainted: G        W       4.10.0-rc6-next-20170202 #498

(2) I got a lockdep warning. (A new false positive?)

[  243.036975] =====================================================
[  243.042976] WARNING: RECLAIM_FS-safe -> RECLAIM_FS-unsafe lock order detected
[  243.051211] 4.10.0-rc6-next-20170202 #46 Not tainted
[  243.054619] -----------------------------------------------------
[  243.057395] awk/8767 [HC0[0]:SC0[0]:HE1:SE1] is trying to acquire:
[  243.060310]  (cpu_hotplug.dep_map){++++++}, at: [<ffffffff8108ddf2>] get_online_cpus+0x32/0x80
[  243.063462] 
[  243.063462] and this task is already holding:
[  243.066851]  (&xfs_dir_ilock_class){++++-.}, at: [<ffffffffa02a4af4>] xfs_ilock+0x114/0x290 [xfs]
[  243.069949] which would create a new lock dependency:
[  243.072143]  (&xfs_dir_ilock_class){++++-.} -> (cpu_hotplug.dep_map){++++++}
[  243.074789] 
[  243.074789] but this new dependency connects a RECLAIM_FS-irq-safe lock:
[  243.078735]  (&xfs_dir_ilock_class){++++-.}
[  243.078739] 
[  243.078739] ... which became RECLAIM_FS-irq-safe at:
[  243.084175]   
[  243.084180] [<ffffffff810ef934>] __lock_acquire+0x344/0x1bb0
[  243.087257]   
[  243.087261] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0
[  243.090027]   
[  243.090033] [<ffffffff810ea7e9>] down_write_nested+0x59/0xc0
[  243.092838]   
[  243.092888] [<ffffffffa02a4b2e>] xfs_ilock+0x14e/0x290 [xfs]
[  243.095453]   
[  243.095485] [<ffffffffa02986a5>] xfs_reclaim_inode+0x135/0x340 [xfs]
[  243.098083]   
[  243.098109] [<ffffffffa0298b7a>] xfs_reclaim_inodes_ag+0x2ca/0x4f0 [xfs]
[  243.100668]   
[  243.100692] [<ffffffffa029af9e>] xfs_reclaim_inodes_nr+0x2e/0x40 [xfs]
[  243.103191]   
[  243.103221] [<ffffffffa02b32c4>] xfs_fs_free_cached_objects+0x14/0x20 [xfs]
[  243.105710]   
[  243.105714] [<ffffffff81261dbc>] super_cache_scan+0x17c/0x190
[  243.107947]   
[  243.107950] [<ffffffff811d375a>] shrink_slab+0x29a/0x710
[  243.110133]   
[  243.110135] [<ffffffff811d876d>] shrink_node+0x23d/0x320
[  243.112262]   
[  243.112264] [<ffffffff811d9e24>] kswapd+0x354/0xa10
[  243.114323]   
[  243.114326] [<ffffffff810b5caa>] kthread+0x10a/0x140
[  243.116448]   
[  243.116452] [<ffffffff81715081>] ret_from_fork+0x31/0x40
[  243.118692] 
[  243.118692] to a RECLAIM_FS-irq-unsafe lock:
[  243.120636]  (cpu_hotplug.dep_map){++++++}
[  243.120638] 
[  243.120638] ... which became RECLAIM_FS-irq-unsafe at:
[  243.124021] ...
[  243.124022]   
[  243.124820] [<ffffffff810ef051>] mark_held_locks+0x71/0x90
[  243.127033]   
[  243.127035] [<ffffffff810f3405>] lockdep_trace_alloc+0xc5/0x110
[  243.129228]   
[  243.129231] [<ffffffff8122f8ca>] kmem_cache_alloc_node_trace+0x4a/0x410
[  243.131534]   
[  243.131536] [<ffffffff810ba350>] __smpboot_create_thread.part.3+0x30/0xf0
[  243.133850]   
[  243.133852] [<ffffffff810ba7a1>] smpboot_create_threads+0x61/0x90
[  243.136113]   
[  243.136119] [<ffffffff8108e2cb>] cpuhp_invoke_callback+0xbb/0xb70
[  243.138319]   
[  243.138320] [<ffffffff8108fc82>] cpuhp_up_callbacks+0x32/0xb0
[  243.140479]   
[  243.140480] [<ffffffff810900f4>] _cpu_up+0x84/0xf0
[  243.142484]   
[  243.142485] [<ffffffff810901e4>] do_cpu_up+0x84/0xd0
[  243.144716]   
[  243.144719] [<ffffffff8109023e>] cpu_up+0xe/0x10
[  243.146684]   
[  243.146687] [<ffffffff81f6f446>] smp_init+0xd5/0x141
[  243.148755]   
[  243.148758] [<ffffffff81f3f35b>] kernel_init_freeable+0x17d/0x2a7
[  243.150932]   
[  243.150936] [<ffffffff817048e9>] kernel_init+0x9/0x100
[  243.153088]   
[  243.153092] [<ffffffff81715081>] ret_from_fork+0x31/0x40
[  243.155135] 
[  243.155135] other info that might help us debug this:
[  243.155135] 
[  243.157724]  Possible interrupt unsafe locking scenario:
[  243.157724] 
[  243.159877]        CPU0                    CPU1
[  243.161047]        ----                    ----
[  243.162210]   lock(cpu_hotplug.dep_map);
[  243.163279]                                local_irq_disable();
[  243.164669]                                lock(&xfs_dir_ilock_class);
[  243.166148]                                lock(cpu_hotplug.dep_map);
[  243.167653]   <Interrupt>
[  243.168594]     lock(&xfs_dir_ilock_class);
[  243.169694] 
[  243.169694]  *** DEADLOCK ***
[  243.169694] 
[  243.171864] 3 locks held by awk/8767:
[  243.172872]  #0:  (&type->i_mutex_dir_key#3){++++++}, at: [<ffffffff8126e2dc>] path_openat+0x53c/0xa90
[  243.174791]  #1:  (&xfs_dir_ilock_class){++++-.}, at: [<ffffffffa02a4af4>] xfs_ilock+0x114/0x290 [xfs]
[  243.176899]  #2:  (pcpu_drain_mutex){+.+...}, at: [<ffffffff811bf39a>] drain_all_pages.part.80+0x1a/0x320
[  243.178875] 
[  243.178875] the dependencies between RECLAIM_FS-irq-safe lock and the holding lock:
[  243.181262] -> (&xfs_dir_ilock_class){++++-.} ops: 17348 {
[  243.182610]    HARDIRQ-ON-W at:
[  243.183603]                     
[  243.183606] [<ffffffff810efd84>] __lock_acquire+0x794/0x1bb0
[  243.186056]                     
[  243.186059] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0
[  243.188419]                     
[  243.188422] [<ffffffff810ea7e9>] down_write_nested+0x59/0xc0
[  243.190909]                     
[  243.190941] [<ffffffffa02a4b2e>] xfs_ilock+0x14e/0x290 [xfs]
[  243.193257]                     
[  243.193281] [<ffffffffa02a4c9b>] xfs_ilock_data_map_shared+0x2b/0x30 [xfs]
[  243.195795]                     
[  243.195814] [<ffffffffa02559f4>] xfs_dir_lookup+0xd4/0x1c0 [xfs]
[  243.198204]                     
[  243.198227] [<ffffffffa02a62ff>] xfs_lookup+0x7f/0x250 [xfs]
[  243.200570]                     
[  243.200593] [<ffffffffa02a1fcb>] xfs_vn_lookup+0x6b/0xb0 [xfs]
[  243.203086]                     
[  243.203089] [<ffffffff8126ce2c>] lookup_open+0x54c/0x790
[  243.205417]                     
[  243.205420] [<ffffffff8126e2fa>] path_openat+0x55a/0xa90
[  243.207711]                     
[  243.207713] [<ffffffff8126f9ec>] do_filp_open+0x8c/0x100
[  243.210092]                     
[  243.210095] [<ffffffff81263c41>] do_open_execat+0x71/0x180
[  243.212427]                     
[  243.212429] [<ffffffff812641b6>] open_exec+0x26/0x40
[  243.214664]                     
[  243.214668] [<ffffffff812c43ee>] load_elf_binary+0x2be/0x15f0
[  243.217045]                     
[  243.217048] [<ffffffff812644b0>] search_binary_handler+0x80/0x1e0
[  243.219501]                     
[  243.219503] [<ffffffff812663ca>] do_execveat_common.isra.40+0x68a/0xa00
[  243.222056]                     
[  243.222058] [<ffffffff81266767>] do_execve+0x27/0x30
[  243.224471]                     
[  243.224475] [<ffffffff812669c0>] SyS_execve+0x20/0x30
[  243.226787]                     
[  243.226790] [<ffffffff81003c17>] do_syscall_64+0x67/0x1f0
[  243.229178]                     
[  243.229182] [<ffffffff81714ec9>] return_from_SYSCALL_64+0x0/0x7a
[  243.231695]    HARDIRQ-ON-R at:
[  243.232709]                     
[  243.232712] [<ffffffff810ef8c0>] __lock_acquire+0x2d0/0x1bb0
[  243.235161]                     
[  243.235164] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0
[  243.237547]                     
[  243.237551] [<ffffffff810ea672>] down_read_nested+0x52/0xb0
[  243.239930]                     
[  243.239962] [<ffffffffa02a4af4>] xfs_ilock+0x114/0x290 [xfs]
[  243.242353]                     
[  243.242385] [<ffffffffa02a4c9b>] xfs_ilock_data_map_shared+0x2b/0x30 [xfs]
[  243.244978]                     
[  243.244998] [<ffffffffa02559f4>] xfs_dir_lookup+0xd4/0x1c0 [xfs]
[  243.247493]                     
[  243.247515] [<ffffffffa02a62ff>] xfs_lookup+0x7f/0x250 [xfs]
[  243.249910]                     
[  243.249930] [<ffffffffa02a1fcb>] xfs_vn_lookup+0x6b/0xb0 [xfs]
[  243.252407]                     
[  243.252412] [<ffffffff8126902e>] lookup_slow+0x12e/0x220
[  243.254747]                     
[  243.254750] [<ffffffff8126d2c6>] walk_component+0x1a6/0x2b0
[  243.257126]                     
[  243.257128] [<ffffffff8126d55c>] link_path_walk+0x18c/0x580
[  243.259495]                     
[  243.259497] [<ffffffff8126de41>] path_openat+0xa1/0xa90
[  243.261804]                     
[  243.261806] [<ffffffff8126f9ec>] do_filp_open+0x8c/0x100
[  243.264184]                     
[  243.264188] [<ffffffff8125c0ea>] do_sys_open+0x13a/0x200
[  243.266595]                     
[  243.266599] [<ffffffff8125c1c9>] SyS_open+0x19/0x20
[  243.268984]                     
[  243.268989] [<ffffffff81714e01>] entry_SYSCALL_64_fastpath+0x1f/0xc2
[  243.271702]    SOFTIRQ-ON-W at:
[  243.272726]                     
[  243.272729] [<ffffffff810ef8ed>] __lock_acquire+0x2fd/0x1bb0
[  243.275109]                     
[  243.275111] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0
[  243.277426]                     
[  243.277429] [<ffffffff810ea7e9>] down_write_nested+0x59/0xc0
[  243.279790]                     
[  243.279823] [<ffffffffa02a4b2e>] xfs_ilock+0x14e/0x290 [xfs]
[  243.282192]                     
[  243.282216] [<ffffffffa02a4c9b>] xfs_ilock_data_map_shared+0x2b/0x30 [xfs]
[  243.284794]                     
[  243.284816] [<ffffffffa02559f4>] xfs_dir_lookup+0xd4/0x1c0 [xfs]
[  243.287259]                     
[  243.287284] [<ffffffffa02a62ff>] xfs_lookup+0x7f/0x250 [xfs]
[  243.289735]                     
[  243.289763] [<ffffffffa02a1fcb>] xfs_vn_lookup+0x6b/0xb0 [xfs]
[  243.292205]                     
[  243.292208] [<ffffffff8126ce2c>] lookup_open+0x54c/0x790
[  243.294555]                     
[  243.294558] [<ffffffff8126e2fa>] path_openat+0x55a/0xa90
[  243.296897]                     
[  243.296900] [<ffffffff8126f9ec>] do_filp_open+0x8c/0x100
[  243.299242]                     
[  243.299244] [<ffffffff81263c41>] do_open_execat+0x71/0x180
[  243.301754]                     
[  243.301759] [<ffffffff812641b6>] open_exec+0x26/0x40
[  243.304037]                     
[  243.304042] [<ffffffff812c43ee>] load_elf_binary+0x2be/0x15f0
[  243.306531]                     
[  243.306534] [<ffffffff812644b0>] search_binary_handler+0x80/0x1e0
[  243.308976]                     
[  243.308979] [<ffffffff812663ca>] do_execveat_common.isra.40+0x68a/0xa00
[  243.311506]                     
[  243.311508] [<ffffffff81266767>] do_execve+0x27/0x30
[  243.313777]                     
[  243.313779] [<ffffffff812669c0>] SyS_execve+0x20/0x30
[  243.316067]                     
[  243.316070] [<ffffffff81003c17>] do_syscall_64+0x67/0x1f0
[  243.318429]                     
[  243.318434] [<ffffffff81714ec9>] return_from_SYSCALL_64+0x0/0x7a
[  243.320884]    SOFTIRQ-ON-R at:
[  243.321860]                     
[  243.321862] [<ffffffff810ef8ed>] __lock_acquire+0x2fd/0x1bb0
[  243.324251]                     
[  243.324252] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0
[  243.326601]                     
[  243.326604] [<ffffffff810ea672>] down_read_nested+0x52/0xb0
[  243.328966]                     
[  243.328998] [<ffffffffa02a4af4>] xfs_ilock+0x114/0x290 [xfs]
[  243.331384]                     
[  243.331407] [<ffffffffa02a4c9b>] xfs_ilock_data_map_shared+0x2b/0x30 [xfs]
[  243.333978]                     
[  243.334001] [<ffffffffa02559f4>] xfs_dir_lookup+0xd4/0x1c0 [xfs]
[  243.336492]                     
[  243.336516] [<ffffffffa02a62ff>] xfs_lookup+0x7f/0x250 [xfs]
[  243.338926]                     
[  243.338948] [<ffffffffa02a1fcb>] xfs_vn_lookup+0x6b/0xb0 [xfs]
[  243.341365]                     
[  243.341368] [<ffffffff8126902e>] lookup_slow+0x12e/0x220
[  243.343694]                     
[  243.343696] [<ffffffff8126d2c6>] walk_component+0x1a6/0x2b0
[  243.346074]                     
[  243.346076] [<ffffffff8126d55c>] link_path_walk+0x18c/0x580
[  243.348443]                     
[  243.348444] [<ffffffff8126de41>] path_openat+0xa1/0xa90
[  243.350753]                     
[  243.350755] [<ffffffff8126f9ec>] do_filp_open+0x8c/0x100
[  243.353240]                     
[  243.353244] [<ffffffff8125c0ea>] do_sys_open+0x13a/0x200
[  243.355581]                     
[  243.355583] [<ffffffff8125c1c9>] SyS_open+0x19/0x20
[  243.358015]                     
[  243.358019] [<ffffffff81714e01>] entry_SYSCALL_64_fastpath+0x1f/0xc2
[  243.360586]    IN-RECLAIM_FS-W at:
[  243.361628]                        
[  243.361630] [<ffffffff810ef934>] __lock_acquire+0x344/0x1bb0
[  243.364273]                        
[  243.364275] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0
[  243.366710]                        
[  243.366713] [<ffffffff810ea7e9>] down_write_nested+0x59/0xc0
[  243.369153]                        
[  243.369182] [<ffffffffa02a4b2e>] xfs_ilock+0x14e/0x290 [xfs]
[  243.371597]                        
[  243.371619] [<ffffffffa02986a5>] xfs_reclaim_inode+0x135/0x340 [xfs]
[  243.374339]                        
[  243.374366] [<ffffffffa0298b7a>] xfs_reclaim_inodes_ag+0x2ca/0x4f0 [xfs]
[  243.377009]                        
[  243.377032] [<ffffffffa029af9e>] xfs_reclaim_inodes_nr+0x2e/0x40 [xfs]
[  243.379659]                        
[  243.379686] [<ffffffffa02b32c4>] xfs_fs_free_cached_objects+0x14/0x20 [xfs]
[  243.382349]                        
[  243.382352] [<ffffffff81261dbc>] super_cache_scan+0x17c/0x190
[  243.384907]                        
[  243.384911] [<ffffffff811d375a>] shrink_slab+0x29a/0x710
[  243.387690]                        
[  243.387693] [<ffffffff811d876d>] shrink_node+0x23d/0x320
[  243.390148]                        
[  243.390150] [<ffffffff811d9e24>] kswapd+0x354/0xa10
[  243.392517]                        
[  243.392520] [<ffffffff810b5caa>] kthread+0x10a/0x140
[  243.394851]                        
[  243.394853] [<ffffffff81715081>] ret_from_fork+0x31/0x40
[  243.397246]    INITIAL USE at:
[  243.398227]                    
[  243.398229] [<ffffffff810ef960>] __lock_acquire+0x370/0x1bb0
[  243.400646]                    
[  243.400648] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0
[  243.402997]                    
[  243.402999] [<ffffffff810ea672>] down_read_nested+0x52/0xb0
[  243.405351]                    
[  243.405397] [<ffffffffa02a4af4>] xfs_ilock+0x114/0x290 [xfs]
[  243.407778]                    
[  243.407799] [<ffffffffa02a4c9b>] xfs_ilock_data_map_shared+0x2b/0x30 [xfs]
[  243.410364]                    
[  243.410390] [<ffffffffa02559f4>] xfs_dir_lookup+0xd4/0x1c0 [xfs]
[  243.412989]                    
[  243.413011] [<ffffffffa02a62ff>] xfs_lookup+0x7f/0x250 [xfs]
[  243.415416]                    
[  243.415437] [<ffffffffa02a1fcb>] xfs_vn_lookup+0x6b/0xb0 [xfs]
[  243.417871]                    
[  243.417874] [<ffffffff8126902e>] lookup_slow+0x12e/0x220
[  243.420641]                    
[  243.420644] [<ffffffff8126d2c6>] walk_component+0x1a6/0x2b0
[  243.423039]                    
[  243.423041] [<ffffffff8126d55c>] link_path_walk+0x18c/0x580
[  243.425553]                    
[  243.425555] [<ffffffff8126de41>] path_openat+0xa1/0xa90
[  243.427891]                    
[  243.427892] [<ffffffff8126f9ec>] do_filp_open+0x8c/0x100
[  243.430249]                    
[  243.430251] [<ffffffff8125c0ea>] do_sys_open+0x13a/0x200
[  243.432586]                    
[  243.432588] [<ffffffff8125c1c9>] SyS_open+0x19/0x20
[  243.434839]                    
[  243.434843] [<ffffffff81714e01>] entry_SYSCALL_64_fastpath+0x1f/0xc2
[  243.437343]  }
[  243.438115]  ... key      at: [<ffffffffa031dfcc>] xfs_dir_ilock_class+0x0/0xfffffffffffc3f6e [xfs]
[  243.440082]  ... acquired at:
[  243.441047]    
[  243.441049] [<ffffffff810ee7ea>] check_irq_usage+0x4a/0xb0
[  243.443169]    
[  243.443171] [<ffffffff810f0954>] __lock_acquire+0x1364/0x1bb0
[  243.445366]    
[  243.445368] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0
[  243.447471]    
[  243.447474] [<ffffffff8108de18>] get_online_cpus+0x58/0x80
[  243.449601]    
[  243.449604] [<ffffffff811bf3a7>] drain_all_pages.part.80+0x27/0x320
[  243.452123]    
[  243.452125] [<ffffffff811c2039>] drain_all_pages+0x19/0x20
[  243.454264]    
[  243.454266] [<ffffffff811c4854>] __alloc_pages_nodemask+0x784/0x1630
[  243.456596]    
[  243.456599] [<ffffffff8122e1bf>] cache_grow_begin+0xcf/0x630
[  243.458774]    
[  243.458776] [<ffffffff8122eb45>] fallback_alloc+0x1e5/0x290
[  243.460952]    
[  243.460955] [<ffffffff8122e955>] ____cache_alloc_node+0x235/0x240
[  243.463199]    
[  243.463201] [<ffffffff8122f30c>] kmem_cache_alloc+0x26c/0x3e0
[  243.465482]    
[  243.465510] [<ffffffffa02b9211>] kmem_zone_alloc+0x91/0x120 [xfs]
[  243.467754]    
[  243.467774] [<ffffffffa024e2f5>] xfs_da_state_alloc+0x15/0x20 [xfs]
[  243.470083]    
[  243.470101] [<ffffffffa025f333>] xfs_dir2_node_lookup+0x53/0x2b0 [xfs]
[  243.472427]    
[  243.472445] [<ffffffffa0255ac5>] xfs_dir_lookup+0x1a5/0x1c0 [xfs]
[  243.474705]    
[  243.474726] [<ffffffffa02a62ff>] xfs_lookup+0x7f/0x250 [xfs]
[  243.476933]    
[  243.476954] [<ffffffffa02a1fcb>] xfs_vn_lookup+0x6b/0xb0 [xfs]
[  243.479178]    
[  243.479180] [<ffffffff8126ce2c>] lookup_open+0x54c/0x790
[  243.481350]    
[  243.481352] [<ffffffff8126e2fa>] path_openat+0x55a/0xa90
[  243.483907]    
[  243.483910] [<ffffffff8126f9ec>] do_filp_open+0x8c/0x100
[  243.486070]    
[  243.486073] [<ffffffff8125c0ea>] do_sys_open+0x13a/0x200
[  243.488334]    
[  243.488338] [<ffffffff8125c1c9>] SyS_open+0x19/0x20
[  243.490476]    
[  243.490480] [<ffffffff81003c17>] do_syscall_64+0x67/0x1f0
[  243.492619]    
[  243.492623] [<ffffffff81714ec9>] return_from_SYSCALL_64+0x0/0x7a
[  243.494864] 
[  243.495618] 
[  243.495618] the dependencies between the lock to be acquired
[  243.495619]  and RECLAIM_FS-irq-unsafe lock:
[  243.498973] -> (cpu_hotplug.dep_map){++++++} ops: 838 {
[  243.500297]    HARDIRQ-ON-W at:
[  243.501292]                     
[  243.501295] [<ffffffff810efd84>] __lock_acquire+0x794/0x1bb0
[  243.503718]                     
[  243.503719] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0
[  243.506059]                     
[  243.506061] [<ffffffff8108ff5e>] cpu_hotplug_begin+0x6e/0xe0
[  243.508471]                     
[  243.508473] [<ffffffff8109009d>] _cpu_up+0x2d/0xf0
[  243.510708]                     
[  243.510709] [<ffffffff810901e4>] do_cpu_up+0x84/0xd0
[  243.512997]                     
[  243.512999] [<ffffffff8109023e>] cpu_up+0xe/0x10
[  243.515556]                     
[  243.515561] [<ffffffff81f6f446>] smp_init+0xd5/0x141
[  243.517807]                     
[  243.517810] [<ffffffff81f3f35b>] kernel_init_freeable+0x17d/0x2a7
[  243.520271]                     
[  243.520275] [<ffffffff817048e9>] kernel_init+0x9/0x100
[  243.522538]                     
[  243.522540] [<ffffffff81715081>] ret_from_fork+0x31/0x40
[  243.524833]    HARDIRQ-ON-R at:
[  243.525801]                     
[  243.525803] [<ffffffff810ef8c0>] __lock_acquire+0x2d0/0x1bb0
[  243.528152]                     
[  243.528153] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0
[  243.530416]                     
[  243.530419] [<ffffffff8108de18>] get_online_cpus+0x58/0x80
[  243.532696]                     
[  243.532698] [<ffffffff811ec375>] kmem_cache_create+0x35/0x2d0
[  243.535039]                     
[  243.535041] [<ffffffff81f87d4a>] debug_objects_mem_init+0x48/0x5c5
[  243.537451]                     
[  243.537453] [<ffffffff81f3f108>] start_kernel+0x3ec/0x4c2
[  243.539744]                     
[  243.539746] [<ffffffff81f3e5d6>] x86_64_start_reservations+0x2a/0x2c
[  243.542186]                     
[  243.542188] [<ffffffff81f3e724>] x86_64_start_kernel+0x14c/0x16f
[  243.544603]                     
[  243.544605] [<ffffffff810001c4>] verify_cpu+0x0/0xfc
[  243.547245]    SOFTIRQ-ON-W at:
[  243.548241]                     
[  243.548243] [<ffffffff810ef8ed>] __lock_acquire+0x2fd/0x1bb0
[  243.550559]                     
[  243.550561] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0
[  243.552841]                     
[  243.552842] [<ffffffff8108ff5e>] cpu_hotplug_begin+0x6e/0xe0
[  243.555186]                     
[  243.555187] [<ffffffff8109009d>] _cpu_up+0x2d/0xf0
[  243.557404]                     
[  243.557405] [<ffffffff810901e4>] do_cpu_up+0x84/0xd0
[  243.559654]                     
[  243.559656] [<ffffffff8109023e>] cpu_up+0xe/0x10
[  243.561824]                     
[  243.561827] [<ffffffff81f6f446>] smp_init+0xd5/0x141
[  243.564048]                     
[  243.564050] [<ffffffff81f3f35b>] kernel_init_freeable+0x17d/0x2a7
[  243.566455]                     
[  243.566457] [<ffffffff817048e9>] kernel_init+0x9/0x100
[  243.568731]                     
[  243.568733] [<ffffffff81715081>] ret_from_fork+0x31/0x40
[  243.571014]    SOFTIRQ-ON-R at:
[  243.571975]                     
[  243.571976] [<ffffffff810ef8ed>] __lock_acquire+0x2fd/0x1bb0
[  243.574328]                     
[  243.574330] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0
[  243.576610]                     
[  243.576612] [<ffffffff8108de18>] get_online_cpus+0x58/0x80
[  243.579161]                     
[  243.579165] [<ffffffff811ec375>] kmem_cache_create+0x35/0x2d0
[  243.581537]                     
[  243.581539] [<ffffffff81f87d4a>] debug_objects_mem_init+0x48/0x5c5
[  243.583982]                     
[  243.583984] [<ffffffff81f3f108>] start_kernel+0x3ec/0x4c2
[  243.586304]                     
[  243.586306] [<ffffffff81f3e5d6>] x86_64_start_reservations+0x2a/0x2c
[  243.588819]                     
[  243.588821] [<ffffffff81f3e724>] x86_64_start_kernel+0x14c/0x16f
[  243.591227]                     
[  243.591229] [<ffffffff810001c4>] verify_cpu+0x0/0xfc
[  243.593507]    RECLAIM_FS-ON-W at:
[  243.594519]                        
[  243.594520] [<ffffffff810ef051>] mark_held_locks+0x71/0x90
[  243.596888]                        
[  243.596895] [<ffffffff810f3405>] lockdep_trace_alloc+0xc5/0x110
[  243.599331]                        
[  243.599334] [<ffffffff8122f8ca>] kmem_cache_alloc_node_trace+0x4a/0x410
[  243.601872]                        
[  243.601874] [<ffffffff810ba350>] __smpboot_create_thread.part.3+0x30/0xf0
[  243.604460]                        
[  243.604461] [<ffffffff810ba7a1>] smpboot_create_threads+0x61/0x90
[  243.606950]                        
[  243.606952] [<ffffffff8108e2cb>] cpuhp_invoke_callback+0xbb/0xb70
[  243.609463]                        
[  243.609465] [<ffffffff8108fc82>] cpuhp_up_callbacks+0x32/0xb0
[  243.612282]                        
[  243.612285] [<ffffffff810900f4>] _cpu_up+0x84/0xf0
[  243.614604]                        
[  243.614606] [<ffffffff810901e4>] do_cpu_up+0x84/0xd0
[  243.616929]                        
[  243.616930] [<ffffffff8109023e>] cpu_up+0xe/0x10
[  243.619208]                        
[  243.619211] [<ffffffff81f6f446>] smp_init+0xd5/0x141
[  243.621518]                        
[  243.621520] [<ffffffff81f3f35b>] kernel_init_freeable+0x17d/0x2a7
[  243.624018]                        
[  243.624020] [<ffffffff817048e9>] kernel_init+0x9/0x100
[  243.626374]                        
[  243.626376] [<ffffffff81715081>] ret_from_fork+0x31/0x40
[  243.628771]    RECLAIM_FS-ON-R at:
[  243.629802]                        
[  243.629803] [<ffffffff810ef051>] mark_held_locks+0x71/0x90
[  243.632201]                        
[  243.632203] [<ffffffff810f3405>] lockdep_trace_alloc+0xc5/0x110
[  243.634692]                        
[  243.634695] [<ffffffff8122f8ca>] kmem_cache_alloc_node_trace+0x4a/0x410
[  243.637277]                        
[  243.637279] [<ffffffff8100cbb4>] allocate_shared_regs+0x24/0x70
[  243.639777]                        
[  243.639779] [<ffffffff8100cc32>] intel_pmu_cpu_prepare+0x32/0x140
[  243.643062]                        
[  243.643066] [<ffffffff810053db>] x86_pmu_prepare_cpu+0x3b/0x40
[  243.645553]                        
[  243.645556] [<ffffffff8108e2cb>] cpuhp_invoke_callback+0xbb/0xb70
[  243.648095]                        
[  243.648097] [<ffffffff8108f29c>] cpuhp_issue_call+0xec/0x160
[  243.650536]                        
[  243.650539] [<ffffffff8108f6bb>] __cpuhp_setup_state+0x13b/0x1a0
[  243.653126]                        
[  243.653130] [<ffffffff81f427e9>] init_hw_perf_events+0x402/0x5b6
[  243.655652]                        
[  243.655655] [<ffffffff8100217c>] do_one_initcall+0x4c/0x1b0
[  243.658127]                        
[  243.658130] [<ffffffff81f3f333>] kernel_init_freeable+0x155/0x2a7
[  243.660653]                        
[  243.660656] [<ffffffff817048e9>] kernel_init+0x9/0x100
[  243.663048]                        
[  243.663050] [<ffffffff81715081>] ret_from_fork+0x31/0x40
[  243.665436]    INITIAL USE at:
[  243.666403]                    
[  243.666405] [<ffffffff810ef960>] __lock_acquire+0x370/0x1bb0
[  243.668790]                    
[  243.668791] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0
[  243.671093]                    
[  243.671095] [<ffffffff8108de18>] get_online_cpus+0x58/0x80
[  243.673455]                    
[  243.673458] [<ffffffff8108f5be>] __cpuhp_setup_state+0x3e/0x1a0
[  243.676126]                    
[  243.676130] [<ffffffff81f7660e>] page_alloc_init+0x23/0x3a
[  243.678510]                    
[  243.678512] [<ffffffff81f3eebe>] start_kernel+0x1a2/0x4c2
[  243.680851]                    
[  243.680853] [<ffffffff81f3e5d6>] x86_64_start_reservations+0x2a/0x2c
[  243.683367]                    
[  243.683369] [<ffffffff81f3e724>] x86_64_start_kernel+0x14c/0x16f
[  243.685812]                    
[  243.685815] [<ffffffff810001c4>] verify_cpu+0x0/0xfc
[  243.688133]  }
[  243.688907]  ... key      at: [<ffffffff81c56848>] cpu_hotplug+0x108/0x140
[  243.690542]  ... acquired at:
[  243.691514]    
[  243.691517] [<ffffffff810ee7ea>] check_irq_usage+0x4a/0xb0
[  243.693655]    
[  243.693656] [<ffffffff810f0954>] __lock_acquire+0x1364/0x1bb0
[  243.695820]    
[  243.695822] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0
[  243.697926]    
[  243.697929] [<ffffffff8108de18>] get_online_cpus+0x58/0x80
[  243.700042]    
[  243.700044] [<ffffffff811bf3a7>] drain_all_pages.part.80+0x27/0x320
[  243.702285]    
[  243.702286] [<ffffffff811c2039>] drain_all_pages+0x19/0x20
[  243.704405]    
[  243.704407] [<ffffffff811c4854>] __alloc_pages_nodemask+0x784/0x1630
[  243.706721]    
[  243.706724] [<ffffffff8122e1bf>] cache_grow_begin+0xcf/0x630
[  243.708867]    
[  243.708870] [<ffffffff8122eb45>] fallback_alloc+0x1e5/0x290
[  243.711000]    
[  243.711002] [<ffffffff8122e955>] ____cache_alloc_node+0x235/0x240
[  243.713211]    
[  243.713213] [<ffffffff8122f30c>] kmem_cache_alloc+0x26c/0x3e0
[  243.715366]    
[  243.715410] [<ffffffffa02b9211>] kmem_zone_alloc+0x91/0x120 [xfs]
[  243.717625]    
[  243.717644] [<ffffffffa024e2f5>] xfs_da_state_alloc+0x15/0x20 [xfs]
[  243.719889]    
[  243.719918] [<ffffffffa025f333>] xfs_dir2_node_lookup+0x53/0x2b0 [xfs]
[  243.722224]    
[  243.722242] [<ffffffffa0255ac5>] xfs_dir_lookup+0x1a5/0x1c0 [xfs]
[  243.724493]    
[  243.724514] [<ffffffffa02a62ff>] xfs_lookup+0x7f/0x250 [xfs]
[  243.726690]    
[  243.726710] [<ffffffffa02a1fcb>] xfs_vn_lookup+0x6b/0xb0 [xfs]
[  243.728933]    
[  243.728936] [<ffffffff8126ce2c>] lookup_open+0x54c/0x790
[  243.731064]    
[  243.731066] [<ffffffff8126e2fa>] path_openat+0x55a/0xa90
[  243.733192]    
[  243.733194] [<ffffffff8126f9ec>] do_filp_open+0x8c/0x100
[  243.735312]    
[  243.735315] [<ffffffff8125c0ea>] do_sys_open+0x13a/0x200
[  243.737523]    
[  243.737527] [<ffffffff8125c1c9>] SyS_open+0x19/0x20
[  243.739577]    
[  243.739579] [<ffffffff81003c17>] do_syscall_64+0x67/0x1f0
[  243.741702]    
[  243.741706] [<ffffffff81714ec9>] return_from_SYSCALL_64+0x0/0x7a
[  243.743932] 
[  243.744661] 
[  243.744661] stack backtrace:
[  243.746302] CPU: 1 PID: 8767 Comm: awk Not tainted 4.10.0-rc6-next-20170202 #46
[  243.747963] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[  243.750166] Call Trace:
[  243.751071]  dump_stack+0x85/0xc9
[  243.752110]  check_usage+0x4f9/0x680
[  243.753188]  check_irq_usage+0x4a/0xb0
[  243.754280]  __lock_acquire+0x1364/0x1bb0
[  243.755410]  lock_acquire+0xe0/0x2a0
[  243.756467]  ? get_online_cpus+0x32/0x80
[  243.757580]  get_online_cpus+0x58/0x80
[  243.758664]  ? get_online_cpus+0x32/0x80
[  243.759764]  drain_all_pages.part.80+0x27/0x320
[  243.760972]  drain_all_pages+0x19/0x20
[  243.762039]  __alloc_pages_nodemask+0x784/0x1630
[  243.763249]  ? rcu_read_lock_sched_held+0x91/0xa0
[  243.764466]  ? __alloc_pages_nodemask+0x2e6/0x1630
[  243.765689]  ? mark_held_locks+0x71/0x90
[  243.766780]  ? cache_grow_begin+0x4ac/0x630
[  243.767912]  cache_grow_begin+0xcf/0x630
[  243.768985]  ? ____cache_alloc_node+0x1bf/0x240
[  243.770173]  fallback_alloc+0x1e5/0x290
[  243.771233]  ____cache_alloc_node+0x235/0x240
[  243.772403]  ? kmem_zone_alloc+0x91/0x120 [xfs]
[  243.773576]  kmem_cache_alloc+0x26c/0x3e0
[  243.774671]  kmem_zone_alloc+0x91/0x120 [xfs]
[  243.775816]  xfs_da_state_alloc+0x15/0x20 [xfs]
[  243.776989]  xfs_dir2_node_lookup+0x53/0x2b0 [xfs]
[  243.778188]  xfs_dir_lookup+0x1a5/0x1c0 [xfs]
[  243.779327]  xfs_lookup+0x7f/0x250 [xfs]
[  243.780394]  xfs_vn_lookup+0x6b/0xb0 [xfs]
[  243.781466]  lookup_open+0x54c/0x790
[  243.782440]  path_openat+0x55a/0xa90
[  243.783412]  do_filp_open+0x8c/0x100
[  243.784377]  ? _raw_spin_unlock+0x22/0x30
[  243.785418]  ? __alloc_fd+0xf2/0x210
[  243.786378]  do_sys_open+0x13a/0x200
[  243.787361]  SyS_open+0x19/0x20
[  243.788252]  do_syscall_64+0x67/0x1f0
[  243.789228]  entry_SYSCALL64_slow_path+0x25/0x25
[  243.790347] RIP: 0033:0x7fcf8dda06c7
[  243.791299] RSP: 002b:00007ffd883327b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000002
[  243.792895] RAX: ffffffffffffffda RBX: 00007ffd883328a8 RCX: 00007fcf8dda06c7
[  243.794424] RDX: 00007fcf8dfa9148 RSI: 0000000000080000 RDI: 00007fcf8dfa6b08
[  243.795949] RBP: 00007ffd88332810 R08: 00007ffd88332890 R09: 0000000000000000
[  243.797480] R10: 00007fcf8dfa6b08 R11: 0000000000000246 R12: 0000000000000000
[  243.799002] R13: 0000000000000000 R14: 0000000000000000 R15: 00007ffd88332890
[  253.543441] awk invoked oom-killer: gfp_mask=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null),  order=0, oom_score_adj=0
[  253.546121] awk cpuset=/ mems_allowed=0
[  253.547233] CPU: 3 PID: 8767 Comm: awk Not tainted 4.10.0-rc6-next-20170202 #46

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
  2017-02-03 10:57                                             ` Tetsuo Handa
@ 2017-02-03 14:41                                               ` Michal Hocko
  -1 siblings, 0 replies; 110+ messages in thread
From: Michal Hocko @ 2017-02-03 14:41 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: david, dchinner, hch, mgorman, viro, linux-mm, hannes, linux-kernel

On Fri 03-02-17 19:57:39, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Mon 30-01-17 09:55:46, Michal Hocko wrote:
> > > On Sun 29-01-17 00:27:27, Tetsuo Handa wrote:
> > [...]
> > > > Regarding [1], it helped avoiding the too_many_isolated() issue. I can't
> > > > tell whether it has any negative effect, but I got on the first trial that
> > > > all allocating threads are blocked on wait_for_completion() from flush_work()
> > > > in drain_all_pages() introduced by "mm, page_alloc: drain per-cpu pages from
> > > > workqueue context". There was no warn_alloc() stall warning message afterwords.
> > > 
> > > That patch is buggy and there is a follow up [1] which is not sitting in the
> > > mmotm (and thus linux-next) yet. I didn't get to review it properly and
> > > I cannot say I would be too happy about using WQ from the page
> > > allocator. I believe even the follow up needs to have WQ_RECLAIM WQ.
> > > 
> > > [1] http://lkml.kernel.org/r/20170125083038.rzb5f43nptmk7aed@techsingularity.net
> > 
> > Did you get chance to test with this follow up patch? It would be
> > interesting to see whether OOM situation can still starve the waiter.
> > The current linux-next should contain this patch.
> 
> So far I can't reproduce problems except two listed below (cond_resched() trap
> in printk() and IDLE priority trap are excluded from the list). But I agree that
> the follow up patch needs to use a WQ_RECLAIM WQ. It is theoretically possible
> that an allocation request which can trigger the OOM killer waits for the
> system_wq while there is already a work which is in system_wq which is looping
> forever inside the page allocator without triggering the OOM killer.

Well, this shouldn't happen AFAICS because a new worker would be
requested and that would certainly require a memory and that allocation
would trigger the OOM killer. On the other hand I agree that it would be
safer to not depend on memory allocation from within the page allocator.

> Maybe the follow up patch can share the vmstat WQ?

Yes, this would be an option.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
@ 2017-02-03 14:41                                               ` Michal Hocko
  0 siblings, 0 replies; 110+ messages in thread
From: Michal Hocko @ 2017-02-03 14:41 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: david, dchinner, hch, mgorman, viro, linux-mm, hannes, linux-kernel

On Fri 03-02-17 19:57:39, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Mon 30-01-17 09:55:46, Michal Hocko wrote:
> > > On Sun 29-01-17 00:27:27, Tetsuo Handa wrote:
> > [...]
> > > > Regarding [1], it helped avoiding the too_many_isolated() issue. I can't
> > > > tell whether it has any negative effect, but I got on the first trial that
> > > > all allocating threads are blocked on wait_for_completion() from flush_work()
> > > > in drain_all_pages() introduced by "mm, page_alloc: drain per-cpu pages from
> > > > workqueue context". There was no warn_alloc() stall warning message afterwords.
> > > 
> > > That patch is buggy and there is a follow up [1] which is not sitting in the
> > > mmotm (and thus linux-next) yet. I didn't get to review it properly and
> > > I cannot say I would be too happy about using WQ from the page
> > > allocator. I believe even the follow up needs to have WQ_RECLAIM WQ.
> > > 
> > > [1] http://lkml.kernel.org/r/20170125083038.rzb5f43nptmk7aed@techsingularity.net
> > 
> > Did you get chance to test with this follow up patch? It would be
> > interesting to see whether OOM situation can still starve the waiter.
> > The current linux-next should contain this patch.
> 
> So far I can't reproduce problems except two listed below (cond_resched() trap
> in printk() and IDLE priority trap are excluded from the list). But I agree that
> the follow up patch needs to use a WQ_RECLAIM WQ. It is theoretically possible
> that an allocation request which can trigger the OOM killer waits for the
> system_wq while there is already a work which is in system_wq which is looping
> forever inside the page allocator without triggering the OOM killer.

Well, this shouldn't happen AFAICS because a new worker would be
requested and that would certainly require a memory and that allocation
would trigger the OOM killer. On the other hand I agree that it would be
safer to not depend on memory allocation from within the page allocator.

> Maybe the follow up patch can share the vmstat WQ?

Yes, this would be an option.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
  2017-02-03 10:57                                             ` Tetsuo Handa
@ 2017-02-03 14:50                                               ` Michal Hocko
  -1 siblings, 0 replies; 110+ messages in thread
From: Michal Hocko @ 2017-02-03 14:50 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: david, dchinner, hch, mgorman, viro, linux-mm, hannes,
	linux-kernel, Darrick J. Wong, linux-xfs

[Let's CC more xfs people]

On Fri 03-02-17 19:57:39, Tetsuo Handa wrote:
[...]
> (1) I got an assertion failure.

I suspect this is a result of
http://lkml.kernel.org/r/20170201092706.9966-2-mhocko@kernel.org
I have no idea what the assert means though.

> 
> [  969.626518] Killed process 6262 (oom-write) total-vm:2166856kB, anon-rss:1128732kB, file-rss:4kB, shmem-rss:0kB
> [  969.958307] oom_reaper: reaped process 6262 (oom-write), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> [  972.114644] XFS: Assertion failed: oldlen > newlen, file: fs/xfs/libxfs/xfs_bmap.c, line: 2867
> [  972.125085] ------------[ cut here ]------------
> [  972.129261] WARNING: CPU: 0 PID: 6280 at fs/xfs/xfs_message.c:105 asswarn+0x33/0x40 [xfs]
> [  972.136146] Modules linked in: nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack coretemp crct10dif_pclmul ppdev crc32_pclmul ghash_clmulni_intel ip_set nfnetlink ebtable_nat aesni_intel crypto_simd cryptd ebtable_broute glue_helper vmw_balloon bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 pcspkr nf_nat_ipv6 ip6table_mangle ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter sg parport_pc parport shpchp i2c_piix4 vmw_vsock_vmci_transport vsock vmw_vmci ip_tables xfs libcrc32c sr_mod cdrom ata_generic sd_mod pata_acpi crc32c_intel serio_raw vmwgfx drm_kms_helper syscopyarea sysfillrect
> [  972.163630]  sysimgblt fb_sys_fops ttm drm ata_piix ahci libahci mptspi scsi_transport_spi mptscsih e1000 libata i2c_core mptbase
> [  972.172535] CPU: 0 PID: 6280 Comm: write Not tainted 4.10.0-rc6-next-20170202 #498
> [  972.175126] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
> [  972.178381] Call Trace:
> [  972.180003]  dump_stack+0x85/0xc9
> [  972.181682]  __warn+0xd1/0xf0
> [  972.183374]  warn_slowpath_null+0x1d/0x20
> [  972.185223]  asswarn+0x33/0x40 [xfs]
> [  972.186950]  xfs_bmap_add_extent_hole_delay+0xb7f/0xdf0 [xfs]
> [  972.189055]  xfs_bmapi_reserve_delalloc+0x297/0x440 [xfs]
> [  972.191263]  ? xfs_ilock+0x1c9/0x360 [xfs]
> [  972.193414]  xfs_file_iomap_begin+0x880/0x1140 [xfs]
> [  972.195300]  ? iomap_write_end+0x80/0x80
> [  972.196980]  iomap_apply+0x6c/0x130
> [  972.198539]  iomap_file_buffered_write+0x68/0xa0
> [  972.200316]  ? iomap_write_end+0x80/0x80
> [  972.201950]  xfs_file_buffered_aio_write+0x132/0x390 [xfs]
> [  972.203868]  ? _raw_spin_unlock+0x27/0x40
> [  972.205470]  xfs_file_write_iter+0x90/0x130 [xfs]
> [  972.207167]  __vfs_write+0xe5/0x140
> [  972.208752]  vfs_write+0xc7/0x1f0
> [  972.210233]  ? syscall_trace_enter+0x1d0/0x380
> [  972.211809]  SyS_write+0x58/0xc0
> [  972.213166]  do_int80_syscall_32+0x6c/0x1f0
> [  972.214676]  entry_INT80_compat+0x38/0x50
> [  972.216168] RIP: 0023:0x8048076
> [  972.217494] RSP: 002b:00000000ff997020 EFLAGS: 00000202 ORIG_RAX: 0000000000000004
> [  972.219635] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 0000000008048000
> [  972.221679] RDX: 0000000000001000 RSI: 0000000000000000 RDI: 0000000000000000
> [  972.223774] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
> [  972.225905] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> [  972.227946] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> [  972.230064] ---[ end trace d498098daec56c11 ]---
> [  984.210890] vmtoolsd invoked oom-killer: gfp_mask=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null),  order=0, oom_score_adj=0
> [  984.224191] vmtoolsd cpuset=/ mems_allowed=0
> [  984.231022] CPU: 0 PID: 689 Comm: vmtoolsd Tainted: G        W       4.10.0-rc6-next-20170202 #498
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
@ 2017-02-03 14:50                                               ` Michal Hocko
  0 siblings, 0 replies; 110+ messages in thread
From: Michal Hocko @ 2017-02-03 14:50 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: david, dchinner, hch, mgorman, viro, linux-mm, hannes,
	linux-kernel, Darrick J. Wong, linux-xfs

[Let's CC more xfs people]

On Fri 03-02-17 19:57:39, Tetsuo Handa wrote:
[...]
> (1) I got an assertion failure.

I suspect this is a result of
http://lkml.kernel.org/r/20170201092706.9966-2-mhocko@kernel.org
I have no idea what the assert means though.

> 
> [  969.626518] Killed process 6262 (oom-write) total-vm:2166856kB, anon-rss:1128732kB, file-rss:4kB, shmem-rss:0kB
> [  969.958307] oom_reaper: reaped process 6262 (oom-write), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> [  972.114644] XFS: Assertion failed: oldlen > newlen, file: fs/xfs/libxfs/xfs_bmap.c, line: 2867
> [  972.125085] ------------[ cut here ]------------
> [  972.129261] WARNING: CPU: 0 PID: 6280 at fs/xfs/xfs_message.c:105 asswarn+0x33/0x40 [xfs]
> [  972.136146] Modules linked in: nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack coretemp crct10dif_pclmul ppdev crc32_pclmul ghash_clmulni_intel ip_set nfnetlink ebtable_nat aesni_intel crypto_simd cryptd ebtable_broute glue_helper vmw_balloon bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 pcspkr nf_nat_ipv6 ip6table_mangle ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter sg parport_pc parport shpchp i2c_piix4 vmw_vsock_vmci_transport vsock vmw_vmci ip_tables xfs libcrc32c sr_mod cdrom ata_generic sd_mod pata_acpi crc32c_intel serio_raw vmwgfx drm_kms_helper syscopyarea sysfillrect
> [  972.163630]  sysimgblt fb_sys_fops ttm drm ata_piix ahci libahci mptspi scsi_transport_spi mptscsih e1000 libata i2c_core mptbase
> [  972.172535] CPU: 0 PID: 6280 Comm: write Not tainted 4.10.0-rc6-next-20170202 #498
> [  972.175126] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
> [  972.178381] Call Trace:
> [  972.180003]  dump_stack+0x85/0xc9
> [  972.181682]  __warn+0xd1/0xf0
> [  972.183374]  warn_slowpath_null+0x1d/0x20
> [  972.185223]  asswarn+0x33/0x40 [xfs]
> [  972.186950]  xfs_bmap_add_extent_hole_delay+0xb7f/0xdf0 [xfs]
> [  972.189055]  xfs_bmapi_reserve_delalloc+0x297/0x440 [xfs]
> [  972.191263]  ? xfs_ilock+0x1c9/0x360 [xfs]
> [  972.193414]  xfs_file_iomap_begin+0x880/0x1140 [xfs]
> [  972.195300]  ? iomap_write_end+0x80/0x80
> [  972.196980]  iomap_apply+0x6c/0x130
> [  972.198539]  iomap_file_buffered_write+0x68/0xa0
> [  972.200316]  ? iomap_write_end+0x80/0x80
> [  972.201950]  xfs_file_buffered_aio_write+0x132/0x390 [xfs]
> [  972.203868]  ? _raw_spin_unlock+0x27/0x40
> [  972.205470]  xfs_file_write_iter+0x90/0x130 [xfs]
> [  972.207167]  __vfs_write+0xe5/0x140
> [  972.208752]  vfs_write+0xc7/0x1f0
> [  972.210233]  ? syscall_trace_enter+0x1d0/0x380
> [  972.211809]  SyS_write+0x58/0xc0
> [  972.213166]  do_int80_syscall_32+0x6c/0x1f0
> [  972.214676]  entry_INT80_compat+0x38/0x50
> [  972.216168] RIP: 0023:0x8048076
> [  972.217494] RSP: 002b:00000000ff997020 EFLAGS: 00000202 ORIG_RAX: 0000000000000004
> [  972.219635] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 0000000008048000
> [  972.221679] RDX: 0000000000001000 RSI: 0000000000000000 RDI: 0000000000000000
> [  972.223774] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
> [  972.225905] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> [  972.227946] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> [  972.230064] ---[ end trace d498098daec56c11 ]---
> [  984.210890] vmtoolsd invoked oom-killer: gfp_mask=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null),  order=0, oom_score_adj=0
> [  984.224191] vmtoolsd cpuset=/ mems_allowed=0
> [  984.231022] CPU: 0 PID: 689 Comm: vmtoolsd Tainted: G        W       4.10.0-rc6-next-20170202 #498
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
  2017-02-03 10:57                                             ` Tetsuo Handa
@ 2017-02-03 14:55                                               ` Michal Hocko
  -1 siblings, 0 replies; 110+ messages in thread
From: Michal Hocko @ 2017-02-03 14:55 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: hch, mgorman, viro, linux-mm, hannes, linux-kernel, Peter Zijlstra

[CC Petr]

On Fri 03-02-17 19:57:39, Tetsuo Handa wrote:
[...]
> (2) I got a lockdep warning. (A new false positive?)

Yes, I suspect this is a false possitive. I do not see how we can
deadlock. __alloc_pages_direct_reclaim calls drain_all_pages(NULL) which
means that a potential recursion to the page allocator during draining
would just bail out on the trylock. Maybe I am misinterpreting the
report though.

> [  243.036975] =====================================================
> [  243.042976] WARNING: RECLAIM_FS-safe -> RECLAIM_FS-unsafe lock order detected
> [  243.051211] 4.10.0-rc6-next-20170202 #46 Not tainted
> [  243.054619] -----------------------------------------------------
> [  243.057395] awk/8767 [HC0[0]:SC0[0]:HE1:SE1] is trying to acquire:
> [  243.060310]  (cpu_hotplug.dep_map){++++++}, at: [<ffffffff8108ddf2>] get_online_cpus+0x32/0x80
> [  243.063462] 
> [  243.063462] and this task is already holding:
> [  243.066851]  (&xfs_dir_ilock_class){++++-.}, at: [<ffffffffa02a4af4>] xfs_ilock+0x114/0x290 [xfs]
> [  243.069949] which would create a new lock dependency:
> [  243.072143]  (&xfs_dir_ilock_class){++++-.} -> (cpu_hotplug.dep_map){++++++}
> [  243.074789] 
> [  243.074789] but this new dependency connects a RECLAIM_FS-irq-safe lock:
> [  243.078735]  (&xfs_dir_ilock_class){++++-.}
> [  243.078739] 
> [  243.078739] ... which became RECLAIM_FS-irq-safe at:
> [  243.084175]   
> [  243.084180] [<ffffffff810ef934>] __lock_acquire+0x344/0x1bb0
> [  243.087257]   
> [  243.087261] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0
> [  243.090027]   
> [  243.090033] [<ffffffff810ea7e9>] down_write_nested+0x59/0xc0
> [  243.092838]   
> [  243.092888] [<ffffffffa02a4b2e>] xfs_ilock+0x14e/0x290 [xfs]
> [  243.095453]   
> [  243.095485] [<ffffffffa02986a5>] xfs_reclaim_inode+0x135/0x340 [xfs]
> [  243.098083]   
> [  243.098109] [<ffffffffa0298b7a>] xfs_reclaim_inodes_ag+0x2ca/0x4f0 [xfs]
> [  243.100668]   
> [  243.100692] [<ffffffffa029af9e>] xfs_reclaim_inodes_nr+0x2e/0x40 [xfs]
> [  243.103191]   
> [  243.103221] [<ffffffffa02b32c4>] xfs_fs_free_cached_objects+0x14/0x20 [xfs]
> [  243.105710]   
> [  243.105714] [<ffffffff81261dbc>] super_cache_scan+0x17c/0x190
> [  243.107947]   
> [  243.107950] [<ffffffff811d375a>] shrink_slab+0x29a/0x710
> [  243.110133]   
> [  243.110135] [<ffffffff811d876d>] shrink_node+0x23d/0x320
> [  243.112262]   
> [  243.112264] [<ffffffff811d9e24>] kswapd+0x354/0xa10
> [  243.114323]   
> [  243.114326] [<ffffffff810b5caa>] kthread+0x10a/0x140
> [  243.116448]   
> [  243.116452] [<ffffffff81715081>] ret_from_fork+0x31/0x40
> [  243.118692] 
> [  243.118692] to a RECLAIM_FS-irq-unsafe lock:
> [  243.120636]  (cpu_hotplug.dep_map){++++++}
> [  243.120638] 
> [  243.120638] ... which became RECLAIM_FS-irq-unsafe at:
> [  243.124021] ...
> [  243.124022]   
> [  243.124820] [<ffffffff810ef051>] mark_held_locks+0x71/0x90
> [  243.127033]   
> [  243.127035] [<ffffffff810f3405>] lockdep_trace_alloc+0xc5/0x110
> [  243.129228]   
> [  243.129231] [<ffffffff8122f8ca>] kmem_cache_alloc_node_trace+0x4a/0x410
> [  243.131534]   
> [  243.131536] [<ffffffff810ba350>] __smpboot_create_thread.part.3+0x30/0xf0
> [  243.133850]   
> [  243.133852] [<ffffffff810ba7a1>] smpboot_create_threads+0x61/0x90
> [  243.136113]   
> [  243.136119] [<ffffffff8108e2cb>] cpuhp_invoke_callback+0xbb/0xb70
> [  243.138319]   
> [  243.138320] [<ffffffff8108fc82>] cpuhp_up_callbacks+0x32/0xb0
> [  243.140479]   
> [  243.140480] [<ffffffff810900f4>] _cpu_up+0x84/0xf0
> [  243.142484]   
> [  243.142485] [<ffffffff810901e4>] do_cpu_up+0x84/0xd0
> [  243.144716]   
> [  243.144719] [<ffffffff8109023e>] cpu_up+0xe/0x10
> [  243.146684]   
> [  243.146687] [<ffffffff81f6f446>] smp_init+0xd5/0x141
> [  243.148755]   
> [  243.148758] [<ffffffff81f3f35b>] kernel_init_freeable+0x17d/0x2a7
> [  243.150932]   
> [  243.150936] [<ffffffff817048e9>] kernel_init+0x9/0x100
> [  243.153088]   
> [  243.153092] [<ffffffff81715081>] ret_from_fork+0x31/0x40
> [  243.155135] 
> [  243.155135] other info that might help us debug this:
> [  243.155135] 
> [  243.157724]  Possible interrupt unsafe locking scenario:
> [  243.157724] 
> [  243.159877]        CPU0                    CPU1
> [  243.161047]        ----                    ----
> [  243.162210]   lock(cpu_hotplug.dep_map);
> [  243.163279]                                local_irq_disable();
> [  243.164669]                                lock(&xfs_dir_ilock_class);
> [  243.166148]                                lock(cpu_hotplug.dep_map);
> [  243.167653]   <Interrupt>
> [  243.168594]     lock(&xfs_dir_ilock_class);
> [  243.169694] 
> [  243.169694]  *** DEADLOCK ***
> [  243.169694] 
> [  243.171864] 3 locks held by awk/8767:
> [  243.172872]  #0:  (&type->i_mutex_dir_key#3){++++++}, at: [<ffffffff8126e2dc>] path_openat+0x53c/0xa90
> [  243.174791]  #1:  (&xfs_dir_ilock_class){++++-.}, at: [<ffffffffa02a4af4>] xfs_ilock+0x114/0x290 [xfs]
> [  243.176899]  #2:  (pcpu_drain_mutex){+.+...}, at: [<ffffffff811bf39a>] drain_all_pages.part.80+0x1a/0x320
> [  243.178875] 
> [  243.178875] the dependencies between RECLAIM_FS-irq-safe lock and the holding lock:
> [  243.181262] -> (&xfs_dir_ilock_class){++++-.} ops: 17348 {
> [  243.182610]    HARDIRQ-ON-W at:
> [  243.183603]                     
> [  243.183606] [<ffffffff810efd84>] __lock_acquire+0x794/0x1bb0
> [  243.186056]                     
> [  243.186059] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0
> [  243.188419]                     
> [  243.188422] [<ffffffff810ea7e9>] down_write_nested+0x59/0xc0
> [  243.190909]                     
> [  243.190941] [<ffffffffa02a4b2e>] xfs_ilock+0x14e/0x290 [xfs]
> [  243.193257]                     
> [  243.193281] [<ffffffffa02a4c9b>] xfs_ilock_data_map_shared+0x2b/0x30 [xfs]
> [  243.195795]                     
> [  243.195814] [<ffffffffa02559f4>] xfs_dir_lookup+0xd4/0x1c0 [xfs]
> [  243.198204]                     
> [  243.198227] [<ffffffffa02a62ff>] xfs_lookup+0x7f/0x250 [xfs]
> [  243.200570]                     
> [  243.200593] [<ffffffffa02a1fcb>] xfs_vn_lookup+0x6b/0xb0 [xfs]
> [  243.203086]                     
> [  243.203089] [<ffffffff8126ce2c>] lookup_open+0x54c/0x790
> [  243.205417]                     
> [  243.205420] [<ffffffff8126e2fa>] path_openat+0x55a/0xa90
> [  243.207711]                     
> [  243.207713] [<ffffffff8126f9ec>] do_filp_open+0x8c/0x100
> [  243.210092]                     
> [  243.210095] [<ffffffff81263c41>] do_open_execat+0x71/0x180
> [  243.212427]                     
> [  243.212429] [<ffffffff812641b6>] open_exec+0x26/0x40
> [  243.214664]                     
> [  243.214668] [<ffffffff812c43ee>] load_elf_binary+0x2be/0x15f0
> [  243.217045]                     
> [  243.217048] [<ffffffff812644b0>] search_binary_handler+0x80/0x1e0
> [  243.219501]                     
> [  243.219503] [<ffffffff812663ca>] do_execveat_common.isra.40+0x68a/0xa00
> [  243.222056]                     
> [  243.222058] [<ffffffff81266767>] do_execve+0x27/0x30
> [  243.224471]                     
> [  243.224475] [<ffffffff812669c0>] SyS_execve+0x20/0x30
> [  243.226787]                     
> [  243.226790] [<ffffffff81003c17>] do_syscall_64+0x67/0x1f0
> [  243.229178]                     
> [  243.229182] [<ffffffff81714ec9>] return_from_SYSCALL_64+0x0/0x7a
> [  243.231695]    HARDIRQ-ON-R at:
> [  243.232709]                     
> [  243.232712] [<ffffffff810ef8c0>] __lock_acquire+0x2d0/0x1bb0
> [  243.235161]                     
> [  243.235164] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0
> [  243.237547]                     
> [  243.237551] [<ffffffff810ea672>] down_read_nested+0x52/0xb0
> [  243.239930]                     
> [  243.239962] [<ffffffffa02a4af4>] xfs_ilock+0x114/0x290 [xfs]
> [  243.242353]                     
> [  243.242385] [<ffffffffa02a4c9b>] xfs_ilock_data_map_shared+0x2b/0x30 [xfs]
> [  243.244978]                     
> [  243.244998] [<ffffffffa02559f4>] xfs_dir_lookup+0xd4/0x1c0 [xfs]
> [  243.247493]                     
> [  243.247515] [<ffffffffa02a62ff>] xfs_lookup+0x7f/0x250 [xfs]
> [  243.249910]                     
> [  243.249930] [<ffffffffa02a1fcb>] xfs_vn_lookup+0x6b/0xb0 [xfs]
> [  243.252407]                     
> [  243.252412] [<ffffffff8126902e>] lookup_slow+0x12e/0x220
> [  243.254747]                     
> [  243.254750] [<ffffffff8126d2c6>] walk_component+0x1a6/0x2b0
> [  243.257126]                     
> [  243.257128] [<ffffffff8126d55c>] link_path_walk+0x18c/0x580
> [  243.259495]                     
> [  243.259497] [<ffffffff8126de41>] path_openat+0xa1/0xa90
> [  243.261804]                     
> [  243.261806] [<ffffffff8126f9ec>] do_filp_open+0x8c/0x100
> [  243.264184]                     
> [  243.264188] [<ffffffff8125c0ea>] do_sys_open+0x13a/0x200
> [  243.266595]                     
> [  243.266599] [<ffffffff8125c1c9>] SyS_open+0x19/0x20
> [  243.268984]                     
> [  243.268989] [<ffffffff81714e01>] entry_SYSCALL_64_fastpath+0x1f/0xc2
> [  243.271702]    SOFTIRQ-ON-W at:
> [  243.272726]                     
> [  243.272729] [<ffffffff810ef8ed>] __lock_acquire+0x2fd/0x1bb0
> [  243.275109]                     
> [  243.275111] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0
> [  243.277426]                     
> [  243.277429] [<ffffffff810ea7e9>] down_write_nested+0x59/0xc0
> [  243.279790]                     
> [  243.279823] [<ffffffffa02a4b2e>] xfs_ilock+0x14e/0x290 [xfs]
> [  243.282192]                     
> [  243.282216] [<ffffffffa02a4c9b>] xfs_ilock_data_map_shared+0x2b/0x30 [xfs]
> [  243.284794]                     
> [  243.284816] [<ffffffffa02559f4>] xfs_dir_lookup+0xd4/0x1c0 [xfs]
> [  243.287259]                     
> [  243.287284] [<ffffffffa02a62ff>] xfs_lookup+0x7f/0x250 [xfs]
> [  243.289735]                     
> [  243.289763] [<ffffffffa02a1fcb>] xfs_vn_lookup+0x6b/0xb0 [xfs]
> [  243.292205]                     
> [  243.292208] [<ffffffff8126ce2c>] lookup_open+0x54c/0x790
> [  243.294555]                     
> [  243.294558] [<ffffffff8126e2fa>] path_openat+0x55a/0xa90
> [  243.296897]                     
> [  243.296900] [<ffffffff8126f9ec>] do_filp_open+0x8c/0x100
> [  243.299242]                     
> [  243.299244] [<ffffffff81263c41>] do_open_execat+0x71/0x180
> [  243.301754]                     
> [  243.301759] [<ffffffff812641b6>] open_exec+0x26/0x40
> [  243.304037]                     
> [  243.304042] [<ffffffff812c43ee>] load_elf_binary+0x2be/0x15f0
> [  243.306531]                     
> [  243.306534] [<ffffffff812644b0>] search_binary_handler+0x80/0x1e0
> [  243.308976]                     
> [  243.308979] [<ffffffff812663ca>] do_execveat_common.isra.40+0x68a/0xa00
> [  243.311506]                     
> [  243.311508] [<ffffffff81266767>] do_execve+0x27/0x30
> [  243.313777]                     
> [  243.313779] [<ffffffff812669c0>] SyS_execve+0x20/0x30
> [  243.316067]                     
> [  243.316070] [<ffffffff81003c17>] do_syscall_64+0x67/0x1f0
> [  243.318429]                     
> [  243.318434] [<ffffffff81714ec9>] return_from_SYSCALL_64+0x0/0x7a
> [  243.320884]    SOFTIRQ-ON-R at:
> [  243.321860]                     
> [  243.321862] [<ffffffff810ef8ed>] __lock_acquire+0x2fd/0x1bb0
> [  243.324251]                     
> [  243.324252] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0
> [  243.326601]                     
> [  243.326604] [<ffffffff810ea672>] down_read_nested+0x52/0xb0
> [  243.328966]                     
> [  243.328998] [<ffffffffa02a4af4>] xfs_ilock+0x114/0x290 [xfs]
> [  243.331384]                     
> [  243.331407] [<ffffffffa02a4c9b>] xfs_ilock_data_map_shared+0x2b/0x30 [xfs]
> [  243.333978]                     
> [  243.334001] [<ffffffffa02559f4>] xfs_dir_lookup+0xd4/0x1c0 [xfs]
> [  243.336492]                     
> [  243.336516] [<ffffffffa02a62ff>] xfs_lookup+0x7f/0x250 [xfs]
> [  243.338926]                     
> [  243.338948] [<ffffffffa02a1fcb>] xfs_vn_lookup+0x6b/0xb0 [xfs]
> [  243.341365]                     
> [  243.341368] [<ffffffff8126902e>] lookup_slow+0x12e/0x220
> [  243.343694]                     
> [  243.343696] [<ffffffff8126d2c6>] walk_component+0x1a6/0x2b0
> [  243.346074]                     
> [  243.346076] [<ffffffff8126d55c>] link_path_walk+0x18c/0x580
> [  243.348443]                     
> [  243.348444] [<ffffffff8126de41>] path_openat+0xa1/0xa90
> [  243.350753]                     
> [  243.350755] [<ffffffff8126f9ec>] do_filp_open+0x8c/0x100
> [  243.353240]                     
> [  243.353244] [<ffffffff8125c0ea>] do_sys_open+0x13a/0x200
> [  243.355581]                     
> [  243.355583] [<ffffffff8125c1c9>] SyS_open+0x19/0x20
> [  243.358015]                     
> [  243.358019] [<ffffffff81714e01>] entry_SYSCALL_64_fastpath+0x1f/0xc2
> [  243.360586]    IN-RECLAIM_FS-W at:
> [  243.361628]                        
> [  243.361630] [<ffffffff810ef934>] __lock_acquire+0x344/0x1bb0
> [  243.364273]                        
> [  243.364275] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0
> [  243.366710]                        
> [  243.366713] [<ffffffff810ea7e9>] down_write_nested+0x59/0xc0
> [  243.369153]                        
> [  243.369182] [<ffffffffa02a4b2e>] xfs_ilock+0x14e/0x290 [xfs]
> [  243.371597]                        
> [  243.371619] [<ffffffffa02986a5>] xfs_reclaim_inode+0x135/0x340 [xfs]
> [  243.374339]                        
> [  243.374366] [<ffffffffa0298b7a>] xfs_reclaim_inodes_ag+0x2ca/0x4f0 [xfs]
> [  243.377009]                        
> [  243.377032] [<ffffffffa029af9e>] xfs_reclaim_inodes_nr+0x2e/0x40 [xfs]
> [  243.379659]                        
> [  243.379686] [<ffffffffa02b32c4>] xfs_fs_free_cached_objects+0x14/0x20 [xfs]
> [  243.382349]                        
> [  243.382352] [<ffffffff81261dbc>] super_cache_scan+0x17c/0x190
> [  243.384907]                        
> [  243.384911] [<ffffffff811d375a>] shrink_slab+0x29a/0x710
> [  243.387690]                        
> [  243.387693] [<ffffffff811d876d>] shrink_node+0x23d/0x320
> [  243.390148]                        
> [  243.390150] [<ffffffff811d9e24>] kswapd+0x354/0xa10
> [  243.392517]                        
> [  243.392520] [<ffffffff810b5caa>] kthread+0x10a/0x140
> [  243.394851]                        
> [  243.394853] [<ffffffff81715081>] ret_from_fork+0x31/0x40
> [  243.397246]    INITIAL USE at:
> [  243.398227]                    
> [  243.398229] [<ffffffff810ef960>] __lock_acquire+0x370/0x1bb0
> [  243.400646]                    
> [  243.400648] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0
> [  243.402997]                    
> [  243.402999] [<ffffffff810ea672>] down_read_nested+0x52/0xb0
> [  243.405351]                    
> [  243.405397] [<ffffffffa02a4af4>] xfs_ilock+0x114/0x290 [xfs]
> [  243.407778]                    
> [  243.407799] [<ffffffffa02a4c9b>] xfs_ilock_data_map_shared+0x2b/0x30 [xfs]
> [  243.410364]                    
> [  243.410390] [<ffffffffa02559f4>] xfs_dir_lookup+0xd4/0x1c0 [xfs]
> [  243.412989]                    
> [  243.413011] [<ffffffffa02a62ff>] xfs_lookup+0x7f/0x250 [xfs]
> [  243.415416]                    
> [  243.415437] [<ffffffffa02a1fcb>] xfs_vn_lookup+0x6b/0xb0 [xfs]
> [  243.417871]                    
> [  243.417874] [<ffffffff8126902e>] lookup_slow+0x12e/0x220
> [  243.420641]                    
> [  243.420644] [<ffffffff8126d2c6>] walk_component+0x1a6/0x2b0
> [  243.423039]                    
> [  243.423041] [<ffffffff8126d55c>] link_path_walk+0x18c/0x580
> [  243.425553]                    
> [  243.425555] [<ffffffff8126de41>] path_openat+0xa1/0xa90
> [  243.427891]                    
> [  243.427892] [<ffffffff8126f9ec>] do_filp_open+0x8c/0x100
> [  243.430249]                    
> [  243.430251] [<ffffffff8125c0ea>] do_sys_open+0x13a/0x200
> [  243.432586]                    
> [  243.432588] [<ffffffff8125c1c9>] SyS_open+0x19/0x20
> [  243.434839]                    
> [  243.434843] [<ffffffff81714e01>] entry_SYSCALL_64_fastpath+0x1f/0xc2
> [  243.437343]  }
> [  243.438115]  ... key      at: [<ffffffffa031dfcc>] xfs_dir_ilock_class+0x0/0xfffffffffffc3f6e [xfs]
> [  243.440082]  ... acquired at:
> [  243.441047]    
> [  243.441049] [<ffffffff810ee7ea>] check_irq_usage+0x4a/0xb0
> [  243.443169]    
> [  243.443171] [<ffffffff810f0954>] __lock_acquire+0x1364/0x1bb0
> [  243.445366]    
> [  243.445368] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0
> [  243.447471]    
> [  243.447474] [<ffffffff8108de18>] get_online_cpus+0x58/0x80
> [  243.449601]    
> [  243.449604] [<ffffffff811bf3a7>] drain_all_pages.part.80+0x27/0x320
> [  243.452123]    
> [  243.452125] [<ffffffff811c2039>] drain_all_pages+0x19/0x20
> [  243.454264]    
> [  243.454266] [<ffffffff811c4854>] __alloc_pages_nodemask+0x784/0x1630
> [  243.456596]    
> [  243.456599] [<ffffffff8122e1bf>] cache_grow_begin+0xcf/0x630
> [  243.458774]    
> [  243.458776] [<ffffffff8122eb45>] fallback_alloc+0x1e5/0x290
> [  243.460952]    
> [  243.460955] [<ffffffff8122e955>] ____cache_alloc_node+0x235/0x240
> [  243.463199]    
> [  243.463201] [<ffffffff8122f30c>] kmem_cache_alloc+0x26c/0x3e0
> [  243.465482]    
> [  243.465510] [<ffffffffa02b9211>] kmem_zone_alloc+0x91/0x120 [xfs]
> [  243.467754]    
> [  243.467774] [<ffffffffa024e2f5>] xfs_da_state_alloc+0x15/0x20 [xfs]
> [  243.470083]    
> [  243.470101] [<ffffffffa025f333>] xfs_dir2_node_lookup+0x53/0x2b0 [xfs]
> [  243.472427]    
> [  243.472445] [<ffffffffa0255ac5>] xfs_dir_lookup+0x1a5/0x1c0 [xfs]
> [  243.474705]    
> [  243.474726] [<ffffffffa02a62ff>] xfs_lookup+0x7f/0x250 [xfs]
> [  243.476933]    
> [  243.476954] [<ffffffffa02a1fcb>] xfs_vn_lookup+0x6b/0xb0 [xfs]
> [  243.479178]    
> [  243.479180] [<ffffffff8126ce2c>] lookup_open+0x54c/0x790
> [  243.481350]    
> [  243.481352] [<ffffffff8126e2fa>] path_openat+0x55a/0xa90
> [  243.483907]    
> [  243.483910] [<ffffffff8126f9ec>] do_filp_open+0x8c/0x100
> [  243.486070]    
> [  243.486073] [<ffffffff8125c0ea>] do_sys_open+0x13a/0x200
> [  243.488334]    
> [  243.488338] [<ffffffff8125c1c9>] SyS_open+0x19/0x20
> [  243.490476]    
> [  243.490480] [<ffffffff81003c17>] do_syscall_64+0x67/0x1f0
> [  243.492619]    
> [  243.492623] [<ffffffff81714ec9>] return_from_SYSCALL_64+0x0/0x7a
> [  243.494864] 
> [  243.495618] 
> [  243.495618] the dependencies between the lock to be acquired
> [  243.495619]  and RECLAIM_FS-irq-unsafe lock:
> [  243.498973] -> (cpu_hotplug.dep_map){++++++} ops: 838 {
> [  243.500297]    HARDIRQ-ON-W at:
> [  243.501292]                     
> [  243.501295] [<ffffffff810efd84>] __lock_acquire+0x794/0x1bb0
> [  243.503718]                     
> [  243.503719] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0
> [  243.506059]                     
> [  243.506061] [<ffffffff8108ff5e>] cpu_hotplug_begin+0x6e/0xe0
> [  243.508471]                     
> [  243.508473] [<ffffffff8109009d>] _cpu_up+0x2d/0xf0
> [  243.510708]                     
> [  243.510709] [<ffffffff810901e4>] do_cpu_up+0x84/0xd0
> [  243.512997]                     
> [  243.512999] [<ffffffff8109023e>] cpu_up+0xe/0x10
> [  243.515556]                     
> [  243.515561] [<ffffffff81f6f446>] smp_init+0xd5/0x141
> [  243.517807]                     
> [  243.517810] [<ffffffff81f3f35b>] kernel_init_freeable+0x17d/0x2a7
> [  243.520271]                     
> [  243.520275] [<ffffffff817048e9>] kernel_init+0x9/0x100
> [  243.522538]                     
> [  243.522540] [<ffffffff81715081>] ret_from_fork+0x31/0x40
> [  243.524833]    HARDIRQ-ON-R at:
> [  243.525801]                     
> [  243.525803] [<ffffffff810ef8c0>] __lock_acquire+0x2d0/0x1bb0
> [  243.528152]                     
> [  243.528153] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0
> [  243.530416]                     
> [  243.530419] [<ffffffff8108de18>] get_online_cpus+0x58/0x80
> [  243.532696]                     
> [  243.532698] [<ffffffff811ec375>] kmem_cache_create+0x35/0x2d0
> [  243.535039]                     
> [  243.535041] [<ffffffff81f87d4a>] debug_objects_mem_init+0x48/0x5c5
> [  243.537451]                     
> [  243.537453] [<ffffffff81f3f108>] start_kernel+0x3ec/0x4c2
> [  243.539744]                     
> [  243.539746] [<ffffffff81f3e5d6>] x86_64_start_reservations+0x2a/0x2c
> [  243.542186]                     
> [  243.542188] [<ffffffff81f3e724>] x86_64_start_kernel+0x14c/0x16f
> [  243.544603]                     
> [  243.544605] [<ffffffff810001c4>] verify_cpu+0x0/0xfc
> [  243.547245]    SOFTIRQ-ON-W at:
> [  243.548241]                     
> [  243.548243] [<ffffffff810ef8ed>] __lock_acquire+0x2fd/0x1bb0
> [  243.550559]                     
> [  243.550561] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0
> [  243.552841]                     
> [  243.552842] [<ffffffff8108ff5e>] cpu_hotplug_begin+0x6e/0xe0
> [  243.555186]                     
> [  243.555187] [<ffffffff8109009d>] _cpu_up+0x2d/0xf0
> [  243.557404]                     
> [  243.557405] [<ffffffff810901e4>] do_cpu_up+0x84/0xd0
> [  243.559654]                     
> [  243.559656] [<ffffffff8109023e>] cpu_up+0xe/0x10
> [  243.561824]                     
> [  243.561827] [<ffffffff81f6f446>] smp_init+0xd5/0x141
> [  243.564048]                     
> [  243.564050] [<ffffffff81f3f35b>] kernel_init_freeable+0x17d/0x2a7
> [  243.566455]                     
> [  243.566457] [<ffffffff817048e9>] kernel_init+0x9/0x100
> [  243.568731]                     
> [  243.568733] [<ffffffff81715081>] ret_from_fork+0x31/0x40
> [  243.571014]    SOFTIRQ-ON-R at:
> [  243.571975]                     
> [  243.571976] [<ffffffff810ef8ed>] __lock_acquire+0x2fd/0x1bb0
> [  243.574328]                     
> [  243.574330] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0
> [  243.576610]                     
> [  243.576612] [<ffffffff8108de18>] get_online_cpus+0x58/0x80
> [  243.579161]                     
> [  243.579165] [<ffffffff811ec375>] kmem_cache_create+0x35/0x2d0
> [  243.581537]                     
> [  243.581539] [<ffffffff81f87d4a>] debug_objects_mem_init+0x48/0x5c5
> [  243.583982]                     
> [  243.583984] [<ffffffff81f3f108>] start_kernel+0x3ec/0x4c2
> [  243.586304]                     
> [  243.586306] [<ffffffff81f3e5d6>] x86_64_start_reservations+0x2a/0x2c
> [  243.588819]                     
> [  243.588821] [<ffffffff81f3e724>] x86_64_start_kernel+0x14c/0x16f
> [  243.591227]                     
> [  243.591229] [<ffffffff810001c4>] verify_cpu+0x0/0xfc
> [  243.593507]    RECLAIM_FS-ON-W at:
> [  243.594519]                        
> [  243.594520] [<ffffffff810ef051>] mark_held_locks+0x71/0x90
> [  243.596888]                        
> [  243.596895] [<ffffffff810f3405>] lockdep_trace_alloc+0xc5/0x110
> [  243.599331]                        
> [  243.599334] [<ffffffff8122f8ca>] kmem_cache_alloc_node_trace+0x4a/0x410
> [  243.601872]                        
> [  243.601874] [<ffffffff810ba350>] __smpboot_create_thread.part.3+0x30/0xf0
> [  243.604460]                        
> [  243.604461] [<ffffffff810ba7a1>] smpboot_create_threads+0x61/0x90
> [  243.606950]                        
> [  243.606952] [<ffffffff8108e2cb>] cpuhp_invoke_callback+0xbb/0xb70
> [  243.609463]                        
> [  243.609465] [<ffffffff8108fc82>] cpuhp_up_callbacks+0x32/0xb0
> [  243.612282]                        
> [  243.612285] [<ffffffff810900f4>] _cpu_up+0x84/0xf0
> [  243.614604]                        
> [  243.614606] [<ffffffff810901e4>] do_cpu_up+0x84/0xd0
> [  243.616929]                        
> [  243.616930] [<ffffffff8109023e>] cpu_up+0xe/0x10
> [  243.619208]                        
> [  243.619211] [<ffffffff81f6f446>] smp_init+0xd5/0x141
> [  243.621518]                        
> [  243.621520] [<ffffffff81f3f35b>] kernel_init_freeable+0x17d/0x2a7
> [  243.624018]                        
> [  243.624020] [<ffffffff817048e9>] kernel_init+0x9/0x100
> [  243.626374]                        
> [  243.626376] [<ffffffff81715081>] ret_from_fork+0x31/0x40
> [  243.628771]    RECLAIM_FS-ON-R at:
> [  243.629802]                        
> [  243.629803] [<ffffffff810ef051>] mark_held_locks+0x71/0x90
> [  243.632201]                        
> [  243.632203] [<ffffffff810f3405>] lockdep_trace_alloc+0xc5/0x110
> [  243.634692]                        
> [  243.634695] [<ffffffff8122f8ca>] kmem_cache_alloc_node_trace+0x4a/0x410
> [  243.637277]                        
> [  243.637279] [<ffffffff8100cbb4>] allocate_shared_regs+0x24/0x70
> [  243.639777]                        
> [  243.639779] [<ffffffff8100cc32>] intel_pmu_cpu_prepare+0x32/0x140
> [  243.643062]                        
> [  243.643066] [<ffffffff810053db>] x86_pmu_prepare_cpu+0x3b/0x40
> [  243.645553]                        
> [  243.645556] [<ffffffff8108e2cb>] cpuhp_invoke_callback+0xbb/0xb70
> [  243.648095]                        
> [  243.648097] [<ffffffff8108f29c>] cpuhp_issue_call+0xec/0x160
> [  243.650536]                        
> [  243.650539] [<ffffffff8108f6bb>] __cpuhp_setup_state+0x13b/0x1a0
> [  243.653126]                        
> [  243.653130] [<ffffffff81f427e9>] init_hw_perf_events+0x402/0x5b6
> [  243.655652]                        
> [  243.655655] [<ffffffff8100217c>] do_one_initcall+0x4c/0x1b0
> [  243.658127]                        
> [  243.658130] [<ffffffff81f3f333>] kernel_init_freeable+0x155/0x2a7
> [  243.660653]                        
> [  243.660656] [<ffffffff817048e9>] kernel_init+0x9/0x100
> [  243.663048]                        
> [  243.663050] [<ffffffff81715081>] ret_from_fork+0x31/0x40
> [  243.665436]    INITIAL USE at:
> [  243.666403]                    
> [  243.666405] [<ffffffff810ef960>] __lock_acquire+0x370/0x1bb0
> [  243.668790]                    
> [  243.668791] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0
> [  243.671093]                    
> [  243.671095] [<ffffffff8108de18>] get_online_cpus+0x58/0x80
> [  243.673455]                    
> [  243.673458] [<ffffffff8108f5be>] __cpuhp_setup_state+0x3e/0x1a0
> [  243.676126]                    
> [  243.676130] [<ffffffff81f7660e>] page_alloc_init+0x23/0x3a
> [  243.678510]                    
> [  243.678512] [<ffffffff81f3eebe>] start_kernel+0x1a2/0x4c2
> [  243.680851]                    
> [  243.680853] [<ffffffff81f3e5d6>] x86_64_start_reservations+0x2a/0x2c
> [  243.683367]                    
> [  243.683369] [<ffffffff81f3e724>] x86_64_start_kernel+0x14c/0x16f
> [  243.685812]                    
> [  243.685815] [<ffffffff810001c4>] verify_cpu+0x0/0xfc
> [  243.688133]  }
> [  243.688907]  ... key      at: [<ffffffff81c56848>] cpu_hotplug+0x108/0x140
> [  243.690542]  ... acquired at:
> [  243.691514]    
> [  243.691517] [<ffffffff810ee7ea>] check_irq_usage+0x4a/0xb0
> [  243.693655]    
> [  243.693656] [<ffffffff810f0954>] __lock_acquire+0x1364/0x1bb0
> [  243.695820]    
> [  243.695822] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0
> [  243.697926]    
> [  243.697929] [<ffffffff8108de18>] get_online_cpus+0x58/0x80
> [  243.700042]    
> [  243.700044] [<ffffffff811bf3a7>] drain_all_pages.part.80+0x27/0x320
> [  243.702285]    
> [  243.702286] [<ffffffff811c2039>] drain_all_pages+0x19/0x20
> [  243.704405]    
> [  243.704407] [<ffffffff811c4854>] __alloc_pages_nodemask+0x784/0x1630
> [  243.706721]    
> [  243.706724] [<ffffffff8122e1bf>] cache_grow_begin+0xcf/0x630
> [  243.708867]    
> [  243.708870] [<ffffffff8122eb45>] fallback_alloc+0x1e5/0x290
> [  243.711000]    
> [  243.711002] [<ffffffff8122e955>] ____cache_alloc_node+0x235/0x240
> [  243.713211]    
> [  243.713213] [<ffffffff8122f30c>] kmem_cache_alloc+0x26c/0x3e0
> [  243.715366]    
> [  243.715410] [<ffffffffa02b9211>] kmem_zone_alloc+0x91/0x120 [xfs]
> [  243.717625]    
> [  243.717644] [<ffffffffa024e2f5>] xfs_da_state_alloc+0x15/0x20 [xfs]
> [  243.719889]    
> [  243.719918] [<ffffffffa025f333>] xfs_dir2_node_lookup+0x53/0x2b0 [xfs]
> [  243.722224]    
> [  243.722242] [<ffffffffa0255ac5>] xfs_dir_lookup+0x1a5/0x1c0 [xfs]
> [  243.724493]    
> [  243.724514] [<ffffffffa02a62ff>] xfs_lookup+0x7f/0x250 [xfs]
> [  243.726690]    
> [  243.726710] [<ffffffffa02a1fcb>] xfs_vn_lookup+0x6b/0xb0 [xfs]
> [  243.728933]    
> [  243.728936] [<ffffffff8126ce2c>] lookup_open+0x54c/0x790
> [  243.731064]    
> [  243.731066] [<ffffffff8126e2fa>] path_openat+0x55a/0xa90
> [  243.733192]    
> [  243.733194] [<ffffffff8126f9ec>] do_filp_open+0x8c/0x100
> [  243.735312]    
> [  243.735315] [<ffffffff8125c0ea>] do_sys_open+0x13a/0x200
> [  243.737523]    
> [  243.737527] [<ffffffff8125c1c9>] SyS_open+0x19/0x20
> [  243.739577]    
> [  243.739579] [<ffffffff81003c17>] do_syscall_64+0x67/0x1f0
> [  243.741702]    
> [  243.741706] [<ffffffff81714ec9>] return_from_SYSCALL_64+0x0/0x7a
> [  243.743932] 
> [  243.744661] 
> [  243.744661] stack backtrace:
> [  243.746302] CPU: 1 PID: 8767 Comm: awk Not tainted 4.10.0-rc6-next-20170202 #46
> [  243.747963] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
> [  243.750166] Call Trace:
> [  243.751071]  dump_stack+0x85/0xc9
> [  243.752110]  check_usage+0x4f9/0x680
> [  243.753188]  check_irq_usage+0x4a/0xb0
> [  243.754280]  __lock_acquire+0x1364/0x1bb0
> [  243.755410]  lock_acquire+0xe0/0x2a0
> [  243.756467]  ? get_online_cpus+0x32/0x80
> [  243.757580]  get_online_cpus+0x58/0x80
> [  243.758664]  ? get_online_cpus+0x32/0x80
> [  243.759764]  drain_all_pages.part.80+0x27/0x320
> [  243.760972]  drain_all_pages+0x19/0x20
> [  243.762039]  __alloc_pages_nodemask+0x784/0x1630
> [  243.763249]  ? rcu_read_lock_sched_held+0x91/0xa0
> [  243.764466]  ? __alloc_pages_nodemask+0x2e6/0x1630
> [  243.765689]  ? mark_held_locks+0x71/0x90
> [  243.766780]  ? cache_grow_begin+0x4ac/0x630
> [  243.767912]  cache_grow_begin+0xcf/0x630
> [  243.768985]  ? ____cache_alloc_node+0x1bf/0x240
> [  243.770173]  fallback_alloc+0x1e5/0x290
> [  243.771233]  ____cache_alloc_node+0x235/0x240
> [  243.772403]  ? kmem_zone_alloc+0x91/0x120 [xfs]
> [  243.773576]  kmem_cache_alloc+0x26c/0x3e0
> [  243.774671]  kmem_zone_alloc+0x91/0x120 [xfs]
> [  243.775816]  xfs_da_state_alloc+0x15/0x20 [xfs]
> [  243.776989]  xfs_dir2_node_lookup+0x53/0x2b0 [xfs]
> [  243.778188]  xfs_dir_lookup+0x1a5/0x1c0 [xfs]
> [  243.779327]  xfs_lookup+0x7f/0x250 [xfs]
> [  243.780394]  xfs_vn_lookup+0x6b/0xb0 [xfs]
> [  243.781466]  lookup_open+0x54c/0x790
> [  243.782440]  path_openat+0x55a/0xa90
> [  243.783412]  do_filp_open+0x8c/0x100
> [  243.784377]  ? _raw_spin_unlock+0x22/0x30
> [  243.785418]  ? __alloc_fd+0xf2/0x210
> [  243.786378]  do_sys_open+0x13a/0x200
> [  243.787361]  SyS_open+0x19/0x20
> [  243.788252]  do_syscall_64+0x67/0x1f0
> [  243.789228]  entry_SYSCALL64_slow_path+0x25/0x25
> [  243.790347] RIP: 0033:0x7fcf8dda06c7
> [  243.791299] RSP: 002b:00007ffd883327b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000002
> [  243.792895] RAX: ffffffffffffffda RBX: 00007ffd883328a8 RCX: 00007fcf8dda06c7
> [  243.794424] RDX: 00007fcf8dfa9148 RSI: 0000000000080000 RDI: 00007fcf8dfa6b08
> [  243.795949] RBP: 00007ffd88332810 R08: 00007ffd88332890 R09: 0000000000000000
> [  243.797480] R10: 00007fcf8dfa6b08 R11: 0000000000000246 R12: 0000000000000000
> [  243.799002] R13: 0000000000000000 R14: 0000000000000000 R15: 00007ffd88332890
> [  253.543441] awk invoked oom-killer: gfp_mask=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null),  order=0, oom_score_adj=0
> [  253.546121] awk cpuset=/ mems_allowed=0
> [  253.547233] CPU: 3 PID: 8767 Comm: awk Not tainted 4.10.0-rc6-next-20170202 #46

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
@ 2017-02-03 14:55                                               ` Michal Hocko
  0 siblings, 0 replies; 110+ messages in thread
From: Michal Hocko @ 2017-02-03 14:55 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: hch, mgorman, viro, linux-mm, hannes, linux-kernel, Peter Zijlstra

[CC Petr]

On Fri 03-02-17 19:57:39, Tetsuo Handa wrote:
[...]
> (2) I got a lockdep warning. (A new false positive?)

Yes, I suspect this is a false possitive. I do not see how we can
deadlock. __alloc_pages_direct_reclaim calls drain_all_pages(NULL) which
means that a potential recursion to the page allocator during draining
would just bail out on the trylock. Maybe I am misinterpreting the
report though.

> [  243.036975] =====================================================
> [  243.042976] WARNING: RECLAIM_FS-safe -> RECLAIM_FS-unsafe lock order detected
> [  243.051211] 4.10.0-rc6-next-20170202 #46 Not tainted
> [  243.054619] -----------------------------------------------------
> [  243.057395] awk/8767 [HC0[0]:SC0[0]:HE1:SE1] is trying to acquire:
> [  243.060310]  (cpu_hotplug.dep_map){++++++}, at: [<ffffffff8108ddf2>] get_online_cpus+0x32/0x80
> [  243.063462] 
> [  243.063462] and this task is already holding:
> [  243.066851]  (&xfs_dir_ilock_class){++++-.}, at: [<ffffffffa02a4af4>] xfs_ilock+0x114/0x290 [xfs]
> [  243.069949] which would create a new lock dependency:
> [  243.072143]  (&xfs_dir_ilock_class){++++-.} -> (cpu_hotplug.dep_map){++++++}
> [  243.074789] 
> [  243.074789] but this new dependency connects a RECLAIM_FS-irq-safe lock:
> [  243.078735]  (&xfs_dir_ilock_class){++++-.}
> [  243.078739] 
> [  243.078739] ... which became RECLAIM_FS-irq-safe at:
> [  243.084175]   
> [  243.084180] [<ffffffff810ef934>] __lock_acquire+0x344/0x1bb0
> [  243.087257]   
> [  243.087261] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0
> [  243.090027]   
> [  243.090033] [<ffffffff810ea7e9>] down_write_nested+0x59/0xc0
> [  243.092838]   
> [  243.092888] [<ffffffffa02a4b2e>] xfs_ilock+0x14e/0x290 [xfs]
> [  243.095453]   
> [  243.095485] [<ffffffffa02986a5>] xfs_reclaim_inode+0x135/0x340 [xfs]
> [  243.098083]   
> [  243.098109] [<ffffffffa0298b7a>] xfs_reclaim_inodes_ag+0x2ca/0x4f0 [xfs]
> [  243.100668]   
> [  243.100692] [<ffffffffa029af9e>] xfs_reclaim_inodes_nr+0x2e/0x40 [xfs]
> [  243.103191]   
> [  243.103221] [<ffffffffa02b32c4>] xfs_fs_free_cached_objects+0x14/0x20 [xfs]
> [  243.105710]   
> [  243.105714] [<ffffffff81261dbc>] super_cache_scan+0x17c/0x190
> [  243.107947]   
> [  243.107950] [<ffffffff811d375a>] shrink_slab+0x29a/0x710
> [  243.110133]   
> [  243.110135] [<ffffffff811d876d>] shrink_node+0x23d/0x320
> [  243.112262]   
> [  243.112264] [<ffffffff811d9e24>] kswapd+0x354/0xa10
> [  243.114323]   
> [  243.114326] [<ffffffff810b5caa>] kthread+0x10a/0x140
> [  243.116448]   
> [  243.116452] [<ffffffff81715081>] ret_from_fork+0x31/0x40
> [  243.118692] 
> [  243.118692] to a RECLAIM_FS-irq-unsafe lock:
> [  243.120636]  (cpu_hotplug.dep_map){++++++}
> [  243.120638] 
> [  243.120638] ... which became RECLAIM_FS-irq-unsafe at:
> [  243.124021] ...
> [  243.124022]   
> [  243.124820] [<ffffffff810ef051>] mark_held_locks+0x71/0x90
> [  243.127033]   
> [  243.127035] [<ffffffff810f3405>] lockdep_trace_alloc+0xc5/0x110
> [  243.129228]   
> [  243.129231] [<ffffffff8122f8ca>] kmem_cache_alloc_node_trace+0x4a/0x410
> [  243.131534]   
> [  243.131536] [<ffffffff810ba350>] __smpboot_create_thread.part.3+0x30/0xf0
> [  243.133850]   
> [  243.133852] [<ffffffff810ba7a1>] smpboot_create_threads+0x61/0x90
> [  243.136113]   
> [  243.136119] [<ffffffff8108e2cb>] cpuhp_invoke_callback+0xbb/0xb70
> [  243.138319]   
> [  243.138320] [<ffffffff8108fc82>] cpuhp_up_callbacks+0x32/0xb0
> [  243.140479]   
> [  243.140480] [<ffffffff810900f4>] _cpu_up+0x84/0xf0
> [  243.142484]   
> [  243.142485] [<ffffffff810901e4>] do_cpu_up+0x84/0xd0
> [  243.144716]   
> [  243.144719] [<ffffffff8109023e>] cpu_up+0xe/0x10
> [  243.146684]   
> [  243.146687] [<ffffffff81f6f446>] smp_init+0xd5/0x141
> [  243.148755]   
> [  243.148758] [<ffffffff81f3f35b>] kernel_init_freeable+0x17d/0x2a7
> [  243.150932]   
> [  243.150936] [<ffffffff817048e9>] kernel_init+0x9/0x100
> [  243.153088]   
> [  243.153092] [<ffffffff81715081>] ret_from_fork+0x31/0x40
> [  243.155135] 
> [  243.155135] other info that might help us debug this:
> [  243.155135] 
> [  243.157724]  Possible interrupt unsafe locking scenario:
> [  243.157724] 
> [  243.159877]        CPU0                    CPU1
> [  243.161047]        ----                    ----
> [  243.162210]   lock(cpu_hotplug.dep_map);
> [  243.163279]                                local_irq_disable();
> [  243.164669]                                lock(&xfs_dir_ilock_class);
> [  243.166148]                                lock(cpu_hotplug.dep_map);
> [  243.167653]   <Interrupt>
> [  243.168594]     lock(&xfs_dir_ilock_class);
> [  243.169694] 
> [  243.169694]  *** DEADLOCK ***
> [  243.169694] 
> [  243.171864] 3 locks held by awk/8767:
> [  243.172872]  #0:  (&type->i_mutex_dir_key#3){++++++}, at: [<ffffffff8126e2dc>] path_openat+0x53c/0xa90
> [  243.174791]  #1:  (&xfs_dir_ilock_class){++++-.}, at: [<ffffffffa02a4af4>] xfs_ilock+0x114/0x290 [xfs]
> [  243.176899]  #2:  (pcpu_drain_mutex){+.+...}, at: [<ffffffff811bf39a>] drain_all_pages.part.80+0x1a/0x320
> [  243.178875] 
> [  243.178875] the dependencies between RECLAIM_FS-irq-safe lock and the holding lock:
> [  243.181262] -> (&xfs_dir_ilock_class){++++-.} ops: 17348 {
> [  243.182610]    HARDIRQ-ON-W at:
> [  243.183603]                     
> [  243.183606] [<ffffffff810efd84>] __lock_acquire+0x794/0x1bb0
> [  243.186056]                     
> [  243.186059] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0
> [  243.188419]                     
> [  243.188422] [<ffffffff810ea7e9>] down_write_nested+0x59/0xc0
> [  243.190909]                     
> [  243.190941] [<ffffffffa02a4b2e>] xfs_ilock+0x14e/0x290 [xfs]
> [  243.193257]                     
> [  243.193281] [<ffffffffa02a4c9b>] xfs_ilock_data_map_shared+0x2b/0x30 [xfs]
> [  243.195795]                     
> [  243.195814] [<ffffffffa02559f4>] xfs_dir_lookup+0xd4/0x1c0 [xfs]
> [  243.198204]                     
> [  243.198227] [<ffffffffa02a62ff>] xfs_lookup+0x7f/0x250 [xfs]
> [  243.200570]                     
> [  243.200593] [<ffffffffa02a1fcb>] xfs_vn_lookup+0x6b/0xb0 [xfs]
> [  243.203086]                     
> [  243.203089] [<ffffffff8126ce2c>] lookup_open+0x54c/0x790
> [  243.205417]                     
> [  243.205420] [<ffffffff8126e2fa>] path_openat+0x55a/0xa90
> [  243.207711]                     
> [  243.207713] [<ffffffff8126f9ec>] do_filp_open+0x8c/0x100
> [  243.210092]                     
> [  243.210095] [<ffffffff81263c41>] do_open_execat+0x71/0x180
> [  243.212427]                     
> [  243.212429] [<ffffffff812641b6>] open_exec+0x26/0x40
> [  243.214664]                     
> [  243.214668] [<ffffffff812c43ee>] load_elf_binary+0x2be/0x15f0
> [  243.217045]                     
> [  243.217048] [<ffffffff812644b0>] search_binary_handler+0x80/0x1e0
> [  243.219501]                     
> [  243.219503] [<ffffffff812663ca>] do_execveat_common.isra.40+0x68a/0xa00
> [  243.222056]                     
> [  243.222058] [<ffffffff81266767>] do_execve+0x27/0x30
> [  243.224471]                     
> [  243.224475] [<ffffffff812669c0>] SyS_execve+0x20/0x30
> [  243.226787]                     
> [  243.226790] [<ffffffff81003c17>] do_syscall_64+0x67/0x1f0
> [  243.229178]                     
> [  243.229182] [<ffffffff81714ec9>] return_from_SYSCALL_64+0x0/0x7a
> [  243.231695]    HARDIRQ-ON-R at:
> [  243.232709]                     
> [  243.232712] [<ffffffff810ef8c0>] __lock_acquire+0x2d0/0x1bb0
> [  243.235161]                     
> [  243.235164] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0
> [  243.237547]                     
> [  243.237551] [<ffffffff810ea672>] down_read_nested+0x52/0xb0
> [  243.239930]                     
> [  243.239962] [<ffffffffa02a4af4>] xfs_ilock+0x114/0x290 [xfs]
> [  243.242353]                     
> [  243.242385] [<ffffffffa02a4c9b>] xfs_ilock_data_map_shared+0x2b/0x30 [xfs]
> [  243.244978]                     
> [  243.244998] [<ffffffffa02559f4>] xfs_dir_lookup+0xd4/0x1c0 [xfs]
> [  243.247493]                     
> [  243.247515] [<ffffffffa02a62ff>] xfs_lookup+0x7f/0x250 [xfs]
> [  243.249910]                     
> [  243.249930] [<ffffffffa02a1fcb>] xfs_vn_lookup+0x6b/0xb0 [xfs]
> [  243.252407]                     
> [  243.252412] [<ffffffff8126902e>] lookup_slow+0x12e/0x220
> [  243.254747]                     
> [  243.254750] [<ffffffff8126d2c6>] walk_component+0x1a6/0x2b0
> [  243.257126]                     
> [  243.257128] [<ffffffff8126d55c>] link_path_walk+0x18c/0x580
> [  243.259495]                     
> [  243.259497] [<ffffffff8126de41>] path_openat+0xa1/0xa90
> [  243.261804]                     
> [  243.261806] [<ffffffff8126f9ec>] do_filp_open+0x8c/0x100
> [  243.264184]                     
> [  243.264188] [<ffffffff8125c0ea>] do_sys_open+0x13a/0x200
> [  243.266595]                     
> [  243.266599] [<ffffffff8125c1c9>] SyS_open+0x19/0x20
> [  243.268984]                     
> [  243.268989] [<ffffffff81714e01>] entry_SYSCALL_64_fastpath+0x1f/0xc2
> [  243.271702]    SOFTIRQ-ON-W at:
> [  243.272726]                     
> [  243.272729] [<ffffffff810ef8ed>] __lock_acquire+0x2fd/0x1bb0
> [  243.275109]                     
> [  243.275111] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0
> [  243.277426]                     
> [  243.277429] [<ffffffff810ea7e9>] down_write_nested+0x59/0xc0
> [  243.279790]                     
> [  243.279823] [<ffffffffa02a4b2e>] xfs_ilock+0x14e/0x290 [xfs]
> [  243.282192]                     
> [  243.282216] [<ffffffffa02a4c9b>] xfs_ilock_data_map_shared+0x2b/0x30 [xfs]
> [  243.284794]                     
> [  243.284816] [<ffffffffa02559f4>] xfs_dir_lookup+0xd4/0x1c0 [xfs]
> [  243.287259]                     
> [  243.287284] [<ffffffffa02a62ff>] xfs_lookup+0x7f/0x250 [xfs]
> [  243.289735]                     
> [  243.289763] [<ffffffffa02a1fcb>] xfs_vn_lookup+0x6b/0xb0 [xfs]
> [  243.292205]                     
> [  243.292208] [<ffffffff8126ce2c>] lookup_open+0x54c/0x790
> [  243.294555]                     
> [  243.294558] [<ffffffff8126e2fa>] path_openat+0x55a/0xa90
> [  243.296897]                     
> [  243.296900] [<ffffffff8126f9ec>] do_filp_open+0x8c/0x100
> [  243.299242]                     
> [  243.299244] [<ffffffff81263c41>] do_open_execat+0x71/0x180
> [  243.301754]                     
> [  243.301759] [<ffffffff812641b6>] open_exec+0x26/0x40
> [  243.304037]                     
> [  243.304042] [<ffffffff812c43ee>] load_elf_binary+0x2be/0x15f0
> [  243.306531]                     
> [  243.306534] [<ffffffff812644b0>] search_binary_handler+0x80/0x1e0
> [  243.308976]                     
> [  243.308979] [<ffffffff812663ca>] do_execveat_common.isra.40+0x68a/0xa00
> [  243.311506]                     
> [  243.311508] [<ffffffff81266767>] do_execve+0x27/0x30
> [  243.313777]                     
> [  243.313779] [<ffffffff812669c0>] SyS_execve+0x20/0x30
> [  243.316067]                     
> [  243.316070] [<ffffffff81003c17>] do_syscall_64+0x67/0x1f0
> [  243.318429]                     
> [  243.318434] [<ffffffff81714ec9>] return_from_SYSCALL_64+0x0/0x7a
> [  243.320884]    SOFTIRQ-ON-R at:
> [  243.321860]                     
> [  243.321862] [<ffffffff810ef8ed>] __lock_acquire+0x2fd/0x1bb0
> [  243.324251]                     
> [  243.324252] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0
> [  243.326601]                     
> [  243.326604] [<ffffffff810ea672>] down_read_nested+0x52/0xb0
> [  243.328966]                     
> [  243.328998] [<ffffffffa02a4af4>] xfs_ilock+0x114/0x290 [xfs]
> [  243.331384]                     
> [  243.331407] [<ffffffffa02a4c9b>] xfs_ilock_data_map_shared+0x2b/0x30 [xfs]
> [  243.333978]                     
> [  243.334001] [<ffffffffa02559f4>] xfs_dir_lookup+0xd4/0x1c0 [xfs]
> [  243.336492]                     
> [  243.336516] [<ffffffffa02a62ff>] xfs_lookup+0x7f/0x250 [xfs]
> [  243.338926]                     
> [  243.338948] [<ffffffffa02a1fcb>] xfs_vn_lookup+0x6b/0xb0 [xfs]
> [  243.341365]                     
> [  243.341368] [<ffffffff8126902e>] lookup_slow+0x12e/0x220
> [  243.343694]                     
> [  243.343696] [<ffffffff8126d2c6>] walk_component+0x1a6/0x2b0
> [  243.346074]                     
> [  243.346076] [<ffffffff8126d55c>] link_path_walk+0x18c/0x580
> [  243.348443]                     
> [  243.348444] [<ffffffff8126de41>] path_openat+0xa1/0xa90
> [  243.350753]                     
> [  243.350755] [<ffffffff8126f9ec>] do_filp_open+0x8c/0x100
> [  243.353240]                     
> [  243.353244] [<ffffffff8125c0ea>] do_sys_open+0x13a/0x200
> [  243.355581]                     
> [  243.355583] [<ffffffff8125c1c9>] SyS_open+0x19/0x20
> [  243.358015]                     
> [  243.358019] [<ffffffff81714e01>] entry_SYSCALL_64_fastpath+0x1f/0xc2
> [  243.360586]    IN-RECLAIM_FS-W at:
> [  243.361628]                        
> [  243.361630] [<ffffffff810ef934>] __lock_acquire+0x344/0x1bb0
> [  243.364273]                        
> [  243.364275] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0
> [  243.366710]                        
> [  243.366713] [<ffffffff810ea7e9>] down_write_nested+0x59/0xc0
> [  243.369153]                        
> [  243.369182] [<ffffffffa02a4b2e>] xfs_ilock+0x14e/0x290 [xfs]
> [  243.371597]                        
> [  243.371619] [<ffffffffa02986a5>] xfs_reclaim_inode+0x135/0x340 [xfs]
> [  243.374339]                        
> [  243.374366] [<ffffffffa0298b7a>] xfs_reclaim_inodes_ag+0x2ca/0x4f0 [xfs]
> [  243.377009]                        
> [  243.377032] [<ffffffffa029af9e>] xfs_reclaim_inodes_nr+0x2e/0x40 [xfs]
> [  243.379659]                        
> [  243.379686] [<ffffffffa02b32c4>] xfs_fs_free_cached_objects+0x14/0x20 [xfs]
> [  243.382349]                        
> [  243.382352] [<ffffffff81261dbc>] super_cache_scan+0x17c/0x190
> [  243.384907]                        
> [  243.384911] [<ffffffff811d375a>] shrink_slab+0x29a/0x710
> [  243.387690]                        
> [  243.387693] [<ffffffff811d876d>] shrink_node+0x23d/0x320
> [  243.390148]                        
> [  243.390150] [<ffffffff811d9e24>] kswapd+0x354/0xa10
> [  243.392517]                        
> [  243.392520] [<ffffffff810b5caa>] kthread+0x10a/0x140
> [  243.394851]                        
> [  243.394853] [<ffffffff81715081>] ret_from_fork+0x31/0x40
> [  243.397246]    INITIAL USE at:
> [  243.398227]                    
> [  243.398229] [<ffffffff810ef960>] __lock_acquire+0x370/0x1bb0
> [  243.400646]                    
> [  243.400648] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0
> [  243.402997]                    
> [  243.402999] [<ffffffff810ea672>] down_read_nested+0x52/0xb0
> [  243.405351]                    
> [  243.405397] [<ffffffffa02a4af4>] xfs_ilock+0x114/0x290 [xfs]
> [  243.407778]                    
> [  243.407799] [<ffffffffa02a4c9b>] xfs_ilock_data_map_shared+0x2b/0x30 [xfs]
> [  243.410364]                    
> [  243.410390] [<ffffffffa02559f4>] xfs_dir_lookup+0xd4/0x1c0 [xfs]
> [  243.412989]                    
> [  243.413011] [<ffffffffa02a62ff>] xfs_lookup+0x7f/0x250 [xfs]
> [  243.415416]                    
> [  243.415437] [<ffffffffa02a1fcb>] xfs_vn_lookup+0x6b/0xb0 [xfs]
> [  243.417871]                    
> [  243.417874] [<ffffffff8126902e>] lookup_slow+0x12e/0x220
> [  243.420641]                    
> [  243.420644] [<ffffffff8126d2c6>] walk_component+0x1a6/0x2b0
> [  243.423039]                    
> [  243.423041] [<ffffffff8126d55c>] link_path_walk+0x18c/0x580
> [  243.425553]                    
> [  243.425555] [<ffffffff8126de41>] path_openat+0xa1/0xa90
> [  243.427891]                    
> [  243.427892] [<ffffffff8126f9ec>] do_filp_open+0x8c/0x100
> [  243.430249]                    
> [  243.430251] [<ffffffff8125c0ea>] do_sys_open+0x13a/0x200
> [  243.432586]                    
> [  243.432588] [<ffffffff8125c1c9>] SyS_open+0x19/0x20
> [  243.434839]                    
> [  243.434843] [<ffffffff81714e01>] entry_SYSCALL_64_fastpath+0x1f/0xc2
> [  243.437343]  }
> [  243.438115]  ... key      at: [<ffffffffa031dfcc>] xfs_dir_ilock_class+0x0/0xfffffffffffc3f6e [xfs]
> [  243.440082]  ... acquired at:
> [  243.441047]    
> [  243.441049] [<ffffffff810ee7ea>] check_irq_usage+0x4a/0xb0
> [  243.443169]    
> [  243.443171] [<ffffffff810f0954>] __lock_acquire+0x1364/0x1bb0
> [  243.445366]    
> [  243.445368] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0
> [  243.447471]    
> [  243.447474] [<ffffffff8108de18>] get_online_cpus+0x58/0x80
> [  243.449601]    
> [  243.449604] [<ffffffff811bf3a7>] drain_all_pages.part.80+0x27/0x320
> [  243.452123]    
> [  243.452125] [<ffffffff811c2039>] drain_all_pages+0x19/0x20
> [  243.454264]    
> [  243.454266] [<ffffffff811c4854>] __alloc_pages_nodemask+0x784/0x1630
> [  243.456596]    
> [  243.456599] [<ffffffff8122e1bf>] cache_grow_begin+0xcf/0x630
> [  243.458774]    
> [  243.458776] [<ffffffff8122eb45>] fallback_alloc+0x1e5/0x290
> [  243.460952]    
> [  243.460955] [<ffffffff8122e955>] ____cache_alloc_node+0x235/0x240
> [  243.463199]    
> [  243.463201] [<ffffffff8122f30c>] kmem_cache_alloc+0x26c/0x3e0
> [  243.465482]    
> [  243.465510] [<ffffffffa02b9211>] kmem_zone_alloc+0x91/0x120 [xfs]
> [  243.467754]    
> [  243.467774] [<ffffffffa024e2f5>] xfs_da_state_alloc+0x15/0x20 [xfs]
> [  243.470083]    
> [  243.470101] [<ffffffffa025f333>] xfs_dir2_node_lookup+0x53/0x2b0 [xfs]
> [  243.472427]    
> [  243.472445] [<ffffffffa0255ac5>] xfs_dir_lookup+0x1a5/0x1c0 [xfs]
> [  243.474705]    
> [  243.474726] [<ffffffffa02a62ff>] xfs_lookup+0x7f/0x250 [xfs]
> [  243.476933]    
> [  243.476954] [<ffffffffa02a1fcb>] xfs_vn_lookup+0x6b/0xb0 [xfs]
> [  243.479178]    
> [  243.479180] [<ffffffff8126ce2c>] lookup_open+0x54c/0x790
> [  243.481350]    
> [  243.481352] [<ffffffff8126e2fa>] path_openat+0x55a/0xa90
> [  243.483907]    
> [  243.483910] [<ffffffff8126f9ec>] do_filp_open+0x8c/0x100
> [  243.486070]    
> [  243.486073] [<ffffffff8125c0ea>] do_sys_open+0x13a/0x200
> [  243.488334]    
> [  243.488338] [<ffffffff8125c1c9>] SyS_open+0x19/0x20
> [  243.490476]    
> [  243.490480] [<ffffffff81003c17>] do_syscall_64+0x67/0x1f0
> [  243.492619]    
> [  243.492623] [<ffffffff81714ec9>] return_from_SYSCALL_64+0x0/0x7a
> [  243.494864] 
> [  243.495618] 
> [  243.495618] the dependencies between the lock to be acquired
> [  243.495619]  and RECLAIM_FS-irq-unsafe lock:
> [  243.498973] -> (cpu_hotplug.dep_map){++++++} ops: 838 {
> [  243.500297]    HARDIRQ-ON-W at:
> [  243.501292]                     
> [  243.501295] [<ffffffff810efd84>] __lock_acquire+0x794/0x1bb0
> [  243.503718]                     
> [  243.503719] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0
> [  243.506059]                     
> [  243.506061] [<ffffffff8108ff5e>] cpu_hotplug_begin+0x6e/0xe0
> [  243.508471]                     
> [  243.508473] [<ffffffff8109009d>] _cpu_up+0x2d/0xf0
> [  243.510708]                     
> [  243.510709] [<ffffffff810901e4>] do_cpu_up+0x84/0xd0
> [  243.512997]                     
> [  243.512999] [<ffffffff8109023e>] cpu_up+0xe/0x10
> [  243.515556]                     
> [  243.515561] [<ffffffff81f6f446>] smp_init+0xd5/0x141
> [  243.517807]                     
> [  243.517810] [<ffffffff81f3f35b>] kernel_init_freeable+0x17d/0x2a7
> [  243.520271]                     
> [  243.520275] [<ffffffff817048e9>] kernel_init+0x9/0x100
> [  243.522538]                     
> [  243.522540] [<ffffffff81715081>] ret_from_fork+0x31/0x40
> [  243.524833]    HARDIRQ-ON-R at:
> [  243.525801]                     
> [  243.525803] [<ffffffff810ef8c0>] __lock_acquire+0x2d0/0x1bb0
> [  243.528152]                     
> [  243.528153] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0
> [  243.530416]                     
> [  243.530419] [<ffffffff8108de18>] get_online_cpus+0x58/0x80
> [  243.532696]                     
> [  243.532698] [<ffffffff811ec375>] kmem_cache_create+0x35/0x2d0
> [  243.535039]                     
> [  243.535041] [<ffffffff81f87d4a>] debug_objects_mem_init+0x48/0x5c5
> [  243.537451]                     
> [  243.537453] [<ffffffff81f3f108>] start_kernel+0x3ec/0x4c2
> [  243.539744]                     
> [  243.539746] [<ffffffff81f3e5d6>] x86_64_start_reservations+0x2a/0x2c
> [  243.542186]                     
> [  243.542188] [<ffffffff81f3e724>] x86_64_start_kernel+0x14c/0x16f
> [  243.544603]                     
> [  243.544605] [<ffffffff810001c4>] verify_cpu+0x0/0xfc
> [  243.547245]    SOFTIRQ-ON-W at:
> [  243.548241]                     
> [  243.548243] [<ffffffff810ef8ed>] __lock_acquire+0x2fd/0x1bb0
> [  243.550559]                     
> [  243.550561] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0
> [  243.552841]                     
> [  243.552842] [<ffffffff8108ff5e>] cpu_hotplug_begin+0x6e/0xe0
> [  243.555186]                     
> [  243.555187] [<ffffffff8109009d>] _cpu_up+0x2d/0xf0
> [  243.557404]                     
> [  243.557405] [<ffffffff810901e4>] do_cpu_up+0x84/0xd0
> [  243.559654]                     
> [  243.559656] [<ffffffff8109023e>] cpu_up+0xe/0x10
> [  243.561824]                     
> [  243.561827] [<ffffffff81f6f446>] smp_init+0xd5/0x141
> [  243.564048]                     
> [  243.564050] [<ffffffff81f3f35b>] kernel_init_freeable+0x17d/0x2a7
> [  243.566455]                     
> [  243.566457] [<ffffffff817048e9>] kernel_init+0x9/0x100
> [  243.568731]                     
> [  243.568733] [<ffffffff81715081>] ret_from_fork+0x31/0x40
> [  243.571014]    SOFTIRQ-ON-R at:
> [  243.571975]                     
> [  243.571976] [<ffffffff810ef8ed>] __lock_acquire+0x2fd/0x1bb0
> [  243.574328]                     
> [  243.574330] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0
> [  243.576610]                     
> [  243.576612] [<ffffffff8108de18>] get_online_cpus+0x58/0x80
> [  243.579161]                     
> [  243.579165] [<ffffffff811ec375>] kmem_cache_create+0x35/0x2d0
> [  243.581537]                     
> [  243.581539] [<ffffffff81f87d4a>] debug_objects_mem_init+0x48/0x5c5
> [  243.583982]                     
> [  243.583984] [<ffffffff81f3f108>] start_kernel+0x3ec/0x4c2
> [  243.586304]                     
> [  243.586306] [<ffffffff81f3e5d6>] x86_64_start_reservations+0x2a/0x2c
> [  243.588819]                     
> [  243.588821] [<ffffffff81f3e724>] x86_64_start_kernel+0x14c/0x16f
> [  243.591227]                     
> [  243.591229] [<ffffffff810001c4>] verify_cpu+0x0/0xfc
> [  243.593507]    RECLAIM_FS-ON-W at:
> [  243.594519]                        
> [  243.594520] [<ffffffff810ef051>] mark_held_locks+0x71/0x90
> [  243.596888]                        
> [  243.596895] [<ffffffff810f3405>] lockdep_trace_alloc+0xc5/0x110
> [  243.599331]                        
> [  243.599334] [<ffffffff8122f8ca>] kmem_cache_alloc_node_trace+0x4a/0x410
> [  243.601872]                        
> [  243.601874] [<ffffffff810ba350>] __smpboot_create_thread.part.3+0x30/0xf0
> [  243.604460]                        
> [  243.604461] [<ffffffff810ba7a1>] smpboot_create_threads+0x61/0x90
> [  243.606950]                        
> [  243.606952] [<ffffffff8108e2cb>] cpuhp_invoke_callback+0xbb/0xb70
> [  243.609463]                        
> [  243.609465] [<ffffffff8108fc82>] cpuhp_up_callbacks+0x32/0xb0
> [  243.612282]                        
> [  243.612285] [<ffffffff810900f4>] _cpu_up+0x84/0xf0
> [  243.614604]                        
> [  243.614606] [<ffffffff810901e4>] do_cpu_up+0x84/0xd0
> [  243.616929]                        
> [  243.616930] [<ffffffff8109023e>] cpu_up+0xe/0x10
> [  243.619208]                        
> [  243.619211] [<ffffffff81f6f446>] smp_init+0xd5/0x141
> [  243.621518]                        
> [  243.621520] [<ffffffff81f3f35b>] kernel_init_freeable+0x17d/0x2a7
> [  243.624018]                        
> [  243.624020] [<ffffffff817048e9>] kernel_init+0x9/0x100
> [  243.626374]                        
> [  243.626376] [<ffffffff81715081>] ret_from_fork+0x31/0x40
> [  243.628771]    RECLAIM_FS-ON-R at:
> [  243.629802]                        
> [  243.629803] [<ffffffff810ef051>] mark_held_locks+0x71/0x90
> [  243.632201]                        
> [  243.632203] [<ffffffff810f3405>] lockdep_trace_alloc+0xc5/0x110
> [  243.634692]                        
> [  243.634695] [<ffffffff8122f8ca>] kmem_cache_alloc_node_trace+0x4a/0x410
> [  243.637277]                        
> [  243.637279] [<ffffffff8100cbb4>] allocate_shared_regs+0x24/0x70
> [  243.639777]                        
> [  243.639779] [<ffffffff8100cc32>] intel_pmu_cpu_prepare+0x32/0x140
> [  243.643062]                        
> [  243.643066] [<ffffffff810053db>] x86_pmu_prepare_cpu+0x3b/0x40
> [  243.645553]                        
> [  243.645556] [<ffffffff8108e2cb>] cpuhp_invoke_callback+0xbb/0xb70
> [  243.648095]                        
> [  243.648097] [<ffffffff8108f29c>] cpuhp_issue_call+0xec/0x160
> [  243.650536]                        
> [  243.650539] [<ffffffff8108f6bb>] __cpuhp_setup_state+0x13b/0x1a0
> [  243.653126]                        
> [  243.653130] [<ffffffff81f427e9>] init_hw_perf_events+0x402/0x5b6
> [  243.655652]                        
> [  243.655655] [<ffffffff8100217c>] do_one_initcall+0x4c/0x1b0
> [  243.658127]                        
> [  243.658130] [<ffffffff81f3f333>] kernel_init_freeable+0x155/0x2a7
> [  243.660653]                        
> [  243.660656] [<ffffffff817048e9>] kernel_init+0x9/0x100
> [  243.663048]                        
> [  243.663050] [<ffffffff81715081>] ret_from_fork+0x31/0x40
> [  243.665436]    INITIAL USE at:
> [  243.666403]                    
> [  243.666405] [<ffffffff810ef960>] __lock_acquire+0x370/0x1bb0
> [  243.668790]                    
> [  243.668791] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0
> [  243.671093]                    
> [  243.671095] [<ffffffff8108de18>] get_online_cpus+0x58/0x80
> [  243.673455]                    
> [  243.673458] [<ffffffff8108f5be>] __cpuhp_setup_state+0x3e/0x1a0
> [  243.676126]                    
> [  243.676130] [<ffffffff81f7660e>] page_alloc_init+0x23/0x3a
> [  243.678510]                    
> [  243.678512] [<ffffffff81f3eebe>] start_kernel+0x1a2/0x4c2
> [  243.680851]                    
> [  243.680853] [<ffffffff81f3e5d6>] x86_64_start_reservations+0x2a/0x2c
> [  243.683367]                    
> [  243.683369] [<ffffffff81f3e724>] x86_64_start_kernel+0x14c/0x16f
> [  243.685812]                    
> [  243.685815] [<ffffffff810001c4>] verify_cpu+0x0/0xfc
> [  243.688133]  }
> [  243.688907]  ... key      at: [<ffffffff81c56848>] cpu_hotplug+0x108/0x140
> [  243.690542]  ... acquired at:
> [  243.691514]    
> [  243.691517] [<ffffffff810ee7ea>] check_irq_usage+0x4a/0xb0
> [  243.693655]    
> [  243.693656] [<ffffffff810f0954>] __lock_acquire+0x1364/0x1bb0
> [  243.695820]    
> [  243.695822] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0
> [  243.697926]    
> [  243.697929] [<ffffffff8108de18>] get_online_cpus+0x58/0x80
> [  243.700042]    
> [  243.700044] [<ffffffff811bf3a7>] drain_all_pages.part.80+0x27/0x320
> [  243.702285]    
> [  243.702286] [<ffffffff811c2039>] drain_all_pages+0x19/0x20
> [  243.704405]    
> [  243.704407] [<ffffffff811c4854>] __alloc_pages_nodemask+0x784/0x1630
> [  243.706721]    
> [  243.706724] [<ffffffff8122e1bf>] cache_grow_begin+0xcf/0x630
> [  243.708867]    
> [  243.708870] [<ffffffff8122eb45>] fallback_alloc+0x1e5/0x290
> [  243.711000]    
> [  243.711002] [<ffffffff8122e955>] ____cache_alloc_node+0x235/0x240
> [  243.713211]    
> [  243.713213] [<ffffffff8122f30c>] kmem_cache_alloc+0x26c/0x3e0
> [  243.715366]    
> [  243.715410] [<ffffffffa02b9211>] kmem_zone_alloc+0x91/0x120 [xfs]
> [  243.717625]    
> [  243.717644] [<ffffffffa024e2f5>] xfs_da_state_alloc+0x15/0x20 [xfs]
> [  243.719889]    
> [  243.719918] [<ffffffffa025f333>] xfs_dir2_node_lookup+0x53/0x2b0 [xfs]
> [  243.722224]    
> [  243.722242] [<ffffffffa0255ac5>] xfs_dir_lookup+0x1a5/0x1c0 [xfs]
> [  243.724493]    
> [  243.724514] [<ffffffffa02a62ff>] xfs_lookup+0x7f/0x250 [xfs]
> [  243.726690]    
> [  243.726710] [<ffffffffa02a1fcb>] xfs_vn_lookup+0x6b/0xb0 [xfs]
> [  243.728933]    
> [  243.728936] [<ffffffff8126ce2c>] lookup_open+0x54c/0x790
> [  243.731064]    
> [  243.731066] [<ffffffff8126e2fa>] path_openat+0x55a/0xa90
> [  243.733192]    
> [  243.733194] [<ffffffff8126f9ec>] do_filp_open+0x8c/0x100
> [  243.735312]    
> [  243.735315] [<ffffffff8125c0ea>] do_sys_open+0x13a/0x200
> [  243.737523]    
> [  243.737527] [<ffffffff8125c1c9>] SyS_open+0x19/0x20
> [  243.739577]    
> [  243.739579] [<ffffffff81003c17>] do_syscall_64+0x67/0x1f0
> [  243.741702]    
> [  243.741706] [<ffffffff81714ec9>] return_from_SYSCALL_64+0x0/0x7a
> [  243.743932] 
> [  243.744661] 
> [  243.744661] stack backtrace:
> [  243.746302] CPU: 1 PID: 8767 Comm: awk Not tainted 4.10.0-rc6-next-20170202 #46
> [  243.747963] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
> [  243.750166] Call Trace:
> [  243.751071]  dump_stack+0x85/0xc9
> [  243.752110]  check_usage+0x4f9/0x680
> [  243.753188]  check_irq_usage+0x4a/0xb0
> [  243.754280]  __lock_acquire+0x1364/0x1bb0
> [  243.755410]  lock_acquire+0xe0/0x2a0
> [  243.756467]  ? get_online_cpus+0x32/0x80
> [  243.757580]  get_online_cpus+0x58/0x80
> [  243.758664]  ? get_online_cpus+0x32/0x80
> [  243.759764]  drain_all_pages.part.80+0x27/0x320
> [  243.760972]  drain_all_pages+0x19/0x20
> [  243.762039]  __alloc_pages_nodemask+0x784/0x1630
> [  243.763249]  ? rcu_read_lock_sched_held+0x91/0xa0
> [  243.764466]  ? __alloc_pages_nodemask+0x2e6/0x1630
> [  243.765689]  ? mark_held_locks+0x71/0x90
> [  243.766780]  ? cache_grow_begin+0x4ac/0x630
> [  243.767912]  cache_grow_begin+0xcf/0x630
> [  243.768985]  ? ____cache_alloc_node+0x1bf/0x240
> [  243.770173]  fallback_alloc+0x1e5/0x290
> [  243.771233]  ____cache_alloc_node+0x235/0x240
> [  243.772403]  ? kmem_zone_alloc+0x91/0x120 [xfs]
> [  243.773576]  kmem_cache_alloc+0x26c/0x3e0
> [  243.774671]  kmem_zone_alloc+0x91/0x120 [xfs]
> [  243.775816]  xfs_da_state_alloc+0x15/0x20 [xfs]
> [  243.776989]  xfs_dir2_node_lookup+0x53/0x2b0 [xfs]
> [  243.778188]  xfs_dir_lookup+0x1a5/0x1c0 [xfs]
> [  243.779327]  xfs_lookup+0x7f/0x250 [xfs]
> [  243.780394]  xfs_vn_lookup+0x6b/0xb0 [xfs]
> [  243.781466]  lookup_open+0x54c/0x790
> [  243.782440]  path_openat+0x55a/0xa90
> [  243.783412]  do_filp_open+0x8c/0x100
> [  243.784377]  ? _raw_spin_unlock+0x22/0x30
> [  243.785418]  ? __alloc_fd+0xf2/0x210
> [  243.786378]  do_sys_open+0x13a/0x200
> [  243.787361]  SyS_open+0x19/0x20
> [  243.788252]  do_syscall_64+0x67/0x1f0
> [  243.789228]  entry_SYSCALL64_slow_path+0x25/0x25
> [  243.790347] RIP: 0033:0x7fcf8dda06c7
> [  243.791299] RSP: 002b:00007ffd883327b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000002
> [  243.792895] RAX: ffffffffffffffda RBX: 00007ffd883328a8 RCX: 00007fcf8dda06c7
> [  243.794424] RDX: 00007fcf8dfa9148 RSI: 0000000000080000 RDI: 00007fcf8dfa6b08
> [  243.795949] RBP: 00007ffd88332810 R08: 00007ffd88332890 R09: 0000000000000000
> [  243.797480] R10: 00007fcf8dfa6b08 R11: 0000000000000246 R12: 0000000000000000
> [  243.799002] R13: 0000000000000000 R14: 0000000000000000 R15: 00007ffd88332890
> [  253.543441] awk invoked oom-killer: gfp_mask=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null),  order=0, oom_score_adj=0
> [  253.546121] awk cpuset=/ mems_allowed=0
> [  253.547233] CPU: 3 PID: 8767 Comm: awk Not tainted 4.10.0-rc6-next-20170202 #46

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
  2017-02-03 14:50                                               ` Michal Hocko
@ 2017-02-03 17:24                                                 ` Brian Foster
  -1 siblings, 0 replies; 110+ messages in thread
From: Brian Foster @ 2017-02-03 17:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, david, dchinner, hch, mgorman, viro, linux-mm,
	hannes, linux-kernel, Darrick J. Wong, linux-xfs

On Fri, Feb 03, 2017 at 03:50:09PM +0100, Michal Hocko wrote:
> [Let's CC more xfs people]
> 
> On Fri 03-02-17 19:57:39, Tetsuo Handa wrote:
> [...]
> > (1) I got an assertion failure.
> 
> I suspect this is a result of
> http://lkml.kernel.org/r/20170201092706.9966-2-mhocko@kernel.org
> I have no idea what the assert means though.
> 
> > 
> > [  969.626518] Killed process 6262 (oom-write) total-vm:2166856kB, anon-rss:1128732kB, file-rss:4kB, shmem-rss:0kB
> > [  969.958307] oom_reaper: reaped process 6262 (oom-write), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> > [  972.114644] XFS: Assertion failed: oldlen > newlen, file: fs/xfs/libxfs/xfs_bmap.c, line: 2867

Indirect block reservation underrun on delayed allocation extent merge.
These are extra blocks are used for the inode bmap btree when a delalloc
extent is converted to physical blocks. We're in a case where we expect
to only ever free excess blocks due to a merge of extents with
independent reservations, but a situation occurs where we actually need
blocks and hence the assert fails. This can occur if an extent is merged
with one that has a reservation less than the expected worst case
reservation for its size (due to previous extent splits due to hole
punches, for example). Therefore, I think the core expectation that
xfs_bmap_add_extent_hole_delay() will always have enough blocks
pre-reserved is invalid.

Can you describe the workload that reproduces this? FWIW, I think the
way xfs_bmap_add_extent_hole_delay() currently works is likely broken
and have a couple patches to fix up indlen reservation that I haven't
posted yet. The diff that deals with this particular bit is appended.
Care to give that a try?

Brian

> > [  972.125085] ------------[ cut here ]------------
> > [  972.129261] WARNING: CPU: 0 PID: 6280 at fs/xfs/xfs_message.c:105 asswarn+0x33/0x40 [xfs]
> > [  972.136146] Modules linked in: nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack coretemp crct10dif_pclmul ppdev crc32_pclmul ghash_clmulni_intel ip_set nfnetlink ebtable_nat aesni_intel crypto_simd cryptd ebtable_broute glue_helper vmw_balloon bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 pcspkr nf_nat_ipv6 ip6table_mangle ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter sg parport_pc parport shpchp i2c_piix4 vmw_vsock_vmci_transport vsock vmw_vmci ip_tables xfs libcrc32c sr_mod cdrom ata_generic sd_mod pata_acpi crc32c_intel serio_raw vmwgfx drm_kms_helper syscopyarea sysfillrect
> > [  972.163630]  sysimgblt fb_sys_fops ttm drm ata_piix ahci libahci mptspi scsi_transport_spi mptscsih e1000 libata i2c_core mptbase
> > [  972.172535] CPU: 0 PID: 6280 Comm: write Not tainted 4.10.0-rc6-next-20170202 #498
> > [  972.175126] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
> > [  972.178381] Call Trace:
...

---8<---

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index bfc00de..d2e48ed 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -2809,7 +2809,8 @@ xfs_bmap_add_extent_hole_delay(
 		oldlen = startblockval(left.br_startblock) +
 			startblockval(new->br_startblock) +
 			startblockval(right.br_startblock);
-		newlen = xfs_bmap_worst_indlen(ip, temp);
+		newlen = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(ip, temp),
+					 oldlen);
 		xfs_bmbt_set_startblock(xfs_iext_get_ext(ifp, *idx),
 			nullstartblock((int)newlen));
 		trace_xfs_bmap_post_update(ip, *idx, state, _THIS_IP_);
@@ -2830,7 +2831,8 @@ xfs_bmap_add_extent_hole_delay(
 		xfs_bmbt_set_blockcount(xfs_iext_get_ext(ifp, *idx), temp);
 		oldlen = startblockval(left.br_startblock) +
 			startblockval(new->br_startblock);
-		newlen = xfs_bmap_worst_indlen(ip, temp);
+		newlen = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(ip, temp),
+					 oldlen);
 		xfs_bmbt_set_startblock(xfs_iext_get_ext(ifp, *idx),
 			nullstartblock((int)newlen));
 		trace_xfs_bmap_post_update(ip, *idx, state, _THIS_IP_);
@@ -2846,7 +2848,8 @@ xfs_bmap_add_extent_hole_delay(
 		temp = new->br_blockcount + right.br_blockcount;
 		oldlen = startblockval(new->br_startblock) +
 			startblockval(right.br_startblock);
-		newlen = xfs_bmap_worst_indlen(ip, temp);
+		newlen = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(ip, temp),
+					 oldlen);
 		xfs_bmbt_set_allf(xfs_iext_get_ext(ifp, *idx),
 			new->br_startoff,
 			nullstartblock((int)newlen), temp, right.br_state);

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
@ 2017-02-03 17:24                                                 ` Brian Foster
  0 siblings, 0 replies; 110+ messages in thread
From: Brian Foster @ 2017-02-03 17:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, david, dchinner, hch, mgorman, viro, linux-mm,
	hannes, linux-kernel, Darrick J. Wong, linux-xfs

On Fri, Feb 03, 2017 at 03:50:09PM +0100, Michal Hocko wrote:
> [Let's CC more xfs people]
> 
> On Fri 03-02-17 19:57:39, Tetsuo Handa wrote:
> [...]
> > (1) I got an assertion failure.
> 
> I suspect this is a result of
> http://lkml.kernel.org/r/20170201092706.9966-2-mhocko@kernel.org
> I have no idea what the assert means though.
> 
> > 
> > [  969.626518] Killed process 6262 (oom-write) total-vm:2166856kB, anon-rss:1128732kB, file-rss:4kB, shmem-rss:0kB
> > [  969.958307] oom_reaper: reaped process 6262 (oom-write), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> > [  972.114644] XFS: Assertion failed: oldlen > newlen, file: fs/xfs/libxfs/xfs_bmap.c, line: 2867

Indirect block reservation underrun on delayed allocation extent merge.
These are extra blocks are used for the inode bmap btree when a delalloc
extent is converted to physical blocks. We're in a case where we expect
to only ever free excess blocks due to a merge of extents with
independent reservations, but a situation occurs where we actually need
blocks and hence the assert fails. This can occur if an extent is merged
with one that has a reservation less than the expected worst case
reservation for its size (due to previous extent splits due to hole
punches, for example). Therefore, I think the core expectation that
xfs_bmap_add_extent_hole_delay() will always have enough blocks
pre-reserved is invalid.

Can you describe the workload that reproduces this? FWIW, I think the
way xfs_bmap_add_extent_hole_delay() currently works is likely broken
and have a couple patches to fix up indlen reservation that I haven't
posted yet. The diff that deals with this particular bit is appended.
Care to give that a try?

Brian

> > [  972.125085] ------------[ cut here ]------------
> > [  972.129261] WARNING: CPU: 0 PID: 6280 at fs/xfs/xfs_message.c:105 asswarn+0x33/0x40 [xfs]
> > [  972.136146] Modules linked in: nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack coretemp crct10dif_pclmul ppdev crc32_pclmul ghash_clmulni_intel ip_set nfnetlink ebtable_nat aesni_intel crypto_simd cryptd ebtable_broute glue_helper vmw_balloon bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 pcspkr nf_nat_ipv6 ip6table_mangle ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter sg parport_pc parport shpchp i2c_piix4 vmw_vsock_vmci_transport vsock vmw_vmci ip_tables xfs libcrc32c sr_mod cdrom ata_generic sd_mod pata_acpi crc32c_intel serio_raw vmwgfx drm_kms_helper syscopyarea sysfillrect
> > [  972.163630]  sysimgblt fb_sys_fops ttm drm ata_piix ahci libahci mptspi scsi_transport_spi mptscsih e1000 libata i2c_core mptbase
> > [  972.172535] CPU: 0 PID: 6280 Comm: write Not tainted 4.10.0-rc6-next-20170202 #498
> > [  972.175126] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
> > [  972.178381] Call Trace:
...

---8<---

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index bfc00de..d2e48ed 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -2809,7 +2809,8 @@ xfs_bmap_add_extent_hole_delay(
 		oldlen = startblockval(left.br_startblock) +
 			startblockval(new->br_startblock) +
 			startblockval(right.br_startblock);
-		newlen = xfs_bmap_worst_indlen(ip, temp);
+		newlen = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(ip, temp),
+					 oldlen);
 		xfs_bmbt_set_startblock(xfs_iext_get_ext(ifp, *idx),
 			nullstartblock((int)newlen));
 		trace_xfs_bmap_post_update(ip, *idx, state, _THIS_IP_);
@@ -2830,7 +2831,8 @@ xfs_bmap_add_extent_hole_delay(
 		xfs_bmbt_set_blockcount(xfs_iext_get_ext(ifp, *idx), temp);
 		oldlen = startblockval(left.br_startblock) +
 			startblockval(new->br_startblock);
-		newlen = xfs_bmap_worst_indlen(ip, temp);
+		newlen = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(ip, temp),
+					 oldlen);
 		xfs_bmbt_set_startblock(xfs_iext_get_ext(ifp, *idx),
 			nullstartblock((int)newlen));
 		trace_xfs_bmap_post_update(ip, *idx, state, _THIS_IP_);
@@ -2846,7 +2848,8 @@ xfs_bmap_add_extent_hole_delay(
 		temp = new->br_blockcount + right.br_blockcount;
 		oldlen = startblockval(new->br_startblock) +
 			startblockval(right.br_startblock);
-		newlen = xfs_bmap_worst_indlen(ip, temp);
+		newlen = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(ip, temp),
+					 oldlen);
 		xfs_bmbt_set_allf(xfs_iext_get_ext(ifp, *idx),
 			new->br_startoff,
 			nullstartblock((int)newlen), temp, right.br_state);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
  2017-02-03 14:55                                               ` Michal Hocko
@ 2017-02-05 10:43                                                 ` Tetsuo Handa
  -1 siblings, 0 replies; 110+ messages in thread
From: Tetsuo Handa @ 2017-02-05 10:43 UTC (permalink / raw)
  To: mhocko; +Cc: hch, mgorman, viro, linux-mm, hannes, linux-kernel, peterz

Michal Hocko wrote:
> [CC Petr]
> 
> On Fri 03-02-17 19:57:39, Tetsuo Handa wrote:
> [...]
> > (2) I got a lockdep warning. (A new false positive?)
> 
> Yes, I suspect this is a false possitive. I do not see how we can
> deadlock. __alloc_pages_direct_reclaim calls drain_all_pages(NULL) which
> means that a potential recursion to the page allocator during draining
> would just bail out on the trylock. Maybe I am misinterpreting the
> report though.
> 

I got same warning with ext4. Maybe we need to check carefully.

[  511.215743] =====================================================
[  511.218003] WARNING: RECLAIM_FS-safe -> RECLAIM_FS-unsafe lock order detected
[  511.220031] 4.10.0-rc6-next-20170202+ #500 Not tainted
[  511.221689] -----------------------------------------------------
[  511.223579] a.out/49302 [HC0[0]:SC0[0]:HE1:SE1] is trying to acquire:
[  511.225533]  (cpu_hotplug.dep_map){++++++}, at: [<ffffffff810a1477>] get_online_cpus+0x37/0x80
[  511.227795] 
[  511.227795] and this task is already holding:
[  511.230082]  (jbd2_handle){++++-.}, at: [<ffffffff813a8be7>] start_this_handle+0x1a7/0x590
[  511.232592] which would create a new lock dependency:
[  511.234192]  (jbd2_handle){++++-.} -> (cpu_hotplug.dep_map){++++++}
[  511.235966] 
[  511.235966] but this new dependency connects a RECLAIM_FS-irq-safe lock:
[  511.238563]  (jbd2_handle){++++-.}
[  511.238564] 
[  511.238564] ... which became RECLAIM_FS-irq-safe at:
[  511.242078]   
[  511.242084] [<ffffffff811089db>] __lock_acquire+0x34b/0x1640
[  511.244492]   
[  511.244495] [<ffffffff8110a119>] lock_acquire+0xc9/0x250
[  511.246694]   
[  511.246697] [<ffffffff813b3525>] jbd2_log_wait_commit+0x55/0x1d0
[  511.249323]   
[  511.249328] [<ffffffff813b59b1>] jbd2_complete_transaction+0x71/0x90
[  511.252069]   
[  511.252074] [<ffffffff813592d6>] ext4_evict_inode+0x356/0x760
[  511.254753]   
[  511.254757] [<ffffffff812c9f61>] evict+0xd1/0x1a0
[  511.257062]   
[  511.257065] [<ffffffff812ca07d>] dispose_list+0x4d/0x80
[  511.259531]   
[  511.259535] [<ffffffff812cb3da>] prune_icache_sb+0x5a/0x80
[  511.261953]   
[  511.261957] [<ffffffff812acf41>] super_cache_scan+0x141/0x190
[  511.264540]   
[  511.264545] [<ffffffff812102ef>] shrink_slab+0x29f/0x6d0
[  511.267165]   
[  511.267171] [<ffffffff812154aa>] shrink_node+0x2fa/0x310
[  511.269455]   
[  511.269459] [<ffffffff812169d2>] kswapd+0x362/0x9b0
[  511.271831]   
[  511.271834] [<ffffffff810ca72f>] kthread+0x10f/0x150
[  511.274031]   
[  511.274035] [<ffffffff818531c1>] ret_from_fork+0x31/0x40
[  511.276216] 
[  511.276216] to a RECLAIM_FS-irq-unsafe lock:
[  511.278128]  (cpu_hotplug.dep_map){++++++}
[  511.278130] 
[  511.278130] ... which became RECLAIM_FS-irq-unsafe at:
[  511.281809] ...
[  511.281811]   
[  511.282598] [<ffffffff81108141>] mark_held_locks+0x71/0x90
[  511.284852]   
[  511.284854] [<ffffffff8110ab6f>] lockdep_trace_alloc+0x6f/0xd0
[  511.287215]   
[  511.287218] [<ffffffff812744c8>] kmem_cache_alloc_node_trace+0x48/0x3b0
[  511.289751]   
[  511.289755] [<ffffffff810cfa65>] __smpboot_create_thread.part.2+0x35/0xf0
[  511.292326]   
[  511.292329] [<ffffffff810d0026>] smpboot_create_threads+0x66/0x90
[  511.295025]   
[  511.295030] [<ffffffff810a2239>] cpuhp_invoke_callback+0x229/0x9e0
[  511.299245]   
[  511.299253] [<ffffffff810a2b57>] cpuhp_up_callbacks+0x37/0xb0
[  511.301889]   
[  511.301894] [<ffffffff810a37b9>] _cpu_up+0x89/0xf0
[  511.304270]   
[  511.304275] [<ffffffff810a38a5>] do_cpu_up+0x85/0xb0
[  511.306428]   
[  511.306431] [<ffffffff810a38e3>] cpu_up+0x13/0x20
[  511.308533]   
[  511.308535] [<ffffffff821eeee3>] smp_init+0x6b/0xcc
[  511.310710]   
[  511.310713] [<ffffffff821c3399>] kernel_init_freeable+0x17d/0x2ac
[  511.313232]   
[  511.313235] [<ffffffff81841b3e>] kernel_init+0xe/0x110
[  511.315616]   
[  511.315620] [<ffffffff818531c1>] ret_from_fork+0x31/0x40
[  511.317867] 
[  511.317867] other info that might help us debug this:
[  511.317867] 
[  511.320920]  Possible interrupt unsafe locking scenario:
[  511.320920] 
[  511.323218]        CPU0                    CPU1
[  511.324622]        ----                    ----
[  511.325973]   lock(cpu_hotplug.dep_map);
[  511.327246]                                local_irq_disable();
[  511.328870]                                lock(jbd2_handle);
[  511.330483]                                lock(cpu_hotplug.dep_map);
[  511.332259]   <Interrupt>
[  511.333187]     lock(jbd2_handle);
[  511.334304] 
[  511.334304]  *** DEADLOCK ***
[  511.334304] 
[  511.336749] 4 locks held by a.out/49302:
[  511.338129]  #0:  (sb_writers#8){.+.+.+}, at: [<ffffffff812d11d4>] mnt_want_write+0x24/0x50
[  511.340768]  #1:  (&type->i_mutex_dir_key#3){++++++}, at: [<ffffffff812ba06b>] path_openat+0x60b/0xd50
[  511.343744]  #2:  (jbd2_handle){++++-.}, at: [<ffffffff813a8be7>] start_this_handle+0x1a7/0x590
[  511.345743]  #3:  (pcpu_drain_mutex){+.+...}, at: [<ffffffff811fc96f>] drain_all_pages.part.89+0x1f/0x2c0
[  511.348605] 
[  511.348605] the dependencies between RECLAIM_FS-irq-safe lock and the holding lock:
[  511.351336] -> (jbd2_handle){++++-.} ops: 203220 {
[  511.352768]    HARDIRQ-ON-W at:
[  511.353827]                     
[  511.353833] [<ffffffff8110906e>] __lock_acquire+0x9de/0x1640
[  511.356489]                     
[  511.356492] [<ffffffff8110a119>] lock_acquire+0xc9/0x250
[  511.359063]                     
[  511.359067] [<ffffffff813b3525>] jbd2_log_wait_commit+0x55/0x1d0
[  511.361905]                     
[  511.361908] [<ffffffff813b59b1>] jbd2_complete_transaction+0x71/0x90
[  511.364560]                     
[  511.364563] [<ffffffff8134bec7>] ext4_sync_file+0x2e7/0x5e0
[  511.367362]                     
[  511.367367] [<ffffffff812e74ad>] vfs_fsync_range+0x3d/0xb0
[  511.369950]                     
[  511.369953] [<ffffffff812e757d>] do_fsync+0x3d/0x70
[  511.372400]                     
[  511.372402] [<ffffffff812e7840>] SyS_fsync+0x10/0x20
[  511.374821]                     
[  511.374824] [<ffffffff81003c3c>] do_syscall_64+0x6c/0x200
[  511.377422]                     
[  511.377425] [<ffffffff81853009>] return_from_SYSCALL_64+0x0/0x7a
[  511.380273]    HARDIRQ-ON-R at:
[  511.381791]                     
[  511.381815] [<ffffffff8110896d>] __lock_acquire+0x2dd/0x1640
[  511.384693]                     
[  511.384697] [<ffffffff8110a119>] lock_acquire+0xc9/0x250
[  511.387195]                     
[  511.387198] [<ffffffff813a8c65>] start_this_handle+0x225/0x590
[  511.389888]                     
[  511.389891] [<ffffffff813a9639>] jbd2__journal_start+0xe9/0x340
[  511.392522]                     
[  511.392525] [<ffffffff8138adaa>] __ext4_journal_start_sb+0x9a/0x240
[  511.395341]                     
[  511.395344] [<ffffffff8134af58>] ext4_file_open+0x188/0x230
[  511.397886]                     
[  511.397889] [<ffffffff812a53cb>] do_dentry_open+0x22b/0x340
[  511.400727]                     
[  511.400730] [<ffffffff812a6922>] vfs_open+0x52/0x80
[  511.403297]                     
[  511.403301] [<ffffffff812b9f02>] path_openat+0x4a2/0xd50
[  511.405752]                     
[  511.405755] [<ffffffff812bba51>] do_filp_open+0x91/0x100
[  511.408229]                     
[  511.408231] [<ffffffff812a6d44>] do_sys_open+0x124/0x210
[  511.410820]                     
[  511.410822] [<ffffffff812a6e4e>] SyS_open+0x1e/0x20
[  511.413158]                     
[  511.413161] [<ffffffff81852f41>] entry_SYSCALL_64_fastpath+0x1f/0xc2
[  511.416074]    SOFTIRQ-ON-W at:
[  511.417069]                     
[  511.417073] [<ffffffff81108996>] __lock_acquire+0x306/0x1640
[  511.419681]                     
[  511.419684] [<ffffffff8110a119>] lock_acquire+0xc9/0x250
[  511.422516]                     
[  511.422520] [<ffffffff813b3525>] jbd2_log_wait_commit+0x55/0x1d0
[  511.425157]                     
[  511.425160] [<ffffffff813b59b1>] jbd2_complete_transaction+0x71/0x90
[  511.427862]                     
[  511.427865] [<ffffffff8134bec7>] ext4_sync_file+0x2e7/0x5e0
[  511.430379]                     
[  511.430382] [<ffffffff812e74ad>] vfs_fsync_range+0x3d/0xb0
[  511.433412]                     
[  511.433418] [<ffffffff812e757d>] do_fsync+0x3d/0x70
[  511.436064]                     
[  511.436067] [<ffffffff812e7840>] SyS_fsync+0x10/0x20
[  511.438498]                     
[  511.438502] [<ffffffff81003c3c>] do_syscall_64+0x6c/0x200
[  511.441519]                     
[  511.441524] [<ffffffff81853009>] return_from_SYSCALL_64+0x0/0x7a
[  511.444325]    SOFTIRQ-ON-R at:
[  511.445358]                     
[  511.445362] [<ffffffff81108996>] __lock_acquire+0x306/0x1640
[  511.448298]                     
[  511.448312] [<ffffffff8110a119>] lock_acquire+0xc9/0x250
[  511.451096]                     
[  511.451100] [<ffffffff813a8c65>] start_this_handle+0x225/0x590
[  511.453784]                     
[  511.453786] [<ffffffff813a9639>] jbd2__journal_start+0xe9/0x340
[  511.456659]                     
[  511.456664] [<ffffffff8138adaa>] __ext4_journal_start_sb+0x9a/0x240
[  511.459638]                     
[  511.459643] [<ffffffff8134af58>] ext4_file_open+0x188/0x230
[  511.462384]                     
[  511.462389] [<ffffffff812a53cb>] do_dentry_open+0x22b/0x340
[  511.465550]                     
[  511.465558] [<ffffffff812a6922>] vfs_open+0x52/0x80
[  511.468141]                     
[  511.468145] [<ffffffff812b9f02>] path_openat+0x4a2/0xd50
[  511.470816]                     
[  511.470819] [<ffffffff812bba51>] do_filp_open+0x91/0x100
[  511.473441]                     
[  511.473443] [<ffffffff812a6d44>] do_sys_open+0x124/0x210
[  511.476079]                     
[  511.476081] [<ffffffff812a6e4e>] SyS_open+0x1e/0x20
[  511.478584]                     
[  511.478587] [<ffffffff81852f41>] entry_SYSCALL_64_fastpath+0x1f/0xc2
[  511.481394]    IN-RECLAIM_FS-W at:
[  511.482680]                        
[  511.482691] [<ffffffff811089db>] __lock_acquire+0x34b/0x1640
[  511.485262]                        
[  511.485264] [<ffffffff8110a119>] lock_acquire+0xc9/0x250
[  511.487862]                        
[  511.487865] [<ffffffff813b3525>] jbd2_log_wait_commit+0x55/0x1d0
[  511.490707]                        
[  511.490710] [<ffffffff813b59b1>] jbd2_complete_transaction+0x71/0x90
[  511.493524]                        
[  511.493527] [<ffffffff813592d6>] ext4_evict_inode+0x356/0x760
[  511.496251]                        
[  511.496255] [<ffffffff812c9f61>] evict+0xd1/0x1a0
[  511.498817]                        
[  511.498821] [<ffffffff812ca07d>] dispose_list+0x4d/0x80
[  511.501361]                        
[  511.501364] [<ffffffff812cb3da>] prune_icache_sb+0x5a/0x80
[  511.504069]                        
[  511.504072] [<ffffffff812acf41>] super_cache_scan+0x141/0x190
[  511.506890]                        
[  511.506895] [<ffffffff812102ef>] shrink_slab+0x29f/0x6d0
[  511.509465]                        
[  511.509467] [<ffffffff812154aa>] shrink_node+0x2fa/0x310
[  511.512228]                        
[  511.512233] [<ffffffff812169d2>] kswapd+0x362/0x9b0
[  511.514724]                        
[  511.514728] [<ffffffff810ca72f>] kthread+0x10f/0x150
[  511.517264]                        
[  511.517269] [<ffffffff818531c1>] ret_from_fork+0x31/0x40
[  511.519827]    INITIAL USE at:
[  511.520829]                    
[  511.520833] [<ffffffff811089ff>] __lock_acquire+0x36f/0x1640
[  511.523377]                    
[  511.523380] [<ffffffff8110a119>] lock_acquire+0xc9/0x250
[  511.525781]                    
[  511.525784] [<ffffffff813a8c65>] start_this_handle+0x225/0x590
[  511.528372]                    
[  511.528375] [<ffffffff813a9639>] jbd2__journal_start+0xe9/0x340
[  511.531138]                    
[  511.531141] [<ffffffff8138adaa>] __ext4_journal_start_sb+0x9a/0x240
[  511.533905]                    
[  511.533908] [<ffffffff8134af58>] ext4_file_open+0x188/0x230
[  511.536467]                    
[  511.536471] [<ffffffff812a53cb>] do_dentry_open+0x22b/0x340
[  511.538990]                    
[  511.538992] [<ffffffff812a6922>] vfs_open+0x52/0x80
[  511.541457]                    
[  511.541461] [<ffffffff812b9f02>] path_openat+0x4a2/0xd50
[  511.544036]                    
[  511.544039] [<ffffffff812bba51>] do_filp_open+0x91/0x100
[  511.546642]                    
[  511.546644] [<ffffffff812a6d44>] do_sys_open+0x124/0x210
[  511.549354]                    
[  511.549370] [<ffffffff812a6e4e>] SyS_open+0x1e/0x20
[  511.551781]                    
[  511.551784] [<ffffffff81852f41>] entry_SYSCALL_64_fastpath+0x1f/0xc2
[  511.554410]  }
[  511.555145]  ... key      at: [<ffffffff8335b518>] jbd2_trans_commit_key.48870+0x0/0x8
[  511.557051]  ... acquired at:
[  511.558047]    
[  511.558050] [<ffffffff81107d0a>] check_irq_usage+0x4a/0xb0
[  511.560268]    
[  511.560270] [<ffffffff8110950b>] __lock_acquire+0xe7b/0x1640
[  511.562536]    
[  511.562538] [<ffffffff8110a119>] lock_acquire+0xc9/0x250
[  511.564779]    
[  511.564783] [<ffffffff810a149d>] get_online_cpus+0x5d/0x80
[  511.567230]    
[  511.567234] [<ffffffff811fc97c>] drain_all_pages.part.89+0x2c/0x2c0
[  511.569585]    
[  511.569588] [<ffffffff812a1cfb>] __alloc_pages_slowpath+0x509/0xe36
[  511.572289]    
[  511.572292] [<ffffffff812018a2>] __alloc_pages_nodemask+0x382/0x3d0
[  511.574744]    
[  511.574747] [<ffffffff81265077>] alloc_pages_current+0x97/0x1b0
[  511.577103]    
[  511.577106] [<ffffffff811f22fd>] __page_cache_alloc+0x15d/0x1a0
[  511.579483]    
[  511.579486] [<ffffffff811f494a>] pagecache_get_page+0x5a/0x2b0
[  511.581935]    
[  511.581940] [<ffffffff812eca32>] __getblk_gfp+0x112/0x390
[  511.584220]    
[  511.584223] [<ffffffff813514ca>] __ext4_get_inode_loc+0x10a/0x560
[  511.586627]    
[  511.586630] [<ffffffff81353e50>] ext4_get_inode_loc+0x20/0x30
[  511.589802]    
[  511.589808] [<ffffffff81355ec6>] ext4_reserve_inode_write+0x26/0x90
[  511.592471]    
[  511.592476] [<ffffffff81355fbe>] ext4_mark_inode_dirty+0x8e/0x390
[  511.594926]    
[  511.594930] [<ffffffff8138325a>] ext4_ext_tree_init+0x3a/0x40
[  511.597306]    
[  511.597308] [<ffffffff8134eaaa>] __ext4_new_inode+0x12da/0x1540
[  511.599962]    
[  511.599969] [<ffffffff81363602>] ext4_create+0xd2/0x1a0
[  511.602484]    
[  511.602489] [<ffffffff812b9903>] lookup_open+0x653/0x7b0
[  511.604699]    
[  511.604701] [<ffffffff812ba086>] path_openat+0x626/0xd50
[  511.606890]    
[  511.606893] [<ffffffff812bba51>] do_filp_open+0x91/0x100
[  511.609097]    
[  511.609099] [<ffffffff812a6d44>] do_sys_open+0x124/0x210
[  511.611346]    
[  511.611348] [<ffffffff812a6e4e>] SyS_open+0x1e/0x20
[  511.613431]    
[  511.613434] [<ffffffff81003c3c>] do_syscall_64+0x6c/0x200
[  511.615967]    
[  511.615979] [<ffffffff81853009>] return_from_SYSCALL_64+0x0/0x7a
[  511.618303] 
[  511.619062] 
[  511.619062] the dependencies between the lock to be acquired
[  511.619063]  and RECLAIM_FS-irq-unsafe lock:
[  511.622794] -> (cpu_hotplug.dep_map){++++++} ops: 1130 {
[  511.624286]    HARDIRQ-ON-W at:
[  511.625479]                     
[  511.625485] [<ffffffff8110906e>] __lock_acquire+0x9de/0x1640
[  511.627957]                     
[  511.627959] [<ffffffff8110a119>] lock_acquire+0xc9/0x250
[  511.630609]                     
[  511.630612] [<ffffffff810a3603>] cpu_hotplug_begin+0x73/0xe0
[  511.633682]                     
[  511.633697] [<ffffffff810a3762>] _cpu_up+0x32/0xf0
[  511.636022]                     
[  511.636024] [<ffffffff810a38a5>] do_cpu_up+0x85/0xb0
[  511.638397]                     
[  511.638399] [<ffffffff810a38e3>] cpu_up+0x13/0x20
[  511.640852]                     
[  511.640866] [<ffffffff821eeee3>] smp_init+0x6b/0xcc
[  511.643507]                     
[  511.643511] [<ffffffff821c3399>] kernel_init_freeable+0x17d/0x2ac
[  511.646002]                     
[  511.646005] [<ffffffff81841b3e>] kernel_init+0xe/0x110
[  511.648600]                     
[  511.648611] [<ffffffff818531c1>] ret_from_fork+0x31/0x40
[  511.651115]    HARDIRQ-ON-R at:
[  511.652080]                     
[  511.652084] [<ffffffff8110896d>] __lock_acquire+0x2dd/0x1640
[  511.654554]                     
[  511.654557] [<ffffffff8110a119>] lock_acquire+0xc9/0x250
[  511.656983]                     
[  511.656986] [<ffffffff810a149d>] get_online_cpus+0x5d/0x80
[  511.659442]                     
[  511.659445] [<ffffffff8122a55a>] kmem_cache_create+0x3a/0x2d0
[  511.662336]                     
[  511.662342] [<ffffffff821fd151>] numa_policy_init+0x43/0x24a
[  511.665117]                     
[  511.665121] [<ffffffff821c313c>] start_kernel+0x3f6/0x4d6
[  511.667566]                     
[  511.667568] [<ffffffff821c25d6>] x86_64_start_reservations+0x2a/0x2c
[  511.670245]                     
[  511.670247] [<ffffffff821c2724>] x86_64_start_kernel+0x14c/0x16f
[  511.673050]                     
[  511.673054] [<ffffffff810001c4>] verify_cpu+0x0/0xfc
[  511.675400]    SOFTIRQ-ON-W at:
[  511.676405]                     
[  511.676408] [<ffffffff81108996>] __lock_acquire+0x306/0x1640
[  511.679556]                     
[  511.679563] [<ffffffff8110a119>] lock_acquire+0xc9/0x250
[  511.683155]                     
[  511.683164] [<ffffffff810a3603>] cpu_hotplug_begin+0x73/0xe0
[  511.686224]                     
[  511.686231] [<ffffffff810a3762>] _cpu_up+0x32/0xf0
[  511.689073]                     
[  511.689078] [<ffffffff810a38a5>] do_cpu_up+0x85/0xb0
[  511.691573]                     
[  511.691575] [<ffffffff810a38e3>] cpu_up+0x13/0x20
[  511.694007]                     
[  511.694010] [<ffffffff821eeee3>] smp_init+0x6b/0xcc
[  511.696524]                     
[  511.696528] [<ffffffff821c3399>] kernel_init_freeable+0x17d/0x2ac
[  511.699401]                     
[  511.699405] [<ffffffff81841b3e>] kernel_init+0xe/0x110
[  511.701956]                     
[  511.701959] [<ffffffff818531c1>] ret_from_fork+0x31/0x40
[  511.704520]    SOFTIRQ-ON-R at:
[  511.705530]                     
[  511.705534] [<ffffffff81108996>] __lock_acquire+0x306/0x1640
[  511.708036]                     
[  511.708038] [<ffffffff8110a119>] lock_acquire+0xc9/0x250
[  511.710516]                     
[  511.710518] [<ffffffff810a149d>] get_online_cpus+0x5d/0x80
[  511.713771]                     
[  511.713780] [<ffffffff8122a55a>] kmem_cache_create+0x3a/0x2d0
[  511.716681]                     
[  511.716688] [<ffffffff821fd151>] numa_policy_init+0x43/0x24a
[  511.719450]                     
[  511.719455] [<ffffffff821c313c>] start_kernel+0x3f6/0x4d6
[  511.722114]                     
[  511.722117] [<ffffffff821c25d6>] x86_64_start_reservations+0x2a/0x2c
[  511.724864]                     
[  511.724866] [<ffffffff821c2724>] x86_64_start_kernel+0x14c/0x16f
[  511.727552]                     
[  511.727555] [<ffffffff810001c4>] verify_cpu+0x0/0xfc
[  511.729936]    RECLAIM_FS-ON-W at:
[  511.731059]                        
[  511.731063] [<ffffffff81108141>] mark_held_locks+0x71/0x90
[  511.733851]                        
[  511.733857] [<ffffffff8110ab6f>] lockdep_trace_alloc+0x6f/0xd0
[  511.736601]                        
[  511.736604] [<ffffffff812744c8>] kmem_cache_alloc_node_trace+0x48/0x3b0
[  511.739325]                        
[  511.739329] [<ffffffff810cfa65>] __smpboot_create_thread.part.2+0x35/0xf0
[  511.742499]                        
[  511.742503] [<ffffffff810d0026>] smpboot_create_threads+0x66/0x90
[  511.745233]                        
[  511.745236] [<ffffffff810a2239>] cpuhp_invoke_callback+0x229/0x9e0
[  511.747909]                        
[  511.747911] [<ffffffff810a2b57>] cpuhp_up_callbacks+0x37/0xb0
[  511.750604]                        
[  511.750606] [<ffffffff810a37b9>] _cpu_up+0x89/0xf0
[  511.753180]                        
[  511.753182] [<ffffffff810a38a5>] do_cpu_up+0x85/0xb0
[  511.755982]                        
[  511.755986] [<ffffffff810a38e3>] cpu_up+0x13/0x20
[  511.758565]                        
[  511.758568] [<ffffffff821eeee3>] smp_init+0x6b/0xcc
[  511.761138]                        
[  511.761141] [<ffffffff821c3399>] kernel_init_freeable+0x17d/0x2ac
[  511.763877]                        
[  511.763881] [<ffffffff81841b3e>] kernel_init+0xe/0x110
[  511.766703]                        
[  511.766709] [<ffffffff818531c1>] ret_from_fork+0x31/0x40
[  511.769522]    RECLAIM_FS-ON-R at:
[  511.770730]                        
[  511.770735] [<ffffffff81108141>] mark_held_locks+0x71/0x90
[  511.773324]                        
[  511.773327] [<ffffffff8110ab6f>] lockdep_trace_alloc+0x6f/0xd0
[  511.775897]                        
[  511.775900] [<ffffffff812744c8>] kmem_cache_alloc_node_trace+0x48/0x3b0
[  511.778659]                        
[  511.778663] [<ffffffff8100d199>] allocate_shared_regs+0x29/0x70
[  511.781485]                        
[  511.781488] [<ffffffff8100d217>] intel_pmu_cpu_prepare+0x37/0x140
[  511.784574]                        
[  511.784578] [<ffffffff81005410>] x86_pmu_prepare_cpu+0x40/0x50
[  511.787169]                        
[  511.787172] [<ffffffff810a2239>] cpuhp_invoke_callback+0x229/0x9e0
[  511.789906]                        
[  511.789909] [<ffffffff810a2e42>] cpuhp_issue_call+0xe2/0x140
[  511.792625]                        
[  511.792628] [<ffffffff810a321d>] __cpuhp_setup_state+0x12d/0x190
[  511.795441]                        
[  511.795446] [<ffffffff821c59b1>] init_hw_perf_events+0x402/0x5b6
[  511.798187]                        
[  511.798190] [<ffffffff81002191>] do_one_initcall+0x51/0x1c0
[  511.801133]                        
[  511.801139] [<ffffffff821c3371>] kernel_init_freeable+0x155/0x2ac
[  511.803812]                        
[  511.803816] [<ffffffff81841b3e>] kernel_init+0xe/0x110
[  511.806381]                        
[  511.806385] [<ffffffff818531c1>] ret_from_fork+0x31/0x40
[  511.808849]    INITIAL USE at:
[  511.809876]                    
[  511.809881] [<ffffffff811089ff>] __lock_acquire+0x36f/0x1640
[  511.812607]                    
[  511.812610] [<ffffffff8110a119>] lock_acquire+0xc9/0x250
[  511.815088]                    
[  511.815092] [<ffffffff810a149d>] get_online_cpus+0x5d/0x80
[  511.817776]                    
[  511.817779] [<ffffffff810a3133>] __cpuhp_setup_state+0x43/0x190
[  511.820394]                    
[  511.820397] [<ffffffff821f756b>] page_alloc_init+0x23/0x3a
[  511.823000]                    
[  511.823003] [<ffffffff821c2ee8>] start_kernel+0x1a2/0x4d6
[  511.825495]                    
[  511.825497] [<ffffffff821c25d6>] x86_64_start_reservations+0x2a/0x2c
[  511.828158]                    
[  511.828160] [<ffffffff821c2724>] x86_64_start_kernel+0x14c/0x16f
[  511.830986]                    
[  511.830991] [<ffffffff810001c4>] verify_cpu+0x0/0xfc
[  511.833452]  }
[  511.834219]  ... key      at: [<ffffffff81e59b08>] cpu_hotplug+0x108/0x140
[  511.835931]  ... acquired at:
[  511.836924]    
[  511.836927] [<ffffffff81107d0a>] check_irq_usage+0x4a/0xb0
[  511.839589]    
[  511.839593] [<ffffffff8110950b>] __lock_acquire+0xe7b/0x1640
[  511.842158]    
[  511.842162] [<ffffffff8110a119>] lock_acquire+0xc9/0x250
[  511.844452]    
[  511.844454] [<ffffffff810a149d>] get_online_cpus+0x5d/0x80
[  511.846668]    
[  511.846671] [<ffffffff811fc97c>] drain_all_pages.part.89+0x2c/0x2c0
[  511.849257]    
[  511.849264] [<ffffffff812a1cfb>] __alloc_pages_slowpath+0x509/0xe36
[  511.852127]    
[  511.852132] [<ffffffff812018a2>] __alloc_pages_nodemask+0x382/0x3d0
[  511.854545]    
[  511.854549] [<ffffffff81265077>] alloc_pages_current+0x97/0x1b0
[  511.856942]    
[  511.856946] [<ffffffff811f22fd>] __page_cache_alloc+0x15d/0x1a0
[  511.859259]    
[  511.859262] [<ffffffff811f494a>] pagecache_get_page+0x5a/0x2b0
[  511.861595]    
[  511.861598] [<ffffffff812eca32>] __getblk_gfp+0x112/0x390
[  511.863893]    
[  511.863897] [<ffffffff813514ca>] __ext4_get_inode_loc+0x10a/0x560
[  511.866538]    
[  511.866542] [<ffffffff81353e50>] ext4_get_inode_loc+0x20/0x30
[  511.868929]    
[  511.868932] [<ffffffff81355ec6>] ext4_reserve_inode_write+0x26/0x90
[  511.871579]    
[  511.871584] [<ffffffff81355fbe>] ext4_mark_inode_dirty+0x8e/0x390
[  511.874088]    
[  511.874092] [<ffffffff8138325a>] ext4_ext_tree_init+0x3a/0x40
[  511.876398]    
[  511.876400] [<ffffffff8134eaaa>] __ext4_new_inode+0x12da/0x1540
[  511.878735]    
[  511.878737] [<ffffffff81363602>] ext4_create+0xd2/0x1a0
[  511.881170]    
[  511.881174] [<ffffffff812b9903>] lookup_open+0x653/0x7b0
[  511.883841]    
[  511.883848] [<ffffffff812ba086>] path_openat+0x626/0xd50
[  511.886058]    
[  511.886061] [<ffffffff812bba51>] do_filp_open+0x91/0x100
[  511.888285]    
[  511.888288] [<ffffffff812a6d44>] do_sys_open+0x124/0x210
[  511.890642]    
[  511.890644] [<ffffffff812a6e4e>] SyS_open+0x1e/0x20
[  511.892781]    
[  511.892784] [<ffffffff81003c3c>] do_syscall_64+0x6c/0x200
[  511.895050]    
[  511.895053] [<ffffffff81853009>] return_from_SYSCALL_64+0x0/0x7a
[  511.897382] 
[  511.898165] 
[  511.898165] stack backtrace:
[  511.900033] CPU: 0 PID: 49302 Comm: a.out Not tainted 4.10.0-rc6-next-20170202+ #500
[  511.901974] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[  511.904851] Call Trace:
[  511.905789]  dump_stack+0x85/0xc9
[  511.906854]  check_usage+0x4ba/0x4d0
[  511.907984]  ? delayacct_end+0x56/0x60
[  511.909136]  check_irq_usage+0x4a/0xb0
[  511.910318]  __lock_acquire+0xe7b/0x1640
[  511.911470]  ? delayacct_end+0x56/0x60
[  511.912607]  lock_acquire+0xc9/0x250
[  511.913703]  ? get_online_cpus+0x37/0x80
[  511.914888]  get_online_cpus+0x5d/0x80
[  511.916137]  ? get_online_cpus+0x37/0x80
[  511.917287]  drain_all_pages.part.89+0x2c/0x2c0
[  511.918539]  __alloc_pages_slowpath+0x509/0xe36
[  511.919889]  __alloc_pages_nodemask+0x382/0x3d0
[  511.921673]  ? sched_clock_cpu+0x11/0xc0
[  511.922919]  alloc_pages_current+0x97/0x1b0
[  511.924123]  __page_cache_alloc+0x15d/0x1a0
[  511.925252]  pagecache_get_page+0x5a/0x2b0
[  511.926392]  __getblk_gfp+0x112/0x390
[  511.927524]  __ext4_get_inode_loc+0x10a/0x560
[  511.928723]  ? ext4_ext_tree_init+0x3a/0x40
[  511.929900]  ext4_get_inode_loc+0x20/0x30
[  511.931008]  ext4_reserve_inode_write+0x26/0x90
[  511.932370]  ? ext4_ext_tree_init+0x3a/0x40
[  511.933582]  ext4_mark_inode_dirty+0x8e/0x390
[  511.934807]  ext4_ext_tree_init+0x3a/0x40
[  511.935919]  __ext4_new_inode+0x12da/0x1540
[  511.937093]  ext4_create+0xd2/0x1a0
[  511.938106]  lookup_open+0x653/0x7b0
[  511.939108]  ? __wake_up+0x23/0x50
[  511.940131]  ? sched_clock+0x9/0x10
[  511.941184]  path_openat+0x626/0xd50
[  511.942194]  do_filp_open+0x91/0x100
[  511.943164]  ? _raw_spin_unlock+0x27/0x40
[  511.944335]  ? __alloc_fd+0xf7/0x210
[  511.945350]  do_sys_open+0x124/0x210
[  511.946333]  SyS_open+0x1e/0x20
[  511.947189]  do_syscall_64+0x6c/0x200
[  511.948208]  entry_SYSCALL64_slow_path+0x25/0x25
[  511.949587] RIP: 0033:0x7feb6a026a10
[  511.950555] RSP: 002b:00007ffce3579c88 EFLAGS: 00000246 ORIG_RAX: 0000000000000002
[  511.952261] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007feb6a026a10
[  511.953864] RDX: 0000000000000180 RSI: 0000000000004441 RDI: 00000000006010c0
[  511.955566] RBP: 0000000000000000 R08: 00007feb69f86938 R09: 000000000000000f
[  511.957231] R10: 0000000000000000 R11: 0000000000000246 R12: 000000000040083b
[  511.958864] R13: 00007ffce3579d90 R14: 0000000000000000 R15: 0000000000000000

Below one is also a loop. Maybe we can add __GFP_NOMEMALLOC to GFP_NOWAIT ?

[  257.781715] Out of memory: Kill process 5171 (a.out) score 842 or sacrifice child
[  257.784726] Killed process 5171 (a.out) total-vm:2177096kB, anon-rss:1476488kB, file-rss:4kB, shmem-rss:0kB
[  257.787691] a.out(5171): TIF_MEMDIE allocation: order=0 mode=0x1000200(GFP_NOWAIT|__GFP_NOWARN)
[  257.789789] CPU: 3 PID: 5171 Comm: a.out Not tainted 4.10.0-rc6-next-20170202+ #500
[  257.791784] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[  257.794700] Call Trace:
[  257.795690]  dump_stack+0x85/0xc9
[  257.797224]  __alloc_pages_slowpath+0xacb/0xe36
[  257.798612]  __alloc_pages_nodemask+0x382/0x3d0
[  257.799942]  alloc_pages_current+0x97/0x1b0
[  257.801236]  __get_free_pages+0x14/0x50
[  257.802546]  __tlb_remove_page_size+0x70/0xd0
[  257.803810]  unmap_page_range+0x74b/0xa80
[  257.804992]  unmap_single_vma+0x81/0xf0
[  257.806131]  unmap_vmas+0x41/0x60
[  257.807179]  exit_mmap+0x97/0x150
[  257.808282]  ? __khugepaged_exit+0xe5/0x130
[  257.809594]  mmput+0x80/0x150
[  257.810566]  do_exit+0x2c0/0xd70
[  257.811609]  do_group_exit+0x4c/0xc0
[  257.813035]  get_signal+0x35f/0x9b0
[  257.814199]  do_signal+0x37/0x730
[  257.815215]  ? mutex_unlock+0x12/0x20
[  257.816285]  ? pagefault_out_of_memory+0x75/0x80
[  257.817872]  ? mm_fault_error+0x65/0x152
[  257.819027]  ? exit_to_usermode_loop+0x26/0x92
[  257.820277]  exit_to_usermode_loop+0x51/0x92
[  257.821480]  prepare_exit_to_usermode+0x7f/0x90
[  257.822756]  retint_user+0x8/0x23
[  257.823755] RIP: 0033:0x400780
[  257.824717] RSP: 002b:00007ffce4497640 EFLAGS: 00010206
[  257.826061] RAX: 000000005a1de000 RBX: 0000000080000000 RCX: 00007f11b8887650
[  257.827774] RDX: 0000000000000000 RSI: 00007ffce4497460 RDI: 00007ffce4497460
[  257.829770] RBP: 00007f10b89be010 R08: 00007ffce4497570 R09: 00007ffce44973b0
[  257.831714] R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000007
[  257.833447] R13: 00007f10b89be010 R14: 0000000000000000 R15: 0000000000000000

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
@ 2017-02-05 10:43                                                 ` Tetsuo Handa
  0 siblings, 0 replies; 110+ messages in thread
From: Tetsuo Handa @ 2017-02-05 10:43 UTC (permalink / raw)
  To: mhocko; +Cc: hch, mgorman, viro, linux-mm, hannes, linux-kernel, peterz

Michal Hocko wrote:
> [CC Petr]
> 
> On Fri 03-02-17 19:57:39, Tetsuo Handa wrote:
> [...]
> > (2) I got a lockdep warning. (A new false positive?)
> 
> Yes, I suspect this is a false possitive. I do not see how we can
> deadlock. __alloc_pages_direct_reclaim calls drain_all_pages(NULL) which
> means that a potential recursion to the page allocator during draining
> would just bail out on the trylock. Maybe I am misinterpreting the
> report though.
> 

I got same warning with ext4. Maybe we need to check carefully.

[  511.215743] =====================================================
[  511.218003] WARNING: RECLAIM_FS-safe -> RECLAIM_FS-unsafe lock order detected
[  511.220031] 4.10.0-rc6-next-20170202+ #500 Not tainted
[  511.221689] -----------------------------------------------------
[  511.223579] a.out/49302 [HC0[0]:SC0[0]:HE1:SE1] is trying to acquire:
[  511.225533]  (cpu_hotplug.dep_map){++++++}, at: [<ffffffff810a1477>] get_online_cpus+0x37/0x80
[  511.227795] 
[  511.227795] and this task is already holding:
[  511.230082]  (jbd2_handle){++++-.}, at: [<ffffffff813a8be7>] start_this_handle+0x1a7/0x590
[  511.232592] which would create a new lock dependency:
[  511.234192]  (jbd2_handle){++++-.} -> (cpu_hotplug.dep_map){++++++}
[  511.235966] 
[  511.235966] but this new dependency connects a RECLAIM_FS-irq-safe lock:
[  511.238563]  (jbd2_handle){++++-.}
[  511.238564] 
[  511.238564] ... which became RECLAIM_FS-irq-safe at:
[  511.242078]   
[  511.242084] [<ffffffff811089db>] __lock_acquire+0x34b/0x1640
[  511.244492]   
[  511.244495] [<ffffffff8110a119>] lock_acquire+0xc9/0x250
[  511.246694]   
[  511.246697] [<ffffffff813b3525>] jbd2_log_wait_commit+0x55/0x1d0
[  511.249323]   
[  511.249328] [<ffffffff813b59b1>] jbd2_complete_transaction+0x71/0x90
[  511.252069]   
[  511.252074] [<ffffffff813592d6>] ext4_evict_inode+0x356/0x760
[  511.254753]   
[  511.254757] [<ffffffff812c9f61>] evict+0xd1/0x1a0
[  511.257062]   
[  511.257065] [<ffffffff812ca07d>] dispose_list+0x4d/0x80
[  511.259531]   
[  511.259535] [<ffffffff812cb3da>] prune_icache_sb+0x5a/0x80
[  511.261953]   
[  511.261957] [<ffffffff812acf41>] super_cache_scan+0x141/0x190
[  511.264540]   
[  511.264545] [<ffffffff812102ef>] shrink_slab+0x29f/0x6d0
[  511.267165]   
[  511.267171] [<ffffffff812154aa>] shrink_node+0x2fa/0x310
[  511.269455]   
[  511.269459] [<ffffffff812169d2>] kswapd+0x362/0x9b0
[  511.271831]   
[  511.271834] [<ffffffff810ca72f>] kthread+0x10f/0x150
[  511.274031]   
[  511.274035] [<ffffffff818531c1>] ret_from_fork+0x31/0x40
[  511.276216] 
[  511.276216] to a RECLAIM_FS-irq-unsafe lock:
[  511.278128]  (cpu_hotplug.dep_map){++++++}
[  511.278130] 
[  511.278130] ... which became RECLAIM_FS-irq-unsafe at:
[  511.281809] ...
[  511.281811]   
[  511.282598] [<ffffffff81108141>] mark_held_locks+0x71/0x90
[  511.284852]   
[  511.284854] [<ffffffff8110ab6f>] lockdep_trace_alloc+0x6f/0xd0
[  511.287215]   
[  511.287218] [<ffffffff812744c8>] kmem_cache_alloc_node_trace+0x48/0x3b0
[  511.289751]   
[  511.289755] [<ffffffff810cfa65>] __smpboot_create_thread.part.2+0x35/0xf0
[  511.292326]   
[  511.292329] [<ffffffff810d0026>] smpboot_create_threads+0x66/0x90
[  511.295025]   
[  511.295030] [<ffffffff810a2239>] cpuhp_invoke_callback+0x229/0x9e0
[  511.299245]   
[  511.299253] [<ffffffff810a2b57>] cpuhp_up_callbacks+0x37/0xb0
[  511.301889]   
[  511.301894] [<ffffffff810a37b9>] _cpu_up+0x89/0xf0
[  511.304270]   
[  511.304275] [<ffffffff810a38a5>] do_cpu_up+0x85/0xb0
[  511.306428]   
[  511.306431] [<ffffffff810a38e3>] cpu_up+0x13/0x20
[  511.308533]   
[  511.308535] [<ffffffff821eeee3>] smp_init+0x6b/0xcc
[  511.310710]   
[  511.310713] [<ffffffff821c3399>] kernel_init_freeable+0x17d/0x2ac
[  511.313232]   
[  511.313235] [<ffffffff81841b3e>] kernel_init+0xe/0x110
[  511.315616]   
[  511.315620] [<ffffffff818531c1>] ret_from_fork+0x31/0x40
[  511.317867] 
[  511.317867] other info that might help us debug this:
[  511.317867] 
[  511.320920]  Possible interrupt unsafe locking scenario:
[  511.320920] 
[  511.323218]        CPU0                    CPU1
[  511.324622]        ----                    ----
[  511.325973]   lock(cpu_hotplug.dep_map);
[  511.327246]                                local_irq_disable();
[  511.328870]                                lock(jbd2_handle);
[  511.330483]                                lock(cpu_hotplug.dep_map);
[  511.332259]   <Interrupt>
[  511.333187]     lock(jbd2_handle);
[  511.334304] 
[  511.334304]  *** DEADLOCK ***
[  511.334304] 
[  511.336749] 4 locks held by a.out/49302:
[  511.338129]  #0:  (sb_writers#8){.+.+.+}, at: [<ffffffff812d11d4>] mnt_want_write+0x24/0x50
[  511.340768]  #1:  (&type->i_mutex_dir_key#3){++++++}, at: [<ffffffff812ba06b>] path_openat+0x60b/0xd50
[  511.343744]  #2:  (jbd2_handle){++++-.}, at: [<ffffffff813a8be7>] start_this_handle+0x1a7/0x590
[  511.345743]  #3:  (pcpu_drain_mutex){+.+...}, at: [<ffffffff811fc96f>] drain_all_pages.part.89+0x1f/0x2c0
[  511.348605] 
[  511.348605] the dependencies between RECLAIM_FS-irq-safe lock and the holding lock:
[  511.351336] -> (jbd2_handle){++++-.} ops: 203220 {
[  511.352768]    HARDIRQ-ON-W at:
[  511.353827]                     
[  511.353833] [<ffffffff8110906e>] __lock_acquire+0x9de/0x1640
[  511.356489]                     
[  511.356492] [<ffffffff8110a119>] lock_acquire+0xc9/0x250
[  511.359063]                     
[  511.359067] [<ffffffff813b3525>] jbd2_log_wait_commit+0x55/0x1d0
[  511.361905]                     
[  511.361908] [<ffffffff813b59b1>] jbd2_complete_transaction+0x71/0x90
[  511.364560]                     
[  511.364563] [<ffffffff8134bec7>] ext4_sync_file+0x2e7/0x5e0
[  511.367362]                     
[  511.367367] [<ffffffff812e74ad>] vfs_fsync_range+0x3d/0xb0
[  511.369950]                     
[  511.369953] [<ffffffff812e757d>] do_fsync+0x3d/0x70
[  511.372400]                     
[  511.372402] [<ffffffff812e7840>] SyS_fsync+0x10/0x20
[  511.374821]                     
[  511.374824] [<ffffffff81003c3c>] do_syscall_64+0x6c/0x200
[  511.377422]                     
[  511.377425] [<ffffffff81853009>] return_from_SYSCALL_64+0x0/0x7a
[  511.380273]    HARDIRQ-ON-R at:
[  511.381791]                     
[  511.381815] [<ffffffff8110896d>] __lock_acquire+0x2dd/0x1640
[  511.384693]                     
[  511.384697] [<ffffffff8110a119>] lock_acquire+0xc9/0x250
[  511.387195]                     
[  511.387198] [<ffffffff813a8c65>] start_this_handle+0x225/0x590
[  511.389888]                     
[  511.389891] [<ffffffff813a9639>] jbd2__journal_start+0xe9/0x340
[  511.392522]                     
[  511.392525] [<ffffffff8138adaa>] __ext4_journal_start_sb+0x9a/0x240
[  511.395341]                     
[  511.395344] [<ffffffff8134af58>] ext4_file_open+0x188/0x230
[  511.397886]                     
[  511.397889] [<ffffffff812a53cb>] do_dentry_open+0x22b/0x340
[  511.400727]                     
[  511.400730] [<ffffffff812a6922>] vfs_open+0x52/0x80
[  511.403297]                     
[  511.403301] [<ffffffff812b9f02>] path_openat+0x4a2/0xd50
[  511.405752]                     
[  511.405755] [<ffffffff812bba51>] do_filp_open+0x91/0x100
[  511.408229]                     
[  511.408231] [<ffffffff812a6d44>] do_sys_open+0x124/0x210
[  511.410820]                     
[  511.410822] [<ffffffff812a6e4e>] SyS_open+0x1e/0x20
[  511.413158]                     
[  511.413161] [<ffffffff81852f41>] entry_SYSCALL_64_fastpath+0x1f/0xc2
[  511.416074]    SOFTIRQ-ON-W at:
[  511.417069]                     
[  511.417073] [<ffffffff81108996>] __lock_acquire+0x306/0x1640
[  511.419681]                     
[  511.419684] [<ffffffff8110a119>] lock_acquire+0xc9/0x250
[  511.422516]                     
[  511.422520] [<ffffffff813b3525>] jbd2_log_wait_commit+0x55/0x1d0
[  511.425157]                     
[  511.425160] [<ffffffff813b59b1>] jbd2_complete_transaction+0x71/0x90
[  511.427862]                     
[  511.427865] [<ffffffff8134bec7>] ext4_sync_file+0x2e7/0x5e0
[  511.430379]                     
[  511.430382] [<ffffffff812e74ad>] vfs_fsync_range+0x3d/0xb0
[  511.433412]                     
[  511.433418] [<ffffffff812e757d>] do_fsync+0x3d/0x70
[  511.436064]                     
[  511.436067] [<ffffffff812e7840>] SyS_fsync+0x10/0x20
[  511.438498]                     
[  511.438502] [<ffffffff81003c3c>] do_syscall_64+0x6c/0x200
[  511.441519]                     
[  511.441524] [<ffffffff81853009>] return_from_SYSCALL_64+0x0/0x7a
[  511.444325]    SOFTIRQ-ON-R at:
[  511.445358]                     
[  511.445362] [<ffffffff81108996>] __lock_acquire+0x306/0x1640
[  511.448298]                     
[  511.448312] [<ffffffff8110a119>] lock_acquire+0xc9/0x250
[  511.451096]                     
[  511.451100] [<ffffffff813a8c65>] start_this_handle+0x225/0x590
[  511.453784]                     
[  511.453786] [<ffffffff813a9639>] jbd2__journal_start+0xe9/0x340
[  511.456659]                     
[  511.456664] [<ffffffff8138adaa>] __ext4_journal_start_sb+0x9a/0x240
[  511.459638]                     
[  511.459643] [<ffffffff8134af58>] ext4_file_open+0x188/0x230
[  511.462384]                     
[  511.462389] [<ffffffff812a53cb>] do_dentry_open+0x22b/0x340
[  511.465550]                     
[  511.465558] [<ffffffff812a6922>] vfs_open+0x52/0x80
[  511.468141]                     
[  511.468145] [<ffffffff812b9f02>] path_openat+0x4a2/0xd50
[  511.470816]                     
[  511.470819] [<ffffffff812bba51>] do_filp_open+0x91/0x100
[  511.473441]                     
[  511.473443] [<ffffffff812a6d44>] do_sys_open+0x124/0x210
[  511.476079]                     
[  511.476081] [<ffffffff812a6e4e>] SyS_open+0x1e/0x20
[  511.478584]                     
[  511.478587] [<ffffffff81852f41>] entry_SYSCALL_64_fastpath+0x1f/0xc2
[  511.481394]    IN-RECLAIM_FS-W at:
[  511.482680]                        
[  511.482691] [<ffffffff811089db>] __lock_acquire+0x34b/0x1640
[  511.485262]                        
[  511.485264] [<ffffffff8110a119>] lock_acquire+0xc9/0x250
[  511.487862]                        
[  511.487865] [<ffffffff813b3525>] jbd2_log_wait_commit+0x55/0x1d0
[  511.490707]                        
[  511.490710] [<ffffffff813b59b1>] jbd2_complete_transaction+0x71/0x90
[  511.493524]                        
[  511.493527] [<ffffffff813592d6>] ext4_evict_inode+0x356/0x760
[  511.496251]                        
[  511.496255] [<ffffffff812c9f61>] evict+0xd1/0x1a0
[  511.498817]                        
[  511.498821] [<ffffffff812ca07d>] dispose_list+0x4d/0x80
[  511.501361]                        
[  511.501364] [<ffffffff812cb3da>] prune_icache_sb+0x5a/0x80
[  511.504069]                        
[  511.504072] [<ffffffff812acf41>] super_cache_scan+0x141/0x190
[  511.506890]                        
[  511.506895] [<ffffffff812102ef>] shrink_slab+0x29f/0x6d0
[  511.509465]                        
[  511.509467] [<ffffffff812154aa>] shrink_node+0x2fa/0x310
[  511.512228]                        
[  511.512233] [<ffffffff812169d2>] kswapd+0x362/0x9b0
[  511.514724]                        
[  511.514728] [<ffffffff810ca72f>] kthread+0x10f/0x150
[  511.517264]                        
[  511.517269] [<ffffffff818531c1>] ret_from_fork+0x31/0x40
[  511.519827]    INITIAL USE at:
[  511.520829]                    
[  511.520833] [<ffffffff811089ff>] __lock_acquire+0x36f/0x1640
[  511.523377]                    
[  511.523380] [<ffffffff8110a119>] lock_acquire+0xc9/0x250
[  511.525781]                    
[  511.525784] [<ffffffff813a8c65>] start_this_handle+0x225/0x590
[  511.528372]                    
[  511.528375] [<ffffffff813a9639>] jbd2__journal_start+0xe9/0x340
[  511.531138]                    
[  511.531141] [<ffffffff8138adaa>] __ext4_journal_start_sb+0x9a/0x240
[  511.533905]                    
[  511.533908] [<ffffffff8134af58>] ext4_file_open+0x188/0x230
[  511.536467]                    
[  511.536471] [<ffffffff812a53cb>] do_dentry_open+0x22b/0x340
[  511.538990]                    
[  511.538992] [<ffffffff812a6922>] vfs_open+0x52/0x80
[  511.541457]                    
[  511.541461] [<ffffffff812b9f02>] path_openat+0x4a2/0xd50
[  511.544036]                    
[  511.544039] [<ffffffff812bba51>] do_filp_open+0x91/0x100
[  511.546642]                    
[  511.546644] [<ffffffff812a6d44>] do_sys_open+0x124/0x210
[  511.549354]                    
[  511.549370] [<ffffffff812a6e4e>] SyS_open+0x1e/0x20
[  511.551781]                    
[  511.551784] [<ffffffff81852f41>] entry_SYSCALL_64_fastpath+0x1f/0xc2
[  511.554410]  }
[  511.555145]  ... key      at: [<ffffffff8335b518>] jbd2_trans_commit_key.48870+0x0/0x8
[  511.557051]  ... acquired at:
[  511.558047]    
[  511.558050] [<ffffffff81107d0a>] check_irq_usage+0x4a/0xb0
[  511.560268]    
[  511.560270] [<ffffffff8110950b>] __lock_acquire+0xe7b/0x1640
[  511.562536]    
[  511.562538] [<ffffffff8110a119>] lock_acquire+0xc9/0x250
[  511.564779]    
[  511.564783] [<ffffffff810a149d>] get_online_cpus+0x5d/0x80
[  511.567230]    
[  511.567234] [<ffffffff811fc97c>] drain_all_pages.part.89+0x2c/0x2c0
[  511.569585]    
[  511.569588] [<ffffffff812a1cfb>] __alloc_pages_slowpath+0x509/0xe36
[  511.572289]    
[  511.572292] [<ffffffff812018a2>] __alloc_pages_nodemask+0x382/0x3d0
[  511.574744]    
[  511.574747] [<ffffffff81265077>] alloc_pages_current+0x97/0x1b0
[  511.577103]    
[  511.577106] [<ffffffff811f22fd>] __page_cache_alloc+0x15d/0x1a0
[  511.579483]    
[  511.579486] [<ffffffff811f494a>] pagecache_get_page+0x5a/0x2b0
[  511.581935]    
[  511.581940] [<ffffffff812eca32>] __getblk_gfp+0x112/0x390
[  511.584220]    
[  511.584223] [<ffffffff813514ca>] __ext4_get_inode_loc+0x10a/0x560
[  511.586627]    
[  511.586630] [<ffffffff81353e50>] ext4_get_inode_loc+0x20/0x30
[  511.589802]    
[  511.589808] [<ffffffff81355ec6>] ext4_reserve_inode_write+0x26/0x90
[  511.592471]    
[  511.592476] [<ffffffff81355fbe>] ext4_mark_inode_dirty+0x8e/0x390
[  511.594926]    
[  511.594930] [<ffffffff8138325a>] ext4_ext_tree_init+0x3a/0x40
[  511.597306]    
[  511.597308] [<ffffffff8134eaaa>] __ext4_new_inode+0x12da/0x1540
[  511.599962]    
[  511.599969] [<ffffffff81363602>] ext4_create+0xd2/0x1a0
[  511.602484]    
[  511.602489] [<ffffffff812b9903>] lookup_open+0x653/0x7b0
[  511.604699]    
[  511.604701] [<ffffffff812ba086>] path_openat+0x626/0xd50
[  511.606890]    
[  511.606893] [<ffffffff812bba51>] do_filp_open+0x91/0x100
[  511.609097]    
[  511.609099] [<ffffffff812a6d44>] do_sys_open+0x124/0x210
[  511.611346]    
[  511.611348] [<ffffffff812a6e4e>] SyS_open+0x1e/0x20
[  511.613431]    
[  511.613434] [<ffffffff81003c3c>] do_syscall_64+0x6c/0x200
[  511.615967]    
[  511.615979] [<ffffffff81853009>] return_from_SYSCALL_64+0x0/0x7a
[  511.618303] 
[  511.619062] 
[  511.619062] the dependencies between the lock to be acquired
[  511.619063]  and RECLAIM_FS-irq-unsafe lock:
[  511.622794] -> (cpu_hotplug.dep_map){++++++} ops: 1130 {
[  511.624286]    HARDIRQ-ON-W at:
[  511.625479]                     
[  511.625485] [<ffffffff8110906e>] __lock_acquire+0x9de/0x1640
[  511.627957]                     
[  511.627959] [<ffffffff8110a119>] lock_acquire+0xc9/0x250
[  511.630609]                     
[  511.630612] [<ffffffff810a3603>] cpu_hotplug_begin+0x73/0xe0
[  511.633682]                     
[  511.633697] [<ffffffff810a3762>] _cpu_up+0x32/0xf0
[  511.636022]                     
[  511.636024] [<ffffffff810a38a5>] do_cpu_up+0x85/0xb0
[  511.638397]                     
[  511.638399] [<ffffffff810a38e3>] cpu_up+0x13/0x20
[  511.640852]                     
[  511.640866] [<ffffffff821eeee3>] smp_init+0x6b/0xcc
[  511.643507]                     
[  511.643511] [<ffffffff821c3399>] kernel_init_freeable+0x17d/0x2ac
[  511.646002]                     
[  511.646005] [<ffffffff81841b3e>] kernel_init+0xe/0x110
[  511.648600]                     
[  511.648611] [<ffffffff818531c1>] ret_from_fork+0x31/0x40
[  511.651115]    HARDIRQ-ON-R at:
[  511.652080]                     
[  511.652084] [<ffffffff8110896d>] __lock_acquire+0x2dd/0x1640
[  511.654554]                     
[  511.654557] [<ffffffff8110a119>] lock_acquire+0xc9/0x250
[  511.656983]                     
[  511.656986] [<ffffffff810a149d>] get_online_cpus+0x5d/0x80
[  511.659442]                     
[  511.659445] [<ffffffff8122a55a>] kmem_cache_create+0x3a/0x2d0
[  511.662336]                     
[  511.662342] [<ffffffff821fd151>] numa_policy_init+0x43/0x24a
[  511.665117]                     
[  511.665121] [<ffffffff821c313c>] start_kernel+0x3f6/0x4d6
[  511.667566]                     
[  511.667568] [<ffffffff821c25d6>] x86_64_start_reservations+0x2a/0x2c
[  511.670245]                     
[  511.670247] [<ffffffff821c2724>] x86_64_start_kernel+0x14c/0x16f
[  511.673050]                     
[  511.673054] [<ffffffff810001c4>] verify_cpu+0x0/0xfc
[  511.675400]    SOFTIRQ-ON-W at:
[  511.676405]                     
[  511.676408] [<ffffffff81108996>] __lock_acquire+0x306/0x1640
[  511.679556]                     
[  511.679563] [<ffffffff8110a119>] lock_acquire+0xc9/0x250
[  511.683155]                     
[  511.683164] [<ffffffff810a3603>] cpu_hotplug_begin+0x73/0xe0
[  511.686224]                     
[  511.686231] [<ffffffff810a3762>] _cpu_up+0x32/0xf0
[  511.689073]                     
[  511.689078] [<ffffffff810a38a5>] do_cpu_up+0x85/0xb0
[  511.691573]                     
[  511.691575] [<ffffffff810a38e3>] cpu_up+0x13/0x20
[  511.694007]                     
[  511.694010] [<ffffffff821eeee3>] smp_init+0x6b/0xcc
[  511.696524]                     
[  511.696528] [<ffffffff821c3399>] kernel_init_freeable+0x17d/0x2ac
[  511.699401]                     
[  511.699405] [<ffffffff81841b3e>] kernel_init+0xe/0x110
[  511.701956]                     
[  511.701959] [<ffffffff818531c1>] ret_from_fork+0x31/0x40
[  511.704520]    SOFTIRQ-ON-R at:
[  511.705530]                     
[  511.705534] [<ffffffff81108996>] __lock_acquire+0x306/0x1640
[  511.708036]                     
[  511.708038] [<ffffffff8110a119>] lock_acquire+0xc9/0x250
[  511.710516]                     
[  511.710518] [<ffffffff810a149d>] get_online_cpus+0x5d/0x80
[  511.713771]                     
[  511.713780] [<ffffffff8122a55a>] kmem_cache_create+0x3a/0x2d0
[  511.716681]                     
[  511.716688] [<ffffffff821fd151>] numa_policy_init+0x43/0x24a
[  511.719450]                     
[  511.719455] [<ffffffff821c313c>] start_kernel+0x3f6/0x4d6
[  511.722114]                     
[  511.722117] [<ffffffff821c25d6>] x86_64_start_reservations+0x2a/0x2c
[  511.724864]                     
[  511.724866] [<ffffffff821c2724>] x86_64_start_kernel+0x14c/0x16f
[  511.727552]                     
[  511.727555] [<ffffffff810001c4>] verify_cpu+0x0/0xfc
[  511.729936]    RECLAIM_FS-ON-W at:
[  511.731059]                        
[  511.731063] [<ffffffff81108141>] mark_held_locks+0x71/0x90
[  511.733851]                        
[  511.733857] [<ffffffff8110ab6f>] lockdep_trace_alloc+0x6f/0xd0
[  511.736601]                        
[  511.736604] [<ffffffff812744c8>] kmem_cache_alloc_node_trace+0x48/0x3b0
[  511.739325]                        
[  511.739329] [<ffffffff810cfa65>] __smpboot_create_thread.part.2+0x35/0xf0
[  511.742499]                        
[  511.742503] [<ffffffff810d0026>] smpboot_create_threads+0x66/0x90
[  511.745233]                        
[  511.745236] [<ffffffff810a2239>] cpuhp_invoke_callback+0x229/0x9e0
[  511.747909]                        
[  511.747911] [<ffffffff810a2b57>] cpuhp_up_callbacks+0x37/0xb0
[  511.750604]                        
[  511.750606] [<ffffffff810a37b9>] _cpu_up+0x89/0xf0
[  511.753180]                        
[  511.753182] [<ffffffff810a38a5>] do_cpu_up+0x85/0xb0
[  511.755982]                        
[  511.755986] [<ffffffff810a38e3>] cpu_up+0x13/0x20
[  511.758565]                        
[  511.758568] [<ffffffff821eeee3>] smp_init+0x6b/0xcc
[  511.761138]                        
[  511.761141] [<ffffffff821c3399>] kernel_init_freeable+0x17d/0x2ac
[  511.763877]                        
[  511.763881] [<ffffffff81841b3e>] kernel_init+0xe/0x110
[  511.766703]                        
[  511.766709] [<ffffffff818531c1>] ret_from_fork+0x31/0x40
[  511.769522]    RECLAIM_FS-ON-R at:
[  511.770730]                        
[  511.770735] [<ffffffff81108141>] mark_held_locks+0x71/0x90
[  511.773324]                        
[  511.773327] [<ffffffff8110ab6f>] lockdep_trace_alloc+0x6f/0xd0
[  511.775897]                        
[  511.775900] [<ffffffff812744c8>] kmem_cache_alloc_node_trace+0x48/0x3b0
[  511.778659]                        
[  511.778663] [<ffffffff8100d199>] allocate_shared_regs+0x29/0x70
[  511.781485]                        
[  511.781488] [<ffffffff8100d217>] intel_pmu_cpu_prepare+0x37/0x140
[  511.784574]                        
[  511.784578] [<ffffffff81005410>] x86_pmu_prepare_cpu+0x40/0x50
[  511.787169]                        
[  511.787172] [<ffffffff810a2239>] cpuhp_invoke_callback+0x229/0x9e0
[  511.789906]                        
[  511.789909] [<ffffffff810a2e42>] cpuhp_issue_call+0xe2/0x140
[  511.792625]                        
[  511.792628] [<ffffffff810a321d>] __cpuhp_setup_state+0x12d/0x190
[  511.795441]                        
[  511.795446] [<ffffffff821c59b1>] init_hw_perf_events+0x402/0x5b6
[  511.798187]                        
[  511.798190] [<ffffffff81002191>] do_one_initcall+0x51/0x1c0
[  511.801133]                        
[  511.801139] [<ffffffff821c3371>] kernel_init_freeable+0x155/0x2ac
[  511.803812]                        
[  511.803816] [<ffffffff81841b3e>] kernel_init+0xe/0x110
[  511.806381]                        
[  511.806385] [<ffffffff818531c1>] ret_from_fork+0x31/0x40
[  511.808849]    INITIAL USE at:
[  511.809876]                    
[  511.809881] [<ffffffff811089ff>] __lock_acquire+0x36f/0x1640
[  511.812607]                    
[  511.812610] [<ffffffff8110a119>] lock_acquire+0xc9/0x250
[  511.815088]                    
[  511.815092] [<ffffffff810a149d>] get_online_cpus+0x5d/0x80
[  511.817776]                    
[  511.817779] [<ffffffff810a3133>] __cpuhp_setup_state+0x43/0x190
[  511.820394]                    
[  511.820397] [<ffffffff821f756b>] page_alloc_init+0x23/0x3a
[  511.823000]                    
[  511.823003] [<ffffffff821c2ee8>] start_kernel+0x1a2/0x4d6
[  511.825495]                    
[  511.825497] [<ffffffff821c25d6>] x86_64_start_reservations+0x2a/0x2c
[  511.828158]                    
[  511.828160] [<ffffffff821c2724>] x86_64_start_kernel+0x14c/0x16f
[  511.830986]                    
[  511.830991] [<ffffffff810001c4>] verify_cpu+0x0/0xfc
[  511.833452]  }
[  511.834219]  ... key      at: [<ffffffff81e59b08>] cpu_hotplug+0x108/0x140
[  511.835931]  ... acquired at:
[  511.836924]    
[  511.836927] [<ffffffff81107d0a>] check_irq_usage+0x4a/0xb0
[  511.839589]    
[  511.839593] [<ffffffff8110950b>] __lock_acquire+0xe7b/0x1640
[  511.842158]    
[  511.842162] [<ffffffff8110a119>] lock_acquire+0xc9/0x250
[  511.844452]    
[  511.844454] [<ffffffff810a149d>] get_online_cpus+0x5d/0x80
[  511.846668]    
[  511.846671] [<ffffffff811fc97c>] drain_all_pages.part.89+0x2c/0x2c0
[  511.849257]    
[  511.849264] [<ffffffff812a1cfb>] __alloc_pages_slowpath+0x509/0xe36
[  511.852127]    
[  511.852132] [<ffffffff812018a2>] __alloc_pages_nodemask+0x382/0x3d0
[  511.854545]    
[  511.854549] [<ffffffff81265077>] alloc_pages_current+0x97/0x1b0
[  511.856942]    
[  511.856946] [<ffffffff811f22fd>] __page_cache_alloc+0x15d/0x1a0
[  511.859259]    
[  511.859262] [<ffffffff811f494a>] pagecache_get_page+0x5a/0x2b0
[  511.861595]    
[  511.861598] [<ffffffff812eca32>] __getblk_gfp+0x112/0x390
[  511.863893]    
[  511.863897] [<ffffffff813514ca>] __ext4_get_inode_loc+0x10a/0x560
[  511.866538]    
[  511.866542] [<ffffffff81353e50>] ext4_get_inode_loc+0x20/0x30
[  511.868929]    
[  511.868932] [<ffffffff81355ec6>] ext4_reserve_inode_write+0x26/0x90
[  511.871579]    
[  511.871584] [<ffffffff81355fbe>] ext4_mark_inode_dirty+0x8e/0x390
[  511.874088]    
[  511.874092] [<ffffffff8138325a>] ext4_ext_tree_init+0x3a/0x40
[  511.876398]    
[  511.876400] [<ffffffff8134eaaa>] __ext4_new_inode+0x12da/0x1540
[  511.878735]    
[  511.878737] [<ffffffff81363602>] ext4_create+0xd2/0x1a0
[  511.881170]    
[  511.881174] [<ffffffff812b9903>] lookup_open+0x653/0x7b0
[  511.883841]    
[  511.883848] [<ffffffff812ba086>] path_openat+0x626/0xd50
[  511.886058]    
[  511.886061] [<ffffffff812bba51>] do_filp_open+0x91/0x100
[  511.888285]    
[  511.888288] [<ffffffff812a6d44>] do_sys_open+0x124/0x210
[  511.890642]    
[  511.890644] [<ffffffff812a6e4e>] SyS_open+0x1e/0x20
[  511.892781]    
[  511.892784] [<ffffffff81003c3c>] do_syscall_64+0x6c/0x200
[  511.895050]    
[  511.895053] [<ffffffff81853009>] return_from_SYSCALL_64+0x0/0x7a
[  511.897382] 
[  511.898165] 
[  511.898165] stack backtrace:
[  511.900033] CPU: 0 PID: 49302 Comm: a.out Not tainted 4.10.0-rc6-next-20170202+ #500
[  511.901974] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[  511.904851] Call Trace:
[  511.905789]  dump_stack+0x85/0xc9
[  511.906854]  check_usage+0x4ba/0x4d0
[  511.907984]  ? delayacct_end+0x56/0x60
[  511.909136]  check_irq_usage+0x4a/0xb0
[  511.910318]  __lock_acquire+0xe7b/0x1640
[  511.911470]  ? delayacct_end+0x56/0x60
[  511.912607]  lock_acquire+0xc9/0x250
[  511.913703]  ? get_online_cpus+0x37/0x80
[  511.914888]  get_online_cpus+0x5d/0x80
[  511.916137]  ? get_online_cpus+0x37/0x80
[  511.917287]  drain_all_pages.part.89+0x2c/0x2c0
[  511.918539]  __alloc_pages_slowpath+0x509/0xe36
[  511.919889]  __alloc_pages_nodemask+0x382/0x3d0
[  511.921673]  ? sched_clock_cpu+0x11/0xc0
[  511.922919]  alloc_pages_current+0x97/0x1b0
[  511.924123]  __page_cache_alloc+0x15d/0x1a0
[  511.925252]  pagecache_get_page+0x5a/0x2b0
[  511.926392]  __getblk_gfp+0x112/0x390
[  511.927524]  __ext4_get_inode_loc+0x10a/0x560
[  511.928723]  ? ext4_ext_tree_init+0x3a/0x40
[  511.929900]  ext4_get_inode_loc+0x20/0x30
[  511.931008]  ext4_reserve_inode_write+0x26/0x90
[  511.932370]  ? ext4_ext_tree_init+0x3a/0x40
[  511.933582]  ext4_mark_inode_dirty+0x8e/0x390
[  511.934807]  ext4_ext_tree_init+0x3a/0x40
[  511.935919]  __ext4_new_inode+0x12da/0x1540
[  511.937093]  ext4_create+0xd2/0x1a0
[  511.938106]  lookup_open+0x653/0x7b0
[  511.939108]  ? __wake_up+0x23/0x50
[  511.940131]  ? sched_clock+0x9/0x10
[  511.941184]  path_openat+0x626/0xd50
[  511.942194]  do_filp_open+0x91/0x100
[  511.943164]  ? _raw_spin_unlock+0x27/0x40
[  511.944335]  ? __alloc_fd+0xf7/0x210
[  511.945350]  do_sys_open+0x124/0x210
[  511.946333]  SyS_open+0x1e/0x20
[  511.947189]  do_syscall_64+0x6c/0x200
[  511.948208]  entry_SYSCALL64_slow_path+0x25/0x25
[  511.949587] RIP: 0033:0x7feb6a026a10
[  511.950555] RSP: 002b:00007ffce3579c88 EFLAGS: 00000246 ORIG_RAX: 0000000000000002
[  511.952261] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007feb6a026a10
[  511.953864] RDX: 0000000000000180 RSI: 0000000000004441 RDI: 00000000006010c0
[  511.955566] RBP: 0000000000000000 R08: 00007feb69f86938 R09: 000000000000000f
[  511.957231] R10: 0000000000000000 R11: 0000000000000246 R12: 000000000040083b
[  511.958864] R13: 00007ffce3579d90 R14: 0000000000000000 R15: 0000000000000000

Below one is also a loop. Maybe we can add __GFP_NOMEMALLOC to GFP_NOWAIT ?

[  257.781715] Out of memory: Kill process 5171 (a.out) score 842 or sacrifice child
[  257.784726] Killed process 5171 (a.out) total-vm:2177096kB, anon-rss:1476488kB, file-rss:4kB, shmem-rss:0kB
[  257.787691] a.out(5171): TIF_MEMDIE allocation: order=0 mode=0x1000200(GFP_NOWAIT|__GFP_NOWARN)
[  257.789789] CPU: 3 PID: 5171 Comm: a.out Not tainted 4.10.0-rc6-next-20170202+ #500
[  257.791784] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[  257.794700] Call Trace:
[  257.795690]  dump_stack+0x85/0xc9
[  257.797224]  __alloc_pages_slowpath+0xacb/0xe36
[  257.798612]  __alloc_pages_nodemask+0x382/0x3d0
[  257.799942]  alloc_pages_current+0x97/0x1b0
[  257.801236]  __get_free_pages+0x14/0x50
[  257.802546]  __tlb_remove_page_size+0x70/0xd0
[  257.803810]  unmap_page_range+0x74b/0xa80
[  257.804992]  unmap_single_vma+0x81/0xf0
[  257.806131]  unmap_vmas+0x41/0x60
[  257.807179]  exit_mmap+0x97/0x150
[  257.808282]  ? __khugepaged_exit+0xe5/0x130
[  257.809594]  mmput+0x80/0x150
[  257.810566]  do_exit+0x2c0/0xd70
[  257.811609]  do_group_exit+0x4c/0xc0
[  257.813035]  get_signal+0x35f/0x9b0
[  257.814199]  do_signal+0x37/0x730
[  257.815215]  ? mutex_unlock+0x12/0x20
[  257.816285]  ? pagefault_out_of_memory+0x75/0x80
[  257.817872]  ? mm_fault_error+0x65/0x152
[  257.819027]  ? exit_to_usermode_loop+0x26/0x92
[  257.820277]  exit_to_usermode_loop+0x51/0x92
[  257.821480]  prepare_exit_to_usermode+0x7f/0x90
[  257.822756]  retint_user+0x8/0x23
[  257.823755] RIP: 0033:0x400780
[  257.824717] RSP: 002b:00007ffce4497640 EFLAGS: 00010206
[  257.826061] RAX: 000000005a1de000 RBX: 0000000080000000 RCX: 00007f11b8887650
[  257.827774] RDX: 0000000000000000 RSI: 00007ffce4497460 RDI: 00007ffce4497460
[  257.829770] RBP: 00007f10b89be010 R08: 00007ffce4497570 R09: 00007ffce44973b0
[  257.831714] R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000007
[  257.833447] R13: 00007f10b89be010 R14: 0000000000000000 R15: 0000000000000000

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
  2017-02-03 17:24                                                 ` Brian Foster
@ 2017-02-06  6:29                                                   ` Tetsuo Handa
  -1 siblings, 0 replies; 110+ messages in thread
From: Tetsuo Handa @ 2017-02-06  6:29 UTC (permalink / raw)
  To: bfoster, mhocko
  Cc: david, dchinner, hch, mgorman, viro, linux-mm, hannes,
	linux-kernel, darrick.wong, linux-xfs

Brian Foster wrote:
> On Fri, Feb 03, 2017 at 03:50:09PM +0100, Michal Hocko wrote:
> > [Let's CC more xfs people]
> > 
> > On Fri 03-02-17 19:57:39, Tetsuo Handa wrote:
> > [...]
> > > (1) I got an assertion failure.
> > 
> > I suspect this is a result of
> > http://lkml.kernel.org/r/20170201092706.9966-2-mhocko@kernel.org
> > I have no idea what the assert means though.
> > 
> > > 
> > > [  969.626518] Killed process 6262 (oom-write) total-vm:2166856kB, anon-rss:1128732kB, file-rss:4kB, shmem-rss:0kB
> > > [  969.958307] oom_reaper: reaped process 6262 (oom-write), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> > > [  972.114644] XFS: Assertion failed: oldlen > newlen, file: fs/xfs/libxfs/xfs_bmap.c, line: 2867
> 
> Indirect block reservation underrun on delayed allocation extent merge.
> These are extra blocks are used for the inode bmap btree when a delalloc
> extent is converted to physical blocks. We're in a case where we expect
> to only ever free excess blocks due to a merge of extents with
> independent reservations, but a situation occurs where we actually need
> blocks and hence the assert fails. This can occur if an extent is merged
> with one that has a reservation less than the expected worst case
> reservation for its size (due to previous extent splits due to hole
> punches, for example). Therefore, I think the core expectation that
> xfs_bmap_add_extent_hole_delay() will always have enough blocks
> pre-reserved is invalid.
> 
> Can you describe the workload that reproduces this? FWIW, I think the
> way xfs_bmap_add_extent_hole_delay() currently works is likely broken
> and have a couple patches to fix up indlen reservation that I haven't
> posted yet. The diff that deals with this particular bit is appended.
> Care to give that a try?

The workload is to write to a single file on XFS from 10 processes demonstrated at
http://lkml.kernel.org/r/201512052133.IAE00551.LSOQFtMFFVOHOJ@I-love.SAKURA.ne.jp
using "while :; do ./oom-write; done" loop on a VM with 4CPUs / 2048MB RAM.
With this XFS_FILBLKS_MIN() change applied, I no longer hit assertion failures.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
@ 2017-02-06  6:29                                                   ` Tetsuo Handa
  0 siblings, 0 replies; 110+ messages in thread
From: Tetsuo Handa @ 2017-02-06  6:29 UTC (permalink / raw)
  To: bfoster, mhocko
  Cc: david, dchinner, hch, mgorman, viro, linux-mm, hannes,
	linux-kernel, darrick.wong, linux-xfs

Brian Foster wrote:
> On Fri, Feb 03, 2017 at 03:50:09PM +0100, Michal Hocko wrote:
> > [Let's CC more xfs people]
> > 
> > On Fri 03-02-17 19:57:39, Tetsuo Handa wrote:
> > [...]
> > > (1) I got an assertion failure.
> > 
> > I suspect this is a result of
> > http://lkml.kernel.org/r/20170201092706.9966-2-mhocko@kernel.org
> > I have no idea what the assert means though.
> > 
> > > 
> > > [  969.626518] Killed process 6262 (oom-write) total-vm:2166856kB, anon-rss:1128732kB, file-rss:4kB, shmem-rss:0kB
> > > [  969.958307] oom_reaper: reaped process 6262 (oom-write), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> > > [  972.114644] XFS: Assertion failed: oldlen > newlen, file: fs/xfs/libxfs/xfs_bmap.c, line: 2867
> 
> Indirect block reservation underrun on delayed allocation extent merge.
> These are extra blocks are used for the inode bmap btree when a delalloc
> extent is converted to physical blocks. We're in a case where we expect
> to only ever free excess blocks due to a merge of extents with
> independent reservations, but a situation occurs where we actually need
> blocks and hence the assert fails. This can occur if an extent is merged
> with one that has a reservation less than the expected worst case
> reservation for its size (due to previous extent splits due to hole
> punches, for example). Therefore, I think the core expectation that
> xfs_bmap_add_extent_hole_delay() will always have enough blocks
> pre-reserved is invalid.
> 
> Can you describe the workload that reproduces this? FWIW, I think the
> way xfs_bmap_add_extent_hole_delay() currently works is likely broken
> and have a couple patches to fix up indlen reservation that I haven't
> posted yet. The diff that deals with this particular bit is appended.
> Care to give that a try?

The workload is to write to a single file on XFS from 10 processes demonstrated at
http://lkml.kernel.org/r/201512052133.IAE00551.LSOQFtMFFVOHOJ@I-love.SAKURA.ne.jp
using "while :; do ./oom-write; done" loop on a VM with 4CPUs / 2048MB RAM.
With this XFS_FILBLKS_MIN() change applied, I no longer hit assertion failures.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
  2017-02-05 10:43                                                 ` Tetsuo Handa
@ 2017-02-06 10:34                                                   ` Michal Hocko
  -1 siblings, 0 replies; 110+ messages in thread
From: Michal Hocko @ 2017-02-06 10:34 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: hch, mgorman, viro, linux-mm, hannes, linux-kernel, peterz

On Sun 05-02-17 19:43:07, Tetsuo Handa wrote:
[...]
> Below one is also a loop. Maybe we can add __GFP_NOMEMALLOC to GFP_NOWAIT ?

No, GFP_NOWAIT is just too generic to use this flag.

> [  257.781715] Out of memory: Kill process 5171 (a.out) score 842 or sacrifice child
> [  257.784726] Killed process 5171 (a.out) total-vm:2177096kB, anon-rss:1476488kB, file-rss:4kB, shmem-rss:0kB
> [  257.787691] a.out(5171): TIF_MEMDIE allocation: order=0 mode=0x1000200(GFP_NOWAIT|__GFP_NOWARN)
> [  257.789789] CPU: 3 PID: 5171 Comm: a.out Not tainted 4.10.0-rc6-next-20170202+ #500
> [  257.791784] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
> [  257.794700] Call Trace:
> [  257.795690]  dump_stack+0x85/0xc9
> [  257.797224]  __alloc_pages_slowpath+0xacb/0xe36
> [  257.798612]  __alloc_pages_nodemask+0x382/0x3d0
> [  257.799942]  alloc_pages_current+0x97/0x1b0
> [  257.801236]  __get_free_pages+0x14/0x50
> [  257.802546]  __tlb_remove_page_size+0x70/0xd0

This is bound to MAX_GATHER_BATCH_COUNT which shouldn't be a lot of
pages (20 or so). We could add __GFP_NOMEMALLOC into tlb_next_batch
but I am not entirely convinced it is really necessary.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
@ 2017-02-06 10:34                                                   ` Michal Hocko
  0 siblings, 0 replies; 110+ messages in thread
From: Michal Hocko @ 2017-02-06 10:34 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: hch, mgorman, viro, linux-mm, hannes, linux-kernel, peterz

On Sun 05-02-17 19:43:07, Tetsuo Handa wrote:
[...]
> Below one is also a loop. Maybe we can add __GFP_NOMEMALLOC to GFP_NOWAIT ?

No, GFP_NOWAIT is just too generic to use this flag.

> [  257.781715] Out of memory: Kill process 5171 (a.out) score 842 or sacrifice child
> [  257.784726] Killed process 5171 (a.out) total-vm:2177096kB, anon-rss:1476488kB, file-rss:4kB, shmem-rss:0kB
> [  257.787691] a.out(5171): TIF_MEMDIE allocation: order=0 mode=0x1000200(GFP_NOWAIT|__GFP_NOWARN)
> [  257.789789] CPU: 3 PID: 5171 Comm: a.out Not tainted 4.10.0-rc6-next-20170202+ #500
> [  257.791784] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
> [  257.794700] Call Trace:
> [  257.795690]  dump_stack+0x85/0xc9
> [  257.797224]  __alloc_pages_slowpath+0xacb/0xe36
> [  257.798612]  __alloc_pages_nodemask+0x382/0x3d0
> [  257.799942]  alloc_pages_current+0x97/0x1b0
> [  257.801236]  __get_free_pages+0x14/0x50
> [  257.802546]  __tlb_remove_page_size+0x70/0xd0

This is bound to MAX_GATHER_BATCH_COUNT which shouldn't be a lot of
pages (20 or so). We could add __GFP_NOMEMALLOC into tlb_next_batch
but I am not entirely convinced it is really necessary.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
  2017-02-05 10:43                                                 ` Tetsuo Handa
@ 2017-02-06 10:39                                                   ` Michal Hocko
  -1 siblings, 0 replies; 110+ messages in thread
From: Michal Hocko @ 2017-02-06 10:39 UTC (permalink / raw)
  To: Tetsuo Handa, peterz; +Cc: hch, mgorman, viro, linux-mm, hannes, linux-kernel

On Sun 05-02-17 19:43:07, Tetsuo Handa wrote:
> Michal Hocko wrote:
> I got same warning with ext4. Maybe we need to check carefully.
> 
> [  511.215743] =====================================================
> [  511.218003] WARNING: RECLAIM_FS-safe -> RECLAIM_FS-unsafe lock order detected
> [  511.220031] 4.10.0-rc6-next-20170202+ #500 Not tainted
> [  511.221689] -----------------------------------------------------
> [  511.223579] a.out/49302 [HC0[0]:SC0[0]:HE1:SE1] is trying to acquire:
> [  511.225533]  (cpu_hotplug.dep_map){++++++}, at: [<ffffffff810a1477>] get_online_cpus+0x37/0x80
> [  511.227795] 
> [  511.227795] and this task is already holding:
> [  511.230082]  (jbd2_handle){++++-.}, at: [<ffffffff813a8be7>] start_this_handle+0x1a7/0x590
> [  511.232592] which would create a new lock dependency:
> [  511.234192]  (jbd2_handle){++++-.} -> (cpu_hotplug.dep_map){++++++}
> [  511.235966] 
> [  511.235966] but this new dependency connects a RECLAIM_FS-irq-safe lock:
> [  511.238563]  (jbd2_handle){++++-.}
> [  511.238564] 
> [  511.238564] ... which became RECLAIM_FS-irq-safe at:
> [  511.242078]   
> [  511.242084] [<ffffffff811089db>] __lock_acquire+0x34b/0x1640
> [  511.244495] [<ffffffff8110a119>] lock_acquire+0xc9/0x250
> [  511.246697] [<ffffffff813b3525>] jbd2_log_wait_commit+0x55/0x1d0
[...]
> [  511.276216] to a RECLAIM_FS-irq-unsafe lock:
> [  511.278128]  (cpu_hotplug.dep_map){++++++}
> [  511.278130] 
> [  511.278130] ... which became RECLAIM_FS-irq-unsafe at:
> [  511.281809] ...
> [  511.281811]   
> [  511.282598] [<ffffffff81108141>] mark_held_locks+0x71/0x90
> [  511.284854] [<ffffffff8110ab6f>] lockdep_trace_alloc+0x6f/0xd0
> [  511.287218] [<ffffffff812744c8>] kmem_cache_alloc_node_trace+0x48/0x3b0
> [  511.289755] [<ffffffff810cfa65>] __smpboot_create_thread.part.2+0x35/0xf0
> [  511.292329] [<ffffffff810d0026>] smpboot_create_threads+0x66/0x90
[...]
> [  511.317867] other info that might help us debug this:
> [  511.317867] 
> [  511.320920]  Possible interrupt unsafe locking scenario:
> [  511.320920] 
> [  511.323218]        CPU0                    CPU1
> [  511.324622]        ----                    ----
> [  511.325973]   lock(cpu_hotplug.dep_map);
> [  511.327246]                                local_irq_disable();
> [  511.328870]                                lock(jbd2_handle);
> [  511.330483]                                lock(cpu_hotplug.dep_map);
> [  511.332259]   <Interrupt>
> [  511.333187]     lock(jbd2_handle);

Peter, is there any way how to tell the lockdep that this is in fact
reclaim safe? The direct reclaim only does the trylock and backs off so
we cannot deadlock here.

Or am I misinterpreting the trace?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
@ 2017-02-06 10:39                                                   ` Michal Hocko
  0 siblings, 0 replies; 110+ messages in thread
From: Michal Hocko @ 2017-02-06 10:39 UTC (permalink / raw)
  To: Tetsuo Handa, peterz; +Cc: hch, mgorman, viro, linux-mm, hannes, linux-kernel

On Sun 05-02-17 19:43:07, Tetsuo Handa wrote:
> Michal Hocko wrote:
> I got same warning with ext4. Maybe we need to check carefully.
> 
> [  511.215743] =====================================================
> [  511.218003] WARNING: RECLAIM_FS-safe -> RECLAIM_FS-unsafe lock order detected
> [  511.220031] 4.10.0-rc6-next-20170202+ #500 Not tainted
> [  511.221689] -----------------------------------------------------
> [  511.223579] a.out/49302 [HC0[0]:SC0[0]:HE1:SE1] is trying to acquire:
> [  511.225533]  (cpu_hotplug.dep_map){++++++}, at: [<ffffffff810a1477>] get_online_cpus+0x37/0x80
> [  511.227795] 
> [  511.227795] and this task is already holding:
> [  511.230082]  (jbd2_handle){++++-.}, at: [<ffffffff813a8be7>] start_this_handle+0x1a7/0x590
> [  511.232592] which would create a new lock dependency:
> [  511.234192]  (jbd2_handle){++++-.} -> (cpu_hotplug.dep_map){++++++}
> [  511.235966] 
> [  511.235966] but this new dependency connects a RECLAIM_FS-irq-safe lock:
> [  511.238563]  (jbd2_handle){++++-.}
> [  511.238564] 
> [  511.238564] ... which became RECLAIM_FS-irq-safe at:
> [  511.242078]   
> [  511.242084] [<ffffffff811089db>] __lock_acquire+0x34b/0x1640
> [  511.244495] [<ffffffff8110a119>] lock_acquire+0xc9/0x250
> [  511.246697] [<ffffffff813b3525>] jbd2_log_wait_commit+0x55/0x1d0
[...]
> [  511.276216] to a RECLAIM_FS-irq-unsafe lock:
> [  511.278128]  (cpu_hotplug.dep_map){++++++}
> [  511.278130] 
> [  511.278130] ... which became RECLAIM_FS-irq-unsafe at:
> [  511.281809] ...
> [  511.281811]   
> [  511.282598] [<ffffffff81108141>] mark_held_locks+0x71/0x90
> [  511.284854] [<ffffffff8110ab6f>] lockdep_trace_alloc+0x6f/0xd0
> [  511.287218] [<ffffffff812744c8>] kmem_cache_alloc_node_trace+0x48/0x3b0
> [  511.289755] [<ffffffff810cfa65>] __smpboot_create_thread.part.2+0x35/0xf0
> [  511.292329] [<ffffffff810d0026>] smpboot_create_threads+0x66/0x90
[...]
> [  511.317867] other info that might help us debug this:
> [  511.317867] 
> [  511.320920]  Possible interrupt unsafe locking scenario:
> [  511.320920] 
> [  511.323218]        CPU0                    CPU1
> [  511.324622]        ----                    ----
> [  511.325973]   lock(cpu_hotplug.dep_map);
> [  511.327246]                                local_irq_disable();
> [  511.328870]                                lock(jbd2_handle);
> [  511.330483]                                lock(cpu_hotplug.dep_map);
> [  511.332259]   <Interrupt>
> [  511.333187]     lock(jbd2_handle);

Peter, is there any way how to tell the lockdep that this is in fact
reclaim safe? The direct reclaim only does the trylock and backs off so
we cannot deadlock here.

Or am I misinterpreting the trace?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
  2017-02-06  6:29                                                   ` Tetsuo Handa
@ 2017-02-06 14:35                                                     ` Brian Foster
  -1 siblings, 0 replies; 110+ messages in thread
From: Brian Foster @ 2017-02-06 14:35 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, david, dchinner, hch, mgorman, viro, linux-mm, hannes,
	linux-kernel, darrick.wong, linux-xfs

On Mon, Feb 06, 2017 at 03:29:24PM +0900, Tetsuo Handa wrote:
> Brian Foster wrote:
> > On Fri, Feb 03, 2017 at 03:50:09PM +0100, Michal Hocko wrote:
> > > [Let's CC more xfs people]
> > > 
> > > On Fri 03-02-17 19:57:39, Tetsuo Handa wrote:
> > > [...]
> > > > (1) I got an assertion failure.
> > > 
> > > I suspect this is a result of
> > > http://lkml.kernel.org/r/20170201092706.9966-2-mhocko@kernel.org
> > > I have no idea what the assert means though.
> > > 
> > > > 
> > > > [  969.626518] Killed process 6262 (oom-write) total-vm:2166856kB, anon-rss:1128732kB, file-rss:4kB, shmem-rss:0kB
> > > > [  969.958307] oom_reaper: reaped process 6262 (oom-write), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> > > > [  972.114644] XFS: Assertion failed: oldlen > newlen, file: fs/xfs/libxfs/xfs_bmap.c, line: 2867
> > 
> > Indirect block reservation underrun on delayed allocation extent merge.
> > These are extra blocks are used for the inode bmap btree when a delalloc
> > extent is converted to physical blocks. We're in a case where we expect
> > to only ever free excess blocks due to a merge of extents with
> > independent reservations, but a situation occurs where we actually need
> > blocks and hence the assert fails. This can occur if an extent is merged
> > with one that has a reservation less than the expected worst case
> > reservation for its size (due to previous extent splits due to hole
> > punches, for example). Therefore, I think the core expectation that
> > xfs_bmap_add_extent_hole_delay() will always have enough blocks
> > pre-reserved is invalid.
> > 
> > Can you describe the workload that reproduces this? FWIW, I think the
> > way xfs_bmap_add_extent_hole_delay() currently works is likely broken
> > and have a couple patches to fix up indlen reservation that I haven't
> > posted yet. The diff that deals with this particular bit is appended.
> > Care to give that a try?
> 
> The workload is to write to a single file on XFS from 10 processes demonstrated at
> http://lkml.kernel.org/r/201512052133.IAE00551.LSOQFtMFFVOHOJ@I-love.SAKURA.ne.jp
> using "while :; do ./oom-write; done" loop on a VM with 4CPUs / 2048MB RAM.
> With this XFS_FILBLKS_MIN() change applied, I no longer hit assertion failures.
> 

Thanks for testing. Well, that's an interesting workload. I couldn't
reproduce on a few quick tries in a similarly configured vm.

Normally I'd expect to see this kind of thing on a hole punching
workload or dealing with large, sparse files that make use of
speculative preallocation (post-eof blocks allocated in anticipation of
file extending writes). I'm wondering if what is happening here is that
the appending writes and file closes due to oom kills are generating
speculative preallocs and prealloc truncates, respectively, and that
causes prealloc extents at the eof boundary to be split up and then
re-merged by surviving appending writers.

/tmp/file _is_ on an XFS filesystem in your test, correct? If so and if
you still have the output file from a test that reproduced, could you
get the 'xfs_io -c "fiemap -v" <file>' output?

I suppose another possibility is that prealloc occurs, write failure(s)
leads to extent splits via unmapping the target range of the write, and
then surviving writers generate the warning on a delalloc extent merge..

Brian

> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
@ 2017-02-06 14:35                                                     ` Brian Foster
  0 siblings, 0 replies; 110+ messages in thread
From: Brian Foster @ 2017-02-06 14:35 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, david, dchinner, hch, mgorman, viro, linux-mm, hannes,
	linux-kernel, darrick.wong, linux-xfs

On Mon, Feb 06, 2017 at 03:29:24PM +0900, Tetsuo Handa wrote:
> Brian Foster wrote:
> > On Fri, Feb 03, 2017 at 03:50:09PM +0100, Michal Hocko wrote:
> > > [Let's CC more xfs people]
> > > 
> > > On Fri 03-02-17 19:57:39, Tetsuo Handa wrote:
> > > [...]
> > > > (1) I got an assertion failure.
> > > 
> > > I suspect this is a result of
> > > http://lkml.kernel.org/r/20170201092706.9966-2-mhocko@kernel.org
> > > I have no idea what the assert means though.
> > > 
> > > > 
> > > > [  969.626518] Killed process 6262 (oom-write) total-vm:2166856kB, anon-rss:1128732kB, file-rss:4kB, shmem-rss:0kB
> > > > [  969.958307] oom_reaper: reaped process 6262 (oom-write), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> > > > [  972.114644] XFS: Assertion failed: oldlen > newlen, file: fs/xfs/libxfs/xfs_bmap.c, line: 2867
> > 
> > Indirect block reservation underrun on delayed allocation extent merge.
> > These are extra blocks are used for the inode bmap btree when a delalloc
> > extent is converted to physical blocks. We're in a case where we expect
> > to only ever free excess blocks due to a merge of extents with
> > independent reservations, but a situation occurs where we actually need
> > blocks and hence the assert fails. This can occur if an extent is merged
> > with one that has a reservation less than the expected worst case
> > reservation for its size (due to previous extent splits due to hole
> > punches, for example). Therefore, I think the core expectation that
> > xfs_bmap_add_extent_hole_delay() will always have enough blocks
> > pre-reserved is invalid.
> > 
> > Can you describe the workload that reproduces this? FWIW, I think the
> > way xfs_bmap_add_extent_hole_delay() currently works is likely broken
> > and have a couple patches to fix up indlen reservation that I haven't
> > posted yet. The diff that deals with this particular bit is appended.
> > Care to give that a try?
> 
> The workload is to write to a single file on XFS from 10 processes demonstrated at
> http://lkml.kernel.org/r/201512052133.IAE00551.LSOQFtMFFVOHOJ@I-love.SAKURA.ne.jp
> using "while :; do ./oom-write; done" loop on a VM with 4CPUs / 2048MB RAM.
> With this XFS_FILBLKS_MIN() change applied, I no longer hit assertion failures.
> 

Thanks for testing. Well, that's an interesting workload. I couldn't
reproduce on a few quick tries in a similarly configured vm.

Normally I'd expect to see this kind of thing on a hole punching
workload or dealing with large, sparse files that make use of
speculative preallocation (post-eof blocks allocated in anticipation of
file extending writes). I'm wondering if what is happening here is that
the appending writes and file closes due to oom kills are generating
speculative preallocs and prealloc truncates, respectively, and that
causes prealloc extents at the eof boundary to be split up and then
re-merged by surviving appending writers.

/tmp/file _is_ on an XFS filesystem in your test, correct? If so and if
you still have the output file from a test that reproduced, could you
get the 'xfs_io -c "fiemap -v" <file>' output?

I suppose another possibility is that prealloc occurs, write failure(s)
leads to extent splits via unmapping the target range of the write, and
then surviving writers generate the warning on a delalloc extent merge..

Brian

> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
  2017-02-06 14:35                                                     ` Brian Foster
@ 2017-02-06 14:42                                                       ` Michal Hocko
  -1 siblings, 0 replies; 110+ messages in thread
From: Michal Hocko @ 2017-02-06 14:42 UTC (permalink / raw)
  To: Brian Foster
  Cc: Tetsuo Handa, david, dchinner, hch, mgorman, viro, linux-mm,
	hannes, linux-kernel, darrick.wong, linux-xfs

On Mon 06-02-17 09:35:33, Brian Foster wrote:
> On Mon, Feb 06, 2017 at 03:29:24PM +0900, Tetsuo Handa wrote:
> > Brian Foster wrote:
> > > On Fri, Feb 03, 2017 at 03:50:09PM +0100, Michal Hocko wrote:
> > > > [Let's CC more xfs people]
> > > > 
> > > > On Fri 03-02-17 19:57:39, Tetsuo Handa wrote:
> > > > [...]
> > > > > (1) I got an assertion failure.
> > > > 
> > > > I suspect this is a result of
> > > > http://lkml.kernel.org/r/20170201092706.9966-2-mhocko@kernel.org
> > > > I have no idea what the assert means though.
> > > > 
> > > > > 
> > > > > [  969.626518] Killed process 6262 (oom-write) total-vm:2166856kB, anon-rss:1128732kB, file-rss:4kB, shmem-rss:0kB
> > > > > [  969.958307] oom_reaper: reaped process 6262 (oom-write), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> > > > > [  972.114644] XFS: Assertion failed: oldlen > newlen, file: fs/xfs/libxfs/xfs_bmap.c, line: 2867
> > > 
> > > Indirect block reservation underrun on delayed allocation extent merge.
> > > These are extra blocks are used for the inode bmap btree when a delalloc
> > > extent is converted to physical blocks. We're in a case where we expect
> > > to only ever free excess blocks due to a merge of extents with
> > > independent reservations, but a situation occurs where we actually need
> > > blocks and hence the assert fails. This can occur if an extent is merged
> > > with one that has a reservation less than the expected worst case
> > > reservation for its size (due to previous extent splits due to hole
> > > punches, for example). Therefore, I think the core expectation that
> > > xfs_bmap_add_extent_hole_delay() will always have enough blocks
> > > pre-reserved is invalid.
> > > 
> > > Can you describe the workload that reproduces this? FWIW, I think the
> > > way xfs_bmap_add_extent_hole_delay() currently works is likely broken
> > > and have a couple patches to fix up indlen reservation that I haven't
> > > posted yet. The diff that deals with this particular bit is appended.
> > > Care to give that a try?
> > 
> > The workload is to write to a single file on XFS from 10 processes demonstrated at
> > http://lkml.kernel.org/r/201512052133.IAE00551.LSOQFtMFFVOHOJ@I-love.SAKURA.ne.jp
> > using "while :; do ./oom-write; done" loop on a VM with 4CPUs / 2048MB RAM.
> > With this XFS_FILBLKS_MIN() change applied, I no longer hit assertion failures.
> > 
> 
> Thanks for testing. Well, that's an interesting workload. I couldn't
> reproduce on a few quick tries in a similarly configured vm.
> 
> Normally I'd expect to see this kind of thing on a hole punching
> workload or dealing with large, sparse files that make use of
> speculative preallocation (post-eof blocks allocated in anticipation of
> file extending writes). I'm wondering if what is happening here is that
> the appending writes and file closes due to oom kills are generating
> speculative preallocs and prealloc truncates, respectively, and that
> causes prealloc extents at the eof boundary to be split up and then
> re-merged by surviving appending writers.

Can those preallocs be affected by
http://lkml.kernel.org/r/20170201092706.9966-2-mhocko@kernel.org ?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
@ 2017-02-06 14:42                                                       ` Michal Hocko
  0 siblings, 0 replies; 110+ messages in thread
From: Michal Hocko @ 2017-02-06 14:42 UTC (permalink / raw)
  To: Brian Foster
  Cc: Tetsuo Handa, david, dchinner, hch, mgorman, viro, linux-mm,
	hannes, linux-kernel, darrick.wong, linux-xfs

On Mon 06-02-17 09:35:33, Brian Foster wrote:
> On Mon, Feb 06, 2017 at 03:29:24PM +0900, Tetsuo Handa wrote:
> > Brian Foster wrote:
> > > On Fri, Feb 03, 2017 at 03:50:09PM +0100, Michal Hocko wrote:
> > > > [Let's CC more xfs people]
> > > > 
> > > > On Fri 03-02-17 19:57:39, Tetsuo Handa wrote:
> > > > [...]
> > > > > (1) I got an assertion failure.
> > > > 
> > > > I suspect this is a result of
> > > > http://lkml.kernel.org/r/20170201092706.9966-2-mhocko@kernel.org
> > > > I have no idea what the assert means though.
> > > > 
> > > > > 
> > > > > [  969.626518] Killed process 6262 (oom-write) total-vm:2166856kB, anon-rss:1128732kB, file-rss:4kB, shmem-rss:0kB
> > > > > [  969.958307] oom_reaper: reaped process 6262 (oom-write), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> > > > > [  972.114644] XFS: Assertion failed: oldlen > newlen, file: fs/xfs/libxfs/xfs_bmap.c, line: 2867
> > > 
> > > Indirect block reservation underrun on delayed allocation extent merge.
> > > These are extra blocks are used for the inode bmap btree when a delalloc
> > > extent is converted to physical blocks. We're in a case where we expect
> > > to only ever free excess blocks due to a merge of extents with
> > > independent reservations, but a situation occurs where we actually need
> > > blocks and hence the assert fails. This can occur if an extent is merged
> > > with one that has a reservation less than the expected worst case
> > > reservation for its size (due to previous extent splits due to hole
> > > punches, for example). Therefore, I think the core expectation that
> > > xfs_bmap_add_extent_hole_delay() will always have enough blocks
> > > pre-reserved is invalid.
> > > 
> > > Can you describe the workload that reproduces this? FWIW, I think the
> > > way xfs_bmap_add_extent_hole_delay() currently works is likely broken
> > > and have a couple patches to fix up indlen reservation that I haven't
> > > posted yet. The diff that deals with this particular bit is appended.
> > > Care to give that a try?
> > 
> > The workload is to write to a single file on XFS from 10 processes demonstrated at
> > http://lkml.kernel.org/r/201512052133.IAE00551.LSOQFtMFFVOHOJ@I-love.SAKURA.ne.jp
> > using "while :; do ./oom-write; done" loop on a VM with 4CPUs / 2048MB RAM.
> > With this XFS_FILBLKS_MIN() change applied, I no longer hit assertion failures.
> > 
> 
> Thanks for testing. Well, that's an interesting workload. I couldn't
> reproduce on a few quick tries in a similarly configured vm.
> 
> Normally I'd expect to see this kind of thing on a hole punching
> workload or dealing with large, sparse files that make use of
> speculative preallocation (post-eof blocks allocated in anticipation of
> file extending writes). I'm wondering if what is happening here is that
> the appending writes and file closes due to oom kills are generating
> speculative preallocs and prealloc truncates, respectively, and that
> causes prealloc extents at the eof boundary to be split up and then
> re-merged by surviving appending writers.

Can those preallocs be affected by
http://lkml.kernel.org/r/20170201092706.9966-2-mhocko@kernel.org ?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
  2017-02-06 14:42                                                       ` Michal Hocko
@ 2017-02-06 15:47                                                         ` Brian Foster
  -1 siblings, 0 replies; 110+ messages in thread
From: Brian Foster @ 2017-02-06 15:47 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, david, dchinner, hch, mgorman, viro, linux-mm,
	hannes, linux-kernel, darrick.wong, linux-xfs

On Mon, Feb 06, 2017 at 03:42:22PM +0100, Michal Hocko wrote:
> On Mon 06-02-17 09:35:33, Brian Foster wrote:
> > On Mon, Feb 06, 2017 at 03:29:24PM +0900, Tetsuo Handa wrote:
> > > Brian Foster wrote:
> > > > On Fri, Feb 03, 2017 at 03:50:09PM +0100, Michal Hocko wrote:
> > > > > [Let's CC more xfs people]
> > > > > 
> > > > > On Fri 03-02-17 19:57:39, Tetsuo Handa wrote:
> > > > > [...]
> > > > > > (1) I got an assertion failure.
> > > > > 
> > > > > I suspect this is a result of
> > > > > http://lkml.kernel.org/r/20170201092706.9966-2-mhocko@kernel.org
> > > > > I have no idea what the assert means though.
> > > > > 
> > > > > > 
> > > > > > [  969.626518] Killed process 6262 (oom-write) total-vm:2166856kB, anon-rss:1128732kB, file-rss:4kB, shmem-rss:0kB
> > > > > > [  969.958307] oom_reaper: reaped process 6262 (oom-write), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> > > > > > [  972.114644] XFS: Assertion failed: oldlen > newlen, file: fs/xfs/libxfs/xfs_bmap.c, line: 2867
> > > > 
> > > > Indirect block reservation underrun on delayed allocation extent merge.
> > > > These are extra blocks are used for the inode bmap btree when a delalloc
> > > > extent is converted to physical blocks. We're in a case where we expect
> > > > to only ever free excess blocks due to a merge of extents with
> > > > independent reservations, but a situation occurs where we actually need
> > > > blocks and hence the assert fails. This can occur if an extent is merged
> > > > with one that has a reservation less than the expected worst case
> > > > reservation for its size (due to previous extent splits due to hole
> > > > punches, for example). Therefore, I think the core expectation that
> > > > xfs_bmap_add_extent_hole_delay() will always have enough blocks
> > > > pre-reserved is invalid.
> > > > 
> > > > Can you describe the workload that reproduces this? FWIW, I think the
> > > > way xfs_bmap_add_extent_hole_delay() currently works is likely broken
> > > > and have a couple patches to fix up indlen reservation that I haven't
> > > > posted yet. The diff that deals with this particular bit is appended.
> > > > Care to give that a try?
> > > 
> > > The workload is to write to a single file on XFS from 10 processes demonstrated at
> > > http://lkml.kernel.org/r/201512052133.IAE00551.LSOQFtMFFVOHOJ@I-love.SAKURA.ne.jp
> > > using "while :; do ./oom-write; done" loop on a VM with 4CPUs / 2048MB RAM.
> > > With this XFS_FILBLKS_MIN() change applied, I no longer hit assertion failures.
> > > 
> > 
> > Thanks for testing. Well, that's an interesting workload. I couldn't
> > reproduce on a few quick tries in a similarly configured vm.
> > 
> > Normally I'd expect to see this kind of thing on a hole punching
> > workload or dealing with large, sparse files that make use of
> > speculative preallocation (post-eof blocks allocated in anticipation of
> > file extending writes). I'm wondering if what is happening here is that
> > the appending writes and file closes due to oom kills are generating
> > speculative preallocs and prealloc truncates, respectively, and that
> > causes prealloc extents at the eof boundary to be split up and then
> > re-merged by surviving appending writers.
> 
> Can those preallocs be affected by
> http://lkml.kernel.org/r/20170201092706.9966-2-mhocko@kernel.org ?
> 

Hmm, I wouldn't expect that to make much of a difference wrt to the core
problem. The prealloc is created on a file extending write that requires
block allocation (we basically just tack on extra blocks to an extending
alloc based on some heuristics like the size of the file and the
previous extent). Whether that allocation occurs on one iomap iteration
or another due to a short write and retry, I wouldn't expect to matter
that much.

I suppose it could change the behavior of specialized workload though.
E.g., if it caused a write() call to return quicker and thus lead to a
file close(). We do use file release as an indication that prealloc will
not be used and can reclaim it at that point (presumably causing an
extent split with pre-eof blocks).

Brian

> -- 
> Michal Hocko
> SUSE Labs
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
@ 2017-02-06 15:47                                                         ` Brian Foster
  0 siblings, 0 replies; 110+ messages in thread
From: Brian Foster @ 2017-02-06 15:47 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, david, dchinner, hch, mgorman, viro, linux-mm,
	hannes, linux-kernel, darrick.wong, linux-xfs

On Mon, Feb 06, 2017 at 03:42:22PM +0100, Michal Hocko wrote:
> On Mon 06-02-17 09:35:33, Brian Foster wrote:
> > On Mon, Feb 06, 2017 at 03:29:24PM +0900, Tetsuo Handa wrote:
> > > Brian Foster wrote:
> > > > On Fri, Feb 03, 2017 at 03:50:09PM +0100, Michal Hocko wrote:
> > > > > [Let's CC more xfs people]
> > > > > 
> > > > > On Fri 03-02-17 19:57:39, Tetsuo Handa wrote:
> > > > > [...]
> > > > > > (1) I got an assertion failure.
> > > > > 
> > > > > I suspect this is a result of
> > > > > http://lkml.kernel.org/r/20170201092706.9966-2-mhocko@kernel.org
> > > > > I have no idea what the assert means though.
> > > > > 
> > > > > > 
> > > > > > [  969.626518] Killed process 6262 (oom-write) total-vm:2166856kB, anon-rss:1128732kB, file-rss:4kB, shmem-rss:0kB
> > > > > > [  969.958307] oom_reaper: reaped process 6262 (oom-write), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> > > > > > [  972.114644] XFS: Assertion failed: oldlen > newlen, file: fs/xfs/libxfs/xfs_bmap.c, line: 2867
> > > > 
> > > > Indirect block reservation underrun on delayed allocation extent merge.
> > > > These are extra blocks are used for the inode bmap btree when a delalloc
> > > > extent is converted to physical blocks. We're in a case where we expect
> > > > to only ever free excess blocks due to a merge of extents with
> > > > independent reservations, but a situation occurs where we actually need
> > > > blocks and hence the assert fails. This can occur if an extent is merged
> > > > with one that has a reservation less than the expected worst case
> > > > reservation for its size (due to previous extent splits due to hole
> > > > punches, for example). Therefore, I think the core expectation that
> > > > xfs_bmap_add_extent_hole_delay() will always have enough blocks
> > > > pre-reserved is invalid.
> > > > 
> > > > Can you describe the workload that reproduces this? FWIW, I think the
> > > > way xfs_bmap_add_extent_hole_delay() currently works is likely broken
> > > > and have a couple patches to fix up indlen reservation that I haven't
> > > > posted yet. The diff that deals with this particular bit is appended.
> > > > Care to give that a try?
> > > 
> > > The workload is to write to a single file on XFS from 10 processes demonstrated at
> > > http://lkml.kernel.org/r/201512052133.IAE00551.LSOQFtMFFVOHOJ@I-love.SAKURA.ne.jp
> > > using "while :; do ./oom-write; done" loop on a VM with 4CPUs / 2048MB RAM.
> > > With this XFS_FILBLKS_MIN() change applied, I no longer hit assertion failures.
> > > 
> > 
> > Thanks for testing. Well, that's an interesting workload. I couldn't
> > reproduce on a few quick tries in a similarly configured vm.
> > 
> > Normally I'd expect to see this kind of thing on a hole punching
> > workload or dealing with large, sparse files that make use of
> > speculative preallocation (post-eof blocks allocated in anticipation of
> > file extending writes). I'm wondering if what is happening here is that
> > the appending writes and file closes due to oom kills are generating
> > speculative preallocs and prealloc truncates, respectively, and that
> > causes prealloc extents at the eof boundary to be split up and then
> > re-merged by surviving appending writers.
> 
> Can those preallocs be affected by
> http://lkml.kernel.org/r/20170201092706.9966-2-mhocko@kernel.org ?
> 

Hmm, I wouldn't expect that to make much of a difference wrt to the core
problem. The prealloc is created on a file extending write that requires
block allocation (we basically just tack on extra blocks to an extending
alloc based on some heuristics like the size of the file and the
previous extent). Whether that allocation occurs on one iomap iteration
or another due to a short write and retry, I wouldn't expect to matter
that much.

I suppose it could change the behavior of specialized workload though.
E.g., if it caused a write() call to return quicker and thus lead to a
file close(). We do use file release as an indication that prealloc will
not be used and can reclaim it at that point (presumably causing an
extent split with pre-eof blocks).

Brian

> -- 
> Michal Hocko
> SUSE Labs
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
  2017-02-06 14:35                                                     ` Brian Foster
@ 2017-02-07 10:30                                                       ` Tetsuo Handa
  -1 siblings, 0 replies; 110+ messages in thread
From: Tetsuo Handa @ 2017-02-07 10:30 UTC (permalink / raw)
  To: bfoster
  Cc: mhocko, david, dchinner, hch, mgorman, viro, linux-mm, hannes,
	linux-kernel, darrick.wong, linux-xfs

Brian Foster wrote:
> > The workload is to write to a single file on XFS from 10 processes demonstrated at
> > http://lkml.kernel.org/r/201512052133.IAE00551.LSOQFtMFFVOHOJ@I-love.SAKURA.ne.jp
> > using "while :; do ./oom-write; done" loop on a VM with 4CPUs / 2048MB RAM.
> > With this XFS_FILBLKS_MIN() change applied, I no longer hit assertion failures.
> > 
> 
> Thanks for testing. Well, that's an interesting workload. I couldn't
> reproduce on a few quick tries in a similarly configured vm.

It takes 10 to 15 minutes. Maybe some size threshold involved?

> /tmp/file _is_ on an XFS filesystem in your test, correct? If so and if
> you still have the output file from a test that reproduced, could you
> get the 'xfs_io -c "fiemap -v" <file>' output?

Here it is.

[  720.199748] 0 pages HighMem/MovableOnly
[  720.199749] 150524 pages reserved
[  720.199749] 0 pages cma reserved
[  720.199750] 0 pages hwpoisoned
[  722.187335] XFS: Assertion failed: oldlen > newlen, file: fs/xfs/libxfs/xfs_bmap.c, line: 2867
[  722.201784] ------------[ cut here ]------------
[  722.205940] WARNING: CPU: 0 PID: 4877 at fs/xfs/xfs_message.c:105 asswarn+0x33/0x40 [xfs]
[  722.212333] Modules linked in: nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter coretemp crct10dif_pclmul vmw_vsock_vmci_transport crc32_pclmul ghash_clmulni_intel vsock aesni_intel crypto_simd cryptd glue_helper ppdev vmw_balloon pcspkr sg parport_pc i2c_piix4 shpchp vmw_vmci parport ip_tables xfs libcrc32c sd_mod sr_mod cdrom ata_generic pata_acpi crc32c_intel serio_raw vmwgfx drm_kms_helper syscopyarea sysfillrect
[  722.243207]  sysimgblt fb_sys_fops mptspi scsi_transport_spi ata_piix ahci ttm mptscsih libahci drm libata mptbase e1000 i2c_core
[  722.247704] CPU: 0 PID: 4877 Comm: write Not tainted 4.10.0-rc6-next-20170202 #498
[  722.250612] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[  722.254089] Call Trace:
[  722.255751]  dump_stack+0x85/0xc9
[  722.257650]  __warn+0xd1/0xf0
[  722.259420]  warn_slowpath_null+0x1d/0x20
[  722.261434]  asswarn+0x33/0x40 [xfs]
[  722.263356]  xfs_bmap_add_extent_hole_delay+0xb7f/0xdf0 [xfs]
[  722.265695]  xfs_bmapi_reserve_delalloc+0x297/0x440 [xfs]
[  722.267792]  ? xfs_ilock+0x1c9/0x360 [xfs]
[  722.269559]  xfs_file_iomap_begin+0x880/0x1140 [xfs]
[  722.271606]  ? iomap_write_end+0x80/0x80
[  722.273377]  iomap_apply+0x6c/0x130
[  722.274969]  iomap_file_buffered_write+0x68/0xa0
[  722.276702]  ? iomap_write_end+0x80/0x80
[  722.278311]  xfs_file_buffered_aio_write+0x132/0x390 [xfs]
[  722.280394]  ? _raw_spin_unlock+0x27/0x40
[  722.282247]  xfs_file_write_iter+0x90/0x130 [xfs]
[  722.284257]  __vfs_write+0xe5/0x140
[  722.285924]  vfs_write+0xc7/0x1f0
[  722.287536]  ? syscall_trace_enter+0x1d0/0x380
[  722.289490]  SyS_write+0x58/0xc0
[  722.291025]  do_int80_syscall_32+0x6c/0x1f0
[  722.292671]  entry_INT80_compat+0x38/0x50
[  722.294298] RIP: 0023:0x8048076
[  722.295684] RSP: 002b:00000000ffedf840 EFLAGS: 00000202 ORIG_RAX: 0000000000000004
[  722.298075] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 0000000008048000
[  722.300516] RDX: 0000000000001000 RSI: 0000000000000000 RDI: 0000000000000000
[  722.302902] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[  722.305278] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  722.307567] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[  722.309792] ---[ end trace 5b7012eeb84093b7 ]---
[  732.650867] oom_reaper: reaped process 4876 (oom-write), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

# ls -l /tmp/file
-rw------- 1 kumaneko kumaneko 43426648064 Feb  7 19:25 /tmp/file
# xfs_io -c "fiemap -v" /tmp/file
/tmp/file:
 EXT: FILE-OFFSET          BLOCK-RANGE            TOTAL FLAGS
   0: [0..262015]:         358739712..359001727  262016   0x0
   1: [262016..524159]:    367651920..367914063  262144   0x0
   2: [524160..1048447]:   385063864..385588151  524288   0x0
   3: [1048448..1238031]:  463702512..463892095  189584   0x0
   4: [1238032..3335167]:  448234520..450331655 2097136   0x0
   5: [3335168..4769775]:  36165320..37599927   1434608   0x0
   6: [4769776..6897175]:  31677984..33805383   2127400   0x0
   7: [6897176..15285759]: 450331656..458720239 8388584   0x0
   8: [15285760..18520255]: 237497528..240732023 3234496   0x0
   9: [18520256..21063607]: 229750248..232293599 2543352   0x0
  10: [21063608..25257855]: 240732024..244926271 4194248   0x0
  11: [25257856..29452159]: 179523440..183717743 4194304   0x0
  12: [29452160..30380031]: 171930952..172858823  927872   0x0
  13: [30380032..31428607]: 185220160..186268735 1048576   0x0
  14: [31428608..32667751]: 232293600..233532743 1239144   0x0
  15: [32667752..38474351]: 172858824..178665423 5806600   0x0
  16: [38474352..39157119]: 188137184..188819951  682768   0x0
  17: [39157120..40205695]: 234837584..235886159 1048576   0x0
  18: [40205696..42302847]: 33805384..35902535   2097152   0x0
  19: [42302848..44188591]: 37599928..39485671   1885744   0x0
  20: [44188592..45112703]: 446735416..447659527  924112   0x0
  21: [45112704..45436343]: 445337184..445660823  323640   0x0
  22: [45436344..45960575]: 447659528..448183759  524232   0x0
  23: [45960576..46484863]: 463892096..464416383  524288   0x0
  24: [46484864..47533439]: 445660824..446709399 1048576   0x0
  25: [47533440..48541959]: 233532744..234541263 1008520   0x0
  26: [48541960..49294175]: 523533736..524285951  752216   0x0
  27: [49294176..49630591]: 376080552..376416967  336416   0x0
  28: [49630592..50154879]: 129846752..130371039  524288   0x0
  29: [50154880..50844383]: 244926272..245615775  689504   0x0
  30: [50844384..51203455]: 250812112..251171183  359072   0x0
  31: [51203456..51727743]: 259555424..260079711  524288   0x0
  32: [51727744..52295239]: 187350752..187918247  567496   0x0
  33: [52295240..52776319]: 188819952..189301031  481080   0x0
  34: [52776320..53300607]: 206841040..207365327  524288   0x0
  35: [53300608..53824775]: 386221504..386745671  524168   0x0
  36: [53824776..54348935]: 113736928..114261087  524160   0x0
  37: [54348936..54854007]: 228911704..229416775  505072   0x0
  38: [54854008..54905983]: 228760200..228812175   51976   0x0
  39: [54905984..54971519]: 228597920..228663455   65536   0x0
  40: [54971520..55364735]: 178998696..179391911  393216   0x0
  41: [55364736..55868119]: 392669176..393172559  503384   0x0
  42: [55868120..56370663]: 382896800..383399343  502544   0x0
  43: [56370664..56836311]: 464416384..464882031  465648   0x0
  44: [56836312..57085055]: 458720240..458968983  248744   0x0
  45: [57085056..57548743]: 92768112..93231799    463688   0x0
  46: [57548744..57871487]: 102724384..103047127  322744   0x0
  47: [57871488..58304623]: 124278664..124711799  433136   0x0
  48: [58304624..58526847]: 124712024..124934247  222224   0x0
  49: [58526848..58788991]: 125635832..125897975  262144   0x0
  50: [58788992..59203767]: 508031384..508446159  414776   0x0
  51: [59203768..59602871]: 109812624..110211727  399104   0x0
  52: [59602872..59992183]: 385736856..386126167  389312   0x0
  53: [59992184..60381311]: 237108384..237497511  389128   0x0
  54: [60381312..60756863]: 506355968..506731519  375552   0x0
  55: [60756864..61127487]: 186268736..186639359  370624   0x0
  56: [61127488..61490767]: 112848304..113211583  363280   0x0
  57: [61490768..61541503]: 113214200..113264935   50736   0x0
  58: [61541504..61904775]: 112246776..112610047  363272   0x0
  59: [61904776..62246247]: 106328512..106669983  341472   0x0
  60: [62246248..62571991]: 126075640..126401383  325744   0x0
  61: [62571992..62895759]: 108921744..109245511  323768   0x0
  62: [62895760..63219159]: 380153016..380476415  323400   0x0
  63: [63219160..63442047]: 381056248..381279135  222888   0x0
  64: [63442048..63704191]: 379768072..380030215  262144   0x0
  65: [63704192..64026847]: 108328888..108651543  322656   0x0
  66: [64026848..64342415]: 251387232..251702799  315568   0x0
  67: [64342416..64651407]: 183717744..184026735  308992   0x0
  68: [64651408..64947983]: 384092440..384389015  296576   0x0
  69: [64947984..65145983]: 381775560..381973559  198000   0x0
  70: [65145984..65408127]: 186914504..187176647  262144   0x0
  71: [65408128..65447943]: 125328232..125368047   39816   0x0
  72: [65447944..65690599]: 372579112..372821767  242656   0x0
  73: [65690600..65929863]: 130429664..130668927  239264   0x0
  74: [65929864..66168935]: 120951784..121190855  239072   0x0
  75: [66168936..66402279]: 372845976..373079319  233344   0x0
  76: [66402280..66633199]: 113372616..113603535  230920   0x0
  77: [66633200..66859943]: 115982256..116208999  226744   0x0
  78: [66859944..67082127]: 127187600..127409783  222184   0x0
  79: [67082128..67217407]: 127636680..127771959  135280   0x0
  80: [67217408..67280095]: 129510736..129573423   62688   0x0
  81: [67280096..67499063]: 119220288..119439255  218968   0x0
  82: [67499064..67717935]: 507320248..507539119  218872   0x0
  83: [67717936..67936119]: 129292544..129510727  218184   0x0
  84: [67936120..68153903]: 125368048..125585831  217784   0x0
  85: [68153904..68370703]: 117784232..118001031  216800   0x0
  86: [68370704..68586039]: 121997008..122212343  215336   0x0
  87: [68586040..68798855]: 379191840..379404655  212816   0x0
  88: [68798856..68983935]: 378690808..378875887  185080   0x0
  89: [68983936..69196727]: 90790848..91003639    212792   0x0
  90: [69196728..69409287]: 123091672..123304231  212560   0x0
  91: [69409288..69621503]: 377436856..377649071  212216   0x0
  92: [69621504..69828847]: 128990088..129197431  207344   0x0
  93: [69828848..70035391]: 497270968..497477511  206544   0x0
  94: [70035392..70241111]: 391898048..392103767  205720   0x0
  95: [70241112..70446207]: 260716672..260921767  205096   0x0
  96: [70446208..70507647]: 260079712..260141151   61440   0x0
  97: [70507648..70704255]: 245836040..246032647  196608   0x0
  98: [70704256..70906591]: 107009096..107211431  202336   0x0
  99: [70906592..71108807]: 389471224..389673439  202216   0x0
 100: [71108808..71309703]: 224305904..224506799  200896   0x0
 101: [71309704..71509487]: 388524632..388724415  199784   0x0
 102: [71509488..71707119]: 87983688..88181319    197632   0x0
 103: [71707120..71903015]: 236195680..236391575  195896   0x0
 104: [71903016..72098791]: 389000248..389196023  195776   0x0
 105: [72098792..72294471]: 386931872..387127551  195680   0x0
 106: [72294472..72342655]: 387127560..387175743   48184   0x0
 107: [72342656..72408191]: 388031464..388096999   65536   0x0
 108: [72408192..72539263]: 388194472..388325543  131072   0x0
 109: [72539264..72562039]: 369903992..369926767   22776   0x0
 110: [72562040..72753639]: 506916880..507108479  191600   0x0
 111: [72753640..72945143]: 360577376..360768879  191504   0x0
 112: [72945144..73136575]: 246426760..246618191  191432   0x0
 113: [73136576..73326047]: 116629288..116818759  189472   0x0
 114: [73326048..73515047]: 392203096..392392095  189000   0x0
 115: [73515048..73699967]: 223549160..223734079  184920   0x0
 116: [73699968..73883879]: 118860856..119044767  183912   0x0
 117: [73883880..74067175]: 506143208..506326503  183296   0x0
 118: [74067176..74249703]: 507108800..507291327  182528   0x0
 119: [74249704..74401335]: 258917640..259069271  151632   0x0
 120: [74401336..74583135]: 122742560..122924359  181800   0x0
 121: [74583136..74764223]: 374250096..374431183  181088   0x0
 122: [74764224..74945271]: 91175800..91356847    181048   0x0
 123: [74945272..75124183]: 362484776..362663687  178912   0x0
 124: [75124184..75302615]: 223086192..223264623  178432   0x0
 125: [75302616..75479279]: 359280032..359456695  176664   0x0
 126: [75479280..75655559]: 63083912..63260191    176280   0x0
 127: [75655560..75831487]: 384469152..384645079  175928   0x0
 128: [75831488..76006815]: 381459584..381634911  175328   0x0
 129: [76006816..76181255]: 110626376..110800815  174440   0x0
 130: [76181256..76355399]: 380785616..380959759  174144   0x0
 131: [76355400..76527527]: 362768136..362940263  172128   0x0
 132: [76527528..76698695]: 122571384..122742551  171168   0x0
 133: [76698696..76868951]: 382399576..382569831  170256   0x0
 134: [76868952..77039095]: 388353776..388523919  170144   0x0
 135: [77039096..77209183]: 120236192..120406279  170088   0x0
 136: [77209184..77379183]: 383464120..383634119  170000   0x0
 137: [77379184..77548655]: 369926768..370096239  169472   0x0
 138: [77548656..77717663]: 88823232..88992239    169008   0x0
 139: [77717664..77884951]: 365878672..366045959  167288   0x0
 140: [77884952..77897079]: 366445360..366457487   12128   0x0
 141: [77897080..78063423]: 391500528..391666871  166344   0x0
 142: [78063424..78229407]: 107876400..108042383  165984   0x0
 143: [78229408..78395135]: 358573976..358739703  165728   0x0
 144: [78395136..78560703]: 117078480..117244047  165568   0x0
 145: [78560704..78726063]: 257377088..257542447  165360   0x0
 146: [78726064..78889519]: 389678704..389842159  163456   0x0
 147: [78889520..79052607]: 225850112..226013199  163088   0x0
 148: [79052608..79215111]: 359822880..359985383  162504   0x0
 149: [79215112..79376559]: 357914720..358076167  161448   0x0
 150: [79376560..79538007]: 115473264..115634711  161448   0x0
 151: [79538008..79698815]: 112610056..112770863  160808   0x0
 152: [79698816..79857631]: 258732456..258891271  158816   0x0
 153: [79857632..80015807]: 388725328..388883503  158176   0x0
 154: [80015808..80173583]: 93847144..94004919    157776   0x0
 155: [80173584..80331295]: 362940272..363097983  157712   0x0
 156: [80331296..80488727]: 252008432..252165863  157432   0x0
 157: [80488728..80646055]: 118387696..118545023  157328   0x0
 158: [80646056..80803239]: 111368744..111525927  157184   0x0
 159: [80803240..80960055]: 129573424..129730239  156816   0x0
 160: [80960056..81116863]: 497936416..498093223  156808   0x0
 161: [81116864..81272623]: 492109560..492265319  155760   0x0
 162: [81272624..81427695]: 114554072..114709143  155072   0x0
 163: [81427696..81582519]: 106854264..107009087  154824   0x0
 164: [81582520..81735503]: 220700824..220853807  152984   0x0
 165: [81735504..81887807]: 490724024..490876327  152304   0x0
 166: [81887808..82038863]: 122393688..122544743  151056   0x0
 167: [82038864..82189151]: 91659448..91809735    150288   0x0
 168: [82189152..82337287]: 85811104..85959239    148136   0x0
 169: [82337288..82484743]: 235886160..236033615  147456   0x0
 170: [82484744..82631943]: 117486472..117633671  147200   0x0
 171: [82631944..82777887]: 491753616..491899559  145944   0x0
 172: [82777888..82923799]: 94927544..95073455    145912   0x0
 173: [82923800..83068527]: 373754864..373899591  144728   0x0
 174: [83068528..83116375]: 373980848..374028695   47848   0x0
 175: [83116376..83261039]: 361766120..361910783  144664   0x0
 176: [83261040..83404007]: 374431192..374574159  142968   0x0
 177: [83404008..83546815]: 484667976..484810783  142808   0x0
 178: [83546816..83689279]: 251702808..251845271  142464   0x0
 179: [83689280..83831711]: 90474240..90616671    142432   0x0
 180: [83831712..83972959]: 109362776..109504023  141248   0x0
 181: [83972960..84113743]: 377296064..377436847  140784   0x0
 182: [84113744..84254303]: 378416056..378556615  140560   0x0
 183: [84254304..84393663]: 89517888..89657247    139360   0x0
 184: [84393664..84532831]: 376569640..376708807  139168   0x0
 185: [84532832..84671975]: 108725224..108864367  139144   0x0
 186: [84671976..84810807]: 109637664..109776495  138832   0x0
 187: [84810808..84901119]: 110211736..110302047   90312   0x1

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
@ 2017-02-07 10:30                                                       ` Tetsuo Handa
  0 siblings, 0 replies; 110+ messages in thread
From: Tetsuo Handa @ 2017-02-07 10:30 UTC (permalink / raw)
  To: bfoster
  Cc: mhocko, david, dchinner, hch, mgorman, viro, linux-mm, hannes,
	linux-kernel, darrick.wong, linux-xfs

Brian Foster wrote:
> > The workload is to write to a single file on XFS from 10 processes demonstrated at
> > http://lkml.kernel.org/r/201512052133.IAE00551.LSOQFtMFFVOHOJ@I-love.SAKURA.ne.jp
> > using "while :; do ./oom-write; done" loop on a VM with 4CPUs / 2048MB RAM.
> > With this XFS_FILBLKS_MIN() change applied, I no longer hit assertion failures.
> > 
> 
> Thanks for testing. Well, that's an interesting workload. I couldn't
> reproduce on a few quick tries in a similarly configured vm.

It takes 10 to 15 minutes. Maybe some size threshold involved?

> /tmp/file _is_ on an XFS filesystem in your test, correct? If so and if
> you still have the output file from a test that reproduced, could you
> get the 'xfs_io -c "fiemap -v" <file>' output?

Here it is.

[  720.199748] 0 pages HighMem/MovableOnly
[  720.199749] 150524 pages reserved
[  720.199749] 0 pages cma reserved
[  720.199750] 0 pages hwpoisoned
[  722.187335] XFS: Assertion failed: oldlen > newlen, file: fs/xfs/libxfs/xfs_bmap.c, line: 2867
[  722.201784] ------------[ cut here ]------------
[  722.205940] WARNING: CPU: 0 PID: 4877 at fs/xfs/xfs_message.c:105 asswarn+0x33/0x40 [xfs]
[  722.212333] Modules linked in: nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter coretemp crct10dif_pclmul vmw_vsock_vmci_transport crc32_pclmul ghash_clmulni_intel vsock aesni_intel crypto_simd cryptd glue_helper ppdev vmw_balloon pcspkr sg parport_pc i2c_piix4 shpchp vmw_vmci parport ip_tables xfs libcrc32c sd_mod sr_mod cdrom ata_generic pata_acpi crc32c_intel serio_raw vmwgfx drm_kms_helper syscopyarea sysfillrect
[  722.243207]  sysimgblt fb_sys_fops mptspi scsi_transport_spi ata_piix ahci ttm mptscsih libahci drm libata mptbase e1000 i2c_core
[  722.247704] CPU: 0 PID: 4877 Comm: write Not tainted 4.10.0-rc6-next-20170202 #498
[  722.250612] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015
[  722.254089] Call Trace:
[  722.255751]  dump_stack+0x85/0xc9
[  722.257650]  __warn+0xd1/0xf0
[  722.259420]  warn_slowpath_null+0x1d/0x20
[  722.261434]  asswarn+0x33/0x40 [xfs]
[  722.263356]  xfs_bmap_add_extent_hole_delay+0xb7f/0xdf0 [xfs]
[  722.265695]  xfs_bmapi_reserve_delalloc+0x297/0x440 [xfs]
[  722.267792]  ? xfs_ilock+0x1c9/0x360 [xfs]
[  722.269559]  xfs_file_iomap_begin+0x880/0x1140 [xfs]
[  722.271606]  ? iomap_write_end+0x80/0x80
[  722.273377]  iomap_apply+0x6c/0x130
[  722.274969]  iomap_file_buffered_write+0x68/0xa0
[  722.276702]  ? iomap_write_end+0x80/0x80
[  722.278311]  xfs_file_buffered_aio_write+0x132/0x390 [xfs]
[  722.280394]  ? _raw_spin_unlock+0x27/0x40
[  722.282247]  xfs_file_write_iter+0x90/0x130 [xfs]
[  722.284257]  __vfs_write+0xe5/0x140
[  722.285924]  vfs_write+0xc7/0x1f0
[  722.287536]  ? syscall_trace_enter+0x1d0/0x380
[  722.289490]  SyS_write+0x58/0xc0
[  722.291025]  do_int80_syscall_32+0x6c/0x1f0
[  722.292671]  entry_INT80_compat+0x38/0x50
[  722.294298] RIP: 0023:0x8048076
[  722.295684] RSP: 002b:00000000ffedf840 EFLAGS: 00000202 ORIG_RAX: 0000000000000004
[  722.298075] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 0000000008048000
[  722.300516] RDX: 0000000000001000 RSI: 0000000000000000 RDI: 0000000000000000
[  722.302902] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[  722.305278] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  722.307567] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[  722.309792] ---[ end trace 5b7012eeb84093b7 ]---
[  732.650867] oom_reaper: reaped process 4876 (oom-write), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

# ls -l /tmp/file
-rw------- 1 kumaneko kumaneko 43426648064 Feb  7 19:25 /tmp/file
# xfs_io -c "fiemap -v" /tmp/file
/tmp/file:
 EXT: FILE-OFFSET          BLOCK-RANGE            TOTAL FLAGS
   0: [0..262015]:         358739712..359001727  262016   0x0
   1: [262016..524159]:    367651920..367914063  262144   0x0
   2: [524160..1048447]:   385063864..385588151  524288   0x0
   3: [1048448..1238031]:  463702512..463892095  189584   0x0
   4: [1238032..3335167]:  448234520..450331655 2097136   0x0
   5: [3335168..4769775]:  36165320..37599927   1434608   0x0
   6: [4769776..6897175]:  31677984..33805383   2127400   0x0
   7: [6897176..15285759]: 450331656..458720239 8388584   0x0
   8: [15285760..18520255]: 237497528..240732023 3234496   0x0
   9: [18520256..21063607]: 229750248..232293599 2543352   0x0
  10: [21063608..25257855]: 240732024..244926271 4194248   0x0
  11: [25257856..29452159]: 179523440..183717743 4194304   0x0
  12: [29452160..30380031]: 171930952..172858823  927872   0x0
  13: [30380032..31428607]: 185220160..186268735 1048576   0x0
  14: [31428608..32667751]: 232293600..233532743 1239144   0x0
  15: [32667752..38474351]: 172858824..178665423 5806600   0x0
  16: [38474352..39157119]: 188137184..188819951  682768   0x0
  17: [39157120..40205695]: 234837584..235886159 1048576   0x0
  18: [40205696..42302847]: 33805384..35902535   2097152   0x0
  19: [42302848..44188591]: 37599928..39485671   1885744   0x0
  20: [44188592..45112703]: 446735416..447659527  924112   0x0
  21: [45112704..45436343]: 445337184..445660823  323640   0x0
  22: [45436344..45960575]: 447659528..448183759  524232   0x0
  23: [45960576..46484863]: 463892096..464416383  524288   0x0
  24: [46484864..47533439]: 445660824..446709399 1048576   0x0
  25: [47533440..48541959]: 233532744..234541263 1008520   0x0
  26: [48541960..49294175]: 523533736..524285951  752216   0x0
  27: [49294176..49630591]: 376080552..376416967  336416   0x0
  28: [49630592..50154879]: 129846752..130371039  524288   0x0
  29: [50154880..50844383]: 244926272..245615775  689504   0x0
  30: [50844384..51203455]: 250812112..251171183  359072   0x0
  31: [51203456..51727743]: 259555424..260079711  524288   0x0
  32: [51727744..52295239]: 187350752..187918247  567496   0x0
  33: [52295240..52776319]: 188819952..189301031  481080   0x0
  34: [52776320..53300607]: 206841040..207365327  524288   0x0
  35: [53300608..53824775]: 386221504..386745671  524168   0x0
  36: [53824776..54348935]: 113736928..114261087  524160   0x0
  37: [54348936..54854007]: 228911704..229416775  505072   0x0
  38: [54854008..54905983]: 228760200..228812175   51976   0x0
  39: [54905984..54971519]: 228597920..228663455   65536   0x0
  40: [54971520..55364735]: 178998696..179391911  393216   0x0
  41: [55364736..55868119]: 392669176..393172559  503384   0x0
  42: [55868120..56370663]: 382896800..383399343  502544   0x0
  43: [56370664..56836311]: 464416384..464882031  465648   0x0
  44: [56836312..57085055]: 458720240..458968983  248744   0x0
  45: [57085056..57548743]: 92768112..93231799    463688   0x0
  46: [57548744..57871487]: 102724384..103047127  322744   0x0
  47: [57871488..58304623]: 124278664..124711799  433136   0x0
  48: [58304624..58526847]: 124712024..124934247  222224   0x0
  49: [58526848..58788991]: 125635832..125897975  262144   0x0
  50: [58788992..59203767]: 508031384..508446159  414776   0x0
  51: [59203768..59602871]: 109812624..110211727  399104   0x0
  52: [59602872..59992183]: 385736856..386126167  389312   0x0
  53: [59992184..60381311]: 237108384..237497511  389128   0x0
  54: [60381312..60756863]: 506355968..506731519  375552   0x0
  55: [60756864..61127487]: 186268736..186639359  370624   0x0
  56: [61127488..61490767]: 112848304..113211583  363280   0x0
  57: [61490768..61541503]: 113214200..113264935   50736   0x0
  58: [61541504..61904775]: 112246776..112610047  363272   0x0
  59: [61904776..62246247]: 106328512..106669983  341472   0x0
  60: [62246248..62571991]: 126075640..126401383  325744   0x0
  61: [62571992..62895759]: 108921744..109245511  323768   0x0
  62: [62895760..63219159]: 380153016..380476415  323400   0x0
  63: [63219160..63442047]: 381056248..381279135  222888   0x0
  64: [63442048..63704191]: 379768072..380030215  262144   0x0
  65: [63704192..64026847]: 108328888..108651543  322656   0x0
  66: [64026848..64342415]: 251387232..251702799  315568   0x0
  67: [64342416..64651407]: 183717744..184026735  308992   0x0
  68: [64651408..64947983]: 384092440..384389015  296576   0x0
  69: [64947984..65145983]: 381775560..381973559  198000   0x0
  70: [65145984..65408127]: 186914504..187176647  262144   0x0
  71: [65408128..65447943]: 125328232..125368047   39816   0x0
  72: [65447944..65690599]: 372579112..372821767  242656   0x0
  73: [65690600..65929863]: 130429664..130668927  239264   0x0
  74: [65929864..66168935]: 120951784..121190855  239072   0x0
  75: [66168936..66402279]: 372845976..373079319  233344   0x0
  76: [66402280..66633199]: 113372616..113603535  230920   0x0
  77: [66633200..66859943]: 115982256..116208999  226744   0x0
  78: [66859944..67082127]: 127187600..127409783  222184   0x0
  79: [67082128..67217407]: 127636680..127771959  135280   0x0
  80: [67217408..67280095]: 129510736..129573423   62688   0x0
  81: [67280096..67499063]: 119220288..119439255  218968   0x0
  82: [67499064..67717935]: 507320248..507539119  218872   0x0
  83: [67717936..67936119]: 129292544..129510727  218184   0x0
  84: [67936120..68153903]: 125368048..125585831  217784   0x0
  85: [68153904..68370703]: 117784232..118001031  216800   0x0
  86: [68370704..68586039]: 121997008..122212343  215336   0x0
  87: [68586040..68798855]: 379191840..379404655  212816   0x0
  88: [68798856..68983935]: 378690808..378875887  185080   0x0
  89: [68983936..69196727]: 90790848..91003639    212792   0x0
  90: [69196728..69409287]: 123091672..123304231  212560   0x0
  91: [69409288..69621503]: 377436856..377649071  212216   0x0
  92: [69621504..69828847]: 128990088..129197431  207344   0x0
  93: [69828848..70035391]: 497270968..497477511  206544   0x0
  94: [70035392..70241111]: 391898048..392103767  205720   0x0
  95: [70241112..70446207]: 260716672..260921767  205096   0x0
  96: [70446208..70507647]: 260079712..260141151   61440   0x0
  97: [70507648..70704255]: 245836040..246032647  196608   0x0
  98: [70704256..70906591]: 107009096..107211431  202336   0x0
  99: [70906592..71108807]: 389471224..389673439  202216   0x0
 100: [71108808..71309703]: 224305904..224506799  200896   0x0
 101: [71309704..71509487]: 388524632..388724415  199784   0x0
 102: [71509488..71707119]: 87983688..88181319    197632   0x0
 103: [71707120..71903015]: 236195680..236391575  195896   0x0
 104: [71903016..72098791]: 389000248..389196023  195776   0x0
 105: [72098792..72294471]: 386931872..387127551  195680   0x0
 106: [72294472..72342655]: 387127560..387175743   48184   0x0
 107: [72342656..72408191]: 388031464..388096999   65536   0x0
 108: [72408192..72539263]: 388194472..388325543  131072   0x0
 109: [72539264..72562039]: 369903992..369926767   22776   0x0
 110: [72562040..72753639]: 506916880..507108479  191600   0x0
 111: [72753640..72945143]: 360577376..360768879  191504   0x0
 112: [72945144..73136575]: 246426760..246618191  191432   0x0
 113: [73136576..73326047]: 116629288..116818759  189472   0x0
 114: [73326048..73515047]: 392203096..392392095  189000   0x0
 115: [73515048..73699967]: 223549160..223734079  184920   0x0
 116: [73699968..73883879]: 118860856..119044767  183912   0x0
 117: [73883880..74067175]: 506143208..506326503  183296   0x0
 118: [74067176..74249703]: 507108800..507291327  182528   0x0
 119: [74249704..74401335]: 258917640..259069271  151632   0x0
 120: [74401336..74583135]: 122742560..122924359  181800   0x0
 121: [74583136..74764223]: 374250096..374431183  181088   0x0
 122: [74764224..74945271]: 91175800..91356847    181048   0x0
 123: [74945272..75124183]: 362484776..362663687  178912   0x0
 124: [75124184..75302615]: 223086192..223264623  178432   0x0
 125: [75302616..75479279]: 359280032..359456695  176664   0x0
 126: [75479280..75655559]: 63083912..63260191    176280   0x0
 127: [75655560..75831487]: 384469152..384645079  175928   0x0
 128: [75831488..76006815]: 381459584..381634911  175328   0x0
 129: [76006816..76181255]: 110626376..110800815  174440   0x0
 130: [76181256..76355399]: 380785616..380959759  174144   0x0
 131: [76355400..76527527]: 362768136..362940263  172128   0x0
 132: [76527528..76698695]: 122571384..122742551  171168   0x0
 133: [76698696..76868951]: 382399576..382569831  170256   0x0
 134: [76868952..77039095]: 388353776..388523919  170144   0x0
 135: [77039096..77209183]: 120236192..120406279  170088   0x0
 136: [77209184..77379183]: 383464120..383634119  170000   0x0
 137: [77379184..77548655]: 369926768..370096239  169472   0x0
 138: [77548656..77717663]: 88823232..88992239    169008   0x0
 139: [77717664..77884951]: 365878672..366045959  167288   0x0
 140: [77884952..77897079]: 366445360..366457487   12128   0x0
 141: [77897080..78063423]: 391500528..391666871  166344   0x0
 142: [78063424..78229407]: 107876400..108042383  165984   0x0
 143: [78229408..78395135]: 358573976..358739703  165728   0x0
 144: [78395136..78560703]: 117078480..117244047  165568   0x0
 145: [78560704..78726063]: 257377088..257542447  165360   0x0
 146: [78726064..78889519]: 389678704..389842159  163456   0x0
 147: [78889520..79052607]: 225850112..226013199  163088   0x0
 148: [79052608..79215111]: 359822880..359985383  162504   0x0
 149: [79215112..79376559]: 357914720..358076167  161448   0x0
 150: [79376560..79538007]: 115473264..115634711  161448   0x0
 151: [79538008..79698815]: 112610056..112770863  160808   0x0
 152: [79698816..79857631]: 258732456..258891271  158816   0x0
 153: [79857632..80015807]: 388725328..388883503  158176   0x0
 154: [80015808..80173583]: 93847144..94004919    157776   0x0
 155: [80173584..80331295]: 362940272..363097983  157712   0x0
 156: [80331296..80488727]: 252008432..252165863  157432   0x0
 157: [80488728..80646055]: 118387696..118545023  157328   0x0
 158: [80646056..80803239]: 111368744..111525927  157184   0x0
 159: [80803240..80960055]: 129573424..129730239  156816   0x0
 160: [80960056..81116863]: 497936416..498093223  156808   0x0
 161: [81116864..81272623]: 492109560..492265319  155760   0x0
 162: [81272624..81427695]: 114554072..114709143  155072   0x0
 163: [81427696..81582519]: 106854264..107009087  154824   0x0
 164: [81582520..81735503]: 220700824..220853807  152984   0x0
 165: [81735504..81887807]: 490724024..490876327  152304   0x0
 166: [81887808..82038863]: 122393688..122544743  151056   0x0
 167: [82038864..82189151]: 91659448..91809735    150288   0x0
 168: [82189152..82337287]: 85811104..85959239    148136   0x0
 169: [82337288..82484743]: 235886160..236033615  147456   0x0
 170: [82484744..82631943]: 117486472..117633671  147200   0x0
 171: [82631944..82777887]: 491753616..491899559  145944   0x0
 172: [82777888..82923799]: 94927544..95073455    145912   0x0
 173: [82923800..83068527]: 373754864..373899591  144728   0x0
 174: [83068528..83116375]: 373980848..374028695   47848   0x0
 175: [83116376..83261039]: 361766120..361910783  144664   0x0
 176: [83261040..83404007]: 374431192..374574159  142968   0x0
 177: [83404008..83546815]: 484667976..484810783  142808   0x0
 178: [83546816..83689279]: 251702808..251845271  142464   0x0
 179: [83689280..83831711]: 90474240..90616671    142432   0x0
 180: [83831712..83972959]: 109362776..109504023  141248   0x0
 181: [83972960..84113743]: 377296064..377436847  140784   0x0
 182: [84113744..84254303]: 378416056..378556615  140560   0x0
 183: [84254304..84393663]: 89517888..89657247    139360   0x0
 184: [84393664..84532831]: 376569640..376708807  139168   0x0
 185: [84532832..84671975]: 108725224..108864367  139144   0x0
 186: [84671976..84810807]: 109637664..109776495  138832   0x0
 187: [84810808..84901119]: 110211736..110302047   90312   0x1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
  2017-02-07 10:30                                                       ` Tetsuo Handa
@ 2017-02-07 16:54                                                         ` Brian Foster
  -1 siblings, 0 replies; 110+ messages in thread
From: Brian Foster @ 2017-02-07 16:54 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, david, dchinner, hch, mgorman, viro, linux-mm, hannes,
	linux-kernel, darrick.wong, linux-xfs

On Tue, Feb 07, 2017 at 07:30:54PM +0900, Tetsuo Handa wrote:
> Brian Foster wrote:
> > > The workload is to write to a single file on XFS from 10 processes demonstrated at
> > > http://lkml.kernel.org/r/201512052133.IAE00551.LSOQFtMFFVOHOJ@I-love.SAKURA.ne.jp
> > > using "while :; do ./oom-write; done" loop on a VM with 4CPUs / 2048MB RAM.
> > > With this XFS_FILBLKS_MIN() change applied, I no longer hit assertion failures.
> > > 
> > 
> > Thanks for testing. Well, that's an interesting workload. I couldn't
> > reproduce on a few quick tries in a similarly configured vm.
> 
> It takes 10 to 15 minutes. Maybe some size threshold involved?
> 
> > /tmp/file _is_ on an XFS filesystem in your test, correct? If so and if
> > you still have the output file from a test that reproduced, could you
> > get the 'xfs_io -c "fiemap -v" <file>' output?
> 
> Here it is.
> 
> [  720.199748] 0 pages HighMem/MovableOnly
> [  720.199749] 150524 pages reserved
> [  720.199749] 0 pages cma reserved
> [  720.199750] 0 pages hwpoisoned
> [  722.187335] XFS: Assertion failed: oldlen > newlen, file: fs/xfs/libxfs/xfs_bmap.c, line: 2867
> [  722.201784] ------------[ cut here ]------------
...
> 
> # ls -l /tmp/file
> -rw------- 1 kumaneko kumaneko 43426648064 Feb  7 19:25 /tmp/file
> # xfs_io -c "fiemap -v" /tmp/file
> /tmp/file:
>  EXT: FILE-OFFSET          BLOCK-RANGE            TOTAL FLAGS
>    0: [0..262015]:         358739712..359001727  262016   0x0
...
>  187: [84810808..84901119]: 110211736..110302047   90312   0x1

Ok, from the size of the file I realized that I missed you were running
in a loop the first time around. I tried playing with it some more and
still haven't been able to reproduce.

Anyways, the patch intended to fix this has been reviewed[1] and queued
for the next release, so it's probably not a big deal since you've
already verified it. Thanks again.

Brian

[1] http://www.spinics.net/lists/linux-xfs/msg04083.html

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
@ 2017-02-07 16:54                                                         ` Brian Foster
  0 siblings, 0 replies; 110+ messages in thread
From: Brian Foster @ 2017-02-07 16:54 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, david, dchinner, hch, mgorman, viro, linux-mm, hannes,
	linux-kernel, darrick.wong, linux-xfs

On Tue, Feb 07, 2017 at 07:30:54PM +0900, Tetsuo Handa wrote:
> Brian Foster wrote:
> > > The workload is to write to a single file on XFS from 10 processes demonstrated at
> > > http://lkml.kernel.org/r/201512052133.IAE00551.LSOQFtMFFVOHOJ@I-love.SAKURA.ne.jp
> > > using "while :; do ./oom-write; done" loop on a VM with 4CPUs / 2048MB RAM.
> > > With this XFS_FILBLKS_MIN() change applied, I no longer hit assertion failures.
> > > 
> > 
> > Thanks for testing. Well, that's an interesting workload. I couldn't
> > reproduce on a few quick tries in a similarly configured vm.
> 
> It takes 10 to 15 minutes. Maybe some size threshold involved?
> 
> > /tmp/file _is_ on an XFS filesystem in your test, correct? If so and if
> > you still have the output file from a test that reproduced, could you
> > get the 'xfs_io -c "fiemap -v" <file>' output?
> 
> Here it is.
> 
> [  720.199748] 0 pages HighMem/MovableOnly
> [  720.199749] 150524 pages reserved
> [  720.199749] 0 pages cma reserved
> [  720.199750] 0 pages hwpoisoned
> [  722.187335] XFS: Assertion failed: oldlen > newlen, file: fs/xfs/libxfs/xfs_bmap.c, line: 2867
> [  722.201784] ------------[ cut here ]------------
...
> 
> # ls -l /tmp/file
> -rw------- 1 kumaneko kumaneko 43426648064 Feb  7 19:25 /tmp/file
> # xfs_io -c "fiemap -v" /tmp/file
> /tmp/file:
>  EXT: FILE-OFFSET          BLOCK-RANGE            TOTAL FLAGS
>    0: [0..262015]:         358739712..359001727  262016   0x0
...
>  187: [84810808..84901119]: 110211736..110302047   90312   0x1

Ok, from the size of the file I realized that I missed you were running
in a loop the first time around. I tried playing with it some more and
still haven't been able to reproduce.

Anyways, the patch intended to fix this has been reviewed[1] and queued
for the next release, so it's probably not a big deal since you've
already verified it. Thanks again.

Brian

[1] http://www.spinics.net/lists/linux-xfs/msg04083.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
  2017-02-06 10:39                                                   ` Michal Hocko
@ 2017-02-07 21:12                                                     ` Michal Hocko
  -1 siblings, 0 replies; 110+ messages in thread
From: Michal Hocko @ 2017-02-07 21:12 UTC (permalink / raw)
  To: Tetsuo Handa, peterz; +Cc: hch, mgorman, viro, linux-mm, hannes, linux-kernel

On Mon 06-02-17 11:39:18, Michal Hocko wrote:
> On Sun 05-02-17 19:43:07, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > I got same warning with ext4. Maybe we need to check carefully.
> > 
> > [  511.215743] =====================================================
> > [  511.218003] WARNING: RECLAIM_FS-safe -> RECLAIM_FS-unsafe lock order detected
> > [  511.220031] 4.10.0-rc6-next-20170202+ #500 Not tainted
> > [  511.221689] -----------------------------------------------------
> > [  511.223579] a.out/49302 [HC0[0]:SC0[0]:HE1:SE1] is trying to acquire:
> > [  511.225533]  (cpu_hotplug.dep_map){++++++}, at: [<ffffffff810a1477>] get_online_cpus+0x37/0x80
> > [  511.227795] 
> > [  511.227795] and this task is already holding:
> > [  511.230082]  (jbd2_handle){++++-.}, at: [<ffffffff813a8be7>] start_this_handle+0x1a7/0x590
> > [  511.232592] which would create a new lock dependency:
> > [  511.234192]  (jbd2_handle){++++-.} -> (cpu_hotplug.dep_map){++++++}
> > [  511.235966] 
> > [  511.235966] but this new dependency connects a RECLAIM_FS-irq-safe lock:
> > [  511.238563]  (jbd2_handle){++++-.}
> > [  511.238564] 
> > [  511.238564] ... which became RECLAIM_FS-irq-safe at:
> > [  511.242078]   
> > [  511.242084] [<ffffffff811089db>] __lock_acquire+0x34b/0x1640
> > [  511.244495] [<ffffffff8110a119>] lock_acquire+0xc9/0x250
> > [  511.246697] [<ffffffff813b3525>] jbd2_log_wait_commit+0x55/0x1d0
> [...]
> > [  511.276216] to a RECLAIM_FS-irq-unsafe lock:
> > [  511.278128]  (cpu_hotplug.dep_map){++++++}
> > [  511.278130] 
> > [  511.278130] ... which became RECLAIM_FS-irq-unsafe at:
> > [  511.281809] ...
> > [  511.281811]   
> > [  511.282598] [<ffffffff81108141>] mark_held_locks+0x71/0x90
> > [  511.284854] [<ffffffff8110ab6f>] lockdep_trace_alloc+0x6f/0xd0
> > [  511.287218] [<ffffffff812744c8>] kmem_cache_alloc_node_trace+0x48/0x3b0
> > [  511.289755] [<ffffffff810cfa65>] __smpboot_create_thread.part.2+0x35/0xf0
> > [  511.292329] [<ffffffff810d0026>] smpboot_create_threads+0x66/0x90
> [...]
> > [  511.317867] other info that might help us debug this:
> > [  511.317867] 
> > [  511.320920]  Possible interrupt unsafe locking scenario:
> > [  511.320920] 
> > [  511.323218]        CPU0                    CPU1
> > [  511.324622]        ----                    ----
> > [  511.325973]   lock(cpu_hotplug.dep_map);
> > [  511.327246]                                local_irq_disable();
> > [  511.328870]                                lock(jbd2_handle);
> > [  511.330483]                                lock(cpu_hotplug.dep_map);
> > [  511.332259]   <Interrupt>
> > [  511.333187]     lock(jbd2_handle);
> 
> Peter, is there any way how to tell the lockdep that this is in fact
> reclaim safe? The direct reclaim only does the trylock and backs off so
> we cannot deadlock here.
> 
> Or am I misinterpreting the trace?

This is moot - http://lkml.kernel.org/r/20170207201950.20482-1-mhocko@kernel.org

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
@ 2017-02-07 21:12                                                     ` Michal Hocko
  0 siblings, 0 replies; 110+ messages in thread
From: Michal Hocko @ 2017-02-07 21:12 UTC (permalink / raw)
  To: Tetsuo Handa, peterz; +Cc: hch, mgorman, viro, linux-mm, hannes, linux-kernel

On Mon 06-02-17 11:39:18, Michal Hocko wrote:
> On Sun 05-02-17 19:43:07, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > I got same warning with ext4. Maybe we need to check carefully.
> > 
> > [  511.215743] =====================================================
> > [  511.218003] WARNING: RECLAIM_FS-safe -> RECLAIM_FS-unsafe lock order detected
> > [  511.220031] 4.10.0-rc6-next-20170202+ #500 Not tainted
> > [  511.221689] -----------------------------------------------------
> > [  511.223579] a.out/49302 [HC0[0]:SC0[0]:HE1:SE1] is trying to acquire:
> > [  511.225533]  (cpu_hotplug.dep_map){++++++}, at: [<ffffffff810a1477>] get_online_cpus+0x37/0x80
> > [  511.227795] 
> > [  511.227795] and this task is already holding:
> > [  511.230082]  (jbd2_handle){++++-.}, at: [<ffffffff813a8be7>] start_this_handle+0x1a7/0x590
> > [  511.232592] which would create a new lock dependency:
> > [  511.234192]  (jbd2_handle){++++-.} -> (cpu_hotplug.dep_map){++++++}
> > [  511.235966] 
> > [  511.235966] but this new dependency connects a RECLAIM_FS-irq-safe lock:
> > [  511.238563]  (jbd2_handle){++++-.}
> > [  511.238564] 
> > [  511.238564] ... which became RECLAIM_FS-irq-safe at:
> > [  511.242078]   
> > [  511.242084] [<ffffffff811089db>] __lock_acquire+0x34b/0x1640
> > [  511.244495] [<ffffffff8110a119>] lock_acquire+0xc9/0x250
> > [  511.246697] [<ffffffff813b3525>] jbd2_log_wait_commit+0x55/0x1d0
> [...]
> > [  511.276216] to a RECLAIM_FS-irq-unsafe lock:
> > [  511.278128]  (cpu_hotplug.dep_map){++++++}
> > [  511.278130] 
> > [  511.278130] ... which became RECLAIM_FS-irq-unsafe at:
> > [  511.281809] ...
> > [  511.281811]   
> > [  511.282598] [<ffffffff81108141>] mark_held_locks+0x71/0x90
> > [  511.284854] [<ffffffff8110ab6f>] lockdep_trace_alloc+0x6f/0xd0
> > [  511.287218] [<ffffffff812744c8>] kmem_cache_alloc_node_trace+0x48/0x3b0
> > [  511.289755] [<ffffffff810cfa65>] __smpboot_create_thread.part.2+0x35/0xf0
> > [  511.292329] [<ffffffff810d0026>] smpboot_create_threads+0x66/0x90
> [...]
> > [  511.317867] other info that might help us debug this:
> > [  511.317867] 
> > [  511.320920]  Possible interrupt unsafe locking scenario:
> > [  511.320920] 
> > [  511.323218]        CPU0                    CPU1
> > [  511.324622]        ----                    ----
> > [  511.325973]   lock(cpu_hotplug.dep_map);
> > [  511.327246]                                local_irq_disable();
> > [  511.328870]                                lock(jbd2_handle);
> > [  511.330483]                                lock(cpu_hotplug.dep_map);
> > [  511.332259]   <Interrupt>
> > [  511.333187]     lock(jbd2_handle);
> 
> Peter, is there any way how to tell the lockdep that this is in fact
> reclaim safe? The direct reclaim only does the trylock and backs off so
> we cannot deadlock here.
> 
> Or am I misinterpreting the trace?

This is moot - http://lkml.kernel.org/r/20170207201950.20482-1-mhocko@kernel.org

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
  2017-02-07 21:12                                                     ` Michal Hocko
@ 2017-02-08  9:24                                                       ` Peter Zijlstra
  -1 siblings, 0 replies; 110+ messages in thread
From: Peter Zijlstra @ 2017-02-08  9:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, hch, mgorman, viro, linux-mm, hannes, linux-kernel

On Tue, Feb 07, 2017 at 10:12:12PM +0100, Michal Hocko wrote:
> This is moot - http://lkml.kernel.org/r/20170207201950.20482-1-mhocko@kernel.org

Thanks! I was just about to go stare at it in more detail.

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
@ 2017-02-08  9:24                                                       ` Peter Zijlstra
  0 siblings, 0 replies; 110+ messages in thread
From: Peter Zijlstra @ 2017-02-08  9:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, hch, mgorman, viro, linux-mm, hannes, linux-kernel

On Tue, Feb 07, 2017 at 10:12:12PM +0100, Michal Hocko wrote:
> This is moot - http://lkml.kernel.org/r/20170207201950.20482-1-mhocko@kernel.org

Thanks! I was just about to go stare at it in more detail.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
  2017-02-03 10:57                                             ` Tetsuo Handa
@ 2017-02-21  9:40                                               ` Michal Hocko
  -1 siblings, 0 replies; 110+ messages in thread
From: Michal Hocko @ 2017-02-21  9:40 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: david, dchinner, hch, mgorman, viro, linux-mm, hannes, linux-kernel

On Fri 03-02-17 19:57:39, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Mon 30-01-17 09:55:46, Michal Hocko wrote:
> > > On Sun 29-01-17 00:27:27, Tetsuo Handa wrote:
> > [...]
> > > > Regarding [1], it helped avoiding the too_many_isolated() issue. I can't
> > > > tell whether it has any negative effect, but I got on the first trial that
> > > > all allocating threads are blocked on wait_for_completion() from flush_work()
> > > > in drain_all_pages() introduced by "mm, page_alloc: drain per-cpu pages from
> > > > workqueue context". There was no warn_alloc() stall warning message afterwords.
> > > 
> > > That patch is buggy and there is a follow up [1] which is not sitting in the
> > > mmotm (and thus linux-next) yet. I didn't get to review it properly and
> > > I cannot say I would be too happy about using WQ from the page
> > > allocator. I believe even the follow up needs to have WQ_RECLAIM WQ.
> > > 
> > > [1] http://lkml.kernel.org/r/20170125083038.rzb5f43nptmk7aed@techsingularity.net
> > 
> > Did you get chance to test with this follow up patch? It would be
> > interesting to see whether OOM situation can still starve the waiter.
> > The current linux-next should contain this patch.
> 
> So far I can't reproduce problems except two listed below (cond_resched() trap
> in printk() and IDLE priority trap are excluded from the list).

OK, so it seems that all the distractions are handled now and linux-next
should provide a reasonable base for testing. You said you weren't able
to reproduce the original long stalls on too_many_isolated(). I would be
still interested to see those oom reports and potential anomalies in the
isolated counts before I send the patch for inclusion so your further
testing would be more than appreciated. Also stalls > 10s without any
previous occurrences would be interesting.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
@ 2017-02-21  9:40                                               ` Michal Hocko
  0 siblings, 0 replies; 110+ messages in thread
From: Michal Hocko @ 2017-02-21  9:40 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: david, dchinner, hch, mgorman, viro, linux-mm, hannes, linux-kernel

On Fri 03-02-17 19:57:39, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Mon 30-01-17 09:55:46, Michal Hocko wrote:
> > > On Sun 29-01-17 00:27:27, Tetsuo Handa wrote:
> > [...]
> > > > Regarding [1], it helped avoiding the too_many_isolated() issue. I can't
> > > > tell whether it has any negative effect, but I got on the first trial that
> > > > all allocating threads are blocked on wait_for_completion() from flush_work()
> > > > in drain_all_pages() introduced by "mm, page_alloc: drain per-cpu pages from
> > > > workqueue context". There was no warn_alloc() stall warning message afterwords.
> > > 
> > > That patch is buggy and there is a follow up [1] which is not sitting in the
> > > mmotm (and thus linux-next) yet. I didn't get to review it properly and
> > > I cannot say I would be too happy about using WQ from the page
> > > allocator. I believe even the follow up needs to have WQ_RECLAIM WQ.
> > > 
> > > [1] http://lkml.kernel.org/r/20170125083038.rzb5f43nptmk7aed@techsingularity.net
> > 
> > Did you get chance to test with this follow up patch? It would be
> > interesting to see whether OOM situation can still starve the waiter.
> > The current linux-next should contain this patch.
> 
> So far I can't reproduce problems except two listed below (cond_resched() trap
> in printk() and IDLE priority trap are excluded from the list).

OK, so it seems that all the distractions are handled now and linux-next
should provide a reasonable base for testing. You said you weren't able
to reproduce the original long stalls on too_many_isolated(). I would be
still interested to see those oom reports and potential anomalies in the
isolated counts before I send the patch for inclusion so your further
testing would be more than appreciated. Also stalls > 10s without any
previous occurrences would be interesting.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
  2017-02-21  9:40                                               ` Michal Hocko
@ 2017-02-21 14:35                                                 ` Tetsuo Handa
  -1 siblings, 0 replies; 110+ messages in thread
From: Tetsuo Handa @ 2017-02-21 14:35 UTC (permalink / raw)
  To: mhocko
  Cc: david, dchinner, hch, mgorman, viro, linux-mm, hannes, linux-kernel

Michal Hocko wrote:
> OK, so it seems that all the distractions are handled now and linux-next
> should provide a reasonable base for testing. You said you weren't able
> to reproduce the original long stalls on too_many_isolated(). I would be
> still interested to see those oom reports and potential anomalies in the
> isolated counts before I send the patch for inclusion so your further
> testing would be more than appreciated. Also stalls > 10s without any
> previous occurrences would be interesting.

I confirmed that linux-next-20170221 with kmallocwd applied can reproduce
infinite too_many_isolated() loop problem. Please send your patches to linux-next.

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20170221.txt.xz .
----------------------------------------
[ 1160.162013] Out of memory: Kill process 7523 (a.out) score 998 or sacrifice child
[ 1160.164422] Killed process 7523 (a.out) total-vm:4168kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[ 1160.169699] oom_reaper: reaped process 7523 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[ 1209.781787] MemAlloc-Info: stalling=32 dying=1 exiting=0 victim=1 oom_count=45896
[ 1209.790966] MemAlloc: kswapd0(67) flags=0xa60840 switches=51139 uninterruptible
[ 1209.799726] kswapd0         D10936    67      2 0x00000000
[ 1209.807326] Call Trace:
[ 1209.812581]  __schedule+0x336/0xe00
[ 1209.818599]  schedule+0x3d/0x90
[ 1209.823907]  schedule_timeout+0x26a/0x510
[ 1209.827218]  ? trace_hardirqs_on+0xd/0x10
[ 1209.830535]  __down_common+0xfb/0x131
[ 1209.833801]  ? _xfs_buf_find+0x2cb/0xc10 [xfs]
[ 1209.837372]  __down+0x1d/0x1f
[ 1209.840331]  down+0x41/0x50
[ 1209.843243]  xfs_buf_lock+0x64/0x370 [xfs]
[ 1209.846597]  _xfs_buf_find+0x2cb/0xc10 [xfs]
[ 1209.850031]  ? _xfs_buf_find+0xa4/0xc10 [xfs]
[ 1209.853514]  xfs_buf_get_map+0x2a/0x480 [xfs]
[ 1209.855831]  xfs_buf_read_map+0x2c/0x400 [xfs]
[ 1209.857388]  ? free_debug_processing+0x27d/0x2af
[ 1209.859037]  xfs_trans_read_buf_map+0x186/0x830 [xfs]
[ 1209.860707]  xfs_read_agf+0xc8/0x2b0 [xfs]
[ 1209.862184]  xfs_alloc_read_agf+0x7a/0x300 [xfs]
[ 1209.863728]  ? xfs_alloc_space_available+0x7b/0x120 [xfs]
[ 1209.865385]  xfs_alloc_fix_freelist+0x3bc/0x490 [xfs]
[ 1209.866974]  ? __radix_tree_lookup+0x84/0xf0
[ 1209.868374]  ? xfs_perag_get+0x1a0/0x310 [xfs]
[ 1209.869798]  ? xfs_perag_get+0x5/0x310 [xfs]
[ 1209.871288]  xfs_alloc_vextent+0x161/0xda0 [xfs]
[ 1209.872757]  xfs_bmap_btalloc+0x46c/0x8b0 [xfs]
[ 1209.874182]  ? save_stack_trace+0x1b/0x20
[ 1209.875542]  xfs_bmap_alloc+0x17/0x30 [xfs]
[ 1209.876847]  xfs_bmapi_write+0x74e/0x11d0 [xfs]
[ 1209.878190]  xfs_iomap_write_allocate+0x199/0x3a0 [xfs]
[ 1209.879632]  xfs_map_blocks+0x2cc/0x5a0 [xfs]
[ 1209.880909]  xfs_do_writepage+0x215/0x920 [xfs]
[ 1209.882255]  ? clear_page_dirty_for_io+0xb4/0x310
[ 1209.883598]  xfs_vm_writepage+0x3b/0x70 [xfs]
[ 1209.884841]  pageout.isra.54+0x1a4/0x460
[ 1209.886210]  shrink_page_list+0xa86/0xcf0
[ 1209.887441]  shrink_inactive_list+0x1c5/0x660
[ 1209.888682]  shrink_node_memcg+0x535/0x7f0
[ 1209.889975]  ? mem_cgroup_iter+0x14d/0x720
[ 1209.891197]  shrink_node+0xe1/0x310
[ 1209.892288]  kswapd+0x362/0x9b0
[ 1209.893308]  kthread+0x10f/0x150
[ 1209.894383]  ? mem_cgroup_shrink_node+0x3b0/0x3b0
[ 1209.895703]  ? kthread_create_on_node+0x70/0x70
[ 1209.896956]  ret_from_fork+0x31/0x40
[ 1209.898117] MemAlloc: systemd-journal(526) flags=0x400900 switches=33248 seq=121659 gfp=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) order=0 delay=52772 uninterruptible
[ 1209.902154] systemd-journal D11240   526      1 0x00000000
[ 1209.903642] Call Trace:
[ 1209.904574]  __schedule+0x336/0xe00
[ 1209.905734]  schedule+0x3d/0x90
[ 1209.906817]  schedule_timeout+0x20d/0x510
[ 1209.908025]  ? prepare_to_wait+0x2b/0xc0
[ 1209.909268]  ? lock_timer_base+0xa0/0xa0
[ 1209.910460]  io_schedule_timeout+0x1e/0x50
[ 1209.911681]  congestion_wait+0x86/0x260
[ 1209.912853]  ? remove_wait_queue+0x60/0x60
[ 1209.914115]  shrink_inactive_list+0x5b4/0x660
[ 1209.915385]  ? __list_lru_count_one.isra.2+0x22/0x80
[ 1209.916768]  shrink_node_memcg+0x535/0x7f0
[ 1209.918173]  shrink_node+0xe1/0x310
[ 1209.919288]  do_try_to_free_pages+0xe1/0x300
[ 1209.920548]  try_to_free_pages+0x131/0x3f0
[ 1209.921827]  __alloc_pages_slowpath+0x3ec/0xd95
[ 1209.923137]  __alloc_pages_nodemask+0x3e4/0x460
[ 1209.924454]  ? __radix_tree_lookup+0x84/0xf0
[ 1209.925790]  alloc_pages_current+0x97/0x1b0
[ 1209.927021]  ? find_get_entry+0x5/0x300
[ 1209.928189]  __page_cache_alloc+0x15d/0x1a0
[ 1209.929471]  ? pagecache_get_page+0x2c/0x2b0
[ 1209.930716]  filemap_fault+0x4df/0x8b0
[ 1209.931867]  ? filemap_fault+0x373/0x8b0
[ 1209.933111]  ? xfs_ilock+0x22c/0x360 [xfs]
[ 1209.934510]  ? xfs_filemap_fault+0x64/0x1e0 [xfs]
[ 1209.935857]  ? down_read_nested+0x7b/0xc0
[ 1209.937123]  ? xfs_ilock+0x22c/0x360 [xfs]
[ 1209.938373]  xfs_filemap_fault+0x6c/0x1e0 [xfs]
[ 1209.939691]  __do_fault+0x1e/0xa0
[ 1209.940807]  ? _raw_spin_unlock+0x27/0x40
[ 1209.942002]  __handle_mm_fault+0xbb1/0xf40
[ 1209.943228]  ? mutex_unlock+0x12/0x20
[ 1209.944410]  ? devkmsg_read+0x15c/0x330
[ 1209.945912]  handle_mm_fault+0x16b/0x390
[ 1209.947297]  ? handle_mm_fault+0x49/0x390
[ 1209.948868]  __do_page_fault+0x24a/0x530
[ 1209.950351]  do_page_fault+0x30/0x80
[ 1209.951615]  page_fault+0x28/0x30
[ 1209.952724] RIP: 0033:0x556f398d623f
[ 1209.953834] RSP: 002b:00007fff1da75710 EFLAGS: 00010206
[ 1209.955273] RAX: 0000556f3b12b9d0 RBX: 0000000000000009 RCX: 0000000000000020
[ 1209.957117] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ 1209.958849] RBP: 00007fff1da759b0 R08: 0000000000000000 R09: 0000000000000000
[ 1209.960659] R10: 00000000ffffffc0 R11: 00007fdc0df4ef10 R12: 00007fff1da75f30
[ 1209.962397] R13: 00007fff1da78810 R14: 0000000000000009 R15: 0000000000000006
[ 1209.964204] MemAlloc: auditd(563) flags=0x400900 switches=6443 seq=774 gfp=0x142134a(GFP_NOFS|__GFP_HIGHMEM|__GFP_COLD|__GFP_NOWARN|__GFP_NORETRY|__GFP_HARDWALL|__GFP_MOVABLE) order=0 delay=16511 uninterruptible
[ 1209.969005] auditd          D12280   563      1 0x00000000
[ 1209.970503] Call Trace:
[ 1209.971436]  __schedule+0x336/0xe00
[ 1209.972621]  schedule+0x3d/0x90
[ 1209.973696]  schedule_timeout+0x20d/0x510
[ 1209.974910]  ? prepare_to_wait+0x2b/0xc0
[ 1209.976155]  ? lock_timer_base+0xa0/0xa0
[ 1209.977350]  io_schedule_timeout+0x1e/0x50
[ 1209.978597]  congestion_wait+0x86/0x260
[ 1209.979795]  ? remove_wait_queue+0x60/0x60
[ 1209.981020]  shrink_inactive_list+0x5b4/0x660
[ 1209.982290]  ? __list_lru_count_one.isra.2+0x22/0x80
[ 1209.983748]  shrink_node_memcg+0x535/0x7f0
[ 1209.985041]  ? mem_cgroup_iter+0x14d/0x720
[ 1209.986267]  shrink_node+0xe1/0x310
[ 1209.987424]  do_try_to_free_pages+0xe1/0x300
[ 1209.988705]  try_to_free_pages+0x131/0x3f0
[ 1209.989935]  __alloc_pages_slowpath+0x3ec/0xd95
[ 1209.991274]  __alloc_pages_nodemask+0x3e4/0x460
[ 1209.992601]  alloc_pages_current+0x97/0x1b0
[ 1209.993845]  __page_cache_alloc+0x15d/0x1a0
[ 1209.995120]  __do_page_cache_readahead+0x118/0x410
[ 1209.996535]  ? __do_page_cache_readahead+0x191/0x410
[ 1209.997946]  filemap_fault+0x35f/0x8b0
[ 1209.999199]  ? xfs_ilock+0x22c/0x360 [xfs]
[ 1210.000473]  ? xfs_filemap_fault+0x64/0x1e0 [xfs]
[ 1210.001843]  ? down_read_nested+0x7b/0xc0
[ 1210.003184]  ? xfs_ilock+0x22c/0x360 [xfs]
[ 1210.004471]  xfs_filemap_fault+0x6c/0x1e0 [xfs]
[ 1210.005792]  __do_fault+0x1e/0xa0
[ 1210.006925]  __handle_mm_fault+0xbb1/0xf40
[ 1210.008241]  ? ep_poll+0x2ea/0x3b0
[ 1210.009373]  handle_mm_fault+0x16b/0x390
[ 1210.010572]  ? handle_mm_fault+0x49/0x390
[ 1210.011818]  __do_page_fault+0x24a/0x530
[ 1210.013059]  ? wake_up_q+0x80/0x80
[ 1210.014176]  do_page_fault+0x30/0x80
[ 1210.015367]  page_fault+0x28/0x30
[ 1210.016473] RIP: 0033:0x7fcb0c838d13
[ 1210.017635] RSP: 002b:00007ffe275b95a0 EFLAGS: 00010293
[ 1210.019120] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00007fcb0c838d13
[ 1210.020867] RDX: 0000000000000040 RSI: 0000559240b08d40 RDI: 0000000000000009
[ 1210.022769] RBP: 0000000000000000 R08: 00000000000cf8ba R09: 0000000000000001
[ 1210.024530] R10: 000000000000e95f R11: 0000000000000293 R12: 000055923fbe5e60
[ 1210.026308] R13: 0000000000000000 R14: 0000000000000000 R15: 000055923fbe5e60
[ 1210.028961] MemAlloc: vmtoolsd(723) flags=0x400900 switches=36213 seq=120979 gfp=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) order=0 delay=52811 uninterruptible
[ 1210.032683] vmtoolsd        D11240   723      1 0x00000080
[ 1210.034316] Call Trace:
[ 1210.035340]  __schedule+0x336/0xe00
[ 1210.036444]  schedule+0x3d/0x90
[ 1210.037462]  schedule_timeout+0x20d/0x510
[ 1210.038694]  ? prepare_to_wait+0x2b/0xc0
[ 1210.039849]  ? lock_timer_base+0xa0/0xa0
[ 1210.041005]  io_schedule_timeout+0x1e/0x50
[ 1210.042435]  congestion_wait+0x86/0x260
[ 1210.043575]  ? remove_wait_queue+0x60/0x60
[ 1210.044763]  shrink_inactive_list+0x5b4/0x660
[ 1210.046058]  ? __list_lru_count_one.isra.2+0x22/0x80
[ 1210.047419]  shrink_node_memcg+0x535/0x7f0
[ 1210.048609]  shrink_node+0xe1/0x310
[ 1210.049688]  do_try_to_free_pages+0xe1/0x300
[ 1210.051183]  try_to_free_pages+0x131/0x3f0
[ 1210.052421]  __alloc_pages_slowpath+0x3ec/0xd95
[ 1210.053717]  __alloc_pages_nodemask+0x3e4/0x460
[ 1210.055025]  ? __radix_tree_lookup+0x84/0xf0
[ 1210.056264]  alloc_pages_current+0x97/0x1b0
[ 1210.057466]  ? find_get_entry+0x5/0x300
[ 1210.058695]  __page_cache_alloc+0x15d/0x1a0
[ 1210.059894]  ? pagecache_get_page+0x2c/0x2b0
[ 1210.061128]  filemap_fault+0x4df/0x8b0
[ 1210.062340]  ? filemap_fault+0x373/0x8b0
[ 1210.063545]  ? xfs_ilock+0x22c/0x360 [xfs]
[ 1210.064766]  ? xfs_filemap_fault+0x64/0x1e0 [xfs]
[ 1210.066135]  ? down_read_nested+0x7b/0xc0
[ 1210.067405]  ? xfs_ilock+0x22c/0x360 [xfs]
[ 1210.068706]  xfs_filemap_fault+0x6c/0x1e0 [xfs]
[ 1210.070021]  __do_fault+0x1e/0xa0
[ 1210.071102]  __handle_mm_fault+0xbb1/0xf40
[ 1210.072296]  handle_mm_fault+0x16b/0x390
[ 1210.073509]  ? handle_mm_fault+0x49/0x390
[ 1210.074683]  __do_page_fault+0x24a/0x530
[ 1210.075872]  do_page_fault+0x30/0x80
[ 1210.076974]  page_fault+0x28/0x30
[ 1210.078090] RIP: 0033:0x7f12e9fd6420
[ 1210.079193] RSP: 002b:00007ffee98ba498 EFLAGS: 00010202
[ 1210.080605] RAX: 00007f12de02e0fe RBX: 00007ffee98ba4b0 RCX: 00007ffee98ba590
[ 1210.082383] RDX: 00007f12de02e0fe RSI: 0000000000000001 RDI: 00007ffee98ba4b0
[ 1210.084177] RBP: 0000000000000080 R08: 0000000000000000 R09: 000000000000000a
[ 1210.086134] R10: 00007f12eb61a010 R11: 0000000000000000 R12: 0000000000000080
[ 1210.087850] R13: 0000000000000000 R14: 00007f12ea006770 R15: 00005580adf3abc0
(...snipped...)
[ 1210.640170] MemAlloc: a.out(7523) flags=0x420040 switches=90 uninterruptible dying victim
[ 1210.642426] a.out           D11496  7523   7376 0x00100084
[ 1210.643999] Call Trace:
[ 1210.644921]  __schedule+0x336/0xe00
[ 1210.646007]  ? trace_hardirqs_on_thunk+0x1a/0x1c
[ 1210.647328]  schedule+0x3d/0x90
[ 1210.648441]  schedule_timeout+0x26a/0x510
[ 1210.649619]  ? trace_hardirqs_on+0xd/0x10
[ 1210.650792]  __down_common+0xfb/0x131
[ 1210.652188]  ? _xfs_buf_find+0x2cb/0xc10 [xfs]
[ 1210.653480]  __down+0x1d/0x1f
[ 1210.654483]  down+0x41/0x50
[ 1210.655462]  xfs_buf_lock+0x64/0x370 [xfs]
[ 1210.656618]  _xfs_buf_find+0x2cb/0xc10 [xfs]
[ 1210.657823]  ? _xfs_buf_find+0xa4/0xc10 [xfs]
[ 1210.659028]  xfs_buf_get_map+0x2a/0x480 [xfs]
[ 1210.660284]  xfs_buf_read_map+0x2c/0x400 [xfs]
[ 1210.661490]  ? del_timer_sync+0xb5/0xe0
[ 1210.662630]  xfs_trans_read_buf_map+0x186/0x830 [xfs]
[ 1210.664009]  xfs_read_agf+0xc8/0x2b0 [xfs]
[ 1210.665171]  xfs_alloc_read_agf+0x7a/0x300 [xfs]
[ 1210.666441]  ? xfs_alloc_space_available+0x7b/0x120 [xfs]
[ 1210.667923]  xfs_alloc_fix_freelist+0x3bc/0x490 [xfs]
[ 1210.669402]  ? __radix_tree_lookup+0x84/0xf0
[ 1210.670645]  ? xfs_perag_get+0x1a0/0x310 [xfs]
[ 1210.671949]  ? xfs_perag_get+0x5/0x310 [xfs]
[ 1210.673145]  xfs_alloc_vextent+0x161/0xda0 [xfs]
[ 1210.674402]  xfs_bmap_btalloc+0x46c/0x8b0 [xfs]
[ 1210.675774]  ? save_stack_trace+0x1b/0x20
[ 1210.676961]  xfs_bmap_alloc+0x17/0x30 [xfs]
[ 1210.678202]  xfs_bmapi_write+0x74e/0x11d0 [xfs]
[ 1210.679544]  xfs_iomap_write_allocate+0x199/0x3a0 [xfs]
[ 1210.680995]  xfs_map_blocks+0x2cc/0x5a0 [xfs]
[ 1210.682245]  xfs_do_writepage+0x215/0x920 [xfs]
[ 1210.683742]  ? clear_page_dirty_for_io+0xb4/0x310
[ 1210.685125]  write_cache_pages+0x2cb/0x6b0
[ 1210.686408]  ? xfs_map_blocks+0x5a0/0x5a0 [xfs]
[ 1210.687774]  ? xfs_vm_writepages+0x48/0xa0 [xfs]
[ 1210.689111]  xfs_vm_writepages+0x6b/0xa0 [xfs]
[ 1210.690529]  do_writepages+0x21/0x40
[ 1210.691680]  __filemap_fdatawrite_range+0xc6/0x100
[ 1210.693021]  filemap_write_and_wait_range+0x2d/0x70
[ 1210.694444]  xfs_file_fsync+0x8b/0x310 [xfs]
[ 1210.695728]  vfs_fsync_range+0x3d/0xb0
[ 1210.696874]  ? __do_page_fault+0x272/0x530
[ 1210.698102]  do_fsync+0x3d/0x70
[ 1210.699200]  SyS_fsync+0x10/0x20
[ 1210.700267]  do_syscall_64+0x6c/0x200
[ 1210.701498]  entry_SYSCALL64_slow_path+0x25/0x25
[ 1210.702861] RIP: 0033:0x7f504b072d30
[ 1210.704014] RSP: 002b:00007fffcb8f7898 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
[ 1210.705994] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f504b072d30
[ 1210.707857] RDX: 000000000000000a RSI: 0000000000000000 RDI: 0000000000000003
[ 1210.709647] RBP: 0000000000000003 R08: 00007f504afcc938 R09: 000000000000000e
[ 1210.711632] R10: 00007fffcb8f7620 R11: 0000000000000246 R12: 0000000000400912
[ 1210.713520] R13: 00007fffcb8f79a0 R14: 0000000000000000 R15: 0000000000000000
(...snipped...)
[ 1212.195351] MemAlloc-Info: stalling=32 dying=1 exiting=0 victim=1 oom_count=45896
[ 1242.551629] MemAlloc-Info: stalling=36 dying=1 exiting=0 victim=1 oom_count=45896
(...snipped...)
[ 1245.149165] MemAlloc-Info: stalling=36 dying=1 exiting=0 victim=1 oom_count=45896
[ 1275.319189] MemAlloc-Info: stalling=40 dying=1 exiting=0 victim=1 oom_count=45896
(...snipped...)
[ 1278.241813] MemAlloc-Info: stalling=40 dying=1 exiting=0 victim=1 oom_count=45896
[ 1289.804580] sysrq: SysRq : Kill All Tasks
----------------------------------------

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
@ 2017-02-21 14:35                                                 ` Tetsuo Handa
  0 siblings, 0 replies; 110+ messages in thread
From: Tetsuo Handa @ 2017-02-21 14:35 UTC (permalink / raw)
  To: mhocko
  Cc: david, dchinner, hch, mgorman, viro, linux-mm, hannes, linux-kernel

Michal Hocko wrote:
> OK, so it seems that all the distractions are handled now and linux-next
> should provide a reasonable base for testing. You said you weren't able
> to reproduce the original long stalls on too_many_isolated(). I would be
> still interested to see those oom reports and potential anomalies in the
> isolated counts before I send the patch for inclusion so your further
> testing would be more than appreciated. Also stalls > 10s without any
> previous occurrences would be interesting.

I confirmed that linux-next-20170221 with kmallocwd applied can reproduce
infinite too_many_isolated() loop problem. Please send your patches to linux-next.

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20170221.txt.xz .
----------------------------------------
[ 1160.162013] Out of memory: Kill process 7523 (a.out) score 998 or sacrifice child
[ 1160.164422] Killed process 7523 (a.out) total-vm:4168kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[ 1160.169699] oom_reaper: reaped process 7523 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[ 1209.781787] MemAlloc-Info: stalling=32 dying=1 exiting=0 victim=1 oom_count=45896
[ 1209.790966] MemAlloc: kswapd0(67) flags=0xa60840 switches=51139 uninterruptible
[ 1209.799726] kswapd0         D10936    67      2 0x00000000
[ 1209.807326] Call Trace:
[ 1209.812581]  __schedule+0x336/0xe00
[ 1209.818599]  schedule+0x3d/0x90
[ 1209.823907]  schedule_timeout+0x26a/0x510
[ 1209.827218]  ? trace_hardirqs_on+0xd/0x10
[ 1209.830535]  __down_common+0xfb/0x131
[ 1209.833801]  ? _xfs_buf_find+0x2cb/0xc10 [xfs]
[ 1209.837372]  __down+0x1d/0x1f
[ 1209.840331]  down+0x41/0x50
[ 1209.843243]  xfs_buf_lock+0x64/0x370 [xfs]
[ 1209.846597]  _xfs_buf_find+0x2cb/0xc10 [xfs]
[ 1209.850031]  ? _xfs_buf_find+0xa4/0xc10 [xfs]
[ 1209.853514]  xfs_buf_get_map+0x2a/0x480 [xfs]
[ 1209.855831]  xfs_buf_read_map+0x2c/0x400 [xfs]
[ 1209.857388]  ? free_debug_processing+0x27d/0x2af
[ 1209.859037]  xfs_trans_read_buf_map+0x186/0x830 [xfs]
[ 1209.860707]  xfs_read_agf+0xc8/0x2b0 [xfs]
[ 1209.862184]  xfs_alloc_read_agf+0x7a/0x300 [xfs]
[ 1209.863728]  ? xfs_alloc_space_available+0x7b/0x120 [xfs]
[ 1209.865385]  xfs_alloc_fix_freelist+0x3bc/0x490 [xfs]
[ 1209.866974]  ? __radix_tree_lookup+0x84/0xf0
[ 1209.868374]  ? xfs_perag_get+0x1a0/0x310 [xfs]
[ 1209.869798]  ? xfs_perag_get+0x5/0x310 [xfs]
[ 1209.871288]  xfs_alloc_vextent+0x161/0xda0 [xfs]
[ 1209.872757]  xfs_bmap_btalloc+0x46c/0x8b0 [xfs]
[ 1209.874182]  ? save_stack_trace+0x1b/0x20
[ 1209.875542]  xfs_bmap_alloc+0x17/0x30 [xfs]
[ 1209.876847]  xfs_bmapi_write+0x74e/0x11d0 [xfs]
[ 1209.878190]  xfs_iomap_write_allocate+0x199/0x3a0 [xfs]
[ 1209.879632]  xfs_map_blocks+0x2cc/0x5a0 [xfs]
[ 1209.880909]  xfs_do_writepage+0x215/0x920 [xfs]
[ 1209.882255]  ? clear_page_dirty_for_io+0xb4/0x310
[ 1209.883598]  xfs_vm_writepage+0x3b/0x70 [xfs]
[ 1209.884841]  pageout.isra.54+0x1a4/0x460
[ 1209.886210]  shrink_page_list+0xa86/0xcf0
[ 1209.887441]  shrink_inactive_list+0x1c5/0x660
[ 1209.888682]  shrink_node_memcg+0x535/0x7f0
[ 1209.889975]  ? mem_cgroup_iter+0x14d/0x720
[ 1209.891197]  shrink_node+0xe1/0x310
[ 1209.892288]  kswapd+0x362/0x9b0
[ 1209.893308]  kthread+0x10f/0x150
[ 1209.894383]  ? mem_cgroup_shrink_node+0x3b0/0x3b0
[ 1209.895703]  ? kthread_create_on_node+0x70/0x70
[ 1209.896956]  ret_from_fork+0x31/0x40
[ 1209.898117] MemAlloc: systemd-journal(526) flags=0x400900 switches=33248 seq=121659 gfp=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) order=0 delay=52772 uninterruptible
[ 1209.902154] systemd-journal D11240   526      1 0x00000000
[ 1209.903642] Call Trace:
[ 1209.904574]  __schedule+0x336/0xe00
[ 1209.905734]  schedule+0x3d/0x90
[ 1209.906817]  schedule_timeout+0x20d/0x510
[ 1209.908025]  ? prepare_to_wait+0x2b/0xc0
[ 1209.909268]  ? lock_timer_base+0xa0/0xa0
[ 1209.910460]  io_schedule_timeout+0x1e/0x50
[ 1209.911681]  congestion_wait+0x86/0x260
[ 1209.912853]  ? remove_wait_queue+0x60/0x60
[ 1209.914115]  shrink_inactive_list+0x5b4/0x660
[ 1209.915385]  ? __list_lru_count_one.isra.2+0x22/0x80
[ 1209.916768]  shrink_node_memcg+0x535/0x7f0
[ 1209.918173]  shrink_node+0xe1/0x310
[ 1209.919288]  do_try_to_free_pages+0xe1/0x300
[ 1209.920548]  try_to_free_pages+0x131/0x3f0
[ 1209.921827]  __alloc_pages_slowpath+0x3ec/0xd95
[ 1209.923137]  __alloc_pages_nodemask+0x3e4/0x460
[ 1209.924454]  ? __radix_tree_lookup+0x84/0xf0
[ 1209.925790]  alloc_pages_current+0x97/0x1b0
[ 1209.927021]  ? find_get_entry+0x5/0x300
[ 1209.928189]  __page_cache_alloc+0x15d/0x1a0
[ 1209.929471]  ? pagecache_get_page+0x2c/0x2b0
[ 1209.930716]  filemap_fault+0x4df/0x8b0
[ 1209.931867]  ? filemap_fault+0x373/0x8b0
[ 1209.933111]  ? xfs_ilock+0x22c/0x360 [xfs]
[ 1209.934510]  ? xfs_filemap_fault+0x64/0x1e0 [xfs]
[ 1209.935857]  ? down_read_nested+0x7b/0xc0
[ 1209.937123]  ? xfs_ilock+0x22c/0x360 [xfs]
[ 1209.938373]  xfs_filemap_fault+0x6c/0x1e0 [xfs]
[ 1209.939691]  __do_fault+0x1e/0xa0
[ 1209.940807]  ? _raw_spin_unlock+0x27/0x40
[ 1209.942002]  __handle_mm_fault+0xbb1/0xf40
[ 1209.943228]  ? mutex_unlock+0x12/0x20
[ 1209.944410]  ? devkmsg_read+0x15c/0x330
[ 1209.945912]  handle_mm_fault+0x16b/0x390
[ 1209.947297]  ? handle_mm_fault+0x49/0x390
[ 1209.948868]  __do_page_fault+0x24a/0x530
[ 1209.950351]  do_page_fault+0x30/0x80
[ 1209.951615]  page_fault+0x28/0x30
[ 1209.952724] RIP: 0033:0x556f398d623f
[ 1209.953834] RSP: 002b:00007fff1da75710 EFLAGS: 00010206
[ 1209.955273] RAX: 0000556f3b12b9d0 RBX: 0000000000000009 RCX: 0000000000000020
[ 1209.957117] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ 1209.958849] RBP: 00007fff1da759b0 R08: 0000000000000000 R09: 0000000000000000
[ 1209.960659] R10: 00000000ffffffc0 R11: 00007fdc0df4ef10 R12: 00007fff1da75f30
[ 1209.962397] R13: 00007fff1da78810 R14: 0000000000000009 R15: 0000000000000006
[ 1209.964204] MemAlloc: auditd(563) flags=0x400900 switches=6443 seq=774 gfp=0x142134a(GFP_NOFS|__GFP_HIGHMEM|__GFP_COLD|__GFP_NOWARN|__GFP_NORETRY|__GFP_HARDWALL|__GFP_MOVABLE) order=0 delay=16511 uninterruptible
[ 1209.969005] auditd          D12280   563      1 0x00000000
[ 1209.970503] Call Trace:
[ 1209.971436]  __schedule+0x336/0xe00
[ 1209.972621]  schedule+0x3d/0x90
[ 1209.973696]  schedule_timeout+0x20d/0x510
[ 1209.974910]  ? prepare_to_wait+0x2b/0xc0
[ 1209.976155]  ? lock_timer_base+0xa0/0xa0
[ 1209.977350]  io_schedule_timeout+0x1e/0x50
[ 1209.978597]  congestion_wait+0x86/0x260
[ 1209.979795]  ? remove_wait_queue+0x60/0x60
[ 1209.981020]  shrink_inactive_list+0x5b4/0x660
[ 1209.982290]  ? __list_lru_count_one.isra.2+0x22/0x80
[ 1209.983748]  shrink_node_memcg+0x535/0x7f0
[ 1209.985041]  ? mem_cgroup_iter+0x14d/0x720
[ 1209.986267]  shrink_node+0xe1/0x310
[ 1209.987424]  do_try_to_free_pages+0xe1/0x300
[ 1209.988705]  try_to_free_pages+0x131/0x3f0
[ 1209.989935]  __alloc_pages_slowpath+0x3ec/0xd95
[ 1209.991274]  __alloc_pages_nodemask+0x3e4/0x460
[ 1209.992601]  alloc_pages_current+0x97/0x1b0
[ 1209.993845]  __page_cache_alloc+0x15d/0x1a0
[ 1209.995120]  __do_page_cache_readahead+0x118/0x410
[ 1209.996535]  ? __do_page_cache_readahead+0x191/0x410
[ 1209.997946]  filemap_fault+0x35f/0x8b0
[ 1209.999199]  ? xfs_ilock+0x22c/0x360 [xfs]
[ 1210.000473]  ? xfs_filemap_fault+0x64/0x1e0 [xfs]
[ 1210.001843]  ? down_read_nested+0x7b/0xc0
[ 1210.003184]  ? xfs_ilock+0x22c/0x360 [xfs]
[ 1210.004471]  xfs_filemap_fault+0x6c/0x1e0 [xfs]
[ 1210.005792]  __do_fault+0x1e/0xa0
[ 1210.006925]  __handle_mm_fault+0xbb1/0xf40
[ 1210.008241]  ? ep_poll+0x2ea/0x3b0
[ 1210.009373]  handle_mm_fault+0x16b/0x390
[ 1210.010572]  ? handle_mm_fault+0x49/0x390
[ 1210.011818]  __do_page_fault+0x24a/0x530
[ 1210.013059]  ? wake_up_q+0x80/0x80
[ 1210.014176]  do_page_fault+0x30/0x80
[ 1210.015367]  page_fault+0x28/0x30
[ 1210.016473] RIP: 0033:0x7fcb0c838d13
[ 1210.017635] RSP: 002b:00007ffe275b95a0 EFLAGS: 00010293
[ 1210.019120] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00007fcb0c838d13
[ 1210.020867] RDX: 0000000000000040 RSI: 0000559240b08d40 RDI: 0000000000000009
[ 1210.022769] RBP: 0000000000000000 R08: 00000000000cf8ba R09: 0000000000000001
[ 1210.024530] R10: 000000000000e95f R11: 0000000000000293 R12: 000055923fbe5e60
[ 1210.026308] R13: 0000000000000000 R14: 0000000000000000 R15: 000055923fbe5e60
[ 1210.028961] MemAlloc: vmtoolsd(723) flags=0x400900 switches=36213 seq=120979 gfp=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) order=0 delay=52811 uninterruptible
[ 1210.032683] vmtoolsd        D11240   723      1 0x00000080
[ 1210.034316] Call Trace:
[ 1210.035340]  __schedule+0x336/0xe00
[ 1210.036444]  schedule+0x3d/0x90
[ 1210.037462]  schedule_timeout+0x20d/0x510
[ 1210.038694]  ? prepare_to_wait+0x2b/0xc0
[ 1210.039849]  ? lock_timer_base+0xa0/0xa0
[ 1210.041005]  io_schedule_timeout+0x1e/0x50
[ 1210.042435]  congestion_wait+0x86/0x260
[ 1210.043575]  ? remove_wait_queue+0x60/0x60
[ 1210.044763]  shrink_inactive_list+0x5b4/0x660
[ 1210.046058]  ? __list_lru_count_one.isra.2+0x22/0x80
[ 1210.047419]  shrink_node_memcg+0x535/0x7f0
[ 1210.048609]  shrink_node+0xe1/0x310
[ 1210.049688]  do_try_to_free_pages+0xe1/0x300
[ 1210.051183]  try_to_free_pages+0x131/0x3f0
[ 1210.052421]  __alloc_pages_slowpath+0x3ec/0xd95
[ 1210.053717]  __alloc_pages_nodemask+0x3e4/0x460
[ 1210.055025]  ? __radix_tree_lookup+0x84/0xf0
[ 1210.056264]  alloc_pages_current+0x97/0x1b0
[ 1210.057466]  ? find_get_entry+0x5/0x300
[ 1210.058695]  __page_cache_alloc+0x15d/0x1a0
[ 1210.059894]  ? pagecache_get_page+0x2c/0x2b0
[ 1210.061128]  filemap_fault+0x4df/0x8b0
[ 1210.062340]  ? filemap_fault+0x373/0x8b0
[ 1210.063545]  ? xfs_ilock+0x22c/0x360 [xfs]
[ 1210.064766]  ? xfs_filemap_fault+0x64/0x1e0 [xfs]
[ 1210.066135]  ? down_read_nested+0x7b/0xc0
[ 1210.067405]  ? xfs_ilock+0x22c/0x360 [xfs]
[ 1210.068706]  xfs_filemap_fault+0x6c/0x1e0 [xfs]
[ 1210.070021]  __do_fault+0x1e/0xa0
[ 1210.071102]  __handle_mm_fault+0xbb1/0xf40
[ 1210.072296]  handle_mm_fault+0x16b/0x390
[ 1210.073509]  ? handle_mm_fault+0x49/0x390
[ 1210.074683]  __do_page_fault+0x24a/0x530
[ 1210.075872]  do_page_fault+0x30/0x80
[ 1210.076974]  page_fault+0x28/0x30
[ 1210.078090] RIP: 0033:0x7f12e9fd6420
[ 1210.079193] RSP: 002b:00007ffee98ba498 EFLAGS: 00010202
[ 1210.080605] RAX: 00007f12de02e0fe RBX: 00007ffee98ba4b0 RCX: 00007ffee98ba590
[ 1210.082383] RDX: 00007f12de02e0fe RSI: 0000000000000001 RDI: 00007ffee98ba4b0
[ 1210.084177] RBP: 0000000000000080 R08: 0000000000000000 R09: 000000000000000a
[ 1210.086134] R10: 00007f12eb61a010 R11: 0000000000000000 R12: 0000000000000080
[ 1210.087850] R13: 0000000000000000 R14: 00007f12ea006770 R15: 00005580adf3abc0
(...snipped...)
[ 1210.640170] MemAlloc: a.out(7523) flags=0x420040 switches=90 uninterruptible dying victim
[ 1210.642426] a.out           D11496  7523   7376 0x00100084
[ 1210.643999] Call Trace:
[ 1210.644921]  __schedule+0x336/0xe00
[ 1210.646007]  ? trace_hardirqs_on_thunk+0x1a/0x1c
[ 1210.647328]  schedule+0x3d/0x90
[ 1210.648441]  schedule_timeout+0x26a/0x510
[ 1210.649619]  ? trace_hardirqs_on+0xd/0x10
[ 1210.650792]  __down_common+0xfb/0x131
[ 1210.652188]  ? _xfs_buf_find+0x2cb/0xc10 [xfs]
[ 1210.653480]  __down+0x1d/0x1f
[ 1210.654483]  down+0x41/0x50
[ 1210.655462]  xfs_buf_lock+0x64/0x370 [xfs]
[ 1210.656618]  _xfs_buf_find+0x2cb/0xc10 [xfs]
[ 1210.657823]  ? _xfs_buf_find+0xa4/0xc10 [xfs]
[ 1210.659028]  xfs_buf_get_map+0x2a/0x480 [xfs]
[ 1210.660284]  xfs_buf_read_map+0x2c/0x400 [xfs]
[ 1210.661490]  ? del_timer_sync+0xb5/0xe0
[ 1210.662630]  xfs_trans_read_buf_map+0x186/0x830 [xfs]
[ 1210.664009]  xfs_read_agf+0xc8/0x2b0 [xfs]
[ 1210.665171]  xfs_alloc_read_agf+0x7a/0x300 [xfs]
[ 1210.666441]  ? xfs_alloc_space_available+0x7b/0x120 [xfs]
[ 1210.667923]  xfs_alloc_fix_freelist+0x3bc/0x490 [xfs]
[ 1210.669402]  ? __radix_tree_lookup+0x84/0xf0
[ 1210.670645]  ? xfs_perag_get+0x1a0/0x310 [xfs]
[ 1210.671949]  ? xfs_perag_get+0x5/0x310 [xfs]
[ 1210.673145]  xfs_alloc_vextent+0x161/0xda0 [xfs]
[ 1210.674402]  xfs_bmap_btalloc+0x46c/0x8b0 [xfs]
[ 1210.675774]  ? save_stack_trace+0x1b/0x20
[ 1210.676961]  xfs_bmap_alloc+0x17/0x30 [xfs]
[ 1210.678202]  xfs_bmapi_write+0x74e/0x11d0 [xfs]
[ 1210.679544]  xfs_iomap_write_allocate+0x199/0x3a0 [xfs]
[ 1210.680995]  xfs_map_blocks+0x2cc/0x5a0 [xfs]
[ 1210.682245]  xfs_do_writepage+0x215/0x920 [xfs]
[ 1210.683742]  ? clear_page_dirty_for_io+0xb4/0x310
[ 1210.685125]  write_cache_pages+0x2cb/0x6b0
[ 1210.686408]  ? xfs_map_blocks+0x5a0/0x5a0 [xfs]
[ 1210.687774]  ? xfs_vm_writepages+0x48/0xa0 [xfs]
[ 1210.689111]  xfs_vm_writepages+0x6b/0xa0 [xfs]
[ 1210.690529]  do_writepages+0x21/0x40
[ 1210.691680]  __filemap_fdatawrite_range+0xc6/0x100
[ 1210.693021]  filemap_write_and_wait_range+0x2d/0x70
[ 1210.694444]  xfs_file_fsync+0x8b/0x310 [xfs]
[ 1210.695728]  vfs_fsync_range+0x3d/0xb0
[ 1210.696874]  ? __do_page_fault+0x272/0x530
[ 1210.698102]  do_fsync+0x3d/0x70
[ 1210.699200]  SyS_fsync+0x10/0x20
[ 1210.700267]  do_syscall_64+0x6c/0x200
[ 1210.701498]  entry_SYSCALL64_slow_path+0x25/0x25
[ 1210.702861] RIP: 0033:0x7f504b072d30
[ 1210.704014] RSP: 002b:00007fffcb8f7898 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
[ 1210.705994] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f504b072d30
[ 1210.707857] RDX: 000000000000000a RSI: 0000000000000000 RDI: 0000000000000003
[ 1210.709647] RBP: 0000000000000003 R08: 00007f504afcc938 R09: 000000000000000e
[ 1210.711632] R10: 00007fffcb8f7620 R11: 0000000000000246 R12: 0000000000400912
[ 1210.713520] R13: 00007fffcb8f79a0 R14: 0000000000000000 R15: 0000000000000000
(...snipped...)
[ 1212.195351] MemAlloc-Info: stalling=32 dying=1 exiting=0 victim=1 oom_count=45896
[ 1242.551629] MemAlloc-Info: stalling=36 dying=1 exiting=0 victim=1 oom_count=45896
(...snipped...)
[ 1245.149165] MemAlloc-Info: stalling=36 dying=1 exiting=0 victim=1 oom_count=45896
[ 1275.319189] MemAlloc-Info: stalling=40 dying=1 exiting=0 victim=1 oom_count=45896
(...snipped...)
[ 1278.241813] MemAlloc-Info: stalling=40 dying=1 exiting=0 victim=1 oom_count=45896
[ 1289.804580] sysrq: SysRq : Kill All Tasks
----------------------------------------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
  2017-02-21 14:35                                                 ` Tetsuo Handa
@ 2017-02-21 15:53                                                   ` Michal Hocko
  -1 siblings, 0 replies; 110+ messages in thread
From: Michal Hocko @ 2017-02-21 15:53 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: david, dchinner, hch, mgorman, viro, linux-mm, hannes, linux-kernel

On Tue 21-02-17 23:35:07, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > OK, so it seems that all the distractions are handled now and linux-next
> > should provide a reasonable base for testing. You said you weren't able
> > to reproduce the original long stalls on too_many_isolated(). I would be
> > still interested to see those oom reports and potential anomalies in the
> > isolated counts before I send the patch for inclusion so your further
> > testing would be more than appreciated. Also stalls > 10s without any
> > previous occurrences would be interesting.
> 
> I confirmed that linux-next-20170221 with kmallocwd applied can reproduce
> infinite too_many_isolated() loop problem. Please send your patches to linux-next.

So I assume that you didn't see the lockup with the patch applied and
the OOM killer has resolved the situation by killing other tasks, right?
Can I assume your Tested-by?

Thanks for your testing!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
@ 2017-02-21 15:53                                                   ` Michal Hocko
  0 siblings, 0 replies; 110+ messages in thread
From: Michal Hocko @ 2017-02-21 15:53 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: david, dchinner, hch, mgorman, viro, linux-mm, hannes, linux-kernel

On Tue 21-02-17 23:35:07, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > OK, so it seems that all the distractions are handled now and linux-next
> > should provide a reasonable base for testing. You said you weren't able
> > to reproduce the original long stalls on too_many_isolated(). I would be
> > still interested to see those oom reports and potential anomalies in the
> > isolated counts before I send the patch for inclusion so your further
> > testing would be more than appreciated. Also stalls > 10s without any
> > previous occurrences would be interesting.
> 
> I confirmed that linux-next-20170221 with kmallocwd applied can reproduce
> infinite too_many_isolated() loop problem. Please send your patches to linux-next.

So I assume that you didn't see the lockup with the patch applied and
the OOM killer has resolved the situation by killing other tasks, right?
Can I assume your Tested-by?

Thanks for your testing!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
  2017-02-21 15:53                                                   ` Michal Hocko
@ 2017-02-22  2:02                                                     ` Tetsuo Handa
  -1 siblings, 0 replies; 110+ messages in thread
From: Tetsuo Handa @ 2017-02-22  2:02 UTC (permalink / raw)
  To: mhocko
  Cc: david, dchinner, hch, mgorman, viro, linux-mm, hannes, linux-kernel

Michal Hocko wrote:
> On Tue 21-02-17 23:35:07, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > OK, so it seems that all the distractions are handled now and linux-next
> > > should provide a reasonable base for testing. You said you weren't able
> > > to reproduce the original long stalls on too_many_isolated(). I would be
> > > still interested to see those oom reports and potential anomalies in the
> > > isolated counts before I send the patch for inclusion so your further
> > > testing would be more than appreciated. Also stalls > 10s without any
> > > previous occurrences would be interesting.
> > 
> > I confirmed that linux-next-20170221 with kmallocwd applied can reproduce
> > infinite too_many_isolated() loop problem. Please send your patches to linux-next.
> 
> So I assume that you didn't see the lockup with the patch applied and
> the OOM killer has resolved the situation by killing other tasks, right?
> Can I assume your Tested-by?

No. I tested linux-next-20170221 which does not include your patch.
I didn't test linux-next-20170221 with your patch applied. Your patch will
avoid infinite too_many_isolated() loop problem in shrink_inactive_list().
But we need to test different workloads by other people. Thus, I suggest
you to send your patches to linux-next without my testing.

> 
> Thanks for your testing!
> -- 
> Michal Hocko
> SUSE Labs
> 

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
@ 2017-02-22  2:02                                                     ` Tetsuo Handa
  0 siblings, 0 replies; 110+ messages in thread
From: Tetsuo Handa @ 2017-02-22  2:02 UTC (permalink / raw)
  To: mhocko
  Cc: david, dchinner, hch, mgorman, viro, linux-mm, hannes, linux-kernel

Michal Hocko wrote:
> On Tue 21-02-17 23:35:07, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > OK, so it seems that all the distractions are handled now and linux-next
> > > should provide a reasonable base for testing. You said you weren't able
> > > to reproduce the original long stalls on too_many_isolated(). I would be
> > > still interested to see those oom reports and potential anomalies in the
> > > isolated counts before I send the patch for inclusion so your further
> > > testing would be more than appreciated. Also stalls > 10s without any
> > > previous occurrences would be interesting.
> > 
> > I confirmed that linux-next-20170221 with kmallocwd applied can reproduce
> > infinite too_many_isolated() loop problem. Please send your patches to linux-next.
> 
> So I assume that you didn't see the lockup with the patch applied and
> the OOM killer has resolved the situation by killing other tasks, right?
> Can I assume your Tested-by?

No. I tested linux-next-20170221 which does not include your patch.
I didn't test linux-next-20170221 with your patch applied. Your patch will
avoid infinite too_many_isolated() loop problem in shrink_inactive_list().
But we need to test different workloads by other people. Thus, I suggest
you to send your patches to linux-next without my testing.

> 
> Thanks for your testing!
> -- 
> Michal Hocko
> SUSE Labs
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
  2017-02-22  2:02                                                     ` Tetsuo Handa
@ 2017-02-22  7:54                                                       ` Michal Hocko
  -1 siblings, 0 replies; 110+ messages in thread
From: Michal Hocko @ 2017-02-22  7:54 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: david, dchinner, hch, mgorman, viro, linux-mm, hannes, linux-kernel

On Wed 22-02-17 11:02:21, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Tue 21-02-17 23:35:07, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > OK, so it seems that all the distractions are handled now and linux-next
> > > > should provide a reasonable base for testing. You said you weren't able
> > > > to reproduce the original long stalls on too_many_isolated(). I would be
> > > > still interested to see those oom reports and potential anomalies in the
> > > > isolated counts before I send the patch for inclusion so your further
> > > > testing would be more than appreciated. Also stalls > 10s without any
> > > > previous occurrences would be interesting.
> > > 
> > > I confirmed that linux-next-20170221 with kmallocwd applied can reproduce
> > > infinite too_many_isolated() loop problem. Please send your patches to linux-next.
> > 
> > So I assume that you didn't see the lockup with the patch applied and
> > the OOM killer has resolved the situation by killing other tasks, right?
> > Can I assume your Tested-by?
> 
> No. I tested linux-next-20170221 which does not include your patch.
> I didn't test linux-next-20170221 with your patch applied. Your patch will
> avoid infinite too_many_isolated() loop problem in shrink_inactive_list().
> But we need to test different workloads by other people. Thus, I suggest
> you to send your patches to linux-next without my testing.

I will send the patch to Andrew later after merge window closes. It
would be really helpful, though, to see how it handles your workload
which is known to reproduce the oom starvation.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
@ 2017-02-22  7:54                                                       ` Michal Hocko
  0 siblings, 0 replies; 110+ messages in thread
From: Michal Hocko @ 2017-02-22  7:54 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: david, dchinner, hch, mgorman, viro, linux-mm, hannes, linux-kernel

On Wed 22-02-17 11:02:21, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Tue 21-02-17 23:35:07, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > OK, so it seems that all the distractions are handled now and linux-next
> > > > should provide a reasonable base for testing. You said you weren't able
> > > > to reproduce the original long stalls on too_many_isolated(). I would be
> > > > still interested to see those oom reports and potential anomalies in the
> > > > isolated counts before I send the patch for inclusion so your further
> > > > testing would be more than appreciated. Also stalls > 10s without any
> > > > previous occurrences would be interesting.
> > > 
> > > I confirmed that linux-next-20170221 with kmallocwd applied can reproduce
> > > infinite too_many_isolated() loop problem. Please send your patches to linux-next.
> > 
> > So I assume that you didn't see the lockup with the patch applied and
> > the OOM killer has resolved the situation by killing other tasks, right?
> > Can I assume your Tested-by?
> 
> No. I tested linux-next-20170221 which does not include your patch.
> I didn't test linux-next-20170221 with your patch applied. Your patch will
> avoid infinite too_many_isolated() loop problem in shrink_inactive_list().
> But we need to test different workloads by other people. Thus, I suggest
> you to send your patches to linux-next without my testing.

I will send the patch to Andrew later after merge window closes. It
would be really helpful, though, to see how it handles your workload
which is known to reproduce the oom starvation.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
  2017-02-22  7:54                                                       ` Michal Hocko
@ 2017-02-26  6:30                                                         ` Tetsuo Handa
  -1 siblings, 0 replies; 110+ messages in thread
From: Tetsuo Handa @ 2017-02-26  6:30 UTC (permalink / raw)
  To: mhocko
  Cc: david, dchinner, hch, mgorman, viro, linux-mm, hannes, linux-kernel

Michal Hocko wrote:
> On Wed 22-02-17 11:02:21, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Tue 21-02-17 23:35:07, Tetsuo Handa wrote:
> > > > Michal Hocko wrote:
> > > > > OK, so it seems that all the distractions are handled now and linux-next
> > > > > should provide a reasonable base for testing. You said you weren't able
> > > > > to reproduce the original long stalls on too_many_isolated(). I would be
> > > > > still interested to see those oom reports and potential anomalies in the
> > > > > isolated counts before I send the patch for inclusion so your further
> > > > > testing would be more than appreciated. Also stalls > 10s without any
> > > > > previous occurrences would be interesting.
> > > > 
> > > > I confirmed that linux-next-20170221 with kmallocwd applied can reproduce
> > > > infinite too_many_isolated() loop problem. Please send your patches to linux-next.
> > > 
> > > So I assume that you didn't see the lockup with the patch applied and
> > > the OOM killer has resolved the situation by killing other tasks, right?
> > > Can I assume your Tested-by?
> > 
> > No. I tested linux-next-20170221 which does not include your patch.
> > I didn't test linux-next-20170221 with your patch applied. Your patch will
> > avoid infinite too_many_isolated() loop problem in shrink_inactive_list().
> > But we need to test different workloads by other people. Thus, I suggest
> > you to send your patches to linux-next without my testing.
> 
> I will send the patch to Andrew later after merge window closes. It
> would be really helpful, though, to see how it handles your workload
> which is known to reproduce the oom starvation.

I tested http://lkml.kernel.org/r/20170119112336.GN30786@dhcp22.suse.cz
on top of linux-next-20170221 with kmallocwd applied.

I did not hit too_many_isolated() loop problem. But I hit an "unable to invoke
the OOM killer due to !__GFP_FS allocation" lockup problem shown below.

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20170226.txt.xz .
----------
[  444.281177] Killed process 9477 (a.out) total-vm:4168kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[  444.287046] oom_reaper: reaped process 9477 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  484.810225] BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 38s!
[  484.812907] BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0 nice=0 stuck for 41s!
[  484.815546] Showing busy workqueues and worker pools:
[  484.817595] workqueue events: flags=0x0
[  484.819456]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=3/256
[  484.821666]     pending: vmpressure_work_fn, vmstat_shepherd, vmw_fb_dirty_flush [vmwgfx]
[  484.824356]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=2/256
[  484.826582]     pending: drain_local_pages_wq BAR(9595), e1000_watchdog [e1000]
[  484.829091]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
[  484.831325]     in-flight: 7418:rht_deferred_worker
[  484.833336]     pending: rht_deferred_worker
[  484.835346] workqueue events_long: flags=0x0
[  484.837343]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256
[  484.839566]     pending: gc_worker [nf_conntrack]
[  484.841691] workqueue events_power_efficient: flags=0x80
[  484.843873]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256
[  484.846103]     pending: fb_flashcursor
[  484.847928]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
[  484.850149]     pending: neigh_periodic_work, neigh_periodic_work
[  484.852403] workqueue events_freezable_power_: flags=0x84
[  484.854534]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[  484.856666]     in-flight: 27:disk_events_workfn
[  484.858621] workqueue writeback: flags=0x4e
[  484.860347]   pwq 256: cpus=0-127 flags=0x4 nice=0 active=2/256
[  484.862415]     in-flight: 8444:wb_workfn wb_workfn
[  484.864602] workqueue vmstat: flags=0xc
[  484.866291]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[  484.868307]     pending: vmstat_update
[  484.869876]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256
[  484.871864]     pending: vmstat_update
[  484.874058] workqueue mpt_poll_0: flags=0x8
[  484.875698]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[  484.877602]     pending: mpt_fault_reset_work [mptbase]
[  484.879502] workqueue xfs-buf/sda1: flags=0xc
[  484.881148]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1
[  484.883011]     pending: xfs_buf_ioend_work [xfs]
[  484.884706] workqueue xfs-data/sda1: flags=0xc
[  484.886367]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=27/256 MAYDAY
[  484.888410]     in-flight: 5356:xfs_end_io [xfs], 451(RESCUER):xfs_end_io [xfs] xfs_end_io [xfs] xfs_end_io [xfs] xfs_end_io [xfs] xfs_end_io [xfs], 10498:xfs_end_io [xfs], 6386:xfs_end_io [xfs]
[  484.893483]     pending: xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs]
[  484.902636]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=21/256 MAYDAY
[  484.904848]     in-flight: 535:xfs_end_io [xfs], 7416:xfs_end_io [xfs], 7415:xfs_end_io [xfs], 65:xfs_end_io [xfs]
[  484.907863]     pending: xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs]
[  484.916767]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=4/256 MAYDAY
[  484.919024]     in-flight: 5357:xfs_end_io [xfs], 193:xfs_end_io [xfs], 52:xfs_end_io [xfs], 5358:xfs_end_io [xfs]
[  484.922143]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  484.924291]     in-flight: 2486:xfs_end_io [xfs]
[  484.926248] workqueue xfs-reclaim/sda1: flags=0xc
[  484.928216]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[  484.930362]     pending: xfs_reclaim_worker [xfs]
[  484.932312] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 3 6387
[  484.934766] pool 2: cpus=1 node=0 flags=0x0 nice=0 hung=38s workers=6 manager: 19
[  484.937206] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=41s workers=6 manager: 157
[  484.939629] pool 6: cpus=3 node=0 flags=0x0 nice=0 hung=41s workers=4 manager: 10499
[  484.942303] pool 256: cpus=0-127 flags=0x4 nice=0 hung=38s workers=3 idle: 425 426
[  518.090012] MemAlloc-Info: stalling=184 dying=1 exiting=0 victim=1 oom_count=8441307
(...snipped...)
[  518.900038] MemAlloc: kswapd0(69) flags=0xa40840 switches=23883 uninterruptible
[  518.902095] kswapd0         D10776    69      2 0x00000000
[  518.903784] Call Trace:
[  518.904849]  __schedule+0x336/0xe00
[  518.906118]  schedule+0x3d/0x90
[  518.907314]  io_schedule+0x16/0x40
[  518.908622]  __xfs_iflock+0x129/0x140 [xfs]
[  518.910027]  ? autoremove_wake_function+0x60/0x60
[  518.911559]  xfs_reclaim_inode+0x162/0x440 [xfs]
[  518.913068]  xfs_reclaim_inodes_ag+0x2cf/0x4f0 [xfs]
[  518.914611]  ? xfs_reclaim_inodes_ag+0xf2/0x4f0 [xfs]
[  518.916148]  ? trace_hardirqs_on+0xd/0x10
[  518.917465]  ? try_to_wake_up+0x59/0x7a0
[  518.918758]  ? wake_up_process+0x15/0x20
[  518.920067]  xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
[  518.921560]  xfs_fs_free_cached_objects+0x19/0x20 [xfs]
[  518.923114]  super_cache_scan+0x181/0x190
[  518.924435]  shrink_slab+0x29f/0x6d0
[  518.925683]  shrink_node+0x2fa/0x310
[  518.926909]  kswapd+0x362/0x9b0
[  518.928061]  kthread+0x10f/0x150
[  518.929218]  ? mem_cgroup_shrink_node+0x3b0/0x3b0
[  518.930953]  ? kthread_create_on_node+0x70/0x70
[  518.932380]  ret_from_fork+0x31/0x40
(...snipped...)
[  553.070829] MemAlloc-Info: stalling=184 dying=1 exiting=0 victim=1 oom_count=10318507
[  575.432697] BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 129s!
[  575.435276] BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0 nice=0 stuck for 131s!
[  575.437863] Showing busy workqueues and worker pools:
[  575.439837] workqueue events: flags=0x0
[  575.441605]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=4/256
[  575.443717]     pending: vmpressure_work_fn, vmstat_shepherd, vmw_fb_dirty_flush [vmwgfx], check_corruption
[  575.446622]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=2/256
[  575.448763]     pending: drain_local_pages_wq BAR(9595), e1000_watchdog [e1000]
[  575.451173]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
[  575.453323]     in-flight: 7418:rht_deferred_worker
[  575.455243]     pending: rht_deferred_worker
[  575.457100] workqueue events_long: flags=0x0
[  575.458960]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256
[  575.461099]     pending: gc_worker [nf_conntrack]
[  575.463043] workqueue events_power_efficient: flags=0x80
[  575.465110]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256
[  575.467252]     pending: fb_flashcursor
[  575.468966]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
[  575.471109]     pending: neigh_periodic_work, neigh_periodic_work
[  575.473289] workqueue events_freezable_power_: flags=0x84
[  575.475378]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[  575.477526]     in-flight: 27:disk_events_workfn
[  575.479489] workqueue writeback: flags=0x4e
[  575.481257]   pwq 256: cpus=0-127 flags=0x4 nice=0 active=2/256
[  575.483368]     in-flight: 8444:wb_workfn wb_workfn
[  575.485505] workqueue vmstat: flags=0xc
[  575.487196]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256
[  575.489242]     pending: vmstat_update
[  575.491403] workqueue mpt_poll_0: flags=0x8
[  575.493106]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[  575.495115]     pending: mpt_fault_reset_work [mptbase]
[  575.497086] workqueue xfs-buf/sda1: flags=0xc
[  575.498764]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1
[  575.500654]     pending: xfs_buf_ioend_work [xfs]
[  575.502372] workqueue xfs-data/sda1: flags=0xc
[  575.504024]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=27/256 MAYDAY
[  575.506060]     in-flight: 5356:xfs_end_io [xfs], 451(RESCUER):xfs_end_io [xfs] xfs_end_io [xfs] xfs_end_io [xfs] xfs_end_io [xfs] xfs_end_io [xfs], 10498:xfs_end_io [xfs], 6386:xfs_end_io [xfs]
[  575.511096]     pending: xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs]
[  575.520157]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=21/256 MAYDAY
[  575.522340]     in-flight: 535:xfs_end_io [xfs], 7416:xfs_end_io [xfs], 7415:xfs_end_io [xfs], 65:xfs_end_io [xfs]
[  575.525387]     pending: xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs]
[  575.534089]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=4/256 MAYDAY
[  575.536407]     in-flight: 5357:xfs_end_io [xfs], 193:xfs_end_io [xfs], 52:xfs_end_io [xfs], 5358:xfs_end_io [xfs]
[  575.539496]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  575.541648]     in-flight: 2486:xfs_end_io [xfs]
[  575.543591] workqueue xfs-reclaim/sda1: flags=0xc
[  575.545540]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[  575.547675]     pending: xfs_reclaim_worker [xfs]
[  575.549719] workqueue xfs-log/sda1: flags=0x1c
[  575.551591]   pwq 3: cpus=1 node=0 flags=0x0 nice=-20 active=1/256
[  575.553750]     pending: xfs_log_worker [xfs]
[  575.555552] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 3 6387
[  575.557979] pool 2: cpus=1 node=0 flags=0x0 nice=0 hung=129s workers=6 manager: 19
[  575.560399] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=131s workers=6 manager: 157
[  575.562843] pool 6: cpus=3 node=0 flags=0x0 nice=0 hung=132s workers=4 manager: 10499
[  575.565450] pool 256: cpus=0-127 flags=0x4 nice=0 hung=129s workers=3 idle: 425 426
(...snipped...)
[  616.394649] MemAlloc-Info: stalling=186 dying=1 exiting=0 victim=1 oom_count=13908219
(...snipped...)
[  642.266252] MemAlloc-Info: stalling=186 dying=1 exiting=0 victim=1 oom_count=15180673
(...snipped...)
[  702.412189] MemAlloc-Info: stalling=187 dying=1 exiting=0 victim=1 oom_count=18732529
(...snipped...)
[  736.787879] MemAlloc-Info: stalling=187 dying=1 exiting=0 victim=1 oom_count=20565244
(...snipped...)
[  800.715759] MemAlloc-Info: stalling=188 dying=1 exiting=0 victim=1 oom_count=24411576
(...snipped...)
[  837.571405] MemAlloc-Info: stalling=188 dying=1 exiting=0 victim=1 oom_count=26463562
(...snipped...)
[  899.021495] MemAlloc-Info: stalling=189 dying=1 exiting=0 victim=1 oom_count=30144879
(...snipped...)
[  936.282709] MemAlloc-Info: stalling=189 dying=1 exiting=0 victim=1 oom_count=32129234
(...snipped...)
[  997.328119] MemAlloc-Info: stalling=190 dying=1 exiting=0 victim=1 oom_count=35657983
(...snipped...)
[ 1033.977265] MemAlloc-Info: stalling=190 dying=1 exiting=0 victim=1 oom_count=37659912
(...snipped...)
[ 1095.630961] MemAlloc-Info: stalling=190 dying=1 exiting=0 victim=1 oom_count=40639677
(...snipped...)
[ 1095.632984] MemAlloc: kswapd0(69) flags=0xa40840 switches=23883 uninterruptible
[ 1095.632985] kswapd0         D10776    69      2 0x00000000
[ 1095.632988] Call Trace:
[ 1095.632991]  __schedule+0x336/0xe00
[ 1095.632994]  schedule+0x3d/0x90
[ 1095.632996]  io_schedule+0x16/0x40
[ 1095.633017]  __xfs_iflock+0x129/0x140 [xfs]
[ 1095.633021]  ? autoremove_wake_function+0x60/0x60
[ 1095.633051]  xfs_reclaim_inode+0x162/0x440 [xfs]
[ 1095.633072]  xfs_reclaim_inodes_ag+0x2cf/0x4f0 [xfs]
[ 1095.633106]  ? xfs_reclaim_inodes_ag+0xf2/0x4f0 [xfs]
[ 1095.633114]  ? trace_hardirqs_on+0xd/0x10
[ 1095.633116]  ? try_to_wake_up+0x59/0x7a0
[ 1095.633120]  ? wake_up_process+0x15/0x20
[ 1095.633156]  xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
[ 1095.633178]  xfs_fs_free_cached_objects+0x19/0x20 [xfs]
[ 1095.633180]  super_cache_scan+0x181/0x190
[ 1095.633183]  shrink_slab+0x29f/0x6d0
[ 1095.633189]  shrink_node+0x2fa/0x310
[ 1095.633193]  kswapd+0x362/0x9b0
[ 1095.633200]  kthread+0x10f/0x150
[ 1095.633201]  ? mem_cgroup_shrink_node+0x3b0/0x3b0
[ 1095.633202]  ? kthread_create_on_node+0x70/0x70
[ 1095.633205]  ret_from_fork+0x31/0x40
(...snipped...)
[ 1095.821248] MemAlloc-Info: stalling=190 dying=1 exiting=0 victim=1 oom_count=40646791
(...snipped...)
[ 1125.236970] sysrq: SysRq : Resetting
[ 1125.238669] ACPI MEMORY or I/O RESET_REG.
----------

The switches= value (which is "struct task_struct"->nvcsw +
"struct task_struct"->nivcsw ) of kswapd0(69) remained 23883 which means that
kswapd0 was waiting forever at

----------
void
__xfs_iflock(
        struct xfs_inode        *ip)
{
        wait_queue_head_t *wq = bit_waitqueue(&ip->i_flags, __XFS_IFLOCK_BIT);
        DEFINE_WAIT_BIT(wait, &ip->i_flags, __XFS_IFLOCK_BIT);

        do {
                prepare_to_wait_exclusive(wq, &wait.wait, TASK_UNINTERRUPTIBLE);
                if (xfs_isiflocked(ip))
                        io_schedule();      /***** <= This location. *****/
        } while (!xfs_iflock_nowait(ip));

        finish_wait(wq, &wait.wait);
}
----------

while the oom_count= value (which is number of times out_of_memory() was called)
was increasing over time without emitting "Killed process " message.

Reproducer I used is shown below.

----------
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <signal.h>
#include <poll.h>

static char use_delay = 0;

static void sigcld_handler(int unused)
{
        use_delay = 1;
}

int main(int argc, char *argv[])
{
        static char buffer[4096] = { };
        char *buf = NULL;
        unsigned long size;
        int i;
        signal(SIGCLD, sigcld_handler);
        for (i = 0; i < 1024; i++) {
                if (fork() == 0) {
                        int fd = open("/proc/self/oom_score_adj", O_WRONLY);
                        write(fd, "1000", 4);
                        close(fd);
                        sleep(1);
                        if (!i)
                                pause();
                        snprintf(buffer, sizeof(buffer), "/tmp/file.%u", getpid());
                        fd = open(buffer, O_WRONLY | O_CREAT | O_APPEND, 0600);
                        while (write(fd, buffer, sizeof(buffer)) == sizeof(buffer)) {
                                poll(NULL, 0, 10);
                                fsync(fd);
                        }
                        _exit(0);
                }
        }
        for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
                char *cp = realloc(buf, size);
                if (!cp) {
                        size >>= 1;
                        break;
                }
                buf = cp;
        }
        sleep(2);
        /* Will cause OOM due to overcommit */
        for (i = 0; i < size; i += 4096)
                buf[i] = 0;
        pause();
        return 0;
}
----------

^ permalink raw reply	[flat|nested] 110+ messages in thread

* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone
@ 2017-02-26  6:30                                                         ` Tetsuo Handa
  0 siblings, 0 replies; 110+ messages in thread
From: Tetsuo Handa @ 2017-02-26  6:30 UTC (permalink / raw)
  To: mhocko
  Cc: david, dchinner, hch, mgorman, viro, linux-mm, hannes, linux-kernel

Michal Hocko wrote:
> On Wed 22-02-17 11:02:21, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Tue 21-02-17 23:35:07, Tetsuo Handa wrote:
> > > > Michal Hocko wrote:
> > > > > OK, so it seems that all the distractions are handled now and linux-next
> > > > > should provide a reasonable base for testing. You said you weren't able
> > > > > to reproduce the original long stalls on too_many_isolated(). I would be
> > > > > still interested to see those oom reports and potential anomalies in the
> > > > > isolated counts before I send the patch for inclusion so your further
> > > > > testing would be more than appreciated. Also stalls > 10s without any
> > > > > previous occurrences would be interesting.
> > > > 
> > > > I confirmed that linux-next-20170221 with kmallocwd applied can reproduce
> > > > infinite too_many_isolated() loop problem. Please send your patches to linux-next.
> > > 
> > > So I assume that you didn't see the lockup with the patch applied and
> > > the OOM killer has resolved the situation by killing other tasks, right?
> > > Can I assume your Tested-by?
> > 
> > No. I tested linux-next-20170221 which does not include your patch.
> > I didn't test linux-next-20170221 with your patch applied. Your patch will
> > avoid infinite too_many_isolated() loop problem in shrink_inactive_list().
> > But we need to test different workloads by other people. Thus, I suggest
> > you to send your patches to linux-next without my testing.
> 
> I will send the patch to Andrew later after merge window closes. It
> would be really helpful, though, to see how it handles your workload
> which is known to reproduce the oom starvation.

I tested http://lkml.kernel.org/r/20170119112336.GN30786@dhcp22.suse.cz
on top of linux-next-20170221 with kmallocwd applied.

I did not hit too_many_isolated() loop problem. But I hit an "unable to invoke
the OOM killer due to !__GFP_FS allocation" lockup problem shown below.

Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20170226.txt.xz .
----------
[  444.281177] Killed process 9477 (a.out) total-vm:4168kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[  444.287046] oom_reaper: reaped process 9477 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[  484.810225] BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 38s!
[  484.812907] BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0 nice=0 stuck for 41s!
[  484.815546] Showing busy workqueues and worker pools:
[  484.817595] workqueue events: flags=0x0
[  484.819456]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=3/256
[  484.821666]     pending: vmpressure_work_fn, vmstat_shepherd, vmw_fb_dirty_flush [vmwgfx]
[  484.824356]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=2/256
[  484.826582]     pending: drain_local_pages_wq BAR(9595), e1000_watchdog [e1000]
[  484.829091]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
[  484.831325]     in-flight: 7418:rht_deferred_worker
[  484.833336]     pending: rht_deferred_worker
[  484.835346] workqueue events_long: flags=0x0
[  484.837343]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256
[  484.839566]     pending: gc_worker [nf_conntrack]
[  484.841691] workqueue events_power_efficient: flags=0x80
[  484.843873]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256
[  484.846103]     pending: fb_flashcursor
[  484.847928]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
[  484.850149]     pending: neigh_periodic_work, neigh_periodic_work
[  484.852403] workqueue events_freezable_power_: flags=0x84
[  484.854534]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[  484.856666]     in-flight: 27:disk_events_workfn
[  484.858621] workqueue writeback: flags=0x4e
[  484.860347]   pwq 256: cpus=0-127 flags=0x4 nice=0 active=2/256
[  484.862415]     in-flight: 8444:wb_workfn wb_workfn
[  484.864602] workqueue vmstat: flags=0xc
[  484.866291]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[  484.868307]     pending: vmstat_update
[  484.869876]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256
[  484.871864]     pending: vmstat_update
[  484.874058] workqueue mpt_poll_0: flags=0x8
[  484.875698]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[  484.877602]     pending: mpt_fault_reset_work [mptbase]
[  484.879502] workqueue xfs-buf/sda1: flags=0xc
[  484.881148]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1
[  484.883011]     pending: xfs_buf_ioend_work [xfs]
[  484.884706] workqueue xfs-data/sda1: flags=0xc
[  484.886367]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=27/256 MAYDAY
[  484.888410]     in-flight: 5356:xfs_end_io [xfs], 451(RESCUER):xfs_end_io [xfs] xfs_end_io [xfs] xfs_end_io [xfs] xfs_end_io [xfs] xfs_end_io [xfs], 10498:xfs_end_io [xfs], 6386:xfs_end_io [xfs]
[  484.893483]     pending: xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs]
[  484.902636]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=21/256 MAYDAY
[  484.904848]     in-flight: 535:xfs_end_io [xfs], 7416:xfs_end_io [xfs], 7415:xfs_end_io [xfs], 65:xfs_end_io [xfs]
[  484.907863]     pending: xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs]
[  484.916767]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=4/256 MAYDAY
[  484.919024]     in-flight: 5357:xfs_end_io [xfs], 193:xfs_end_io [xfs], 52:xfs_end_io [xfs], 5358:xfs_end_io [xfs]
[  484.922143]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  484.924291]     in-flight: 2486:xfs_end_io [xfs]
[  484.926248] workqueue xfs-reclaim/sda1: flags=0xc
[  484.928216]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[  484.930362]     pending: xfs_reclaim_worker [xfs]
[  484.932312] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 3 6387
[  484.934766] pool 2: cpus=1 node=0 flags=0x0 nice=0 hung=38s workers=6 manager: 19
[  484.937206] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=41s workers=6 manager: 157
[  484.939629] pool 6: cpus=3 node=0 flags=0x0 nice=0 hung=41s workers=4 manager: 10499
[  484.942303] pool 256: cpus=0-127 flags=0x4 nice=0 hung=38s workers=3 idle: 425 426
[  518.090012] MemAlloc-Info: stalling=184 dying=1 exiting=0 victim=1 oom_count=8441307
(...snipped...)
[  518.900038] MemAlloc: kswapd0(69) flags=0xa40840 switches=23883 uninterruptible
[  518.902095] kswapd0         D10776    69      2 0x00000000
[  518.903784] Call Trace:
[  518.904849]  __schedule+0x336/0xe00
[  518.906118]  schedule+0x3d/0x90
[  518.907314]  io_schedule+0x16/0x40
[  518.908622]  __xfs_iflock+0x129/0x140 [xfs]
[  518.910027]  ? autoremove_wake_function+0x60/0x60
[  518.911559]  xfs_reclaim_inode+0x162/0x440 [xfs]
[  518.913068]  xfs_reclaim_inodes_ag+0x2cf/0x4f0 [xfs]
[  518.914611]  ? xfs_reclaim_inodes_ag+0xf2/0x4f0 [xfs]
[  518.916148]  ? trace_hardirqs_on+0xd/0x10
[  518.917465]  ? try_to_wake_up+0x59/0x7a0
[  518.918758]  ? wake_up_process+0x15/0x20
[  518.920067]  xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
[  518.921560]  xfs_fs_free_cached_objects+0x19/0x20 [xfs]
[  518.923114]  super_cache_scan+0x181/0x190
[  518.924435]  shrink_slab+0x29f/0x6d0
[  518.925683]  shrink_node+0x2fa/0x310
[  518.926909]  kswapd+0x362/0x9b0
[  518.928061]  kthread+0x10f/0x150
[  518.929218]  ? mem_cgroup_shrink_node+0x3b0/0x3b0
[  518.930953]  ? kthread_create_on_node+0x70/0x70
[  518.932380]  ret_from_fork+0x31/0x40
(...snipped...)
[  553.070829] MemAlloc-Info: stalling=184 dying=1 exiting=0 victim=1 oom_count=10318507
[  575.432697] BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 129s!
[  575.435276] BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0 nice=0 stuck for 131s!
[  575.437863] Showing busy workqueues and worker pools:
[  575.439837] workqueue events: flags=0x0
[  575.441605]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=4/256
[  575.443717]     pending: vmpressure_work_fn, vmstat_shepherd, vmw_fb_dirty_flush [vmwgfx], check_corruption
[  575.446622]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=2/256
[  575.448763]     pending: drain_local_pages_wq BAR(9595), e1000_watchdog [e1000]
[  575.451173]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
[  575.453323]     in-flight: 7418:rht_deferred_worker
[  575.455243]     pending: rht_deferred_worker
[  575.457100] workqueue events_long: flags=0x0
[  575.458960]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256
[  575.461099]     pending: gc_worker [nf_conntrack]
[  575.463043] workqueue events_power_efficient: flags=0x80
[  575.465110]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256
[  575.467252]     pending: fb_flashcursor
[  575.468966]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256
[  575.471109]     pending: neigh_periodic_work, neigh_periodic_work
[  575.473289] workqueue events_freezable_power_: flags=0x84
[  575.475378]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[  575.477526]     in-flight: 27:disk_events_workfn
[  575.479489] workqueue writeback: flags=0x4e
[  575.481257]   pwq 256: cpus=0-127 flags=0x4 nice=0 active=2/256
[  575.483368]     in-flight: 8444:wb_workfn wb_workfn
[  575.485505] workqueue vmstat: flags=0xc
[  575.487196]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256
[  575.489242]     pending: vmstat_update
[  575.491403] workqueue mpt_poll_0: flags=0x8
[  575.493106]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[  575.495115]     pending: mpt_fault_reset_work [mptbase]
[  575.497086] workqueue xfs-buf/sda1: flags=0xc
[  575.498764]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1
[  575.500654]     pending: xfs_buf_ioend_work [xfs]
[  575.502372] workqueue xfs-data/sda1: flags=0xc
[  575.504024]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=27/256 MAYDAY
[  575.506060]     in-flight: 5356:xfs_end_io [xfs], 451(RESCUER):xfs_end_io [xfs] xfs_end_io [xfs] xfs_end_io [xfs] xfs_end_io [xfs] xfs_end_io [xfs], 10498:xfs_end_io [xfs], 6386:xfs_end_io [xfs]
[  575.511096]     pending: xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs]
[  575.520157]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=21/256 MAYDAY
[  575.522340]     in-flight: 535:xfs_end_io [xfs], 7416:xfs_end_io [xfs], 7415:xfs_end_io [xfs], 65:xfs_end_io [xfs]
[  575.525387]     pending: xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs]
[  575.534089]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=4/256 MAYDAY
[  575.536407]     in-flight: 5357:xfs_end_io [xfs], 193:xfs_end_io [xfs], 52:xfs_end_io [xfs], 5358:xfs_end_io [xfs]
[  575.539496]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[  575.541648]     in-flight: 2486:xfs_end_io [xfs]
[  575.543591] workqueue xfs-reclaim/sda1: flags=0xc
[  575.545540]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[  575.547675]     pending: xfs_reclaim_worker [xfs]
[  575.549719] workqueue xfs-log/sda1: flags=0x1c
[  575.551591]   pwq 3: cpus=1 node=0 flags=0x0 nice=-20 active=1/256
[  575.553750]     pending: xfs_log_worker [xfs]
[  575.555552] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 3 6387
[  575.557979] pool 2: cpus=1 node=0 flags=0x0 nice=0 hung=129s workers=6 manager: 19
[  575.560399] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=131s workers=6 manager: 157
[  575.562843] pool 6: cpus=3 node=0 flags=0x0 nice=0 hung=132s workers=4 manager: 10499
[  575.565450] pool 256: cpus=0-127 flags=0x4 nice=0 hung=129s workers=3 idle: 425 426
(...snipped...)
[  616.394649] MemAlloc-Info: stalling=186 dying=1 exiting=0 victim=1 oom_count=13908219
(...snipped...)
[  642.266252] MemAlloc-Info: stalling=186 dying=1 exiting=0 victim=1 oom_count=15180673
(...snipped...)
[  702.412189] MemAlloc-Info: stalling=187 dying=1 exiting=0 victim=1 oom_count=18732529
(...snipped...)
[  736.787879] MemAlloc-Info: stalling=187 dying=1 exiting=0 victim=1 oom_count=20565244
(...snipped...)
[  800.715759] MemAlloc-Info: stalling=188 dying=1 exiting=0 victim=1 oom_count=24411576
(...snipped...)
[  837.571405] MemAlloc-Info: stalling=188 dying=1 exiting=0 victim=1 oom_count=26463562
(...snipped...)
[  899.021495] MemAlloc-Info: stalling=189 dying=1 exiting=0 victim=1 oom_count=30144879
(...snipped...)
[  936.282709] MemAlloc-Info: stalling=189 dying=1 exiting=0 victim=1 oom_count=32129234
(...snipped...)
[  997.328119] MemAlloc-Info: stalling=190 dying=1 exiting=0 victim=1 oom_count=35657983
(...snipped...)
[ 1033.977265] MemAlloc-Info: stalling=190 dying=1 exiting=0 victim=1 oom_count=37659912
(...snipped...)
[ 1095.630961] MemAlloc-Info: stalling=190 dying=1 exiting=0 victim=1 oom_count=40639677
(...snipped...)
[ 1095.632984] MemAlloc: kswapd0(69) flags=0xa40840 switches=23883 uninterruptible
[ 1095.632985] kswapd0         D10776    69      2 0x00000000
[ 1095.632988] Call Trace:
[ 1095.632991]  __schedule+0x336/0xe00
[ 1095.632994]  schedule+0x3d/0x90
[ 1095.632996]  io_schedule+0x16/0x40
[ 1095.633017]  __xfs_iflock+0x129/0x140 [xfs]
[ 1095.633021]  ? autoremove_wake_function+0x60/0x60
[ 1095.633051]  xfs_reclaim_inode+0x162/0x440 [xfs]
[ 1095.633072]  xfs_reclaim_inodes_ag+0x2cf/0x4f0 [xfs]
[ 1095.633106]  ? xfs_reclaim_inodes_ag+0xf2/0x4f0 [xfs]
[ 1095.633114]  ? trace_hardirqs_on+0xd/0x10
[ 1095.633116]  ? try_to_wake_up+0x59/0x7a0
[ 1095.633120]  ? wake_up_process+0x15/0x20
[ 1095.633156]  xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
[ 1095.633178]  xfs_fs_free_cached_objects+0x19/0x20 [xfs]
[ 1095.633180]  super_cache_scan+0x181/0x190
[ 1095.633183]  shrink_slab+0x29f/0x6d0
[ 1095.633189]  shrink_node+0x2fa/0x310
[ 1095.633193]  kswapd+0x362/0x9b0
[ 1095.633200]  kthread+0x10f/0x150
[ 1095.633201]  ? mem_cgroup_shrink_node+0x3b0/0x3b0
[ 1095.633202]  ? kthread_create_on_node+0x70/0x70
[ 1095.633205]  ret_from_fork+0x31/0x40
(...snipped...)
[ 1095.821248] MemAlloc-Info: stalling=190 dying=1 exiting=0 victim=1 oom_count=40646791
(...snipped...)
[ 1125.236970] sysrq: SysRq : Resetting
[ 1125.238669] ACPI MEMORY or I/O RESET_REG.
----------

The switches= value (which is "struct task_struct"->nvcsw +
"struct task_struct"->nivcsw ) of kswapd0(69) remained 23883 which means that
kswapd0 was waiting forever at

----------
void
__xfs_iflock(
        struct xfs_inode        *ip)
{
        wait_queue_head_t *wq = bit_waitqueue(&ip->i_flags, __XFS_IFLOCK_BIT);
        DEFINE_WAIT_BIT(wait, &ip->i_flags, __XFS_IFLOCK_BIT);

        do {
                prepare_to_wait_exclusive(wq, &wait.wait, TASK_UNINTERRUPTIBLE);
                if (xfs_isiflocked(ip))
                        io_schedule();      /***** <= This location. *****/
        } while (!xfs_iflock_nowait(ip));

        finish_wait(wq, &wait.wait);
}
----------

while the oom_count= value (which is number of times out_of_memory() was called)
was increasing over time without emitting "Killed process " message.

Reproducer I used is shown below.

----------
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <signal.h>
#include <poll.h>

static char use_delay = 0;

static void sigcld_handler(int unused)
{
        use_delay = 1;
}

int main(int argc, char *argv[])
{
        static char buffer[4096] = { };
        char *buf = NULL;
        unsigned long size;
        int i;
        signal(SIGCLD, sigcld_handler);
        for (i = 0; i < 1024; i++) {
                if (fork() == 0) {
                        int fd = open("/proc/self/oom_score_adj", O_WRONLY);
                        write(fd, "1000", 4);
                        close(fd);
                        sleep(1);
                        if (!i)
                                pause();
                        snprintf(buffer, sizeof(buffer), "/tmp/file.%u", getpid());
                        fd = open(buffer, O_WRONLY | O_CREAT | O_APPEND, 0600);
                        while (write(fd, buffer, sizeof(buffer)) == sizeof(buffer)) {
                                poll(NULL, 0, 10);
                                fsync(fd);
                        }
                        _exit(0);
                }
        }
        for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
                char *cp = realloc(buf, size);
                if (!cp) {
                        size >>= 1;
                        break;
                }
                buf = cp;
        }
        sleep(2);
        /* Will cause OOM due to overcommit */
        for (i = 0; i < size; i += 4096)
                buf[i] = 0;
        pause();
        return 0;
}
----------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 110+ messages in thread

end of thread, other threads:[~2017-02-26  6:36 UTC | newest]

Thread overview: 110+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-18 13:44 [RFC PATCH 0/2] fix unbounded too_many_isolated Michal Hocko
2017-01-18 13:44 ` Michal Hocko
2017-01-18 13:44 ` [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone Michal Hocko
2017-01-18 13:44   ` Michal Hocko
2017-01-18 14:46   ` Mel Gorman
2017-01-18 14:46     ` Mel Gorman
2017-01-18 15:15     ` Michal Hocko
2017-01-18 15:15       ` Michal Hocko
2017-01-18 15:54       ` Mel Gorman
2017-01-18 15:54         ` Mel Gorman
2017-01-18 16:17         ` Michal Hocko
2017-01-18 16:17           ` Michal Hocko
2017-01-18 17:00           ` Mel Gorman
2017-01-18 17:00             ` Mel Gorman
2017-01-18 17:29             ` Michal Hocko
2017-01-18 17:29               ` Michal Hocko
2017-01-19 10:07               ` Mel Gorman
2017-01-19 10:07                 ` Mel Gorman
2017-01-19 11:23                 ` Michal Hocko
2017-01-19 11:23                   ` Michal Hocko
2017-01-19 13:11                   ` Mel Gorman
2017-01-19 13:11                     ` Mel Gorman
2017-01-20 13:27                     ` Tetsuo Handa
2017-01-20 13:27                       ` Tetsuo Handa
2017-01-21  7:42                       ` Tetsuo Handa
2017-01-21  7:42                         ` Tetsuo Handa
2017-01-25 10:15                         ` Michal Hocko
2017-01-25 10:15                           ` Michal Hocko
2017-01-25 10:19                           ` Christoph Hellwig
2017-01-25 10:19                             ` Christoph Hellwig
2017-01-25 10:46                             ` Michal Hocko
2017-01-25 10:46                               ` Michal Hocko
2017-01-25 11:09                               ` Tetsuo Handa
2017-01-25 11:09                                 ` Tetsuo Handa
2017-01-25 13:00                                 ` Michal Hocko
2017-01-25 13:00                                   ` Michal Hocko
2017-01-27 14:49                                   ` Michal Hocko
2017-01-27 14:49                                     ` Michal Hocko
2017-01-28 15:27                                     ` Tetsuo Handa
2017-01-28 15:27                                       ` Tetsuo Handa
2017-01-30  8:55                                       ` Michal Hocko
2017-01-30  8:55                                         ` Michal Hocko
2017-02-02 10:14                                         ` Michal Hocko
2017-02-02 10:14                                           ` Michal Hocko
2017-02-03 10:57                                           ` Tetsuo Handa
2017-02-03 10:57                                             ` Tetsuo Handa
2017-02-03 14:41                                             ` Michal Hocko
2017-02-03 14:41                                               ` Michal Hocko
2017-02-03 14:50                                             ` Michal Hocko
2017-02-03 14:50                                               ` Michal Hocko
2017-02-03 17:24                                               ` Brian Foster
2017-02-03 17:24                                                 ` Brian Foster
2017-02-06  6:29                                                 ` Tetsuo Handa
2017-02-06  6:29                                                   ` Tetsuo Handa
2017-02-06 14:35                                                   ` Brian Foster
2017-02-06 14:35                                                     ` Brian Foster
2017-02-06 14:42                                                     ` Michal Hocko
2017-02-06 14:42                                                       ` Michal Hocko
2017-02-06 15:47                                                       ` Brian Foster
2017-02-06 15:47                                                         ` Brian Foster
2017-02-07 10:30                                                     ` Tetsuo Handa
2017-02-07 10:30                                                       ` Tetsuo Handa
2017-02-07 16:54                                                       ` Brian Foster
2017-02-07 16:54                                                         ` Brian Foster
2017-02-03 14:55                                             ` Michal Hocko
2017-02-03 14:55                                               ` Michal Hocko
2017-02-05 10:43                                               ` Tetsuo Handa
2017-02-05 10:43                                                 ` Tetsuo Handa
2017-02-06 10:34                                                 ` Michal Hocko
2017-02-06 10:34                                                   ` Michal Hocko
2017-02-06 10:39                                                 ` Michal Hocko
2017-02-06 10:39                                                   ` Michal Hocko
2017-02-07 21:12                                                   ` Michal Hocko
2017-02-07 21:12                                                     ` Michal Hocko
2017-02-08  9:24                                                     ` Peter Zijlstra
2017-02-08  9:24                                                       ` Peter Zijlstra
2017-02-21  9:40                                             ` Michal Hocko
2017-02-21  9:40                                               ` Michal Hocko
2017-02-21 14:35                                               ` Tetsuo Handa
2017-02-21 14:35                                                 ` Tetsuo Handa
2017-02-21 15:53                                                 ` Michal Hocko
2017-02-21 15:53                                                   ` Michal Hocko
2017-02-22  2:02                                                   ` Tetsuo Handa
2017-02-22  2:02                                                     ` Tetsuo Handa
2017-02-22  7:54                                                     ` Michal Hocko
2017-02-22  7:54                                                       ` Michal Hocko
2017-02-26  6:30                                                       ` Tetsuo Handa
2017-02-26  6:30                                                         ` Tetsuo Handa
2017-01-31 11:58                                   ` Michal Hocko
2017-01-31 11:58                                     ` Michal Hocko
2017-01-31 12:51                                     ` Christoph Hellwig
2017-01-31 12:51                                       ` Christoph Hellwig
2017-01-31 13:21                                       ` Michal Hocko
2017-01-31 13:21                                         ` Michal Hocko
2017-01-25 10:33                           ` [RFC PATCH 1/2] mm, vmscan: account the number of isolated pagesper zone Tetsuo Handa
2017-01-25 10:33                             ` Tetsuo Handa
2017-01-25 12:34                             ` Michal Hocko
2017-01-25 12:34                               ` Michal Hocko
2017-01-25 13:13                               ` [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone Tetsuo Handa
2017-01-25 13:13                                 ` Tetsuo Handa
2017-01-25  9:53                       ` Michal Hocko
2017-01-25  9:53                         ` Michal Hocko
2017-01-20  6:42                 ` Hillf Danton
2017-01-20  6:42                   ` Hillf Danton
2017-01-20  9:25                   ` Mel Gorman
2017-01-20  9:25                     ` Mel Gorman
2017-01-18 13:44 ` [RFC PATCH 2/2] mm, vmscan: do not loop on too_many_isolated for ever Michal Hocko
2017-01-18 13:44   ` Michal Hocko
2017-01-18 14:50   ` Mel Gorman
2017-01-18 14:50     ` Mel Gorman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.