* [RFC PATCH 0/2] fix unbounded too_many_isolated @ 2017-01-18 13:44 ` Michal Hocko 0 siblings, 0 replies; 110+ messages in thread From: Michal Hocko @ 2017-01-18 13:44 UTC (permalink / raw) To: linux-mm; +Cc: Mel Gorman, Johannes Weiner, Tetsuo Handa, LKML Hi, this is based on top of [1]. The first patch continues in the direction of moving some decisions to zones rather than nodes. In this case it is the NR_ISOLATED* counters which I believe need to be zone aware as well. See patch 1 for more information why. The second path builds on top of that and tries to address the problem which has been reported by Tetsuo several times already. In the current implementation we can loop deep in the reclaim path without any effective way out to re-evaluate our decisions about the reclaim retries. Patch 2 says more about that but in principle we should locate retry logic as high in the allocator chain as possible and so we should get rid of any unbound retry loops inside the reclaim. This is what the patch does. I am sending this as an RFC because I am not yet sure this is the best forward. My testing shows that the system behaves sanely. Thoughts, comments? [1] http://lkml.kernel.org/r/20170117103702.28542-1-mhocko@kernel.org Michal Hocko (2): mm, vmscan: account the number of isolated pages per zone mm, vmscan: do not loop on too_many_isolated for ever include/linux/mmzone.h | 4 +-- mm/compaction.c | 16 ++++----- mm/khugepaged.c | 4 +-- mm/memory_hotplug.c | 2 +- mm/migrate.c | 4 +-- mm/page_alloc.c | 14 ++++---- mm/vmscan.c | 93 ++++++++++++++++++++++++++++++++------------------ mm/vmstat.c | 4 +-- 8 files changed, 82 insertions(+), 59 deletions(-) ^ permalink raw reply [flat|nested] 110+ messages in thread
* [RFC PATCH 0/2] fix unbounded too_many_isolated @ 2017-01-18 13:44 ` Michal Hocko 0 siblings, 0 replies; 110+ messages in thread From: Michal Hocko @ 2017-01-18 13:44 UTC (permalink / raw) To: linux-mm; +Cc: Mel Gorman, Johannes Weiner, Tetsuo Handa, LKML Hi, this is based on top of [1]. The first patch continues in the direction of moving some decisions to zones rather than nodes. In this case it is the NR_ISOLATED* counters which I believe need to be zone aware as well. See patch 1 for more information why. The second path builds on top of that and tries to address the problem which has been reported by Tetsuo several times already. In the current implementation we can loop deep in the reclaim path without any effective way out to re-evaluate our decisions about the reclaim retries. Patch 2 says more about that but in principle we should locate retry logic as high in the allocator chain as possible and so we should get rid of any unbound retry loops inside the reclaim. This is what the patch does. I am sending this as an RFC because I am not yet sure this is the best forward. My testing shows that the system behaves sanely. Thoughts, comments? [1] http://lkml.kernel.org/r/20170117103702.28542-1-mhocko@kernel.org Michal Hocko (2): mm, vmscan: account the number of isolated pages per zone mm, vmscan: do not loop on too_many_isolated for ever include/linux/mmzone.h | 4 +-- mm/compaction.c | 16 ++++----- mm/khugepaged.c | 4 +-- mm/memory_hotplug.c | 2 +- mm/migrate.c | 4 +-- mm/page_alloc.c | 14 ++++---- mm/vmscan.c | 93 ++++++++++++++++++++++++++++++++------------------ mm/vmstat.c | 4 +-- 8 files changed, 82 insertions(+), 59 deletions(-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 110+ messages in thread
* [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone 2017-01-18 13:44 ` Michal Hocko @ 2017-01-18 13:44 ` Michal Hocko -1 siblings, 0 replies; 110+ messages in thread From: Michal Hocko @ 2017-01-18 13:44 UTC (permalink / raw) To: linux-mm; +Cc: Mel Gorman, Johannes Weiner, Tetsuo Handa, LKML, Michal Hocko From: Michal Hocko <mhocko@suse.com> 599d0c954f91 ("mm, vmscan: move LRU lists to node") has moved NR_ISOLATED* counters from zones to nodes. This is not the best fit especially for systems with high/lowmem because a heavy memory pressure on the highmem zone might block lowmem requests from making progress. Or we might allow to reclaim lowmem zone even though there are too many pages already isolated from the eligible zones just because highmem pages will easily bias too_many_isolated to say no. Fix these potential issues by moving isolated stats back to zones and teach too_many_isolated to consider only eligible zones. Per zone isolation counters are a bit tricky with the node reclaim because we have to track each page separatelly. Signed-off-by: Michal Hocko <mhocko@suse.com> --- include/linux/mmzone.h | 4 ++-- mm/compaction.c | 16 +++++++------- mm/khugepaged.c | 4 ++-- mm/memory_hotplug.c | 2 +- mm/migrate.c | 4 ++-- mm/page_alloc.c | 14 ++++++------ mm/vmscan.c | 58 ++++++++++++++++++++++++++++---------------------- mm/vmstat.c | 4 ++-- 8 files changed, 56 insertions(+), 50 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 91f69aa0d581..100e7f37b7dc 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -119,6 +119,8 @@ enum zone_stat_item { NR_ZONE_INACTIVE_FILE, NR_ZONE_ACTIVE_FILE, NR_ZONE_UNEVICTABLE, + NR_ZONE_ISOLATED_ANON, /* Temporary isolated pages from anon lru */ + NR_ZONE_ISOLATED_FILE, /* Temporary isolated pages from file lru */ NR_ZONE_WRITE_PENDING, /* Count of dirty, writeback and unstable pages */ NR_MLOCK, /* mlock()ed pages found and moved off LRU */ NR_SLAB_RECLAIMABLE, @@ -148,8 +150,6 @@ enum node_stat_item { NR_INACTIVE_FILE, /* " " " " " */ NR_ACTIVE_FILE, /* " " " " " */ NR_UNEVICTABLE, /* " " " " " */ - NR_ISOLATED_ANON, /* Temporary isolated pages from anon lru */ - NR_ISOLATED_FILE, /* Temporary isolated pages from file lru */ NR_PAGES_SCANNED, /* pages scanned since last reclaim */ WORKINGSET_REFAULT, WORKINGSET_ACTIVATE, diff --git a/mm/compaction.c b/mm/compaction.c index 43a6cf1dc202..f84104217887 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -639,12 +639,12 @@ static bool too_many_isolated(struct zone *zone) { unsigned long active, inactive, isolated; - inactive = node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE) + - node_page_state(zone->zone_pgdat, NR_INACTIVE_ANON); - active = node_page_state(zone->zone_pgdat, NR_ACTIVE_FILE) + - node_page_state(zone->zone_pgdat, NR_ACTIVE_ANON); - isolated = node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE) + - node_page_state(zone->zone_pgdat, NR_ISOLATED_ANON); + inactive = zone_page_state(zone, NR_ZONE_INACTIVE_FILE) + + zone_page_state(zone, NR_ZONE_INACTIVE_ANON); + active = zone_page_state(zone, NR_ZONE_ACTIVE_FILE) + + zone_page_state(zone, NR_ZONE_ACTIVE_ANON); + isolated = zone_page_state(zone, NR_ZONE_ISOLATED_FILE) + + zone_page_state(zone, NR_ZONE_ISOLATED_ANON); return isolated > (inactive + active) / 2; } @@ -857,8 +857,8 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, /* Successfully isolated */ del_page_from_lru_list(page, lruvec, page_lru(page)); - inc_node_page_state(page, - NR_ISOLATED_ANON + page_is_file_cache(page)); + inc_zone_page_state(page, + NR_ZONE_ISOLATED_ANON + page_is_file_cache(page)); isolate_success: list_add(&page->lru, &cc->migratepages); diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 34bce5c308e3..8e692b683cac 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -482,7 +482,7 @@ void __khugepaged_exit(struct mm_struct *mm) static void release_pte_page(struct page *page) { /* 0 stands for page_is_file_cache(page) == false */ - dec_node_page_state(page, NR_ISOLATED_ANON + 0); + dec_zone_page_state(page, NR_ZONE_ISOLATED_ANON + 0); unlock_page(page); putback_lru_page(page); } @@ -578,7 +578,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, goto out; } /* 0 stands for page_is_file_cache(page) == false */ - inc_node_page_state(page, NR_ISOLATED_ANON + 0); + inc_zone_page_state(page, NR_ZONE_ISOLATED_ANON + 0); VM_BUG_ON_PAGE(!PageLocked(page), page); VM_BUG_ON_PAGE(PageLRU(page), page); diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index d47b186892b4..8b88dd63bf3d 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1616,7 +1616,7 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn) put_page(page); list_add_tail(&page->lru, &source); move_pages--; - inc_node_page_state(page, NR_ISOLATED_ANON + + inc_zone_page_state(page, NR_ZONE_ISOLATED_ANON + page_is_file_cache(page)); } else { diff --git a/mm/migrate.c b/mm/migrate.c index 87f4d0f81819..e5589dee3022 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -184,7 +184,7 @@ void putback_movable_pages(struct list_head *l) put_page(page); } else { putback_lru_page(page); - dec_node_page_state(page, NR_ISOLATED_ANON + + dec_zone_page_state(page, NR_ZONE_ISOLATED_ANON + page_is_file_cache(page)); } } @@ -1130,7 +1130,7 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page, * as __PageMovable */ if (likely(!__PageMovable(page))) - dec_node_page_state(page, NR_ISOLATED_ANON + + dec_zone_page_state(page, NR_ZONE_ISOLATED_ANON + page_is_file_cache(page)); } diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 8ff25883c172..997c9bfdf9e5 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4318,18 +4318,16 @@ void show_free_areas(unsigned int filter) free_pcp += per_cpu_ptr(zone->pageset, cpu)->pcp.count; } - printk("active_anon:%lu inactive_anon:%lu isolated_anon:%lu\n" - " active_file:%lu inactive_file:%lu isolated_file:%lu\n" + printk("active_anon:%lu inactive_anon:%lu\n" + " active_file:%lu inactive_file:%lu\n" " unevictable:%lu dirty:%lu writeback:%lu unstable:%lu\n" " slab_reclaimable:%lu slab_unreclaimable:%lu\n" " mapped:%lu shmem:%lu pagetables:%lu bounce:%lu\n" " free:%lu free_pcp:%lu free_cma:%lu\n", global_node_page_state(NR_ACTIVE_ANON), global_node_page_state(NR_INACTIVE_ANON), - global_node_page_state(NR_ISOLATED_ANON), global_node_page_state(NR_ACTIVE_FILE), global_node_page_state(NR_INACTIVE_FILE), - global_node_page_state(NR_ISOLATED_FILE), global_node_page_state(NR_UNEVICTABLE), global_node_page_state(NR_FILE_DIRTY), global_node_page_state(NR_WRITEBACK), @@ -4351,8 +4349,6 @@ void show_free_areas(unsigned int filter) " active_file:%lukB" " inactive_file:%lukB" " unevictable:%lukB" - " isolated(anon):%lukB" - " isolated(file):%lukB" " mapped:%lukB" " dirty:%lukB" " writeback:%lukB" @@ -4373,8 +4369,6 @@ void show_free_areas(unsigned int filter) K(node_page_state(pgdat, NR_ACTIVE_FILE)), K(node_page_state(pgdat, NR_INACTIVE_FILE)), K(node_page_state(pgdat, NR_UNEVICTABLE)), - K(node_page_state(pgdat, NR_ISOLATED_ANON)), - K(node_page_state(pgdat, NR_ISOLATED_FILE)), K(node_page_state(pgdat, NR_FILE_MAPPED)), K(node_page_state(pgdat, NR_FILE_DIRTY)), K(node_page_state(pgdat, NR_WRITEBACK)), @@ -4410,8 +4404,10 @@ void show_free_areas(unsigned int filter) " high:%lukB" " active_anon:%lukB" " inactive_anon:%lukB" + " isolated_anon:%lukB" " active_file:%lukB" " inactive_file:%lukB" + " isolated_file:%lukB" " unevictable:%lukB" " writepending:%lukB" " present:%lukB" @@ -4433,8 +4429,10 @@ void show_free_areas(unsigned int filter) K(high_wmark_pages(zone)), K(zone_page_state(zone, NR_ZONE_ACTIVE_ANON)), K(zone_page_state(zone, NR_ZONE_INACTIVE_ANON)), + K(zone_page_state(zone, NR_ZONE_ISOLATED_ANON)), K(zone_page_state(zone, NR_ZONE_ACTIVE_FILE)), K(zone_page_state(zone, NR_ZONE_INACTIVE_FILE)), + K(zone_page_state(zone, NR_ZONE_ISOLATED_FILE)), K(zone_page_state(zone, NR_ZONE_UNEVICTABLE)), K(zone_page_state(zone, NR_ZONE_WRITE_PENDING)), K(zone->present_pages), diff --git a/mm/vmscan.c b/mm/vmscan.c index f3255702f3df..4b1ed1b1f1db 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -216,14 +216,13 @@ unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat) { unsigned long nr; + /* TODO can we live without NR_*ISOLATED*? */ nr = node_page_state_snapshot(pgdat, NR_ACTIVE_FILE) + - node_page_state_snapshot(pgdat, NR_INACTIVE_FILE) + - node_page_state_snapshot(pgdat, NR_ISOLATED_FILE); + node_page_state_snapshot(pgdat, NR_INACTIVE_FILE); if (get_nr_swap_pages() > 0) nr += node_page_state_snapshot(pgdat, NR_ACTIVE_ANON) + - node_page_state_snapshot(pgdat, NR_INACTIVE_ANON) + - node_page_state_snapshot(pgdat, NR_ISOLATED_ANON); + node_page_state_snapshot(pgdat, NR_INACTIVE_ANON); return nr; } @@ -1245,8 +1244,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, * increment nr_reclaimed here (and * leave it off the LRU). */ - nr_reclaimed++; - continue; + goto drop_isolated; } } } @@ -1267,13 +1265,16 @@ static unsigned long shrink_page_list(struct list_head *page_list, if (ret == SWAP_LZFREE) count_vm_event(PGLAZYFREED); - nr_reclaimed++; - /* * Is there need to periodically free_page_list? It would * appear not as the counts should be low */ list_add(&page->lru, &free_pages); +drop_isolated: + nr_reclaimed++; + mod_zone_page_state(page_zone(page), + NR_ZONE_ISOLATED_ANON + page_is_file_cache(page), + -hpage_nr_pages(page)); continue; cull_mlocked: @@ -1340,7 +1341,6 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone, ret = shrink_page_list(&clean_pages, zone->zone_pgdat, &sc, TTU_UNMAP|TTU_IGNORE_ACCESS, NULL, true); list_splice(&clean_pages, page_list); - mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE, -ret); return ret; } @@ -1433,6 +1433,9 @@ static __always_inline void update_lru_sizes(struct lruvec *lruvec, continue; __update_lru_size(lruvec, lru, zid, -nr_zone_taken[zid]); + mod_zone_page_state(&lruvec_pgdat(lruvec)->node_zones[zid], + NR_ZONE_ISOLATED_ANON + !!is_file_lru(lru), + nr_zone_taken[zid]); #ifdef CONFIG_MEMCG mem_cgroup_update_lru_size(lruvec, lru, zid, -nr_zone_taken[zid]); #endif @@ -1603,10 +1606,11 @@ int isolate_lru_page(struct page *page) * the LRU list will go small and be scanned faster than necessary, leading to * unnecessary swapping, thrashing and OOM. */ -static int too_many_isolated(struct pglist_data *pgdat, int file, +static int too_many_isolated(struct pglist_data *pgdat, enum lru_list lru, struct scan_control *sc) { - unsigned long inactive, isolated; + unsigned long inactive = 0, isolated = 0; + int zid; if (current_is_kswapd()) return 0; @@ -1614,12 +1618,12 @@ static int too_many_isolated(struct pglist_data *pgdat, int file, if (!sane_reclaim(sc)) return 0; - if (file) { - inactive = node_page_state(pgdat, NR_INACTIVE_FILE); - isolated = node_page_state(pgdat, NR_ISOLATED_FILE); - } else { - inactive = node_page_state(pgdat, NR_INACTIVE_ANON); - isolated = node_page_state(pgdat, NR_ISOLATED_ANON); + for (zid = 0; zid <= sc->reclaim_idx; zid++) { + struct zone *zone = &pgdat->node_zones[zid]; + + inactive += zone_page_state_snapshot(zone, NR_ZONE_LRU_BASE + lru); + isolated += zone_page_state_snapshot(zone, + NR_ZONE_ISOLATED_ANON + !!is_file_lru(lru)); } /* @@ -1649,6 +1653,11 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list) VM_BUG_ON_PAGE(PageLRU(page), page); list_del(&page->lru); + + mod_zone_page_state(page_zone(page), + NR_ZONE_ISOLATED_ANON + !!page_is_file_cache(page), + -hpage_nr_pages(page)); + if (unlikely(!page_evictable(page))) { spin_unlock_irq(&pgdat->lru_lock); putback_lru_page(page); @@ -1719,7 +1728,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, struct pglist_data *pgdat = lruvec_pgdat(lruvec); struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat; - while (unlikely(too_many_isolated(pgdat, file, sc))) { + while (unlikely(too_many_isolated(pgdat, lru, sc))) { congestion_wait(BLK_RW_ASYNC, HZ/10); /* We are about to die and free our memory. Return now. */ @@ -1739,7 +1748,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list, &nr_scanned, sc, isolate_mode, lru); - __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken); reclaim_stat->recent_scanned[file] += nr_taken; if (global_reclaim(sc)) { @@ -1768,8 +1776,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, putback_inactive_pages(lruvec, &page_list); - __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken); - spin_unlock_irq(&pgdat->lru_lock); mem_cgroup_uncharge_list(&page_list); @@ -1939,7 +1945,6 @@ static void shrink_active_list(unsigned long nr_to_scan, nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold, &nr_scanned, sc, isolate_mode, lru); - __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken); reclaim_stat->recent_scanned[file] += nr_taken; if (global_reclaim(sc)) @@ -1955,7 +1960,7 @@ static void shrink_active_list(unsigned long nr_to_scan, if (unlikely(!page_evictable(page))) { putback_lru_page(page); - continue; + goto drop_isolated; } if (unlikely(buffer_heads_over_limit)) { @@ -1980,12 +1985,16 @@ static void shrink_active_list(unsigned long nr_to_scan, */ if ((vm_flags & VM_EXEC) && page_is_file_cache(page)) { list_add(&page->lru, &l_active); - continue; + goto drop_isolated; } } ClearPageActive(page); /* we are de-activating */ list_add(&page->lru, &l_inactive); +drop_isolated: + mod_zone_page_state(page_zone(page), + NR_ZONE_ISOLATED_ANON + !!is_file_lru(lru), + -hpage_nr_pages(page)); } /* @@ -2002,7 +2011,6 @@ static void shrink_active_list(unsigned long nr_to_scan, nr_activate = move_active_pages_to_lru(lruvec, &l_active, &l_hold, lru); nr_deactivate = move_active_pages_to_lru(lruvec, &l_inactive, &l_hold, lru - LRU_ACTIVE); - __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken); spin_unlock_irq(&pgdat->lru_lock); mem_cgroup_uncharge_list(&l_hold); diff --git a/mm/vmstat.c b/mm/vmstat.c index bed3c3845936..059c29d14d23 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -926,6 +926,8 @@ const char * const vmstat_text[] = { "nr_zone_inactive_file", "nr_zone_active_file", "nr_zone_unevictable", + "nr_zone_anon_isolated", + "nr_zone_file_isolated", "nr_zone_write_pending", "nr_mlock", "nr_slab_reclaimable", @@ -952,8 +954,6 @@ const char * const vmstat_text[] = { "nr_inactive_file", "nr_active_file", "nr_unevictable", - "nr_isolated_anon", - "nr_isolated_file", "nr_pages_scanned", "workingset_refault", "workingset_activate", -- 2.11.0 ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone @ 2017-01-18 13:44 ` Michal Hocko 0 siblings, 0 replies; 110+ messages in thread From: Michal Hocko @ 2017-01-18 13:44 UTC (permalink / raw) To: linux-mm; +Cc: Mel Gorman, Johannes Weiner, Tetsuo Handa, LKML, Michal Hocko From: Michal Hocko <mhocko@suse.com> 599d0c954f91 ("mm, vmscan: move LRU lists to node") has moved NR_ISOLATED* counters from zones to nodes. This is not the best fit especially for systems with high/lowmem because a heavy memory pressure on the highmem zone might block lowmem requests from making progress. Or we might allow to reclaim lowmem zone even though there are too many pages already isolated from the eligible zones just because highmem pages will easily bias too_many_isolated to say no. Fix these potential issues by moving isolated stats back to zones and teach too_many_isolated to consider only eligible zones. Per zone isolation counters are a bit tricky with the node reclaim because we have to track each page separatelly. Signed-off-by: Michal Hocko <mhocko@suse.com> --- include/linux/mmzone.h | 4 ++-- mm/compaction.c | 16 +++++++------- mm/khugepaged.c | 4 ++-- mm/memory_hotplug.c | 2 +- mm/migrate.c | 4 ++-- mm/page_alloc.c | 14 ++++++------ mm/vmscan.c | 58 ++++++++++++++++++++++++++++---------------------- mm/vmstat.c | 4 ++-- 8 files changed, 56 insertions(+), 50 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 91f69aa0d581..100e7f37b7dc 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -119,6 +119,8 @@ enum zone_stat_item { NR_ZONE_INACTIVE_FILE, NR_ZONE_ACTIVE_FILE, NR_ZONE_UNEVICTABLE, + NR_ZONE_ISOLATED_ANON, /* Temporary isolated pages from anon lru */ + NR_ZONE_ISOLATED_FILE, /* Temporary isolated pages from file lru */ NR_ZONE_WRITE_PENDING, /* Count of dirty, writeback and unstable pages */ NR_MLOCK, /* mlock()ed pages found and moved off LRU */ NR_SLAB_RECLAIMABLE, @@ -148,8 +150,6 @@ enum node_stat_item { NR_INACTIVE_FILE, /* " " " " " */ NR_ACTIVE_FILE, /* " " " " " */ NR_UNEVICTABLE, /* " " " " " */ - NR_ISOLATED_ANON, /* Temporary isolated pages from anon lru */ - NR_ISOLATED_FILE, /* Temporary isolated pages from file lru */ NR_PAGES_SCANNED, /* pages scanned since last reclaim */ WORKINGSET_REFAULT, WORKINGSET_ACTIVATE, diff --git a/mm/compaction.c b/mm/compaction.c index 43a6cf1dc202..f84104217887 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -639,12 +639,12 @@ static bool too_many_isolated(struct zone *zone) { unsigned long active, inactive, isolated; - inactive = node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE) + - node_page_state(zone->zone_pgdat, NR_INACTIVE_ANON); - active = node_page_state(zone->zone_pgdat, NR_ACTIVE_FILE) + - node_page_state(zone->zone_pgdat, NR_ACTIVE_ANON); - isolated = node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE) + - node_page_state(zone->zone_pgdat, NR_ISOLATED_ANON); + inactive = zone_page_state(zone, NR_ZONE_INACTIVE_FILE) + + zone_page_state(zone, NR_ZONE_INACTIVE_ANON); + active = zone_page_state(zone, NR_ZONE_ACTIVE_FILE) + + zone_page_state(zone, NR_ZONE_ACTIVE_ANON); + isolated = zone_page_state(zone, NR_ZONE_ISOLATED_FILE) + + zone_page_state(zone, NR_ZONE_ISOLATED_ANON); return isolated > (inactive + active) / 2; } @@ -857,8 +857,8 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn, /* Successfully isolated */ del_page_from_lru_list(page, lruvec, page_lru(page)); - inc_node_page_state(page, - NR_ISOLATED_ANON + page_is_file_cache(page)); + inc_zone_page_state(page, + NR_ZONE_ISOLATED_ANON + page_is_file_cache(page)); isolate_success: list_add(&page->lru, &cc->migratepages); diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 34bce5c308e3..8e692b683cac 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -482,7 +482,7 @@ void __khugepaged_exit(struct mm_struct *mm) static void release_pte_page(struct page *page) { /* 0 stands for page_is_file_cache(page) == false */ - dec_node_page_state(page, NR_ISOLATED_ANON + 0); + dec_zone_page_state(page, NR_ZONE_ISOLATED_ANON + 0); unlock_page(page); putback_lru_page(page); } @@ -578,7 +578,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, goto out; } /* 0 stands for page_is_file_cache(page) == false */ - inc_node_page_state(page, NR_ISOLATED_ANON + 0); + inc_zone_page_state(page, NR_ZONE_ISOLATED_ANON + 0); VM_BUG_ON_PAGE(!PageLocked(page), page); VM_BUG_ON_PAGE(PageLRU(page), page); diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index d47b186892b4..8b88dd63bf3d 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1616,7 +1616,7 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn) put_page(page); list_add_tail(&page->lru, &source); move_pages--; - inc_node_page_state(page, NR_ISOLATED_ANON + + inc_zone_page_state(page, NR_ZONE_ISOLATED_ANON + page_is_file_cache(page)); } else { diff --git a/mm/migrate.c b/mm/migrate.c index 87f4d0f81819..e5589dee3022 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -184,7 +184,7 @@ void putback_movable_pages(struct list_head *l) put_page(page); } else { putback_lru_page(page); - dec_node_page_state(page, NR_ISOLATED_ANON + + dec_zone_page_state(page, NR_ZONE_ISOLATED_ANON + page_is_file_cache(page)); } } @@ -1130,7 +1130,7 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page, * as __PageMovable */ if (likely(!__PageMovable(page))) - dec_node_page_state(page, NR_ISOLATED_ANON + + dec_zone_page_state(page, NR_ZONE_ISOLATED_ANON + page_is_file_cache(page)); } diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 8ff25883c172..997c9bfdf9e5 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4318,18 +4318,16 @@ void show_free_areas(unsigned int filter) free_pcp += per_cpu_ptr(zone->pageset, cpu)->pcp.count; } - printk("active_anon:%lu inactive_anon:%lu isolated_anon:%lu\n" - " active_file:%lu inactive_file:%lu isolated_file:%lu\n" + printk("active_anon:%lu inactive_anon:%lu\n" + " active_file:%lu inactive_file:%lu\n" " unevictable:%lu dirty:%lu writeback:%lu unstable:%lu\n" " slab_reclaimable:%lu slab_unreclaimable:%lu\n" " mapped:%lu shmem:%lu pagetables:%lu bounce:%lu\n" " free:%lu free_pcp:%lu free_cma:%lu\n", global_node_page_state(NR_ACTIVE_ANON), global_node_page_state(NR_INACTIVE_ANON), - global_node_page_state(NR_ISOLATED_ANON), global_node_page_state(NR_ACTIVE_FILE), global_node_page_state(NR_INACTIVE_FILE), - global_node_page_state(NR_ISOLATED_FILE), global_node_page_state(NR_UNEVICTABLE), global_node_page_state(NR_FILE_DIRTY), global_node_page_state(NR_WRITEBACK), @@ -4351,8 +4349,6 @@ void show_free_areas(unsigned int filter) " active_file:%lukB" " inactive_file:%lukB" " unevictable:%lukB" - " isolated(anon):%lukB" - " isolated(file):%lukB" " mapped:%lukB" " dirty:%lukB" " writeback:%lukB" @@ -4373,8 +4369,6 @@ void show_free_areas(unsigned int filter) K(node_page_state(pgdat, NR_ACTIVE_FILE)), K(node_page_state(pgdat, NR_INACTIVE_FILE)), K(node_page_state(pgdat, NR_UNEVICTABLE)), - K(node_page_state(pgdat, NR_ISOLATED_ANON)), - K(node_page_state(pgdat, NR_ISOLATED_FILE)), K(node_page_state(pgdat, NR_FILE_MAPPED)), K(node_page_state(pgdat, NR_FILE_DIRTY)), K(node_page_state(pgdat, NR_WRITEBACK)), @@ -4410,8 +4404,10 @@ void show_free_areas(unsigned int filter) " high:%lukB" " active_anon:%lukB" " inactive_anon:%lukB" + " isolated_anon:%lukB" " active_file:%lukB" " inactive_file:%lukB" + " isolated_file:%lukB" " unevictable:%lukB" " writepending:%lukB" " present:%lukB" @@ -4433,8 +4429,10 @@ void show_free_areas(unsigned int filter) K(high_wmark_pages(zone)), K(zone_page_state(zone, NR_ZONE_ACTIVE_ANON)), K(zone_page_state(zone, NR_ZONE_INACTIVE_ANON)), + K(zone_page_state(zone, NR_ZONE_ISOLATED_ANON)), K(zone_page_state(zone, NR_ZONE_ACTIVE_FILE)), K(zone_page_state(zone, NR_ZONE_INACTIVE_FILE)), + K(zone_page_state(zone, NR_ZONE_ISOLATED_FILE)), K(zone_page_state(zone, NR_ZONE_UNEVICTABLE)), K(zone_page_state(zone, NR_ZONE_WRITE_PENDING)), K(zone->present_pages), diff --git a/mm/vmscan.c b/mm/vmscan.c index f3255702f3df..4b1ed1b1f1db 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -216,14 +216,13 @@ unsigned long pgdat_reclaimable_pages(struct pglist_data *pgdat) { unsigned long nr; + /* TODO can we live without NR_*ISOLATED*? */ nr = node_page_state_snapshot(pgdat, NR_ACTIVE_FILE) + - node_page_state_snapshot(pgdat, NR_INACTIVE_FILE) + - node_page_state_snapshot(pgdat, NR_ISOLATED_FILE); + node_page_state_snapshot(pgdat, NR_INACTIVE_FILE); if (get_nr_swap_pages() > 0) nr += node_page_state_snapshot(pgdat, NR_ACTIVE_ANON) + - node_page_state_snapshot(pgdat, NR_INACTIVE_ANON) + - node_page_state_snapshot(pgdat, NR_ISOLATED_ANON); + node_page_state_snapshot(pgdat, NR_INACTIVE_ANON); return nr; } @@ -1245,8 +1244,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, * increment nr_reclaimed here (and * leave it off the LRU). */ - nr_reclaimed++; - continue; + goto drop_isolated; } } } @@ -1267,13 +1265,16 @@ static unsigned long shrink_page_list(struct list_head *page_list, if (ret == SWAP_LZFREE) count_vm_event(PGLAZYFREED); - nr_reclaimed++; - /* * Is there need to periodically free_page_list? It would * appear not as the counts should be low */ list_add(&page->lru, &free_pages); +drop_isolated: + nr_reclaimed++; + mod_zone_page_state(page_zone(page), + NR_ZONE_ISOLATED_ANON + page_is_file_cache(page), + -hpage_nr_pages(page)); continue; cull_mlocked: @@ -1340,7 +1341,6 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone, ret = shrink_page_list(&clean_pages, zone->zone_pgdat, &sc, TTU_UNMAP|TTU_IGNORE_ACCESS, NULL, true); list_splice(&clean_pages, page_list); - mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE, -ret); return ret; } @@ -1433,6 +1433,9 @@ static __always_inline void update_lru_sizes(struct lruvec *lruvec, continue; __update_lru_size(lruvec, lru, zid, -nr_zone_taken[zid]); + mod_zone_page_state(&lruvec_pgdat(lruvec)->node_zones[zid], + NR_ZONE_ISOLATED_ANON + !!is_file_lru(lru), + nr_zone_taken[zid]); #ifdef CONFIG_MEMCG mem_cgroup_update_lru_size(lruvec, lru, zid, -nr_zone_taken[zid]); #endif @@ -1603,10 +1606,11 @@ int isolate_lru_page(struct page *page) * the LRU list will go small and be scanned faster than necessary, leading to * unnecessary swapping, thrashing and OOM. */ -static int too_many_isolated(struct pglist_data *pgdat, int file, +static int too_many_isolated(struct pglist_data *pgdat, enum lru_list lru, struct scan_control *sc) { - unsigned long inactive, isolated; + unsigned long inactive = 0, isolated = 0; + int zid; if (current_is_kswapd()) return 0; @@ -1614,12 +1618,12 @@ static int too_many_isolated(struct pglist_data *pgdat, int file, if (!sane_reclaim(sc)) return 0; - if (file) { - inactive = node_page_state(pgdat, NR_INACTIVE_FILE); - isolated = node_page_state(pgdat, NR_ISOLATED_FILE); - } else { - inactive = node_page_state(pgdat, NR_INACTIVE_ANON); - isolated = node_page_state(pgdat, NR_ISOLATED_ANON); + for (zid = 0; zid <= sc->reclaim_idx; zid++) { + struct zone *zone = &pgdat->node_zones[zid]; + + inactive += zone_page_state_snapshot(zone, NR_ZONE_LRU_BASE + lru); + isolated += zone_page_state_snapshot(zone, + NR_ZONE_ISOLATED_ANON + !!is_file_lru(lru)); } /* @@ -1649,6 +1653,11 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list) VM_BUG_ON_PAGE(PageLRU(page), page); list_del(&page->lru); + + mod_zone_page_state(page_zone(page), + NR_ZONE_ISOLATED_ANON + !!page_is_file_cache(page), + -hpage_nr_pages(page)); + if (unlikely(!page_evictable(page))) { spin_unlock_irq(&pgdat->lru_lock); putback_lru_page(page); @@ -1719,7 +1728,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, struct pglist_data *pgdat = lruvec_pgdat(lruvec); struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat; - while (unlikely(too_many_isolated(pgdat, file, sc))) { + while (unlikely(too_many_isolated(pgdat, lru, sc))) { congestion_wait(BLK_RW_ASYNC, HZ/10); /* We are about to die and free our memory. Return now. */ @@ -1739,7 +1748,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list, &nr_scanned, sc, isolate_mode, lru); - __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken); reclaim_stat->recent_scanned[file] += nr_taken; if (global_reclaim(sc)) { @@ -1768,8 +1776,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, putback_inactive_pages(lruvec, &page_list); - __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken); - spin_unlock_irq(&pgdat->lru_lock); mem_cgroup_uncharge_list(&page_list); @@ -1939,7 +1945,6 @@ static void shrink_active_list(unsigned long nr_to_scan, nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold, &nr_scanned, sc, isolate_mode, lru); - __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken); reclaim_stat->recent_scanned[file] += nr_taken; if (global_reclaim(sc)) @@ -1955,7 +1960,7 @@ static void shrink_active_list(unsigned long nr_to_scan, if (unlikely(!page_evictable(page))) { putback_lru_page(page); - continue; + goto drop_isolated; } if (unlikely(buffer_heads_over_limit)) { @@ -1980,12 +1985,16 @@ static void shrink_active_list(unsigned long nr_to_scan, */ if ((vm_flags & VM_EXEC) && page_is_file_cache(page)) { list_add(&page->lru, &l_active); - continue; + goto drop_isolated; } } ClearPageActive(page); /* we are de-activating */ list_add(&page->lru, &l_inactive); +drop_isolated: + mod_zone_page_state(page_zone(page), + NR_ZONE_ISOLATED_ANON + !!is_file_lru(lru), + -hpage_nr_pages(page)); } /* @@ -2002,7 +2011,6 @@ static void shrink_active_list(unsigned long nr_to_scan, nr_activate = move_active_pages_to_lru(lruvec, &l_active, &l_hold, lru); nr_deactivate = move_active_pages_to_lru(lruvec, &l_inactive, &l_hold, lru - LRU_ACTIVE); - __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken); spin_unlock_irq(&pgdat->lru_lock); mem_cgroup_uncharge_list(&l_hold); diff --git a/mm/vmstat.c b/mm/vmstat.c index bed3c3845936..059c29d14d23 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -926,6 +926,8 @@ const char * const vmstat_text[] = { "nr_zone_inactive_file", "nr_zone_active_file", "nr_zone_unevictable", + "nr_zone_anon_isolated", + "nr_zone_file_isolated", "nr_zone_write_pending", "nr_mlock", "nr_slab_reclaimable", @@ -952,8 +954,6 @@ const char * const vmstat_text[] = { "nr_inactive_file", "nr_active_file", "nr_unevictable", - "nr_isolated_anon", - "nr_isolated_file", "nr_pages_scanned", "workingset_refault", "workingset_activate", -- 2.11.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone 2017-01-18 13:44 ` Michal Hocko @ 2017-01-18 14:46 ` Mel Gorman -1 siblings, 0 replies; 110+ messages in thread From: Mel Gorman @ 2017-01-18 14:46 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-mm, Johannes Weiner, Tetsuo Handa, LKML, Michal Hocko On Wed, Jan 18, 2017 at 02:44:52PM +0100, Michal Hocko wrote: > From: Michal Hocko <mhocko@suse.com> > > 599d0c954f91 ("mm, vmscan: move LRU lists to node") has moved > NR_ISOLATED* counters from zones to nodes. This is not the best fit > especially for systems with high/lowmem because a heavy memory pressure > on the highmem zone might block lowmem requests from making progress. Or > we might allow to reclaim lowmem zone even though there are too many > pages already isolated from the eligible zones just because highmem > pages will easily bias too_many_isolated to say no. > > Fix these potential issues by moving isolated stats back to zones and > teach too_many_isolated to consider only eligible zones. Per zone > isolation counters are a bit tricky with the node reclaim because > we have to track each page separatelly. > I'm quite unhappy with this. Each move back increases the cache footprint because of the counters but it's not clear at all this patch actually helps anything. Heavy memory pressure on highmem should be spread across the whole node as we no longer are applying the fair zone allocation policy. The processes with highmem requirements will be reclaiming from all zones and when it finishes, it's possible that a lowmem-specific request will be clear to make progress. It's all the same LRU so if there are too many pages isolated, it makes sense to wait regardless of the allocation request. More importantly, this patch may make things worse and delay reclaim. If this patch allowed a lowmem request to make progress that would have previously stalled, it's going to spend time skipping pages in the LRU instead of letting kswapd and the highmem pressured processes make progress. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone @ 2017-01-18 14:46 ` Mel Gorman 0 siblings, 0 replies; 110+ messages in thread From: Mel Gorman @ 2017-01-18 14:46 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-mm, Johannes Weiner, Tetsuo Handa, LKML, Michal Hocko On Wed, Jan 18, 2017 at 02:44:52PM +0100, Michal Hocko wrote: > From: Michal Hocko <mhocko@suse.com> > > 599d0c954f91 ("mm, vmscan: move LRU lists to node") has moved > NR_ISOLATED* counters from zones to nodes. This is not the best fit > especially for systems with high/lowmem because a heavy memory pressure > on the highmem zone might block lowmem requests from making progress. Or > we might allow to reclaim lowmem zone even though there are too many > pages already isolated from the eligible zones just because highmem > pages will easily bias too_many_isolated to say no. > > Fix these potential issues by moving isolated stats back to zones and > teach too_many_isolated to consider only eligible zones. Per zone > isolation counters are a bit tricky with the node reclaim because > we have to track each page separatelly. > I'm quite unhappy with this. Each move back increases the cache footprint because of the counters but it's not clear at all this patch actually helps anything. Heavy memory pressure on highmem should be spread across the whole node as we no longer are applying the fair zone allocation policy. The processes with highmem requirements will be reclaiming from all zones and when it finishes, it's possible that a lowmem-specific request will be clear to make progress. It's all the same LRU so if there are too many pages isolated, it makes sense to wait regardless of the allocation request. More importantly, this patch may make things worse and delay reclaim. If this patch allowed a lowmem request to make progress that would have previously stalled, it's going to spend time skipping pages in the LRU instead of letting kswapd and the highmem pressured processes make progress. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone 2017-01-18 14:46 ` Mel Gorman @ 2017-01-18 15:15 ` Michal Hocko -1 siblings, 0 replies; 110+ messages in thread From: Michal Hocko @ 2017-01-18 15:15 UTC (permalink / raw) To: Mel Gorman; +Cc: linux-mm, Johannes Weiner, Tetsuo Handa, LKML On Wed 18-01-17 14:46:55, Mel Gorman wrote: > On Wed, Jan 18, 2017 at 02:44:52PM +0100, Michal Hocko wrote: > > From: Michal Hocko <mhocko@suse.com> > > > > 599d0c954f91 ("mm, vmscan: move LRU lists to node") has moved > > NR_ISOLATED* counters from zones to nodes. This is not the best fit > > especially for systems with high/lowmem because a heavy memory pressure > > on the highmem zone might block lowmem requests from making progress. Or > > we might allow to reclaim lowmem zone even though there are too many > > pages already isolated from the eligible zones just because highmem > > pages will easily bias too_many_isolated to say no. > > > > Fix these potential issues by moving isolated stats back to zones and > > teach too_many_isolated to consider only eligible zones. Per zone > > isolation counters are a bit tricky with the node reclaim because > > we have to track each page separatelly. > > > > I'm quite unhappy with this. Each move back increases the cache footprint > because of the counters Why would per zone counters cause an increased cache footprint? > but it's not clear at all this patch actually helps anything. Yes, I cannot prove any real issue so far. The main motivation was the patch 2 which needs per-zone accounting to use it in the retry logic (should_reclaim_retry). I've spotted too_many_isoalated issues on the way. > Heavy memory pressure on highmem should be spread across the whole node as > we no longer are applying the fair zone allocation policy. The processes > with highmem requirements will be reclaiming from all zones and when it > finishes, it's possible that a lowmem-specific request will be clear to make > progress. It's all the same LRU so if there are too many pages isolated, > it makes sense to wait regardless of the allocation request. This is true but I am not sure how it is realated to the patch. If we have a heavy highmem memory pressure then we will throttle based on pages isolated from the respective zones. So if the there is a lowmem pressure at the same time then we throttle it only when we need to. Also consider that lowmem throttling in too_many_isolated has only small chance to ever work with the node counters because highmem >> lowmem in many/most configurations. > More importantly, this patch may make things worse and delay reclaim. If > this patch allowed a lowmem request to make progress that would have > previously stalled, it's going to spend time skipping pages in the LRU > instead of letting kswapd and the highmem pressured processes make progress. I am not sure I understand this part. Say that we have highmem pressure which would isolated too many pages from the LRU. lowmem request would stall previously regardless of where those pages came from. With this patch it would stall only when we isolated too many pages from the eligible zones. So let's assume that lowmem is not under pressure, why should we stall? And why would it delay reclaim? Whoever want to make progress on that zone has to iterate and potentially skip many pages. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone @ 2017-01-18 15:15 ` Michal Hocko 0 siblings, 0 replies; 110+ messages in thread From: Michal Hocko @ 2017-01-18 15:15 UTC (permalink / raw) To: Mel Gorman; +Cc: linux-mm, Johannes Weiner, Tetsuo Handa, LKML On Wed 18-01-17 14:46:55, Mel Gorman wrote: > On Wed, Jan 18, 2017 at 02:44:52PM +0100, Michal Hocko wrote: > > From: Michal Hocko <mhocko@suse.com> > > > > 599d0c954f91 ("mm, vmscan: move LRU lists to node") has moved > > NR_ISOLATED* counters from zones to nodes. This is not the best fit > > especially for systems with high/lowmem because a heavy memory pressure > > on the highmem zone might block lowmem requests from making progress. Or > > we might allow to reclaim lowmem zone even though there are too many > > pages already isolated from the eligible zones just because highmem > > pages will easily bias too_many_isolated to say no. > > > > Fix these potential issues by moving isolated stats back to zones and > > teach too_many_isolated to consider only eligible zones. Per zone > > isolation counters are a bit tricky with the node reclaim because > > we have to track each page separatelly. > > > > I'm quite unhappy with this. Each move back increases the cache footprint > because of the counters Why would per zone counters cause an increased cache footprint? > but it's not clear at all this patch actually helps anything. Yes, I cannot prove any real issue so far. The main motivation was the patch 2 which needs per-zone accounting to use it in the retry logic (should_reclaim_retry). I've spotted too_many_isoalated issues on the way. > Heavy memory pressure on highmem should be spread across the whole node as > we no longer are applying the fair zone allocation policy. The processes > with highmem requirements will be reclaiming from all zones and when it > finishes, it's possible that a lowmem-specific request will be clear to make > progress. It's all the same LRU so if there are too many pages isolated, > it makes sense to wait regardless of the allocation request. This is true but I am not sure how it is realated to the patch. If we have a heavy highmem memory pressure then we will throttle based on pages isolated from the respective zones. So if the there is a lowmem pressure at the same time then we throttle it only when we need to. Also consider that lowmem throttling in too_many_isolated has only small chance to ever work with the node counters because highmem >> lowmem in many/most configurations. > More importantly, this patch may make things worse and delay reclaim. If > this patch allowed a lowmem request to make progress that would have > previously stalled, it's going to spend time skipping pages in the LRU > instead of letting kswapd and the highmem pressured processes make progress. I am not sure I understand this part. Say that we have highmem pressure which would isolated too many pages from the LRU. lowmem request would stall previously regardless of where those pages came from. With this patch it would stall only when we isolated too many pages from the eligible zones. So let's assume that lowmem is not under pressure, why should we stall? And why would it delay reclaim? Whoever want to make progress on that zone has to iterate and potentially skip many pages. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone 2017-01-18 15:15 ` Michal Hocko @ 2017-01-18 15:54 ` Mel Gorman -1 siblings, 0 replies; 110+ messages in thread From: Mel Gorman @ 2017-01-18 15:54 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-mm, Johannes Weiner, Tetsuo Handa, LKML On Wed, Jan 18, 2017 at 04:15:31PM +0100, Michal Hocko wrote: > On Wed 18-01-17 14:46:55, Mel Gorman wrote: > > On Wed, Jan 18, 2017 at 02:44:52PM +0100, Michal Hocko wrote: > > > From: Michal Hocko <mhocko@suse.com> > > > > > > 599d0c954f91 ("mm, vmscan: move LRU lists to node") has moved > > > NR_ISOLATED* counters from zones to nodes. This is not the best fit > > > especially for systems with high/lowmem because a heavy memory pressure > > > on the highmem zone might block lowmem requests from making progress. Or > > > we might allow to reclaim lowmem zone even though there are too many > > > pages already isolated from the eligible zones just because highmem > > > pages will easily bias too_many_isolated to say no. > > > > > > Fix these potential issues by moving isolated stats back to zones and > > > teach too_many_isolated to consider only eligible zones. Per zone > > > isolation counters are a bit tricky with the node reclaim because > > > we have to track each page separatelly. > > > > > > > I'm quite unhappy with this. Each move back increases the cache footprint > > because of the counters > > Why would per zone counters cause an increased cache footprint? > Because there are multiple counters, each of which need to be updated. > > but it's not clear at all this patch actually helps anything. > > Yes, I cannot prove any real issue so far. The main motivation was the > patch 2 which needs per-zone accounting to use it in the retry logic > (should_reclaim_retry). I've spotted too_many_isoalated issues on the > way. > You don't appear to directly use that information in patch 2. The primary breakout is returning after stalling at least once. You could also avoid an infinite loop by using a waitqueue that sleeps on too many isolated. That would both avoid the clunky congestion_wait() and guarantee forward progress. If the primary motivation is to avoid an infinite loop with too_many_isolated then there are ways of handling that without reintroducing zone-based counters. > > Heavy memory pressure on highmem should be spread across the whole node as > > we no longer are applying the fair zone allocation policy. The processes > > with highmem requirements will be reclaiming from all zones and when it > > finishes, it's possible that a lowmem-specific request will be clear to make > > progress. It's all the same LRU so if there are too many pages isolated, > > it makes sense to wait regardless of the allocation request. > > This is true but I am not sure how it is realated to the patch. Because heavy pressure that is enough to trigger too many isolated pages is unlikely to be specifically targetting a lower zone. There is general pressure with multiple direct reclaimers being applied. If the system is under enough pressure with parallel reclaimers to trigger too_many_isolated checks then the system is grinding already and making little progress. Adding multiple counters to allow a lowmem reclaimer to potentially make faster progress is going to be marginal at best. > Also consider that lowmem throttling in too_many_isolated has only small > chance to ever work with the node counters because highmem >> lowmem in > many/most configurations. > While true, it's also not that important. > > More importantly, this patch may make things worse and delay reclaim. If > > this patch allowed a lowmem request to make progress that would have > > previously stalled, it's going to spend time skipping pages in the LRU > > instead of letting kswapd and the highmem pressured processes make progress. > > I am not sure I understand this part. Say that we have highmem pressure > which would isolated too many pages from the LRU. Which requires multiple direct reclaimers or tiny inactive lists. In the event there is such highmem pressure, it also means the lower zones are depleted. > lowmem request would > stall previously regardless of where those pages came from. With this > patch it would stall only when we isolated too many pages from the > eligible zones. And when it makes progress, it's goign to compete with the other direct reclaimers except the lowmem reclaim is skipping some pages and recycling them through the LRU. It chews up CPU that would probably have been better spent letting kswapd and the other direct reclaimers do their work. > So let's assume that lowmem is not under pressure, It has to be or the highmem request would have used memory from the lower zones. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone @ 2017-01-18 15:54 ` Mel Gorman 0 siblings, 0 replies; 110+ messages in thread From: Mel Gorman @ 2017-01-18 15:54 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-mm, Johannes Weiner, Tetsuo Handa, LKML On Wed, Jan 18, 2017 at 04:15:31PM +0100, Michal Hocko wrote: > On Wed 18-01-17 14:46:55, Mel Gorman wrote: > > On Wed, Jan 18, 2017 at 02:44:52PM +0100, Michal Hocko wrote: > > > From: Michal Hocko <mhocko@suse.com> > > > > > > 599d0c954f91 ("mm, vmscan: move LRU lists to node") has moved > > > NR_ISOLATED* counters from zones to nodes. This is not the best fit > > > especially for systems with high/lowmem because a heavy memory pressure > > > on the highmem zone might block lowmem requests from making progress. Or > > > we might allow to reclaim lowmem zone even though there are too many > > > pages already isolated from the eligible zones just because highmem > > > pages will easily bias too_many_isolated to say no. > > > > > > Fix these potential issues by moving isolated stats back to zones and > > > teach too_many_isolated to consider only eligible zones. Per zone > > > isolation counters are a bit tricky with the node reclaim because > > > we have to track each page separatelly. > > > > > > > I'm quite unhappy with this. Each move back increases the cache footprint > > because of the counters > > Why would per zone counters cause an increased cache footprint? > Because there are multiple counters, each of which need to be updated. > > but it's not clear at all this patch actually helps anything. > > Yes, I cannot prove any real issue so far. The main motivation was the > patch 2 which needs per-zone accounting to use it in the retry logic > (should_reclaim_retry). I've spotted too_many_isoalated issues on the > way. > You don't appear to directly use that information in patch 2. The primary breakout is returning after stalling at least once. You could also avoid an infinite loop by using a waitqueue that sleeps on too many isolated. That would both avoid the clunky congestion_wait() and guarantee forward progress. If the primary motivation is to avoid an infinite loop with too_many_isolated then there are ways of handling that without reintroducing zone-based counters. > > Heavy memory pressure on highmem should be spread across the whole node as > > we no longer are applying the fair zone allocation policy. The processes > > with highmem requirements will be reclaiming from all zones and when it > > finishes, it's possible that a lowmem-specific request will be clear to make > > progress. It's all the same LRU so if there are too many pages isolated, > > it makes sense to wait regardless of the allocation request. > > This is true but I am not sure how it is realated to the patch. Because heavy pressure that is enough to trigger too many isolated pages is unlikely to be specifically targetting a lower zone. There is general pressure with multiple direct reclaimers being applied. If the system is under enough pressure with parallel reclaimers to trigger too_many_isolated checks then the system is grinding already and making little progress. Adding multiple counters to allow a lowmem reclaimer to potentially make faster progress is going to be marginal at best. > Also consider that lowmem throttling in too_many_isolated has only small > chance to ever work with the node counters because highmem >> lowmem in > many/most configurations. > While true, it's also not that important. > > More importantly, this patch may make things worse and delay reclaim. If > > this patch allowed a lowmem request to make progress that would have > > previously stalled, it's going to spend time skipping pages in the LRU > > instead of letting kswapd and the highmem pressured processes make progress. > > I am not sure I understand this part. Say that we have highmem pressure > which would isolated too many pages from the LRU. Which requires multiple direct reclaimers or tiny inactive lists. In the event there is such highmem pressure, it also means the lower zones are depleted. > lowmem request would > stall previously regardless of where those pages came from. With this > patch it would stall only when we isolated too many pages from the > eligible zones. And when it makes progress, it's goign to compete with the other direct reclaimers except the lowmem reclaim is skipping some pages and recycling them through the LRU. It chews up CPU that would probably have been better spent letting kswapd and the other direct reclaimers do their work. > So let's assume that lowmem is not under pressure, It has to be or the highmem request would have used memory from the lower zones. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone 2017-01-18 15:54 ` Mel Gorman @ 2017-01-18 16:17 ` Michal Hocko -1 siblings, 0 replies; 110+ messages in thread From: Michal Hocko @ 2017-01-18 16:17 UTC (permalink / raw) To: Mel Gorman; +Cc: linux-mm, Johannes Weiner, Tetsuo Handa, LKML On Wed 18-01-17 15:54:30, Mel Gorman wrote: > On Wed, Jan 18, 2017 at 04:15:31PM +0100, Michal Hocko wrote: > > On Wed 18-01-17 14:46:55, Mel Gorman wrote: > > > On Wed, Jan 18, 2017 at 02:44:52PM +0100, Michal Hocko wrote: > > > > From: Michal Hocko <mhocko@suse.com> > > > > > > > > 599d0c954f91 ("mm, vmscan: move LRU lists to node") has moved > > > > NR_ISOLATED* counters from zones to nodes. This is not the best fit > > > > especially for systems with high/lowmem because a heavy memory pressure > > > > on the highmem zone might block lowmem requests from making progress. Or > > > > we might allow to reclaim lowmem zone even though there are too many > > > > pages already isolated from the eligible zones just because highmem > > > > pages will easily bias too_many_isolated to say no. > > > > > > > > Fix these potential issues by moving isolated stats back to zones and > > > > teach too_many_isolated to consider only eligible zones. Per zone > > > > isolation counters are a bit tricky with the node reclaim because > > > > we have to track each page separatelly. > > > > > > > > > > I'm quite unhappy with this. Each move back increases the cache footprint > > > because of the counters > > > > Why would per zone counters cause an increased cache footprint? > > > > Because there are multiple counters, each of which need to be updated. How does this differ from per node counter though. We would need to do the accounting anyway. Moreover none of the accounting is done in a hot path. > > > but it's not clear at all this patch actually helps anything. > > > > Yes, I cannot prove any real issue so far. The main motivation was the > > patch 2 which needs per-zone accounting to use it in the retry logic > > (should_reclaim_retry). I've spotted too_many_isoalated issues on the > > way. > > > > You don't appear to directly use that information in patch 2. It is used via zone_reclaimable_pages in should_reclaim_retry > The primary > breakout is returning after stalling at least once. You could also avoid > an infinite loop by using a waitqueue that sleeps on too many isolated. That would be tricky on its own. Just consider the report form Tetsuo. Basically all the direct reclamers are looping on too_many_isolated while the kswapd is not making any progres because it is blocked on FS locks which are held by flushers which are making dead slow progress. Some of those direct reclaimers could have gone oom instead and release some memory if we decide so, which we cannot because we are deep down in the reclaim path. Waiting for on the reclaimer to increase the ISOLATED counter wouldn't help in this situation. > That would both avoid the clunky congestion_wait() and guarantee forward > progress. If the primary motivation is to avoid an infinite loop with > too_many_isolated then there are ways of handling that without reintroducing > zone-based counters. > > > > Heavy memory pressure on highmem should be spread across the whole node as > > > we no longer are applying the fair zone allocation policy. The processes > > > with highmem requirements will be reclaiming from all zones and when it > > > finishes, it's possible that a lowmem-specific request will be clear to make > > > progress. It's all the same LRU so if there are too many pages isolated, > > > it makes sense to wait regardless of the allocation request. > > > > This is true but I am not sure how it is realated to the patch. > > Because heavy pressure that is enough to trigger too many isolated pages > is unlikely to be specifically targetting a lower zone. Why? Basically any GFP_KERNEL allocation will make lowmem pressure and going OOM on lowmem is not all that unrealistic scenario on 32b systems. > There is general > pressure with multiple direct reclaimers being applied. If the system is > under enough pressure with parallel reclaimers to trigger too_many_isolated > checks then the system is grinding already and making little progress. Adding > multiple counters to allow a lowmem reclaimer to potentially make faster > progress is going to be marginal at best. OK, I agree that the situation where highmem blocks lowmem from making progress is much less likely than the other situation described in the changelog when lowmem doesn't get throttled ever. Which is the one I am interested more about. > > Also consider that lowmem throttling in too_many_isolated has only small > > chance to ever work with the node counters because highmem >> lowmem in > > many/most configurations. > > > > While true, it's also not that important. > > > > More importantly, this patch may make things worse and delay reclaim. If > > > this patch allowed a lowmem request to make progress that would have > > > previously stalled, it's going to spend time skipping pages in the LRU > > > instead of letting kswapd and the highmem pressured processes make progress. > > > > I am not sure I understand this part. Say that we have highmem pressure > > which would isolated too many pages from the LRU. > > Which requires multiple direct reclaimers or tiny inactive lists. In the > event there is such highmem pressure, it also means the lower zones are > depleted. But consider a lowmem without highmem pressure. E.g. a heavy parallel fork or any other GFP_KERNEL intensive workload. > > lowmem request would > > stall previously regardless of where those pages came from. With this > > patch it would stall only when we isolated too many pages from the > > eligible zones. > > And when it makes progress, it's goign to compete with the other direct > reclaimers except the lowmem reclaim is skipping some pages and > recycling them through the LRU. It chews up CPU that would probably have > been better spent letting kswapd and the other direct reclaimers do > their work. OK, I guess we are talking past each other. What I meant to say is that it doesn't really make any difference who is chewing through the LRU to find last few lowmem pages to reclaim. So I do not see much of a difference sleeping and postponing that to the kswapd. That being said, I _believe_ I will need per zone ISOLATED counters in order to make the other patch work reliably and do not declare oom prematurely. Maybe there is some other way around that (hence this RFC). Would you be strongly opposed to the patch which would make counters per zone without touching too_many_isolated? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone @ 2017-01-18 16:17 ` Michal Hocko 0 siblings, 0 replies; 110+ messages in thread From: Michal Hocko @ 2017-01-18 16:17 UTC (permalink / raw) To: Mel Gorman; +Cc: linux-mm, Johannes Weiner, Tetsuo Handa, LKML On Wed 18-01-17 15:54:30, Mel Gorman wrote: > On Wed, Jan 18, 2017 at 04:15:31PM +0100, Michal Hocko wrote: > > On Wed 18-01-17 14:46:55, Mel Gorman wrote: > > > On Wed, Jan 18, 2017 at 02:44:52PM +0100, Michal Hocko wrote: > > > > From: Michal Hocko <mhocko@suse.com> > > > > > > > > 599d0c954f91 ("mm, vmscan: move LRU lists to node") has moved > > > > NR_ISOLATED* counters from zones to nodes. This is not the best fit > > > > especially for systems with high/lowmem because a heavy memory pressure > > > > on the highmem zone might block lowmem requests from making progress. Or > > > > we might allow to reclaim lowmem zone even though there are too many > > > > pages already isolated from the eligible zones just because highmem > > > > pages will easily bias too_many_isolated to say no. > > > > > > > > Fix these potential issues by moving isolated stats back to zones and > > > > teach too_many_isolated to consider only eligible zones. Per zone > > > > isolation counters are a bit tricky with the node reclaim because > > > > we have to track each page separatelly. > > > > > > > > > > I'm quite unhappy with this. Each move back increases the cache footprint > > > because of the counters > > > > Why would per zone counters cause an increased cache footprint? > > > > Because there are multiple counters, each of which need to be updated. How does this differ from per node counter though. We would need to do the accounting anyway. Moreover none of the accounting is done in a hot path. > > > but it's not clear at all this patch actually helps anything. > > > > Yes, I cannot prove any real issue so far. The main motivation was the > > patch 2 which needs per-zone accounting to use it in the retry logic > > (should_reclaim_retry). I've spotted too_many_isoalated issues on the > > way. > > > > You don't appear to directly use that information in patch 2. It is used via zone_reclaimable_pages in should_reclaim_retry > The primary > breakout is returning after stalling at least once. You could also avoid > an infinite loop by using a waitqueue that sleeps on too many isolated. That would be tricky on its own. Just consider the report form Tetsuo. Basically all the direct reclamers are looping on too_many_isolated while the kswapd is not making any progres because it is blocked on FS locks which are held by flushers which are making dead slow progress. Some of those direct reclaimers could have gone oom instead and release some memory if we decide so, which we cannot because we are deep down in the reclaim path. Waiting for on the reclaimer to increase the ISOLATED counter wouldn't help in this situation. > That would both avoid the clunky congestion_wait() and guarantee forward > progress. If the primary motivation is to avoid an infinite loop with > too_many_isolated then there are ways of handling that without reintroducing > zone-based counters. > > > > Heavy memory pressure on highmem should be spread across the whole node as > > > we no longer are applying the fair zone allocation policy. The processes > > > with highmem requirements will be reclaiming from all zones and when it > > > finishes, it's possible that a lowmem-specific request will be clear to make > > > progress. It's all the same LRU so if there are too many pages isolated, > > > it makes sense to wait regardless of the allocation request. > > > > This is true but I am not sure how it is realated to the patch. > > Because heavy pressure that is enough to trigger too many isolated pages > is unlikely to be specifically targetting a lower zone. Why? Basically any GFP_KERNEL allocation will make lowmem pressure and going OOM on lowmem is not all that unrealistic scenario on 32b systems. > There is general > pressure with multiple direct reclaimers being applied. If the system is > under enough pressure with parallel reclaimers to trigger too_many_isolated > checks then the system is grinding already and making little progress. Adding > multiple counters to allow a lowmem reclaimer to potentially make faster > progress is going to be marginal at best. OK, I agree that the situation where highmem blocks lowmem from making progress is much less likely than the other situation described in the changelog when lowmem doesn't get throttled ever. Which is the one I am interested more about. > > Also consider that lowmem throttling in too_many_isolated has only small > > chance to ever work with the node counters because highmem >> lowmem in > > many/most configurations. > > > > While true, it's also not that important. > > > > More importantly, this patch may make things worse and delay reclaim. If > > > this patch allowed a lowmem request to make progress that would have > > > previously stalled, it's going to spend time skipping pages in the LRU > > > instead of letting kswapd and the highmem pressured processes make progress. > > > > I am not sure I understand this part. Say that we have highmem pressure > > which would isolated too many pages from the LRU. > > Which requires multiple direct reclaimers or tiny inactive lists. In the > event there is such highmem pressure, it also means the lower zones are > depleted. But consider a lowmem without highmem pressure. E.g. a heavy parallel fork or any other GFP_KERNEL intensive workload. > > lowmem request would > > stall previously regardless of where those pages came from. With this > > patch it would stall only when we isolated too many pages from the > > eligible zones. > > And when it makes progress, it's goign to compete with the other direct > reclaimers except the lowmem reclaim is skipping some pages and > recycling them through the LRU. It chews up CPU that would probably have > been better spent letting kswapd and the other direct reclaimers do > their work. OK, I guess we are talking past each other. What I meant to say is that it doesn't really make any difference who is chewing through the LRU to find last few lowmem pages to reclaim. So I do not see much of a difference sleeping and postponing that to the kswapd. That being said, I _believe_ I will need per zone ISOLATED counters in order to make the other patch work reliably and do not declare oom prematurely. Maybe there is some other way around that (hence this RFC). Would you be strongly opposed to the patch which would make counters per zone without touching too_many_isolated? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone 2017-01-18 16:17 ` Michal Hocko @ 2017-01-18 17:00 ` Mel Gorman -1 siblings, 0 replies; 110+ messages in thread From: Mel Gorman @ 2017-01-18 17:00 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-mm, Johannes Weiner, Tetsuo Handa, LKML On Wed, Jan 18, 2017 at 05:17:31PM +0100, Michal Hocko wrote: > On Wed 18-01-17 15:54:30, Mel Gorman wrote: > > On Wed, Jan 18, 2017 at 04:15:31PM +0100, Michal Hocko wrote: > > > On Wed 18-01-17 14:46:55, Mel Gorman wrote: > > > > On Wed, Jan 18, 2017 at 02:44:52PM +0100, Michal Hocko wrote: > > > > > From: Michal Hocko <mhocko@suse.com> > > > > > > > > > > 599d0c954f91 ("mm, vmscan: move LRU lists to node") has moved > > > > > NR_ISOLATED* counters from zones to nodes. This is not the best fit > > > > > especially for systems with high/lowmem because a heavy memory pressure > > > > > on the highmem zone might block lowmem requests from making progress. Or > > > > > we might allow to reclaim lowmem zone even though there are too many > > > > > pages already isolated from the eligible zones just because highmem > > > > > pages will easily bias too_many_isolated to say no. > > > > > > > > > > Fix these potential issues by moving isolated stats back to zones and > > > > > teach too_many_isolated to consider only eligible zones. Per zone > > > > > isolation counters are a bit tricky with the node reclaim because > > > > > we have to track each page separatelly. > > > > > > > > > > > > > I'm quite unhappy with this. Each move back increases the cache footprint > > > > because of the counters > > > > > > Why would per zone counters cause an increased cache footprint? > > > > > > > Because there are multiple counters, each of which need to be updated. > > How does this differ from per node counter though. A per-node counter is 2 * nr_online_nodes A per-zone counter is 2 * nr_populated_zones > We would need to do > the accounting anyway. Moreover none of the accounting is done in a hot > path. > > > > > but it's not clear at all this patch actually helps anything. > > > > > > Yes, I cannot prove any real issue so far. The main motivation was the > > > patch 2 which needs per-zone accounting to use it in the retry logic > > > (should_reclaim_retry). I've spotted too_many_isoalated issues on the > > > way. > > > > > > > You don't appear to directly use that information in patch 2. > > It is used via zone_reclaimable_pages in should_reclaim_retry > Which is still not directly required to avoid the infinite loop. There even is a small inherent risk if the too_isolated_condition no longer applies at the time should_reclaim_retry is attempted. > > The primary > > breakout is returning after stalling at least once. You could also avoid > > an infinite loop by using a waitqueue that sleeps on too many isolated. > > That would be tricky on its own. Just consider the report form Tetsuo. > Basically all the direct reclamers are looping on too_many_isolated > while the kswapd is not making any progres because it is blocked on FS > locks which are held by flushers which are making dead slow progress. > Some of those direct reclaimers could have gone oom instead and release > some memory if we decide so, which we cannot because we are deep down in > the reclaim path. Waiting for on the reclaimer to increase the ISOLATED > counter wouldn't help in this situation. > If it's a waitqueue waking one process at a time, the progress may be slow but it'll still exit the loop, attempt the reclaim and then potentially OOM if no progress is made. The key is using the waitqueue to have a fair queue of processes making progress instead of a potentially infinite loop that never meets the exit conditions. > > That would both avoid the clunky congestion_wait() and guarantee forward > > progress. If the primary motivation is to avoid an infinite loop with > > too_many_isolated then there are ways of handling that without reintroducing > > zone-based counters. > > > > > > Heavy memory pressure on highmem should be spread across the whole node as > > > > we no longer are applying the fair zone allocation policy. The processes > > > > with highmem requirements will be reclaiming from all zones and when it > > > > finishes, it's possible that a lowmem-specific request will be clear to make > > > > progress. It's all the same LRU so if there are too many pages isolated, > > > > it makes sense to wait regardless of the allocation request. > > > > > > This is true but I am not sure how it is realated to the patch. > > > > Because heavy pressure that is enough to trigger too many isolated pages > > is unlikely to be specifically targetting a lower zone. > > Why? Basically any GFP_KERNEL allocation will make lowmem pressure and > going OOM on lowmem is not all that unrealistic scenario on 32b systems. > If the sole source of pressure is from GFP_KERNEL allocations then the isolated counter will also be specific to the lower zones and there is no benefit from the patch. If there is a combination of highmem and lowmem pressure then the highmem reclaimers will also reclaim lowmem memory. > > There is general > > pressure with multiple direct reclaimers being applied. If the system is > > under enough pressure with parallel reclaimers to trigger too_many_isolated > > checks then the system is grinding already and making little progress. Adding > > multiple counters to allow a lowmem reclaimer to potentially make faster > > progress is going to be marginal at best. > > OK, I agree that the situation where highmem blocks lowmem from making > progress is much less likely than the other situation described in the > changelog when lowmem doesn't get throttled ever. Which is the one I am > interested more about. > That is of some concern but could be handled by having too_may_isolated take into account if it's a zone-restricted allocation and if so, then decrement the LRU counts from the higher zones. Counters already exist there. It would not be as strict but it should be sufficient. > > > Also consider that lowmem throttling in too_many_isolated has only small > > > chance to ever work with the node counters because highmem >> lowmem in > > > many/most configurations. > > > > > > > While true, it's also not that important. > > > > > > More importantly, this patch may make things worse and delay reclaim. If > > > > this patch allowed a lowmem request to make progress that would have > > > > previously stalled, it's going to spend time skipping pages in the LRU > > > > instead of letting kswapd and the highmem pressured processes make progress. > > > > > > I am not sure I understand this part. Say that we have highmem pressure > > > which would isolated too many pages from the LRU. > > > > Which requires multiple direct reclaimers or tiny inactive lists. In the > > event there is such highmem pressure, it also means the lower zones are > > depleted. > > But consider a lowmem without highmem pressure. E.g. a heavy parallel > fork or any other GFP_KERNEL intensive workload. > Lowmem without highmem pressure means all isolated pages are in the lowmem nodes and the per-zone counters are unnecessary. > > > lowmem request would > > > stall previously regardless of where those pages came from. With this > > > patch it would stall only when we isolated too many pages from the > > > eligible zones. > > > > And when it makes progress, it's goign to compete with the other direct > > reclaimers except the lowmem reclaim is skipping some pages and > > recycling them through the LRU. It chews up CPU that would probably have > > been better spent letting kswapd and the other direct reclaimers do > > their work. > > OK, I guess we are talking past each other. What I meant to say is that > it doesn't really make any difference who is chewing through the LRU to > find last few lowmem pages to reclaim. So I do not see much of a > difference sleeping and postponing that to the kswapd. > > That being said, I _believe_ I will need per zone ISOLATED counters in > order to make the other patch work reliably and do not declare oom > prematurely. Maybe there is some other way around that (hence this RFC). > Would you be strongly opposed to the patch which would make counters per > zone without touching too_many_isolated? I'm resistent to the per-zone counters in general but it's unfortunate to add them just to avoid a potentially infinite loop from isolated pages. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone @ 2017-01-18 17:00 ` Mel Gorman 0 siblings, 0 replies; 110+ messages in thread From: Mel Gorman @ 2017-01-18 17:00 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-mm, Johannes Weiner, Tetsuo Handa, LKML On Wed, Jan 18, 2017 at 05:17:31PM +0100, Michal Hocko wrote: > On Wed 18-01-17 15:54:30, Mel Gorman wrote: > > On Wed, Jan 18, 2017 at 04:15:31PM +0100, Michal Hocko wrote: > > > On Wed 18-01-17 14:46:55, Mel Gorman wrote: > > > > On Wed, Jan 18, 2017 at 02:44:52PM +0100, Michal Hocko wrote: > > > > > From: Michal Hocko <mhocko@suse.com> > > > > > > > > > > 599d0c954f91 ("mm, vmscan: move LRU lists to node") has moved > > > > > NR_ISOLATED* counters from zones to nodes. This is not the best fit > > > > > especially for systems with high/lowmem because a heavy memory pressure > > > > > on the highmem zone might block lowmem requests from making progress. Or > > > > > we might allow to reclaim lowmem zone even though there are too many > > > > > pages already isolated from the eligible zones just because highmem > > > > > pages will easily bias too_many_isolated to say no. > > > > > > > > > > Fix these potential issues by moving isolated stats back to zones and > > > > > teach too_many_isolated to consider only eligible zones. Per zone > > > > > isolation counters are a bit tricky with the node reclaim because > > > > > we have to track each page separatelly. > > > > > > > > > > > > > I'm quite unhappy with this. Each move back increases the cache footprint > > > > because of the counters > > > > > > Why would per zone counters cause an increased cache footprint? > > > > > > > Because there are multiple counters, each of which need to be updated. > > How does this differ from per node counter though. A per-node counter is 2 * nr_online_nodes A per-zone counter is 2 * nr_populated_zones > We would need to do > the accounting anyway. Moreover none of the accounting is done in a hot > path. > > > > > but it's not clear at all this patch actually helps anything. > > > > > > Yes, I cannot prove any real issue so far. The main motivation was the > > > patch 2 which needs per-zone accounting to use it in the retry logic > > > (should_reclaim_retry). I've spotted too_many_isoalated issues on the > > > way. > > > > > > > You don't appear to directly use that information in patch 2. > > It is used via zone_reclaimable_pages in should_reclaim_retry > Which is still not directly required to avoid the infinite loop. There even is a small inherent risk if the too_isolated_condition no longer applies at the time should_reclaim_retry is attempted. > > The primary > > breakout is returning after stalling at least once. You could also avoid > > an infinite loop by using a waitqueue that sleeps on too many isolated. > > That would be tricky on its own. Just consider the report form Tetsuo. > Basically all the direct reclamers are looping on too_many_isolated > while the kswapd is not making any progres because it is blocked on FS > locks which are held by flushers which are making dead slow progress. > Some of those direct reclaimers could have gone oom instead and release > some memory if we decide so, which we cannot because we are deep down in > the reclaim path. Waiting for on the reclaimer to increase the ISOLATED > counter wouldn't help in this situation. > If it's a waitqueue waking one process at a time, the progress may be slow but it'll still exit the loop, attempt the reclaim and then potentially OOM if no progress is made. The key is using the waitqueue to have a fair queue of processes making progress instead of a potentially infinite loop that never meets the exit conditions. > > That would both avoid the clunky congestion_wait() and guarantee forward > > progress. If the primary motivation is to avoid an infinite loop with > > too_many_isolated then there are ways of handling that without reintroducing > > zone-based counters. > > > > > > Heavy memory pressure on highmem should be spread across the whole node as > > > > we no longer are applying the fair zone allocation policy. The processes > > > > with highmem requirements will be reclaiming from all zones and when it > > > > finishes, it's possible that a lowmem-specific request will be clear to make > > > > progress. It's all the same LRU so if there are too many pages isolated, > > > > it makes sense to wait regardless of the allocation request. > > > > > > This is true but I am not sure how it is realated to the patch. > > > > Because heavy pressure that is enough to trigger too many isolated pages > > is unlikely to be specifically targetting a lower zone. > > Why? Basically any GFP_KERNEL allocation will make lowmem pressure and > going OOM on lowmem is not all that unrealistic scenario on 32b systems. > If the sole source of pressure is from GFP_KERNEL allocations then the isolated counter will also be specific to the lower zones and there is no benefit from the patch. If there is a combination of highmem and lowmem pressure then the highmem reclaimers will also reclaim lowmem memory. > > There is general > > pressure with multiple direct reclaimers being applied. If the system is > > under enough pressure with parallel reclaimers to trigger too_many_isolated > > checks then the system is grinding already and making little progress. Adding > > multiple counters to allow a lowmem reclaimer to potentially make faster > > progress is going to be marginal at best. > > OK, I agree that the situation where highmem blocks lowmem from making > progress is much less likely than the other situation described in the > changelog when lowmem doesn't get throttled ever. Which is the one I am > interested more about. > That is of some concern but could be handled by having too_may_isolated take into account if it's a zone-restricted allocation and if so, then decrement the LRU counts from the higher zones. Counters already exist there. It would not be as strict but it should be sufficient. > > > Also consider that lowmem throttling in too_many_isolated has only small > > > chance to ever work with the node counters because highmem >> lowmem in > > > many/most configurations. > > > > > > > While true, it's also not that important. > > > > > > More importantly, this patch may make things worse and delay reclaim. If > > > > this patch allowed a lowmem request to make progress that would have > > > > previously stalled, it's going to spend time skipping pages in the LRU > > > > instead of letting kswapd and the highmem pressured processes make progress. > > > > > > I am not sure I understand this part. Say that we have highmem pressure > > > which would isolated too many pages from the LRU. > > > > Which requires multiple direct reclaimers or tiny inactive lists. In the > > event there is such highmem pressure, it also means the lower zones are > > depleted. > > But consider a lowmem without highmem pressure. E.g. a heavy parallel > fork or any other GFP_KERNEL intensive workload. > Lowmem without highmem pressure means all isolated pages are in the lowmem nodes and the per-zone counters are unnecessary. > > > lowmem request would > > > stall previously regardless of where those pages came from. With this > > > patch it would stall only when we isolated too many pages from the > > > eligible zones. > > > > And when it makes progress, it's goign to compete with the other direct > > reclaimers except the lowmem reclaim is skipping some pages and > > recycling them through the LRU. It chews up CPU that would probably have > > been better spent letting kswapd and the other direct reclaimers do > > their work. > > OK, I guess we are talking past each other. What I meant to say is that > it doesn't really make any difference who is chewing through the LRU to > find last few lowmem pages to reclaim. So I do not see much of a > difference sleeping and postponing that to the kswapd. > > That being said, I _believe_ I will need per zone ISOLATED counters in > order to make the other patch work reliably and do not declare oom > prematurely. Maybe there is some other way around that (hence this RFC). > Would you be strongly opposed to the patch which would make counters per > zone without touching too_many_isolated? I'm resistent to the per-zone counters in general but it's unfortunate to add them just to avoid a potentially infinite loop from isolated pages. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone 2017-01-18 17:00 ` Mel Gorman @ 2017-01-18 17:29 ` Michal Hocko -1 siblings, 0 replies; 110+ messages in thread From: Michal Hocko @ 2017-01-18 17:29 UTC (permalink / raw) To: Mel Gorman; +Cc: linux-mm, Johannes Weiner, Tetsuo Handa, LKML On Wed 18-01-17 17:00:10, Mel Gorman wrote: > On Wed, Jan 18, 2017 at 05:17:31PM +0100, Michal Hocko wrote: > > On Wed 18-01-17 15:54:30, Mel Gorman wrote: > > > On Wed, Jan 18, 2017 at 04:15:31PM +0100, Michal Hocko wrote: > > > > On Wed 18-01-17 14:46:55, Mel Gorman wrote: > > > > > On Wed, Jan 18, 2017 at 02:44:52PM +0100, Michal Hocko wrote: > > > > > > From: Michal Hocko <mhocko@suse.com> > > > > > > > > > > > > 599d0c954f91 ("mm, vmscan: move LRU lists to node") has moved > > > > > > NR_ISOLATED* counters from zones to nodes. This is not the best fit > > > > > > especially for systems with high/lowmem because a heavy memory pressure > > > > > > on the highmem zone might block lowmem requests from making progress. Or > > > > > > we might allow to reclaim lowmem zone even though there are too many > > > > > > pages already isolated from the eligible zones just because highmem > > > > > > pages will easily bias too_many_isolated to say no. > > > > > > > > > > > > Fix these potential issues by moving isolated stats back to zones and > > > > > > teach too_many_isolated to consider only eligible zones. Per zone > > > > > > isolation counters are a bit tricky with the node reclaim because > > > > > > we have to track each page separatelly. > > > > > > > > > > > > > > > > I'm quite unhappy with this. Each move back increases the cache footprint > > > > > because of the counters > > > > > > > > Why would per zone counters cause an increased cache footprint? > > > > > > > > > > Because there are multiple counters, each of which need to be updated. > > > > How does this differ from per node counter though. > > A per-node counter is 2 * nr_online_nodes > A per-zone counter is 2 * nr_populated_zones > > > We would need to do > > the accounting anyway. Moreover none of the accounting is done in a hot > > path. > > > > > > > but it's not clear at all this patch actually helps anything. > > > > > > > > Yes, I cannot prove any real issue so far. The main motivation was the > > > > patch 2 which needs per-zone accounting to use it in the retry logic > > > > (should_reclaim_retry). I've spotted too_many_isoalated issues on the > > > > way. > > > > > > > > > > You don't appear to directly use that information in patch 2. > > > > It is used via zone_reclaimable_pages in should_reclaim_retry > > > > Which is still not directly required to avoid the infinite loop. There > even is a small inherent risk if the too_isolated_condition no longer > applies at the time should_reclaim_retry is attempted. Not really because, if those pages are no longer isolated then they either have been reclaimed - and NR_FREE_PAGES will increase - or they have been put back to LRU in which case we will see them in regular LRU counters. I need to catch the case where there are still too many pages isolated which would skew should_reclaim_retry watermark check. > > > The primary > > > breakout is returning after stalling at least once. You could also avoid > > > an infinite loop by using a waitqueue that sleeps on too many isolated. > > > > That would be tricky on its own. Just consider the report form Tetsuo. > > Basically all the direct reclamers are looping on too_many_isolated > > while the kswapd is not making any progres because it is blocked on FS > > locks which are held by flushers which are making dead slow progress. > > Some of those direct reclaimers could have gone oom instead and release > > some memory if we decide so, which we cannot because we are deep down in > > the reclaim path. Waiting for on the reclaimer to increase the ISOLATED > > counter wouldn't help in this situation. > > > > If it's a waitqueue waking one process at a time, the progress may be > slow but it'll still exit the loop, attempt the reclaim and then > potentially OOM if no progress is made. The key is using the waitqueue > to have a fair queue of processes making progress instead of a > potentially infinite loop that never meets the exit conditions. It is not clear to me who would wake waiters on the queue. You cannot rely on kswapd to do that as already mentioned. > > > That would both avoid the clunky congestion_wait() and guarantee forward > > > progress. If the primary motivation is to avoid an infinite loop with > > > too_many_isolated then there are ways of handling that without reintroducing > > > zone-based counters. > > > > > > > > Heavy memory pressure on highmem should be spread across the whole node as > > > > > we no longer are applying the fair zone allocation policy. The processes > > > > > with highmem requirements will be reclaiming from all zones and when it > > > > > finishes, it's possible that a lowmem-specific request will be clear to make > > > > > progress. It's all the same LRU so if there are too many pages isolated, > > > > > it makes sense to wait regardless of the allocation request. > > > > > > > > This is true but I am not sure how it is realated to the patch. > > > > > > Because heavy pressure that is enough to trigger too many isolated pages > > > is unlikely to be specifically targetting a lower zone. > > > > Why? Basically any GFP_KERNEL allocation will make lowmem pressure and > > going OOM on lowmem is not all that unrealistic scenario on 32b systems. > > > > If the sole source of pressure is from GFP_KERNEL allocations then the > isolated counter will also be specific to the lower zones and there is no > benefit from the patch. I believe you are wrong here. Just consider that you have isolated basically all lowmem pages. too_many_isolated will still happily tell you to not throttle or back off because NR_INACTIVE_* are way too bigger than all low mem pages altogether. Or am I still missing your point? > If there is a combination of highmem and lowmem pressure then the highmem > reclaimers will also reclaim lowmem memory. > > > > There is general > > > pressure with multiple direct reclaimers being applied. If the system is > > > under enough pressure with parallel reclaimers to trigger too_many_isolated > > > checks then the system is grinding already and making little progress. Adding > > > multiple counters to allow a lowmem reclaimer to potentially make faster > > > progress is going to be marginal at best. > > > > OK, I agree that the situation where highmem blocks lowmem from making > > progress is much less likely than the other situation described in the > > changelog when lowmem doesn't get throttled ever. Which is the one I am > > interested more about. > > > > That is of some concern but could be handled by having too_may_isolated > take into account if it's a zone-restricted allocation and if so, then > decrement the LRU counts from the higher zones. Counters already exist > there. It would not be as strict but it should be sufficient. Well, this is what this patch tries to do. Which other counters I can use to consider only eligible zones when evaluating the number of isolated pages? > > > > Also consider that lowmem throttling in too_many_isolated has only small > > > > chance to ever work with the node counters because highmem >> lowmem in > > > > many/most configurations. > > > > > > > > > > While true, it's also not that important. > > > > > > > > More importantly, this patch may make things worse and delay reclaim. If > > > > > this patch allowed a lowmem request to make progress that would have > > > > > previously stalled, it's going to spend time skipping pages in the LRU > > > > > instead of letting kswapd and the highmem pressured processes make progress. > > > > > > > > I am not sure I understand this part. Say that we have highmem pressure > > > > which would isolated too many pages from the LRU. > > > > > > Which requires multiple direct reclaimers or tiny inactive lists. In the > > > event there is such highmem pressure, it also means the lower zones are > > > depleted. > > > > But consider a lowmem without highmem pressure. E.g. a heavy parallel > > fork or any other GFP_KERNEL intensive workload. > > > > Lowmem without highmem pressure means all isolated pages are in the lowmem > nodes and the per-zone counters are unnecessary. But most configurations will have highmem and lowmem zones in the same node... > > > > lowmem request would > > > > stall previously regardless of where those pages came from. With this > > > > patch it would stall only when we isolated too many pages from the > > > > eligible zones. > > > > > > And when it makes progress, it's goign to compete with the other direct > > > reclaimers except the lowmem reclaim is skipping some pages and > > > recycling them through the LRU. It chews up CPU that would probably have > > > been better spent letting kswapd and the other direct reclaimers do > > > their work. > > > > OK, I guess we are talking past each other. What I meant to say is that > > it doesn't really make any difference who is chewing through the LRU to > > find last few lowmem pages to reclaim. So I do not see much of a > > difference sleeping and postponing that to the kswapd. > > > > That being said, I _believe_ I will need per zone ISOLATED counters in > > order to make the other patch work reliably and do not declare oom > > prematurely. Maybe there is some other way around that (hence this RFC). > > Would you be strongly opposed to the patch which would make counters per > > zone without touching too_many_isolated? > > I'm resistent to the per-zone counters in general but it's unfortunate to > add them just to avoid a potentially infinite loop from isolated pages. I am really open to any alternative solutions, of course. This is the best I could come up with. I will keep thinking but removing too_many_isolated without considering isolated pages during the oom detection is just too risky. We can isolate many pages to ignore them. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone @ 2017-01-18 17:29 ` Michal Hocko 0 siblings, 0 replies; 110+ messages in thread From: Michal Hocko @ 2017-01-18 17:29 UTC (permalink / raw) To: Mel Gorman; +Cc: linux-mm, Johannes Weiner, Tetsuo Handa, LKML On Wed 18-01-17 17:00:10, Mel Gorman wrote: > On Wed, Jan 18, 2017 at 05:17:31PM +0100, Michal Hocko wrote: > > On Wed 18-01-17 15:54:30, Mel Gorman wrote: > > > On Wed, Jan 18, 2017 at 04:15:31PM +0100, Michal Hocko wrote: > > > > On Wed 18-01-17 14:46:55, Mel Gorman wrote: > > > > > On Wed, Jan 18, 2017 at 02:44:52PM +0100, Michal Hocko wrote: > > > > > > From: Michal Hocko <mhocko@suse.com> > > > > > > > > > > > > 599d0c954f91 ("mm, vmscan: move LRU lists to node") has moved > > > > > > NR_ISOLATED* counters from zones to nodes. This is not the best fit > > > > > > especially for systems with high/lowmem because a heavy memory pressure > > > > > > on the highmem zone might block lowmem requests from making progress. Or > > > > > > we might allow to reclaim lowmem zone even though there are too many > > > > > > pages already isolated from the eligible zones just because highmem > > > > > > pages will easily bias too_many_isolated to say no. > > > > > > > > > > > > Fix these potential issues by moving isolated stats back to zones and > > > > > > teach too_many_isolated to consider only eligible zones. Per zone > > > > > > isolation counters are a bit tricky with the node reclaim because > > > > > > we have to track each page separatelly. > > > > > > > > > > > > > > > > I'm quite unhappy with this. Each move back increases the cache footprint > > > > > because of the counters > > > > > > > > Why would per zone counters cause an increased cache footprint? > > > > > > > > > > Because there are multiple counters, each of which need to be updated. > > > > How does this differ from per node counter though. > > A per-node counter is 2 * nr_online_nodes > A per-zone counter is 2 * nr_populated_zones > > > We would need to do > > the accounting anyway. Moreover none of the accounting is done in a hot > > path. > > > > > > > but it's not clear at all this patch actually helps anything. > > > > > > > > Yes, I cannot prove any real issue so far. The main motivation was the > > > > patch 2 which needs per-zone accounting to use it in the retry logic > > > > (should_reclaim_retry). I've spotted too_many_isoalated issues on the > > > > way. > > > > > > > > > > You don't appear to directly use that information in patch 2. > > > > It is used via zone_reclaimable_pages in should_reclaim_retry > > > > Which is still not directly required to avoid the infinite loop. There > even is a small inherent risk if the too_isolated_condition no longer > applies at the time should_reclaim_retry is attempted. Not really because, if those pages are no longer isolated then they either have been reclaimed - and NR_FREE_PAGES will increase - or they have been put back to LRU in which case we will see them in regular LRU counters. I need to catch the case where there are still too many pages isolated which would skew should_reclaim_retry watermark check. > > > The primary > > > breakout is returning after stalling at least once. You could also avoid > > > an infinite loop by using a waitqueue that sleeps on too many isolated. > > > > That would be tricky on its own. Just consider the report form Tetsuo. > > Basically all the direct reclamers are looping on too_many_isolated > > while the kswapd is not making any progres because it is blocked on FS > > locks which are held by flushers which are making dead slow progress. > > Some of those direct reclaimers could have gone oom instead and release > > some memory if we decide so, which we cannot because we are deep down in > > the reclaim path. Waiting for on the reclaimer to increase the ISOLATED > > counter wouldn't help in this situation. > > > > If it's a waitqueue waking one process at a time, the progress may be > slow but it'll still exit the loop, attempt the reclaim and then > potentially OOM if no progress is made. The key is using the waitqueue > to have a fair queue of processes making progress instead of a > potentially infinite loop that never meets the exit conditions. It is not clear to me who would wake waiters on the queue. You cannot rely on kswapd to do that as already mentioned. > > > That would both avoid the clunky congestion_wait() and guarantee forward > > > progress. If the primary motivation is to avoid an infinite loop with > > > too_many_isolated then there are ways of handling that without reintroducing > > > zone-based counters. > > > > > > > > Heavy memory pressure on highmem should be spread across the whole node as > > > > > we no longer are applying the fair zone allocation policy. The processes > > > > > with highmem requirements will be reclaiming from all zones and when it > > > > > finishes, it's possible that a lowmem-specific request will be clear to make > > > > > progress. It's all the same LRU so if there are too many pages isolated, > > > > > it makes sense to wait regardless of the allocation request. > > > > > > > > This is true but I am not sure how it is realated to the patch. > > > > > > Because heavy pressure that is enough to trigger too many isolated pages > > > is unlikely to be specifically targetting a lower zone. > > > > Why? Basically any GFP_KERNEL allocation will make lowmem pressure and > > going OOM on lowmem is not all that unrealistic scenario on 32b systems. > > > > If the sole source of pressure is from GFP_KERNEL allocations then the > isolated counter will also be specific to the lower zones and there is no > benefit from the patch. I believe you are wrong here. Just consider that you have isolated basically all lowmem pages. too_many_isolated will still happily tell you to not throttle or back off because NR_INACTIVE_* are way too bigger than all low mem pages altogether. Or am I still missing your point? > If there is a combination of highmem and lowmem pressure then the highmem > reclaimers will also reclaim lowmem memory. > > > > There is general > > > pressure with multiple direct reclaimers being applied. If the system is > > > under enough pressure with parallel reclaimers to trigger too_many_isolated > > > checks then the system is grinding already and making little progress. Adding > > > multiple counters to allow a lowmem reclaimer to potentially make faster > > > progress is going to be marginal at best. > > > > OK, I agree that the situation where highmem blocks lowmem from making > > progress is much less likely than the other situation described in the > > changelog when lowmem doesn't get throttled ever. Which is the one I am > > interested more about. > > > > That is of some concern but could be handled by having too_may_isolated > take into account if it's a zone-restricted allocation and if so, then > decrement the LRU counts from the higher zones. Counters already exist > there. It would not be as strict but it should be sufficient. Well, this is what this patch tries to do. Which other counters I can use to consider only eligible zones when evaluating the number of isolated pages? > > > > Also consider that lowmem throttling in too_many_isolated has only small > > > > chance to ever work with the node counters because highmem >> lowmem in > > > > many/most configurations. > > > > > > > > > > While true, it's also not that important. > > > > > > > > More importantly, this patch may make things worse and delay reclaim. If > > > > > this patch allowed a lowmem request to make progress that would have > > > > > previously stalled, it's going to spend time skipping pages in the LRU > > > > > instead of letting kswapd and the highmem pressured processes make progress. > > > > > > > > I am not sure I understand this part. Say that we have highmem pressure > > > > which would isolated too many pages from the LRU. > > > > > > Which requires multiple direct reclaimers or tiny inactive lists. In the > > > event there is such highmem pressure, it also means the lower zones are > > > depleted. > > > > But consider a lowmem without highmem pressure. E.g. a heavy parallel > > fork or any other GFP_KERNEL intensive workload. > > > > Lowmem without highmem pressure means all isolated pages are in the lowmem > nodes and the per-zone counters are unnecessary. But most configurations will have highmem and lowmem zones in the same node... > > > > lowmem request would > > > > stall previously regardless of where those pages came from. With this > > > > patch it would stall only when we isolated too many pages from the > > > > eligible zones. > > > > > > And when it makes progress, it's goign to compete with the other direct > > > reclaimers except the lowmem reclaim is skipping some pages and > > > recycling them through the LRU. It chews up CPU that would probably have > > > been better spent letting kswapd and the other direct reclaimers do > > > their work. > > > > OK, I guess we are talking past each other. What I meant to say is that > > it doesn't really make any difference who is chewing through the LRU to > > find last few lowmem pages to reclaim. So I do not see much of a > > difference sleeping and postponing that to the kswapd. > > > > That being said, I _believe_ I will need per zone ISOLATED counters in > > order to make the other patch work reliably and do not declare oom > > prematurely. Maybe there is some other way around that (hence this RFC). > > Would you be strongly opposed to the patch which would make counters per > > zone without touching too_many_isolated? > > I'm resistent to the per-zone counters in general but it's unfortunate to > add them just to avoid a potentially infinite loop from isolated pages. I am really open to any alternative solutions, of course. This is the best I could come up with. I will keep thinking but removing too_many_isolated without considering isolated pages during the oom detection is just too risky. We can isolate many pages to ignore them. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone 2017-01-18 17:29 ` Michal Hocko @ 2017-01-19 10:07 ` Mel Gorman -1 siblings, 0 replies; 110+ messages in thread From: Mel Gorman @ 2017-01-19 10:07 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-mm, Johannes Weiner, Tetsuo Handa, LKML On Wed, Jan 18, 2017 at 06:29:46PM +0100, Michal Hocko wrote: > On Wed 18-01-17 17:00:10, Mel Gorman wrote: > > > > You don't appear to directly use that information in patch 2. > > > > > > It is used via zone_reclaimable_pages in should_reclaim_retry > > > > > > > Which is still not directly required to avoid the infinite loop. There > > even is a small inherent risk if the too_isolated_condition no longer > > applies at the time should_reclaim_retry is attempted. > > Not really because, if those pages are no longer isolated then they > either have been reclaimed - and NR_FREE_PAGES will increase - or they > have been put back to LRU in which case we will see them in regular LRU > counters. I need to catch the case where there are still too many pages > isolated which would skew should_reclaim_retry watermark check. > We can also rely on the no_progress_loops counter to trigger OOM. It'll take longer but has a lower risk of premature OOM. > > > > The primary > > > > breakout is returning after stalling at least once. You could also avoid > > > > an infinite loop by using a waitqueue that sleeps on too many isolated. > > > > > > That would be tricky on its own. Just consider the report form Tetsuo. > > > Basically all the direct reclamers are looping on too_many_isolated > > > while the kswapd is not making any progres because it is blocked on FS > > > locks which are held by flushers which are making dead slow progress. > > > Some of those direct reclaimers could have gone oom instead and release > > > some memory if we decide so, which we cannot because we are deep down in > > > the reclaim path. Waiting for on the reclaimer to increase the ISOLATED > > > counter wouldn't help in this situation. > > > > > > > If it's a waitqueue waking one process at a time, the progress may be > > slow but it'll still exit the loop, attempt the reclaim and then > > potentially OOM if no progress is made. The key is using the waitqueue > > to have a fair queue of processes making progress instead of a > > potentially infinite loop that never meets the exit conditions. > > It is not clear to me who would wake waiters on the queue. You cannot > rely on kswapd to do that as already mentioned. > We can use timeouts to guard against an infinite wait. Besides, updating every single place where pages are put back on the LRU would be fragile and too easy to break. > > > > That would both avoid the clunky congestion_wait() and guarantee forward > > > > progress. If the primary motivation is to avoid an infinite loop with > > > > too_many_isolated then there are ways of handling that without reintroducing > > > > zone-based counters. > > > > > > > > > > Heavy memory pressure on highmem should be spread across the whole node as > > > > > > we no longer are applying the fair zone allocation policy. The processes > > > > > > with highmem requirements will be reclaiming from all zones and when it > > > > > > finishes, it's possible that a lowmem-specific request will be clear to make > > > > > > progress. It's all the same LRU so if there are too many pages isolated, > > > > > > it makes sense to wait regardless of the allocation request. > > > > > > > > > > This is true but I am not sure how it is realated to the patch. > > > > > > > > Because heavy pressure that is enough to trigger too many isolated pages > > > > is unlikely to be specifically targetting a lower zone. > > > > > > Why? Basically any GFP_KERNEL allocation will make lowmem pressure and > > > going OOM on lowmem is not all that unrealistic scenario on 32b systems. > > > > > > > If the sole source of pressure is from GFP_KERNEL allocations then the > > isolated counter will also be specific to the lower zones and there is no > > benefit from the patch. > > I believe you are wrong here. Just consider that you have isolated > basically all lowmem pages. too_many_isolated will still happily tell > you to not throttle or back off because NR_INACTIVE_* are way too bigger > than all low mem pages altogether. Or am I still missing your point? > This is a potential risk. It could be accounted for by including the node isolated counters in the calculation but it'll be inherently fuzzy and may stall a lowmem direct reclaimer unnecessarily in the presence of highmem reclaim. > > If there is a combination of highmem and lowmem pressure then the highmem > > reclaimers will also reclaim lowmem memory. > > > > > > There is general > > > > pressure with multiple direct reclaimers being applied. If the system is > > > > under enough pressure with parallel reclaimers to trigger too_many_isolated > > > > checks then the system is grinding already and making little progress. Adding > > > > multiple counters to allow a lowmem reclaimer to potentially make faster > > > > progress is going to be marginal at best. > > > > > > OK, I agree that the situation where highmem blocks lowmem from making > > > progress is much less likely than the other situation described in the > > > changelog when lowmem doesn't get throttled ever. Which is the one I am > > > interested more about. > > > > > > > That is of some concern but could be handled by having too_may_isolated > > take into account if it's a zone-restricted allocation and if so, then > > decrement the LRU counts from the higher zones. Counters already exist > > there. It would not be as strict but it should be sufficient. > > Well, this is what this patch tries to do. Which other counters I can > use to consider only eligible zones when evaluating the number of > isolated pages? > The LRU anon/file counters. It'll reduce the number of eligible pages for reclaim. > > > > > Also consider that lowmem throttling in too_many_isolated has only small > > > > > chance to ever work with the node counters because highmem >> lowmem in > > > > > many/most configurations. > > > > > > > > > > > > > While true, it's also not that important. > > > > > > > > > > More importantly, this patch may make things worse and delay reclaim. If > > > > > > this patch allowed a lowmem request to make progress that would have > > > > > > previously stalled, it's going to spend time skipping pages in the LRU > > > > > > instead of letting kswapd and the highmem pressured processes make progress. > > > > > > > > > > I am not sure I understand this part. Say that we have highmem pressure > > > > > which would isolated too many pages from the LRU. > > > > > > > > Which requires multiple direct reclaimers or tiny inactive lists. In the > > > > event there is such highmem pressure, it also means the lower zones are > > > > depleted. > > > > > > But consider a lowmem without highmem pressure. E.g. a heavy parallel > > > fork or any other GFP_KERNEL intensive workload. > > > > > > > Lowmem without highmem pressure means all isolated pages are in the lowmem > > nodes and the per-zone counters are unnecessary. > > But most configurations will have highmem and lowmem zones in the same > node... True but if it's only lowmem pressure it doesn't matter. > > > > OK, I guess we are talking past each other. What I meant to say is that > > > it doesn't really make any difference who is chewing through the LRU to > > > find last few lowmem pages to reclaim. So I do not see much of a > > > difference sleeping and postponing that to the kswapd. > > > > > > That being said, I _believe_ I will need per zone ISOLATED counters in > > > order to make the other patch work reliably and do not declare oom > > > prematurely. Maybe there is some other way around that (hence this RFC). > > > Would you be strongly opposed to the patch which would make counters per > > > zone without touching too_many_isolated? > > > > I'm resistent to the per-zone counters in general but it's unfortunate to > > add them just to avoid a potentially infinite loop from isolated pages. > > I am really open to any alternative solutions, of course. This is > the best I could come up with. I will keep thinking but removing > too_many_isolated without considering isolated pages during the oom > detection is just too risky. We can isolate many pages to ignore them. If it's definitely required and is proven to fix the infinite-loop-without-oom workload then I'll back off and withdraw my objections. However, I'd at least like the following untested patch to be considered as an alternative. It has some weaknesses and would be slower to OOM than your patch but it avoids reintroducing zone counters ---8<--- mm, vmscan: Wait on a waitqueue when too many pages are isolated When too many pages are isolated, direct reclaim waits on congestion to clear for up to a tenth of a second. There is no reason to believe that too many pages are isolated due to dirty pages, reclaim efficiency or congestion. It may simply be because an extremely large number of processes have entered direct reclaim at the same time. However, it is possible for the situation to persist forever and never reach OOM. This patch queues processes a waitqueue when too many pages are isolated. When parallel reclaimers finish shrink_page_list, they wake the waiters to recheck whether too many pages are isolated. The wait on the queue has a timeout as not all sites that isolate pages will do the wakeup. Depending on every isolation of LRU pages to be perfect forever is potentially fragile. The specific wakeups occur for page reclaim and compaction. If too many pages are isolated due to memory failure, hotplug or directly calling migration from a syscall then the waiting processes may wait the full timeout. Note that the timeout allows the use of waitqueue_active() on the basis that a race will cause the full timeout to be reached due to a missed wakeup. This is relatively harmless and still a massive improvement over unconditionally calling congestion_wait. Direct reclaimers that cannot isolate pages within the timeout will consider return to the caller. This is somewhat clunky as it won't return immediately and make go through the other priorities and slab shrinking. Eventually, it'll go through a few iterations of should_reclaim_retry and reach the MAX_RECLAIM_RETRIES limit and consider going OOM. diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 91f69aa0d581..3dd617d0c8c4 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -628,6 +628,7 @@ typedef struct pglist_data { int node_id; wait_queue_head_t kswapd_wait; wait_queue_head_t pfmemalloc_wait; + wait_queue_head_t isolated_wait; struct task_struct *kswapd; /* Protected by mem_hotplug_begin/end() */ int kswapd_order; diff --git a/mm/compaction.c b/mm/compaction.c index 43a6cf1dc202..1b1ff6da7401 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -1634,6 +1634,10 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro count_compact_events(COMPACTMIGRATE_SCANNED, cc->total_migrate_scanned); count_compact_events(COMPACTFREE_SCANNED, cc->total_free_scanned); + /* Page reclaim could have stalled due to isolated pages */ + if (waitqueue_active(&zone->zone_pgdat->isolated_wait)) + wake_up(&zone->zone_pgdat->isolated_wait); + trace_mm_compaction_end(start_pfn, cc->migrate_pfn, cc->free_pfn, end_pfn, sync, ret); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 8ff25883c172..d848c9f31bff 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5823,6 +5823,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat) #endif init_waitqueue_head(&pgdat->kswapd_wait); init_waitqueue_head(&pgdat->pfmemalloc_wait); + init_waitqueue_head(&pgdat->isolated_wait); #ifdef CONFIG_COMPACTION init_waitqueue_head(&pgdat->kcompactd_wait); #endif diff --git a/mm/vmscan.c b/mm/vmscan.c index 2281ad310d06..c93f299fbad7 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1603,16 +1603,16 @@ int isolate_lru_page(struct page *page) * the LRU list will go small and be scanned faster than necessary, leading to * unnecessary swapping, thrashing and OOM. */ -static int too_many_isolated(struct pglist_data *pgdat, int file, +static bool safe_to_isolate(struct pglist_data *pgdat, int file, struct scan_control *sc) { unsigned long inactive, isolated; if (current_is_kswapd()) - return 0; + return true; - if (!sane_reclaim(sc)) - return 0; + if (sane_reclaim(sc)) + return true; if (file) { inactive = node_page_state(pgdat, NR_INACTIVE_FILE); @@ -1630,7 +1630,7 @@ static int too_many_isolated(struct pglist_data *pgdat, int file, if ((sc->gfp_mask & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS)) inactive >>= 3; - return isolated > inactive; + return isolated < inactive; } static noinline_for_stack void @@ -1719,12 +1719,28 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, struct pglist_data *pgdat = lruvec_pgdat(lruvec); struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat; - while (unlikely(too_many_isolated(pgdat, file, sc))) { - congestion_wait(BLK_RW_ASYNC, HZ/10); + while (!safe_to_isolate(pgdat, file, sc)) { + long ret; + + ret = wait_event_interruptible_timeout(pgdat->isolated_wait, + safe_to_isolate(pgdat, file, sc), HZ/10); /* We are about to die and free our memory. Return now. */ - if (fatal_signal_pending(current)) - return SWAP_CLUSTER_MAX; + if (fatal_signal_pending(current)) { + nr_reclaimed = SWAP_CLUSTER_MAX; + goto out; + } + + /* + * If we reached the timeout, this is direct reclaim, and + * pages cannot be isolated then return. If the situation + * persists for a long time then it'll eventually reach + * the no_progress limit in should_reclaim_retry and consider + * going OOM. In this case, do not wake the isolated_wait + * queue as the wakee will still not be able to make progress. + */ + if (!ret && !current_is_kswapd() && !safe_to_isolate(pgdat, file, sc)) + return 0; } lru_add_drain(); @@ -1839,6 +1855,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, stat.nr_activate, stat.nr_ref_keep, stat.nr_unmap_fail, sc->priority, file); + +out: + if (waitqueue_active(&pgdat->isolated_wait)) + wake_up(&pgdat->isolated_wait); return nr_reclaimed; } -- Mel Gorman SUSE Labs ^ permalink raw reply related [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone @ 2017-01-19 10:07 ` Mel Gorman 0 siblings, 0 replies; 110+ messages in thread From: Mel Gorman @ 2017-01-19 10:07 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-mm, Johannes Weiner, Tetsuo Handa, LKML On Wed, Jan 18, 2017 at 06:29:46PM +0100, Michal Hocko wrote: > On Wed 18-01-17 17:00:10, Mel Gorman wrote: > > > > You don't appear to directly use that information in patch 2. > > > > > > It is used via zone_reclaimable_pages in should_reclaim_retry > > > > > > > Which is still not directly required to avoid the infinite loop. There > > even is a small inherent risk if the too_isolated_condition no longer > > applies at the time should_reclaim_retry is attempted. > > Not really because, if those pages are no longer isolated then they > either have been reclaimed - and NR_FREE_PAGES will increase - or they > have been put back to LRU in which case we will see them in regular LRU > counters. I need to catch the case where there are still too many pages > isolated which would skew should_reclaim_retry watermark check. > We can also rely on the no_progress_loops counter to trigger OOM. It'll take longer but has a lower risk of premature OOM. > > > > The primary > > > > breakout is returning after stalling at least once. You could also avoid > > > > an infinite loop by using a waitqueue that sleeps on too many isolated. > > > > > > That would be tricky on its own. Just consider the report form Tetsuo. > > > Basically all the direct reclamers are looping on too_many_isolated > > > while the kswapd is not making any progres because it is blocked on FS > > > locks which are held by flushers which are making dead slow progress. > > > Some of those direct reclaimers could have gone oom instead and release > > > some memory if we decide so, which we cannot because we are deep down in > > > the reclaim path. Waiting for on the reclaimer to increase the ISOLATED > > > counter wouldn't help in this situation. > > > > > > > If it's a waitqueue waking one process at a time, the progress may be > > slow but it'll still exit the loop, attempt the reclaim and then > > potentially OOM if no progress is made. The key is using the waitqueue > > to have a fair queue of processes making progress instead of a > > potentially infinite loop that never meets the exit conditions. > > It is not clear to me who would wake waiters on the queue. You cannot > rely on kswapd to do that as already mentioned. > We can use timeouts to guard against an infinite wait. Besides, updating every single place where pages are put back on the LRU would be fragile and too easy to break. > > > > That would both avoid the clunky congestion_wait() and guarantee forward > > > > progress. If the primary motivation is to avoid an infinite loop with > > > > too_many_isolated then there are ways of handling that without reintroducing > > > > zone-based counters. > > > > > > > > > > Heavy memory pressure on highmem should be spread across the whole node as > > > > > > we no longer are applying the fair zone allocation policy. The processes > > > > > > with highmem requirements will be reclaiming from all zones and when it > > > > > > finishes, it's possible that a lowmem-specific request will be clear to make > > > > > > progress. It's all the same LRU so if there are too many pages isolated, > > > > > > it makes sense to wait regardless of the allocation request. > > > > > > > > > > This is true but I am not sure how it is realated to the patch. > > > > > > > > Because heavy pressure that is enough to trigger too many isolated pages > > > > is unlikely to be specifically targetting a lower zone. > > > > > > Why? Basically any GFP_KERNEL allocation will make lowmem pressure and > > > going OOM on lowmem is not all that unrealistic scenario on 32b systems. > > > > > > > If the sole source of pressure is from GFP_KERNEL allocations then the > > isolated counter will also be specific to the lower zones and there is no > > benefit from the patch. > > I believe you are wrong here. Just consider that you have isolated > basically all lowmem pages. too_many_isolated will still happily tell > you to not throttle or back off because NR_INACTIVE_* are way too bigger > than all low mem pages altogether. Or am I still missing your point? > This is a potential risk. It could be accounted for by including the node isolated counters in the calculation but it'll be inherently fuzzy and may stall a lowmem direct reclaimer unnecessarily in the presence of highmem reclaim. > > If there is a combination of highmem and lowmem pressure then the highmem > > reclaimers will also reclaim lowmem memory. > > > > > > There is general > > > > pressure with multiple direct reclaimers being applied. If the system is > > > > under enough pressure with parallel reclaimers to trigger too_many_isolated > > > > checks then the system is grinding already and making little progress. Adding > > > > multiple counters to allow a lowmem reclaimer to potentially make faster > > > > progress is going to be marginal at best. > > > > > > OK, I agree that the situation where highmem blocks lowmem from making > > > progress is much less likely than the other situation described in the > > > changelog when lowmem doesn't get throttled ever. Which is the one I am > > > interested more about. > > > > > > > That is of some concern but could be handled by having too_may_isolated > > take into account if it's a zone-restricted allocation and if so, then > > decrement the LRU counts from the higher zones. Counters already exist > > there. It would not be as strict but it should be sufficient. > > Well, this is what this patch tries to do. Which other counters I can > use to consider only eligible zones when evaluating the number of > isolated pages? > The LRU anon/file counters. It'll reduce the number of eligible pages for reclaim. > > > > > Also consider that lowmem throttling in too_many_isolated has only small > > > > > chance to ever work with the node counters because highmem >> lowmem in > > > > > many/most configurations. > > > > > > > > > > > > > While true, it's also not that important. > > > > > > > > > > More importantly, this patch may make things worse and delay reclaim. If > > > > > > this patch allowed a lowmem request to make progress that would have > > > > > > previously stalled, it's going to spend time skipping pages in the LRU > > > > > > instead of letting kswapd and the highmem pressured processes make progress. > > > > > > > > > > I am not sure I understand this part. Say that we have highmem pressure > > > > > which would isolated too many pages from the LRU. > > > > > > > > Which requires multiple direct reclaimers or tiny inactive lists. In the > > > > event there is such highmem pressure, it also means the lower zones are > > > > depleted. > > > > > > But consider a lowmem without highmem pressure. E.g. a heavy parallel > > > fork or any other GFP_KERNEL intensive workload. > > > > > > > Lowmem without highmem pressure means all isolated pages are in the lowmem > > nodes and the per-zone counters are unnecessary. > > But most configurations will have highmem and lowmem zones in the same > node... True but if it's only lowmem pressure it doesn't matter. > > > > OK, I guess we are talking past each other. What I meant to say is that > > > it doesn't really make any difference who is chewing through the LRU to > > > find last few lowmem pages to reclaim. So I do not see much of a > > > difference sleeping and postponing that to the kswapd. > > > > > > That being said, I _believe_ I will need per zone ISOLATED counters in > > > order to make the other patch work reliably and do not declare oom > > > prematurely. Maybe there is some other way around that (hence this RFC). > > > Would you be strongly opposed to the patch which would make counters per > > > zone without touching too_many_isolated? > > > > I'm resistent to the per-zone counters in general but it's unfortunate to > > add them just to avoid a potentially infinite loop from isolated pages. > > I am really open to any alternative solutions, of course. This is > the best I could come up with. I will keep thinking but removing > too_many_isolated without considering isolated pages during the oom > detection is just too risky. We can isolate many pages to ignore them. If it's definitely required and is proven to fix the infinite-loop-without-oom workload then I'll back off and withdraw my objections. However, I'd at least like the following untested patch to be considered as an alternative. It has some weaknesses and would be slower to OOM than your patch but it avoids reintroducing zone counters ---8<--- mm, vmscan: Wait on a waitqueue when too many pages are isolated When too many pages are isolated, direct reclaim waits on congestion to clear for up to a tenth of a second. There is no reason to believe that too many pages are isolated due to dirty pages, reclaim efficiency or congestion. It may simply be because an extremely large number of processes have entered direct reclaim at the same time. However, it is possible for the situation to persist forever and never reach OOM. This patch queues processes a waitqueue when too many pages are isolated. When parallel reclaimers finish shrink_page_list, they wake the waiters to recheck whether too many pages are isolated. The wait on the queue has a timeout as not all sites that isolate pages will do the wakeup. Depending on every isolation of LRU pages to be perfect forever is potentially fragile. The specific wakeups occur for page reclaim and compaction. If too many pages are isolated due to memory failure, hotplug or directly calling migration from a syscall then the waiting processes may wait the full timeout. Note that the timeout allows the use of waitqueue_active() on the basis that a race will cause the full timeout to be reached due to a missed wakeup. This is relatively harmless and still a massive improvement over unconditionally calling congestion_wait. Direct reclaimers that cannot isolate pages within the timeout will consider return to the caller. This is somewhat clunky as it won't return immediately and make go through the other priorities and slab shrinking. Eventually, it'll go through a few iterations of should_reclaim_retry and reach the MAX_RECLAIM_RETRIES limit and consider going OOM. diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 91f69aa0d581..3dd617d0c8c4 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -628,6 +628,7 @@ typedef struct pglist_data { int node_id; wait_queue_head_t kswapd_wait; wait_queue_head_t pfmemalloc_wait; + wait_queue_head_t isolated_wait; struct task_struct *kswapd; /* Protected by mem_hotplug_begin/end() */ int kswapd_order; diff --git a/mm/compaction.c b/mm/compaction.c index 43a6cf1dc202..1b1ff6da7401 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -1634,6 +1634,10 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro count_compact_events(COMPACTMIGRATE_SCANNED, cc->total_migrate_scanned); count_compact_events(COMPACTFREE_SCANNED, cc->total_free_scanned); + /* Page reclaim could have stalled due to isolated pages */ + if (waitqueue_active(&zone->zone_pgdat->isolated_wait)) + wake_up(&zone->zone_pgdat->isolated_wait); + trace_mm_compaction_end(start_pfn, cc->migrate_pfn, cc->free_pfn, end_pfn, sync, ret); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 8ff25883c172..d848c9f31bff 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -5823,6 +5823,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat) #endif init_waitqueue_head(&pgdat->kswapd_wait); init_waitqueue_head(&pgdat->pfmemalloc_wait); + init_waitqueue_head(&pgdat->isolated_wait); #ifdef CONFIG_COMPACTION init_waitqueue_head(&pgdat->kcompactd_wait); #endif diff --git a/mm/vmscan.c b/mm/vmscan.c index 2281ad310d06..c93f299fbad7 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1603,16 +1603,16 @@ int isolate_lru_page(struct page *page) * the LRU list will go small and be scanned faster than necessary, leading to * unnecessary swapping, thrashing and OOM. */ -static int too_many_isolated(struct pglist_data *pgdat, int file, +static bool safe_to_isolate(struct pglist_data *pgdat, int file, struct scan_control *sc) { unsigned long inactive, isolated; if (current_is_kswapd()) - return 0; + return true; - if (!sane_reclaim(sc)) - return 0; + if (sane_reclaim(sc)) + return true; if (file) { inactive = node_page_state(pgdat, NR_INACTIVE_FILE); @@ -1630,7 +1630,7 @@ static int too_many_isolated(struct pglist_data *pgdat, int file, if ((sc->gfp_mask & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS)) inactive >>= 3; - return isolated > inactive; + return isolated < inactive; } static noinline_for_stack void @@ -1719,12 +1719,28 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, struct pglist_data *pgdat = lruvec_pgdat(lruvec); struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat; - while (unlikely(too_many_isolated(pgdat, file, sc))) { - congestion_wait(BLK_RW_ASYNC, HZ/10); + while (!safe_to_isolate(pgdat, file, sc)) { + long ret; + + ret = wait_event_interruptible_timeout(pgdat->isolated_wait, + safe_to_isolate(pgdat, file, sc), HZ/10); /* We are about to die and free our memory. Return now. */ - if (fatal_signal_pending(current)) - return SWAP_CLUSTER_MAX; + if (fatal_signal_pending(current)) { + nr_reclaimed = SWAP_CLUSTER_MAX; + goto out; + } + + /* + * If we reached the timeout, this is direct reclaim, and + * pages cannot be isolated then return. If the situation + * persists for a long time then it'll eventually reach + * the no_progress limit in should_reclaim_retry and consider + * going OOM. In this case, do not wake the isolated_wait + * queue as the wakee will still not be able to make progress. + */ + if (!ret && !current_is_kswapd() && !safe_to_isolate(pgdat, file, sc)) + return 0; } lru_add_drain(); @@ -1839,6 +1855,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, stat.nr_activate, stat.nr_ref_keep, stat.nr_unmap_fail, sc->priority, file); + +out: + if (waitqueue_active(&pgdat->isolated_wait)) + wake_up(&pgdat->isolated_wait); return nr_reclaimed; } -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone 2017-01-19 10:07 ` Mel Gorman @ 2017-01-19 11:23 ` Michal Hocko -1 siblings, 0 replies; 110+ messages in thread From: Michal Hocko @ 2017-01-19 11:23 UTC (permalink / raw) To: Mel Gorman; +Cc: linux-mm, Johannes Weiner, Tetsuo Handa, LKML On Thu 19-01-17 10:07:55, Mel Gorman wrote: [...] > mm, vmscan: Wait on a waitqueue when too many pages are isolated > > When too many pages are isolated, direct reclaim waits on congestion to clear > for up to a tenth of a second. There is no reason to believe that too many > pages are isolated due to dirty pages, reclaim efficiency or congestion. > It may simply be because an extremely large number of processes have entered > direct reclaim at the same time. However, it is possible for the situation > to persist forever and never reach OOM. > > This patch queues processes a waitqueue when too many pages are isolated. > When parallel reclaimers finish shrink_page_list, they wake the waiters > to recheck whether too many pages are isolated. > > The wait on the queue has a timeout as not all sites that isolate pages > will do the wakeup. Depending on every isolation of LRU pages to be perfect > forever is potentially fragile. The specific wakeups occur for page reclaim > and compaction. If too many pages are isolated due to memory failure, > hotplug or directly calling migration from a syscall then the waiting > processes may wait the full timeout. > > Note that the timeout allows the use of waitqueue_active() on the basis > that a race will cause the full timeout to be reached due to a missed > wakeup. This is relatively harmless and still a massive improvement over > unconditionally calling congestion_wait. > > Direct reclaimers that cannot isolate pages within the timeout will consider > return to the caller. This is somewhat clunky as it won't return immediately > and make go through the other priorities and slab shrinking. Eventually, > it'll go through a few iterations of should_reclaim_retry and reach the > MAX_RECLAIM_RETRIES limit and consider going OOM. I cannot really say I would like this. It's just much more complex than necessary. I definitely agree that congestion_wait while waiting for too_many_isolated is a crude hack. This patch doesn't really resolve my biggest worry, though, that we go OOM with too many pages isolated as your patch doesn't alter zone_reclaimable_pages to reflect those numbers. Anyway, I think both of us are probably overcomplicating things a bit. Your waitqueue approach is definitely better semantically than the congestion_wait because we are waiting for a different event than the API is intended for. On the other hand a mere schedule_timeout_interruptible might work equally well in the real life. On the other side I might really over emphasise the role of NR_ISOLATED* counts. It might really turn out that we can safely ignore them and it won't be the end of the world. So what do you think about the following as a starting point. If we ever see oom reports with high number of NR_ISOLATED* which are part of the oom report then we know we have to do something about that. Those changes would at least be driven by a real usecase rather than theoretical scenarios. So what do you think about the following? Tetsuo, would you be willing to run this patch through your torture testing please? --- >From 47cba23b5b50260b533d7ad57a4c9e6a800d9b20 Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.com> Date: Thu, 19 Jan 2017 12:11:56 +0100 Subject: [PATCH] mm, vmscan: do not loop on too_many_isolated for ever Tetsuo Handa has reported [1] that direct reclaimers might get stuck in too_many_isolated loop basically for ever because the last few pages on the LRU lists are isolated by the kswapd which is stuck on fs locks when doing the pageout. This in turn means that there is nobody to actually trigger the oom killer and the system is basically unusable. too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle direct reclaim when too many pages are isolated already") to prevent from pre-mature oom killer invocations because back then no reclaim progress could indeed trigger the OOM killer too early. But since the oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection") the allocation/reclaim retry loop considers all the reclaimable pages and throttles the allocation at that layer so we can loosen the direct reclaim throttling. Make shrink_inactive_list loop over too_many_isolated bounded and returns immediately when the situation hasn't resolved after the first sleep. Replace congestion_wait by a simple schedule_timeout_interruptible because we are not really waiting on the IO congestion in this path. Please note that this patch can theoretically cause the OOM killer to trigger earlier while there are many pages isolated for the reclaim which makes progress only very slowly. This would be obvious from the oom report as the number of isolated pages are printed there. If we ever hit this should_reclaim_retry should consider those numbers in the evaluation in one way or another. [1] http://lkml.kernel.org/r/201602092349.ACG81273.OSVtMJQHLOFOFF@I-love.SAKURA.ne.jp Signed-off-by: Michal Hocko <mhocko@suse.com> --- mm/vmscan.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index a60066d4521b..d07380ba1f9e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1718,9 +1718,15 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, int file = is_file_lru(lru); struct pglist_data *pgdat = lruvec_pgdat(lruvec); struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat; + bool stalled = false; while (unlikely(too_many_isolated(pgdat, file, sc))) { - congestion_wait(BLK_RW_ASYNC, HZ/10); + if (stalled) + return 0; + + /* wait a bit for the reclaimer. */ + schedule_timeout_interruptible(HZ/10); + stalled = true; /* We are about to die and free our memory. Return now. */ if (fatal_signal_pending(current)) -- 2.11.0 -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone @ 2017-01-19 11:23 ` Michal Hocko 0 siblings, 0 replies; 110+ messages in thread From: Michal Hocko @ 2017-01-19 11:23 UTC (permalink / raw) To: Mel Gorman; +Cc: linux-mm, Johannes Weiner, Tetsuo Handa, LKML On Thu 19-01-17 10:07:55, Mel Gorman wrote: [...] > mm, vmscan: Wait on a waitqueue when too many pages are isolated > > When too many pages are isolated, direct reclaim waits on congestion to clear > for up to a tenth of a second. There is no reason to believe that too many > pages are isolated due to dirty pages, reclaim efficiency or congestion. > It may simply be because an extremely large number of processes have entered > direct reclaim at the same time. However, it is possible for the situation > to persist forever and never reach OOM. > > This patch queues processes a waitqueue when too many pages are isolated. > When parallel reclaimers finish shrink_page_list, they wake the waiters > to recheck whether too many pages are isolated. > > The wait on the queue has a timeout as not all sites that isolate pages > will do the wakeup. Depending on every isolation of LRU pages to be perfect > forever is potentially fragile. The specific wakeups occur for page reclaim > and compaction. If too many pages are isolated due to memory failure, > hotplug or directly calling migration from a syscall then the waiting > processes may wait the full timeout. > > Note that the timeout allows the use of waitqueue_active() on the basis > that a race will cause the full timeout to be reached due to a missed > wakeup. This is relatively harmless and still a massive improvement over > unconditionally calling congestion_wait. > > Direct reclaimers that cannot isolate pages within the timeout will consider > return to the caller. This is somewhat clunky as it won't return immediately > and make go through the other priorities and slab shrinking. Eventually, > it'll go through a few iterations of should_reclaim_retry and reach the > MAX_RECLAIM_RETRIES limit and consider going OOM. I cannot really say I would like this. It's just much more complex than necessary. I definitely agree that congestion_wait while waiting for too_many_isolated is a crude hack. This patch doesn't really resolve my biggest worry, though, that we go OOM with too many pages isolated as your patch doesn't alter zone_reclaimable_pages to reflect those numbers. Anyway, I think both of us are probably overcomplicating things a bit. Your waitqueue approach is definitely better semantically than the congestion_wait because we are waiting for a different event than the API is intended for. On the other hand a mere schedule_timeout_interruptible might work equally well in the real life. On the other side I might really over emphasise the role of NR_ISOLATED* counts. It might really turn out that we can safely ignore them and it won't be the end of the world. So what do you think about the following as a starting point. If we ever see oom reports with high number of NR_ISOLATED* which are part of the oom report then we know we have to do something about that. Those changes would at least be driven by a real usecase rather than theoretical scenarios. So what do you think about the following? Tetsuo, would you be willing to run this patch through your torture testing please? --- ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone 2017-01-19 11:23 ` Michal Hocko @ 2017-01-19 13:11 ` Mel Gorman -1 siblings, 0 replies; 110+ messages in thread From: Mel Gorman @ 2017-01-19 13:11 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-mm, Johannes Weiner, Tetsuo Handa, LKML On Thu, Jan 19, 2017 at 12:23:36PM +0100, Michal Hocko wrote: > On Thu 19-01-17 10:07:55, Mel Gorman wrote: > [...] > > mm, vmscan: Wait on a waitqueue when too many pages are isolated > > > > When too many pages are isolated, direct reclaim waits on congestion to clear > > for up to a tenth of a second. There is no reason to believe that too many > > pages are isolated due to dirty pages, reclaim efficiency or congestion. > > It may simply be because an extremely large number of processes have entered > > direct reclaim at the same time. However, it is possible for the situation > > to persist forever and never reach OOM. > > > > This patch queues processes a waitqueue when too many pages are isolated. > > When parallel reclaimers finish shrink_page_list, they wake the waiters > > to recheck whether too many pages are isolated. > > > > The wait on the queue has a timeout as not all sites that isolate pages > > will do the wakeup. Depending on every isolation of LRU pages to be perfect > > forever is potentially fragile. The specific wakeups occur for page reclaim > > and compaction. If too many pages are isolated due to memory failure, > > hotplug or directly calling migration from a syscall then the waiting > > processes may wait the full timeout. > > > > Note that the timeout allows the use of waitqueue_active() on the basis > > that a race will cause the full timeout to be reached due to a missed > > wakeup. This is relatively harmless and still a massive improvement over > > unconditionally calling congestion_wait. > > > > Direct reclaimers that cannot isolate pages within the timeout will consider > > return to the caller. This is somewhat clunky as it won't return immediately > > and make go through the other priorities and slab shrinking. Eventually, > > it'll go through a few iterations of should_reclaim_retry and reach the > > MAX_RECLAIM_RETRIES limit and consider going OOM. > > I cannot really say I would like this. It's just much more complex than > necessary. I guess it's a difference in opinion. Miximg per-zone and per-node information for me is complex. I liked the workqueue because it was an example of waiting on a specific event instead of relying completely on time. > I definitely agree that congestion_wait while waiting for > too_many_isolated is a crude hack. This patch doesn't really resolve > my biggest worry, though, that we go OOM with too many pages isolated > as your patch doesn't alter zone_reclaimable_pages to reflect those > numbers. > Indeed, but such cases are also caught by the no_progress_loop logic to avoid a premature OOM. > Anyway, I think both of us are probably overcomplicating things a bit. > Your waitqueue approach is definitely better semantically than the > congestion_wait because we are waiting for a different event than the > API is intended for. On the other hand a mere > schedule_timeout_interruptible might work equally well in the real life. > On the other side I might really over emphasise the role of NR_ISOLATED* > counts. It might really turn out that we can safely ignore them and it > won't be the end of the world. So what do you think about the following > as a starting point. If we ever see oom reports with high number of > NR_ISOLATED* which are part of the oom report then we know we have to do > something about that. Those changes would at least be driven by a real > usecase rather than theoretical scenarios. > > So what do you think about the following? Tetsuo, would you be willing > to run this patch through your torture testing please? I'm fine with treating this as a starting point. Thanks. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone @ 2017-01-19 13:11 ` Mel Gorman 0 siblings, 0 replies; 110+ messages in thread From: Mel Gorman @ 2017-01-19 13:11 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-mm, Johannes Weiner, Tetsuo Handa, LKML On Thu, Jan 19, 2017 at 12:23:36PM +0100, Michal Hocko wrote: > On Thu 19-01-17 10:07:55, Mel Gorman wrote: > [...] > > mm, vmscan: Wait on a waitqueue when too many pages are isolated > > > > When too many pages are isolated, direct reclaim waits on congestion to clear > > for up to a tenth of a second. There is no reason to believe that too many > > pages are isolated due to dirty pages, reclaim efficiency or congestion. > > It may simply be because an extremely large number of processes have entered > > direct reclaim at the same time. However, it is possible for the situation > > to persist forever and never reach OOM. > > > > This patch queues processes a waitqueue when too many pages are isolated. > > When parallel reclaimers finish shrink_page_list, they wake the waiters > > to recheck whether too many pages are isolated. > > > > The wait on the queue has a timeout as not all sites that isolate pages > > will do the wakeup. Depending on every isolation of LRU pages to be perfect > > forever is potentially fragile. The specific wakeups occur for page reclaim > > and compaction. If too many pages are isolated due to memory failure, > > hotplug or directly calling migration from a syscall then the waiting > > processes may wait the full timeout. > > > > Note that the timeout allows the use of waitqueue_active() on the basis > > that a race will cause the full timeout to be reached due to a missed > > wakeup. This is relatively harmless and still a massive improvement over > > unconditionally calling congestion_wait. > > > > Direct reclaimers that cannot isolate pages within the timeout will consider > > return to the caller. This is somewhat clunky as it won't return immediately > > and make go through the other priorities and slab shrinking. Eventually, > > it'll go through a few iterations of should_reclaim_retry and reach the > > MAX_RECLAIM_RETRIES limit and consider going OOM. > > I cannot really say I would like this. It's just much more complex than > necessary. I guess it's a difference in opinion. Miximg per-zone and per-node information for me is complex. I liked the workqueue because it was an example of waiting on a specific event instead of relying completely on time. > I definitely agree that congestion_wait while waiting for > too_many_isolated is a crude hack. This patch doesn't really resolve > my biggest worry, though, that we go OOM with too many pages isolated > as your patch doesn't alter zone_reclaimable_pages to reflect those > numbers. > Indeed, but such cases are also caught by the no_progress_loop logic to avoid a premature OOM. > Anyway, I think both of us are probably overcomplicating things a bit. > Your waitqueue approach is definitely better semantically than the > congestion_wait because we are waiting for a different event than the > API is intended for. On the other hand a mere > schedule_timeout_interruptible might work equally well in the real life. > On the other side I might really over emphasise the role of NR_ISOLATED* > counts. It might really turn out that we can safely ignore them and it > won't be the end of the world. So what do you think about the following > as a starting point. If we ever see oom reports with high number of > NR_ISOLATED* which are part of the oom report then we know we have to do > something about that. Those changes would at least be driven by a real > usecase rather than theoretical scenarios. > > So what do you think about the following? Tetsuo, would you be willing > to run this patch through your torture testing please? I'm fine with treating this as a starting point. Thanks. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone 2017-01-19 13:11 ` Mel Gorman @ 2017-01-20 13:27 ` Tetsuo Handa -1 siblings, 0 replies; 110+ messages in thread From: Tetsuo Handa @ 2017-01-20 13:27 UTC (permalink / raw) To: mgorman, mhocko; +Cc: linux-mm, hannes, linux-kernel Mel Gorman wrote: > On Thu, Jan 19, 2017 at 12:23:36PM +0100, Michal Hocko wrote: > > So what do you think about the following? Tetsuo, would you be willing > > to run this patch through your torture testing please? > > I'm fine with treating this as a starting point. OK. So I tried to test this patch but I failed at preparation step. There are too many pending mm patches and I'm not sure which patch on which linux-next snapshot I should try. Also as another question, too_many_isolated() loop exists in both mm/vmscan.c and mm/compaction.c but why this patch does not touch the loop in mm/compaction.c part? Is there a guarantee that the problem can be avoided by tweaking only too_many_isolated() part? Anyway I tried linux-next-20170119 snapshot in order to confirm that my reproducer can still reproduce the problem before trying this patch. But I was not able to reproduce the problem today, for mm part is changing rapidly and existing reproducers need tuning. And I think that there is a different problem if I tune a reproducer like below (i.e. increased the buffer size to write()/fsync() from 4096). ---------- #include <stdio.h> #include <stdlib.h> #include <string.h> #include <unistd.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> int main(int argc, char *argv[]) { static char buffer[10485760] = { }; /* or 1048576 */ char *buf = NULL; unsigned long size; unsigned long i; for (i = 0; i < 1024; i++) { if (fork() == 0) { int fd = open("/proc/self/oom_score_adj", O_WRONLY); write(fd, "1000", 4); close(fd); sleep(1); snprintf(buffer, sizeof(buffer), "/tmp/file.%u", getpid()); fd = open(buffer, O_WRONLY | O_CREAT | O_APPEND, 0600); while (write(fd, buffer, sizeof(buffer)) == sizeof(buffer)) fsync(fd); _exit(0); } } for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) { char *cp = realloc(buf, size); if (!cp) { size >>= 1; break; } buf = cp; } sleep(2); /* Will cause OOM due to overcommit */ for (i = 0; i < size; i += 4096) buf[i] = 0; pause(); return 0; } ---------- Above reproducer sometimes kills all OOM killable processes and the system finally panics. I guess that somebody is abusing TIF_MEMDIE for needless allocations to the level where GFP_ATOMIC allocations start failing. Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20170120.txt.xz . ---------- [ 184.482761] a.out invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=0, order=0, oom_score_adj=0 (...snipped...) [ 184.482955] Node 0 active_anon:1418748kB inactive_anon:13548kB active_file:11448kB inactive_file:26044kB unevictable:0kB isolated(anon):0kB isolated(file):132kB mapped:13744kB dirty:25872kB writeback:376kB shmem:0kB shmem_thp: 0kB sh\ mem_pmdmapped: 258048kB anon_thp: 14184kB writeback_tmp:0kB unstable:0kB pages_scanned:95127 all_unreclaimable? yes [ 184.482956] Node 0 DMA free:7660kB min:380kB low:472kB high:564kB active_anon:8176kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB slab_reclaimable:40\ kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:28kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [ 184.482959] lowmem_reserve[]: 0 1823 1823 1823 [ 184.482963] Node 0 DMA32 free:44636kB min:44672kB low:55840kB high:67008kB active_anon:1410572kB inactive_anon:13548kB active_file:11448kB inactive_file:26044kB unevictable:0kB writepending:26248kB present:2080640kB managed:1866768kB\ mlocked:0kB slab_reclaimable:85544kB slab_unreclaimable:128876kB kernel_stack:20496kB pagetables:40712kB bounce:0kB free_pcp:1136kB local_pcp:656kB free_cma:0kB [ 184.482966] lowmem_reserve[]: 0 0 0 0 [ 184.482970] Node 0 DMA: 9*4kB (UE) 5*8kB (E) 2*16kB (ME) 0*32kB 2*64kB (U) 2*128kB (UE) 2*256kB (UE) 1*512kB (E) 2*1024kB (UE) 2*2048kB (ME) 0*4096kB = 7660kB [ 184.482994] Node 0 DMA32: 3845*4kB (UME) 1809*8kB (UME) 600*16kB (UME) 134*32kB (UME) 14*64kB (UME) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 44636kB (...snipped...) [ 187.477371] Node 0 active_anon:1415648kB inactive_anon:13548kB active_file:11452kB inactive_file:79120kB unevictable:0kB isolated(anon):0kB isolated(file):5220kB mapped:13748kB dirty:83484kB writeback:376kB shmem:0kB shmem_thp: 0kB s\ hmem_pmdmapped: 258048kB anon_thp: 14184kB writeback_tmp:0kB unstable:0kB pages_scanned:16058 all_unreclaimable? no [ 187.477372] Node 0 DMA free:0kB min:380kB low:472kB high:564kB active_anon:8176kB inactive_anon:0kB active_file:0kB inactive_file:6976kB unevictable:0kB writepending:7492kB present:15988kB managed:15904kB mlocked:0kB slab_reclaimable\ :172kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:64kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [ 187.477375] lowmem_reserve[]: 0 1823 1823 1823 [ 187.477378] Node 0 DMA32 free:0kB min:44672kB low:55840kB high:67008kB active_anon:1407472kB inactive_anon:13548kB active_file:11452kB inactive_file:71928kB unevictable:0kB writepending:76368kB present:2080640kB managed:1866768kB mlo\ cked:0kB slab_reclaimable:85580kB slab_unreclaimable:128824kB kernel_stack:20496kB pagetables:39460kB bounce:0kB free_pcp:52kB local_pcp:0kB free_cma:0kB [ 187.477381] lowmem_reserve[]: 0 0 0 0 [ 187.477385] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB [ 187.477394] Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB (...snipped...) [ 318.524868] Node 0 active_anon:7064kB inactive_anon:12088kB active_file:13272kB inactive_file:1520272kB unevictable:0kB isolated(anon):0kB isolated(file):128kB mapped:10276kB dirty:1520264kB writeback:44kB shmem:0kB shmem_thp: 0kB sh\ mem_pmdmapped: 0kB anon_thp: 14184kB writeback_tmp:0kB unstable:0kB pages_scanned:3542854 all_unreclaimable? yes [ 318.524869] Node 0 DMA free:0kB min:380kB low:472kB high:564kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:14752kB unevictable:0kB writepending:14808kB present:15988kB managed:15904kB mlocked:0kB slab_reclaimable:\ 1096kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [ 318.524872] lowmem_reserve[]: 0 1823 1823 1823 [ 318.524876] Node 0 DMA32 free:0kB min:44672kB low:55840kB high:67008kB active_anon:7064kB inactive_anon:12088kB active_file:13272kB inactive_file:1505460kB unevictable:0kB writepending:1505500kB present:2080640kB managed:1866768kB ml\ ocked:0kB slab_reclaimable:147588kB slab_unreclaimable:99652kB kernel_stack:16512kB pagetables:2016kB bounce:0kB free_pcp:788kB local_pcp:512kB free_cma:0kB [ 318.524879] lowmem_reserve[]: 0 0 0 0 [ 318.524882] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB [ 318.524893] Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB [ 318.524903] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB [ 318.524904] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 318.524905] 386967 total pagecache pages [ 318.524908] 0 pages in swap cache [ 318.524909] Swap cache stats: add 0, delete 0, find 0/0 [ 318.524909] Free swap = 0kB [ 318.524910] Total swap = 0kB [ 318.524912] 524157 pages RAM [ 318.524912] 0 pages HighMem/MovableOnly [ 318.524913] 53489 pages reserved [ 318.524914] 0 pages cma reserved [ 318.524914] 0 pages hwpoisoned [ 318.524916] Kernel panic - not syncing: Out of memory and no killable processes... ---------- ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone @ 2017-01-20 13:27 ` Tetsuo Handa 0 siblings, 0 replies; 110+ messages in thread From: Tetsuo Handa @ 2017-01-20 13:27 UTC (permalink / raw) To: mgorman, mhocko; +Cc: linux-mm, hannes, linux-kernel Mel Gorman wrote: > On Thu, Jan 19, 2017 at 12:23:36PM +0100, Michal Hocko wrote: > > So what do you think about the following? Tetsuo, would you be willing > > to run this patch through your torture testing please? > > I'm fine with treating this as a starting point. OK. So I tried to test this patch but I failed at preparation step. There are too many pending mm patches and I'm not sure which patch on which linux-next snapshot I should try. Also as another question, too_many_isolated() loop exists in both mm/vmscan.c and mm/compaction.c but why this patch does not touch the loop in mm/compaction.c part? Is there a guarantee that the problem can be avoided by tweaking only too_many_isolated() part? Anyway I tried linux-next-20170119 snapshot in order to confirm that my reproducer can still reproduce the problem before trying this patch. But I was not able to reproduce the problem today, for mm part is changing rapidly and existing reproducers need tuning. And I think that there is a different problem if I tune a reproducer like below (i.e. increased the buffer size to write()/fsync() from 4096). ---------- #include <stdio.h> #include <stdlib.h> #include <string.h> #include <unistd.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> int main(int argc, char *argv[]) { static char buffer[10485760] = { }; /* or 1048576 */ char *buf = NULL; unsigned long size; unsigned long i; for (i = 0; i < 1024; i++) { if (fork() == 0) { int fd = open("/proc/self/oom_score_adj", O_WRONLY); write(fd, "1000", 4); close(fd); sleep(1); snprintf(buffer, sizeof(buffer), "/tmp/file.%u", getpid()); fd = open(buffer, O_WRONLY | O_CREAT | O_APPEND, 0600); while (write(fd, buffer, sizeof(buffer)) == sizeof(buffer)) fsync(fd); _exit(0); } } for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) { char *cp = realloc(buf, size); if (!cp) { size >>= 1; break; } buf = cp; } sleep(2); /* Will cause OOM due to overcommit */ for (i = 0; i < size; i += 4096) buf[i] = 0; pause(); return 0; } ---------- Above reproducer sometimes kills all OOM killable processes and the system finally panics. I guess that somebody is abusing TIF_MEMDIE for needless allocations to the level where GFP_ATOMIC allocations start failing. Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20170120.txt.xz . ---------- [ 184.482761] a.out invoked oom-killer: gfp_mask=0x14280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=0, order=0, oom_score_adj=0 (...snipped...) [ 184.482955] Node 0 active_anon:1418748kB inactive_anon:13548kB active_file:11448kB inactive_file:26044kB unevictable:0kB isolated(anon):0kB isolated(file):132kB mapped:13744kB dirty:25872kB writeback:376kB shmem:0kB shmem_thp: 0kB sh\ mem_pmdmapped: 258048kB anon_thp: 14184kB writeback_tmp:0kB unstable:0kB pages_scanned:95127 all_unreclaimable? yes [ 184.482956] Node 0 DMA free:7660kB min:380kB low:472kB high:564kB active_anon:8176kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB slab_reclaimable:40\ kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:28kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [ 184.482959] lowmem_reserve[]: 0 1823 1823 1823 [ 184.482963] Node 0 DMA32 free:44636kB min:44672kB low:55840kB high:67008kB active_anon:1410572kB inactive_anon:13548kB active_file:11448kB inactive_file:26044kB unevictable:0kB writepending:26248kB present:2080640kB managed:1866768kB\ mlocked:0kB slab_reclaimable:85544kB slab_unreclaimable:128876kB kernel_stack:20496kB pagetables:40712kB bounce:0kB free_pcp:1136kB local_pcp:656kB free_cma:0kB [ 184.482966] lowmem_reserve[]: 0 0 0 0 [ 184.482970] Node 0 DMA: 9*4kB (UE) 5*8kB (E) 2*16kB (ME) 0*32kB 2*64kB (U) 2*128kB (UE) 2*256kB (UE) 1*512kB (E) 2*1024kB (UE) 2*2048kB (ME) 0*4096kB = 7660kB [ 184.482994] Node 0 DMA32: 3845*4kB (UME) 1809*8kB (UME) 600*16kB (UME) 134*32kB (UME) 14*64kB (UME) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 44636kB (...snipped...) [ 187.477371] Node 0 active_anon:1415648kB inactive_anon:13548kB active_file:11452kB inactive_file:79120kB unevictable:0kB isolated(anon):0kB isolated(file):5220kB mapped:13748kB dirty:83484kB writeback:376kB shmem:0kB shmem_thp: 0kB s\ hmem_pmdmapped: 258048kB anon_thp: 14184kB writeback_tmp:0kB unstable:0kB pages_scanned:16058 all_unreclaimable? no [ 187.477372] Node 0 DMA free:0kB min:380kB low:472kB high:564kB active_anon:8176kB inactive_anon:0kB active_file:0kB inactive_file:6976kB unevictable:0kB writepending:7492kB present:15988kB managed:15904kB mlocked:0kB slab_reclaimable\ :172kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:64kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [ 187.477375] lowmem_reserve[]: 0 1823 1823 1823 [ 187.477378] Node 0 DMA32 free:0kB min:44672kB low:55840kB high:67008kB active_anon:1407472kB inactive_anon:13548kB active_file:11452kB inactive_file:71928kB unevictable:0kB writepending:76368kB present:2080640kB managed:1866768kB mlo\ cked:0kB slab_reclaimable:85580kB slab_unreclaimable:128824kB kernel_stack:20496kB pagetables:39460kB bounce:0kB free_pcp:52kB local_pcp:0kB free_cma:0kB [ 187.477381] lowmem_reserve[]: 0 0 0 0 [ 187.477385] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB [ 187.477394] Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB (...snipped...) [ 318.524868] Node 0 active_anon:7064kB inactive_anon:12088kB active_file:13272kB inactive_file:1520272kB unevictable:0kB isolated(anon):0kB isolated(file):128kB mapped:10276kB dirty:1520264kB writeback:44kB shmem:0kB shmem_thp: 0kB sh\ mem_pmdmapped: 0kB anon_thp: 14184kB writeback_tmp:0kB unstable:0kB pages_scanned:3542854 all_unreclaimable? yes [ 318.524869] Node 0 DMA free:0kB min:380kB low:472kB high:564kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:14752kB unevictable:0kB writepending:14808kB present:15988kB managed:15904kB mlocked:0kB slab_reclaimable:\ 1096kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [ 318.524872] lowmem_reserve[]: 0 1823 1823 1823 [ 318.524876] Node 0 DMA32 free:0kB min:44672kB low:55840kB high:67008kB active_anon:7064kB inactive_anon:12088kB active_file:13272kB inactive_file:1505460kB unevictable:0kB writepending:1505500kB present:2080640kB managed:1866768kB ml\ ocked:0kB slab_reclaimable:147588kB slab_unreclaimable:99652kB kernel_stack:16512kB pagetables:2016kB bounce:0kB free_pcp:788kB local_pcp:512kB free_cma:0kB [ 318.524879] lowmem_reserve[]: 0 0 0 0 [ 318.524882] Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB [ 318.524893] Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB [ 318.524903] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB [ 318.524904] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 318.524905] 386967 total pagecache pages [ 318.524908] 0 pages in swap cache [ 318.524909] Swap cache stats: add 0, delete 0, find 0/0 [ 318.524909] Free swap = 0kB [ 318.524910] Total swap = 0kB [ 318.524912] 524157 pages RAM [ 318.524912] 0 pages HighMem/MovableOnly [ 318.524913] 53489 pages reserved [ 318.524914] 0 pages cma reserved [ 318.524914] 0 pages hwpoisoned [ 318.524916] Kernel panic - not syncing: Out of memory and no killable processes... ---------- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone 2017-01-20 13:27 ` Tetsuo Handa @ 2017-01-21 7:42 ` Tetsuo Handa -1 siblings, 0 replies; 110+ messages in thread From: Tetsuo Handa @ 2017-01-21 7:42 UTC (permalink / raw) To: mgorman, mhocko, viro; +Cc: linux-mm, hannes, linux-kernel Tetsuo Handa wrote: > And I think that there is a different problem if I tune a reproducer > like below (i.e. increased the buffer size to write()/fsync() from 4096). > > ---------- > #include <stdio.h> > #include <stdlib.h> > #include <string.h> > #include <unistd.h> > #include <sys/types.h> > #include <sys/stat.h> > #include <fcntl.h> > > int main(int argc, char *argv[]) > { > static char buffer[10485760] = { }; /* or 1048576 */ > char *buf = NULL; > unsigned long size; > unsigned long i; > for (i = 0; i < 1024; i++) { > if (fork() == 0) { > int fd = open("/proc/self/oom_score_adj", O_WRONLY); > write(fd, "1000", 4); > close(fd); > sleep(1); > snprintf(buffer, sizeof(buffer), "/tmp/file.%u", getpid()); > fd = open(buffer, O_WRONLY | O_CREAT | O_APPEND, 0600); > while (write(fd, buffer, sizeof(buffer)) == sizeof(buffer)) > fsync(fd); > _exit(0); > } > } > for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) { > char *cp = realloc(buf, size); > if (!cp) { > size >>= 1; > break; > } > buf = cp; > } > sleep(2); > /* Will cause OOM due to overcommit */ > for (i = 0; i < size; i += 4096) > buf[i] = 0; > pause(); > return 0; > } > ---------- > > Above reproducer sometimes kills all OOM killable processes and the system > finally panics. I guess that somebody is abusing TIF_MEMDIE for needless > allocations to the level where GFP_ATOMIC allocations start failing. I tracked who is abusing TIF_MEMDIE using below patch. ---------------------------------------- diff --git a/mm/page_alloc.c b/mm/page_alloc.c index ea088e1..d9ac53d 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3038,7 +3038,7 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...) static DEFINE_RATELIMIT_STATE(nopage_rs, DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST); - if ((gfp_mask & __GFP_NOWARN) || !__ratelimit(&nopage_rs) || + if (1 || (gfp_mask & __GFP_NOWARN) || !__ratelimit(&nopage_rs) || debug_guardpage_minorder() > 0) return; @@ -3573,6 +3573,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) int no_progress_loops = 0; unsigned long alloc_start = jiffies; unsigned int stall_timeout = 10 * HZ; + bool victim = false; /* * In the slowpath, we sanity check order to avoid ever trying to @@ -3656,8 +3657,10 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) if (gfp_mask & __GFP_KSWAPD_RECLAIM) wake_all_kswapds(order, ac); - if (gfp_pfmemalloc_allowed(gfp_mask)) + if (gfp_pfmemalloc_allowed(gfp_mask)) { alloc_flags = ALLOC_NO_WATERMARKS; + victim = test_thread_flag(TIF_MEMDIE); + } /* * Reset the zonelist iterators if memory policies can be ignored. @@ -3790,6 +3793,11 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) warn_alloc(gfp_mask, ac->nodemask, "page allocation failure: order:%u", order); got_pg: + if (page && victim) { + pr_warn("%s(%u): TIF_MEMDIE allocation: order=%d mode=%#x(%pGg)\n", + current->comm, current->pid, order, gfp_mask, &gfp_mask); + dump_stack(); + } return page; } ---------------------------------------- And I got flood of traces shown below. It seems to be consuming memory reserves until the size passed to write() request is stored to the page cache even after OOM-killed. Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20170121.txt.xz . ---------------------------------------- [ 202.306077] a.out(9789): TIF_MEMDIE allocation: order=0 mode=0x1c2004a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE) [ 202.309832] CPU: 0 PID: 9789 Comm: a.out Not tainted 4.10.0-rc4-next-20170120+ #492 [ 202.312323] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015 [ 202.315429] Call Trace: [ 202.316902] dump_stack+0x85/0xc9 [ 202.318810] __alloc_pages_slowpath+0xa99/0xd7c [ 202.320697] ? node_dirty_ok+0xef/0x130 [ 202.322454] __alloc_pages_nodemask+0x436/0x4d0 [ 202.324506] alloc_pages_current+0x97/0x1b0 [ 202.326397] __page_cache_alloc+0x15d/0x1a0 mm/filemap.c:728 [ 202.328209] pagecache_get_page+0x5a/0x2b0 mm/filemap.c:1331 [ 202.329989] grab_cache_page_write_begin+0x23/0x40 mm/filemap.c:2773 [ 202.331905] iomap_write_begin+0x50/0xd0 fs/iomap.c:118 [ 202.333641] iomap_write_actor+0xb5/0x1a0 fs/iomap.c:190 [ 202.335377] ? iomap_write_end+0x80/0x80 fs/iomap.c:150 [ 202.337090] iomap_apply+0xb3/0x130 fs/iomap.c:79 [ 202.338721] iomap_file_buffered_write+0x68/0xa0 fs/iomap.c:243 [ 202.340613] ? iomap_write_end+0x80/0x80 [ 202.342471] xfs_file_buffered_aio_write+0x132/0x390 [xfs] [ 202.344501] ? remove_wait_queue+0x59/0x60 [ 202.346261] xfs_file_write_iter+0x90/0x130 [xfs] [ 202.348082] __vfs_write+0xe5/0x140 [ 202.349743] vfs_write+0xc7/0x1f0 [ 202.351214] ? syscall_trace_enter+0x1d0/0x380 [ 202.353155] SyS_write+0x58/0xc0 [ 202.354628] do_syscall_64+0x6c/0x200 [ 202.356100] entry_SYSCALL64_slow_path+0x25/0x25 ---------------------------------------- Do we need to allow access to memory reserves for this allocation? Or, should the caller check for SIGKILL rather than iterate the loop? ^ permalink raw reply related [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone @ 2017-01-21 7:42 ` Tetsuo Handa 0 siblings, 0 replies; 110+ messages in thread From: Tetsuo Handa @ 2017-01-21 7:42 UTC (permalink / raw) To: mgorman, mhocko, viro; +Cc: linux-mm, hannes, linux-kernel Tetsuo Handa wrote: > And I think that there is a different problem if I tune a reproducer > like below (i.e. increased the buffer size to write()/fsync() from 4096). > > ---------- > #include <stdio.h> > #include <stdlib.h> > #include <string.h> > #include <unistd.h> > #include <sys/types.h> > #include <sys/stat.h> > #include <fcntl.h> > > int main(int argc, char *argv[]) > { > static char buffer[10485760] = { }; /* or 1048576 */ > char *buf = NULL; > unsigned long size; > unsigned long i; > for (i = 0; i < 1024; i++) { > if (fork() == 0) { > int fd = open("/proc/self/oom_score_adj", O_WRONLY); > write(fd, "1000", 4); > close(fd); > sleep(1); > snprintf(buffer, sizeof(buffer), "/tmp/file.%u", getpid()); > fd = open(buffer, O_WRONLY | O_CREAT | O_APPEND, 0600); > while (write(fd, buffer, sizeof(buffer)) == sizeof(buffer)) > fsync(fd); > _exit(0); > } > } > for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) { > char *cp = realloc(buf, size); > if (!cp) { > size >>= 1; > break; > } > buf = cp; > } > sleep(2); > /* Will cause OOM due to overcommit */ > for (i = 0; i < size; i += 4096) > buf[i] = 0; > pause(); > return 0; > } > ---------- > > Above reproducer sometimes kills all OOM killable processes and the system > finally panics. I guess that somebody is abusing TIF_MEMDIE for needless > allocations to the level where GFP_ATOMIC allocations start failing. I tracked who is abusing TIF_MEMDIE using below patch. ---------------------------------------- diff --git a/mm/page_alloc.c b/mm/page_alloc.c index ea088e1..d9ac53d 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3038,7 +3038,7 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...) static DEFINE_RATELIMIT_STATE(nopage_rs, DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST); - if ((gfp_mask & __GFP_NOWARN) || !__ratelimit(&nopage_rs) || + if (1 || (gfp_mask & __GFP_NOWARN) || !__ratelimit(&nopage_rs) || debug_guardpage_minorder() > 0) return; @@ -3573,6 +3573,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) int no_progress_loops = 0; unsigned long alloc_start = jiffies; unsigned int stall_timeout = 10 * HZ; + bool victim = false; /* * In the slowpath, we sanity check order to avoid ever trying to @@ -3656,8 +3657,10 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) if (gfp_mask & __GFP_KSWAPD_RECLAIM) wake_all_kswapds(order, ac); - if (gfp_pfmemalloc_allowed(gfp_mask)) + if (gfp_pfmemalloc_allowed(gfp_mask)) { alloc_flags = ALLOC_NO_WATERMARKS; + victim = test_thread_flag(TIF_MEMDIE); + } /* * Reset the zonelist iterators if memory policies can be ignored. @@ -3790,6 +3793,11 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) warn_alloc(gfp_mask, ac->nodemask, "page allocation failure: order:%u", order); got_pg: + if (page && victim) { + pr_warn("%s(%u): TIF_MEMDIE allocation: order=%d mode=%#x(%pGg)\n", + current->comm, current->pid, order, gfp_mask, &gfp_mask); + dump_stack(); + } return page; } ---------------------------------------- And I got flood of traces shown below. It seems to be consuming memory reserves until the size passed to write() request is stored to the page cache even after OOM-killed. Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20170121.txt.xz . ---------------------------------------- [ 202.306077] a.out(9789): TIF_MEMDIE allocation: order=0 mode=0x1c2004a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE) [ 202.309832] CPU: 0 PID: 9789 Comm: a.out Not tainted 4.10.0-rc4-next-20170120+ #492 [ 202.312323] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015 [ 202.315429] Call Trace: [ 202.316902] dump_stack+0x85/0xc9 [ 202.318810] __alloc_pages_slowpath+0xa99/0xd7c [ 202.320697] ? node_dirty_ok+0xef/0x130 [ 202.322454] __alloc_pages_nodemask+0x436/0x4d0 [ 202.324506] alloc_pages_current+0x97/0x1b0 [ 202.326397] __page_cache_alloc+0x15d/0x1a0 mm/filemap.c:728 [ 202.328209] pagecache_get_page+0x5a/0x2b0 mm/filemap.c:1331 [ 202.329989] grab_cache_page_write_begin+0x23/0x40 mm/filemap.c:2773 [ 202.331905] iomap_write_begin+0x50/0xd0 fs/iomap.c:118 [ 202.333641] iomap_write_actor+0xb5/0x1a0 fs/iomap.c:190 [ 202.335377] ? iomap_write_end+0x80/0x80 fs/iomap.c:150 [ 202.337090] iomap_apply+0xb3/0x130 fs/iomap.c:79 [ 202.338721] iomap_file_buffered_write+0x68/0xa0 fs/iomap.c:243 [ 202.340613] ? iomap_write_end+0x80/0x80 [ 202.342471] xfs_file_buffered_aio_write+0x132/0x390 [xfs] [ 202.344501] ? remove_wait_queue+0x59/0x60 [ 202.346261] xfs_file_write_iter+0x90/0x130 [xfs] [ 202.348082] __vfs_write+0xe5/0x140 [ 202.349743] vfs_write+0xc7/0x1f0 [ 202.351214] ? syscall_trace_enter+0x1d0/0x380 [ 202.353155] SyS_write+0x58/0xc0 [ 202.354628] do_syscall_64+0x6c/0x200 [ 202.356100] entry_SYSCALL64_slow_path+0x25/0x25 ---------------------------------------- Do we need to allow access to memory reserves for this allocation? Or, should the caller check for SIGKILL rather than iterate the loop? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone 2017-01-21 7:42 ` Tetsuo Handa @ 2017-01-25 10:15 ` Michal Hocko -1 siblings, 0 replies; 110+ messages in thread From: Michal Hocko @ 2017-01-25 10:15 UTC (permalink / raw) To: Tetsuo Handa, Christoph Hellwig Cc: mgorman, viro, linux-mm, hannes, linux-kernel [Let's add Christoph] The below insane^Wstress test should exercise the OOM killer behavior. On Sat 21-01-17 16:42:42, Tetsuo Handa wrote: > Tetsuo Handa wrote: > > And I think that there is a different problem if I tune a reproducer > > like below (i.e. increased the buffer size to write()/fsync() from 4096). > > > > ---------- > > #include <stdio.h> > > #include <stdlib.h> > > #include <string.h> > > #include <unistd.h> > > #include <sys/types.h> > > #include <sys/stat.h> > > #include <fcntl.h> > > > > int main(int argc, char *argv[]) > > { > > static char buffer[10485760] = { }; /* or 1048576 */ > > char *buf = NULL; > > unsigned long size; > > unsigned long i; > > for (i = 0; i < 1024; i++) { > > if (fork() == 0) { > > int fd = open("/proc/self/oom_score_adj", O_WRONLY); > > write(fd, "1000", 4); > > close(fd); > > sleep(1); > > snprintf(buffer, sizeof(buffer), "/tmp/file.%u", getpid()); > > fd = open(buffer, O_WRONLY | O_CREAT | O_APPEND, 0600); > > while (write(fd, buffer, sizeof(buffer)) == sizeof(buffer)) > > fsync(fd); > > _exit(0); > > } > > } > > for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) { > > char *cp = realloc(buf, size); > > if (!cp) { > > size >>= 1; > > break; > > } > > buf = cp; > > } > > sleep(2); > > /* Will cause OOM due to overcommit */ > > for (i = 0; i < size; i += 4096) > > buf[i] = 0; > > pause(); > > return 0; > > } > > ---------- > > > > Above reproducer sometimes kills all OOM killable processes and the system > > finally panics. I guess that somebody is abusing TIF_MEMDIE for needless > > allocations to the level where GFP_ATOMIC allocations start failing. [...] > And I got flood of traces shown below. It seems to be consuming memory reserves > until the size passed to write() request is stored to the page cache even after > OOM-killed. > > Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20170121.txt.xz . > ---------------------------------------- > [ 202.306077] a.out(9789): TIF_MEMDIE allocation: order=0 mode=0x1c2004a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE) > [ 202.309832] CPU: 0 PID: 9789 Comm: a.out Not tainted 4.10.0-rc4-next-20170120+ #492 > [ 202.312323] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015 > [ 202.315429] Call Trace: > [ 202.316902] dump_stack+0x85/0xc9 > [ 202.318810] __alloc_pages_slowpath+0xa99/0xd7c > [ 202.320697] ? node_dirty_ok+0xef/0x130 > [ 202.322454] __alloc_pages_nodemask+0x436/0x4d0 > [ 202.324506] alloc_pages_current+0x97/0x1b0 > [ 202.326397] __page_cache_alloc+0x15d/0x1a0 mm/filemap.c:728 > [ 202.328209] pagecache_get_page+0x5a/0x2b0 mm/filemap.c:1331 > [ 202.329989] grab_cache_page_write_begin+0x23/0x40 mm/filemap.c:2773 > [ 202.331905] iomap_write_begin+0x50/0xd0 fs/iomap.c:118 > [ 202.333641] iomap_write_actor+0xb5/0x1a0 fs/iomap.c:190 > [ 202.335377] ? iomap_write_end+0x80/0x80 fs/iomap.c:150 > [ 202.337090] iomap_apply+0xb3/0x130 fs/iomap.c:79 > [ 202.338721] iomap_file_buffered_write+0x68/0xa0 fs/iomap.c:243 > [ 202.340613] ? iomap_write_end+0x80/0x80 > [ 202.342471] xfs_file_buffered_aio_write+0x132/0x390 [xfs] > [ 202.344501] ? remove_wait_queue+0x59/0x60 > [ 202.346261] xfs_file_write_iter+0x90/0x130 [xfs] > [ 202.348082] __vfs_write+0xe5/0x140 > [ 202.349743] vfs_write+0xc7/0x1f0 > [ 202.351214] ? syscall_trace_enter+0x1d0/0x380 > [ 202.353155] SyS_write+0x58/0xc0 > [ 202.354628] do_syscall_64+0x6c/0x200 > [ 202.356100] entry_SYSCALL64_slow_path+0x25/0x25 > ---------------------------------------- > > Do we need to allow access to memory reserves for this allocation? > Or, should the caller check for SIGKILL rather than iterate the loop? I think we are missing a check for fatal_signal_pending in iomap_file_buffered_write. This means that an oom victim can consume the full memory reserves. What do you think about the following? I haven't tested this but it mimics generic_perform_write so I guess it should work. --- >From d56b54b708d403d1bf39fccb89750bab31c19032 Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.com> Date: Wed, 25 Jan 2017 11:06:37 +0100 Subject: [PATCH] fs: break out of iomap_file_buffered_write on fatal signals Tetsuo has noticed that an OOM stress test which performs large write requests can cause the full memory reserves depletion. He has tracked this down to the following path __alloc_pages_nodemask+0x436/0x4d0 alloc_pages_current+0x97/0x1b0 __page_cache_alloc+0x15d/0x1a0 mm/filemap.c:728 pagecache_get_page+0x5a/0x2b0 mm/filemap.c:1331 grab_cache_page_write_begin+0x23/0x40 mm/filemap.c:2773 iomap_write_begin+0x50/0xd0 fs/iomap.c:118 iomap_write_actor+0xb5/0x1a0 fs/iomap.c:190 ? iomap_write_end+0x80/0x80 fs/iomap.c:150 iomap_apply+0xb3/0x130 fs/iomap.c:79 iomap_file_buffered_write+0x68/0xa0 fs/iomap.c:243 ? iomap_write_end+0x80/0x80 xfs_file_buffered_aio_write+0x132/0x390 [xfs] ? remove_wait_queue+0x59/0x60 xfs_file_write_iter+0x90/0x130 [xfs] __vfs_write+0xe5/0x140 vfs_write+0xc7/0x1f0 ? syscall_trace_enter+0x1d0/0x380 SyS_write+0x58/0xc0 do_syscall_64+0x6c/0x200 entry_SYSCALL64_slow_path+0x25/0x25 the oom victim has access to all memory reserves to make a forward progress to exit easier. But iomap_file_buffered_write loops to complete the full request. We need to check for fatal signals and back off with a short write. Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path") Cc: stable # 4.8+ Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Michal Hocko <mhocko@suse.com> --- fs/iomap.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/fs/iomap.c b/fs/iomap.c index e57b90b5ff37..a22672387549 100644 --- a/fs/iomap.c +++ b/fs/iomap.c @@ -238,6 +238,10 @@ iomap_file_buffered_write(struct kiocb *iocb, struct iov_iter *iter, loff_t pos = iocb->ki_pos, ret = 0, written = 0; while (iov_iter_count(iter)) { + if (fatal_signal_pending(current)) { + ret = -EINTR; + break; + } ret = iomap_apply(inode, pos, iov_iter_count(iter), IOMAP_WRITE, ops, iter, iomap_write_actor); if (ret <= 0) -- 2.11.0 -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone @ 2017-01-25 10:15 ` Michal Hocko 0 siblings, 0 replies; 110+ messages in thread From: Michal Hocko @ 2017-01-25 10:15 UTC (permalink / raw) To: Tetsuo Handa, Christoph Hellwig Cc: mgorman, viro, linux-mm, hannes, linux-kernel [Let's add Christoph] The below insane^Wstress test should exercise the OOM killer behavior. On Sat 21-01-17 16:42:42, Tetsuo Handa wrote: > Tetsuo Handa wrote: > > And I think that there is a different problem if I tune a reproducer > > like below (i.e. increased the buffer size to write()/fsync() from 4096). > > > > ---------- > > #include <stdio.h> > > #include <stdlib.h> > > #include <string.h> > > #include <unistd.h> > > #include <sys/types.h> > > #include <sys/stat.h> > > #include <fcntl.h> > > > > int main(int argc, char *argv[]) > > { > > static char buffer[10485760] = { }; /* or 1048576 */ > > char *buf = NULL; > > unsigned long size; > > unsigned long i; > > for (i = 0; i < 1024; i++) { > > if (fork() == 0) { > > int fd = open("/proc/self/oom_score_adj", O_WRONLY); > > write(fd, "1000", 4); > > close(fd); > > sleep(1); > > snprintf(buffer, sizeof(buffer), "/tmp/file.%u", getpid()); > > fd = open(buffer, O_WRONLY | O_CREAT | O_APPEND, 0600); > > while (write(fd, buffer, sizeof(buffer)) == sizeof(buffer)) > > fsync(fd); > > _exit(0); > > } > > } > > for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) { > > char *cp = realloc(buf, size); > > if (!cp) { > > size >>= 1; > > break; > > } > > buf = cp; > > } > > sleep(2); > > /* Will cause OOM due to overcommit */ > > for (i = 0; i < size; i += 4096) > > buf[i] = 0; > > pause(); > > return 0; > > } > > ---------- > > > > Above reproducer sometimes kills all OOM killable processes and the system > > finally panics. I guess that somebody is abusing TIF_MEMDIE for needless > > allocations to the level where GFP_ATOMIC allocations start failing. [...] > And I got flood of traces shown below. It seems to be consuming memory reserves > until the size passed to write() request is stored to the page cache even after > OOM-killed. > > Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20170121.txt.xz . > ---------------------------------------- > [ 202.306077] a.out(9789): TIF_MEMDIE allocation: order=0 mode=0x1c2004a(GFP_NOFS|__GFP_HIGHMEM|__GFP_HARDWALL|__GFP_MOVABLE|__GFP_WRITE) > [ 202.309832] CPU: 0 PID: 9789 Comm: a.out Not tainted 4.10.0-rc4-next-20170120+ #492 > [ 202.312323] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015 > [ 202.315429] Call Trace: > [ 202.316902] dump_stack+0x85/0xc9 > [ 202.318810] __alloc_pages_slowpath+0xa99/0xd7c > [ 202.320697] ? node_dirty_ok+0xef/0x130 > [ 202.322454] __alloc_pages_nodemask+0x436/0x4d0 > [ 202.324506] alloc_pages_current+0x97/0x1b0 > [ 202.326397] __page_cache_alloc+0x15d/0x1a0 mm/filemap.c:728 > [ 202.328209] pagecache_get_page+0x5a/0x2b0 mm/filemap.c:1331 > [ 202.329989] grab_cache_page_write_begin+0x23/0x40 mm/filemap.c:2773 > [ 202.331905] iomap_write_begin+0x50/0xd0 fs/iomap.c:118 > [ 202.333641] iomap_write_actor+0xb5/0x1a0 fs/iomap.c:190 > [ 202.335377] ? iomap_write_end+0x80/0x80 fs/iomap.c:150 > [ 202.337090] iomap_apply+0xb3/0x130 fs/iomap.c:79 > [ 202.338721] iomap_file_buffered_write+0x68/0xa0 fs/iomap.c:243 > [ 202.340613] ? iomap_write_end+0x80/0x80 > [ 202.342471] xfs_file_buffered_aio_write+0x132/0x390 [xfs] > [ 202.344501] ? remove_wait_queue+0x59/0x60 > [ 202.346261] xfs_file_write_iter+0x90/0x130 [xfs] > [ 202.348082] __vfs_write+0xe5/0x140 > [ 202.349743] vfs_write+0xc7/0x1f0 > [ 202.351214] ? syscall_trace_enter+0x1d0/0x380 > [ 202.353155] SyS_write+0x58/0xc0 > [ 202.354628] do_syscall_64+0x6c/0x200 > [ 202.356100] entry_SYSCALL64_slow_path+0x25/0x25 > ---------------------------------------- > > Do we need to allow access to memory reserves for this allocation? > Or, should the caller check for SIGKILL rather than iterate the loop? I think we are missing a check for fatal_signal_pending in iomap_file_buffered_write. This means that an oom victim can consume the full memory reserves. What do you think about the following? I haven't tested this but it mimics generic_perform_write so I guess it should work. --- ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone 2017-01-25 10:15 ` Michal Hocko @ 2017-01-25 10:19 ` Christoph Hellwig -1 siblings, 0 replies; 110+ messages in thread From: Christoph Hellwig @ 2017-01-25 10:19 UTC (permalink / raw) To: Michal Hocko Cc: Tetsuo Handa, Christoph Hellwig, mgorman, viro, linux-mm, hannes, linux-kernel On Wed, Jan 25, 2017 at 11:15:17AM +0100, Michal Hocko wrote: > I think we are missing a check for fatal_signal_pending in > iomap_file_buffered_write. This means that an oom victim can consume the > full memory reserves. What do you think about the following? I haven't > tested this but it mimics generic_perform_write so I guess it should > work. Hi Michal, this looks reasonable to me. But we have a few more such loops, maybe it makes sense to move the check into iomap_apply? ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone @ 2017-01-25 10:19 ` Christoph Hellwig 0 siblings, 0 replies; 110+ messages in thread From: Christoph Hellwig @ 2017-01-25 10:19 UTC (permalink / raw) To: Michal Hocko Cc: Tetsuo Handa, Christoph Hellwig, mgorman, viro, linux-mm, hannes, linux-kernel On Wed, Jan 25, 2017 at 11:15:17AM +0100, Michal Hocko wrote: > I think we are missing a check for fatal_signal_pending in > iomap_file_buffered_write. This means that an oom victim can consume the > full memory reserves. What do you think about the following? I haven't > tested this but it mimics generic_perform_write so I guess it should > work. Hi Michal, this looks reasonable to me. But we have a few more such loops, maybe it makes sense to move the check into iomap_apply? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone 2017-01-25 10:19 ` Christoph Hellwig @ 2017-01-25 10:46 ` Michal Hocko -1 siblings, 0 replies; 110+ messages in thread From: Michal Hocko @ 2017-01-25 10:46 UTC (permalink / raw) To: Christoph Hellwig Cc: Tetsuo Handa, mgorman, viro, linux-mm, hannes, linux-kernel On Wed 25-01-17 11:19:57, Christoph Hellwig wrote: > On Wed, Jan 25, 2017 at 11:15:17AM +0100, Michal Hocko wrote: > > I think we are missing a check for fatal_signal_pending in > > iomap_file_buffered_write. This means that an oom victim can consume the > > full memory reserves. What do you think about the following? I haven't > > tested this but it mimics generic_perform_write so I guess it should > > work. > > Hi Michal, > > this looks reasonable to me. But we have a few more such loops, > maybe it makes sense to move the check into iomap_apply? I wasn't sure about the expected semantic of iomap_apply but now that I've actually checked all the callers I believe all of them should be able to handle EINTR just fine. Well iomap_file_dirty, iomap_zero_range, iomap_fiemap and iomap_page_mkwriteseem do not follow the standard pattern to return the number of written pages or an error but it rather propagates the error out. From my limited understanding of those code paths that should just be ok. I was not all that sure about iomap_dio_rw that is just too convoluted for me. If that one is OK as well then the following patch should be indeed better. --- >From d99c9d4115bed69a5d71281f59c190b0b26627cf Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.com> Date: Wed, 25 Jan 2017 11:06:37 +0100 Subject: [PATCH] fs: break out of iomap_file_buffered_write on fatal signals Tetsuo has noticed that an OOM stress test which performs large write requests can cause the full memory reserves depletion. He has tracked this down to the following path __alloc_pages_nodemask+0x436/0x4d0 alloc_pages_current+0x97/0x1b0 __page_cache_alloc+0x15d/0x1a0 mm/filemap.c:728 pagecache_get_page+0x5a/0x2b0 mm/filemap.c:1331 grab_cache_page_write_begin+0x23/0x40 mm/filemap.c:2773 iomap_write_begin+0x50/0xd0 fs/iomap.c:118 iomap_write_actor+0xb5/0x1a0 fs/iomap.c:190 ? iomap_write_end+0x80/0x80 fs/iomap.c:150 iomap_apply+0xb3/0x130 fs/iomap.c:79 iomap_file_buffered_write+0x68/0xa0 fs/iomap.c:243 ? iomap_write_end+0x80/0x80 xfs_file_buffered_aio_write+0x132/0x390 [xfs] ? remove_wait_queue+0x59/0x60 xfs_file_write_iter+0x90/0x130 [xfs] __vfs_write+0xe5/0x140 vfs_write+0xc7/0x1f0 ? syscall_trace_enter+0x1d0/0x380 SyS_write+0x58/0xc0 do_syscall_64+0x6c/0x200 entry_SYSCALL64_slow_path+0x25/0x25 the oom victim has access to all memory reserves to make a forward progress to exit easier. But iomap_file_buffered_write and other callers of iomap_apply loop to complete the full request. We need to check for fatal signals and back off with a short write instead. Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path") Cc: stable # 4.8+ Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Michal Hocko <mhocko@suse.com> --- fs/iomap.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/fs/iomap.c b/fs/iomap.c index e57b90b5ff37..a58190f7a3e4 100644 --- a/fs/iomap.c +++ b/fs/iomap.c @@ -46,6 +46,9 @@ iomap_apply(struct inode *inode, loff_t pos, loff_t length, unsigned flags, struct iomap iomap = { 0 }; loff_t written = 0, ret; + if (fatal_signal_pending(current)) + return -EINTR; + /* * Need to map a range from start position for length bytes. This can * span multiple pages - it is only guaranteed to return a range of a -- 2.11.0 -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone @ 2017-01-25 10:46 ` Michal Hocko 0 siblings, 0 replies; 110+ messages in thread From: Michal Hocko @ 2017-01-25 10:46 UTC (permalink / raw) To: Christoph Hellwig Cc: Tetsuo Handa, mgorman, viro, linux-mm, hannes, linux-kernel On Wed 25-01-17 11:19:57, Christoph Hellwig wrote: > On Wed, Jan 25, 2017 at 11:15:17AM +0100, Michal Hocko wrote: > > I think we are missing a check for fatal_signal_pending in > > iomap_file_buffered_write. This means that an oom victim can consume the > > full memory reserves. What do you think about the following? I haven't > > tested this but it mimics generic_perform_write so I guess it should > > work. > > Hi Michal, > > this looks reasonable to me. But we have a few more such loops, > maybe it makes sense to move the check into iomap_apply? I wasn't sure about the expected semantic of iomap_apply but now that I've actually checked all the callers I believe all of them should be able to handle EINTR just fine. Well iomap_file_dirty, iomap_zero_range, iomap_fiemap and iomap_page_mkwriteseem do not follow the standard pattern to return the number of written pages or an error but it rather propagates the error out. From my limited understanding of those code paths that should just be ok. I was not all that sure about iomap_dio_rw that is just too convoluted for me. If that one is OK as well then the following patch should be indeed better. --- ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone 2017-01-25 10:46 ` Michal Hocko @ 2017-01-25 11:09 ` Tetsuo Handa -1 siblings, 0 replies; 110+ messages in thread From: Tetsuo Handa @ 2017-01-25 11:09 UTC (permalink / raw) To: mhocko, hch; +Cc: mgorman, viro, linux-mm, hannes, linux-kernel Michal Hocko wrote: > On Wed 25-01-17 11:19:57, Christoph Hellwig wrote: > > On Wed, Jan 25, 2017 at 11:15:17AM +0100, Michal Hocko wrote: > > > I think we are missing a check for fatal_signal_pending in > > > iomap_file_buffered_write. This means that an oom victim can consume the > > > full memory reserves. What do you think about the following? I haven't > > > tested this but it mimics generic_perform_write so I guess it should > > > work. > > > > Hi Michal, > > > > this looks reasonable to me. But we have a few more such loops, > > maybe it makes sense to move the check into iomap_apply? > > I wasn't sure about the expected semantic of iomap_apply but now that > I've actually checked all the callers I believe all of them should be > able to handle EINTR just fine. Well iomap_file_dirty, iomap_zero_range, > iomap_fiemap and iomap_page_mkwriteseem do not follow the standard > pattern to return the number of written pages or an error but it rather > propagates the error out. From my limited understanding of those code > paths that should just be ok. I was not all that sure about iomap_dio_rw > that is just too convoluted for me. If that one is OK as well then > the following patch should be indeed better. Is "length" in written = actor(inode, pos, length, data, &iomap); call guaranteed to be small enough? If not guaranteed, don't we need to check SIGKILL inside "actor" functions? ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone @ 2017-01-25 11:09 ` Tetsuo Handa 0 siblings, 0 replies; 110+ messages in thread From: Tetsuo Handa @ 2017-01-25 11:09 UTC (permalink / raw) To: mhocko, hch; +Cc: mgorman, viro, linux-mm, hannes, linux-kernel Michal Hocko wrote: > On Wed 25-01-17 11:19:57, Christoph Hellwig wrote: > > On Wed, Jan 25, 2017 at 11:15:17AM +0100, Michal Hocko wrote: > > > I think we are missing a check for fatal_signal_pending in > > > iomap_file_buffered_write. This means that an oom victim can consume the > > > full memory reserves. What do you think about the following? I haven't > > > tested this but it mimics generic_perform_write so I guess it should > > > work. > > > > Hi Michal, > > > > this looks reasonable to me. But we have a few more such loops, > > maybe it makes sense to move the check into iomap_apply? > > I wasn't sure about the expected semantic of iomap_apply but now that > I've actually checked all the callers I believe all of them should be > able to handle EINTR just fine. Well iomap_file_dirty, iomap_zero_range, > iomap_fiemap and iomap_page_mkwriteseem do not follow the standard > pattern to return the number of written pages or an error but it rather > propagates the error out. From my limited understanding of those code > paths that should just be ok. I was not all that sure about iomap_dio_rw > that is just too convoluted for me. If that one is OK as well then > the following patch should be indeed better. Is "length" in written = actor(inode, pos, length, data, &iomap); call guaranteed to be small enough? If not guaranteed, don't we need to check SIGKILL inside "actor" functions? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone 2017-01-25 11:09 ` Tetsuo Handa @ 2017-01-25 13:00 ` Michal Hocko -1 siblings, 0 replies; 110+ messages in thread From: Michal Hocko @ 2017-01-25 13:00 UTC (permalink / raw) To: Tetsuo Handa; +Cc: hch, mgorman, viro, linux-mm, hannes, linux-kernel On Wed 25-01-17 20:09:31, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Wed 25-01-17 11:19:57, Christoph Hellwig wrote: > > > On Wed, Jan 25, 2017 at 11:15:17AM +0100, Michal Hocko wrote: > > > > I think we are missing a check for fatal_signal_pending in > > > > iomap_file_buffered_write. This means that an oom victim can consume the > > > > full memory reserves. What do you think about the following? I haven't > > > > tested this but it mimics generic_perform_write so I guess it should > > > > work. > > > > > > Hi Michal, > > > > > > this looks reasonable to me. But we have a few more such loops, > > > maybe it makes sense to move the check into iomap_apply? > > > > I wasn't sure about the expected semantic of iomap_apply but now that > > I've actually checked all the callers I believe all of them should be > > able to handle EINTR just fine. Well iomap_file_dirty, iomap_zero_range, > > iomap_fiemap and iomap_page_mkwriteseem do not follow the standard > > pattern to return the number of written pages or an error but it rather > > propagates the error out. From my limited understanding of those code > > paths that should just be ok. I was not all that sure about iomap_dio_rw > > that is just too convoluted for me. If that one is OK as well then > > the following patch should be indeed better. > > Is "length" in > > written = actor(inode, pos, length, data, &iomap); > > call guaranteed to be small enough? If not guaranteed, > don't we need to check SIGKILL inside "actor" functions? You are right! Checking for signals inside iomap_apply doesn't really solve anything because basically all users do iov_iter_count(). Blee. So we have loops around iomap_apply which itself loops inside the actor. iomap_write_begin seems to be used by most of them which is also where we get the pagecache page so I guess this should be the "right" place to put the check in. Things like dax_iomap_actor will need an explicit check. This is quite unfortunate but I do not see any better solution. What do you think Christoph? --- >From 362da5cac527146a341300c2ca441245c16043e8 Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.com> Date: Wed, 25 Jan 2017 11:06:37 +0100 Subject: [PATCH] fs: break out of iomap_file_buffered_write on fatal signals Tetsuo has noticed that an OOM stress test which performs large write requests can cause the full memory reserves depletion. He has tracked this down to the following path __alloc_pages_nodemask+0x436/0x4d0 alloc_pages_current+0x97/0x1b0 __page_cache_alloc+0x15d/0x1a0 mm/filemap.c:728 pagecache_get_page+0x5a/0x2b0 mm/filemap.c:1331 grab_cache_page_write_begin+0x23/0x40 mm/filemap.c:2773 iomap_write_begin+0x50/0xd0 fs/iomap.c:118 iomap_write_actor+0xb5/0x1a0 fs/iomap.c:190 ? iomap_write_end+0x80/0x80 fs/iomap.c:150 iomap_apply+0xb3/0x130 fs/iomap.c:79 iomap_file_buffered_write+0x68/0xa0 fs/iomap.c:243 ? iomap_write_end+0x80/0x80 xfs_file_buffered_aio_write+0x132/0x390 [xfs] ? remove_wait_queue+0x59/0x60 xfs_file_write_iter+0x90/0x130 [xfs] __vfs_write+0xe5/0x140 vfs_write+0xc7/0x1f0 ? syscall_trace_enter+0x1d0/0x380 SyS_write+0x58/0xc0 do_syscall_64+0x6c/0x200 entry_SYSCALL64_slow_path+0x25/0x25 the oom victim has access to all memory reserves to make a forward progress to exit easier. But iomap_file_buffered_write and other callers of iomap_apply loop to complete the full request. We need to check for fatal signals and back off with a short write instead. As the iomap_apply delegates all the work down to the actor we have to hook into those. All callers that work with the page cache are calling iomap_write_begin so we will check for signals there. dax_iomap_actor has to handle the situation explicitly because it copies data to the userspace directly. Other callers like iomap_page_mkwrite work on a single page or iomap_fiemap_actor do not allocate memory based on the given len. Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path") Cc: stable # 4.8+ Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Michal Hocko <mhocko@suse.com> --- fs/dax.c | 5 +++++ fs/iomap.c | 3 +++ 2 files changed, 8 insertions(+) diff --git a/fs/dax.c b/fs/dax.c index 413a91db9351..0e263dacf9cf 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -1033,6 +1033,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data, struct blk_dax_ctl dax = { 0 }; ssize_t map_len; + if (fatal_signal_pending(current)) { + ret = -EINTR; + break; + } + dax.sector = dax_iomap_sector(iomap, pos); dax.size = (length + offset + PAGE_SIZE - 1) & PAGE_MASK; map_len = dax_map_atomic(iomap->bdev, &dax); diff --git a/fs/iomap.c b/fs/iomap.c index e57b90b5ff37..691eada58b06 100644 --- a/fs/iomap.c +++ b/fs/iomap.c @@ -114,6 +114,9 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags, BUG_ON(pos + len > iomap->offset + iomap->length); + if (fatal_signal_pending(current)) + return -EINTR; + page = grab_cache_page_write_begin(inode->i_mapping, index, flags); if (!page) return -ENOMEM; -- 2.11.0 -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone @ 2017-01-25 13:00 ` Michal Hocko 0 siblings, 0 replies; 110+ messages in thread From: Michal Hocko @ 2017-01-25 13:00 UTC (permalink / raw) To: Tetsuo Handa; +Cc: hch, mgorman, viro, linux-mm, hannes, linux-kernel On Wed 25-01-17 20:09:31, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Wed 25-01-17 11:19:57, Christoph Hellwig wrote: > > > On Wed, Jan 25, 2017 at 11:15:17AM +0100, Michal Hocko wrote: > > > > I think we are missing a check for fatal_signal_pending in > > > > iomap_file_buffered_write. This means that an oom victim can consume the > > > > full memory reserves. What do you think about the following? I haven't > > > > tested this but it mimics generic_perform_write so I guess it should > > > > work. > > > > > > Hi Michal, > > > > > > this looks reasonable to me. But we have a few more such loops, > > > maybe it makes sense to move the check into iomap_apply? > > > > I wasn't sure about the expected semantic of iomap_apply but now that > > I've actually checked all the callers I believe all of them should be > > able to handle EINTR just fine. Well iomap_file_dirty, iomap_zero_range, > > iomap_fiemap and iomap_page_mkwriteseem do not follow the standard > > pattern to return the number of written pages or an error but it rather > > propagates the error out. From my limited understanding of those code > > paths that should just be ok. I was not all that sure about iomap_dio_rw > > that is just too convoluted for me. If that one is OK as well then > > the following patch should be indeed better. > > Is "length" in > > written = actor(inode, pos, length, data, &iomap); > > call guaranteed to be small enough? If not guaranteed, > don't we need to check SIGKILL inside "actor" functions? You are right! Checking for signals inside iomap_apply doesn't really solve anything because basically all users do iov_iter_count(). Blee. So we have loops around iomap_apply which itself loops inside the actor. iomap_write_begin seems to be used by most of them which is also where we get the pagecache page so I guess this should be the "right" place to put the check in. Things like dax_iomap_actor will need an explicit check. This is quite unfortunate but I do not see any better solution. What do you think Christoph? --- ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone 2017-01-25 13:00 ` Michal Hocko @ 2017-01-27 14:49 ` Michal Hocko -1 siblings, 0 replies; 110+ messages in thread From: Michal Hocko @ 2017-01-27 14:49 UTC (permalink / raw) To: Tetsuo Handa; +Cc: hch, mgorman, viro, linux-mm, hannes, linux-kernel Tetsuo, before we settle on the proper fix for this issue, could you give the patch a try and try to reproduce the too_many_isolated() issue or just see whether patch [1] has any negative effect on your oom stress testing? [1] http://lkml.kernel.org/r/20170119112336.GN30786@dhcp22.suse.cz On Wed 25-01-17 14:00:14, Michal Hocko wrote: [...] > From 362da5cac527146a341300c2ca441245c16043e8 Mon Sep 17 00:00:00 2001 > From: Michal Hocko <mhocko@suse.com> > Date: Wed, 25 Jan 2017 11:06:37 +0100 > Subject: [PATCH] fs: break out of iomap_file_buffered_write on fatal signals > > Tetsuo has noticed that an OOM stress test which performs large write > requests can cause the full memory reserves depletion. He has tracked > this down to the following path > __alloc_pages_nodemask+0x436/0x4d0 > alloc_pages_current+0x97/0x1b0 > __page_cache_alloc+0x15d/0x1a0 mm/filemap.c:728 > pagecache_get_page+0x5a/0x2b0 mm/filemap.c:1331 > grab_cache_page_write_begin+0x23/0x40 mm/filemap.c:2773 > iomap_write_begin+0x50/0xd0 fs/iomap.c:118 > iomap_write_actor+0xb5/0x1a0 fs/iomap.c:190 > ? iomap_write_end+0x80/0x80 fs/iomap.c:150 > iomap_apply+0xb3/0x130 fs/iomap.c:79 > iomap_file_buffered_write+0x68/0xa0 fs/iomap.c:243 > ? iomap_write_end+0x80/0x80 > xfs_file_buffered_aio_write+0x132/0x390 [xfs] > ? remove_wait_queue+0x59/0x60 > xfs_file_write_iter+0x90/0x130 [xfs] > __vfs_write+0xe5/0x140 > vfs_write+0xc7/0x1f0 > ? syscall_trace_enter+0x1d0/0x380 > SyS_write+0x58/0xc0 > do_syscall_64+0x6c/0x200 > entry_SYSCALL64_slow_path+0x25/0x25 > > the oom victim has access to all memory reserves to make a forward > progress to exit easier. But iomap_file_buffered_write and other callers > of iomap_apply loop to complete the full request. We need to check for > fatal signals and back off with a short write instead. As the > iomap_apply delegates all the work down to the actor we have to hook > into those. All callers that work with the page cache are calling > iomap_write_begin so we will check for signals there. dax_iomap_actor > has to handle the situation explicitly because it copies data to the > userspace directly. Other callers like iomap_page_mkwrite work on a > single page or iomap_fiemap_actor do not allocate memory based on the > given len. > > Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path") > Cc: stable # 4.8+ > Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > Signed-off-by: Michal Hocko <mhocko@suse.com> > --- > fs/dax.c | 5 +++++ > fs/iomap.c | 3 +++ > 2 files changed, 8 insertions(+) > > diff --git a/fs/dax.c b/fs/dax.c > index 413a91db9351..0e263dacf9cf 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -1033,6 +1033,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data, > struct blk_dax_ctl dax = { 0 }; > ssize_t map_len; > > + if (fatal_signal_pending(current)) { > + ret = -EINTR; > + break; > + } > + > dax.sector = dax_iomap_sector(iomap, pos); > dax.size = (length + offset + PAGE_SIZE - 1) & PAGE_MASK; > map_len = dax_map_atomic(iomap->bdev, &dax); > diff --git a/fs/iomap.c b/fs/iomap.c > index e57b90b5ff37..691eada58b06 100644 > --- a/fs/iomap.c > +++ b/fs/iomap.c > @@ -114,6 +114,9 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags, > > BUG_ON(pos + len > iomap->offset + iomap->length); > > + if (fatal_signal_pending(current)) > + return -EINTR; > + > page = grab_cache_page_write_begin(inode->i_mapping, index, flags); > if (!page) > return -ENOMEM; > -- > 2.11.0 > > > -- > Michal Hocko > SUSE Labs -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone @ 2017-01-27 14:49 ` Michal Hocko 0 siblings, 0 replies; 110+ messages in thread From: Michal Hocko @ 2017-01-27 14:49 UTC (permalink / raw) To: Tetsuo Handa; +Cc: hch, mgorman, viro, linux-mm, hannes, linux-kernel Tetsuo, before we settle on the proper fix for this issue, could you give the patch a try and try to reproduce the too_many_isolated() issue or just see whether patch [1] has any negative effect on your oom stress testing? [1] http://lkml.kernel.org/r/20170119112336.GN30786@dhcp22.suse.cz On Wed 25-01-17 14:00:14, Michal Hocko wrote: [...] > From 362da5cac527146a341300c2ca441245c16043e8 Mon Sep 17 00:00:00 2001 > From: Michal Hocko <mhocko@suse.com> > Date: Wed, 25 Jan 2017 11:06:37 +0100 > Subject: [PATCH] fs: break out of iomap_file_buffered_write on fatal signals > > Tetsuo has noticed that an OOM stress test which performs large write > requests can cause the full memory reserves depletion. He has tracked > this down to the following path > __alloc_pages_nodemask+0x436/0x4d0 > alloc_pages_current+0x97/0x1b0 > __page_cache_alloc+0x15d/0x1a0 mm/filemap.c:728 > pagecache_get_page+0x5a/0x2b0 mm/filemap.c:1331 > grab_cache_page_write_begin+0x23/0x40 mm/filemap.c:2773 > iomap_write_begin+0x50/0xd0 fs/iomap.c:118 > iomap_write_actor+0xb5/0x1a0 fs/iomap.c:190 > ? iomap_write_end+0x80/0x80 fs/iomap.c:150 > iomap_apply+0xb3/0x130 fs/iomap.c:79 > iomap_file_buffered_write+0x68/0xa0 fs/iomap.c:243 > ? iomap_write_end+0x80/0x80 > xfs_file_buffered_aio_write+0x132/0x390 [xfs] > ? remove_wait_queue+0x59/0x60 > xfs_file_write_iter+0x90/0x130 [xfs] > __vfs_write+0xe5/0x140 > vfs_write+0xc7/0x1f0 > ? syscall_trace_enter+0x1d0/0x380 > SyS_write+0x58/0xc0 > do_syscall_64+0x6c/0x200 > entry_SYSCALL64_slow_path+0x25/0x25 > > the oom victim has access to all memory reserves to make a forward > progress to exit easier. But iomap_file_buffered_write and other callers > of iomap_apply loop to complete the full request. We need to check for > fatal signals and back off with a short write instead. As the > iomap_apply delegates all the work down to the actor we have to hook > into those. All callers that work with the page cache are calling > iomap_write_begin so we will check for signals there. dax_iomap_actor > has to handle the situation explicitly because it copies data to the > userspace directly. Other callers like iomap_page_mkwrite work on a > single page or iomap_fiemap_actor do not allocate memory based on the > given len. > > Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path") > Cc: stable # 4.8+ > Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > Signed-off-by: Michal Hocko <mhocko@suse.com> > --- > fs/dax.c | 5 +++++ > fs/iomap.c | 3 +++ > 2 files changed, 8 insertions(+) > > diff --git a/fs/dax.c b/fs/dax.c > index 413a91db9351..0e263dacf9cf 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -1033,6 +1033,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data, > struct blk_dax_ctl dax = { 0 }; > ssize_t map_len; > > + if (fatal_signal_pending(current)) { > + ret = -EINTR; > + break; > + } > + > dax.sector = dax_iomap_sector(iomap, pos); > dax.size = (length + offset + PAGE_SIZE - 1) & PAGE_MASK; > map_len = dax_map_atomic(iomap->bdev, &dax); > diff --git a/fs/iomap.c b/fs/iomap.c > index e57b90b5ff37..691eada58b06 100644 > --- a/fs/iomap.c > +++ b/fs/iomap.c > @@ -114,6 +114,9 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags, > > BUG_ON(pos + len > iomap->offset + iomap->length); > > + if (fatal_signal_pending(current)) > + return -EINTR; > + > page = grab_cache_page_write_begin(inode->i_mapping, index, flags); > if (!page) > return -ENOMEM; > -- > 2.11.0 > > > -- > Michal Hocko > SUSE Labs -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone 2017-01-27 14:49 ` Michal Hocko @ 2017-01-28 15:27 ` Tetsuo Handa -1 siblings, 0 replies; 110+ messages in thread From: Tetsuo Handa @ 2017-01-28 15:27 UTC (permalink / raw) To: mhocko; +Cc: hch, mgorman, viro, linux-mm, hannes, linux-kernel Michal Hocko wrote: > Tetsuo, > before we settle on the proper fix for this issue, could you give the > patch a try and try to reproduce the too_many_isolated() issue or > just see whether patch [1] has any negative effect on your oom stress > testing? > > [1] http://lkml.kernel.org/r/20170119112336.GN30786@dhcp22.suse.cz I tested with both [1] and below patch applied on linux-next-20170125 and the result is at http://I-love.SAKURA.ne.jp/tmp/serial-20170128.txt.xz . Regarding below patch, it helped avoiding complete memory depletion with large write() request. I don't know whether below patch helps avoiding complete memory depletion when reading large amount (in other words, I don't know whether this check is done for large read() request). But I believe that __GFP_KILLABLE (despite the limitation that there are unkillable waits in the reclaim path) is better solution compared to scattering around fatal_signal_pending() in the callers. The reason we check SIGKILL here is to avoid allocating memory more than needed. If we check SIGKILL in the entry point of __alloc_pages_nodemask() and retry: label in __alloc_pages_slowpath(), we waste 0 page. Regardless of whether the OOM killer is invoked, whether memory can be allocated without direct reclaim operation, not allocating memory unless needed (in other words, allow page allocator fail immediately if the caller can give up on SIGKILL and SIGKILL is pending) makes sense. It will reduce possibility of OOM livelock on CONFIG_MMU=n kernels where the OOM reaper is not available. > > On Wed 25-01-17 14:00:14, Michal Hocko wrote: > [...] > > From 362da5cac527146a341300c2ca441245c16043e8 Mon Sep 17 00:00:00 2001 > > From: Michal Hocko <mhocko@suse.com> > > Date: Wed, 25 Jan 2017 11:06:37 +0100 > > Subject: [PATCH] fs: break out of iomap_file_buffered_write on fatal signals > > > > Tetsuo has noticed that an OOM stress test which performs large write > > requests can cause the full memory reserves depletion. He has tracked > > this down to the following path > > __alloc_pages_nodemask+0x436/0x4d0 > > alloc_pages_current+0x97/0x1b0 > > __page_cache_alloc+0x15d/0x1a0 mm/filemap.c:728 > > pagecache_get_page+0x5a/0x2b0 mm/filemap.c:1331 > > grab_cache_page_write_begin+0x23/0x40 mm/filemap.c:2773 > > iomap_write_begin+0x50/0xd0 fs/iomap.c:118 > > iomap_write_actor+0xb5/0x1a0 fs/iomap.c:190 > > ? iomap_write_end+0x80/0x80 fs/iomap.c:150 > > iomap_apply+0xb3/0x130 fs/iomap.c:79 > > iomap_file_buffered_write+0x68/0xa0 fs/iomap.c:243 > > ? iomap_write_end+0x80/0x80 > > xfs_file_buffered_aio_write+0x132/0x390 [xfs] > > ? remove_wait_queue+0x59/0x60 > > xfs_file_write_iter+0x90/0x130 [xfs] > > __vfs_write+0xe5/0x140 > > vfs_write+0xc7/0x1f0 > > ? syscall_trace_enter+0x1d0/0x380 > > SyS_write+0x58/0xc0 > > do_syscall_64+0x6c/0x200 > > entry_SYSCALL64_slow_path+0x25/0x25 > > > > the oom victim has access to all memory reserves to make a forward > > progress to exit easier. But iomap_file_buffered_write and other callers > > of iomap_apply loop to complete the full request. We need to check for > > fatal signals and back off with a short write instead. As the > > iomap_apply delegates all the work down to the actor we have to hook > > into those. All callers that work with the page cache are calling > > iomap_write_begin so we will check for signals there. dax_iomap_actor > > has to handle the situation explicitly because it copies data to the > > userspace directly. Other callers like iomap_page_mkwrite work on a > > single page or iomap_fiemap_actor do not allocate memory based on the > > given len. > > > > Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path") > > Cc: stable # 4.8+ > > Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > > Signed-off-by: Michal Hocko <mhocko@suse.com> > > --- > > fs/dax.c | 5 +++++ > > fs/iomap.c | 3 +++ > > 2 files changed, 8 insertions(+) > > > > diff --git a/fs/dax.c b/fs/dax.c > > index 413a91db9351..0e263dacf9cf 100644 > > --- a/fs/dax.c > > +++ b/fs/dax.c > > @@ -1033,6 +1033,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data, > > struct blk_dax_ctl dax = { 0 }; > > ssize_t map_len; > > > > + if (fatal_signal_pending(current)) { > > + ret = -EINTR; > > + break; > > + } > > + > > dax.sector = dax_iomap_sector(iomap, pos); > > dax.size = (length + offset + PAGE_SIZE - 1) & PAGE_MASK; > > map_len = dax_map_atomic(iomap->bdev, &dax); > > diff --git a/fs/iomap.c b/fs/iomap.c > > index e57b90b5ff37..691eada58b06 100644 > > --- a/fs/iomap.c > > +++ b/fs/iomap.c > > @@ -114,6 +114,9 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags, > > > > BUG_ON(pos + len > iomap->offset + iomap->length); > > > > + if (fatal_signal_pending(current)) > > + return -EINTR; > > + > > page = grab_cache_page_write_begin(inode->i_mapping, index, flags); > > if (!page) > > return -ENOMEM; > > -- > > 2.11.0 Regarding [1], it helped avoiding the too_many_isolated() issue. I can't tell whether it has any negative effect, but I got on the first trial that all allocating threads are blocked on wait_for_completion() from flush_work() in drain_all_pages() introduced by "mm, page_alloc: drain per-cpu pages from workqueue context". There was no warn_alloc() stall warning message afterwords. ---------- [ 540.039842] kworker/1:1: page allocation stalls for 10079ms, order:0, mode:0x14001c2(GFP_KERNEL|__GFP_HIGHMEM|__GFP_COLD), nodemask=(null) [ 540.041961] kthreadd invoked oom-killer: gfp_mask=0x16040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK), nodemask=(null), order=2, oom_score_adj=0 [ 540.041970] kthreadd cpuset=/ mems_allowed=0 [ 540.041984] CPU: 3 PID: 2 Comm: kthreadd Not tainted 4.10.0-rc5-next-20170125+ #495 [ 540.041987] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015 [ 540.041989] Call Trace: [ 540.042008] dump_stack+0x85/0xc9 [ 540.042016] dump_header+0x9f/0x296 [ 540.042028] ? trace_hardirqs_on+0xd/0x10 [ 540.042039] oom_kill_process+0x219/0x400 [ 540.042046] out_of_memory+0x13d/0x580 [ 540.042049] ? out_of_memory+0x20d/0x580 [ 540.042058] __alloc_pages_slowpath+0x951/0xe02 [ 540.042063] ? deactivate_slab+0x1fb/0x690 [ 540.042082] __alloc_pages_nodemask+0x382/0x3d0 [ 540.042091] new_slab+0x450/0x6b0 [ 540.042100] ___slab_alloc+0x3a3/0x620 [ 540.042109] ? copy_process.part.31+0x122/0x2200 [ 540.042116] ? cpuacct_charge+0x38/0x1e0 [ 540.042122] ? copy_process.part.31+0x122/0x2200 [ 540.042129] __slab_alloc+0x46/0x7d [ 540.042135] kmem_cache_alloc_node+0xab/0x3a0 [ 540.042144] copy_process.part.31+0x122/0x2200 [ 540.042150] ? cpuacct_charge+0xf3/0x1e0 [ 540.042153] ? cpuacct_charge+0x38/0x1e0 [ 540.042164] ? kthread_create_on_node+0x70/0x70 [ 540.042168] ? finish_task_switch+0x70/0x240 [ 540.042175] _do_fork+0xf3/0x750 [ 540.042183] ? kthreadd+0x2f2/0x3c0 [ 540.042193] kernel_thread+0x29/0x30 [ 540.042196] kthreadd+0x35a/0x3c0 [ 540.042206] ? ret_from_fork+0x31/0x40 [ 540.042218] ? kthread_create_on_cpu+0xb0/0xb0 [ 540.042225] ret_from_fork+0x31/0x40 [ 540.042237] Mem-Info: [ 540.042248] active_anon:170208 inactive_anon:2096 isolated_anon:0 [ 540.042248] active_file:40034 inactive_file:40034 isolated_file:32 [ 540.042248] unevictable:0 dirty:78514 writeback:1568 unstable:0 [ 540.042248] slab_reclaimable:19763 slab_unreclaimable:47744 [ 540.042248] mapped:491 shmem:2162 pagetables:4842 bounce:0 [ 540.042248] free:12698 free_pcp:637 free_cma:0 [ 540.042258] Node 0 active_anon:680832kB inactive_anon:8384kB active_file:160136kB inactive_file:160136kB unevictable:0kB isolated(anon):0kB isolated(file):128kB mapped:1964kB dirty:314056kB writeback:6272kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 217088kB anon_thp: 8648kB writeback_tmp:0kB unstable:0kB pages_scanned:561618 all_unreclaimable? yes [ 540.042260] Node 0 DMA free:6248kB min:476kB low:592kB high:708kB active_anon:9492kB inactive_anon:0kB active_file:4kB inactive_file:4kB unevictable:0kB writepending:8kB present:15988kB managed:15904kB mlocked:0kB slab_reclaimable:48kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:28kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [ 540.042270] lowmem_reserve[]: 0 1443 1443 1443 [ 540.042279] Node 0 DMA32 free:44544kB min:44576kB low:55720kB high:66864kB active_anon:671340kB inactive_anon:8384kB active_file:160132kB inactive_file:160132kB unevictable:0kB writepending:320320kB present:2080640kB managed:1478648kB mlocked:0kB slab_reclaimable:79004kB slab_unreclaimable:190944kB kernel_stack:12240kB pagetables:19340kB bounce:0kB free_pcp:2548kB local_pcp:728kB free_cma:0kB [ 540.042288] lowmem_reserve[]: 0 0 0 0 [ 540.042296] Node 0 DMA: 2*4kB (UM) 0*8kB 2*16kB (UE) 4*32kB (UME) 3*64kB (ME) 2*128kB (UM) 2*256kB (UE) 2*512kB (ME) 2*1024kB (UE) 1*2048kB (E) 0*4096kB = 6248kB [ 540.042330] Node 0 DMA32: 764*4kB (UME) 1122*8kB (UME) 536*16kB (UME) 210*32kB (UME) 107*64kB (UE) 41*128kB (EH) 20*256kB (UME) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 44544kB [ 540.042363] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB [ 540.042366] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 540.042368] 82262 total pagecache pages [ 540.042371] 0 pages in swap cache [ 540.042374] Swap cache stats: add 0, delete 0, find 0/0 [ 540.042376] Free swap = 0kB [ 540.042377] Total swap = 0kB [ 540.042380] 524157 pages RAM [ 540.042382] 0 pages HighMem/MovableOnly [ 540.042383] 150519 pages reserved [ 540.042384] 0 pages cma reserved [ 540.042386] 0 pages hwpoisoned [ 540.042390] Out of memory: Kill process 10688 (a.out) score 998 or sacrifice child [ 540.042401] Killed process 10688 (a.out) total-vm:14404kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB [ 540.043111] oom_reaper: reaped process 10688 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [ 540.212629] kworker/1:1 cpuset=/ mems_allowed=0 [ 540.214404] CPU: 1 PID: 51 Comm: kworker/1:1 Not tainted 4.10.0-rc5-next-20170125+ #495 [ 540.216858] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015 [ 540.219901] Workqueue: events pcpu_balance_workfn [ 540.221740] Call Trace: [ 540.223154] dump_stack+0x85/0xc9 [ 540.224724] warn_alloc+0x11e/0x1d0 [ 540.226333] __alloc_pages_slowpath+0x3d4/0xe02 [ 540.228160] __alloc_pages_nodemask+0x382/0x3d0 [ 540.229970] pcpu_populate_chunk+0xc2/0x440 [ 540.231724] pcpu_balance_workfn+0x615/0x670 [ 540.233483] ? process_one_work+0x194/0x760 [ 540.235405] process_one_work+0x22b/0x760 [ 540.237133] ? process_one_work+0x194/0x760 [ 540.238943] worker_thread+0x243/0x4b0 [ 540.240588] kthread+0x10f/0x150 [ 540.242125] ? process_one_work+0x760/0x760 [ 540.243865] ? kthread_create_on_node+0x70/0x70 [ 540.245631] ret_from_fork+0x31/0x40 [ 540.247278] Mem-Info: [ 540.248572] active_anon:170208 inactive_anon:2096 isolated_anon:0 [ 540.248572] active_file:40163 inactive_file:40049 isolated_file:32 [ 540.248572] unevictable:0 dirty:78514 writeback:1568 unstable:0 [ 540.248572] slab_reclaimable:19763 slab_unreclaimable:47744 [ 540.248572] mapped:522 shmem:2162 pagetables:4842 bounce:0 [ 540.248572] free:12698 free_pcp:500 free_cma:0 [ 540.259735] Node 0 active_anon:680832kB inactive_anon:8384kB active_file:160412kB inactive_file:160436kB unevictable:0kB isolated(anon):0kB isolated(file):128kB mapped:2088kB dirty:314056kB writeback:6272kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 217088kB anon_thp: 8648kB writeback_tmp:0kB unstable:0kB pages_scanned:519289 all_unreclaimable? yes [ 540.267919] Node 0 DMA free:6248kB min:476kB low:592kB high:708kB active_anon:9492kB inactive_anon:0kB active_file:4kB inactive_file:4kB unevictable:0kB writepending:8kB present:15988kB managed:15904kB mlocked:0kB slab_reclaimable:48kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:28kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [ 540.276033] lowmem_reserve[]: 0 1443 1443 1443 [ 540.277629] Node 0 DMA32 free:44544kB min:44576kB low:55720kB high:66864kB active_anon:671340kB inactive_anon:8384kB active_file:160408kB inactive_file:160432kB unevictable:0kB writepending:320320kB present:2080640kB managed:1478648kB mlocked:0kB slab_reclaimable:79004kB slab_unreclaimable:190944kB kernel_stack:12240kB pagetables:19340kB bounce:0kB free_pcp:2000kB local_pcp:352kB free_cma:0kB [ 540.286732] lowmem_reserve[]: 0 0 0 0 [ 540.288204] Node 0 DMA: 2*4kB (UM) 0*8kB 2*16kB (UE) 4*32kB (UME) 3*64kB (ME) 2*128kB (UM) 2*256kB (UE) 2*512kB (ME) 2*1024kB (UE) 1*2048kB (E) 0*4096kB = 6248kB [ 540.292593] Node 0 DMA32: 738*4kB (ME) 1125*8kB (ME) 539*16kB (UME) 209*32kB (ME) 106*64kB (E) 42*128kB (UEH) 20*256kB (UME) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 44544kB [ 540.297228] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB [ 540.299825] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 540.302365] 82400 total pagecache pages [ 540.304010] 0 pages in swap cache [ 540.305535] Swap cache stats: add 0, delete 0, find 0/0 [ 540.307302] Free swap = 0kB [ 540.308600] Total swap = 0kB [ 540.309915] 524157 pages RAM [ 540.311187] 0 pages HighMem/MovableOnly [ 540.312613] 150519 pages reserved [ 540.314026] 0 pages cma reserved [ 540.315325] 0 pages hwpoisoned [ 540.317504] kworker/1:1 invoked oom-killer: gfp_mask=0x14001c2(GFP_KERNEL|__GFP_HIGHMEM|__GFP_COLD), nodemask=(null), order=0, oom_score_adj=0 [ 540.320589] kworker/1:1 cpuset=/ mems_allowed=0 [ 540.322213] CPU: 1 PID: 51 Comm: kworker/1:1 Not tainted 4.10.0-rc5-next-20170125+ #495 [ 540.324410] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015 [ 540.327138] Workqueue: events pcpu_balance_workfn [ 540.328821] Call Trace: [ 540.330060] dump_stack+0x85/0xc9 [ 540.331449] dump_header+0x9f/0x296 [ 540.332925] ? trace_hardirqs_on+0xd/0x10 [ 540.334436] oom_kill_process+0x219/0x400 [ 540.335963] out_of_memory+0x13d/0x580 [ 540.337615] ? out_of_memory+0x20d/0x580 [ 540.339214] __alloc_pages_slowpath+0x951/0xe02 [ 540.340875] __alloc_pages_nodemask+0x382/0x3d0 [ 540.342544] pcpu_populate_chunk+0xc2/0x440 [ 540.344125] pcpu_balance_workfn+0x615/0x670 [ 540.345729] ? process_one_work+0x194/0x760 [ 540.347301] process_one_work+0x22b/0x760 [ 540.349042] ? process_one_work+0x194/0x760 [ 540.350616] worker_thread+0x243/0x4b0 [ 540.352245] kthread+0x10f/0x150 [ 540.353613] ? process_one_work+0x760/0x760 [ 540.355152] ? kthread_create_on_node+0x70/0x70 [ 540.356709] ret_from_fork+0x31/0x40 [ 540.358083] Mem-Info: [ 540.359191] active_anon:170208 inactive_anon:2096 isolated_anon:0 [ 540.359191] active_file:40103 inactive_file:40109 isolated_file:32 [ 540.359191] unevictable:0 dirty:78514 writeback:1568 unstable:0 [ 540.359191] slab_reclaimable:19763 slab_unreclaimable:47744 [ 540.359191] mapped:522 shmem:2162 pagetables:4842 bounce:0 [ 540.359191] free:12698 free_pcp:500 free_cma:0 [ 540.369461] Node 0 active_anon:680832kB inactive_anon:8384kB active_file:160412kB inactive_file:160436kB unevictable:0kB isolated(anon):0kB isolated(file):128kB mapped:2088kB dirty:314056kB writeback:6272kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 217088kB anon_thp: 8648kB writeback_tmp:0kB unstable:0kB pages_scanned:519430 all_unreclaimable? yes [ 540.376876] Node 0 DMA free:6248kB min:476kB low:592kB high:708kB active_anon:9492kB inactive_anon:0kB active_file:4kB inactive_file:4kB unevictable:0kB writepending:8kB present:15988kB managed:15904kB mlocked:0kB slab_reclaimable:48kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:28kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [ 540.384224] lowmem_reserve[]: 0 1443 1443 1443 [ 540.385668] Node 0 DMA32 free:44544kB min:44576kB low:55720kB high:66864kB active_anon:671340kB inactive_anon:8384kB active_file:160408kB inactive_file:160432kB unevictable:0kB writepending:320320kB present:2080640kB managed:1478648kB mlocked:0kB slab_reclaimable:79004kB slab_unreclaimable:190944kB kernel_stack:12240kB pagetables:19340kB bounce:0kB free_pcp:2000kB local_pcp:352kB free_cma:0kB [ 540.394066] lowmem_reserve[]: 0 0 0 0 [ 540.395479] Node 0 DMA: 2*4kB (UM) 0*8kB 2*16kB (UE) 4*32kB (UME) 3*64kB (ME) 2*128kB (UM) 2*256kB (UE) 2*512kB (ME) 2*1024kB (UE) 1*2048kB (E) 0*4096kB = 6248kB [ 540.399533] Node 0 DMA32: 738*4kB (ME) 1125*8kB (ME) 539*16kB (UME) 209*32kB (ME) 106*64kB (E) 42*128kB (UEH) 20*256kB (UME) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 44544kB [ 540.403793] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB [ 540.406130] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 540.408490] 82400 total pagecache pages [ 540.409942] 0 pages in swap cache [ 540.411320] Swap cache stats: add 0, delete 0, find 0/0 [ 540.412992] Free swap = 0kB [ 540.414260] Total swap = 0kB [ 540.415633] 524157 pages RAM [ 540.416877] 0 pages HighMem/MovableOnly [ 540.418307] 150519 pages reserved [ 540.419695] 0 pages cma reserved [ 540.421020] 0 pages hwpoisoned [ 540.422293] Out of memory: Kill process 10689 (a.out) score 998 or sacrifice child [ 540.424450] Killed process 10689 (a.out) total-vm:14404kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB [ 540.430407] oom_reaper: reaped process 10689 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [ 575.747685] BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0 nice=0 stuck for 242s! [ 575.757497] Showing busy workqueues and worker pools: [ 575.765110] workqueue events: flags=0x0 [ 575.772069] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=26/256 [ 575.780544] pending: free_work, vmpressure_work_fn, drain_local_pages_wq BAR(9811), vmw_fb_dirty_flush [vmwgfx], drain_local_pages_wq BAR(2506), drain_local_pages_wq BAR(812), drain_local_pages_wq BAR(2466), drain_local_pages_wq BAR(2485), drain_local_pages_wq BAR(3714), drain_local_pages_wq BAR(2862), drain_local_pages_wq BAR(827), drain_local_pages_wq BAR(527), drain_local_pages_wq BAR(9779), drain_local_pages_wq BAR(2484), drain_local_pages_wq BAR(932), drain_local_pages_wq BAR(2492), drain_local_pages_wq BAR(9820), drain_local_pages_wq BAR(811), drain_local_pages_wq BAR(1), drain_local_pages_wq BAR(2521), drain_local_pages_wq BAR(565), drain_local_pages_wq BAR(10420), drain_local_pages_wq BAR(9824), drain_local_pages_wq BAR(9749), drain_local_pages_wq BAR(2), drain_local_pages_wq BAR(9801) [ 575.827418] workqueue writeback: flags=0x4e [ 575.829234] pwq 256: cpus=0-127 flags=0x4 nice=0 active=2/256 [ 575.831299] in-flight: 425:wb_workfn wb_workfn [ 575.834155] workqueue xfs-eofblocks/sda1: flags=0xc [ 575.836083] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 [ 575.838318] in-flight: 123:xfs_eofblocks_worker [xfs] [ 575.840396] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=242s workers=2 manager: 80 [ 575.843446] pool 256: cpus=0-127 flags=0x4 nice=0 hung=35s workers=3 idle: 424 423 [ 605.951087] BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0 nice=0 stuck for 272s! [ 605.961096] Showing busy workqueues and worker pools: [ 605.968703] workqueue events: flags=0x0 [ 605.975212] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=27/256 [ 605.982787] pending: free_work, vmpressure_work_fn, drain_local_pages_wq BAR(9811), vmw_fb_dirty_flush [vmwgfx], drain_local_pages_wq BAR(2506), drain_local_pages_wq BAR(812), drain_local_pages_wq BAR(2466), drain_local_pages_wq BAR(2485), drain_local_pages_wq BAR(3714), drain_local_pages_wq BAR(2862), drain_local_pages_wq BAR(827), drain_local_pages_wq BAR(527), drain_local_pages_wq BAR(9779), drain_local_pages_wq BAR(2484), drain_local_pages_wq BAR(932), drain_local_pages_wq BAR(2492), drain_local_pages_wq BAR(9820), drain_local_pages_wq BAR(811), drain_local_pages_wq BAR(1), drain_local_pages_wq BAR(2521), drain_local_pages_wq BAR(565), drain_local_pages_wq BAR(10420), drain_local_pages_wq BAR(9824), drain_local_pages_wq BAR(9749), drain_local_pages_wq BAR(2), drain_local_pages_wq BAR(9801) [ 606.010284] , drain_local_pages_wq BAR(47) [ 606.012955] workqueue writeback: flags=0x4e [ 606.014860] pwq 256: cpus=0-127 flags=0x4 nice=0 active=2/256 [ 606.016732] in-flight: 425:wb_workfn wb_workfn [ 606.019085] workqueue mpt_poll_0: flags=0x8 [ 606.020678] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 [ 606.022521] pending: mpt_fault_reset_work [mptbase] [ 606.024445] workqueue xfs-eofblocks/sda1: flags=0xc [ 606.026148] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 [ 606.027992] in-flight: 123:xfs_eofblocks_worker [xfs] [ 606.029904] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=272s workers=2 manager: 80 [ 606.032120] pool 256: cpus=0-127 flags=0x4 nice=0 hung=65s workers=3 idle: 424 423 (...snipped...) [ 908.869406] sysrq: SysRq : Show State [ 908.875534] task PC stack pid father [ 908.883117] systemd D11784 1 0 0x00000000 [ 908.890352] Call Trace: [ 908.893121] __schedule+0x345/0xdd0 [ 908.895830] ? __list_lru_count_one.isra.2+0x22/0x80 [ 908.899036] schedule+0x3d/0x90 [ 908.901616] schedule_timeout+0x287/0x540 [ 908.904485] ? wait_for_completion+0x4c/0x190 [ 908.907488] wait_for_completion+0x12c/0x190 [ 908.910423] ? wake_up_q+0x80/0x80 [ 908.913060] flush_work+0x230/0x310 [ 908.915699] ? flush_work+0x2b4/0x310 [ 908.918382] ? work_busy+0xb0/0xb0 [ 908.920976] drain_all_pages.part.88+0x319/0x390 [ 908.923312] ? drain_local_pages+0x30/0x30 [ 908.924833] __alloc_pages_slowpath+0x4dc/0xe02 [ 908.926380] ? alloc_pages_current+0x193/0x1b0 [ 908.927887] __alloc_pages_nodemask+0x382/0x3d0 [ 908.929406] ? __radix_tree_lookup+0x84/0xf0 [ 908.930879] alloc_pages_current+0x97/0x1b0 [ 908.932333] ? find_get_entry+0x5/0x300 [ 908.933683] __page_cache_alloc+0x15d/0x1a0 [ 908.935069] ? pagecache_get_page+0x2c/0x2b0 [ 908.936447] filemap_fault+0x4df/0x8b0 [ 908.937728] ? filemap_fault+0x373/0x8b0 [ 908.939078] ? xfs_ilock+0x22c/0x360 [xfs] [ 908.940393] ? xfs_filemap_fault+0x64/0x1e0 [xfs] [ 908.941775] ? down_read_nested+0x7b/0xc0 [ 908.943046] ? xfs_ilock+0x22c/0x360 [xfs] [ 908.944290] xfs_filemap_fault+0x6c/0x1e0 [xfs] [ 908.945587] __do_fault+0x1e/0xa0 [ 908.946647] ? _raw_spin_unlock+0x27/0x40 [ 908.947823] handle_mm_fault+0xd75/0x10d0 [ 908.948954] ? handle_mm_fault+0x5e/0x10d0 [ 908.950079] __do_page_fault+0x24a/0x530 [ 908.951158] do_page_fault+0x30/0x80 [ 908.952199] page_fault+0x28/0x30 (...snipped...) [ 909.537512] kswapd0 D11112 68 2 0x00000000 [ 909.538860] Call Trace: [ 909.539675] __schedule+0x345/0xdd0 [ 909.540670] schedule+0x3d/0x90 [ 909.541619] rwsem_down_read_failed+0x10e/0x1a0 [ 909.542827] ? xfs_map_blocks+0x98/0x5a0 [xfs] [ 909.543992] call_rwsem_down_read_failed+0x18/0x30 [ 909.545218] down_read_nested+0xaf/0xc0 [ 909.546316] ? xfs_ilock+0x154/0x360 [xfs] [ 909.547519] xfs_ilock+0x154/0x360 [xfs] [ 909.548608] xfs_map_blocks+0x98/0x5a0 [xfs] [ 909.549754] xfs_do_writepage+0x215/0x920 [xfs] [ 909.550954] ? clear_page_dirty_for_io+0xb4/0x310 [ 909.552188] xfs_vm_writepage+0x3b/0x70 [xfs] [ 909.553340] pageout.isra.54+0x1a4/0x460 [ 909.554428] shrink_page_list+0xa86/0xcf0 [ 909.555529] shrink_inactive_list+0x1d3/0x680 [ 909.556680] ? shrink_active_list+0x44f/0x590 [ 909.557829] shrink_node_memcg+0x535/0x7f0 [ 909.558952] ? mem_cgroup_iter+0x14d/0x720 [ 909.560050] shrink_node+0xe1/0x310 [ 909.561043] kswapd+0x362/0x9b0 [ 909.561976] kthread+0x10f/0x150 [ 909.562974] ? mem_cgroup_shrink_node+0x3b0/0x3b0 [ 909.564199] ? kthread_create_on_node+0x70/0x70 [ 909.565375] ret_from_fork+0x31/0x40 (...snipped...) [ 998.658049] BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0 nice=0 stuck for 665s! [ 998.667526] Showing busy workqueues and worker pools: [ 998.673851] workqueue events: flags=0x0 [ 998.676147] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=28/256 [ 998.678935] pending: free_work, vmpressure_work_fn, drain_local_pages_wq BAR(9811), vmw_fb_dirty_flush [vmwgfx], drain_local_pages_wq BAR(2506), drain_local_pages_wq BAR(812), drain_local_pages_wq BAR(2466), drain_local_pages_wq BAR(2485), drain_local_pages_wq BAR(3714), drain_local_pages_wq BAR(2862), drain_local_pages_wq BAR(827), drain_local_pages_wq BAR(527), drain_local_pages_wq BAR(9779), drain_local_pages_wq BAR(2484), drain_local_pages_wq BAR(932), drain_local_pages_wq BAR(2492), drain_local_pages_wq BAR(9820), drain_local_pages_wq BAR(811), drain_local_pages_wq BAR(1), drain_local_pages_wq BAR(2521), drain_local_pages_wq BAR(565), drain_local_pages_wq BAR(10420), drain_local_pages_wq BAR(9824), drain_local_pages_wq BAR(9749), drain_local_pages_wq BAR(2), drain_local_pages_wq BAR(9801) [ 998.705187] , drain_local_pages_wq BAR(47), drain_local_pages_wq BAR(10805) [ 998.707558] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256 [ 998.709548] pending: e1000_watchdog [e1000], vmstat_shepherd [ 998.711593] workqueue events_power_efficient: flags=0x80 [ 998.713479] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 [ 998.715399] pending: neigh_periodic_work [ 998.717075] workqueue writeback: flags=0x4e [ 998.718656] pwq 256: cpus=0-127 flags=0x4 nice=0 active=2/256 [ 998.720587] in-flight: 425:wb_workfn wb_workfn [ 998.723062] workqueue mpt_poll_0: flags=0x8 [ 998.724712] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 [ 998.726601] pending: mpt_fault_reset_work [mptbase] [ 998.728548] workqueue xfs-eofblocks/sda1: flags=0xc [ 998.730292] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 [ 998.732178] in-flight: 123:xfs_eofblocks_worker [xfs] [ 998.733997] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=665s workers=2 manager: 80 [ 998.736251] pool 6: cpus=3 node=0 flags=0x0 nice=0 hung=0s workers=2 manager: 53 idle: 10804 [ 998.738634] pool 256: cpus=0-127 flags=0x4 nice=0 hung=458s workers=3 idle: 424 423 ---------- So, you believed that the too_many_isolated() issue is the only problem which can prevent reasonable return to the page allocator [2]. But the reality is that we are about to introduce a new problem without knowing the possibility which can prevent reasonable return to the page allocator. So, would you please please please accept asynchronous watchdog [3]? I said "the cause of allocation stall might be due to out of idle workqueue thread" in that post and I think above lockup is exactly in this case. We cannot be careful enough to prove. We forever have possibility of failing to warn as long as we depend on only synchronous watchdog. [2] http://lkml.kernel.org/r/201701141910.ACF73418.OJHFVFStQOOMFL@I-love.SAKURA.ne.jp [3] http://lkml.kernel.org/r/201701261928.DIG05227.OtOVFHOJMFLSQF@I-love.SAKURA.ne.jp ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone @ 2017-01-28 15:27 ` Tetsuo Handa 0 siblings, 0 replies; 110+ messages in thread From: Tetsuo Handa @ 2017-01-28 15:27 UTC (permalink / raw) To: mhocko; +Cc: hch, mgorman, viro, linux-mm, hannes, linux-kernel Michal Hocko wrote: > Tetsuo, > before we settle on the proper fix for this issue, could you give the > patch a try and try to reproduce the too_many_isolated() issue or > just see whether patch [1] has any negative effect on your oom stress > testing? > > [1] http://lkml.kernel.org/r/20170119112336.GN30786@dhcp22.suse.cz I tested with both [1] and below patch applied on linux-next-20170125 and the result is at http://I-love.SAKURA.ne.jp/tmp/serial-20170128.txt.xz . Regarding below patch, it helped avoiding complete memory depletion with large write() request. I don't know whether below patch helps avoiding complete memory depletion when reading large amount (in other words, I don't know whether this check is done for large read() request). But I believe that __GFP_KILLABLE (despite the limitation that there are unkillable waits in the reclaim path) is better solution compared to scattering around fatal_signal_pending() in the callers. The reason we check SIGKILL here is to avoid allocating memory more than needed. If we check SIGKILL in the entry point of __alloc_pages_nodemask() and retry: label in __alloc_pages_slowpath(), we waste 0 page. Regardless of whether the OOM killer is invoked, whether memory can be allocated without direct reclaim operation, not allocating memory unless needed (in other words, allow page allocator fail immediately if the caller can give up on SIGKILL and SIGKILL is pending) makes sense. It will reduce possibility of OOM livelock on CONFIG_MMU=n kernels where the OOM reaper is not available. > > On Wed 25-01-17 14:00:14, Michal Hocko wrote: > [...] > > From 362da5cac527146a341300c2ca441245c16043e8 Mon Sep 17 00:00:00 2001 > > From: Michal Hocko <mhocko@suse.com> > > Date: Wed, 25 Jan 2017 11:06:37 +0100 > > Subject: [PATCH] fs: break out of iomap_file_buffered_write on fatal signals > > > > Tetsuo has noticed that an OOM stress test which performs large write > > requests can cause the full memory reserves depletion. He has tracked > > this down to the following path > > __alloc_pages_nodemask+0x436/0x4d0 > > alloc_pages_current+0x97/0x1b0 > > __page_cache_alloc+0x15d/0x1a0 mm/filemap.c:728 > > pagecache_get_page+0x5a/0x2b0 mm/filemap.c:1331 > > grab_cache_page_write_begin+0x23/0x40 mm/filemap.c:2773 > > iomap_write_begin+0x50/0xd0 fs/iomap.c:118 > > iomap_write_actor+0xb5/0x1a0 fs/iomap.c:190 > > ? iomap_write_end+0x80/0x80 fs/iomap.c:150 > > iomap_apply+0xb3/0x130 fs/iomap.c:79 > > iomap_file_buffered_write+0x68/0xa0 fs/iomap.c:243 > > ? iomap_write_end+0x80/0x80 > > xfs_file_buffered_aio_write+0x132/0x390 [xfs] > > ? remove_wait_queue+0x59/0x60 > > xfs_file_write_iter+0x90/0x130 [xfs] > > __vfs_write+0xe5/0x140 > > vfs_write+0xc7/0x1f0 > > ? syscall_trace_enter+0x1d0/0x380 > > SyS_write+0x58/0xc0 > > do_syscall_64+0x6c/0x200 > > entry_SYSCALL64_slow_path+0x25/0x25 > > > > the oom victim has access to all memory reserves to make a forward > > progress to exit easier. But iomap_file_buffered_write and other callers > > of iomap_apply loop to complete the full request. We need to check for > > fatal signals and back off with a short write instead. As the > > iomap_apply delegates all the work down to the actor we have to hook > > into those. All callers that work with the page cache are calling > > iomap_write_begin so we will check for signals there. dax_iomap_actor > > has to handle the situation explicitly because it copies data to the > > userspace directly. Other callers like iomap_page_mkwrite work on a > > single page or iomap_fiemap_actor do not allocate memory based on the > > given len. > > > > Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path") > > Cc: stable # 4.8+ > > Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > > Signed-off-by: Michal Hocko <mhocko@suse.com> > > --- > > fs/dax.c | 5 +++++ > > fs/iomap.c | 3 +++ > > 2 files changed, 8 insertions(+) > > > > diff --git a/fs/dax.c b/fs/dax.c > > index 413a91db9351..0e263dacf9cf 100644 > > --- a/fs/dax.c > > +++ b/fs/dax.c > > @@ -1033,6 +1033,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data, > > struct blk_dax_ctl dax = { 0 }; > > ssize_t map_len; > > > > + if (fatal_signal_pending(current)) { > > + ret = -EINTR; > > + break; > > + } > > + > > dax.sector = dax_iomap_sector(iomap, pos); > > dax.size = (length + offset + PAGE_SIZE - 1) & PAGE_MASK; > > map_len = dax_map_atomic(iomap->bdev, &dax); > > diff --git a/fs/iomap.c b/fs/iomap.c > > index e57b90b5ff37..691eada58b06 100644 > > --- a/fs/iomap.c > > +++ b/fs/iomap.c > > @@ -114,6 +114,9 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags, > > > > BUG_ON(pos + len > iomap->offset + iomap->length); > > > > + if (fatal_signal_pending(current)) > > + return -EINTR; > > + > > page = grab_cache_page_write_begin(inode->i_mapping, index, flags); > > if (!page) > > return -ENOMEM; > > -- > > 2.11.0 Regarding [1], it helped avoiding the too_many_isolated() issue. I can't tell whether it has any negative effect, but I got on the first trial that all allocating threads are blocked on wait_for_completion() from flush_work() in drain_all_pages() introduced by "mm, page_alloc: drain per-cpu pages from workqueue context". There was no warn_alloc() stall warning message afterwords. ---------- [ 540.039842] kworker/1:1: page allocation stalls for 10079ms, order:0, mode:0x14001c2(GFP_KERNEL|__GFP_HIGHMEM|__GFP_COLD), nodemask=(null) [ 540.041961] kthreadd invoked oom-killer: gfp_mask=0x16040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK), nodemask=(null), order=2, oom_score_adj=0 [ 540.041970] kthreadd cpuset=/ mems_allowed=0 [ 540.041984] CPU: 3 PID: 2 Comm: kthreadd Not tainted 4.10.0-rc5-next-20170125+ #495 [ 540.041987] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015 [ 540.041989] Call Trace: [ 540.042008] dump_stack+0x85/0xc9 [ 540.042016] dump_header+0x9f/0x296 [ 540.042028] ? trace_hardirqs_on+0xd/0x10 [ 540.042039] oom_kill_process+0x219/0x400 [ 540.042046] out_of_memory+0x13d/0x580 [ 540.042049] ? out_of_memory+0x20d/0x580 [ 540.042058] __alloc_pages_slowpath+0x951/0xe02 [ 540.042063] ? deactivate_slab+0x1fb/0x690 [ 540.042082] __alloc_pages_nodemask+0x382/0x3d0 [ 540.042091] new_slab+0x450/0x6b0 [ 540.042100] ___slab_alloc+0x3a3/0x620 [ 540.042109] ? copy_process.part.31+0x122/0x2200 [ 540.042116] ? cpuacct_charge+0x38/0x1e0 [ 540.042122] ? copy_process.part.31+0x122/0x2200 [ 540.042129] __slab_alloc+0x46/0x7d [ 540.042135] kmem_cache_alloc_node+0xab/0x3a0 [ 540.042144] copy_process.part.31+0x122/0x2200 [ 540.042150] ? cpuacct_charge+0xf3/0x1e0 [ 540.042153] ? cpuacct_charge+0x38/0x1e0 [ 540.042164] ? kthread_create_on_node+0x70/0x70 [ 540.042168] ? finish_task_switch+0x70/0x240 [ 540.042175] _do_fork+0xf3/0x750 [ 540.042183] ? kthreadd+0x2f2/0x3c0 [ 540.042193] kernel_thread+0x29/0x30 [ 540.042196] kthreadd+0x35a/0x3c0 [ 540.042206] ? ret_from_fork+0x31/0x40 [ 540.042218] ? kthread_create_on_cpu+0xb0/0xb0 [ 540.042225] ret_from_fork+0x31/0x40 [ 540.042237] Mem-Info: [ 540.042248] active_anon:170208 inactive_anon:2096 isolated_anon:0 [ 540.042248] active_file:40034 inactive_file:40034 isolated_file:32 [ 540.042248] unevictable:0 dirty:78514 writeback:1568 unstable:0 [ 540.042248] slab_reclaimable:19763 slab_unreclaimable:47744 [ 540.042248] mapped:491 shmem:2162 pagetables:4842 bounce:0 [ 540.042248] free:12698 free_pcp:637 free_cma:0 [ 540.042258] Node 0 active_anon:680832kB inactive_anon:8384kB active_file:160136kB inactive_file:160136kB unevictable:0kB isolated(anon):0kB isolated(file):128kB mapped:1964kB dirty:314056kB writeback:6272kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 217088kB anon_thp: 8648kB writeback_tmp:0kB unstable:0kB pages_scanned:561618 all_unreclaimable? yes [ 540.042260] Node 0 DMA free:6248kB min:476kB low:592kB high:708kB active_anon:9492kB inactive_anon:0kB active_file:4kB inactive_file:4kB unevictable:0kB writepending:8kB present:15988kB managed:15904kB mlocked:0kB slab_reclaimable:48kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:28kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [ 540.042270] lowmem_reserve[]: 0 1443 1443 1443 [ 540.042279] Node 0 DMA32 free:44544kB min:44576kB low:55720kB high:66864kB active_anon:671340kB inactive_anon:8384kB active_file:160132kB inactive_file:160132kB unevictable:0kB writepending:320320kB present:2080640kB managed:1478648kB mlocked:0kB slab_reclaimable:79004kB slab_unreclaimable:190944kB kernel_stack:12240kB pagetables:19340kB bounce:0kB free_pcp:2548kB local_pcp:728kB free_cma:0kB [ 540.042288] lowmem_reserve[]: 0 0 0 0 [ 540.042296] Node 0 DMA: 2*4kB (UM) 0*8kB 2*16kB (UE) 4*32kB (UME) 3*64kB (ME) 2*128kB (UM) 2*256kB (UE) 2*512kB (ME) 2*1024kB (UE) 1*2048kB (E) 0*4096kB = 6248kB [ 540.042330] Node 0 DMA32: 764*4kB (UME) 1122*8kB (UME) 536*16kB (UME) 210*32kB (UME) 107*64kB (UE) 41*128kB (EH) 20*256kB (UME) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 44544kB [ 540.042363] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB [ 540.042366] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 540.042368] 82262 total pagecache pages [ 540.042371] 0 pages in swap cache [ 540.042374] Swap cache stats: add 0, delete 0, find 0/0 [ 540.042376] Free swap = 0kB [ 540.042377] Total swap = 0kB [ 540.042380] 524157 pages RAM [ 540.042382] 0 pages HighMem/MovableOnly [ 540.042383] 150519 pages reserved [ 540.042384] 0 pages cma reserved [ 540.042386] 0 pages hwpoisoned [ 540.042390] Out of memory: Kill process 10688 (a.out) score 998 or sacrifice child [ 540.042401] Killed process 10688 (a.out) total-vm:14404kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB [ 540.043111] oom_reaper: reaped process 10688 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [ 540.212629] kworker/1:1 cpuset=/ mems_allowed=0 [ 540.214404] CPU: 1 PID: 51 Comm: kworker/1:1 Not tainted 4.10.0-rc5-next-20170125+ #495 [ 540.216858] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015 [ 540.219901] Workqueue: events pcpu_balance_workfn [ 540.221740] Call Trace: [ 540.223154] dump_stack+0x85/0xc9 [ 540.224724] warn_alloc+0x11e/0x1d0 [ 540.226333] __alloc_pages_slowpath+0x3d4/0xe02 [ 540.228160] __alloc_pages_nodemask+0x382/0x3d0 [ 540.229970] pcpu_populate_chunk+0xc2/0x440 [ 540.231724] pcpu_balance_workfn+0x615/0x670 [ 540.233483] ? process_one_work+0x194/0x760 [ 540.235405] process_one_work+0x22b/0x760 [ 540.237133] ? process_one_work+0x194/0x760 [ 540.238943] worker_thread+0x243/0x4b0 [ 540.240588] kthread+0x10f/0x150 [ 540.242125] ? process_one_work+0x760/0x760 [ 540.243865] ? kthread_create_on_node+0x70/0x70 [ 540.245631] ret_from_fork+0x31/0x40 [ 540.247278] Mem-Info: [ 540.248572] active_anon:170208 inactive_anon:2096 isolated_anon:0 [ 540.248572] active_file:40163 inactive_file:40049 isolated_file:32 [ 540.248572] unevictable:0 dirty:78514 writeback:1568 unstable:0 [ 540.248572] slab_reclaimable:19763 slab_unreclaimable:47744 [ 540.248572] mapped:522 shmem:2162 pagetables:4842 bounce:0 [ 540.248572] free:12698 free_pcp:500 free_cma:0 [ 540.259735] Node 0 active_anon:680832kB inactive_anon:8384kB active_file:160412kB inactive_file:160436kB unevictable:0kB isolated(anon):0kB isolated(file):128kB mapped:2088kB dirty:314056kB writeback:6272kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 217088kB anon_thp: 8648kB writeback_tmp:0kB unstable:0kB pages_scanned:519289 all_unreclaimable? yes [ 540.267919] Node 0 DMA free:6248kB min:476kB low:592kB high:708kB active_anon:9492kB inactive_anon:0kB active_file:4kB inactive_file:4kB unevictable:0kB writepending:8kB present:15988kB managed:15904kB mlocked:0kB slab_reclaimable:48kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:28kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [ 540.276033] lowmem_reserve[]: 0 1443 1443 1443 [ 540.277629] Node 0 DMA32 free:44544kB min:44576kB low:55720kB high:66864kB active_anon:671340kB inactive_anon:8384kB active_file:160408kB inactive_file:160432kB unevictable:0kB writepending:320320kB present:2080640kB managed:1478648kB mlocked:0kB slab_reclaimable:79004kB slab_unreclaimable:190944kB kernel_stack:12240kB pagetables:19340kB bounce:0kB free_pcp:2000kB local_pcp:352kB free_cma:0kB [ 540.286732] lowmem_reserve[]: 0 0 0 0 [ 540.288204] Node 0 DMA: 2*4kB (UM) 0*8kB 2*16kB (UE) 4*32kB (UME) 3*64kB (ME) 2*128kB (UM) 2*256kB (UE) 2*512kB (ME) 2*1024kB (UE) 1*2048kB (E) 0*4096kB = 6248kB [ 540.292593] Node 0 DMA32: 738*4kB (ME) 1125*8kB (ME) 539*16kB (UME) 209*32kB (ME) 106*64kB (E) 42*128kB (UEH) 20*256kB (UME) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 44544kB [ 540.297228] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB [ 540.299825] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 540.302365] 82400 total pagecache pages [ 540.304010] 0 pages in swap cache [ 540.305535] Swap cache stats: add 0, delete 0, find 0/0 [ 540.307302] Free swap = 0kB [ 540.308600] Total swap = 0kB [ 540.309915] 524157 pages RAM [ 540.311187] 0 pages HighMem/MovableOnly [ 540.312613] 150519 pages reserved [ 540.314026] 0 pages cma reserved [ 540.315325] 0 pages hwpoisoned [ 540.317504] kworker/1:1 invoked oom-killer: gfp_mask=0x14001c2(GFP_KERNEL|__GFP_HIGHMEM|__GFP_COLD), nodemask=(null), order=0, oom_score_adj=0 [ 540.320589] kworker/1:1 cpuset=/ mems_allowed=0 [ 540.322213] CPU: 1 PID: 51 Comm: kworker/1:1 Not tainted 4.10.0-rc5-next-20170125+ #495 [ 540.324410] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015 [ 540.327138] Workqueue: events pcpu_balance_workfn [ 540.328821] Call Trace: [ 540.330060] dump_stack+0x85/0xc9 [ 540.331449] dump_header+0x9f/0x296 [ 540.332925] ? trace_hardirqs_on+0xd/0x10 [ 540.334436] oom_kill_process+0x219/0x400 [ 540.335963] out_of_memory+0x13d/0x580 [ 540.337615] ? out_of_memory+0x20d/0x580 [ 540.339214] __alloc_pages_slowpath+0x951/0xe02 [ 540.340875] __alloc_pages_nodemask+0x382/0x3d0 [ 540.342544] pcpu_populate_chunk+0xc2/0x440 [ 540.344125] pcpu_balance_workfn+0x615/0x670 [ 540.345729] ? process_one_work+0x194/0x760 [ 540.347301] process_one_work+0x22b/0x760 [ 540.349042] ? process_one_work+0x194/0x760 [ 540.350616] worker_thread+0x243/0x4b0 [ 540.352245] kthread+0x10f/0x150 [ 540.353613] ? process_one_work+0x760/0x760 [ 540.355152] ? kthread_create_on_node+0x70/0x70 [ 540.356709] ret_from_fork+0x31/0x40 [ 540.358083] Mem-Info: [ 540.359191] active_anon:170208 inactive_anon:2096 isolated_anon:0 [ 540.359191] active_file:40103 inactive_file:40109 isolated_file:32 [ 540.359191] unevictable:0 dirty:78514 writeback:1568 unstable:0 [ 540.359191] slab_reclaimable:19763 slab_unreclaimable:47744 [ 540.359191] mapped:522 shmem:2162 pagetables:4842 bounce:0 [ 540.359191] free:12698 free_pcp:500 free_cma:0 [ 540.369461] Node 0 active_anon:680832kB inactive_anon:8384kB active_file:160412kB inactive_file:160436kB unevictable:0kB isolated(anon):0kB isolated(file):128kB mapped:2088kB dirty:314056kB writeback:6272kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 217088kB anon_thp: 8648kB writeback_tmp:0kB unstable:0kB pages_scanned:519430 all_unreclaimable? yes [ 540.376876] Node 0 DMA free:6248kB min:476kB low:592kB high:708kB active_anon:9492kB inactive_anon:0kB active_file:4kB inactive_file:4kB unevictable:0kB writepending:8kB present:15988kB managed:15904kB mlocked:0kB slab_reclaimable:48kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:28kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB [ 540.384224] lowmem_reserve[]: 0 1443 1443 1443 [ 540.385668] Node 0 DMA32 free:44544kB min:44576kB low:55720kB high:66864kB active_anon:671340kB inactive_anon:8384kB active_file:160408kB inactive_file:160432kB unevictable:0kB writepending:320320kB present:2080640kB managed:1478648kB mlocked:0kB slab_reclaimable:79004kB slab_unreclaimable:190944kB kernel_stack:12240kB pagetables:19340kB bounce:0kB free_pcp:2000kB local_pcp:352kB free_cma:0kB [ 540.394066] lowmem_reserve[]: 0 0 0 0 [ 540.395479] Node 0 DMA: 2*4kB (UM) 0*8kB 2*16kB (UE) 4*32kB (UME) 3*64kB (ME) 2*128kB (UM) 2*256kB (UE) 2*512kB (ME) 2*1024kB (UE) 1*2048kB (E) 0*4096kB = 6248kB [ 540.399533] Node 0 DMA32: 738*4kB (ME) 1125*8kB (ME) 539*16kB (UME) 209*32kB (ME) 106*64kB (E) 42*128kB (UEH) 20*256kB (UME) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 44544kB [ 540.403793] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB [ 540.406130] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB [ 540.408490] 82400 total pagecache pages [ 540.409942] 0 pages in swap cache [ 540.411320] Swap cache stats: add 0, delete 0, find 0/0 [ 540.412992] Free swap = 0kB [ 540.414260] Total swap = 0kB [ 540.415633] 524157 pages RAM [ 540.416877] 0 pages HighMem/MovableOnly [ 540.418307] 150519 pages reserved [ 540.419695] 0 pages cma reserved [ 540.421020] 0 pages hwpoisoned [ 540.422293] Out of memory: Kill process 10689 (a.out) score 998 or sacrifice child [ 540.424450] Killed process 10689 (a.out) total-vm:14404kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB [ 540.430407] oom_reaper: reaped process 10689 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [ 575.747685] BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0 nice=0 stuck for 242s! [ 575.757497] Showing busy workqueues and worker pools: [ 575.765110] workqueue events: flags=0x0 [ 575.772069] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=26/256 [ 575.780544] pending: free_work, vmpressure_work_fn, drain_local_pages_wq BAR(9811), vmw_fb_dirty_flush [vmwgfx], drain_local_pages_wq BAR(2506), drain_local_pages_wq BAR(812), drain_local_pages_wq BAR(2466), drain_local_pages_wq BAR(2485), drain_local_pages_wq BAR(3714), drain_local_pages_wq BAR(2862), drain_local_pages_wq BAR(827), drain_local_pages_wq BAR(527), drain_local_pages_wq BAR(9779), drain_local_pages_wq BAR(2484), drain_local_pages_wq BAR(932), drain_local_pages_wq BAR(2492), drain_local_pages_wq BAR(9820), drain_local_pages_wq BAR(811), drain_local_pages_wq BAR(1), drain_local_pages_wq BAR(2521), drain_local_pages_wq BAR(565), drain_local_pages_wq BAR(10420), drain_local_pages_wq BAR(9824), drain_local_pages_wq BAR(9749), drain_local_pages_wq BAR(2), drain_local_pages_wq BAR(9801) [ 575.827418] workqueue writeback: flags=0x4e [ 575.829234] pwq 256: cpus=0-127 flags=0x4 nice=0 active=2/256 [ 575.831299] in-flight: 425:wb_workfn wb_workfn [ 575.834155] workqueue xfs-eofblocks/sda1: flags=0xc [ 575.836083] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 [ 575.838318] in-flight: 123:xfs_eofblocks_worker [xfs] [ 575.840396] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=242s workers=2 manager: 80 [ 575.843446] pool 256: cpus=0-127 flags=0x4 nice=0 hung=35s workers=3 idle: 424 423 [ 605.951087] BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0 nice=0 stuck for 272s! [ 605.961096] Showing busy workqueues and worker pools: [ 605.968703] workqueue events: flags=0x0 [ 605.975212] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=27/256 [ 605.982787] pending: free_work, vmpressure_work_fn, drain_local_pages_wq BAR(9811), vmw_fb_dirty_flush [vmwgfx], drain_local_pages_wq BAR(2506), drain_local_pages_wq BAR(812), drain_local_pages_wq BAR(2466), drain_local_pages_wq BAR(2485), drain_local_pages_wq BAR(3714), drain_local_pages_wq BAR(2862), drain_local_pages_wq BAR(827), drain_local_pages_wq BAR(527), drain_local_pages_wq BAR(9779), drain_local_pages_wq BAR(2484), drain_local_pages_wq BAR(932), drain_local_pages_wq BAR(2492), drain_local_pages_wq BAR(9820), drain_local_pages_wq BAR(811), drain_local_pages_wq BAR(1), drain_local_pages_wq BAR(2521), drain_local_pages_wq BAR(565), drain_local_pages_wq BAR(10420), drain_local_pages_wq BAR(9824), drain_local_pages_wq BAR(9749), drain_local_pages_wq BAR(2), drain_local_pages_wq BAR(9801) [ 606.010284] , drain_local_pages_wq BAR(47) [ 606.012955] workqueue writeback: flags=0x4e [ 606.014860] pwq 256: cpus=0-127 flags=0x4 nice=0 active=2/256 [ 606.016732] in-flight: 425:wb_workfn wb_workfn [ 606.019085] workqueue mpt_poll_0: flags=0x8 [ 606.020678] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 [ 606.022521] pending: mpt_fault_reset_work [mptbase] [ 606.024445] workqueue xfs-eofblocks/sda1: flags=0xc [ 606.026148] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 [ 606.027992] in-flight: 123:xfs_eofblocks_worker [xfs] [ 606.029904] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=272s workers=2 manager: 80 [ 606.032120] pool 256: cpus=0-127 flags=0x4 nice=0 hung=65s workers=3 idle: 424 423 (...snipped...) [ 908.869406] sysrq: SysRq : Show State [ 908.875534] task PC stack pid father [ 908.883117] systemd D11784 1 0 0x00000000 [ 908.890352] Call Trace: [ 908.893121] __schedule+0x345/0xdd0 [ 908.895830] ? __list_lru_count_one.isra.2+0x22/0x80 [ 908.899036] schedule+0x3d/0x90 [ 908.901616] schedule_timeout+0x287/0x540 [ 908.904485] ? wait_for_completion+0x4c/0x190 [ 908.907488] wait_for_completion+0x12c/0x190 [ 908.910423] ? wake_up_q+0x80/0x80 [ 908.913060] flush_work+0x230/0x310 [ 908.915699] ? flush_work+0x2b4/0x310 [ 908.918382] ? work_busy+0xb0/0xb0 [ 908.920976] drain_all_pages.part.88+0x319/0x390 [ 908.923312] ? drain_local_pages+0x30/0x30 [ 908.924833] __alloc_pages_slowpath+0x4dc/0xe02 [ 908.926380] ? alloc_pages_current+0x193/0x1b0 [ 908.927887] __alloc_pages_nodemask+0x382/0x3d0 [ 908.929406] ? __radix_tree_lookup+0x84/0xf0 [ 908.930879] alloc_pages_current+0x97/0x1b0 [ 908.932333] ? find_get_entry+0x5/0x300 [ 908.933683] __page_cache_alloc+0x15d/0x1a0 [ 908.935069] ? pagecache_get_page+0x2c/0x2b0 [ 908.936447] filemap_fault+0x4df/0x8b0 [ 908.937728] ? filemap_fault+0x373/0x8b0 [ 908.939078] ? xfs_ilock+0x22c/0x360 [xfs] [ 908.940393] ? xfs_filemap_fault+0x64/0x1e0 [xfs] [ 908.941775] ? down_read_nested+0x7b/0xc0 [ 908.943046] ? xfs_ilock+0x22c/0x360 [xfs] [ 908.944290] xfs_filemap_fault+0x6c/0x1e0 [xfs] [ 908.945587] __do_fault+0x1e/0xa0 [ 908.946647] ? _raw_spin_unlock+0x27/0x40 [ 908.947823] handle_mm_fault+0xd75/0x10d0 [ 908.948954] ? handle_mm_fault+0x5e/0x10d0 [ 908.950079] __do_page_fault+0x24a/0x530 [ 908.951158] do_page_fault+0x30/0x80 [ 908.952199] page_fault+0x28/0x30 (...snipped...) [ 909.537512] kswapd0 D11112 68 2 0x00000000 [ 909.538860] Call Trace: [ 909.539675] __schedule+0x345/0xdd0 [ 909.540670] schedule+0x3d/0x90 [ 909.541619] rwsem_down_read_failed+0x10e/0x1a0 [ 909.542827] ? xfs_map_blocks+0x98/0x5a0 [xfs] [ 909.543992] call_rwsem_down_read_failed+0x18/0x30 [ 909.545218] down_read_nested+0xaf/0xc0 [ 909.546316] ? xfs_ilock+0x154/0x360 [xfs] [ 909.547519] xfs_ilock+0x154/0x360 [xfs] [ 909.548608] xfs_map_blocks+0x98/0x5a0 [xfs] [ 909.549754] xfs_do_writepage+0x215/0x920 [xfs] [ 909.550954] ? clear_page_dirty_for_io+0xb4/0x310 [ 909.552188] xfs_vm_writepage+0x3b/0x70 [xfs] [ 909.553340] pageout.isra.54+0x1a4/0x460 [ 909.554428] shrink_page_list+0xa86/0xcf0 [ 909.555529] shrink_inactive_list+0x1d3/0x680 [ 909.556680] ? shrink_active_list+0x44f/0x590 [ 909.557829] shrink_node_memcg+0x535/0x7f0 [ 909.558952] ? mem_cgroup_iter+0x14d/0x720 [ 909.560050] shrink_node+0xe1/0x310 [ 909.561043] kswapd+0x362/0x9b0 [ 909.561976] kthread+0x10f/0x150 [ 909.562974] ? mem_cgroup_shrink_node+0x3b0/0x3b0 [ 909.564199] ? kthread_create_on_node+0x70/0x70 [ 909.565375] ret_from_fork+0x31/0x40 (...snipped...) [ 998.658049] BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0 nice=0 stuck for 665s! [ 998.667526] Showing busy workqueues and worker pools: [ 998.673851] workqueue events: flags=0x0 [ 998.676147] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=28/256 [ 998.678935] pending: free_work, vmpressure_work_fn, drain_local_pages_wq BAR(9811), vmw_fb_dirty_flush [vmwgfx], drain_local_pages_wq BAR(2506), drain_local_pages_wq BAR(812), drain_local_pages_wq BAR(2466), drain_local_pages_wq BAR(2485), drain_local_pages_wq BAR(3714), drain_local_pages_wq BAR(2862), drain_local_pages_wq BAR(827), drain_local_pages_wq BAR(527), drain_local_pages_wq BAR(9779), drain_local_pages_wq BAR(2484), drain_local_pages_wq BAR(932), drain_local_pages_wq BAR(2492), drain_local_pages_wq BAR(9820), drain_local_pages_wq BAR(811), drain_local_pages_wq BAR(1), drain_local_pages_wq BAR(2521), drain_local_pages_wq BAR(565), drain_local_pages_wq BAR(10420), drain_local_pages_wq BAR(9824), drain_local_pages_wq BAR(9749), drain_local_pages_wq BAR(2), drain_local_pages_wq BAR(9801) [ 998.705187] , drain_local_pages_wq BAR(47), drain_local_pages_wq BAR(10805) [ 998.707558] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256 [ 998.709548] pending: e1000_watchdog [e1000], vmstat_shepherd [ 998.711593] workqueue events_power_efficient: flags=0x80 [ 998.713479] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 [ 998.715399] pending: neigh_periodic_work [ 998.717075] workqueue writeback: flags=0x4e [ 998.718656] pwq 256: cpus=0-127 flags=0x4 nice=0 active=2/256 [ 998.720587] in-flight: 425:wb_workfn wb_workfn [ 998.723062] workqueue mpt_poll_0: flags=0x8 [ 998.724712] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 [ 998.726601] pending: mpt_fault_reset_work [mptbase] [ 998.728548] workqueue xfs-eofblocks/sda1: flags=0xc [ 998.730292] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 [ 998.732178] in-flight: 123:xfs_eofblocks_worker [xfs] [ 998.733997] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=665s workers=2 manager: 80 [ 998.736251] pool 6: cpus=3 node=0 flags=0x0 nice=0 hung=0s workers=2 manager: 53 idle: 10804 [ 998.738634] pool 256: cpus=0-127 flags=0x4 nice=0 hung=458s workers=3 idle: 424 423 ---------- So, you believed that the too_many_isolated() issue is the only problem which can prevent reasonable return to the page allocator [2]. But the reality is that we are about to introduce a new problem without knowing the possibility which can prevent reasonable return to the page allocator. So, would you please please please accept asynchronous watchdog [3]? I said "the cause of allocation stall might be due to out of idle workqueue thread" in that post and I think above lockup is exactly in this case. We cannot be careful enough to prove. We forever have possibility of failing to warn as long as we depend on only synchronous watchdog. [2] http://lkml.kernel.org/r/201701141910.ACF73418.OJHFVFStQOOMFL@I-love.SAKURA.ne.jp [3] http://lkml.kernel.org/r/201701261928.DIG05227.OtOVFHOJMFLSQF@I-love.SAKURA.ne.jp -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone 2017-01-28 15:27 ` Tetsuo Handa @ 2017-01-30 8:55 ` Michal Hocko -1 siblings, 0 replies; 110+ messages in thread From: Michal Hocko @ 2017-01-30 8:55 UTC (permalink / raw) To: Tetsuo Handa; +Cc: hch, mgorman, viro, linux-mm, hannes, linux-kernel On Sun 29-01-17 00:27:27, Tetsuo Handa wrote: > Michal Hocko wrote: > > Tetsuo, > > before we settle on the proper fix for this issue, could you give the > > patch a try and try to reproduce the too_many_isolated() issue or > > just see whether patch [1] has any negative effect on your oom stress > > testing? > > > > [1] http://lkml.kernel.org/r/20170119112336.GN30786@dhcp22.suse.cz > > I tested with both [1] and below patch applied on linux-next-20170125 and > the result is at http://I-love.SAKURA.ne.jp/tmp/serial-20170128.txt.xz . > > Regarding below patch, it helped avoiding complete memory depletion with > large write() request. I don't know whether below patch helps avoiding > complete memory depletion when reading large amount (in other words, I > don't know whether this check is done for large read() request). It's not AFAICS. do_generic_file_read doesn't do the fatal_signal_pending check. > But > I believe that __GFP_KILLABLE (despite the limitation that there are > unkillable waits in the reclaim path) is better solution compared to > scattering around fatal_signal_pending() in the callers. The reason > we check SIGKILL here is to avoid allocating memory more than needed. > If we check SIGKILL in the entry point of __alloc_pages_nodemask() and > retry: label in __alloc_pages_slowpath(), we waste 0 page. Regardless > of whether the OOM killer is invoked, whether memory can be allocated > without direct reclaim operation, not allocating memory unless needed > (in other words, allow page allocator fail immediately if the caller > can give up on SIGKILL and SIGKILL is pending) makes sense. It will > reduce possibility of OOM livelock on CONFIG_MMU=n kernels where the > OOM reaper is not available. I am not really convinced this is a good idea. Put aside the fuzzy semantic of __GFP_KILLABLE, we would have to use this flag in all potentially allocating places from read/write paths and then it is just easier to do the explicit checks in the the loops around those allocations. > > On Wed 25-01-17 14:00:14, Michal Hocko wrote: > > [...] > > > From 362da5cac527146a341300c2ca441245c16043e8 Mon Sep 17 00:00:00 2001 > > > From: Michal Hocko <mhocko@suse.com> > > > Date: Wed, 25 Jan 2017 11:06:37 +0100 > > > Subject: [PATCH] fs: break out of iomap_file_buffered_write on fatal signals > > > > > > Tetsuo has noticed that an OOM stress test which performs large write > > > requests can cause the full memory reserves depletion. He has tracked > > > this down to the following path > > > __alloc_pages_nodemask+0x436/0x4d0 > > > alloc_pages_current+0x97/0x1b0 > > > __page_cache_alloc+0x15d/0x1a0 mm/filemap.c:728 > > > pagecache_get_page+0x5a/0x2b0 mm/filemap.c:1331 > > > grab_cache_page_write_begin+0x23/0x40 mm/filemap.c:2773 > > > iomap_write_begin+0x50/0xd0 fs/iomap.c:118 > > > iomap_write_actor+0xb5/0x1a0 fs/iomap.c:190 > > > ? iomap_write_end+0x80/0x80 fs/iomap.c:150 > > > iomap_apply+0xb3/0x130 fs/iomap.c:79 > > > iomap_file_buffered_write+0x68/0xa0 fs/iomap.c:243 > > > ? iomap_write_end+0x80/0x80 > > > xfs_file_buffered_aio_write+0x132/0x390 [xfs] > > > ? remove_wait_queue+0x59/0x60 > > > xfs_file_write_iter+0x90/0x130 [xfs] > > > __vfs_write+0xe5/0x140 > > > vfs_write+0xc7/0x1f0 > > > ? syscall_trace_enter+0x1d0/0x380 > > > SyS_write+0x58/0xc0 > > > do_syscall_64+0x6c/0x200 > > > entry_SYSCALL64_slow_path+0x25/0x25 > > > > > > the oom victim has access to all memory reserves to make a forward > > > progress to exit easier. But iomap_file_buffered_write and other callers > > > of iomap_apply loop to complete the full request. We need to check for > > > fatal signals and back off with a short write instead. As the > > > iomap_apply delegates all the work down to the actor we have to hook > > > into those. All callers that work with the page cache are calling > > > iomap_write_begin so we will check for signals there. dax_iomap_actor > > > has to handle the situation explicitly because it copies data to the > > > userspace directly. Other callers like iomap_page_mkwrite work on a > > > single page or iomap_fiemap_actor do not allocate memory based on the > > > given len. > > > > > > Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path") > > > Cc: stable # 4.8+ > > > Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > > > Signed-off-by: Michal Hocko <mhocko@suse.com> > > > --- > > > fs/dax.c | 5 +++++ > > > fs/iomap.c | 3 +++ > > > 2 files changed, 8 insertions(+) > > > > > > diff --git a/fs/dax.c b/fs/dax.c > > > index 413a91db9351..0e263dacf9cf 100644 > > > --- a/fs/dax.c > > > +++ b/fs/dax.c > > > @@ -1033,6 +1033,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data, > > > struct blk_dax_ctl dax = { 0 }; > > > ssize_t map_len; > > > > > > + if (fatal_signal_pending(current)) { > > > + ret = -EINTR; > > > + break; > > > + } > > > + > > > dax.sector = dax_iomap_sector(iomap, pos); > > > dax.size = (length + offset + PAGE_SIZE - 1) & PAGE_MASK; > > > map_len = dax_map_atomic(iomap->bdev, &dax); > > > diff --git a/fs/iomap.c b/fs/iomap.c > > > index e57b90b5ff37..691eada58b06 100644 > > > --- a/fs/iomap.c > > > +++ b/fs/iomap.c > > > @@ -114,6 +114,9 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags, > > > > > > BUG_ON(pos + len > iomap->offset + iomap->length); > > > > > > + if (fatal_signal_pending(current)) > > > + return -EINTR; > > > + > > > page = grab_cache_page_write_begin(inode->i_mapping, index, flags); > > > if (!page) > > > return -ENOMEM; > > > -- > > > 2.11.0 > > Regarding [1], it helped avoiding the too_many_isolated() issue. I can't > tell whether it has any negative effect, but I got on the first trial that > all allocating threads are blocked on wait_for_completion() from flush_work() > in drain_all_pages() introduced by "mm, page_alloc: drain per-cpu pages from > workqueue context". There was no warn_alloc() stall warning message afterwords. That patch is buggy and there is a follow up [1] which is not sitting in the mmotm (and thus linux-next) yet. I didn't get to review it properly and I cannot say I would be too happy about using WQ from the page allocator. I believe even the follow up needs to have WQ_RECLAIM WQ. [1] http://lkml.kernel.org/r/20170125083038.rzb5f43nptmk7aed@techsingularity.net Thanks for your testing! -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone @ 2017-01-30 8:55 ` Michal Hocko 0 siblings, 0 replies; 110+ messages in thread From: Michal Hocko @ 2017-01-30 8:55 UTC (permalink / raw) To: Tetsuo Handa; +Cc: hch, mgorman, viro, linux-mm, hannes, linux-kernel On Sun 29-01-17 00:27:27, Tetsuo Handa wrote: > Michal Hocko wrote: > > Tetsuo, > > before we settle on the proper fix for this issue, could you give the > > patch a try and try to reproduce the too_many_isolated() issue or > > just see whether patch [1] has any negative effect on your oom stress > > testing? > > > > [1] http://lkml.kernel.org/r/20170119112336.GN30786@dhcp22.suse.cz > > I tested with both [1] and below patch applied on linux-next-20170125 and > the result is at http://I-love.SAKURA.ne.jp/tmp/serial-20170128.txt.xz . > > Regarding below patch, it helped avoiding complete memory depletion with > large write() request. I don't know whether below patch helps avoiding > complete memory depletion when reading large amount (in other words, I > don't know whether this check is done for large read() request). It's not AFAICS. do_generic_file_read doesn't do the fatal_signal_pending check. > But > I believe that __GFP_KILLABLE (despite the limitation that there are > unkillable waits in the reclaim path) is better solution compared to > scattering around fatal_signal_pending() in the callers. The reason > we check SIGKILL here is to avoid allocating memory more than needed. > If we check SIGKILL in the entry point of __alloc_pages_nodemask() and > retry: label in __alloc_pages_slowpath(), we waste 0 page. Regardless > of whether the OOM killer is invoked, whether memory can be allocated > without direct reclaim operation, not allocating memory unless needed > (in other words, allow page allocator fail immediately if the caller > can give up on SIGKILL and SIGKILL is pending) makes sense. It will > reduce possibility of OOM livelock on CONFIG_MMU=n kernels where the > OOM reaper is not available. I am not really convinced this is a good idea. Put aside the fuzzy semantic of __GFP_KILLABLE, we would have to use this flag in all potentially allocating places from read/write paths and then it is just easier to do the explicit checks in the the loops around those allocations. > > On Wed 25-01-17 14:00:14, Michal Hocko wrote: > > [...] > > > From 362da5cac527146a341300c2ca441245c16043e8 Mon Sep 17 00:00:00 2001 > > > From: Michal Hocko <mhocko@suse.com> > > > Date: Wed, 25 Jan 2017 11:06:37 +0100 > > > Subject: [PATCH] fs: break out of iomap_file_buffered_write on fatal signals > > > > > > Tetsuo has noticed that an OOM stress test which performs large write > > > requests can cause the full memory reserves depletion. He has tracked > > > this down to the following path > > > __alloc_pages_nodemask+0x436/0x4d0 > > > alloc_pages_current+0x97/0x1b0 > > > __page_cache_alloc+0x15d/0x1a0 mm/filemap.c:728 > > > pagecache_get_page+0x5a/0x2b0 mm/filemap.c:1331 > > > grab_cache_page_write_begin+0x23/0x40 mm/filemap.c:2773 > > > iomap_write_begin+0x50/0xd0 fs/iomap.c:118 > > > iomap_write_actor+0xb5/0x1a0 fs/iomap.c:190 > > > ? iomap_write_end+0x80/0x80 fs/iomap.c:150 > > > iomap_apply+0xb3/0x130 fs/iomap.c:79 > > > iomap_file_buffered_write+0x68/0xa0 fs/iomap.c:243 > > > ? iomap_write_end+0x80/0x80 > > > xfs_file_buffered_aio_write+0x132/0x390 [xfs] > > > ? remove_wait_queue+0x59/0x60 > > > xfs_file_write_iter+0x90/0x130 [xfs] > > > __vfs_write+0xe5/0x140 > > > vfs_write+0xc7/0x1f0 > > > ? syscall_trace_enter+0x1d0/0x380 > > > SyS_write+0x58/0xc0 > > > do_syscall_64+0x6c/0x200 > > > entry_SYSCALL64_slow_path+0x25/0x25 > > > > > > the oom victim has access to all memory reserves to make a forward > > > progress to exit easier. But iomap_file_buffered_write and other callers > > > of iomap_apply loop to complete the full request. We need to check for > > > fatal signals and back off with a short write instead. As the > > > iomap_apply delegates all the work down to the actor we have to hook > > > into those. All callers that work with the page cache are calling > > > iomap_write_begin so we will check for signals there. dax_iomap_actor > > > has to handle the situation explicitly because it copies data to the > > > userspace directly. Other callers like iomap_page_mkwrite work on a > > > single page or iomap_fiemap_actor do not allocate memory based on the > > > given len. > > > > > > Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path") > > > Cc: stable # 4.8+ > > > Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > > > Signed-off-by: Michal Hocko <mhocko@suse.com> > > > --- > > > fs/dax.c | 5 +++++ > > > fs/iomap.c | 3 +++ > > > 2 files changed, 8 insertions(+) > > > > > > diff --git a/fs/dax.c b/fs/dax.c > > > index 413a91db9351..0e263dacf9cf 100644 > > > --- a/fs/dax.c > > > +++ b/fs/dax.c > > > @@ -1033,6 +1033,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data, > > > struct blk_dax_ctl dax = { 0 }; > > > ssize_t map_len; > > > > > > + if (fatal_signal_pending(current)) { > > > + ret = -EINTR; > > > + break; > > > + } > > > + > > > dax.sector = dax_iomap_sector(iomap, pos); > > > dax.size = (length + offset + PAGE_SIZE - 1) & PAGE_MASK; > > > map_len = dax_map_atomic(iomap->bdev, &dax); > > > diff --git a/fs/iomap.c b/fs/iomap.c > > > index e57b90b5ff37..691eada58b06 100644 > > > --- a/fs/iomap.c > > > +++ b/fs/iomap.c > > > @@ -114,6 +114,9 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags, > > > > > > BUG_ON(pos + len > iomap->offset + iomap->length); > > > > > > + if (fatal_signal_pending(current)) > > > + return -EINTR; > > > + > > > page = grab_cache_page_write_begin(inode->i_mapping, index, flags); > > > if (!page) > > > return -ENOMEM; > > > -- > > > 2.11.0 > > Regarding [1], it helped avoiding the too_many_isolated() issue. I can't > tell whether it has any negative effect, but I got on the first trial that > all allocating threads are blocked on wait_for_completion() from flush_work() > in drain_all_pages() introduced by "mm, page_alloc: drain per-cpu pages from > workqueue context". There was no warn_alloc() stall warning message afterwords. That patch is buggy and there is a follow up [1] which is not sitting in the mmotm (and thus linux-next) yet. I didn't get to review it properly and I cannot say I would be too happy about using WQ from the page allocator. I believe even the follow up needs to have WQ_RECLAIM WQ. [1] http://lkml.kernel.org/r/20170125083038.rzb5f43nptmk7aed@techsingularity.net Thanks for your testing! -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone 2017-01-30 8:55 ` Michal Hocko @ 2017-02-02 10:14 ` Michal Hocko -1 siblings, 0 replies; 110+ messages in thread From: Michal Hocko @ 2017-02-02 10:14 UTC (permalink / raw) To: Tetsuo Handa; +Cc: hch, mgorman, viro, linux-mm, hannes, linux-kernel On Mon 30-01-17 09:55:46, Michal Hocko wrote: > On Sun 29-01-17 00:27:27, Tetsuo Handa wrote: [...] > > Regarding [1], it helped avoiding the too_many_isolated() issue. I can't > > tell whether it has any negative effect, but I got on the first trial that > > all allocating threads are blocked on wait_for_completion() from flush_work() > > in drain_all_pages() introduced by "mm, page_alloc: drain per-cpu pages from > > workqueue context". There was no warn_alloc() stall warning message afterwords. > > That patch is buggy and there is a follow up [1] which is not sitting in the > mmotm (and thus linux-next) yet. I didn't get to review it properly and > I cannot say I would be too happy about using WQ from the page > allocator. I believe even the follow up needs to have WQ_RECLAIM WQ. > > [1] http://lkml.kernel.org/r/20170125083038.rzb5f43nptmk7aed@techsingularity.net Did you get chance to test with this follow up patch? It would be interesting to see whether OOM situation can still starve the waiter. The current linux-next should contain this patch. Thanks! -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone @ 2017-02-02 10:14 ` Michal Hocko 0 siblings, 0 replies; 110+ messages in thread From: Michal Hocko @ 2017-02-02 10:14 UTC (permalink / raw) To: Tetsuo Handa; +Cc: hch, mgorman, viro, linux-mm, hannes, linux-kernel On Mon 30-01-17 09:55:46, Michal Hocko wrote: > On Sun 29-01-17 00:27:27, Tetsuo Handa wrote: [...] > > Regarding [1], it helped avoiding the too_many_isolated() issue. I can't > > tell whether it has any negative effect, but I got on the first trial that > > all allocating threads are blocked on wait_for_completion() from flush_work() > > in drain_all_pages() introduced by "mm, page_alloc: drain per-cpu pages from > > workqueue context". There was no warn_alloc() stall warning message afterwords. > > That patch is buggy and there is a follow up [1] which is not sitting in the > mmotm (and thus linux-next) yet. I didn't get to review it properly and > I cannot say I would be too happy about using WQ from the page > allocator. I believe even the follow up needs to have WQ_RECLAIM WQ. > > [1] http://lkml.kernel.org/r/20170125083038.rzb5f43nptmk7aed@techsingularity.net Did you get chance to test with this follow up patch? It would be interesting to see whether OOM situation can still starve the waiter. The current linux-next should contain this patch. Thanks! -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone 2017-02-02 10:14 ` Michal Hocko @ 2017-02-03 10:57 ` Tetsuo Handa -1 siblings, 0 replies; 110+ messages in thread From: Tetsuo Handa @ 2017-02-03 10:57 UTC (permalink / raw) To: mhocko Cc: david, dchinner, hch, mgorman, viro, linux-mm, hannes, linux-kernel Michal Hocko wrote: > On Mon 30-01-17 09:55:46, Michal Hocko wrote: > > On Sun 29-01-17 00:27:27, Tetsuo Handa wrote: > [...] > > > Regarding [1], it helped avoiding the too_many_isolated() issue. I can't > > > tell whether it has any negative effect, but I got on the first trial that > > > all allocating threads are blocked on wait_for_completion() from flush_work() > > > in drain_all_pages() introduced by "mm, page_alloc: drain per-cpu pages from > > > workqueue context". There was no warn_alloc() stall warning message afterwords. > > > > That patch is buggy and there is a follow up [1] which is not sitting in the > > mmotm (and thus linux-next) yet. I didn't get to review it properly and > > I cannot say I would be too happy about using WQ from the page > > allocator. I believe even the follow up needs to have WQ_RECLAIM WQ. > > > > [1] http://lkml.kernel.org/r/20170125083038.rzb5f43nptmk7aed@techsingularity.net > > Did you get chance to test with this follow up patch? It would be > interesting to see whether OOM situation can still starve the waiter. > The current linux-next should contain this patch. So far I can't reproduce problems except two listed below (cond_resched() trap in printk() and IDLE priority trap are excluded from the list). But I agree that the follow up patch needs to use a WQ_RECLAIM WQ. It is theoretically possible that an allocation request which can trigger the OOM killer waits for the system_wq while there is already a work which is in system_wq which is looping forever inside the page allocator without triggering the OOM killer. Maybe the follow up patch can share the vmstat WQ? (1) I got an assertion failure. [ 969.626518] Killed process 6262 (oom-write) total-vm:2166856kB, anon-rss:1128732kB, file-rss:4kB, shmem-rss:0kB [ 969.958307] oom_reaper: reaped process 6262 (oom-write), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [ 972.114644] XFS: Assertion failed: oldlen > newlen, file: fs/xfs/libxfs/xfs_bmap.c, line: 2867 [ 972.125085] ------------[ cut here ]------------ [ 972.129261] WARNING: CPU: 0 PID: 6280 at fs/xfs/xfs_message.c:105 asswarn+0x33/0x40 [xfs] [ 972.136146] Modules linked in: nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack coretemp crct10dif_pclmul ppdev crc32_pclmul ghash_clmulni_intel ip_set nfnetlink ebtable_nat aesni_intel crypto_simd cryptd ebtable_broute glue_helper vmw_balloon bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 pcspkr nf_nat_ipv6 ip6table_mangle ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter sg parport_pc parport shpchp i2c_piix4 vmw_vsock_vmci_transport vsock vmw_vmci ip_tables xfs libcrc32c sr_mod cdrom ata_generic sd_mod pata_acpi crc32c_intel serio_raw vmwgfx drm_kms_helper syscopyarea sysfillrect [ 972.163630] sysimgblt fb_sys_fops ttm drm ata_piix ahci libahci mptspi scsi_transport_spi mptscsih e1000 libata i2c_core mptbase [ 972.172535] CPU: 0 PID: 6280 Comm: write Not tainted 4.10.0-rc6-next-20170202 #498 [ 972.175126] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015 [ 972.178381] Call Trace: [ 972.180003] dump_stack+0x85/0xc9 [ 972.181682] __warn+0xd1/0xf0 [ 972.183374] warn_slowpath_null+0x1d/0x20 [ 972.185223] asswarn+0x33/0x40 [xfs] [ 972.186950] xfs_bmap_add_extent_hole_delay+0xb7f/0xdf0 [xfs] [ 972.189055] xfs_bmapi_reserve_delalloc+0x297/0x440 [xfs] [ 972.191263] ? xfs_ilock+0x1c9/0x360 [xfs] [ 972.193414] xfs_file_iomap_begin+0x880/0x1140 [xfs] [ 972.195300] ? iomap_write_end+0x80/0x80 [ 972.196980] iomap_apply+0x6c/0x130 [ 972.198539] iomap_file_buffered_write+0x68/0xa0 [ 972.200316] ? iomap_write_end+0x80/0x80 [ 972.201950] xfs_file_buffered_aio_write+0x132/0x390 [xfs] [ 972.203868] ? _raw_spin_unlock+0x27/0x40 [ 972.205470] xfs_file_write_iter+0x90/0x130 [xfs] [ 972.207167] __vfs_write+0xe5/0x140 [ 972.208752] vfs_write+0xc7/0x1f0 [ 972.210233] ? syscall_trace_enter+0x1d0/0x380 [ 972.211809] SyS_write+0x58/0xc0 [ 972.213166] do_int80_syscall_32+0x6c/0x1f0 [ 972.214676] entry_INT80_compat+0x38/0x50 [ 972.216168] RIP: 0023:0x8048076 [ 972.217494] RSP: 002b:00000000ff997020 EFLAGS: 00000202 ORIG_RAX: 0000000000000004 [ 972.219635] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 0000000008048000 [ 972.221679] RDX: 0000000000001000 RSI: 0000000000000000 RDI: 0000000000000000 [ 972.223774] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 [ 972.225905] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 972.227946] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 972.230064] ---[ end trace d498098daec56c11 ]--- [ 984.210890] vmtoolsd invoked oom-killer: gfp_mask=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null), order=0, oom_score_adj=0 [ 984.224191] vmtoolsd cpuset=/ mems_allowed=0 [ 984.231022] CPU: 0 PID: 689 Comm: vmtoolsd Tainted: G W 4.10.0-rc6-next-20170202 #498 (2) I got a lockdep warning. (A new false positive?) [ 243.036975] ===================================================== [ 243.042976] WARNING: RECLAIM_FS-safe -> RECLAIM_FS-unsafe lock order detected [ 243.051211] 4.10.0-rc6-next-20170202 #46 Not tainted [ 243.054619] ----------------------------------------------------- [ 243.057395] awk/8767 [HC0[0]:SC0[0]:HE1:SE1] is trying to acquire: [ 243.060310] (cpu_hotplug.dep_map){++++++}, at: [<ffffffff8108ddf2>] get_online_cpus+0x32/0x80 [ 243.063462] [ 243.063462] and this task is already holding: [ 243.066851] (&xfs_dir_ilock_class){++++-.}, at: [<ffffffffa02a4af4>] xfs_ilock+0x114/0x290 [xfs] [ 243.069949] which would create a new lock dependency: [ 243.072143] (&xfs_dir_ilock_class){++++-.} -> (cpu_hotplug.dep_map){++++++} [ 243.074789] [ 243.074789] but this new dependency connects a RECLAIM_FS-irq-safe lock: [ 243.078735] (&xfs_dir_ilock_class){++++-.} [ 243.078739] [ 243.078739] ... which became RECLAIM_FS-irq-safe at: [ 243.084175] [ 243.084180] [<ffffffff810ef934>] __lock_acquire+0x344/0x1bb0 [ 243.087257] [ 243.087261] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0 [ 243.090027] [ 243.090033] [<ffffffff810ea7e9>] down_write_nested+0x59/0xc0 [ 243.092838] [ 243.092888] [<ffffffffa02a4b2e>] xfs_ilock+0x14e/0x290 [xfs] [ 243.095453] [ 243.095485] [<ffffffffa02986a5>] xfs_reclaim_inode+0x135/0x340 [xfs] [ 243.098083] [ 243.098109] [<ffffffffa0298b7a>] xfs_reclaim_inodes_ag+0x2ca/0x4f0 [xfs] [ 243.100668] [ 243.100692] [<ffffffffa029af9e>] xfs_reclaim_inodes_nr+0x2e/0x40 [xfs] [ 243.103191] [ 243.103221] [<ffffffffa02b32c4>] xfs_fs_free_cached_objects+0x14/0x20 [xfs] [ 243.105710] [ 243.105714] [<ffffffff81261dbc>] super_cache_scan+0x17c/0x190 [ 243.107947] [ 243.107950] [<ffffffff811d375a>] shrink_slab+0x29a/0x710 [ 243.110133] [ 243.110135] [<ffffffff811d876d>] shrink_node+0x23d/0x320 [ 243.112262] [ 243.112264] [<ffffffff811d9e24>] kswapd+0x354/0xa10 [ 243.114323] [ 243.114326] [<ffffffff810b5caa>] kthread+0x10a/0x140 [ 243.116448] [ 243.116452] [<ffffffff81715081>] ret_from_fork+0x31/0x40 [ 243.118692] [ 243.118692] to a RECLAIM_FS-irq-unsafe lock: [ 243.120636] (cpu_hotplug.dep_map){++++++} [ 243.120638] [ 243.120638] ... which became RECLAIM_FS-irq-unsafe at: [ 243.124021] ... [ 243.124022] [ 243.124820] [<ffffffff810ef051>] mark_held_locks+0x71/0x90 [ 243.127033] [ 243.127035] [<ffffffff810f3405>] lockdep_trace_alloc+0xc5/0x110 [ 243.129228] [ 243.129231] [<ffffffff8122f8ca>] kmem_cache_alloc_node_trace+0x4a/0x410 [ 243.131534] [ 243.131536] [<ffffffff810ba350>] __smpboot_create_thread.part.3+0x30/0xf0 [ 243.133850] [ 243.133852] [<ffffffff810ba7a1>] smpboot_create_threads+0x61/0x90 [ 243.136113] [ 243.136119] [<ffffffff8108e2cb>] cpuhp_invoke_callback+0xbb/0xb70 [ 243.138319] [ 243.138320] [<ffffffff8108fc82>] cpuhp_up_callbacks+0x32/0xb0 [ 243.140479] [ 243.140480] [<ffffffff810900f4>] _cpu_up+0x84/0xf0 [ 243.142484] [ 243.142485] [<ffffffff810901e4>] do_cpu_up+0x84/0xd0 [ 243.144716] [ 243.144719] [<ffffffff8109023e>] cpu_up+0xe/0x10 [ 243.146684] [ 243.146687] [<ffffffff81f6f446>] smp_init+0xd5/0x141 [ 243.148755] [ 243.148758] [<ffffffff81f3f35b>] kernel_init_freeable+0x17d/0x2a7 [ 243.150932] [ 243.150936] [<ffffffff817048e9>] kernel_init+0x9/0x100 [ 243.153088] [ 243.153092] [<ffffffff81715081>] ret_from_fork+0x31/0x40 [ 243.155135] [ 243.155135] other info that might help us debug this: [ 243.155135] [ 243.157724] Possible interrupt unsafe locking scenario: [ 243.157724] [ 243.159877] CPU0 CPU1 [ 243.161047] ---- ---- [ 243.162210] lock(cpu_hotplug.dep_map); [ 243.163279] local_irq_disable(); [ 243.164669] lock(&xfs_dir_ilock_class); [ 243.166148] lock(cpu_hotplug.dep_map); [ 243.167653] <Interrupt> [ 243.168594] lock(&xfs_dir_ilock_class); [ 243.169694] [ 243.169694] *** DEADLOCK *** [ 243.169694] [ 243.171864] 3 locks held by awk/8767: [ 243.172872] #0: (&type->i_mutex_dir_key#3){++++++}, at: [<ffffffff8126e2dc>] path_openat+0x53c/0xa90 [ 243.174791] #1: (&xfs_dir_ilock_class){++++-.}, at: [<ffffffffa02a4af4>] xfs_ilock+0x114/0x290 [xfs] [ 243.176899] #2: (pcpu_drain_mutex){+.+...}, at: [<ffffffff811bf39a>] drain_all_pages.part.80+0x1a/0x320 [ 243.178875] [ 243.178875] the dependencies between RECLAIM_FS-irq-safe lock and the holding lock: [ 243.181262] -> (&xfs_dir_ilock_class){++++-.} ops: 17348 { [ 243.182610] HARDIRQ-ON-W at: [ 243.183603] [ 243.183606] [<ffffffff810efd84>] __lock_acquire+0x794/0x1bb0 [ 243.186056] [ 243.186059] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0 [ 243.188419] [ 243.188422] [<ffffffff810ea7e9>] down_write_nested+0x59/0xc0 [ 243.190909] [ 243.190941] [<ffffffffa02a4b2e>] xfs_ilock+0x14e/0x290 [xfs] [ 243.193257] [ 243.193281] [<ffffffffa02a4c9b>] xfs_ilock_data_map_shared+0x2b/0x30 [xfs] [ 243.195795] [ 243.195814] [<ffffffffa02559f4>] xfs_dir_lookup+0xd4/0x1c0 [xfs] [ 243.198204] [ 243.198227] [<ffffffffa02a62ff>] xfs_lookup+0x7f/0x250 [xfs] [ 243.200570] [ 243.200593] [<ffffffffa02a1fcb>] xfs_vn_lookup+0x6b/0xb0 [xfs] [ 243.203086] [ 243.203089] [<ffffffff8126ce2c>] lookup_open+0x54c/0x790 [ 243.205417] [ 243.205420] [<ffffffff8126e2fa>] path_openat+0x55a/0xa90 [ 243.207711] [ 243.207713] [<ffffffff8126f9ec>] do_filp_open+0x8c/0x100 [ 243.210092] [ 243.210095] [<ffffffff81263c41>] do_open_execat+0x71/0x180 [ 243.212427] [ 243.212429] [<ffffffff812641b6>] open_exec+0x26/0x40 [ 243.214664] [ 243.214668] [<ffffffff812c43ee>] load_elf_binary+0x2be/0x15f0 [ 243.217045] [ 243.217048] [<ffffffff812644b0>] search_binary_handler+0x80/0x1e0 [ 243.219501] [ 243.219503] [<ffffffff812663ca>] do_execveat_common.isra.40+0x68a/0xa00 [ 243.222056] [ 243.222058] [<ffffffff81266767>] do_execve+0x27/0x30 [ 243.224471] [ 243.224475] [<ffffffff812669c0>] SyS_execve+0x20/0x30 [ 243.226787] [ 243.226790] [<ffffffff81003c17>] do_syscall_64+0x67/0x1f0 [ 243.229178] [ 243.229182] [<ffffffff81714ec9>] return_from_SYSCALL_64+0x0/0x7a [ 243.231695] HARDIRQ-ON-R at: [ 243.232709] [ 243.232712] [<ffffffff810ef8c0>] __lock_acquire+0x2d0/0x1bb0 [ 243.235161] [ 243.235164] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0 [ 243.237547] [ 243.237551] [<ffffffff810ea672>] down_read_nested+0x52/0xb0 [ 243.239930] [ 243.239962] [<ffffffffa02a4af4>] xfs_ilock+0x114/0x290 [xfs] [ 243.242353] [ 243.242385] [<ffffffffa02a4c9b>] xfs_ilock_data_map_shared+0x2b/0x30 [xfs] [ 243.244978] [ 243.244998] [<ffffffffa02559f4>] xfs_dir_lookup+0xd4/0x1c0 [xfs] [ 243.247493] [ 243.247515] [<ffffffffa02a62ff>] xfs_lookup+0x7f/0x250 [xfs] [ 243.249910] [ 243.249930] [<ffffffffa02a1fcb>] xfs_vn_lookup+0x6b/0xb0 [xfs] [ 243.252407] [ 243.252412] [<ffffffff8126902e>] lookup_slow+0x12e/0x220 [ 243.254747] [ 243.254750] [<ffffffff8126d2c6>] walk_component+0x1a6/0x2b0 [ 243.257126] [ 243.257128] [<ffffffff8126d55c>] link_path_walk+0x18c/0x580 [ 243.259495] [ 243.259497] [<ffffffff8126de41>] path_openat+0xa1/0xa90 [ 243.261804] [ 243.261806] [<ffffffff8126f9ec>] do_filp_open+0x8c/0x100 [ 243.264184] [ 243.264188] [<ffffffff8125c0ea>] do_sys_open+0x13a/0x200 [ 243.266595] [ 243.266599] [<ffffffff8125c1c9>] SyS_open+0x19/0x20 [ 243.268984] [ 243.268989] [<ffffffff81714e01>] entry_SYSCALL_64_fastpath+0x1f/0xc2 [ 243.271702] SOFTIRQ-ON-W at: [ 243.272726] [ 243.272729] [<ffffffff810ef8ed>] __lock_acquire+0x2fd/0x1bb0 [ 243.275109] [ 243.275111] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0 [ 243.277426] [ 243.277429] [<ffffffff810ea7e9>] down_write_nested+0x59/0xc0 [ 243.279790] [ 243.279823] [<ffffffffa02a4b2e>] xfs_ilock+0x14e/0x290 [xfs] [ 243.282192] [ 243.282216] [<ffffffffa02a4c9b>] xfs_ilock_data_map_shared+0x2b/0x30 [xfs] [ 243.284794] [ 243.284816] [<ffffffffa02559f4>] xfs_dir_lookup+0xd4/0x1c0 [xfs] [ 243.287259] [ 243.287284] [<ffffffffa02a62ff>] xfs_lookup+0x7f/0x250 [xfs] [ 243.289735] [ 243.289763] [<ffffffffa02a1fcb>] xfs_vn_lookup+0x6b/0xb0 [xfs] [ 243.292205] [ 243.292208] [<ffffffff8126ce2c>] lookup_open+0x54c/0x790 [ 243.294555] [ 243.294558] [<ffffffff8126e2fa>] path_openat+0x55a/0xa90 [ 243.296897] [ 243.296900] [<ffffffff8126f9ec>] do_filp_open+0x8c/0x100 [ 243.299242] [ 243.299244] [<ffffffff81263c41>] do_open_execat+0x71/0x180 [ 243.301754] [ 243.301759] [<ffffffff812641b6>] open_exec+0x26/0x40 [ 243.304037] [ 243.304042] [<ffffffff812c43ee>] load_elf_binary+0x2be/0x15f0 [ 243.306531] [ 243.306534] [<ffffffff812644b0>] search_binary_handler+0x80/0x1e0 [ 243.308976] [ 243.308979] [<ffffffff812663ca>] do_execveat_common.isra.40+0x68a/0xa00 [ 243.311506] [ 243.311508] [<ffffffff81266767>] do_execve+0x27/0x30 [ 243.313777] [ 243.313779] [<ffffffff812669c0>] SyS_execve+0x20/0x30 [ 243.316067] [ 243.316070] [<ffffffff81003c17>] do_syscall_64+0x67/0x1f0 [ 243.318429] [ 243.318434] [<ffffffff81714ec9>] return_from_SYSCALL_64+0x0/0x7a [ 243.320884] SOFTIRQ-ON-R at: [ 243.321860] [ 243.321862] [<ffffffff810ef8ed>] __lock_acquire+0x2fd/0x1bb0 [ 243.324251] [ 243.324252] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0 [ 243.326601] [ 243.326604] [<ffffffff810ea672>] down_read_nested+0x52/0xb0 [ 243.328966] [ 243.328998] [<ffffffffa02a4af4>] xfs_ilock+0x114/0x290 [xfs] [ 243.331384] [ 243.331407] [<ffffffffa02a4c9b>] xfs_ilock_data_map_shared+0x2b/0x30 [xfs] [ 243.333978] [ 243.334001] [<ffffffffa02559f4>] xfs_dir_lookup+0xd4/0x1c0 [xfs] [ 243.336492] [ 243.336516] [<ffffffffa02a62ff>] xfs_lookup+0x7f/0x250 [xfs] [ 243.338926] [ 243.338948] [<ffffffffa02a1fcb>] xfs_vn_lookup+0x6b/0xb0 [xfs] [ 243.341365] [ 243.341368] [<ffffffff8126902e>] lookup_slow+0x12e/0x220 [ 243.343694] [ 243.343696] [<ffffffff8126d2c6>] walk_component+0x1a6/0x2b0 [ 243.346074] [ 243.346076] [<ffffffff8126d55c>] link_path_walk+0x18c/0x580 [ 243.348443] [ 243.348444] [<ffffffff8126de41>] path_openat+0xa1/0xa90 [ 243.350753] [ 243.350755] [<ffffffff8126f9ec>] do_filp_open+0x8c/0x100 [ 243.353240] [ 243.353244] [<ffffffff8125c0ea>] do_sys_open+0x13a/0x200 [ 243.355581] [ 243.355583] [<ffffffff8125c1c9>] SyS_open+0x19/0x20 [ 243.358015] [ 243.358019] [<ffffffff81714e01>] entry_SYSCALL_64_fastpath+0x1f/0xc2 [ 243.360586] IN-RECLAIM_FS-W at: [ 243.361628] [ 243.361630] [<ffffffff810ef934>] __lock_acquire+0x344/0x1bb0 [ 243.364273] [ 243.364275] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0 [ 243.366710] [ 243.366713] [<ffffffff810ea7e9>] down_write_nested+0x59/0xc0 [ 243.369153] [ 243.369182] [<ffffffffa02a4b2e>] xfs_ilock+0x14e/0x290 [xfs] [ 243.371597] [ 243.371619] [<ffffffffa02986a5>] xfs_reclaim_inode+0x135/0x340 [xfs] [ 243.374339] [ 243.374366] [<ffffffffa0298b7a>] xfs_reclaim_inodes_ag+0x2ca/0x4f0 [xfs] [ 243.377009] [ 243.377032] [<ffffffffa029af9e>] xfs_reclaim_inodes_nr+0x2e/0x40 [xfs] [ 243.379659] [ 243.379686] [<ffffffffa02b32c4>] xfs_fs_free_cached_objects+0x14/0x20 [xfs] [ 243.382349] [ 243.382352] [<ffffffff81261dbc>] super_cache_scan+0x17c/0x190 [ 243.384907] [ 243.384911] [<ffffffff811d375a>] shrink_slab+0x29a/0x710 [ 243.387690] [ 243.387693] [<ffffffff811d876d>] shrink_node+0x23d/0x320 [ 243.390148] [ 243.390150] [<ffffffff811d9e24>] kswapd+0x354/0xa10 [ 243.392517] [ 243.392520] [<ffffffff810b5caa>] kthread+0x10a/0x140 [ 243.394851] [ 243.394853] [<ffffffff81715081>] ret_from_fork+0x31/0x40 [ 243.397246] INITIAL USE at: [ 243.398227] [ 243.398229] [<ffffffff810ef960>] __lock_acquire+0x370/0x1bb0 [ 243.400646] [ 243.400648] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0 [ 243.402997] [ 243.402999] [<ffffffff810ea672>] down_read_nested+0x52/0xb0 [ 243.405351] [ 243.405397] [<ffffffffa02a4af4>] xfs_ilock+0x114/0x290 [xfs] [ 243.407778] [ 243.407799] [<ffffffffa02a4c9b>] xfs_ilock_data_map_shared+0x2b/0x30 [xfs] [ 243.410364] [ 243.410390] [<ffffffffa02559f4>] xfs_dir_lookup+0xd4/0x1c0 [xfs] [ 243.412989] [ 243.413011] [<ffffffffa02a62ff>] xfs_lookup+0x7f/0x250 [xfs] [ 243.415416] [ 243.415437] [<ffffffffa02a1fcb>] xfs_vn_lookup+0x6b/0xb0 [xfs] [ 243.417871] [ 243.417874] [<ffffffff8126902e>] lookup_slow+0x12e/0x220 [ 243.420641] [ 243.420644] [<ffffffff8126d2c6>] walk_component+0x1a6/0x2b0 [ 243.423039] [ 243.423041] [<ffffffff8126d55c>] link_path_walk+0x18c/0x580 [ 243.425553] [ 243.425555] [<ffffffff8126de41>] path_openat+0xa1/0xa90 [ 243.427891] [ 243.427892] [<ffffffff8126f9ec>] do_filp_open+0x8c/0x100 [ 243.430249] [ 243.430251] [<ffffffff8125c0ea>] do_sys_open+0x13a/0x200 [ 243.432586] [ 243.432588] [<ffffffff8125c1c9>] SyS_open+0x19/0x20 [ 243.434839] [ 243.434843] [<ffffffff81714e01>] entry_SYSCALL_64_fastpath+0x1f/0xc2 [ 243.437343] } [ 243.438115] ... key at: [<ffffffffa031dfcc>] xfs_dir_ilock_class+0x0/0xfffffffffffc3f6e [xfs] [ 243.440082] ... acquired at: [ 243.441047] [ 243.441049] [<ffffffff810ee7ea>] check_irq_usage+0x4a/0xb0 [ 243.443169] [ 243.443171] [<ffffffff810f0954>] __lock_acquire+0x1364/0x1bb0 [ 243.445366] [ 243.445368] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0 [ 243.447471] [ 243.447474] [<ffffffff8108de18>] get_online_cpus+0x58/0x80 [ 243.449601] [ 243.449604] [<ffffffff811bf3a7>] drain_all_pages.part.80+0x27/0x320 [ 243.452123] [ 243.452125] [<ffffffff811c2039>] drain_all_pages+0x19/0x20 [ 243.454264] [ 243.454266] [<ffffffff811c4854>] __alloc_pages_nodemask+0x784/0x1630 [ 243.456596] [ 243.456599] [<ffffffff8122e1bf>] cache_grow_begin+0xcf/0x630 [ 243.458774] [ 243.458776] [<ffffffff8122eb45>] fallback_alloc+0x1e5/0x290 [ 243.460952] [ 243.460955] [<ffffffff8122e955>] ____cache_alloc_node+0x235/0x240 [ 243.463199] [ 243.463201] [<ffffffff8122f30c>] kmem_cache_alloc+0x26c/0x3e0 [ 243.465482] [ 243.465510] [<ffffffffa02b9211>] kmem_zone_alloc+0x91/0x120 [xfs] [ 243.467754] [ 243.467774] [<ffffffffa024e2f5>] xfs_da_state_alloc+0x15/0x20 [xfs] [ 243.470083] [ 243.470101] [<ffffffffa025f333>] xfs_dir2_node_lookup+0x53/0x2b0 [xfs] [ 243.472427] [ 243.472445] [<ffffffffa0255ac5>] xfs_dir_lookup+0x1a5/0x1c0 [xfs] [ 243.474705] [ 243.474726] [<ffffffffa02a62ff>] xfs_lookup+0x7f/0x250 [xfs] [ 243.476933] [ 243.476954] [<ffffffffa02a1fcb>] xfs_vn_lookup+0x6b/0xb0 [xfs] [ 243.479178] [ 243.479180] [<ffffffff8126ce2c>] lookup_open+0x54c/0x790 [ 243.481350] [ 243.481352] [<ffffffff8126e2fa>] path_openat+0x55a/0xa90 [ 243.483907] [ 243.483910] [<ffffffff8126f9ec>] do_filp_open+0x8c/0x100 [ 243.486070] [ 243.486073] [<ffffffff8125c0ea>] do_sys_open+0x13a/0x200 [ 243.488334] [ 243.488338] [<ffffffff8125c1c9>] SyS_open+0x19/0x20 [ 243.490476] [ 243.490480] [<ffffffff81003c17>] do_syscall_64+0x67/0x1f0 [ 243.492619] [ 243.492623] [<ffffffff81714ec9>] return_from_SYSCALL_64+0x0/0x7a [ 243.494864] [ 243.495618] [ 243.495618] the dependencies between the lock to be acquired [ 243.495619] and RECLAIM_FS-irq-unsafe lock: [ 243.498973] -> (cpu_hotplug.dep_map){++++++} ops: 838 { [ 243.500297] HARDIRQ-ON-W at: [ 243.501292] [ 243.501295] [<ffffffff810efd84>] __lock_acquire+0x794/0x1bb0 [ 243.503718] [ 243.503719] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0 [ 243.506059] [ 243.506061] [<ffffffff8108ff5e>] cpu_hotplug_begin+0x6e/0xe0 [ 243.508471] [ 243.508473] [<ffffffff8109009d>] _cpu_up+0x2d/0xf0 [ 243.510708] [ 243.510709] [<ffffffff810901e4>] do_cpu_up+0x84/0xd0 [ 243.512997] [ 243.512999] [<ffffffff8109023e>] cpu_up+0xe/0x10 [ 243.515556] [ 243.515561] [<ffffffff81f6f446>] smp_init+0xd5/0x141 [ 243.517807] [ 243.517810] [<ffffffff81f3f35b>] kernel_init_freeable+0x17d/0x2a7 [ 243.520271] [ 243.520275] [<ffffffff817048e9>] kernel_init+0x9/0x100 [ 243.522538] [ 243.522540] [<ffffffff81715081>] ret_from_fork+0x31/0x40 [ 243.524833] HARDIRQ-ON-R at: [ 243.525801] [ 243.525803] [<ffffffff810ef8c0>] __lock_acquire+0x2d0/0x1bb0 [ 243.528152] [ 243.528153] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0 [ 243.530416] [ 243.530419] [<ffffffff8108de18>] get_online_cpus+0x58/0x80 [ 243.532696] [ 243.532698] [<ffffffff811ec375>] kmem_cache_create+0x35/0x2d0 [ 243.535039] [ 243.535041] [<ffffffff81f87d4a>] debug_objects_mem_init+0x48/0x5c5 [ 243.537451] [ 243.537453] [<ffffffff81f3f108>] start_kernel+0x3ec/0x4c2 [ 243.539744] [ 243.539746] [<ffffffff81f3e5d6>] x86_64_start_reservations+0x2a/0x2c [ 243.542186] [ 243.542188] [<ffffffff81f3e724>] x86_64_start_kernel+0x14c/0x16f [ 243.544603] [ 243.544605] [<ffffffff810001c4>] verify_cpu+0x0/0xfc [ 243.547245] SOFTIRQ-ON-W at: [ 243.548241] [ 243.548243] [<ffffffff810ef8ed>] __lock_acquire+0x2fd/0x1bb0 [ 243.550559] [ 243.550561] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0 [ 243.552841] [ 243.552842] [<ffffffff8108ff5e>] cpu_hotplug_begin+0x6e/0xe0 [ 243.555186] [ 243.555187] [<ffffffff8109009d>] _cpu_up+0x2d/0xf0 [ 243.557404] [ 243.557405] [<ffffffff810901e4>] do_cpu_up+0x84/0xd0 [ 243.559654] [ 243.559656] [<ffffffff8109023e>] cpu_up+0xe/0x10 [ 243.561824] [ 243.561827] [<ffffffff81f6f446>] smp_init+0xd5/0x141 [ 243.564048] [ 243.564050] [<ffffffff81f3f35b>] kernel_init_freeable+0x17d/0x2a7 [ 243.566455] [ 243.566457] [<ffffffff817048e9>] kernel_init+0x9/0x100 [ 243.568731] [ 243.568733] [<ffffffff81715081>] ret_from_fork+0x31/0x40 [ 243.571014] SOFTIRQ-ON-R at: [ 243.571975] [ 243.571976] [<ffffffff810ef8ed>] __lock_acquire+0x2fd/0x1bb0 [ 243.574328] [ 243.574330] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0 [ 243.576610] [ 243.576612] [<ffffffff8108de18>] get_online_cpus+0x58/0x80 [ 243.579161] [ 243.579165] [<ffffffff811ec375>] kmem_cache_create+0x35/0x2d0 [ 243.581537] [ 243.581539] [<ffffffff81f87d4a>] debug_objects_mem_init+0x48/0x5c5 [ 243.583982] [ 243.583984] [<ffffffff81f3f108>] start_kernel+0x3ec/0x4c2 [ 243.586304] [ 243.586306] [<ffffffff81f3e5d6>] x86_64_start_reservations+0x2a/0x2c [ 243.588819] [ 243.588821] [<ffffffff81f3e724>] x86_64_start_kernel+0x14c/0x16f [ 243.591227] [ 243.591229] [<ffffffff810001c4>] verify_cpu+0x0/0xfc [ 243.593507] RECLAIM_FS-ON-W at: [ 243.594519] [ 243.594520] [<ffffffff810ef051>] mark_held_locks+0x71/0x90 [ 243.596888] [ 243.596895] [<ffffffff810f3405>] lockdep_trace_alloc+0xc5/0x110 [ 243.599331] [ 243.599334] [<ffffffff8122f8ca>] kmem_cache_alloc_node_trace+0x4a/0x410 [ 243.601872] [ 243.601874] [<ffffffff810ba350>] __smpboot_create_thread.part.3+0x30/0xf0 [ 243.604460] [ 243.604461] [<ffffffff810ba7a1>] smpboot_create_threads+0x61/0x90 [ 243.606950] [ 243.606952] [<ffffffff8108e2cb>] cpuhp_invoke_callback+0xbb/0xb70 [ 243.609463] [ 243.609465] [<ffffffff8108fc82>] cpuhp_up_callbacks+0x32/0xb0 [ 243.612282] [ 243.612285] [<ffffffff810900f4>] _cpu_up+0x84/0xf0 [ 243.614604] [ 243.614606] [<ffffffff810901e4>] do_cpu_up+0x84/0xd0 [ 243.616929] [ 243.616930] [<ffffffff8109023e>] cpu_up+0xe/0x10 [ 243.619208] [ 243.619211] [<ffffffff81f6f446>] smp_init+0xd5/0x141 [ 243.621518] [ 243.621520] [<ffffffff81f3f35b>] kernel_init_freeable+0x17d/0x2a7 [ 243.624018] [ 243.624020] [<ffffffff817048e9>] kernel_init+0x9/0x100 [ 243.626374] [ 243.626376] [<ffffffff81715081>] ret_from_fork+0x31/0x40 [ 243.628771] RECLAIM_FS-ON-R at: [ 243.629802] [ 243.629803] [<ffffffff810ef051>] mark_held_locks+0x71/0x90 [ 243.632201] [ 243.632203] [<ffffffff810f3405>] lockdep_trace_alloc+0xc5/0x110 [ 243.634692] [ 243.634695] [<ffffffff8122f8ca>] kmem_cache_alloc_node_trace+0x4a/0x410 [ 243.637277] [ 243.637279] [<ffffffff8100cbb4>] allocate_shared_regs+0x24/0x70 [ 243.639777] [ 243.639779] [<ffffffff8100cc32>] intel_pmu_cpu_prepare+0x32/0x140 [ 243.643062] [ 243.643066] [<ffffffff810053db>] x86_pmu_prepare_cpu+0x3b/0x40 [ 243.645553] [ 243.645556] [<ffffffff8108e2cb>] cpuhp_invoke_callback+0xbb/0xb70 [ 243.648095] [ 243.648097] [<ffffffff8108f29c>] cpuhp_issue_call+0xec/0x160 [ 243.650536] [ 243.650539] [<ffffffff8108f6bb>] __cpuhp_setup_state+0x13b/0x1a0 [ 243.653126] [ 243.653130] [<ffffffff81f427e9>] init_hw_perf_events+0x402/0x5b6 [ 243.655652] [ 243.655655] [<ffffffff8100217c>] do_one_initcall+0x4c/0x1b0 [ 243.658127] [ 243.658130] [<ffffffff81f3f333>] kernel_init_freeable+0x155/0x2a7 [ 243.660653] [ 243.660656] [<ffffffff817048e9>] kernel_init+0x9/0x100 [ 243.663048] [ 243.663050] [<ffffffff81715081>] ret_from_fork+0x31/0x40 [ 243.665436] INITIAL USE at: [ 243.666403] [ 243.666405] [<ffffffff810ef960>] __lock_acquire+0x370/0x1bb0 [ 243.668790] [ 243.668791] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0 [ 243.671093] [ 243.671095] [<ffffffff8108de18>] get_online_cpus+0x58/0x80 [ 243.673455] [ 243.673458] [<ffffffff8108f5be>] __cpuhp_setup_state+0x3e/0x1a0 [ 243.676126] [ 243.676130] [<ffffffff81f7660e>] page_alloc_init+0x23/0x3a [ 243.678510] [ 243.678512] [<ffffffff81f3eebe>] start_kernel+0x1a2/0x4c2 [ 243.680851] [ 243.680853] [<ffffffff81f3e5d6>] x86_64_start_reservations+0x2a/0x2c [ 243.683367] [ 243.683369] [<ffffffff81f3e724>] x86_64_start_kernel+0x14c/0x16f [ 243.685812] [ 243.685815] [<ffffffff810001c4>] verify_cpu+0x0/0xfc [ 243.688133] } [ 243.688907] ... key at: [<ffffffff81c56848>] cpu_hotplug+0x108/0x140 [ 243.690542] ... acquired at: [ 243.691514] [ 243.691517] [<ffffffff810ee7ea>] check_irq_usage+0x4a/0xb0 [ 243.693655] [ 243.693656] [<ffffffff810f0954>] __lock_acquire+0x1364/0x1bb0 [ 243.695820] [ 243.695822] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0 [ 243.697926] [ 243.697929] [<ffffffff8108de18>] get_online_cpus+0x58/0x80 [ 243.700042] [ 243.700044] [<ffffffff811bf3a7>] drain_all_pages.part.80+0x27/0x320 [ 243.702285] [ 243.702286] [<ffffffff811c2039>] drain_all_pages+0x19/0x20 [ 243.704405] [ 243.704407] [<ffffffff811c4854>] __alloc_pages_nodemask+0x784/0x1630 [ 243.706721] [ 243.706724] [<ffffffff8122e1bf>] cache_grow_begin+0xcf/0x630 [ 243.708867] [ 243.708870] [<ffffffff8122eb45>] fallback_alloc+0x1e5/0x290 [ 243.711000] [ 243.711002] [<ffffffff8122e955>] ____cache_alloc_node+0x235/0x240 [ 243.713211] [ 243.713213] [<ffffffff8122f30c>] kmem_cache_alloc+0x26c/0x3e0 [ 243.715366] [ 243.715410] [<ffffffffa02b9211>] kmem_zone_alloc+0x91/0x120 [xfs] [ 243.717625] [ 243.717644] [<ffffffffa024e2f5>] xfs_da_state_alloc+0x15/0x20 [xfs] [ 243.719889] [ 243.719918] [<ffffffffa025f333>] xfs_dir2_node_lookup+0x53/0x2b0 [xfs] [ 243.722224] [ 243.722242] [<ffffffffa0255ac5>] xfs_dir_lookup+0x1a5/0x1c0 [xfs] [ 243.724493] [ 243.724514] [<ffffffffa02a62ff>] xfs_lookup+0x7f/0x250 [xfs] [ 243.726690] [ 243.726710] [<ffffffffa02a1fcb>] xfs_vn_lookup+0x6b/0xb0 [xfs] [ 243.728933] [ 243.728936] [<ffffffff8126ce2c>] lookup_open+0x54c/0x790 [ 243.731064] [ 243.731066] [<ffffffff8126e2fa>] path_openat+0x55a/0xa90 [ 243.733192] [ 243.733194] [<ffffffff8126f9ec>] do_filp_open+0x8c/0x100 [ 243.735312] [ 243.735315] [<ffffffff8125c0ea>] do_sys_open+0x13a/0x200 [ 243.737523] [ 243.737527] [<ffffffff8125c1c9>] SyS_open+0x19/0x20 [ 243.739577] [ 243.739579] [<ffffffff81003c17>] do_syscall_64+0x67/0x1f0 [ 243.741702] [ 243.741706] [<ffffffff81714ec9>] return_from_SYSCALL_64+0x0/0x7a [ 243.743932] [ 243.744661] [ 243.744661] stack backtrace: [ 243.746302] CPU: 1 PID: 8767 Comm: awk Not tainted 4.10.0-rc6-next-20170202 #46 [ 243.747963] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013 [ 243.750166] Call Trace: [ 243.751071] dump_stack+0x85/0xc9 [ 243.752110] check_usage+0x4f9/0x680 [ 243.753188] check_irq_usage+0x4a/0xb0 [ 243.754280] __lock_acquire+0x1364/0x1bb0 [ 243.755410] lock_acquire+0xe0/0x2a0 [ 243.756467] ? get_online_cpus+0x32/0x80 [ 243.757580] get_online_cpus+0x58/0x80 [ 243.758664] ? get_online_cpus+0x32/0x80 [ 243.759764] drain_all_pages.part.80+0x27/0x320 [ 243.760972] drain_all_pages+0x19/0x20 [ 243.762039] __alloc_pages_nodemask+0x784/0x1630 [ 243.763249] ? rcu_read_lock_sched_held+0x91/0xa0 [ 243.764466] ? __alloc_pages_nodemask+0x2e6/0x1630 [ 243.765689] ? mark_held_locks+0x71/0x90 [ 243.766780] ? cache_grow_begin+0x4ac/0x630 [ 243.767912] cache_grow_begin+0xcf/0x630 [ 243.768985] ? ____cache_alloc_node+0x1bf/0x240 [ 243.770173] fallback_alloc+0x1e5/0x290 [ 243.771233] ____cache_alloc_node+0x235/0x240 [ 243.772403] ? kmem_zone_alloc+0x91/0x120 [xfs] [ 243.773576] kmem_cache_alloc+0x26c/0x3e0 [ 243.774671] kmem_zone_alloc+0x91/0x120 [xfs] [ 243.775816] xfs_da_state_alloc+0x15/0x20 [xfs] [ 243.776989] xfs_dir2_node_lookup+0x53/0x2b0 [xfs] [ 243.778188] xfs_dir_lookup+0x1a5/0x1c0 [xfs] [ 243.779327] xfs_lookup+0x7f/0x250 [xfs] [ 243.780394] xfs_vn_lookup+0x6b/0xb0 [xfs] [ 243.781466] lookup_open+0x54c/0x790 [ 243.782440] path_openat+0x55a/0xa90 [ 243.783412] do_filp_open+0x8c/0x100 [ 243.784377] ? _raw_spin_unlock+0x22/0x30 [ 243.785418] ? __alloc_fd+0xf2/0x210 [ 243.786378] do_sys_open+0x13a/0x200 [ 243.787361] SyS_open+0x19/0x20 [ 243.788252] do_syscall_64+0x67/0x1f0 [ 243.789228] entry_SYSCALL64_slow_path+0x25/0x25 [ 243.790347] RIP: 0033:0x7fcf8dda06c7 [ 243.791299] RSP: 002b:00007ffd883327b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000002 [ 243.792895] RAX: ffffffffffffffda RBX: 00007ffd883328a8 RCX: 00007fcf8dda06c7 [ 243.794424] RDX: 00007fcf8dfa9148 RSI: 0000000000080000 RDI: 00007fcf8dfa6b08 [ 243.795949] RBP: 00007ffd88332810 R08: 00007ffd88332890 R09: 0000000000000000 [ 243.797480] R10: 00007fcf8dfa6b08 R11: 0000000000000246 R12: 0000000000000000 [ 243.799002] R13: 0000000000000000 R14: 0000000000000000 R15: 00007ffd88332890 [ 253.543441] awk invoked oom-killer: gfp_mask=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null), order=0, oom_score_adj=0 [ 253.546121] awk cpuset=/ mems_allowed=0 [ 253.547233] CPU: 3 PID: 8767 Comm: awk Not tainted 4.10.0-rc6-next-20170202 #46 ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone @ 2017-02-03 10:57 ` Tetsuo Handa 0 siblings, 0 replies; 110+ messages in thread From: Tetsuo Handa @ 2017-02-03 10:57 UTC (permalink / raw) To: mhocko Cc: david, dchinner, hch, mgorman, viro, linux-mm, hannes, linux-kernel Michal Hocko wrote: > On Mon 30-01-17 09:55:46, Michal Hocko wrote: > > On Sun 29-01-17 00:27:27, Tetsuo Handa wrote: > [...] > > > Regarding [1], it helped avoiding the too_many_isolated() issue. I can't > > > tell whether it has any negative effect, but I got on the first trial that > > > all allocating threads are blocked on wait_for_completion() from flush_work() > > > in drain_all_pages() introduced by "mm, page_alloc: drain per-cpu pages from > > > workqueue context". There was no warn_alloc() stall warning message afterwords. > > > > That patch is buggy and there is a follow up [1] which is not sitting in the > > mmotm (and thus linux-next) yet. I didn't get to review it properly and > > I cannot say I would be too happy about using WQ from the page > > allocator. I believe even the follow up needs to have WQ_RECLAIM WQ. > > > > [1] http://lkml.kernel.org/r/20170125083038.rzb5f43nptmk7aed@techsingularity.net > > Did you get chance to test with this follow up patch? It would be > interesting to see whether OOM situation can still starve the waiter. > The current linux-next should contain this patch. So far I can't reproduce problems except two listed below (cond_resched() trap in printk() and IDLE priority trap are excluded from the list). But I agree that the follow up patch needs to use a WQ_RECLAIM WQ. It is theoretically possible that an allocation request which can trigger the OOM killer waits for the system_wq while there is already a work which is in system_wq which is looping forever inside the page allocator without triggering the OOM killer. Maybe the follow up patch can share the vmstat WQ? (1) I got an assertion failure. [ 969.626518] Killed process 6262 (oom-write) total-vm:2166856kB, anon-rss:1128732kB, file-rss:4kB, shmem-rss:0kB [ 969.958307] oom_reaper: reaped process 6262 (oom-write), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [ 972.114644] XFS: Assertion failed: oldlen > newlen, file: fs/xfs/libxfs/xfs_bmap.c, line: 2867 [ 972.125085] ------------[ cut here ]------------ [ 972.129261] WARNING: CPU: 0 PID: 6280 at fs/xfs/xfs_message.c:105 asswarn+0x33/0x40 [xfs] [ 972.136146] Modules linked in: nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack coretemp crct10dif_pclmul ppdev crc32_pclmul ghash_clmulni_intel ip_set nfnetlink ebtable_nat aesni_intel crypto_simd cryptd ebtable_broute glue_helper vmw_balloon bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 pcspkr nf_nat_ipv6 ip6table_mangle ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter sg parport_pc parport shpchp i2c_piix4 vmw_vsock_vmci_transport vsock vmw_vmci ip_tables xfs libcrc32c sr_mod cdrom ata_generic sd_mod pata_acpi crc32c_intel serio_raw vmwgfx drm_kms_helper syscopyarea sysfillrect [ 972.163630] sysimgblt fb_sys_fops ttm drm ata_piix ahci libahci mptspi scsi_transport_spi mptscsih e1000 libata i2c_core mptbase [ 972.172535] CPU: 0 PID: 6280 Comm: write Not tainted 4.10.0-rc6-next-20170202 #498 [ 972.175126] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015 [ 972.178381] Call Trace: [ 972.180003] dump_stack+0x85/0xc9 [ 972.181682] __warn+0xd1/0xf0 [ 972.183374] warn_slowpath_null+0x1d/0x20 [ 972.185223] asswarn+0x33/0x40 [xfs] [ 972.186950] xfs_bmap_add_extent_hole_delay+0xb7f/0xdf0 [xfs] [ 972.189055] xfs_bmapi_reserve_delalloc+0x297/0x440 [xfs] [ 972.191263] ? xfs_ilock+0x1c9/0x360 [xfs] [ 972.193414] xfs_file_iomap_begin+0x880/0x1140 [xfs] [ 972.195300] ? iomap_write_end+0x80/0x80 [ 972.196980] iomap_apply+0x6c/0x130 [ 972.198539] iomap_file_buffered_write+0x68/0xa0 [ 972.200316] ? iomap_write_end+0x80/0x80 [ 972.201950] xfs_file_buffered_aio_write+0x132/0x390 [xfs] [ 972.203868] ? _raw_spin_unlock+0x27/0x40 [ 972.205470] xfs_file_write_iter+0x90/0x130 [xfs] [ 972.207167] __vfs_write+0xe5/0x140 [ 972.208752] vfs_write+0xc7/0x1f0 [ 972.210233] ? syscall_trace_enter+0x1d0/0x380 [ 972.211809] SyS_write+0x58/0xc0 [ 972.213166] do_int80_syscall_32+0x6c/0x1f0 [ 972.214676] entry_INT80_compat+0x38/0x50 [ 972.216168] RIP: 0023:0x8048076 [ 972.217494] RSP: 002b:00000000ff997020 EFLAGS: 00000202 ORIG_RAX: 0000000000000004 [ 972.219635] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 0000000008048000 [ 972.221679] RDX: 0000000000001000 RSI: 0000000000000000 RDI: 0000000000000000 [ 972.223774] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 [ 972.225905] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 972.227946] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 972.230064] ---[ end trace d498098daec56c11 ]--- [ 984.210890] vmtoolsd invoked oom-killer: gfp_mask=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null), order=0, oom_score_adj=0 [ 984.224191] vmtoolsd cpuset=/ mems_allowed=0 [ 984.231022] CPU: 0 PID: 689 Comm: vmtoolsd Tainted: G W 4.10.0-rc6-next-20170202 #498 (2) I got a lockdep warning. (A new false positive?) [ 243.036975] ===================================================== [ 243.042976] WARNING: RECLAIM_FS-safe -> RECLAIM_FS-unsafe lock order detected [ 243.051211] 4.10.0-rc6-next-20170202 #46 Not tainted [ 243.054619] ----------------------------------------------------- [ 243.057395] awk/8767 [HC0[0]:SC0[0]:HE1:SE1] is trying to acquire: [ 243.060310] (cpu_hotplug.dep_map){++++++}, at: [<ffffffff8108ddf2>] get_online_cpus+0x32/0x80 [ 243.063462] [ 243.063462] and this task is already holding: [ 243.066851] (&xfs_dir_ilock_class){++++-.}, at: [<ffffffffa02a4af4>] xfs_ilock+0x114/0x290 [xfs] [ 243.069949] which would create a new lock dependency: [ 243.072143] (&xfs_dir_ilock_class){++++-.} -> (cpu_hotplug.dep_map){++++++} [ 243.074789] [ 243.074789] but this new dependency connects a RECLAIM_FS-irq-safe lock: [ 243.078735] (&xfs_dir_ilock_class){++++-.} [ 243.078739] [ 243.078739] ... which became RECLAIM_FS-irq-safe at: [ 243.084175] [ 243.084180] [<ffffffff810ef934>] __lock_acquire+0x344/0x1bb0 [ 243.087257] [ 243.087261] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0 [ 243.090027] [ 243.090033] [<ffffffff810ea7e9>] down_write_nested+0x59/0xc0 [ 243.092838] [ 243.092888] [<ffffffffa02a4b2e>] xfs_ilock+0x14e/0x290 [xfs] [ 243.095453] [ 243.095485] [<ffffffffa02986a5>] xfs_reclaim_inode+0x135/0x340 [xfs] [ 243.098083] [ 243.098109] [<ffffffffa0298b7a>] xfs_reclaim_inodes_ag+0x2ca/0x4f0 [xfs] [ 243.100668] [ 243.100692] [<ffffffffa029af9e>] xfs_reclaim_inodes_nr+0x2e/0x40 [xfs] [ 243.103191] [ 243.103221] [<ffffffffa02b32c4>] xfs_fs_free_cached_objects+0x14/0x20 [xfs] [ 243.105710] [ 243.105714] [<ffffffff81261dbc>] super_cache_scan+0x17c/0x190 [ 243.107947] [ 243.107950] [<ffffffff811d375a>] shrink_slab+0x29a/0x710 [ 243.110133] [ 243.110135] [<ffffffff811d876d>] shrink_node+0x23d/0x320 [ 243.112262] [ 243.112264] [<ffffffff811d9e24>] kswapd+0x354/0xa10 [ 243.114323] [ 243.114326] [<ffffffff810b5caa>] kthread+0x10a/0x140 [ 243.116448] [ 243.116452] [<ffffffff81715081>] ret_from_fork+0x31/0x40 [ 243.118692] [ 243.118692] to a RECLAIM_FS-irq-unsafe lock: [ 243.120636] (cpu_hotplug.dep_map){++++++} [ 243.120638] [ 243.120638] ... which became RECLAIM_FS-irq-unsafe at: [ 243.124021] ... [ 243.124022] [ 243.124820] [<ffffffff810ef051>] mark_held_locks+0x71/0x90 [ 243.127033] [ 243.127035] [<ffffffff810f3405>] lockdep_trace_alloc+0xc5/0x110 [ 243.129228] [ 243.129231] [<ffffffff8122f8ca>] kmem_cache_alloc_node_trace+0x4a/0x410 [ 243.131534] [ 243.131536] [<ffffffff810ba350>] __smpboot_create_thread.part.3+0x30/0xf0 [ 243.133850] [ 243.133852] [<ffffffff810ba7a1>] smpboot_create_threads+0x61/0x90 [ 243.136113] [ 243.136119] [<ffffffff8108e2cb>] cpuhp_invoke_callback+0xbb/0xb70 [ 243.138319] [ 243.138320] [<ffffffff8108fc82>] cpuhp_up_callbacks+0x32/0xb0 [ 243.140479] [ 243.140480] [<ffffffff810900f4>] _cpu_up+0x84/0xf0 [ 243.142484] [ 243.142485] [<ffffffff810901e4>] do_cpu_up+0x84/0xd0 [ 243.144716] [ 243.144719] [<ffffffff8109023e>] cpu_up+0xe/0x10 [ 243.146684] [ 243.146687] [<ffffffff81f6f446>] smp_init+0xd5/0x141 [ 243.148755] [ 243.148758] [<ffffffff81f3f35b>] kernel_init_freeable+0x17d/0x2a7 [ 243.150932] [ 243.150936] [<ffffffff817048e9>] kernel_init+0x9/0x100 [ 243.153088] [ 243.153092] [<ffffffff81715081>] ret_from_fork+0x31/0x40 [ 243.155135] [ 243.155135] other info that might help us debug this: [ 243.155135] [ 243.157724] Possible interrupt unsafe locking scenario: [ 243.157724] [ 243.159877] CPU0 CPU1 [ 243.161047] ---- ---- [ 243.162210] lock(cpu_hotplug.dep_map); [ 243.163279] local_irq_disable(); [ 243.164669] lock(&xfs_dir_ilock_class); [ 243.166148] lock(cpu_hotplug.dep_map); [ 243.167653] <Interrupt> [ 243.168594] lock(&xfs_dir_ilock_class); [ 243.169694] [ 243.169694] *** DEADLOCK *** [ 243.169694] [ 243.171864] 3 locks held by awk/8767: [ 243.172872] #0: (&type->i_mutex_dir_key#3){++++++}, at: [<ffffffff8126e2dc>] path_openat+0x53c/0xa90 [ 243.174791] #1: (&xfs_dir_ilock_class){++++-.}, at: [<ffffffffa02a4af4>] xfs_ilock+0x114/0x290 [xfs] [ 243.176899] #2: (pcpu_drain_mutex){+.+...}, at: [<ffffffff811bf39a>] drain_all_pages.part.80+0x1a/0x320 [ 243.178875] [ 243.178875] the dependencies between RECLAIM_FS-irq-safe lock and the holding lock: [ 243.181262] -> (&xfs_dir_ilock_class){++++-.} ops: 17348 { [ 243.182610] HARDIRQ-ON-W at: [ 243.183603] [ 243.183606] [<ffffffff810efd84>] __lock_acquire+0x794/0x1bb0 [ 243.186056] [ 243.186059] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0 [ 243.188419] [ 243.188422] [<ffffffff810ea7e9>] down_write_nested+0x59/0xc0 [ 243.190909] [ 243.190941] [<ffffffffa02a4b2e>] xfs_ilock+0x14e/0x290 [xfs] [ 243.193257] [ 243.193281] [<ffffffffa02a4c9b>] xfs_ilock_data_map_shared+0x2b/0x30 [xfs] [ 243.195795] [ 243.195814] [<ffffffffa02559f4>] xfs_dir_lookup+0xd4/0x1c0 [xfs] [ 243.198204] [ 243.198227] [<ffffffffa02a62ff>] xfs_lookup+0x7f/0x250 [xfs] [ 243.200570] [ 243.200593] [<ffffffffa02a1fcb>] xfs_vn_lookup+0x6b/0xb0 [xfs] [ 243.203086] [ 243.203089] [<ffffffff8126ce2c>] lookup_open+0x54c/0x790 [ 243.205417] [ 243.205420] [<ffffffff8126e2fa>] path_openat+0x55a/0xa90 [ 243.207711] [ 243.207713] [<ffffffff8126f9ec>] do_filp_open+0x8c/0x100 [ 243.210092] [ 243.210095] [<ffffffff81263c41>] do_open_execat+0x71/0x180 [ 243.212427] [ 243.212429] [<ffffffff812641b6>] open_exec+0x26/0x40 [ 243.214664] [ 243.214668] [<ffffffff812c43ee>] load_elf_binary+0x2be/0x15f0 [ 243.217045] [ 243.217048] [<ffffffff812644b0>] search_binary_handler+0x80/0x1e0 [ 243.219501] [ 243.219503] [<ffffffff812663ca>] do_execveat_common.isra.40+0x68a/0xa00 [ 243.222056] [ 243.222058] [<ffffffff81266767>] do_execve+0x27/0x30 [ 243.224471] [ 243.224475] [<ffffffff812669c0>] SyS_execve+0x20/0x30 [ 243.226787] [ 243.226790] [<ffffffff81003c17>] do_syscall_64+0x67/0x1f0 [ 243.229178] [ 243.229182] [<ffffffff81714ec9>] return_from_SYSCALL_64+0x0/0x7a [ 243.231695] HARDIRQ-ON-R at: [ 243.232709] [ 243.232712] [<ffffffff810ef8c0>] __lock_acquire+0x2d0/0x1bb0 [ 243.235161] [ 243.235164] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0 [ 243.237547] [ 243.237551] [<ffffffff810ea672>] down_read_nested+0x52/0xb0 [ 243.239930] [ 243.239962] [<ffffffffa02a4af4>] xfs_ilock+0x114/0x290 [xfs] [ 243.242353] [ 243.242385] [<ffffffffa02a4c9b>] xfs_ilock_data_map_shared+0x2b/0x30 [xfs] [ 243.244978] [ 243.244998] [<ffffffffa02559f4>] xfs_dir_lookup+0xd4/0x1c0 [xfs] [ 243.247493] [ 243.247515] [<ffffffffa02a62ff>] xfs_lookup+0x7f/0x250 [xfs] [ 243.249910] [ 243.249930] [<ffffffffa02a1fcb>] xfs_vn_lookup+0x6b/0xb0 [xfs] [ 243.252407] [ 243.252412] [<ffffffff8126902e>] lookup_slow+0x12e/0x220 [ 243.254747] [ 243.254750] [<ffffffff8126d2c6>] walk_component+0x1a6/0x2b0 [ 243.257126] [ 243.257128] [<ffffffff8126d55c>] link_path_walk+0x18c/0x580 [ 243.259495] [ 243.259497] [<ffffffff8126de41>] path_openat+0xa1/0xa90 [ 243.261804] [ 243.261806] [<ffffffff8126f9ec>] do_filp_open+0x8c/0x100 [ 243.264184] [ 243.264188] [<ffffffff8125c0ea>] do_sys_open+0x13a/0x200 [ 243.266595] [ 243.266599] [<ffffffff8125c1c9>] SyS_open+0x19/0x20 [ 243.268984] [ 243.268989] [<ffffffff81714e01>] entry_SYSCALL_64_fastpath+0x1f/0xc2 [ 243.271702] SOFTIRQ-ON-W at: [ 243.272726] [ 243.272729] [<ffffffff810ef8ed>] __lock_acquire+0x2fd/0x1bb0 [ 243.275109] [ 243.275111] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0 [ 243.277426] [ 243.277429] [<ffffffff810ea7e9>] down_write_nested+0x59/0xc0 [ 243.279790] [ 243.279823] [<ffffffffa02a4b2e>] xfs_ilock+0x14e/0x290 [xfs] [ 243.282192] [ 243.282216] [<ffffffffa02a4c9b>] xfs_ilock_data_map_shared+0x2b/0x30 [xfs] [ 243.284794] [ 243.284816] [<ffffffffa02559f4>] xfs_dir_lookup+0xd4/0x1c0 [xfs] [ 243.287259] [ 243.287284] [<ffffffffa02a62ff>] xfs_lookup+0x7f/0x250 [xfs] [ 243.289735] [ 243.289763] [<ffffffffa02a1fcb>] xfs_vn_lookup+0x6b/0xb0 [xfs] [ 243.292205] [ 243.292208] [<ffffffff8126ce2c>] lookup_open+0x54c/0x790 [ 243.294555] [ 243.294558] [<ffffffff8126e2fa>] path_openat+0x55a/0xa90 [ 243.296897] [ 243.296900] [<ffffffff8126f9ec>] do_filp_open+0x8c/0x100 [ 243.299242] [ 243.299244] [<ffffffff81263c41>] do_open_execat+0x71/0x180 [ 243.301754] [ 243.301759] [<ffffffff812641b6>] open_exec+0x26/0x40 [ 243.304037] [ 243.304042] [<ffffffff812c43ee>] load_elf_binary+0x2be/0x15f0 [ 243.306531] [ 243.306534] [<ffffffff812644b0>] search_binary_handler+0x80/0x1e0 [ 243.308976] [ 243.308979] [<ffffffff812663ca>] do_execveat_common.isra.40+0x68a/0xa00 [ 243.311506] [ 243.311508] [<ffffffff81266767>] do_execve+0x27/0x30 [ 243.313777] [ 243.313779] [<ffffffff812669c0>] SyS_execve+0x20/0x30 [ 243.316067] [ 243.316070] [<ffffffff81003c17>] do_syscall_64+0x67/0x1f0 [ 243.318429] [ 243.318434] [<ffffffff81714ec9>] return_from_SYSCALL_64+0x0/0x7a [ 243.320884] SOFTIRQ-ON-R at: [ 243.321860] [ 243.321862] [<ffffffff810ef8ed>] __lock_acquire+0x2fd/0x1bb0 [ 243.324251] [ 243.324252] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0 [ 243.326601] [ 243.326604] [<ffffffff810ea672>] down_read_nested+0x52/0xb0 [ 243.328966] [ 243.328998] [<ffffffffa02a4af4>] xfs_ilock+0x114/0x290 [xfs] [ 243.331384] [ 243.331407] [<ffffffffa02a4c9b>] xfs_ilock_data_map_shared+0x2b/0x30 [xfs] [ 243.333978] [ 243.334001] [<ffffffffa02559f4>] xfs_dir_lookup+0xd4/0x1c0 [xfs] [ 243.336492] [ 243.336516] [<ffffffffa02a62ff>] xfs_lookup+0x7f/0x250 [xfs] [ 243.338926] [ 243.338948] [<ffffffffa02a1fcb>] xfs_vn_lookup+0x6b/0xb0 [xfs] [ 243.341365] [ 243.341368] [<ffffffff8126902e>] lookup_slow+0x12e/0x220 [ 243.343694] [ 243.343696] [<ffffffff8126d2c6>] walk_component+0x1a6/0x2b0 [ 243.346074] [ 243.346076] [<ffffffff8126d55c>] link_path_walk+0x18c/0x580 [ 243.348443] [ 243.348444] [<ffffffff8126de41>] path_openat+0xa1/0xa90 [ 243.350753] [ 243.350755] [<ffffffff8126f9ec>] do_filp_open+0x8c/0x100 [ 243.353240] [ 243.353244] [<ffffffff8125c0ea>] do_sys_open+0x13a/0x200 [ 243.355581] [ 243.355583] [<ffffffff8125c1c9>] SyS_open+0x19/0x20 [ 243.358015] [ 243.358019] [<ffffffff81714e01>] entry_SYSCALL_64_fastpath+0x1f/0xc2 [ 243.360586] IN-RECLAIM_FS-W at: [ 243.361628] [ 243.361630] [<ffffffff810ef934>] __lock_acquire+0x344/0x1bb0 [ 243.364273] [ 243.364275] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0 [ 243.366710] [ 243.366713] [<ffffffff810ea7e9>] down_write_nested+0x59/0xc0 [ 243.369153] [ 243.369182] [<ffffffffa02a4b2e>] xfs_ilock+0x14e/0x290 [xfs] [ 243.371597] [ 243.371619] [<ffffffffa02986a5>] xfs_reclaim_inode+0x135/0x340 [xfs] [ 243.374339] [ 243.374366] [<ffffffffa0298b7a>] xfs_reclaim_inodes_ag+0x2ca/0x4f0 [xfs] [ 243.377009] [ 243.377032] [<ffffffffa029af9e>] xfs_reclaim_inodes_nr+0x2e/0x40 [xfs] [ 243.379659] [ 243.379686] [<ffffffffa02b32c4>] xfs_fs_free_cached_objects+0x14/0x20 [xfs] [ 243.382349] [ 243.382352] [<ffffffff81261dbc>] super_cache_scan+0x17c/0x190 [ 243.384907] [ 243.384911] [<ffffffff811d375a>] shrink_slab+0x29a/0x710 [ 243.387690] [ 243.387693] [<ffffffff811d876d>] shrink_node+0x23d/0x320 [ 243.390148] [ 243.390150] [<ffffffff811d9e24>] kswapd+0x354/0xa10 [ 243.392517] [ 243.392520] [<ffffffff810b5caa>] kthread+0x10a/0x140 [ 243.394851] [ 243.394853] [<ffffffff81715081>] ret_from_fork+0x31/0x40 [ 243.397246] INITIAL USE at: [ 243.398227] [ 243.398229] [<ffffffff810ef960>] __lock_acquire+0x370/0x1bb0 [ 243.400646] [ 243.400648] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0 [ 243.402997] [ 243.402999] [<ffffffff810ea672>] down_read_nested+0x52/0xb0 [ 243.405351] [ 243.405397] [<ffffffffa02a4af4>] xfs_ilock+0x114/0x290 [xfs] [ 243.407778] [ 243.407799] [<ffffffffa02a4c9b>] xfs_ilock_data_map_shared+0x2b/0x30 [xfs] [ 243.410364] [ 243.410390] [<ffffffffa02559f4>] xfs_dir_lookup+0xd4/0x1c0 [xfs] [ 243.412989] [ 243.413011] [<ffffffffa02a62ff>] xfs_lookup+0x7f/0x250 [xfs] [ 243.415416] [ 243.415437] [<ffffffffa02a1fcb>] xfs_vn_lookup+0x6b/0xb0 [xfs] [ 243.417871] [ 243.417874] [<ffffffff8126902e>] lookup_slow+0x12e/0x220 [ 243.420641] [ 243.420644] [<ffffffff8126d2c6>] walk_component+0x1a6/0x2b0 [ 243.423039] [ 243.423041] [<ffffffff8126d55c>] link_path_walk+0x18c/0x580 [ 243.425553] [ 243.425555] [<ffffffff8126de41>] path_openat+0xa1/0xa90 [ 243.427891] [ 243.427892] [<ffffffff8126f9ec>] do_filp_open+0x8c/0x100 [ 243.430249] [ 243.430251] [<ffffffff8125c0ea>] do_sys_open+0x13a/0x200 [ 243.432586] [ 243.432588] [<ffffffff8125c1c9>] SyS_open+0x19/0x20 [ 243.434839] [ 243.434843] [<ffffffff81714e01>] entry_SYSCALL_64_fastpath+0x1f/0xc2 [ 243.437343] } [ 243.438115] ... key at: [<ffffffffa031dfcc>] xfs_dir_ilock_class+0x0/0xfffffffffffc3f6e [xfs] [ 243.440082] ... acquired at: [ 243.441047] [ 243.441049] [<ffffffff810ee7ea>] check_irq_usage+0x4a/0xb0 [ 243.443169] [ 243.443171] [<ffffffff810f0954>] __lock_acquire+0x1364/0x1bb0 [ 243.445366] [ 243.445368] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0 [ 243.447471] [ 243.447474] [<ffffffff8108de18>] get_online_cpus+0x58/0x80 [ 243.449601] [ 243.449604] [<ffffffff811bf3a7>] drain_all_pages.part.80+0x27/0x320 [ 243.452123] [ 243.452125] [<ffffffff811c2039>] drain_all_pages+0x19/0x20 [ 243.454264] [ 243.454266] [<ffffffff811c4854>] __alloc_pages_nodemask+0x784/0x1630 [ 243.456596] [ 243.456599] [<ffffffff8122e1bf>] cache_grow_begin+0xcf/0x630 [ 243.458774] [ 243.458776] [<ffffffff8122eb45>] fallback_alloc+0x1e5/0x290 [ 243.460952] [ 243.460955] [<ffffffff8122e955>] ____cache_alloc_node+0x235/0x240 [ 243.463199] [ 243.463201] [<ffffffff8122f30c>] kmem_cache_alloc+0x26c/0x3e0 [ 243.465482] [ 243.465510] [<ffffffffa02b9211>] kmem_zone_alloc+0x91/0x120 [xfs] [ 243.467754] [ 243.467774] [<ffffffffa024e2f5>] xfs_da_state_alloc+0x15/0x20 [xfs] [ 243.470083] [ 243.470101] [<ffffffffa025f333>] xfs_dir2_node_lookup+0x53/0x2b0 [xfs] [ 243.472427] [ 243.472445] [<ffffffffa0255ac5>] xfs_dir_lookup+0x1a5/0x1c0 [xfs] [ 243.474705] [ 243.474726] [<ffffffffa02a62ff>] xfs_lookup+0x7f/0x250 [xfs] [ 243.476933] [ 243.476954] [<ffffffffa02a1fcb>] xfs_vn_lookup+0x6b/0xb0 [xfs] [ 243.479178] [ 243.479180] [<ffffffff8126ce2c>] lookup_open+0x54c/0x790 [ 243.481350] [ 243.481352] [<ffffffff8126e2fa>] path_openat+0x55a/0xa90 [ 243.483907] [ 243.483910] [<ffffffff8126f9ec>] do_filp_open+0x8c/0x100 [ 243.486070] [ 243.486073] [<ffffffff8125c0ea>] do_sys_open+0x13a/0x200 [ 243.488334] [ 243.488338] [<ffffffff8125c1c9>] SyS_open+0x19/0x20 [ 243.490476] [ 243.490480] [<ffffffff81003c17>] do_syscall_64+0x67/0x1f0 [ 243.492619] [ 243.492623] [<ffffffff81714ec9>] return_from_SYSCALL_64+0x0/0x7a [ 243.494864] [ 243.495618] [ 243.495618] the dependencies between the lock to be acquired [ 243.495619] and RECLAIM_FS-irq-unsafe lock: [ 243.498973] -> (cpu_hotplug.dep_map){++++++} ops: 838 { [ 243.500297] HARDIRQ-ON-W at: [ 243.501292] [ 243.501295] [<ffffffff810efd84>] __lock_acquire+0x794/0x1bb0 [ 243.503718] [ 243.503719] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0 [ 243.506059] [ 243.506061] [<ffffffff8108ff5e>] cpu_hotplug_begin+0x6e/0xe0 [ 243.508471] [ 243.508473] [<ffffffff8109009d>] _cpu_up+0x2d/0xf0 [ 243.510708] [ 243.510709] [<ffffffff810901e4>] do_cpu_up+0x84/0xd0 [ 243.512997] [ 243.512999] [<ffffffff8109023e>] cpu_up+0xe/0x10 [ 243.515556] [ 243.515561] [<ffffffff81f6f446>] smp_init+0xd5/0x141 [ 243.517807] [ 243.517810] [<ffffffff81f3f35b>] kernel_init_freeable+0x17d/0x2a7 [ 243.520271] [ 243.520275] [<ffffffff817048e9>] kernel_init+0x9/0x100 [ 243.522538] [ 243.522540] [<ffffffff81715081>] ret_from_fork+0x31/0x40 [ 243.524833] HARDIRQ-ON-R at: [ 243.525801] [ 243.525803] [<ffffffff810ef8c0>] __lock_acquire+0x2d0/0x1bb0 [ 243.528152] [ 243.528153] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0 [ 243.530416] [ 243.530419] [<ffffffff8108de18>] get_online_cpus+0x58/0x80 [ 243.532696] [ 243.532698] [<ffffffff811ec375>] kmem_cache_create+0x35/0x2d0 [ 243.535039] [ 243.535041] [<ffffffff81f87d4a>] debug_objects_mem_init+0x48/0x5c5 [ 243.537451] [ 243.537453] [<ffffffff81f3f108>] start_kernel+0x3ec/0x4c2 [ 243.539744] [ 243.539746] [<ffffffff81f3e5d6>] x86_64_start_reservations+0x2a/0x2c [ 243.542186] [ 243.542188] [<ffffffff81f3e724>] x86_64_start_kernel+0x14c/0x16f [ 243.544603] [ 243.544605] [<ffffffff810001c4>] verify_cpu+0x0/0xfc [ 243.547245] SOFTIRQ-ON-W at: [ 243.548241] [ 243.548243] [<ffffffff810ef8ed>] __lock_acquire+0x2fd/0x1bb0 [ 243.550559] [ 243.550561] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0 [ 243.552841] [ 243.552842] [<ffffffff8108ff5e>] cpu_hotplug_begin+0x6e/0xe0 [ 243.555186] [ 243.555187] [<ffffffff8109009d>] _cpu_up+0x2d/0xf0 [ 243.557404] [ 243.557405] [<ffffffff810901e4>] do_cpu_up+0x84/0xd0 [ 243.559654] [ 243.559656] [<ffffffff8109023e>] cpu_up+0xe/0x10 [ 243.561824] [ 243.561827] [<ffffffff81f6f446>] smp_init+0xd5/0x141 [ 243.564048] [ 243.564050] [<ffffffff81f3f35b>] kernel_init_freeable+0x17d/0x2a7 [ 243.566455] [ 243.566457] [<ffffffff817048e9>] kernel_init+0x9/0x100 [ 243.568731] [ 243.568733] [<ffffffff81715081>] ret_from_fork+0x31/0x40 [ 243.571014] SOFTIRQ-ON-R at: [ 243.571975] [ 243.571976] [<ffffffff810ef8ed>] __lock_acquire+0x2fd/0x1bb0 [ 243.574328] [ 243.574330] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0 [ 243.576610] [ 243.576612] [<ffffffff8108de18>] get_online_cpus+0x58/0x80 [ 243.579161] [ 243.579165] [<ffffffff811ec375>] kmem_cache_create+0x35/0x2d0 [ 243.581537] [ 243.581539] [<ffffffff81f87d4a>] debug_objects_mem_init+0x48/0x5c5 [ 243.583982] [ 243.583984] [<ffffffff81f3f108>] start_kernel+0x3ec/0x4c2 [ 243.586304] [ 243.586306] [<ffffffff81f3e5d6>] x86_64_start_reservations+0x2a/0x2c [ 243.588819] [ 243.588821] [<ffffffff81f3e724>] x86_64_start_kernel+0x14c/0x16f [ 243.591227] [ 243.591229] [<ffffffff810001c4>] verify_cpu+0x0/0xfc [ 243.593507] RECLAIM_FS-ON-W at: [ 243.594519] [ 243.594520] [<ffffffff810ef051>] mark_held_locks+0x71/0x90 [ 243.596888] [ 243.596895] [<ffffffff810f3405>] lockdep_trace_alloc+0xc5/0x110 [ 243.599331] [ 243.599334] [<ffffffff8122f8ca>] kmem_cache_alloc_node_trace+0x4a/0x410 [ 243.601872] [ 243.601874] [<ffffffff810ba350>] __smpboot_create_thread.part.3+0x30/0xf0 [ 243.604460] [ 243.604461] [<ffffffff810ba7a1>] smpboot_create_threads+0x61/0x90 [ 243.606950] [ 243.606952] [<ffffffff8108e2cb>] cpuhp_invoke_callback+0xbb/0xb70 [ 243.609463] [ 243.609465] [<ffffffff8108fc82>] cpuhp_up_callbacks+0x32/0xb0 [ 243.612282] [ 243.612285] [<ffffffff810900f4>] _cpu_up+0x84/0xf0 [ 243.614604] [ 243.614606] [<ffffffff810901e4>] do_cpu_up+0x84/0xd0 [ 243.616929] [ 243.616930] [<ffffffff8109023e>] cpu_up+0xe/0x10 [ 243.619208] [ 243.619211] [<ffffffff81f6f446>] smp_init+0xd5/0x141 [ 243.621518] [ 243.621520] [<ffffffff81f3f35b>] kernel_init_freeable+0x17d/0x2a7 [ 243.624018] [ 243.624020] [<ffffffff817048e9>] kernel_init+0x9/0x100 [ 243.626374] [ 243.626376] [<ffffffff81715081>] ret_from_fork+0x31/0x40 [ 243.628771] RECLAIM_FS-ON-R at: [ 243.629802] [ 243.629803] [<ffffffff810ef051>] mark_held_locks+0x71/0x90 [ 243.632201] [ 243.632203] [<ffffffff810f3405>] lockdep_trace_alloc+0xc5/0x110 [ 243.634692] [ 243.634695] [<ffffffff8122f8ca>] kmem_cache_alloc_node_trace+0x4a/0x410 [ 243.637277] [ 243.637279] [<ffffffff8100cbb4>] allocate_shared_regs+0x24/0x70 [ 243.639777] [ 243.639779] [<ffffffff8100cc32>] intel_pmu_cpu_prepare+0x32/0x140 [ 243.643062] [ 243.643066] [<ffffffff810053db>] x86_pmu_prepare_cpu+0x3b/0x40 [ 243.645553] [ 243.645556] [<ffffffff8108e2cb>] cpuhp_invoke_callback+0xbb/0xb70 [ 243.648095] [ 243.648097] [<ffffffff8108f29c>] cpuhp_issue_call+0xec/0x160 [ 243.650536] [ 243.650539] [<ffffffff8108f6bb>] __cpuhp_setup_state+0x13b/0x1a0 [ 243.653126] [ 243.653130] [<ffffffff81f427e9>] init_hw_perf_events+0x402/0x5b6 [ 243.655652] [ 243.655655] [<ffffffff8100217c>] do_one_initcall+0x4c/0x1b0 [ 243.658127] [ 243.658130] [<ffffffff81f3f333>] kernel_init_freeable+0x155/0x2a7 [ 243.660653] [ 243.660656] [<ffffffff817048e9>] kernel_init+0x9/0x100 [ 243.663048] [ 243.663050] [<ffffffff81715081>] ret_from_fork+0x31/0x40 [ 243.665436] INITIAL USE at: [ 243.666403] [ 243.666405] [<ffffffff810ef960>] __lock_acquire+0x370/0x1bb0 [ 243.668790] [ 243.668791] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0 [ 243.671093] [ 243.671095] [<ffffffff8108de18>] get_online_cpus+0x58/0x80 [ 243.673455] [ 243.673458] [<ffffffff8108f5be>] __cpuhp_setup_state+0x3e/0x1a0 [ 243.676126] [ 243.676130] [<ffffffff81f7660e>] page_alloc_init+0x23/0x3a [ 243.678510] [ 243.678512] [<ffffffff81f3eebe>] start_kernel+0x1a2/0x4c2 [ 243.680851] [ 243.680853] [<ffffffff81f3e5d6>] x86_64_start_reservations+0x2a/0x2c [ 243.683367] [ 243.683369] [<ffffffff81f3e724>] x86_64_start_kernel+0x14c/0x16f [ 243.685812] [ 243.685815] [<ffffffff810001c4>] verify_cpu+0x0/0xfc [ 243.688133] } [ 243.688907] ... key at: [<ffffffff81c56848>] cpu_hotplug+0x108/0x140 [ 243.690542] ... acquired at: [ 243.691514] [ 243.691517] [<ffffffff810ee7ea>] check_irq_usage+0x4a/0xb0 [ 243.693655] [ 243.693656] [<ffffffff810f0954>] __lock_acquire+0x1364/0x1bb0 [ 243.695820] [ 243.695822] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0 [ 243.697926] [ 243.697929] [<ffffffff8108de18>] get_online_cpus+0x58/0x80 [ 243.700042] [ 243.700044] [<ffffffff811bf3a7>] drain_all_pages.part.80+0x27/0x320 [ 243.702285] [ 243.702286] [<ffffffff811c2039>] drain_all_pages+0x19/0x20 [ 243.704405] [ 243.704407] [<ffffffff811c4854>] __alloc_pages_nodemask+0x784/0x1630 [ 243.706721] [ 243.706724] [<ffffffff8122e1bf>] cache_grow_begin+0xcf/0x630 [ 243.708867] [ 243.708870] [<ffffffff8122eb45>] fallback_alloc+0x1e5/0x290 [ 243.711000] [ 243.711002] [<ffffffff8122e955>] ____cache_alloc_node+0x235/0x240 [ 243.713211] [ 243.713213] [<ffffffff8122f30c>] kmem_cache_alloc+0x26c/0x3e0 [ 243.715366] [ 243.715410] [<ffffffffa02b9211>] kmem_zone_alloc+0x91/0x120 [xfs] [ 243.717625] [ 243.717644] [<ffffffffa024e2f5>] xfs_da_state_alloc+0x15/0x20 [xfs] [ 243.719889] [ 243.719918] [<ffffffffa025f333>] xfs_dir2_node_lookup+0x53/0x2b0 [xfs] [ 243.722224] [ 243.722242] [<ffffffffa0255ac5>] xfs_dir_lookup+0x1a5/0x1c0 [xfs] [ 243.724493] [ 243.724514] [<ffffffffa02a62ff>] xfs_lookup+0x7f/0x250 [xfs] [ 243.726690] [ 243.726710] [<ffffffffa02a1fcb>] xfs_vn_lookup+0x6b/0xb0 [xfs] [ 243.728933] [ 243.728936] [<ffffffff8126ce2c>] lookup_open+0x54c/0x790 [ 243.731064] [ 243.731066] [<ffffffff8126e2fa>] path_openat+0x55a/0xa90 [ 243.733192] [ 243.733194] [<ffffffff8126f9ec>] do_filp_open+0x8c/0x100 [ 243.735312] [ 243.735315] [<ffffffff8125c0ea>] do_sys_open+0x13a/0x200 [ 243.737523] [ 243.737527] [<ffffffff8125c1c9>] SyS_open+0x19/0x20 [ 243.739577] [ 243.739579] [<ffffffff81003c17>] do_syscall_64+0x67/0x1f0 [ 243.741702] [ 243.741706] [<ffffffff81714ec9>] return_from_SYSCALL_64+0x0/0x7a [ 243.743932] [ 243.744661] [ 243.744661] stack backtrace: [ 243.746302] CPU: 1 PID: 8767 Comm: awk Not tainted 4.10.0-rc6-next-20170202 #46 [ 243.747963] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013 [ 243.750166] Call Trace: [ 243.751071] dump_stack+0x85/0xc9 [ 243.752110] check_usage+0x4f9/0x680 [ 243.753188] check_irq_usage+0x4a/0xb0 [ 243.754280] __lock_acquire+0x1364/0x1bb0 [ 243.755410] lock_acquire+0xe0/0x2a0 [ 243.756467] ? get_online_cpus+0x32/0x80 [ 243.757580] get_online_cpus+0x58/0x80 [ 243.758664] ? get_online_cpus+0x32/0x80 [ 243.759764] drain_all_pages.part.80+0x27/0x320 [ 243.760972] drain_all_pages+0x19/0x20 [ 243.762039] __alloc_pages_nodemask+0x784/0x1630 [ 243.763249] ? rcu_read_lock_sched_held+0x91/0xa0 [ 243.764466] ? __alloc_pages_nodemask+0x2e6/0x1630 [ 243.765689] ? mark_held_locks+0x71/0x90 [ 243.766780] ? cache_grow_begin+0x4ac/0x630 [ 243.767912] cache_grow_begin+0xcf/0x630 [ 243.768985] ? ____cache_alloc_node+0x1bf/0x240 [ 243.770173] fallback_alloc+0x1e5/0x290 [ 243.771233] ____cache_alloc_node+0x235/0x240 [ 243.772403] ? kmem_zone_alloc+0x91/0x120 [xfs] [ 243.773576] kmem_cache_alloc+0x26c/0x3e0 [ 243.774671] kmem_zone_alloc+0x91/0x120 [xfs] [ 243.775816] xfs_da_state_alloc+0x15/0x20 [xfs] [ 243.776989] xfs_dir2_node_lookup+0x53/0x2b0 [xfs] [ 243.778188] xfs_dir_lookup+0x1a5/0x1c0 [xfs] [ 243.779327] xfs_lookup+0x7f/0x250 [xfs] [ 243.780394] xfs_vn_lookup+0x6b/0xb0 [xfs] [ 243.781466] lookup_open+0x54c/0x790 [ 243.782440] path_openat+0x55a/0xa90 [ 243.783412] do_filp_open+0x8c/0x100 [ 243.784377] ? _raw_spin_unlock+0x22/0x30 [ 243.785418] ? __alloc_fd+0xf2/0x210 [ 243.786378] do_sys_open+0x13a/0x200 [ 243.787361] SyS_open+0x19/0x20 [ 243.788252] do_syscall_64+0x67/0x1f0 [ 243.789228] entry_SYSCALL64_slow_path+0x25/0x25 [ 243.790347] RIP: 0033:0x7fcf8dda06c7 [ 243.791299] RSP: 002b:00007ffd883327b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000002 [ 243.792895] RAX: ffffffffffffffda RBX: 00007ffd883328a8 RCX: 00007fcf8dda06c7 [ 243.794424] RDX: 00007fcf8dfa9148 RSI: 0000000000080000 RDI: 00007fcf8dfa6b08 [ 243.795949] RBP: 00007ffd88332810 R08: 00007ffd88332890 R09: 0000000000000000 [ 243.797480] R10: 00007fcf8dfa6b08 R11: 0000000000000246 R12: 0000000000000000 [ 243.799002] R13: 0000000000000000 R14: 0000000000000000 R15: 00007ffd88332890 [ 253.543441] awk invoked oom-killer: gfp_mask=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null), order=0, oom_score_adj=0 [ 253.546121] awk cpuset=/ mems_allowed=0 [ 253.547233] CPU: 3 PID: 8767 Comm: awk Not tainted 4.10.0-rc6-next-20170202 #46 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone 2017-02-03 10:57 ` Tetsuo Handa @ 2017-02-03 14:41 ` Michal Hocko -1 siblings, 0 replies; 110+ messages in thread From: Michal Hocko @ 2017-02-03 14:41 UTC (permalink / raw) To: Tetsuo Handa Cc: david, dchinner, hch, mgorman, viro, linux-mm, hannes, linux-kernel On Fri 03-02-17 19:57:39, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Mon 30-01-17 09:55:46, Michal Hocko wrote: > > > On Sun 29-01-17 00:27:27, Tetsuo Handa wrote: > > [...] > > > > Regarding [1], it helped avoiding the too_many_isolated() issue. I can't > > > > tell whether it has any negative effect, but I got on the first trial that > > > > all allocating threads are blocked on wait_for_completion() from flush_work() > > > > in drain_all_pages() introduced by "mm, page_alloc: drain per-cpu pages from > > > > workqueue context". There was no warn_alloc() stall warning message afterwords. > > > > > > That patch is buggy and there is a follow up [1] which is not sitting in the > > > mmotm (and thus linux-next) yet. I didn't get to review it properly and > > > I cannot say I would be too happy about using WQ from the page > > > allocator. I believe even the follow up needs to have WQ_RECLAIM WQ. > > > > > > [1] http://lkml.kernel.org/r/20170125083038.rzb5f43nptmk7aed@techsingularity.net > > > > Did you get chance to test with this follow up patch? It would be > > interesting to see whether OOM situation can still starve the waiter. > > The current linux-next should contain this patch. > > So far I can't reproduce problems except two listed below (cond_resched() trap > in printk() and IDLE priority trap are excluded from the list). But I agree that > the follow up patch needs to use a WQ_RECLAIM WQ. It is theoretically possible > that an allocation request which can trigger the OOM killer waits for the > system_wq while there is already a work which is in system_wq which is looping > forever inside the page allocator without triggering the OOM killer. Well, this shouldn't happen AFAICS because a new worker would be requested and that would certainly require a memory and that allocation would trigger the OOM killer. On the other hand I agree that it would be safer to not depend on memory allocation from within the page allocator. > Maybe the follow up patch can share the vmstat WQ? Yes, this would be an option. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone @ 2017-02-03 14:41 ` Michal Hocko 0 siblings, 0 replies; 110+ messages in thread From: Michal Hocko @ 2017-02-03 14:41 UTC (permalink / raw) To: Tetsuo Handa Cc: david, dchinner, hch, mgorman, viro, linux-mm, hannes, linux-kernel On Fri 03-02-17 19:57:39, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Mon 30-01-17 09:55:46, Michal Hocko wrote: > > > On Sun 29-01-17 00:27:27, Tetsuo Handa wrote: > > [...] > > > > Regarding [1], it helped avoiding the too_many_isolated() issue. I can't > > > > tell whether it has any negative effect, but I got on the first trial that > > > > all allocating threads are blocked on wait_for_completion() from flush_work() > > > > in drain_all_pages() introduced by "mm, page_alloc: drain per-cpu pages from > > > > workqueue context". There was no warn_alloc() stall warning message afterwords. > > > > > > That patch is buggy and there is a follow up [1] which is not sitting in the > > > mmotm (and thus linux-next) yet. I didn't get to review it properly and > > > I cannot say I would be too happy about using WQ from the page > > > allocator. I believe even the follow up needs to have WQ_RECLAIM WQ. > > > > > > [1] http://lkml.kernel.org/r/20170125083038.rzb5f43nptmk7aed@techsingularity.net > > > > Did you get chance to test with this follow up patch? It would be > > interesting to see whether OOM situation can still starve the waiter. > > The current linux-next should contain this patch. > > So far I can't reproduce problems except two listed below (cond_resched() trap > in printk() and IDLE priority trap are excluded from the list). But I agree that > the follow up patch needs to use a WQ_RECLAIM WQ. It is theoretically possible > that an allocation request which can trigger the OOM killer waits for the > system_wq while there is already a work which is in system_wq which is looping > forever inside the page allocator without triggering the OOM killer. Well, this shouldn't happen AFAICS because a new worker would be requested and that would certainly require a memory and that allocation would trigger the OOM killer. On the other hand I agree that it would be safer to not depend on memory allocation from within the page allocator. > Maybe the follow up patch can share the vmstat WQ? Yes, this would be an option. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone 2017-02-03 10:57 ` Tetsuo Handa @ 2017-02-03 14:50 ` Michal Hocko -1 siblings, 0 replies; 110+ messages in thread From: Michal Hocko @ 2017-02-03 14:50 UTC (permalink / raw) To: Tetsuo Handa Cc: david, dchinner, hch, mgorman, viro, linux-mm, hannes, linux-kernel, Darrick J. Wong, linux-xfs [Let's CC more xfs people] On Fri 03-02-17 19:57:39, Tetsuo Handa wrote: [...] > (1) I got an assertion failure. I suspect this is a result of http://lkml.kernel.org/r/20170201092706.9966-2-mhocko@kernel.org I have no idea what the assert means though. > > [ 969.626518] Killed process 6262 (oom-write) total-vm:2166856kB, anon-rss:1128732kB, file-rss:4kB, shmem-rss:0kB > [ 969.958307] oom_reaper: reaped process 6262 (oom-write), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB > [ 972.114644] XFS: Assertion failed: oldlen > newlen, file: fs/xfs/libxfs/xfs_bmap.c, line: 2867 > [ 972.125085] ------------[ cut here ]------------ > [ 972.129261] WARNING: CPU: 0 PID: 6280 at fs/xfs/xfs_message.c:105 asswarn+0x33/0x40 [xfs] > [ 972.136146] Modules linked in: nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack coretemp crct10dif_pclmul ppdev crc32_pclmul ghash_clmulni_intel ip_set nfnetlink ebtable_nat aesni_intel crypto_simd cryptd ebtable_broute glue_helper vmw_balloon bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 pcspkr nf_nat_ipv6 ip6table_mangle ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter sg parport_pc parport shpchp i2c_piix4 vmw_vsock_vmci_transport vsock vmw_vmci ip_tables xfs libcrc32c sr_mod cdrom ata_generic sd_mod pata_acpi crc32c_intel serio_raw vmwgfx drm_kms_helper syscopyarea sysfillrect > [ 972.163630] sysimgblt fb_sys_fops ttm drm ata_piix ahci libahci mptspi scsi_transport_spi mptscsih e1000 libata i2c_core mptbase > [ 972.172535] CPU: 0 PID: 6280 Comm: write Not tainted 4.10.0-rc6-next-20170202 #498 > [ 972.175126] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015 > [ 972.178381] Call Trace: > [ 972.180003] dump_stack+0x85/0xc9 > [ 972.181682] __warn+0xd1/0xf0 > [ 972.183374] warn_slowpath_null+0x1d/0x20 > [ 972.185223] asswarn+0x33/0x40 [xfs] > [ 972.186950] xfs_bmap_add_extent_hole_delay+0xb7f/0xdf0 [xfs] > [ 972.189055] xfs_bmapi_reserve_delalloc+0x297/0x440 [xfs] > [ 972.191263] ? xfs_ilock+0x1c9/0x360 [xfs] > [ 972.193414] xfs_file_iomap_begin+0x880/0x1140 [xfs] > [ 972.195300] ? iomap_write_end+0x80/0x80 > [ 972.196980] iomap_apply+0x6c/0x130 > [ 972.198539] iomap_file_buffered_write+0x68/0xa0 > [ 972.200316] ? iomap_write_end+0x80/0x80 > [ 972.201950] xfs_file_buffered_aio_write+0x132/0x390 [xfs] > [ 972.203868] ? _raw_spin_unlock+0x27/0x40 > [ 972.205470] xfs_file_write_iter+0x90/0x130 [xfs] > [ 972.207167] __vfs_write+0xe5/0x140 > [ 972.208752] vfs_write+0xc7/0x1f0 > [ 972.210233] ? syscall_trace_enter+0x1d0/0x380 > [ 972.211809] SyS_write+0x58/0xc0 > [ 972.213166] do_int80_syscall_32+0x6c/0x1f0 > [ 972.214676] entry_INT80_compat+0x38/0x50 > [ 972.216168] RIP: 0023:0x8048076 > [ 972.217494] RSP: 002b:00000000ff997020 EFLAGS: 00000202 ORIG_RAX: 0000000000000004 > [ 972.219635] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 0000000008048000 > [ 972.221679] RDX: 0000000000001000 RSI: 0000000000000000 RDI: 0000000000000000 > [ 972.223774] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 > [ 972.225905] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 > [ 972.227946] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 > [ 972.230064] ---[ end trace d498098daec56c11 ]--- > [ 984.210890] vmtoolsd invoked oom-killer: gfp_mask=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null), order=0, oom_score_adj=0 > [ 984.224191] vmtoolsd cpuset=/ mems_allowed=0 > [ 984.231022] CPU: 0 PID: 689 Comm: vmtoolsd Tainted: G W 4.10.0-rc6-next-20170202 #498 -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone @ 2017-02-03 14:50 ` Michal Hocko 0 siblings, 0 replies; 110+ messages in thread From: Michal Hocko @ 2017-02-03 14:50 UTC (permalink / raw) To: Tetsuo Handa Cc: david, dchinner, hch, mgorman, viro, linux-mm, hannes, linux-kernel, Darrick J. Wong, linux-xfs [Let's CC more xfs people] On Fri 03-02-17 19:57:39, Tetsuo Handa wrote: [...] > (1) I got an assertion failure. I suspect this is a result of http://lkml.kernel.org/r/20170201092706.9966-2-mhocko@kernel.org I have no idea what the assert means though. > > [ 969.626518] Killed process 6262 (oom-write) total-vm:2166856kB, anon-rss:1128732kB, file-rss:4kB, shmem-rss:0kB > [ 969.958307] oom_reaper: reaped process 6262 (oom-write), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB > [ 972.114644] XFS: Assertion failed: oldlen > newlen, file: fs/xfs/libxfs/xfs_bmap.c, line: 2867 > [ 972.125085] ------------[ cut here ]------------ > [ 972.129261] WARNING: CPU: 0 PID: 6280 at fs/xfs/xfs_message.c:105 asswarn+0x33/0x40 [xfs] > [ 972.136146] Modules linked in: nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack coretemp crct10dif_pclmul ppdev crc32_pclmul ghash_clmulni_intel ip_set nfnetlink ebtable_nat aesni_intel crypto_simd cryptd ebtable_broute glue_helper vmw_balloon bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 pcspkr nf_nat_ipv6 ip6table_mangle ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter sg parport_pc parport shpchp i2c_piix4 vmw_vsock_vmci_transport vsock vmw_vmci ip_tables xfs libcrc32c sr_mod cdrom ata_generic sd_mod pata_acpi crc32c_intel serio_raw vmwgfx drm_kms_helper syscopyarea sysfillrect > [ 972.163630] sysimgblt fb_sys_fops ttm drm ata_piix ahci libahci mptspi scsi_transport_spi mptscsih e1000 libata i2c_core mptbase > [ 972.172535] CPU: 0 PID: 6280 Comm: write Not tainted 4.10.0-rc6-next-20170202 #498 > [ 972.175126] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015 > [ 972.178381] Call Trace: > [ 972.180003] dump_stack+0x85/0xc9 > [ 972.181682] __warn+0xd1/0xf0 > [ 972.183374] warn_slowpath_null+0x1d/0x20 > [ 972.185223] asswarn+0x33/0x40 [xfs] > [ 972.186950] xfs_bmap_add_extent_hole_delay+0xb7f/0xdf0 [xfs] > [ 972.189055] xfs_bmapi_reserve_delalloc+0x297/0x440 [xfs] > [ 972.191263] ? xfs_ilock+0x1c9/0x360 [xfs] > [ 972.193414] xfs_file_iomap_begin+0x880/0x1140 [xfs] > [ 972.195300] ? iomap_write_end+0x80/0x80 > [ 972.196980] iomap_apply+0x6c/0x130 > [ 972.198539] iomap_file_buffered_write+0x68/0xa0 > [ 972.200316] ? iomap_write_end+0x80/0x80 > [ 972.201950] xfs_file_buffered_aio_write+0x132/0x390 [xfs] > [ 972.203868] ? _raw_spin_unlock+0x27/0x40 > [ 972.205470] xfs_file_write_iter+0x90/0x130 [xfs] > [ 972.207167] __vfs_write+0xe5/0x140 > [ 972.208752] vfs_write+0xc7/0x1f0 > [ 972.210233] ? syscall_trace_enter+0x1d0/0x380 > [ 972.211809] SyS_write+0x58/0xc0 > [ 972.213166] do_int80_syscall_32+0x6c/0x1f0 > [ 972.214676] entry_INT80_compat+0x38/0x50 > [ 972.216168] RIP: 0023:0x8048076 > [ 972.217494] RSP: 002b:00000000ff997020 EFLAGS: 00000202 ORIG_RAX: 0000000000000004 > [ 972.219635] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 0000000008048000 > [ 972.221679] RDX: 0000000000001000 RSI: 0000000000000000 RDI: 0000000000000000 > [ 972.223774] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 > [ 972.225905] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 > [ 972.227946] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 > [ 972.230064] ---[ end trace d498098daec56c11 ]--- > [ 984.210890] vmtoolsd invoked oom-killer: gfp_mask=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null), order=0, oom_score_adj=0 > [ 984.224191] vmtoolsd cpuset=/ mems_allowed=0 > [ 984.231022] CPU: 0 PID: 689 Comm: vmtoolsd Tainted: G W 4.10.0-rc6-next-20170202 #498 -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone 2017-02-03 14:50 ` Michal Hocko @ 2017-02-03 17:24 ` Brian Foster -1 siblings, 0 replies; 110+ messages in thread From: Brian Foster @ 2017-02-03 17:24 UTC (permalink / raw) To: Michal Hocko Cc: Tetsuo Handa, david, dchinner, hch, mgorman, viro, linux-mm, hannes, linux-kernel, Darrick J. Wong, linux-xfs On Fri, Feb 03, 2017 at 03:50:09PM +0100, Michal Hocko wrote: > [Let's CC more xfs people] > > On Fri 03-02-17 19:57:39, Tetsuo Handa wrote: > [...] > > (1) I got an assertion failure. > > I suspect this is a result of > http://lkml.kernel.org/r/20170201092706.9966-2-mhocko@kernel.org > I have no idea what the assert means though. > > > > > [ 969.626518] Killed process 6262 (oom-write) total-vm:2166856kB, anon-rss:1128732kB, file-rss:4kB, shmem-rss:0kB > > [ 969.958307] oom_reaper: reaped process 6262 (oom-write), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB > > [ 972.114644] XFS: Assertion failed: oldlen > newlen, file: fs/xfs/libxfs/xfs_bmap.c, line: 2867 Indirect block reservation underrun on delayed allocation extent merge. These are extra blocks are used for the inode bmap btree when a delalloc extent is converted to physical blocks. We're in a case where we expect to only ever free excess blocks due to a merge of extents with independent reservations, but a situation occurs where we actually need blocks and hence the assert fails. This can occur if an extent is merged with one that has a reservation less than the expected worst case reservation for its size (due to previous extent splits due to hole punches, for example). Therefore, I think the core expectation that xfs_bmap_add_extent_hole_delay() will always have enough blocks pre-reserved is invalid. Can you describe the workload that reproduces this? FWIW, I think the way xfs_bmap_add_extent_hole_delay() currently works is likely broken and have a couple patches to fix up indlen reservation that I haven't posted yet. The diff that deals with this particular bit is appended. Care to give that a try? Brian > > [ 972.125085] ------------[ cut here ]------------ > > [ 972.129261] WARNING: CPU: 0 PID: 6280 at fs/xfs/xfs_message.c:105 asswarn+0x33/0x40 [xfs] > > [ 972.136146] Modules linked in: nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack coretemp crct10dif_pclmul ppdev crc32_pclmul ghash_clmulni_intel ip_set nfnetlink ebtable_nat aesni_intel crypto_simd cryptd ebtable_broute glue_helper vmw_balloon bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 pcspkr nf_nat_ipv6 ip6table_mangle ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter sg parport_pc parport shpchp i2c_piix4 vmw_vsock_vmci_transport vsock vmw_vmci ip_tables xfs libcrc32c sr_mod cdrom ata_generic sd_mod pata_acpi crc32c_intel serio_raw vmwgfx drm_kms_helper syscopyarea sysfillrect > > [ 972.163630] sysimgblt fb_sys_fops ttm drm ata_piix ahci libahci mptspi scsi_transport_spi mptscsih e1000 libata i2c_core mptbase > > [ 972.172535] CPU: 0 PID: 6280 Comm: write Not tainted 4.10.0-rc6-next-20170202 #498 > > [ 972.175126] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015 > > [ 972.178381] Call Trace: ... ---8<--- diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c index bfc00de..d2e48ed 100644 --- a/fs/xfs/libxfs/xfs_bmap.c +++ b/fs/xfs/libxfs/xfs_bmap.c @@ -2809,7 +2809,8 @@ xfs_bmap_add_extent_hole_delay( oldlen = startblockval(left.br_startblock) + startblockval(new->br_startblock) + startblockval(right.br_startblock); - newlen = xfs_bmap_worst_indlen(ip, temp); + newlen = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(ip, temp), + oldlen); xfs_bmbt_set_startblock(xfs_iext_get_ext(ifp, *idx), nullstartblock((int)newlen)); trace_xfs_bmap_post_update(ip, *idx, state, _THIS_IP_); @@ -2830,7 +2831,8 @@ xfs_bmap_add_extent_hole_delay( xfs_bmbt_set_blockcount(xfs_iext_get_ext(ifp, *idx), temp); oldlen = startblockval(left.br_startblock) + startblockval(new->br_startblock); - newlen = xfs_bmap_worst_indlen(ip, temp); + newlen = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(ip, temp), + oldlen); xfs_bmbt_set_startblock(xfs_iext_get_ext(ifp, *idx), nullstartblock((int)newlen)); trace_xfs_bmap_post_update(ip, *idx, state, _THIS_IP_); @@ -2846,7 +2848,8 @@ xfs_bmap_add_extent_hole_delay( temp = new->br_blockcount + right.br_blockcount; oldlen = startblockval(new->br_startblock) + startblockval(right.br_startblock); - newlen = xfs_bmap_worst_indlen(ip, temp); + newlen = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(ip, temp), + oldlen); xfs_bmbt_set_allf(xfs_iext_get_ext(ifp, *idx), new->br_startoff, nullstartblock((int)newlen), temp, right.br_state); ^ permalink raw reply related [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone @ 2017-02-03 17:24 ` Brian Foster 0 siblings, 0 replies; 110+ messages in thread From: Brian Foster @ 2017-02-03 17:24 UTC (permalink / raw) To: Michal Hocko Cc: Tetsuo Handa, david, dchinner, hch, mgorman, viro, linux-mm, hannes, linux-kernel, Darrick J. Wong, linux-xfs On Fri, Feb 03, 2017 at 03:50:09PM +0100, Michal Hocko wrote: > [Let's CC more xfs people] > > On Fri 03-02-17 19:57:39, Tetsuo Handa wrote: > [...] > > (1) I got an assertion failure. > > I suspect this is a result of > http://lkml.kernel.org/r/20170201092706.9966-2-mhocko@kernel.org > I have no idea what the assert means though. > > > > > [ 969.626518] Killed process 6262 (oom-write) total-vm:2166856kB, anon-rss:1128732kB, file-rss:4kB, shmem-rss:0kB > > [ 969.958307] oom_reaper: reaped process 6262 (oom-write), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB > > [ 972.114644] XFS: Assertion failed: oldlen > newlen, file: fs/xfs/libxfs/xfs_bmap.c, line: 2867 Indirect block reservation underrun on delayed allocation extent merge. These are extra blocks are used for the inode bmap btree when a delalloc extent is converted to physical blocks. We're in a case where we expect to only ever free excess blocks due to a merge of extents with independent reservations, but a situation occurs where we actually need blocks and hence the assert fails. This can occur if an extent is merged with one that has a reservation less than the expected worst case reservation for its size (due to previous extent splits due to hole punches, for example). Therefore, I think the core expectation that xfs_bmap_add_extent_hole_delay() will always have enough blocks pre-reserved is invalid. Can you describe the workload that reproduces this? FWIW, I think the way xfs_bmap_add_extent_hole_delay() currently works is likely broken and have a couple patches to fix up indlen reservation that I haven't posted yet. The diff that deals with this particular bit is appended. Care to give that a try? Brian > > [ 972.125085] ------------[ cut here ]------------ > > [ 972.129261] WARNING: CPU: 0 PID: 6280 at fs/xfs/xfs_message.c:105 asswarn+0x33/0x40 [xfs] > > [ 972.136146] Modules linked in: nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack coretemp crct10dif_pclmul ppdev crc32_pclmul ghash_clmulni_intel ip_set nfnetlink ebtable_nat aesni_intel crypto_simd cryptd ebtable_broute glue_helper vmw_balloon bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 pcspkr nf_nat_ipv6 ip6table_mangle ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter sg parport_pc parport shpchp i2c_piix4 vmw_vsock_vmci_transport vsock vmw_vmci ip_tables xfs libcrc32c sr_mod cdrom ata_generic sd_mod pata_acpi crc32c_intel serio_raw vmwgfx drm_kms_helper syscopyarea sysfillrect > > [ 972.163630] sysimgblt fb_sys_fops ttm drm ata_piix ahci libahci mptspi scsi_transport_spi mptscsih e1000 libata i2c_core mptbase > > [ 972.172535] CPU: 0 PID: 6280 Comm: write Not tainted 4.10.0-rc6-next-20170202 #498 > > [ 972.175126] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015 > > [ 972.178381] Call Trace: ... ---8<--- diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c index bfc00de..d2e48ed 100644 --- a/fs/xfs/libxfs/xfs_bmap.c +++ b/fs/xfs/libxfs/xfs_bmap.c @@ -2809,7 +2809,8 @@ xfs_bmap_add_extent_hole_delay( oldlen = startblockval(left.br_startblock) + startblockval(new->br_startblock) + startblockval(right.br_startblock); - newlen = xfs_bmap_worst_indlen(ip, temp); + newlen = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(ip, temp), + oldlen); xfs_bmbt_set_startblock(xfs_iext_get_ext(ifp, *idx), nullstartblock((int)newlen)); trace_xfs_bmap_post_update(ip, *idx, state, _THIS_IP_); @@ -2830,7 +2831,8 @@ xfs_bmap_add_extent_hole_delay( xfs_bmbt_set_blockcount(xfs_iext_get_ext(ifp, *idx), temp); oldlen = startblockval(left.br_startblock) + startblockval(new->br_startblock); - newlen = xfs_bmap_worst_indlen(ip, temp); + newlen = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(ip, temp), + oldlen); xfs_bmbt_set_startblock(xfs_iext_get_ext(ifp, *idx), nullstartblock((int)newlen)); trace_xfs_bmap_post_update(ip, *idx, state, _THIS_IP_); @@ -2846,7 +2848,8 @@ xfs_bmap_add_extent_hole_delay( temp = new->br_blockcount + right.br_blockcount; oldlen = startblockval(new->br_startblock) + startblockval(right.br_startblock); - newlen = xfs_bmap_worst_indlen(ip, temp); + newlen = XFS_FILBLKS_MIN(xfs_bmap_worst_indlen(ip, temp), + oldlen); xfs_bmbt_set_allf(xfs_iext_get_ext(ifp, *idx), new->br_startoff, nullstartblock((int)newlen), temp, right.br_state); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone 2017-02-03 17:24 ` Brian Foster @ 2017-02-06 6:29 ` Tetsuo Handa -1 siblings, 0 replies; 110+ messages in thread From: Tetsuo Handa @ 2017-02-06 6:29 UTC (permalink / raw) To: bfoster, mhocko Cc: david, dchinner, hch, mgorman, viro, linux-mm, hannes, linux-kernel, darrick.wong, linux-xfs Brian Foster wrote: > On Fri, Feb 03, 2017 at 03:50:09PM +0100, Michal Hocko wrote: > > [Let's CC more xfs people] > > > > On Fri 03-02-17 19:57:39, Tetsuo Handa wrote: > > [...] > > > (1) I got an assertion failure. > > > > I suspect this is a result of > > http://lkml.kernel.org/r/20170201092706.9966-2-mhocko@kernel.org > > I have no idea what the assert means though. > > > > > > > > [ 969.626518] Killed process 6262 (oom-write) total-vm:2166856kB, anon-rss:1128732kB, file-rss:4kB, shmem-rss:0kB > > > [ 969.958307] oom_reaper: reaped process 6262 (oom-write), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB > > > [ 972.114644] XFS: Assertion failed: oldlen > newlen, file: fs/xfs/libxfs/xfs_bmap.c, line: 2867 > > Indirect block reservation underrun on delayed allocation extent merge. > These are extra blocks are used for the inode bmap btree when a delalloc > extent is converted to physical blocks. We're in a case where we expect > to only ever free excess blocks due to a merge of extents with > independent reservations, but a situation occurs where we actually need > blocks and hence the assert fails. This can occur if an extent is merged > with one that has a reservation less than the expected worst case > reservation for its size (due to previous extent splits due to hole > punches, for example). Therefore, I think the core expectation that > xfs_bmap_add_extent_hole_delay() will always have enough blocks > pre-reserved is invalid. > > Can you describe the workload that reproduces this? FWIW, I think the > way xfs_bmap_add_extent_hole_delay() currently works is likely broken > and have a couple patches to fix up indlen reservation that I haven't > posted yet. The diff that deals with this particular bit is appended. > Care to give that a try? The workload is to write to a single file on XFS from 10 processes demonstrated at http://lkml.kernel.org/r/201512052133.IAE00551.LSOQFtMFFVOHOJ@I-love.SAKURA.ne.jp using "while :; do ./oom-write; done" loop on a VM with 4CPUs / 2048MB RAM. With this XFS_FILBLKS_MIN() change applied, I no longer hit assertion failures. ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone @ 2017-02-06 6:29 ` Tetsuo Handa 0 siblings, 0 replies; 110+ messages in thread From: Tetsuo Handa @ 2017-02-06 6:29 UTC (permalink / raw) To: bfoster, mhocko Cc: david, dchinner, hch, mgorman, viro, linux-mm, hannes, linux-kernel, darrick.wong, linux-xfs Brian Foster wrote: > On Fri, Feb 03, 2017 at 03:50:09PM +0100, Michal Hocko wrote: > > [Let's CC more xfs people] > > > > On Fri 03-02-17 19:57:39, Tetsuo Handa wrote: > > [...] > > > (1) I got an assertion failure. > > > > I suspect this is a result of > > http://lkml.kernel.org/r/20170201092706.9966-2-mhocko@kernel.org > > I have no idea what the assert means though. > > > > > > > > [ 969.626518] Killed process 6262 (oom-write) total-vm:2166856kB, anon-rss:1128732kB, file-rss:4kB, shmem-rss:0kB > > > [ 969.958307] oom_reaper: reaped process 6262 (oom-write), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB > > > [ 972.114644] XFS: Assertion failed: oldlen > newlen, file: fs/xfs/libxfs/xfs_bmap.c, line: 2867 > > Indirect block reservation underrun on delayed allocation extent merge. > These are extra blocks are used for the inode bmap btree when a delalloc > extent is converted to physical blocks. We're in a case where we expect > to only ever free excess blocks due to a merge of extents with > independent reservations, but a situation occurs where we actually need > blocks and hence the assert fails. This can occur if an extent is merged > with one that has a reservation less than the expected worst case > reservation for its size (due to previous extent splits due to hole > punches, for example). Therefore, I think the core expectation that > xfs_bmap_add_extent_hole_delay() will always have enough blocks > pre-reserved is invalid. > > Can you describe the workload that reproduces this? FWIW, I think the > way xfs_bmap_add_extent_hole_delay() currently works is likely broken > and have a couple patches to fix up indlen reservation that I haven't > posted yet. The diff that deals with this particular bit is appended. > Care to give that a try? The workload is to write to a single file on XFS from 10 processes demonstrated at http://lkml.kernel.org/r/201512052133.IAE00551.LSOQFtMFFVOHOJ@I-love.SAKURA.ne.jp using "while :; do ./oom-write; done" loop on a VM with 4CPUs / 2048MB RAM. With this XFS_FILBLKS_MIN() change applied, I no longer hit assertion failures. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone 2017-02-06 6:29 ` Tetsuo Handa @ 2017-02-06 14:35 ` Brian Foster -1 siblings, 0 replies; 110+ messages in thread From: Brian Foster @ 2017-02-06 14:35 UTC (permalink / raw) To: Tetsuo Handa Cc: mhocko, david, dchinner, hch, mgorman, viro, linux-mm, hannes, linux-kernel, darrick.wong, linux-xfs On Mon, Feb 06, 2017 at 03:29:24PM +0900, Tetsuo Handa wrote: > Brian Foster wrote: > > On Fri, Feb 03, 2017 at 03:50:09PM +0100, Michal Hocko wrote: > > > [Let's CC more xfs people] > > > > > > On Fri 03-02-17 19:57:39, Tetsuo Handa wrote: > > > [...] > > > > (1) I got an assertion failure. > > > > > > I suspect this is a result of > > > http://lkml.kernel.org/r/20170201092706.9966-2-mhocko@kernel.org > > > I have no idea what the assert means though. > > > > > > > > > > > [ 969.626518] Killed process 6262 (oom-write) total-vm:2166856kB, anon-rss:1128732kB, file-rss:4kB, shmem-rss:0kB > > > > [ 969.958307] oom_reaper: reaped process 6262 (oom-write), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB > > > > [ 972.114644] XFS: Assertion failed: oldlen > newlen, file: fs/xfs/libxfs/xfs_bmap.c, line: 2867 > > > > Indirect block reservation underrun on delayed allocation extent merge. > > These are extra blocks are used for the inode bmap btree when a delalloc > > extent is converted to physical blocks. We're in a case where we expect > > to only ever free excess blocks due to a merge of extents with > > independent reservations, but a situation occurs where we actually need > > blocks and hence the assert fails. This can occur if an extent is merged > > with one that has a reservation less than the expected worst case > > reservation for its size (due to previous extent splits due to hole > > punches, for example). Therefore, I think the core expectation that > > xfs_bmap_add_extent_hole_delay() will always have enough blocks > > pre-reserved is invalid. > > > > Can you describe the workload that reproduces this? FWIW, I think the > > way xfs_bmap_add_extent_hole_delay() currently works is likely broken > > and have a couple patches to fix up indlen reservation that I haven't > > posted yet. The diff that deals with this particular bit is appended. > > Care to give that a try? > > The workload is to write to a single file on XFS from 10 processes demonstrated at > http://lkml.kernel.org/r/201512052133.IAE00551.LSOQFtMFFVOHOJ@I-love.SAKURA.ne.jp > using "while :; do ./oom-write; done" loop on a VM with 4CPUs / 2048MB RAM. > With this XFS_FILBLKS_MIN() change applied, I no longer hit assertion failures. > Thanks for testing. Well, that's an interesting workload. I couldn't reproduce on a few quick tries in a similarly configured vm. Normally I'd expect to see this kind of thing on a hole punching workload or dealing with large, sparse files that make use of speculative preallocation (post-eof blocks allocated in anticipation of file extending writes). I'm wondering if what is happening here is that the appending writes and file closes due to oom kills are generating speculative preallocs and prealloc truncates, respectively, and that causes prealloc extents at the eof boundary to be split up and then re-merged by surviving appending writers. /tmp/file _is_ on an XFS filesystem in your test, correct? If so and if you still have the output file from a test that reproduced, could you get the 'xfs_io -c "fiemap -v" <file>' output? I suppose another possibility is that prealloc occurs, write failure(s) leads to extent splits via unmapping the target range of the write, and then surviving writers generate the warning on a delalloc extent merge.. Brian > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone @ 2017-02-06 14:35 ` Brian Foster 0 siblings, 0 replies; 110+ messages in thread From: Brian Foster @ 2017-02-06 14:35 UTC (permalink / raw) To: Tetsuo Handa Cc: mhocko, david, dchinner, hch, mgorman, viro, linux-mm, hannes, linux-kernel, darrick.wong, linux-xfs On Mon, Feb 06, 2017 at 03:29:24PM +0900, Tetsuo Handa wrote: > Brian Foster wrote: > > On Fri, Feb 03, 2017 at 03:50:09PM +0100, Michal Hocko wrote: > > > [Let's CC more xfs people] > > > > > > On Fri 03-02-17 19:57:39, Tetsuo Handa wrote: > > > [...] > > > > (1) I got an assertion failure. > > > > > > I suspect this is a result of > > > http://lkml.kernel.org/r/20170201092706.9966-2-mhocko@kernel.org > > > I have no idea what the assert means though. > > > > > > > > > > > [ 969.626518] Killed process 6262 (oom-write) total-vm:2166856kB, anon-rss:1128732kB, file-rss:4kB, shmem-rss:0kB > > > > [ 969.958307] oom_reaper: reaped process 6262 (oom-write), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB > > > > [ 972.114644] XFS: Assertion failed: oldlen > newlen, file: fs/xfs/libxfs/xfs_bmap.c, line: 2867 > > > > Indirect block reservation underrun on delayed allocation extent merge. > > These are extra blocks are used for the inode bmap btree when a delalloc > > extent is converted to physical blocks. We're in a case where we expect > > to only ever free excess blocks due to a merge of extents with > > independent reservations, but a situation occurs where we actually need > > blocks and hence the assert fails. This can occur if an extent is merged > > with one that has a reservation less than the expected worst case > > reservation for its size (due to previous extent splits due to hole > > punches, for example). Therefore, I think the core expectation that > > xfs_bmap_add_extent_hole_delay() will always have enough blocks > > pre-reserved is invalid. > > > > Can you describe the workload that reproduces this? FWIW, I think the > > way xfs_bmap_add_extent_hole_delay() currently works is likely broken > > and have a couple patches to fix up indlen reservation that I haven't > > posted yet. The diff that deals with this particular bit is appended. > > Care to give that a try? > > The workload is to write to a single file on XFS from 10 processes demonstrated at > http://lkml.kernel.org/r/201512052133.IAE00551.LSOQFtMFFVOHOJ@I-love.SAKURA.ne.jp > using "while :; do ./oom-write; done" loop on a VM with 4CPUs / 2048MB RAM. > With this XFS_FILBLKS_MIN() change applied, I no longer hit assertion failures. > Thanks for testing. Well, that's an interesting workload. I couldn't reproduce on a few quick tries in a similarly configured vm. Normally I'd expect to see this kind of thing on a hole punching workload or dealing with large, sparse files that make use of speculative preallocation (post-eof blocks allocated in anticipation of file extending writes). I'm wondering if what is happening here is that the appending writes and file closes due to oom kills are generating speculative preallocs and prealloc truncates, respectively, and that causes prealloc extents at the eof boundary to be split up and then re-merged by surviving appending writers. /tmp/file _is_ on an XFS filesystem in your test, correct? If so and if you still have the output file from a test that reproduced, could you get the 'xfs_io -c "fiemap -v" <file>' output? I suppose another possibility is that prealloc occurs, write failure(s) leads to extent splits via unmapping the target range of the write, and then surviving writers generate the warning on a delalloc extent merge.. Brian > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone 2017-02-06 14:35 ` Brian Foster @ 2017-02-06 14:42 ` Michal Hocko -1 siblings, 0 replies; 110+ messages in thread From: Michal Hocko @ 2017-02-06 14:42 UTC (permalink / raw) To: Brian Foster Cc: Tetsuo Handa, david, dchinner, hch, mgorman, viro, linux-mm, hannes, linux-kernel, darrick.wong, linux-xfs On Mon 06-02-17 09:35:33, Brian Foster wrote: > On Mon, Feb 06, 2017 at 03:29:24PM +0900, Tetsuo Handa wrote: > > Brian Foster wrote: > > > On Fri, Feb 03, 2017 at 03:50:09PM +0100, Michal Hocko wrote: > > > > [Let's CC more xfs people] > > > > > > > > On Fri 03-02-17 19:57:39, Tetsuo Handa wrote: > > > > [...] > > > > > (1) I got an assertion failure. > > > > > > > > I suspect this is a result of > > > > http://lkml.kernel.org/r/20170201092706.9966-2-mhocko@kernel.org > > > > I have no idea what the assert means though. > > > > > > > > > > > > > > [ 969.626518] Killed process 6262 (oom-write) total-vm:2166856kB, anon-rss:1128732kB, file-rss:4kB, shmem-rss:0kB > > > > > [ 969.958307] oom_reaper: reaped process 6262 (oom-write), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB > > > > > [ 972.114644] XFS: Assertion failed: oldlen > newlen, file: fs/xfs/libxfs/xfs_bmap.c, line: 2867 > > > > > > Indirect block reservation underrun on delayed allocation extent merge. > > > These are extra blocks are used for the inode bmap btree when a delalloc > > > extent is converted to physical blocks. We're in a case where we expect > > > to only ever free excess blocks due to a merge of extents with > > > independent reservations, but a situation occurs where we actually need > > > blocks and hence the assert fails. This can occur if an extent is merged > > > with one that has a reservation less than the expected worst case > > > reservation for its size (due to previous extent splits due to hole > > > punches, for example). Therefore, I think the core expectation that > > > xfs_bmap_add_extent_hole_delay() will always have enough blocks > > > pre-reserved is invalid. > > > > > > Can you describe the workload that reproduces this? FWIW, I think the > > > way xfs_bmap_add_extent_hole_delay() currently works is likely broken > > > and have a couple patches to fix up indlen reservation that I haven't > > > posted yet. The diff that deals with this particular bit is appended. > > > Care to give that a try? > > > > The workload is to write to a single file on XFS from 10 processes demonstrated at > > http://lkml.kernel.org/r/201512052133.IAE00551.LSOQFtMFFVOHOJ@I-love.SAKURA.ne.jp > > using "while :; do ./oom-write; done" loop on a VM with 4CPUs / 2048MB RAM. > > With this XFS_FILBLKS_MIN() change applied, I no longer hit assertion failures. > > > > Thanks for testing. Well, that's an interesting workload. I couldn't > reproduce on a few quick tries in a similarly configured vm. > > Normally I'd expect to see this kind of thing on a hole punching > workload or dealing with large, sparse files that make use of > speculative preallocation (post-eof blocks allocated in anticipation of > file extending writes). I'm wondering if what is happening here is that > the appending writes and file closes due to oom kills are generating > speculative preallocs and prealloc truncates, respectively, and that > causes prealloc extents at the eof boundary to be split up and then > re-merged by surviving appending writers. Can those preallocs be affected by http://lkml.kernel.org/r/20170201092706.9966-2-mhocko@kernel.org ? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone @ 2017-02-06 14:42 ` Michal Hocko 0 siblings, 0 replies; 110+ messages in thread From: Michal Hocko @ 2017-02-06 14:42 UTC (permalink / raw) To: Brian Foster Cc: Tetsuo Handa, david, dchinner, hch, mgorman, viro, linux-mm, hannes, linux-kernel, darrick.wong, linux-xfs On Mon 06-02-17 09:35:33, Brian Foster wrote: > On Mon, Feb 06, 2017 at 03:29:24PM +0900, Tetsuo Handa wrote: > > Brian Foster wrote: > > > On Fri, Feb 03, 2017 at 03:50:09PM +0100, Michal Hocko wrote: > > > > [Let's CC more xfs people] > > > > > > > > On Fri 03-02-17 19:57:39, Tetsuo Handa wrote: > > > > [...] > > > > > (1) I got an assertion failure. > > > > > > > > I suspect this is a result of > > > > http://lkml.kernel.org/r/20170201092706.9966-2-mhocko@kernel.org > > > > I have no idea what the assert means though. > > > > > > > > > > > > > > [ 969.626518] Killed process 6262 (oom-write) total-vm:2166856kB, anon-rss:1128732kB, file-rss:4kB, shmem-rss:0kB > > > > > [ 969.958307] oom_reaper: reaped process 6262 (oom-write), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB > > > > > [ 972.114644] XFS: Assertion failed: oldlen > newlen, file: fs/xfs/libxfs/xfs_bmap.c, line: 2867 > > > > > > Indirect block reservation underrun on delayed allocation extent merge. > > > These are extra blocks are used for the inode bmap btree when a delalloc > > > extent is converted to physical blocks. We're in a case where we expect > > > to only ever free excess blocks due to a merge of extents with > > > independent reservations, but a situation occurs where we actually need > > > blocks and hence the assert fails. This can occur if an extent is merged > > > with one that has a reservation less than the expected worst case > > > reservation for its size (due to previous extent splits due to hole > > > punches, for example). Therefore, I think the core expectation that > > > xfs_bmap_add_extent_hole_delay() will always have enough blocks > > > pre-reserved is invalid. > > > > > > Can you describe the workload that reproduces this? FWIW, I think the > > > way xfs_bmap_add_extent_hole_delay() currently works is likely broken > > > and have a couple patches to fix up indlen reservation that I haven't > > > posted yet. The diff that deals with this particular bit is appended. > > > Care to give that a try? > > > > The workload is to write to a single file on XFS from 10 processes demonstrated at > > http://lkml.kernel.org/r/201512052133.IAE00551.LSOQFtMFFVOHOJ@I-love.SAKURA.ne.jp > > using "while :; do ./oom-write; done" loop on a VM with 4CPUs / 2048MB RAM. > > With this XFS_FILBLKS_MIN() change applied, I no longer hit assertion failures. > > > > Thanks for testing. Well, that's an interesting workload. I couldn't > reproduce on a few quick tries in a similarly configured vm. > > Normally I'd expect to see this kind of thing on a hole punching > workload or dealing with large, sparse files that make use of > speculative preallocation (post-eof blocks allocated in anticipation of > file extending writes). I'm wondering if what is happening here is that > the appending writes and file closes due to oom kills are generating > speculative preallocs and prealloc truncates, respectively, and that > causes prealloc extents at the eof boundary to be split up and then > re-merged by surviving appending writers. Can those preallocs be affected by http://lkml.kernel.org/r/20170201092706.9966-2-mhocko@kernel.org ? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone 2017-02-06 14:42 ` Michal Hocko @ 2017-02-06 15:47 ` Brian Foster -1 siblings, 0 replies; 110+ messages in thread From: Brian Foster @ 2017-02-06 15:47 UTC (permalink / raw) To: Michal Hocko Cc: Tetsuo Handa, david, dchinner, hch, mgorman, viro, linux-mm, hannes, linux-kernel, darrick.wong, linux-xfs On Mon, Feb 06, 2017 at 03:42:22PM +0100, Michal Hocko wrote: > On Mon 06-02-17 09:35:33, Brian Foster wrote: > > On Mon, Feb 06, 2017 at 03:29:24PM +0900, Tetsuo Handa wrote: > > > Brian Foster wrote: > > > > On Fri, Feb 03, 2017 at 03:50:09PM +0100, Michal Hocko wrote: > > > > > [Let's CC more xfs people] > > > > > > > > > > On Fri 03-02-17 19:57:39, Tetsuo Handa wrote: > > > > > [...] > > > > > > (1) I got an assertion failure. > > > > > > > > > > I suspect this is a result of > > > > > http://lkml.kernel.org/r/20170201092706.9966-2-mhocko@kernel.org > > > > > I have no idea what the assert means though. > > > > > > > > > > > > > > > > > [ 969.626518] Killed process 6262 (oom-write) total-vm:2166856kB, anon-rss:1128732kB, file-rss:4kB, shmem-rss:0kB > > > > > > [ 969.958307] oom_reaper: reaped process 6262 (oom-write), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB > > > > > > [ 972.114644] XFS: Assertion failed: oldlen > newlen, file: fs/xfs/libxfs/xfs_bmap.c, line: 2867 > > > > > > > > Indirect block reservation underrun on delayed allocation extent merge. > > > > These are extra blocks are used for the inode bmap btree when a delalloc > > > > extent is converted to physical blocks. We're in a case where we expect > > > > to only ever free excess blocks due to a merge of extents with > > > > independent reservations, but a situation occurs where we actually need > > > > blocks and hence the assert fails. This can occur if an extent is merged > > > > with one that has a reservation less than the expected worst case > > > > reservation for its size (due to previous extent splits due to hole > > > > punches, for example). Therefore, I think the core expectation that > > > > xfs_bmap_add_extent_hole_delay() will always have enough blocks > > > > pre-reserved is invalid. > > > > > > > > Can you describe the workload that reproduces this? FWIW, I think the > > > > way xfs_bmap_add_extent_hole_delay() currently works is likely broken > > > > and have a couple patches to fix up indlen reservation that I haven't > > > > posted yet. The diff that deals with this particular bit is appended. > > > > Care to give that a try? > > > > > > The workload is to write to a single file on XFS from 10 processes demonstrated at > > > http://lkml.kernel.org/r/201512052133.IAE00551.LSOQFtMFFVOHOJ@I-love.SAKURA.ne.jp > > > using "while :; do ./oom-write; done" loop on a VM with 4CPUs / 2048MB RAM. > > > With this XFS_FILBLKS_MIN() change applied, I no longer hit assertion failures. > > > > > > > Thanks for testing. Well, that's an interesting workload. I couldn't > > reproduce on a few quick tries in a similarly configured vm. > > > > Normally I'd expect to see this kind of thing on a hole punching > > workload or dealing with large, sparse files that make use of > > speculative preallocation (post-eof blocks allocated in anticipation of > > file extending writes). I'm wondering if what is happening here is that > > the appending writes and file closes due to oom kills are generating > > speculative preallocs and prealloc truncates, respectively, and that > > causes prealloc extents at the eof boundary to be split up and then > > re-merged by surviving appending writers. > > Can those preallocs be affected by > http://lkml.kernel.org/r/20170201092706.9966-2-mhocko@kernel.org ? > Hmm, I wouldn't expect that to make much of a difference wrt to the core problem. The prealloc is created on a file extending write that requires block allocation (we basically just tack on extra blocks to an extending alloc based on some heuristics like the size of the file and the previous extent). Whether that allocation occurs on one iomap iteration or another due to a short write and retry, I wouldn't expect to matter that much. I suppose it could change the behavior of specialized workload though. E.g., if it caused a write() call to return quicker and thus lead to a file close(). We do use file release as an indication that prealloc will not be used and can reclaim it at that point (presumably causing an extent split with pre-eof blocks). Brian > -- > Michal Hocko > SUSE Labs > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone @ 2017-02-06 15:47 ` Brian Foster 0 siblings, 0 replies; 110+ messages in thread From: Brian Foster @ 2017-02-06 15:47 UTC (permalink / raw) To: Michal Hocko Cc: Tetsuo Handa, david, dchinner, hch, mgorman, viro, linux-mm, hannes, linux-kernel, darrick.wong, linux-xfs On Mon, Feb 06, 2017 at 03:42:22PM +0100, Michal Hocko wrote: > On Mon 06-02-17 09:35:33, Brian Foster wrote: > > On Mon, Feb 06, 2017 at 03:29:24PM +0900, Tetsuo Handa wrote: > > > Brian Foster wrote: > > > > On Fri, Feb 03, 2017 at 03:50:09PM +0100, Michal Hocko wrote: > > > > > [Let's CC more xfs people] > > > > > > > > > > On Fri 03-02-17 19:57:39, Tetsuo Handa wrote: > > > > > [...] > > > > > > (1) I got an assertion failure. > > > > > > > > > > I suspect this is a result of > > > > > http://lkml.kernel.org/r/20170201092706.9966-2-mhocko@kernel.org > > > > > I have no idea what the assert means though. > > > > > > > > > > > > > > > > > [ 969.626518] Killed process 6262 (oom-write) total-vm:2166856kB, anon-rss:1128732kB, file-rss:4kB, shmem-rss:0kB > > > > > > [ 969.958307] oom_reaper: reaped process 6262 (oom-write), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB > > > > > > [ 972.114644] XFS: Assertion failed: oldlen > newlen, file: fs/xfs/libxfs/xfs_bmap.c, line: 2867 > > > > > > > > Indirect block reservation underrun on delayed allocation extent merge. > > > > These are extra blocks are used for the inode bmap btree when a delalloc > > > > extent is converted to physical blocks. We're in a case where we expect > > > > to only ever free excess blocks due to a merge of extents with > > > > independent reservations, but a situation occurs where we actually need > > > > blocks and hence the assert fails. This can occur if an extent is merged > > > > with one that has a reservation less than the expected worst case > > > > reservation for its size (due to previous extent splits due to hole > > > > punches, for example). Therefore, I think the core expectation that > > > > xfs_bmap_add_extent_hole_delay() will always have enough blocks > > > > pre-reserved is invalid. > > > > > > > > Can you describe the workload that reproduces this? FWIW, I think the > > > > way xfs_bmap_add_extent_hole_delay() currently works is likely broken > > > > and have a couple patches to fix up indlen reservation that I haven't > > > > posted yet. The diff that deals with this particular bit is appended. > > > > Care to give that a try? > > > > > > The workload is to write to a single file on XFS from 10 processes demonstrated at > > > http://lkml.kernel.org/r/201512052133.IAE00551.LSOQFtMFFVOHOJ@I-love.SAKURA.ne.jp > > > using "while :; do ./oom-write; done" loop on a VM with 4CPUs / 2048MB RAM. > > > With this XFS_FILBLKS_MIN() change applied, I no longer hit assertion failures. > > > > > > > Thanks for testing. Well, that's an interesting workload. I couldn't > > reproduce on a few quick tries in a similarly configured vm. > > > > Normally I'd expect to see this kind of thing on a hole punching > > workload or dealing with large, sparse files that make use of > > speculative preallocation (post-eof blocks allocated in anticipation of > > file extending writes). I'm wondering if what is happening here is that > > the appending writes and file closes due to oom kills are generating > > speculative preallocs and prealloc truncates, respectively, and that > > causes prealloc extents at the eof boundary to be split up and then > > re-merged by surviving appending writers. > > Can those preallocs be affected by > http://lkml.kernel.org/r/20170201092706.9966-2-mhocko@kernel.org ? > Hmm, I wouldn't expect that to make much of a difference wrt to the core problem. The prealloc is created on a file extending write that requires block allocation (we basically just tack on extra blocks to an extending alloc based on some heuristics like the size of the file and the previous extent). Whether that allocation occurs on one iomap iteration or another due to a short write and retry, I wouldn't expect to matter that much. I suppose it could change the behavior of specialized workload though. E.g., if it caused a write() call to return quicker and thus lead to a file close(). We do use file release as an indication that prealloc will not be used and can reclaim it at that point (presumably causing an extent split with pre-eof blocks). Brian > -- > Michal Hocko > SUSE Labs > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone 2017-02-06 14:35 ` Brian Foster @ 2017-02-07 10:30 ` Tetsuo Handa -1 siblings, 0 replies; 110+ messages in thread From: Tetsuo Handa @ 2017-02-07 10:30 UTC (permalink / raw) To: bfoster Cc: mhocko, david, dchinner, hch, mgorman, viro, linux-mm, hannes, linux-kernel, darrick.wong, linux-xfs Brian Foster wrote: > > The workload is to write to a single file on XFS from 10 processes demonstrated at > > http://lkml.kernel.org/r/201512052133.IAE00551.LSOQFtMFFVOHOJ@I-love.SAKURA.ne.jp > > using "while :; do ./oom-write; done" loop on a VM with 4CPUs / 2048MB RAM. > > With this XFS_FILBLKS_MIN() change applied, I no longer hit assertion failures. > > > > Thanks for testing. Well, that's an interesting workload. I couldn't > reproduce on a few quick tries in a similarly configured vm. It takes 10 to 15 minutes. Maybe some size threshold involved? > /tmp/file _is_ on an XFS filesystem in your test, correct? If so and if > you still have the output file from a test that reproduced, could you > get the 'xfs_io -c "fiemap -v" <file>' output? Here it is. [ 720.199748] 0 pages HighMem/MovableOnly [ 720.199749] 150524 pages reserved [ 720.199749] 0 pages cma reserved [ 720.199750] 0 pages hwpoisoned [ 722.187335] XFS: Assertion failed: oldlen > newlen, file: fs/xfs/libxfs/xfs_bmap.c, line: 2867 [ 722.201784] ------------[ cut here ]------------ [ 722.205940] WARNING: CPU: 0 PID: 4877 at fs/xfs/xfs_message.c:105 asswarn+0x33/0x40 [xfs] [ 722.212333] Modules linked in: nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter coretemp crct10dif_pclmul vmw_vsock_vmci_transport crc32_pclmul ghash_clmulni_intel vsock aesni_intel crypto_simd cryptd glue_helper ppdev vmw_balloon pcspkr sg parport_pc i2c_piix4 shpchp vmw_vmci parport ip_tables xfs libcrc32c sd_mod sr_mod cdrom ata_generic pata_acpi crc32c_intel serio_raw vmwgfx drm_kms_helper syscopyarea sysfillrect [ 722.243207] sysimgblt fb_sys_fops mptspi scsi_transport_spi ata_piix ahci ttm mptscsih libahci drm libata mptbase e1000 i2c_core [ 722.247704] CPU: 0 PID: 4877 Comm: write Not tainted 4.10.0-rc6-next-20170202 #498 [ 722.250612] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015 [ 722.254089] Call Trace: [ 722.255751] dump_stack+0x85/0xc9 [ 722.257650] __warn+0xd1/0xf0 [ 722.259420] warn_slowpath_null+0x1d/0x20 [ 722.261434] asswarn+0x33/0x40 [xfs] [ 722.263356] xfs_bmap_add_extent_hole_delay+0xb7f/0xdf0 [xfs] [ 722.265695] xfs_bmapi_reserve_delalloc+0x297/0x440 [xfs] [ 722.267792] ? xfs_ilock+0x1c9/0x360 [xfs] [ 722.269559] xfs_file_iomap_begin+0x880/0x1140 [xfs] [ 722.271606] ? iomap_write_end+0x80/0x80 [ 722.273377] iomap_apply+0x6c/0x130 [ 722.274969] iomap_file_buffered_write+0x68/0xa0 [ 722.276702] ? iomap_write_end+0x80/0x80 [ 722.278311] xfs_file_buffered_aio_write+0x132/0x390 [xfs] [ 722.280394] ? _raw_spin_unlock+0x27/0x40 [ 722.282247] xfs_file_write_iter+0x90/0x130 [xfs] [ 722.284257] __vfs_write+0xe5/0x140 [ 722.285924] vfs_write+0xc7/0x1f0 [ 722.287536] ? syscall_trace_enter+0x1d0/0x380 [ 722.289490] SyS_write+0x58/0xc0 [ 722.291025] do_int80_syscall_32+0x6c/0x1f0 [ 722.292671] entry_INT80_compat+0x38/0x50 [ 722.294298] RIP: 0023:0x8048076 [ 722.295684] RSP: 002b:00000000ffedf840 EFLAGS: 00000202 ORIG_RAX: 0000000000000004 [ 722.298075] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 0000000008048000 [ 722.300516] RDX: 0000000000001000 RSI: 0000000000000000 RDI: 0000000000000000 [ 722.302902] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 [ 722.305278] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 722.307567] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 722.309792] ---[ end trace 5b7012eeb84093b7 ]--- [ 732.650867] oom_reaper: reaped process 4876 (oom-write), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB # ls -l /tmp/file -rw------- 1 kumaneko kumaneko 43426648064 Feb 7 19:25 /tmp/file # xfs_io -c "fiemap -v" /tmp/file /tmp/file: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..262015]: 358739712..359001727 262016 0x0 1: [262016..524159]: 367651920..367914063 262144 0x0 2: [524160..1048447]: 385063864..385588151 524288 0x0 3: [1048448..1238031]: 463702512..463892095 189584 0x0 4: [1238032..3335167]: 448234520..450331655 2097136 0x0 5: [3335168..4769775]: 36165320..37599927 1434608 0x0 6: [4769776..6897175]: 31677984..33805383 2127400 0x0 7: [6897176..15285759]: 450331656..458720239 8388584 0x0 8: [15285760..18520255]: 237497528..240732023 3234496 0x0 9: [18520256..21063607]: 229750248..232293599 2543352 0x0 10: [21063608..25257855]: 240732024..244926271 4194248 0x0 11: [25257856..29452159]: 179523440..183717743 4194304 0x0 12: [29452160..30380031]: 171930952..172858823 927872 0x0 13: [30380032..31428607]: 185220160..186268735 1048576 0x0 14: [31428608..32667751]: 232293600..233532743 1239144 0x0 15: [32667752..38474351]: 172858824..178665423 5806600 0x0 16: [38474352..39157119]: 188137184..188819951 682768 0x0 17: [39157120..40205695]: 234837584..235886159 1048576 0x0 18: [40205696..42302847]: 33805384..35902535 2097152 0x0 19: [42302848..44188591]: 37599928..39485671 1885744 0x0 20: [44188592..45112703]: 446735416..447659527 924112 0x0 21: [45112704..45436343]: 445337184..445660823 323640 0x0 22: [45436344..45960575]: 447659528..448183759 524232 0x0 23: [45960576..46484863]: 463892096..464416383 524288 0x0 24: [46484864..47533439]: 445660824..446709399 1048576 0x0 25: [47533440..48541959]: 233532744..234541263 1008520 0x0 26: [48541960..49294175]: 523533736..524285951 752216 0x0 27: [49294176..49630591]: 376080552..376416967 336416 0x0 28: [49630592..50154879]: 129846752..130371039 524288 0x0 29: [50154880..50844383]: 244926272..245615775 689504 0x0 30: [50844384..51203455]: 250812112..251171183 359072 0x0 31: [51203456..51727743]: 259555424..260079711 524288 0x0 32: [51727744..52295239]: 187350752..187918247 567496 0x0 33: [52295240..52776319]: 188819952..189301031 481080 0x0 34: [52776320..53300607]: 206841040..207365327 524288 0x0 35: [53300608..53824775]: 386221504..386745671 524168 0x0 36: [53824776..54348935]: 113736928..114261087 524160 0x0 37: [54348936..54854007]: 228911704..229416775 505072 0x0 38: [54854008..54905983]: 228760200..228812175 51976 0x0 39: [54905984..54971519]: 228597920..228663455 65536 0x0 40: [54971520..55364735]: 178998696..179391911 393216 0x0 41: [55364736..55868119]: 392669176..393172559 503384 0x0 42: [55868120..56370663]: 382896800..383399343 502544 0x0 43: [56370664..56836311]: 464416384..464882031 465648 0x0 44: [56836312..57085055]: 458720240..458968983 248744 0x0 45: [57085056..57548743]: 92768112..93231799 463688 0x0 46: [57548744..57871487]: 102724384..103047127 322744 0x0 47: [57871488..58304623]: 124278664..124711799 433136 0x0 48: [58304624..58526847]: 124712024..124934247 222224 0x0 49: [58526848..58788991]: 125635832..125897975 262144 0x0 50: [58788992..59203767]: 508031384..508446159 414776 0x0 51: [59203768..59602871]: 109812624..110211727 399104 0x0 52: [59602872..59992183]: 385736856..386126167 389312 0x0 53: [59992184..60381311]: 237108384..237497511 389128 0x0 54: [60381312..60756863]: 506355968..506731519 375552 0x0 55: [60756864..61127487]: 186268736..186639359 370624 0x0 56: [61127488..61490767]: 112848304..113211583 363280 0x0 57: [61490768..61541503]: 113214200..113264935 50736 0x0 58: [61541504..61904775]: 112246776..112610047 363272 0x0 59: [61904776..62246247]: 106328512..106669983 341472 0x0 60: [62246248..62571991]: 126075640..126401383 325744 0x0 61: [62571992..62895759]: 108921744..109245511 323768 0x0 62: [62895760..63219159]: 380153016..380476415 323400 0x0 63: [63219160..63442047]: 381056248..381279135 222888 0x0 64: [63442048..63704191]: 379768072..380030215 262144 0x0 65: [63704192..64026847]: 108328888..108651543 322656 0x0 66: [64026848..64342415]: 251387232..251702799 315568 0x0 67: [64342416..64651407]: 183717744..184026735 308992 0x0 68: [64651408..64947983]: 384092440..384389015 296576 0x0 69: [64947984..65145983]: 381775560..381973559 198000 0x0 70: [65145984..65408127]: 186914504..187176647 262144 0x0 71: [65408128..65447943]: 125328232..125368047 39816 0x0 72: [65447944..65690599]: 372579112..372821767 242656 0x0 73: [65690600..65929863]: 130429664..130668927 239264 0x0 74: [65929864..66168935]: 120951784..121190855 239072 0x0 75: [66168936..66402279]: 372845976..373079319 233344 0x0 76: [66402280..66633199]: 113372616..113603535 230920 0x0 77: [66633200..66859943]: 115982256..116208999 226744 0x0 78: [66859944..67082127]: 127187600..127409783 222184 0x0 79: [67082128..67217407]: 127636680..127771959 135280 0x0 80: [67217408..67280095]: 129510736..129573423 62688 0x0 81: [67280096..67499063]: 119220288..119439255 218968 0x0 82: [67499064..67717935]: 507320248..507539119 218872 0x0 83: [67717936..67936119]: 129292544..129510727 218184 0x0 84: [67936120..68153903]: 125368048..125585831 217784 0x0 85: [68153904..68370703]: 117784232..118001031 216800 0x0 86: [68370704..68586039]: 121997008..122212343 215336 0x0 87: [68586040..68798855]: 379191840..379404655 212816 0x0 88: [68798856..68983935]: 378690808..378875887 185080 0x0 89: [68983936..69196727]: 90790848..91003639 212792 0x0 90: [69196728..69409287]: 123091672..123304231 212560 0x0 91: [69409288..69621503]: 377436856..377649071 212216 0x0 92: [69621504..69828847]: 128990088..129197431 207344 0x0 93: [69828848..70035391]: 497270968..497477511 206544 0x0 94: [70035392..70241111]: 391898048..392103767 205720 0x0 95: [70241112..70446207]: 260716672..260921767 205096 0x0 96: [70446208..70507647]: 260079712..260141151 61440 0x0 97: [70507648..70704255]: 245836040..246032647 196608 0x0 98: [70704256..70906591]: 107009096..107211431 202336 0x0 99: [70906592..71108807]: 389471224..389673439 202216 0x0 100: [71108808..71309703]: 224305904..224506799 200896 0x0 101: [71309704..71509487]: 388524632..388724415 199784 0x0 102: [71509488..71707119]: 87983688..88181319 197632 0x0 103: [71707120..71903015]: 236195680..236391575 195896 0x0 104: [71903016..72098791]: 389000248..389196023 195776 0x0 105: [72098792..72294471]: 386931872..387127551 195680 0x0 106: [72294472..72342655]: 387127560..387175743 48184 0x0 107: [72342656..72408191]: 388031464..388096999 65536 0x0 108: [72408192..72539263]: 388194472..388325543 131072 0x0 109: [72539264..72562039]: 369903992..369926767 22776 0x0 110: [72562040..72753639]: 506916880..507108479 191600 0x0 111: [72753640..72945143]: 360577376..360768879 191504 0x0 112: [72945144..73136575]: 246426760..246618191 191432 0x0 113: [73136576..73326047]: 116629288..116818759 189472 0x0 114: [73326048..73515047]: 392203096..392392095 189000 0x0 115: [73515048..73699967]: 223549160..223734079 184920 0x0 116: [73699968..73883879]: 118860856..119044767 183912 0x0 117: [73883880..74067175]: 506143208..506326503 183296 0x0 118: [74067176..74249703]: 507108800..507291327 182528 0x0 119: [74249704..74401335]: 258917640..259069271 151632 0x0 120: [74401336..74583135]: 122742560..122924359 181800 0x0 121: [74583136..74764223]: 374250096..374431183 181088 0x0 122: [74764224..74945271]: 91175800..91356847 181048 0x0 123: [74945272..75124183]: 362484776..362663687 178912 0x0 124: [75124184..75302615]: 223086192..223264623 178432 0x0 125: [75302616..75479279]: 359280032..359456695 176664 0x0 126: [75479280..75655559]: 63083912..63260191 176280 0x0 127: [75655560..75831487]: 384469152..384645079 175928 0x0 128: [75831488..76006815]: 381459584..381634911 175328 0x0 129: [76006816..76181255]: 110626376..110800815 174440 0x0 130: [76181256..76355399]: 380785616..380959759 174144 0x0 131: [76355400..76527527]: 362768136..362940263 172128 0x0 132: [76527528..76698695]: 122571384..122742551 171168 0x0 133: [76698696..76868951]: 382399576..382569831 170256 0x0 134: [76868952..77039095]: 388353776..388523919 170144 0x0 135: [77039096..77209183]: 120236192..120406279 170088 0x0 136: [77209184..77379183]: 383464120..383634119 170000 0x0 137: [77379184..77548655]: 369926768..370096239 169472 0x0 138: [77548656..77717663]: 88823232..88992239 169008 0x0 139: [77717664..77884951]: 365878672..366045959 167288 0x0 140: [77884952..77897079]: 366445360..366457487 12128 0x0 141: [77897080..78063423]: 391500528..391666871 166344 0x0 142: [78063424..78229407]: 107876400..108042383 165984 0x0 143: [78229408..78395135]: 358573976..358739703 165728 0x0 144: [78395136..78560703]: 117078480..117244047 165568 0x0 145: [78560704..78726063]: 257377088..257542447 165360 0x0 146: [78726064..78889519]: 389678704..389842159 163456 0x0 147: [78889520..79052607]: 225850112..226013199 163088 0x0 148: [79052608..79215111]: 359822880..359985383 162504 0x0 149: [79215112..79376559]: 357914720..358076167 161448 0x0 150: [79376560..79538007]: 115473264..115634711 161448 0x0 151: [79538008..79698815]: 112610056..112770863 160808 0x0 152: [79698816..79857631]: 258732456..258891271 158816 0x0 153: [79857632..80015807]: 388725328..388883503 158176 0x0 154: [80015808..80173583]: 93847144..94004919 157776 0x0 155: [80173584..80331295]: 362940272..363097983 157712 0x0 156: [80331296..80488727]: 252008432..252165863 157432 0x0 157: [80488728..80646055]: 118387696..118545023 157328 0x0 158: [80646056..80803239]: 111368744..111525927 157184 0x0 159: [80803240..80960055]: 129573424..129730239 156816 0x0 160: [80960056..81116863]: 497936416..498093223 156808 0x0 161: [81116864..81272623]: 492109560..492265319 155760 0x0 162: [81272624..81427695]: 114554072..114709143 155072 0x0 163: [81427696..81582519]: 106854264..107009087 154824 0x0 164: [81582520..81735503]: 220700824..220853807 152984 0x0 165: [81735504..81887807]: 490724024..490876327 152304 0x0 166: [81887808..82038863]: 122393688..122544743 151056 0x0 167: [82038864..82189151]: 91659448..91809735 150288 0x0 168: [82189152..82337287]: 85811104..85959239 148136 0x0 169: [82337288..82484743]: 235886160..236033615 147456 0x0 170: [82484744..82631943]: 117486472..117633671 147200 0x0 171: [82631944..82777887]: 491753616..491899559 145944 0x0 172: [82777888..82923799]: 94927544..95073455 145912 0x0 173: [82923800..83068527]: 373754864..373899591 144728 0x0 174: [83068528..83116375]: 373980848..374028695 47848 0x0 175: [83116376..83261039]: 361766120..361910783 144664 0x0 176: [83261040..83404007]: 374431192..374574159 142968 0x0 177: [83404008..83546815]: 484667976..484810783 142808 0x0 178: [83546816..83689279]: 251702808..251845271 142464 0x0 179: [83689280..83831711]: 90474240..90616671 142432 0x0 180: [83831712..83972959]: 109362776..109504023 141248 0x0 181: [83972960..84113743]: 377296064..377436847 140784 0x0 182: [84113744..84254303]: 378416056..378556615 140560 0x0 183: [84254304..84393663]: 89517888..89657247 139360 0x0 184: [84393664..84532831]: 376569640..376708807 139168 0x0 185: [84532832..84671975]: 108725224..108864367 139144 0x0 186: [84671976..84810807]: 109637664..109776495 138832 0x0 187: [84810808..84901119]: 110211736..110302047 90312 0x1 ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone @ 2017-02-07 10:30 ` Tetsuo Handa 0 siblings, 0 replies; 110+ messages in thread From: Tetsuo Handa @ 2017-02-07 10:30 UTC (permalink / raw) To: bfoster Cc: mhocko, david, dchinner, hch, mgorman, viro, linux-mm, hannes, linux-kernel, darrick.wong, linux-xfs Brian Foster wrote: > > The workload is to write to a single file on XFS from 10 processes demonstrated at > > http://lkml.kernel.org/r/201512052133.IAE00551.LSOQFtMFFVOHOJ@I-love.SAKURA.ne.jp > > using "while :; do ./oom-write; done" loop on a VM with 4CPUs / 2048MB RAM. > > With this XFS_FILBLKS_MIN() change applied, I no longer hit assertion failures. > > > > Thanks for testing. Well, that's an interesting workload. I couldn't > reproduce on a few quick tries in a similarly configured vm. It takes 10 to 15 minutes. Maybe some size threshold involved? > /tmp/file _is_ on an XFS filesystem in your test, correct? If so and if > you still have the output file from a test that reproduced, could you > get the 'xfs_io -c "fiemap -v" <file>' output? Here it is. [ 720.199748] 0 pages HighMem/MovableOnly [ 720.199749] 150524 pages reserved [ 720.199749] 0 pages cma reserved [ 720.199750] 0 pages hwpoisoned [ 722.187335] XFS: Assertion failed: oldlen > newlen, file: fs/xfs/libxfs/xfs_bmap.c, line: 2867 [ 722.201784] ------------[ cut here ]------------ [ 722.205940] WARNING: CPU: 0 PID: 4877 at fs/xfs/xfs_message.c:105 asswarn+0x33/0x40 [xfs] [ 722.212333] Modules linked in: nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter coretemp crct10dif_pclmul vmw_vsock_vmci_transport crc32_pclmul ghash_clmulni_intel vsock aesni_intel crypto_simd cryptd glue_helper ppdev vmw_balloon pcspkr sg parport_pc i2c_piix4 shpchp vmw_vmci parport ip_tables xfs libcrc32c sd_mod sr_mod cdrom ata_generic pata_acpi crc32c_intel serio_raw vmwgfx drm_kms_helper syscopyarea sysfillrect [ 722.243207] sysimgblt fb_sys_fops mptspi scsi_transport_spi ata_piix ahci ttm mptscsih libahci drm libata mptbase e1000 i2c_core [ 722.247704] CPU: 0 PID: 4877 Comm: write Not tainted 4.10.0-rc6-next-20170202 #498 [ 722.250612] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015 [ 722.254089] Call Trace: [ 722.255751] dump_stack+0x85/0xc9 [ 722.257650] __warn+0xd1/0xf0 [ 722.259420] warn_slowpath_null+0x1d/0x20 [ 722.261434] asswarn+0x33/0x40 [xfs] [ 722.263356] xfs_bmap_add_extent_hole_delay+0xb7f/0xdf0 [xfs] [ 722.265695] xfs_bmapi_reserve_delalloc+0x297/0x440 [xfs] [ 722.267792] ? xfs_ilock+0x1c9/0x360 [xfs] [ 722.269559] xfs_file_iomap_begin+0x880/0x1140 [xfs] [ 722.271606] ? iomap_write_end+0x80/0x80 [ 722.273377] iomap_apply+0x6c/0x130 [ 722.274969] iomap_file_buffered_write+0x68/0xa0 [ 722.276702] ? iomap_write_end+0x80/0x80 [ 722.278311] xfs_file_buffered_aio_write+0x132/0x390 [xfs] [ 722.280394] ? _raw_spin_unlock+0x27/0x40 [ 722.282247] xfs_file_write_iter+0x90/0x130 [xfs] [ 722.284257] __vfs_write+0xe5/0x140 [ 722.285924] vfs_write+0xc7/0x1f0 [ 722.287536] ? syscall_trace_enter+0x1d0/0x380 [ 722.289490] SyS_write+0x58/0xc0 [ 722.291025] do_int80_syscall_32+0x6c/0x1f0 [ 722.292671] entry_INT80_compat+0x38/0x50 [ 722.294298] RIP: 0023:0x8048076 [ 722.295684] RSP: 002b:00000000ffedf840 EFLAGS: 00000202 ORIG_RAX: 0000000000000004 [ 722.298075] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 0000000008048000 [ 722.300516] RDX: 0000000000001000 RSI: 0000000000000000 RDI: 0000000000000000 [ 722.302902] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 [ 722.305278] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 722.307567] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 722.309792] ---[ end trace 5b7012eeb84093b7 ]--- [ 732.650867] oom_reaper: reaped process 4876 (oom-write), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB # ls -l /tmp/file -rw------- 1 kumaneko kumaneko 43426648064 Feb 7 19:25 /tmp/file # xfs_io -c "fiemap -v" /tmp/file /tmp/file: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS 0: [0..262015]: 358739712..359001727 262016 0x0 1: [262016..524159]: 367651920..367914063 262144 0x0 2: [524160..1048447]: 385063864..385588151 524288 0x0 3: [1048448..1238031]: 463702512..463892095 189584 0x0 4: [1238032..3335167]: 448234520..450331655 2097136 0x0 5: [3335168..4769775]: 36165320..37599927 1434608 0x0 6: [4769776..6897175]: 31677984..33805383 2127400 0x0 7: [6897176..15285759]: 450331656..458720239 8388584 0x0 8: [15285760..18520255]: 237497528..240732023 3234496 0x0 9: [18520256..21063607]: 229750248..232293599 2543352 0x0 10: [21063608..25257855]: 240732024..244926271 4194248 0x0 11: [25257856..29452159]: 179523440..183717743 4194304 0x0 12: [29452160..30380031]: 171930952..172858823 927872 0x0 13: [30380032..31428607]: 185220160..186268735 1048576 0x0 14: [31428608..32667751]: 232293600..233532743 1239144 0x0 15: [32667752..38474351]: 172858824..178665423 5806600 0x0 16: [38474352..39157119]: 188137184..188819951 682768 0x0 17: [39157120..40205695]: 234837584..235886159 1048576 0x0 18: [40205696..42302847]: 33805384..35902535 2097152 0x0 19: [42302848..44188591]: 37599928..39485671 1885744 0x0 20: [44188592..45112703]: 446735416..447659527 924112 0x0 21: [45112704..45436343]: 445337184..445660823 323640 0x0 22: [45436344..45960575]: 447659528..448183759 524232 0x0 23: [45960576..46484863]: 463892096..464416383 524288 0x0 24: [46484864..47533439]: 445660824..446709399 1048576 0x0 25: [47533440..48541959]: 233532744..234541263 1008520 0x0 26: [48541960..49294175]: 523533736..524285951 752216 0x0 27: [49294176..49630591]: 376080552..376416967 336416 0x0 28: [49630592..50154879]: 129846752..130371039 524288 0x0 29: [50154880..50844383]: 244926272..245615775 689504 0x0 30: [50844384..51203455]: 250812112..251171183 359072 0x0 31: [51203456..51727743]: 259555424..260079711 524288 0x0 32: [51727744..52295239]: 187350752..187918247 567496 0x0 33: [52295240..52776319]: 188819952..189301031 481080 0x0 34: [52776320..53300607]: 206841040..207365327 524288 0x0 35: [53300608..53824775]: 386221504..386745671 524168 0x0 36: [53824776..54348935]: 113736928..114261087 524160 0x0 37: [54348936..54854007]: 228911704..229416775 505072 0x0 38: [54854008..54905983]: 228760200..228812175 51976 0x0 39: [54905984..54971519]: 228597920..228663455 65536 0x0 40: [54971520..55364735]: 178998696..179391911 393216 0x0 41: [55364736..55868119]: 392669176..393172559 503384 0x0 42: [55868120..56370663]: 382896800..383399343 502544 0x0 43: [56370664..56836311]: 464416384..464882031 465648 0x0 44: [56836312..57085055]: 458720240..458968983 248744 0x0 45: [57085056..57548743]: 92768112..93231799 463688 0x0 46: [57548744..57871487]: 102724384..103047127 322744 0x0 47: [57871488..58304623]: 124278664..124711799 433136 0x0 48: [58304624..58526847]: 124712024..124934247 222224 0x0 49: [58526848..58788991]: 125635832..125897975 262144 0x0 50: [58788992..59203767]: 508031384..508446159 414776 0x0 51: [59203768..59602871]: 109812624..110211727 399104 0x0 52: [59602872..59992183]: 385736856..386126167 389312 0x0 53: [59992184..60381311]: 237108384..237497511 389128 0x0 54: [60381312..60756863]: 506355968..506731519 375552 0x0 55: [60756864..61127487]: 186268736..186639359 370624 0x0 56: [61127488..61490767]: 112848304..113211583 363280 0x0 57: [61490768..61541503]: 113214200..113264935 50736 0x0 58: [61541504..61904775]: 112246776..112610047 363272 0x0 59: [61904776..62246247]: 106328512..106669983 341472 0x0 60: [62246248..62571991]: 126075640..126401383 325744 0x0 61: [62571992..62895759]: 108921744..109245511 323768 0x0 62: [62895760..63219159]: 380153016..380476415 323400 0x0 63: [63219160..63442047]: 381056248..381279135 222888 0x0 64: [63442048..63704191]: 379768072..380030215 262144 0x0 65: [63704192..64026847]: 108328888..108651543 322656 0x0 66: [64026848..64342415]: 251387232..251702799 315568 0x0 67: [64342416..64651407]: 183717744..184026735 308992 0x0 68: [64651408..64947983]: 384092440..384389015 296576 0x0 69: [64947984..65145983]: 381775560..381973559 198000 0x0 70: [65145984..65408127]: 186914504..187176647 262144 0x0 71: [65408128..65447943]: 125328232..125368047 39816 0x0 72: [65447944..65690599]: 372579112..372821767 242656 0x0 73: [65690600..65929863]: 130429664..130668927 239264 0x0 74: [65929864..66168935]: 120951784..121190855 239072 0x0 75: [66168936..66402279]: 372845976..373079319 233344 0x0 76: [66402280..66633199]: 113372616..113603535 230920 0x0 77: [66633200..66859943]: 115982256..116208999 226744 0x0 78: [66859944..67082127]: 127187600..127409783 222184 0x0 79: [67082128..67217407]: 127636680..127771959 135280 0x0 80: [67217408..67280095]: 129510736..129573423 62688 0x0 81: [67280096..67499063]: 119220288..119439255 218968 0x0 82: [67499064..67717935]: 507320248..507539119 218872 0x0 83: [67717936..67936119]: 129292544..129510727 218184 0x0 84: [67936120..68153903]: 125368048..125585831 217784 0x0 85: [68153904..68370703]: 117784232..118001031 216800 0x0 86: [68370704..68586039]: 121997008..122212343 215336 0x0 87: [68586040..68798855]: 379191840..379404655 212816 0x0 88: [68798856..68983935]: 378690808..378875887 185080 0x0 89: [68983936..69196727]: 90790848..91003639 212792 0x0 90: [69196728..69409287]: 123091672..123304231 212560 0x0 91: [69409288..69621503]: 377436856..377649071 212216 0x0 92: [69621504..69828847]: 128990088..129197431 207344 0x0 93: [69828848..70035391]: 497270968..497477511 206544 0x0 94: [70035392..70241111]: 391898048..392103767 205720 0x0 95: [70241112..70446207]: 260716672..260921767 205096 0x0 96: [70446208..70507647]: 260079712..260141151 61440 0x0 97: [70507648..70704255]: 245836040..246032647 196608 0x0 98: [70704256..70906591]: 107009096..107211431 202336 0x0 99: [70906592..71108807]: 389471224..389673439 202216 0x0 100: [71108808..71309703]: 224305904..224506799 200896 0x0 101: [71309704..71509487]: 388524632..388724415 199784 0x0 102: [71509488..71707119]: 87983688..88181319 197632 0x0 103: [71707120..71903015]: 236195680..236391575 195896 0x0 104: [71903016..72098791]: 389000248..389196023 195776 0x0 105: [72098792..72294471]: 386931872..387127551 195680 0x0 106: [72294472..72342655]: 387127560..387175743 48184 0x0 107: [72342656..72408191]: 388031464..388096999 65536 0x0 108: [72408192..72539263]: 388194472..388325543 131072 0x0 109: [72539264..72562039]: 369903992..369926767 22776 0x0 110: [72562040..72753639]: 506916880..507108479 191600 0x0 111: [72753640..72945143]: 360577376..360768879 191504 0x0 112: [72945144..73136575]: 246426760..246618191 191432 0x0 113: [73136576..73326047]: 116629288..116818759 189472 0x0 114: [73326048..73515047]: 392203096..392392095 189000 0x0 115: [73515048..73699967]: 223549160..223734079 184920 0x0 116: [73699968..73883879]: 118860856..119044767 183912 0x0 117: [73883880..74067175]: 506143208..506326503 183296 0x0 118: [74067176..74249703]: 507108800..507291327 182528 0x0 119: [74249704..74401335]: 258917640..259069271 151632 0x0 120: [74401336..74583135]: 122742560..122924359 181800 0x0 121: [74583136..74764223]: 374250096..374431183 181088 0x0 122: [74764224..74945271]: 91175800..91356847 181048 0x0 123: [74945272..75124183]: 362484776..362663687 178912 0x0 124: [75124184..75302615]: 223086192..223264623 178432 0x0 125: [75302616..75479279]: 359280032..359456695 176664 0x0 126: [75479280..75655559]: 63083912..63260191 176280 0x0 127: [75655560..75831487]: 384469152..384645079 175928 0x0 128: [75831488..76006815]: 381459584..381634911 175328 0x0 129: [76006816..76181255]: 110626376..110800815 174440 0x0 130: [76181256..76355399]: 380785616..380959759 174144 0x0 131: [76355400..76527527]: 362768136..362940263 172128 0x0 132: [76527528..76698695]: 122571384..122742551 171168 0x0 133: [76698696..76868951]: 382399576..382569831 170256 0x0 134: [76868952..77039095]: 388353776..388523919 170144 0x0 135: [77039096..77209183]: 120236192..120406279 170088 0x0 136: [77209184..77379183]: 383464120..383634119 170000 0x0 137: [77379184..77548655]: 369926768..370096239 169472 0x0 138: [77548656..77717663]: 88823232..88992239 169008 0x0 139: [77717664..77884951]: 365878672..366045959 167288 0x0 140: [77884952..77897079]: 366445360..366457487 12128 0x0 141: [77897080..78063423]: 391500528..391666871 166344 0x0 142: [78063424..78229407]: 107876400..108042383 165984 0x0 143: [78229408..78395135]: 358573976..358739703 165728 0x0 144: [78395136..78560703]: 117078480..117244047 165568 0x0 145: [78560704..78726063]: 257377088..257542447 165360 0x0 146: [78726064..78889519]: 389678704..389842159 163456 0x0 147: [78889520..79052607]: 225850112..226013199 163088 0x0 148: [79052608..79215111]: 359822880..359985383 162504 0x0 149: [79215112..79376559]: 357914720..358076167 161448 0x0 150: [79376560..79538007]: 115473264..115634711 161448 0x0 151: [79538008..79698815]: 112610056..112770863 160808 0x0 152: [79698816..79857631]: 258732456..258891271 158816 0x0 153: [79857632..80015807]: 388725328..388883503 158176 0x0 154: [80015808..80173583]: 93847144..94004919 157776 0x0 155: [80173584..80331295]: 362940272..363097983 157712 0x0 156: [80331296..80488727]: 252008432..252165863 157432 0x0 157: [80488728..80646055]: 118387696..118545023 157328 0x0 158: [80646056..80803239]: 111368744..111525927 157184 0x0 159: [80803240..80960055]: 129573424..129730239 156816 0x0 160: [80960056..81116863]: 497936416..498093223 156808 0x0 161: [81116864..81272623]: 492109560..492265319 155760 0x0 162: [81272624..81427695]: 114554072..114709143 155072 0x0 163: [81427696..81582519]: 106854264..107009087 154824 0x0 164: [81582520..81735503]: 220700824..220853807 152984 0x0 165: [81735504..81887807]: 490724024..490876327 152304 0x0 166: [81887808..82038863]: 122393688..122544743 151056 0x0 167: [82038864..82189151]: 91659448..91809735 150288 0x0 168: [82189152..82337287]: 85811104..85959239 148136 0x0 169: [82337288..82484743]: 235886160..236033615 147456 0x0 170: [82484744..82631943]: 117486472..117633671 147200 0x0 171: [82631944..82777887]: 491753616..491899559 145944 0x0 172: [82777888..82923799]: 94927544..95073455 145912 0x0 173: [82923800..83068527]: 373754864..373899591 144728 0x0 174: [83068528..83116375]: 373980848..374028695 47848 0x0 175: [83116376..83261039]: 361766120..361910783 144664 0x0 176: [83261040..83404007]: 374431192..374574159 142968 0x0 177: [83404008..83546815]: 484667976..484810783 142808 0x0 178: [83546816..83689279]: 251702808..251845271 142464 0x0 179: [83689280..83831711]: 90474240..90616671 142432 0x0 180: [83831712..83972959]: 109362776..109504023 141248 0x0 181: [83972960..84113743]: 377296064..377436847 140784 0x0 182: [84113744..84254303]: 378416056..378556615 140560 0x0 183: [84254304..84393663]: 89517888..89657247 139360 0x0 184: [84393664..84532831]: 376569640..376708807 139168 0x0 185: [84532832..84671975]: 108725224..108864367 139144 0x0 186: [84671976..84810807]: 109637664..109776495 138832 0x0 187: [84810808..84901119]: 110211736..110302047 90312 0x1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone 2017-02-07 10:30 ` Tetsuo Handa @ 2017-02-07 16:54 ` Brian Foster -1 siblings, 0 replies; 110+ messages in thread From: Brian Foster @ 2017-02-07 16:54 UTC (permalink / raw) To: Tetsuo Handa Cc: mhocko, david, dchinner, hch, mgorman, viro, linux-mm, hannes, linux-kernel, darrick.wong, linux-xfs On Tue, Feb 07, 2017 at 07:30:54PM +0900, Tetsuo Handa wrote: > Brian Foster wrote: > > > The workload is to write to a single file on XFS from 10 processes demonstrated at > > > http://lkml.kernel.org/r/201512052133.IAE00551.LSOQFtMFFVOHOJ@I-love.SAKURA.ne.jp > > > using "while :; do ./oom-write; done" loop on a VM with 4CPUs / 2048MB RAM. > > > With this XFS_FILBLKS_MIN() change applied, I no longer hit assertion failures. > > > > > > > Thanks for testing. Well, that's an interesting workload. I couldn't > > reproduce on a few quick tries in a similarly configured vm. > > It takes 10 to 15 minutes. Maybe some size threshold involved? > > > /tmp/file _is_ on an XFS filesystem in your test, correct? If so and if > > you still have the output file from a test that reproduced, could you > > get the 'xfs_io -c "fiemap -v" <file>' output? > > Here it is. > > [ 720.199748] 0 pages HighMem/MovableOnly > [ 720.199749] 150524 pages reserved > [ 720.199749] 0 pages cma reserved > [ 720.199750] 0 pages hwpoisoned > [ 722.187335] XFS: Assertion failed: oldlen > newlen, file: fs/xfs/libxfs/xfs_bmap.c, line: 2867 > [ 722.201784] ------------[ cut here ]------------ ... > > # ls -l /tmp/file > -rw------- 1 kumaneko kumaneko 43426648064 Feb 7 19:25 /tmp/file > # xfs_io -c "fiemap -v" /tmp/file > /tmp/file: > EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS > 0: [0..262015]: 358739712..359001727 262016 0x0 ... > 187: [84810808..84901119]: 110211736..110302047 90312 0x1 Ok, from the size of the file I realized that I missed you were running in a loop the first time around. I tried playing with it some more and still haven't been able to reproduce. Anyways, the patch intended to fix this has been reviewed[1] and queued for the next release, so it's probably not a big deal since you've already verified it. Thanks again. Brian [1] http://www.spinics.net/lists/linux-xfs/msg04083.html ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone @ 2017-02-07 16:54 ` Brian Foster 0 siblings, 0 replies; 110+ messages in thread From: Brian Foster @ 2017-02-07 16:54 UTC (permalink / raw) To: Tetsuo Handa Cc: mhocko, david, dchinner, hch, mgorman, viro, linux-mm, hannes, linux-kernel, darrick.wong, linux-xfs On Tue, Feb 07, 2017 at 07:30:54PM +0900, Tetsuo Handa wrote: > Brian Foster wrote: > > > The workload is to write to a single file on XFS from 10 processes demonstrated at > > > http://lkml.kernel.org/r/201512052133.IAE00551.LSOQFtMFFVOHOJ@I-love.SAKURA.ne.jp > > > using "while :; do ./oom-write; done" loop on a VM with 4CPUs / 2048MB RAM. > > > With this XFS_FILBLKS_MIN() change applied, I no longer hit assertion failures. > > > > > > > Thanks for testing. Well, that's an interesting workload. I couldn't > > reproduce on a few quick tries in a similarly configured vm. > > It takes 10 to 15 minutes. Maybe some size threshold involved? > > > /tmp/file _is_ on an XFS filesystem in your test, correct? If so and if > > you still have the output file from a test that reproduced, could you > > get the 'xfs_io -c "fiemap -v" <file>' output? > > Here it is. > > [ 720.199748] 0 pages HighMem/MovableOnly > [ 720.199749] 150524 pages reserved > [ 720.199749] 0 pages cma reserved > [ 720.199750] 0 pages hwpoisoned > [ 722.187335] XFS: Assertion failed: oldlen > newlen, file: fs/xfs/libxfs/xfs_bmap.c, line: 2867 > [ 722.201784] ------------[ cut here ]------------ ... > > # ls -l /tmp/file > -rw------- 1 kumaneko kumaneko 43426648064 Feb 7 19:25 /tmp/file > # xfs_io -c "fiemap -v" /tmp/file > /tmp/file: > EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS > 0: [0..262015]: 358739712..359001727 262016 0x0 ... > 187: [84810808..84901119]: 110211736..110302047 90312 0x1 Ok, from the size of the file I realized that I missed you were running in a loop the first time around. I tried playing with it some more and still haven't been able to reproduce. Anyways, the patch intended to fix this has been reviewed[1] and queued for the next release, so it's probably not a big deal since you've already verified it. Thanks again. Brian [1] http://www.spinics.net/lists/linux-xfs/msg04083.html -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone 2017-02-03 10:57 ` Tetsuo Handa @ 2017-02-03 14:55 ` Michal Hocko -1 siblings, 0 replies; 110+ messages in thread From: Michal Hocko @ 2017-02-03 14:55 UTC (permalink / raw) To: Tetsuo Handa Cc: hch, mgorman, viro, linux-mm, hannes, linux-kernel, Peter Zijlstra [CC Petr] On Fri 03-02-17 19:57:39, Tetsuo Handa wrote: [...] > (2) I got a lockdep warning. (A new false positive?) Yes, I suspect this is a false possitive. I do not see how we can deadlock. __alloc_pages_direct_reclaim calls drain_all_pages(NULL) which means that a potential recursion to the page allocator during draining would just bail out on the trylock. Maybe I am misinterpreting the report though. > [ 243.036975] ===================================================== > [ 243.042976] WARNING: RECLAIM_FS-safe -> RECLAIM_FS-unsafe lock order detected > [ 243.051211] 4.10.0-rc6-next-20170202 #46 Not tainted > [ 243.054619] ----------------------------------------------------- > [ 243.057395] awk/8767 [HC0[0]:SC0[0]:HE1:SE1] is trying to acquire: > [ 243.060310] (cpu_hotplug.dep_map){++++++}, at: [<ffffffff8108ddf2>] get_online_cpus+0x32/0x80 > [ 243.063462] > [ 243.063462] and this task is already holding: > [ 243.066851] (&xfs_dir_ilock_class){++++-.}, at: [<ffffffffa02a4af4>] xfs_ilock+0x114/0x290 [xfs] > [ 243.069949] which would create a new lock dependency: > [ 243.072143] (&xfs_dir_ilock_class){++++-.} -> (cpu_hotplug.dep_map){++++++} > [ 243.074789] > [ 243.074789] but this new dependency connects a RECLAIM_FS-irq-safe lock: > [ 243.078735] (&xfs_dir_ilock_class){++++-.} > [ 243.078739] > [ 243.078739] ... which became RECLAIM_FS-irq-safe at: > [ 243.084175] > [ 243.084180] [<ffffffff810ef934>] __lock_acquire+0x344/0x1bb0 > [ 243.087257] > [ 243.087261] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0 > [ 243.090027] > [ 243.090033] [<ffffffff810ea7e9>] down_write_nested+0x59/0xc0 > [ 243.092838] > [ 243.092888] [<ffffffffa02a4b2e>] xfs_ilock+0x14e/0x290 [xfs] > [ 243.095453] > [ 243.095485] [<ffffffffa02986a5>] xfs_reclaim_inode+0x135/0x340 [xfs] > [ 243.098083] > [ 243.098109] [<ffffffffa0298b7a>] xfs_reclaim_inodes_ag+0x2ca/0x4f0 [xfs] > [ 243.100668] > [ 243.100692] [<ffffffffa029af9e>] xfs_reclaim_inodes_nr+0x2e/0x40 [xfs] > [ 243.103191] > [ 243.103221] [<ffffffffa02b32c4>] xfs_fs_free_cached_objects+0x14/0x20 [xfs] > [ 243.105710] > [ 243.105714] [<ffffffff81261dbc>] super_cache_scan+0x17c/0x190 > [ 243.107947] > [ 243.107950] [<ffffffff811d375a>] shrink_slab+0x29a/0x710 > [ 243.110133] > [ 243.110135] [<ffffffff811d876d>] shrink_node+0x23d/0x320 > [ 243.112262] > [ 243.112264] [<ffffffff811d9e24>] kswapd+0x354/0xa10 > [ 243.114323] > [ 243.114326] [<ffffffff810b5caa>] kthread+0x10a/0x140 > [ 243.116448] > [ 243.116452] [<ffffffff81715081>] ret_from_fork+0x31/0x40 > [ 243.118692] > [ 243.118692] to a RECLAIM_FS-irq-unsafe lock: > [ 243.120636] (cpu_hotplug.dep_map){++++++} > [ 243.120638] > [ 243.120638] ... which became RECLAIM_FS-irq-unsafe at: > [ 243.124021] ... > [ 243.124022] > [ 243.124820] [<ffffffff810ef051>] mark_held_locks+0x71/0x90 > [ 243.127033] > [ 243.127035] [<ffffffff810f3405>] lockdep_trace_alloc+0xc5/0x110 > [ 243.129228] > [ 243.129231] [<ffffffff8122f8ca>] kmem_cache_alloc_node_trace+0x4a/0x410 > [ 243.131534] > [ 243.131536] [<ffffffff810ba350>] __smpboot_create_thread.part.3+0x30/0xf0 > [ 243.133850] > [ 243.133852] [<ffffffff810ba7a1>] smpboot_create_threads+0x61/0x90 > [ 243.136113] > [ 243.136119] [<ffffffff8108e2cb>] cpuhp_invoke_callback+0xbb/0xb70 > [ 243.138319] > [ 243.138320] [<ffffffff8108fc82>] cpuhp_up_callbacks+0x32/0xb0 > [ 243.140479] > [ 243.140480] [<ffffffff810900f4>] _cpu_up+0x84/0xf0 > [ 243.142484] > [ 243.142485] [<ffffffff810901e4>] do_cpu_up+0x84/0xd0 > [ 243.144716] > [ 243.144719] [<ffffffff8109023e>] cpu_up+0xe/0x10 > [ 243.146684] > [ 243.146687] [<ffffffff81f6f446>] smp_init+0xd5/0x141 > [ 243.148755] > [ 243.148758] [<ffffffff81f3f35b>] kernel_init_freeable+0x17d/0x2a7 > [ 243.150932] > [ 243.150936] [<ffffffff817048e9>] kernel_init+0x9/0x100 > [ 243.153088] > [ 243.153092] [<ffffffff81715081>] ret_from_fork+0x31/0x40 > [ 243.155135] > [ 243.155135] other info that might help us debug this: > [ 243.155135] > [ 243.157724] Possible interrupt unsafe locking scenario: > [ 243.157724] > [ 243.159877] CPU0 CPU1 > [ 243.161047] ---- ---- > [ 243.162210] lock(cpu_hotplug.dep_map); > [ 243.163279] local_irq_disable(); > [ 243.164669] lock(&xfs_dir_ilock_class); > [ 243.166148] lock(cpu_hotplug.dep_map); > [ 243.167653] <Interrupt> > [ 243.168594] lock(&xfs_dir_ilock_class); > [ 243.169694] > [ 243.169694] *** DEADLOCK *** > [ 243.169694] > [ 243.171864] 3 locks held by awk/8767: > [ 243.172872] #0: (&type->i_mutex_dir_key#3){++++++}, at: [<ffffffff8126e2dc>] path_openat+0x53c/0xa90 > [ 243.174791] #1: (&xfs_dir_ilock_class){++++-.}, at: [<ffffffffa02a4af4>] xfs_ilock+0x114/0x290 [xfs] > [ 243.176899] #2: (pcpu_drain_mutex){+.+...}, at: [<ffffffff811bf39a>] drain_all_pages.part.80+0x1a/0x320 > [ 243.178875] > [ 243.178875] the dependencies between RECLAIM_FS-irq-safe lock and the holding lock: > [ 243.181262] -> (&xfs_dir_ilock_class){++++-.} ops: 17348 { > [ 243.182610] HARDIRQ-ON-W at: > [ 243.183603] > [ 243.183606] [<ffffffff810efd84>] __lock_acquire+0x794/0x1bb0 > [ 243.186056] > [ 243.186059] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0 > [ 243.188419] > [ 243.188422] [<ffffffff810ea7e9>] down_write_nested+0x59/0xc0 > [ 243.190909] > [ 243.190941] [<ffffffffa02a4b2e>] xfs_ilock+0x14e/0x290 [xfs] > [ 243.193257] > [ 243.193281] [<ffffffffa02a4c9b>] xfs_ilock_data_map_shared+0x2b/0x30 [xfs] > [ 243.195795] > [ 243.195814] [<ffffffffa02559f4>] xfs_dir_lookup+0xd4/0x1c0 [xfs] > [ 243.198204] > [ 243.198227] [<ffffffffa02a62ff>] xfs_lookup+0x7f/0x250 [xfs] > [ 243.200570] > [ 243.200593] [<ffffffffa02a1fcb>] xfs_vn_lookup+0x6b/0xb0 [xfs] > [ 243.203086] > [ 243.203089] [<ffffffff8126ce2c>] lookup_open+0x54c/0x790 > [ 243.205417] > [ 243.205420] [<ffffffff8126e2fa>] path_openat+0x55a/0xa90 > [ 243.207711] > [ 243.207713] [<ffffffff8126f9ec>] do_filp_open+0x8c/0x100 > [ 243.210092] > [ 243.210095] [<ffffffff81263c41>] do_open_execat+0x71/0x180 > [ 243.212427] > [ 243.212429] [<ffffffff812641b6>] open_exec+0x26/0x40 > [ 243.214664] > [ 243.214668] [<ffffffff812c43ee>] load_elf_binary+0x2be/0x15f0 > [ 243.217045] > [ 243.217048] [<ffffffff812644b0>] search_binary_handler+0x80/0x1e0 > [ 243.219501] > [ 243.219503] [<ffffffff812663ca>] do_execveat_common.isra.40+0x68a/0xa00 > [ 243.222056] > [ 243.222058] [<ffffffff81266767>] do_execve+0x27/0x30 > [ 243.224471] > [ 243.224475] [<ffffffff812669c0>] SyS_execve+0x20/0x30 > [ 243.226787] > [ 243.226790] [<ffffffff81003c17>] do_syscall_64+0x67/0x1f0 > [ 243.229178] > [ 243.229182] [<ffffffff81714ec9>] return_from_SYSCALL_64+0x0/0x7a > [ 243.231695] HARDIRQ-ON-R at: > [ 243.232709] > [ 243.232712] [<ffffffff810ef8c0>] __lock_acquire+0x2d0/0x1bb0 > [ 243.235161] > [ 243.235164] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0 > [ 243.237547] > [ 243.237551] [<ffffffff810ea672>] down_read_nested+0x52/0xb0 > [ 243.239930] > [ 243.239962] [<ffffffffa02a4af4>] xfs_ilock+0x114/0x290 [xfs] > [ 243.242353] > [ 243.242385] [<ffffffffa02a4c9b>] xfs_ilock_data_map_shared+0x2b/0x30 [xfs] > [ 243.244978] > [ 243.244998] [<ffffffffa02559f4>] xfs_dir_lookup+0xd4/0x1c0 [xfs] > [ 243.247493] > [ 243.247515] [<ffffffffa02a62ff>] xfs_lookup+0x7f/0x250 [xfs] > [ 243.249910] > [ 243.249930] [<ffffffffa02a1fcb>] xfs_vn_lookup+0x6b/0xb0 [xfs] > [ 243.252407] > [ 243.252412] [<ffffffff8126902e>] lookup_slow+0x12e/0x220 > [ 243.254747] > [ 243.254750] [<ffffffff8126d2c6>] walk_component+0x1a6/0x2b0 > [ 243.257126] > [ 243.257128] [<ffffffff8126d55c>] link_path_walk+0x18c/0x580 > [ 243.259495] > [ 243.259497] [<ffffffff8126de41>] path_openat+0xa1/0xa90 > [ 243.261804] > [ 243.261806] [<ffffffff8126f9ec>] do_filp_open+0x8c/0x100 > [ 243.264184] > [ 243.264188] [<ffffffff8125c0ea>] do_sys_open+0x13a/0x200 > [ 243.266595] > [ 243.266599] [<ffffffff8125c1c9>] SyS_open+0x19/0x20 > [ 243.268984] > [ 243.268989] [<ffffffff81714e01>] entry_SYSCALL_64_fastpath+0x1f/0xc2 > [ 243.271702] SOFTIRQ-ON-W at: > [ 243.272726] > [ 243.272729] [<ffffffff810ef8ed>] __lock_acquire+0x2fd/0x1bb0 > [ 243.275109] > [ 243.275111] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0 > [ 243.277426] > [ 243.277429] [<ffffffff810ea7e9>] down_write_nested+0x59/0xc0 > [ 243.279790] > [ 243.279823] [<ffffffffa02a4b2e>] xfs_ilock+0x14e/0x290 [xfs] > [ 243.282192] > [ 243.282216] [<ffffffffa02a4c9b>] xfs_ilock_data_map_shared+0x2b/0x30 [xfs] > [ 243.284794] > [ 243.284816] [<ffffffffa02559f4>] xfs_dir_lookup+0xd4/0x1c0 [xfs] > [ 243.287259] > [ 243.287284] [<ffffffffa02a62ff>] xfs_lookup+0x7f/0x250 [xfs] > [ 243.289735] > [ 243.289763] [<ffffffffa02a1fcb>] xfs_vn_lookup+0x6b/0xb0 [xfs] > [ 243.292205] > [ 243.292208] [<ffffffff8126ce2c>] lookup_open+0x54c/0x790 > [ 243.294555] > [ 243.294558] [<ffffffff8126e2fa>] path_openat+0x55a/0xa90 > [ 243.296897] > [ 243.296900] [<ffffffff8126f9ec>] do_filp_open+0x8c/0x100 > [ 243.299242] > [ 243.299244] [<ffffffff81263c41>] do_open_execat+0x71/0x180 > [ 243.301754] > [ 243.301759] [<ffffffff812641b6>] open_exec+0x26/0x40 > [ 243.304037] > [ 243.304042] [<ffffffff812c43ee>] load_elf_binary+0x2be/0x15f0 > [ 243.306531] > [ 243.306534] [<ffffffff812644b0>] search_binary_handler+0x80/0x1e0 > [ 243.308976] > [ 243.308979] [<ffffffff812663ca>] do_execveat_common.isra.40+0x68a/0xa00 > [ 243.311506] > [ 243.311508] [<ffffffff81266767>] do_execve+0x27/0x30 > [ 243.313777] > [ 243.313779] [<ffffffff812669c0>] SyS_execve+0x20/0x30 > [ 243.316067] > [ 243.316070] [<ffffffff81003c17>] do_syscall_64+0x67/0x1f0 > [ 243.318429] > [ 243.318434] [<ffffffff81714ec9>] return_from_SYSCALL_64+0x0/0x7a > [ 243.320884] SOFTIRQ-ON-R at: > [ 243.321860] > [ 243.321862] [<ffffffff810ef8ed>] __lock_acquire+0x2fd/0x1bb0 > [ 243.324251] > [ 243.324252] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0 > [ 243.326601] > [ 243.326604] [<ffffffff810ea672>] down_read_nested+0x52/0xb0 > [ 243.328966] > [ 243.328998] [<ffffffffa02a4af4>] xfs_ilock+0x114/0x290 [xfs] > [ 243.331384] > [ 243.331407] [<ffffffffa02a4c9b>] xfs_ilock_data_map_shared+0x2b/0x30 [xfs] > [ 243.333978] > [ 243.334001] [<ffffffffa02559f4>] xfs_dir_lookup+0xd4/0x1c0 [xfs] > [ 243.336492] > [ 243.336516] [<ffffffffa02a62ff>] xfs_lookup+0x7f/0x250 [xfs] > [ 243.338926] > [ 243.338948] [<ffffffffa02a1fcb>] xfs_vn_lookup+0x6b/0xb0 [xfs] > [ 243.341365] > [ 243.341368] [<ffffffff8126902e>] lookup_slow+0x12e/0x220 > [ 243.343694] > [ 243.343696] [<ffffffff8126d2c6>] walk_component+0x1a6/0x2b0 > [ 243.346074] > [ 243.346076] [<ffffffff8126d55c>] link_path_walk+0x18c/0x580 > [ 243.348443] > [ 243.348444] [<ffffffff8126de41>] path_openat+0xa1/0xa90 > [ 243.350753] > [ 243.350755] [<ffffffff8126f9ec>] do_filp_open+0x8c/0x100 > [ 243.353240] > [ 243.353244] [<ffffffff8125c0ea>] do_sys_open+0x13a/0x200 > [ 243.355581] > [ 243.355583] [<ffffffff8125c1c9>] SyS_open+0x19/0x20 > [ 243.358015] > [ 243.358019] [<ffffffff81714e01>] entry_SYSCALL_64_fastpath+0x1f/0xc2 > [ 243.360586] IN-RECLAIM_FS-W at: > [ 243.361628] > [ 243.361630] [<ffffffff810ef934>] __lock_acquire+0x344/0x1bb0 > [ 243.364273] > [ 243.364275] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0 > [ 243.366710] > [ 243.366713] [<ffffffff810ea7e9>] down_write_nested+0x59/0xc0 > [ 243.369153] > [ 243.369182] [<ffffffffa02a4b2e>] xfs_ilock+0x14e/0x290 [xfs] > [ 243.371597] > [ 243.371619] [<ffffffffa02986a5>] xfs_reclaim_inode+0x135/0x340 [xfs] > [ 243.374339] > [ 243.374366] [<ffffffffa0298b7a>] xfs_reclaim_inodes_ag+0x2ca/0x4f0 [xfs] > [ 243.377009] > [ 243.377032] [<ffffffffa029af9e>] xfs_reclaim_inodes_nr+0x2e/0x40 [xfs] > [ 243.379659] > [ 243.379686] [<ffffffffa02b32c4>] xfs_fs_free_cached_objects+0x14/0x20 [xfs] > [ 243.382349] > [ 243.382352] [<ffffffff81261dbc>] super_cache_scan+0x17c/0x190 > [ 243.384907] > [ 243.384911] [<ffffffff811d375a>] shrink_slab+0x29a/0x710 > [ 243.387690] > [ 243.387693] [<ffffffff811d876d>] shrink_node+0x23d/0x320 > [ 243.390148] > [ 243.390150] [<ffffffff811d9e24>] kswapd+0x354/0xa10 > [ 243.392517] > [ 243.392520] [<ffffffff810b5caa>] kthread+0x10a/0x140 > [ 243.394851] > [ 243.394853] [<ffffffff81715081>] ret_from_fork+0x31/0x40 > [ 243.397246] INITIAL USE at: > [ 243.398227] > [ 243.398229] [<ffffffff810ef960>] __lock_acquire+0x370/0x1bb0 > [ 243.400646] > [ 243.400648] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0 > [ 243.402997] > [ 243.402999] [<ffffffff810ea672>] down_read_nested+0x52/0xb0 > [ 243.405351] > [ 243.405397] [<ffffffffa02a4af4>] xfs_ilock+0x114/0x290 [xfs] > [ 243.407778] > [ 243.407799] [<ffffffffa02a4c9b>] xfs_ilock_data_map_shared+0x2b/0x30 [xfs] > [ 243.410364] > [ 243.410390] [<ffffffffa02559f4>] xfs_dir_lookup+0xd4/0x1c0 [xfs] > [ 243.412989] > [ 243.413011] [<ffffffffa02a62ff>] xfs_lookup+0x7f/0x250 [xfs] > [ 243.415416] > [ 243.415437] [<ffffffffa02a1fcb>] xfs_vn_lookup+0x6b/0xb0 [xfs] > [ 243.417871] > [ 243.417874] [<ffffffff8126902e>] lookup_slow+0x12e/0x220 > [ 243.420641] > [ 243.420644] [<ffffffff8126d2c6>] walk_component+0x1a6/0x2b0 > [ 243.423039] > [ 243.423041] [<ffffffff8126d55c>] link_path_walk+0x18c/0x580 > [ 243.425553] > [ 243.425555] [<ffffffff8126de41>] path_openat+0xa1/0xa90 > [ 243.427891] > [ 243.427892] [<ffffffff8126f9ec>] do_filp_open+0x8c/0x100 > [ 243.430249] > [ 243.430251] [<ffffffff8125c0ea>] do_sys_open+0x13a/0x200 > [ 243.432586] > [ 243.432588] [<ffffffff8125c1c9>] SyS_open+0x19/0x20 > [ 243.434839] > [ 243.434843] [<ffffffff81714e01>] entry_SYSCALL_64_fastpath+0x1f/0xc2 > [ 243.437343] } > [ 243.438115] ... key at: [<ffffffffa031dfcc>] xfs_dir_ilock_class+0x0/0xfffffffffffc3f6e [xfs] > [ 243.440082] ... acquired at: > [ 243.441047] > [ 243.441049] [<ffffffff810ee7ea>] check_irq_usage+0x4a/0xb0 > [ 243.443169] > [ 243.443171] [<ffffffff810f0954>] __lock_acquire+0x1364/0x1bb0 > [ 243.445366] > [ 243.445368] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0 > [ 243.447471] > [ 243.447474] [<ffffffff8108de18>] get_online_cpus+0x58/0x80 > [ 243.449601] > [ 243.449604] [<ffffffff811bf3a7>] drain_all_pages.part.80+0x27/0x320 > [ 243.452123] > [ 243.452125] [<ffffffff811c2039>] drain_all_pages+0x19/0x20 > [ 243.454264] > [ 243.454266] [<ffffffff811c4854>] __alloc_pages_nodemask+0x784/0x1630 > [ 243.456596] > [ 243.456599] [<ffffffff8122e1bf>] cache_grow_begin+0xcf/0x630 > [ 243.458774] > [ 243.458776] [<ffffffff8122eb45>] fallback_alloc+0x1e5/0x290 > [ 243.460952] > [ 243.460955] [<ffffffff8122e955>] ____cache_alloc_node+0x235/0x240 > [ 243.463199] > [ 243.463201] [<ffffffff8122f30c>] kmem_cache_alloc+0x26c/0x3e0 > [ 243.465482] > [ 243.465510] [<ffffffffa02b9211>] kmem_zone_alloc+0x91/0x120 [xfs] > [ 243.467754] > [ 243.467774] [<ffffffffa024e2f5>] xfs_da_state_alloc+0x15/0x20 [xfs] > [ 243.470083] > [ 243.470101] [<ffffffffa025f333>] xfs_dir2_node_lookup+0x53/0x2b0 [xfs] > [ 243.472427] > [ 243.472445] [<ffffffffa0255ac5>] xfs_dir_lookup+0x1a5/0x1c0 [xfs] > [ 243.474705] > [ 243.474726] [<ffffffffa02a62ff>] xfs_lookup+0x7f/0x250 [xfs] > [ 243.476933] > [ 243.476954] [<ffffffffa02a1fcb>] xfs_vn_lookup+0x6b/0xb0 [xfs] > [ 243.479178] > [ 243.479180] [<ffffffff8126ce2c>] lookup_open+0x54c/0x790 > [ 243.481350] > [ 243.481352] [<ffffffff8126e2fa>] path_openat+0x55a/0xa90 > [ 243.483907] > [ 243.483910] [<ffffffff8126f9ec>] do_filp_open+0x8c/0x100 > [ 243.486070] > [ 243.486073] [<ffffffff8125c0ea>] do_sys_open+0x13a/0x200 > [ 243.488334] > [ 243.488338] [<ffffffff8125c1c9>] SyS_open+0x19/0x20 > [ 243.490476] > [ 243.490480] [<ffffffff81003c17>] do_syscall_64+0x67/0x1f0 > [ 243.492619] > [ 243.492623] [<ffffffff81714ec9>] return_from_SYSCALL_64+0x0/0x7a > [ 243.494864] > [ 243.495618] > [ 243.495618] the dependencies between the lock to be acquired > [ 243.495619] and RECLAIM_FS-irq-unsafe lock: > [ 243.498973] -> (cpu_hotplug.dep_map){++++++} ops: 838 { > [ 243.500297] HARDIRQ-ON-W at: > [ 243.501292] > [ 243.501295] [<ffffffff810efd84>] __lock_acquire+0x794/0x1bb0 > [ 243.503718] > [ 243.503719] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0 > [ 243.506059] > [ 243.506061] [<ffffffff8108ff5e>] cpu_hotplug_begin+0x6e/0xe0 > [ 243.508471] > [ 243.508473] [<ffffffff8109009d>] _cpu_up+0x2d/0xf0 > [ 243.510708] > [ 243.510709] [<ffffffff810901e4>] do_cpu_up+0x84/0xd0 > [ 243.512997] > [ 243.512999] [<ffffffff8109023e>] cpu_up+0xe/0x10 > [ 243.515556] > [ 243.515561] [<ffffffff81f6f446>] smp_init+0xd5/0x141 > [ 243.517807] > [ 243.517810] [<ffffffff81f3f35b>] kernel_init_freeable+0x17d/0x2a7 > [ 243.520271] > [ 243.520275] [<ffffffff817048e9>] kernel_init+0x9/0x100 > [ 243.522538] > [ 243.522540] [<ffffffff81715081>] ret_from_fork+0x31/0x40 > [ 243.524833] HARDIRQ-ON-R at: > [ 243.525801] > [ 243.525803] [<ffffffff810ef8c0>] __lock_acquire+0x2d0/0x1bb0 > [ 243.528152] > [ 243.528153] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0 > [ 243.530416] > [ 243.530419] [<ffffffff8108de18>] get_online_cpus+0x58/0x80 > [ 243.532696] > [ 243.532698] [<ffffffff811ec375>] kmem_cache_create+0x35/0x2d0 > [ 243.535039] > [ 243.535041] [<ffffffff81f87d4a>] debug_objects_mem_init+0x48/0x5c5 > [ 243.537451] > [ 243.537453] [<ffffffff81f3f108>] start_kernel+0x3ec/0x4c2 > [ 243.539744] > [ 243.539746] [<ffffffff81f3e5d6>] x86_64_start_reservations+0x2a/0x2c > [ 243.542186] > [ 243.542188] [<ffffffff81f3e724>] x86_64_start_kernel+0x14c/0x16f > [ 243.544603] > [ 243.544605] [<ffffffff810001c4>] verify_cpu+0x0/0xfc > [ 243.547245] SOFTIRQ-ON-W at: > [ 243.548241] > [ 243.548243] [<ffffffff810ef8ed>] __lock_acquire+0x2fd/0x1bb0 > [ 243.550559] > [ 243.550561] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0 > [ 243.552841] > [ 243.552842] [<ffffffff8108ff5e>] cpu_hotplug_begin+0x6e/0xe0 > [ 243.555186] > [ 243.555187] [<ffffffff8109009d>] _cpu_up+0x2d/0xf0 > [ 243.557404] > [ 243.557405] [<ffffffff810901e4>] do_cpu_up+0x84/0xd0 > [ 243.559654] > [ 243.559656] [<ffffffff8109023e>] cpu_up+0xe/0x10 > [ 243.561824] > [ 243.561827] [<ffffffff81f6f446>] smp_init+0xd5/0x141 > [ 243.564048] > [ 243.564050] [<ffffffff81f3f35b>] kernel_init_freeable+0x17d/0x2a7 > [ 243.566455] > [ 243.566457] [<ffffffff817048e9>] kernel_init+0x9/0x100 > [ 243.568731] > [ 243.568733] [<ffffffff81715081>] ret_from_fork+0x31/0x40 > [ 243.571014] SOFTIRQ-ON-R at: > [ 243.571975] > [ 243.571976] [<ffffffff810ef8ed>] __lock_acquire+0x2fd/0x1bb0 > [ 243.574328] > [ 243.574330] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0 > [ 243.576610] > [ 243.576612] [<ffffffff8108de18>] get_online_cpus+0x58/0x80 > [ 243.579161] > [ 243.579165] [<ffffffff811ec375>] kmem_cache_create+0x35/0x2d0 > [ 243.581537] > [ 243.581539] [<ffffffff81f87d4a>] debug_objects_mem_init+0x48/0x5c5 > [ 243.583982] > [ 243.583984] [<ffffffff81f3f108>] start_kernel+0x3ec/0x4c2 > [ 243.586304] > [ 243.586306] [<ffffffff81f3e5d6>] x86_64_start_reservations+0x2a/0x2c > [ 243.588819] > [ 243.588821] [<ffffffff81f3e724>] x86_64_start_kernel+0x14c/0x16f > [ 243.591227] > [ 243.591229] [<ffffffff810001c4>] verify_cpu+0x0/0xfc > [ 243.593507] RECLAIM_FS-ON-W at: > [ 243.594519] > [ 243.594520] [<ffffffff810ef051>] mark_held_locks+0x71/0x90 > [ 243.596888] > [ 243.596895] [<ffffffff810f3405>] lockdep_trace_alloc+0xc5/0x110 > [ 243.599331] > [ 243.599334] [<ffffffff8122f8ca>] kmem_cache_alloc_node_trace+0x4a/0x410 > [ 243.601872] > [ 243.601874] [<ffffffff810ba350>] __smpboot_create_thread.part.3+0x30/0xf0 > [ 243.604460] > [ 243.604461] [<ffffffff810ba7a1>] smpboot_create_threads+0x61/0x90 > [ 243.606950] > [ 243.606952] [<ffffffff8108e2cb>] cpuhp_invoke_callback+0xbb/0xb70 > [ 243.609463] > [ 243.609465] [<ffffffff8108fc82>] cpuhp_up_callbacks+0x32/0xb0 > [ 243.612282] > [ 243.612285] [<ffffffff810900f4>] _cpu_up+0x84/0xf0 > [ 243.614604] > [ 243.614606] [<ffffffff810901e4>] do_cpu_up+0x84/0xd0 > [ 243.616929] > [ 243.616930] [<ffffffff8109023e>] cpu_up+0xe/0x10 > [ 243.619208] > [ 243.619211] [<ffffffff81f6f446>] smp_init+0xd5/0x141 > [ 243.621518] > [ 243.621520] [<ffffffff81f3f35b>] kernel_init_freeable+0x17d/0x2a7 > [ 243.624018] > [ 243.624020] [<ffffffff817048e9>] kernel_init+0x9/0x100 > [ 243.626374] > [ 243.626376] [<ffffffff81715081>] ret_from_fork+0x31/0x40 > [ 243.628771] RECLAIM_FS-ON-R at: > [ 243.629802] > [ 243.629803] [<ffffffff810ef051>] mark_held_locks+0x71/0x90 > [ 243.632201] > [ 243.632203] [<ffffffff810f3405>] lockdep_trace_alloc+0xc5/0x110 > [ 243.634692] > [ 243.634695] [<ffffffff8122f8ca>] kmem_cache_alloc_node_trace+0x4a/0x410 > [ 243.637277] > [ 243.637279] [<ffffffff8100cbb4>] allocate_shared_regs+0x24/0x70 > [ 243.639777] > [ 243.639779] [<ffffffff8100cc32>] intel_pmu_cpu_prepare+0x32/0x140 > [ 243.643062] > [ 243.643066] [<ffffffff810053db>] x86_pmu_prepare_cpu+0x3b/0x40 > [ 243.645553] > [ 243.645556] [<ffffffff8108e2cb>] cpuhp_invoke_callback+0xbb/0xb70 > [ 243.648095] > [ 243.648097] [<ffffffff8108f29c>] cpuhp_issue_call+0xec/0x160 > [ 243.650536] > [ 243.650539] [<ffffffff8108f6bb>] __cpuhp_setup_state+0x13b/0x1a0 > [ 243.653126] > [ 243.653130] [<ffffffff81f427e9>] init_hw_perf_events+0x402/0x5b6 > [ 243.655652] > [ 243.655655] [<ffffffff8100217c>] do_one_initcall+0x4c/0x1b0 > [ 243.658127] > [ 243.658130] [<ffffffff81f3f333>] kernel_init_freeable+0x155/0x2a7 > [ 243.660653] > [ 243.660656] [<ffffffff817048e9>] kernel_init+0x9/0x100 > [ 243.663048] > [ 243.663050] [<ffffffff81715081>] ret_from_fork+0x31/0x40 > [ 243.665436] INITIAL USE at: > [ 243.666403] > [ 243.666405] [<ffffffff810ef960>] __lock_acquire+0x370/0x1bb0 > [ 243.668790] > [ 243.668791] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0 > [ 243.671093] > [ 243.671095] [<ffffffff8108de18>] get_online_cpus+0x58/0x80 > [ 243.673455] > [ 243.673458] [<ffffffff8108f5be>] __cpuhp_setup_state+0x3e/0x1a0 > [ 243.676126] > [ 243.676130] [<ffffffff81f7660e>] page_alloc_init+0x23/0x3a > [ 243.678510] > [ 243.678512] [<ffffffff81f3eebe>] start_kernel+0x1a2/0x4c2 > [ 243.680851] > [ 243.680853] [<ffffffff81f3e5d6>] x86_64_start_reservations+0x2a/0x2c > [ 243.683367] > [ 243.683369] [<ffffffff81f3e724>] x86_64_start_kernel+0x14c/0x16f > [ 243.685812] > [ 243.685815] [<ffffffff810001c4>] verify_cpu+0x0/0xfc > [ 243.688133] } > [ 243.688907] ... key at: [<ffffffff81c56848>] cpu_hotplug+0x108/0x140 > [ 243.690542] ... acquired at: > [ 243.691514] > [ 243.691517] [<ffffffff810ee7ea>] check_irq_usage+0x4a/0xb0 > [ 243.693655] > [ 243.693656] [<ffffffff810f0954>] __lock_acquire+0x1364/0x1bb0 > [ 243.695820] > [ 243.695822] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0 > [ 243.697926] > [ 243.697929] [<ffffffff8108de18>] get_online_cpus+0x58/0x80 > [ 243.700042] > [ 243.700044] [<ffffffff811bf3a7>] drain_all_pages.part.80+0x27/0x320 > [ 243.702285] > [ 243.702286] [<ffffffff811c2039>] drain_all_pages+0x19/0x20 > [ 243.704405] > [ 243.704407] [<ffffffff811c4854>] __alloc_pages_nodemask+0x784/0x1630 > [ 243.706721] > [ 243.706724] [<ffffffff8122e1bf>] cache_grow_begin+0xcf/0x630 > [ 243.708867] > [ 243.708870] [<ffffffff8122eb45>] fallback_alloc+0x1e5/0x290 > [ 243.711000] > [ 243.711002] [<ffffffff8122e955>] ____cache_alloc_node+0x235/0x240 > [ 243.713211] > [ 243.713213] [<ffffffff8122f30c>] kmem_cache_alloc+0x26c/0x3e0 > [ 243.715366] > [ 243.715410] [<ffffffffa02b9211>] kmem_zone_alloc+0x91/0x120 [xfs] > [ 243.717625] > [ 243.717644] [<ffffffffa024e2f5>] xfs_da_state_alloc+0x15/0x20 [xfs] > [ 243.719889] > [ 243.719918] [<ffffffffa025f333>] xfs_dir2_node_lookup+0x53/0x2b0 [xfs] > [ 243.722224] > [ 243.722242] [<ffffffffa0255ac5>] xfs_dir_lookup+0x1a5/0x1c0 [xfs] > [ 243.724493] > [ 243.724514] [<ffffffffa02a62ff>] xfs_lookup+0x7f/0x250 [xfs] > [ 243.726690] > [ 243.726710] [<ffffffffa02a1fcb>] xfs_vn_lookup+0x6b/0xb0 [xfs] > [ 243.728933] > [ 243.728936] [<ffffffff8126ce2c>] lookup_open+0x54c/0x790 > [ 243.731064] > [ 243.731066] [<ffffffff8126e2fa>] path_openat+0x55a/0xa90 > [ 243.733192] > [ 243.733194] [<ffffffff8126f9ec>] do_filp_open+0x8c/0x100 > [ 243.735312] > [ 243.735315] [<ffffffff8125c0ea>] do_sys_open+0x13a/0x200 > [ 243.737523] > [ 243.737527] [<ffffffff8125c1c9>] SyS_open+0x19/0x20 > [ 243.739577] > [ 243.739579] [<ffffffff81003c17>] do_syscall_64+0x67/0x1f0 > [ 243.741702] > [ 243.741706] [<ffffffff81714ec9>] return_from_SYSCALL_64+0x0/0x7a > [ 243.743932] > [ 243.744661] > [ 243.744661] stack backtrace: > [ 243.746302] CPU: 1 PID: 8767 Comm: awk Not tainted 4.10.0-rc6-next-20170202 #46 > [ 243.747963] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013 > [ 243.750166] Call Trace: > [ 243.751071] dump_stack+0x85/0xc9 > [ 243.752110] check_usage+0x4f9/0x680 > [ 243.753188] check_irq_usage+0x4a/0xb0 > [ 243.754280] __lock_acquire+0x1364/0x1bb0 > [ 243.755410] lock_acquire+0xe0/0x2a0 > [ 243.756467] ? get_online_cpus+0x32/0x80 > [ 243.757580] get_online_cpus+0x58/0x80 > [ 243.758664] ? get_online_cpus+0x32/0x80 > [ 243.759764] drain_all_pages.part.80+0x27/0x320 > [ 243.760972] drain_all_pages+0x19/0x20 > [ 243.762039] __alloc_pages_nodemask+0x784/0x1630 > [ 243.763249] ? rcu_read_lock_sched_held+0x91/0xa0 > [ 243.764466] ? __alloc_pages_nodemask+0x2e6/0x1630 > [ 243.765689] ? mark_held_locks+0x71/0x90 > [ 243.766780] ? cache_grow_begin+0x4ac/0x630 > [ 243.767912] cache_grow_begin+0xcf/0x630 > [ 243.768985] ? ____cache_alloc_node+0x1bf/0x240 > [ 243.770173] fallback_alloc+0x1e5/0x290 > [ 243.771233] ____cache_alloc_node+0x235/0x240 > [ 243.772403] ? kmem_zone_alloc+0x91/0x120 [xfs] > [ 243.773576] kmem_cache_alloc+0x26c/0x3e0 > [ 243.774671] kmem_zone_alloc+0x91/0x120 [xfs] > [ 243.775816] xfs_da_state_alloc+0x15/0x20 [xfs] > [ 243.776989] xfs_dir2_node_lookup+0x53/0x2b0 [xfs] > [ 243.778188] xfs_dir_lookup+0x1a5/0x1c0 [xfs] > [ 243.779327] xfs_lookup+0x7f/0x250 [xfs] > [ 243.780394] xfs_vn_lookup+0x6b/0xb0 [xfs] > [ 243.781466] lookup_open+0x54c/0x790 > [ 243.782440] path_openat+0x55a/0xa90 > [ 243.783412] do_filp_open+0x8c/0x100 > [ 243.784377] ? _raw_spin_unlock+0x22/0x30 > [ 243.785418] ? __alloc_fd+0xf2/0x210 > [ 243.786378] do_sys_open+0x13a/0x200 > [ 243.787361] SyS_open+0x19/0x20 > [ 243.788252] do_syscall_64+0x67/0x1f0 > [ 243.789228] entry_SYSCALL64_slow_path+0x25/0x25 > [ 243.790347] RIP: 0033:0x7fcf8dda06c7 > [ 243.791299] RSP: 002b:00007ffd883327b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000002 > [ 243.792895] RAX: ffffffffffffffda RBX: 00007ffd883328a8 RCX: 00007fcf8dda06c7 > [ 243.794424] RDX: 00007fcf8dfa9148 RSI: 0000000000080000 RDI: 00007fcf8dfa6b08 > [ 243.795949] RBP: 00007ffd88332810 R08: 00007ffd88332890 R09: 0000000000000000 > [ 243.797480] R10: 00007fcf8dfa6b08 R11: 0000000000000246 R12: 0000000000000000 > [ 243.799002] R13: 0000000000000000 R14: 0000000000000000 R15: 00007ffd88332890 > [ 253.543441] awk invoked oom-killer: gfp_mask=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null), order=0, oom_score_adj=0 > [ 253.546121] awk cpuset=/ mems_allowed=0 > [ 253.547233] CPU: 3 PID: 8767 Comm: awk Not tainted 4.10.0-rc6-next-20170202 #46 -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone @ 2017-02-03 14:55 ` Michal Hocko 0 siblings, 0 replies; 110+ messages in thread From: Michal Hocko @ 2017-02-03 14:55 UTC (permalink / raw) To: Tetsuo Handa Cc: hch, mgorman, viro, linux-mm, hannes, linux-kernel, Peter Zijlstra [CC Petr] On Fri 03-02-17 19:57:39, Tetsuo Handa wrote: [...] > (2) I got a lockdep warning. (A new false positive?) Yes, I suspect this is a false possitive. I do not see how we can deadlock. __alloc_pages_direct_reclaim calls drain_all_pages(NULL) which means that a potential recursion to the page allocator during draining would just bail out on the trylock. Maybe I am misinterpreting the report though. > [ 243.036975] ===================================================== > [ 243.042976] WARNING: RECLAIM_FS-safe -> RECLAIM_FS-unsafe lock order detected > [ 243.051211] 4.10.0-rc6-next-20170202 #46 Not tainted > [ 243.054619] ----------------------------------------------------- > [ 243.057395] awk/8767 [HC0[0]:SC0[0]:HE1:SE1] is trying to acquire: > [ 243.060310] (cpu_hotplug.dep_map){++++++}, at: [<ffffffff8108ddf2>] get_online_cpus+0x32/0x80 > [ 243.063462] > [ 243.063462] and this task is already holding: > [ 243.066851] (&xfs_dir_ilock_class){++++-.}, at: [<ffffffffa02a4af4>] xfs_ilock+0x114/0x290 [xfs] > [ 243.069949] which would create a new lock dependency: > [ 243.072143] (&xfs_dir_ilock_class){++++-.} -> (cpu_hotplug.dep_map){++++++} > [ 243.074789] > [ 243.074789] but this new dependency connects a RECLAIM_FS-irq-safe lock: > [ 243.078735] (&xfs_dir_ilock_class){++++-.} > [ 243.078739] > [ 243.078739] ... which became RECLAIM_FS-irq-safe at: > [ 243.084175] > [ 243.084180] [<ffffffff810ef934>] __lock_acquire+0x344/0x1bb0 > [ 243.087257] > [ 243.087261] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0 > [ 243.090027] > [ 243.090033] [<ffffffff810ea7e9>] down_write_nested+0x59/0xc0 > [ 243.092838] > [ 243.092888] [<ffffffffa02a4b2e>] xfs_ilock+0x14e/0x290 [xfs] > [ 243.095453] > [ 243.095485] [<ffffffffa02986a5>] xfs_reclaim_inode+0x135/0x340 [xfs] > [ 243.098083] > [ 243.098109] [<ffffffffa0298b7a>] xfs_reclaim_inodes_ag+0x2ca/0x4f0 [xfs] > [ 243.100668] > [ 243.100692] [<ffffffffa029af9e>] xfs_reclaim_inodes_nr+0x2e/0x40 [xfs] > [ 243.103191] > [ 243.103221] [<ffffffffa02b32c4>] xfs_fs_free_cached_objects+0x14/0x20 [xfs] > [ 243.105710] > [ 243.105714] [<ffffffff81261dbc>] super_cache_scan+0x17c/0x190 > [ 243.107947] > [ 243.107950] [<ffffffff811d375a>] shrink_slab+0x29a/0x710 > [ 243.110133] > [ 243.110135] [<ffffffff811d876d>] shrink_node+0x23d/0x320 > [ 243.112262] > [ 243.112264] [<ffffffff811d9e24>] kswapd+0x354/0xa10 > [ 243.114323] > [ 243.114326] [<ffffffff810b5caa>] kthread+0x10a/0x140 > [ 243.116448] > [ 243.116452] [<ffffffff81715081>] ret_from_fork+0x31/0x40 > [ 243.118692] > [ 243.118692] to a RECLAIM_FS-irq-unsafe lock: > [ 243.120636] (cpu_hotplug.dep_map){++++++} > [ 243.120638] > [ 243.120638] ... which became RECLAIM_FS-irq-unsafe at: > [ 243.124021] ... > [ 243.124022] > [ 243.124820] [<ffffffff810ef051>] mark_held_locks+0x71/0x90 > [ 243.127033] > [ 243.127035] [<ffffffff810f3405>] lockdep_trace_alloc+0xc5/0x110 > [ 243.129228] > [ 243.129231] [<ffffffff8122f8ca>] kmem_cache_alloc_node_trace+0x4a/0x410 > [ 243.131534] > [ 243.131536] [<ffffffff810ba350>] __smpboot_create_thread.part.3+0x30/0xf0 > [ 243.133850] > [ 243.133852] [<ffffffff810ba7a1>] smpboot_create_threads+0x61/0x90 > [ 243.136113] > [ 243.136119] [<ffffffff8108e2cb>] cpuhp_invoke_callback+0xbb/0xb70 > [ 243.138319] > [ 243.138320] [<ffffffff8108fc82>] cpuhp_up_callbacks+0x32/0xb0 > [ 243.140479] > [ 243.140480] [<ffffffff810900f4>] _cpu_up+0x84/0xf0 > [ 243.142484] > [ 243.142485] [<ffffffff810901e4>] do_cpu_up+0x84/0xd0 > [ 243.144716] > [ 243.144719] [<ffffffff8109023e>] cpu_up+0xe/0x10 > [ 243.146684] > [ 243.146687] [<ffffffff81f6f446>] smp_init+0xd5/0x141 > [ 243.148755] > [ 243.148758] [<ffffffff81f3f35b>] kernel_init_freeable+0x17d/0x2a7 > [ 243.150932] > [ 243.150936] [<ffffffff817048e9>] kernel_init+0x9/0x100 > [ 243.153088] > [ 243.153092] [<ffffffff81715081>] ret_from_fork+0x31/0x40 > [ 243.155135] > [ 243.155135] other info that might help us debug this: > [ 243.155135] > [ 243.157724] Possible interrupt unsafe locking scenario: > [ 243.157724] > [ 243.159877] CPU0 CPU1 > [ 243.161047] ---- ---- > [ 243.162210] lock(cpu_hotplug.dep_map); > [ 243.163279] local_irq_disable(); > [ 243.164669] lock(&xfs_dir_ilock_class); > [ 243.166148] lock(cpu_hotplug.dep_map); > [ 243.167653] <Interrupt> > [ 243.168594] lock(&xfs_dir_ilock_class); > [ 243.169694] > [ 243.169694] *** DEADLOCK *** > [ 243.169694] > [ 243.171864] 3 locks held by awk/8767: > [ 243.172872] #0: (&type->i_mutex_dir_key#3){++++++}, at: [<ffffffff8126e2dc>] path_openat+0x53c/0xa90 > [ 243.174791] #1: (&xfs_dir_ilock_class){++++-.}, at: [<ffffffffa02a4af4>] xfs_ilock+0x114/0x290 [xfs] > [ 243.176899] #2: (pcpu_drain_mutex){+.+...}, at: [<ffffffff811bf39a>] drain_all_pages.part.80+0x1a/0x320 > [ 243.178875] > [ 243.178875] the dependencies between RECLAIM_FS-irq-safe lock and the holding lock: > [ 243.181262] -> (&xfs_dir_ilock_class){++++-.} ops: 17348 { > [ 243.182610] HARDIRQ-ON-W at: > [ 243.183603] > [ 243.183606] [<ffffffff810efd84>] __lock_acquire+0x794/0x1bb0 > [ 243.186056] > [ 243.186059] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0 > [ 243.188419] > [ 243.188422] [<ffffffff810ea7e9>] down_write_nested+0x59/0xc0 > [ 243.190909] > [ 243.190941] [<ffffffffa02a4b2e>] xfs_ilock+0x14e/0x290 [xfs] > [ 243.193257] > [ 243.193281] [<ffffffffa02a4c9b>] xfs_ilock_data_map_shared+0x2b/0x30 [xfs] > [ 243.195795] > [ 243.195814] [<ffffffffa02559f4>] xfs_dir_lookup+0xd4/0x1c0 [xfs] > [ 243.198204] > [ 243.198227] [<ffffffffa02a62ff>] xfs_lookup+0x7f/0x250 [xfs] > [ 243.200570] > [ 243.200593] [<ffffffffa02a1fcb>] xfs_vn_lookup+0x6b/0xb0 [xfs] > [ 243.203086] > [ 243.203089] [<ffffffff8126ce2c>] lookup_open+0x54c/0x790 > [ 243.205417] > [ 243.205420] [<ffffffff8126e2fa>] path_openat+0x55a/0xa90 > [ 243.207711] > [ 243.207713] [<ffffffff8126f9ec>] do_filp_open+0x8c/0x100 > [ 243.210092] > [ 243.210095] [<ffffffff81263c41>] do_open_execat+0x71/0x180 > [ 243.212427] > [ 243.212429] [<ffffffff812641b6>] open_exec+0x26/0x40 > [ 243.214664] > [ 243.214668] [<ffffffff812c43ee>] load_elf_binary+0x2be/0x15f0 > [ 243.217045] > [ 243.217048] [<ffffffff812644b0>] search_binary_handler+0x80/0x1e0 > [ 243.219501] > [ 243.219503] [<ffffffff812663ca>] do_execveat_common.isra.40+0x68a/0xa00 > [ 243.222056] > [ 243.222058] [<ffffffff81266767>] do_execve+0x27/0x30 > [ 243.224471] > [ 243.224475] [<ffffffff812669c0>] SyS_execve+0x20/0x30 > [ 243.226787] > [ 243.226790] [<ffffffff81003c17>] do_syscall_64+0x67/0x1f0 > [ 243.229178] > [ 243.229182] [<ffffffff81714ec9>] return_from_SYSCALL_64+0x0/0x7a > [ 243.231695] HARDIRQ-ON-R at: > [ 243.232709] > [ 243.232712] [<ffffffff810ef8c0>] __lock_acquire+0x2d0/0x1bb0 > [ 243.235161] > [ 243.235164] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0 > [ 243.237547] > [ 243.237551] [<ffffffff810ea672>] down_read_nested+0x52/0xb0 > [ 243.239930] > [ 243.239962] [<ffffffffa02a4af4>] xfs_ilock+0x114/0x290 [xfs] > [ 243.242353] > [ 243.242385] [<ffffffffa02a4c9b>] xfs_ilock_data_map_shared+0x2b/0x30 [xfs] > [ 243.244978] > [ 243.244998] [<ffffffffa02559f4>] xfs_dir_lookup+0xd4/0x1c0 [xfs] > [ 243.247493] > [ 243.247515] [<ffffffffa02a62ff>] xfs_lookup+0x7f/0x250 [xfs] > [ 243.249910] > [ 243.249930] [<ffffffffa02a1fcb>] xfs_vn_lookup+0x6b/0xb0 [xfs] > [ 243.252407] > [ 243.252412] [<ffffffff8126902e>] lookup_slow+0x12e/0x220 > [ 243.254747] > [ 243.254750] [<ffffffff8126d2c6>] walk_component+0x1a6/0x2b0 > [ 243.257126] > [ 243.257128] [<ffffffff8126d55c>] link_path_walk+0x18c/0x580 > [ 243.259495] > [ 243.259497] [<ffffffff8126de41>] path_openat+0xa1/0xa90 > [ 243.261804] > [ 243.261806] [<ffffffff8126f9ec>] do_filp_open+0x8c/0x100 > [ 243.264184] > [ 243.264188] [<ffffffff8125c0ea>] do_sys_open+0x13a/0x200 > [ 243.266595] > [ 243.266599] [<ffffffff8125c1c9>] SyS_open+0x19/0x20 > [ 243.268984] > [ 243.268989] [<ffffffff81714e01>] entry_SYSCALL_64_fastpath+0x1f/0xc2 > [ 243.271702] SOFTIRQ-ON-W at: > [ 243.272726] > [ 243.272729] [<ffffffff810ef8ed>] __lock_acquire+0x2fd/0x1bb0 > [ 243.275109] > [ 243.275111] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0 > [ 243.277426] > [ 243.277429] [<ffffffff810ea7e9>] down_write_nested+0x59/0xc0 > [ 243.279790] > [ 243.279823] [<ffffffffa02a4b2e>] xfs_ilock+0x14e/0x290 [xfs] > [ 243.282192] > [ 243.282216] [<ffffffffa02a4c9b>] xfs_ilock_data_map_shared+0x2b/0x30 [xfs] > [ 243.284794] > [ 243.284816] [<ffffffffa02559f4>] xfs_dir_lookup+0xd4/0x1c0 [xfs] > [ 243.287259] > [ 243.287284] [<ffffffffa02a62ff>] xfs_lookup+0x7f/0x250 [xfs] > [ 243.289735] > [ 243.289763] [<ffffffffa02a1fcb>] xfs_vn_lookup+0x6b/0xb0 [xfs] > [ 243.292205] > [ 243.292208] [<ffffffff8126ce2c>] lookup_open+0x54c/0x790 > [ 243.294555] > [ 243.294558] [<ffffffff8126e2fa>] path_openat+0x55a/0xa90 > [ 243.296897] > [ 243.296900] [<ffffffff8126f9ec>] do_filp_open+0x8c/0x100 > [ 243.299242] > [ 243.299244] [<ffffffff81263c41>] do_open_execat+0x71/0x180 > [ 243.301754] > [ 243.301759] [<ffffffff812641b6>] open_exec+0x26/0x40 > [ 243.304037] > [ 243.304042] [<ffffffff812c43ee>] load_elf_binary+0x2be/0x15f0 > [ 243.306531] > [ 243.306534] [<ffffffff812644b0>] search_binary_handler+0x80/0x1e0 > [ 243.308976] > [ 243.308979] [<ffffffff812663ca>] do_execveat_common.isra.40+0x68a/0xa00 > [ 243.311506] > [ 243.311508] [<ffffffff81266767>] do_execve+0x27/0x30 > [ 243.313777] > [ 243.313779] [<ffffffff812669c0>] SyS_execve+0x20/0x30 > [ 243.316067] > [ 243.316070] [<ffffffff81003c17>] do_syscall_64+0x67/0x1f0 > [ 243.318429] > [ 243.318434] [<ffffffff81714ec9>] return_from_SYSCALL_64+0x0/0x7a > [ 243.320884] SOFTIRQ-ON-R at: > [ 243.321860] > [ 243.321862] [<ffffffff810ef8ed>] __lock_acquire+0x2fd/0x1bb0 > [ 243.324251] > [ 243.324252] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0 > [ 243.326601] > [ 243.326604] [<ffffffff810ea672>] down_read_nested+0x52/0xb0 > [ 243.328966] > [ 243.328998] [<ffffffffa02a4af4>] xfs_ilock+0x114/0x290 [xfs] > [ 243.331384] > [ 243.331407] [<ffffffffa02a4c9b>] xfs_ilock_data_map_shared+0x2b/0x30 [xfs] > [ 243.333978] > [ 243.334001] [<ffffffffa02559f4>] xfs_dir_lookup+0xd4/0x1c0 [xfs] > [ 243.336492] > [ 243.336516] [<ffffffffa02a62ff>] xfs_lookup+0x7f/0x250 [xfs] > [ 243.338926] > [ 243.338948] [<ffffffffa02a1fcb>] xfs_vn_lookup+0x6b/0xb0 [xfs] > [ 243.341365] > [ 243.341368] [<ffffffff8126902e>] lookup_slow+0x12e/0x220 > [ 243.343694] > [ 243.343696] [<ffffffff8126d2c6>] walk_component+0x1a6/0x2b0 > [ 243.346074] > [ 243.346076] [<ffffffff8126d55c>] link_path_walk+0x18c/0x580 > [ 243.348443] > [ 243.348444] [<ffffffff8126de41>] path_openat+0xa1/0xa90 > [ 243.350753] > [ 243.350755] [<ffffffff8126f9ec>] do_filp_open+0x8c/0x100 > [ 243.353240] > [ 243.353244] [<ffffffff8125c0ea>] do_sys_open+0x13a/0x200 > [ 243.355581] > [ 243.355583] [<ffffffff8125c1c9>] SyS_open+0x19/0x20 > [ 243.358015] > [ 243.358019] [<ffffffff81714e01>] entry_SYSCALL_64_fastpath+0x1f/0xc2 > [ 243.360586] IN-RECLAIM_FS-W at: > [ 243.361628] > [ 243.361630] [<ffffffff810ef934>] __lock_acquire+0x344/0x1bb0 > [ 243.364273] > [ 243.364275] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0 > [ 243.366710] > [ 243.366713] [<ffffffff810ea7e9>] down_write_nested+0x59/0xc0 > [ 243.369153] > [ 243.369182] [<ffffffffa02a4b2e>] xfs_ilock+0x14e/0x290 [xfs] > [ 243.371597] > [ 243.371619] [<ffffffffa02986a5>] xfs_reclaim_inode+0x135/0x340 [xfs] > [ 243.374339] > [ 243.374366] [<ffffffffa0298b7a>] xfs_reclaim_inodes_ag+0x2ca/0x4f0 [xfs] > [ 243.377009] > [ 243.377032] [<ffffffffa029af9e>] xfs_reclaim_inodes_nr+0x2e/0x40 [xfs] > [ 243.379659] > [ 243.379686] [<ffffffffa02b32c4>] xfs_fs_free_cached_objects+0x14/0x20 [xfs] > [ 243.382349] > [ 243.382352] [<ffffffff81261dbc>] super_cache_scan+0x17c/0x190 > [ 243.384907] > [ 243.384911] [<ffffffff811d375a>] shrink_slab+0x29a/0x710 > [ 243.387690] > [ 243.387693] [<ffffffff811d876d>] shrink_node+0x23d/0x320 > [ 243.390148] > [ 243.390150] [<ffffffff811d9e24>] kswapd+0x354/0xa10 > [ 243.392517] > [ 243.392520] [<ffffffff810b5caa>] kthread+0x10a/0x140 > [ 243.394851] > [ 243.394853] [<ffffffff81715081>] ret_from_fork+0x31/0x40 > [ 243.397246] INITIAL USE at: > [ 243.398227] > [ 243.398229] [<ffffffff810ef960>] __lock_acquire+0x370/0x1bb0 > [ 243.400646] > [ 243.400648] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0 > [ 243.402997] > [ 243.402999] [<ffffffff810ea672>] down_read_nested+0x52/0xb0 > [ 243.405351] > [ 243.405397] [<ffffffffa02a4af4>] xfs_ilock+0x114/0x290 [xfs] > [ 243.407778] > [ 243.407799] [<ffffffffa02a4c9b>] xfs_ilock_data_map_shared+0x2b/0x30 [xfs] > [ 243.410364] > [ 243.410390] [<ffffffffa02559f4>] xfs_dir_lookup+0xd4/0x1c0 [xfs] > [ 243.412989] > [ 243.413011] [<ffffffffa02a62ff>] xfs_lookup+0x7f/0x250 [xfs] > [ 243.415416] > [ 243.415437] [<ffffffffa02a1fcb>] xfs_vn_lookup+0x6b/0xb0 [xfs] > [ 243.417871] > [ 243.417874] [<ffffffff8126902e>] lookup_slow+0x12e/0x220 > [ 243.420641] > [ 243.420644] [<ffffffff8126d2c6>] walk_component+0x1a6/0x2b0 > [ 243.423039] > [ 243.423041] [<ffffffff8126d55c>] link_path_walk+0x18c/0x580 > [ 243.425553] > [ 243.425555] [<ffffffff8126de41>] path_openat+0xa1/0xa90 > [ 243.427891] > [ 243.427892] [<ffffffff8126f9ec>] do_filp_open+0x8c/0x100 > [ 243.430249] > [ 243.430251] [<ffffffff8125c0ea>] do_sys_open+0x13a/0x200 > [ 243.432586] > [ 243.432588] [<ffffffff8125c1c9>] SyS_open+0x19/0x20 > [ 243.434839] > [ 243.434843] [<ffffffff81714e01>] entry_SYSCALL_64_fastpath+0x1f/0xc2 > [ 243.437343] } > [ 243.438115] ... key at: [<ffffffffa031dfcc>] xfs_dir_ilock_class+0x0/0xfffffffffffc3f6e [xfs] > [ 243.440082] ... acquired at: > [ 243.441047] > [ 243.441049] [<ffffffff810ee7ea>] check_irq_usage+0x4a/0xb0 > [ 243.443169] > [ 243.443171] [<ffffffff810f0954>] __lock_acquire+0x1364/0x1bb0 > [ 243.445366] > [ 243.445368] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0 > [ 243.447471] > [ 243.447474] [<ffffffff8108de18>] get_online_cpus+0x58/0x80 > [ 243.449601] > [ 243.449604] [<ffffffff811bf3a7>] drain_all_pages.part.80+0x27/0x320 > [ 243.452123] > [ 243.452125] [<ffffffff811c2039>] drain_all_pages+0x19/0x20 > [ 243.454264] > [ 243.454266] [<ffffffff811c4854>] __alloc_pages_nodemask+0x784/0x1630 > [ 243.456596] > [ 243.456599] [<ffffffff8122e1bf>] cache_grow_begin+0xcf/0x630 > [ 243.458774] > [ 243.458776] [<ffffffff8122eb45>] fallback_alloc+0x1e5/0x290 > [ 243.460952] > [ 243.460955] [<ffffffff8122e955>] ____cache_alloc_node+0x235/0x240 > [ 243.463199] > [ 243.463201] [<ffffffff8122f30c>] kmem_cache_alloc+0x26c/0x3e0 > [ 243.465482] > [ 243.465510] [<ffffffffa02b9211>] kmem_zone_alloc+0x91/0x120 [xfs] > [ 243.467754] > [ 243.467774] [<ffffffffa024e2f5>] xfs_da_state_alloc+0x15/0x20 [xfs] > [ 243.470083] > [ 243.470101] [<ffffffffa025f333>] xfs_dir2_node_lookup+0x53/0x2b0 [xfs] > [ 243.472427] > [ 243.472445] [<ffffffffa0255ac5>] xfs_dir_lookup+0x1a5/0x1c0 [xfs] > [ 243.474705] > [ 243.474726] [<ffffffffa02a62ff>] xfs_lookup+0x7f/0x250 [xfs] > [ 243.476933] > [ 243.476954] [<ffffffffa02a1fcb>] xfs_vn_lookup+0x6b/0xb0 [xfs] > [ 243.479178] > [ 243.479180] [<ffffffff8126ce2c>] lookup_open+0x54c/0x790 > [ 243.481350] > [ 243.481352] [<ffffffff8126e2fa>] path_openat+0x55a/0xa90 > [ 243.483907] > [ 243.483910] [<ffffffff8126f9ec>] do_filp_open+0x8c/0x100 > [ 243.486070] > [ 243.486073] [<ffffffff8125c0ea>] do_sys_open+0x13a/0x200 > [ 243.488334] > [ 243.488338] [<ffffffff8125c1c9>] SyS_open+0x19/0x20 > [ 243.490476] > [ 243.490480] [<ffffffff81003c17>] do_syscall_64+0x67/0x1f0 > [ 243.492619] > [ 243.492623] [<ffffffff81714ec9>] return_from_SYSCALL_64+0x0/0x7a > [ 243.494864] > [ 243.495618] > [ 243.495618] the dependencies between the lock to be acquired > [ 243.495619] and RECLAIM_FS-irq-unsafe lock: > [ 243.498973] -> (cpu_hotplug.dep_map){++++++} ops: 838 { > [ 243.500297] HARDIRQ-ON-W at: > [ 243.501292] > [ 243.501295] [<ffffffff810efd84>] __lock_acquire+0x794/0x1bb0 > [ 243.503718] > [ 243.503719] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0 > [ 243.506059] > [ 243.506061] [<ffffffff8108ff5e>] cpu_hotplug_begin+0x6e/0xe0 > [ 243.508471] > [ 243.508473] [<ffffffff8109009d>] _cpu_up+0x2d/0xf0 > [ 243.510708] > [ 243.510709] [<ffffffff810901e4>] do_cpu_up+0x84/0xd0 > [ 243.512997] > [ 243.512999] [<ffffffff8109023e>] cpu_up+0xe/0x10 > [ 243.515556] > [ 243.515561] [<ffffffff81f6f446>] smp_init+0xd5/0x141 > [ 243.517807] > [ 243.517810] [<ffffffff81f3f35b>] kernel_init_freeable+0x17d/0x2a7 > [ 243.520271] > [ 243.520275] [<ffffffff817048e9>] kernel_init+0x9/0x100 > [ 243.522538] > [ 243.522540] [<ffffffff81715081>] ret_from_fork+0x31/0x40 > [ 243.524833] HARDIRQ-ON-R at: > [ 243.525801] > [ 243.525803] [<ffffffff810ef8c0>] __lock_acquire+0x2d0/0x1bb0 > [ 243.528152] > [ 243.528153] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0 > [ 243.530416] > [ 243.530419] [<ffffffff8108de18>] get_online_cpus+0x58/0x80 > [ 243.532696] > [ 243.532698] [<ffffffff811ec375>] kmem_cache_create+0x35/0x2d0 > [ 243.535039] > [ 243.535041] [<ffffffff81f87d4a>] debug_objects_mem_init+0x48/0x5c5 > [ 243.537451] > [ 243.537453] [<ffffffff81f3f108>] start_kernel+0x3ec/0x4c2 > [ 243.539744] > [ 243.539746] [<ffffffff81f3e5d6>] x86_64_start_reservations+0x2a/0x2c > [ 243.542186] > [ 243.542188] [<ffffffff81f3e724>] x86_64_start_kernel+0x14c/0x16f > [ 243.544603] > [ 243.544605] [<ffffffff810001c4>] verify_cpu+0x0/0xfc > [ 243.547245] SOFTIRQ-ON-W at: > [ 243.548241] > [ 243.548243] [<ffffffff810ef8ed>] __lock_acquire+0x2fd/0x1bb0 > [ 243.550559] > [ 243.550561] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0 > [ 243.552841] > [ 243.552842] [<ffffffff8108ff5e>] cpu_hotplug_begin+0x6e/0xe0 > [ 243.555186] > [ 243.555187] [<ffffffff8109009d>] _cpu_up+0x2d/0xf0 > [ 243.557404] > [ 243.557405] [<ffffffff810901e4>] do_cpu_up+0x84/0xd0 > [ 243.559654] > [ 243.559656] [<ffffffff8109023e>] cpu_up+0xe/0x10 > [ 243.561824] > [ 243.561827] [<ffffffff81f6f446>] smp_init+0xd5/0x141 > [ 243.564048] > [ 243.564050] [<ffffffff81f3f35b>] kernel_init_freeable+0x17d/0x2a7 > [ 243.566455] > [ 243.566457] [<ffffffff817048e9>] kernel_init+0x9/0x100 > [ 243.568731] > [ 243.568733] [<ffffffff81715081>] ret_from_fork+0x31/0x40 > [ 243.571014] SOFTIRQ-ON-R at: > [ 243.571975] > [ 243.571976] [<ffffffff810ef8ed>] __lock_acquire+0x2fd/0x1bb0 > [ 243.574328] > [ 243.574330] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0 > [ 243.576610] > [ 243.576612] [<ffffffff8108de18>] get_online_cpus+0x58/0x80 > [ 243.579161] > [ 243.579165] [<ffffffff811ec375>] kmem_cache_create+0x35/0x2d0 > [ 243.581537] > [ 243.581539] [<ffffffff81f87d4a>] debug_objects_mem_init+0x48/0x5c5 > [ 243.583982] > [ 243.583984] [<ffffffff81f3f108>] start_kernel+0x3ec/0x4c2 > [ 243.586304] > [ 243.586306] [<ffffffff81f3e5d6>] x86_64_start_reservations+0x2a/0x2c > [ 243.588819] > [ 243.588821] [<ffffffff81f3e724>] x86_64_start_kernel+0x14c/0x16f > [ 243.591227] > [ 243.591229] [<ffffffff810001c4>] verify_cpu+0x0/0xfc > [ 243.593507] RECLAIM_FS-ON-W at: > [ 243.594519] > [ 243.594520] [<ffffffff810ef051>] mark_held_locks+0x71/0x90 > [ 243.596888] > [ 243.596895] [<ffffffff810f3405>] lockdep_trace_alloc+0xc5/0x110 > [ 243.599331] > [ 243.599334] [<ffffffff8122f8ca>] kmem_cache_alloc_node_trace+0x4a/0x410 > [ 243.601872] > [ 243.601874] [<ffffffff810ba350>] __smpboot_create_thread.part.3+0x30/0xf0 > [ 243.604460] > [ 243.604461] [<ffffffff810ba7a1>] smpboot_create_threads+0x61/0x90 > [ 243.606950] > [ 243.606952] [<ffffffff8108e2cb>] cpuhp_invoke_callback+0xbb/0xb70 > [ 243.609463] > [ 243.609465] [<ffffffff8108fc82>] cpuhp_up_callbacks+0x32/0xb0 > [ 243.612282] > [ 243.612285] [<ffffffff810900f4>] _cpu_up+0x84/0xf0 > [ 243.614604] > [ 243.614606] [<ffffffff810901e4>] do_cpu_up+0x84/0xd0 > [ 243.616929] > [ 243.616930] [<ffffffff8109023e>] cpu_up+0xe/0x10 > [ 243.619208] > [ 243.619211] [<ffffffff81f6f446>] smp_init+0xd5/0x141 > [ 243.621518] > [ 243.621520] [<ffffffff81f3f35b>] kernel_init_freeable+0x17d/0x2a7 > [ 243.624018] > [ 243.624020] [<ffffffff817048e9>] kernel_init+0x9/0x100 > [ 243.626374] > [ 243.626376] [<ffffffff81715081>] ret_from_fork+0x31/0x40 > [ 243.628771] RECLAIM_FS-ON-R at: > [ 243.629802] > [ 243.629803] [<ffffffff810ef051>] mark_held_locks+0x71/0x90 > [ 243.632201] > [ 243.632203] [<ffffffff810f3405>] lockdep_trace_alloc+0xc5/0x110 > [ 243.634692] > [ 243.634695] [<ffffffff8122f8ca>] kmem_cache_alloc_node_trace+0x4a/0x410 > [ 243.637277] > [ 243.637279] [<ffffffff8100cbb4>] allocate_shared_regs+0x24/0x70 > [ 243.639777] > [ 243.639779] [<ffffffff8100cc32>] intel_pmu_cpu_prepare+0x32/0x140 > [ 243.643062] > [ 243.643066] [<ffffffff810053db>] x86_pmu_prepare_cpu+0x3b/0x40 > [ 243.645553] > [ 243.645556] [<ffffffff8108e2cb>] cpuhp_invoke_callback+0xbb/0xb70 > [ 243.648095] > [ 243.648097] [<ffffffff8108f29c>] cpuhp_issue_call+0xec/0x160 > [ 243.650536] > [ 243.650539] [<ffffffff8108f6bb>] __cpuhp_setup_state+0x13b/0x1a0 > [ 243.653126] > [ 243.653130] [<ffffffff81f427e9>] init_hw_perf_events+0x402/0x5b6 > [ 243.655652] > [ 243.655655] [<ffffffff8100217c>] do_one_initcall+0x4c/0x1b0 > [ 243.658127] > [ 243.658130] [<ffffffff81f3f333>] kernel_init_freeable+0x155/0x2a7 > [ 243.660653] > [ 243.660656] [<ffffffff817048e9>] kernel_init+0x9/0x100 > [ 243.663048] > [ 243.663050] [<ffffffff81715081>] ret_from_fork+0x31/0x40 > [ 243.665436] INITIAL USE at: > [ 243.666403] > [ 243.666405] [<ffffffff810ef960>] __lock_acquire+0x370/0x1bb0 > [ 243.668790] > [ 243.668791] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0 > [ 243.671093] > [ 243.671095] [<ffffffff8108de18>] get_online_cpus+0x58/0x80 > [ 243.673455] > [ 243.673458] [<ffffffff8108f5be>] __cpuhp_setup_state+0x3e/0x1a0 > [ 243.676126] > [ 243.676130] [<ffffffff81f7660e>] page_alloc_init+0x23/0x3a > [ 243.678510] > [ 243.678512] [<ffffffff81f3eebe>] start_kernel+0x1a2/0x4c2 > [ 243.680851] > [ 243.680853] [<ffffffff81f3e5d6>] x86_64_start_reservations+0x2a/0x2c > [ 243.683367] > [ 243.683369] [<ffffffff81f3e724>] x86_64_start_kernel+0x14c/0x16f > [ 243.685812] > [ 243.685815] [<ffffffff810001c4>] verify_cpu+0x0/0xfc > [ 243.688133] } > [ 243.688907] ... key at: [<ffffffff81c56848>] cpu_hotplug+0x108/0x140 > [ 243.690542] ... acquired at: > [ 243.691514] > [ 243.691517] [<ffffffff810ee7ea>] check_irq_usage+0x4a/0xb0 > [ 243.693655] > [ 243.693656] [<ffffffff810f0954>] __lock_acquire+0x1364/0x1bb0 > [ 243.695820] > [ 243.695822] [<ffffffff810f1840>] lock_acquire+0xe0/0x2a0 > [ 243.697926] > [ 243.697929] [<ffffffff8108de18>] get_online_cpus+0x58/0x80 > [ 243.700042] > [ 243.700044] [<ffffffff811bf3a7>] drain_all_pages.part.80+0x27/0x320 > [ 243.702285] > [ 243.702286] [<ffffffff811c2039>] drain_all_pages+0x19/0x20 > [ 243.704405] > [ 243.704407] [<ffffffff811c4854>] __alloc_pages_nodemask+0x784/0x1630 > [ 243.706721] > [ 243.706724] [<ffffffff8122e1bf>] cache_grow_begin+0xcf/0x630 > [ 243.708867] > [ 243.708870] [<ffffffff8122eb45>] fallback_alloc+0x1e5/0x290 > [ 243.711000] > [ 243.711002] [<ffffffff8122e955>] ____cache_alloc_node+0x235/0x240 > [ 243.713211] > [ 243.713213] [<ffffffff8122f30c>] kmem_cache_alloc+0x26c/0x3e0 > [ 243.715366] > [ 243.715410] [<ffffffffa02b9211>] kmem_zone_alloc+0x91/0x120 [xfs] > [ 243.717625] > [ 243.717644] [<ffffffffa024e2f5>] xfs_da_state_alloc+0x15/0x20 [xfs] > [ 243.719889] > [ 243.719918] [<ffffffffa025f333>] xfs_dir2_node_lookup+0x53/0x2b0 [xfs] > [ 243.722224] > [ 243.722242] [<ffffffffa0255ac5>] xfs_dir_lookup+0x1a5/0x1c0 [xfs] > [ 243.724493] > [ 243.724514] [<ffffffffa02a62ff>] xfs_lookup+0x7f/0x250 [xfs] > [ 243.726690] > [ 243.726710] [<ffffffffa02a1fcb>] xfs_vn_lookup+0x6b/0xb0 [xfs] > [ 243.728933] > [ 243.728936] [<ffffffff8126ce2c>] lookup_open+0x54c/0x790 > [ 243.731064] > [ 243.731066] [<ffffffff8126e2fa>] path_openat+0x55a/0xa90 > [ 243.733192] > [ 243.733194] [<ffffffff8126f9ec>] do_filp_open+0x8c/0x100 > [ 243.735312] > [ 243.735315] [<ffffffff8125c0ea>] do_sys_open+0x13a/0x200 > [ 243.737523] > [ 243.737527] [<ffffffff8125c1c9>] SyS_open+0x19/0x20 > [ 243.739577] > [ 243.739579] [<ffffffff81003c17>] do_syscall_64+0x67/0x1f0 > [ 243.741702] > [ 243.741706] [<ffffffff81714ec9>] return_from_SYSCALL_64+0x0/0x7a > [ 243.743932] > [ 243.744661] > [ 243.744661] stack backtrace: > [ 243.746302] CPU: 1 PID: 8767 Comm: awk Not tainted 4.10.0-rc6-next-20170202 #46 > [ 243.747963] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013 > [ 243.750166] Call Trace: > [ 243.751071] dump_stack+0x85/0xc9 > [ 243.752110] check_usage+0x4f9/0x680 > [ 243.753188] check_irq_usage+0x4a/0xb0 > [ 243.754280] __lock_acquire+0x1364/0x1bb0 > [ 243.755410] lock_acquire+0xe0/0x2a0 > [ 243.756467] ? get_online_cpus+0x32/0x80 > [ 243.757580] get_online_cpus+0x58/0x80 > [ 243.758664] ? get_online_cpus+0x32/0x80 > [ 243.759764] drain_all_pages.part.80+0x27/0x320 > [ 243.760972] drain_all_pages+0x19/0x20 > [ 243.762039] __alloc_pages_nodemask+0x784/0x1630 > [ 243.763249] ? rcu_read_lock_sched_held+0x91/0xa0 > [ 243.764466] ? __alloc_pages_nodemask+0x2e6/0x1630 > [ 243.765689] ? mark_held_locks+0x71/0x90 > [ 243.766780] ? cache_grow_begin+0x4ac/0x630 > [ 243.767912] cache_grow_begin+0xcf/0x630 > [ 243.768985] ? ____cache_alloc_node+0x1bf/0x240 > [ 243.770173] fallback_alloc+0x1e5/0x290 > [ 243.771233] ____cache_alloc_node+0x235/0x240 > [ 243.772403] ? kmem_zone_alloc+0x91/0x120 [xfs] > [ 243.773576] kmem_cache_alloc+0x26c/0x3e0 > [ 243.774671] kmem_zone_alloc+0x91/0x120 [xfs] > [ 243.775816] xfs_da_state_alloc+0x15/0x20 [xfs] > [ 243.776989] xfs_dir2_node_lookup+0x53/0x2b0 [xfs] > [ 243.778188] xfs_dir_lookup+0x1a5/0x1c0 [xfs] > [ 243.779327] xfs_lookup+0x7f/0x250 [xfs] > [ 243.780394] xfs_vn_lookup+0x6b/0xb0 [xfs] > [ 243.781466] lookup_open+0x54c/0x790 > [ 243.782440] path_openat+0x55a/0xa90 > [ 243.783412] do_filp_open+0x8c/0x100 > [ 243.784377] ? _raw_spin_unlock+0x22/0x30 > [ 243.785418] ? __alloc_fd+0xf2/0x210 > [ 243.786378] do_sys_open+0x13a/0x200 > [ 243.787361] SyS_open+0x19/0x20 > [ 243.788252] do_syscall_64+0x67/0x1f0 > [ 243.789228] entry_SYSCALL64_slow_path+0x25/0x25 > [ 243.790347] RIP: 0033:0x7fcf8dda06c7 > [ 243.791299] RSP: 002b:00007ffd883327b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000002 > [ 243.792895] RAX: ffffffffffffffda RBX: 00007ffd883328a8 RCX: 00007fcf8dda06c7 > [ 243.794424] RDX: 00007fcf8dfa9148 RSI: 0000000000080000 RDI: 00007fcf8dfa6b08 > [ 243.795949] RBP: 00007ffd88332810 R08: 00007ffd88332890 R09: 0000000000000000 > [ 243.797480] R10: 00007fcf8dfa6b08 R11: 0000000000000246 R12: 0000000000000000 > [ 243.799002] R13: 0000000000000000 R14: 0000000000000000 R15: 00007ffd88332890 > [ 253.543441] awk invoked oom-killer: gfp_mask=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null), order=0, oom_score_adj=0 > [ 253.546121] awk cpuset=/ mems_allowed=0 > [ 253.547233] CPU: 3 PID: 8767 Comm: awk Not tainted 4.10.0-rc6-next-20170202 #46 -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone 2017-02-03 14:55 ` Michal Hocko @ 2017-02-05 10:43 ` Tetsuo Handa -1 siblings, 0 replies; 110+ messages in thread From: Tetsuo Handa @ 2017-02-05 10:43 UTC (permalink / raw) To: mhocko; +Cc: hch, mgorman, viro, linux-mm, hannes, linux-kernel, peterz Michal Hocko wrote: > [CC Petr] > > On Fri 03-02-17 19:57:39, Tetsuo Handa wrote: > [...] > > (2) I got a lockdep warning. (A new false positive?) > > Yes, I suspect this is a false possitive. I do not see how we can > deadlock. __alloc_pages_direct_reclaim calls drain_all_pages(NULL) which > means that a potential recursion to the page allocator during draining > would just bail out on the trylock. Maybe I am misinterpreting the > report though. > I got same warning with ext4. Maybe we need to check carefully. [ 511.215743] ===================================================== [ 511.218003] WARNING: RECLAIM_FS-safe -> RECLAIM_FS-unsafe lock order detected [ 511.220031] 4.10.0-rc6-next-20170202+ #500 Not tainted [ 511.221689] ----------------------------------------------------- [ 511.223579] a.out/49302 [HC0[0]:SC0[0]:HE1:SE1] is trying to acquire: [ 511.225533] (cpu_hotplug.dep_map){++++++}, at: [<ffffffff810a1477>] get_online_cpus+0x37/0x80 [ 511.227795] [ 511.227795] and this task is already holding: [ 511.230082] (jbd2_handle){++++-.}, at: [<ffffffff813a8be7>] start_this_handle+0x1a7/0x590 [ 511.232592] which would create a new lock dependency: [ 511.234192] (jbd2_handle){++++-.} -> (cpu_hotplug.dep_map){++++++} [ 511.235966] [ 511.235966] but this new dependency connects a RECLAIM_FS-irq-safe lock: [ 511.238563] (jbd2_handle){++++-.} [ 511.238564] [ 511.238564] ... which became RECLAIM_FS-irq-safe at: [ 511.242078] [ 511.242084] [<ffffffff811089db>] __lock_acquire+0x34b/0x1640 [ 511.244492] [ 511.244495] [<ffffffff8110a119>] lock_acquire+0xc9/0x250 [ 511.246694] [ 511.246697] [<ffffffff813b3525>] jbd2_log_wait_commit+0x55/0x1d0 [ 511.249323] [ 511.249328] [<ffffffff813b59b1>] jbd2_complete_transaction+0x71/0x90 [ 511.252069] [ 511.252074] [<ffffffff813592d6>] ext4_evict_inode+0x356/0x760 [ 511.254753] [ 511.254757] [<ffffffff812c9f61>] evict+0xd1/0x1a0 [ 511.257062] [ 511.257065] [<ffffffff812ca07d>] dispose_list+0x4d/0x80 [ 511.259531] [ 511.259535] [<ffffffff812cb3da>] prune_icache_sb+0x5a/0x80 [ 511.261953] [ 511.261957] [<ffffffff812acf41>] super_cache_scan+0x141/0x190 [ 511.264540] [ 511.264545] [<ffffffff812102ef>] shrink_slab+0x29f/0x6d0 [ 511.267165] [ 511.267171] [<ffffffff812154aa>] shrink_node+0x2fa/0x310 [ 511.269455] [ 511.269459] [<ffffffff812169d2>] kswapd+0x362/0x9b0 [ 511.271831] [ 511.271834] [<ffffffff810ca72f>] kthread+0x10f/0x150 [ 511.274031] [ 511.274035] [<ffffffff818531c1>] ret_from_fork+0x31/0x40 [ 511.276216] [ 511.276216] to a RECLAIM_FS-irq-unsafe lock: [ 511.278128] (cpu_hotplug.dep_map){++++++} [ 511.278130] [ 511.278130] ... which became RECLAIM_FS-irq-unsafe at: [ 511.281809] ... [ 511.281811] [ 511.282598] [<ffffffff81108141>] mark_held_locks+0x71/0x90 [ 511.284852] [ 511.284854] [<ffffffff8110ab6f>] lockdep_trace_alloc+0x6f/0xd0 [ 511.287215] [ 511.287218] [<ffffffff812744c8>] kmem_cache_alloc_node_trace+0x48/0x3b0 [ 511.289751] [ 511.289755] [<ffffffff810cfa65>] __smpboot_create_thread.part.2+0x35/0xf0 [ 511.292326] [ 511.292329] [<ffffffff810d0026>] smpboot_create_threads+0x66/0x90 [ 511.295025] [ 511.295030] [<ffffffff810a2239>] cpuhp_invoke_callback+0x229/0x9e0 [ 511.299245] [ 511.299253] [<ffffffff810a2b57>] cpuhp_up_callbacks+0x37/0xb0 [ 511.301889] [ 511.301894] [<ffffffff810a37b9>] _cpu_up+0x89/0xf0 [ 511.304270] [ 511.304275] [<ffffffff810a38a5>] do_cpu_up+0x85/0xb0 [ 511.306428] [ 511.306431] [<ffffffff810a38e3>] cpu_up+0x13/0x20 [ 511.308533] [ 511.308535] [<ffffffff821eeee3>] smp_init+0x6b/0xcc [ 511.310710] [ 511.310713] [<ffffffff821c3399>] kernel_init_freeable+0x17d/0x2ac [ 511.313232] [ 511.313235] [<ffffffff81841b3e>] kernel_init+0xe/0x110 [ 511.315616] [ 511.315620] [<ffffffff818531c1>] ret_from_fork+0x31/0x40 [ 511.317867] [ 511.317867] other info that might help us debug this: [ 511.317867] [ 511.320920] Possible interrupt unsafe locking scenario: [ 511.320920] [ 511.323218] CPU0 CPU1 [ 511.324622] ---- ---- [ 511.325973] lock(cpu_hotplug.dep_map); [ 511.327246] local_irq_disable(); [ 511.328870] lock(jbd2_handle); [ 511.330483] lock(cpu_hotplug.dep_map); [ 511.332259] <Interrupt> [ 511.333187] lock(jbd2_handle); [ 511.334304] [ 511.334304] *** DEADLOCK *** [ 511.334304] [ 511.336749] 4 locks held by a.out/49302: [ 511.338129] #0: (sb_writers#8){.+.+.+}, at: [<ffffffff812d11d4>] mnt_want_write+0x24/0x50 [ 511.340768] #1: (&type->i_mutex_dir_key#3){++++++}, at: [<ffffffff812ba06b>] path_openat+0x60b/0xd50 [ 511.343744] #2: (jbd2_handle){++++-.}, at: [<ffffffff813a8be7>] start_this_handle+0x1a7/0x590 [ 511.345743] #3: (pcpu_drain_mutex){+.+...}, at: [<ffffffff811fc96f>] drain_all_pages.part.89+0x1f/0x2c0 [ 511.348605] [ 511.348605] the dependencies between RECLAIM_FS-irq-safe lock and the holding lock: [ 511.351336] -> (jbd2_handle){++++-.} ops: 203220 { [ 511.352768] HARDIRQ-ON-W at: [ 511.353827] [ 511.353833] [<ffffffff8110906e>] __lock_acquire+0x9de/0x1640 [ 511.356489] [ 511.356492] [<ffffffff8110a119>] lock_acquire+0xc9/0x250 [ 511.359063] [ 511.359067] [<ffffffff813b3525>] jbd2_log_wait_commit+0x55/0x1d0 [ 511.361905] [ 511.361908] [<ffffffff813b59b1>] jbd2_complete_transaction+0x71/0x90 [ 511.364560] [ 511.364563] [<ffffffff8134bec7>] ext4_sync_file+0x2e7/0x5e0 [ 511.367362] [ 511.367367] [<ffffffff812e74ad>] vfs_fsync_range+0x3d/0xb0 [ 511.369950] [ 511.369953] [<ffffffff812e757d>] do_fsync+0x3d/0x70 [ 511.372400] [ 511.372402] [<ffffffff812e7840>] SyS_fsync+0x10/0x20 [ 511.374821] [ 511.374824] [<ffffffff81003c3c>] do_syscall_64+0x6c/0x200 [ 511.377422] [ 511.377425] [<ffffffff81853009>] return_from_SYSCALL_64+0x0/0x7a [ 511.380273] HARDIRQ-ON-R at: [ 511.381791] [ 511.381815] [<ffffffff8110896d>] __lock_acquire+0x2dd/0x1640 [ 511.384693] [ 511.384697] [<ffffffff8110a119>] lock_acquire+0xc9/0x250 [ 511.387195] [ 511.387198] [<ffffffff813a8c65>] start_this_handle+0x225/0x590 [ 511.389888] [ 511.389891] [<ffffffff813a9639>] jbd2__journal_start+0xe9/0x340 [ 511.392522] [ 511.392525] [<ffffffff8138adaa>] __ext4_journal_start_sb+0x9a/0x240 [ 511.395341] [ 511.395344] [<ffffffff8134af58>] ext4_file_open+0x188/0x230 [ 511.397886] [ 511.397889] [<ffffffff812a53cb>] do_dentry_open+0x22b/0x340 [ 511.400727] [ 511.400730] [<ffffffff812a6922>] vfs_open+0x52/0x80 [ 511.403297] [ 511.403301] [<ffffffff812b9f02>] path_openat+0x4a2/0xd50 [ 511.405752] [ 511.405755] [<ffffffff812bba51>] do_filp_open+0x91/0x100 [ 511.408229] [ 511.408231] [<ffffffff812a6d44>] do_sys_open+0x124/0x210 [ 511.410820] [ 511.410822] [<ffffffff812a6e4e>] SyS_open+0x1e/0x20 [ 511.413158] [ 511.413161] [<ffffffff81852f41>] entry_SYSCALL_64_fastpath+0x1f/0xc2 [ 511.416074] SOFTIRQ-ON-W at: [ 511.417069] [ 511.417073] [<ffffffff81108996>] __lock_acquire+0x306/0x1640 [ 511.419681] [ 511.419684] [<ffffffff8110a119>] lock_acquire+0xc9/0x250 [ 511.422516] [ 511.422520] [<ffffffff813b3525>] jbd2_log_wait_commit+0x55/0x1d0 [ 511.425157] [ 511.425160] [<ffffffff813b59b1>] jbd2_complete_transaction+0x71/0x90 [ 511.427862] [ 511.427865] [<ffffffff8134bec7>] ext4_sync_file+0x2e7/0x5e0 [ 511.430379] [ 511.430382] [<ffffffff812e74ad>] vfs_fsync_range+0x3d/0xb0 [ 511.433412] [ 511.433418] [<ffffffff812e757d>] do_fsync+0x3d/0x70 [ 511.436064] [ 511.436067] [<ffffffff812e7840>] SyS_fsync+0x10/0x20 [ 511.438498] [ 511.438502] [<ffffffff81003c3c>] do_syscall_64+0x6c/0x200 [ 511.441519] [ 511.441524] [<ffffffff81853009>] return_from_SYSCALL_64+0x0/0x7a [ 511.444325] SOFTIRQ-ON-R at: [ 511.445358] [ 511.445362] [<ffffffff81108996>] __lock_acquire+0x306/0x1640 [ 511.448298] [ 511.448312] [<ffffffff8110a119>] lock_acquire+0xc9/0x250 [ 511.451096] [ 511.451100] [<ffffffff813a8c65>] start_this_handle+0x225/0x590 [ 511.453784] [ 511.453786] [<ffffffff813a9639>] jbd2__journal_start+0xe9/0x340 [ 511.456659] [ 511.456664] [<ffffffff8138adaa>] __ext4_journal_start_sb+0x9a/0x240 [ 511.459638] [ 511.459643] [<ffffffff8134af58>] ext4_file_open+0x188/0x230 [ 511.462384] [ 511.462389] [<ffffffff812a53cb>] do_dentry_open+0x22b/0x340 [ 511.465550] [ 511.465558] [<ffffffff812a6922>] vfs_open+0x52/0x80 [ 511.468141] [ 511.468145] [<ffffffff812b9f02>] path_openat+0x4a2/0xd50 [ 511.470816] [ 511.470819] [<ffffffff812bba51>] do_filp_open+0x91/0x100 [ 511.473441] [ 511.473443] [<ffffffff812a6d44>] do_sys_open+0x124/0x210 [ 511.476079] [ 511.476081] [<ffffffff812a6e4e>] SyS_open+0x1e/0x20 [ 511.478584] [ 511.478587] [<ffffffff81852f41>] entry_SYSCALL_64_fastpath+0x1f/0xc2 [ 511.481394] IN-RECLAIM_FS-W at: [ 511.482680] [ 511.482691] [<ffffffff811089db>] __lock_acquire+0x34b/0x1640 [ 511.485262] [ 511.485264] [<ffffffff8110a119>] lock_acquire+0xc9/0x250 [ 511.487862] [ 511.487865] [<ffffffff813b3525>] jbd2_log_wait_commit+0x55/0x1d0 [ 511.490707] [ 511.490710] [<ffffffff813b59b1>] jbd2_complete_transaction+0x71/0x90 [ 511.493524] [ 511.493527] [<ffffffff813592d6>] ext4_evict_inode+0x356/0x760 [ 511.496251] [ 511.496255] [<ffffffff812c9f61>] evict+0xd1/0x1a0 [ 511.498817] [ 511.498821] [<ffffffff812ca07d>] dispose_list+0x4d/0x80 [ 511.501361] [ 511.501364] [<ffffffff812cb3da>] prune_icache_sb+0x5a/0x80 [ 511.504069] [ 511.504072] [<ffffffff812acf41>] super_cache_scan+0x141/0x190 [ 511.506890] [ 511.506895] [<ffffffff812102ef>] shrink_slab+0x29f/0x6d0 [ 511.509465] [ 511.509467] [<ffffffff812154aa>] shrink_node+0x2fa/0x310 [ 511.512228] [ 511.512233] [<ffffffff812169d2>] kswapd+0x362/0x9b0 [ 511.514724] [ 511.514728] [<ffffffff810ca72f>] kthread+0x10f/0x150 [ 511.517264] [ 511.517269] [<ffffffff818531c1>] ret_from_fork+0x31/0x40 [ 511.519827] INITIAL USE at: [ 511.520829] [ 511.520833] [<ffffffff811089ff>] __lock_acquire+0x36f/0x1640 [ 511.523377] [ 511.523380] [<ffffffff8110a119>] lock_acquire+0xc9/0x250 [ 511.525781] [ 511.525784] [<ffffffff813a8c65>] start_this_handle+0x225/0x590 [ 511.528372] [ 511.528375] [<ffffffff813a9639>] jbd2__journal_start+0xe9/0x340 [ 511.531138] [ 511.531141] [<ffffffff8138adaa>] __ext4_journal_start_sb+0x9a/0x240 [ 511.533905] [ 511.533908] [<ffffffff8134af58>] ext4_file_open+0x188/0x230 [ 511.536467] [ 511.536471] [<ffffffff812a53cb>] do_dentry_open+0x22b/0x340 [ 511.538990] [ 511.538992] [<ffffffff812a6922>] vfs_open+0x52/0x80 [ 511.541457] [ 511.541461] [<ffffffff812b9f02>] path_openat+0x4a2/0xd50 [ 511.544036] [ 511.544039] [<ffffffff812bba51>] do_filp_open+0x91/0x100 [ 511.546642] [ 511.546644] [<ffffffff812a6d44>] do_sys_open+0x124/0x210 [ 511.549354] [ 511.549370] [<ffffffff812a6e4e>] SyS_open+0x1e/0x20 [ 511.551781] [ 511.551784] [<ffffffff81852f41>] entry_SYSCALL_64_fastpath+0x1f/0xc2 [ 511.554410] } [ 511.555145] ... key at: [<ffffffff8335b518>] jbd2_trans_commit_key.48870+0x0/0x8 [ 511.557051] ... acquired at: [ 511.558047] [ 511.558050] [<ffffffff81107d0a>] check_irq_usage+0x4a/0xb0 [ 511.560268] [ 511.560270] [<ffffffff8110950b>] __lock_acquire+0xe7b/0x1640 [ 511.562536] [ 511.562538] [<ffffffff8110a119>] lock_acquire+0xc9/0x250 [ 511.564779] [ 511.564783] [<ffffffff810a149d>] get_online_cpus+0x5d/0x80 [ 511.567230] [ 511.567234] [<ffffffff811fc97c>] drain_all_pages.part.89+0x2c/0x2c0 [ 511.569585] [ 511.569588] [<ffffffff812a1cfb>] __alloc_pages_slowpath+0x509/0xe36 [ 511.572289] [ 511.572292] [<ffffffff812018a2>] __alloc_pages_nodemask+0x382/0x3d0 [ 511.574744] [ 511.574747] [<ffffffff81265077>] alloc_pages_current+0x97/0x1b0 [ 511.577103] [ 511.577106] [<ffffffff811f22fd>] __page_cache_alloc+0x15d/0x1a0 [ 511.579483] [ 511.579486] [<ffffffff811f494a>] pagecache_get_page+0x5a/0x2b0 [ 511.581935] [ 511.581940] [<ffffffff812eca32>] __getblk_gfp+0x112/0x390 [ 511.584220] [ 511.584223] [<ffffffff813514ca>] __ext4_get_inode_loc+0x10a/0x560 [ 511.586627] [ 511.586630] [<ffffffff81353e50>] ext4_get_inode_loc+0x20/0x30 [ 511.589802] [ 511.589808] [<ffffffff81355ec6>] ext4_reserve_inode_write+0x26/0x90 [ 511.592471] [ 511.592476] [<ffffffff81355fbe>] ext4_mark_inode_dirty+0x8e/0x390 [ 511.594926] [ 511.594930] [<ffffffff8138325a>] ext4_ext_tree_init+0x3a/0x40 [ 511.597306] [ 511.597308] [<ffffffff8134eaaa>] __ext4_new_inode+0x12da/0x1540 [ 511.599962] [ 511.599969] [<ffffffff81363602>] ext4_create+0xd2/0x1a0 [ 511.602484] [ 511.602489] [<ffffffff812b9903>] lookup_open+0x653/0x7b0 [ 511.604699] [ 511.604701] [<ffffffff812ba086>] path_openat+0x626/0xd50 [ 511.606890] [ 511.606893] [<ffffffff812bba51>] do_filp_open+0x91/0x100 [ 511.609097] [ 511.609099] [<ffffffff812a6d44>] do_sys_open+0x124/0x210 [ 511.611346] [ 511.611348] [<ffffffff812a6e4e>] SyS_open+0x1e/0x20 [ 511.613431] [ 511.613434] [<ffffffff81003c3c>] do_syscall_64+0x6c/0x200 [ 511.615967] [ 511.615979] [<ffffffff81853009>] return_from_SYSCALL_64+0x0/0x7a [ 511.618303] [ 511.619062] [ 511.619062] the dependencies between the lock to be acquired [ 511.619063] and RECLAIM_FS-irq-unsafe lock: [ 511.622794] -> (cpu_hotplug.dep_map){++++++} ops: 1130 { [ 511.624286] HARDIRQ-ON-W at: [ 511.625479] [ 511.625485] [<ffffffff8110906e>] __lock_acquire+0x9de/0x1640 [ 511.627957] [ 511.627959] [<ffffffff8110a119>] lock_acquire+0xc9/0x250 [ 511.630609] [ 511.630612] [<ffffffff810a3603>] cpu_hotplug_begin+0x73/0xe0 [ 511.633682] [ 511.633697] [<ffffffff810a3762>] _cpu_up+0x32/0xf0 [ 511.636022] [ 511.636024] [<ffffffff810a38a5>] do_cpu_up+0x85/0xb0 [ 511.638397] [ 511.638399] [<ffffffff810a38e3>] cpu_up+0x13/0x20 [ 511.640852] [ 511.640866] [<ffffffff821eeee3>] smp_init+0x6b/0xcc [ 511.643507] [ 511.643511] [<ffffffff821c3399>] kernel_init_freeable+0x17d/0x2ac [ 511.646002] [ 511.646005] [<ffffffff81841b3e>] kernel_init+0xe/0x110 [ 511.648600] [ 511.648611] [<ffffffff818531c1>] ret_from_fork+0x31/0x40 [ 511.651115] HARDIRQ-ON-R at: [ 511.652080] [ 511.652084] [<ffffffff8110896d>] __lock_acquire+0x2dd/0x1640 [ 511.654554] [ 511.654557] [<ffffffff8110a119>] lock_acquire+0xc9/0x250 [ 511.656983] [ 511.656986] [<ffffffff810a149d>] get_online_cpus+0x5d/0x80 [ 511.659442] [ 511.659445] [<ffffffff8122a55a>] kmem_cache_create+0x3a/0x2d0 [ 511.662336] [ 511.662342] [<ffffffff821fd151>] numa_policy_init+0x43/0x24a [ 511.665117] [ 511.665121] [<ffffffff821c313c>] start_kernel+0x3f6/0x4d6 [ 511.667566] [ 511.667568] [<ffffffff821c25d6>] x86_64_start_reservations+0x2a/0x2c [ 511.670245] [ 511.670247] [<ffffffff821c2724>] x86_64_start_kernel+0x14c/0x16f [ 511.673050] [ 511.673054] [<ffffffff810001c4>] verify_cpu+0x0/0xfc [ 511.675400] SOFTIRQ-ON-W at: [ 511.676405] [ 511.676408] [<ffffffff81108996>] __lock_acquire+0x306/0x1640 [ 511.679556] [ 511.679563] [<ffffffff8110a119>] lock_acquire+0xc9/0x250 [ 511.683155] [ 511.683164] [<ffffffff810a3603>] cpu_hotplug_begin+0x73/0xe0 [ 511.686224] [ 511.686231] [<ffffffff810a3762>] _cpu_up+0x32/0xf0 [ 511.689073] [ 511.689078] [<ffffffff810a38a5>] do_cpu_up+0x85/0xb0 [ 511.691573] [ 511.691575] [<ffffffff810a38e3>] cpu_up+0x13/0x20 [ 511.694007] [ 511.694010] [<ffffffff821eeee3>] smp_init+0x6b/0xcc [ 511.696524] [ 511.696528] [<ffffffff821c3399>] kernel_init_freeable+0x17d/0x2ac [ 511.699401] [ 511.699405] [<ffffffff81841b3e>] kernel_init+0xe/0x110 [ 511.701956] [ 511.701959] [<ffffffff818531c1>] ret_from_fork+0x31/0x40 [ 511.704520] SOFTIRQ-ON-R at: [ 511.705530] [ 511.705534] [<ffffffff81108996>] __lock_acquire+0x306/0x1640 [ 511.708036] [ 511.708038] [<ffffffff8110a119>] lock_acquire+0xc9/0x250 [ 511.710516] [ 511.710518] [<ffffffff810a149d>] get_online_cpus+0x5d/0x80 [ 511.713771] [ 511.713780] [<ffffffff8122a55a>] kmem_cache_create+0x3a/0x2d0 [ 511.716681] [ 511.716688] [<ffffffff821fd151>] numa_policy_init+0x43/0x24a [ 511.719450] [ 511.719455] [<ffffffff821c313c>] start_kernel+0x3f6/0x4d6 [ 511.722114] [ 511.722117] [<ffffffff821c25d6>] x86_64_start_reservations+0x2a/0x2c [ 511.724864] [ 511.724866] [<ffffffff821c2724>] x86_64_start_kernel+0x14c/0x16f [ 511.727552] [ 511.727555] [<ffffffff810001c4>] verify_cpu+0x0/0xfc [ 511.729936] RECLAIM_FS-ON-W at: [ 511.731059] [ 511.731063] [<ffffffff81108141>] mark_held_locks+0x71/0x90 [ 511.733851] [ 511.733857] [<ffffffff8110ab6f>] lockdep_trace_alloc+0x6f/0xd0 [ 511.736601] [ 511.736604] [<ffffffff812744c8>] kmem_cache_alloc_node_trace+0x48/0x3b0 [ 511.739325] [ 511.739329] [<ffffffff810cfa65>] __smpboot_create_thread.part.2+0x35/0xf0 [ 511.742499] [ 511.742503] [<ffffffff810d0026>] smpboot_create_threads+0x66/0x90 [ 511.745233] [ 511.745236] [<ffffffff810a2239>] cpuhp_invoke_callback+0x229/0x9e0 [ 511.747909] [ 511.747911] [<ffffffff810a2b57>] cpuhp_up_callbacks+0x37/0xb0 [ 511.750604] [ 511.750606] [<ffffffff810a37b9>] _cpu_up+0x89/0xf0 [ 511.753180] [ 511.753182] [<ffffffff810a38a5>] do_cpu_up+0x85/0xb0 [ 511.755982] [ 511.755986] [<ffffffff810a38e3>] cpu_up+0x13/0x20 [ 511.758565] [ 511.758568] [<ffffffff821eeee3>] smp_init+0x6b/0xcc [ 511.761138] [ 511.761141] [<ffffffff821c3399>] kernel_init_freeable+0x17d/0x2ac [ 511.763877] [ 511.763881] [<ffffffff81841b3e>] kernel_init+0xe/0x110 [ 511.766703] [ 511.766709] [<ffffffff818531c1>] ret_from_fork+0x31/0x40 [ 511.769522] RECLAIM_FS-ON-R at: [ 511.770730] [ 511.770735] [<ffffffff81108141>] mark_held_locks+0x71/0x90 [ 511.773324] [ 511.773327] [<ffffffff8110ab6f>] lockdep_trace_alloc+0x6f/0xd0 [ 511.775897] [ 511.775900] [<ffffffff812744c8>] kmem_cache_alloc_node_trace+0x48/0x3b0 [ 511.778659] [ 511.778663] [<ffffffff8100d199>] allocate_shared_regs+0x29/0x70 [ 511.781485] [ 511.781488] [<ffffffff8100d217>] intel_pmu_cpu_prepare+0x37/0x140 [ 511.784574] [ 511.784578] [<ffffffff81005410>] x86_pmu_prepare_cpu+0x40/0x50 [ 511.787169] [ 511.787172] [<ffffffff810a2239>] cpuhp_invoke_callback+0x229/0x9e0 [ 511.789906] [ 511.789909] [<ffffffff810a2e42>] cpuhp_issue_call+0xe2/0x140 [ 511.792625] [ 511.792628] [<ffffffff810a321d>] __cpuhp_setup_state+0x12d/0x190 [ 511.795441] [ 511.795446] [<ffffffff821c59b1>] init_hw_perf_events+0x402/0x5b6 [ 511.798187] [ 511.798190] [<ffffffff81002191>] do_one_initcall+0x51/0x1c0 [ 511.801133] [ 511.801139] [<ffffffff821c3371>] kernel_init_freeable+0x155/0x2ac [ 511.803812] [ 511.803816] [<ffffffff81841b3e>] kernel_init+0xe/0x110 [ 511.806381] [ 511.806385] [<ffffffff818531c1>] ret_from_fork+0x31/0x40 [ 511.808849] INITIAL USE at: [ 511.809876] [ 511.809881] [<ffffffff811089ff>] __lock_acquire+0x36f/0x1640 [ 511.812607] [ 511.812610] [<ffffffff8110a119>] lock_acquire+0xc9/0x250 [ 511.815088] [ 511.815092] [<ffffffff810a149d>] get_online_cpus+0x5d/0x80 [ 511.817776] [ 511.817779] [<ffffffff810a3133>] __cpuhp_setup_state+0x43/0x190 [ 511.820394] [ 511.820397] [<ffffffff821f756b>] page_alloc_init+0x23/0x3a [ 511.823000] [ 511.823003] [<ffffffff821c2ee8>] start_kernel+0x1a2/0x4d6 [ 511.825495] [ 511.825497] [<ffffffff821c25d6>] x86_64_start_reservations+0x2a/0x2c [ 511.828158] [ 511.828160] [<ffffffff821c2724>] x86_64_start_kernel+0x14c/0x16f [ 511.830986] [ 511.830991] [<ffffffff810001c4>] verify_cpu+0x0/0xfc [ 511.833452] } [ 511.834219] ... key at: [<ffffffff81e59b08>] cpu_hotplug+0x108/0x140 [ 511.835931] ... acquired at: [ 511.836924] [ 511.836927] [<ffffffff81107d0a>] check_irq_usage+0x4a/0xb0 [ 511.839589] [ 511.839593] [<ffffffff8110950b>] __lock_acquire+0xe7b/0x1640 [ 511.842158] [ 511.842162] [<ffffffff8110a119>] lock_acquire+0xc9/0x250 [ 511.844452] [ 511.844454] [<ffffffff810a149d>] get_online_cpus+0x5d/0x80 [ 511.846668] [ 511.846671] [<ffffffff811fc97c>] drain_all_pages.part.89+0x2c/0x2c0 [ 511.849257] [ 511.849264] [<ffffffff812a1cfb>] __alloc_pages_slowpath+0x509/0xe36 [ 511.852127] [ 511.852132] [<ffffffff812018a2>] __alloc_pages_nodemask+0x382/0x3d0 [ 511.854545] [ 511.854549] [<ffffffff81265077>] alloc_pages_current+0x97/0x1b0 [ 511.856942] [ 511.856946] [<ffffffff811f22fd>] __page_cache_alloc+0x15d/0x1a0 [ 511.859259] [ 511.859262] [<ffffffff811f494a>] pagecache_get_page+0x5a/0x2b0 [ 511.861595] [ 511.861598] [<ffffffff812eca32>] __getblk_gfp+0x112/0x390 [ 511.863893] [ 511.863897] [<ffffffff813514ca>] __ext4_get_inode_loc+0x10a/0x560 [ 511.866538] [ 511.866542] [<ffffffff81353e50>] ext4_get_inode_loc+0x20/0x30 [ 511.868929] [ 511.868932] [<ffffffff81355ec6>] ext4_reserve_inode_write+0x26/0x90 [ 511.871579] [ 511.871584] [<ffffffff81355fbe>] ext4_mark_inode_dirty+0x8e/0x390 [ 511.874088] [ 511.874092] [<ffffffff8138325a>] ext4_ext_tree_init+0x3a/0x40 [ 511.876398] [ 511.876400] [<ffffffff8134eaaa>] __ext4_new_inode+0x12da/0x1540 [ 511.878735] [ 511.878737] [<ffffffff81363602>] ext4_create+0xd2/0x1a0 [ 511.881170] [ 511.881174] [<ffffffff812b9903>] lookup_open+0x653/0x7b0 [ 511.883841] [ 511.883848] [<ffffffff812ba086>] path_openat+0x626/0xd50 [ 511.886058] [ 511.886061] [<ffffffff812bba51>] do_filp_open+0x91/0x100 [ 511.888285] [ 511.888288] [<ffffffff812a6d44>] do_sys_open+0x124/0x210 [ 511.890642] [ 511.890644] [<ffffffff812a6e4e>] SyS_open+0x1e/0x20 [ 511.892781] [ 511.892784] [<ffffffff81003c3c>] do_syscall_64+0x6c/0x200 [ 511.895050] [ 511.895053] [<ffffffff81853009>] return_from_SYSCALL_64+0x0/0x7a [ 511.897382] [ 511.898165] [ 511.898165] stack backtrace: [ 511.900033] CPU: 0 PID: 49302 Comm: a.out Not tainted 4.10.0-rc6-next-20170202+ #500 [ 511.901974] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015 [ 511.904851] Call Trace: [ 511.905789] dump_stack+0x85/0xc9 [ 511.906854] check_usage+0x4ba/0x4d0 [ 511.907984] ? delayacct_end+0x56/0x60 [ 511.909136] check_irq_usage+0x4a/0xb0 [ 511.910318] __lock_acquire+0xe7b/0x1640 [ 511.911470] ? delayacct_end+0x56/0x60 [ 511.912607] lock_acquire+0xc9/0x250 [ 511.913703] ? get_online_cpus+0x37/0x80 [ 511.914888] get_online_cpus+0x5d/0x80 [ 511.916137] ? get_online_cpus+0x37/0x80 [ 511.917287] drain_all_pages.part.89+0x2c/0x2c0 [ 511.918539] __alloc_pages_slowpath+0x509/0xe36 [ 511.919889] __alloc_pages_nodemask+0x382/0x3d0 [ 511.921673] ? sched_clock_cpu+0x11/0xc0 [ 511.922919] alloc_pages_current+0x97/0x1b0 [ 511.924123] __page_cache_alloc+0x15d/0x1a0 [ 511.925252] pagecache_get_page+0x5a/0x2b0 [ 511.926392] __getblk_gfp+0x112/0x390 [ 511.927524] __ext4_get_inode_loc+0x10a/0x560 [ 511.928723] ? ext4_ext_tree_init+0x3a/0x40 [ 511.929900] ext4_get_inode_loc+0x20/0x30 [ 511.931008] ext4_reserve_inode_write+0x26/0x90 [ 511.932370] ? ext4_ext_tree_init+0x3a/0x40 [ 511.933582] ext4_mark_inode_dirty+0x8e/0x390 [ 511.934807] ext4_ext_tree_init+0x3a/0x40 [ 511.935919] __ext4_new_inode+0x12da/0x1540 [ 511.937093] ext4_create+0xd2/0x1a0 [ 511.938106] lookup_open+0x653/0x7b0 [ 511.939108] ? __wake_up+0x23/0x50 [ 511.940131] ? sched_clock+0x9/0x10 [ 511.941184] path_openat+0x626/0xd50 [ 511.942194] do_filp_open+0x91/0x100 [ 511.943164] ? _raw_spin_unlock+0x27/0x40 [ 511.944335] ? __alloc_fd+0xf7/0x210 [ 511.945350] do_sys_open+0x124/0x210 [ 511.946333] SyS_open+0x1e/0x20 [ 511.947189] do_syscall_64+0x6c/0x200 [ 511.948208] entry_SYSCALL64_slow_path+0x25/0x25 [ 511.949587] RIP: 0033:0x7feb6a026a10 [ 511.950555] RSP: 002b:00007ffce3579c88 EFLAGS: 00000246 ORIG_RAX: 0000000000000002 [ 511.952261] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007feb6a026a10 [ 511.953864] RDX: 0000000000000180 RSI: 0000000000004441 RDI: 00000000006010c0 [ 511.955566] RBP: 0000000000000000 R08: 00007feb69f86938 R09: 000000000000000f [ 511.957231] R10: 0000000000000000 R11: 0000000000000246 R12: 000000000040083b [ 511.958864] R13: 00007ffce3579d90 R14: 0000000000000000 R15: 0000000000000000 Below one is also a loop. Maybe we can add __GFP_NOMEMALLOC to GFP_NOWAIT ? [ 257.781715] Out of memory: Kill process 5171 (a.out) score 842 or sacrifice child [ 257.784726] Killed process 5171 (a.out) total-vm:2177096kB, anon-rss:1476488kB, file-rss:4kB, shmem-rss:0kB [ 257.787691] a.out(5171): TIF_MEMDIE allocation: order=0 mode=0x1000200(GFP_NOWAIT|__GFP_NOWARN) [ 257.789789] CPU: 3 PID: 5171 Comm: a.out Not tainted 4.10.0-rc6-next-20170202+ #500 [ 257.791784] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015 [ 257.794700] Call Trace: [ 257.795690] dump_stack+0x85/0xc9 [ 257.797224] __alloc_pages_slowpath+0xacb/0xe36 [ 257.798612] __alloc_pages_nodemask+0x382/0x3d0 [ 257.799942] alloc_pages_current+0x97/0x1b0 [ 257.801236] __get_free_pages+0x14/0x50 [ 257.802546] __tlb_remove_page_size+0x70/0xd0 [ 257.803810] unmap_page_range+0x74b/0xa80 [ 257.804992] unmap_single_vma+0x81/0xf0 [ 257.806131] unmap_vmas+0x41/0x60 [ 257.807179] exit_mmap+0x97/0x150 [ 257.808282] ? __khugepaged_exit+0xe5/0x130 [ 257.809594] mmput+0x80/0x150 [ 257.810566] do_exit+0x2c0/0xd70 [ 257.811609] do_group_exit+0x4c/0xc0 [ 257.813035] get_signal+0x35f/0x9b0 [ 257.814199] do_signal+0x37/0x730 [ 257.815215] ? mutex_unlock+0x12/0x20 [ 257.816285] ? pagefault_out_of_memory+0x75/0x80 [ 257.817872] ? mm_fault_error+0x65/0x152 [ 257.819027] ? exit_to_usermode_loop+0x26/0x92 [ 257.820277] exit_to_usermode_loop+0x51/0x92 [ 257.821480] prepare_exit_to_usermode+0x7f/0x90 [ 257.822756] retint_user+0x8/0x23 [ 257.823755] RIP: 0033:0x400780 [ 257.824717] RSP: 002b:00007ffce4497640 EFLAGS: 00010206 [ 257.826061] RAX: 000000005a1de000 RBX: 0000000080000000 RCX: 00007f11b8887650 [ 257.827774] RDX: 0000000000000000 RSI: 00007ffce4497460 RDI: 00007ffce4497460 [ 257.829770] RBP: 00007f10b89be010 R08: 00007ffce4497570 R09: 00007ffce44973b0 [ 257.831714] R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000007 [ 257.833447] R13: 00007f10b89be010 R14: 0000000000000000 R15: 0000000000000000 ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone @ 2017-02-05 10:43 ` Tetsuo Handa 0 siblings, 0 replies; 110+ messages in thread From: Tetsuo Handa @ 2017-02-05 10:43 UTC (permalink / raw) To: mhocko; +Cc: hch, mgorman, viro, linux-mm, hannes, linux-kernel, peterz Michal Hocko wrote: > [CC Petr] > > On Fri 03-02-17 19:57:39, Tetsuo Handa wrote: > [...] > > (2) I got a lockdep warning. (A new false positive?) > > Yes, I suspect this is a false possitive. I do not see how we can > deadlock. __alloc_pages_direct_reclaim calls drain_all_pages(NULL) which > means that a potential recursion to the page allocator during draining > would just bail out on the trylock. Maybe I am misinterpreting the > report though. > I got same warning with ext4. Maybe we need to check carefully. [ 511.215743] ===================================================== [ 511.218003] WARNING: RECLAIM_FS-safe -> RECLAIM_FS-unsafe lock order detected [ 511.220031] 4.10.0-rc6-next-20170202+ #500 Not tainted [ 511.221689] ----------------------------------------------------- [ 511.223579] a.out/49302 [HC0[0]:SC0[0]:HE1:SE1] is trying to acquire: [ 511.225533] (cpu_hotplug.dep_map){++++++}, at: [<ffffffff810a1477>] get_online_cpus+0x37/0x80 [ 511.227795] [ 511.227795] and this task is already holding: [ 511.230082] (jbd2_handle){++++-.}, at: [<ffffffff813a8be7>] start_this_handle+0x1a7/0x590 [ 511.232592] which would create a new lock dependency: [ 511.234192] (jbd2_handle){++++-.} -> (cpu_hotplug.dep_map){++++++} [ 511.235966] [ 511.235966] but this new dependency connects a RECLAIM_FS-irq-safe lock: [ 511.238563] (jbd2_handle){++++-.} [ 511.238564] [ 511.238564] ... which became RECLAIM_FS-irq-safe at: [ 511.242078] [ 511.242084] [<ffffffff811089db>] __lock_acquire+0x34b/0x1640 [ 511.244492] [ 511.244495] [<ffffffff8110a119>] lock_acquire+0xc9/0x250 [ 511.246694] [ 511.246697] [<ffffffff813b3525>] jbd2_log_wait_commit+0x55/0x1d0 [ 511.249323] [ 511.249328] [<ffffffff813b59b1>] jbd2_complete_transaction+0x71/0x90 [ 511.252069] [ 511.252074] [<ffffffff813592d6>] ext4_evict_inode+0x356/0x760 [ 511.254753] [ 511.254757] [<ffffffff812c9f61>] evict+0xd1/0x1a0 [ 511.257062] [ 511.257065] [<ffffffff812ca07d>] dispose_list+0x4d/0x80 [ 511.259531] [ 511.259535] [<ffffffff812cb3da>] prune_icache_sb+0x5a/0x80 [ 511.261953] [ 511.261957] [<ffffffff812acf41>] super_cache_scan+0x141/0x190 [ 511.264540] [ 511.264545] [<ffffffff812102ef>] shrink_slab+0x29f/0x6d0 [ 511.267165] [ 511.267171] [<ffffffff812154aa>] shrink_node+0x2fa/0x310 [ 511.269455] [ 511.269459] [<ffffffff812169d2>] kswapd+0x362/0x9b0 [ 511.271831] [ 511.271834] [<ffffffff810ca72f>] kthread+0x10f/0x150 [ 511.274031] [ 511.274035] [<ffffffff818531c1>] ret_from_fork+0x31/0x40 [ 511.276216] [ 511.276216] to a RECLAIM_FS-irq-unsafe lock: [ 511.278128] (cpu_hotplug.dep_map){++++++} [ 511.278130] [ 511.278130] ... which became RECLAIM_FS-irq-unsafe at: [ 511.281809] ... [ 511.281811] [ 511.282598] [<ffffffff81108141>] mark_held_locks+0x71/0x90 [ 511.284852] [ 511.284854] [<ffffffff8110ab6f>] lockdep_trace_alloc+0x6f/0xd0 [ 511.287215] [ 511.287218] [<ffffffff812744c8>] kmem_cache_alloc_node_trace+0x48/0x3b0 [ 511.289751] [ 511.289755] [<ffffffff810cfa65>] __smpboot_create_thread.part.2+0x35/0xf0 [ 511.292326] [ 511.292329] [<ffffffff810d0026>] smpboot_create_threads+0x66/0x90 [ 511.295025] [ 511.295030] [<ffffffff810a2239>] cpuhp_invoke_callback+0x229/0x9e0 [ 511.299245] [ 511.299253] [<ffffffff810a2b57>] cpuhp_up_callbacks+0x37/0xb0 [ 511.301889] [ 511.301894] [<ffffffff810a37b9>] _cpu_up+0x89/0xf0 [ 511.304270] [ 511.304275] [<ffffffff810a38a5>] do_cpu_up+0x85/0xb0 [ 511.306428] [ 511.306431] [<ffffffff810a38e3>] cpu_up+0x13/0x20 [ 511.308533] [ 511.308535] [<ffffffff821eeee3>] smp_init+0x6b/0xcc [ 511.310710] [ 511.310713] [<ffffffff821c3399>] kernel_init_freeable+0x17d/0x2ac [ 511.313232] [ 511.313235] [<ffffffff81841b3e>] kernel_init+0xe/0x110 [ 511.315616] [ 511.315620] [<ffffffff818531c1>] ret_from_fork+0x31/0x40 [ 511.317867] [ 511.317867] other info that might help us debug this: [ 511.317867] [ 511.320920] Possible interrupt unsafe locking scenario: [ 511.320920] [ 511.323218] CPU0 CPU1 [ 511.324622] ---- ---- [ 511.325973] lock(cpu_hotplug.dep_map); [ 511.327246] local_irq_disable(); [ 511.328870] lock(jbd2_handle); [ 511.330483] lock(cpu_hotplug.dep_map); [ 511.332259] <Interrupt> [ 511.333187] lock(jbd2_handle); [ 511.334304] [ 511.334304] *** DEADLOCK *** [ 511.334304] [ 511.336749] 4 locks held by a.out/49302: [ 511.338129] #0: (sb_writers#8){.+.+.+}, at: [<ffffffff812d11d4>] mnt_want_write+0x24/0x50 [ 511.340768] #1: (&type->i_mutex_dir_key#3){++++++}, at: [<ffffffff812ba06b>] path_openat+0x60b/0xd50 [ 511.343744] #2: (jbd2_handle){++++-.}, at: [<ffffffff813a8be7>] start_this_handle+0x1a7/0x590 [ 511.345743] #3: (pcpu_drain_mutex){+.+...}, at: [<ffffffff811fc96f>] drain_all_pages.part.89+0x1f/0x2c0 [ 511.348605] [ 511.348605] the dependencies between RECLAIM_FS-irq-safe lock and the holding lock: [ 511.351336] -> (jbd2_handle){++++-.} ops: 203220 { [ 511.352768] HARDIRQ-ON-W at: [ 511.353827] [ 511.353833] [<ffffffff8110906e>] __lock_acquire+0x9de/0x1640 [ 511.356489] [ 511.356492] [<ffffffff8110a119>] lock_acquire+0xc9/0x250 [ 511.359063] [ 511.359067] [<ffffffff813b3525>] jbd2_log_wait_commit+0x55/0x1d0 [ 511.361905] [ 511.361908] [<ffffffff813b59b1>] jbd2_complete_transaction+0x71/0x90 [ 511.364560] [ 511.364563] [<ffffffff8134bec7>] ext4_sync_file+0x2e7/0x5e0 [ 511.367362] [ 511.367367] [<ffffffff812e74ad>] vfs_fsync_range+0x3d/0xb0 [ 511.369950] [ 511.369953] [<ffffffff812e757d>] do_fsync+0x3d/0x70 [ 511.372400] [ 511.372402] [<ffffffff812e7840>] SyS_fsync+0x10/0x20 [ 511.374821] [ 511.374824] [<ffffffff81003c3c>] do_syscall_64+0x6c/0x200 [ 511.377422] [ 511.377425] [<ffffffff81853009>] return_from_SYSCALL_64+0x0/0x7a [ 511.380273] HARDIRQ-ON-R at: [ 511.381791] [ 511.381815] [<ffffffff8110896d>] __lock_acquire+0x2dd/0x1640 [ 511.384693] [ 511.384697] [<ffffffff8110a119>] lock_acquire+0xc9/0x250 [ 511.387195] [ 511.387198] [<ffffffff813a8c65>] start_this_handle+0x225/0x590 [ 511.389888] [ 511.389891] [<ffffffff813a9639>] jbd2__journal_start+0xe9/0x340 [ 511.392522] [ 511.392525] [<ffffffff8138adaa>] __ext4_journal_start_sb+0x9a/0x240 [ 511.395341] [ 511.395344] [<ffffffff8134af58>] ext4_file_open+0x188/0x230 [ 511.397886] [ 511.397889] [<ffffffff812a53cb>] do_dentry_open+0x22b/0x340 [ 511.400727] [ 511.400730] [<ffffffff812a6922>] vfs_open+0x52/0x80 [ 511.403297] [ 511.403301] [<ffffffff812b9f02>] path_openat+0x4a2/0xd50 [ 511.405752] [ 511.405755] [<ffffffff812bba51>] do_filp_open+0x91/0x100 [ 511.408229] [ 511.408231] [<ffffffff812a6d44>] do_sys_open+0x124/0x210 [ 511.410820] [ 511.410822] [<ffffffff812a6e4e>] SyS_open+0x1e/0x20 [ 511.413158] [ 511.413161] [<ffffffff81852f41>] entry_SYSCALL_64_fastpath+0x1f/0xc2 [ 511.416074] SOFTIRQ-ON-W at: [ 511.417069] [ 511.417073] [<ffffffff81108996>] __lock_acquire+0x306/0x1640 [ 511.419681] [ 511.419684] [<ffffffff8110a119>] lock_acquire+0xc9/0x250 [ 511.422516] [ 511.422520] [<ffffffff813b3525>] jbd2_log_wait_commit+0x55/0x1d0 [ 511.425157] [ 511.425160] [<ffffffff813b59b1>] jbd2_complete_transaction+0x71/0x90 [ 511.427862] [ 511.427865] [<ffffffff8134bec7>] ext4_sync_file+0x2e7/0x5e0 [ 511.430379] [ 511.430382] [<ffffffff812e74ad>] vfs_fsync_range+0x3d/0xb0 [ 511.433412] [ 511.433418] [<ffffffff812e757d>] do_fsync+0x3d/0x70 [ 511.436064] [ 511.436067] [<ffffffff812e7840>] SyS_fsync+0x10/0x20 [ 511.438498] [ 511.438502] [<ffffffff81003c3c>] do_syscall_64+0x6c/0x200 [ 511.441519] [ 511.441524] [<ffffffff81853009>] return_from_SYSCALL_64+0x0/0x7a [ 511.444325] SOFTIRQ-ON-R at: [ 511.445358] [ 511.445362] [<ffffffff81108996>] __lock_acquire+0x306/0x1640 [ 511.448298] [ 511.448312] [<ffffffff8110a119>] lock_acquire+0xc9/0x250 [ 511.451096] [ 511.451100] [<ffffffff813a8c65>] start_this_handle+0x225/0x590 [ 511.453784] [ 511.453786] [<ffffffff813a9639>] jbd2__journal_start+0xe9/0x340 [ 511.456659] [ 511.456664] [<ffffffff8138adaa>] __ext4_journal_start_sb+0x9a/0x240 [ 511.459638] [ 511.459643] [<ffffffff8134af58>] ext4_file_open+0x188/0x230 [ 511.462384] [ 511.462389] [<ffffffff812a53cb>] do_dentry_open+0x22b/0x340 [ 511.465550] [ 511.465558] [<ffffffff812a6922>] vfs_open+0x52/0x80 [ 511.468141] [ 511.468145] [<ffffffff812b9f02>] path_openat+0x4a2/0xd50 [ 511.470816] [ 511.470819] [<ffffffff812bba51>] do_filp_open+0x91/0x100 [ 511.473441] [ 511.473443] [<ffffffff812a6d44>] do_sys_open+0x124/0x210 [ 511.476079] [ 511.476081] [<ffffffff812a6e4e>] SyS_open+0x1e/0x20 [ 511.478584] [ 511.478587] [<ffffffff81852f41>] entry_SYSCALL_64_fastpath+0x1f/0xc2 [ 511.481394] IN-RECLAIM_FS-W at: [ 511.482680] [ 511.482691] [<ffffffff811089db>] __lock_acquire+0x34b/0x1640 [ 511.485262] [ 511.485264] [<ffffffff8110a119>] lock_acquire+0xc9/0x250 [ 511.487862] [ 511.487865] [<ffffffff813b3525>] jbd2_log_wait_commit+0x55/0x1d0 [ 511.490707] [ 511.490710] [<ffffffff813b59b1>] jbd2_complete_transaction+0x71/0x90 [ 511.493524] [ 511.493527] [<ffffffff813592d6>] ext4_evict_inode+0x356/0x760 [ 511.496251] [ 511.496255] [<ffffffff812c9f61>] evict+0xd1/0x1a0 [ 511.498817] [ 511.498821] [<ffffffff812ca07d>] dispose_list+0x4d/0x80 [ 511.501361] [ 511.501364] [<ffffffff812cb3da>] prune_icache_sb+0x5a/0x80 [ 511.504069] [ 511.504072] [<ffffffff812acf41>] super_cache_scan+0x141/0x190 [ 511.506890] [ 511.506895] [<ffffffff812102ef>] shrink_slab+0x29f/0x6d0 [ 511.509465] [ 511.509467] [<ffffffff812154aa>] shrink_node+0x2fa/0x310 [ 511.512228] [ 511.512233] [<ffffffff812169d2>] kswapd+0x362/0x9b0 [ 511.514724] [ 511.514728] [<ffffffff810ca72f>] kthread+0x10f/0x150 [ 511.517264] [ 511.517269] [<ffffffff818531c1>] ret_from_fork+0x31/0x40 [ 511.519827] INITIAL USE at: [ 511.520829] [ 511.520833] [<ffffffff811089ff>] __lock_acquire+0x36f/0x1640 [ 511.523377] [ 511.523380] [<ffffffff8110a119>] lock_acquire+0xc9/0x250 [ 511.525781] [ 511.525784] [<ffffffff813a8c65>] start_this_handle+0x225/0x590 [ 511.528372] [ 511.528375] [<ffffffff813a9639>] jbd2__journal_start+0xe9/0x340 [ 511.531138] [ 511.531141] [<ffffffff8138adaa>] __ext4_journal_start_sb+0x9a/0x240 [ 511.533905] [ 511.533908] [<ffffffff8134af58>] ext4_file_open+0x188/0x230 [ 511.536467] [ 511.536471] [<ffffffff812a53cb>] do_dentry_open+0x22b/0x340 [ 511.538990] [ 511.538992] [<ffffffff812a6922>] vfs_open+0x52/0x80 [ 511.541457] [ 511.541461] [<ffffffff812b9f02>] path_openat+0x4a2/0xd50 [ 511.544036] [ 511.544039] [<ffffffff812bba51>] do_filp_open+0x91/0x100 [ 511.546642] [ 511.546644] [<ffffffff812a6d44>] do_sys_open+0x124/0x210 [ 511.549354] [ 511.549370] [<ffffffff812a6e4e>] SyS_open+0x1e/0x20 [ 511.551781] [ 511.551784] [<ffffffff81852f41>] entry_SYSCALL_64_fastpath+0x1f/0xc2 [ 511.554410] } [ 511.555145] ... key at: [<ffffffff8335b518>] jbd2_trans_commit_key.48870+0x0/0x8 [ 511.557051] ... acquired at: [ 511.558047] [ 511.558050] [<ffffffff81107d0a>] check_irq_usage+0x4a/0xb0 [ 511.560268] [ 511.560270] [<ffffffff8110950b>] __lock_acquire+0xe7b/0x1640 [ 511.562536] [ 511.562538] [<ffffffff8110a119>] lock_acquire+0xc9/0x250 [ 511.564779] [ 511.564783] [<ffffffff810a149d>] get_online_cpus+0x5d/0x80 [ 511.567230] [ 511.567234] [<ffffffff811fc97c>] drain_all_pages.part.89+0x2c/0x2c0 [ 511.569585] [ 511.569588] [<ffffffff812a1cfb>] __alloc_pages_slowpath+0x509/0xe36 [ 511.572289] [ 511.572292] [<ffffffff812018a2>] __alloc_pages_nodemask+0x382/0x3d0 [ 511.574744] [ 511.574747] [<ffffffff81265077>] alloc_pages_current+0x97/0x1b0 [ 511.577103] [ 511.577106] [<ffffffff811f22fd>] __page_cache_alloc+0x15d/0x1a0 [ 511.579483] [ 511.579486] [<ffffffff811f494a>] pagecache_get_page+0x5a/0x2b0 [ 511.581935] [ 511.581940] [<ffffffff812eca32>] __getblk_gfp+0x112/0x390 [ 511.584220] [ 511.584223] [<ffffffff813514ca>] __ext4_get_inode_loc+0x10a/0x560 [ 511.586627] [ 511.586630] [<ffffffff81353e50>] ext4_get_inode_loc+0x20/0x30 [ 511.589802] [ 511.589808] [<ffffffff81355ec6>] ext4_reserve_inode_write+0x26/0x90 [ 511.592471] [ 511.592476] [<ffffffff81355fbe>] ext4_mark_inode_dirty+0x8e/0x390 [ 511.594926] [ 511.594930] [<ffffffff8138325a>] ext4_ext_tree_init+0x3a/0x40 [ 511.597306] [ 511.597308] [<ffffffff8134eaaa>] __ext4_new_inode+0x12da/0x1540 [ 511.599962] [ 511.599969] [<ffffffff81363602>] ext4_create+0xd2/0x1a0 [ 511.602484] [ 511.602489] [<ffffffff812b9903>] lookup_open+0x653/0x7b0 [ 511.604699] [ 511.604701] [<ffffffff812ba086>] path_openat+0x626/0xd50 [ 511.606890] [ 511.606893] [<ffffffff812bba51>] do_filp_open+0x91/0x100 [ 511.609097] [ 511.609099] [<ffffffff812a6d44>] do_sys_open+0x124/0x210 [ 511.611346] [ 511.611348] [<ffffffff812a6e4e>] SyS_open+0x1e/0x20 [ 511.613431] [ 511.613434] [<ffffffff81003c3c>] do_syscall_64+0x6c/0x200 [ 511.615967] [ 511.615979] [<ffffffff81853009>] return_from_SYSCALL_64+0x0/0x7a [ 511.618303] [ 511.619062] [ 511.619062] the dependencies between the lock to be acquired [ 511.619063] and RECLAIM_FS-irq-unsafe lock: [ 511.622794] -> (cpu_hotplug.dep_map){++++++} ops: 1130 { [ 511.624286] HARDIRQ-ON-W at: [ 511.625479] [ 511.625485] [<ffffffff8110906e>] __lock_acquire+0x9de/0x1640 [ 511.627957] [ 511.627959] [<ffffffff8110a119>] lock_acquire+0xc9/0x250 [ 511.630609] [ 511.630612] [<ffffffff810a3603>] cpu_hotplug_begin+0x73/0xe0 [ 511.633682] [ 511.633697] [<ffffffff810a3762>] _cpu_up+0x32/0xf0 [ 511.636022] [ 511.636024] [<ffffffff810a38a5>] do_cpu_up+0x85/0xb0 [ 511.638397] [ 511.638399] [<ffffffff810a38e3>] cpu_up+0x13/0x20 [ 511.640852] [ 511.640866] [<ffffffff821eeee3>] smp_init+0x6b/0xcc [ 511.643507] [ 511.643511] [<ffffffff821c3399>] kernel_init_freeable+0x17d/0x2ac [ 511.646002] [ 511.646005] [<ffffffff81841b3e>] kernel_init+0xe/0x110 [ 511.648600] [ 511.648611] [<ffffffff818531c1>] ret_from_fork+0x31/0x40 [ 511.651115] HARDIRQ-ON-R at: [ 511.652080] [ 511.652084] [<ffffffff8110896d>] __lock_acquire+0x2dd/0x1640 [ 511.654554] [ 511.654557] [<ffffffff8110a119>] lock_acquire+0xc9/0x250 [ 511.656983] [ 511.656986] [<ffffffff810a149d>] get_online_cpus+0x5d/0x80 [ 511.659442] [ 511.659445] [<ffffffff8122a55a>] kmem_cache_create+0x3a/0x2d0 [ 511.662336] [ 511.662342] [<ffffffff821fd151>] numa_policy_init+0x43/0x24a [ 511.665117] [ 511.665121] [<ffffffff821c313c>] start_kernel+0x3f6/0x4d6 [ 511.667566] [ 511.667568] [<ffffffff821c25d6>] x86_64_start_reservations+0x2a/0x2c [ 511.670245] [ 511.670247] [<ffffffff821c2724>] x86_64_start_kernel+0x14c/0x16f [ 511.673050] [ 511.673054] [<ffffffff810001c4>] verify_cpu+0x0/0xfc [ 511.675400] SOFTIRQ-ON-W at: [ 511.676405] [ 511.676408] [<ffffffff81108996>] __lock_acquire+0x306/0x1640 [ 511.679556] [ 511.679563] [<ffffffff8110a119>] lock_acquire+0xc9/0x250 [ 511.683155] [ 511.683164] [<ffffffff810a3603>] cpu_hotplug_begin+0x73/0xe0 [ 511.686224] [ 511.686231] [<ffffffff810a3762>] _cpu_up+0x32/0xf0 [ 511.689073] [ 511.689078] [<ffffffff810a38a5>] do_cpu_up+0x85/0xb0 [ 511.691573] [ 511.691575] [<ffffffff810a38e3>] cpu_up+0x13/0x20 [ 511.694007] [ 511.694010] [<ffffffff821eeee3>] smp_init+0x6b/0xcc [ 511.696524] [ 511.696528] [<ffffffff821c3399>] kernel_init_freeable+0x17d/0x2ac [ 511.699401] [ 511.699405] [<ffffffff81841b3e>] kernel_init+0xe/0x110 [ 511.701956] [ 511.701959] [<ffffffff818531c1>] ret_from_fork+0x31/0x40 [ 511.704520] SOFTIRQ-ON-R at: [ 511.705530] [ 511.705534] [<ffffffff81108996>] __lock_acquire+0x306/0x1640 [ 511.708036] [ 511.708038] [<ffffffff8110a119>] lock_acquire+0xc9/0x250 [ 511.710516] [ 511.710518] [<ffffffff810a149d>] get_online_cpus+0x5d/0x80 [ 511.713771] [ 511.713780] [<ffffffff8122a55a>] kmem_cache_create+0x3a/0x2d0 [ 511.716681] [ 511.716688] [<ffffffff821fd151>] numa_policy_init+0x43/0x24a [ 511.719450] [ 511.719455] [<ffffffff821c313c>] start_kernel+0x3f6/0x4d6 [ 511.722114] [ 511.722117] [<ffffffff821c25d6>] x86_64_start_reservations+0x2a/0x2c [ 511.724864] [ 511.724866] [<ffffffff821c2724>] x86_64_start_kernel+0x14c/0x16f [ 511.727552] [ 511.727555] [<ffffffff810001c4>] verify_cpu+0x0/0xfc [ 511.729936] RECLAIM_FS-ON-W at: [ 511.731059] [ 511.731063] [<ffffffff81108141>] mark_held_locks+0x71/0x90 [ 511.733851] [ 511.733857] [<ffffffff8110ab6f>] lockdep_trace_alloc+0x6f/0xd0 [ 511.736601] [ 511.736604] [<ffffffff812744c8>] kmem_cache_alloc_node_trace+0x48/0x3b0 [ 511.739325] [ 511.739329] [<ffffffff810cfa65>] __smpboot_create_thread.part.2+0x35/0xf0 [ 511.742499] [ 511.742503] [<ffffffff810d0026>] smpboot_create_threads+0x66/0x90 [ 511.745233] [ 511.745236] [<ffffffff810a2239>] cpuhp_invoke_callback+0x229/0x9e0 [ 511.747909] [ 511.747911] [<ffffffff810a2b57>] cpuhp_up_callbacks+0x37/0xb0 [ 511.750604] [ 511.750606] [<ffffffff810a37b9>] _cpu_up+0x89/0xf0 [ 511.753180] [ 511.753182] [<ffffffff810a38a5>] do_cpu_up+0x85/0xb0 [ 511.755982] [ 511.755986] [<ffffffff810a38e3>] cpu_up+0x13/0x20 [ 511.758565] [ 511.758568] [<ffffffff821eeee3>] smp_init+0x6b/0xcc [ 511.761138] [ 511.761141] [<ffffffff821c3399>] kernel_init_freeable+0x17d/0x2ac [ 511.763877] [ 511.763881] [<ffffffff81841b3e>] kernel_init+0xe/0x110 [ 511.766703] [ 511.766709] [<ffffffff818531c1>] ret_from_fork+0x31/0x40 [ 511.769522] RECLAIM_FS-ON-R at: [ 511.770730] [ 511.770735] [<ffffffff81108141>] mark_held_locks+0x71/0x90 [ 511.773324] [ 511.773327] [<ffffffff8110ab6f>] lockdep_trace_alloc+0x6f/0xd0 [ 511.775897] [ 511.775900] [<ffffffff812744c8>] kmem_cache_alloc_node_trace+0x48/0x3b0 [ 511.778659] [ 511.778663] [<ffffffff8100d199>] allocate_shared_regs+0x29/0x70 [ 511.781485] [ 511.781488] [<ffffffff8100d217>] intel_pmu_cpu_prepare+0x37/0x140 [ 511.784574] [ 511.784578] [<ffffffff81005410>] x86_pmu_prepare_cpu+0x40/0x50 [ 511.787169] [ 511.787172] [<ffffffff810a2239>] cpuhp_invoke_callback+0x229/0x9e0 [ 511.789906] [ 511.789909] [<ffffffff810a2e42>] cpuhp_issue_call+0xe2/0x140 [ 511.792625] [ 511.792628] [<ffffffff810a321d>] __cpuhp_setup_state+0x12d/0x190 [ 511.795441] [ 511.795446] [<ffffffff821c59b1>] init_hw_perf_events+0x402/0x5b6 [ 511.798187] [ 511.798190] [<ffffffff81002191>] do_one_initcall+0x51/0x1c0 [ 511.801133] [ 511.801139] [<ffffffff821c3371>] kernel_init_freeable+0x155/0x2ac [ 511.803812] [ 511.803816] [<ffffffff81841b3e>] kernel_init+0xe/0x110 [ 511.806381] [ 511.806385] [<ffffffff818531c1>] ret_from_fork+0x31/0x40 [ 511.808849] INITIAL USE at: [ 511.809876] [ 511.809881] [<ffffffff811089ff>] __lock_acquire+0x36f/0x1640 [ 511.812607] [ 511.812610] [<ffffffff8110a119>] lock_acquire+0xc9/0x250 [ 511.815088] [ 511.815092] [<ffffffff810a149d>] get_online_cpus+0x5d/0x80 [ 511.817776] [ 511.817779] [<ffffffff810a3133>] __cpuhp_setup_state+0x43/0x190 [ 511.820394] [ 511.820397] [<ffffffff821f756b>] page_alloc_init+0x23/0x3a [ 511.823000] [ 511.823003] [<ffffffff821c2ee8>] start_kernel+0x1a2/0x4d6 [ 511.825495] [ 511.825497] [<ffffffff821c25d6>] x86_64_start_reservations+0x2a/0x2c [ 511.828158] [ 511.828160] [<ffffffff821c2724>] x86_64_start_kernel+0x14c/0x16f [ 511.830986] [ 511.830991] [<ffffffff810001c4>] verify_cpu+0x0/0xfc [ 511.833452] } [ 511.834219] ... key at: [<ffffffff81e59b08>] cpu_hotplug+0x108/0x140 [ 511.835931] ... acquired at: [ 511.836924] [ 511.836927] [<ffffffff81107d0a>] check_irq_usage+0x4a/0xb0 [ 511.839589] [ 511.839593] [<ffffffff8110950b>] __lock_acquire+0xe7b/0x1640 [ 511.842158] [ 511.842162] [<ffffffff8110a119>] lock_acquire+0xc9/0x250 [ 511.844452] [ 511.844454] [<ffffffff810a149d>] get_online_cpus+0x5d/0x80 [ 511.846668] [ 511.846671] [<ffffffff811fc97c>] drain_all_pages.part.89+0x2c/0x2c0 [ 511.849257] [ 511.849264] [<ffffffff812a1cfb>] __alloc_pages_slowpath+0x509/0xe36 [ 511.852127] [ 511.852132] [<ffffffff812018a2>] __alloc_pages_nodemask+0x382/0x3d0 [ 511.854545] [ 511.854549] [<ffffffff81265077>] alloc_pages_current+0x97/0x1b0 [ 511.856942] [ 511.856946] [<ffffffff811f22fd>] __page_cache_alloc+0x15d/0x1a0 [ 511.859259] [ 511.859262] [<ffffffff811f494a>] pagecache_get_page+0x5a/0x2b0 [ 511.861595] [ 511.861598] [<ffffffff812eca32>] __getblk_gfp+0x112/0x390 [ 511.863893] [ 511.863897] [<ffffffff813514ca>] __ext4_get_inode_loc+0x10a/0x560 [ 511.866538] [ 511.866542] [<ffffffff81353e50>] ext4_get_inode_loc+0x20/0x30 [ 511.868929] [ 511.868932] [<ffffffff81355ec6>] ext4_reserve_inode_write+0x26/0x90 [ 511.871579] [ 511.871584] [<ffffffff81355fbe>] ext4_mark_inode_dirty+0x8e/0x390 [ 511.874088] [ 511.874092] [<ffffffff8138325a>] ext4_ext_tree_init+0x3a/0x40 [ 511.876398] [ 511.876400] [<ffffffff8134eaaa>] __ext4_new_inode+0x12da/0x1540 [ 511.878735] [ 511.878737] [<ffffffff81363602>] ext4_create+0xd2/0x1a0 [ 511.881170] [ 511.881174] [<ffffffff812b9903>] lookup_open+0x653/0x7b0 [ 511.883841] [ 511.883848] [<ffffffff812ba086>] path_openat+0x626/0xd50 [ 511.886058] [ 511.886061] [<ffffffff812bba51>] do_filp_open+0x91/0x100 [ 511.888285] [ 511.888288] [<ffffffff812a6d44>] do_sys_open+0x124/0x210 [ 511.890642] [ 511.890644] [<ffffffff812a6e4e>] SyS_open+0x1e/0x20 [ 511.892781] [ 511.892784] [<ffffffff81003c3c>] do_syscall_64+0x6c/0x200 [ 511.895050] [ 511.895053] [<ffffffff81853009>] return_from_SYSCALL_64+0x0/0x7a [ 511.897382] [ 511.898165] [ 511.898165] stack backtrace: [ 511.900033] CPU: 0 PID: 49302 Comm: a.out Not tainted 4.10.0-rc6-next-20170202+ #500 [ 511.901974] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015 [ 511.904851] Call Trace: [ 511.905789] dump_stack+0x85/0xc9 [ 511.906854] check_usage+0x4ba/0x4d0 [ 511.907984] ? delayacct_end+0x56/0x60 [ 511.909136] check_irq_usage+0x4a/0xb0 [ 511.910318] __lock_acquire+0xe7b/0x1640 [ 511.911470] ? delayacct_end+0x56/0x60 [ 511.912607] lock_acquire+0xc9/0x250 [ 511.913703] ? get_online_cpus+0x37/0x80 [ 511.914888] get_online_cpus+0x5d/0x80 [ 511.916137] ? get_online_cpus+0x37/0x80 [ 511.917287] drain_all_pages.part.89+0x2c/0x2c0 [ 511.918539] __alloc_pages_slowpath+0x509/0xe36 [ 511.919889] __alloc_pages_nodemask+0x382/0x3d0 [ 511.921673] ? sched_clock_cpu+0x11/0xc0 [ 511.922919] alloc_pages_current+0x97/0x1b0 [ 511.924123] __page_cache_alloc+0x15d/0x1a0 [ 511.925252] pagecache_get_page+0x5a/0x2b0 [ 511.926392] __getblk_gfp+0x112/0x390 [ 511.927524] __ext4_get_inode_loc+0x10a/0x560 [ 511.928723] ? ext4_ext_tree_init+0x3a/0x40 [ 511.929900] ext4_get_inode_loc+0x20/0x30 [ 511.931008] ext4_reserve_inode_write+0x26/0x90 [ 511.932370] ? ext4_ext_tree_init+0x3a/0x40 [ 511.933582] ext4_mark_inode_dirty+0x8e/0x390 [ 511.934807] ext4_ext_tree_init+0x3a/0x40 [ 511.935919] __ext4_new_inode+0x12da/0x1540 [ 511.937093] ext4_create+0xd2/0x1a0 [ 511.938106] lookup_open+0x653/0x7b0 [ 511.939108] ? __wake_up+0x23/0x50 [ 511.940131] ? sched_clock+0x9/0x10 [ 511.941184] path_openat+0x626/0xd50 [ 511.942194] do_filp_open+0x91/0x100 [ 511.943164] ? _raw_spin_unlock+0x27/0x40 [ 511.944335] ? __alloc_fd+0xf7/0x210 [ 511.945350] do_sys_open+0x124/0x210 [ 511.946333] SyS_open+0x1e/0x20 [ 511.947189] do_syscall_64+0x6c/0x200 [ 511.948208] entry_SYSCALL64_slow_path+0x25/0x25 [ 511.949587] RIP: 0033:0x7feb6a026a10 [ 511.950555] RSP: 002b:00007ffce3579c88 EFLAGS: 00000246 ORIG_RAX: 0000000000000002 [ 511.952261] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007feb6a026a10 [ 511.953864] RDX: 0000000000000180 RSI: 0000000000004441 RDI: 00000000006010c0 [ 511.955566] RBP: 0000000000000000 R08: 00007feb69f86938 R09: 000000000000000f [ 511.957231] R10: 0000000000000000 R11: 0000000000000246 R12: 000000000040083b [ 511.958864] R13: 00007ffce3579d90 R14: 0000000000000000 R15: 0000000000000000 Below one is also a loop. Maybe we can add __GFP_NOMEMALLOC to GFP_NOWAIT ? [ 257.781715] Out of memory: Kill process 5171 (a.out) score 842 or sacrifice child [ 257.784726] Killed process 5171 (a.out) total-vm:2177096kB, anon-rss:1476488kB, file-rss:4kB, shmem-rss:0kB [ 257.787691] a.out(5171): TIF_MEMDIE allocation: order=0 mode=0x1000200(GFP_NOWAIT|__GFP_NOWARN) [ 257.789789] CPU: 3 PID: 5171 Comm: a.out Not tainted 4.10.0-rc6-next-20170202+ #500 [ 257.791784] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015 [ 257.794700] Call Trace: [ 257.795690] dump_stack+0x85/0xc9 [ 257.797224] __alloc_pages_slowpath+0xacb/0xe36 [ 257.798612] __alloc_pages_nodemask+0x382/0x3d0 [ 257.799942] alloc_pages_current+0x97/0x1b0 [ 257.801236] __get_free_pages+0x14/0x50 [ 257.802546] __tlb_remove_page_size+0x70/0xd0 [ 257.803810] unmap_page_range+0x74b/0xa80 [ 257.804992] unmap_single_vma+0x81/0xf0 [ 257.806131] unmap_vmas+0x41/0x60 [ 257.807179] exit_mmap+0x97/0x150 [ 257.808282] ? __khugepaged_exit+0xe5/0x130 [ 257.809594] mmput+0x80/0x150 [ 257.810566] do_exit+0x2c0/0xd70 [ 257.811609] do_group_exit+0x4c/0xc0 [ 257.813035] get_signal+0x35f/0x9b0 [ 257.814199] do_signal+0x37/0x730 [ 257.815215] ? mutex_unlock+0x12/0x20 [ 257.816285] ? pagefault_out_of_memory+0x75/0x80 [ 257.817872] ? mm_fault_error+0x65/0x152 [ 257.819027] ? exit_to_usermode_loop+0x26/0x92 [ 257.820277] exit_to_usermode_loop+0x51/0x92 [ 257.821480] prepare_exit_to_usermode+0x7f/0x90 [ 257.822756] retint_user+0x8/0x23 [ 257.823755] RIP: 0033:0x400780 [ 257.824717] RSP: 002b:00007ffce4497640 EFLAGS: 00010206 [ 257.826061] RAX: 000000005a1de000 RBX: 0000000080000000 RCX: 00007f11b8887650 [ 257.827774] RDX: 0000000000000000 RSI: 00007ffce4497460 RDI: 00007ffce4497460 [ 257.829770] RBP: 00007f10b89be010 R08: 00007ffce4497570 R09: 00007ffce44973b0 [ 257.831714] R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000007 [ 257.833447] R13: 00007f10b89be010 R14: 0000000000000000 R15: 0000000000000000 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone 2017-02-05 10:43 ` Tetsuo Handa @ 2017-02-06 10:34 ` Michal Hocko -1 siblings, 0 replies; 110+ messages in thread From: Michal Hocko @ 2017-02-06 10:34 UTC (permalink / raw) To: Tetsuo Handa; +Cc: hch, mgorman, viro, linux-mm, hannes, linux-kernel, peterz On Sun 05-02-17 19:43:07, Tetsuo Handa wrote: [...] > Below one is also a loop. Maybe we can add __GFP_NOMEMALLOC to GFP_NOWAIT ? No, GFP_NOWAIT is just too generic to use this flag. > [ 257.781715] Out of memory: Kill process 5171 (a.out) score 842 or sacrifice child > [ 257.784726] Killed process 5171 (a.out) total-vm:2177096kB, anon-rss:1476488kB, file-rss:4kB, shmem-rss:0kB > [ 257.787691] a.out(5171): TIF_MEMDIE allocation: order=0 mode=0x1000200(GFP_NOWAIT|__GFP_NOWARN) > [ 257.789789] CPU: 3 PID: 5171 Comm: a.out Not tainted 4.10.0-rc6-next-20170202+ #500 > [ 257.791784] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015 > [ 257.794700] Call Trace: > [ 257.795690] dump_stack+0x85/0xc9 > [ 257.797224] __alloc_pages_slowpath+0xacb/0xe36 > [ 257.798612] __alloc_pages_nodemask+0x382/0x3d0 > [ 257.799942] alloc_pages_current+0x97/0x1b0 > [ 257.801236] __get_free_pages+0x14/0x50 > [ 257.802546] __tlb_remove_page_size+0x70/0xd0 This is bound to MAX_GATHER_BATCH_COUNT which shouldn't be a lot of pages (20 or so). We could add __GFP_NOMEMALLOC into tlb_next_batch but I am not entirely convinced it is really necessary. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone @ 2017-02-06 10:34 ` Michal Hocko 0 siblings, 0 replies; 110+ messages in thread From: Michal Hocko @ 2017-02-06 10:34 UTC (permalink / raw) To: Tetsuo Handa; +Cc: hch, mgorman, viro, linux-mm, hannes, linux-kernel, peterz On Sun 05-02-17 19:43:07, Tetsuo Handa wrote: [...] > Below one is also a loop. Maybe we can add __GFP_NOMEMALLOC to GFP_NOWAIT ? No, GFP_NOWAIT is just too generic to use this flag. > [ 257.781715] Out of memory: Kill process 5171 (a.out) score 842 or sacrifice child > [ 257.784726] Killed process 5171 (a.out) total-vm:2177096kB, anon-rss:1476488kB, file-rss:4kB, shmem-rss:0kB > [ 257.787691] a.out(5171): TIF_MEMDIE allocation: order=0 mode=0x1000200(GFP_NOWAIT|__GFP_NOWARN) > [ 257.789789] CPU: 3 PID: 5171 Comm: a.out Not tainted 4.10.0-rc6-next-20170202+ #500 > [ 257.791784] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/02/2015 > [ 257.794700] Call Trace: > [ 257.795690] dump_stack+0x85/0xc9 > [ 257.797224] __alloc_pages_slowpath+0xacb/0xe36 > [ 257.798612] __alloc_pages_nodemask+0x382/0x3d0 > [ 257.799942] alloc_pages_current+0x97/0x1b0 > [ 257.801236] __get_free_pages+0x14/0x50 > [ 257.802546] __tlb_remove_page_size+0x70/0xd0 This is bound to MAX_GATHER_BATCH_COUNT which shouldn't be a lot of pages (20 or so). We could add __GFP_NOMEMALLOC into tlb_next_batch but I am not entirely convinced it is really necessary. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone 2017-02-05 10:43 ` Tetsuo Handa @ 2017-02-06 10:39 ` Michal Hocko -1 siblings, 0 replies; 110+ messages in thread From: Michal Hocko @ 2017-02-06 10:39 UTC (permalink / raw) To: Tetsuo Handa, peterz; +Cc: hch, mgorman, viro, linux-mm, hannes, linux-kernel On Sun 05-02-17 19:43:07, Tetsuo Handa wrote: > Michal Hocko wrote: > I got same warning with ext4. Maybe we need to check carefully. > > [ 511.215743] ===================================================== > [ 511.218003] WARNING: RECLAIM_FS-safe -> RECLAIM_FS-unsafe lock order detected > [ 511.220031] 4.10.0-rc6-next-20170202+ #500 Not tainted > [ 511.221689] ----------------------------------------------------- > [ 511.223579] a.out/49302 [HC0[0]:SC0[0]:HE1:SE1] is trying to acquire: > [ 511.225533] (cpu_hotplug.dep_map){++++++}, at: [<ffffffff810a1477>] get_online_cpus+0x37/0x80 > [ 511.227795] > [ 511.227795] and this task is already holding: > [ 511.230082] (jbd2_handle){++++-.}, at: [<ffffffff813a8be7>] start_this_handle+0x1a7/0x590 > [ 511.232592] which would create a new lock dependency: > [ 511.234192] (jbd2_handle){++++-.} -> (cpu_hotplug.dep_map){++++++} > [ 511.235966] > [ 511.235966] but this new dependency connects a RECLAIM_FS-irq-safe lock: > [ 511.238563] (jbd2_handle){++++-.} > [ 511.238564] > [ 511.238564] ... which became RECLAIM_FS-irq-safe at: > [ 511.242078] > [ 511.242084] [<ffffffff811089db>] __lock_acquire+0x34b/0x1640 > [ 511.244495] [<ffffffff8110a119>] lock_acquire+0xc9/0x250 > [ 511.246697] [<ffffffff813b3525>] jbd2_log_wait_commit+0x55/0x1d0 [...] > [ 511.276216] to a RECLAIM_FS-irq-unsafe lock: > [ 511.278128] (cpu_hotplug.dep_map){++++++} > [ 511.278130] > [ 511.278130] ... which became RECLAIM_FS-irq-unsafe at: > [ 511.281809] ... > [ 511.281811] > [ 511.282598] [<ffffffff81108141>] mark_held_locks+0x71/0x90 > [ 511.284854] [<ffffffff8110ab6f>] lockdep_trace_alloc+0x6f/0xd0 > [ 511.287218] [<ffffffff812744c8>] kmem_cache_alloc_node_trace+0x48/0x3b0 > [ 511.289755] [<ffffffff810cfa65>] __smpboot_create_thread.part.2+0x35/0xf0 > [ 511.292329] [<ffffffff810d0026>] smpboot_create_threads+0x66/0x90 [...] > [ 511.317867] other info that might help us debug this: > [ 511.317867] > [ 511.320920] Possible interrupt unsafe locking scenario: > [ 511.320920] > [ 511.323218] CPU0 CPU1 > [ 511.324622] ---- ---- > [ 511.325973] lock(cpu_hotplug.dep_map); > [ 511.327246] local_irq_disable(); > [ 511.328870] lock(jbd2_handle); > [ 511.330483] lock(cpu_hotplug.dep_map); > [ 511.332259] <Interrupt> > [ 511.333187] lock(jbd2_handle); Peter, is there any way how to tell the lockdep that this is in fact reclaim safe? The direct reclaim only does the trylock and backs off so we cannot deadlock here. Or am I misinterpreting the trace? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone @ 2017-02-06 10:39 ` Michal Hocko 0 siblings, 0 replies; 110+ messages in thread From: Michal Hocko @ 2017-02-06 10:39 UTC (permalink / raw) To: Tetsuo Handa, peterz; +Cc: hch, mgorman, viro, linux-mm, hannes, linux-kernel On Sun 05-02-17 19:43:07, Tetsuo Handa wrote: > Michal Hocko wrote: > I got same warning with ext4. Maybe we need to check carefully. > > [ 511.215743] ===================================================== > [ 511.218003] WARNING: RECLAIM_FS-safe -> RECLAIM_FS-unsafe lock order detected > [ 511.220031] 4.10.0-rc6-next-20170202+ #500 Not tainted > [ 511.221689] ----------------------------------------------------- > [ 511.223579] a.out/49302 [HC0[0]:SC0[0]:HE1:SE1] is trying to acquire: > [ 511.225533] (cpu_hotplug.dep_map){++++++}, at: [<ffffffff810a1477>] get_online_cpus+0x37/0x80 > [ 511.227795] > [ 511.227795] and this task is already holding: > [ 511.230082] (jbd2_handle){++++-.}, at: [<ffffffff813a8be7>] start_this_handle+0x1a7/0x590 > [ 511.232592] which would create a new lock dependency: > [ 511.234192] (jbd2_handle){++++-.} -> (cpu_hotplug.dep_map){++++++} > [ 511.235966] > [ 511.235966] but this new dependency connects a RECLAIM_FS-irq-safe lock: > [ 511.238563] (jbd2_handle){++++-.} > [ 511.238564] > [ 511.238564] ... which became RECLAIM_FS-irq-safe at: > [ 511.242078] > [ 511.242084] [<ffffffff811089db>] __lock_acquire+0x34b/0x1640 > [ 511.244495] [<ffffffff8110a119>] lock_acquire+0xc9/0x250 > [ 511.246697] [<ffffffff813b3525>] jbd2_log_wait_commit+0x55/0x1d0 [...] > [ 511.276216] to a RECLAIM_FS-irq-unsafe lock: > [ 511.278128] (cpu_hotplug.dep_map){++++++} > [ 511.278130] > [ 511.278130] ... which became RECLAIM_FS-irq-unsafe at: > [ 511.281809] ... > [ 511.281811] > [ 511.282598] [<ffffffff81108141>] mark_held_locks+0x71/0x90 > [ 511.284854] [<ffffffff8110ab6f>] lockdep_trace_alloc+0x6f/0xd0 > [ 511.287218] [<ffffffff812744c8>] kmem_cache_alloc_node_trace+0x48/0x3b0 > [ 511.289755] [<ffffffff810cfa65>] __smpboot_create_thread.part.2+0x35/0xf0 > [ 511.292329] [<ffffffff810d0026>] smpboot_create_threads+0x66/0x90 [...] > [ 511.317867] other info that might help us debug this: > [ 511.317867] > [ 511.320920] Possible interrupt unsafe locking scenario: > [ 511.320920] > [ 511.323218] CPU0 CPU1 > [ 511.324622] ---- ---- > [ 511.325973] lock(cpu_hotplug.dep_map); > [ 511.327246] local_irq_disable(); > [ 511.328870] lock(jbd2_handle); > [ 511.330483] lock(cpu_hotplug.dep_map); > [ 511.332259] <Interrupt> > [ 511.333187] lock(jbd2_handle); Peter, is there any way how to tell the lockdep that this is in fact reclaim safe? The direct reclaim only does the trylock and backs off so we cannot deadlock here. Or am I misinterpreting the trace? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone 2017-02-06 10:39 ` Michal Hocko @ 2017-02-07 21:12 ` Michal Hocko -1 siblings, 0 replies; 110+ messages in thread From: Michal Hocko @ 2017-02-07 21:12 UTC (permalink / raw) To: Tetsuo Handa, peterz; +Cc: hch, mgorman, viro, linux-mm, hannes, linux-kernel On Mon 06-02-17 11:39:18, Michal Hocko wrote: > On Sun 05-02-17 19:43:07, Tetsuo Handa wrote: > > Michal Hocko wrote: > > I got same warning with ext4. Maybe we need to check carefully. > > > > [ 511.215743] ===================================================== > > [ 511.218003] WARNING: RECLAIM_FS-safe -> RECLAIM_FS-unsafe lock order detected > > [ 511.220031] 4.10.0-rc6-next-20170202+ #500 Not tainted > > [ 511.221689] ----------------------------------------------------- > > [ 511.223579] a.out/49302 [HC0[0]:SC0[0]:HE1:SE1] is trying to acquire: > > [ 511.225533] (cpu_hotplug.dep_map){++++++}, at: [<ffffffff810a1477>] get_online_cpus+0x37/0x80 > > [ 511.227795] > > [ 511.227795] and this task is already holding: > > [ 511.230082] (jbd2_handle){++++-.}, at: [<ffffffff813a8be7>] start_this_handle+0x1a7/0x590 > > [ 511.232592] which would create a new lock dependency: > > [ 511.234192] (jbd2_handle){++++-.} -> (cpu_hotplug.dep_map){++++++} > > [ 511.235966] > > [ 511.235966] but this new dependency connects a RECLAIM_FS-irq-safe lock: > > [ 511.238563] (jbd2_handle){++++-.} > > [ 511.238564] > > [ 511.238564] ... which became RECLAIM_FS-irq-safe at: > > [ 511.242078] > > [ 511.242084] [<ffffffff811089db>] __lock_acquire+0x34b/0x1640 > > [ 511.244495] [<ffffffff8110a119>] lock_acquire+0xc9/0x250 > > [ 511.246697] [<ffffffff813b3525>] jbd2_log_wait_commit+0x55/0x1d0 > [...] > > [ 511.276216] to a RECLAIM_FS-irq-unsafe lock: > > [ 511.278128] (cpu_hotplug.dep_map){++++++} > > [ 511.278130] > > [ 511.278130] ... which became RECLAIM_FS-irq-unsafe at: > > [ 511.281809] ... > > [ 511.281811] > > [ 511.282598] [<ffffffff81108141>] mark_held_locks+0x71/0x90 > > [ 511.284854] [<ffffffff8110ab6f>] lockdep_trace_alloc+0x6f/0xd0 > > [ 511.287218] [<ffffffff812744c8>] kmem_cache_alloc_node_trace+0x48/0x3b0 > > [ 511.289755] [<ffffffff810cfa65>] __smpboot_create_thread.part.2+0x35/0xf0 > > [ 511.292329] [<ffffffff810d0026>] smpboot_create_threads+0x66/0x90 > [...] > > [ 511.317867] other info that might help us debug this: > > [ 511.317867] > > [ 511.320920] Possible interrupt unsafe locking scenario: > > [ 511.320920] > > [ 511.323218] CPU0 CPU1 > > [ 511.324622] ---- ---- > > [ 511.325973] lock(cpu_hotplug.dep_map); > > [ 511.327246] local_irq_disable(); > > [ 511.328870] lock(jbd2_handle); > > [ 511.330483] lock(cpu_hotplug.dep_map); > > [ 511.332259] <Interrupt> > > [ 511.333187] lock(jbd2_handle); > > Peter, is there any way how to tell the lockdep that this is in fact > reclaim safe? The direct reclaim only does the trylock and backs off so > we cannot deadlock here. > > Or am I misinterpreting the trace? This is moot - http://lkml.kernel.org/r/20170207201950.20482-1-mhocko@kernel.org -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone @ 2017-02-07 21:12 ` Michal Hocko 0 siblings, 0 replies; 110+ messages in thread From: Michal Hocko @ 2017-02-07 21:12 UTC (permalink / raw) To: Tetsuo Handa, peterz; +Cc: hch, mgorman, viro, linux-mm, hannes, linux-kernel On Mon 06-02-17 11:39:18, Michal Hocko wrote: > On Sun 05-02-17 19:43:07, Tetsuo Handa wrote: > > Michal Hocko wrote: > > I got same warning with ext4. Maybe we need to check carefully. > > > > [ 511.215743] ===================================================== > > [ 511.218003] WARNING: RECLAIM_FS-safe -> RECLAIM_FS-unsafe lock order detected > > [ 511.220031] 4.10.0-rc6-next-20170202+ #500 Not tainted > > [ 511.221689] ----------------------------------------------------- > > [ 511.223579] a.out/49302 [HC0[0]:SC0[0]:HE1:SE1] is trying to acquire: > > [ 511.225533] (cpu_hotplug.dep_map){++++++}, at: [<ffffffff810a1477>] get_online_cpus+0x37/0x80 > > [ 511.227795] > > [ 511.227795] and this task is already holding: > > [ 511.230082] (jbd2_handle){++++-.}, at: [<ffffffff813a8be7>] start_this_handle+0x1a7/0x590 > > [ 511.232592] which would create a new lock dependency: > > [ 511.234192] (jbd2_handle){++++-.} -> (cpu_hotplug.dep_map){++++++} > > [ 511.235966] > > [ 511.235966] but this new dependency connects a RECLAIM_FS-irq-safe lock: > > [ 511.238563] (jbd2_handle){++++-.} > > [ 511.238564] > > [ 511.238564] ... which became RECLAIM_FS-irq-safe at: > > [ 511.242078] > > [ 511.242084] [<ffffffff811089db>] __lock_acquire+0x34b/0x1640 > > [ 511.244495] [<ffffffff8110a119>] lock_acquire+0xc9/0x250 > > [ 511.246697] [<ffffffff813b3525>] jbd2_log_wait_commit+0x55/0x1d0 > [...] > > [ 511.276216] to a RECLAIM_FS-irq-unsafe lock: > > [ 511.278128] (cpu_hotplug.dep_map){++++++} > > [ 511.278130] > > [ 511.278130] ... which became RECLAIM_FS-irq-unsafe at: > > [ 511.281809] ... > > [ 511.281811] > > [ 511.282598] [<ffffffff81108141>] mark_held_locks+0x71/0x90 > > [ 511.284854] [<ffffffff8110ab6f>] lockdep_trace_alloc+0x6f/0xd0 > > [ 511.287218] [<ffffffff812744c8>] kmem_cache_alloc_node_trace+0x48/0x3b0 > > [ 511.289755] [<ffffffff810cfa65>] __smpboot_create_thread.part.2+0x35/0xf0 > > [ 511.292329] [<ffffffff810d0026>] smpboot_create_threads+0x66/0x90 > [...] > > [ 511.317867] other info that might help us debug this: > > [ 511.317867] > > [ 511.320920] Possible interrupt unsafe locking scenario: > > [ 511.320920] > > [ 511.323218] CPU0 CPU1 > > [ 511.324622] ---- ---- > > [ 511.325973] lock(cpu_hotplug.dep_map); > > [ 511.327246] local_irq_disable(); > > [ 511.328870] lock(jbd2_handle); > > [ 511.330483] lock(cpu_hotplug.dep_map); > > [ 511.332259] <Interrupt> > > [ 511.333187] lock(jbd2_handle); > > Peter, is there any way how to tell the lockdep that this is in fact > reclaim safe? The direct reclaim only does the trylock and backs off so > we cannot deadlock here. > > Or am I misinterpreting the trace? This is moot - http://lkml.kernel.org/r/20170207201950.20482-1-mhocko@kernel.org -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone 2017-02-07 21:12 ` Michal Hocko @ 2017-02-08 9:24 ` Peter Zijlstra -1 siblings, 0 replies; 110+ messages in thread From: Peter Zijlstra @ 2017-02-08 9:24 UTC (permalink / raw) To: Michal Hocko Cc: Tetsuo Handa, hch, mgorman, viro, linux-mm, hannes, linux-kernel On Tue, Feb 07, 2017 at 10:12:12PM +0100, Michal Hocko wrote: > This is moot - http://lkml.kernel.org/r/20170207201950.20482-1-mhocko@kernel.org Thanks! I was just about to go stare at it in more detail. ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone @ 2017-02-08 9:24 ` Peter Zijlstra 0 siblings, 0 replies; 110+ messages in thread From: Peter Zijlstra @ 2017-02-08 9:24 UTC (permalink / raw) To: Michal Hocko Cc: Tetsuo Handa, hch, mgorman, viro, linux-mm, hannes, linux-kernel On Tue, Feb 07, 2017 at 10:12:12PM +0100, Michal Hocko wrote: > This is moot - http://lkml.kernel.org/r/20170207201950.20482-1-mhocko@kernel.org Thanks! I was just about to go stare at it in more detail. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone 2017-02-03 10:57 ` Tetsuo Handa @ 2017-02-21 9:40 ` Michal Hocko -1 siblings, 0 replies; 110+ messages in thread From: Michal Hocko @ 2017-02-21 9:40 UTC (permalink / raw) To: Tetsuo Handa Cc: david, dchinner, hch, mgorman, viro, linux-mm, hannes, linux-kernel On Fri 03-02-17 19:57:39, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Mon 30-01-17 09:55:46, Michal Hocko wrote: > > > On Sun 29-01-17 00:27:27, Tetsuo Handa wrote: > > [...] > > > > Regarding [1], it helped avoiding the too_many_isolated() issue. I can't > > > > tell whether it has any negative effect, but I got on the first trial that > > > > all allocating threads are blocked on wait_for_completion() from flush_work() > > > > in drain_all_pages() introduced by "mm, page_alloc: drain per-cpu pages from > > > > workqueue context". There was no warn_alloc() stall warning message afterwords. > > > > > > That patch is buggy and there is a follow up [1] which is not sitting in the > > > mmotm (and thus linux-next) yet. I didn't get to review it properly and > > > I cannot say I would be too happy about using WQ from the page > > > allocator. I believe even the follow up needs to have WQ_RECLAIM WQ. > > > > > > [1] http://lkml.kernel.org/r/20170125083038.rzb5f43nptmk7aed@techsingularity.net > > > > Did you get chance to test with this follow up patch? It would be > > interesting to see whether OOM situation can still starve the waiter. > > The current linux-next should contain this patch. > > So far I can't reproduce problems except two listed below (cond_resched() trap > in printk() and IDLE priority trap are excluded from the list). OK, so it seems that all the distractions are handled now and linux-next should provide a reasonable base for testing. You said you weren't able to reproduce the original long stalls on too_many_isolated(). I would be still interested to see those oom reports and potential anomalies in the isolated counts before I send the patch for inclusion so your further testing would be more than appreciated. Also stalls > 10s without any previous occurrences would be interesting. Thanks! -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone @ 2017-02-21 9:40 ` Michal Hocko 0 siblings, 0 replies; 110+ messages in thread From: Michal Hocko @ 2017-02-21 9:40 UTC (permalink / raw) To: Tetsuo Handa Cc: david, dchinner, hch, mgorman, viro, linux-mm, hannes, linux-kernel On Fri 03-02-17 19:57:39, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Mon 30-01-17 09:55:46, Michal Hocko wrote: > > > On Sun 29-01-17 00:27:27, Tetsuo Handa wrote: > > [...] > > > > Regarding [1], it helped avoiding the too_many_isolated() issue. I can't > > > > tell whether it has any negative effect, but I got on the first trial that > > > > all allocating threads are blocked on wait_for_completion() from flush_work() > > > > in drain_all_pages() introduced by "mm, page_alloc: drain per-cpu pages from > > > > workqueue context". There was no warn_alloc() stall warning message afterwords. > > > > > > That patch is buggy and there is a follow up [1] which is not sitting in the > > > mmotm (and thus linux-next) yet. I didn't get to review it properly and > > > I cannot say I would be too happy about using WQ from the page > > > allocator. I believe even the follow up needs to have WQ_RECLAIM WQ. > > > > > > [1] http://lkml.kernel.org/r/20170125083038.rzb5f43nptmk7aed@techsingularity.net > > > > Did you get chance to test with this follow up patch? It would be > > interesting to see whether OOM situation can still starve the waiter. > > The current linux-next should contain this patch. > > So far I can't reproduce problems except two listed below (cond_resched() trap > in printk() and IDLE priority trap are excluded from the list). OK, so it seems that all the distractions are handled now and linux-next should provide a reasonable base for testing. You said you weren't able to reproduce the original long stalls on too_many_isolated(). I would be still interested to see those oom reports and potential anomalies in the isolated counts before I send the patch for inclusion so your further testing would be more than appreciated. Also stalls > 10s without any previous occurrences would be interesting. Thanks! -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone 2017-02-21 9:40 ` Michal Hocko @ 2017-02-21 14:35 ` Tetsuo Handa -1 siblings, 0 replies; 110+ messages in thread From: Tetsuo Handa @ 2017-02-21 14:35 UTC (permalink / raw) To: mhocko Cc: david, dchinner, hch, mgorman, viro, linux-mm, hannes, linux-kernel Michal Hocko wrote: > OK, so it seems that all the distractions are handled now and linux-next > should provide a reasonable base for testing. You said you weren't able > to reproduce the original long stalls on too_many_isolated(). I would be > still interested to see those oom reports and potential anomalies in the > isolated counts before I send the patch for inclusion so your further > testing would be more than appreciated. Also stalls > 10s without any > previous occurrences would be interesting. I confirmed that linux-next-20170221 with kmallocwd applied can reproduce infinite too_many_isolated() loop problem. Please send your patches to linux-next. Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20170221.txt.xz . ---------------------------------------- [ 1160.162013] Out of memory: Kill process 7523 (a.out) score 998 or sacrifice child [ 1160.164422] Killed process 7523 (a.out) total-vm:4168kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB [ 1160.169699] oom_reaper: reaped process 7523 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [ 1209.781787] MemAlloc-Info: stalling=32 dying=1 exiting=0 victim=1 oom_count=45896 [ 1209.790966] MemAlloc: kswapd0(67) flags=0xa60840 switches=51139 uninterruptible [ 1209.799726] kswapd0 D10936 67 2 0x00000000 [ 1209.807326] Call Trace: [ 1209.812581] __schedule+0x336/0xe00 [ 1209.818599] schedule+0x3d/0x90 [ 1209.823907] schedule_timeout+0x26a/0x510 [ 1209.827218] ? trace_hardirqs_on+0xd/0x10 [ 1209.830535] __down_common+0xfb/0x131 [ 1209.833801] ? _xfs_buf_find+0x2cb/0xc10 [xfs] [ 1209.837372] __down+0x1d/0x1f [ 1209.840331] down+0x41/0x50 [ 1209.843243] xfs_buf_lock+0x64/0x370 [xfs] [ 1209.846597] _xfs_buf_find+0x2cb/0xc10 [xfs] [ 1209.850031] ? _xfs_buf_find+0xa4/0xc10 [xfs] [ 1209.853514] xfs_buf_get_map+0x2a/0x480 [xfs] [ 1209.855831] xfs_buf_read_map+0x2c/0x400 [xfs] [ 1209.857388] ? free_debug_processing+0x27d/0x2af [ 1209.859037] xfs_trans_read_buf_map+0x186/0x830 [xfs] [ 1209.860707] xfs_read_agf+0xc8/0x2b0 [xfs] [ 1209.862184] xfs_alloc_read_agf+0x7a/0x300 [xfs] [ 1209.863728] ? xfs_alloc_space_available+0x7b/0x120 [xfs] [ 1209.865385] xfs_alloc_fix_freelist+0x3bc/0x490 [xfs] [ 1209.866974] ? __radix_tree_lookup+0x84/0xf0 [ 1209.868374] ? xfs_perag_get+0x1a0/0x310 [xfs] [ 1209.869798] ? xfs_perag_get+0x5/0x310 [xfs] [ 1209.871288] xfs_alloc_vextent+0x161/0xda0 [xfs] [ 1209.872757] xfs_bmap_btalloc+0x46c/0x8b0 [xfs] [ 1209.874182] ? save_stack_trace+0x1b/0x20 [ 1209.875542] xfs_bmap_alloc+0x17/0x30 [xfs] [ 1209.876847] xfs_bmapi_write+0x74e/0x11d0 [xfs] [ 1209.878190] xfs_iomap_write_allocate+0x199/0x3a0 [xfs] [ 1209.879632] xfs_map_blocks+0x2cc/0x5a0 [xfs] [ 1209.880909] xfs_do_writepage+0x215/0x920 [xfs] [ 1209.882255] ? clear_page_dirty_for_io+0xb4/0x310 [ 1209.883598] xfs_vm_writepage+0x3b/0x70 [xfs] [ 1209.884841] pageout.isra.54+0x1a4/0x460 [ 1209.886210] shrink_page_list+0xa86/0xcf0 [ 1209.887441] shrink_inactive_list+0x1c5/0x660 [ 1209.888682] shrink_node_memcg+0x535/0x7f0 [ 1209.889975] ? mem_cgroup_iter+0x14d/0x720 [ 1209.891197] shrink_node+0xe1/0x310 [ 1209.892288] kswapd+0x362/0x9b0 [ 1209.893308] kthread+0x10f/0x150 [ 1209.894383] ? mem_cgroup_shrink_node+0x3b0/0x3b0 [ 1209.895703] ? kthread_create_on_node+0x70/0x70 [ 1209.896956] ret_from_fork+0x31/0x40 [ 1209.898117] MemAlloc: systemd-journal(526) flags=0x400900 switches=33248 seq=121659 gfp=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) order=0 delay=52772 uninterruptible [ 1209.902154] systemd-journal D11240 526 1 0x00000000 [ 1209.903642] Call Trace: [ 1209.904574] __schedule+0x336/0xe00 [ 1209.905734] schedule+0x3d/0x90 [ 1209.906817] schedule_timeout+0x20d/0x510 [ 1209.908025] ? prepare_to_wait+0x2b/0xc0 [ 1209.909268] ? lock_timer_base+0xa0/0xa0 [ 1209.910460] io_schedule_timeout+0x1e/0x50 [ 1209.911681] congestion_wait+0x86/0x260 [ 1209.912853] ? remove_wait_queue+0x60/0x60 [ 1209.914115] shrink_inactive_list+0x5b4/0x660 [ 1209.915385] ? __list_lru_count_one.isra.2+0x22/0x80 [ 1209.916768] shrink_node_memcg+0x535/0x7f0 [ 1209.918173] shrink_node+0xe1/0x310 [ 1209.919288] do_try_to_free_pages+0xe1/0x300 [ 1209.920548] try_to_free_pages+0x131/0x3f0 [ 1209.921827] __alloc_pages_slowpath+0x3ec/0xd95 [ 1209.923137] __alloc_pages_nodemask+0x3e4/0x460 [ 1209.924454] ? __radix_tree_lookup+0x84/0xf0 [ 1209.925790] alloc_pages_current+0x97/0x1b0 [ 1209.927021] ? find_get_entry+0x5/0x300 [ 1209.928189] __page_cache_alloc+0x15d/0x1a0 [ 1209.929471] ? pagecache_get_page+0x2c/0x2b0 [ 1209.930716] filemap_fault+0x4df/0x8b0 [ 1209.931867] ? filemap_fault+0x373/0x8b0 [ 1209.933111] ? xfs_ilock+0x22c/0x360 [xfs] [ 1209.934510] ? xfs_filemap_fault+0x64/0x1e0 [xfs] [ 1209.935857] ? down_read_nested+0x7b/0xc0 [ 1209.937123] ? xfs_ilock+0x22c/0x360 [xfs] [ 1209.938373] xfs_filemap_fault+0x6c/0x1e0 [xfs] [ 1209.939691] __do_fault+0x1e/0xa0 [ 1209.940807] ? _raw_spin_unlock+0x27/0x40 [ 1209.942002] __handle_mm_fault+0xbb1/0xf40 [ 1209.943228] ? mutex_unlock+0x12/0x20 [ 1209.944410] ? devkmsg_read+0x15c/0x330 [ 1209.945912] handle_mm_fault+0x16b/0x390 [ 1209.947297] ? handle_mm_fault+0x49/0x390 [ 1209.948868] __do_page_fault+0x24a/0x530 [ 1209.950351] do_page_fault+0x30/0x80 [ 1209.951615] page_fault+0x28/0x30 [ 1209.952724] RIP: 0033:0x556f398d623f [ 1209.953834] RSP: 002b:00007fff1da75710 EFLAGS: 00010206 [ 1209.955273] RAX: 0000556f3b12b9d0 RBX: 0000000000000009 RCX: 0000000000000020 [ 1209.957117] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [ 1209.958849] RBP: 00007fff1da759b0 R08: 0000000000000000 R09: 0000000000000000 [ 1209.960659] R10: 00000000ffffffc0 R11: 00007fdc0df4ef10 R12: 00007fff1da75f30 [ 1209.962397] R13: 00007fff1da78810 R14: 0000000000000009 R15: 0000000000000006 [ 1209.964204] MemAlloc: auditd(563) flags=0x400900 switches=6443 seq=774 gfp=0x142134a(GFP_NOFS|__GFP_HIGHMEM|__GFP_COLD|__GFP_NOWARN|__GFP_NORETRY|__GFP_HARDWALL|__GFP_MOVABLE) order=0 delay=16511 uninterruptible [ 1209.969005] auditd D12280 563 1 0x00000000 [ 1209.970503] Call Trace: [ 1209.971436] __schedule+0x336/0xe00 [ 1209.972621] schedule+0x3d/0x90 [ 1209.973696] schedule_timeout+0x20d/0x510 [ 1209.974910] ? prepare_to_wait+0x2b/0xc0 [ 1209.976155] ? lock_timer_base+0xa0/0xa0 [ 1209.977350] io_schedule_timeout+0x1e/0x50 [ 1209.978597] congestion_wait+0x86/0x260 [ 1209.979795] ? remove_wait_queue+0x60/0x60 [ 1209.981020] shrink_inactive_list+0x5b4/0x660 [ 1209.982290] ? __list_lru_count_one.isra.2+0x22/0x80 [ 1209.983748] shrink_node_memcg+0x535/0x7f0 [ 1209.985041] ? mem_cgroup_iter+0x14d/0x720 [ 1209.986267] shrink_node+0xe1/0x310 [ 1209.987424] do_try_to_free_pages+0xe1/0x300 [ 1209.988705] try_to_free_pages+0x131/0x3f0 [ 1209.989935] __alloc_pages_slowpath+0x3ec/0xd95 [ 1209.991274] __alloc_pages_nodemask+0x3e4/0x460 [ 1209.992601] alloc_pages_current+0x97/0x1b0 [ 1209.993845] __page_cache_alloc+0x15d/0x1a0 [ 1209.995120] __do_page_cache_readahead+0x118/0x410 [ 1209.996535] ? __do_page_cache_readahead+0x191/0x410 [ 1209.997946] filemap_fault+0x35f/0x8b0 [ 1209.999199] ? xfs_ilock+0x22c/0x360 [xfs] [ 1210.000473] ? xfs_filemap_fault+0x64/0x1e0 [xfs] [ 1210.001843] ? down_read_nested+0x7b/0xc0 [ 1210.003184] ? xfs_ilock+0x22c/0x360 [xfs] [ 1210.004471] xfs_filemap_fault+0x6c/0x1e0 [xfs] [ 1210.005792] __do_fault+0x1e/0xa0 [ 1210.006925] __handle_mm_fault+0xbb1/0xf40 [ 1210.008241] ? ep_poll+0x2ea/0x3b0 [ 1210.009373] handle_mm_fault+0x16b/0x390 [ 1210.010572] ? handle_mm_fault+0x49/0x390 [ 1210.011818] __do_page_fault+0x24a/0x530 [ 1210.013059] ? wake_up_q+0x80/0x80 [ 1210.014176] do_page_fault+0x30/0x80 [ 1210.015367] page_fault+0x28/0x30 [ 1210.016473] RIP: 0033:0x7fcb0c838d13 [ 1210.017635] RSP: 002b:00007ffe275b95a0 EFLAGS: 00010293 [ 1210.019120] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00007fcb0c838d13 [ 1210.020867] RDX: 0000000000000040 RSI: 0000559240b08d40 RDI: 0000000000000009 [ 1210.022769] RBP: 0000000000000000 R08: 00000000000cf8ba R09: 0000000000000001 [ 1210.024530] R10: 000000000000e95f R11: 0000000000000293 R12: 000055923fbe5e60 [ 1210.026308] R13: 0000000000000000 R14: 0000000000000000 R15: 000055923fbe5e60 [ 1210.028961] MemAlloc: vmtoolsd(723) flags=0x400900 switches=36213 seq=120979 gfp=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) order=0 delay=52811 uninterruptible [ 1210.032683] vmtoolsd D11240 723 1 0x00000080 [ 1210.034316] Call Trace: [ 1210.035340] __schedule+0x336/0xe00 [ 1210.036444] schedule+0x3d/0x90 [ 1210.037462] schedule_timeout+0x20d/0x510 [ 1210.038694] ? prepare_to_wait+0x2b/0xc0 [ 1210.039849] ? lock_timer_base+0xa0/0xa0 [ 1210.041005] io_schedule_timeout+0x1e/0x50 [ 1210.042435] congestion_wait+0x86/0x260 [ 1210.043575] ? remove_wait_queue+0x60/0x60 [ 1210.044763] shrink_inactive_list+0x5b4/0x660 [ 1210.046058] ? __list_lru_count_one.isra.2+0x22/0x80 [ 1210.047419] shrink_node_memcg+0x535/0x7f0 [ 1210.048609] shrink_node+0xe1/0x310 [ 1210.049688] do_try_to_free_pages+0xe1/0x300 [ 1210.051183] try_to_free_pages+0x131/0x3f0 [ 1210.052421] __alloc_pages_slowpath+0x3ec/0xd95 [ 1210.053717] __alloc_pages_nodemask+0x3e4/0x460 [ 1210.055025] ? __radix_tree_lookup+0x84/0xf0 [ 1210.056264] alloc_pages_current+0x97/0x1b0 [ 1210.057466] ? find_get_entry+0x5/0x300 [ 1210.058695] __page_cache_alloc+0x15d/0x1a0 [ 1210.059894] ? pagecache_get_page+0x2c/0x2b0 [ 1210.061128] filemap_fault+0x4df/0x8b0 [ 1210.062340] ? filemap_fault+0x373/0x8b0 [ 1210.063545] ? xfs_ilock+0x22c/0x360 [xfs] [ 1210.064766] ? xfs_filemap_fault+0x64/0x1e0 [xfs] [ 1210.066135] ? down_read_nested+0x7b/0xc0 [ 1210.067405] ? xfs_ilock+0x22c/0x360 [xfs] [ 1210.068706] xfs_filemap_fault+0x6c/0x1e0 [xfs] [ 1210.070021] __do_fault+0x1e/0xa0 [ 1210.071102] __handle_mm_fault+0xbb1/0xf40 [ 1210.072296] handle_mm_fault+0x16b/0x390 [ 1210.073509] ? handle_mm_fault+0x49/0x390 [ 1210.074683] __do_page_fault+0x24a/0x530 [ 1210.075872] do_page_fault+0x30/0x80 [ 1210.076974] page_fault+0x28/0x30 [ 1210.078090] RIP: 0033:0x7f12e9fd6420 [ 1210.079193] RSP: 002b:00007ffee98ba498 EFLAGS: 00010202 [ 1210.080605] RAX: 00007f12de02e0fe RBX: 00007ffee98ba4b0 RCX: 00007ffee98ba590 [ 1210.082383] RDX: 00007f12de02e0fe RSI: 0000000000000001 RDI: 00007ffee98ba4b0 [ 1210.084177] RBP: 0000000000000080 R08: 0000000000000000 R09: 000000000000000a [ 1210.086134] R10: 00007f12eb61a010 R11: 0000000000000000 R12: 0000000000000080 [ 1210.087850] R13: 0000000000000000 R14: 00007f12ea006770 R15: 00005580adf3abc0 (...snipped...) [ 1210.640170] MemAlloc: a.out(7523) flags=0x420040 switches=90 uninterruptible dying victim [ 1210.642426] a.out D11496 7523 7376 0x00100084 [ 1210.643999] Call Trace: [ 1210.644921] __schedule+0x336/0xe00 [ 1210.646007] ? trace_hardirqs_on_thunk+0x1a/0x1c [ 1210.647328] schedule+0x3d/0x90 [ 1210.648441] schedule_timeout+0x26a/0x510 [ 1210.649619] ? trace_hardirqs_on+0xd/0x10 [ 1210.650792] __down_common+0xfb/0x131 [ 1210.652188] ? _xfs_buf_find+0x2cb/0xc10 [xfs] [ 1210.653480] __down+0x1d/0x1f [ 1210.654483] down+0x41/0x50 [ 1210.655462] xfs_buf_lock+0x64/0x370 [xfs] [ 1210.656618] _xfs_buf_find+0x2cb/0xc10 [xfs] [ 1210.657823] ? _xfs_buf_find+0xa4/0xc10 [xfs] [ 1210.659028] xfs_buf_get_map+0x2a/0x480 [xfs] [ 1210.660284] xfs_buf_read_map+0x2c/0x400 [xfs] [ 1210.661490] ? del_timer_sync+0xb5/0xe0 [ 1210.662630] xfs_trans_read_buf_map+0x186/0x830 [xfs] [ 1210.664009] xfs_read_agf+0xc8/0x2b0 [xfs] [ 1210.665171] xfs_alloc_read_agf+0x7a/0x300 [xfs] [ 1210.666441] ? xfs_alloc_space_available+0x7b/0x120 [xfs] [ 1210.667923] xfs_alloc_fix_freelist+0x3bc/0x490 [xfs] [ 1210.669402] ? __radix_tree_lookup+0x84/0xf0 [ 1210.670645] ? xfs_perag_get+0x1a0/0x310 [xfs] [ 1210.671949] ? xfs_perag_get+0x5/0x310 [xfs] [ 1210.673145] xfs_alloc_vextent+0x161/0xda0 [xfs] [ 1210.674402] xfs_bmap_btalloc+0x46c/0x8b0 [xfs] [ 1210.675774] ? save_stack_trace+0x1b/0x20 [ 1210.676961] xfs_bmap_alloc+0x17/0x30 [xfs] [ 1210.678202] xfs_bmapi_write+0x74e/0x11d0 [xfs] [ 1210.679544] xfs_iomap_write_allocate+0x199/0x3a0 [xfs] [ 1210.680995] xfs_map_blocks+0x2cc/0x5a0 [xfs] [ 1210.682245] xfs_do_writepage+0x215/0x920 [xfs] [ 1210.683742] ? clear_page_dirty_for_io+0xb4/0x310 [ 1210.685125] write_cache_pages+0x2cb/0x6b0 [ 1210.686408] ? xfs_map_blocks+0x5a0/0x5a0 [xfs] [ 1210.687774] ? xfs_vm_writepages+0x48/0xa0 [xfs] [ 1210.689111] xfs_vm_writepages+0x6b/0xa0 [xfs] [ 1210.690529] do_writepages+0x21/0x40 [ 1210.691680] __filemap_fdatawrite_range+0xc6/0x100 [ 1210.693021] filemap_write_and_wait_range+0x2d/0x70 [ 1210.694444] xfs_file_fsync+0x8b/0x310 [xfs] [ 1210.695728] vfs_fsync_range+0x3d/0xb0 [ 1210.696874] ? __do_page_fault+0x272/0x530 [ 1210.698102] do_fsync+0x3d/0x70 [ 1210.699200] SyS_fsync+0x10/0x20 [ 1210.700267] do_syscall_64+0x6c/0x200 [ 1210.701498] entry_SYSCALL64_slow_path+0x25/0x25 [ 1210.702861] RIP: 0033:0x7f504b072d30 [ 1210.704014] RSP: 002b:00007fffcb8f7898 EFLAGS: 00000246 ORIG_RAX: 000000000000004a [ 1210.705994] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f504b072d30 [ 1210.707857] RDX: 000000000000000a RSI: 0000000000000000 RDI: 0000000000000003 [ 1210.709647] RBP: 0000000000000003 R08: 00007f504afcc938 R09: 000000000000000e [ 1210.711632] R10: 00007fffcb8f7620 R11: 0000000000000246 R12: 0000000000400912 [ 1210.713520] R13: 00007fffcb8f79a0 R14: 0000000000000000 R15: 0000000000000000 (...snipped...) [ 1212.195351] MemAlloc-Info: stalling=32 dying=1 exiting=0 victim=1 oom_count=45896 [ 1242.551629] MemAlloc-Info: stalling=36 dying=1 exiting=0 victim=1 oom_count=45896 (...snipped...) [ 1245.149165] MemAlloc-Info: stalling=36 dying=1 exiting=0 victim=1 oom_count=45896 [ 1275.319189] MemAlloc-Info: stalling=40 dying=1 exiting=0 victim=1 oom_count=45896 (...snipped...) [ 1278.241813] MemAlloc-Info: stalling=40 dying=1 exiting=0 victim=1 oom_count=45896 [ 1289.804580] sysrq: SysRq : Kill All Tasks ---------------------------------------- ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone @ 2017-02-21 14:35 ` Tetsuo Handa 0 siblings, 0 replies; 110+ messages in thread From: Tetsuo Handa @ 2017-02-21 14:35 UTC (permalink / raw) To: mhocko Cc: david, dchinner, hch, mgorman, viro, linux-mm, hannes, linux-kernel Michal Hocko wrote: > OK, so it seems that all the distractions are handled now and linux-next > should provide a reasonable base for testing. You said you weren't able > to reproduce the original long stalls on too_many_isolated(). I would be > still interested to see those oom reports and potential anomalies in the > isolated counts before I send the patch for inclusion so your further > testing would be more than appreciated. Also stalls > 10s without any > previous occurrences would be interesting. I confirmed that linux-next-20170221 with kmallocwd applied can reproduce infinite too_many_isolated() loop problem. Please send your patches to linux-next. Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20170221.txt.xz . ---------------------------------------- [ 1160.162013] Out of memory: Kill process 7523 (a.out) score 998 or sacrifice child [ 1160.164422] Killed process 7523 (a.out) total-vm:4168kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB [ 1160.169699] oom_reaper: reaped process 7523 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [ 1209.781787] MemAlloc-Info: stalling=32 dying=1 exiting=0 victim=1 oom_count=45896 [ 1209.790966] MemAlloc: kswapd0(67) flags=0xa60840 switches=51139 uninterruptible [ 1209.799726] kswapd0 D10936 67 2 0x00000000 [ 1209.807326] Call Trace: [ 1209.812581] __schedule+0x336/0xe00 [ 1209.818599] schedule+0x3d/0x90 [ 1209.823907] schedule_timeout+0x26a/0x510 [ 1209.827218] ? trace_hardirqs_on+0xd/0x10 [ 1209.830535] __down_common+0xfb/0x131 [ 1209.833801] ? _xfs_buf_find+0x2cb/0xc10 [xfs] [ 1209.837372] __down+0x1d/0x1f [ 1209.840331] down+0x41/0x50 [ 1209.843243] xfs_buf_lock+0x64/0x370 [xfs] [ 1209.846597] _xfs_buf_find+0x2cb/0xc10 [xfs] [ 1209.850031] ? _xfs_buf_find+0xa4/0xc10 [xfs] [ 1209.853514] xfs_buf_get_map+0x2a/0x480 [xfs] [ 1209.855831] xfs_buf_read_map+0x2c/0x400 [xfs] [ 1209.857388] ? free_debug_processing+0x27d/0x2af [ 1209.859037] xfs_trans_read_buf_map+0x186/0x830 [xfs] [ 1209.860707] xfs_read_agf+0xc8/0x2b0 [xfs] [ 1209.862184] xfs_alloc_read_agf+0x7a/0x300 [xfs] [ 1209.863728] ? xfs_alloc_space_available+0x7b/0x120 [xfs] [ 1209.865385] xfs_alloc_fix_freelist+0x3bc/0x490 [xfs] [ 1209.866974] ? __radix_tree_lookup+0x84/0xf0 [ 1209.868374] ? xfs_perag_get+0x1a0/0x310 [xfs] [ 1209.869798] ? xfs_perag_get+0x5/0x310 [xfs] [ 1209.871288] xfs_alloc_vextent+0x161/0xda0 [xfs] [ 1209.872757] xfs_bmap_btalloc+0x46c/0x8b0 [xfs] [ 1209.874182] ? save_stack_trace+0x1b/0x20 [ 1209.875542] xfs_bmap_alloc+0x17/0x30 [xfs] [ 1209.876847] xfs_bmapi_write+0x74e/0x11d0 [xfs] [ 1209.878190] xfs_iomap_write_allocate+0x199/0x3a0 [xfs] [ 1209.879632] xfs_map_blocks+0x2cc/0x5a0 [xfs] [ 1209.880909] xfs_do_writepage+0x215/0x920 [xfs] [ 1209.882255] ? clear_page_dirty_for_io+0xb4/0x310 [ 1209.883598] xfs_vm_writepage+0x3b/0x70 [xfs] [ 1209.884841] pageout.isra.54+0x1a4/0x460 [ 1209.886210] shrink_page_list+0xa86/0xcf0 [ 1209.887441] shrink_inactive_list+0x1c5/0x660 [ 1209.888682] shrink_node_memcg+0x535/0x7f0 [ 1209.889975] ? mem_cgroup_iter+0x14d/0x720 [ 1209.891197] shrink_node+0xe1/0x310 [ 1209.892288] kswapd+0x362/0x9b0 [ 1209.893308] kthread+0x10f/0x150 [ 1209.894383] ? mem_cgroup_shrink_node+0x3b0/0x3b0 [ 1209.895703] ? kthread_create_on_node+0x70/0x70 [ 1209.896956] ret_from_fork+0x31/0x40 [ 1209.898117] MemAlloc: systemd-journal(526) flags=0x400900 switches=33248 seq=121659 gfp=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) order=0 delay=52772 uninterruptible [ 1209.902154] systemd-journal D11240 526 1 0x00000000 [ 1209.903642] Call Trace: [ 1209.904574] __schedule+0x336/0xe00 [ 1209.905734] schedule+0x3d/0x90 [ 1209.906817] schedule_timeout+0x20d/0x510 [ 1209.908025] ? prepare_to_wait+0x2b/0xc0 [ 1209.909268] ? lock_timer_base+0xa0/0xa0 [ 1209.910460] io_schedule_timeout+0x1e/0x50 [ 1209.911681] congestion_wait+0x86/0x260 [ 1209.912853] ? remove_wait_queue+0x60/0x60 [ 1209.914115] shrink_inactive_list+0x5b4/0x660 [ 1209.915385] ? __list_lru_count_one.isra.2+0x22/0x80 [ 1209.916768] shrink_node_memcg+0x535/0x7f0 [ 1209.918173] shrink_node+0xe1/0x310 [ 1209.919288] do_try_to_free_pages+0xe1/0x300 [ 1209.920548] try_to_free_pages+0x131/0x3f0 [ 1209.921827] __alloc_pages_slowpath+0x3ec/0xd95 [ 1209.923137] __alloc_pages_nodemask+0x3e4/0x460 [ 1209.924454] ? __radix_tree_lookup+0x84/0xf0 [ 1209.925790] alloc_pages_current+0x97/0x1b0 [ 1209.927021] ? find_get_entry+0x5/0x300 [ 1209.928189] __page_cache_alloc+0x15d/0x1a0 [ 1209.929471] ? pagecache_get_page+0x2c/0x2b0 [ 1209.930716] filemap_fault+0x4df/0x8b0 [ 1209.931867] ? filemap_fault+0x373/0x8b0 [ 1209.933111] ? xfs_ilock+0x22c/0x360 [xfs] [ 1209.934510] ? xfs_filemap_fault+0x64/0x1e0 [xfs] [ 1209.935857] ? down_read_nested+0x7b/0xc0 [ 1209.937123] ? xfs_ilock+0x22c/0x360 [xfs] [ 1209.938373] xfs_filemap_fault+0x6c/0x1e0 [xfs] [ 1209.939691] __do_fault+0x1e/0xa0 [ 1209.940807] ? _raw_spin_unlock+0x27/0x40 [ 1209.942002] __handle_mm_fault+0xbb1/0xf40 [ 1209.943228] ? mutex_unlock+0x12/0x20 [ 1209.944410] ? devkmsg_read+0x15c/0x330 [ 1209.945912] handle_mm_fault+0x16b/0x390 [ 1209.947297] ? handle_mm_fault+0x49/0x390 [ 1209.948868] __do_page_fault+0x24a/0x530 [ 1209.950351] do_page_fault+0x30/0x80 [ 1209.951615] page_fault+0x28/0x30 [ 1209.952724] RIP: 0033:0x556f398d623f [ 1209.953834] RSP: 002b:00007fff1da75710 EFLAGS: 00010206 [ 1209.955273] RAX: 0000556f3b12b9d0 RBX: 0000000000000009 RCX: 0000000000000020 [ 1209.957117] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 [ 1209.958849] RBP: 00007fff1da759b0 R08: 0000000000000000 R09: 0000000000000000 [ 1209.960659] R10: 00000000ffffffc0 R11: 00007fdc0df4ef10 R12: 00007fff1da75f30 [ 1209.962397] R13: 00007fff1da78810 R14: 0000000000000009 R15: 0000000000000006 [ 1209.964204] MemAlloc: auditd(563) flags=0x400900 switches=6443 seq=774 gfp=0x142134a(GFP_NOFS|__GFP_HIGHMEM|__GFP_COLD|__GFP_NOWARN|__GFP_NORETRY|__GFP_HARDWALL|__GFP_MOVABLE) order=0 delay=16511 uninterruptible [ 1209.969005] auditd D12280 563 1 0x00000000 [ 1209.970503] Call Trace: [ 1209.971436] __schedule+0x336/0xe00 [ 1209.972621] schedule+0x3d/0x90 [ 1209.973696] schedule_timeout+0x20d/0x510 [ 1209.974910] ? prepare_to_wait+0x2b/0xc0 [ 1209.976155] ? lock_timer_base+0xa0/0xa0 [ 1209.977350] io_schedule_timeout+0x1e/0x50 [ 1209.978597] congestion_wait+0x86/0x260 [ 1209.979795] ? remove_wait_queue+0x60/0x60 [ 1209.981020] shrink_inactive_list+0x5b4/0x660 [ 1209.982290] ? __list_lru_count_one.isra.2+0x22/0x80 [ 1209.983748] shrink_node_memcg+0x535/0x7f0 [ 1209.985041] ? mem_cgroup_iter+0x14d/0x720 [ 1209.986267] shrink_node+0xe1/0x310 [ 1209.987424] do_try_to_free_pages+0xe1/0x300 [ 1209.988705] try_to_free_pages+0x131/0x3f0 [ 1209.989935] __alloc_pages_slowpath+0x3ec/0xd95 [ 1209.991274] __alloc_pages_nodemask+0x3e4/0x460 [ 1209.992601] alloc_pages_current+0x97/0x1b0 [ 1209.993845] __page_cache_alloc+0x15d/0x1a0 [ 1209.995120] __do_page_cache_readahead+0x118/0x410 [ 1209.996535] ? __do_page_cache_readahead+0x191/0x410 [ 1209.997946] filemap_fault+0x35f/0x8b0 [ 1209.999199] ? xfs_ilock+0x22c/0x360 [xfs] [ 1210.000473] ? xfs_filemap_fault+0x64/0x1e0 [xfs] [ 1210.001843] ? down_read_nested+0x7b/0xc0 [ 1210.003184] ? xfs_ilock+0x22c/0x360 [xfs] [ 1210.004471] xfs_filemap_fault+0x6c/0x1e0 [xfs] [ 1210.005792] __do_fault+0x1e/0xa0 [ 1210.006925] __handle_mm_fault+0xbb1/0xf40 [ 1210.008241] ? ep_poll+0x2ea/0x3b0 [ 1210.009373] handle_mm_fault+0x16b/0x390 [ 1210.010572] ? handle_mm_fault+0x49/0x390 [ 1210.011818] __do_page_fault+0x24a/0x530 [ 1210.013059] ? wake_up_q+0x80/0x80 [ 1210.014176] do_page_fault+0x30/0x80 [ 1210.015367] page_fault+0x28/0x30 [ 1210.016473] RIP: 0033:0x7fcb0c838d13 [ 1210.017635] RSP: 002b:00007ffe275b95a0 EFLAGS: 00010293 [ 1210.019120] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00007fcb0c838d13 [ 1210.020867] RDX: 0000000000000040 RSI: 0000559240b08d40 RDI: 0000000000000009 [ 1210.022769] RBP: 0000000000000000 R08: 00000000000cf8ba R09: 0000000000000001 [ 1210.024530] R10: 000000000000e95f R11: 0000000000000293 R12: 000055923fbe5e60 [ 1210.026308] R13: 0000000000000000 R14: 0000000000000000 R15: 000055923fbe5e60 [ 1210.028961] MemAlloc: vmtoolsd(723) flags=0x400900 switches=36213 seq=120979 gfp=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD) order=0 delay=52811 uninterruptible [ 1210.032683] vmtoolsd D11240 723 1 0x00000080 [ 1210.034316] Call Trace: [ 1210.035340] __schedule+0x336/0xe00 [ 1210.036444] schedule+0x3d/0x90 [ 1210.037462] schedule_timeout+0x20d/0x510 [ 1210.038694] ? prepare_to_wait+0x2b/0xc0 [ 1210.039849] ? lock_timer_base+0xa0/0xa0 [ 1210.041005] io_schedule_timeout+0x1e/0x50 [ 1210.042435] congestion_wait+0x86/0x260 [ 1210.043575] ? remove_wait_queue+0x60/0x60 [ 1210.044763] shrink_inactive_list+0x5b4/0x660 [ 1210.046058] ? __list_lru_count_one.isra.2+0x22/0x80 [ 1210.047419] shrink_node_memcg+0x535/0x7f0 [ 1210.048609] shrink_node+0xe1/0x310 [ 1210.049688] do_try_to_free_pages+0xe1/0x300 [ 1210.051183] try_to_free_pages+0x131/0x3f0 [ 1210.052421] __alloc_pages_slowpath+0x3ec/0xd95 [ 1210.053717] __alloc_pages_nodemask+0x3e4/0x460 [ 1210.055025] ? __radix_tree_lookup+0x84/0xf0 [ 1210.056264] alloc_pages_current+0x97/0x1b0 [ 1210.057466] ? find_get_entry+0x5/0x300 [ 1210.058695] __page_cache_alloc+0x15d/0x1a0 [ 1210.059894] ? pagecache_get_page+0x2c/0x2b0 [ 1210.061128] filemap_fault+0x4df/0x8b0 [ 1210.062340] ? filemap_fault+0x373/0x8b0 [ 1210.063545] ? xfs_ilock+0x22c/0x360 [xfs] [ 1210.064766] ? xfs_filemap_fault+0x64/0x1e0 [xfs] [ 1210.066135] ? down_read_nested+0x7b/0xc0 [ 1210.067405] ? xfs_ilock+0x22c/0x360 [xfs] [ 1210.068706] xfs_filemap_fault+0x6c/0x1e0 [xfs] [ 1210.070021] __do_fault+0x1e/0xa0 [ 1210.071102] __handle_mm_fault+0xbb1/0xf40 [ 1210.072296] handle_mm_fault+0x16b/0x390 [ 1210.073509] ? handle_mm_fault+0x49/0x390 [ 1210.074683] __do_page_fault+0x24a/0x530 [ 1210.075872] do_page_fault+0x30/0x80 [ 1210.076974] page_fault+0x28/0x30 [ 1210.078090] RIP: 0033:0x7f12e9fd6420 [ 1210.079193] RSP: 002b:00007ffee98ba498 EFLAGS: 00010202 [ 1210.080605] RAX: 00007f12de02e0fe RBX: 00007ffee98ba4b0 RCX: 00007ffee98ba590 [ 1210.082383] RDX: 00007f12de02e0fe RSI: 0000000000000001 RDI: 00007ffee98ba4b0 [ 1210.084177] RBP: 0000000000000080 R08: 0000000000000000 R09: 000000000000000a [ 1210.086134] R10: 00007f12eb61a010 R11: 0000000000000000 R12: 0000000000000080 [ 1210.087850] R13: 0000000000000000 R14: 00007f12ea006770 R15: 00005580adf3abc0 (...snipped...) [ 1210.640170] MemAlloc: a.out(7523) flags=0x420040 switches=90 uninterruptible dying victim [ 1210.642426] a.out D11496 7523 7376 0x00100084 [ 1210.643999] Call Trace: [ 1210.644921] __schedule+0x336/0xe00 [ 1210.646007] ? trace_hardirqs_on_thunk+0x1a/0x1c [ 1210.647328] schedule+0x3d/0x90 [ 1210.648441] schedule_timeout+0x26a/0x510 [ 1210.649619] ? trace_hardirqs_on+0xd/0x10 [ 1210.650792] __down_common+0xfb/0x131 [ 1210.652188] ? _xfs_buf_find+0x2cb/0xc10 [xfs] [ 1210.653480] __down+0x1d/0x1f [ 1210.654483] down+0x41/0x50 [ 1210.655462] xfs_buf_lock+0x64/0x370 [xfs] [ 1210.656618] _xfs_buf_find+0x2cb/0xc10 [xfs] [ 1210.657823] ? _xfs_buf_find+0xa4/0xc10 [xfs] [ 1210.659028] xfs_buf_get_map+0x2a/0x480 [xfs] [ 1210.660284] xfs_buf_read_map+0x2c/0x400 [xfs] [ 1210.661490] ? del_timer_sync+0xb5/0xe0 [ 1210.662630] xfs_trans_read_buf_map+0x186/0x830 [xfs] [ 1210.664009] xfs_read_agf+0xc8/0x2b0 [xfs] [ 1210.665171] xfs_alloc_read_agf+0x7a/0x300 [xfs] [ 1210.666441] ? xfs_alloc_space_available+0x7b/0x120 [xfs] [ 1210.667923] xfs_alloc_fix_freelist+0x3bc/0x490 [xfs] [ 1210.669402] ? __radix_tree_lookup+0x84/0xf0 [ 1210.670645] ? xfs_perag_get+0x1a0/0x310 [xfs] [ 1210.671949] ? xfs_perag_get+0x5/0x310 [xfs] [ 1210.673145] xfs_alloc_vextent+0x161/0xda0 [xfs] [ 1210.674402] xfs_bmap_btalloc+0x46c/0x8b0 [xfs] [ 1210.675774] ? save_stack_trace+0x1b/0x20 [ 1210.676961] xfs_bmap_alloc+0x17/0x30 [xfs] [ 1210.678202] xfs_bmapi_write+0x74e/0x11d0 [xfs] [ 1210.679544] xfs_iomap_write_allocate+0x199/0x3a0 [xfs] [ 1210.680995] xfs_map_blocks+0x2cc/0x5a0 [xfs] [ 1210.682245] xfs_do_writepage+0x215/0x920 [xfs] [ 1210.683742] ? clear_page_dirty_for_io+0xb4/0x310 [ 1210.685125] write_cache_pages+0x2cb/0x6b0 [ 1210.686408] ? xfs_map_blocks+0x5a0/0x5a0 [xfs] [ 1210.687774] ? xfs_vm_writepages+0x48/0xa0 [xfs] [ 1210.689111] xfs_vm_writepages+0x6b/0xa0 [xfs] [ 1210.690529] do_writepages+0x21/0x40 [ 1210.691680] __filemap_fdatawrite_range+0xc6/0x100 [ 1210.693021] filemap_write_and_wait_range+0x2d/0x70 [ 1210.694444] xfs_file_fsync+0x8b/0x310 [xfs] [ 1210.695728] vfs_fsync_range+0x3d/0xb0 [ 1210.696874] ? __do_page_fault+0x272/0x530 [ 1210.698102] do_fsync+0x3d/0x70 [ 1210.699200] SyS_fsync+0x10/0x20 [ 1210.700267] do_syscall_64+0x6c/0x200 [ 1210.701498] entry_SYSCALL64_slow_path+0x25/0x25 [ 1210.702861] RIP: 0033:0x7f504b072d30 [ 1210.704014] RSP: 002b:00007fffcb8f7898 EFLAGS: 00000246 ORIG_RAX: 000000000000004a [ 1210.705994] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f504b072d30 [ 1210.707857] RDX: 000000000000000a RSI: 0000000000000000 RDI: 0000000000000003 [ 1210.709647] RBP: 0000000000000003 R08: 00007f504afcc938 R09: 000000000000000e [ 1210.711632] R10: 00007fffcb8f7620 R11: 0000000000000246 R12: 0000000000400912 [ 1210.713520] R13: 00007fffcb8f79a0 R14: 0000000000000000 R15: 0000000000000000 (...snipped...) [ 1212.195351] MemAlloc-Info: stalling=32 dying=1 exiting=0 victim=1 oom_count=45896 [ 1242.551629] MemAlloc-Info: stalling=36 dying=1 exiting=0 victim=1 oom_count=45896 (...snipped...) [ 1245.149165] MemAlloc-Info: stalling=36 dying=1 exiting=0 victim=1 oom_count=45896 [ 1275.319189] MemAlloc-Info: stalling=40 dying=1 exiting=0 victim=1 oom_count=45896 (...snipped...) [ 1278.241813] MemAlloc-Info: stalling=40 dying=1 exiting=0 victim=1 oom_count=45896 [ 1289.804580] sysrq: SysRq : Kill All Tasks ---------------------------------------- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone 2017-02-21 14:35 ` Tetsuo Handa @ 2017-02-21 15:53 ` Michal Hocko -1 siblings, 0 replies; 110+ messages in thread From: Michal Hocko @ 2017-02-21 15:53 UTC (permalink / raw) To: Tetsuo Handa Cc: david, dchinner, hch, mgorman, viro, linux-mm, hannes, linux-kernel On Tue 21-02-17 23:35:07, Tetsuo Handa wrote: > Michal Hocko wrote: > > OK, so it seems that all the distractions are handled now and linux-next > > should provide a reasonable base for testing. You said you weren't able > > to reproduce the original long stalls on too_many_isolated(). I would be > > still interested to see those oom reports and potential anomalies in the > > isolated counts before I send the patch for inclusion so your further > > testing would be more than appreciated. Also stalls > 10s without any > > previous occurrences would be interesting. > > I confirmed that linux-next-20170221 with kmallocwd applied can reproduce > infinite too_many_isolated() loop problem. Please send your patches to linux-next. So I assume that you didn't see the lockup with the patch applied and the OOM killer has resolved the situation by killing other tasks, right? Can I assume your Tested-by? Thanks for your testing! -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone @ 2017-02-21 15:53 ` Michal Hocko 0 siblings, 0 replies; 110+ messages in thread From: Michal Hocko @ 2017-02-21 15:53 UTC (permalink / raw) To: Tetsuo Handa Cc: david, dchinner, hch, mgorman, viro, linux-mm, hannes, linux-kernel On Tue 21-02-17 23:35:07, Tetsuo Handa wrote: > Michal Hocko wrote: > > OK, so it seems that all the distractions are handled now and linux-next > > should provide a reasonable base for testing. You said you weren't able > > to reproduce the original long stalls on too_many_isolated(). I would be > > still interested to see those oom reports and potential anomalies in the > > isolated counts before I send the patch for inclusion so your further > > testing would be more than appreciated. Also stalls > 10s without any > > previous occurrences would be interesting. > > I confirmed that linux-next-20170221 with kmallocwd applied can reproduce > infinite too_many_isolated() loop problem. Please send your patches to linux-next. So I assume that you didn't see the lockup with the patch applied and the OOM killer has resolved the situation by killing other tasks, right? Can I assume your Tested-by? Thanks for your testing! -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone 2017-02-21 15:53 ` Michal Hocko @ 2017-02-22 2:02 ` Tetsuo Handa -1 siblings, 0 replies; 110+ messages in thread From: Tetsuo Handa @ 2017-02-22 2:02 UTC (permalink / raw) To: mhocko Cc: david, dchinner, hch, mgorman, viro, linux-mm, hannes, linux-kernel Michal Hocko wrote: > On Tue 21-02-17 23:35:07, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > OK, so it seems that all the distractions are handled now and linux-next > > > should provide a reasonable base for testing. You said you weren't able > > > to reproduce the original long stalls on too_many_isolated(). I would be > > > still interested to see those oom reports and potential anomalies in the > > > isolated counts before I send the patch for inclusion so your further > > > testing would be more than appreciated. Also stalls > 10s without any > > > previous occurrences would be interesting. > > > > I confirmed that linux-next-20170221 with kmallocwd applied can reproduce > > infinite too_many_isolated() loop problem. Please send your patches to linux-next. > > So I assume that you didn't see the lockup with the patch applied and > the OOM killer has resolved the situation by killing other tasks, right? > Can I assume your Tested-by? No. I tested linux-next-20170221 which does not include your patch. I didn't test linux-next-20170221 with your patch applied. Your patch will avoid infinite too_many_isolated() loop problem in shrink_inactive_list(). But we need to test different workloads by other people. Thus, I suggest you to send your patches to linux-next without my testing. > > Thanks for your testing! > -- > Michal Hocko > SUSE Labs > ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone @ 2017-02-22 2:02 ` Tetsuo Handa 0 siblings, 0 replies; 110+ messages in thread From: Tetsuo Handa @ 2017-02-22 2:02 UTC (permalink / raw) To: mhocko Cc: david, dchinner, hch, mgorman, viro, linux-mm, hannes, linux-kernel Michal Hocko wrote: > On Tue 21-02-17 23:35:07, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > OK, so it seems that all the distractions are handled now and linux-next > > > should provide a reasonable base for testing. You said you weren't able > > > to reproduce the original long stalls on too_many_isolated(). I would be > > > still interested to see those oom reports and potential anomalies in the > > > isolated counts before I send the patch for inclusion so your further > > > testing would be more than appreciated. Also stalls > 10s without any > > > previous occurrences would be interesting. > > > > I confirmed that linux-next-20170221 with kmallocwd applied can reproduce > > infinite too_many_isolated() loop problem. Please send your patches to linux-next. > > So I assume that you didn't see the lockup with the patch applied and > the OOM killer has resolved the situation by killing other tasks, right? > Can I assume your Tested-by? No. I tested linux-next-20170221 which does not include your patch. I didn't test linux-next-20170221 with your patch applied. Your patch will avoid infinite too_many_isolated() loop problem in shrink_inactive_list(). But we need to test different workloads by other people. Thus, I suggest you to send your patches to linux-next without my testing. > > Thanks for your testing! > -- > Michal Hocko > SUSE Labs > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone 2017-02-22 2:02 ` Tetsuo Handa @ 2017-02-22 7:54 ` Michal Hocko -1 siblings, 0 replies; 110+ messages in thread From: Michal Hocko @ 2017-02-22 7:54 UTC (permalink / raw) To: Tetsuo Handa Cc: david, dchinner, hch, mgorman, viro, linux-mm, hannes, linux-kernel On Wed 22-02-17 11:02:21, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Tue 21-02-17 23:35:07, Tetsuo Handa wrote: > > > Michal Hocko wrote: > > > > OK, so it seems that all the distractions are handled now and linux-next > > > > should provide a reasonable base for testing. You said you weren't able > > > > to reproduce the original long stalls on too_many_isolated(). I would be > > > > still interested to see those oom reports and potential anomalies in the > > > > isolated counts before I send the patch for inclusion so your further > > > > testing would be more than appreciated. Also stalls > 10s without any > > > > previous occurrences would be interesting. > > > > > > I confirmed that linux-next-20170221 with kmallocwd applied can reproduce > > > infinite too_many_isolated() loop problem. Please send your patches to linux-next. > > > > So I assume that you didn't see the lockup with the patch applied and > > the OOM killer has resolved the situation by killing other tasks, right? > > Can I assume your Tested-by? > > No. I tested linux-next-20170221 which does not include your patch. > I didn't test linux-next-20170221 with your patch applied. Your patch will > avoid infinite too_many_isolated() loop problem in shrink_inactive_list(). > But we need to test different workloads by other people. Thus, I suggest > you to send your patches to linux-next without my testing. I will send the patch to Andrew later after merge window closes. It would be really helpful, though, to see how it handles your workload which is known to reproduce the oom starvation. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone @ 2017-02-22 7:54 ` Michal Hocko 0 siblings, 0 replies; 110+ messages in thread From: Michal Hocko @ 2017-02-22 7:54 UTC (permalink / raw) To: Tetsuo Handa Cc: david, dchinner, hch, mgorman, viro, linux-mm, hannes, linux-kernel On Wed 22-02-17 11:02:21, Tetsuo Handa wrote: > Michal Hocko wrote: > > On Tue 21-02-17 23:35:07, Tetsuo Handa wrote: > > > Michal Hocko wrote: > > > > OK, so it seems that all the distractions are handled now and linux-next > > > > should provide a reasonable base for testing. You said you weren't able > > > > to reproduce the original long stalls on too_many_isolated(). I would be > > > > still interested to see those oom reports and potential anomalies in the > > > > isolated counts before I send the patch for inclusion so your further > > > > testing would be more than appreciated. Also stalls > 10s without any > > > > previous occurrences would be interesting. > > > > > > I confirmed that linux-next-20170221 with kmallocwd applied can reproduce > > > infinite too_many_isolated() loop problem. Please send your patches to linux-next. > > > > So I assume that you didn't see the lockup with the patch applied and > > the OOM killer has resolved the situation by killing other tasks, right? > > Can I assume your Tested-by? > > No. I tested linux-next-20170221 which does not include your patch. > I didn't test linux-next-20170221 with your patch applied. Your patch will > avoid infinite too_many_isolated() loop problem in shrink_inactive_list(). > But we need to test different workloads by other people. Thus, I suggest > you to send your patches to linux-next without my testing. I will send the patch to Andrew later after merge window closes. It would be really helpful, though, to see how it handles your workload which is known to reproduce the oom starvation. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone 2017-02-22 7:54 ` Michal Hocko @ 2017-02-26 6:30 ` Tetsuo Handa -1 siblings, 0 replies; 110+ messages in thread From: Tetsuo Handa @ 2017-02-26 6:30 UTC (permalink / raw) To: mhocko Cc: david, dchinner, hch, mgorman, viro, linux-mm, hannes, linux-kernel Michal Hocko wrote: > On Wed 22-02-17 11:02:21, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > On Tue 21-02-17 23:35:07, Tetsuo Handa wrote: > > > > Michal Hocko wrote: > > > > > OK, so it seems that all the distractions are handled now and linux-next > > > > > should provide a reasonable base for testing. You said you weren't able > > > > > to reproduce the original long stalls on too_many_isolated(). I would be > > > > > still interested to see those oom reports and potential anomalies in the > > > > > isolated counts before I send the patch for inclusion so your further > > > > > testing would be more than appreciated. Also stalls > 10s without any > > > > > previous occurrences would be interesting. > > > > > > > > I confirmed that linux-next-20170221 with kmallocwd applied can reproduce > > > > infinite too_many_isolated() loop problem. Please send your patches to linux-next. > > > > > > So I assume that you didn't see the lockup with the patch applied and > > > the OOM killer has resolved the situation by killing other tasks, right? > > > Can I assume your Tested-by? > > > > No. I tested linux-next-20170221 which does not include your patch. > > I didn't test linux-next-20170221 with your patch applied. Your patch will > > avoid infinite too_many_isolated() loop problem in shrink_inactive_list(). > > But we need to test different workloads by other people. Thus, I suggest > > you to send your patches to linux-next without my testing. > > I will send the patch to Andrew later after merge window closes. It > would be really helpful, though, to see how it handles your workload > which is known to reproduce the oom starvation. I tested http://lkml.kernel.org/r/20170119112336.GN30786@dhcp22.suse.cz on top of linux-next-20170221 with kmallocwd applied. I did not hit too_many_isolated() loop problem. But I hit an "unable to invoke the OOM killer due to !__GFP_FS allocation" lockup problem shown below. Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20170226.txt.xz . ---------- [ 444.281177] Killed process 9477 (a.out) total-vm:4168kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB [ 444.287046] oom_reaper: reaped process 9477 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [ 484.810225] BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 38s! [ 484.812907] BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0 nice=0 stuck for 41s! [ 484.815546] Showing busy workqueues and worker pools: [ 484.817595] workqueue events: flags=0x0 [ 484.819456] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=3/256 [ 484.821666] pending: vmpressure_work_fn, vmstat_shepherd, vmw_fb_dirty_flush [vmwgfx] [ 484.824356] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=2/256 [ 484.826582] pending: drain_local_pages_wq BAR(9595), e1000_watchdog [e1000] [ 484.829091] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256 [ 484.831325] in-flight: 7418:rht_deferred_worker [ 484.833336] pending: rht_deferred_worker [ 484.835346] workqueue events_long: flags=0x0 [ 484.837343] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256 [ 484.839566] pending: gc_worker [nf_conntrack] [ 484.841691] workqueue events_power_efficient: flags=0x80 [ 484.843873] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256 [ 484.846103] pending: fb_flashcursor [ 484.847928] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256 [ 484.850149] pending: neigh_periodic_work, neigh_periodic_work [ 484.852403] workqueue events_freezable_power_: flags=0x84 [ 484.854534] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 [ 484.856666] in-flight: 27:disk_events_workfn [ 484.858621] workqueue writeback: flags=0x4e [ 484.860347] pwq 256: cpus=0-127 flags=0x4 nice=0 active=2/256 [ 484.862415] in-flight: 8444:wb_workfn wb_workfn [ 484.864602] workqueue vmstat: flags=0xc [ 484.866291] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 [ 484.868307] pending: vmstat_update [ 484.869876] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 [ 484.871864] pending: vmstat_update [ 484.874058] workqueue mpt_poll_0: flags=0x8 [ 484.875698] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 [ 484.877602] pending: mpt_fault_reset_work [mptbase] [ 484.879502] workqueue xfs-buf/sda1: flags=0xc [ 484.881148] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1 [ 484.883011] pending: xfs_buf_ioend_work [xfs] [ 484.884706] workqueue xfs-data/sda1: flags=0xc [ 484.886367] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=27/256 MAYDAY [ 484.888410] in-flight: 5356:xfs_end_io [xfs], 451(RESCUER):xfs_end_io [xfs] xfs_end_io [xfs] xfs_end_io [xfs] xfs_end_io [xfs] xfs_end_io [xfs], 10498:xfs_end_io [xfs], 6386:xfs_end_io [xfs] [ 484.893483] pending: xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs] [ 484.902636] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=21/256 MAYDAY [ 484.904848] in-flight: 535:xfs_end_io [xfs], 7416:xfs_end_io [xfs], 7415:xfs_end_io [xfs], 65:xfs_end_io [xfs] [ 484.907863] pending: xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs] [ 484.916767] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=4/256 MAYDAY [ 484.919024] in-flight: 5357:xfs_end_io [xfs], 193:xfs_end_io [xfs], 52:xfs_end_io [xfs], 5358:xfs_end_io [xfs] [ 484.922143] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 [ 484.924291] in-flight: 2486:xfs_end_io [xfs] [ 484.926248] workqueue xfs-reclaim/sda1: flags=0xc [ 484.928216] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 [ 484.930362] pending: xfs_reclaim_worker [xfs] [ 484.932312] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 3 6387 [ 484.934766] pool 2: cpus=1 node=0 flags=0x0 nice=0 hung=38s workers=6 manager: 19 [ 484.937206] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=41s workers=6 manager: 157 [ 484.939629] pool 6: cpus=3 node=0 flags=0x0 nice=0 hung=41s workers=4 manager: 10499 [ 484.942303] pool 256: cpus=0-127 flags=0x4 nice=0 hung=38s workers=3 idle: 425 426 [ 518.090012] MemAlloc-Info: stalling=184 dying=1 exiting=0 victim=1 oom_count=8441307 (...snipped...) [ 518.900038] MemAlloc: kswapd0(69) flags=0xa40840 switches=23883 uninterruptible [ 518.902095] kswapd0 D10776 69 2 0x00000000 [ 518.903784] Call Trace: [ 518.904849] __schedule+0x336/0xe00 [ 518.906118] schedule+0x3d/0x90 [ 518.907314] io_schedule+0x16/0x40 [ 518.908622] __xfs_iflock+0x129/0x140 [xfs] [ 518.910027] ? autoremove_wake_function+0x60/0x60 [ 518.911559] xfs_reclaim_inode+0x162/0x440 [xfs] [ 518.913068] xfs_reclaim_inodes_ag+0x2cf/0x4f0 [xfs] [ 518.914611] ? xfs_reclaim_inodes_ag+0xf2/0x4f0 [xfs] [ 518.916148] ? trace_hardirqs_on+0xd/0x10 [ 518.917465] ? try_to_wake_up+0x59/0x7a0 [ 518.918758] ? wake_up_process+0x15/0x20 [ 518.920067] xfs_reclaim_inodes_nr+0x33/0x40 [xfs] [ 518.921560] xfs_fs_free_cached_objects+0x19/0x20 [xfs] [ 518.923114] super_cache_scan+0x181/0x190 [ 518.924435] shrink_slab+0x29f/0x6d0 [ 518.925683] shrink_node+0x2fa/0x310 [ 518.926909] kswapd+0x362/0x9b0 [ 518.928061] kthread+0x10f/0x150 [ 518.929218] ? mem_cgroup_shrink_node+0x3b0/0x3b0 [ 518.930953] ? kthread_create_on_node+0x70/0x70 [ 518.932380] ret_from_fork+0x31/0x40 (...snipped...) [ 553.070829] MemAlloc-Info: stalling=184 dying=1 exiting=0 victim=1 oom_count=10318507 [ 575.432697] BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 129s! [ 575.435276] BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0 nice=0 stuck for 131s! [ 575.437863] Showing busy workqueues and worker pools: [ 575.439837] workqueue events: flags=0x0 [ 575.441605] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=4/256 [ 575.443717] pending: vmpressure_work_fn, vmstat_shepherd, vmw_fb_dirty_flush [vmwgfx], check_corruption [ 575.446622] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=2/256 [ 575.448763] pending: drain_local_pages_wq BAR(9595), e1000_watchdog [e1000] [ 575.451173] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256 [ 575.453323] in-flight: 7418:rht_deferred_worker [ 575.455243] pending: rht_deferred_worker [ 575.457100] workqueue events_long: flags=0x0 [ 575.458960] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256 [ 575.461099] pending: gc_worker [nf_conntrack] [ 575.463043] workqueue events_power_efficient: flags=0x80 [ 575.465110] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256 [ 575.467252] pending: fb_flashcursor [ 575.468966] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256 [ 575.471109] pending: neigh_periodic_work, neigh_periodic_work [ 575.473289] workqueue events_freezable_power_: flags=0x84 [ 575.475378] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 [ 575.477526] in-flight: 27:disk_events_workfn [ 575.479489] workqueue writeback: flags=0x4e [ 575.481257] pwq 256: cpus=0-127 flags=0x4 nice=0 active=2/256 [ 575.483368] in-flight: 8444:wb_workfn wb_workfn [ 575.485505] workqueue vmstat: flags=0xc [ 575.487196] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 [ 575.489242] pending: vmstat_update [ 575.491403] workqueue mpt_poll_0: flags=0x8 [ 575.493106] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 [ 575.495115] pending: mpt_fault_reset_work [mptbase] [ 575.497086] workqueue xfs-buf/sda1: flags=0xc [ 575.498764] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1 [ 575.500654] pending: xfs_buf_ioend_work [xfs] [ 575.502372] workqueue xfs-data/sda1: flags=0xc [ 575.504024] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=27/256 MAYDAY [ 575.506060] in-flight: 5356:xfs_end_io [xfs], 451(RESCUER):xfs_end_io [xfs] xfs_end_io [xfs] xfs_end_io [xfs] xfs_end_io [xfs] xfs_end_io [xfs], 10498:xfs_end_io [xfs], 6386:xfs_end_io [xfs] [ 575.511096] pending: xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs] [ 575.520157] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=21/256 MAYDAY [ 575.522340] in-flight: 535:xfs_end_io [xfs], 7416:xfs_end_io [xfs], 7415:xfs_end_io [xfs], 65:xfs_end_io [xfs] [ 575.525387] pending: xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs] [ 575.534089] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=4/256 MAYDAY [ 575.536407] in-flight: 5357:xfs_end_io [xfs], 193:xfs_end_io [xfs], 52:xfs_end_io [xfs], 5358:xfs_end_io [xfs] [ 575.539496] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 [ 575.541648] in-flight: 2486:xfs_end_io [xfs] [ 575.543591] workqueue xfs-reclaim/sda1: flags=0xc [ 575.545540] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 [ 575.547675] pending: xfs_reclaim_worker [xfs] [ 575.549719] workqueue xfs-log/sda1: flags=0x1c [ 575.551591] pwq 3: cpus=1 node=0 flags=0x0 nice=-20 active=1/256 [ 575.553750] pending: xfs_log_worker [xfs] [ 575.555552] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 3 6387 [ 575.557979] pool 2: cpus=1 node=0 flags=0x0 nice=0 hung=129s workers=6 manager: 19 [ 575.560399] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=131s workers=6 manager: 157 [ 575.562843] pool 6: cpus=3 node=0 flags=0x0 nice=0 hung=132s workers=4 manager: 10499 [ 575.565450] pool 256: cpus=0-127 flags=0x4 nice=0 hung=129s workers=3 idle: 425 426 (...snipped...) [ 616.394649] MemAlloc-Info: stalling=186 dying=1 exiting=0 victim=1 oom_count=13908219 (...snipped...) [ 642.266252] MemAlloc-Info: stalling=186 dying=1 exiting=0 victim=1 oom_count=15180673 (...snipped...) [ 702.412189] MemAlloc-Info: stalling=187 dying=1 exiting=0 victim=1 oom_count=18732529 (...snipped...) [ 736.787879] MemAlloc-Info: stalling=187 dying=1 exiting=0 victim=1 oom_count=20565244 (...snipped...) [ 800.715759] MemAlloc-Info: stalling=188 dying=1 exiting=0 victim=1 oom_count=24411576 (...snipped...) [ 837.571405] MemAlloc-Info: stalling=188 dying=1 exiting=0 victim=1 oom_count=26463562 (...snipped...) [ 899.021495] MemAlloc-Info: stalling=189 dying=1 exiting=0 victim=1 oom_count=30144879 (...snipped...) [ 936.282709] MemAlloc-Info: stalling=189 dying=1 exiting=0 victim=1 oom_count=32129234 (...snipped...) [ 997.328119] MemAlloc-Info: stalling=190 dying=1 exiting=0 victim=1 oom_count=35657983 (...snipped...) [ 1033.977265] MemAlloc-Info: stalling=190 dying=1 exiting=0 victim=1 oom_count=37659912 (...snipped...) [ 1095.630961] MemAlloc-Info: stalling=190 dying=1 exiting=0 victim=1 oom_count=40639677 (...snipped...) [ 1095.632984] MemAlloc: kswapd0(69) flags=0xa40840 switches=23883 uninterruptible [ 1095.632985] kswapd0 D10776 69 2 0x00000000 [ 1095.632988] Call Trace: [ 1095.632991] __schedule+0x336/0xe00 [ 1095.632994] schedule+0x3d/0x90 [ 1095.632996] io_schedule+0x16/0x40 [ 1095.633017] __xfs_iflock+0x129/0x140 [xfs] [ 1095.633021] ? autoremove_wake_function+0x60/0x60 [ 1095.633051] xfs_reclaim_inode+0x162/0x440 [xfs] [ 1095.633072] xfs_reclaim_inodes_ag+0x2cf/0x4f0 [xfs] [ 1095.633106] ? xfs_reclaim_inodes_ag+0xf2/0x4f0 [xfs] [ 1095.633114] ? trace_hardirqs_on+0xd/0x10 [ 1095.633116] ? try_to_wake_up+0x59/0x7a0 [ 1095.633120] ? wake_up_process+0x15/0x20 [ 1095.633156] xfs_reclaim_inodes_nr+0x33/0x40 [xfs] [ 1095.633178] xfs_fs_free_cached_objects+0x19/0x20 [xfs] [ 1095.633180] super_cache_scan+0x181/0x190 [ 1095.633183] shrink_slab+0x29f/0x6d0 [ 1095.633189] shrink_node+0x2fa/0x310 [ 1095.633193] kswapd+0x362/0x9b0 [ 1095.633200] kthread+0x10f/0x150 [ 1095.633201] ? mem_cgroup_shrink_node+0x3b0/0x3b0 [ 1095.633202] ? kthread_create_on_node+0x70/0x70 [ 1095.633205] ret_from_fork+0x31/0x40 (...snipped...) [ 1095.821248] MemAlloc-Info: stalling=190 dying=1 exiting=0 victim=1 oom_count=40646791 (...snipped...) [ 1125.236970] sysrq: SysRq : Resetting [ 1125.238669] ACPI MEMORY or I/O RESET_REG. ---------- The switches= value (which is "struct task_struct"->nvcsw + "struct task_struct"->nivcsw ) of kswapd0(69) remained 23883 which means that kswapd0 was waiting forever at ---------- void __xfs_iflock( struct xfs_inode *ip) { wait_queue_head_t *wq = bit_waitqueue(&ip->i_flags, __XFS_IFLOCK_BIT); DEFINE_WAIT_BIT(wait, &ip->i_flags, __XFS_IFLOCK_BIT); do { prepare_to_wait_exclusive(wq, &wait.wait, TASK_UNINTERRUPTIBLE); if (xfs_isiflocked(ip)) io_schedule(); /***** <= This location. *****/ } while (!xfs_iflock_nowait(ip)); finish_wait(wq, &wait.wait); } ---------- while the oom_count= value (which is number of times out_of_memory() was called) was increasing over time without emitting "Killed process " message. Reproducer I used is shown below. ---------- #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <signal.h> #include <poll.h> static char use_delay = 0; static void sigcld_handler(int unused) { use_delay = 1; } int main(int argc, char *argv[]) { static char buffer[4096] = { }; char *buf = NULL; unsigned long size; int i; signal(SIGCLD, sigcld_handler); for (i = 0; i < 1024; i++) { if (fork() == 0) { int fd = open("/proc/self/oom_score_adj", O_WRONLY); write(fd, "1000", 4); close(fd); sleep(1); if (!i) pause(); snprintf(buffer, sizeof(buffer), "/tmp/file.%u", getpid()); fd = open(buffer, O_WRONLY | O_CREAT | O_APPEND, 0600); while (write(fd, buffer, sizeof(buffer)) == sizeof(buffer)) { poll(NULL, 0, 10); fsync(fd); } _exit(0); } } for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) { char *cp = realloc(buf, size); if (!cp) { size >>= 1; break; } buf = cp; } sleep(2); /* Will cause OOM due to overcommit */ for (i = 0; i < size; i += 4096) buf[i] = 0; pause(); return 0; } ---------- ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone @ 2017-02-26 6:30 ` Tetsuo Handa 0 siblings, 0 replies; 110+ messages in thread From: Tetsuo Handa @ 2017-02-26 6:30 UTC (permalink / raw) To: mhocko Cc: david, dchinner, hch, mgorman, viro, linux-mm, hannes, linux-kernel Michal Hocko wrote: > On Wed 22-02-17 11:02:21, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > On Tue 21-02-17 23:35:07, Tetsuo Handa wrote: > > > > Michal Hocko wrote: > > > > > OK, so it seems that all the distractions are handled now and linux-next > > > > > should provide a reasonable base for testing. You said you weren't able > > > > > to reproduce the original long stalls on too_many_isolated(). I would be > > > > > still interested to see those oom reports and potential anomalies in the > > > > > isolated counts before I send the patch for inclusion so your further > > > > > testing would be more than appreciated. Also stalls > 10s without any > > > > > previous occurrences would be interesting. > > > > > > > > I confirmed that linux-next-20170221 with kmallocwd applied can reproduce > > > > infinite too_many_isolated() loop problem. Please send your patches to linux-next. > > > > > > So I assume that you didn't see the lockup with the patch applied and > > > the OOM killer has resolved the situation by killing other tasks, right? > > > Can I assume your Tested-by? > > > > No. I tested linux-next-20170221 which does not include your patch. > > I didn't test linux-next-20170221 with your patch applied. Your patch will > > avoid infinite too_many_isolated() loop problem in shrink_inactive_list(). > > But we need to test different workloads by other people. Thus, I suggest > > you to send your patches to linux-next without my testing. > > I will send the patch to Andrew later after merge window closes. It > would be really helpful, though, to see how it handles your workload > which is known to reproduce the oom starvation. I tested http://lkml.kernel.org/r/20170119112336.GN30786@dhcp22.suse.cz on top of linux-next-20170221 with kmallocwd applied. I did not hit too_many_isolated() loop problem. But I hit an "unable to invoke the OOM killer due to !__GFP_FS allocation" lockup problem shown below. Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20170226.txt.xz . ---------- [ 444.281177] Killed process 9477 (a.out) total-vm:4168kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB [ 444.287046] oom_reaper: reaped process 9477 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB [ 484.810225] BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 38s! [ 484.812907] BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0 nice=0 stuck for 41s! [ 484.815546] Showing busy workqueues and worker pools: [ 484.817595] workqueue events: flags=0x0 [ 484.819456] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=3/256 [ 484.821666] pending: vmpressure_work_fn, vmstat_shepherd, vmw_fb_dirty_flush [vmwgfx] [ 484.824356] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=2/256 [ 484.826582] pending: drain_local_pages_wq BAR(9595), e1000_watchdog [e1000] [ 484.829091] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256 [ 484.831325] in-flight: 7418:rht_deferred_worker [ 484.833336] pending: rht_deferred_worker [ 484.835346] workqueue events_long: flags=0x0 [ 484.837343] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256 [ 484.839566] pending: gc_worker [nf_conntrack] [ 484.841691] workqueue events_power_efficient: flags=0x80 [ 484.843873] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256 [ 484.846103] pending: fb_flashcursor [ 484.847928] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256 [ 484.850149] pending: neigh_periodic_work, neigh_periodic_work [ 484.852403] workqueue events_freezable_power_: flags=0x84 [ 484.854534] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 [ 484.856666] in-flight: 27:disk_events_workfn [ 484.858621] workqueue writeback: flags=0x4e [ 484.860347] pwq 256: cpus=0-127 flags=0x4 nice=0 active=2/256 [ 484.862415] in-flight: 8444:wb_workfn wb_workfn [ 484.864602] workqueue vmstat: flags=0xc [ 484.866291] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 [ 484.868307] pending: vmstat_update [ 484.869876] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 [ 484.871864] pending: vmstat_update [ 484.874058] workqueue mpt_poll_0: flags=0x8 [ 484.875698] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 [ 484.877602] pending: mpt_fault_reset_work [mptbase] [ 484.879502] workqueue xfs-buf/sda1: flags=0xc [ 484.881148] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1 [ 484.883011] pending: xfs_buf_ioend_work [xfs] [ 484.884706] workqueue xfs-data/sda1: flags=0xc [ 484.886367] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=27/256 MAYDAY [ 484.888410] in-flight: 5356:xfs_end_io [xfs], 451(RESCUER):xfs_end_io [xfs] xfs_end_io [xfs] xfs_end_io [xfs] xfs_end_io [xfs] xfs_end_io [xfs], 10498:xfs_end_io [xfs], 6386:xfs_end_io [xfs] [ 484.893483] pending: xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs] [ 484.902636] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=21/256 MAYDAY [ 484.904848] in-flight: 535:xfs_end_io [xfs], 7416:xfs_end_io [xfs], 7415:xfs_end_io [xfs], 65:xfs_end_io [xfs] [ 484.907863] pending: xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs] [ 484.916767] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=4/256 MAYDAY [ 484.919024] in-flight: 5357:xfs_end_io [xfs], 193:xfs_end_io [xfs], 52:xfs_end_io [xfs], 5358:xfs_end_io [xfs] [ 484.922143] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 [ 484.924291] in-flight: 2486:xfs_end_io [xfs] [ 484.926248] workqueue xfs-reclaim/sda1: flags=0xc [ 484.928216] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 [ 484.930362] pending: xfs_reclaim_worker [xfs] [ 484.932312] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 3 6387 [ 484.934766] pool 2: cpus=1 node=0 flags=0x0 nice=0 hung=38s workers=6 manager: 19 [ 484.937206] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=41s workers=6 manager: 157 [ 484.939629] pool 6: cpus=3 node=0 flags=0x0 nice=0 hung=41s workers=4 manager: 10499 [ 484.942303] pool 256: cpus=0-127 flags=0x4 nice=0 hung=38s workers=3 idle: 425 426 [ 518.090012] MemAlloc-Info: stalling=184 dying=1 exiting=0 victim=1 oom_count=8441307 (...snipped...) [ 518.900038] MemAlloc: kswapd0(69) flags=0xa40840 switches=23883 uninterruptible [ 518.902095] kswapd0 D10776 69 2 0x00000000 [ 518.903784] Call Trace: [ 518.904849] __schedule+0x336/0xe00 [ 518.906118] schedule+0x3d/0x90 [ 518.907314] io_schedule+0x16/0x40 [ 518.908622] __xfs_iflock+0x129/0x140 [xfs] [ 518.910027] ? autoremove_wake_function+0x60/0x60 [ 518.911559] xfs_reclaim_inode+0x162/0x440 [xfs] [ 518.913068] xfs_reclaim_inodes_ag+0x2cf/0x4f0 [xfs] [ 518.914611] ? xfs_reclaim_inodes_ag+0xf2/0x4f0 [xfs] [ 518.916148] ? trace_hardirqs_on+0xd/0x10 [ 518.917465] ? try_to_wake_up+0x59/0x7a0 [ 518.918758] ? wake_up_process+0x15/0x20 [ 518.920067] xfs_reclaim_inodes_nr+0x33/0x40 [xfs] [ 518.921560] xfs_fs_free_cached_objects+0x19/0x20 [xfs] [ 518.923114] super_cache_scan+0x181/0x190 [ 518.924435] shrink_slab+0x29f/0x6d0 [ 518.925683] shrink_node+0x2fa/0x310 [ 518.926909] kswapd+0x362/0x9b0 [ 518.928061] kthread+0x10f/0x150 [ 518.929218] ? mem_cgroup_shrink_node+0x3b0/0x3b0 [ 518.930953] ? kthread_create_on_node+0x70/0x70 [ 518.932380] ret_from_fork+0x31/0x40 (...snipped...) [ 553.070829] MemAlloc-Info: stalling=184 dying=1 exiting=0 victim=1 oom_count=10318507 [ 575.432697] BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 129s! [ 575.435276] BUG: workqueue lockup - pool cpus=2 node=0 flags=0x0 nice=0 stuck for 131s! [ 575.437863] Showing busy workqueues and worker pools: [ 575.439837] workqueue events: flags=0x0 [ 575.441605] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=4/256 [ 575.443717] pending: vmpressure_work_fn, vmstat_shepherd, vmw_fb_dirty_flush [vmwgfx], check_corruption [ 575.446622] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=2/256 [ 575.448763] pending: drain_local_pages_wq BAR(9595), e1000_watchdog [e1000] [ 575.451173] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256 [ 575.453323] in-flight: 7418:rht_deferred_worker [ 575.455243] pending: rht_deferred_worker [ 575.457100] workqueue events_long: flags=0x0 [ 575.458960] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256 [ 575.461099] pending: gc_worker [nf_conntrack] [ 575.463043] workqueue events_power_efficient: flags=0x80 [ 575.465110] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256 [ 575.467252] pending: fb_flashcursor [ 575.468966] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=2/256 [ 575.471109] pending: neigh_periodic_work, neigh_periodic_work [ 575.473289] workqueue events_freezable_power_: flags=0x84 [ 575.475378] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 [ 575.477526] in-flight: 27:disk_events_workfn [ 575.479489] workqueue writeback: flags=0x4e [ 575.481257] pwq 256: cpus=0-127 flags=0x4 nice=0 active=2/256 [ 575.483368] in-flight: 8444:wb_workfn wb_workfn [ 575.485505] workqueue vmstat: flags=0xc [ 575.487196] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 [ 575.489242] pending: vmstat_update [ 575.491403] workqueue mpt_poll_0: flags=0x8 [ 575.493106] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 [ 575.495115] pending: mpt_fault_reset_work [mptbase] [ 575.497086] workqueue xfs-buf/sda1: flags=0xc [ 575.498764] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1 [ 575.500654] pending: xfs_buf_ioend_work [xfs] [ 575.502372] workqueue xfs-data/sda1: flags=0xc [ 575.504024] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=27/256 MAYDAY [ 575.506060] in-flight: 5356:xfs_end_io [xfs], 451(RESCUER):xfs_end_io [xfs] xfs_end_io [xfs] xfs_end_io [xfs] xfs_end_io [xfs] xfs_end_io [xfs], 10498:xfs_end_io [xfs], 6386:xfs_end_io [xfs] [ 575.511096] pending: xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs] [ 575.520157] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=21/256 MAYDAY [ 575.522340] in-flight: 535:xfs_end_io [xfs], 7416:xfs_end_io [xfs], 7415:xfs_end_io [xfs], 65:xfs_end_io [xfs] [ 575.525387] pending: xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs], xfs_end_io [xfs] [ 575.534089] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=4/256 MAYDAY [ 575.536407] in-flight: 5357:xfs_end_io [xfs], 193:xfs_end_io [xfs], 52:xfs_end_io [xfs], 5358:xfs_end_io [xfs] [ 575.539496] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 [ 575.541648] in-flight: 2486:xfs_end_io [xfs] [ 575.543591] workqueue xfs-reclaim/sda1: flags=0xc [ 575.545540] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 [ 575.547675] pending: xfs_reclaim_worker [xfs] [ 575.549719] workqueue xfs-log/sda1: flags=0x1c [ 575.551591] pwq 3: cpus=1 node=0 flags=0x0 nice=-20 active=1/256 [ 575.553750] pending: xfs_log_worker [xfs] [ 575.555552] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 3 6387 [ 575.557979] pool 2: cpus=1 node=0 flags=0x0 nice=0 hung=129s workers=6 manager: 19 [ 575.560399] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=131s workers=6 manager: 157 [ 575.562843] pool 6: cpus=3 node=0 flags=0x0 nice=0 hung=132s workers=4 manager: 10499 [ 575.565450] pool 256: cpus=0-127 flags=0x4 nice=0 hung=129s workers=3 idle: 425 426 (...snipped...) [ 616.394649] MemAlloc-Info: stalling=186 dying=1 exiting=0 victim=1 oom_count=13908219 (...snipped...) [ 642.266252] MemAlloc-Info: stalling=186 dying=1 exiting=0 victim=1 oom_count=15180673 (...snipped...) [ 702.412189] MemAlloc-Info: stalling=187 dying=1 exiting=0 victim=1 oom_count=18732529 (...snipped...) [ 736.787879] MemAlloc-Info: stalling=187 dying=1 exiting=0 victim=1 oom_count=20565244 (...snipped...) [ 800.715759] MemAlloc-Info: stalling=188 dying=1 exiting=0 victim=1 oom_count=24411576 (...snipped...) [ 837.571405] MemAlloc-Info: stalling=188 dying=1 exiting=0 victim=1 oom_count=26463562 (...snipped...) [ 899.021495] MemAlloc-Info: stalling=189 dying=1 exiting=0 victim=1 oom_count=30144879 (...snipped...) [ 936.282709] MemAlloc-Info: stalling=189 dying=1 exiting=0 victim=1 oom_count=32129234 (...snipped...) [ 997.328119] MemAlloc-Info: stalling=190 dying=1 exiting=0 victim=1 oom_count=35657983 (...snipped...) [ 1033.977265] MemAlloc-Info: stalling=190 dying=1 exiting=0 victim=1 oom_count=37659912 (...snipped...) [ 1095.630961] MemAlloc-Info: stalling=190 dying=1 exiting=0 victim=1 oom_count=40639677 (...snipped...) [ 1095.632984] MemAlloc: kswapd0(69) flags=0xa40840 switches=23883 uninterruptible [ 1095.632985] kswapd0 D10776 69 2 0x00000000 [ 1095.632988] Call Trace: [ 1095.632991] __schedule+0x336/0xe00 [ 1095.632994] schedule+0x3d/0x90 [ 1095.632996] io_schedule+0x16/0x40 [ 1095.633017] __xfs_iflock+0x129/0x140 [xfs] [ 1095.633021] ? autoremove_wake_function+0x60/0x60 [ 1095.633051] xfs_reclaim_inode+0x162/0x440 [xfs] [ 1095.633072] xfs_reclaim_inodes_ag+0x2cf/0x4f0 [xfs] [ 1095.633106] ? xfs_reclaim_inodes_ag+0xf2/0x4f0 [xfs] [ 1095.633114] ? trace_hardirqs_on+0xd/0x10 [ 1095.633116] ? try_to_wake_up+0x59/0x7a0 [ 1095.633120] ? wake_up_process+0x15/0x20 [ 1095.633156] xfs_reclaim_inodes_nr+0x33/0x40 [xfs] [ 1095.633178] xfs_fs_free_cached_objects+0x19/0x20 [xfs] [ 1095.633180] super_cache_scan+0x181/0x190 [ 1095.633183] shrink_slab+0x29f/0x6d0 [ 1095.633189] shrink_node+0x2fa/0x310 [ 1095.633193] kswapd+0x362/0x9b0 [ 1095.633200] kthread+0x10f/0x150 [ 1095.633201] ? mem_cgroup_shrink_node+0x3b0/0x3b0 [ 1095.633202] ? kthread_create_on_node+0x70/0x70 [ 1095.633205] ret_from_fork+0x31/0x40 (...snipped...) [ 1095.821248] MemAlloc-Info: stalling=190 dying=1 exiting=0 victim=1 oom_count=40646791 (...snipped...) [ 1125.236970] sysrq: SysRq : Resetting [ 1125.238669] ACPI MEMORY or I/O RESET_REG. ---------- The switches= value (which is "struct task_struct"->nvcsw + "struct task_struct"->nivcsw ) of kswapd0(69) remained 23883 which means that kswapd0 was waiting forever at ---------- void __xfs_iflock( struct xfs_inode *ip) { wait_queue_head_t *wq = bit_waitqueue(&ip->i_flags, __XFS_IFLOCK_BIT); DEFINE_WAIT_BIT(wait, &ip->i_flags, __XFS_IFLOCK_BIT); do { prepare_to_wait_exclusive(wq, &wait.wait, TASK_UNINTERRUPTIBLE); if (xfs_isiflocked(ip)) io_schedule(); /***** <= This location. *****/ } while (!xfs_iflock_nowait(ip)); finish_wait(wq, &wait.wait); } ---------- while the oom_count= value (which is number of times out_of_memory() was called) was increasing over time without emitting "Killed process " message. Reproducer I used is shown below. ---------- #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <signal.h> #include <poll.h> static char use_delay = 0; static void sigcld_handler(int unused) { use_delay = 1; } int main(int argc, char *argv[]) { static char buffer[4096] = { }; char *buf = NULL; unsigned long size; int i; signal(SIGCLD, sigcld_handler); for (i = 0; i < 1024; i++) { if (fork() == 0) { int fd = open("/proc/self/oom_score_adj", O_WRONLY); write(fd, "1000", 4); close(fd); sleep(1); if (!i) pause(); snprintf(buffer, sizeof(buffer), "/tmp/file.%u", getpid()); fd = open(buffer, O_WRONLY | O_CREAT | O_APPEND, 0600); while (write(fd, buffer, sizeof(buffer)) == sizeof(buffer)) { poll(NULL, 0, 10); fsync(fd); } _exit(0); } } for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) { char *cp = realloc(buf, size); if (!cp) { size >>= 1; break; } buf = cp; } sleep(2); /* Will cause OOM due to overcommit */ for (i = 0; i < size; i += 4096) buf[i] = 0; pause(); return 0; } ---------- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone 2017-01-25 13:00 ` Michal Hocko @ 2017-01-31 11:58 ` Michal Hocko -1 siblings, 0 replies; 110+ messages in thread From: Michal Hocko @ 2017-01-31 11:58 UTC (permalink / raw) To: Christoph Hellwig Cc: Tetsuo Handa, mgorman, viro, linux-mm, hannes, linux-kernel On Wed 25-01-17 14:00:14, Michal Hocko wrote: > On Wed 25-01-17 20:09:31, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > On Wed 25-01-17 11:19:57, Christoph Hellwig wrote: > > > > On Wed, Jan 25, 2017 at 11:15:17AM +0100, Michal Hocko wrote: > > > > > I think we are missing a check for fatal_signal_pending in > > > > > iomap_file_buffered_write. This means that an oom victim can consume the > > > > > full memory reserves. What do you think about the following? I haven't > > > > > tested this but it mimics generic_perform_write so I guess it should > > > > > work. > > > > > > > > Hi Michal, > > > > > > > > this looks reasonable to me. But we have a few more such loops, > > > > maybe it makes sense to move the check into iomap_apply? > > > > > > I wasn't sure about the expected semantic of iomap_apply but now that > > > I've actually checked all the callers I believe all of them should be > > > able to handle EINTR just fine. Well iomap_file_dirty, iomap_zero_range, > > > iomap_fiemap and iomap_page_mkwriteseem do not follow the standard > > > pattern to return the number of written pages or an error but it rather > > > propagates the error out. From my limited understanding of those code > > > paths that should just be ok. I was not all that sure about iomap_dio_rw > > > that is just too convoluted for me. If that one is OK as well then > > > the following patch should be indeed better. > > > > Is "length" in > > > > written = actor(inode, pos, length, data, &iomap); > > > > call guaranteed to be small enough? If not guaranteed, > > don't we need to check SIGKILL inside "actor" functions? > > You are right! Checking for signals inside iomap_apply doesn't really > solve anything because basically all users do iov_iter_count(). Blee. So > we have loops around iomap_apply which itself loops inside the actor. > iomap_write_begin seems to be used by most of them which is also where we > get the pagecache page so I guess this should be the "right" place to > put the check in. Things like dax_iomap_actor will need an explicit check. > This is quite unfortunate but I do not see any better solution. > What do you think Christoph? What do you think Christoph? I have an additional patch to handle do_generic_file_read and a similar one to back off in __vmalloc_area_node. I would like to post them all in one series but I would like to know that this one is OK before I do that. Thanks! > --- > From 362da5cac527146a341300c2ca441245c16043e8 Mon Sep 17 00:00:00 2001 > From: Michal Hocko <mhocko@suse.com> > Date: Wed, 25 Jan 2017 11:06:37 +0100 > Subject: [PATCH] fs: break out of iomap_file_buffered_write on fatal signals > > Tetsuo has noticed that an OOM stress test which performs large write > requests can cause the full memory reserves depletion. He has tracked > this down to the following path > __alloc_pages_nodemask+0x436/0x4d0 > alloc_pages_current+0x97/0x1b0 > __page_cache_alloc+0x15d/0x1a0 mm/filemap.c:728 > pagecache_get_page+0x5a/0x2b0 mm/filemap.c:1331 > grab_cache_page_write_begin+0x23/0x40 mm/filemap.c:2773 > iomap_write_begin+0x50/0xd0 fs/iomap.c:118 > iomap_write_actor+0xb5/0x1a0 fs/iomap.c:190 > ? iomap_write_end+0x80/0x80 fs/iomap.c:150 > iomap_apply+0xb3/0x130 fs/iomap.c:79 > iomap_file_buffered_write+0x68/0xa0 fs/iomap.c:243 > ? iomap_write_end+0x80/0x80 > xfs_file_buffered_aio_write+0x132/0x390 [xfs] > ? remove_wait_queue+0x59/0x60 > xfs_file_write_iter+0x90/0x130 [xfs] > __vfs_write+0xe5/0x140 > vfs_write+0xc7/0x1f0 > ? syscall_trace_enter+0x1d0/0x380 > SyS_write+0x58/0xc0 > do_syscall_64+0x6c/0x200 > entry_SYSCALL64_slow_path+0x25/0x25 > > the oom victim has access to all memory reserves to make a forward > progress to exit easier. But iomap_file_buffered_write and other callers > of iomap_apply loop to complete the full request. We need to check for > fatal signals and back off with a short write instead. As the > iomap_apply delegates all the work down to the actor we have to hook > into those. All callers that work with the page cache are calling > iomap_write_begin so we will check for signals there. dax_iomap_actor > has to handle the situation explicitly because it copies data to the > userspace directly. Other callers like iomap_page_mkwrite work on a > single page or iomap_fiemap_actor do not allocate memory based on the > given len. > > Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path") > Cc: stable # 4.8+ > Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > Signed-off-by: Michal Hocko <mhocko@suse.com> > --- > fs/dax.c | 5 +++++ > fs/iomap.c | 3 +++ > 2 files changed, 8 insertions(+) > > diff --git a/fs/dax.c b/fs/dax.c > index 413a91db9351..0e263dacf9cf 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -1033,6 +1033,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data, > struct blk_dax_ctl dax = { 0 }; > ssize_t map_len; > > + if (fatal_signal_pending(current)) { > + ret = -EINTR; > + break; > + } > + > dax.sector = dax_iomap_sector(iomap, pos); > dax.size = (length + offset + PAGE_SIZE - 1) & PAGE_MASK; > map_len = dax_map_atomic(iomap->bdev, &dax); > diff --git a/fs/iomap.c b/fs/iomap.c > index e57b90b5ff37..691eada58b06 100644 > --- a/fs/iomap.c > +++ b/fs/iomap.c > @@ -114,6 +114,9 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags, > > BUG_ON(pos + len > iomap->offset + iomap->length); > > + if (fatal_signal_pending(current)) > + return -EINTR; > + > page = grab_cache_page_write_begin(inode->i_mapping, index, flags); > if (!page) > return -ENOMEM; > -- > 2.11.0 > > > -- > Michal Hocko > SUSE Labs -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone @ 2017-01-31 11:58 ` Michal Hocko 0 siblings, 0 replies; 110+ messages in thread From: Michal Hocko @ 2017-01-31 11:58 UTC (permalink / raw) To: Christoph Hellwig Cc: Tetsuo Handa, mgorman, viro, linux-mm, hannes, linux-kernel On Wed 25-01-17 14:00:14, Michal Hocko wrote: > On Wed 25-01-17 20:09:31, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > On Wed 25-01-17 11:19:57, Christoph Hellwig wrote: > > > > On Wed, Jan 25, 2017 at 11:15:17AM +0100, Michal Hocko wrote: > > > > > I think we are missing a check for fatal_signal_pending in > > > > > iomap_file_buffered_write. This means that an oom victim can consume the > > > > > full memory reserves. What do you think about the following? I haven't > > > > > tested this but it mimics generic_perform_write so I guess it should > > > > > work. > > > > > > > > Hi Michal, > > > > > > > > this looks reasonable to me. But we have a few more such loops, > > > > maybe it makes sense to move the check into iomap_apply? > > > > > > I wasn't sure about the expected semantic of iomap_apply but now that > > > I've actually checked all the callers I believe all of them should be > > > able to handle EINTR just fine. Well iomap_file_dirty, iomap_zero_range, > > > iomap_fiemap and iomap_page_mkwriteseem do not follow the standard > > > pattern to return the number of written pages or an error but it rather > > > propagates the error out. From my limited understanding of those code > > > paths that should just be ok. I was not all that sure about iomap_dio_rw > > > that is just too convoluted for me. If that one is OK as well then > > > the following patch should be indeed better. > > > > Is "length" in > > > > written = actor(inode, pos, length, data, &iomap); > > > > call guaranteed to be small enough? If not guaranteed, > > don't we need to check SIGKILL inside "actor" functions? > > You are right! Checking for signals inside iomap_apply doesn't really > solve anything because basically all users do iov_iter_count(). Blee. So > we have loops around iomap_apply which itself loops inside the actor. > iomap_write_begin seems to be used by most of them which is also where we > get the pagecache page so I guess this should be the "right" place to > put the check in. Things like dax_iomap_actor will need an explicit check. > This is quite unfortunate but I do not see any better solution. > What do you think Christoph? What do you think Christoph? I have an additional patch to handle do_generic_file_read and a similar one to back off in __vmalloc_area_node. I would like to post them all in one series but I would like to know that this one is OK before I do that. Thanks! > --- > From 362da5cac527146a341300c2ca441245c16043e8 Mon Sep 17 00:00:00 2001 > From: Michal Hocko <mhocko@suse.com> > Date: Wed, 25 Jan 2017 11:06:37 +0100 > Subject: [PATCH] fs: break out of iomap_file_buffered_write on fatal signals > > Tetsuo has noticed that an OOM stress test which performs large write > requests can cause the full memory reserves depletion. He has tracked > this down to the following path > __alloc_pages_nodemask+0x436/0x4d0 > alloc_pages_current+0x97/0x1b0 > __page_cache_alloc+0x15d/0x1a0 mm/filemap.c:728 > pagecache_get_page+0x5a/0x2b0 mm/filemap.c:1331 > grab_cache_page_write_begin+0x23/0x40 mm/filemap.c:2773 > iomap_write_begin+0x50/0xd0 fs/iomap.c:118 > iomap_write_actor+0xb5/0x1a0 fs/iomap.c:190 > ? iomap_write_end+0x80/0x80 fs/iomap.c:150 > iomap_apply+0xb3/0x130 fs/iomap.c:79 > iomap_file_buffered_write+0x68/0xa0 fs/iomap.c:243 > ? iomap_write_end+0x80/0x80 > xfs_file_buffered_aio_write+0x132/0x390 [xfs] > ? remove_wait_queue+0x59/0x60 > xfs_file_write_iter+0x90/0x130 [xfs] > __vfs_write+0xe5/0x140 > vfs_write+0xc7/0x1f0 > ? syscall_trace_enter+0x1d0/0x380 > SyS_write+0x58/0xc0 > do_syscall_64+0x6c/0x200 > entry_SYSCALL64_slow_path+0x25/0x25 > > the oom victim has access to all memory reserves to make a forward > progress to exit easier. But iomap_file_buffered_write and other callers > of iomap_apply loop to complete the full request. We need to check for > fatal signals and back off with a short write instead. As the > iomap_apply delegates all the work down to the actor we have to hook > into those. All callers that work with the page cache are calling > iomap_write_begin so we will check for signals there. dax_iomap_actor > has to handle the situation explicitly because it copies data to the > userspace directly. Other callers like iomap_page_mkwrite work on a > single page or iomap_fiemap_actor do not allocate memory based on the > given len. > > Fixes: 68a9f5e7007c ("xfs: implement iomap based buffered write path") > Cc: stable # 4.8+ > Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > Signed-off-by: Michal Hocko <mhocko@suse.com> > --- > fs/dax.c | 5 +++++ > fs/iomap.c | 3 +++ > 2 files changed, 8 insertions(+) > > diff --git a/fs/dax.c b/fs/dax.c > index 413a91db9351..0e263dacf9cf 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -1033,6 +1033,11 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data, > struct blk_dax_ctl dax = { 0 }; > ssize_t map_len; > > + if (fatal_signal_pending(current)) { > + ret = -EINTR; > + break; > + } > + > dax.sector = dax_iomap_sector(iomap, pos); > dax.size = (length + offset + PAGE_SIZE - 1) & PAGE_MASK; > map_len = dax_map_atomic(iomap->bdev, &dax); > diff --git a/fs/iomap.c b/fs/iomap.c > index e57b90b5ff37..691eada58b06 100644 > --- a/fs/iomap.c > +++ b/fs/iomap.c > @@ -114,6 +114,9 @@ iomap_write_begin(struct inode *inode, loff_t pos, unsigned len, unsigned flags, > > BUG_ON(pos + len > iomap->offset + iomap->length); > > + if (fatal_signal_pending(current)) > + return -EINTR; > + > page = grab_cache_page_write_begin(inode->i_mapping, index, flags); > if (!page) > return -ENOMEM; > -- > 2.11.0 > > > -- > Michal Hocko > SUSE Labs -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone 2017-01-31 11:58 ` Michal Hocko @ 2017-01-31 12:51 ` Christoph Hellwig -1 siblings, 0 replies; 110+ messages in thread From: Christoph Hellwig @ 2017-01-31 12:51 UTC (permalink / raw) To: Michal Hocko Cc: Christoph Hellwig, Tetsuo Handa, mgorman, viro, linux-mm, hannes, linux-kernel On Tue, Jan 31, 2017 at 12:58:46PM +0100, Michal Hocko wrote: > What do you think Christoph? I have an additional patch to handle > do_generic_file_read and a similar one to back off in > __vmalloc_area_node. I would like to post them all in one series but I > would like to know that this one is OK before I do that. Well, that patch you posted is okay, but you probably need additional ones for the other interesting users of iomap_apply. ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone @ 2017-01-31 12:51 ` Christoph Hellwig 0 siblings, 0 replies; 110+ messages in thread From: Christoph Hellwig @ 2017-01-31 12:51 UTC (permalink / raw) To: Michal Hocko Cc: Christoph Hellwig, Tetsuo Handa, mgorman, viro, linux-mm, hannes, linux-kernel On Tue, Jan 31, 2017 at 12:58:46PM +0100, Michal Hocko wrote: > What do you think Christoph? I have an additional patch to handle > do_generic_file_read and a similar one to back off in > __vmalloc_area_node. I would like to post them all in one series but I > would like to know that this one is OK before I do that. Well, that patch you posted is okay, but you probably need additional ones for the other interesting users of iomap_apply. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone 2017-01-31 12:51 ` Christoph Hellwig @ 2017-01-31 13:21 ` Michal Hocko -1 siblings, 0 replies; 110+ messages in thread From: Michal Hocko @ 2017-01-31 13:21 UTC (permalink / raw) To: Christoph Hellwig Cc: Tetsuo Handa, mgorman, viro, linux-mm, hannes, linux-kernel On Tue 31-01-17 13:51:40, Christoph Hellwig wrote: > On Tue, Jan 31, 2017 at 12:58:46PM +0100, Michal Hocko wrote: > > What do you think Christoph? I have an additional patch to handle > > do_generic_file_read and a similar one to back off in > > __vmalloc_area_node. I would like to post them all in one series but I > > would like to know that this one is OK before I do that. > > Well, that patch you posted is okay, but you probably need additional > ones for the other interesting users of iomap_apply. I have checked all of them I guees/hope. Which one you have in mind? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone @ 2017-01-31 13:21 ` Michal Hocko 0 siblings, 0 replies; 110+ messages in thread From: Michal Hocko @ 2017-01-31 13:21 UTC (permalink / raw) To: Christoph Hellwig Cc: Tetsuo Handa, mgorman, viro, linux-mm, hannes, linux-kernel On Tue 31-01-17 13:51:40, Christoph Hellwig wrote: > On Tue, Jan 31, 2017 at 12:58:46PM +0100, Michal Hocko wrote: > > What do you think Christoph? I have an additional patch to handle > > do_generic_file_read and a similar one to back off in > > __vmalloc_area_node. I would like to post them all in one series but I > > would like to know that this one is OK before I do that. > > Well, that patch you posted is okay, but you probably need additional > ones for the other interesting users of iomap_apply. I have checked all of them I guees/hope. Which one you have in mind? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pagesper zone 2017-01-25 10:15 ` Michal Hocko @ 2017-01-25 10:33 ` Tetsuo Handa -1 siblings, 0 replies; 110+ messages in thread From: Tetsuo Handa @ 2017-01-25 10:33 UTC (permalink / raw) To: mhocko, hch; +Cc: mgorman, viro, linux-mm, hannes, linux-kernel Michal Hocko wrote: > I think we are missing a check for fatal_signal_pending in > iomap_file_buffered_write. This means that an oom victim can consume the > full memory reserves. What do you think about the following? I haven't > tested this but it mimics generic_perform_write so I guess it should > work. Looks OK to me. I worried #define AOP_FLAG_UNINTERRUPTIBLE 0x0001 /* will not do a short write */ which forbids (!?) aborting the loop. But it seems that this flag is no longer checked (i.e. set but not used). So, everybody should be ready for short write, although I don't know whether exofs / hfs / hfsplus are doing appropriate error handling. ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pagesper zone @ 2017-01-25 10:33 ` Tetsuo Handa 0 siblings, 0 replies; 110+ messages in thread From: Tetsuo Handa @ 2017-01-25 10:33 UTC (permalink / raw) To: mhocko, hch; +Cc: mgorman, viro, linux-mm, hannes, linux-kernel Michal Hocko wrote: > I think we are missing a check for fatal_signal_pending in > iomap_file_buffered_write. This means that an oom victim can consume the > full memory reserves. What do you think about the following? I haven't > tested this but it mimics generic_perform_write so I guess it should > work. Looks OK to me. I worried #define AOP_FLAG_UNINTERRUPTIBLE 0x0001 /* will not do a short write */ which forbids (!?) aborting the loop. But it seems that this flag is no longer checked (i.e. set but not used). So, everybody should be ready for short write, although I don't know whether exofs / hfs / hfsplus are doing appropriate error handling. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pagesper zone 2017-01-25 10:33 ` Tetsuo Handa @ 2017-01-25 12:34 ` Michal Hocko -1 siblings, 0 replies; 110+ messages in thread From: Michal Hocko @ 2017-01-25 12:34 UTC (permalink / raw) To: Tetsuo Handa; +Cc: hch, mgorman, viro, linux-mm, hannes, linux-kernel On Wed 25-01-17 19:33:59, Tetsuo Handa wrote: > Michal Hocko wrote: > > I think we are missing a check for fatal_signal_pending in > > iomap_file_buffered_write. This means that an oom victim can consume the > > full memory reserves. What do you think about the following? I haven't > > tested this but it mimics generic_perform_write so I guess it should > > work. > > Looks OK to me. I worried > > #define AOP_FLAG_UNINTERRUPTIBLE 0x0001 /* will not do a short write */ > > which forbids (!?) aborting the loop. But it seems that this flag is > no longer checked (i.e. set but not used). So, everybody should be ready > for short write, although I don't know whether exofs / hfs / hfsplus are > doing appropriate error handling. Those were using generic implementation before and that handles this case AFAICS. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pagesper zone @ 2017-01-25 12:34 ` Michal Hocko 0 siblings, 0 replies; 110+ messages in thread From: Michal Hocko @ 2017-01-25 12:34 UTC (permalink / raw) To: Tetsuo Handa; +Cc: hch, mgorman, viro, linux-mm, hannes, linux-kernel On Wed 25-01-17 19:33:59, Tetsuo Handa wrote: > Michal Hocko wrote: > > I think we are missing a check for fatal_signal_pending in > > iomap_file_buffered_write. This means that an oom victim can consume the > > full memory reserves. What do you think about the following? I haven't > > tested this but it mimics generic_perform_write so I guess it should > > work. > > Looks OK to me. I worried > > #define AOP_FLAG_UNINTERRUPTIBLE 0x0001 /* will not do a short write */ > > which forbids (!?) aborting the loop. But it seems that this flag is > no longer checked (i.e. set but not used). So, everybody should be ready > for short write, although I don't know whether exofs / hfs / hfsplus are > doing appropriate error handling. Those were using generic implementation before and that handles this case AFAICS. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone 2017-01-25 12:34 ` Michal Hocko @ 2017-01-25 13:13 ` Tetsuo Handa -1 siblings, 0 replies; 110+ messages in thread From: Tetsuo Handa @ 2017-01-25 13:13 UTC (permalink / raw) To: mhocko; +Cc: hch, mgorman, viro, linux-mm, hannes, linux-kernel Michal Hocko wrote: > On Wed 25-01-17 19:33:59, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > I think we are missing a check for fatal_signal_pending in > > > iomap_file_buffered_write. This means that an oom victim can consume the > > > full memory reserves. What do you think about the following? I haven't > > > tested this but it mimics generic_perform_write so I guess it should > > > work. > > > > Looks OK to me. I worried > > > > #define AOP_FLAG_UNINTERRUPTIBLE 0x0001 /* will not do a short write */ > > > > which forbids (!?) aborting the loop. But it seems that this flag is > > no longer checked (i.e. set but not used). So, everybody should be ready > > for short write, although I don't know whether exofs / hfs / hfsplus are > > doing appropriate error handling. > > Those were using generic implementation before and that handles this > case AFAICS. What I wanted to say is: "We can remove AOP_FLAG_UNINTERRUPTIBLE completely because grep does not find that flag used in condition check, can't we?". ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone @ 2017-01-25 13:13 ` Tetsuo Handa 0 siblings, 0 replies; 110+ messages in thread From: Tetsuo Handa @ 2017-01-25 13:13 UTC (permalink / raw) To: mhocko; +Cc: hch, mgorman, viro, linux-mm, hannes, linux-kernel Michal Hocko wrote: > On Wed 25-01-17 19:33:59, Tetsuo Handa wrote: > > Michal Hocko wrote: > > > I think we are missing a check for fatal_signal_pending in > > > iomap_file_buffered_write. This means that an oom victim can consume the > > > full memory reserves. What do you think about the following? I haven't > > > tested this but it mimics generic_perform_write so I guess it should > > > work. > > > > Looks OK to me. I worried > > > > #define AOP_FLAG_UNINTERRUPTIBLE 0x0001 /* will not do a short write */ > > > > which forbids (!?) aborting the loop. But it seems that this flag is > > no longer checked (i.e. set but not used). So, everybody should be ready > > for short write, although I don't know whether exofs / hfs / hfsplus are > > doing appropriate error handling. > > Those were using generic implementation before and that handles this > case AFAICS. What I wanted to say is: "We can remove AOP_FLAG_UNINTERRUPTIBLE completely because grep does not find that flag used in condition check, can't we?". -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone 2017-01-20 13:27 ` Tetsuo Handa @ 2017-01-25 9:53 ` Michal Hocko -1 siblings, 0 replies; 110+ messages in thread From: Michal Hocko @ 2017-01-25 9:53 UTC (permalink / raw) To: Tetsuo Handa; +Cc: mgorman, linux-mm, hannes, linux-kernel On Fri 20-01-17 22:27:27, Tetsuo Handa wrote: > Mel Gorman wrote: > > On Thu, Jan 19, 2017 at 12:23:36PM +0100, Michal Hocko wrote: > > > So what do you think about the following? Tetsuo, would you be willing > > > to run this patch through your torture testing please? > > > > I'm fine with treating this as a starting point. > > OK. So I tried to test this patch but I failed at preparation step. > There are too many pending mm patches and I'm not sure which patch on > which linux-next snapshot I should try. The current linux-next should be good to test. It contains all patches sitting in the mmotm tree. If you want a more stable base then you can use mmotm git tree (git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git #since-4.9 or its #auto-latest alias) > Also as another question, > too_many_isolated() loop exists in both mm/vmscan.c and mm/compaction.c > but why this patch does not touch the loop in mm/compaction.c part? I am not yet convinced the compaction suffers from the same problem. Compaction backs off much sooner so that path shouldn't get into pathological situation AFAICS. I might be wrong here but I think we should start with the reclaim path first. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone @ 2017-01-25 9:53 ` Michal Hocko 0 siblings, 0 replies; 110+ messages in thread From: Michal Hocko @ 2017-01-25 9:53 UTC (permalink / raw) To: Tetsuo Handa; +Cc: mgorman, linux-mm, hannes, linux-kernel On Fri 20-01-17 22:27:27, Tetsuo Handa wrote: > Mel Gorman wrote: > > On Thu, Jan 19, 2017 at 12:23:36PM +0100, Michal Hocko wrote: > > > So what do you think about the following? Tetsuo, would you be willing > > > to run this patch through your torture testing please? > > > > I'm fine with treating this as a starting point. > > OK. So I tried to test this patch but I failed at preparation step. > There are too many pending mm patches and I'm not sure which patch on > which linux-next snapshot I should try. The current linux-next should be good to test. It contains all patches sitting in the mmotm tree. If you want a more stable base then you can use mmotm git tree (git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git #since-4.9 or its #auto-latest alias) > Also as another question, > too_many_isolated() loop exists in both mm/vmscan.c and mm/compaction.c > but why this patch does not touch the loop in mm/compaction.c part? I am not yet convinced the compaction suffers from the same problem. Compaction backs off much sooner so that path shouldn't get into pathological situation AFAICS. I might be wrong here but I think we should start with the reclaim path first. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone 2017-01-19 10:07 ` Mel Gorman @ 2017-01-20 6:42 ` Hillf Danton -1 siblings, 0 replies; 110+ messages in thread From: Hillf Danton @ 2017-01-20 6:42 UTC (permalink / raw) To: 'Mel Gorman', 'Michal Hocko' Cc: linux-mm, 'Johannes Weiner', 'Tetsuo Handa', 'LKML' On Thursday, January 19, 2017 6:08 PM Mel Gorman wrote: > > If it's definitely required and is proven to fix the > infinite-loop-without-oom workload then I'll back off and withdraw my > objections. However, I'd at least like the following untested patch to > be considered as an alternative. It has some weaknesses and would be > slower to OOM than your patch but it avoids reintroducing zone counters > > ---8<--- > mm, vmscan: Wait on a waitqueue when too many pages are isolated > > When too many pages are isolated, direct reclaim waits on congestion to clear > for up to a tenth of a second. There is no reason to believe that too many > pages are isolated due to dirty pages, reclaim efficiency or congestion. > It may simply be because an extremely large number of processes have entered > direct reclaim at the same time. However, it is possible for the situation > to persist forever and never reach OOM. > > This patch queues processes a waitqueue when too many pages are isolated. > When parallel reclaimers finish shrink_page_list, they wake the waiters > to recheck whether too many pages are isolated. > > The wait on the queue has a timeout as not all sites that isolate pages > will do the wakeup. Depending on every isolation of LRU pages to be perfect > forever is potentially fragile. The specific wakeups occur for page reclaim > and compaction. If too many pages are isolated due to memory failure, > hotplug or directly calling migration from a syscall then the waiting > processes may wait the full timeout. > > Note that the timeout allows the use of waitqueue_active() on the basis > that a race will cause the full timeout to be reached due to a missed > wakeup. This is relatively harmless and still a massive improvement over > unconditionally calling congestion_wait. > > Direct reclaimers that cannot isolate pages within the timeout will consider > return to the caller. This is somewhat clunky as it won't return immediately > and make go through the other priorities and slab shrinking. Eventually, > it'll go through a few iterations of should_reclaim_retry and reach the > MAX_RECLAIM_RETRIES limit and consider going OOM. > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 91f69aa0d581..3dd617d0c8c4 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -628,6 +628,7 @@ typedef struct pglist_data { > int node_id; > wait_queue_head_t kswapd_wait; > wait_queue_head_t pfmemalloc_wait; > + wait_queue_head_t isolated_wait; > struct task_struct *kswapd; /* Protected by > mem_hotplug_begin/end() */ > int kswapd_order; > diff --git a/mm/compaction.c b/mm/compaction.c > index 43a6cf1dc202..1b1ff6da7401 100644 > --- a/mm/compaction.c > +++ b/mm/compaction.c > @@ -1634,6 +1634,10 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro > count_compact_events(COMPACTMIGRATE_SCANNED, cc->total_migrate_scanned); > count_compact_events(COMPACTFREE_SCANNED, cc->total_free_scanned); > > + /* Page reclaim could have stalled due to isolated pages */ > + if (waitqueue_active(&zone->zone_pgdat->isolated_wait)) > + wake_up(&zone->zone_pgdat->isolated_wait); > + > trace_mm_compaction_end(start_pfn, cc->migrate_pfn, > cc->free_pfn, end_pfn, sync, ret); > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 8ff25883c172..d848c9f31bff 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -5823,6 +5823,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat) > #endif > init_waitqueue_head(&pgdat->kswapd_wait); > init_waitqueue_head(&pgdat->pfmemalloc_wait); > + init_waitqueue_head(&pgdat->isolated_wait); > #ifdef CONFIG_COMPACTION > init_waitqueue_head(&pgdat->kcompactd_wait); > #endif > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 2281ad310d06..c93f299fbad7 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1603,16 +1603,16 @@ int isolate_lru_page(struct page *page) > * the LRU list will go small and be scanned faster than necessary, leading to > * unnecessary swapping, thrashing and OOM. > */ > -static int too_many_isolated(struct pglist_data *pgdat, int file, > +static bool safe_to_isolate(struct pglist_data *pgdat, int file, > struct scan_control *sc) I prefer the current function name. > { > unsigned long inactive, isolated; > > if (current_is_kswapd()) > - return 0; > + return true; > > - if (!sane_reclaim(sc)) > - return 0; > + if (sane_reclaim(sc)) > + return true; We only need a one-line change. > > if (file) { > inactive = node_page_state(pgdat, NR_INACTIVE_FILE); > @@ -1630,7 +1630,7 @@ static int too_many_isolated(struct pglist_data *pgdat, int file, > if ((sc->gfp_mask & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS)) > inactive >>= 3; > > - return isolated > inactive; > + return isolated < inactive; > } > > static noinline_for_stack void > @@ -1719,12 +1719,28 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, > struct pglist_data *pgdat = lruvec_pgdat(lruvec); > struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat; > > - while (unlikely(too_many_isolated(pgdat, file, sc))) { > - congestion_wait(BLK_RW_ASYNC, HZ/10); > + while (!safe_to_isolate(pgdat, file, sc)) { > + long ret; > + > + ret = wait_event_interruptible_timeout(pgdat->isolated_wait, > + safe_to_isolate(pgdat, file, sc), HZ/10); > > /* We are about to die and free our memory. Return now. */ > - if (fatal_signal_pending(current)) > - return SWAP_CLUSTER_MAX; > + if (fatal_signal_pending(current)) { > + nr_reclaimed = SWAP_CLUSTER_MAX; > + goto out; > + } > + > + /* > + * If we reached the timeout, this is direct reclaim, and > + * pages cannot be isolated then return. If the situation Please add something that we would rather shrink slab than go another round of nap. > + * persists for a long time then it'll eventually reach > + * the no_progress limit in should_reclaim_retry and consider > + * going OOM. In this case, do not wake the isolated_wait > + * queue as the wakee will still not be able to make progress. > + */ > + if (!ret && !current_is_kswapd() && !safe_to_isolate(pgdat, file, sc)) > + return 0; > } > > lru_add_drain(); > @@ -1839,6 +1855,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, > stat.nr_activate, stat.nr_ref_keep, > stat.nr_unmap_fail, > sc->priority, file); > + > +out: > + if (waitqueue_active(&pgdat->isolated_wait)) > + wake_up(&pgdat->isolated_wait); > return nr_reclaimed; > } > Is it also needed to check isolated_wait active before kswapd takes nap? thanks Hillf ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone @ 2017-01-20 6:42 ` Hillf Danton 0 siblings, 0 replies; 110+ messages in thread From: Hillf Danton @ 2017-01-20 6:42 UTC (permalink / raw) To: 'Mel Gorman', 'Michal Hocko' Cc: linux-mm, 'Johannes Weiner', 'Tetsuo Handa', 'LKML' On Thursday, January 19, 2017 6:08 PM Mel Gorman wrote: > > If it's definitely required and is proven to fix the > infinite-loop-without-oom workload then I'll back off and withdraw my > objections. However, I'd at least like the following untested patch to > be considered as an alternative. It has some weaknesses and would be > slower to OOM than your patch but it avoids reintroducing zone counters > > ---8<--- > mm, vmscan: Wait on a waitqueue when too many pages are isolated > > When too many pages are isolated, direct reclaim waits on congestion to clear > for up to a tenth of a second. There is no reason to believe that too many > pages are isolated due to dirty pages, reclaim efficiency or congestion. > It may simply be because an extremely large number of processes have entered > direct reclaim at the same time. However, it is possible for the situation > to persist forever and never reach OOM. > > This patch queues processes a waitqueue when too many pages are isolated. > When parallel reclaimers finish shrink_page_list, they wake the waiters > to recheck whether too many pages are isolated. > > The wait on the queue has a timeout as not all sites that isolate pages > will do the wakeup. Depending on every isolation of LRU pages to be perfect > forever is potentially fragile. The specific wakeups occur for page reclaim > and compaction. If too many pages are isolated due to memory failure, > hotplug or directly calling migration from a syscall then the waiting > processes may wait the full timeout. > > Note that the timeout allows the use of waitqueue_active() on the basis > that a race will cause the full timeout to be reached due to a missed > wakeup. This is relatively harmless and still a massive improvement over > unconditionally calling congestion_wait. > > Direct reclaimers that cannot isolate pages within the timeout will consider > return to the caller. This is somewhat clunky as it won't return immediately > and make go through the other priorities and slab shrinking. Eventually, > it'll go through a few iterations of should_reclaim_retry and reach the > MAX_RECLAIM_RETRIES limit and consider going OOM. > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 91f69aa0d581..3dd617d0c8c4 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -628,6 +628,7 @@ typedef struct pglist_data { > int node_id; > wait_queue_head_t kswapd_wait; > wait_queue_head_t pfmemalloc_wait; > + wait_queue_head_t isolated_wait; > struct task_struct *kswapd; /* Protected by > mem_hotplug_begin/end() */ > int kswapd_order; > diff --git a/mm/compaction.c b/mm/compaction.c > index 43a6cf1dc202..1b1ff6da7401 100644 > --- a/mm/compaction.c > +++ b/mm/compaction.c > @@ -1634,6 +1634,10 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro > count_compact_events(COMPACTMIGRATE_SCANNED, cc->total_migrate_scanned); > count_compact_events(COMPACTFREE_SCANNED, cc->total_free_scanned); > > + /* Page reclaim could have stalled due to isolated pages */ > + if (waitqueue_active(&zone->zone_pgdat->isolated_wait)) > + wake_up(&zone->zone_pgdat->isolated_wait); > + > trace_mm_compaction_end(start_pfn, cc->migrate_pfn, > cc->free_pfn, end_pfn, sync, ret); > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 8ff25883c172..d848c9f31bff 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -5823,6 +5823,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat) > #endif > init_waitqueue_head(&pgdat->kswapd_wait); > init_waitqueue_head(&pgdat->pfmemalloc_wait); > + init_waitqueue_head(&pgdat->isolated_wait); > #ifdef CONFIG_COMPACTION > init_waitqueue_head(&pgdat->kcompactd_wait); > #endif > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 2281ad310d06..c93f299fbad7 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1603,16 +1603,16 @@ int isolate_lru_page(struct page *page) > * the LRU list will go small and be scanned faster than necessary, leading to > * unnecessary swapping, thrashing and OOM. > */ > -static int too_many_isolated(struct pglist_data *pgdat, int file, > +static bool safe_to_isolate(struct pglist_data *pgdat, int file, > struct scan_control *sc) I prefer the current function name. > { > unsigned long inactive, isolated; > > if (current_is_kswapd()) > - return 0; > + return true; > > - if (!sane_reclaim(sc)) > - return 0; > + if (sane_reclaim(sc)) > + return true; We only need a one-line change. > > if (file) { > inactive = node_page_state(pgdat, NR_INACTIVE_FILE); > @@ -1630,7 +1630,7 @@ static int too_many_isolated(struct pglist_data *pgdat, int file, > if ((sc->gfp_mask & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS)) > inactive >>= 3; > > - return isolated > inactive; > + return isolated < inactive; > } > > static noinline_for_stack void > @@ -1719,12 +1719,28 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, > struct pglist_data *pgdat = lruvec_pgdat(lruvec); > struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat; > > - while (unlikely(too_many_isolated(pgdat, file, sc))) { > - congestion_wait(BLK_RW_ASYNC, HZ/10); > + while (!safe_to_isolate(pgdat, file, sc)) { > + long ret; > + > + ret = wait_event_interruptible_timeout(pgdat->isolated_wait, > + safe_to_isolate(pgdat, file, sc), HZ/10); > > /* We are about to die and free our memory. Return now. */ > - if (fatal_signal_pending(current)) > - return SWAP_CLUSTER_MAX; > + if (fatal_signal_pending(current)) { > + nr_reclaimed = SWAP_CLUSTER_MAX; > + goto out; > + } > + > + /* > + * If we reached the timeout, this is direct reclaim, and > + * pages cannot be isolated then return. If the situation Please add something that we would rather shrink slab than go another round of nap. > + * persists for a long time then it'll eventually reach > + * the no_progress limit in should_reclaim_retry and consider > + * going OOM. In this case, do not wake the isolated_wait > + * queue as the wakee will still not be able to make progress. > + */ > + if (!ret && !current_is_kswapd() && !safe_to_isolate(pgdat, file, sc)) > + return 0; > } > > lru_add_drain(); > @@ -1839,6 +1855,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, > stat.nr_activate, stat.nr_ref_keep, > stat.nr_unmap_fail, > sc->priority, file); > + > +out: > + if (waitqueue_active(&pgdat->isolated_wait)) > + wake_up(&pgdat->isolated_wait); > return nr_reclaimed; > } > Is it also needed to check isolated_wait active before kswapd takes nap? thanks Hillf -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone 2017-01-20 6:42 ` Hillf Danton @ 2017-01-20 9:25 ` Mel Gorman -1 siblings, 0 replies; 110+ messages in thread From: Mel Gorman @ 2017-01-20 9:25 UTC (permalink / raw) To: Hillf Danton Cc: 'Michal Hocko', linux-mm, 'Johannes Weiner', 'Tetsuo Handa', 'LKML' On Fri, Jan 20, 2017 at 02:42:24PM +0800, Hillf Danton wrote: > > @@ -1603,16 +1603,16 @@ int isolate_lru_page(struct page *page) > > * the LRU list will go small and be scanned faster than necessary, leading to > > * unnecessary swapping, thrashing and OOM. > > */ > > -static int too_many_isolated(struct pglist_data *pgdat, int file, > > +static bool safe_to_isolate(struct pglist_data *pgdat, int file, > > struct scan_control *sc) > > I prefer the current function name. > The restructure is to work with the workqueue api. > > { > > unsigned long inactive, isolated; > > > > if (current_is_kswapd()) > > - return 0; > > + return true; > > > > - if (!sane_reclaim(sc)) > > - return 0; > > + if (sane_reclaim(sc)) > > + return true; > > We only need a one-line change. It's bool so the conversion is made to bool while it's being changed anyway. > > > > if (file) { > > inactive = node_page_state(pgdat, NR_INACTIVE_FILE); > > @@ -1630,7 +1630,7 @@ static int too_many_isolated(struct pglist_data *pgdat, int file, > > if ((sc->gfp_mask & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS)) > > inactive >>= 3; > > > > - return isolated > inactive; > > + return isolated < inactive; > > } > > > > static noinline_for_stack void > > @@ -1719,12 +1719,28 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, > > struct pglist_data *pgdat = lruvec_pgdat(lruvec); > > struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat; > > > > - while (unlikely(too_many_isolated(pgdat, file, sc))) { > > - congestion_wait(BLK_RW_ASYNC, HZ/10); > > + while (!safe_to_isolate(pgdat, file, sc)) { > > + long ret; > > + > > + ret = wait_event_interruptible_timeout(pgdat->isolated_wait, > > + safe_to_isolate(pgdat, file, sc), HZ/10); > > > > /* We are about to die and free our memory. Return now. */ > > - if (fatal_signal_pending(current)) > > - return SWAP_CLUSTER_MAX; > > + if (fatal_signal_pending(current)) { > > + nr_reclaimed = SWAP_CLUSTER_MAX; > > + goto out; > > + } > > + > > + /* > > + * If we reached the timeout, this is direct reclaim, and > > + * pages cannot be isolated then return. If the situation > > Please add something that we would rather shrink slab than go > another round of nap. > That's not necessarily true or even a good idea. It could result in excessive slab shrinking that is no longer in proportion to LRU scanning and increased contention within shrinkers. > > + * persists for a long time then it'll eventually reach > > + * the no_progress limit in should_reclaim_retry and consider > > + * going OOM. In this case, do not wake the isolated_wait > > + * queue as the wakee will still not be able to make progress. > > + */ > > + if (!ret && !current_is_kswapd() && !safe_to_isolate(pgdat, file, sc)) > > + return 0; > > } > > > > lru_add_drain(); > > @@ -1839,6 +1855,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, > > stat.nr_activate, stat.nr_ref_keep, > > stat.nr_unmap_fail, > > sc->priority, file); > > + > > +out: > > + if (waitqueue_active(&pgdat->isolated_wait)) > > + wake_up(&pgdat->isolated_wait); > > return nr_reclaimed; > > } > > > Is it also needed to check isolated_wait active before kswapd > takes nap? > No because this is where pages were isolated and there is no putback event that would justify waking the queue. There is a race between waitqueue_active() and going to sleep that we rely on the timeout to recover from. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone @ 2017-01-20 9:25 ` Mel Gorman 0 siblings, 0 replies; 110+ messages in thread From: Mel Gorman @ 2017-01-20 9:25 UTC (permalink / raw) To: Hillf Danton Cc: 'Michal Hocko', linux-mm, 'Johannes Weiner', 'Tetsuo Handa', 'LKML' On Fri, Jan 20, 2017 at 02:42:24PM +0800, Hillf Danton wrote: > > @@ -1603,16 +1603,16 @@ int isolate_lru_page(struct page *page) > > * the LRU list will go small and be scanned faster than necessary, leading to > > * unnecessary swapping, thrashing and OOM. > > */ > > -static int too_many_isolated(struct pglist_data *pgdat, int file, > > +static bool safe_to_isolate(struct pglist_data *pgdat, int file, > > struct scan_control *sc) > > I prefer the current function name. > The restructure is to work with the workqueue api. > > { > > unsigned long inactive, isolated; > > > > if (current_is_kswapd()) > > - return 0; > > + return true; > > > > - if (!sane_reclaim(sc)) > > - return 0; > > + if (sane_reclaim(sc)) > > + return true; > > We only need a one-line change. It's bool so the conversion is made to bool while it's being changed anyway. > > > > if (file) { > > inactive = node_page_state(pgdat, NR_INACTIVE_FILE); > > @@ -1630,7 +1630,7 @@ static int too_many_isolated(struct pglist_data *pgdat, int file, > > if ((sc->gfp_mask & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS)) > > inactive >>= 3; > > > > - return isolated > inactive; > > + return isolated < inactive; > > } > > > > static noinline_for_stack void > > @@ -1719,12 +1719,28 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, > > struct pglist_data *pgdat = lruvec_pgdat(lruvec); > > struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat; > > > > - while (unlikely(too_many_isolated(pgdat, file, sc))) { > > - congestion_wait(BLK_RW_ASYNC, HZ/10); > > + while (!safe_to_isolate(pgdat, file, sc)) { > > + long ret; > > + > > + ret = wait_event_interruptible_timeout(pgdat->isolated_wait, > > + safe_to_isolate(pgdat, file, sc), HZ/10); > > > > /* We are about to die and free our memory. Return now. */ > > - if (fatal_signal_pending(current)) > > - return SWAP_CLUSTER_MAX; > > + if (fatal_signal_pending(current)) { > > + nr_reclaimed = SWAP_CLUSTER_MAX; > > + goto out; > > + } > > + > > + /* > > + * If we reached the timeout, this is direct reclaim, and > > + * pages cannot be isolated then return. If the situation > > Please add something that we would rather shrink slab than go > another round of nap. > That's not necessarily true or even a good idea. It could result in excessive slab shrinking that is no longer in proportion to LRU scanning and increased contention within shrinkers. > > + * persists for a long time then it'll eventually reach > > + * the no_progress limit in should_reclaim_retry and consider > > + * going OOM. In this case, do not wake the isolated_wait > > + * queue as the wakee will still not be able to make progress. > > + */ > > + if (!ret && !current_is_kswapd() && !safe_to_isolate(pgdat, file, sc)) > > + return 0; > > } > > > > lru_add_drain(); > > @@ -1839,6 +1855,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, > > stat.nr_activate, stat.nr_ref_keep, > > stat.nr_unmap_fail, > > sc->priority, file); > > + > > +out: > > + if (waitqueue_active(&pgdat->isolated_wait)) > > + wake_up(&pgdat->isolated_wait); > > return nr_reclaimed; > > } > > > Is it also needed to check isolated_wait active before kswapd > takes nap? > No because this is where pages were isolated and there is no putback event that would justify waking the queue. There is a race between waitqueue_active() and going to sleep that we rely on the timeout to recover from. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 110+ messages in thread
* [RFC PATCH 2/2] mm, vmscan: do not loop on too_many_isolated for ever 2017-01-18 13:44 ` Michal Hocko @ 2017-01-18 13:44 ` Michal Hocko -1 siblings, 0 replies; 110+ messages in thread From: Michal Hocko @ 2017-01-18 13:44 UTC (permalink / raw) To: linux-mm; +Cc: Mel Gorman, Johannes Weiner, Tetsuo Handa, LKML, Michal Hocko From: Michal Hocko <mhocko@suse.com> Tetsuo Handa has reported [1] that direct reclaimers might get stuck in too_many_isolated loop basically for ever because the last few pages on the LRU lists are isolated by the kswapd which is stuck on fs locks when doing the pageout. This in turn means that there is nobody to actually trigger the oom killer and the system is basically unusable. too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle direct reclaim when too many pages are isolated already") to prevent from pre-mature oom killer invocations because back then no reclaim progress could indeed trigger the OOM killer too early. But since the oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection") the allocation/reclaim retry loop considers all the reclaimable pages including those which are isolated - see 9f6c399ddc36 ("mm, vmscan: consider isolated pages in zone_reclaimable_pages") so we can loosen the direct reclaim throttling and instead rely on should_reclaim_retry logic which is the proper layer to control how to throttle and retry reclaim attempts. Move the too_many_isolated check outside shrink_inactive_list because in fact active list might theoretically see too many isolated pages as well. [1] http://lkml.kernel.org/r/201602092349.ACG81273.OSVtMJQHLOFOFF@I-love.SAKURA.ne.jp Signed-off-by: Michal Hocko <mhocko@suse.com> --- mm/vmscan.c | 37 +++++++++++++++++++++++++++---------- 1 file changed, 27 insertions(+), 10 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 4b1ed1b1f1db..9f6be3b10ff0 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -204,10 +204,12 @@ unsigned long zone_reclaimable_pages(struct zone *zone) unsigned long nr; nr = zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_FILE) + - zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_FILE); + zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_FILE) + + zone_page_state_snapshot(zone, NR_ZONE_ISOLATED_FILE); if (get_nr_swap_pages() > 0) nr += zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_ANON) + - zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_ANON); + zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_ANON) + + zone_page_state_snapshot(zone, NR_ZONE_ISOLATED_ANON); return nr; } @@ -1728,14 +1730,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, struct pglist_data *pgdat = lruvec_pgdat(lruvec); struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat; - while (unlikely(too_many_isolated(pgdat, lru, sc))) { - congestion_wait(BLK_RW_ASYNC, HZ/10); - - /* We are about to die and free our memory. Return now. */ - if (fatal_signal_pending(current)) - return SWAP_CLUSTER_MAX; - } - lru_add_drain(); if (!sc->may_unmap) @@ -2083,6 +2077,29 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file, static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan, struct lruvec *lruvec, struct scan_control *sc) { + int stalled = false; + + /* We are about to die and free our memory. Return now. */ + if (fatal_signal_pending(current)) + return SWAP_CLUSTER_MAX; + + /* + * throttle direct reclaimers but do not loop for ever. We rely + * on should_reclaim_retry to not allow pre-mature OOM when + * there are too many pages under reclaim. + */ + while (too_many_isolated(lruvec_pgdat(lruvec), lru, sc)) { + if (stalled) + return 0; + + /* + * TODO we should wait on a different event here - do the wake up + * after we decrement NR_ZONE_ISOLATED_* + */ + congestion_wait(BLK_RW_ASYNC, HZ/10); + stalled = true; + } + if (is_active_lru(lru)) { if (inactive_list_is_low(lruvec, is_file_lru(lru), sc, true)) shrink_active_list(nr_to_scan, lruvec, sc, lru); -- 2.11.0 ^ permalink raw reply related [flat|nested] 110+ messages in thread
* [RFC PATCH 2/2] mm, vmscan: do not loop on too_many_isolated for ever @ 2017-01-18 13:44 ` Michal Hocko 0 siblings, 0 replies; 110+ messages in thread From: Michal Hocko @ 2017-01-18 13:44 UTC (permalink / raw) To: linux-mm; +Cc: Mel Gorman, Johannes Weiner, Tetsuo Handa, LKML, Michal Hocko From: Michal Hocko <mhocko@suse.com> Tetsuo Handa has reported [1] that direct reclaimers might get stuck in too_many_isolated loop basically for ever because the last few pages on the LRU lists are isolated by the kswapd which is stuck on fs locks when doing the pageout. This in turn means that there is nobody to actually trigger the oom killer and the system is basically unusable. too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle direct reclaim when too many pages are isolated already") to prevent from pre-mature oom killer invocations because back then no reclaim progress could indeed trigger the OOM killer too early. But since the oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection") the allocation/reclaim retry loop considers all the reclaimable pages including those which are isolated - see 9f6c399ddc36 ("mm, vmscan: consider isolated pages in zone_reclaimable_pages") so we can loosen the direct reclaim throttling and instead rely on should_reclaim_retry logic which is the proper layer to control how to throttle and retry reclaim attempts. Move the too_many_isolated check outside shrink_inactive_list because in fact active list might theoretically see too many isolated pages as well. [1] http://lkml.kernel.org/r/201602092349.ACG81273.OSVtMJQHLOFOFF@I-love.SAKURA.ne.jp Signed-off-by: Michal Hocko <mhocko@suse.com> --- mm/vmscan.c | 37 +++++++++++++++++++++++++++---------- 1 file changed, 27 insertions(+), 10 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 4b1ed1b1f1db..9f6be3b10ff0 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -204,10 +204,12 @@ unsigned long zone_reclaimable_pages(struct zone *zone) unsigned long nr; nr = zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_FILE) + - zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_FILE); + zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_FILE) + + zone_page_state_snapshot(zone, NR_ZONE_ISOLATED_FILE); if (get_nr_swap_pages() > 0) nr += zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_ANON) + - zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_ANON); + zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_ANON) + + zone_page_state_snapshot(zone, NR_ZONE_ISOLATED_ANON); return nr; } @@ -1728,14 +1730,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, struct pglist_data *pgdat = lruvec_pgdat(lruvec); struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat; - while (unlikely(too_many_isolated(pgdat, lru, sc))) { - congestion_wait(BLK_RW_ASYNC, HZ/10); - - /* We are about to die and free our memory. Return now. */ - if (fatal_signal_pending(current)) - return SWAP_CLUSTER_MAX; - } - lru_add_drain(); if (!sc->may_unmap) @@ -2083,6 +2077,29 @@ static bool inactive_list_is_low(struct lruvec *lruvec, bool file, static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan, struct lruvec *lruvec, struct scan_control *sc) { + int stalled = false; + + /* We are about to die and free our memory. Return now. */ + if (fatal_signal_pending(current)) + return SWAP_CLUSTER_MAX; + + /* + * throttle direct reclaimers but do not loop for ever. We rely + * on should_reclaim_retry to not allow pre-mature OOM when + * there are too many pages under reclaim. + */ + while (too_many_isolated(lruvec_pgdat(lruvec), lru, sc)) { + if (stalled) + return 0; + + /* + * TODO we should wait on a different event here - do the wake up + * after we decrement NR_ZONE_ISOLATED_* + */ + congestion_wait(BLK_RW_ASYNC, HZ/10); + stalled = true; + } + if (is_active_lru(lru)) { if (inactive_list_is_low(lruvec, is_file_lru(lru), sc, true)) shrink_active_list(nr_to_scan, lruvec, sc, lru); -- 2.11.0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 2/2] mm, vmscan: do not loop on too_many_isolated for ever 2017-01-18 13:44 ` Michal Hocko @ 2017-01-18 14:50 ` Mel Gorman -1 siblings, 0 replies; 110+ messages in thread From: Mel Gorman @ 2017-01-18 14:50 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-mm, Johannes Weiner, Tetsuo Handa, LKML, Michal Hocko On Wed, Jan 18, 2017 at 02:44:53PM +0100, Michal Hocko wrote: > From: Michal Hocko <mhocko@suse.com> > > Tetsuo Handa has reported [1] that direct reclaimers might get stuck in > too_many_isolated loop basically for ever because the last few pages on > the LRU lists are isolated by the kswapd which is stuck on fs locks when > doing the pageout. This in turn means that there is nobody to actually > trigger the oom killer and the system is basically unusable. > > too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle > direct reclaim when too many pages are isolated already") to prevent > from pre-mature oom killer invocations because back then no reclaim > progress could indeed trigger the OOM killer too early. But since the > oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection") > the allocation/reclaim retry loop considers all the reclaimable pages > including those which are isolated - see 9f6c399ddc36 ("mm, vmscan: > consider isolated pages in zone_reclaimable_pages") so we can loosen > the direct reclaim throttling and instead rely on should_reclaim_retry > logic which is the proper layer to control how to throttle and retry > reclaim attempts. > > Move the too_many_isolated check outside shrink_inactive_list because > in fact active list might theoretically see too many isolated pages as > well. > No major objections in general. It's a bit odd you have a while loop for something that will only loop once. As for the TODO, one approach would be to use a waitqueue when too many pages are isolated. Wake them one at a time when isolated pages drops below the threshold. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 110+ messages in thread
* Re: [RFC PATCH 2/2] mm, vmscan: do not loop on too_many_isolated for ever @ 2017-01-18 14:50 ` Mel Gorman 0 siblings, 0 replies; 110+ messages in thread From: Mel Gorman @ 2017-01-18 14:50 UTC (permalink / raw) To: Michal Hocko; +Cc: linux-mm, Johannes Weiner, Tetsuo Handa, LKML, Michal Hocko On Wed, Jan 18, 2017 at 02:44:53PM +0100, Michal Hocko wrote: > From: Michal Hocko <mhocko@suse.com> > > Tetsuo Handa has reported [1] that direct reclaimers might get stuck in > too_many_isolated loop basically for ever because the last few pages on > the LRU lists are isolated by the kswapd which is stuck on fs locks when > doing the pageout. This in turn means that there is nobody to actually > trigger the oom killer and the system is basically unusable. > > too_many_isolated has been introduced by 35cd78156c49 ("vmscan: throttle > direct reclaim when too many pages are isolated already") to prevent > from pre-mature oom killer invocations because back then no reclaim > progress could indeed trigger the OOM killer too early. But since the > oom detection rework 0a0337e0d1d1 ("mm, oom: rework oom detection") > the allocation/reclaim retry loop considers all the reclaimable pages > including those which are isolated - see 9f6c399ddc36 ("mm, vmscan: > consider isolated pages in zone_reclaimable_pages") so we can loosen > the direct reclaim throttling and instead rely on should_reclaim_retry > logic which is the proper layer to control how to throttle and retry > reclaim attempts. > > Move the too_many_isolated check outside shrink_inactive_list because > in fact active list might theoretically see too many isolated pages as > well. > No major objections in general. It's a bit odd you have a while loop for something that will only loop once. As for the TODO, one approach would be to use a waitqueue when too many pages are isolated. Wake them one at a time when isolated pages drops below the threshold. -- Mel Gorman SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 110+ messages in thread
end of thread, other threads:[~2017-02-26 6:36 UTC | newest] Thread overview: 110+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-01-18 13:44 [RFC PATCH 0/2] fix unbounded too_many_isolated Michal Hocko 2017-01-18 13:44 ` Michal Hocko 2017-01-18 13:44 ` [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone Michal Hocko 2017-01-18 13:44 ` Michal Hocko 2017-01-18 14:46 ` Mel Gorman 2017-01-18 14:46 ` Mel Gorman 2017-01-18 15:15 ` Michal Hocko 2017-01-18 15:15 ` Michal Hocko 2017-01-18 15:54 ` Mel Gorman 2017-01-18 15:54 ` Mel Gorman 2017-01-18 16:17 ` Michal Hocko 2017-01-18 16:17 ` Michal Hocko 2017-01-18 17:00 ` Mel Gorman 2017-01-18 17:00 ` Mel Gorman 2017-01-18 17:29 ` Michal Hocko 2017-01-18 17:29 ` Michal Hocko 2017-01-19 10:07 ` Mel Gorman 2017-01-19 10:07 ` Mel Gorman 2017-01-19 11:23 ` Michal Hocko 2017-01-19 11:23 ` Michal Hocko 2017-01-19 13:11 ` Mel Gorman 2017-01-19 13:11 ` Mel Gorman 2017-01-20 13:27 ` Tetsuo Handa 2017-01-20 13:27 ` Tetsuo Handa 2017-01-21 7:42 ` Tetsuo Handa 2017-01-21 7:42 ` Tetsuo Handa 2017-01-25 10:15 ` Michal Hocko 2017-01-25 10:15 ` Michal Hocko 2017-01-25 10:19 ` Christoph Hellwig 2017-01-25 10:19 ` Christoph Hellwig 2017-01-25 10:46 ` Michal Hocko 2017-01-25 10:46 ` Michal Hocko 2017-01-25 11:09 ` Tetsuo Handa 2017-01-25 11:09 ` Tetsuo Handa 2017-01-25 13:00 ` Michal Hocko 2017-01-25 13:00 ` Michal Hocko 2017-01-27 14:49 ` Michal Hocko 2017-01-27 14:49 ` Michal Hocko 2017-01-28 15:27 ` Tetsuo Handa 2017-01-28 15:27 ` Tetsuo Handa 2017-01-30 8:55 ` Michal Hocko 2017-01-30 8:55 ` Michal Hocko 2017-02-02 10:14 ` Michal Hocko 2017-02-02 10:14 ` Michal Hocko 2017-02-03 10:57 ` Tetsuo Handa 2017-02-03 10:57 ` Tetsuo Handa 2017-02-03 14:41 ` Michal Hocko 2017-02-03 14:41 ` Michal Hocko 2017-02-03 14:50 ` Michal Hocko 2017-02-03 14:50 ` Michal Hocko 2017-02-03 17:24 ` Brian Foster 2017-02-03 17:24 ` Brian Foster 2017-02-06 6:29 ` Tetsuo Handa 2017-02-06 6:29 ` Tetsuo Handa 2017-02-06 14:35 ` Brian Foster 2017-02-06 14:35 ` Brian Foster 2017-02-06 14:42 ` Michal Hocko 2017-02-06 14:42 ` Michal Hocko 2017-02-06 15:47 ` Brian Foster 2017-02-06 15:47 ` Brian Foster 2017-02-07 10:30 ` Tetsuo Handa 2017-02-07 10:30 ` Tetsuo Handa 2017-02-07 16:54 ` Brian Foster 2017-02-07 16:54 ` Brian Foster 2017-02-03 14:55 ` Michal Hocko 2017-02-03 14:55 ` Michal Hocko 2017-02-05 10:43 ` Tetsuo Handa 2017-02-05 10:43 ` Tetsuo Handa 2017-02-06 10:34 ` Michal Hocko 2017-02-06 10:34 ` Michal Hocko 2017-02-06 10:39 ` Michal Hocko 2017-02-06 10:39 ` Michal Hocko 2017-02-07 21:12 ` Michal Hocko 2017-02-07 21:12 ` Michal Hocko 2017-02-08 9:24 ` Peter Zijlstra 2017-02-08 9:24 ` Peter Zijlstra 2017-02-21 9:40 ` Michal Hocko 2017-02-21 9:40 ` Michal Hocko 2017-02-21 14:35 ` Tetsuo Handa 2017-02-21 14:35 ` Tetsuo Handa 2017-02-21 15:53 ` Michal Hocko 2017-02-21 15:53 ` Michal Hocko 2017-02-22 2:02 ` Tetsuo Handa 2017-02-22 2:02 ` Tetsuo Handa 2017-02-22 7:54 ` Michal Hocko 2017-02-22 7:54 ` Michal Hocko 2017-02-26 6:30 ` Tetsuo Handa 2017-02-26 6:30 ` Tetsuo Handa 2017-01-31 11:58 ` Michal Hocko 2017-01-31 11:58 ` Michal Hocko 2017-01-31 12:51 ` Christoph Hellwig 2017-01-31 12:51 ` Christoph Hellwig 2017-01-31 13:21 ` Michal Hocko 2017-01-31 13:21 ` Michal Hocko 2017-01-25 10:33 ` [RFC PATCH 1/2] mm, vmscan: account the number of isolated pagesper zone Tetsuo Handa 2017-01-25 10:33 ` Tetsuo Handa 2017-01-25 12:34 ` Michal Hocko 2017-01-25 12:34 ` Michal Hocko 2017-01-25 13:13 ` [RFC PATCH 1/2] mm, vmscan: account the number of isolated pages per zone Tetsuo Handa 2017-01-25 13:13 ` Tetsuo Handa 2017-01-25 9:53 ` Michal Hocko 2017-01-25 9:53 ` Michal Hocko 2017-01-20 6:42 ` Hillf Danton 2017-01-20 6:42 ` Hillf Danton 2017-01-20 9:25 ` Mel Gorman 2017-01-20 9:25 ` Mel Gorman 2017-01-18 13:44 ` [RFC PATCH 2/2] mm, vmscan: do not loop on too_many_isolated for ever Michal Hocko 2017-01-18 13:44 ` Michal Hocko 2017-01-18 14:50 ` Mel Gorman 2017-01-18 14:50 ` Mel Gorman
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.