* Re: [Question] Should direct reclaim time be bounded? [not found] <20190712054732.7264-1-hdanton@sina.com> @ 2019-07-13 1:11 ` Mike Kravetz 0 siblings, 0 replies; 17+ messages in thread From: Mike Kravetz @ 2019-07-13 1:11 UTC (permalink / raw) To: Hillf Danton, Mel Gorman Cc: Vlastimil Babka, Michal Hocko, linux-mm, linux-kernel, Johannes Weiner On 7/11/19 10:47 PM, Hillf Danton wrote: > > On Thu, 11 Jul 2019 02:42:56 +0800 Mike Kravetz wrote: >> >> It is quite easy to hit the condition where: >> nr_reclaimed == 0 && nr_scanned == 0 is true, but we skip the previous test >> > Then skipping check of __GFP_RETRY_MAYFAIL makes no sense in your case. > It is restored in respin below. > >> and the compaction check: >> sc->nr_reclaimed < pages_for_compaction && >> inactive_lru_pages > pages_for_compaction >> is true, so we return true before the below check of costly_fg_reclaim >> > This check is placed after COMPACT_SUCCESS; the latter is used to > replace sc->nr_reclaimed < pages_for_compaction. > > And dryrun detection is added based on the result of last round of > shrinking of inactive pages, particularly when their number is large > enough. > Thanks Hillf. This change does appear to eliminate the issue with stalls by should_continue_reclaim returning true too often. I need to think some more about exactly what is impacted with the change. With this change, the problem moves to compaction with should_compact_retry returning true too often. It is the same behavior seem when I simply removed the __GFP_RETRY_MAYFAIL special casing in should_continue_reclaim. At Mel's suggestion I removed the compaction_zonelist_suitable() call from should_compact_retry. This eliminated the compaction stalls. Thanks Mel. With both changes, stalls appear to be eliminated. This is promising. I'll try to refine these approaches and continue testing. -- Mike Kravetz > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2571,18 +2571,6 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat, > return false; > } > > - /* > - * If we have not reclaimed enough pages for compaction and the > - * inactive lists are large enough, continue reclaiming > - */ > - pages_for_compaction = compact_gap(sc->order); > - inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE); > - if (get_nr_swap_pages() > 0) > - inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON); > - if (sc->nr_reclaimed < pages_for_compaction && > - inactive_lru_pages > pages_for_compaction) > - return true; > - > /* If compaction would go ahead or the allocation would succeed, stop */ > for (z = 0; z <= sc->reclaim_idx; z++) { > struct zone *zone = &pgdat->node_zones[z]; > @@ -2598,7 +2586,21 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat, > ; > } > } > - return true; > + > + /* > + * If we have not reclaimed enough pages for compaction and the > + * inactive lists are large enough, continue reclaiming > + */ > + pages_for_compaction = compact_gap(sc->order); > + inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE); > + if (get_nr_swap_pages() > 0) > + inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON); > + > + return inactive_lru_pages > pages_for_compaction && > + /* > + * avoid dryrun with plenty of inactive pages > + */ > + nr_scanned && nr_reclaimed; > } > > static bool pgdat_memcg_congested(pg_data_t *pgdat, struct mem_cgroup *memcg) > -- > ^ permalink raw reply [flat|nested] 17+ messages in thread
* [Question] Should direct reclaim time be bounded? @ 2019-04-23 4:07 Mike Kravetz 2019-04-23 7:19 ` Michal Hocko 0 siblings, 1 reply; 17+ messages in thread From: Mike Kravetz @ 2019-04-23 4:07 UTC (permalink / raw) To: linux-mm, linux-kernel Cc: Michal Hocko, Andrea Arcangeli, Mel Gorman, Vlastimil Babka, Johannes Weiner I was looking into an issue on our distro kernel where allocation of huge pages via "echo X > /proc/sys/vm/nr_hugepages" was taking a LONG time. In this particular case, we were actually allocating huge pages VERY slowly at the rate of about one every 30 seconds. I don't want to talk about the code in our distro kernel, but the situation that caused this issue exists upstream and appears to be worse there. One thing to note is that hugetlb page allocation can really stress the page allocator. The routine alloc_pool_huge_page is of special concern. /* * Allocates a fresh page to the hugetlb allocator pool in the node interleaved * manner. */ static int alloc_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed) { struct page *page; int nr_nodes, node; gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE; for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) { page = alloc_fresh_huge_page(h, gfp_mask, node, nodes_allowed); if (page) break; } if (!page) return 0; put_page(page); /* free it into the hugepage allocator */ return 1; } This routine is called for each huge page the user wants to allocate. If they do "echo 4096 > nr_hugepages", this is called 4096 times. alloc_fresh_huge_page() will eventually call __alloc_pages_nodemask with __GFP_COMP|__GFP_RETRY_MAYFAIL|__GFP_NOWARN in addition to __GFP_THISNODE. That for_each_node_mask_to_alloc() macro is hugetlbfs specific and attempts to allocate huge pages in a round robin fashion. When asked to allocate a huge page, it first tries the 'next_nid_to_alloc'. If that fails, it goes to the next allowed node. This is 'documented' in kernel docs as: "On a NUMA platform, the kernel will attempt to distribute the huge page pool over all the set of allowed nodes specified by the NUMA memory policy of the task that modifies nr_hugepages. The default for the allowed nodes--when the task has default memory policy--is all on-line nodes with memory. Allowed nodes with insufficient available, contiguous memory for a huge page will be silently skipped when allocating persistent huge pages. See the discussion below of the interaction of task memory policy, cpusets and per node attributes with the allocation and freeing of persistent huge pages. The success or failure of huge page allocation depends on the amount of physically contiguous memory that is present in system at the time of the allocation attempt. If the kernel is unable to allocate huge pages from some nodes in a NUMA system, it will attempt to make up the difference by allocating extra pages on other nodes with sufficient available contiguous memory, if any." However, consider the case of a 2 node system where: node 0 has 2GB memory node 1 has 4GB memory Now, if one wants to allocate 4GB of huge pages they may be tempted to simply, "echo 2048 > nr_hugepages". At first this will go well until node 0 is out of memory. When this happens, alloc_pool_huge_page() will continue to be called. Because of that for_each_node_mask_to_alloc() macro, it will likely attempt to first allocate a page from node 0. It will call direct reclaim and compaction until it fails. Then, it will successfully allocate from node 1. In our distro kernel, I am thinking about making allocations try "less hard" on nodes where we start to see failures. less hard == NORETRY/NORECLAIM. I was going to try something like this on an upstream kernel when I noticed that it seems like direct reclaim may never end/exit. It 'may' exit, but I instrumented __alloc_pages_slowpath() and saw it take well over an hour before I 'tricked' it into exiting. [ 5916.248341] hpage_slow_alloc: jiffies 5295742 tries 2 node 0 success [ 5916.249271] reclaim 5295741 compact 1 This is where it stalled after "echo 4096 > nr_hugepages" on a little VM with 8GB total memory. I have not started looking at the direct reclaim code to see exactly where we may be stuck, or trying really hard. My question is, "Is this expected or should direct reclaim be somewhat bounded?" With __alloc_pages_slowpath getting 'stuck' in direct reclaim, the documented behavior for huge page allocation is not going to happen. -- Mike Kravetz ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Question] Should direct reclaim time be bounded? 2019-04-23 4:07 Mike Kravetz @ 2019-04-23 7:19 ` Michal Hocko 2019-04-23 16:39 ` Mike Kravetz 0 siblings, 1 reply; 17+ messages in thread From: Michal Hocko @ 2019-04-23 7:19 UTC (permalink / raw) To: Mike Kravetz Cc: linux-mm, linux-kernel, Andrea Arcangeli, Mel Gorman, Vlastimil Babka, Johannes Weiner On Mon 22-04-19 21:07:28, Mike Kravetz wrote: [...] > However, consider the case of a 2 node system where: > node 0 has 2GB memory > node 1 has 4GB memory > > Now, if one wants to allocate 4GB of huge pages they may be tempted to simply, > "echo 2048 > nr_hugepages". At first this will go well until node 0 is out > of memory. When this happens, alloc_pool_huge_page() will continue to be > called. Because of that for_each_node_mask_to_alloc() macro, it will likely > attempt to first allocate a page from node 0. It will call direct reclaim and > compaction until it fails. Then, it will successfully allocate from node 1. Yeah, the even distribution is quite a strong statement. We just try to distribute somehow and it is likely to not work really great on system with nodes that are different in size. I know it sucks but I've been recommending to use the /sys/devices/system/node/node$N/hugepages/hugepages-2048kB/nr_hugepages because that allows the define the actual policy much better. I guess we want to be more specific about this in the documentation at least. > In our distro kernel, I am thinking about making allocations try "less hard" > on nodes where we start to see failures. less hard == NORETRY/NORECLAIM. > I was going to try something like this on an upstream kernel when I noticed > that it seems like direct reclaim may never end/exit. It 'may' exit, but I > instrumented __alloc_pages_slowpath() and saw it take well over an hour > before I 'tricked' it into exiting. > > [ 5916.248341] hpage_slow_alloc: jiffies 5295742 tries 2 node 0 success > [ 5916.249271] reclaim 5295741 compact 1 This is unexpected though. What does tries mean? Number of reclaim attempts? If yes could you enable tracing to see what takes so long in the reclaim path? > This is where it stalled after "echo 4096 > nr_hugepages" on a little VM > with 8GB total memory. > > I have not started looking at the direct reclaim code to see exactly where > we may be stuck, or trying really hard. My question is, "Is this expected > or should direct reclaim be somewhat bounded?" With __alloc_pages_slowpath > getting 'stuck' in direct reclaim, the documented behavior for huge page > allocation is not going to happen. Well, our "how hard to try for hugetlb pages" is quite arbitrary. We used to rety as long as at least order worth of pages have been reclaimed but that didn't make any sense since the lumpy reclaim was gone. So the semantic has change to reclaim&compact as long as there is some progress. From what I understad above it seems that you are not thrashing and calling reclaim again and again but rather one reclaim round takes ages. That being said, I do not think __GFP_RETRY_MAYFAIL is wrong here. It looks like there is something wrong in the reclaim going on. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Question] Should direct reclaim time be bounded? 2019-04-23 7:19 ` Michal Hocko @ 2019-04-23 16:39 ` Mike Kravetz 2019-04-24 14:35 ` Vlastimil Babka 0 siblings, 1 reply; 17+ messages in thread From: Mike Kravetz @ 2019-04-23 16:39 UTC (permalink / raw) To: Michal Hocko Cc: linux-mm, linux-kernel, Andrea Arcangeli, Mel Gorman, Vlastimil Babka, Johannes Weiner On 4/23/19 12:19 AM, Michal Hocko wrote: > On Mon 22-04-19 21:07:28, Mike Kravetz wrote: >> In our distro kernel, I am thinking about making allocations try "less hard" >> on nodes where we start to see failures. less hard == NORETRY/NORECLAIM. >> I was going to try something like this on an upstream kernel when I noticed >> that it seems like direct reclaim may never end/exit. It 'may' exit, but I >> instrumented __alloc_pages_slowpath() and saw it take well over an hour >> before I 'tricked' it into exiting. >> >> [ 5916.248341] hpage_slow_alloc: jiffies 5295742 tries 2 node 0 success >> [ 5916.249271] reclaim 5295741 compact 1 > > This is unexpected though. What does tries mean? Number of reclaim > attempts? If yes could you enable tracing to see what takes so long in > the reclaim path? tries is the number of times we pass the 'retry:' label in __alloc_pages_slowpath. In this specific case, I am pretty sure all that time is in one call to __alloc_pages_direct_reclaim. My 'trick' to make this succeed was to "echo 0 > nr_hugepages" in another shell. >> This is where it stalled after "echo 4096 > nr_hugepages" on a little VM >> with 8GB total memory. >> >> I have not started looking at the direct reclaim code to see exactly where >> we may be stuck, or trying really hard. My question is, "Is this expected >> or should direct reclaim be somewhat bounded?" With __alloc_pages_slowpath >> getting 'stuck' in direct reclaim, the documented behavior for huge page >> allocation is not going to happen. > > Well, our "how hard to try for hugetlb pages" is quite arbitrary. We > used to rety as long as at least order worth of pages have been > reclaimed but that didn't make any sense since the lumpy reclaim was > gone. Yes, that is what I am seeing in our older distro kernel and I can at least deal with that. > So the semantic has change to reclaim&compact as long as there is > some progress. From what I understad above it seems that you are not > thrashing and calling reclaim again and again but rather one reclaim > round takes ages. Correct > That being said, I do not think __GFP_RETRY_MAYFAIL is wrong here. It > looks like there is something wrong in the reclaim going on. Ok, I will start digging into that. Just wanted to make sure before I got into it too deep. BTW - This is very easy to reproduce. Just try to allocate more huge pages than will fit into memory. I see this 'reclaim taking forever' behavior on v5.1-rc5-mmotm-2019-04-19-14-53. Looks like it was there in v5.0 as well. -- Mike Kravetz ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Question] Should direct reclaim time be bounded? 2019-04-23 16:39 ` Mike Kravetz @ 2019-04-24 14:35 ` Vlastimil Babka 2019-06-28 18:20 ` Mike Kravetz 0 siblings, 1 reply; 17+ messages in thread From: Vlastimil Babka @ 2019-04-24 14:35 UTC (permalink / raw) To: Mike Kravetz, Michal Hocko Cc: linux-mm, linux-kernel, Andrea Arcangeli, Mel Gorman, Johannes Weiner On 4/23/19 6:39 PM, Mike Kravetz wrote: >> That being said, I do not think __GFP_RETRY_MAYFAIL is wrong here. It >> looks like there is something wrong in the reclaim going on. > > Ok, I will start digging into that. Just wanted to make sure before I got > into it too deep. > > BTW - This is very easy to reproduce. Just try to allocate more huge pages > than will fit into memory. I see this 'reclaim taking forever' behavior on > v5.1-rc5-mmotm-2019-04-19-14-53. Looks like it was there in v5.0 as well. I'd suspect this in should_continue_reclaim(): /* Consider stopping depending on scan and reclaim activity */ if (sc->gfp_mask & __GFP_RETRY_MAYFAIL) { /* * For __GFP_RETRY_MAYFAIL allocations, stop reclaiming if the * full LRU list has been scanned and we are still failing * to reclaim pages. This full LRU scan is potentially * expensive but a __GFP_RETRY_MAYFAIL caller really wants to succeed */ if (!nr_reclaimed && !nr_scanned) return false; And that for some reason, nr_scanned never becomes zero. But it's hard to figure out through all the layers of functions :/ ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Question] Should direct reclaim time be bounded? 2019-04-24 14:35 ` Vlastimil Babka @ 2019-06-28 18:20 ` Mike Kravetz 2019-07-01 8:59 ` Mel Gorman 0 siblings, 1 reply; 17+ messages in thread From: Mike Kravetz @ 2019-06-28 18:20 UTC (permalink / raw) To: Vlastimil Babka, Michal Hocko Cc: linux-mm, linux-kernel, Andrea Arcangeli, Mel Gorman, Johannes Weiner On 4/24/19 7:35 AM, Vlastimil Babka wrote: > On 4/23/19 6:39 PM, Mike Kravetz wrote: >>> That being said, I do not think __GFP_RETRY_MAYFAIL is wrong here. It >>> looks like there is something wrong in the reclaim going on. >> >> Ok, I will start digging into that. Just wanted to make sure before I got >> into it too deep. >> >> BTW - This is very easy to reproduce. Just try to allocate more huge pages >> than will fit into memory. I see this 'reclaim taking forever' behavior on >> v5.1-rc5-mmotm-2019-04-19-14-53. Looks like it was there in v5.0 as well. > > I'd suspect this in should_continue_reclaim(): > > /* Consider stopping depending on scan and reclaim activity */ > if (sc->gfp_mask & __GFP_RETRY_MAYFAIL) { > /* > * For __GFP_RETRY_MAYFAIL allocations, stop reclaiming if the > * full LRU list has been scanned and we are still failing > * to reclaim pages. This full LRU scan is potentially > * expensive but a __GFP_RETRY_MAYFAIL caller really wants to succeed > */ > if (!nr_reclaimed && !nr_scanned) > return false; > > And that for some reason, nr_scanned never becomes zero. But it's hard > to figure out through all the layers of functions :/ I got back to looking into the direct reclaim/compaction stalls when trying to allocate huge pages. As previously mentioned, the code is looping for a long time in shrink_node(). The routine should_continue_reclaim() returns true perhaps more often than it should. As Vlastmil guessed, my debug code output below shows nr_scanned is remaining non-zero for quite a while. This was on v5.2-rc6. To help determine what is happening, I added a counter to determine how many times should_continue_reclaim was returning true within one calling context from shrink_node. Every 1,000,000 calls, I set a 'debug flag' to get a little more information about what is happening before the next call. Here is output from debug code with some comments about what the debug code is showing. [ 1477.583859] ***should_continue_reclaim: setting debug_nr_scanned call 1000000 [ 1477.584698] shrink_inactive_list: nr_taken 1 = isolate_lru_pages(1, ... shrink_node calls shrink_node_memcg which calls shrink_list. shrink_list can increment sc->nr_scanned which is used to compute the value of nr_scanned passed to should_continue_reclaim. Here, we see that only one page is isolated for potential reclaim. [ 1477.585465] shrink_page_list: goto activate_locked 4 shrink_page_list determines that the page can not be reclaimed. The following code in shrink_page_list determines this. if (page_has_private(page)) { if (!try_to_release_page(page, sc->gfp_mask)) { /* Obviously, my debug code. */ if (sc->debug_nr_scanned) printk("shrink_page_list: goto activate_locked 4\n"); goto activate_locked; } [ 1477.586044] shrink_inactive_list: nr_reclaimed = 0 The bottom line is that page is not reclaimed. But, it was 'scanned' so .. [ 1477.586627] ***should_continue_reclaim: sc->nr_reclaimed 9 pages_for_compaction 1024 [ 1477.587546] nr_reclaimed 0, nr_scanned 1 should_continue_reclaim is called with nr_reclaimed 0, nr_scanned 1 [ 1477.588124] inactive_lru_pages 1 sc->nr_scanned 1000001 Since sc->nr_scanned is 1000001 for 1000001 calls, it looks like we scanned one page each time. Additional similar output without comments. [ 1511.200515] ***should_continue_reclaim: setting debug_nr_scanned call 2000000 [ 1511.201581] shrink_inactive_list: nr_taken 1 = isolate_lru_pages(1, ... [ 1511.202615] shrink_page_list: goto activate_locked 4 [ 1511.203561] shrink_inactive_list: nr_reclaimed = 0 [ 1511.204422] ***should_continue_reclaim: sc->nr_reclaimed 10 pages_for_compaction 1024 [ 1511.205785] nr_reclaimed 0, nr_scanned 1 [ 1511.206569] inactive_lru_pages 1 sc->nr_scanned 2000001 [ 1543.982310] ***should_continue_reclaim: setting debug_nr_scanned call 3000000 [ 1543.983692] shrink_inactive_list: nr_taken 1 = isolate_lru_pages(1, ... [ 1543.984645] shrink_page_list: goto activate_locked 4 [ 1543.985405] shrink_inactive_list: nr_reclaimed = 0 [ 1543.986386] ***should_continue_reclaim: sc->nr_reclaimed 11 pages_for_compaction 1024 [ 1543.987662] nr_reclaimed 0, nr_scanned 1 [ 1543.988423] inactive_lru_pages 1 sc->nr_scanned 3000001 [ 1577.104923] ***should_continue_reclaim: setting debug_nr_scanned call 4000000 [ 1577.105857] shrink_inactive_list: nr_taken 1 = isolate_lru_pages(1, ... [ 1577.106752] shrink_page_list: goto activate_locked 4 [ 1577.107454] shrink_inactive_list: nr_reclaimed = 0 [ 1577.108163] ***should_continue_reclaim: sc->nr_reclaimed 11 pages_for_compaction 1024 [ 1577.109423] nr_reclaimed 0, nr_scanned 1 [ 1577.110205] inactive_lru_pages 1 sc->nr_scanned 4000001 [ 1614.075236] ***should_continue_reclaim: setting debug_nr_scanned call 5000000 [ 1614.076375] shrink_inactive_list: nr_taken 1 = isolate_lru_pages(1, ... [ 1614.077410] shrink_page_list: goto activate_locked 4 [ 1614.078210] shrink_inactive_list: nr_reclaimed = 0 [ 1614.078913] ***should_continue_reclaim: sc->nr_reclaimed 13 pages_for_compaction 1024 [ 1614.079907] nr_reclaimed 0, nr_scanned 1 [ 1614.080496] inactive_lru_pages 1 sc->nr_scanned 5000001 [ 1650.360466] ***should_continue_reclaim: setting debug_nr_scanned call 6000000 [ 1650.361342] shrink_inactive_list: nr_taken 1 = isolate_lru_pages(1, ... [ 1650.362147] shrink_page_list: goto activate_locked 4 [ 1650.362759] shrink_inactive_list: nr_reclaimed = 0 [ 1650.363685] ***should_continue_reclaim: sc->nr_reclaimed 13 pages_for_compaction 1024 [ 1650.364648] nr_reclaimed 0, nr_scanned 1 [ 1650.365222] inactive_lru_pages 0 sc->nr_scanned 6000001 [ 1685.061380] ***should_continue_reclaim: setting debug_nr_scanned call 7000000 [ 1685.062529] shrink_inactive_list: nr_taken 1 = isolate_lru_pages(1, ... [ 1685.063615] shrink_page_list: goto activate_locked 4 [ 1685.064439] shrink_inactive_list: nr_reclaimed = 0 [ 1685.065244] ***should_continue_reclaim: sc->nr_reclaimed 14 pages_for_compaction 1024 [ 1685.066536] nr_reclaimed 0, nr_scanned 1 [ 1685.067356] inactive_lru_pages 1 sc->nr_scanned 7000001 [ 1719.103176] ***should_continue_reclaim: setting debug_nr_scanned call 8000000 [ 1719.104206] shrink_inactive_list: nr_taken 1 = isolate_lru_pages(1, ... [ 1719.105117] shrink_page_list: goto activate_locked 4 [ 1719.105781] shrink_inactive_list: nr_reclaimed = 0 [ 1719.106475] ***should_continue_reclaim: sc->nr_reclaimed 15 pages_for_compaction 1024 [ 1719.107499] nr_reclaimed 0, nr_scanned 1 [ 1719.108129] inactive_lru_pages 1 sc->nr_scanned 8000001 [ 1743.911025] ***should_continue_reclaim: false after 8711717 calls !nr_reclaimed && !nr_scanned Note that we do reclaim a 'few' pages along the way. After 8711717 calls should_continue_reclaim finally hits the condition where (!nr_reclaimed && !nr_scanned) and returns false. We were stuck looping in shrink node for about 5 minutes. Any additional insight, suggestions for additional debug, etc. would be appreciated. -- Mike Kravetz ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Question] Should direct reclaim time be bounded? 2019-06-28 18:20 ` Mike Kravetz @ 2019-07-01 8:59 ` Mel Gorman 2019-07-02 3:15 ` Mike Kravetz 0 siblings, 1 reply; 17+ messages in thread From: Mel Gorman @ 2019-07-01 8:59 UTC (permalink / raw) To: Mike Kravetz Cc: Vlastimil Babka, Michal Hocko, linux-mm, linux-kernel, Andrea Arcangeli, Johannes Weiner On Fri, Jun 28, 2019 at 11:20:42AM -0700, Mike Kravetz wrote: > On 4/24/19 7:35 AM, Vlastimil Babka wrote: > > On 4/23/19 6:39 PM, Mike Kravetz wrote: > >>> That being said, I do not think __GFP_RETRY_MAYFAIL is wrong here. It > >>> looks like there is something wrong in the reclaim going on. > >> > >> Ok, I will start digging into that. Just wanted to make sure before I got > >> into it too deep. > >> > >> BTW - This is very easy to reproduce. Just try to allocate more huge pages > >> than will fit into memory. I see this 'reclaim taking forever' behavior on > >> v5.1-rc5-mmotm-2019-04-19-14-53. Looks like it was there in v5.0 as well. > > > > I'd suspect this in should_continue_reclaim(): > > > > /* Consider stopping depending on scan and reclaim activity */ > > if (sc->gfp_mask & __GFP_RETRY_MAYFAIL) { > > /* > > * For __GFP_RETRY_MAYFAIL allocations, stop reclaiming if the > > * full LRU list has been scanned and we are still failing > > * to reclaim pages. This full LRU scan is potentially > > * expensive but a __GFP_RETRY_MAYFAIL caller really wants to succeed > > */ > > if (!nr_reclaimed && !nr_scanned) > > return false; > > > > And that for some reason, nr_scanned never becomes zero. But it's hard > > to figure out through all the layers of functions :/ > > I got back to looking into the direct reclaim/compaction stalls when > trying to allocate huge pages. As previously mentioned, the code is > looping for a long time in shrink_node(). The routine > should_continue_reclaim() returns true perhaps more often than it should. > > As Vlastmil guessed, my debug code output below shows nr_scanned is remaining > non-zero for quite a while. This was on v5.2-rc6. > I think it would be reasonable to have should_continue_reclaim allow an exit if scanning at higher priority than DEF_PRIORITY - 2, nr_scanned is less than SWAP_CLUSTER_MAX and no pages are being reclaimed. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Question] Should direct reclaim time be bounded? 2019-07-01 8:59 ` Mel Gorman @ 2019-07-02 3:15 ` Mike Kravetz 2019-07-03 9:43 ` Mel Gorman 2019-07-10 18:42 ` Mike Kravetz 0 siblings, 2 replies; 17+ messages in thread From: Mike Kravetz @ 2019-07-02 3:15 UTC (permalink / raw) To: Mel Gorman Cc: Vlastimil Babka, Michal Hocko, linux-mm, linux-kernel, Andrea Arcangeli, Johannes Weiner On 7/1/19 1:59 AM, Mel Gorman wrote: > On Fri, Jun 28, 2019 at 11:20:42AM -0700, Mike Kravetz wrote: >> On 4/24/19 7:35 AM, Vlastimil Babka wrote: >>> On 4/23/19 6:39 PM, Mike Kravetz wrote: >>>>> That being said, I do not think __GFP_RETRY_MAYFAIL is wrong here. It >>>>> looks like there is something wrong in the reclaim going on. >>>> >>>> Ok, I will start digging into that. Just wanted to make sure before I got >>>> into it too deep. >>>> >>>> BTW - This is very easy to reproduce. Just try to allocate more huge pages >>>> than will fit into memory. I see this 'reclaim taking forever' behavior on >>>> v5.1-rc5-mmotm-2019-04-19-14-53. Looks like it was there in v5.0 as well. >>> >>> I'd suspect this in should_continue_reclaim(): >>> >>> /* Consider stopping depending on scan and reclaim activity */ >>> if (sc->gfp_mask & __GFP_RETRY_MAYFAIL) { >>> /* >>> * For __GFP_RETRY_MAYFAIL allocations, stop reclaiming if the >>> * full LRU list has been scanned and we are still failing >>> * to reclaim pages. This full LRU scan is potentially >>> * expensive but a __GFP_RETRY_MAYFAIL caller really wants to succeed >>> */ >>> if (!nr_reclaimed && !nr_scanned) >>> return false; >>> >>> And that for some reason, nr_scanned never becomes zero. But it's hard >>> to figure out through all the layers of functions :/ >> >> I got back to looking into the direct reclaim/compaction stalls when >> trying to allocate huge pages. As previously mentioned, the code is >> looping for a long time in shrink_node(). The routine >> should_continue_reclaim() returns true perhaps more often than it should. >> >> As Vlastmil guessed, my debug code output below shows nr_scanned is remaining >> non-zero for quite a while. This was on v5.2-rc6. >> > > I think it would be reasonable to have should_continue_reclaim allow an > exit if scanning at higher priority than DEF_PRIORITY - 2, nr_scanned is > less than SWAP_CLUSTER_MAX and no pages are being reclaimed. Thanks Mel, I added such a check to should_continue_reclaim. However, it does not address the issue I am seeing. In that do-while loop in shrink_node, the scan priority is not raised (priority--). We can enter the loop with priority == DEF_PRIORITY and continue to loop for minutes as seen in my previous debug output. -- Mike Kravetz ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Question] Should direct reclaim time be bounded? 2019-07-02 3:15 ` Mike Kravetz @ 2019-07-03 9:43 ` Mel Gorman 2019-07-03 23:54 ` Mike Kravetz 2019-07-10 18:42 ` Mike Kravetz 1 sibling, 1 reply; 17+ messages in thread From: Mel Gorman @ 2019-07-03 9:43 UTC (permalink / raw) To: Mike Kravetz Cc: Mel Gorman, Vlastimil Babka, Michal Hocko, linux-mm, linux-kernel, Andrea Arcangeli, Johannes Weiner On Mon, Jul 01, 2019 at 08:15:50PM -0700, Mike Kravetz wrote: > On 7/1/19 1:59 AM, Mel Gorman wrote: > > On Fri, Jun 28, 2019 at 11:20:42AM -0700, Mike Kravetz wrote: > >> On 4/24/19 7:35 AM, Vlastimil Babka wrote: > >>> On 4/23/19 6:39 PM, Mike Kravetz wrote: > >>>>> That being said, I do not think __GFP_RETRY_MAYFAIL is wrong here. It > >>>>> looks like there is something wrong in the reclaim going on. > >>>> > >>>> Ok, I will start digging into that. Just wanted to make sure before I got > >>>> into it too deep. > >>>> > >>>> BTW - This is very easy to reproduce. Just try to allocate more huge pages > >>>> than will fit into memory. I see this 'reclaim taking forever' behavior on > >>>> v5.1-rc5-mmotm-2019-04-19-14-53. Looks like it was there in v5.0 as well. > >>> > >>> I'd suspect this in should_continue_reclaim(): > >>> > >>> /* Consider stopping depending on scan and reclaim activity */ > >>> if (sc->gfp_mask & __GFP_RETRY_MAYFAIL) { > >>> /* > >>> * For __GFP_RETRY_MAYFAIL allocations, stop reclaiming if the > >>> * full LRU list has been scanned and we are still failing > >>> * to reclaim pages. This full LRU scan is potentially > >>> * expensive but a __GFP_RETRY_MAYFAIL caller really wants to succeed > >>> */ > >>> if (!nr_reclaimed && !nr_scanned) > >>> return false; > >>> > >>> And that for some reason, nr_scanned never becomes zero. But it's hard > >>> to figure out through all the layers of functions :/ > >> > >> I got back to looking into the direct reclaim/compaction stalls when > >> trying to allocate huge pages. As previously mentioned, the code is > >> looping for a long time in shrink_node(). The routine > >> should_continue_reclaim() returns true perhaps more often than it should. > >> > >> As Vlastmil guessed, my debug code output below shows nr_scanned is remaining > >> non-zero for quite a while. This was on v5.2-rc6. > >> > > > > I think it would be reasonable to have should_continue_reclaim allow an > > exit if scanning at higher priority than DEF_PRIORITY - 2, nr_scanned is > > less than SWAP_CLUSTER_MAX and no pages are being reclaimed. > > Thanks Mel, > > I added such a check to should_continue_reclaim. However, it does not > address the issue I am seeing. In that do-while loop in shrink_node, > the scan priority is not raised (priority--). We can enter the loop > with priority == DEF_PRIORITY and continue to loop for minutes as seen > in my previous debug output. > Indeed. I'm getting knocked offline shortly so I didn't give this the time it deserves but it appears that part of this problem is hugetlb-specific when one node is full and can enter into this continual loop due to __GFP_RETRY_MAYFAIL requiring both nr_reclaimed and nr_scanned to be zero. Have you considered one of the following as an option? 1. Always use the on-stack nodes_allowed in __nr_hugepages_store_common and copy nodes_states if necessary. Add a bool parameter to alloc_pool_huge_page that is true when called from set_max_huge_pages. If an allocation from alloc_fresh_huge_page, clear the failing node from the mask so it's not retried, bail if the mask is empty. The consequences are that round-robin allocation of huge pages will be different if a node failed to allocate for transient reasons. 2. Alter the condition in should_continue_reclaim for __GFP_RETRY_MAYFAIL to consider if nr_scanned < SWAP_CLUSTER_MAX. Either raise priority (will interfere with kswapd though) or bail entirely. Consequences may be that other __GFP_RETRY_MAYFAIL allocations do not want this behaviour. There are a lot of users. 3. Move where __GFP_RETRY_MAYFAIL is set in a gfp_mask in mm/hugetlb.c. Strip the flag if an allocation fails on a node. Consequences are that setting the required number of huge pages is more likely to return without all the huge pages set. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Question] Should direct reclaim time be bounded? 2019-07-03 9:43 ` Mel Gorman @ 2019-07-03 23:54 ` Mike Kravetz 2019-07-04 11:09 ` Michal Hocko 0 siblings, 1 reply; 17+ messages in thread From: Mike Kravetz @ 2019-07-03 23:54 UTC (permalink / raw) To: Mel Gorman Cc: Mel Gorman, Vlastimil Babka, Michal Hocko, linux-mm, linux-kernel, Andrea Arcangeli, Johannes Weiner On 7/3/19 2:43 AM, Mel Gorman wrote: > Indeed. I'm getting knocked offline shortly so I didn't give this the > time it deserves but it appears that part of this problem is > hugetlb-specific when one node is full and can enter into this continual > loop due to __GFP_RETRY_MAYFAIL requiring both nr_reclaimed and > nr_scanned to be zero. Yes, I am not aware of any other large order allocations consistently made with __GFP_RETRY_MAYFAIL. But, I did not look too closely. Michal believes that hugetlb pages allocations should use __GFP_RETRY_MAYFAIL. > Have you considered one of the following as an option? > > 1. Always use the on-stack nodes_allowed in __nr_hugepages_store_common > and copy nodes_states if necessary. Add a bool parameter to > alloc_pool_huge_page that is true when called from set_max_huge_pages. > If an allocation from alloc_fresh_huge_page, clear the failing node > from the mask so it's not retried, bail if the mask is empty. The > consequences are that round-robin allocation of huge pages will be > different if a node failed to allocate for transient reasons. That seems to be a more aggressive form of 3 below. > 2. Alter the condition in should_continue_reclaim for > __GFP_RETRY_MAYFAIL to consider if nr_scanned < SWAP_CLUSTER_MAX. > Either raise priority (will interfere with kswapd though) or > bail entirely. Consequences may be that other __GFP_RETRY_MAYFAIL > allocations do not want this behaviour. There are a lot of users. Due to high number of users, I am avoiding such a change. It would be hard to validate that such a change does not impact other users. > 3. Move where __GFP_RETRY_MAYFAIL is set in a gfp_mask in mm/hugetlb.c. > Strip the flag if an allocation fails on a node. Consequences are > that setting the required number of huge pages is more likely to > return without all the huge pages set. We are actually using a form of this in our distro kernel. It works quite well on the older (4.11 based) distro kernel. My plan was to push this upstream. However, when I tested this on recent upstream kernels, I encountered long stalls associated with the first __GFP_RETRY_MAYFAIL allocation failure. That is what prompted me to ask this queastion/start this thread. The distro kernel would see stalls taking tens of seconds, upstream would see stalls of several minutes. Much has changed since 4.11, so I was trying to figure out what might be causing this change in behavior. BTW, here is the patch I was testing. It actually has additional code to switch between __GFP_RETRY_MAYFAIL and __GFP_NORETRY and back to hopefully take into account transient conditions. From 528c52397301f02acb614c610bd65f0f9a107481 Mon Sep 17 00:00:00 2001 From: Mike Kravetz <mike.kravetz@oracle.com> Date: Wed, 3 Jul 2019 13:36:24 -0700 Subject: [PATCH] hugetlbfs: don't retry when pool page allocations start to fail When allocating hugetlbfs pool pages via /proc/sys/vm/nr_hugepages, the pages will be interleaved between all nodes of the system. If nodes are not equal, it is quite possible for one node to fill up before the others. When this happens, the code still attempts to allocate pages from the full node. This results in calls to direct reclaim and compaction which slow things down considerably. When allocating pool pages, note the state of the previous allocation for each node. If previous allocation failed, do not use the aggressive retry algorithm on successive attempts. The allocation will still succeed if there is memory available, but it will not try as hard to free up memory. Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> --- mm/hugetlb.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 77 insertions(+), 10 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index ede7e7f5d1ab..f3c50344a9b4 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1405,12 +1405,27 @@ pgoff_t __basepage_index(struct page *page) } static struct page *alloc_buddy_huge_page(struct hstate *h, - gfp_t gfp_mask, int nid, nodemask_t *nmask) + gfp_t gfp_mask, int nid, nodemask_t *nmask, + nodemask_t *node_alloc_noretry) { int order = huge_page_order(h); struct page *page; + bool alloc_try_hard = true; - gfp_mask |= __GFP_COMP|__GFP_RETRY_MAYFAIL|__GFP_NOWARN; + /* + * By default we always try hard to allocate the page with + * __GFP_RETRY_MAYFAIL flag. However, if we are allocating pages in + * a loop (to adjust global huge page counts) and previous allocation + * failed, do not continue to try hard on the same node. Use the + * node_alloc_noretry bitmap to manage this state information. + */ + if (node_alloc_noretry && node_isset(nid, *node_alloc_noretry)) + alloc_try_hard = false; + gfp_mask |= __GFP_COMP|__GFP_NOWARN; + if (alloc_try_hard) + gfp_mask |= __GFP_RETRY_MAYFAIL; + else + gfp_mask |= __GFP_NORETRY; if (nid == NUMA_NO_NODE) nid = numa_mem_id(); page = __alloc_pages_nodemask(gfp_mask, order, nid, nmask); @@ -1419,6 +1434,22 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, else __count_vm_event(HTLB_BUDDY_PGALLOC_FAIL); + /* + * If we did not specify __GFP_RETRY_MAYFAIL, but still got a page this + * indicates an overall state change. Clear bit so that we resume + * normal 'try hard' allocations. + */ + if (node_alloc_noretry && page && !alloc_try_hard) + node_clear(nid, *node_alloc_noretry); + + /* + * If we tried hard to get a page but failed, set bit so that + * subsequent attempts will not try as hard until there is an + * overall state change. + */ + if (node_alloc_noretry && !page && alloc_try_hard) + node_set(nid, *node_alloc_noretry); + return page; } @@ -1427,7 +1458,8 @@ static struct page *alloc_buddy_huge_page(struct hstate *h, * should use this function to get new hugetlb pages */ static struct page *alloc_fresh_huge_page(struct hstate *h, - gfp_t gfp_mask, int nid, nodemask_t *nmask) + gfp_t gfp_mask, int nid, nodemask_t *nmask, + nodemask_t *node_alloc_noretry) { struct page *page; @@ -1435,7 +1467,7 @@ static struct page *alloc_fresh_huge_page(struct hstate *h, page = alloc_gigantic_page(h, gfp_mask, nid, nmask); else page = alloc_buddy_huge_page(h, gfp_mask, - nid, nmask); + nid, nmask, node_alloc_noretry); if (!page) return NULL; @@ -1450,14 +1482,16 @@ static struct page *alloc_fresh_huge_page(struct hstate *h, * Allocates a fresh page to the hugetlb allocator pool in the node interleaved * manner. */ -static int alloc_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed) +static int alloc_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed, + nodemask_t *node_alloc_noretry) { struct page *page; int nr_nodes, node; gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE; for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) { - page = alloc_fresh_huge_page(h, gfp_mask, node, nodes_allowed); + page = alloc_fresh_huge_page(h, gfp_mask, node, nodes_allowed, + node_alloc_noretry); if (page) break; } @@ -1601,7 +1635,7 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask, goto out_unlock; spin_unlock(&hugetlb_lock); - page = alloc_fresh_huge_page(h, gfp_mask, nid, nmask); + page = alloc_fresh_huge_page(h, gfp_mask, nid, nmask, NULL); if (!page) return NULL; @@ -1637,7 +1671,7 @@ struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask, if (hstate_is_gigantic(h)) return NULL; - page = alloc_fresh_huge_page(h, gfp_mask, nid, nmask); + page = alloc_fresh_huge_page(h, gfp_mask, nid, nmask, NULL); if (!page) return NULL; @@ -2207,13 +2241,31 @@ static void __init gather_bootmem_prealloc(void) static void __init hugetlb_hstate_alloc_pages(struct hstate *h) { unsigned long i; + nodemask_t *node_alloc_noretry; + + if (!hstate_is_gigantic(h)) { + /* + * bit mask controlling how hard we retry per-node + * allocations. + */ + node_alloc_noretry = kmalloc(sizeof(*node_alloc_noretry), + GFP_KERNEL | __GFP_NORETRY); + } else { + /* allocations done at boot time */ + node_alloc_noretry = NULL; + } + + /* bit mask controlling how hard we retry per-node allocations */ + if (node_alloc_noretry) + nodes_clear(*node_alloc_noretry); for (i = 0; i < h->max_huge_pages; ++i) { if (hstate_is_gigantic(h)) { if (!alloc_bootmem_huge_page(h)) break; } else if (!alloc_pool_huge_page(h, - &node_states[N_MEMORY])) + &node_states[N_MEMORY], + node_alloc_noretry)) break; cond_resched(); } @@ -2225,6 +2277,9 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h) h->max_huge_pages, buf, i); h->max_huge_pages = i; } + + if (node_alloc_noretry) + kfree(node_alloc_noretry); } static void __init hugetlb_init_hstates(void) @@ -2323,6 +2378,12 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid, nodemask_t *nodes_allowed) { unsigned long min_count, ret; + NODEMASK_ALLOC(nodemask_t, node_alloc_noretry, + GFP_KERNEL | __GFP_NORETRY); + + /* bit mask controlling how hard we retry per-node allocations */ + if (node_alloc_noretry) + nodes_clear(*node_alloc_noretry); spin_lock(&hugetlb_lock); @@ -2356,6 +2417,8 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid, if (hstate_is_gigantic(h) && !IS_ENABLED(CONFIG_CONTIG_ALLOC)) { if (count > persistent_huge_pages(h)) { spin_unlock(&hugetlb_lock); + if (node_alloc_noretry) + NODEMASK_FREE(node_alloc_noretry); return -EINVAL; } /* Fall through to decrease pool */ @@ -2388,7 +2451,8 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid, /* yield cpu to avoid soft lockup */ cond_resched(); - ret = alloc_pool_huge_page(h, nodes_allowed); + ret = alloc_pool_huge_page(h, nodes_allowed, + node_alloc_noretry); spin_lock(&hugetlb_lock); if (!ret) goto out; @@ -2429,6 +2493,9 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid, h->max_huge_pages = persistent_huge_pages(h); spin_unlock(&hugetlb_lock); + if (node_alloc_noretry) + NODEMASK_FREE(node_alloc_noretry); + return 0; } -- 2.20.1 ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [Question] Should direct reclaim time be bounded? 2019-07-03 23:54 ` Mike Kravetz @ 2019-07-04 11:09 ` Michal Hocko 2019-07-04 15:11 ` Mike Kravetz 0 siblings, 1 reply; 17+ messages in thread From: Michal Hocko @ 2019-07-04 11:09 UTC (permalink / raw) To: Mike Kravetz Cc: Mel Gorman, Mel Gorman, Vlastimil Babka, linux-mm, linux-kernel, Andrea Arcangeli, Johannes Weiner On Wed 03-07-19 16:54:35, Mike Kravetz wrote: > On 7/3/19 2:43 AM, Mel Gorman wrote: > > Indeed. I'm getting knocked offline shortly so I didn't give this the > > time it deserves but it appears that part of this problem is > > hugetlb-specific when one node is full and can enter into this continual > > loop due to __GFP_RETRY_MAYFAIL requiring both nr_reclaimed and > > nr_scanned to be zero. > > Yes, I am not aware of any other large order allocations consistently made > with __GFP_RETRY_MAYFAIL. But, I did not look too closely. Michal believes > that hugetlb pages allocations should use __GFP_RETRY_MAYFAIL. Yes. The argument is that this is controlable by an admin and failures should be prevented as much as possible. I didn't get to understand should_continue_reclaim part of the problem but I have a strong feeling that __GFP_RETRY_MAYFAIL handling at that layer is not correct. What happens if it is simply removed and we rely only on the retry mechanism from the page allocator instead? Does the success rate is reduced considerably? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Question] Should direct reclaim time be bounded? 2019-07-04 11:09 ` Michal Hocko @ 2019-07-04 15:11 ` Mike Kravetz 0 siblings, 0 replies; 17+ messages in thread From: Mike Kravetz @ 2019-07-04 15:11 UTC (permalink / raw) To: Michal Hocko Cc: Mel Gorman, Mel Gorman, Vlastimil Babka, linux-mm, linux-kernel, Andrea Arcangeli, Johannes Weiner On 7/4/19 4:09 AM, Michal Hocko wrote: > On Wed 03-07-19 16:54:35, Mike Kravetz wrote: >> On 7/3/19 2:43 AM, Mel Gorman wrote: >>> Indeed. I'm getting knocked offline shortly so I didn't give this the >>> time it deserves but it appears that part of this problem is >>> hugetlb-specific when one node is full and can enter into this continual >>> loop due to __GFP_RETRY_MAYFAIL requiring both nr_reclaimed and >>> nr_scanned to be zero. >> >> Yes, I am not aware of any other large order allocations consistently made >> with __GFP_RETRY_MAYFAIL. But, I did not look too closely. Michal believes >> that hugetlb pages allocations should use __GFP_RETRY_MAYFAIL. > > Yes. The argument is that this is controlable by an admin and failures > should be prevented as much as possible. I didn't get to understand > should_continue_reclaim part of the problem but I have a strong feeling > that __GFP_RETRY_MAYFAIL handling at that layer is not correct. What > happens if it is simply removed and we rely only on the retry mechanism > from the page allocator instead? Does the success rate is reduced > considerably? It certainly will be reduced. I 'think' it will be hard to predict how much it will be reduced as this will depend on the state of memory usage and fragmentation at the time of the attempt. I can try to measure this, but I will be a few days due to U.S. holiday. -- Mike Kravetz ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Question] Should direct reclaim time be bounded? 2019-07-02 3:15 ` Mike Kravetz 2019-07-03 9:43 ` Mel Gorman @ 2019-07-10 18:42 ` Mike Kravetz 2019-07-10 19:44 ` Michal Hocko 1 sibling, 1 reply; 17+ messages in thread From: Mike Kravetz @ 2019-07-10 18:42 UTC (permalink / raw) To: Hillf Danton Cc: Vlastimil Babka, Michal Hocko, Mel Gorman, linux-mm, linux-kernel, Johannes Weiner On 7/7/19 10:19 PM, Hillf Danton wrote: > On Mon, 01 Jul 2019 20:15:51 -0700 Mike Kravetz wrote: >> On 7/1/19 1:59 AM, Mel Gorman wrote: >>> >>> I think it would be reasonable to have should_continue_reclaim allow an >>> exit if scanning at higher priority than DEF_PRIORITY - 2, nr_scanned is >>> less than SWAP_CLUSTER_MAX and no pages are being reclaimed. >> >> Thanks Mel, >> >> I added such a check to should_continue_reclaim. However, it does not >> address the issue I am seeing. In that do-while loop in shrink_node, >> the scan priority is not raised (priority--). We can enter the loop >> with priority == DEF_PRIORITY and continue to loop for minutes as seen >> in my previous debug output. >> > Does it help raise prioity in your case? Thanks Hillf, sorry for delay in responding I have been AFK. I am not sure if you wanted to try this somehow in addition to Mel's suggestion, or alone. Unfortunately, such a change actually causes worse behavior. > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2543,11 +2543,18 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat, > unsigned long pages_for_compaction; > unsigned long inactive_lru_pages; > int z; > + bool costly_fg_reclaim = false; > > /* If not in reclaim/compaction mode, stop */ > if (!in_reclaim_compaction(sc)) > return false; > > + /* Let compact determine what to do for high order allocators */ > + costly_fg_reclaim = sc->order > PAGE_ALLOC_COSTLY_ORDER && > + !current_is_kswapd(); > + if (costly_fg_reclaim) > + goto check_compact; This goto makes us skip the 'if (!nr_reclaimed && !nr_scanned)' test. > + > /* Consider stopping depending on scan and reclaim activity */ > if (sc->gfp_mask & __GFP_RETRY_MAYFAIL) { > /* > @@ -2571,6 +2578,7 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat, > return false; > } > > +check_compact: > /* > * If we have not reclaimed enough pages for compaction and the > * inactive lists are large enough, continue reclaiming It is quite easy to hit the condition where: nr_reclaimed == 0 && nr_scanned == 0 is true, but we skip the previous test and the compaction check: sc->nr_reclaimed < pages_for_compaction && inactive_lru_pages > pages_for_compaction is true, so we return true before the below check of costly_fg_reclaim > @@ -2583,6 +2591,9 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat, > inactive_lru_pages > pages_for_compaction) > return true; > > + if (costly_fg_reclaim) > + return false; > + > /* If compaction would go ahead or the allocation would succeed, stop */ > for (z = 0; z <= sc->reclaim_idx; z++) { > struct zone *zone = &pgdat->node_zones[z]; > -- > As Michal suggested, I'm going to do some testing to see what impact dropping the __GFP_RETRY_MAYFAIL flag for these huge page allocations will have on the number of pages allocated. -- Mike Kravetz ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Question] Should direct reclaim time be bounded? 2019-07-10 18:42 ` Mike Kravetz @ 2019-07-10 19:44 ` Michal Hocko 2019-07-10 23:36 ` Mike Kravetz 0 siblings, 1 reply; 17+ messages in thread From: Michal Hocko @ 2019-07-10 19:44 UTC (permalink / raw) To: Mike Kravetz Cc: Hillf Danton, Vlastimil Babka, Mel Gorman, linux-mm, linux-kernel, Johannes Weiner On Wed 10-07-19 11:42:40, Mike Kravetz wrote: [...] > As Michal suggested, I'm going to do some testing to see what impact > dropping the __GFP_RETRY_MAYFAIL flag for these huge page allocations > will have on the number of pages allocated. Just to clarify. I didn't mean to drop __GFP_RETRY_MAYFAIL from the allocation request. I meant to drop the special casing of the flag in should_continue_reclaim. I really have hard time to argue for this special casing TBH. The flag is meant to retry harder but that shouldn't be reduced to a single reclaim attempt because that alone doesn't really help much with the high order allocation. It is more about compaction to be retried harder. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Question] Should direct reclaim time be bounded? 2019-07-10 19:44 ` Michal Hocko @ 2019-07-10 23:36 ` Mike Kravetz 2019-07-11 7:12 ` Michal Hocko 0 siblings, 1 reply; 17+ messages in thread From: Mike Kravetz @ 2019-07-10 23:36 UTC (permalink / raw) To: Michal Hocko Cc: Hillf Danton, Vlastimil Babka, Mel Gorman, linux-mm, linux-kernel, Johannes Weiner On 7/10/19 12:44 PM, Michal Hocko wrote: > On Wed 10-07-19 11:42:40, Mike Kravetz wrote: > [...] >> As Michal suggested, I'm going to do some testing to see what impact >> dropping the __GFP_RETRY_MAYFAIL flag for these huge page allocations >> will have on the number of pages allocated. > > Just to clarify. I didn't mean to drop __GFP_RETRY_MAYFAIL from the > allocation request. I meant to drop the special casing of the flag in > should_continue_reclaim. I really have hard time to argue for this > special casing TBH. The flag is meant to retry harder but that shouldn't > be reduced to a single reclaim attempt because that alone doesn't really > help much with the high order allocation. It is more about compaction to > be retried harder. Thanks Michal. That is indeed what you suggested earlier. I remembered incorrectly. Sorry. Removing the special casing for __GFP_RETRY_MAYFAIL in should_continue_reclaim implies that it will return false if nothing was reclaimed (nr_reclaimed == 0) in the previous pass. When I make such a modification and test, I see long stalls as a result of should_compact_retry returning true too often. On a system I am currently testing, should_compact_retry has returned true 36000000 times. My guess is that this may stall forever. Vlastmil previously asked about this behavior, so I am capturing the reason. Like before [1], should_compact_retry is returning true mostly because compaction_withdrawn() returns COMPACT_DEFERRED. Total 36000000 35437500 COMPACT_DEFERRED 562500 COMPACT_PARTIAL_SKIPPED [1] https://lkml.org/lkml/2019/6/5/643 -- Mike Kravetz ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Question] Should direct reclaim time be bounded? 2019-07-10 23:36 ` Mike Kravetz @ 2019-07-11 7:12 ` Michal Hocko 2019-07-12 9:49 ` Mel Gorman 0 siblings, 1 reply; 17+ messages in thread From: Michal Hocko @ 2019-07-11 7:12 UTC (permalink / raw) To: Mike Kravetz Cc: Hillf Danton, Vlastimil Babka, Mel Gorman, linux-mm, linux-kernel, Johannes Weiner On Wed 10-07-19 16:36:58, Mike Kravetz wrote: > On 7/10/19 12:44 PM, Michal Hocko wrote: > > On Wed 10-07-19 11:42:40, Mike Kravetz wrote: > > [...] > >> As Michal suggested, I'm going to do some testing to see what impact > >> dropping the __GFP_RETRY_MAYFAIL flag for these huge page allocations > >> will have on the number of pages allocated. > > > > Just to clarify. I didn't mean to drop __GFP_RETRY_MAYFAIL from the > > allocation request. I meant to drop the special casing of the flag in > > should_continue_reclaim. I really have hard time to argue for this > > special casing TBH. The flag is meant to retry harder but that shouldn't > > be reduced to a single reclaim attempt because that alone doesn't really > > help much with the high order allocation. It is more about compaction to > > be retried harder. > > Thanks Michal. That is indeed what you suggested earlier. I remembered > incorrectly. Sorry. > > Removing the special casing for __GFP_RETRY_MAYFAIL in should_continue_reclaim > implies that it will return false if nothing was reclaimed (nr_reclaimed == 0) > in the previous pass. > > When I make such a modification and test, I see long stalls as a result > of should_compact_retry returning true too often. On a system I am currently > testing, should_compact_retry has returned true 36000000 times. My guess > is that this may stall forever. Vlastmil previously asked about this behavior, > so I am capturing the reason. Like before [1], should_compact_retry is > returning true mostly because compaction_withdrawn() returns COMPACT_DEFERRED. This smells like a problem to me. But somebody more familiar with compaction should comment. > > Total 36000000 > 35437500 COMPACT_DEFERRED > 562500 COMPACT_PARTIAL_SKIPPED > > > [1] https://lkml.org/lkml/2019/6/5/643 > -- > Mike Kravetz -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [Question] Should direct reclaim time be bounded? 2019-07-11 7:12 ` Michal Hocko @ 2019-07-12 9:49 ` Mel Gorman 0 siblings, 0 replies; 17+ messages in thread From: Mel Gorman @ 2019-07-12 9:49 UTC (permalink / raw) To: Michal Hocko Cc: Mike Kravetz, Hillf Danton, Vlastimil Babka, linux-mm, linux-kernel, Johannes Weiner On Thu, Jul 11, 2019 at 09:12:45AM +0200, Michal Hocko wrote: > On Wed 10-07-19 16:36:58, Mike Kravetz wrote: > > On 7/10/19 12:44 PM, Michal Hocko wrote: > > > On Wed 10-07-19 11:42:40, Mike Kravetz wrote: > > > [...] > > >> As Michal suggested, I'm going to do some testing to see what impact > > >> dropping the __GFP_RETRY_MAYFAIL flag for these huge page allocations > > >> will have on the number of pages allocated. > > > > > > Just to clarify. I didn't mean to drop __GFP_RETRY_MAYFAIL from the > > > allocation request. I meant to drop the special casing of the flag in > > > should_continue_reclaim. I really have hard time to argue for this > > > special casing TBH. The flag is meant to retry harder but that shouldn't > > > be reduced to a single reclaim attempt because that alone doesn't really > > > help much with the high order allocation. It is more about compaction to > > > be retried harder. > > > > Thanks Michal. That is indeed what you suggested earlier. I remembered > > incorrectly. Sorry. > > > > Removing the special casing for __GFP_RETRY_MAYFAIL in should_continue_reclaim > > implies that it will return false if nothing was reclaimed (nr_reclaimed == 0) > > in the previous pass. > > > > When I make such a modification and test, I see long stalls as a result > > of should_compact_retry returning true too often. On a system I am currently > > testing, should_compact_retry has returned true 36000000 times. My guess > > is that this may stall forever. Vlastmil previously asked about this behavior, > > so I am capturing the reason. Like before [1], should_compact_retry is > > returning true mostly because compaction_withdrawn() returns COMPACT_DEFERRED. > > This smells like a problem to me. But somebody more familiar with > compaction should comment. > Examine in should_compact_retry if it's retrying because compaction_zonelist_suitable is true. Looking at it now, it would not necessarily do the right thing because any non-skipped zone would make it eligible which is too strong a condition as COMPACT_SKIPPED is not reliably set. If that function is the case, it would be reasonable remove "ret = compaction_zonelist_suitable(ac, order, alloc_flags);" and the implementation of compaction_zonelist_suitable entirely as part of your fix. -- Mel Gorman SUSE Labs ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2019-07-13 1:11 UTC | newest] Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <20190712054732.7264-1-hdanton@sina.com> 2019-07-13 1:11 ` [Question] Should direct reclaim time be bounded? Mike Kravetz 2019-04-23 4:07 Mike Kravetz 2019-04-23 7:19 ` Michal Hocko 2019-04-23 16:39 ` Mike Kravetz 2019-04-24 14:35 ` Vlastimil Babka 2019-06-28 18:20 ` Mike Kravetz 2019-07-01 8:59 ` Mel Gorman 2019-07-02 3:15 ` Mike Kravetz 2019-07-03 9:43 ` Mel Gorman 2019-07-03 23:54 ` Mike Kravetz 2019-07-04 11:09 ` Michal Hocko 2019-07-04 15:11 ` Mike Kravetz 2019-07-10 18:42 ` Mike Kravetz 2019-07-10 19:44 ` Michal Hocko 2019-07-10 23:36 ` Mike Kravetz 2019-07-11 7:12 ` Michal Hocko 2019-07-12 9:49 ` Mel Gorman
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).