* [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated @ 2015-01-14 11:36 ` Vinayak Menon 0 siblings, 0 replies; 60+ messages in thread From: Vinayak Menon @ 2015-01-14 11:36 UTC (permalink / raw) To: linux-mm, linux-kernel Cc: akpm, hannes, vdavydov, mhocko, mgorman, minchan, Vinayak Menon It is observed that sometimes multiple tasks get blocked for long in the congestion_wait loop below, in shrink_inactive_list. This is because of vm_stat values not being synced. (__schedule) from [<c0a03328>] (schedule_timeout) from [<c0a04940>] (io_schedule_timeout) from [<c01d585c>] (congestion_wait) from [<c01cc9d8>] (shrink_inactive_list) from [<c01cd034>] (shrink_zone) from [<c01cdd08>] (try_to_free_pages) from [<c01c442c>] (__alloc_pages_nodemask) from [<c01f1884>] (new_slab) from [<c09fcf60>] (__slab_alloc) from [<c01f1a6c>] In one such instance, zone_page_state(zone, NR_ISOLATED_FILE) had returned 14, zone_page_state(zone, NR_INACTIVE_FILE) returned 92, and GFP_IOFS was set, and this resulted in too_many_isolated returning true. But one of the CPU's pageset vm_stat_diff had NR_ISOLATED_FILE as "-14". So the actual isolated count was zero. As there weren't any more updates to NR_ISOLATED_FILE and vmstat_update deffered work had not been scheduled yet, 7 tasks were spinning in the congestion wait loop for around 4 seconds, in the direct reclaim path. This patch uses zone_page_state_snapshot instead, but restricts its usage to avoid performance penalty. Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org> --- mm/vmscan.c | 56 +++++++++++++++++++++++++++++++++++++------------------- 1 file changed, 37 insertions(+), 19 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 5e8772b..266551f 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1392,6 +1392,32 @@ int isolate_lru_page(struct page *page) return ret; } +static int __too_many_isolated(struct zone *zone, int file, + struct scan_control *sc, int safe) +{ + unsigned long inactive, isolated; + + if (safe) { + inactive = zone_page_state_snapshot(zone, + NR_INACTIVE_ANON + 2 * file); + isolated = zone_page_state_snapshot(zone, + NR_ISOLATED_ANON + file); + } else { + inactive = zone_page_state(zone, NR_INACTIVE_ANON + 2 * file); + isolated = zone_page_state(zone, NR_ISOLATED_ANON + file); + } + + /* + * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so they + * won't get blocked by normal direct-reclaimers, forming a circular + * deadlock. + */ + if ((sc->gfp_mask & GFP_IOFS) == GFP_IOFS) + inactive >>= 3; + + return isolated > inactive; +} + /* * A direct reclaimer may isolate SWAP_CLUSTER_MAX pages from the LRU list and * then get resheduled. When there are massive number of tasks doing page @@ -1400,33 +1426,22 @@ int isolate_lru_page(struct page *page) * unnecessary swapping, thrashing and OOM. */ static int too_many_isolated(struct zone *zone, int file, - struct scan_control *sc) + struct scan_control *sc, int safe) { - unsigned long inactive, isolated; - if (current_is_kswapd()) return 0; if (!global_reclaim(sc)) return 0; - if (file) { - inactive = zone_page_state(zone, NR_INACTIVE_FILE); - isolated = zone_page_state(zone, NR_ISOLATED_FILE); - } else { - inactive = zone_page_state(zone, NR_INACTIVE_ANON); - isolated = zone_page_state(zone, NR_ISOLATED_ANON); + if (unlikely(__too_many_isolated(zone, file, sc, 0))) { + if (safe) + return __too_many_isolated(zone, file, sc, safe); + else + return 1; } - /* - * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so they - * won't get blocked by normal direct-reclaimers, forming a circular - * deadlock. - */ - if ((sc->gfp_mask & GFP_IOFS) == GFP_IOFS) - inactive >>= 3; - - return isolated > inactive; + return 0; } static noinline_for_stack void @@ -1516,15 +1531,18 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, unsigned long nr_immediate = 0; isolate_mode_t isolate_mode = 0; int file = is_file_lru(lru); + int safe = 0; struct zone *zone = lruvec_zone(lruvec); struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat; - while (unlikely(too_many_isolated(zone, file, sc))) { + while (unlikely(too_many_isolated(zone, file, sc, safe))) { congestion_wait(BLK_RW_ASYNC, HZ/10); /* We are about to die and free our memory. Return now. */ if (fatal_signal_pending(current)) return SWAP_CLUSTER_MAX; + + safe = 1; } lru_add_drain(); -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation ^ permalink raw reply related [flat|nested] 60+ messages in thread
* [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated @ 2015-01-14 11:36 ` Vinayak Menon 0 siblings, 0 replies; 60+ messages in thread From: Vinayak Menon @ 2015-01-14 11:36 UTC (permalink / raw) To: linux-mm, linux-kernel Cc: akpm, hannes, vdavydov, mhocko, mgorman, minchan, Vinayak Menon It is observed that sometimes multiple tasks get blocked for long in the congestion_wait loop below, in shrink_inactive_list. This is because of vm_stat values not being synced. (__schedule) from [<c0a03328>] (schedule_timeout) from [<c0a04940>] (io_schedule_timeout) from [<c01d585c>] (congestion_wait) from [<c01cc9d8>] (shrink_inactive_list) from [<c01cd034>] (shrink_zone) from [<c01cdd08>] (try_to_free_pages) from [<c01c442c>] (__alloc_pages_nodemask) from [<c01f1884>] (new_slab) from [<c09fcf60>] (__slab_alloc) from [<c01f1a6c>] In one such instance, zone_page_state(zone, NR_ISOLATED_FILE) had returned 14, zone_page_state(zone, NR_INACTIVE_FILE) returned 92, and GFP_IOFS was set, and this resulted in too_many_isolated returning true. But one of the CPU's pageset vm_stat_diff had NR_ISOLATED_FILE as "-14". So the actual isolated count was zero. As there weren't any more updates to NR_ISOLATED_FILE and vmstat_update deffered work had not been scheduled yet, 7 tasks were spinning in the congestion wait loop for around 4 seconds, in the direct reclaim path. This patch uses zone_page_state_snapshot instead, but restricts its usage to avoid performance penalty. Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org> --- mm/vmscan.c | 56 +++++++++++++++++++++++++++++++++++++------------------- 1 file changed, 37 insertions(+), 19 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 5e8772b..266551f 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1392,6 +1392,32 @@ int isolate_lru_page(struct page *page) return ret; } +static int __too_many_isolated(struct zone *zone, int file, + struct scan_control *sc, int safe) +{ + unsigned long inactive, isolated; + + if (safe) { + inactive = zone_page_state_snapshot(zone, + NR_INACTIVE_ANON + 2 * file); + isolated = zone_page_state_snapshot(zone, + NR_ISOLATED_ANON + file); + } else { + inactive = zone_page_state(zone, NR_INACTIVE_ANON + 2 * file); + isolated = zone_page_state(zone, NR_ISOLATED_ANON + file); + } + + /* + * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so they + * won't get blocked by normal direct-reclaimers, forming a circular + * deadlock. + */ + if ((sc->gfp_mask & GFP_IOFS) == GFP_IOFS) + inactive >>= 3; + + return isolated > inactive; +} + /* * A direct reclaimer may isolate SWAP_CLUSTER_MAX pages from the LRU list and * then get resheduled. When there are massive number of tasks doing page @@ -1400,33 +1426,22 @@ int isolate_lru_page(struct page *page) * unnecessary swapping, thrashing and OOM. */ static int too_many_isolated(struct zone *zone, int file, - struct scan_control *sc) + struct scan_control *sc, int safe) { - unsigned long inactive, isolated; - if (current_is_kswapd()) return 0; if (!global_reclaim(sc)) return 0; - if (file) { - inactive = zone_page_state(zone, NR_INACTIVE_FILE); - isolated = zone_page_state(zone, NR_ISOLATED_FILE); - } else { - inactive = zone_page_state(zone, NR_INACTIVE_ANON); - isolated = zone_page_state(zone, NR_ISOLATED_ANON); + if (unlikely(__too_many_isolated(zone, file, sc, 0))) { + if (safe) + return __too_many_isolated(zone, file, sc, safe); + else + return 1; } - /* - * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so they - * won't get blocked by normal direct-reclaimers, forming a circular - * deadlock. - */ - if ((sc->gfp_mask & GFP_IOFS) == GFP_IOFS) - inactive >>= 3; - - return isolated > inactive; + return 0; } static noinline_for_stack void @@ -1516,15 +1531,18 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, unsigned long nr_immediate = 0; isolate_mode_t isolate_mode = 0; int file = is_file_lru(lru); + int safe = 0; struct zone *zone = lruvec_zone(lruvec); struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat; - while (unlikely(too_many_isolated(zone, file, sc))) { + while (unlikely(too_many_isolated(zone, file, sc, safe))) { congestion_wait(BLK_RW_ASYNC, HZ/10); /* We are about to die and free our memory. Return now. */ if (fatal_signal_pending(current)) return SWAP_CLUSTER_MAX; + + safe = 1; } lru_add_drain(); -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 60+ messages in thread
* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated 2015-01-14 11:36 ` Vinayak Menon @ 2015-01-14 16:50 ` Michal Hocko -1 siblings, 0 replies; 60+ messages in thread From: Michal Hocko @ 2015-01-14 16:50 UTC (permalink / raw) To: Vinayak Menon Cc: linux-mm, linux-kernel, akpm, hannes, vdavydov, mgorman, minchan On Wed 14-01-15 17:06:59, Vinayak Menon wrote: [...] > In one such instance, zone_page_state(zone, NR_ISOLATED_FILE) > had returned 14, zone_page_state(zone, NR_INACTIVE_FILE) > returned 92, and GFP_IOFS was set, and this resulted > in too_many_isolated returning true. But one of the CPU's > pageset vm_stat_diff had NR_ISOLATED_FILE as "-14". So the > actual isolated count was zero. As there weren't any more > updates to NR_ISOLATED_FILE and vmstat_update deffered work > had not been scheduled yet, 7 tasks were spinning in the > congestion wait loop for around 4 seconds, in the direct > reclaim path. Not syncing for such a long time doesn't sound right. I am not familiar with the vmstat syncing but sysctl_stat_interval is HZ so it should happen much more often that every 4 seconds. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated @ 2015-01-14 16:50 ` Michal Hocko 0 siblings, 0 replies; 60+ messages in thread From: Michal Hocko @ 2015-01-14 16:50 UTC (permalink / raw) To: Vinayak Menon Cc: linux-mm, linux-kernel, akpm, hannes, vdavydov, mgorman, minchan On Wed 14-01-15 17:06:59, Vinayak Menon wrote: [...] > In one such instance, zone_page_state(zone, NR_ISOLATED_FILE) > had returned 14, zone_page_state(zone, NR_INACTIVE_FILE) > returned 92, and GFP_IOFS was set, and this resulted > in too_many_isolated returning true. But one of the CPU's > pageset vm_stat_diff had NR_ISOLATED_FILE as "-14". So the > actual isolated count was zero. As there weren't any more > updates to NR_ISOLATED_FILE and vmstat_update deffered work > had not been scheduled yet, 7 tasks were spinning in the > congestion wait loop for around 4 seconds, in the direct > reclaim path. Not syncing for such a long time doesn't sound right. I am not familiar with the vmstat syncing but sysctl_stat_interval is HZ so it should happen much more often that every 4 seconds. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated 2015-01-14 16:50 ` Michal Hocko @ 2015-01-15 17:24 ` Vinayak Menon -1 siblings, 0 replies; 60+ messages in thread From: Vinayak Menon @ 2015-01-15 17:24 UTC (permalink / raw) To: Michal Hocko Cc: linux-mm, linux-kernel, akpm, hannes, vdavydov, mgorman, minchan On 01/14/2015 10:20 PM, Michal Hocko wrote: > On Wed 14-01-15 17:06:59, Vinayak Menon wrote: > [...] >> In one such instance, zone_page_state(zone, NR_ISOLATED_FILE) >> had returned 14, zone_page_state(zone, NR_INACTIVE_FILE) >> returned 92, and GFP_IOFS was set, and this resulted >> in too_many_isolated returning true. But one of the CPU's >> pageset vm_stat_diff had NR_ISOLATED_FILE as "-14". So the >> actual isolated count was zero. As there weren't any more >> updates to NR_ISOLATED_FILE and vmstat_update deffered work >> had not been scheduled yet, 7 tasks were spinning in the >> congestion wait loop for around 4 seconds, in the direct >> reclaim path. > > Not syncing for such a long time doesn't sound right. I am not familiar > with the vmstat syncing but sysctl_stat_interval is HZ so it should > happen much more often that every 4 seconds. > Though the interval is HZ, since the vmstat_work is declared as a deferrable work, IIUC the timer trigger can be deferred to the next non-defferable timer expiry on the CPU which is in idle. This results in the vmstat syncing on an idle CPU delayed by seconds. May be in most cases this behavior is fine, except in cases like this. Even in usual cases were the timer triggers in 1-2 secs, is it fine to let the tasks in reclaim path wait that long unnecessarily when there isn't any real congestion? -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated @ 2015-01-15 17:24 ` Vinayak Menon 0 siblings, 0 replies; 60+ messages in thread From: Vinayak Menon @ 2015-01-15 17:24 UTC (permalink / raw) To: Michal Hocko Cc: linux-mm, linux-kernel, akpm, hannes, vdavydov, mgorman, minchan On 01/14/2015 10:20 PM, Michal Hocko wrote: > On Wed 14-01-15 17:06:59, Vinayak Menon wrote: > [...] >> In one such instance, zone_page_state(zone, NR_ISOLATED_FILE) >> had returned 14, zone_page_state(zone, NR_INACTIVE_FILE) >> returned 92, and GFP_IOFS was set, and this resulted >> in too_many_isolated returning true. But one of the CPU's >> pageset vm_stat_diff had NR_ISOLATED_FILE as "-14". So the >> actual isolated count was zero. As there weren't any more >> updates to NR_ISOLATED_FILE and vmstat_update deffered work >> had not been scheduled yet, 7 tasks were spinning in the >> congestion wait loop for around 4 seconds, in the direct >> reclaim path. > > Not syncing for such a long time doesn't sound right. I am not familiar > with the vmstat syncing but sysctl_stat_interval is HZ so it should > happen much more often that every 4 seconds. > Though the interval is HZ, since the vmstat_work is declared as a deferrable work, IIUC the timer trigger can be deferred to the next non-defferable timer expiry on the CPU which is in idle. This results in the vmstat syncing on an idle CPU delayed by seconds. May be in most cases this behavior is fine, except in cases like this. Even in usual cases were the timer triggers in 1-2 secs, is it fine to let the tasks in reclaim path wait that long unnecessarily when there isn't any real congestion? -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated 2015-01-15 17:24 ` Vinayak Menon @ 2015-01-16 15:49 ` Michal Hocko -1 siblings, 0 replies; 60+ messages in thread From: Michal Hocko @ 2015-01-16 15:49 UTC (permalink / raw) To: Vinayak Menon Cc: linux-mm, linux-kernel, akpm, hannes, vdavydov, mgorman, minchan, Christoph Lameter On Thu 15-01-15 22:54:20, Vinayak Menon wrote: > On 01/14/2015 10:20 PM, Michal Hocko wrote: > >On Wed 14-01-15 17:06:59, Vinayak Menon wrote: > >[...] > >>In one such instance, zone_page_state(zone, NR_ISOLATED_FILE) > >>had returned 14, zone_page_state(zone, NR_INACTIVE_FILE) > >>returned 92, and GFP_IOFS was set, and this resulted > >>in too_many_isolated returning true. But one of the CPU's > >>pageset vm_stat_diff had NR_ISOLATED_FILE as "-14". So the > >>actual isolated count was zero. As there weren't any more > >>updates to NR_ISOLATED_FILE and vmstat_update deffered work > >>had not been scheduled yet, 7 tasks were spinning in the > >>congestion wait loop for around 4 seconds, in the direct > >>reclaim path. > > > >Not syncing for such a long time doesn't sound right. I am not familiar > >with the vmstat syncing but sysctl_stat_interval is HZ so it should > >happen much more often that every 4 seconds. > > > > Though the interval is HZ, since the vmstat_work is declared as a > deferrable work, IIUC the timer trigger can be deferred to the next > non-defferable timer expiry on the CPU which is in idle. This results > in the vmstat syncing on an idle CPU delayed by seconds. May be in > most cases this behavior is fine, except in cases like this. I am not sure I understand the above because CPU being idle doesn't seem important AFAICS. Anyway I have checked the current code which has changed quite recently by 7cc36bbddde5 (vmstat: on-demand vmstat workers V8). Let's CC Christoph (the thread starts here: http://thread.gmane.org/gmane.linux.kernel.mm/127229). __round_jiffies_relative can easily make timeout 2HZ from 1HZ. Now we have vmstat_shepherd which waits to be queued and then wait to run. When it runs finally it only queues per-cpu vmstat_work which can also end up being 2HZ for some CPUs. So we can indeed have 4 seconds spent just for queuing. Not even mentioning work item latencies. Especially when workers are overloaded e.g. by fs work items and no additional workers cannot be created e.g. due to memory pressure so they are processed only by the workqueue rescuer. And latencies would grow a lot. We have seen an issue where rescuer had to process thousands of work items because all workers where blocked on memory allocation - see http://thread.gmane.org/gmane.linux.kernel/1816452 - which is mainline already 008847f66c38 (workqueue: allow rescuer thread to do more work.) the patch reduces the latency considerably but it doesn't remove it completely. If the time between two syncs is large then the per-cpu drift might be really large as well. Isn't this going to hurt other places where we rely on stats as well? In this particular case the reclaimers are throttled because they see too many isolated pages which was just a result of the per-cpu drift and small LRU list. The system seems to be under serious memory pressure already because there is basically no file cache. If the reclaimers wouldn't be throttled they would fail the reclaim probably and trigger OOM killer which might help to resolve the situation. Reclaimers are throttled instead. Why cannot we simply update the global counters from vmstat_shepherd directly? Sure we would access remote CPU counters but that wouldn't happen often. That still wouldn't be enough because there is the workqueue latency. So why cannot we do that whole shepherd thing from a kernel thread instead? If the NUMA zone->pageset->pcp thing has to be done locally then it could have per-node kthread rather than being part of the unrelated vmstat code. I might be missing obvious things because I still haven't digested the code completely but this whole thing seems overly complicated and I do not see a good reason for that. If the primary motivation was cpu isolation then kthreads would sound like a better idea because you can play with the explicit affinity from the userspace. > Even in usual cases were the timer triggers in 1-2 secs, is it fine to > let the tasks in reclaim path wait that long unnecessarily when there > isn't any real congestion? But how do we find that out? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated @ 2015-01-16 15:49 ` Michal Hocko 0 siblings, 0 replies; 60+ messages in thread From: Michal Hocko @ 2015-01-16 15:49 UTC (permalink / raw) To: Vinayak Menon Cc: linux-mm, linux-kernel, akpm, hannes, vdavydov, mgorman, minchan, Christoph Lameter On Thu 15-01-15 22:54:20, Vinayak Menon wrote: > On 01/14/2015 10:20 PM, Michal Hocko wrote: > >On Wed 14-01-15 17:06:59, Vinayak Menon wrote: > >[...] > >>In one such instance, zone_page_state(zone, NR_ISOLATED_FILE) > >>had returned 14, zone_page_state(zone, NR_INACTIVE_FILE) > >>returned 92, and GFP_IOFS was set, and this resulted > >>in too_many_isolated returning true. But one of the CPU's > >>pageset vm_stat_diff had NR_ISOLATED_FILE as "-14". So the > >>actual isolated count was zero. As there weren't any more > >>updates to NR_ISOLATED_FILE and vmstat_update deffered work > >>had not been scheduled yet, 7 tasks were spinning in the > >>congestion wait loop for around 4 seconds, in the direct > >>reclaim path. > > > >Not syncing for such a long time doesn't sound right. I am not familiar > >with the vmstat syncing but sysctl_stat_interval is HZ so it should > >happen much more often that every 4 seconds. > > > > Though the interval is HZ, since the vmstat_work is declared as a > deferrable work, IIUC the timer trigger can be deferred to the next > non-defferable timer expiry on the CPU which is in idle. This results > in the vmstat syncing on an idle CPU delayed by seconds. May be in > most cases this behavior is fine, except in cases like this. I am not sure I understand the above because CPU being idle doesn't seem important AFAICS. Anyway I have checked the current code which has changed quite recently by 7cc36bbddde5 (vmstat: on-demand vmstat workers V8). Let's CC Christoph (the thread starts here: http://thread.gmane.org/gmane.linux.kernel.mm/127229). __round_jiffies_relative can easily make timeout 2HZ from 1HZ. Now we have vmstat_shepherd which waits to be queued and then wait to run. When it runs finally it only queues per-cpu vmstat_work which can also end up being 2HZ for some CPUs. So we can indeed have 4 seconds spent just for queuing. Not even mentioning work item latencies. Especially when workers are overloaded e.g. by fs work items and no additional workers cannot be created e.g. due to memory pressure so they are processed only by the workqueue rescuer. And latencies would grow a lot. We have seen an issue where rescuer had to process thousands of work items because all workers where blocked on memory allocation - see http://thread.gmane.org/gmane.linux.kernel/1816452 - which is mainline already 008847f66c38 (workqueue: allow rescuer thread to do more work.) the patch reduces the latency considerably but it doesn't remove it completely. If the time between two syncs is large then the per-cpu drift might be really large as well. Isn't this going to hurt other places where we rely on stats as well? In this particular case the reclaimers are throttled because they see too many isolated pages which was just a result of the per-cpu drift and small LRU list. The system seems to be under serious memory pressure already because there is basically no file cache. If the reclaimers wouldn't be throttled they would fail the reclaim probably and trigger OOM killer which might help to resolve the situation. Reclaimers are throttled instead. Why cannot we simply update the global counters from vmstat_shepherd directly? Sure we would access remote CPU counters but that wouldn't happen often. That still wouldn't be enough because there is the workqueue latency. So why cannot we do that whole shepherd thing from a kernel thread instead? If the NUMA zone->pageset->pcp thing has to be done locally then it could have per-node kthread rather than being part of the unrelated vmstat code. I might be missing obvious things because I still haven't digested the code completely but this whole thing seems overly complicated and I do not see a good reason for that. If the primary motivation was cpu isolation then kthreads would sound like a better idea because you can play with the explicit affinity from the userspace. > Even in usual cases were the timer triggers in 1-2 secs, is it fine to > let the tasks in reclaim path wait that long unnecessarily when there > isn't any real congestion? But how do we find that out? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated 2015-01-16 15:49 ` Michal Hocko @ 2015-01-16 17:57 ` Michal Hocko -1 siblings, 0 replies; 60+ messages in thread From: Michal Hocko @ 2015-01-16 17:57 UTC (permalink / raw) To: Vinayak Menon Cc: linux-mm, linux-kernel, akpm, hannes, vdavydov, mgorman, minchan, Christoph Lameter On Fri 16-01-15 16:49:22, Michal Hocko wrote: [...] > Why cannot we simply update the global counters from vmstat_shepherd > directly? OK, I should have checked the updating paths... This would be racy, so update from remote is not an option without additional trickery (like retries etc.) :/ -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated @ 2015-01-16 17:57 ` Michal Hocko 0 siblings, 0 replies; 60+ messages in thread From: Michal Hocko @ 2015-01-16 17:57 UTC (permalink / raw) To: Vinayak Menon Cc: linux-mm, linux-kernel, akpm, hannes, vdavydov, mgorman, minchan, Christoph Lameter On Fri 16-01-15 16:49:22, Michal Hocko wrote: [...] > Why cannot we simply update the global counters from vmstat_shepherd > directly? OK, I should have checked the updating paths... This would be racy, so update from remote is not an option without additional trickery (like retries etc.) :/ -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated 2015-01-16 17:57 ` Michal Hocko @ 2015-01-16 19:17 ` Christoph Lameter -1 siblings, 0 replies; 60+ messages in thread From: Christoph Lameter @ 2015-01-16 19:17 UTC (permalink / raw) To: Michal Hocko Cc: Vinayak Menon, linux-mm, linux-kernel, akpm, hannes, vdavydov, mgorman, minchan On Fri, 16 Jan 2015, Michal Hocko wrote: > On Fri 16-01-15 16:49:22, Michal Hocko wrote: > [...] > > Why cannot we simply update the global counters from vmstat_shepherd > > directly? > > OK, I should have checked the updating paths... This would be racy, so > update from remote is not an option without additional trickery (like > retries etc.) :/ You can do that if you have a way to ensure that the other cpu does not access the counter. F.e. if the other cpu is staying in user space all the time or it is guaranteed to be idle. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated @ 2015-01-16 19:17 ` Christoph Lameter 0 siblings, 0 replies; 60+ messages in thread From: Christoph Lameter @ 2015-01-16 19:17 UTC (permalink / raw) To: Michal Hocko Cc: Vinayak Menon, linux-mm, linux-kernel, akpm, hannes, vdavydov, mgorman, minchan On Fri, 16 Jan 2015, Michal Hocko wrote: > On Fri 16-01-15 16:49:22, Michal Hocko wrote: > [...] > > Why cannot we simply update the global counters from vmstat_shepherd > > directly? > > OK, I should have checked the updating paths... This would be racy, so > update from remote is not an option without additional trickery (like > retries etc.) :/ You can do that if you have a way to ensure that the other cpu does not access the counter. F.e. if the other cpu is staying in user space all the time or it is guaranteed to be idle. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated 2015-01-16 15:49 ` Michal Hocko @ 2015-01-17 15:18 ` Vinayak Menon -1 siblings, 0 replies; 60+ messages in thread From: Vinayak Menon @ 2015-01-17 15:18 UTC (permalink / raw) To: Michal Hocko Cc: linux-mm, linux-kernel, akpm, hannes, vdavydov, mgorman, minchan, Christoph Lameter On 01/16/2015 09:19 PM, Michal Hocko wrote: > On Thu 15-01-15 22:54:20, Vinayak Menon wrote: >> On 01/14/2015 10:20 PM, Michal Hocko wrote: >>> On Wed 14-01-15 17:06:59, Vinayak Menon wrote: >>> [...] >>>> In one such instance, zone_page_state(zone, NR_ISOLATED_FILE) >>>> had returned 14, zone_page_state(zone, NR_INACTIVE_FILE) >>>> returned 92, and GFP_IOFS was set, and this resulted >>>> in too_many_isolated returning true. But one of the CPU's >>>> pageset vm_stat_diff had NR_ISOLATED_FILE as "-14". So the >>>> actual isolated count was zero. As there weren't any more >>>> updates to NR_ISOLATED_FILE and vmstat_update deffered work >>>> had not been scheduled yet, 7 tasks were spinning in the >>>> congestion wait loop for around 4 seconds, in the direct >>>> reclaim path. >>> >>> Not syncing for such a long time doesn't sound right. I am not familiar >>> with the vmstat syncing but sysctl_stat_interval is HZ so it should >>> happen much more often that every 4 seconds. >>> >> >> Though the interval is HZ, since the vmstat_work is declared as a >> deferrable work, IIUC the timer trigger can be deferred to the next >> non-defferable timer expiry on the CPU which is in idle. This results >> in the vmstat syncing on an idle CPU delayed by seconds. May be in >> most cases this behavior is fine, except in cases like this. > > I am not sure I understand the above because CPU being idle doesn't > seem important AFAICS. Anyway I have checked the current code which has > changed quite recently by 7cc36bbddde5 (vmstat: on-demand vmstat workers > V8). Let's CC Christoph (the thread starts here: > http://thread.gmane.org/gmane.linux.kernel.mm/127229). > I will try to explain the exact observations. All the cases which I had encountered, had similar symptoms. In one of the cases, it was CPU3 alone which had not updated the vmstat_diff. This CPU was in idle for around 30 secs. When I looked at the tvec base for this CPU, the timer associated with vmstat_update had its expiry time less than current jiffies. This timer had its deferrable flag set, and was tied to the next non-deferrable timer in the list. Since deferrable timers can't wake up the CPU, the vmstat sync for this CPU was deferred for a long time i.e. till the expiry of next non-deferrable timer. The issue was caught because, one of the tasks which was in reclaim path and in the congestion_wait loop had an associated watchdog, which resulted in a panic after 4secs. So 4 secs is actually the watchdog expiry, and the time we can get blocked in the congestion loop can be even more. -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated @ 2015-01-17 15:18 ` Vinayak Menon 0 siblings, 0 replies; 60+ messages in thread From: Vinayak Menon @ 2015-01-17 15:18 UTC (permalink / raw) To: Michal Hocko Cc: linux-mm, linux-kernel, akpm, hannes, vdavydov, mgorman, minchan, Christoph Lameter On 01/16/2015 09:19 PM, Michal Hocko wrote: > On Thu 15-01-15 22:54:20, Vinayak Menon wrote: >> On 01/14/2015 10:20 PM, Michal Hocko wrote: >>> On Wed 14-01-15 17:06:59, Vinayak Menon wrote: >>> [...] >>>> In one such instance, zone_page_state(zone, NR_ISOLATED_FILE) >>>> had returned 14, zone_page_state(zone, NR_INACTIVE_FILE) >>>> returned 92, and GFP_IOFS was set, and this resulted >>>> in too_many_isolated returning true. But one of the CPU's >>>> pageset vm_stat_diff had NR_ISOLATED_FILE as "-14". So the >>>> actual isolated count was zero. As there weren't any more >>>> updates to NR_ISOLATED_FILE and vmstat_update deffered work >>>> had not been scheduled yet, 7 tasks were spinning in the >>>> congestion wait loop for around 4 seconds, in the direct >>>> reclaim path. >>> >>> Not syncing for such a long time doesn't sound right. I am not familiar >>> with the vmstat syncing but sysctl_stat_interval is HZ so it should >>> happen much more often that every 4 seconds. >>> >> >> Though the interval is HZ, since the vmstat_work is declared as a >> deferrable work, IIUC the timer trigger can be deferred to the next >> non-defferable timer expiry on the CPU which is in idle. This results >> in the vmstat syncing on an idle CPU delayed by seconds. May be in >> most cases this behavior is fine, except in cases like this. > > I am not sure I understand the above because CPU being idle doesn't > seem important AFAICS. Anyway I have checked the current code which has > changed quite recently by 7cc36bbddde5 (vmstat: on-demand vmstat workers > V8). Let's CC Christoph (the thread starts here: > http://thread.gmane.org/gmane.linux.kernel.mm/127229). > I will try to explain the exact observations. All the cases which I had encountered, had similar symptoms. In one of the cases, it was CPU3 alone which had not updated the vmstat_diff. This CPU was in idle for around 30 secs. When I looked at the tvec base for this CPU, the timer associated with vmstat_update had its expiry time less than current jiffies. This timer had its deferrable flag set, and was tied to the next non-deferrable timer in the list. Since deferrable timers can't wake up the CPU, the vmstat sync for this CPU was deferred for a long time i.e. till the expiry of next non-deferrable timer. The issue was caught because, one of the tasks which was in reclaim path and in the congestion_wait loop had an associated watchdog, which resulted in a panic after 4secs. So 4 secs is actually the watchdog expiry, and the time we can get blocked in the congestion loop can be even more. -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated 2015-01-17 15:18 ` Vinayak Menon @ 2015-01-17 19:48 ` Christoph Lameter -1 siblings, 0 replies; 60+ messages in thread From: Christoph Lameter @ 2015-01-17 19:48 UTC (permalink / raw) To: Vinayak Menon Cc: Michal Hocko, linux-mm, linux-kernel, akpm, hannes, vdavydov, mgorman, minchan On Sat, 17 Jan 2015, Vinayak Menon wrote: > which had not updated the vmstat_diff. This CPU was in idle for around 30 > secs. When I looked at the tvec base for this CPU, the timer associated with > vmstat_update had its expiry time less than current jiffies. This timer had > its deferrable flag set, and was tied to the next non-deferrable timer in the We can remove the deferrrable flag now since the vmstat threads are only activated as necessary with the recent changes. Looks like this could fix your issue? ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated @ 2015-01-17 19:48 ` Christoph Lameter 0 siblings, 0 replies; 60+ messages in thread From: Christoph Lameter @ 2015-01-17 19:48 UTC (permalink / raw) To: Vinayak Menon Cc: Michal Hocko, linux-mm, linux-kernel, akpm, hannes, vdavydov, mgorman, minchan On Sat, 17 Jan 2015, Vinayak Menon wrote: > which had not updated the vmstat_diff. This CPU was in idle for around 30 > secs. When I looked at the tvec base for this CPU, the timer associated with > vmstat_update had its expiry time less than current jiffies. This timer had > its deferrable flag set, and was tied to the next non-deferrable timer in the We can remove the deferrrable flag now since the vmstat threads are only activated as necessary with the recent changes. Looks like this could fix your issue? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated 2015-01-17 19:48 ` Christoph Lameter @ 2015-01-19 4:27 ` Vinayak Menon -1 siblings, 0 replies; 60+ messages in thread From: Vinayak Menon @ 2015-01-19 4:27 UTC (permalink / raw) To: Christoph Lameter Cc: Michal Hocko, linux-mm, linux-kernel, akpm, hannes, vdavydov, mgorman, minchan On 01/18/2015 01:18 AM, Christoph Lameter wrote: > On Sat, 17 Jan 2015, Vinayak Menon wrote: > >> which had not updated the vmstat_diff. This CPU was in idle for around 30 >> secs. When I looked at the tvec base for this CPU, the timer associated with >> vmstat_update had its expiry time less than current jiffies. This timer had >> its deferrable flag set, and was tied to the next non-deferrable timer in the > > We can remove the deferrrable flag now since the vmstat threads are only > activated as necessary with the recent changes. Looks like this could fix > your issue? > Yes, this should fix my issue. But I think we may need the fix in too_many_isolated, since there can still be a delay of few seconds (HZ by default and even more because of reasons pointed out by Michal) which will result in reclaimers unnecessarily entering congestion_wait. No ? -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated @ 2015-01-19 4:27 ` Vinayak Menon 0 siblings, 0 replies; 60+ messages in thread From: Vinayak Menon @ 2015-01-19 4:27 UTC (permalink / raw) To: Christoph Lameter Cc: Michal Hocko, linux-mm, linux-kernel, akpm, hannes, vdavydov, mgorman, minchan On 01/18/2015 01:18 AM, Christoph Lameter wrote: > On Sat, 17 Jan 2015, Vinayak Menon wrote: > >> which had not updated the vmstat_diff. This CPU was in idle for around 30 >> secs. When I looked at the tvec base for this CPU, the timer associated with >> vmstat_update had its expiry time less than current jiffies. This timer had >> its deferrable flag set, and was tied to the next non-deferrable timer in the > > We can remove the deferrrable flag now since the vmstat threads are only > activated as necessary with the recent changes. Looks like this could fix > your issue? > Yes, this should fix my issue. But I think we may need the fix in too_many_isolated, since there can still be a delay of few seconds (HZ by default and even more because of reasons pointed out by Michal) which will result in reclaimers unnecessarily entering congestion_wait. No ? -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated 2015-01-19 4:27 ` Vinayak Menon @ 2015-01-21 14:39 ` Michal Hocko -1 siblings, 0 replies; 60+ messages in thread From: Michal Hocko @ 2015-01-21 14:39 UTC (permalink / raw) To: Vinayak Menon Cc: Christoph Lameter, linux-mm, linux-kernel, akpm, hannes, vdavydov, mgorman, minchan On Mon 19-01-15 09:57:08, Vinayak Menon wrote: > On 01/18/2015 01:18 AM, Christoph Lameter wrote: > >On Sat, 17 Jan 2015, Vinayak Menon wrote: > > > >>which had not updated the vmstat_diff. This CPU was in idle for around 30 > >>secs. When I looked at the tvec base for this CPU, the timer associated with > >>vmstat_update had its expiry time less than current jiffies. This timer had > >>its deferrable flag set, and was tied to the next non-deferrable timer in the > > > >We can remove the deferrrable flag now since the vmstat threads are only > >activated as necessary with the recent changes. Looks like this could fix > >your issue? > > > > Yes, this should fix my issue. Does it? Because I would prefer not getting into un-synced state much more than playing around one specific place which shows the problems right now. > But I think we may need the fix in too_many_isolated, since there can still > be a delay of few seconds (HZ by default and even more because of reasons > pointed out by Michal) which will result in reclaimers unnecessarily > entering congestion_wait. No ? I think we can solve this as well. We can stick vmstat_shepherd into a kernel thread with a loop with the configured timeout and then create a mask of CPUs which need the update and run vmstat_update from IPI context (smp_call_function_many). We would have to drop cond_resched from refresh_cpu_vm_stats of course. The nr_zones x NR_VM_ZONE_STAT_ITEMS in the IPI context shouldn't be excessive but I haven't measured that so I might be easily wrong. Anyway, that should work more reliably than the current scheme and should help to reduce pointless wakeups which the original patchset was addressing. Or am I missing something? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated @ 2015-01-21 14:39 ` Michal Hocko 0 siblings, 0 replies; 60+ messages in thread From: Michal Hocko @ 2015-01-21 14:39 UTC (permalink / raw) To: Vinayak Menon Cc: Christoph Lameter, linux-mm, linux-kernel, akpm, hannes, vdavydov, mgorman, minchan On Mon 19-01-15 09:57:08, Vinayak Menon wrote: > On 01/18/2015 01:18 AM, Christoph Lameter wrote: > >On Sat, 17 Jan 2015, Vinayak Menon wrote: > > > >>which had not updated the vmstat_diff. This CPU was in idle for around 30 > >>secs. When I looked at the tvec base for this CPU, the timer associated with > >>vmstat_update had its expiry time less than current jiffies. This timer had > >>its deferrable flag set, and was tied to the next non-deferrable timer in the > > > >We can remove the deferrrable flag now since the vmstat threads are only > >activated as necessary with the recent changes. Looks like this could fix > >your issue? > > > > Yes, this should fix my issue. Does it? Because I would prefer not getting into un-synced state much more than playing around one specific place which shows the problems right now. > But I think we may need the fix in too_many_isolated, since there can still > be a delay of few seconds (HZ by default and even more because of reasons > pointed out by Michal) which will result in reclaimers unnecessarily > entering congestion_wait. No ? I think we can solve this as well. We can stick vmstat_shepherd into a kernel thread with a loop with the configured timeout and then create a mask of CPUs which need the update and run vmstat_update from IPI context (smp_call_function_many). We would have to drop cond_resched from refresh_cpu_vm_stats of course. The nr_zones x NR_VM_ZONE_STAT_ITEMS in the IPI context shouldn't be excessive but I haven't measured that so I might be easily wrong. Anyway, that should work more reliably than the current scheme and should help to reduce pointless wakeups which the original patchset was addressing. Or am I missing something? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated 2015-01-21 14:39 ` Michal Hocko @ 2015-01-22 15:16 ` Vlastimil Babka -1 siblings, 0 replies; 60+ messages in thread From: Vlastimil Babka @ 2015-01-22 15:16 UTC (permalink / raw) To: Michal Hocko, Vinayak Menon Cc: Christoph Lameter, linux-mm, linux-kernel, akpm, hannes, vdavydov, mgorman, minchan On 01/21/2015 03:39 PM, Michal Hocko wrote: > On Mon 19-01-15 09:57:08, Vinayak Menon wrote: >> On 01/18/2015 01:18 AM, Christoph Lameter wrote: >>> On Sat, 17 Jan 2015, Vinayak Menon wrote: >>> >>>> which had not updated the vmstat_diff. This CPU was in idle for around 30 >>>> secs. When I looked at the tvec base for this CPU, the timer associated with >>>> vmstat_update had its expiry time less than current jiffies. This timer had >>>> its deferrable flag set, and was tied to the next non-deferrable timer in the >>> >>> We can remove the deferrrable flag now since the vmstat threads are only >>> activated as necessary with the recent changes. Looks like this could fix >>> your issue? >>> >> >> Yes, this should fix my issue. > > Does it? Because I would prefer not getting into un-synced state much > more than playing around one specific place which shows the problems > right now. > >> But I think we may need the fix in too_many_isolated, since there can still >> be a delay of few seconds (HZ by default and even more because of reasons >> pointed out by Michal) which will result in reclaimers unnecessarily >> entering congestion_wait. No ? > > I think we can solve this as well. We can stick vmstat_shepherd into a > kernel thread with a loop with the configured timeout and then create a > mask of CPUs which need the update and run vmstat_update from > IPI context (smp_call_function_many). > We would have to drop cond_resched from refresh_cpu_vm_stats of > course. The nr_zones x NR_VM_ZONE_STAT_ITEMS in the IPI context > shouldn't be excessive but I haven't measured that so I might be easily > wrong. > > Anyway, that should work more reliably than the current scheme and > should help to reduce pointless wakeups which the original patchset was > addressing. Or am I missing something? Maybe to further reduce wakeups, a CPU could check and update its counters before going idle? (unless that already happens) ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated @ 2015-01-22 15:16 ` Vlastimil Babka 0 siblings, 0 replies; 60+ messages in thread From: Vlastimil Babka @ 2015-01-22 15:16 UTC (permalink / raw) To: Michal Hocko, Vinayak Menon Cc: Christoph Lameter, linux-mm, linux-kernel, akpm, hannes, vdavydov, mgorman, minchan On 01/21/2015 03:39 PM, Michal Hocko wrote: > On Mon 19-01-15 09:57:08, Vinayak Menon wrote: >> On 01/18/2015 01:18 AM, Christoph Lameter wrote: >>> On Sat, 17 Jan 2015, Vinayak Menon wrote: >>> >>>> which had not updated the vmstat_diff. This CPU was in idle for around 30 >>>> secs. When I looked at the tvec base for this CPU, the timer associated with >>>> vmstat_update had its expiry time less than current jiffies. This timer had >>>> its deferrable flag set, and was tied to the next non-deferrable timer in the >>> >>> We can remove the deferrrable flag now since the vmstat threads are only >>> activated as necessary with the recent changes. Looks like this could fix >>> your issue? >>> >> >> Yes, this should fix my issue. > > Does it? Because I would prefer not getting into un-synced state much > more than playing around one specific place which shows the problems > right now. > >> But I think we may need the fix in too_many_isolated, since there can still >> be a delay of few seconds (HZ by default and even more because of reasons >> pointed out by Michal) which will result in reclaimers unnecessarily >> entering congestion_wait. No ? > > I think we can solve this as well. We can stick vmstat_shepherd into a > kernel thread with a loop with the configured timeout and then create a > mask of CPUs which need the update and run vmstat_update from > IPI context (smp_call_function_many). > We would have to drop cond_resched from refresh_cpu_vm_stats of > course. The nr_zones x NR_VM_ZONE_STAT_ITEMS in the IPI context > shouldn't be excessive but I haven't measured that so I might be easily > wrong. > > Anyway, that should work more reliably than the current scheme and > should help to reduce pointless wakeups which the original patchset was > addressing. Or am I missing something? Maybe to further reduce wakeups, a CPU could check and update its counters before going idle? (unless that already happens) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated 2015-01-21 14:39 ` Michal Hocko @ 2015-01-22 16:11 ` Christoph Lameter -1 siblings, 0 replies; 60+ messages in thread From: Christoph Lameter @ 2015-01-22 16:11 UTC (permalink / raw) To: Michal Hocko Cc: Vinayak Menon, linux-mm, linux-kernel, akpm, hannes, vdavydov, mgorman, minchan On Wed, 21 Jan 2015, Michal Hocko wrote: > I think we can solve this as well. We can stick vmstat_shepherd into a > kernel thread with a loop with the configured timeout and then create a > mask of CPUs which need the update and run vmstat_update from > IPI context (smp_call_function_many). Please do not run the vmstat_updates concurrently. They update shared cachelines and therefore can cause bouncing cachelines if run concurrently on multiple cpus. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated @ 2015-01-22 16:11 ` Christoph Lameter 0 siblings, 0 replies; 60+ messages in thread From: Christoph Lameter @ 2015-01-22 16:11 UTC (permalink / raw) To: Michal Hocko Cc: Vinayak Menon, linux-mm, linux-kernel, akpm, hannes, vdavydov, mgorman, minchan On Wed, 21 Jan 2015, Michal Hocko wrote: > I think we can solve this as well. We can stick vmstat_shepherd into a > kernel thread with a loop with the configured timeout and then create a > mask of CPUs which need the update and run vmstat_update from > IPI context (smp_call_function_many). Please do not run the vmstat_updates concurrently. They update shared cachelines and therefore can cause bouncing cachelines if run concurrently on multiple cpus. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated 2015-01-22 16:11 ` Christoph Lameter @ 2015-01-26 17:46 ` Michal Hocko -1 siblings, 0 replies; 60+ messages in thread From: Michal Hocko @ 2015-01-26 17:46 UTC (permalink / raw) To: Christoph Lameter Cc: Vinayak Menon, linux-mm, linux-kernel, akpm, hannes, vdavydov, mgorman, minchan On Thu 22-01-15 10:11:31, Christoph Lameter wrote: > On Wed, 21 Jan 2015, Michal Hocko wrote: > > > I think we can solve this as well. We can stick vmstat_shepherd into a > > kernel thread with a loop with the configured timeout and then create a > > mask of CPUs which need the update and run vmstat_update from > > IPI context (smp_call_function_many). > > Please do not run the vmstat_updates concurrently. They update shared > cachelines and therefore can cause bouncing cachelines if run concurrently > on multiple cpus. Would you preffer to call smp_call_function_single on each CPU which needs an update? That would make vmstat_shepherd slower but that is not a big deal, is it? Anyway I am wondering whether the cache line bouncing between vmstat_update instances is a big deal in the real life. Updating shared counters whould bounce with many CPUs but this is an operation which is not done often. Also all the CPUs would have update the same counters all the time and I am not sure this happens that often. Do you have a load where this would be measurable? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated @ 2015-01-26 17:46 ` Michal Hocko 0 siblings, 0 replies; 60+ messages in thread From: Michal Hocko @ 2015-01-26 17:46 UTC (permalink / raw) To: Christoph Lameter Cc: Vinayak Menon, linux-mm, linux-kernel, akpm, hannes, vdavydov, mgorman, minchan On Thu 22-01-15 10:11:31, Christoph Lameter wrote: > On Wed, 21 Jan 2015, Michal Hocko wrote: > > > I think we can solve this as well. We can stick vmstat_shepherd into a > > kernel thread with a loop with the configured timeout and then create a > > mask of CPUs which need the update and run vmstat_update from > > IPI context (smp_call_function_many). > > Please do not run the vmstat_updates concurrently. They update shared > cachelines and therefore can cause bouncing cachelines if run concurrently > on multiple cpus. Would you preffer to call smp_call_function_single on each CPU which needs an update? That would make vmstat_shepherd slower but that is not a big deal, is it? Anyway I am wondering whether the cache line bouncing between vmstat_update instances is a big deal in the real life. Updating shared counters whould bounce with many CPUs but this is an operation which is not done often. Also all the CPUs would have update the same counters all the time and I am not sure this happens that often. Do you have a load where this would be measurable? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated 2015-01-26 17:46 ` Michal Hocko @ 2015-01-26 18:35 ` Christoph Lameter -1 siblings, 0 replies; 60+ messages in thread From: Christoph Lameter @ 2015-01-26 18:35 UTC (permalink / raw) To: Michal Hocko Cc: Vinayak Menon, linux-mm, linux-kernel, akpm, hannes, vdavydov, mgorman, minchan On Mon, 26 Jan 2015, Michal Hocko wrote: > > Please do not run the vmstat_updates concurrently. They update shared > > cachelines and therefore can cause bouncing cachelines if run concurrently > > on multiple cpus. > > Would you preffer to call smp_call_function_single on each CPU > which needs an update? That would make vmstat_shepherd slower but that > is not a big deal, is it? Run it from the timer interrupt as usual from a work request? Those are staggered. > Anyway I am wondering whether the cache line bouncing between > vmstat_update instances is a big deal in the real life. Updating shared > counters whould bounce with many CPUs but this is an operation which is > not done often. Also all the CPUs would have update the same counters > all the time and I am not sure this happens that often. Do you have a > load where this would be measurable? Concurrent page faults update lots of counters concurrently. But will those trigger the smp_call_function? ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated @ 2015-01-26 18:35 ` Christoph Lameter 0 siblings, 0 replies; 60+ messages in thread From: Christoph Lameter @ 2015-01-26 18:35 UTC (permalink / raw) To: Michal Hocko Cc: Vinayak Menon, linux-mm, linux-kernel, akpm, hannes, vdavydov, mgorman, minchan On Mon, 26 Jan 2015, Michal Hocko wrote: > > Please do not run the vmstat_updates concurrently. They update shared > > cachelines and therefore can cause bouncing cachelines if run concurrently > > on multiple cpus. > > Would you preffer to call smp_call_function_single on each CPU > which needs an update? That would make vmstat_shepherd slower but that > is not a big deal, is it? Run it from the timer interrupt as usual from a work request? Those are staggered. > Anyway I am wondering whether the cache line bouncing between > vmstat_update instances is a big deal in the real life. Updating shared > counters whould bounce with many CPUs but this is an operation which is > not done often. Also all the CPUs would have update the same counters > all the time and I am not sure this happens that often. Do you have a > load where this would be measurable? Concurrent page faults update lots of counters concurrently. But will those trigger the smp_call_function? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated 2015-01-26 18:35 ` Christoph Lameter @ 2015-01-27 10:52 ` Michal Hocko -1 siblings, 0 replies; 60+ messages in thread From: Michal Hocko @ 2015-01-27 10:52 UTC (permalink / raw) To: Christoph Lameter Cc: Vinayak Menon, linux-mm, linux-kernel, akpm, hannes, vdavydov, mgorman, minchan On Mon 26-01-15 12:35:00, Christoph Lameter wrote: > On Mon, 26 Jan 2015, Michal Hocko wrote: > > > > Please do not run the vmstat_updates concurrently. They update shared > > > cachelines and therefore can cause bouncing cachelines if run concurrently > > > on multiple cpus. > > > > Would you preffer to call smp_call_function_single on each CPU > > which needs an update? That would make vmstat_shepherd slower but that > > is not a big deal, is it? > > Run it from the timer interrupt as usual from a work request? Those are > staggered. I am not following. The idea was to run vmstat_shepherd in a kernel thread and waking up as per defined timeout and then check need_update for each CPU and call smp_call_function_single to refresh the timer rather than building a mask and then calling sm_call_function_many to reduce paralel contention on the shared counters. > > Anyway I am wondering whether the cache line bouncing between > > vmstat_update instances is a big deal in the real life. Updating shared > > counters whould bounce with many CPUs but this is an operation which is > > not done often. Also all the CPUs would have update the same counters > > all the time and I am not sure this happens that often. Do you have a > > load where this would be measurable? > > Concurrent page faults update lots of counters concurrently. True > But will those trigger the smp_call_function? The smp_call_function was meant to be called only from the vmstat_shepherd context which does happen "rarely". Or am I missing your point here? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated @ 2015-01-27 10:52 ` Michal Hocko 0 siblings, 0 replies; 60+ messages in thread From: Michal Hocko @ 2015-01-27 10:52 UTC (permalink / raw) To: Christoph Lameter Cc: Vinayak Menon, linux-mm, linux-kernel, akpm, hannes, vdavydov, mgorman, minchan On Mon 26-01-15 12:35:00, Christoph Lameter wrote: > On Mon, 26 Jan 2015, Michal Hocko wrote: > > > > Please do not run the vmstat_updates concurrently. They update shared > > > cachelines and therefore can cause bouncing cachelines if run concurrently > > > on multiple cpus. > > > > Would you preffer to call smp_call_function_single on each CPU > > which needs an update? That would make vmstat_shepherd slower but that > > is not a big deal, is it? > > Run it from the timer interrupt as usual from a work request? Those are > staggered. I am not following. The idea was to run vmstat_shepherd in a kernel thread and waking up as per defined timeout and then check need_update for each CPU and call smp_call_function_single to refresh the timer rather than building a mask and then calling sm_call_function_many to reduce paralel contention on the shared counters. > > Anyway I am wondering whether the cache line bouncing between > > vmstat_update instances is a big deal in the real life. Updating shared > > counters whould bounce with many CPUs but this is an operation which is > > not done often. Also all the CPUs would have update the same counters > > all the time and I am not sure this happens that often. Do you have a > > load where this would be measurable? > > Concurrent page faults update lots of counters concurrently. True > But will those trigger the smp_call_function? The smp_call_function was meant to be called only from the vmstat_shepherd context which does happen "rarely". Or am I missing your point here? -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated 2015-01-27 10:52 ` Michal Hocko @ 2015-01-27 16:59 ` Christoph Lameter -1 siblings, 0 replies; 60+ messages in thread From: Christoph Lameter @ 2015-01-27 16:59 UTC (permalink / raw) To: Michal Hocko Cc: Vinayak Menon, linux-mm, linux-kernel, akpm, hannes, vdavydov, mgorman, minchan On Tue, 27 Jan 2015, Michal Hocko wrote: > I am not following. The idea was to run vmstat_shepherd in a kernel > thread and waking up as per defined timeout and then check need_update > for each CPU and call smp_call_function_single to refresh the timer > rather than building a mask and then calling sm_call_function_many to > reduce paralel contention on the shared counters. Thats ok. ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated @ 2015-01-27 16:59 ` Christoph Lameter 0 siblings, 0 replies; 60+ messages in thread From: Christoph Lameter @ 2015-01-27 16:59 UTC (permalink / raw) To: Michal Hocko Cc: Vinayak Menon, linux-mm, linux-kernel, akpm, hannes, vdavydov, mgorman, minchan On Tue, 27 Jan 2015, Michal Hocko wrote: > I am not following. The idea was to run vmstat_shepherd in a kernel > thread and waking up as per defined timeout and then check need_update > for each CPU and call smp_call_function_single to refresh the timer > rather than building a mask and then calling sm_call_function_many to > reduce paralel contention on the shared counters. Thats ok. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated 2015-01-27 16:59 ` Christoph Lameter @ 2015-01-30 15:28 ` Michal Hocko -1 siblings, 0 replies; 60+ messages in thread From: Michal Hocko @ 2015-01-30 15:28 UTC (permalink / raw) To: Christoph Lameter Cc: Vinayak Menon, linux-mm, linux-kernel, akpm, hannes, vdavydov, mgorman, minchan On Tue 27-01-15 10:59:59, Christoph Lameter wrote: > On Tue, 27 Jan 2015, Michal Hocko wrote: > > > I am not following. The idea was to run vmstat_shepherd in a kernel > > thread and waking up as per defined timeout and then check need_update > > for each CPU and call smp_call_function_single to refresh the timer > > rather than building a mask and then calling sm_call_function_many to > > reduce paralel contention on the shared counters. > > Thats ok. OK, I will put that on my todo list and try to find some time to implement it. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated @ 2015-01-30 15:28 ` Michal Hocko 0 siblings, 0 replies; 60+ messages in thread From: Michal Hocko @ 2015-01-30 15:28 UTC (permalink / raw) To: Christoph Lameter Cc: Vinayak Menon, linux-mm, linux-kernel, akpm, hannes, vdavydov, mgorman, minchan On Tue 27-01-15 10:59:59, Christoph Lameter wrote: > On Tue, 27 Jan 2015, Michal Hocko wrote: > > > I am not following. The idea was to run vmstat_shepherd in a kernel > > thread and waking up as per defined timeout and then check need_update > > for each CPU and call smp_call_function_single to refresh the timer > > rather than building a mask and then calling sm_call_function_many to > > reduce paralel contention on the shared counters. > > Thats ok. OK, I will put that on my todo list and try to find some time to implement it. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated 2015-01-17 19:48 ` Christoph Lameter @ 2015-01-26 17:28 ` Michal Hocko -1 siblings, 0 replies; 60+ messages in thread From: Michal Hocko @ 2015-01-26 17:28 UTC (permalink / raw) To: Christoph Lameter Cc: Vinayak Menon, linux-mm, linux-kernel, akpm, hannes, vdavydov, mgorman, minchan On Sat 17-01-15 13:48:34, Christoph Lameter wrote: > On Sat, 17 Jan 2015, Vinayak Menon wrote: > > > which had not updated the vmstat_diff. This CPU was in idle for around 30 > > secs. When I looked at the tvec base for this CPU, the timer associated with > > vmstat_update had its expiry time less than current jiffies. This timer had > > its deferrable flag set, and was tied to the next non-deferrable timer in the > > We can remove the deferrrable flag now since the vmstat threads are only > activated as necessary with the recent changes. Looks like this could fix > your issue? OK, I have checked the history and the deferrable behavior has been introduced by 39bf6270f524 (VM statistics: Make timer deferrable) which hasn't offered any numbers which would justify the change. So I think it would be a good idea to revert this one as it can clearly cause issues. Could you retest with this change? It still wouldn't help with the highly overloaded workqueues but that sounds like a bigger change and this one sounds like quite safe to me so it is a good start. --- >From 12d00a8066e336d3e1311600b50fa9b588798448 Mon Sep 17 00:00:00 2001 From: Michal Hocko <mhocko@suse.cz> Date: Mon, 26 Jan 2015 18:07:51 +0100 Subject: [PATCH] vmstat: Do not use deferrable delayed work for vmstat_update Vinayak Menon has reported that excessive number of tasks was throttled in the direct reclaim inside too_many_isolated because NR_ISOLATED_FILE was relatively high compared to NR_INACTIVE_FILE. However it turned out that the real number of NR_ISOLATED_FILE was 0 and the per-cpu vm_stat_diff wasn't transfered into the global counter. vmstat_work which is responsible for the sync is defined as deferrable delayed work which means that the defined timeout doesn't wake up an idle CPU. A CPU might stay in an idle state for a long time and general effort is to keep such a CPU in this state as long as possible which might lead to all sorts of troubles for vmstat consumers as can be seen with the excessive direct reclaim throttling. This patch basically reverts 39bf6270f524 (VM statistics: Make timer deferrable) but it shouldn't cause any problems for idle CPUs because only CPUs with an active per-cpu drift are woken up since 7cc36bbddde5 (vmstat: on-demand vmstat workers v8) and CPUs which are idle for a longer time shouldn't have per-cpu drift. Fixes: 39bf6270f524 (VM statistics: Make timer deferrable) Reported-and-debugged-by: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: Michal Hocko <mhocko@suse.cz> --- mm/vmstat.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/vmstat.c b/mm/vmstat.c index c95d6b39ac91..b9b9deec1d54 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1453,7 +1453,7 @@ static void __init start_shepherd_timer(void) int cpu; for_each_possible_cpu(cpu) - INIT_DEFERRABLE_WORK(per_cpu_ptr(&vmstat_work, cpu), + INIT_DELAYED_WORK(per_cpu_ptr(&vmstat_work, cpu), vmstat_update); if (!alloc_cpumask_var(&cpu_stat_off, GFP_KERNEL)) -- 2.1.4 -- Michal Hocko SUSE Labs ^ permalink raw reply related [flat|nested] 60+ messages in thread
* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated @ 2015-01-26 17:28 ` Michal Hocko 0 siblings, 0 replies; 60+ messages in thread From: Michal Hocko @ 2015-01-26 17:28 UTC (permalink / raw) To: Christoph Lameter Cc: Vinayak Menon, linux-mm, linux-kernel, akpm, hannes, vdavydov, mgorman, minchan On Sat 17-01-15 13:48:34, Christoph Lameter wrote: > On Sat, 17 Jan 2015, Vinayak Menon wrote: > > > which had not updated the vmstat_diff. This CPU was in idle for around 30 > > secs. When I looked at the tvec base for this CPU, the timer associated with > > vmstat_update had its expiry time less than current jiffies. This timer had > > its deferrable flag set, and was tied to the next non-deferrable timer in the > > We can remove the deferrrable flag now since the vmstat threads are only > activated as necessary with the recent changes. Looks like this could fix > your issue? OK, I have checked the history and the deferrable behavior has been introduced by 39bf6270f524 (VM statistics: Make timer deferrable) which hasn't offered any numbers which would justify the change. So I think it would be a good idea to revert this one as it can clearly cause issues. Could you retest with this change? It still wouldn't help with the highly overloaded workqueues but that sounds like a bigger change and this one sounds like quite safe to me so it is a good start. --- ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated 2015-01-26 17:28 ` Michal Hocko @ 2015-01-26 18:35 ` Christoph Lameter -1 siblings, 0 replies; 60+ messages in thread From: Christoph Lameter @ 2015-01-26 18:35 UTC (permalink / raw) To: Michal Hocko Cc: Vinayak Menon, linux-mm, linux-kernel, akpm, hannes, vdavydov, mgorman, minchan On Mon, 26 Jan 2015, Michal Hocko wrote: > >From 12d00a8066e336d3e1311600b50fa9b588798448 Mon Sep 17 00:00:00 2001 > From: Michal Hocko <mhocko@suse.cz> > Date: Mon, 26 Jan 2015 18:07:51 +0100 > Subject: [PATCH] vmstat: Do not use deferrable delayed work for vmstat_update Acked-by: Christoph Lameter <cl@linux.com> ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated @ 2015-01-26 18:35 ` Christoph Lameter 0 siblings, 0 replies; 60+ messages in thread From: Christoph Lameter @ 2015-01-26 18:35 UTC (permalink / raw) To: Michal Hocko Cc: Vinayak Menon, linux-mm, linux-kernel, akpm, hannes, vdavydov, mgorman, minchan On Mon, 26 Jan 2015, Michal Hocko wrote: > >From 12d00a8066e336d3e1311600b50fa9b588798448 Mon Sep 17 00:00:00 2001 > From: Michal Hocko <mhocko@suse.cz> > Date: Mon, 26 Jan 2015 18:07:51 +0100 > Subject: [PATCH] vmstat: Do not use deferrable delayed work for vmstat_update Acked-by: Christoph Lameter <cl@linux.com> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated 2015-01-26 17:28 ` Michal Hocko @ 2015-01-26 22:11 ` Andrew Morton -1 siblings, 0 replies; 60+ messages in thread From: Andrew Morton @ 2015-01-26 22:11 UTC (permalink / raw) To: Michal Hocko Cc: Christoph Lameter, Vinayak Menon, linux-mm, linux-kernel, hannes, vdavydov, mgorman, minchan On Mon, 26 Jan 2015 18:28:32 +0100 Michal Hocko <mhocko@suse.cz> wrote: > Subject: [PATCH] vmstat: Do not use deferrable delayed work for vmstat_update > > Vinayak Menon has reported that excessive number of tasks was throttled > in the direct reclaim inside too_many_isolated because NR_ISOLATED_FILE > was relatively high compared to NR_INACTIVE_FILE. However it turned out > that the real number of NR_ISOLATED_FILE was 0 and the per-cpu > vm_stat_diff wasn't transfered into the global counter. > > vmstat_work which is responsible for the sync is defined as deferrable > delayed work which means that the defined timeout doesn't wake up an > idle CPU. A CPU might stay in an idle state for a long time and general > effort is to keep such a CPU in this state as long as possible which > might lead to all sorts of troubles for vmstat consumers as can be seen > with the excessive direct reclaim throttling. > > This patch basically reverts 39bf6270f524 (VM statistics: Make timer > deferrable) but it shouldn't cause any problems for idle CPUs because > only CPUs with an active per-cpu drift are woken up since 7cc36bbddde5 > (vmstat: on-demand vmstat workers v8) and CPUs which are idle for a > longer time shouldn't have per-cpu drift. So do we drop mm-vmscan-fix-the-page-state-calculation-in-too_many_isolated.patch? From: Vinayak Menon <vinmenon@codeaurora.org> Subject: mm: vmscan: fix the page state calculation in too_many_isolated It is observed that sometimes multiple tasks get blocked for long in the congestion_wait loop below, in shrink_inactive_list. This is because of vm_stat values not being synced. (__schedule) from [<c0a03328>] (schedule_timeout) from [<c0a04940>] (io_schedule_timeout) from [<c01d585c>] (congestion_wait) from [<c01cc9d8>] (shrink_inactive_list) from [<c01cd034>] (shrink_zone) from [<c01cdd08>] (try_to_free_pages) from [<c01c442c>] (__alloc_pages_nodemask) from [<c01f1884>] (new_slab) from [<c09fcf60>] (__slab_alloc) from [<c01f1a6c>] In one such instance, zone_page_state(zone, NR_ISOLATED_FILE) had returned 14, zone_page_state(zone, NR_INACTIVE_FILE) returned 92, and GFP_IOFS was set, and this resulted in too_many_isolated returning true. But one of the CPU's pageset vm_stat_diff had NR_ISOLATED_FILE as "-14". So the actual isolated count was zero. As there weren't any more updates to NR_ISOLATED_FILE and vmstat_update deffered work had not been scheduled yet, 7 tasks were spinning in the congestion wait loop for around 4 seconds, in the direct reclaim path. This patch uses zone_page_state_snapshot instead, but restricts its usage to avoid performance penalty. The vmstat sync interval is HZ (sysctl_stat_interval), but since the vmstat_work is declared as a deferrable work, the timer trigger can be deferred to the next non-defferable timer expiry on the CPU which is in idle. This results in the vmstat syncing on an idle CPU being delayed by seconds. May be in most cases this behavior is fine, except in cases like this. Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- mm/vmscan.c | 56 +++++++++++++++++++++++++++++++++----------------- 1 file changed, 37 insertions(+), 19 deletions(-) diff -puN mm/vmscan.c~mm-vmscan-fix-the-page-state-calculation-in-too_many_isolated mm/vmscan.c --- a/mm/vmscan.c~mm-vmscan-fix-the-page-state-calculation-in-too_many_isolated +++ a/mm/vmscan.c @@ -1401,6 +1401,32 @@ int isolate_lru_page(struct page *page) return ret; } +static int __too_many_isolated(struct zone *zone, int file, + struct scan_control *sc, int safe) +{ + unsigned long inactive, isolated; + + if (safe) { + inactive = zone_page_state_snapshot(zone, + NR_INACTIVE_ANON + 2 * file); + isolated = zone_page_state_snapshot(zone, + NR_ISOLATED_ANON + file); + } else { + inactive = zone_page_state(zone, NR_INACTIVE_ANON + 2 * file); + isolated = zone_page_state(zone, NR_ISOLATED_ANON + file); + } + + /* + * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so they + * won't get blocked by normal direct-reclaimers, forming a circular + * deadlock. + */ + if ((sc->gfp_mask & GFP_IOFS) == GFP_IOFS) + inactive >>= 3; + + return isolated > inactive; +} + /* * A direct reclaimer may isolate SWAP_CLUSTER_MAX pages from the LRU list and * then get resheduled. When there are massive number of tasks doing page @@ -1409,33 +1435,22 @@ int isolate_lru_page(struct page *page) * unnecessary swapping, thrashing and OOM. */ static int too_many_isolated(struct zone *zone, int file, - struct scan_control *sc) + struct scan_control *sc, int safe) { - unsigned long inactive, isolated; - if (current_is_kswapd()) return 0; if (!global_reclaim(sc)) return 0; - if (file) { - inactive = zone_page_state(zone, NR_INACTIVE_FILE); - isolated = zone_page_state(zone, NR_ISOLATED_FILE); - } else { - inactive = zone_page_state(zone, NR_INACTIVE_ANON); - isolated = zone_page_state(zone, NR_ISOLATED_ANON); + if (unlikely(__too_many_isolated(zone, file, sc, 0))) { + if (safe) + return __too_many_isolated(zone, file, sc, safe); + else + return 1; } - /* - * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so they - * won't get blocked by normal direct-reclaimers, forming a circular - * deadlock. - */ - if ((sc->gfp_mask & GFP_IOFS) == GFP_IOFS) - inactive >>= 3; - - return isolated > inactive; + return 0; } static noinline_for_stack void @@ -1525,15 +1540,18 @@ shrink_inactive_list(unsigned long nr_to unsigned long nr_immediate = 0; isolate_mode_t isolate_mode = 0; int file = is_file_lru(lru); + int safe = 0; struct zone *zone = lruvec_zone(lruvec); struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat; - while (unlikely(too_many_isolated(zone, file, sc))) { + while (unlikely(too_many_isolated(zone, file, sc, safe))) { congestion_wait(BLK_RW_ASYNC, HZ/10); /* We are about to die and free our memory. Return now. */ if (fatal_signal_pending(current)) return SWAP_CLUSTER_MAX; + + safe = 1; } lru_add_drain(); _ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated @ 2015-01-26 22:11 ` Andrew Morton 0 siblings, 0 replies; 60+ messages in thread From: Andrew Morton @ 2015-01-26 22:11 UTC (permalink / raw) To: Michal Hocko Cc: Christoph Lameter, Vinayak Menon, linux-mm, linux-kernel, hannes, vdavydov, mgorman, minchan On Mon, 26 Jan 2015 18:28:32 +0100 Michal Hocko <mhocko@suse.cz> wrote: > Subject: [PATCH] vmstat: Do not use deferrable delayed work for vmstat_update > > Vinayak Menon has reported that excessive number of tasks was throttled > in the direct reclaim inside too_many_isolated because NR_ISOLATED_FILE > was relatively high compared to NR_INACTIVE_FILE. However it turned out > that the real number of NR_ISOLATED_FILE was 0 and the per-cpu > vm_stat_diff wasn't transfered into the global counter. > > vmstat_work which is responsible for the sync is defined as deferrable > delayed work which means that the defined timeout doesn't wake up an > idle CPU. A CPU might stay in an idle state for a long time and general > effort is to keep such a CPU in this state as long as possible which > might lead to all sorts of troubles for vmstat consumers as can be seen > with the excessive direct reclaim throttling. > > This patch basically reverts 39bf6270f524 (VM statistics: Make timer > deferrable) but it shouldn't cause any problems for idle CPUs because > only CPUs with an active per-cpu drift are woken up since 7cc36bbddde5 > (vmstat: on-demand vmstat workers v8) and CPUs which are idle for a > longer time shouldn't have per-cpu drift. So do we drop mm-vmscan-fix-the-page-state-calculation-in-too_many_isolated.patch? From: Vinayak Menon <vinmenon@codeaurora.org> Subject: mm: vmscan: fix the page state calculation in too_many_isolated It is observed that sometimes multiple tasks get blocked for long in the congestion_wait loop below, in shrink_inactive_list. This is because of vm_stat values not being synced. (__schedule) from [<c0a03328>] (schedule_timeout) from [<c0a04940>] (io_schedule_timeout) from [<c01d585c>] (congestion_wait) from [<c01cc9d8>] (shrink_inactive_list) from [<c01cd034>] (shrink_zone) from [<c01cdd08>] (try_to_free_pages) from [<c01c442c>] (__alloc_pages_nodemask) from [<c01f1884>] (new_slab) from [<c09fcf60>] (__slab_alloc) from [<c01f1a6c>] In one such instance, zone_page_state(zone, NR_ISOLATED_FILE) had returned 14, zone_page_state(zone, NR_INACTIVE_FILE) returned 92, and GFP_IOFS was set, and this resulted in too_many_isolated returning true. But one of the CPU's pageset vm_stat_diff had NR_ISOLATED_FILE as "-14". So the actual isolated count was zero. As there weren't any more updates to NR_ISOLATED_FILE and vmstat_update deffered work had not been scheduled yet, 7 tasks were spinning in the congestion wait loop for around 4 seconds, in the direct reclaim path. This patch uses zone_page_state_snapshot instead, but restricts its usage to avoid performance penalty. The vmstat sync interval is HZ (sysctl_stat_interval), but since the vmstat_work is declared as a deferrable work, the timer trigger can be deferred to the next non-defferable timer expiry on the CPU which is in idle. This results in the vmstat syncing on an idle CPU being delayed by seconds. May be in most cases this behavior is fine, except in cases like this. Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- mm/vmscan.c | 56 +++++++++++++++++++++++++++++++++----------------- 1 file changed, 37 insertions(+), 19 deletions(-) diff -puN mm/vmscan.c~mm-vmscan-fix-the-page-state-calculation-in-too_many_isolated mm/vmscan.c --- a/mm/vmscan.c~mm-vmscan-fix-the-page-state-calculation-in-too_many_isolated +++ a/mm/vmscan.c @@ -1401,6 +1401,32 @@ int isolate_lru_page(struct page *page) return ret; } +static int __too_many_isolated(struct zone *zone, int file, + struct scan_control *sc, int safe) +{ + unsigned long inactive, isolated; + + if (safe) { + inactive = zone_page_state_snapshot(zone, + NR_INACTIVE_ANON + 2 * file); + isolated = zone_page_state_snapshot(zone, + NR_ISOLATED_ANON + file); + } else { + inactive = zone_page_state(zone, NR_INACTIVE_ANON + 2 * file); + isolated = zone_page_state(zone, NR_ISOLATED_ANON + file); + } + + /* + * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so they + * won't get blocked by normal direct-reclaimers, forming a circular + * deadlock. + */ + if ((sc->gfp_mask & GFP_IOFS) == GFP_IOFS) + inactive >>= 3; + + return isolated > inactive; +} + /* * A direct reclaimer may isolate SWAP_CLUSTER_MAX pages from the LRU list and * then get resheduled. When there are massive number of tasks doing page @@ -1409,33 +1435,22 @@ int isolate_lru_page(struct page *page) * unnecessary swapping, thrashing and OOM. */ static int too_many_isolated(struct zone *zone, int file, - struct scan_control *sc) + struct scan_control *sc, int safe) { - unsigned long inactive, isolated; - if (current_is_kswapd()) return 0; if (!global_reclaim(sc)) return 0; - if (file) { - inactive = zone_page_state(zone, NR_INACTIVE_FILE); - isolated = zone_page_state(zone, NR_ISOLATED_FILE); - } else { - inactive = zone_page_state(zone, NR_INACTIVE_ANON); - isolated = zone_page_state(zone, NR_ISOLATED_ANON); + if (unlikely(__too_many_isolated(zone, file, sc, 0))) { + if (safe) + return __too_many_isolated(zone, file, sc, safe); + else + return 1; } - /* - * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so they - * won't get blocked by normal direct-reclaimers, forming a circular - * deadlock. - */ - if ((sc->gfp_mask & GFP_IOFS) == GFP_IOFS) - inactive >>= 3; - - return isolated > inactive; + return 0; } static noinline_for_stack void @@ -1525,15 +1540,18 @@ shrink_inactive_list(unsigned long nr_to unsigned long nr_immediate = 0; isolate_mode_t isolate_mode = 0; int file = is_file_lru(lru); + int safe = 0; struct zone *zone = lruvec_zone(lruvec); struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat; - while (unlikely(too_many_isolated(zone, file, sc))) { + while (unlikely(too_many_isolated(zone, file, sc, safe))) { congestion_wait(BLK_RW_ASYNC, HZ/10); /* We are about to die and free our memory. Return now. */ if (fatal_signal_pending(current)) return SWAP_CLUSTER_MAX; + + safe = 1; } lru_add_drain(); _ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated 2015-01-26 22:11 ` Andrew Morton @ 2015-01-27 10:41 ` Michal Hocko -1 siblings, 0 replies; 60+ messages in thread From: Michal Hocko @ 2015-01-27 10:41 UTC (permalink / raw) To: Andrew Morton Cc: Christoph Lameter, Vinayak Menon, linux-mm, linux-kernel, hannes, vdavydov, mgorman, minchan On Mon 26-01-15 14:11:59, Andrew Morton wrote: [...] > So do we drop > mm-vmscan-fix-the-page-state-calculation-in-too_many_isolated.patch? I would like to see a confirmation from Vinayak that this really helped in his test case first and only then drop the above patch. I really think that we shouldn't spread hacks around the code just workaround inefficiency in the vmstat code. We already have two places which need a special treatment and who knows how many more will show up later. Even with this patch applied we have other issues related to the overloaded workqueues as mentioned earlier but those should be fixed by a separate fixe(s). -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated @ 2015-01-27 10:41 ` Michal Hocko 0 siblings, 0 replies; 60+ messages in thread From: Michal Hocko @ 2015-01-27 10:41 UTC (permalink / raw) To: Andrew Morton Cc: Christoph Lameter, Vinayak Menon, linux-mm, linux-kernel, hannes, vdavydov, mgorman, minchan On Mon 26-01-15 14:11:59, Andrew Morton wrote: [...] > So do we drop > mm-vmscan-fix-the-page-state-calculation-in-too_many_isolated.patch? I would like to see a confirmation from Vinayak that this really helped in his test case first and only then drop the above patch. I really think that we shouldn't spread hacks around the code just workaround inefficiency in the vmstat code. We already have two places which need a special treatment and who knows how many more will show up later. Even with this patch applied we have other issues related to the overloaded workqueues as mentioned earlier but those should be fixed by a separate fixe(s). -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated 2015-01-26 17:28 ` Michal Hocko @ 2015-01-27 10:33 ` Vinayak Menon -1 siblings, 0 replies; 60+ messages in thread From: Vinayak Menon @ 2015-01-27 10:33 UTC (permalink / raw) To: Michal Hocko Cc: Christoph Lameter, linux-mm, linux-kernel, akpm, hannes, vdavydov, mgorman, minchan On 01/26/2015 10:58 PM, Michal Hocko wrote: > On Sat 17-01-15 13:48:34, Christoph Lameter wrote: >> On Sat, 17 Jan 2015, Vinayak Menon wrote: >> >>> which had not updated the vmstat_diff. This CPU was in idle for around 30 >>> secs. When I looked at the tvec base for this CPU, the timer associated with >>> vmstat_update had its expiry time less than current jiffies. This timer had >>> its deferrable flag set, and was tied to the next non-deferrable timer in the >> >> We can remove the deferrrable flag now since the vmstat threads are only >> activated as necessary with the recent changes. Looks like this could fix >> your issue? > > OK, I have checked the history and the deferrable behavior has been > introduced by 39bf6270f524 (VM statistics: Make timer deferrable) which > hasn't offered any numbers which would justify the change. So I think it > would be a good idea to revert this one as it can clearly cause issues. > > Could you retest with this change? It still wouldn't help with the > highly overloaded workqueues but that sounds like a bigger change and > this one sounds like quite safe to me so it is a good start. Sure, I can retest. Even without highly overloaded workqueues, there can be a delay of HZ in updating the counters. This means reclaim path can be blocked for a second or more, when there aren't really any isolated pages. So we need the fix in too_many_isolated also right ? -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated @ 2015-01-27 10:33 ` Vinayak Menon 0 siblings, 0 replies; 60+ messages in thread From: Vinayak Menon @ 2015-01-27 10:33 UTC (permalink / raw) To: Michal Hocko Cc: Christoph Lameter, linux-mm, linux-kernel, akpm, hannes, vdavydov, mgorman, minchan On 01/26/2015 10:58 PM, Michal Hocko wrote: > On Sat 17-01-15 13:48:34, Christoph Lameter wrote: >> On Sat, 17 Jan 2015, Vinayak Menon wrote: >> >>> which had not updated the vmstat_diff. This CPU was in idle for around 30 >>> secs. When I looked at the tvec base for this CPU, the timer associated with >>> vmstat_update had its expiry time less than current jiffies. This timer had >>> its deferrable flag set, and was tied to the next non-deferrable timer in the >> >> We can remove the deferrrable flag now since the vmstat threads are only >> activated as necessary with the recent changes. Looks like this could fix >> your issue? > > OK, I have checked the history and the deferrable behavior has been > introduced by 39bf6270f524 (VM statistics: Make timer deferrable) which > hasn't offered any numbers which would justify the change. So I think it > would be a good idea to revert this one as it can clearly cause issues. > > Could you retest with this change? It still wouldn't help with the > highly overloaded workqueues but that sounds like a bigger change and > this one sounds like quite safe to me so it is a good start. Sure, I can retest. Even without highly overloaded workqueues, there can be a delay of HZ in updating the counters. This means reclaim path can be blocked for a second or more, when there aren't really any isolated pages. So we need the fix in too_many_isolated also right ? -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated 2015-01-27 10:33 ` Vinayak Menon @ 2015-01-27 10:45 ` Michal Hocko -1 siblings, 0 replies; 60+ messages in thread From: Michal Hocko @ 2015-01-27 10:45 UTC (permalink / raw) To: Vinayak Menon Cc: Christoph Lameter, linux-mm, linux-kernel, akpm, hannes, vdavydov, mgorman, minchan On Tue 27-01-15 16:03:57, Vinayak Menon wrote: [...] > Sure, I can retest. Thanks! > Even without highly overloaded workqueues, there can be a delay of HZ in > updating the counters. This means reclaim path can be blocked for a second > or more, when there aren't really any isolated pages. So we need the fix in > too_many_isolated also right ? Is this a big deal though? What you are hitting is certainly a corner case. I assume your system is trashing heavily already with so few pages on the file LRU list. Anyway as mentioned in other email I would rather see vmstat data more reliable than spread hacks to the code where we see immediate issues. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated @ 2015-01-27 10:45 ` Michal Hocko 0 siblings, 0 replies; 60+ messages in thread From: Michal Hocko @ 2015-01-27 10:45 UTC (permalink / raw) To: Vinayak Menon Cc: Christoph Lameter, linux-mm, linux-kernel, akpm, hannes, vdavydov, mgorman, minchan On Tue 27-01-15 16:03:57, Vinayak Menon wrote: [...] > Sure, I can retest. Thanks! > Even without highly overloaded workqueues, there can be a delay of HZ in > updating the counters. This means reclaim path can be blocked for a second > or more, when there aren't really any isolated pages. So we need the fix in > too_many_isolated also right ? Is this a big deal though? What you are hitting is certainly a corner case. I assume your system is trashing heavily already with so few pages on the file LRU list. Anyway as mentioned in other email I would rather see vmstat data more reliable than spread hacks to the code where we see immediate issues. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated 2015-01-16 15:49 ` Michal Hocko @ 2015-01-29 17:32 ` Christoph Lameter -1 siblings, 0 replies; 60+ messages in thread From: Christoph Lameter @ 2015-01-29 17:32 UTC (permalink / raw) To: Michal Hocko Cc: Vinayak Menon, linux-mm, linux-kernel, akpm, hannes, vdavydov, mgorman, minchan On Fri, 16 Jan 2015, Michal Hocko wrote: > __round_jiffies_relative can easily make timeout 2HZ from 1HZ. Now we > have vmstat_shepherd which waits to be queued and then wait to run. When > it runs finally it only queues per-cpu vmstat_work which can also end > up being 2HZ for some CPUs. So we can indeed have 4 seconds spent just > for queuing. Not even mentioning work item latencies. Especially when > workers are overloaded e.g. by fs work items and no additional workers > cannot be created e.g. due to memory pressure so they are processed only > by the workqueue rescuer. And latencies would grow a lot. Here is a small fix to ensure that the 4 seconds interval does not happen: Subject: vmstat: Reduce time interval to stat update on idle cpu It was noted that the vm stat shepherd runs every 2 seconds and that the vmstat update is then scheduled 2 seconds in the future. This yields an interval of double the time interval which is not desired. Change the shepherd so that it does not delay the vmstat update on the other cpu. We stil have to use schedule_delayed_work since we are using a delayed_work_struct but we can set the delay to 0. Signed-off-by: Christoph Lameter <cl@linux.com> Index: linux/mm/vmstat.c =================================================================== --- linux.orig/mm/vmstat.c +++ linux/mm/vmstat.c @@ -1435,8 +1435,8 @@ static void vmstat_shepherd(struct work_ if (need_update(cpu) && cpumask_test_and_clear_cpu(cpu, cpu_stat_off)) - schedule_delayed_work_on(cpu, &per_cpu(vmstat_work, cpu), - __round_jiffies_relative(sysctl_stat_interval, cpu)); + schedule_delayed_work_on(cpu, + &per_cpu(vmstat_work, cpu), 0); put_online_cpus(); ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated @ 2015-01-29 17:32 ` Christoph Lameter 0 siblings, 0 replies; 60+ messages in thread From: Christoph Lameter @ 2015-01-29 17:32 UTC (permalink / raw) To: Michal Hocko Cc: Vinayak Menon, linux-mm, linux-kernel, akpm, hannes, vdavydov, mgorman, minchan On Fri, 16 Jan 2015, Michal Hocko wrote: > __round_jiffies_relative can easily make timeout 2HZ from 1HZ. Now we > have vmstat_shepherd which waits to be queued and then wait to run. When > it runs finally it only queues per-cpu vmstat_work which can also end > up being 2HZ for some CPUs. So we can indeed have 4 seconds spent just > for queuing. Not even mentioning work item latencies. Especially when > workers are overloaded e.g. by fs work items and no additional workers > cannot be created e.g. due to memory pressure so they are processed only > by the workqueue rescuer. And latencies would grow a lot. Here is a small fix to ensure that the 4 seconds interval does not happen: Subject: vmstat: Reduce time interval to stat update on idle cpu It was noted that the vm stat shepherd runs every 2 seconds and that the vmstat update is then scheduled 2 seconds in the future. This yields an interval of double the time interval which is not desired. Change the shepherd so that it does not delay the vmstat update on the other cpu. We stil have to use schedule_delayed_work since we are using a delayed_work_struct but we can set the delay to 0. Signed-off-by: Christoph Lameter <cl@linux.com> Index: linux/mm/vmstat.c =================================================================== --- linux.orig/mm/vmstat.c +++ linux/mm/vmstat.c @@ -1435,8 +1435,8 @@ static void vmstat_shepherd(struct work_ if (need_update(cpu) && cpumask_test_and_clear_cpu(cpu, cpu_stat_off)) - schedule_delayed_work_on(cpu, &per_cpu(vmstat_work, cpu), - __round_jiffies_relative(sysctl_stat_interval, cpu)); + schedule_delayed_work_on(cpu, + &per_cpu(vmstat_work, cpu), 0); put_online_cpus(); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated 2015-01-29 17:32 ` Christoph Lameter @ 2015-01-30 15:27 ` Michal Hocko -1 siblings, 0 replies; 60+ messages in thread From: Michal Hocko @ 2015-01-30 15:27 UTC (permalink / raw) To: Christoph Lameter Cc: Vinayak Menon, linux-mm, linux-kernel, akpm, hannes, vdavydov, mgorman, minchan On Thu 29-01-15 11:32:43, Christoph Lameter wrote: [...] > Subject: vmstat: Reduce time interval to stat update on idle cpu > > It was noted that the vm stat shepherd runs every 2 seconds and > that the vmstat update is then scheduled 2 seconds in the future. > > This yields an interval of double the time interval which is not > desired. > > Change the shepherd so that it does not delay the vmstat update > on the other cpu. We stil have to use schedule_delayed_work since > we are using a delayed_work_struct but we can set the delay to 0. > > > Signed-off-by: Christoph Lameter <cl@linux.com> Acked-by: Michal Hocko <mhocko@suse.cz> > > Index: linux/mm/vmstat.c > =================================================================== > --- linux.orig/mm/vmstat.c > +++ linux/mm/vmstat.c > @@ -1435,8 +1435,8 @@ static void vmstat_shepherd(struct work_ > if (need_update(cpu) && > cpumask_test_and_clear_cpu(cpu, cpu_stat_off)) > > - schedule_delayed_work_on(cpu, &per_cpu(vmstat_work, cpu), > - __round_jiffies_relative(sysctl_stat_interval, cpu)); > + schedule_delayed_work_on(cpu, > + &per_cpu(vmstat_work, cpu), 0); > > put_online_cpus(); > -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated @ 2015-01-30 15:27 ` Michal Hocko 0 siblings, 0 replies; 60+ messages in thread From: Michal Hocko @ 2015-01-30 15:27 UTC (permalink / raw) To: Christoph Lameter Cc: Vinayak Menon, linux-mm, linux-kernel, akpm, hannes, vdavydov, mgorman, minchan On Thu 29-01-15 11:32:43, Christoph Lameter wrote: [...] > Subject: vmstat: Reduce time interval to stat update on idle cpu > > It was noted that the vm stat shepherd runs every 2 seconds and > that the vmstat update is then scheduled 2 seconds in the future. > > This yields an interval of double the time interval which is not > desired. > > Change the shepherd so that it does not delay the vmstat update > on the other cpu. We stil have to use schedule_delayed_work since > we are using a delayed_work_struct but we can set the delay to 0. > > > Signed-off-by: Christoph Lameter <cl@linux.com> Acked-by: Michal Hocko <mhocko@suse.cz> > > Index: linux/mm/vmstat.c > =================================================================== > --- linux.orig/mm/vmstat.c > +++ linux/mm/vmstat.c > @@ -1435,8 +1435,8 @@ static void vmstat_shepherd(struct work_ > if (need_update(cpu) && > cpumask_test_and_clear_cpu(cpu, cpu_stat_off)) > > - schedule_delayed_work_on(cpu, &per_cpu(vmstat_work, cpu), > - __round_jiffies_relative(sysctl_stat_interval, cpu)); > + schedule_delayed_work_on(cpu, > + &per_cpu(vmstat_work, cpu), 0); > > put_online_cpus(); > -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated 2015-01-14 11:36 ` Vinayak Menon @ 2015-01-16 1:17 ` Andrew Morton -1 siblings, 0 replies; 60+ messages in thread From: Andrew Morton @ 2015-01-16 1:17 UTC (permalink / raw) To: Vinayak Menon Cc: linux-mm, linux-kernel, hannes, vdavydov, mhocko, mgorman, minchan On Wed, 14 Jan 2015 17:06:59 +0530 Vinayak Menon <vinmenon@codeaurora.org> wrote: > It is observed that sometimes multiple tasks get blocked for long > in the congestion_wait loop below, in shrink_inactive_list. This > is because of vm_stat values not being synced. > > (__schedule) from [<c0a03328>] > (schedule_timeout) from [<c0a04940>] > (io_schedule_timeout) from [<c01d585c>] > (congestion_wait) from [<c01cc9d8>] > (shrink_inactive_list) from [<c01cd034>] > (shrink_zone) from [<c01cdd08>] > (try_to_free_pages) from [<c01c442c>] > (__alloc_pages_nodemask) from [<c01f1884>] > (new_slab) from [<c09fcf60>] > (__slab_alloc) from [<c01f1a6c>] > > In one such instance, zone_page_state(zone, NR_ISOLATED_FILE) > had returned 14, zone_page_state(zone, NR_INACTIVE_FILE) > returned 92, and GFP_IOFS was set, and this resulted > in too_many_isolated returning true. But one of the CPU's > pageset vm_stat_diff had NR_ISOLATED_FILE as "-14". So the > actual isolated count was zero. As there weren't any more > updates to NR_ISOLATED_FILE and vmstat_update deffered work > had not been scheduled yet, 7 tasks were spinning in the > congestion wait loop for around 4 seconds, in the direct > reclaim path. > > This patch uses zone_page_state_snapshot instead, but restricts > its usage to avoid performance penalty. Seems reasonable. > > ... > > @@ -1516,15 +1531,18 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, > unsigned long nr_immediate = 0; > isolate_mode_t isolate_mode = 0; > int file = is_file_lru(lru); > + int safe = 0; > struct zone *zone = lruvec_zone(lruvec); > struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat; > > - while (unlikely(too_many_isolated(zone, file, sc))) { > + while (unlikely(too_many_isolated(zone, file, sc, safe))) { > congestion_wait(BLK_RW_ASYNC, HZ/10); > > /* We are about to die and free our memory. Return now. */ > if (fatal_signal_pending(current)) > return SWAP_CLUSTER_MAX; > + > + safe = 1; > } But here and under the circumstances you describe, we'll call congestion_wait() a single time. That shouldn't have occurred. So how about we put the fallback logic into too_many_isolated() itself? From: Andrew Morton <akpm@linux-foundation.org> Subject: mm-vmscan-fix-the-page-state-calculation-in-too_many_isolated-fix Move the zone_page_state_snapshot() fallback logic into too_many_isolated(), so shrink_inactive_list() doesn't incorrectly call congestion_wait(). Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- mm/vmscan.c | 23 +++++++++++------------ 1 file changed, 11 insertions(+), 12 deletions(-) diff -puN mm/vmscan.c~mm-vmscan-fix-the-page-state-calculation-in-too_many_isolated-fix mm/vmscan.c --- a/mm/vmscan.c~mm-vmscan-fix-the-page-state-calculation-in-too_many_isolated-fix +++ a/mm/vmscan.c @@ -1402,7 +1402,7 @@ int isolate_lru_page(struct page *page) } static int __too_many_isolated(struct zone *zone, int file, - struct scan_control *sc, int safe) + struct scan_control *sc, int safe) { unsigned long inactive, isolated; @@ -1435,7 +1435,7 @@ static int __too_many_isolated(struct zo * unnecessary swapping, thrashing and OOM. */ static int too_many_isolated(struct zone *zone, int file, - struct scan_control *sc, int safe) + struct scan_control *sc) { if (current_is_kswapd()) return 0; @@ -1443,12 +1443,14 @@ static int too_many_isolated(struct zone if (!global_reclaim(sc)) return 0; - if (unlikely(__too_many_isolated(zone, file, sc, 0))) { - if (safe) - return __too_many_isolated(zone, file, sc, safe); - else - return 1; - } + /* + * __too_many_isolated(safe=0) is fast but inaccurate, because it + * doesn't account for the vm_stat_diff[] counters. So if it looks + * like too_many_isolated() is about to return true, fall back to the + * slower, more accurate zone_page_state_snapshot(). + */ + if (unlikely(__too_many_isolated(zone, file, sc, 0))) + return __too_many_isolated(zone, file, sc, safe); return 0; } @@ -1540,18 +1542,15 @@ shrink_inactive_list(unsigned long nr_to unsigned long nr_immediate = 0; isolate_mode_t isolate_mode = 0; int file = is_file_lru(lru); - int safe = 0; struct zone *zone = lruvec_zone(lruvec); struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat; - while (unlikely(too_many_isolated(zone, file, sc, safe))) { + while (unlikely(too_many_isolated(zone, file, sc))) { congestion_wait(BLK_RW_ASYNC, HZ/10); /* We are about to die and free our memory. Return now. */ if (fatal_signal_pending(current)) return SWAP_CLUSTER_MAX; - - safe = 1; } lru_add_drain(); _ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated @ 2015-01-16 1:17 ` Andrew Morton 0 siblings, 0 replies; 60+ messages in thread From: Andrew Morton @ 2015-01-16 1:17 UTC (permalink / raw) To: Vinayak Menon Cc: linux-mm, linux-kernel, hannes, vdavydov, mhocko, mgorman, minchan On Wed, 14 Jan 2015 17:06:59 +0530 Vinayak Menon <vinmenon@codeaurora.org> wrote: > It is observed that sometimes multiple tasks get blocked for long > in the congestion_wait loop below, in shrink_inactive_list. This > is because of vm_stat values not being synced. > > (__schedule) from [<c0a03328>] > (schedule_timeout) from [<c0a04940>] > (io_schedule_timeout) from [<c01d585c>] > (congestion_wait) from [<c01cc9d8>] > (shrink_inactive_list) from [<c01cd034>] > (shrink_zone) from [<c01cdd08>] > (try_to_free_pages) from [<c01c442c>] > (__alloc_pages_nodemask) from [<c01f1884>] > (new_slab) from [<c09fcf60>] > (__slab_alloc) from [<c01f1a6c>] > > In one such instance, zone_page_state(zone, NR_ISOLATED_FILE) > had returned 14, zone_page_state(zone, NR_INACTIVE_FILE) > returned 92, and GFP_IOFS was set, and this resulted > in too_many_isolated returning true. But one of the CPU's > pageset vm_stat_diff had NR_ISOLATED_FILE as "-14". So the > actual isolated count was zero. As there weren't any more > updates to NR_ISOLATED_FILE and vmstat_update deffered work > had not been scheduled yet, 7 tasks were spinning in the > congestion wait loop for around 4 seconds, in the direct > reclaim path. > > This patch uses zone_page_state_snapshot instead, but restricts > its usage to avoid performance penalty. Seems reasonable. > > ... > > @@ -1516,15 +1531,18 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, > unsigned long nr_immediate = 0; > isolate_mode_t isolate_mode = 0; > int file = is_file_lru(lru); > + int safe = 0; > struct zone *zone = lruvec_zone(lruvec); > struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat; > > - while (unlikely(too_many_isolated(zone, file, sc))) { > + while (unlikely(too_many_isolated(zone, file, sc, safe))) { > congestion_wait(BLK_RW_ASYNC, HZ/10); > > /* We are about to die and free our memory. Return now. */ > if (fatal_signal_pending(current)) > return SWAP_CLUSTER_MAX; > + > + safe = 1; > } But here and under the circumstances you describe, we'll call congestion_wait() a single time. That shouldn't have occurred. So how about we put the fallback logic into too_many_isolated() itself? From: Andrew Morton <akpm@linux-foundation.org> Subject: mm-vmscan-fix-the-page-state-calculation-in-too_many_isolated-fix Move the zone_page_state_snapshot() fallback logic into too_many_isolated(), so shrink_inactive_list() doesn't incorrectly call congestion_wait(). Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.cz> Cc: Minchan Kim <minchan@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- mm/vmscan.c | 23 +++++++++++------------ 1 file changed, 11 insertions(+), 12 deletions(-) diff -puN mm/vmscan.c~mm-vmscan-fix-the-page-state-calculation-in-too_many_isolated-fix mm/vmscan.c --- a/mm/vmscan.c~mm-vmscan-fix-the-page-state-calculation-in-too_many_isolated-fix +++ a/mm/vmscan.c @@ -1402,7 +1402,7 @@ int isolate_lru_page(struct page *page) } static int __too_many_isolated(struct zone *zone, int file, - struct scan_control *sc, int safe) + struct scan_control *sc, int safe) { unsigned long inactive, isolated; @@ -1435,7 +1435,7 @@ static int __too_many_isolated(struct zo * unnecessary swapping, thrashing and OOM. */ static int too_many_isolated(struct zone *zone, int file, - struct scan_control *sc, int safe) + struct scan_control *sc) { if (current_is_kswapd()) return 0; @@ -1443,12 +1443,14 @@ static int too_many_isolated(struct zone if (!global_reclaim(sc)) return 0; - if (unlikely(__too_many_isolated(zone, file, sc, 0))) { - if (safe) - return __too_many_isolated(zone, file, sc, safe); - else - return 1; - } + /* + * __too_many_isolated(safe=0) is fast but inaccurate, because it + * doesn't account for the vm_stat_diff[] counters. So if it looks + * like too_many_isolated() is about to return true, fall back to the + * slower, more accurate zone_page_state_snapshot(). + */ + if (unlikely(__too_many_isolated(zone, file, sc, 0))) + return __too_many_isolated(zone, file, sc, safe); return 0; } @@ -1540,18 +1542,15 @@ shrink_inactive_list(unsigned long nr_to unsigned long nr_immediate = 0; isolate_mode_t isolate_mode = 0; int file = is_file_lru(lru); - int safe = 0; struct zone *zone = lruvec_zone(lruvec); struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat; - while (unlikely(too_many_isolated(zone, file, sc, safe))) { + while (unlikely(too_many_isolated(zone, file, sc))) { congestion_wait(BLK_RW_ASYNC, HZ/10); /* We are about to die and free our memory. Return now. */ if (fatal_signal_pending(current)) return SWAP_CLUSTER_MAX; - - safe = 1; } lru_add_drain(); _ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated 2015-01-16 1:17 ` Andrew Morton @ 2015-01-16 5:10 ` Vinayak Menon -1 siblings, 0 replies; 60+ messages in thread From: Vinayak Menon @ 2015-01-16 5:10 UTC (permalink / raw) To: Andrew Morton Cc: linux-mm, linux-kernel, hannes, vdavydov, mhocko, mgorman, minchan On 01/16/2015 06:47 AM, Andrew Morton wrote: > On Wed, 14 Jan 2015 17:06:59 +0530 Vinayak Menon <vinmenon@codeaurora.org> wrote: > >> It is observed that sometimes multiple tasks get blocked for long >> in the congestion_wait loop below, in shrink_inactive_list. This >> is because of vm_stat values not being synced. >> >> (__schedule) from [<c0a03328>] >> (schedule_timeout) from [<c0a04940>] >> (io_schedule_timeout) from [<c01d585c>] >> (congestion_wait) from [<c01cc9d8>] >> (shrink_inactive_list) from [<c01cd034>] >> (shrink_zone) from [<c01cdd08>] >> (try_to_free_pages) from [<c01c442c>] >> (__alloc_pages_nodemask) from [<c01f1884>] >> (new_slab) from [<c09fcf60>] >> (__slab_alloc) from [<c01f1a6c>] >> >> In one such instance, zone_page_state(zone, NR_ISOLATED_FILE) >> had returned 14, zone_page_state(zone, NR_INACTIVE_FILE) >> returned 92, and GFP_IOFS was set, and this resulted >> in too_many_isolated returning true. But one of the CPU's >> pageset vm_stat_diff had NR_ISOLATED_FILE as "-14". So the >> actual isolated count was zero. As there weren't any more >> updates to NR_ISOLATED_FILE and vmstat_update deffered work >> had not been scheduled yet, 7 tasks were spinning in the >> congestion wait loop for around 4 seconds, in the direct >> reclaim path. >> >> This patch uses zone_page_state_snapshot instead, but restricts >> its usage to avoid performance penalty. > > Seems reasonable. > >> >> ... >> >> @@ -1516,15 +1531,18 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, >> unsigned long nr_immediate = 0; >> isolate_mode_t isolate_mode = 0; >> int file = is_file_lru(lru); >> + int safe = 0; >> struct zone *zone = lruvec_zone(lruvec); >> struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat; >> >> - while (unlikely(too_many_isolated(zone, file, sc))) { >> + while (unlikely(too_many_isolated(zone, file, sc, safe))) { >> congestion_wait(BLK_RW_ASYNC, HZ/10); >> >> /* We are about to die and free our memory. Return now. */ >> if (fatal_signal_pending(current)) >> return SWAP_CLUSTER_MAX; >> + >> + safe = 1; >> } > > But here and under the circumstances you describe, we'll call > congestion_wait() a single time. That shouldn't have occurred. > > So how about we put the fallback logic into too_many_isolated() itself? > > congestion_wait was allowed to run once as an optimization, considering that __too_many_isolated (unsafe and faster) can be correct in returning true most of the time. So we avoid calling the safe version, in most of the cases. But I agree that we should not call congestion_wait unnecessarily even in those rare cases. So this looks correct to me. > > From: Andrew Morton <akpm@linux-foundation.org> > Subject: mm-vmscan-fix-the-page-state-calculation-in-too_many_isolated-fix > > Move the zone_page_state_snapshot() fallback logic into > too_many_isolated(), so shrink_inactive_list() doesn't incorrectly call > congestion_wait(). > > Cc: Johannes Weiner <hannes@cmpxchg.org> > Cc: Mel Gorman <mgorman@suse.de> > Cc: Michal Hocko <mhocko@suse.cz> > Cc: Minchan Kim <minchan@kernel.org> > Cc: Vinayak Menon <vinmenon@codeaurora.org> > Cc: Vladimir Davydov <vdavydov@parallels.com> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org> > --- > > mm/vmscan.c | 23 +++++++++++------------ > 1 file changed, 11 insertions(+), 12 deletions(-) > > diff -puN mm/vmscan.c~mm-vmscan-fix-the-page-state-calculation-in-too_many_isolated-fix mm/vmscan.c > --- a/mm/vmscan.c~mm-vmscan-fix-the-page-state-calculation-in-too_many_isolated-fix > +++ a/mm/vmscan.c > @@ -1402,7 +1402,7 @@ int isolate_lru_page(struct page *page) > } > > static int __too_many_isolated(struct zone *zone, int file, > - struct scan_control *sc, int safe) > + struct scan_control *sc, int safe) > { > unsigned long inactive, isolated; > > @@ -1435,7 +1435,7 @@ static int __too_many_isolated(struct zo > * unnecessary swapping, thrashing and OOM. > */ > static int too_many_isolated(struct zone *zone, int file, > - struct scan_control *sc, int safe) > + struct scan_control *sc) > { > if (current_is_kswapd()) > return 0; > @@ -1443,12 +1443,14 @@ static int too_many_isolated(struct zone > if (!global_reclaim(sc)) > return 0; > > - if (unlikely(__too_many_isolated(zone, file, sc, 0))) { > - if (safe) > - return __too_many_isolated(zone, file, sc, safe); > - else > - return 1; > - } > + /* > + * __too_many_isolated(safe=0) is fast but inaccurate, because it > + * doesn't account for the vm_stat_diff[] counters. So if it looks > + * like too_many_isolated() is about to return true, fall back to the > + * slower, more accurate zone_page_state_snapshot(). > + */ > + if (unlikely(__too_many_isolated(zone, file, sc, 0))) > + return __too_many_isolated(zone, file, sc, safe); > > return 0; > } > @@ -1540,18 +1542,15 @@ shrink_inactive_list(unsigned long nr_to > unsigned long nr_immediate = 0; > isolate_mode_t isolate_mode = 0; > int file = is_file_lru(lru); > - int safe = 0; > struct zone *zone = lruvec_zone(lruvec); > struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat; > > - while (unlikely(too_many_isolated(zone, file, sc, safe))) { > + while (unlikely(too_many_isolated(zone, file, sc))) { > congestion_wait(BLK_RW_ASYNC, HZ/10); > > /* We are about to die and free our memory. Return now. */ > if (fatal_signal_pending(current)) > return SWAP_CLUSTER_MAX; > - > - safe = 1; > } > > lru_add_drain(); > _ > -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated @ 2015-01-16 5:10 ` Vinayak Menon 0 siblings, 0 replies; 60+ messages in thread From: Vinayak Menon @ 2015-01-16 5:10 UTC (permalink / raw) To: Andrew Morton Cc: linux-mm, linux-kernel, hannes, vdavydov, mhocko, mgorman, minchan On 01/16/2015 06:47 AM, Andrew Morton wrote: > On Wed, 14 Jan 2015 17:06:59 +0530 Vinayak Menon <vinmenon@codeaurora.org> wrote: > >> It is observed that sometimes multiple tasks get blocked for long >> in the congestion_wait loop below, in shrink_inactive_list. This >> is because of vm_stat values not being synced. >> >> (__schedule) from [<c0a03328>] >> (schedule_timeout) from [<c0a04940>] >> (io_schedule_timeout) from [<c01d585c>] >> (congestion_wait) from [<c01cc9d8>] >> (shrink_inactive_list) from [<c01cd034>] >> (shrink_zone) from [<c01cdd08>] >> (try_to_free_pages) from [<c01c442c>] >> (__alloc_pages_nodemask) from [<c01f1884>] >> (new_slab) from [<c09fcf60>] >> (__slab_alloc) from [<c01f1a6c>] >> >> In one such instance, zone_page_state(zone, NR_ISOLATED_FILE) >> had returned 14, zone_page_state(zone, NR_INACTIVE_FILE) >> returned 92, and GFP_IOFS was set, and this resulted >> in too_many_isolated returning true. But one of the CPU's >> pageset vm_stat_diff had NR_ISOLATED_FILE as "-14". So the >> actual isolated count was zero. As there weren't any more >> updates to NR_ISOLATED_FILE and vmstat_update deffered work >> had not been scheduled yet, 7 tasks were spinning in the >> congestion wait loop for around 4 seconds, in the direct >> reclaim path. >> >> This patch uses zone_page_state_snapshot instead, but restricts >> its usage to avoid performance penalty. > > Seems reasonable. > >> >> ... >> >> @@ -1516,15 +1531,18 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, >> unsigned long nr_immediate = 0; >> isolate_mode_t isolate_mode = 0; >> int file = is_file_lru(lru); >> + int safe = 0; >> struct zone *zone = lruvec_zone(lruvec); >> struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat; >> >> - while (unlikely(too_many_isolated(zone, file, sc))) { >> + while (unlikely(too_many_isolated(zone, file, sc, safe))) { >> congestion_wait(BLK_RW_ASYNC, HZ/10); >> >> /* We are about to die and free our memory. Return now. */ >> if (fatal_signal_pending(current)) >> return SWAP_CLUSTER_MAX; >> + >> + safe = 1; >> } > > But here and under the circumstances you describe, we'll call > congestion_wait() a single time. That shouldn't have occurred. > > So how about we put the fallback logic into too_many_isolated() itself? > > congestion_wait was allowed to run once as an optimization, considering that __too_many_isolated (unsafe and faster) can be correct in returning true most of the time. So we avoid calling the safe version, in most of the cases. But I agree that we should not call congestion_wait unnecessarily even in those rare cases. So this looks correct to me. > > From: Andrew Morton <akpm@linux-foundation.org> > Subject: mm-vmscan-fix-the-page-state-calculation-in-too_many_isolated-fix > > Move the zone_page_state_snapshot() fallback logic into > too_many_isolated(), so shrink_inactive_list() doesn't incorrectly call > congestion_wait(). > > Cc: Johannes Weiner <hannes@cmpxchg.org> > Cc: Mel Gorman <mgorman@suse.de> > Cc: Michal Hocko <mhocko@suse.cz> > Cc: Minchan Kim <minchan@kernel.org> > Cc: Vinayak Menon <vinmenon@codeaurora.org> > Cc: Vladimir Davydov <vdavydov@parallels.com> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org> > --- > > mm/vmscan.c | 23 +++++++++++------------ > 1 file changed, 11 insertions(+), 12 deletions(-) > > diff -puN mm/vmscan.c~mm-vmscan-fix-the-page-state-calculation-in-too_many_isolated-fix mm/vmscan.c > --- a/mm/vmscan.c~mm-vmscan-fix-the-page-state-calculation-in-too_many_isolated-fix > +++ a/mm/vmscan.c > @@ -1402,7 +1402,7 @@ int isolate_lru_page(struct page *page) > } > > static int __too_many_isolated(struct zone *zone, int file, > - struct scan_control *sc, int safe) > + struct scan_control *sc, int safe) > { > unsigned long inactive, isolated; > > @@ -1435,7 +1435,7 @@ static int __too_many_isolated(struct zo > * unnecessary swapping, thrashing and OOM. > */ > static int too_many_isolated(struct zone *zone, int file, > - struct scan_control *sc, int safe) > + struct scan_control *sc) > { > if (current_is_kswapd()) > return 0; > @@ -1443,12 +1443,14 @@ static int too_many_isolated(struct zone > if (!global_reclaim(sc)) > return 0; > > - if (unlikely(__too_many_isolated(zone, file, sc, 0))) { > - if (safe) > - return __too_many_isolated(zone, file, sc, safe); > - else > - return 1; > - } > + /* > + * __too_many_isolated(safe=0) is fast but inaccurate, because it > + * doesn't account for the vm_stat_diff[] counters. So if it looks > + * like too_many_isolated() is about to return true, fall back to the > + * slower, more accurate zone_page_state_snapshot(). > + */ > + if (unlikely(__too_many_isolated(zone, file, sc, 0))) > + return __too_many_isolated(zone, file, sc, safe); > > return 0; > } > @@ -1540,18 +1542,15 @@ shrink_inactive_list(unsigned long nr_to > unsigned long nr_immediate = 0; > isolate_mode_t isolate_mode = 0; > int file = is_file_lru(lru); > - int safe = 0; > struct zone *zone = lruvec_zone(lruvec); > struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat; > > - while (unlikely(too_many_isolated(zone, file, sc, safe))) { > + while (unlikely(too_many_isolated(zone, file, sc))) { > congestion_wait(BLK_RW_ASYNC, HZ/10); > > /* We are about to die and free our memory. Return now. */ > if (fatal_signal_pending(current)) > return SWAP_CLUSTER_MAX; > - > - safe = 1; > } > > lru_add_drain(); > _ > -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated 2015-01-16 1:17 ` Andrew Morton @ 2015-01-17 16:29 ` Vinayak Menon -1 siblings, 0 replies; 60+ messages in thread From: Vinayak Menon @ 2015-01-17 16:29 UTC (permalink / raw) To: Andrew Morton Cc: linux-mm, linux-kernel, hannes, vdavydov, mhocko, mgorman, minchan On 01/16/2015 06:47 AM, Andrew Morton wrote: > From: Andrew Morton <akpm@linux-foundation.org> > Subject: mm-vmscan-fix-the-page-state-calculation-in-too_many_isolated-fix > > Move the zone_page_state_snapshot() fallback logic into > too_many_isolated(), so shrink_inactive_list() doesn't incorrectly call > congestion_wait(). > > Cc: Johannes Weiner <hannes@cmpxchg.org> > Cc: Mel Gorman <mgorman@suse.de> > Cc: Michal Hocko <mhocko@suse.cz> > Cc: Minchan Kim <minchan@kernel.org> > Cc: Vinayak Menon <vinmenon@codeaurora.org> > Cc: Vladimir Davydov <vdavydov@parallels.com> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org> > --- > > mm/vmscan.c | 23 +++++++++++------------ > 1 file changed, 11 insertions(+), 12 deletions(-) > > diff -puN mm/vmscan.c~mm-vmscan-fix-the-page-state-calculation-in-too_many_isolated-fix mm/vmscan.c > --- a/mm/vmscan.c~mm-vmscan-fix-the-page-state-calculation-in-too_many_isolated-fix > +++ a/mm/vmscan.c > @@ -1402,7 +1402,7 @@ int isolate_lru_page(struct page *page) > } > > static int __too_many_isolated(struct zone *zone, int file, > - struct scan_control *sc, int safe) > + struct scan_control *sc, int safe) > { > unsigned long inactive, isolated; > > @@ -1435,7 +1435,7 @@ static int __too_many_isolated(struct zo > * unnecessary swapping, thrashing and OOM. > */ > static int too_many_isolated(struct zone *zone, int file, > - struct scan_control *sc, int safe) > + struct scan_control *sc) > { > if (current_is_kswapd()) > return 0; > @@ -1443,12 +1443,14 @@ static int too_many_isolated(struct zone > if (!global_reclaim(sc)) > return 0; > > - if (unlikely(__too_many_isolated(zone, file, sc, 0))) { > - if (safe) > - return __too_many_isolated(zone, file, sc, safe); > - else > - return 1; > - } > + /* > + * __too_many_isolated(safe=0) is fast but inaccurate, because it > + * doesn't account for the vm_stat_diff[] counters. So if it looks > + * like too_many_isolated() is about to return true, fall back to the > + * slower, more accurate zone_page_state_snapshot(). > + */ > + if (unlikely(__too_many_isolated(zone, file, sc, 0))) > + return __too_many_isolated(zone, file, sc, safe); Just noticed now that, in the above statement it should be "1", instead of "safe". "safe" is not declared. -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated @ 2015-01-17 16:29 ` Vinayak Menon 0 siblings, 0 replies; 60+ messages in thread From: Vinayak Menon @ 2015-01-17 16:29 UTC (permalink / raw) To: Andrew Morton Cc: linux-mm, linux-kernel, hannes, vdavydov, mhocko, mgorman, minchan On 01/16/2015 06:47 AM, Andrew Morton wrote: > From: Andrew Morton <akpm@linux-foundation.org> > Subject: mm-vmscan-fix-the-page-state-calculation-in-too_many_isolated-fix > > Move the zone_page_state_snapshot() fallback logic into > too_many_isolated(), so shrink_inactive_list() doesn't incorrectly call > congestion_wait(). > > Cc: Johannes Weiner <hannes@cmpxchg.org> > Cc: Mel Gorman <mgorman@suse.de> > Cc: Michal Hocko <mhocko@suse.cz> > Cc: Minchan Kim <minchan@kernel.org> > Cc: Vinayak Menon <vinmenon@codeaurora.org> > Cc: Vladimir Davydov <vdavydov@parallels.com> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org> > --- > > mm/vmscan.c | 23 +++++++++++------------ > 1 file changed, 11 insertions(+), 12 deletions(-) > > diff -puN mm/vmscan.c~mm-vmscan-fix-the-page-state-calculation-in-too_many_isolated-fix mm/vmscan.c > --- a/mm/vmscan.c~mm-vmscan-fix-the-page-state-calculation-in-too_many_isolated-fix > +++ a/mm/vmscan.c > @@ -1402,7 +1402,7 @@ int isolate_lru_page(struct page *page) > } > > static int __too_many_isolated(struct zone *zone, int file, > - struct scan_control *sc, int safe) > + struct scan_control *sc, int safe) > { > unsigned long inactive, isolated; > > @@ -1435,7 +1435,7 @@ static int __too_many_isolated(struct zo > * unnecessary swapping, thrashing and OOM. > */ > static int too_many_isolated(struct zone *zone, int file, > - struct scan_control *sc, int safe) > + struct scan_control *sc) > { > if (current_is_kswapd()) > return 0; > @@ -1443,12 +1443,14 @@ static int too_many_isolated(struct zone > if (!global_reclaim(sc)) > return 0; > > - if (unlikely(__too_many_isolated(zone, file, sc, 0))) { > - if (safe) > - return __too_many_isolated(zone, file, sc, safe); > - else > - return 1; > - } > + /* > + * __too_many_isolated(safe=0) is fast but inaccurate, because it > + * doesn't account for the vm_stat_diff[] counters. So if it looks > + * like too_many_isolated() is about to return true, fall back to the > + * slower, more accurate zone_page_state_snapshot(). > + */ > + if (unlikely(__too_many_isolated(zone, file, sc, 0))) > + return __too_many_isolated(zone, file, sc, safe); Just noticed now that, in the above statement it should be "1", instead of "safe". "safe" is not declared. -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated 2015-01-17 16:29 ` Vinayak Menon @ 2015-02-11 22:14 ` Andrew Morton -1 siblings, 0 replies; 60+ messages in thread From: Andrew Morton @ 2015-02-11 22:14 UTC (permalink / raw) To: Vinayak Menon Cc: linux-mm, linux-kernel, hannes, vdavydov, mhocko, mgorman, minchan Did we end up deciding to merge this, or is http://ozlabs.org/~akpm/mmots/broken-out/vmstat-do-not-use-deferrable-delayed-work-for-vmstat_update.patch a sufficient fix? From: Vinayak Menon <vinmenon@codeaurora.org> Subject: mm: vmscan: fix the page state calculation in too_many_isolated It is observed that sometimes multiple tasks get blocked for long in the congestion_wait loop below, in shrink_inactive_list. This is because of vm_stat values not being synced. (__schedule) from [<c0a03328>] (schedule_timeout) from [<c0a04940>] (io_schedule_timeout) from [<c01d585c>] (congestion_wait) from [<c01cc9d8>] (shrink_inactive_list) from [<c01cd034>] (shrink_zone) from [<c01cdd08>] (try_to_free_pages) from [<c01c442c>] (__alloc_pages_nodemask) from [<c01f1884>] (new_slab) from [<c09fcf60>] (__slab_alloc) from [<c01f1a6c>] In one such instance, zone_page_state(zone, NR_ISOLATED_FILE) had returned 14, zone_page_state(zone, NR_INACTIVE_FILE) returned 92, and GFP_IOFS was set, and this resulted in too_many_isolated returning true. But one of the CPU's pageset vm_stat_diff had NR_ISOLATED_FILE as "-14". So the actual isolated count was zero. As there weren't any more updates to NR_ISOLATED_FILE and vmstat_update deffered work had not been scheduled yet, 7 tasks were spinning in the congestion wait loop for around 4 seconds, in the direct reclaim path. This patch uses zone_page_state_snapshot instead, but restricts its usage to avoid performance penalty. The vmstat sync interval is HZ (sysctl_stat_interval), but since the vmstat_work is declared as a deferrable work, the timer trigger can be deferred to the next non-defferable timer expiry on the CPU which is in idle. This results in the vmstat syncing on an idle CPU being delayed by seconds. May be in most cases this behavior is fine, except in cases like this. [akpm@linux-foundation.org: move zone_page_state_snapshot() fallback logic into too_many_isolated()] Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- mm/vmscan.c | 51 +++++++++++++++++++++++++++++++++----------------- 1 file changed, 34 insertions(+), 17 deletions(-) diff -puN mm/vmscan.c~mm-vmscan-fix-the-page-state-calculation-in-too_many_isolated mm/vmscan.c --- a/mm/vmscan.c~mm-vmscan-fix-the-page-state-calculation-in-too_many_isolated +++ a/mm/vmscan.c @@ -1363,6 +1363,32 @@ int isolate_lru_page(struct page *page) return ret; } +static int __too_many_isolated(struct zone *zone, int file, + struct scan_control *sc, int safe) +{ + unsigned long inactive, isolated; + + if (safe) { + inactive = zone_page_state_snapshot(zone, + NR_INACTIVE_ANON + 2 * file); + isolated = zone_page_state_snapshot(zone, + NR_ISOLATED_ANON + file); + } else { + inactive = zone_page_state(zone, NR_INACTIVE_ANON + 2 * file); + isolated = zone_page_state(zone, NR_ISOLATED_ANON + file); + } + + /* + * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so they + * won't get blocked by normal direct-reclaimers, forming a circular + * deadlock. + */ + if ((sc->gfp_mask & GFP_IOFS) == GFP_IOFS) + inactive >>= 3; + + return isolated > inactive; +} + /* * A direct reclaimer may isolate SWAP_CLUSTER_MAX pages from the LRU list and * then get resheduled. When there are massive number of tasks doing page @@ -1371,33 +1397,24 @@ int isolate_lru_page(struct page *page) * unnecessary swapping, thrashing and OOM. */ static int too_many_isolated(struct zone *zone, int file, - struct scan_control *sc) + struct scan_control *sc) { - unsigned long inactive, isolated; - if (current_is_kswapd()) return 0; if (!global_reclaim(sc)) return 0; - if (file) { - inactive = zone_page_state(zone, NR_INACTIVE_FILE); - isolated = zone_page_state(zone, NR_ISOLATED_FILE); - } else { - inactive = zone_page_state(zone, NR_INACTIVE_ANON); - isolated = zone_page_state(zone, NR_ISOLATED_ANON); - } - /* - * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so they - * won't get blocked by normal direct-reclaimers, forming a circular - * deadlock. + * __too_many_isolated(safe=0) is fast but inaccurate, because it + * doesn't account for the vm_stat_diff[] counters. So if it looks + * like too_many_isolated() is about to return true, fall back to the + * slower, more accurate zone_page_state_snapshot(). */ - if ((sc->gfp_mask & GFP_IOFS) == GFP_IOFS) - inactive >>= 3; + if (unlikely(__too_many_isolated(zone, file, sc, 0))) + return __too_many_isolated(zone, file, sc, 1); - return isolated > inactive; + return 0; } static noinline_for_stack void _ ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated @ 2015-02-11 22:14 ` Andrew Morton 0 siblings, 0 replies; 60+ messages in thread From: Andrew Morton @ 2015-02-11 22:14 UTC (permalink / raw) To: Vinayak Menon Cc: linux-mm, linux-kernel, hannes, vdavydov, mhocko, mgorman, minchan Did we end up deciding to merge this, or is http://ozlabs.org/~akpm/mmots/broken-out/vmstat-do-not-use-deferrable-delayed-work-for-vmstat_update.patch a sufficient fix? From: Vinayak Menon <vinmenon@codeaurora.org> Subject: mm: vmscan: fix the page state calculation in too_many_isolated It is observed that sometimes multiple tasks get blocked for long in the congestion_wait loop below, in shrink_inactive_list. This is because of vm_stat values not being synced. (__schedule) from [<c0a03328>] (schedule_timeout) from [<c0a04940>] (io_schedule_timeout) from [<c01d585c>] (congestion_wait) from [<c01cc9d8>] (shrink_inactive_list) from [<c01cd034>] (shrink_zone) from [<c01cdd08>] (try_to_free_pages) from [<c01c442c>] (__alloc_pages_nodemask) from [<c01f1884>] (new_slab) from [<c09fcf60>] (__slab_alloc) from [<c01f1a6c>] In one such instance, zone_page_state(zone, NR_ISOLATED_FILE) had returned 14, zone_page_state(zone, NR_INACTIVE_FILE) returned 92, and GFP_IOFS was set, and this resulted in too_many_isolated returning true. But one of the CPU's pageset vm_stat_diff had NR_ISOLATED_FILE as "-14". So the actual isolated count was zero. As there weren't any more updates to NR_ISOLATED_FILE and vmstat_update deffered work had not been scheduled yet, 7 tasks were spinning in the congestion wait loop for around 4 seconds, in the direct reclaim path. This patch uses zone_page_state_snapshot instead, but restricts its usage to avoid performance penalty. The vmstat sync interval is HZ (sysctl_stat_interval), but since the vmstat_work is declared as a deferrable work, the timer trigger can be deferred to the next non-defferable timer expiry on the CPU which is in idle. This results in the vmstat syncing on an idle CPU being delayed by seconds. May be in most cases this behavior is fine, except in cases like this. [akpm@linux-foundation.org: move zone_page_state_snapshot() fallback logic into too_many_isolated()] Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- mm/vmscan.c | 51 +++++++++++++++++++++++++++++++++----------------- 1 file changed, 34 insertions(+), 17 deletions(-) diff -puN mm/vmscan.c~mm-vmscan-fix-the-page-state-calculation-in-too_many_isolated mm/vmscan.c --- a/mm/vmscan.c~mm-vmscan-fix-the-page-state-calculation-in-too_many_isolated +++ a/mm/vmscan.c @@ -1363,6 +1363,32 @@ int isolate_lru_page(struct page *page) return ret; } +static int __too_many_isolated(struct zone *zone, int file, + struct scan_control *sc, int safe) +{ + unsigned long inactive, isolated; + + if (safe) { + inactive = zone_page_state_snapshot(zone, + NR_INACTIVE_ANON + 2 * file); + isolated = zone_page_state_snapshot(zone, + NR_ISOLATED_ANON + file); + } else { + inactive = zone_page_state(zone, NR_INACTIVE_ANON + 2 * file); + isolated = zone_page_state(zone, NR_ISOLATED_ANON + file); + } + + /* + * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so they + * won't get blocked by normal direct-reclaimers, forming a circular + * deadlock. + */ + if ((sc->gfp_mask & GFP_IOFS) == GFP_IOFS) + inactive >>= 3; + + return isolated > inactive; +} + /* * A direct reclaimer may isolate SWAP_CLUSTER_MAX pages from the LRU list and * then get resheduled. When there are massive number of tasks doing page @@ -1371,33 +1397,24 @@ int isolate_lru_page(struct page *page) * unnecessary swapping, thrashing and OOM. */ static int too_many_isolated(struct zone *zone, int file, - struct scan_control *sc) + struct scan_control *sc) { - unsigned long inactive, isolated; - if (current_is_kswapd()) return 0; if (!global_reclaim(sc)) return 0; - if (file) { - inactive = zone_page_state(zone, NR_INACTIVE_FILE); - isolated = zone_page_state(zone, NR_ISOLATED_FILE); - } else { - inactive = zone_page_state(zone, NR_INACTIVE_ANON); - isolated = zone_page_state(zone, NR_ISOLATED_ANON); - } - /* - * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so they - * won't get blocked by normal direct-reclaimers, forming a circular - * deadlock. + * __too_many_isolated(safe=0) is fast but inaccurate, because it + * doesn't account for the vm_stat_diff[] counters. So if it looks + * like too_many_isolated() is about to return true, fall back to the + * slower, more accurate zone_page_state_snapshot(). */ - if ((sc->gfp_mask & GFP_IOFS) == GFP_IOFS) - inactive >>= 3; + if (unlikely(__too_many_isolated(zone, file, sc, 0))) + return __too_many_isolated(zone, file, sc, 1); - return isolated > inactive; + return 0; } static noinline_for_stack void _ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated 2015-02-11 22:14 ` Andrew Morton @ 2015-02-12 16:19 ` Vlastimil Babka -1 siblings, 0 replies; 60+ messages in thread From: Vlastimil Babka @ 2015-02-12 16:19 UTC (permalink / raw) To: Andrew Morton, Vinayak Menon Cc: linux-mm, linux-kernel, hannes, vdavydov, mhocko, mgorman, minchan On 02/11/2015 11:14 PM, Andrew Morton wrote: > > Did we end up deciding to merge this, or is > http://ozlabs.org/~akpm/mmots/broken-out/vmstat-do-not-use-deferrable-delayed-work-for-vmstat_update.patch > a sufficient fix? I think Michal wanted to have the general vmstat worker fix from elsewhere in the thread tested, if it solves the problem by itself, without this patch. > From: Vinayak Menon <vinmenon@codeaurora.org> > Subject: mm: vmscan: fix the page state calculation in too_many_isolated > > It is observed that sometimes multiple tasks get blocked for long in the > congestion_wait loop below, in shrink_inactive_list. This is because of > vm_stat values not being synced. > > (__schedule) from [<c0a03328>] > (schedule_timeout) from [<c0a04940>] > (io_schedule_timeout) from [<c01d585c>] > (congestion_wait) from [<c01cc9d8>] > (shrink_inactive_list) from [<c01cd034>] > (shrink_zone) from [<c01cdd08>] > (try_to_free_pages) from [<c01c442c>] > (__alloc_pages_nodemask) from [<c01f1884>] > (new_slab) from [<c09fcf60>] > (__slab_alloc) from [<c01f1a6c>] > > In one such instance, zone_page_state(zone, NR_ISOLATED_FILE) had returned > 14, zone_page_state(zone, NR_INACTIVE_FILE) returned 92, and GFP_IOFS was > set, and this resulted in too_many_isolated returning true. But one of > the CPU's pageset vm_stat_diff had NR_ISOLATED_FILE as "-14". So the > actual isolated count was zero. As there weren't any more updates to > NR_ISOLATED_FILE and vmstat_update deffered work had not been scheduled > yet, 7 tasks were spinning in the congestion wait loop for around 4 > seconds, in the direct reclaim path. > > This patch uses zone_page_state_snapshot instead, but restricts its usage > to avoid performance penalty. > > > The vmstat sync interval is HZ (sysctl_stat_interval), but since the > vmstat_work is declared as a deferrable work, the timer trigger can be > deferred to the next non-defferable timer expiry on the CPU which is in > idle. This results in the vmstat syncing on an idle CPU being delayed by > seconds. May be in most cases this behavior is fine, except in cases like > this. > > [akpm@linux-foundation.org: move zone_page_state_snapshot() fallback logic into too_many_isolated()] > Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org> > Cc: Johannes Weiner <hannes@cmpxchg.org> > Cc: Vladimir Davydov <vdavydov@parallels.com> > Cc: Mel Gorman <mgorman@suse.de> > Cc: Minchan Kim <minchan@kernel.org> > Cc: Michal Hocko <mhocko@suse.cz> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org> > --- > > mm/vmscan.c | 51 +++++++++++++++++++++++++++++++++----------------- > 1 file changed, 34 insertions(+), 17 deletions(-) > > diff -puN mm/vmscan.c~mm-vmscan-fix-the-page-state-calculation-in-too_many_isolated mm/vmscan.c > --- a/mm/vmscan.c~mm-vmscan-fix-the-page-state-calculation-in-too_many_isolated > +++ a/mm/vmscan.c > @@ -1363,6 +1363,32 @@ int isolate_lru_page(struct page *page) > return ret; > } > > +static int __too_many_isolated(struct zone *zone, int file, > + struct scan_control *sc, int safe) > +{ > + unsigned long inactive, isolated; > + > + if (safe) { > + inactive = zone_page_state_snapshot(zone, > + NR_INACTIVE_ANON + 2 * file); > + isolated = zone_page_state_snapshot(zone, > + NR_ISOLATED_ANON + file); > + } else { > + inactive = zone_page_state(zone, NR_INACTIVE_ANON + 2 * file); > + isolated = zone_page_state(zone, NR_ISOLATED_ANON + file); > + } > + > + /* > + * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so they > + * won't get blocked by normal direct-reclaimers, forming a circular > + * deadlock. > + */ > + if ((sc->gfp_mask & GFP_IOFS) == GFP_IOFS) > + inactive >>= 3; > + > + return isolated > inactive; > +} > + > /* > * A direct reclaimer may isolate SWAP_CLUSTER_MAX pages from the LRU list and > * then get resheduled. When there are massive number of tasks doing page > @@ -1371,33 +1397,24 @@ int isolate_lru_page(struct page *page) > * unnecessary swapping, thrashing and OOM. > */ > static int too_many_isolated(struct zone *zone, int file, > - struct scan_control *sc) > + struct scan_control *sc) > { > - unsigned long inactive, isolated; > - > if (current_is_kswapd()) > return 0; > > if (!global_reclaim(sc)) > return 0; > > - if (file) { > - inactive = zone_page_state(zone, NR_INACTIVE_FILE); > - isolated = zone_page_state(zone, NR_ISOLATED_FILE); > - } else { > - inactive = zone_page_state(zone, NR_INACTIVE_ANON); > - isolated = zone_page_state(zone, NR_ISOLATED_ANON); > - } > - > /* > - * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so they > - * won't get blocked by normal direct-reclaimers, forming a circular > - * deadlock. > + * __too_many_isolated(safe=0) is fast but inaccurate, because it > + * doesn't account for the vm_stat_diff[] counters. So if it looks > + * like too_many_isolated() is about to return true, fall back to the > + * slower, more accurate zone_page_state_snapshot(). > */ > - if ((sc->gfp_mask & GFP_IOFS) == GFP_IOFS) > - inactive >>= 3; > + if (unlikely(__too_many_isolated(zone, file, sc, 0))) > + return __too_many_isolated(zone, file, sc, 1); > > - return isolated > inactive; > + return 0; > } > > static noinline_for_stack void > _ > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > ^ permalink raw reply [flat|nested] 60+ messages in thread
* Re: [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated @ 2015-02-12 16:19 ` Vlastimil Babka 0 siblings, 0 replies; 60+ messages in thread From: Vlastimil Babka @ 2015-02-12 16:19 UTC (permalink / raw) To: Andrew Morton, Vinayak Menon Cc: linux-mm, linux-kernel, hannes, vdavydov, mhocko, mgorman, minchan On 02/11/2015 11:14 PM, Andrew Morton wrote: > > Did we end up deciding to merge this, or is > http://ozlabs.org/~akpm/mmots/broken-out/vmstat-do-not-use-deferrable-delayed-work-for-vmstat_update.patch > a sufficient fix? I think Michal wanted to have the general vmstat worker fix from elsewhere in the thread tested, if it solves the problem by itself, without this patch. > From: Vinayak Menon <vinmenon@codeaurora.org> > Subject: mm: vmscan: fix the page state calculation in too_many_isolated > > It is observed that sometimes multiple tasks get blocked for long in the > congestion_wait loop below, in shrink_inactive_list. This is because of > vm_stat values not being synced. > > (__schedule) from [<c0a03328>] > (schedule_timeout) from [<c0a04940>] > (io_schedule_timeout) from [<c01d585c>] > (congestion_wait) from [<c01cc9d8>] > (shrink_inactive_list) from [<c01cd034>] > (shrink_zone) from [<c01cdd08>] > (try_to_free_pages) from [<c01c442c>] > (__alloc_pages_nodemask) from [<c01f1884>] > (new_slab) from [<c09fcf60>] > (__slab_alloc) from [<c01f1a6c>] > > In one such instance, zone_page_state(zone, NR_ISOLATED_FILE) had returned > 14, zone_page_state(zone, NR_INACTIVE_FILE) returned 92, and GFP_IOFS was > set, and this resulted in too_many_isolated returning true. But one of > the CPU's pageset vm_stat_diff had NR_ISOLATED_FILE as "-14". So the > actual isolated count was zero. As there weren't any more updates to > NR_ISOLATED_FILE and vmstat_update deffered work had not been scheduled > yet, 7 tasks were spinning in the congestion wait loop for around 4 > seconds, in the direct reclaim path. > > This patch uses zone_page_state_snapshot instead, but restricts its usage > to avoid performance penalty. > > > The vmstat sync interval is HZ (sysctl_stat_interval), but since the > vmstat_work is declared as a deferrable work, the timer trigger can be > deferred to the next non-defferable timer expiry on the CPU which is in > idle. This results in the vmstat syncing on an idle CPU being delayed by > seconds. May be in most cases this behavior is fine, except in cases like > this. > > [akpm@linux-foundation.org: move zone_page_state_snapshot() fallback logic into too_many_isolated()] > Signed-off-by: Vinayak Menon <vinmenon@codeaurora.org> > Cc: Johannes Weiner <hannes@cmpxchg.org> > Cc: Vladimir Davydov <vdavydov@parallels.com> > Cc: Mel Gorman <mgorman@suse.de> > Cc: Minchan Kim <minchan@kernel.org> > Cc: Michal Hocko <mhocko@suse.cz> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org> > --- > > mm/vmscan.c | 51 +++++++++++++++++++++++++++++++++----------------- > 1 file changed, 34 insertions(+), 17 deletions(-) > > diff -puN mm/vmscan.c~mm-vmscan-fix-the-page-state-calculation-in-too_many_isolated mm/vmscan.c > --- a/mm/vmscan.c~mm-vmscan-fix-the-page-state-calculation-in-too_many_isolated > +++ a/mm/vmscan.c > @@ -1363,6 +1363,32 @@ int isolate_lru_page(struct page *page) > return ret; > } > > +static int __too_many_isolated(struct zone *zone, int file, > + struct scan_control *sc, int safe) > +{ > + unsigned long inactive, isolated; > + > + if (safe) { > + inactive = zone_page_state_snapshot(zone, > + NR_INACTIVE_ANON + 2 * file); > + isolated = zone_page_state_snapshot(zone, > + NR_ISOLATED_ANON + file); > + } else { > + inactive = zone_page_state(zone, NR_INACTIVE_ANON + 2 * file); > + isolated = zone_page_state(zone, NR_ISOLATED_ANON + file); > + } > + > + /* > + * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so they > + * won't get blocked by normal direct-reclaimers, forming a circular > + * deadlock. > + */ > + if ((sc->gfp_mask & GFP_IOFS) == GFP_IOFS) > + inactive >>= 3; > + > + return isolated > inactive; > +} > + > /* > * A direct reclaimer may isolate SWAP_CLUSTER_MAX pages from the LRU list and > * then get resheduled. When there are massive number of tasks doing page > @@ -1371,33 +1397,24 @@ int isolate_lru_page(struct page *page) > * unnecessary swapping, thrashing and OOM. > */ > static int too_many_isolated(struct zone *zone, int file, > - struct scan_control *sc) > + struct scan_control *sc) > { > - unsigned long inactive, isolated; > - > if (current_is_kswapd()) > return 0; > > if (!global_reclaim(sc)) > return 0; > > - if (file) { > - inactive = zone_page_state(zone, NR_INACTIVE_FILE); > - isolated = zone_page_state(zone, NR_ISOLATED_FILE); > - } else { > - inactive = zone_page_state(zone, NR_INACTIVE_ANON); > - isolated = zone_page_state(zone, NR_ISOLATED_ANON); > - } > - > /* > - * GFP_NOIO/GFP_NOFS callers are allowed to isolate more pages, so they > - * won't get blocked by normal direct-reclaimers, forming a circular > - * deadlock. > + * __too_many_isolated(safe=0) is fast but inaccurate, because it > + * doesn't account for the vm_stat_diff[] counters. So if it looks > + * like too_many_isolated() is about to return true, fall back to the > + * slower, more accurate zone_page_state_snapshot(). > */ > - if ((sc->gfp_mask & GFP_IOFS) == GFP_IOFS) > - inactive >>= 3; > + if (unlikely(__too_many_isolated(zone, file, sc, 0))) > + return __too_many_isolated(zone, file, sc, 1); > > - return isolated > inactive; > + return 0; > } > > static noinline_for_stack void > _ > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 60+ messages in thread
end of thread, other threads:[~2015-02-12 16:19 UTC | newest] Thread overview: 60+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-01-14 11:36 [PATCH v2] mm: vmscan: fix the page state calculation in too_many_isolated Vinayak Menon 2015-01-14 11:36 ` Vinayak Menon 2015-01-14 16:50 ` Michal Hocko 2015-01-14 16:50 ` Michal Hocko 2015-01-15 17:24 ` Vinayak Menon 2015-01-15 17:24 ` Vinayak Menon 2015-01-16 15:49 ` Michal Hocko 2015-01-16 15:49 ` Michal Hocko 2015-01-16 17:57 ` Michal Hocko 2015-01-16 17:57 ` Michal Hocko 2015-01-16 19:17 ` Christoph Lameter 2015-01-16 19:17 ` Christoph Lameter 2015-01-17 15:18 ` Vinayak Menon 2015-01-17 15:18 ` Vinayak Menon 2015-01-17 19:48 ` Christoph Lameter 2015-01-17 19:48 ` Christoph Lameter 2015-01-19 4:27 ` Vinayak Menon 2015-01-19 4:27 ` Vinayak Menon 2015-01-21 14:39 ` Michal Hocko 2015-01-21 14:39 ` Michal Hocko 2015-01-22 15:16 ` Vlastimil Babka 2015-01-22 15:16 ` Vlastimil Babka 2015-01-22 16:11 ` Christoph Lameter 2015-01-22 16:11 ` Christoph Lameter 2015-01-26 17:46 ` Michal Hocko 2015-01-26 17:46 ` Michal Hocko 2015-01-26 18:35 ` Christoph Lameter 2015-01-26 18:35 ` Christoph Lameter 2015-01-27 10:52 ` Michal Hocko 2015-01-27 10:52 ` Michal Hocko 2015-01-27 16:59 ` Christoph Lameter 2015-01-27 16:59 ` Christoph Lameter 2015-01-30 15:28 ` Michal Hocko 2015-01-30 15:28 ` Michal Hocko 2015-01-26 17:28 ` Michal Hocko 2015-01-26 17:28 ` Michal Hocko 2015-01-26 18:35 ` Christoph Lameter 2015-01-26 18:35 ` Christoph Lameter 2015-01-26 22:11 ` Andrew Morton 2015-01-26 22:11 ` Andrew Morton 2015-01-27 10:41 ` Michal Hocko 2015-01-27 10:41 ` Michal Hocko 2015-01-27 10:33 ` Vinayak Menon 2015-01-27 10:33 ` Vinayak Menon 2015-01-27 10:45 ` Michal Hocko 2015-01-27 10:45 ` Michal Hocko 2015-01-29 17:32 ` Christoph Lameter 2015-01-29 17:32 ` Christoph Lameter 2015-01-30 15:27 ` Michal Hocko 2015-01-30 15:27 ` Michal Hocko 2015-01-16 1:17 ` Andrew Morton 2015-01-16 1:17 ` Andrew Morton 2015-01-16 5:10 ` Vinayak Menon 2015-01-16 5:10 ` Vinayak Menon 2015-01-17 16:29 ` Vinayak Menon 2015-01-17 16:29 ` Vinayak Menon 2015-02-11 22:14 ` Andrew Morton 2015-02-11 22:14 ` Andrew Morton 2015-02-12 16:19 ` Vlastimil Babka 2015-02-12 16:19 ` Vlastimil Babka
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.