All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
@ 2015-10-21 12:26 ` Tetsuo Handa
  0 siblings, 0 replies; 122+ messages in thread
From: Tetsuo Handa @ 2015-10-21 12:26 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: torvalds, mhocko, rientjes, oleg, kwalker, cl, akpm, hannes,
	vdavydov, skozina, mgorman, riel

>From 0c50792dfa6396453c89c71351a7458b94d3e881 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Wed, 21 Oct 2015 21:15:30 +0900
Subject: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

Since "struct zone"->vm_stat[] is array of atomic_long_t, an attempt
to reduce frequency of updating values in vm_stat[] is achieved by
using per cpu variables "struct per_cpu_pageset"->vm_stat_diff[].
Values in vm_stat_diff[] are merged into vm_stat[] periodically
(configured via /proc/sys/vm/stat_interval) using vmstat_update
workqueue (struct delayed_work vmstat_work).

When a task attempted to allocate memory and reached direct reclaim
path, shrink_zones() checks whether there are reclaimable pages by
calling zone_reclaimable(). zone_reclaimable() makes decision based
on values in vm_stat[] by calling zone_page_state(). This is usually
fine because values in vm_stat_diff[] are expected to be merged into
vm_stat[] shortly.

However, if a workqueue which is processed before vmstat_update
workqueue is processed got stuck inside memory allocation request,
values in vm_stat_diff[] cannot be merged into vm_stat[]. As a result,
zone_reclaimable() continues using outdated vm_stat[] values and the
task which is doing direct reclaim path thinks that there are reclaimable
pages and therefore continues looping. The consequence is a silent
livelock (hang up without any kernel messages) because the OOM killer
will not be invoked.

We can hit such livelock by e.g. disk_events_workfn workqueue doing
memory allocation from bio_copy_kern().

[  255.054205] kworker/3:1     R  running task        0    45      2 0x00000008
[  255.056063] Workqueue: events_freezable_power_ disk_events_workfn
[  255.057715]  ffff88007f805680 ffff88007c55f6d0 ffffffff8116463d ffff88007c55f758
[  255.059705]  ffff88007f82b870 ffff88007c55f6e0 ffffffff811646be ffff88007c55f710
[  255.061694]  ffffffff811bdaf0 ffff88007f82b870 0000000000000400 0000000000000000
[  255.063690] Call Trace:
[  255.064664]  [<ffffffff8116463d>] ? __list_lru_count_one.isra.4+0x1d/0x80
[  255.066428]  [<ffffffff811646be>] ? list_lru_count_one+0x1e/0x20
[  255.068063]  [<ffffffff811bdaf0>] ? super_cache_count+0x50/0xd0
[  255.069666]  [<ffffffff8114ecf6>] ? shrink_slab.part.38+0xf6/0x2a0
[  255.071313]  [<ffffffff81151f78>] ? shrink_zone+0x2c8/0x2e0
[  255.072845]  [<ffffffff81152316>] ? do_try_to_free_pages+0x156/0x6d0
[  255.074527]  [<ffffffff810bc6b6>] ? mark_held_locks+0x66/0x90
[  255.076085]  [<ffffffff816ca797>] ? _raw_spin_unlock_irq+0x27/0x40
[  255.077727]  [<ffffffff810bc7d9>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[  255.079451]  [<ffffffff81152924>] ? try_to_free_pages+0x94/0xc0
[  255.081045]  [<ffffffff81145b4a>] ? __alloc_pages_nodemask+0x72a/0xdb0
[  255.082761]  [<ffffffff8118cd06>] ? alloc_pages_current+0x96/0x1b0
[  255.084407]  [<ffffffff8133985d>] ? bio_alloc_bioset+0x20d/0x2d0
[  255.086032]  [<ffffffff8133aba4>] ? bio_copy_kern+0xc4/0x180
[  255.087584]  [<ffffffff81344f20>] ? blk_rq_map_kern+0x70/0x130
[  255.089161]  [<ffffffff814a334d>] ? scsi_execute+0x12d/0x160
[  255.090696]  [<ffffffff814a3474>] ? scsi_execute_req_flags+0x84/0xf0
[  255.092466]  [<ffffffff814b55f2>] ? sr_check_events+0xb2/0x2a0
[  255.094042]  [<ffffffff814c3223>] ? cdrom_check_events+0x13/0x30
[  255.095634]  [<ffffffff814b5a35>] ? sr_block_check_events+0x25/0x30
[  255.097278]  [<ffffffff813501fb>] ? disk_check_events+0x5b/0x150
[  255.098865]  [<ffffffff81350307>] ? disk_events_workfn+0x17/0x20
[  255.100451]  [<ffffffff810890b5>] ? process_one_work+0x1a5/0x420
[  255.102046]  [<ffffffff81089051>] ? process_one_work+0x141/0x420
[  255.103625]  [<ffffffff8108944b>] ? worker_thread+0x11b/0x490
[  255.105159]  [<ffffffff816c4e95>] ? __schedule+0x315/0xac0
[  255.106643]  [<ffffffff81089330>] ? process_one_work+0x420/0x420
[  255.108217]  [<ffffffff8108f4e9>] ? kthread+0xf9/0x110
[  255.109634]  [<ffffffff8108f3f0>] ? kthread_create_on_node+0x230/0x230
[  255.111307]  [<ffffffff816cb35f>] ? ret_from_fork+0x3f/0x70
[  255.112785]  [<ffffffff8108f3f0>] ? kthread_create_on_node+0x230/0x230

[  273.930846] Showing busy workqueues and worker pools:
[  273.932299] workqueue events: flags=0x0
[  273.933465]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=4/256
[  273.935120]     pending: vmpressure_work_fn, vmstat_shepherd, vmstat_update, vmw_fb_dirty_flush [vmwgfx]
[  273.937489] workqueue events_freezable: flags=0x4
[  273.938795]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256
[  273.940446]     pending: vmballoon_work [vmw_balloon]
[  273.941973] workqueue events_power_efficient: flags=0x80
[  273.943491]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256
[  273.945167]     pending: check_lifetime
[  273.946422] workqueue events_freezable_power_: flags=0x84
[  273.947890]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256
[  273.949579]     in-flight: 45:disk_events_workfn
[  273.951103] workqueue ipv6_addrconf: flags=0x8
[  273.952447]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/1
[  273.954121]     pending: addrconf_verify_work
[  273.955541] workqueue xfs-reclaim/sda1: flags=0x4
[  273.957036]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256
[  273.958847]     pending: xfs_reclaim_worker
[  273.960392] pool 6: cpus=3 node=0 flags=0x0 nice=0 workers=3 idle: 186 26

This patch changes zone_reclaimable() to use zone_page_state_snapshot()
in order to make sure that values in vm_stat_diff[] are taken into
account when making decision.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/vmscan.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index af4f4c0..2e4ef60 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -196,19 +196,19 @@ static unsigned long zone_reclaimable_pages(struct zone *zone)
 {
 	unsigned long nr;
 
-	nr = zone_page_state(zone, NR_ACTIVE_FILE) +
-	     zone_page_state(zone, NR_INACTIVE_FILE);
+	nr = zone_page_state_snapshot(zone, NR_ACTIVE_FILE) +
+	     zone_page_state_snapshot(zone, NR_INACTIVE_FILE);
 
 	if (get_nr_swap_pages() > 0)
-		nr += zone_page_state(zone, NR_ACTIVE_ANON) +
-		      zone_page_state(zone, NR_INACTIVE_ANON);
+		nr += zone_page_state_snapshot(zone, NR_ACTIVE_ANON) +
+		      zone_page_state_snapshot(zone, NR_INACTIVE_ANON);
 
 	return nr;
 }
 
 bool zone_reclaimable(struct zone *zone)
 {
-	return zone_page_state(zone, NR_PAGES_SCANNED) <
+	return zone_page_state_snapshot(zone, NR_PAGES_SCANNED) <
 		zone_reclaimable_pages(zone) * 6;
 }
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 122+ messages in thread

* [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
@ 2015-10-21 12:26 ` Tetsuo Handa
  0 siblings, 0 replies; 122+ messages in thread
From: Tetsuo Handa @ 2015-10-21 12:26 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: torvalds, mhocko, rientjes, oleg, kwalker, cl, akpm, hannes,
	vdavydov, skozina, mgorman, riel

>From 0c50792dfa6396453c89c71351a7458b94d3e881 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Wed, 21 Oct 2015 21:15:30 +0900
Subject: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

Since "struct zone"->vm_stat[] is array of atomic_long_t, an attempt
to reduce frequency of updating values in vm_stat[] is achieved by
using per cpu variables "struct per_cpu_pageset"->vm_stat_diff[].
Values in vm_stat_diff[] are merged into vm_stat[] periodically
(configured via /proc/sys/vm/stat_interval) using vmstat_update
workqueue (struct delayed_work vmstat_work).

When a task attempted to allocate memory and reached direct reclaim
path, shrink_zones() checks whether there are reclaimable pages by
calling zone_reclaimable(). zone_reclaimable() makes decision based
on values in vm_stat[] by calling zone_page_state(). This is usually
fine because values in vm_stat_diff[] are expected to be merged into
vm_stat[] shortly.

However, if a workqueue which is processed before vmstat_update
workqueue is processed got stuck inside memory allocation request,
values in vm_stat_diff[] cannot be merged into vm_stat[]. As a result,
zone_reclaimable() continues using outdated vm_stat[] values and the
task which is doing direct reclaim path thinks that there are reclaimable
pages and therefore continues looping. The consequence is a silent
livelock (hang up without any kernel messages) because the OOM killer
will not be invoked.

We can hit such livelock by e.g. disk_events_workfn workqueue doing
memory allocation from bio_copy_kern().

[  255.054205] kworker/3:1     R  running task        0    45      2 0x00000008
[  255.056063] Workqueue: events_freezable_power_ disk_events_workfn
[  255.057715]  ffff88007f805680 ffff88007c55f6d0 ffffffff8116463d ffff88007c55f758
[  255.059705]  ffff88007f82b870 ffff88007c55f6e0 ffffffff811646be ffff88007c55f710
[  255.061694]  ffffffff811bdaf0 ffff88007f82b870 0000000000000400 0000000000000000
[  255.063690] Call Trace:
[  255.064664]  [<ffffffff8116463d>] ? __list_lru_count_one.isra.4+0x1d/0x80
[  255.066428]  [<ffffffff811646be>] ? list_lru_count_one+0x1e/0x20
[  255.068063]  [<ffffffff811bdaf0>] ? super_cache_count+0x50/0xd0
[  255.069666]  [<ffffffff8114ecf6>] ? shrink_slab.part.38+0xf6/0x2a0
[  255.071313]  [<ffffffff81151f78>] ? shrink_zone+0x2c8/0x2e0
[  255.072845]  [<ffffffff81152316>] ? do_try_to_free_pages+0x156/0x6d0
[  255.074527]  [<ffffffff810bc6b6>] ? mark_held_locks+0x66/0x90
[  255.076085]  [<ffffffff816ca797>] ? _raw_spin_unlock_irq+0x27/0x40
[  255.077727]  [<ffffffff810bc7d9>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[  255.079451]  [<ffffffff81152924>] ? try_to_free_pages+0x94/0xc0
[  255.081045]  [<ffffffff81145b4a>] ? __alloc_pages_nodemask+0x72a/0xdb0
[  255.082761]  [<ffffffff8118cd06>] ? alloc_pages_current+0x96/0x1b0
[  255.084407]  [<ffffffff8133985d>] ? bio_alloc_bioset+0x20d/0x2d0
[  255.086032]  [<ffffffff8133aba4>] ? bio_copy_kern+0xc4/0x180
[  255.087584]  [<ffffffff81344f20>] ? blk_rq_map_kern+0x70/0x130
[  255.089161]  [<ffffffff814a334d>] ? scsi_execute+0x12d/0x160
[  255.090696]  [<ffffffff814a3474>] ? scsi_execute_req_flags+0x84/0xf0
[  255.092466]  [<ffffffff814b55f2>] ? sr_check_events+0xb2/0x2a0
[  255.094042]  [<ffffffff814c3223>] ? cdrom_check_events+0x13/0x30
[  255.095634]  [<ffffffff814b5a35>] ? sr_block_check_events+0x25/0x30
[  255.097278]  [<ffffffff813501fb>] ? disk_check_events+0x5b/0x150
[  255.098865]  [<ffffffff81350307>] ? disk_events_workfn+0x17/0x20
[  255.100451]  [<ffffffff810890b5>] ? process_one_work+0x1a5/0x420
[  255.102046]  [<ffffffff81089051>] ? process_one_work+0x141/0x420
[  255.103625]  [<ffffffff8108944b>] ? worker_thread+0x11b/0x490
[  255.105159]  [<ffffffff816c4e95>] ? __schedule+0x315/0xac0
[  255.106643]  [<ffffffff81089330>] ? process_one_work+0x420/0x420
[  255.108217]  [<ffffffff8108f4e9>] ? kthread+0xf9/0x110
[  255.109634]  [<ffffffff8108f3f0>] ? kthread_create_on_node+0x230/0x230
[  255.111307]  [<ffffffff816cb35f>] ? ret_from_fork+0x3f/0x70
[  255.112785]  [<ffffffff8108f3f0>] ? kthread_create_on_node+0x230/0x230

[  273.930846] Showing busy workqueues and worker pools:
[  273.932299] workqueue events: flags=0x0
[  273.933465]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=4/256
[  273.935120]     pending: vmpressure_work_fn, vmstat_shepherd, vmstat_update, vmw_fb_dirty_flush [vmwgfx]
[  273.937489] workqueue events_freezable: flags=0x4
[  273.938795]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256
[  273.940446]     pending: vmballoon_work [vmw_balloon]
[  273.941973] workqueue events_power_efficient: flags=0x80
[  273.943491]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256
[  273.945167]     pending: check_lifetime
[  273.946422] workqueue events_freezable_power_: flags=0x84
[  273.947890]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256
[  273.949579]     in-flight: 45:disk_events_workfn
[  273.951103] workqueue ipv6_addrconf: flags=0x8
[  273.952447]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/1
[  273.954121]     pending: addrconf_verify_work
[  273.955541] workqueue xfs-reclaim/sda1: flags=0x4
[  273.957036]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256
[  273.958847]     pending: xfs_reclaim_worker
[  273.960392] pool 6: cpus=3 node=0 flags=0x0 nice=0 workers=3 idle: 186 26

This patch changes zone_reclaimable() to use zone_page_state_snapshot()
in order to make sure that values in vm_stat_diff[] are taken into
account when making decision.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/vmscan.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index af4f4c0..2e4ef60 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -196,19 +196,19 @@ static unsigned long zone_reclaimable_pages(struct zone *zone)
 {
 	unsigned long nr;
 
-	nr = zone_page_state(zone, NR_ACTIVE_FILE) +
-	     zone_page_state(zone, NR_INACTIVE_FILE);
+	nr = zone_page_state_snapshot(zone, NR_ACTIVE_FILE) +
+	     zone_page_state_snapshot(zone, NR_INACTIVE_FILE);
 
 	if (get_nr_swap_pages() > 0)
-		nr += zone_page_state(zone, NR_ACTIVE_ANON) +
-		      zone_page_state(zone, NR_INACTIVE_ANON);
+		nr += zone_page_state_snapshot(zone, NR_ACTIVE_ANON) +
+		      zone_page_state_snapshot(zone, NR_INACTIVE_ANON);
 
 	return nr;
 }
 
 bool zone_reclaimable(struct zone *zone)
 {
-	return zone_page_state(zone, NR_PAGES_SCANNED) <
+	return zone_page_state_snapshot(zone, NR_PAGES_SCANNED) <
 		zone_reclaimable_pages(zone) * 6;
 }
 
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
  2015-10-21 12:26 ` Tetsuo Handa
@ 2015-10-21 13:03   ` Michal Hocko
  -1 siblings, 0 replies; 122+ messages in thread
From: Michal Hocko @ 2015-10-21 13:03 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, linux-kernel, torvalds, rientjes, oleg, kwalker, cl,
	akpm, hannes, vdavydov, skozina, mgorman, riel

On Wed 21-10-15 21:26:19, Tetsuo Handa wrote:
> >From 0c50792dfa6396453c89c71351a7458b94d3e881 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Wed, 21 Oct 2015 21:15:30 +0900
> Subject: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
> 
> Since "struct zone"->vm_stat[] is array of atomic_long_t, an attempt
> to reduce frequency of updating values in vm_stat[] is achieved by
> using per cpu variables "struct per_cpu_pageset"->vm_stat_diff[].
> Values in vm_stat_diff[] are merged into vm_stat[] periodically
> (configured via /proc/sys/vm/stat_interval) using vmstat_update
> workqueue (struct delayed_work vmstat_work).
> 
> When a task attempted to allocate memory and reached direct reclaim
> path, shrink_zones() checks whether there are reclaimable pages by
> calling zone_reclaimable(). zone_reclaimable() makes decision based
> on values in vm_stat[] by calling zone_page_state(). This is usually
> fine because values in vm_stat_diff[] are expected to be merged into
> vm_stat[] shortly.
> 
> However, if a workqueue which is processed before vmstat_update
> workqueue is processed got stuck inside memory allocation request,
> values in vm_stat_diff[] cannot be merged into vm_stat[]. As a result,
> zone_reclaimable() continues using outdated vm_stat[] values and the
> task which is doing direct reclaim path thinks that there are reclaimable
> pages and therefore continues looping. The consequence is a silent
> livelock (hang up without any kernel messages) because the OOM killer
> will not be invoked.
> 
> We can hit such livelock by e.g. disk_events_workfn workqueue doing
> memory allocation from bio_copy_kern().
> 
> [  255.054205] kworker/3:1     R  running task        0    45      2 0x00000008
> [  255.056063] Workqueue: events_freezable_power_ disk_events_workfn
> [  255.057715]  ffff88007f805680 ffff88007c55f6d0 ffffffff8116463d ffff88007c55f758
> [  255.059705]  ffff88007f82b870 ffff88007c55f6e0 ffffffff811646be ffff88007c55f710
> [  255.061694]  ffffffff811bdaf0 ffff88007f82b870 0000000000000400 0000000000000000
> [  255.063690] Call Trace:
> [  255.064664]  [<ffffffff8116463d>] ? __list_lru_count_one.isra.4+0x1d/0x80
> [  255.066428]  [<ffffffff811646be>] ? list_lru_count_one+0x1e/0x20
> [  255.068063]  [<ffffffff811bdaf0>] ? super_cache_count+0x50/0xd0
> [  255.069666]  [<ffffffff8114ecf6>] ? shrink_slab.part.38+0xf6/0x2a0
> [  255.071313]  [<ffffffff81151f78>] ? shrink_zone+0x2c8/0x2e0
> [  255.072845]  [<ffffffff81152316>] ? do_try_to_free_pages+0x156/0x6d0
> [  255.074527]  [<ffffffff810bc6b6>] ? mark_held_locks+0x66/0x90
> [  255.076085]  [<ffffffff816ca797>] ? _raw_spin_unlock_irq+0x27/0x40
> [  255.077727]  [<ffffffff810bc7d9>] ? trace_hardirqs_on_caller+0xf9/0x1c0
> [  255.079451]  [<ffffffff81152924>] ? try_to_free_pages+0x94/0xc0
> [  255.081045]  [<ffffffff81145b4a>] ? __alloc_pages_nodemask+0x72a/0xdb0
> [  255.082761]  [<ffffffff8118cd06>] ? alloc_pages_current+0x96/0x1b0
> [  255.084407]  [<ffffffff8133985d>] ? bio_alloc_bioset+0x20d/0x2d0
> [  255.086032]  [<ffffffff8133aba4>] ? bio_copy_kern+0xc4/0x180
> [  255.087584]  [<ffffffff81344f20>] ? blk_rq_map_kern+0x70/0x130
> [  255.089161]  [<ffffffff814a334d>] ? scsi_execute+0x12d/0x160
> [  255.090696]  [<ffffffff814a3474>] ? scsi_execute_req_flags+0x84/0xf0
> [  255.092466]  [<ffffffff814b55f2>] ? sr_check_events+0xb2/0x2a0
> [  255.094042]  [<ffffffff814c3223>] ? cdrom_check_events+0x13/0x30
> [  255.095634]  [<ffffffff814b5a35>] ? sr_block_check_events+0x25/0x30
> [  255.097278]  [<ffffffff813501fb>] ? disk_check_events+0x5b/0x150
> [  255.098865]  [<ffffffff81350307>] ? disk_events_workfn+0x17/0x20
> [  255.100451]  [<ffffffff810890b5>] ? process_one_work+0x1a5/0x420
> [  255.102046]  [<ffffffff81089051>] ? process_one_work+0x141/0x420
> [  255.103625]  [<ffffffff8108944b>] ? worker_thread+0x11b/0x490
> [  255.105159]  [<ffffffff816c4e95>] ? __schedule+0x315/0xac0
> [  255.106643]  [<ffffffff81089330>] ? process_one_work+0x420/0x420
> [  255.108217]  [<ffffffff8108f4e9>] ? kthread+0xf9/0x110
> [  255.109634]  [<ffffffff8108f3f0>] ? kthread_create_on_node+0x230/0x230
> [  255.111307]  [<ffffffff816cb35f>] ? ret_from_fork+0x3f/0x70
> [  255.112785]  [<ffffffff8108f3f0>] ? kthread_create_on_node+0x230/0x230
> 
> [  273.930846] Showing busy workqueues and worker pools:
> [  273.932299] workqueue events: flags=0x0
> [  273.933465]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=4/256
> [  273.935120]     pending: vmpressure_work_fn, vmstat_shepherd, vmstat_update, vmw_fb_dirty_flush [vmwgfx]
> [  273.937489] workqueue events_freezable: flags=0x4
> [  273.938795]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256
> [  273.940446]     pending: vmballoon_work [vmw_balloon]
> [  273.941973] workqueue events_power_efficient: flags=0x80
> [  273.943491]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256
> [  273.945167]     pending: check_lifetime
> [  273.946422] workqueue events_freezable_power_: flags=0x84
> [  273.947890]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256
> [  273.949579]     in-flight: 45:disk_events_workfn
> [  273.951103] workqueue ipv6_addrconf: flags=0x8
> [  273.952447]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/1
> [  273.954121]     pending: addrconf_verify_work
> [  273.955541] workqueue xfs-reclaim/sda1: flags=0x4
> [  273.957036]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256
> [  273.958847]     pending: xfs_reclaim_worker
> [  273.960392] pool 6: cpus=3 node=0 flags=0x0 nice=0 workers=3 idle: 186 26
> 
> This patch changes zone_reclaimable() to use zone_page_state_snapshot()
> in order to make sure that values in vm_stat_diff[] are taken into
> account when making decision.

Longerm we definitely want to get rid of zone_reclaimable for the OOM
detection. I hope I can post a proposal for this shortly but this is
simple enough and easy to backport to older kernels.

I would even consider it a stable candidate. It should go back in years.
Delayed vmstat updates go way back but there were other changes in the
area but it seems that at least since d1908362ae0b9 ("vmscan: check
all_unreclaimable in direct reclaim path") we are relying on
zone_reclaimable and vmstat was depending on WQ at the time already.

> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

> ---
>  mm/vmscan.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index af4f4c0..2e4ef60 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -196,19 +196,19 @@ static unsigned long zone_reclaimable_pages(struct zone *zone)
>  {
>  	unsigned long nr;
>  
> -	nr = zone_page_state(zone, NR_ACTIVE_FILE) +
> -	     zone_page_state(zone, NR_INACTIVE_FILE);
> +	nr = zone_page_state_snapshot(zone, NR_ACTIVE_FILE) +
> +	     zone_page_state_snapshot(zone, NR_INACTIVE_FILE);
>  
>  	if (get_nr_swap_pages() > 0)
> -		nr += zone_page_state(zone, NR_ACTIVE_ANON) +
> -		      zone_page_state(zone, NR_INACTIVE_ANON);
> +		nr += zone_page_state_snapshot(zone, NR_ACTIVE_ANON) +
> +		      zone_page_state_snapshot(zone, NR_INACTIVE_ANON);
>  
>  	return nr;
>  }
>  
>  bool zone_reclaimable(struct zone *zone)
>  {
> -	return zone_page_state(zone, NR_PAGES_SCANNED) <
> +	return zone_page_state_snapshot(zone, NR_PAGES_SCANNED) <
>  		zone_reclaimable_pages(zone) * 6;
>  }
>  
> -- 
> 1.8.3.1
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
@ 2015-10-21 13:03   ` Michal Hocko
  0 siblings, 0 replies; 122+ messages in thread
From: Michal Hocko @ 2015-10-21 13:03 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, linux-kernel, torvalds, rientjes, oleg, kwalker, cl,
	akpm, hannes, vdavydov, skozina, mgorman, riel

On Wed 21-10-15 21:26:19, Tetsuo Handa wrote:
> >From 0c50792dfa6396453c89c71351a7458b94d3e881 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Wed, 21 Oct 2015 21:15:30 +0900
> Subject: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
> 
> Since "struct zone"->vm_stat[] is array of atomic_long_t, an attempt
> to reduce frequency of updating values in vm_stat[] is achieved by
> using per cpu variables "struct per_cpu_pageset"->vm_stat_diff[].
> Values in vm_stat_diff[] are merged into vm_stat[] periodically
> (configured via /proc/sys/vm/stat_interval) using vmstat_update
> workqueue (struct delayed_work vmstat_work).
> 
> When a task attempted to allocate memory and reached direct reclaim
> path, shrink_zones() checks whether there are reclaimable pages by
> calling zone_reclaimable(). zone_reclaimable() makes decision based
> on values in vm_stat[] by calling zone_page_state(). This is usually
> fine because values in vm_stat_diff[] are expected to be merged into
> vm_stat[] shortly.
> 
> However, if a workqueue which is processed before vmstat_update
> workqueue is processed got stuck inside memory allocation request,
> values in vm_stat_diff[] cannot be merged into vm_stat[]. As a result,
> zone_reclaimable() continues using outdated vm_stat[] values and the
> task which is doing direct reclaim path thinks that there are reclaimable
> pages and therefore continues looping. The consequence is a silent
> livelock (hang up without any kernel messages) because the OOM killer
> will not be invoked.
> 
> We can hit such livelock by e.g. disk_events_workfn workqueue doing
> memory allocation from bio_copy_kern().
> 
> [  255.054205] kworker/3:1     R  running task        0    45      2 0x00000008
> [  255.056063] Workqueue: events_freezable_power_ disk_events_workfn
> [  255.057715]  ffff88007f805680 ffff88007c55f6d0 ffffffff8116463d ffff88007c55f758
> [  255.059705]  ffff88007f82b870 ffff88007c55f6e0 ffffffff811646be ffff88007c55f710
> [  255.061694]  ffffffff811bdaf0 ffff88007f82b870 0000000000000400 0000000000000000
> [  255.063690] Call Trace:
> [  255.064664]  [<ffffffff8116463d>] ? __list_lru_count_one.isra.4+0x1d/0x80
> [  255.066428]  [<ffffffff811646be>] ? list_lru_count_one+0x1e/0x20
> [  255.068063]  [<ffffffff811bdaf0>] ? super_cache_count+0x50/0xd0
> [  255.069666]  [<ffffffff8114ecf6>] ? shrink_slab.part.38+0xf6/0x2a0
> [  255.071313]  [<ffffffff81151f78>] ? shrink_zone+0x2c8/0x2e0
> [  255.072845]  [<ffffffff81152316>] ? do_try_to_free_pages+0x156/0x6d0
> [  255.074527]  [<ffffffff810bc6b6>] ? mark_held_locks+0x66/0x90
> [  255.076085]  [<ffffffff816ca797>] ? _raw_spin_unlock_irq+0x27/0x40
> [  255.077727]  [<ffffffff810bc7d9>] ? trace_hardirqs_on_caller+0xf9/0x1c0
> [  255.079451]  [<ffffffff81152924>] ? try_to_free_pages+0x94/0xc0
> [  255.081045]  [<ffffffff81145b4a>] ? __alloc_pages_nodemask+0x72a/0xdb0
> [  255.082761]  [<ffffffff8118cd06>] ? alloc_pages_current+0x96/0x1b0
> [  255.084407]  [<ffffffff8133985d>] ? bio_alloc_bioset+0x20d/0x2d0
> [  255.086032]  [<ffffffff8133aba4>] ? bio_copy_kern+0xc4/0x180
> [  255.087584]  [<ffffffff81344f20>] ? blk_rq_map_kern+0x70/0x130
> [  255.089161]  [<ffffffff814a334d>] ? scsi_execute+0x12d/0x160
> [  255.090696]  [<ffffffff814a3474>] ? scsi_execute_req_flags+0x84/0xf0
> [  255.092466]  [<ffffffff814b55f2>] ? sr_check_events+0xb2/0x2a0
> [  255.094042]  [<ffffffff814c3223>] ? cdrom_check_events+0x13/0x30
> [  255.095634]  [<ffffffff814b5a35>] ? sr_block_check_events+0x25/0x30
> [  255.097278]  [<ffffffff813501fb>] ? disk_check_events+0x5b/0x150
> [  255.098865]  [<ffffffff81350307>] ? disk_events_workfn+0x17/0x20
> [  255.100451]  [<ffffffff810890b5>] ? process_one_work+0x1a5/0x420
> [  255.102046]  [<ffffffff81089051>] ? process_one_work+0x141/0x420
> [  255.103625]  [<ffffffff8108944b>] ? worker_thread+0x11b/0x490
> [  255.105159]  [<ffffffff816c4e95>] ? __schedule+0x315/0xac0
> [  255.106643]  [<ffffffff81089330>] ? process_one_work+0x420/0x420
> [  255.108217]  [<ffffffff8108f4e9>] ? kthread+0xf9/0x110
> [  255.109634]  [<ffffffff8108f3f0>] ? kthread_create_on_node+0x230/0x230
> [  255.111307]  [<ffffffff816cb35f>] ? ret_from_fork+0x3f/0x70
> [  255.112785]  [<ffffffff8108f3f0>] ? kthread_create_on_node+0x230/0x230
> 
> [  273.930846] Showing busy workqueues and worker pools:
> [  273.932299] workqueue events: flags=0x0
> [  273.933465]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=4/256
> [  273.935120]     pending: vmpressure_work_fn, vmstat_shepherd, vmstat_update, vmw_fb_dirty_flush [vmwgfx]
> [  273.937489] workqueue events_freezable: flags=0x4
> [  273.938795]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256
> [  273.940446]     pending: vmballoon_work [vmw_balloon]
> [  273.941973] workqueue events_power_efficient: flags=0x80
> [  273.943491]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256
> [  273.945167]     pending: check_lifetime
> [  273.946422] workqueue events_freezable_power_: flags=0x84
> [  273.947890]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256
> [  273.949579]     in-flight: 45:disk_events_workfn
> [  273.951103] workqueue ipv6_addrconf: flags=0x8
> [  273.952447]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/1
> [  273.954121]     pending: addrconf_verify_work
> [  273.955541] workqueue xfs-reclaim/sda1: flags=0x4
> [  273.957036]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256
> [  273.958847]     pending: xfs_reclaim_worker
> [  273.960392] pool 6: cpus=3 node=0 flags=0x0 nice=0 workers=3 idle: 186 26
> 
> This patch changes zone_reclaimable() to use zone_page_state_snapshot()
> in order to make sure that values in vm_stat_diff[] are taken into
> account when making decision.

Longerm we definitely want to get rid of zone_reclaimable for the OOM
detection. I hope I can post a proposal for this shortly but this is
simple enough and easy to backport to older kernels.

I would even consider it a stable candidate. It should go back in years.
Delayed vmstat updates go way back but there were other changes in the
area but it seems that at least since d1908362ae0b9 ("vmscan: check
all_unreclaimable in direct reclaim path") we are relying on
zone_reclaimable and vmstat was depending on WQ at the time already.

> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

> ---
>  mm/vmscan.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index af4f4c0..2e4ef60 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -196,19 +196,19 @@ static unsigned long zone_reclaimable_pages(struct zone *zone)
>  {
>  	unsigned long nr;
>  
> -	nr = zone_page_state(zone, NR_ACTIVE_FILE) +
> -	     zone_page_state(zone, NR_INACTIVE_FILE);
> +	nr = zone_page_state_snapshot(zone, NR_ACTIVE_FILE) +
> +	     zone_page_state_snapshot(zone, NR_INACTIVE_FILE);
>  
>  	if (get_nr_swap_pages() > 0)
> -		nr += zone_page_state(zone, NR_ACTIVE_ANON) +
> -		      zone_page_state(zone, NR_INACTIVE_ANON);
> +		nr += zone_page_state_snapshot(zone, NR_ACTIVE_ANON) +
> +		      zone_page_state_snapshot(zone, NR_INACTIVE_ANON);
>  
>  	return nr;
>  }
>  
>  bool zone_reclaimable(struct zone *zone)
>  {
> -	return zone_page_state(zone, NR_PAGES_SCANNED) <
> +	return zone_page_state_snapshot(zone, NR_PAGES_SCANNED) <
>  		zone_reclaimable_pages(zone) * 6;
>  }
>  
> -- 
> 1.8.3.1
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
  2015-10-21 12:26 ` Tetsuo Handa
@ 2015-10-21 14:22   ` Christoph Lameter
  -1 siblings, 0 replies; 122+ messages in thread
From: Christoph Lameter @ 2015-10-21 14:22 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, linux-kernel, torvalds, mhocko, rientjes, oleg,
	kwalker, akpm, hannes, vdavydov, skozina, mgorman, riel

On Wed, 21 Oct 2015, Tetsuo Handa wrote:

> However, if a workqueue which is processed before vmstat_update
> workqueue is processed got stuck inside memory allocation request,
> values in vm_stat_diff[] cannot be merged into vm_stat[]. As a result,
> zone_reclaimable() continues using outdated vm_stat[] values and the
> task which is doing direct reclaim path thinks that there are reclaimable
> pages and therefore continues looping. The consequence is a silent
> livelock (hang up without any kernel messages) because the OOM killer
> will not be invoked.

The diffs will be merged if they reach a certain threshold regardless. You
can decrease that threshhold. See calculate_pressure_threshhold().

Why is the merging not occurring if a process gets stuck? Workrequests are
not blocked by a process being stuck doing memory allocation or reclaim.


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
@ 2015-10-21 14:22   ` Christoph Lameter
  0 siblings, 0 replies; 122+ messages in thread
From: Christoph Lameter @ 2015-10-21 14:22 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, linux-kernel, torvalds, mhocko, rientjes, oleg,
	kwalker, akpm, hannes, vdavydov, skozina, mgorman, riel

On Wed, 21 Oct 2015, Tetsuo Handa wrote:

> However, if a workqueue which is processed before vmstat_update
> workqueue is processed got stuck inside memory allocation request,
> values in vm_stat_diff[] cannot be merged into vm_stat[]. As a result,
> zone_reclaimable() continues using outdated vm_stat[] values and the
> task which is doing direct reclaim path thinks that there are reclaimable
> pages and therefore continues looping. The consequence is a silent
> livelock (hang up without any kernel messages) because the OOM killer
> will not be invoked.

The diffs will be merged if they reach a certain threshold regardless. You
can decrease that threshhold. See calculate_pressure_threshhold().

Why is the merging not occurring if a process gets stuck? Workrequests are
not blocked by a process being stuck doing memory allocation or reclaim.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
  2015-10-21 14:22   ` Christoph Lameter
@ 2015-10-21 14:33     ` Michal Hocko
  -1 siblings, 0 replies; 122+ messages in thread
From: Michal Hocko @ 2015-10-21 14:33 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tetsuo Handa, linux-mm, linux-kernel, torvalds, rientjes, oleg,
	kwalker, akpm, hannes, vdavydov, skozina, mgorman, riel

On Wed 21-10-15 09:22:40, Christoph Lameter wrote:
> On Wed, 21 Oct 2015, Tetsuo Handa wrote:
> 
> > However, if a workqueue which is processed before vmstat_update
> > workqueue is processed got stuck inside memory allocation request,
> > values in vm_stat_diff[] cannot be merged into vm_stat[]. As a result,
> > zone_reclaimable() continues using outdated vm_stat[] values and the
> > task which is doing direct reclaim path thinks that there are reclaimable
> > pages and therefore continues looping. The consequence is a silent
> > livelock (hang up without any kernel messages) because the OOM killer
> > will not be invoked.
> 
> The diffs will be merged if they reach a certain threshold regardless. You
> can decrease that threshhold. See calculate_pressure_threshhold().

The thing is that they will not reach the threshold. The LRUs in this
particular case are empty so there is nothing scanned so
NR_PAGES_SCANNED doesn't increase.

> Why is the merging not occurring if a process gets stuck? Workrequests are
> not blocked by a process being stuck doing memory allocation or reclaim.

Because all the WQ workers are stuck somewhere, maybe in the memory
allocation which cannot make any progress and the vmstat update work is
queued behind them.

At least this is my current understanding.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
@ 2015-10-21 14:33     ` Michal Hocko
  0 siblings, 0 replies; 122+ messages in thread
From: Michal Hocko @ 2015-10-21 14:33 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tetsuo Handa, linux-mm, linux-kernel, torvalds, rientjes, oleg,
	kwalker, akpm, hannes, vdavydov, skozina, mgorman, riel

On Wed 21-10-15 09:22:40, Christoph Lameter wrote:
> On Wed, 21 Oct 2015, Tetsuo Handa wrote:
> 
> > However, if a workqueue which is processed before vmstat_update
> > workqueue is processed got stuck inside memory allocation request,
> > values in vm_stat_diff[] cannot be merged into vm_stat[]. As a result,
> > zone_reclaimable() continues using outdated vm_stat[] values and the
> > task which is doing direct reclaim path thinks that there are reclaimable
> > pages and therefore continues looping. The consequence is a silent
> > livelock (hang up without any kernel messages) because the OOM killer
> > will not be invoked.
> 
> The diffs will be merged if they reach a certain threshold regardless. You
> can decrease that threshhold. See calculate_pressure_threshhold().

The thing is that they will not reach the threshold. The LRUs in this
particular case are empty so there is nothing scanned so
NR_PAGES_SCANNED doesn't increase.

> Why is the merging not occurring if a process gets stuck? Workrequests are
> not blocked by a process being stuck doing memory allocation or reclaim.

Because all the WQ workers are stuck somewhere, maybe in the memory
allocation which cannot make any progress and the vmstat update work is
queued behind them.

At least this is my current understanding.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
  2015-10-21 14:33     ` Michal Hocko
@ 2015-10-21 14:49       ` Christoph Lameter
  -1 siblings, 0 replies; 122+ messages in thread
From: Christoph Lameter @ 2015-10-21 14:49 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, linux-mm, linux-kernel, torvalds, rientjes, oleg,
	kwalker, akpm, hannes, vdavydov, skozina, mgorman, riel

On Wed, 21 Oct 2015, Michal Hocko wrote:

> Because all the WQ workers are stuck somewhere, maybe in the memory
> allocation which cannot make any progress and the vmstat update work is
> queued behind them.
>
> At least this is my current understanding.

Eww. Maybe need a queue that does not do such evil things as memory
allocation?


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
@ 2015-10-21 14:49       ` Christoph Lameter
  0 siblings, 0 replies; 122+ messages in thread
From: Christoph Lameter @ 2015-10-21 14:49 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, linux-mm, linux-kernel, torvalds, rientjes, oleg,
	kwalker, akpm, hannes, vdavydov, skozina, mgorman, riel

On Wed, 21 Oct 2015, Michal Hocko wrote:

> Because all the WQ workers are stuck somewhere, maybe in the memory
> allocation which cannot make any progress and the vmstat update work is
> queued behind them.
>
> At least this is my current understanding.

Eww. Maybe need a queue that does not do such evil things as memory
allocation?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
  2015-10-21 14:49       ` Christoph Lameter
@ 2015-10-21 14:55         ` Michal Hocko
  -1 siblings, 0 replies; 122+ messages in thread
From: Michal Hocko @ 2015-10-21 14:55 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tetsuo Handa, linux-mm, linux-kernel, torvalds, rientjes, oleg,
	kwalker, akpm, hannes, vdavydov, skozina, mgorman, riel

On Wed 21-10-15 09:49:07, Christoph Lameter wrote:
> On Wed, 21 Oct 2015, Michal Hocko wrote:
> 
> > Because all the WQ workers are stuck somewhere, maybe in the memory
> > allocation which cannot make any progress and the vmstat update work is
> > queued behind them.
> >
> > At least this is my current understanding.
> 
> Eww. Maybe need a queue that does not do such evil things as memory
> allocation?

I am not sure how to achieve that. Requiring non-sleeping worker would
work out but do we have enough users to add such an API?

I would rather see vmstat using dedicated kernel thread(s) for this this
purpose. We have discussed that in the past but it hasn't led anywhere.

Anyway the workaround for this issue seems to be pretty trivial and
shouldn't affect users out of direct reclaim much so it sounds good
enough to me. Longterm we should really get rid of scan_reclaimable from
the direct reclaim altogether.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
@ 2015-10-21 14:55         ` Michal Hocko
  0 siblings, 0 replies; 122+ messages in thread
From: Michal Hocko @ 2015-10-21 14:55 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tetsuo Handa, linux-mm, linux-kernel, torvalds, rientjes, oleg,
	kwalker, akpm, hannes, vdavydov, skozina, mgorman, riel

On Wed 21-10-15 09:49:07, Christoph Lameter wrote:
> On Wed, 21 Oct 2015, Michal Hocko wrote:
> 
> > Because all the WQ workers are stuck somewhere, maybe in the memory
> > allocation which cannot make any progress and the vmstat update work is
> > queued behind them.
> >
> > At least this is my current understanding.
> 
> Eww. Maybe need a queue that does not do such evil things as memory
> allocation?

I am not sure how to achieve that. Requiring non-sleeping worker would
work out but do we have enough users to add such an API?

I would rather see vmstat using dedicated kernel thread(s) for this this
purpose. We have discussed that in the past but it hasn't led anywhere.

Anyway the workaround for this issue seems to be pretty trivial and
shouldn't affect users out of direct reclaim much so it sounds good
enough to me. Longterm we should really get rid of scan_reclaimable from
the direct reclaim altogether.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
  2015-10-21 14:55         ` Michal Hocko
@ 2015-10-21 15:39           ` Tetsuo Handa
  -1 siblings, 0 replies; 122+ messages in thread
From: Tetsuo Handa @ 2015-10-21 15:39 UTC (permalink / raw)
  To: mhocko, cl
  Cc: linux-mm, linux-kernel, torvalds, rientjes, oleg, kwalker, akpm,
	hannes, vdavydov, skozina, mgorman, riel

Michal Hocko wrote:
> On Wed 21-10-15 09:49:07, Christoph Lameter wrote:
> > On Wed, 21 Oct 2015, Michal Hocko wrote:
> > 
> > > Because all the WQ workers are stuck somewhere, maybe in the memory
> > > allocation which cannot make any progress and the vmstat update work is
> > > queued behind them.

After invoking the OOM killer, we can easily observe that vmstat_update
cannot be processed due to memory allocation by disk_events_workfn stalls.
http://lkml.kernel.org/r/201509120019.BJI48986.OOSVMJtOLFQHFF@I-love.SAKURA.ne.jp

I worried that blocking forever from workqueue is an exclusive occupation of
workqueue. In fact, changing to GFP_ATOMIC avoids this problem.
http://lkml.kernel.org/r/201503012017.EAD00571.HOOJVOStMFLFQF@I-love.SAKURA.ne.jp

Now we realized that we are hitting this problem before invoking the OOM
killer. The situation is similar to the case after the OOM killer is
invoked; there are no reclaimable pages but vmstat_update cannot be
processed. We are caught by a small difference of vmstat counter values.

> > >
> > > At least this is my current understanding.
> > 
> > Eww. Maybe need a queue that does not do such evil things as memory
> > allocation?
> 
> I am not sure how to achieve that. Requiring non-sleeping worker would
> work out but do we have enough users to add such an API?

If a queue does not need to sleep, can't that queue be processed from
timer context (e.g. mod_timer()) ?

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
@ 2015-10-21 15:39           ` Tetsuo Handa
  0 siblings, 0 replies; 122+ messages in thread
From: Tetsuo Handa @ 2015-10-21 15:39 UTC (permalink / raw)
  To: mhocko, cl
  Cc: linux-mm, linux-kernel, torvalds, rientjes, oleg, kwalker, akpm,
	hannes, vdavydov, skozina, mgorman, riel

Michal Hocko wrote:
> On Wed 21-10-15 09:49:07, Christoph Lameter wrote:
> > On Wed, 21 Oct 2015, Michal Hocko wrote:
> > 
> > > Because all the WQ workers are stuck somewhere, maybe in the memory
> > > allocation which cannot make any progress and the vmstat update work is
> > > queued behind them.

After invoking the OOM killer, we can easily observe that vmstat_update
cannot be processed due to memory allocation by disk_events_workfn stalls.
http://lkml.kernel.org/r/201509120019.BJI48986.OOSVMJtOLFQHFF@I-love.SAKURA.ne.jp

I worried that blocking forever from workqueue is an exclusive occupation of
workqueue. In fact, changing to GFP_ATOMIC avoids this problem.
http://lkml.kernel.org/r/201503012017.EAD00571.HOOJVOStMFLFQF@I-love.SAKURA.ne.jp

Now we realized that we are hitting this problem before invoking the OOM
killer. The situation is similar to the case after the OOM killer is
invoked; there are no reclaimable pages but vmstat_update cannot be
processed. We are caught by a small difference of vmstat counter values.

> > >
> > > At least this is my current understanding.
> > 
> > Eww. Maybe need a queue that does not do such evil things as memory
> > allocation?
> 
> I am not sure how to achieve that. Requiring non-sleeping worker would
> work out but do we have enough users to add such an API?

If a queue does not need to sleep, can't that queue be processed from
timer context (e.g. mod_timer()) ?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
  2015-10-21 14:55         ` Michal Hocko
@ 2015-10-21 17:16           ` Christoph Lameter
  -1 siblings, 0 replies; 122+ messages in thread
From: Christoph Lameter @ 2015-10-21 17:16 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, linux-mm, linux-kernel, torvalds, rientjes, oleg,
	kwalker, akpm, hannes, vdavydov, skozina, mgorman, riel

On Wed, 21 Oct 2015, Michal Hocko wrote:

> I am not sure how to achieve that. Requiring non-sleeping worker would
> work out but do we have enough users to add such an API?
>
> I would rather see vmstat using dedicated kernel thread(s) for this this
> purpose. We have discussed that in the past but it hasn't led anywhere.

How about this one? I really would like to have the vm statistics work as
designed and apparently they no longer work right with the existing
workqueue mechanism.


From: Christoph Lameter <cl@linux.com>
Subject: vmstat: Create our own workqueue

Seems that vmstat needs its own workqueue now since the general
workqueue mechanism has been *enhanced* which means that the
vmstat_updates cannot run reliably but are being blocked by
work requests doing memory allocation. Which causes vmstat
to be unable to keep the counters up to date.

Bad. Fix this by creating our own workqueue.

Signed-off-by: Christoph Lameter <cl@linux.com>

Index: linux/mm/vmstat.c
===================================================================
--- linux.orig/mm/vmstat.c
+++ linux/mm/vmstat.c
@@ -1357,6 +1357,8 @@ static const struct file_operations proc
 #endif /* CONFIG_PROC_FS */

 #ifdef CONFIG_SMP
+static struct workqueue_struct *vmstat_wq;
+
 static DEFINE_PER_CPU(struct delayed_work, vmstat_work);
 int sysctl_stat_interval __read_mostly = HZ;
 static cpumask_var_t cpu_stat_off;
@@ -1369,7 +1371,7 @@ static void vmstat_update(struct work_st
 		 * to occur in the future. Keep on running the
 		 * update worker thread.
 		 */
-		schedule_delayed_work_on(smp_processor_id(),
+		queue_delayed_work_on(smp_processor_id(), vmstat_wq,
 			this_cpu_ptr(&vmstat_work),
 			round_jiffies_relative(sysctl_stat_interval));
 	} else {
@@ -1438,7 +1440,7 @@ static void vmstat_shepherd(struct work_
 		if (need_update(cpu) &&
 			cpumask_test_and_clear_cpu(cpu, cpu_stat_off))

-			schedule_delayed_work_on(cpu,
+			queue_delayed_work_on(cpu, vmstat_wq,
 				&per_cpu(vmstat_work, cpu), 0);

 	put_online_cpus();
@@ -1534,6 +1536,7 @@ static int __init setup_vmstat(void)
 	proc_create("vmstat", S_IRUGO, NULL, &proc_vmstat_file_operations);
 	proc_create("zoneinfo", S_IRUGO, NULL, &proc_zoneinfo_file_operations);
 #endif
+	vmstat_wq = alloc_workqueue("vmstat", WQ_FREEZABLE|WQ_MEM_RECLAIM, 0);
 	return 0;
 }
 module_init(setup_vmstat)

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
@ 2015-10-21 17:16           ` Christoph Lameter
  0 siblings, 0 replies; 122+ messages in thread
From: Christoph Lameter @ 2015-10-21 17:16 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, linux-mm, linux-kernel, torvalds, rientjes, oleg,
	kwalker, akpm, hannes, vdavydov, skozina, mgorman, riel

On Wed, 21 Oct 2015, Michal Hocko wrote:

> I am not sure how to achieve that. Requiring non-sleeping worker would
> work out but do we have enough users to add such an API?
>
> I would rather see vmstat using dedicated kernel thread(s) for this this
> purpose. We have discussed that in the past but it hasn't led anywhere.

How about this one? I really would like to have the vm statistics work as
designed and apparently they no longer work right with the existing
workqueue mechanism.


From: Christoph Lameter <cl@linux.com>
Subject: vmstat: Create our own workqueue

Seems that vmstat needs its own workqueue now since the general
workqueue mechanism has been *enhanced* which means that the
vmstat_updates cannot run reliably but are being blocked by
work requests doing memory allocation. Which causes vmstat
to be unable to keep the counters up to date.

Bad. Fix this by creating our own workqueue.

Signed-off-by: Christoph Lameter <cl@linux.com>

Index: linux/mm/vmstat.c
===================================================================
--- linux.orig/mm/vmstat.c
+++ linux/mm/vmstat.c
@@ -1357,6 +1357,8 @@ static const struct file_operations proc
 #endif /* CONFIG_PROC_FS */

 #ifdef CONFIG_SMP
+static struct workqueue_struct *vmstat_wq;
+
 static DEFINE_PER_CPU(struct delayed_work, vmstat_work);
 int sysctl_stat_interval __read_mostly = HZ;
 static cpumask_var_t cpu_stat_off;
@@ -1369,7 +1371,7 @@ static void vmstat_update(struct work_st
 		 * to occur in the future. Keep on running the
 		 * update worker thread.
 		 */
-		schedule_delayed_work_on(smp_processor_id(),
+		queue_delayed_work_on(smp_processor_id(), vmstat_wq,
 			this_cpu_ptr(&vmstat_work),
 			round_jiffies_relative(sysctl_stat_interval));
 	} else {
@@ -1438,7 +1440,7 @@ static void vmstat_shepherd(struct work_
 		if (need_update(cpu) &&
 			cpumask_test_and_clear_cpu(cpu, cpu_stat_off))

-			schedule_delayed_work_on(cpu,
+			queue_delayed_work_on(cpu, vmstat_wq,
 				&per_cpu(vmstat_work, cpu), 0);

 	put_online_cpus();
@@ -1534,6 +1536,7 @@ static int __init setup_vmstat(void)
 	proc_create("vmstat", S_IRUGO, NULL, &proc_vmstat_file_operations);
 	proc_create("zoneinfo", S_IRUGO, NULL, &proc_zoneinfo_file_operations);
 #endif
+	vmstat_wq = alloc_workqueue("vmstat", WQ_FREEZABLE|WQ_MEM_RECLAIM, 0);
 	return 0;
 }
 module_init(setup_vmstat)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
  2015-10-21 17:16           ` Christoph Lameter
@ 2015-10-22 11:37             ` Tetsuo Handa
  -1 siblings, 0 replies; 122+ messages in thread
From: Tetsuo Handa @ 2015-10-22 11:37 UTC (permalink / raw)
  To: cl, mhocko
  Cc: linux-mm, linux-kernel, torvalds, rientjes, oleg, kwalker, akpm,
	hannes, vdavydov, skozina, mgorman, riel

Christoph Lameter wrote:
> On Wed, 21 Oct 2015, Michal Hocko wrote:
> 
> > I am not sure how to achieve that. Requiring non-sleeping worker would
> > work out but do we have enough users to add such an API?
> >
> > I would rather see vmstat using dedicated kernel thread(s) for this this
> > purpose. We have discussed that in the past but it hasn't led anywhere.
> 
> How about this one? I really would like to have the vm statistics work as
> designed and apparently they no longer work right with the existing
> workqueue mechanism.

No, it won't help. Adding a dedicated workqueue for vmstat_update job
merely moves that job from "events" to "vmstat" workqueue. The "vmstat"
workqueue after all appears in the list of busy workqueues.

The problem is that all workqueues are assigned the same CPU (cpus=2 in
below example tested on a 4 CPUs VM) and therefore only one job is
in-flight state. All other jobs are waiting for the in-flight job to
complete in the pending list while the in-flight job is blocked at memory
allocation.

The problem would be that the "struct task_struct" to execute vmstat_update
job does not exist, and will not be able to create one on demand because we
are stuck at __GFP_WAIT allocation. Therefore adding a dedicated kernel
thread for vmstat_update job would work. But ...

------------------------------------------------------------
[  133.132322] Showing busy workqueues and worker pools:
[  133.133878] workqueue events: flags=0x0
[  133.135215]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=2/256
[  133.137076]     pending: vmpressure_work_fn, vmw_fb_dirty_flush [vmwgfx]
[  133.139075] workqueue events_freezable_power_: flags=0x84
[  133.140745]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=2/256
[  133.142638]     in-flight: 20:disk_events_workfn
[  133.144199]     pending: disk_events_workfn
[  133.145699] workqueue vmstat: flags=0xc
[  133.147055]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[  133.148910]     pending: vmstat_update
[  133.150354] pool 4: cpus=2 node=0 flags=0x0 nice=0 workers=4 idle: 43 189 183
[  133.174523] DMA32 zone_reclaimable: reclaim:2(30186,30162,2) free:11163(25154,-20) min:11163 pages_scanned:0(30158,0) prio:12
[  133.177264] DMA32 zone_reclaimable: reclaim:2(30189,30165,2) free:11163(25157,-20) min:11163 pages_scanned:0(30161,0) prio:11
[  133.180139] DMA32 zone_reclaimable: reclaim:2(30191,30167,2) free:11163(25159,-20) min:11163 pages_scanned:0(30163,0) prio:10
[  133.182847] DMA32 zone_reclaimable: reclaim:2(30194,30170,2) free:11163(25162,-20) min:11163 pages_scanned:0(30166,0) prio:9
[  133.207048] DMA32 zone_reclaimable: reclaim:2(30219,30195,2) free:11163(25187,-20) min:11163 pages_scanned:0(30191,0) prio:8
[  133.209770] DMA32 zone_reclaimable: reclaim:2(30221,30197,2) free:11163(25189,-20) min:11163 pages_scanned:0(30193,0) prio:7
[  133.212470] DMA32 zone_reclaimable: reclaim:2(30224,30200,2) free:11163(25192,-20) min:11163 pages_scanned:0(30196,0) prio:6
[  133.215149] DMA32 zone_reclaimable: reclaim:2(30227,30203,2) free:11163(25195,-20) min:11163 pages_scanned:0(30199,0) prio:5
[  133.239013] DMA32 zone_reclaimable: reclaim:2(30251,30227,2) free:11163(25219,-20) min:11163 pages_scanned:0(30223,0) prio:4
[  133.241688] DMA32 zone_reclaimable: reclaim:2(30253,30229,2) free:11163(25221,-20) min:11163 pages_scanned:0(30225,0) prio:3
[  133.244332] DMA32 zone_reclaimable: reclaim:2(30256,30232,2) free:11163(25224,-20) min:11163 pages_scanned:0(30228,0) prio:2
[  133.246919] DMA32 zone_reclaimable: reclaim:2(30258,30234,2) free:11163(25226,-20) min:11163 pages_scanned:0(30230,0) prio:1
[  133.270967] DMA32 zone_reclaimable: reclaim:2(30283,30259,2) free:11163(25251,-20) min:11163 pages_scanned:0(30255,0) prio:0
[  133.273587] DMA32 zone_reclaimable: reclaim:2(30285,30261,2) free:11163(25253,-20) min:11163 pages_scanned:0(30257,0) prio:12
[  133.276224] DMA32 zone_reclaimable: reclaim:2(30287,30263,2) free:11163(25255,-20) min:11163 pages_scanned:0(30259,0) prio:11
[  133.278852] DMA32 zone_reclaimable: reclaim:2(30290,30266,2) free:11163(25258,-20) min:11163 pages_scanned:0(30262,0) prio:10
[  133.302964] DMA32 zone_reclaimable: reclaim:2(30315,30291,2) free:11163(25283,-20) min:11163 pages_scanned:0(30287,0) prio:9
[  133.305518] DMA32 zone_reclaimable: reclaim:2(30317,30293,2) free:11163(25285,-20) min:11163 pages_scanned:0(30289,0) prio:8
[  133.308095] DMA32 zone_reclaimable: reclaim:2(30319,30295,2) free:11163(25287,-20) min:11163 pages_scanned:0(30291,0) prio:7
[  133.310683] DMA32 zone_reclaimable: reclaim:2(30322,30298,2) free:11163(25290,-20) min:11163 pages_scanned:0(30294,0) prio:6
[  133.334904] DMA32 zone_reclaimable: reclaim:2(30347,30323,2) free:11163(25315,-20) min:11163 pages_scanned:0(30319,0) prio:5
[  133.337590] DMA32 zone_reclaimable: reclaim:2(30349,30325,2) free:11163(25317,-20) min:11163 pages_scanned:0(30321,0) prio:4
[  133.340147] DMA32 zone_reclaimable: reclaim:2(30351,30327,2) free:11163(25319,-20) min:11163 pages_scanned:0(30323,0) prio:3
[  133.343436] DMA32 zone_reclaimable: reclaim:2(30355,30331,2) free:11163(25323,-20) min:11163 pages_scanned:0(30327,0) prio:2
[  133.367531] DMA32 zone_reclaimable: reclaim:2(30379,30355,2) free:11163(25347,-20) min:11163 pages_scanned:0(30351,0) prio:1
[  133.370261] DMA32 zone_reclaimable: reclaim:2(30382,30358,2) free:11163(25350,-20) min:11163 pages_scanned:0(30354,0) prio:0
[  133.372786] did_some_progress=1 at line 3380
[  143.153205] MemAlloc-Info: 10 stalling task, 0 dying task, 0 victim task.
[  143.154981] MemAlloc: a.out(11052) gfp=0x24280ca order=0 delay=40104
[  143.156698] MemAlloc: abrt-watch-log(1708) gfp=0x242014a order=0 delay=39655
[  143.158527] MemAlloc: kworker/2:0(20) gfp=0x2400000 order=0 delay=39627
[  143.160247] MemAlloc: tuned(2076) gfp=0x242014a order=0 delay=39625
[  143.161920] MemAlloc: rngd(1703) gfp=0x242014a order=0 delay=38870
[  143.163618] MemAlloc: systemd-journal(471) gfp=0x242014a order=0 delay=38135
[  143.165435] MemAlloc: crond(1720) gfp=0x242014a order=0 delay=36330
[  143.167095] MemAlloc: vmtoolsd(1900) gfp=0x242014a order=0 delay=36136
[  143.168936] MemAlloc: irqbalance(1702) gfp=0x242014a order=0 delay=31584
[  143.170656] MemAlloc: nmbd(4791) gfp=0x242014a order=0 delay=30483
[  143.213896] Showing busy workqueues and worker pools:
[  143.215429] workqueue events: flags=0x0
[  143.216763]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=3/256
[  143.218542]     pending: vmpressure_work_fn, vmw_fb_dirty_flush [vmwgfx], console_callback
[  143.220785] workqueue events_freezable_power_: flags=0x84
[  143.222438]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=2/256
[  143.224239]     in-flight: 20:disk_events_workfn
[  143.225644]     pending: disk_events_workfn
[  143.227032] workqueue vmstat: flags=0xc
[  143.228303]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[  143.229996]     pending: vmstat_update
[  143.231396] pool 4: cpus=2 node=0 flags=0x0 nice=0 workers=4 idle: 43 189 183
[  144.023799] sysrq: SysRq : Kill All Tasks
------------------------------------------------------------

do we need to use a dedicated kernel thread for vmstat_update job?
It seems to me that refresh_cpu_vm_stats() will not sleep if we remove
cond_resched(). If vmstat_update job does not need to sleep, why can't
we do that job from timer interrupts? We have add_timer_on() which the
workqueue is also using.

Moreover, do we need to use atomic_long_t counters from the beginning?
Can't we do something like

 (1) Each CPU updates its own per CPU counter ("struct per_cpu_pageset"
     ->vm_stat_diff[] ?).
 (2) Only one thread periodically reads snapshot of all per CPU counters
     and adds diff values between the latest snapshot and previous snapshot
     to global counters ("struct zone"->vm_stat[] ?), and save the latest
     snapshot as previous snapshot.
 (3) Anyone can read global counters at any time.

which would not use atomic operations at all because only one task updates
global counters while each CPU continues using per CPU counters.



Linus Torvalds wrote (off-list due to mobile post):
> Side note: it would probably be interesting to see exactly *what*
> allocation ends up screwing up using just a regular workqueue. I bet
> there are lots of other workqueue users where timelineness can be a
> big deal - they continue to work, but perhaps they cause bad
> performance if there are allocators in other workqueues that end up
> delaying them.

We might want to favor kernel threads and dying threads over normal threads.
It helps reducing TIF_MEMDIE stalls if the dependency is limited to tasks
sharing the same memory.
http://lkml.kernel.org/r/201509102318.GHG18789.OHMSLFJOQFOtFV@I-love.SAKURA.ne.jp

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
@ 2015-10-22 11:37             ` Tetsuo Handa
  0 siblings, 0 replies; 122+ messages in thread
From: Tetsuo Handa @ 2015-10-22 11:37 UTC (permalink / raw)
  To: cl, mhocko
  Cc: linux-mm, linux-kernel, torvalds, rientjes, oleg, kwalker, akpm,
	hannes, vdavydov, skozina, mgorman, riel

Christoph Lameter wrote:
> On Wed, 21 Oct 2015, Michal Hocko wrote:
> 
> > I am not sure how to achieve that. Requiring non-sleeping worker would
> > work out but do we have enough users to add such an API?
> >
> > I would rather see vmstat using dedicated kernel thread(s) for this this
> > purpose. We have discussed that in the past but it hasn't led anywhere.
> 
> How about this one? I really would like to have the vm statistics work as
> designed and apparently they no longer work right with the existing
> workqueue mechanism.

No, it won't help. Adding a dedicated workqueue for vmstat_update job
merely moves that job from "events" to "vmstat" workqueue. The "vmstat"
workqueue after all appears in the list of busy workqueues.

The problem is that all workqueues are assigned the same CPU (cpus=2 in
below example tested on a 4 CPUs VM) and therefore only one job is
in-flight state. All other jobs are waiting for the in-flight job to
complete in the pending list while the in-flight job is blocked at memory
allocation.

The problem would be that the "struct task_struct" to execute vmstat_update
job does not exist, and will not be able to create one on demand because we
are stuck at __GFP_WAIT allocation. Therefore adding a dedicated kernel
thread for vmstat_update job would work. But ...

------------------------------------------------------------
[  133.132322] Showing busy workqueues and worker pools:
[  133.133878] workqueue events: flags=0x0
[  133.135215]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=2/256
[  133.137076]     pending: vmpressure_work_fn, vmw_fb_dirty_flush [vmwgfx]
[  133.139075] workqueue events_freezable_power_: flags=0x84
[  133.140745]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=2/256
[  133.142638]     in-flight: 20:disk_events_workfn
[  133.144199]     pending: disk_events_workfn
[  133.145699] workqueue vmstat: flags=0xc
[  133.147055]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[  133.148910]     pending: vmstat_update
[  133.150354] pool 4: cpus=2 node=0 flags=0x0 nice=0 workers=4 idle: 43 189 183
[  133.174523] DMA32 zone_reclaimable: reclaim:2(30186,30162,2) free:11163(25154,-20) min:11163 pages_scanned:0(30158,0) prio:12
[  133.177264] DMA32 zone_reclaimable: reclaim:2(30189,30165,2) free:11163(25157,-20) min:11163 pages_scanned:0(30161,0) prio:11
[  133.180139] DMA32 zone_reclaimable: reclaim:2(30191,30167,2) free:11163(25159,-20) min:11163 pages_scanned:0(30163,0) prio:10
[  133.182847] DMA32 zone_reclaimable: reclaim:2(30194,30170,2) free:11163(25162,-20) min:11163 pages_scanned:0(30166,0) prio:9
[  133.207048] DMA32 zone_reclaimable: reclaim:2(30219,30195,2) free:11163(25187,-20) min:11163 pages_scanned:0(30191,0) prio:8
[  133.209770] DMA32 zone_reclaimable: reclaim:2(30221,30197,2) free:11163(25189,-20) min:11163 pages_scanned:0(30193,0) prio:7
[  133.212470] DMA32 zone_reclaimable: reclaim:2(30224,30200,2) free:11163(25192,-20) min:11163 pages_scanned:0(30196,0) prio:6
[  133.215149] DMA32 zone_reclaimable: reclaim:2(30227,30203,2) free:11163(25195,-20) min:11163 pages_scanned:0(30199,0) prio:5
[  133.239013] DMA32 zone_reclaimable: reclaim:2(30251,30227,2) free:11163(25219,-20) min:11163 pages_scanned:0(30223,0) prio:4
[  133.241688] DMA32 zone_reclaimable: reclaim:2(30253,30229,2) free:11163(25221,-20) min:11163 pages_scanned:0(30225,0) prio:3
[  133.244332] DMA32 zone_reclaimable: reclaim:2(30256,30232,2) free:11163(25224,-20) min:11163 pages_scanned:0(30228,0) prio:2
[  133.246919] DMA32 zone_reclaimable: reclaim:2(30258,30234,2) free:11163(25226,-20) min:11163 pages_scanned:0(30230,0) prio:1
[  133.270967] DMA32 zone_reclaimable: reclaim:2(30283,30259,2) free:11163(25251,-20) min:11163 pages_scanned:0(30255,0) prio:0
[  133.273587] DMA32 zone_reclaimable: reclaim:2(30285,30261,2) free:11163(25253,-20) min:11163 pages_scanned:0(30257,0) prio:12
[  133.276224] DMA32 zone_reclaimable: reclaim:2(30287,30263,2) free:11163(25255,-20) min:11163 pages_scanned:0(30259,0) prio:11
[  133.278852] DMA32 zone_reclaimable: reclaim:2(30290,30266,2) free:11163(25258,-20) min:11163 pages_scanned:0(30262,0) prio:10
[  133.302964] DMA32 zone_reclaimable: reclaim:2(30315,30291,2) free:11163(25283,-20) min:11163 pages_scanned:0(30287,0) prio:9
[  133.305518] DMA32 zone_reclaimable: reclaim:2(30317,30293,2) free:11163(25285,-20) min:11163 pages_scanned:0(30289,0) prio:8
[  133.308095] DMA32 zone_reclaimable: reclaim:2(30319,30295,2) free:11163(25287,-20) min:11163 pages_scanned:0(30291,0) prio:7
[  133.310683] DMA32 zone_reclaimable: reclaim:2(30322,30298,2) free:11163(25290,-20) min:11163 pages_scanned:0(30294,0) prio:6
[  133.334904] DMA32 zone_reclaimable: reclaim:2(30347,30323,2) free:11163(25315,-20) min:11163 pages_scanned:0(30319,0) prio:5
[  133.337590] DMA32 zone_reclaimable: reclaim:2(30349,30325,2) free:11163(25317,-20) min:11163 pages_scanned:0(30321,0) prio:4
[  133.340147] DMA32 zone_reclaimable: reclaim:2(30351,30327,2) free:11163(25319,-20) min:11163 pages_scanned:0(30323,0) prio:3
[  133.343436] DMA32 zone_reclaimable: reclaim:2(30355,30331,2) free:11163(25323,-20) min:11163 pages_scanned:0(30327,0) prio:2
[  133.367531] DMA32 zone_reclaimable: reclaim:2(30379,30355,2) free:11163(25347,-20) min:11163 pages_scanned:0(30351,0) prio:1
[  133.370261] DMA32 zone_reclaimable: reclaim:2(30382,30358,2) free:11163(25350,-20) min:11163 pages_scanned:0(30354,0) prio:0
[  133.372786] did_some_progress=1 at line 3380
[  143.153205] MemAlloc-Info: 10 stalling task, 0 dying task, 0 victim task.
[  143.154981] MemAlloc: a.out(11052) gfp=0x24280ca order=0 delay=40104
[  143.156698] MemAlloc: abrt-watch-log(1708) gfp=0x242014a order=0 delay=39655
[  143.158527] MemAlloc: kworker/2:0(20) gfp=0x2400000 order=0 delay=39627
[  143.160247] MemAlloc: tuned(2076) gfp=0x242014a order=0 delay=39625
[  143.161920] MemAlloc: rngd(1703) gfp=0x242014a order=0 delay=38870
[  143.163618] MemAlloc: systemd-journal(471) gfp=0x242014a order=0 delay=38135
[  143.165435] MemAlloc: crond(1720) gfp=0x242014a order=0 delay=36330
[  143.167095] MemAlloc: vmtoolsd(1900) gfp=0x242014a order=0 delay=36136
[  143.168936] MemAlloc: irqbalance(1702) gfp=0x242014a order=0 delay=31584
[  143.170656] MemAlloc: nmbd(4791) gfp=0x242014a order=0 delay=30483
[  143.213896] Showing busy workqueues and worker pools:
[  143.215429] workqueue events: flags=0x0
[  143.216763]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=3/256
[  143.218542]     pending: vmpressure_work_fn, vmw_fb_dirty_flush [vmwgfx], console_callback
[  143.220785] workqueue events_freezable_power_: flags=0x84
[  143.222438]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=2/256
[  143.224239]     in-flight: 20:disk_events_workfn
[  143.225644]     pending: disk_events_workfn
[  143.227032] workqueue vmstat: flags=0xc
[  143.228303]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[  143.229996]     pending: vmstat_update
[  143.231396] pool 4: cpus=2 node=0 flags=0x0 nice=0 workers=4 idle: 43 189 183
[  144.023799] sysrq: SysRq : Kill All Tasks
------------------------------------------------------------

do we need to use a dedicated kernel thread for vmstat_update job?
It seems to me that refresh_cpu_vm_stats() will not sleep if we remove
cond_resched(). If vmstat_update job does not need to sleep, why can't
we do that job from timer interrupts? We have add_timer_on() which the
workqueue is also using.

Moreover, do we need to use atomic_long_t counters from the beginning?
Can't we do something like

 (1) Each CPU updates its own per CPU counter ("struct per_cpu_pageset"
     ->vm_stat_diff[] ?).
 (2) Only one thread periodically reads snapshot of all per CPU counters
     and adds diff values between the latest snapshot and previous snapshot
     to global counters ("struct zone"->vm_stat[] ?), and save the latest
     snapshot as previous snapshot.
 (3) Anyone can read global counters at any time.

which would not use atomic operations at all because only one task updates
global counters while each CPU continues using per CPU counters.



Linus Torvalds wrote (off-list due to mobile post):
> Side note: it would probably be interesting to see exactly *what*
> allocation ends up screwing up using just a regular workqueue. I bet
> there are lots of other workqueue users where timelineness can be a
> big deal - they continue to work, but perhaps they cause bad
> performance if there are allocators in other workqueues that end up
> delaying them.

We might want to favor kernel threads and dying threads over normal threads.
It helps reducing TIF_MEMDIE stalls if the dependency is limited to tasks
sharing the same memory.
http://lkml.kernel.org/r/201509102318.GHG18789.OHMSLFJOQFOtFV@I-love.SAKURA.ne.jp

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
  2015-10-22 11:37             ` Tetsuo Handa
@ 2015-10-22 13:39               ` Christoph Lameter
  -1 siblings, 0 replies; 122+ messages in thread
From: Christoph Lameter @ 2015-10-22 13:39 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tetsuo Handa, mhocko, linux-mm, linux-kernel, torvalds,
	David Rientjes, oleg, kwalker, akpm, hannes, vdavydov, skozina,
	mgorman, riel

On Thu, 22 Oct 2015, Tetsuo Handa wrote:

> The problem would be that the "struct task_struct" to execute vmstat_update
> job does not exist, and will not be able to create one on demand because we
> are stuck at __GFP_WAIT allocation. Therefore adding a dedicated kernel
> thread for vmstat_update job would work. But ...

Yuck. Can someone please get this major screwup out of the work queue
subsystem? Tejun?


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
@ 2015-10-22 13:39               ` Christoph Lameter
  0 siblings, 0 replies; 122+ messages in thread
From: Christoph Lameter @ 2015-10-22 13:39 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tetsuo Handa, mhocko, linux-mm, linux-kernel, torvalds,
	David Rientjes, oleg, kwalker, akpm, hannes, vdavydov, skozina,
	mgorman, riel

On Thu, 22 Oct 2015, Tetsuo Handa wrote:

> The problem would be that the "struct task_struct" to execute vmstat_update
> job does not exist, and will not be able to create one on demand because we
> are stuck at __GFP_WAIT allocation. Therefore adding a dedicated kernel
> thread for vmstat_update job would work. But ...

Yuck. Can someone please get this major screwup out of the work queue
subsystem? Tejun?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
  2015-10-22 13:39               ` Christoph Lameter
@ 2015-10-22 14:09                 ` Tejun Heo
  -1 siblings, 0 replies; 122+ messages in thread
From: Tejun Heo @ 2015-10-22 14:09 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tetsuo Handa, mhocko, linux-mm, linux-kernel, torvalds,
	David Rientjes, oleg, kwalker, akpm, hannes, vdavydov, skozina,
	mgorman, riel

On Thu, Oct 22, 2015 at 08:39:11AM -0500, Christoph Lameter wrote:
> On Thu, 22 Oct 2015, Tetsuo Handa wrote:
> 
> > The problem would be that the "struct task_struct" to execute vmstat_update
> > job does not exist, and will not be able to create one on demand because we
> > are stuck at __GFP_WAIT allocation. Therefore adding a dedicated kernel
> > thread for vmstat_update job would work. But ...
> 
> Yuck. Can someone please get this major screwup out of the work queue
> subsystem? Tejun?

Hmmm?  Just use a dedicated workqueue with WQ_MEM_RECLAIM.  If
concurrency management is a problem and there's something live-locking
for that work item (really?), WQ_CPU_INTENSIVE escapes it.  If this is
a common occurrence that it makes sense to give vmstat higher
priority, set WQ_HIGHPRI.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
@ 2015-10-22 14:09                 ` Tejun Heo
  0 siblings, 0 replies; 122+ messages in thread
From: Tejun Heo @ 2015-10-22 14:09 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tetsuo Handa, mhocko, linux-mm, linux-kernel, torvalds,
	David Rientjes, oleg, kwalker, akpm, hannes, vdavydov, skozina,
	mgorman, riel

On Thu, Oct 22, 2015 at 08:39:11AM -0500, Christoph Lameter wrote:
> On Thu, 22 Oct 2015, Tetsuo Handa wrote:
> 
> > The problem would be that the "struct task_struct" to execute vmstat_update
> > job does not exist, and will not be able to create one on demand because we
> > are stuck at __GFP_WAIT allocation. Therefore adding a dedicated kernel
> > thread for vmstat_update job would work. But ...
> 
> Yuck. Can someone please get this major screwup out of the work queue
> subsystem? Tejun?

Hmmm?  Just use a dedicated workqueue with WQ_MEM_RECLAIM.  If
concurrency management is a problem and there's something live-locking
for that work item (really?), WQ_CPU_INTENSIVE escapes it.  If this is
a common occurrence that it makes sense to give vmstat higher
priority, set WQ_HIGHPRI.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
  2015-10-22 14:09                 ` Tejun Heo
@ 2015-10-22 14:21                   ` Tejun Heo
  -1 siblings, 0 replies; 122+ messages in thread
From: Tejun Heo @ 2015-10-22 14:21 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tetsuo Handa, mhocko, linux-mm, linux-kernel, torvalds,
	David Rientjes, oleg, kwalker, akpm, hannes, vdavydov, skozina,
	mgorman, riel

On Thu, Oct 22, 2015 at 11:09:44PM +0900, Tejun Heo wrote:
> On Thu, Oct 22, 2015 at 08:39:11AM -0500, Christoph Lameter wrote:
> > On Thu, 22 Oct 2015, Tetsuo Handa wrote:
> > 
> > > The problem would be that the "struct task_struct" to execute vmstat_update
> > > job does not exist, and will not be able to create one on demand because we
> > > are stuck at __GFP_WAIT allocation. Therefore adding a dedicated kernel
> > > thread for vmstat_update job would work. But ...
> > 
> > Yuck. Can someone please get this major screwup out of the work queue
> > subsystem? Tejun?
> 
> Hmmm?  Just use a dedicated workqueue with WQ_MEM_RECLAIM.  If
> concurrency management is a problem and there's something live-locking
> for that work item (really?), WQ_CPU_INTENSIVE escapes it.  If this is
> a common occurrence that it makes sense to give vmstat higher
> priority, set WQ_HIGHPRI.

Oooh, HIGHPRI + CPU_INTENSIVE immediate scheduling guarantee got lost
while converting HIGHPRI to a separate pool but guaranteeing immediate
scheduling for CPU_INTENSIVE is trivial.  If vmstat requires that,
please let me know.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
@ 2015-10-22 14:21                   ` Tejun Heo
  0 siblings, 0 replies; 122+ messages in thread
From: Tejun Heo @ 2015-10-22 14:21 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tetsuo Handa, mhocko, linux-mm, linux-kernel, torvalds,
	David Rientjes, oleg, kwalker, akpm, hannes, vdavydov, skozina,
	mgorman, riel

On Thu, Oct 22, 2015 at 11:09:44PM +0900, Tejun Heo wrote:
> On Thu, Oct 22, 2015 at 08:39:11AM -0500, Christoph Lameter wrote:
> > On Thu, 22 Oct 2015, Tetsuo Handa wrote:
> > 
> > > The problem would be that the "struct task_struct" to execute vmstat_update
> > > job does not exist, and will not be able to create one on demand because we
> > > are stuck at __GFP_WAIT allocation. Therefore adding a dedicated kernel
> > > thread for vmstat_update job would work. But ...
> > 
> > Yuck. Can someone please get this major screwup out of the work queue
> > subsystem? Tejun?
> 
> Hmmm?  Just use a dedicated workqueue with WQ_MEM_RECLAIM.  If
> concurrency management is a problem and there's something live-locking
> for that work item (really?), WQ_CPU_INTENSIVE escapes it.  If this is
> a common occurrence that it makes sense to give vmstat higher
> priority, set WQ_HIGHPRI.

Oooh, HIGHPRI + CPU_INTENSIVE immediate scheduling guarantee got lost
while converting HIGHPRI to a separate pool but guaranteeing immediate
scheduling for CPU_INTENSIVE is trivial.  If vmstat requires that,
please let me know.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
  2015-10-22 14:09                 ` Tejun Heo
@ 2015-10-22 14:22                   ` Christoph Lameter
  -1 siblings, 0 replies; 122+ messages in thread
From: Christoph Lameter @ 2015-10-22 14:22 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tetsuo Handa, mhocko, linux-mm, linux-kernel, torvalds,
	David Rientjes, oleg, kwalker, akpm, hannes, vdavydov, skozina,
	mgorman, riel

On Thu, 22 Oct 2015, Tejun Heo wrote:

> > Yuck. Can someone please get this major screwup out of the work queue
> > subsystem? Tejun?
>
> Hmmm?  Just use a dedicated workqueue with WQ_MEM_RECLAIM.  If
> concurrency management is a problem and there's something live-locking
> for that work item (really?), WQ_CPU_INTENSIVE escapes it.  If this is
> a common occurrence that it makes sense to give vmstat higher
> priority, set WQ_HIGHPRI.

I did. Check the thread. The result was that other tasks were still
blocking the thread. Ok I did not use HIGHPRI here is a newer version:


From: Christoph Lameter <cl@linux.com>
Subject: vmstat: Create our own workqueue V2

V1->V2:
   - Add a couple of workqueue flags that may fix things.

Seems that vmstat needs its own workqueue now since the general
workqueue mechanism has been *enhanced* which means that the
vmstat_updates cannot run reliably but are being blocked by
work requests doing memory allocation. Which causes vmstat
to be unable to keep the counters up to date.

Bad. Fix this by creating our own workqueue.

Signed-off-by: Christoph Lameter <cl@linux.com>

Index: linux/mm/vmstat.c
===================================================================
--- linux.orig/mm/vmstat.c
+++ linux/mm/vmstat.c
@@ -1382,6 +1382,8 @@ static const struct file_operations proc
 #endif /* CONFIG_PROC_FS */

 #ifdef CONFIG_SMP
+static struct workqueue_struct *vmstat_wq;
+
 static DEFINE_PER_CPU(struct delayed_work, vmstat_work);
 int sysctl_stat_interval __read_mostly = HZ;
 static cpumask_var_t cpu_stat_off;
@@ -1394,7 +1396,7 @@ static void vmstat_update(struct work_st
 		 * to occur in the future. Keep on running the
 		 * update worker thread.
 		 */
-		schedule_delayed_work_on(smp_processor_id(),
+		queue_delayed_work_on(smp_processor_id(), vmstat_wq,
 			this_cpu_ptr(&vmstat_work),
 			round_jiffies_relative(sysctl_stat_interval));
 	} else {
@@ -1463,7 +1465,7 @@ static void vmstat_shepherd(struct work_
 		if (need_update(cpu) &&
 			cpumask_test_and_clear_cpu(cpu, cpu_stat_off))

-			schedule_delayed_work_on(cpu,
+			queue_delayed_work_on(cpu, vmstat_wq,
 				&per_cpu(vmstat_work, cpu), 0);

 	put_online_cpus();
@@ -1552,6 +1554,12 @@ static int __init setup_vmstat(void)

 	start_shepherd_timer();
 	cpu_notifier_register_done();
+	vmstat_wq = alloc_workqueue("vmstat",
+		WQ_FREEZABLE|
+		WQ_SYSFS|
+		WQ_MEM_RECLAIM|
+		WQ_HIGHPRI|
+		WQ_CPU_INTENSIVE, 0);
 #endif
 #ifdef CONFIG_PROC_FS
 	proc_create("buddyinfo", S_IRUGO, NULL, &fragmentation_file_operations);


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
@ 2015-10-22 14:22                   ` Christoph Lameter
  0 siblings, 0 replies; 122+ messages in thread
From: Christoph Lameter @ 2015-10-22 14:22 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tetsuo Handa, mhocko, linux-mm, linux-kernel, torvalds,
	David Rientjes, oleg, kwalker, akpm, hannes, vdavydov, skozina,
	mgorman, riel

On Thu, 22 Oct 2015, Tejun Heo wrote:

> > Yuck. Can someone please get this major screwup out of the work queue
> > subsystem? Tejun?
>
> Hmmm?  Just use a dedicated workqueue with WQ_MEM_RECLAIM.  If
> concurrency management is a problem and there's something live-locking
> for that work item (really?), WQ_CPU_INTENSIVE escapes it.  If this is
> a common occurrence that it makes sense to give vmstat higher
> priority, set WQ_HIGHPRI.

I did. Check the thread. The result was that other tasks were still
blocking the thread. Ok I did not use HIGHPRI here is a newer version:


From: Christoph Lameter <cl@linux.com>
Subject: vmstat: Create our own workqueue V2

V1->V2:
   - Add a couple of workqueue flags that may fix things.

Seems that vmstat needs its own workqueue now since the general
workqueue mechanism has been *enhanced* which means that the
vmstat_updates cannot run reliably but are being blocked by
work requests doing memory allocation. Which causes vmstat
to be unable to keep the counters up to date.

Bad. Fix this by creating our own workqueue.

Signed-off-by: Christoph Lameter <cl@linux.com>

Index: linux/mm/vmstat.c
===================================================================
--- linux.orig/mm/vmstat.c
+++ linux/mm/vmstat.c
@@ -1382,6 +1382,8 @@ static const struct file_operations proc
 #endif /* CONFIG_PROC_FS */

 #ifdef CONFIG_SMP
+static struct workqueue_struct *vmstat_wq;
+
 static DEFINE_PER_CPU(struct delayed_work, vmstat_work);
 int sysctl_stat_interval __read_mostly = HZ;
 static cpumask_var_t cpu_stat_off;
@@ -1394,7 +1396,7 @@ static void vmstat_update(struct work_st
 		 * to occur in the future. Keep on running the
 		 * update worker thread.
 		 */
-		schedule_delayed_work_on(smp_processor_id(),
+		queue_delayed_work_on(smp_processor_id(), vmstat_wq,
 			this_cpu_ptr(&vmstat_work),
 			round_jiffies_relative(sysctl_stat_interval));
 	} else {
@@ -1463,7 +1465,7 @@ static void vmstat_shepherd(struct work_
 		if (need_update(cpu) &&
 			cpumask_test_and_clear_cpu(cpu, cpu_stat_off))

-			schedule_delayed_work_on(cpu,
+			queue_delayed_work_on(cpu, vmstat_wq,
 				&per_cpu(vmstat_work, cpu), 0);

 	put_online_cpus();
@@ -1552,6 +1554,12 @@ static int __init setup_vmstat(void)

 	start_shepherd_timer();
 	cpu_notifier_register_done();
+	vmstat_wq = alloc_workqueue("vmstat",
+		WQ_FREEZABLE|
+		WQ_SYSFS|
+		WQ_MEM_RECLAIM|
+		WQ_HIGHPRI|
+		WQ_CPU_INTENSIVE, 0);
 #endif
 #ifdef CONFIG_PROC_FS
 	proc_create("buddyinfo", S_IRUGO, NULL, &fragmentation_file_operations);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
  2015-10-22 14:21                   ` Tejun Heo
@ 2015-10-22 14:23                     ` Christoph Lameter
  -1 siblings, 0 replies; 122+ messages in thread
From: Christoph Lameter @ 2015-10-22 14:23 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tetsuo Handa, mhocko, linux-mm, linux-kernel, torvalds,
	David Rientjes, oleg, kwalker, akpm, hannes, vdavydov, skozina,
	mgorman, riel

On Thu, 22 Oct 2015, Tejun Heo wrote:

> > Hmmm?  Just use a dedicated workqueue with WQ_MEM_RECLAIM.  If
> > concurrency management is a problem and there's something live-locking
> > for that work item (really?), WQ_CPU_INTENSIVE escapes it.  If this is
> > a common occurrence that it makes sense to give vmstat higher
> > priority, set WQ_HIGHPRI.
>
> Oooh, HIGHPRI + CPU_INTENSIVE immediate scheduling guarantee got lost
> while converting HIGHPRI to a separate pool but guaranteeing immediate
> scheduling for CPU_INTENSIVE is trivial.  If vmstat requires that,
> please let me know.

I guess we need that otherwise vm statistics are not updated while worker
threads are blocking on memory reclaim.


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
@ 2015-10-22 14:23                     ` Christoph Lameter
  0 siblings, 0 replies; 122+ messages in thread
From: Christoph Lameter @ 2015-10-22 14:23 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tetsuo Handa, mhocko, linux-mm, linux-kernel, torvalds,
	David Rientjes, oleg, kwalker, akpm, hannes, vdavydov, skozina,
	mgorman, riel

On Thu, 22 Oct 2015, Tejun Heo wrote:

> > Hmmm?  Just use a dedicated workqueue with WQ_MEM_RECLAIM.  If
> > concurrency management is a problem and there's something live-locking
> > for that work item (really?), WQ_CPU_INTENSIVE escapes it.  If this is
> > a common occurrence that it makes sense to give vmstat higher
> > priority, set WQ_HIGHPRI.
>
> Oooh, HIGHPRI + CPU_INTENSIVE immediate scheduling guarantee got lost
> while converting HIGHPRI to a separate pool but guaranteeing immediate
> scheduling for CPU_INTENSIVE is trivial.  If vmstat requires that,
> please let me know.

I guess we need that otherwise vm statistics are not updated while worker
threads are blocking on memory reclaim.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
  2015-10-22 14:23                     ` Christoph Lameter
@ 2015-10-22 14:24                       ` Tejun Heo
  -1 siblings, 0 replies; 122+ messages in thread
From: Tejun Heo @ 2015-10-22 14:24 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tetsuo Handa, mhocko, linux-mm, linux-kernel, torvalds,
	David Rientjes, oleg, kwalker, akpm, hannes, vdavydov, skozina,
	mgorman, riel

On Thu, Oct 22, 2015 at 09:23:54AM -0500, Christoph Lameter wrote:
> I guess we need that otherwise vm statistics are not updated while worker
> threads are blocking on memory reclaim.

And the blocking one is just constantly running?

-- 
tejun

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
@ 2015-10-22 14:24                       ` Tejun Heo
  0 siblings, 0 replies; 122+ messages in thread
From: Tejun Heo @ 2015-10-22 14:24 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tetsuo Handa, mhocko, linux-mm, linux-kernel, torvalds,
	David Rientjes, oleg, kwalker, akpm, hannes, vdavydov, skozina,
	mgorman, riel

On Thu, Oct 22, 2015 at 09:23:54AM -0500, Christoph Lameter wrote:
> I guess we need that otherwise vm statistics are not updated while worker
> threads are blocking on memory reclaim.

And the blocking one is just constantly running?

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
  2015-10-22 14:24                       ` Tejun Heo
@ 2015-10-22 14:25                         ` Christoph Lameter
  -1 siblings, 0 replies; 122+ messages in thread
From: Christoph Lameter @ 2015-10-22 14:25 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tetsuo Handa, mhocko, linux-mm, linux-kernel, torvalds,
	David Rientjes, oleg, kwalker, akpm, hannes, vdavydov, skozina,
	mgorman, riel

On Thu, 22 Oct 2015, Tejun Heo wrote:

> On Thu, Oct 22, 2015 at 09:23:54AM -0500, Christoph Lameter wrote:
> > I guess we need that otherwise vm statistics are not updated while worker
> > threads are blocking on memory reclaim.
>
> And the blocking one is just constantly running?

I was told that there is just one task struct so additional work queue
items cannot be processed while waiting?


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
@ 2015-10-22 14:25                         ` Christoph Lameter
  0 siblings, 0 replies; 122+ messages in thread
From: Christoph Lameter @ 2015-10-22 14:25 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tetsuo Handa, mhocko, linux-mm, linux-kernel, torvalds,
	David Rientjes, oleg, kwalker, akpm, hannes, vdavydov, skozina,
	mgorman, riel

On Thu, 22 Oct 2015, Tejun Heo wrote:

> On Thu, Oct 22, 2015 at 09:23:54AM -0500, Christoph Lameter wrote:
> > I guess we need that otherwise vm statistics are not updated while worker
> > threads are blocking on memory reclaim.
>
> And the blocking one is just constantly running?

I was told that there is just one task struct so additional work queue
items cannot be processed while waiting?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
  2015-10-22 14:25                         ` Christoph Lameter
@ 2015-10-22 14:33                           ` Tejun Heo
  -1 siblings, 0 replies; 122+ messages in thread
From: Tejun Heo @ 2015-10-22 14:33 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tetsuo Handa, mhocko, linux-mm, linux-kernel, torvalds,
	David Rientjes, oleg, kwalker, akpm, hannes, vdavydov, skozina,
	mgorman, riel

On Thu, Oct 22, 2015 at 09:25:49AM -0500, Christoph Lameter wrote:
> On Thu, 22 Oct 2015, Tejun Heo wrote:
> 
> > On Thu, Oct 22, 2015 at 09:23:54AM -0500, Christoph Lameter wrote:
> > > I guess we need that otherwise vm statistics are not updated while worker
> > > threads are blocking on memory reclaim.
> >
> > And the blocking one is just constantly running?
> 
> I was told that there is just one task struct so additional work queue
> items cannot be processed while waiting?

lol, no, what it tries to do is trying to keep the number of RUNNING
workers at minimum so that minimum number of workers can be used and
work items are executed back-to-back on the same workers.  The moment
a work item blocks, the next worker kicks in and starts executing the
next work item in line.

The only way to hang the execution for a work item w/ WQ_MEM_RECLAIM
is to create a cyclic dependency on another work item and keep that
work item busy wait.  Workqueue thinks that work item is making
progress as it's running and doesn't schedule the next one.

(I was misremembering here) HIGHPRI originally was implemented
head-queueing on the same pool followed by immediate execution, so
could get around cases where this could happen, but that got lost
while converting it to a separate pool.  I can introduce another flag
to bypass concurrency management if necessary (it's kinda trivial) but
busy-waiting cyclic dependency is a pretty unusual thing.

If this is actually a legit busy-waiting cyclic dependency, just let
me know.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
@ 2015-10-22 14:33                           ` Tejun Heo
  0 siblings, 0 replies; 122+ messages in thread
From: Tejun Heo @ 2015-10-22 14:33 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tetsuo Handa, mhocko, linux-mm, linux-kernel, torvalds,
	David Rientjes, oleg, kwalker, akpm, hannes, vdavydov, skozina,
	mgorman, riel

On Thu, Oct 22, 2015 at 09:25:49AM -0500, Christoph Lameter wrote:
> On Thu, 22 Oct 2015, Tejun Heo wrote:
> 
> > On Thu, Oct 22, 2015 at 09:23:54AM -0500, Christoph Lameter wrote:
> > > I guess we need that otherwise vm statistics are not updated while worker
> > > threads are blocking on memory reclaim.
> >
> > And the blocking one is just constantly running?
> 
> I was told that there is just one task struct so additional work queue
> items cannot be processed while waiting?

lol, no, what it tries to do is trying to keep the number of RUNNING
workers at minimum so that minimum number of workers can be used and
work items are executed back-to-back on the same workers.  The moment
a work item blocks, the next worker kicks in and starts executing the
next work item in line.

The only way to hang the execution for a work item w/ WQ_MEM_RECLAIM
is to create a cyclic dependency on another work item and keep that
work item busy wait.  Workqueue thinks that work item is making
progress as it's running and doesn't schedule the next one.

(I was misremembering here) HIGHPRI originally was implemented
head-queueing on the same pool followed by immediate execution, so
could get around cases where this could happen, but that got lost
while converting it to a separate pool.  I can introduce another flag
to bypass concurrency management if necessary (it's kinda trivial) but
busy-waiting cyclic dependency is a pretty unusual thing.

If this is actually a legit busy-waiting cyclic dependency, just let
me know.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
  2015-10-22 14:33                           ` Tejun Heo
@ 2015-10-22 14:41                             ` Christoph Lameter
  -1 siblings, 0 replies; 122+ messages in thread
From: Christoph Lameter @ 2015-10-22 14:41 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tetsuo Handa, mhocko, linux-mm, linux-kernel, torvalds,
	David Rientjes, oleg, kwalker, akpm, hannes, vdavydov, skozina,
	mgorman, riel

On Thu, 22 Oct 2015, Tejun Heo wrote:

> The only way to hang the execution for a work item w/ WQ_MEM_RECLAIM
> is to create a cyclic dependency on another work item and keep that
> work item busy wait.  Workqueue thinks that work item is making
> progress as it's running and doesn't schedule the next one.
>
> (I was misremembering here) HIGHPRI originally was implemented
> head-queueing on the same pool followed by immediate execution, so
> could get around cases where this could happen, but that got lost
> while converting it to a separate pool.  I can introduce another flag
> to bypass concurrency management if necessary (it's kinda trivial) but
> busy-waiting cyclic dependency is a pretty unusual thing.
>
> If this is actually a legit busy-waiting cyclic dependency, just let
> me know.

There is no dependency of the vmstat updater on anything.
They can run anytime. If there is a dependency then its created by the
kworker subsystem itself.



^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
@ 2015-10-22 14:41                             ` Christoph Lameter
  0 siblings, 0 replies; 122+ messages in thread
From: Christoph Lameter @ 2015-10-22 14:41 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tetsuo Handa, mhocko, linux-mm, linux-kernel, torvalds,
	David Rientjes, oleg, kwalker, akpm, hannes, vdavydov, skozina,
	mgorman, riel

On Thu, 22 Oct 2015, Tejun Heo wrote:

> The only way to hang the execution for a work item w/ WQ_MEM_RECLAIM
> is to create a cyclic dependency on another work item and keep that
> work item busy wait.  Workqueue thinks that work item is making
> progress as it's running and doesn't schedule the next one.
>
> (I was misremembering here) HIGHPRI originally was implemented
> head-queueing on the same pool followed by immediate execution, so
> could get around cases where this could happen, but that got lost
> while converting it to a separate pool.  I can introduce another flag
> to bypass concurrency management if necessary (it's kinda trivial) but
> busy-waiting cyclic dependency is a pretty unusual thing.
>
> If this is actually a legit busy-waiting cyclic dependency, just let
> me know.

There is no dependency of the vmstat updater on anything.
They can run anytime. If there is a dependency then its created by the
kworker subsystem itself.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
  2015-10-22 14:09                 ` Tejun Heo
@ 2015-10-22 15:06                   ` Michal Hocko
  -1 siblings, 0 replies; 122+ messages in thread
From: Michal Hocko @ 2015-10-22 15:06 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Lameter, Tetsuo Handa, linux-mm, linux-kernel,
	torvalds, David Rientjes, oleg, kwalker, akpm, hannes, vdavydov,
	skozina, mgorman, riel

On Thu 22-10-15 23:09:44, Tejun Heo wrote:
> On Thu, Oct 22, 2015 at 08:39:11AM -0500, Christoph Lameter wrote:
> > On Thu, 22 Oct 2015, Tetsuo Handa wrote:
> > 
> > > The problem would be that the "struct task_struct" to execute vmstat_update
> > > job does not exist, and will not be able to create one on demand because we
> > > are stuck at __GFP_WAIT allocation. Therefore adding a dedicated kernel
> > > thread for vmstat_update job would work. But ...
> > 
> > Yuck. Can someone please get this major screwup out of the work queue
> > subsystem? Tejun?
> 
> Hmmm?  Just use a dedicated workqueue with WQ_MEM_RECLAIM.

Do I get it right that if vmstat_update has its own workqueue with
WQ_MEM_RECLAIM then there is a _guarantee_ that the rescuer will always
be able to process vmstat_update work from the requested CPU?

That should be sufficient because vmstat_update doesn't sleep on
allocation. I agree that this would be a more appropriate fix.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
@ 2015-10-22 15:06                   ` Michal Hocko
  0 siblings, 0 replies; 122+ messages in thread
From: Michal Hocko @ 2015-10-22 15:06 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Lameter, Tetsuo Handa, linux-mm, linux-kernel,
	torvalds, David Rientjes, oleg, kwalker, akpm, hannes, vdavydov,
	skozina, mgorman, riel

On Thu 22-10-15 23:09:44, Tejun Heo wrote:
> On Thu, Oct 22, 2015 at 08:39:11AM -0500, Christoph Lameter wrote:
> > On Thu, 22 Oct 2015, Tetsuo Handa wrote:
> > 
> > > The problem would be that the "struct task_struct" to execute vmstat_update
> > > job does not exist, and will not be able to create one on demand because we
> > > are stuck at __GFP_WAIT allocation. Therefore adding a dedicated kernel
> > > thread for vmstat_update job would work. But ...
> > 
> > Yuck. Can someone please get this major screwup out of the work queue
> > subsystem? Tejun?
> 
> Hmmm?  Just use a dedicated workqueue with WQ_MEM_RECLAIM.

Do I get it right that if vmstat_update has its own workqueue with
WQ_MEM_RECLAIM then there is a _guarantee_ that the rescuer will always
be able to process vmstat_update work from the requested CPU?

That should be sufficient because vmstat_update doesn't sleep on
allocation. I agree that this would be a more appropriate fix.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
  2015-10-22 14:41                             ` Christoph Lameter
@ 2015-10-22 15:14                               ` Tejun Heo
  -1 siblings, 0 replies; 122+ messages in thread
From: Tejun Heo @ 2015-10-22 15:14 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tetsuo Handa, mhocko, linux-mm, linux-kernel, torvalds,
	David Rientjes, oleg, kwalker, akpm, hannes, vdavydov, skozina,
	mgorman, riel

Hello,

On Thu, Oct 22, 2015 at 09:41:11AM -0500, Christoph Lameter wrote:
> > If this is actually a legit busy-waiting cyclic dependency, just let
> > me know.
> 
> There is no dependency of the vmstat updater on anything.
> They can run anytime. If there is a dependency then its created by the
> kworker subsystem itself.

Sure, the other direction is from workqueue concurrency detection.  I
was asking whether a work item can busy-wait on vmstat_update work
item cuz that's what confuses workqueue.  Looking at the original
dump, the pool has two idle workers indicating that the workqueue
wasn't short of execution resources and it really looks like that work
item was live-locking the pool.  I'll go ahead and add WQ_IMMEDIATE.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
@ 2015-10-22 15:14                               ` Tejun Heo
  0 siblings, 0 replies; 122+ messages in thread
From: Tejun Heo @ 2015-10-22 15:14 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tetsuo Handa, mhocko, linux-mm, linux-kernel, torvalds,
	David Rientjes, oleg, kwalker, akpm, hannes, vdavydov, skozina,
	mgorman, riel

Hello,

On Thu, Oct 22, 2015 at 09:41:11AM -0500, Christoph Lameter wrote:
> > If this is actually a legit busy-waiting cyclic dependency, just let
> > me know.
> 
> There is no dependency of the vmstat updater on anything.
> They can run anytime. If there is a dependency then its created by the
> kworker subsystem itself.

Sure, the other direction is from workqueue concurrency detection.  I
was asking whether a work item can busy-wait on vmstat_update work
item cuz that's what confuses workqueue.  Looking at the original
dump, the pool has two idle workers indicating that the workqueue
wasn't short of execution resources and it really looks like that work
item was live-locking the pool.  I'll go ahead and add WQ_IMMEDIATE.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
  2015-10-22 15:06                   ` Michal Hocko
@ 2015-10-22 15:15                     ` Tejun Heo
  -1 siblings, 0 replies; 122+ messages in thread
From: Tejun Heo @ 2015-10-22 15:15 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Christoph Lameter, Tetsuo Handa, linux-mm, linux-kernel,
	torvalds, David Rientjes, oleg, kwalker, akpm, hannes, vdavydov,
	skozina, mgorman, riel

On Thu, Oct 22, 2015 at 05:06:23PM +0200, Michal Hocko wrote:
> Do I get it right that if vmstat_update has its own workqueue with
> WQ_MEM_RECLAIM then there is a _guarantee_ that the rescuer will always
> be able to process vmstat_update work from the requested CPU?

Yeah.

> That should be sufficient because vmstat_update doesn't sleep on
> allocation. I agree that this would be a more appropriate fix.

The problem seems to be reclaim path busy looping waiting for
vmstat_update and workqueue thinking that the work item must be making
forward-progress and thus not starting the next work item.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
@ 2015-10-22 15:15                     ` Tejun Heo
  0 siblings, 0 replies; 122+ messages in thread
From: Tejun Heo @ 2015-10-22 15:15 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Christoph Lameter, Tetsuo Handa, linux-mm, linux-kernel,
	torvalds, David Rientjes, oleg, kwalker, akpm, hannes, vdavydov,
	skozina, mgorman, riel

On Thu, Oct 22, 2015 at 05:06:23PM +0200, Michal Hocko wrote:
> Do I get it right that if vmstat_update has its own workqueue with
> WQ_MEM_RECLAIM then there is a _guarantee_ that the rescuer will always
> be able to process vmstat_update work from the requested CPU?

Yeah.

> That should be sufficient because vmstat_update doesn't sleep on
> allocation. I agree that this would be a more appropriate fix.

The problem seems to be reclaim path busy looping waiting for
vmstat_update and workqueue thinking that the work item must be making
forward-progress and thus not starting the next work item.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
  2015-10-22 15:15                     ` Tejun Heo
@ 2015-10-22 15:33                       ` Christoph Lameter
  -1 siblings, 0 replies; 122+ messages in thread
From: Christoph Lameter @ 2015-10-22 15:33 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Hocko, Tetsuo Handa, linux-mm, linux-kernel, torvalds,
	David Rientjes, oleg, kwalker, akpm, hannes, vdavydov, skozina,
	mgorman, riel

Ok that also makes me rethink commit
ba4877b9ca51f80b5d30f304a46762f0509e1635 which seems to be a similar fix
this time related to idle mode not updating the counters.

Could we fix that by folding the counters before going to idle mode?

That fix seems to now create 2 separate application interuptions because
the vmstat update is not deferred anymore to occur with other events.


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
@ 2015-10-22 15:33                       ` Christoph Lameter
  0 siblings, 0 replies; 122+ messages in thread
From: Christoph Lameter @ 2015-10-22 15:33 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Hocko, Tetsuo Handa, linux-mm, linux-kernel, torvalds,
	David Rientjes, oleg, kwalker, akpm, hannes, vdavydov, skozina,
	mgorman, riel

Ok that also makes me rethink commit
ba4877b9ca51f80b5d30f304a46762f0509e1635 which seems to be a similar fix
this time related to idle mode not updating the counters.

Could we fix that by folding the counters before going to idle mode?

That fix seems to now create 2 separate application interuptions because
the vmstat update is not deferred anymore to occur with other events.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
  2015-10-22 15:15                     ` Tejun Heo
@ 2015-10-22 15:35                       ` Michal Hocko
  -1 siblings, 0 replies; 122+ messages in thread
From: Michal Hocko @ 2015-10-22 15:35 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Lameter, Tetsuo Handa, linux-mm, linux-kernel,
	torvalds, David Rientjes, oleg, kwalker, akpm, hannes, vdavydov,
	skozina, mgorman, riel

On Fri 23-10-15 00:15:28, Tejun Heo wrote:
> On Thu, Oct 22, 2015 at 05:06:23PM +0200, Michal Hocko wrote:
> > Do I get it right that if vmstat_update has its own workqueue with
> > WQ_MEM_RECLAIM then there is a _guarantee_ that the rescuer will always
> > be able to process vmstat_update work from the requested CPU?
> 
> Yeah.

Thanks for the confirmation.

> > That should be sufficient because vmstat_update doesn't sleep on
> > allocation. I agree that this would be a more appropriate fix.
> 
> The problem seems to be reclaim path busy looping waiting for
> vmstat_update and workqueue thinking that the work item must be making
> forward-progress and thus not starting the next work item.

But that shouldn't happen because the allocation path does cond_resched
even when nothing is really reclaimable (e.g. wait_iff_congested from
__alloc_pages_slowpath).
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
@ 2015-10-22 15:35                       ` Michal Hocko
  0 siblings, 0 replies; 122+ messages in thread
From: Michal Hocko @ 2015-10-22 15:35 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Lameter, Tetsuo Handa, linux-mm, linux-kernel,
	torvalds, David Rientjes, oleg, kwalker, akpm, hannes, vdavydov,
	skozina, mgorman, riel

On Fri 23-10-15 00:15:28, Tejun Heo wrote:
> On Thu, Oct 22, 2015 at 05:06:23PM +0200, Michal Hocko wrote:
> > Do I get it right that if vmstat_update has its own workqueue with
> > WQ_MEM_RECLAIM then there is a _guarantee_ that the rescuer will always
> > be able to process vmstat_update work from the requested CPU?
> 
> Yeah.

Thanks for the confirmation.

> > That should be sufficient because vmstat_update doesn't sleep on
> > allocation. I agree that this would be a more appropriate fix.
> 
> The problem seems to be reclaim path busy looping waiting for
> vmstat_update and workqueue thinking that the work item must be making
> forward-progress and thus not starting the next work item.

But that shouldn't happen because the allocation path does cond_resched
even when nothing is really reclaimable (e.g. wait_iff_congested from
__alloc_pages_slowpath).
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
  2015-10-22 15:35                       ` Michal Hocko
@ 2015-10-22 15:37                         ` Tejun Heo
  -1 siblings, 0 replies; 122+ messages in thread
From: Tejun Heo @ 2015-10-22 15:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Christoph Lameter, Tetsuo Handa, linux-mm, linux-kernel,
	torvalds, David Rientjes, oleg, kwalker, akpm, hannes, vdavydov,
	skozina, mgorman, riel

On Thu, Oct 22, 2015 at 05:35:59PM +0200, Michal Hocko wrote:
> But that shouldn't happen because the allocation path does cond_resched
> even when nothing is really reclaimable (e.g. wait_iff_congested from
> __alloc_pages_slowpath).

cond_resched() isn't enough.  The work item should go !RUNNING, not
just yielding.

-- 
tejun

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
@ 2015-10-22 15:37                         ` Tejun Heo
  0 siblings, 0 replies; 122+ messages in thread
From: Tejun Heo @ 2015-10-22 15:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Christoph Lameter, Tetsuo Handa, linux-mm, linux-kernel,
	torvalds, David Rientjes, oleg, kwalker, akpm, hannes, vdavydov,
	skozina, mgorman, riel

On Thu, Oct 22, 2015 at 05:35:59PM +0200, Michal Hocko wrote:
> But that shouldn't happen because the allocation path does cond_resched
> even when nothing is really reclaimable (e.g. wait_iff_congested from
> __alloc_pages_slowpath).

cond_resched() isn't enough.  The work item should go !RUNNING, not
just yielding.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
  2015-10-22 15:37                         ` Tejun Heo
@ 2015-10-22 15:49                           ` Michal Hocko
  -1 siblings, 0 replies; 122+ messages in thread
From: Michal Hocko @ 2015-10-22 15:49 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Lameter, Tetsuo Handa, linux-mm, linux-kernel,
	torvalds, David Rientjes, oleg, kwalker, akpm, hannes, vdavydov,
	skozina, mgorman, riel

On Fri 23-10-15 00:37:03, Tejun Heo wrote:
> On Thu, Oct 22, 2015 at 05:35:59PM +0200, Michal Hocko wrote:
> > But that shouldn't happen because the allocation path does cond_resched
> > even when nothing is really reclaimable (e.g. wait_iff_congested from
> > __alloc_pages_slowpath).
> 
> cond_resched() isn't enough.  The work item should go !RUNNING, not
> just yielding.

I am confused. What makes rescuer to not run? Nothing seems to be
hogging CPUs, we are just out of workers which are loopin in the
allocator but that is preemptible context.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
@ 2015-10-22 15:49                           ` Michal Hocko
  0 siblings, 0 replies; 122+ messages in thread
From: Michal Hocko @ 2015-10-22 15:49 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Lameter, Tetsuo Handa, linux-mm, linux-kernel,
	torvalds, David Rientjes, oleg, kwalker, akpm, hannes, vdavydov,
	skozina, mgorman, riel

On Fri 23-10-15 00:37:03, Tejun Heo wrote:
> On Thu, Oct 22, 2015 at 05:35:59PM +0200, Michal Hocko wrote:
> > But that shouldn't happen because the allocation path does cond_resched
> > even when nothing is really reclaimable (e.g. wait_iff_congested from
> > __alloc_pages_slowpath).
> 
> cond_resched() isn't enough.  The work item should go !RUNNING, not
> just yielding.

I am confused. What makes rescuer to not run? Nothing seems to be
hogging CPUs, we are just out of workers which are loopin in the
allocator but that is preemptible context.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
  2015-10-22 15:49                           ` Michal Hocko
@ 2015-10-22 18:42                             ` Tejun Heo
  -1 siblings, 0 replies; 122+ messages in thread
From: Tejun Heo @ 2015-10-22 18:42 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Christoph Lameter, Tetsuo Handa, linux-mm, linux-kernel,
	torvalds, David Rientjes, oleg, kwalker, akpm, hannes, vdavydov,
	skozina, mgorman, riel

On Thu, Oct 22, 2015 at 05:49:22PM +0200, Michal Hocko wrote:
> I am confused. What makes rescuer to not run? Nothing seems to be
> hogging CPUs, we are just out of workers which are loopin in the
> allocator but that is preemptible context.

It's concurrency management.  Workqueue thinks that the pool is making
positive forward progress and doesn't schedule anything else for
execution while that work item is burning cpu cycles.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
@ 2015-10-22 18:42                             ` Tejun Heo
  0 siblings, 0 replies; 122+ messages in thread
From: Tejun Heo @ 2015-10-22 18:42 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Christoph Lameter, Tetsuo Handa, linux-mm, linux-kernel,
	torvalds, David Rientjes, oleg, kwalker, akpm, hannes, vdavydov,
	skozina, mgorman, riel

On Thu, Oct 22, 2015 at 05:49:22PM +0200, Michal Hocko wrote:
> I am confused. What makes rescuer to not run? Nothing seems to be
> hogging CPUs, we are just out of workers which are loopin in the
> allocator but that is preemptible context.

It's concurrency management.  Workqueue thinks that the pool is making
positive forward progress and doesn't schedule anything else for
execution while that work item is burning cpu cycles.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable()checks
  2015-10-22 18:42                             ` Tejun Heo
@ 2015-10-22 21:42                               ` Tetsuo Handa
  -1 siblings, 0 replies; 122+ messages in thread
From: Tetsuo Handa @ 2015-10-22 21:42 UTC (permalink / raw)
  To: htejun, mhocko
  Cc: cl, linux-mm, linux-kernel, torvalds, rientjes, oleg, kwalker,
	akpm, hannes, vdavydov, skozina, mgorman, riel

Tejun Heo wrote:
> On Thu, Oct 22, 2015 at 05:49:22PM +0200, Michal Hocko wrote:
> > I am confused. What makes rescuer to not run? Nothing seems to be
> > hogging CPUs, we are just out of workers which are loopin in the
> > allocator but that is preemptible context.
> 
> It's concurrency management.  Workqueue thinks that the pool is making
> positive forward progress and doesn't schedule anything else for
> execution while that work item is burning cpu cycles.

Then, isn't below change easier to backport which will also alleviate
needlessly burning CPU cycles?

--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3385,6 +3385,7 @@ retry:
 	((gfp_mask & __GFP_REPEAT) && pages_reclaimed < (1 << order))) {
 		/* Wait for some write requests to complete then retry */
 		wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);
+		schedule_timeout_uninterruptible(1);
 		goto retry;
 	}
 

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable()checks
@ 2015-10-22 21:42                               ` Tetsuo Handa
  0 siblings, 0 replies; 122+ messages in thread
From: Tetsuo Handa @ 2015-10-22 21:42 UTC (permalink / raw)
  To: htejun, mhocko
  Cc: cl, linux-mm, linux-kernel, torvalds, rientjes, oleg, kwalker,
	akpm, hannes, vdavydov, skozina, mgorman, riel

Tejun Heo wrote:
> On Thu, Oct 22, 2015 at 05:49:22PM +0200, Michal Hocko wrote:
> > I am confused. What makes rescuer to not run? Nothing seems to be
> > hogging CPUs, we are just out of workers which are loopin in the
> > allocator but that is preemptible context.
> 
> It's concurrency management.  Workqueue thinks that the pool is making
> positive forward progress and doesn't schedule anything else for
> execution while that work item is burning cpu cycles.

Then, isn't below change easier to backport which will also alleviate
needlessly burning CPU cycles?

--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3385,6 +3385,7 @@ retry:
 	((gfp_mask & __GFP_REPEAT) && pages_reclaimed < (1 << order))) {
 		/* Wait for some write requests to complete then retry */
 		wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);
+		schedule_timeout_uninterruptible(1);
 		goto retry;
 	}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable()checks
  2015-10-22 21:42                               ` Tetsuo Handa
@ 2015-10-22 22:47                                 ` Tejun Heo
  -1 siblings, 0 replies; 122+ messages in thread
From: Tejun Heo @ 2015-10-22 22:47 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, cl, linux-mm, linux-kernel, torvalds, rientjes, oleg,
	kwalker, akpm, hannes, vdavydov, skozina, mgorman, riel

Hello,

On Fri, Oct 23, 2015 at 06:42:43AM +0900, Tetsuo Handa wrote:
> Then, isn't below change easier to backport which will also alleviate
> needlessly burning CPU cycles?
> 
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3385,6 +3385,7 @@ retry:
>  	((gfp_mask & __GFP_REPEAT) && pages_reclaimed < (1 << order))) {
>  		/* Wait for some write requests to complete then retry */
>  		wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);
> +		schedule_timeout_uninterruptible(1);
>  		goto retry;
>  	}

Yeah, that works too.  It should still be put on a dedicated wq with
MEM_RECLAIM tho.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable()checks
@ 2015-10-22 22:47                                 ` Tejun Heo
  0 siblings, 0 replies; 122+ messages in thread
From: Tejun Heo @ 2015-10-22 22:47 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, cl, linux-mm, linux-kernel, torvalds, rientjes, oleg,
	kwalker, akpm, hannes, vdavydov, skozina, mgorman, riel

Hello,

On Fri, Oct 23, 2015 at 06:42:43AM +0900, Tetsuo Handa wrote:
> Then, isn't below change easier to backport which will also alleviate
> needlessly burning CPU cycles?
> 
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3385,6 +3385,7 @@ retry:
>  	((gfp_mask & __GFP_REPEAT) && pages_reclaimed < (1 << order))) {
>  		/* Wait for some write requests to complete then retry */
>  		wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);
> +		schedule_timeout_uninterruptible(1);
>  		goto retry;
>  	}

Yeah, that works too.  It should still be put on a dedicated wq with
MEM_RECLAIM tho.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
  2015-10-22 15:14                               ` Tejun Heo
@ 2015-10-23  4:26                                 ` Tejun Heo
  -1 siblings, 0 replies; 122+ messages in thread
From: Tejun Heo @ 2015-10-23  4:26 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tetsuo Handa, mhocko, linux-mm, linux-kernel, torvalds,
	David Rientjes, oleg, kwalker, akpm, hannes, vdavydov, skozina,
	mgorman, riel

Hello,

So, something like the following.  Just compile tested but this is
essentially partial revert of 3270476a6c0c ("workqueue: reimplement
WQ_HIGHPRI using a separate worker_pool") - resurrecting the old
WQ_HIGHPRI implementation under WQ_IMMEDIATE, so we know this works.
If for some reason, it gets decided against simply adding one jiffy
sleep, please let me know.  I'll verify the operation and post a
proper patch.  That said, given that this prolly needs -stable
backport and vmstat is likely to be the only user (busy loops are
really rare in the kernel after all), I think the better approach
would be reinstating the short sleep.

Thanks.

---
 include/linux/workqueue.h |    7 ++---
 kernel/workqueue.c        |   63 +++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 63 insertions(+), 7 deletions(-)

--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -278,9 +278,10 @@ enum {
 	WQ_UNBOUND		= 1 << 1, /* not bound to any cpu */
 	WQ_FREEZABLE		= 1 << 2, /* freeze during suspend */
 	WQ_MEM_RECLAIM		= 1 << 3, /* may be used for memory reclaim */
-	WQ_HIGHPRI		= 1 << 4, /* high priority */
-	WQ_CPU_INTENSIVE	= 1 << 5, /* cpu intensive workqueue */
-	WQ_SYSFS		= 1 << 6, /* visible in sysfs, see wq_sysfs_register() */
+	WQ_IMMEDIATE		= 1 << 4, /* bypass concurrency management */
+	WQ_HIGHPRI		= 1 << 5, /* high priority */
+	WQ_CPU_INTENSIVE	= 1 << 6, /* cpu intensive workqueue */
+	WQ_SYSFS		= 1 << 7, /* visible in sysfs, see wq_sysfs_register() */
 
 	/*
 	 * Per-cpu workqueues are generally preferred because they tend to
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -68,6 +68,7 @@ enum {
 	 * attach_mutex to avoid changing binding state while
 	 * worker_attach_to_pool() is in progress.
 	 */
+	POOL_IMMEDIATE_PENDING	= 1 << 0,	/* WQ_IMMEDIATE items on queue */
 	POOL_DISASSOCIATED	= 1 << 2,	/* cpu can't serve workers */
 
 	/* worker flags */
@@ -731,7 +732,8 @@ static bool work_is_canceling(struct wor
 
 static bool __need_more_worker(struct worker_pool *pool)
 {
-	return !atomic_read(&pool->nr_running);
+	return !atomic_read(&pool->nr_running) ||
+		(pool->flags & POOL_IMMEDIATE_PENDING);
 }
 
 /*
@@ -757,7 +759,8 @@ static bool may_start_working(struct wor
 static bool keep_working(struct worker_pool *pool)
 {
 	return !list_empty(&pool->worklist) &&
-		atomic_read(&pool->nr_running) <= 1;
+		(atomic_read(&pool->nr_running) <= 1 ||
+		 (pool->flags & POOL_IMMEDIATE_PENDING));
 }
 
 /* Do we need a new worker?  Called from manager. */
@@ -1021,6 +1024,42 @@ static void move_linked_works(struct wor
 }
 
 /**
+ * pwq_determine_ins_pos - find insertion position
+ * @pwq: pwq a work is being queued for
+ *
+ * A work for @pwq is about to be queued on @pwq->pool, determine insertion
+ * position for the work.  If @pwq is for IMMEDIATE wq, the work item is
+ * queued at the head of the queue but in FIFO order with respect to other
+ * IMMEDIATE work items; otherwise, at the end of the queue.  This function
+ * also sets POOL_IMMEDIATE_PENDING flag to hint @pool that there are
+ * IMMEDIATE works pending.
+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock).
+ *
+ * RETURNS:
+ * Pointer to insertion position.
+ */
+static struct list_head *pwq_determine_ins_pos(struct pool_workqueue *pwq)
+{
+	struct worker_pool *pool = pwq->pool;
+	struct work_struct *twork;
+
+	if (likely(!(pwq->wq->flags & WQ_IMMEDIATE)))
+		return &pool->worklist;
+
+	list_for_each_entry(twork, &pool->worklist, entry) {
+		struct pool_workqueue *tpwq = get_work_pwq(twork);
+
+		if (!(tpwq->wq->flags & WQ_IMMEDIATE))
+			break;
+	}
+
+	pool->flags |= POOL_IMMEDIATE_PENDING;
+	return &twork->entry;
+}
+
+/**
  * get_pwq - get an extra reference on the specified pool_workqueue
  * @pwq: pool_workqueue to get
  *
@@ -1081,9 +1120,10 @@ static void put_pwq_unlocked(struct pool
 static void pwq_activate_delayed_work(struct work_struct *work)
 {
 	struct pool_workqueue *pwq = get_work_pwq(work);
+	struct list_head *pos = pwq_determine_ins_pos(pwq);
 
 	trace_workqueue_activate_work(work);
-	move_linked_works(work, &pwq->pool->worklist, NULL);
+	move_linked_works(work, pos, NULL);
 	__clear_bit(WORK_STRUCT_DELAYED_BIT, work_data_bits(work));
 	pwq->nr_active++;
 }
@@ -1384,7 +1424,7 @@ retry:
 	if (likely(pwq->nr_active < pwq->max_active)) {
 		trace_workqueue_activate_work(work);
 		pwq->nr_active++;
-		worklist = &pwq->pool->worklist;
+		worklist = pwq_determine_ins_pos(pwq);
 	} else {
 		work_flags |= WORK_STRUCT_DELAYED;
 		worklist = &pwq->delayed_works;
@@ -1996,6 +2036,21 @@ __acquires(&pool->lock)
 	list_del_init(&work->entry);
 
 	/*
+	 * If IMMEDIATE_PENDING, check the next work, and, if IMMEDIATE,
+	 * wake up another worker; otherwise, clear IMMEDIATE_PENDING.
+	 */
+	if (unlikely(pool->flags & POOL_IMMEDIATE_PENDING)) {
+		struct work_struct *nwork = list_first_entry(&pool->worklist,
+						struct work_struct, entry);
+
+		if (!list_empty(&pool->worklist) &&
+		    get_work_pwq(nwork)->wq->flags & WQ_IMMEDIATE)
+			wake_up_worker(pool);
+		else
+			pool->flags &= ~POOL_IMMEDIATE_PENDING;
+	}
+
+	/*
 	 * CPU intensive works don't participate in concurrency management.
 	 * They're the scheduler's responsibility.  This takes @worker out
 	 * of concurrency management and the next code block will chain

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
@ 2015-10-23  4:26                                 ` Tejun Heo
  0 siblings, 0 replies; 122+ messages in thread
From: Tejun Heo @ 2015-10-23  4:26 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tetsuo Handa, mhocko, linux-mm, linux-kernel, torvalds,
	David Rientjes, oleg, kwalker, akpm, hannes, vdavydov, skozina,
	mgorman, riel

Hello,

So, something like the following.  Just compile tested but this is
essentially partial revert of 3270476a6c0c ("workqueue: reimplement
WQ_HIGHPRI using a separate worker_pool") - resurrecting the old
WQ_HIGHPRI implementation under WQ_IMMEDIATE, so we know this works.
If for some reason, it gets decided against simply adding one jiffy
sleep, please let me know.  I'll verify the operation and post a
proper patch.  That said, given that this prolly needs -stable
backport and vmstat is likely to be the only user (busy loops are
really rare in the kernel after all), I think the better approach
would be reinstating the short sleep.

Thanks.

---
 include/linux/workqueue.h |    7 ++---
 kernel/workqueue.c        |   63 +++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 63 insertions(+), 7 deletions(-)

--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -278,9 +278,10 @@ enum {
 	WQ_UNBOUND		= 1 << 1, /* not bound to any cpu */
 	WQ_FREEZABLE		= 1 << 2, /* freeze during suspend */
 	WQ_MEM_RECLAIM		= 1 << 3, /* may be used for memory reclaim */
-	WQ_HIGHPRI		= 1 << 4, /* high priority */
-	WQ_CPU_INTENSIVE	= 1 << 5, /* cpu intensive workqueue */
-	WQ_SYSFS		= 1 << 6, /* visible in sysfs, see wq_sysfs_register() */
+	WQ_IMMEDIATE		= 1 << 4, /* bypass concurrency management */
+	WQ_HIGHPRI		= 1 << 5, /* high priority */
+	WQ_CPU_INTENSIVE	= 1 << 6, /* cpu intensive workqueue */
+	WQ_SYSFS		= 1 << 7, /* visible in sysfs, see wq_sysfs_register() */
 
 	/*
 	 * Per-cpu workqueues are generally preferred because they tend to
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -68,6 +68,7 @@ enum {
 	 * attach_mutex to avoid changing binding state while
 	 * worker_attach_to_pool() is in progress.
 	 */
+	POOL_IMMEDIATE_PENDING	= 1 << 0,	/* WQ_IMMEDIATE items on queue */
 	POOL_DISASSOCIATED	= 1 << 2,	/* cpu can't serve workers */
 
 	/* worker flags */
@@ -731,7 +732,8 @@ static bool work_is_canceling(struct wor
 
 static bool __need_more_worker(struct worker_pool *pool)
 {
-	return !atomic_read(&pool->nr_running);
+	return !atomic_read(&pool->nr_running) ||
+		(pool->flags & POOL_IMMEDIATE_PENDING);
 }
 
 /*
@@ -757,7 +759,8 @@ static bool may_start_working(struct wor
 static bool keep_working(struct worker_pool *pool)
 {
 	return !list_empty(&pool->worklist) &&
-		atomic_read(&pool->nr_running) <= 1;
+		(atomic_read(&pool->nr_running) <= 1 ||
+		 (pool->flags & POOL_IMMEDIATE_PENDING));
 }
 
 /* Do we need a new worker?  Called from manager. */
@@ -1021,6 +1024,42 @@ static void move_linked_works(struct wor
 }
 
 /**
+ * pwq_determine_ins_pos - find insertion position
+ * @pwq: pwq a work is being queued for
+ *
+ * A work for @pwq is about to be queued on @pwq->pool, determine insertion
+ * position for the work.  If @pwq is for IMMEDIATE wq, the work item is
+ * queued at the head of the queue but in FIFO order with respect to other
+ * IMMEDIATE work items; otherwise, at the end of the queue.  This function
+ * also sets POOL_IMMEDIATE_PENDING flag to hint @pool that there are
+ * IMMEDIATE works pending.
+ *
+ * CONTEXT:
+ * spin_lock_irq(gcwq->lock).
+ *
+ * RETURNS:
+ * Pointer to insertion position.
+ */
+static struct list_head *pwq_determine_ins_pos(struct pool_workqueue *pwq)
+{
+	struct worker_pool *pool = pwq->pool;
+	struct work_struct *twork;
+
+	if (likely(!(pwq->wq->flags & WQ_IMMEDIATE)))
+		return &pool->worklist;
+
+	list_for_each_entry(twork, &pool->worklist, entry) {
+		struct pool_workqueue *tpwq = get_work_pwq(twork);
+
+		if (!(tpwq->wq->flags & WQ_IMMEDIATE))
+			break;
+	}
+
+	pool->flags |= POOL_IMMEDIATE_PENDING;
+	return &twork->entry;
+}
+
+/**
  * get_pwq - get an extra reference on the specified pool_workqueue
  * @pwq: pool_workqueue to get
  *
@@ -1081,9 +1120,10 @@ static void put_pwq_unlocked(struct pool
 static void pwq_activate_delayed_work(struct work_struct *work)
 {
 	struct pool_workqueue *pwq = get_work_pwq(work);
+	struct list_head *pos = pwq_determine_ins_pos(pwq);
 
 	trace_workqueue_activate_work(work);
-	move_linked_works(work, &pwq->pool->worklist, NULL);
+	move_linked_works(work, pos, NULL);
 	__clear_bit(WORK_STRUCT_DELAYED_BIT, work_data_bits(work));
 	pwq->nr_active++;
 }
@@ -1384,7 +1424,7 @@ retry:
 	if (likely(pwq->nr_active < pwq->max_active)) {
 		trace_workqueue_activate_work(work);
 		pwq->nr_active++;
-		worklist = &pwq->pool->worklist;
+		worklist = pwq_determine_ins_pos(pwq);
 	} else {
 		work_flags |= WORK_STRUCT_DELAYED;
 		worklist = &pwq->delayed_works;
@@ -1996,6 +2036,21 @@ __acquires(&pool->lock)
 	list_del_init(&work->entry);
 
 	/*
+	 * If IMMEDIATE_PENDING, check the next work, and, if IMMEDIATE,
+	 * wake up another worker; otherwise, clear IMMEDIATE_PENDING.
+	 */
+	if (unlikely(pool->flags & POOL_IMMEDIATE_PENDING)) {
+		struct work_struct *nwork = list_first_entry(&pool->worklist,
+						struct work_struct, entry);
+
+		if (!list_empty(&pool->worklist) &&
+		    get_work_pwq(nwork)->wq->flags & WQ_IMMEDIATE)
+			wake_up_worker(pool);
+		else
+			pool->flags &= ~POOL_IMMEDIATE_PENDING;
+	}
+
+	/*
 	 * CPU intensive works don't participate in concurrency management.
 	 * They're the scheduler's responsibility.  This takes @worker out
 	 * of concurrency management and the next code block will chain

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
  2015-10-22 18:42                             ` Tejun Heo
@ 2015-10-23  8:33                               ` Michal Hocko
  -1 siblings, 0 replies; 122+ messages in thread
From: Michal Hocko @ 2015-10-23  8:33 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Lameter, Tetsuo Handa, linux-mm, linux-kernel,
	torvalds, David Rientjes, oleg, kwalker, akpm, hannes, vdavydov,
	skozina, mgorman, riel

On Fri 23-10-15 03:42:26, Tejun Heo wrote:
> On Thu, Oct 22, 2015 at 05:49:22PM +0200, Michal Hocko wrote:
> > I am confused. What makes rescuer to not run? Nothing seems to be
> > hogging CPUs, we are just out of workers which are loopin in the
> > allocator but that is preemptible context.
> 
> It's concurrency management.  Workqueue thinks that the pool is making
> positive forward progress and doesn't schedule anything else for
> execution while that work item is burning cpu cycles.

Ohh, OK I can see wq_worker_sleeping now. I've missed your point in
other email, sorry about that. But now I am wondering whether this
is an intended behavior. The documentation says:
  WQ_MEM_RECLAIM

        All wq which might be used in the memory reclaim paths _MUST_
        have this flag set.  The wq is guaranteed to have at least one
        execution context regardless of memory pressure.

Which doesn't seem to be true currently, right? Now I can see your patch
to introduce WQ_IMMEDIATE but I am wondering which WQ_MEM_RECLAIM users
could do without WQ_IMMEDIATE? I mean all current workers might be
looping in the page allocator and it seems possible that WQ_MEM_RECLAIM
work items might be waiting behind them so they cannot help to relieve
the memory pressure. This doesn't sound right to me. Or I am completely
confused and still fail to understand what is WQ_MEM_RECLAIM supposed to
be used for.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
@ 2015-10-23  8:33                               ` Michal Hocko
  0 siblings, 0 replies; 122+ messages in thread
From: Michal Hocko @ 2015-10-23  8:33 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Lameter, Tetsuo Handa, linux-mm, linux-kernel,
	torvalds, David Rientjes, oleg, kwalker, akpm, hannes, vdavydov,
	skozina, mgorman, riel

On Fri 23-10-15 03:42:26, Tejun Heo wrote:
> On Thu, Oct 22, 2015 at 05:49:22PM +0200, Michal Hocko wrote:
> > I am confused. What makes rescuer to not run? Nothing seems to be
> > hogging CPUs, we are just out of workers which are loopin in the
> > allocator but that is preemptible context.
> 
> It's concurrency management.  Workqueue thinks that the pool is making
> positive forward progress and doesn't schedule anything else for
> execution while that work item is burning cpu cycles.

Ohh, OK I can see wq_worker_sleeping now. I've missed your point in
other email, sorry about that. But now I am wondering whether this
is an intended behavior. The documentation says:
  WQ_MEM_RECLAIM

        All wq which might be used in the memory reclaim paths _MUST_
        have this flag set.  The wq is guaranteed to have at least one
        execution context regardless of memory pressure.

Which doesn't seem to be true currently, right? Now I can see your patch
to introduce WQ_IMMEDIATE but I am wondering which WQ_MEM_RECLAIM users
could do without WQ_IMMEDIATE? I mean all current workers might be
looping in the page allocator and it seems possible that WQ_MEM_RECLAIM
work items might be waiting behind them so they cannot help to relieve
the memory pressure. This doesn't sound right to me. Or I am completely
confused and still fail to understand what is WQ_MEM_RECLAIM supposed to
be used for.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable()checks
  2015-10-22 21:42                               ` Tetsuo Handa
@ 2015-10-23  8:36                                 ` Michal Hocko
  -1 siblings, 0 replies; 122+ messages in thread
From: Michal Hocko @ 2015-10-23  8:36 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: htejun, cl, linux-mm, linux-kernel, torvalds, rientjes, oleg,
	kwalker, akpm, hannes, vdavydov, skozina, mgorman, riel

On Fri 23-10-15 06:42:43, Tetsuo Handa wrote:
> Tejun Heo wrote:
> > On Thu, Oct 22, 2015 at 05:49:22PM +0200, Michal Hocko wrote:
> > > I am confused. What makes rescuer to not run? Nothing seems to be
> > > hogging CPUs, we are just out of workers which are loopin in the
> > > allocator but that is preemptible context.
> > 
> > It's concurrency management.  Workqueue thinks that the pool is making
> > positive forward progress and doesn't schedule anything else for
> > execution while that work item is burning cpu cycles.
> 
> Then, isn't below change easier to backport which will also alleviate
> needlessly burning CPU cycles?

This is quite obscure. If the vmstat_update fix needs workqueue tweaks
as well then I would vote for your original patch which is clear,
straightforward and easy to backport.

If WQ_MEM_RECLAIM can really guarantee one worker as described in the
documentation then I agree that fixing vmstat is a better fix. But that
doesn't seem to be the case currently.
 
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3385,6 +3385,7 @@ retry:
>  	((gfp_mask & __GFP_REPEAT) && pages_reclaimed < (1 << order))) {
>  		/* Wait for some write requests to complete then retry */
>  		wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);
> +		schedule_timeout_uninterruptible(1);
>  		goto retry;
>  	}
>  
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable()checks
@ 2015-10-23  8:36                                 ` Michal Hocko
  0 siblings, 0 replies; 122+ messages in thread
From: Michal Hocko @ 2015-10-23  8:36 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: htejun, cl, linux-mm, linux-kernel, torvalds, rientjes, oleg,
	kwalker, akpm, hannes, vdavydov, skozina, mgorman, riel

On Fri 23-10-15 06:42:43, Tetsuo Handa wrote:
> Tejun Heo wrote:
> > On Thu, Oct 22, 2015 at 05:49:22PM +0200, Michal Hocko wrote:
> > > I am confused. What makes rescuer to not run? Nothing seems to be
> > > hogging CPUs, we are just out of workers which are loopin in the
> > > allocator but that is preemptible context.
> > 
> > It's concurrency management.  Workqueue thinks that the pool is making
> > positive forward progress and doesn't schedule anything else for
> > execution while that work item is burning cpu cycles.
> 
> Then, isn't below change easier to backport which will also alleviate
> needlessly burning CPU cycles?

This is quite obscure. If the vmstat_update fix needs workqueue tweaks
as well then I would vote for your original patch which is clear,
straightforward and easy to backport.

If WQ_MEM_RECLAIM can really guarantee one worker as described in the
documentation then I agree that fixing vmstat is a better fix. But that
doesn't seem to be the case currently.
 
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3385,6 +3385,7 @@ retry:
>  	((gfp_mask & __GFP_REPEAT) && pages_reclaimed < (1 << order))) {
>  		/* Wait for some write requests to complete then retry */
>  		wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);
> +		schedule_timeout_uninterruptible(1);
>  		goto retry;
>  	}
>  
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
  2015-10-22 15:33                       ` Christoph Lameter
@ 2015-10-23  8:37                         ` Michal Hocko
  -1 siblings, 0 replies; 122+ messages in thread
From: Michal Hocko @ 2015-10-23  8:37 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tejun Heo, Tetsuo Handa, linux-mm, linux-kernel, torvalds,
	David Rientjes, oleg, kwalker, akpm, hannes, vdavydov, skozina,
	mgorman, riel

On Thu 22-10-15 10:33:20, Christoph Lameter wrote:
> Ok that also makes me rethink commit
> ba4877b9ca51f80b5d30f304a46762f0509e1635 which seems to be a similar fix
> this time related to idle mode not updating the counters.
> 
> Could we fix that by folding the counters before going to idle mode?

This would work as well.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
@ 2015-10-23  8:37                         ` Michal Hocko
  0 siblings, 0 replies; 122+ messages in thread
From: Michal Hocko @ 2015-10-23  8:37 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tejun Heo, Tetsuo Handa, linux-mm, linux-kernel, torvalds,
	David Rientjes, oleg, kwalker, akpm, hannes, vdavydov, skozina,
	mgorman, riel

On Thu 22-10-15 10:33:20, Christoph Lameter wrote:
> Ok that also makes me rethink commit
> ba4877b9ca51f80b5d30f304a46762f0509e1635 which seems to be a similar fix
> this time related to idle mode not updating the counters.
> 
> Could we fix that by folding the counters before going to idle mode?

This would work as well.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
  2015-10-23  8:33                               ` Michal Hocko
@ 2015-10-23 10:36                                 ` Tejun Heo
  -1 siblings, 0 replies; 122+ messages in thread
From: Tejun Heo @ 2015-10-23 10:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Christoph Lameter, Tetsuo Handa, linux-mm, linux-kernel,
	torvalds, David Rientjes, oleg, kwalker, akpm, hannes, vdavydov,
	skozina, mgorman, riel

Hello, Michal.

On Fri, Oct 23, 2015 at 10:33:16AM +0200, Michal Hocko wrote:
> Ohh, OK I can see wq_worker_sleeping now. I've missed your point in
> other email, sorry about that. But now I am wondering whether this
> is an intended behavior. The documentation says:

This is.

>   WQ_MEM_RECLAIM
> 
>         All wq which might be used in the memory reclaim paths _MUST_
>         have this flag set.  The wq is guaranteed to have at least one
>         execution context regardless of memory pressure.
> 
> Which doesn't seem to be true currently, right? Now I can see your patch

It is true.

> to introduce WQ_IMMEDIATE but I am wondering which WQ_MEM_RECLAIM users
> could do without WQ_IMMEDIATE? I mean all current workers might be
> looping in the page allocator and it seems possible that WQ_MEM_RECLAIM
> work items might be waiting behind them so they cannot help to relieve
> the memory pressure. This doesn't sound right to me. Or I am completely
> confused and still fail to understand what is WQ_MEM_RECLAIM supposed to
> be used for.

It guarantees that there always is enough execution resource to
execute a work item from that workqueue.  The problem here is not lack
of execution resource but concurrency management misunderstanding the
situation.  This also can be fixed by teaching concurrency management
to be a bit smarter - e.g. if a work item is burning a lot of CPU
cycles continuously or pool hasn't finished a work item over a certain
amount of time, automatically ignore the in-flight work item for the
purpose of concurrency management; however, this sort of inter-work
item busy waits are so extremely rare and undesirable that I'm not
sure the added complexity would be worthwhile.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
@ 2015-10-23 10:36                                 ` Tejun Heo
  0 siblings, 0 replies; 122+ messages in thread
From: Tejun Heo @ 2015-10-23 10:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Christoph Lameter, Tetsuo Handa, linux-mm, linux-kernel,
	torvalds, David Rientjes, oleg, kwalker, akpm, hannes, vdavydov,
	skozina, mgorman, riel

Hello, Michal.

On Fri, Oct 23, 2015 at 10:33:16AM +0200, Michal Hocko wrote:
> Ohh, OK I can see wq_worker_sleeping now. I've missed your point in
> other email, sorry about that. But now I am wondering whether this
> is an intended behavior. The documentation says:

This is.

>   WQ_MEM_RECLAIM
> 
>         All wq which might be used in the memory reclaim paths _MUST_
>         have this flag set.  The wq is guaranteed to have at least one
>         execution context regardless of memory pressure.
> 
> Which doesn't seem to be true currently, right? Now I can see your patch

It is true.

> to introduce WQ_IMMEDIATE but I am wondering which WQ_MEM_RECLAIM users
> could do without WQ_IMMEDIATE? I mean all current workers might be
> looping in the page allocator and it seems possible that WQ_MEM_RECLAIM
> work items might be waiting behind them so they cannot help to relieve
> the memory pressure. This doesn't sound right to me. Or I am completely
> confused and still fail to understand what is WQ_MEM_RECLAIM supposed to
> be used for.

It guarantees that there always is enough execution resource to
execute a work item from that workqueue.  The problem here is not lack
of execution resource but concurrency management misunderstanding the
situation.  This also can be fixed by teaching concurrency management
to be a bit smarter - e.g. if a work item is burning a lot of CPU
cycles continuously or pool hasn't finished a work item over a certain
amount of time, automatically ignore the in-flight work item for the
purpose of concurrency management; however, this sort of inter-work
item busy waits are so extremely rare and undesirable that I'm not
sure the added complexity would be worthwhile.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable()checks
  2015-10-23  8:36                                 ` Michal Hocko
@ 2015-10-23 10:37                                   ` Tejun Heo
  -1 siblings, 0 replies; 122+ messages in thread
From: Tejun Heo @ 2015-10-23 10:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, cl, linux-mm, linux-kernel, torvalds, rientjes,
	oleg, kwalker, akpm, hannes, vdavydov, skozina, mgorman, riel

On Fri, Oct 23, 2015 at 10:36:12AM +0200, Michal Hocko wrote:
> If WQ_MEM_RECLAIM can really guarantee one worker as described in the
> documentation then I agree that fixing vmstat is a better fix. But that
> doesn't seem to be the case currently.

It does.

-- 
tejun

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable()checks
@ 2015-10-23 10:37                                   ` Tejun Heo
  0 siblings, 0 replies; 122+ messages in thread
From: Tejun Heo @ 2015-10-23 10:37 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, cl, linux-mm, linux-kernel, torvalds, rientjes,
	oleg, kwalker, akpm, hannes, vdavydov, skozina, mgorman, riel

On Fri, Oct 23, 2015 at 10:36:12AM +0200, Michal Hocko wrote:
> If WQ_MEM_RECLAIM can really guarantee one worker as described in the
> documentation then I agree that fixing vmstat is a better fix. But that
> doesn't seem to be the case currently.

It does.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
  2015-10-23 10:36                                 ` Tejun Heo
@ 2015-10-23 11:11                                   ` Michal Hocko
  -1 siblings, 0 replies; 122+ messages in thread
From: Michal Hocko @ 2015-10-23 11:11 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Lameter, Tetsuo Handa, linux-mm, linux-kernel,
	torvalds, David Rientjes, oleg, kwalker, akpm, hannes, vdavydov,
	skozina, mgorman, riel

On Fri 23-10-15 19:36:30, Tejun Heo wrote:
> Hello, Michal.
> 
> On Fri, Oct 23, 2015 at 10:33:16AM +0200, Michal Hocko wrote:
> > Ohh, OK I can see wq_worker_sleeping now. I've missed your point in
> > other email, sorry about that. But now I am wondering whether this
> > is an intended behavior. The documentation says:
> 
> This is.
> 
> >   WQ_MEM_RECLAIM
> > 
> >         All wq which might be used in the memory reclaim paths _MUST_
> >         have this flag set.  The wq is guaranteed to have at least one
> >         execution context regardless of memory pressure.
> > 
> > Which doesn't seem to be true currently, right? Now I can see your patch
> 
> It is true.
> 
> > to introduce WQ_IMMEDIATE but I am wondering which WQ_MEM_RECLAIM users
> > could do without WQ_IMMEDIATE? I mean all current workers might be
> > looping in the page allocator and it seems possible that WQ_MEM_RECLAIM
> > work items might be waiting behind them so they cannot help to relieve
> > the memory pressure. This doesn't sound right to me. Or I am completely
> > confused and still fail to understand what is WQ_MEM_RECLAIM supposed to
> > be used for.
> 
> It guarantees that there always is enough execution resource to
> execute a work item from that workqueue. 

OK, strictly speaking the rescuer is there but it is kind of pointless
if it doesn't fire up and do a work.

> The problem here is not lack
> of execution resource but concurrency management misunderstanding the
> situation. 

And this sounds like a bug to me.

> This also can be fixed by teaching concurrency management
> to be a bit smarter - e.g. if a work item is burning a lot of CPU
> cycles continuously or pool hasn't finished a work item over a certain
> amount of time, automatically ignore the in-flight work item for the
> purpose of concurrency management; however, this sort of inter-work
> item busy waits are so extremely rare and undesirable that I'm not
> sure the added complexity would be worthwhile.

Don't we have some IO related paths which would suffer from the same
problem. I haven't checked all the WQ_MEM_RECLAIM users but from the
name I would expect they _do_ participate in the reclaim and so they
should be able to make a progress. Now if your new IMMEDIATE flag will
guarantee that then I would argue that it should be implicit for
WQ_MEM_RECLAIM otherwise we always risk a similar situation. What would
be a counter argument for doing that?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
@ 2015-10-23 11:11                                   ` Michal Hocko
  0 siblings, 0 replies; 122+ messages in thread
From: Michal Hocko @ 2015-10-23 11:11 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Lameter, Tetsuo Handa, linux-mm, linux-kernel,
	torvalds, David Rientjes, oleg, kwalker, akpm, hannes, vdavydov,
	skozina, mgorman, riel

On Fri 23-10-15 19:36:30, Tejun Heo wrote:
> Hello, Michal.
> 
> On Fri, Oct 23, 2015 at 10:33:16AM +0200, Michal Hocko wrote:
> > Ohh, OK I can see wq_worker_sleeping now. I've missed your point in
> > other email, sorry about that. But now I am wondering whether this
> > is an intended behavior. The documentation says:
> 
> This is.
> 
> >   WQ_MEM_RECLAIM
> > 
> >         All wq which might be used in the memory reclaim paths _MUST_
> >         have this flag set.  The wq is guaranteed to have at least one
> >         execution context regardless of memory pressure.
> > 
> > Which doesn't seem to be true currently, right? Now I can see your patch
> 
> It is true.
> 
> > to introduce WQ_IMMEDIATE but I am wondering which WQ_MEM_RECLAIM users
> > could do without WQ_IMMEDIATE? I mean all current workers might be
> > looping in the page allocator and it seems possible that WQ_MEM_RECLAIM
> > work items might be waiting behind them so they cannot help to relieve
> > the memory pressure. This doesn't sound right to me. Or I am completely
> > confused and still fail to understand what is WQ_MEM_RECLAIM supposed to
> > be used for.
> 
> It guarantees that there always is enough execution resource to
> execute a work item from that workqueue. 

OK, strictly speaking the rescuer is there but it is kind of pointless
if it doesn't fire up and do a work.

> The problem here is not lack
> of execution resource but concurrency management misunderstanding the
> situation. 

And this sounds like a bug to me.

> This also can be fixed by teaching concurrency management
> to be a bit smarter - e.g. if a work item is burning a lot of CPU
> cycles continuously or pool hasn't finished a work item over a certain
> amount of time, automatically ignore the in-flight work item for the
> purpose of concurrency management; however, this sort of inter-work
> item busy waits are so extremely rare and undesirable that I'm not
> sure the added complexity would be worthwhile.

Don't we have some IO related paths which would suffer from the same
problem. I haven't checked all the WQ_MEM_RECLAIM users but from the
name I would expect they _do_ participate in the reclaim and so they
should be able to make a progress. Now if your new IMMEDIATE flag will
guarantee that then I would argue that it should be implicit for
WQ_MEM_RECLAIM otherwise we always risk a similar situation. What would
be a counter argument for doing that?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Make vmstat deferrable again (was Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks)
  2015-10-23  8:37                         ` Michal Hocko
@ 2015-10-23 11:43                           ` Christoph Lameter
  -1 siblings, 0 replies; 122+ messages in thread
From: Christoph Lameter @ 2015-10-23 11:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tejun Heo, Tetsuo Handa, linux-mm, linux-kernel, torvalds,
	David Rientjes, oleg, kwalker, akpm, hannes, vdavydov, skozina,
	mgorman, riel

On Fri, 23 Oct 2015, Michal Hocko wrote:

> On Thu 22-10-15 10:33:20, Christoph Lameter wrote:
> > Ok that also makes me rethink commit
> > ba4877b9ca51f80b5d30f304a46762f0509e1635 which seems to be a similar fix
> > this time related to idle mode not updating the counters.
> >
> > Could we fix that by folding the counters before going to idle mode?
>
> This would work as well.

Is this ok?


Subject: Fix vmstat: make vmstat_updater deferrable again and shut down on idle

Currently the vmstat updater is not deferrable as a result of commit
ba4877b9ca51f80b5d30f304a46762f0509e1635. This in turn can cause multiple
interruptions of the applications because the vmstat updater may run at
different times than tick processing. No good.

Make vmstate_update deferrable again and provide a function that
shuts down the vmstat updater when we go idle by folding the differentials.
Shut it down from the load average calculation logic introduced by nohz.

Note that the shepherd thread will continue scanning the differentials
from another processor and will reenable the vmstat workers if it
detects any changes.

Fixes: ba4877b9ca51f80b5d30f304a46762f0509e1635 (do not use deferrable delay)
Signed-off-by: Christoph Lameter <cl@linux.com>

Index: linux/mm/vmstat.c
===================================================================
--- linux.orig/mm/vmstat.c
+++ linux/mm/vmstat.c
@@ -1395,6 +1395,20 @@ static void vmstat_update(struct work_st
 }

 /*
+ * Switch off vmstat processing and then fold all the remaining differentials
+ * until the diffs stay at zero. The function is used by NOHZ and can only be
+ * invoked when tick processing is not active.
+ */
+void quiet_vmstat(void)
+{
+	do {
+		if (!cpumask_test_and_set_cpu(smp_processor_id(), cpu_stat_off))
+			cancel_delayed_work(this_cpu_ptr(&vmstat_work));
+
+	} while (refresh_cpu_vm_stats());
+}
+
+/*
  * Check if the diffs for a certain cpu indicate that
  * an update is needed.
  */
@@ -1426,7 +1440,7 @@ static bool need_update(int cpu)
  */
 static void vmstat_shepherd(struct work_struct *w);

-static DECLARE_DELAYED_WORK(shepherd, vmstat_shepherd);
+static DECLARE_DEFERRABLE_WORK(shepherd, vmstat_shepherd);

 static void vmstat_shepherd(struct work_struct *w)
 {
Index: linux/include/linux/vmstat.h
===================================================================
--- linux.orig/include/linux/vmstat.h
+++ linux/include/linux/vmstat.h
@@ -211,6 +211,7 @@ extern void __inc_zone_state(struct zone
 extern void dec_zone_state(struct zone *, enum zone_stat_item);
 extern void __dec_zone_state(struct zone *, enum zone_stat_item);

+void quiet_vmstat(void);
 void cpu_vm_stats_fold(int cpu);
 void refresh_zone_stat_thresholds(void);

@@ -272,6 +273,7 @@ static inline void __dec_zone_page_state
 static inline void refresh_cpu_vm_stats(int cpu) { }
 static inline void refresh_zone_stat_thresholds(void) { }
 static inline void cpu_vm_stats_fold(int cpu) { }
+static inline void quiet_vmstat(void) { }

 static inline void drain_zonestat(struct zone *zone,
 			struct per_cpu_pageset *pset) { }
Index: linux/kernel/sched/loadavg.c
===================================================================
--- linux.orig/kernel/sched/loadavg.c
+++ linux/kernel/sched/loadavg.c
@@ -191,6 +191,8 @@ void calc_load_enter_idle(void)

 		atomic_long_add(delta, &calc_load_idle[idx]);
 	}
+	/* Fold the current vmstat counters and disable vmstat updater */
+	quiet_vmstat();
 }

 void calc_load_exit_idle(void)

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Make vmstat deferrable again (was Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks)
@ 2015-10-23 11:43                           ` Christoph Lameter
  0 siblings, 0 replies; 122+ messages in thread
From: Christoph Lameter @ 2015-10-23 11:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tejun Heo, Tetsuo Handa, linux-mm, linux-kernel, torvalds,
	David Rientjes, oleg, kwalker, akpm, hannes, vdavydov, skozina,
	mgorman, riel

On Fri, 23 Oct 2015, Michal Hocko wrote:

> On Thu 22-10-15 10:33:20, Christoph Lameter wrote:
> > Ok that also makes me rethink commit
> > ba4877b9ca51f80b5d30f304a46762f0509e1635 which seems to be a similar fix
> > this time related to idle mode not updating the counters.
> >
> > Could we fix that by folding the counters before going to idle mode?
>
> This would work as well.

Is this ok?


Subject: Fix vmstat: make vmstat_updater deferrable again and shut down on idle

Currently the vmstat updater is not deferrable as a result of commit
ba4877b9ca51f80b5d30f304a46762f0509e1635. This in turn can cause multiple
interruptions of the applications because the vmstat updater may run at
different times than tick processing. No good.

Make vmstate_update deferrable again and provide a function that
shuts down the vmstat updater when we go idle by folding the differentials.
Shut it down from the load average calculation logic introduced by nohz.

Note that the shepherd thread will continue scanning the differentials
from another processor and will reenable the vmstat workers if it
detects any changes.

Fixes: ba4877b9ca51f80b5d30f304a46762f0509e1635 (do not use deferrable delay)
Signed-off-by: Christoph Lameter <cl@linux.com>

Index: linux/mm/vmstat.c
===================================================================
--- linux.orig/mm/vmstat.c
+++ linux/mm/vmstat.c
@@ -1395,6 +1395,20 @@ static void vmstat_update(struct work_st
 }

 /*
+ * Switch off vmstat processing and then fold all the remaining differentials
+ * until the diffs stay at zero. The function is used by NOHZ and can only be
+ * invoked when tick processing is not active.
+ */
+void quiet_vmstat(void)
+{
+	do {
+		if (!cpumask_test_and_set_cpu(smp_processor_id(), cpu_stat_off))
+			cancel_delayed_work(this_cpu_ptr(&vmstat_work));
+
+	} while (refresh_cpu_vm_stats());
+}
+
+/*
  * Check if the diffs for a certain cpu indicate that
  * an update is needed.
  */
@@ -1426,7 +1440,7 @@ static bool need_update(int cpu)
  */
 static void vmstat_shepherd(struct work_struct *w);

-static DECLARE_DELAYED_WORK(shepherd, vmstat_shepherd);
+static DECLARE_DEFERRABLE_WORK(shepherd, vmstat_shepherd);

 static void vmstat_shepherd(struct work_struct *w)
 {
Index: linux/include/linux/vmstat.h
===================================================================
--- linux.orig/include/linux/vmstat.h
+++ linux/include/linux/vmstat.h
@@ -211,6 +211,7 @@ extern void __inc_zone_state(struct zone
 extern void dec_zone_state(struct zone *, enum zone_stat_item);
 extern void __dec_zone_state(struct zone *, enum zone_stat_item);

+void quiet_vmstat(void);
 void cpu_vm_stats_fold(int cpu);
 void refresh_zone_stat_thresholds(void);

@@ -272,6 +273,7 @@ static inline void __dec_zone_page_state
 static inline void refresh_cpu_vm_stats(int cpu) { }
 static inline void refresh_zone_stat_thresholds(void) { }
 static inline void cpu_vm_stats_fold(int cpu) { }
+static inline void quiet_vmstat(void) { }

 static inline void drain_zonestat(struct zone *zone,
 			struct per_cpu_pageset *pset) { }
Index: linux/kernel/sched/loadavg.c
===================================================================
--- linux.orig/kernel/sched/loadavg.c
+++ linux/kernel/sched/loadavg.c
@@ -191,6 +191,8 @@ void calc_load_enter_idle(void)

 		atomic_long_add(delta, &calc_load_idle[idx]);
 	}
+	/* Fold the current vmstat counters and disable vmstat updater */
+	quiet_vmstat();
 }

 void calc_load_exit_idle(void)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Make vmstat deferrable again (was Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks)
  2015-10-23 11:43                           ` Christoph Lameter
@ 2015-10-23 12:07                             ` Sergey Senozhatsky
  -1 siblings, 0 replies; 122+ messages in thread
From: Sergey Senozhatsky @ 2015-10-23 12:07 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Michal Hocko, Tejun Heo, Tetsuo Handa, linux-mm, linux-kernel,
	torvalds, David Rientjes, oleg, kwalker, akpm, hannes, vdavydov,
	skozina, mgorman, riel, Sergey Senozhatsky

On (10/23/15 06:43), Christoph Lameter wrote:
> Is this ok?

kernel/sched/loadavg.c: In function ‘calc_load_enter_idle’:
kernel/sched/loadavg.c:195:2: error: implicit declaration of function ‘quiet_vmstat’ [-Werror=implicit-function-declaration]
  quiet_vmstat();
    ^

> Subject: Fix vmstat: make vmstat_updater deferrable again and shut down on idle
> 
> Currently the vmstat updater is not deferrable as a result of commit
> ba4877b9ca51f80b5d30f304a46762f0509e1635. This in turn can cause multiple
> interruptions of the applications because the vmstat updater may run at
> different times than tick processing. No good.
> 
> Make vmstate_update deferrable again and provide a function that
> shuts down the vmstat updater when we go idle by folding the differentials.
> Shut it down from the load average calculation logic introduced by nohz.
> 
> Note that the shepherd thread will continue scanning the differentials
> from another processor and will reenable the vmstat workers if it
> detects any changes.
> 
> Fixes: ba4877b9ca51f80b5d30f304a46762f0509e1635 (do not use deferrable delay)
> Signed-off-by: Christoph Lameter <cl@linux.com>
> 
> Index: linux/mm/vmstat.c
> ===================================================================
> --- linux.orig/mm/vmstat.c
> +++ linux/mm/vmstat.c
> @@ -1395,6 +1395,20 @@ static void vmstat_update(struct work_st
>  }
> 
>  /*
> + * Switch off vmstat processing and then fold all the remaining differentials
> + * until the diffs stay at zero. The function is used by NOHZ and can only be
> + * invoked when tick processing is not active.
> + */
> +void quiet_vmstat(void)
> +{
> +	do {
> +		if (!cpumask_test_and_set_cpu(smp_processor_id(), cpu_stat_off))
> +			cancel_delayed_work(this_cpu_ptr(&vmstat_work));

shouldn't preemption be disable for smp_processor_id() here?

	-ss

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Make vmstat deferrable again (was Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks)
@ 2015-10-23 12:07                             ` Sergey Senozhatsky
  0 siblings, 0 replies; 122+ messages in thread
From: Sergey Senozhatsky @ 2015-10-23 12:07 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Michal Hocko, Tejun Heo, Tetsuo Handa, linux-mm, linux-kernel,
	torvalds, David Rientjes, oleg, kwalker, akpm, hannes, vdavydov,
	skozina, mgorman, riel, Sergey Senozhatsky

On (10/23/15 06:43), Christoph Lameter wrote:
> Is this ok?

kernel/sched/loadavg.c: In function a??calc_load_enter_idlea??:
kernel/sched/loadavg.c:195:2: error: implicit declaration of function a??quiet_vmstata?? [-Werror=implicit-function-declaration]
  quiet_vmstat();
    ^

> Subject: Fix vmstat: make vmstat_updater deferrable again and shut down on idle
> 
> Currently the vmstat updater is not deferrable as a result of commit
> ba4877b9ca51f80b5d30f304a46762f0509e1635. This in turn can cause multiple
> interruptions of the applications because the vmstat updater may run at
> different times than tick processing. No good.
> 
> Make vmstate_update deferrable again and provide a function that
> shuts down the vmstat updater when we go idle by folding the differentials.
> Shut it down from the load average calculation logic introduced by nohz.
> 
> Note that the shepherd thread will continue scanning the differentials
> from another processor and will reenable the vmstat workers if it
> detects any changes.
> 
> Fixes: ba4877b9ca51f80b5d30f304a46762f0509e1635 (do not use deferrable delay)
> Signed-off-by: Christoph Lameter <cl@linux.com>
> 
> Index: linux/mm/vmstat.c
> ===================================================================
> --- linux.orig/mm/vmstat.c
> +++ linux/mm/vmstat.c
> @@ -1395,6 +1395,20 @@ static void vmstat_update(struct work_st
>  }
> 
>  /*
> + * Switch off vmstat processing and then fold all the remaining differentials
> + * until the diffs stay at zero. The function is used by NOHZ and can only be
> + * invoked when tick processing is not active.
> + */
> +void quiet_vmstat(void)
> +{
> +	do {
> +		if (!cpumask_test_and_set_cpu(smp_processor_id(), cpu_stat_off))
> +			cancel_delayed_work(this_cpu_ptr(&vmstat_work));

shouldn't preemption be disable for smp_processor_id() here?

	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
  2015-10-23 11:11                                   ` Michal Hocko
@ 2015-10-23 12:25                                     ` Tetsuo Handa
  -1 siblings, 0 replies; 122+ messages in thread
From: Tetsuo Handa @ 2015-10-23 12:25 UTC (permalink / raw)
  To: mhocko, htejun
  Cc: cl, linux-mm, linux-kernel, torvalds, rientjes, oleg, kwalker,
	akpm, hannes, vdavydov, skozina, mgorman, riel

Michal Hocko wrote:
> On Fri 23-10-15 19:36:30, Tejun Heo wrote:
> > Hello, Michal.
> > 
> > On Fri, Oct 23, 2015 at 10:33:16AM +0200, Michal Hocko wrote:
> > > Ohh, OK I can see wq_worker_sleeping now. I've missed your point in
> > > other email, sorry about that. But now I am wondering whether this
> > > is an intended behavior. The documentation says:
> > 
> > This is.
> > 
> > >   WQ_MEM_RECLAIM
> > > 
> > >         All wq which might be used in the memory reclaim paths _MUST_
> > >         have this flag set.  The wq is guaranteed to have at least one
> > >         execution context regardless of memory pressure.
> > > 
> > > Which doesn't seem to be true currently, right? Now I can see your patch
> > 
> > It is true.
> > 
> > > to introduce WQ_IMMEDIATE but I am wondering which WQ_MEM_RECLAIM users
> > > could do without WQ_IMMEDIATE? I mean all current workers might be
> > > looping in the page allocator and it seems possible that WQ_MEM_RECLAIM
> > > work items might be waiting behind them so they cannot help to relieve
> > > the memory pressure. This doesn't sound right to me. Or I am completely
> > > confused and still fail to understand what is WQ_MEM_RECLAIM supposed to
> > > be used for.
> > 
> > It guarantees that there always is enough execution resource to
> > execute a work item from that workqueue. 
> 
> OK, strictly speaking the rescuer is there but it is kind of pointless
> if it doesn't fire up and do a work.
> 
> > The problem here is not lack
> > of execution resource but concurrency management misunderstanding the
> > situation. 
> 
> And this sounds like a bug to me.
> 
> > This also can be fixed by teaching concurrency management
> > to be a bit smarter - e.g. if a work item is burning a lot of CPU
> > cycles continuously or pool hasn't finished a work item over a certain
> > amount of time, automatically ignore the in-flight work item for the
> > purpose of concurrency management; however, this sort of inter-work
> > item busy waits are so extremely rare and undesirable that I'm not
> > sure the added complexity would be worthwhile.
> 
> Don't we have some IO related paths which would suffer from the same
> problem. I haven't checked all the WQ_MEM_RECLAIM users but from the
> name I would expect they _do_ participate in the reclaim and so they
> should be able to make a progress. Now if your new IMMEDIATE flag will
> guarantee that then I would argue that it should be implicit for
> WQ_MEM_RECLAIM otherwise we always risk a similar situation. What would
> be a counter argument for doing that?

WQ_MEM_RECLAIM only guarantees that a "struct task_struct" is preallocated
in order to avoid failing to allocate it on demand due to a GFP_KERNEL
allocation? Is this correct?

WQ_CPU_INTENSIVE only guarantees that work items don't participate in
concurrency management in order to avoid failing to wake up a "struct
task_struct" which will process the work items? Is this correct?

Is Michal's question "does it make sense to use WQ_MEM_RECLAIM without
WQ_CPU_INTENSIVE"? In other words, any "struct task_struct" which calls
rescuer_thread() must imply WQ_CPU_INTENSIVE in order to avoid failing to
wake up due to being participated in concurrency management?

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
@ 2015-10-23 12:25                                     ` Tetsuo Handa
  0 siblings, 0 replies; 122+ messages in thread
From: Tetsuo Handa @ 2015-10-23 12:25 UTC (permalink / raw)
  To: mhocko, htejun
  Cc: cl, linux-mm, linux-kernel, torvalds, rientjes, oleg, kwalker,
	akpm, hannes, vdavydov, skozina, mgorman, riel

Michal Hocko wrote:
> On Fri 23-10-15 19:36:30, Tejun Heo wrote:
> > Hello, Michal.
> > 
> > On Fri, Oct 23, 2015 at 10:33:16AM +0200, Michal Hocko wrote:
> > > Ohh, OK I can see wq_worker_sleeping now. I've missed your point in
> > > other email, sorry about that. But now I am wondering whether this
> > > is an intended behavior. The documentation says:
> > 
> > This is.
> > 
> > >   WQ_MEM_RECLAIM
> > > 
> > >         All wq which might be used in the memory reclaim paths _MUST_
> > >         have this flag set.  The wq is guaranteed to have at least one
> > >         execution context regardless of memory pressure.
> > > 
> > > Which doesn't seem to be true currently, right? Now I can see your patch
> > 
> > It is true.
> > 
> > > to introduce WQ_IMMEDIATE but I am wondering which WQ_MEM_RECLAIM users
> > > could do without WQ_IMMEDIATE? I mean all current workers might be
> > > looping in the page allocator and it seems possible that WQ_MEM_RECLAIM
> > > work items might be waiting behind them so they cannot help to relieve
> > > the memory pressure. This doesn't sound right to me. Or I am completely
> > > confused and still fail to understand what is WQ_MEM_RECLAIM supposed to
> > > be used for.
> > 
> > It guarantees that there always is enough execution resource to
> > execute a work item from that workqueue. 
> 
> OK, strictly speaking the rescuer is there but it is kind of pointless
> if it doesn't fire up and do a work.
> 
> > The problem here is not lack
> > of execution resource but concurrency management misunderstanding the
> > situation. 
> 
> And this sounds like a bug to me.
> 
> > This also can be fixed by teaching concurrency management
> > to be a bit smarter - e.g. if a work item is burning a lot of CPU
> > cycles continuously or pool hasn't finished a work item over a certain
> > amount of time, automatically ignore the in-flight work item for the
> > purpose of concurrency management; however, this sort of inter-work
> > item busy waits are so extremely rare and undesirable that I'm not
> > sure the added complexity would be worthwhile.
> 
> Don't we have some IO related paths which would suffer from the same
> problem. I haven't checked all the WQ_MEM_RECLAIM users but from the
> name I would expect they _do_ participate in the reclaim and so they
> should be able to make a progress. Now if your new IMMEDIATE flag will
> guarantee that then I would argue that it should be implicit for
> WQ_MEM_RECLAIM otherwise we always risk a similar situation. What would
> be a counter argument for doing that?

WQ_MEM_RECLAIM only guarantees that a "struct task_struct" is preallocated
in order to avoid failing to allocate it on demand due to a GFP_KERNEL
allocation? Is this correct?

WQ_CPU_INTENSIVE only guarantees that work items don't participate in
concurrency management in order to avoid failing to wake up a "struct
task_struct" which will process the work items? Is this correct?

Is Michal's question "does it make sense to use WQ_MEM_RECLAIM without
WQ_CPU_INTENSIVE"? In other words, any "struct task_struct" which calls
rescuer_thread() must imply WQ_CPU_INTENSIVE in order to avoid failing to
wake up due to being participated in concurrency management?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Make vmstat deferrable again (was Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks)
  2015-10-23 12:07                             ` Sergey Senozhatsky
@ 2015-10-23 14:12                               ` Christoph Lameter
  -1 siblings, 0 replies; 122+ messages in thread
From: Christoph Lameter @ 2015-10-23 14:12 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Michal Hocko, Tejun Heo, Tetsuo Handa, linux-mm, linux-kernel,
	torvalds, David Rientjes, oleg, kwalker, akpm, hannes, vdavydov,
	skozina, mgorman, riel

[-- Attachment #1: Type: text/plain, Size: 3935 bytes --]

On Fri, 23 Oct 2015, Sergey Senozhatsky wrote:

> On (10/23/15 06:43), Christoph Lameter wrote:
> > Is this ok?
>
> kernel/sched/loadavg.c: In function ‘calc_load_enter_idle’:
> kernel/sched/loadavg.c:195:2: error: implicit declaration of function ‘quiet_vmstat’ [-Werror=implicit-function-declaration]
>   quiet_vmstat();
>     ^

Oww... Not good to do that in the scheduler. Ok new patch follows that
does the call from tick_nohz_stop_sched_tick. Hopefully that is the right
location to call quiet_vmstat().

> > +		if (!cpumask_test_and_set_cpu(smp_processor_id(), cpu_stat_off))
> > +			cancel_delayed_work(this_cpu_ptr(&vmstat_work));
>
> shouldn't preemption be disable for smp_processor_id() here?

Preemption is disabled when quiet_vmstat() is called.



Subject: Fix vmstat: make vmstat_updater deferrable again and shut down on idle V2

V1->V2
 - Call vmstat_quiet from tick_nohz_stop_sched_tick() instead.

Currently the vmstat updater is not deferrable as a result of commit
ba4877b9ca51f80b5d30f304a46762f0509e1635. This in turn can cause multiple
interruptions of the applications because the vmstat updater may run at
different times than tick processing. No good.

Make vmstate_update deferrable again and provide a function that
shuts down the vmstat updater when we go idle by folding the differentials.
Shut it down from the load average calculation logic introduced by nohz.

Note that the shepherd thread will continue scanning the differentials
from another processor and will reenable the vmstat workers if it
detects any changes.

Fixes: ba4877b9ca51f80b5d30f304a46762f0509e1635 (do not use deferrable delay)
Signed-off-by: Christoph Lameter <cl@linux.com>

Index: linux/mm/vmstat.c
===================================================================
--- linux.orig/mm/vmstat.c
+++ linux/mm/vmstat.c
@@ -1395,6 +1395,20 @@ static void vmstat_update(struct work_st
 }

 /*
+ * Switch off vmstat processing and then fold all the remaining differentials
+ * until the diffs stay at zero. The function is used by NOHZ and can only be
+ * invoked when tick processing is not active.
+ */
+void quiet_vmstat(void)
+{
+	do {
+		if (!cpumask_test_and_set_cpu(smp_processor_id(), cpu_stat_off))
+			cancel_delayed_work(this_cpu_ptr(&vmstat_work));
+
+	} while (refresh_cpu_vm_stats());
+}
+
+/*
  * Check if the diffs for a certain cpu indicate that
  * an update is needed.
  */
@@ -1426,7 +1440,7 @@ static bool need_update(int cpu)
  */
 static void vmstat_shepherd(struct work_struct *w);

-static DECLARE_DELAYED_WORK(shepherd, vmstat_shepherd);
+static DECLARE_DEFERRABLE_WORK(shepherd, vmstat_shepherd);

 static void vmstat_shepherd(struct work_struct *w)
 {
Index: linux/include/linux/vmstat.h
===================================================================
--- linux.orig/include/linux/vmstat.h
+++ linux/include/linux/vmstat.h
@@ -211,6 +211,7 @@ extern void __inc_zone_state(struct zone
 extern void dec_zone_state(struct zone *, enum zone_stat_item);
 extern void __dec_zone_state(struct zone *, enum zone_stat_item);

+void quiet_vmstat(void);
 void cpu_vm_stats_fold(int cpu);
 void refresh_zone_stat_thresholds(void);

@@ -272,6 +273,7 @@ static inline void __dec_zone_page_state
 static inline void refresh_cpu_vm_stats(int cpu) { }
 static inline void refresh_zone_stat_thresholds(void) { }
 static inline void cpu_vm_stats_fold(int cpu) { }
+static inline void quiet_vmstat(void) { }

 static inline void drain_zonestat(struct zone *zone,
 			struct per_cpu_pageset *pset) { }
Index: linux/kernel/time/tick-sched.c
===================================================================
--- linux.orig/kernel/time/tick-sched.c
+++ linux/kernel/time/tick-sched.c
@@ -667,6 +667,7 @@ static ktime_t tick_nohz_stop_sched_tick
 	 */
 	if (!ts->tick_stopped) {
 		nohz_balance_enter_idle(cpu);
+		quiet_vmstat();
 		calc_load_enter_idle();

 		ts->last_tick = hrtimer_get_expires(&ts->sched_timer);

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Make vmstat deferrable again (was Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks)
@ 2015-10-23 14:12                               ` Christoph Lameter
  0 siblings, 0 replies; 122+ messages in thread
From: Christoph Lameter @ 2015-10-23 14:12 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Michal Hocko, Tejun Heo, Tetsuo Handa, linux-mm, linux-kernel,
	torvalds, David Rientjes, oleg, kwalker, akpm, hannes, vdavydov,
	skozina, mgorman, riel

[-- Attachment #1: Type: text/plain, Size: 3935 bytes --]

On Fri, 23 Oct 2015, Sergey Senozhatsky wrote:

> On (10/23/15 06:43), Christoph Lameter wrote:
> > Is this ok?
>
> kernel/sched/loadavg.c: In function a??calc_load_enter_idlea??:
> kernel/sched/loadavg.c:195:2: error: implicit declaration of function a??quiet_vmstata?? [-Werror=implicit-function-declaration]
>   quiet_vmstat();
>     ^

Oww... Not good to do that in the scheduler. Ok new patch follows that
does the call from tick_nohz_stop_sched_tick. Hopefully that is the right
location to call quiet_vmstat().

> > +		if (!cpumask_test_and_set_cpu(smp_processor_id(), cpu_stat_off))
> > +			cancel_delayed_work(this_cpu_ptr(&vmstat_work));
>
> shouldn't preemption be disable for smp_processor_id() here?

Preemption is disabled when quiet_vmstat() is called.



Subject: Fix vmstat: make vmstat_updater deferrable again and shut down on idle V2

V1->V2
 - Call vmstat_quiet from tick_nohz_stop_sched_tick() instead.

Currently the vmstat updater is not deferrable as a result of commit
ba4877b9ca51f80b5d30f304a46762f0509e1635. This in turn can cause multiple
interruptions of the applications because the vmstat updater may run at
different times than tick processing. No good.

Make vmstate_update deferrable again and provide a function that
shuts down the vmstat updater when we go idle by folding the differentials.
Shut it down from the load average calculation logic introduced by nohz.

Note that the shepherd thread will continue scanning the differentials
from another processor and will reenable the vmstat workers if it
detects any changes.

Fixes: ba4877b9ca51f80b5d30f304a46762f0509e1635 (do not use deferrable delay)
Signed-off-by: Christoph Lameter <cl@linux.com>

Index: linux/mm/vmstat.c
===================================================================
--- linux.orig/mm/vmstat.c
+++ linux/mm/vmstat.c
@@ -1395,6 +1395,20 @@ static void vmstat_update(struct work_st
 }

 /*
+ * Switch off vmstat processing and then fold all the remaining differentials
+ * until the diffs stay at zero. The function is used by NOHZ and can only be
+ * invoked when tick processing is not active.
+ */
+void quiet_vmstat(void)
+{
+	do {
+		if (!cpumask_test_and_set_cpu(smp_processor_id(), cpu_stat_off))
+			cancel_delayed_work(this_cpu_ptr(&vmstat_work));
+
+	} while (refresh_cpu_vm_stats());
+}
+
+/*
  * Check if the diffs for a certain cpu indicate that
  * an update is needed.
  */
@@ -1426,7 +1440,7 @@ static bool need_update(int cpu)
  */
 static void vmstat_shepherd(struct work_struct *w);

-static DECLARE_DELAYED_WORK(shepherd, vmstat_shepherd);
+static DECLARE_DEFERRABLE_WORK(shepherd, vmstat_shepherd);

 static void vmstat_shepherd(struct work_struct *w)
 {
Index: linux/include/linux/vmstat.h
===================================================================
--- linux.orig/include/linux/vmstat.h
+++ linux/include/linux/vmstat.h
@@ -211,6 +211,7 @@ extern void __inc_zone_state(struct zone
 extern void dec_zone_state(struct zone *, enum zone_stat_item);
 extern void __dec_zone_state(struct zone *, enum zone_stat_item);

+void quiet_vmstat(void);
 void cpu_vm_stats_fold(int cpu);
 void refresh_zone_stat_thresholds(void);

@@ -272,6 +273,7 @@ static inline void __dec_zone_page_state
 static inline void refresh_cpu_vm_stats(int cpu) { }
 static inline void refresh_zone_stat_thresholds(void) { }
 static inline void cpu_vm_stats_fold(int cpu) { }
+static inline void quiet_vmstat(void) { }

 static inline void drain_zonestat(struct zone *zone,
 			struct per_cpu_pageset *pset) { }
Index: linux/kernel/time/tick-sched.c
===================================================================
--- linux.orig/kernel/time/tick-sched.c
+++ linux/kernel/time/tick-sched.c
@@ -667,6 +667,7 @@ static ktime_t tick_nohz_stop_sched_tick
 	 */
 	if (!ts->tick_stopped) {
 		nohz_balance_enter_idle(cpu);
+		quiet_vmstat();
 		calc_load_enter_idle();

 		ts->last_tick = hrtimer_get_expires(&ts->sched_timer);

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Make vmstat deferrable again (was Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks)
  2015-10-23 14:12                               ` Christoph Lameter
@ 2015-10-23 14:49                                 ` Sergey Senozhatsky
  -1 siblings, 0 replies; 122+ messages in thread
From: Sergey Senozhatsky @ 2015-10-23 14:49 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Sergey Senozhatsky, Michal Hocko, Tejun Heo, Tetsuo Handa,
	linux-mm, linux-kernel, torvalds, David Rientjes, oleg, kwalker,
	akpm, hannes, vdavydov, skozina, mgorman, riel

On (10/23/15 09:12), Christoph Lameter wrote:
[..]
> > > +		if (!cpumask_test_and_set_cpu(smp_processor_id(), cpu_stat_off))
> > > +			cancel_delayed_work(this_cpu_ptr(&vmstat_work));
> >
> > shouldn't preemption be disable for smp_processor_id() here?
> 
> Preemption is disabled when quiet_vmstat() is called.
> 

cond_resched()

[   29.607725] BUG: sleeping function called from invalid context at mm/vmstat.c:487
[   29.607729] in_atomic(): 1, irqs_disabled(): 1, pid: 0, name: swapper/7
[   29.607731] no locks held by swapper/7/0.
[   29.607732] irq event stamp: 48932
[   29.607733] hardirqs last  enabled at (48931): [<ffffffff813b246a>] _raw_spin_unlock_irq+0x2c/0x37
[   29.607739] hardirqs last disabled at (48932): [<ffffffff810a3fec>] tick_nohz_idle_enter+0x3c/0x5f
[   29.607743] softirqs last  enabled at (48924): [<ffffffff81041fd8>] __do_softirq+0x2bb/0x3a9
[   29.607747] softirqs last disabled at (48893): [<ffffffff810422a7>] irq_exit+0x41/0x95
[   29.607752] CPU: 7 PID: 0 Comm: swapper/7 Not tainted 4.3.0-rc6-next-20151022-dbg-00003-g01184ff-dirty #261
[   29.607754]  0000000000000000 ffff88041dae7da0 ffffffff811dd4f3 ffff88041dacd100
[   29.607756]  ffff88041dae7dc8 ffffffff8105f144 ffffffff8169f800 0000000000000000
[   29.607759]  0000000000000007 ffff88041dae7e70 ffffffff811040b1 0000000000000002
[   29.607761] Call Trace:
[   29.607767]  [<ffffffff811dd4f3>] dump_stack+0x4b/0x63
[   29.607770]  [<ffffffff8105f144>] ___might_sleep+0x1e7/0x1ee
[   29.607773]  [<ffffffff811040b1>] refresh_cpu_vm_stats+0x8b/0xb5
[   29.607776]  [<ffffffff81104f4c>] quiet_vmstat+0x3a/0x41
[   29.607778]  [<ffffffff810a3ccf>] __tick_nohz_idle_enter+0x292/0x410
[   29.607781]  [<ffffffff810a4007>] tick_nohz_idle_enter+0x57/0x5f
[   29.607784]  [<ffffffff81076d8b>] cpu_startup_entry+0x36/0x330
[   29.607788]  [<ffffffff81028821>] start_secondary+0xf3/0xf6



by the way, tick_nohz_stop_sched_tick() receives cpu from __tick_nohz_idle_enter().
do you want to pass it to quiet_vmstat()?

	if (!ts->tick_stopped) {
		nohz_balance_enter_idle(cpu);
-		quiet_vmstat();
+		quiet_vmstat(cpu);
		calc_load_enter_idle();

	-ss

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Make vmstat deferrable again (was Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks)
@ 2015-10-23 14:49                                 ` Sergey Senozhatsky
  0 siblings, 0 replies; 122+ messages in thread
From: Sergey Senozhatsky @ 2015-10-23 14:49 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Sergey Senozhatsky, Michal Hocko, Tejun Heo, Tetsuo Handa,
	linux-mm, linux-kernel, torvalds, David Rientjes, oleg, kwalker,
	akpm, hannes, vdavydov, skozina, mgorman, riel

On (10/23/15 09:12), Christoph Lameter wrote:
[..]
> > > +		if (!cpumask_test_and_set_cpu(smp_processor_id(), cpu_stat_off))
> > > +			cancel_delayed_work(this_cpu_ptr(&vmstat_work));
> >
> > shouldn't preemption be disable for smp_processor_id() here?
> 
> Preemption is disabled when quiet_vmstat() is called.
> 

cond_resched()

[   29.607725] BUG: sleeping function called from invalid context at mm/vmstat.c:487
[   29.607729] in_atomic(): 1, irqs_disabled(): 1, pid: 0, name: swapper/7
[   29.607731] no locks held by swapper/7/0.
[   29.607732] irq event stamp: 48932
[   29.607733] hardirqs last  enabled at (48931): [<ffffffff813b246a>] _raw_spin_unlock_irq+0x2c/0x37
[   29.607739] hardirqs last disabled at (48932): [<ffffffff810a3fec>] tick_nohz_idle_enter+0x3c/0x5f
[   29.607743] softirqs last  enabled at (48924): [<ffffffff81041fd8>] __do_softirq+0x2bb/0x3a9
[   29.607747] softirqs last disabled at (48893): [<ffffffff810422a7>] irq_exit+0x41/0x95
[   29.607752] CPU: 7 PID: 0 Comm: swapper/7 Not tainted 4.3.0-rc6-next-20151022-dbg-00003-g01184ff-dirty #261
[   29.607754]  0000000000000000 ffff88041dae7da0 ffffffff811dd4f3 ffff88041dacd100
[   29.607756]  ffff88041dae7dc8 ffffffff8105f144 ffffffff8169f800 0000000000000000
[   29.607759]  0000000000000007 ffff88041dae7e70 ffffffff811040b1 0000000000000002
[   29.607761] Call Trace:
[   29.607767]  [<ffffffff811dd4f3>] dump_stack+0x4b/0x63
[   29.607770]  [<ffffffff8105f144>] ___might_sleep+0x1e7/0x1ee
[   29.607773]  [<ffffffff811040b1>] refresh_cpu_vm_stats+0x8b/0xb5
[   29.607776]  [<ffffffff81104f4c>] quiet_vmstat+0x3a/0x41
[   29.607778]  [<ffffffff810a3ccf>] __tick_nohz_idle_enter+0x292/0x410
[   29.607781]  [<ffffffff810a4007>] tick_nohz_idle_enter+0x57/0x5f
[   29.607784]  [<ffffffff81076d8b>] cpu_startup_entry+0x36/0x330
[   29.607788]  [<ffffffff81028821>] start_secondary+0xf3/0xf6



by the way, tick_nohz_stop_sched_tick() receives cpu from __tick_nohz_idle_enter().
do you want to pass it to quiet_vmstat()?

	if (!ts->tick_stopped) {
		nohz_balance_enter_idle(cpu);
-		quiet_vmstat();
+		quiet_vmstat(cpu);
		calc_load_enter_idle();

	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Make vmstat deferrable again (was Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks)
  2015-10-23 14:49                                 ` Sergey Senozhatsky
@ 2015-10-23 16:10                                   ` Christoph Lameter
  -1 siblings, 0 replies; 122+ messages in thread
From: Christoph Lameter @ 2015-10-23 16:10 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Michal Hocko, Tejun Heo, Tetsuo Handa, linux-mm, linux-kernel,
	torvalds, David Rientjes, oleg, kwalker, akpm, hannes, vdavydov,
	skozina, mgorman, riel

On Fri, 23 Oct 2015, Sergey Senozhatsky wrote:

> by the way, tick_nohz_stop_sched_tick() receives cpu from __tick_nohz_idle_enter().
> do you want to pass it to quiet_vmstat()?

No this is quite wrong at this point. quiet_vmstat() needs to be called
from the cpu going into idle state.

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: Make vmstat deferrable again (was Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks)
@ 2015-10-23 16:10                                   ` Christoph Lameter
  0 siblings, 0 replies; 122+ messages in thread
From: Christoph Lameter @ 2015-10-23 16:10 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Michal Hocko, Tejun Heo, Tetsuo Handa, linux-mm, linux-kernel,
	torvalds, David Rientjes, oleg, kwalker, akpm, hannes, vdavydov,
	skozina, mgorman, riel

On Fri, 23 Oct 2015, Sergey Senozhatsky wrote:

> by the way, tick_nohz_stop_sched_tick() receives cpu from __tick_nohz_idle_enter().
> do you want to pass it to quiet_vmstat()?

No this is quite wrong at this point. quiet_vmstat() needs to be called
from the cpu going into idle state.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
  2015-10-23 11:11                                   ` Michal Hocko
@ 2015-10-23 18:21                                     ` Tejun Heo
  -1 siblings, 0 replies; 122+ messages in thread
From: Tejun Heo @ 2015-10-23 18:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Christoph Lameter, Tetsuo Handa, linux-mm, linux-kernel,
	torvalds, David Rientjes, oleg, kwalker, akpm, hannes, vdavydov,
	skozina, mgorman, riel

Hello,

On Fri, Oct 23, 2015 at 01:11:45PM +0200, Michal Hocko wrote:
> > The problem here is not lack
> > of execution resource but concurrency management misunderstanding the
> > situation. 
> 
> And this sounds like a bug to me.

I don't know.  I can be argued either way, the other direction being a
kernel thread going RUNNING non-stop is buggy.  Given how this has
been a complete non-issue for all the years, I'm not sure how useful
plugging this is.

> Don't we have some IO related paths which would suffer from the same
> problem. I haven't checked all the WQ_MEM_RECLAIM users but from the
> name I would expect they _do_ participate in the reclaim and so they
> should be able to make a progress. Now if your new IMMEDIATE flag will

Seriously, nobody goes full-on RUNNING.

> guarantee that then I would argue that it should be implicit for
> WQ_MEM_RECLAIM otherwise we always risk a similar situation. What would
> be a counter argument for doing that?

Not serving any actual purpose and degrading execution behavior.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
@ 2015-10-23 18:21                                     ` Tejun Heo
  0 siblings, 0 replies; 122+ messages in thread
From: Tejun Heo @ 2015-10-23 18:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Christoph Lameter, Tetsuo Handa, linux-mm, linux-kernel,
	torvalds, David Rientjes, oleg, kwalker, akpm, hannes, vdavydov,
	skozina, mgorman, riel

Hello,

On Fri, Oct 23, 2015 at 01:11:45PM +0200, Michal Hocko wrote:
> > The problem here is not lack
> > of execution resource but concurrency management misunderstanding the
> > situation. 
> 
> And this sounds like a bug to me.

I don't know.  I can be argued either way, the other direction being a
kernel thread going RUNNING non-stop is buggy.  Given how this has
been a complete non-issue for all the years, I'm not sure how useful
plugging this is.

> Don't we have some IO related paths which would suffer from the same
> problem. I haven't checked all the WQ_MEM_RECLAIM users but from the
> name I would expect they _do_ participate in the reclaim and so they
> should be able to make a progress. Now if your new IMMEDIATE flag will

Seriously, nobody goes full-on RUNNING.

> guarantee that then I would argue that it should be implicit for
> WQ_MEM_RECLAIM otherwise we always risk a similar situation. What would
> be a counter argument for doing that?

Not serving any actual purpose and degrading execution behavior.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
  2015-10-23 12:25                                     ` Tetsuo Handa
@ 2015-10-23 18:23                                       ` Tejun Heo
  -1 siblings, 0 replies; 122+ messages in thread
From: Tejun Heo @ 2015-10-23 18:23 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, cl, linux-mm, linux-kernel, torvalds, rientjes, oleg,
	kwalker, akpm, hannes, vdavydov, skozina, mgorman, riel

Hello, Tetsuo.

On Fri, Oct 23, 2015 at 09:25:11PM +0900, Tetsuo Handa wrote:
> WQ_MEM_RECLAIM only guarantees that a "struct task_struct" is preallocated
> in order to avoid failing to allocate it on demand due to a GFP_KERNEL
> allocation? Is this correct?
> 
> WQ_CPU_INTENSIVE only guarantees that work items don't participate in
> concurrency management in order to avoid failing to wake up a "struct
> task_struct" which will process the work items? Is this correct?

CPU_INTENSIVE avoids the tail end of concurrency management.  The
previous HIGHPRI or the posted IMMEDIATE avoids the head end.

> Is Michal's question "does it make sense to use WQ_MEM_RECLAIM without
> WQ_CPU_INTENSIVE"? In other words, any "struct task_struct" which calls
> rescuer_thread() must imply WQ_CPU_INTENSIVE in order to avoid failing to
> wake up due to being participated in concurrency management?

If this is an actual problem, a better approach would be something
which detects the stall condition and kicks off the next work item but
if we do that I think I'd still trigger a warning there.  I don't
know.  Don't go busy waiting in kernel.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
@ 2015-10-23 18:23                                       ` Tejun Heo
  0 siblings, 0 replies; 122+ messages in thread
From: Tejun Heo @ 2015-10-23 18:23 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, cl, linux-mm, linux-kernel, torvalds, rientjes, oleg,
	kwalker, akpm, hannes, vdavydov, skozina, mgorman, riel

Hello, Tetsuo.

On Fri, Oct 23, 2015 at 09:25:11PM +0900, Tetsuo Handa wrote:
> WQ_MEM_RECLAIM only guarantees that a "struct task_struct" is preallocated
> in order to avoid failing to allocate it on demand due to a GFP_KERNEL
> allocation? Is this correct?
> 
> WQ_CPU_INTENSIVE only guarantees that work items don't participate in
> concurrency management in order to avoid failing to wake up a "struct
> task_struct" which will process the work items? Is this correct?

CPU_INTENSIVE avoids the tail end of concurrency management.  The
previous HIGHPRI or the posted IMMEDIATE avoids the head end.

> Is Michal's question "does it make sense to use WQ_MEM_RECLAIM without
> WQ_CPU_INTENSIVE"? In other words, any "struct task_struct" which calls
> rescuer_thread() must imply WQ_CPU_INTENSIVE in order to avoid failing to
> wake up due to being participated in concurrency management?

If this is an actual problem, a better approach would be something
which detects the stall condition and kicks off the next work item but
if we do that I think I'd still trigger a warning there.  I don't
know.  Don't go busy waiting in kernel.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
  2015-10-23 18:23                                       ` Tejun Heo
@ 2015-10-25 10:52                                         ` Tetsuo Handa
  -1 siblings, 0 replies; 122+ messages in thread
From: Tetsuo Handa @ 2015-10-25 10:52 UTC (permalink / raw)
  To: mhocko, cl, htejun
  Cc: linux-mm, linux-kernel, torvalds, rientjes, oleg, kwalker, akpm,
	hannes, vdavydov, skozina, mgorman, riel

Tejun Heo wrote:
> If this is an actual problem, a better approach would be something
> which detects the stall condition and kicks off the next work item but
> if we do that I think I'd still trigger a warning there.  I don't
> know.  Don't go busy waiting in kernel.

Busy waiting in kernel refers several cases.

  (1) Wait for something with interrupts disabled.

  (2) Wait for something with interrupts enabled but
      without calling cond_resched() etc.

  (3) Wait for something with interrupts enabled and
      with calling cond_resched() etc.

  (4) Wait for something with interrupts enabled and
      with calling schedule_timeout() etc.

Kernel code tries to minimize (1). Kernel code does (2) if they are
not allowed to sleep. But kernel code is allowed to do (3) if they
are allowed to sleep, as long as cond_resched() is sometimes called.
And currently page allocator does (3). But kernel code invoked via
workqueue is expected to do (4) than (3).

This means that any kernel code which invokes a __GFP_WAIT allocation
might fail to do (4) when invoked via workqueue, regardless of flags
passed to alloc_workqueue()?

Michal Hocko wrote:
> On Fri 23-10-15 06:42:43, Tetsuo Handa wrote:
> > Tejun Heo wrote:
> > > On Thu, Oct 22, 2015 at 05:49:22PM +0200, Michal Hocko wrote:
> > > > I am confused. What makes rescuer to not run? Nothing seems to be
> > > > hogging CPUs, we are just out of workers which are loopin in the
> > > > allocator but that is preemptible context.
> > > 
> > > It's concurrency management.  Workqueue thinks that the pool is making
> > > positive forward progress and doesn't schedule anything else for
> > > execution while that work item is burning cpu cycles.
> > 
> > Then, isn't below change easier to backport which will also alleviate
> > needlessly burning CPU cycles?
> 
> This is quite obscure. If the vmstat_update fix needs workqueue tweaks
> as well then I would vote for your original patch which is clear,
> straightforward and easy to backport.

I think that inserting a short sleep into page allocator is better
because the vmstat_update fix will not require workqueue tweaks if
we sleep inside page allocator. Also, from the point of view of
protecting page allocator from going unresponsive when hundreds of tasks
started busy-waiting at __alloc_pages_slowpath() because we can observe
that XXX value in the "MemAlloc-Info: XXX stalling task," line grows
when we are unable to make forward progress.

----------------------------------------
>From a2f34850c26b5bb124d44983f5a2020b51249d53 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Sun, 25 Oct 2015 19:42:15 +0900
Subject: [PATCH] mm,page_alloc: Insert an uninterruptible sleep before
 retrying.

Since "struct zone"->vm_stat[] is array of atomic_long_t, an effort
to reduce frequency of updating values in vm_stat[] is made by using
per cpu variables "struct per_cpu_pageset"->vm_stat_diff[].
Values in vm_stat_diff[] are merged into vm_stat[] periodically
using vmstat_update workqueue item (struct delayed_work vmstat_work).

When a task attempted to allocate memory and reached direct reclaim
path, shrink_zones() checks whether there are reclaimable pages by
calling zone_reclaimable(). zone_reclaimable() makes decision based
on values in vm_stat[] by calling zone_page_state(). This is usually
fine because values in vm_stat_diff[] are expected to be merged into
vm_stat[] shortly.

But workqueue and page allocator have different assumptions.

  (A) The workqueue defers processing of other items unless currently
      in-flight item enters into !TASK_RUNNING state.

  (B) The page allocator never enters into !TASK_RUNNING state if there
      is nothing to reclaim. (The page allocator calls cond_resched()
      via wait_iff_congested(), but cond_resched() does not make the
      task enter into !TASK_RUNNING state.)

Therefore, if a workqueue item which is processed before vmstat_update
item is processed got stuck inside memory allocation request, values in
vm_stat_diff[] cannot be merged into vm_stat[].

As a result, zone_reclaimable() continues using outdated vm_stat[] values
and the task which is doing direct reclaim path thinks that there are
still reclaimable pages and therefore continues looping.

The consequence is a silent livelock (hang up without any kernel messages)
because the OOM killer will not be invoked. We can hit such livelock by
e.g. disk_events_workfn workqueue item doing memory allocation from
bio_copy_kern().

----------------------------------------
[  255.054205] kworker/3:1     R  running task        0    45      2 0x00000008
[  255.056063] Workqueue: events_freezable_power_ disk_events_workfn
[  255.057715]  ffff88007f805680 ffff88007c55f6d0 ffffffff8116463d ffff88007c55f758
[  255.059705]  ffff88007f82b870 ffff88007c55f6e0 ffffffff811646be ffff88007c55f710
[  255.061694]  ffffffff811bdaf0 ffff88007f82b870 0000000000000400 0000000000000000
[  255.063690] Call Trace:
[  255.064664]  [<ffffffff8116463d>] ? __list_lru_count_one.isra.4+0x1d/0x80
[  255.066428]  [<ffffffff811646be>] ? list_lru_count_one+0x1e/0x20
[  255.068063]  [<ffffffff811bdaf0>] ? super_cache_count+0x50/0xd0
[  255.069666]  [<ffffffff8114ecf6>] ? shrink_slab.part.38+0xf6/0x2a0
[  255.071313]  [<ffffffff81151f78>] ? shrink_zone+0x2c8/0x2e0
[  255.072845]  [<ffffffff81152316>] ? do_try_to_free_pages+0x156/0x6d0
[  255.074527]  [<ffffffff810bc6b6>] ? mark_held_locks+0x66/0x90
[  255.076085]  [<ffffffff816ca797>] ? _raw_spin_unlock_irq+0x27/0x40
[  255.077727]  [<ffffffff810bc7d9>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[  255.079451]  [<ffffffff81152924>] ? try_to_free_pages+0x94/0xc0
[  255.081045]  [<ffffffff81145b4a>] ? __alloc_pages_nodemask+0x72a/0xdb0
[  255.082761]  [<ffffffff8118cd06>] ? alloc_pages_current+0x96/0x1b0
[  255.084407]  [<ffffffff8133985d>] ? bio_alloc_bioset+0x20d/0x2d0
[  255.086032]  [<ffffffff8133aba4>] ? bio_copy_kern+0xc4/0x180
[  255.087584]  [<ffffffff81344f20>] ? blk_rq_map_kern+0x70/0x130
[  255.089161]  [<ffffffff814a334d>] ? scsi_execute+0x12d/0x160
[  255.090696]  [<ffffffff814a3474>] ? scsi_execute_req_flags+0x84/0xf0
[  255.092466]  [<ffffffff814b55f2>] ? sr_check_events+0xb2/0x2a0
[  255.094042]  [<ffffffff814c3223>] ? cdrom_check_events+0x13/0x30
[  255.095634]  [<ffffffff814b5a35>] ? sr_block_check_events+0x25/0x30
[  255.097278]  [<ffffffff813501fb>] ? disk_check_events+0x5b/0x150
[  255.098865]  [<ffffffff81350307>] ? disk_events_workfn+0x17/0x20
[  255.100451]  [<ffffffff810890b5>] ? process_one_work+0x1a5/0x420
[  255.102046]  [<ffffffff81089051>] ? process_one_work+0x141/0x420
[  255.103625]  [<ffffffff8108944b>] ? worker_thread+0x11b/0x490
[  255.105159]  [<ffffffff816c4e95>] ? __schedule+0x315/0xac0
[  255.106643]  [<ffffffff81089330>] ? process_one_work+0x420/0x420
[  255.108217]  [<ffffffff8108f4e9>] ? kthread+0xf9/0x110
[  255.109634]  [<ffffffff8108f3f0>] ? kthread_create_on_node+0x230/0x230
[  255.111307]  [<ffffffff816cb35f>] ? ret_from_fork+0x3f/0x70
[  255.112785]  [<ffffffff8108f3f0>] ? kthread_create_on_node+0x230/0x230
(...snipped...)
[  273.930846] Showing busy workqueues and worker pools:
[  273.932299] workqueue events: flags=0x0
[  273.933465]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=4/256
[  273.935120]     pending: vmpressure_work_fn, vmstat_shepherd, vmstat_update, vmw_fb_dirty_flush [vmwgfx]
[  273.937489] workqueue events_freezable: flags=0x4
[  273.938795]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256
[  273.940446]     pending: vmballoon_work [vmw_balloon]
[  273.941973] workqueue events_power_efficient: flags=0x80
[  273.943491]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256
[  273.945167]     pending: check_lifetime
[  273.946422] workqueue events_freezable_power_: flags=0x84
[  273.947890]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256
[  273.949579]     in-flight: 45:disk_events_workfn
[  273.951103] workqueue ipv6_addrconf: flags=0x8
[  273.952447]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/1
[  273.954121]     pending: addrconf_verify_work
[  273.955541] workqueue xfs-reclaim/sda1: flags=0x4
[  273.957036]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256
[  273.958847]     pending: xfs_reclaim_worker
[  273.960392] pool 6: cpus=3 node=0 flags=0x0 nice=0 workers=3 idle: 186 26
----------------------------------------

Three approaches are proposed for fixing this silent livelock problem.

 (1) Use zone_page_state_snapshot() instead of zone_page_state()
     when doing zone_reclaimable() checks. This approach is clear,
     straightforward and easy to backport. So far I cannot reproduce
     this livelock using this change. But there might be more locations
     which should use zone_page_state_snapshot().

 (2) Use a dedicated workqueue for vmstat_update item which is guaranteed
     to be processed immediately. So far I cannot reproduce this livelock
     using a dedicated workqueue created with WQ_MEM_RECLAIM|WQ_HIGHPRI
     (patch proposed by Christoph Lameter). But according to Tejun Heo,
     if we want to guarantee that nobody can reproduce this livelock, we
     need to modify workqueue API because commit 3270476a6c0c ("workqueue:
     reimplement WQ_HIGHPRI using a separate worker_pool") which went to
     Linux 3.6 lost the guarantee.

 (3) Use a !TASK_RUNNING sleep inside page allocator side. This approach
     is easy to backport. So far I cannot reproduce this livelock using
     this approach. And I think that nobody can reproduce this livelock
     because this changes the page allocator to obey the workqueue's
     expectations. Even if we leave this livelock problem aside, not
     entering into !TASK_RUNNING state for too long is an exclusive
     occupation of workqueue which will make other items in the workqueue
     needlessly deferred. We don't need to defer other items which do not
     invoke a __GFP_WAIT allocation.

This patch does approach (3), by inserting an uninterruptible sleep into
page allocator side before retrying, in order to make sure that other
workqueue items (especially vmstat_update item) are given a chance to be
processed.

Although a different problem, by using approach (3), we can alleviate
needlessly burning CPU cycles even when we hit OOM-killer livelock problem
(hang up after the OOM-killer messages are printed because the OOM victim
cannot terminate due to dependency).

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/oom_kill.c   |  8 +-------
 mm/page_alloc.c | 19 +++++++++++++++++--
 2 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d13a339..877b5a5 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -722,15 +722,9 @@ bool out_of_memory(struct oom_control *oc)
 		dump_header(oc, NULL, NULL);
 		panic("Out of memory and no killable processes...\n");
 	}
-	if (p && p != (void *)-1UL) {
+	if (p && p != (void *)-1UL)
 		oom_kill_process(oc, p, points, totalpages, NULL,
 				 "Out of memory");
-		/*
-		 * Give the killed process a good chance to exit before trying
-		 * to allocate memory again.
-		 */
-		schedule_timeout_killable(1);
-	}
 	return true;
 }
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3687f4c..047ebda 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2726,7 +2726,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	 */
 	if (!mutex_trylock(&oom_lock)) {
 		*did_some_progress = 1;
-		schedule_timeout_uninterruptible(1);
 		return NULL;
 	}
 
@@ -3385,6 +3384,15 @@ retry:
 	    ((gfp_mask & __GFP_REPEAT) && pages_reclaimed < (1 << order))) {
 		/* Wait for some write requests to complete then retry */
 		wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);
+		/*
+		 * Give other workqueue items (especially vmstat_update item)
+		 * a chance to be processed. There is no need to wait if I was
+		 * chosen by the OOM killer, for I will leave this function
+		 * using ALLOC_NO_WATERMARKS. But I need to wait even if I have
+		 * SIGKILL pending, for I can't leave this function.
+		 */
+		if (!test_thread_flag(TIF_MEMDIE))
+			schedule_timeout_uninterruptible(1);
 		goto retry;
 	}
 
@@ -3394,8 +3402,15 @@ retry:
 		goto got_pg;
 
 	/* Retry as long as the OOM killer is making progress */
-	if (did_some_progress)
+	if (did_some_progress) {
+		/*
+		 * Give the OOM victim a chance to leave this function
+		 * before trying to allocate memory again.
+		 */
+		if (!test_thread_flag(TIF_MEMDIE))
+			schedule_timeout_uninterruptible(1);
 		goto retry;
+	}
 
 noretry:
 	/*
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
@ 2015-10-25 10:52                                         ` Tetsuo Handa
  0 siblings, 0 replies; 122+ messages in thread
From: Tetsuo Handa @ 2015-10-25 10:52 UTC (permalink / raw)
  To: mhocko, cl, htejun
  Cc: linux-mm, linux-kernel, torvalds, rientjes, oleg, kwalker, akpm,
	hannes, vdavydov, skozina, mgorman, riel

Tejun Heo wrote:
> If this is an actual problem, a better approach would be something
> which detects the stall condition and kicks off the next work item but
> if we do that I think I'd still trigger a warning there.  I don't
> know.  Don't go busy waiting in kernel.

Busy waiting in kernel refers several cases.

  (1) Wait for something with interrupts disabled.

  (2) Wait for something with interrupts enabled but
      without calling cond_resched() etc.

  (3) Wait for something with interrupts enabled and
      with calling cond_resched() etc.

  (4) Wait for something with interrupts enabled and
      with calling schedule_timeout() etc.

Kernel code tries to minimize (1). Kernel code does (2) if they are
not allowed to sleep. But kernel code is allowed to do (3) if they
are allowed to sleep, as long as cond_resched() is sometimes called.
And currently page allocator does (3). But kernel code invoked via
workqueue is expected to do (4) than (3).

This means that any kernel code which invokes a __GFP_WAIT allocation
might fail to do (4) when invoked via workqueue, regardless of flags
passed to alloc_workqueue()?

Michal Hocko wrote:
> On Fri 23-10-15 06:42:43, Tetsuo Handa wrote:
> > Tejun Heo wrote:
> > > On Thu, Oct 22, 2015 at 05:49:22PM +0200, Michal Hocko wrote:
> > > > I am confused. What makes rescuer to not run? Nothing seems to be
> > > > hogging CPUs, we are just out of workers which are loopin in the
> > > > allocator but that is preemptible context.
> > > 
> > > It's concurrency management.  Workqueue thinks that the pool is making
> > > positive forward progress and doesn't schedule anything else for
> > > execution while that work item is burning cpu cycles.
> > 
> > Then, isn't below change easier to backport which will also alleviate
> > needlessly burning CPU cycles?
> 
> This is quite obscure. If the vmstat_update fix needs workqueue tweaks
> as well then I would vote for your original patch which is clear,
> straightforward and easy to backport.

I think that inserting a short sleep into page allocator is better
because the vmstat_update fix will not require workqueue tweaks if
we sleep inside page allocator. Also, from the point of view of
protecting page allocator from going unresponsive when hundreds of tasks
started busy-waiting at __alloc_pages_slowpath() because we can observe
that XXX value in the "MemAlloc-Info: XXX stalling task," line grows
when we are unable to make forward progress.

----------------------------------------
>From a2f34850c26b5bb124d44983f5a2020b51249d53 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Sun, 25 Oct 2015 19:42:15 +0900
Subject: [PATCH] mm,page_alloc: Insert an uninterruptible sleep before
 retrying.

Since "struct zone"->vm_stat[] is array of atomic_long_t, an effort
to reduce frequency of updating values in vm_stat[] is made by using
per cpu variables "struct per_cpu_pageset"->vm_stat_diff[].
Values in vm_stat_diff[] are merged into vm_stat[] periodically
using vmstat_update workqueue item (struct delayed_work vmstat_work).

When a task attempted to allocate memory and reached direct reclaim
path, shrink_zones() checks whether there are reclaimable pages by
calling zone_reclaimable(). zone_reclaimable() makes decision based
on values in vm_stat[] by calling zone_page_state(). This is usually
fine because values in vm_stat_diff[] are expected to be merged into
vm_stat[] shortly.

But workqueue and page allocator have different assumptions.

  (A) The workqueue defers processing of other items unless currently
      in-flight item enters into !TASK_RUNNING state.

  (B) The page allocator never enters into !TASK_RUNNING state if there
      is nothing to reclaim. (The page allocator calls cond_resched()
      via wait_iff_congested(), but cond_resched() does not make the
      task enter into !TASK_RUNNING state.)

Therefore, if a workqueue item which is processed before vmstat_update
item is processed got stuck inside memory allocation request, values in
vm_stat_diff[] cannot be merged into vm_stat[].

As a result, zone_reclaimable() continues using outdated vm_stat[] values
and the task which is doing direct reclaim path thinks that there are
still reclaimable pages and therefore continues looping.

The consequence is a silent livelock (hang up without any kernel messages)
because the OOM killer will not be invoked. We can hit such livelock by
e.g. disk_events_workfn workqueue item doing memory allocation from
bio_copy_kern().

----------------------------------------
[  255.054205] kworker/3:1     R  running task        0    45      2 0x00000008
[  255.056063] Workqueue: events_freezable_power_ disk_events_workfn
[  255.057715]  ffff88007f805680 ffff88007c55f6d0 ffffffff8116463d ffff88007c55f758
[  255.059705]  ffff88007f82b870 ffff88007c55f6e0 ffffffff811646be ffff88007c55f710
[  255.061694]  ffffffff811bdaf0 ffff88007f82b870 0000000000000400 0000000000000000
[  255.063690] Call Trace:
[  255.064664]  [<ffffffff8116463d>] ? __list_lru_count_one.isra.4+0x1d/0x80
[  255.066428]  [<ffffffff811646be>] ? list_lru_count_one+0x1e/0x20
[  255.068063]  [<ffffffff811bdaf0>] ? super_cache_count+0x50/0xd0
[  255.069666]  [<ffffffff8114ecf6>] ? shrink_slab.part.38+0xf6/0x2a0
[  255.071313]  [<ffffffff81151f78>] ? shrink_zone+0x2c8/0x2e0
[  255.072845]  [<ffffffff81152316>] ? do_try_to_free_pages+0x156/0x6d0
[  255.074527]  [<ffffffff810bc6b6>] ? mark_held_locks+0x66/0x90
[  255.076085]  [<ffffffff816ca797>] ? _raw_spin_unlock_irq+0x27/0x40
[  255.077727]  [<ffffffff810bc7d9>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[  255.079451]  [<ffffffff81152924>] ? try_to_free_pages+0x94/0xc0
[  255.081045]  [<ffffffff81145b4a>] ? __alloc_pages_nodemask+0x72a/0xdb0
[  255.082761]  [<ffffffff8118cd06>] ? alloc_pages_current+0x96/0x1b0
[  255.084407]  [<ffffffff8133985d>] ? bio_alloc_bioset+0x20d/0x2d0
[  255.086032]  [<ffffffff8133aba4>] ? bio_copy_kern+0xc4/0x180
[  255.087584]  [<ffffffff81344f20>] ? blk_rq_map_kern+0x70/0x130
[  255.089161]  [<ffffffff814a334d>] ? scsi_execute+0x12d/0x160
[  255.090696]  [<ffffffff814a3474>] ? scsi_execute_req_flags+0x84/0xf0
[  255.092466]  [<ffffffff814b55f2>] ? sr_check_events+0xb2/0x2a0
[  255.094042]  [<ffffffff814c3223>] ? cdrom_check_events+0x13/0x30
[  255.095634]  [<ffffffff814b5a35>] ? sr_block_check_events+0x25/0x30
[  255.097278]  [<ffffffff813501fb>] ? disk_check_events+0x5b/0x150
[  255.098865]  [<ffffffff81350307>] ? disk_events_workfn+0x17/0x20
[  255.100451]  [<ffffffff810890b5>] ? process_one_work+0x1a5/0x420
[  255.102046]  [<ffffffff81089051>] ? process_one_work+0x141/0x420
[  255.103625]  [<ffffffff8108944b>] ? worker_thread+0x11b/0x490
[  255.105159]  [<ffffffff816c4e95>] ? __schedule+0x315/0xac0
[  255.106643]  [<ffffffff81089330>] ? process_one_work+0x420/0x420
[  255.108217]  [<ffffffff8108f4e9>] ? kthread+0xf9/0x110
[  255.109634]  [<ffffffff8108f3f0>] ? kthread_create_on_node+0x230/0x230
[  255.111307]  [<ffffffff816cb35f>] ? ret_from_fork+0x3f/0x70
[  255.112785]  [<ffffffff8108f3f0>] ? kthread_create_on_node+0x230/0x230
(...snipped...)
[  273.930846] Showing busy workqueues and worker pools:
[  273.932299] workqueue events: flags=0x0
[  273.933465]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=4/256
[  273.935120]     pending: vmpressure_work_fn, vmstat_shepherd, vmstat_update, vmw_fb_dirty_flush [vmwgfx]
[  273.937489] workqueue events_freezable: flags=0x4
[  273.938795]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256
[  273.940446]     pending: vmballoon_work [vmw_balloon]
[  273.941973] workqueue events_power_efficient: flags=0x80
[  273.943491]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256
[  273.945167]     pending: check_lifetime
[  273.946422] workqueue events_freezable_power_: flags=0x84
[  273.947890]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256
[  273.949579]     in-flight: 45:disk_events_workfn
[  273.951103] workqueue ipv6_addrconf: flags=0x8
[  273.952447]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/1
[  273.954121]     pending: addrconf_verify_work
[  273.955541] workqueue xfs-reclaim/sda1: flags=0x4
[  273.957036]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256
[  273.958847]     pending: xfs_reclaim_worker
[  273.960392] pool 6: cpus=3 node=0 flags=0x0 nice=0 workers=3 idle: 186 26
----------------------------------------

Three approaches are proposed for fixing this silent livelock problem.

 (1) Use zone_page_state_snapshot() instead of zone_page_state()
     when doing zone_reclaimable() checks. This approach is clear,
     straightforward and easy to backport. So far I cannot reproduce
     this livelock using this change. But there might be more locations
     which should use zone_page_state_snapshot().

 (2) Use a dedicated workqueue for vmstat_update item which is guaranteed
     to be processed immediately. So far I cannot reproduce this livelock
     using a dedicated workqueue created with WQ_MEM_RECLAIM|WQ_HIGHPRI
     (patch proposed by Christoph Lameter). But according to Tejun Heo,
     if we want to guarantee that nobody can reproduce this livelock, we
     need to modify workqueue API because commit 3270476a6c0c ("workqueue:
     reimplement WQ_HIGHPRI using a separate worker_pool") which went to
     Linux 3.6 lost the guarantee.

 (3) Use a !TASK_RUNNING sleep inside page allocator side. This approach
     is easy to backport. So far I cannot reproduce this livelock using
     this approach. And I think that nobody can reproduce this livelock
     because this changes the page allocator to obey the workqueue's
     expectations. Even if we leave this livelock problem aside, not
     entering into !TASK_RUNNING state for too long is an exclusive
     occupation of workqueue which will make other items in the workqueue
     needlessly deferred. We don't need to defer other items which do not
     invoke a __GFP_WAIT allocation.

This patch does approach (3), by inserting an uninterruptible sleep into
page allocator side before retrying, in order to make sure that other
workqueue items (especially vmstat_update item) are given a chance to be
processed.

Although a different problem, by using approach (3), we can alleviate
needlessly burning CPU cycles even when we hit OOM-killer livelock problem
(hang up after the OOM-killer messages are printed because the OOM victim
cannot terminate due to dependency).

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 mm/oom_kill.c   |  8 +-------
 mm/page_alloc.c | 19 +++++++++++++++++--
 2 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d13a339..877b5a5 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -722,15 +722,9 @@ bool out_of_memory(struct oom_control *oc)
 		dump_header(oc, NULL, NULL);
 		panic("Out of memory and no killable processes...\n");
 	}
-	if (p && p != (void *)-1UL) {
+	if (p && p != (void *)-1UL)
 		oom_kill_process(oc, p, points, totalpages, NULL,
 				 "Out of memory");
-		/*
-		 * Give the killed process a good chance to exit before trying
-		 * to allocate memory again.
-		 */
-		schedule_timeout_killable(1);
-	}
 	return true;
 }
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3687f4c..047ebda 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2726,7 +2726,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	 */
 	if (!mutex_trylock(&oom_lock)) {
 		*did_some_progress = 1;
-		schedule_timeout_uninterruptible(1);
 		return NULL;
 	}
 
@@ -3385,6 +3384,15 @@ retry:
 	    ((gfp_mask & __GFP_REPEAT) && pages_reclaimed < (1 << order))) {
 		/* Wait for some write requests to complete then retry */
 		wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);
+		/*
+		 * Give other workqueue items (especially vmstat_update item)
+		 * a chance to be processed. There is no need to wait if I was
+		 * chosen by the OOM killer, for I will leave this function
+		 * using ALLOC_NO_WATERMARKS. But I need to wait even if I have
+		 * SIGKILL pending, for I can't leave this function.
+		 */
+		if (!test_thread_flag(TIF_MEMDIE))
+			schedule_timeout_uninterruptible(1);
 		goto retry;
 	}
 
@@ -3394,8 +3402,15 @@ retry:
 		goto got_pg;
 
 	/* Retry as long as the OOM killer is making progress */
-	if (did_some_progress)
+	if (did_some_progress) {
+		/*
+		 * Give the OOM victim a chance to leave this function
+		 * before trying to allocate memory again.
+		 */
+		if (!test_thread_flag(TIF_MEMDIE))
+			schedule_timeout_uninterruptible(1);
 		goto retry;
+	}
 
 noretry:
 	/*
-- 
1.8.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
  2015-10-25 10:52                                         ` Tetsuo Handa
@ 2015-10-25 22:47                                           ` Tejun Heo
  -1 siblings, 0 replies; 122+ messages in thread
From: Tejun Heo @ 2015-10-25 22:47 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, cl, linux-mm, linux-kernel, torvalds, rientjes, oleg,
	kwalker, akpm, hannes, vdavydov, skozina, mgorman, riel

Hello,

On Sun, Oct 25, 2015 at 07:52:59PM +0900, Tetsuo Handa wrote:
...
> This means that any kernel code which invokes a __GFP_WAIT allocation
> might fail to do (4) when invoked via workqueue, regardless of flags
> passed to alloc_workqueue()?

Sounds that way and yeah (3) should technically be okay and that's why
HIGHPRI was implemented the way it was at the beginning; however, in
practice, this is the first time it's noticeable in all the years.  I
think it comes down to the fact that there just aren't many places
which need such looping behavior and even in those places it's often
very undesirable to busy-loop while not making forward-progress (and
if forward-progress is being made, it won't be indefinite).

> I think that inserting a short sleep into page allocator is better
> because the vmstat_update fix will not require workqueue tweaks if
> we sleep inside page allocator. Also, from the point of view of
> protecting page allocator from going unresponsive when hundreds of tasks
> started busy-waiting at __alloc_pages_slowpath() because we can observe
> that XXX value in the "MemAlloc-Info: XXX stalling task," line grows
> when we are unable to make forward progress.

This looks good to me too; however, it still needs a dedicated
workqueue with WQ_MEM_RECLAIM set.  That deadlock probably is very
unlikely as the side effect of vmstat failing to execute due to worker
exhaustion is more memory reclaim but it still is theoretically
possible and it could just be that it happens at low enough frequency
that it hasn't been reported yet.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
@ 2015-10-25 22:47                                           ` Tejun Heo
  0 siblings, 0 replies; 122+ messages in thread
From: Tejun Heo @ 2015-10-25 22:47 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, cl, linux-mm, linux-kernel, torvalds, rientjes, oleg,
	kwalker, akpm, hannes, vdavydov, skozina, mgorman, riel

Hello,

On Sun, Oct 25, 2015 at 07:52:59PM +0900, Tetsuo Handa wrote:
...
> This means that any kernel code which invokes a __GFP_WAIT allocation
> might fail to do (4) when invoked via workqueue, regardless of flags
> passed to alloc_workqueue()?

Sounds that way and yeah (3) should technically be okay and that's why
HIGHPRI was implemented the way it was at the beginning; however, in
practice, this is the first time it's noticeable in all the years.  I
think it comes down to the fact that there just aren't many places
which need such looping behavior and even in those places it's often
very undesirable to busy-loop while not making forward-progress (and
if forward-progress is being made, it won't be indefinite).

> I think that inserting a short sleep into page allocator is better
> because the vmstat_update fix will not require workqueue tweaks if
> we sleep inside page allocator. Also, from the point of view of
> protecting page allocator from going unresponsive when hundreds of tasks
> started busy-waiting at __alloc_pages_slowpath() because we can observe
> that XXX value in the "MemAlloc-Info: XXX stalling task," line grows
> when we are unable to make forward progress.

This looks good to me too; however, it still needs a dedicated
workqueue with WQ_MEM_RECLAIM set.  That deadlock probably is very
unlikely as the side effect of vmstat failing to execute due to worker
exhaustion is more memory reclaim but it still is theoretically
possible and it could just be that it happens at low enough frequency
that it hasn't been reported yet.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
  2015-10-23 18:21                                     ` Tejun Heo
@ 2015-10-27  9:16                                       ` Michal Hocko
  -1 siblings, 0 replies; 122+ messages in thread
From: Michal Hocko @ 2015-10-27  9:16 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Lameter, Tetsuo Handa, linux-mm, linux-kernel,
	torvalds, David Rientjes, oleg, kwalker, akpm, hannes, vdavydov,
	skozina, mgorman, riel

On Sat 24-10-15 03:21:09, Tejun Heo wrote:
> Hello,
> 
> On Fri, Oct 23, 2015 at 01:11:45PM +0200, Michal Hocko wrote:
> > > The problem here is not lack
> > > of execution resource but concurrency management misunderstanding the
> > > situation. 
> > 
> > And this sounds like a bug to me.
> 
> I don't know.  I can be argued either way, the other direction being a
> kernel thread going RUNNING non-stop is buggy.  Given how this has
> been a complete non-issue for all the years, I'm not sure how useful
> plugging this is.

Well, I guess we haven't noticed because this is a pathological case. It
also triggers OOM livelocks which were not reported in the past either.
You do not reach this state normally unless you rely _want_ to kill your
machine

And vmstat is not the only instance. E.g. sysrq oom trigger is known
to stay behind in similar cases. It should be changed to a dedicated
WQ_MEM_RECLAIM wq and it would require runnable item guarantee as well.

> > Don't we have some IO related paths which would suffer from the same
> > problem. I haven't checked all the WQ_MEM_RECLAIM users but from the
> > name I would expect they _do_ participate in the reclaim and so they
> > should be able to make a progress. Now if your new IMMEDIATE flag will
> 
> Seriously, nobody goes full-on RUNNING.

Looping with cond_resched seems like general pattern in the kernel when
there is no clear source to wait for. We have io_schedule when we know
we should wait for IO (in case of congestion) but this is not necessarily
the case - as you can see here. What should we wait for? A short nap
without actually waiting on anything sounds like a dirty workaround to
me.

> > guarantee that then I would argue that it should be implicit for
> > WQ_MEM_RECLAIM otherwise we always risk a similar situation. What would
> > be a counter argument for doing that?
> 
> Not serving any actual purpose and degrading execution behavior.

I dunno, I am not familiar with WQ internals to see the risks but to me
it sounds like WQ_MEM_RECLAIM gives an incorrect impression of safety
wrt. memory pressure and as demonstrated it doesn't do that. Even if you
consider cond_resched behavior of the page allocator as bug we should be
able to handle this gracefully.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
@ 2015-10-27  9:16                                       ` Michal Hocko
  0 siblings, 0 replies; 122+ messages in thread
From: Michal Hocko @ 2015-10-27  9:16 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Lameter, Tetsuo Handa, linux-mm, linux-kernel,
	torvalds, David Rientjes, oleg, kwalker, akpm, hannes, vdavydov,
	skozina, mgorman, riel

On Sat 24-10-15 03:21:09, Tejun Heo wrote:
> Hello,
> 
> On Fri, Oct 23, 2015 at 01:11:45PM +0200, Michal Hocko wrote:
> > > The problem here is not lack
> > > of execution resource but concurrency management misunderstanding the
> > > situation. 
> > 
> > And this sounds like a bug to me.
> 
> I don't know.  I can be argued either way, the other direction being a
> kernel thread going RUNNING non-stop is buggy.  Given how this has
> been a complete non-issue for all the years, I'm not sure how useful
> plugging this is.

Well, I guess we haven't noticed because this is a pathological case. It
also triggers OOM livelocks which were not reported in the past either.
You do not reach this state normally unless you rely _want_ to kill your
machine

And vmstat is not the only instance. E.g. sysrq oom trigger is known
to stay behind in similar cases. It should be changed to a dedicated
WQ_MEM_RECLAIM wq and it would require runnable item guarantee as well.

> > Don't we have some IO related paths which would suffer from the same
> > problem. I haven't checked all the WQ_MEM_RECLAIM users but from the
> > name I would expect they _do_ participate in the reclaim and so they
> > should be able to make a progress. Now if your new IMMEDIATE flag will
> 
> Seriously, nobody goes full-on RUNNING.

Looping with cond_resched seems like general pattern in the kernel when
there is no clear source to wait for. We have io_schedule when we know
we should wait for IO (in case of congestion) but this is not necessarily
the case - as you can see here. What should we wait for? A short nap
without actually waiting on anything sounds like a dirty workaround to
me.

> > guarantee that then I would argue that it should be implicit for
> > WQ_MEM_RECLAIM otherwise we always risk a similar situation. What would
> > be a counter argument for doing that?
> 
> Not serving any actual purpose and degrading execution behavior.

I dunno, I am not familiar with WQ internals to see the risks but to me
it sounds like WQ_MEM_RECLAIM gives an incorrect impression of safety
wrt. memory pressure and as demonstrated it doesn't do that. Even if you
consider cond_resched behavior of the page allocator as bug we should be
able to handle this gracefully.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
  2015-10-25 10:52                                         ` Tetsuo Handa
@ 2015-10-27  9:22                                           ` Michal Hocko
  -1 siblings, 0 replies; 122+ messages in thread
From: Michal Hocko @ 2015-10-27  9:22 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: cl, htejun, linux-mm, linux-kernel, torvalds, rientjes, oleg,
	kwalker, akpm, hannes, vdavydov, skozina, mgorman, riel

On Sun 25-10-15 19:52:59, Tetsuo Handa wrote:
[...]
> Three approaches are proposed for fixing this silent livelock problem.
> 
>  (1) Use zone_page_state_snapshot() instead of zone_page_state()
>      when doing zone_reclaimable() checks. This approach is clear,
>      straightforward and easy to backport. So far I cannot reproduce
>      this livelock using this change. But there might be more locations
>      which should use zone_page_state_snapshot().
> 
>  (2) Use a dedicated workqueue for vmstat_update item which is guaranteed
>      to be processed immediately. So far I cannot reproduce this livelock
>      using a dedicated workqueue created with WQ_MEM_RECLAIM|WQ_HIGHPRI
>      (patch proposed by Christoph Lameter). But according to Tejun Heo,
>      if we want to guarantee that nobody can reproduce this livelock, we
>      need to modify workqueue API because commit 3270476a6c0c ("workqueue:
>      reimplement WQ_HIGHPRI using a separate worker_pool") which went to
>      Linux 3.6 lost the guarantee.
> 
>  (3) Use a !TASK_RUNNING sleep inside page allocator side. This approach
>      is easy to backport. So far I cannot reproduce this livelock using
>      this approach. And I think that nobody can reproduce this livelock
>      because this changes the page allocator to obey the workqueue's
>      expectations. Even if we leave this livelock problem aside, not
>      entering into !TASK_RUNNING state for too long is an exclusive
>      occupation of workqueue which will make other items in the workqueue
>      needlessly deferred. We don't need to defer other items which do not
>      invoke a __GFP_WAIT allocation.
> 
> This patch does approach (3), by inserting an uninterruptible sleep into
> page allocator side before retrying, in order to make sure that other
> workqueue items (especially vmstat_update item) are given a chance to be
> processed.
> 
> Although a different problem, by using approach (3), we can alleviate
> needlessly burning CPU cycles even when we hit OOM-killer livelock problem
> (hang up after the OOM-killer messages are printed because the OOM victim
> cannot terminate due to dependency).

I really dislike this approach. Waiting without having an event to
wait for is just too ugly. I think 1) is easiest to backport to
stable kernels without causing any other regressions. 2) is the way
to move forward for next kernels and we should really think whether
WQ_MEM_RECLAIM should imply also WQ_HIGHPRI by default. If there is a
general consensus that there are legitimate WQ_MEM_RECLAIM users which
can do without the other flag then I am perfectly OK to use it for
vmstat and oom sysrq dedicated workqueues.

> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
@ 2015-10-27  9:22                                           ` Michal Hocko
  0 siblings, 0 replies; 122+ messages in thread
From: Michal Hocko @ 2015-10-27  9:22 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: cl, htejun, linux-mm, linux-kernel, torvalds, rientjes, oleg,
	kwalker, akpm, hannes, vdavydov, skozina, mgorman, riel

On Sun 25-10-15 19:52:59, Tetsuo Handa wrote:
[...]
> Three approaches are proposed for fixing this silent livelock problem.
> 
>  (1) Use zone_page_state_snapshot() instead of zone_page_state()
>      when doing zone_reclaimable() checks. This approach is clear,
>      straightforward and easy to backport. So far I cannot reproduce
>      this livelock using this change. But there might be more locations
>      which should use zone_page_state_snapshot().
> 
>  (2) Use a dedicated workqueue for vmstat_update item which is guaranteed
>      to be processed immediately. So far I cannot reproduce this livelock
>      using a dedicated workqueue created with WQ_MEM_RECLAIM|WQ_HIGHPRI
>      (patch proposed by Christoph Lameter). But according to Tejun Heo,
>      if we want to guarantee that nobody can reproduce this livelock, we
>      need to modify workqueue API because commit 3270476a6c0c ("workqueue:
>      reimplement WQ_HIGHPRI using a separate worker_pool") which went to
>      Linux 3.6 lost the guarantee.
> 
>  (3) Use a !TASK_RUNNING sleep inside page allocator side. This approach
>      is easy to backport. So far I cannot reproduce this livelock using
>      this approach. And I think that nobody can reproduce this livelock
>      because this changes the page allocator to obey the workqueue's
>      expectations. Even if we leave this livelock problem aside, not
>      entering into !TASK_RUNNING state for too long is an exclusive
>      occupation of workqueue which will make other items in the workqueue
>      needlessly deferred. We don't need to defer other items which do not
>      invoke a __GFP_WAIT allocation.
> 
> This patch does approach (3), by inserting an uninterruptible sleep into
> page allocator side before retrying, in order to make sure that other
> workqueue items (especially vmstat_update item) are given a chance to be
> processed.
> 
> Although a different problem, by using approach (3), we can alleviate
> needlessly burning CPU cycles even when we hit OOM-killer livelock problem
> (hang up after the OOM-killer messages are printed because the OOM victim
> cannot terminate due to dependency).

I really dislike this approach. Waiting without having an event to
wait for is just too ugly. I think 1) is easiest to backport to
stable kernels without causing any other regressions. 2) is the way
to move forward for next kernels and we should really think whether
WQ_MEM_RECLAIM should imply also WQ_HIGHPRI by default. If there is a
general consensus that there are legitimate WQ_MEM_RECLAIM users which
can do without the other flag then I am perfectly OK to use it for
vmstat and oom sysrq dedicated workqueues.

> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
  2015-10-27  9:16                                       ` Michal Hocko
@ 2015-10-27 10:52                                         ` Tejun Heo
  -1 siblings, 0 replies; 122+ messages in thread
From: Tejun Heo @ 2015-10-27 10:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Christoph Lameter, Tetsuo Handa, linux-mm, linux-kernel,
	torvalds, David Rientjes, oleg, kwalker, akpm, hannes, vdavydov,
	skozina, mgorman, riel

Hello, Michal.

On Tue, Oct 27, 2015 at 10:16:03AM +0100, Michal Hocko wrote:
> > Seriously, nobody goes full-on RUNNING.
> 
> Looping with cond_resched seems like general pattern in the kernel when
> there is no clear source to wait for. We have io_schedule when we know
> we should wait for IO (in case of congestion) but this is not necessarily
> the case - as you can see here. What should we wait for? A short nap
> without actually waiting on anything sounds like a dirty workaround to
> me.

It's one thing to do cond_resched() in long loops to avoid long
priority inversions and another to indefinitely loop without making
any difference.

> > > guarantee that then I would argue that it should be implicit for
> > > WQ_MEM_RECLAIM otherwise we always risk a similar situation. What would
> > > be a counter argument for doing that?
> > 
> > Not serving any actual purpose and degrading execution behavior.
> 
> I dunno, I am not familiar with WQ internals to see the risks but to me
> it sounds like WQ_MEM_RECLAIM gives an incorrect impression of safety
> wrt. memory pressure and as demonstrated it doesn't do that. Even if you

It generally does.  This is an extremely rare corner case where
infinite loop w/o forward progress is introduce w/o the user being
outright buggy.

> consider cond_resched behavior of the page allocator as bug we should be
> able to handle this gracefully.

We can argue this back and forth forever but we'll either need to
special case it (be it short sleep or a special flag) or implement a
rather complex detection logic which will likely involve some level of
complexity and is dubious in its practical usefulness.  It's a
trade-off and given the circumstances adding short sleep looks like a
reasonable one to me.  If this is more common, we definitely wanna go
for automatic detection.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
@ 2015-10-27 10:52                                         ` Tejun Heo
  0 siblings, 0 replies; 122+ messages in thread
From: Tejun Heo @ 2015-10-27 10:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Christoph Lameter, Tetsuo Handa, linux-mm, linux-kernel,
	torvalds, David Rientjes, oleg, kwalker, akpm, hannes, vdavydov,
	skozina, mgorman, riel

Hello, Michal.

On Tue, Oct 27, 2015 at 10:16:03AM +0100, Michal Hocko wrote:
> > Seriously, nobody goes full-on RUNNING.
> 
> Looping with cond_resched seems like general pattern in the kernel when
> there is no clear source to wait for. We have io_schedule when we know
> we should wait for IO (in case of congestion) but this is not necessarily
> the case - as you can see here. What should we wait for? A short nap
> without actually waiting on anything sounds like a dirty workaround to
> me.

It's one thing to do cond_resched() in long loops to avoid long
priority inversions and another to indefinitely loop without making
any difference.

> > > guarantee that then I would argue that it should be implicit for
> > > WQ_MEM_RECLAIM otherwise we always risk a similar situation. What would
> > > be a counter argument for doing that?
> > 
> > Not serving any actual purpose and degrading execution behavior.
> 
> I dunno, I am not familiar with WQ internals to see the risks but to me
> it sounds like WQ_MEM_RECLAIM gives an incorrect impression of safety
> wrt. memory pressure and as demonstrated it doesn't do that. Even if you

It generally does.  This is an extremely rare corner case where
infinite loop w/o forward progress is introduce w/o the user being
outright buggy.

> consider cond_resched behavior of the page allocator as bug we should be
> able to handle this gracefully.

We can argue this back and forth forever but we'll either need to
special case it (be it short sleep or a special flag) or implement a
rather complex detection logic which will likely involve some level of
complexity and is dubious in its practical usefulness.  It's a
trade-off and given the circumstances adding short sleep looks like a
reasonable one to me.  If this is more common, we definitely wanna go
for automatic detection.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
  2015-10-27  9:22                                           ` Michal Hocko
@ 2015-10-27 10:55                                             ` Tejun Heo
  -1 siblings, 0 replies; 122+ messages in thread
From: Tejun Heo @ 2015-10-27 10:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, cl, linux-mm, linux-kernel, torvalds, rientjes,
	oleg, kwalker, akpm, hannes, vdavydov, skozina, mgorman, riel

On Tue, Oct 27, 2015 at 10:22:31AM +0100, Michal Hocko wrote:
...
> stable kernels without causing any other regressions. 2) is the way
> to move forward for next kernels and we should really think whether
> WQ_MEM_RECLAIM should imply also WQ_HIGHPRI by default. If there is a
> general consensus that there are legitimate WQ_MEM_RECLAIM users which
> can do without the other flag then I am perfectly OK to use it for
> vmstat and oom sysrq dedicated workqueues.

I don't think flagging these things is a good approach.  These are too
easy to miss.  If this is a problem which needs to be solved, which
I'm not convined it is at this point, the right thing to do would be
doing stall detection and kicking the next work item automatically.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
@ 2015-10-27 10:55                                             ` Tejun Heo
  0 siblings, 0 replies; 122+ messages in thread
From: Tejun Heo @ 2015-10-27 10:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, cl, linux-mm, linux-kernel, torvalds, rientjes,
	oleg, kwalker, akpm, hannes, vdavydov, skozina, mgorman, riel

On Tue, Oct 27, 2015 at 10:22:31AM +0100, Michal Hocko wrote:
...
> stable kernels without causing any other regressions. 2) is the way
> to move forward for next kernels and we should really think whether
> WQ_MEM_RECLAIM should imply also WQ_HIGHPRI by default. If there is a
> general consensus that there are legitimate WQ_MEM_RECLAIM users which
> can do without the other flag then I am perfectly OK to use it for
> vmstat and oom sysrq dedicated workqueues.

I don't think flagging these things is a good approach.  These are too
easy to miss.  If this is a problem which needs to be solved, which
I'm not convined it is at this point, the right thing to do would be
doing stall detection and kicking the next work item automatically.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable()checks
  2015-10-27  9:16                                       ` Michal Hocko
@ 2015-10-27 11:07                                         ` Tetsuo Handa
  -1 siblings, 0 replies; 122+ messages in thread
From: Tetsuo Handa @ 2015-10-27 11:07 UTC (permalink / raw)
  To: mhocko, htejun
  Cc: cl, linux-mm, linux-kernel, torvalds, rientjes, oleg, kwalker,
	akpm, hannes, vdavydov, skozina, mgorman, riel

Michal Hocko wrote:
> > On Fri, Oct 23, 2015 at 01:11:45PM +0200, Michal Hocko wrote:
> > > > The problem here is not lack
> > > > of execution resource but concurrency management misunderstanding the
> > > > situation. 
> > > 
> > > And this sounds like a bug to me.
> > 
> > I don't know.  I can be argued either way, the other direction being a
> > kernel thread going RUNNING non-stop is buggy.  Given how this has
> > been a complete non-issue for all the years, I'm not sure how useful
> > plugging this is.
> 
> Well, I guess we haven't noticed because this is a pathological case. It
> also triggers OOM livelocks which were not reported in the past either.
> You do not reach this state normally unless you rely _want_ to kill your
> machine

I don't think we can say this is a pathological case. Customers' serves
might have hit this state. We have no code for warning this state.

> 
> And vmstat is not the only instance. E.g. sysrq oom trigger is known
> to stay behind in similar cases. It should be changed to a dedicated
> WQ_MEM_RECLAIM wq and it would require runnable item guarantee as well.
> 

Well, this seems to be the cause of SysRq-f being unresponsive...
http://lkml.kernel.org/r/201411231349.CAG78628.VFQFOtOSFJMOLH@I-love.SAKURA.ne.jp

Picking up from http://lkml.kernel.org/r/201506112212.JAG26531.FLSVFMOQJOtOHF@I-love.SAKURA.ne.jp
----------
[  515.536393] Showing busy workqueues and worker pools:
[  515.538185] workqueue events: flags=0x0
[  515.539758]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=8/256
[  515.541872]     pending: vmpressure_work_fn, console_callback, vmstat_update, flush_to_ldisc, push_to_pool, moom_callback, sysrq_reinject_alt_sysrq, fb_deferred_io_work
[  515.546684] workqueue events_power_efficient: flags=0x80
[  515.548589]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=2/256
[  515.550829]     pending: neigh_periodic_work, check_lifetime
[  515.552884] workqueue events_freezable_power_: flags=0x84
[  515.554742]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256
[  515.556846]     in-flight: 3837:disk_events_workfn
[  515.558665] workqueue writeback: flags=0x4e
[  515.560291]   pwq 16: cpus=0-7 flags=0x4 nice=0 active=2/256
[  515.562271]     in-flight: 3812:bdi_writeback_workfn bdi_writeback_workfn
[  515.564544] workqueue xfs-data/sda1: flags=0xc
[  515.566265]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=4/256
[  515.568359]     in-flight: 374(RESCUER):xfs_end_io, 3759:xfs_end_io, 26:xfs_end_io, 3836:xfs_end_io
[  515.571018]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256
[  515.573113]     in-flight: 179:xfs_end_io
[  515.574782] pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=4 idle: 3790 237 3820
[  515.577230] pool 6: cpus=3 node=0 flags=0x0 nice=0 workers=5 manager: 219
[  515.579488] pool 16: cpus=0-7 flags=0x4 nice=0 workers=3 idle: 356 357
----------
We want immediate execution guarantee for not only vmstat_update and
moom_callback but also vmstat_shepherd and console_callback?

> > > Don't we have some IO related paths which would suffer from the same
> > > problem. I haven't checked all the WQ_MEM_RECLAIM users but from the
> > > name I would expect they _do_ participate in the reclaim and so they
> > > should be able to make a progress. Now if your new IMMEDIATE flag will
> > 
> > Seriously, nobody goes full-on RUNNING.
> 
> Looping with cond_resched seems like general pattern in the kernel when
> there is no clear source to wait for. We have io_schedule when we know
> we should wait for IO (in case of congestion) but this is not necessarily
> the case - as you can see here. What should we wait for? A short nap
> without actually waiting on anything sounds like a dirty workaround to
> me.

Can't we have a waitqueue like
http://lkml.kernel.org/r/201510142121.IDE86954.SOVOFFQOFMJHtL@I-love.SAKURA.ne.jp ?

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable()checks
@ 2015-10-27 11:07                                         ` Tetsuo Handa
  0 siblings, 0 replies; 122+ messages in thread
From: Tetsuo Handa @ 2015-10-27 11:07 UTC (permalink / raw)
  To: mhocko, htejun
  Cc: cl, linux-mm, linux-kernel, torvalds, rientjes, oleg, kwalker,
	akpm, hannes, vdavydov, skozina, mgorman, riel

Michal Hocko wrote:
> > On Fri, Oct 23, 2015 at 01:11:45PM +0200, Michal Hocko wrote:
> > > > The problem here is not lack
> > > > of execution resource but concurrency management misunderstanding the
> > > > situation. 
> > > 
> > > And this sounds like a bug to me.
> > 
> > I don't know.  I can be argued either way, the other direction being a
> > kernel thread going RUNNING non-stop is buggy.  Given how this has
> > been a complete non-issue for all the years, I'm not sure how useful
> > plugging this is.
> 
> Well, I guess we haven't noticed because this is a pathological case. It
> also triggers OOM livelocks which were not reported in the past either.
> You do not reach this state normally unless you rely _want_ to kill your
> machine

I don't think we can say this is a pathological case. Customers' serves
might have hit this state. We have no code for warning this state.

> 
> And vmstat is not the only instance. E.g. sysrq oom trigger is known
> to stay behind in similar cases. It should be changed to a dedicated
> WQ_MEM_RECLAIM wq and it would require runnable item guarantee as well.
> 

Well, this seems to be the cause of SysRq-f being unresponsive...
http://lkml.kernel.org/r/201411231349.CAG78628.VFQFOtOSFJMOLH@I-love.SAKURA.ne.jp

Picking up from http://lkml.kernel.org/r/201506112212.JAG26531.FLSVFMOQJOtOHF@I-love.SAKURA.ne.jp
----------
[  515.536393] Showing busy workqueues and worker pools:
[  515.538185] workqueue events: flags=0x0
[  515.539758]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=8/256
[  515.541872]     pending: vmpressure_work_fn, console_callback, vmstat_update, flush_to_ldisc, push_to_pool, moom_callback, sysrq_reinject_alt_sysrq, fb_deferred_io_work
[  515.546684] workqueue events_power_efficient: flags=0x80
[  515.548589]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=2/256
[  515.550829]     pending: neigh_periodic_work, check_lifetime
[  515.552884] workqueue events_freezable_power_: flags=0x84
[  515.554742]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256
[  515.556846]     in-flight: 3837:disk_events_workfn
[  515.558665] workqueue writeback: flags=0x4e
[  515.560291]   pwq 16: cpus=0-7 flags=0x4 nice=0 active=2/256
[  515.562271]     in-flight: 3812:bdi_writeback_workfn bdi_writeback_workfn
[  515.564544] workqueue xfs-data/sda1: flags=0xc
[  515.566265]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=4/256
[  515.568359]     in-flight: 374(RESCUER):xfs_end_io, 3759:xfs_end_io, 26:xfs_end_io, 3836:xfs_end_io
[  515.571018]   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256
[  515.573113]     in-flight: 179:xfs_end_io
[  515.574782] pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=4 idle: 3790 237 3820
[  515.577230] pool 6: cpus=3 node=0 flags=0x0 nice=0 workers=5 manager: 219
[  515.579488] pool 16: cpus=0-7 flags=0x4 nice=0 workers=3 idle: 356 357
----------
We want immediate execution guarantee for not only vmstat_update and
moom_callback but also vmstat_shepherd and console_callback?

> > > Don't we have some IO related paths which would suffer from the same
> > > problem. I haven't checked all the WQ_MEM_RECLAIM users but from the
> > > name I would expect they _do_ participate in the reclaim and so they
> > > should be able to make a progress. Now if your new IMMEDIATE flag will
> > 
> > Seriously, nobody goes full-on RUNNING.
> 
> Looping with cond_resched seems like general pattern in the kernel when
> there is no clear source to wait for. We have io_schedule when we know
> we should wait for IO (in case of congestion) but this is not necessarily
> the case - as you can see here. What should we wait for? A short nap
> without actually waiting on anything sounds like a dirty workaround to
> me.

Can't we have a waitqueue like
http://lkml.kernel.org/r/201510142121.IDE86954.SOVOFFQOFMJHtL@I-love.SAKURA.ne.jp ?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable()checks
  2015-10-27 11:07                                         ` Tetsuo Handa
@ 2015-10-27 11:30                                           ` Tejun Heo
  -1 siblings, 0 replies; 122+ messages in thread
From: Tejun Heo @ 2015-10-27 11:30 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, cl, linux-mm, linux-kernel, torvalds, rientjes, oleg,
	kwalker, akpm, hannes, vdavydov, skozina, mgorman, riel

On Tue, Oct 27, 2015 at 08:07:38PM +0900, Tetsuo Handa wrote:
> Can't we have a waitqueue like
> http://lkml.kernel.org/r/201510142121.IDE86954.SOVOFFQOFMJHtL@I-love.SAKURA.ne.jp ?

There's no reason to complicate it.  It wouldn't buy anything
meaningful.  Can we please stop trying to solve a non-existent
problem?

-- 
tejun

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable()checks
@ 2015-10-27 11:30                                           ` Tejun Heo
  0 siblings, 0 replies; 122+ messages in thread
From: Tejun Heo @ 2015-10-27 11:30 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, cl, linux-mm, linux-kernel, torvalds, rientjes, oleg,
	kwalker, akpm, hannes, vdavydov, skozina, mgorman, riel

On Tue, Oct 27, 2015 at 08:07:38PM +0900, Tetsuo Handa wrote:
> Can't we have a waitqueue like
> http://lkml.kernel.org/r/201510142121.IDE86954.SOVOFFQOFMJHtL@I-love.SAKURA.ne.jp ?

There's no reason to complicate it.  It wouldn't buy anything
meaningful.  Can we please stop trying to solve a non-existent
problem?

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
  2015-10-27 10:55                                             ` Tejun Heo
@ 2015-10-27 12:07                                               ` Michal Hocko
  -1 siblings, 0 replies; 122+ messages in thread
From: Michal Hocko @ 2015-10-27 12:07 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tetsuo Handa, cl, linux-mm, linux-kernel, torvalds, rientjes,
	oleg, kwalker, akpm, hannes, vdavydov, skozina, mgorman, riel

On Tue 27-10-15 19:55:06, Tejun Heo wrote:
> On Tue, Oct 27, 2015 at 10:22:31AM +0100, Michal Hocko wrote:
> ...
> > stable kernels without causing any other regressions. 2) is the way
> > to move forward for next kernels and we should really think whether
> > WQ_MEM_RECLAIM should imply also WQ_HIGHPRI by default. If there is a
> > general consensus that there are legitimate WQ_MEM_RECLAIM users which
> > can do without the other flag then I am perfectly OK to use it for
> > vmstat and oom sysrq dedicated workqueues.
> 
> I don't think flagging these things is a good approach.  These are too
> easy to miss.  If this is a problem which needs to be solved, which
> I'm not convined it is at this point, the right thing to do would be
> doing stall detection and kicking the next work item automatically.

To be honest, I do not really care whether this gets "fixed" in the
stall detection code or by making WQ_MEM_RECLAIM to flag a special
behavior implicitly. All I would like to see is to have a guarantee
that such workqueues are not staying behind just because all current
workers are in the allocator. Adding artificial schedule_timeouts in the
allocator is a fragile way to work around the issue.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
@ 2015-10-27 12:07                                               ` Michal Hocko
  0 siblings, 0 replies; 122+ messages in thread
From: Michal Hocko @ 2015-10-27 12:07 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Tetsuo Handa, cl, linux-mm, linux-kernel, torvalds, rientjes,
	oleg, kwalker, akpm, hannes, vdavydov, skozina, mgorman, riel

On Tue 27-10-15 19:55:06, Tejun Heo wrote:
> On Tue, Oct 27, 2015 at 10:22:31AM +0100, Michal Hocko wrote:
> ...
> > stable kernels without causing any other regressions. 2) is the way
> > to move forward for next kernels and we should really think whether
> > WQ_MEM_RECLAIM should imply also WQ_HIGHPRI by default. If there is a
> > general consensus that there are legitimate WQ_MEM_RECLAIM users which
> > can do without the other flag then I am perfectly OK to use it for
> > vmstat and oom sysrq dedicated workqueues.
> 
> I don't think flagging these things is a good approach.  These are too
> easy to miss.  If this is a problem which needs to be solved, which
> I'm not convined it is at this point, the right thing to do would be
> doing stall detection and kicking the next work item automatically.

To be honest, I do not really care whether this gets "fixed" in the
stall detection code or by making WQ_MEM_RECLAIM to flag a special
behavior implicitly. All I would like to see is to have a guarantee
that such workqueues are not staying behind just because all current
workers are in the allocator. Adding artificial schedule_timeouts in the
allocator is a fragile way to work around the issue.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
  2015-10-23  4:26                                 ` Tejun Heo
@ 2015-11-02 15:01                                   ` Michal Hocko
  -1 siblings, 0 replies; 122+ messages in thread
From: Michal Hocko @ 2015-11-02 15:01 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Lameter, Tetsuo Handa, linux-mm, linux-kernel,
	torvalds, David Rientjes, oleg, kwalker, akpm, hannes, vdavydov,
	skozina, mgorman, riel

On Fri 23-10-15 13:26:49, Tejun Heo wrote:
> Hello,
> 
> So, something like the following.  Just compile tested but this is
> essentially partial revert of 3270476a6c0c ("workqueue: reimplement
> WQ_HIGHPRI using a separate worker_pool") - resurrecting the old
> WQ_HIGHPRI implementation under WQ_IMMEDIATE, so we know this works.
> If for some reason, it gets decided against simply adding one jiffy
> sleep, please let me know.  I'll verify the operation and post a
> proper patch.  That said, given that this prolly needs -stable
> backport and vmstat is likely to be the only user (busy loops are
> really rare in the kernel after all), I think the better approach
> would be reinstating the short sleep.

As already pointed out I really detest a short sleep and would prefer
a way to tell WQ what we really need. vmstat is not the only user. OOM
sysrq will need this special treatment as well. While the
zone_reclaimable can be fixed in an easy patch
(http://lkml.kernel.org/r/201510212126.JIF90648.HOOFJVFQLMStOF%40I-love.SAKURA.ne.jp)
which is perfectly suited for the stable backport, OOM sysrq resp. any
sysrq which runs from the WQ context should be as robust as possible and
shouldn't rely on all the code running from WQ context to issue a sleep
to get unstuck. So I definitely support something like this patch.

I am still not sure whether other WQ_MEM_RECLAIM users needs this flag
as well because I am not familiar with their implementation but at
vmstat and sysrq should use it and should be safe to do so without risk
of breaking anything AFAICS.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
@ 2015-11-02 15:01                                   ` Michal Hocko
  0 siblings, 0 replies; 122+ messages in thread
From: Michal Hocko @ 2015-11-02 15:01 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Lameter, Tetsuo Handa, linux-mm, linux-kernel,
	torvalds, David Rientjes, oleg, kwalker, akpm, hannes, vdavydov,
	skozina, mgorman, riel

On Fri 23-10-15 13:26:49, Tejun Heo wrote:
> Hello,
> 
> So, something like the following.  Just compile tested but this is
> essentially partial revert of 3270476a6c0c ("workqueue: reimplement
> WQ_HIGHPRI using a separate worker_pool") - resurrecting the old
> WQ_HIGHPRI implementation under WQ_IMMEDIATE, so we know this works.
> If for some reason, it gets decided against simply adding one jiffy
> sleep, please let me know.  I'll verify the operation and post a
> proper patch.  That said, given that this prolly needs -stable
> backport and vmstat is likely to be the only user (busy loops are
> really rare in the kernel after all), I think the better approach
> would be reinstating the short sleep.

As already pointed out I really detest a short sleep and would prefer
a way to tell WQ what we really need. vmstat is not the only user. OOM
sysrq will need this special treatment as well. While the
zone_reclaimable can be fixed in an easy patch
(http://lkml.kernel.org/r/201510212126.JIF90648.HOOFJVFQLMStOF%40I-love.SAKURA.ne.jp)
which is perfectly suited for the stable backport, OOM sysrq resp. any
sysrq which runs from the WQ context should be as robust as possible and
shouldn't rely on all the code running from WQ context to issue a sleep
to get unstuck. So I definitely support something like this patch.

I am still not sure whether other WQ_MEM_RECLAIM users needs this flag
as well because I am not familiar with their implementation but at
vmstat and sysrq should use it and should be safe to do so without risk
of breaking anything AFAICS.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
  2015-11-02 15:01                                   ` Michal Hocko
@ 2015-11-02 19:20                                     ` Tejun Heo
  -1 siblings, 0 replies; 122+ messages in thread
From: Tejun Heo @ 2015-11-02 19:20 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Christoph Lameter, Tetsuo Handa, linux-mm, linux-kernel,
	torvalds, David Rientjes, oleg, kwalker, akpm, hannes, vdavydov,
	skozina, mgorman, riel

On Mon, Nov 02, 2015 at 04:01:37PM +0100, Michal Hocko wrote:
...
> which is perfectly suited for the stable backport, OOM sysrq resp. any
> sysrq which runs from the WQ context should be as robust as possible and
> shouldn't rely on all the code running from WQ context to issue a sleep
> to get unstuck. So I definitely support something like this patch.

Well, sysrq wouldn't run successfully either on a cpu which is busy
looping with preemption off.  I don't think this calls for a new flag
to modify workqueue behavior especially given that missing such flag
would lead to the same kind of lockup.  It's a shitty solution.  If
the possibility of sysrq getting stuck behind concurrency management
is an issue, queueing them on an unbound or highpri workqueue should
be good enough.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
@ 2015-11-02 19:20                                     ` Tejun Heo
  0 siblings, 0 replies; 122+ messages in thread
From: Tejun Heo @ 2015-11-02 19:20 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Christoph Lameter, Tetsuo Handa, linux-mm, linux-kernel,
	torvalds, David Rientjes, oleg, kwalker, akpm, hannes, vdavydov,
	skozina, mgorman, riel

On Mon, Nov 02, 2015 at 04:01:37PM +0100, Michal Hocko wrote:
...
> which is perfectly suited for the stable backport, OOM sysrq resp. any
> sysrq which runs from the WQ context should be as robust as possible and
> shouldn't rely on all the code running from WQ context to issue a sleep
> to get unstuck. So I definitely support something like this patch.

Well, sysrq wouldn't run successfully either on a cpu which is busy
looping with preemption off.  I don't think this calls for a new flag
to modify workqueue behavior especially given that missing such flag
would lead to the same kind of lockup.  It's a shitty solution.  If
the possibility of sysrq getting stuck behind concurrency management
is an issue, queueing them on an unbound or highpri workqueue should
be good enough.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
  2015-11-02 19:20                                     ` Tejun Heo
@ 2015-11-03  2:32                                       ` Tetsuo Handa
  -1 siblings, 0 replies; 122+ messages in thread
From: Tetsuo Handa @ 2015-11-03  2:32 UTC (permalink / raw)
  To: mhocko
  Cc: htejun, cl, linux-mm, linux-kernel, torvalds, rientjes, oleg,
	kwalker, akpm, hannes, vdavydov, skozina, mgorman, riel

Tejun Heo wrote:
>                                                                  If
> the possibility of sysrq getting stuck behind concurrency management
> is an issue, queueing them on an unbound or highpri workqueue should
> be good enough.

Regarding SysRq-f, we could do like below. Though I think that converting
the OOM killer into a dedicated kernel thread would allow more things to do
(e.g. Oleg's memory zapping code, my timeout based next victim selection).

diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c
index 5381a72..46b951aa 100644
--- a/drivers/tty/sysrq.c
+++ b/drivers/tty/sysrq.c
@@ -47,6 +47,7 @@
 #include <linux/syscalls.h>
 #include <linux/of.h>
 #include <linux/rcupdate.h>
+#include <linux/kthread.h>
 
 #include <asm/ptrace.h>
 #include <asm/irq_regs.h>
@@ -351,27 +352,35 @@ static struct sysrq_key_op sysrq_term_op = {
 	.enable_mask	= SYSRQ_ENABLE_SIGNAL,
 };
 
-static void moom_callback(struct work_struct *ignored)
+static DECLARE_WAIT_QUEUE_HEAD(moom_wait);
+
+static int moom_callback(void *unused)
 {
 	const gfp_t gfp_mask = GFP_KERNEL;
-	struct oom_control oc = {
-		.zonelist = node_zonelist(first_memory_node, gfp_mask),
-		.nodemask = NULL,
-		.gfp_mask = gfp_mask,
-		.order = -1,
-	};
-
-	mutex_lock(&oom_lock);
-	if (!out_of_memory(&oc))
-		pr_info("OOM request ignored because killer is disabled\n");
-	mutex_unlock(&oom_lock);
+	DEFINE_WAIT(wait);
+
+	while (1) {
+		struct oom_control oc = {
+			.zonelist = node_zonelist(first_memory_node, gfp_mask),
+			.nodemask = NULL,
+			.gfp_mask = gfp_mask,
+			.order = -1,
+		};
+
+		prepare_to_wait(&moom_wait, &wait, TASK_INTERRUPTIBLE);
+		schedule();
+		finish_wait(&moom_wait, &wait);
+		mutex_lock(&oom_lock);
+		if (!out_of_memory(&oc))
+			pr_info("OOM request ignored because killer is disabled\n");
+		mutex_unlock(&oom_lock);
+	}
+	return 0;
 }
 
-static DECLARE_WORK(moom_work, moom_callback);
-
 static void sysrq_handle_moom(int key)
 {
-	schedule_work(&moom_work);
+	wake_up(&moom_wait);
 }
 static struct sysrq_key_op sysrq_moom_op = {
 	.handler	= sysrq_handle_moom,
@@ -1116,6 +1125,9 @@ static inline void sysrq_init_procfs(void)
 
 static int __init sysrq_init(void)
 {
+	struct task_struct *task = kthread_run(moom_callback, NULL,
+					       "manual_oom");
+	BUG_ON(IS_ERR(task));
 	sysrq_init_procfs();
 
 	if (sysrq_on())

^ permalink raw reply related	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
@ 2015-11-03  2:32                                       ` Tetsuo Handa
  0 siblings, 0 replies; 122+ messages in thread
From: Tetsuo Handa @ 2015-11-03  2:32 UTC (permalink / raw)
  To: mhocko
  Cc: htejun, cl, linux-mm, linux-kernel, torvalds, rientjes, oleg,
	kwalker, akpm, hannes, vdavydov, skozina, mgorman, riel

Tejun Heo wrote:
>                                                                  If
> the possibility of sysrq getting stuck behind concurrency management
> is an issue, queueing them on an unbound or highpri workqueue should
> be good enough.

Regarding SysRq-f, we could do like below. Though I think that converting
the OOM killer into a dedicated kernel thread would allow more things to do
(e.g. Oleg's memory zapping code, my timeout based next victim selection).

diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c
index 5381a72..46b951aa 100644
--- a/drivers/tty/sysrq.c
+++ b/drivers/tty/sysrq.c
@@ -47,6 +47,7 @@
 #include <linux/syscalls.h>
 #include <linux/of.h>
 #include <linux/rcupdate.h>
+#include <linux/kthread.h>
 
 #include <asm/ptrace.h>
 #include <asm/irq_regs.h>
@@ -351,27 +352,35 @@ static struct sysrq_key_op sysrq_term_op = {
 	.enable_mask	= SYSRQ_ENABLE_SIGNAL,
 };
 
-static void moom_callback(struct work_struct *ignored)
+static DECLARE_WAIT_QUEUE_HEAD(moom_wait);
+
+static int moom_callback(void *unused)
 {
 	const gfp_t gfp_mask = GFP_KERNEL;
-	struct oom_control oc = {
-		.zonelist = node_zonelist(first_memory_node, gfp_mask),
-		.nodemask = NULL,
-		.gfp_mask = gfp_mask,
-		.order = -1,
-	};
-
-	mutex_lock(&oom_lock);
-	if (!out_of_memory(&oc))
-		pr_info("OOM request ignored because killer is disabled\n");
-	mutex_unlock(&oom_lock);
+	DEFINE_WAIT(wait);
+
+	while (1) {
+		struct oom_control oc = {
+			.zonelist = node_zonelist(first_memory_node, gfp_mask),
+			.nodemask = NULL,
+			.gfp_mask = gfp_mask,
+			.order = -1,
+		};
+
+		prepare_to_wait(&moom_wait, &wait, TASK_INTERRUPTIBLE);
+		schedule();
+		finish_wait(&moom_wait, &wait);
+		mutex_lock(&oom_lock);
+		if (!out_of_memory(&oc))
+			pr_info("OOM request ignored because killer is disabled\n");
+		mutex_unlock(&oom_lock);
+	}
+	return 0;
 }
 
-static DECLARE_WORK(moom_work, moom_callback);
-
 static void sysrq_handle_moom(int key)
 {
-	schedule_work(&moom_work);
+	wake_up(&moom_wait);
 }
 static struct sysrq_key_op sysrq_moom_op = {
 	.handler	= sysrq_handle_moom,
@@ -1116,6 +1125,9 @@ static inline void sysrq_init_procfs(void)
 
 static int __init sysrq_init(void)
 {
+	struct task_struct *task = kthread_run(moom_callback, NULL,
+					       "manual_oom");
+	BUG_ON(IS_ERR(task));
 	sysrq_init_procfs();
 
 	if (sysrq_on())

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
  2015-11-03  2:32                                       ` Tetsuo Handa
@ 2015-11-03 19:43                                         ` Tejun Heo
  -1 siblings, 0 replies; 122+ messages in thread
From: Tejun Heo @ 2015-11-03 19:43 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, cl, linux-mm, linux-kernel, torvalds, rientjes, oleg,
	kwalker, akpm, hannes, vdavydov, skozina, mgorman, riel

Hello, Tetsuo.

On Tue, Nov 03, 2015 at 11:32:06AM +0900, Tetsuo Handa wrote:
> Tejun Heo wrote:
> >                                                                  If
> > the possibility of sysrq getting stuck behind concurrency management
> > is an issue, queueing them on an unbound or highpri workqueue should
> > be good enough.
> 
> Regarding SysRq-f, we could do like below. Though I think that converting
> the OOM killer into a dedicated kernel thread would allow more things to do
> (e.g. Oleg's memory zapping code, my timeout based next victim selection).

I'm not sure doing anything to sysrq-f is warranted.  If workqueue
can't make forward progress due to memory exhaustion, OOM will be
triggered anyway.  Getting stuck behind concurrency management isn't
that different a failure mode from getting stuck behind busy loop with
preemption off.  We should just plug them at the source.  If
necessary, what we can do is adding stall watchdog (can prolly
combined with the usual watchdog) so that it can better point out the
culprit.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
@ 2015-11-03 19:43                                         ` Tejun Heo
  0 siblings, 0 replies; 122+ messages in thread
From: Tejun Heo @ 2015-11-03 19:43 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, cl, linux-mm, linux-kernel, torvalds, rientjes, oleg,
	kwalker, akpm, hannes, vdavydov, skozina, mgorman, riel

Hello, Tetsuo.

On Tue, Nov 03, 2015 at 11:32:06AM +0900, Tetsuo Handa wrote:
> Tejun Heo wrote:
> >                                                                  If
> > the possibility of sysrq getting stuck behind concurrency management
> > is an issue, queueing them on an unbound or highpri workqueue should
> > be good enough.
> 
> Regarding SysRq-f, we could do like below. Though I think that converting
> the OOM killer into a dedicated kernel thread would allow more things to do
> (e.g. Oleg's memory zapping code, my timeout based next victim selection).

I'm not sure doing anything to sysrq-f is warranted.  If workqueue
can't make forward progress due to memory exhaustion, OOM will be
triggered anyway.  Getting stuck behind concurrency management isn't
that different a failure mode from getting stuck behind busy loop with
preemption off.  We should just plug them at the source.  If
necessary, what we can do is adding stall watchdog (can prolly
combined with the usual watchdog) so that it can better point out the
culprit.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
  2015-11-02 15:01                                   ` Michal Hocko
@ 2015-11-05 14:59                                     ` Tetsuo Handa
  -1 siblings, 0 replies; 122+ messages in thread
From: Tetsuo Handa @ 2015-11-05 14:59 UTC (permalink / raw)
  To: mhocko, htejun
  Cc: cl, linux-mm, linux-kernel, torvalds, rientjes, oleg, kwalker,
	akpm, hannes, vdavydov, skozina, mgorman, riel

Michal Hocko wrote:
> As already pointed out I really detest a short sleep and would prefer
> a way to tell WQ what we really need. vmstat is not the only user. OOM
> sysrq will need this special treatment as well. While the
> zone_reclaimable can be fixed in an easy patch
> (http://lkml.kernel.org/r/201510212126.JIF90648.HOOFJVFQLMStOF%40I-love.SAKURA.ne.jp)
> which is perfectly suited for the stable backport, OOM sysrq resp. any
> sysrq which runs from the WQ context should be as robust as possible and
> shouldn't rely on all the code running from WQ context to issue a sleep
> to get unstuck. So I definitely support something like this patch.

I still prefer a short sleep from a different perspective.

I tested above patch with below patch applied

----------------------------------------
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d0499ff..54bedd8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2992,6 +2992,53 @@ static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
 	return (gfp_mask & (GFP_TRANSHUGE | __GFP_KSWAPD_RECLAIM)) == GFP_TRANSHUGE;
 }
 
+static atomic_t stall_tasks;
+
+static int kmallocwd(void *unused)
+{
+	struct task_struct *g, *p;
+	unsigned int sigkill_pending;
+	unsigned int memdie_pending;
+	unsigned int stalling_tasks;
+
+ not_stalling: /* Healty case. */
+	schedule_timeout_interruptible(HZ);
+	if (likely(!atomic_read(&stall_tasks)))
+		goto not_stalling;
+ maybe_stalling: /* Maybe something is wrong. Let's check. */
+	/* Count stalling tasks, dying and victim tasks. */
+	sigkill_pending = 0;
+	memdie_pending = 0;
+	stalling_tasks = atomic_read(&stall_tasks);
+	preempt_disable();
+	rcu_read_lock();
+	for_each_process_thread(g, p) {
+		if (test_tsk_thread_flag(p, TIF_MEMDIE))
+			memdie_pending++;
+		if (fatal_signal_pending(p))
+			sigkill_pending++;
+	}
+	rcu_read_unlock();
+	preempt_enable();
+	pr_warn("MemAlloc-Info: %u stalling task, %u dying task, %u victim task.\n",
+		stalling_tasks, sigkill_pending, memdie_pending);
+	show_workqueue_state();
+	schedule_timeout_interruptible(10 * HZ);
+	if (atomic_read(&stall_tasks))
+		goto maybe_stalling;
+	goto not_stalling;
+	return 0; /* To suppress "no return statement" compiler warning. */
+}
+
+static int __init start_kmallocwd(void)
+{
+	struct task_struct *task = kthread_run(kmallocwd, NULL,
+					       "kmallocwd");
+	BUG_ON(IS_ERR(task));
+	return 0;
+}
+late_initcall(start_kmallocwd);
+
 static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 						struct alloc_context *ac)
@@ -3004,6 +3051,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	enum migrate_mode migration_mode = MIGRATE_ASYNC;
 	bool deferred_compaction = false;
 	int contended_compaction = COMPACT_CONTENDED_NONE;
+	unsigned long start = jiffies;
+	bool stall_counted = false;
 
 	/*
 	 * In the slowpath, we sanity check order to avoid ever trying to
@@ -3095,6 +3144,11 @@ retry:
 	if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
 		goto nopage;
 
+	if (!stall_counted && time_after(jiffies, start + 10 * HZ)) {
+		atomic_inc(&stall_tasks);
+		stall_counted = true;
+	}
+
 	/*
 	 * Try direct compaction. The first pass is asynchronous. Subsequent
 	 * attempts after direct reclaim are synchronous
@@ -3188,6 +3242,8 @@ noretry:
 nopage:
 	warn_alloc_failed(gfp_mask, order, NULL);
 got_pg:
+	if (stall_counted)
+		atomic_dec(&stall_tasks);
 	return page;
 }
 
----------------------------------------

using a crazy stressing program. (Not a TIF_MEMDIE stall.)

----------------------------------------
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <signal.h>
#include <fcntl.h>

static void child(void)
{
	char *buf = NULL;
	unsigned long size = 0;
	const int fd = open("/dev/zero", O_RDONLY);
	for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
		char *cp = realloc(buf, size);
		if (!cp) {
			size >>= 1;
			break;
		}
		buf = cp;
	}
	read(fd, buf, size); /* Will cause OOM due to overcommit */
}

int main(int argc, char *argv[])
{
	if (argc > 1) {
		int i;
		char buffer[4096];
		for (i = 0; i < 1000; i++) {
			if (fork() == 0) {
				sleep(20);
				memset(buffer, 0, sizeof(buffer));
				_exit(0);
			}
		}
		child();
		return 0;
	}
	signal(SIGCLD, SIG_IGN);
	while (1) {
		switch (fork()) {
		case 0:
			execl("/proc/self/exe", argv[0], "1", NULL);;
			_exit(0);
		case -1:
			sleep(1);
		}
	}
	return 0;
}
----------------------------------------

Note the interval between invoking the OOM killer.
(Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20151105.txt.xz .)
----------------------------------------
[   74.260621] exe invoked oom-killer: gfp_mask=0x24280ca, order=0, oom_score_adj=0
[   75.069510] exe invoked oom-killer: gfp_mask=0x24200ca, order=0, oom_score_adj=0
[   79.062507] exe invoked oom-killer: gfp_mask=0x24280ca, order=0, oom_score_adj=0
[   80.464618] MemAlloc-Info: 459 stalling task, 0 dying task, 0 victim task.
[   90.482731] MemAlloc-Info: 699 stalling task, 0 dying task, 0 victim task.
[  100.503633] MemAlloc-Info: 3972 stalling task, 0 dying task, 0 victim task.
[  110.534937] MemAlloc-Info: 4097 stalling task, 0 dying task, 0 victim task.
[  120.535740] MemAlloc-Info: 4098 stalling task, 0 dying task, 0 victim task.
[  130.563961] MemAlloc-Info: 4099 stalling task, 0 dying task, 0 victim task.
[  140.593108] MemAlloc-Info: 4096 stalling task, 0 dying task, 0 victim task.
[  150.617960] MemAlloc-Info: 4096 stalling task, 0 dying task, 0 victim task.
[  160.639131] MemAlloc-Info: 4099 stalling task, 0 dying task, 0 victim task.
[  170.659915] MemAlloc-Info: 4099 stalling task, 0 dying task, 0 victim task.
[  172.597736] exe invoked oom-killer: gfp_mask=0x24280ca, order=0, oom_score_adj=0
[  180.680650] MemAlloc-Info: 4099 stalling task, 0 dying task, 0 victim task.
[  190.705534] MemAlloc-Info: 4099 stalling task, 0 dying task, 0 victim task.
[  200.724567] MemAlloc-Info: 4099 stalling task, 0 dying task, 0 victim task.
[  210.745397] MemAlloc-Info: 4065 stalling task, 0 dying task, 0 victim task.
[  220.769501] MemAlloc-Info: 4092 stalling task, 0 dying task, 0 victim task.
[  230.791530] MemAlloc-Info: 4099 stalling task, 0 dying task, 0 victim task.
[  240.816711] MemAlloc-Info: 4099 stalling task, 0 dying task, 0 victim task.
[  250.836724] MemAlloc-Info: 4099 stalling task, 0 dying task, 0 victim task.
[  260.860257] MemAlloc-Info: 4099 stalling task, 0 dying task, 0 victim task.
[  270.883573] MemAlloc-Info: 4099 stalling task, 0 dying task, 0 victim task.
[  280.910072] MemAlloc-Info: 4088 stalling task, 0 dying task, 0 victim task.
[  290.931988] MemAlloc-Info: 4092 stalling task, 0 dying task, 0 victim task.
[  300.955543] MemAlloc-Info: 4099 stalling task, 0 dying task, 0 victim task.
[  308.212307] exe invoked oom-killer: gfp_mask=0x24200ca, order=0, oom_score_adj=0
[  310.977057] MemAlloc-Info: 3988 stalling task, 0 dying task, 0 victim task.
[  320.999353] MemAlloc-Info: 4096 stalling task, 0 dying task, 0 victim task.
----------------------------------------

See? The memory allocation requests cannot constantly invoke the OOM-killer
because the sum of CPU cycles wasted for sleep-less retry loop is close to
mutually blocking other tasks when number of tasks doing memory allocation
requests exceeded number of available CPUs. We should be careful not to defer
invocation of the OOM-killer too much.

If a short sleep patch
( http://lkml.kernel.org/r/201510251952.CEF04109.OSOtLFHFVFJMQO@I-love.SAKURA.ne.jp )
is applied in addition to the above patches, the memory allocation requests
can constantly invoke the OOM-killer.

By using short sleep, some task might be able to do some useful computation
job which does not involve a __GFP_WAIT memory allocation.

We don't need to defer workqueue items which do not involve a __GFP_WAIT
memory allocation. By allowing workqueue items to be processed (by using
short sleep), some task might release memory when workqueue item is
processed.

Therefore, not only to keep vmstat counters up to date, but also for
avoid wasting CPU cycles, I prefer a short sleep.

^ permalink raw reply related	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
@ 2015-11-05 14:59                                     ` Tetsuo Handa
  0 siblings, 0 replies; 122+ messages in thread
From: Tetsuo Handa @ 2015-11-05 14:59 UTC (permalink / raw)
  To: mhocko, htejun
  Cc: cl, linux-mm, linux-kernel, torvalds, rientjes, oleg, kwalker,
	akpm, hannes, vdavydov, skozina, mgorman, riel

Michal Hocko wrote:
> As already pointed out I really detest a short sleep and would prefer
> a way to tell WQ what we really need. vmstat is not the only user. OOM
> sysrq will need this special treatment as well. While the
> zone_reclaimable can be fixed in an easy patch
> (http://lkml.kernel.org/r/201510212126.JIF90648.HOOFJVFQLMStOF%40I-love.SAKURA.ne.jp)
> which is perfectly suited for the stable backport, OOM sysrq resp. any
> sysrq which runs from the WQ context should be as robust as possible and
> shouldn't rely on all the code running from WQ context to issue a sleep
> to get unstuck. So I definitely support something like this patch.

I still prefer a short sleep from a different perspective.

I tested above patch with below patch applied

----------------------------------------
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d0499ff..54bedd8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2992,6 +2992,53 @@ static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
 	return (gfp_mask & (GFP_TRANSHUGE | __GFP_KSWAPD_RECLAIM)) == GFP_TRANSHUGE;
 }
 
+static atomic_t stall_tasks;
+
+static int kmallocwd(void *unused)
+{
+	struct task_struct *g, *p;
+	unsigned int sigkill_pending;
+	unsigned int memdie_pending;
+	unsigned int stalling_tasks;
+
+ not_stalling: /* Healty case. */
+	schedule_timeout_interruptible(HZ);
+	if (likely(!atomic_read(&stall_tasks)))
+		goto not_stalling;
+ maybe_stalling: /* Maybe something is wrong. Let's check. */
+	/* Count stalling tasks, dying and victim tasks. */
+	sigkill_pending = 0;
+	memdie_pending = 0;
+	stalling_tasks = atomic_read(&stall_tasks);
+	preempt_disable();
+	rcu_read_lock();
+	for_each_process_thread(g, p) {
+		if (test_tsk_thread_flag(p, TIF_MEMDIE))
+			memdie_pending++;
+		if (fatal_signal_pending(p))
+			sigkill_pending++;
+	}
+	rcu_read_unlock();
+	preempt_enable();
+	pr_warn("MemAlloc-Info: %u stalling task, %u dying task, %u victim task.\n",
+		stalling_tasks, sigkill_pending, memdie_pending);
+	show_workqueue_state();
+	schedule_timeout_interruptible(10 * HZ);
+	if (atomic_read(&stall_tasks))
+		goto maybe_stalling;
+	goto not_stalling;
+	return 0; /* To suppress "no return statement" compiler warning. */
+}
+
+static int __init start_kmallocwd(void)
+{
+	struct task_struct *task = kthread_run(kmallocwd, NULL,
+					       "kmallocwd");
+	BUG_ON(IS_ERR(task));
+	return 0;
+}
+late_initcall(start_kmallocwd);
+
 static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 						struct alloc_context *ac)
@@ -3004,6 +3051,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	enum migrate_mode migration_mode = MIGRATE_ASYNC;
 	bool deferred_compaction = false;
 	int contended_compaction = COMPACT_CONTENDED_NONE;
+	unsigned long start = jiffies;
+	bool stall_counted = false;
 
 	/*
 	 * In the slowpath, we sanity check order to avoid ever trying to
@@ -3095,6 +3144,11 @@ retry:
 	if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
 		goto nopage;
 
+	if (!stall_counted && time_after(jiffies, start + 10 * HZ)) {
+		atomic_inc(&stall_tasks);
+		stall_counted = true;
+	}
+
 	/*
 	 * Try direct compaction. The first pass is asynchronous. Subsequent
 	 * attempts after direct reclaim are synchronous
@@ -3188,6 +3242,8 @@ noretry:
 nopage:
 	warn_alloc_failed(gfp_mask, order, NULL);
 got_pg:
+	if (stall_counted)
+		atomic_dec(&stall_tasks);
 	return page;
 }
 
----------------------------------------

using a crazy stressing program. (Not a TIF_MEMDIE stall.)

----------------------------------------
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <signal.h>
#include <fcntl.h>

static void child(void)
{
	char *buf = NULL;
	unsigned long size = 0;
	const int fd = open("/dev/zero", O_RDONLY);
	for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
		char *cp = realloc(buf, size);
		if (!cp) {
			size >>= 1;
			break;
		}
		buf = cp;
	}
	read(fd, buf, size); /* Will cause OOM due to overcommit */
}

int main(int argc, char *argv[])
{
	if (argc > 1) {
		int i;
		char buffer[4096];
		for (i = 0; i < 1000; i++) {
			if (fork() == 0) {
				sleep(20);
				memset(buffer, 0, sizeof(buffer));
				_exit(0);
			}
		}
		child();
		return 0;
	}
	signal(SIGCLD, SIG_IGN);
	while (1) {
		switch (fork()) {
		case 0:
			execl("/proc/self/exe", argv[0], "1", NULL);;
			_exit(0);
		case -1:
			sleep(1);
		}
	}
	return 0;
}
----------------------------------------

Note the interval between invoking the OOM killer.
(Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20151105.txt.xz .)
----------------------------------------
[   74.260621] exe invoked oom-killer: gfp_mask=0x24280ca, order=0, oom_score_adj=0
[   75.069510] exe invoked oom-killer: gfp_mask=0x24200ca, order=0, oom_score_adj=0
[   79.062507] exe invoked oom-killer: gfp_mask=0x24280ca, order=0, oom_score_adj=0
[   80.464618] MemAlloc-Info: 459 stalling task, 0 dying task, 0 victim task.
[   90.482731] MemAlloc-Info: 699 stalling task, 0 dying task, 0 victim task.
[  100.503633] MemAlloc-Info: 3972 stalling task, 0 dying task, 0 victim task.
[  110.534937] MemAlloc-Info: 4097 stalling task, 0 dying task, 0 victim task.
[  120.535740] MemAlloc-Info: 4098 stalling task, 0 dying task, 0 victim task.
[  130.563961] MemAlloc-Info: 4099 stalling task, 0 dying task, 0 victim task.
[  140.593108] MemAlloc-Info: 4096 stalling task, 0 dying task, 0 victim task.
[  150.617960] MemAlloc-Info: 4096 stalling task, 0 dying task, 0 victim task.
[  160.639131] MemAlloc-Info: 4099 stalling task, 0 dying task, 0 victim task.
[  170.659915] MemAlloc-Info: 4099 stalling task, 0 dying task, 0 victim task.
[  172.597736] exe invoked oom-killer: gfp_mask=0x24280ca, order=0, oom_score_adj=0
[  180.680650] MemAlloc-Info: 4099 stalling task, 0 dying task, 0 victim task.
[  190.705534] MemAlloc-Info: 4099 stalling task, 0 dying task, 0 victim task.
[  200.724567] MemAlloc-Info: 4099 stalling task, 0 dying task, 0 victim task.
[  210.745397] MemAlloc-Info: 4065 stalling task, 0 dying task, 0 victim task.
[  220.769501] MemAlloc-Info: 4092 stalling task, 0 dying task, 0 victim task.
[  230.791530] MemAlloc-Info: 4099 stalling task, 0 dying task, 0 victim task.
[  240.816711] MemAlloc-Info: 4099 stalling task, 0 dying task, 0 victim task.
[  250.836724] MemAlloc-Info: 4099 stalling task, 0 dying task, 0 victim task.
[  260.860257] MemAlloc-Info: 4099 stalling task, 0 dying task, 0 victim task.
[  270.883573] MemAlloc-Info: 4099 stalling task, 0 dying task, 0 victim task.
[  280.910072] MemAlloc-Info: 4088 stalling task, 0 dying task, 0 victim task.
[  290.931988] MemAlloc-Info: 4092 stalling task, 0 dying task, 0 victim task.
[  300.955543] MemAlloc-Info: 4099 stalling task, 0 dying task, 0 victim task.
[  308.212307] exe invoked oom-killer: gfp_mask=0x24200ca, order=0, oom_score_adj=0
[  310.977057] MemAlloc-Info: 3988 stalling task, 0 dying task, 0 victim task.
[  320.999353] MemAlloc-Info: 4096 stalling task, 0 dying task, 0 victim task.
----------------------------------------

See? The memory allocation requests cannot constantly invoke the OOM-killer
because the sum of CPU cycles wasted for sleep-less retry loop is close to
mutually blocking other tasks when number of tasks doing memory allocation
requests exceeded number of available CPUs. We should be careful not to defer
invocation of the OOM-killer too much.

If a short sleep patch
( http://lkml.kernel.org/r/201510251952.CEF04109.OSOtLFHFVFJMQO@I-love.SAKURA.ne.jp )
is applied in addition to the above patches, the memory allocation requests
can constantly invoke the OOM-killer.

By using short sleep, some task might be able to do some useful computation
job which does not involve a __GFP_WAIT memory allocation.

We don't need to defer workqueue items which do not involve a __GFP_WAIT
memory allocation. By allowing workqueue items to be processed (by using
short sleep), some task might release memory when workqueue item is
processed.

Therefore, not only to keep vmstat counters up to date, but also for
avoid wasting CPU cycles, I prefer a short sleep.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
  2015-11-05 14:59                                     ` Tetsuo Handa
@ 2015-11-05 17:45                                       ` Christoph Lameter
  -1 siblings, 0 replies; 122+ messages in thread
From: Christoph Lameter @ 2015-11-05 17:45 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, htejun, linux-mm, linux-kernel, torvalds, rientjes, oleg,
	kwalker, akpm, hannes, vdavydov, skozina, mgorman, riel

On Thu, 5 Nov 2015, Tetsuo Handa wrote:

> memory allocation. By allowing workqueue items to be processed (by using
> short sleep), some task might release memory when workqueue item is
> processed.
>
> Therefore, not only to keep vmstat counters up to date, but also for
> avoid wasting CPU cycles, I prefer a short sleep.

Sorry but we need work queue processing for vmstat counters that is
independent of other requests submitted that may block. Adding points
where we sleep / schedule everywhere to do this is not the right approach.


^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
@ 2015-11-05 17:45                                       ` Christoph Lameter
  0 siblings, 0 replies; 122+ messages in thread
From: Christoph Lameter @ 2015-11-05 17:45 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mhocko, htejun, linux-mm, linux-kernel, torvalds, rientjes, oleg,
	kwalker, akpm, hannes, vdavydov, skozina, mgorman, riel

On Thu, 5 Nov 2015, Tetsuo Handa wrote:

> memory allocation. By allowing workqueue items to be processed (by using
> short sleep), some task might release memory when workqueue item is
> processed.
>
> Therefore, not only to keep vmstat counters up to date, but also for
> avoid wasting CPU cycles, I prefer a short sleep.

Sorry but we need work queue processing for vmstat counters that is
independent of other requests submitted that may block. Adding points
where we sleep / schedule everywhere to do this is not the right approach.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
  2015-11-05 17:45                                       ` Christoph Lameter
@ 2015-11-06  0:16                                         ` Tejun Heo
  -1 siblings, 0 replies; 122+ messages in thread
From: Tejun Heo @ 2015-11-06  0:16 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tetsuo Handa, mhocko, linux-mm, linux-kernel, torvalds, rientjes,
	oleg, kwalker, akpm, hannes, vdavydov, skozina, mgorman, riel

Hello,

On Thu, Nov 05, 2015 at 11:45:42AM -0600, Christoph Lameter wrote:
> Sorry but we need work queue processing for vmstat counters that is

I made this analogy before but this is similar to looping with
preemption off.  If anything on workqueue stays RUNNING w/o making
forward progress, it's buggy.  I'd venture to say any code which busy
loops without making forward progress in the time scale noticeable to
human beings is borderline buggy too.  If things need to be retried in
that time scale, putting in a short sleep between trials is a sensible
thing to do.  There's no point in occupying the cpu and burning cycles
without making forward progress.

These things actually matter.  Freezer used to burn cycles this way
and was really good at burning off the last remaining battery reserve
during emergency hibernation if freezing takes some amount of time.

It is true that as it currently stands this is error-prone as
workqueue can't detect these conditions and warn about them.  The same
goes for workqueues which sit in memory reclaim path but forgets
WQ_MEM_RECLAIM.  I'm going to add lockup detection, similar to how
softlockup but that's a different issue, so please update the code.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
@ 2015-11-06  0:16                                         ` Tejun Heo
  0 siblings, 0 replies; 122+ messages in thread
From: Tejun Heo @ 2015-11-06  0:16 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Tetsuo Handa, mhocko, linux-mm, linux-kernel, torvalds, rientjes,
	oleg, kwalker, akpm, hannes, vdavydov, skozina, mgorman, riel

Hello,

On Thu, Nov 05, 2015 at 11:45:42AM -0600, Christoph Lameter wrote:
> Sorry but we need work queue processing for vmstat counters that is

I made this analogy before but this is similar to looping with
preemption off.  If anything on workqueue stays RUNNING w/o making
forward progress, it's buggy.  I'd venture to say any code which busy
loops without making forward progress in the time scale noticeable to
human beings is borderline buggy too.  If things need to be retried in
that time scale, putting in a short sleep between trials is a sensible
thing to do.  There's no point in occupying the cpu and burning cycles
without making forward progress.

These things actually matter.  Freezer used to burn cycles this way
and was really good at burning off the last remaining battery reserve
during emergency hibernation if freezing takes some amount of time.

It is true that as it currently stands this is error-prone as
workqueue can't detect these conditions and warn about them.  The same
goes for workqueues which sit in memory reclaim path but forgets
WQ_MEM_RECLAIM.  I'm going to add lockup detection, similar to how
softlockup but that's a different issue, so please update the code.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
  2015-11-06  0:16                                         ` Tejun Heo
@ 2015-11-11 15:44                                           ` Michal Hocko
  -1 siblings, 0 replies; 122+ messages in thread
From: Michal Hocko @ 2015-11-11 15:44 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Lameter, Tetsuo Handa, linux-mm, linux-kernel,
	torvalds, rientjes, oleg, kwalker, akpm, hannes, vdavydov,
	skozina, mgorman, riel

On Thu 05-11-15 19:16:48, Tejun Heo wrote:
> Hello,
> 
> On Thu, Nov 05, 2015 at 11:45:42AM -0600, Christoph Lameter wrote:
> > Sorry but we need work queue processing for vmstat counters that is
> 
> I made this analogy before but this is similar to looping with
> preemption off.  If anything on workqueue stays RUNNING w/o making
> forward progress, it's buggy.  I'd venture to say any code which busy
> loops without making forward progress in the time scale noticeable to
> human beings is borderline buggy too. 

Well, the caller asked for a memory but the request cannot succeed. Due
to the memory allocator semantic we cannot fail the request so we have
to loop. If we had an event to wait for we would do so, of course.

Now wrt. to a small sleep. We used to do that and called
congestion_wait(HZ/50) before retry. This has proved to cause stalls
during high memory pressure 0e093d99763e ("writeback: do not sleep on
the congestion queue if there are no congested BDIs or if significant
congestion is not being encountered in the current zone"). I do not
really remember what was CONFIG_HZ in those reports but it is quite
possible it was 250. So there is a risk of (partial) re-introducing of
those stalls with the patch from Tetsuo
(http://lkml.kernel.org/r/201510251952.CEF04109.OSOtLFHFVFJMQO@I-love.SAKURA.ne.jp)

If we really have to do short sleep, though, then I would suggest
sticking that into wait_iff_congested rather than spread it into more
places and reduce it only to worker threads. This should be much more
safer. Thought?
---
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 8ed2ffd963c5..7340353f8aea 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -957,8 +957,9 @@ EXPORT_SYMBOL(congestion_wait);
  * jiffies for either a BDI to exit congestion of the given @sync queue
  * or a write to complete.
  *
- * In the absence of zone congestion, cond_resched() is called to yield
- * the processor if necessary but otherwise does not sleep.
+ * In the absence of zone congestion, a short sleep or a cond_resched is
+ * performed to yield the processor and to allow other subsystems to make
+ * a forward progress.
  *
  * The return value is 0 if the sleep is for the full timeout. Otherwise,
  * it is the number of jiffies that were still remaining when the function
@@ -978,7 +979,19 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout)
 	 */
 	if (atomic_read(&nr_wb_congested[sync]) == 0 ||
 	    !test_bit(ZONE_CONGESTED, &zone->flags)) {
-		cond_resched();
+
+		/*
+		 * Memory allocation/reclaim might be called from a WQ
+		 * context and the current implementation of the WQ
+		 * concurrency control doesn't recognize that a particular
+		 * WQ is congested if the worker thread is looping without
+		 * ever sleeping. Therefore we have to do a short sleep
+		 * here rather than calling cond_resched().
+		 */
+		if (current->flags & PF_WQ_WORKER)
+			schedule_timeout(1);
+		else
+			cond_resched();
 
 		/* In case we scheduled, work out time remaining */
 		ret = timeout - (jiffies - start);
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
@ 2015-11-11 15:44                                           ` Michal Hocko
  0 siblings, 0 replies; 122+ messages in thread
From: Michal Hocko @ 2015-11-11 15:44 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Lameter, Tetsuo Handa, linux-mm, linux-kernel,
	torvalds, rientjes, oleg, kwalker, akpm, hannes, vdavydov,
	skozina, mgorman, riel

On Thu 05-11-15 19:16:48, Tejun Heo wrote:
> Hello,
> 
> On Thu, Nov 05, 2015 at 11:45:42AM -0600, Christoph Lameter wrote:
> > Sorry but we need work queue processing for vmstat counters that is
> 
> I made this analogy before but this is similar to looping with
> preemption off.  If anything on workqueue stays RUNNING w/o making
> forward progress, it's buggy.  I'd venture to say any code which busy
> loops without making forward progress in the time scale noticeable to
> human beings is borderline buggy too. 

Well, the caller asked for a memory but the request cannot succeed. Due
to the memory allocator semantic we cannot fail the request so we have
to loop. If we had an event to wait for we would do so, of course.

Now wrt. to a small sleep. We used to do that and called
congestion_wait(HZ/50) before retry. This has proved to cause stalls
during high memory pressure 0e093d99763e ("writeback: do not sleep on
the congestion queue if there are no congested BDIs or if significant
congestion is not being encountered in the current zone"). I do not
really remember what was CONFIG_HZ in those reports but it is quite
possible it was 250. So there is a risk of (partial) re-introducing of
those stalls with the patch from Tetsuo
(http://lkml.kernel.org/r/201510251952.CEF04109.OSOtLFHFVFJMQO@I-love.SAKURA.ne.jp)

If we really have to do short sleep, though, then I would suggest
sticking that into wait_iff_congested rather than spread it into more
places and reduce it only to worker threads. This should be much more
safer. Thought?
---
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 8ed2ffd963c5..7340353f8aea 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -957,8 +957,9 @@ EXPORT_SYMBOL(congestion_wait);
  * jiffies for either a BDI to exit congestion of the given @sync queue
  * or a write to complete.
  *
- * In the absence of zone congestion, cond_resched() is called to yield
- * the processor if necessary but otherwise does not sleep.
+ * In the absence of zone congestion, a short sleep or a cond_resched is
+ * performed to yield the processor and to allow other subsystems to make
+ * a forward progress.
  *
  * The return value is 0 if the sleep is for the full timeout. Otherwise,
  * it is the number of jiffies that were still remaining when the function
@@ -978,7 +979,19 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout)
 	 */
 	if (atomic_read(&nr_wb_congested[sync]) == 0 ||
 	    !test_bit(ZONE_CONGESTED, &zone->flags)) {
-		cond_resched();
+
+		/*
+		 * Memory allocation/reclaim might be called from a WQ
+		 * context and the current implementation of the WQ
+		 * concurrency control doesn't recognize that a particular
+		 * WQ is congested if the worker thread is looping without
+		 * ever sleeping. Therefore we have to do a short sleep
+		 * here rather than calling cond_resched().
+		 */
+		if (current->flags & PF_WQ_WORKER)
+			schedule_timeout(1);
+		else
+			cond_resched();
 
 		/* In case we scheduled, work out time remaining */
 		ret = timeout - (jiffies - start);
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
  2015-11-11 15:44                                           ` Michal Hocko
@ 2015-11-11 16:03                                             ` Michal Hocko
  -1 siblings, 0 replies; 122+ messages in thread
From: Michal Hocko @ 2015-11-11 16:03 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Lameter, Tetsuo Handa, linux-mm, linux-kernel,
	torvalds, rientjes, oleg, kwalker, akpm, hannes, vdavydov,
	skozina, mgorman, riel

With the full changelog and the vmstat update for the reference.
---
>From 9492966a552751e6d7a63e9aafb87e35992b840a Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Wed, 11 Nov 2015 16:45:53 +0100
Subject: [PATCH] mm, vmstat: Allow WQ concurrency to discover memory reclaim
 doesn't make any progress

Tetsuo Handa has reported that the system might basically livelock in OOM
condition without triggering the OOM killer. The issue is caused by
internal dependency of the direct reclaim on vmstat counter updates (via
zone_reclaimable) which are performed from the workqueue context.
If all the current workers get assigned to an allocation request,
though, they will be looping inside the allocator trying to reclaim
memory but zone_reclaimable can see stalled numbers so it will consider
a zone reclaimable even though it has been scanned way too much. WQ
concurrency logic will not consider this situation as a congested workqueue
because it relies that worker would have to sleep in such a situation.
This also means that it doesn't try to spawn new workers or invoke
the rescuer thread if the one is assigned to the queue.

In order to fix this issue we need to do two things. First we have to
let wq concurrency code know that we are in trouble so we have to do
a short sleep. In order to prevent from issues handled by 0e093d99763e
("writeback: do not sleep on the congestion queue if there are no
congested BDIs or if significant congestion is not being encountered in
the current zone") we limit the sleep only to worker threads which are
the ones of the interest anyway.

The second thing to do is to create a dedicated workqueue for vmstat and
mark it WQ_MEM_RECLAIM to note it participates in the reclaim and to
have a spare worker thread for it.

Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/backing-dev.c | 19 ++++++++++++++++---
 mm/vmstat.c      |  6 ++++--
 2 files changed, 20 insertions(+), 5 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 8ed2ffd963c5..7340353f8aea 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -957,8 +957,9 @@ EXPORT_SYMBOL(congestion_wait);
  * jiffies for either a BDI to exit congestion of the given @sync queue
  * or a write to complete.
  *
- * In the absence of zone congestion, cond_resched() is called to yield
- * the processor if necessary but otherwise does not sleep.
+ * In the absence of zone congestion, a short sleep or a cond_resched is
+ * performed to yield the processor and to allow other subsystems to make
+ * a forward progress.
  *
  * The return value is 0 if the sleep is for the full timeout. Otherwise,
  * it is the number of jiffies that were still remaining when the function
@@ -978,7 +979,19 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout)
 	 */
 	if (atomic_read(&nr_wb_congested[sync]) == 0 ||
 	    !test_bit(ZONE_CONGESTED, &zone->flags)) {
-		cond_resched();
+
+		/*
+		 * Memory allocation/reclaim might be called from a WQ
+		 * context and the current implementation of the WQ
+		 * concurrency control doesn't recognize that a particular
+		 * WQ is congested if the worker thread is looping without
+		 * ever sleeping. Therefore we have to do a short sleep
+		 * here rather than calling cond_resched().
+		 */
+		if (current->flags & PF_WQ_WORKER)
+			schedule_timeout(1);
+		else
+			cond_resched();
 
 		/* In case we scheduled, work out time remaining */
 		ret = timeout - (jiffies - start);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 45dcbcb5c594..0975da8e3432 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1381,6 +1381,7 @@ static const struct file_operations proc_vmstat_file_operations = {
 #endif /* CONFIG_PROC_FS */
 
 #ifdef CONFIG_SMP
+static struct workqueue_struct *vmstat_wq;
 static DEFINE_PER_CPU(struct delayed_work, vmstat_work);
 int sysctl_stat_interval __read_mostly = HZ;
 static cpumask_var_t cpu_stat_off;
@@ -1393,7 +1394,7 @@ static void vmstat_update(struct work_struct *w)
 		 * to occur in the future. Keep on running the
 		 * update worker thread.
 		 */
-		schedule_delayed_work_on(smp_processor_id(),
+		queue_delayed_work_on(smp_processor_id(), vmstat_wq,
 			this_cpu_ptr(&vmstat_work),
 			round_jiffies_relative(sysctl_stat_interval));
 	} else {
@@ -1462,7 +1463,7 @@ static void vmstat_shepherd(struct work_struct *w)
 		if (need_update(cpu) &&
 			cpumask_test_and_clear_cpu(cpu, cpu_stat_off))
 
-			schedule_delayed_work_on(cpu,
+			queue_delayed_work_on(cpu, vmstat_wq,
 				&per_cpu(vmstat_work, cpu), 0);
 
 	put_online_cpus();
@@ -1551,6 +1552,7 @@ static int __init setup_vmstat(void)
 
 	start_shepherd_timer();
 	cpu_notifier_register_done();
+	vmstat_wq = alloc_workqueue("vmstat", WQ_FREEZABLE|WQ_MEM_RECLAIM, 0);
 #endif
 #ifdef CONFIG_PROC_FS
 	proc_create("buddyinfo", S_IRUGO, NULL, &fragmentation_file_operations);
-- 
2.6.2

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 122+ messages in thread

* Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks
@ 2015-11-11 16:03                                             ` Michal Hocko
  0 siblings, 0 replies; 122+ messages in thread
From: Michal Hocko @ 2015-11-11 16:03 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Christoph Lameter, Tetsuo Handa, linux-mm, linux-kernel,
	torvalds, rientjes, oleg, kwalker, akpm, hannes, vdavydov,
	skozina, mgorman, riel

With the full changelog and the vmstat update for the reference.
---

^ permalink raw reply	[flat|nested] 122+ messages in thread

end of thread, other threads:[~2015-11-11 16:03 UTC | newest]

Thread overview: 122+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-21 12:26 [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks Tetsuo Handa
2015-10-21 12:26 ` Tetsuo Handa
2015-10-21 13:03 ` Michal Hocko
2015-10-21 13:03   ` Michal Hocko
2015-10-21 14:22 ` Christoph Lameter
2015-10-21 14:22   ` Christoph Lameter
2015-10-21 14:33   ` Michal Hocko
2015-10-21 14:33     ` Michal Hocko
2015-10-21 14:49     ` Christoph Lameter
2015-10-21 14:49       ` Christoph Lameter
2015-10-21 14:55       ` Michal Hocko
2015-10-21 14:55         ` Michal Hocko
2015-10-21 15:39         ` Tetsuo Handa
2015-10-21 15:39           ` Tetsuo Handa
2015-10-21 17:16         ` Christoph Lameter
2015-10-21 17:16           ` Christoph Lameter
2015-10-22 11:37           ` Tetsuo Handa
2015-10-22 11:37             ` Tetsuo Handa
2015-10-22 13:39             ` Christoph Lameter
2015-10-22 13:39               ` Christoph Lameter
2015-10-22 14:09               ` Tejun Heo
2015-10-22 14:09                 ` Tejun Heo
2015-10-22 14:21                 ` Tejun Heo
2015-10-22 14:21                   ` Tejun Heo
2015-10-22 14:23                   ` Christoph Lameter
2015-10-22 14:23                     ` Christoph Lameter
2015-10-22 14:24                     ` Tejun Heo
2015-10-22 14:24                       ` Tejun Heo
2015-10-22 14:25                       ` Christoph Lameter
2015-10-22 14:25                         ` Christoph Lameter
2015-10-22 14:33                         ` Tejun Heo
2015-10-22 14:33                           ` Tejun Heo
2015-10-22 14:41                           ` Christoph Lameter
2015-10-22 14:41                             ` Christoph Lameter
2015-10-22 15:14                             ` Tejun Heo
2015-10-22 15:14                               ` Tejun Heo
2015-10-23  4:26                               ` Tejun Heo
2015-10-23  4:26                                 ` Tejun Heo
2015-11-02 15:01                                 ` Michal Hocko
2015-11-02 15:01                                   ` Michal Hocko
2015-11-02 19:20                                   ` Tejun Heo
2015-11-02 19:20                                     ` Tejun Heo
2015-11-03  2:32                                     ` Tetsuo Handa
2015-11-03  2:32                                       ` Tetsuo Handa
2015-11-03 19:43                                       ` Tejun Heo
2015-11-03 19:43                                         ` Tejun Heo
2015-11-05 14:59                                   ` Tetsuo Handa
2015-11-05 14:59                                     ` Tetsuo Handa
2015-11-05 17:45                                     ` Christoph Lameter
2015-11-05 17:45                                       ` Christoph Lameter
2015-11-06  0:16                                       ` Tejun Heo
2015-11-06  0:16                                         ` Tejun Heo
2015-11-11 15:44                                         ` Michal Hocko
2015-11-11 15:44                                           ` Michal Hocko
2015-11-11 16:03                                           ` Michal Hocko
2015-11-11 16:03                                             ` Michal Hocko
2015-10-22 14:22                 ` Christoph Lameter
2015-10-22 14:22                   ` Christoph Lameter
2015-10-22 15:06                 ` Michal Hocko
2015-10-22 15:06                   ` Michal Hocko
2015-10-22 15:15                   ` Tejun Heo
2015-10-22 15:15                     ` Tejun Heo
2015-10-22 15:33                     ` Christoph Lameter
2015-10-22 15:33                       ` Christoph Lameter
2015-10-23  8:37                       ` Michal Hocko
2015-10-23  8:37                         ` Michal Hocko
2015-10-23 11:43                         ` Make vmstat deferrable again (was Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks) Christoph Lameter
2015-10-23 11:43                           ` Christoph Lameter
2015-10-23 12:07                           ` Sergey Senozhatsky
2015-10-23 12:07                             ` Sergey Senozhatsky
2015-10-23 14:12                             ` Christoph Lameter
2015-10-23 14:12                               ` Christoph Lameter
2015-10-23 14:49                               ` Sergey Senozhatsky
2015-10-23 14:49                                 ` Sergey Senozhatsky
2015-10-23 16:10                                 ` Christoph Lameter
2015-10-23 16:10                                   ` Christoph Lameter
2015-10-22 15:35                     ` [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks Michal Hocko
2015-10-22 15:35                       ` Michal Hocko
2015-10-22 15:37                       ` Tejun Heo
2015-10-22 15:37                         ` Tejun Heo
2015-10-22 15:49                         ` Michal Hocko
2015-10-22 15:49                           ` Michal Hocko
2015-10-22 18:42                           ` Tejun Heo
2015-10-22 18:42                             ` Tejun Heo
2015-10-22 21:42                             ` [PATCH] mm,vmscan: Use accurate values for zone_reclaimable()checks Tetsuo Handa
2015-10-22 21:42                               ` Tetsuo Handa
2015-10-22 22:47                               ` Tejun Heo
2015-10-22 22:47                                 ` Tejun Heo
2015-10-23  8:36                               ` Michal Hocko
2015-10-23  8:36                                 ` Michal Hocko
2015-10-23 10:37                                 ` Tejun Heo
2015-10-23 10:37                                   ` Tejun Heo
2015-10-23  8:33                             ` [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks Michal Hocko
2015-10-23  8:33                               ` Michal Hocko
2015-10-23 10:36                               ` Tejun Heo
2015-10-23 10:36                                 ` Tejun Heo
2015-10-23 11:11                                 ` Michal Hocko
2015-10-23 11:11                                   ` Michal Hocko
2015-10-23 12:25                                   ` Tetsuo Handa
2015-10-23 12:25                                     ` Tetsuo Handa
2015-10-23 18:23                                     ` Tejun Heo
2015-10-23 18:23                                       ` Tejun Heo
2015-10-25 10:52                                       ` Tetsuo Handa
2015-10-25 10:52                                         ` Tetsuo Handa
2015-10-25 22:47                                         ` Tejun Heo
2015-10-25 22:47                                           ` Tejun Heo
2015-10-27  9:22                                         ` Michal Hocko
2015-10-27  9:22                                           ` Michal Hocko
2015-10-27 10:55                                           ` Tejun Heo
2015-10-27 10:55                                             ` Tejun Heo
2015-10-27 12:07                                             ` Michal Hocko
2015-10-27 12:07                                               ` Michal Hocko
2015-10-23 18:21                                   ` Tejun Heo
2015-10-23 18:21                                     ` Tejun Heo
2015-10-27  9:16                                     ` Michal Hocko
2015-10-27  9:16                                       ` Michal Hocko
2015-10-27 10:52                                       ` Tejun Heo
2015-10-27 10:52                                         ` Tejun Heo
2015-10-27 11:07                                       ` [PATCH] mm,vmscan: Use accurate values for zone_reclaimable()checks Tetsuo Handa
2015-10-27 11:07                                         ` Tetsuo Handa
2015-10-27 11:30                                         ` Tejun Heo
2015-10-27 11:30                                           ` Tejun Heo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.