All of lore.kernel.org
 help / color / mirror / Atom feed
* 4.3-rc1 dirty page count underflow (cgroup-related?)
@ 2015-09-18  4:29 ` Dave Hansen
  0 siblings, 0 replies; 16+ messages in thread
From: Dave Hansen @ 2015-09-18  4:29 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Tejun Heo, Jens Axboe,
	Andrew Morton, Jan Kara,
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
	open list, Greg Thelen

[-- Attachment #1: Type: text/plain, Size: 4536 bytes --]

I've been seeing some strange behavior with 4.3-rc1 kernels on my Ubuntu
14.04.3 system.  The system will run fine for a few hours, but suddenly
start becoming horribly I/O bound.  A compile of perf for instance takes
20-30 minutes and the compile seems entirely I/O bound.  But, the SSD is
only seeing tens or hundreds of KB/s of writes.

Looking at some writeback tracepoints shows it hitting
balance_dirty_pages() pretty hard with a pretty large number of dirty
pages. :)

>               ld-27008 [000] ...1 88895.190770: balance_dirty_pages: bdi
> 8:0: limit=234545 setpoint=204851 dirty=18446744073709513951
> bdi_setpoint=184364 bdi_dirty=33 dirty_ratelimit=24 task_ratelimit=0
> dirtied=1 dirtied_pause=0 paused=0 pause=136 period=136 think=0
> cgroup=/user/1000.user/c2.session

So something is underflowing dirty.

I added the attached patch and got a warning pretty quickly, so this
looks pretty reproducible for me.

I'm not 100% sure this is from the 4.3 merge window.  I was running the
4.2-rcs, but they seemed to have their own issues.  Ubuntu seems to be
automatically creating some cgroups, so they're definitely in play here.

Any ideas what is going on?

> [   12.415472] ------------[ cut here ]------------
> [   12.415481] WARNING: CPU: 1 PID: 1684 at mm/page-writeback.c:2435 account_page_cleaned+0x101/0x110()
> [   12.415483] MEM_CGROUP_STAT_DIRTY bogus
> [   12.415484] Modules linked in: ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_CHECKSUM iptable_mangle xt_tcpudp bridge stp llc iptable_filter ip_tables ebtable_nat ebtables x_tables dm_crypt cmac rfcomm bnep arc4 iwldvm mac80211 btusb snd_hda_codec_hdmi iwlwifi btrtl btbcm btintel snd_hda_codec_realtek snd_hda_codec_generic bluetooth snd_hda_intel snd_hda_codec cfg80211 snd_hwdep snd_hda_core intel_rapl iosf_mbi hid_logitech_hidpp x86_pkg_temp_thermal snd_pcm coretemp ghash_clmulni_intel thinkpad_acpi joydev snd_seq_midi snd_seq_midi_event snd_rawmidi nvram snd_seq snd_timer snd_seq_device wmi snd soundcore mac_hid lpc_ich aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd kvm_intel kvm hid_logitech_dj sdhci_pci sdhci usbhid hid
> [   12.415538] CPU: 1 PID: 1684 Comm: indicator-keybo Not tainted 4.3.0-rc1-dirty #25
> [   12.415540] Hardware name: LENOVO 2325AR2/2325AR2, BIOS G2ETA4WW (2.64 ) 04/09/2015
> [   12.415542]  ffffffff81aa8172 ffff8800c926ba30 ffffffff8132e3e2 ffff8800c926ba78
> [   12.415544]  ffff8800c926ba68 ffffffff8105e386 ffffea000fca4700 ffff880409b96420
> [   12.415547]  ffff8803ef979490 ffff880403ff4800 0000000000000000 ffff8800c926bac8
> [   12.415550] Call Trace:
> [   12.415555]  [<ffffffff8132e3e2>] dump_stack+0x4b/0x69
> [   12.415560]  [<ffffffff8105e386>] warn_slowpath_common+0x86/0xc0
> [   12.415563]  [<ffffffff8105e40c>] warn_slowpath_fmt+0x4c/0x50
> [   12.415566]  [<ffffffff8115f951>] account_page_cleaned+0x101/0x110
> [   12.415568]  [<ffffffff8115fa1d>] cancel_dirty_page+0xbd/0xf0
> [   12.415571]  [<ffffffff811fc044>] try_to_free_buffers+0x94/0xb0
> [   12.415575]  [<ffffffff81296740>] jbd2_journal_try_to_free_buffers+0x100/0x130
> [   12.415578]  [<ffffffff812498b2>] ext4_releasepage+0x52/0xa0
> [   12.415582]  [<ffffffff81153765>] try_to_release_page+0x35/0x50
> [   12.415585]  [<ffffffff811fcc13>] block_invalidatepage+0x113/0x130
> [   12.415587]  [<ffffffff81249d5e>] ext4_invalidatepage+0x5e/0xb0
> [   12.415590]  [<ffffffff8124a6f0>] ext4_da_invalidatepage+0x40/0x310
> [   12.415593]  [<ffffffff81162b03>] truncate_inode_page+0x83/0x90
> [   12.415595]  [<ffffffff81162ce9>] truncate_inode_pages_range+0x199/0x730
> [   12.415598]  [<ffffffff8109c494>] ? __wake_up+0x44/0x50
> [   12.415600]  [<ffffffff812962ca>] ? jbd2_journal_stop+0x1ba/0x3b0
> [   12.415603]  [<ffffffff8125a754>] ? ext4_unlink+0x2f4/0x330
> [   12.415607]  [<ffffffff811eed3d>] ? __inode_wait_for_writeback+0x6d/0xc0
> [   12.415609]  [<ffffffff811632ec>] truncate_inode_pages_final+0x4c/0x60
> [   12.415612]  [<ffffffff81250736>] ext4_evict_inode+0x116/0x4c0
> [   12.415615]  [<ffffffff811e139c>] evict+0xbc/0x190
> [   12.415617]  [<ffffffff811e1d2d>] iput+0x17d/0x1e0
> [   12.415620]  [<ffffffff811d623b>] do_unlinkat+0x1ab/0x2b0
> [   12.415622]  [<ffffffff811d6cb6>] SyS_unlink+0x16/0x20
> [   12.415626]  [<ffffffff817d7f97>] entry_SYSCALL_64_fastpath+0x12/0x6a
> [   12.415628] ---[ end trace 6fba1ddd3d240e13 ]---
> [   12.418211] ------------[ cut here ]------------

[-- Attachment #2: dirtywarn.patch --]
[-- Type: text/x-patch, Size: 3944 bytes --]

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6ddaeba..f130d02 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3757,6 +3757,9 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pavail,
 	unsigned long file_pages;
 
 	*pdirty = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_DIRTY);
+	WARN(*pdirty > (1<<30), "%s() huge dirty: %ld\n", __func__, *pdirty);
+	if (*pdirty > (1<<30))
+		*pdirty = 0;
 
 	/* this should eventually include NR_UNSTABLE_NFS */
 	*pwriteback = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK);
@@ -4607,6 +4610,8 @@ static int mem_cgroup_move_account(struct page *page,
 				       nr_pages);
 			__this_cpu_add(to->stat->count[MEM_CGROUP_STAT_DIRTY],
 				       nr_pages);
+			WARN_ONCE(__this_cpu_read(from->stat->count[MEM_CGROUP_STAT_DIRTY]) > (1UL<<30), "MEM_CGROUP_STAT_DIRTY bogus");
+			WARN_ONCE(__this_cpu_read(  to->stat->count[MEM_CGROUP_STAT_DIRTY]) > (1UL<<30), "MEM_CGROUP_STAT_DIRTY bogus");
 		}
 	}
 
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 0a931cd..9dfd8fa 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1546,6 +1546,7 @@ static void balance_dirty_pages(struct address_space *mapping,
 					global_page_state(NR_UNSTABLE_NFS);
 		gdtc->avail = global_dirtyable_memory();
 		gdtc->dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
+		WARN(gdtc->dirty > (1<<30), "%s() huge dirty: %ld\n", __func__, gdtc->dirty);
 
 		domain_dirty_limits(gdtc);
 
@@ -1570,6 +1571,7 @@ static void balance_dirty_pages(struct address_space *mapping,
 			 */
 			mem_cgroup_wb_stats(wb, &mdtc->avail, &mdtc->dirty,
 					    &writeback);
+			WARN(mdtc->dirty > (1<<30), "%s() huge dirty: %ld\n", __func__, mdtc->dirty);
 			mdtc_cap_avail(mdtc);
 			mdtc->dirty += writeback;
 
@@ -1884,6 +1886,7 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb)
 	gdtc->avail = global_dirtyable_memory();
 	gdtc->dirty = global_page_state(NR_FILE_DIRTY) +
 		      global_page_state(NR_UNSTABLE_NFS);
+	WARN(gdtc->dirty > (1<<30), "%s() huge dirty: %ld\n", __func__, gdtc->dirty);
 	domain_dirty_limits(gdtc);
 
 	if (gdtc->dirty > gdtc->bg_thresh)
@@ -1896,6 +1899,7 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb)
 		unsigned long writeback;
 
 		mem_cgroup_wb_stats(wb, &mdtc->avail, &mdtc->dirty, &writeback);
+		WARN(mdtc->dirty > (1<<30), "%s() huge dirty: %ld\n", __func__, mdtc->dirty);
 		mdtc_cap_avail(mdtc);
 		domain_dirty_limits(mdtc);	/* ditto, ignore writeback */
 
@@ -2404,7 +2408,9 @@ void account_page_dirtied(struct page *page, struct address_space *mapping,
 		inode_attach_wb(inode, page);
 		wb = inode_to_wb(inode);
 
+		WARN_ONCE(__this_cpu_read(memcg->stat->count[MEM_CGROUP_STAT_DIRTY]) > (1UL<<30), "MEM_CGROUP_STAT_DIRTY bogus");
 		mem_cgroup_inc_page_stat(memcg, MEM_CGROUP_STAT_DIRTY);
+		WARN_ONCE(__this_cpu_read(memcg->stat->count[MEM_CGROUP_STAT_DIRTY]) > (1UL<<30), "MEM_CGROUP_STAT_DIRTY bogus");
 		__inc_zone_page_state(page, NR_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_DIRTIED);
 		__inc_wb_stat(wb, WB_RECLAIMABLE);
@@ -2426,6 +2432,7 @@ void account_page_cleaned(struct page *page, struct address_space *mapping,
 {
 	if (mapping_cap_account_dirty(mapping)) {
 		mem_cgroup_dec_page_stat(memcg, MEM_CGROUP_STAT_DIRTY);
+		WARN_ONCE(__this_cpu_read(memcg->stat->count[MEM_CGROUP_STAT_DIRTY]) > (1UL<<30), "MEM_CGROUP_STAT_DIRTY bogus");
 		dec_zone_page_state(page, NR_FILE_DIRTY);
 		dec_wb_stat(wb, WB_RECLAIMABLE);
 		task_io_account_cancelled_write(PAGE_CACHE_SIZE);
@@ -2686,6 +2693,7 @@ int clear_page_dirty_for_io(struct page *page)
 		wb = unlocked_inode_to_wb_begin(inode, &locked);
 		if (TestClearPageDirty(page)) {
 			mem_cgroup_dec_page_stat(memcg, MEM_CGROUP_STAT_DIRTY);
+			WARN_ONCE(__this_cpu_read(memcg->stat->count[MEM_CGROUP_STAT_DIRTY]) > (1UL<<30), "MEM_CGROUP_STAT_DIRTY bogus");
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_wb_stat(wb, WB_RECLAIMABLE);
 			ret = 1;

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* 4.3-rc1 dirty page count underflow (cgroup-related?)
@ 2015-09-18  4:29 ` Dave Hansen
  0 siblings, 0 replies; 16+ messages in thread
From: Dave Hansen @ 2015-09-18  4:29 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Tejun Heo, Jens Axboe,
	Andrew Morton, Jan Kara,
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
	open list, Greg Thelen

[-- Attachment #1: Type: text/plain, Size: 4536 bytes --]

I've been seeing some strange behavior with 4.3-rc1 kernels on my Ubuntu
14.04.3 system.  The system will run fine for a few hours, but suddenly
start becoming horribly I/O bound.  A compile of perf for instance takes
20-30 minutes and the compile seems entirely I/O bound.  But, the SSD is
only seeing tens or hundreds of KB/s of writes.

Looking at some writeback tracepoints shows it hitting
balance_dirty_pages() pretty hard with a pretty large number of dirty
pages. :)

>               ld-27008 [000] ...1 88895.190770: balance_dirty_pages: bdi
> 8:0: limit=234545 setpoint=204851 dirty=18446744073709513951
> bdi_setpoint=184364 bdi_dirty=33 dirty_ratelimit=24 task_ratelimit=0
> dirtied=1 dirtied_pause=0 paused=0 pause=136 period=136 think=0
> cgroup=/user/1000.user/c2.session

So something is underflowing dirty.

I added the attached patch and got a warning pretty quickly, so this
looks pretty reproducible for me.

I'm not 100% sure this is from the 4.3 merge window.  I was running the
4.2-rcs, but they seemed to have their own issues.  Ubuntu seems to be
automatically creating some cgroups, so they're definitely in play here.

Any ideas what is going on?

> [   12.415472] ------------[ cut here ]------------
> [   12.415481] WARNING: CPU: 1 PID: 1684 at mm/page-writeback.c:2435 account_page_cleaned+0x101/0x110()
> [   12.415483] MEM_CGROUP_STAT_DIRTY bogus
> [   12.415484] Modules linked in: ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_CHECKSUM iptable_mangle xt_tcpudp bridge stp llc iptable_filter ip_tables ebtable_nat ebtables x_tables dm_crypt cmac rfcomm bnep arc4 iwldvm mac80211 btusb snd_hda_codec_hdmi iwlwifi btrtl btbcm btintel snd_hda_codec_realtek snd_hda_codec_generic bluetooth snd_hda_intel snd_hda_codec cfg80211 snd_hwdep snd_hda_core intel_rapl iosf_mbi hid_logitech_hidpp x86_pkg_temp_thermal snd_pcm coretemp ghash_clmulni_intel thinkpad_acpi joydev snd_seq_midi snd_seq_midi_event snd_rawmidi nvram snd_seq snd_timer snd_seq_device wmi snd soundcore mac_hid lpc_ich aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd kvm_intel kvm hid_logitech_dj sdhci_pci sdhci usbhid hid
> [   12.415538] CPU: 1 PID: 1684 Comm: indicator-keybo Not tainted 4.3.0-rc1-dirty #25
> [   12.415540] Hardware name: LENOVO 2325AR2/2325AR2, BIOS G2ETA4WW (2.64 ) 04/09/2015
> [   12.415542]  ffffffff81aa8172 ffff8800c926ba30 ffffffff8132e3e2 ffff8800c926ba78
> [   12.415544]  ffff8800c926ba68 ffffffff8105e386 ffffea000fca4700 ffff880409b96420
> [   12.415547]  ffff8803ef979490 ffff880403ff4800 0000000000000000 ffff8800c926bac8
> [   12.415550] Call Trace:
> [   12.415555]  [<ffffffff8132e3e2>] dump_stack+0x4b/0x69
> [   12.415560]  [<ffffffff8105e386>] warn_slowpath_common+0x86/0xc0
> [   12.415563]  [<ffffffff8105e40c>] warn_slowpath_fmt+0x4c/0x50
> [   12.415566]  [<ffffffff8115f951>] account_page_cleaned+0x101/0x110
> [   12.415568]  [<ffffffff8115fa1d>] cancel_dirty_page+0xbd/0xf0
> [   12.415571]  [<ffffffff811fc044>] try_to_free_buffers+0x94/0xb0
> [   12.415575]  [<ffffffff81296740>] jbd2_journal_try_to_free_buffers+0x100/0x130
> [   12.415578]  [<ffffffff812498b2>] ext4_releasepage+0x52/0xa0
> [   12.415582]  [<ffffffff81153765>] try_to_release_page+0x35/0x50
> [   12.415585]  [<ffffffff811fcc13>] block_invalidatepage+0x113/0x130
> [   12.415587]  [<ffffffff81249d5e>] ext4_invalidatepage+0x5e/0xb0
> [   12.415590]  [<ffffffff8124a6f0>] ext4_da_invalidatepage+0x40/0x310
> [   12.415593]  [<ffffffff81162b03>] truncate_inode_page+0x83/0x90
> [   12.415595]  [<ffffffff81162ce9>] truncate_inode_pages_range+0x199/0x730
> [   12.415598]  [<ffffffff8109c494>] ? __wake_up+0x44/0x50
> [   12.415600]  [<ffffffff812962ca>] ? jbd2_journal_stop+0x1ba/0x3b0
> [   12.415603]  [<ffffffff8125a754>] ? ext4_unlink+0x2f4/0x330
> [   12.415607]  [<ffffffff811eed3d>] ? __inode_wait_for_writeback+0x6d/0xc0
> [   12.415609]  [<ffffffff811632ec>] truncate_inode_pages_final+0x4c/0x60
> [   12.415612]  [<ffffffff81250736>] ext4_evict_inode+0x116/0x4c0
> [   12.415615]  [<ffffffff811e139c>] evict+0xbc/0x190
> [   12.415617]  [<ffffffff811e1d2d>] iput+0x17d/0x1e0
> [   12.415620]  [<ffffffff811d623b>] do_unlinkat+0x1ab/0x2b0
> [   12.415622]  [<ffffffff811d6cb6>] SyS_unlink+0x16/0x20
> [   12.415626]  [<ffffffff817d7f97>] entry_SYSCALL_64_fastpath+0x12/0x6a
> [   12.415628] ---[ end trace 6fba1ddd3d240e13 ]---
> [   12.418211] ------------[ cut here ]------------

[-- Attachment #2: dirtywarn.patch --]
[-- Type: text/x-patch, Size: 3944 bytes --]

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6ddaeba..f130d02 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3757,6 +3757,9 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pavail,
 	unsigned long file_pages;
 
 	*pdirty = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_DIRTY);
+	WARN(*pdirty > (1<<30), "%s() huge dirty: %ld\n", __func__, *pdirty);
+	if (*pdirty > (1<<30))
+		*pdirty = 0;
 
 	/* this should eventually include NR_UNSTABLE_NFS */
 	*pwriteback = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK);
@@ -4607,6 +4610,8 @@ static int mem_cgroup_move_account(struct page *page,
 				       nr_pages);
 			__this_cpu_add(to->stat->count[MEM_CGROUP_STAT_DIRTY],
 				       nr_pages);
+			WARN_ONCE(__this_cpu_read(from->stat->count[MEM_CGROUP_STAT_DIRTY]) > (1UL<<30), "MEM_CGROUP_STAT_DIRTY bogus");
+			WARN_ONCE(__this_cpu_read(  to->stat->count[MEM_CGROUP_STAT_DIRTY]) > (1UL<<30), "MEM_CGROUP_STAT_DIRTY bogus");
 		}
 	}
 
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 0a931cd..9dfd8fa 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1546,6 +1546,7 @@ static void balance_dirty_pages(struct address_space *mapping,
 					global_page_state(NR_UNSTABLE_NFS);
 		gdtc->avail = global_dirtyable_memory();
 		gdtc->dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
+		WARN(gdtc->dirty > (1<<30), "%s() huge dirty: %ld\n", __func__, gdtc->dirty);
 
 		domain_dirty_limits(gdtc);
 
@@ -1570,6 +1571,7 @@ static void balance_dirty_pages(struct address_space *mapping,
 			 */
 			mem_cgroup_wb_stats(wb, &mdtc->avail, &mdtc->dirty,
 					    &writeback);
+			WARN(mdtc->dirty > (1<<30), "%s() huge dirty: %ld\n", __func__, mdtc->dirty);
 			mdtc_cap_avail(mdtc);
 			mdtc->dirty += writeback;
 
@@ -1884,6 +1886,7 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb)
 	gdtc->avail = global_dirtyable_memory();
 	gdtc->dirty = global_page_state(NR_FILE_DIRTY) +
 		      global_page_state(NR_UNSTABLE_NFS);
+	WARN(gdtc->dirty > (1<<30), "%s() huge dirty: %ld\n", __func__, gdtc->dirty);
 	domain_dirty_limits(gdtc);
 
 	if (gdtc->dirty > gdtc->bg_thresh)
@@ -1896,6 +1899,7 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb)
 		unsigned long writeback;
 
 		mem_cgroup_wb_stats(wb, &mdtc->avail, &mdtc->dirty, &writeback);
+		WARN(mdtc->dirty > (1<<30), "%s() huge dirty: %ld\n", __func__, mdtc->dirty);
 		mdtc_cap_avail(mdtc);
 		domain_dirty_limits(mdtc);	/* ditto, ignore writeback */
 
@@ -2404,7 +2408,9 @@ void account_page_dirtied(struct page *page, struct address_space *mapping,
 		inode_attach_wb(inode, page);
 		wb = inode_to_wb(inode);
 
+		WARN_ONCE(__this_cpu_read(memcg->stat->count[MEM_CGROUP_STAT_DIRTY]) > (1UL<<30), "MEM_CGROUP_STAT_DIRTY bogus");
 		mem_cgroup_inc_page_stat(memcg, MEM_CGROUP_STAT_DIRTY);
+		WARN_ONCE(__this_cpu_read(memcg->stat->count[MEM_CGROUP_STAT_DIRTY]) > (1UL<<30), "MEM_CGROUP_STAT_DIRTY bogus");
 		__inc_zone_page_state(page, NR_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_DIRTIED);
 		__inc_wb_stat(wb, WB_RECLAIMABLE);
@@ -2426,6 +2432,7 @@ void account_page_cleaned(struct page *page, struct address_space *mapping,
 {
 	if (mapping_cap_account_dirty(mapping)) {
 		mem_cgroup_dec_page_stat(memcg, MEM_CGROUP_STAT_DIRTY);
+		WARN_ONCE(__this_cpu_read(memcg->stat->count[MEM_CGROUP_STAT_DIRTY]) > (1UL<<30), "MEM_CGROUP_STAT_DIRTY bogus");
 		dec_zone_page_state(page, NR_FILE_DIRTY);
 		dec_wb_stat(wb, WB_RECLAIMABLE);
 		task_io_account_cancelled_write(PAGE_CACHE_SIZE);
@@ -2686,6 +2693,7 @@ int clear_page_dirty_for_io(struct page *page)
 		wb = unlocked_inode_to_wb_begin(inode, &locked);
 		if (TestClearPageDirty(page)) {
 			mem_cgroup_dec_page_stat(memcg, MEM_CGROUP_STAT_DIRTY);
+			WARN_ONCE(__this_cpu_read(memcg->stat->count[MEM_CGROUP_STAT_DIRTY]) > (1UL<<30), "MEM_CGROUP_STAT_DIRTY bogus");
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_wb_stat(wb, WB_RECLAIMABLE);
 			ret = 1;

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* 4.3-rc1 dirty page count underflow (cgroup-related?)
@ 2015-09-18  4:29 ` Dave Hansen
  0 siblings, 0 replies; 16+ messages in thread
From: Dave Hansen @ 2015-09-18  4:29 UTC (permalink / raw)
  To: Johannes Weiner, Michal Hocko, Tejun Heo, Jens Axboe,
	Andrew Morton, Jan Kara,
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
	open list, Greg Thelen

[-- Attachment #1: Type: text/plain, Size: 4538 bytes --]

I've been seeing some strange behavior with 4.3-rc1 kernels on my Ubuntu
14.04.3 system.  The system will run fine for a few hours, but suddenly
start becoming horribly I/O bound.  A compile of perf for instance takes
20-30 minutes and the compile seems entirely I/O bound.  But, the SSD is
only seeing tens or hundreds of KB/s of writes.

Looking at some writeback tracepoints shows it hitting
balance_dirty_pages() pretty hard with a pretty large number of dirty
pages. :)

>               ld-27008 [000] ...1 88895.190770: balance_dirty_pages: bdi
> 8:0: limit=234545 setpoint=204851 dirty=18446744073709513951
> bdi_setpoint=184364 bdi_dirty=33 dirty_ratelimit=24 task_ratelimit=0
> dirtied=1 dirtied_pause=0 paused=0 pause=136 period=136 think=0
> cgroup=/user/1000.user/c2.session

So something is underflowing dirty.

I added the attached patch and got a warning pretty quickly, so this
looks pretty reproducible for me.

I'm not 100% sure this is from the 4.3 merge window.  I was running the
4.2-rcs, but they seemed to have their own issues.  Ubuntu seems to be
automatically creating some cgroups, so they're definitely in play here.

Any ideas what is going on?

> [   12.415472] ------------[ cut here ]------------
> [   12.415481] WARNING: CPU: 1 PID: 1684 at mm/page-writeback.c:2435 account_page_cleaned+0x101/0x110()
> [   12.415483] MEM_CGROUP_STAT_DIRTY bogus
> [   12.415484] Modules linked in: ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_CHECKSUM iptable_mangle xt_tcpudp bridge stp llc iptable_filter ip_tables ebtable_nat ebtables x_tables dm_crypt cmac rfcomm bnep arc4 iwldvm mac80211 btusb snd_hda_codec_hdmi iwlwifi btrtl btbcm btintel snd_hda_codec_realtek snd_hda_codec_generic bluetooth snd_hda_intel snd_hda_codec cfg80211 snd_hwdep snd_hda_core intel_rapl iosf_mbi hid_logitech_hidpp x86_pkg_temp_thermal snd_pcm coretemp ghash_clmulni_intel thinkpad_acpi joydev snd_seq_midi snd_seq_midi_event snd_rawmidi nvram snd_seq snd_timer snd_seq_device wmi snd soundcore mac_hid lpc_ich aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_h
 elper cryptd kvm_intel kvm hid_logitech_dj sdhci_pci sdhci usbhid hid
> [   12.415538] CPU: 1 PID: 1684 Comm: indicator-keybo Not tainted 4.3.0-rc1-dirty #25
> [   12.415540] Hardware name: LENOVO 2325AR2/2325AR2, BIOS G2ETA4WW (2.64 ) 04/09/2015
> [   12.415542]  ffffffff81aa8172 ffff8800c926ba30 ffffffff8132e3e2 ffff8800c926ba78
> [   12.415544]  ffff8800c926ba68 ffffffff8105e386 ffffea000fca4700 ffff880409b96420
> [   12.415547]  ffff8803ef979490 ffff880403ff4800 0000000000000000 ffff8800c926bac8
> [   12.415550] Call Trace:
> [   12.415555]  [<ffffffff8132e3e2>] dump_stack+0x4b/0x69
> [   12.415560]  [<ffffffff8105e386>] warn_slowpath_common+0x86/0xc0
> [   12.415563]  [<ffffffff8105e40c>] warn_slowpath_fmt+0x4c/0x50
> [   12.415566]  [<ffffffff8115f951>] account_page_cleaned+0x101/0x110
> [   12.415568]  [<ffffffff8115fa1d>] cancel_dirty_page+0xbd/0xf0
> [   12.415571]  [<ffffffff811fc044>] try_to_free_buffers+0x94/0xb0
> [   12.415575]  [<ffffffff81296740>] jbd2_journal_try_to_free_buffers+0x100/0x130
> [   12.415578]  [<ffffffff812498b2>] ext4_releasepage+0x52/0xa0
> [   12.415582]  [<ffffffff81153765>] try_to_release_page+0x35/0x50
> [   12.415585]  [<ffffffff811fcc13>] block_invalidatepage+0x113/0x130
> [   12.415587]  [<ffffffff81249d5e>] ext4_invalidatepage+0x5e/0xb0
> [   12.415590]  [<ffffffff8124a6f0>] ext4_da_invalidatepage+0x40/0x310
> [   12.415593]  [<ffffffff81162b03>] truncate_inode_page+0x83/0x90
> [   12.415595]  [<ffffffff81162ce9>] truncate_inode_pages_range+0x199/0x730
> [   12.415598]  [<ffffffff8109c494>] ? __wake_up+0x44/0x50
> [   12.415600]  [<ffffffff812962ca>] ? jbd2_journal_stop+0x1ba/0x3b0
> [   12.415603]  [<ffffffff8125a754>] ? ext4_unlink+0x2f4/0x330
> [   12.415607]  [<ffffffff811eed3d>] ? __inode_wait_for_writeback+0x6d/0xc0
> [   12.415609]  [<ffffffff811632ec>] truncate_inode_pages_final+0x4c/0x60
> [   12.415612]  [<ffffffff81250736>] ext4_evict_inode+0x116/0x4c0
> [   12.415615]  [<ffffffff811e139c>] evict+0xbc/0x190
> [   12.415617]  [<ffffffff811e1d2d>] iput+0x17d/0x1e0
> [   12.415620]  [<ffffffff811d623b>] do_unlinkat+0x1ab/0x2b0
> [   12.415622]  [<ffffffff811d6cb6>] SyS_unlink+0x16/0x20
> [   12.415626]  [<ffffffff817d7f97>] entry_SYSCALL_64_fastpath+0x12/0x6a
> [   12.415628] ---[ end trace 6fba1ddd3d240e13 ]---
> [   12.418211] ------------[ cut here ]------------

[-- Attachment #2: dirtywarn.patch --]
[-- Type: text/x-patch, Size: 3944 bytes --]

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6ddaeba..f130d02 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3757,6 +3757,9 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pavail,
 	unsigned long file_pages;
 
 	*pdirty = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_DIRTY);
+	WARN(*pdirty > (1<<30), "%s() huge dirty: %ld\n", __func__, *pdirty);
+	if (*pdirty > (1<<30))
+		*pdirty = 0;
 
 	/* this should eventually include NR_UNSTABLE_NFS */
 	*pwriteback = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK);
@@ -4607,6 +4610,8 @@ static int mem_cgroup_move_account(struct page *page,
 				       nr_pages);
 			__this_cpu_add(to->stat->count[MEM_CGROUP_STAT_DIRTY],
 				       nr_pages);
+			WARN_ONCE(__this_cpu_read(from->stat->count[MEM_CGROUP_STAT_DIRTY]) > (1UL<<30), "MEM_CGROUP_STAT_DIRTY bogus");
+			WARN_ONCE(__this_cpu_read(  to->stat->count[MEM_CGROUP_STAT_DIRTY]) > (1UL<<30), "MEM_CGROUP_STAT_DIRTY bogus");
 		}
 	}
 
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 0a931cd..9dfd8fa 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1546,6 +1546,7 @@ static void balance_dirty_pages(struct address_space *mapping,
 					global_page_state(NR_UNSTABLE_NFS);
 		gdtc->avail = global_dirtyable_memory();
 		gdtc->dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
+		WARN(gdtc->dirty > (1<<30), "%s() huge dirty: %ld\n", __func__, gdtc->dirty);
 
 		domain_dirty_limits(gdtc);
 
@@ -1570,6 +1571,7 @@ static void balance_dirty_pages(struct address_space *mapping,
 			 */
 			mem_cgroup_wb_stats(wb, &mdtc->avail, &mdtc->dirty,
 					    &writeback);
+			WARN(mdtc->dirty > (1<<30), "%s() huge dirty: %ld\n", __func__, mdtc->dirty);
 			mdtc_cap_avail(mdtc);
 			mdtc->dirty += writeback;
 
@@ -1884,6 +1886,7 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb)
 	gdtc->avail = global_dirtyable_memory();
 	gdtc->dirty = global_page_state(NR_FILE_DIRTY) +
 		      global_page_state(NR_UNSTABLE_NFS);
+	WARN(gdtc->dirty > (1<<30), "%s() huge dirty: %ld\n", __func__, gdtc->dirty);
 	domain_dirty_limits(gdtc);
 
 	if (gdtc->dirty > gdtc->bg_thresh)
@@ -1896,6 +1899,7 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb)
 		unsigned long writeback;
 
 		mem_cgroup_wb_stats(wb, &mdtc->avail, &mdtc->dirty, &writeback);
+		WARN(mdtc->dirty > (1<<30), "%s() huge dirty: %ld\n", __func__, mdtc->dirty);
 		mdtc_cap_avail(mdtc);
 		domain_dirty_limits(mdtc);	/* ditto, ignore writeback */
 
@@ -2404,7 +2408,9 @@ void account_page_dirtied(struct page *page, struct address_space *mapping,
 		inode_attach_wb(inode, page);
 		wb = inode_to_wb(inode);
 
+		WARN_ONCE(__this_cpu_read(memcg->stat->count[MEM_CGROUP_STAT_DIRTY]) > (1UL<<30), "MEM_CGROUP_STAT_DIRTY bogus");
 		mem_cgroup_inc_page_stat(memcg, MEM_CGROUP_STAT_DIRTY);
+		WARN_ONCE(__this_cpu_read(memcg->stat->count[MEM_CGROUP_STAT_DIRTY]) > (1UL<<30), "MEM_CGROUP_STAT_DIRTY bogus");
 		__inc_zone_page_state(page, NR_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_DIRTIED);
 		__inc_wb_stat(wb, WB_RECLAIMABLE);
@@ -2426,6 +2432,7 @@ void account_page_cleaned(struct page *page, struct address_space *mapping,
 {
 	if (mapping_cap_account_dirty(mapping)) {
 		mem_cgroup_dec_page_stat(memcg, MEM_CGROUP_STAT_DIRTY);
+		WARN_ONCE(__this_cpu_read(memcg->stat->count[MEM_CGROUP_STAT_DIRTY]) > (1UL<<30), "MEM_CGROUP_STAT_DIRTY bogus");
 		dec_zone_page_state(page, NR_FILE_DIRTY);
 		dec_wb_stat(wb, WB_RECLAIMABLE);
 		task_io_account_cancelled_write(PAGE_CACHE_SIZE);
@@ -2686,6 +2693,7 @@ int clear_page_dirty_for_io(struct page *page)
 		wb = unlocked_inode_to_wb_begin(inode, &locked);
 		if (TestClearPageDirty(page)) {
 			mem_cgroup_dec_page_stat(memcg, MEM_CGROUP_STAT_DIRTY);
+			WARN_ONCE(__this_cpu_read(memcg->stat->count[MEM_CGROUP_STAT_DIRTY]) > (1UL<<30), "MEM_CGROUP_STAT_DIRTY bogus");
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_wb_stat(wb, WB_RECLAIMABLE);
 			ret = 1;

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: 4.3-rc1 dirty page count underflow (cgroup-related?)
  2015-09-18  4:29 ` Dave Hansen
  (?)
@ 2015-09-18  6:09   ` Greg Thelen
  -1 siblings, 0 replies; 16+ messages in thread
From: Greg Thelen @ 2015-09-18  6:09 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Johannes Weiner, Michal Hocko, Tejun Heo, Jens Axboe,
	Andrew Morton, Jan Kara,
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
	open list


Dave Hansen wrote:

> I've been seeing some strange behavior with 4.3-rc1 kernels on my Ubuntu
> 14.04.3 system.  The system will run fine for a few hours, but suddenly
> start becoming horribly I/O bound.  A compile of perf for instance takes
> 20-30 minutes and the compile seems entirely I/O bound.  But, the SSD is
> only seeing tens or hundreds of KB/s of writes.
>
> Looking at some writeback tracepoints shows it hitting
> balance_dirty_pages() pretty hard with a pretty large number of dirty
> pages. :)
>
>>               ld-27008 [000] ...1 88895.190770: balance_dirty_pages: bdi
>> 8:0: limit=234545 setpoint=204851 dirty=18446744073709513951
>> bdi_setpoint=184364 bdi_dirty=33 dirty_ratelimit=24 task_ratelimit=0
>> dirtied=1 dirtied_pause=0 paused=0 pause=136 period=136 think=0
>> cgroup=/user/1000.user/c2.session
>
> So something is underflowing dirty.
>
> I added the attached patch and got a warning pretty quickly, so this
> looks pretty reproducible for me.
>
> I'm not 100% sure this is from the 4.3 merge window.  I was running the
> 4.2-rcs, but they seemed to have their own issues.  Ubuntu seems to be
> automatically creating some cgroups, so they're definitely in play here.
>
> Any ideas what is going on?
>
>> [   12.415472] ------------[ cut here ]------------
>> [   12.415481] WARNING: CPU: 1 PID: 1684 at mm/page-writeback.c:2435 account_page_cleaned+0x101/0x110()
>> [   12.415483] MEM_CGROUP_STAT_DIRTY bogus
>> [   12.415484] Modules linked in: ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_CHECKSUM iptable_mangle xt_tcpudp bridge stp llc iptable_filter ip_tables ebtable_nat ebtables x_tables dm_crypt cmac rfcomm bnep arc4 iwldvm mac80211 btusb snd_hda_codec_hdmi iwlwifi btrtl btbcm btintel snd_hda_codec_realtek snd_hda_codec_generic bluetooth snd_hda_intel snd_hda_codec cfg80211 snd_hwdep snd_hda_core intel_rapl iosf_mbi hid_logitech_hidpp x86_pkg_temp_thermal snd_pcm coretemp ghash_clmulni_intel thinkpad_acpi joydev snd_seq_midi snd_seq_midi_event snd_rawmidi nvram snd_seq snd_timer snd_seq_device wmi snd soundcore mac_hid lpc_ich aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd kvm_intel kvm hid_logitech_dj sdhci_pci sdhci usbhid hid
>> [   12.415538] CPU: 1 PID: 1684 Comm: indicator-keybo Not tainted 4.3.0-rc1-dirty #25
>> [   12.415540] Hardware name: LENOVO 2325AR2/2325AR2, BIOS G2ETA4WW (2.64 ) 04/09/2015
>> [   12.415542]  ffffffff81aa8172 ffff8800c926ba30 ffffffff8132e3e2 ffff8800c926ba78
>> [   12.415544]  ffff8800c926ba68 ffffffff8105e386 ffffea000fca4700 ffff880409b96420
>> [   12.415547]  ffff8803ef979490 ffff880403ff4800 0000000000000000 ffff8800c926bac8
>> [   12.415550] Call Trace:
>> [   12.415555]  [<ffffffff8132e3e2>] dump_stack+0x4b/0x69
>> [   12.415560]  [<ffffffff8105e386>] warn_slowpath_common+0x86/0xc0
>> [   12.415563]  [<ffffffff8105e40c>] warn_slowpath_fmt+0x4c/0x50
>> [   12.415566]  [<ffffffff8115f951>] account_page_cleaned+0x101/0x110
>> [   12.415568]  [<ffffffff8115fa1d>] cancel_dirty_page+0xbd/0xf0
>> [   12.415571]  [<ffffffff811fc044>] try_to_free_buffers+0x94/0xb0
>> [   12.415575]  [<ffffffff81296740>] jbd2_journal_try_to_free_buffers+0x100/0x130
>> [   12.415578]  [<ffffffff812498b2>] ext4_releasepage+0x52/0xa0
>> [   12.415582]  [<ffffffff81153765>] try_to_release_page+0x35/0x50
>> [   12.415585]  [<ffffffff811fcc13>] block_invalidatepage+0x113/0x130
>> [   12.415587]  [<ffffffff81249d5e>] ext4_invalidatepage+0x5e/0xb0
>> [   12.415590]  [<ffffffff8124a6f0>] ext4_da_invalidatepage+0x40/0x310
>> [   12.415593]  [<ffffffff81162b03>] truncate_inode_page+0x83/0x90
>> [   12.415595]  [<ffffffff81162ce9>] truncate_inode_pages_range+0x199/0x730
>> [   12.415598]  [<ffffffff8109c494>] ? __wake_up+0x44/0x50
>> [   12.415600]  [<ffffffff812962ca>] ? jbd2_journal_stop+0x1ba/0x3b0
>> [   12.415603]  [<ffffffff8125a754>] ? ext4_unlink+0x2f4/0x330
>> [   12.415607]  [<ffffffff811eed3d>] ? __inode_wait_for_writeback+0x6d/0xc0
>> [   12.415609]  [<ffffffff811632ec>] truncate_inode_pages_final+0x4c/0x60
>> [   12.415612]  [<ffffffff81250736>] ext4_evict_inode+0x116/0x4c0
>> [   12.415615]  [<ffffffff811e139c>] evict+0xbc/0x190
>> [   12.415617]  [<ffffffff811e1d2d>] iput+0x17d/0x1e0
>> [   12.415620]  [<ffffffff811d623b>] do_unlinkat+0x1ab/0x2b0
>> [   12.415622]  [<ffffffff811d6cb6>] SyS_unlink+0x16/0x20
>> [   12.415626]  [<ffffffff817d7f97>] entry_SYSCALL_64_fastpath+0x12/0x6a
>> [   12.415628] ---[ end trace 6fba1ddd3d240e13 ]---
>> [   12.418211] ------------[ cut here ]------------

I'm not denying the issue, bug the WARNING splat isn't necessarily
catching a problem.  The corresponding code comes from your debug patch:
+		WARN_ONCE(__this_cpu_read(memcg->stat->count[MEM_CGROUP_STAT_DIRTY]) > (1UL<<30), "MEM_CGROUP_STAT_DIRTY bogus");

This only checks a single cpu's counter, which can be negative.  The sum
of all counters is what matters.
Imagine:
cpu1) dirty page: inc
cpu2) clean page: dec
The sum is properly zero, but cpu2 is -1, which will trigger the WARN.

I'll look at the code and also see if I can reproduce the failure using
mem_cgroup_read_stat() for all of the new WARNs.

Did you notice if the global /proc/meminfo:Dirty count also underflowed?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 4.3-rc1 dirty page count underflow (cgroup-related?)
@ 2015-09-18  6:09   ` Greg Thelen
  0 siblings, 0 replies; 16+ messages in thread
From: Greg Thelen @ 2015-09-18  6:09 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Johannes Weiner, Michal Hocko, Tejun Heo, Jens Axboe,
	Andrew Morton, Jan Kara,
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
	open list


Dave Hansen wrote:

> I've been seeing some strange behavior with 4.3-rc1 kernels on my Ubuntu
> 14.04.3 system.  The system will run fine for a few hours, but suddenly
> start becoming horribly I/O bound.  A compile of perf for instance takes
> 20-30 minutes and the compile seems entirely I/O bound.  But, the SSD is
> only seeing tens or hundreds of KB/s of writes.
>
> Looking at some writeback tracepoints shows it hitting
> balance_dirty_pages() pretty hard with a pretty large number of dirty
> pages. :)
>
>>               ld-27008 [000] ...1 88895.190770: balance_dirty_pages: bdi
>> 8:0: limit=234545 setpoint=204851 dirty=18446744073709513951
>> bdi_setpoint=184364 bdi_dirty=33 dirty_ratelimit=24 task_ratelimit=0
>> dirtied=1 dirtied_pause=0 paused=0 pause=136 period=136 think=0
>> cgroup=/user/1000.user/c2.session
>
> So something is underflowing dirty.
>
> I added the attached patch and got a warning pretty quickly, so this
> looks pretty reproducible for me.
>
> I'm not 100% sure this is from the 4.3 merge window.  I was running the
> 4.2-rcs, but they seemed to have their own issues.  Ubuntu seems to be
> automatically creating some cgroups, so they're definitely in play here.
>
> Any ideas what is going on?
>
>> [   12.415472] ------------[ cut here ]------------
>> [   12.415481] WARNING: CPU: 1 PID: 1684 at mm/page-writeback.c:2435 account_page_cleaned+0x101/0x110()
>> [   12.415483] MEM_CGROUP_STAT_DIRTY bogus
>> [   12.415484] Modules linked in: ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_CHECKSUM iptable_mangle xt_tcpudp bridge stp llc iptable_filter ip_tables ebtable_nat ebtables x_tables dm_crypt cmac rfcomm bnep arc4 iwldvm mac80211 btusb snd_hda_codec_hdmi iwlwifi btrtl btbcm btintel snd_hda_codec_realtek snd_hda_codec_generic bluetooth snd_hda_intel snd_hda_codec cfg80211 snd_hwdep snd_hda_core intel_rapl iosf_mbi hid_logitech_hidpp x86_pkg_temp_thermal snd_pcm coretemp ghash_clmulni_intel thinkpad_acpi joydev snd_seq_midi snd_seq_midi_event snd_rawmidi nvram snd_seq snd_timer snd_seq_device wmi snd soundcore mac_hid lpc_ich aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd kvm_intel kvm hid_logitech_dj sdhci_pci sdhci usbhid hid
>> [   12.415538] CPU: 1 PID: 1684 Comm: indicator-keybo Not tainted 4.3.0-rc1-dirty #25
>> [   12.415540] Hardware name: LENOVO 2325AR2/2325AR2, BIOS G2ETA4WW (2.64 ) 04/09/2015
>> [   12.415542]  ffffffff81aa8172 ffff8800c926ba30 ffffffff8132e3e2 ffff8800c926ba78
>> [   12.415544]  ffff8800c926ba68 ffffffff8105e386 ffffea000fca4700 ffff880409b96420
>> [   12.415547]  ffff8803ef979490 ffff880403ff4800 0000000000000000 ffff8800c926bac8
>> [   12.415550] Call Trace:
>> [   12.415555]  [<ffffffff8132e3e2>] dump_stack+0x4b/0x69
>> [   12.415560]  [<ffffffff8105e386>] warn_slowpath_common+0x86/0xc0
>> [   12.415563]  [<ffffffff8105e40c>] warn_slowpath_fmt+0x4c/0x50
>> [   12.415566]  [<ffffffff8115f951>] account_page_cleaned+0x101/0x110
>> [   12.415568]  [<ffffffff8115fa1d>] cancel_dirty_page+0xbd/0xf0
>> [   12.415571]  [<ffffffff811fc044>] try_to_free_buffers+0x94/0xb0
>> [   12.415575]  [<ffffffff81296740>] jbd2_journal_try_to_free_buffers+0x100/0x130
>> [   12.415578]  [<ffffffff812498b2>] ext4_releasepage+0x52/0xa0
>> [   12.415582]  [<ffffffff81153765>] try_to_release_page+0x35/0x50
>> [   12.415585]  [<ffffffff811fcc13>] block_invalidatepage+0x113/0x130
>> [   12.415587]  [<ffffffff81249d5e>] ext4_invalidatepage+0x5e/0xb0
>> [   12.415590]  [<ffffffff8124a6f0>] ext4_da_invalidatepage+0x40/0x310
>> [   12.415593]  [<ffffffff81162b03>] truncate_inode_page+0x83/0x90
>> [   12.415595]  [<ffffffff81162ce9>] truncate_inode_pages_range+0x199/0x730
>> [   12.415598]  [<ffffffff8109c494>] ? __wake_up+0x44/0x50
>> [   12.415600]  [<ffffffff812962ca>] ? jbd2_journal_stop+0x1ba/0x3b0
>> [   12.415603]  [<ffffffff8125a754>] ? ext4_unlink+0x2f4/0x330
>> [   12.415607]  [<ffffffff811eed3d>] ? __inode_wait_for_writeback+0x6d/0xc0
>> [   12.415609]  [<ffffffff811632ec>] truncate_inode_pages_final+0x4c/0x60
>> [   12.415612]  [<ffffffff81250736>] ext4_evict_inode+0x116/0x4c0
>> [   12.415615]  [<ffffffff811e139c>] evict+0xbc/0x190
>> [   12.415617]  [<ffffffff811e1d2d>] iput+0x17d/0x1e0
>> [   12.415620]  [<ffffffff811d623b>] do_unlinkat+0x1ab/0x2b0
>> [   12.415622]  [<ffffffff811d6cb6>] SyS_unlink+0x16/0x20
>> [   12.415626]  [<ffffffff817d7f97>] entry_SYSCALL_64_fastpath+0x12/0x6a
>> [   12.415628] ---[ end trace 6fba1ddd3d240e13 ]---
>> [   12.418211] ------------[ cut here ]------------

I'm not denying the issue, bug the WARNING splat isn't necessarily
catching a problem.  The corresponding code comes from your debug patch:
+		WARN_ONCE(__this_cpu_read(memcg->stat->count[MEM_CGROUP_STAT_DIRTY]) > (1UL<<30), "MEM_CGROUP_STAT_DIRTY bogus");

This only checks a single cpu's counter, which can be negative.  The sum
of all counters is what matters.
Imagine:
cpu1) dirty page: inc
cpu2) clean page: dec
The sum is properly zero, but cpu2 is -1, which will trigger the WARN.

I'll look at the code and also see if I can reproduce the failure using
mem_cgroup_read_stat() for all of the new WARNs.

Did you notice if the global /proc/meminfo:Dirty count also underflowed?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 4.3-rc1 dirty page count underflow (cgroup-related?)
@ 2015-09-18  6:09   ` Greg Thelen
  0 siblings, 0 replies; 16+ messages in thread
From: Greg Thelen @ 2015-09-18  6:09 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Johannes Weiner, Michal Hocko, Tejun Heo, Jens Axboe,
	Andrew Morton, Jan Kara,
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
	open list


Dave Hansen wrote:

> I've been seeing some strange behavior with 4.3-rc1 kernels on my Ubuntu
> 14.04.3 system.  The system will run fine for a few hours, but suddenly
> start becoming horribly I/O bound.  A compile of perf for instance takes
> 20-30 minutes and the compile seems entirely I/O bound.  But, the SSD is
> only seeing tens or hundreds of KB/s of writes.
>
> Looking at some writeback tracepoints shows it hitting
> balance_dirty_pages() pretty hard with a pretty large number of dirty
> pages. :)
>
>>               ld-27008 [000] ...1 88895.190770: balance_dirty_pages: bdi
>> 8:0: limit=234545 setpoint=204851 dirty=18446744073709513951
>> bdi_setpoint=184364 bdi_dirty=33 dirty_ratelimit=24 task_ratelimit=0
>> dirtied=1 dirtied_pause=0 paused=0 pause=136 period=136 think=0
>> cgroup=/user/1000.user/c2.session
>
> So something is underflowing dirty.
>
> I added the attached patch and got a warning pretty quickly, so this
> looks pretty reproducible for me.
>
> I'm not 100% sure this is from the 4.3 merge window.  I was running the
> 4.2-rcs, but they seemed to have their own issues.  Ubuntu seems to be
> automatically creating some cgroups, so they're definitely in play here.
>
> Any ideas what is going on?
>
>> [   12.415472] ------------[ cut here ]------------
>> [   12.415481] WARNING: CPU: 1 PID: 1684 at mm/page-writeback.c:2435 account_page_cleaned+0x101/0x110()
>> [   12.415483] MEM_CGROUP_STAT_DIRTY bogus
>> [   12.415484] Modules linked in: ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_CHECKSUM iptable_mangle xt_tcpudp bridge stp llc iptable_filter ip_tables ebtable_nat ebtables x_tables dm_crypt cmac rfcomm bnep arc4 iwldvm mac80211 btusb snd_hda_codec_hdmi iwlwifi btrtl btbcm btintel snd_hda_codec_realtek snd_hda_codec_generic bluetooth snd_hda_intel snd_hda_codec cfg80211 snd_hwdep snd_hda_core intel_rapl iosf_mbi hid_logitech_hidpp x86_pkg_temp_thermal snd_pcm coretemp ghash_clmulni_intel thinkpad_acpi joydev snd_seq_midi snd_seq_midi_event snd_rawmidi nvram snd_seq snd_timer snd_seq_device wmi snd soundcore mac_hid lpc_ich aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_
 helper cryptd kvm_intel kvm hid_logitech_dj sdhci_pci sdhci usbhid hid
>> [   12.415538] CPU: 1 PID: 1684 Comm: indicator-keybo Not tainted 4.3.0-rc1-dirty #25
>> [   12.415540] Hardware name: LENOVO 2325AR2/2325AR2, BIOS G2ETA4WW (2.64 ) 04/09/2015
>> [   12.415542]  ffffffff81aa8172 ffff8800c926ba30 ffffffff8132e3e2 ffff8800c926ba78
>> [   12.415544]  ffff8800c926ba68 ffffffff8105e386 ffffea000fca4700 ffff880409b96420
>> [   12.415547]  ffff8803ef979490 ffff880403ff4800 0000000000000000 ffff8800c926bac8
>> [   12.415550] Call Trace:
>> [   12.415555]  [<ffffffff8132e3e2>] dump_stack+0x4b/0x69
>> [   12.415560]  [<ffffffff8105e386>] warn_slowpath_common+0x86/0xc0
>> [   12.415563]  [<ffffffff8105e40c>] warn_slowpath_fmt+0x4c/0x50
>> [   12.415566]  [<ffffffff8115f951>] account_page_cleaned+0x101/0x110
>> [   12.415568]  [<ffffffff8115fa1d>] cancel_dirty_page+0xbd/0xf0
>> [   12.415571]  [<ffffffff811fc044>] try_to_free_buffers+0x94/0xb0
>> [   12.415575]  [<ffffffff81296740>] jbd2_journal_try_to_free_buffers+0x100/0x130
>> [   12.415578]  [<ffffffff812498b2>] ext4_releasepage+0x52/0xa0
>> [   12.415582]  [<ffffffff81153765>] try_to_release_page+0x35/0x50
>> [   12.415585]  [<ffffffff811fcc13>] block_invalidatepage+0x113/0x130
>> [   12.415587]  [<ffffffff81249d5e>] ext4_invalidatepage+0x5e/0xb0
>> [   12.415590]  [<ffffffff8124a6f0>] ext4_da_invalidatepage+0x40/0x310
>> [   12.415593]  [<ffffffff81162b03>] truncate_inode_page+0x83/0x90
>> [   12.415595]  [<ffffffff81162ce9>] truncate_inode_pages_range+0x199/0x730
>> [   12.415598]  [<ffffffff8109c494>] ? __wake_up+0x44/0x50
>> [   12.415600]  [<ffffffff812962ca>] ? jbd2_journal_stop+0x1ba/0x3b0
>> [   12.415603]  [<ffffffff8125a754>] ? ext4_unlink+0x2f4/0x330
>> [   12.415607]  [<ffffffff811eed3d>] ? __inode_wait_for_writeback+0x6d/0xc0
>> [   12.415609]  [<ffffffff811632ec>] truncate_inode_pages_final+0x4c/0x60
>> [   12.415612]  [<ffffffff81250736>] ext4_evict_inode+0x116/0x4c0
>> [   12.415615]  [<ffffffff811e139c>] evict+0xbc/0x190
>> [   12.415617]  [<ffffffff811e1d2d>] iput+0x17d/0x1e0
>> [   12.415620]  [<ffffffff811d623b>] do_unlinkat+0x1ab/0x2b0
>> [   12.415622]  [<ffffffff811d6cb6>] SyS_unlink+0x16/0x20
>> [   12.415626]  [<ffffffff817d7f97>] entry_SYSCALL_64_fastpath+0x12/0x6a
>> [   12.415628] ---[ end trace 6fba1ddd3d240e13 ]---
>> [   12.418211] ------------[ cut here ]------------

I'm not denying the issue, bug the WARNING splat isn't necessarily
catching a problem.  The corresponding code comes from your debug patch:
+		WARN_ONCE(__this_cpu_read(memcg->stat->count[MEM_CGROUP_STAT_DIRTY]) > (1UL<<30), "MEM_CGROUP_STAT_DIRTY bogus");

This only checks a single cpu's counter, which can be negative.  The sum
of all counters is what matters.
Imagine:
cpu1) dirty page: inc
cpu2) clean page: dec
The sum is properly zero, but cpu2 is -1, which will trigger the WARN.

I'll look at the code and also see if I can reproduce the failure using
mem_cgroup_read_stat() for all of the new WARNs.

Did you notice if the global /proc/meminfo:Dirty count also underflowed?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 4.3-rc1 dirty page count underflow (cgroup-related?)
  2015-09-18  6:09   ` Greg Thelen
  (?)
@ 2015-09-18  8:31     ` Greg Thelen
  -1 siblings, 0 replies; 16+ messages in thread
From: Greg Thelen @ 2015-09-18  8:31 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Johannes Weiner, Michal Hocko, Tejun Heo, Jens Axboe,
	Andrew Morton, Jan Kara,
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
	open list


Greg Thelen wrote:

> Dave Hansen wrote:
>
>> I've been seeing some strange behavior with 4.3-rc1 kernels on my Ubuntu
>> 14.04.3 system.  The system will run fine for a few hours, but suddenly
>> start becoming horribly I/O bound.  A compile of perf for instance takes
>> 20-30 minutes and the compile seems entirely I/O bound.  But, the SSD is
>> only seeing tens or hundreds of KB/s of writes.
>>
>> Looking at some writeback tracepoints shows it hitting
>> balance_dirty_pages() pretty hard with a pretty large number of dirty
>> pages. :)
>>
>>>               ld-27008 [000] ...1 88895.190770: balance_dirty_pages: bdi
>>> 8:0: limit=234545 setpoint=204851 dirty=18446744073709513951
>>> bdi_setpoint=184364 bdi_dirty=33 dirty_ratelimit=24 task_ratelimit=0
>>> dirtied=1 dirtied_pause=0 paused=0 pause=136 period=136 think=0
>>> cgroup=/user/1000.user/c2.session
>>
>> So something is underflowing dirty.
>>
>> I added the attached patch and got a warning pretty quickly, so this
>> looks pretty reproducible for me.
>>
>> I'm not 100% sure this is from the 4.3 merge window.  I was running the
>> 4.2-rcs, but they seemed to have their own issues.  Ubuntu seems to be
>> automatically creating some cgroups, so they're definitely in play here.
>>
>> Any ideas what is going on?
>>
>>> [   12.415472] ------------[ cut here ]------------
>>> [   12.415481] WARNING: CPU: 1 PID: 1684 at mm/page-writeback.c:2435 account_page_cleaned+0x101/0x110()
>>> [   12.415483] MEM_CGROUP_STAT_DIRTY bogus
>>> [   12.415484] Modules linked in: ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_CHECKSUM iptable_mangle xt_tcpudp bridge stp llc iptable_filter ip_tables ebtable_nat ebtables x_tables dm_crypt cmac rfcomm bnep arc4 iwldvm mac80211 btusb snd_hda_codec_hdmi iwlwifi btrtl btbcm btintel snd_hda_codec_realtek snd_hda_codec_generic bluetooth snd_hda_intel snd_hda_codec cfg80211 snd_hwdep snd_hda_core intel_rapl iosf_mbi hid_logitech_hidpp x86_pkg_temp_thermal snd_pcm coretemp ghash_clmulni_intel thinkpad_acpi joydev snd_seq_midi snd_seq_midi_event snd_rawmidi nvram snd_seq snd_timer snd_seq_device wmi snd soundcore mac_hid lpc_ich aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd kvm_intel kvm hid_logitech_dj sdhci_pci sdhci usbhid hid
>>> [   12.415538] CPU: 1 PID: 1684 Comm: indicator-keybo Not tainted 4.3.0-rc1-dirty #25
>>> [   12.415540] Hardware name: LENOVO 2325AR2/2325AR2, BIOS G2ETA4WW (2.64 ) 04/09/2015
>>> [   12.415542]  ffffffff81aa8172 ffff8800c926ba30 ffffffff8132e3e2 ffff8800c926ba78
>>> [   12.415544]  ffff8800c926ba68 ffffffff8105e386 ffffea000fca4700 ffff880409b96420
>>> [   12.415547]  ffff8803ef979490 ffff880403ff4800 0000000000000000 ffff8800c926bac8
>>> [   12.415550] Call Trace:
>>> [   12.415555]  [<ffffffff8132e3e2>] dump_stack+0x4b/0x69
>>> [   12.415560]  [<ffffffff8105e386>] warn_slowpath_common+0x86/0xc0
>>> [   12.415563]  [<ffffffff8105e40c>] warn_slowpath_fmt+0x4c/0x50
>>> [   12.415566]  [<ffffffff8115f951>] account_page_cleaned+0x101/0x110
>>> [   12.415568]  [<ffffffff8115fa1d>] cancel_dirty_page+0xbd/0xf0
>>> [   12.415571]  [<ffffffff811fc044>] try_to_free_buffers+0x94/0xb0
>>> [   12.415575]  [<ffffffff81296740>] jbd2_journal_try_to_free_buffers+0x100/0x130
>>> [   12.415578]  [<ffffffff812498b2>] ext4_releasepage+0x52/0xa0
>>> [   12.415582]  [<ffffffff81153765>] try_to_release_page+0x35/0x50
>>> [   12.415585]  [<ffffffff811fcc13>] block_invalidatepage+0x113/0x130
>>> [   12.415587]  [<ffffffff81249d5e>] ext4_invalidatepage+0x5e/0xb0
>>> [   12.415590]  [<ffffffff8124a6f0>] ext4_da_invalidatepage+0x40/0x310
>>> [   12.415593]  [<ffffffff81162b03>] truncate_inode_page+0x83/0x90
>>> [   12.415595]  [<ffffffff81162ce9>] truncate_inode_pages_range+0x199/0x730
>>> [   12.415598]  [<ffffffff8109c494>] ? __wake_up+0x44/0x50
>>> [   12.415600]  [<ffffffff812962ca>] ? jbd2_journal_stop+0x1ba/0x3b0
>>> [   12.415603]  [<ffffffff8125a754>] ? ext4_unlink+0x2f4/0x330
>>> [   12.415607]  [<ffffffff811eed3d>] ? __inode_wait_for_writeback+0x6d/0xc0
>>> [   12.415609]  [<ffffffff811632ec>] truncate_inode_pages_final+0x4c/0x60
>>> [   12.415612]  [<ffffffff81250736>] ext4_evict_inode+0x116/0x4c0
>>> [   12.415615]  [<ffffffff811e139c>] evict+0xbc/0x190
>>> [   12.415617]  [<ffffffff811e1d2d>] iput+0x17d/0x1e0
>>> [   12.415620]  [<ffffffff811d623b>] do_unlinkat+0x1ab/0x2b0
>>> [   12.415622]  [<ffffffff811d6cb6>] SyS_unlink+0x16/0x20
>>> [   12.415626]  [<ffffffff817d7f97>] entry_SYSCALL_64_fastpath+0x12/0x6a
>>> [   12.415628] ---[ end trace 6fba1ddd3d240e13 ]---
>>> [   12.418211] ------------[ cut here ]------------
>
> I'm not denying the issue, bug the WARNING splat isn't necessarily
> catching a problem.  The corresponding code comes from your debug patch:
> +		WARN_ONCE(__this_cpu_read(memcg->stat->count[MEM_CGROUP_STAT_DIRTY]) > (1UL<<30), "MEM_CGROUP_STAT_DIRTY bogus");
>
> This only checks a single cpu's counter, which can be negative.  The sum
> of all counters is what matters.
> Imagine:
> cpu1) dirty page: inc
> cpu2) clean page: dec
> The sum is properly zero, but cpu2 is -1, which will trigger the WARN.
>
> I'll look at the code and also see if I can reproduce the failure using
> mem_cgroup_read_stat() for all of the new WARNs.
>
> Did you notice if the global /proc/meminfo:Dirty count also underflowed?

Looks to be miscommunication between memcg dirty page accounting and
page migration.

>From d6b62de5475f85ed4dabe7b6c6f0991339b9c0c7 Mon Sep 17 00:00:00 2001
From: Greg Thelen <gthelen@google.com>
Date: Fri, 18 Sep 2015 00:38:14 -0700
Subject: [PATCH] possible fix

Assume that newpage is dirty and charged to memcg.  So the memcg's dirty
page count was incremented for the page.  Migration's move_to_new_page()
indirectly calls migrate_page_copy(newpage, page), which decrements
page->mem_cgroup's dirty page count, but because newpage->mem_cgroup is
not yet set, __set_page_dirty_nobuffers() cannot reincrement the memcg
dirty page count.  Later move_to_new_page() calls mem_cgroup_migrate()
which finally sets newpage->mem_cgroup, but the dirty page count is
never reincremented, despite PG_dirty=1.  Time passes and
account_page_cleaned(newpage) is called which decrements a
non-incremented page->mem_cgroup dirty page count, causing underflow.

Because we're exchanging one dirty page for another, we might be able to
avoid decrementing and immediately reincrementing the memcg's dirty
count.

This is an alternative (ugly) fix which survives light testing.  It
temporarily sets up newpage's memcg so that __set_page_dirty_nobuffers()
can increment memcg's dirty page count.  I need to study this approach
more carefully before claiming it's a good fix.

Not-Yet-Signed-off-by: Greg Thelen <gthelen@google.com>
---
 mm/migrate.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/mm/migrate.c b/mm/migrate.c
index c3cb566af3e2..6268eb8b6fd5 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -734,6 +734,11 @@ static int move_to_new_page(struct page *newpage, struct page *page,
 	if (!trylock_page(newpage))
 		BUG();
 
+#ifdef CONFIG_MEMCG
+	VM_BUG_ON(newpage->mem_cgroup);
+	newpage->mem_cgroup = page->mem_cgroup;
+#endif
+
 	/* Prepare mapping for the new page.*/
 	newpage->index = page->index;
 	newpage->mapping = page->mapping;
@@ -755,6 +760,10 @@ static int move_to_new_page(struct page *newpage, struct page *page,
 	else
 		rc = fallback_migrate_page(mapping, newpage, page, mode);
 
+#ifdef CONFIG_MEMCG
+	newpage->mem_cgroup = NULL;
+#endif
+
 	if (rc != MIGRATEPAGE_SUCCESS) {
 		newpage->mapping = NULL;
 	} else {

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: 4.3-rc1 dirty page count underflow (cgroup-related?)
@ 2015-09-18  8:31     ` Greg Thelen
  0 siblings, 0 replies; 16+ messages in thread
From: Greg Thelen @ 2015-09-18  8:31 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Johannes Weiner, Michal Hocko, Tejun Heo, Jens Axboe,
	Andrew Morton, Jan Kara,
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
	open list


Greg Thelen wrote:

> Dave Hansen wrote:
>
>> I've been seeing some strange behavior with 4.3-rc1 kernels on my Ubuntu
>> 14.04.3 system.  The system will run fine for a few hours, but suddenly
>> start becoming horribly I/O bound.  A compile of perf for instance takes
>> 20-30 minutes and the compile seems entirely I/O bound.  But, the SSD is
>> only seeing tens or hundreds of KB/s of writes.
>>
>> Looking at some writeback tracepoints shows it hitting
>> balance_dirty_pages() pretty hard with a pretty large number of dirty
>> pages. :)
>>
>>>               ld-27008 [000] ...1 88895.190770: balance_dirty_pages: bdi
>>> 8:0: limit=234545 setpoint=204851 dirty=18446744073709513951
>>> bdi_setpoint=184364 bdi_dirty=33 dirty_ratelimit=24 task_ratelimit=0
>>> dirtied=1 dirtied_pause=0 paused=0 pause=136 period=136 think=0
>>> cgroup=/user/1000.user/c2.session
>>
>> So something is underflowing dirty.
>>
>> I added the attached patch and got a warning pretty quickly, so this
>> looks pretty reproducible for me.
>>
>> I'm not 100% sure this is from the 4.3 merge window.  I was running the
>> 4.2-rcs, but they seemed to have their own issues.  Ubuntu seems to be
>> automatically creating some cgroups, so they're definitely in play here.
>>
>> Any ideas what is going on?
>>
>>> [   12.415472] ------------[ cut here ]------------
>>> [   12.415481] WARNING: CPU: 1 PID: 1684 at mm/page-writeback.c:2435 account_page_cleaned+0x101/0x110()
>>> [   12.415483] MEM_CGROUP_STAT_DIRTY bogus
>>> [   12.415484] Modules linked in: ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_CHECKSUM iptable_mangle xt_tcpudp bridge stp llc iptable_filter ip_tables ebtable_nat ebtables x_tables dm_crypt cmac rfcomm bnep arc4 iwldvm mac80211 btusb snd_hda_codec_hdmi iwlwifi btrtl btbcm btintel snd_hda_codec_realtek snd_hda_codec_generic bluetooth snd_hda_intel snd_hda_codec cfg80211 snd_hwdep snd_hda_core intel_rapl iosf_mbi hid_logitech_hidpp x86_pkg_temp_thermal snd_pcm coretemp ghash_clmulni_intel thinkpad_acpi joydev snd_seq_midi snd_seq_midi_event snd_rawmidi nvram snd_seq snd_timer snd_seq_device wmi snd soundcore mac_hid lpc_ich aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd kvm_intel kvm hid_logitech_dj sdhci_pci sdhci usbhid hid
>>> [   12.415538] CPU: 1 PID: 1684 Comm: indicator-keybo Not tainted 4.3.0-rc1-dirty #25
>>> [   12.415540] Hardware name: LENOVO 2325AR2/2325AR2, BIOS G2ETA4WW (2.64 ) 04/09/2015
>>> [   12.415542]  ffffffff81aa8172 ffff8800c926ba30 ffffffff8132e3e2 ffff8800c926ba78
>>> [   12.415544]  ffff8800c926ba68 ffffffff8105e386 ffffea000fca4700 ffff880409b96420
>>> [   12.415547]  ffff8803ef979490 ffff880403ff4800 0000000000000000 ffff8800c926bac8
>>> [   12.415550] Call Trace:
>>> [   12.415555]  [<ffffffff8132e3e2>] dump_stack+0x4b/0x69
>>> [   12.415560]  [<ffffffff8105e386>] warn_slowpath_common+0x86/0xc0
>>> [   12.415563]  [<ffffffff8105e40c>] warn_slowpath_fmt+0x4c/0x50
>>> [   12.415566]  [<ffffffff8115f951>] account_page_cleaned+0x101/0x110
>>> [   12.415568]  [<ffffffff8115fa1d>] cancel_dirty_page+0xbd/0xf0
>>> [   12.415571]  [<ffffffff811fc044>] try_to_free_buffers+0x94/0xb0
>>> [   12.415575]  [<ffffffff81296740>] jbd2_journal_try_to_free_buffers+0x100/0x130
>>> [   12.415578]  [<ffffffff812498b2>] ext4_releasepage+0x52/0xa0
>>> [   12.415582]  [<ffffffff81153765>] try_to_release_page+0x35/0x50
>>> [   12.415585]  [<ffffffff811fcc13>] block_invalidatepage+0x113/0x130
>>> [   12.415587]  [<ffffffff81249d5e>] ext4_invalidatepage+0x5e/0xb0
>>> [   12.415590]  [<ffffffff8124a6f0>] ext4_da_invalidatepage+0x40/0x310
>>> [   12.415593]  [<ffffffff81162b03>] truncate_inode_page+0x83/0x90
>>> [   12.415595]  [<ffffffff81162ce9>] truncate_inode_pages_range+0x199/0x730
>>> [   12.415598]  [<ffffffff8109c494>] ? __wake_up+0x44/0x50
>>> [   12.415600]  [<ffffffff812962ca>] ? jbd2_journal_stop+0x1ba/0x3b0
>>> [   12.415603]  [<ffffffff8125a754>] ? ext4_unlink+0x2f4/0x330
>>> [   12.415607]  [<ffffffff811eed3d>] ? __inode_wait_for_writeback+0x6d/0xc0
>>> [   12.415609]  [<ffffffff811632ec>] truncate_inode_pages_final+0x4c/0x60
>>> [   12.415612]  [<ffffffff81250736>] ext4_evict_inode+0x116/0x4c0
>>> [   12.415615]  [<ffffffff811e139c>] evict+0xbc/0x190
>>> [   12.415617]  [<ffffffff811e1d2d>] iput+0x17d/0x1e0
>>> [   12.415620]  [<ffffffff811d623b>] do_unlinkat+0x1ab/0x2b0
>>> [   12.415622]  [<ffffffff811d6cb6>] SyS_unlink+0x16/0x20
>>> [   12.415626]  [<ffffffff817d7f97>] entry_SYSCALL_64_fastpath+0x12/0x6a
>>> [   12.415628] ---[ end trace 6fba1ddd3d240e13 ]---
>>> [   12.418211] ------------[ cut here ]------------
>
> I'm not denying the issue, bug the WARNING splat isn't necessarily
> catching a problem.  The corresponding code comes from your debug patch:
> +		WARN_ONCE(__this_cpu_read(memcg->stat->count[MEM_CGROUP_STAT_DIRTY]) > (1UL<<30), "MEM_CGROUP_STAT_DIRTY bogus");
>
> This only checks a single cpu's counter, which can be negative.  The sum
> of all counters is what matters.
> Imagine:
> cpu1) dirty page: inc
> cpu2) clean page: dec
> The sum is properly zero, but cpu2 is -1, which will trigger the WARN.
>
> I'll look at the code and also see if I can reproduce the failure using
> mem_cgroup_read_stat() for all of the new WARNs.
>
> Did you notice if the global /proc/meminfo:Dirty count also underflowed?

Looks to be miscommunication between memcg dirty page accounting and
page migration.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 4.3-rc1 dirty page count underflow (cgroup-related?)
@ 2015-09-18  8:31     ` Greg Thelen
  0 siblings, 0 replies; 16+ messages in thread
From: Greg Thelen @ 2015-09-18  8:31 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Johannes Weiner, Michal Hocko, Tejun Heo, Jens Axboe,
	Andrew Morton, Jan Kara,
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
	open list


Greg Thelen wrote:

> Dave Hansen wrote:
>
>> I've been seeing some strange behavior with 4.3-rc1 kernels on my Ubuntu
>> 14.04.3 system.  The system will run fine for a few hours, but suddenly
>> start becoming horribly I/O bound.  A compile of perf for instance takes
>> 20-30 minutes and the compile seems entirely I/O bound.  But, the SSD is
>> only seeing tens or hundreds of KB/s of writes.
>>
>> Looking at some writeback tracepoints shows it hitting
>> balance_dirty_pages() pretty hard with a pretty large number of dirty
>> pages. :)
>>
>>>               ld-27008 [000] ...1 88895.190770: balance_dirty_pages: bdi
>>> 8:0: limit=234545 setpoint=204851 dirty=18446744073709513951
>>> bdi_setpoint=184364 bdi_dirty=33 dirty_ratelimit=24 task_ratelimit=0
>>> dirtied=1 dirtied_pause=0 paused=0 pause=136 period=136 think=0
>>> cgroup=/user/1000.user/c2.session
>>
>> So something is underflowing dirty.
>>
>> I added the attached patch and got a warning pretty quickly, so this
>> looks pretty reproducible for me.
>>
>> I'm not 100% sure this is from the 4.3 merge window.  I was running the
>> 4.2-rcs, but they seemed to have their own issues.  Ubuntu seems to be
>> automatically creating some cgroups, so they're definitely in play here.
>>
>> Any ideas what is going on?
>>
>>> [   12.415472] ------------[ cut here ]------------
>>> [   12.415481] WARNING: CPU: 1 PID: 1684 at mm/page-writeback.c:2435 account_page_cleaned+0x101/0x110()
>>> [   12.415483] MEM_CGROUP_STAT_DIRTY bogus
>>> [   12.415484] Modules linked in: ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_CHECKSUM iptable_mangle xt_tcpudp bridge stp llc iptable_filter ip_tables ebtable_nat ebtables x_tables dm_crypt cmac rfcomm bnep arc4 iwldvm mac80211 btusb snd_hda_codec_hdmi iwlwifi btrtl btbcm btintel snd_hda_codec_realtek snd_hda_codec_generic bluetooth snd_hda_intel snd_hda_codec cfg80211 snd_hwdep snd_hda_core intel_rapl iosf_mbi hid_logitech_hidpp x86_pkg_temp_thermal snd_pcm coretemp ghash_clmulni_intel thinkpad_acpi joydev snd_seq_midi snd_seq_midi_event snd_rawmidi nvram snd_seq snd_timer snd_seq_device wmi snd soundcore mac_hid lpc_ich aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk
 _helper cryptd kvm_intel kvm hid_logitech_dj sdhci_pci sdhci usbhid hid
>>> [   12.415538] CPU: 1 PID: 1684 Comm: indicator-keybo Not tainted 4.3.0-rc1-dirty #25
>>> [   12.415540] Hardware name: LENOVO 2325AR2/2325AR2, BIOS G2ETA4WW (2.64 ) 04/09/2015
>>> [   12.415542]  ffffffff81aa8172 ffff8800c926ba30 ffffffff8132e3e2 ffff8800c926ba78
>>> [   12.415544]  ffff8800c926ba68 ffffffff8105e386 ffffea000fca4700 ffff880409b96420
>>> [   12.415547]  ffff8803ef979490 ffff880403ff4800 0000000000000000 ffff8800c926bac8
>>> [   12.415550] Call Trace:
>>> [   12.415555]  [<ffffffff8132e3e2>] dump_stack+0x4b/0x69
>>> [   12.415560]  [<ffffffff8105e386>] warn_slowpath_common+0x86/0xc0
>>> [   12.415563]  [<ffffffff8105e40c>] warn_slowpath_fmt+0x4c/0x50
>>> [   12.415566]  [<ffffffff8115f951>] account_page_cleaned+0x101/0x110
>>> [   12.415568]  [<ffffffff8115fa1d>] cancel_dirty_page+0xbd/0xf0
>>> [   12.415571]  [<ffffffff811fc044>] try_to_free_buffers+0x94/0xb0
>>> [   12.415575]  [<ffffffff81296740>] jbd2_journal_try_to_free_buffers+0x100/0x130
>>> [   12.415578]  [<ffffffff812498b2>] ext4_releasepage+0x52/0xa0
>>> [   12.415582]  [<ffffffff81153765>] try_to_release_page+0x35/0x50
>>> [   12.415585]  [<ffffffff811fcc13>] block_invalidatepage+0x113/0x130
>>> [   12.415587]  [<ffffffff81249d5e>] ext4_invalidatepage+0x5e/0xb0
>>> [   12.415590]  [<ffffffff8124a6f0>] ext4_da_invalidatepage+0x40/0x310
>>> [   12.415593]  [<ffffffff81162b03>] truncate_inode_page+0x83/0x90
>>> [   12.415595]  [<ffffffff81162ce9>] truncate_inode_pages_range+0x199/0x730
>>> [   12.415598]  [<ffffffff8109c494>] ? __wake_up+0x44/0x50
>>> [   12.415600]  [<ffffffff812962ca>] ? jbd2_journal_stop+0x1ba/0x3b0
>>> [   12.415603]  [<ffffffff8125a754>] ? ext4_unlink+0x2f4/0x330
>>> [   12.415607]  [<ffffffff811eed3d>] ? __inode_wait_for_writeback+0x6d/0xc0
>>> [   12.415609]  [<ffffffff811632ec>] truncate_inode_pages_final+0x4c/0x60
>>> [   12.415612]  [<ffffffff81250736>] ext4_evict_inode+0x116/0x4c0
>>> [   12.415615]  [<ffffffff811e139c>] evict+0xbc/0x190
>>> [   12.415617]  [<ffffffff811e1d2d>] iput+0x17d/0x1e0
>>> [   12.415620]  [<ffffffff811d623b>] do_unlinkat+0x1ab/0x2b0
>>> [   12.415622]  [<ffffffff811d6cb6>] SyS_unlink+0x16/0x20
>>> [   12.415626]  [<ffffffff817d7f97>] entry_SYSCALL_64_fastpath+0x12/0x6a
>>> [   12.415628] ---[ end trace 6fba1ddd3d240e13 ]---
>>> [   12.418211] ------------[ cut here ]------------
>
> I'm not denying the issue, bug the WARNING splat isn't necessarily
> catching a problem.  The corresponding code comes from your debug patch:
> +		WARN_ONCE(__this_cpu_read(memcg->stat->count[MEM_CGROUP_STAT_DIRTY]) > (1UL<<30), "MEM_CGROUP_STAT_DIRTY bogus");
>
> This only checks a single cpu's counter, which can be negative.  The sum
> of all counters is what matters.
> Imagine:
> cpu1) dirty page: inc
> cpu2) clean page: dec
> The sum is properly zero, but cpu2 is -1, which will trigger the WARN.
>
> I'll look at the code and also see if I can reproduce the failure using
> mem_cgroup_read_stat() for all of the new WARNs.
>
> Did you notice if the global /proc/meminfo:Dirty count also underflowed?

Looks to be miscommunication between memcg dirty page accounting and
page migration.

From d6b62de5475f85ed4dabe7b6c6f0991339b9c0c7 Mon Sep 17 00:00:00 2001
From: Greg Thelen <gthelen@google.com>
Date: Fri, 18 Sep 2015 00:38:14 -0700
Subject: [PATCH] possible fix

Assume that newpage is dirty and charged to memcg.  So the memcg's dirty
page count was incremented for the page.  Migration's move_to_new_page()
indirectly calls migrate_page_copy(newpage, page), which decrements
page->mem_cgroup's dirty page count, but because newpage->mem_cgroup is
not yet set, __set_page_dirty_nobuffers() cannot reincrement the memcg
dirty page count.  Later move_to_new_page() calls mem_cgroup_migrate()
which finally sets newpage->mem_cgroup, but the dirty page count is
never reincremented, despite PG_dirty=1.  Time passes and
account_page_cleaned(newpage) is called which decrements a
non-incremented page->mem_cgroup dirty page count, causing underflow.

Because we're exchanging one dirty page for another, we might be able to
avoid decrementing and immediately reincrementing the memcg's dirty
count.

This is an alternative (ugly) fix which survives light testing.  It
temporarily sets up newpage's memcg so that __set_page_dirty_nobuffers()
can increment memcg's dirty page count.  I need to study this approach
more carefully before claiming it's a good fix.

Not-Yet-Signed-off-by: Greg Thelen <gthelen@google.com>
---
 mm/migrate.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/mm/migrate.c b/mm/migrate.c
index c3cb566af3e2..6268eb8b6fd5 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -734,6 +734,11 @@ static int move_to_new_page(struct page *newpage, struct page *page,
 	if (!trylock_page(newpage))
 		BUG();
 
+#ifdef CONFIG_MEMCG
+	VM_BUG_ON(newpage->mem_cgroup);
+	newpage->mem_cgroup = page->mem_cgroup;
+#endif
+
 	/* Prepare mapping for the new page.*/
 	newpage->index = page->index;
 	newpage->mapping = page->mapping;
@@ -755,6 +760,10 @@ static int move_to_new_page(struct page *newpage, struct page *page,
 	else
 		rc = fallback_migrate_page(mapping, newpage, page, mode);
 
+#ifdef CONFIG_MEMCG
+	newpage->mem_cgroup = NULL;
+#endif
+
 	if (rc != MIGRATEPAGE_SUCCESS) {
 		newpage->mapping = NULL;
 	} else {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: 4.3-rc1 dirty page count underflow (cgroup-related?)
  2015-09-18  6:09   ` Greg Thelen
@ 2015-09-18 14:50     ` Dave Hansen
  -1 siblings, 0 replies; 16+ messages in thread
From: Dave Hansen @ 2015-09-18 14:50 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Johannes Weiner, Michal Hocko, Tejun Heo, Jens Axboe,
	Andrew Morton, Jan Kara,
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
	open list

On 09/17/2015 11:09 PM, Greg Thelen wrote:
> I'm not denying the issue, bug the WARNING splat isn't necessarily
> catching a problem.  The corresponding code comes from your debug patch:
> +		WARN_ONCE(__this_cpu_read(memcg->stat->count[MEM_CGROUP_STAT_DIRTY]) > (1UL<<30), "MEM_CGROUP_STAT_DIRTY bogus");
> 
> This only checks a single cpu's counter, which can be negative.  The sum
> of all counters is what matters.
> Imagine:
> cpu1) dirty page: inc
> cpu2) clean page: dec
> The sum is properly zero, but cpu2 is -1, which will trigger the WARN.
> 
> I'll look at the code and also see if I can reproduce the failure using
> mem_cgroup_read_stat() for all of the new WARNs.

D'oh.  I'll replace those with the proper mem_cgroup_read_stat() and
test with your patch to see if anything still triggers.

> Did you notice if the global /proc/meminfo:Dirty count also underflowed?

It did not underflow.  It was one of the first things I looked at and it
looked fine, went down near 0 at 'sync', etc...

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 4.3-rc1 dirty page count underflow (cgroup-related?)
@ 2015-09-18 14:50     ` Dave Hansen
  0 siblings, 0 replies; 16+ messages in thread
From: Dave Hansen @ 2015-09-18 14:50 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Johannes Weiner, Michal Hocko, Tejun Heo, Jens Axboe,
	Andrew Morton, Jan Kara,
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
	open list

On 09/17/2015 11:09 PM, Greg Thelen wrote:
> I'm not denying the issue, bug the WARNING splat isn't necessarily
> catching a problem.  The corresponding code comes from your debug patch:
> +		WARN_ONCE(__this_cpu_read(memcg->stat->count[MEM_CGROUP_STAT_DIRTY]) > (1UL<<30), "MEM_CGROUP_STAT_DIRTY bogus");
> 
> This only checks a single cpu's counter, which can be negative.  The sum
> of all counters is what matters.
> Imagine:
> cpu1) dirty page: inc
> cpu2) clean page: dec
> The sum is properly zero, but cpu2 is -1, which will trigger the WARN.
> 
> I'll look at the code and also see if I can reproduce the failure using
> mem_cgroup_read_stat() for all of the new WARNs.

D'oh.  I'll replace those with the proper mem_cgroup_read_stat() and
test with your patch to see if anything still triggers.

> Did you notice if the global /proc/meminfo:Dirty count also underflowed?

It did not underflow.  It was one of the first things I looked at and it
looked fine, went down near 0 at 'sync', etc...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 4.3-rc1 dirty page count underflow (cgroup-related?)
  2015-09-18 14:50     ` Dave Hansen
  (?)
@ 2015-09-21  8:06       ` Greg Thelen
  -1 siblings, 0 replies; 16+ messages in thread
From: Greg Thelen @ 2015-09-21  8:06 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Johannes Weiner, Michal Hocko, Tejun Heo, Jens Axboe,
	Andrew Morton, Jan Kara,
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
	open list


Dave Hansen wrote:

> On 09/17/2015 11:09 PM, Greg Thelen wrote:
>> I'm not denying the issue, bug the WARNING splat isn't necessarily
>> catching a problem.  The corresponding code comes from your debug patch:
>> +		WARN_ONCE(__this_cpu_read(memcg->stat->count[MEM_CGROUP_STAT_DIRTY]) > (1UL<<30), "MEM_CGROUP_STAT_DIRTY bogus");
>> 
>> This only checks a single cpu's counter, which can be negative.  The sum
>> of all counters is what matters.
>> Imagine:
>> cpu1) dirty page: inc
>> cpu2) clean page: dec
>> The sum is properly zero, but cpu2 is -1, which will trigger the WARN.
>> 
>> I'll look at the code and also see if I can reproduce the failure using
>> mem_cgroup_read_stat() for all of the new WARNs.
>
> D'oh.  I'll replace those with the proper mem_cgroup_read_stat() and
> test with your patch to see if anything still triggers.

Thanks Dave.

Here's what I think we should use to fix the issue.  I tagged this for
v4.2 stable given the way that unpatched performance falls apart without
warning or workaround (besides deleting and recreating affected memcg).
Feedback welcome.

>From f5c39c2e8471c10fe0464ca7b6e6f743ce6920a6 Mon Sep 17 00:00:00 2001
From: Greg Thelen <gthelen@google.com>
Date: Sat, 19 Sep 2015 16:21:18 -0700
Subject: [PATCH] memcg: fix dirty page migration

The problem starts with a file backed dirty page which is charged to a
memcg.  Then page migration is used to move oldpage to newpage.
Migration:
- copies the oldpage's data to newpage
- clears oldpage.PG_dirty
- sets newpage.PG_dirty
- uncharges oldpage from memcg
- charges newpage to memcg

Clearing oldpage.PG_dirty decrements the charged memcg's dirty page
count.  However, because newpage is not yet charged, setting
newpage.PG_dirty does not increment the memcg's dirty page count.  After
migration completes newpage.PG_dirty is eventually cleared, often in
account_page_cleaned().  At this time newpage is charged to a memcg so
the memcg's dirty page count is decremented which causes underflow
because the count was not previously incremented by migration.  This
underflow causes balance_dirty_pages() to see a very large unsigned
number of dirty memcg pages which leads to aggressive throttling of
buffered writes by processes in non root memcg.

This issue:
- can harm performance of non root memcg buffered writes.
- can report too small (even negative) values in
  memory.stat[(total_)dirty] counters of all memcg, including the root.

To avoid polluting migrate.c with #ifdef CONFIG_MEMCG checks, introduce
page_memcg() and set_page_memcg() helpers.

Test:
    0) setup and enter limited memcg
    mkdir /sys/fs/cgroup/test
    echo 1G > /sys/fs/cgroup/test/memory.limit_in_bytes
    echo $$ > /sys/fs/cgroup/test/cgroup.procs

    1) buffered writes baseline
    dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k
    sync
    grep ^dirty /sys/fs/cgroup/test/memory.stat

    2) buffered writes with compaction antagonist to induce migration
    yes 1 > /proc/sys/vm/compact_memory &
    rm -rf /data/tmp/foo
    dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k
    kill %
    sync
    grep ^dirty /sys/fs/cgroup/test/memory.stat

    3) buffered writes without antagonist, should match baseline
    rm -rf /data/tmp/foo
    dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k
    sync
    grep ^dirty /sys/fs/cgroup/test/memory.stat

                       (speed, dirty residue)
             unpatched                       patched
    1) 841 MB/s 0 dirty pages          886 MB/s 0 dirty pages
    2) 611 MB/s -33427456 dirty pages  793 MB/s 0 dirty pages
    3) 114 MB/s -33427456 dirty pages  891 MB/s 0 dirty pages

    Notice that unpatched baseline performance (1) fell after
    migration (3): 841 -> 114 MB/s.  In the patched kernel, post
    migration performance matches baseline.

Fixes: c4843a7593a9 ("memcg: add per cgroup dirty page accounting")
Cc: <stable@vger.kernel.org> # 4.2+
Reported-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
---
 include/linux/mm.h | 21 +++++++++++++++++++++
 mm/migrate.c       | 12 +++++++++++-
 2 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 91c08f6f0dc9..80001de019ba 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -905,6 +905,27 @@ static inline void set_page_links(struct page *page, enum zone_type zone,
 #endif
 }
 
+#ifdef CONFIG_MEMCG
+static inline struct mem_cgroup *page_memcg(struct page *page)
+{
+	return page->mem_cgroup;
+}
+
+static inline void set_page_memcg(struct page *page, struct mem_cgroup *memcg)
+{
+	page->mem_cgroup = memcg;
+}
+#else
+static inline struct mem_cgroup *page_memcg(struct page *page)
+{
+	return NULL;
+}
+
+static inline void set_page_memcg(struct page *page, struct mem_cgroup *memcg)
+{
+}
+#endif
+
 /*
  * Some inline functions in vmstat.h depend on page_zone()
  */
diff --git a/mm/migrate.c b/mm/migrate.c
index c3cb566af3e2..6116b8f64d27 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -740,6 +740,15 @@ static int move_to_new_page(struct page *newpage, struct page *page,
 	if (PageSwapBacked(page))
 		SetPageSwapBacked(newpage);
 
+	/*
+	 * Indirectly called below, migrate_page_copy() copies PG_dirty and thus
+	 * needs newpage's memcg set to transfer memcg dirty page accounting.
+	 * So perform memcg migration in two steps:
+	 * 1. set newpage->mem_cgroup (here)
+	 * 2. clear page->mem_cgroup (below)
+	 */
+	set_page_memcg(newpage, page_memcg(page));
+
 	mapping = page_mapping(page);
 	if (!mapping)
 		rc = migrate_page(mapping, newpage, page, mode);
@@ -756,9 +765,10 @@ static int move_to_new_page(struct page *newpage, struct page *page,
 		rc = fallback_migrate_page(mapping, newpage, page, mode);
 
 	if (rc != MIGRATEPAGE_SUCCESS) {
+		set_page_memcg(newpage, NULL);
 		newpage->mapping = NULL;
 	} else {
-		mem_cgroup_migrate(page, newpage, false);
+		set_page_memcg(page, NULL);
 		if (page_was_mapped)
 			remove_migration_ptes(page, newpage);
 		page->mapping = NULL;
-- 
2.6.0.rc0.131.gf624c3d


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: 4.3-rc1 dirty page count underflow (cgroup-related?)
@ 2015-09-21  8:06       ` Greg Thelen
  0 siblings, 0 replies; 16+ messages in thread
From: Greg Thelen @ 2015-09-21  8:06 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Johannes Weiner, Michal Hocko, Tejun Heo, Jens Axboe,
	Andrew Morton, Jan Kara,
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
	open list


Dave Hansen wrote:

> On 09/17/2015 11:09 PM, Greg Thelen wrote:
>> I'm not denying the issue, bug the WARNING splat isn't necessarily
>> catching a problem.  The corresponding code comes from your debug patch:
>> +		WARN_ONCE(__this_cpu_read(memcg->stat->count[MEM_CGROUP_STAT_DIRTY]) > (1UL<<30), "MEM_CGROUP_STAT_DIRTY bogus");
>> 
>> This only checks a single cpu's counter, which can be negative.  The sum
>> of all counters is what matters.
>> Imagine:
>> cpu1) dirty page: inc
>> cpu2) clean page: dec
>> The sum is properly zero, but cpu2 is -1, which will trigger the WARN.
>> 
>> I'll look at the code and also see if I can reproduce the failure using
>> mem_cgroup_read_stat() for all of the new WARNs.
>
> D'oh.  I'll replace those with the proper mem_cgroup_read_stat() and
> test with your patch to see if anything still triggers.

Thanks Dave.

Here's what I think we should use to fix the issue.  I tagged this for
v4.2 stable given the way that unpatched performance falls apart without
warning or workaround (besides deleting and recreating affected memcg).
Feedback welcome.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 4.3-rc1 dirty page count underflow (cgroup-related?)
@ 2015-09-21  8:06       ` Greg Thelen
  0 siblings, 0 replies; 16+ messages in thread
From: Greg Thelen @ 2015-09-21  8:06 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Johannes Weiner, Michal Hocko, Tejun Heo, Jens Axboe,
	Andrew Morton, Jan Kara,
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
	open list


Dave Hansen wrote:

> On 09/17/2015 11:09 PM, Greg Thelen wrote:
>> I'm not denying the issue, bug the WARNING splat isn't necessarily
>> catching a problem.  The corresponding code comes from your debug patch:
>> +		WARN_ONCE(__this_cpu_read(memcg->stat->count[MEM_CGROUP_STAT_DIRTY]) > (1UL<<30), "MEM_CGROUP_STAT_DIRTY bogus");
>> 
>> This only checks a single cpu's counter, which can be negative.  The sum
>> of all counters is what matters.
>> Imagine:
>> cpu1) dirty page: inc
>> cpu2) clean page: dec
>> The sum is properly zero, but cpu2 is -1, which will trigger the WARN.
>> 
>> I'll look at the code and also see if I can reproduce the failure using
>> mem_cgroup_read_stat() for all of the new WARNs.
>
> D'oh.  I'll replace those with the proper mem_cgroup_read_stat() and
> test with your patch to see if anything still triggers.

Thanks Dave.

Here's what I think we should use to fix the issue.  I tagged this for
v4.2 stable given the way that unpatched performance falls apart without
warning or workaround (besides deleting and recreating affected memcg).
Feedback welcome.

From f5c39c2e8471c10fe0464ca7b6e6f743ce6920a6 Mon Sep 17 00:00:00 2001
From: Greg Thelen <gthelen-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Date: Sat, 19 Sep 2015 16:21:18 -0700
Subject: [PATCH] memcg: fix dirty page migration

The problem starts with a file backed dirty page which is charged to a
memcg.  Then page migration is used to move oldpage to newpage.
Migration:
- copies the oldpage's data to newpage
- clears oldpage.PG_dirty
- sets newpage.PG_dirty
- uncharges oldpage from memcg
- charges newpage to memcg

Clearing oldpage.PG_dirty decrements the charged memcg's dirty page
count.  However, because newpage is not yet charged, setting
newpage.PG_dirty does not increment the memcg's dirty page count.  After
migration completes newpage.PG_dirty is eventually cleared, often in
account_page_cleaned().  At this time newpage is charged to a memcg so
the memcg's dirty page count is decremented which causes underflow
because the count was not previously incremented by migration.  This
underflow causes balance_dirty_pages() to see a very large unsigned
number of dirty memcg pages which leads to aggressive throttling of
buffered writes by processes in non root memcg.

This issue:
- can harm performance of non root memcg buffered writes.
- can report too small (even negative) values in
  memory.stat[(total_)dirty] counters of all memcg, including the root.

To avoid polluting migrate.c with #ifdef CONFIG_MEMCG checks, introduce
page_memcg() and set_page_memcg() helpers.

Test:
    0) setup and enter limited memcg
    mkdir /sys/fs/cgroup/test
    echo 1G > /sys/fs/cgroup/test/memory.limit_in_bytes
    echo $$ > /sys/fs/cgroup/test/cgroup.procs

    1) buffered writes baseline
    dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k
    sync
    grep ^dirty /sys/fs/cgroup/test/memory.stat

    2) buffered writes with compaction antagonist to induce migration
    yes 1 > /proc/sys/vm/compact_memory &
    rm -rf /data/tmp/foo
    dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k
    kill %
    sync
    grep ^dirty /sys/fs/cgroup/test/memory.stat

    3) buffered writes without antagonist, should match baseline
    rm -rf /data/tmp/foo
    dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k
    sync
    grep ^dirty /sys/fs/cgroup/test/memory.stat

                       (speed, dirty residue)
             unpatched                       patched
    1) 841 MB/s 0 dirty pages          886 MB/s 0 dirty pages
    2) 611 MB/s -33427456 dirty pages  793 MB/s 0 dirty pages
    3) 114 MB/s -33427456 dirty pages  891 MB/s 0 dirty pages

    Notice that unpatched baseline performance (1) fell after
    migration (3): 841 -> 114 MB/s.  In the patched kernel, post
    migration performance matches baseline.

Fixes: c4843a7593a9 ("memcg: add per cgroup dirty page accounting")
Cc: <stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org> # 4.2+
Reported-by: Dave Hansen <dave.hansen-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Greg Thelen <gthelen-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
 include/linux/mm.h | 21 +++++++++++++++++++++
 mm/migrate.c       | 12 +++++++++++-
 2 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 91c08f6f0dc9..80001de019ba 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -905,6 +905,27 @@ static inline void set_page_links(struct page *page, enum zone_type zone,
 #endif
 }
 
+#ifdef CONFIG_MEMCG
+static inline struct mem_cgroup *page_memcg(struct page *page)
+{
+	return page->mem_cgroup;
+}
+
+static inline void set_page_memcg(struct page *page, struct mem_cgroup *memcg)
+{
+	page->mem_cgroup = memcg;
+}
+#else
+static inline struct mem_cgroup *page_memcg(struct page *page)
+{
+	return NULL;
+}
+
+static inline void set_page_memcg(struct page *page, struct mem_cgroup *memcg)
+{
+}
+#endif
+
 /*
  * Some inline functions in vmstat.h depend on page_zone()
  */
diff --git a/mm/migrate.c b/mm/migrate.c
index c3cb566af3e2..6116b8f64d27 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -740,6 +740,15 @@ static int move_to_new_page(struct page *newpage, struct page *page,
 	if (PageSwapBacked(page))
 		SetPageSwapBacked(newpage);
 
+	/*
+	 * Indirectly called below, migrate_page_copy() copies PG_dirty and thus
+	 * needs newpage's memcg set to transfer memcg dirty page accounting.
+	 * So perform memcg migration in two steps:
+	 * 1. set newpage->mem_cgroup (here)
+	 * 2. clear page->mem_cgroup (below)
+	 */
+	set_page_memcg(newpage, page_memcg(page));
+
 	mapping = page_mapping(page);
 	if (!mapping)
 		rc = migrate_page(mapping, newpage, page, mode);
@@ -756,9 +765,10 @@ static int move_to_new_page(struct page *newpage, struct page *page,
 		rc = fallback_migrate_page(mapping, newpage, page, mode);
 
 	if (rc != MIGRATEPAGE_SUCCESS) {
+		set_page_memcg(newpage, NULL);
 		newpage->mapping = NULL;
 	} else {
-		mem_cgroup_migrate(page, newpage, false);
+		set_page_memcg(page, NULL);
 		if (page_was_mapped)
 			remove_migration_ptes(page, newpage);
 		page->mapping = NULL;
-- 
2.6.0.rc0.131.gf624c3d

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: 4.3-rc1 dirty page count underflow (cgroup-related?)
  2015-09-21  8:06       ` Greg Thelen
@ 2015-09-22  9:03         ` Michal Hocko
  -1 siblings, 0 replies; 16+ messages in thread
From: Michal Hocko @ 2015-09-22  9:03 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Dave Hansen, Johannes Weiner, Tejun Heo, Jens Axboe,
	Andrew Morton, Jan Kara,
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
	open list

[I am sorry I didn't get to this earlier because I was at an internal
conference last week]

On Mon 21-09-15 01:06:58, Greg Thelen wrote:
[...]
> >From f5c39c2e8471c10fe0464ca7b6e6f743ce6920a6 Mon Sep 17 00:00:00 2001
> From: Greg Thelen <gthelen@google.com>
> Date: Sat, 19 Sep 2015 16:21:18 -0700
> Subject: [PATCH] memcg: fix dirty page migration
> 
> The problem starts with a file backed dirty page which is charged to a
> memcg.  Then page migration is used to move oldpage to newpage.
> Migration:
> - copies the oldpage's data to newpage
> - clears oldpage.PG_dirty
> - sets newpage.PG_dirty
> - uncharges oldpage from memcg
> - charges newpage to memcg
> 
> Clearing oldpage.PG_dirty decrements the charged memcg's dirty page
> count.  However, because newpage is not yet charged, setting
> newpage.PG_dirty does not increment the memcg's dirty page count.  After
> migration completes newpage.PG_dirty is eventually cleared, often in
> account_page_cleaned().  At this time newpage is charged to a memcg so
> the memcg's dirty page count is decremented which causes underflow
> because the count was not previously incremented by migration.  This
> underflow causes balance_dirty_pages() to see a very large unsigned
> number of dirty memcg pages which leads to aggressive throttling of
> buffered writes by processes in non root memcg.

Very well spotted!

> This issue:
> - can harm performance of non root memcg buffered writes.
> - can report too small (even negative) values in
>   memory.stat[(total_)dirty] counters of all memcg, including the root.
> 
> To avoid polluting migrate.c with #ifdef CONFIG_MEMCG checks, introduce
> page_memcg() and set_page_memcg() helpers.
> 
> Test:
>     0) setup and enter limited memcg
>     mkdir /sys/fs/cgroup/test
>     echo 1G > /sys/fs/cgroup/test/memory.limit_in_bytes
>     echo $$ > /sys/fs/cgroup/test/cgroup.procs
> 
>     1) buffered writes baseline
>     dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k
>     sync
>     grep ^dirty /sys/fs/cgroup/test/memory.stat
> 
>     2) buffered writes with compaction antagonist to induce migration
>     yes 1 > /proc/sys/vm/compact_memory &
>     rm -rf /data/tmp/foo
>     dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k
>     kill %
>     sync
>     grep ^dirty /sys/fs/cgroup/test/memory.stat
> 
>     3) buffered writes without antagonist, should match baseline
>     rm -rf /data/tmp/foo
>     dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k
>     sync
>     grep ^dirty /sys/fs/cgroup/test/memory.stat
> 
>                        (speed, dirty residue)
>              unpatched                       patched
>     1) 841 MB/s 0 dirty pages          886 MB/s 0 dirty pages
>     2) 611 MB/s -33427456 dirty pages  793 MB/s 0 dirty pages
>     3) 114 MB/s -33427456 dirty pages  891 MB/s 0 dirty pages
> 
>     Notice that unpatched baseline performance (1) fell after
>     migration (3): 841 -> 114 MB/s.  In the patched kernel, post
>     migration performance matches baseline.
> 
> Fixes: c4843a7593a9 ("memcg: add per cgroup dirty page accounting")
> Cc: <stable@vger.kernel.org> # 4.2+
> Reported-by: Dave Hansen <dave.hansen@intel.com>
> Signed-off-by: Greg Thelen <gthelen@google.com>

Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

> ---
>  include/linux/mm.h | 21 +++++++++++++++++++++
>  mm/migrate.c       | 12 +++++++++++-
>  2 files changed, 32 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 91c08f6f0dc9..80001de019ba 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -905,6 +905,27 @@ static inline void set_page_links(struct page *page, enum zone_type zone,
>  #endif
>  }
>  
> +#ifdef CONFIG_MEMCG
> +static inline struct mem_cgroup *page_memcg(struct page *page)
> +{
> +	return page->mem_cgroup;
> +}
> +
> +static inline void set_page_memcg(struct page *page, struct mem_cgroup *memcg)
> +{
> +	page->mem_cgroup = memcg;
> +}
> +#else
> +static inline struct mem_cgroup *page_memcg(struct page *page)
> +{
> +	return NULL;
> +}
> +
> +static inline void set_page_memcg(struct page *page, struct mem_cgroup *memcg)
> +{
> +}
> +#endif
> +
>  /*
>   * Some inline functions in vmstat.h depend on page_zone()
>   */
> diff --git a/mm/migrate.c b/mm/migrate.c
> index c3cb566af3e2..6116b8f64d27 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -740,6 +740,15 @@ static int move_to_new_page(struct page *newpage, struct page *page,
>  	if (PageSwapBacked(page))
>  		SetPageSwapBacked(newpage);
>  
> +	/*
> +	 * Indirectly called below, migrate_page_copy() copies PG_dirty and thus
> +	 * needs newpage's memcg set to transfer memcg dirty page accounting.
> +	 * So perform memcg migration in two steps:
> +	 * 1. set newpage->mem_cgroup (here)
> +	 * 2. clear page->mem_cgroup (below)
> +	 */
> +	set_page_memcg(newpage, page_memcg(page));
> +
>  	mapping = page_mapping(page);
>  	if (!mapping)
>  		rc = migrate_page(mapping, newpage, page, mode);
> @@ -756,9 +765,10 @@ static int move_to_new_page(struct page *newpage, struct page *page,
>  		rc = fallback_migrate_page(mapping, newpage, page, mode);
>  
>  	if (rc != MIGRATEPAGE_SUCCESS) {
> +		set_page_memcg(newpage, NULL);
>  		newpage->mapping = NULL;
>  	} else {
> -		mem_cgroup_migrate(page, newpage, false);
> +		set_page_memcg(page, NULL);
>  		if (page_was_mapped)
>  			remove_migration_ptes(page, newpage);
>  		page->mapping = NULL;
> -- 
> 2.6.0.rc0.131.gf624c3d

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 4.3-rc1 dirty page count underflow (cgroup-related?)
@ 2015-09-22  9:03         ` Michal Hocko
  0 siblings, 0 replies; 16+ messages in thread
From: Michal Hocko @ 2015-09-22  9:03 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Dave Hansen, Johannes Weiner, Tejun Heo, Jens Axboe,
	Andrew Morton, Jan Kara,
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
	open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG),
	open list

[I am sorry I didn't get to this earlier because I was at an internal
conference last week]

On Mon 21-09-15 01:06:58, Greg Thelen wrote:
[...]
> >From f5c39c2e8471c10fe0464ca7b6e6f743ce6920a6 Mon Sep 17 00:00:00 2001
> From: Greg Thelen <gthelen@google.com>
> Date: Sat, 19 Sep 2015 16:21:18 -0700
> Subject: [PATCH] memcg: fix dirty page migration
> 
> The problem starts with a file backed dirty page which is charged to a
> memcg.  Then page migration is used to move oldpage to newpage.
> Migration:
> - copies the oldpage's data to newpage
> - clears oldpage.PG_dirty
> - sets newpage.PG_dirty
> - uncharges oldpage from memcg
> - charges newpage to memcg
> 
> Clearing oldpage.PG_dirty decrements the charged memcg's dirty page
> count.  However, because newpage is not yet charged, setting
> newpage.PG_dirty does not increment the memcg's dirty page count.  After
> migration completes newpage.PG_dirty is eventually cleared, often in
> account_page_cleaned().  At this time newpage is charged to a memcg so
> the memcg's dirty page count is decremented which causes underflow
> because the count was not previously incremented by migration.  This
> underflow causes balance_dirty_pages() to see a very large unsigned
> number of dirty memcg pages which leads to aggressive throttling of
> buffered writes by processes in non root memcg.

Very well spotted!

> This issue:
> - can harm performance of non root memcg buffered writes.
> - can report too small (even negative) values in
>   memory.stat[(total_)dirty] counters of all memcg, including the root.
> 
> To avoid polluting migrate.c with #ifdef CONFIG_MEMCG checks, introduce
> page_memcg() and set_page_memcg() helpers.
> 
> Test:
>     0) setup and enter limited memcg
>     mkdir /sys/fs/cgroup/test
>     echo 1G > /sys/fs/cgroup/test/memory.limit_in_bytes
>     echo $$ > /sys/fs/cgroup/test/cgroup.procs
> 
>     1) buffered writes baseline
>     dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k
>     sync
>     grep ^dirty /sys/fs/cgroup/test/memory.stat
> 
>     2) buffered writes with compaction antagonist to induce migration
>     yes 1 > /proc/sys/vm/compact_memory &
>     rm -rf /data/tmp/foo
>     dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k
>     kill %
>     sync
>     grep ^dirty /sys/fs/cgroup/test/memory.stat
> 
>     3) buffered writes without antagonist, should match baseline
>     rm -rf /data/tmp/foo
>     dd if=/dev/zero of=/data/tmp/foo bs=1M count=1k
>     sync
>     grep ^dirty /sys/fs/cgroup/test/memory.stat
> 
>                        (speed, dirty residue)
>              unpatched                       patched
>     1) 841 MB/s 0 dirty pages          886 MB/s 0 dirty pages
>     2) 611 MB/s -33427456 dirty pages  793 MB/s 0 dirty pages
>     3) 114 MB/s -33427456 dirty pages  891 MB/s 0 dirty pages
> 
>     Notice that unpatched baseline performance (1) fell after
>     migration (3): 841 -> 114 MB/s.  In the patched kernel, post
>     migration performance matches baseline.
> 
> Fixes: c4843a7593a9 ("memcg: add per cgroup dirty page accounting")
> Cc: <stable@vger.kernel.org> # 4.2+
> Reported-by: Dave Hansen <dave.hansen@intel.com>
> Signed-off-by: Greg Thelen <gthelen@google.com>

Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

> ---
>  include/linux/mm.h | 21 +++++++++++++++++++++
>  mm/migrate.c       | 12 +++++++++++-
>  2 files changed, 32 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 91c08f6f0dc9..80001de019ba 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -905,6 +905,27 @@ static inline void set_page_links(struct page *page, enum zone_type zone,
>  #endif
>  }
>  
> +#ifdef CONFIG_MEMCG
> +static inline struct mem_cgroup *page_memcg(struct page *page)
> +{
> +	return page->mem_cgroup;
> +}
> +
> +static inline void set_page_memcg(struct page *page, struct mem_cgroup *memcg)
> +{
> +	page->mem_cgroup = memcg;
> +}
> +#else
> +static inline struct mem_cgroup *page_memcg(struct page *page)
> +{
> +	return NULL;
> +}
> +
> +static inline void set_page_memcg(struct page *page, struct mem_cgroup *memcg)
> +{
> +}
> +#endif
> +
>  /*
>   * Some inline functions in vmstat.h depend on page_zone()
>   */
> diff --git a/mm/migrate.c b/mm/migrate.c
> index c3cb566af3e2..6116b8f64d27 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -740,6 +740,15 @@ static int move_to_new_page(struct page *newpage, struct page *page,
>  	if (PageSwapBacked(page))
>  		SetPageSwapBacked(newpage);
>  
> +	/*
> +	 * Indirectly called below, migrate_page_copy() copies PG_dirty and thus
> +	 * needs newpage's memcg set to transfer memcg dirty page accounting.
> +	 * So perform memcg migration in two steps:
> +	 * 1. set newpage->mem_cgroup (here)
> +	 * 2. clear page->mem_cgroup (below)
> +	 */
> +	set_page_memcg(newpage, page_memcg(page));
> +
>  	mapping = page_mapping(page);
>  	if (!mapping)
>  		rc = migrate_page(mapping, newpage, page, mode);
> @@ -756,9 +765,10 @@ static int move_to_new_page(struct page *newpage, struct page *page,
>  		rc = fallback_migrate_page(mapping, newpage, page, mode);
>  
>  	if (rc != MIGRATEPAGE_SUCCESS) {
> +		set_page_memcg(newpage, NULL);
>  		newpage->mapping = NULL;
>  	} else {
> -		mem_cgroup_migrate(page, newpage, false);
> +		set_page_memcg(page, NULL);
>  		if (page_was_mapped)
>  			remove_migration_ptes(page, newpage);
>  		page->mapping = NULL;
> -- 
> 2.6.0.rc0.131.gf624c3d

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2015-09-22  9:03 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-18  4:29 4.3-rc1 dirty page count underflow (cgroup-related?) Dave Hansen
2015-09-18  4:29 ` Dave Hansen
2015-09-18  4:29 ` Dave Hansen
2015-09-18  6:09 ` Greg Thelen
2015-09-18  6:09   ` Greg Thelen
2015-09-18  6:09   ` Greg Thelen
2015-09-18  8:31   ` Greg Thelen
2015-09-18  8:31     ` Greg Thelen
2015-09-18  8:31     ` Greg Thelen
2015-09-18 14:50   ` Dave Hansen
2015-09-18 14:50     ` Dave Hansen
2015-09-21  8:06     ` Greg Thelen
2015-09-21  8:06       ` Greg Thelen
2015-09-21  8:06       ` Greg Thelen
2015-09-22  9:03       ` Michal Hocko
2015-09-22  9:03         ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.