linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* shrink_active_list/try_to_release_page bug? (was Re: xfs trace in 4.4.2 / also in 4.3.3 WARNING fs/xfs/xfs_aops.c:1232 xfs_vm_releasepage)
       [not found]                 ` <574BEA84.3010206@profihost.ag>
@ 2016-05-30 22:36                   ` Dave Chinner
  2016-05-31  1:07                     ` Minchan Kim
  0 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2016-05-30 22:36 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG; +Cc: Brian Foster, xfs, linux-mm, linux-kernel

[adding lkml and linux-mm to the cc list]

On Mon, May 30, 2016 at 09:23:48AM +0200, Stefan Priebe - Profihost AG wrote:
> Hi Dave,
>   Hi Brian,
> 
> below are the results with a vanilla 4.4.11 kernel.

Thanks for persisting with the testing, Stefan.

....

> i've now used a vanilla 4.4.11 Kernel and the issue remains. After a
> fresh reboot it has happened again on the root FS for a debian apt file:
> 
> XFS (md127p3): ino 0x41221d1 delalloc 1 unwritten 0 pgoff 0x0 size 0x12b990
> ------------[ cut here ]------------
> WARNING: CPU: 1 PID: 111 at fs/xfs/xfs_aops.c:1239
> xfs_vm_releasepage+0x10f/0x140()
> Modules linked in: netconsole ipt_REJECT nf_reject_ipv4 xt_multiport
> iptable_filter ip_tables x_tables bonding coretemp 8021q garp fuse
> sb_edac edac_core i2c_i801 i40e(O) xhci_pci xhci_hcd shpchp vxlan
> ip6_udp_tunnel udp_tunnel ipmi_si ipmi_msghandler button btrfs xor
> raid6_pq dm_mod raid1 md_mod usbhid usb_storage ohci_hcd sg sd_mod
> ehci_pci ehci_hcd usbcore usb_common igb ahci i2c_algo_bit libahci
> i2c_core mpt3sas ptp pps_core raid_class scsi_transport_sas
> CPU: 1 PID: 111 Comm: kswapd0 Tainted: G           O    4.4.11 #1
> Hardware name: Supermicro Super Server/X10SRH-CF, BIOS 1.0b 05/18/2015
>  0000000000000000 ffff880c4dacfa88 ffffffffa23c5b8f 0000000000000000
>  ffffffffa2a51ab4 ffff880c4dacfac8 ffffffffa20837a7 ffff880c4dacfae8
>  0000000000000001 ffffea00010c3640 ffff8802176b49d0 ffffea00010c3660
> Call Trace:
>  [<ffffffffa23c5b8f>] dump_stack+0x63/0x84
>  [<ffffffffa20837a7>] warn_slowpath_common+0x97/0xe0
>  [<ffffffffa208380a>] warn_slowpath_null+0x1a/0x20
>  [<ffffffffa2326caf>] xfs_vm_releasepage+0x10f/0x140
>  [<ffffffffa218c680>] ? page_mkclean_one+0xd0/0xd0
>  [<ffffffffa218d3a0>] ? anon_vma_prepare+0x150/0x150
>  [<ffffffffa21521c2>] try_to_release_page+0x32/0x50
>  [<ffffffffa2166b2e>] shrink_active_list+0x3ce/0x3e0
>  [<ffffffffa21671c7>] shrink_lruvec+0x687/0x7d0
>  [<ffffffffa21673ec>] shrink_zone+0xdc/0x2c0
>  [<ffffffffa2168539>] kswapd+0x4f9/0x970
>  [<ffffffffa2168040>] ? mem_cgroup_shrink_node_zone+0x1a0/0x1a0
>  [<ffffffffa20a0d99>] kthread+0xc9/0xe0
>  [<ffffffffa20a0cd0>] ? kthread_stop+0x100/0x100
>  [<ffffffffa26b404f>] ret_from_fork+0x3f/0x70
>  [<ffffffffa20a0cd0>] ? kthread_stop+0x100/0x100
> ---[ end trace c9d679f8ed4d7610 ]---
> XFS (md127p3): ino 0x41221d1 delalloc 1 unwritten 0 pgoff 0x1000 size
> 0x12b990
> XFS (md127p3): ino 0x41221d1 delalloc 1 unwritten 0 pgoff 0x2000 size
.....

Ok, I suspect this may be a VM bug. I've been looking at the 4.6
code (so please try to reproduce on that kernel!) but it looks to me
like the only way we can get from shrink_active_list() direct to
try_to_release_page() is if we are over the maximum bufferhead
threshold (i.e buffer_heads_over_limit = true) and we are trying to
reclaim pages direct from the active list.

Because we are called from kswapd()->balance_pgdat(), we have:

        struct scan_control sc = {
                .gfp_mask = GFP_KERNEL,
                .order = order,
                .priority = DEF_PRIORITY,
                .may_writepage = !laptop_mode,
                .may_unmap = 1,
                .may_swap = 1,
        };

The key point here is reclaim is being run with .may_writepage =
true for default configuration kernels. when we get to
shrink_active_list():

	if (!sc->may_writepage)
		isolate_mode |= ISOLATE_CLEAN;

But sc->may_writepage = true and this allows isolate_lru_pages() to
isolate dirty pages from the active list. Normally this isn't a
problem, because the isolated active list pages are rotated to the
inactive list, and nothing else happens to them. *Except when
buffer_heads_over_limit = true*. This special condition would
explain why I have never seen apt/dpkg cause this problem on any of
my (many) Debian systems that all use XFS....

In that case, shrink_active_list() runs:

	if (unlikely(buffer_heads_over_limit)) {
		if (page_has_private(page) && trylock_page(page)) {
			if (page_has_private(page))
				try_to_release_page(page, 0);
			unlock_page(page);
		}
	}

i.e. it locks the page, and if it has buffer heads it trys to get
the bufferheads freed from the page.

But this is a dirty page, which means it may have delalloc or
unwritten state on it's buffers, both of which indicate that there
is dirty data in teh page that hasn't been written. XFS issues a
warning on this because neither shrink_active_list nor
try_to_release_page() check for whether the page is dirty or not.

Hence it seems to me that shrink_active_list() is calling
try_to_release_page() inappropriately, and XFS is just the
messenger. If you turn laptop mode on, it is likely the problem will
go away as kswapd will run with .may_writepage = false, but that
will also cause other behavioural changes relating to writeback and
memory reclaim. It might be worth trying as a workaround for now.

MM-folk - is this analysis correct? If so, why is
shrink_active_list() calling try_to_release_page() on dirty pages?
Is this just an oversight or is there some problem that this is
trying to work around? It seems trivial to fix to me (add a
!PageDirty check), but I don't know why the check is there in the
first place...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: shrink_active_list/try_to_release_page bug? (was Re: xfs trace in 4.4.2 / also in 4.3.3 WARNING fs/xfs/xfs_aops.c:1232 xfs_vm_releasepage)
  2016-05-30 22:36                   ` shrink_active_list/try_to_release_page bug? (was Re: xfs trace in 4.4.2 / also in 4.3.3 WARNING fs/xfs/xfs_aops.c:1232 xfs_vm_releasepage) Dave Chinner
@ 2016-05-31  1:07                     ` Minchan Kim
  2016-05-31  2:55                       ` Dave Chinner
  2016-05-31  9:50                       ` Jan Kara
  0 siblings, 2 replies; 14+ messages in thread
From: Minchan Kim @ 2016-05-31  1:07 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Stefan Priebe - Profihost AG, Brian Foster, xfs, linux-mm, linux-kernel

On Tue, May 31, 2016 at 08:36:57AM +1000, Dave Chinner wrote:
> [adding lkml and linux-mm to the cc list]
> 
> On Mon, May 30, 2016 at 09:23:48AM +0200, Stefan Priebe - Profihost AG wrote:
> > Hi Dave,
> >   Hi Brian,
> > 
> > below are the results with a vanilla 4.4.11 kernel.
> 
> Thanks for persisting with the testing, Stefan.
> 
> ....
> 
> > i've now used a vanilla 4.4.11 Kernel and the issue remains. After a
> > fresh reboot it has happened again on the root FS for a debian apt file:
> > 
> > XFS (md127p3): ino 0x41221d1 delalloc 1 unwritten 0 pgoff 0x0 size 0x12b990
> > ------------[ cut here ]------------
> > WARNING: CPU: 1 PID: 111 at fs/xfs/xfs_aops.c:1239
> > xfs_vm_releasepage+0x10f/0x140()
> > Modules linked in: netconsole ipt_REJECT nf_reject_ipv4 xt_multiport
> > iptable_filter ip_tables x_tables bonding coretemp 8021q garp fuse
> > sb_edac edac_core i2c_i801 i40e(O) xhci_pci xhci_hcd shpchp vxlan
> > ip6_udp_tunnel udp_tunnel ipmi_si ipmi_msghandler button btrfs xor
> > raid6_pq dm_mod raid1 md_mod usbhid usb_storage ohci_hcd sg sd_mod
> > ehci_pci ehci_hcd usbcore usb_common igb ahci i2c_algo_bit libahci
> > i2c_core mpt3sas ptp pps_core raid_class scsi_transport_sas
> > CPU: 1 PID: 111 Comm: kswapd0 Tainted: G           O    4.4.11 #1
> > Hardware name: Supermicro Super Server/X10SRH-CF, BIOS 1.0b 05/18/2015
> >  0000000000000000 ffff880c4dacfa88 ffffffffa23c5b8f 0000000000000000
> >  ffffffffa2a51ab4 ffff880c4dacfac8 ffffffffa20837a7 ffff880c4dacfae8
> >  0000000000000001 ffffea00010c3640 ffff8802176b49d0 ffffea00010c3660
> > Call Trace:
> >  [<ffffffffa23c5b8f>] dump_stack+0x63/0x84
> >  [<ffffffffa20837a7>] warn_slowpath_common+0x97/0xe0
> >  [<ffffffffa208380a>] warn_slowpath_null+0x1a/0x20
> >  [<ffffffffa2326caf>] xfs_vm_releasepage+0x10f/0x140
> >  [<ffffffffa218c680>] ? page_mkclean_one+0xd0/0xd0
> >  [<ffffffffa218d3a0>] ? anon_vma_prepare+0x150/0x150
> >  [<ffffffffa21521c2>] try_to_release_page+0x32/0x50
> >  [<ffffffffa2166b2e>] shrink_active_list+0x3ce/0x3e0
> >  [<ffffffffa21671c7>] shrink_lruvec+0x687/0x7d0
> >  [<ffffffffa21673ec>] shrink_zone+0xdc/0x2c0
> >  [<ffffffffa2168539>] kswapd+0x4f9/0x970
> >  [<ffffffffa2168040>] ? mem_cgroup_shrink_node_zone+0x1a0/0x1a0
> >  [<ffffffffa20a0d99>] kthread+0xc9/0xe0
> >  [<ffffffffa20a0cd0>] ? kthread_stop+0x100/0x100
> >  [<ffffffffa26b404f>] ret_from_fork+0x3f/0x70
> >  [<ffffffffa20a0cd0>] ? kthread_stop+0x100/0x100
> > ---[ end trace c9d679f8ed4d7610 ]---
> > XFS (md127p3): ino 0x41221d1 delalloc 1 unwritten 0 pgoff 0x1000 size
> > 0x12b990
> > XFS (md127p3): ino 0x41221d1 delalloc 1 unwritten 0 pgoff 0x2000 size
> .....
> 
> Ok, I suspect this may be a VM bug. I've been looking at the 4.6
> code (so please try to reproduce on that kernel!) but it looks to me
> like the only way we can get from shrink_active_list() direct to
> try_to_release_page() is if we are over the maximum bufferhead
> threshold (i.e buffer_heads_over_limit = true) and we are trying to
> reclaim pages direct from the active list.
> 
> Because we are called from kswapd()->balance_pgdat(), we have:
> 
>         struct scan_control sc = {
>                 .gfp_mask = GFP_KERNEL,
>                 .order = order,
>                 .priority = DEF_PRIORITY,
>                 .may_writepage = !laptop_mode,
>                 .may_unmap = 1,
>                 .may_swap = 1,
>         };
> 
> The key point here is reclaim is being run with .may_writepage =
> true for default configuration kernels. when we get to
> shrink_active_list():
> 
> 	if (!sc->may_writepage)
> 		isolate_mode |= ISOLATE_CLEAN;
> 
> But sc->may_writepage = true and this allows isolate_lru_pages() to
> isolate dirty pages from the active list. Normally this isn't a
> problem, because the isolated active list pages are rotated to the
> inactive list, and nothing else happens to them. *Except when
> buffer_heads_over_limit = true*. This special condition would
> explain why I have never seen apt/dpkg cause this problem on any of
> my (many) Debian systems that all use XFS....
> 
> In that case, shrink_active_list() runs:
> 
> 	if (unlikely(buffer_heads_over_limit)) {
> 		if (page_has_private(page) && trylock_page(page)) {
> 			if (page_has_private(page))
> 				try_to_release_page(page, 0);
> 			unlock_page(page);
> 		}
> 	}
> 
> i.e. it locks the page, and if it has buffer heads it trys to get
> the bufferheads freed from the page.
> 
> But this is a dirty page, which means it may have delalloc or
> unwritten state on it's buffers, both of which indicate that there
> is dirty data in teh page that hasn't been written. XFS issues a
> warning on this because neither shrink_active_list nor
> try_to_release_page() check for whether the page is dirty or not.
> 
> Hence it seems to me that shrink_active_list() is calling
> try_to_release_page() inappropriately, and XFS is just the
> messenger. If you turn laptop mode on, it is likely the problem will
> go away as kswapd will run with .may_writepage = false, but that
> will also cause other behavioural changes relating to writeback and
> memory reclaim. It might be worth trying as a workaround for now.
> 
> MM-folk - is this analysis correct? If so, why is
> shrink_active_list() calling try_to_release_page() on dirty pages?
> Is this just an oversight or is there some problem that this is
> trying to work around? It seems trivial to fix to me (add a
> !PageDirty check), but I don't know why the check is there in the
> first place...

It seems to be latter.
Below commit seems to be related.
[ecdfc9787fe527, Resurrect 'try_to_free_buffers()' VM hackery.]

At that time, even shrink_page_list works like this.

shrink_page_list
        while (!list_empty(page_list)) {
                ..
                ..
                if (PageDirty(page)) {
                        ..
                }

                /*
                 * If the page has buffers, try to free the buffer mappings
                 * associated with this page. If we succeed we try to free
                 * the page as well.
                 *
                 * We do this even if the page is PageDirty().
                 * try_to_release_page() does not perform I/O, but it is
                 * possible for a page to have PageDirty set, but it is actually
                 * clean (all its buffers are clean).  This happens if the
                 * buffers were written out directly, with submit_bh(). ext3
                 * will do this, as well as the blockdev mapping. 
                 * try_to_release_page() will discover that cleanness and will
                 * drop the buffers and mark the page clean - it can be freed.
                 * ..
                 */
                if (PagePrivate(page)) {
                        if (!try_to_release_page(page, sc->gfp_mask))
                                goto activate_locked;
                        if (!mapping && page_count(page) == 1)
                                goto free_it;
                }
                ..
        }

I wonder whether it's valid or not with on ext4.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: shrink_active_list/try_to_release_page bug? (was Re: xfs trace in 4.4.2 / also in 4.3.3 WARNING fs/xfs/xfs_aops.c:1232 xfs_vm_releasepage)
  2016-05-31  1:07                     ` Minchan Kim
@ 2016-05-31  2:55                       ` Dave Chinner
  2016-05-31  3:59                         ` Minchan Kim
  2016-05-31  9:50                       ` Jan Kara
  1 sibling, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2016-05-31  2:55 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Stefan Priebe - Profihost AG, Brian Foster, xfs, linux-mm, linux-kernel

On Tue, May 31, 2016 at 10:07:24AM +0900, Minchan Kim wrote:
> On Tue, May 31, 2016 at 08:36:57AM +1000, Dave Chinner wrote:
> > [adding lkml and linux-mm to the cc list]
> > 
> > On Mon, May 30, 2016 at 09:23:48AM +0200, Stefan Priebe - Profihost AG wrote:
> > > Hi Dave,
> > >   Hi Brian,
> > > 
> > > below are the results with a vanilla 4.4.11 kernel.
> > 
> > Thanks for persisting with the testing, Stefan.
> > 
> > ....
> > 
> > > i've now used a vanilla 4.4.11 Kernel and the issue remains. After a
> > > fresh reboot it has happened again on the root FS for a debian apt file:
> > > 
> > > XFS (md127p3): ino 0x41221d1 delalloc 1 unwritten 0 pgoff 0x0 size 0x12b990
> > > ------------[ cut here ]------------
> > > WARNING: CPU: 1 PID: 111 at fs/xfs/xfs_aops.c:1239
> > > xfs_vm_releasepage+0x10f/0x140()
> > > Modules linked in: netconsole ipt_REJECT nf_reject_ipv4 xt_multiport
> > > iptable_filter ip_tables x_tables bonding coretemp 8021q garp fuse
> > > sb_edac edac_core i2c_i801 i40e(O) xhci_pci xhci_hcd shpchp vxlan
> > > ip6_udp_tunnel udp_tunnel ipmi_si ipmi_msghandler button btrfs xor
> > > raid6_pq dm_mod raid1 md_mod usbhid usb_storage ohci_hcd sg sd_mod
> > > ehci_pci ehci_hcd usbcore usb_common igb ahci i2c_algo_bit libahci
> > > i2c_core mpt3sas ptp pps_core raid_class scsi_transport_sas
> > > CPU: 1 PID: 111 Comm: kswapd0 Tainted: G           O    4.4.11 #1
> > > Hardware name: Supermicro Super Server/X10SRH-CF, BIOS 1.0b 05/18/2015
> > >  0000000000000000 ffff880c4dacfa88 ffffffffa23c5b8f 0000000000000000
> > >  ffffffffa2a51ab4 ffff880c4dacfac8 ffffffffa20837a7 ffff880c4dacfae8
> > >  0000000000000001 ffffea00010c3640 ffff8802176b49d0 ffffea00010c3660
> > > Call Trace:
> > >  [<ffffffffa23c5b8f>] dump_stack+0x63/0x84
> > >  [<ffffffffa20837a7>] warn_slowpath_common+0x97/0xe0
> > >  [<ffffffffa208380a>] warn_slowpath_null+0x1a/0x20
> > >  [<ffffffffa2326caf>] xfs_vm_releasepage+0x10f/0x140
> > >  [<ffffffffa218c680>] ? page_mkclean_one+0xd0/0xd0
> > >  [<ffffffffa218d3a0>] ? anon_vma_prepare+0x150/0x150
> > >  [<ffffffffa21521c2>] try_to_release_page+0x32/0x50
> > >  [<ffffffffa2166b2e>] shrink_active_list+0x3ce/0x3e0
> > >  [<ffffffffa21671c7>] shrink_lruvec+0x687/0x7d0
> > >  [<ffffffffa21673ec>] shrink_zone+0xdc/0x2c0
> > >  [<ffffffffa2168539>] kswapd+0x4f9/0x970
> > >  [<ffffffffa2168040>] ? mem_cgroup_shrink_node_zone+0x1a0/0x1a0
> > >  [<ffffffffa20a0d99>] kthread+0xc9/0xe0
> > >  [<ffffffffa20a0cd0>] ? kthread_stop+0x100/0x100
> > >  [<ffffffffa26b404f>] ret_from_fork+0x3f/0x70
> > >  [<ffffffffa20a0cd0>] ? kthread_stop+0x100/0x100
> > > ---[ end trace c9d679f8ed4d7610 ]---
> > > XFS (md127p3): ino 0x41221d1 delalloc 1 unwritten 0 pgoff 0x1000 size
> > > 0x12b990
> > > XFS (md127p3): ino 0x41221d1 delalloc 1 unwritten 0 pgoff 0x2000 size
> > .....
> > 
> > Ok, I suspect this may be a VM bug. I've been looking at the 4.6
> > code (so please try to reproduce on that kernel!) but it looks to me
> > like the only way we can get from shrink_active_list() direct to
> > try_to_release_page() is if we are over the maximum bufferhead
> > threshold (i.e buffer_heads_over_limit = true) and we are trying to
> > reclaim pages direct from the active list.
> > 
> > Because we are called from kswapd()->balance_pgdat(), we have:
> > 
> >         struct scan_control sc = {
> >                 .gfp_mask = GFP_KERNEL,
> >                 .order = order,
> >                 .priority = DEF_PRIORITY,
> >                 .may_writepage = !laptop_mode,
> >                 .may_unmap = 1,
> >                 .may_swap = 1,
> >         };
> > 
> > The key point here is reclaim is being run with .may_writepage =
> > true for default configuration kernels. when we get to
> > shrink_active_list():
> > 
> > 	if (!sc->may_writepage)
> > 		isolate_mode |= ISOLATE_CLEAN;
> > 
> > But sc->may_writepage = true and this allows isolate_lru_pages() to
> > isolate dirty pages from the active list. Normally this isn't a
> > problem, because the isolated active list pages are rotated to the
> > inactive list, and nothing else happens to them. *Except when
> > buffer_heads_over_limit = true*. This special condition would
> > explain why I have never seen apt/dpkg cause this problem on any of
> > my (many) Debian systems that all use XFS....
> > 
> > In that case, shrink_active_list() runs:
> > 
> > 	if (unlikely(buffer_heads_over_limit)) {
> > 		if (page_has_private(page) && trylock_page(page)) {
> > 			if (page_has_private(page))
> > 				try_to_release_page(page, 0);
> > 			unlock_page(page);
> > 		}
> > 	}
> > 
> > i.e. it locks the page, and if it has buffer heads it trys to get
> > the bufferheads freed from the page.
> > 
> > But this is a dirty page, which means it may have delalloc or
> > unwritten state on it's buffers, both of which indicate that there
> > is dirty data in teh page that hasn't been written. XFS issues a
> > warning on this because neither shrink_active_list nor
> > try_to_release_page() check for whether the page is dirty or not.
> > 
> > Hence it seems to me that shrink_active_list() is calling
> > try_to_release_page() inappropriately, and XFS is just the
> > messenger. If you turn laptop mode on, it is likely the problem will
> > go away as kswapd will run with .may_writepage = false, but that
> > will also cause other behavioural changes relating to writeback and
> > memory reclaim. It might be worth trying as a workaround for now.
> > 
> > MM-folk - is this analysis correct? If so, why is
> > shrink_active_list() calling try_to_release_page() on dirty pages?
> > Is this just an oversight or is there some problem that this is
> > trying to work around? It seems trivial to fix to me (add a
> > !PageDirty check), but I don't know why the check is there in the
> > first place...
> 
> It seems to be latter.
> Below commit seems to be related.
> [ecdfc9787fe527, Resurrect 'try_to_free_buffers()' VM hackery.]

Okay, that's been there a long, long time (2007), and it covers a
case where the filesystem cleans pages without the VM knowing about
it (i.e. it marks bufferheads clean without clearing the PageDirty
state).

That does not explain the code in shrink_active_list().

> At that time, even shrink_page_list works like this.

The current code in shrink_page_list still works this way - the
PageDirty code will *jump over the PagePrivate case* if the page is
to remain dirty or pageout() fails to make it clean.  Hence it never
gets to try_to_release_page() on a dirty page.

Seems like this really needs a dirty check in shrink_active_list()
and to leave the stripping of bufferheads from dirty pages in the
ext3 corner case to shrink_inactive_list() once the dirty pages have
been rotated off the active list...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: shrink_active_list/try_to_release_page bug? (was Re: xfs trace in 4.4.2 / also in 4.3.3 WARNING fs/xfs/xfs_aops.c:1232 xfs_vm_releasepage)
  2016-05-31  2:55                       ` Dave Chinner
@ 2016-05-31  3:59                         ` Minchan Kim
  2016-05-31  6:07                           ` Dave Chinner
  0 siblings, 1 reply; 14+ messages in thread
From: Minchan Kim @ 2016-05-31  3:59 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Stefan Priebe - Profihost AG, Brian Foster, xfs, linux-mm, linux-kernel

On Tue, May 31, 2016 at 12:55:09PM +1000, Dave Chinner wrote:
> On Tue, May 31, 2016 at 10:07:24AM +0900, Minchan Kim wrote:
> > On Tue, May 31, 2016 at 08:36:57AM +1000, Dave Chinner wrote:
> > > [adding lkml and linux-mm to the cc list]
> > > 
> > > On Mon, May 30, 2016 at 09:23:48AM +0200, Stefan Priebe - Profihost AG wrote:
> > > > Hi Dave,
> > > >   Hi Brian,
> > > > 
> > > > below are the results with a vanilla 4.4.11 kernel.
> > > 
> > > Thanks for persisting with the testing, Stefan.
> > > 
> > > ....
> > > 
> > > > i've now used a vanilla 4.4.11 Kernel and the issue remains. After a
> > > > fresh reboot it has happened again on the root FS for a debian apt file:
> > > > 
> > > > XFS (md127p3): ino 0x41221d1 delalloc 1 unwritten 0 pgoff 0x0 size 0x12b990
> > > > ------------[ cut here ]------------
> > > > WARNING: CPU: 1 PID: 111 at fs/xfs/xfs_aops.c:1239
> > > > xfs_vm_releasepage+0x10f/0x140()
> > > > Modules linked in: netconsole ipt_REJECT nf_reject_ipv4 xt_multiport
> > > > iptable_filter ip_tables x_tables bonding coretemp 8021q garp fuse
> > > > sb_edac edac_core i2c_i801 i40e(O) xhci_pci xhci_hcd shpchp vxlan
> > > > ip6_udp_tunnel udp_tunnel ipmi_si ipmi_msghandler button btrfs xor
> > > > raid6_pq dm_mod raid1 md_mod usbhid usb_storage ohci_hcd sg sd_mod
> > > > ehci_pci ehci_hcd usbcore usb_common igb ahci i2c_algo_bit libahci
> > > > i2c_core mpt3sas ptp pps_core raid_class scsi_transport_sas
> > > > CPU: 1 PID: 111 Comm: kswapd0 Tainted: G           O    4.4.11 #1
> > > > Hardware name: Supermicro Super Server/X10SRH-CF, BIOS 1.0b 05/18/2015
> > > >  0000000000000000 ffff880c4dacfa88 ffffffffa23c5b8f 0000000000000000
> > > >  ffffffffa2a51ab4 ffff880c4dacfac8 ffffffffa20837a7 ffff880c4dacfae8
> > > >  0000000000000001 ffffea00010c3640 ffff8802176b49d0 ffffea00010c3660
> > > > Call Trace:
> > > >  [<ffffffffa23c5b8f>] dump_stack+0x63/0x84
> > > >  [<ffffffffa20837a7>] warn_slowpath_common+0x97/0xe0
> > > >  [<ffffffffa208380a>] warn_slowpath_null+0x1a/0x20
> > > >  [<ffffffffa2326caf>] xfs_vm_releasepage+0x10f/0x140
> > > >  [<ffffffffa218c680>] ? page_mkclean_one+0xd0/0xd0
> > > >  [<ffffffffa218d3a0>] ? anon_vma_prepare+0x150/0x150
> > > >  [<ffffffffa21521c2>] try_to_release_page+0x32/0x50
> > > >  [<ffffffffa2166b2e>] shrink_active_list+0x3ce/0x3e0
> > > >  [<ffffffffa21671c7>] shrink_lruvec+0x687/0x7d0
> > > >  [<ffffffffa21673ec>] shrink_zone+0xdc/0x2c0
> > > >  [<ffffffffa2168539>] kswapd+0x4f9/0x970
> > > >  [<ffffffffa2168040>] ? mem_cgroup_shrink_node_zone+0x1a0/0x1a0
> > > >  [<ffffffffa20a0d99>] kthread+0xc9/0xe0
> > > >  [<ffffffffa20a0cd0>] ? kthread_stop+0x100/0x100
> > > >  [<ffffffffa26b404f>] ret_from_fork+0x3f/0x70
> > > >  [<ffffffffa20a0cd0>] ? kthread_stop+0x100/0x100
> > > > ---[ end trace c9d679f8ed4d7610 ]---
> > > > XFS (md127p3): ino 0x41221d1 delalloc 1 unwritten 0 pgoff 0x1000 size
> > > > 0x12b990
> > > > XFS (md127p3): ino 0x41221d1 delalloc 1 unwritten 0 pgoff 0x2000 size
> > > .....
> > > 
> > > Ok, I suspect this may be a VM bug. I've been looking at the 4.6
> > > code (so please try to reproduce on that kernel!) but it looks to me
> > > like the only way we can get from shrink_active_list() direct to
> > > try_to_release_page() is if we are over the maximum bufferhead
> > > threshold (i.e buffer_heads_over_limit = true) and we are trying to
> > > reclaim pages direct from the active list.
> > > 
> > > Because we are called from kswapd()->balance_pgdat(), we have:
> > > 
> > >         struct scan_control sc = {
> > >                 .gfp_mask = GFP_KERNEL,
> > >                 .order = order,
> > >                 .priority = DEF_PRIORITY,
> > >                 .may_writepage = !laptop_mode,
> > >                 .may_unmap = 1,
> > >                 .may_swap = 1,
> > >         };
> > > 
> > > The key point here is reclaim is being run with .may_writepage =
> > > true for default configuration kernels. when we get to
> > > shrink_active_list():
> > > 
> > > 	if (!sc->may_writepage)
> > > 		isolate_mode |= ISOLATE_CLEAN;
> > > 
> > > But sc->may_writepage = true and this allows isolate_lru_pages() to
> > > isolate dirty pages from the active list. Normally this isn't a
> > > problem, because the isolated active list pages are rotated to the
> > > inactive list, and nothing else happens to them. *Except when
> > > buffer_heads_over_limit = true*. This special condition would
> > > explain why I have never seen apt/dpkg cause this problem on any of
> > > my (many) Debian systems that all use XFS....
> > > 
> > > In that case, shrink_active_list() runs:
> > > 
> > > 	if (unlikely(buffer_heads_over_limit)) {
> > > 		if (page_has_private(page) && trylock_page(page)) {
> > > 			if (page_has_private(page))
> > > 				try_to_release_page(page, 0);
> > > 			unlock_page(page);
> > > 		}
> > > 	}
> > > 
> > > i.e. it locks the page, and if it has buffer heads it trys to get
> > > the bufferheads freed from the page.
> > > 
> > > But this is a dirty page, which means it may have delalloc or
> > > unwritten state on it's buffers, both of which indicate that there
> > > is dirty data in teh page that hasn't been written. XFS issues a
> > > warning on this because neither shrink_active_list nor
> > > try_to_release_page() check for whether the page is dirty or not.
> > > 
> > > Hence it seems to me that shrink_active_list() is calling
> > > try_to_release_page() inappropriately, and XFS is just the
> > > messenger. If you turn laptop mode on, it is likely the problem will
> > > go away as kswapd will run with .may_writepage = false, but that
> > > will also cause other behavioural changes relating to writeback and
> > > memory reclaim. It might be worth trying as a workaround for now.
> > > 
> > > MM-folk - is this analysis correct? If so, why is
> > > shrink_active_list() calling try_to_release_page() on dirty pages?
> > > Is this just an oversight or is there some problem that this is
> > > trying to work around? It seems trivial to fix to me (add a
> > > !PageDirty check), but I don't know why the check is there in the
> > > first place...
> > 
> > It seems to be latter.
> > Below commit seems to be related.
> > [ecdfc9787fe527, Resurrect 'try_to_free_buffers()' VM hackery.]
> 
> Okay, that's been there a long, long time (2007), and it covers a
> case where the filesystem cleans pages without the VM knowing about
> it (i.e. it marks bufferheads clean without clearing the PageDirty
> state).
> 
> That does not explain the code in shrink_active_list().

Yeb, My point was the patch removed the PageDirty check in
try_to_free_buffers.

When I read description correctly, at that time, we wanted to check
PageDirty in try_to_free_buffers but couldn't do with above ext3
corner case reason.

iff --git a/fs/buffer.c b/fs/buffer.c
index 3b116078b4c3..460f1c43238e 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2834,7 +2834,7 @@ int try_to_free_buffers(struct page *page)
        int ret = 0;
 
        BUG_ON(!PageLocked(page));
-       if (PageDirty(page) || PageWriteback(page))
+       if (PageWriteback(page))
                return 0;

And I found a culprit.
e182d61263b7d5, [PATCH] buffer_head takedown for bighighmem machines

It introduced pagevec_strip wich calls try_to_release_page without
PageDirty check in refill_inactive_zone which is shrink_active_list
now.

Quote from
"
    In refill_inactive(): if the number of buffer_heads is excessive then
    strip buffers from pages as they move onto the inactive list.  This
    change is useful for all filesystems.  This approach is good because
    pages which are being repeatedly overwritten will remain on the active
    list and will retain their buffers, whereas pages which are not being
    overwritten will be stripped.
"

> 
> > At that time, even shrink_page_list works like this.
> 
> The current code in shrink_page_list still works this way - the
> PageDirty code will *jump over the PagePrivate case* if the page is
> to remain dirty or pageout() fails to make it clean.  Hence it never
> gets to try_to_release_page() on a dirty page.
> 
> Seems like this really needs a dirty check in shrink_active_list()
> and to leave the stripping of bufferheads from dirty pages in the
> ext3 corner case to shrink_inactive_list() once the dirty pages have
> been rotated off the active list...

Another topic:
I don't know file system at all so I might miss something.

IMHO, if we should prohibit dirty page to fs->releasepage, isn't it
better to move PageDirty warning check to try_to_release_page and
clean it up all FSes's releasepage.

diff --git a/mm/filemap.c b/mm/filemap.c
index 00ae878b2a38..7c8b375c3475 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2821,8 +2821,10 @@ int try_to_release_page(struct page *page, gfp_t gfp_mask)
        if (PageWriteback(page))
                return 0;
 
-       if (mapping && mapping->a_ops->releasepage)
+       if (mapping && mapping->a_ops->releasepage) {
+               WARN_ON(PageDirty(page));
                return mapping->a_ops->releasepage(page, gfp_mask);
+       }
        return try_to_free_buffers(page);
 }

diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index 9a8bbc1fb1fa..89b432a90f59 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -1795,10 +1795,6 @@ void f2fs_invalidate_page(struct page *page, unsigned int offset,
 
 int f2fs_release_page(struct page *page, gfp_t wait)
 {
-       /* If this is dirty page, keep PagePrivate */
-       if (PageDirty(page))
-               return 0;
-
        /* This is atomic written page, keep Private */
        if (IS_ATOMIC_WRITTEN_PAGE(page))
                return 0;

Otherwise, we can simply return 0 in try_to_relase_page if it finds dirty page.

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: shrink_active_list/try_to_release_page bug? (was Re: xfs trace in 4.4.2 / also in 4.3.3 WARNING fs/xfs/xfs_aops.c:1232 xfs_vm_releasepage)
  2016-05-31  3:59                         ` Minchan Kim
@ 2016-05-31  6:07                           ` Dave Chinner
  2016-05-31  6:11                             ` Stefan Priebe - Profihost AG
  0 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2016-05-31  6:07 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Stefan Priebe - Profihost AG, Brian Foster, xfs, linux-mm, linux-kernel

On Tue, May 31, 2016 at 12:59:04PM +0900, Minchan Kim wrote:
> On Tue, May 31, 2016 at 12:55:09PM +1000, Dave Chinner wrote:
> > On Tue, May 31, 2016 at 10:07:24AM +0900, Minchan Kim wrote:
> > > On Tue, May 31, 2016 at 08:36:57AM +1000, Dave Chinner wrote:
> > > > But this is a dirty page, which means it may have delalloc or
> > > > unwritten state on it's buffers, both of which indicate that there
> > > > is dirty data in teh page that hasn't been written. XFS issues a
> > > > warning on this because neither shrink_active_list nor
> > > > try_to_release_page() check for whether the page is dirty or not.
> > > > 
> > > > Hence it seems to me that shrink_active_list() is calling
> > > > try_to_release_page() inappropriately, and XFS is just the
> > > > messenger. If you turn laptop mode on, it is likely the problem will
> > > > go away as kswapd will run with .may_writepage = false, but that
> > > > will also cause other behavioural changes relating to writeback and
> > > > memory reclaim. It might be worth trying as a workaround for now.
> > > > 
> > > > MM-folk - is this analysis correct? If so, why is
> > > > shrink_active_list() calling try_to_release_page() on dirty pages?
> > > > Is this just an oversight or is there some problem that this is
> > > > trying to work around? It seems trivial to fix to me (add a
> > > > !PageDirty check), but I don't know why the check is there in the
> > > > first place...
> > > 
> > > It seems to be latter.
> > > Below commit seems to be related.
> > > [ecdfc9787fe527, Resurrect 'try_to_free_buffers()' VM hackery.]
> > 
> > Okay, that's been there a long, long time (2007), and it covers a
> > case where the filesystem cleans pages without the VM knowing about
> > it (i.e. it marks bufferheads clean without clearing the PageDirty
> > state).
> > 
> > That does not explain the code in shrink_active_list().
> 
> Yeb, My point was the patch removed the PageDirty check in
> try_to_free_buffers.

*nod*

[...]

> And I found a culprit.
> e182d61263b7d5, [PATCH] buffer_head takedown for bighighmem machines

Heh. You have the combined historic tree sitting around for code
archeology, just like I do :)

> It introduced pagevec_strip wich calls try_to_release_page without
> PageDirty check in refill_inactive_zone which is shrink_active_list
> now.

<sigh>

It was merged 2 days before XFS was merged. Merging XFS made the
code Andrew wrote incorrect:

> Quote from
> "
>     In refill_inactive(): if the number of buffer_heads is excessive then
>     strip buffers from pages as they move onto the inactive list.  This
>     change is useful for all filesystems. [....]

Except for those that carry state necessary for writeback to be done
correctly on the dirty page bufferheads.  At the time, nobody doing
work the mm/writeback code cared about delayed allocation. So we've
carried this behaviour for 14 years without realising that it's
probably the source of all the unexplainable warnings we've got from
XFS over all that time.

I'm half tempted at this point to mostly ignore this mm/ behavour
because we are moving down the path of removing buffer heads from
XFS. That will require us to do different things in ->releasepage
and so just skipping dirty pages in the XFS code is the best thing
to do....

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: shrink_active_list/try_to_release_page bug? (was Re: xfs trace in 4.4.2 / also in 4.3.3 WARNING fs/xfs/xfs_aops.c:1232 xfs_vm_releasepage)
  2016-05-31  6:07                           ` Dave Chinner
@ 2016-05-31  6:11                             ` Stefan Priebe - Profihost AG
  2016-05-31  7:31                               ` Dave Chinner
  0 siblings, 1 reply; 14+ messages in thread
From: Stefan Priebe - Profihost AG @ 2016-05-31  6:11 UTC (permalink / raw)
  To: Dave Chinner, Minchan Kim; +Cc: Brian Foster, xfs, linux-mm, linux-kernel

Hi Dave,

Am 31.05.2016 um 08:07 schrieb Dave Chinner:
> On Tue, May 31, 2016 at 12:59:04PM +0900, Minchan Kim wrote:
>> On Tue, May 31, 2016 at 12:55:09PM +1000, Dave Chinner wrote:
>>> On Tue, May 31, 2016 at 10:07:24AM +0900, Minchan Kim wrote:
>>>> On Tue, May 31, 2016 at 08:36:57AM +1000, Dave Chinner wrote:
>>>>> But this is a dirty page, which means it may have delalloc or
>>>>> unwritten state on it's buffers, both of which indicate that there
>>>>> is dirty data in teh page that hasn't been written. XFS issues a
>>>>> warning on this because neither shrink_active_list nor
>>>>> try_to_release_page() check for whether the page is dirty or not.
>>>>>
>>>>> Hence it seems to me that shrink_active_list() is calling
>>>>> try_to_release_page() inappropriately, and XFS is just the
>>>>> messenger. If you turn laptop mode on, it is likely the problem will
>>>>> go away as kswapd will run with .may_writepage = false, but that
>>>>> will also cause other behavioural changes relating to writeback and
>>>>> memory reclaim. It might be worth trying as a workaround for now.
>>>>>
>>>>> MM-folk - is this analysis correct? If so, why is
>>>>> shrink_active_list() calling try_to_release_page() on dirty pages?
>>>>> Is this just an oversight or is there some problem that this is
>>>>> trying to work around? It seems trivial to fix to me (add a
>>>>> !PageDirty check), but I don't know why the check is there in the
>>>>> first place...
>>>>
>>>> It seems to be latter.
>>>> Below commit seems to be related.
>>>> [ecdfc9787fe527, Resurrect 'try_to_free_buffers()' VM hackery.]
>>>
>>> Okay, that's been there a long, long time (2007), and it covers a
>>> case where the filesystem cleans pages without the VM knowing about
>>> it (i.e. it marks bufferheads clean without clearing the PageDirty
>>> state).
>>>
>>> That does not explain the code in shrink_active_list().
>>
>> Yeb, My point was the patch removed the PageDirty check in
>> try_to_free_buffers.
> 
> *nod*
> 
> [...]
> 
>> And I found a culprit.
>> e182d61263b7d5, [PATCH] buffer_head takedown for bighighmem machines
> 
> Heh. You have the combined historic tree sitting around for code
> archeology, just like I do :)
> 
>> It introduced pagevec_strip wich calls try_to_release_page without
>> PageDirty check in refill_inactive_zone which is shrink_active_list
>> now.
> 
> <sigh>
> 
> It was merged 2 days before XFS was merged. Merging XFS made the
> code Andrew wrote incorrect:
> 
>> Quote from
>> "
>>     In refill_inactive(): if the number of buffer_heads is excessive then
>>     strip buffers from pages as they move onto the inactive list.  This
>>     change is useful for all filesystems. [....]
> 
> Except for those that carry state necessary for writeback to be done
> correctly on the dirty page bufferheads.  At the time, nobody doing
> work the mm/writeback code cared about delayed allocation. So we've
> carried this behaviour for 14 years without realising that it's
> probably the source of all the unexplainable warnings we've got from
> XFS over all that time.
> 
> I'm half tempted at this point to mostly ignore this mm/ behavour
> because we are moving down the path of removing buffer heads from
> XFS. That will require us to do different things in ->releasepage
> and so just skipping dirty pages in the XFS code is the best thing
> to do....

does this change anything i should test? Or is 4.6 still the way to go?

Greets,
Stefan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: shrink_active_list/try_to_release_page bug? (was Re: xfs trace in 4.4.2 / also in 4.3.3 WARNING fs/xfs/xfs_aops.c:1232 xfs_vm_releasepage)
  2016-05-31  6:11                             ` Stefan Priebe - Profihost AG
@ 2016-05-31  7:31                               ` Dave Chinner
  2016-05-31  8:03                                 ` Stefan Priebe - Profihost AG
  2016-06-02 12:13                                 ` Stefan Priebe - Profihost AG
  0 siblings, 2 replies; 14+ messages in thread
From: Dave Chinner @ 2016-05-31  7:31 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG
  Cc: Minchan Kim, Brian Foster, xfs, linux-mm, linux-kernel

On Tue, May 31, 2016 at 08:11:42AM +0200, Stefan Priebe - Profihost AG wrote:
> > I'm half tempted at this point to mostly ignore this mm/ behavour
> > because we are moving down the path of removing buffer heads from
> > XFS. That will require us to do different things in ->releasepage
> > and so just skipping dirty pages in the XFS code is the best thing
> > to do....
> 
> does this change anything i should test? Or is 4.6 still the way to go?

Doesn't matter now - the warning will still be there on 4.6. I think
you can simply ignore it as the XFS code appears to be handling the
dirty page that is being passed to it correctly. We'll work out what
needs to be done to get rid of the warning for this case, wether it
be a mm/ change or an XFS change.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: shrink_active_list/try_to_release_page bug? (was Re: xfs trace in 4.4.2 / also in 4.3.3 WARNING fs/xfs/xfs_aops.c:1232 xfs_vm_releasepage)
  2016-05-31  7:31                               ` Dave Chinner
@ 2016-05-31  8:03                                 ` Stefan Priebe - Profihost AG
  2016-06-02 12:13                                 ` Stefan Priebe - Profihost AG
  1 sibling, 0 replies; 14+ messages in thread
From: Stefan Priebe - Profihost AG @ 2016-05-31  8:03 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Minchan Kim, Brian Foster, xfs, linux-mm, linux-kernel

Am 31.05.2016 um 09:31 schrieb Dave Chinner:
> On Tue, May 31, 2016 at 08:11:42AM +0200, Stefan Priebe - Profihost AG wrote:
>>> I'm half tempted at this point to mostly ignore this mm/ behavour
>>> because we are moving down the path of removing buffer heads from
>>> XFS. That will require us to do different things in ->releasepage
>>> and so just skipping dirty pages in the XFS code is the best thing
>>> to do....
>>
>> does this change anything i should test? Or is 4.6 still the way to go?
> 
> Doesn't matter now - the warning will still be there on 4.6. I think
> you can simply ignore it as the XFS code appears to be handling the
> dirty page that is being passed to it correctly. We'll work out what
> needs to be done to get rid of the warning for this case, wether it
> be a mm/ change or an XFS change.

So is it OK to remove the WARN_ONCE in kernel code? So i don't get
alarms from our monitoring systems for the trace.

Stefan

> 
> Cheers,
> 
> Dave.
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: shrink_active_list/try_to_release_page bug? (was Re: xfs trace in 4.4.2 / also in 4.3.3 WARNING fs/xfs/xfs_aops.c:1232 xfs_vm_releasepage)
  2016-05-31  1:07                     ` Minchan Kim
  2016-05-31  2:55                       ` Dave Chinner
@ 2016-05-31  9:50                       ` Jan Kara
  2016-06-01  1:38                         ` Minchan Kim
  2016-08-17 15:37                         ` Andreas Grünbacher
  1 sibling, 2 replies; 14+ messages in thread
From: Jan Kara @ 2016-05-31  9:50 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Dave Chinner, linux-mm, Brian Foster, xfs, linux-kernel,
	Stefan Priebe - Profihost AG

On Tue 31-05-16 10:07:24, Minchan Kim wrote:
> On Tue, May 31, 2016 at 08:36:57AM +1000, Dave Chinner wrote:
> > [adding lkml and linux-mm to the cc list]
> > 
> > On Mon, May 30, 2016 at 09:23:48AM +0200, Stefan Priebe - Profihost AG wrote:
> > > Hi Dave,
> > >   Hi Brian,
> > > 
> > > below are the results with a vanilla 4.4.11 kernel.
> > 
> > Thanks for persisting with the testing, Stefan.
> > 
> > ....
> > 
> > > i've now used a vanilla 4.4.11 Kernel and the issue remains. After a
> > > fresh reboot it has happened again on the root FS for a debian apt file:
> > > 
> > > XFS (md127p3): ino 0x41221d1 delalloc 1 unwritten 0 pgoff 0x0 size 0x12b990
> > > ------------[ cut here ]------------
> > > WARNING: CPU: 1 PID: 111 at fs/xfs/xfs_aops.c:1239
> > > xfs_vm_releasepage+0x10f/0x140()
> > > Modules linked in: netconsole ipt_REJECT nf_reject_ipv4 xt_multiport
> > > iptable_filter ip_tables x_tables bonding coretemp 8021q garp fuse
> > > sb_edac edac_core i2c_i801 i40e(O) xhci_pci xhci_hcd shpchp vxlan
> > > ip6_udp_tunnel udp_tunnel ipmi_si ipmi_msghandler button btrfs xor
> > > raid6_pq dm_mod raid1 md_mod usbhid usb_storage ohci_hcd sg sd_mod
> > > ehci_pci ehci_hcd usbcore usb_common igb ahci i2c_algo_bit libahci
> > > i2c_core mpt3sas ptp pps_core raid_class scsi_transport_sas
> > > CPU: 1 PID: 111 Comm: kswapd0 Tainted: G           O    4.4.11 #1
> > > Hardware name: Supermicro Super Server/X10SRH-CF, BIOS 1.0b 05/18/2015
> > >  0000000000000000 ffff880c4dacfa88 ffffffffa23c5b8f 0000000000000000
> > >  ffffffffa2a51ab4 ffff880c4dacfac8 ffffffffa20837a7 ffff880c4dacfae8
> > >  0000000000000001 ffffea00010c3640 ffff8802176b49d0 ffffea00010c3660
> > > Call Trace:
> > >  [<ffffffffa23c5b8f>] dump_stack+0x63/0x84
> > >  [<ffffffffa20837a7>] warn_slowpath_common+0x97/0xe0
> > >  [<ffffffffa208380a>] warn_slowpath_null+0x1a/0x20
> > >  [<ffffffffa2326caf>] xfs_vm_releasepage+0x10f/0x140
> > >  [<ffffffffa218c680>] ? page_mkclean_one+0xd0/0xd0
> > >  [<ffffffffa218d3a0>] ? anon_vma_prepare+0x150/0x150
> > >  [<ffffffffa21521c2>] try_to_release_page+0x32/0x50
> > >  [<ffffffffa2166b2e>] shrink_active_list+0x3ce/0x3e0
> > >  [<ffffffffa21671c7>] shrink_lruvec+0x687/0x7d0
> > >  [<ffffffffa21673ec>] shrink_zone+0xdc/0x2c0
> > >  [<ffffffffa2168539>] kswapd+0x4f9/0x970
> > >  [<ffffffffa2168040>] ? mem_cgroup_shrink_node_zone+0x1a0/0x1a0
> > >  [<ffffffffa20a0d99>] kthread+0xc9/0xe0
> > >  [<ffffffffa20a0cd0>] ? kthread_stop+0x100/0x100
> > >  [<ffffffffa26b404f>] ret_from_fork+0x3f/0x70
> > >  [<ffffffffa20a0cd0>] ? kthread_stop+0x100/0x100
> > > ---[ end trace c9d679f8ed4d7610 ]---
> > > XFS (md127p3): ino 0x41221d1 delalloc 1 unwritten 0 pgoff 0x1000 size
> > > 0x12b990
> > > XFS (md127p3): ino 0x41221d1 delalloc 1 unwritten 0 pgoff 0x2000 size
> > .....
> > 
> > Ok, I suspect this may be a VM bug. I've been looking at the 4.6
> > code (so please try to reproduce on that kernel!) but it looks to me
> > like the only way we can get from shrink_active_list() direct to
> > try_to_release_page() is if we are over the maximum bufferhead
> > threshold (i.e buffer_heads_over_limit = true) and we are trying to
> > reclaim pages direct from the active list.
> > 
> > Because we are called from kswapd()->balance_pgdat(), we have:
> > 
> >         struct scan_control sc = {
> >                 .gfp_mask = GFP_KERNEL,
> >                 .order = order,
> >                 .priority = DEF_PRIORITY,
> >                 .may_writepage = !laptop_mode,
> >                 .may_unmap = 1,
> >                 .may_swap = 1,
> >         };
> > 
> > The key point here is reclaim is being run with .may_writepage =
> > true for default configuration kernels. when we get to
> > shrink_active_list():
> > 
> > 	if (!sc->may_writepage)
> > 		isolate_mode |= ISOLATE_CLEAN;
> > 
> > But sc->may_writepage = true and this allows isolate_lru_pages() to
> > isolate dirty pages from the active list. Normally this isn't a
> > problem, because the isolated active list pages are rotated to the
> > inactive list, and nothing else happens to them. *Except when
> > buffer_heads_over_limit = true*. This special condition would
> > explain why I have never seen apt/dpkg cause this problem on any of
> > my (many) Debian systems that all use XFS....
> > 
> > In that case, shrink_active_list() runs:
> > 
> > 	if (unlikely(buffer_heads_over_limit)) {
> > 		if (page_has_private(page) && trylock_page(page)) {
> > 			if (page_has_private(page))
> > 				try_to_release_page(page, 0);
> > 			unlock_page(page);
> > 		}
> > 	}
> > 
> > i.e. it locks the page, and if it has buffer heads it trys to get
> > the bufferheads freed from the page.
> > 
> > But this is a dirty page, which means it may have delalloc or
> > unwritten state on it's buffers, both of which indicate that there
> > is dirty data in teh page that hasn't been written. XFS issues a
> > warning on this because neither shrink_active_list nor
> > try_to_release_page() check for whether the page is dirty or not.
> > 
> > Hence it seems to me that shrink_active_list() is calling
> > try_to_release_page() inappropriately, and XFS is just the
> > messenger. If you turn laptop mode on, it is likely the problem will
> > go away as kswapd will run with .may_writepage = false, but that
> > will also cause other behavioural changes relating to writeback and
> > memory reclaim. It might be worth trying as a workaround for now.
> > 
> > MM-folk - is this analysis correct? If so, why is
> > shrink_active_list() calling try_to_release_page() on dirty pages?
> > Is this just an oversight or is there some problem that this is
> > trying to work around? It seems trivial to fix to me (add a
> > !PageDirty check), but I don't know why the check is there in the
> > first place...
> 
> It seems to be latter.
> Below commit seems to be related.
> [ecdfc9787fe527, Resurrect 'try_to_free_buffers()' VM hackery.]
> 
> At that time, even shrink_page_list works like this.
> 
> shrink_page_list
>         while (!list_empty(page_list)) {
>                 ..
>                 ..
>                 if (PageDirty(page)) {
>                         ..
>                 }
> 
>                 /*
>                  * If the page has buffers, try to free the buffer mappings
>                  * associated with this page. If we succeed we try to free
>                  * the page as well.
>                  *
>                  * We do this even if the page is PageDirty().
>                  * try_to_release_page() does not perform I/O, but it is
>                  * possible for a page to have PageDirty set, but it is actually
>                  * clean (all its buffers are clean).  This happens if the
>                  * buffers were written out directly, with submit_bh(). ext3
>                  * will do this, as well as the blockdev mapping. 
>                  * try_to_release_page() will discover that cleanness and will
>                  * drop the buffers and mark the page clean - it can be freed.
>                  * ..
>                  */
>                 if (PagePrivate(page)) {
>                         if (!try_to_release_page(page, sc->gfp_mask))
>                                 goto activate_locked;
>                         if (!mapping && page_count(page) == 1)
>                                 goto free_it;
>                 }
>                 ..
>         }
> 
> I wonder whether it's valid or not with on ext4.

Actually, we've already discussed this about an year ago:
http://oss.sgi.com/archives/xfs/2015-06/msg00119.html

And it was the last drop that made me remove ext3 from the tree. ext4 can
also clean dirty buffers while keeping pages dirty but it is limited only
to metadata (and data in data=journal mode) so the scope of the problem is
much smaller. So just avoiding calling ->releasepage for dirty pages may
work fine these days.

Also it is possible to change ext4 checkpointing code to completely avoid
doing this but I never got to rewriting that code. Probably I should give
it higher priority on my todo list...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: shrink_active_list/try_to_release_page bug? (was Re: xfs trace in 4.4.2 / also in 4.3.3 WARNING fs/xfs/xfs_aops.c:1232 xfs_vm_releasepage)
  2016-05-31  9:50                       ` Jan Kara
@ 2016-06-01  1:38                         ` Minchan Kim
  2016-08-17 15:37                         ` Andreas Grünbacher
  1 sibling, 0 replies; 14+ messages in thread
From: Minchan Kim @ 2016-06-01  1:38 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Chinner, linux-mm, Brian Foster, xfs, linux-kernel,
	Stefan Priebe - Profihost AG

On Tue, May 31, 2016 at 11:50:31AM +0200, Jan Kara wrote:
> On Tue 31-05-16 10:07:24, Minchan Kim wrote:
> > On Tue, May 31, 2016 at 08:36:57AM +1000, Dave Chinner wrote:
> > > [adding lkml and linux-mm to the cc list]
> > > 
> > > On Mon, May 30, 2016 at 09:23:48AM +0200, Stefan Priebe - Profihost AG wrote:
> > > > Hi Dave,
> > > >   Hi Brian,
> > > > 
> > > > below are the results with a vanilla 4.4.11 kernel.
> > > 
> > > Thanks for persisting with the testing, Stefan.
> > > 
> > > ....
> > > 
> > > > i've now used a vanilla 4.4.11 Kernel and the issue remains. After a
> > > > fresh reboot it has happened again on the root FS for a debian apt file:
> > > > 
> > > > XFS (md127p3): ino 0x41221d1 delalloc 1 unwritten 0 pgoff 0x0 size 0x12b990
> > > > ------------[ cut here ]------------
> > > > WARNING: CPU: 1 PID: 111 at fs/xfs/xfs_aops.c:1239
> > > > xfs_vm_releasepage+0x10f/0x140()
> > > > Modules linked in: netconsole ipt_REJECT nf_reject_ipv4 xt_multiport
> > > > iptable_filter ip_tables x_tables bonding coretemp 8021q garp fuse
> > > > sb_edac edac_core i2c_i801 i40e(O) xhci_pci xhci_hcd shpchp vxlan
> > > > ip6_udp_tunnel udp_tunnel ipmi_si ipmi_msghandler button btrfs xor
> > > > raid6_pq dm_mod raid1 md_mod usbhid usb_storage ohci_hcd sg sd_mod
> > > > ehci_pci ehci_hcd usbcore usb_common igb ahci i2c_algo_bit libahci
> > > > i2c_core mpt3sas ptp pps_core raid_class scsi_transport_sas
> > > > CPU: 1 PID: 111 Comm: kswapd0 Tainted: G           O    4.4.11 #1
> > > > Hardware name: Supermicro Super Server/X10SRH-CF, BIOS 1.0b 05/18/2015
> > > >  0000000000000000 ffff880c4dacfa88 ffffffffa23c5b8f 0000000000000000
> > > >  ffffffffa2a51ab4 ffff880c4dacfac8 ffffffffa20837a7 ffff880c4dacfae8
> > > >  0000000000000001 ffffea00010c3640 ffff8802176b49d0 ffffea00010c3660
> > > > Call Trace:
> > > >  [<ffffffffa23c5b8f>] dump_stack+0x63/0x84
> > > >  [<ffffffffa20837a7>] warn_slowpath_common+0x97/0xe0
> > > >  [<ffffffffa208380a>] warn_slowpath_null+0x1a/0x20
> > > >  [<ffffffffa2326caf>] xfs_vm_releasepage+0x10f/0x140
> > > >  [<ffffffffa218c680>] ? page_mkclean_one+0xd0/0xd0
> > > >  [<ffffffffa218d3a0>] ? anon_vma_prepare+0x150/0x150
> > > >  [<ffffffffa21521c2>] try_to_release_page+0x32/0x50
> > > >  [<ffffffffa2166b2e>] shrink_active_list+0x3ce/0x3e0
> > > >  [<ffffffffa21671c7>] shrink_lruvec+0x687/0x7d0
> > > >  [<ffffffffa21673ec>] shrink_zone+0xdc/0x2c0
> > > >  [<ffffffffa2168539>] kswapd+0x4f9/0x970
> > > >  [<ffffffffa2168040>] ? mem_cgroup_shrink_node_zone+0x1a0/0x1a0
> > > >  [<ffffffffa20a0d99>] kthread+0xc9/0xe0
> > > >  [<ffffffffa20a0cd0>] ? kthread_stop+0x100/0x100
> > > >  [<ffffffffa26b404f>] ret_from_fork+0x3f/0x70
> > > >  [<ffffffffa20a0cd0>] ? kthread_stop+0x100/0x100
> > > > ---[ end trace c9d679f8ed4d7610 ]---
> > > > XFS (md127p3): ino 0x41221d1 delalloc 1 unwritten 0 pgoff 0x1000 size
> > > > 0x12b990
> > > > XFS (md127p3): ino 0x41221d1 delalloc 1 unwritten 0 pgoff 0x2000 size
> > > .....
> > > 
> > > Ok, I suspect this may be a VM bug. I've been looking at the 4.6
> > > code (so please try to reproduce on that kernel!) but it looks to me
> > > like the only way we can get from shrink_active_list() direct to
> > > try_to_release_page() is if we are over the maximum bufferhead
> > > threshold (i.e buffer_heads_over_limit = true) and we are trying to
> > > reclaim pages direct from the active list.
> > > 
> > > Because we are called from kswapd()->balance_pgdat(), we have:
> > > 
> > >         struct scan_control sc = {
> > >                 .gfp_mask = GFP_KERNEL,
> > >                 .order = order,
> > >                 .priority = DEF_PRIORITY,
> > >                 .may_writepage = !laptop_mode,
> > >                 .may_unmap = 1,
> > >                 .may_swap = 1,
> > >         };
> > > 
> > > The key point here is reclaim is being run with .may_writepage =
> > > true for default configuration kernels. when we get to
> > > shrink_active_list():
> > > 
> > > 	if (!sc->may_writepage)
> > > 		isolate_mode |= ISOLATE_CLEAN;
> > > 
> > > But sc->may_writepage = true and this allows isolate_lru_pages() to
> > > isolate dirty pages from the active list. Normally this isn't a
> > > problem, because the isolated active list pages are rotated to the
> > > inactive list, and nothing else happens to them. *Except when
> > > buffer_heads_over_limit = true*. This special condition would
> > > explain why I have never seen apt/dpkg cause this problem on any of
> > > my (many) Debian systems that all use XFS....
> > > 
> > > In that case, shrink_active_list() runs:
> > > 
> > > 	if (unlikely(buffer_heads_over_limit)) {
> > > 		if (page_has_private(page) && trylock_page(page)) {
> > > 			if (page_has_private(page))
> > > 				try_to_release_page(page, 0);
> > > 			unlock_page(page);
> > > 		}
> > > 	}
> > > 
> > > i.e. it locks the page, and if it has buffer heads it trys to get
> > > the bufferheads freed from the page.
> > > 
> > > But this is a dirty page, which means it may have delalloc or
> > > unwritten state on it's buffers, both of which indicate that there
> > > is dirty data in teh page that hasn't been written. XFS issues a
> > > warning on this because neither shrink_active_list nor
> > > try_to_release_page() check for whether the page is dirty or not.
> > > 
> > > Hence it seems to me that shrink_active_list() is calling
> > > try_to_release_page() inappropriately, and XFS is just the
> > > messenger. If you turn laptop mode on, it is likely the problem will
> > > go away as kswapd will run with .may_writepage = false, but that
> > > will also cause other behavioural changes relating to writeback and
> > > memory reclaim. It might be worth trying as a workaround for now.
> > > 
> > > MM-folk - is this analysis correct? If so, why is
> > > shrink_active_list() calling try_to_release_page() on dirty pages?
> > > Is this just an oversight or is there some problem that this is
> > > trying to work around? It seems trivial to fix to me (add a
> > > !PageDirty check), but I don't know why the check is there in the
> > > first place...
> > 
> > It seems to be latter.
> > Below commit seems to be related.
> > [ecdfc9787fe527, Resurrect 'try_to_free_buffers()' VM hackery.]
> > 
> > At that time, even shrink_page_list works like this.
> > 
> > shrink_page_list
> >         while (!list_empty(page_list)) {
> >                 ..
> >                 ..
> >                 if (PageDirty(page)) {
> >                         ..
> >                 }
> > 
> >                 /*
> >                  * If the page has buffers, try to free the buffer mappings
> >                  * associated with this page. If we succeed we try to free
> >                  * the page as well.
> >                  *
> >                  * We do this even if the page is PageDirty().
> >                  * try_to_release_page() does not perform I/O, but it is
> >                  * possible for a page to have PageDirty set, but it is actually
> >                  * clean (all its buffers are clean).  This happens if the
> >                  * buffers were written out directly, with submit_bh(). ext3
> >                  * will do this, as well as the blockdev mapping. 
> >                  * try_to_release_page() will discover that cleanness and will
> >                  * drop the buffers and mark the page clean - it can be freed.
> >                  * ..
> >                  */
> >                 if (PagePrivate(page)) {
> >                         if (!try_to_release_page(page, sc->gfp_mask))
> >                                 goto activate_locked;
> >                         if (!mapping && page_count(page) == 1)
> >                                 goto free_it;
> >                 }
> >                 ..
> >         }
> > 
> > I wonder whether it's valid or not with on ext4.
> 
> Actually, we've already discussed this about an year ago:
> http://oss.sgi.com/archives/xfs/2015-06/msg00119.html
> 
> And it was the last drop that made me remove ext3 from the tree. ext4 can
> also clean dirty buffers while keeping pages dirty but it is limited only
> to metadata (and data in data=journal mode) so the scope of the problem is
> much smaller. So just avoiding calling ->releasepage for dirty pages may
> work fine these days.
> 
> Also it is possible to change ext4 checkpointing code to completely avoid
> doing this but I never got to rewriting that code. Probably I should give
> it higher priority on my todo list...

Hah, you already noticed. Thanks for the information.

At a first glance, it seems to fix it in /mm with checking PageDirty but
it might be risky for other out-of-tree FSes without full understanding
of internal and block_invalidatepage users can make such clean buffers
but dirty page although there is no one in mainline now so I will leave
the fix to FS guys.

Thanks.

> 
> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: shrink_active_list/try_to_release_page bug? (was Re: xfs trace in 4.4.2 / also in 4.3.3 WARNING fs/xfs/xfs_aops.c:1232 xfs_vm_releasepage)
  2016-05-31  7:31                               ` Dave Chinner
  2016-05-31  8:03                                 ` Stefan Priebe - Profihost AG
@ 2016-06-02 12:13                                 ` Stefan Priebe - Profihost AG
  2016-06-02 12:44                                   ` Holger Hoffstätte
  1 sibling, 1 reply; 14+ messages in thread
From: Stefan Priebe - Profihost AG @ 2016-06-02 12:13 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Minchan Kim, Brian Foster, xfs, linux-mm, linux-kernel


Am 31.05.2016 um 09:31 schrieb Dave Chinner:
> On Tue, May 31, 2016 at 08:11:42AM +0200, Stefan Priebe - Profihost AG wrote:
>>> I'm half tempted at this point to mostly ignore this mm/ behavour
>>> because we are moving down the path of removing buffer heads from
>>> XFS. That will require us to do different things in ->releasepage
>>> and so just skipping dirty pages in the XFS code is the best thing
>>> to do....
>>
>> does this change anything i should test? Or is 4.6 still the way to go?
> 
> Doesn't matter now - the warning will still be there on 4.6. I think
> you can simply ignore it as the XFS code appears to be handling the
> dirty page that is being passed to it correctly. We'll work out what
> needs to be done to get rid of the warning for this case, wether it
> be a mm/ change or an XFS change.

Any idea what i could do with 4.4.X? Can i safely remove the WARN_ONCE
statement?

Stefan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: shrink_active_list/try_to_release_page bug? (was Re: xfs trace in 4.4.2 / also in 4.3.3 WARNING fs/xfs/xfs_aops.c:1232 xfs_vm_releasepage)
  2016-06-02 12:13                                 ` Stefan Priebe - Profihost AG
@ 2016-06-02 12:44                                   ` Holger Hoffstätte
  2016-06-02 23:08                                     ` Dave Chinner
  0 siblings, 1 reply; 14+ messages in thread
From: Holger Hoffstätte @ 2016-06-02 12:44 UTC (permalink / raw)
  To: Stefan Priebe - Profihost AG, Dave Chinner
  Cc: linux-mm, Minchan Kim, Brian Foster, linux-kernel, xfs

On 06/02/16 14:13, Stefan Priebe - Profihost AG wrote:
> 
> Am 31.05.2016 um 09:31 schrieb Dave Chinner:
>> On Tue, May 31, 2016 at 08:11:42AM +0200, Stefan Priebe - Profihost AG wrote:
>>>> I'm half tempted at this point to mostly ignore this mm/ behavour
>>>> because we are moving down the path of removing buffer heads from
>>>> XFS. That will require us to do different things in ->releasepage
>>>> and so just skipping dirty pages in the XFS code is the best thing
>>>> to do....
>>>
>>> does this change anything i should test? Or is 4.6 still the way to go?
>>
>> Doesn't matter now - the warning will still be there on 4.6. I think
>> you can simply ignore it as the XFS code appears to be handling the
>> dirty page that is being passed to it correctly. We'll work out what
>> needs to be done to get rid of the warning for this case, wether it
>> be a mm/ change or an XFS change.
> 
> Any idea what i could do with 4.4.X? Can i safely remove the WARN_ONCE
> statement?

By definition it won't break anything since it's just a heads-up message,
so yes, it should be "safe". However if my understanding of the situation
is correct, mainline commit f0281a00fe "mm: workingset: only do workingset
activations on reads" (+ friends) in 4.7 should effectively prevent this
from happenning. Can someone confirm or deny this?

-h

PS: Stefan: I backported that commit (and friends) to my 4.4.x patch queue,
so if you want to try that for today's 4.4.12 the warning should be gone.
No guarantees though :)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: shrink_active_list/try_to_release_page bug? (was Re: xfs trace in 4.4.2 / also in 4.3.3 WARNING fs/xfs/xfs_aops.c:1232 xfs_vm_releasepage)
  2016-06-02 12:44                                   ` Holger Hoffstätte
@ 2016-06-02 23:08                                     ` Dave Chinner
  0 siblings, 0 replies; 14+ messages in thread
From: Dave Chinner @ 2016-06-02 23:08 UTC (permalink / raw)
  To: Holger Hoffstätte
  Cc: Stefan Priebe - Profihost AG, linux-mm, Minchan Kim,
	Brian Foster, linux-kernel, xfs

On Thu, Jun 02, 2016 at 02:44:30PM +0200, Holger Hoffstätte wrote:
> On 06/02/16 14:13, Stefan Priebe - Profihost AG wrote:
> > 
> > Am 31.05.2016 um 09:31 schrieb Dave Chinner:
> >> On Tue, May 31, 2016 at 08:11:42AM +0200, Stefan Priebe - Profihost AG wrote:
> >>>> I'm half tempted at this point to mostly ignore this mm/ behavour
> >>>> because we are moving down the path of removing buffer heads from
> >>>> XFS. That will require us to do different things in ->releasepage
> >>>> and so just skipping dirty pages in the XFS code is the best thing
> >>>> to do....
> >>>
> >>> does this change anything i should test? Or is 4.6 still the way to go?
> >>
> >> Doesn't matter now - the warning will still be there on 4.6. I think
> >> you can simply ignore it as the XFS code appears to be handling the
> >> dirty page that is being passed to it correctly. We'll work out what
> >> needs to be done to get rid of the warning for this case, wether it
> >> be a mm/ change or an XFS change.
> > 
> > Any idea what i could do with 4.4.X? Can i safely remove the WARN_ONCE
> > statement?
> 
> By definition it won't break anything since it's just a heads-up message,
> so yes, it should be "safe". However if my understanding of the situation
> is correct, mainline commit f0281a00fe "mm: workingset: only do workingset
> activations on reads" (+ friends) in 4.7 should effectively prevent this
> from happenning. Can someone confirm or deny this?

I don't think it will.  The above commits will avoid putting
/write-only/ dirty pages on the active list from the write() syscall
vector, but it won't prevent pages that are read first then dirtied
from ending up on the active list. e.g. a mmap write will first read
the page from disk to populate the page (hence it ends up on the
active list), then the page gets dirtied and ->page_mkwrite is
called to tell the filesystem....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: shrink_active_list/try_to_release_page bug? (was Re: xfs trace in 4.4.2 / also in 4.3.3 WARNING fs/xfs/xfs_aops.c:1232 xfs_vm_releasepage)
  2016-05-31  9:50                       ` Jan Kara
  2016-06-01  1:38                         ` Minchan Kim
@ 2016-08-17 15:37                         ` Andreas Grünbacher
  1 sibling, 0 replies; 14+ messages in thread
From: Andreas Grünbacher @ 2016-08-17 15:37 UTC (permalink / raw)
  To: Jan Kara
  Cc: Minchan Kim, Stefan Priebe - Profihost AG, Brian Foster,
	Linux Kernel Mailing List, xfs, linux-mm, Lukas Czerner,
	Steven Whitehouse

Hi Jan,

2016-05-31 11:50 GMT+02:00 Jan Kara <jack@suse.cz>:
> On Tue 31-05-16 10:07:24, Minchan Kim wrote:
>> On Tue, May 31, 2016 at 08:36:57AM +1000, Dave Chinner wrote:
>> > [adding lkml and linux-mm to the cc list]
>> >
>> > On Mon, May 30, 2016 at 09:23:48AM +0200, Stefan Priebe - Profihost AG wrote:
>> > > Hi Dave,
>> > >   Hi Brian,
>> > >
>> > > below are the results with a vanilla 4.4.11 kernel.
>> >
>> > Thanks for persisting with the testing, Stefan.
>> >
>> > ....
>> >
>> > > i've now used a vanilla 4.4.11 Kernel and the issue remains. After a
>> > > fresh reboot it has happened again on the root FS for a debian apt file:
>> > >
>> > > XFS (md127p3): ino 0x41221d1 delalloc 1 unwritten 0 pgoff 0x0 size 0x12b990
>> > > ------------[ cut here ]------------
>> > > WARNING: CPU: 1 PID: 111 at fs/xfs/xfs_aops.c:1239
>> > > xfs_vm_releasepage+0x10f/0x140()
>> > > Modules linked in: netconsole ipt_REJECT nf_reject_ipv4 xt_multiport
>> > > iptable_filter ip_tables x_tables bonding coretemp 8021q garp fuse
>> > > sb_edac edac_core i2c_i801 i40e(O) xhci_pci xhci_hcd shpchp vxlan
>> > > ip6_udp_tunnel udp_tunnel ipmi_si ipmi_msghandler button btrfs xor
>> > > raid6_pq dm_mod raid1 md_mod usbhid usb_storage ohci_hcd sg sd_mod
>> > > ehci_pci ehci_hcd usbcore usb_common igb ahci i2c_algo_bit libahci
>> > > i2c_core mpt3sas ptp pps_core raid_class scsi_transport_sas
>> > > CPU: 1 PID: 111 Comm: kswapd0 Tainted: G           O    4.4.11 #1
>> > > Hardware name: Supermicro Super Server/X10SRH-CF, BIOS 1.0b 05/18/2015
>> > >  0000000000000000 ffff880c4dacfa88 ffffffffa23c5b8f 0000000000000000
>> > >  ffffffffa2a51ab4 ffff880c4dacfac8 ffffffffa20837a7 ffff880c4dacfae8
>> > >  0000000000000001 ffffea00010c3640 ffff8802176b49d0 ffffea00010c3660
>> > > Call Trace:
>> > >  [<ffffffffa23c5b8f>] dump_stack+0x63/0x84
>> > >  [<ffffffffa20837a7>] warn_slowpath_common+0x97/0xe0
>> > >  [<ffffffffa208380a>] warn_slowpath_null+0x1a/0x20
>> > >  [<ffffffffa2326caf>] xfs_vm_releasepage+0x10f/0x140
>> > >  [<ffffffffa218c680>] ? page_mkclean_one+0xd0/0xd0
>> > >  [<ffffffffa218d3a0>] ? anon_vma_prepare+0x150/0x150
>> > >  [<ffffffffa21521c2>] try_to_release_page+0x32/0x50
>> > >  [<ffffffffa2166b2e>] shrink_active_list+0x3ce/0x3e0
>> > >  [<ffffffffa21671c7>] shrink_lruvec+0x687/0x7d0
>> > >  [<ffffffffa21673ec>] shrink_zone+0xdc/0x2c0
>> > >  [<ffffffffa2168539>] kswapd+0x4f9/0x970
>> > >  [<ffffffffa2168040>] ? mem_cgroup_shrink_node_zone+0x1a0/0x1a0
>> > >  [<ffffffffa20a0d99>] kthread+0xc9/0xe0
>> > >  [<ffffffffa20a0cd0>] ? kthread_stop+0x100/0x100
>> > >  [<ffffffffa26b404f>] ret_from_fork+0x3f/0x70
>> > >  [<ffffffffa20a0cd0>] ? kthread_stop+0x100/0x100
>> > > ---[ end trace c9d679f8ed4d7610 ]---
>> > > XFS (md127p3): ino 0x41221d1 delalloc 1 unwritten 0 pgoff 0x1000 size
>> > > 0x12b990
>> > > XFS (md127p3): ino 0x41221d1 delalloc 1 unwritten 0 pgoff 0x2000 size
>> > .....
>> >
>> > Ok, I suspect this may be a VM bug. I've been looking at the 4.6
>> > code (so please try to reproduce on that kernel!) but it looks to me
>> > like the only way we can get from shrink_active_list() direct to
>> > try_to_release_page() is if we are over the maximum bufferhead
>> > threshold (i.e buffer_heads_over_limit = true) and we are trying to
>> > reclaim pages direct from the active list.
>> >
>> > Because we are called from kswapd()->balance_pgdat(), we have:
>> >
>> >         struct scan_control sc = {
>> >                 .gfp_mask = GFP_KERNEL,
>> >                 .order = order,
>> >                 .priority = DEF_PRIORITY,
>> >                 .may_writepage = !laptop_mode,
>> >                 .may_unmap = 1,
>> >                 .may_swap = 1,
>> >         };
>> >
>> > The key point here is reclaim is being run with .may_writepage =
>> > true for default configuration kernels. when we get to
>> > shrink_active_list():
>> >
>> >     if (!sc->may_writepage)
>> >             isolate_mode |= ISOLATE_CLEAN;
>> >
>> > But sc->may_writepage = true and this allows isolate_lru_pages() to
>> > isolate dirty pages from the active list. Normally this isn't a
>> > problem, because the isolated active list pages are rotated to the
>> > inactive list, and nothing else happens to them. *Except when
>> > buffer_heads_over_limit = true*. This special condition would
>> > explain why I have never seen apt/dpkg cause this problem on any of
>> > my (many) Debian systems that all use XFS....
>> >
>> > In that case, shrink_active_list() runs:
>> >
>> >     if (unlikely(buffer_heads_over_limit)) {
>> >             if (page_has_private(page) && trylock_page(page)) {
>> >                     if (page_has_private(page))
>> >                             try_to_release_page(page, 0);
>> >                     unlock_page(page);
>> >             }
>> >     }
>> >
>> > i.e. it locks the page, and if it has buffer heads it trys to get
>> > the bufferheads freed from the page.
>> >
>> > But this is a dirty page, which means it may have delalloc or
>> > unwritten state on it's buffers, both of which indicate that there
>> > is dirty data in teh page that hasn't been written. XFS issues a
>> > warning on this because neither shrink_active_list nor
>> > try_to_release_page() check for whether the page is dirty or not.
>> >
>> > Hence it seems to me that shrink_active_list() is calling
>> > try_to_release_page() inappropriately, and XFS is just the
>> > messenger. If you turn laptop mode on, it is likely the problem will
>> > go away as kswapd will run with .may_writepage = false, but that
>> > will also cause other behavioural changes relating to writeback and
>> > memory reclaim. It might be worth trying as a workaround for now.
>> >
>> > MM-folk - is this analysis correct? If so, why is
>> > shrink_active_list() calling try_to_release_page() on dirty pages?
>> > Is this just an oversight or is there some problem that this is
>> > trying to work around? It seems trivial to fix to me (add a
>> > !PageDirty check), but I don't know why the check is there in the
>> > first place...
>>
>> It seems to be latter.
>> Below commit seems to be related.
>> [ecdfc9787fe527, Resurrect 'try_to_free_buffers()' VM hackery.]
>>
>> At that time, even shrink_page_list works like this.
>>
>> shrink_page_list
>>         while (!list_empty(page_list)) {
>>                 ..
>>                 ..
>>                 if (PageDirty(page)) {
>>                         ..
>>                 }
>>
>>                 /*
>>                  * If the page has buffers, try to free the buffer mappings
>>                  * associated with this page. If we succeed we try to free
>>                  * the page as well.
>>                  *
>>                  * We do this even if the page is PageDirty().
>>                  * try_to_release_page() does not perform I/O, but it is
>>                  * possible for a page to have PageDirty set, but it is actually
>>                  * clean (all its buffers are clean).  This happens if the
>>                  * buffers were written out directly, with submit_bh(). ext3
>>                  * will do this, as well as the blockdev mapping.
>>                  * try_to_release_page() will discover that cleanness and will
>>                  * drop the buffers and mark the page clean - it can be freed.
>>                  * ..
>>                  */
>>                 if (PagePrivate(page)) {
>>                         if (!try_to_release_page(page, sc->gfp_mask))
>>                                 goto activate_locked;
>>                         if (!mapping && page_count(page) == 1)
>>                                 goto free_it;
>>                 }
>>                 ..
>>         }
>>
>> I wonder whether it's valid or not with on ext4.
>
> Actually, we've already discussed this about an year ago:
> http://oss.sgi.com/archives/xfs/2015-06/msg00119.html
>
> And it was the last drop that made me remove ext3 from the tree. ext4 can
> also clean dirty buffers while keeping pages dirty but it is limited only
> to metadata (and data in data=journal mode) so the scope of the problem is
> much smaller. So just avoiding calling ->releasepage for dirty pages may
> work fine these days.
>
> Also it is possible to change ext4 checkpointing code to completely avoid
> doing this but I never got to rewriting that code. Probably I should give
> it higher priority on my todo list...

we're seeing the same (releasepage being called for dirty pages) on
GFS2 as well. Right now, GFS2 warns about this case, but we'll remove
that warning and wait for ext4 and releasepage to be fixed so that we
can re-add the warning. Maybe this will help as an argument for fixing
ext4 soon :)

Thanks,
Andreas

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2016-08-17 15:55 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20160511133417.GA42410@bfoster.bfoster>
     [not found] ` <57333BA4.4040402@profihost.ag>
     [not found]   ` <20160511155951.GF42410@bfoster.bfoster>
     [not found]     ` <5738576B.4010208@profihost.ag>
     [not found]       ` <20160515115017.GA6433@laptop.bfoster>
     [not found]         ` <57386E84.3090606@profihost.ag>
     [not found]           ` <20160516010602.GA24980@bfoster.bfoster>
     [not found]             ` <57420A47.2000700@profihost.ag>
     [not found]               ` <20160522213850.GE26977@dastard>
     [not found]                 ` <574BEA84.3010206@profihost.ag>
2016-05-30 22:36                   ` shrink_active_list/try_to_release_page bug? (was Re: xfs trace in 4.4.2 / also in 4.3.3 WARNING fs/xfs/xfs_aops.c:1232 xfs_vm_releasepage) Dave Chinner
2016-05-31  1:07                     ` Minchan Kim
2016-05-31  2:55                       ` Dave Chinner
2016-05-31  3:59                         ` Minchan Kim
2016-05-31  6:07                           ` Dave Chinner
2016-05-31  6:11                             ` Stefan Priebe - Profihost AG
2016-05-31  7:31                               ` Dave Chinner
2016-05-31  8:03                                 ` Stefan Priebe - Profihost AG
2016-06-02 12:13                                 ` Stefan Priebe - Profihost AG
2016-06-02 12:44                                   ` Holger Hoffstätte
2016-06-02 23:08                                     ` Dave Chinner
2016-05-31  9:50                       ` Jan Kara
2016-06-01  1:38                         ` Minchan Kim
2016-08-17 15:37                         ` Andreas Grünbacher

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).