All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/8 v3] scope GFP_NOFS api
@ 2017-01-06 14:10 ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:10 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Brian Foster, Michal Hocko, Peter Zijlstra (Intel)

Hi,
I have posted the previous version here [1]. Since then I've added some
reviewed bys and fixed some minor issues. I've dropped patch 2 [2] based
on Dave's request [3]. I agree that this can be done later and doing
all at once. I still think that __GFP_NOLOCKDEP should be added by this
series to make the further development easier.

There didn't seem to be any real objections and so I think we should go
and merge this and build further on top. I would like to get rid of all
explicit GFP_NOFS usage in ext4 code. I have something half baked already
and will send it later on. I also hope we can get further with the xfs
as well.

I haven't heard anything from btrfs or other filesystems guys which is a
bit unfortunate but I do not want to wait for them to much longer, they
can join the effort later on.

The patchset is based on next-20170106

Diffstat says
 fs/ext4/acl.c             |  6 +++---
 fs/ext4/extents.c         |  8 ++++----
 fs/ext4/resize.c          |  4 ++--
 fs/ext4/xattr.c           |  4 ++--
 fs/jbd2/journal.c         |  7 +++++++
 fs/jbd2/transaction.c     | 11 +++++++++++
 fs/xfs/kmem.c             | 10 +++++-----
 fs/xfs/kmem.h             |  2 +-
 fs/xfs/libxfs/xfs_btree.c |  2 +-
 fs/xfs/xfs_aops.c         |  6 +++---
 fs/xfs/xfs_buf.c          |  8 ++++----
 fs/xfs/xfs_trans.c        | 12 ++++++------
 include/linux/gfp.h       | 18 +++++++++++++++++-
 include/linux/jbd2.h      |  2 ++
 include/linux/sched.h     | 32 ++++++++++++++++++++++++++------
 kernel/locking/lockdep.c  |  6 +++++-
 lib/radix-tree.c          |  2 ++
 mm/page_alloc.c           |  8 +++++---
 mm/vmscan.c               |  6 +++---
 19 files changed, 109 insertions(+), 45 deletions(-)

Shortlog:
Michal Hocko (8):
      lockdep: allow to disable reclaim lockup detection
      xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS
      mm: introduce memalloc_nofs_{save,restore} API
      xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*
      jbd2: mark the transaction context with the scope GFP_NOFS context
      jbd2: make the whole kjournald2 kthread NOFS safe
      Revert "ext4: avoid deadlocks in the writeback path by using sb_getblk_gfp"
      Revert "ext4: fix wrong gfp type under transaction"

[1] http://lkml.kernel.org/r/20161215140715.12732-1-mhocko@kernel.org
[2] http://lkml.kernel.org/r/20161215140715.12732-3-mhocko@kernel.org
[3] http://lkml.kernel.org/r/20161219212413.GN4326@dastard



^ permalink raw reply	[flat|nested] 167+ messages in thread

* [PATCH 0/8 v3] scope GFP_NOFS api
@ 2017-01-06 14:10 ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:10 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Brian Foster, Michal Hocko, Peter Zijlstra (Intel)

Hi,
I have posted the previous version here [1]. Since then I've added some
reviewed bys and fixed some minor issues. I've dropped patch 2 [2] based
on Dave's request [3]. I agree that this can be done later and doing
all at once. I still think that __GFP_NOLOCKDEP should be added by this
series to make the further development easier.

There didn't seem to be any real objections and so I think we should go
and merge this and build further on top. I would like to get rid of all
explicit GFP_NOFS usage in ext4 code. I have something half baked already
and will send it later on. I also hope we can get further with the xfs
as well.

I haven't heard anything from btrfs or other filesystems guys which is a
bit unfortunate but I do not want to wait for them to much longer, they
can join the effort later on.

The patchset is based on next-20170106

Diffstat says
 fs/ext4/acl.c             |  6 +++---
 fs/ext4/extents.c         |  8 ++++----
 fs/ext4/resize.c          |  4 ++--
 fs/ext4/xattr.c           |  4 ++--
 fs/jbd2/journal.c         |  7 +++++++
 fs/jbd2/transaction.c     | 11 +++++++++++
 fs/xfs/kmem.c             | 10 +++++-----
 fs/xfs/kmem.h             |  2 +-
 fs/xfs/libxfs/xfs_btree.c |  2 +-
 fs/xfs/xfs_aops.c         |  6 +++---
 fs/xfs/xfs_buf.c          |  8 ++++----
 fs/xfs/xfs_trans.c        | 12 ++++++------
 include/linux/gfp.h       | 18 +++++++++++++++++-
 include/linux/jbd2.h      |  2 ++
 include/linux/sched.h     | 32 ++++++++++++++++++++++++++------
 kernel/locking/lockdep.c  |  6 +++++-
 lib/radix-tree.c          |  2 ++
 mm/page_alloc.c           |  8 +++++---
 mm/vmscan.c               |  6 +++---
 19 files changed, 109 insertions(+), 45 deletions(-)

Shortlog:
Michal Hocko (8):
      lockdep: allow to disable reclaim lockup detection
      xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS
      mm: introduce memalloc_nofs_{save,restore} API
      xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*
      jbd2: mark the transaction context with the scope GFP_NOFS context
      jbd2: make the whole kjournald2 kthread NOFS safe
      Revert "ext4: avoid deadlocks in the writeback path by using sb_getblk_gfp"
      Revert "ext4: fix wrong gfp type under transaction"

[1] http://lkml.kernel.org/r/20161215140715.12732-1-mhocko@kernel.org
[2] http://lkml.kernel.org/r/20161215140715.12732-3-mhocko@kernel.org
[3] http://lkml.kernel.org/r/20161219212413.GN4326@dastard


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* [PATCH 0/8 v3] scope GFP_NOFS api
@ 2017-01-06 14:10 ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:10 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Brian Foster, Michal Hocko, Peter Zijlstra (Intel)

Hi,
I have posted the previous version here [1]. Since then I've added some
reviewed bys and fixed some minor issues. I've dropped patch 2 [2] based
on Dave's request [3]. I agree that this can be done later and doing
all at once. I still think that __GFP_NOLOCKDEP should be added by this
series to make the further development easier.

There didn't seem to be any real objections and so I think we should go
and merge this and build further on top. I would like to get rid of all
explicit GFP_NOFS usage in ext4 code. I have something half baked already
and will send it later on. I also hope we can get further with the xfs
as well.

I haven't heard anything from btrfs or other filesystems guys which is a
bit unfortunate but I do not want to wait for them to much longer, they
can join the effort later on.

The patchset is based on next-20170106

Diffstat says
 fs/ext4/acl.c             |  6 +++---
 fs/ext4/extents.c         |  8 ++++----
 fs/ext4/resize.c          |  4 ++--
 fs/ext4/xattr.c           |  4 ++--
 fs/jbd2/journal.c         |  7 +++++++
 fs/jbd2/transaction.c     | 11 +++++++++++
 fs/xfs/kmem.c             | 10 +++++-----
 fs/xfs/kmem.h             |  2 +-
 fs/xfs/libxfs/xfs_btree.c |  2 +-
 fs/xfs/xfs_aops.c         |  6 +++---
 fs/xfs/xfs_buf.c          |  8 ++++----
 fs/xfs/xfs_trans.c        | 12 ++++++------
 include/linux/gfp.h       | 18 +++++++++++++++++-
 include/linux/jbd2.h      |  2 ++
 include/linux/sched.h     | 32 ++++++++++++++++++++++++++------
 kernel/locking/lockdep.c  |  6 +++++-
 lib/radix-tree.c          |  2 ++
 mm/page_alloc.c           |  8 +++++---
 mm/vmscan.c               |  6 +++---
 19 files changed, 109 insertions(+), 45 deletions(-)

Shortlog:
Michal Hocko (8):
      lockdep: allow to disable reclaim lockup detection
      xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS
      mm: introduce memalloc_nofs_{save,restore} API
      xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*
      jbd2: mark the transaction context with the scope GFP_NOFS context
      jbd2: make the whole kjournald2 kthread NOFS safe
      Revert "ext4: avoid deadlocks in the writeback path by using sb_getblk_gfp"
      Revert "ext4: fix wrong gfp type under transaction"

[1] http://lkml.kernel.org/r/20161215140715.12732-1-mhocko@kernel.org
[2] http://lkml.kernel.org/r/20161215140715.12732-3-mhocko@kernel.org
[3] http://lkml.kernel.org/r/20161219212413.GN4326@dastard



^ permalink raw reply	[flat|nested] 167+ messages in thread

* [PATCH 0/8 v3] scope GFP_NOFS api
@ 2017-01-06 14:10 ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:10 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Brian Foster, Michal Hocko, Peter Zijlstra (Intel)

Hi,
I have posted the previous version here [1]. Since then I've added some
reviewed bys and fixed some minor issues. I've dropped patch 2 [2] based
on Dave's request [3]. I agree that this can be done later and doing
all at once. I still think that __GFP_NOLOCKDEP should be added by this
series to make the further development easier.

There didn't seem to be any real objections and so I think we should go
and merge this and build further on top. I would like to get rid of all
explicit GFP_NOFS usage in ext4 code. I have something half baked already
and will send it later on. I also hope we can get further with the xfs
as well.

I haven't heard anything from btrfs or other filesystems guys which is a
bit unfortunate but I do not want to wait for them to much longer, they
can join the effort later on.

The patchset is based on next-20170106

Diffstat says
 fs/ext4/acl.c             |  6 +++---
 fs/ext4/extents.c         |  8 ++++----
 fs/ext4/resize.c          |  4 ++--
 fs/ext4/xattr.c           |  4 ++--
 fs/jbd2/journal.c         |  7 +++++++
 fs/jbd2/transaction.c     | 11 +++++++++++
 fs/xfs/kmem.c             | 10 +++++-----
 fs/xfs/kmem.h             |  2 +-
 fs/xfs/libxfs/xfs_btree.c |  2 +-
 fs/xfs/xfs_aops.c         |  6 +++---
 fs/xfs/xfs_buf.c          |  8 ++++----
 fs/xfs/xfs_trans.c        | 12 ++++++------
 include/linux/gfp.h       | 18 +++++++++++++++++-
 include/linux/jbd2.h      |  2 ++
 include/linux/sched.h     | 32 ++++++++++++++++++++++++++------
 kernel/locking/lockdep.c  |  6 +++++-
 lib/radix-tree.c          |  2 ++
 mm/page_alloc.c           |  8 +++++---
 mm/vmscan.c               |  6 +++---
 19 files changed, 109 insertions(+), 45 deletions(-)

Shortlog:
Michal Hocko (8):
      lockdep: allow to disable reclaim lockup detection
      xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS
      mm: introduce memalloc_nofs_{save,restore} API
      xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*
      jbd2: mark the transaction context with the scope GFP_NOFS context
      jbd2: make the whole kjournald2 kthread NOFS safe
      Revert "ext4: avoid deadlocks in the writeback path by using sb_getblk_gfp"
      Revert "ext4: fix wrong gfp type under transaction"

[1] http://lkml.kernel.org/r/20161215140715.12732-1-mhocko@kernel.org
[2] http://lkml.kernel.org/r/20161215140715.12732-3-mhocko@kernel.org
[3] http://lkml.kernel.org/r/20161219212413.GN4326@dastard


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* [Cluster-devel] [PATCH 0/8 v3] scope GFP_NOFS api
@ 2017-01-06 14:10 ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:10 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,
I have posted the previous version here [1]. Since then I've added some
reviewed bys and fixed some minor issues. I've dropped patch 2 [2] based
on Dave's request [3]. I agree that this can be done later and doing
all at once. I still think that __GFP_NOLOCKDEP should be added by this
series to make the further development easier.

There didn't seem to be any real objections and so I think we should go
and merge this and build further on top. I would like to get rid of all
explicit GFP_NOFS usage in ext4 code. I have something half baked already
and will send it later on. I also hope we can get further with the xfs
as well.

I haven't heard anything from btrfs or other filesystems guys which is a
bit unfortunate but I do not want to wait for them to much longer, they
can join the effort later on.

The patchset is based on next-20170106

Diffstat says
 fs/ext4/acl.c             |  6 +++---
 fs/ext4/extents.c         |  8 ++++----
 fs/ext4/resize.c          |  4 ++--
 fs/ext4/xattr.c           |  4 ++--
 fs/jbd2/journal.c         |  7 +++++++
 fs/jbd2/transaction.c     | 11 +++++++++++
 fs/xfs/kmem.c             | 10 +++++-----
 fs/xfs/kmem.h             |  2 +-
 fs/xfs/libxfs/xfs_btree.c |  2 +-
 fs/xfs/xfs_aops.c         |  6 +++---
 fs/xfs/xfs_buf.c          |  8 ++++----
 fs/xfs/xfs_trans.c        | 12 ++++++------
 include/linux/gfp.h       | 18 +++++++++++++++++-
 include/linux/jbd2.h      |  2 ++
 include/linux/sched.h     | 32 ++++++++++++++++++++++++++------
 kernel/locking/lockdep.c  |  6 +++++-
 lib/radix-tree.c          |  2 ++
 mm/page_alloc.c           |  8 +++++---
 mm/vmscan.c               |  6 +++---
 19 files changed, 109 insertions(+), 45 deletions(-)

Shortlog:
Michal Hocko (8):
      lockdep: allow to disable reclaim lockup detection
      xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS
      mm: introduce memalloc_nofs_{save,restore} API
      xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*
      jbd2: mark the transaction context with the scope GFP_NOFS context
      jbd2: make the whole kjournald2 kthread NOFS safe
      Revert "ext4: avoid deadlocks in the writeback path by using sb_getblk_gfp"
      Revert "ext4: fix wrong gfp type under transaction"

[1] http://lkml.kernel.org/r/20161215140715.12732-1-mhocko at kernel.org
[2] http://lkml.kernel.org/r/20161215140715.12732-3-mhocko at kernel.org
[3] http://lkml.kernel.org/r/20161219212413.GN4326 at dastard




^ permalink raw reply	[flat|nested] 167+ messages in thread

* [PATCH 1/8] lockdep: allow to disable reclaim lockup detection
  2017-01-06 14:10 ` Michal Hocko
                     ` (2 preceding siblings ...)
  (?)
@ 2017-01-06 14:11   ` Michal Hocko
  -1 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:11 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

The current implementation of the reclaim lockup detection can lead to
false positives and those even happen and usually lead to tweak the
code to silence the lockdep by using GFP_NOFS even though the context
can use __GFP_FS just fine. See
http://lkml.kernel.org/r/20160512080321.GA18496@dastard as an example.

=================================
[ INFO: inconsistent lock state ]
4.5.0-rc2+ #4 Tainted: G           O
---------------------------------
inconsistent {RECLAIM_FS-ON-R} -> {IN-RECLAIM_FS-W} usage.
kswapd0/543 [HC0[0]:SC0[0]:HE1:SE1] takes:

(&xfs_nondir_ilock_class){++++-+}, at: [<ffffffffa00781f7>] xfs_ilock+0x177/0x200 [xfs]

{RECLAIM_FS-ON-R} state was registered at:
  [<ffffffff8110f369>] mark_held_locks+0x79/0xa0
  [<ffffffff81113a43>] lockdep_trace_alloc+0xb3/0x100
  [<ffffffff81224623>] kmem_cache_alloc+0x33/0x230
  [<ffffffffa008acc1>] kmem_zone_alloc+0x81/0x120 [xfs]
  [<ffffffffa005456e>] xfs_refcountbt_init_cursor+0x3e/0xa0 [xfs]
  [<ffffffffa0053455>] __xfs_refcount_find_shared+0x75/0x580 [xfs]
  [<ffffffffa00539e4>] xfs_refcount_find_shared+0x84/0xb0 [xfs]
  [<ffffffffa005dcb8>] xfs_getbmap+0x608/0x8c0 [xfs]
  [<ffffffffa007634b>] xfs_vn_fiemap+0xab/0xc0 [xfs]
  [<ffffffff81244208>] do_vfs_ioctl+0x498/0x670
  [<ffffffff81244459>] SyS_ioctl+0x79/0x90
  [<ffffffff81847cd7>] entry_SYSCALL_64_fastpath+0x12/0x6f

       CPU0
       ----
  lock(&xfs_nondir_ilock_class);
  <Interrupt>
    lock(&xfs_nondir_ilock_class);

 *** DEADLOCK ***

3 locks held by kswapd0/543:

stack backtrace:
CPU: 0 PID: 543 Comm: kswapd0 Tainted: G           O    4.5.0-rc2+ #4

Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006

 ffffffff82a34f10 ffff88003aa078d0 ffffffff813a14f9 ffff88003d8551c0
 ffff88003aa07920 ffffffff8110ec65 0000000000000000 0000000000000001
 ffff880000000001 000000000000000b 0000000000000008 ffff88003d855aa0
Call Trace:
 [<ffffffff813a14f9>] dump_stack+0x4b/0x72
 [<ffffffff8110ec65>] print_usage_bug+0x215/0x240
 [<ffffffff8110ee85>] mark_lock+0x1f5/0x660
 [<ffffffff8110e100>] ? print_shortest_lock_dependencies+0x1a0/0x1a0
 [<ffffffff811102e0>] __lock_acquire+0xa80/0x1e50
 [<ffffffff8122474e>] ? kmem_cache_alloc+0x15e/0x230
 [<ffffffffa008acc1>] ? kmem_zone_alloc+0x81/0x120 [xfs]
 [<ffffffff811122e8>] lock_acquire+0xd8/0x1e0
 [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs]
 [<ffffffffa0083a70>] ? xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
 [<ffffffff8110aace>] down_write_nested+0x5e/0xc0
 [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs]
 [<ffffffffa00781f7>] xfs_ilock+0x177/0x200 [xfs]
 [<ffffffffa0083a70>] xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
 [<ffffffffa0085bdc>] xfs_fs_evict_inode+0xdc/0x1e0 [xfs]
 [<ffffffff8124d7d5>] evict+0xc5/0x190
 [<ffffffff8124d8d9>] dispose_list+0x39/0x60
 [<ffffffff8124eb2b>] prune_icache_sb+0x4b/0x60
 [<ffffffff8123317f>] super_cache_scan+0x14f/0x1a0
 [<ffffffff811e0d19>] shrink_slab.part.63.constprop.79+0x1e9/0x4e0
 [<ffffffff811e50ee>] shrink_zone+0x15e/0x170
 [<ffffffff811e5ef1>] kswapd+0x4f1/0xa80
 [<ffffffff811e5a00>] ? zone_reclaim+0x230/0x230
 [<ffffffff810e6882>] kthread+0xf2/0x110
 [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220
 [<ffffffff8184803f>] ret_from_fork+0x3f/0x70
 [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220

To quote Dave:
"
Ignoring whether reflink should be doing anything or not, that's a
"xfs_refcountbt_init_cursor() gets called both outside and inside
transactions" lockdep false positive case. The problem here is
lockdep has seen this allocation from within a transaction, hence a
GFP_NOFS allocation, and now it's seeing it in a GFP_KERNEL context.
Also note that we have an active reference to this inode.

So, because the reclaim annotations overload the interrupt level
detections and it's seen the inode ilock been taken in reclaim
("interrupt") context, this triggers a reclaim context warning where
it thinks it is unsafe to do this allocation in GFP_KERNEL context
holding the inode ilock...
"

This sounds like a fundamental problem of the reclaim lock detection.
It is really impossible to annotate such a special usecase IMHO unless
the reclaim lockup detection is reworked completely. Until then it
is much better to provide a way to add "I know what I am doing flag"
and mark problematic places. This would prevent from abusing GFP_NOFS
flag which has a runtime effect even on configurations which have
lockdep disabled.

Introduce __GFP_NOLOCKDEP flag which tells the lockdep gfp tracking to
skip the current allocation request.

While we are at it also make sure that the radix tree doesn't
accidentaly override tags stored in the upper part of the gfp_mask.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/gfp.h      | 10 +++++++++-
 kernel/locking/lockdep.c |  4 ++++
 lib/radix-tree.c         |  2 ++
 3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 4175dca4ac39..1a934383cc20 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -41,6 +41,11 @@ struct vm_area_struct;
 #define ___GFP_OTHER_NODE	0x800000u
 #define ___GFP_WRITE		0x1000000u
 #define ___GFP_KSWAPD_RECLAIM	0x2000000u
+#ifdef CONFIG_LOCKDEP
+#define ___GFP_NOLOCKDEP	0x4000000u
+#else
+#define ___GFP_NOLOCKDEP	0
+#endif
 /* If the above are modified, __GFP_BITS_SHIFT may need updating */
 
 /*
@@ -186,8 +191,11 @@ struct vm_area_struct;
 #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
 #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE)
 
+/* Disable lockdep for GFP context tracking */
+#define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)
+
 /* Room for N __GFP_FOO bits */
-#define __GFP_BITS_SHIFT 26
+#define __GFP_BITS_SHIFT (26 + IS_ENABLED(CONFIG_LOCKDEP))
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
 
 /*
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 844cd04bb453..59e94ce8a0cf 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -2879,6 +2879,10 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags)
 	if (DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags)))
 		return;
 
+	/* Disable lockdep if explicitly requested */
+	if (gfp_mask & __GFP_NOLOCKDEP)
+		return;
+
 	mark_held_locks(curr, RECLAIM_FS);
 }
 
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 5ccf00277233..c63b719346ae 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -2198,6 +2198,8 @@ static int radix_tree_cpu_dead(unsigned int cpu)
 void __init radix_tree_init(void)
 {
 	int ret;
+
+	BUILD_BUG_ON(RADIX_TREE_MAX_TAGS + __GFP_BITS_SHIFT > 32);
 	radix_tree_node_cachep = kmem_cache_create("radix_tree_node",
 			sizeof(struct radix_tree_node), 0,
 			SLAB_PANIC | SLAB_RECLAIM_ACCOUNT,
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [PATCH 1/8] lockdep: allow to disable reclaim lockup detection
@ 2017-01-06 14:11   ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:11 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

The current implementation of the reclaim lockup detection can lead to
false positives and those even happen and usually lead to tweak the
code to silence the lockdep by using GFP_NOFS even though the context
can use __GFP_FS just fine. See
http://lkml.kernel.org/r/20160512080321.GA18496@dastard as an example.

=================================
[ INFO: inconsistent lock state ]
4.5.0-rc2+ #4 Tainted: G           O
---------------------------------
inconsistent {RECLAIM_FS-ON-R} -> {IN-RECLAIM_FS-W} usage.
kswapd0/543 [HC0[0]:SC0[0]:HE1:SE1] takes:

(&xfs_nondir_ilock_class){++++-+}, at: [<ffffffffa00781f7>] xfs_ilock+0x177/0x200 [xfs]

{RECLAIM_FS-ON-R} state was registered at:
  [<ffffffff8110f369>] mark_held_locks+0x79/0xa0
  [<ffffffff81113a43>] lockdep_trace_alloc+0xb3/0x100
  [<ffffffff81224623>] kmem_cache_alloc+0x33/0x230
  [<ffffffffa008acc1>] kmem_zone_alloc+0x81/0x120 [xfs]
  [<ffffffffa005456e>] xfs_refcountbt_init_cursor+0x3e/0xa0 [xfs]
  [<ffffffffa0053455>] __xfs_refcount_find_shared+0x75/0x580 [xfs]
  [<ffffffffa00539e4>] xfs_refcount_find_shared+0x84/0xb0 [xfs]
  [<ffffffffa005dcb8>] xfs_getbmap+0x608/0x8c0 [xfs]
  [<ffffffffa007634b>] xfs_vn_fiemap+0xab/0xc0 [xfs]
  [<ffffffff81244208>] do_vfs_ioctl+0x498/0x670
  [<ffffffff81244459>] SyS_ioctl+0x79/0x90
  [<ffffffff81847cd7>] entry_SYSCALL_64_fastpath+0x12/0x6f

       CPU0
       ----
  lock(&xfs_nondir_ilock_class);
  <Interrupt>
    lock(&xfs_nondir_ilock_class);

 *** DEADLOCK ***

3 locks held by kswapd0/543:

stack backtrace:
CPU: 0 PID: 543 Comm: kswapd0 Tainted: G           O    4.5.0-rc2+ #4

Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006

 ffffffff82a34f10 ffff88003aa078d0 ffffffff813a14f9 ffff88003d8551c0
 ffff88003aa07920 ffffffff8110ec65 0000000000000000 0000000000000001
 ffff880000000001 000000000000000b 0000000000000008 ffff88003d855aa0
Call Trace:
 [<ffffffff813a14f9>] dump_stack+0x4b/0x72
 [<ffffffff8110ec65>] print_usage_bug+0x215/0x240
 [<ffffffff8110ee85>] mark_lock+0x1f5/0x660
 [<ffffffff8110e100>] ? print_shortest_lock_dependencies+0x1a0/0x1a0
 [<ffffffff811102e0>] __lock_acquire+0xa80/0x1e50
 [<ffffffff8122474e>] ? kmem_cache_alloc+0x15e/0x230
 [<ffffffffa008acc1>] ? kmem_zone_alloc+0x81/0x120 [xfs]
 [<ffffffff811122e8>] lock_acquire+0xd8/0x1e0
 [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs]
 [<ffffffffa0083a70>] ? xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
 [<ffffffff8110aace>] down_write_nested+0x5e/0xc0
 [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs]
 [<ffffffffa00781f7>] xfs_ilock+0x177/0x200 [xfs]
 [<ffffffffa0083a70>] xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
 [<ffffffffa0085bdc>] xfs_fs_evict_inode+0xdc/0x1e0 [xfs]
 [<ffffffff8124d7d5>] evict+0xc5/0x190
 [<ffffffff8124d8d9>] dispose_list+0x39/0x60
 [<ffffffff8124eb2b>] prune_icache_sb+0x4b/0x60
 [<ffffffff8123317f>] super_cache_scan+0x14f/0x1a0
 [<ffffffff811e0d19>] shrink_slab.part.63.constprop.79+0x1e9/0x4e0
 [<ffffffff811e50ee>] shrink_zone+0x15e/0x170
 [<ffffffff811e5ef1>] kswapd+0x4f1/0xa80
 [<ffffffff811e5a00>] ? zone_reclaim+0x230/0x230
 [<ffffffff810e6882>] kthread+0xf2/0x110
 [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220
 [<ffffffff8184803f>] ret_from_fork+0x3f/0x70
 [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220

To quote Dave:
"
Ignoring whether reflink should be doing anything or not, that's a
"xfs_refcountbt_init_cursor() gets called both outside and inside
transactions" lockdep false positive case. The problem here is
lockdep has seen this allocation from within a transaction, hence a
GFP_NOFS allocation, and now it's seeing it in a GFP_KERNEL context.
Also note that we have an active reference to this inode.

So, because the reclaim annotations overload the interrupt level
detections and it's seen the inode ilock been taken in reclaim
("interrupt") context, this triggers a reclaim context warning where
it thinks it is unsafe to do this allocation in GFP_KERNEL context
holding the inode ilock...
"

This sounds like a fundamental problem of the reclaim lock detection.
It is really impossible to annotate such a special usecase IMHO unless
the reclaim lockup detection is reworked completely. Until then it
is much better to provide a way to add "I know what I am doing flag"
and mark problematic places. This would prevent from abusing GFP_NOFS
flag which has a runtime effect even on configurations which have
lockdep disabled.

Introduce __GFP_NOLOCKDEP flag which tells the lockdep gfp tracking to
skip the current allocation request.

While we are at it also make sure that the radix tree doesn't
accidentaly override tags stored in the upper part of the gfp_mask.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/gfp.h      | 10 +++++++++-
 kernel/locking/lockdep.c |  4 ++++
 lib/radix-tree.c         |  2 ++
 3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 4175dca4ac39..1a934383cc20 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -41,6 +41,11 @@ struct vm_area_struct;
 #define ___GFP_OTHER_NODE	0x800000u
 #define ___GFP_WRITE		0x1000000u
 #define ___GFP_KSWAPD_RECLAIM	0x2000000u
+#ifdef CONFIG_LOCKDEP
+#define ___GFP_NOLOCKDEP	0x4000000u
+#else
+#define ___GFP_NOLOCKDEP	0
+#endif
 /* If the above are modified, __GFP_BITS_SHIFT may need updating */
 
 /*
@@ -186,8 +191,11 @@ struct vm_area_struct;
 #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
 #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE)
 
+/* Disable lockdep for GFP context tracking */
+#define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)
+
 /* Room for N __GFP_FOO bits */
-#define __GFP_BITS_SHIFT 26
+#define __GFP_BITS_SHIFT (26 + IS_ENABLED(CONFIG_LOCKDEP))
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
 
 /*
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 844cd04bb453..59e94ce8a0cf 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -2879,6 +2879,10 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags)
 	if (DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags)))
 		return;
 
+	/* Disable lockdep if explicitly requested */
+	if (gfp_mask & __GFP_NOLOCKDEP)
+		return;
+
 	mark_held_locks(curr, RECLAIM_FS);
 }
 
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 5ccf00277233..c63b719346ae 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -2198,6 +2198,8 @@ static int radix_tree_cpu_dead(unsigned int cpu)
 void __init radix_tree_init(void)
 {
 	int ret;
+
+	BUILD_BUG_ON(RADIX_TREE_MAX_TAGS + __GFP_BITS_SHIFT > 32);
 	radix_tree_node_cachep = kmem_cache_create("radix_tree_node",
 			sizeof(struct radix_tree_node), 0,
 			SLAB_PANIC | SLAB_RECLAIM_ACCOUNT,
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [PATCH 1/8] lockdep: allow to disable reclaim lockup detection
@ 2017-01-06 14:11   ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:11 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

The current implementation of the reclaim lockup detection can lead to
false positives and those even happen and usually lead to tweak the
code to silence the lockdep by using GFP_NOFS even though the context
can use __GFP_FS just fine. See
http://lkml.kernel.org/r/20160512080321.GA18496@dastard as an example.

=================================
[ INFO: inconsistent lock state ]
4.5.0-rc2+ #4 Tainted: G           O
---------------------------------
inconsistent {RECLAIM_FS-ON-R} -> {IN-RECLAIM_FS-W} usage.
kswapd0/543 [HC0[0]:SC0[0]:HE1:SE1] takes:

(&xfs_nondir_ilock_class){++++-+}, at: [<ffffffffa00781f7>] xfs_ilock+0x177/0x200 [xfs]

{RECLAIM_FS-ON-R} state was registered at:
  [<ffffffff8110f369>] mark_held_locks+0x79/0xa0
  [<ffffffff81113a43>] lockdep_trace_alloc+0xb3/0x100
  [<ffffffff81224623>] kmem_cache_alloc+0x33/0x230
  [<ffffffffa008acc1>] kmem_zone_alloc+0x81/0x120 [xfs]
  [<ffffffffa005456e>] xfs_refcountbt_init_cursor+0x3e/0xa0 [xfs]
  [<ffffffffa0053455>] __xfs_refcount_find_shared+0x75/0x580 [xfs]
  [<ffffffffa00539e4>] xfs_refcount_find_shared+0x84/0xb0 [xfs]
  [<ffffffffa005dcb8>] xfs_getbmap+0x608/0x8c0 [xfs]
  [<ffffffffa007634b>] xfs_vn_fiemap+0xab/0xc0 [xfs]
  [<ffffffff81244208>] do_vfs_ioctl+0x498/0x670
  [<ffffffff81244459>] SyS_ioctl+0x79/0x90
  [<ffffffff81847cd7>] entry_SYSCALL_64_fastpath+0x12/0x6f

       CPU0
       ----
  lock(&xfs_nondir_ilock_class);
  <Interrupt>
    lock(&xfs_nondir_ilock_class);

 *** DEADLOCK ***

3 locks held by kswapd0/543:

stack backtrace:
CPU: 0 PID: 543 Comm: kswapd0 Tainted: G           O    4.5.0-rc2+ #4

Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006

 ffffffff82a34f10 ffff88003aa078d0 ffffffff813a14f9 ffff88003d8551c0
 ffff88003aa07920 ffffffff8110ec65 0000000000000000 0000000000000001
 ffff880000000001 000000000000000b 0000000000000008 ffff88003d855aa0
Call Trace:
 [<ffffffff813a14f9>] dump_stack+0x4b/0x72
 [<ffffffff8110ec65>] print_usage_bug+0x215/0x240
 [<ffffffff8110ee85>] mark_lock+0x1f5/0x660
 [<ffffffff8110e100>] ? print_shortest_lock_dependencies+0x1a0/0x1a0
 [<ffffffff811102e0>] __lock_acquire+0xa80/0x1e50
 [<ffffffff8122474e>] ? kmem_cache_alloc+0x15e/0x230
 [<ffffffffa008acc1>] ? kmem_zone_alloc+0x81/0x120 [xfs]
 [<ffffffff811122e8>] lock_acquire+0xd8/0x1e0
 [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs]
 [<ffffffffa0083a70>] ? xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
 [<ffffffff8110aace>] down_write_nested+0x5e/0xc0
 [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs]
 [<ffffffffa00781f7>] xfs_ilock+0x177/0x200 [xfs]
 [<ffffffffa0083a70>] xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
 [<ffffffffa0085bdc>] xfs_fs_evict_inode+0xdc/0x1e0 [xfs]
 [<ffffffff8124d7d5>] evict+0xc5/0x190
 [<ffffffff8124d8d9>] dispose_list+0x39/0x60
 [<ffffffff8124eb2b>] prune_icache_sb+0x4b/0x60
 [<ffffffff8123317f>] super_cache_scan+0x14f/0x1a0
 [<ffffffff811e0d19>] shrink_slab.part.63.constprop.79+0x1e9/0x4e0
 [<ffffffff811e50ee>] shrink_zone+0x15e/0x170
 [<ffffffff811e5ef1>] kswapd+0x4f1/0xa80
 [<ffffffff811e5a00>] ? zone_reclaim+0x230/0x230
 [<ffffffff810e6882>] kthread+0xf2/0x110
 [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220
 [<ffffffff8184803f>] ret_from_fork+0x3f/0x70
 [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220

To quote Dave:
"
Ignoring whether reflink should be doing anything or not, that's a
"xfs_refcountbt_init_cursor() gets called both outside and inside
transactions" lockdep false positive case. The problem here is
lockdep has seen this allocation from within a transaction, hence a
GFP_NOFS allocation, and now it's seeing it in a GFP_KERNEL context.
Also note that we have an active reference to this inode.

So, because the reclaim annotations overload the interrupt level
detections and it's seen the inode ilock been taken in reclaim
("interrupt") context, this triggers a reclaim context warning where
it thinks it is unsafe to do this allocation in GFP_KERNEL context
holding the inode ilock...
"

This sounds like a fundamental problem of the reclaim lock detection.
It is really impossible to annotate such a special usecase IMHO unless
the reclaim lockup detection is reworked completely. Until then it
is much better to provide a way to add "I know what I am doing flag"
and mark problematic places. This would prevent from abusing GFP_NOFS
flag which has a runtime effect even on configurations which have
lockdep disabled.

Introduce __GFP_NOLOCKDEP flag which tells the lockdep gfp tracking to
skip the current allocation request.

While we are at it also make sure that the radix tree doesn't
accidentaly override tags stored in the upper part of the gfp_mask.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/gfp.h      | 10 +++++++++-
 kernel/locking/lockdep.c |  4 ++++
 lib/radix-tree.c         |  2 ++
 3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 4175dca4ac39..1a934383cc20 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -41,6 +41,11 @@ struct vm_area_struct;
 #define ___GFP_OTHER_NODE	0x800000u
 #define ___GFP_WRITE		0x1000000u
 #define ___GFP_KSWAPD_RECLAIM	0x2000000u
+#ifdef CONFIG_LOCKDEP
+#define ___GFP_NOLOCKDEP	0x4000000u
+#else
+#define ___GFP_NOLOCKDEP	0
+#endif
 /* If the above are modified, __GFP_BITS_SHIFT may need updating */
 
 /*
@@ -186,8 +191,11 @@ struct vm_area_struct;
 #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
 #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE)
 
+/* Disable lockdep for GFP context tracking */
+#define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)
+
 /* Room for N __GFP_FOO bits */
-#define __GFP_BITS_SHIFT 26
+#define __GFP_BITS_SHIFT (26 + IS_ENABLED(CONFIG_LOCKDEP))
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
 
 /*
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 844cd04bb453..59e94ce8a0cf 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -2879,6 +2879,10 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags)
 	if (DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags)))
 		return;
 
+	/* Disable lockdep if explicitly requested */
+	if (gfp_mask & __GFP_NOLOCKDEP)
+		return;
+
 	mark_held_locks(curr, RECLAIM_FS);
 }
 
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 5ccf00277233..c63b719346ae 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -2198,6 +2198,8 @@ static int radix_tree_cpu_dead(unsigned int cpu)
 void __init radix_tree_init(void)
 {
 	int ret;
+
+	BUILD_BUG_ON(RADIX_TREE_MAX_TAGS + __GFP_BITS_SHIFT > 32);
 	radix_tree_node_cachep = kmem_cache_create("radix_tree_node",
 			sizeof(struct radix_tree_node), 0,
 			SLAB_PANIC | SLAB_RECLAIM_ACCOUNT,
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [PATCH 1/8] lockdep: allow to disable reclaim lockup detection
@ 2017-01-06 14:11   ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:11 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

The current implementation of the reclaim lockup detection can lead to
false positives and those even happen and usually lead to tweak the
code to silence the lockdep by using GFP_NOFS even though the context
can use __GFP_FS just fine. See
http://lkml.kernel.org/r/20160512080321.GA18496@dastard as an example.

=================================
[ INFO: inconsistent lock state ]
4.5.0-rc2+ #4 Tainted: G           O
---------------------------------
inconsistent {RECLAIM_FS-ON-R} -> {IN-RECLAIM_FS-W} usage.
kswapd0/543 [HC0[0]:SC0[0]:HE1:SE1] takes:

(&xfs_nondir_ilock_class){++++-+}, at: [<ffffffffa00781f7>] xfs_ilock+0x177/0x200 [xfs]

{RECLAIM_FS-ON-R} state was registered at:
  [<ffffffff8110f369>] mark_held_locks+0x79/0xa0
  [<ffffffff81113a43>] lockdep_trace_alloc+0xb3/0x100
  [<ffffffff81224623>] kmem_cache_alloc+0x33/0x230
  [<ffffffffa008acc1>] kmem_zone_alloc+0x81/0x120 [xfs]
  [<ffffffffa005456e>] xfs_refcountbt_init_cursor+0x3e/0xa0 [xfs]
  [<ffffffffa0053455>] __xfs_refcount_find_shared+0x75/0x580 [xfs]
  [<ffffffffa00539e4>] xfs_refcount_find_shared+0x84/0xb0 [xfs]
  [<ffffffffa005dcb8>] xfs_getbmap+0x608/0x8c0 [xfs]
  [<ffffffffa007634b>] xfs_vn_fiemap+0xab/0xc0 [xfs]
  [<ffffffff81244208>] do_vfs_ioctl+0x498/0x670
  [<ffffffff81244459>] SyS_ioctl+0x79/0x90
  [<ffffffff81847cd7>] entry_SYSCALL_64_fastpath+0x12/0x6f

       CPU0
       ----
  lock(&xfs_nondir_ilock_class);
  <Interrupt>
    lock(&xfs_nondir_ilock_class);

 *** DEADLOCK ***

3 locks held by kswapd0/543:

stack backtrace:
CPU: 0 PID: 543 Comm: kswapd0 Tainted: G           O    4.5.0-rc2+ #4

Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006

 ffffffff82a34f10 ffff88003aa078d0 ffffffff813a14f9 ffff88003d8551c0
 ffff88003aa07920 ffffffff8110ec65 0000000000000000 0000000000000001
 ffff880000000001 000000000000000b 0000000000000008 ffff88003d855aa0
Call Trace:
 [<ffffffff813a14f9>] dump_stack+0x4b/0x72
 [<ffffffff8110ec65>] print_usage_bug+0x215/0x240
 [<ffffffff8110ee85>] mark_lock+0x1f5/0x660
 [<ffffffff8110e100>] ? print_shortest_lock_dependencies+0x1a0/0x1a0
 [<ffffffff811102e0>] __lock_acquire+0xa80/0x1e50
 [<ffffffff8122474e>] ? kmem_cache_alloc+0x15e/0x230
 [<ffffffffa008acc1>] ? kmem_zone_alloc+0x81/0x120 [xfs]
 [<ffffffff811122e8>] lock_acquire+0xd8/0x1e0
 [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs]
 [<ffffffffa0083a70>] ? xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
 [<ffffffff8110aace>] down_write_nested+0x5e/0xc0
 [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs]
 [<ffffffffa00781f7>] xfs_ilock+0x177/0x200 [xfs]
 [<ffffffffa0083a70>] xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
 [<ffffffffa0085bdc>] xfs_fs_evict_inode+0xdc/0x1e0 [xfs]
 [<ffffffff8124d7d5>] evict+0xc5/0x190
 [<ffffffff8124d8d9>] dispose_list+0x39/0x60
 [<ffffffff8124eb2b>] prune_icache_sb+0x4b/0x60
 [<ffffffff8123317f>] super_cache_scan+0x14f/0x1a0
 [<ffffffff811e0d19>] shrink_slab.part.63.constprop.79+0x1e9/0x4e0
 [<ffffffff811e50ee>] shrink_zone+0x15e/0x170
 [<ffffffff811e5ef1>] kswapd+0x4f1/0xa80
 [<ffffffff811e5a00>] ? zone_reclaim+0x230/0x230
 [<ffffffff810e6882>] kthread+0xf2/0x110
 [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220
 [<ffffffff8184803f>] ret_from_fork+0x3f/0x70
 [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220

To quote Dave:
"
Ignoring whether reflink should be doing anything or not, that's a
"xfs_refcountbt_init_cursor() gets called both outside and inside
transactions" lockdep false positive case. The problem here is
lockdep has seen this allocation from within a transaction, hence a
GFP_NOFS allocation, and now it's seeing it in a GFP_KERNEL context.
Also note that we have an active reference to this inode.

So, because the reclaim annotations overload the interrupt level
detections and it's seen the inode ilock been taken in reclaim
("interrupt") context, this triggers a reclaim context warning where
it thinks it is unsafe to do this allocation in GFP_KERNEL context
holding the inode ilock...
"

This sounds like a fundamental problem of the reclaim lock detection.
It is really impossible to annotate such a special usecase IMHO unless
the reclaim lockup detection is reworked completely. Until then it
is much better to provide a way to add "I know what I am doing flag"
and mark problematic places. This would prevent from abusing GFP_NOFS
flag which has a runtime effect even on configurations which have
lockdep disabled.

Introduce __GFP_NOLOCKDEP flag which tells the lockdep gfp tracking to
skip the current allocation request.

While we are at it also make sure that the radix tree doesn't
accidentaly override tags stored in the upper part of the gfp_mask.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/gfp.h      | 10 +++++++++-
 kernel/locking/lockdep.c |  4 ++++
 lib/radix-tree.c         |  2 ++
 3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 4175dca4ac39..1a934383cc20 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -41,6 +41,11 @@ struct vm_area_struct;
 #define ___GFP_OTHER_NODE	0x800000u
 #define ___GFP_WRITE		0x1000000u
 #define ___GFP_KSWAPD_RECLAIM	0x2000000u
+#ifdef CONFIG_LOCKDEP
+#define ___GFP_NOLOCKDEP	0x4000000u
+#else
+#define ___GFP_NOLOCKDEP	0
+#endif
 /* If the above are modified, __GFP_BITS_SHIFT may need updating */
 
 /*
@@ -186,8 +191,11 @@ struct vm_area_struct;
 #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
 #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE)
 
+/* Disable lockdep for GFP context tracking */
+#define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)
+
 /* Room for N __GFP_FOO bits */
-#define __GFP_BITS_SHIFT 26
+#define __GFP_BITS_SHIFT (26 + IS_ENABLED(CONFIG_LOCKDEP))
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
 
 /*
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 844cd04bb453..59e94ce8a0cf 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -2879,6 +2879,10 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags)
 	if (DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags)))
 		return;
 
+	/* Disable lockdep if explicitly requested */
+	if (gfp_mask & __GFP_NOLOCKDEP)
+		return;
+
 	mark_held_locks(curr, RECLAIM_FS);
 }
 
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 5ccf00277233..c63b719346ae 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -2198,6 +2198,8 @@ static int radix_tree_cpu_dead(unsigned int cpu)
 void __init radix_tree_init(void)
 {
 	int ret;
+
+	BUILD_BUG_ON(RADIX_TREE_MAX_TAGS + __GFP_BITS_SHIFT > 32);
 	radix_tree_node_cachep = kmem_cache_create("radix_tree_node",
 			sizeof(struct radix_tree_node), 0,
 			SLAB_PANIC | SLAB_RECLAIM_ACCOUNT,
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [Cluster-devel] [PATCH 1/8] lockdep: allow to disable reclaim lockup detection
@ 2017-01-06 14:11   ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:11 UTC (permalink / raw)
  To: cluster-devel.redhat.com

From: Michal Hocko <mhocko@suse.com>

The current implementation of the reclaim lockup detection can lead to
false positives and those even happen and usually lead to tweak the
code to silence the lockdep by using GFP_NOFS even though the context
can use __GFP_FS just fine. See
http://lkml.kernel.org/r/20160512080321.GA18496 at dastard as an example.

=================================
[ INFO: inconsistent lock state ]
4.5.0-rc2+ #4 Tainted: G           O
---------------------------------
inconsistent {RECLAIM_FS-ON-R} -> {IN-RECLAIM_FS-W} usage.
kswapd0/543 [HC0[0]:SC0[0]:HE1:SE1] takes:

(&xfs_nondir_ilock_class){++++-+}, at: [<ffffffffa00781f7>] xfs_ilock+0x177/0x200 [xfs]

{RECLAIM_FS-ON-R} state was registered at:
  [<ffffffff8110f369>] mark_held_locks+0x79/0xa0
  [<ffffffff81113a43>] lockdep_trace_alloc+0xb3/0x100
  [<ffffffff81224623>] kmem_cache_alloc+0x33/0x230
  [<ffffffffa008acc1>] kmem_zone_alloc+0x81/0x120 [xfs]
  [<ffffffffa005456e>] xfs_refcountbt_init_cursor+0x3e/0xa0 [xfs]
  [<ffffffffa0053455>] __xfs_refcount_find_shared+0x75/0x580 [xfs]
  [<ffffffffa00539e4>] xfs_refcount_find_shared+0x84/0xb0 [xfs]
  [<ffffffffa005dcb8>] xfs_getbmap+0x608/0x8c0 [xfs]
  [<ffffffffa007634b>] xfs_vn_fiemap+0xab/0xc0 [xfs]
  [<ffffffff81244208>] do_vfs_ioctl+0x498/0x670
  [<ffffffff81244459>] SyS_ioctl+0x79/0x90
  [<ffffffff81847cd7>] entry_SYSCALL_64_fastpath+0x12/0x6f

       CPU0
       ----
  lock(&xfs_nondir_ilock_class);
  <Interrupt>
    lock(&xfs_nondir_ilock_class);

 *** DEADLOCK ***

3 locks held by kswapd0/543:

stack backtrace:
CPU: 0 PID: 543 Comm: kswapd0 Tainted: G           O    4.5.0-rc2+ #4

Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006

 ffffffff82a34f10 ffff88003aa078d0 ffffffff813a14f9 ffff88003d8551c0
 ffff88003aa07920 ffffffff8110ec65 0000000000000000 0000000000000001
 ffff880000000001 000000000000000b 0000000000000008 ffff88003d855aa0
Call Trace:
 [<ffffffff813a14f9>] dump_stack+0x4b/0x72
 [<ffffffff8110ec65>] print_usage_bug+0x215/0x240
 [<ffffffff8110ee85>] mark_lock+0x1f5/0x660
 [<ffffffff8110e100>] ? print_shortest_lock_dependencies+0x1a0/0x1a0
 [<ffffffff811102e0>] __lock_acquire+0xa80/0x1e50
 [<ffffffff8122474e>] ? kmem_cache_alloc+0x15e/0x230
 [<ffffffffa008acc1>] ? kmem_zone_alloc+0x81/0x120 [xfs]
 [<ffffffff811122e8>] lock_acquire+0xd8/0x1e0
 [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs]
 [<ffffffffa0083a70>] ? xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
 [<ffffffff8110aace>] down_write_nested+0x5e/0xc0
 [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs]
 [<ffffffffa00781f7>] xfs_ilock+0x177/0x200 [xfs]
 [<ffffffffa0083a70>] xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
 [<ffffffffa0085bdc>] xfs_fs_evict_inode+0xdc/0x1e0 [xfs]
 [<ffffffff8124d7d5>] evict+0xc5/0x190
 [<ffffffff8124d8d9>] dispose_list+0x39/0x60
 [<ffffffff8124eb2b>] prune_icache_sb+0x4b/0x60
 [<ffffffff8123317f>] super_cache_scan+0x14f/0x1a0
 [<ffffffff811e0d19>] shrink_slab.part.63.constprop.79+0x1e9/0x4e0
 [<ffffffff811e50ee>] shrink_zone+0x15e/0x170
 [<ffffffff811e5ef1>] kswapd+0x4f1/0xa80
 [<ffffffff811e5a00>] ? zone_reclaim+0x230/0x230
 [<ffffffff810e6882>] kthread+0xf2/0x110
 [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220
 [<ffffffff8184803f>] ret_from_fork+0x3f/0x70
 [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220

To quote Dave:
"
Ignoring whether reflink should be doing anything or not, that's a
"xfs_refcountbt_init_cursor() gets called both outside and inside
transactions" lockdep false positive case. The problem here is
lockdep has seen this allocation from within a transaction, hence a
GFP_NOFS allocation, and now it's seeing it in a GFP_KERNEL context.
Also note that we have an active reference to this inode.

So, because the reclaim annotations overload the interrupt level
detections and it's seen the inode ilock been taken in reclaim
("interrupt") context, this triggers a reclaim context warning where
it thinks it is unsafe to do this allocation in GFP_KERNEL context
holding the inode ilock...
"

This sounds like a fundamental problem of the reclaim lock detection.
It is really impossible to annotate such a special usecase IMHO unless
the reclaim lockup detection is reworked completely. Until then it
is much better to provide a way to add "I know what I am doing flag"
and mark problematic places. This would prevent from abusing GFP_NOFS
flag which has a runtime effect even on configurations which have
lockdep disabled.

Introduce __GFP_NOLOCKDEP flag which tells the lockdep gfp tracking to
skip the current allocation request.

While we are at it also make sure that the radix tree doesn't
accidentaly override tags stored in the upper part of the gfp_mask.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/gfp.h      | 10 +++++++++-
 kernel/locking/lockdep.c |  4 ++++
 lib/radix-tree.c         |  2 ++
 3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 4175dca4ac39..1a934383cc20 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -41,6 +41,11 @@ struct vm_area_struct;
 #define ___GFP_OTHER_NODE	0x800000u
 #define ___GFP_WRITE		0x1000000u
 #define ___GFP_KSWAPD_RECLAIM	0x2000000u
+#ifdef CONFIG_LOCKDEP
+#define ___GFP_NOLOCKDEP	0x4000000u
+#else
+#define ___GFP_NOLOCKDEP	0
+#endif
 /* If the above are modified, __GFP_BITS_SHIFT may need updating */
 
 /*
@@ -186,8 +191,11 @@ struct vm_area_struct;
 #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
 #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE)
 
+/* Disable lockdep for GFP context tracking */
+#define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)
+
 /* Room for N __GFP_FOO bits */
-#define __GFP_BITS_SHIFT 26
+#define __GFP_BITS_SHIFT (26 + IS_ENABLED(CONFIG_LOCKDEP))
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
 
 /*
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 844cd04bb453..59e94ce8a0cf 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -2879,6 +2879,10 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags)
 	if (DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags)))
 		return;
 
+	/* Disable lockdep if explicitly requested */
+	if (gfp_mask & __GFP_NOLOCKDEP)
+		return;
+
 	mark_held_locks(curr, RECLAIM_FS);
 }
 
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 5ccf00277233..c63b719346ae 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -2198,6 +2198,8 @@ static int radix_tree_cpu_dead(unsigned int cpu)
 void __init radix_tree_init(void)
 {
 	int ret;
+
+	BUILD_BUG_ON(RADIX_TREE_MAX_TAGS + __GFP_BITS_SHIFT > 32);
 	radix_tree_node_cachep = kmem_cache_create("radix_tree_node",
 			sizeof(struct radix_tree_node), 0,
 			SLAB_PANIC | SLAB_RECLAIM_ACCOUNT,
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [PATCH 2/8] xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS
  2017-01-06 14:10 ` Michal Hocko
                     ` (2 preceding siblings ...)
  (?)
@ 2017-01-06 14:11   ` Michal Hocko
  -1 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:11 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite
some time ago. We would like to make this concept more generic and use
it for other filesystems as well. Let's start by giving the flag a
more generic name PF_MEMALLOC_NOFS which is in line with an exiting
PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO
contexts. Replace all PF_FSTRANS usage from the xfs code in the first
step before we introduce a full API for it as xfs uses the flag directly
anyway.

This patch doesn't introduce any functional change.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/kmem.c             |  4 ++--
 fs/xfs/kmem.h             |  2 +-
 fs/xfs/libxfs/xfs_btree.c |  2 +-
 fs/xfs/xfs_aops.c         |  6 +++---
 fs/xfs/xfs_trans.c        | 12 ++++++------
 include/linux/sched.h     |  2 ++
 6 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index 339c696bbc01..a76a05dae96b 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -80,13 +80,13 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 	 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
 	 * the filesystem here and potentially deadlocking.
 	 */
-	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
 		noio_flag = memalloc_noio_save();
 
 	lflags = kmem_flags_convert(flags);
 	ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
 
-	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
 		memalloc_noio_restore(noio_flag);
 
 	return ptr;
diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index 689f746224e7..d973dbfc2bfa 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
 		lflags = GFP_ATOMIC | __GFP_NOWARN;
 	} else {
 		lflags = GFP_KERNEL | __GFP_NOWARN;
-		if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+		if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
 			lflags &= ~__GFP_FS;
 	}
 
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 21e6a6ab6b9a..a2672ba4dc33 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -2866,7 +2866,7 @@ xfs_btree_split_worker(
 	struct xfs_btree_split_args	*args = container_of(work,
 						struct xfs_btree_split_args, work);
 	unsigned long		pflags;
-	unsigned long		new_pflags = PF_FSTRANS;
+	unsigned long		new_pflags = PF_MEMALLOC_NOFS;
 
 	/*
 	 * we are in a transaction context here, but may also be doing work
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index ef382bfb402b..d4094bb55033 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -189,7 +189,7 @@ xfs_setfilesize_trans_alloc(
 	 * We hand off the transaction to the completion thread now, so
 	 * clear the flag here.
 	 */
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	return 0;
 }
 
@@ -252,7 +252,7 @@ xfs_setfilesize_ioend(
 	 * thus we need to mark ourselves as being in a transaction manually.
 	 * Similarly for freeze protection.
 	 */
-	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	__sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS);
 
 	/* we abort the update if there was an IO error */
@@ -1015,7 +1015,7 @@ xfs_do_writepage(
 	 * Given that we do not allow direct reclaim to call us, we should
 	 * never be called while in a filesystem transaction.
 	 */
-	if (WARN_ON_ONCE(current->flags & PF_FSTRANS))
+	if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
 		goto redirty;
 
 	/*
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 70f42ea86dfb..f5969c8274fc 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -134,7 +134,7 @@ xfs_trans_reserve(
 	bool		rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
 
 	/* Mark this thread as being in a transaction */
-	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	/*
 	 * Attempt to reserve the needed disk blocks by decrementing
@@ -144,7 +144,7 @@ xfs_trans_reserve(
 	if (blocks > 0) {
 		error = xfs_mod_fdblocks(tp->t_mountp, -((int64_t)blocks), rsvd);
 		if (error != 0) {
-			current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+			current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 			return -ENOSPC;
 		}
 		tp->t_blk_res += blocks;
@@ -221,7 +221,7 @@ xfs_trans_reserve(
 		tp->t_blk_res = 0;
 	}
 
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	return error;
 }
@@ -914,7 +914,7 @@ __xfs_trans_commit(
 
 	xfs_log_commit_cil(mp, tp, &commit_lsn, regrant);
 
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	xfs_trans_free(tp);
 
 	/*
@@ -944,7 +944,7 @@ __xfs_trans_commit(
 		if (commit_lsn == -1 && !error)
 			error = -EIO;
 	}
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	xfs_trans_free_items(tp, NULLCOMMITLSN, !!error);
 	xfs_trans_free(tp);
 
@@ -998,7 +998,7 @@ xfs_trans_cancel(
 		xfs_log_done(mp, tp->t_ticket, NULL, false);
 
 	/* mark this thread as no longer being in a transaction */
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	xfs_trans_free_items(tp, NULLCOMMITLSN, dirty);
 	xfs_trans_free(tp);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1531c48f56e2..abeb84604d32 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2320,6 +2320,8 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_FREEZER_SKIP	0x40000000	/* Freezer should not count it as freezable */
 #define PF_SUSPEND_TASK 0x80000000      /* this thread called freeze_processes and should not be frozen */
 
+#define PF_MEMALLOC_NOFS PF_FSTRANS	/* Transition to a more generic GFP_NOFS scope semantic */
+
 /*
  * Only the _current_ task can read/write to tsk->flags, but other
  * tasks can access tsk->flags in readonly mode for example
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [PATCH 2/8] xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS
@ 2017-01-06 14:11   ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:11 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite
some time ago. We would like to make this concept more generic and use
it for other filesystems as well. Let's start by giving the flag a
more generic name PF_MEMALLOC_NOFS which is in line with an exiting
PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO
contexts. Replace all PF_FSTRANS usage from the xfs code in the first
step before we introduce a full API for it as xfs uses the flag directly
anyway.

This patch doesn't introduce any functional change.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/kmem.c             |  4 ++--
 fs/xfs/kmem.h             |  2 +-
 fs/xfs/libxfs/xfs_btree.c |  2 +-
 fs/xfs/xfs_aops.c         |  6 +++---
 fs/xfs/xfs_trans.c        | 12 ++++++------
 include/linux/sched.h     |  2 ++
 6 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index 339c696bbc01..a76a05dae96b 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -80,13 +80,13 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 	 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
 	 * the filesystem here and potentially deadlocking.
 	 */
-	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
 		noio_flag = memalloc_noio_save();
 
 	lflags = kmem_flags_convert(flags);
 	ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
 
-	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
 		memalloc_noio_restore(noio_flag);
 
 	return ptr;
diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index 689f746224e7..d973dbfc2bfa 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
 		lflags = GFP_ATOMIC | __GFP_NOWARN;
 	} else {
 		lflags = GFP_KERNEL | __GFP_NOWARN;
-		if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+		if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
 			lflags &= ~__GFP_FS;
 	}
 
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 21e6a6ab6b9a..a2672ba4dc33 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -2866,7 +2866,7 @@ xfs_btree_split_worker(
 	struct xfs_btree_split_args	*args = container_of(work,
 						struct xfs_btree_split_args, work);
 	unsigned long		pflags;
-	unsigned long		new_pflags = PF_FSTRANS;
+	unsigned long		new_pflags = PF_MEMALLOC_NOFS;
 
 	/*
 	 * we are in a transaction context here, but may also be doing work
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index ef382bfb402b..d4094bb55033 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -189,7 +189,7 @@ xfs_setfilesize_trans_alloc(
 	 * We hand off the transaction to the completion thread now, so
 	 * clear the flag here.
 	 */
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	return 0;
 }
 
@@ -252,7 +252,7 @@ xfs_setfilesize_ioend(
 	 * thus we need to mark ourselves as being in a transaction manually.
 	 * Similarly for freeze protection.
 	 */
-	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	__sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS);
 
 	/* we abort the update if there was an IO error */
@@ -1015,7 +1015,7 @@ xfs_do_writepage(
 	 * Given that we do not allow direct reclaim to call us, we should
 	 * never be called while in a filesystem transaction.
 	 */
-	if (WARN_ON_ONCE(current->flags & PF_FSTRANS))
+	if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
 		goto redirty;
 
 	/*
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 70f42ea86dfb..f5969c8274fc 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -134,7 +134,7 @@ xfs_trans_reserve(
 	bool		rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
 
 	/* Mark this thread as being in a transaction */
-	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	/*
 	 * Attempt to reserve the needed disk blocks by decrementing
@@ -144,7 +144,7 @@ xfs_trans_reserve(
 	if (blocks > 0) {
 		error = xfs_mod_fdblocks(tp->t_mountp, -((int64_t)blocks), rsvd);
 		if (error != 0) {
-			current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+			current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 			return -ENOSPC;
 		}
 		tp->t_blk_res += blocks;
@@ -221,7 +221,7 @@ xfs_trans_reserve(
 		tp->t_blk_res = 0;
 	}
 
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	return error;
 }
@@ -914,7 +914,7 @@ __xfs_trans_commit(
 
 	xfs_log_commit_cil(mp, tp, &commit_lsn, regrant);
 
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	xfs_trans_free(tp);
 
 	/*
@@ -944,7 +944,7 @@ __xfs_trans_commit(
 		if (commit_lsn == -1 && !error)
 			error = -EIO;
 	}
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	xfs_trans_free_items(tp, NULLCOMMITLSN, !!error);
 	xfs_trans_free(tp);
 
@@ -998,7 +998,7 @@ xfs_trans_cancel(
 		xfs_log_done(mp, tp->t_ticket, NULL, false);
 
 	/* mark this thread as no longer being in a transaction */
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	xfs_trans_free_items(tp, NULLCOMMITLSN, dirty);
 	xfs_trans_free(tp);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1531c48f56e2..abeb84604d32 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2320,6 +2320,8 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_FREEZER_SKIP	0x40000000	/* Freezer should not count it as freezable */
 #define PF_SUSPEND_TASK 0x80000000      /* this thread called freeze_processes and should not be frozen */
 
+#define PF_MEMALLOC_NOFS PF_FSTRANS	/* Transition to a more generic GFP_NOFS scope semantic */
+
 /*
  * Only the _current_ task can read/write to tsk->flags, but other
  * tasks can access tsk->flags in readonly mode for example
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [PATCH 2/8] xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS
@ 2017-01-06 14:11   ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:11 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite
some time ago. We would like to make this concept more generic and use
it for other filesystems as well. Let's start by giving the flag a
more generic name PF_MEMALLOC_NOFS which is in line with an exiting
PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO
contexts. Replace all PF_FSTRANS usage from the xfs code in the first
step before we introduce a full API for it as xfs uses the flag directly
anyway.

This patch doesn't introduce any functional change.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/kmem.c             |  4 ++--
 fs/xfs/kmem.h             |  2 +-
 fs/xfs/libxfs/xfs_btree.c |  2 +-
 fs/xfs/xfs_aops.c         |  6 +++---
 fs/xfs/xfs_trans.c        | 12 ++++++------
 include/linux/sched.h     |  2 ++
 6 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index 339c696bbc01..a76a05dae96b 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -80,13 +80,13 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 	 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
 	 * the filesystem here and potentially deadlocking.
 	 */
-	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
 		noio_flag = memalloc_noio_save();
 
 	lflags = kmem_flags_convert(flags);
 	ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
 
-	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
 		memalloc_noio_restore(noio_flag);
 
 	return ptr;
diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index 689f746224e7..d973dbfc2bfa 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
 		lflags = GFP_ATOMIC | __GFP_NOWARN;
 	} else {
 		lflags = GFP_KERNEL | __GFP_NOWARN;
-		if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+		if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
 			lflags &= ~__GFP_FS;
 	}
 
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 21e6a6ab6b9a..a2672ba4dc33 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -2866,7 +2866,7 @@ xfs_btree_split_worker(
 	struct xfs_btree_split_args	*args = container_of(work,
 						struct xfs_btree_split_args, work);
 	unsigned long		pflags;
-	unsigned long		new_pflags = PF_FSTRANS;
+	unsigned long		new_pflags = PF_MEMALLOC_NOFS;
 
 	/*
 	 * we are in a transaction context here, but may also be doing work
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index ef382bfb402b..d4094bb55033 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -189,7 +189,7 @@ xfs_setfilesize_trans_alloc(
 	 * We hand off the transaction to the completion thread now, so
 	 * clear the flag here.
 	 */
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	return 0;
 }
 
@@ -252,7 +252,7 @@ xfs_setfilesize_ioend(
 	 * thus we need to mark ourselves as being in a transaction manually.
 	 * Similarly for freeze protection.
 	 */
-	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	__sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS);
 
 	/* we abort the update if there was an IO error */
@@ -1015,7 +1015,7 @@ xfs_do_writepage(
 	 * Given that we do not allow direct reclaim to call us, we should
 	 * never be called while in a filesystem transaction.
 	 */
-	if (WARN_ON_ONCE(current->flags & PF_FSTRANS))
+	if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
 		goto redirty;
 
 	/*
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 70f42ea86dfb..f5969c8274fc 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -134,7 +134,7 @@ xfs_trans_reserve(
 	bool		rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
 
 	/* Mark this thread as being in a transaction */
-	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	/*
 	 * Attempt to reserve the needed disk blocks by decrementing
@@ -144,7 +144,7 @@ xfs_trans_reserve(
 	if (blocks > 0) {
 		error = xfs_mod_fdblocks(tp->t_mountp, -((int64_t)blocks), rsvd);
 		if (error != 0) {
-			current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+			current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 			return -ENOSPC;
 		}
 		tp->t_blk_res += blocks;
@@ -221,7 +221,7 @@ xfs_trans_reserve(
 		tp->t_blk_res = 0;
 	}
 
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	return error;
 }
@@ -914,7 +914,7 @@ __xfs_trans_commit(
 
 	xfs_log_commit_cil(mp, tp, &commit_lsn, regrant);
 
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	xfs_trans_free(tp);
 
 	/*
@@ -944,7 +944,7 @@ __xfs_trans_commit(
 		if (commit_lsn == -1 && !error)
 			error = -EIO;
 	}
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	xfs_trans_free_items(tp, NULLCOMMITLSN, !!error);
 	xfs_trans_free(tp);
 
@@ -998,7 +998,7 @@ xfs_trans_cancel(
 		xfs_log_done(mp, tp->t_ticket, NULL, false);
 
 	/* mark this thread as no longer being in a transaction */
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	xfs_trans_free_items(tp, NULLCOMMITLSN, dirty);
 	xfs_trans_free(tp);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1531c48f56e2..abeb84604d32 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2320,6 +2320,8 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_FREEZER_SKIP	0x40000000	/* Freezer should not count it as freezable */
 #define PF_SUSPEND_TASK 0x80000000      /* this thread called freeze_processes and should not be frozen */
 
+#define PF_MEMALLOC_NOFS PF_FSTRANS	/* Transition to a more generic GFP_NOFS scope semantic */
+
 /*
  * Only the _current_ task can read/write to tsk->flags, but other
  * tasks can access tsk->flags in readonly mode for example
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [PATCH 2/8] xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS
@ 2017-01-06 14:11   ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:11 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite
some time ago. We would like to make this concept more generic and use
it for other filesystems as well. Let's start by giving the flag a
more generic name PF_MEMALLOC_NOFS which is in line with an exiting
PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO
contexts. Replace all PF_FSTRANS usage from the xfs code in the first
step before we introduce a full API for it as xfs uses the flag directly
anyway.

This patch doesn't introduce any functional change.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/kmem.c             |  4 ++--
 fs/xfs/kmem.h             |  2 +-
 fs/xfs/libxfs/xfs_btree.c |  2 +-
 fs/xfs/xfs_aops.c         |  6 +++---
 fs/xfs/xfs_trans.c        | 12 ++++++------
 include/linux/sched.h     |  2 ++
 6 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index 339c696bbc01..a76a05dae96b 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -80,13 +80,13 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 	 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
 	 * the filesystem here and potentially deadlocking.
 	 */
-	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
 		noio_flag = memalloc_noio_save();
 
 	lflags = kmem_flags_convert(flags);
 	ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
 
-	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
 		memalloc_noio_restore(noio_flag);
 
 	return ptr;
diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index 689f746224e7..d973dbfc2bfa 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
 		lflags = GFP_ATOMIC | __GFP_NOWARN;
 	} else {
 		lflags = GFP_KERNEL | __GFP_NOWARN;
-		if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+		if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
 			lflags &= ~__GFP_FS;
 	}
 
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 21e6a6ab6b9a..a2672ba4dc33 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -2866,7 +2866,7 @@ xfs_btree_split_worker(
 	struct xfs_btree_split_args	*args = container_of(work,
 						struct xfs_btree_split_args, work);
 	unsigned long		pflags;
-	unsigned long		new_pflags = PF_FSTRANS;
+	unsigned long		new_pflags = PF_MEMALLOC_NOFS;
 
 	/*
 	 * we are in a transaction context here, but may also be doing work
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index ef382bfb402b..d4094bb55033 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -189,7 +189,7 @@ xfs_setfilesize_trans_alloc(
 	 * We hand off the transaction to the completion thread now, so
 	 * clear the flag here.
 	 */
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	return 0;
 }
 
@@ -252,7 +252,7 @@ xfs_setfilesize_ioend(
 	 * thus we need to mark ourselves as being in a transaction manually.
 	 * Similarly for freeze protection.
 	 */
-	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	__sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS);
 
 	/* we abort the update if there was an IO error */
@@ -1015,7 +1015,7 @@ xfs_do_writepage(
 	 * Given that we do not allow direct reclaim to call us, we should
 	 * never be called while in a filesystem transaction.
 	 */
-	if (WARN_ON_ONCE(current->flags & PF_FSTRANS))
+	if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
 		goto redirty;
 
 	/*
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 70f42ea86dfb..f5969c8274fc 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -134,7 +134,7 @@ xfs_trans_reserve(
 	bool		rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
 
 	/* Mark this thread as being in a transaction */
-	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	/*
 	 * Attempt to reserve the needed disk blocks by decrementing
@@ -144,7 +144,7 @@ xfs_trans_reserve(
 	if (blocks > 0) {
 		error = xfs_mod_fdblocks(tp->t_mountp, -((int64_t)blocks), rsvd);
 		if (error != 0) {
-			current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+			current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 			return -ENOSPC;
 		}
 		tp->t_blk_res += blocks;
@@ -221,7 +221,7 @@ xfs_trans_reserve(
 		tp->t_blk_res = 0;
 	}
 
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	return error;
 }
@@ -914,7 +914,7 @@ __xfs_trans_commit(
 
 	xfs_log_commit_cil(mp, tp, &commit_lsn, regrant);
 
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	xfs_trans_free(tp);
 
 	/*
@@ -944,7 +944,7 @@ __xfs_trans_commit(
 		if (commit_lsn == -1 && !error)
 			error = -EIO;
 	}
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	xfs_trans_free_items(tp, NULLCOMMITLSN, !!error);
 	xfs_trans_free(tp);
 
@@ -998,7 +998,7 @@ xfs_trans_cancel(
 		xfs_log_done(mp, tp->t_ticket, NULL, false);
 
 	/* mark this thread as no longer being in a transaction */
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	xfs_trans_free_items(tp, NULLCOMMITLSN, dirty);
 	xfs_trans_free(tp);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1531c48f56e2..abeb84604d32 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2320,6 +2320,8 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_FREEZER_SKIP	0x40000000	/* Freezer should not count it as freezable */
 #define PF_SUSPEND_TASK 0x80000000      /* this thread called freeze_processes and should not be frozen */
 
+#define PF_MEMALLOC_NOFS PF_FSTRANS	/* Transition to a more generic GFP_NOFS scope semantic */
+
 /*
  * Only the _current_ task can read/write to tsk->flags, but other
  * tasks can access tsk->flags in readonly mode for example
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [Cluster-devel] [PATCH 2/8] xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS
@ 2017-01-06 14:11   ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:11 UTC (permalink / raw)
  To: cluster-devel.redhat.com

From: Michal Hocko <mhocko@suse.com>

xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite
some time ago. We would like to make this concept more generic and use
it for other filesystems as well. Let's start by giving the flag a
more generic name PF_MEMALLOC_NOFS which is in line with an exiting
PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO
contexts. Replace all PF_FSTRANS usage from the xfs code in the first
step before we introduce a full API for it as xfs uses the flag directly
anyway.

This patch doesn't introduce any functional change.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/kmem.c             |  4 ++--
 fs/xfs/kmem.h             |  2 +-
 fs/xfs/libxfs/xfs_btree.c |  2 +-
 fs/xfs/xfs_aops.c         |  6 +++---
 fs/xfs/xfs_trans.c        | 12 ++++++------
 include/linux/sched.h     |  2 ++
 6 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index 339c696bbc01..a76a05dae96b 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -80,13 +80,13 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 	 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
 	 * the filesystem here and potentially deadlocking.
 	 */
-	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
 		noio_flag = memalloc_noio_save();
 
 	lflags = kmem_flags_convert(flags);
 	ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
 
-	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
 		memalloc_noio_restore(noio_flag);
 
 	return ptr;
diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index 689f746224e7..d973dbfc2bfa 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
 		lflags = GFP_ATOMIC | __GFP_NOWARN;
 	} else {
 		lflags = GFP_KERNEL | __GFP_NOWARN;
-		if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+		if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
 			lflags &= ~__GFP_FS;
 	}
 
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 21e6a6ab6b9a..a2672ba4dc33 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -2866,7 +2866,7 @@ xfs_btree_split_worker(
 	struct xfs_btree_split_args	*args = container_of(work,
 						struct xfs_btree_split_args, work);
 	unsigned long		pflags;
-	unsigned long		new_pflags = PF_FSTRANS;
+	unsigned long		new_pflags = PF_MEMALLOC_NOFS;
 
 	/*
 	 * we are in a transaction context here, but may also be doing work
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index ef382bfb402b..d4094bb55033 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -189,7 +189,7 @@ xfs_setfilesize_trans_alloc(
 	 * We hand off the transaction to the completion thread now, so
 	 * clear the flag here.
 	 */
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	return 0;
 }
 
@@ -252,7 +252,7 @@ xfs_setfilesize_ioend(
 	 * thus we need to mark ourselves as being in a transaction manually.
 	 * Similarly for freeze protection.
 	 */
-	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	__sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS);
 
 	/* we abort the update if there was an IO error */
@@ -1015,7 +1015,7 @@ xfs_do_writepage(
 	 * Given that we do not allow direct reclaim to call us, we should
 	 * never be called while in a filesystem transaction.
 	 */
-	if (WARN_ON_ONCE(current->flags & PF_FSTRANS))
+	if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
 		goto redirty;
 
 	/*
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 70f42ea86dfb..f5969c8274fc 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -134,7 +134,7 @@ xfs_trans_reserve(
 	bool		rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
 
 	/* Mark this thread as being in a transaction */
-	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	/*
 	 * Attempt to reserve the needed disk blocks by decrementing
@@ -144,7 +144,7 @@ xfs_trans_reserve(
 	if (blocks > 0) {
 		error = xfs_mod_fdblocks(tp->t_mountp, -((int64_t)blocks), rsvd);
 		if (error != 0) {
-			current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+			current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 			return -ENOSPC;
 		}
 		tp->t_blk_res += blocks;
@@ -221,7 +221,7 @@ xfs_trans_reserve(
 		tp->t_blk_res = 0;
 	}
 
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	return error;
 }
@@ -914,7 +914,7 @@ __xfs_trans_commit(
 
 	xfs_log_commit_cil(mp, tp, &commit_lsn, regrant);
 
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	xfs_trans_free(tp);
 
 	/*
@@ -944,7 +944,7 @@ __xfs_trans_commit(
 		if (commit_lsn == -1 && !error)
 			error = -EIO;
 	}
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	xfs_trans_free_items(tp, NULLCOMMITLSN, !!error);
 	xfs_trans_free(tp);
 
@@ -998,7 +998,7 @@ xfs_trans_cancel(
 		xfs_log_done(mp, tp->t_ticket, NULL, false);
 
 	/* mark this thread as no longer being in a transaction */
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	xfs_trans_free_items(tp, NULLCOMMITLSN, dirty);
 	xfs_trans_free(tp);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1531c48f56e2..abeb84604d32 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2320,6 +2320,8 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_FREEZER_SKIP	0x40000000	/* Freezer should not count it as freezable */
 #define PF_SUSPEND_TASK 0x80000000      /* this thread called freeze_processes and should not be frozen */
 
+#define PF_MEMALLOC_NOFS PF_FSTRANS	/* Transition to a more generic GFP_NOFS scope semantic */
+
 /*
  * Only the _current_ task can read/write to tsk->flags, but other
  * tasks can access tsk->flags in readonly mode for example
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [PATCH 3/8] mm: introduce memalloc_nofs_{save,restore} API
  2017-01-06 14:10 ` Michal Hocko
                     ` (2 preceding siblings ...)
  (?)
@ 2017-01-06 14:11   ` Michal Hocko
  -1 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:11 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

GFP_NOFS context is used for the following 5 reasons currently
	- to prevent from deadlocks when the lock held by the allocation
	  context would be needed during the memory reclaim
	- to prevent from stack overflows during the reclaim because
	  the allocation is performed from a deep context already
	- to prevent lockups when the allocation context depends on
	  other reclaimers to make a forward progress indirectly
	- just in case because this would be safe from the fs POV
	- silence lockdep false positives

Unfortunately overuse of this allocation context brings some problems
to the MM. Memory reclaim is much weaker (especially during heavy FS
metadata workloads), OOM killer cannot be invoked because the MM layer
doesn't have enough information about how much memory is freeable by the
FS layer.

In many cases it is far from clear why the weaker context is even used
and so it might be used unnecessarily. We would like to get rid of
those as much as possible. One way to do that is to use the flag in
scopes rather than isolated cases. Such a scope is declared when really
necessary, tracked per task and all the allocation requests from within
the context will simply inherit the GFP_NOFS semantic.

Not only this is easier to understand and maintain because there are
much less problematic contexts than specific allocation requests, this
also helps code paths where FS layer interacts with other layers (e.g.
crypto, security modules, MM etc...) and there is no easy way to convey
the allocation context between the layers.

Introduce memalloc_nofs_{save,restore} API to control the scope
of GFP_NOFS allocation context. This is basically copying
memalloc_noio_{save,restore} API we have for other restricted allocation
context GFP_NOIO. The PF_MEMALLOC_NOFS flag already exists and it is
just an alias for PF_FSTRANS which has been xfs specific until recently.
There are no more PF_FSTRANS users anymore so let's just drop it.

PF_MEMALLOC_NOFS is now checked in the MM layer and drops __GFP_FS
implicitly same as PF_MEMALLOC_NOIO drops __GFP_IO. memalloc_noio_flags
is renamed to current_gfp_context because it now cares about both
PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts. Xfs code paths preserve
their semantic. kmem_flags_convert() doesn't need to evaluate the flag
anymore.

This patch shouldn't introduce any functional changes.

Let's hope that filesystems will drop direct GFP_NOFS (resp. ~__GFP_FS)
usage as much as possible and only use a properly documented
memalloc_nofs_{save,restore} checkpoints where they are appropriate.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/xfs/kmem.h            |  2 +-
 include/linux/gfp.h      |  8 ++++++++
 include/linux/sched.h    | 34 ++++++++++++++++++++++++++--------
 kernel/locking/lockdep.c |  2 +-
 mm/page_alloc.c          |  8 +++++---
 mm/vmscan.c              |  6 +++---
 6 files changed, 44 insertions(+), 16 deletions(-)

diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index d973dbfc2bfa..ae08cfd9552a 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
 		lflags = GFP_ATOMIC | __GFP_NOWARN;
 	} else {
 		lflags = GFP_KERNEL | __GFP_NOWARN;
-		if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
+		if (flags & KM_NOFS)
 			lflags &= ~__GFP_FS;
 	}
 
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 1a934383cc20..bfe53d95c25b 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -217,8 +217,16 @@ struct vm_area_struct;
  *
  * GFP_NOIO will use direct reclaim to discard clean pages or slab pages
  *   that do not require the starting of any physical IO.
+ *   Please try to avoid using this flag directly and instead use
+ *   memalloc_noio_{save,restore} to mark the whole scope which cannot
+ *   perform any IO with a short explanation why. All allocation requests
+ *   will inherit GFP_NOIO implicitly.
  *
  * GFP_NOFS will use direct reclaim but will not use any filesystem interfaces.
+ *   Please try to avoid using this flag directly and instead use
+ *   memalloc_nofs_{save,restore} to mark the whole scope which cannot/shouldn't
+ *   recurse into the FS layer with a short explanation why. All allocation
+ *   requests will inherit GFP_NOFS implicitly.
  *
  * GFP_USER is for userspace allocations that also need to be directly
  *   accessibly by the kernel or hardware. It is typically used by hardware
diff --git a/include/linux/sched.h b/include/linux/sched.h
index abeb84604d32..2032fc642a26 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2307,9 +2307,9 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_USED_ASYNC	0x00004000	/* used async_schedule*(), used by module init */
 #define PF_NOFREEZE	0x00008000	/* this thread should not be frozen */
 #define PF_FROZEN	0x00010000	/* frozen for system suspend */
-#define PF_FSTRANS	0x00020000	/* inside a filesystem transaction */
-#define PF_KSWAPD	0x00040000	/* I am kswapd */
-#define PF_MEMALLOC_NOIO 0x00080000	/* Allocating memory without IO involved */
+#define PF_KSWAPD	0x00020000	/* I am kswapd */
+#define PF_MEMALLOC_NOFS 0x00040000	/* All allocation requests will inherit GFP_NOFS */
+#define PF_MEMALLOC_NOIO 0x00080000	/* All allocation requests will inherit GFP_NOIO */
 #define PF_LESS_THROTTLE 0x00100000	/* Throttle me less: I clean memory */
 #define PF_KTHREAD	0x00200000	/* I am a kernel thread */
 #define PF_RANDOMIZE	0x00400000	/* randomize virtual address space */
@@ -2320,8 +2320,6 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_FREEZER_SKIP	0x40000000	/* Freezer should not count it as freezable */
 #define PF_SUSPEND_TASK 0x80000000      /* this thread called freeze_processes and should not be frozen */
 
-#define PF_MEMALLOC_NOFS PF_FSTRANS	/* Transition to a more generic GFP_NOFS scope semantic */
-
 /*
  * Only the _current_ task can read/write to tsk->flags, but other
  * tasks can access tsk->flags in readonly mode for example
@@ -2347,13 +2345,21 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
 #define used_math() tsk_used_math(current)
 
-/* __GFP_IO isn't allowed if PF_MEMALLOC_NOIO is set in current->flags
- * __GFP_FS is also cleared as it implies __GFP_IO.
+/*
+ * Applies per-task gfp context to the given allocation flags.
+ * PF_MEMALLOC_NOIO implies GFP_NOIO
+ * PF_MEMALLOC_NOFS implies GFP_NOFS
  */
-static inline gfp_t memalloc_noio_flags(gfp_t flags)
+static inline gfp_t current_gfp_context(gfp_t flags)
 {
+	/*
+	 * NOIO implies both NOIO and NOFS and it is a weaker context
+	 * so always make sure it makes precendence
+	 */
 	if (unlikely(current->flags & PF_MEMALLOC_NOIO))
 		flags &= ~(__GFP_IO | __GFP_FS);
+	else if (unlikely(current->flags & PF_MEMALLOC_NOFS))
+		flags &= ~__GFP_FS;
 	return flags;
 }
 
@@ -2369,6 +2375,18 @@ static inline void memalloc_noio_restore(unsigned int flags)
 	current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags;
 }
 
+static inline unsigned int memalloc_nofs_save(void)
+{
+	unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
+	current->flags |= PF_MEMALLOC_NOFS;
+	return flags;
+}
+
+static inline void memalloc_nofs_restore(unsigned int flags)
+{
+	current->flags = (current->flags & ~PF_MEMALLOC_NOFS) | flags;
+}
+
 /* Per-process atomic flags. */
 #define PFA_NO_NEW_PRIVS 0	/* May not gain new privileges. */
 #define PFA_SPREAD_PAGE  1      /* Spread page cache over cpuset */
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 59e94ce8a0cf..cdcd6f249ec8 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -2870,7 +2870,7 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags)
 		return;
 
 	/* We're only interested __GFP_FS allocations for now */
-	if (!(gfp_mask & __GFP_FS))
+	if (!(gfp_mask & __GFP_FS) || (curr->flags & PF_MEMALLOC_NOFS))
 		return;
 
 	/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 46ad035b955f..5138b46a4295 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3807,10 +3807,12 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 		goto out;
 
 	/*
-	 * Runtime PM, block IO and its error handling path can deadlock
-	 * because I/O on the device might not complete.
+	 * Apply scoped allocation constrains. This is mainly about
+	 * GFP_NOFS resp. GFP_NOIO which has to be inherited for all
+	 * allocation requests from a particular context which has
+	 * been marked by memalloc_no{fs,io}_{save,restore}
 	 */
-	alloc_mask = memalloc_noio_flags(gfp_mask);
+	alloc_mask = current_gfp_context(gfp_mask);
 	ac.spread_dirty_pages = false;
 
 	/*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6aa5b01d3e75..4ea6b610f20e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2949,7 +2949,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 	unsigned long nr_reclaimed;
 	struct scan_control sc = {
 		.nr_to_reclaim = SWAP_CLUSTER_MAX,
-		.gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
+		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
 		.reclaim_idx = gfp_zone(gfp_mask),
 		.order = order,
 		.nodemask = nodemask,
@@ -3029,7 +3029,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 	int nid;
 	struct scan_control sc = {
 		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
-		.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
+		.gfp_mask = (current_gfp_context(gfp_mask) & GFP_RECLAIM_MASK) |
 				(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
 		.reclaim_idx = MAX_NR_ZONES - 1,
 		.target_mem_cgroup = memcg,
@@ -3723,7 +3723,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
 	int classzone_idx = gfp_zone(gfp_mask);
 	struct scan_control sc = {
 		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
-		.gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
+		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
 		.order = order,
 		.priority = NODE_RECLAIM_PRIORITY,
 		.may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE),
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [PATCH 3/8] mm: introduce memalloc_nofs_{save,restore} API
@ 2017-01-06 14:11   ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:11 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

GFP_NOFS context is used for the following 5 reasons currently
	- to prevent from deadlocks when the lock held by the allocation
	  context would be needed during the memory reclaim
	- to prevent from stack overflows during the reclaim because
	  the allocation is performed from a deep context already
	- to prevent lockups when the allocation context depends on
	  other reclaimers to make a forward progress indirectly
	- just in case because this would be safe from the fs POV
	- silence lockdep false positives

Unfortunately overuse of this allocation context brings some problems
to the MM. Memory reclaim is much weaker (especially during heavy FS
metadata workloads), OOM killer cannot be invoked because the MM layer
doesn't have enough information about how much memory is freeable by the
FS layer.

In many cases it is far from clear why the weaker context is even used
and so it might be used unnecessarily. We would like to get rid of
those as much as possible. One way to do that is to use the flag in
scopes rather than isolated cases. Such a scope is declared when really
necessary, tracked per task and all the allocation requests from within
the context will simply inherit the GFP_NOFS semantic.

Not only this is easier to understand and maintain because there are
much less problematic contexts than specific allocation requests, this
also helps code paths where FS layer interacts with other layers (e.g.
crypto, security modules, MM etc...) and there is no easy way to convey
the allocation context between the layers.

Introduce memalloc_nofs_{save,restore} API to control the scope
of GFP_NOFS allocation context. This is basically copying
memalloc_noio_{save,restore} API we have for other restricted allocation
context GFP_NOIO. The PF_MEMALLOC_NOFS flag already exists and it is
just an alias for PF_FSTRANS which has been xfs specific until recently.
There are no more PF_FSTRANS users anymore so let's just drop it.

PF_MEMALLOC_NOFS is now checked in the MM layer and drops __GFP_FS
implicitly same as PF_MEMALLOC_NOIO drops __GFP_IO. memalloc_noio_flags
is renamed to current_gfp_context because it now cares about both
PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts. Xfs code paths preserve
their semantic. kmem_flags_convert() doesn't need to evaluate the flag
anymore.

This patch shouldn't introduce any functional changes.

Let's hope that filesystems will drop direct GFP_NOFS (resp. ~__GFP_FS)
usage as much as possible and only use a properly documented
memalloc_nofs_{save,restore} checkpoints where they are appropriate.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/xfs/kmem.h            |  2 +-
 include/linux/gfp.h      |  8 ++++++++
 include/linux/sched.h    | 34 ++++++++++++++++++++++++++--------
 kernel/locking/lockdep.c |  2 +-
 mm/page_alloc.c          |  8 +++++---
 mm/vmscan.c              |  6 +++---
 6 files changed, 44 insertions(+), 16 deletions(-)

diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index d973dbfc2bfa..ae08cfd9552a 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
 		lflags = GFP_ATOMIC | __GFP_NOWARN;
 	} else {
 		lflags = GFP_KERNEL | __GFP_NOWARN;
-		if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
+		if (flags & KM_NOFS)
 			lflags &= ~__GFP_FS;
 	}
 
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 1a934383cc20..bfe53d95c25b 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -217,8 +217,16 @@ struct vm_area_struct;
  *
  * GFP_NOIO will use direct reclaim to discard clean pages or slab pages
  *   that do not require the starting of any physical IO.
+ *   Please try to avoid using this flag directly and instead use
+ *   memalloc_noio_{save,restore} to mark the whole scope which cannot
+ *   perform any IO with a short explanation why. All allocation requests
+ *   will inherit GFP_NOIO implicitly.
  *
  * GFP_NOFS will use direct reclaim but will not use any filesystem interfaces.
+ *   Please try to avoid using this flag directly and instead use
+ *   memalloc_nofs_{save,restore} to mark the whole scope which cannot/shouldn't
+ *   recurse into the FS layer with a short explanation why. All allocation
+ *   requests will inherit GFP_NOFS implicitly.
  *
  * GFP_USER is for userspace allocations that also need to be directly
  *   accessibly by the kernel or hardware. It is typically used by hardware
diff --git a/include/linux/sched.h b/include/linux/sched.h
index abeb84604d32..2032fc642a26 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2307,9 +2307,9 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_USED_ASYNC	0x00004000	/* used async_schedule*(), used by module init */
 #define PF_NOFREEZE	0x00008000	/* this thread should not be frozen */
 #define PF_FROZEN	0x00010000	/* frozen for system suspend */
-#define PF_FSTRANS	0x00020000	/* inside a filesystem transaction */
-#define PF_KSWAPD	0x00040000	/* I am kswapd */
-#define PF_MEMALLOC_NOIO 0x00080000	/* Allocating memory without IO involved */
+#define PF_KSWAPD	0x00020000	/* I am kswapd */
+#define PF_MEMALLOC_NOFS 0x00040000	/* All allocation requests will inherit GFP_NOFS */
+#define PF_MEMALLOC_NOIO 0x00080000	/* All allocation requests will inherit GFP_NOIO */
 #define PF_LESS_THROTTLE 0x00100000	/* Throttle me less: I clean memory */
 #define PF_KTHREAD	0x00200000	/* I am a kernel thread */
 #define PF_RANDOMIZE	0x00400000	/* randomize virtual address space */
@@ -2320,8 +2320,6 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_FREEZER_SKIP	0x40000000	/* Freezer should not count it as freezable */
 #define PF_SUSPEND_TASK 0x80000000      /* this thread called freeze_processes and should not be frozen */
 
-#define PF_MEMALLOC_NOFS PF_FSTRANS	/* Transition to a more generic GFP_NOFS scope semantic */
-
 /*
  * Only the _current_ task can read/write to tsk->flags, but other
  * tasks can access tsk->flags in readonly mode for example
@@ -2347,13 +2345,21 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
 #define used_math() tsk_used_math(current)
 
-/* __GFP_IO isn't allowed if PF_MEMALLOC_NOIO is set in current->flags
- * __GFP_FS is also cleared as it implies __GFP_IO.
+/*
+ * Applies per-task gfp context to the given allocation flags.
+ * PF_MEMALLOC_NOIO implies GFP_NOIO
+ * PF_MEMALLOC_NOFS implies GFP_NOFS
  */
-static inline gfp_t memalloc_noio_flags(gfp_t flags)
+static inline gfp_t current_gfp_context(gfp_t flags)
 {
+	/*
+	 * NOIO implies both NOIO and NOFS and it is a weaker context
+	 * so always make sure it makes precendence
+	 */
 	if (unlikely(current->flags & PF_MEMALLOC_NOIO))
 		flags &= ~(__GFP_IO | __GFP_FS);
+	else if (unlikely(current->flags & PF_MEMALLOC_NOFS))
+		flags &= ~__GFP_FS;
 	return flags;
 }
 
@@ -2369,6 +2375,18 @@ static inline void memalloc_noio_restore(unsigned int flags)
 	current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags;
 }
 
+static inline unsigned int memalloc_nofs_save(void)
+{
+	unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
+	current->flags |= PF_MEMALLOC_NOFS;
+	return flags;
+}
+
+static inline void memalloc_nofs_restore(unsigned int flags)
+{
+	current->flags = (current->flags & ~PF_MEMALLOC_NOFS) | flags;
+}
+
 /* Per-process atomic flags. */
 #define PFA_NO_NEW_PRIVS 0	/* May not gain new privileges. */
 #define PFA_SPREAD_PAGE  1      /* Spread page cache over cpuset */
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 59e94ce8a0cf..cdcd6f249ec8 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -2870,7 +2870,7 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags)
 		return;
 
 	/* We're only interested __GFP_FS allocations for now */
-	if (!(gfp_mask & __GFP_FS))
+	if (!(gfp_mask & __GFP_FS) || (curr->flags & PF_MEMALLOC_NOFS))
 		return;
 
 	/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 46ad035b955f..5138b46a4295 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3807,10 +3807,12 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 		goto out;
 
 	/*
-	 * Runtime PM, block IO and its error handling path can deadlock
-	 * because I/O on the device might not complete.
+	 * Apply scoped allocation constrains. This is mainly about
+	 * GFP_NOFS resp. GFP_NOIO which has to be inherited for all
+	 * allocation requests from a particular context which has
+	 * been marked by memalloc_no{fs,io}_{save,restore}
 	 */
-	alloc_mask = memalloc_noio_flags(gfp_mask);
+	alloc_mask = current_gfp_context(gfp_mask);
 	ac.spread_dirty_pages = false;
 
 	/*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6aa5b01d3e75..4ea6b610f20e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2949,7 +2949,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 	unsigned long nr_reclaimed;
 	struct scan_control sc = {
 		.nr_to_reclaim = SWAP_CLUSTER_MAX,
-		.gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
+		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
 		.reclaim_idx = gfp_zone(gfp_mask),
 		.order = order,
 		.nodemask = nodemask,
@@ -3029,7 +3029,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 	int nid;
 	struct scan_control sc = {
 		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
-		.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
+		.gfp_mask = (current_gfp_context(gfp_mask) & GFP_RECLAIM_MASK) |
 				(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
 		.reclaim_idx = MAX_NR_ZONES - 1,
 		.target_mem_cgroup = memcg,
@@ -3723,7 +3723,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
 	int classzone_idx = gfp_zone(gfp_mask);
 	struct scan_control sc = {
 		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
-		.gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
+		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
 		.order = order,
 		.priority = NODE_RECLAIM_PRIORITY,
 		.may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE),
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [PATCH 3/8] mm: introduce memalloc_nofs_{save,restore} API
@ 2017-01-06 14:11   ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:11 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

GFP_NOFS context is used for the following 5 reasons currently
	- to prevent from deadlocks when the lock held by the allocation
	  context would be needed during the memory reclaim
	- to prevent from stack overflows during the reclaim because
	  the allocation is performed from a deep context already
	- to prevent lockups when the allocation context depends on
	  other reclaimers to make a forward progress indirectly
	- just in case because this would be safe from the fs POV
	- silence lockdep false positives

Unfortunately overuse of this allocation context brings some problems
to the MM. Memory reclaim is much weaker (especially during heavy FS
metadata workloads), OOM killer cannot be invoked because the MM layer
doesn't have enough information about how much memory is freeable by the
FS layer.

In many cases it is far from clear why the weaker context is even used
and so it might be used unnecessarily. We would like to get rid of
those as much as possible. One way to do that is to use the flag in
scopes rather than isolated cases. Such a scope is declared when really
necessary, tracked per task and all the allocation requests from within
the context will simply inherit the GFP_NOFS semantic.

Not only this is easier to understand and maintain because there are
much less problematic contexts than specific allocation requests, this
also helps code paths where FS layer interacts with other layers (e.g.
crypto, security modules, MM etc...) and there is no easy way to convey
the allocation context between the layers.

Introduce memalloc_nofs_{save,restore} API to control the scope
of GFP_NOFS allocation context. This is basically copying
memalloc_noio_{save,restore} API we have for other restricted allocation
context GFP_NOIO. The PF_MEMALLOC_NOFS flag already exists and it is
just an alias for PF_FSTRANS which has been xfs specific until recently.
There are no more PF_FSTRANS users anymore so let's just drop it.

PF_MEMALLOC_NOFS is now checked in the MM layer and drops __GFP_FS
implicitly same as PF_MEMALLOC_NOIO drops __GFP_IO. memalloc_noio_flags
is renamed to current_gfp_context because it now cares about both
PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts. Xfs code paths preserve
their semantic. kmem_flags_convert() doesn't need to evaluate the flag
anymore.

This patch shouldn't introduce any functional changes.

Let's hope that filesystems will drop direct GFP_NOFS (resp. ~__GFP_FS)
usage as much as possible and only use a properly documented
memalloc_nofs_{save,restore} checkpoints where they are appropriate.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/xfs/kmem.h            |  2 +-
 include/linux/gfp.h      |  8 ++++++++
 include/linux/sched.h    | 34 ++++++++++++++++++++++++++--------
 kernel/locking/lockdep.c |  2 +-
 mm/page_alloc.c          |  8 +++++---
 mm/vmscan.c              |  6 +++---
 6 files changed, 44 insertions(+), 16 deletions(-)

diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index d973dbfc2bfa..ae08cfd9552a 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
 		lflags = GFP_ATOMIC | __GFP_NOWARN;
 	} else {
 		lflags = GFP_KERNEL | __GFP_NOWARN;
-		if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
+		if (flags & KM_NOFS)
 			lflags &= ~__GFP_FS;
 	}
 
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 1a934383cc20..bfe53d95c25b 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -217,8 +217,16 @@ struct vm_area_struct;
  *
  * GFP_NOIO will use direct reclaim to discard clean pages or slab pages
  *   that do not require the starting of any physical IO.
+ *   Please try to avoid using this flag directly and instead use
+ *   memalloc_noio_{save,restore} to mark the whole scope which cannot
+ *   perform any IO with a short explanation why. All allocation requests
+ *   will inherit GFP_NOIO implicitly.
  *
  * GFP_NOFS will use direct reclaim but will not use any filesystem interfaces.
+ *   Please try to avoid using this flag directly and instead use
+ *   memalloc_nofs_{save,restore} to mark the whole scope which cannot/shouldn't
+ *   recurse into the FS layer with a short explanation why. All allocation
+ *   requests will inherit GFP_NOFS implicitly.
  *
  * GFP_USER is for userspace allocations that also need to be directly
  *   accessibly by the kernel or hardware. It is typically used by hardware
diff --git a/include/linux/sched.h b/include/linux/sched.h
index abeb84604d32..2032fc642a26 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2307,9 +2307,9 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_USED_ASYNC	0x00004000	/* used async_schedule*(), used by module init */
 #define PF_NOFREEZE	0x00008000	/* this thread should not be frozen */
 #define PF_FROZEN	0x00010000	/* frozen for system suspend */
-#define PF_FSTRANS	0x00020000	/* inside a filesystem transaction */
-#define PF_KSWAPD	0x00040000	/* I am kswapd */
-#define PF_MEMALLOC_NOIO 0x00080000	/* Allocating memory without IO involved */
+#define PF_KSWAPD	0x00020000	/* I am kswapd */
+#define PF_MEMALLOC_NOFS 0x00040000	/* All allocation requests will inherit GFP_NOFS */
+#define PF_MEMALLOC_NOIO 0x00080000	/* All allocation requests will inherit GFP_NOIO */
 #define PF_LESS_THROTTLE 0x00100000	/* Throttle me less: I clean memory */
 #define PF_KTHREAD	0x00200000	/* I am a kernel thread */
 #define PF_RANDOMIZE	0x00400000	/* randomize virtual address space */
@@ -2320,8 +2320,6 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_FREEZER_SKIP	0x40000000	/* Freezer should not count it as freezable */
 #define PF_SUSPEND_TASK 0x80000000      /* this thread called freeze_processes and should not be frozen */
 
-#define PF_MEMALLOC_NOFS PF_FSTRANS	/* Transition to a more generic GFP_NOFS scope semantic */
-
 /*
  * Only the _current_ task can read/write to tsk->flags, but other
  * tasks can access tsk->flags in readonly mode for example
@@ -2347,13 +2345,21 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
 #define used_math() tsk_used_math(current)
 
-/* __GFP_IO isn't allowed if PF_MEMALLOC_NOIO is set in current->flags
- * __GFP_FS is also cleared as it implies __GFP_IO.
+/*
+ * Applies per-task gfp context to the given allocation flags.
+ * PF_MEMALLOC_NOIO implies GFP_NOIO
+ * PF_MEMALLOC_NOFS implies GFP_NOFS
  */
-static inline gfp_t memalloc_noio_flags(gfp_t flags)
+static inline gfp_t current_gfp_context(gfp_t flags)
 {
+	/*
+	 * NOIO implies both NOIO and NOFS and it is a weaker context
+	 * so always make sure it makes precendence
+	 */
 	if (unlikely(current->flags & PF_MEMALLOC_NOIO))
 		flags &= ~(__GFP_IO | __GFP_FS);
+	else if (unlikely(current->flags & PF_MEMALLOC_NOFS))
+		flags &= ~__GFP_FS;
 	return flags;
 }
 
@@ -2369,6 +2375,18 @@ static inline void memalloc_noio_restore(unsigned int flags)
 	current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags;
 }
 
+static inline unsigned int memalloc_nofs_save(void)
+{
+	unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
+	current->flags |= PF_MEMALLOC_NOFS;
+	return flags;
+}
+
+static inline void memalloc_nofs_restore(unsigned int flags)
+{
+	current->flags = (current->flags & ~PF_MEMALLOC_NOFS) | flags;
+}
+
 /* Per-process atomic flags. */
 #define PFA_NO_NEW_PRIVS 0	/* May not gain new privileges. */
 #define PFA_SPREAD_PAGE  1      /* Spread page cache over cpuset */
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 59e94ce8a0cf..cdcd6f249ec8 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -2870,7 +2870,7 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags)
 		return;
 
 	/* We're only interested __GFP_FS allocations for now */
-	if (!(gfp_mask & __GFP_FS))
+	if (!(gfp_mask & __GFP_FS) || (curr->flags & PF_MEMALLOC_NOFS))
 		return;
 
 	/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 46ad035b955f..5138b46a4295 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3807,10 +3807,12 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 		goto out;
 
 	/*
-	 * Runtime PM, block IO and its error handling path can deadlock
-	 * because I/O on the device might not complete.
+	 * Apply scoped allocation constrains. This is mainly about
+	 * GFP_NOFS resp. GFP_NOIO which has to be inherited for all
+	 * allocation requests from a particular context which has
+	 * been marked by memalloc_no{fs,io}_{save,restore}
 	 */
-	alloc_mask = memalloc_noio_flags(gfp_mask);
+	alloc_mask = current_gfp_context(gfp_mask);
 	ac.spread_dirty_pages = false;
 
 	/*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6aa5b01d3e75..4ea6b610f20e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2949,7 +2949,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 	unsigned long nr_reclaimed;
 	struct scan_control sc = {
 		.nr_to_reclaim = SWAP_CLUSTER_MAX,
-		.gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
+		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
 		.reclaim_idx = gfp_zone(gfp_mask),
 		.order = order,
 		.nodemask = nodemask,
@@ -3029,7 +3029,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 	int nid;
 	struct scan_control sc = {
 		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
-		.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
+		.gfp_mask = (current_gfp_context(gfp_mask) & GFP_RECLAIM_MASK) |
 				(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
 		.reclaim_idx = MAX_NR_ZONES - 1,
 		.target_mem_cgroup = memcg,
@@ -3723,7 +3723,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
 	int classzone_idx = gfp_zone(gfp_mask);
 	struct scan_control sc = {
 		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
-		.gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
+		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
 		.order = order,
 		.priority = NODE_RECLAIM_PRIORITY,
 		.may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE),
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [PATCH 3/8] mm: introduce memalloc_nofs_{save,restore} API
@ 2017-01-06 14:11   ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:11 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

GFP_NOFS context is used for the following 5 reasons currently
	- to prevent from deadlocks when the lock held by the allocation
	  context would be needed during the memory reclaim
	- to prevent from stack overflows during the reclaim because
	  the allocation is performed from a deep context already
	- to prevent lockups when the allocation context depends on
	  other reclaimers to make a forward progress indirectly
	- just in case because this would be safe from the fs POV
	- silence lockdep false positives

Unfortunately overuse of this allocation context brings some problems
to the MM. Memory reclaim is much weaker (especially during heavy FS
metadata workloads), OOM killer cannot be invoked because the MM layer
doesn't have enough information about how much memory is freeable by the
FS layer.

In many cases it is far from clear why the weaker context is even used
and so it might be used unnecessarily. We would like to get rid of
those as much as possible. One way to do that is to use the flag in
scopes rather than isolated cases. Such a scope is declared when really
necessary, tracked per task and all the allocation requests from within
the context will simply inherit the GFP_NOFS semantic.

Not only this is easier to understand and maintain because there are
much less problematic contexts than specific allocation requests, this
also helps code paths where FS layer interacts with other layers (e.g.
crypto, security modules, MM etc...) and there is no easy way to convey
the allocation context between the layers.

Introduce memalloc_nofs_{save,restore} API to control the scope
of GFP_NOFS allocation context. This is basically copying
memalloc_noio_{save,restore} API we have for other restricted allocation
context GFP_NOIO. The PF_MEMALLOC_NOFS flag already exists and it is
just an alias for PF_FSTRANS which has been xfs specific until recently.
There are no more PF_FSTRANS users anymore so let's just drop it.

PF_MEMALLOC_NOFS is now checked in the MM layer and drops __GFP_FS
implicitly same as PF_MEMALLOC_NOIO drops __GFP_IO. memalloc_noio_flags
is renamed to current_gfp_context because it now cares about both
PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts. Xfs code paths preserve
their semantic. kmem_flags_convert() doesn't need to evaluate the flag
anymore.

This patch shouldn't introduce any functional changes.

Let's hope that filesystems will drop direct GFP_NOFS (resp. ~__GFP_FS)
usage as much as possible and only use a properly documented
memalloc_nofs_{save,restore} checkpoints where they are appropriate.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/xfs/kmem.h            |  2 +-
 include/linux/gfp.h      |  8 ++++++++
 include/linux/sched.h    | 34 ++++++++++++++++++++++++++--------
 kernel/locking/lockdep.c |  2 +-
 mm/page_alloc.c          |  8 +++++---
 mm/vmscan.c              |  6 +++---
 6 files changed, 44 insertions(+), 16 deletions(-)

diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index d973dbfc2bfa..ae08cfd9552a 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
 		lflags = GFP_ATOMIC | __GFP_NOWARN;
 	} else {
 		lflags = GFP_KERNEL | __GFP_NOWARN;
-		if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
+		if (flags & KM_NOFS)
 			lflags &= ~__GFP_FS;
 	}
 
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 1a934383cc20..bfe53d95c25b 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -217,8 +217,16 @@ struct vm_area_struct;
  *
  * GFP_NOIO will use direct reclaim to discard clean pages or slab pages
  *   that do not require the starting of any physical IO.
+ *   Please try to avoid using this flag directly and instead use
+ *   memalloc_noio_{save,restore} to mark the whole scope which cannot
+ *   perform any IO with a short explanation why. All allocation requests
+ *   will inherit GFP_NOIO implicitly.
  *
  * GFP_NOFS will use direct reclaim but will not use any filesystem interfaces.
+ *   Please try to avoid using this flag directly and instead use
+ *   memalloc_nofs_{save,restore} to mark the whole scope which cannot/shouldn't
+ *   recurse into the FS layer with a short explanation why. All allocation
+ *   requests will inherit GFP_NOFS implicitly.
  *
  * GFP_USER is for userspace allocations that also need to be directly
  *   accessibly by the kernel or hardware. It is typically used by hardware
diff --git a/include/linux/sched.h b/include/linux/sched.h
index abeb84604d32..2032fc642a26 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2307,9 +2307,9 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_USED_ASYNC	0x00004000	/* used async_schedule*(), used by module init */
 #define PF_NOFREEZE	0x00008000	/* this thread should not be frozen */
 #define PF_FROZEN	0x00010000	/* frozen for system suspend */
-#define PF_FSTRANS	0x00020000	/* inside a filesystem transaction */
-#define PF_KSWAPD	0x00040000	/* I am kswapd */
-#define PF_MEMALLOC_NOIO 0x00080000	/* Allocating memory without IO involved */
+#define PF_KSWAPD	0x00020000	/* I am kswapd */
+#define PF_MEMALLOC_NOFS 0x00040000	/* All allocation requests will inherit GFP_NOFS */
+#define PF_MEMALLOC_NOIO 0x00080000	/* All allocation requests will inherit GFP_NOIO */
 #define PF_LESS_THROTTLE 0x00100000	/* Throttle me less: I clean memory */
 #define PF_KTHREAD	0x00200000	/* I am a kernel thread */
 #define PF_RANDOMIZE	0x00400000	/* randomize virtual address space */
@@ -2320,8 +2320,6 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_FREEZER_SKIP	0x40000000	/* Freezer should not count it as freezable */
 #define PF_SUSPEND_TASK 0x80000000      /* this thread called freeze_processes and should not be frozen */
 
-#define PF_MEMALLOC_NOFS PF_FSTRANS	/* Transition to a more generic GFP_NOFS scope semantic */
-
 /*
  * Only the _current_ task can read/write to tsk->flags, but other
  * tasks can access tsk->flags in readonly mode for example
@@ -2347,13 +2345,21 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
 #define used_math() tsk_used_math(current)
 
-/* __GFP_IO isn't allowed if PF_MEMALLOC_NOIO is set in current->flags
- * __GFP_FS is also cleared as it implies __GFP_IO.
+/*
+ * Applies per-task gfp context to the given allocation flags.
+ * PF_MEMALLOC_NOIO implies GFP_NOIO
+ * PF_MEMALLOC_NOFS implies GFP_NOFS
  */
-static inline gfp_t memalloc_noio_flags(gfp_t flags)
+static inline gfp_t current_gfp_context(gfp_t flags)
 {
+	/*
+	 * NOIO implies both NOIO and NOFS and it is a weaker context
+	 * so always make sure it makes precendence
+	 */
 	if (unlikely(current->flags & PF_MEMALLOC_NOIO))
 		flags &= ~(__GFP_IO | __GFP_FS);
+	else if (unlikely(current->flags & PF_MEMALLOC_NOFS))
+		flags &= ~__GFP_FS;
 	return flags;
 }
 
@@ -2369,6 +2375,18 @@ static inline void memalloc_noio_restore(unsigned int flags)
 	current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags;
 }
 
+static inline unsigned int memalloc_nofs_save(void)
+{
+	unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
+	current->flags |= PF_MEMALLOC_NOFS;
+	return flags;
+}
+
+static inline void memalloc_nofs_restore(unsigned int flags)
+{
+	current->flags = (current->flags & ~PF_MEMALLOC_NOFS) | flags;
+}
+
 /* Per-process atomic flags. */
 #define PFA_NO_NEW_PRIVS 0	/* May not gain new privileges. */
 #define PFA_SPREAD_PAGE  1      /* Spread page cache over cpuset */
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 59e94ce8a0cf..cdcd6f249ec8 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -2870,7 +2870,7 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags)
 		return;
 
 	/* We're only interested __GFP_FS allocations for now */
-	if (!(gfp_mask & __GFP_FS))
+	if (!(gfp_mask & __GFP_FS) || (curr->flags & PF_MEMALLOC_NOFS))
 		return;
 
 	/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 46ad035b955f..5138b46a4295 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3807,10 +3807,12 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 		goto out;
 
 	/*
-	 * Runtime PM, block IO and its error handling path can deadlock
-	 * because I/O on the device might not complete.
+	 * Apply scoped allocation constrains. This is mainly about
+	 * GFP_NOFS resp. GFP_NOIO which has to be inherited for all
+	 * allocation requests from a particular context which has
+	 * been marked by memalloc_no{fs,io}_{save,restore}
 	 */
-	alloc_mask = memalloc_noio_flags(gfp_mask);
+	alloc_mask = current_gfp_context(gfp_mask);
 	ac.spread_dirty_pages = false;
 
 	/*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6aa5b01d3e75..4ea6b610f20e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2949,7 +2949,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 	unsigned long nr_reclaimed;
 	struct scan_control sc = {
 		.nr_to_reclaim = SWAP_CLUSTER_MAX,
-		.gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
+		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
 		.reclaim_idx = gfp_zone(gfp_mask),
 		.order = order,
 		.nodemask = nodemask,
@@ -3029,7 +3029,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 	int nid;
 	struct scan_control sc = {
 		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
-		.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
+		.gfp_mask = (current_gfp_context(gfp_mask) & GFP_RECLAIM_MASK) |
 				(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
 		.reclaim_idx = MAX_NR_ZONES - 1,
 		.target_mem_cgroup = memcg,
@@ -3723,7 +3723,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
 	int classzone_idx = gfp_zone(gfp_mask);
 	struct scan_control sc = {
 		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
-		.gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
+		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
 		.order = order,
 		.priority = NODE_RECLAIM_PRIORITY,
 		.may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE),
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [Cluster-devel] [PATCH 3/8] mm: introduce memalloc_nofs_{save, restore} API
@ 2017-01-06 14:11   ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:11 UTC (permalink / raw)
  To: cluster-devel.redhat.com

From: Michal Hocko <mhocko@suse.com>

GFP_NOFS context is used for the following 5 reasons currently
	- to prevent from deadlocks when the lock held by the allocation
	  context would be needed during the memory reclaim
	- to prevent from stack overflows during the reclaim because
	  the allocation is performed from a deep context already
	- to prevent lockups when the allocation context depends on
	  other reclaimers to make a forward progress indirectly
	- just in case because this would be safe from the fs POV
	- silence lockdep false positives

Unfortunately overuse of this allocation context brings some problems
to the MM. Memory reclaim is much weaker (especially during heavy FS
metadata workloads), OOM killer cannot be invoked because the MM layer
doesn't have enough information about how much memory is freeable by the
FS layer.

In many cases it is far from clear why the weaker context is even used
and so it might be used unnecessarily. We would like to get rid of
those as much as possible. One way to do that is to use the flag in
scopes rather than isolated cases. Such a scope is declared when really
necessary, tracked per task and all the allocation requests from within
the context will simply inherit the GFP_NOFS semantic.

Not only this is easier to understand and maintain because there are
much less problematic contexts than specific allocation requests, this
also helps code paths where FS layer interacts with other layers (e.g.
crypto, security modules, MM etc...) and there is no easy way to convey
the allocation context between the layers.

Introduce memalloc_nofs_{save,restore} API to control the scope
of GFP_NOFS allocation context. This is basically copying
memalloc_noio_{save,restore} API we have for other restricted allocation
context GFP_NOIO. The PF_MEMALLOC_NOFS flag already exists and it is
just an alias for PF_FSTRANS which has been xfs specific until recently.
There are no more PF_FSTRANS users anymore so let's just drop it.

PF_MEMALLOC_NOFS is now checked in the MM layer and drops __GFP_FS
implicitly same as PF_MEMALLOC_NOIO drops __GFP_IO. memalloc_noio_flags
is renamed to current_gfp_context because it now cares about both
PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts. Xfs code paths preserve
their semantic. kmem_flags_convert() doesn't need to evaluate the flag
anymore.

This patch shouldn't introduce any functional changes.

Let's hope that filesystems will drop direct GFP_NOFS (resp. ~__GFP_FS)
usage as much as possible and only use a properly documented
memalloc_nofs_{save,restore} checkpoints where they are appropriate.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/xfs/kmem.h            |  2 +-
 include/linux/gfp.h      |  8 ++++++++
 include/linux/sched.h    | 34 ++++++++++++++++++++++++++--------
 kernel/locking/lockdep.c |  2 +-
 mm/page_alloc.c          |  8 +++++---
 mm/vmscan.c              |  6 +++---
 6 files changed, 44 insertions(+), 16 deletions(-)

diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index d973dbfc2bfa..ae08cfd9552a 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
 		lflags = GFP_ATOMIC | __GFP_NOWARN;
 	} else {
 		lflags = GFP_KERNEL | __GFP_NOWARN;
-		if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
+		if (flags & KM_NOFS)
 			lflags &= ~__GFP_FS;
 	}
 
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 1a934383cc20..bfe53d95c25b 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -217,8 +217,16 @@ struct vm_area_struct;
  *
  * GFP_NOIO will use direct reclaim to discard clean pages or slab pages
  *   that do not require the starting of any physical IO.
+ *   Please try to avoid using this flag directly and instead use
+ *   memalloc_noio_{save,restore} to mark the whole scope which cannot
+ *   perform any IO with a short explanation why. All allocation requests
+ *   will inherit GFP_NOIO implicitly.
  *
  * GFP_NOFS will use direct reclaim but will not use any filesystem interfaces.
+ *   Please try to avoid using this flag directly and instead use
+ *   memalloc_nofs_{save,restore} to mark the whole scope which cannot/shouldn't
+ *   recurse into the FS layer with a short explanation why. All allocation
+ *   requests will inherit GFP_NOFS implicitly.
  *
  * GFP_USER is for userspace allocations that also need to be directly
  *   accessibly by the kernel or hardware. It is typically used by hardware
diff --git a/include/linux/sched.h b/include/linux/sched.h
index abeb84604d32..2032fc642a26 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2307,9 +2307,9 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_USED_ASYNC	0x00004000	/* used async_schedule*(), used by module init */
 #define PF_NOFREEZE	0x00008000	/* this thread should not be frozen */
 #define PF_FROZEN	0x00010000	/* frozen for system suspend */
-#define PF_FSTRANS	0x00020000	/* inside a filesystem transaction */
-#define PF_KSWAPD	0x00040000	/* I am kswapd */
-#define PF_MEMALLOC_NOIO 0x00080000	/* Allocating memory without IO involved */
+#define PF_KSWAPD	0x00020000	/* I am kswapd */
+#define PF_MEMALLOC_NOFS 0x00040000	/* All allocation requests will inherit GFP_NOFS */
+#define PF_MEMALLOC_NOIO 0x00080000	/* All allocation requests will inherit GFP_NOIO */
 #define PF_LESS_THROTTLE 0x00100000	/* Throttle me less: I clean memory */
 #define PF_KTHREAD	0x00200000	/* I am a kernel thread */
 #define PF_RANDOMIZE	0x00400000	/* randomize virtual address space */
@@ -2320,8 +2320,6 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_FREEZER_SKIP	0x40000000	/* Freezer should not count it as freezable */
 #define PF_SUSPEND_TASK 0x80000000      /* this thread called freeze_processes and should not be frozen */
 
-#define PF_MEMALLOC_NOFS PF_FSTRANS	/* Transition to a more generic GFP_NOFS scope semantic */
-
 /*
  * Only the _current_ task can read/write to tsk->flags, but other
  * tasks can access tsk->flags in readonly mode for example
@@ -2347,13 +2345,21 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
 #define used_math() tsk_used_math(current)
 
-/* __GFP_IO isn't allowed if PF_MEMALLOC_NOIO is set in current->flags
- * __GFP_FS is also cleared as it implies __GFP_IO.
+/*
+ * Applies per-task gfp context to the given allocation flags.
+ * PF_MEMALLOC_NOIO implies GFP_NOIO
+ * PF_MEMALLOC_NOFS implies GFP_NOFS
  */
-static inline gfp_t memalloc_noio_flags(gfp_t flags)
+static inline gfp_t current_gfp_context(gfp_t flags)
 {
+	/*
+	 * NOIO implies both NOIO and NOFS and it is a weaker context
+	 * so always make sure it makes precendence
+	 */
 	if (unlikely(current->flags & PF_MEMALLOC_NOIO))
 		flags &= ~(__GFP_IO | __GFP_FS);
+	else if (unlikely(current->flags & PF_MEMALLOC_NOFS))
+		flags &= ~__GFP_FS;
 	return flags;
 }
 
@@ -2369,6 +2375,18 @@ static inline void memalloc_noio_restore(unsigned int flags)
 	current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags;
 }
 
+static inline unsigned int memalloc_nofs_save(void)
+{
+	unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
+	current->flags |= PF_MEMALLOC_NOFS;
+	return flags;
+}
+
+static inline void memalloc_nofs_restore(unsigned int flags)
+{
+	current->flags = (current->flags & ~PF_MEMALLOC_NOFS) | flags;
+}
+
 /* Per-process atomic flags. */
 #define PFA_NO_NEW_PRIVS 0	/* May not gain new privileges. */
 #define PFA_SPREAD_PAGE  1      /* Spread page cache over cpuset */
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 59e94ce8a0cf..cdcd6f249ec8 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -2870,7 +2870,7 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags)
 		return;
 
 	/* We're only interested __GFP_FS allocations for now */
-	if (!(gfp_mask & __GFP_FS))
+	if (!(gfp_mask & __GFP_FS) || (curr->flags & PF_MEMALLOC_NOFS))
 		return;
 
 	/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 46ad035b955f..5138b46a4295 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3807,10 +3807,12 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 		goto out;
 
 	/*
-	 * Runtime PM, block IO and its error handling path can deadlock
-	 * because I/O on the device might not complete.
+	 * Apply scoped allocation constrains. This is mainly about
+	 * GFP_NOFS resp. GFP_NOIO which has to be inherited for all
+	 * allocation requests from a particular context which has
+	 * been marked by memalloc_no{fs,io}_{save,restore}
 	 */
-	alloc_mask = memalloc_noio_flags(gfp_mask);
+	alloc_mask = current_gfp_context(gfp_mask);
 	ac.spread_dirty_pages = false;
 
 	/*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6aa5b01d3e75..4ea6b610f20e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2949,7 +2949,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 	unsigned long nr_reclaimed;
 	struct scan_control sc = {
 		.nr_to_reclaim = SWAP_CLUSTER_MAX,
-		.gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
+		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
 		.reclaim_idx = gfp_zone(gfp_mask),
 		.order = order,
 		.nodemask = nodemask,
@@ -3029,7 +3029,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 	int nid;
 	struct scan_control sc = {
 		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
-		.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
+		.gfp_mask = (current_gfp_context(gfp_mask) & GFP_RECLAIM_MASK) |
 				(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
 		.reclaim_idx = MAX_NR_ZONES - 1,
 		.target_mem_cgroup = memcg,
@@ -3723,7 +3723,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
 	int classzone_idx = gfp_zone(gfp_mask);
 	struct scan_control sc = {
 		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
-		.gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
+		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
 		.order = order,
 		.priority = NODE_RECLAIM_PRIORITY,
 		.may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE),
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [PATCH 4/8] xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*
  2017-01-06 14:10 ` Michal Hocko
                     ` (3 preceding siblings ...)
  (?)
@ 2017-01-06 14:11   ` Michal Hocko
  -1 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:11 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

kmem_zalloc_large and _xfs_buf_map_pages use memalloc_noio_{save,restore}
API to prevent from reclaim recursion into the fs because vmalloc can
invoke unconditional GFP_KERNEL allocations and these functions might be
called from the NOFS contexts. The memalloc_noio_save will enforce
GFP_NOIO context which is even weaker than GFP_NOFS and that seems to be
unnecessary. Let's use memalloc_nofs_{save,restore} instead as it should
provide exactly what we need here - implicit GFP_NOFS context.

Changes since v1
- s@memalloc_noio_restore@memalloc_nofs_restore@ in _xfs_buf_map_pages
  as per Brian Foster

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/xfs/kmem.c    | 10 +++++-----
 fs/xfs/xfs_buf.c |  8 ++++----
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index a76a05dae96b..d69ed5e76621 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -65,7 +65,7 @@ kmem_alloc(size_t size, xfs_km_flags_t flags)
 void *
 kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 {
-	unsigned noio_flag = 0;
+	unsigned nofs_flag = 0;
 	void	*ptr;
 	gfp_t	lflags;
 
@@ -80,14 +80,14 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 	 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
 	 * the filesystem here and potentially deadlocking.
 	 */
-	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
-		noio_flag = memalloc_noio_save();
+	if (flags & KM_NOFS)
+		nofs_flag = memalloc_nofs_save();
 
 	lflags = kmem_flags_convert(flags);
 	ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
 
-	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
-		memalloc_noio_restore(noio_flag);
+	if (flags & KM_NOFS)
+		memalloc_nofs_restore(nofs_flag);
 
 	return ptr;
 }
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 7f0a01f7b592..8cb8dd4cdfd8 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -441,17 +441,17 @@ _xfs_buf_map_pages(
 		bp->b_addr = NULL;
 	} else {
 		int retried = 0;
-		unsigned noio_flag;
+		unsigned nofs_flag;
 
 		/*
 		 * vm_map_ram() will allocate auxillary structures (e.g.
 		 * pagetables) with GFP_KERNEL, yet we are likely to be under
 		 * GFP_NOFS context here. Hence we need to tell memory reclaim
-		 * that we are in such a context via PF_MEMALLOC_NOIO to prevent
+		 * that we are in such a context via PF_MEMALLOC_NOFS to prevent
 		 * memory reclaim re-entering the filesystem here and
 		 * potentially deadlocking.
 		 */
-		noio_flag = memalloc_noio_save();
+		nofs_flag = memalloc_nofs_save();
 		do {
 			bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
 						-1, PAGE_KERNEL);
@@ -459,7 +459,7 @@ _xfs_buf_map_pages(
 				break;
 			vm_unmap_aliases();
 		} while (retried++ <= 1);
-		memalloc_noio_restore(noio_flag);
+		memalloc_nofs_restore(nofs_flag);
 
 		if (!bp->b_addr)
 			return -ENOMEM;
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [PATCH 4/8] xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*
@ 2017-01-06 14:11   ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:11 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

kmem_zalloc_large and _xfs_buf_map_pages use memalloc_noio_{save,restore}
API to prevent from reclaim recursion into the fs because vmalloc can
invoke unconditional GFP_KERNEL allocations and these functions might be
called from the NOFS contexts. The memalloc_noio_save will enforce
GFP_NOIO context which is even weaker than GFP_NOFS and that seems to be
unnecessary. Let's use memalloc_nofs_{save,restore} instead as it should
provide exactly what we need here - implicit GFP_NOFS context.

Changes since v1
- s@memalloc_noio_restore@memalloc_nofs_restore@ in _xfs_buf_map_pages
  as per Brian Foster

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/xfs/kmem.c    | 10 +++++-----
 fs/xfs/xfs_buf.c |  8 ++++----
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index a76a05dae96b..d69ed5e76621 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -65,7 +65,7 @@ kmem_alloc(size_t size, xfs_km_flags_t flags)
 void *
 kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 {
-	unsigned noio_flag = 0;
+	unsigned nofs_flag = 0;
 	void	*ptr;
 	gfp_t	lflags;
 
@@ -80,14 +80,14 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 	 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
 	 * the filesystem here and potentially deadlocking.
 	 */
-	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
-		noio_flag = memalloc_noio_save();
+	if (flags & KM_NOFS)
+		nofs_flag = memalloc_nofs_save();
 
 	lflags = kmem_flags_convert(flags);
 	ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
 
-	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
-		memalloc_noio_restore(noio_flag);
+	if (flags & KM_NOFS)
+		memalloc_nofs_restore(nofs_flag);
 
 	return ptr;
 }
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 7f0a01f7b592..8cb8dd4cdfd8 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -441,17 +441,17 @@ _xfs_buf_map_pages(
 		bp->b_addr = NULL;
 	} else {
 		int retried = 0;
-		unsigned noio_flag;
+		unsigned nofs_flag;
 
 		/*
 		 * vm_map_ram() will allocate auxillary structures (e.g.
 		 * pagetables) with GFP_KERNEL, yet we are likely to be under
 		 * GFP_NOFS context here. Hence we need to tell memory reclaim
-		 * that we are in such a context via PF_MEMALLOC_NOIO to prevent
+		 * that we are in such a context via PF_MEMALLOC_NOFS to prevent
 		 * memory reclaim re-entering the filesystem here and
 		 * potentially deadlocking.
 		 */
-		noio_flag = memalloc_noio_save();
+		nofs_flag = memalloc_nofs_save();
 		do {
 			bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
 						-1, PAGE_KERNEL);
@@ -459,7 +459,7 @@ _xfs_buf_map_pages(
 				break;
 			vm_unmap_aliases();
 		} while (retried++ <= 1);
-		memalloc_noio_restore(noio_flag);
+		memalloc_nofs_restore(nofs_flag);
 
 		if (!bp->b_addr)
 			return -ENOMEM;
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [PATCH 4/8] xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*
@ 2017-01-06 14:11   ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:11 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

kmem_zalloc_large and _xfs_buf_map_pages use memalloc_noio_{save,restore}
API to prevent from reclaim recursion into the fs because vmalloc can
invoke unconditional GFP_KERNEL allocations and these functions might be
called from the NOFS contexts. The memalloc_noio_save will enforce
GFP_NOIO context which is even weaker than GFP_NOFS and that seems to be
unnecessary. Let's use memalloc_nofs_{save,restore} instead as it should
provide exactly what we need here - implicit GFP_NOFS context.

Changes since v1
- s@memalloc_noio_restore@memalloc_nofs_restore@ in _xfs_buf_map_pages
  as per Brian Foster

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/xfs/kmem.c    | 10 +++++-----
 fs/xfs/xfs_buf.c |  8 ++++----
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index a76a05dae96b..d69ed5e76621 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -65,7 +65,7 @@ kmem_alloc(size_t size, xfs_km_flags_t flags)
 void *
 kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 {
-	unsigned noio_flag = 0;
+	unsigned nofs_flag = 0;
 	void	*ptr;
 	gfp_t	lflags;
 
@@ -80,14 +80,14 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 	 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
 	 * the filesystem here and potentially deadlocking.
 	 */
-	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
-		noio_flag = memalloc_noio_save();
+	if (flags & KM_NOFS)
+		nofs_flag = memalloc_nofs_save();
 
 	lflags = kmem_flags_convert(flags);
 	ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
 
-	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
-		memalloc_noio_restore(noio_flag);
+	if (flags & KM_NOFS)
+		memalloc_nofs_restore(nofs_flag);
 
 	return ptr;
 }
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 7f0a01f7b592..8cb8dd4cdfd8 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -441,17 +441,17 @@ _xfs_buf_map_pages(
 		bp->b_addr = NULL;
 	} else {
 		int retried = 0;
-		unsigned noio_flag;
+		unsigned nofs_flag;
 
 		/*
 		 * vm_map_ram() will allocate auxillary structures (e.g.
 		 * pagetables) with GFP_KERNEL, yet we are likely to be under
 		 * GFP_NOFS context here. Hence we need to tell memory reclaim
-		 * that we are in such a context via PF_MEMALLOC_NOIO to prevent
+		 * that we are in such a context via PF_MEMALLOC_NOFS to prevent
 		 * memory reclaim re-entering the filesystem here and
 		 * potentially deadlocking.
 		 */
-		noio_flag = memalloc_noio_save();
+		nofs_flag = memalloc_nofs_save();
 		do {
 			bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
 						-1, PAGE_KERNEL);
@@ -459,7 +459,7 @@ _xfs_buf_map_pages(
 				break;
 			vm_unmap_aliases();
 		} while (retried++ <= 1);
-		memalloc_noio_restore(noio_flag);
+		memalloc_nofs_restore(nofs_flag);
 
 		if (!bp->b_addr)
 			return -ENOMEM;
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [PATCH 4/8] xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*
@ 2017-01-06 14:11   ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:11 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

kmem_zalloc_large and _xfs_buf_map_pages use memalloc_noio_{save,restore}
API to prevent from reclaim recursion into the fs because vmalloc can
invoke unconditional GFP_KERNEL allocations and these functions might be
called from the NOFS contexts. The memalloc_noio_save will enforce
GFP_NOIO context which is even weaker than GFP_NOFS and that seems to be
unnecessary. Let's use memalloc_nofs_{save,restore} instead as it should
provide exactly what we need here - implicit GFP_NOFS context.

Changes since v1
- s@memalloc_noio_restore@memalloc_nofs_restore@ in _xfs_buf_map_pages
  as per Brian Foster

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/xfs/kmem.c    | 10 +++++-----
 fs/xfs/xfs_buf.c |  8 ++++----
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index a76a05dae96b..d69ed5e76621 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -65,7 +65,7 @@ kmem_alloc(size_t size, xfs_km_flags_t flags)
 void *
 kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 {
-	unsigned noio_flag = 0;
+	unsigned nofs_flag = 0;
 	void	*ptr;
 	gfp_t	lflags;
 
@@ -80,14 +80,14 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 	 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
 	 * the filesystem here and potentially deadlocking.
 	 */
-	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
-		noio_flag = memalloc_noio_save();
+	if (flags & KM_NOFS)
+		nofs_flag = memalloc_nofs_save();
 
 	lflags = kmem_flags_convert(flags);
 	ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
 
-	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
-		memalloc_noio_restore(noio_flag);
+	if (flags & KM_NOFS)
+		memalloc_nofs_restore(nofs_flag);
 
 	return ptr;
 }
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 7f0a01f7b592..8cb8dd4cdfd8 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -441,17 +441,17 @@ _xfs_buf_map_pages(
 		bp->b_addr = NULL;
 	} else {
 		int retried = 0;
-		unsigned noio_flag;
+		unsigned nofs_flag;
 
 		/*
 		 * vm_map_ram() will allocate auxillary structures (e.g.
 		 * pagetables) with GFP_KERNEL, yet we are likely to be under
 		 * GFP_NOFS context here. Hence we need to tell memory reclaim
-		 * that we are in such a context via PF_MEMALLOC_NOIO to prevent
+		 * that we are in such a context via PF_MEMALLOC_NOFS to prevent
 		 * memory reclaim re-entering the filesystem here and
 		 * potentially deadlocking.
 		 */
-		noio_flag = memalloc_noio_save();
+		nofs_flag = memalloc_nofs_save();
 		do {
 			bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
 						-1, PAGE_KERNEL);
@@ -459,7 +459,7 @@ _xfs_buf_map_pages(
 				break;
 			vm_unmap_aliases();
 		} while (retried++ <= 1);
-		memalloc_noio_restore(noio_flag);
+		memalloc_nofs_restore(nofs_flag);
 
 		if (!bp->b_addr)
 			return -ENOMEM;
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [PATCH 4/8] xfs: use memalloc_nofs_{save, restore} instead of memalloc_noio*
@ 2017-01-06 14:11   ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:11 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

kmem_zalloc_large and _xfs_buf_map_pages use memalloc_noio_{save,restore}
API to prevent from reclaim recursion into the fs because vmalloc can
invoke unconditional GFP_KERNEL allocations and these functions might be
called from the NOFS contexts. The memalloc_noio_save will enforce
GFP_NOIO context which is even weaker than GFP_NOFS and that seems to be
unnecessary. Let's use memalloc_nofs_{save,restore} instead as it should
provide exactly what we need here - implicit GFP_NOFS context.

Changes since v1
- s@memalloc_noio_restore@memalloc_nofs_restore@ in _xfs_buf_map_pages
  as per Brian Foster

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/xfs/kmem.c    | 10 +++++-----
 fs/xfs/xfs_buf.c |  8 ++++----
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index a76a05dae96b..d69ed5e76621 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -65,7 +65,7 @@ kmem_alloc(size_t size, xfs_km_flags_t flags)
 void *
 kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 {
-	unsigned noio_flag = 0;
+	unsigned nofs_flag = 0;
 	void	*ptr;
 	gfp_t	lflags;
 
@@ -80,14 +80,14 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 	 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
 	 * the filesystem here and potentially deadlocking.
 	 */
-	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
-		noio_flag = memalloc_noio_save();
+	if (flags & KM_NOFS)
+		nofs_flag = memalloc_nofs_save();
 
 	lflags = kmem_flags_convert(flags);
 	ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
 
-	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
-		memalloc_noio_restore(noio_flag);
+	if (flags & KM_NOFS)
+		memalloc_nofs_restore(nofs_flag);
 
 	return ptr;
 }
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 7f0a01f7b592..8cb8dd4cdfd8 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -441,17 +441,17 @@ _xfs_buf_map_pages(
 		bp->b_addr = NULL;
 	} else {
 		int retried = 0;
-		unsigned noio_flag;
+		unsigned nofs_flag;
 
 		/*
 		 * vm_map_ram() will allocate auxillary structures (e.g.
 		 * pagetables) with GFP_KERNEL, yet we are likely to be under
 		 * GFP_NOFS context here. Hence we need to tell memory reclaim
-		 * that we are in such a context via PF_MEMALLOC_NOIO to prevent
+		 * that we are in such a context via PF_MEMALLOC_NOFS to prevent
 		 * memory reclaim re-entering the filesystem here and
 		 * potentially deadlocking.
 		 */
-		noio_flag = memalloc_noio_save();
+		nofs_flag = memalloc_nofs_save();
 		do {
 			bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
 						-1, PAGE_KERNEL);
@@ -459,7 +459,7 @@ _xfs_buf_map_pages(
 				break;
 			vm_unmap_aliases();
 		} while (retried++ <= 1);
-		memalloc_noio_restore(noio_flag);
+		memalloc_nofs_restore(nofs_flag);
 
 		if (!bp->b_addr)
 			return -ENOMEM;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [Cluster-devel] [PATCH 4/8] xfs: use memalloc_nofs_{save, restore} instead of memalloc_noio*
@ 2017-01-06 14:11   ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:11 UTC (permalink / raw)
  To: cluster-devel.redhat.com

From: Michal Hocko <mhocko@suse.com>

kmem_zalloc_large and _xfs_buf_map_pages use memalloc_noio_{save,restore}
API to prevent from reclaim recursion into the fs because vmalloc can
invoke unconditional GFP_KERNEL allocations and these functions might be
called from the NOFS contexts. The memalloc_noio_save will enforce
GFP_NOIO context which is even weaker than GFP_NOFS and that seems to be
unnecessary. Let's use memalloc_nofs_{save,restore} instead as it should
provide exactly what we need here - implicit GFP_NOFS context.

Changes since v1
- s at memalloc_noio_restore@memalloc_nofs_restore@ in _xfs_buf_map_pages
  as per Brian Foster

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/xfs/kmem.c    | 10 +++++-----
 fs/xfs/xfs_buf.c |  8 ++++----
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index a76a05dae96b..d69ed5e76621 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -65,7 +65,7 @@ kmem_alloc(size_t size, xfs_km_flags_t flags)
 void *
 kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 {
-	unsigned noio_flag = 0;
+	unsigned nofs_flag = 0;
 	void	*ptr;
 	gfp_t	lflags;
 
@@ -80,14 +80,14 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 	 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
 	 * the filesystem here and potentially deadlocking.
 	 */
-	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
-		noio_flag = memalloc_noio_save();
+	if (flags & KM_NOFS)
+		nofs_flag = memalloc_nofs_save();
 
 	lflags = kmem_flags_convert(flags);
 	ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
 
-	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
-		memalloc_noio_restore(noio_flag);
+	if (flags & KM_NOFS)
+		memalloc_nofs_restore(nofs_flag);
 
 	return ptr;
 }
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 7f0a01f7b592..8cb8dd4cdfd8 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -441,17 +441,17 @@ _xfs_buf_map_pages(
 		bp->b_addr = NULL;
 	} else {
 		int retried = 0;
-		unsigned noio_flag;
+		unsigned nofs_flag;
 
 		/*
 		 * vm_map_ram() will allocate auxillary structures (e.g.
 		 * pagetables) with GFP_KERNEL, yet we are likely to be under
 		 * GFP_NOFS context here. Hence we need to tell memory reclaim
-		 * that we are in such a context via PF_MEMALLOC_NOIO to prevent
+		 * that we are in such a context via PF_MEMALLOC_NOFS to prevent
 		 * memory reclaim re-entering the filesystem here and
 		 * potentially deadlocking.
 		 */
-		noio_flag = memalloc_noio_save();
+		nofs_flag = memalloc_nofs_save();
 		do {
 			bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
 						-1, PAGE_KERNEL);
@@ -459,7 +459,7 @@ _xfs_buf_map_pages(
 				break;
 			vm_unmap_aliases();
 		} while (retried++ <= 1);
-		memalloc_noio_restore(noio_flag);
+		memalloc_nofs_restore(nofs_flag);
 
 		if (!bp->b_addr)
 			return -ENOMEM;
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [PATCH 5/8] jbd2: mark the transaction context with the scope GFP_NOFS context
  2017-01-06 14:10 ` Michal Hocko
                     ` (2 preceding siblings ...)
  (?)
@ 2017-01-06 14:11   ` Michal Hocko
  -1 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:11 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

now that we have memalloc_nofs_{save,restore} api we can mark the whole
transaction context as implicitly GFP_NOFS. All allocations will
automatically inherit GFP_NOFS this way. This means that we do not have
to mark any of those requests with GFP_NOFS and moreover all the
ext4_kv[mz]alloc(GFP_NOFS) are also safe now because even the hardcoded
GFP_KERNEL allocations deep inside the vmalloc will be NOFS now.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/transaction.c | 11 +++++++++++
 include/linux/jbd2.h  |  2 ++
 2 files changed, 13 insertions(+)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index e1652665bd93..35a5d3d76182 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -388,6 +388,11 @@ static int start_this_handle(journal_t *journal, handle_t *handle,
 
 	rwsem_acquire_read(&journal->j_trans_commit_map, 0, 0, _THIS_IP_);
 	jbd2_journal_free_transaction(new_transaction);
+	/*
+	 * Make sure that no allocations done while the transaction is
+	 * open is going to recurse back to the fs layer.
+	 */
+	handle->saved_alloc_context = memalloc_nofs_save();
 	return 0;
 }
 
@@ -466,6 +471,7 @@ handle_t *jbd2__journal_start(journal_t *journal, int nblocks, int rsv_blocks,
 	trace_jbd2_handle_start(journal->j_fs_dev->bd_dev,
 				handle->h_transaction->t_tid, type,
 				line_no, nblocks);
+
 	return handle;
 }
 EXPORT_SYMBOL(jbd2__journal_start);
@@ -1760,6 +1766,11 @@ int jbd2_journal_stop(handle_t *handle)
 	if (handle->h_rsv_handle)
 		jbd2_journal_free_reserved(handle->h_rsv_handle);
 free_and_exit:
+	/*
+	 * scope of th GFP_NOFS context is over here and so we can
+	 * restore the original alloc context.
+	 */
+	memalloc_nofs_restore(handle->saved_alloc_context);
 	jbd2_free_handle(handle);
 	return err;
 }
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index dfaa1f4dcb0c..606b6bce3a5b 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -491,6 +491,8 @@ struct jbd2_journal_handle
 
 	unsigned long		h_start_jiffies;
 	unsigned int		h_requested_credits;
+
+	unsigned int		saved_alloc_context;
 };
 
 
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [PATCH 5/8] jbd2: mark the transaction context with the scope GFP_NOFS context
@ 2017-01-06 14:11   ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:11 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

now that we have memalloc_nofs_{save,restore} api we can mark the whole
transaction context as implicitly GFP_NOFS. All allocations will
automatically inherit GFP_NOFS this way. This means that we do not have
to mark any of those requests with GFP_NOFS and moreover all the
ext4_kv[mz]alloc(GFP_NOFS) are also safe now because even the hardcoded
GFP_KERNEL allocations deep inside the vmalloc will be NOFS now.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/transaction.c | 11 +++++++++++
 include/linux/jbd2.h  |  2 ++
 2 files changed, 13 insertions(+)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index e1652665bd93..35a5d3d76182 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -388,6 +388,11 @@ static int start_this_handle(journal_t *journal, handle_t *handle,
 
 	rwsem_acquire_read(&journal->j_trans_commit_map, 0, 0, _THIS_IP_);
 	jbd2_journal_free_transaction(new_transaction);
+	/*
+	 * Make sure that no allocations done while the transaction is
+	 * open is going to recurse back to the fs layer.
+	 */
+	handle->saved_alloc_context = memalloc_nofs_save();
 	return 0;
 }
 
@@ -466,6 +471,7 @@ handle_t *jbd2__journal_start(journal_t *journal, int nblocks, int rsv_blocks,
 	trace_jbd2_handle_start(journal->j_fs_dev->bd_dev,
 				handle->h_transaction->t_tid, type,
 				line_no, nblocks);
+
 	return handle;
 }
 EXPORT_SYMBOL(jbd2__journal_start);
@@ -1760,6 +1766,11 @@ int jbd2_journal_stop(handle_t *handle)
 	if (handle->h_rsv_handle)
 		jbd2_journal_free_reserved(handle->h_rsv_handle);
 free_and_exit:
+	/*
+	 * scope of th GFP_NOFS context is over here and so we can
+	 * restore the original alloc context.
+	 */
+	memalloc_nofs_restore(handle->saved_alloc_context);
 	jbd2_free_handle(handle);
 	return err;
 }
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index dfaa1f4dcb0c..606b6bce3a5b 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -491,6 +491,8 @@ struct jbd2_journal_handle
 
 	unsigned long		h_start_jiffies;
 	unsigned int		h_requested_credits;
+
+	unsigned int		saved_alloc_context;
 };
 
 
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [PATCH 5/8] jbd2: mark the transaction context with the scope GFP_NOFS context
@ 2017-01-06 14:11   ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:11 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

now that we have memalloc_nofs_{save,restore} api we can mark the whole
transaction context as implicitly GFP_NOFS. All allocations will
automatically inherit GFP_NOFS this way. This means that we do not have
to mark any of those requests with GFP_NOFS and moreover all the
ext4_kv[mz]alloc(GFP_NOFS) are also safe now because even the hardcoded
GFP_KERNEL allocations deep inside the vmalloc will be NOFS now.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/transaction.c | 11 +++++++++++
 include/linux/jbd2.h  |  2 ++
 2 files changed, 13 insertions(+)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index e1652665bd93..35a5d3d76182 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -388,6 +388,11 @@ static int start_this_handle(journal_t *journal, handle_t *handle,
 
 	rwsem_acquire_read(&journal->j_trans_commit_map, 0, 0, _THIS_IP_);
 	jbd2_journal_free_transaction(new_transaction);
+	/*
+	 * Make sure that no allocations done while the transaction is
+	 * open is going to recurse back to the fs layer.
+	 */
+	handle->saved_alloc_context = memalloc_nofs_save();
 	return 0;
 }
 
@@ -466,6 +471,7 @@ handle_t *jbd2__journal_start(journal_t *journal, int nblocks, int rsv_blocks,
 	trace_jbd2_handle_start(journal->j_fs_dev->bd_dev,
 				handle->h_transaction->t_tid, type,
 				line_no, nblocks);
+
 	return handle;
 }
 EXPORT_SYMBOL(jbd2__journal_start);
@@ -1760,6 +1766,11 @@ int jbd2_journal_stop(handle_t *handle)
 	if (handle->h_rsv_handle)
 		jbd2_journal_free_reserved(handle->h_rsv_handle);
 free_and_exit:
+	/*
+	 * scope of th GFP_NOFS context is over here and so we can
+	 * restore the original alloc context.
+	 */
+	memalloc_nofs_restore(handle->saved_alloc_context);
 	jbd2_free_handle(handle);
 	return err;
 }
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index dfaa1f4dcb0c..606b6bce3a5b 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -491,6 +491,8 @@ struct jbd2_journal_handle
 
 	unsigned long		h_start_jiffies;
 	unsigned int		h_requested_credits;
+
+	unsigned int		saved_alloc_context;
 };
 
 
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [PATCH 5/8] jbd2: mark the transaction context with the scope GFP_NOFS context
@ 2017-01-06 14:11   ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:11 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

now that we have memalloc_nofs_{save,restore} api we can mark the whole
transaction context as implicitly GFP_NOFS. All allocations will
automatically inherit GFP_NOFS this way. This means that we do not have
to mark any of those requests with GFP_NOFS and moreover all the
ext4_kv[mz]alloc(GFP_NOFS) are also safe now because even the hardcoded
GFP_KERNEL allocations deep inside the vmalloc will be NOFS now.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/transaction.c | 11 +++++++++++
 include/linux/jbd2.h  |  2 ++
 2 files changed, 13 insertions(+)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index e1652665bd93..35a5d3d76182 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -388,6 +388,11 @@ static int start_this_handle(journal_t *journal, handle_t *handle,
 
 	rwsem_acquire_read(&journal->j_trans_commit_map, 0, 0, _THIS_IP_);
 	jbd2_journal_free_transaction(new_transaction);
+	/*
+	 * Make sure that no allocations done while the transaction is
+	 * open is going to recurse back to the fs layer.
+	 */
+	handle->saved_alloc_context = memalloc_nofs_save();
 	return 0;
 }
 
@@ -466,6 +471,7 @@ handle_t *jbd2__journal_start(journal_t *journal, int nblocks, int rsv_blocks,
 	trace_jbd2_handle_start(journal->j_fs_dev->bd_dev,
 				handle->h_transaction->t_tid, type,
 				line_no, nblocks);
+
 	return handle;
 }
 EXPORT_SYMBOL(jbd2__journal_start);
@@ -1760,6 +1766,11 @@ int jbd2_journal_stop(handle_t *handle)
 	if (handle->h_rsv_handle)
 		jbd2_journal_free_reserved(handle->h_rsv_handle);
 free_and_exit:
+	/*
+	 * scope of th GFP_NOFS context is over here and so we can
+	 * restore the original alloc context.
+	 */
+	memalloc_nofs_restore(handle->saved_alloc_context);
 	jbd2_free_handle(handle);
 	return err;
 }
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index dfaa1f4dcb0c..606b6bce3a5b 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -491,6 +491,8 @@ struct jbd2_journal_handle
 
 	unsigned long		h_start_jiffies;
 	unsigned int		h_requested_credits;
+
+	unsigned int		saved_alloc_context;
 };
 
 
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [Cluster-devel] [PATCH 5/8] jbd2: mark the transaction context with the scope GFP_NOFS context
@ 2017-01-06 14:11   ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:11 UTC (permalink / raw)
  To: cluster-devel.redhat.com

From: Michal Hocko <mhocko@suse.com>

now that we have memalloc_nofs_{save,restore} api we can mark the whole
transaction context as implicitly GFP_NOFS. All allocations will
automatically inherit GFP_NOFS this way. This means that we do not have
to mark any of those requests with GFP_NOFS and moreover all the
ext4_kv[mz]alloc(GFP_NOFS) are also safe now because even the hardcoded
GFP_KERNEL allocations deep inside the vmalloc will be NOFS now.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/transaction.c | 11 +++++++++++
 include/linux/jbd2.h  |  2 ++
 2 files changed, 13 insertions(+)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index e1652665bd93..35a5d3d76182 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -388,6 +388,11 @@ static int start_this_handle(journal_t *journal, handle_t *handle,
 
 	rwsem_acquire_read(&journal->j_trans_commit_map, 0, 0, _THIS_IP_);
 	jbd2_journal_free_transaction(new_transaction);
+	/*
+	 * Make sure that no allocations done while the transaction is
+	 * open is going to recurse back to the fs layer.
+	 */
+	handle->saved_alloc_context = memalloc_nofs_save();
 	return 0;
 }
 
@@ -466,6 +471,7 @@ handle_t *jbd2__journal_start(journal_t *journal, int nblocks, int rsv_blocks,
 	trace_jbd2_handle_start(journal->j_fs_dev->bd_dev,
 				handle->h_transaction->t_tid, type,
 				line_no, nblocks);
+
 	return handle;
 }
 EXPORT_SYMBOL(jbd2__journal_start);
@@ -1760,6 +1766,11 @@ int jbd2_journal_stop(handle_t *handle)
 	if (handle->h_rsv_handle)
 		jbd2_journal_free_reserved(handle->h_rsv_handle);
 free_and_exit:
+	/*
+	 * scope of th GFP_NOFS context is over here and so we can
+	 * restore the original alloc context.
+	 */
+	memalloc_nofs_restore(handle->saved_alloc_context);
 	jbd2_free_handle(handle);
 	return err;
 }
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index dfaa1f4dcb0c..606b6bce3a5b 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -491,6 +491,8 @@ struct jbd2_journal_handle
 
 	unsigned long		h_start_jiffies;
 	unsigned int		h_requested_credits;
+
+	unsigned int		saved_alloc_context;
 };
 
 
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [PATCH 6/8] jbd2: make the whole kjournald2 kthread NOFS safe
  2017-01-06 14:10 ` Michal Hocko
                     ` (2 preceding siblings ...)
  (?)
@ 2017-01-06 14:11   ` Michal Hocko
  -1 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:11 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

kjournald2 is central to the transaction commit processing. As such any
potential allocation from this kernel thread has to be GFP_NOFS. Make
sure to mark the whole kernel thread GFP_NOFS by the memalloc_nofs_save.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/journal.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index a097048ed1a3..3a449150f834 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -206,6 +206,13 @@ static int kjournald2(void *arg)
 	wake_up(&journal->j_wait_done_commit);
 
 	/*
+	 * Make sure that no allocations from this kernel thread will ever recurse
+	 * to the fs layer because we are responsible for the transaction commit
+	 * and any fs involvement might get stuck waiting for the trasn. commit.
+	 */
+	memalloc_nofs_save();
+
+	/*
 	 * And now, wait forever for commit wakeup events.
 	 */
 	write_lock(&journal->j_state_lock);
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [PATCH 6/8] jbd2: make the whole kjournald2 kthread NOFS safe
@ 2017-01-06 14:11   ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:11 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

kjournald2 is central to the transaction commit processing. As such any
potential allocation from this kernel thread has to be GFP_NOFS. Make
sure to mark the whole kernel thread GFP_NOFS by the memalloc_nofs_save.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/journal.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index a097048ed1a3..3a449150f834 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -206,6 +206,13 @@ static int kjournald2(void *arg)
 	wake_up(&journal->j_wait_done_commit);
 
 	/*
+	 * Make sure that no allocations from this kernel thread will ever recurse
+	 * to the fs layer because we are responsible for the transaction commit
+	 * and any fs involvement might get stuck waiting for the trasn. commit.
+	 */
+	memalloc_nofs_save();
+
+	/*
 	 * And now, wait forever for commit wakeup events.
 	 */
 	write_lock(&journal->j_state_lock);
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [PATCH 6/8] jbd2: make the whole kjournald2 kthread NOFS safe
@ 2017-01-06 14:11   ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:11 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

kjournald2 is central to the transaction commit processing. As such any
potential allocation from this kernel thread has to be GFP_NOFS. Make
sure to mark the whole kernel thread GFP_NOFS by the memalloc_nofs_save.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/journal.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index a097048ed1a3..3a449150f834 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -206,6 +206,13 @@ static int kjournald2(void *arg)
 	wake_up(&journal->j_wait_done_commit);
 
 	/*
+	 * Make sure that no allocations from this kernel thread will ever recurse
+	 * to the fs layer because we are responsible for the transaction commit
+	 * and any fs involvement might get stuck waiting for the trasn. commit.
+	 */
+	memalloc_nofs_save();
+
+	/*
 	 * And now, wait forever for commit wakeup events.
 	 */
 	write_lock(&journal->j_state_lock);
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [PATCH 6/8] jbd2: make the whole kjournald2 kthread NOFS safe
@ 2017-01-06 14:11   ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:11 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

kjournald2 is central to the transaction commit processing. As such any
potential allocation from this kernel thread has to be GFP_NOFS. Make
sure to mark the whole kernel thread GFP_NOFS by the memalloc_nofs_save.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/journal.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index a097048ed1a3..3a449150f834 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -206,6 +206,13 @@ static int kjournald2(void *arg)
 	wake_up(&journal->j_wait_done_commit);
 
 	/*
+	 * Make sure that no allocations from this kernel thread will ever recurse
+	 * to the fs layer because we are responsible for the transaction commit
+	 * and any fs involvement might get stuck waiting for the trasn. commit.
+	 */
+	memalloc_nofs_save();
+
+	/*
 	 * And now, wait forever for commit wakeup events.
 	 */
 	write_lock(&journal->j_state_lock);
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [Cluster-devel] [PATCH 6/8] jbd2: make the whole kjournald2 kthread NOFS safe
@ 2017-01-06 14:11   ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:11 UTC (permalink / raw)
  To: cluster-devel.redhat.com

From: Michal Hocko <mhocko@suse.com>

kjournald2 is central to the transaction commit processing. As such any
potential allocation from this kernel thread has to be GFP_NOFS. Make
sure to mark the whole kernel thread GFP_NOFS by the memalloc_nofs_save.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/jbd2/journal.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index a097048ed1a3..3a449150f834 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -206,6 +206,13 @@ static int kjournald2(void *arg)
 	wake_up(&journal->j_wait_done_commit);
 
 	/*
+	 * Make sure that no allocations from this kernel thread will ever recurse
+	 * to the fs layer because we are responsible for the transaction commit
+	 * and any fs involvement might get stuck waiting for the trasn. commit.
+	 */
+	memalloc_nofs_save();
+
+	/*
 	 * And now, wait forever for commit wakeup events.
 	 */
 	write_lock(&journal->j_state_lock);
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [PATCH 7/8] Revert "ext4: avoid deadlocks in the writeback path by using sb_getblk_gfp"
  2017-01-06 14:10 ` Michal Hocko
                     ` (2 preceding siblings ...)
  (?)
@ 2017-01-06 14:11   ` Michal Hocko
  -1 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:11 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

This reverts commit c45653c341f5c8a0ce19c8f0ad4678640849cb86 because
sb_getblk_gfp is not really needed as
sb_getblk
  __getblk_gfp
    __getblk_slow
      grow_buffers
        grow_dev_page
	  gfp_mask = mapping_gfp_constraint(inode->i_mapping, ~__GFP_FS) | gfp

so __GFP_FS is cleared unconditionally and therefore the above commit
didn't have any real effect in fact.

This patch should not introduce any functional change. The main point
of this change is to reduce explicit GFP_NOFS usage inside ext4 code to
make the review of the remaining usage easier.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/extents.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 3e295d3350a9..9867b9e5ad8f 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -518,7 +518,7 @@ __read_extent_tree_block(const char *function, unsigned int line,
 	struct buffer_head		*bh;
 	int				err;
 
-	bh = sb_getblk_gfp(inode->i_sb, pblk, __GFP_MOVABLE | GFP_NOFS);
+	bh = sb_getblk(inode->i_sb, pblk);
 	if (unlikely(!bh))
 		return ERR_PTR(-ENOMEM);
 
@@ -1096,7 +1096,7 @@ static int ext4_ext_split(handle_t *handle, struct inode *inode,
 		err = -EFSCORRUPTED;
 		goto cleanup;
 	}
-	bh = sb_getblk_gfp(inode->i_sb, newblock, __GFP_MOVABLE | GFP_NOFS);
+	bh = sb_getblk(inode->i_sb, newblock);
 	if (unlikely(!bh)) {
 		err = -ENOMEM;
 		goto cleanup;
@@ -1290,7 +1290,7 @@ static int ext4_ext_grow_indepth(handle_t *handle, struct inode *inode,
 	if (newblock == 0)
 		return err;
 
-	bh = sb_getblk_gfp(inode->i_sb, newblock, __GFP_MOVABLE | GFP_NOFS);
+	bh = sb_getblk(inode->i_sb, newblock);
 	if (unlikely(!bh))
 		return -ENOMEM;
 	lock_buffer(bh);
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [PATCH 7/8] Revert "ext4: avoid deadlocks in the writeback path by using sb_getblk_gfp"
@ 2017-01-06 14:11   ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:11 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

This reverts commit c45653c341f5c8a0ce19c8f0ad4678640849cb86 because
sb_getblk_gfp is not really needed as
sb_getblk
  __getblk_gfp
    __getblk_slow
      grow_buffers
        grow_dev_page
	  gfp_mask = mapping_gfp_constraint(inode->i_mapping, ~__GFP_FS) | gfp

so __GFP_FS is cleared unconditionally and therefore the above commit
didn't have any real effect in fact.

This patch should not introduce any functional change. The main point
of this change is to reduce explicit GFP_NOFS usage inside ext4 code to
make the review of the remaining usage easier.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/extents.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 3e295d3350a9..9867b9e5ad8f 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -518,7 +518,7 @@ __read_extent_tree_block(const char *function, unsigned int line,
 	struct buffer_head		*bh;
 	int				err;
 
-	bh = sb_getblk_gfp(inode->i_sb, pblk, __GFP_MOVABLE | GFP_NOFS);
+	bh = sb_getblk(inode->i_sb, pblk);
 	if (unlikely(!bh))
 		return ERR_PTR(-ENOMEM);
 
@@ -1096,7 +1096,7 @@ static int ext4_ext_split(handle_t *handle, struct inode *inode,
 		err = -EFSCORRUPTED;
 		goto cleanup;
 	}
-	bh = sb_getblk_gfp(inode->i_sb, newblock, __GFP_MOVABLE | GFP_NOFS);
+	bh = sb_getblk(inode->i_sb, newblock);
 	if (unlikely(!bh)) {
 		err = -ENOMEM;
 		goto cleanup;
@@ -1290,7 +1290,7 @@ static int ext4_ext_grow_indepth(handle_t *handle, struct inode *inode,
 	if (newblock == 0)
 		return err;
 
-	bh = sb_getblk_gfp(inode->i_sb, newblock, __GFP_MOVABLE | GFP_NOFS);
+	bh = sb_getblk(inode->i_sb, newblock);
 	if (unlikely(!bh))
 		return -ENOMEM;
 	lock_buffer(bh);
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [PATCH 7/8] Revert "ext4: avoid deadlocks in the writeback path by using sb_getblk_gfp"
@ 2017-01-06 14:11   ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:11 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

This reverts commit c45653c341f5c8a0ce19c8f0ad4678640849cb86 because
sb_getblk_gfp is not really needed as
sb_getblk
  __getblk_gfp
    __getblk_slow
      grow_buffers
        grow_dev_page
	  gfp_mask = mapping_gfp_constraint(inode->i_mapping, ~__GFP_FS) | gfp

so __GFP_FS is cleared unconditionally and therefore the above commit
didn't have any real effect in fact.

This patch should not introduce any functional change. The main point
of this change is to reduce explicit GFP_NOFS usage inside ext4 code to
make the review of the remaining usage easier.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/extents.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 3e295d3350a9..9867b9e5ad8f 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -518,7 +518,7 @@ __read_extent_tree_block(const char *function, unsigned int line,
 	struct buffer_head		*bh;
 	int				err;
 
-	bh = sb_getblk_gfp(inode->i_sb, pblk, __GFP_MOVABLE | GFP_NOFS);
+	bh = sb_getblk(inode->i_sb, pblk);
 	if (unlikely(!bh))
 		return ERR_PTR(-ENOMEM);
 
@@ -1096,7 +1096,7 @@ static int ext4_ext_split(handle_t *handle, struct inode *inode,
 		err = -EFSCORRUPTED;
 		goto cleanup;
 	}
-	bh = sb_getblk_gfp(inode->i_sb, newblock, __GFP_MOVABLE | GFP_NOFS);
+	bh = sb_getblk(inode->i_sb, newblock);
 	if (unlikely(!bh)) {
 		err = -ENOMEM;
 		goto cleanup;
@@ -1290,7 +1290,7 @@ static int ext4_ext_grow_indepth(handle_t *handle, struct inode *inode,
 	if (newblock == 0)
 		return err;
 
-	bh = sb_getblk_gfp(inode->i_sb, newblock, __GFP_MOVABLE | GFP_NOFS);
+	bh = sb_getblk(inode->i_sb, newblock);
 	if (unlikely(!bh))
 		return -ENOMEM;
 	lock_buffer(bh);
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [PATCH 7/8] Revert "ext4: avoid deadlocks in the writeback path by using sb_getblk_gfp"
@ 2017-01-06 14:11   ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:11 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

This reverts commit c45653c341f5c8a0ce19c8f0ad4678640849cb86 because
sb_getblk_gfp is not really needed as
sb_getblk
  __getblk_gfp
    __getblk_slow
      grow_buffers
        grow_dev_page
	  gfp_mask = mapping_gfp_constraint(inode->i_mapping, ~__GFP_FS) | gfp

so __GFP_FS is cleared unconditionally and therefore the above commit
didn't have any real effect in fact.

This patch should not introduce any functional change. The main point
of this change is to reduce explicit GFP_NOFS usage inside ext4 code to
make the review of the remaining usage easier.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/extents.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 3e295d3350a9..9867b9e5ad8f 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -518,7 +518,7 @@ __read_extent_tree_block(const char *function, unsigned int line,
 	struct buffer_head		*bh;
 	int				err;
 
-	bh = sb_getblk_gfp(inode->i_sb, pblk, __GFP_MOVABLE | GFP_NOFS);
+	bh = sb_getblk(inode->i_sb, pblk);
 	if (unlikely(!bh))
 		return ERR_PTR(-ENOMEM);
 
@@ -1096,7 +1096,7 @@ static int ext4_ext_split(handle_t *handle, struct inode *inode,
 		err = -EFSCORRUPTED;
 		goto cleanup;
 	}
-	bh = sb_getblk_gfp(inode->i_sb, newblock, __GFP_MOVABLE | GFP_NOFS);
+	bh = sb_getblk(inode->i_sb, newblock);
 	if (unlikely(!bh)) {
 		err = -ENOMEM;
 		goto cleanup;
@@ -1290,7 +1290,7 @@ static int ext4_ext_grow_indepth(handle_t *handle, struct inode *inode,
 	if (newblock == 0)
 		return err;
 
-	bh = sb_getblk_gfp(inode->i_sb, newblock, __GFP_MOVABLE | GFP_NOFS);
+	bh = sb_getblk(inode->i_sb, newblock);
 	if (unlikely(!bh))
 		return -ENOMEM;
 	lock_buffer(bh);
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [Cluster-devel] [PATCH 7/8] Revert "ext4: avoid deadlocks in the writeback path by using sb_getblk_gfp"
@ 2017-01-06 14:11   ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:11 UTC (permalink / raw)
  To: cluster-devel.redhat.com

From: Michal Hocko <mhocko@suse.com>

This reverts commit c45653c341f5c8a0ce19c8f0ad4678640849cb86 because
sb_getblk_gfp is not really needed as
sb_getblk
  __getblk_gfp
    __getblk_slow
      grow_buffers
        grow_dev_page
	  gfp_mask = mapping_gfp_constraint(inode->i_mapping, ~__GFP_FS) | gfp

so __GFP_FS is cleared unconditionally and therefore the above commit
didn't have any real effect in fact.

This patch should not introduce any functional change. The main point
of this change is to reduce explicit GFP_NOFS usage inside ext4 code to
make the review of the remaining usage easier.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/extents.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 3e295d3350a9..9867b9e5ad8f 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -518,7 +518,7 @@ __read_extent_tree_block(const char *function, unsigned int line,
 	struct buffer_head		*bh;
 	int				err;
 
-	bh = sb_getblk_gfp(inode->i_sb, pblk, __GFP_MOVABLE | GFP_NOFS);
+	bh = sb_getblk(inode->i_sb, pblk);
 	if (unlikely(!bh))
 		return ERR_PTR(-ENOMEM);
 
@@ -1096,7 +1096,7 @@ static int ext4_ext_split(handle_t *handle, struct inode *inode,
 		err = -EFSCORRUPTED;
 		goto cleanup;
 	}
-	bh = sb_getblk_gfp(inode->i_sb, newblock, __GFP_MOVABLE | GFP_NOFS);
+	bh = sb_getblk(inode->i_sb, newblock);
 	if (unlikely(!bh)) {
 		err = -ENOMEM;
 		goto cleanup;
@@ -1290,7 +1290,7 @@ static int ext4_ext_grow_indepth(handle_t *handle, struct inode *inode,
 	if (newblock == 0)
 		return err;
 
-	bh = sb_getblk_gfp(inode->i_sb, newblock, __GFP_MOVABLE | GFP_NOFS);
+	bh = sb_getblk(inode->i_sb, newblock);
 	if (unlikely(!bh))
 		return -ENOMEM;
 	lock_buffer(bh);
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
  2017-01-06 14:10 ` Michal Hocko
                     ` (2 preceding siblings ...)
  (?)
@ 2017-01-06 14:11   ` Michal Hocko
  -1 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:11 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

This reverts commit 216553c4b7f3e3e2beb4981cddca9b2027523928. Now that
the transaction context uses memalloc_nofs_save and all allocations
within the this context inherit GFP_NOFS automatically, there is no
reason to mark specific allocations explicitly.

This patch should not introduce any functional change. The main point
of this change is to reduce explicit GFP_NOFS usage inside ext4 code
to make the review of the remaining usage easier.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/acl.c     | 6 +++---
 fs/ext4/extents.c | 2 +-
 fs/ext4/resize.c  | 4 ++--
 fs/ext4/xattr.c   | 4 ++--
 4 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/fs/ext4/acl.c b/fs/ext4/acl.c
index fd389935ecd1..9e98092c2a4b 100644
--- a/fs/ext4/acl.c
+++ b/fs/ext4/acl.c
@@ -32,7 +32,7 @@ ext4_acl_from_disk(const void *value, size_t size)
 		return ERR_PTR(-EINVAL);
 	if (count == 0)
 		return NULL;
-	acl = posix_acl_alloc(count, GFP_NOFS);
+	acl = posix_acl_alloc(count, GFP_KERNEL);
 	if (!acl)
 		return ERR_PTR(-ENOMEM);
 	for (n = 0; n < count; n++) {
@@ -94,7 +94,7 @@ ext4_acl_to_disk(const struct posix_acl *acl, size_t *size)
 
 	*size = ext4_acl_size(acl->a_count);
 	ext_acl = kmalloc(sizeof(ext4_acl_header) + acl->a_count *
-			sizeof(ext4_acl_entry), GFP_NOFS);
+			sizeof(ext4_acl_entry), GFP_KERNEL);
 	if (!ext_acl)
 		return ERR_PTR(-ENOMEM);
 	ext_acl->a_version = cpu_to_le32(EXT4_ACL_VERSION);
@@ -159,7 +159,7 @@ ext4_get_acl(struct inode *inode, int type)
 	}
 	retval = ext4_xattr_get(inode, name_index, "", NULL, 0);
 	if (retval > 0) {
-		value = kmalloc(retval, GFP_NOFS);
+		value = kmalloc(retval, GFP_KERNEL);
 		if (!value)
 			return ERR_PTR(-ENOMEM);
 		retval = ext4_xattr_get(inode, name_index, "", value, retval);
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 9867b9e5ad8f..0371e7aa7bea 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -2933,7 +2933,7 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
 				le16_to_cpu(path[k].p_hdr->eh_entries)+1;
 	} else {
 		path = kzalloc(sizeof(struct ext4_ext_path) * (depth + 1),
-			       GFP_NOFS);
+			       GFP_KERNEL);
 		if (path == NULL) {
 			ext4_journal_stop(handle);
 			return -ENOMEM;
diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
index cf681004b196..e121f4e048b8 100644
--- a/fs/ext4/resize.c
+++ b/fs/ext4/resize.c
@@ -816,7 +816,7 @@ static int add_new_gdb(handle_t *handle, struct inode *inode,
 
 	n_group_desc = ext4_kvmalloc((gdb_num + 1) *
 				     sizeof(struct buffer_head *),
-				     GFP_NOFS);
+				     GFP_KERNEL);
 	if (!n_group_desc) {
 		err = -ENOMEM;
 		ext4_warning(sb, "not enough memory for %lu groups",
@@ -943,7 +943,7 @@ static int reserve_backup_gdb(handle_t *handle, struct inode *inode,
 	int res, i;
 	int err;
 
-	primary = kmalloc(reserved_gdb * sizeof(*primary), GFP_NOFS);
+	primary = kmalloc(reserved_gdb * sizeof(*primary), GFP_KERNEL);
 	if (!primary)
 		return -ENOMEM;
 
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 5a94fa52b74f..172317462238 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -875,7 +875,7 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 
 			unlock_buffer(bs->bh);
 			ea_bdebug(bs->bh, "cloning");
-			s->base = kmalloc(bs->bh->b_size, GFP_NOFS);
+			s->base = kmalloc(bs->bh->b_size, GFP_KERNEL);
 			error = -ENOMEM;
 			if (s->base == NULL)
 				goto cleanup;
@@ -887,7 +887,7 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 		}
 	} else {
 		/* Allocate a buffer where we construct the new block. */
-		s->base = kzalloc(sb->s_blocksize, GFP_NOFS);
+		s->base = kzalloc(sb->s_blocksize, GFP_KERNEL);
 		/* assert(header == s->base) */
 		error = -ENOMEM;
 		if (s->base == NULL)
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
@ 2017-01-06 14:11   ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:11 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

This reverts commit 216553c4b7f3e3e2beb4981cddca9b2027523928. Now that
the transaction context uses memalloc_nofs_save and all allocations
within the this context inherit GFP_NOFS automatically, there is no
reason to mark specific allocations explicitly.

This patch should not introduce any functional change. The main point
of this change is to reduce explicit GFP_NOFS usage inside ext4 code
to make the review of the remaining usage easier.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/acl.c     | 6 +++---
 fs/ext4/extents.c | 2 +-
 fs/ext4/resize.c  | 4 ++--
 fs/ext4/xattr.c   | 4 ++--
 4 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/fs/ext4/acl.c b/fs/ext4/acl.c
index fd389935ecd1..9e98092c2a4b 100644
--- a/fs/ext4/acl.c
+++ b/fs/ext4/acl.c
@@ -32,7 +32,7 @@ ext4_acl_from_disk(const void *value, size_t size)
 		return ERR_PTR(-EINVAL);
 	if (count == 0)
 		return NULL;
-	acl = posix_acl_alloc(count, GFP_NOFS);
+	acl = posix_acl_alloc(count, GFP_KERNEL);
 	if (!acl)
 		return ERR_PTR(-ENOMEM);
 	for (n = 0; n < count; n++) {
@@ -94,7 +94,7 @@ ext4_acl_to_disk(const struct posix_acl *acl, size_t *size)
 
 	*size = ext4_acl_size(acl->a_count);
 	ext_acl = kmalloc(sizeof(ext4_acl_header) + acl->a_count *
-			sizeof(ext4_acl_entry), GFP_NOFS);
+			sizeof(ext4_acl_entry), GFP_KERNEL);
 	if (!ext_acl)
 		return ERR_PTR(-ENOMEM);
 	ext_acl->a_version = cpu_to_le32(EXT4_ACL_VERSION);
@@ -159,7 +159,7 @@ ext4_get_acl(struct inode *inode, int type)
 	}
 	retval = ext4_xattr_get(inode, name_index, "", NULL, 0);
 	if (retval > 0) {
-		value = kmalloc(retval, GFP_NOFS);
+		value = kmalloc(retval, GFP_KERNEL);
 		if (!value)
 			return ERR_PTR(-ENOMEM);
 		retval = ext4_xattr_get(inode, name_index, "", value, retval);
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 9867b9e5ad8f..0371e7aa7bea 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -2933,7 +2933,7 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
 				le16_to_cpu(path[k].p_hdr->eh_entries)+1;
 	} else {
 		path = kzalloc(sizeof(struct ext4_ext_path) * (depth + 1),
-			       GFP_NOFS);
+			       GFP_KERNEL);
 		if (path == NULL) {
 			ext4_journal_stop(handle);
 			return -ENOMEM;
diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
index cf681004b196..e121f4e048b8 100644
--- a/fs/ext4/resize.c
+++ b/fs/ext4/resize.c
@@ -816,7 +816,7 @@ static int add_new_gdb(handle_t *handle, struct inode *inode,
 
 	n_group_desc = ext4_kvmalloc((gdb_num + 1) *
 				     sizeof(struct buffer_head *),
-				     GFP_NOFS);
+				     GFP_KERNEL);
 	if (!n_group_desc) {
 		err = -ENOMEM;
 		ext4_warning(sb, "not enough memory for %lu groups",
@@ -943,7 +943,7 @@ static int reserve_backup_gdb(handle_t *handle, struct inode *inode,
 	int res, i;
 	int err;
 
-	primary = kmalloc(reserved_gdb * sizeof(*primary), GFP_NOFS);
+	primary = kmalloc(reserved_gdb * sizeof(*primary), GFP_KERNEL);
 	if (!primary)
 		return -ENOMEM;
 
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 5a94fa52b74f..172317462238 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -875,7 +875,7 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 
 			unlock_buffer(bs->bh);
 			ea_bdebug(bs->bh, "cloning");
-			s->base = kmalloc(bs->bh->b_size, GFP_NOFS);
+			s->base = kmalloc(bs->bh->b_size, GFP_KERNEL);
 			error = -ENOMEM;
 			if (s->base == NULL)
 				goto cleanup;
@@ -887,7 +887,7 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 		}
 	} else {
 		/* Allocate a buffer where we construct the new block. */
-		s->base = kzalloc(sb->s_blocksize, GFP_NOFS);
+		s->base = kzalloc(sb->s_blocksize, GFP_KERNEL);
 		/* assert(header == s->base) */
 		error = -ENOMEM;
 		if (s->base == NULL)
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
@ 2017-01-06 14:11   ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:11 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

This reverts commit 216553c4b7f3e3e2beb4981cddca9b2027523928. Now that
the transaction context uses memalloc_nofs_save and all allocations
within the this context inherit GFP_NOFS automatically, there is no
reason to mark specific allocations explicitly.

This patch should not introduce any functional change. The main point
of this change is to reduce explicit GFP_NOFS usage inside ext4 code
to make the review of the remaining usage easier.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/acl.c     | 6 +++---
 fs/ext4/extents.c | 2 +-
 fs/ext4/resize.c  | 4 ++--
 fs/ext4/xattr.c   | 4 ++--
 4 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/fs/ext4/acl.c b/fs/ext4/acl.c
index fd389935ecd1..9e98092c2a4b 100644
--- a/fs/ext4/acl.c
+++ b/fs/ext4/acl.c
@@ -32,7 +32,7 @@ ext4_acl_from_disk(const void *value, size_t size)
 		return ERR_PTR(-EINVAL);
 	if (count == 0)
 		return NULL;
-	acl = posix_acl_alloc(count, GFP_NOFS);
+	acl = posix_acl_alloc(count, GFP_KERNEL);
 	if (!acl)
 		return ERR_PTR(-ENOMEM);
 	for (n = 0; n < count; n++) {
@@ -94,7 +94,7 @@ ext4_acl_to_disk(const struct posix_acl *acl, size_t *size)
 
 	*size = ext4_acl_size(acl->a_count);
 	ext_acl = kmalloc(sizeof(ext4_acl_header) + acl->a_count *
-			sizeof(ext4_acl_entry), GFP_NOFS);
+			sizeof(ext4_acl_entry), GFP_KERNEL);
 	if (!ext_acl)
 		return ERR_PTR(-ENOMEM);
 	ext_acl->a_version = cpu_to_le32(EXT4_ACL_VERSION);
@@ -159,7 +159,7 @@ ext4_get_acl(struct inode *inode, int type)
 	}
 	retval = ext4_xattr_get(inode, name_index, "", NULL, 0);
 	if (retval > 0) {
-		value = kmalloc(retval, GFP_NOFS);
+		value = kmalloc(retval, GFP_KERNEL);
 		if (!value)
 			return ERR_PTR(-ENOMEM);
 		retval = ext4_xattr_get(inode, name_index, "", value, retval);
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 9867b9e5ad8f..0371e7aa7bea 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -2933,7 +2933,7 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
 				le16_to_cpu(path[k].p_hdr->eh_entries)+1;
 	} else {
 		path = kzalloc(sizeof(struct ext4_ext_path) * (depth + 1),
-			       GFP_NOFS);
+			       GFP_KERNEL);
 		if (path == NULL) {
 			ext4_journal_stop(handle);
 			return -ENOMEM;
diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
index cf681004b196..e121f4e048b8 100644
--- a/fs/ext4/resize.c
+++ b/fs/ext4/resize.c
@@ -816,7 +816,7 @@ static int add_new_gdb(handle_t *handle, struct inode *inode,
 
 	n_group_desc = ext4_kvmalloc((gdb_num + 1) *
 				     sizeof(struct buffer_head *),
-				     GFP_NOFS);
+				     GFP_KERNEL);
 	if (!n_group_desc) {
 		err = -ENOMEM;
 		ext4_warning(sb, "not enough memory for %lu groups",
@@ -943,7 +943,7 @@ static int reserve_backup_gdb(handle_t *handle, struct inode *inode,
 	int res, i;
 	int err;
 
-	primary = kmalloc(reserved_gdb * sizeof(*primary), GFP_NOFS);
+	primary = kmalloc(reserved_gdb * sizeof(*primary), GFP_KERNEL);
 	if (!primary)
 		return -ENOMEM;
 
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 5a94fa52b74f..172317462238 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -875,7 +875,7 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 
 			unlock_buffer(bs->bh);
 			ea_bdebug(bs->bh, "cloning");
-			s->base = kmalloc(bs->bh->b_size, GFP_NOFS);
+			s->base = kmalloc(bs->bh->b_size, GFP_KERNEL);
 			error = -ENOMEM;
 			if (s->base == NULL)
 				goto cleanup;
@@ -887,7 +887,7 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 		}
 	} else {
 		/* Allocate a buffer where we construct the new block. */
-		s->base = kzalloc(sb->s_blocksize, GFP_NOFS);
+		s->base = kzalloc(sb->s_blocksize, GFP_KERNEL);
 		/* assert(header == s->base) */
 		error = -ENOMEM;
 		if (s->base == NULL)
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
@ 2017-01-06 14:11   ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:11 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

This reverts commit 216553c4b7f3e3e2beb4981cddca9b2027523928. Now that
the transaction context uses memalloc_nofs_save and all allocations
within the this context inherit GFP_NOFS automatically, there is no
reason to mark specific allocations explicitly.

This patch should not introduce any functional change. The main point
of this change is to reduce explicit GFP_NOFS usage inside ext4 code
to make the review of the remaining usage easier.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/acl.c     | 6 +++---
 fs/ext4/extents.c | 2 +-
 fs/ext4/resize.c  | 4 ++--
 fs/ext4/xattr.c   | 4 ++--
 4 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/fs/ext4/acl.c b/fs/ext4/acl.c
index fd389935ecd1..9e98092c2a4b 100644
--- a/fs/ext4/acl.c
+++ b/fs/ext4/acl.c
@@ -32,7 +32,7 @@ ext4_acl_from_disk(const void *value, size_t size)
 		return ERR_PTR(-EINVAL);
 	if (count == 0)
 		return NULL;
-	acl = posix_acl_alloc(count, GFP_NOFS);
+	acl = posix_acl_alloc(count, GFP_KERNEL);
 	if (!acl)
 		return ERR_PTR(-ENOMEM);
 	for (n = 0; n < count; n++) {
@@ -94,7 +94,7 @@ ext4_acl_to_disk(const struct posix_acl *acl, size_t *size)
 
 	*size = ext4_acl_size(acl->a_count);
 	ext_acl = kmalloc(sizeof(ext4_acl_header) + acl->a_count *
-			sizeof(ext4_acl_entry), GFP_NOFS);
+			sizeof(ext4_acl_entry), GFP_KERNEL);
 	if (!ext_acl)
 		return ERR_PTR(-ENOMEM);
 	ext_acl->a_version = cpu_to_le32(EXT4_ACL_VERSION);
@@ -159,7 +159,7 @@ ext4_get_acl(struct inode *inode, int type)
 	}
 	retval = ext4_xattr_get(inode, name_index, "", NULL, 0);
 	if (retval > 0) {
-		value = kmalloc(retval, GFP_NOFS);
+		value = kmalloc(retval, GFP_KERNEL);
 		if (!value)
 			return ERR_PTR(-ENOMEM);
 		retval = ext4_xattr_get(inode, name_index, "", value, retval);
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 9867b9e5ad8f..0371e7aa7bea 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -2933,7 +2933,7 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
 				le16_to_cpu(path[k].p_hdr->eh_entries)+1;
 	} else {
 		path = kzalloc(sizeof(struct ext4_ext_path) * (depth + 1),
-			       GFP_NOFS);
+			       GFP_KERNEL);
 		if (path == NULL) {
 			ext4_journal_stop(handle);
 			return -ENOMEM;
diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
index cf681004b196..e121f4e048b8 100644
--- a/fs/ext4/resize.c
+++ b/fs/ext4/resize.c
@@ -816,7 +816,7 @@ static int add_new_gdb(handle_t *handle, struct inode *inode,
 
 	n_group_desc = ext4_kvmalloc((gdb_num + 1) *
 				     sizeof(struct buffer_head *),
-				     GFP_NOFS);
+				     GFP_KERNEL);
 	if (!n_group_desc) {
 		err = -ENOMEM;
 		ext4_warning(sb, "not enough memory for %lu groups",
@@ -943,7 +943,7 @@ static int reserve_backup_gdb(handle_t *handle, struct inode *inode,
 	int res, i;
 	int err;
 
-	primary = kmalloc(reserved_gdb * sizeof(*primary), GFP_NOFS);
+	primary = kmalloc(reserved_gdb * sizeof(*primary), GFP_KERNEL);
 	if (!primary)
 		return -ENOMEM;
 
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 5a94fa52b74f..172317462238 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -875,7 +875,7 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 
 			unlock_buffer(bs->bh);
 			ea_bdebug(bs->bh, "cloning");
-			s->base = kmalloc(bs->bh->b_size, GFP_NOFS);
+			s->base = kmalloc(bs->bh->b_size, GFP_KERNEL);
 			error = -ENOMEM;
 			if (s->base == NULL)
 				goto cleanup;
@@ -887,7 +887,7 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 		}
 	} else {
 		/* Allocate a buffer where we construct the new block. */
-		s->base = kzalloc(sb->s_blocksize, GFP_NOFS);
+		s->base = kzalloc(sb->s_blocksize, GFP_KERNEL);
 		/* assert(header == s->base) */
 		error = -ENOMEM;
 		if (s->base == NULL)
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [Cluster-devel] [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
@ 2017-01-06 14:11   ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:11 UTC (permalink / raw)
  To: cluster-devel.redhat.com

From: Michal Hocko <mhocko@suse.com>

This reverts commit 216553c4b7f3e3e2beb4981cddca9b2027523928. Now that
the transaction context uses memalloc_nofs_save and all allocations
within the this context inherit GFP_NOFS automatically, there is no
reason to mark specific allocations explicitly.

This patch should not introduce any functional change. The main point
of this change is to reduce explicit GFP_NOFS usage inside ext4 code
to make the review of the remaining usage easier.

Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/acl.c     | 6 +++---
 fs/ext4/extents.c | 2 +-
 fs/ext4/resize.c  | 4 ++--
 fs/ext4/xattr.c   | 4 ++--
 4 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/fs/ext4/acl.c b/fs/ext4/acl.c
index fd389935ecd1..9e98092c2a4b 100644
--- a/fs/ext4/acl.c
+++ b/fs/ext4/acl.c
@@ -32,7 +32,7 @@ ext4_acl_from_disk(const void *value, size_t size)
 		return ERR_PTR(-EINVAL);
 	if (count == 0)
 		return NULL;
-	acl = posix_acl_alloc(count, GFP_NOFS);
+	acl = posix_acl_alloc(count, GFP_KERNEL);
 	if (!acl)
 		return ERR_PTR(-ENOMEM);
 	for (n = 0; n < count; n++) {
@@ -94,7 +94,7 @@ ext4_acl_to_disk(const struct posix_acl *acl, size_t *size)
 
 	*size = ext4_acl_size(acl->a_count);
 	ext_acl = kmalloc(sizeof(ext4_acl_header) + acl->a_count *
-			sizeof(ext4_acl_entry), GFP_NOFS);
+			sizeof(ext4_acl_entry), GFP_KERNEL);
 	if (!ext_acl)
 		return ERR_PTR(-ENOMEM);
 	ext_acl->a_version = cpu_to_le32(EXT4_ACL_VERSION);
@@ -159,7 +159,7 @@ ext4_get_acl(struct inode *inode, int type)
 	}
 	retval = ext4_xattr_get(inode, name_index, "", NULL, 0);
 	if (retval > 0) {
-		value = kmalloc(retval, GFP_NOFS);
+		value = kmalloc(retval, GFP_KERNEL);
 		if (!value)
 			return ERR_PTR(-ENOMEM);
 		retval = ext4_xattr_get(inode, name_index, "", value, retval);
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 9867b9e5ad8f..0371e7aa7bea 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -2933,7 +2933,7 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
 				le16_to_cpu(path[k].p_hdr->eh_entries)+1;
 	} else {
 		path = kzalloc(sizeof(struct ext4_ext_path) * (depth + 1),
-			       GFP_NOFS);
+			       GFP_KERNEL);
 		if (path == NULL) {
 			ext4_journal_stop(handle);
 			return -ENOMEM;
diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
index cf681004b196..e121f4e048b8 100644
--- a/fs/ext4/resize.c
+++ b/fs/ext4/resize.c
@@ -816,7 +816,7 @@ static int add_new_gdb(handle_t *handle, struct inode *inode,
 
 	n_group_desc = ext4_kvmalloc((gdb_num + 1) *
 				     sizeof(struct buffer_head *),
-				     GFP_NOFS);
+				     GFP_KERNEL);
 	if (!n_group_desc) {
 		err = -ENOMEM;
 		ext4_warning(sb, "not enough memory for %lu groups",
@@ -943,7 +943,7 @@ static int reserve_backup_gdb(handle_t *handle, struct inode *inode,
 	int res, i;
 	int err;
 
-	primary = kmalloc(reserved_gdb * sizeof(*primary), GFP_NOFS);
+	primary = kmalloc(reserved_gdb * sizeof(*primary), GFP_KERNEL);
 	if (!primary)
 		return -ENOMEM;
 
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 5a94fa52b74f..172317462238 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -875,7 +875,7 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 
 			unlock_buffer(bs->bh);
 			ea_bdebug(bs->bh, "cloning");
-			s->base = kmalloc(bs->bh->b_size, GFP_NOFS);
+			s->base = kmalloc(bs->bh->b_size, GFP_KERNEL);
 			error = -ENOMEM;
 			if (s->base == NULL)
 				goto cleanup;
@@ -887,7 +887,7 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 		}
 	} else {
 		/* Allocate a buffer where we construct the new block. */
-		s->base = kzalloc(sb->s_blocksize, GFP_NOFS);
+		s->base = kzalloc(sb->s_blocksize, GFP_KERNEL);
 		/* assert(header == s->base) */
 		error = -ENOMEM;
 		if (s->base == NULL)
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [DEBUG PATCH 0/2] debug explicit GFP_NO{FS,IO} usage from the scope context
  2017-01-06 14:10 ` Michal Hocko
                     ` (3 preceding siblings ...)
  (?)
@ 2017-01-06 14:18   ` Michal Hocko
  -1 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:18 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML

These two patches should help to identify explicit GFP_NO{FS,IO} usage
from withing a scope context and reduce such a usage as a result. Such
a usage can be changed to the full GFP_KERNEL because all the calls
from within the NO{FS,IO} scope will drop the __GFP_FS resp. __GFP_IO
automatically and if the function is called outside of the scope then
we do not need to restrict it to NOFS/NOIO as long as all the reclaim
recursion unsafe contexts are marked properly. This means that each such
a reported allocation site has to be checked before converted.

The debugging has to be enabled explicitly by a kernel command line
parameter and then it reports the stack trace of the allocation and
also the function which has started the current scope.

These two patches are _not_ intended to be merged and they are only
aimed at debugging.


^ permalink raw reply	[flat|nested] 167+ messages in thread

* [DEBUG PATCH 0/2] debug explicit GFP_NO{FS,IO} usage from the scope context
@ 2017-01-06 14:18   ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:18 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML

These two patches should help to identify explicit GFP_NO{FS,IO} usage
from withing a scope context and reduce such a usage as a result. Such
a usage can be changed to the full GFP_KERNEL because all the calls
from within the NO{FS,IO} scope will drop the __GFP_FS resp. __GFP_IO
automatically and if the function is called outside of the scope then
we do not need to restrict it to NOFS/NOIO as long as all the reclaim
recursion unsafe contexts are marked properly. This means that each such
a reported allocation site has to be checked before converted.

The debugging has to be enabled explicitly by a kernel command line
parameter and then it reports the stack trace of the allocation and
also the function which has started the current scope.

These two patches are _not_ intended to be merged and they are only
aimed at debugging.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* [DEBUG PATCH 0/2] debug explicit GFP_NO{FS,IO} usage from the scope context
@ 2017-01-06 14:18   ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:18 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML

These two patches should help to identify explicit GFP_NO{FS,IO} usage
from withing a scope context and reduce such a usage as a result. Such
a usage can be changed to the full GFP_KERNEL because all the calls
from within the NO{FS,IO} scope will drop the __GFP_FS resp. __GFP_IO
automatically and if the function is called outside of the scope then
we do not need to restrict it to NOFS/NOIO as long as all the reclaim
recursion unsafe contexts are marked properly. This means that each such
a reported allocation site has to be checked before converted.

The debugging has to be enabled explicitly by a kernel command line
parameter and then it reports the stack trace of the allocation and
also the function which has started the current scope.

These two patches are _not_ intended to be merged and they are only
aimed at debugging.


^ permalink raw reply	[flat|nested] 167+ messages in thread

* [DEBUG PATCH 0/2] debug explicit GFP_NO{FS,IO} usage from the scope context
@ 2017-01-06 14:18   ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:18 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML

These two patches should help to identify explicit GFP_NO{FS,IO} usage
from withing a scope context and reduce such a usage as a result. Such
a usage can be changed to the full GFP_KERNEL because all the calls
from within the NO{FS,IO} scope will drop the __GFP_FS resp. __GFP_IO
automatically and if the function is called outside of the scope then
we do not need to restrict it to NOFS/NOIO as long as all the reclaim
recursion unsafe contexts are marked properly. This means that each such
a reported allocation site has to be checked before converted.

The debugging has to be enabled explicitly by a kernel command line
parameter and then it reports the stack trace of the allocation and
also the function which has started the current scope.

These two patches are _not_ intended to be merged and they are only
aimed at debugging.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* [DEBUG PATCH 0/2] debug explicit GFP_NO{FS, IO} usage from the scope context
@ 2017-01-06 14:18   ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:18 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML

These two patches should help to identify explicit GFP_NO{FS,IO} usage
from withing a scope context and reduce such a usage as a result. Such
a usage can be changed to the full GFP_KERNEL because all the calls
from within the NO{FS,IO} scope will drop the __GFP_FS resp. __GFP_IO
automatically and if the function is called outside of the scope then
we do not need to restrict it to NOFS/NOIO as long as all the reclaim
recursion unsafe contexts are marked properly. This means that each such
a reported allocation site has to be checked before converted.

The debugging has to be enabled explicitly by a kernel command line
parameter and then it reports the stack trace of the allocation and
also the function which has started the current scope.

These two patches are _not_ intended to be merged and they are only
aimed at debugging.

^ permalink raw reply	[flat|nested] 167+ messages in thread

* [Cluster-devel] [DEBUG PATCH 0/2] debug explicit GFP_NO{FS, IO} usage from the scope context
@ 2017-01-06 14:18   ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:18 UTC (permalink / raw)
  To: cluster-devel.redhat.com

These two patches should help to identify explicit GFP_NO{FS,IO} usage
from withing a scope context and reduce such a usage as a result. Such
a usage can be changed to the full GFP_KERNEL because all the calls
from within the NO{FS,IO} scope will drop the __GFP_FS resp. __GFP_IO
automatically and if the function is called outside of the scope then
we do not need to restrict it to NOFS/NOIO as long as all the reclaim
recursion unsafe contexts are marked properly. This means that each such
a reported allocation site has to be checked before converted.

The debugging has to be enabled explicitly by a kernel command line
parameter and then it reports the stack trace of the allocation and
also the function which has started the current scope.

These two patches are _not_ intended to be merged and they are only
aimed at debugging.



^ permalink raw reply	[flat|nested] 167+ messages in thread

* [DEBUG PATCH 1/2] mm, debug: report when GFP_NO{FS,IO} is used explicitly from memalloc_no{fs,io}_{save,restore} context
  2017-01-06 14:18   ` Michal Hocko
                       ` (3 preceding siblings ...)
  (?)
@ 2017-01-06 14:18     ` Michal Hocko
  -1 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:18 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

THIS PATCH IS FOR TESTING ONLY AND NOT MEANT TO HIT LINUS TREE

It is desirable to reduce the direct GFP_NO{FS,IO} usage at minimum and
prefer scope usage defined by memalloc_no{fs,io}_{save,restore} API.

Let's help this process and add a debugging tool to catch when an
explicit allocation request for GFP_NO{FS,IO} is done from the scope
context. The printed stacktrace should help to identify the caller
and evaluate whether it can be changed to use a wider context or whether
it is called from another potentially dangerous context which needs
a scope protection as well.

The checks have to be enabled explicitly by debug_scope_gfp kernel
command line parameter.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/sched.h | 14 +++++++++++--
 include/linux/slab.h  |  3 +++
 mm/page_alloc.c       | 58 +++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 73 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2032fc642a26..59428926e989 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1988,6 +1988,8 @@ struct task_struct {
 	/* A live task holds one reference. */
 	atomic_t stack_refcount;
 #endif
+	unsigned long nofs_caller;
+	unsigned long noio_caller;
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
@@ -2345,6 +2347,8 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
 #define used_math() tsk_used_math(current)
 
+extern void debug_scope_gfp_context(gfp_t gfp_mask);
+
 /*
  * Applies per-task gfp context to the given allocation flags.
  * PF_MEMALLOC_NOIO implies GFP_NOIO
@@ -2363,25 +2367,31 @@ static inline gfp_t current_gfp_context(gfp_t flags)
 	return flags;
 }
 
-static inline unsigned int memalloc_noio_save(void)
+static inline unsigned int __memalloc_noio_save(unsigned long caller)
 {
 	unsigned int flags = current->flags & PF_MEMALLOC_NOIO;
 	current->flags |= PF_MEMALLOC_NOIO;
+	current->noio_caller = caller;
 	return flags;
 }
 
+#define memalloc_noio_save()	__memalloc_noio_save(_RET_IP_)
+
 static inline void memalloc_noio_restore(unsigned int flags)
 {
 	current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags;
 }
 
-static inline unsigned int memalloc_nofs_save(void)
+static inline unsigned int __memalloc_nofs_save(unsigned long caller)
 {
 	unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
 	current->flags |= PF_MEMALLOC_NOFS;
+	current->nofs_caller = caller;
 	return flags;
 }
 
+#define memalloc_nofs_save()	__memalloc_nofs_save(_RET_IP_)
+
 static inline void memalloc_nofs_restore(unsigned int flags)
 {
 	current->flags = (current->flags & ~PF_MEMALLOC_NOFS) | flags;
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 084b12bad198..6559668e29db 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -477,6 +477,7 @@ static __always_inline void *kmalloc_large(size_t size, gfp_t flags)
  */
 static __always_inline void *kmalloc(size_t size, gfp_t flags)
 {
+	debug_scope_gfp_context(flags);
 	if (__builtin_constant_p(size)) {
 		if (size > KMALLOC_MAX_CACHE_SIZE)
 			return kmalloc_large(size, flags);
@@ -517,6 +518,7 @@ static __always_inline int kmalloc_size(int n)
 
 static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
 {
+	debug_scope_gfp_context(flags);
 #ifndef CONFIG_SLOB
 	if (__builtin_constant_p(size) &&
 		size <= KMALLOC_MAX_CACHE_SIZE && !(flags & GFP_DMA)) {
@@ -575,6 +577,7 @@ int memcg_update_all_caches(int num_memcgs);
  */
 static inline void *kmalloc_array(size_t n, size_t size, gfp_t flags)
 {
+	debug_scope_gfp_context(flags);
 	if (size != 0 && n > SIZE_MAX / size)
 		return NULL;
 	if (__builtin_constant_p(n) && __builtin_constant_p(size))
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5138b46a4295..87a2bb5262b2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3738,6 +3738,63 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	return page;
 }
 
+static bool debug_scope_gfp;
+
+static int __init enable_debug_scope_gfp(char *unused)
+{
+	debug_scope_gfp = true;
+	return 0;
+}
+
+/*
+ * spit the stack trace if the given gfp_mask clears flags which are context
+ * wide cleared. Such a caller can remove special flags clearing and rely on
+ * the context wide mask.
+ */
+void debug_scope_gfp_context(gfp_t gfp_mask)
+{
+	gfp_t restrict_mask;
+
+	if (likely(!debug_scope_gfp))
+		return;
+
+	/* both NOFS, NOIO are irrelevant when direct reclaim is disabled */
+	if (!(gfp_mask & __GFP_DIRECT_RECLAIM))
+		return;
+
+	if (current->flags & PF_MEMALLOC_NOIO)
+		restrict_mask = __GFP_IO;
+	else if ((current->flags & PF_MEMALLOC_NOFS) && (gfp_mask & __GFP_IO))
+		restrict_mask = __GFP_FS;
+	else
+		return;
+
+	if ((gfp_mask & restrict_mask) != restrict_mask) {
+		/*
+		 * If you see this this warning then the code does:
+		 * memalloc_no{fs,io}_save()
+		 * ...
+		 *    foo()
+		 *      alloc_page(GFP_NO{FS,IO})
+		 * ...
+		 * memalloc_no{fs,io}_restore()
+		 *
+		 * allocation which is unnecessary because the scope gfp
+		 * context will do that for all allocation requests already.
+		 * If foo() is called from multiple contexts then make sure other
+		 * contexts are safe wrt. GFP_NO{FS,IO} semantic and either add
+		 * scope protection into particular paths or change the gfp mask
+		 * to GFP_KERNEL.
+		 */
+		pr_info("Unnecesarily specific gfp mask:%#x(%pGg) for the %s task wide context from %ps\n", gfp_mask, &gfp_mask,
+				(current->flags & PF_MEMALLOC_NOIO)?"NOIO":"NOFS",
+				(void*)((current->flags & PF_MEMALLOC_NOIO)?current->noio_caller:current->nofs_caller));
+		dump_stack();
+	}
+}
+EXPORT_SYMBOL(debug_scope_gfp_context);
+early_param("debug_scope_gfp", enable_debug_scope_gfp);
+
 /*
  * This is the 'heart' of the zoned buddy allocator.
  */
@@ -3802,6 +3859,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	}
 
 	/* First allocation attempt */
+	debug_scope_gfp_context(gfp_mask);
 	page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac);
 	if (likely(page))
 		goto out;
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [DEBUG PATCH 1/2] mm, debug: report when GFP_NO{FS,IO} is used explicitly from memalloc_no{fs,io}_{save,restore} context
@ 2017-01-06 14:18     ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:18 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

THIS PATCH IS FOR TESTING ONLY AND NOT MEANT TO HIT LINUS TREE

It is desirable to reduce the direct GFP_NO{FS,IO} usage at minimum and
prefer scope usage defined by memalloc_no{fs,io}_{save,restore} API.

Let's help this process and add a debugging tool to catch when an
explicit allocation request for GFP_NO{FS,IO} is done from the scope
context. The printed stacktrace should help to identify the caller
and evaluate whether it can be changed to use a wider context or whether
it is called from another potentially dangerous context which needs
a scope protection as well.

The checks have to be enabled explicitly by debug_scope_gfp kernel
command line parameter.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/sched.h | 14 +++++++++++--
 include/linux/slab.h  |  3 +++
 mm/page_alloc.c       | 58 +++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 73 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2032fc642a26..59428926e989 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1988,6 +1988,8 @@ struct task_struct {
 	/* A live task holds one reference. */
 	atomic_t stack_refcount;
 #endif
+	unsigned long nofs_caller;
+	unsigned long noio_caller;
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
@@ -2345,6 +2347,8 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
 #define used_math() tsk_used_math(current)
 
+extern void debug_scope_gfp_context(gfp_t gfp_mask);
+
 /*
  * Applies per-task gfp context to the given allocation flags.
  * PF_MEMALLOC_NOIO implies GFP_NOIO
@@ -2363,25 +2367,31 @@ static inline gfp_t current_gfp_context(gfp_t flags)
 	return flags;
 }
 
-static inline unsigned int memalloc_noio_save(void)
+static inline unsigned int __memalloc_noio_save(unsigned long caller)
 {
 	unsigned int flags = current->flags & PF_MEMALLOC_NOIO;
 	current->flags |= PF_MEMALLOC_NOIO;
+	current->noio_caller = caller;
 	return flags;
 }
 
+#define memalloc_noio_save()	__memalloc_noio_save(_RET_IP_)
+
 static inline void memalloc_noio_restore(unsigned int flags)
 {
 	current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags;
 }
 
-static inline unsigned int memalloc_nofs_save(void)
+static inline unsigned int __memalloc_nofs_save(unsigned long caller)
 {
 	unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
 	current->flags |= PF_MEMALLOC_NOFS;
+	current->nofs_caller = caller;
 	return flags;
 }
 
+#define memalloc_nofs_save()	__memalloc_nofs_save(_RET_IP_)
+
 static inline void memalloc_nofs_restore(unsigned int flags)
 {
 	current->flags = (current->flags & ~PF_MEMALLOC_NOFS) | flags;
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 084b12bad198..6559668e29db 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -477,6 +477,7 @@ static __always_inline void *kmalloc_large(size_t size, gfp_t flags)
  */
 static __always_inline void *kmalloc(size_t size, gfp_t flags)
 {
+	debug_scope_gfp_context(flags);
 	if (__builtin_constant_p(size)) {
 		if (size > KMALLOC_MAX_CACHE_SIZE)
 			return kmalloc_large(size, flags);
@@ -517,6 +518,7 @@ static __always_inline int kmalloc_size(int n)
 
 static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
 {
+	debug_scope_gfp_context(flags);
 #ifndef CONFIG_SLOB
 	if (__builtin_constant_p(size) &&
 		size <= KMALLOC_MAX_CACHE_SIZE && !(flags & GFP_DMA)) {
@@ -575,6 +577,7 @@ int memcg_update_all_caches(int num_memcgs);
  */
 static inline void *kmalloc_array(size_t n, size_t size, gfp_t flags)
 {
+	debug_scope_gfp_context(flags);
 	if (size != 0 && n > SIZE_MAX / size)
 		return NULL;
 	if (__builtin_constant_p(n) && __builtin_constant_p(size))
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5138b46a4295..87a2bb5262b2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3738,6 +3738,63 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	return page;
 }
 
+static bool debug_scope_gfp;
+
+static int __init enable_debug_scope_gfp(char *unused)
+{
+	debug_scope_gfp = true;
+	return 0;
+}
+
+/*
+ * spit the stack trace if the given gfp_mask clears flags which are context
+ * wide cleared. Such a caller can remove special flags clearing and rely on
+ * the context wide mask.
+ */
+void debug_scope_gfp_context(gfp_t gfp_mask)
+{
+	gfp_t restrict_mask;
+
+	if (likely(!debug_scope_gfp))
+		return;
+
+	/* both NOFS, NOIO are irrelevant when direct reclaim is disabled */
+	if (!(gfp_mask & __GFP_DIRECT_RECLAIM))
+		return;
+
+	if (current->flags & PF_MEMALLOC_NOIO)
+		restrict_mask = __GFP_IO;
+	else if ((current->flags & PF_MEMALLOC_NOFS) && (gfp_mask & __GFP_IO))
+		restrict_mask = __GFP_FS;
+	else
+		return;
+
+	if ((gfp_mask & restrict_mask) != restrict_mask) {
+		/*
+		 * If you see this this warning then the code does:
+		 * memalloc_no{fs,io}_save()
+		 * ...
+		 *    foo()
+		 *      alloc_page(GFP_NO{FS,IO})
+		 * ...
+		 * memalloc_no{fs,io}_restore()
+		 *
+		 * allocation which is unnecessary because the scope gfp
+		 * context will do that for all allocation requests already.
+		 * If foo() is called from multiple contexts then make sure other
+		 * contexts are safe wrt. GFP_NO{FS,IO} semantic and either add
+		 * scope protection into particular paths or change the gfp mask
+		 * to GFP_KERNEL.
+		 */
+		pr_info("Unnecesarily specific gfp mask:%#x(%pGg) for the %s task wide context from %ps\n", gfp_mask, &gfp_mask,
+				(current->flags & PF_MEMALLOC_NOIO)?"NOIO":"NOFS",
+				(void*)((current->flags & PF_MEMALLOC_NOIO)?current->noio_caller:current->nofs_caller));
+		dump_stack();
+	}
+}
+EXPORT_SYMBOL(debug_scope_gfp_context);
+early_param("debug_scope_gfp", enable_debug_scope_gfp);
+
 /*
  * This is the 'heart' of the zoned buddy allocator.
  */
@@ -3802,6 +3859,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	}
 
 	/* First allocation attempt */
+	debug_scope_gfp_context(gfp_mask);
 	page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac);
 	if (likely(page))
 		goto out;
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [DEBUG PATCH 1/2] mm, debug: report when GFP_NO{FS,IO} is used explicitly from memalloc_no{fs,io}_{save,restore} context
@ 2017-01-06 14:18     ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:18 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

THIS PATCH IS FOR TESTING ONLY AND NOT MEANT TO HIT LINUS TREE

It is desirable to reduce the direct GFP_NO{FS,IO} usage at minimum and
prefer scope usage defined by memalloc_no{fs,io}_{save,restore} API.

Let's help this process and add a debugging tool to catch when an
explicit allocation request for GFP_NO{FS,IO} is done from the scope
context. The printed stacktrace should help to identify the caller
and evaluate whether it can be changed to use a wider context or whether
it is called from another potentially dangerous context which needs
a scope protection as well.

The checks have to be enabled explicitly by debug_scope_gfp kernel
command line parameter.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/sched.h | 14 +++++++++++--
 include/linux/slab.h  |  3 +++
 mm/page_alloc.c       | 58 +++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 73 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2032fc642a26..59428926e989 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1988,6 +1988,8 @@ struct task_struct {
 	/* A live task holds one reference. */
 	atomic_t stack_refcount;
 #endif
+	unsigned long nofs_caller;
+	unsigned long noio_caller;
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
@@ -2345,6 +2347,8 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
 #define used_math() tsk_used_math(current)
 
+extern void debug_scope_gfp_context(gfp_t gfp_mask);
+
 /*
  * Applies per-task gfp context to the given allocation flags.
  * PF_MEMALLOC_NOIO implies GFP_NOIO
@@ -2363,25 +2367,31 @@ static inline gfp_t current_gfp_context(gfp_t flags)
 	return flags;
 }
 
-static inline unsigned int memalloc_noio_save(void)
+static inline unsigned int __memalloc_noio_save(unsigned long caller)
 {
 	unsigned int flags = current->flags & PF_MEMALLOC_NOIO;
 	current->flags |= PF_MEMALLOC_NOIO;
+	current->noio_caller = caller;
 	return flags;
 }
 
+#define memalloc_noio_save()	__memalloc_noio_save(_RET_IP_)
+
 static inline void memalloc_noio_restore(unsigned int flags)
 {
 	current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags;
 }
 
-static inline unsigned int memalloc_nofs_save(void)
+static inline unsigned int __memalloc_nofs_save(unsigned long caller)
 {
 	unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
 	current->flags |= PF_MEMALLOC_NOFS;
+	current->nofs_caller = caller;
 	return flags;
 }
 
+#define memalloc_nofs_save()	__memalloc_nofs_save(_RET_IP_)
+
 static inline void memalloc_nofs_restore(unsigned int flags)
 {
 	current->flags = (current->flags & ~PF_MEMALLOC_NOFS) | flags;
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 084b12bad198..6559668e29db 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -477,6 +477,7 @@ static __always_inline void *kmalloc_large(size_t size, gfp_t flags)
  */
 static __always_inline void *kmalloc(size_t size, gfp_t flags)
 {
+	debug_scope_gfp_context(flags);
 	if (__builtin_constant_p(size)) {
 		if (size > KMALLOC_MAX_CACHE_SIZE)
 			return kmalloc_large(size, flags);
@@ -517,6 +518,7 @@ static __always_inline int kmalloc_size(int n)
 
 static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
 {
+	debug_scope_gfp_context(flags);
 #ifndef CONFIG_SLOB
 	if (__builtin_constant_p(size) &&
 		size <= KMALLOC_MAX_CACHE_SIZE && !(flags & GFP_DMA)) {
@@ -575,6 +577,7 @@ int memcg_update_all_caches(int num_memcgs);
  */
 static inline void *kmalloc_array(size_t n, size_t size, gfp_t flags)
 {
+	debug_scope_gfp_context(flags);
 	if (size != 0 && n > SIZE_MAX / size)
 		return NULL;
 	if (__builtin_constant_p(n) && __builtin_constant_p(size))
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5138b46a4295..87a2bb5262b2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3738,6 +3738,63 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	return page;
 }
 
+static bool debug_scope_gfp;
+
+static int __init enable_debug_scope_gfp(char *unused)
+{
+	debug_scope_gfp = true;
+	return 0;
+}
+
+/*
+ * spit the stack trace if the given gfp_mask clears flags which are context
+ * wide cleared. Such a caller can remove special flags clearing and rely on
+ * the context wide mask.
+ */
+void debug_scope_gfp_context(gfp_t gfp_mask)
+{
+	gfp_t restrict_mask;
+
+	if (likely(!debug_scope_gfp))
+		return;
+
+	/* both NOFS, NOIO are irrelevant when direct reclaim is disabled */
+	if (!(gfp_mask & __GFP_DIRECT_RECLAIM))
+		return;
+
+	if (current->flags & PF_MEMALLOC_NOIO)
+		restrict_mask = __GFP_IO;
+	else if ((current->flags & PF_MEMALLOC_NOFS) && (gfp_mask & __GFP_IO))
+		restrict_mask = __GFP_FS;
+	else
+		return;
+
+	if ((gfp_mask & restrict_mask) != restrict_mask) {
+		/*
+		 * If you see this this warning then the code does:
+		 * memalloc_no{fs,io}_save()
+		 * ...
+		 *    foo()
+		 *      alloc_page(GFP_NO{FS,IO})
+		 * ...
+		 * memalloc_no{fs,io}_restore()
+		 *
+		 * allocation which is unnecessary because the scope gfp
+		 * context will do that for all allocation requests already.
+		 * If foo() is called from multiple contexts then make sure other
+		 * contexts are safe wrt. GFP_NO{FS,IO} semantic and either add
+		 * scope protection into particular paths or change the gfp mask
+		 * to GFP_KERNEL.
+		 */
+		pr_info("Unnecesarily specific gfp mask:%#x(%pGg) for the %s task wide context from %ps\n", gfp_mask, &gfp_mask,
+				(current->flags & PF_MEMALLOC_NOIO)?"NOIO":"NOFS",
+				(void*)((current->flags & PF_MEMALLOC_NOIO)?current->noio_caller:current->nofs_caller));
+		dump_stack();
+	}
+}
+EXPORT_SYMBOL(debug_scope_gfp_context);
+early_param("debug_scope_gfp", enable_debug_scope_gfp);
+
 /*
  * This is the 'heart' of the zoned buddy allocator.
  */
@@ -3802,6 +3859,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	}
 
 	/* First allocation attempt */
+	debug_scope_gfp_context(gfp_mask);
 	page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac);
 	if (likely(page))
 		goto out;
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [DEBUG PATCH 1/2] mm, debug: report when GFP_NO{FS,IO} is used explicitly from memalloc_no{fs,io}_{save,restore} context
@ 2017-01-06 14:18     ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:18 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

THIS PATCH IS FOR TESTING ONLY AND NOT MEANT TO HIT LINUS TREE

It is desirable to reduce the direct GFP_NO{FS,IO} usage at minimum and
prefer scope usage defined by memalloc_no{fs,io}_{save,restore} API.

Let's help this process and add a debugging tool to catch when an
explicit allocation request for GFP_NO{FS,IO} is done from the scope
context. The printed stacktrace should help to identify the caller
and evaluate whether it can be changed to use a wider context or whether
it is called from another potentially dangerous context which needs
a scope protection as well.

The checks have to be enabled explicitly by debug_scope_gfp kernel
command line parameter.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/sched.h | 14 +++++++++++--
 include/linux/slab.h  |  3 +++
 mm/page_alloc.c       | 58 +++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 73 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2032fc642a26..59428926e989 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1988,6 +1988,8 @@ struct task_struct {
 	/* A live task holds one reference. */
 	atomic_t stack_refcount;
 #endif
+	unsigned long nofs_caller;
+	unsigned long noio_caller;
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
@@ -2345,6 +2347,8 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
 #define used_math() tsk_used_math(current)
 
+extern void debug_scope_gfp_context(gfp_t gfp_mask);
+
 /*
  * Applies per-task gfp context to the given allocation flags.
  * PF_MEMALLOC_NOIO implies GFP_NOIO
@@ -2363,25 +2367,31 @@ static inline gfp_t current_gfp_context(gfp_t flags)
 	return flags;
 }
 
-static inline unsigned int memalloc_noio_save(void)
+static inline unsigned int __memalloc_noio_save(unsigned long caller)
 {
 	unsigned int flags = current->flags & PF_MEMALLOC_NOIO;
 	current->flags |= PF_MEMALLOC_NOIO;
+	current->noio_caller = caller;
 	return flags;
 }
 
+#define memalloc_noio_save()	__memalloc_noio_save(_RET_IP_)
+
 static inline void memalloc_noio_restore(unsigned int flags)
 {
 	current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags;
 }
 
-static inline unsigned int memalloc_nofs_save(void)
+static inline unsigned int __memalloc_nofs_save(unsigned long caller)
 {
 	unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
 	current->flags |= PF_MEMALLOC_NOFS;
+	current->nofs_caller = caller;
 	return flags;
 }
 
+#define memalloc_nofs_save()	__memalloc_nofs_save(_RET_IP_)
+
 static inline void memalloc_nofs_restore(unsigned int flags)
 {
 	current->flags = (current->flags & ~PF_MEMALLOC_NOFS) | flags;
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 084b12bad198..6559668e29db 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -477,6 +477,7 @@ static __always_inline void *kmalloc_large(size_t size, gfp_t flags)
  */
 static __always_inline void *kmalloc(size_t size, gfp_t flags)
 {
+	debug_scope_gfp_context(flags);
 	if (__builtin_constant_p(size)) {
 		if (size > KMALLOC_MAX_CACHE_SIZE)
 			return kmalloc_large(size, flags);
@@ -517,6 +518,7 @@ static __always_inline int kmalloc_size(int n)
 
 static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
 {
+	debug_scope_gfp_context(flags);
 #ifndef CONFIG_SLOB
 	if (__builtin_constant_p(size) &&
 		size <= KMALLOC_MAX_CACHE_SIZE && !(flags & GFP_DMA)) {
@@ -575,6 +577,7 @@ int memcg_update_all_caches(int num_memcgs);
  */
 static inline void *kmalloc_array(size_t n, size_t size, gfp_t flags)
 {
+	debug_scope_gfp_context(flags);
 	if (size != 0 && n > SIZE_MAX / size)
 		return NULL;
 	if (__builtin_constant_p(n) && __builtin_constant_p(size))
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5138b46a4295..87a2bb5262b2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3738,6 +3738,63 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	return page;
 }
 
+static bool debug_scope_gfp;
+
+static int __init enable_debug_scope_gfp(char *unused)
+{
+	debug_scope_gfp = true;
+	return 0;
+}
+
+/*
+ * spit the stack trace if the given gfp_mask clears flags which are context
+ * wide cleared. Such a caller can remove special flags clearing and rely on
+ * the context wide mask.
+ */
+void debug_scope_gfp_context(gfp_t gfp_mask)
+{
+	gfp_t restrict_mask;
+
+	if (likely(!debug_scope_gfp))
+		return;
+
+	/* both NOFS, NOIO are irrelevant when direct reclaim is disabled */
+	if (!(gfp_mask & __GFP_DIRECT_RECLAIM))
+		return;
+
+	if (current->flags & PF_MEMALLOC_NOIO)
+		restrict_mask = __GFP_IO;
+	else if ((current->flags & PF_MEMALLOC_NOFS) && (gfp_mask & __GFP_IO))
+		restrict_mask = __GFP_FS;
+	else
+		return;
+
+	if ((gfp_mask & restrict_mask) != restrict_mask) {
+		/*
+		 * If you see this this warning then the code does:
+		 * memalloc_no{fs,io}_save()
+		 * ...
+		 *    foo()
+		 *      alloc_page(GFP_NO{FS,IO})
+		 * ...
+		 * memalloc_no{fs,io}_restore()
+		 *
+		 * allocation which is unnecessary because the scope gfp
+		 * context will do that for all allocation requests already.
+		 * If foo() is called from multiple contexts then make sure other
+		 * contexts are safe wrt. GFP_NO{FS,IO} semantic and either add
+		 * scope protection into particular paths or change the gfp mask
+		 * to GFP_KERNEL.
+		 */
+		pr_info("Unnecesarily specific gfp mask:%#x(%pGg) for the %s task wide context from %ps\n", gfp_mask, &gfp_mask,
+				(current->flags & PF_MEMALLOC_NOIO)?"NOIO":"NOFS",
+				(void*)((current->flags & PF_MEMALLOC_NOIO)?current->noio_caller:current->nofs_caller));
+		dump_stack();
+	}
+}
+EXPORT_SYMBOL(debug_scope_gfp_context);
+early_param("debug_scope_gfp", enable_debug_scope_gfp);
+
 /*
  * This is the 'heart' of the zoned buddy allocator.
  */
@@ -3802,6 +3859,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	}
 
 	/* First allocation attempt */
+	debug_scope_gfp_context(gfp_mask);
 	page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac);
 	if (likely(page))
 		goto out;
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [DEBUG PATCH 1/2] mm, debug: report when GFP_NO{FS, IO} is used explicitly from memalloc_no{fs, io}_{save, restore} context
@ 2017-01-06 14:18     ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:18 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

THIS PATCH IS FOR TESTING ONLY AND NOT MEANT TO HIT LINUS TREE

It is desirable to reduce the direct GFP_NO{FS,IO} usage at minimum and
prefer scope usage defined by memalloc_no{fs,io}_{save,restore} API.

Let's help this process and add a debugging tool to catch when an
explicit allocation request for GFP_NO{FS,IO} is done from the scope
context. The printed stacktrace should help to identify the caller
and evaluate whether it can be changed to use a wider context or whether
it is called from another potentially dangerous context which needs
a scope protection as well.

The checks have to be enabled explicitly by debug_scope_gfp kernel
command line parameter.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/sched.h | 14 +++++++++++--
 include/linux/slab.h  |  3 +++
 mm/page_alloc.c       | 58 +++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 73 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2032fc642a26..59428926e989 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1988,6 +1988,8 @@ struct task_struct {
 	/* A live task holds one reference. */
 	atomic_t stack_refcount;
 #endif
+	unsigned long nofs_caller;
+	unsigned long noio_caller;
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
@@ -2345,6 +2347,8 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
 #define used_math() tsk_used_math(current)
 
+extern void debug_scope_gfp_context(gfp_t gfp_mask);
+
 /*
  * Applies per-task gfp context to the given allocation flags.
  * PF_MEMALLOC_NOIO implies GFP_NOIO
@@ -2363,25 +2367,31 @@ static inline gfp_t current_gfp_context(gfp_t flags)
 	return flags;
 }
 
-static inline unsigned int memalloc_noio_save(void)
+static inline unsigned int __memalloc_noio_save(unsigned long caller)
 {
 	unsigned int flags = current->flags & PF_MEMALLOC_NOIO;
 	current->flags |= PF_MEMALLOC_NOIO;
+	current->noio_caller = caller;
 	return flags;
 }
 
+#define memalloc_noio_save()	__memalloc_noio_save(_RET_IP_)
+
 static inline void memalloc_noio_restore(unsigned int flags)
 {
 	current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags;
 }
 
-static inline unsigned int memalloc_nofs_save(void)
+static inline unsigned int __memalloc_nofs_save(unsigned long caller)
 {
 	unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
 	current->flags |= PF_MEMALLOC_NOFS;
+	current->nofs_caller = caller;
 	return flags;
 }
 
+#define memalloc_nofs_save()	__memalloc_nofs_save(_RET_IP_)
+
 static inline void memalloc_nofs_restore(unsigned int flags)
 {
 	current->flags = (current->flags & ~PF_MEMALLOC_NOFS) | flags;
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 084b12bad198..6559668e29db 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -477,6 +477,7 @@ static __always_inline void *kmalloc_large(size_t size, gfp_t flags)
  */
 static __always_inline void *kmalloc(size_t size, gfp_t flags)
 {
+	debug_scope_gfp_context(flags);
 	if (__builtin_constant_p(size)) {
 		if (size > KMALLOC_MAX_CACHE_SIZE)
 			return kmalloc_large(size, flags);
@@ -517,6 +518,7 @@ static __always_inline int kmalloc_size(int n)
 
 static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
 {
+	debug_scope_gfp_context(flags);
 #ifndef CONFIG_SLOB
 	if (__builtin_constant_p(size) &&
 		size <= KMALLOC_MAX_CACHE_SIZE && !(flags & GFP_DMA)) {
@@ -575,6 +577,7 @@ int memcg_update_all_caches(int num_memcgs);
  */
 static inline void *kmalloc_array(size_t n, size_t size, gfp_t flags)
 {
+	debug_scope_gfp_context(flags);
 	if (size != 0 && n > SIZE_MAX / size)
 		return NULL;
 	if (__builtin_constant_p(n) && __builtin_constant_p(size))
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5138b46a4295..87a2bb5262b2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3738,6 +3738,63 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	return page;
 }
 
+static bool debug_scope_gfp;
+
+static int __init enable_debug_scope_gfp(char *unused)
+{
+	debug_scope_gfp = true;
+	return 0;
+}
+
+/*
+ * spit the stack trace if the given gfp_mask clears flags which are context
+ * wide cleared. Such a caller can remove special flags clearing and rely on
+ * the context wide mask.
+ */
+void debug_scope_gfp_context(gfp_t gfp_mask)
+{
+	gfp_t restrict_mask;
+
+	if (likely(!debug_scope_gfp))
+		return;
+
+	/* both NOFS, NOIO are irrelevant when direct reclaim is disabled */
+	if (!(gfp_mask & __GFP_DIRECT_RECLAIM))
+		return;
+
+	if (current->flags & PF_MEMALLOC_NOIO)
+		restrict_mask = __GFP_IO;
+	else if ((current->flags & PF_MEMALLOC_NOFS) && (gfp_mask & __GFP_IO))
+		restrict_mask = __GFP_FS;
+	else
+		return;
+
+	if ((gfp_mask & restrict_mask) != restrict_mask) {
+		/*
+		 * If you see this this warning then the code does:
+		 * memalloc_no{fs,io}_save()
+		 * ...
+		 *    foo()
+		 *      alloc_page(GFP_NO{FS,IO})
+		 * ...
+		 * memalloc_no{fs,io}_restore()
+		 *
+		 * allocation which is unnecessary because the scope gfp
+		 * context will do that for all allocation requests already.
+		 * If foo() is called from multiple contexts then make sure other
+		 * contexts are safe wrt. GFP_NO{FS,IO} semantic and either add
+		 * scope protection into particular paths or change the gfp mask
+		 * to GFP_KERNEL.
+		 */
+		pr_info("Unnecesarily specific gfp mask:%#x(%pGg) for the %s task wide context from %ps\n", gfp_mask, &gfp_mask,
+				(current->flags & PF_MEMALLOC_NOIO)?"NOIO":"NOFS",
+				(void*)((current->flags & PF_MEMALLOC_NOIO)?current->noio_caller:current->nofs_caller));
+		dump_stack();
+	}
+}
+EXPORT_SYMBOL(debug_scope_gfp_context);
+early_param("debug_scope_gfp", enable_debug_scope_gfp);
+
 /*
  * This is the 'heart' of the zoned buddy allocator.
  */
@@ -3802,6 +3859,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	}
 
 	/* First allocation attempt */
+	debug_scope_gfp_context(gfp_mask);
 	page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac);
 	if (likely(page))
 		goto out;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [Cluster-devel] [DEBUG PATCH 1/2] mm, debug: report when GFP_NO{FS, IO} is used explicitly from memalloc_no{fs, io}_{save, restore} context
@ 2017-01-06 14:18     ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:18 UTC (permalink / raw)
  To: cluster-devel.redhat.com

From: Michal Hocko <mhocko@suse.com>

THIS PATCH IS FOR TESTING ONLY AND NOT MEANT TO HIT LINUS TREE

It is desirable to reduce the direct GFP_NO{FS,IO} usage at minimum and
prefer scope usage defined by memalloc_no{fs,io}_{save,restore} API.

Let's help this process and add a debugging tool to catch when an
explicit allocation request for GFP_NO{FS,IO} is done from the scope
context. The printed stacktrace should help to identify the caller
and evaluate whether it can be changed to use a wider context or whether
it is called from another potentially dangerous context which needs
a scope protection as well.

The checks have to be enabled explicitly by debug_scope_gfp kernel
command line parameter.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/sched.h | 14 +++++++++++--
 include/linux/slab.h  |  3 +++
 mm/page_alloc.c       | 58 +++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 73 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2032fc642a26..59428926e989 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1988,6 +1988,8 @@ struct task_struct {
 	/* A live task holds one reference. */
 	atomic_t stack_refcount;
 #endif
+	unsigned long nofs_caller;
+	unsigned long noio_caller;
 /* CPU-specific state of this task */
 	struct thread_struct thread;
 /*
@@ -2345,6 +2347,8 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
 #define used_math() tsk_used_math(current)
 
+extern void debug_scope_gfp_context(gfp_t gfp_mask);
+
 /*
  * Applies per-task gfp context to the given allocation flags.
  * PF_MEMALLOC_NOIO implies GFP_NOIO
@@ -2363,25 +2367,31 @@ static inline gfp_t current_gfp_context(gfp_t flags)
 	return flags;
 }
 
-static inline unsigned int memalloc_noio_save(void)
+static inline unsigned int __memalloc_noio_save(unsigned long caller)
 {
 	unsigned int flags = current->flags & PF_MEMALLOC_NOIO;
 	current->flags |= PF_MEMALLOC_NOIO;
+	current->noio_caller = caller;
 	return flags;
 }
 
+#define memalloc_noio_save()	__memalloc_noio_save(_RET_IP_)
+
 static inline void memalloc_noio_restore(unsigned int flags)
 {
 	current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags;
 }
 
-static inline unsigned int memalloc_nofs_save(void)
+static inline unsigned int __memalloc_nofs_save(unsigned long caller)
 {
 	unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
 	current->flags |= PF_MEMALLOC_NOFS;
+	current->nofs_caller = caller;
 	return flags;
 }
 
+#define memalloc_nofs_save()	__memalloc_nofs_save(_RET_IP_)
+
 static inline void memalloc_nofs_restore(unsigned int flags)
 {
 	current->flags = (current->flags & ~PF_MEMALLOC_NOFS) | flags;
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 084b12bad198..6559668e29db 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -477,6 +477,7 @@ static __always_inline void *kmalloc_large(size_t size, gfp_t flags)
  */
 static __always_inline void *kmalloc(size_t size, gfp_t flags)
 {
+	debug_scope_gfp_context(flags);
 	if (__builtin_constant_p(size)) {
 		if (size > KMALLOC_MAX_CACHE_SIZE)
 			return kmalloc_large(size, flags);
@@ -517,6 +518,7 @@ static __always_inline int kmalloc_size(int n)
 
 static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
 {
+	debug_scope_gfp_context(flags);
 #ifndef CONFIG_SLOB
 	if (__builtin_constant_p(size) &&
 		size <= KMALLOC_MAX_CACHE_SIZE && !(flags & GFP_DMA)) {
@@ -575,6 +577,7 @@ int memcg_update_all_caches(int num_memcgs);
  */
 static inline void *kmalloc_array(size_t n, size_t size, gfp_t flags)
 {
+	debug_scope_gfp_context(flags);
 	if (size != 0 && n > SIZE_MAX / size)
 		return NULL;
 	if (__builtin_constant_p(n) && __builtin_constant_p(size))
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5138b46a4295..87a2bb5262b2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3738,6 +3738,63 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	return page;
 }
 
+static bool debug_scope_gfp;
+
+static int __init enable_debug_scope_gfp(char *unused)
+{
+	debug_scope_gfp = true;
+	return 0;
+}
+
+/*
+ * spit the stack trace if the given gfp_mask clears flags which are context
+ * wide cleared. Such a caller can remove special flags clearing and rely on
+ * the context wide mask.
+ */
+void debug_scope_gfp_context(gfp_t gfp_mask)
+{
+	gfp_t restrict_mask;
+
+	if (likely(!debug_scope_gfp))
+		return;
+
+	/* both NOFS, NOIO are irrelevant when direct reclaim is disabled */
+	if (!(gfp_mask & __GFP_DIRECT_RECLAIM))
+		return;
+
+	if (current->flags & PF_MEMALLOC_NOIO)
+		restrict_mask = __GFP_IO;
+	else if ((current->flags & PF_MEMALLOC_NOFS) && (gfp_mask & __GFP_IO))
+		restrict_mask = __GFP_FS;
+	else
+		return;
+
+	if ((gfp_mask & restrict_mask) != restrict_mask) {
+		/*
+		 * If you see this this warning then the code does:
+		 * memalloc_no{fs,io}_save()
+		 * ...
+		 *    foo()
+		 *      alloc_page(GFP_NO{FS,IO})
+		 * ...
+		 * memalloc_no{fs,io}_restore()
+		 *
+		 * allocation which is unnecessary because the scope gfp
+		 * context will do that for all allocation requests already.
+		 * If foo() is called from multiple contexts then make sure other
+		 * contexts are safe wrt. GFP_NO{FS,IO} semantic and either add
+		 * scope protection into particular paths or change the gfp mask
+		 * to GFP_KERNEL.
+		 */
+		pr_info("Unnecesarily specific gfp mask:%#x(%pGg) for the %s task wide context from %ps\n", gfp_mask, &gfp_mask,
+				(current->flags & PF_MEMALLOC_NOIO)?"NOIO":"NOFS",
+				(void*)((current->flags & PF_MEMALLOC_NOIO)?current->noio_caller:current->nofs_caller));
+		dump_stack();
+	}
+}
+EXPORT_SYMBOL(debug_scope_gfp_context);
+early_param("debug_scope_gfp", enable_debug_scope_gfp);
+
 /*
  * This is the 'heart' of the zoned buddy allocator.
  */
@@ -3802,6 +3859,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	}
 
 	/* First allocation attempt */
+	debug_scope_gfp_context(gfp_mask);
 	page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac);
 	if (likely(page))
 		goto out;
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [DEBUG PATCH 2/2] silent warnings which we cannot do anything about
  2017-01-06 14:18   ` Michal Hocko
                       ` (2 preceding siblings ...)
  (?)
@ 2017-01-06 14:18     ` Michal Hocko
  -1 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:18 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

THIS PATCH IS FOR TESTING ONLY AND NOT MEANT TO HIT LINUS TREE

There are some code paths used by all the filesystems which we cannot
change to drop the GFP_NOFS, yet they generate a lot of warnings.
Provide {disable,enable}_scope_gfp_check to silence those.
alloc_page_buffers and grow_dev_page are silenced right away.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/buffer.c           |  4 ++++
 include/linux/sched.h | 11 +++++++++++
 mm/page_alloc.c       |  3 +++
 3 files changed, 18 insertions(+)

diff --git a/fs/buffer.c b/fs/buffer.c
index 28484b3ebc98..dbe529e7881b 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -873,7 +873,9 @@ struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size,
 	head = NULL;
 	offset = PAGE_SIZE;
 	while ((offset -= size) >= 0) {
+		disable_scope_gfp_check();
 		bh = alloc_buffer_head(GFP_NOFS);
+		enable_scope_gfp_check();
 		if (!bh)
 			goto no_grow;
 
@@ -1003,7 +1005,9 @@ grow_dev_page(struct block_device *bdev, sector_t block,
 	 */
 	gfp_mask |= __GFP_NOFAIL;
 
+	disable_scope_gfp_check();
 	page = find_or_create_page(inode->i_mapping, index, gfp_mask);
+	enable_scope_gfp_check();
 	if (!page)
 		return ret;
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 59428926e989..f60294732ed5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1988,6 +1988,7 @@ struct task_struct {
 	/* A live task holds one reference. */
 	atomic_t stack_refcount;
 #endif
+	bool disable_scope_gfp_warn;
 	unsigned long nofs_caller;
 	unsigned long noio_caller;
 /* CPU-specific state of this task */
@@ -2390,6 +2391,16 @@ static inline unsigned int __memalloc_nofs_save(unsigned long caller)
 	return flags;
 }
 
+static inline void disable_scope_gfp_check(void)
+{
+	current->disable_scope_gfp_warn = true;
+}
+
+static inline void enable_scope_gfp_check(void)
+{
+	current->disable_scope_gfp_warn = false;
+}
+
 #define memalloc_nofs_save()	__memalloc_nofs_save(_RET_IP_)
 
 static inline void memalloc_nofs_restore(unsigned int flags)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 87a2bb5262b2..5405278bd733 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3762,6 +3762,9 @@ void debug_scope_gfp_context(gfp_t gfp_mask)
 	if (!(gfp_mask & __GFP_DIRECT_RECLAIM))
 		return;
 
+	if (current->disable_scope_gfp_warn)
+		return;
+
 	if (current->flags & PF_MEMALLOC_NOIO)
 		restrict_mask = __GFP_IO;
 	else if ((current->flags & PF_MEMALLOC_NOFS) && (gfp_mask & __GFP_IO))
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [DEBUG PATCH 2/2] silent warnings which we cannot do anything about
@ 2017-01-06 14:18     ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:18 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

THIS PATCH IS FOR TESTING ONLY AND NOT MEANT TO HIT LINUS TREE

There are some code paths used by all the filesystems which we cannot
change to drop the GFP_NOFS, yet they generate a lot of warnings.
Provide {disable,enable}_scope_gfp_check to silence those.
alloc_page_buffers and grow_dev_page are silenced right away.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/buffer.c           |  4 ++++
 include/linux/sched.h | 11 +++++++++++
 mm/page_alloc.c       |  3 +++
 3 files changed, 18 insertions(+)

diff --git a/fs/buffer.c b/fs/buffer.c
index 28484b3ebc98..dbe529e7881b 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -873,7 +873,9 @@ struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size,
 	head = NULL;
 	offset = PAGE_SIZE;
 	while ((offset -= size) >= 0) {
+		disable_scope_gfp_check();
 		bh = alloc_buffer_head(GFP_NOFS);
+		enable_scope_gfp_check();
 		if (!bh)
 			goto no_grow;
 
@@ -1003,7 +1005,9 @@ grow_dev_page(struct block_device *bdev, sector_t block,
 	 */
 	gfp_mask |= __GFP_NOFAIL;
 
+	disable_scope_gfp_check();
 	page = find_or_create_page(inode->i_mapping, index, gfp_mask);
+	enable_scope_gfp_check();
 	if (!page)
 		return ret;
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 59428926e989..f60294732ed5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1988,6 +1988,7 @@ struct task_struct {
 	/* A live task holds one reference. */
 	atomic_t stack_refcount;
 #endif
+	bool disable_scope_gfp_warn;
 	unsigned long nofs_caller;
 	unsigned long noio_caller;
 /* CPU-specific state of this task */
@@ -2390,6 +2391,16 @@ static inline unsigned int __memalloc_nofs_save(unsigned long caller)
 	return flags;
 }
 
+static inline void disable_scope_gfp_check(void)
+{
+	current->disable_scope_gfp_warn = true;
+}
+
+static inline void enable_scope_gfp_check(void)
+{
+	current->disable_scope_gfp_warn = false;
+}
+
 #define memalloc_nofs_save()	__memalloc_nofs_save(_RET_IP_)
 
 static inline void memalloc_nofs_restore(unsigned int flags)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 87a2bb5262b2..5405278bd733 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3762,6 +3762,9 @@ void debug_scope_gfp_context(gfp_t gfp_mask)
 	if (!(gfp_mask & __GFP_DIRECT_RECLAIM))
 		return;
 
+	if (current->disable_scope_gfp_warn)
+		return;
+
 	if (current->flags & PF_MEMALLOC_NOIO)
 		restrict_mask = __GFP_IO;
 	else if ((current->flags & PF_MEMALLOC_NOFS) && (gfp_mask & __GFP_IO))
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [DEBUG PATCH 2/2] silent warnings which we cannot do anything about
@ 2017-01-06 14:18     ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:18 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

THIS PATCH IS FOR TESTING ONLY AND NOT MEANT TO HIT LINUS TREE

There are some code paths used by all the filesystems which we cannot
change to drop the GFP_NOFS, yet they generate a lot of warnings.
Provide {disable,enable}_scope_gfp_check to silence those.
alloc_page_buffers and grow_dev_page are silenced right away.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/buffer.c           |  4 ++++
 include/linux/sched.h | 11 +++++++++++
 mm/page_alloc.c       |  3 +++
 3 files changed, 18 insertions(+)

diff --git a/fs/buffer.c b/fs/buffer.c
index 28484b3ebc98..dbe529e7881b 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -873,7 +873,9 @@ struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size,
 	head = NULL;
 	offset = PAGE_SIZE;
 	while ((offset -= size) >= 0) {
+		disable_scope_gfp_check();
 		bh = alloc_buffer_head(GFP_NOFS);
+		enable_scope_gfp_check();
 		if (!bh)
 			goto no_grow;
 
@@ -1003,7 +1005,9 @@ grow_dev_page(struct block_device *bdev, sector_t block,
 	 */
 	gfp_mask |= __GFP_NOFAIL;
 
+	disable_scope_gfp_check();
 	page = find_or_create_page(inode->i_mapping, index, gfp_mask);
+	enable_scope_gfp_check();
 	if (!page)
 		return ret;
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 59428926e989..f60294732ed5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1988,6 +1988,7 @@ struct task_struct {
 	/* A live task holds one reference. */
 	atomic_t stack_refcount;
 #endif
+	bool disable_scope_gfp_warn;
 	unsigned long nofs_caller;
 	unsigned long noio_caller;
 /* CPU-specific state of this task */
@@ -2390,6 +2391,16 @@ static inline unsigned int __memalloc_nofs_save(unsigned long caller)
 	return flags;
 }
 
+static inline void disable_scope_gfp_check(void)
+{
+	current->disable_scope_gfp_warn = true;
+}
+
+static inline void enable_scope_gfp_check(void)
+{
+	current->disable_scope_gfp_warn = false;
+}
+
 #define memalloc_nofs_save()	__memalloc_nofs_save(_RET_IP_)
 
 static inline void memalloc_nofs_restore(unsigned int flags)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 87a2bb5262b2..5405278bd733 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3762,6 +3762,9 @@ void debug_scope_gfp_context(gfp_t gfp_mask)
 	if (!(gfp_mask & __GFP_DIRECT_RECLAIM))
 		return;
 
+	if (current->disable_scope_gfp_warn)
+		return;
+
 	if (current->flags & PF_MEMALLOC_NOIO)
 		restrict_mask = __GFP_IO;
 	else if ((current->flags & PF_MEMALLOC_NOFS) && (gfp_mask & __GFP_IO))
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [DEBUG PATCH 2/2] silent warnings which we cannot do anything about
@ 2017-01-06 14:18     ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:18 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

THIS PATCH IS FOR TESTING ONLY AND NOT MEANT TO HIT LINUS TREE

There are some code paths used by all the filesystems which we cannot
change to drop the GFP_NOFS, yet they generate a lot of warnings.
Provide {disable,enable}_scope_gfp_check to silence those.
alloc_page_buffers and grow_dev_page are silenced right away.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/buffer.c           |  4 ++++
 include/linux/sched.h | 11 +++++++++++
 mm/page_alloc.c       |  3 +++
 3 files changed, 18 insertions(+)

diff --git a/fs/buffer.c b/fs/buffer.c
index 28484b3ebc98..dbe529e7881b 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -873,7 +873,9 @@ struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size,
 	head = NULL;
 	offset = PAGE_SIZE;
 	while ((offset -= size) >= 0) {
+		disable_scope_gfp_check();
 		bh = alloc_buffer_head(GFP_NOFS);
+		enable_scope_gfp_check();
 		if (!bh)
 			goto no_grow;
 
@@ -1003,7 +1005,9 @@ grow_dev_page(struct block_device *bdev, sector_t block,
 	 */
 	gfp_mask |= __GFP_NOFAIL;
 
+	disable_scope_gfp_check();
 	page = find_or_create_page(inode->i_mapping, index, gfp_mask);
+	enable_scope_gfp_check();
 	if (!page)
 		return ret;
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 59428926e989..f60294732ed5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1988,6 +1988,7 @@ struct task_struct {
 	/* A live task holds one reference. */
 	atomic_t stack_refcount;
 #endif
+	bool disable_scope_gfp_warn;
 	unsigned long nofs_caller;
 	unsigned long noio_caller;
 /* CPU-specific state of this task */
@@ -2390,6 +2391,16 @@ static inline unsigned int __memalloc_nofs_save(unsigned long caller)
 	return flags;
 }
 
+static inline void disable_scope_gfp_check(void)
+{
+	current->disable_scope_gfp_warn = true;
+}
+
+static inline void enable_scope_gfp_check(void)
+{
+	current->disable_scope_gfp_warn = false;
+}
+
 #define memalloc_nofs_save()	__memalloc_nofs_save(_RET_IP_)
 
 static inline void memalloc_nofs_restore(unsigned int flags)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 87a2bb5262b2..5405278bd733 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3762,6 +3762,9 @@ void debug_scope_gfp_context(gfp_t gfp_mask)
 	if (!(gfp_mask & __GFP_DIRECT_RECLAIM))
 		return;
 
+	if (current->disable_scope_gfp_warn)
+		return;
+
 	if (current->flags & PF_MEMALLOC_NOIO)
 		restrict_mask = __GFP_IO;
 	else if ((current->flags & PF_MEMALLOC_NOFS) && (gfp_mask & __GFP_IO))
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [Cluster-devel] [DEBUG PATCH 2/2] silent warnings which we cannot do anything about
@ 2017-01-06 14:18     ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-06 14:18 UTC (permalink / raw)
  To: cluster-devel.redhat.com

From: Michal Hocko <mhocko@suse.com>

THIS PATCH IS FOR TESTING ONLY AND NOT MEANT TO HIT LINUS TREE

There are some code paths used by all the filesystems which we cannot
change to drop the GFP_NOFS, yet they generate a lot of warnings.
Provide {disable,enable}_scope_gfp_check to silence those.
alloc_page_buffers and grow_dev_page are silenced right away.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/buffer.c           |  4 ++++
 include/linux/sched.h | 11 +++++++++++
 mm/page_alloc.c       |  3 +++
 3 files changed, 18 insertions(+)

diff --git a/fs/buffer.c b/fs/buffer.c
index 28484b3ebc98..dbe529e7881b 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -873,7 +873,9 @@ struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size,
 	head = NULL;
 	offset = PAGE_SIZE;
 	while ((offset -= size) >= 0) {
+		disable_scope_gfp_check();
 		bh = alloc_buffer_head(GFP_NOFS);
+		enable_scope_gfp_check();
 		if (!bh)
 			goto no_grow;
 
@@ -1003,7 +1005,9 @@ grow_dev_page(struct block_device *bdev, sector_t block,
 	 */
 	gfp_mask |= __GFP_NOFAIL;
 
+	disable_scope_gfp_check();
 	page = find_or_create_page(inode->i_mapping, index, gfp_mask);
+	enable_scope_gfp_check();
 	if (!page)
 		return ret;
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 59428926e989..f60294732ed5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1988,6 +1988,7 @@ struct task_struct {
 	/* A live task holds one reference. */
 	atomic_t stack_refcount;
 #endif
+	bool disable_scope_gfp_warn;
 	unsigned long nofs_caller;
 	unsigned long noio_caller;
 /* CPU-specific state of this task */
@@ -2390,6 +2391,16 @@ static inline unsigned int __memalloc_nofs_save(unsigned long caller)
 	return flags;
 }
 
+static inline void disable_scope_gfp_check(void)
+{
+	current->disable_scope_gfp_warn = true;
+}
+
+static inline void enable_scope_gfp_check(void)
+{
+	current->disable_scope_gfp_warn = false;
+}
+
 #define memalloc_nofs_save()	__memalloc_nofs_save(_RET_IP_)
 
 static inline void memalloc_nofs_restore(unsigned int flags)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 87a2bb5262b2..5405278bd733 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3762,6 +3762,9 @@ void debug_scope_gfp_context(gfp_t gfp_mask)
 	if (!(gfp_mask & __GFP_DIRECT_RECLAIM))
 		return;
 
+	if (current->disable_scope_gfp_warn)
+		return;
+
 	if (current->flags & PF_MEMALLOC_NOIO)
 		restrict_mask = __GFP_IO;
 	else if ((current->flags & PF_MEMALLOC_NOFS) && (gfp_mask & __GFP_IO))
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 167+ messages in thread

* Re: [PATCH 1/8] lockdep: allow to disable reclaim lockup detection
  2017-01-06 14:11   ` Michal Hocko
  (?)
@ 2017-01-09 12:56     ` Vlastimil Babka
  -1 siblings, 0 replies; 167+ messages in thread
From: Vlastimil Babka @ 2017-01-09 12:56 UTC (permalink / raw)
  To: Michal Hocko, linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

On 01/06/2017 03:11 PM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> The current implementation of the reclaim lockup detection can lead to
> false positives and those even happen and usually lead to tweak the
> code to silence the lockdep by using GFP_NOFS even though the context
> can use __GFP_FS just fine. See
> http://lkml.kernel.org/r/20160512080321.GA18496@dastard as an example.
> 
> =================================
> [ INFO: inconsistent lock state ]
> 4.5.0-rc2+ #4 Tainted: G           O
> ---------------------------------
> inconsistent {RECLAIM_FS-ON-R} -> {IN-RECLAIM_FS-W} usage.
> kswapd0/543 [HC0[0]:SC0[0]:HE1:SE1] takes:
> 
> (&xfs_nondir_ilock_class){++++-+}, at: [<ffffffffa00781f7>] xfs_ilock+0x177/0x200 [xfs]
> 
> {RECLAIM_FS-ON-R} state was registered at:
>   [<ffffffff8110f369>] mark_held_locks+0x79/0xa0
>   [<ffffffff81113a43>] lockdep_trace_alloc+0xb3/0x100
>   [<ffffffff81224623>] kmem_cache_alloc+0x33/0x230
>   [<ffffffffa008acc1>] kmem_zone_alloc+0x81/0x120 [xfs]
>   [<ffffffffa005456e>] xfs_refcountbt_init_cursor+0x3e/0xa0 [xfs]
>   [<ffffffffa0053455>] __xfs_refcount_find_shared+0x75/0x580 [xfs]
>   [<ffffffffa00539e4>] xfs_refcount_find_shared+0x84/0xb0 [xfs]
>   [<ffffffffa005dcb8>] xfs_getbmap+0x608/0x8c0 [xfs]
>   [<ffffffffa007634b>] xfs_vn_fiemap+0xab/0xc0 [xfs]
>   [<ffffffff81244208>] do_vfs_ioctl+0x498/0x670
>   [<ffffffff81244459>] SyS_ioctl+0x79/0x90
>   [<ffffffff81847cd7>] entry_SYSCALL_64_fastpath+0x12/0x6f
> 
>        CPU0
>        ----
>   lock(&xfs_nondir_ilock_class);
>   <Interrupt>
>     lock(&xfs_nondir_ilock_class);
> 
>  *** DEADLOCK ***
> 
> 3 locks held by kswapd0/543:
> 
> stack backtrace:
> CPU: 0 PID: 543 Comm: kswapd0 Tainted: G           O    4.5.0-rc2+ #4
> 
> Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
> 
>  ffffffff82a34f10 ffff88003aa078d0 ffffffff813a14f9 ffff88003d8551c0
>  ffff88003aa07920 ffffffff8110ec65 0000000000000000 0000000000000001
>  ffff880000000001 000000000000000b 0000000000000008 ffff88003d855aa0
> Call Trace:
>  [<ffffffff813a14f9>] dump_stack+0x4b/0x72
>  [<ffffffff8110ec65>] print_usage_bug+0x215/0x240
>  [<ffffffff8110ee85>] mark_lock+0x1f5/0x660
>  [<ffffffff8110e100>] ? print_shortest_lock_dependencies+0x1a0/0x1a0
>  [<ffffffff811102e0>] __lock_acquire+0xa80/0x1e50
>  [<ffffffff8122474e>] ? kmem_cache_alloc+0x15e/0x230
>  [<ffffffffa008acc1>] ? kmem_zone_alloc+0x81/0x120 [xfs]
>  [<ffffffff811122e8>] lock_acquire+0xd8/0x1e0
>  [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs]
>  [<ffffffffa0083a70>] ? xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
>  [<ffffffff8110aace>] down_write_nested+0x5e/0xc0
>  [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs]
>  [<ffffffffa00781f7>] xfs_ilock+0x177/0x200 [xfs]
>  [<ffffffffa0083a70>] xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
>  [<ffffffffa0085bdc>] xfs_fs_evict_inode+0xdc/0x1e0 [xfs]
>  [<ffffffff8124d7d5>] evict+0xc5/0x190
>  [<ffffffff8124d8d9>] dispose_list+0x39/0x60
>  [<ffffffff8124eb2b>] prune_icache_sb+0x4b/0x60
>  [<ffffffff8123317f>] super_cache_scan+0x14f/0x1a0
>  [<ffffffff811e0d19>] shrink_slab.part.63.constprop.79+0x1e9/0x4e0
>  [<ffffffff811e50ee>] shrink_zone+0x15e/0x170
>  [<ffffffff811e5ef1>] kswapd+0x4f1/0xa80
>  [<ffffffff811e5a00>] ? zone_reclaim+0x230/0x230
>  [<ffffffff810e6882>] kthread+0xf2/0x110
>  [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220
>  [<ffffffff8184803f>] ret_from_fork+0x3f/0x70
>  [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220
> 
> To quote Dave:
> "
> Ignoring whether reflink should be doing anything or not, that's a
> "xfs_refcountbt_init_cursor() gets called both outside and inside
> transactions" lockdep false positive case. The problem here is
> lockdep has seen this allocation from within a transaction, hence a
> GFP_NOFS allocation, and now it's seeing it in a GFP_KERNEL context.
> Also note that we have an active reference to this inode.
> 
> So, because the reclaim annotations overload the interrupt level
> detections and it's seen the inode ilock been taken in reclaim
> ("interrupt") context, this triggers a reclaim context warning where
> it thinks it is unsafe to do this allocation in GFP_KERNEL context
> holding the inode ilock...
> "
> 
> This sounds like a fundamental problem of the reclaim lock detection.
> It is really impossible to annotate such a special usecase IMHO unless
> the reclaim lockup detection is reworked completely. Until then it
> is much better to provide a way to add "I know what I am doing flag"
> and mark problematic places. This would prevent from abusing GFP_NOFS
> flag which has a runtime effect even on configurations which have
> lockdep disabled.
> 
> Introduce __GFP_NOLOCKDEP flag which tells the lockdep gfp tracking to
> skip the current allocation request.
> 
> While we are at it also make sure that the radix tree doesn't
> accidentaly override tags stored in the upper part of the gfp_mask.
> 
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>


^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 1/8] lockdep: allow to disable reclaim lockup detection
@ 2017-01-09 12:56     ` Vlastimil Babka
  0 siblings, 0 replies; 167+ messages in thread
From: Vlastimil Babka @ 2017-01-09 12:56 UTC (permalink / raw)
  To: Michal Hocko, linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

On 01/06/2017 03:11 PM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> The current implementation of the reclaim lockup detection can lead to
> false positives and those even happen and usually lead to tweak the
> code to silence the lockdep by using GFP_NOFS even though the context
> can use __GFP_FS just fine. See
> http://lkml.kernel.org/r/20160512080321.GA18496@dastard as an example.
> 
> =================================
> [ INFO: inconsistent lock state ]
> 4.5.0-rc2+ #4 Tainted: G           O
> ---------------------------------
> inconsistent {RECLAIM_FS-ON-R} -> {IN-RECLAIM_FS-W} usage.
> kswapd0/543 [HC0[0]:SC0[0]:HE1:SE1] takes:
> 
> (&xfs_nondir_ilock_class){++++-+}, at: [<ffffffffa00781f7>] xfs_ilock+0x177/0x200 [xfs]
> 
> {RECLAIM_FS-ON-R} state was registered at:
>   [<ffffffff8110f369>] mark_held_locks+0x79/0xa0
>   [<ffffffff81113a43>] lockdep_trace_alloc+0xb3/0x100
>   [<ffffffff81224623>] kmem_cache_alloc+0x33/0x230
>   [<ffffffffa008acc1>] kmem_zone_alloc+0x81/0x120 [xfs]
>   [<ffffffffa005456e>] xfs_refcountbt_init_cursor+0x3e/0xa0 [xfs]
>   [<ffffffffa0053455>] __xfs_refcount_find_shared+0x75/0x580 [xfs]
>   [<ffffffffa00539e4>] xfs_refcount_find_shared+0x84/0xb0 [xfs]
>   [<ffffffffa005dcb8>] xfs_getbmap+0x608/0x8c0 [xfs]
>   [<ffffffffa007634b>] xfs_vn_fiemap+0xab/0xc0 [xfs]
>   [<ffffffff81244208>] do_vfs_ioctl+0x498/0x670
>   [<ffffffff81244459>] SyS_ioctl+0x79/0x90
>   [<ffffffff81847cd7>] entry_SYSCALL_64_fastpath+0x12/0x6f
> 
>        CPU0
>        ----
>   lock(&xfs_nondir_ilock_class);
>   <Interrupt>
>     lock(&xfs_nondir_ilock_class);
> 
>  *** DEADLOCK ***
> 
> 3 locks held by kswapd0/543:
> 
> stack backtrace:
> CPU: 0 PID: 543 Comm: kswapd0 Tainted: G           O    4.5.0-rc2+ #4
> 
> Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
> 
>  ffffffff82a34f10 ffff88003aa078d0 ffffffff813a14f9 ffff88003d8551c0
>  ffff88003aa07920 ffffffff8110ec65 0000000000000000 0000000000000001
>  ffff880000000001 000000000000000b 0000000000000008 ffff88003d855aa0
> Call Trace:
>  [<ffffffff813a14f9>] dump_stack+0x4b/0x72
>  [<ffffffff8110ec65>] print_usage_bug+0x215/0x240
>  [<ffffffff8110ee85>] mark_lock+0x1f5/0x660
>  [<ffffffff8110e100>] ? print_shortest_lock_dependencies+0x1a0/0x1a0
>  [<ffffffff811102e0>] __lock_acquire+0xa80/0x1e50
>  [<ffffffff8122474e>] ? kmem_cache_alloc+0x15e/0x230
>  [<ffffffffa008acc1>] ? kmem_zone_alloc+0x81/0x120 [xfs]
>  [<ffffffff811122e8>] lock_acquire+0xd8/0x1e0
>  [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs]
>  [<ffffffffa0083a70>] ? xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
>  [<ffffffff8110aace>] down_write_nested+0x5e/0xc0
>  [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs]
>  [<ffffffffa00781f7>] xfs_ilock+0x177/0x200 [xfs]
>  [<ffffffffa0083a70>] xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
>  [<ffffffffa0085bdc>] xfs_fs_evict_inode+0xdc/0x1e0 [xfs]
>  [<ffffffff8124d7d5>] evict+0xc5/0x190
>  [<ffffffff8124d8d9>] dispose_list+0x39/0x60
>  [<ffffffff8124eb2b>] prune_icache_sb+0x4b/0x60
>  [<ffffffff8123317f>] super_cache_scan+0x14f/0x1a0
>  [<ffffffff811e0d19>] shrink_slab.part.63.constprop.79+0x1e9/0x4e0
>  [<ffffffff811e50ee>] shrink_zone+0x15e/0x170
>  [<ffffffff811e5ef1>] kswapd+0x4f1/0xa80
>  [<ffffffff811e5a00>] ? zone_reclaim+0x230/0x230
>  [<ffffffff810e6882>] kthread+0xf2/0x110
>  [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220
>  [<ffffffff8184803f>] ret_from_fork+0x3f/0x70
>  [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220
> 
> To quote Dave:
> "
> Ignoring whether reflink should be doing anything or not, that's a
> "xfs_refcountbt_init_cursor() gets called both outside and inside
> transactions" lockdep false positive case. The problem here is
> lockdep has seen this allocation from within a transaction, hence a
> GFP_NOFS allocation, and now it's seeing it in a GFP_KERNEL context.
> Also note that we have an active reference to this inode.
> 
> So, because the reclaim annotations overload the interrupt level
> detections and it's seen the inode ilock been taken in reclaim
> ("interrupt") context, this triggers a reclaim context warning where
> it thinks it is unsafe to do this allocation in GFP_KERNEL context
> holding the inode ilock...
> "
> 
> This sounds like a fundamental problem of the reclaim lock detection.
> It is really impossible to annotate such a special usecase IMHO unless
> the reclaim lockup detection is reworked completely. Until then it
> is much better to provide a way to add "I know what I am doing flag"
> and mark problematic places. This would prevent from abusing GFP_NOFS
> flag which has a runtime effect even on configurations which have
> lockdep disabled.
> 
> Introduce __GFP_NOLOCKDEP flag which tells the lockdep gfp tracking to
> skip the current allocation request.
> 
> While we are at it also make sure that the radix tree doesn't
> accidentaly override tags stored in the upper part of the gfp_mask.
> 
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* [Cluster-devel] [PATCH 1/8] lockdep: allow to disable reclaim lockup detection
@ 2017-01-09 12:56     ` Vlastimil Babka
  0 siblings, 0 replies; 167+ messages in thread
From: Vlastimil Babka @ 2017-01-09 12:56 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On 01/06/2017 03:11 PM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> The current implementation of the reclaim lockup detection can lead to
> false positives and those even happen and usually lead to tweak the
> code to silence the lockdep by using GFP_NOFS even though the context
> can use __GFP_FS just fine. See
> http://lkml.kernel.org/r/20160512080321.GA18496 at dastard as an example.
> 
> =================================
> [ INFO: inconsistent lock state ]
> 4.5.0-rc2+ #4 Tainted: G           O
> ---------------------------------
> inconsistent {RECLAIM_FS-ON-R} -> {IN-RECLAIM_FS-W} usage.
> kswapd0/543 [HC0[0]:SC0[0]:HE1:SE1] takes:
> 
> (&xfs_nondir_ilock_class){++++-+}, at: [<ffffffffa00781f7>] xfs_ilock+0x177/0x200 [xfs]
> 
> {RECLAIM_FS-ON-R} state was registered at:
>   [<ffffffff8110f369>] mark_held_locks+0x79/0xa0
>   [<ffffffff81113a43>] lockdep_trace_alloc+0xb3/0x100
>   [<ffffffff81224623>] kmem_cache_alloc+0x33/0x230
>   [<ffffffffa008acc1>] kmem_zone_alloc+0x81/0x120 [xfs]
>   [<ffffffffa005456e>] xfs_refcountbt_init_cursor+0x3e/0xa0 [xfs]
>   [<ffffffffa0053455>] __xfs_refcount_find_shared+0x75/0x580 [xfs]
>   [<ffffffffa00539e4>] xfs_refcount_find_shared+0x84/0xb0 [xfs]
>   [<ffffffffa005dcb8>] xfs_getbmap+0x608/0x8c0 [xfs]
>   [<ffffffffa007634b>] xfs_vn_fiemap+0xab/0xc0 [xfs]
>   [<ffffffff81244208>] do_vfs_ioctl+0x498/0x670
>   [<ffffffff81244459>] SyS_ioctl+0x79/0x90
>   [<ffffffff81847cd7>] entry_SYSCALL_64_fastpath+0x12/0x6f
> 
>        CPU0
>        ----
>   lock(&xfs_nondir_ilock_class);
>   <Interrupt>
>     lock(&xfs_nondir_ilock_class);
> 
>  *** DEADLOCK ***
> 
> 3 locks held by kswapd0/543:
> 
> stack backtrace:
> CPU: 0 PID: 543 Comm: kswapd0 Tainted: G           O    4.5.0-rc2+ #4
> 
> Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
> 
>  ffffffff82a34f10 ffff88003aa078d0 ffffffff813a14f9 ffff88003d8551c0
>  ffff88003aa07920 ffffffff8110ec65 0000000000000000 0000000000000001
>  ffff880000000001 000000000000000b 0000000000000008 ffff88003d855aa0
> Call Trace:
>  [<ffffffff813a14f9>] dump_stack+0x4b/0x72
>  [<ffffffff8110ec65>] print_usage_bug+0x215/0x240
>  [<ffffffff8110ee85>] mark_lock+0x1f5/0x660
>  [<ffffffff8110e100>] ? print_shortest_lock_dependencies+0x1a0/0x1a0
>  [<ffffffff811102e0>] __lock_acquire+0xa80/0x1e50
>  [<ffffffff8122474e>] ? kmem_cache_alloc+0x15e/0x230
>  [<ffffffffa008acc1>] ? kmem_zone_alloc+0x81/0x120 [xfs]
>  [<ffffffff811122e8>] lock_acquire+0xd8/0x1e0
>  [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs]
>  [<ffffffffa0083a70>] ? xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
>  [<ffffffff8110aace>] down_write_nested+0x5e/0xc0
>  [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs]
>  [<ffffffffa00781f7>] xfs_ilock+0x177/0x200 [xfs]
>  [<ffffffffa0083a70>] xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
>  [<ffffffffa0085bdc>] xfs_fs_evict_inode+0xdc/0x1e0 [xfs]
>  [<ffffffff8124d7d5>] evict+0xc5/0x190
>  [<ffffffff8124d8d9>] dispose_list+0x39/0x60
>  [<ffffffff8124eb2b>] prune_icache_sb+0x4b/0x60
>  [<ffffffff8123317f>] super_cache_scan+0x14f/0x1a0
>  [<ffffffff811e0d19>] shrink_slab.part.63.constprop.79+0x1e9/0x4e0
>  [<ffffffff811e50ee>] shrink_zone+0x15e/0x170
>  [<ffffffff811e5ef1>] kswapd+0x4f1/0xa80
>  [<ffffffff811e5a00>] ? zone_reclaim+0x230/0x230
>  [<ffffffff810e6882>] kthread+0xf2/0x110
>  [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220
>  [<ffffffff8184803f>] ret_from_fork+0x3f/0x70
>  [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220
> 
> To quote Dave:
> "
> Ignoring whether reflink should be doing anything or not, that's a
> "xfs_refcountbt_init_cursor() gets called both outside and inside
> transactions" lockdep false positive case. The problem here is
> lockdep has seen this allocation from within a transaction, hence a
> GFP_NOFS allocation, and now it's seeing it in a GFP_KERNEL context.
> Also note that we have an active reference to this inode.
> 
> So, because the reclaim annotations overload the interrupt level
> detections and it's seen the inode ilock been taken in reclaim
> ("interrupt") context, this triggers a reclaim context warning where
> it thinks it is unsafe to do this allocation in GFP_KERNEL context
> holding the inode ilock...
> "
> 
> This sounds like a fundamental problem of the reclaim lock detection.
> It is really impossible to annotate such a special usecase IMHO unless
> the reclaim lockup detection is reworked completely. Until then it
> is much better to provide a way to add "I know what I am doing flag"
> and mark problematic places. This would prevent from abusing GFP_NOFS
> flag which has a runtime effect even on configurations which have
> lockdep disabled.
> 
> Introduce __GFP_NOLOCKDEP flag which tells the lockdep gfp tracking to
> skip the current allocation request.
> 
> While we are at it also make sure that the radix tree doesn't
> accidentaly override tags stored in the upper part of the gfp_mask.
> 
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>



^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 2/8] xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS
  2017-01-06 14:11   ` Michal Hocko
  (?)
@ 2017-01-09 12:59     ` Vlastimil Babka
  -1 siblings, 0 replies; 167+ messages in thread
From: Vlastimil Babka @ 2017-01-09 12:59 UTC (permalink / raw)
  To: Michal Hocko, linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

On 01/06/2017 03:11 PM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite
> some time ago. We would like to make this concept more generic and use
> it for other filesystems as well. Let's start by giving the flag a
> more generic name PF_MEMALLOC_NOFS which is in line with an exiting
> PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO
> contexts. Replace all PF_FSTRANS usage from the xfs code in the first
> step before we introduce a full API for it as xfs uses the flag directly
> anyway.
> 
> This patch doesn't introduce any functional change.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> Reviewed-by: Brian Foster <bfoster@redhat.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

A nit:

> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2320,6 +2320,8 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
>  #define PF_FREEZER_SKIP	0x40000000	/* Freezer should not count it as freezable */
>  #define PF_SUSPEND_TASK 0x80000000      /* this thread called freeze_processes and should not be frozen */
>  
> +#define PF_MEMALLOC_NOFS PF_FSTRANS	/* Transition to a more generic GFP_NOFS scope semantic */

I don't see why this transition is needed, as there are already no users
of PF_FSTRANS after this patch. The next patch doesn't remove any more,
so this is just extra churn IMHO. But not a strong objection.

> +
>  /*
>   * Only the _current_ task can read/write to tsk->flags, but other
>   * tasks can access tsk->flags in readonly mode for example
> 


^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 2/8] xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS
@ 2017-01-09 12:59     ` Vlastimil Babka
  0 siblings, 0 replies; 167+ messages in thread
From: Vlastimil Babka @ 2017-01-09 12:59 UTC (permalink / raw)
  To: Michal Hocko, linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

On 01/06/2017 03:11 PM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite
> some time ago. We would like to make this concept more generic and use
> it for other filesystems as well. Let's start by giving the flag a
> more generic name PF_MEMALLOC_NOFS which is in line with an exiting
> PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO
> contexts. Replace all PF_FSTRANS usage from the xfs code in the first
> step before we introduce a full API for it as xfs uses the flag directly
> anyway.
> 
> This patch doesn't introduce any functional change.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> Reviewed-by: Brian Foster <bfoster@redhat.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

A nit:

> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2320,6 +2320,8 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
>  #define PF_FREEZER_SKIP	0x40000000	/* Freezer should not count it as freezable */
>  #define PF_SUSPEND_TASK 0x80000000      /* this thread called freeze_processes and should not be frozen */
>  
> +#define PF_MEMALLOC_NOFS PF_FSTRANS	/* Transition to a more generic GFP_NOFS scope semantic */

I don't see why this transition is needed, as there are already no users
of PF_FSTRANS after this patch. The next patch doesn't remove any more,
so this is just extra churn IMHO. But not a strong objection.

> +
>  /*
>   * Only the _current_ task can read/write to tsk->flags, but other
>   * tasks can access tsk->flags in readonly mode for example
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* [Cluster-devel] [PATCH 2/8] xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS
@ 2017-01-09 12:59     ` Vlastimil Babka
  0 siblings, 0 replies; 167+ messages in thread
From: Vlastimil Babka @ 2017-01-09 12:59 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On 01/06/2017 03:11 PM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite
> some time ago. We would like to make this concept more generic and use
> it for other filesystems as well. Let's start by giving the flag a
> more generic name PF_MEMALLOC_NOFS which is in line with an exiting
> PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO
> contexts. Replace all PF_FSTRANS usage from the xfs code in the first
> step before we introduce a full API for it as xfs uses the flag directly
> anyway.
> 
> This patch doesn't introduce any functional change.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> Reviewed-by: Brian Foster <bfoster@redhat.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

A nit:

> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2320,6 +2320,8 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
>  #define PF_FREEZER_SKIP	0x40000000	/* Freezer should not count it as freezable */
>  #define PF_SUSPEND_TASK 0x80000000      /* this thread called freeze_processes and should not be frozen */
>  
> +#define PF_MEMALLOC_NOFS PF_FSTRANS	/* Transition to a more generic GFP_NOFS scope semantic */

I don't see why this transition is needed, as there are already no users
of PF_FSTRANS after this patch. The next patch doesn't remove any more,
so this is just extra churn IMHO. But not a strong objection.

> +
>  /*
>   * Only the _current_ task can read/write to tsk->flags, but other
>   * tasks can access tsk->flags in readonly mode for example
> 



^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 3/8] mm: introduce memalloc_nofs_{save,restore} API
  2017-01-06 14:11   ` Michal Hocko
  (?)
@ 2017-01-09 13:04     ` Vlastimil Babka
  -1 siblings, 0 replies; 167+ messages in thread
From: Vlastimil Babka @ 2017-01-09 13:04 UTC (permalink / raw)
  To: Michal Hocko, linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

On 01/06/2017 03:11 PM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> GFP_NOFS context is used for the following 5 reasons currently
> 	- to prevent from deadlocks when the lock held by the allocation
> 	  context would be needed during the memory reclaim
> 	- to prevent from stack overflows during the reclaim because
> 	  the allocation is performed from a deep context already
> 	- to prevent lockups when the allocation context depends on
> 	  other reclaimers to make a forward progress indirectly
> 	- just in case because this would be safe from the fs POV
> 	- silence lockdep false positives
> 
> Unfortunately overuse of this allocation context brings some problems
> to the MM. Memory reclaim is much weaker (especially during heavy FS
> metadata workloads), OOM killer cannot be invoked because the MM layer
> doesn't have enough information about how much memory is freeable by the
> FS layer.
> 
> In many cases it is far from clear why the weaker context is even used
> and so it might be used unnecessarily. We would like to get rid of
> those as much as possible. One way to do that is to use the flag in
> scopes rather than isolated cases. Such a scope is declared when really
> necessary, tracked per task and all the allocation requests from within
> the context will simply inherit the GFP_NOFS semantic.
> 
> Not only this is easier to understand and maintain because there are
> much less problematic contexts than specific allocation requests, this
> also helps code paths where FS layer interacts with other layers (e.g.
> crypto, security modules, MM etc...) and there is no easy way to convey
> the allocation context between the layers.
> 
> Introduce memalloc_nofs_{save,restore} API to control the scope
> of GFP_NOFS allocation context. This is basically copying
> memalloc_noio_{save,restore} API we have for other restricted allocation
> context GFP_NOIO. The PF_MEMALLOC_NOFS flag already exists and it is
> just an alias for PF_FSTRANS which has been xfs specific until recently.
> There are no more PF_FSTRANS users anymore so let's just drop it.
> 
> PF_MEMALLOC_NOFS is now checked in the MM layer and drops __GFP_FS
> implicitly same as PF_MEMALLOC_NOIO drops __GFP_IO. memalloc_noio_flags
> is renamed to current_gfp_context because it now cares about both
> PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts. Xfs code paths preserve
> their semantic. kmem_flags_convert() doesn't need to evaluate the flag
> anymore.
> 
> This patch shouldn't introduce any functional changes.
> 
> Let's hope that filesystems will drop direct GFP_NOFS (resp. ~__GFP_FS)
> usage as much as possible and only use a properly documented
> memalloc_nofs_{save,restore} checkpoints where they are appropriate.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>


[...]

> +static inline unsigned int memalloc_nofs_save(void)
> +{
> +	unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
> +	current->flags |= PF_MEMALLOC_NOFS;

So this is not new, as same goes for memalloc_noio_save, but I've
noticed that e.g. exit_signal() does tsk->flags |= PF_EXITING;
So is it possible that there's a r-m-w hazard here?

> +	return flags;
> +}
> +
> +static inline void memalloc_nofs_restore(unsigned int flags)
> +{
> +	current->flags = (current->flags & ~PF_MEMALLOC_NOFS) | flags;
> +}
> +
>  /* Per-process atomic flags. */
>  #define PFA_NO_NEW_PRIVS 0	/* May not gain new privileges. */
>  #define PFA_SPREAD_PAGE  1      /* Spread page cache over cpuset */

[...]

> @@ -3029,7 +3029,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>  	int nid;
>  	struct scan_control sc = {
>  		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
> -		.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
> +		.gfp_mask = (current_gfp_context(gfp_mask) & GFP_RECLAIM_MASK) |

So this function didn't do memalloc_noio_flags() before? Is it a bug
that should be fixed separately or at least mentioned? Because that
looks like a functional change...

Thanks!

>  				(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
>  		.reclaim_idx = MAX_NR_ZONES - 1,
>  		.target_mem_cgroup = memcg,
> @@ -3723,7 +3723,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
>  	int classzone_idx = gfp_zone(gfp_mask);
>  	struct scan_control sc = {
>  		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
> -		.gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
> +		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
>  		.order = order,
>  		.priority = NODE_RECLAIM_PRIORITY,
>  		.may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE),
> 


^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 3/8] mm: introduce memalloc_nofs_{save,restore} API
@ 2017-01-09 13:04     ` Vlastimil Babka
  0 siblings, 0 replies; 167+ messages in thread
From: Vlastimil Babka @ 2017-01-09 13:04 UTC (permalink / raw)
  To: Michal Hocko, linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

On 01/06/2017 03:11 PM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> GFP_NOFS context is used for the following 5 reasons currently
> 	- to prevent from deadlocks when the lock held by the allocation
> 	  context would be needed during the memory reclaim
> 	- to prevent from stack overflows during the reclaim because
> 	  the allocation is performed from a deep context already
> 	- to prevent lockups when the allocation context depends on
> 	  other reclaimers to make a forward progress indirectly
> 	- just in case because this would be safe from the fs POV
> 	- silence lockdep false positives
> 
> Unfortunately overuse of this allocation context brings some problems
> to the MM. Memory reclaim is much weaker (especially during heavy FS
> metadata workloads), OOM killer cannot be invoked because the MM layer
> doesn't have enough information about how much memory is freeable by the
> FS layer.
> 
> In many cases it is far from clear why the weaker context is even used
> and so it might be used unnecessarily. We would like to get rid of
> those as much as possible. One way to do that is to use the flag in
> scopes rather than isolated cases. Such a scope is declared when really
> necessary, tracked per task and all the allocation requests from within
> the context will simply inherit the GFP_NOFS semantic.
> 
> Not only this is easier to understand and maintain because there are
> much less problematic contexts than specific allocation requests, this
> also helps code paths where FS layer interacts with other layers (e.g.
> crypto, security modules, MM etc...) and there is no easy way to convey
> the allocation context between the layers.
> 
> Introduce memalloc_nofs_{save,restore} API to control the scope
> of GFP_NOFS allocation context. This is basically copying
> memalloc_noio_{save,restore} API we have for other restricted allocation
> context GFP_NOIO. The PF_MEMALLOC_NOFS flag already exists and it is
> just an alias for PF_FSTRANS which has been xfs specific until recently.
> There are no more PF_FSTRANS users anymore so let's just drop it.
> 
> PF_MEMALLOC_NOFS is now checked in the MM layer and drops __GFP_FS
> implicitly same as PF_MEMALLOC_NOIO drops __GFP_IO. memalloc_noio_flags
> is renamed to current_gfp_context because it now cares about both
> PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts. Xfs code paths preserve
> their semantic. kmem_flags_convert() doesn't need to evaluate the flag
> anymore.
> 
> This patch shouldn't introduce any functional changes.
> 
> Let's hope that filesystems will drop direct GFP_NOFS (resp. ~__GFP_FS)
> usage as much as possible and only use a properly documented
> memalloc_nofs_{save,restore} checkpoints where they are appropriate.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>


[...]

> +static inline unsigned int memalloc_nofs_save(void)
> +{
> +	unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
> +	current->flags |= PF_MEMALLOC_NOFS;

So this is not new, as same goes for memalloc_noio_save, but I've
noticed that e.g. exit_signal() does tsk->flags |= PF_EXITING;
So is it possible that there's a r-m-w hazard here?

> +	return flags;
> +}
> +
> +static inline void memalloc_nofs_restore(unsigned int flags)
> +{
> +	current->flags = (current->flags & ~PF_MEMALLOC_NOFS) | flags;
> +}
> +
>  /* Per-process atomic flags. */
>  #define PFA_NO_NEW_PRIVS 0	/* May not gain new privileges. */
>  #define PFA_SPREAD_PAGE  1      /* Spread page cache over cpuset */

[...]

> @@ -3029,7 +3029,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>  	int nid;
>  	struct scan_control sc = {
>  		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
> -		.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
> +		.gfp_mask = (current_gfp_context(gfp_mask) & GFP_RECLAIM_MASK) |

So this function didn't do memalloc_noio_flags() before? Is it a bug
that should be fixed separately or at least mentioned? Because that
looks like a functional change...

Thanks!

>  				(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
>  		.reclaim_idx = MAX_NR_ZONES - 1,
>  		.target_mem_cgroup = memcg,
> @@ -3723,7 +3723,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
>  	int classzone_idx = gfp_zone(gfp_mask);
>  	struct scan_control sc = {
>  		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
> -		.gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
> +		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
>  		.order = order,
>  		.priority = NODE_RECLAIM_PRIORITY,
>  		.may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE),
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* [Cluster-devel] [PATCH 3/8] mm: introduce memalloc_nofs_{save, restore} API
@ 2017-01-09 13:04     ` Vlastimil Babka
  0 siblings, 0 replies; 167+ messages in thread
From: Vlastimil Babka @ 2017-01-09 13:04 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On 01/06/2017 03:11 PM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> GFP_NOFS context is used for the following 5 reasons currently
> 	- to prevent from deadlocks when the lock held by the allocation
> 	  context would be needed during the memory reclaim
> 	- to prevent from stack overflows during the reclaim because
> 	  the allocation is performed from a deep context already
> 	- to prevent lockups when the allocation context depends on
> 	  other reclaimers to make a forward progress indirectly
> 	- just in case because this would be safe from the fs POV
> 	- silence lockdep false positives
> 
> Unfortunately overuse of this allocation context brings some problems
> to the MM. Memory reclaim is much weaker (especially during heavy FS
> metadata workloads), OOM killer cannot be invoked because the MM layer
> doesn't have enough information about how much memory is freeable by the
> FS layer.
> 
> In many cases it is far from clear why the weaker context is even used
> and so it might be used unnecessarily. We would like to get rid of
> those as much as possible. One way to do that is to use the flag in
> scopes rather than isolated cases. Such a scope is declared when really
> necessary, tracked per task and all the allocation requests from within
> the context will simply inherit the GFP_NOFS semantic.
> 
> Not only this is easier to understand and maintain because there are
> much less problematic contexts than specific allocation requests, this
> also helps code paths where FS layer interacts with other layers (e.g.
> crypto, security modules, MM etc...) and there is no easy way to convey
> the allocation context between the layers.
> 
> Introduce memalloc_nofs_{save,restore} API to control the scope
> of GFP_NOFS allocation context. This is basically copying
> memalloc_noio_{save,restore} API we have for other restricted allocation
> context GFP_NOIO. The PF_MEMALLOC_NOFS flag already exists and it is
> just an alias for PF_FSTRANS which has been xfs specific until recently.
> There are no more PF_FSTRANS users anymore so let's just drop it.
> 
> PF_MEMALLOC_NOFS is now checked in the MM layer and drops __GFP_FS
> implicitly same as PF_MEMALLOC_NOIO drops __GFP_IO. memalloc_noio_flags
> is renamed to current_gfp_context because it now cares about both
> PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts. Xfs code paths preserve
> their semantic. kmem_flags_convert() doesn't need to evaluate the flag
> anymore.
> 
> This patch shouldn't introduce any functional changes.
> 
> Let's hope that filesystems will drop direct GFP_NOFS (resp. ~__GFP_FS)
> usage as much as possible and only use a properly documented
> memalloc_nofs_{save,restore} checkpoints where they are appropriate.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>


[...]

> +static inline unsigned int memalloc_nofs_save(void)
> +{
> +	unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
> +	current->flags |= PF_MEMALLOC_NOFS;

So this is not new, as same goes for memalloc_noio_save, but I've
noticed that e.g. exit_signal() does tsk->flags |= PF_EXITING;
So is it possible that there's a r-m-w hazard here?

> +	return flags;
> +}
> +
> +static inline void memalloc_nofs_restore(unsigned int flags)
> +{
> +	current->flags = (current->flags & ~PF_MEMALLOC_NOFS) | flags;
> +}
> +
>  /* Per-process atomic flags. */
>  #define PFA_NO_NEW_PRIVS 0	/* May not gain new privileges. */
>  #define PFA_SPREAD_PAGE  1      /* Spread page cache over cpuset */

[...]

> @@ -3029,7 +3029,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>  	int nid;
>  	struct scan_control sc = {
>  		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
> -		.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
> +		.gfp_mask = (current_gfp_context(gfp_mask) & GFP_RECLAIM_MASK) |

So this function didn't do memalloc_noio_flags() before? Is it a bug
that should be fixed separately or at least mentioned? Because that
looks like a functional change...

Thanks!

>  				(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
>  		.reclaim_idx = MAX_NR_ZONES - 1,
>  		.target_mem_cgroup = memcg,
> @@ -3723,7 +3723,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
>  	int classzone_idx = gfp_zone(gfp_mask);
>  	struct scan_control sc = {
>  		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
> -		.gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
> +		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
>  		.order = order,
>  		.priority = NODE_RECLAIM_PRIORITY,
>  		.may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE),
> 



^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 3/8] mm: introduce memalloc_nofs_{save,restore} API
  2017-01-09 13:04     ` [PATCH 3/8] mm: introduce memalloc_nofs_{save,restore} API Vlastimil Babka
  (?)
@ 2017-01-09 13:42       ` Michal Hocko
  -1 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-09 13:42 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Dave Chinner, djwong,
	Theodore Ts'o, Chris Mason, David Sterba, Jan Kara,
	ceph-devel, cluster-devel, linux-nfs, logfs, linux-xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML

On Mon 09-01-17 14:04:21, Vlastimil Babka wrote:
[...]
> > +static inline unsigned int memalloc_nofs_save(void)
> > +{
> > +	unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
> > +	current->flags |= PF_MEMALLOC_NOFS;
> 
> So this is not new, as same goes for memalloc_noio_save, but I've
> noticed that e.g. exit_signal() does tsk->flags |= PF_EXITING;
> So is it possible that there's a r-m-w hazard here?

exit_signals operates on current and all task_struct::flags should be
used only on the current.
[...]

> > @@ -3029,7 +3029,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
> >  	int nid;
> >  	struct scan_control sc = {
> >  		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
> > -		.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
> > +		.gfp_mask = (current_gfp_context(gfp_mask) & GFP_RECLAIM_MASK) |
> 
> So this function didn't do memalloc_noio_flags() before? Is it a bug
> that should be fixed separately or at least mentioned? Because that
> looks like a functional change...

We didn't need it. Kmem charges are opt-in and current all of them
support GFP_IO. The LRU pages are not charged in NOIO context either.
We need it now because there will be callers to charge GFP_KERNEL while
being inside the NOFS scope.

Now that you have opened this I have noticed that the code is wrong
here because GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK would overwrite
the removed GFP_FS. I guess it would be better and less error prone
to move the current_gfp_context part into the direct reclaim entry -
do_try_to_free_pages - and put the comment like this
---
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4ea6b610f20e..df7975185f11 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2756,6 +2756,13 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 	int initial_priority = sc->priority;
 	unsigned long total_scanned = 0;
 	unsigned long writeback_threshold;
+
+	/*
+	 * Make sure that the gfp context properly handles scope gfp mask.
+	 * This might weaken the reclaim context (e.g. make it GFP_NOFS or
+	 * GFP_NOIO).
+	 */
+	sc->gfp_mask = current_gfp_context(sc->gfp_mask);
 retry:
 	delayacct_freepages_start();
 
@@ -2949,7 +2956,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 	unsigned long nr_reclaimed;
 	struct scan_control sc = {
 		.nr_to_reclaim = SWAP_CLUSTER_MAX,
-		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
+		.gfp_mask = gfp_mask,
 		.reclaim_idx = gfp_zone(gfp_mask),
 		.order = order,
 		.nodemask = nodemask,
@@ -3029,8 +3036,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 	int nid;
 	struct scan_control sc = {
 		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
-		.gfp_mask = (current_gfp_context(gfp_mask) & GFP_RECLAIM_MASK) |
-				(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
+		.gfp_mask = GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK,
 		.reclaim_idx = MAX_NR_ZONES - 1,
 		.target_mem_cgroup = memcg,
 		.priority = DEF_PRIORITY,
@@ -3723,7 +3729,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
 	int classzone_idx = gfp_zone(gfp_mask);
 	struct scan_control sc = {
 		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
-		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
+		.gfp_mask = gfp_mask,
 		.order = order,
 		.priority = NODE_RECLAIM_PRIORITY,
 		.may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE),
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 167+ messages in thread

* Re: [PATCH 3/8] mm: introduce memalloc_nofs_{save,restore} API
@ 2017-01-09 13:42       ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-09 13:42 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Dave Chinner, djwong,
	Theodore Ts'o, Chris Mason, David Sterba, Jan Kara,
	ceph-devel, cluster-devel, linux-nfs, logfs, linux-xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML

On Mon 09-01-17 14:04:21, Vlastimil Babka wrote:
[...]
> > +static inline unsigned int memalloc_nofs_save(void)
> > +{
> > +	unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
> > +	current->flags |= PF_MEMALLOC_NOFS;
> 
> So this is not new, as same goes for memalloc_noio_save, but I've
> noticed that e.g. exit_signal() does tsk->flags |= PF_EXITING;
> So is it possible that there's a r-m-w hazard here?

exit_signals operates on current and all task_struct::flags should be
used only on the current.
[...]

> > @@ -3029,7 +3029,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
> >  	int nid;
> >  	struct scan_control sc = {
> >  		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
> > -		.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
> > +		.gfp_mask = (current_gfp_context(gfp_mask) & GFP_RECLAIM_MASK) |
> 
> So this function didn't do memalloc_noio_flags() before? Is it a bug
> that should be fixed separately or at least mentioned? Because that
> looks like a functional change...

We didn't need it. Kmem charges are opt-in and current all of them
support GFP_IO. The LRU pages are not charged in NOIO context either.
We need it now because there will be callers to charge GFP_KERNEL while
being inside the NOFS scope.

Now that you have opened this I have noticed that the code is wrong
here because GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK would overwrite
the removed GFP_FS. I guess it would be better and less error prone
to move the current_gfp_context part into the direct reclaim entry -
do_try_to_free_pages - and put the comment like this
---
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4ea6b610f20e..df7975185f11 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2756,6 +2756,13 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 	int initial_priority = sc->priority;
 	unsigned long total_scanned = 0;
 	unsigned long writeback_threshold;
+
+	/*
+	 * Make sure that the gfp context properly handles scope gfp mask.
+	 * This might weaken the reclaim context (e.g. make it GFP_NOFS or
+	 * GFP_NOIO).
+	 */
+	sc->gfp_mask = current_gfp_context(sc->gfp_mask);
 retry:
 	delayacct_freepages_start();
 
@@ -2949,7 +2956,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 	unsigned long nr_reclaimed;
 	struct scan_control sc = {
 		.nr_to_reclaim = SWAP_CLUSTER_MAX,
-		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
+		.gfp_mask = gfp_mask,
 		.reclaim_idx = gfp_zone(gfp_mask),
 		.order = order,
 		.nodemask = nodemask,
@@ -3029,8 +3036,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 	int nid;
 	struct scan_control sc = {
 		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
-		.gfp_mask = (current_gfp_context(gfp_mask) & GFP_RECLAIM_MASK) |
-				(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
+		.gfp_mask = GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK,
 		.reclaim_idx = MAX_NR_ZONES - 1,
 		.target_mem_cgroup = memcg,
 		.priority = DEF_PRIORITY,
@@ -3723,7 +3729,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
 	int classzone_idx = gfp_zone(gfp_mask);
 	struct scan_control sc = {
 		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
-		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
+		.gfp_mask = gfp_mask,
 		.order = order,
 		.priority = NODE_RECLAIM_PRIORITY,
 		.may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE),
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [Cluster-devel] [PATCH 3/8] mm: introduce memalloc_nofs_{save, restore} API
@ 2017-01-09 13:42       ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-09 13:42 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Mon 09-01-17 14:04:21, Vlastimil Babka wrote:
[...]
> > +static inline unsigned int memalloc_nofs_save(void)
> > +{
> > +	unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
> > +	current->flags |= PF_MEMALLOC_NOFS;
> 
> So this is not new, as same goes for memalloc_noio_save, but I've
> noticed that e.g. exit_signal() does tsk->flags |= PF_EXITING;
> So is it possible that there's a r-m-w hazard here?

exit_signals operates on current and all task_struct::flags should be
used only on the current.
[...]

> > @@ -3029,7 +3029,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
> >  	int nid;
> >  	struct scan_control sc = {
> >  		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
> > -		.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
> > +		.gfp_mask = (current_gfp_context(gfp_mask) & GFP_RECLAIM_MASK) |
> 
> So this function didn't do memalloc_noio_flags() before? Is it a bug
> that should be fixed separately or at least mentioned? Because that
> looks like a functional change...

We didn't need it. Kmem charges are opt-in and current all of them
support GFP_IO. The LRU pages are not charged in NOIO context either.
We need it now because there will be callers to charge GFP_KERNEL while
being inside the NOFS scope.

Now that you have opened this I have noticed that the code is wrong
here because GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK would overwrite
the removed GFP_FS. I guess it would be better and less error prone
to move the current_gfp_context part into the direct reclaim entry -
do_try_to_free_pages - and put the comment like this
---
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4ea6b610f20e..df7975185f11 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2756,6 +2756,13 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 	int initial_priority = sc->priority;
 	unsigned long total_scanned = 0;
 	unsigned long writeback_threshold;
+
+	/*
+	 * Make sure that the gfp context properly handles scope gfp mask.
+	 * This might weaken the reclaim context (e.g. make it GFP_NOFS or
+	 * GFP_NOIO).
+	 */
+	sc->gfp_mask = current_gfp_context(sc->gfp_mask);
 retry:
 	delayacct_freepages_start();
 
@@ -2949,7 +2956,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 	unsigned long nr_reclaimed;
 	struct scan_control sc = {
 		.nr_to_reclaim = SWAP_CLUSTER_MAX,
-		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
+		.gfp_mask = gfp_mask,
 		.reclaim_idx = gfp_zone(gfp_mask),
 		.order = order,
 		.nodemask = nodemask,
@@ -3029,8 +3036,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 	int nid;
 	struct scan_control sc = {
 		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
-		.gfp_mask = (current_gfp_context(gfp_mask) & GFP_RECLAIM_MASK) |
-				(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
+		.gfp_mask = GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK,
 		.reclaim_idx = MAX_NR_ZONES - 1,
 		.target_mem_cgroup = memcg,
 		.priority = DEF_PRIORITY,
@@ -3723,7 +3729,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
 	int classzone_idx = gfp_zone(gfp_mask);
 	struct scan_control sc = {
 		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
-		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
+		.gfp_mask = gfp_mask,
 		.order = order,
 		.priority = NODE_RECLAIM_PRIORITY,
 		.may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE),
-- 
Michal Hocko
SUSE Labs



^ permalink raw reply related	[flat|nested] 167+ messages in thread

* Re: [PATCH 3/8] mm: introduce memalloc_nofs_{save,restore} API
  2017-01-09 13:42       ` [PATCH 3/8] mm: introduce memalloc_nofs_{save,restore} API Michal Hocko
  (?)
@ 2017-01-09 13:59         ` Michal Hocko
  -1 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-09 13:59 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Dave Chinner, djwong,
	Theodore Ts'o, Chris Mason, David Sterba, Jan Kara,
	ceph-devel, cluster-devel, linux-nfs, logfs, linux-xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML

On Mon 09-01-17 14:42:10, Michal Hocko wrote:
> On Mon 09-01-17 14:04:21, Vlastimil Babka wrote:
[...]
> Now that you have opened this I have noticed that the code is wrong
> here because GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK would overwrite
> the removed GFP_FS.

Blee, it wouldn't because ~GFP_RECLAIM_MASK will not contain neither
GFP_FS nor GFP_IO. So all is good here.

> I guess it would be better and less error prone
> to move the current_gfp_context part into the direct reclaim entry -
> do_try_to_free_pages - and put the comment like this

well, after more thinking about we, should probably keep it where it is.
If for nothing else try_to_free_mem_cgroup_pages has a tracepoint which
prints the gfp mask so we should use the filtered one. So let's just
scratch this follow up fix.

> ---
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 4ea6b610f20e..df7975185f11 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2756,6 +2756,13 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>  	int initial_priority = sc->priority;
>  	unsigned long total_scanned = 0;
>  	unsigned long writeback_threshold;
> +
> +	/*
> +	 * Make sure that the gfp context properly handles scope gfp mask.
> +	 * This might weaken the reclaim context (e.g. make it GFP_NOFS or
> +	 * GFP_NOIO).
> +	 */
> +	sc->gfp_mask = current_gfp_context(sc->gfp_mask);
>  retry:
>  	delayacct_freepages_start();
>  
> @@ -2949,7 +2956,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>  	unsigned long nr_reclaimed;
>  	struct scan_control sc = {
>  		.nr_to_reclaim = SWAP_CLUSTER_MAX,
> -		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
> +		.gfp_mask = gfp_mask,
>  		.reclaim_idx = gfp_zone(gfp_mask),
>  		.order = order,
>  		.nodemask = nodemask,
> @@ -3029,8 +3036,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>  	int nid;
>  	struct scan_control sc = {
>  		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
> -		.gfp_mask = (current_gfp_context(gfp_mask) & GFP_RECLAIM_MASK) |
> -				(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
> +		.gfp_mask = GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK,
>  		.reclaim_idx = MAX_NR_ZONES - 1,
>  		.target_mem_cgroup = memcg,
>  		.priority = DEF_PRIORITY,
> @@ -3723,7 +3729,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
>  	int classzone_idx = gfp_zone(gfp_mask);
>  	struct scan_control sc = {
>  		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
> -		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
> +		.gfp_mask = gfp_mask,
>  		.order = order,
>  		.priority = NODE_RECLAIM_PRIORITY,
>  		.may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE),
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 3/8] mm: introduce memalloc_nofs_{save,restore} API
@ 2017-01-09 13:59         ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-09 13:59 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Dave Chinner, djwong,
	Theodore Ts'o, Chris Mason, David Sterba, Jan Kara,
	ceph-devel, cluster-devel, linux-nfs, logfs, linux-xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML

On Mon 09-01-17 14:42:10, Michal Hocko wrote:
> On Mon 09-01-17 14:04:21, Vlastimil Babka wrote:
[...]
> Now that you have opened this I have noticed that the code is wrong
> here because GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK would overwrite
> the removed GFP_FS.

Blee, it wouldn't because ~GFP_RECLAIM_MASK will not contain neither
GFP_FS nor GFP_IO. So all is good here.

> I guess it would be better and less error prone
> to move the current_gfp_context part into the direct reclaim entry -
> do_try_to_free_pages - and put the comment like this

well, after more thinking about we, should probably keep it where it is.
If for nothing else try_to_free_mem_cgroup_pages has a tracepoint which
prints the gfp mask so we should use the filtered one. So let's just
scratch this follow up fix.

> ---
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 4ea6b610f20e..df7975185f11 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2756,6 +2756,13 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>  	int initial_priority = sc->priority;
>  	unsigned long total_scanned = 0;
>  	unsigned long writeback_threshold;
> +
> +	/*
> +	 * Make sure that the gfp context properly handles scope gfp mask.
> +	 * This might weaken the reclaim context (e.g. make it GFP_NOFS or
> +	 * GFP_NOIO).
> +	 */
> +	sc->gfp_mask = current_gfp_context(sc->gfp_mask);
>  retry:
>  	delayacct_freepages_start();
>  
> @@ -2949,7 +2956,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>  	unsigned long nr_reclaimed;
>  	struct scan_control sc = {
>  		.nr_to_reclaim = SWAP_CLUSTER_MAX,
> -		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
> +		.gfp_mask = gfp_mask,
>  		.reclaim_idx = gfp_zone(gfp_mask),
>  		.order = order,
>  		.nodemask = nodemask,
> @@ -3029,8 +3036,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>  	int nid;
>  	struct scan_control sc = {
>  		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
> -		.gfp_mask = (current_gfp_context(gfp_mask) & GFP_RECLAIM_MASK) |
> -				(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
> +		.gfp_mask = GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK,
>  		.reclaim_idx = MAX_NR_ZONES - 1,
>  		.target_mem_cgroup = memcg,
>  		.priority = DEF_PRIORITY,
> @@ -3723,7 +3729,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
>  	int classzone_idx = gfp_zone(gfp_mask);
>  	struct scan_control sc = {
>  		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
> -		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
> +		.gfp_mask = gfp_mask,
>  		.order = order,
>  		.priority = NODE_RECLAIM_PRIORITY,
>  		.may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE),
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* [Cluster-devel] [PATCH 3/8] mm: introduce memalloc_nofs_{save, restore} API
@ 2017-01-09 13:59         ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-09 13:59 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Mon 09-01-17 14:42:10, Michal Hocko wrote:
> On Mon 09-01-17 14:04:21, Vlastimil Babka wrote:
[...]
> Now that you have opened this I have noticed that the code is wrong
> here because GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK would overwrite
> the removed GFP_FS.

Blee, it wouldn't because ~GFP_RECLAIM_MASK will not contain neither
GFP_FS nor GFP_IO. So all is good here.

> I guess it would be better and less error prone
> to move the current_gfp_context part into the direct reclaim entry -
> do_try_to_free_pages - and put the comment like this

well, after more thinking about we, should probably keep it where it is.
If for nothing else try_to_free_mem_cgroup_pages has a tracepoint which
prints the gfp mask so we should use the filtered one. So let's just
scratch this follow up fix.

> ---
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 4ea6b610f20e..df7975185f11 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2756,6 +2756,13 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>  	int initial_priority = sc->priority;
>  	unsigned long total_scanned = 0;
>  	unsigned long writeback_threshold;
> +
> +	/*
> +	 * Make sure that the gfp context properly handles scope gfp mask.
> +	 * This might weaken the reclaim context (e.g. make it GFP_NOFS or
> +	 * GFP_NOIO).
> +	 */
> +	sc->gfp_mask = current_gfp_context(sc->gfp_mask);
>  retry:
>  	delayacct_freepages_start();
>  
> @@ -2949,7 +2956,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>  	unsigned long nr_reclaimed;
>  	struct scan_control sc = {
>  		.nr_to_reclaim = SWAP_CLUSTER_MAX,
> -		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
> +		.gfp_mask = gfp_mask,
>  		.reclaim_idx = gfp_zone(gfp_mask),
>  		.order = order,
>  		.nodemask = nodemask,
> @@ -3029,8 +3036,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>  	int nid;
>  	struct scan_control sc = {
>  		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
> -		.gfp_mask = (current_gfp_context(gfp_mask) & GFP_RECLAIM_MASK) |
> -				(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
> +		.gfp_mask = GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK,
>  		.reclaim_idx = MAX_NR_ZONES - 1,
>  		.target_mem_cgroup = memcg,
>  		.priority = DEF_PRIORITY,
> @@ -3723,7 +3729,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
>  	int classzone_idx = gfp_zone(gfp_mask);
>  	struct scan_control sc = {
>  		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
> -		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
> +		.gfp_mask = gfp_mask,
>  		.order = order,
>  		.priority = NODE_RECLAIM_PRIORITY,
>  		.may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE),
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs



^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 3/8] mm: introduce memalloc_nofs_{save,restore} API
  2017-01-09 13:42       ` [PATCH 3/8] mm: introduce memalloc_nofs_{save,restore} API Michal Hocko
  (?)
@ 2017-01-09 14:04         ` Vlastimil Babka
  -1 siblings, 0 replies; 167+ messages in thread
From: Vlastimil Babka @ 2017-01-09 14:04 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Dave Chinner, djwong,
	Theodore Ts'o, Chris Mason, David Sterba, Jan Kara,
	ceph-devel, cluster-devel, linux-nfs, logfs, linux-xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML

On 01/09/2017 02:42 PM, Michal Hocko wrote:
> On Mon 09-01-17 14:04:21, Vlastimil Babka wrote:
> [...]
>>> +static inline unsigned int memalloc_nofs_save(void)
>>> +{
>>> +	unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
>>> +	current->flags |= PF_MEMALLOC_NOFS;
>>
>> So this is not new, as same goes for memalloc_noio_save, but I've
>> noticed that e.g. exit_signal() does tsk->flags |= PF_EXITING;
>> So is it possible that there's a r-m-w hazard here?
> 
> exit_signals operates on current and all task_struct::flags should be
> used only on the current.
> [...]

Ah, good to know.

> 
>>> @@ -3029,7 +3029,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>>>  	int nid;
>>>  	struct scan_control sc = {
>>>  		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
>>> -		.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
>>> +		.gfp_mask = (current_gfp_context(gfp_mask) & GFP_RECLAIM_MASK) |
>>
>> So this function didn't do memalloc_noio_flags() before? Is it a bug
>> that should be fixed separately or at least mentioned? Because that
>> looks like a functional change...
> 
> We didn't need it. Kmem charges are opt-in and current all of them
> support GFP_IO. The LRU pages are not charged in NOIO context either.
> We need it now because there will be callers to charge GFP_KERNEL while
> being inside the NOFS scope.

I see.

> Now that you have opened this I have noticed that the code is wrong
> here because GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK would overwrite
> the removed GFP_FS. I guess it would be better and less error prone
> to move the current_gfp_context part into the direct reclaim entry -
> do_try_to_free_pages - and put the comment like this

Agree with your "So let's just scratch this follow up fix in the next
e-mail.

So for the unchanged patch.

Acked-by: Vlastimil Babka <vbabka@suse.cz>


^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 3/8] mm: introduce memalloc_nofs_{save,restore} API
@ 2017-01-09 14:04         ` Vlastimil Babka
  0 siblings, 0 replies; 167+ messages in thread
From: Vlastimil Babka @ 2017-01-09 14:04 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Dave Chinner, djwong,
	Theodore Ts'o, Chris Mason, David Sterba, Jan Kara,
	ceph-devel, cluster-devel, linux-nfs, logfs, linux-xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML

On 01/09/2017 02:42 PM, Michal Hocko wrote:
> On Mon 09-01-17 14:04:21, Vlastimil Babka wrote:
> [...]
>>> +static inline unsigned int memalloc_nofs_save(void)
>>> +{
>>> +	unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
>>> +	current->flags |= PF_MEMALLOC_NOFS;
>>
>> So this is not new, as same goes for memalloc_noio_save, but I've
>> noticed that e.g. exit_signal() does tsk->flags |= PF_EXITING;
>> So is it possible that there's a r-m-w hazard here?
> 
> exit_signals operates on current and all task_struct::flags should be
> used only on the current.
> [...]

Ah, good to know.

> 
>>> @@ -3029,7 +3029,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>>>  	int nid;
>>>  	struct scan_control sc = {
>>>  		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
>>> -		.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
>>> +		.gfp_mask = (current_gfp_context(gfp_mask) & GFP_RECLAIM_MASK) |
>>
>> So this function didn't do memalloc_noio_flags() before? Is it a bug
>> that should be fixed separately or at least mentioned? Because that
>> looks like a functional change...
> 
> We didn't need it. Kmem charges are opt-in and current all of them
> support GFP_IO. The LRU pages are not charged in NOIO context either.
> We need it now because there will be callers to charge GFP_KERNEL while
> being inside the NOFS scope.

I see.

> Now that you have opened this I have noticed that the code is wrong
> here because GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK would overwrite
> the removed GFP_FS. I guess it would be better and less error prone
> to move the current_gfp_context part into the direct reclaim entry -
> do_try_to_free_pages - and put the comment like this

Agree with your "So let's just scratch this follow up fix in the next
e-mail.

So for the unchanged patch.

Acked-by: Vlastimil Babka <vbabka@suse.cz>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* [Cluster-devel] [PATCH 3/8] mm: introduce memalloc_nofs_{save, restore} API
@ 2017-01-09 14:04         ` Vlastimil Babka
  0 siblings, 0 replies; 167+ messages in thread
From: Vlastimil Babka @ 2017-01-09 14:04 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On 01/09/2017 02:42 PM, Michal Hocko wrote:
> On Mon 09-01-17 14:04:21, Vlastimil Babka wrote:
> [...]
>>> +static inline unsigned int memalloc_nofs_save(void)
>>> +{
>>> +	unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
>>> +	current->flags |= PF_MEMALLOC_NOFS;
>>
>> So this is not new, as same goes for memalloc_noio_save, but I've
>> noticed that e.g. exit_signal() does tsk->flags |= PF_EXITING;
>> So is it possible that there's a r-m-w hazard here?
> 
> exit_signals operates on current and all task_struct::flags should be
> used only on the current.
> [...]

Ah, good to know.

> 
>>> @@ -3029,7 +3029,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>>>  	int nid;
>>>  	struct scan_control sc = {
>>>  		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
>>> -		.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
>>> +		.gfp_mask = (current_gfp_context(gfp_mask) & GFP_RECLAIM_MASK) |
>>
>> So this function didn't do memalloc_noio_flags() before? Is it a bug
>> that should be fixed separately or at least mentioned? Because that
>> looks like a functional change...
> 
> We didn't need it. Kmem charges are opt-in and current all of them
> support GFP_IO. The LRU pages are not charged in NOIO context either.
> We need it now because there will be callers to charge GFP_KERNEL while
> being inside the NOFS scope.

I see.

> Now that you have opened this I have noticed that the code is wrong
> here because GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK would overwrite
> the removed GFP_FS. I guess it would be better and less error prone
> to move the current_gfp_context part into the direct reclaim entry -
> do_try_to_free_pages - and put the comment like this

Agree with your "So let's just scratch this follow up fix in the next
e-mail.

So for the unchanged patch.

Acked-by: Vlastimil Babka <vbabka@suse.cz>



^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 4/8] xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*
  2017-01-06 14:11   ` Michal Hocko
  (?)
@ 2017-01-09 14:08     ` Vlastimil Babka
  -1 siblings, 0 replies; 167+ messages in thread
From: Vlastimil Babka @ 2017-01-09 14:08 UTC (permalink / raw)
  To: Michal Hocko, linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

On 01/06/2017 03:11 PM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> kmem_zalloc_large and _xfs_buf_map_pages use memalloc_noio_{save,restore}
> API to prevent from reclaim recursion into the fs because vmalloc can
> invoke unconditional GFP_KERNEL allocations and these functions might be
> called from the NOFS contexts. The memalloc_noio_save will enforce
> GFP_NOIO context which is even weaker than GFP_NOFS and that seems to be
> unnecessary. Let's use memalloc_nofs_{save,restore} instead as it should
> provide exactly what we need here - implicit GFP_NOFS context.
> 
> Changes since v1
> - s@memalloc_noio_restore@memalloc_nofs_restore@ in _xfs_buf_map_pages
>   as per Brian Foster
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Not a xfs expert, but seems correct.

Acked-by: Vlastimil Babka <vbabka@suse.cz>

Nit below:

> ---
>  fs/xfs/kmem.c    | 10 +++++-----
>  fs/xfs/xfs_buf.c |  8 ++++----
>  2 files changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> index a76a05dae96b..d69ed5e76621 100644
> --- a/fs/xfs/kmem.c
> +++ b/fs/xfs/kmem.c
> @@ -65,7 +65,7 @@ kmem_alloc(size_t size, xfs_km_flags_t flags)
>  void *
>  kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
>  {
> -	unsigned noio_flag = 0;
> +	unsigned nofs_flag = 0;
>  	void	*ptr;
>  	gfp_t	lflags;
>  
> @@ -80,14 +80,14 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
>  	 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
>  	 * the filesystem here and potentially deadlocking.

The comment above is now largely obsolete, or minimally should be
changed to PF_MEMALLOC_NOFS?

>  	 */
> -	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
> -		noio_flag = memalloc_noio_save();
> +	if (flags & KM_NOFS)
> +		nofs_flag = memalloc_nofs_save();
>  
>  	lflags = kmem_flags_convert(flags);
>  	ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
>  
> -	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
> -		memalloc_noio_restore(noio_flag);
> +	if (flags & KM_NOFS)
> +		memalloc_nofs_restore(nofs_flag);
>  
>  	return ptr;
>  }
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 7f0a01f7b592..8cb8dd4cdfd8 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -441,17 +441,17 @@ _xfs_buf_map_pages(
>  		bp->b_addr = NULL;
>  	} else {
>  		int retried = 0;
> -		unsigned noio_flag;
> +		unsigned nofs_flag;
>  
>  		/*
>  		 * vm_map_ram() will allocate auxillary structures (e.g.
>  		 * pagetables) with GFP_KERNEL, yet we are likely to be under
>  		 * GFP_NOFS context here. Hence we need to tell memory reclaim
> -		 * that we are in such a context via PF_MEMALLOC_NOIO to prevent
> +		 * that we are in such a context via PF_MEMALLOC_NOFS to prevent
>  		 * memory reclaim re-entering the filesystem here and
>  		 * potentially deadlocking.
>  		 */
> -		noio_flag = memalloc_noio_save();
> +		nofs_flag = memalloc_nofs_save();
>  		do {
>  			bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
>  						-1, PAGE_KERNEL);
> @@ -459,7 +459,7 @@ _xfs_buf_map_pages(
>  				break;
>  			vm_unmap_aliases();
>  		} while (retried++ <= 1);
> -		memalloc_noio_restore(noio_flag);
> +		memalloc_nofs_restore(nofs_flag);
>  
>  		if (!bp->b_addr)
>  			return -ENOMEM;
> 


^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 4/8] xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*
@ 2017-01-09 14:08     ` Vlastimil Babka
  0 siblings, 0 replies; 167+ messages in thread
From: Vlastimil Babka @ 2017-01-09 14:08 UTC (permalink / raw)
  To: Michal Hocko, linux-mm, linux-fsdevel
  Cc: Andrew Morton, Dave Chinner, djwong, Theodore Ts'o,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

On 01/06/2017 03:11 PM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> kmem_zalloc_large and _xfs_buf_map_pages use memalloc_noio_{save,restore}
> API to prevent from reclaim recursion into the fs because vmalloc can
> invoke unconditional GFP_KERNEL allocations and these functions might be
> called from the NOFS contexts. The memalloc_noio_save will enforce
> GFP_NOIO context which is even weaker than GFP_NOFS and that seems to be
> unnecessary. Let's use memalloc_nofs_{save,restore} instead as it should
> provide exactly what we need here - implicit GFP_NOFS context.
> 
> Changes since v1
> - s@memalloc_noio_restore@memalloc_nofs_restore@ in _xfs_buf_map_pages
>   as per Brian Foster
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Not a xfs expert, but seems correct.

Acked-by: Vlastimil Babka <vbabka@suse.cz>

Nit below:

> ---
>  fs/xfs/kmem.c    | 10 +++++-----
>  fs/xfs/xfs_buf.c |  8 ++++----
>  2 files changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> index a76a05dae96b..d69ed5e76621 100644
> --- a/fs/xfs/kmem.c
> +++ b/fs/xfs/kmem.c
> @@ -65,7 +65,7 @@ kmem_alloc(size_t size, xfs_km_flags_t flags)
>  void *
>  kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
>  {
> -	unsigned noio_flag = 0;
> +	unsigned nofs_flag = 0;
>  	void	*ptr;
>  	gfp_t	lflags;
>  
> @@ -80,14 +80,14 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
>  	 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
>  	 * the filesystem here and potentially deadlocking.

The comment above is now largely obsolete, or minimally should be
changed to PF_MEMALLOC_NOFS?

>  	 */
> -	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
> -		noio_flag = memalloc_noio_save();
> +	if (flags & KM_NOFS)
> +		nofs_flag = memalloc_nofs_save();
>  
>  	lflags = kmem_flags_convert(flags);
>  	ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
>  
> -	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
> -		memalloc_noio_restore(noio_flag);
> +	if (flags & KM_NOFS)
> +		memalloc_nofs_restore(nofs_flag);
>  
>  	return ptr;
>  }
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 7f0a01f7b592..8cb8dd4cdfd8 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -441,17 +441,17 @@ _xfs_buf_map_pages(
>  		bp->b_addr = NULL;
>  	} else {
>  		int retried = 0;
> -		unsigned noio_flag;
> +		unsigned nofs_flag;
>  
>  		/*
>  		 * vm_map_ram() will allocate auxillary structures (e.g.
>  		 * pagetables) with GFP_KERNEL, yet we are likely to be under
>  		 * GFP_NOFS context here. Hence we need to tell memory reclaim
> -		 * that we are in such a context via PF_MEMALLOC_NOIO to prevent
> +		 * that we are in such a context via PF_MEMALLOC_NOFS to prevent
>  		 * memory reclaim re-entering the filesystem here and
>  		 * potentially deadlocking.
>  		 */
> -		noio_flag = memalloc_noio_save();
> +		nofs_flag = memalloc_nofs_save();
>  		do {
>  			bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
>  						-1, PAGE_KERNEL);
> @@ -459,7 +459,7 @@ _xfs_buf_map_pages(
>  				break;
>  			vm_unmap_aliases();
>  		} while (retried++ <= 1);
> -		memalloc_noio_restore(noio_flag);
> +		memalloc_nofs_restore(nofs_flag);
>  
>  		if (!bp->b_addr)
>  			return -ENOMEM;
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* [Cluster-devel] [PATCH 4/8] xfs: use memalloc_nofs_{save, restore} instead of memalloc_noio*
@ 2017-01-09 14:08     ` Vlastimil Babka
  0 siblings, 0 replies; 167+ messages in thread
From: Vlastimil Babka @ 2017-01-09 14:08 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On 01/06/2017 03:11 PM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> kmem_zalloc_large and _xfs_buf_map_pages use memalloc_noio_{save,restore}
> API to prevent from reclaim recursion into the fs because vmalloc can
> invoke unconditional GFP_KERNEL allocations and these functions might be
> called from the NOFS contexts. The memalloc_noio_save will enforce
> GFP_NOIO context which is even weaker than GFP_NOFS and that seems to be
> unnecessary. Let's use memalloc_nofs_{save,restore} instead as it should
> provide exactly what we need here - implicit GFP_NOFS context.
> 
> Changes since v1
> - s at memalloc_noio_restore@memalloc_nofs_restore@ in _xfs_buf_map_pages
>   as per Brian Foster
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Not a xfs expert, but seems correct.

Acked-by: Vlastimil Babka <vbabka@suse.cz>

Nit below:

> ---
>  fs/xfs/kmem.c    | 10 +++++-----
>  fs/xfs/xfs_buf.c |  8 ++++----
>  2 files changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> index a76a05dae96b..d69ed5e76621 100644
> --- a/fs/xfs/kmem.c
> +++ b/fs/xfs/kmem.c
> @@ -65,7 +65,7 @@ kmem_alloc(size_t size, xfs_km_flags_t flags)
>  void *
>  kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
>  {
> -	unsigned noio_flag = 0;
> +	unsigned nofs_flag = 0;
>  	void	*ptr;
>  	gfp_t	lflags;
>  
> @@ -80,14 +80,14 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
>  	 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
>  	 * the filesystem here and potentially deadlocking.

The comment above is now largely obsolete, or minimally should be
changed to PF_MEMALLOC_NOFS?

>  	 */
> -	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
> -		noio_flag = memalloc_noio_save();
> +	if (flags & KM_NOFS)
> +		nofs_flag = memalloc_nofs_save();
>  
>  	lflags = kmem_flags_convert(flags);
>  	ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
>  
> -	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
> -		memalloc_noio_restore(noio_flag);
> +	if (flags & KM_NOFS)
> +		memalloc_nofs_restore(nofs_flag);
>  
>  	return ptr;
>  }
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 7f0a01f7b592..8cb8dd4cdfd8 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -441,17 +441,17 @@ _xfs_buf_map_pages(
>  		bp->b_addr = NULL;
>  	} else {
>  		int retried = 0;
> -		unsigned noio_flag;
> +		unsigned nofs_flag;
>  
>  		/*
>  		 * vm_map_ram() will allocate auxillary structures (e.g.
>  		 * pagetables) with GFP_KERNEL, yet we are likely to be under
>  		 * GFP_NOFS context here. Hence we need to tell memory reclaim
> -		 * that we are in such a context via PF_MEMALLOC_NOIO to prevent
> +		 * that we are in such a context via PF_MEMALLOC_NOFS to prevent
>  		 * memory reclaim re-entering the filesystem here and
>  		 * potentially deadlocking.
>  		 */
> -		noio_flag = memalloc_noio_save();
> +		nofs_flag = memalloc_nofs_save();
>  		do {
>  			bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
>  						-1, PAGE_KERNEL);
> @@ -459,7 +459,7 @@ _xfs_buf_map_pages(
>  				break;
>  			vm_unmap_aliases();
>  		} while (retried++ <= 1);
> -		memalloc_noio_restore(noio_flag);
> +		memalloc_nofs_restore(nofs_flag);
>  
>  		if (!bp->b_addr)
>  			return -ENOMEM;
> 



^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 4/8] xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*
  2017-01-09 14:08     ` [PATCH 4/8] xfs: use memalloc_nofs_{save,restore} " Vlastimil Babka
  (?)
@ 2017-01-09 14:25       ` Michal Hocko
  -1 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-09 14:25 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Dave Chinner, djwong,
	Theodore Ts'o, Chris Mason, David Sterba, Jan Kara,
	ceph-devel, cluster-devel, linux-nfs, logfs, linux-xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML

On Mon 09-01-17 15:08:27, Vlastimil Babka wrote:
> On 01/06/2017 03:11 PM, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > kmem_zalloc_large and _xfs_buf_map_pages use memalloc_noio_{save,restore}
> > API to prevent from reclaim recursion into the fs because vmalloc can
> > invoke unconditional GFP_KERNEL allocations and these functions might be
> > called from the NOFS contexts. The memalloc_noio_save will enforce
> > GFP_NOIO context which is even weaker than GFP_NOFS and that seems to be
> > unnecessary. Let's use memalloc_nofs_{save,restore} instead as it should
> > provide exactly what we need here - implicit GFP_NOFS context.
> > 
> > Changes since v1
> > - s@memalloc_noio_restore@memalloc_nofs_restore@ in _xfs_buf_map_pages
> >   as per Brian Foster
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> Not a xfs expert, but seems correct.
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Thanks!

> 
> Nit below:
> 
> > ---
> >  fs/xfs/kmem.c    | 10 +++++-----
> >  fs/xfs/xfs_buf.c |  8 ++++----
> >  2 files changed, 9 insertions(+), 9 deletions(-)
> > 
> > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> > index a76a05dae96b..d69ed5e76621 100644
> > --- a/fs/xfs/kmem.c
> > +++ b/fs/xfs/kmem.c
> > @@ -65,7 +65,7 @@ kmem_alloc(size_t size, xfs_km_flags_t flags)
> >  void *
> >  kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
> >  {
> > -	unsigned noio_flag = 0;
> > +	unsigned nofs_flag = 0;
> >  	void	*ptr;
> >  	gfp_t	lflags;
> >  
> > @@ -80,14 +80,14 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
> >  	 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
> >  	 * the filesystem here and potentially deadlocking.
> 
> The comment above is now largely obsolete, or minimally should be
> changed to PF_MEMALLOC_NOFS?
---
diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index d69ed5e76621..0c9f94f41b6c 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -77,7 +77,7 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 	 * __vmalloc() will allocate data pages and auxillary structures (e.g.
 	 * pagetables) with GFP_KERNEL, yet we may be under GFP_NOFS context
 	 * here. Hence we need to tell memory reclaim that we are in such a
-	 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
+	 * context via PF_MEMALLOC_NOFS to prevent memory reclaim re-entering
 	 * the filesystem here and potentially deadlocking.
 	 */
 	if (flags & KM_NOFS)

I will fold it into the original patch.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 167+ messages in thread

* Re: [PATCH 4/8] xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*
@ 2017-01-09 14:25       ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-09 14:25 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Dave Chinner, djwong,
	Theodore Ts'o, Chris Mason, David Sterba, Jan Kara,
	ceph-devel, cluster-devel, linux-nfs, logfs, linux-xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML

On Mon 09-01-17 15:08:27, Vlastimil Babka wrote:
> On 01/06/2017 03:11 PM, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > kmem_zalloc_large and _xfs_buf_map_pages use memalloc_noio_{save,restore}
> > API to prevent from reclaim recursion into the fs because vmalloc can
> > invoke unconditional GFP_KERNEL allocations and these functions might be
> > called from the NOFS contexts. The memalloc_noio_save will enforce
> > GFP_NOIO context which is even weaker than GFP_NOFS and that seems to be
> > unnecessary. Let's use memalloc_nofs_{save,restore} instead as it should
> > provide exactly what we need here - implicit GFP_NOFS context.
> > 
> > Changes since v1
> > - s@memalloc_noio_restore@memalloc_nofs_restore@ in _xfs_buf_map_pages
> >   as per Brian Foster
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> Not a xfs expert, but seems correct.
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Thanks!

> 
> Nit below:
> 
> > ---
> >  fs/xfs/kmem.c    | 10 +++++-----
> >  fs/xfs/xfs_buf.c |  8 ++++----
> >  2 files changed, 9 insertions(+), 9 deletions(-)
> > 
> > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> > index a76a05dae96b..d69ed5e76621 100644
> > --- a/fs/xfs/kmem.c
> > +++ b/fs/xfs/kmem.c
> > @@ -65,7 +65,7 @@ kmem_alloc(size_t size, xfs_km_flags_t flags)
> >  void *
> >  kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
> >  {
> > -	unsigned noio_flag = 0;
> > +	unsigned nofs_flag = 0;
> >  	void	*ptr;
> >  	gfp_t	lflags;
> >  
> > @@ -80,14 +80,14 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
> >  	 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
> >  	 * the filesystem here and potentially deadlocking.
> 
> The comment above is now largely obsolete, or minimally should be
> changed to PF_MEMALLOC_NOFS?
---
diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index d69ed5e76621..0c9f94f41b6c 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -77,7 +77,7 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 	 * __vmalloc() will allocate data pages and auxillary structures (e.g.
 	 * pagetables) with GFP_KERNEL, yet we may be under GFP_NOFS context
 	 * here. Hence we need to tell memory reclaim that we are in such a
-	 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
+	 * context via PF_MEMALLOC_NOFS to prevent memory reclaim re-entering
 	 * the filesystem here and potentially deadlocking.
 	 */
 	if (flags & KM_NOFS)

I will fold it into the original patch.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [Cluster-devel] [PATCH 4/8] xfs: use memalloc_nofs_{save, restore} instead of memalloc_noio*
@ 2017-01-09 14:25       ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-09 14:25 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Mon 09-01-17 15:08:27, Vlastimil Babka wrote:
> On 01/06/2017 03:11 PM, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > kmem_zalloc_large and _xfs_buf_map_pages use memalloc_noio_{save,restore}
> > API to prevent from reclaim recursion into the fs because vmalloc can
> > invoke unconditional GFP_KERNEL allocations and these functions might be
> > called from the NOFS contexts. The memalloc_noio_save will enforce
> > GFP_NOIO context which is even weaker than GFP_NOFS and that seems to be
> > unnecessary. Let's use memalloc_nofs_{save,restore} instead as it should
> > provide exactly what we need here - implicit GFP_NOFS context.
> > 
> > Changes since v1
> > - s at memalloc_noio_restore@memalloc_nofs_restore@ in _xfs_buf_map_pages
> >   as per Brian Foster
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> Not a xfs expert, but seems correct.
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Thanks!

> 
> Nit below:
> 
> > ---
> >  fs/xfs/kmem.c    | 10 +++++-----
> >  fs/xfs/xfs_buf.c |  8 ++++----
> >  2 files changed, 9 insertions(+), 9 deletions(-)
> > 
> > diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> > index a76a05dae96b..d69ed5e76621 100644
> > --- a/fs/xfs/kmem.c
> > +++ b/fs/xfs/kmem.c
> > @@ -65,7 +65,7 @@ kmem_alloc(size_t size, xfs_km_flags_t flags)
> >  void *
> >  kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
> >  {
> > -	unsigned noio_flag = 0;
> > +	unsigned nofs_flag = 0;
> >  	void	*ptr;
> >  	gfp_t	lflags;
> >  
> > @@ -80,14 +80,14 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
> >  	 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
> >  	 * the filesystem here and potentially deadlocking.
> 
> The comment above is now largely obsolete, or minimally should be
> changed to PF_MEMALLOC_NOFS?
---
diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index d69ed5e76621..0c9f94f41b6c 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -77,7 +77,7 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 	 * __vmalloc() will allocate data pages and auxillary structures (e.g.
 	 * pagetables) with GFP_KERNEL, yet we may be under GFP_NOFS context
 	 * here. Hence we need to tell memory reclaim that we are in such a
-	 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
+	 * context via PF_MEMALLOC_NOFS to prevent memory reclaim re-entering
 	 * the filesystem here and potentially deadlocking.
 	 */
 	if (flags & KM_NOFS)

I will fold it into the original patch.

Thanks!
-- 
Michal Hocko
SUSE Labs



^ permalink raw reply related	[flat|nested] 167+ messages in thread

* Re: [PATCH 2/8] xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS
  2017-01-09 12:59     ` Vlastimil Babka
  (?)
@ 2017-01-09 14:29       ` Michal Hocko
  -1 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-09 14:29 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Dave Chinner, djwong,
	Theodore Ts'o, Chris Mason, David Sterba, Jan Kara,
	ceph-devel, cluster-devel, linux-nfs, logfs, linux-xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML

On Mon 09-01-17 13:59:05, Vlastimil Babka wrote:
> On 01/06/2017 03:11 PM, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite
> > some time ago. We would like to make this concept more generic and use
> > it for other filesystems as well. Let's start by giving the flag a
> > more generic name PF_MEMALLOC_NOFS which is in line with an exiting
> > PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO
> > contexts. Replace all PF_FSTRANS usage from the xfs code in the first
> > step before we introduce a full API for it as xfs uses the flag directly
> > anyway.
> > 
> > This patch doesn't introduce any functional change.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > Reviewed-by: Brian Foster <bfoster@redhat.com>
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Thanks!

> 
> A nit:
> 
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -2320,6 +2320,8 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
> >  #define PF_FREEZER_SKIP	0x40000000	/* Freezer should not count it as freezable */
> >  #define PF_SUSPEND_TASK 0x80000000      /* this thread called freeze_processes and should not be frozen */
> >  
> > +#define PF_MEMALLOC_NOFS PF_FSTRANS	/* Transition to a more generic GFP_NOFS scope semantic */
> 
> I don't see why this transition is needed, as there are already no users
> of PF_FSTRANS after this patch. The next patch doesn't remove any more,
> so this is just extra churn IMHO. But not a strong objection.

I just wanted to have this transparent for the xfs in this patch.
AFAIR Dave wanted to have xfs and generic parts separated. So it was the
easiest to simply keep the flag and then remove it in a separate patach.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 2/8] xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS
@ 2017-01-09 14:29       ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-09 14:29 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Dave Chinner, djwong,
	Theodore Ts'o, Chris Mason, David Sterba, Jan Kara,
	ceph-devel, cluster-devel, linux-nfs, logfs, linux-xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML

On Mon 09-01-17 13:59:05, Vlastimil Babka wrote:
> On 01/06/2017 03:11 PM, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite
> > some time ago. We would like to make this concept more generic and use
> > it for other filesystems as well. Let's start by giving the flag a
> > more generic name PF_MEMALLOC_NOFS which is in line with an exiting
> > PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO
> > contexts. Replace all PF_FSTRANS usage from the xfs code in the first
> > step before we introduce a full API for it as xfs uses the flag directly
> > anyway.
> > 
> > This patch doesn't introduce any functional change.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > Reviewed-by: Brian Foster <bfoster@redhat.com>
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Thanks!

> 
> A nit:
> 
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -2320,6 +2320,8 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
> >  #define PF_FREEZER_SKIP	0x40000000	/* Freezer should not count it as freezable */
> >  #define PF_SUSPEND_TASK 0x80000000      /* this thread called freeze_processes and should not be frozen */
> >  
> > +#define PF_MEMALLOC_NOFS PF_FSTRANS	/* Transition to a more generic GFP_NOFS scope semantic */
> 
> I don't see why this transition is needed, as there are already no users
> of PF_FSTRANS after this patch. The next patch doesn't remove any more,
> so this is just extra churn IMHO. But not a strong objection.

I just wanted to have this transparent for the xfs in this patch.
AFAIR Dave wanted to have xfs and generic parts separated. So it was the
easiest to simply keep the flag and then remove it in a separate patach.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* [Cluster-devel] [PATCH 2/8] xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS
@ 2017-01-09 14:29       ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-09 14:29 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Mon 09-01-17 13:59:05, Vlastimil Babka wrote:
> On 01/06/2017 03:11 PM, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite
> > some time ago. We would like to make this concept more generic and use
> > it for other filesystems as well. Let's start by giving the flag a
> > more generic name PF_MEMALLOC_NOFS which is in line with an exiting
> > PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO
> > contexts. Replace all PF_FSTRANS usage from the xfs code in the first
> > step before we introduce a full API for it as xfs uses the flag directly
> > anyway.
> > 
> > This patch doesn't introduce any functional change.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > Reviewed-by: Brian Foster <bfoster@redhat.com>
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Thanks!

> 
> A nit:
> 
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -2320,6 +2320,8 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
> >  #define PF_FREEZER_SKIP	0x40000000	/* Freezer should not count it as freezable */
> >  #define PF_SUSPEND_TASK 0x80000000      /* this thread called freeze_processes and should not be frozen */
> >  
> > +#define PF_MEMALLOC_NOFS PF_FSTRANS	/* Transition to a more generic GFP_NOFS scope semantic */
> 
> I don't see why this transition is needed, as there are already no users
> of PF_FSTRANS after this patch. The next patch doesn't remove any more,
> so this is just extra churn IMHO. But not a strong objection.

I just wanted to have this transparent for the xfs in this patch.
AFAIR Dave wanted to have xfs and generic parts separated. So it was the
easiest to simply keep the flag and then remove it in a separate patach.

-- 
Michal Hocko
SUSE Labs



^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 4/8] xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*
  2017-01-06 14:11   ` Michal Hocko
  (?)
@ 2017-01-09 15:56     ` Brian Foster
  -1 siblings, 0 replies; 167+ messages in thread
From: Brian Foster @ 2017-01-09 15:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Dave Chinner, djwong,
	Theodore Ts'o, Chris Mason, David Sterba, Jan Kara,
	ceph-devel, cluster-devel, linux-nfs, logfs, linux-xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML, Michal Hocko

On Fri, Jan 06, 2017 at 03:11:03PM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> kmem_zalloc_large and _xfs_buf_map_pages use memalloc_noio_{save,restore}
> API to prevent from reclaim recursion into the fs because vmalloc can
> invoke unconditional GFP_KERNEL allocations and these functions might be
> called from the NOFS contexts. The memalloc_noio_save will enforce
> GFP_NOIO context which is even weaker than GFP_NOFS and that seems to be
> unnecessary. Let's use memalloc_nofs_{save,restore} instead as it should
> provide exactly what we need here - implicit GFP_NOFS context.
> 
> Changes since v1
> - s@memalloc_noio_restore@memalloc_nofs_restore@ in _xfs_buf_map_pages
>   as per Brian Foster
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---

Looks fine to me:

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/kmem.c    | 10 +++++-----
>  fs/xfs/xfs_buf.c |  8 ++++----
>  2 files changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> index a76a05dae96b..d69ed5e76621 100644
> --- a/fs/xfs/kmem.c
> +++ b/fs/xfs/kmem.c
> @@ -65,7 +65,7 @@ kmem_alloc(size_t size, xfs_km_flags_t flags)
>  void *
>  kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
>  {
> -	unsigned noio_flag = 0;
> +	unsigned nofs_flag = 0;
>  	void	*ptr;
>  	gfp_t	lflags;
>  
> @@ -80,14 +80,14 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
>  	 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
>  	 * the filesystem here and potentially deadlocking.
>  	 */
> -	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
> -		noio_flag = memalloc_noio_save();
> +	if (flags & KM_NOFS)
> +		nofs_flag = memalloc_nofs_save();
>  
>  	lflags = kmem_flags_convert(flags);
>  	ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
>  
> -	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
> -		memalloc_noio_restore(noio_flag);
> +	if (flags & KM_NOFS)
> +		memalloc_nofs_restore(nofs_flag);
>  
>  	return ptr;
>  }
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 7f0a01f7b592..8cb8dd4cdfd8 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -441,17 +441,17 @@ _xfs_buf_map_pages(
>  		bp->b_addr = NULL;
>  	} else {
>  		int retried = 0;
> -		unsigned noio_flag;
> +		unsigned nofs_flag;
>  
>  		/*
>  		 * vm_map_ram() will allocate auxillary structures (e.g.
>  		 * pagetables) with GFP_KERNEL, yet we are likely to be under
>  		 * GFP_NOFS context here. Hence we need to tell memory reclaim
> -		 * that we are in such a context via PF_MEMALLOC_NOIO to prevent
> +		 * that we are in such a context via PF_MEMALLOC_NOFS to prevent
>  		 * memory reclaim re-entering the filesystem here and
>  		 * potentially deadlocking.
>  		 */
> -		noio_flag = memalloc_noio_save();
> +		nofs_flag = memalloc_nofs_save();
>  		do {
>  			bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
>  						-1, PAGE_KERNEL);
> @@ -459,7 +459,7 @@ _xfs_buf_map_pages(
>  				break;
>  			vm_unmap_aliases();
>  		} while (retried++ <= 1);
> -		memalloc_noio_restore(noio_flag);
> +		memalloc_nofs_restore(nofs_flag);
>  
>  		if (!bp->b_addr)
>  			return -ENOMEM;
> -- 
> 2.11.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 4/8] xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*
@ 2017-01-09 15:56     ` Brian Foster
  0 siblings, 0 replies; 167+ messages in thread
From: Brian Foster @ 2017-01-09 15:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Dave Chinner, djwong,
	Theodore Ts'o, Chris Mason, David Sterba, Jan Kara,
	ceph-devel, cluster-devel, linux-nfs, logfs, linux-xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML, Michal Hocko

On Fri, Jan 06, 2017 at 03:11:03PM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> kmem_zalloc_large and _xfs_buf_map_pages use memalloc_noio_{save,restore}
> API to prevent from reclaim recursion into the fs because vmalloc can
> invoke unconditional GFP_KERNEL allocations and these functions might be
> called from the NOFS contexts. The memalloc_noio_save will enforce
> GFP_NOIO context which is even weaker than GFP_NOFS and that seems to be
> unnecessary. Let's use memalloc_nofs_{save,restore} instead as it should
> provide exactly what we need here - implicit GFP_NOFS context.
> 
> Changes since v1
> - s@memalloc_noio_restore@memalloc_nofs_restore@ in _xfs_buf_map_pages
>   as per Brian Foster
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---

Looks fine to me:

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/kmem.c    | 10 +++++-----
>  fs/xfs/xfs_buf.c |  8 ++++----
>  2 files changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> index a76a05dae96b..d69ed5e76621 100644
> --- a/fs/xfs/kmem.c
> +++ b/fs/xfs/kmem.c
> @@ -65,7 +65,7 @@ kmem_alloc(size_t size, xfs_km_flags_t flags)
>  void *
>  kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
>  {
> -	unsigned noio_flag = 0;
> +	unsigned nofs_flag = 0;
>  	void	*ptr;
>  	gfp_t	lflags;
>  
> @@ -80,14 +80,14 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
>  	 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
>  	 * the filesystem here and potentially deadlocking.
>  	 */
> -	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
> -		noio_flag = memalloc_noio_save();
> +	if (flags & KM_NOFS)
> +		nofs_flag = memalloc_nofs_save();
>  
>  	lflags = kmem_flags_convert(flags);
>  	ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
>  
> -	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
> -		memalloc_noio_restore(noio_flag);
> +	if (flags & KM_NOFS)
> +		memalloc_nofs_restore(nofs_flag);
>  
>  	return ptr;
>  }
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 7f0a01f7b592..8cb8dd4cdfd8 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -441,17 +441,17 @@ _xfs_buf_map_pages(
>  		bp->b_addr = NULL;
>  	} else {
>  		int retried = 0;
> -		unsigned noio_flag;
> +		unsigned nofs_flag;
>  
>  		/*
>  		 * vm_map_ram() will allocate auxillary structures (e.g.
>  		 * pagetables) with GFP_KERNEL, yet we are likely to be under
>  		 * GFP_NOFS context here. Hence we need to tell memory reclaim
> -		 * that we are in such a context via PF_MEMALLOC_NOIO to prevent
> +		 * that we are in such a context via PF_MEMALLOC_NOFS to prevent
>  		 * memory reclaim re-entering the filesystem here and
>  		 * potentially deadlocking.
>  		 */
> -		noio_flag = memalloc_noio_save();
> +		nofs_flag = memalloc_nofs_save();
>  		do {
>  			bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
>  						-1, PAGE_KERNEL);
> @@ -459,7 +459,7 @@ _xfs_buf_map_pages(
>  				break;
>  			vm_unmap_aliases();
>  		} while (retried++ <= 1);
> -		memalloc_noio_restore(noio_flag);
> +		memalloc_nofs_restore(nofs_flag);
>  
>  		if (!bp->b_addr)
>  			return -ENOMEM;
> -- 
> 2.11.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* [Cluster-devel] [PATCH 4/8] xfs: use memalloc_nofs_{save, restore} instead of memalloc_noio*
@ 2017-01-09 15:56     ` Brian Foster
  0 siblings, 0 replies; 167+ messages in thread
From: Brian Foster @ 2017-01-09 15:56 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Fri, Jan 06, 2017 at 03:11:03PM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> kmem_zalloc_large and _xfs_buf_map_pages use memalloc_noio_{save,restore}
> API to prevent from reclaim recursion into the fs because vmalloc can
> invoke unconditional GFP_KERNEL allocations and these functions might be
> called from the NOFS contexts. The memalloc_noio_save will enforce
> GFP_NOIO context which is even weaker than GFP_NOFS and that seems to be
> unnecessary. Let's use memalloc_nofs_{save,restore} instead as it should
> provide exactly what we need here - implicit GFP_NOFS context.
> 
> Changes since v1
> - s at memalloc_noio_restore@memalloc_nofs_restore@ in _xfs_buf_map_pages
>   as per Brian Foster
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---

Looks fine to me:

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/kmem.c    | 10 +++++-----
>  fs/xfs/xfs_buf.c |  8 ++++----
>  2 files changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> index a76a05dae96b..d69ed5e76621 100644
> --- a/fs/xfs/kmem.c
> +++ b/fs/xfs/kmem.c
> @@ -65,7 +65,7 @@ kmem_alloc(size_t size, xfs_km_flags_t flags)
>  void *
>  kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
>  {
> -	unsigned noio_flag = 0;
> +	unsigned nofs_flag = 0;
>  	void	*ptr;
>  	gfp_t	lflags;
>  
> @@ -80,14 +80,14 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
>  	 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
>  	 * the filesystem here and potentially deadlocking.
>  	 */
> -	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
> -		noio_flag = memalloc_noio_save();
> +	if (flags & KM_NOFS)
> +		nofs_flag = memalloc_nofs_save();
>  
>  	lflags = kmem_flags_convert(flags);
>  	ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
>  
> -	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
> -		memalloc_noio_restore(noio_flag);
> +	if (flags & KM_NOFS)
> +		memalloc_nofs_restore(nofs_flag);
>  
>  	return ptr;
>  }
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 7f0a01f7b592..8cb8dd4cdfd8 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -441,17 +441,17 @@ _xfs_buf_map_pages(
>  		bp->b_addr = NULL;
>  	} else {
>  		int retried = 0;
> -		unsigned noio_flag;
> +		unsigned nofs_flag;
>  
>  		/*
>  		 * vm_map_ram() will allocate auxillary structures (e.g.
>  		 * pagetables) with GFP_KERNEL, yet we are likely to be under
>  		 * GFP_NOFS context here. Hence we need to tell memory reclaim
> -		 * that we are in such a context via PF_MEMALLOC_NOIO to prevent
> +		 * that we are in such a context via PF_MEMALLOC_NOFS to prevent
>  		 * memory reclaim re-entering the filesystem here and
>  		 * potentially deadlocking.
>  		 */
> -		noio_flag = memalloc_noio_save();
> +		nofs_flag = memalloc_nofs_save();
>  		do {
>  			bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
>  						-1, PAGE_KERNEL);
> @@ -459,7 +459,7 @@ _xfs_buf_map_pages(
>  				break;
>  			vm_unmap_aliases();
>  		} while (retried++ <= 1);
> -		memalloc_noio_restore(noio_flag);
> +		memalloc_nofs_restore(nofs_flag);
>  
>  		if (!bp->b_addr)
>  			return -ENOMEM;
> -- 
> 2.11.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 2/8] xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS
  2017-01-06 14:11   ` Michal Hocko
  (?)
@ 2017-01-09 20:58     ` Darrick J. Wong
  -1 siblings, 0 replies; 167+ messages in thread
From: Darrick J. Wong @ 2017-01-09 20:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Dave Chinner, djwong,
	Theodore Ts'o, Chris Mason, David Sterba, Jan Kara,
	ceph-devel, cluster-devel, linux-nfs, logfs, linux-xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML, Michal Hocko

On Fri, Jan 06, 2017 at 03:11:01PM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite
> some time ago. We would like to make this concept more generic and use
> it for other filesystems as well. Let's start by giving the flag a
> more generic name PF_MEMALLOC_NOFS which is in line with an exiting
> PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO
> contexts. Replace all PF_FSTRANS usage from the xfs code in the first
> step before we introduce a full API for it as xfs uses the flag directly
> anyway.
> 
> This patch doesn't introduce any functional change.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> Reviewed-by: Brian Foster <bfoster@redhat.com>

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

> ---
>  fs/xfs/kmem.c             |  4 ++--
>  fs/xfs/kmem.h             |  2 +-
>  fs/xfs/libxfs/xfs_btree.c |  2 +-
>  fs/xfs/xfs_aops.c         |  6 +++---
>  fs/xfs/xfs_trans.c        | 12 ++++++------
>  include/linux/sched.h     |  2 ++
>  6 files changed, 15 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> index 339c696bbc01..a76a05dae96b 100644
> --- a/fs/xfs/kmem.c
> +++ b/fs/xfs/kmem.c
> @@ -80,13 +80,13 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
>  	 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
>  	 * the filesystem here and potentially deadlocking.
>  	 */
> -	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
> +	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
>  		noio_flag = memalloc_noio_save();
>  
>  	lflags = kmem_flags_convert(flags);
>  	ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
>  
> -	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
> +	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
>  		memalloc_noio_restore(noio_flag);
>  
>  	return ptr;
> diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
> index 689f746224e7..d973dbfc2bfa 100644
> --- a/fs/xfs/kmem.h
> +++ b/fs/xfs/kmem.h
> @@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
>  		lflags = GFP_ATOMIC | __GFP_NOWARN;
>  	} else {
>  		lflags = GFP_KERNEL | __GFP_NOWARN;
> -		if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
> +		if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
>  			lflags &= ~__GFP_FS;
>  	}
>  
> diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> index 21e6a6ab6b9a..a2672ba4dc33 100644
> --- a/fs/xfs/libxfs/xfs_btree.c
> +++ b/fs/xfs/libxfs/xfs_btree.c
> @@ -2866,7 +2866,7 @@ xfs_btree_split_worker(
>  	struct xfs_btree_split_args	*args = container_of(work,
>  						struct xfs_btree_split_args, work);
>  	unsigned long		pflags;
> -	unsigned long		new_pflags = PF_FSTRANS;
> +	unsigned long		new_pflags = PF_MEMALLOC_NOFS;
>  
>  	/*
>  	 * we are in a transaction context here, but may also be doing work
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index ef382bfb402b..d4094bb55033 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -189,7 +189,7 @@ xfs_setfilesize_trans_alloc(
>  	 * We hand off the transaction to the completion thread now, so
>  	 * clear the flag here.
>  	 */
> -	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> +	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>  	return 0;
>  }
>  
> @@ -252,7 +252,7 @@ xfs_setfilesize_ioend(
>  	 * thus we need to mark ourselves as being in a transaction manually.
>  	 * Similarly for freeze protection.
>  	 */
> -	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
> +	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>  	__sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS);
>  
>  	/* we abort the update if there was an IO error */
> @@ -1015,7 +1015,7 @@ xfs_do_writepage(
>  	 * Given that we do not allow direct reclaim to call us, we should
>  	 * never be called while in a filesystem transaction.
>  	 */
> -	if (WARN_ON_ONCE(current->flags & PF_FSTRANS))
> +	if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
>  		goto redirty;
>  
>  	/*
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 70f42ea86dfb..f5969c8274fc 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -134,7 +134,7 @@ xfs_trans_reserve(
>  	bool		rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
>  
>  	/* Mark this thread as being in a transaction */
> -	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
> +	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>  
>  	/*
>  	 * Attempt to reserve the needed disk blocks by decrementing
> @@ -144,7 +144,7 @@ xfs_trans_reserve(
>  	if (blocks > 0) {
>  		error = xfs_mod_fdblocks(tp->t_mountp, -((int64_t)blocks), rsvd);
>  		if (error != 0) {
> -			current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> +			current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>  			return -ENOSPC;
>  		}
>  		tp->t_blk_res += blocks;
> @@ -221,7 +221,7 @@ xfs_trans_reserve(
>  		tp->t_blk_res = 0;
>  	}
>  
> -	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> +	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>  
>  	return error;
>  }
> @@ -914,7 +914,7 @@ __xfs_trans_commit(
>  
>  	xfs_log_commit_cil(mp, tp, &commit_lsn, regrant);
>  
> -	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> +	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>  	xfs_trans_free(tp);
>  
>  	/*
> @@ -944,7 +944,7 @@ __xfs_trans_commit(
>  		if (commit_lsn == -1 && !error)
>  			error = -EIO;
>  	}
> -	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> +	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>  	xfs_trans_free_items(tp, NULLCOMMITLSN, !!error);
>  	xfs_trans_free(tp);
>  
> @@ -998,7 +998,7 @@ xfs_trans_cancel(
>  		xfs_log_done(mp, tp->t_ticket, NULL, false);
>  
>  	/* mark this thread as no longer being in a transaction */
> -	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> +	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>  
>  	xfs_trans_free_items(tp, NULLCOMMITLSN, dirty);
>  	xfs_trans_free(tp);
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 1531c48f56e2..abeb84604d32 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2320,6 +2320,8 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
>  #define PF_FREEZER_SKIP	0x40000000	/* Freezer should not count it as freezable */
>  #define PF_SUSPEND_TASK 0x80000000      /* this thread called freeze_processes and should not be frozen */
>  
> +#define PF_MEMALLOC_NOFS PF_FSTRANS	/* Transition to a more generic GFP_NOFS scope semantic */
> +
>  /*
>   * Only the _current_ task can read/write to tsk->flags, but other
>   * tasks can access tsk->flags in readonly mode for example
> -- 
> 2.11.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 2/8] xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS
@ 2017-01-09 20:58     ` Darrick J. Wong
  0 siblings, 0 replies; 167+ messages in thread
From: Darrick J. Wong @ 2017-01-09 20:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Dave Chinner, djwong,
	Theodore Ts'o, Chris Mason, David Sterba, Jan Kara,
	ceph-devel, cluster-devel, linux-nfs, logfs, linux-xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML, Michal Hocko

On Fri, Jan 06, 2017 at 03:11:01PM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite
> some time ago. We would like to make this concept more generic and use
> it for other filesystems as well. Let's start by giving the flag a
> more generic name PF_MEMALLOC_NOFS which is in line with an exiting
> PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO
> contexts. Replace all PF_FSTRANS usage from the xfs code in the first
> step before we introduce a full API for it as xfs uses the flag directly
> anyway.
> 
> This patch doesn't introduce any functional change.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> Reviewed-by: Brian Foster <bfoster@redhat.com>

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

> ---
>  fs/xfs/kmem.c             |  4 ++--
>  fs/xfs/kmem.h             |  2 +-
>  fs/xfs/libxfs/xfs_btree.c |  2 +-
>  fs/xfs/xfs_aops.c         |  6 +++---
>  fs/xfs/xfs_trans.c        | 12 ++++++------
>  include/linux/sched.h     |  2 ++
>  6 files changed, 15 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> index 339c696bbc01..a76a05dae96b 100644
> --- a/fs/xfs/kmem.c
> +++ b/fs/xfs/kmem.c
> @@ -80,13 +80,13 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
>  	 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
>  	 * the filesystem here and potentially deadlocking.
>  	 */
> -	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
> +	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
>  		noio_flag = memalloc_noio_save();
>  
>  	lflags = kmem_flags_convert(flags);
>  	ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
>  
> -	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
> +	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
>  		memalloc_noio_restore(noio_flag);
>  
>  	return ptr;
> diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
> index 689f746224e7..d973dbfc2bfa 100644
> --- a/fs/xfs/kmem.h
> +++ b/fs/xfs/kmem.h
> @@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
>  		lflags = GFP_ATOMIC | __GFP_NOWARN;
>  	} else {
>  		lflags = GFP_KERNEL | __GFP_NOWARN;
> -		if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
> +		if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
>  			lflags &= ~__GFP_FS;
>  	}
>  
> diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> index 21e6a6ab6b9a..a2672ba4dc33 100644
> --- a/fs/xfs/libxfs/xfs_btree.c
> +++ b/fs/xfs/libxfs/xfs_btree.c
> @@ -2866,7 +2866,7 @@ xfs_btree_split_worker(
>  	struct xfs_btree_split_args	*args = container_of(work,
>  						struct xfs_btree_split_args, work);
>  	unsigned long		pflags;
> -	unsigned long		new_pflags = PF_FSTRANS;
> +	unsigned long		new_pflags = PF_MEMALLOC_NOFS;
>  
>  	/*
>  	 * we are in a transaction context here, but may also be doing work
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index ef382bfb402b..d4094bb55033 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -189,7 +189,7 @@ xfs_setfilesize_trans_alloc(
>  	 * We hand off the transaction to the completion thread now, so
>  	 * clear the flag here.
>  	 */
> -	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> +	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>  	return 0;
>  }
>  
> @@ -252,7 +252,7 @@ xfs_setfilesize_ioend(
>  	 * thus we need to mark ourselves as being in a transaction manually.
>  	 * Similarly for freeze protection.
>  	 */
> -	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
> +	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>  	__sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS);
>  
>  	/* we abort the update if there was an IO error */
> @@ -1015,7 +1015,7 @@ xfs_do_writepage(
>  	 * Given that we do not allow direct reclaim to call us, we should
>  	 * never be called while in a filesystem transaction.
>  	 */
> -	if (WARN_ON_ONCE(current->flags & PF_FSTRANS))
> +	if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
>  		goto redirty;
>  
>  	/*
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 70f42ea86dfb..f5969c8274fc 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -134,7 +134,7 @@ xfs_trans_reserve(
>  	bool		rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
>  
>  	/* Mark this thread as being in a transaction */
> -	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
> +	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>  
>  	/*
>  	 * Attempt to reserve the needed disk blocks by decrementing
> @@ -144,7 +144,7 @@ xfs_trans_reserve(
>  	if (blocks > 0) {
>  		error = xfs_mod_fdblocks(tp->t_mountp, -((int64_t)blocks), rsvd);
>  		if (error != 0) {
> -			current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> +			current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>  			return -ENOSPC;
>  		}
>  		tp->t_blk_res += blocks;
> @@ -221,7 +221,7 @@ xfs_trans_reserve(
>  		tp->t_blk_res = 0;
>  	}
>  
> -	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> +	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>  
>  	return error;
>  }
> @@ -914,7 +914,7 @@ __xfs_trans_commit(
>  
>  	xfs_log_commit_cil(mp, tp, &commit_lsn, regrant);
>  
> -	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> +	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>  	xfs_trans_free(tp);
>  
>  	/*
> @@ -944,7 +944,7 @@ __xfs_trans_commit(
>  		if (commit_lsn == -1 && !error)
>  			error = -EIO;
>  	}
> -	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> +	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>  	xfs_trans_free_items(tp, NULLCOMMITLSN, !!error);
>  	xfs_trans_free(tp);
>  
> @@ -998,7 +998,7 @@ xfs_trans_cancel(
>  		xfs_log_done(mp, tp->t_ticket, NULL, false);
>  
>  	/* mark this thread as no longer being in a transaction */
> -	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> +	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>  
>  	xfs_trans_free_items(tp, NULLCOMMITLSN, dirty);
>  	xfs_trans_free(tp);
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 1531c48f56e2..abeb84604d32 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2320,6 +2320,8 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
>  #define PF_FREEZER_SKIP	0x40000000	/* Freezer should not count it as freezable */
>  #define PF_SUSPEND_TASK 0x80000000      /* this thread called freeze_processes and should not be frozen */
>  
> +#define PF_MEMALLOC_NOFS PF_FSTRANS	/* Transition to a more generic GFP_NOFS scope semantic */
> +
>  /*
>   * Only the _current_ task can read/write to tsk->flags, but other
>   * tasks can access tsk->flags in readonly mode for example
> -- 
> 2.11.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* [Cluster-devel] [PATCH 2/8] xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS
@ 2017-01-09 20:58     ` Darrick J. Wong
  0 siblings, 0 replies; 167+ messages in thread
From: Darrick J. Wong @ 2017-01-09 20:58 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Fri, Jan 06, 2017 at 03:11:01PM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite
> some time ago. We would like to make this concept more generic and use
> it for other filesystems as well. Let's start by giving the flag a
> more generic name PF_MEMALLOC_NOFS which is in line with an exiting
> PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO
> contexts. Replace all PF_FSTRANS usage from the xfs code in the first
> step before we introduce a full API for it as xfs uses the flag directly
> anyway.
> 
> This patch doesn't introduce any functional change.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> Reviewed-by: Brian Foster <bfoster@redhat.com>

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

> ---
>  fs/xfs/kmem.c             |  4 ++--
>  fs/xfs/kmem.h             |  2 +-
>  fs/xfs/libxfs/xfs_btree.c |  2 +-
>  fs/xfs/xfs_aops.c         |  6 +++---
>  fs/xfs/xfs_trans.c        | 12 ++++++------
>  include/linux/sched.h     |  2 ++
>  6 files changed, 15 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> index 339c696bbc01..a76a05dae96b 100644
> --- a/fs/xfs/kmem.c
> +++ b/fs/xfs/kmem.c
> @@ -80,13 +80,13 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
>  	 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
>  	 * the filesystem here and potentially deadlocking.
>  	 */
> -	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
> +	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
>  		noio_flag = memalloc_noio_save();
>  
>  	lflags = kmem_flags_convert(flags);
>  	ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
>  
> -	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
> +	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
>  		memalloc_noio_restore(noio_flag);
>  
>  	return ptr;
> diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
> index 689f746224e7..d973dbfc2bfa 100644
> --- a/fs/xfs/kmem.h
> +++ b/fs/xfs/kmem.h
> @@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
>  		lflags = GFP_ATOMIC | __GFP_NOWARN;
>  	} else {
>  		lflags = GFP_KERNEL | __GFP_NOWARN;
> -		if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
> +		if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
>  			lflags &= ~__GFP_FS;
>  	}
>  
> diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> index 21e6a6ab6b9a..a2672ba4dc33 100644
> --- a/fs/xfs/libxfs/xfs_btree.c
> +++ b/fs/xfs/libxfs/xfs_btree.c
> @@ -2866,7 +2866,7 @@ xfs_btree_split_worker(
>  	struct xfs_btree_split_args	*args = container_of(work,
>  						struct xfs_btree_split_args, work);
>  	unsigned long		pflags;
> -	unsigned long		new_pflags = PF_FSTRANS;
> +	unsigned long		new_pflags = PF_MEMALLOC_NOFS;
>  
>  	/*
>  	 * we are in a transaction context here, but may also be doing work
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index ef382bfb402b..d4094bb55033 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -189,7 +189,7 @@ xfs_setfilesize_trans_alloc(
>  	 * We hand off the transaction to the completion thread now, so
>  	 * clear the flag here.
>  	 */
> -	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> +	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>  	return 0;
>  }
>  
> @@ -252,7 +252,7 @@ xfs_setfilesize_ioend(
>  	 * thus we need to mark ourselves as being in a transaction manually.
>  	 * Similarly for freeze protection.
>  	 */
> -	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
> +	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>  	__sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS);
>  
>  	/* we abort the update if there was an IO error */
> @@ -1015,7 +1015,7 @@ xfs_do_writepage(
>  	 * Given that we do not allow direct reclaim to call us, we should
>  	 * never be called while in a filesystem transaction.
>  	 */
> -	if (WARN_ON_ONCE(current->flags & PF_FSTRANS))
> +	if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
>  		goto redirty;
>  
>  	/*
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 70f42ea86dfb..f5969c8274fc 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -134,7 +134,7 @@ xfs_trans_reserve(
>  	bool		rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
>  
>  	/* Mark this thread as being in a transaction */
> -	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
> +	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>  
>  	/*
>  	 * Attempt to reserve the needed disk blocks by decrementing
> @@ -144,7 +144,7 @@ xfs_trans_reserve(
>  	if (blocks > 0) {
>  		error = xfs_mod_fdblocks(tp->t_mountp, -((int64_t)blocks), rsvd);
>  		if (error != 0) {
> -			current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> +			current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>  			return -ENOSPC;
>  		}
>  		tp->t_blk_res += blocks;
> @@ -221,7 +221,7 @@ xfs_trans_reserve(
>  		tp->t_blk_res = 0;
>  	}
>  
> -	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> +	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>  
>  	return error;
>  }
> @@ -914,7 +914,7 @@ __xfs_trans_commit(
>  
>  	xfs_log_commit_cil(mp, tp, &commit_lsn, regrant);
>  
> -	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> +	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>  	xfs_trans_free(tp);
>  
>  	/*
> @@ -944,7 +944,7 @@ __xfs_trans_commit(
>  		if (commit_lsn == -1 && !error)
>  			error = -EIO;
>  	}
> -	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> +	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>  	xfs_trans_free_items(tp, NULLCOMMITLSN, !!error);
>  	xfs_trans_free(tp);
>  
> @@ -998,7 +998,7 @@ xfs_trans_cancel(
>  		xfs_log_done(mp, tp->t_ticket, NULL, false);
>  
>  	/* mark this thread as no longer being in a transaction */
> -	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> +	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>  
>  	xfs_trans_free_items(tp, NULLCOMMITLSN, dirty);
>  	xfs_trans_free(tp);
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 1531c48f56e2..abeb84604d32 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2320,6 +2320,8 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
>  #define PF_FREEZER_SKIP	0x40000000	/* Freezer should not count it as freezable */
>  #define PF_SUSPEND_TASK 0x80000000      /* this thread called freeze_processes and should not be frozen */
>  
> +#define PF_MEMALLOC_NOFS PF_FSTRANS	/* Transition to a more generic GFP_NOFS scope semantic */
> +
>  /*
>   * Only the _current_ task can read/write to tsk->flags, but other
>   * tasks can access tsk->flags in readonly mode for example
> -- 
> 2.11.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 4/8] xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*
  2017-01-06 14:11   ` Michal Hocko
  (?)
@ 2017-01-09 20:59     ` Darrick J. Wong
  -1 siblings, 0 replies; 167+ messages in thread
From: Darrick J. Wong @ 2017-01-09 20:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Dave Chinner, djwong,
	Theodore Ts'o, Chris Mason, David Sterba, Jan Kara,
	ceph-devel, cluster-devel, linux-nfs, logfs, linux-xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML, Michal Hocko

On Fri, Jan 06, 2017 at 03:11:03PM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> kmem_zalloc_large and _xfs_buf_map_pages use memalloc_noio_{save,restore}
> API to prevent from reclaim recursion into the fs because vmalloc can
> invoke unconditional GFP_KERNEL allocations and these functions might be
> called from the NOFS contexts. The memalloc_noio_save will enforce
> GFP_NOIO context which is even weaker than GFP_NOFS and that seems to be
> unnecessary. Let's use memalloc_nofs_{save,restore} instead as it should
> provide exactly what we need here - implicit GFP_NOFS context.
> 
> Changes since v1
> - s@memalloc_noio_restore@memalloc_nofs_restore@ in _xfs_buf_map_pages
>   as per Brian Foster
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

> ---
>  fs/xfs/kmem.c    | 10 +++++-----
>  fs/xfs/xfs_buf.c |  8 ++++----
>  2 files changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> index a76a05dae96b..d69ed5e76621 100644
> --- a/fs/xfs/kmem.c
> +++ b/fs/xfs/kmem.c
> @@ -65,7 +65,7 @@ kmem_alloc(size_t size, xfs_km_flags_t flags)
>  void *
>  kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
>  {
> -	unsigned noio_flag = 0;
> +	unsigned nofs_flag = 0;
>  	void	*ptr;
>  	gfp_t	lflags;
>  
> @@ -80,14 +80,14 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
>  	 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
>  	 * the filesystem here and potentially deadlocking.
>  	 */
> -	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
> -		noio_flag = memalloc_noio_save();
> +	if (flags & KM_NOFS)
> +		nofs_flag = memalloc_nofs_save();
>  
>  	lflags = kmem_flags_convert(flags);
>  	ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
>  
> -	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
> -		memalloc_noio_restore(noio_flag);
> +	if (flags & KM_NOFS)
> +		memalloc_nofs_restore(nofs_flag);
>  
>  	return ptr;
>  }
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 7f0a01f7b592..8cb8dd4cdfd8 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -441,17 +441,17 @@ _xfs_buf_map_pages(
>  		bp->b_addr = NULL;
>  	} else {
>  		int retried = 0;
> -		unsigned noio_flag;
> +		unsigned nofs_flag;
>  
>  		/*
>  		 * vm_map_ram() will allocate auxillary structures (e.g.
>  		 * pagetables) with GFP_KERNEL, yet we are likely to be under
>  		 * GFP_NOFS context here. Hence we need to tell memory reclaim
> -		 * that we are in such a context via PF_MEMALLOC_NOIO to prevent
> +		 * that we are in such a context via PF_MEMALLOC_NOFS to prevent
>  		 * memory reclaim re-entering the filesystem here and
>  		 * potentially deadlocking.
>  		 */
> -		noio_flag = memalloc_noio_save();
> +		nofs_flag = memalloc_nofs_save();
>  		do {
>  			bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
>  						-1, PAGE_KERNEL);
> @@ -459,7 +459,7 @@ _xfs_buf_map_pages(
>  				break;
>  			vm_unmap_aliases();
>  		} while (retried++ <= 1);
> -		memalloc_noio_restore(noio_flag);
> +		memalloc_nofs_restore(nofs_flag);
>  
>  		if (!bp->b_addr)
>  			return -ENOMEM;
> -- 
> 2.11.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 4/8] xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*
@ 2017-01-09 20:59     ` Darrick J. Wong
  0 siblings, 0 replies; 167+ messages in thread
From: Darrick J. Wong @ 2017-01-09 20:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Dave Chinner, djwong,
	Theodore Ts'o, Chris Mason, David Sterba, Jan Kara,
	ceph-devel, cluster-devel, linux-nfs, logfs, linux-xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML, Michal Hocko

On Fri, Jan 06, 2017 at 03:11:03PM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> kmem_zalloc_large and _xfs_buf_map_pages use memalloc_noio_{save,restore}
> API to prevent from reclaim recursion into the fs because vmalloc can
> invoke unconditional GFP_KERNEL allocations and these functions might be
> called from the NOFS contexts. The memalloc_noio_save will enforce
> GFP_NOIO context which is even weaker than GFP_NOFS and that seems to be
> unnecessary. Let's use memalloc_nofs_{save,restore} instead as it should
> provide exactly what we need here - implicit GFP_NOFS context.
> 
> Changes since v1
> - s@memalloc_noio_restore@memalloc_nofs_restore@ in _xfs_buf_map_pages
>   as per Brian Foster
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

> ---
>  fs/xfs/kmem.c    | 10 +++++-----
>  fs/xfs/xfs_buf.c |  8 ++++----
>  2 files changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> index a76a05dae96b..d69ed5e76621 100644
> --- a/fs/xfs/kmem.c
> +++ b/fs/xfs/kmem.c
> @@ -65,7 +65,7 @@ kmem_alloc(size_t size, xfs_km_flags_t flags)
>  void *
>  kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
>  {
> -	unsigned noio_flag = 0;
> +	unsigned nofs_flag = 0;
>  	void	*ptr;
>  	gfp_t	lflags;
>  
> @@ -80,14 +80,14 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
>  	 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
>  	 * the filesystem here and potentially deadlocking.
>  	 */
> -	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
> -		noio_flag = memalloc_noio_save();
> +	if (flags & KM_NOFS)
> +		nofs_flag = memalloc_nofs_save();
>  
>  	lflags = kmem_flags_convert(flags);
>  	ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
>  
> -	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
> -		memalloc_noio_restore(noio_flag);
> +	if (flags & KM_NOFS)
> +		memalloc_nofs_restore(nofs_flag);
>  
>  	return ptr;
>  }
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 7f0a01f7b592..8cb8dd4cdfd8 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -441,17 +441,17 @@ _xfs_buf_map_pages(
>  		bp->b_addr = NULL;
>  	} else {
>  		int retried = 0;
> -		unsigned noio_flag;
> +		unsigned nofs_flag;
>  
>  		/*
>  		 * vm_map_ram() will allocate auxillary structures (e.g.
>  		 * pagetables) with GFP_KERNEL, yet we are likely to be under
>  		 * GFP_NOFS context here. Hence we need to tell memory reclaim
> -		 * that we are in such a context via PF_MEMALLOC_NOIO to prevent
> +		 * that we are in such a context via PF_MEMALLOC_NOFS to prevent
>  		 * memory reclaim re-entering the filesystem here and
>  		 * potentially deadlocking.
>  		 */
> -		noio_flag = memalloc_noio_save();
> +		nofs_flag = memalloc_nofs_save();
>  		do {
>  			bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
>  						-1, PAGE_KERNEL);
> @@ -459,7 +459,7 @@ _xfs_buf_map_pages(
>  				break;
>  			vm_unmap_aliases();
>  		} while (retried++ <= 1);
> -		memalloc_noio_restore(noio_flag);
> +		memalloc_nofs_restore(nofs_flag);
>  
>  		if (!bp->b_addr)
>  			return -ENOMEM;
> -- 
> 2.11.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* [Cluster-devel] [PATCH 4/8] xfs: use memalloc_nofs_{save, restore} instead of memalloc_noio*
@ 2017-01-09 20:59     ` Darrick J. Wong
  0 siblings, 0 replies; 167+ messages in thread
From: Darrick J. Wong @ 2017-01-09 20:59 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Fri, Jan 06, 2017 at 03:11:03PM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> kmem_zalloc_large and _xfs_buf_map_pages use memalloc_noio_{save,restore}
> API to prevent from reclaim recursion into the fs because vmalloc can
> invoke unconditional GFP_KERNEL allocations and these functions might be
> called from the NOFS contexts. The memalloc_noio_save will enforce
> GFP_NOIO context which is even weaker than GFP_NOFS and that seems to be
> unnecessary. Let's use memalloc_nofs_{save,restore} instead as it should
> provide exactly what we need here - implicit GFP_NOFS context.
> 
> Changes since v1
> - s at memalloc_noio_restore@memalloc_nofs_restore@ in _xfs_buf_map_pages
>   as per Brian Foster
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

> ---
>  fs/xfs/kmem.c    | 10 +++++-----
>  fs/xfs/xfs_buf.c |  8 ++++----
>  2 files changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> index a76a05dae96b..d69ed5e76621 100644
> --- a/fs/xfs/kmem.c
> +++ b/fs/xfs/kmem.c
> @@ -65,7 +65,7 @@ kmem_alloc(size_t size, xfs_km_flags_t flags)
>  void *
>  kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
>  {
> -	unsigned noio_flag = 0;
> +	unsigned nofs_flag = 0;
>  	void	*ptr;
>  	gfp_t	lflags;
>  
> @@ -80,14 +80,14 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
>  	 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
>  	 * the filesystem here and potentially deadlocking.
>  	 */
> -	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
> -		noio_flag = memalloc_noio_save();
> +	if (flags & KM_NOFS)
> +		nofs_flag = memalloc_nofs_save();
>  
>  	lflags = kmem_flags_convert(flags);
>  	ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
>  
> -	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
> -		memalloc_noio_restore(noio_flag);
> +	if (flags & KM_NOFS)
> +		memalloc_nofs_restore(nofs_flag);
>  
>  	return ptr;
>  }
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 7f0a01f7b592..8cb8dd4cdfd8 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -441,17 +441,17 @@ _xfs_buf_map_pages(
>  		bp->b_addr = NULL;
>  	} else {
>  		int retried = 0;
> -		unsigned noio_flag;
> +		unsigned nofs_flag;
>  
>  		/*
>  		 * vm_map_ram() will allocate auxillary structures (e.g.
>  		 * pagetables) with GFP_KERNEL, yet we are likely to be under
>  		 * GFP_NOFS context here. Hence we need to tell memory reclaim
> -		 * that we are in such a context via PF_MEMALLOC_NOIO to prevent
> +		 * that we are in such a context via PF_MEMALLOC_NOFS to prevent
>  		 * memory reclaim re-entering the filesystem here and
>  		 * potentially deadlocking.
>  		 */
> -		noio_flag = memalloc_noio_save();
> +		nofs_flag = memalloc_nofs_save();
>  		do {
>  			bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
>  						-1, PAGE_KERNEL);
> @@ -459,7 +459,7 @@ _xfs_buf_map_pages(
>  				break;
>  			vm_unmap_aliases();
>  		} while (retried++ <= 1);
> -		memalloc_noio_restore(noio_flag);
> +		memalloc_nofs_restore(nofs_flag);
>  
>  		if (!bp->b_addr)
>  			return -ENOMEM;
> -- 
> 2.11.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
  2017-01-06 14:11   ` Michal Hocko
  (?)
@ 2017-01-17  2:56     ` Theodore Ts'o
  -1 siblings, 0 replies; 167+ messages in thread
From: Theodore Ts'o @ 2017-01-17  2:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Dave Chinner, djwong,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

On Fri, Jan 06, 2017 at 03:11:07PM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> This reverts commit 216553c4b7f3e3e2beb4981cddca9b2027523928. Now that
> the transaction context uses memalloc_nofs_save and all allocations
> within the this context inherit GFP_NOFS automatically, there is no
> reason to mark specific allocations explicitly.
> 
> This patch should not introduce any functional change. The main point
> of this change is to reduce explicit GFP_NOFS usage inside ext4 code
> to make the review of the remaining usage easier.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> Reviewed-by: Jan Kara <jack@suse.cz>

Changes in the jbd2 layer aren't going to guarantee that
memalloc_nofs_save() will be executed if we are running ext4 without a
journal (aka in no journal mode).  And this is a *very* common
configuration; it's how ext4 is used inside Google in our production
servers.

So that means the earlier patches will probably need to be changed so
the nOFS scope is done in the ext4_journal_{start,stop} functions in
fs/ext4/ext4_jbd2.c.

					- Ted

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
@ 2017-01-17  2:56     ` Theodore Ts'o
  0 siblings, 0 replies; 167+ messages in thread
From: Theodore Ts'o @ 2017-01-17  2:56 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Dave Chinner, djwong,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

On Fri, Jan 06, 2017 at 03:11:07PM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> This reverts commit 216553c4b7f3e3e2beb4981cddca9b2027523928. Now that
> the transaction context uses memalloc_nofs_save and all allocations
> within the this context inherit GFP_NOFS automatically, there is no
> reason to mark specific allocations explicitly.
> 
> This patch should not introduce any functional change. The main point
> of this change is to reduce explicit GFP_NOFS usage inside ext4 code
> to make the review of the remaining usage easier.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> Reviewed-by: Jan Kara <jack@suse.cz>

Changes in the jbd2 layer aren't going to guarantee that
memalloc_nofs_save() will be executed if we are running ext4 without a
journal (aka in no journal mode).  And this is a *very* common
configuration; it's how ext4 is used inside Google in our production
servers.

So that means the earlier patches will probably need to be changed so
the nOFS scope is done in the ext4_journal_{start,stop} functions in
fs/ext4/ext4_jbd2.c.

					- Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* [Cluster-devel] [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
@ 2017-01-17  2:56     ` Theodore Ts'o
  0 siblings, 0 replies; 167+ messages in thread
From: Theodore Ts'o @ 2017-01-17  2:56 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Fri, Jan 06, 2017 at 03:11:07PM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> This reverts commit 216553c4b7f3e3e2beb4981cddca9b2027523928. Now that
> the transaction context uses memalloc_nofs_save and all allocations
> within the this context inherit GFP_NOFS automatically, there is no
> reason to mark specific allocations explicitly.
> 
> This patch should not introduce any functional change. The main point
> of this change is to reduce explicit GFP_NOFS usage inside ext4 code
> to make the review of the remaining usage easier.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> Reviewed-by: Jan Kara <jack@suse.cz>

Changes in the jbd2 layer aren't going to guarantee that
memalloc_nofs_save() will be executed if we are running ext4 without a
journal (aka in no journal mode).  And this is a *very* common
configuration; it's how ext4 is used inside Google in our production
servers.

So that means the earlier patches will probably need to be changed so
the nOFS scope is done in the ext4_journal_{start,stop} functions in
fs/ext4/ext4_jbd2.c.

					- Ted



^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 7/8] Revert "ext4: avoid deadlocks in the writeback path by using sb_getblk_gfp"
  2017-01-06 14:11   ` Michal Hocko
  (?)
@ 2017-01-17  3:01     ` Theodore Ts'o
  -1 siblings, 0 replies; 167+ messages in thread
From: Theodore Ts'o @ 2017-01-17  3:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Dave Chinner, djwong,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

On Fri, Jan 06, 2017 at 03:11:06PM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> This reverts commit c45653c341f5c8a0ce19c8f0ad4678640849cb86 because
> sb_getblk_gfp is not really needed as
> sb_getblk
>   __getblk_gfp
>     __getblk_slow
>       grow_buffers
>         grow_dev_page
> 	  gfp_mask = mapping_gfp_constraint(inode->i_mapping, ~__GFP_FS) | gfp
> 
> so __GFP_FS is cleared unconditionally and therefore the above commit
> didn't have any real effect in fact.
> 
> This patch should not introduce any functional change. The main point
> of this change is to reduce explicit GFP_NOFS usage inside ext4 code to
> make the review of the remaining usage easier.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> Reviewed-by: Jan Kara <jack@suse.cz>

If I'm not mistaken, this patch is not dependent on any of the other
patches in this series (and the other patches are not dependent on
this one).  Hence, I could take this patch via the ext4 tree, correct?

     	    	     	   	     - Ted

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 7/8] Revert "ext4: avoid deadlocks in the writeback path by using sb_getblk_gfp"
@ 2017-01-17  3:01     ` Theodore Ts'o
  0 siblings, 0 replies; 167+ messages in thread
From: Theodore Ts'o @ 2017-01-17  3:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Dave Chinner, djwong,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

On Fri, Jan 06, 2017 at 03:11:06PM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> This reverts commit c45653c341f5c8a0ce19c8f0ad4678640849cb86 because
> sb_getblk_gfp is not really needed as
> sb_getblk
>   __getblk_gfp
>     __getblk_slow
>       grow_buffers
>         grow_dev_page
> 	  gfp_mask = mapping_gfp_constraint(inode->i_mapping, ~__GFP_FS) | gfp
> 
> so __GFP_FS is cleared unconditionally and therefore the above commit
> didn't have any real effect in fact.
> 
> This patch should not introduce any functional change. The main point
> of this change is to reduce explicit GFP_NOFS usage inside ext4 code to
> make the review of the remaining usage easier.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> Reviewed-by: Jan Kara <jack@suse.cz>

If I'm not mistaken, this patch is not dependent on any of the other
patches in this series (and the other patches are not dependent on
this one).  Hence, I could take this patch via the ext4 tree, correct?

     	    	     	   	     - Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* [Cluster-devel] [PATCH 7/8] Revert "ext4: avoid deadlocks in the writeback path by using sb_getblk_gfp"
@ 2017-01-17  3:01     ` Theodore Ts'o
  0 siblings, 0 replies; 167+ messages in thread
From: Theodore Ts'o @ 2017-01-17  3:01 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Fri, Jan 06, 2017 at 03:11:06PM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> This reverts commit c45653c341f5c8a0ce19c8f0ad4678640849cb86 because
> sb_getblk_gfp is not really needed as
> sb_getblk
>   __getblk_gfp
>     __getblk_slow
>       grow_buffers
>         grow_dev_page
> 	  gfp_mask = mapping_gfp_constraint(inode->i_mapping, ~__GFP_FS) | gfp
> 
> so __GFP_FS is cleared unconditionally and therefore the above commit
> didn't have any real effect in fact.
> 
> This patch should not introduce any functional change. The main point
> of this change is to reduce explicit GFP_NOFS usage inside ext4 code to
> make the review of the remaining usage easier.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> Reviewed-by: Jan Kara <jack@suse.cz>

If I'm not mistaken, this patch is not dependent on any of the other
patches in this series (and the other patches are not dependent on
this one).  Hence, I could take this patch via the ext4 tree, correct?

     	    	     	   	     - Ted



^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 7/8] Revert "ext4: avoid deadlocks in the writeback path by using sb_getblk_gfp"
  2017-01-17  3:01     ` Theodore Ts'o
  (?)
@ 2017-01-17  7:54       ` Michal Hocko
  -1 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-17  7:54 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Dave Chinner, djwong,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML

On Mon 16-01-17 22:01:18, Theodore Ts'o wrote:
> On Fri, Jan 06, 2017 at 03:11:06PM +0100, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > This reverts commit c45653c341f5c8a0ce19c8f0ad4678640849cb86 because
> > sb_getblk_gfp is not really needed as
> > sb_getblk
> >   __getblk_gfp
> >     __getblk_slow
> >       grow_buffers
> >         grow_dev_page
> > 	  gfp_mask = mapping_gfp_constraint(inode->i_mapping, ~__GFP_FS) | gfp
> > 
> > so __GFP_FS is cleared unconditionally and therefore the above commit
> > didn't have any real effect in fact.
> > 
> > This patch should not introduce any functional change. The main point
> > of this change is to reduce explicit GFP_NOFS usage inside ext4 code to
> > make the review of the remaining usage easier.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > Reviewed-by: Jan Kara <jack@suse.cz>
> 
> If I'm not mistaken, this patch is not dependent on any of the other
> patches in this series (and the other patches are not dependent on
> this one).  Hence, I could take this patch via the ext4 tree, correct?

Yes, that is correct

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 7/8] Revert "ext4: avoid deadlocks in the writeback path by using sb_getblk_gfp"
@ 2017-01-17  7:54       ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-17  7:54 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Dave Chinner, djwong,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML

On Mon 16-01-17 22:01:18, Theodore Ts'o wrote:
> On Fri, Jan 06, 2017 at 03:11:06PM +0100, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > This reverts commit c45653c341f5c8a0ce19c8f0ad4678640849cb86 because
> > sb_getblk_gfp is not really needed as
> > sb_getblk
> >   __getblk_gfp
> >     __getblk_slow
> >       grow_buffers
> >         grow_dev_page
> > 	  gfp_mask = mapping_gfp_constraint(inode->i_mapping, ~__GFP_FS) | gfp
> > 
> > so __GFP_FS is cleared unconditionally and therefore the above commit
> > didn't have any real effect in fact.
> > 
> > This patch should not introduce any functional change. The main point
> > of this change is to reduce explicit GFP_NOFS usage inside ext4 code to
> > make the review of the remaining usage easier.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > Reviewed-by: Jan Kara <jack@suse.cz>
> 
> If I'm not mistaken, this patch is not dependent on any of the other
> patches in this series (and the other patches are not dependent on
> this one).  Hence, I could take this patch via the ext4 tree, correct?

Yes, that is correct

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* [Cluster-devel] [PATCH 7/8] Revert "ext4: avoid deadlocks in the writeback path by using sb_getblk_gfp"
@ 2017-01-17  7:54       ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-17  7:54 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Mon 16-01-17 22:01:18, Theodore Ts'o wrote:
> On Fri, Jan 06, 2017 at 03:11:06PM +0100, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > This reverts commit c45653c341f5c8a0ce19c8f0ad4678640849cb86 because
> > sb_getblk_gfp is not really needed as
> > sb_getblk
> >   __getblk_gfp
> >     __getblk_slow
> >       grow_buffers
> >         grow_dev_page
> > 	  gfp_mask = mapping_gfp_constraint(inode->i_mapping, ~__GFP_FS) | gfp
> > 
> > so __GFP_FS is cleared unconditionally and therefore the above commit
> > didn't have any real effect in fact.
> > 
> > This patch should not introduce any functional change. The main point
> > of this change is to reduce explicit GFP_NOFS usage inside ext4 code to
> > make the review of the remaining usage easier.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > Reviewed-by: Jan Kara <jack@suse.cz>
> 
> If I'm not mistaken, this patch is not dependent on any of the other
> patches in this series (and the other patches are not dependent on
> this one).  Hence, I could take this patch via the ext4 tree, correct?

Yes, that is correct

-- 
Michal Hocko
SUSE Labs



^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
  2017-01-17  2:56     ` Theodore Ts'o
  (?)
@ 2017-01-17  8:24       ` Michal Hocko
  -1 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-17  8:24 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Dave Chinner, djwong,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML

On Mon 16-01-17 21:56:07, Theodore Ts'o wrote:
> On Fri, Jan 06, 2017 at 03:11:07PM +0100, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > This reverts commit 216553c4b7f3e3e2beb4981cddca9b2027523928. Now that
> > the transaction context uses memalloc_nofs_save and all allocations
> > within the this context inherit GFP_NOFS automatically, there is no
> > reason to mark specific allocations explicitly.
> > 
> > This patch should not introduce any functional change. The main point
> > of this change is to reduce explicit GFP_NOFS usage inside ext4 code
> > to make the review of the remaining usage easier.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > Reviewed-by: Jan Kara <jack@suse.cz>
> 
> Changes in the jbd2 layer aren't going to guarantee that
> memalloc_nofs_save() will be executed if we are running ext4 without a
> journal (aka in no journal mode).  And this is a *very* common
> configuration; it's how ext4 is used inside Google in our production
> servers.

OK, I wasn't aware of that.

> So that means the earlier patches will probably need to be changed so
> the nOFS scope is done in the ext4_journal_{start,stop} functions in
> fs/ext4/ext4_jbd2.c.

I could definitely appreciated some help here. The call paths are rather
complex and I am not familiar with the code enough. On of the biggest
problem I have currently is that there doesn't seem to be an easy place
to store the old allocation context. The original patch had it inside
the journal handle. I was thinking about putting it into superblock but
ext4_journal_stop doesn't seem to have access to the sb if there is no
handle. Now, if ext4_journal_start is never called from a nested context
then this is not a big deal but there are just too many caller to
check...
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
@ 2017-01-17  8:24       ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-17  8:24 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Dave Chinner, djwong,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML

On Mon 16-01-17 21:56:07, Theodore Ts'o wrote:
> On Fri, Jan 06, 2017 at 03:11:07PM +0100, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > This reverts commit 216553c4b7f3e3e2beb4981cddca9b2027523928. Now that
> > the transaction context uses memalloc_nofs_save and all allocations
> > within the this context inherit GFP_NOFS automatically, there is no
> > reason to mark specific allocations explicitly.
> > 
> > This patch should not introduce any functional change. The main point
> > of this change is to reduce explicit GFP_NOFS usage inside ext4 code
> > to make the review of the remaining usage easier.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > Reviewed-by: Jan Kara <jack@suse.cz>
> 
> Changes in the jbd2 layer aren't going to guarantee that
> memalloc_nofs_save() will be executed if we are running ext4 without a
> journal (aka in no journal mode).  And this is a *very* common
> configuration; it's how ext4 is used inside Google in our production
> servers.

OK, I wasn't aware of that.

> So that means the earlier patches will probably need to be changed so
> the nOFS scope is done in the ext4_journal_{start,stop} functions in
> fs/ext4/ext4_jbd2.c.

I could definitely appreciated some help here. The call paths are rather
complex and I am not familiar with the code enough. On of the biggest
problem I have currently is that there doesn't seem to be an easy place
to store the old allocation context. The original patch had it inside
the journal handle. I was thinking about putting it into superblock but
ext4_journal_stop doesn't seem to have access to the sb if there is no
handle. Now, if ext4_journal_start is never called from a nested context
then this is not a big deal but there are just too many caller to
check...
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* [Cluster-devel] [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
@ 2017-01-17  8:24       ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-17  8:24 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Mon 16-01-17 21:56:07, Theodore Ts'o wrote:
> On Fri, Jan 06, 2017 at 03:11:07PM +0100, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > This reverts commit 216553c4b7f3e3e2beb4981cddca9b2027523928. Now that
> > the transaction context uses memalloc_nofs_save and all allocations
> > within the this context inherit GFP_NOFS automatically, there is no
> > reason to mark specific allocations explicitly.
> > 
> > This patch should not introduce any functional change. The main point
> > of this change is to reduce explicit GFP_NOFS usage inside ext4 code
> > to make the review of the remaining usage easier.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > Reviewed-by: Jan Kara <jack@suse.cz>
> 
> Changes in the jbd2 layer aren't going to guarantee that
> memalloc_nofs_save() will be executed if we are running ext4 without a
> journal (aka in no journal mode).  And this is a *very* common
> configuration; it's how ext4 is used inside Google in our production
> servers.

OK, I wasn't aware of that.

> So that means the earlier patches will probably need to be changed so
> the nOFS scope is done in the ext4_journal_{start,stop} functions in
> fs/ext4/ext4_jbd2.c.

I could definitely appreciated some help here. The call paths are rather
complex and I am not familiar with the code enough. On of the biggest
problem I have currently is that there doesn't seem to be an easy place
to store the old allocation context. The original patch had it inside
the journal handle. I was thinking about putting it into superblock but
ext4_journal_stop doesn't seem to have access to the sb if there is no
handle. Now, if ext4_journal_start is never called from a nested context
then this is not a big deal but there are just too many caller to
check...
-- 
Michal Hocko
SUSE Labs



^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
  2017-01-17  8:24       ` Michal Hocko
  (?)
  (?)
@ 2017-01-17 15:18         ` Michal Hocko
  -1 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-17 15:18 UTC (permalink / raw)
  To: Theodore Ts'o, Jan Kara
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Dave Chinner, djwong,
	Chris Mason, David Sterba, ceph-devel, cluster-devel, linux-nfs,
	logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML

On Tue 17-01-17 09:24:25, Michal Hocko wrote:
> On Mon 16-01-17 21:56:07, Theodore Ts'o wrote:
> > On Fri, Jan 06, 2017 at 03:11:07PM +0100, Michal Hocko wrote:
> > > From: Michal Hocko <mhocko@suse.com>
> > > 
> > > This reverts commit 216553c4b7f3e3e2beb4981cddca9b2027523928. Now that
> > > the transaction context uses memalloc_nofs_save and all allocations
> > > within the this context inherit GFP_NOFS automatically, there is no
> > > reason to mark specific allocations explicitly.
> > > 
> > > This patch should not introduce any functional change. The main point
> > > of this change is to reduce explicit GFP_NOFS usage inside ext4 code
> > > to make the review of the remaining usage easier.
> > > 
> > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > > Reviewed-by: Jan Kara <jack@suse.cz>
> > 
> > Changes in the jbd2 layer aren't going to guarantee that
> > memalloc_nofs_save() will be executed if we are running ext4 without a
> > journal (aka in no journal mode).  And this is a *very* common
> > configuration; it's how ext4 is used inside Google in our production
> > servers.
> 
> OK, I wasn't aware of that.
> 
> > So that means the earlier patches will probably need to be changed so
> > the nOFS scope is done in the ext4_journal_{start,stop} functions in
> > fs/ext4/ext4_jbd2.c.
> 
> I could definitely appreciated some help here. The call paths are rather
> complex and I am not familiar with the code enough. On of the biggest
> problem I have currently is that there doesn't seem to be an easy place
> to store the old allocation context. 

OK, so I've been staring into the code and AFAIU current->journal_info
can contain my stored information. I could either hijack part of the
word as the ref counting is only consuming low 12b. But that looks too
ugly to live. Or I can allocate some placeholder.

But before going to play with that I am really wondering whether we need
all this with no journal at all. AFAIU what Jack told me it is the
journal lock(s) which is the biggest problem from the reclaim recursion
point of view. What would cause a deadlock in no journal mode?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
@ 2017-01-17 15:18         ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-17 15:18 UTC (permalink / raw)
  To: Theodore Ts'o, Jan Kara
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Dave Chinner, djwong,
	Chris Mason, David Sterba, ceph-devel, cluster-devel, linux-nfs,
	logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML

On Tue 17-01-17 09:24:25, Michal Hocko wrote:
> On Mon 16-01-17 21:56:07, Theodore Ts'o wrote:
> > On Fri, Jan 06, 2017 at 03:11:07PM +0100, Michal Hocko wrote:
> > > From: Michal Hocko <mhocko@suse.com>
> > > 
> > > This reverts commit 216553c4b7f3e3e2beb4981cddca9b2027523928. Now that
> > > the transaction context uses memalloc_nofs_save and all allocations
> > > within the this context inherit GFP_NOFS automatically, there is no
> > > reason to mark specific allocations explicitly.
> > > 
> > > This patch should not introduce any functional change. The main point
> > > of this change is to reduce explicit GFP_NOFS usage inside ext4 code
> > > to make the review of the remaining usage easier.
> > > 
> > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > > Reviewed-by: Jan Kara <jack@suse.cz>
> > 
> > Changes in the jbd2 layer aren't going to guarantee that
> > memalloc_nofs_save() will be executed if we are running ext4 without a
> > journal (aka in no journal mode).  And this is a *very* common
> > configuration; it's how ext4 is used inside Google in our production
> > servers.
> 
> OK, I wasn't aware of that.
> 
> > So that means the earlier patches will probably need to be changed so
> > the nOFS scope is done in the ext4_journal_{start,stop} functions in
> > fs/ext4/ext4_jbd2.c.
> 
> I could definitely appreciated some help here. The call paths are rather
> complex and I am not familiar with the code enough. On of the biggest
> problem I have currently is that there doesn't seem to be an easy place
> to store the old allocation context. 

OK, so I've been staring into the code and AFAIU current->journal_info
can contain my stored information. I could either hijack part of the
word as the ref counting is only consuming low 12b. But that looks too
ugly to live. Or I can allocate some placeholder.

But before going to play with that I am really wondering whether we need
all this with no journal at all. AFAIU what Jack told me it is the
journal lock(s) which is the biggest problem from the reclaim recursion
point of view. What would cause a deadlock in no journal mode?

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
@ 2017-01-17 15:18         ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-17 15:18 UTC (permalink / raw)
  To: Theodore Ts'o, Jan Kara
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	Dave Chinner, djwong-DgEjT+Ai2ygdnm+yROfE0A, Chris Mason,
	David Sterba, ceph-devel-u79uwXL29TY76Z2rM5mHXA,
	cluster-devel-H+wXaHxf7aLQT0dZR+AlfA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA, logfs-PCqxUs/MD9bYtjvyW6yDsg,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	reiserfs-devel-u79uwXL29TY76Z2rM5mHXA,
	linux-ntfs-dev-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	linux-f2fs-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	linux-afs-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, LKML

On Tue 17-01-17 09:24:25, Michal Hocko wrote:
> On Mon 16-01-17 21:56:07, Theodore Ts'o wrote:
> > On Fri, Jan 06, 2017 at 03:11:07PM +0100, Michal Hocko wrote:
> > > From: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
> > > 
> > > This reverts commit 216553c4b7f3e3e2beb4981cddca9b2027523928. Now that
> > > the transaction context uses memalloc_nofs_save and all allocations
> > > within the this context inherit GFP_NOFS automatically, there is no
> > > reason to mark specific allocations explicitly.
> > > 
> > > This patch should not introduce any functional change. The main point
> > > of this change is to reduce explicit GFP_NOFS usage inside ext4 code
> > > to make the review of the remaining usage easier.
> > > 
> > > Signed-off-by: Michal Hocko <mhocko-IBi9RG/b67k@public.gmane.org>
> > > Reviewed-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
> > 
> > Changes in the jbd2 layer aren't going to guarantee that
> > memalloc_nofs_save() will be executed if we are running ext4 without a
> > journal (aka in no journal mode).  And this is a *very* common
> > configuration; it's how ext4 is used inside Google in our production
> > servers.
> 
> OK, I wasn't aware of that.
> 
> > So that means the earlier patches will probably need to be changed so
> > the nOFS scope is done in the ext4_journal_{start,stop} functions in
> > fs/ext4/ext4_jbd2.c.
> 
> I could definitely appreciated some help here. The call paths are rather
> complex and I am not familiar with the code enough. On of the biggest
> problem I have currently is that there doesn't seem to be an easy place
> to store the old allocation context. 

OK, so I've been staring into the code and AFAIU current->journal_info
can contain my stored information. I could either hijack part of the
word as the ref counting is only consuming low 12b. But that looks too
ugly to live. Or I can allocate some placeholder.

But before going to play with that I am really wondering whether we need
all this with no journal at all. AFAIU what Jack told me it is the
journal lock(s) which is the biggest problem from the reclaim recursion
point of view. What would cause a deadlock in no journal mode?

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 167+ messages in thread

* [Cluster-devel] [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
@ 2017-01-17 15:18         ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-17 15:18 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Tue 17-01-17 09:24:25, Michal Hocko wrote:
> On Mon 16-01-17 21:56:07, Theodore Ts'o wrote:
> > On Fri, Jan 06, 2017 at 03:11:07PM +0100, Michal Hocko wrote:
> > > From: Michal Hocko <mhocko@suse.com>
> > > 
> > > This reverts commit 216553c4b7f3e3e2beb4981cddca9b2027523928. Now that
> > > the transaction context uses memalloc_nofs_save and all allocations
> > > within the this context inherit GFP_NOFS automatically, there is no
> > > reason to mark specific allocations explicitly.
> > > 
> > > This patch should not introduce any functional change. The main point
> > > of this change is to reduce explicit GFP_NOFS usage inside ext4 code
> > > to make the review of the remaining usage easier.
> > > 
> > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > > Reviewed-by: Jan Kara <jack@suse.cz>
> > 
> > Changes in the jbd2 layer aren't going to guarantee that
> > memalloc_nofs_save() will be executed if we are running ext4 without a
> > journal (aka in no journal mode).  And this is a *very* common
> > configuration; it's how ext4 is used inside Google in our production
> > servers.
> 
> OK, I wasn't aware of that.
> 
> > So that means the earlier patches will probably need to be changed so
> > the nOFS scope is done in the ext4_journal_{start,stop} functions in
> > fs/ext4/ext4_jbd2.c.
> 
> I could definitely appreciated some help here. The call paths are rather
> complex and I am not familiar with the code enough. On of the biggest
> problem I have currently is that there doesn't seem to be an easy place
> to store the old allocation context. 

OK, so I've been staring into the code and AFAIU current->journal_info
can contain my stored information. I could either hijack part of the
word as the ref counting is only consuming low 12b. But that looks too
ugly to live. Or I can allocate some placeholder.

But before going to play with that I am really wondering whether we need
all this with no journal at all. AFAIU what Jack told me it is the
journal lock(s) which is the biggest problem from the reclaim recursion
point of view. What would cause a deadlock in no journal mode?

-- 
Michal Hocko
SUSE Labs



^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
  2017-01-17 15:18         ` Michal Hocko
  (?)
@ 2017-01-17 15:59           ` Theodore Ts'o
  -1 siblings, 0 replies; 167+ messages in thread
From: Theodore Ts'o @ 2017-01-17 15:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jan Kara, linux-mm, linux-fsdevel, Andrew Morton, Dave Chinner,
	djwong, Chris Mason, David Sterba, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML

On Tue, Jan 17, 2017 at 04:18:17PM +0100, Michal Hocko wrote:
> 
> OK, so I've been staring into the code and AFAIU current->journal_info
> can contain my stored information. I could either hijack part of the
> word as the ref counting is only consuming low 12b. But that looks too
> ugly to live. Or I can allocate some placeholder.

Yeah, I was looking at something similar.  Can you guarantee that the
context will only take one or two bits?  (Looks like it only needs one
bit ATM, even though at the moment you're storing the whole GFP mask,
correct?)

> But before going to play with that I am really wondering whether we need
> all this with no journal at all. AFAIU what Jack told me it is the
> journal lock(s) which is the biggest problem from the reclaim recursion
> point of view. What would cause a deadlock in no journal mode?

We still have the original problem for why we need GFP_NOFS even in
ext2.  If we are in a writeback path, and we need to allocate memory,
we don't want to recurse back into the file system's writeback path.
Certainly not for the same inode, and while we could make it work if
the mm was writing back another inode, or another superblock, there
are also stack depth considerations that would make this be a bad
idea.  So we do need to be able to assert GFP_NOFS even in no journal
mode, and for any file system including ext2, for that matter.

Because of the fact that we're going to have to play games with
current->journal_info, maybe this is something that I should take
responsibility for, and to go through the the ext4 tree after the main
patch series go through?  Maybe you could use xfs and ext2 as sample
(simple) implementations?

My only ask is that the memalloc nofs context be a well defined N
bits, where N < 16, and I'll find some place to put them (probably
journal_info).

Thanks,

					- Ted

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
@ 2017-01-17 15:59           ` Theodore Ts'o
  0 siblings, 0 replies; 167+ messages in thread
From: Theodore Ts'o @ 2017-01-17 15:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jan Kara, linux-mm, linux-fsdevel, Andrew Morton, Dave Chinner,
	djwong, Chris Mason, David Sterba, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML

On Tue, Jan 17, 2017 at 04:18:17PM +0100, Michal Hocko wrote:
> 
> OK, so I've been staring into the code and AFAIU current->journal_info
> can contain my stored information. I could either hijack part of the
> word as the ref counting is only consuming low 12b. But that looks too
> ugly to live. Or I can allocate some placeholder.

Yeah, I was looking at something similar.  Can you guarantee that the
context will only take one or two bits?  (Looks like it only needs one
bit ATM, even though at the moment you're storing the whole GFP mask,
correct?)

> But before going to play with that I am really wondering whether we need
> all this with no journal at all. AFAIU what Jack told me it is the
> journal lock(s) which is the biggest problem from the reclaim recursion
> point of view. What would cause a deadlock in no journal mode?

We still have the original problem for why we need GFP_NOFS even in
ext2.  If we are in a writeback path, and we need to allocate memory,
we don't want to recurse back into the file system's writeback path.
Certainly not for the same inode, and while we could make it work if
the mm was writing back another inode, or another superblock, there
are also stack depth considerations that would make this be a bad
idea.  So we do need to be able to assert GFP_NOFS even in no journal
mode, and for any file system including ext2, for that matter.

Because of the fact that we're going to have to play games with
current->journal_info, maybe this is something that I should take
responsibility for, and to go through the the ext4 tree after the main
patch series go through?  Maybe you could use xfs and ext2 as sample
(simple) implementations?

My only ask is that the memalloc nofs context be a well defined N
bits, where N < 16, and I'll find some place to put them (probably
journal_info).

Thanks,

					- Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* [Cluster-devel] [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
@ 2017-01-17 15:59           ` Theodore Ts'o
  0 siblings, 0 replies; 167+ messages in thread
From: Theodore Ts'o @ 2017-01-17 15:59 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Tue, Jan 17, 2017 at 04:18:17PM +0100, Michal Hocko wrote:
> 
> OK, so I've been staring into the code and AFAIU current->journal_info
> can contain my stored information. I could either hijack part of the
> word as the ref counting is only consuming low 12b. But that looks too
> ugly to live. Or I can allocate some placeholder.

Yeah, I was looking at something similar.  Can you guarantee that the
context will only take one or two bits?  (Looks like it only needs one
bit ATM, even though at the moment you're storing the whole GFP mask,
correct?)

> But before going to play with that I am really wondering whether we need
> all this with no journal at all. AFAIU what Jack told me it is the
> journal lock(s) which is the biggest problem from the reclaim recursion
> point of view. What would cause a deadlock in no journal mode?

We still have the original problem for why we need GFP_NOFS even in
ext2.  If we are in a writeback path, and we need to allocate memory,
we don't want to recurse back into the file system's writeback path.
Certainly not for the same inode, and while we could make it work if
the mm was writing back another inode, or another superblock, there
are also stack depth considerations that would make this be a bad
idea.  So we do need to be able to assert GFP_NOFS even in no journal
mode, and for any file system including ext2, for that matter.

Because of the fact that we're going to have to play games with
current->journal_info, maybe this is something that I should take
responsibility for, and to go through the the ext4 tree after the main
patch series go through?  Maybe you could use xfs and ext2 as sample
(simple) implementations?

My only ask is that the memalloc nofs context be a well defined N
bits, where N < 16, and I'll find some place to put them (probably
journal_info).

Thanks,

					- Ted



^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
  2017-01-17 15:59           ` Theodore Ts'o
  (?)
@ 2017-01-17 16:16             ` Michal Hocko
  -1 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-17 16:16 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Jan Kara, linux-mm, linux-fsdevel, Andrew Morton, Dave Chinner,
	djwong, Chris Mason, David Sterba, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML

On Tue 17-01-17 10:59:16, Theodore Ts'o wrote:
> On Tue, Jan 17, 2017 at 04:18:17PM +0100, Michal Hocko wrote:
> > 
> > OK, so I've been staring into the code and AFAIU current->journal_info
> > can contain my stored information. I could either hijack part of the
> > word as the ref counting is only consuming low 12b. But that looks too
> > ugly to live. Or I can allocate some placeholder.
> 
> Yeah, I was looking at something similar.  Can you guarantee that the
> context will only take one or two bits?  (Looks like it only needs one
> bit ATM, even though at the moment you're storing the whole GFP mask,
> correct?)

No, I am just storing PF_MEMALLOC_NO{FS,IO} but I assume further changes
might want to pull in more changes into the scope context.

> > But before going to play with that I am really wondering whether we need
> > all this with no journal at all. AFAIU what Jack told me it is the
> > journal lock(s) which is the biggest problem from the reclaim recursion
> > point of view. What would cause a deadlock in no journal mode?
> 
> We still have the original problem for why we need GFP_NOFS even in
> ext2.  If we are in a writeback path, and we need to allocate memory,
> we don't want to recurse back into the file system's writeback path.

But we do not enter the writeback path from the direct reclaim. Or do
you mean something other than pageout()'s mapping->a_ops->writepage?
There is only try_to_release_page where we get back to the filesystems
but I do not see any NOFS protection in ext4_releasepage.

> Certainly not for the same inode, and while we could make it work if
> the mm was writing back another inode, or another superblock, there
> are also stack depth considerations that would make this be a bad
> idea.  So we do need to be able to assert GFP_NOFS even in no journal
> mode, and for any file system including ext2, for that matter.
> 
> Because of the fact that we're going to have to play games with
> current->journal_info, maybe this is something that I should take
> responsibility for, and to go through the the ext4 tree after the main
> patch series go through?

How do you see a possibility that we would handle nojournal mode on
top of "[PATCH 5/8] jbd2: mark the transaction context with the scope
GFP_NOFS context" in a separate patch?

But anyway, I agree that we should go with the API sooner rather than
later.

>   Maybe you could use xfs and ext2 as sample
> (simple) implementations?
> 
> My only ask is that the memalloc nofs context be a well defined N
> bits, where N < 16, and I'll find some place to put them (probably
> journal_info).

I am pretty sure that we won't need more than a bit or two in a
foreseeable future (I can think of GFP_NOWAIT being one candidate).
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
@ 2017-01-17 16:16             ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-17 16:16 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Jan Kara, linux-mm, linux-fsdevel, Andrew Morton, Dave Chinner,
	djwong, Chris Mason, David Sterba, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML

On Tue 17-01-17 10:59:16, Theodore Ts'o wrote:
> On Tue, Jan 17, 2017 at 04:18:17PM +0100, Michal Hocko wrote:
> > 
> > OK, so I've been staring into the code and AFAIU current->journal_info
> > can contain my stored information. I could either hijack part of the
> > word as the ref counting is only consuming low 12b. But that looks too
> > ugly to live. Or I can allocate some placeholder.
> 
> Yeah, I was looking at something similar.  Can you guarantee that the
> context will only take one or two bits?  (Looks like it only needs one
> bit ATM, even though at the moment you're storing the whole GFP mask,
> correct?)

No, I am just storing PF_MEMALLOC_NO{FS,IO} but I assume further changes
might want to pull in more changes into the scope context.

> > But before going to play with that I am really wondering whether we need
> > all this with no journal at all. AFAIU what Jack told me it is the
> > journal lock(s) which is the biggest problem from the reclaim recursion
> > point of view. What would cause a deadlock in no journal mode?
> 
> We still have the original problem for why we need GFP_NOFS even in
> ext2.  If we are in a writeback path, and we need to allocate memory,
> we don't want to recurse back into the file system's writeback path.

But we do not enter the writeback path from the direct reclaim. Or do
you mean something other than pageout()'s mapping->a_ops->writepage?
There is only try_to_release_page where we get back to the filesystems
but I do not see any NOFS protection in ext4_releasepage.

> Certainly not for the same inode, and while we could make it work if
> the mm was writing back another inode, or another superblock, there
> are also stack depth considerations that would make this be a bad
> idea.  So we do need to be able to assert GFP_NOFS even in no journal
> mode, and for any file system including ext2, for that matter.
> 
> Because of the fact that we're going to have to play games with
> current->journal_info, maybe this is something that I should take
> responsibility for, and to go through the the ext4 tree after the main
> patch series go through?

How do you see a possibility that we would handle nojournal mode on
top of "[PATCH 5/8] jbd2: mark the transaction context with the scope
GFP_NOFS context" in a separate patch?

But anyway, I agree that we should go with the API sooner rather than
later.

>   Maybe you could use xfs and ext2 as sample
> (simple) implementations?
> 
> My only ask is that the memalloc nofs context be a well defined N
> bits, where N < 16, and I'll find some place to put them (probably
> journal_info).

I am pretty sure that we won't need more than a bit or two in a
foreseeable future (I can think of GFP_NOWAIT being one candidate).
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* [Cluster-devel] [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
@ 2017-01-17 16:16             ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-17 16:16 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Tue 17-01-17 10:59:16, Theodore Ts'o wrote:
> On Tue, Jan 17, 2017 at 04:18:17PM +0100, Michal Hocko wrote:
> > 
> > OK, so I've been staring into the code and AFAIU current->journal_info
> > can contain my stored information. I could either hijack part of the
> > word as the ref counting is only consuming low 12b. But that looks too
> > ugly to live. Or I can allocate some placeholder.
> 
> Yeah, I was looking at something similar.  Can you guarantee that the
> context will only take one or two bits?  (Looks like it only needs one
> bit ATM, even though at the moment you're storing the whole GFP mask,
> correct?)

No, I am just storing PF_MEMALLOC_NO{FS,IO} but I assume further changes
might want to pull in more changes into the scope context.

> > But before going to play with that I am really wondering whether we need
> > all this with no journal at all. AFAIU what Jack told me it is the
> > journal lock(s) which is the biggest problem from the reclaim recursion
> > point of view. What would cause a deadlock in no journal mode?
> 
> We still have the original problem for why we need GFP_NOFS even in
> ext2.  If we are in a writeback path, and we need to allocate memory,
> we don't want to recurse back into the file system's writeback path.

But we do not enter the writeback path from the direct reclaim. Or do
you mean something other than pageout()'s mapping->a_ops->writepage?
There is only try_to_release_page where we get back to the filesystems
but I do not see any NOFS protection in ext4_releasepage.

> Certainly not for the same inode, and while we could make it work if
> the mm was writing back another inode, or another superblock, there
> are also stack depth considerations that would make this be a bad
> idea.  So we do need to be able to assert GFP_NOFS even in no journal
> mode, and for any file system including ext2, for that matter.
> 
> Because of the fact that we're going to have to play games with
> current->journal_info, maybe this is something that I should take
> responsibility for, and to go through the the ext4 tree after the main
> patch series go through?

How do you see a possibility that we would handle nojournal mode on
top of "[PATCH 5/8] jbd2: mark the transaction context with the scope
GFP_NOFS context" in a separate patch?

But anyway, I agree that we should go with the API sooner rather than
later.

>   Maybe you could use xfs and ext2 as sample
> (simple) implementations?
> 
> My only ask is that the memalloc nofs context be a well defined N
> bits, where N < 16, and I'll find some place to put them (probably
> journal_info).

I am pretty sure that we won't need more than a bit or two in a
foreseeable future (I can think of GFP_NOWAIT being one candidate).
-- 
Michal Hocko
SUSE Labs



^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
  2017-01-17 16:16             ` Michal Hocko
  (?)
@ 2017-01-17 17:29               ` Jan Kara
  -1 siblings, 0 replies; 167+ messages in thread
From: Jan Kara @ 2017-01-17 17:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Theodore Ts'o, Jan Kara, linux-mm, linux-fsdevel,
	Andrew Morton, Dave Chinner, djwong, Chris Mason, David Sterba,
	ceph-devel, cluster-devel, linux-nfs, logfs, linux-xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML

On Tue 17-01-17 17:16:19, Michal Hocko wrote:
> > > But before going to play with that I am really wondering whether we need
> > > all this with no journal at all. AFAIU what Jack told me it is the
> > > journal lock(s) which is the biggest problem from the reclaim recursion
> > > point of view. What would cause a deadlock in no journal mode?
> > 
> > We still have the original problem for why we need GFP_NOFS even in
> > ext2.  If we are in a writeback path, and we need to allocate memory,
> > we don't want to recurse back into the file system's writeback path.
> 
> But we do not enter the writeback path from the direct reclaim. Or do
> you mean something other than pageout()'s mapping->a_ops->writepage?
> There is only try_to_release_page where we get back to the filesystems
> but I do not see any NOFS protection in ext4_releasepage.

Maybe to expand a bit: These days, direct reclaim can call ->releasepage()
callback, ->evict_inode() callback (and only for inodes with i_nlink > 0),
shrinkers. That's it. So the recursion possibilities are rather more limited
than they used to be several years ago and we likely do not need as much
GFP_NOFS protection as we used to.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
@ 2017-01-17 17:29               ` Jan Kara
  0 siblings, 0 replies; 167+ messages in thread
From: Jan Kara @ 2017-01-17 17:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Theodore Ts'o, Jan Kara, linux-mm, linux-fsdevel,
	Andrew Morton, Dave Chinner, djwong, Chris Mason, David Sterba,
	ceph-devel, cluster-devel, linux-nfs, logfs, linux-xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML

On Tue 17-01-17 17:16:19, Michal Hocko wrote:
> > > But before going to play with that I am really wondering whether we need
> > > all this with no journal at all. AFAIU what Jack told me it is the
> > > journal lock(s) which is the biggest problem from the reclaim recursion
> > > point of view. What would cause a deadlock in no journal mode?
> > 
> > We still have the original problem for why we need GFP_NOFS even in
> > ext2.  If we are in a writeback path, and we need to allocate memory,
> > we don't want to recurse back into the file system's writeback path.
> 
> But we do not enter the writeback path from the direct reclaim. Or do
> you mean something other than pageout()'s mapping->a_ops->writepage?
> There is only try_to_release_page where we get back to the filesystems
> but I do not see any NOFS protection in ext4_releasepage.

Maybe to expand a bit: These days, direct reclaim can call ->releasepage()
callback, ->evict_inode() callback (and only for inodes with i_nlink > 0),
shrinkers. That's it. So the recursion possibilities are rather more limited
than they used to be several years ago and we likely do not need as much
GFP_NOFS protection as we used to.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* [Cluster-devel] [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
@ 2017-01-17 17:29               ` Jan Kara
  0 siblings, 0 replies; 167+ messages in thread
From: Jan Kara @ 2017-01-17 17:29 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Tue 17-01-17 17:16:19, Michal Hocko wrote:
> > > But before going to play with that I am really wondering whether we need
> > > all this with no journal at all. AFAIU what Jack told me it is the
> > > journal lock(s) which is the biggest problem from the reclaim recursion
> > > point of view. What would cause a deadlock in no journal mode?
> > 
> > We still have the original problem for why we need GFP_NOFS even in
> > ext2.  If we are in a writeback path, and we need to allocate memory,
> > we don't want to recurse back into the file system's writeback path.
> 
> But we do not enter the writeback path from the direct reclaim. Or do
> you mean something other than pageout()'s mapping->a_ops->writepage?
> There is only try_to_release_page where we get back to the filesystems
> but I do not see any NOFS protection in ext4_releasepage.

Maybe to expand a bit: These days, direct reclaim can call ->releasepage()
callback, ->evict_inode() callback (and only for inodes with i_nlink > 0),
shrinkers. That's it. So the recursion possibilities are rather more limited
than they used to be several years ago and we likely do not need as much
GFP_NOFS protection as we used to.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR



^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
@ 2017-01-17 21:04             ` Andreas Dilger
  0 siblings, 0 replies; 167+ messages in thread
From: Andreas Dilger @ 2017-01-17 21:04 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Michal Hocko, Jan Kara, linux-mm, linux-fsdevel, Andrew Morton,
	Dave Chinner, djwong, Chris Mason, David Sterba, ceph-devel,
	cluster-devel, linux-nfs, logfs, linux-xfs, linux-ext4,
	linux-btrfs, linux-mtd, reiserfs-devel, linux-ntfs-dev,
	linux-f2fs-devel, linux-afs, LKML

[-- Attachment #1: Type: text/plain, Size: 2685 bytes --]

On Jan 17, 2017, at 8:59 AM, Theodore Ts'o <tytso@mit.edu> wrote:
> 
> On Tue, Jan 17, 2017 at 04:18:17PM +0100, Michal Hocko wrote:
>> 
>> OK, so I've been staring into the code and AFAIU current->journal_info
>> can contain my stored information. I could either hijack part of the
>> word as the ref counting is only consuming low 12b. But that looks too
>> ugly to live. Or I can allocate some placeholder.
> 
> Yeah, I was looking at something similar.  Can you guarantee that the
> context will only take one or two bits?  (Looks like it only needs one
> bit ATM, even though at the moment you're storing the whole GFP mask,
> correct?)
> 
>> But before going to play with that I am really wondering whether we need
>> all this with no journal at all. AFAIU what Jack told me it is the
>> journal lock(s) which is the biggest problem from the reclaim recursion
>> point of view. What would cause a deadlock in no journal mode?
> 
> We still have the original problem for why we need GFP_NOFS even in
> ext2.  If we are in a writeback path, and we need to allocate memory,
> we don't want to recurse back into the file system's writeback path.
> Certainly not for the same inode, and while we could make it work if
> the mm was writing back another inode, or another superblock, there
> are also stack depth considerations that would make this be a bad
> idea.  So we do need to be able to assert GFP_NOFS even in no journal
> mode, and for any file system including ext2, for that matter.
> 
> Because of the fact that we're going to have to play games with
> current->journal_info, maybe this is something that I should take
> responsibility for, and to go through the the ext4 tree after the main
> patch series go through?  Maybe you could use xfs and ext2 as sample
> (simple) implementations?
> 
> My only ask is that the memalloc nofs context be a well defined N
> bits, where N < 16, and I'll find some place to put them (probably
> journal_info).

I think Dave was suggesting that the NOFS context allow a pointer to
an arbitrary struct, so that it is possible to dereference this in
the filesystem itself to determine if the recursion is safe or not.
That way, ext2 could store an inode pointer (if that is what it cares
about) and verify that writeback is not recursing on the same inode,
and XFS can store something different.  It would also need to store
some additional info (e.g. fstype or superblock pointer) so that it
can determine how to interpret the NOFS context pointer.

I think it makes sense to add a couple of void * pointers to the task
struct along with journal_info and leave it up to the filesystem to
determine how to use them.

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
@ 2017-01-17 21:04             ` Andreas Dilger
  0 siblings, 0 replies; 167+ messages in thread
From: Andreas Dilger @ 2017-01-17 21:04 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Michal Hocko, Jan Kara, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	Dave Chinner, djwong-DgEjT+Ai2ygdnm+yROfE0A, Chris Mason,
	David Sterba, ceph-devel-u79uwXL29TY76Z2rM5mHXA,
	cluster-devel-H+wXaHxf7aLQT0dZR+AlfA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA, logfs-PCqxUs/MD9bYtjvyW6yDsg,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	reiserfs-devel-u79uwXL29TY76Z2rM5mHXA,
	linux-ntfs-dev-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	linux-f2fs-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	linux-afs-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, LKML

[-- Attachment #1: Type: text/plain, Size: 2706 bytes --]

On Jan 17, 2017, at 8:59 AM, Theodore Ts'o <tytso-3s7WtUTddSA@public.gmane.org> wrote:
> 
> On Tue, Jan 17, 2017 at 04:18:17PM +0100, Michal Hocko wrote:
>> 
>> OK, so I've been staring into the code and AFAIU current->journal_info
>> can contain my stored information. I could either hijack part of the
>> word as the ref counting is only consuming low 12b. But that looks too
>> ugly to live. Or I can allocate some placeholder.
> 
> Yeah, I was looking at something similar.  Can you guarantee that the
> context will only take one or two bits?  (Looks like it only needs one
> bit ATM, even though at the moment you're storing the whole GFP mask,
> correct?)
> 
>> But before going to play with that I am really wondering whether we need
>> all this with no journal at all. AFAIU what Jack told me it is the
>> journal lock(s) which is the biggest problem from the reclaim recursion
>> point of view. What would cause a deadlock in no journal mode?
> 
> We still have the original problem for why we need GFP_NOFS even in
> ext2.  If we are in a writeback path, and we need to allocate memory,
> we don't want to recurse back into the file system's writeback path.
> Certainly not for the same inode, and while we could make it work if
> the mm was writing back another inode, or another superblock, there
> are also stack depth considerations that would make this be a bad
> idea.  So we do need to be able to assert GFP_NOFS even in no journal
> mode, and for any file system including ext2, for that matter.
> 
> Because of the fact that we're going to have to play games with
> current->journal_info, maybe this is something that I should take
> responsibility for, and to go through the the ext4 tree after the main
> patch series go through?  Maybe you could use xfs and ext2 as sample
> (simple) implementations?
> 
> My only ask is that the memalloc nofs context be a well defined N
> bits, where N < 16, and I'll find some place to put them (probably
> journal_info).

I think Dave was suggesting that the NOFS context allow a pointer to
an arbitrary struct, so that it is possible to dereference this in
the filesystem itself to determine if the recursion is safe or not.
That way, ext2 could store an inode pointer (if that is what it cares
about) and verify that writeback is not recursing on the same inode,
and XFS can store something different.  It would also need to store
some additional info (e.g. fstype or superblock pointer) so that it
can determine how to interpret the NOFS context pointer.

I think it makes sense to add a couple of void * pointers to the task
struct along with journal_info and leave it up to the filesystem to
determine how to use them.

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 167+ messages in thread

* [Cluster-devel] [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
@ 2017-01-17 21:04             ` Andreas Dilger
  0 siblings, 0 replies; 167+ messages in thread
From: Andreas Dilger @ 2017-01-17 21:04 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Jan 17, 2017, at 8:59 AM, Theodore Ts'o <tytso@mit.edu> wrote:
> 
> On Tue, Jan 17, 2017 at 04:18:17PM +0100, Michal Hocko wrote:
>> 
>> OK, so I've been staring into the code and AFAIU current->journal_info
>> can contain my stored information. I could either hijack part of the
>> word as the ref counting is only consuming low 12b. But that looks too
>> ugly to live. Or I can allocate some placeholder.
> 
> Yeah, I was looking at something similar.  Can you guarantee that the
> context will only take one or two bits?  (Looks like it only needs one
> bit ATM, even though at the moment you're storing the whole GFP mask,
> correct?)
> 
>> But before going to play with that I am really wondering whether we need
>> all this with no journal at all. AFAIU what Jack told me it is the
>> journal lock(s) which is the biggest problem from the reclaim recursion
>> point of view. What would cause a deadlock in no journal mode?
> 
> We still have the original problem for why we need GFP_NOFS even in
> ext2.  If we are in a writeback path, and we need to allocate memory,
> we don't want to recurse back into the file system's writeback path.
> Certainly not for the same inode, and while we could make it work if
> the mm was writing back another inode, or another superblock, there
> are also stack depth considerations that would make this be a bad
> idea.  So we do need to be able to assert GFP_NOFS even in no journal
> mode, and for any file system including ext2, for that matter.
> 
> Because of the fact that we're going to have to play games with
> current->journal_info, maybe this is something that I should take
> responsibility for, and to go through the the ext4 tree after the main
> patch series go through?  Maybe you could use xfs and ext2 as sample
> (simple) implementations?
> 
> My only ask is that the memalloc nofs context be a well defined N
> bits, where N < 16, and I'll find some place to put them (probably
> journal_info).

I think Dave was suggesting that the NOFS context allow a pointer to
an arbitrary struct, so that it is possible to dereference this in
the filesystem itself to determine if the recursion is safe or not.
That way, ext2 could store an inode pointer (if that is what it cares
about) and verify that writeback is not recursing on the same inode,
and XFS can store something different.  It would also need to store
some additional info (e.g. fstype or superblock pointer) so that it
can determine how to interpret the NOFS context pointer.

I think it makes sense to add a couple of void * pointers to the task
struct along with journal_info and leave it up to the filesystem to
determine how to use them.

Cheers, Andreas





-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://listman.redhat.com/archives/cluster-devel/attachments/20170117/b4fcadc9/attachment.sig>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
  2017-01-17 21:04             ` Andreas Dilger
  (?)
@ 2017-01-18  8:29               ` Michal Hocko
  -1 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-18  8:29 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Theodore Ts'o, Jan Kara, linux-mm, linux-fsdevel,
	Andrew Morton, Dave Chinner, djwong, Chris Mason, David Sterba,
	ceph-devel, cluster-devel, linux-nfs, logfs, linux-xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML

On Tue 17-01-17 14:04:03, Andreas Dilger wrote:
> On Jan 17, 2017, at 8:59 AM, Theodore Ts'o <tytso@mit.edu> wrote:
> > 
> > On Tue, Jan 17, 2017 at 04:18:17PM +0100, Michal Hocko wrote:
> >> 
> >> OK, so I've been staring into the code and AFAIU current->journal_info
> >> can contain my stored information. I could either hijack part of the
> >> word as the ref counting is only consuming low 12b. But that looks too
> >> ugly to live. Or I can allocate some placeholder.
> > 
> > Yeah, I was looking at something similar.  Can you guarantee that the
> > context will only take one or two bits?  (Looks like it only needs one
> > bit ATM, even though at the moment you're storing the whole GFP mask,
> > correct?)
> > 
> >> But before going to play with that I am really wondering whether we need
> >> all this with no journal at all. AFAIU what Jack told me it is the
> >> journal lock(s) which is the biggest problem from the reclaim recursion
> >> point of view. What would cause a deadlock in no journal mode?
> > 
> > We still have the original problem for why we need GFP_NOFS even in
> > ext2.  If we are in a writeback path, and we need to allocate memory,
> > we don't want to recurse back into the file system's writeback path.
> > Certainly not for the same inode, and while we could make it work if
> > the mm was writing back another inode, or another superblock, there
> > are also stack depth considerations that would make this be a bad
> > idea.  So we do need to be able to assert GFP_NOFS even in no journal
> > mode, and for any file system including ext2, for that matter.
> > 
> > Because of the fact that we're going to have to play games with
> > current->journal_info, maybe this is something that I should take
> > responsibility for, and to go through the the ext4 tree after the main
> > patch series go through?  Maybe you could use xfs and ext2 as sample
> > (simple) implementations?
> > 
> > My only ask is that the memalloc nofs context be a well defined N
> > bits, where N < 16, and I'll find some place to put them (probably
> > journal_info).
> 
> I think Dave was suggesting that the NOFS context allow a pointer to
> an arbitrary struct, so that it is possible to dereference this in
> the filesystem itself to determine if the recursion is safe or not.

Yes, but can we start with a simpler approach first? Even this approach
takes quite some time to be used.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
@ 2017-01-18  8:29               ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-18  8:29 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Theodore Ts'o, Jan Kara, linux-mm, linux-fsdevel,
	Andrew Morton, Dave Chinner, djwong, Chris Mason, David Sterba,
	ceph-devel, cluster-devel, linux-nfs, logfs, linux-xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML

On Tue 17-01-17 14:04:03, Andreas Dilger wrote:
> On Jan 17, 2017, at 8:59 AM, Theodore Ts'o <tytso@mit.edu> wrote:
> > 
> > On Tue, Jan 17, 2017 at 04:18:17PM +0100, Michal Hocko wrote:
> >> 
> >> OK, so I've been staring into the code and AFAIU current->journal_info
> >> can contain my stored information. I could either hijack part of the
> >> word as the ref counting is only consuming low 12b. But that looks too
> >> ugly to live. Or I can allocate some placeholder.
> > 
> > Yeah, I was looking at something similar.  Can you guarantee that the
> > context will only take one or two bits?  (Looks like it only needs one
> > bit ATM, even though at the moment you're storing the whole GFP mask,
> > correct?)
> > 
> >> But before going to play with that I am really wondering whether we need
> >> all this with no journal at all. AFAIU what Jack told me it is the
> >> journal lock(s) which is the biggest problem from the reclaim recursion
> >> point of view. What would cause a deadlock in no journal mode?
> > 
> > We still have the original problem for why we need GFP_NOFS even in
> > ext2.  If we are in a writeback path, and we need to allocate memory,
> > we don't want to recurse back into the file system's writeback path.
> > Certainly not for the same inode, and while we could make it work if
> > the mm was writing back another inode, or another superblock, there
> > are also stack depth considerations that would make this be a bad
> > idea.  So we do need to be able to assert GFP_NOFS even in no journal
> > mode, and for any file system including ext2, for that matter.
> > 
> > Because of the fact that we're going to have to play games with
> > current->journal_info, maybe this is something that I should take
> > responsibility for, and to go through the the ext4 tree after the main
> > patch series go through?  Maybe you could use xfs and ext2 as sample
> > (simple) implementations?
> > 
> > My only ask is that the memalloc nofs context be a well defined N
> > bits, where N < 16, and I'll find some place to put them (probably
> > journal_info).
> 
> I think Dave was suggesting that the NOFS context allow a pointer to
> an arbitrary struct, so that it is possible to dereference this in
> the filesystem itself to determine if the recursion is safe or not.

Yes, but can we start with a simpler approach first? Even this approach
takes quite some time to be used.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* [Cluster-devel] [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
@ 2017-01-18  8:29               ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-18  8:29 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Tue 17-01-17 14:04:03, Andreas Dilger wrote:
> On Jan 17, 2017, at 8:59 AM, Theodore Ts'o <tytso@mit.edu> wrote:
> > 
> > On Tue, Jan 17, 2017 at 04:18:17PM +0100, Michal Hocko wrote:
> >> 
> >> OK, so I've been staring into the code and AFAIU current->journal_info
> >> can contain my stored information. I could either hijack part of the
> >> word as the ref counting is only consuming low 12b. But that looks too
> >> ugly to live. Or I can allocate some placeholder.
> > 
> > Yeah, I was looking at something similar.  Can you guarantee that the
> > context will only take one or two bits?  (Looks like it only needs one
> > bit ATM, even though at the moment you're storing the whole GFP mask,
> > correct?)
> > 
> >> But before going to play with that I am really wondering whether we need
> >> all this with no journal at all. AFAIU what Jack told me it is the
> >> journal lock(s) which is the biggest problem from the reclaim recursion
> >> point of view. What would cause a deadlock in no journal mode?
> > 
> > We still have the original problem for why we need GFP_NOFS even in
> > ext2.  If we are in a writeback path, and we need to allocate memory,
> > we don't want to recurse back into the file system's writeback path.
> > Certainly not for the same inode, and while we could make it work if
> > the mm was writing back another inode, or another superblock, there
> > are also stack depth considerations that would make this be a bad
> > idea.  So we do need to be able to assert GFP_NOFS even in no journal
> > mode, and for any file system including ext2, for that matter.
> > 
> > Because of the fact that we're going to have to play games with
> > current->journal_info, maybe this is something that I should take
> > responsibility for, and to go through the the ext4 tree after the main
> > patch series go through?  Maybe you could use xfs and ext2 as sample
> > (simple) implementations?
> > 
> > My only ask is that the memalloc nofs context be a well defined N
> > bits, where N < 16, and I'll find some place to put them (probably
> > journal_info).
> 
> I think Dave was suggesting that the NOFS context allow a pointer to
> an arbitrary struct, so that it is possible to dereference this in
> the filesystem itself to determine if the recursion is safe or not.

Yes, but can we start with a simpler approach first? Even this approach
takes quite some time to be used.
-- 
Michal Hocko
SUSE Labs



^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
  2017-01-17 17:29               ` Jan Kara
  (?)
@ 2017-01-19  8:39                 ` Michal Hocko
  -1 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-19  8:39 UTC (permalink / raw)
  To: Jan Kara
  Cc: Theodore Ts'o, linux-mm, linux-fsdevel, Andrew Morton,
	Dave Chinner, djwong, Chris Mason, David Sterba, ceph-devel,
	cluster-devel, linux-nfs, logfs, linux-xfs, linux-ext4,
	linux-btrfs, linux-mtd, reiserfs-devel, linux-ntfs-dev,
	linux-f2fs-devel, linux-afs, LKML

On Tue 17-01-17 18:29:25, Jan Kara wrote:
> On Tue 17-01-17 17:16:19, Michal Hocko wrote:
> > > > But before going to play with that I am really wondering whether we need
> > > > all this with no journal at all. AFAIU what Jack told me it is the
> > > > journal lock(s) which is the biggest problem from the reclaim recursion
> > > > point of view. What would cause a deadlock in no journal mode?
> > > 
> > > We still have the original problem for why we need GFP_NOFS even in
> > > ext2.  If we are in a writeback path, and we need to allocate memory,
> > > we don't want to recurse back into the file system's writeback path.
> > 
> > But we do not enter the writeback path from the direct reclaim. Or do
> > you mean something other than pageout()'s mapping->a_ops->writepage?
> > There is only try_to_release_page where we get back to the filesystems
> > but I do not see any NOFS protection in ext4_releasepage.
> 
> Maybe to expand a bit: These days, direct reclaim can call ->releasepage()
> callback, ->evict_inode() callback (and only for inodes with i_nlink > 0),
> shrinkers. That's it. So the recursion possibilities are rather more limited
> than they used to be several years ago and we likely do not need as much
> GFP_NOFS protection as we used to.

Thanks for making my remark more clear Jack! I would just want to add
that I was playing with the patch below (it is basically
GFP_NOFS->GFP_KERNEL for all allocations which trigger warning from the
debugging patch which means they are called from within transaction) and
it didn't hit the lockdep when running xfstests both with or without the
enabled journal.

So am I still missing something or the nojournal mode is safe and the
current series is OK wrt. ext*?

The following patch in its current form is WIP and needs a proper review
before I post it.
---
 fs/ext4/inode.c       |  4 ++--
 fs/ext4/mballoc.c     | 14 +++++++-------
 fs/ext4/xattr.c       |  2 +-
 fs/jbd2/journal.c     |  4 ++--
 fs/jbd2/revoke.c      |  2 +-
 fs/jbd2/transaction.c |  2 +-
 6 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index b7d141c3b810..841cb8c4cb5e 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2085,7 +2085,7 @@ static int ext4_writepage(struct page *page,
 		return __ext4_journalled_writepage(page, len);
 
 	ext4_io_submit_init(&io_submit, wbc);
-	io_submit.io_end = ext4_init_io_end(inode, GFP_NOFS);
+	io_submit.io_end = ext4_init_io_end(inode, GFP_KERNEL);
 	if (!io_submit.io_end) {
 		redirty_page_for_writepage(wbc, page);
 		unlock_page(page);
@@ -3794,7 +3794,7 @@ static int __ext4_block_zero_page_range(handle_t *handle,
 	int err = 0;
 
 	page = find_or_create_page(mapping, from >> PAGE_SHIFT,
-				   mapping_gfp_constraint(mapping, ~__GFP_FS));
+				   mapping_gfp_mask(mapping));
 	if (!page)
 		return -ENOMEM;
 
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index d9fd184b049e..67b97cd6e3d6 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -1251,7 +1251,7 @@ ext4_mb_load_buddy_gfp(struct super_block *sb, ext4_group_t group,
 static int ext4_mb_load_buddy(struct super_block *sb, ext4_group_t group,
 			      struct ext4_buddy *e4b)
 {
-	return ext4_mb_load_buddy_gfp(sb, group, e4b, GFP_NOFS);
+	return ext4_mb_load_buddy_gfp(sb, group, e4b, GFP_KERNEL);
 }
 
 static void ext4_mb_unload_buddy(struct ext4_buddy *e4b)
@@ -2054,7 +2054,7 @@ static int ext4_mb_good_group(struct ext4_allocation_context *ac,
 
 	/* We only do this if the grp has never been initialized */
 	if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) {
-		int ret = ext4_mb_init_group(ac->ac_sb, group, GFP_NOFS);
+		int ret = ext4_mb_init_group(ac->ac_sb, group, GFP_KERNEL);
 		if (ret)
 			return ret;
 	}
@@ -3600,7 +3600,7 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
 	BUG_ON(ac->ac_status != AC_STATUS_FOUND);
 	BUG_ON(!S_ISREG(ac->ac_inode->i_mode));
 
-	pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_NOFS);
+	pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_KERNEL);
 	if (pa == NULL)
 		return -ENOMEM;
 
@@ -3694,7 +3694,7 @@ ext4_mb_new_group_pa(struct ext4_allocation_context *ac)
 	BUG_ON(!S_ISREG(ac->ac_inode->i_mode));
 
 	BUG_ON(ext4_pspace_cachep == NULL);
-	pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_NOFS);
+	pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_KERNEL);
 	if (pa == NULL)
 		return -ENOMEM;
 
@@ -4479,7 +4479,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
 		}
 	}
 
-	ac = kmem_cache_zalloc(ext4_ac_cachep, GFP_NOFS);
+	ac = kmem_cache_zalloc(ext4_ac_cachep, GFP_KERNEL);
 	if (!ac) {
 		ar->len = 0;
 		*errp = -ENOMEM;
@@ -4813,7 +4813,7 @@ void ext4_free_blocks(handle_t *handle, struct inode *inode,
 
 	/* __GFP_NOFAIL: retry infinitely, ignore TIF_MEMDIE and memcg limit. */
 	err = ext4_mb_load_buddy_gfp(sb, block_group, &e4b,
-				     GFP_NOFS|__GFP_NOFAIL);
+				     GFP_KERNEL|__GFP_NOFAIL);
 	if (err)
 		goto error_return;
 
@@ -4832,7 +4832,7 @@ void ext4_free_blocks(handle_t *handle, struct inode *inode,
 		 * to fail.
 		 */
 		new_entry = kmem_cache_alloc(ext4_free_data_cachep,
-				GFP_NOFS|__GFP_NOFAIL);
+				GFP_KERNEL|__GFP_NOFAIL);
 		new_entry->efd_start_cluster = bit;
 		new_entry->efd_group = block_group;
 		new_entry->efd_count = count_clusters;
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 172317462238..f68e8c87f9f2 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -1650,7 +1650,7 @@ ext4_xattr_cache_insert(struct mb_cache *ext4_mb_cache, struct buffer_head *bh)
 		       EXT4_XATTR_REFCOUNT_MAX;
 	int error;
 
-	error = mb_cache_entry_create(ext4_mb_cache, GFP_NOFS, hash,
+	error = mb_cache_entry_create(ext4_mb_cache, GFP_KERNEL, hash,
 				      bh->b_blocknr, reusable);
 	if (error) {
 		if (error == -EBUSY)
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 3a449150f834..bd29daa975a5 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -379,7 +379,7 @@ int jbd2_journal_write_metadata_buffer(transaction_t *transaction,
 	 */
 	J_ASSERT_BH(bh_in, buffer_jbddirty(bh_in));
 
-	new_bh = alloc_buffer_head(GFP_NOFS|__GFP_NOFAIL);
+	new_bh = alloc_buffer_head(GFP_KERNEL|__GFP_NOFAIL);
 
 	/* keep subsequent assertions sane */
 	atomic_set(&new_bh->b_count, 1);
@@ -2375,7 +2375,7 @@ static struct journal_head *journal_alloc_journal_head(void)
 #ifdef CONFIG_JBD2_DEBUG
 	atomic_inc(&nr_journal_heads);
 #endif
-	ret = kmem_cache_zalloc(jbd2_journal_head_cache, GFP_NOFS);
+	ret = kmem_cache_zalloc(jbd2_journal_head_cache, GFP_KERNEL);
 	if (!ret) {
 		jbd_debug(1, "out of memory for journal_head\n");
 		pr_notice_ratelimited("ENOMEM in %s, retrying.\n", __func__);
diff --git a/fs/jbd2/revoke.c b/fs/jbd2/revoke.c
index cfc38b552118..c9c347468c5b 100644
--- a/fs/jbd2/revoke.c
+++ b/fs/jbd2/revoke.c
@@ -141,7 +141,7 @@ static int insert_revoke_hash(journal_t *journal, unsigned long long blocknr,
 {
 	struct list_head *hash_list;
 	struct jbd2_revoke_record_s *record;
-	gfp_t gfp_mask = GFP_NOFS;
+	gfp_t gfp_mask = GFP_KERNEL;
 
 	if (journal_oom_retry)
 		gfp_mask |= __GFP_NOFAIL;
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 35a5d3d76182..a7e50eb330a8 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -974,7 +974,7 @@ do_get_write_access(handle_t *handle, struct journal_head *jh,
 			JBUFFER_TRACE(jh, "allocate memory for buffer");
 			jbd_unlock_bh_state(bh);
 			frozen_buffer = jbd2_alloc(jh2bh(jh)->b_size,
-						   GFP_NOFS | __GFP_NOFAIL);
+						   GFP_KERNEL | __GFP_NOFAIL);
 			goto repeat;
 		}
 		jh->b_frozen_data = frozen_buffer;
-- 
2.11.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 167+ messages in thread

* Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
@ 2017-01-19  8:39                 ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-19  8:39 UTC (permalink / raw)
  To: Jan Kara
  Cc: Theodore Ts'o, linux-mm, linux-fsdevel, Andrew Morton,
	Dave Chinner, djwong, Chris Mason, David Sterba, ceph-devel,
	cluster-devel, linux-nfs, logfs, linux-xfs, linux-ext4,
	linux-btrfs, linux-mtd, reiserfs-devel, linux-ntfs-dev,
	linux-f2fs-devel, linux-afs, LKML

On Tue 17-01-17 18:29:25, Jan Kara wrote:
> On Tue 17-01-17 17:16:19, Michal Hocko wrote:
> > > > But before going to play with that I am really wondering whether we need
> > > > all this with no journal at all. AFAIU what Jack told me it is the
> > > > journal lock(s) which is the biggest problem from the reclaim recursion
> > > > point of view. What would cause a deadlock in no journal mode?
> > > 
> > > We still have the original problem for why we need GFP_NOFS even in
> > > ext2.  If we are in a writeback path, and we need to allocate memory,
> > > we don't want to recurse back into the file system's writeback path.
> > 
> > But we do not enter the writeback path from the direct reclaim. Or do
> > you mean something other than pageout()'s mapping->a_ops->writepage?
> > There is only try_to_release_page where we get back to the filesystems
> > but I do not see any NOFS protection in ext4_releasepage.
> 
> Maybe to expand a bit: These days, direct reclaim can call ->releasepage()
> callback, ->evict_inode() callback (and only for inodes with i_nlink > 0),
> shrinkers. That's it. So the recursion possibilities are rather more limited
> than they used to be several years ago and we likely do not need as much
> GFP_NOFS protection as we used to.

Thanks for making my remark more clear Jack! I would just want to add
that I was playing with the patch below (it is basically
GFP_NOFS->GFP_KERNEL for all allocations which trigger warning from the
debugging patch which means they are called from within transaction) and
it didn't hit the lockdep when running xfstests both with or without the
enabled journal.

So am I still missing something or the nojournal mode is safe and the
current series is OK wrt. ext*?

The following patch in its current form is WIP and needs a proper review
before I post it.
---
 fs/ext4/inode.c       |  4 ++--
 fs/ext4/mballoc.c     | 14 +++++++-------
 fs/ext4/xattr.c       |  2 +-
 fs/jbd2/journal.c     |  4 ++--
 fs/jbd2/revoke.c      |  2 +-
 fs/jbd2/transaction.c |  2 +-
 6 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index b7d141c3b810..841cb8c4cb5e 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2085,7 +2085,7 @@ static int ext4_writepage(struct page *page,
 		return __ext4_journalled_writepage(page, len);
 
 	ext4_io_submit_init(&io_submit, wbc);
-	io_submit.io_end = ext4_init_io_end(inode, GFP_NOFS);
+	io_submit.io_end = ext4_init_io_end(inode, GFP_KERNEL);
 	if (!io_submit.io_end) {
 		redirty_page_for_writepage(wbc, page);
 		unlock_page(page);
@@ -3794,7 +3794,7 @@ static int __ext4_block_zero_page_range(handle_t *handle,
 	int err = 0;
 
 	page = find_or_create_page(mapping, from >> PAGE_SHIFT,
-				   mapping_gfp_constraint(mapping, ~__GFP_FS));
+				   mapping_gfp_mask(mapping));
 	if (!page)
 		return -ENOMEM;
 
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index d9fd184b049e..67b97cd6e3d6 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -1251,7 +1251,7 @@ ext4_mb_load_buddy_gfp(struct super_block *sb, ext4_group_t group,
 static int ext4_mb_load_buddy(struct super_block *sb, ext4_group_t group,
 			      struct ext4_buddy *e4b)
 {
-	return ext4_mb_load_buddy_gfp(sb, group, e4b, GFP_NOFS);
+	return ext4_mb_load_buddy_gfp(sb, group, e4b, GFP_KERNEL);
 }
 
 static void ext4_mb_unload_buddy(struct ext4_buddy *e4b)
@@ -2054,7 +2054,7 @@ static int ext4_mb_good_group(struct ext4_allocation_context *ac,
 
 	/* We only do this if the grp has never been initialized */
 	if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) {
-		int ret = ext4_mb_init_group(ac->ac_sb, group, GFP_NOFS);
+		int ret = ext4_mb_init_group(ac->ac_sb, group, GFP_KERNEL);
 		if (ret)
 			return ret;
 	}
@@ -3600,7 +3600,7 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
 	BUG_ON(ac->ac_status != AC_STATUS_FOUND);
 	BUG_ON(!S_ISREG(ac->ac_inode->i_mode));
 
-	pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_NOFS);
+	pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_KERNEL);
 	if (pa == NULL)
 		return -ENOMEM;
 
@@ -3694,7 +3694,7 @@ ext4_mb_new_group_pa(struct ext4_allocation_context *ac)
 	BUG_ON(!S_ISREG(ac->ac_inode->i_mode));
 
 	BUG_ON(ext4_pspace_cachep == NULL);
-	pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_NOFS);
+	pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_KERNEL);
 	if (pa == NULL)
 		return -ENOMEM;
 
@@ -4479,7 +4479,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
 		}
 	}
 
-	ac = kmem_cache_zalloc(ext4_ac_cachep, GFP_NOFS);
+	ac = kmem_cache_zalloc(ext4_ac_cachep, GFP_KERNEL);
 	if (!ac) {
 		ar->len = 0;
 		*errp = -ENOMEM;
@@ -4813,7 +4813,7 @@ void ext4_free_blocks(handle_t *handle, struct inode *inode,
 
 	/* __GFP_NOFAIL: retry infinitely, ignore TIF_MEMDIE and memcg limit. */
 	err = ext4_mb_load_buddy_gfp(sb, block_group, &e4b,
-				     GFP_NOFS|__GFP_NOFAIL);
+				     GFP_KERNEL|__GFP_NOFAIL);
 	if (err)
 		goto error_return;
 
@@ -4832,7 +4832,7 @@ void ext4_free_blocks(handle_t *handle, struct inode *inode,
 		 * to fail.
 		 */
 		new_entry = kmem_cache_alloc(ext4_free_data_cachep,
-				GFP_NOFS|__GFP_NOFAIL);
+				GFP_KERNEL|__GFP_NOFAIL);
 		new_entry->efd_start_cluster = bit;
 		new_entry->efd_group = block_group;
 		new_entry->efd_count = count_clusters;
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 172317462238..f68e8c87f9f2 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -1650,7 +1650,7 @@ ext4_xattr_cache_insert(struct mb_cache *ext4_mb_cache, struct buffer_head *bh)
 		       EXT4_XATTR_REFCOUNT_MAX;
 	int error;
 
-	error = mb_cache_entry_create(ext4_mb_cache, GFP_NOFS, hash,
+	error = mb_cache_entry_create(ext4_mb_cache, GFP_KERNEL, hash,
 				      bh->b_blocknr, reusable);
 	if (error) {
 		if (error == -EBUSY)
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 3a449150f834..bd29daa975a5 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -379,7 +379,7 @@ int jbd2_journal_write_metadata_buffer(transaction_t *transaction,
 	 */
 	J_ASSERT_BH(bh_in, buffer_jbddirty(bh_in));
 
-	new_bh = alloc_buffer_head(GFP_NOFS|__GFP_NOFAIL);
+	new_bh = alloc_buffer_head(GFP_KERNEL|__GFP_NOFAIL);
 
 	/* keep subsequent assertions sane */
 	atomic_set(&new_bh->b_count, 1);
@@ -2375,7 +2375,7 @@ static struct journal_head *journal_alloc_journal_head(void)
 #ifdef CONFIG_JBD2_DEBUG
 	atomic_inc(&nr_journal_heads);
 #endif
-	ret = kmem_cache_zalloc(jbd2_journal_head_cache, GFP_NOFS);
+	ret = kmem_cache_zalloc(jbd2_journal_head_cache, GFP_KERNEL);
 	if (!ret) {
 		jbd_debug(1, "out of memory for journal_head\n");
 		pr_notice_ratelimited("ENOMEM in %s, retrying.\n", __func__);
diff --git a/fs/jbd2/revoke.c b/fs/jbd2/revoke.c
index cfc38b552118..c9c347468c5b 100644
--- a/fs/jbd2/revoke.c
+++ b/fs/jbd2/revoke.c
@@ -141,7 +141,7 @@ static int insert_revoke_hash(journal_t *journal, unsigned long long blocknr,
 {
 	struct list_head *hash_list;
 	struct jbd2_revoke_record_s *record;
-	gfp_t gfp_mask = GFP_NOFS;
+	gfp_t gfp_mask = GFP_KERNEL;
 
 	if (journal_oom_retry)
 		gfp_mask |= __GFP_NOFAIL;
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 35a5d3d76182..a7e50eb330a8 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -974,7 +974,7 @@ do_get_write_access(handle_t *handle, struct journal_head *jh,
 			JBUFFER_TRACE(jh, "allocate memory for buffer");
 			jbd_unlock_bh_state(bh);
 			frozen_buffer = jbd2_alloc(jh2bh(jh)->b_size,
-						   GFP_NOFS | __GFP_NOFAIL);
+						   GFP_KERNEL | __GFP_NOFAIL);
 			goto repeat;
 		}
 		jh->b_frozen_data = frozen_buffer;
-- 
2.11.0

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 167+ messages in thread

* [Cluster-devel] [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
@ 2017-01-19  8:39                 ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-19  8:39 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Tue 17-01-17 18:29:25, Jan Kara wrote:
> On Tue 17-01-17 17:16:19, Michal Hocko wrote:
> > > > But before going to play with that I am really wondering whether we need
> > > > all this with no journal at all. AFAIU what Jack told me it is the
> > > > journal lock(s) which is the biggest problem from the reclaim recursion
> > > > point of view. What would cause a deadlock in no journal mode?
> > > 
> > > We still have the original problem for why we need GFP_NOFS even in
> > > ext2.  If we are in a writeback path, and we need to allocate memory,
> > > we don't want to recurse back into the file system's writeback path.
> > 
> > But we do not enter the writeback path from the direct reclaim. Or do
> > you mean something other than pageout()'s mapping->a_ops->writepage?
> > There is only try_to_release_page where we get back to the filesystems
> > but I do not see any NOFS protection in ext4_releasepage.
> 
> Maybe to expand a bit: These days, direct reclaim can call ->releasepage()
> callback, ->evict_inode() callback (and only for inodes with i_nlink > 0),
> shrinkers. That's it. So the recursion possibilities are rather more limited
> than they used to be several years ago and we likely do not need as much
> GFP_NOFS protection as we used to.

Thanks for making my remark more clear Jack! I would just want to add
that I was playing with the patch below (it is basically
GFP_NOFS->GFP_KERNEL for all allocations which trigger warning from the
debugging patch which means they are called from within transaction) and
it didn't hit the lockdep when running xfstests both with or without the
enabled journal.

So am I still missing something or the nojournal mode is safe and the
current series is OK wrt. ext*?

The following patch in its current form is WIP and needs a proper review
before I post it.
---
 fs/ext4/inode.c       |  4 ++--
 fs/ext4/mballoc.c     | 14 +++++++-------
 fs/ext4/xattr.c       |  2 +-
 fs/jbd2/journal.c     |  4 ++--
 fs/jbd2/revoke.c      |  2 +-
 fs/jbd2/transaction.c |  2 +-
 6 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index b7d141c3b810..841cb8c4cb5e 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2085,7 +2085,7 @@ static int ext4_writepage(struct page *page,
 		return __ext4_journalled_writepage(page, len);
 
 	ext4_io_submit_init(&io_submit, wbc);
-	io_submit.io_end = ext4_init_io_end(inode, GFP_NOFS);
+	io_submit.io_end = ext4_init_io_end(inode, GFP_KERNEL);
 	if (!io_submit.io_end) {
 		redirty_page_for_writepage(wbc, page);
 		unlock_page(page);
@@ -3794,7 +3794,7 @@ static int __ext4_block_zero_page_range(handle_t *handle,
 	int err = 0;
 
 	page = find_or_create_page(mapping, from >> PAGE_SHIFT,
-				   mapping_gfp_constraint(mapping, ~__GFP_FS));
+				   mapping_gfp_mask(mapping));
 	if (!page)
 		return -ENOMEM;
 
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index d9fd184b049e..67b97cd6e3d6 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -1251,7 +1251,7 @@ ext4_mb_load_buddy_gfp(struct super_block *sb, ext4_group_t group,
 static int ext4_mb_load_buddy(struct super_block *sb, ext4_group_t group,
 			      struct ext4_buddy *e4b)
 {
-	return ext4_mb_load_buddy_gfp(sb, group, e4b, GFP_NOFS);
+	return ext4_mb_load_buddy_gfp(sb, group, e4b, GFP_KERNEL);
 }
 
 static void ext4_mb_unload_buddy(struct ext4_buddy *e4b)
@@ -2054,7 +2054,7 @@ static int ext4_mb_good_group(struct ext4_allocation_context *ac,
 
 	/* We only do this if the grp has never been initialized */
 	if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) {
-		int ret = ext4_mb_init_group(ac->ac_sb, group, GFP_NOFS);
+		int ret = ext4_mb_init_group(ac->ac_sb, group, GFP_KERNEL);
 		if (ret)
 			return ret;
 	}
@@ -3600,7 +3600,7 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
 	BUG_ON(ac->ac_status != AC_STATUS_FOUND);
 	BUG_ON(!S_ISREG(ac->ac_inode->i_mode));
 
-	pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_NOFS);
+	pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_KERNEL);
 	if (pa == NULL)
 		return -ENOMEM;
 
@@ -3694,7 +3694,7 @@ ext4_mb_new_group_pa(struct ext4_allocation_context *ac)
 	BUG_ON(!S_ISREG(ac->ac_inode->i_mode));
 
 	BUG_ON(ext4_pspace_cachep == NULL);
-	pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_NOFS);
+	pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_KERNEL);
 	if (pa == NULL)
 		return -ENOMEM;
 
@@ -4479,7 +4479,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
 		}
 	}
 
-	ac = kmem_cache_zalloc(ext4_ac_cachep, GFP_NOFS);
+	ac = kmem_cache_zalloc(ext4_ac_cachep, GFP_KERNEL);
 	if (!ac) {
 		ar->len = 0;
 		*errp = -ENOMEM;
@@ -4813,7 +4813,7 @@ void ext4_free_blocks(handle_t *handle, struct inode *inode,
 
 	/* __GFP_NOFAIL: retry infinitely, ignore TIF_MEMDIE and memcg limit. */
 	err = ext4_mb_load_buddy_gfp(sb, block_group, &e4b,
-				     GFP_NOFS|__GFP_NOFAIL);
+				     GFP_KERNEL|__GFP_NOFAIL);
 	if (err)
 		goto error_return;
 
@@ -4832,7 +4832,7 @@ void ext4_free_blocks(handle_t *handle, struct inode *inode,
 		 * to fail.
 		 */
 		new_entry = kmem_cache_alloc(ext4_free_data_cachep,
-				GFP_NOFS|__GFP_NOFAIL);
+				GFP_KERNEL|__GFP_NOFAIL);
 		new_entry->efd_start_cluster = bit;
 		new_entry->efd_group = block_group;
 		new_entry->efd_count = count_clusters;
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 172317462238..f68e8c87f9f2 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -1650,7 +1650,7 @@ ext4_xattr_cache_insert(struct mb_cache *ext4_mb_cache, struct buffer_head *bh)
 		       EXT4_XATTR_REFCOUNT_MAX;
 	int error;
 
-	error = mb_cache_entry_create(ext4_mb_cache, GFP_NOFS, hash,
+	error = mb_cache_entry_create(ext4_mb_cache, GFP_KERNEL, hash,
 				      bh->b_blocknr, reusable);
 	if (error) {
 		if (error == -EBUSY)
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 3a449150f834..bd29daa975a5 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -379,7 +379,7 @@ int jbd2_journal_write_metadata_buffer(transaction_t *transaction,
 	 */
 	J_ASSERT_BH(bh_in, buffer_jbddirty(bh_in));
 
-	new_bh = alloc_buffer_head(GFP_NOFS|__GFP_NOFAIL);
+	new_bh = alloc_buffer_head(GFP_KERNEL|__GFP_NOFAIL);
 
 	/* keep subsequent assertions sane */
 	atomic_set(&new_bh->b_count, 1);
@@ -2375,7 +2375,7 @@ static struct journal_head *journal_alloc_journal_head(void)
 #ifdef CONFIG_JBD2_DEBUG
 	atomic_inc(&nr_journal_heads);
 #endif
-	ret = kmem_cache_zalloc(jbd2_journal_head_cache, GFP_NOFS);
+	ret = kmem_cache_zalloc(jbd2_journal_head_cache, GFP_KERNEL);
 	if (!ret) {
 		jbd_debug(1, "out of memory for journal_head\n");
 		pr_notice_ratelimited("ENOMEM in %s, retrying.\n", __func__);
diff --git a/fs/jbd2/revoke.c b/fs/jbd2/revoke.c
index cfc38b552118..c9c347468c5b 100644
--- a/fs/jbd2/revoke.c
+++ b/fs/jbd2/revoke.c
@@ -141,7 +141,7 @@ static int insert_revoke_hash(journal_t *journal, unsigned long long blocknr,
 {
 	struct list_head *hash_list;
 	struct jbd2_revoke_record_s *record;
-	gfp_t gfp_mask = GFP_NOFS;
+	gfp_t gfp_mask = GFP_KERNEL;
 
 	if (journal_oom_retry)
 		gfp_mask |= __GFP_NOFAIL;
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 35a5d3d76182..a7e50eb330a8 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -974,7 +974,7 @@ do_get_write_access(handle_t *handle, struct journal_head *jh,
 			JBUFFER_TRACE(jh, "allocate memory for buffer");
 			jbd_unlock_bh_state(bh);
 			frozen_buffer = jbd2_alloc(jh2bh(jh)->b_size,
-						   GFP_NOFS | __GFP_NOFAIL);
+						   GFP_KERNEL | __GFP_NOFAIL);
 			goto repeat;
 		}
 		jh->b_frozen_data = frozen_buffer;
-- 
2.11.0

-- 
Michal Hocko
SUSE Labs



^ permalink raw reply related	[flat|nested] 167+ messages in thread

* Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
  2017-01-19  8:39                 ` Michal Hocko
  (?)
@ 2017-01-19  9:22                   ` Jan Kara
  -1 siblings, 0 replies; 167+ messages in thread
From: Jan Kara @ 2017-01-19  9:22 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jan Kara, Theodore Ts'o, linux-mm, linux-fsdevel,
	Andrew Morton, Dave Chinner, djwong, Chris Mason, David Sterba,
	ceph-devel, cluster-devel, linux-nfs, logfs, linux-xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML

On Thu 19-01-17 09:39:56, Michal Hocko wrote:
> On Tue 17-01-17 18:29:25, Jan Kara wrote:
> > On Tue 17-01-17 17:16:19, Michal Hocko wrote:
> > > > > But before going to play with that I am really wondering whether we need
> > > > > all this with no journal at all. AFAIU what Jack told me it is the
> > > > > journal lock(s) which is the biggest problem from the reclaim recursion
> > > > > point of view. What would cause a deadlock in no journal mode?
> > > > 
> > > > We still have the original problem for why we need GFP_NOFS even in
> > > > ext2.  If we are in a writeback path, and we need to allocate memory,
> > > > we don't want to recurse back into the file system's writeback path.
> > > 
> > > But we do not enter the writeback path from the direct reclaim. Or do
> > > you mean something other than pageout()'s mapping->a_ops->writepage?
> > > There is only try_to_release_page where we get back to the filesystems
> > > but I do not see any NOFS protection in ext4_releasepage.
> > 
> > Maybe to expand a bit: These days, direct reclaim can call ->releasepage()
> > callback, ->evict_inode() callback (and only for inodes with i_nlink > 0),
> > shrinkers. That's it. So the recursion possibilities are rather more limited
> > than they used to be several years ago and we likely do not need as much
> > GFP_NOFS protection as we used to.
> 
> Thanks for making my remark more clear Jack! I would just want to add
> that I was playing with the patch below (it is basically
> GFP_NOFS->GFP_KERNEL for all allocations which trigger warning from the
> debugging patch which means they are called from within transaction) and
> it didn't hit the lockdep when running xfstests both with or without the
> enabled journal.
> 
> So am I still missing something or the nojournal mode is safe and the
> current series is OK wrt. ext*?

I'm convinced the current series is OK, only real life will tell us whether
we missed something or not ;)

> The following patch in its current form is WIP and needs a proper review
> before I post it.

So jbd2 changes look confusing (although technically correct) to me - we
*always* should run in NOFS context in those place so having GFP_KERNEL
there looks like it is unnecessarily hiding what is going on. So in those
places I'd prefer to keep GFP_NOFS or somehow else make it very clear these
allocations are expected to be GFP_NOFS (and assert that). Otherwise the
changes look good to me.

								Honza

> ---
>  fs/ext4/inode.c       |  4 ++--
>  fs/ext4/mballoc.c     | 14 +++++++-------
>  fs/ext4/xattr.c       |  2 +-
>  fs/jbd2/journal.c     |  4 ++--
>  fs/jbd2/revoke.c      |  2 +-
>  fs/jbd2/transaction.c |  2 +-
>  6 files changed, 14 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index b7d141c3b810..841cb8c4cb5e 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2085,7 +2085,7 @@ static int ext4_writepage(struct page *page,
>  		return __ext4_journalled_writepage(page, len);
>  
>  	ext4_io_submit_init(&io_submit, wbc);
> -	io_submit.io_end = ext4_init_io_end(inode, GFP_NOFS);
> +	io_submit.io_end = ext4_init_io_end(inode, GFP_KERNEL);
>  	if (!io_submit.io_end) {
>  		redirty_page_for_writepage(wbc, page);
>  		unlock_page(page);
> @@ -3794,7 +3794,7 @@ static int __ext4_block_zero_page_range(handle_t *handle,
>  	int err = 0;
>  
>  	page = find_or_create_page(mapping, from >> PAGE_SHIFT,
> -				   mapping_gfp_constraint(mapping, ~__GFP_FS));
> +				   mapping_gfp_mask(mapping));
>  	if (!page)
>  		return -ENOMEM;
>  
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index d9fd184b049e..67b97cd6e3d6 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -1251,7 +1251,7 @@ ext4_mb_load_buddy_gfp(struct super_block *sb, ext4_group_t group,
>  static int ext4_mb_load_buddy(struct super_block *sb, ext4_group_t group,
>  			      struct ext4_buddy *e4b)
>  {
> -	return ext4_mb_load_buddy_gfp(sb, group, e4b, GFP_NOFS);
> +	return ext4_mb_load_buddy_gfp(sb, group, e4b, GFP_KERNEL);
>  }
>  
>  static void ext4_mb_unload_buddy(struct ext4_buddy *e4b)
> @@ -2054,7 +2054,7 @@ static int ext4_mb_good_group(struct ext4_allocation_context *ac,
>  
>  	/* We only do this if the grp has never been initialized */
>  	if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) {
> -		int ret = ext4_mb_init_group(ac->ac_sb, group, GFP_NOFS);
> +		int ret = ext4_mb_init_group(ac->ac_sb, group, GFP_KERNEL);
>  		if (ret)
>  			return ret;
>  	}
> @@ -3600,7 +3600,7 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
>  	BUG_ON(ac->ac_status != AC_STATUS_FOUND);
>  	BUG_ON(!S_ISREG(ac->ac_inode->i_mode));
>  
> -	pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_NOFS);
> +	pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_KERNEL);
>  	if (pa == NULL)
>  		return -ENOMEM;
>  
> @@ -3694,7 +3694,7 @@ ext4_mb_new_group_pa(struct ext4_allocation_context *ac)
>  	BUG_ON(!S_ISREG(ac->ac_inode->i_mode));
>  
>  	BUG_ON(ext4_pspace_cachep == NULL);
> -	pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_NOFS);
> +	pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_KERNEL);
>  	if (pa == NULL)
>  		return -ENOMEM;
>  
> @@ -4479,7 +4479,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
>  		}
>  	}
>  
> -	ac = kmem_cache_zalloc(ext4_ac_cachep, GFP_NOFS);
> +	ac = kmem_cache_zalloc(ext4_ac_cachep, GFP_KERNEL);
>  	if (!ac) {
>  		ar->len = 0;
>  		*errp = -ENOMEM;
> @@ -4813,7 +4813,7 @@ void ext4_free_blocks(handle_t *handle, struct inode *inode,
>  
>  	/* __GFP_NOFAIL: retry infinitely, ignore TIF_MEMDIE and memcg limit. */
>  	err = ext4_mb_load_buddy_gfp(sb, block_group, &e4b,
> -				     GFP_NOFS|__GFP_NOFAIL);
> +				     GFP_KERNEL|__GFP_NOFAIL);
>  	if (err)
>  		goto error_return;
>  
> @@ -4832,7 +4832,7 @@ void ext4_free_blocks(handle_t *handle, struct inode *inode,
>  		 * to fail.
>  		 */
>  		new_entry = kmem_cache_alloc(ext4_free_data_cachep,
> -				GFP_NOFS|__GFP_NOFAIL);
> +				GFP_KERNEL|__GFP_NOFAIL);
>  		new_entry->efd_start_cluster = bit;
>  		new_entry->efd_group = block_group;
>  		new_entry->efd_count = count_clusters;
> diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
> index 172317462238..f68e8c87f9f2 100644
> --- a/fs/ext4/xattr.c
> +++ b/fs/ext4/xattr.c
> @@ -1650,7 +1650,7 @@ ext4_xattr_cache_insert(struct mb_cache *ext4_mb_cache, struct buffer_head *bh)
>  		       EXT4_XATTR_REFCOUNT_MAX;
>  	int error;
>  
> -	error = mb_cache_entry_create(ext4_mb_cache, GFP_NOFS, hash,
> +	error = mb_cache_entry_create(ext4_mb_cache, GFP_KERNEL, hash,
>  				      bh->b_blocknr, reusable);
>  	if (error) {
>  		if (error == -EBUSY)
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index 3a449150f834..bd29daa975a5 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -379,7 +379,7 @@ int jbd2_journal_write_metadata_buffer(transaction_t *transaction,
>  	 */
>  	J_ASSERT_BH(bh_in, buffer_jbddirty(bh_in));
>  
> -	new_bh = alloc_buffer_head(GFP_NOFS|__GFP_NOFAIL);
> +	new_bh = alloc_buffer_head(GFP_KERNEL|__GFP_NOFAIL);
>  
>  	/* keep subsequent assertions sane */
>  	atomic_set(&new_bh->b_count, 1);
> @@ -2375,7 +2375,7 @@ static struct journal_head *journal_alloc_journal_head(void)
>  #ifdef CONFIG_JBD2_DEBUG
>  	atomic_inc(&nr_journal_heads);
>  #endif
> -	ret = kmem_cache_zalloc(jbd2_journal_head_cache, GFP_NOFS);
> +	ret = kmem_cache_zalloc(jbd2_journal_head_cache, GFP_KERNEL);
>  	if (!ret) {
>  		jbd_debug(1, "out of memory for journal_head\n");
>  		pr_notice_ratelimited("ENOMEM in %s, retrying.\n", __func__);
> diff --git a/fs/jbd2/revoke.c b/fs/jbd2/revoke.c
> index cfc38b552118..c9c347468c5b 100644
> --- a/fs/jbd2/revoke.c
> +++ b/fs/jbd2/revoke.c
> @@ -141,7 +141,7 @@ static int insert_revoke_hash(journal_t *journal, unsigned long long blocknr,
>  {
>  	struct list_head *hash_list;
>  	struct jbd2_revoke_record_s *record;
> -	gfp_t gfp_mask = GFP_NOFS;
> +	gfp_t gfp_mask = GFP_KERNEL;
>  
>  	if (journal_oom_retry)
>  		gfp_mask |= __GFP_NOFAIL;
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index 35a5d3d76182..a7e50eb330a8 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -974,7 +974,7 @@ do_get_write_access(handle_t *handle, struct journal_head *jh,
>  			JBUFFER_TRACE(jh, "allocate memory for buffer");
>  			jbd_unlock_bh_state(bh);
>  			frozen_buffer = jbd2_alloc(jh2bh(jh)->b_size,
> -						   GFP_NOFS | __GFP_NOFAIL);
> +						   GFP_KERNEL | __GFP_NOFAIL);
>  			goto repeat;
>  		}
>  		jh->b_frozen_data = frozen_buffer;
> -- 
> 2.11.0
> 
> -- 
> Michal Hocko
> SUSE Labs
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
@ 2017-01-19  9:22                   ` Jan Kara
  0 siblings, 0 replies; 167+ messages in thread
From: Jan Kara @ 2017-01-19  9:22 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jan Kara, Theodore Ts'o, linux-mm, linux-fsdevel,
	Andrew Morton, Dave Chinner, djwong, Chris Mason, David Sterba,
	ceph-devel, cluster-devel, linux-nfs, logfs, linux-xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML

On Thu 19-01-17 09:39:56, Michal Hocko wrote:
> On Tue 17-01-17 18:29:25, Jan Kara wrote:
> > On Tue 17-01-17 17:16:19, Michal Hocko wrote:
> > > > > But before going to play with that I am really wondering whether we need
> > > > > all this with no journal at all. AFAIU what Jack told me it is the
> > > > > journal lock(s) which is the biggest problem from the reclaim recursion
> > > > > point of view. What would cause a deadlock in no journal mode?
> > > > 
> > > > We still have the original problem for why we need GFP_NOFS even in
> > > > ext2.  If we are in a writeback path, and we need to allocate memory,
> > > > we don't want to recurse back into the file system's writeback path.
> > > 
> > > But we do not enter the writeback path from the direct reclaim. Or do
> > > you mean something other than pageout()'s mapping->a_ops->writepage?
> > > There is only try_to_release_page where we get back to the filesystems
> > > but I do not see any NOFS protection in ext4_releasepage.
> > 
> > Maybe to expand a bit: These days, direct reclaim can call ->releasepage()
> > callback, ->evict_inode() callback (and only for inodes with i_nlink > 0),
> > shrinkers. That's it. So the recursion possibilities are rather more limited
> > than they used to be several years ago and we likely do not need as much
> > GFP_NOFS protection as we used to.
> 
> Thanks for making my remark more clear Jack! I would just want to add
> that I was playing with the patch below (it is basically
> GFP_NOFS->GFP_KERNEL for all allocations which trigger warning from the
> debugging patch which means they are called from within transaction) and
> it didn't hit the lockdep when running xfstests both with or without the
> enabled journal.
> 
> So am I still missing something or the nojournal mode is safe and the
> current series is OK wrt. ext*?

I'm convinced the current series is OK, only real life will tell us whether
we missed something or not ;)

> The following patch in its current form is WIP and needs a proper review
> before I post it.

So jbd2 changes look confusing (although technically correct) to me - we
*always* should run in NOFS context in those place so having GFP_KERNEL
there looks like it is unnecessarily hiding what is going on. So in those
places I'd prefer to keep GFP_NOFS or somehow else make it very clear these
allocations are expected to be GFP_NOFS (and assert that). Otherwise the
changes look good to me.

								Honza

> ---
>  fs/ext4/inode.c       |  4 ++--
>  fs/ext4/mballoc.c     | 14 +++++++-------
>  fs/ext4/xattr.c       |  2 +-
>  fs/jbd2/journal.c     |  4 ++--
>  fs/jbd2/revoke.c      |  2 +-
>  fs/jbd2/transaction.c |  2 +-
>  6 files changed, 14 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index b7d141c3b810..841cb8c4cb5e 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2085,7 +2085,7 @@ static int ext4_writepage(struct page *page,
>  		return __ext4_journalled_writepage(page, len);
>  
>  	ext4_io_submit_init(&io_submit, wbc);
> -	io_submit.io_end = ext4_init_io_end(inode, GFP_NOFS);
> +	io_submit.io_end = ext4_init_io_end(inode, GFP_KERNEL);
>  	if (!io_submit.io_end) {
>  		redirty_page_for_writepage(wbc, page);
>  		unlock_page(page);
> @@ -3794,7 +3794,7 @@ static int __ext4_block_zero_page_range(handle_t *handle,
>  	int err = 0;
>  
>  	page = find_or_create_page(mapping, from >> PAGE_SHIFT,
> -				   mapping_gfp_constraint(mapping, ~__GFP_FS));
> +				   mapping_gfp_mask(mapping));
>  	if (!page)
>  		return -ENOMEM;
>  
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index d9fd184b049e..67b97cd6e3d6 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -1251,7 +1251,7 @@ ext4_mb_load_buddy_gfp(struct super_block *sb, ext4_group_t group,
>  static int ext4_mb_load_buddy(struct super_block *sb, ext4_group_t group,
>  			      struct ext4_buddy *e4b)
>  {
> -	return ext4_mb_load_buddy_gfp(sb, group, e4b, GFP_NOFS);
> +	return ext4_mb_load_buddy_gfp(sb, group, e4b, GFP_KERNEL);
>  }
>  
>  static void ext4_mb_unload_buddy(struct ext4_buddy *e4b)
> @@ -2054,7 +2054,7 @@ static int ext4_mb_good_group(struct ext4_allocation_context *ac,
>  
>  	/* We only do this if the grp has never been initialized */
>  	if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) {
> -		int ret = ext4_mb_init_group(ac->ac_sb, group, GFP_NOFS);
> +		int ret = ext4_mb_init_group(ac->ac_sb, group, GFP_KERNEL);
>  		if (ret)
>  			return ret;
>  	}
> @@ -3600,7 +3600,7 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
>  	BUG_ON(ac->ac_status != AC_STATUS_FOUND);
>  	BUG_ON(!S_ISREG(ac->ac_inode->i_mode));
>  
> -	pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_NOFS);
> +	pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_KERNEL);
>  	if (pa == NULL)
>  		return -ENOMEM;
>  
> @@ -3694,7 +3694,7 @@ ext4_mb_new_group_pa(struct ext4_allocation_context *ac)
>  	BUG_ON(!S_ISREG(ac->ac_inode->i_mode));
>  
>  	BUG_ON(ext4_pspace_cachep == NULL);
> -	pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_NOFS);
> +	pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_KERNEL);
>  	if (pa == NULL)
>  		return -ENOMEM;
>  
> @@ -4479,7 +4479,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
>  		}
>  	}
>  
> -	ac = kmem_cache_zalloc(ext4_ac_cachep, GFP_NOFS);
> +	ac = kmem_cache_zalloc(ext4_ac_cachep, GFP_KERNEL);
>  	if (!ac) {
>  		ar->len = 0;
>  		*errp = -ENOMEM;
> @@ -4813,7 +4813,7 @@ void ext4_free_blocks(handle_t *handle, struct inode *inode,
>  
>  	/* __GFP_NOFAIL: retry infinitely, ignore TIF_MEMDIE and memcg limit. */
>  	err = ext4_mb_load_buddy_gfp(sb, block_group, &e4b,
> -				     GFP_NOFS|__GFP_NOFAIL);
> +				     GFP_KERNEL|__GFP_NOFAIL);
>  	if (err)
>  		goto error_return;
>  
> @@ -4832,7 +4832,7 @@ void ext4_free_blocks(handle_t *handle, struct inode *inode,
>  		 * to fail.
>  		 */
>  		new_entry = kmem_cache_alloc(ext4_free_data_cachep,
> -				GFP_NOFS|__GFP_NOFAIL);
> +				GFP_KERNEL|__GFP_NOFAIL);
>  		new_entry->efd_start_cluster = bit;
>  		new_entry->efd_group = block_group;
>  		new_entry->efd_count = count_clusters;
> diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
> index 172317462238..f68e8c87f9f2 100644
> --- a/fs/ext4/xattr.c
> +++ b/fs/ext4/xattr.c
> @@ -1650,7 +1650,7 @@ ext4_xattr_cache_insert(struct mb_cache *ext4_mb_cache, struct buffer_head *bh)
>  		       EXT4_XATTR_REFCOUNT_MAX;
>  	int error;
>  
> -	error = mb_cache_entry_create(ext4_mb_cache, GFP_NOFS, hash,
> +	error = mb_cache_entry_create(ext4_mb_cache, GFP_KERNEL, hash,
>  				      bh->b_blocknr, reusable);
>  	if (error) {
>  		if (error == -EBUSY)
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index 3a449150f834..bd29daa975a5 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -379,7 +379,7 @@ int jbd2_journal_write_metadata_buffer(transaction_t *transaction,
>  	 */
>  	J_ASSERT_BH(bh_in, buffer_jbddirty(bh_in));
>  
> -	new_bh = alloc_buffer_head(GFP_NOFS|__GFP_NOFAIL);
> +	new_bh = alloc_buffer_head(GFP_KERNEL|__GFP_NOFAIL);
>  
>  	/* keep subsequent assertions sane */
>  	atomic_set(&new_bh->b_count, 1);
> @@ -2375,7 +2375,7 @@ static struct journal_head *journal_alloc_journal_head(void)
>  #ifdef CONFIG_JBD2_DEBUG
>  	atomic_inc(&nr_journal_heads);
>  #endif
> -	ret = kmem_cache_zalloc(jbd2_journal_head_cache, GFP_NOFS);
> +	ret = kmem_cache_zalloc(jbd2_journal_head_cache, GFP_KERNEL);
>  	if (!ret) {
>  		jbd_debug(1, "out of memory for journal_head\n");
>  		pr_notice_ratelimited("ENOMEM in %s, retrying.\n", __func__);
> diff --git a/fs/jbd2/revoke.c b/fs/jbd2/revoke.c
> index cfc38b552118..c9c347468c5b 100644
> --- a/fs/jbd2/revoke.c
> +++ b/fs/jbd2/revoke.c
> @@ -141,7 +141,7 @@ static int insert_revoke_hash(journal_t *journal, unsigned long long blocknr,
>  {
>  	struct list_head *hash_list;
>  	struct jbd2_revoke_record_s *record;
> -	gfp_t gfp_mask = GFP_NOFS;
> +	gfp_t gfp_mask = GFP_KERNEL;
>  
>  	if (journal_oom_retry)
>  		gfp_mask |= __GFP_NOFAIL;
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index 35a5d3d76182..a7e50eb330a8 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -974,7 +974,7 @@ do_get_write_access(handle_t *handle, struct journal_head *jh,
>  			JBUFFER_TRACE(jh, "allocate memory for buffer");
>  			jbd_unlock_bh_state(bh);
>  			frozen_buffer = jbd2_alloc(jh2bh(jh)->b_size,
> -						   GFP_NOFS | __GFP_NOFAIL);
> +						   GFP_KERNEL | __GFP_NOFAIL);
>  			goto repeat;
>  		}
>  		jh->b_frozen_data = frozen_buffer;
> -- 
> 2.11.0
> 
> -- 
> Michal Hocko
> SUSE Labs
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* [Cluster-devel] [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
@ 2017-01-19  9:22                   ` Jan Kara
  0 siblings, 0 replies; 167+ messages in thread
From: Jan Kara @ 2017-01-19  9:22 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Thu 19-01-17 09:39:56, Michal Hocko wrote:
> On Tue 17-01-17 18:29:25, Jan Kara wrote:
> > On Tue 17-01-17 17:16:19, Michal Hocko wrote:
> > > > > But before going to play with that I am really wondering whether we need
> > > > > all this with no journal at all. AFAIU what Jack told me it is the
> > > > > journal lock(s) which is the biggest problem from the reclaim recursion
> > > > > point of view. What would cause a deadlock in no journal mode?
> > > > 
> > > > We still have the original problem for why we need GFP_NOFS even in
> > > > ext2.  If we are in a writeback path, and we need to allocate memory,
> > > > we don't want to recurse back into the file system's writeback path.
> > > 
> > > But we do not enter the writeback path from the direct reclaim. Or do
> > > you mean something other than pageout()'s mapping->a_ops->writepage?
> > > There is only try_to_release_page where we get back to the filesystems
> > > but I do not see any NOFS protection in ext4_releasepage.
> > 
> > Maybe to expand a bit: These days, direct reclaim can call ->releasepage()
> > callback, ->evict_inode() callback (and only for inodes with i_nlink > 0),
> > shrinkers. That's it. So the recursion possibilities are rather more limited
> > than they used to be several years ago and we likely do not need as much
> > GFP_NOFS protection as we used to.
> 
> Thanks for making my remark more clear Jack! I would just want to add
> that I was playing with the patch below (it is basically
> GFP_NOFS->GFP_KERNEL for all allocations which trigger warning from the
> debugging patch which means they are called from within transaction) and
> it didn't hit the lockdep when running xfstests both with or without the
> enabled journal.
> 
> So am I still missing something or the nojournal mode is safe and the
> current series is OK wrt. ext*?

I'm convinced the current series is OK, only real life will tell us whether
we missed something or not ;)

> The following patch in its current form is WIP and needs a proper review
> before I post it.

So jbd2 changes look confusing (although technically correct) to me - we
*always* should run in NOFS context in those place so having GFP_KERNEL
there looks like it is unnecessarily hiding what is going on. So in those
places I'd prefer to keep GFP_NOFS or somehow else make it very clear these
allocations are expected to be GFP_NOFS (and assert that). Otherwise the
changes look good to me.

								Honza

> ---
>  fs/ext4/inode.c       |  4 ++--
>  fs/ext4/mballoc.c     | 14 +++++++-------
>  fs/ext4/xattr.c       |  2 +-
>  fs/jbd2/journal.c     |  4 ++--
>  fs/jbd2/revoke.c      |  2 +-
>  fs/jbd2/transaction.c |  2 +-
>  6 files changed, 14 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index b7d141c3b810..841cb8c4cb5e 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2085,7 +2085,7 @@ static int ext4_writepage(struct page *page,
>  		return __ext4_journalled_writepage(page, len);
>  
>  	ext4_io_submit_init(&io_submit, wbc);
> -	io_submit.io_end = ext4_init_io_end(inode, GFP_NOFS);
> +	io_submit.io_end = ext4_init_io_end(inode, GFP_KERNEL);
>  	if (!io_submit.io_end) {
>  		redirty_page_for_writepage(wbc, page);
>  		unlock_page(page);
> @@ -3794,7 +3794,7 @@ static int __ext4_block_zero_page_range(handle_t *handle,
>  	int err = 0;
>  
>  	page = find_or_create_page(mapping, from >> PAGE_SHIFT,
> -				   mapping_gfp_constraint(mapping, ~__GFP_FS));
> +				   mapping_gfp_mask(mapping));
>  	if (!page)
>  		return -ENOMEM;
>  
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index d9fd184b049e..67b97cd6e3d6 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -1251,7 +1251,7 @@ ext4_mb_load_buddy_gfp(struct super_block *sb, ext4_group_t group,
>  static int ext4_mb_load_buddy(struct super_block *sb, ext4_group_t group,
>  			      struct ext4_buddy *e4b)
>  {
> -	return ext4_mb_load_buddy_gfp(sb, group, e4b, GFP_NOFS);
> +	return ext4_mb_load_buddy_gfp(sb, group, e4b, GFP_KERNEL);
>  }
>  
>  static void ext4_mb_unload_buddy(struct ext4_buddy *e4b)
> @@ -2054,7 +2054,7 @@ static int ext4_mb_good_group(struct ext4_allocation_context *ac,
>  
>  	/* We only do this if the grp has never been initialized */
>  	if (unlikely(EXT4_MB_GRP_NEED_INIT(grp))) {
> -		int ret = ext4_mb_init_group(ac->ac_sb, group, GFP_NOFS);
> +		int ret = ext4_mb_init_group(ac->ac_sb, group, GFP_KERNEL);
>  		if (ret)
>  			return ret;
>  	}
> @@ -3600,7 +3600,7 @@ ext4_mb_new_inode_pa(struct ext4_allocation_context *ac)
>  	BUG_ON(ac->ac_status != AC_STATUS_FOUND);
>  	BUG_ON(!S_ISREG(ac->ac_inode->i_mode));
>  
> -	pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_NOFS);
> +	pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_KERNEL);
>  	if (pa == NULL)
>  		return -ENOMEM;
>  
> @@ -3694,7 +3694,7 @@ ext4_mb_new_group_pa(struct ext4_allocation_context *ac)
>  	BUG_ON(!S_ISREG(ac->ac_inode->i_mode));
>  
>  	BUG_ON(ext4_pspace_cachep == NULL);
> -	pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_NOFS);
> +	pa = kmem_cache_alloc(ext4_pspace_cachep, GFP_KERNEL);
>  	if (pa == NULL)
>  		return -ENOMEM;
>  
> @@ -4479,7 +4479,7 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
>  		}
>  	}
>  
> -	ac = kmem_cache_zalloc(ext4_ac_cachep, GFP_NOFS);
> +	ac = kmem_cache_zalloc(ext4_ac_cachep, GFP_KERNEL);
>  	if (!ac) {
>  		ar->len = 0;
>  		*errp = -ENOMEM;
> @@ -4813,7 +4813,7 @@ void ext4_free_blocks(handle_t *handle, struct inode *inode,
>  
>  	/* __GFP_NOFAIL: retry infinitely, ignore TIF_MEMDIE and memcg limit. */
>  	err = ext4_mb_load_buddy_gfp(sb, block_group, &e4b,
> -				     GFP_NOFS|__GFP_NOFAIL);
> +				     GFP_KERNEL|__GFP_NOFAIL);
>  	if (err)
>  		goto error_return;
>  
> @@ -4832,7 +4832,7 @@ void ext4_free_blocks(handle_t *handle, struct inode *inode,
>  		 * to fail.
>  		 */
>  		new_entry = kmem_cache_alloc(ext4_free_data_cachep,
> -				GFP_NOFS|__GFP_NOFAIL);
> +				GFP_KERNEL|__GFP_NOFAIL);
>  		new_entry->efd_start_cluster = bit;
>  		new_entry->efd_group = block_group;
>  		new_entry->efd_count = count_clusters;
> diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
> index 172317462238..f68e8c87f9f2 100644
> --- a/fs/ext4/xattr.c
> +++ b/fs/ext4/xattr.c
> @@ -1650,7 +1650,7 @@ ext4_xattr_cache_insert(struct mb_cache *ext4_mb_cache, struct buffer_head *bh)
>  		       EXT4_XATTR_REFCOUNT_MAX;
>  	int error;
>  
> -	error = mb_cache_entry_create(ext4_mb_cache, GFP_NOFS, hash,
> +	error = mb_cache_entry_create(ext4_mb_cache, GFP_KERNEL, hash,
>  				      bh->b_blocknr, reusable);
>  	if (error) {
>  		if (error == -EBUSY)
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index 3a449150f834..bd29daa975a5 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -379,7 +379,7 @@ int jbd2_journal_write_metadata_buffer(transaction_t *transaction,
>  	 */
>  	J_ASSERT_BH(bh_in, buffer_jbddirty(bh_in));
>  
> -	new_bh = alloc_buffer_head(GFP_NOFS|__GFP_NOFAIL);
> +	new_bh = alloc_buffer_head(GFP_KERNEL|__GFP_NOFAIL);
>  
>  	/* keep subsequent assertions sane */
>  	atomic_set(&new_bh->b_count, 1);
> @@ -2375,7 +2375,7 @@ static struct journal_head *journal_alloc_journal_head(void)
>  #ifdef CONFIG_JBD2_DEBUG
>  	atomic_inc(&nr_journal_heads);
>  #endif
> -	ret = kmem_cache_zalloc(jbd2_journal_head_cache, GFP_NOFS);
> +	ret = kmem_cache_zalloc(jbd2_journal_head_cache, GFP_KERNEL);
>  	if (!ret) {
>  		jbd_debug(1, "out of memory for journal_head\n");
>  		pr_notice_ratelimited("ENOMEM in %s, retrying.\n", __func__);
> diff --git a/fs/jbd2/revoke.c b/fs/jbd2/revoke.c
> index cfc38b552118..c9c347468c5b 100644
> --- a/fs/jbd2/revoke.c
> +++ b/fs/jbd2/revoke.c
> @@ -141,7 +141,7 @@ static int insert_revoke_hash(journal_t *journal, unsigned long long blocknr,
>  {
>  	struct list_head *hash_list;
>  	struct jbd2_revoke_record_s *record;
> -	gfp_t gfp_mask = GFP_NOFS;
> +	gfp_t gfp_mask = GFP_KERNEL;
>  
>  	if (journal_oom_retry)
>  		gfp_mask |= __GFP_NOFAIL;
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index 35a5d3d76182..a7e50eb330a8 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -974,7 +974,7 @@ do_get_write_access(handle_t *handle, struct journal_head *jh,
>  			JBUFFER_TRACE(jh, "allocate memory for buffer");
>  			jbd_unlock_bh_state(bh);
>  			frozen_buffer = jbd2_alloc(jh2bh(jh)->b_size,
> -						   GFP_NOFS | __GFP_NOFAIL);
> +						   GFP_KERNEL | __GFP_NOFAIL);
>  			goto repeat;
>  		}
>  		jh->b_frozen_data = frozen_buffer;
> -- 
> 2.11.0
> 
> -- 
> Michal Hocko
> SUSE Labs
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR



^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
  2017-01-19  9:22                   ` Jan Kara
  (?)
@ 2017-01-19  9:44                     ` Michal Hocko
  -1 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-19  9:44 UTC (permalink / raw)
  To: Jan Kara
  Cc: Theodore Ts'o, linux-mm, linux-fsdevel, Andrew Morton,
	Dave Chinner, djwong, Chris Mason, David Sterba, ceph-devel,
	cluster-devel, linux-nfs, logfs, linux-xfs, linux-ext4,
	linux-btrfs, linux-mtd, reiserfs-devel, linux-ntfs-dev,
	linux-f2fs-devel, linux-afs, LKML

On Thu 19-01-17 10:22:36, Jan Kara wrote:
> On Thu 19-01-17 09:39:56, Michal Hocko wrote:
> > On Tue 17-01-17 18:29:25, Jan Kara wrote:
> > > On Tue 17-01-17 17:16:19, Michal Hocko wrote:
> > > > > > But before going to play with that I am really wondering whether we need
> > > > > > all this with no journal at all. AFAIU what Jack told me it is the
> > > > > > journal lock(s) which is the biggest problem from the reclaim recursion
> > > > > > point of view. What would cause a deadlock in no journal mode?
> > > > > 
> > > > > We still have the original problem for why we need GFP_NOFS even in
> > > > > ext2.  If we are in a writeback path, and we need to allocate memory,
> > > > > we don't want to recurse back into the file system's writeback path.
> > > > 
> > > > But we do not enter the writeback path from the direct reclaim. Or do
> > > > you mean something other than pageout()'s mapping->a_ops->writepage?
> > > > There is only try_to_release_page where we get back to the filesystems
> > > > but I do not see any NOFS protection in ext4_releasepage.
> > > 
> > > Maybe to expand a bit: These days, direct reclaim can call ->releasepage()
> > > callback, ->evict_inode() callback (and only for inodes with i_nlink > 0),
> > > shrinkers. That's it. So the recursion possibilities are rather more limited
> > > than they used to be several years ago and we likely do not need as much
> > > GFP_NOFS protection as we used to.
> > 
> > Thanks for making my remark more clear Jack! I would just want to add
> > that I was playing with the patch below (it is basically
> > GFP_NOFS->GFP_KERNEL for all allocations which trigger warning from the
> > debugging patch which means they are called from within transaction) and
> > it didn't hit the lockdep when running xfstests both with or without the
> > enabled journal.
> > 
> > So am I still missing something or the nojournal mode is safe and the
> > current series is OK wrt. ext*?
> 
> I'm convinced the current series is OK, only real life will tell us whether
> we missed something or not ;)

I would like to extend the changelog of "jbd2: mark the transaction
context with the scope GFP_NOFS context".

"
Please note that setups without journal do not suffer from potential
recursion problems and so they do not need the scope protection because
neither ->releasepage nor ->evict_inode (which are the only fs entry
points from the direct reclaim) can reenter a locked context which is
doing the allocation currently.
"
 
> > The following patch in its current form is WIP and needs a proper review
> > before I post it.
> 
> So jbd2 changes look confusing (although technically correct) to me - we
> *always* should run in NOFS context in those place so having GFP_KERNEL
> there looks like it is unnecessarily hiding what is going on. So in those
> places I'd prefer to keep GFP_NOFS or somehow else make it very clear these
> allocations are expected to be GFP_NOFS (and assert that). Otherwise the
> changes look good to me.

I would really like to get rid most of NOFS direct usage and only
dictate it via the scope API otherwise I suspect we will just grow more
users and end up in the same situation as we are now currently over time.
In principle only the context which changes the reclaim reentrancy policy
should care about NOFS and everybody else should just pretend nothing
like that exists. There might be few exceptions of course, I am not yet
sure whether jbd2 is that case. But I am not proposing this change yet
(thanks for checking anyway)...
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
@ 2017-01-19  9:44                     ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-19  9:44 UTC (permalink / raw)
  To: Jan Kara
  Cc: Theodore Ts'o, linux-mm, linux-fsdevel, Andrew Morton,
	Dave Chinner, djwong, Chris Mason, David Sterba, ceph-devel,
	cluster-devel, linux-nfs, logfs, linux-xfs, linux-ext4,
	linux-btrfs, linux-mtd, reiserfs-devel, linux-ntfs-dev,
	linux-f2fs-devel, linux-afs, LKML

On Thu 19-01-17 10:22:36, Jan Kara wrote:
> On Thu 19-01-17 09:39:56, Michal Hocko wrote:
> > On Tue 17-01-17 18:29:25, Jan Kara wrote:
> > > On Tue 17-01-17 17:16:19, Michal Hocko wrote:
> > > > > > But before going to play with that I am really wondering whether we need
> > > > > > all this with no journal at all. AFAIU what Jack told me it is the
> > > > > > journal lock(s) which is the biggest problem from the reclaim recursion
> > > > > > point of view. What would cause a deadlock in no journal mode?
> > > > > 
> > > > > We still have the original problem for why we need GFP_NOFS even in
> > > > > ext2.  If we are in a writeback path, and we need to allocate memory,
> > > > > we don't want to recurse back into the file system's writeback path.
> > > > 
> > > > But we do not enter the writeback path from the direct reclaim. Or do
> > > > you mean something other than pageout()'s mapping->a_ops->writepage?
> > > > There is only try_to_release_page where we get back to the filesystems
> > > > but I do not see any NOFS protection in ext4_releasepage.
> > > 
> > > Maybe to expand a bit: These days, direct reclaim can call ->releasepage()
> > > callback, ->evict_inode() callback (and only for inodes with i_nlink > 0),
> > > shrinkers. That's it. So the recursion possibilities are rather more limited
> > > than they used to be several years ago and we likely do not need as much
> > > GFP_NOFS protection as we used to.
> > 
> > Thanks for making my remark more clear Jack! I would just want to add
> > that I was playing with the patch below (it is basically
> > GFP_NOFS->GFP_KERNEL for all allocations which trigger warning from the
> > debugging patch which means they are called from within transaction) and
> > it didn't hit the lockdep when running xfstests both with or without the
> > enabled journal.
> > 
> > So am I still missing something or the nojournal mode is safe and the
> > current series is OK wrt. ext*?
> 
> I'm convinced the current series is OK, only real life will tell us whether
> we missed something or not ;)

I would like to extend the changelog of "jbd2: mark the transaction
context with the scope GFP_NOFS context".

"
Please note that setups without journal do not suffer from potential
recursion problems and so they do not need the scope protection because
neither ->releasepage nor ->evict_inode (which are the only fs entry
points from the direct reclaim) can reenter a locked context which is
doing the allocation currently.
"
 
> > The following patch in its current form is WIP and needs a proper review
> > before I post it.
> 
> So jbd2 changes look confusing (although technically correct) to me - we
> *always* should run in NOFS context in those place so having GFP_KERNEL
> there looks like it is unnecessarily hiding what is going on. So in those
> places I'd prefer to keep GFP_NOFS or somehow else make it very clear these
> allocations are expected to be GFP_NOFS (and assert that). Otherwise the
> changes look good to me.

I would really like to get rid most of NOFS direct usage and only
dictate it via the scope API otherwise I suspect we will just grow more
users and end up in the same situation as we are now currently over time.
In principle only the context which changes the reclaim reentrancy policy
should care about NOFS and everybody else should just pretend nothing
like that exists. There might be few exceptions of course, I am not yet
sure whether jbd2 is that case. But I am not proposing this change yet
(thanks for checking anyway)...
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* [Cluster-devel] [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
@ 2017-01-19  9:44                     ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-19  9:44 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Thu 19-01-17 10:22:36, Jan Kara wrote:
> On Thu 19-01-17 09:39:56, Michal Hocko wrote:
> > On Tue 17-01-17 18:29:25, Jan Kara wrote:
> > > On Tue 17-01-17 17:16:19, Michal Hocko wrote:
> > > > > > But before going to play with that I am really wondering whether we need
> > > > > > all this with no journal at all. AFAIU what Jack told me it is the
> > > > > > journal lock(s) which is the biggest problem from the reclaim recursion
> > > > > > point of view. What would cause a deadlock in no journal mode?
> > > > > 
> > > > > We still have the original problem for why we need GFP_NOFS even in
> > > > > ext2.  If we are in a writeback path, and we need to allocate memory,
> > > > > we don't want to recurse back into the file system's writeback path.
> > > > 
> > > > But we do not enter the writeback path from the direct reclaim. Or do
> > > > you mean something other than pageout()'s mapping->a_ops->writepage?
> > > > There is only try_to_release_page where we get back to the filesystems
> > > > but I do not see any NOFS protection in ext4_releasepage.
> > > 
> > > Maybe to expand a bit: These days, direct reclaim can call ->releasepage()
> > > callback, ->evict_inode() callback (and only for inodes with i_nlink > 0),
> > > shrinkers. That's it. So the recursion possibilities are rather more limited
> > > than they used to be several years ago and we likely do not need as much
> > > GFP_NOFS protection as we used to.
> > 
> > Thanks for making my remark more clear Jack! I would just want to add
> > that I was playing with the patch below (it is basically
> > GFP_NOFS->GFP_KERNEL for all allocations which trigger warning from the
> > debugging patch which means they are called from within transaction) and
> > it didn't hit the lockdep when running xfstests both with or without the
> > enabled journal.
> > 
> > So am I still missing something or the nojournal mode is safe and the
> > current series is OK wrt. ext*?
> 
> I'm convinced the current series is OK, only real life will tell us whether
> we missed something or not ;)

I would like to extend the changelog of "jbd2: mark the transaction
context with the scope GFP_NOFS context".

"
Please note that setups without journal do not suffer from potential
recursion problems and so they do not need the scope protection because
neither ->releasepage nor ->evict_inode (which are the only fs entry
points from the direct reclaim) can reenter a locked context which is
doing the allocation currently.
"
 
> > The following patch in its current form is WIP and needs a proper review
> > before I post it.
> 
> So jbd2 changes look confusing (although technically correct) to me - we
> *always* should run in NOFS context in those place so having GFP_KERNEL
> there looks like it is unnecessarily hiding what is going on. So in those
> places I'd prefer to keep GFP_NOFS or somehow else make it very clear these
> allocations are expected to be GFP_NOFS (and assert that). Otherwise the
> changes look good to me.

I would really like to get rid most of NOFS direct usage and only
dictate it via the scope API otherwise I suspect we will just grow more
users and end up in the same situation as we are now currently over time.
In principle only the context which changes the reclaim reentrancy policy
should care about NOFS and everybody else should just pretend nothing
like that exists. There might be few exceptions of course, I am not yet
sure whether jbd2 is that case. But I am not proposing this change yet
(thanks for checking anyway)...
-- 
Michal Hocko
SUSE Labs



^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
  2017-01-19  9:44                     ` Michal Hocko
  (?)
@ 2017-01-26  7:44                       ` Michal Hocko
  -1 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-26  7:44 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Jan Kara, linux-mm, linux-fsdevel, Andrew Morton, Dave Chinner,
	djwong, Chris Mason, David Sterba, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML

On Thu 19-01-17 10:44:05, Michal Hocko wrote:
> On Thu 19-01-17 10:22:36, Jan Kara wrote:
> > On Thu 19-01-17 09:39:56, Michal Hocko wrote:
> > > On Tue 17-01-17 18:29:25, Jan Kara wrote:
> > > > On Tue 17-01-17 17:16:19, Michal Hocko wrote:
> > > > > > > But before going to play with that I am really wondering whether we need
> > > > > > > all this with no journal at all. AFAIU what Jack told me it is the
> > > > > > > journal lock(s) which is the biggest problem from the reclaim recursion
> > > > > > > point of view. What would cause a deadlock in no journal mode?
> > > > > > 
> > > > > > We still have the original problem for why we need GFP_NOFS even in
> > > > > > ext2.  If we are in a writeback path, and we need to allocate memory,
> > > > > > we don't want to recurse back into the file system's writeback path.
> > > > > 
> > > > > But we do not enter the writeback path from the direct reclaim. Or do
> > > > > you mean something other than pageout()'s mapping->a_ops->writepage?
> > > > > There is only try_to_release_page where we get back to the filesystems
> > > > > but I do not see any NOFS protection in ext4_releasepage.
> > > > 
> > > > Maybe to expand a bit: These days, direct reclaim can call ->releasepage()
> > > > callback, ->evict_inode() callback (and only for inodes with i_nlink > 0),
> > > > shrinkers. That's it. So the recursion possibilities are rather more limited
> > > > than they used to be several years ago and we likely do not need as much
> > > > GFP_NOFS protection as we used to.
> > > 
> > > Thanks for making my remark more clear Jack! I would just want to add
> > > that I was playing with the patch below (it is basically
> > > GFP_NOFS->GFP_KERNEL for all allocations which trigger warning from the
> > > debugging patch which means they are called from within transaction) and
> > > it didn't hit the lockdep when running xfstests both with or without the
> > > enabled journal.
> > > 
> > > So am I still missing something or the nojournal mode is safe and the
> > > current series is OK wrt. ext*?
> > 
> > I'm convinced the current series is OK, only real life will tell us whether
> > we missed something or not ;)
> 
> I would like to extend the changelog of "jbd2: mark the transaction
> context with the scope GFP_NOFS context".
> 
> "
> Please note that setups without journal do not suffer from potential
> recursion problems and so they do not need the scope protection because
> neither ->releasepage nor ->evict_inode (which are the only fs entry
> points from the direct reclaim) can reenter a locked context which is
> doing the allocation currently.
> "

Could you comment on this Ted, please?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
@ 2017-01-26  7:44                       ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-26  7:44 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Jan Kara, linux-mm, linux-fsdevel, Andrew Morton, Dave Chinner,
	djwong, Chris Mason, David Sterba, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML

On Thu 19-01-17 10:44:05, Michal Hocko wrote:
> On Thu 19-01-17 10:22:36, Jan Kara wrote:
> > On Thu 19-01-17 09:39:56, Michal Hocko wrote:
> > > On Tue 17-01-17 18:29:25, Jan Kara wrote:
> > > > On Tue 17-01-17 17:16:19, Michal Hocko wrote:
> > > > > > > But before going to play with that I am really wondering whether we need
> > > > > > > all this with no journal at all. AFAIU what Jack told me it is the
> > > > > > > journal lock(s) which is the biggest problem from the reclaim recursion
> > > > > > > point of view. What would cause a deadlock in no journal mode?
> > > > > > 
> > > > > > We still have the original problem for why we need GFP_NOFS even in
> > > > > > ext2.  If we are in a writeback path, and we need to allocate memory,
> > > > > > we don't want to recurse back into the file system's writeback path.
> > > > > 
> > > > > But we do not enter the writeback path from the direct reclaim. Or do
> > > > > you mean something other than pageout()'s mapping->a_ops->writepage?
> > > > > There is only try_to_release_page where we get back to the filesystems
> > > > > but I do not see any NOFS protection in ext4_releasepage.
> > > > 
> > > > Maybe to expand a bit: These days, direct reclaim can call ->releasepage()
> > > > callback, ->evict_inode() callback (and only for inodes with i_nlink > 0),
> > > > shrinkers. That's it. So the recursion possibilities are rather more limited
> > > > than they used to be several years ago and we likely do not need as much
> > > > GFP_NOFS protection as we used to.
> > > 
> > > Thanks for making my remark more clear Jack! I would just want to add
> > > that I was playing with the patch below (it is basically
> > > GFP_NOFS->GFP_KERNEL for all allocations which trigger warning from the
> > > debugging patch which means they are called from within transaction) and
> > > it didn't hit the lockdep when running xfstests both with or without the
> > > enabled journal.
> > > 
> > > So am I still missing something or the nojournal mode is safe and the
> > > current series is OK wrt. ext*?
> > 
> > I'm convinced the current series is OK, only real life will tell us whether
> > we missed something or not ;)
> 
> I would like to extend the changelog of "jbd2: mark the transaction
> context with the scope GFP_NOFS context".
> 
> "
> Please note that setups without journal do not suffer from potential
> recursion problems and so they do not need the scope protection because
> neither ->releasepage nor ->evict_inode (which are the only fs entry
> points from the direct reclaim) can reenter a locked context which is
> doing the allocation currently.
> "

Could you comment on this Ted, please?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* [Cluster-devel] [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
@ 2017-01-26  7:44                       ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-26  7:44 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Thu 19-01-17 10:44:05, Michal Hocko wrote:
> On Thu 19-01-17 10:22:36, Jan Kara wrote:
> > On Thu 19-01-17 09:39:56, Michal Hocko wrote:
> > > On Tue 17-01-17 18:29:25, Jan Kara wrote:
> > > > On Tue 17-01-17 17:16:19, Michal Hocko wrote:
> > > > > > > But before going to play with that I am really wondering whether we need
> > > > > > > all this with no journal at all. AFAIU what Jack told me it is the
> > > > > > > journal lock(s) which is the biggest problem from the reclaim recursion
> > > > > > > point of view. What would cause a deadlock in no journal mode?
> > > > > > 
> > > > > > We still have the original problem for why we need GFP_NOFS even in
> > > > > > ext2.  If we are in a writeback path, and we need to allocate memory,
> > > > > > we don't want to recurse back into the file system's writeback path.
> > > > > 
> > > > > But we do not enter the writeback path from the direct reclaim. Or do
> > > > > you mean something other than pageout()'s mapping->a_ops->writepage?
> > > > > There is only try_to_release_page where we get back to the filesystems
> > > > > but I do not see any NOFS protection in ext4_releasepage.
> > > > 
> > > > Maybe to expand a bit: These days, direct reclaim can call ->releasepage()
> > > > callback, ->evict_inode() callback (and only for inodes with i_nlink > 0),
> > > > shrinkers. That's it. So the recursion possibilities are rather more limited
> > > > than they used to be several years ago and we likely do not need as much
> > > > GFP_NOFS protection as we used to.
> > > 
> > > Thanks for making my remark more clear Jack! I would just want to add
> > > that I was playing with the patch below (it is basically
> > > GFP_NOFS->GFP_KERNEL for all allocations which trigger warning from the
> > > debugging patch which means they are called from within transaction) and
> > > it didn't hit the lockdep when running xfstests both with or without the
> > > enabled journal.
> > > 
> > > So am I still missing something or the nojournal mode is safe and the
> > > current series is OK wrt. ext*?
> > 
> > I'm convinced the current series is OK, only real life will tell us whether
> > we missed something or not ;)
> 
> I would like to extend the changelog of "jbd2: mark the transaction
> context with the scope GFP_NOFS context".
> 
> "
> Please note that setups without journal do not suffer from potential
> recursion problems and so they do not need the scope protection because
> neither ->releasepage nor ->evict_inode (which are the only fs entry
> points from the direct reclaim) can reenter a locked context which is
> doing the allocation currently.
> "

Could you comment on this Ted, please?
-- 
Michal Hocko
SUSE Labs



^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
  2017-01-26  7:44                       ` Michal Hocko
  (?)
  (?)
@ 2017-01-27  6:13                         ` Theodore Ts'o
  -1 siblings, 0 replies; 167+ messages in thread
From: Theodore Ts'o @ 2017-01-27  6:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jan Kara, linux-mm, linux-fsdevel, Andrew Morton, Dave Chinner,
	djwong, Chris Mason, David Sterba, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML

On Thu, Jan 26, 2017 at 08:44:55AM +0100, Michal Hocko wrote:
> > > I'm convinced the current series is OK, only real life will tell us whether
> > > we missed something or not ;)
> > 
> > I would like to extend the changelog of "jbd2: mark the transaction
> > context with the scope GFP_NOFS context".
> > 
> > "
> > Please note that setups without journal do not suffer from potential
> > recursion problems and so they do not need the scope protection because
> > neither ->releasepage nor ->evict_inode (which are the only fs entry
> > points from the direct reclaim) can reenter a locked context which is
> > doing the allocation currently.
> > "
> 
> Could you comment on this Ted, please?

I guess....   so there still is one way this could screw us, and it's this reason for GFP_NOFS:

        - to prevent from stack overflows during the reclaim because
	          the allocation is performed from a deep context already

The writepages call stack can be pretty deep.  (Especially if we're
using ext4 in no journal mode over, say, iSCSI.)

How much stack space can get consumed by a reclaim?

						- Ted
    		 

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
@ 2017-01-27  6:13                         ` Theodore Ts'o
  0 siblings, 0 replies; 167+ messages in thread
From: Theodore Ts'o @ 2017-01-27  6:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jan Kara, linux-mm, linux-fsdevel, Andrew Morton, Dave Chinner,
	djwong, Chris Mason, David Sterba, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML

On Thu, Jan 26, 2017 at 08:44:55AM +0100, Michal Hocko wrote:
> > > I'm convinced the current series is OK, only real life will tell us whether
> > > we missed something or not ;)
> > 
> > I would like to extend the changelog of "jbd2: mark the transaction
> > context with the scope GFP_NOFS context".
> > 
> > "
> > Please note that setups without journal do not suffer from potential
> > recursion problems and so they do not need the scope protection because
> > neither ->releasepage nor ->evict_inode (which are the only fs entry
> > points from the direct reclaim) can reenter a locked context which is
> > doing the allocation currently.
> > "
> 
> Could you comment on this Ted, please?

I guess....   so there still is one way this could screw us, and it's this reason for GFP_NOFS:

        - to prevent from stack overflows during the reclaim because
	          the allocation is performed from a deep context already

The writepages call stack can be pretty deep.  (Especially if we're
using ext4 in no journal mode over, say, iSCSI.)

How much stack space can get consumed by a reclaim?

						- Ted
    		 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
@ 2017-01-27  6:13                         ` Theodore Ts'o
  0 siblings, 0 replies; 167+ messages in thread
From: Theodore Ts'o @ 2017-01-27  6:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jan Kara, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	Dave Chinner, djwong-DgEjT+Ai2ygdnm+yROfE0A, Chris Mason,
	David Sterba, ceph-devel-u79uwXL29TY76Z2rM5mHXA,
	cluster-devel-H+wXaHxf7aLQT0dZR+AlfA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA, logfs-PCqxUs/MD9bYtjvyW6yDsg,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	reiserfs-devel-u79uwXL29TY76Z2rM5mHXA,
	linux-ntfs-dev-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	linux-f2fs-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	linux-afs-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, LKML

On Thu, Jan 26, 2017 at 08:44:55AM +0100, Michal Hocko wrote:
> > > I'm convinced the current series is OK, only real life will tell us whether
> > > we missed something or not ;)
> > 
> > I would like to extend the changelog of "jbd2: mark the transaction
> > context with the scope GFP_NOFS context".
> > 
> > "
> > Please note that setups without journal do not suffer from potential
> > recursion problems and so they do not need the scope protection because
> > neither ->releasepage nor ->evict_inode (which are the only fs entry
> > points from the direct reclaim) can reenter a locked context which is
> > doing the allocation currently.
> > "
> 
> Could you comment on this Ted, please?

I guess....   so there still is one way this could screw us, and it's this reason for GFP_NOFS:

        - to prevent from stack overflows during the reclaim because
	          the allocation is performed from a deep context already

The writepages call stack can be pretty deep.  (Especially if we're
using ext4 in no journal mode over, say, iSCSI.)

How much stack space can get consumed by a reclaim?

						- Ted
    		 
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 167+ messages in thread

* [Cluster-devel] [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
@ 2017-01-27  6:13                         ` Theodore Ts'o
  0 siblings, 0 replies; 167+ messages in thread
From: Theodore Ts'o @ 2017-01-27  6:13 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Thu, Jan 26, 2017 at 08:44:55AM +0100, Michal Hocko wrote:
> > > I'm convinced the current series is OK, only real life will tell us whether
> > > we missed something or not ;)
> > 
> > I would like to extend the changelog of "jbd2: mark the transaction
> > context with the scope GFP_NOFS context".
> > 
> > "
> > Please note that setups without journal do not suffer from potential
> > recursion problems and so they do not need the scope protection because
> > neither ->releasepage nor ->evict_inode (which are the only fs entry
> > points from the direct reclaim) can reenter a locked context which is
> > doing the allocation currently.
> > "
> 
> Could you comment on this Ted, please?

I guess....   so there still is one way this could screw us, and it's this reason for GFP_NOFS:

        - to prevent from stack overflows during the reclaim because
	          the allocation is performed from a deep context already

The writepages call stack can be pretty deep.  (Especially if we're
using ext4 in no journal mode over, say, iSCSI.)

How much stack space can get consumed by a reclaim?

						- Ted
    		 



^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
  2017-01-27  6:13                         ` Theodore Ts'o
  (?)
@ 2017-01-27  9:37                           ` Michal Hocko
  -1 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-27  9:37 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Jan Kara, linux-mm, linux-fsdevel, Andrew Morton, Dave Chinner,
	djwong, Chris Mason, David Sterba, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML

On Fri 27-01-17 01:13:18, Theodore Ts'o wrote:
> On Thu, Jan 26, 2017 at 08:44:55AM +0100, Michal Hocko wrote:
> > > > I'm convinced the current series is OK, only real life will tell us whether
> > > > we missed something or not ;)
> > > 
> > > I would like to extend the changelog of "jbd2: mark the transaction
> > > context with the scope GFP_NOFS context".
> > > 
> > > "
> > > Please note that setups without journal do not suffer from potential
> > > recursion problems and so they do not need the scope protection because
> > > neither ->releasepage nor ->evict_inode (which are the only fs entry
> > > points from the direct reclaim) can reenter a locked context which is
> > > doing the allocation currently.
> > > "
> > 
> > Could you comment on this Ted, please?
> 
> I guess....   so there still is one way this could screw us, and it's this reason for GFP_NOFS:
> 
>         - to prevent from stack overflows during the reclaim because
> 	          the allocation is performed from a deep context already
> 
> The writepages call stack can be pretty deep.  (Especially if we're
> using ext4 in no journal mode over, say, iSCSI.)
> 
> How much stack space can get consumed by a reclaim?

./scripts/stackusage with allyesconfig says:

./mm/page_alloc.c:3745  __alloc_pages_nodemask  264     static
./mm/page_alloc.c:3531  __alloc_pages_slowpath  520     static
./mm/vmscan.c:2946      try_to_free_pages       216     static
./mm/vmscan.c:2753      do_try_to_free_pages    304     static
./mm/vmscan.c:2517      shrink_node     	352     static
./mm/vmscan.c:2317      shrink_node_memcg       560     static
./mm/vmscan.c:1692      shrink_inactive_list    688     static
./mm/vmscan.c:908       shrink_page_list        608     static

So this would be 3512 for the standard LRUs reclaim whether we have
GFP_FS or not. shrink_page_list can recurse to releasepage but there is
no NOFS protection there so it doesn't make much sense to check this
path. So we are left with the slab shrinkers path

./mm/page_alloc.c:3745  __alloc_pages_nodemask  264     static
./mm/page_alloc.c:3531  __alloc_pages_slowpath  520     static
./mm/vmscan.c:2946      try_to_free_pages       216     static
./mm/vmscan.c:2753      do_try_to_free_pages    304     static
./mm/vmscan.c:2517      shrink_node     	352     static
./mm/vmscan.c:427       shrink_slab     	336     static
./fs/super.c:56 	super_cache_scan        104     static << here we have the NOFS protection
./fs/dcache.c:1089      prune_dcache_sb 	152     static
./fs/dcache.c:939       shrink_dentry_list      96      static
./fs/dcache.c:509       __dentry_kill   	72      static
./fs/dcache.c:323       dentry_unlink_inode     64      static
./fs/inode.c:1527       iput    		80      static
./fs/inode.c:532        evict   		72      static

This is where the fs specific callbacks play role and I am not sure
which paths can pass through for ext4 in the nojournal mode and how much
of the stack this can eat. But currently we are at +536 wrt. NOFS
context. This is quite a lot but still much less (2632 vs. 3512) than
the regular reclaim. So there is quite some stack space to eat... I am
wondering whether we have to really treat nojournal mode any special
just because of the stack usage?

If this ever turn out to be a problem and with the vmapped stacks we
have good chances to get a proper stack traces on a potential overflow
we can add the scope API around the problematic code path with the
explanation why it is needed.

Does that make sense to you?

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
@ 2017-01-27  9:37                           ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-27  9:37 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Jan Kara, linux-mm, linux-fsdevel, Andrew Morton, Dave Chinner,
	djwong, Chris Mason, David Sterba, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML

On Fri 27-01-17 01:13:18, Theodore Ts'o wrote:
> On Thu, Jan 26, 2017 at 08:44:55AM +0100, Michal Hocko wrote:
> > > > I'm convinced the current series is OK, only real life will tell us whether
> > > > we missed something or not ;)
> > > 
> > > I would like to extend the changelog of "jbd2: mark the transaction
> > > context with the scope GFP_NOFS context".
> > > 
> > > "
> > > Please note that setups without journal do not suffer from potential
> > > recursion problems and so they do not need the scope protection because
> > > neither ->releasepage nor ->evict_inode (which are the only fs entry
> > > points from the direct reclaim) can reenter a locked context which is
> > > doing the allocation currently.
> > > "
> > 
> > Could you comment on this Ted, please?
> 
> I guess....   so there still is one way this could screw us, and it's this reason for GFP_NOFS:
> 
>         - to prevent from stack overflows during the reclaim because
> 	          the allocation is performed from a deep context already
> 
> The writepages call stack can be pretty deep.  (Especially if we're
> using ext4 in no journal mode over, say, iSCSI.)
> 
> How much stack space can get consumed by a reclaim?

./scripts/stackusage with allyesconfig says:

./mm/page_alloc.c:3745  __alloc_pages_nodemask  264     static
./mm/page_alloc.c:3531  __alloc_pages_slowpath  520     static
./mm/vmscan.c:2946      try_to_free_pages       216     static
./mm/vmscan.c:2753      do_try_to_free_pages    304     static
./mm/vmscan.c:2517      shrink_node     	352     static
./mm/vmscan.c:2317      shrink_node_memcg       560     static
./mm/vmscan.c:1692      shrink_inactive_list    688     static
./mm/vmscan.c:908       shrink_page_list        608     static

So this would be 3512 for the standard LRUs reclaim whether we have
GFP_FS or not. shrink_page_list can recurse to releasepage but there is
no NOFS protection there so it doesn't make much sense to check this
path. So we are left with the slab shrinkers path

./mm/page_alloc.c:3745  __alloc_pages_nodemask  264     static
./mm/page_alloc.c:3531  __alloc_pages_slowpath  520     static
./mm/vmscan.c:2946      try_to_free_pages       216     static
./mm/vmscan.c:2753      do_try_to_free_pages    304     static
./mm/vmscan.c:2517      shrink_node     	352     static
./mm/vmscan.c:427       shrink_slab     	336     static
./fs/super.c:56 	super_cache_scan        104     static << here we have the NOFS protection
./fs/dcache.c:1089      prune_dcache_sb 	152     static
./fs/dcache.c:939       shrink_dentry_list      96      static
./fs/dcache.c:509       __dentry_kill   	72      static
./fs/dcache.c:323       dentry_unlink_inode     64      static
./fs/inode.c:1527       iput    		80      static
./fs/inode.c:532        evict   		72      static

This is where the fs specific callbacks play role and I am not sure
which paths can pass through for ext4 in the nojournal mode and how much
of the stack this can eat. But currently we are at +536 wrt. NOFS
context. This is quite a lot but still much less (2632 vs. 3512) than
the regular reclaim. So there is quite some stack space to eat... I am
wondering whether we have to really treat nojournal mode any special
just because of the stack usage?

If this ever turn out to be a problem and with the vmapped stacks we
have good chances to get a proper stack traces on a potential overflow
we can add the scope API around the problematic code path with the
explanation why it is needed.

Does that make sense to you?

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* [Cluster-devel] [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
@ 2017-01-27  9:37                           ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-27  9:37 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Fri 27-01-17 01:13:18, Theodore Ts'o wrote:
> On Thu, Jan 26, 2017 at 08:44:55AM +0100, Michal Hocko wrote:
> > > > I'm convinced the current series is OK, only real life will tell us whether
> > > > we missed something or not ;)
> > > 
> > > I would like to extend the changelog of "jbd2: mark the transaction
> > > context with the scope GFP_NOFS context".
> > > 
> > > "
> > > Please note that setups without journal do not suffer from potential
> > > recursion problems and so they do not need the scope protection because
> > > neither ->releasepage nor ->evict_inode (which are the only fs entry
> > > points from the direct reclaim) can reenter a locked context which is
> > > doing the allocation currently.
> > > "
> > 
> > Could you comment on this Ted, please?
> 
> I guess....   so there still is one way this could screw us, and it's this reason for GFP_NOFS:
> 
>         - to prevent from stack overflows during the reclaim because
> 	          the allocation is performed from a deep context already
> 
> The writepages call stack can be pretty deep.  (Especially if we're
> using ext4 in no journal mode over, say, iSCSI.)
> 
> How much stack space can get consumed by a reclaim?

./scripts/stackusage with allyesconfig says:

./mm/page_alloc.c:3745  __alloc_pages_nodemask  264     static
./mm/page_alloc.c:3531  __alloc_pages_slowpath  520     static
./mm/vmscan.c:2946      try_to_free_pages       216     static
./mm/vmscan.c:2753      do_try_to_free_pages    304     static
./mm/vmscan.c:2517      shrink_node     	352     static
./mm/vmscan.c:2317      shrink_node_memcg       560     static
./mm/vmscan.c:1692      shrink_inactive_list    688     static
./mm/vmscan.c:908       shrink_page_list        608     static

So this would be 3512 for the standard LRUs reclaim whether we have
GFP_FS or not. shrink_page_list can recurse to releasepage but there is
no NOFS protection there so it doesn't make much sense to check this
path. So we are left with the slab shrinkers path

./mm/page_alloc.c:3745  __alloc_pages_nodemask  264     static
./mm/page_alloc.c:3531  __alloc_pages_slowpath  520     static
./mm/vmscan.c:2946      try_to_free_pages       216     static
./mm/vmscan.c:2753      do_try_to_free_pages    304     static
./mm/vmscan.c:2517      shrink_node     	352     static
./mm/vmscan.c:427       shrink_slab     	336     static
./fs/super.c:56 	super_cache_scan        104     static << here we have the NOFS protection
./fs/dcache.c:1089      prune_dcache_sb 	152     static
./fs/dcache.c:939       shrink_dentry_list      96      static
./fs/dcache.c:509       __dentry_kill   	72      static
./fs/dcache.c:323       dentry_unlink_inode     64      static
./fs/inode.c:1527       iput    		80      static
./fs/inode.c:532        evict   		72      static

This is where the fs specific callbacks play role and I am not sure
which paths can pass through for ext4 in the nojournal mode and how much
of the stack this can eat. But currently we are@+536 wrt. NOFS
context. This is quite a lot but still much less (2632 vs. 3512) than
the regular reclaim. So there is quite some stack space to eat... I am
wondering whether we have to really treat nojournal mode any special
just because of the stack usage?

If this ever turn out to be a problem and with the vmapped stacks we
have good chances to get a proper stack traces on a potential overflow
we can add the scope API around the problematic code path with the
explanation why it is needed.

Does that make sense to you?

Thanks!
-- 
Michal Hocko
SUSE Labs



^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
  2017-01-27  9:37                           ` Michal Hocko
  (?)
@ 2017-01-27 16:40                             ` Theodore Ts'o
  -1 siblings, 0 replies; 167+ messages in thread
From: Theodore Ts'o @ 2017-01-27 16:40 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jan Kara, linux-mm, linux-fsdevel, Andrew Morton, Dave Chinner,
	djwong, Chris Mason, David Sterba, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML

On Fri, Jan 27, 2017 at 10:37:35AM +0100, Michal Hocko wrote:
> If this ever turn out to be a problem and with the vmapped stacks we
> have good chances to get a proper stack traces on a potential overflow
> we can add the scope API around the problematic code path with the
> explanation why it is needed.

Yeah, or maybe we can automate it?  Can the reclaim code check how
much stack space is left and do the right thing automatically?

The reason why I'm nervous is that nojournal mode is not a common
configuration, and "wait until production systems start failing" is
not a strategy that I or many SRE-types find.... comforting.

So if we can assure ourselves that the right thing will happen
automatically, or that lockdep will detect a required GFP_NOFS when
running tests, the happier I'll be.

					- Ted

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
@ 2017-01-27 16:40                             ` Theodore Ts'o
  0 siblings, 0 replies; 167+ messages in thread
From: Theodore Ts'o @ 2017-01-27 16:40 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jan Kara, linux-mm, linux-fsdevel, Andrew Morton, Dave Chinner,
	djwong, Chris Mason, David Sterba, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML

On Fri, Jan 27, 2017 at 10:37:35AM +0100, Michal Hocko wrote:
> If this ever turn out to be a problem and with the vmapped stacks we
> have good chances to get a proper stack traces on a potential overflow
> we can add the scope API around the problematic code path with the
> explanation why it is needed.

Yeah, or maybe we can automate it?  Can the reclaim code check how
much stack space is left and do the right thing automatically?

The reason why I'm nervous is that nojournal mode is not a common
configuration, and "wait until production systems start failing" is
not a strategy that I or many SRE-types find.... comforting.

So if we can assure ourselves that the right thing will happen
automatically, or that lockdep will detect a required GFP_NOFS when
running tests, the happier I'll be.

					- Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* [Cluster-devel] [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
@ 2017-01-27 16:40                             ` Theodore Ts'o
  0 siblings, 0 replies; 167+ messages in thread
From: Theodore Ts'o @ 2017-01-27 16:40 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Fri, Jan 27, 2017 at 10:37:35AM +0100, Michal Hocko wrote:
> If this ever turn out to be a problem and with the vmapped stacks we
> have good chances to get a proper stack traces on a potential overflow
> we can add the scope API around the problematic code path with the
> explanation why it is needed.

Yeah, or maybe we can automate it?  Can the reclaim code check how
much stack space is left and do the right thing automatically?

The reason why I'm nervous is that nojournal mode is not a common
configuration, and "wait until production systems start failing" is
not a strategy that I or many SRE-types find.... comforting.

So if we can assure ourselves that the right thing will happen
automatically, or that lockdep will detect a required GFP_NOFS when
running tests, the happier I'll be.

					- Ted



^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [Cluster-devel] [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
  2017-01-27 16:40                             ` Theodore Ts'o
  (?)
@ 2017-01-28  7:32                               ` Christoph Hellwig
  -1 siblings, 0 replies; 167+ messages in thread
From: Christoph Hellwig @ 2017-01-28  7:32 UTC (permalink / raw)
  To: Theodore Ts'o, Michal Hocko, Jan Kara, linux-mm,
	linux-fsdevel, Andrew Morton, Dave Chinner, djwong, Chris Mason,
	David Sterba, ceph-devel, cluster-devel, linux-nfs, logfs,
	linux-xfs, linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML

On Fri, Jan 27, 2017 at 11:40:42AM -0500, Theodore Ts'o wrote:
> The reason why I'm nervous is that nojournal mode is not a common
> configuration, and "wait until production systems start failing" is
> not a strategy that I or many SRE-types find.... comforting.

What does SRE stand for?

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [Cluster-devel] [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
@ 2017-01-28  7:32                               ` Christoph Hellwig
  0 siblings, 0 replies; 167+ messages in thread
From: Christoph Hellwig @ 2017-01-28  7:32 UTC (permalink / raw)
  To: Theodore Ts'o, Michal Hocko, Jan Kara, linux-mm,
	linux-fsdevel, Andrew Morton, Dave Chinner, djwong, Chris Mason,
	David Sterba, ceph-devel, cluster-devel, linux-nfs, logfs,
	linux-xfs, linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML

On Fri, Jan 27, 2017 at 11:40:42AM -0500, Theodore Ts'o wrote:
> The reason why I'm nervous is that nojournal mode is not a common
> configuration, and "wait until production systems start failing" is
> not a strategy that I or many SRE-types find.... comforting.

What does SRE stand for?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* [Cluster-devel] [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
@ 2017-01-28  7:32                               ` Christoph Hellwig
  0 siblings, 0 replies; 167+ messages in thread
From: Christoph Hellwig @ 2017-01-28  7:32 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Fri, Jan 27, 2017 at 11:40:42AM -0500, Theodore Ts'o wrote:
> The reason why I'm nervous is that nojournal mode is not a common
> configuration, and "wait until production systems start failing" is
> not a strategy that I or many SRE-types find.... comforting.

What does SRE stand for?



^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [Cluster-devel] [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
  2017-01-28  7:32                               ` Christoph Hellwig
  (?)
@ 2017-01-28  8:17                                 ` David Lang
  -1 siblings, 0 replies; 167+ messages in thread
From: David Lang @ 2017-01-28  8:17 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Theodore Ts'o, Michal Hocko, Jan Kara, linux-mm,
	linux-fsdevel, Andrew Morton, Dave Chinner, djwong, Chris Mason,
	David Sterba, ceph-devel, cluster-devel, linux-nfs, logfs,
	linux-xfs, linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML

On Fri, 27 Jan 2017, Christoph Hellwig wrote:

> On Fri, Jan 27, 2017 at 11:40:42AM -0500, Theodore Ts'o wrote:
>> The reason why I'm nervous is that nojournal mode is not a common
>> configuration, and "wait until production systems start failing" is
>> not a strategy that I or many SRE-types find.... comforting.
>
> What does SRE stand for?

Site Reliability Engineer, a mix of operations and engineering (DevOps++)

David Lang

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [Cluster-devel] [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
@ 2017-01-28  8:17                                 ` David Lang
  0 siblings, 0 replies; 167+ messages in thread
From: David Lang @ 2017-01-28  8:17 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Theodore Ts'o, Michal Hocko, Jan Kara, linux-mm,
	linux-fsdevel, Andrew Morton, Dave Chinner, djwong, Chris Mason,
	David Sterba, ceph-devel, cluster-devel, linux-nfs, logfs,
	linux-xfs, linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML

On Fri, 27 Jan 2017, Christoph Hellwig wrote:

> On Fri, Jan 27, 2017 at 11:40:42AM -0500, Theodore Ts'o wrote:
>> The reason why I'm nervous is that nojournal mode is not a common
>> configuration, and "wait until production systems start failing" is
>> not a strategy that I or many SRE-types find.... comforting.
>
> What does SRE stand for?

Site Reliability Engineer, a mix of operations and engineering (DevOps++)

David Lang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* [Cluster-devel] [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
@ 2017-01-28  8:17                                 ` David Lang
  0 siblings, 0 replies; 167+ messages in thread
From: David Lang @ 2017-01-28  8:17 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Fri, 27 Jan 2017, Christoph Hellwig wrote:

> On Fri, Jan 27, 2017 at 11:40:42AM -0500, Theodore Ts'o wrote:
>> The reason why I'm nervous is that nojournal mode is not a common
>> configuration, and "wait until production systems start failing" is
>> not a strategy that I or many SRE-types find.... comforting.
>
> What does SRE stand for?

Site Reliability Engineer, a mix of operations and engineering (DevOps++)

David Lang



^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
  2017-01-27 16:40                             ` Theodore Ts'o
  (?)
@ 2017-01-30  8:12                               ` Michal Hocko
  -1 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-30  8:12 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Jan Kara, linux-mm, linux-fsdevel, Andrew Morton, Dave Chinner,
	djwong, Chris Mason, David Sterba, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML

On Fri 27-01-17 11:40:42, Theodore Ts'o wrote:
> On Fri, Jan 27, 2017 at 10:37:35AM +0100, Michal Hocko wrote:
> > If this ever turn out to be a problem and with the vmapped stacks we
> > have good chances to get a proper stack traces on a potential overflow
> > we can add the scope API around the problematic code path with the
> > explanation why it is needed.
> 
> Yeah, or maybe we can automate it?  Can the reclaim code check how
> much stack space is left and do the right thing automatically?

I am not sure how to do that. Checking for some magic value sounds quite
fragile to me. It also sounds a bit strange to focus only on the reclaim
while other code paths might suffer from the same problem.

What is actually the deepest possible call chain from the slab reclaim
where I stopped? I have tried to follow that path but hit the callback
wall quite early.
 
> The reason why I'm nervous is that nojournal mode is not a common
> configuration, and "wait until production systems start failing" is
> not a strategy that I or many SRE-types find.... comforting.

I understand that but I would be much more happier if we did the
decision based on the actual data rather than a fear something would
break down.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
@ 2017-01-30  8:12                               ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-30  8:12 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Jan Kara, linux-mm, linux-fsdevel, Andrew Morton, Dave Chinner,
	djwong, Chris Mason, David Sterba, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML

On Fri 27-01-17 11:40:42, Theodore Ts'o wrote:
> On Fri, Jan 27, 2017 at 10:37:35AM +0100, Michal Hocko wrote:
> > If this ever turn out to be a problem and with the vmapped stacks we
> > have good chances to get a proper stack traces on a potential overflow
> > we can add the scope API around the problematic code path with the
> > explanation why it is needed.
> 
> Yeah, or maybe we can automate it?  Can the reclaim code check how
> much stack space is left and do the right thing automatically?

I am not sure how to do that. Checking for some magic value sounds quite
fragile to me. It also sounds a bit strange to focus only on the reclaim
while other code paths might suffer from the same problem.

What is actually the deepest possible call chain from the slab reclaim
where I stopped? I have tried to follow that path but hit the callback
wall quite early.
 
> The reason why I'm nervous is that nojournal mode is not a common
> configuration, and "wait until production systems start failing" is
> not a strategy that I or many SRE-types find.... comforting.

I understand that but I would be much more happier if we did the
decision based on the actual data rather than a fear something would
break down.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* [Cluster-devel] [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
@ 2017-01-30  8:12                               ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-01-30  8:12 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Fri 27-01-17 11:40:42, Theodore Ts'o wrote:
> On Fri, Jan 27, 2017 at 10:37:35AM +0100, Michal Hocko wrote:
> > If this ever turn out to be a problem and with the vmapped stacks we
> > have good chances to get a proper stack traces on a potential overflow
> > we can add the scope API around the problematic code path with the
> > explanation why it is needed.
> 
> Yeah, or maybe we can automate it?  Can the reclaim code check how
> much stack space is left and do the right thing automatically?

I am not sure how to do that. Checking for some magic value sounds quite
fragile to me. It also sounds a bit strange to focus only on the reclaim
while other code paths might suffer from the same problem.

What is actually the deepest possible call chain from the slab reclaim
where I stopped? I have tried to follow that path but hit the callback
wall quite early.
 
> The reason why I'm nervous is that nojournal mode is not a common
> configuration, and "wait until production systems start failing" is
> not a strategy that I or many SRE-types find.... comforting.

I understand that but I would be much more happier if we did the
decision based on the actual data rather than a fear something would
break down.

-- 
Michal Hocko
SUSE Labs



^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
  2017-01-30  8:12                               ` Michal Hocko
  (?)
@ 2017-02-03 15:32                                 ` Michal Hocko
  -1 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-02-03 15:32 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Jan Kara, linux-mm, linux-fsdevel, Andrew Morton, Dave Chinner,
	djwong, Chris Mason, David Sterba, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML

On Mon 30-01-17 09:12:10, Michal Hocko wrote:
> On Fri 27-01-17 11:40:42, Theodore Ts'o wrote:
> > On Fri, Jan 27, 2017 at 10:37:35AM +0100, Michal Hocko wrote:
> > > If this ever turn out to be a problem and with the vmapped stacks we
> > > have good chances to get a proper stack traces on a potential overflow
> > > we can add the scope API around the problematic code path with the
> > > explanation why it is needed.
> > 
> > Yeah, or maybe we can automate it?  Can the reclaim code check how
> > much stack space is left and do the right thing automatically?
> 
> I am not sure how to do that. Checking for some magic value sounds quite
> fragile to me. It also sounds a bit strange to focus only on the reclaim
> while other code paths might suffer from the same problem.
> 
> What is actually the deepest possible call chain from the slab reclaim
> where I stopped? I have tried to follow that path but hit the callback
> wall quite early.
>  
> > The reason why I'm nervous is that nojournal mode is not a common
> > configuration, and "wait until production systems start failing" is
> > not a strategy that I or many SRE-types find.... comforting.
> 
> I understand that but I would be much more happier if we did the
> decision based on the actual data rather than a fear something would
> break down.

ping on this. I would really like to move forward here and target 4.11
merge window. Is your concern so serious to block this patch?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
@ 2017-02-03 15:32                                 ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-02-03 15:32 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Jan Kara, linux-mm, linux-fsdevel, Andrew Morton, Dave Chinner,
	djwong, Chris Mason, David Sterba, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML

On Mon 30-01-17 09:12:10, Michal Hocko wrote:
> On Fri 27-01-17 11:40:42, Theodore Ts'o wrote:
> > On Fri, Jan 27, 2017 at 10:37:35AM +0100, Michal Hocko wrote:
> > > If this ever turn out to be a problem and with the vmapped stacks we
> > > have good chances to get a proper stack traces on a potential overflow
> > > we can add the scope API around the problematic code path with the
> > > explanation why it is needed.
> > 
> > Yeah, or maybe we can automate it?  Can the reclaim code check how
> > much stack space is left and do the right thing automatically?
> 
> I am not sure how to do that. Checking for some magic value sounds quite
> fragile to me. It also sounds a bit strange to focus only on the reclaim
> while other code paths might suffer from the same problem.
> 
> What is actually the deepest possible call chain from the slab reclaim
> where I stopped? I have tried to follow that path but hit the callback
> wall quite early.
>  
> > The reason why I'm nervous is that nojournal mode is not a common
> > configuration, and "wait until production systems start failing" is
> > not a strategy that I or many SRE-types find.... comforting.
> 
> I understand that but I would be much more happier if we did the
> decision based on the actual data rather than a fear something would
> break down.

ping on this. I would really like to move forward here and target 4.11
merge window. Is your concern so serious to block this patch?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* [Cluster-devel] [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction"
@ 2017-02-03 15:32                                 ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-02-03 15:32 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Mon 30-01-17 09:12:10, Michal Hocko wrote:
> On Fri 27-01-17 11:40:42, Theodore Ts'o wrote:
> > On Fri, Jan 27, 2017 at 10:37:35AM +0100, Michal Hocko wrote:
> > > If this ever turn out to be a problem and with the vmapped stacks we
> > > have good chances to get a proper stack traces on a potential overflow
> > > we can add the scope API around the problematic code path with the
> > > explanation why it is needed.
> > 
> > Yeah, or maybe we can automate it?  Can the reclaim code check how
> > much stack space is left and do the right thing automatically?
> 
> I am not sure how to do that. Checking for some magic value sounds quite
> fragile to me. It also sounds a bit strange to focus only on the reclaim
> while other code paths might suffer from the same problem.
> 
> What is actually the deepest possible call chain from the slab reclaim
> where I stopped? I have tried to follow that path but hit the callback
> wall quite early.
>  
> > The reason why I'm nervous is that nojournal mode is not a common
> > configuration, and "wait until production systems start failing" is
> > not a strategy that I or many SRE-types find.... comforting.
> 
> I understand that but I would be much more happier if we did the
> decision based on the actual data rather than a fear something would
> break down.

ping on this. I would really like to move forward here and target 4.11
merge window. Is your concern so serious to block this patch?
-- 
Michal Hocko
SUSE Labs



^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 7/8] Revert "ext4: avoid deadlocks in the writeback path by using sb_getblk_gfp"
  2017-01-17  7:54       ` Michal Hocko
  (?)
@ 2017-03-06 11:59         ` Michal Hocko
  -1 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-03-06 11:59 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Dave Chinner, djwong,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML

On Tue 17-01-17 08:54:50, Michal Hocko wrote:
> On Mon 16-01-17 22:01:18, Theodore Ts'o wrote:
> > On Fri, Jan 06, 2017 at 03:11:06PM +0100, Michal Hocko wrote:
> > > From: Michal Hocko <mhocko@suse.com>
> > > 
> > > This reverts commit c45653c341f5c8a0ce19c8f0ad4678640849cb86 because
> > > sb_getblk_gfp is not really needed as
> > > sb_getblk
> > >   __getblk_gfp
> > >     __getblk_slow
> > >       grow_buffers
> > >         grow_dev_page
> > > 	  gfp_mask = mapping_gfp_constraint(inode->i_mapping, ~__GFP_FS) | gfp
> > > 
> > > so __GFP_FS is cleared unconditionally and therefore the above commit
> > > didn't have any real effect in fact.
> > > 
> > > This patch should not introduce any functional change. The main point
> > > of this change is to reduce explicit GFP_NOFS usage inside ext4 code to
> > > make the review of the remaining usage easier.
> > > 
> > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > > Reviewed-by: Jan Kara <jack@suse.cz>
> > 
> > If I'm not mistaken, this patch is not dependent on any of the other
> > patches in this series (and the other patches are not dependent on
> > this one).  Hence, I could take this patch via the ext4 tree, correct?
> 
> Yes, that is correct

Hi Ted,
this doesn't seem to be in any of the branches [1]. I plan to resend the
whole scope nofs series, should I add this to the pile or you are going
to route it via your tree?

[1] git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 167+ messages in thread

* Re: [PATCH 7/8] Revert "ext4: avoid deadlocks in the writeback path by using sb_getblk_gfp"
@ 2017-03-06 11:59         ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-03-06 11:59 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: linux-mm, linux-fsdevel, Andrew Morton, Dave Chinner, djwong,
	Chris Mason, David Sterba, Jan Kara, ceph-devel, cluster-devel,
	linux-nfs, logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML

On Tue 17-01-17 08:54:50, Michal Hocko wrote:
> On Mon 16-01-17 22:01:18, Theodore Ts'o wrote:
> > On Fri, Jan 06, 2017 at 03:11:06PM +0100, Michal Hocko wrote:
> > > From: Michal Hocko <mhocko@suse.com>
> > > 
> > > This reverts commit c45653c341f5c8a0ce19c8f0ad4678640849cb86 because
> > > sb_getblk_gfp is not really needed as
> > > sb_getblk
> > >   __getblk_gfp
> > >     __getblk_slow
> > >       grow_buffers
> > >         grow_dev_page
> > > 	  gfp_mask = mapping_gfp_constraint(inode->i_mapping, ~__GFP_FS) | gfp
> > > 
> > > so __GFP_FS is cleared unconditionally and therefore the above commit
> > > didn't have any real effect in fact.
> > > 
> > > This patch should not introduce any functional change. The main point
> > > of this change is to reduce explicit GFP_NOFS usage inside ext4 code to
> > > make the review of the remaining usage easier.
> > > 
> > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > > Reviewed-by: Jan Kara <jack@suse.cz>
> > 
> > If I'm not mistaken, this patch is not dependent on any of the other
> > patches in this series (and the other patches are not dependent on
> > this one).  Hence, I could take this patch via the ext4 tree, correct?
> 
> Yes, that is correct

Hi Ted,
this doesn't seem to be in any of the branches [1]. I plan to resend the
whole scope nofs series, should I add this to the pile or you are going
to route it via your tree?

[1] git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 167+ messages in thread

* [Cluster-devel] [PATCH 7/8] Revert "ext4: avoid deadlocks in the writeback path by using sb_getblk_gfp"
@ 2017-03-06 11:59         ` Michal Hocko
  0 siblings, 0 replies; 167+ messages in thread
From: Michal Hocko @ 2017-03-06 11:59 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Tue 17-01-17 08:54:50, Michal Hocko wrote:
> On Mon 16-01-17 22:01:18, Theodore Ts'o wrote:
> > On Fri, Jan 06, 2017 at 03:11:06PM +0100, Michal Hocko wrote:
> > > From: Michal Hocko <mhocko@suse.com>
> > > 
> > > This reverts commit c45653c341f5c8a0ce19c8f0ad4678640849cb86 because
> > > sb_getblk_gfp is not really needed as
> > > sb_getblk
> > >   __getblk_gfp
> > >     __getblk_slow
> > >       grow_buffers
> > >         grow_dev_page
> > > 	  gfp_mask = mapping_gfp_constraint(inode->i_mapping, ~__GFP_FS) | gfp
> > > 
> > > so __GFP_FS is cleared unconditionally and therefore the above commit
> > > didn't have any real effect in fact.
> > > 
> > > This patch should not introduce any functional change. The main point
> > > of this change is to reduce explicit GFP_NOFS usage inside ext4 code to
> > > make the review of the remaining usage easier.
> > > 
> > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > > Reviewed-by: Jan Kara <jack@suse.cz>
> > 
> > If I'm not mistaken, this patch is not dependent on any of the other
> > patches in this series (and the other patches are not dependent on
> > this one).  Hence, I could take this patch via the ext4 tree, correct?
> 
> Yes, that is correct

Hi Ted,
this doesn't seem to be in any of the branches [1]. I plan to resend the
whole scope nofs series, should I add this to the pile or you are going
to route it via your tree?

[1] git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git
-- 
Michal Hocko
SUSE Labs



^ permalink raw reply	[flat|nested] 167+ messages in thread

end of thread, other threads:[~2017-03-06 11:59 UTC | newest]

Thread overview: 167+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-06 14:10 [PATCH 0/8 v3] scope GFP_NOFS api Michal Hocko
2017-01-06 14:10 ` [Cluster-devel] " Michal Hocko
2017-01-06 14:10 ` Michal Hocko
2017-01-06 14:10 ` Michal Hocko
2017-01-06 14:10 ` Michal Hocko
2017-01-06 14:11 ` [PATCH 1/8] lockdep: allow to disable reclaim lockup detection Michal Hocko
2017-01-06 14:11   ` [Cluster-devel] " Michal Hocko
2017-01-06 14:11   ` Michal Hocko
2017-01-06 14:11   ` Michal Hocko
2017-01-06 14:11   ` Michal Hocko
2017-01-09 12:56   ` Vlastimil Babka
2017-01-09 12:56     ` [Cluster-devel] " Vlastimil Babka
2017-01-09 12:56     ` Vlastimil Babka
2017-01-06 14:11 ` [PATCH 2/8] xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS Michal Hocko
2017-01-06 14:11   ` [Cluster-devel] " Michal Hocko
2017-01-06 14:11   ` Michal Hocko
2017-01-06 14:11   ` Michal Hocko
2017-01-06 14:11   ` Michal Hocko
2017-01-09 12:59   ` Vlastimil Babka
2017-01-09 12:59     ` [Cluster-devel] " Vlastimil Babka
2017-01-09 12:59     ` Vlastimil Babka
2017-01-09 14:29     ` Michal Hocko
2017-01-09 14:29       ` [Cluster-devel] " Michal Hocko
2017-01-09 14:29       ` Michal Hocko
2017-01-09 20:58   ` Darrick J. Wong
2017-01-09 20:58     ` [Cluster-devel] " Darrick J. Wong
2017-01-09 20:58     ` Darrick J. Wong
2017-01-06 14:11 ` [PATCH 3/8] mm: introduce memalloc_nofs_{save,restore} API Michal Hocko
2017-01-06 14:11   ` [Cluster-devel] [PATCH 3/8] mm: introduce memalloc_nofs_{save, restore} API Michal Hocko
2017-01-06 14:11   ` [PATCH 3/8] mm: introduce memalloc_nofs_{save,restore} API Michal Hocko
2017-01-06 14:11   ` Michal Hocko
2017-01-06 14:11   ` Michal Hocko
2017-01-09 13:04   ` Vlastimil Babka
2017-01-09 13:04     ` [Cluster-devel] [PATCH 3/8] mm: introduce memalloc_nofs_{save, restore} API Vlastimil Babka
2017-01-09 13:04     ` [PATCH 3/8] mm: introduce memalloc_nofs_{save,restore} API Vlastimil Babka
2017-01-09 13:42     ` Michal Hocko
2017-01-09 13:42       ` [Cluster-devel] [PATCH 3/8] mm: introduce memalloc_nofs_{save, restore} API Michal Hocko
2017-01-09 13:42       ` [PATCH 3/8] mm: introduce memalloc_nofs_{save,restore} API Michal Hocko
2017-01-09 13:59       ` Michal Hocko
2017-01-09 13:59         ` [Cluster-devel] [PATCH 3/8] mm: introduce memalloc_nofs_{save, restore} API Michal Hocko
2017-01-09 13:59         ` [PATCH 3/8] mm: introduce memalloc_nofs_{save,restore} API Michal Hocko
2017-01-09 14:04       ` Vlastimil Babka
2017-01-09 14:04         ` [Cluster-devel] [PATCH 3/8] mm: introduce memalloc_nofs_{save, restore} API Vlastimil Babka
2017-01-09 14:04         ` [PATCH 3/8] mm: introduce memalloc_nofs_{save,restore} API Vlastimil Babka
2017-01-06 14:11 ` [PATCH 4/8] xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio* Michal Hocko
2017-01-06 14:11   ` [Cluster-devel] [PATCH 4/8] xfs: use memalloc_nofs_{save, restore} " Michal Hocko
2017-01-06 14:11   ` Michal Hocko
2017-01-06 14:11   ` [PATCH 4/8] xfs: use memalloc_nofs_{save,restore} " Michal Hocko
2017-01-06 14:11   ` Michal Hocko
2017-01-06 14:11   ` Michal Hocko
2017-01-09 14:08   ` Vlastimil Babka
2017-01-09 14:08     ` [Cluster-devel] [PATCH 4/8] xfs: use memalloc_nofs_{save, restore} " Vlastimil Babka
2017-01-09 14:08     ` [PATCH 4/8] xfs: use memalloc_nofs_{save,restore} " Vlastimil Babka
2017-01-09 14:25     ` Michal Hocko
2017-01-09 14:25       ` [Cluster-devel] [PATCH 4/8] xfs: use memalloc_nofs_{save, restore} " Michal Hocko
2017-01-09 14:25       ` [PATCH 4/8] xfs: use memalloc_nofs_{save,restore} " Michal Hocko
2017-01-09 15:56   ` Brian Foster
2017-01-09 15:56     ` [Cluster-devel] [PATCH 4/8] xfs: use memalloc_nofs_{save, restore} " Brian Foster
2017-01-09 15:56     ` [PATCH 4/8] xfs: use memalloc_nofs_{save,restore} " Brian Foster
2017-01-09 20:59   ` Darrick J. Wong
2017-01-09 20:59     ` [Cluster-devel] [PATCH 4/8] xfs: use memalloc_nofs_{save, restore} " Darrick J. Wong
2017-01-09 20:59     ` [PATCH 4/8] xfs: use memalloc_nofs_{save,restore} " Darrick J. Wong
2017-01-06 14:11 ` [PATCH 5/8] jbd2: mark the transaction context with the scope GFP_NOFS context Michal Hocko
2017-01-06 14:11   ` [Cluster-devel] " Michal Hocko
2017-01-06 14:11   ` Michal Hocko
2017-01-06 14:11   ` Michal Hocko
2017-01-06 14:11   ` Michal Hocko
2017-01-06 14:11 ` [PATCH 6/8] jbd2: make the whole kjournald2 kthread NOFS safe Michal Hocko
2017-01-06 14:11   ` [Cluster-devel] " Michal Hocko
2017-01-06 14:11   ` Michal Hocko
2017-01-06 14:11   ` Michal Hocko
2017-01-06 14:11   ` Michal Hocko
2017-01-06 14:11 ` [PATCH 7/8] Revert "ext4: avoid deadlocks in the writeback path by using sb_getblk_gfp" Michal Hocko
2017-01-06 14:11   ` [Cluster-devel] " Michal Hocko
2017-01-06 14:11   ` Michal Hocko
2017-01-06 14:11   ` Michal Hocko
2017-01-06 14:11   ` Michal Hocko
2017-01-17  3:01   ` Theodore Ts'o
2017-01-17  3:01     ` [Cluster-devel] " Theodore Ts'o
2017-01-17  3:01     ` Theodore Ts'o
2017-01-17  7:54     ` Michal Hocko
2017-01-17  7:54       ` [Cluster-devel] " Michal Hocko
2017-01-17  7:54       ` Michal Hocko
2017-03-06 11:59       ` Michal Hocko
2017-03-06 11:59         ` [Cluster-devel] " Michal Hocko
2017-03-06 11:59         ` Michal Hocko
2017-01-06 14:11 ` [PATCH 8/8] Revert "ext4: fix wrong gfp type under transaction" Michal Hocko
2017-01-06 14:11   ` [Cluster-devel] " Michal Hocko
2017-01-06 14:11   ` Michal Hocko
2017-01-06 14:11   ` Michal Hocko
2017-01-06 14:11   ` Michal Hocko
2017-01-17  2:56   ` Theodore Ts'o
2017-01-17  2:56     ` [Cluster-devel] " Theodore Ts'o
2017-01-17  2:56     ` Theodore Ts'o
2017-01-17  8:24     ` Michal Hocko
2017-01-17  8:24       ` [Cluster-devel] " Michal Hocko
2017-01-17  8:24       ` Michal Hocko
2017-01-17 15:18       ` Michal Hocko
2017-01-17 15:18         ` [Cluster-devel] " Michal Hocko
2017-01-17 15:18         ` Michal Hocko
2017-01-17 15:18         ` Michal Hocko
2017-01-17 15:59         ` Theodore Ts'o
2017-01-17 15:59           ` [Cluster-devel] " Theodore Ts'o
2017-01-17 15:59           ` Theodore Ts'o
2017-01-17 16:16           ` Michal Hocko
2017-01-17 16:16             ` [Cluster-devel] " Michal Hocko
2017-01-17 16:16             ` Michal Hocko
2017-01-17 17:29             ` Jan Kara
2017-01-17 17:29               ` [Cluster-devel] " Jan Kara
2017-01-17 17:29               ` Jan Kara
2017-01-19  8:39               ` Michal Hocko
2017-01-19  8:39                 ` [Cluster-devel] " Michal Hocko
2017-01-19  8:39                 ` Michal Hocko
2017-01-19  9:22                 ` Jan Kara
2017-01-19  9:22                   ` [Cluster-devel] " Jan Kara
2017-01-19  9:22                   ` Jan Kara
2017-01-19  9:44                   ` Michal Hocko
2017-01-19  9:44                     ` [Cluster-devel] " Michal Hocko
2017-01-19  9:44                     ` Michal Hocko
2017-01-26  7:44                     ` Michal Hocko
2017-01-26  7:44                       ` [Cluster-devel] " Michal Hocko
2017-01-26  7:44                       ` Michal Hocko
2017-01-27  6:13                       ` Theodore Ts'o
2017-01-27  6:13                         ` [Cluster-devel] " Theodore Ts'o
2017-01-27  6:13                         ` Theodore Ts'o
2017-01-27  6:13                         ` Theodore Ts'o
2017-01-27  9:37                         ` Michal Hocko
2017-01-27  9:37                           ` [Cluster-devel] " Michal Hocko
2017-01-27  9:37                           ` Michal Hocko
2017-01-27 16:40                           ` Theodore Ts'o
2017-01-27 16:40                             ` [Cluster-devel] " Theodore Ts'o
2017-01-27 16:40                             ` Theodore Ts'o
2017-01-28  7:32                             ` [Cluster-devel] " Christoph Hellwig
2017-01-28  7:32                               ` Christoph Hellwig
2017-01-28  7:32                               ` Christoph Hellwig
2017-01-28  8:17                               ` David Lang
2017-01-28  8:17                                 ` David Lang
2017-01-28  8:17                                 ` David Lang
2017-01-30  8:12                             ` Michal Hocko
2017-01-30  8:12                               ` [Cluster-devel] " Michal Hocko
2017-01-30  8:12                               ` Michal Hocko
2017-02-03 15:32                               ` Michal Hocko
2017-02-03 15:32                                 ` [Cluster-devel] " Michal Hocko
2017-02-03 15:32                                 ` Michal Hocko
2017-01-17 21:04           ` Andreas Dilger
2017-01-17 21:04             ` [Cluster-devel] " Andreas Dilger
2017-01-17 21:04             ` Andreas Dilger
2017-01-18  8:29             ` Michal Hocko
2017-01-18  8:29               ` [Cluster-devel] " Michal Hocko
2017-01-18  8:29               ` Michal Hocko
2017-01-06 14:18 ` [DEBUG PATCH 0/2] debug explicit GFP_NO{FS,IO} usage from the scope context Michal Hocko
2017-01-06 14:18   ` [Cluster-devel] [DEBUG PATCH 0/2] debug explicit GFP_NO{FS, IO} " Michal Hocko
2017-01-06 14:18   ` Michal Hocko
2017-01-06 14:18   ` [DEBUG PATCH 0/2] debug explicit GFP_NO{FS,IO} " Michal Hocko
2017-01-06 14:18   ` Michal Hocko
2017-01-06 14:18   ` Michal Hocko
2017-01-06 14:18   ` [DEBUG PATCH 1/2] mm, debug: report when GFP_NO{FS,IO} is used explicitly from memalloc_no{fs,io}_{save,restore} context Michal Hocko
2017-01-06 14:18     ` [Cluster-devel] [DEBUG PATCH 1/2] mm, debug: report when GFP_NO{FS, IO} is used explicitly from memalloc_no{fs, io}_{save, restore} context Michal Hocko
2017-01-06 14:18     ` Michal Hocko
2017-01-06 14:18     ` [DEBUG PATCH 1/2] mm, debug: report when GFP_NO{FS,IO} is used explicitly from memalloc_no{fs,io}_{save,restore} context Michal Hocko
2017-01-06 14:18     ` Michal Hocko
2017-01-06 14:18     ` Michal Hocko
2017-01-06 14:18   ` [DEBUG PATCH 2/2] silent warnings which we cannot do anything about Michal Hocko
2017-01-06 14:18     ` [Cluster-devel] " Michal Hocko
2017-01-06 14:18     ` Michal Hocko
2017-01-06 14:18     ` Michal Hocko
2017-01-06 14:18     ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.