All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6 v4] scope GFP_NOFS api
@ 2017-02-06 14:07 ` Michal Hocko
  0 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-02-06 14:07 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel
  Cc: linux-mm, Dave Chinner, djwong, Theodore Ts'o, Chris Mason,
	David Sterba, Jan Kara, ceph-devel, cluster-devel, linux-nfs,
	logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Brian Foster, Darrick J. Wong, Michal Hocko,
	Peter Zijlstra (Intel),
	Vlastimil Babka

Hi,
I have posted the previous version here [1]. There are no real changes
in the implementation since then. Few acks added and one new user of
memalloc_noio_flags (in alloc_contig_range) converted. I have decided
to drop the last two ext4 related patches. One of them will be picked up
by Ted [2] and the other one will probably need more time to settle down.
I believe it is OK as is but let's not block the whole thing just because
of it.

There didn't seem to be any real objections and so I think we should
go and merge this to mmotm tree and target the next merge window. The
risk of regressions is really small because we do not remove any real
GFP_NOFS users yet.

I hope to get ext4 parts resolved in the follow up patches as well as
pull other filesystems in. There is still a lot work to do but having
the infrastructure in place should be very useful already.

The patchset is based on next-20170206

Diffstat says
 fs/jbd2/journal.c         |  7 +++++++
 fs/jbd2/transaction.c     | 11 +++++++++++
 fs/xfs/kmem.c             | 12 ++++++------
 fs/xfs/kmem.h             |  2 +-
 fs/xfs/libxfs/xfs_btree.c |  2 +-
 fs/xfs/xfs_aops.c         |  6 +++---
 fs/xfs/xfs_buf.c          |  8 ++++----
 fs/xfs/xfs_trans.c        | 12 ++++++------
 include/linux/gfp.h       | 18 +++++++++++++++++-
 include/linux/jbd2.h      |  2 ++
 include/linux/sched.h     | 32 ++++++++++++++++++++++++++------
 kernel/locking/lockdep.c  |  6 +++++-
 lib/radix-tree.c          |  2 ++
 mm/page_alloc.c           | 10 ++++++----
 mm/vmscan.c               |  6 +++---
 15 files changed, 100 insertions(+), 36 deletions(-)

Shortlog:
Michal Hocko (6):
      lockdep: allow to disable reclaim lockup detection
      xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS
      mm: introduce memalloc_nofs_{save,restore} API
      xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*
      jbd2: mark the transaction context with the scope GFP_NOFS context
      jbd2: make the whole kjournald2 kthread NOFS safe

[1] http://lkml.kernel.org/r/20170106141107.23953-1-mhocko@kernel.org
[2] http://lkml.kernel.org/r/20170117030118.727jqyamjhojzajb@thunk.org


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 0/6 v4] scope GFP_NOFS api
@ 2017-02-06 14:07 ` Michal Hocko
  0 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-02-06 14:07 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel
  Cc: linux-mm, Dave Chinner, djwong, Theodore Ts'o, Chris Mason,
	David Sterba, Jan Kara, ceph-devel, cluster-devel, linux-nfs,
	logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Brian Foster, Darrick J. Wong, Michal Hocko,
	Peter Zijlstra (Intel),
	Vlastimil Babka

Hi,
I have posted the previous version here [1]. There are no real changes
in the implementation since then. Few acks added and one new user of
memalloc_noio_flags (in alloc_contig_range) converted. I have decided
to drop the last two ext4 related patches. One of them will be picked up
by Ted [2] and the other one will probably need more time to settle down.
I believe it is OK as is but let's not block the whole thing just because
of it.

There didn't seem to be any real objections and so I think we should
go and merge this to mmotm tree and target the next merge window. The
risk of regressions is really small because we do not remove any real
GFP_NOFS users yet.

I hope to get ext4 parts resolved in the follow up patches as well as
pull other filesystems in. There is still a lot work to do but having
the infrastructure in place should be very useful already.

The patchset is based on next-20170206

Diffstat says
 fs/jbd2/journal.c         |  7 +++++++
 fs/jbd2/transaction.c     | 11 +++++++++++
 fs/xfs/kmem.c             | 12 ++++++------
 fs/xfs/kmem.h             |  2 +-
 fs/xfs/libxfs/xfs_btree.c |  2 +-
 fs/xfs/xfs_aops.c         |  6 +++---
 fs/xfs/xfs_buf.c          |  8 ++++----
 fs/xfs/xfs_trans.c        | 12 ++++++------
 include/linux/gfp.h       | 18 +++++++++++++++++-
 include/linux/jbd2.h      |  2 ++
 include/linux/sched.h     | 32 ++++++++++++++++++++++++++------
 kernel/locking/lockdep.c  |  6 +++++-
 lib/radix-tree.c          |  2 ++
 mm/page_alloc.c           | 10 ++++++----
 mm/vmscan.c               |  6 +++---
 15 files changed, 100 insertions(+), 36 deletions(-)

Shortlog:
Michal Hocko (6):
      lockdep: allow to disable reclaim lockup detection
      xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS
      mm: introduce memalloc_nofs_{save,restore} API
      xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*
      jbd2: mark the transaction context with the scope GFP_NOFS context
      jbd2: make the whole kjournald2 kthread NOFS safe

[1] http://lkml.kernel.org/r/20170106141107.23953-1-mhocko@kernel.org
[2] http://lkml.kernel.org/r/20170117030118.727jqyamjhojzajb@thunk.org

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 0/6 v4] scope GFP_NOFS api
@ 2017-02-06 14:07 ` Michal Hocko
  0 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-02-06 14:07 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Dave Chinner,
	djwong-DgEjT+Ai2ygdnm+yROfE0A, Theodore Ts'o, Chris Mason,
	David Sterba, Jan Kara, ceph-devel-u79uwXL29TY76Z2rM5mHXA,
	cluster-devel-H+wXaHxf7aLQT0dZR+AlfA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA, logfs-PCqxUs/MD9bYtjvyW6yDsg,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	reiserfs-devel-u79uwXL29TY76Z2rM5mHXA,
	linux-ntfs-dev-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	linux-f2fs-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	linux-afs-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, LKML, Brian Foster,
	Darrick J. Wong, Michal Hocko, Peter Zijlstra (Intel),
	Vlastimil Babka

Hi,
I have posted the previous version here [1]. There are no real changes
in the implementation since then. Few acks added and one new user of
memalloc_noio_flags (in alloc_contig_range) converted. I have decided
to drop the last two ext4 related patches. One of them will be picked up
by Ted [2] and the other one will probably need more time to settle down.
I believe it is OK as is but let's not block the whole thing just because
of it.

There didn't seem to be any real objections and so I think we should
go and merge this to mmotm tree and target the next merge window. The
risk of regressions is really small because we do not remove any real
GFP_NOFS users yet.

I hope to get ext4 parts resolved in the follow up patches as well as
pull other filesystems in. There is still a lot work to do but having
the infrastructure in place should be very useful already.

The patchset is based on next-20170206

Diffstat says
 fs/jbd2/journal.c         |  7 +++++++
 fs/jbd2/transaction.c     | 11 +++++++++++
 fs/xfs/kmem.c             | 12 ++++++------
 fs/xfs/kmem.h             |  2 +-
 fs/xfs/libxfs/xfs_btree.c |  2 +-
 fs/xfs/xfs_aops.c         |  6 +++---
 fs/xfs/xfs_buf.c          |  8 ++++----
 fs/xfs/xfs_trans.c        | 12 ++++++------
 include/linux/gfp.h       | 18 +++++++++++++++++-
 include/linux/jbd2.h      |  2 ++
 include/linux/sched.h     | 32 ++++++++++++++++++++++++++------
 kernel/locking/lockdep.c  |  6 +++++-
 lib/radix-tree.c          |  2 ++
 mm/page_alloc.c           | 10 ++++++----
 mm/vmscan.c               |  6 +++---
 15 files changed, 100 insertions(+), 36 deletions(-)

Shortlog:
Michal Hocko (6):
      lockdep: allow to disable reclaim lockup detection
      xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS
      mm: introduce memalloc_nofs_{save,restore} API
      xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*
      jbd2: mark the transaction context with the scope GFP_NOFS context
      jbd2: make the whole kjournald2 kthread NOFS safe

[1] http://lkml.kernel.org/r/20170106141107.23953-1-mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org
[2] http://lkml.kernel.org/r/20170117030118.727jqyamjhojzajb-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 0/6 v4] scope GFP_NOFS api
@ 2017-02-06 14:07 ` Michal Hocko
  0 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-02-06 14:07 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel
  Cc: linux-mm, Dave Chinner, djwong, Theodore Ts'o, Chris Mason,
	David Sterba, Jan Kara, ceph-devel, cluster-devel, linux-nfs,
	logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Brian Foster, Darrick J. Wong, Michal Hocko,
	Peter Zijlstra (Intel),
	Vlastimil Babka

Hi,
I have posted the previous version here [1]. There are no real changes
in the implementation since then. Few acks added and one new user of
memalloc_noio_flags (in alloc_contig_range) converted. I have decided
to drop the last two ext4 related patches. One of them will be picked up
by Ted [2] and the other one will probably need more time to settle down.
I believe it is OK as is but let's not block the whole thing just because
of it.

There didn't seem to be any real objections and so I think we should
go and merge this to mmotm tree and target the next merge window. The
risk of regressions is really small because we do not remove any real
GFP_NOFS users yet.

I hope to get ext4 parts resolved in the follow up patches as well as
pull other filesystems in. There is still a lot work to do but having
the infrastructure in place should be very useful already.

The patchset is based on next-20170206

Diffstat says
 fs/jbd2/journal.c         |  7 +++++++
 fs/jbd2/transaction.c     | 11 +++++++++++
 fs/xfs/kmem.c             | 12 ++++++------
 fs/xfs/kmem.h             |  2 +-
 fs/xfs/libxfs/xfs_btree.c |  2 +-
 fs/xfs/xfs_aops.c         |  6 +++---
 fs/xfs/xfs_buf.c          |  8 ++++----
 fs/xfs/xfs_trans.c        | 12 ++++++------
 include/linux/gfp.h       | 18 +++++++++++++++++-
 include/linux/jbd2.h      |  2 ++
 include/linux/sched.h     | 32 ++++++++++++++++++++++++++------
 kernel/locking/lockdep.c  |  6 +++++-
 lib/radix-tree.c          |  2 ++
 mm/page_alloc.c           | 10 ++++++----
 mm/vmscan.c               |  6 +++---
 15 files changed, 100 insertions(+), 36 deletions(-)

Shortlog:
Michal Hocko (6):
      lockdep: allow to disable reclaim lockup detection
      xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS
      mm: introduce memalloc_nofs_{save,restore} API
      xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*
      jbd2: mark the transaction context with the scope GFP_NOFS context
      jbd2: make the whole kjournald2 kthread NOFS safe

[1] http://lkml.kernel.org/r/20170106141107.23953-1-mhocko@kernel.org
[2] http://lkml.kernel.org/r/20170117030118.727jqyamjhojzajb@thunk.org


^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 0/6 v4] scope GFP_NOFS api
@ 2017-02-06 14:07 ` Michal Hocko
  0 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-02-06 14:07 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, Dave Chinner,
	djwong-DgEjT+Ai2ygdnm+yROfE0A, Theodore Ts'o, Chris Mason,
	David Sterba, Jan Kara, ceph-devel-u79uwXL29TY76Z2rM5mHXA,
	cluster-devel-H+wXaHxf7aLQT0dZR+AlfA,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA, logfs-PCqxUs/MD9bYtjvyW6yDsg,
	linux-xfs-u79uwXL29TY76Z2rM5mHXA,
	linux-ext4-u79uwXL29TY76Z2rM5mHXA,
	linux-btrfs-u79uwXL29TY76Z2rM5mHXA,
	linux-mtd-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	reiserfs-devel-u79uwXL29TY76Z2rM5mHXA,
	linux-ntfs-dev-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	linux-f2fs-devel-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	linux-afs-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r, LKML, Brian Foster,
	Darrick J. Wong, Michal Hocko, Peter Zijlstra (Intel),
	Vlastimil Babka

Hi,
I have posted the previous version here [1]. There are no real changes
in the implementation since then. Few acks added and one new user of
memalloc_noio_flags (in alloc_contig_range) converted. I have decided
to drop the last two ext4 related patches. One of them will be picked up
by Ted [2] and the other one will probably need more time to settle down.
I believe it is OK as is but let's not block the whole thing just because
of it.

There didn't seem to be any real objections and so I think we should
go and merge this to mmotm tree and target the next merge window. The
risk of regressions is really small because we do not remove any real
GFP_NOFS users yet.

I hope to get ext4 parts resolved in the follow up patches as well as
pull other filesystems in. There is still a lot work to do but having
the infrastructure in place should be very useful already.

The patchset is based on next-20170206

Diffstat says
 fs/jbd2/journal.c         |  7 +++++++
 fs/jbd2/transaction.c     | 11 +++++++++++
 fs/xfs/kmem.c             | 12 ++++++------
 fs/xfs/kmem.h             |  2 +-
 fs/xfs/libxfs/xfs_btree.c |  2 +-
 fs/xfs/xfs_aops.c         |  6 +++---
 fs/xfs/xfs_buf.c          |  8 ++++----
 fs/xfs/xfs_trans.c        | 12 ++++++------
 include/linux/gfp.h       | 18 +++++++++++++++++-
 include/linux/jbd2.h      |  2 ++
 include/linux/sched.h     | 32 ++++++++++++++++++++++++++------
 kernel/locking/lockdep.c  |  6 +++++-
 lib/radix-tree.c          |  2 ++
 mm/page_alloc.c           | 10 ++++++----
 mm/vmscan.c               |  6 +++---
 15 files changed, 100 insertions(+), 36 deletions(-)

Shortlog:
Michal Hocko (6):
      lockdep: allow to disable reclaim lockup detection
      xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS
      mm: introduce memalloc_nofs_{save,restore} API
      xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*
      jbd2: mark the transaction context with the scope GFP_NOFS context
      jbd2: make the whole kjournald2 kthread NOFS safe

[1] http://lkml.kernel.org/r/20170106141107.23953-1-mhocko-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org
[2] http://lkml.kernel.org/r/20170117030118.727jqyamjhojzajb-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 0/6 v4] scope GFP_NOFS api
@ 2017-02-06 14:07 ` Michal Hocko
  0 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-02-06 14:07 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel
  Cc: linux-mm, Dave Chinner, djwong, Theodore Ts'o, Chris Mason,
	David Sterba, Jan Kara, ceph-devel, cluster-devel, linux-nfs,
	logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Brian Foster, Darrick J. Wong, Michal Hocko,
	Peter Zijlstra (Intel),
	Vlastimil Babka

Hi,
I have posted the previous version here [1]. There are no real changes
in the implementation since then. Few acks added and one new user of
memalloc_noio_flags (in alloc_contig_range) converted. I have decided
to drop the last two ext4 related patches. One of them will be picked up
by Ted [2] and the other one will probably need more time to settle down.
I believe it is OK as is but let's not block the whole thing just because
of it.

There didn't seem to be any real objections and so I think we should
go and merge this to mmotm tree and target the next merge window. The
risk of regressions is really small because we do not remove any real
GFP_NOFS users yet.

I hope to get ext4 parts resolved in the follow up patches as well as
pull other filesystems in. There is still a lot work to do but having
the infrastructure in place should be very useful already.

The patchset is based on next-20170206

Diffstat says
 fs/jbd2/journal.c         |  7 +++++++
 fs/jbd2/transaction.c     | 11 +++++++++++
 fs/xfs/kmem.c             | 12 ++++++------
 fs/xfs/kmem.h             |  2 +-
 fs/xfs/libxfs/xfs_btree.c |  2 +-
 fs/xfs/xfs_aops.c         |  6 +++---
 fs/xfs/xfs_buf.c          |  8 ++++----
 fs/xfs/xfs_trans.c        | 12 ++++++------
 include/linux/gfp.h       | 18 +++++++++++++++++-
 include/linux/jbd2.h      |  2 ++
 include/linux/sched.h     | 32 ++++++++++++++++++++++++++------
 kernel/locking/lockdep.c  |  6 +++++-
 lib/radix-tree.c          |  2 ++
 mm/page_alloc.c           | 10 ++++++----
 mm/vmscan.c               |  6 +++---
 15 files changed, 100 insertions(+), 36 deletions(-)

Shortlog:
Michal Hocko (6):
      lockdep: allow to disable reclaim lockup detection
      xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS
      mm: introduce memalloc_nofs_{save,restore} API
      xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*
      jbd2: mark the transaction context with the scope GFP_NOFS context
      jbd2: make the whole kjournald2 kthread NOFS safe

[1] http://lkml.kernel.org/r/20170106141107.23953-1-mhocko@kernel.org
[2] http://lkml.kernel.org/r/20170117030118.727jqyamjhojzajb@thunk.org

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [Cluster-devel] [PATCH 0/6 v4] scope GFP_NOFS api
@ 2017-02-06 14:07 ` Michal Hocko
  0 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-02-06 14:07 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,
I have posted the previous version here [1]. There are no real changes
in the implementation since then. Few acks added and one new user of
memalloc_noio_flags (in alloc_contig_range) converted. I have decided
to drop the last two ext4 related patches. One of them will be picked up
by Ted [2] and the other one will probably need more time to settle down.
I believe it is OK as is but let's not block the whole thing just because
of it.

There didn't seem to be any real objections and so I think we should
go and merge this to mmotm tree and target the next merge window. The
risk of regressions is really small because we do not remove any real
GFP_NOFS users yet.

I hope to get ext4 parts resolved in the follow up patches as well as
pull other filesystems in. There is still a lot work to do but having
the infrastructure in place should be very useful already.

The patchset is based on next-20170206

Diffstat says
 fs/jbd2/journal.c         |  7 +++++++
 fs/jbd2/transaction.c     | 11 +++++++++++
 fs/xfs/kmem.c             | 12 ++++++------
 fs/xfs/kmem.h             |  2 +-
 fs/xfs/libxfs/xfs_btree.c |  2 +-
 fs/xfs/xfs_aops.c         |  6 +++---
 fs/xfs/xfs_buf.c          |  8 ++++----
 fs/xfs/xfs_trans.c        | 12 ++++++------
 include/linux/gfp.h       | 18 +++++++++++++++++-
 include/linux/jbd2.h      |  2 ++
 include/linux/sched.h     | 32 ++++++++++++++++++++++++++------
 kernel/locking/lockdep.c  |  6 +++++-
 lib/radix-tree.c          |  2 ++
 mm/page_alloc.c           | 10 ++++++----
 mm/vmscan.c               |  6 +++---
 15 files changed, 100 insertions(+), 36 deletions(-)

Shortlog:
Michal Hocko (6):
      lockdep: allow to disable reclaim lockup detection
      xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS
      mm: introduce memalloc_nofs_{save,restore} API
      xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*
      jbd2: mark the transaction context with the scope GFP_NOFS context
      jbd2: make the whole kjournald2 kthread NOFS safe

[1] http://lkml.kernel.org/r/20170106141107.23953-1-mhocko at kernel.org
[2] http://lkml.kernel.org/r/20170117030118.727jqyamjhojzajb at thunk.org



^ permalink raw reply	[flat|nested] 74+ messages in thread

* [PATCH 1/6] lockdep: allow to disable reclaim lockup detection
  2017-02-06 14:07 ` Michal Hocko
                     ` (2 preceding siblings ...)
  (?)
@ 2017-02-06 14:07   ` Michal Hocko
  -1 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-02-06 14:07 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel
  Cc: linux-mm, Dave Chinner, djwong, Theodore Ts'o, Chris Mason,
	David Sterba, Jan Kara, ceph-devel, cluster-devel, linux-nfs,
	logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

The current implementation of the reclaim lockup detection can lead to
false positives and those even happen and usually lead to tweak the
code to silence the lockdep by using GFP_NOFS even though the context
can use __GFP_FS just fine. See
http://lkml.kernel.org/r/20160512080321.GA18496@dastard as an example.

=================================
[ INFO: inconsistent lock state ]
4.5.0-rc2+ #4 Tainted: G           O
---------------------------------
inconsistent {RECLAIM_FS-ON-R} -> {IN-RECLAIM_FS-W} usage.
kswapd0/543 [HC0[0]:SC0[0]:HE1:SE1] takes:

(&xfs_nondir_ilock_class){++++-+}, at: [<ffffffffa00781f7>] xfs_ilock+0x177/0x200 [xfs]

{RECLAIM_FS-ON-R} state was registered at:
  [<ffffffff8110f369>] mark_held_locks+0x79/0xa0
  [<ffffffff81113a43>] lockdep_trace_alloc+0xb3/0x100
  [<ffffffff81224623>] kmem_cache_alloc+0x33/0x230
  [<ffffffffa008acc1>] kmem_zone_alloc+0x81/0x120 [xfs]
  [<ffffffffa005456e>] xfs_refcountbt_init_cursor+0x3e/0xa0 [xfs]
  [<ffffffffa0053455>] __xfs_refcount_find_shared+0x75/0x580 [xfs]
  [<ffffffffa00539e4>] xfs_refcount_find_shared+0x84/0xb0 [xfs]
  [<ffffffffa005dcb8>] xfs_getbmap+0x608/0x8c0 [xfs]
  [<ffffffffa007634b>] xfs_vn_fiemap+0xab/0xc0 [xfs]
  [<ffffffff81244208>] do_vfs_ioctl+0x498/0x670
  [<ffffffff81244459>] SyS_ioctl+0x79/0x90
  [<ffffffff81847cd7>] entry_SYSCALL_64_fastpath+0x12/0x6f

       CPU0
       ----
  lock(&xfs_nondir_ilock_class);
  <Interrupt>
    lock(&xfs_nondir_ilock_class);

 *** DEADLOCK ***

3 locks held by kswapd0/543:

stack backtrace:
CPU: 0 PID: 543 Comm: kswapd0 Tainted: G           O    4.5.0-rc2+ #4

Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006

 ffffffff82a34f10 ffff88003aa078d0 ffffffff813a14f9 ffff88003d8551c0
 ffff88003aa07920 ffffffff8110ec65 0000000000000000 0000000000000001
 ffff880000000001 000000000000000b 0000000000000008 ffff88003d855aa0
Call Trace:
 [<ffffffff813a14f9>] dump_stack+0x4b/0x72
 [<ffffffff8110ec65>] print_usage_bug+0x215/0x240
 [<ffffffff8110ee85>] mark_lock+0x1f5/0x660
 [<ffffffff8110e100>] ? print_shortest_lock_dependencies+0x1a0/0x1a0
 [<ffffffff811102e0>] __lock_acquire+0xa80/0x1e50
 [<ffffffff8122474e>] ? kmem_cache_alloc+0x15e/0x230
 [<ffffffffa008acc1>] ? kmem_zone_alloc+0x81/0x120 [xfs]
 [<ffffffff811122e8>] lock_acquire+0xd8/0x1e0
 [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs]
 [<ffffffffa0083a70>] ? xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
 [<ffffffff8110aace>] down_write_nested+0x5e/0xc0
 [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs]
 [<ffffffffa00781f7>] xfs_ilock+0x177/0x200 [xfs]
 [<ffffffffa0083a70>] xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
 [<ffffffffa0085bdc>] xfs_fs_evict_inode+0xdc/0x1e0 [xfs]
 [<ffffffff8124d7d5>] evict+0xc5/0x190
 [<ffffffff8124d8d9>] dispose_list+0x39/0x60
 [<ffffffff8124eb2b>] prune_icache_sb+0x4b/0x60
 [<ffffffff8123317f>] super_cache_scan+0x14f/0x1a0
 [<ffffffff811e0d19>] shrink_slab.part.63.constprop.79+0x1e9/0x4e0
 [<ffffffff811e50ee>] shrink_zone+0x15e/0x170
 [<ffffffff811e5ef1>] kswapd+0x4f1/0xa80
 [<ffffffff811e5a00>] ? zone_reclaim+0x230/0x230
 [<ffffffff810e6882>] kthread+0xf2/0x110
 [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220
 [<ffffffff8184803f>] ret_from_fork+0x3f/0x70
 [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220

To quote Dave:
"
Ignoring whether reflink should be doing anything or not, that's a
"xfs_refcountbt_init_cursor() gets called both outside and inside
transactions" lockdep false positive case. The problem here is
lockdep has seen this allocation from within a transaction, hence a
GFP_NOFS allocation, and now it's seeing it in a GFP_KERNEL context.
Also note that we have an active reference to this inode.

So, because the reclaim annotations overload the interrupt level
detections and it's seen the inode ilock been taken in reclaim
("interrupt") context, this triggers a reclaim context warning where
it thinks it is unsafe to do this allocation in GFP_KERNEL context
holding the inode ilock...
"

This sounds like a fundamental problem of the reclaim lock detection.
It is really impossible to annotate such a special usecase IMHO unless
the reclaim lockup detection is reworked completely. Until then it
is much better to provide a way to add "I know what I am doing flag"
and mark problematic places. This would prevent from abusing GFP_NOFS
flag which has a runtime effect even on configurations which have
lockdep disabled.

Introduce __GFP_NOLOCKDEP flag which tells the lockdep gfp tracking to
skip the current allocation request.

While we are at it also make sure that the radix tree doesn't
accidentaly override tags stored in the upper part of the gfp_mask.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/gfp.h      | 10 +++++++++-
 kernel/locking/lockdep.c |  4 ++++
 lib/radix-tree.c         |  2 ++
 3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index db373b9d3223..978232a3b4ae 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -40,6 +40,11 @@ struct vm_area_struct;
 #define ___GFP_DIRECT_RECLAIM	0x400000u
 #define ___GFP_WRITE		0x800000u
 #define ___GFP_KSWAPD_RECLAIM	0x1000000u
+#ifdef CONFIG_LOCKDEP
+#define ___GFP_NOLOCKDEP	0x4000000u
+#else
+#define ___GFP_NOLOCKDEP	0
+#endif
 /* If the above are modified, __GFP_BITS_SHIFT may need updating */
 
 /*
@@ -179,8 +184,11 @@ struct vm_area_struct;
 #define __GFP_NOTRACK	((__force gfp_t)___GFP_NOTRACK)
 #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
 
+/* Disable lockdep for GFP context tracking */
+#define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)
+
 /* Room for N __GFP_FOO bits */
-#define __GFP_BITS_SHIFT 25
+#define __GFP_BITS_SHIFT (25 + IS_ENABLED(CONFIG_LOCKDEP))
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
 
 /*
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 45acb86d3f56..7dc2375c3ec7 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -2879,6 +2879,10 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags)
 	if (DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags)))
 		return;
 
+	/* Disable lockdep if explicitly requested */
+	if (gfp_mask & __GFP_NOLOCKDEP)
+		return;
+
 	mark_held_locks(curr, RECLAIM_FS);
 }
 
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 9dc093d5ef39..7550be09f9d6 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -2274,6 +2274,8 @@ static int radix_tree_cpu_dead(unsigned int cpu)
 void __init radix_tree_init(void)
 {
 	int ret;
+
+	BUILD_BUG_ON(RADIX_TREE_MAX_TAGS + __GFP_BITS_SHIFT > 32);
 	radix_tree_node_cachep = kmem_cache_create("radix_tree_node",
 			sizeof(struct radix_tree_node), 0,
 			SLAB_PANIC | SLAB_RECLAIM_ACCOUNT,
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH 1/6] lockdep: allow to disable reclaim lockup detection
@ 2017-02-06 14:07   ` Michal Hocko
  0 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-02-06 14:07 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel
  Cc: linux-mm, Dave Chinner, djwong, Theodore Ts'o, Chris Mason,
	David Sterba, Jan Kara, ceph-devel, cluster-devel, linux-nfs,
	logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

The current implementation of the reclaim lockup detection can lead to
false positives and those even happen and usually lead to tweak the
code to silence the lockdep by using GFP_NOFS even though the context
can use __GFP_FS just fine. See
http://lkml.kernel.org/r/20160512080321.GA18496@dastard as an example.

=================================
[ INFO: inconsistent lock state ]
4.5.0-rc2+ #4 Tainted: G           O
---------------------------------
inconsistent {RECLAIM_FS-ON-R} -> {IN-RECLAIM_FS-W} usage.
kswapd0/543 [HC0[0]:SC0[0]:HE1:SE1] takes:

(&xfs_nondir_ilock_class){++++-+}, at: [<ffffffffa00781f7>] xfs_ilock+0x177/0x200 [xfs]

{RECLAIM_FS-ON-R} state was registered at:
  [<ffffffff8110f369>] mark_held_locks+0x79/0xa0
  [<ffffffff81113a43>] lockdep_trace_alloc+0xb3/0x100
  [<ffffffff81224623>] kmem_cache_alloc+0x33/0x230
  [<ffffffffa008acc1>] kmem_zone_alloc+0x81/0x120 [xfs]
  [<ffffffffa005456e>] xfs_refcountbt_init_cursor+0x3e/0xa0 [xfs]
  [<ffffffffa0053455>] __xfs_refcount_find_shared+0x75/0x580 [xfs]
  [<ffffffffa00539e4>] xfs_refcount_find_shared+0x84/0xb0 [xfs]
  [<ffffffffa005dcb8>] xfs_getbmap+0x608/0x8c0 [xfs]
  [<ffffffffa007634b>] xfs_vn_fiemap+0xab/0xc0 [xfs]
  [<ffffffff81244208>] do_vfs_ioctl+0x498/0x670
  [<ffffffff81244459>] SyS_ioctl+0x79/0x90
  [<ffffffff81847cd7>] entry_SYSCALL_64_fastpath+0x12/0x6f

       CPU0
       ----
  lock(&xfs_nondir_ilock_class);
  <Interrupt>
    lock(&xfs_nondir_ilock_class);

 *** DEADLOCK ***

3 locks held by kswapd0/543:

stack backtrace:
CPU: 0 PID: 543 Comm: kswapd0 Tainted: G           O    4.5.0-rc2+ #4

Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006

 ffffffff82a34f10 ffff88003aa078d0 ffffffff813a14f9 ffff88003d8551c0
 ffff88003aa07920 ffffffff8110ec65 0000000000000000 0000000000000001
 ffff880000000001 000000000000000b 0000000000000008 ffff88003d855aa0
Call Trace:
 [<ffffffff813a14f9>] dump_stack+0x4b/0x72
 [<ffffffff8110ec65>] print_usage_bug+0x215/0x240
 [<ffffffff8110ee85>] mark_lock+0x1f5/0x660
 [<ffffffff8110e100>] ? print_shortest_lock_dependencies+0x1a0/0x1a0
 [<ffffffff811102e0>] __lock_acquire+0xa80/0x1e50
 [<ffffffff8122474e>] ? kmem_cache_alloc+0x15e/0x230
 [<ffffffffa008acc1>] ? kmem_zone_alloc+0x81/0x120 [xfs]
 [<ffffffff811122e8>] lock_acquire+0xd8/0x1e0
 [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs]
 [<ffffffffa0083a70>] ? xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
 [<ffffffff8110aace>] down_write_nested+0x5e/0xc0
 [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs]
 [<ffffffffa00781f7>] xfs_ilock+0x177/0x200 [xfs]
 [<ffffffffa0083a70>] xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
 [<ffffffffa0085bdc>] xfs_fs_evict_inode+0xdc/0x1e0 [xfs]
 [<ffffffff8124d7d5>] evict+0xc5/0x190
 [<ffffffff8124d8d9>] dispose_list+0x39/0x60
 [<ffffffff8124eb2b>] prune_icache_sb+0x4b/0x60
 [<ffffffff8123317f>] super_cache_scan+0x14f/0x1a0
 [<ffffffff811e0d19>] shrink_slab.part.63.constprop.79+0x1e9/0x4e0
 [<ffffffff811e50ee>] shrink_zone+0x15e/0x170
 [<ffffffff811e5ef1>] kswapd+0x4f1/0xa80
 [<ffffffff811e5a00>] ? zone_reclaim+0x230/0x230
 [<ffffffff810e6882>] kthread+0xf2/0x110
 [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220
 [<ffffffff8184803f>] ret_from_fork+0x3f/0x70
 [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220

To quote Dave:
"
Ignoring whether reflink should be doing anything or not, that's a
"xfs_refcountbt_init_cursor() gets called both outside and inside
transactions" lockdep false positive case. The problem here is
lockdep has seen this allocation from within a transaction, hence a
GFP_NOFS allocation, and now it's seeing it in a GFP_KERNEL context.
Also note that we have an active reference to this inode.

So, because the reclaim annotations overload the interrupt level
detections and it's seen the inode ilock been taken in reclaim
("interrupt") context, this triggers a reclaim context warning where
it thinks it is unsafe to do this allocation in GFP_KERNEL context
holding the inode ilock...
"

This sounds like a fundamental problem of the reclaim lock detection.
It is really impossible to annotate such a special usecase IMHO unless
the reclaim lockup detection is reworked completely. Until then it
is much better to provide a way to add "I know what I am doing flag"
and mark problematic places. This would prevent from abusing GFP_NOFS
flag which has a runtime effect even on configurations which have
lockdep disabled.

Introduce __GFP_NOLOCKDEP flag which tells the lockdep gfp tracking to
skip the current allocation request.

While we are at it also make sure that the radix tree doesn't
accidentaly override tags stored in the upper part of the gfp_mask.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/gfp.h      | 10 +++++++++-
 kernel/locking/lockdep.c |  4 ++++
 lib/radix-tree.c         |  2 ++
 3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index db373b9d3223..978232a3b4ae 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -40,6 +40,11 @@ struct vm_area_struct;
 #define ___GFP_DIRECT_RECLAIM	0x400000u
 #define ___GFP_WRITE		0x800000u
 #define ___GFP_KSWAPD_RECLAIM	0x1000000u
+#ifdef CONFIG_LOCKDEP
+#define ___GFP_NOLOCKDEP	0x4000000u
+#else
+#define ___GFP_NOLOCKDEP	0
+#endif
 /* If the above are modified, __GFP_BITS_SHIFT may need updating */
 
 /*
@@ -179,8 +184,11 @@ struct vm_area_struct;
 #define __GFP_NOTRACK	((__force gfp_t)___GFP_NOTRACK)
 #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
 
+/* Disable lockdep for GFP context tracking */
+#define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)
+
 /* Room for N __GFP_FOO bits */
-#define __GFP_BITS_SHIFT 25
+#define __GFP_BITS_SHIFT (25 + IS_ENABLED(CONFIG_LOCKDEP))
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
 
 /*
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 45acb86d3f56..7dc2375c3ec7 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -2879,6 +2879,10 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags)
 	if (DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags)))
 		return;
 
+	/* Disable lockdep if explicitly requested */
+	if (gfp_mask & __GFP_NOLOCKDEP)
+		return;
+
 	mark_held_locks(curr, RECLAIM_FS);
 }
 
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 9dc093d5ef39..7550be09f9d6 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -2274,6 +2274,8 @@ static int radix_tree_cpu_dead(unsigned int cpu)
 void __init radix_tree_init(void)
 {
 	int ret;
+
+	BUILD_BUG_ON(RADIX_TREE_MAX_TAGS + __GFP_BITS_SHIFT > 32);
 	radix_tree_node_cachep = kmem_cache_create("radix_tree_node",
 			sizeof(struct radix_tree_node), 0,
 			SLAB_PANIC | SLAB_RECLAIM_ACCOUNT,
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH 1/6] lockdep: allow to disable reclaim lockup detection
@ 2017-02-06 14:07   ` Michal Hocko
  0 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-02-06 14:07 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel
  Cc: linux-mm, Dave Chinner, djwong, Theodore Ts'o, Chris Mason,
	David Sterba, Jan Kara, ceph-devel, cluster-devel, linux-nfs,
	logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

The current implementation of the reclaim lockup detection can lead to
false positives and those even happen and usually lead to tweak the
code to silence the lockdep by using GFP_NOFS even though the context
can use __GFP_FS just fine. See
http://lkml.kernel.org/r/20160512080321.GA18496@dastard as an example.

=================================
[ INFO: inconsistent lock state ]
4.5.0-rc2+ #4 Tainted: G           O
---------------------------------
inconsistent {RECLAIM_FS-ON-R} -> {IN-RECLAIM_FS-W} usage.
kswapd0/543 [HC0[0]:SC0[0]:HE1:SE1] takes:

(&xfs_nondir_ilock_class){++++-+}, at: [<ffffffffa00781f7>] xfs_ilock+0x177/0x200 [xfs]

{RECLAIM_FS-ON-R} state was registered at:
  [<ffffffff8110f369>] mark_held_locks+0x79/0xa0
  [<ffffffff81113a43>] lockdep_trace_alloc+0xb3/0x100
  [<ffffffff81224623>] kmem_cache_alloc+0x33/0x230
  [<ffffffffa008acc1>] kmem_zone_alloc+0x81/0x120 [xfs]
  [<ffffffffa005456e>] xfs_refcountbt_init_cursor+0x3e/0xa0 [xfs]
  [<ffffffffa0053455>] __xfs_refcount_find_shared+0x75/0x580 [xfs]
  [<ffffffffa00539e4>] xfs_refcount_find_shared+0x84/0xb0 [xfs]
  [<ffffffffa005dcb8>] xfs_getbmap+0x608/0x8c0 [xfs]
  [<ffffffffa007634b>] xfs_vn_fiemap+0xab/0xc0 [xfs]
  [<ffffffff81244208>] do_vfs_ioctl+0x498/0x670
  [<ffffffff81244459>] SyS_ioctl+0x79/0x90
  [<ffffffff81847cd7>] entry_SYSCALL_64_fastpath+0x12/0x6f

       CPU0
       ----
  lock(&xfs_nondir_ilock_class);
  <Interrupt>
    lock(&xfs_nondir_ilock_class);

 *** DEADLOCK ***

3 locks held by kswapd0/543:

stack backtrace:
CPU: 0 PID: 543 Comm: kswapd0 Tainted: G           O    4.5.0-rc2+ #4

Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006

 ffffffff82a34f10 ffff88003aa078d0 ffffffff813a14f9 ffff88003d8551c0
 ffff88003aa07920 ffffffff8110ec65 0000000000000000 0000000000000001
 ffff880000000001 000000000000000b 0000000000000008 ffff88003d855aa0
Call Trace:
 [<ffffffff813a14f9>] dump_stack+0x4b/0x72
 [<ffffffff8110ec65>] print_usage_bug+0x215/0x240
 [<ffffffff8110ee85>] mark_lock+0x1f5/0x660
 [<ffffffff8110e100>] ? print_shortest_lock_dependencies+0x1a0/0x1a0
 [<ffffffff811102e0>] __lock_acquire+0xa80/0x1e50
 [<ffffffff8122474e>] ? kmem_cache_alloc+0x15e/0x230
 [<ffffffffa008acc1>] ? kmem_zone_alloc+0x81/0x120 [xfs]
 [<ffffffff811122e8>] lock_acquire+0xd8/0x1e0
 [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs]
 [<ffffffffa0083a70>] ? xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
 [<ffffffff8110aace>] down_write_nested+0x5e/0xc0
 [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs]
 [<ffffffffa00781f7>] xfs_ilock+0x177/0x200 [xfs]
 [<ffffffffa0083a70>] xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
 [<ffffffffa0085bdc>] xfs_fs_evict_inode+0xdc/0x1e0 [xfs]
 [<ffffffff8124d7d5>] evict+0xc5/0x190
 [<ffffffff8124d8d9>] dispose_list+0x39/0x60
 [<ffffffff8124eb2b>] prune_icache_sb+0x4b/0x60
 [<ffffffff8123317f>] super_cache_scan+0x14f/0x1a0
 [<ffffffff811e0d19>] shrink_slab.part.63.constprop.79+0x1e9/0x4e0
 [<ffffffff811e50ee>] shrink_zone+0x15e/0x170
 [<ffffffff811e5ef1>] kswapd+0x4f1/0xa80
 [<ffffffff811e5a00>] ? zone_reclaim+0x230/0x230
 [<ffffffff810e6882>] kthread+0xf2/0x110
 [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220
 [<ffffffff8184803f>] ret_from_fork+0x3f/0x70
 [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220

To quote Dave:
"
Ignoring whether reflink should be doing anything or not, that's a
"xfs_refcountbt_init_cursor() gets called both outside and inside
transactions" lockdep false positive case. The problem here is
lockdep has seen this allocation from within a transaction, hence a
GFP_NOFS allocation, and now it's seeing it in a GFP_KERNEL context.
Also note that we have an active reference to this inode.

So, because the reclaim annotations overload the interrupt level
detections and it's seen the inode ilock been taken in reclaim
("interrupt") context, this triggers a reclaim context warning where
it thinks it is unsafe to do this allocation in GFP_KERNEL context
holding the inode ilock...
"

This sounds like a fundamental problem of the reclaim lock detection.
It is really impossible to annotate such a special usecase IMHO unless
the reclaim lockup detection is reworked completely. Until then it
is much better to provide a way to add "I know what I am doing flag"
and mark problematic places. This would prevent from abusing GFP_NOFS
flag which has a runtime effect even on configurations which have
lockdep disabled.

Introduce __GFP_NOLOCKDEP flag which tells the lockdep gfp tracking to
skip the current allocation request.

While we are at it also make sure that the radix tree doesn't
accidentaly override tags stored in the upper part of the gfp_mask.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/gfp.h      | 10 +++++++++-
 kernel/locking/lockdep.c |  4 ++++
 lib/radix-tree.c         |  2 ++
 3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index db373b9d3223..978232a3b4ae 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -40,6 +40,11 @@ struct vm_area_struct;
 #define ___GFP_DIRECT_RECLAIM	0x400000u
 #define ___GFP_WRITE		0x800000u
 #define ___GFP_KSWAPD_RECLAIM	0x1000000u
+#ifdef CONFIG_LOCKDEP
+#define ___GFP_NOLOCKDEP	0x4000000u
+#else
+#define ___GFP_NOLOCKDEP	0
+#endif
 /* If the above are modified, __GFP_BITS_SHIFT may need updating */
 
 /*
@@ -179,8 +184,11 @@ struct vm_area_struct;
 #define __GFP_NOTRACK	((__force gfp_t)___GFP_NOTRACK)
 #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
 
+/* Disable lockdep for GFP context tracking */
+#define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)
+
 /* Room for N __GFP_FOO bits */
-#define __GFP_BITS_SHIFT 25
+#define __GFP_BITS_SHIFT (25 + IS_ENABLED(CONFIG_LOCKDEP))
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
 
 /*
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 45acb86d3f56..7dc2375c3ec7 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -2879,6 +2879,10 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags)
 	if (DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags)))
 		return;
 
+	/* Disable lockdep if explicitly requested */
+	if (gfp_mask & __GFP_NOLOCKDEP)
+		return;
+
 	mark_held_locks(curr, RECLAIM_FS);
 }
 
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 9dc093d5ef39..7550be09f9d6 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -2274,6 +2274,8 @@ static int radix_tree_cpu_dead(unsigned int cpu)
 void __init radix_tree_init(void)
 {
 	int ret;
+
+	BUILD_BUG_ON(RADIX_TREE_MAX_TAGS + __GFP_BITS_SHIFT > 32);
 	radix_tree_node_cachep = kmem_cache_create("radix_tree_node",
 			sizeof(struct radix_tree_node), 0,
 			SLAB_PANIC | SLAB_RECLAIM_ACCOUNT,
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH 1/6] lockdep: allow to disable reclaim lockup detection
@ 2017-02-06 14:07   ` Michal Hocko
  0 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-02-06 14:07 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel
  Cc: linux-mm, Dave Chinner, djwong, Theodore Ts'o, Chris Mason,
	David Sterba, Jan Kara, ceph-devel, cluster-devel, linux-nfs,
	logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

The current implementation of the reclaim lockup detection can lead to
false positives and those even happen and usually lead to tweak the
code to silence the lockdep by using GFP_NOFS even though the context
can use __GFP_FS just fine. See
http://lkml.kernel.org/r/20160512080321.GA18496@dastard as an example.

=================================
[ INFO: inconsistent lock state ]
4.5.0-rc2+ #4 Tainted: G           O
---------------------------------
inconsistent {RECLAIM_FS-ON-R} -> {IN-RECLAIM_FS-W} usage.
kswapd0/543 [HC0[0]:SC0[0]:HE1:SE1] takes:

(&xfs_nondir_ilock_class){++++-+}, at: [<ffffffffa00781f7>] xfs_ilock+0x177/0x200 [xfs]

{RECLAIM_FS-ON-R} state was registered at:
  [<ffffffff8110f369>] mark_held_locks+0x79/0xa0
  [<ffffffff81113a43>] lockdep_trace_alloc+0xb3/0x100
  [<ffffffff81224623>] kmem_cache_alloc+0x33/0x230
  [<ffffffffa008acc1>] kmem_zone_alloc+0x81/0x120 [xfs]
  [<ffffffffa005456e>] xfs_refcountbt_init_cursor+0x3e/0xa0 [xfs]
  [<ffffffffa0053455>] __xfs_refcount_find_shared+0x75/0x580 [xfs]
  [<ffffffffa00539e4>] xfs_refcount_find_shared+0x84/0xb0 [xfs]
  [<ffffffffa005dcb8>] xfs_getbmap+0x608/0x8c0 [xfs]
  [<ffffffffa007634b>] xfs_vn_fiemap+0xab/0xc0 [xfs]
  [<ffffffff81244208>] do_vfs_ioctl+0x498/0x670
  [<ffffffff81244459>] SyS_ioctl+0x79/0x90
  [<ffffffff81847cd7>] entry_SYSCALL_64_fastpath+0x12/0x6f

       CPU0
       ----
  lock(&xfs_nondir_ilock_class);
  <Interrupt>
    lock(&xfs_nondir_ilock_class);

 *** DEADLOCK ***

3 locks held by kswapd0/543:

stack backtrace:
CPU: 0 PID: 543 Comm: kswapd0 Tainted: G           O    4.5.0-rc2+ #4

Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006

 ffffffff82a34f10 ffff88003aa078d0 ffffffff813a14f9 ffff88003d8551c0
 ffff88003aa07920 ffffffff8110ec65 0000000000000000 0000000000000001
 ffff880000000001 000000000000000b 0000000000000008 ffff88003d855aa0
Call Trace:
 [<ffffffff813a14f9>] dump_stack+0x4b/0x72
 [<ffffffff8110ec65>] print_usage_bug+0x215/0x240
 [<ffffffff8110ee85>] mark_lock+0x1f5/0x660
 [<ffffffff8110e100>] ? print_shortest_lock_dependencies+0x1a0/0x1a0
 [<ffffffff811102e0>] __lock_acquire+0xa80/0x1e50
 [<ffffffff8122474e>] ? kmem_cache_alloc+0x15e/0x230
 [<ffffffffa008acc1>] ? kmem_zone_alloc+0x81/0x120 [xfs]
 [<ffffffff811122e8>] lock_acquire+0xd8/0x1e0
 [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs]
 [<ffffffffa0083a70>] ? xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
 [<ffffffff8110aace>] down_write_nested+0x5e/0xc0
 [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs]
 [<ffffffffa00781f7>] xfs_ilock+0x177/0x200 [xfs]
 [<ffffffffa0083a70>] xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
 [<ffffffffa0085bdc>] xfs_fs_evict_inode+0xdc/0x1e0 [xfs]
 [<ffffffff8124d7d5>] evict+0xc5/0x190
 [<ffffffff8124d8d9>] dispose_list+0x39/0x60
 [<ffffffff8124eb2b>] prune_icache_sb+0x4b/0x60
 [<ffffffff8123317f>] super_cache_scan+0x14f/0x1a0
 [<ffffffff811e0d19>] shrink_slab.part.63.constprop.79+0x1e9/0x4e0
 [<ffffffff811e50ee>] shrink_zone+0x15e/0x170
 [<ffffffff811e5ef1>] kswapd+0x4f1/0xa80
 [<ffffffff811e5a00>] ? zone_reclaim+0x230/0x230
 [<ffffffff810e6882>] kthread+0xf2/0x110
 [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220
 [<ffffffff8184803f>] ret_from_fork+0x3f/0x70
 [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220

To quote Dave:
"
Ignoring whether reflink should be doing anything or not, that's a
"xfs_refcountbt_init_cursor() gets called both outside and inside
transactions" lockdep false positive case. The problem here is
lockdep has seen this allocation from within a transaction, hence a
GFP_NOFS allocation, and now it's seeing it in a GFP_KERNEL context.
Also note that we have an active reference to this inode.

So, because the reclaim annotations overload the interrupt level
detections and it's seen the inode ilock been taken in reclaim
("interrupt") context, this triggers a reclaim context warning where
it thinks it is unsafe to do this allocation in GFP_KERNEL context
holding the inode ilock...
"

This sounds like a fundamental problem of the reclaim lock detection.
It is really impossible to annotate such a special usecase IMHO unless
the reclaim lockup detection is reworked completely. Until then it
is much better to provide a way to add "I know what I am doing flag"
and mark problematic places. This would prevent from abusing GFP_NOFS
flag which has a runtime effect even on configurations which have
lockdep disabled.

Introduce __GFP_NOLOCKDEP flag which tells the lockdep gfp tracking to
skip the current allocation request.

While we are at it also make sure that the radix tree doesn't
accidentaly override tags stored in the upper part of the gfp_mask.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/gfp.h      | 10 +++++++++-
 kernel/locking/lockdep.c |  4 ++++
 lib/radix-tree.c         |  2 ++
 3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index db373b9d3223..978232a3b4ae 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -40,6 +40,11 @@ struct vm_area_struct;
 #define ___GFP_DIRECT_RECLAIM	0x400000u
 #define ___GFP_WRITE		0x800000u
 #define ___GFP_KSWAPD_RECLAIM	0x1000000u
+#ifdef CONFIG_LOCKDEP
+#define ___GFP_NOLOCKDEP	0x4000000u
+#else
+#define ___GFP_NOLOCKDEP	0
+#endif
 /* If the above are modified, __GFP_BITS_SHIFT may need updating */
 
 /*
@@ -179,8 +184,11 @@ struct vm_area_struct;
 #define __GFP_NOTRACK	((__force gfp_t)___GFP_NOTRACK)
 #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
 
+/* Disable lockdep for GFP context tracking */
+#define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)
+
 /* Room for N __GFP_FOO bits */
-#define __GFP_BITS_SHIFT 25
+#define __GFP_BITS_SHIFT (25 + IS_ENABLED(CONFIG_LOCKDEP))
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
 
 /*
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 45acb86d3f56..7dc2375c3ec7 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -2879,6 +2879,10 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags)
 	if (DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags)))
 		return;
 
+	/* Disable lockdep if explicitly requested */
+	if (gfp_mask & __GFP_NOLOCKDEP)
+		return;
+
 	mark_held_locks(curr, RECLAIM_FS);
 }
 
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 9dc093d5ef39..7550be09f9d6 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -2274,6 +2274,8 @@ static int radix_tree_cpu_dead(unsigned int cpu)
 void __init radix_tree_init(void)
 {
 	int ret;
+
+	BUILD_BUG_ON(RADIX_TREE_MAX_TAGS + __GFP_BITS_SHIFT > 32);
 	radix_tree_node_cachep = kmem_cache_create("radix_tree_node",
 			sizeof(struct radix_tree_node), 0,
 			SLAB_PANIC | SLAB_RECLAIM_ACCOUNT,
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [Cluster-devel] [PATCH 1/6] lockdep: allow to disable reclaim lockup detection
@ 2017-02-06 14:07   ` Michal Hocko
  0 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-02-06 14:07 UTC (permalink / raw)
  To: cluster-devel.redhat.com

From: Michal Hocko <mhocko@suse.com>

The current implementation of the reclaim lockup detection can lead to
false positives and those even happen and usually lead to tweak the
code to silence the lockdep by using GFP_NOFS even though the context
can use __GFP_FS just fine. See
http://lkml.kernel.org/r/20160512080321.GA18496 at dastard as an example.

=================================
[ INFO: inconsistent lock state ]
4.5.0-rc2+ #4 Tainted: G           O
---------------------------------
inconsistent {RECLAIM_FS-ON-R} -> {IN-RECLAIM_FS-W} usage.
kswapd0/543 [HC0[0]:SC0[0]:HE1:SE1] takes:

(&xfs_nondir_ilock_class){++++-+}, at: [<ffffffffa00781f7>] xfs_ilock+0x177/0x200 [xfs]

{RECLAIM_FS-ON-R} state was registered at:
  [<ffffffff8110f369>] mark_held_locks+0x79/0xa0
  [<ffffffff81113a43>] lockdep_trace_alloc+0xb3/0x100
  [<ffffffff81224623>] kmem_cache_alloc+0x33/0x230
  [<ffffffffa008acc1>] kmem_zone_alloc+0x81/0x120 [xfs]
  [<ffffffffa005456e>] xfs_refcountbt_init_cursor+0x3e/0xa0 [xfs]
  [<ffffffffa0053455>] __xfs_refcount_find_shared+0x75/0x580 [xfs]
  [<ffffffffa00539e4>] xfs_refcount_find_shared+0x84/0xb0 [xfs]
  [<ffffffffa005dcb8>] xfs_getbmap+0x608/0x8c0 [xfs]
  [<ffffffffa007634b>] xfs_vn_fiemap+0xab/0xc0 [xfs]
  [<ffffffff81244208>] do_vfs_ioctl+0x498/0x670
  [<ffffffff81244459>] SyS_ioctl+0x79/0x90
  [<ffffffff81847cd7>] entry_SYSCALL_64_fastpath+0x12/0x6f

       CPU0
       ----
  lock(&xfs_nondir_ilock_class);
  <Interrupt>
    lock(&xfs_nondir_ilock_class);

 *** DEADLOCK ***

3 locks held by kswapd0/543:

stack backtrace:
CPU: 0 PID: 543 Comm: kswapd0 Tainted: G           O    4.5.0-rc2+ #4

Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006

 ffffffff82a34f10 ffff88003aa078d0 ffffffff813a14f9 ffff88003d8551c0
 ffff88003aa07920 ffffffff8110ec65 0000000000000000 0000000000000001
 ffff880000000001 000000000000000b 0000000000000008 ffff88003d855aa0
Call Trace:
 [<ffffffff813a14f9>] dump_stack+0x4b/0x72
 [<ffffffff8110ec65>] print_usage_bug+0x215/0x240
 [<ffffffff8110ee85>] mark_lock+0x1f5/0x660
 [<ffffffff8110e100>] ? print_shortest_lock_dependencies+0x1a0/0x1a0
 [<ffffffff811102e0>] __lock_acquire+0xa80/0x1e50
 [<ffffffff8122474e>] ? kmem_cache_alloc+0x15e/0x230
 [<ffffffffa008acc1>] ? kmem_zone_alloc+0x81/0x120 [xfs]
 [<ffffffff811122e8>] lock_acquire+0xd8/0x1e0
 [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs]
 [<ffffffffa0083a70>] ? xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
 [<ffffffff8110aace>] down_write_nested+0x5e/0xc0
 [<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs]
 [<ffffffffa00781f7>] xfs_ilock+0x177/0x200 [xfs]
 [<ffffffffa0083a70>] xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
 [<ffffffffa0085bdc>] xfs_fs_evict_inode+0xdc/0x1e0 [xfs]
 [<ffffffff8124d7d5>] evict+0xc5/0x190
 [<ffffffff8124d8d9>] dispose_list+0x39/0x60
 [<ffffffff8124eb2b>] prune_icache_sb+0x4b/0x60
 [<ffffffff8123317f>] super_cache_scan+0x14f/0x1a0
 [<ffffffff811e0d19>] shrink_slab.part.63.constprop.79+0x1e9/0x4e0
 [<ffffffff811e50ee>] shrink_zone+0x15e/0x170
 [<ffffffff811e5ef1>] kswapd+0x4f1/0xa80
 [<ffffffff811e5a00>] ? zone_reclaim+0x230/0x230
 [<ffffffff810e6882>] kthread+0xf2/0x110
 [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220
 [<ffffffff8184803f>] ret_from_fork+0x3f/0x70
 [<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220

To quote Dave:
"
Ignoring whether reflink should be doing anything or not, that's a
"xfs_refcountbt_init_cursor() gets called both outside and inside
transactions" lockdep false positive case. The problem here is
lockdep has seen this allocation from within a transaction, hence a
GFP_NOFS allocation, and now it's seeing it in a GFP_KERNEL context.
Also note that we have an active reference to this inode.

So, because the reclaim annotations overload the interrupt level
detections and it's seen the inode ilock been taken in reclaim
("interrupt") context, this triggers a reclaim context warning where
it thinks it is unsafe to do this allocation in GFP_KERNEL context
holding the inode ilock...
"

This sounds like a fundamental problem of the reclaim lock detection.
It is really impossible to annotate such a special usecase IMHO unless
the reclaim lockup detection is reworked completely. Until then it
is much better to provide a way to add "I know what I am doing flag"
and mark problematic places. This would prevent from abusing GFP_NOFS
flag which has a runtime effect even on configurations which have
lockdep disabled.

Introduce __GFP_NOLOCKDEP flag which tells the lockdep gfp tracking to
skip the current allocation request.

While we are at it also make sure that the radix tree doesn't
accidentaly override tags stored in the upper part of the gfp_mask.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/gfp.h      | 10 +++++++++-
 kernel/locking/lockdep.c |  4 ++++
 lib/radix-tree.c         |  2 ++
 3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index db373b9d3223..978232a3b4ae 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -40,6 +40,11 @@ struct vm_area_struct;
 #define ___GFP_DIRECT_RECLAIM	0x400000u
 #define ___GFP_WRITE		0x800000u
 #define ___GFP_KSWAPD_RECLAIM	0x1000000u
+#ifdef CONFIG_LOCKDEP
+#define ___GFP_NOLOCKDEP	0x4000000u
+#else
+#define ___GFP_NOLOCKDEP	0
+#endif
 /* If the above are modified, __GFP_BITS_SHIFT may need updating */
 
 /*
@@ -179,8 +184,11 @@ struct vm_area_struct;
 #define __GFP_NOTRACK	((__force gfp_t)___GFP_NOTRACK)
 #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
 
+/* Disable lockdep for GFP context tracking */
+#define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)
+
 /* Room for N __GFP_FOO bits */
-#define __GFP_BITS_SHIFT 25
+#define __GFP_BITS_SHIFT (25 + IS_ENABLED(CONFIG_LOCKDEP))
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
 
 /*
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 45acb86d3f56..7dc2375c3ec7 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -2879,6 +2879,10 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags)
 	if (DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags)))
 		return;
 
+	/* Disable lockdep if explicitly requested */
+	if (gfp_mask & __GFP_NOLOCKDEP)
+		return;
+
 	mark_held_locks(curr, RECLAIM_FS);
 }
 
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 9dc093d5ef39..7550be09f9d6 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -2274,6 +2274,8 @@ static int radix_tree_cpu_dead(unsigned int cpu)
 void __init radix_tree_init(void)
 {
 	int ret;
+
+	BUILD_BUG_ON(RADIX_TREE_MAX_TAGS + __GFP_BITS_SHIFT > 32);
 	radix_tree_node_cachep = kmem_cache_create("radix_tree_node",
 			sizeof(struct radix_tree_node), 0,
 			SLAB_PANIC | SLAB_RECLAIM_ACCOUNT,
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH 2/6] xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS
  2017-02-06 14:07 ` Michal Hocko
                     ` (2 preceding siblings ...)
  (?)
@ 2017-02-06 14:07   ` Michal Hocko
  -1 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-02-06 14:07 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel
  Cc: linux-mm, Dave Chinner, djwong, Theodore Ts'o, Chris Mason,
	David Sterba, Jan Kara, ceph-devel, cluster-devel, linux-nfs,
	logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite
some time ago. We would like to make this concept more generic and use
it for other filesystems as well. Let's start by giving the flag a
more generic name PF_MEMALLOC_NOFS which is in line with an exiting
PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO
contexts. Replace all PF_FSTRANS usage from the xfs code in the first
step before we introduce a full API for it as xfs uses the flag directly
anyway.

This patch doesn't introduce any functional change.

Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/xfs/kmem.c             |  4 ++--
 fs/xfs/kmem.h             |  2 +-
 fs/xfs/libxfs/xfs_btree.c |  2 +-
 fs/xfs/xfs_aops.c         |  6 +++---
 fs/xfs/xfs_trans.c        | 12 ++++++------
 include/linux/sched.h     |  2 ++
 6 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index 339c696bbc01..a76a05dae96b 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -80,13 +80,13 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 	 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
 	 * the filesystem here and potentially deadlocking.
 	 */
-	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
 		noio_flag = memalloc_noio_save();
 
 	lflags = kmem_flags_convert(flags);
 	ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
 
-	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
 		memalloc_noio_restore(noio_flag);
 
 	return ptr;
diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index 689f746224e7..d973dbfc2bfa 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
 		lflags = GFP_ATOMIC | __GFP_NOWARN;
 	} else {
 		lflags = GFP_KERNEL | __GFP_NOWARN;
-		if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+		if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
 			lflags &= ~__GFP_FS;
 	}
 
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 21e6a6ab6b9a..a2672ba4dc33 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -2866,7 +2866,7 @@ xfs_btree_split_worker(
 	struct xfs_btree_split_args	*args = container_of(work,
 						struct xfs_btree_split_args, work);
 	unsigned long		pflags;
-	unsigned long		new_pflags = PF_FSTRANS;
+	unsigned long		new_pflags = PF_MEMALLOC_NOFS;
 
 	/*
 	 * we are in a transaction context here, but may also be doing work
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 3a4434297697..b3d41c1d67ab 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -189,7 +189,7 @@ xfs_setfilesize_trans_alloc(
 	 * We hand off the transaction to the completion thread now, so
 	 * clear the flag here.
 	 */
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	return 0;
 }
 
@@ -252,7 +252,7 @@ xfs_setfilesize_ioend(
 	 * thus we need to mark ourselves as being in a transaction manually.
 	 * Similarly for freeze protection.
 	 */
-	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	__sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS);
 
 	/* we abort the update if there was an IO error */
@@ -1015,7 +1015,7 @@ xfs_do_writepage(
 	 * Given that we do not allow direct reclaim to call us, we should
 	 * never be called while in a filesystem transaction.
 	 */
-	if (WARN_ON_ONCE(current->flags & PF_FSTRANS))
+	if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
 		goto redirty;
 
 	/*
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 70f42ea86dfb..f5969c8274fc 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -134,7 +134,7 @@ xfs_trans_reserve(
 	bool		rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
 
 	/* Mark this thread as being in a transaction */
-	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	/*
 	 * Attempt to reserve the needed disk blocks by decrementing
@@ -144,7 +144,7 @@ xfs_trans_reserve(
 	if (blocks > 0) {
 		error = xfs_mod_fdblocks(tp->t_mountp, -((int64_t)blocks), rsvd);
 		if (error != 0) {
-			current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+			current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 			return -ENOSPC;
 		}
 		tp->t_blk_res += blocks;
@@ -221,7 +221,7 @@ xfs_trans_reserve(
 		tp->t_blk_res = 0;
 	}
 
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	return error;
 }
@@ -914,7 +914,7 @@ __xfs_trans_commit(
 
 	xfs_log_commit_cil(mp, tp, &commit_lsn, regrant);
 
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	xfs_trans_free(tp);
 
 	/*
@@ -944,7 +944,7 @@ __xfs_trans_commit(
 		if (commit_lsn == -1 && !error)
 			error = -EIO;
 	}
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	xfs_trans_free_items(tp, NULLCOMMITLSN, !!error);
 	xfs_trans_free(tp);
 
@@ -998,7 +998,7 @@ xfs_trans_cancel(
 		xfs_log_done(mp, tp->t_ticket, NULL, false);
 
 	/* mark this thread as no longer being in a transaction */
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	xfs_trans_free_items(tp, NULLCOMMITLSN, dirty);
 	xfs_trans_free(tp);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9159a1e4e838..5be9818e9bd9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2312,6 +2312,8 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_FREEZER_SKIP	0x40000000	/* Freezer should not count it as freezable */
 #define PF_SUSPEND_TASK 0x80000000      /* this thread called freeze_processes and should not be frozen */
 
+#define PF_MEMALLOC_NOFS PF_FSTRANS	/* Transition to a more generic GFP_NOFS scope semantic */
+
 /*
  * Only the _current_ task can read/write to tsk->flags, but other
  * tasks can access tsk->flags in readonly mode for example
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH 2/6] xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS
@ 2017-02-06 14:07   ` Michal Hocko
  0 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-02-06 14:07 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel
  Cc: linux-mm, Dave Chinner, djwong, Theodore Ts'o, Chris Mason,
	David Sterba, Jan Kara, ceph-devel, cluster-devel, linux-nfs,
	logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite
some time ago. We would like to make this concept more generic and use
it for other filesystems as well. Let's start by giving the flag a
more generic name PF_MEMALLOC_NOFS which is in line with an exiting
PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO
contexts. Replace all PF_FSTRANS usage from the xfs code in the first
step before we introduce a full API for it as xfs uses the flag directly
anyway.

This patch doesn't introduce any functional change.

Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/xfs/kmem.c             |  4 ++--
 fs/xfs/kmem.h             |  2 +-
 fs/xfs/libxfs/xfs_btree.c |  2 +-
 fs/xfs/xfs_aops.c         |  6 +++---
 fs/xfs/xfs_trans.c        | 12 ++++++------
 include/linux/sched.h     |  2 ++
 6 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index 339c696bbc01..a76a05dae96b 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -80,13 +80,13 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 	 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
 	 * the filesystem here and potentially deadlocking.
 	 */
-	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
 		noio_flag = memalloc_noio_save();
 
 	lflags = kmem_flags_convert(flags);
 	ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
 
-	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
 		memalloc_noio_restore(noio_flag);
 
 	return ptr;
diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index 689f746224e7..d973dbfc2bfa 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
 		lflags = GFP_ATOMIC | __GFP_NOWARN;
 	} else {
 		lflags = GFP_KERNEL | __GFP_NOWARN;
-		if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+		if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
 			lflags &= ~__GFP_FS;
 	}
 
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 21e6a6ab6b9a..a2672ba4dc33 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -2866,7 +2866,7 @@ xfs_btree_split_worker(
 	struct xfs_btree_split_args	*args = container_of(work,
 						struct xfs_btree_split_args, work);
 	unsigned long		pflags;
-	unsigned long		new_pflags = PF_FSTRANS;
+	unsigned long		new_pflags = PF_MEMALLOC_NOFS;
 
 	/*
 	 * we are in a transaction context here, but may also be doing work
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 3a4434297697..b3d41c1d67ab 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -189,7 +189,7 @@ xfs_setfilesize_trans_alloc(
 	 * We hand off the transaction to the completion thread now, so
 	 * clear the flag here.
 	 */
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	return 0;
 }
 
@@ -252,7 +252,7 @@ xfs_setfilesize_ioend(
 	 * thus we need to mark ourselves as being in a transaction manually.
 	 * Similarly for freeze protection.
 	 */
-	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	__sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS);
 
 	/* we abort the update if there was an IO error */
@@ -1015,7 +1015,7 @@ xfs_do_writepage(
 	 * Given that we do not allow direct reclaim to call us, we should
 	 * never be called while in a filesystem transaction.
 	 */
-	if (WARN_ON_ONCE(current->flags & PF_FSTRANS))
+	if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
 		goto redirty;
 
 	/*
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 70f42ea86dfb..f5969c8274fc 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -134,7 +134,7 @@ xfs_trans_reserve(
 	bool		rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
 
 	/* Mark this thread as being in a transaction */
-	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	/*
 	 * Attempt to reserve the needed disk blocks by decrementing
@@ -144,7 +144,7 @@ xfs_trans_reserve(
 	if (blocks > 0) {
 		error = xfs_mod_fdblocks(tp->t_mountp, -((int64_t)blocks), rsvd);
 		if (error != 0) {
-			current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+			current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 			return -ENOSPC;
 		}
 		tp->t_blk_res += blocks;
@@ -221,7 +221,7 @@ xfs_trans_reserve(
 		tp->t_blk_res = 0;
 	}
 
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	return error;
 }
@@ -914,7 +914,7 @@ __xfs_trans_commit(
 
 	xfs_log_commit_cil(mp, tp, &commit_lsn, regrant);
 
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	xfs_trans_free(tp);
 
 	/*
@@ -944,7 +944,7 @@ __xfs_trans_commit(
 		if (commit_lsn == -1 && !error)
 			error = -EIO;
 	}
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	xfs_trans_free_items(tp, NULLCOMMITLSN, !!error);
 	xfs_trans_free(tp);
 
@@ -998,7 +998,7 @@ xfs_trans_cancel(
 		xfs_log_done(mp, tp->t_ticket, NULL, false);
 
 	/* mark this thread as no longer being in a transaction */
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	xfs_trans_free_items(tp, NULLCOMMITLSN, dirty);
 	xfs_trans_free(tp);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9159a1e4e838..5be9818e9bd9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2312,6 +2312,8 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_FREEZER_SKIP	0x40000000	/* Freezer should not count it as freezable */
 #define PF_SUSPEND_TASK 0x80000000      /* this thread called freeze_processes and should not be frozen */
 
+#define PF_MEMALLOC_NOFS PF_FSTRANS	/* Transition to a more generic GFP_NOFS scope semantic */
+
 /*
  * Only the _current_ task can read/write to tsk->flags, but other
  * tasks can access tsk->flags in readonly mode for example
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH 2/6] xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS
@ 2017-02-06 14:07   ` Michal Hocko
  0 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-02-06 14:07 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel
  Cc: linux-mm, Dave Chinner, djwong, Theodore Ts'o, Chris Mason,
	David Sterba, Jan Kara, ceph-devel, cluster-devel, linux-nfs,
	logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite
some time ago. We would like to make this concept more generic and use
it for other filesystems as well. Let's start by giving the flag a
more generic name PF_MEMALLOC_NOFS which is in line with an exiting
PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO
contexts. Replace all PF_FSTRANS usage from the xfs code in the first
step before we introduce a full API for it as xfs uses the flag directly
anyway.

This patch doesn't introduce any functional change.

Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/xfs/kmem.c             |  4 ++--
 fs/xfs/kmem.h             |  2 +-
 fs/xfs/libxfs/xfs_btree.c |  2 +-
 fs/xfs/xfs_aops.c         |  6 +++---
 fs/xfs/xfs_trans.c        | 12 ++++++------
 include/linux/sched.h     |  2 ++
 6 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index 339c696bbc01..a76a05dae96b 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -80,13 +80,13 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 	 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
 	 * the filesystem here and potentially deadlocking.
 	 */
-	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
 		noio_flag = memalloc_noio_save();
 
 	lflags = kmem_flags_convert(flags);
 	ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
 
-	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
 		memalloc_noio_restore(noio_flag);
 
 	return ptr;
diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index 689f746224e7..d973dbfc2bfa 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
 		lflags = GFP_ATOMIC | __GFP_NOWARN;
 	} else {
 		lflags = GFP_KERNEL | __GFP_NOWARN;
-		if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+		if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
 			lflags &= ~__GFP_FS;
 	}
 
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 21e6a6ab6b9a..a2672ba4dc33 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -2866,7 +2866,7 @@ xfs_btree_split_worker(
 	struct xfs_btree_split_args	*args = container_of(work,
 						struct xfs_btree_split_args, work);
 	unsigned long		pflags;
-	unsigned long		new_pflags = PF_FSTRANS;
+	unsigned long		new_pflags = PF_MEMALLOC_NOFS;
 
 	/*
 	 * we are in a transaction context here, but may also be doing work
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 3a4434297697..b3d41c1d67ab 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -189,7 +189,7 @@ xfs_setfilesize_trans_alloc(
 	 * We hand off the transaction to the completion thread now, so
 	 * clear the flag here.
 	 */
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	return 0;
 }
 
@@ -252,7 +252,7 @@ xfs_setfilesize_ioend(
 	 * thus we need to mark ourselves as being in a transaction manually.
 	 * Similarly for freeze protection.
 	 */
-	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	__sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS);
 
 	/* we abort the update if there was an IO error */
@@ -1015,7 +1015,7 @@ xfs_do_writepage(
 	 * Given that we do not allow direct reclaim to call us, we should
 	 * never be called while in a filesystem transaction.
 	 */
-	if (WARN_ON_ONCE(current->flags & PF_FSTRANS))
+	if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
 		goto redirty;
 
 	/*
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 70f42ea86dfb..f5969c8274fc 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -134,7 +134,7 @@ xfs_trans_reserve(
 	bool		rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
 
 	/* Mark this thread as being in a transaction */
-	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	/*
 	 * Attempt to reserve the needed disk blocks by decrementing
@@ -144,7 +144,7 @@ xfs_trans_reserve(
 	if (blocks > 0) {
 		error = xfs_mod_fdblocks(tp->t_mountp, -((int64_t)blocks), rsvd);
 		if (error != 0) {
-			current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+			current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 			return -ENOSPC;
 		}
 		tp->t_blk_res += blocks;
@@ -221,7 +221,7 @@ xfs_trans_reserve(
 		tp->t_blk_res = 0;
 	}
 
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	return error;
 }
@@ -914,7 +914,7 @@ __xfs_trans_commit(
 
 	xfs_log_commit_cil(mp, tp, &commit_lsn, regrant);
 
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	xfs_trans_free(tp);
 
 	/*
@@ -944,7 +944,7 @@ __xfs_trans_commit(
 		if (commit_lsn == -1 && !error)
 			error = -EIO;
 	}
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	xfs_trans_free_items(tp, NULLCOMMITLSN, !!error);
 	xfs_trans_free(tp);
 
@@ -998,7 +998,7 @@ xfs_trans_cancel(
 		xfs_log_done(mp, tp->t_ticket, NULL, false);
 
 	/* mark this thread as no longer being in a transaction */
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	xfs_trans_free_items(tp, NULLCOMMITLSN, dirty);
 	xfs_trans_free(tp);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9159a1e4e838..5be9818e9bd9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2312,6 +2312,8 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_FREEZER_SKIP	0x40000000	/* Freezer should not count it as freezable */
 #define PF_SUSPEND_TASK 0x80000000      /* this thread called freeze_processes and should not be frozen */
 
+#define PF_MEMALLOC_NOFS PF_FSTRANS	/* Transition to a more generic GFP_NOFS scope semantic */
+
 /*
  * Only the _current_ task can read/write to tsk->flags, but other
  * tasks can access tsk->flags in readonly mode for example
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH 2/6] xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS
@ 2017-02-06 14:07   ` Michal Hocko
  0 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-02-06 14:07 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel
  Cc: linux-mm, Dave Chinner, djwong, Theodore Ts'o, Chris Mason,
	David Sterba, Jan Kara, ceph-devel, cluster-devel, linux-nfs,
	logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite
some time ago. We would like to make this concept more generic and use
it for other filesystems as well. Let's start by giving the flag a
more generic name PF_MEMALLOC_NOFS which is in line with an exiting
PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO
contexts. Replace all PF_FSTRANS usage from the xfs code in the first
step before we introduce a full API for it as xfs uses the flag directly
anyway.

This patch doesn't introduce any functional change.

Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/xfs/kmem.c             |  4 ++--
 fs/xfs/kmem.h             |  2 +-
 fs/xfs/libxfs/xfs_btree.c |  2 +-
 fs/xfs/xfs_aops.c         |  6 +++---
 fs/xfs/xfs_trans.c        | 12 ++++++------
 include/linux/sched.h     |  2 ++
 6 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index 339c696bbc01..a76a05dae96b 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -80,13 +80,13 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 	 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
 	 * the filesystem here and potentially deadlocking.
 	 */
-	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
 		noio_flag = memalloc_noio_save();
 
 	lflags = kmem_flags_convert(flags);
 	ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
 
-	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
 		memalloc_noio_restore(noio_flag);
 
 	return ptr;
diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index 689f746224e7..d973dbfc2bfa 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
 		lflags = GFP_ATOMIC | __GFP_NOWARN;
 	} else {
 		lflags = GFP_KERNEL | __GFP_NOWARN;
-		if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+		if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
 			lflags &= ~__GFP_FS;
 	}
 
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 21e6a6ab6b9a..a2672ba4dc33 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -2866,7 +2866,7 @@ xfs_btree_split_worker(
 	struct xfs_btree_split_args	*args = container_of(work,
 						struct xfs_btree_split_args, work);
 	unsigned long		pflags;
-	unsigned long		new_pflags = PF_FSTRANS;
+	unsigned long		new_pflags = PF_MEMALLOC_NOFS;
 
 	/*
 	 * we are in a transaction context here, but may also be doing work
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 3a4434297697..b3d41c1d67ab 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -189,7 +189,7 @@ xfs_setfilesize_trans_alloc(
 	 * We hand off the transaction to the completion thread now, so
 	 * clear the flag here.
 	 */
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	return 0;
 }
 
@@ -252,7 +252,7 @@ xfs_setfilesize_ioend(
 	 * thus we need to mark ourselves as being in a transaction manually.
 	 * Similarly for freeze protection.
 	 */
-	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	__sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS);
 
 	/* we abort the update if there was an IO error */
@@ -1015,7 +1015,7 @@ xfs_do_writepage(
 	 * Given that we do not allow direct reclaim to call us, we should
 	 * never be called while in a filesystem transaction.
 	 */
-	if (WARN_ON_ONCE(current->flags & PF_FSTRANS))
+	if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
 		goto redirty;
 
 	/*
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 70f42ea86dfb..f5969c8274fc 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -134,7 +134,7 @@ xfs_trans_reserve(
 	bool		rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
 
 	/* Mark this thread as being in a transaction */
-	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	/*
 	 * Attempt to reserve the needed disk blocks by decrementing
@@ -144,7 +144,7 @@ xfs_trans_reserve(
 	if (blocks > 0) {
 		error = xfs_mod_fdblocks(tp->t_mountp, -((int64_t)blocks), rsvd);
 		if (error != 0) {
-			current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+			current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 			return -ENOSPC;
 		}
 		tp->t_blk_res += blocks;
@@ -221,7 +221,7 @@ xfs_trans_reserve(
 		tp->t_blk_res = 0;
 	}
 
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	return error;
 }
@@ -914,7 +914,7 @@ __xfs_trans_commit(
 
 	xfs_log_commit_cil(mp, tp, &commit_lsn, regrant);
 
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	xfs_trans_free(tp);
 
 	/*
@@ -944,7 +944,7 @@ __xfs_trans_commit(
 		if (commit_lsn == -1 && !error)
 			error = -EIO;
 	}
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	xfs_trans_free_items(tp, NULLCOMMITLSN, !!error);
 	xfs_trans_free(tp);
 
@@ -998,7 +998,7 @@ xfs_trans_cancel(
 		xfs_log_done(mp, tp->t_ticket, NULL, false);
 
 	/* mark this thread as no longer being in a transaction */
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	xfs_trans_free_items(tp, NULLCOMMITLSN, dirty);
 	xfs_trans_free(tp);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9159a1e4e838..5be9818e9bd9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2312,6 +2312,8 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_FREEZER_SKIP	0x40000000	/* Freezer should not count it as freezable */
 #define PF_SUSPEND_TASK 0x80000000      /* this thread called freeze_processes and should not be frozen */
 
+#define PF_MEMALLOC_NOFS PF_FSTRANS	/* Transition to a more generic GFP_NOFS scope semantic */
+
 /*
  * Only the _current_ task can read/write to tsk->flags, but other
  * tasks can access tsk->flags in readonly mode for example
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [Cluster-devel] [PATCH 2/6] xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS
@ 2017-02-06 14:07   ` Michal Hocko
  0 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-02-06 14:07 UTC (permalink / raw)
  To: cluster-devel.redhat.com

From: Michal Hocko <mhocko@suse.com>

xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite
some time ago. We would like to make this concept more generic and use
it for other filesystems as well. Let's start by giving the flag a
more generic name PF_MEMALLOC_NOFS which is in line with an exiting
PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO
contexts. Replace all PF_FSTRANS usage from the xfs code in the first
step before we introduce a full API for it as xfs uses the flag directly
anyway.

This patch doesn't introduce any functional change.

Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/xfs/kmem.c             |  4 ++--
 fs/xfs/kmem.h             |  2 +-
 fs/xfs/libxfs/xfs_btree.c |  2 +-
 fs/xfs/xfs_aops.c         |  6 +++---
 fs/xfs/xfs_trans.c        | 12 ++++++------
 include/linux/sched.h     |  2 ++
 6 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index 339c696bbc01..a76a05dae96b 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -80,13 +80,13 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 	 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
 	 * the filesystem here and potentially deadlocking.
 	 */
-	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
 		noio_flag = memalloc_noio_save();
 
 	lflags = kmem_flags_convert(flags);
 	ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
 
-	if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
 		memalloc_noio_restore(noio_flag);
 
 	return ptr;
diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index 689f746224e7..d973dbfc2bfa 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
 		lflags = GFP_ATOMIC | __GFP_NOWARN;
 	} else {
 		lflags = GFP_KERNEL | __GFP_NOWARN;
-		if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+		if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
 			lflags &= ~__GFP_FS;
 	}
 
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 21e6a6ab6b9a..a2672ba4dc33 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -2866,7 +2866,7 @@ xfs_btree_split_worker(
 	struct xfs_btree_split_args	*args = container_of(work,
 						struct xfs_btree_split_args, work);
 	unsigned long		pflags;
-	unsigned long		new_pflags = PF_FSTRANS;
+	unsigned long		new_pflags = PF_MEMALLOC_NOFS;
 
 	/*
 	 * we are in a transaction context here, but may also be doing work
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 3a4434297697..b3d41c1d67ab 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -189,7 +189,7 @@ xfs_setfilesize_trans_alloc(
 	 * We hand off the transaction to the completion thread now, so
 	 * clear the flag here.
 	 */
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	return 0;
 }
 
@@ -252,7 +252,7 @@ xfs_setfilesize_ioend(
 	 * thus we need to mark ourselves as being in a transaction manually.
 	 * Similarly for freeze protection.
 	 */
-	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	__sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS);
 
 	/* we abort the update if there was an IO error */
@@ -1015,7 +1015,7 @@ xfs_do_writepage(
 	 * Given that we do not allow direct reclaim to call us, we should
 	 * never be called while in a filesystem transaction.
 	 */
-	if (WARN_ON_ONCE(current->flags & PF_FSTRANS))
+	if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
 		goto redirty;
 
 	/*
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 70f42ea86dfb..f5969c8274fc 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -134,7 +134,7 @@ xfs_trans_reserve(
 	bool		rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
 
 	/* Mark this thread as being in a transaction */
-	current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	/*
 	 * Attempt to reserve the needed disk blocks by decrementing
@@ -144,7 +144,7 @@ xfs_trans_reserve(
 	if (blocks > 0) {
 		error = xfs_mod_fdblocks(tp->t_mountp, -((int64_t)blocks), rsvd);
 		if (error != 0) {
-			current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+			current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 			return -ENOSPC;
 		}
 		tp->t_blk_res += blocks;
@@ -221,7 +221,7 @@ xfs_trans_reserve(
 		tp->t_blk_res = 0;
 	}
 
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	return error;
 }
@@ -914,7 +914,7 @@ __xfs_trans_commit(
 
 	xfs_log_commit_cil(mp, tp, &commit_lsn, regrant);
 
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	xfs_trans_free(tp);
 
 	/*
@@ -944,7 +944,7 @@ __xfs_trans_commit(
 		if (commit_lsn == -1 && !error)
 			error = -EIO;
 	}
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	xfs_trans_free_items(tp, NULLCOMMITLSN, !!error);
 	xfs_trans_free(tp);
 
@@ -998,7 +998,7 @@ xfs_trans_cancel(
 		xfs_log_done(mp, tp->t_ticket, NULL, false);
 
 	/* mark this thread as no longer being in a transaction */
-	current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 
 	xfs_trans_free_items(tp, NULLCOMMITLSN, dirty);
 	xfs_trans_free(tp);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9159a1e4e838..5be9818e9bd9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2312,6 +2312,8 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_FREEZER_SKIP	0x40000000	/* Freezer should not count it as freezable */
 #define PF_SUSPEND_TASK 0x80000000      /* this thread called freeze_processes and should not be frozen */
 
+#define PF_MEMALLOC_NOFS PF_FSTRANS	/* Transition to a more generic GFP_NOFS scope semantic */
+
 /*
  * Only the _current_ task can read/write to tsk->flags, but other
  * tasks can access tsk->flags in readonly mode for example
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH 3/6] mm: introduce memalloc_nofs_{save,restore} API
  2017-02-06 14:07 ` Michal Hocko
                     ` (2 preceding siblings ...)
  (?)
@ 2017-02-06 14:07   ` Michal Hocko
  -1 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-02-06 14:07 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel
  Cc: linux-mm, Dave Chinner, djwong, Theodore Ts'o, Chris Mason,
	David Sterba, Jan Kara, ceph-devel, cluster-devel, linux-nfs,
	logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

GFP_NOFS context is used for the following 5 reasons currently
	- to prevent from deadlocks when the lock held by the allocation
	  context would be needed during the memory reclaim
	- to prevent from stack overflows during the reclaim because
	  the allocation is performed from a deep context already
	- to prevent lockups when the allocation context depends on
	  other reclaimers to make a forward progress indirectly
	- just in case because this would be safe from the fs POV
	- silence lockdep false positives

Unfortunately overuse of this allocation context brings some problems
to the MM. Memory reclaim is much weaker (especially during heavy FS
metadata workloads), OOM killer cannot be invoked because the MM layer
doesn't have enough information about how much memory is freeable by the
FS layer.

In many cases it is far from clear why the weaker context is even used
and so it might be used unnecessarily. We would like to get rid of
those as much as possible. One way to do that is to use the flag in
scopes rather than isolated cases. Such a scope is declared when really
necessary, tracked per task and all the allocation requests from within
the context will simply inherit the GFP_NOFS semantic.

Not only this is easier to understand and maintain because there are
much less problematic contexts than specific allocation requests, this
also helps code paths where FS layer interacts with other layers (e.g.
crypto, security modules, MM etc...) and there is no easy way to convey
the allocation context between the layers.

Introduce memalloc_nofs_{save,restore} API to control the scope
of GFP_NOFS allocation context. This is basically copying
memalloc_noio_{save,restore} API we have for other restricted allocation
context GFP_NOIO. The PF_MEMALLOC_NOFS flag already exists and it is
just an alias for PF_FSTRANS which has been xfs specific until recently.
There are no more PF_FSTRANS users anymore so let's just drop it.

PF_MEMALLOC_NOFS is now checked in the MM layer and drops __GFP_FS
implicitly same as PF_MEMALLOC_NOIO drops __GFP_IO. memalloc_noio_flags
is renamed to current_gfp_context because it now cares about both
PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts. Xfs code paths preserve
their semantic. kmem_flags_convert() doesn't need to evaluate the flag
anymore.

This patch shouldn't introduce any functional changes.

Let's hope that filesystems will drop direct GFP_NOFS (resp. ~__GFP_FS)
usage as much as possible and only use a properly documented
memalloc_nofs_{save,restore} checkpoints where they are appropriate.

Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/xfs/kmem.h            |  2 +-
 include/linux/gfp.h      |  8 ++++++++
 include/linux/sched.h    | 34 ++++++++++++++++++++++++++--------
 kernel/locking/lockdep.c |  2 +-
 mm/page_alloc.c          | 10 ++++++----
 mm/vmscan.c              |  6 +++---
 6 files changed, 45 insertions(+), 17 deletions(-)

diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index d973dbfc2bfa..ae08cfd9552a 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
 		lflags = GFP_ATOMIC | __GFP_NOWARN;
 	} else {
 		lflags = GFP_KERNEL | __GFP_NOWARN;
-		if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
+		if (flags & KM_NOFS)
 			lflags &= ~__GFP_FS;
 	}
 
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 978232a3b4ae..2bfcfd33e476 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -210,8 +210,16 @@ struct vm_area_struct;
  *
  * GFP_NOIO will use direct reclaim to discard clean pages or slab pages
  *   that do not require the starting of any physical IO.
+ *   Please try to avoid using this flag directly and instead use
+ *   memalloc_noio_{save,restore} to mark the whole scope which cannot
+ *   perform any IO with a short explanation why. All allocation requests
+ *   will inherit GFP_NOIO implicitly.
  *
  * GFP_NOFS will use direct reclaim but will not use any filesystem interfaces.
+ *   Please try to avoid using this flag directly and instead use
+ *   memalloc_nofs_{save,restore} to mark the whole scope which cannot/shouldn't
+ *   recurse into the FS layer with a short explanation why. All allocation
+ *   requests will inherit GFP_NOFS implicitly.
  *
  * GFP_USER is for userspace allocations that also need to be directly
  *   accessibly by the kernel or hardware. It is typically used by hardware
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5be9818e9bd9..6573e9f04aed 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2299,9 +2299,9 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_USED_ASYNC	0x00004000	/* used async_schedule*(), used by module init */
 #define PF_NOFREEZE	0x00008000	/* this thread should not be frozen */
 #define PF_FROZEN	0x00010000	/* frozen for system suspend */
-#define PF_FSTRANS	0x00020000	/* inside a filesystem transaction */
-#define PF_KSWAPD	0x00040000	/* I am kswapd */
-#define PF_MEMALLOC_NOIO 0x00080000	/* Allocating memory without IO involved */
+#define PF_KSWAPD	0x00020000	/* I am kswapd */
+#define PF_MEMALLOC_NOFS 0x00040000	/* All allocation requests will inherit GFP_NOFS */
+#define PF_MEMALLOC_NOIO 0x00080000	/* All allocation requests will inherit GFP_NOIO */
 #define PF_LESS_THROTTLE 0x00100000	/* Throttle me less: I clean memory */
 #define PF_KTHREAD	0x00200000	/* I am a kernel thread */
 #define PF_RANDOMIZE	0x00400000	/* randomize virtual address space */
@@ -2312,8 +2312,6 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_FREEZER_SKIP	0x40000000	/* Freezer should not count it as freezable */
 #define PF_SUSPEND_TASK 0x80000000      /* this thread called freeze_processes and should not be frozen */
 
-#define PF_MEMALLOC_NOFS PF_FSTRANS	/* Transition to a more generic GFP_NOFS scope semantic */
-
 /*
  * Only the _current_ task can read/write to tsk->flags, but other
  * tasks can access tsk->flags in readonly mode for example
@@ -2339,13 +2337,21 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
 #define used_math() tsk_used_math(current)
 
-/* __GFP_IO isn't allowed if PF_MEMALLOC_NOIO is set in current->flags
- * __GFP_FS is also cleared as it implies __GFP_IO.
+/*
+ * Applies per-task gfp context to the given allocation flags.
+ * PF_MEMALLOC_NOIO implies GFP_NOIO
+ * PF_MEMALLOC_NOFS implies GFP_NOFS
  */
-static inline gfp_t memalloc_noio_flags(gfp_t flags)
+static inline gfp_t current_gfp_context(gfp_t flags)
 {
+	/*
+	 * NOIO implies both NOIO and NOFS and it is a weaker context
+	 * so always make sure it makes precendence
+	 */
 	if (unlikely(current->flags & PF_MEMALLOC_NOIO))
 		flags &= ~(__GFP_IO | __GFP_FS);
+	else if (unlikely(current->flags & PF_MEMALLOC_NOFS))
+		flags &= ~__GFP_FS;
 	return flags;
 }
 
@@ -2361,6 +2367,18 @@ static inline void memalloc_noio_restore(unsigned int flags)
 	current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags;
 }
 
+static inline unsigned int memalloc_nofs_save(void)
+{
+	unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
+	current->flags |= PF_MEMALLOC_NOFS;
+	return flags;
+}
+
+static inline void memalloc_nofs_restore(unsigned int flags)
+{
+	current->flags = (current->flags & ~PF_MEMALLOC_NOFS) | flags;
+}
+
 /* Per-process atomic flags. */
 #define PFA_NO_NEW_PRIVS 0	/* May not gain new privileges. */
 #define PFA_SPREAD_PAGE  1      /* Spread page cache over cpuset */
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 7dc2375c3ec7..8ab2319dc398 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -2870,7 +2870,7 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags)
 		return;
 
 	/* We're only interested __GFP_FS allocations for now */
-	if (!(gfp_mask & __GFP_FS))
+	if (!(gfp_mask & __GFP_FS) || (curr->flags & PF_MEMALLOC_NOFS))
 		return;
 
 	/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c3358d4f7932..e35f3a750fb9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3964,10 +3964,12 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 		goto out;
 
 	/*
-	 * Runtime PM, block IO and its error handling path can deadlock
-	 * because I/O on the device might not complete.
+	 * Apply scoped allocation constrains. This is mainly about
+	 * GFP_NOFS resp. GFP_NOIO which has to be inherited for all
+	 * allocation requests from a particular context which has
+	 * been marked by memalloc_no{fs,io}_{save,restore}
 	 */
-	alloc_mask = memalloc_noio_flags(gfp_mask);
+	alloc_mask = current_gfp_context(gfp_mask);
 	ac.spread_dirty_pages = false;
 
 	/*
@@ -7425,7 +7427,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 		.zone = page_zone(pfn_to_page(start)),
 		.mode = MIGRATE_SYNC,
 		.ignore_skip_hint = true,
-		.gfp_mask = memalloc_noio_flags(gfp_mask),
+		.gfp_mask = current_gfp_context(gfp_mask),
 	};
 	INIT_LIST_HEAD(&cc.migratepages);
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e60b58f37033..5119b8d983b2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2947,7 +2947,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 	unsigned long nr_reclaimed;
 	struct scan_control sc = {
 		.nr_to_reclaim = SWAP_CLUSTER_MAX,
-		.gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
+		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
 		.reclaim_idx = gfp_zone(gfp_mask),
 		.order = order,
 		.nodemask = nodemask,
@@ -3027,7 +3027,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 	int nid;
 	struct scan_control sc = {
 		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
-		.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
+		.gfp_mask = (current_gfp_context(gfp_mask) & GFP_RECLAIM_MASK) |
 				(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
 		.reclaim_idx = MAX_NR_ZONES - 1,
 		.target_mem_cgroup = memcg,
@@ -3721,7 +3721,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
 	int classzone_idx = gfp_zone(gfp_mask);
 	struct scan_control sc = {
 		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
-		.gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
+		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
 		.order = order,
 		.priority = NODE_RECLAIM_PRIORITY,
 		.may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE),
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH 3/6] mm: introduce memalloc_nofs_{save,restore} API
@ 2017-02-06 14:07   ` Michal Hocko
  0 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-02-06 14:07 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel
  Cc: linux-mm, Dave Chinner, djwong, Theodore Ts'o, Chris Mason,
	David Sterba, Jan Kara, ceph-devel, cluster-devel, linux-nfs,
	logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

GFP_NOFS context is used for the following 5 reasons currently
	- to prevent from deadlocks when the lock held by the allocation
	  context would be needed during the memory reclaim
	- to prevent from stack overflows during the reclaim because
	  the allocation is performed from a deep context already
	- to prevent lockups when the allocation context depends on
	  other reclaimers to make a forward progress indirectly
	- just in case because this would be safe from the fs POV
	- silence lockdep false positives

Unfortunately overuse of this allocation context brings some problems
to the MM. Memory reclaim is much weaker (especially during heavy FS
metadata workloads), OOM killer cannot be invoked because the MM layer
doesn't have enough information about how much memory is freeable by the
FS layer.

In many cases it is far from clear why the weaker context is even used
and so it might be used unnecessarily. We would like to get rid of
those as much as possible. One way to do that is to use the flag in
scopes rather than isolated cases. Such a scope is declared when really
necessary, tracked per task and all the allocation requests from within
the context will simply inherit the GFP_NOFS semantic.

Not only this is easier to understand and maintain because there are
much less problematic contexts than specific allocation requests, this
also helps code paths where FS layer interacts with other layers (e.g.
crypto, security modules, MM etc...) and there is no easy way to convey
the allocation context between the layers.

Introduce memalloc_nofs_{save,restore} API to control the scope
of GFP_NOFS allocation context. This is basically copying
memalloc_noio_{save,restore} API we have for other restricted allocation
context GFP_NOIO. The PF_MEMALLOC_NOFS flag already exists and it is
just an alias for PF_FSTRANS which has been xfs specific until recently.
There are no more PF_FSTRANS users anymore so let's just drop it.

PF_MEMALLOC_NOFS is now checked in the MM layer and drops __GFP_FS
implicitly same as PF_MEMALLOC_NOIO drops __GFP_IO. memalloc_noio_flags
is renamed to current_gfp_context because it now cares about both
PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts. Xfs code paths preserve
their semantic. kmem_flags_convert() doesn't need to evaluate the flag
anymore.

This patch shouldn't introduce any functional changes.

Let's hope that filesystems will drop direct GFP_NOFS (resp. ~__GFP_FS)
usage as much as possible and only use a properly documented
memalloc_nofs_{save,restore} checkpoints where they are appropriate.

Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/xfs/kmem.h            |  2 +-
 include/linux/gfp.h      |  8 ++++++++
 include/linux/sched.h    | 34 ++++++++++++++++++++++++++--------
 kernel/locking/lockdep.c |  2 +-
 mm/page_alloc.c          | 10 ++++++----
 mm/vmscan.c              |  6 +++---
 6 files changed, 45 insertions(+), 17 deletions(-)

diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index d973dbfc2bfa..ae08cfd9552a 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
 		lflags = GFP_ATOMIC | __GFP_NOWARN;
 	} else {
 		lflags = GFP_KERNEL | __GFP_NOWARN;
-		if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
+		if (flags & KM_NOFS)
 			lflags &= ~__GFP_FS;
 	}
 
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 978232a3b4ae..2bfcfd33e476 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -210,8 +210,16 @@ struct vm_area_struct;
  *
  * GFP_NOIO will use direct reclaim to discard clean pages or slab pages
  *   that do not require the starting of any physical IO.
+ *   Please try to avoid using this flag directly and instead use
+ *   memalloc_noio_{save,restore} to mark the whole scope which cannot
+ *   perform any IO with a short explanation why. All allocation requests
+ *   will inherit GFP_NOIO implicitly.
  *
  * GFP_NOFS will use direct reclaim but will not use any filesystem interfaces.
+ *   Please try to avoid using this flag directly and instead use
+ *   memalloc_nofs_{save,restore} to mark the whole scope which cannot/shouldn't
+ *   recurse into the FS layer with a short explanation why. All allocation
+ *   requests will inherit GFP_NOFS implicitly.
  *
  * GFP_USER is for userspace allocations that also need to be directly
  *   accessibly by the kernel or hardware. It is typically used by hardware
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5be9818e9bd9..6573e9f04aed 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2299,9 +2299,9 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_USED_ASYNC	0x00004000	/* used async_schedule*(), used by module init */
 #define PF_NOFREEZE	0x00008000	/* this thread should not be frozen */
 #define PF_FROZEN	0x00010000	/* frozen for system suspend */
-#define PF_FSTRANS	0x00020000	/* inside a filesystem transaction */
-#define PF_KSWAPD	0x00040000	/* I am kswapd */
-#define PF_MEMALLOC_NOIO 0x00080000	/* Allocating memory without IO involved */
+#define PF_KSWAPD	0x00020000	/* I am kswapd */
+#define PF_MEMALLOC_NOFS 0x00040000	/* All allocation requests will inherit GFP_NOFS */
+#define PF_MEMALLOC_NOIO 0x00080000	/* All allocation requests will inherit GFP_NOIO */
 #define PF_LESS_THROTTLE 0x00100000	/* Throttle me less: I clean memory */
 #define PF_KTHREAD	0x00200000	/* I am a kernel thread */
 #define PF_RANDOMIZE	0x00400000	/* randomize virtual address space */
@@ -2312,8 +2312,6 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_FREEZER_SKIP	0x40000000	/* Freezer should not count it as freezable */
 #define PF_SUSPEND_TASK 0x80000000      /* this thread called freeze_processes and should not be frozen */
 
-#define PF_MEMALLOC_NOFS PF_FSTRANS	/* Transition to a more generic GFP_NOFS scope semantic */
-
 /*
  * Only the _current_ task can read/write to tsk->flags, but other
  * tasks can access tsk->flags in readonly mode for example
@@ -2339,13 +2337,21 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
 #define used_math() tsk_used_math(current)
 
-/* __GFP_IO isn't allowed if PF_MEMALLOC_NOIO is set in current->flags
- * __GFP_FS is also cleared as it implies __GFP_IO.
+/*
+ * Applies per-task gfp context to the given allocation flags.
+ * PF_MEMALLOC_NOIO implies GFP_NOIO
+ * PF_MEMALLOC_NOFS implies GFP_NOFS
  */
-static inline gfp_t memalloc_noio_flags(gfp_t flags)
+static inline gfp_t current_gfp_context(gfp_t flags)
 {
+	/*
+	 * NOIO implies both NOIO and NOFS and it is a weaker context
+	 * so always make sure it makes precendence
+	 */
 	if (unlikely(current->flags & PF_MEMALLOC_NOIO))
 		flags &= ~(__GFP_IO | __GFP_FS);
+	else if (unlikely(current->flags & PF_MEMALLOC_NOFS))
+		flags &= ~__GFP_FS;
 	return flags;
 }
 
@@ -2361,6 +2367,18 @@ static inline void memalloc_noio_restore(unsigned int flags)
 	current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags;
 }
 
+static inline unsigned int memalloc_nofs_save(void)
+{
+	unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
+	current->flags |= PF_MEMALLOC_NOFS;
+	return flags;
+}
+
+static inline void memalloc_nofs_restore(unsigned int flags)
+{
+	current->flags = (current->flags & ~PF_MEMALLOC_NOFS) | flags;
+}
+
 /* Per-process atomic flags. */
 #define PFA_NO_NEW_PRIVS 0	/* May not gain new privileges. */
 #define PFA_SPREAD_PAGE  1      /* Spread page cache over cpuset */
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 7dc2375c3ec7..8ab2319dc398 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -2870,7 +2870,7 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags)
 		return;
 
 	/* We're only interested __GFP_FS allocations for now */
-	if (!(gfp_mask & __GFP_FS))
+	if (!(gfp_mask & __GFP_FS) || (curr->flags & PF_MEMALLOC_NOFS))
 		return;
 
 	/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c3358d4f7932..e35f3a750fb9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3964,10 +3964,12 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 		goto out;
 
 	/*
-	 * Runtime PM, block IO and its error handling path can deadlock
-	 * because I/O on the device might not complete.
+	 * Apply scoped allocation constrains. This is mainly about
+	 * GFP_NOFS resp. GFP_NOIO which has to be inherited for all
+	 * allocation requests from a particular context which has
+	 * been marked by memalloc_no{fs,io}_{save,restore}
 	 */
-	alloc_mask = memalloc_noio_flags(gfp_mask);
+	alloc_mask = current_gfp_context(gfp_mask);
 	ac.spread_dirty_pages = false;
 
 	/*
@@ -7425,7 +7427,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 		.zone = page_zone(pfn_to_page(start)),
 		.mode = MIGRATE_SYNC,
 		.ignore_skip_hint = true,
-		.gfp_mask = memalloc_noio_flags(gfp_mask),
+		.gfp_mask = current_gfp_context(gfp_mask),
 	};
 	INIT_LIST_HEAD(&cc.migratepages);
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e60b58f37033..5119b8d983b2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2947,7 +2947,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 	unsigned long nr_reclaimed;
 	struct scan_control sc = {
 		.nr_to_reclaim = SWAP_CLUSTER_MAX,
-		.gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
+		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
 		.reclaim_idx = gfp_zone(gfp_mask),
 		.order = order,
 		.nodemask = nodemask,
@@ -3027,7 +3027,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 	int nid;
 	struct scan_control sc = {
 		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
-		.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
+		.gfp_mask = (current_gfp_context(gfp_mask) & GFP_RECLAIM_MASK) |
 				(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
 		.reclaim_idx = MAX_NR_ZONES - 1,
 		.target_mem_cgroup = memcg,
@@ -3721,7 +3721,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
 	int classzone_idx = gfp_zone(gfp_mask);
 	struct scan_control sc = {
 		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
-		.gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
+		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
 		.order = order,
 		.priority = NODE_RECLAIM_PRIORITY,
 		.may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE),
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH 3/6] mm: introduce memalloc_nofs_{save,restore} API
@ 2017-02-06 14:07   ` Michal Hocko
  0 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-02-06 14:07 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel
  Cc: linux-mm, Dave Chinner, djwong, Theodore Ts'o, Chris Mason,
	David Sterba, Jan Kara, ceph-devel, cluster-devel, linux-nfs,
	logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

GFP_NOFS context is used for the following 5 reasons currently
	- to prevent from deadlocks when the lock held by the allocation
	  context would be needed during the memory reclaim
	- to prevent from stack overflows during the reclaim because
	  the allocation is performed from a deep context already
	- to prevent lockups when the allocation context depends on
	  other reclaimers to make a forward progress indirectly
	- just in case because this would be safe from the fs POV
	- silence lockdep false positives

Unfortunately overuse of this allocation context brings some problems
to the MM. Memory reclaim is much weaker (especially during heavy FS
metadata workloads), OOM killer cannot be invoked because the MM layer
doesn't have enough information about how much memory is freeable by the
FS layer.

In many cases it is far from clear why the weaker context is even used
and so it might be used unnecessarily. We would like to get rid of
those as much as possible. One way to do that is to use the flag in
scopes rather than isolated cases. Such a scope is declared when really
necessary, tracked per task and all the allocation requests from within
the context will simply inherit the GFP_NOFS semantic.

Not only this is easier to understand and maintain because there are
much less problematic contexts than specific allocation requests, this
also helps code paths where FS layer interacts with other layers (e.g.
crypto, security modules, MM etc...) and there is no easy way to convey
the allocation context between the layers.

Introduce memalloc_nofs_{save,restore} API to control the scope
of GFP_NOFS allocation context. This is basically copying
memalloc_noio_{save,restore} API we have for other restricted allocation
context GFP_NOIO. The PF_MEMALLOC_NOFS flag already exists and it is
just an alias for PF_FSTRANS which has been xfs specific until recently.
There are no more PF_FSTRANS users anymore so let's just drop it.

PF_MEMALLOC_NOFS is now checked in the MM layer and drops __GFP_FS
implicitly same as PF_MEMALLOC_NOIO drops __GFP_IO. memalloc_noio_flags
is renamed to current_gfp_context because it now cares about both
PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts. Xfs code paths preserve
their semantic. kmem_flags_convert() doesn't need to evaluate the flag
anymore.

This patch shouldn't introduce any functional changes.

Let's hope that filesystems will drop direct GFP_NOFS (resp. ~__GFP_FS)
usage as much as possible and only use a properly documented
memalloc_nofs_{save,restore} checkpoints where they are appropriate.

Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/xfs/kmem.h            |  2 +-
 include/linux/gfp.h      |  8 ++++++++
 include/linux/sched.h    | 34 ++++++++++++++++++++++++++--------
 kernel/locking/lockdep.c |  2 +-
 mm/page_alloc.c          | 10 ++++++----
 mm/vmscan.c              |  6 +++---
 6 files changed, 45 insertions(+), 17 deletions(-)

diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index d973dbfc2bfa..ae08cfd9552a 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
 		lflags = GFP_ATOMIC | __GFP_NOWARN;
 	} else {
 		lflags = GFP_KERNEL | __GFP_NOWARN;
-		if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
+		if (flags & KM_NOFS)
 			lflags &= ~__GFP_FS;
 	}
 
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 978232a3b4ae..2bfcfd33e476 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -210,8 +210,16 @@ struct vm_area_struct;
  *
  * GFP_NOIO will use direct reclaim to discard clean pages or slab pages
  *   that do not require the starting of any physical IO.
+ *   Please try to avoid using this flag directly and instead use
+ *   memalloc_noio_{save,restore} to mark the whole scope which cannot
+ *   perform any IO with a short explanation why. All allocation requests
+ *   will inherit GFP_NOIO implicitly.
  *
  * GFP_NOFS will use direct reclaim but will not use any filesystem interfaces.
+ *   Please try to avoid using this flag directly and instead use
+ *   memalloc_nofs_{save,restore} to mark the whole scope which cannot/shouldn't
+ *   recurse into the FS layer with a short explanation why. All allocation
+ *   requests will inherit GFP_NOFS implicitly.
  *
  * GFP_USER is for userspace allocations that also need to be directly
  *   accessibly by the kernel or hardware. It is typically used by hardware
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5be9818e9bd9..6573e9f04aed 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2299,9 +2299,9 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_USED_ASYNC	0x00004000	/* used async_schedule*(), used by module init */
 #define PF_NOFREEZE	0x00008000	/* this thread should not be frozen */
 #define PF_FROZEN	0x00010000	/* frozen for system suspend */
-#define PF_FSTRANS	0x00020000	/* inside a filesystem transaction */
-#define PF_KSWAPD	0x00040000	/* I am kswapd */
-#define PF_MEMALLOC_NOIO 0x00080000	/* Allocating memory without IO involved */
+#define PF_KSWAPD	0x00020000	/* I am kswapd */
+#define PF_MEMALLOC_NOFS 0x00040000	/* All allocation requests will inherit GFP_NOFS */
+#define PF_MEMALLOC_NOIO 0x00080000	/* All allocation requests will inherit GFP_NOIO */
 #define PF_LESS_THROTTLE 0x00100000	/* Throttle me less: I clean memory */
 #define PF_KTHREAD	0x00200000	/* I am a kernel thread */
 #define PF_RANDOMIZE	0x00400000	/* randomize virtual address space */
@@ -2312,8 +2312,6 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_FREEZER_SKIP	0x40000000	/* Freezer should not count it as freezable */
 #define PF_SUSPEND_TASK 0x80000000      /* this thread called freeze_processes and should not be frozen */
 
-#define PF_MEMALLOC_NOFS PF_FSTRANS	/* Transition to a more generic GFP_NOFS scope semantic */
-
 /*
  * Only the _current_ task can read/write to tsk->flags, but other
  * tasks can access tsk->flags in readonly mode for example
@@ -2339,13 +2337,21 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
 #define used_math() tsk_used_math(current)
 
-/* __GFP_IO isn't allowed if PF_MEMALLOC_NOIO is set in current->flags
- * __GFP_FS is also cleared as it implies __GFP_IO.
+/*
+ * Applies per-task gfp context to the given allocation flags.
+ * PF_MEMALLOC_NOIO implies GFP_NOIO
+ * PF_MEMALLOC_NOFS implies GFP_NOFS
  */
-static inline gfp_t memalloc_noio_flags(gfp_t flags)
+static inline gfp_t current_gfp_context(gfp_t flags)
 {
+	/*
+	 * NOIO implies both NOIO and NOFS and it is a weaker context
+	 * so always make sure it makes precendence
+	 */
 	if (unlikely(current->flags & PF_MEMALLOC_NOIO))
 		flags &= ~(__GFP_IO | __GFP_FS);
+	else if (unlikely(current->flags & PF_MEMALLOC_NOFS))
+		flags &= ~__GFP_FS;
 	return flags;
 }
 
@@ -2361,6 +2367,18 @@ static inline void memalloc_noio_restore(unsigned int flags)
 	current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags;
 }
 
+static inline unsigned int memalloc_nofs_save(void)
+{
+	unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
+	current->flags |= PF_MEMALLOC_NOFS;
+	return flags;
+}
+
+static inline void memalloc_nofs_restore(unsigned int flags)
+{
+	current->flags = (current->flags & ~PF_MEMALLOC_NOFS) | flags;
+}
+
 /* Per-process atomic flags. */
 #define PFA_NO_NEW_PRIVS 0	/* May not gain new privileges. */
 #define PFA_SPREAD_PAGE  1      /* Spread page cache over cpuset */
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 7dc2375c3ec7..8ab2319dc398 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -2870,7 +2870,7 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags)
 		return;
 
 	/* We're only interested __GFP_FS allocations for now */
-	if (!(gfp_mask & __GFP_FS))
+	if (!(gfp_mask & __GFP_FS) || (curr->flags & PF_MEMALLOC_NOFS))
 		return;
 
 	/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c3358d4f7932..e35f3a750fb9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3964,10 +3964,12 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 		goto out;
 
 	/*
-	 * Runtime PM, block IO and its error handling path can deadlock
-	 * because I/O on the device might not complete.
+	 * Apply scoped allocation constrains. This is mainly about
+	 * GFP_NOFS resp. GFP_NOIO which has to be inherited for all
+	 * allocation requests from a particular context which has
+	 * been marked by memalloc_no{fs,io}_{save,restore}
 	 */
-	alloc_mask = memalloc_noio_flags(gfp_mask);
+	alloc_mask = current_gfp_context(gfp_mask);
 	ac.spread_dirty_pages = false;
 
 	/*
@@ -7425,7 +7427,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 		.zone = page_zone(pfn_to_page(start)),
 		.mode = MIGRATE_SYNC,
 		.ignore_skip_hint = true,
-		.gfp_mask = memalloc_noio_flags(gfp_mask),
+		.gfp_mask = current_gfp_context(gfp_mask),
 	};
 	INIT_LIST_HEAD(&cc.migratepages);
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e60b58f37033..5119b8d983b2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2947,7 +2947,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 	unsigned long nr_reclaimed;
 	struct scan_control sc = {
 		.nr_to_reclaim = SWAP_CLUSTER_MAX,
-		.gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
+		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
 		.reclaim_idx = gfp_zone(gfp_mask),
 		.order = order,
 		.nodemask = nodemask,
@@ -3027,7 +3027,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 	int nid;
 	struct scan_control sc = {
 		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
-		.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
+		.gfp_mask = (current_gfp_context(gfp_mask) & GFP_RECLAIM_MASK) |
 				(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
 		.reclaim_idx = MAX_NR_ZONES - 1,
 		.target_mem_cgroup = memcg,
@@ -3721,7 +3721,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
 	int classzone_idx = gfp_zone(gfp_mask);
 	struct scan_control sc = {
 		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
-		.gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
+		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
 		.order = order,
 		.priority = NODE_RECLAIM_PRIORITY,
 		.may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE),
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH 3/6] mm: introduce memalloc_nofs_{save,restore} API
@ 2017-02-06 14:07   ` Michal Hocko
  0 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-02-06 14:07 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel
  Cc: linux-mm, Dave Chinner, djwong, Theodore Ts'o, Chris Mason,
	David Sterba, Jan Kara, ceph-devel, cluster-devel, linux-nfs,
	logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

GFP_NOFS context is used for the following 5 reasons currently
	- to prevent from deadlocks when the lock held by the allocation
	  context would be needed during the memory reclaim
	- to prevent from stack overflows during the reclaim because
	  the allocation is performed from a deep context already
	- to prevent lockups when the allocation context depends on
	  other reclaimers to make a forward progress indirectly
	- just in case because this would be safe from the fs POV
	- silence lockdep false positives

Unfortunately overuse of this allocation context brings some problems
to the MM. Memory reclaim is much weaker (especially during heavy FS
metadata workloads), OOM killer cannot be invoked because the MM layer
doesn't have enough information about how much memory is freeable by the
FS layer.

In many cases it is far from clear why the weaker context is even used
and so it might be used unnecessarily. We would like to get rid of
those as much as possible. One way to do that is to use the flag in
scopes rather than isolated cases. Such a scope is declared when really
necessary, tracked per task and all the allocation requests from within
the context will simply inherit the GFP_NOFS semantic.

Not only this is easier to understand and maintain because there are
much less problematic contexts than specific allocation requests, this
also helps code paths where FS layer interacts with other layers (e.g.
crypto, security modules, MM etc...) and there is no easy way to convey
the allocation context between the layers.

Introduce memalloc_nofs_{save,restore} API to control the scope
of GFP_NOFS allocation context. This is basically copying
memalloc_noio_{save,restore} API we have for other restricted allocation
context GFP_NOIO. The PF_MEMALLOC_NOFS flag already exists and it is
just an alias for PF_FSTRANS which has been xfs specific until recently.
There are no more PF_FSTRANS users anymore so let's just drop it.

PF_MEMALLOC_NOFS is now checked in the MM layer and drops __GFP_FS
implicitly same as PF_MEMALLOC_NOIO drops __GFP_IO. memalloc_noio_flags
is renamed to current_gfp_context because it now cares about both
PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts. Xfs code paths preserve
their semantic. kmem_flags_convert() doesn't need to evaluate the flag
anymore.

This patch shouldn't introduce any functional changes.

Let's hope that filesystems will drop direct GFP_NOFS (resp. ~__GFP_FS)
usage as much as possible and only use a properly documented
memalloc_nofs_{save,restore} checkpoints where they are appropriate.

Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/xfs/kmem.h            |  2 +-
 include/linux/gfp.h      |  8 ++++++++
 include/linux/sched.h    | 34 ++++++++++++++++++++++++++--------
 kernel/locking/lockdep.c |  2 +-
 mm/page_alloc.c          | 10 ++++++----
 mm/vmscan.c              |  6 +++---
 6 files changed, 45 insertions(+), 17 deletions(-)

diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index d973dbfc2bfa..ae08cfd9552a 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
 		lflags = GFP_ATOMIC | __GFP_NOWARN;
 	} else {
 		lflags = GFP_KERNEL | __GFP_NOWARN;
-		if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
+		if (flags & KM_NOFS)
 			lflags &= ~__GFP_FS;
 	}
 
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 978232a3b4ae..2bfcfd33e476 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -210,8 +210,16 @@ struct vm_area_struct;
  *
  * GFP_NOIO will use direct reclaim to discard clean pages or slab pages
  *   that do not require the starting of any physical IO.
+ *   Please try to avoid using this flag directly and instead use
+ *   memalloc_noio_{save,restore} to mark the whole scope which cannot
+ *   perform any IO with a short explanation why. All allocation requests
+ *   will inherit GFP_NOIO implicitly.
  *
  * GFP_NOFS will use direct reclaim but will not use any filesystem interfaces.
+ *   Please try to avoid using this flag directly and instead use
+ *   memalloc_nofs_{save,restore} to mark the whole scope which cannot/shouldn't
+ *   recurse into the FS layer with a short explanation why. All allocation
+ *   requests will inherit GFP_NOFS implicitly.
  *
  * GFP_USER is for userspace allocations that also need to be directly
  *   accessibly by the kernel or hardware. It is typically used by hardware
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5be9818e9bd9..6573e9f04aed 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2299,9 +2299,9 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_USED_ASYNC	0x00004000	/* used async_schedule*(), used by module init */
 #define PF_NOFREEZE	0x00008000	/* this thread should not be frozen */
 #define PF_FROZEN	0x00010000	/* frozen for system suspend */
-#define PF_FSTRANS	0x00020000	/* inside a filesystem transaction */
-#define PF_KSWAPD	0x00040000	/* I am kswapd */
-#define PF_MEMALLOC_NOIO 0x00080000	/* Allocating memory without IO involved */
+#define PF_KSWAPD	0x00020000	/* I am kswapd */
+#define PF_MEMALLOC_NOFS 0x00040000	/* All allocation requests will inherit GFP_NOFS */
+#define PF_MEMALLOC_NOIO 0x00080000	/* All allocation requests will inherit GFP_NOIO */
 #define PF_LESS_THROTTLE 0x00100000	/* Throttle me less: I clean memory */
 #define PF_KTHREAD	0x00200000	/* I am a kernel thread */
 #define PF_RANDOMIZE	0x00400000	/* randomize virtual address space */
@@ -2312,8 +2312,6 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_FREEZER_SKIP	0x40000000	/* Freezer should not count it as freezable */
 #define PF_SUSPEND_TASK 0x80000000      /* this thread called freeze_processes and should not be frozen */
 
-#define PF_MEMALLOC_NOFS PF_FSTRANS	/* Transition to a more generic GFP_NOFS scope semantic */
-
 /*
  * Only the _current_ task can read/write to tsk->flags, but other
  * tasks can access tsk->flags in readonly mode for example
@@ -2339,13 +2337,21 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
 #define used_math() tsk_used_math(current)
 
-/* __GFP_IO isn't allowed if PF_MEMALLOC_NOIO is set in current->flags
- * __GFP_FS is also cleared as it implies __GFP_IO.
+/*
+ * Applies per-task gfp context to the given allocation flags.
+ * PF_MEMALLOC_NOIO implies GFP_NOIO
+ * PF_MEMALLOC_NOFS implies GFP_NOFS
  */
-static inline gfp_t memalloc_noio_flags(gfp_t flags)
+static inline gfp_t current_gfp_context(gfp_t flags)
 {
+	/*
+	 * NOIO implies both NOIO and NOFS and it is a weaker context
+	 * so always make sure it makes precendence
+	 */
 	if (unlikely(current->flags & PF_MEMALLOC_NOIO))
 		flags &= ~(__GFP_IO | __GFP_FS);
+	else if (unlikely(current->flags & PF_MEMALLOC_NOFS))
+		flags &= ~__GFP_FS;
 	return flags;
 }
 
@@ -2361,6 +2367,18 @@ static inline void memalloc_noio_restore(unsigned int flags)
 	current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags;
 }
 
+static inline unsigned int memalloc_nofs_save(void)
+{
+	unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
+	current->flags |= PF_MEMALLOC_NOFS;
+	return flags;
+}
+
+static inline void memalloc_nofs_restore(unsigned int flags)
+{
+	current->flags = (current->flags & ~PF_MEMALLOC_NOFS) | flags;
+}
+
 /* Per-process atomic flags. */
 #define PFA_NO_NEW_PRIVS 0	/* May not gain new privileges. */
 #define PFA_SPREAD_PAGE  1      /* Spread page cache over cpuset */
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 7dc2375c3ec7..8ab2319dc398 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -2870,7 +2870,7 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags)
 		return;
 
 	/* We're only interested __GFP_FS allocations for now */
-	if (!(gfp_mask & __GFP_FS))
+	if (!(gfp_mask & __GFP_FS) || (curr->flags & PF_MEMALLOC_NOFS))
 		return;
 
 	/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c3358d4f7932..e35f3a750fb9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3964,10 +3964,12 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 		goto out;
 
 	/*
-	 * Runtime PM, block IO and its error handling path can deadlock
-	 * because I/O on the device might not complete.
+	 * Apply scoped allocation constrains. This is mainly about
+	 * GFP_NOFS resp. GFP_NOIO which has to be inherited for all
+	 * allocation requests from a particular context which has
+	 * been marked by memalloc_no{fs,io}_{save,restore}
 	 */
-	alloc_mask = memalloc_noio_flags(gfp_mask);
+	alloc_mask = current_gfp_context(gfp_mask);
 	ac.spread_dirty_pages = false;
 
 	/*
@@ -7425,7 +7427,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 		.zone = page_zone(pfn_to_page(start)),
 		.mode = MIGRATE_SYNC,
 		.ignore_skip_hint = true,
-		.gfp_mask = memalloc_noio_flags(gfp_mask),
+		.gfp_mask = current_gfp_context(gfp_mask),
 	};
 	INIT_LIST_HEAD(&cc.migratepages);
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e60b58f37033..5119b8d983b2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2947,7 +2947,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 	unsigned long nr_reclaimed;
 	struct scan_control sc = {
 		.nr_to_reclaim = SWAP_CLUSTER_MAX,
-		.gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
+		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
 		.reclaim_idx = gfp_zone(gfp_mask),
 		.order = order,
 		.nodemask = nodemask,
@@ -3027,7 +3027,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 	int nid;
 	struct scan_control sc = {
 		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
-		.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
+		.gfp_mask = (current_gfp_context(gfp_mask) & GFP_RECLAIM_MASK) |
 				(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
 		.reclaim_idx = MAX_NR_ZONES - 1,
 		.target_mem_cgroup = memcg,
@@ -3721,7 +3721,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
 	int classzone_idx = gfp_zone(gfp_mask);
 	struct scan_control sc = {
 		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
-		.gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
+		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
 		.order = order,
 		.priority = NODE_RECLAIM_PRIORITY,
 		.may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE),
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [Cluster-devel] [PATCH 3/6] mm: introduce memalloc_nofs_{save, restore} API
@ 2017-02-06 14:07   ` Michal Hocko
  0 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-02-06 14:07 UTC (permalink / raw)
  To: cluster-devel.redhat.com

From: Michal Hocko <mhocko@suse.com>

GFP_NOFS context is used for the following 5 reasons currently
	- to prevent from deadlocks when the lock held by the allocation
	  context would be needed during the memory reclaim
	- to prevent from stack overflows during the reclaim because
	  the allocation is performed from a deep context already
	- to prevent lockups when the allocation context depends on
	  other reclaimers to make a forward progress indirectly
	- just in case because this would be safe from the fs POV
	- silence lockdep false positives

Unfortunately overuse of this allocation context brings some problems
to the MM. Memory reclaim is much weaker (especially during heavy FS
metadata workloads), OOM killer cannot be invoked because the MM layer
doesn't have enough information about how much memory is freeable by the
FS layer.

In many cases it is far from clear why the weaker context is even used
and so it might be used unnecessarily. We would like to get rid of
those as much as possible. One way to do that is to use the flag in
scopes rather than isolated cases. Such a scope is declared when really
necessary, tracked per task and all the allocation requests from within
the context will simply inherit the GFP_NOFS semantic.

Not only this is easier to understand and maintain because there are
much less problematic contexts than specific allocation requests, this
also helps code paths where FS layer interacts with other layers (e.g.
crypto, security modules, MM etc...) and there is no easy way to convey
the allocation context between the layers.

Introduce memalloc_nofs_{save,restore} API to control the scope
of GFP_NOFS allocation context. This is basically copying
memalloc_noio_{save,restore} API we have for other restricted allocation
context GFP_NOIO. The PF_MEMALLOC_NOFS flag already exists and it is
just an alias for PF_FSTRANS which has been xfs specific until recently.
There are no more PF_FSTRANS users anymore so let's just drop it.

PF_MEMALLOC_NOFS is now checked in the MM layer and drops __GFP_FS
implicitly same as PF_MEMALLOC_NOIO drops __GFP_IO. memalloc_noio_flags
is renamed to current_gfp_context because it now cares about both
PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts. Xfs code paths preserve
their semantic. kmem_flags_convert() doesn't need to evaluate the flag
anymore.

This patch shouldn't introduce any functional changes.

Let's hope that filesystems will drop direct GFP_NOFS (resp. ~__GFP_FS)
usage as much as possible and only use a properly documented
memalloc_nofs_{save,restore} checkpoints where they are appropriate.

Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/xfs/kmem.h            |  2 +-
 include/linux/gfp.h      |  8 ++++++++
 include/linux/sched.h    | 34 ++++++++++++++++++++++++++--------
 kernel/locking/lockdep.c |  2 +-
 mm/page_alloc.c          | 10 ++++++----
 mm/vmscan.c              |  6 +++---
 6 files changed, 45 insertions(+), 17 deletions(-)

diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index d973dbfc2bfa..ae08cfd9552a 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
 		lflags = GFP_ATOMIC | __GFP_NOWARN;
 	} else {
 		lflags = GFP_KERNEL | __GFP_NOWARN;
-		if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
+		if (flags & KM_NOFS)
 			lflags &= ~__GFP_FS;
 	}
 
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 978232a3b4ae..2bfcfd33e476 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -210,8 +210,16 @@ struct vm_area_struct;
  *
  * GFP_NOIO will use direct reclaim to discard clean pages or slab pages
  *   that do not require the starting of any physical IO.
+ *   Please try to avoid using this flag directly and instead use
+ *   memalloc_noio_{save,restore} to mark the whole scope which cannot
+ *   perform any IO with a short explanation why. All allocation requests
+ *   will inherit GFP_NOIO implicitly.
  *
  * GFP_NOFS will use direct reclaim but will not use any filesystem interfaces.
+ *   Please try to avoid using this flag directly and instead use
+ *   memalloc_nofs_{save,restore} to mark the whole scope which cannot/shouldn't
+ *   recurse into the FS layer with a short explanation why. All allocation
+ *   requests will inherit GFP_NOFS implicitly.
  *
  * GFP_USER is for userspace allocations that also need to be directly
  *   accessibly by the kernel or hardware. It is typically used by hardware
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5be9818e9bd9..6573e9f04aed 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2299,9 +2299,9 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_USED_ASYNC	0x00004000	/* used async_schedule*(), used by module init */
 #define PF_NOFREEZE	0x00008000	/* this thread should not be frozen */
 #define PF_FROZEN	0x00010000	/* frozen for system suspend */
-#define PF_FSTRANS	0x00020000	/* inside a filesystem transaction */
-#define PF_KSWAPD	0x00040000	/* I am kswapd */
-#define PF_MEMALLOC_NOIO 0x00080000	/* Allocating memory without IO involved */
+#define PF_KSWAPD	0x00020000	/* I am kswapd */
+#define PF_MEMALLOC_NOFS 0x00040000	/* All allocation requests will inherit GFP_NOFS */
+#define PF_MEMALLOC_NOIO 0x00080000	/* All allocation requests will inherit GFP_NOIO */
 #define PF_LESS_THROTTLE 0x00100000	/* Throttle me less: I clean memory */
 #define PF_KTHREAD	0x00200000	/* I am a kernel thread */
 #define PF_RANDOMIZE	0x00400000	/* randomize virtual address space */
@@ -2312,8 +2312,6 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define PF_FREEZER_SKIP	0x40000000	/* Freezer should not count it as freezable */
 #define PF_SUSPEND_TASK 0x80000000      /* this thread called freeze_processes and should not be frozen */
 
-#define PF_MEMALLOC_NOFS PF_FSTRANS	/* Transition to a more generic GFP_NOFS scope semantic */
-
 /*
  * Only the _current_ task can read/write to tsk->flags, but other
  * tasks can access tsk->flags in readonly mode for example
@@ -2339,13 +2337,21 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
 #define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
 #define used_math() tsk_used_math(current)
 
-/* __GFP_IO isn't allowed if PF_MEMALLOC_NOIO is set in current->flags
- * __GFP_FS is also cleared as it implies __GFP_IO.
+/*
+ * Applies per-task gfp context to the given allocation flags.
+ * PF_MEMALLOC_NOIO implies GFP_NOIO
+ * PF_MEMALLOC_NOFS implies GFP_NOFS
  */
-static inline gfp_t memalloc_noio_flags(gfp_t flags)
+static inline gfp_t current_gfp_context(gfp_t flags)
 {
+	/*
+	 * NOIO implies both NOIO and NOFS and it is a weaker context
+	 * so always make sure it makes precendence
+	 */
 	if (unlikely(current->flags & PF_MEMALLOC_NOIO))
 		flags &= ~(__GFP_IO | __GFP_FS);
+	else if (unlikely(current->flags & PF_MEMALLOC_NOFS))
+		flags &= ~__GFP_FS;
 	return flags;
 }
 
@@ -2361,6 +2367,18 @@ static inline void memalloc_noio_restore(unsigned int flags)
 	current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags;
 }
 
+static inline unsigned int memalloc_nofs_save(void)
+{
+	unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
+	current->flags |= PF_MEMALLOC_NOFS;
+	return flags;
+}
+
+static inline void memalloc_nofs_restore(unsigned int flags)
+{
+	current->flags = (current->flags & ~PF_MEMALLOC_NOFS) | flags;
+}
+
 /* Per-process atomic flags. */
 #define PFA_NO_NEW_PRIVS 0	/* May not gain new privileges. */
 #define PFA_SPREAD_PAGE  1      /* Spread page cache over cpuset */
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 7dc2375c3ec7..8ab2319dc398 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -2870,7 +2870,7 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags)
 		return;
 
 	/* We're only interested __GFP_FS allocations for now */
-	if (!(gfp_mask & __GFP_FS))
+	if (!(gfp_mask & __GFP_FS) || (curr->flags & PF_MEMALLOC_NOFS))
 		return;
 
 	/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c3358d4f7932..e35f3a750fb9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3964,10 +3964,12 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 		goto out;
 
 	/*
-	 * Runtime PM, block IO and its error handling path can deadlock
-	 * because I/O on the device might not complete.
+	 * Apply scoped allocation constrains. This is mainly about
+	 * GFP_NOFS resp. GFP_NOIO which has to be inherited for all
+	 * allocation requests from a particular context which has
+	 * been marked by memalloc_no{fs,io}_{save,restore}
 	 */
-	alloc_mask = memalloc_noio_flags(gfp_mask);
+	alloc_mask = current_gfp_context(gfp_mask);
 	ac.spread_dirty_pages = false;
 
 	/*
@@ -7425,7 +7427,7 @@ int alloc_contig_range(unsigned long start, unsigned long end,
 		.zone = page_zone(pfn_to_page(start)),
 		.mode = MIGRATE_SYNC,
 		.ignore_skip_hint = true,
-		.gfp_mask = memalloc_noio_flags(gfp_mask),
+		.gfp_mask = current_gfp_context(gfp_mask),
 	};
 	INIT_LIST_HEAD(&cc.migratepages);
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e60b58f37033..5119b8d983b2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2947,7 +2947,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 	unsigned long nr_reclaimed;
 	struct scan_control sc = {
 		.nr_to_reclaim = SWAP_CLUSTER_MAX,
-		.gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
+		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
 		.reclaim_idx = gfp_zone(gfp_mask),
 		.order = order,
 		.nodemask = nodemask,
@@ -3027,7 +3027,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 	int nid;
 	struct scan_control sc = {
 		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
-		.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
+		.gfp_mask = (current_gfp_context(gfp_mask) & GFP_RECLAIM_MASK) |
 				(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
 		.reclaim_idx = MAX_NR_ZONES - 1,
 		.target_mem_cgroup = memcg,
@@ -3721,7 +3721,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
 	int classzone_idx = gfp_zone(gfp_mask);
 	struct scan_control sc = {
 		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
-		.gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
+		.gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
 		.order = order,
 		.priority = NODE_RECLAIM_PRIORITY,
 		.may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE),
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH 4/6] xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*
  2017-02-06 14:07 ` Michal Hocko
                     ` (3 preceding siblings ...)
  (?)
@ 2017-02-06 14:07   ` Michal Hocko
  -1 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-02-06 14:07 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel
  Cc: linux-mm, Dave Chinner, djwong, Theodore Ts'o, Chris Mason,
	David Sterba, Jan Kara, ceph-devel, cluster-devel, linux-nfs,
	logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

kmem_zalloc_large and _xfs_buf_map_pages use memalloc_noio_{save,restore}
API to prevent from reclaim recursion into the fs because vmalloc can
invoke unconditional GFP_KERNEL allocations and these functions might be
called from the NOFS contexts. The memalloc_noio_save will enforce
GFP_NOIO context which is even weaker than GFP_NOFS and that seems to be
unnecessary. Let's use memalloc_nofs_{save,restore} instead as it should
provide exactly what we need here - implicit GFP_NOFS context.

Changes since v1
- s@memalloc_noio_restore@memalloc_nofs_restore@ in _xfs_buf_map_pages
  as per Brian Foster

Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/xfs/kmem.c    | 12 ++++++------
 fs/xfs/xfs_buf.c |  8 ++++----
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index a76a05dae96b..0c9f94f41b6c 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -65,7 +65,7 @@ kmem_alloc(size_t size, xfs_km_flags_t flags)
 void *
 kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 {
-	unsigned noio_flag = 0;
+	unsigned nofs_flag = 0;
 	void	*ptr;
 	gfp_t	lflags;
 
@@ -77,17 +77,17 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 	 * __vmalloc() will allocate data pages and auxillary structures (e.g.
 	 * pagetables) with GFP_KERNEL, yet we may be under GFP_NOFS context
 	 * here. Hence we need to tell memory reclaim that we are in such a
-	 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
+	 * context via PF_MEMALLOC_NOFS to prevent memory reclaim re-entering
 	 * the filesystem here and potentially deadlocking.
 	 */
-	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
-		noio_flag = memalloc_noio_save();
+	if (flags & KM_NOFS)
+		nofs_flag = memalloc_nofs_save();
 
 	lflags = kmem_flags_convert(flags);
 	ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
 
-	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
-		memalloc_noio_restore(noio_flag);
+	if (flags & KM_NOFS)
+		memalloc_nofs_restore(nofs_flag);
 
 	return ptr;
 }
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 8c7d01b75922..676a9ae75b9a 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -442,17 +442,17 @@ _xfs_buf_map_pages(
 		bp->b_addr = NULL;
 	} else {
 		int retried = 0;
-		unsigned noio_flag;
+		unsigned nofs_flag;
 
 		/*
 		 * vm_map_ram() will allocate auxillary structures (e.g.
 		 * pagetables) with GFP_KERNEL, yet we are likely to be under
 		 * GFP_NOFS context here. Hence we need to tell memory reclaim
-		 * that we are in such a context via PF_MEMALLOC_NOIO to prevent
+		 * that we are in such a context via PF_MEMALLOC_NOFS to prevent
 		 * memory reclaim re-entering the filesystem here and
 		 * potentially deadlocking.
 		 */
-		noio_flag = memalloc_noio_save();
+		nofs_flag = memalloc_nofs_save();
 		do {
 			bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
 						-1, PAGE_KERNEL);
@@ -460,7 +460,7 @@ _xfs_buf_map_pages(
 				break;
 			vm_unmap_aliases();
 		} while (retried++ <= 1);
-		memalloc_noio_restore(noio_flag);
+		memalloc_nofs_restore(nofs_flag);
 
 		if (!bp->b_addr)
 			return -ENOMEM;
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH 4/6] xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*
@ 2017-02-06 14:07   ` Michal Hocko
  0 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-02-06 14:07 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel
  Cc: linux-mm, Dave Chinner, djwong, Theodore Ts'o, Chris Mason,
	David Sterba, Jan Kara, ceph-devel, cluster-devel, linux-nfs,
	logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

kmem_zalloc_large and _xfs_buf_map_pages use memalloc_noio_{save,restore}
API to prevent from reclaim recursion into the fs because vmalloc can
invoke unconditional GFP_KERNEL allocations and these functions might be
called from the NOFS contexts. The memalloc_noio_save will enforce
GFP_NOIO context which is even weaker than GFP_NOFS and that seems to be
unnecessary. Let's use memalloc_nofs_{save,restore} instead as it should
provide exactly what we need here - implicit GFP_NOFS context.

Changes since v1
- s@memalloc_noio_restore@memalloc_nofs_restore@ in _xfs_buf_map_pages
  as per Brian Foster

Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/xfs/kmem.c    | 12 ++++++------
 fs/xfs/xfs_buf.c |  8 ++++----
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index a76a05dae96b..0c9f94f41b6c 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -65,7 +65,7 @@ kmem_alloc(size_t size, xfs_km_flags_t flags)
 void *
 kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 {
-	unsigned noio_flag = 0;
+	unsigned nofs_flag = 0;
 	void	*ptr;
 	gfp_t	lflags;
 
@@ -77,17 +77,17 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 	 * __vmalloc() will allocate data pages and auxillary structures (e.g.
 	 * pagetables) with GFP_KERNEL, yet we may be under GFP_NOFS context
 	 * here. Hence we need to tell memory reclaim that we are in such a
-	 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
+	 * context via PF_MEMALLOC_NOFS to prevent memory reclaim re-entering
 	 * the filesystem here and potentially deadlocking.
 	 */
-	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
-		noio_flag = memalloc_noio_save();
+	if (flags & KM_NOFS)
+		nofs_flag = memalloc_nofs_save();
 
 	lflags = kmem_flags_convert(flags);
 	ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
 
-	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
-		memalloc_noio_restore(noio_flag);
+	if (flags & KM_NOFS)
+		memalloc_nofs_restore(nofs_flag);
 
 	return ptr;
 }
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 8c7d01b75922..676a9ae75b9a 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -442,17 +442,17 @@ _xfs_buf_map_pages(
 		bp->b_addr = NULL;
 	} else {
 		int retried = 0;
-		unsigned noio_flag;
+		unsigned nofs_flag;
 
 		/*
 		 * vm_map_ram() will allocate auxillary structures (e.g.
 		 * pagetables) with GFP_KERNEL, yet we are likely to be under
 		 * GFP_NOFS context here. Hence we need to tell memory reclaim
-		 * that we are in such a context via PF_MEMALLOC_NOIO to prevent
+		 * that we are in such a context via PF_MEMALLOC_NOFS to prevent
 		 * memory reclaim re-entering the filesystem here and
 		 * potentially deadlocking.
 		 */
-		noio_flag = memalloc_noio_save();
+		nofs_flag = memalloc_nofs_save();
 		do {
 			bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
 						-1, PAGE_KERNEL);
@@ -460,7 +460,7 @@ _xfs_buf_map_pages(
 				break;
 			vm_unmap_aliases();
 		} while (retried++ <= 1);
-		memalloc_noio_restore(noio_flag);
+		memalloc_nofs_restore(nofs_flag);
 
 		if (!bp->b_addr)
 			return -ENOMEM;
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH 4/6] xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*
@ 2017-02-06 14:07   ` Michal Hocko
  0 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-02-06 14:07 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel
  Cc: linux-mm, Dave Chinner, djwong, Theodore Ts'o, Chris Mason,
	David Sterba, Jan Kara, ceph-devel, cluster-devel, linux-nfs,
	logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

kmem_zalloc_large and _xfs_buf_map_pages use memalloc_noio_{save,restore}
API to prevent from reclaim recursion into the fs because vmalloc can
invoke unconditional GFP_KERNEL allocations and these functions might be
called from the NOFS contexts. The memalloc_noio_save will enforce
GFP_NOIO context which is even weaker than GFP_NOFS and that seems to be
unnecessary. Let's use memalloc_nofs_{save,restore} instead as it should
provide exactly what we need here - implicit GFP_NOFS context.

Changes since v1
- s@memalloc_noio_restore@memalloc_nofs_restore@ in _xfs_buf_map_pages
  as per Brian Foster

Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/xfs/kmem.c    | 12 ++++++------
 fs/xfs/xfs_buf.c |  8 ++++----
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index a76a05dae96b..0c9f94f41b6c 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -65,7 +65,7 @@ kmem_alloc(size_t size, xfs_km_flags_t flags)
 void *
 kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 {
-	unsigned noio_flag = 0;
+	unsigned nofs_flag = 0;
 	void	*ptr;
 	gfp_t	lflags;
 
@@ -77,17 +77,17 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 	 * __vmalloc() will allocate data pages and auxillary structures (e.g.
 	 * pagetables) with GFP_KERNEL, yet we may be under GFP_NOFS context
 	 * here. Hence we need to tell memory reclaim that we are in such a
-	 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
+	 * context via PF_MEMALLOC_NOFS to prevent memory reclaim re-entering
 	 * the filesystem here and potentially deadlocking.
 	 */
-	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
-		noio_flag = memalloc_noio_save();
+	if (flags & KM_NOFS)
+		nofs_flag = memalloc_nofs_save();
 
 	lflags = kmem_flags_convert(flags);
 	ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
 
-	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
-		memalloc_noio_restore(noio_flag);
+	if (flags & KM_NOFS)
+		memalloc_nofs_restore(nofs_flag);
 
 	return ptr;
 }
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 8c7d01b75922..676a9ae75b9a 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -442,17 +442,17 @@ _xfs_buf_map_pages(
 		bp->b_addr = NULL;
 	} else {
 		int retried = 0;
-		unsigned noio_flag;
+		unsigned nofs_flag;
 
 		/*
 		 * vm_map_ram() will allocate auxillary structures (e.g.
 		 * pagetables) with GFP_KERNEL, yet we are likely to be under
 		 * GFP_NOFS context here. Hence we need to tell memory reclaim
-		 * that we are in such a context via PF_MEMALLOC_NOIO to prevent
+		 * that we are in such a context via PF_MEMALLOC_NOFS to prevent
 		 * memory reclaim re-entering the filesystem here and
 		 * potentially deadlocking.
 		 */
-		noio_flag = memalloc_noio_save();
+		nofs_flag = memalloc_nofs_save();
 		do {
 			bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
 						-1, PAGE_KERNEL);
@@ -460,7 +460,7 @@ _xfs_buf_map_pages(
 				break;
 			vm_unmap_aliases();
 		} while (retried++ <= 1);
-		memalloc_noio_restore(noio_flag);
+		memalloc_nofs_restore(nofs_flag);
 
 		if (!bp->b_addr)
 			return -ENOMEM;
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH 4/6] xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*
@ 2017-02-06 14:07   ` Michal Hocko
  0 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-02-06 14:07 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel
  Cc: linux-mm, Dave Chinner, djwong, Theodore Ts'o, Chris Mason,
	David Sterba, Jan Kara, ceph-devel, cluster-devel, linux-nfs,
	logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

kmem_zalloc_large and _xfs_buf_map_pages use memalloc_noio_{save,restore}
API to prevent from reclaim recursion into the fs because vmalloc can
invoke unconditional GFP_KERNEL allocations and these functions might be
called from the NOFS contexts. The memalloc_noio_save will enforce
GFP_NOIO context which is even weaker than GFP_NOFS and that seems to be
unnecessary. Let's use memalloc_nofs_{save,restore} instead as it should
provide exactly what we need here - implicit GFP_NOFS context.

Changes since v1
- s@memalloc_noio_restore@memalloc_nofs_restore@ in _xfs_buf_map_pages
  as per Brian Foster

Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/xfs/kmem.c    | 12 ++++++------
 fs/xfs/xfs_buf.c |  8 ++++----
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index a76a05dae96b..0c9f94f41b6c 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -65,7 +65,7 @@ kmem_alloc(size_t size, xfs_km_flags_t flags)
 void *
 kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 {
-	unsigned noio_flag = 0;
+	unsigned nofs_flag = 0;
 	void	*ptr;
 	gfp_t	lflags;
 
@@ -77,17 +77,17 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 	 * __vmalloc() will allocate data pages and auxillary structures (e.g.
 	 * pagetables) with GFP_KERNEL, yet we may be under GFP_NOFS context
 	 * here. Hence we need to tell memory reclaim that we are in such a
-	 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
+	 * context via PF_MEMALLOC_NOFS to prevent memory reclaim re-entering
 	 * the filesystem here and potentially deadlocking.
 	 */
-	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
-		noio_flag = memalloc_noio_save();
+	if (flags & KM_NOFS)
+		nofs_flag = memalloc_nofs_save();
 
 	lflags = kmem_flags_convert(flags);
 	ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
 
-	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
-		memalloc_noio_restore(noio_flag);
+	if (flags & KM_NOFS)
+		memalloc_nofs_restore(nofs_flag);
 
 	return ptr;
 }
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 8c7d01b75922..676a9ae75b9a 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -442,17 +442,17 @@ _xfs_buf_map_pages(
 		bp->b_addr = NULL;
 	} else {
 		int retried = 0;
-		unsigned noio_flag;
+		unsigned nofs_flag;
 
 		/*
 		 * vm_map_ram() will allocate auxillary structures (e.g.
 		 * pagetables) with GFP_KERNEL, yet we are likely to be under
 		 * GFP_NOFS context here. Hence we need to tell memory reclaim
-		 * that we are in such a context via PF_MEMALLOC_NOIO to prevent
+		 * that we are in such a context via PF_MEMALLOC_NOFS to prevent
 		 * memory reclaim re-entering the filesystem here and
 		 * potentially deadlocking.
 		 */
-		noio_flag = memalloc_noio_save();
+		nofs_flag = memalloc_nofs_save();
 		do {
 			bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
 						-1, PAGE_KERNEL);
@@ -460,7 +460,7 @@ _xfs_buf_map_pages(
 				break;
 			vm_unmap_aliases();
 		} while (retried++ <= 1);
-		memalloc_noio_restore(noio_flag);
+		memalloc_nofs_restore(nofs_flag);
 
 		if (!bp->b_addr)
 			return -ENOMEM;
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH 4/6] xfs: use memalloc_nofs_{save, restore} instead of memalloc_noio*
@ 2017-02-06 14:07   ` Michal Hocko
  0 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-02-06 14:07 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel
  Cc: linux-mm, Dave Chinner, djwong, Theodore Ts'o, Chris Mason,
	David Sterba, Jan Kara, ceph-devel, cluster-devel, linux-nfs,
	logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

kmem_zalloc_large and _xfs_buf_map_pages use memalloc_noio_{save,restore}
API to prevent from reclaim recursion into the fs because vmalloc can
invoke unconditional GFP_KERNEL allocations and these functions might be
called from the NOFS contexts. The memalloc_noio_save will enforce
GFP_NOIO context which is even weaker than GFP_NOFS and that seems to be
unnecessary. Let's use memalloc_nofs_{save,restore} instead as it should
provide exactly what we need here - implicit GFP_NOFS context.

Changes since v1
- s@memalloc_noio_restore@memalloc_nofs_restore@ in _xfs_buf_map_pages
  as per Brian Foster

Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/xfs/kmem.c    | 12 ++++++------
 fs/xfs/xfs_buf.c |  8 ++++----
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index a76a05dae96b..0c9f94f41b6c 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -65,7 +65,7 @@ kmem_alloc(size_t size, xfs_km_flags_t flags)
 void *
 kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 {
-	unsigned noio_flag = 0;
+	unsigned nofs_flag = 0;
 	void	*ptr;
 	gfp_t	lflags;
 
@@ -77,17 +77,17 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 	 * __vmalloc() will allocate data pages and auxillary structures (e.g.
 	 * pagetables) with GFP_KERNEL, yet we may be under GFP_NOFS context
 	 * here. Hence we need to tell memory reclaim that we are in such a
-	 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
+	 * context via PF_MEMALLOC_NOFS to prevent memory reclaim re-entering
 	 * the filesystem here and potentially deadlocking.
 	 */
-	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
-		noio_flag = memalloc_noio_save();
+	if (flags & KM_NOFS)
+		nofs_flag = memalloc_nofs_save();
 
 	lflags = kmem_flags_convert(flags);
 	ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
 
-	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
-		memalloc_noio_restore(noio_flag);
+	if (flags & KM_NOFS)
+		memalloc_nofs_restore(nofs_flag);
 
 	return ptr;
 }
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 8c7d01b75922..676a9ae75b9a 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -442,17 +442,17 @@ _xfs_buf_map_pages(
 		bp->b_addr = NULL;
 	} else {
 		int retried = 0;
-		unsigned noio_flag;
+		unsigned nofs_flag;
 
 		/*
 		 * vm_map_ram() will allocate auxillary structures (e.g.
 		 * pagetables) with GFP_KERNEL, yet we are likely to be under
 		 * GFP_NOFS context here. Hence we need to tell memory reclaim
-		 * that we are in such a context via PF_MEMALLOC_NOIO to prevent
+		 * that we are in such a context via PF_MEMALLOC_NOFS to prevent
 		 * memory reclaim re-entering the filesystem here and
 		 * potentially deadlocking.
 		 */
-		noio_flag = memalloc_noio_save();
+		nofs_flag = memalloc_nofs_save();
 		do {
 			bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
 						-1, PAGE_KERNEL);
@@ -460,7 +460,7 @@ _xfs_buf_map_pages(
 				break;
 			vm_unmap_aliases();
 		} while (retried++ <= 1);
-		memalloc_noio_restore(noio_flag);
+		memalloc_nofs_restore(nofs_flag);
 
 		if (!bp->b_addr)
 			return -ENOMEM;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [Cluster-devel] [PATCH 4/6] xfs: use memalloc_nofs_{save, restore} instead of memalloc_noio*
@ 2017-02-06 14:07   ` Michal Hocko
  0 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-02-06 14:07 UTC (permalink / raw)
  To: cluster-devel.redhat.com

From: Michal Hocko <mhocko@suse.com>

kmem_zalloc_large and _xfs_buf_map_pages use memalloc_noio_{save,restore}
API to prevent from reclaim recursion into the fs because vmalloc can
invoke unconditional GFP_KERNEL allocations and these functions might be
called from the NOFS contexts. The memalloc_noio_save will enforce
GFP_NOIO context which is even weaker than GFP_NOFS and that seems to be
unnecessary. Let's use memalloc_nofs_{save,restore} instead as it should
provide exactly what we need here - implicit GFP_NOFS context.

Changes since v1
- s at memalloc_noio_restore@memalloc_nofs_restore@ in _xfs_buf_map_pages
  as per Brian Foster

Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/xfs/kmem.c    | 12 ++++++------
 fs/xfs/xfs_buf.c |  8 ++++----
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index a76a05dae96b..0c9f94f41b6c 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -65,7 +65,7 @@ kmem_alloc(size_t size, xfs_km_flags_t flags)
 void *
 kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 {
-	unsigned noio_flag = 0;
+	unsigned nofs_flag = 0;
 	void	*ptr;
 	gfp_t	lflags;
 
@@ -77,17 +77,17 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
 	 * __vmalloc() will allocate data pages and auxillary structures (e.g.
 	 * pagetables) with GFP_KERNEL, yet we may be under GFP_NOFS context
 	 * here. Hence we need to tell memory reclaim that we are in such a
-	 * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
+	 * context via PF_MEMALLOC_NOFS to prevent memory reclaim re-entering
 	 * the filesystem here and potentially deadlocking.
 	 */
-	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
-		noio_flag = memalloc_noio_save();
+	if (flags & KM_NOFS)
+		nofs_flag = memalloc_nofs_save();
 
 	lflags = kmem_flags_convert(flags);
 	ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
 
-	if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
-		memalloc_noio_restore(noio_flag);
+	if (flags & KM_NOFS)
+		memalloc_nofs_restore(nofs_flag);
 
 	return ptr;
 }
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 8c7d01b75922..676a9ae75b9a 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -442,17 +442,17 @@ _xfs_buf_map_pages(
 		bp->b_addr = NULL;
 	} else {
 		int retried = 0;
-		unsigned noio_flag;
+		unsigned nofs_flag;
 
 		/*
 		 * vm_map_ram() will allocate auxillary structures (e.g.
 		 * pagetables) with GFP_KERNEL, yet we are likely to be under
 		 * GFP_NOFS context here. Hence we need to tell memory reclaim
-		 * that we are in such a context via PF_MEMALLOC_NOIO to prevent
+		 * that we are in such a context via PF_MEMALLOC_NOFS to prevent
 		 * memory reclaim re-entering the filesystem here and
 		 * potentially deadlocking.
 		 */
-		noio_flag = memalloc_noio_save();
+		nofs_flag = memalloc_nofs_save();
 		do {
 			bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
 						-1, PAGE_KERNEL);
@@ -460,7 +460,7 @@ _xfs_buf_map_pages(
 				break;
 			vm_unmap_aliases();
 		} while (retried++ <= 1);
-		memalloc_noio_restore(noio_flag);
+		memalloc_nofs_restore(nofs_flag);
 
 		if (!bp->b_addr)
 			return -ENOMEM;
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH 5/6] jbd2: mark the transaction context with the scope GFP_NOFS context
  2017-02-06 14:07 ` Michal Hocko
                     ` (2 preceding siblings ...)
  (?)
@ 2017-02-06 14:07   ` Michal Hocko
  -1 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-02-06 14:07 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel
  Cc: linux-mm, Dave Chinner, djwong, Theodore Ts'o, Chris Mason,
	David Sterba, Jan Kara, ceph-devel, cluster-devel, linux-nfs,
	logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

now that we have memalloc_nofs_{save,restore} api we can mark the whole
transaction context as implicitly GFP_NOFS. All allocations will
automatically inherit GFP_NOFS this way. This means that we do not have
to mark any of those requests with GFP_NOFS and moreover all the
ext4_kv[mz]alloc(GFP_NOFS) are also safe now because even the hardcoded
GFP_KERNEL allocations deep inside the vmalloc will be NOFS now.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/jbd2/transaction.c | 11 +++++++++++
 include/linux/jbd2.h  |  2 ++
 2 files changed, 13 insertions(+)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index e1652665bd93..35a5d3d76182 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -388,6 +388,11 @@ static int start_this_handle(journal_t *journal, handle_t *handle,
 
 	rwsem_acquire_read(&journal->j_trans_commit_map, 0, 0, _THIS_IP_);
 	jbd2_journal_free_transaction(new_transaction);
+	/*
+	 * Make sure that no allocations done while the transaction is
+	 * open is going to recurse back to the fs layer.
+	 */
+	handle->saved_alloc_context = memalloc_nofs_save();
 	return 0;
 }
 
@@ -466,6 +471,7 @@ handle_t *jbd2__journal_start(journal_t *journal, int nblocks, int rsv_blocks,
 	trace_jbd2_handle_start(journal->j_fs_dev->bd_dev,
 				handle->h_transaction->t_tid, type,
 				line_no, nblocks);
+
 	return handle;
 }
 EXPORT_SYMBOL(jbd2__journal_start);
@@ -1760,6 +1766,11 @@ int jbd2_journal_stop(handle_t *handle)
 	if (handle->h_rsv_handle)
 		jbd2_journal_free_reserved(handle->h_rsv_handle);
 free_and_exit:
+	/*
+	 * scope of th GFP_NOFS context is over here and so we can
+	 * restore the original alloc context.
+	 */
+	memalloc_nofs_restore(handle->saved_alloc_context);
 	jbd2_free_handle(handle);
 	return err;
 }
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index dfaa1f4dcb0c..606b6bce3a5b 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -491,6 +491,8 @@ struct jbd2_journal_handle
 
 	unsigned long		h_start_jiffies;
 	unsigned int		h_requested_credits;
+
+	unsigned int		saved_alloc_context;
 };
 
 
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH 5/6] jbd2: mark the transaction context with the scope GFP_NOFS context
@ 2017-02-06 14:07   ` Michal Hocko
  0 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-02-06 14:07 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel
  Cc: linux-mm, Dave Chinner, djwong, Theodore Ts'o, Chris Mason,
	David Sterba, Jan Kara, ceph-devel, cluster-devel, linux-nfs,
	logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

now that we have memalloc_nofs_{save,restore} api we can mark the whole
transaction context as implicitly GFP_NOFS. All allocations will
automatically inherit GFP_NOFS this way. This means that we do not have
to mark any of those requests with GFP_NOFS and moreover all the
ext4_kv[mz]alloc(GFP_NOFS) are also safe now because even the hardcoded
GFP_KERNEL allocations deep inside the vmalloc will be NOFS now.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/jbd2/transaction.c | 11 +++++++++++
 include/linux/jbd2.h  |  2 ++
 2 files changed, 13 insertions(+)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index e1652665bd93..35a5d3d76182 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -388,6 +388,11 @@ static int start_this_handle(journal_t *journal, handle_t *handle,
 
 	rwsem_acquire_read(&journal->j_trans_commit_map, 0, 0, _THIS_IP_);
 	jbd2_journal_free_transaction(new_transaction);
+	/*
+	 * Make sure that no allocations done while the transaction is
+	 * open is going to recurse back to the fs layer.
+	 */
+	handle->saved_alloc_context = memalloc_nofs_save();
 	return 0;
 }
 
@@ -466,6 +471,7 @@ handle_t *jbd2__journal_start(journal_t *journal, int nblocks, int rsv_blocks,
 	trace_jbd2_handle_start(journal->j_fs_dev->bd_dev,
 				handle->h_transaction->t_tid, type,
 				line_no, nblocks);
+
 	return handle;
 }
 EXPORT_SYMBOL(jbd2__journal_start);
@@ -1760,6 +1766,11 @@ int jbd2_journal_stop(handle_t *handle)
 	if (handle->h_rsv_handle)
 		jbd2_journal_free_reserved(handle->h_rsv_handle);
 free_and_exit:
+	/*
+	 * scope of th GFP_NOFS context is over here and so we can
+	 * restore the original alloc context.
+	 */
+	memalloc_nofs_restore(handle->saved_alloc_context);
 	jbd2_free_handle(handle);
 	return err;
 }
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index dfaa1f4dcb0c..606b6bce3a5b 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -491,6 +491,8 @@ struct jbd2_journal_handle
 
 	unsigned long		h_start_jiffies;
 	unsigned int		h_requested_credits;
+
+	unsigned int		saved_alloc_context;
 };
 
 
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH 5/6] jbd2: mark the transaction context with the scope GFP_NOFS context
@ 2017-02-06 14:07   ` Michal Hocko
  0 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-02-06 14:07 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel
  Cc: linux-mm, Dave Chinner, djwong, Theodore Ts'o, Chris Mason,
	David Sterba, Jan Kara, ceph-devel, cluster-devel, linux-nfs,
	logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

now that we have memalloc_nofs_{save,restore} api we can mark the whole
transaction context as implicitly GFP_NOFS. All allocations will
automatically inherit GFP_NOFS this way. This means that we do not have
to mark any of those requests with GFP_NOFS and moreover all the
ext4_kv[mz]alloc(GFP_NOFS) are also safe now because even the hardcoded
GFP_KERNEL allocations deep inside the vmalloc will be NOFS now.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/jbd2/transaction.c | 11 +++++++++++
 include/linux/jbd2.h  |  2 ++
 2 files changed, 13 insertions(+)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index e1652665bd93..35a5d3d76182 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -388,6 +388,11 @@ static int start_this_handle(journal_t *journal, handle_t *handle,
 
 	rwsem_acquire_read(&journal->j_trans_commit_map, 0, 0, _THIS_IP_);
 	jbd2_journal_free_transaction(new_transaction);
+	/*
+	 * Make sure that no allocations done while the transaction is
+	 * open is going to recurse back to the fs layer.
+	 */
+	handle->saved_alloc_context = memalloc_nofs_save();
 	return 0;
 }
 
@@ -466,6 +471,7 @@ handle_t *jbd2__journal_start(journal_t *journal, int nblocks, int rsv_blocks,
 	trace_jbd2_handle_start(journal->j_fs_dev->bd_dev,
 				handle->h_transaction->t_tid, type,
 				line_no, nblocks);
+
 	return handle;
 }
 EXPORT_SYMBOL(jbd2__journal_start);
@@ -1760,6 +1766,11 @@ int jbd2_journal_stop(handle_t *handle)
 	if (handle->h_rsv_handle)
 		jbd2_journal_free_reserved(handle->h_rsv_handle);
 free_and_exit:
+	/*
+	 * scope of th GFP_NOFS context is over here and so we can
+	 * restore the original alloc context.
+	 */
+	memalloc_nofs_restore(handle->saved_alloc_context);
 	jbd2_free_handle(handle);
 	return err;
 }
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index dfaa1f4dcb0c..606b6bce3a5b 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -491,6 +491,8 @@ struct jbd2_journal_handle
 
 	unsigned long		h_start_jiffies;
 	unsigned int		h_requested_credits;
+
+	unsigned int		saved_alloc_context;
 };
 
 
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH 5/6] jbd2: mark the transaction context with the scope GFP_NOFS context
@ 2017-02-06 14:07   ` Michal Hocko
  0 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-02-06 14:07 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel
  Cc: linux-mm, Dave Chinner, djwong, Theodore Ts'o, Chris Mason,
	David Sterba, Jan Kara, ceph-devel, cluster-devel, linux-nfs,
	logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

now that we have memalloc_nofs_{save,restore} api we can mark the whole
transaction context as implicitly GFP_NOFS. All allocations will
automatically inherit GFP_NOFS this way. This means that we do not have
to mark any of those requests with GFP_NOFS and moreover all the
ext4_kv[mz]alloc(GFP_NOFS) are also safe now because even the hardcoded
GFP_KERNEL allocations deep inside the vmalloc will be NOFS now.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/jbd2/transaction.c | 11 +++++++++++
 include/linux/jbd2.h  |  2 ++
 2 files changed, 13 insertions(+)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index e1652665bd93..35a5d3d76182 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -388,6 +388,11 @@ static int start_this_handle(journal_t *journal, handle_t *handle,
 
 	rwsem_acquire_read(&journal->j_trans_commit_map, 0, 0, _THIS_IP_);
 	jbd2_journal_free_transaction(new_transaction);
+	/*
+	 * Make sure that no allocations done while the transaction is
+	 * open is going to recurse back to the fs layer.
+	 */
+	handle->saved_alloc_context = memalloc_nofs_save();
 	return 0;
 }
 
@@ -466,6 +471,7 @@ handle_t *jbd2__journal_start(journal_t *journal, int nblocks, int rsv_blocks,
 	trace_jbd2_handle_start(journal->j_fs_dev->bd_dev,
 				handle->h_transaction->t_tid, type,
 				line_no, nblocks);
+
 	return handle;
 }
 EXPORT_SYMBOL(jbd2__journal_start);
@@ -1760,6 +1766,11 @@ int jbd2_journal_stop(handle_t *handle)
 	if (handle->h_rsv_handle)
 		jbd2_journal_free_reserved(handle->h_rsv_handle);
 free_and_exit:
+	/*
+	 * scope of th GFP_NOFS context is over here and so we can
+	 * restore the original alloc context.
+	 */
+	memalloc_nofs_restore(handle->saved_alloc_context);
 	jbd2_free_handle(handle);
 	return err;
 }
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index dfaa1f4dcb0c..606b6bce3a5b 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -491,6 +491,8 @@ struct jbd2_journal_handle
 
 	unsigned long		h_start_jiffies;
 	unsigned int		h_requested_credits;
+
+	unsigned int		saved_alloc_context;
 };
 
 
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [Cluster-devel] [PATCH 5/6] jbd2: mark the transaction context with the scope GFP_NOFS context
@ 2017-02-06 14:07   ` Michal Hocko
  0 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-02-06 14:07 UTC (permalink / raw)
  To: cluster-devel.redhat.com

From: Michal Hocko <mhocko@suse.com>

now that we have memalloc_nofs_{save,restore} api we can mark the whole
transaction context as implicitly GFP_NOFS. All allocations will
automatically inherit GFP_NOFS this way. This means that we do not have
to mark any of those requests with GFP_NOFS and moreover all the
ext4_kv[mz]alloc(GFP_NOFS) are also safe now because even the hardcoded
GFP_KERNEL allocations deep inside the vmalloc will be NOFS now.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/jbd2/transaction.c | 11 +++++++++++
 include/linux/jbd2.h  |  2 ++
 2 files changed, 13 insertions(+)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index e1652665bd93..35a5d3d76182 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -388,6 +388,11 @@ static int start_this_handle(journal_t *journal, handle_t *handle,
 
 	rwsem_acquire_read(&journal->j_trans_commit_map, 0, 0, _THIS_IP_);
 	jbd2_journal_free_transaction(new_transaction);
+	/*
+	 * Make sure that no allocations done while the transaction is
+	 * open is going to recurse back to the fs layer.
+	 */
+	handle->saved_alloc_context = memalloc_nofs_save();
 	return 0;
 }
 
@@ -466,6 +471,7 @@ handle_t *jbd2__journal_start(journal_t *journal, int nblocks, int rsv_blocks,
 	trace_jbd2_handle_start(journal->j_fs_dev->bd_dev,
 				handle->h_transaction->t_tid, type,
 				line_no, nblocks);
+
 	return handle;
 }
 EXPORT_SYMBOL(jbd2__journal_start);
@@ -1760,6 +1766,11 @@ int jbd2_journal_stop(handle_t *handle)
 	if (handle->h_rsv_handle)
 		jbd2_journal_free_reserved(handle->h_rsv_handle);
 free_and_exit:
+	/*
+	 * scope of th GFP_NOFS context is over here and so we can
+	 * restore the original alloc context.
+	 */
+	memalloc_nofs_restore(handle->saved_alloc_context);
 	jbd2_free_handle(handle);
 	return err;
 }
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index dfaa1f4dcb0c..606b6bce3a5b 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -491,6 +491,8 @@ struct jbd2_journal_handle
 
 	unsigned long		h_start_jiffies;
 	unsigned int		h_requested_credits;
+
+	unsigned int		saved_alloc_context;
 };
 
 
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH 6/6] jbd2: make the whole kjournald2 kthread NOFS safe
  2017-02-06 14:07 ` Michal Hocko
                     ` (2 preceding siblings ...)
  (?)
@ 2017-02-06 14:07   ` Michal Hocko
  -1 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-02-06 14:07 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel
  Cc: linux-mm, Dave Chinner, djwong, Theodore Ts'o, Chris Mason,
	David Sterba, Jan Kara, ceph-devel, cluster-devel, linux-nfs,
	logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

kjournald2 is central to the transaction commit processing. As such any
potential allocation from this kernel thread has to be GFP_NOFS. Make
sure to mark the whole kernel thread GFP_NOFS by the memalloc_nofs_save.

Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/jbd2/journal.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 704139625fbe..662531a70ce1 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -206,6 +206,13 @@ static int kjournald2(void *arg)
 	wake_up(&journal->j_wait_done_commit);
 
 	/*
+	 * Make sure that no allocations from this kernel thread will ever recurse
+	 * to the fs layer because we are responsible for the transaction commit
+	 * and any fs involvement might get stuck waiting for the trasn. commit.
+	 */
+	memalloc_nofs_save();
+
+	/*
 	 * And now, wait forever for commit wakeup events.
 	 */
 	write_lock(&journal->j_state_lock);
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH 6/6] jbd2: make the whole kjournald2 kthread NOFS safe
@ 2017-02-06 14:07   ` Michal Hocko
  0 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-02-06 14:07 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel
  Cc: linux-mm, Dave Chinner, djwong, Theodore Ts'o, Chris Mason,
	David Sterba, Jan Kara, ceph-devel, cluster-devel, linux-nfs,
	logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

kjournald2 is central to the transaction commit processing. As such any
potential allocation from this kernel thread has to be GFP_NOFS. Make
sure to mark the whole kernel thread GFP_NOFS by the memalloc_nofs_save.

Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/jbd2/journal.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 704139625fbe..662531a70ce1 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -206,6 +206,13 @@ static int kjournald2(void *arg)
 	wake_up(&journal->j_wait_done_commit);
 
 	/*
+	 * Make sure that no allocations from this kernel thread will ever recurse
+	 * to the fs layer because we are responsible for the transaction commit
+	 * and any fs involvement might get stuck waiting for the trasn. commit.
+	 */
+	memalloc_nofs_save();
+
+	/*
 	 * And now, wait forever for commit wakeup events.
 	 */
 	write_lock(&journal->j_state_lock);
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH 6/6] jbd2: make the whole kjournald2 kthread NOFS safe
@ 2017-02-06 14:07   ` Michal Hocko
  0 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-02-06 14:07 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel
  Cc: linux-mm, Dave Chinner, djwong, Theodore Ts'o, Chris Mason,
	David Sterba, Jan Kara, ceph-devel, cluster-devel, linux-nfs,
	logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

kjournald2 is central to the transaction commit processing. As such any
potential allocation from this kernel thread has to be GFP_NOFS. Make
sure to mark the whole kernel thread GFP_NOFS by the memalloc_nofs_save.

Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/jbd2/journal.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 704139625fbe..662531a70ce1 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -206,6 +206,13 @@ static int kjournald2(void *arg)
 	wake_up(&journal->j_wait_done_commit);
 
 	/*
+	 * Make sure that no allocations from this kernel thread will ever recurse
+	 * to the fs layer because we are responsible for the transaction commit
+	 * and any fs involvement might get stuck waiting for the trasn. commit.
+	 */
+	memalloc_nofs_save();
+
+	/*
 	 * And now, wait forever for commit wakeup events.
 	 */
 	write_lock(&journal->j_state_lock);
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [PATCH 6/6] jbd2: make the whole kjournald2 kthread NOFS safe
@ 2017-02-06 14:07   ` Michal Hocko
  0 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-02-06 14:07 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel
  Cc: linux-mm, Dave Chinner, djwong, Theodore Ts'o, Chris Mason,
	David Sterba, Jan Kara, ceph-devel, cluster-devel, linux-nfs,
	logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

kjournald2 is central to the transaction commit processing. As such any
potential allocation from this kernel thread has to be GFP_NOFS. Make
sure to mark the whole kernel thread GFP_NOFS by the memalloc_nofs_save.

Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/jbd2/journal.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 704139625fbe..662531a70ce1 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -206,6 +206,13 @@ static int kjournald2(void *arg)
 	wake_up(&journal->j_wait_done_commit);
 
 	/*
+	 * Make sure that no allocations from this kernel thread will ever recurse
+	 * to the fs layer because we are responsible for the transaction commit
+	 * and any fs involvement might get stuck waiting for the trasn. commit.
+	 */
+	memalloc_nofs_save();
+
+	/*
 	 * And now, wait forever for commit wakeup events.
 	 */
 	write_lock(&journal->j_state_lock);
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 74+ messages in thread

* [Cluster-devel] [PATCH 6/6] jbd2: make the whole kjournald2 kthread NOFS safe
@ 2017-02-06 14:07   ` Michal Hocko
  0 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-02-06 14:07 UTC (permalink / raw)
  To: cluster-devel.redhat.com

From: Michal Hocko <mhocko@suse.com>

kjournald2 is central to the transaction commit processing. As such any
potential allocation from this kernel thread has to be GFP_NOFS. Make
sure to mark the whole kernel thread GFP_NOFS by the memalloc_nofs_save.

Suggested-by: Jan Kara <jack@suse.cz>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/jbd2/journal.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 704139625fbe..662531a70ce1 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -206,6 +206,13 @@ static int kjournald2(void *arg)
 	wake_up(&journal->j_wait_done_commit);
 
 	/*
+	 * Make sure that no allocations from this kernel thread will ever recurse
+	 * to the fs layer because we are responsible for the transaction commit
+	 * and any fs involvement might get stuck waiting for the trasn. commit.
+	 */
+	memalloc_nofs_save();
+
+	/*
 	 * And now, wait forever for commit wakeup events.
 	 */
 	write_lock(&journal->j_state_lock);
-- 
2.11.0



^ permalink raw reply related	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/6] lockdep: allow to disable reclaim lockup detection
  2017-02-06 14:07   ` Michal Hocko
  (?)
@ 2017-02-06 14:26     ` Matthew Wilcox
  -1 siblings, 0 replies; 74+ messages in thread
From: Matthew Wilcox @ 2017-02-06 14:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-fsdevel, linux-mm, Dave Chinner, djwong,
	Theodore Ts'o, Chris Mason, David Sterba, Jan Kara,
	ceph-devel, cluster-devel, linux-nfs, logfs, linux-xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML, Michal Hocko

On Mon, Feb 06, 2017 at 03:07:13PM +0100, Michal Hocko wrote:
> While we are at it also make sure that the radix tree doesn't
> accidentaly override tags stored in the upper part of the gfp_mask.

> diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> index 9dc093d5ef39..7550be09f9d6 100644
> --- a/lib/radix-tree.c
> +++ b/lib/radix-tree.c
> @@ -2274,6 +2274,8 @@ static int radix_tree_cpu_dead(unsigned int cpu)
>  void __init radix_tree_init(void)
>  {
>  	int ret;
> +
> +	BUILD_BUG_ON(RADIX_TREE_MAX_TAGS + __GFP_BITS_SHIFT > 32);
>  	radix_tree_node_cachep = kmem_cache_create("radix_tree_node",
>  			sizeof(struct radix_tree_node), 0,
>  			SLAB_PANIC | SLAB_RECLAIM_ACCOUNT,

That's going to have a conceptual conflict with some patches I have
in flight.  I'll take this part through my radix tree patch collection.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/6] lockdep: allow to disable reclaim lockup detection
@ 2017-02-06 14:26     ` Matthew Wilcox
  0 siblings, 0 replies; 74+ messages in thread
From: Matthew Wilcox @ 2017-02-06 14:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-fsdevel, linux-mm, Dave Chinner, djwong,
	Theodore Ts'o, Chris Mason, David Sterba, Jan Kara,
	ceph-devel, cluster-devel, linux-nfs, logfs, linux-xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML, Michal Hocko

On Mon, Feb 06, 2017 at 03:07:13PM +0100, Michal Hocko wrote:
> While we are at it also make sure that the radix tree doesn't
> accidentaly override tags stored in the upper part of the gfp_mask.

> diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> index 9dc093d5ef39..7550be09f9d6 100644
> --- a/lib/radix-tree.c
> +++ b/lib/radix-tree.c
> @@ -2274,6 +2274,8 @@ static int radix_tree_cpu_dead(unsigned int cpu)
>  void __init radix_tree_init(void)
>  {
>  	int ret;
> +
> +	BUILD_BUG_ON(RADIX_TREE_MAX_TAGS + __GFP_BITS_SHIFT > 32);
>  	radix_tree_node_cachep = kmem_cache_create("radix_tree_node",
>  			sizeof(struct radix_tree_node), 0,
>  			SLAB_PANIC | SLAB_RECLAIM_ACCOUNT,

That's going to have a conceptual conflict with some patches I have
in flight.  I'll take this part through my radix tree patch collection.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [Cluster-devel] [PATCH 1/6] lockdep: allow to disable reclaim lockup detection
@ 2017-02-06 14:26     ` Matthew Wilcox
  0 siblings, 0 replies; 74+ messages in thread
From: Matthew Wilcox @ 2017-02-06 14:26 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Mon, Feb 06, 2017 at 03:07:13PM +0100, Michal Hocko wrote:
> While we are at it also make sure that the radix tree doesn't
> accidentaly override tags stored in the upper part of the gfp_mask.

> diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> index 9dc093d5ef39..7550be09f9d6 100644
> --- a/lib/radix-tree.c
> +++ b/lib/radix-tree.c
> @@ -2274,6 +2274,8 @@ static int radix_tree_cpu_dead(unsigned int cpu)
>  void __init radix_tree_init(void)
>  {
>  	int ret;
> +
> +	BUILD_BUG_ON(RADIX_TREE_MAX_TAGS + __GFP_BITS_SHIFT > 32);
>  	radix_tree_node_cachep = kmem_cache_create("radix_tree_node",
>  			sizeof(struct radix_tree_node), 0,
>  			SLAB_PANIC | SLAB_RECLAIM_ACCOUNT,

That's going to have a conceptual conflict with some patches I have
in flight.  I'll take this part through my radix tree patch collection.



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/6] lockdep: allow to disable reclaim lockup detection
  2017-02-06 14:26     ` Matthew Wilcox
  (?)
@ 2017-02-06 14:34       ` Michal Hocko
  -1 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-02-06 14:34 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, linux-fsdevel, linux-mm, Dave Chinner, djwong,
	Theodore Ts'o, Chris Mason, David Sterba, Jan Kara,
	ceph-devel, cluster-devel, linux-nfs, logfs, linux-xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML

On Mon 06-02-17 06:26:41, Matthew Wilcox wrote:
> On Mon, Feb 06, 2017 at 03:07:13PM +0100, Michal Hocko wrote:
> > While we are at it also make sure that the radix tree doesn't
> > accidentaly override tags stored in the upper part of the gfp_mask.
> 
> > diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> > index 9dc093d5ef39..7550be09f9d6 100644
> > --- a/lib/radix-tree.c
> > +++ b/lib/radix-tree.c
> > @@ -2274,6 +2274,8 @@ static int radix_tree_cpu_dead(unsigned int cpu)
> >  void __init radix_tree_init(void)
> >  {
> >  	int ret;
> > +
> > +	BUILD_BUG_ON(RADIX_TREE_MAX_TAGS + __GFP_BITS_SHIFT > 32);
> >  	radix_tree_node_cachep = kmem_cache_create("radix_tree_node",
> >  			sizeof(struct radix_tree_node), 0,
> >  			SLAB_PANIC | SLAB_RECLAIM_ACCOUNT,
> 
> That's going to have a conceptual conflict with some patches I have
> in flight.  I'll take this part through my radix tree patch collection.

This part is not needed for the patch, strictly speaking but I wanted to
make the code more future proof.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/6] lockdep: allow to disable reclaim lockup detection
@ 2017-02-06 14:34       ` Michal Hocko
  0 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-02-06 14:34 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, linux-fsdevel, linux-mm, Dave Chinner, djwong,
	Theodore Ts'o, Chris Mason, David Sterba, Jan Kara,
	ceph-devel, cluster-devel, linux-nfs, logfs, linux-xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML

On Mon 06-02-17 06:26:41, Matthew Wilcox wrote:
> On Mon, Feb 06, 2017 at 03:07:13PM +0100, Michal Hocko wrote:
> > While we are at it also make sure that the radix tree doesn't
> > accidentaly override tags stored in the upper part of the gfp_mask.
> 
> > diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> > index 9dc093d5ef39..7550be09f9d6 100644
> > --- a/lib/radix-tree.c
> > +++ b/lib/radix-tree.c
> > @@ -2274,6 +2274,8 @@ static int radix_tree_cpu_dead(unsigned int cpu)
> >  void __init radix_tree_init(void)
> >  {
> >  	int ret;
> > +
> > +	BUILD_BUG_ON(RADIX_TREE_MAX_TAGS + __GFP_BITS_SHIFT > 32);
> >  	radix_tree_node_cachep = kmem_cache_create("radix_tree_node",
> >  			sizeof(struct radix_tree_node), 0,
> >  			SLAB_PANIC | SLAB_RECLAIM_ACCOUNT,
> 
> That's going to have a conceptual conflict with some patches I have
> in flight.  I'll take this part through my radix tree patch collection.

This part is not needed for the patch, strictly speaking but I wanted to
make the code more future proof.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [Cluster-devel] [PATCH 1/6] lockdep: allow to disable reclaim lockup detection
@ 2017-02-06 14:34       ` Michal Hocko
  0 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-02-06 14:34 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Mon 06-02-17 06:26:41, Matthew Wilcox wrote:
> On Mon, Feb 06, 2017 at 03:07:13PM +0100, Michal Hocko wrote:
> > While we are at it also make sure that the radix tree doesn't
> > accidentaly override tags stored in the upper part of the gfp_mask.
> 
> > diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> > index 9dc093d5ef39..7550be09f9d6 100644
> > --- a/lib/radix-tree.c
> > +++ b/lib/radix-tree.c
> > @@ -2274,6 +2274,8 @@ static int radix_tree_cpu_dead(unsigned int cpu)
> >  void __init radix_tree_init(void)
> >  {
> >  	int ret;
> > +
> > +	BUILD_BUG_ON(RADIX_TREE_MAX_TAGS + __GFP_BITS_SHIFT > 32);
> >  	radix_tree_node_cachep = kmem_cache_create("radix_tree_node",
> >  			sizeof(struct radix_tree_node), 0,
> >  			SLAB_PANIC | SLAB_RECLAIM_ACCOUNT,
> 
> That's going to have a conceptual conflict with some patches I have
> in flight.  I'll take this part through my radix tree patch collection.

This part is not needed for the patch, strictly speaking but I wanted to
make the code more future proof.

-- 
Michal Hocko
SUSE Labs



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/6] lockdep: allow to disable reclaim lockup detection
  2017-02-06 14:34       ` Michal Hocko
  (?)
@ 2017-02-06 15:24         ` Matthew Wilcox
  -1 siblings, 0 replies; 74+ messages in thread
From: Matthew Wilcox @ 2017-02-06 15:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-fsdevel, linux-mm, Dave Chinner, djwong,
	Theodore Ts'o, Chris Mason, David Sterba, Jan Kara,
	ceph-devel, cluster-devel, linux-nfs, logfs, linux-xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML

On Mon, Feb 06, 2017 at 03:34:50PM +0100, Michal Hocko wrote:
> This part is not needed for the patch, strictly speaking but I wanted to
> make the code more future proof.

Understood.  I took an extra bit myself for marking the radix tree as
being used for an IDR (so the radix tree now uses 4 bits).  I see you
already split out the address space GFP mask from the other flags :-)
I would prefer not to do that with the radix tree, but I understand
your desire for more GFP bits.  I'm not entirely sure that an implicit
gfpmask makes a lot of sense for the radix tree, but it'd be a big effort
to change all the callers.  Anyway, I'm going to update your line here
for my current tree and add the build bug so we'll know if we ever hit
any problems.

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/6] lockdep: allow to disable reclaim lockup detection
@ 2017-02-06 15:24         ` Matthew Wilcox
  0 siblings, 0 replies; 74+ messages in thread
From: Matthew Wilcox @ 2017-02-06 15:24 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-fsdevel, linux-mm, Dave Chinner, djwong,
	Theodore Ts'o, Chris Mason, David Sterba, Jan Kara,
	ceph-devel, cluster-devel, linux-nfs, logfs, linux-xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML

On Mon, Feb 06, 2017 at 03:34:50PM +0100, Michal Hocko wrote:
> This part is not needed for the patch, strictly speaking but I wanted to
> make the code more future proof.

Understood.  I took an extra bit myself for marking the radix tree as
being used for an IDR (so the radix tree now uses 4 bits).  I see you
already split out the address space GFP mask from the other flags :-)
I would prefer not to do that with the radix tree, but I understand
your desire for more GFP bits.  I'm not entirely sure that an implicit
gfpmask makes a lot of sense for the radix tree, but it'd be a big effort
to change all the callers.  Anyway, I'm going to update your line here
for my current tree and add the build bug so we'll know if we ever hit
any problems.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [Cluster-devel] [PATCH 1/6] lockdep: allow to disable reclaim lockup detection
@ 2017-02-06 15:24         ` Matthew Wilcox
  0 siblings, 0 replies; 74+ messages in thread
From: Matthew Wilcox @ 2017-02-06 15:24 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Mon, Feb 06, 2017 at 03:34:50PM +0100, Michal Hocko wrote:
> This part is not needed for the patch, strictly speaking but I wanted to
> make the code more future proof.

Understood.  I took an extra bit myself for marking the radix tree as
being used for an IDR (so the radix tree now uses 4 bits).  I see you
already split out the address space GFP mask from the other flags :-)
I would prefer not to do that with the radix tree, but I understand
your desire for more GFP bits.  I'm not entirely sure that an implicit
gfpmask makes a lot of sense for the radix tree, but it'd be a big effort
to change all the callers.  Anyway, I'm going to update your line here
for my current tree and add the build bug so we'll know if we ever hit
any problems.



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/6] lockdep: allow to disable reclaim lockup detection
  2017-02-06 15:24         ` Matthew Wilcox
  (?)
@ 2017-02-06 15:30           ` Michal Hocko
  -1 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-02-06 15:30 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, linux-fsdevel, linux-mm, Dave Chinner, djwong,
	Theodore Ts'o, Chris Mason, David Sterba, Jan Kara,
	ceph-devel, cluster-devel, linux-nfs, logfs, linux-xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML

On Mon 06-02-17 07:24:00, Matthew Wilcox wrote:
> On Mon, Feb 06, 2017 at 03:34:50PM +0100, Michal Hocko wrote:
> > This part is not needed for the patch, strictly speaking but I wanted to
> > make the code more future proof.
> 
> Understood.  I took an extra bit myself for marking the radix tree as
> being used for an IDR (so the radix tree now uses 4 bits).  I see you
> already split out the address space GFP mask from the other flags :-)
> I would prefer not to do that with the radix tree, but I understand
> your desire for more GFP bits.  I'm not entirely sure that an implicit
> gfpmask makes a lot of sense for the radix tree, but it'd be a big effort
> to change all the callers.  Anyway, I'm going to update your line here
> for my current tree and add the build bug so we'll know if we ever hit
> any problems.

OK, do I get it right that the patch can stay as is and go to Andrew?
I would really like to not rebase the patch again for something that is
not merged yet. I really hope for getting this merged finally...

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 1/6] lockdep: allow to disable reclaim lockup detection
@ 2017-02-06 15:30           ` Michal Hocko
  0 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-02-06 15:30 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, linux-fsdevel, linux-mm, Dave Chinner, djwong,
	Theodore Ts'o, Chris Mason, David Sterba, Jan Kara,
	ceph-devel, cluster-devel, linux-nfs, logfs, linux-xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML

On Mon 06-02-17 07:24:00, Matthew Wilcox wrote:
> On Mon, Feb 06, 2017 at 03:34:50PM +0100, Michal Hocko wrote:
> > This part is not needed for the patch, strictly speaking but I wanted to
> > make the code more future proof.
> 
> Understood.  I took an extra bit myself for marking the radix tree as
> being used for an IDR (so the radix tree now uses 4 bits).  I see you
> already split out the address space GFP mask from the other flags :-)
> I would prefer not to do that with the radix tree, but I understand
> your desire for more GFP bits.  I'm not entirely sure that an implicit
> gfpmask makes a lot of sense for the radix tree, but it'd be a big effort
> to change all the callers.  Anyway, I'm going to update your line here
> for my current tree and add the build bug so we'll know if we ever hit
> any problems.

OK, do I get it right that the patch can stay as is and go to Andrew?
I would really like to not rebase the patch again for something that is
not merged yet. I really hope for getting this merged finally...

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [Cluster-devel] [PATCH 1/6] lockdep: allow to disable reclaim lockup detection
@ 2017-02-06 15:30           ` Michal Hocko
  0 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-02-06 15:30 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Mon 06-02-17 07:24:00, Matthew Wilcox wrote:
> On Mon, Feb 06, 2017 at 03:34:50PM +0100, Michal Hocko wrote:
> > This part is not needed for the patch, strictly speaking but I wanted to
> > make the code more future proof.
> 
> Understood.  I took an extra bit myself for marking the radix tree as
> being used for an IDR (so the radix tree now uses 4 bits).  I see you
> already split out the address space GFP mask from the other flags :-)
> I would prefer not to do that with the radix tree, but I understand
> your desire for more GFP bits.  I'm not entirely sure that an implicit
> gfpmask makes a lot of sense for the radix tree, but it'd be a big effort
> to change all the callers.  Anyway, I'm going to update your line here
> for my current tree and add the build bug so we'll know if we ever hit
> any problems.

OK, do I get it right that the patch can stay as is and go to Andrew?
I would really like to not rebase the patch again for something that is
not merged yet. I really hope for getting this merged finally...

-- 
Michal Hocko
SUSE Labs



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 4/6] xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*
  2017-02-06 14:07   ` Michal Hocko
  (?)
@ 2017-02-06 15:39     ` Matthew Wilcox
  -1 siblings, 0 replies; 74+ messages in thread
From: Matthew Wilcox @ 2017-02-06 15:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-fsdevel, linux-mm, Dave Chinner, djwong,
	Theodore Ts'o, Chris Mason, David Sterba, Jan Kara,
	ceph-devel, cluster-devel, linux-nfs, logfs, linux-xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML, Michal Hocko

On Mon, Feb 06, 2017 at 03:07:16PM +0100, Michal Hocko wrote:
> +++ b/fs/xfs/xfs_buf.c
> @@ -442,17 +442,17 @@ _xfs_buf_map_pages(
>  		bp->b_addr = NULL;
>  	} else {
>  		int retried = 0;
> -		unsigned noio_flag;
> +		unsigned nofs_flag;
>  
>  		/*
>  		 * vm_map_ram() will allocate auxillary structures (e.g.
>  		 * pagetables) with GFP_KERNEL, yet we are likely to be under
>  		 * GFP_NOFS context here. Hence we need to tell memory reclaim
> -		 * that we are in such a context via PF_MEMALLOC_NOIO to prevent
> +		 * that we are in such a context via PF_MEMALLOC_NOFS to prevent
>  		 * memory reclaim re-entering the filesystem here and
>  		 * potentially deadlocking.
>  		 */

This comment feels out of date ... how about:

		/*
		 * vm_map_ram will allocate auxiliary structures (eg page
		 * tables) with GFP_KERNEL.  If that tries to reclaim memory
		 * by calling back into this filesystem, we may deadlock.
		 * Prevent that by setting the NOFS flag.
		 */

> -		noio_flag = memalloc_noio_save();
> +		nofs_flag = memalloc_nofs_save();
>  		do {
>  			bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
>  						-1, PAGE_KERNEL);

Also, I think it shows that this is the wrong place in XFS to be calling
memalloc_nofs_save().  I'm not arguing against including this patch;
it's a step towards where we want to be.  I also don't know XFS well
enough to know where to set that flag ;-)  Presumably when we start a
transaction ... ?

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 4/6] xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*
@ 2017-02-06 15:39     ` Matthew Wilcox
  0 siblings, 0 replies; 74+ messages in thread
From: Matthew Wilcox @ 2017-02-06 15:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-fsdevel, linux-mm, Dave Chinner, djwong,
	Theodore Ts'o, Chris Mason, David Sterba, Jan Kara,
	ceph-devel, cluster-devel, linux-nfs, logfs, linux-xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML, Michal Hocko

On Mon, Feb 06, 2017 at 03:07:16PM +0100, Michal Hocko wrote:
> +++ b/fs/xfs/xfs_buf.c
> @@ -442,17 +442,17 @@ _xfs_buf_map_pages(
>  		bp->b_addr = NULL;
>  	} else {
>  		int retried = 0;
> -		unsigned noio_flag;
> +		unsigned nofs_flag;
>  
>  		/*
>  		 * vm_map_ram() will allocate auxillary structures (e.g.
>  		 * pagetables) with GFP_KERNEL, yet we are likely to be under
>  		 * GFP_NOFS context here. Hence we need to tell memory reclaim
> -		 * that we are in such a context via PF_MEMALLOC_NOIO to prevent
> +		 * that we are in such a context via PF_MEMALLOC_NOFS to prevent
>  		 * memory reclaim re-entering the filesystem here and
>  		 * potentially deadlocking.
>  		 */

This comment feels out of date ... how about:

		/*
		 * vm_map_ram will allocate auxiliary structures (eg page
		 * tables) with GFP_KERNEL.  If that tries to reclaim memory
		 * by calling back into this filesystem, we may deadlock.
		 * Prevent that by setting the NOFS flag.
		 */

> -		noio_flag = memalloc_noio_save();
> +		nofs_flag = memalloc_nofs_save();
>  		do {
>  			bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
>  						-1, PAGE_KERNEL);

Also, I think it shows that this is the wrong place in XFS to be calling
memalloc_nofs_save().  I'm not arguing against including this patch;
it's a step towards where we want to be.  I also don't know XFS well
enough to know where to set that flag ;-)  Presumably when we start a
transaction ... ?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [Cluster-devel] [PATCH 4/6] xfs: use memalloc_nofs_{save, restore} instead of memalloc_noio*
@ 2017-02-06 15:39     ` Matthew Wilcox
  0 siblings, 0 replies; 74+ messages in thread
From: Matthew Wilcox @ 2017-02-06 15:39 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Mon, Feb 06, 2017 at 03:07:16PM +0100, Michal Hocko wrote:
> +++ b/fs/xfs/xfs_buf.c
> @@ -442,17 +442,17 @@ _xfs_buf_map_pages(
>  		bp->b_addr = NULL;
>  	} else {
>  		int retried = 0;
> -		unsigned noio_flag;
> +		unsigned nofs_flag;
>  
>  		/*
>  		 * vm_map_ram() will allocate auxillary structures (e.g.
>  		 * pagetables) with GFP_KERNEL, yet we are likely to be under
>  		 * GFP_NOFS context here. Hence we need to tell memory reclaim
> -		 * that we are in such a context via PF_MEMALLOC_NOIO to prevent
> +		 * that we are in such a context via PF_MEMALLOC_NOFS to prevent
>  		 * memory reclaim re-entering the filesystem here and
>  		 * potentially deadlocking.
>  		 */

This comment feels out of date ... how about:

		/*
		 * vm_map_ram will allocate auxiliary structures (eg page
		 * tables) with GFP_KERNEL.  If that tries to reclaim memory
		 * by calling back into this filesystem, we may deadlock.
		 * Prevent that by setting the NOFS flag.
		 */

> -		noio_flag = memalloc_noio_save();
> +		nofs_flag = memalloc_nofs_save();
>  		do {
>  			bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
>  						-1, PAGE_KERNEL);

Also, I think it shows that this is the wrong place in XFS to be calling
memalloc_nofs_save().  I'm not arguing against including this patch;
it's a step towards where we want to be.  I also don't know XFS well
enough to know where to set that flag ;-)  Presumably when we start a
transaction ... ?



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 4/6] xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*
  2017-02-06 15:39     ` [PATCH 4/6] xfs: use memalloc_nofs_{save,restore} " Matthew Wilcox
  (?)
@ 2017-02-06 17:44       ` Michal Hocko
  -1 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-02-06 17:44 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, linux-fsdevel, linux-mm, Dave Chinner, djwong,
	Theodore Ts'o, Chris Mason, David Sterba, Jan Kara,
	ceph-devel, cluster-devel, linux-nfs, logfs, linux-xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML

On Mon 06-02-17 07:39:23, Matthew Wilcox wrote:
> On Mon, Feb 06, 2017 at 03:07:16PM +0100, Michal Hocko wrote:
> > +++ b/fs/xfs/xfs_buf.c
> > @@ -442,17 +442,17 @@ _xfs_buf_map_pages(
> >  		bp->b_addr = NULL;
> >  	} else {
> >  		int retried = 0;
> > -		unsigned noio_flag;
> > +		unsigned nofs_flag;
> >  
> >  		/*
> >  		 * vm_map_ram() will allocate auxillary structures (e.g.
> >  		 * pagetables) with GFP_KERNEL, yet we are likely to be under
> >  		 * GFP_NOFS context here. Hence we need to tell memory reclaim
> > -		 * that we are in such a context via PF_MEMALLOC_NOIO to prevent
> > +		 * that we are in such a context via PF_MEMALLOC_NOFS to prevent
> >  		 * memory reclaim re-entering the filesystem here and
> >  		 * potentially deadlocking.
> >  		 */
> 
> This comment feels out of date ... how about:

which part is out of date?

> 
> 		/*
> 		 * vm_map_ram will allocate auxiliary structures (eg page
> 		 * tables) with GFP_KERNEL.  If that tries to reclaim memory
> 		 * by calling back into this filesystem, we may deadlock.
> 		 * Prevent that by setting the NOFS flag.
> 		 */

dunno, the previous wording seems clear enough to me. Maybe little bit
more chatty than yours but I am not sure this is worth changing.

> 
> > -		noio_flag = memalloc_noio_save();
> > +		nofs_flag = memalloc_nofs_save();
> >  		do {
> >  			bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
> >  						-1, PAGE_KERNEL);
> 
> Also, I think it shows that this is the wrong place in XFS to be calling
> memalloc_nofs_save().  I'm not arguing against including this patch;
> it's a step towards where we want to be.  I also don't know XFS well
> enough to know where to set that flag ;-)  Presumably when we start a
> transaction ... ?

Yes that is what I would like to achieve longterm. And the reason why I
didn't want to mimic this pattern in kvmalloc as some have suggested.
It just takes much more time to get there from the past experience and
we should really start somewhere.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 4/6] xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*
@ 2017-02-06 17:44       ` Michal Hocko
  0 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-02-06 17:44 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, linux-fsdevel, linux-mm, Dave Chinner, djwong,
	Theodore Ts'o, Chris Mason, David Sterba, Jan Kara,
	ceph-devel, cluster-devel, linux-nfs, logfs, linux-xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML

On Mon 06-02-17 07:39:23, Matthew Wilcox wrote:
> On Mon, Feb 06, 2017 at 03:07:16PM +0100, Michal Hocko wrote:
> > +++ b/fs/xfs/xfs_buf.c
> > @@ -442,17 +442,17 @@ _xfs_buf_map_pages(
> >  		bp->b_addr = NULL;
> >  	} else {
> >  		int retried = 0;
> > -		unsigned noio_flag;
> > +		unsigned nofs_flag;
> >  
> >  		/*
> >  		 * vm_map_ram() will allocate auxillary structures (e.g.
> >  		 * pagetables) with GFP_KERNEL, yet we are likely to be under
> >  		 * GFP_NOFS context here. Hence we need to tell memory reclaim
> > -		 * that we are in such a context via PF_MEMALLOC_NOIO to prevent
> > +		 * that we are in such a context via PF_MEMALLOC_NOFS to prevent
> >  		 * memory reclaim re-entering the filesystem here and
> >  		 * potentially deadlocking.
> >  		 */
> 
> This comment feels out of date ... how about:

which part is out of date?

> 
> 		/*
> 		 * vm_map_ram will allocate auxiliary structures (eg page
> 		 * tables) with GFP_KERNEL.  If that tries to reclaim memory
> 		 * by calling back into this filesystem, we may deadlock.
> 		 * Prevent that by setting the NOFS flag.
> 		 */

dunno, the previous wording seems clear enough to me. Maybe little bit
more chatty than yours but I am not sure this is worth changing.

> 
> > -		noio_flag = memalloc_noio_save();
> > +		nofs_flag = memalloc_nofs_save();
> >  		do {
> >  			bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
> >  						-1, PAGE_KERNEL);
> 
> Also, I think it shows that this is the wrong place in XFS to be calling
> memalloc_nofs_save().  I'm not arguing against including this patch;
> it's a step towards where we want to be.  I also don't know XFS well
> enough to know where to set that flag ;-)  Presumably when we start a
> transaction ... ?

Yes that is what I would like to achieve longterm. And the reason why I
didn't want to mimic this pattern in kvmalloc as some have suggested.
It just takes much more time to get there from the past experience and
we should really start somewhere.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [Cluster-devel] [PATCH 4/6] xfs: use memalloc_nofs_{save, restore} instead of memalloc_noio*
@ 2017-02-06 17:44       ` Michal Hocko
  0 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-02-06 17:44 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Mon 06-02-17 07:39:23, Matthew Wilcox wrote:
> On Mon, Feb 06, 2017 at 03:07:16PM +0100, Michal Hocko wrote:
> > +++ b/fs/xfs/xfs_buf.c
> > @@ -442,17 +442,17 @@ _xfs_buf_map_pages(
> >  		bp->b_addr = NULL;
> >  	} else {
> >  		int retried = 0;
> > -		unsigned noio_flag;
> > +		unsigned nofs_flag;
> >  
> >  		/*
> >  		 * vm_map_ram() will allocate auxillary structures (e.g.
> >  		 * pagetables) with GFP_KERNEL, yet we are likely to be under
> >  		 * GFP_NOFS context here. Hence we need to tell memory reclaim
> > -		 * that we are in such a context via PF_MEMALLOC_NOIO to prevent
> > +		 * that we are in such a context via PF_MEMALLOC_NOFS to prevent
> >  		 * memory reclaim re-entering the filesystem here and
> >  		 * potentially deadlocking.
> >  		 */
> 
> This comment feels out of date ... how about:

which part is out of date?

> 
> 		/*
> 		 * vm_map_ram will allocate auxiliary structures (eg page
> 		 * tables) with GFP_KERNEL.  If that tries to reclaim memory
> 		 * by calling back into this filesystem, we may deadlock.
> 		 * Prevent that by setting the NOFS flag.
> 		 */

dunno, the previous wording seems clear enough to me. Maybe little bit
more chatty than yours but I am not sure this is worth changing.

> 
> > -		noio_flag = memalloc_noio_save();
> > +		nofs_flag = memalloc_nofs_save();
> >  		do {
> >  			bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
> >  						-1, PAGE_KERNEL);
> 
> Also, I think it shows that this is the wrong place in XFS to be calling
> memalloc_nofs_save().  I'm not arguing against including this patch;
> it's a step towards where we want to be.  I also don't know XFS well
> enough to know where to set that flag ;-)  Presumably when we start a
> transaction ... ?

Yes that is what I would like to achieve longterm. And the reason why I
didn't want to mimic this pattern in kvmalloc as some have suggested.
It just takes much more time to get there from the past experience and
we should really start somewhere.
-- 
Michal Hocko
SUSE Labs



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 4/6] xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*
  2017-02-06 17:44       ` [PATCH 4/6] xfs: use memalloc_nofs_{save,restore} " Michal Hocko
  (?)
@ 2017-02-06 18:32         ` Darrick J. Wong
  -1 siblings, 0 replies; 74+ messages in thread
From: Darrick J. Wong @ 2017-02-06 18:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Matthew Wilcox, Andrew Morton, linux-fsdevel, linux-mm,
	Dave Chinner, djwong, Theodore Ts'o, Chris Mason,
	David Sterba, Jan Kara, ceph-devel, cluster-devel, linux-nfs,
	logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML

On Mon, Feb 06, 2017 at 06:44:15PM +0100, Michal Hocko wrote:
> On Mon 06-02-17 07:39:23, Matthew Wilcox wrote:
> > On Mon, Feb 06, 2017 at 03:07:16PM +0100, Michal Hocko wrote:
> > > +++ b/fs/xfs/xfs_buf.c
> > > @@ -442,17 +442,17 @@ _xfs_buf_map_pages(
> > >  		bp->b_addr = NULL;
> > >  	} else {
> > >  		int retried = 0;
> > > -		unsigned noio_flag;
> > > +		unsigned nofs_flag;
> > >  
> > >  		/*
> > >  		 * vm_map_ram() will allocate auxillary structures (e.g.
> > >  		 * pagetables) with GFP_KERNEL, yet we are likely to be under
> > >  		 * GFP_NOFS context here. Hence we need to tell memory reclaim
> > > -		 * that we are in such a context via PF_MEMALLOC_NOIO to prevent
> > > +		 * that we are in such a context via PF_MEMALLOC_NOFS to prevent
> > >  		 * memory reclaim re-entering the filesystem here and
> > >  		 * potentially deadlocking.
> > >  		 */
> > 
> > This comment feels out of date ... how about:
> 
> which part is out of date?
> 
> > 
> > 		/*
> > 		 * vm_map_ram will allocate auxiliary structures (eg page
> > 		 * tables) with GFP_KERNEL.  If that tries to reclaim memory
> > 		 * by calling back into this filesystem, we may deadlock.
> > 		 * Prevent that by setting the NOFS flag.
> > 		 */
> 
> dunno, the previous wording seems clear enough to me. Maybe little bit
> more chatty than yours but I am not sure this is worth changing.

I prefer to keep the "...yet we are likely to be under GFP_NOFS..."
wording of the old comment because it captures the uncertainty of
whether or not we actually are already under NOFS.  If someone actually
has audited this code well enough to know for sure then yes let's change
the comment, but I haven't gone that far.

The way the kmem_zalloc_large code is structured suggests to me that
callers don't have to be especially aware of the NOFS state -- they can
just call the function and it'll take care of making it work.

> > 
> > > -		noio_flag = memalloc_noio_save();
> > > +		nofs_flag = memalloc_nofs_save();
> > >  		do {
> > >  			bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
> > >  						-1, PAGE_KERNEL);
> > 
> > Also, I think it shows that this is the wrong place in XFS to be calling
> > memalloc_nofs_save().  I'm not arguing against including this patch;
> > it's a step towards where we want to be.  I also don't know XFS well
> > enough to know where to set that flag ;-)  Presumably when we start a
> > transaction ... ?

None of the current kmem_zalloc_large callers actually have a
transaction, at least not at that point.

> Yes that is what I would like to achieve longterm. And the reason why I
> didn't want to mimic this pattern in kvmalloc as some have suggested.
> It just takes much more time to get there from the past experience and
> we should really start somewhere.

--D

> -- 
> Michal Hocko
> SUSE Labs
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 4/6] xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*
@ 2017-02-06 18:32         ` Darrick J. Wong
  0 siblings, 0 replies; 74+ messages in thread
From: Darrick J. Wong @ 2017-02-06 18:32 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Matthew Wilcox, Andrew Morton, linux-fsdevel, linux-mm,
	Dave Chinner, djwong, Theodore Ts'o, Chris Mason,
	David Sterba, Jan Kara, ceph-devel, cluster-devel, linux-nfs,
	logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML

On Mon, Feb 06, 2017 at 06:44:15PM +0100, Michal Hocko wrote:
> On Mon 06-02-17 07:39:23, Matthew Wilcox wrote:
> > On Mon, Feb 06, 2017 at 03:07:16PM +0100, Michal Hocko wrote:
> > > +++ b/fs/xfs/xfs_buf.c
> > > @@ -442,17 +442,17 @@ _xfs_buf_map_pages(
> > >  		bp->b_addr = NULL;
> > >  	} else {
> > >  		int retried = 0;
> > > -		unsigned noio_flag;
> > > +		unsigned nofs_flag;
> > >  
> > >  		/*
> > >  		 * vm_map_ram() will allocate auxillary structures (e.g.
> > >  		 * pagetables) with GFP_KERNEL, yet we are likely to be under
> > >  		 * GFP_NOFS context here. Hence we need to tell memory reclaim
> > > -		 * that we are in such a context via PF_MEMALLOC_NOIO to prevent
> > > +		 * that we are in such a context via PF_MEMALLOC_NOFS to prevent
> > >  		 * memory reclaim re-entering the filesystem here and
> > >  		 * potentially deadlocking.
> > >  		 */
> > 
> > This comment feels out of date ... how about:
> 
> which part is out of date?
> 
> > 
> > 		/*
> > 		 * vm_map_ram will allocate auxiliary structures (eg page
> > 		 * tables) with GFP_KERNEL.  If that tries to reclaim memory
> > 		 * by calling back into this filesystem, we may deadlock.
> > 		 * Prevent that by setting the NOFS flag.
> > 		 */
> 
> dunno, the previous wording seems clear enough to me. Maybe little bit
> more chatty than yours but I am not sure this is worth changing.

I prefer to keep the "...yet we are likely to be under GFP_NOFS..."
wording of the old comment because it captures the uncertainty of
whether or not we actually are already under NOFS.  If someone actually
has audited this code well enough to know for sure then yes let's change
the comment, but I haven't gone that far.

The way the kmem_zalloc_large code is structured suggests to me that
callers don't have to be especially aware of the NOFS state -- they can
just call the function and it'll take care of making it work.

> > 
> > > -		noio_flag = memalloc_noio_save();
> > > +		nofs_flag = memalloc_nofs_save();
> > >  		do {
> > >  			bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
> > >  						-1, PAGE_KERNEL);
> > 
> > Also, I think it shows that this is the wrong place in XFS to be calling
> > memalloc_nofs_save().  I'm not arguing against including this patch;
> > it's a step towards where we want to be.  I also don't know XFS well
> > enough to know where to set that flag ;-)  Presumably when we start a
> > transaction ... ?

None of the current kmem_zalloc_large callers actually have a
transaction, at least not at that point.

> Yes that is what I would like to achieve longterm. And the reason why I
> didn't want to mimic this pattern in kvmalloc as some have suggested.
> It just takes much more time to get there from the past experience and
> we should really start somewhere.

--D

> -- 
> Michal Hocko
> SUSE Labs
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [Cluster-devel] [PATCH 4/6] xfs: use memalloc_nofs_{save, restore} instead of memalloc_noio*
@ 2017-02-06 18:32         ` Darrick J. Wong
  0 siblings, 0 replies; 74+ messages in thread
From: Darrick J. Wong @ 2017-02-06 18:32 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Mon, Feb 06, 2017 at 06:44:15PM +0100, Michal Hocko wrote:
> On Mon 06-02-17 07:39:23, Matthew Wilcox wrote:
> > On Mon, Feb 06, 2017 at 03:07:16PM +0100, Michal Hocko wrote:
> > > +++ b/fs/xfs/xfs_buf.c
> > > @@ -442,17 +442,17 @@ _xfs_buf_map_pages(
> > >  		bp->b_addr = NULL;
> > >  	} else {
> > >  		int retried = 0;
> > > -		unsigned noio_flag;
> > > +		unsigned nofs_flag;
> > >  
> > >  		/*
> > >  		 * vm_map_ram() will allocate auxillary structures (e.g.
> > >  		 * pagetables) with GFP_KERNEL, yet we are likely to be under
> > >  		 * GFP_NOFS context here. Hence we need to tell memory reclaim
> > > -		 * that we are in such a context via PF_MEMALLOC_NOIO to prevent
> > > +		 * that we are in such a context via PF_MEMALLOC_NOFS to prevent
> > >  		 * memory reclaim re-entering the filesystem here and
> > >  		 * potentially deadlocking.
> > >  		 */
> > 
> > This comment feels out of date ... how about:
> 
> which part is out of date?
> 
> > 
> > 		/*
> > 		 * vm_map_ram will allocate auxiliary structures (eg page
> > 		 * tables) with GFP_KERNEL.  If that tries to reclaim memory
> > 		 * by calling back into this filesystem, we may deadlock.
> > 		 * Prevent that by setting the NOFS flag.
> > 		 */
> 
> dunno, the previous wording seems clear enough to me. Maybe little bit
> more chatty than yours but I am not sure this is worth changing.

I prefer to keep the "...yet we are likely to be under GFP_NOFS..."
wording of the old comment because it captures the uncertainty of
whether or not we actually are already under NOFS.  If someone actually
has audited this code well enough to know for sure then yes let's change
the comment, but I haven't gone that far.

The way the kmem_zalloc_large code is structured suggests to me that
callers don't have to be especially aware of the NOFS state -- they can
just call the function and it'll take care of making it work.

> > 
> > > -		noio_flag = memalloc_noio_save();
> > > +		nofs_flag = memalloc_nofs_save();
> > >  		do {
> > >  			bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
> > >  						-1, PAGE_KERNEL);
> > 
> > Also, I think it shows that this is the wrong place in XFS to be calling
> > memalloc_nofs_save().  I'm not arguing against including this patch;
> > it's a step towards where we want to be.  I also don't know XFS well
> > enough to know where to set that flag ;-)  Presumably when we start a
> > transaction ... ?

None of the current kmem_zalloc_large callers actually have a
transaction, at least not at that point.

> Yes that is what I would like to achieve longterm. And the reason why I
> didn't want to mimic this pattern in kvmalloc as some have suggested.
> It just takes much more time to get there from the past experience and
> we should really start somewhere.

--D

> -- 
> Michal Hocko
> SUSE Labs
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 4/6] xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*
  2017-02-06 18:32         ` [PATCH 4/6] xfs: use memalloc_nofs_{save,restore} " Darrick J. Wong
  (?)
@ 2017-02-06 18:47           ` Michal Hocko
  -1 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-02-06 18:47 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Matthew Wilcox, Andrew Morton, linux-fsdevel, linux-mm,
	Dave Chinner, djwong, Theodore Ts'o, Chris Mason,
	David Sterba, Jan Kara, ceph-devel, cluster-devel, linux-nfs,
	logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML

On Mon 06-02-17 10:32:37, Darrick J. Wong wrote:
> On Mon, Feb 06, 2017 at 06:44:15PM +0100, Michal Hocko wrote:
> > On Mon 06-02-17 07:39:23, Matthew Wilcox wrote:
> > > On Mon, Feb 06, 2017 at 03:07:16PM +0100, Michal Hocko wrote:
> > > > +++ b/fs/xfs/xfs_buf.c
> > > > @@ -442,17 +442,17 @@ _xfs_buf_map_pages(
> > > >  		bp->b_addr = NULL;
> > > >  	} else {
> > > >  		int retried = 0;
> > > > -		unsigned noio_flag;
> > > > +		unsigned nofs_flag;
> > > >  
> > > >  		/*
> > > >  		 * vm_map_ram() will allocate auxillary structures (e.g.
> > > >  		 * pagetables) with GFP_KERNEL, yet we are likely to be under
> > > >  		 * GFP_NOFS context here. Hence we need to tell memory reclaim
> > > > -		 * that we are in such a context via PF_MEMALLOC_NOIO to prevent
> > > > +		 * that we are in such a context via PF_MEMALLOC_NOFS to prevent
> > > >  		 * memory reclaim re-entering the filesystem here and
> > > >  		 * potentially deadlocking.
> > > >  		 */
> > > 
> > > This comment feels out of date ... how about:
> > 
> > which part is out of date?
> > 
> > > 
> > > 		/*
> > > 		 * vm_map_ram will allocate auxiliary structures (eg page
> > > 		 * tables) with GFP_KERNEL.  If that tries to reclaim memory
> > > 		 * by calling back into this filesystem, we may deadlock.
> > > 		 * Prevent that by setting the NOFS flag.
> > > 		 */
> > 
> > dunno, the previous wording seems clear enough to me. Maybe little bit
> > more chatty than yours but I am not sure this is worth changing.
> 
> I prefer to keep the "...yet we are likely to be under GFP_NOFS..."
> wording of the old comment because it captures the uncertainty of
> whether or not we actually are already under NOFS.  If someone actually
> has audited this code well enough to know for sure then yes let's change
> the comment, but I haven't gone that far.

I believe we can drop the memalloc_nofs_save then as well because either
we are called from a potentially dangerous context and thus we are in
the nofs scope we we do not need the protection at all.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 4/6] xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*
@ 2017-02-06 18:47           ` Michal Hocko
  0 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-02-06 18:47 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Matthew Wilcox, Andrew Morton, linux-fsdevel, linux-mm,
	Dave Chinner, djwong, Theodore Ts'o, Chris Mason,
	David Sterba, Jan Kara, ceph-devel, cluster-devel, linux-nfs,
	logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML

On Mon 06-02-17 10:32:37, Darrick J. Wong wrote:
> On Mon, Feb 06, 2017 at 06:44:15PM +0100, Michal Hocko wrote:
> > On Mon 06-02-17 07:39:23, Matthew Wilcox wrote:
> > > On Mon, Feb 06, 2017 at 03:07:16PM +0100, Michal Hocko wrote:
> > > > +++ b/fs/xfs/xfs_buf.c
> > > > @@ -442,17 +442,17 @@ _xfs_buf_map_pages(
> > > >  		bp->b_addr = NULL;
> > > >  	} else {
> > > >  		int retried = 0;
> > > > -		unsigned noio_flag;
> > > > +		unsigned nofs_flag;
> > > >  
> > > >  		/*
> > > >  		 * vm_map_ram() will allocate auxillary structures (e.g.
> > > >  		 * pagetables) with GFP_KERNEL, yet we are likely to be under
> > > >  		 * GFP_NOFS context here. Hence we need to tell memory reclaim
> > > > -		 * that we are in such a context via PF_MEMALLOC_NOIO to prevent
> > > > +		 * that we are in such a context via PF_MEMALLOC_NOFS to prevent
> > > >  		 * memory reclaim re-entering the filesystem here and
> > > >  		 * potentially deadlocking.
> > > >  		 */
> > > 
> > > This comment feels out of date ... how about:
> > 
> > which part is out of date?
> > 
> > > 
> > > 		/*
> > > 		 * vm_map_ram will allocate auxiliary structures (eg page
> > > 		 * tables) with GFP_KERNEL.  If that tries to reclaim memory
> > > 		 * by calling back into this filesystem, we may deadlock.
> > > 		 * Prevent that by setting the NOFS flag.
> > > 		 */
> > 
> > dunno, the previous wording seems clear enough to me. Maybe little bit
> > more chatty than yours but I am not sure this is worth changing.
> 
> I prefer to keep the "...yet we are likely to be under GFP_NOFS..."
> wording of the old comment because it captures the uncertainty of
> whether or not we actually are already under NOFS.  If someone actually
> has audited this code well enough to know for sure then yes let's change
> the comment, but I haven't gone that far.

I believe we can drop the memalloc_nofs_save then as well because either
we are called from a potentially dangerous context and thus we are in
the nofs scope we we do not need the protection at all.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [Cluster-devel] [PATCH 4/6] xfs: use memalloc_nofs_{save, restore} instead of memalloc_noio*
@ 2017-02-06 18:47           ` Michal Hocko
  0 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-02-06 18:47 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Mon 06-02-17 10:32:37, Darrick J. Wong wrote:
> On Mon, Feb 06, 2017 at 06:44:15PM +0100, Michal Hocko wrote:
> > On Mon 06-02-17 07:39:23, Matthew Wilcox wrote:
> > > On Mon, Feb 06, 2017 at 03:07:16PM +0100, Michal Hocko wrote:
> > > > +++ b/fs/xfs/xfs_buf.c
> > > > @@ -442,17 +442,17 @@ _xfs_buf_map_pages(
> > > >  		bp->b_addr = NULL;
> > > >  	} else {
> > > >  		int retried = 0;
> > > > -		unsigned noio_flag;
> > > > +		unsigned nofs_flag;
> > > >  
> > > >  		/*
> > > >  		 * vm_map_ram() will allocate auxillary structures (e.g.
> > > >  		 * pagetables) with GFP_KERNEL, yet we are likely to be under
> > > >  		 * GFP_NOFS context here. Hence we need to tell memory reclaim
> > > > -		 * that we are in such a context via PF_MEMALLOC_NOIO to prevent
> > > > +		 * that we are in such a context via PF_MEMALLOC_NOFS to prevent
> > > >  		 * memory reclaim re-entering the filesystem here and
> > > >  		 * potentially deadlocking.
> > > >  		 */
> > > 
> > > This comment feels out of date ... how about:
> > 
> > which part is out of date?
> > 
> > > 
> > > 		/*
> > > 		 * vm_map_ram will allocate auxiliary structures (eg page
> > > 		 * tables) with GFP_KERNEL.  If that tries to reclaim memory
> > > 		 * by calling back into this filesystem, we may deadlock.
> > > 		 * Prevent that by setting the NOFS flag.
> > > 		 */
> > 
> > dunno, the previous wording seems clear enough to me. Maybe little bit
> > more chatty than yours but I am not sure this is worth changing.
> 
> I prefer to keep the "...yet we are likely to be under GFP_NOFS..."
> wording of the old comment because it captures the uncertainty of
> whether or not we actually are already under NOFS.  If someone actually
> has audited this code well enough to know for sure then yes let's change
> the comment, but I haven't gone that far.

I believe we can drop the memalloc_nofs_save then as well because either
we are called from a potentially dangerous context and thus we are in
the nofs scope we we do not need the protection at all.
-- 
Michal Hocko
SUSE Labs



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 4/6] xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*
  2017-02-06 18:47           ` [PATCH 4/6] xfs: use memalloc_nofs_{save,restore} " Michal Hocko
  (?)
@ 2017-02-06 19:51             ` Darrick J. Wong
  -1 siblings, 0 replies; 74+ messages in thread
From: Darrick J. Wong @ 2017-02-06 19:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Matthew Wilcox, Andrew Morton, linux-fsdevel, linux-mm,
	Dave Chinner, djwong, Theodore Ts'o, Chris Mason,
	David Sterba, Jan Kara, ceph-devel, cluster-devel, linux-nfs,
	logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML

On Mon, Feb 06, 2017 at 07:47:43PM +0100, Michal Hocko wrote:
> On Mon 06-02-17 10:32:37, Darrick J. Wong wrote:
> > On Mon, Feb 06, 2017 at 06:44:15PM +0100, Michal Hocko wrote:
> > > On Mon 06-02-17 07:39:23, Matthew Wilcox wrote:
> > > > On Mon, Feb 06, 2017 at 03:07:16PM +0100, Michal Hocko wrote:
> > > > > +++ b/fs/xfs/xfs_buf.c
> > > > > @@ -442,17 +442,17 @@ _xfs_buf_map_pages(
> > > > >  		bp->b_addr = NULL;
> > > > >  	} else {
> > > > >  		int retried = 0;
> > > > > -		unsigned noio_flag;
> > > > > +		unsigned nofs_flag;
> > > > >  
> > > > >  		/*
> > > > >  		 * vm_map_ram() will allocate auxillary structures (e.g.
> > > > >  		 * pagetables) with GFP_KERNEL, yet we are likely to be under
> > > > >  		 * GFP_NOFS context here. Hence we need to tell memory reclaim
> > > > > -		 * that we are in such a context via PF_MEMALLOC_NOIO to prevent
> > > > > +		 * that we are in such a context via PF_MEMALLOC_NOFS to prevent
> > > > >  		 * memory reclaim re-entering the filesystem here and
> > > > >  		 * potentially deadlocking.
> > > > >  		 */
> > > > 
> > > > This comment feels out of date ... how about:
> > > 
> > > which part is out of date?
> > > 
> > > > 
> > > > 		/*
> > > > 		 * vm_map_ram will allocate auxiliary structures (eg page
> > > > 		 * tables) with GFP_KERNEL.  If that tries to reclaim memory
> > > > 		 * by calling back into this filesystem, we may deadlock.
> > > > 		 * Prevent that by setting the NOFS flag.
> > > > 		 */
> > > 
> > > dunno, the previous wording seems clear enough to me. Maybe little bit
> > > more chatty than yours but I am not sure this is worth changing.
> > 
> > I prefer to keep the "...yet we are likely to be under GFP_NOFS..."
> > wording of the old comment because it captures the uncertainty of
> > whether or not we actually are already under NOFS.  If someone actually
> > has audited this code well enough to know for sure then yes let's change
> > the comment, but I haven't gone that far.

Ugh, /me hands himself another cup of coffee...

Somehow I mixed up _xfs_buf_map_pages and kmem_zalloc_large in this
discussion.  Probably because they have similar code snippets with very
similar comments to two totally different parts of xfs.

The _xfs_buf_map_pages can be called inside or outside of
transaction context, so I think we still have to memalloc_nofs_save for
that to avoid the lockdep complaints and deadlocks referenced in the
commit that added all that (to _xfs_buf_map_pages) in the first place.
ae687e58b3 ("xfs: use NOIO contexts for vm_map_ram")

My comments about kmem_zalloc_large still stand, even though the part
of the patch you two were discussing was the _xfs_buf_map_pages.  I
probably should have clarified that I think both functions actually
/are/ doing the right thing wrt calling (or not calling)
memalloc_nofs_save().

> I believe we can drop the memalloc_nofs_save then as well because either
> we are called from a potentially dangerous context and thus we are in
> the nofs scope we we do not need the protection at all.

Uh, now that I've muddied up the waters, which part are you referring to?

--D

> -- 
> Michal Hocko
> SUSE Labs
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 4/6] xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*
@ 2017-02-06 19:51             ` Darrick J. Wong
  0 siblings, 0 replies; 74+ messages in thread
From: Darrick J. Wong @ 2017-02-06 19:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Matthew Wilcox, Andrew Morton, linux-fsdevel, linux-mm,
	Dave Chinner, djwong, Theodore Ts'o, Chris Mason,
	David Sterba, Jan Kara, ceph-devel, cluster-devel, linux-nfs,
	logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML

On Mon, Feb 06, 2017 at 07:47:43PM +0100, Michal Hocko wrote:
> On Mon 06-02-17 10:32:37, Darrick J. Wong wrote:
> > On Mon, Feb 06, 2017 at 06:44:15PM +0100, Michal Hocko wrote:
> > > On Mon 06-02-17 07:39:23, Matthew Wilcox wrote:
> > > > On Mon, Feb 06, 2017 at 03:07:16PM +0100, Michal Hocko wrote:
> > > > > +++ b/fs/xfs/xfs_buf.c
> > > > > @@ -442,17 +442,17 @@ _xfs_buf_map_pages(
> > > > >  		bp->b_addr = NULL;
> > > > >  	} else {
> > > > >  		int retried = 0;
> > > > > -		unsigned noio_flag;
> > > > > +		unsigned nofs_flag;
> > > > >  
> > > > >  		/*
> > > > >  		 * vm_map_ram() will allocate auxillary structures (e.g.
> > > > >  		 * pagetables) with GFP_KERNEL, yet we are likely to be under
> > > > >  		 * GFP_NOFS context here. Hence we need to tell memory reclaim
> > > > > -		 * that we are in such a context via PF_MEMALLOC_NOIO to prevent
> > > > > +		 * that we are in such a context via PF_MEMALLOC_NOFS to prevent
> > > > >  		 * memory reclaim re-entering the filesystem here and
> > > > >  		 * potentially deadlocking.
> > > > >  		 */
> > > > 
> > > > This comment feels out of date ... how about:
> > > 
> > > which part is out of date?
> > > 
> > > > 
> > > > 		/*
> > > > 		 * vm_map_ram will allocate auxiliary structures (eg page
> > > > 		 * tables) with GFP_KERNEL.  If that tries to reclaim memory
> > > > 		 * by calling back into this filesystem, we may deadlock.
> > > > 		 * Prevent that by setting the NOFS flag.
> > > > 		 */
> > > 
> > > dunno, the previous wording seems clear enough to me. Maybe little bit
> > > more chatty than yours but I am not sure this is worth changing.
> > 
> > I prefer to keep the "...yet we are likely to be under GFP_NOFS..."
> > wording of the old comment because it captures the uncertainty of
> > whether or not we actually are already under NOFS.  If someone actually
> > has audited this code well enough to know for sure then yes let's change
> > the comment, but I haven't gone that far.

Ugh, /me hands himself another cup of coffee...

Somehow I mixed up _xfs_buf_map_pages and kmem_zalloc_large in this
discussion.  Probably because they have similar code snippets with very
similar comments to two totally different parts of xfs.

The _xfs_buf_map_pages can be called inside or outside of
transaction context, so I think we still have to memalloc_nofs_save for
that to avoid the lockdep complaints and deadlocks referenced in the
commit that added all that (to _xfs_buf_map_pages) in the first place.
ae687e58b3 ("xfs: use NOIO contexts for vm_map_ram")

My comments about kmem_zalloc_large still stand, even though the part
of the patch you two were discussing was the _xfs_buf_map_pages.  I
probably should have clarified that I think both functions actually
/are/ doing the right thing wrt calling (or not calling)
memalloc_nofs_save().

> I believe we can drop the memalloc_nofs_save then as well because either
> we are called from a potentially dangerous context and thus we are in
> the nofs scope we we do not need the protection at all.

Uh, now that I've muddied up the waters, which part are you referring to?

--D

> -- 
> Michal Hocko
> SUSE Labs
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [Cluster-devel] [PATCH 4/6] xfs: use memalloc_nofs_{save, restore} instead of memalloc_noio*
@ 2017-02-06 19:51             ` Darrick J. Wong
  0 siblings, 0 replies; 74+ messages in thread
From: Darrick J. Wong @ 2017-02-06 19:51 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Mon, Feb 06, 2017 at 07:47:43PM +0100, Michal Hocko wrote:
> On Mon 06-02-17 10:32:37, Darrick J. Wong wrote:
> > On Mon, Feb 06, 2017 at 06:44:15PM +0100, Michal Hocko wrote:
> > > On Mon 06-02-17 07:39:23, Matthew Wilcox wrote:
> > > > On Mon, Feb 06, 2017 at 03:07:16PM +0100, Michal Hocko wrote:
> > > > > +++ b/fs/xfs/xfs_buf.c
> > > > > @@ -442,17 +442,17 @@ _xfs_buf_map_pages(
> > > > >  		bp->b_addr = NULL;
> > > > >  	} else {
> > > > >  		int retried = 0;
> > > > > -		unsigned noio_flag;
> > > > > +		unsigned nofs_flag;
> > > > >  
> > > > >  		/*
> > > > >  		 * vm_map_ram() will allocate auxillary structures (e.g.
> > > > >  		 * pagetables) with GFP_KERNEL, yet we are likely to be under
> > > > >  		 * GFP_NOFS context here. Hence we need to tell memory reclaim
> > > > > -		 * that we are in such a context via PF_MEMALLOC_NOIO to prevent
> > > > > +		 * that we are in such a context via PF_MEMALLOC_NOFS to prevent
> > > > >  		 * memory reclaim re-entering the filesystem here and
> > > > >  		 * potentially deadlocking.
> > > > >  		 */
> > > > 
> > > > This comment feels out of date ... how about:
> > > 
> > > which part is out of date?
> > > 
> > > > 
> > > > 		/*
> > > > 		 * vm_map_ram will allocate auxiliary structures (eg page
> > > > 		 * tables) with GFP_KERNEL.  If that tries to reclaim memory
> > > > 		 * by calling back into this filesystem, we may deadlock.
> > > > 		 * Prevent that by setting the NOFS flag.
> > > > 		 */
> > > 
> > > dunno, the previous wording seems clear enough to me. Maybe little bit
> > > more chatty than yours but I am not sure this is worth changing.
> > 
> > I prefer to keep the "...yet we are likely to be under GFP_NOFS..."
> > wording of the old comment because it captures the uncertainty of
> > whether or not we actually are already under NOFS.  If someone actually
> > has audited this code well enough to know for sure then yes let's change
> > the comment, but I haven't gone that far.

Ugh, /me hands himself another cup of coffee...

Somehow I mixed up _xfs_buf_map_pages and kmem_zalloc_large in this
discussion.  Probably because they have similar code snippets with very
similar comments to two totally different parts of xfs.

The _xfs_buf_map_pages can be called inside or outside of
transaction context, so I think we still have to memalloc_nofs_save for
that to avoid the lockdep complaints and deadlocks referenced in the
commit that added all that (to _xfs_buf_map_pages) in the first place.
ae687e58b3 ("xfs: use NOIO contexts for vm_map_ram")

My comments about kmem_zalloc_large still stand, even though the part
of the patch you two were discussing was the _xfs_buf_map_pages.  I
probably should have clarified that I think both functions actually
/are/ doing the right thing wrt calling (or not calling)
memalloc_nofs_save().

> I believe we can drop the memalloc_nofs_save then as well because either
> we are called from a potentially dangerous context and thus we are in
> the nofs scope we we do not need the protection at all.

Uh, now that I've muddied up the waters, which part are you referring to?

--D

> -- 
> Michal Hocko
> SUSE Labs
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 4/6] xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*
  2017-02-06 19:51             ` [PATCH 4/6] xfs: use memalloc_nofs_{save,restore} " Darrick J. Wong
  (?)
@ 2017-02-06 21:18               ` Michal Hocko
  -1 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-02-06 21:18 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Matthew Wilcox, Andrew Morton, linux-fsdevel, linux-mm,
	Dave Chinner, djwong, Theodore Ts'o, Chris Mason,
	David Sterba, Jan Kara, ceph-devel, cluster-devel, linux-nfs,
	logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML

On Mon 06-02-17 11:51:11, Darrick J. Wong wrote:
> On Mon, Feb 06, 2017 at 07:47:43PM +0100, Michal Hocko wrote:
> > On Mon 06-02-17 10:32:37, Darrick J. Wong wrote:
> > > On Mon, Feb 06, 2017 at 06:44:15PM +0100, Michal Hocko wrote:
> > > > On Mon 06-02-17 07:39:23, Matthew Wilcox wrote:
> > > > > On Mon, Feb 06, 2017 at 03:07:16PM +0100, Michal Hocko wrote:
> > > > > > +++ b/fs/xfs/xfs_buf.c
> > > > > > @@ -442,17 +442,17 @@ _xfs_buf_map_pages(
> > > > > >  		bp->b_addr = NULL;
> > > > > >  	} else {
> > > > > >  		int retried = 0;
> > > > > > -		unsigned noio_flag;
> > > > > > +		unsigned nofs_flag;
> > > > > >  
> > > > > >  		/*
> > > > > >  		 * vm_map_ram() will allocate auxillary structures (e.g.
> > > > > >  		 * pagetables) with GFP_KERNEL, yet we are likely to be under
> > > > > >  		 * GFP_NOFS context here. Hence we need to tell memory reclaim
> > > > > > -		 * that we are in such a context via PF_MEMALLOC_NOIO to prevent
> > > > > > +		 * that we are in such a context via PF_MEMALLOC_NOFS to prevent
> > > > > >  		 * memory reclaim re-entering the filesystem here and
> > > > > >  		 * potentially deadlocking.
> > > > > >  		 */
> > > > > 
> > > > > This comment feels out of date ... how about:
> > > > 
> > > > which part is out of date?
> > > > 
> > > > > 
> > > > > 		/*
> > > > > 		 * vm_map_ram will allocate auxiliary structures (eg page
> > > > > 		 * tables) with GFP_KERNEL.  If that tries to reclaim memory
> > > > > 		 * by calling back into this filesystem, we may deadlock.
> > > > > 		 * Prevent that by setting the NOFS flag.
> > > > > 		 */
> > > > 
> > > > dunno, the previous wording seems clear enough to me. Maybe little bit
> > > > more chatty than yours but I am not sure this is worth changing.
> > > 
> > > I prefer to keep the "...yet we are likely to be under GFP_NOFS..."
> > > wording of the old comment because it captures the uncertainty of
> > > whether or not we actually are already under NOFS.  If someone actually
> > > has audited this code well enough to know for sure then yes let's change
> > > the comment, but I haven't gone that far.
> 
> Ugh, /me hands himself another cup of coffee...
> 
> Somehow I mixed up _xfs_buf_map_pages and kmem_zalloc_large in this
> discussion.  Probably because they have similar code snippets with very
> similar comments to two totally different parts of xfs.
> 
> The _xfs_buf_map_pages can be called inside or outside of
> transaction context, so I think we still have to memalloc_nofs_save for
> that to avoid the lockdep complaints and deadlocks referenced in the
> commit that added all that (to _xfs_buf_map_pages) in the first place.
> ae687e58b3 ("xfs: use NOIO contexts for vm_map_ram")

Yes, and that memalloc_nofs_save would start with the transaction
context so this (_xfs_buf_map_pages) call would be already covered so
additional memalloc_nofs_save would be unnecessary. Right now I am not
sure whether this is always the case so I have kept this "just to be
sure" measure. Checking that would be in the next step when I would like
to remove other KM_NOFS usage so that we would always rely on the scope
inside the transaction or other potentially dangerous (e.g. from the
stack usage POV and who knows what else) contexts.

Does that make more sense now?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 4/6] xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*
@ 2017-02-06 21:18               ` Michal Hocko
  0 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-02-06 21:18 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Matthew Wilcox, Andrew Morton, linux-fsdevel, linux-mm,
	Dave Chinner, djwong, Theodore Ts'o, Chris Mason,
	David Sterba, Jan Kara, ceph-devel, cluster-devel, linux-nfs,
	logfs, linux-xfs, linux-ext4, linux-btrfs, linux-mtd,
	reiserfs-devel, linux-ntfs-dev, linux-f2fs-devel, linux-afs,
	LKML

On Mon 06-02-17 11:51:11, Darrick J. Wong wrote:
> On Mon, Feb 06, 2017 at 07:47:43PM +0100, Michal Hocko wrote:
> > On Mon 06-02-17 10:32:37, Darrick J. Wong wrote:
> > > On Mon, Feb 06, 2017 at 06:44:15PM +0100, Michal Hocko wrote:
> > > > On Mon 06-02-17 07:39:23, Matthew Wilcox wrote:
> > > > > On Mon, Feb 06, 2017 at 03:07:16PM +0100, Michal Hocko wrote:
> > > > > > +++ b/fs/xfs/xfs_buf.c
> > > > > > @@ -442,17 +442,17 @@ _xfs_buf_map_pages(
> > > > > >  		bp->b_addr = NULL;
> > > > > >  	} else {
> > > > > >  		int retried = 0;
> > > > > > -		unsigned noio_flag;
> > > > > > +		unsigned nofs_flag;
> > > > > >  
> > > > > >  		/*
> > > > > >  		 * vm_map_ram() will allocate auxillary structures (e.g.
> > > > > >  		 * pagetables) with GFP_KERNEL, yet we are likely to be under
> > > > > >  		 * GFP_NOFS context here. Hence we need to tell memory reclaim
> > > > > > -		 * that we are in such a context via PF_MEMALLOC_NOIO to prevent
> > > > > > +		 * that we are in such a context via PF_MEMALLOC_NOFS to prevent
> > > > > >  		 * memory reclaim re-entering the filesystem here and
> > > > > >  		 * potentially deadlocking.
> > > > > >  		 */
> > > > > 
> > > > > This comment feels out of date ... how about:
> > > > 
> > > > which part is out of date?
> > > > 
> > > > > 
> > > > > 		/*
> > > > > 		 * vm_map_ram will allocate auxiliary structures (eg page
> > > > > 		 * tables) with GFP_KERNEL.  If that tries to reclaim memory
> > > > > 		 * by calling back into this filesystem, we may deadlock.
> > > > > 		 * Prevent that by setting the NOFS flag.
> > > > > 		 */
> > > > 
> > > > dunno, the previous wording seems clear enough to me. Maybe little bit
> > > > more chatty than yours but I am not sure this is worth changing.
> > > 
> > > I prefer to keep the "...yet we are likely to be under GFP_NOFS..."
> > > wording of the old comment because it captures the uncertainty of
> > > whether or not we actually are already under NOFS.  If someone actually
> > > has audited this code well enough to know for sure then yes let's change
> > > the comment, but I haven't gone that far.
> 
> Ugh, /me hands himself another cup of coffee...
> 
> Somehow I mixed up _xfs_buf_map_pages and kmem_zalloc_large in this
> discussion.  Probably because they have similar code snippets with very
> similar comments to two totally different parts of xfs.
> 
> The _xfs_buf_map_pages can be called inside or outside of
> transaction context, so I think we still have to memalloc_nofs_save for
> that to avoid the lockdep complaints and deadlocks referenced in the
> commit that added all that (to _xfs_buf_map_pages) in the first place.
> ae687e58b3 ("xfs: use NOIO contexts for vm_map_ram")

Yes, and that memalloc_nofs_save would start with the transaction
context so this (_xfs_buf_map_pages) call would be already covered so
additional memalloc_nofs_save would be unnecessary. Right now I am not
sure whether this is always the case so I have kept this "just to be
sure" measure. Checking that would be in the next step when I would like
to remove other KM_NOFS usage so that we would always rely on the scope
inside the transaction or other potentially dangerous (e.g. from the
stack usage POV and who knows what else) contexts.

Does that make more sense now?
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [Cluster-devel] [PATCH 4/6] xfs: use memalloc_nofs_{save, restore} instead of memalloc_noio*
@ 2017-02-06 21:18               ` Michal Hocko
  0 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-02-06 21:18 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Mon 06-02-17 11:51:11, Darrick J. Wong wrote:
> On Mon, Feb 06, 2017 at 07:47:43PM +0100, Michal Hocko wrote:
> > On Mon 06-02-17 10:32:37, Darrick J. Wong wrote:
> > > On Mon, Feb 06, 2017 at 06:44:15PM +0100, Michal Hocko wrote:
> > > > On Mon 06-02-17 07:39:23, Matthew Wilcox wrote:
> > > > > On Mon, Feb 06, 2017 at 03:07:16PM +0100, Michal Hocko wrote:
> > > > > > +++ b/fs/xfs/xfs_buf.c
> > > > > > @@ -442,17 +442,17 @@ _xfs_buf_map_pages(
> > > > > >  		bp->b_addr = NULL;
> > > > > >  	} else {
> > > > > >  		int retried = 0;
> > > > > > -		unsigned noio_flag;
> > > > > > +		unsigned nofs_flag;
> > > > > >  
> > > > > >  		/*
> > > > > >  		 * vm_map_ram() will allocate auxillary structures (e.g.
> > > > > >  		 * pagetables) with GFP_KERNEL, yet we are likely to be under
> > > > > >  		 * GFP_NOFS context here. Hence we need to tell memory reclaim
> > > > > > -		 * that we are in such a context via PF_MEMALLOC_NOIO to prevent
> > > > > > +		 * that we are in such a context via PF_MEMALLOC_NOFS to prevent
> > > > > >  		 * memory reclaim re-entering the filesystem here and
> > > > > >  		 * potentially deadlocking.
> > > > > >  		 */
> > > > > 
> > > > > This comment feels out of date ... how about:
> > > > 
> > > > which part is out of date?
> > > > 
> > > > > 
> > > > > 		/*
> > > > > 		 * vm_map_ram will allocate auxiliary structures (eg page
> > > > > 		 * tables) with GFP_KERNEL.  If that tries to reclaim memory
> > > > > 		 * by calling back into this filesystem, we may deadlock.
> > > > > 		 * Prevent that by setting the NOFS flag.
> > > > > 		 */
> > > > 
> > > > dunno, the previous wording seems clear enough to me. Maybe little bit
> > > > more chatty than yours but I am not sure this is worth changing.
> > > 
> > > I prefer to keep the "...yet we are likely to be under GFP_NOFS..."
> > > wording of the old comment because it captures the uncertainty of
> > > whether or not we actually are already under NOFS.  If someone actually
> > > has audited this code well enough to know for sure then yes let's change
> > > the comment, but I haven't gone that far.
> 
> Ugh, /me hands himself another cup of coffee...
> 
> Somehow I mixed up _xfs_buf_map_pages and kmem_zalloc_large in this
> discussion.  Probably because they have similar code snippets with very
> similar comments to two totally different parts of xfs.
> 
> The _xfs_buf_map_pages can be called inside or outside of
> transaction context, so I think we still have to memalloc_nofs_save for
> that to avoid the lockdep complaints and deadlocks referenced in the
> commit that added all that (to _xfs_buf_map_pages) in the first place.
> ae687e58b3 ("xfs: use NOIO contexts for vm_map_ram")

Yes, and that memalloc_nofs_save would start with the transaction
context so this (_xfs_buf_map_pages) call would be already covered so
additional memalloc_nofs_save would be unnecessary. Right now I am not
sure whether this is always the case so I have kept this "just to be
sure" measure. Checking that would be in the next step when I would like
to remove other KM_NOFS usage so that we would always rely on the scope
inside the transaction or other potentially dangerous (e.g. from the
stack usage POV and who knows what else) contexts.

Does that make more sense now?
-- 
Michal Hocko
SUSE Labs



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 4/6] xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*
  2017-02-06 18:47           ` [PATCH 4/6] xfs: use memalloc_nofs_{save,restore} " Michal Hocko
  (?)
@ 2017-02-06 22:51             ` Dave Chinner
  -1 siblings, 0 replies; 74+ messages in thread
From: Dave Chinner @ 2017-02-06 22:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Darrick J. Wong, Matthew Wilcox, Andrew Morton, linux-fsdevel,
	linux-mm, djwong, Theodore Ts'o, Chris Mason, David Sterba,
	Jan Kara, ceph-devel, cluster-devel, linux-nfs, logfs, linux-xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML

On Mon, Feb 06, 2017 at 07:47:43PM +0100, Michal Hocko wrote:
> On Mon 06-02-17 10:32:37, Darrick J. Wong wrote:
> > On Mon, Feb 06, 2017 at 06:44:15PM +0100, Michal Hocko wrote:
> > > On Mon 06-02-17 07:39:23, Matthew Wilcox wrote:
> > > > On Mon, Feb 06, 2017 at 03:07:16PM +0100, Michal Hocko wrote:
> > > > > +++ b/fs/xfs/xfs_buf.c
> > > > > @@ -442,17 +442,17 @@ _xfs_buf_map_pages(
> > > > >  		bp->b_addr = NULL;
> > > > >  	} else {
> > > > >  		int retried = 0;
> > > > > -		unsigned noio_flag;
> > > > > +		unsigned nofs_flag;
> > > > >  
> > > > >  		/*
> > > > >  		 * vm_map_ram() will allocate auxillary structures (e.g.
> > > > >  		 * pagetables) with GFP_KERNEL, yet we are likely to be under
> > > > >  		 * GFP_NOFS context here. Hence we need to tell memory reclaim
> > > > > -		 * that we are in such a context via PF_MEMALLOC_NOIO to prevent
> > > > > +		 * that we are in such a context via PF_MEMALLOC_NOFS to prevent
> > > > >  		 * memory reclaim re-entering the filesystem here and
> > > > >  		 * potentially deadlocking.
> > > > >  		 */
> > > > 
> > > > This comment feels out of date ... how about:
> > > 
> > > which part is out of date?
> > > 
> > > > 
> > > > 		/*
> > > > 		 * vm_map_ram will allocate auxiliary structures (eg page
> > > > 		 * tables) with GFP_KERNEL.  If that tries to reclaim memory
> > > > 		 * by calling back into this filesystem, we may deadlock.
> > > > 		 * Prevent that by setting the NOFS flag.
> > > > 		 */
> > > 
> > > dunno, the previous wording seems clear enough to me. Maybe little bit
> > > more chatty than yours but I am not sure this is worth changing.
> > 
> > I prefer to keep the "...yet we are likely to be under GFP_NOFS..."
> > wording of the old comment because it captures the uncertainty of
> > whether or not we actually are already under NOFS.  If someone actually
> > has audited this code well enough to know for sure then yes let's change
> > the comment, but I haven't gone that far.
> 
> I believe we can drop the memalloc_nofs_save then as well because either
> we are called from a potentially dangerous context and thus we are in
> the nofs scope we we do not need the protection at all.

No, absolutely not. "Belief" is not a sufficient justification for
removing low level deadlock avoidance infrastructure. This code
needs to remain in _xfs_buf_map_pages() until a full audit of the
caller paths is done and we're 100% certain that there are no
lurking deadlocks.

For example, I'm pretty sure we can call into _xfs_buf_map_pages()
outside of a transaction context but with an inode ILOCK held
exclusively. If we then recurse into memory reclaim and try to run a
transaction during reclaim, we have an inverted ILOCK vs transaction
locking order. i.e. we are not allowed to call xfs_trans_reserve()
with an ILOCK held as that can deadlock the log:  log full, locked
inode pins tail of log, inode cannot be flushed because ILOCK is
held by caller waiting for log space to become available....

i.e. there are certain situations where holding a ILOCK is a
deadlock vector. See xfs_lock_inodes() for an example of the lengths
we go to avoid ILOCK based log deadlocks like this...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 4/6] xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*
@ 2017-02-06 22:51             ` Dave Chinner
  0 siblings, 0 replies; 74+ messages in thread
From: Dave Chinner @ 2017-02-06 22:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Darrick J. Wong, Matthew Wilcox, Andrew Morton, linux-fsdevel,
	linux-mm, djwong, Theodore Ts'o, Chris Mason, David Sterba,
	Jan Kara, ceph-devel, cluster-devel, linux-nfs, logfs, linux-xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML

On Mon, Feb 06, 2017 at 07:47:43PM +0100, Michal Hocko wrote:
> On Mon 06-02-17 10:32:37, Darrick J. Wong wrote:
> > On Mon, Feb 06, 2017 at 06:44:15PM +0100, Michal Hocko wrote:
> > > On Mon 06-02-17 07:39:23, Matthew Wilcox wrote:
> > > > On Mon, Feb 06, 2017 at 03:07:16PM +0100, Michal Hocko wrote:
> > > > > +++ b/fs/xfs/xfs_buf.c
> > > > > @@ -442,17 +442,17 @@ _xfs_buf_map_pages(
> > > > >  		bp->b_addr = NULL;
> > > > >  	} else {
> > > > >  		int retried = 0;
> > > > > -		unsigned noio_flag;
> > > > > +		unsigned nofs_flag;
> > > > >  
> > > > >  		/*
> > > > >  		 * vm_map_ram() will allocate auxillary structures (e.g.
> > > > >  		 * pagetables) with GFP_KERNEL, yet we are likely to be under
> > > > >  		 * GFP_NOFS context here. Hence we need to tell memory reclaim
> > > > > -		 * that we are in such a context via PF_MEMALLOC_NOIO to prevent
> > > > > +		 * that we are in such a context via PF_MEMALLOC_NOFS to prevent
> > > > >  		 * memory reclaim re-entering the filesystem here and
> > > > >  		 * potentially deadlocking.
> > > > >  		 */
> > > > 
> > > > This comment feels out of date ... how about:
> > > 
> > > which part is out of date?
> > > 
> > > > 
> > > > 		/*
> > > > 		 * vm_map_ram will allocate auxiliary structures (eg page
> > > > 		 * tables) with GFP_KERNEL.  If that tries to reclaim memory
> > > > 		 * by calling back into this filesystem, we may deadlock.
> > > > 		 * Prevent that by setting the NOFS flag.
> > > > 		 */
> > > 
> > > dunno, the previous wording seems clear enough to me. Maybe little bit
> > > more chatty than yours but I am not sure this is worth changing.
> > 
> > I prefer to keep the "...yet we are likely to be under GFP_NOFS..."
> > wording of the old comment because it captures the uncertainty of
> > whether or not we actually are already under NOFS.  If someone actually
> > has audited this code well enough to know for sure then yes let's change
> > the comment, but I haven't gone that far.
> 
> I believe we can drop the memalloc_nofs_save then as well because either
> we are called from a potentially dangerous context and thus we are in
> the nofs scope we we do not need the protection at all.

No, absolutely not. "Belief" is not a sufficient justification for
removing low level deadlock avoidance infrastructure. This code
needs to remain in _xfs_buf_map_pages() until a full audit of the
caller paths is done and we're 100% certain that there are no
lurking deadlocks.

For example, I'm pretty sure we can call into _xfs_buf_map_pages()
outside of a transaction context but with an inode ILOCK held
exclusively. If we then recurse into memory reclaim and try to run a
transaction during reclaim, we have an inverted ILOCK vs transaction
locking order. i.e. we are not allowed to call xfs_trans_reserve()
with an ILOCK held as that can deadlock the log:  log full, locked
inode pins tail of log, inode cannot be flushed because ILOCK is
held by caller waiting for log space to become available....

i.e. there are certain situations where holding a ILOCK is a
deadlock vector. See xfs_lock_inodes() for an example of the lengths
we go to avoid ILOCK based log deadlocks like this...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [Cluster-devel] [PATCH 4/6] xfs: use memalloc_nofs_{save, restore} instead of memalloc_noio*
@ 2017-02-06 22:51             ` Dave Chinner
  0 siblings, 0 replies; 74+ messages in thread
From: Dave Chinner @ 2017-02-06 22:51 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Mon, Feb 06, 2017 at 07:47:43PM +0100, Michal Hocko wrote:
> On Mon 06-02-17 10:32:37, Darrick J. Wong wrote:
> > On Mon, Feb 06, 2017 at 06:44:15PM +0100, Michal Hocko wrote:
> > > On Mon 06-02-17 07:39:23, Matthew Wilcox wrote:
> > > > On Mon, Feb 06, 2017 at 03:07:16PM +0100, Michal Hocko wrote:
> > > > > +++ b/fs/xfs/xfs_buf.c
> > > > > @@ -442,17 +442,17 @@ _xfs_buf_map_pages(
> > > > >  		bp->b_addr = NULL;
> > > > >  	} else {
> > > > >  		int retried = 0;
> > > > > -		unsigned noio_flag;
> > > > > +		unsigned nofs_flag;
> > > > >  
> > > > >  		/*
> > > > >  		 * vm_map_ram() will allocate auxillary structures (e.g.
> > > > >  		 * pagetables) with GFP_KERNEL, yet we are likely to be under
> > > > >  		 * GFP_NOFS context here. Hence we need to tell memory reclaim
> > > > > -		 * that we are in such a context via PF_MEMALLOC_NOIO to prevent
> > > > > +		 * that we are in such a context via PF_MEMALLOC_NOFS to prevent
> > > > >  		 * memory reclaim re-entering the filesystem here and
> > > > >  		 * potentially deadlocking.
> > > > >  		 */
> > > > 
> > > > This comment feels out of date ... how about:
> > > 
> > > which part is out of date?
> > > 
> > > > 
> > > > 		/*
> > > > 		 * vm_map_ram will allocate auxiliary structures (eg page
> > > > 		 * tables) with GFP_KERNEL.  If that tries to reclaim memory
> > > > 		 * by calling back into this filesystem, we may deadlock.
> > > > 		 * Prevent that by setting the NOFS flag.
> > > > 		 */
> > > 
> > > dunno, the previous wording seems clear enough to me. Maybe little bit
> > > more chatty than yours but I am not sure this is worth changing.
> > 
> > I prefer to keep the "...yet we are likely to be under GFP_NOFS..."
> > wording of the old comment because it captures the uncertainty of
> > whether or not we actually are already under NOFS.  If someone actually
> > has audited this code well enough to know for sure then yes let's change
> > the comment, but I haven't gone that far.
> 
> I believe we can drop the memalloc_nofs_save then as well because either
> we are called from a potentially dangerous context and thus we are in
> the nofs scope we we do not need the protection at all.

No, absolutely not. "Belief" is not a sufficient justification for
removing low level deadlock avoidance infrastructure. This code
needs to remain in _xfs_buf_map_pages() until a full audit of the
caller paths is done and we're 100% certain that there are no
lurking deadlocks.

For example, I'm pretty sure we can call into _xfs_buf_map_pages()
outside of a transaction context but with an inode ILOCK held
exclusively. If we then recurse into memory reclaim and try to run a
transaction during reclaim, we have an inverted ILOCK vs transaction
locking order. i.e. we are not allowed to call xfs_trans_reserve()
with an ILOCK held as that can deadlock the log:  log full, locked
inode pins tail of log, inode cannot be flushed because ILOCK is
held by caller waiting for log space to become available....

i.e. there are certain situations where holding a ILOCK is a
deadlock vector. See xfs_lock_inodes() for an example of the lengths
we go to avoid ILOCK based log deadlocks like this...

Cheers,

Dave.
-- 
Dave Chinner
david at fromorbit.com



^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 4/6] xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*
  2017-02-06 22:51             ` [PATCH 4/6] xfs: use memalloc_nofs_{save,restore} " Dave Chinner
  (?)
@ 2017-02-07  7:17               ` Michal Hocko
  -1 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-02-07  7:17 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Darrick J. Wong, Matthew Wilcox, Andrew Morton, linux-fsdevel,
	linux-mm, djwong, Theodore Ts'o, Chris Mason, David Sterba,
	Jan Kara, ceph-devel, cluster-devel, linux-nfs, logfs, linux-xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML

On Tue 07-02-17 09:51:50, Dave Chinner wrote:
> On Mon, Feb 06, 2017 at 07:47:43PM +0100, Michal Hocko wrote:
> > On Mon 06-02-17 10:32:37, Darrick J. Wong wrote:
[...]
> > > I prefer to keep the "...yet we are likely to be under GFP_NOFS..."
> > > wording of the old comment because it captures the uncertainty of
> > > whether or not we actually are already under NOFS.  If someone actually
> > > has audited this code well enough to know for sure then yes let's change
> > > the comment, but I haven't gone that far.
> > 
> > I believe we can drop the memalloc_nofs_save then as well because either
> > we are called from a potentially dangerous context and thus we are in
> > the nofs scope we we do not need the protection at all.
> 
> No, absolutely not. "Belief" is not a sufficient justification for
> removing low level deadlock avoidance infrastructure. This code
> needs to remain in _xfs_buf_map_pages() until a full audit of the
> caller paths is done and we're 100% certain that there are no
> lurking deadlocks.

Exactly. I was actually refering to "If someone actually has audited
this code" above... So I definitely do not want to justify anything
based on the belief

> For example, I'm pretty sure we can call into _xfs_buf_map_pages()
> outside of a transaction context but with an inode ILOCK held
> exclusively. If we then recurse into memory reclaim and try to run a
> transaction during reclaim, we have an inverted ILOCK vs transaction
> locking order. i.e. we are not allowed to call xfs_trans_reserve()
> with an ILOCK held as that can deadlock the log:  log full, locked
> inode pins tail of log, inode cannot be flushed because ILOCK is
> held by caller waiting for log space to become available....
> 
> i.e. there are certain situations where holding a ILOCK is a
> deadlock vector. See xfs_lock_inodes() for an example of the lengths
> we go to avoid ILOCK based log deadlocks like this...

Thanks for the reference. This is really helpful!

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 74+ messages in thread

* Re: [PATCH 4/6] xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*
@ 2017-02-07  7:17               ` Michal Hocko
  0 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-02-07  7:17 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Darrick J. Wong, Matthew Wilcox, Andrew Morton, linux-fsdevel,
	linux-mm, djwong, Theodore Ts'o, Chris Mason, David Sterba,
	Jan Kara, ceph-devel, cluster-devel, linux-nfs, logfs, linux-xfs,
	linux-ext4, linux-btrfs, linux-mtd, reiserfs-devel,
	linux-ntfs-dev, linux-f2fs-devel, linux-afs, LKML

On Tue 07-02-17 09:51:50, Dave Chinner wrote:
> On Mon, Feb 06, 2017 at 07:47:43PM +0100, Michal Hocko wrote:
> > On Mon 06-02-17 10:32:37, Darrick J. Wong wrote:
[...]
> > > I prefer to keep the "...yet we are likely to be under GFP_NOFS..."
> > > wording of the old comment because it captures the uncertainty of
> > > whether or not we actually are already under NOFS.  If someone actually
> > > has audited this code well enough to know for sure then yes let's change
> > > the comment, but I haven't gone that far.
> > 
> > I believe we can drop the memalloc_nofs_save then as well because either
> > we are called from a potentially dangerous context and thus we are in
> > the nofs scope we we do not need the protection at all.
> 
> No, absolutely not. "Belief" is not a sufficient justification for
> removing low level deadlock avoidance infrastructure. This code
> needs to remain in _xfs_buf_map_pages() until a full audit of the
> caller paths is done and we're 100% certain that there are no
> lurking deadlocks.

Exactly. I was actually refering to "If someone actually has audited
this code" above... So I definitely do not want to justify anything
based on the belief

> For example, I'm pretty sure we can call into _xfs_buf_map_pages()
> outside of a transaction context but with an inode ILOCK held
> exclusively. If we then recurse into memory reclaim and try to run a
> transaction during reclaim, we have an inverted ILOCK vs transaction
> locking order. i.e. we are not allowed to call xfs_trans_reserve()
> with an ILOCK held as that can deadlock the log:  log full, locked
> inode pins tail of log, inode cannot be flushed because ILOCK is
> held by caller waiting for log space to become available....
> 
> i.e. there are certain situations where holding a ILOCK is a
> deadlock vector. See xfs_lock_inodes() for an example of the lengths
> we go to avoid ILOCK based log deadlocks like this...

Thanks for the reference. This is really helpful!

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 74+ messages in thread

* [Cluster-devel] [PATCH 4/6] xfs: use memalloc_nofs_{save, restore} instead of memalloc_noio*
@ 2017-02-07  7:17               ` Michal Hocko
  0 siblings, 0 replies; 74+ messages in thread
From: Michal Hocko @ 2017-02-07  7:17 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Tue 07-02-17 09:51:50, Dave Chinner wrote:
> On Mon, Feb 06, 2017 at 07:47:43PM +0100, Michal Hocko wrote:
> > On Mon 06-02-17 10:32:37, Darrick J. Wong wrote:
[...]
> > > I prefer to keep the "...yet we are likely to be under GFP_NOFS..."
> > > wording of the old comment because it captures the uncertainty of
> > > whether or not we actually are already under NOFS.  If someone actually
> > > has audited this code well enough to know for sure then yes let's change
> > > the comment, but I haven't gone that far.
> > 
> > I believe we can drop the memalloc_nofs_save then as well because either
> > we are called from a potentially dangerous context and thus we are in
> > the nofs scope we we do not need the protection at all.
> 
> No, absolutely not. "Belief" is not a sufficient justification for
> removing low level deadlock avoidance infrastructure. This code
> needs to remain in _xfs_buf_map_pages() until a full audit of the
> caller paths is done and we're 100% certain that there are no
> lurking deadlocks.

Exactly. I was actually refering to "If someone actually has audited
this code" above... So I definitely do not want to justify anything
based on the belief

> For example, I'm pretty sure we can call into _xfs_buf_map_pages()
> outside of a transaction context but with an inode ILOCK held
> exclusively. If we then recurse into memory reclaim and try to run a
> transaction during reclaim, we have an inverted ILOCK vs transaction
> locking order. i.e. we are not allowed to call xfs_trans_reserve()
> with an ILOCK held as that can deadlock the log:  log full, locked
> inode pins tail of log, inode cannot be flushed because ILOCK is
> held by caller waiting for log space to become available....
> 
> i.e. there are certain situations where holding a ILOCK is a
> deadlock vector. See xfs_lock_inodes() for an example of the lengths
> we go to avoid ILOCK based log deadlocks like this...

Thanks for the reference. This is really helpful!

-- 
Michal Hocko
SUSE Labs



^ permalink raw reply	[flat|nested] 74+ messages in thread

end of thread, other threads:[~2017-02-07  7:17 UTC | newest]

Thread overview: 74+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-06 14:07 [PATCH 0/6 v4] scope GFP_NOFS api Michal Hocko
2017-02-06 14:07 ` [Cluster-devel] " Michal Hocko
2017-02-06 14:07 ` Michal Hocko
2017-02-06 14:07 ` Michal Hocko
2017-02-06 14:07 ` Michal Hocko
2017-02-06 14:07 ` Michal Hocko
2017-02-06 14:07 ` Michal Hocko
2017-02-06 14:07 ` [PATCH 1/6] lockdep: allow to disable reclaim lockup detection Michal Hocko
2017-02-06 14:07   ` [Cluster-devel] " Michal Hocko
2017-02-06 14:07   ` Michal Hocko
2017-02-06 14:07   ` Michal Hocko
2017-02-06 14:07   ` Michal Hocko
2017-02-06 14:26   ` Matthew Wilcox
2017-02-06 14:26     ` [Cluster-devel] " Matthew Wilcox
2017-02-06 14:26     ` Matthew Wilcox
2017-02-06 14:34     ` Michal Hocko
2017-02-06 14:34       ` [Cluster-devel] " Michal Hocko
2017-02-06 14:34       ` Michal Hocko
2017-02-06 15:24       ` Matthew Wilcox
2017-02-06 15:24         ` [Cluster-devel] " Matthew Wilcox
2017-02-06 15:24         ` Matthew Wilcox
2017-02-06 15:30         ` Michal Hocko
2017-02-06 15:30           ` [Cluster-devel] " Michal Hocko
2017-02-06 15:30           ` Michal Hocko
2017-02-06 14:07 ` [PATCH 2/6] xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS Michal Hocko
2017-02-06 14:07   ` [Cluster-devel] " Michal Hocko
2017-02-06 14:07   ` Michal Hocko
2017-02-06 14:07   ` Michal Hocko
2017-02-06 14:07   ` Michal Hocko
2017-02-06 14:07 ` [PATCH 3/6] mm: introduce memalloc_nofs_{save,restore} API Michal Hocko
2017-02-06 14:07   ` [Cluster-devel] [PATCH 3/6] mm: introduce memalloc_nofs_{save, restore} API Michal Hocko
2017-02-06 14:07   ` [PATCH 3/6] mm: introduce memalloc_nofs_{save,restore} API Michal Hocko
2017-02-06 14:07   ` Michal Hocko
2017-02-06 14:07   ` Michal Hocko
2017-02-06 14:07 ` [PATCH 4/6] xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio* Michal Hocko
2017-02-06 14:07   ` [Cluster-devel] [PATCH 4/6] xfs: use memalloc_nofs_{save, restore} " Michal Hocko
2017-02-06 14:07   ` Michal Hocko
2017-02-06 14:07   ` [PATCH 4/6] xfs: use memalloc_nofs_{save,restore} " Michal Hocko
2017-02-06 14:07   ` Michal Hocko
2017-02-06 14:07   ` Michal Hocko
2017-02-06 15:39   ` Matthew Wilcox
2017-02-06 15:39     ` [Cluster-devel] [PATCH 4/6] xfs: use memalloc_nofs_{save, restore} " Matthew Wilcox
2017-02-06 15:39     ` [PATCH 4/6] xfs: use memalloc_nofs_{save,restore} " Matthew Wilcox
2017-02-06 17:44     ` Michal Hocko
2017-02-06 17:44       ` [Cluster-devel] [PATCH 4/6] xfs: use memalloc_nofs_{save, restore} " Michal Hocko
2017-02-06 17:44       ` [PATCH 4/6] xfs: use memalloc_nofs_{save,restore} " Michal Hocko
2017-02-06 18:32       ` Darrick J. Wong
2017-02-06 18:32         ` [Cluster-devel] [PATCH 4/6] xfs: use memalloc_nofs_{save, restore} " Darrick J. Wong
2017-02-06 18:32         ` [PATCH 4/6] xfs: use memalloc_nofs_{save,restore} " Darrick J. Wong
2017-02-06 18:47         ` Michal Hocko
2017-02-06 18:47           ` [Cluster-devel] [PATCH 4/6] xfs: use memalloc_nofs_{save, restore} " Michal Hocko
2017-02-06 18:47           ` [PATCH 4/6] xfs: use memalloc_nofs_{save,restore} " Michal Hocko
2017-02-06 19:51           ` Darrick J. Wong
2017-02-06 19:51             ` [Cluster-devel] [PATCH 4/6] xfs: use memalloc_nofs_{save, restore} " Darrick J. Wong
2017-02-06 19:51             ` [PATCH 4/6] xfs: use memalloc_nofs_{save,restore} " Darrick J. Wong
2017-02-06 21:18             ` Michal Hocko
2017-02-06 21:18               ` [Cluster-devel] [PATCH 4/6] xfs: use memalloc_nofs_{save, restore} " Michal Hocko
2017-02-06 21:18               ` [PATCH 4/6] xfs: use memalloc_nofs_{save,restore} " Michal Hocko
2017-02-06 22:51           ` Dave Chinner
2017-02-06 22:51             ` [Cluster-devel] [PATCH 4/6] xfs: use memalloc_nofs_{save, restore} " Dave Chinner
2017-02-06 22:51             ` [PATCH 4/6] xfs: use memalloc_nofs_{save,restore} " Dave Chinner
2017-02-07  7:17             ` Michal Hocko
2017-02-07  7:17               ` [Cluster-devel] [PATCH 4/6] xfs: use memalloc_nofs_{save, restore} " Michal Hocko
2017-02-07  7:17               ` [PATCH 4/6] xfs: use memalloc_nofs_{save,restore} " Michal Hocko
2017-02-06 14:07 ` [PATCH 5/6] jbd2: mark the transaction context with the scope GFP_NOFS context Michal Hocko
2017-02-06 14:07   ` [Cluster-devel] " Michal Hocko
2017-02-06 14:07   ` Michal Hocko
2017-02-06 14:07   ` Michal Hocko
2017-02-06 14:07   ` Michal Hocko
2017-02-06 14:07 ` [PATCH 6/6] jbd2: make the whole kjournald2 kthread NOFS safe Michal Hocko
2017-02-06 14:07   ` [Cluster-devel] " Michal Hocko
2017-02-06 14:07   ` Michal Hocko
2017-02-06 14:07   ` Michal Hocko
2017-02-06 14:07   ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.