linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/10] Remove zonelist cache and high-order watermark checking v4
@ 2015-09-21 10:52 Mel Gorman
  2015-09-21 10:52 ` [PATCH 01/10] mm, page_alloc: Remove unnecessary parameter from zone_watermark_ok_safe Mel Gorman
                   ` (9 more replies)
  0 siblings, 10 replies; 48+ messages in thread
From: Mel Gorman @ 2015-09-21 10:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Rik van Riel, Vlastimil Babka, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML, Mel Gorman

Changelog since V3
o Rebase to 4.3-rc1
o Consistent style for __GFP_WAIT				(joonsoo)
o Restored cpuset static checking behaviour			(vbabka)
o Fix cpusets check in allocator fastpath			(vbabka)
o Applied acks

Changelog since V2
o Covered cases where __GFP_KSWAPD_RECLAIM is needed		(vbabka)
o Cleaned up trailing references to zlc				(vbabka)
o Fixed a subtle problem with GFP_TRANSHUGE checks		(vbabka)
o Split out an unrelated change to its own patch		(vbabka)
o Reordered series to put GFP flag modifications at start	(mhocko)
o Added a number of clarifications on reclaim modifications	(mhocko)
o Only check cpusets when one exists that can limit memory	(rientjes)
o Applied acks

Changelog since V1
o Improve cpusets checks as suggested				(rientjes)
o Add various acks and reviewed-bys
o Rebase to 4.2-rc6

Changelog since RFC
o Rebase to 4.2-rc5
o Distinguish between high priority callers and callers that avoid sleep
o Remove jump label related damage patches

Overall, the intent of this series is to remove the zonelist cache which
was introduced to avoid high overhead in the page allocator. Once this is
done, it is necessary to reduce the cost of watermark checks.

The series starts with minor micro-optimisations.

Next it notes that GFP flags that affect watermark checks are
bused. __GFP_WAIT historically identified callers that could not sleep and
could access reserves. This was later abused to identify callers that simply
prefer to avoid sleeping and have other options. A patch distinguishes
between atomic callers, high-priority callers and those that simply wish
to avoid sleep.

The zonelist cache has been around for a long time but it is of dubious
merit with a lot of complexity and some issues that are explained.
The most important issue is that a failed THP allocation can cause a
zone to be treated as "full". This potentially causes unnecessary stalls,
reclaim activity or remote fallbacks. The issues could be fixed but it's
not worth it. The series places a small number of other micro-optimisations
on top before examining GFP flags watermarks.

High-order watermarks enforcement can cause high-order allocations to fail
even though pages are free. The watermark checks both protect high-order
atomic allocations and make kswapd aware of high-order pages but there is
a much better way that can be handled using migrate types. This series uses
page grouping by mobility to reserve pageblocks for high-order allocations
with the size of the reservation depending on demand. kswapd awareness
is maintained by examining the free lists. By patch 12 in this series,
there are no high-order watermark checks while preserving the properties
that motivated the introduction of the watermark checks.

 Documentation/vm/balance                           |  14 +-
 arch/arm/mm/dma-mapping.c                          |   6 +-
 arch/arm/xen/mm.c                                  |   2 +-
 arch/arm64/mm/dma-mapping.c                        |   4 +-
 arch/x86/kernel/pci-dma.c                          |   2 +-
 block/bio.c                                        |  26 +-
 block/blk-core.c                                   |  16 +-
 block/blk-ioc.c                                    |   2 +-
 block/blk-mq-tag.c                                 |   2 +-
 block/blk-mq.c                                     |   8 +-
 block/scsi_ioctl.c                                 |   6 +-
 drivers/block/drbd/drbd_bitmap.c                   |   2 +-
 drivers/block/drbd/drbd_receiver.c                 |   3 +-
 drivers/block/mtip32xx/mtip32xx.c                  |   2 +-
 drivers/block/nvme-core.c                          |   4 +-
 drivers/block/osdblk.c                             |   2 +-
 drivers/block/paride/pd.c                          |   2 +-
 drivers/block/pktcdvd.c                            |   4 +-
 drivers/connector/connector.c                      |   3 +-
 drivers/firewire/core-cdev.c                       |   2 +-
 drivers/gpu/drm/i915/i915_gem.c                    |   4 +-
 drivers/ide/ide-atapi.c                            |   2 +-
 drivers/ide/ide-cd.c                               |   2 +-
 drivers/ide/ide-cd_ioctl.c                         |   2 +-
 drivers/ide/ide-devsets.c                          |   2 +-
 drivers/ide/ide-disk.c                             |   2 +-
 drivers/ide/ide-ioctls.c                           |   4 +-
 drivers/ide/ide-park.c                             |   2 +-
 drivers/ide/ide-pm.c                               |   4 +-
 drivers/ide/ide-tape.c                             |   4 +-
 drivers/ide/ide-taskfile.c                         |   4 +-
 drivers/infiniband/core/sa_query.c                 |   2 +-
 drivers/infiniband/hw/qib/qib_init.c               |   2 +-
 drivers/iommu/amd_iommu.c                          |   2 +-
 drivers/iommu/intel-iommu.c                        |   2 +-
 drivers/md/dm-crypt.c                              |   6 +-
 drivers/md/dm-kcopyd.c                             |   2 +-
 drivers/media/pci/solo6x10/solo6x10-v4l2-enc.c     |   2 +-
 drivers/media/pci/solo6x10/solo6x10-v4l2.c         |   2 +-
 drivers/media/pci/tw68/tw68-video.c                |   2 +-
 drivers/misc/vmw_balloon.c                         |   2 +-
 drivers/mtd/mtdcore.c                              |   3 +-
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c    |   2 +-
 drivers/scsi/scsi_error.c                          |   2 +-
 drivers/scsi/scsi_lib.c                            |   4 +-
 drivers/staging/android/ion/ion_system_heap.c      |   2 +-
 .../lustre/include/linux/libcfs/libcfs_private.h   |   2 +-
 drivers/staging/rdma/hfi1/init.c                   |   2 +-
 drivers/staging/rdma/ipath/ipath_file_ops.c        |   2 +-
 drivers/usb/host/u132-hcd.c                        |   2 +-
 drivers/video/fbdev/vermilion/vermilion.c          |   2 +-
 fs/btrfs/disk-io.c                                 |   2 +-
 fs/btrfs/extent_io.c                               |  14 +-
 fs/btrfs/volumes.c                                 |   4 +-
 fs/cachefiles/internal.h                           |   2 +-
 fs/direct-io.c                                     |   2 +-
 fs/ext4/super.c                                    |   2 +-
 fs/fscache/cookie.c                                |   2 +-
 fs/fscache/page.c                                  |   6 +-
 fs/jbd2/transaction.c                              |   4 +-
 fs/nfs/file.c                                      |   6 +-
 fs/nilfs2/mdt.h                                    |   2 +-
 fs/xfs/xfs_qm.c                                    |   2 +-
 include/linux/cpuset.h                             |   6 +
 include/linux/gfp.h                                |  70 ++-
 include/linux/mmzone.h                             |  88 +--
 include/linux/skbuff.h                             |   6 +-
 include/net/sock.h                                 |   2 +-
 include/trace/events/gfpflags.h                    |   5 +-
 kernel/audit.c                                     |   6 +-
 kernel/cgroup.c                                    |   2 +-
 kernel/locking/lockdep.c                           |   2 +-
 kernel/power/snapshot.c                            |   2 +-
 kernel/power/swap.c                                |  14 +-
 kernel/smp.c                                       |   2 +-
 lib/idr.c                                          |   4 +-
 lib/percpu_ida.c                                   |   2 +-
 lib/radix-tree.c                                   |  10 +-
 mm/backing-dev.c                                   |   2 +-
 mm/dmapool.c                                       |   2 +-
 mm/failslab.c                                      |   8 +-
 mm/filemap.c                                       |   2 +-
 mm/huge_memory.c                                   |   4 +-
 mm/internal.h                                      |   1 +
 mm/memcontrol.c                                    |   8 +-
 mm/mempool.c                                       |  10 +-
 mm/migrate.c                                       |   4 +-
 mm/page_alloc.c                                    | 599 +++++++--------------
 mm/slab.c                                          |  18 +-
 mm/slub.c                                          |  10 +-
 mm/vmalloc.c                                       |   2 +-
 mm/vmscan.c                                        |   8 +-
 mm/vmstat.c                                        |   2 +-
 mm/zswap.c                                         |   5 +-
 net/core/skbuff.c                                  |   8 +-
 net/core/sock.c                                    |   6 +-
 net/netlink/af_netlink.c                           |   2 +-
 net/rds/ib_recv.c                                  |   4 +-
 net/rxrpc/ar-connection.c                          |   2 +-
 net/sctp/associola.c                               |   2 +-
 security/integrity/ima/ima_crypto.c                |   2 +-
 101 files changed, 466 insertions(+), 705 deletions(-)

-- 
2.4.6


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH 01/10] mm, page_alloc: Remove unnecessary parameter from zone_watermark_ok_safe
  2015-09-21 10:52 [PATCH 00/10] Remove zonelist cache and high-order watermark checking v4 Mel Gorman
@ 2015-09-21 10:52 ` Mel Gorman
  2015-09-24 20:01   ` Johannes Weiner
  2015-09-21 10:52 ` [PATCH 02/10] mm, page_alloc: Remove unnecessary recalculations for dirty zone balancing Mel Gorman
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 48+ messages in thread
From: Mel Gorman @ 2015-09-21 10:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Rik van Riel, Vlastimil Babka, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML, Mel Gorman

No user of zone_watermark_ok_safe() specifies alloc_flags. This patch
removes the unnecessary parameter.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
---
 include/linux/mmzone.h | 2 +-
 mm/page_alloc.c        | 5 +++--
 mm/vmscan.c            | 4 ++--
 3 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d94347737292..8687467a6e84 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -817,7 +817,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx);
 bool zone_watermark_ok(struct zone *z, unsigned int order,
 		unsigned long mark, int classzone_idx, int alloc_flags);
 bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
-		unsigned long mark, int classzone_idx, int alloc_flags);
+		unsigned long mark, int classzone_idx);
 enum memmap_context {
 	MEMMAP_EARLY,
 	MEMMAP_HOTPLUG,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 48aaf7b9f253..7a3199313622 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2249,6 +2249,7 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
 		min -= min / 2;
 	if (alloc_flags & ALLOC_HARDER)
 		min -= min / 4;
+
 #ifdef CONFIG_CMA
 	/* If allocation can't use CMA areas don't use free CMA pages */
 	if (!(alloc_flags & ALLOC_CMA))
@@ -2278,14 +2279,14 @@ bool zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 }
 
 bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
-			unsigned long mark, int classzone_idx, int alloc_flags)
+			unsigned long mark, int classzone_idx)
 {
 	long free_pages = zone_page_state(z, NR_FREE_PAGES);
 
 	if (z->percpu_drift_mark && free_pages < z->percpu_drift_mark)
 		free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES);
 
-	return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags,
+	return __zone_watermark_ok(z, order, mark, classzone_idx, 0,
 								free_pages);
 }
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2d978b28a410..8b2786fd42b5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2480,7 +2480,7 @@ static inline bool compaction_ready(struct zone *zone, int order)
 	balance_gap = min(low_wmark_pages(zone), DIV_ROUND_UP(
 			zone->managed_pages, KSWAPD_ZONE_BALANCE_GAP_RATIO));
 	watermark = high_wmark_pages(zone) + balance_gap + (2UL << order);
-	watermark_ok = zone_watermark_ok_safe(zone, 0, watermark, 0, 0);
+	watermark_ok = zone_watermark_ok_safe(zone, 0, watermark, 0);
 
 	/*
 	 * If compaction is deferred, reclaim up to a point where
@@ -2963,7 +2963,7 @@ static bool zone_balanced(struct zone *zone, int order,
 			  unsigned long balance_gap, int classzone_idx)
 {
 	if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone) +
-				    balance_gap, classzone_idx, 0))
+				    balance_gap, classzone_idx))
 		return false;
 
 	if (IS_ENABLED(CONFIG_COMPACTION) && order && compaction_suitable(zone,
-- 
2.4.6


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 02/10] mm, page_alloc: Remove unnecessary recalculations for dirty zone balancing
  2015-09-21 10:52 [PATCH 00/10] Remove zonelist cache and high-order watermark checking v4 Mel Gorman
  2015-09-21 10:52 ` [PATCH 01/10] mm, page_alloc: Remove unnecessary parameter from zone_watermark_ok_safe Mel Gorman
@ 2015-09-21 10:52 ` Mel Gorman
  2015-09-24 20:05   ` Johannes Weiner
  2015-09-21 10:52 ` [PATCH 03/10] mm, page_alloc: Remove unnecessary taking of a seqlock when cpusets are disabled Mel Gorman
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 48+ messages in thread
From: Mel Gorman @ 2015-09-21 10:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Rik van Riel, Vlastimil Babka, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML, Mel Gorman

File-backed pages that will be immediately written are balanced between
zones.  This heuristic tries to avoid having a single zone filled with
recently dirtied pages but the checks are unnecessarily expensive. Move
consider_zone_balanced into the alloc_context instead of checking bitmaps
multiple times. The patch also gives the parameter a more meaningful name.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/internal.h   |  1 +
 mm/page_alloc.c | 11 +++++++----
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index bc0fa9a69e46..83fb0bfffc13 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -129,6 +129,7 @@ struct alloc_context {
 	int classzone_idx;
 	int migratetype;
 	enum zone_type high_zoneidx;
+	bool spread_dirty_pages;
 };
 
 /*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7a3199313622..4793bddb6b2a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2478,8 +2478,6 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 	nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
 	int zlc_active = 0;		/* set if using zonelist_cache */
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
-	bool consider_zone_dirty = (alloc_flags & ALLOC_WMARK_LOW) &&
-				(gfp_mask & __GFP_WRITE);
 	int nr_fair_skipped = 0;
 	bool zonelist_rescan;
 
@@ -2534,14 +2532,14 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 		 *
 		 * XXX: For now, allow allocations to potentially
 		 * exceed the per-zone dirty limit in the slowpath
-		 * (ALLOC_WMARK_LOW unset) before going into reclaim,
+		 * (spread_dirty_pages unset) before going into reclaim,
 		 * which is important when on a NUMA setup the allowed
 		 * zones are together not big enough to reach the
 		 * global limit.  The proper fix for these situations
 		 * will require awareness of zones in the
 		 * dirty-throttling and the flusher threads.
 		 */
-		if (consider_zone_dirty && !zone_dirty_ok(zone))
+		if (ac->spread_dirty_pages && !zone_dirty_ok(zone))
 			continue;
 
 		mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
@@ -3232,6 +3230,10 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 
 	/* We set it here, as __alloc_pages_slowpath might have changed it */
 	ac.zonelist = zonelist;
+
+	/* Dirty zone balancing only done in the fast path */
+	ac.spread_dirty_pages = (gfp_mask & __GFP_WRITE);
+
 	/* The preferred zone is used for statistics later */
 	preferred_zoneref = first_zones_zonelist(ac.zonelist, ac.high_zoneidx,
 				ac.nodemask ? : &cpuset_current_mems_allowed,
@@ -3250,6 +3252,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 		 * complete.
 		 */
 		alloc_mask = memalloc_noio_flags(gfp_mask);
+		ac.spread_dirty_pages = false;
 
 		page = __alloc_pages_slowpath(alloc_mask, order, &ac);
 	}
-- 
2.4.6


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 03/10] mm, page_alloc: Remove unnecessary taking of a seqlock when cpusets are disabled
  2015-09-21 10:52 [PATCH 00/10] Remove zonelist cache and high-order watermark checking v4 Mel Gorman
  2015-09-21 10:52 ` [PATCH 01/10] mm, page_alloc: Remove unnecessary parameter from zone_watermark_ok_safe Mel Gorman
  2015-09-21 10:52 ` [PATCH 02/10] mm, page_alloc: Remove unnecessary recalculations for dirty zone balancing Mel Gorman
@ 2015-09-21 10:52 ` Mel Gorman
  2015-09-24 20:06   ` Johannes Weiner
  2015-09-30 22:22   ` David Rientjes
  2015-09-21 10:52 ` [PATCH 04/10] mm, page_alloc: Use masks and shifts when converting GFP flags to migrate types Mel Gorman
                   ` (6 subsequent siblings)
  9 siblings, 2 replies; 48+ messages in thread
From: Mel Gorman @ 2015-09-21 10:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Rik van Riel, Vlastimil Babka, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML, Mel Gorman

There is a seqcounter that protects against spurious allocation failures
when a task is changing the allowed nodes in a cpuset. There is no need
to check the seqcounter until a cpuset exists.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/cpuset.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 1b357997cac5..6eb27cb480b7 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -104,6 +104,9 @@ extern void cpuset_print_task_mems_allowed(struct task_struct *p);
  */
 static inline unsigned int read_mems_allowed_begin(void)
 {
+	if (!cpusets_enabled())
+		return 0;
+
 	return read_seqcount_begin(&current->mems_allowed_seq);
 }
 
@@ -115,6 +118,9 @@ static inline unsigned int read_mems_allowed_begin(void)
  */
 static inline bool read_mems_allowed_retry(unsigned int seq)
 {
+	if (!cpusets_enabled())
+		return false;
+
 	return read_seqcount_retry(&current->mems_allowed_seq, seq);
 }
 
-- 
2.4.6


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 04/10] mm, page_alloc: Use masks and shifts when converting GFP flags to migrate types
  2015-09-21 10:52 [PATCH 00/10] Remove zonelist cache and high-order watermark checking v4 Mel Gorman
                   ` (2 preceding siblings ...)
  2015-09-21 10:52 ` [PATCH 03/10] mm, page_alloc: Remove unnecessary taking of a seqlock when cpusets are disabled Mel Gorman
@ 2015-09-21 10:52 ` Mel Gorman
  2015-09-24 20:34   ` Johannes Weiner
  2015-09-21 10:52 ` [PATCH 05/10] mm, page_alloc: Distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd Mel Gorman
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 48+ messages in thread
From: Mel Gorman @ 2015-09-21 10:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Rik van Riel, Vlastimil Babka, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML, Mel Gorman

This patch redefines which GFP bits are used for specifying mobility and
the order of the migrate types. Once redefined it's possible to convert
GFP flags to a migrate type with a simple mask and shift. The only downside
is that readers of OOM kill messages and allocation failures may have been
used to the existing values but scripts/gfp-translate will help.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/gfp.h    | 12 +++++++-----
 include/linux/mmzone.h |  2 +-
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index f92cbd2f4450..440fca3e7e5d 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -14,7 +14,7 @@ struct vm_area_struct;
 #define ___GFP_HIGHMEM		0x02u
 #define ___GFP_DMA32		0x04u
 #define ___GFP_MOVABLE		0x08u
-#define ___GFP_WAIT		0x10u
+#define ___GFP_RECLAIMABLE	0x10u
 #define ___GFP_HIGH		0x20u
 #define ___GFP_IO		0x40u
 #define ___GFP_FS		0x80u
@@ -29,7 +29,7 @@ struct vm_area_struct;
 #define ___GFP_NOMEMALLOC	0x10000u
 #define ___GFP_HARDWALL		0x20000u
 #define ___GFP_THISNODE		0x40000u
-#define ___GFP_RECLAIMABLE	0x80000u
+#define ___GFP_WAIT		0x80000u
 #define ___GFP_NOACCOUNT	0x100000u
 #define ___GFP_NOTRACK		0x200000u
 #define ___GFP_NO_KSWAPD	0x400000u
@@ -126,6 +126,7 @@ struct vm_area_struct;
 
 /* This mask makes up all the page movable related flags */
 #define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE)
+#define GFP_MOVABLE_SHIFT 3
 
 /* Control page allocator reclaim behavior */
 #define GFP_RECLAIM_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS|\
@@ -152,14 +153,15 @@ struct vm_area_struct;
 /* Convert GFP flags to their corresponding migrate type */
 static inline int gfpflags_to_migratetype(const gfp_t gfp_flags)
 {
-	WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK);
+	VM_WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK);
+	BUILD_BUG_ON((1UL << GFP_MOVABLE_SHIFT) != ___GFP_MOVABLE);
+	BUILD_BUG_ON((___GFP_MOVABLE >> GFP_MOVABLE_SHIFT) != MIGRATE_MOVABLE);
 
 	if (unlikely(page_group_by_mobility_disabled))
 		return MIGRATE_UNMOVABLE;
 
 	/* Group based on mobility */
-	return (((gfp_flags & __GFP_MOVABLE) != 0) << 1) |
-		((gfp_flags & __GFP_RECLAIMABLE) != 0);
+	return (gfp_flags & GFP_MOVABLE_MASK) >> GFP_MOVABLE_SHIFT;
 }
 
 #ifdef CONFIG_HIGHMEM
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 8687467a6e84..b489e0b5ab48 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -37,8 +37,8 @@
 
 enum {
 	MIGRATE_UNMOVABLE,
-	MIGRATE_RECLAIMABLE,
 	MIGRATE_MOVABLE,
+	MIGRATE_RECLAIMABLE,
 	MIGRATE_PCPTYPES,	/* the number of types on the pcp lists */
 	MIGRATE_RESERVE = MIGRATE_PCPTYPES,
 #ifdef CONFIG_CMA
-- 
2.4.6


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 05/10] mm, page_alloc: Distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
  2015-09-21 10:52 [PATCH 00/10] Remove zonelist cache and high-order watermark checking v4 Mel Gorman
                   ` (3 preceding siblings ...)
  2015-09-21 10:52 ` [PATCH 04/10] mm, page_alloc: Use masks and shifts when converting GFP flags to migrate types Mel Gorman
@ 2015-09-21 10:52 ` Mel Gorman
  2015-09-24 13:51   ` Michal Hocko
  2015-09-24 20:55   ` Johannes Weiner
  2015-09-21 10:52 ` [PATCH 06/10] mm, page_alloc: Rename __GFP_WAIT to __GFP_RECLAIM Mel Gorman
                   ` (4 subsequent siblings)
  9 siblings, 2 replies; 48+ messages in thread
From: Mel Gorman @ 2015-09-21 10:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Rik van Riel, Vlastimil Babka, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML, Mel Gorman

__GFP_WAIT has been used to identify atomic context in callers that hold
spinlocks or are in interrupts. They are expected to be high priority and
have access one of two watermarks lower than "min" which can be referred
to as the "atomic reserve". __GFP_HIGH users get access to the first lower
watermark and can be called the "high priority reserve".

Over time, callers had a requirement to not block when fallback options
were available. Some have abused __GFP_WAIT leading to a situation where
an optimisitic allocation with a fallback option can access atomic reserves.

This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
cannot sleep and have no alternative. High priority users continue to use
__GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are
willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers
that want to wake kswapd for background reclaim. __GFP_WAIT is redefined
as a caller that is willing to enter direct reclaim and wake kswapd for
background reclaim.

This patch then converts a number of sites

o __GFP_ATOMIC is used by callers that are high priority and have memory
  pools for those requests. GFP_ATOMIC uses this flag.

o Callers that have a limited mempool to guarantee forward progress clear
  __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
  into this category where kswapd will still be woken but atomic reserves
  are not used as there is a one-entry mempool to guarantee progress.

o Callers that are checking if they are non-blocking should use the
  helper gfpflags_allow_blocking() where possible. This is because
  checking for __GFP_WAIT as was done historically now can trigger false
  positives. Some exceptions like dm-crypt.c exist where the code intent
  is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
  flag manipulations.

o Callers that built their own GFP flags instead of starting with GFP_KERNEL
  and friends now also need to specify __GFP_KSWAPD_RECLAIM.

The first key hazard to watch out for is callers that removed __GFP_WAIT
and was depending on access to atomic reserves for inconspicuous reasons.
In some cases it may be appropriate for them to use __GFP_HIGH.

The second key hazard is callers that assembled their own combination of
GFP flags instead of starting with something like GFP_KERNEL. They may
now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
if it's missed in most cases as other activity will wake kswapd.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 Documentation/vm/balance                           | 14 ++++---
 arch/arm/mm/dma-mapping.c                          |  6 +--
 arch/arm/xen/mm.c                                  |  2 +-
 arch/arm64/mm/dma-mapping.c                        |  4 +-
 arch/x86/kernel/pci-dma.c                          |  2 +-
 block/bio.c                                        | 26 ++++++------
 block/blk-core.c                                   | 16 ++++----
 block/blk-ioc.c                                    |  2 +-
 block/blk-mq-tag.c                                 |  2 +-
 block/blk-mq.c                                     |  8 ++--
 drivers/block/drbd/drbd_receiver.c                 |  3 +-
 drivers/block/osdblk.c                             |  2 +-
 drivers/connector/connector.c                      |  3 +-
 drivers/firewire/core-cdev.c                       |  2 +-
 drivers/gpu/drm/i915/i915_gem.c                    |  2 +-
 drivers/infiniband/core/sa_query.c                 |  2 +-
 drivers/iommu/amd_iommu.c                          |  2 +-
 drivers/iommu/intel-iommu.c                        |  2 +-
 drivers/md/dm-crypt.c                              |  6 +--
 drivers/md/dm-kcopyd.c                             |  2 +-
 drivers/media/pci/solo6x10/solo6x10-v4l2-enc.c     |  2 +-
 drivers/media/pci/solo6x10/solo6x10-v4l2.c         |  2 +-
 drivers/media/pci/tw68/tw68-video.c                |  2 +-
 drivers/mtd/mtdcore.c                              |  3 +-
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c    |  2 +-
 drivers/staging/android/ion/ion_system_heap.c      |  2 +-
 .../lustre/include/linux/libcfs/libcfs_private.h   |  2 +-
 drivers/usb/host/u132-hcd.c                        |  2 +-
 drivers/video/fbdev/vermilion/vermilion.c          |  2 +-
 fs/btrfs/disk-io.c                                 |  2 +-
 fs/btrfs/extent_io.c                               | 14 +++----
 fs/btrfs/volumes.c                                 |  4 +-
 fs/ext4/super.c                                    |  2 +-
 fs/fscache/cookie.c                                |  2 +-
 fs/fscache/page.c                                  |  6 +--
 fs/jbd2/transaction.c                              |  4 +-
 fs/nfs/file.c                                      |  6 +--
 fs/xfs/xfs_qm.c                                    |  2 +-
 include/linux/gfp.h                                | 46 ++++++++++++++++------
 include/linux/skbuff.h                             |  6 +--
 include/net/sock.h                                 |  2 +-
 include/trace/events/gfpflags.h                    |  5 ++-
 kernel/audit.c                                     |  6 +--
 kernel/cgroup.c                                    |  2 +-
 kernel/locking/lockdep.c                           |  2 +-
 kernel/power/snapshot.c                            |  2 +-
 kernel/smp.c                                       |  2 +-
 lib/idr.c                                          |  4 +-
 lib/radix-tree.c                                   | 10 ++---
 mm/backing-dev.c                                   |  2 +-
 mm/dmapool.c                                       |  2 +-
 mm/memcontrol.c                                    |  8 ++--
 mm/mempool.c                                       | 10 ++---
 mm/migrate.c                                       |  2 +-
 mm/page_alloc.c                                    | 43 ++++++++++++--------
 mm/slab.c                                          | 18 ++++-----
 mm/slub.c                                          | 10 ++---
 mm/vmalloc.c                                       |  2 +-
 mm/vmscan.c                                        |  4 +-
 mm/zswap.c                                         |  5 ++-
 net/core/skbuff.c                                  |  8 ++--
 net/core/sock.c                                    |  6 ++-
 net/netlink/af_netlink.c                           |  2 +-
 net/rds/ib_recv.c                                  |  4 +-
 net/rxrpc/ar-connection.c                          |  2 +-
 net/sctp/associola.c                               |  2 +-
 66 files changed, 212 insertions(+), 174 deletions(-)

diff --git a/Documentation/vm/balance b/Documentation/vm/balance
index c46e68cf9344..964595481af6 100644
--- a/Documentation/vm/balance
+++ b/Documentation/vm/balance
@@ -1,12 +1,14 @@
 Started Jan 2000 by Kanoj Sarcar <kanoj@sgi.com>
 
-Memory balancing is needed for non __GFP_WAIT as well as for non
-__GFP_IO allocations.
+Memory balancing is needed for !__GFP_ATOMIC and !__GFP_KSWAPD_RECLAIM as
+well as for non __GFP_IO allocations.
 
-There are two reasons to be requesting non __GFP_WAIT allocations:
-the caller can not sleep (typically intr context), or does not want
-to incur cost overheads of page stealing and possible swap io for
-whatever reasons.
+The first reason why a caller may avoid reclaim is that the caller can not
+sleep due to holding a spinlock or is in interrupt context. The second may
+be that the caller is willing to fail the allocation without incurring the
+overhead of page reclaim. This may happen for opportunistic high-order
+allocation requests that have order-0 fallback options. In such cases,
+the caller may also wish to avoid waking kswapd.
 
 __GFP_IO allocation requests are made to prevent file system deadlocks.
 
diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index 1a7815e5421b..38307d8312ac 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -651,12 +651,12 @@ static void *__dma_alloc(struct device *dev, size_t size, dma_addr_t *handle,
 
 	if (nommu())
 		addr = __alloc_simple_buffer(dev, size, gfp, &page);
-	else if (dev_get_cma_area(dev) && (gfp & __GFP_WAIT))
+	else if (dev_get_cma_area(dev) && (gfp & __GFP_DIRECT_RECLAIM))
 		addr = __alloc_from_contiguous(dev, size, prot, &page,
 					       caller, want_vaddr);
 	else if (is_coherent)
 		addr = __alloc_simple_buffer(dev, size, gfp, &page);
-	else if (!(gfp & __GFP_WAIT))
+	else if (!gfpflags_allow_blocking(gfp))
 		addr = __alloc_from_pool(size, &page);
 	else
 		addr = __alloc_remap_buffer(dev, size, gfp, prot, &page,
@@ -1363,7 +1363,7 @@ static void *arm_iommu_alloc_attrs(struct device *dev, size_t size,
 	*handle = DMA_ERROR_CODE;
 	size = PAGE_ALIGN(size);
 
-	if (!(gfp & __GFP_WAIT))
+	if (!gfpflags_allow_blocking(gfp))
 		return __iommu_alloc_atomic(dev, size, handle);
 
 	/*
diff --git a/arch/arm/xen/mm.c b/arch/arm/xen/mm.c
index 6dd911d1f0ac..99eec9063f68 100644
--- a/arch/arm/xen/mm.c
+++ b/arch/arm/xen/mm.c
@@ -25,7 +25,7 @@
 unsigned long xen_get_swiotlb_free_pages(unsigned int order)
 {
 	struct memblock_region *reg;
-	gfp_t flags = __GFP_NOWARN;
+	gfp_t flags = __GFP_NOWARN|__GFP_KSWAPD_RECLAIM;
 
 	for_each_memblock(memory, reg) {
 		if (reg->base < (phys_addr_t)0xffffffff) {
diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
index 99224dcebdc5..478234383c2c 100644
--- a/arch/arm64/mm/dma-mapping.c
+++ b/arch/arm64/mm/dma-mapping.c
@@ -100,7 +100,7 @@ static void *__dma_alloc_coherent(struct device *dev, size_t size,
 	if (IS_ENABLED(CONFIG_ZONE_DMA) &&
 	    dev->coherent_dma_mask <= DMA_BIT_MASK(32))
 		flags |= GFP_DMA;
-	if (dev_get_cma_area(dev) && (flags & __GFP_WAIT)) {
+	if (dev_get_cma_area(dev) && gfpflags_allow_blocking(flags)) {
 		struct page *page;
 		void *addr;
 
@@ -148,7 +148,7 @@ static void *__dma_alloc(struct device *dev, size_t size,
 
 	size = PAGE_ALIGN(size);
 
-	if (!coherent && !(flags & __GFP_WAIT)) {
+	if (!coherent && !gfpflags_allow_blocking(flags)) {
 		struct page *page = NULL;
 		void *addr = __alloc_from_pool(size, &page, flags);
 
diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
index 1b55de1267cf..a8e618b16a66 100644
--- a/arch/x86/kernel/pci-dma.c
+++ b/arch/x86/kernel/pci-dma.c
@@ -90,7 +90,7 @@ void *dma_generic_alloc_coherent(struct device *dev, size_t size,
 again:
 	page = NULL;
 	/* CMA can be used only in the context which permits sleeping */
-	if (flag & __GFP_WAIT) {
+	if (gfpflags_allow_blocking(flag)) {
 		page = dma_alloc_from_contiguous(dev, count, get_order(size));
 		if (page && page_to_phys(page) + size > dma_mask) {
 			dma_release_from_contiguous(dev, page, count);
diff --git a/block/bio.c b/block/bio.c
index ad3f276d74bc..4f184d938942 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -211,7 +211,7 @@ struct bio_vec *bvec_alloc(gfp_t gfp_mask, int nr, unsigned long *idx,
 		bvl = mempool_alloc(pool, gfp_mask);
 	} else {
 		struct biovec_slab *bvs = bvec_slabs + *idx;
-		gfp_t __gfp_mask = gfp_mask & ~(__GFP_WAIT | __GFP_IO);
+		gfp_t __gfp_mask = gfp_mask & ~(__GFP_DIRECT_RECLAIM | __GFP_IO);
 
 		/*
 		 * Make this allocation restricted and don't dump info on
@@ -221,11 +221,11 @@ struct bio_vec *bvec_alloc(gfp_t gfp_mask, int nr, unsigned long *idx,
 		__gfp_mask |= __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN;
 
 		/*
-		 * Try a slab allocation. If this fails and __GFP_WAIT
+		 * Try a slab allocation. If this fails and __GFP_DIRECT_RECLAIM
 		 * is set, retry with the 1-entry mempool
 		 */
 		bvl = kmem_cache_alloc(bvs->slab, __gfp_mask);
-		if (unlikely(!bvl && (gfp_mask & __GFP_WAIT))) {
+		if (unlikely(!bvl && (gfp_mask & __GFP_DIRECT_RECLAIM))) {
 			*idx = BIOVEC_MAX_IDX;
 			goto fallback;
 		}
@@ -395,12 +395,12 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
  *   If @bs is NULL, uses kmalloc() to allocate the bio; else the allocation is
  *   backed by the @bs's mempool.
  *
- *   When @bs is not NULL, if %__GFP_WAIT is set then bio_alloc will always be
- *   able to allocate a bio. This is due to the mempool guarantees. To make this
- *   work, callers must never allocate more than 1 bio at a time from this pool.
- *   Callers that need to allocate more than 1 bio must always submit the
- *   previously allocated bio for IO before attempting to allocate a new one.
- *   Failure to do so can cause deadlocks under memory pressure.
+ *   When @bs is not NULL, if %__GFP_DIRECT_RECLAIM is set then bio_alloc will
+ *   always be able to allocate a bio. This is due to the mempool guarantees.
+ *   To make this work, callers must never allocate more than 1 bio at a time
+ *   from this pool. Callers that need to allocate more than 1 bio must always
+ *   submit the previously allocated bio for IO before attempting to allocate
+ *   a new one. Failure to do so can cause deadlocks under memory pressure.
  *
  *   Note that when running under generic_make_request() (i.e. any block
  *   driver), bios are not submitted until after you return - see the code in
@@ -459,13 +459,13 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
 		 * We solve this, and guarantee forward progress, with a rescuer
 		 * workqueue per bio_set. If we go to allocate and there are
 		 * bios on current->bio_list, we first try the allocation
-		 * without __GFP_WAIT; if that fails, we punt those bios we
-		 * would be blocking to the rescuer workqueue before we retry
-		 * with the original gfp_flags.
+		 * without __GFP_DIRECT_RECLAIM; if that fails, we punt those
+		 * bios we would be blocking to the rescuer workqueue before
+		 * we retry with the original gfp_flags.
 		 */
 
 		if (current->bio_list && !bio_list_empty(current->bio_list))
-			gfp_mask &= ~__GFP_WAIT;
+			gfp_mask &= ~__GFP_DIRECT_RECLAIM;
 
 		p = mempool_alloc(bs->bio_pool, gfp_mask);
 		if (!p && gfp_mask != saved_gfp) {
diff --git a/block/blk-core.c b/block/blk-core.c
index 2eb722d48773..0391206868e9 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1160,8 +1160,8 @@ static struct request *__get_request(struct request_list *rl, int rw_flags,
  * @bio: bio to allocate request for (can be %NULL)
  * @gfp_mask: allocation mask
  *
- * Get a free request from @q.  If %__GFP_WAIT is set in @gfp_mask, this
- * function keeps retrying under memory pressure and fails iff @q is dead.
+ * Get a free request from @q.  If %__GFP_DIRECT_RECLAIM is set in @gfp_mask,
+ * this function keeps retrying under memory pressure and fails iff @q is dead.
  *
  * Must be called with @q->queue_lock held and,
  * Returns ERR_PTR on failure, with @q->queue_lock held.
@@ -1181,7 +1181,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 	if (!IS_ERR(rq))
 		return rq;
 
-	if (!(gfp_mask & __GFP_WAIT) || unlikely(blk_queue_dying(q))) {
+	if (!gfpflags_allow_blocking(gfp_mask) || unlikely(blk_queue_dying(q))) {
 		blk_put_rl(rl);
 		return rq;
 	}
@@ -1259,11 +1259,11 @@ EXPORT_SYMBOL(blk_get_request);
  * BUG.
  *
  * WARNING: When allocating/cloning a bio-chain, careful consideration should be
- * given to how you allocate bios. In particular, you cannot use __GFP_WAIT for
- * anything but the first bio in the chain. Otherwise you risk waiting for IO
- * completion of a bio that hasn't been submitted yet, thus resulting in a
- * deadlock. Alternatively bios should be allocated using bio_kmalloc() instead
- * of bio_alloc(), as that avoids the mempool deadlock.
+ * given to how you allocate bios. In particular, you cannot use
+ * __GFP_DIRECT_RECLAIM for anything but the first bio in the chain. Otherwise
+ * you risk waiting for IO completion of a bio that hasn't been submitted yet,
+ * thus resulting in a deadlock. Alternatively bios should be allocated using
+ * bio_kmalloc() instead of bio_alloc(), as that avoids the mempool deadlock.
  * If possible a big IO should be split into smaller parts when allocation
  * fails. Partial allocation should not be an error, or you risk a live-lock.
  */
diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index 1a27f45ec776..381cb50a673c 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -289,7 +289,7 @@ struct io_context *get_task_io_context(struct task_struct *task,
 {
 	struct io_context *ioc;
 
-	might_sleep_if(gfp_flags & __GFP_WAIT);
+	might_sleep_if(gfpflags_allow_blocking(gfp_flags));
 
 	do {
 		task_lock(task);
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 9115c6d59948..f6020c624967 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -264,7 +264,7 @@ static int bt_get(struct blk_mq_alloc_data *data,
 	if (tag != -1)
 		return tag;
 
-	if (!(data->gfp & __GFP_WAIT))
+	if (!gfpflags_allow_blocking(data->gfp))
 		return -1;
 
 	bs = bt_wait_ptr(bt, hctx);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index f2d67b4047a0..7c322cea838f 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -85,7 +85,7 @@ static int blk_mq_queue_enter(struct request_queue *q, gfp_t gfp)
 		if (percpu_ref_tryget_live(&q->mq_usage_counter))
 			return 0;
 
-		if (!(gfp & __GFP_WAIT))
+		if (!gfpflags_allow_blocking(gfp))
 			return -EBUSY;
 
 		ret = wait_event_interruptible(q->mq_freeze_wq,
@@ -261,11 +261,11 @@ struct request *blk_mq_alloc_request(struct request_queue *q, int rw, gfp_t gfp,
 
 	ctx = blk_mq_get_ctx(q);
 	hctx = q->mq_ops->map_queue(q, ctx->cpu);
-	blk_mq_set_alloc_data(&alloc_data, q, gfp & ~__GFP_WAIT,
+	blk_mq_set_alloc_data(&alloc_data, q, gfp & ~__GFP_DIRECT_RECLAIM,
 			reserved, ctx, hctx);
 
 	rq = __blk_mq_alloc_request(&alloc_data, rw);
-	if (!rq && (gfp & __GFP_WAIT)) {
+	if (!rq && (gfp & __GFP_DIRECT_RECLAIM)) {
 		__blk_mq_run_hw_queue(hctx);
 		blk_mq_put_ctx(ctx);
 
@@ -1207,7 +1207,7 @@ static struct request *blk_mq_map_request(struct request_queue *q,
 		ctx = blk_mq_get_ctx(q);
 		hctx = q->mq_ops->map_queue(q, ctx->cpu);
 		blk_mq_set_alloc_data(&alloc_data, q,
-				__GFP_WAIT|GFP_ATOMIC, false, ctx, hctx);
+				__GFP_WAIT|__GFP_HIGH, false, ctx, hctx);
 		rq = __blk_mq_alloc_request(&alloc_data, rw);
 		ctx = alloc_data.ctx;
 		hctx = alloc_data.hctx;
diff --git a/drivers/block/drbd/drbd_receiver.c b/drivers/block/drbd/drbd_receiver.c
index c097909c589c..b4b5680ac6ad 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -357,7 +357,8 @@ drbd_alloc_peer_req(struct drbd_peer_device *peer_device, u64 id, sector_t secto
 	}
 
 	if (has_payload && data_size) {
-		page = drbd_alloc_pages(peer_device, nr_pages, (gfp_mask & __GFP_WAIT));
+		page = drbd_alloc_pages(peer_device, nr_pages,
+					gfpflags_allow_blocking(gfp_mask));
 		if (!page)
 			goto fail;
 	}
diff --git a/drivers/block/osdblk.c b/drivers/block/osdblk.c
index e22942596207..1b709a4e3b5e 100644
--- a/drivers/block/osdblk.c
+++ b/drivers/block/osdblk.c
@@ -271,7 +271,7 @@ static struct bio *bio_chain_clone(struct bio *old_chain, gfp_t gfpmask)
 			goto err_out;
 
 		tmp->bi_bdev = NULL;
-		gfpmask &= ~__GFP_WAIT;
+		gfpmask &= ~__GFP_DIRECT_RECLAIM;
 		tmp->bi_next = NULL;
 
 		if (!new_chain)
diff --git a/drivers/connector/connector.c b/drivers/connector/connector.c
index 30f522848c73..d7373ca69c99 100644
--- a/drivers/connector/connector.c
+++ b/drivers/connector/connector.c
@@ -124,7 +124,8 @@ int cn_netlink_send_mult(struct cn_msg *msg, u16 len, u32 portid, u32 __group,
 	if (group)
 		return netlink_broadcast(dev->nls, skb, portid, group,
 					 gfp_mask);
-	return netlink_unicast(dev->nls, skb, portid, !(gfp_mask&__GFP_WAIT));
+	return netlink_unicast(dev->nls, skb, portid,
+			!gfpflags_allow_blocking(gfp_mask));
 }
 EXPORT_SYMBOL_GPL(cn_netlink_send_mult);
 
diff --git a/drivers/firewire/core-cdev.c b/drivers/firewire/core-cdev.c
index 2a3973a7c441..36a7c2d89a01 100644
--- a/drivers/firewire/core-cdev.c
+++ b/drivers/firewire/core-cdev.c
@@ -486,7 +486,7 @@ static int ioctl_get_info(struct client *client, union ioctl_arg *arg)
 static int add_client_resource(struct client *client,
 			       struct client_resource *resource, gfp_t gfp_mask)
 {
-	bool preload = !!(gfp_mask & __GFP_WAIT);
+	bool preload = gfpflags_allow_blocking(gfp_mask);
 	unsigned long flags;
 	int ret;
 
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 4d631a946481..d58cb9e034fe 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2215,7 +2215,7 @@ i915_gem_object_get_pages_gtt(struct drm_i915_gem_object *obj)
 	 */
 	mapping = file_inode(obj->base.filp)->i_mapping;
 	gfp = mapping_gfp_mask(mapping);
-	gfp |= __GFP_NORETRY | __GFP_NOWARN | __GFP_NO_KSWAPD;
+	gfp |= __GFP_NORETRY | __GFP_NOWARN;
 	gfp &= ~(__GFP_IO | __GFP_WAIT);
 	sg = st->sgl;
 	st->nents = 0;
diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c
index 8c014b33d8e0..59ab264c99c4 100644
--- a/drivers/infiniband/core/sa_query.c
+++ b/drivers/infiniband/core/sa_query.c
@@ -1083,7 +1083,7 @@ static void init_mad(struct ib_sa_mad *mad, struct ib_mad_agent *agent)
 
 static int send_mad(struct ib_sa_query *query, int timeout_ms, gfp_t gfp_mask)
 {
-	bool preload = !!(gfp_mask & __GFP_WAIT);
+	bool preload = gfpflags_allow_blocking(gfp_mask);
 	unsigned long flags;
 	int ret, id;
 
diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index f82060e778a2..1c0006e1ba4a 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -2755,7 +2755,7 @@ static void *alloc_coherent(struct device *dev, size_t size,
 
 	page = alloc_pages(flag | __GFP_NOWARN,  get_order(size));
 	if (!page) {
-		if (!(flag & __GFP_WAIT))
+		if (!gfpflags_allow_blocking(flag))
 			return NULL;
 
 		page = dma_alloc_from_contiguous(dev, size >> PAGE_SHIFT,
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 2d7349a3ee14..ecdafbe81a5e 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -3533,7 +3533,7 @@ static void *intel_alloc_coherent(struct device *dev, size_t size,
 			flags |= GFP_DMA32;
 	}
 
-	if (flags & __GFP_WAIT) {
+	if (gfpflags_allow_blocking(flags)) {
 		unsigned int count = size >> PAGE_SHIFT;
 
 		page = dma_alloc_from_contiguous(dev, count, order);
diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index d60c88df5234..55ec935de2b4 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -993,7 +993,7 @@ static struct bio *crypt_alloc_buffer(struct dm_crypt_io *io, unsigned size)
 	struct bio_vec *bvec;
 
 retry:
-	if (unlikely(gfp_mask & __GFP_WAIT))
+	if (unlikely(gfp_mask & __GFP_DIRECT_RECLAIM))
 		mutex_lock(&cc->bio_alloc_lock);
 
 	clone = bio_alloc_bioset(GFP_NOIO, nr_iovecs, cc->bs);
@@ -1009,7 +1009,7 @@ static struct bio *crypt_alloc_buffer(struct dm_crypt_io *io, unsigned size)
 		if (!page) {
 			crypt_free_buffer_pages(cc, clone);
 			bio_put(clone);
-			gfp_mask |= __GFP_WAIT;
+			gfp_mask |= __GFP_DIRECT_RECLAIM;
 			goto retry;
 		}
 
@@ -1026,7 +1026,7 @@ static struct bio *crypt_alloc_buffer(struct dm_crypt_io *io, unsigned size)
 	}
 
 return_clone:
-	if (unlikely(gfp_mask & __GFP_WAIT))
+	if (unlikely(gfp_mask & __GFP_DIRECT_RECLAIM))
 		mutex_unlock(&cc->bio_alloc_lock);
 
 	return clone;
diff --git a/drivers/md/dm-kcopyd.c b/drivers/md/dm-kcopyd.c
index 3a7cade5e27d..1452ed9aacb4 100644
--- a/drivers/md/dm-kcopyd.c
+++ b/drivers/md/dm-kcopyd.c
@@ -244,7 +244,7 @@ static int kcopyd_get_pages(struct dm_kcopyd_client *kc,
 	*pages = NULL;
 
 	do {
-		pl = alloc_pl(__GFP_NOWARN | __GFP_NORETRY);
+		pl = alloc_pl(__GFP_NOWARN | __GFP_NORETRY | __GFP_KSWAPD_RECLAIM);
 		if (unlikely(!pl)) {
 			/* Use reserved pages */
 			pl = kc->pages;
diff --git a/drivers/media/pci/solo6x10/solo6x10-v4l2-enc.c b/drivers/media/pci/solo6x10/solo6x10-v4l2-enc.c
index 53fff5425c13..fb2cb4bdc0c1 100644
--- a/drivers/media/pci/solo6x10/solo6x10-v4l2-enc.c
+++ b/drivers/media/pci/solo6x10/solo6x10-v4l2-enc.c
@@ -1291,7 +1291,7 @@ static struct solo_enc_dev *solo_enc_alloc(struct solo_dev *solo_dev,
 	solo_enc->vidq.ops = &solo_enc_video_qops;
 	solo_enc->vidq.mem_ops = &vb2_dma_sg_memops;
 	solo_enc->vidq.drv_priv = solo_enc;
-	solo_enc->vidq.gfp_flags = __GFP_DMA32;
+	solo_enc->vidq.gfp_flags = __GFP_DMA32 | __GFP_KSWAPD_RECLAIM;
 	solo_enc->vidq.timestamp_flags = V4L2_BUF_FLAG_TIMESTAMP_MONOTONIC;
 	solo_enc->vidq.buf_struct_size = sizeof(struct solo_vb2_buf);
 	solo_enc->vidq.lock = &solo_enc->lock;
diff --git a/drivers/media/pci/solo6x10/solo6x10-v4l2.c b/drivers/media/pci/solo6x10/solo6x10-v4l2.c
index 63ae8a61f603..bde77b22340c 100644
--- a/drivers/media/pci/solo6x10/solo6x10-v4l2.c
+++ b/drivers/media/pci/solo6x10/solo6x10-v4l2.c
@@ -675,7 +675,7 @@ int solo_v4l2_init(struct solo_dev *solo_dev, unsigned nr)
 	solo_dev->vidq.mem_ops = &vb2_dma_contig_memops;
 	solo_dev->vidq.drv_priv = solo_dev;
 	solo_dev->vidq.timestamp_flags = V4L2_BUF_FLAG_TIMESTAMP_MONOTONIC;
-	solo_dev->vidq.gfp_flags = __GFP_DMA32;
+	solo_dev->vidq.gfp_flags = __GFP_DMA32 | __GFP_KSWAPD_RECLAIM;
 	solo_dev->vidq.buf_struct_size = sizeof(struct solo_vb2_buf);
 	solo_dev->vidq.lock = &solo_dev->lock;
 	ret = vb2_queue_init(&solo_dev->vidq);
diff --git a/drivers/media/pci/tw68/tw68-video.c b/drivers/media/pci/tw68/tw68-video.c
index 8355e55b4e8e..e556f989aaab 100644
--- a/drivers/media/pci/tw68/tw68-video.c
+++ b/drivers/media/pci/tw68/tw68-video.c
@@ -975,7 +975,7 @@ int tw68_video_init2(struct tw68_dev *dev, int video_nr)
 	dev->vidq.ops = &tw68_video_qops;
 	dev->vidq.mem_ops = &vb2_dma_sg_memops;
 	dev->vidq.drv_priv = dev;
-	dev->vidq.gfp_flags = __GFP_DMA32;
+	dev->vidq.gfp_flags = __GFP_DMA32 | __GFP_KSWAPD_RECLAIM;
 	dev->vidq.buf_struct_size = sizeof(struct tw68_buf);
 	dev->vidq.lock = &dev->lock;
 	dev->vidq.min_buffers_needed = 2;
diff --git a/drivers/mtd/mtdcore.c b/drivers/mtd/mtdcore.c
index 8bbbb751bf45..2dfb291a47c6 100644
--- a/drivers/mtd/mtdcore.c
+++ b/drivers/mtd/mtdcore.c
@@ -1188,8 +1188,7 @@ EXPORT_SYMBOL_GPL(mtd_writev);
  */
 void *mtd_kmalloc_up_to(const struct mtd_info *mtd, size_t *size)
 {
-	gfp_t flags = __GFP_NOWARN | __GFP_WAIT |
-		       __GFP_NORETRY | __GFP_NO_KSWAPD;
+	gfp_t flags = __GFP_NOWARN | __GFP_DIRECT_RECLAIM | __GFP_NORETRY;
 	size_t min_alloc = max_t(size_t, mtd->writesize, PAGE_SIZE);
 	void *kbuf;
 
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
index 44173be5cbf0..f8d7a2f06950 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
@@ -691,7 +691,7 @@ static void *bnx2x_frag_alloc(const struct bnx2x_fastpath *fp, gfp_t gfp_mask)
 {
 	if (fp->rx_frag_size) {
 		/* GFP_KERNEL allocations are used only during initialization */
-		if (unlikely(gfp_mask & __GFP_WAIT))
+		if (unlikely(gfpflags_allow_blocking(gfp_mask)))
 			return (void *)__get_free_page(gfp_mask);
 
 		return netdev_alloc_frag(fp->rx_frag_size);
diff --git a/drivers/staging/android/ion/ion_system_heap.c b/drivers/staging/android/ion/ion_system_heap.c
index 7a7a9a047230..d4cdbf28dbb6 100644
--- a/drivers/staging/android/ion/ion_system_heap.c
+++ b/drivers/staging/android/ion/ion_system_heap.c
@@ -27,7 +27,7 @@
 #include "ion_priv.h"
 
 static gfp_t high_order_gfp_flags = (GFP_HIGHUSER | __GFP_ZERO | __GFP_NOWARN |
-				     __GFP_NORETRY) & ~__GFP_WAIT;
+				     __GFP_NORETRY) & ~__GFP_DIRECT_RECLAIM;
 static gfp_t low_order_gfp_flags  = (GFP_HIGHUSER | __GFP_ZERO | __GFP_NOWARN);
 static const unsigned int orders[] = {8, 4, 0};
 static const int num_orders = ARRAY_SIZE(orders);
diff --git a/drivers/staging/lustre/include/linux/libcfs/libcfs_private.h b/drivers/staging/lustre/include/linux/libcfs/libcfs_private.h
index 9544860e3292..78bde2c11b50 100644
--- a/drivers/staging/lustre/include/linux/libcfs/libcfs_private.h
+++ b/drivers/staging/lustre/include/linux/libcfs/libcfs_private.h
@@ -95,7 +95,7 @@ do {								    \
 do {									    \
 	LASSERT(!in_interrupt() ||					    \
 		((size) <= LIBCFS_VMALLOC_SIZE &&			    \
-		 ((mask) & __GFP_WAIT) == 0));				    \
+		 !gfpflags_allow_blocking(mask)));			    \
 } while (0)
 
 #define LIBCFS_ALLOC_POST(ptr, size)					    \
diff --git a/drivers/usb/host/u132-hcd.c b/drivers/usb/host/u132-hcd.c
index a67bd5090330..67b3b9d9dfd1 100644
--- a/drivers/usb/host/u132-hcd.c
+++ b/drivers/usb/host/u132-hcd.c
@@ -2244,7 +2244,7 @@ static int u132_urb_enqueue(struct usb_hcd *hcd, struct urb *urb,
 {
 	struct u132 *u132 = hcd_to_u132(hcd);
 	if (irqs_disabled()) {
-		if (__GFP_WAIT & mem_flags) {
+		if (gfpflags_allow_blocking(mem_flags)) {
 			printk(KERN_ERR "invalid context for function that migh"
 				"t sleep\n");
 			return -EINVAL;
diff --git a/drivers/video/fbdev/vermilion/vermilion.c b/drivers/video/fbdev/vermilion/vermilion.c
index 6b70d7f62b2f..1c1e95a0b8fa 100644
--- a/drivers/video/fbdev/vermilion/vermilion.c
+++ b/drivers/video/fbdev/vermilion/vermilion.c
@@ -99,7 +99,7 @@ static int vmlfb_alloc_vram_area(struct vram_area *va, unsigned max_order,
 		 * below the first 16MB.
 		 */
 
-		flags = __GFP_DMA | __GFP_HIGH;
+		flags = __GFP_DMA | __GFP_HIGH | __GFP_KSWAPD_RECLAIM;
 		va->logical =
 			 __get_free_pages(flags, --max_order);
 	} while (va->logical == 0 && max_order > min_order);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 0d98aee34fee..5632ba60c8f5 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2572,7 +2572,7 @@ int open_ctree(struct super_block *sb,
 	fs_info->commit_interval = BTRFS_DEFAULT_COMMIT_INTERVAL;
 	fs_info->avg_delayed_ref_runtime = NSEC_PER_SEC >> 6; /* div by 64 */
 	/* readahead state */
-	INIT_RADIX_TREE(&fs_info->reada_tree, GFP_NOFS & ~__GFP_WAIT);
+	INIT_RADIX_TREE(&fs_info->reada_tree, GFP_NOFS & ~__GFP_DIRECT_RECLAIM);
 	spin_lock_init(&fs_info->reada_lock);
 
 	fs_info->thread_pool_size = min_t(unsigned long,
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index f1018cfbfefa..7956b310c194 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -594,7 +594,7 @@ int clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
 	if (bits & (EXTENT_IOBITS | EXTENT_BOUNDARY))
 		clear = 1;
 again:
-	if (!prealloc && (mask & __GFP_WAIT)) {
+	if (!prealloc && gfpflags_allow_blocking(mask)) {
 		/*
 		 * Don't care for allocation failure here because we might end
 		 * up not needing the pre-allocated extent state at all, which
@@ -718,7 +718,7 @@ int clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
 	if (start > end)
 		goto out;
 	spin_unlock(&tree->lock);
-	if (mask & __GFP_WAIT)
+	if (gfpflags_allow_blocking(mask))
 		cond_resched();
 	goto again;
 }
@@ -850,7 +850,7 @@ __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
 
 	bits |= EXTENT_FIRST_DELALLOC;
 again:
-	if (!prealloc && (mask & __GFP_WAIT)) {
+	if (!prealloc && gfpflags_allow_blocking(mask)) {
 		prealloc = alloc_extent_state(mask);
 		BUG_ON(!prealloc);
 	}
@@ -1028,7 +1028,7 @@ __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
 	if (start > end)
 		goto out;
 	spin_unlock(&tree->lock);
-	if (mask & __GFP_WAIT)
+	if (gfpflags_allow_blocking(mask))
 		cond_resched();
 	goto again;
 }
@@ -1076,7 +1076,7 @@ int convert_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
 	btrfs_debug_check_extent_io_range(tree, start, end);
 
 again:
-	if (!prealloc && (mask & __GFP_WAIT)) {
+	if (!prealloc && gfpflags_allow_blocking(mask)) {
 		/*
 		 * Best effort, don't worry if extent state allocation fails
 		 * here for the first iteration. We might have a cached state
@@ -1253,7 +1253,7 @@ int convert_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
 	if (start > end)
 		goto out;
 	spin_unlock(&tree->lock);
-	if (mask & __GFP_WAIT)
+	if (gfpflags_allow_blocking(mask))
 		cond_resched();
 	first_iteration = false;
 	goto again;
@@ -4267,7 +4267,7 @@ int try_release_extent_mapping(struct extent_map_tree *map,
 	u64 start = page_offset(page);
 	u64 end = start + PAGE_CACHE_SIZE - 1;
 
-	if ((mask & __GFP_WAIT) &&
+	if (gfpflags_allow_blocking(mask) &&
 	    page->mapping->host->i_size > 16 * 1024 * 1024) {
 		u64 len;
 		while (start <= end) {
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 6fc735869c18..e023919b4470 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -156,8 +156,8 @@ static struct btrfs_device *__alloc_device(void)
 	spin_lock_init(&dev->reada_lock);
 	atomic_set(&dev->reada_in_flight, 0);
 	atomic_set(&dev->dev_stats_ccnt, 0);
-	INIT_RADIX_TREE(&dev->reada_zones, GFP_NOFS & ~__GFP_WAIT);
-	INIT_RADIX_TREE(&dev->reada_extents, GFP_NOFS & ~__GFP_WAIT);
+	INIT_RADIX_TREE(&dev->reada_zones, GFP_NOFS & ~__GFP_DIRECT_RECLAIM);
+	INIT_RADIX_TREE(&dev->reada_extents, GFP_NOFS & ~__GFP_DIRECT_RECLAIM);
 
 	return dev;
 }
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index a63c7b0a10cf..49f6c78ee3af 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1058,7 +1058,7 @@ static int bdev_try_to_free_page(struct super_block *sb, struct page *page,
 		return 0;
 	if (journal)
 		return jbd2_journal_try_to_free_buffers(journal, page,
-							wait & ~__GFP_WAIT);
+						wait & ~__GFP_DIRECT_RECLAIM);
 	return try_to_free_buffers(page);
 }
 
diff --git a/fs/fscache/cookie.c b/fs/fscache/cookie.c
index d403c69bee08..4304072161aa 100644
--- a/fs/fscache/cookie.c
+++ b/fs/fscache/cookie.c
@@ -111,7 +111,7 @@ struct fscache_cookie *__fscache_acquire_cookie(
 
 	/* radix tree insertion won't use the preallocation pool unless it's
 	 * told it may not wait */
-	INIT_RADIX_TREE(&cookie->stores, GFP_NOFS & ~__GFP_WAIT);
+	INIT_RADIX_TREE(&cookie->stores, GFP_NOFS & ~__GFP_DIRECT_RECLAIM);
 
 	switch (cookie->def->type) {
 	case FSCACHE_COOKIE_TYPE_INDEX:
diff --git a/fs/fscache/page.c b/fs/fscache/page.c
index 483bbc613bf0..79483b3d8c6f 100644
--- a/fs/fscache/page.c
+++ b/fs/fscache/page.c
@@ -58,7 +58,7 @@ bool release_page_wait_timeout(struct fscache_cookie *cookie, struct page *page)
 
 /*
  * decide whether a page can be released, possibly by cancelling a store to it
- * - we're allowed to sleep if __GFP_WAIT is flagged
+ * - we're allowed to sleep if __GFP_DIRECT_RECLAIM is flagged
  */
 bool __fscache_maybe_release_page(struct fscache_cookie *cookie,
 				  struct page *page,
@@ -122,7 +122,7 @@ bool __fscache_maybe_release_page(struct fscache_cookie *cookie,
 	 * allocator as the work threads writing to the cache may all end up
 	 * sleeping on memory allocation, so we may need to impose a timeout
 	 * too. */
-	if (!(gfp & __GFP_WAIT) || !(gfp & __GFP_FS)) {
+	if (!(gfp & __GFP_DIRECT_RECLAIM) || !(gfp & __GFP_FS)) {
 		fscache_stat(&fscache_n_store_vmscan_busy);
 		return false;
 	}
@@ -132,7 +132,7 @@ bool __fscache_maybe_release_page(struct fscache_cookie *cookie,
 		_debug("fscache writeout timeout page: %p{%lx}",
 			page, page->index);
 
-	gfp &= ~__GFP_WAIT;
+	gfp &= ~__GFP_DIRECT_RECLAIM;
 	goto try_again;
 }
 EXPORT_SYMBOL(__fscache_maybe_release_page);
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 6b8338ec2464..89463eee6791 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -1937,8 +1937,8 @@ __journal_try_to_free_buffer(journal_t *journal, struct buffer_head *bh)
  * @journal: journal for operation
  * @page: to try and free
  * @gfp_mask: we use the mask to detect how hard should we try to release
- * buffers. If __GFP_WAIT and __GFP_FS is set, we wait for commit code to
- * release the buffers.
+ * buffers. If __GFP_DIRECT_RECLAIM and __GFP_FS is set, we wait for commit
+ * code to release the buffers.
  *
  *
  * For all the buffers on this page,
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index c0f9b1ed12b9..17d3417c8a74 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -473,8 +473,8 @@ static int nfs_release_page(struct page *page, gfp_t gfp)
 	dfprintk(PAGECACHE, "NFS: release_page(%p)\n", page);
 
 	/* Always try to initiate a 'commit' if relevant, but only
-	 * wait for it if __GFP_WAIT is set.  Even then, only wait 1
-	 * second and only if the 'bdi' is not congested.
+	 * wait for it if the caller allows blocking.  Even then,
+	 * only wait 1 second and only if the 'bdi' is not congested.
 	 * Waiting indefinitely can cause deadlocks when the NFS
 	 * server is on this machine, when a new TCP connection is
 	 * needed and in other rare cases.  There is no particular
@@ -484,7 +484,7 @@ static int nfs_release_page(struct page *page, gfp_t gfp)
 	if (mapping) {
 		struct nfs_server *nfss = NFS_SERVER(mapping->host);
 		nfs_commit_inode(mapping->host, 0);
-		if ((gfp & __GFP_WAIT) &&
+		if (gfpflags_allow_blocking(gfp) &&
 		    !bdi_write_congested(&nfss->backing_dev_info)) {
 			wait_on_page_bit_killable_timeout(page, PG_private,
 							  HZ);
diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index eac9549efd52..587174fd4f2c 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -525,7 +525,7 @@ xfs_qm_shrink_scan(
 	unsigned long		freed;
 	int			error;
 
-	if ((sc->gfp_mask & (__GFP_FS|__GFP_WAIT)) != (__GFP_FS|__GFP_WAIT))
+	if ((sc->gfp_mask & (__GFP_FS|__GFP_DIRECT_RECLAIM)) != (__GFP_FS|__GFP_DIRECT_RECLAIM))
 		return 0;
 
 	INIT_LIST_HEAD(&isol.buffers);
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 440fca3e7e5d..b56e811b6f7c 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -29,12 +29,13 @@ struct vm_area_struct;
 #define ___GFP_NOMEMALLOC	0x10000u
 #define ___GFP_HARDWALL		0x20000u
 #define ___GFP_THISNODE		0x40000u
-#define ___GFP_WAIT		0x80000u
+#define ___GFP_ATOMIC		0x80000u
 #define ___GFP_NOACCOUNT	0x100000u
 #define ___GFP_NOTRACK		0x200000u
-#define ___GFP_NO_KSWAPD	0x400000u
+#define ___GFP_DIRECT_RECLAIM	0x400000u
 #define ___GFP_OTHER_NODE	0x800000u
 #define ___GFP_WRITE		0x1000000u
+#define ___GFP_KSWAPD_RECLAIM	0x2000000u
 /* If the above are modified, __GFP_BITS_SHIFT may need updating */
 
 /*
@@ -71,7 +72,7 @@ struct vm_area_struct;
  * __GFP_MOVABLE: Flag that this page will be movable by the page migration
  * mechanism or reclaimed
  */
-#define __GFP_WAIT	((__force gfp_t)___GFP_WAIT)	/* Can wait and reschedule? */
+#define __GFP_ATOMIC	((__force gfp_t)___GFP_ATOMIC)  /* Caller cannot wait or reschedule */
 #define __GFP_HIGH	((__force gfp_t)___GFP_HIGH)	/* Should access emergency pools? */
 #define __GFP_IO	((__force gfp_t)___GFP_IO)	/* Can start physical IO? */
 #define __GFP_FS	((__force gfp_t)___GFP_FS)	/* Can call down to low-level FS? */
@@ -94,23 +95,37 @@ struct vm_area_struct;
 #define __GFP_NOACCOUNT	((__force gfp_t)___GFP_NOACCOUNT) /* Don't account to kmemcg */
 #define __GFP_NOTRACK	((__force gfp_t)___GFP_NOTRACK)  /* Don't track with kmemcheck */
 
-#define __GFP_NO_KSWAPD	((__force gfp_t)___GFP_NO_KSWAPD)
 #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
 #define __GFP_WRITE	((__force gfp_t)___GFP_WRITE)	/* Allocator intends to dirty page */
 
 /*
+ * A caller that is willing to wait may enter direct reclaim and will
+ * wake kswapd to reclaim pages in the background until the high
+ * watermark is met. A caller may wish to clear __GFP_DIRECT_RECLAIM to
+ * avoid unnecessary delays when a fallback option is available but
+ * still allow kswapd to reclaim in the background. The kswapd flag
+ * can be cleared when the reclaiming of pages would cause unnecessary
+ * disruption.
+ */
+#define __GFP_WAIT ((__force gfp_t)(___GFP_DIRECT_RECLAIM|___GFP_KSWAPD_RECLAIM))
+#define __GFP_DIRECT_RECLAIM	((__force gfp_t)___GFP_DIRECT_RECLAIM) /* Caller can reclaim */
+#define __GFP_KSWAPD_RECLAIM	((__force gfp_t)___GFP_KSWAPD_RECLAIM) /* kswapd can wake */
+
+/*
  * This may seem redundant, but it's a way of annotating false positives vs.
  * allocations that simply cannot be supported (e.g. page tables).
  */
 #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
 
-#define __GFP_BITS_SHIFT 25	/* Room for N __GFP_FOO bits */
+#define __GFP_BITS_SHIFT 26	/* Room for N __GFP_FOO bits */
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
 
-/* This equals 0, but use constants in case they ever change */
-#define GFP_NOWAIT	(GFP_ATOMIC & ~__GFP_HIGH)
-/* GFP_ATOMIC means both !wait (__GFP_WAIT not set) and use emergency pool */
-#define GFP_ATOMIC	(__GFP_HIGH)
+/*
+ * GFP_ATOMIC callers can not sleep, need the allocation to succeed.
+ * A lower watermark is applied to allow access to "atomic reserves"
+ */
+#define GFP_ATOMIC	(__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM)
+#define GFP_NOWAIT	(__GFP_KSWAPD_RECLAIM)
 #define GFP_NOIO	(__GFP_WAIT)
 #define GFP_NOFS	(__GFP_WAIT | __GFP_IO)
 #define GFP_KERNEL	(__GFP_WAIT | __GFP_IO | __GFP_FS)
@@ -119,10 +134,10 @@ struct vm_area_struct;
 #define GFP_USER	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL)
 #define GFP_HIGHUSER	(GFP_USER | __GFP_HIGHMEM)
 #define GFP_HIGHUSER_MOVABLE	(GFP_HIGHUSER | __GFP_MOVABLE)
-#define GFP_IOFS	(__GFP_IO | __GFP_FS)
-#define GFP_TRANSHUGE	(GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
-			 __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | \
-			 __GFP_NO_KSWAPD)
+#define GFP_IOFS	(__GFP_IO | __GFP_FS | __GFP_KSWAPD_RECLAIM)
+#define GFP_TRANSHUGE	((GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
+			 __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN) & \
+			 ~__GFP_KSWAPD_RECLAIM)
 
 /* This mask makes up all the page movable related flags */
 #define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE)
@@ -164,6 +179,11 @@ static inline int gfpflags_to_migratetype(const gfp_t gfp_flags)
 	return (gfp_flags & GFP_MOVABLE_MASK) >> GFP_MOVABLE_SHIFT;
 }
 
+static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
+{
+	return gfp_flags & __GFP_DIRECT_RECLAIM;
+}
+
 #ifdef CONFIG_HIGHMEM
 #define OPT_ZONE_HIGHMEM ZONE_HIGHMEM
 #else
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 2738d355cdf9..6f1f5a813554 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1215,7 +1215,7 @@ static inline int skb_cloned(const struct sk_buff *skb)
 
 static inline int skb_unclone(struct sk_buff *skb, gfp_t pri)
 {
-	might_sleep_if(pri & __GFP_WAIT);
+	might_sleep_if(gfpflags_allow_blocking(pri));
 
 	if (skb_cloned(skb))
 		return pskb_expand_head(skb, 0, 0, pri);
@@ -1299,7 +1299,7 @@ static inline int skb_shared(const struct sk_buff *skb)
  */
 static inline struct sk_buff *skb_share_check(struct sk_buff *skb, gfp_t pri)
 {
-	might_sleep_if(pri & __GFP_WAIT);
+	might_sleep_if(gfpflags_allow_blocking(pri));
 	if (skb_shared(skb)) {
 		struct sk_buff *nskb = skb_clone(skb, pri);
 
@@ -1335,7 +1335,7 @@ static inline struct sk_buff *skb_share_check(struct sk_buff *skb, gfp_t pri)
 static inline struct sk_buff *skb_unshare(struct sk_buff *skb,
 					  gfp_t pri)
 {
-	might_sleep_if(pri & __GFP_WAIT);
+	might_sleep_if(gfpflags_allow_blocking(pri));
 	if (skb_cloned(skb)) {
 		struct sk_buff *nskb = skb_copy(skb, pri);
 
diff --git a/include/net/sock.h b/include/net/sock.h
index 7aa78440559a..e822cdf8b855 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2020,7 +2020,7 @@ struct sk_buff *sk_stream_alloc_skb(struct sock *sk, int size, gfp_t gfp,
  */
 static inline struct page_frag *sk_page_frag(struct sock *sk)
 {
-	if (sk->sk_allocation & __GFP_WAIT)
+	if (gfpflags_allow_blocking(sk->sk_allocation))
 		return &current->task_frag;
 
 	return &sk->sk_frag;
diff --git a/include/trace/events/gfpflags.h b/include/trace/events/gfpflags.h
index d6fd8e5b14b7..dde6bf092c8a 100644
--- a/include/trace/events/gfpflags.h
+++ b/include/trace/events/gfpflags.h
@@ -20,7 +20,7 @@
 	{(unsigned long)GFP_ATOMIC,		"GFP_ATOMIC"},		\
 	{(unsigned long)GFP_NOIO,		"GFP_NOIO"},		\
 	{(unsigned long)__GFP_HIGH,		"GFP_HIGH"},		\
-	{(unsigned long)__GFP_WAIT,		"GFP_WAIT"},		\
+	{(unsigned long)__GFP_ATOMIC,		"GFP_ATOMIC"},		\
 	{(unsigned long)__GFP_IO,		"GFP_IO"},		\
 	{(unsigned long)__GFP_COLD,		"GFP_COLD"},		\
 	{(unsigned long)__GFP_NOWARN,		"GFP_NOWARN"},		\
@@ -36,7 +36,8 @@
 	{(unsigned long)__GFP_RECLAIMABLE,	"GFP_RECLAIMABLE"},	\
 	{(unsigned long)__GFP_MOVABLE,		"GFP_MOVABLE"},		\
 	{(unsigned long)__GFP_NOTRACK,		"GFP_NOTRACK"},		\
-	{(unsigned long)__GFP_NO_KSWAPD,	"GFP_NO_KSWAPD"},	\
+	{(unsigned long)__GFP_DIRECT_RECLAIM,	"GFP_DIRECT_RECLAIM"},	\
+	{(unsigned long)__GFP_KSWAPD_RECLAIM,	"GFP_KSWAPD_RECLAIM"},	\
 	{(unsigned long)__GFP_OTHER_NODE,	"GFP_OTHER_NODE"}	\
 	) : "GFP_NOWAIT"
 
diff --git a/kernel/audit.c b/kernel/audit.c
index 662c007635fb..6ae6e2b62e3e 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -1357,16 +1357,16 @@ struct audit_buffer *audit_log_start(struct audit_context *ctx, gfp_t gfp_mask,
 	if (unlikely(audit_filter_type(type)))
 		return NULL;
 
-	if (gfp_mask & __GFP_WAIT) {
+	if (gfp_mask & __GFP_DIRECT_RECLAIM) {
 		if (audit_pid && audit_pid == current->pid)
-			gfp_mask &= ~__GFP_WAIT;
+			gfp_mask &= ~__GFP_DIRECT_RECLAIM;
 		else
 			reserve = 0;
 	}
 
 	while (audit_backlog_limit
 	       && skb_queue_len(&audit_skb_queue) > audit_backlog_limit + reserve) {
-		if (gfp_mask & __GFP_WAIT && audit_backlog_wait_time) {
+		if (gfp_mask & __GFP_DIRECT_RECLAIM && audit_backlog_wait_time) {
 			long sleep_time;
 
 			sleep_time = timeout_start + audit_backlog_wait_time - jiffies;
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 2cf0f79f1fc9..e843dffa7b87 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -211,7 +211,7 @@ static int cgroup_idr_alloc(struct idr *idr, void *ptr, int start, int end,
 
 	idr_preload(gfp_mask);
 	spin_lock_bh(&cgroup_idr_lock);
-	ret = idr_alloc(idr, ptr, start, end, gfp_mask & ~__GFP_WAIT);
+	ret = idr_alloc(idr, ptr, start, end, gfp_mask & ~__GFP_DIRECT_RECLAIM);
 	spin_unlock_bh(&cgroup_idr_lock);
 	idr_preload_end();
 	return ret;
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 8acfbf773e06..9aa39f20f593 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -2738,7 +2738,7 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags)
 		return;
 
 	/* no reclaim without waiting on it */
-	if (!(gfp_mask & __GFP_WAIT))
+	if (!(gfp_mask & __GFP_DIRECT_RECLAIM))
 		return;
 
 	/* this guy won't enter reclaim */
diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index 5235dd4e1e2f..3a970604308f 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -1779,7 +1779,7 @@ alloc_highmem_pages(struct memory_bitmap *bm, unsigned int nr_highmem)
 	while (to_alloc-- > 0) {
 		struct page *page;
 
-		page = alloc_image_page(__GFP_HIGHMEM);
+		page = alloc_image_page(__GFP_HIGHMEM|__GFP_KSWAPD_RECLAIM);
 		memory_bm_set_bit(bm, page_to_pfn(page));
 	}
 	return nr_highmem;
diff --git a/kernel/smp.c b/kernel/smp.c
index 07854477c164..d903c02223af 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -669,7 +669,7 @@ void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
 	cpumask_var_t cpus;
 	int cpu, ret;
 
-	might_sleep_if(gfp_flags & __GFP_WAIT);
+	might_sleep_if(gfpflags_allow_blocking(gfp_flags));
 
 	if (likely(zalloc_cpumask_var(&cpus, (gfp_flags|__GFP_NOWARN)))) {
 		preempt_disable();
diff --git a/lib/idr.c b/lib/idr.c
index 5335c43adf46..6098336df267 100644
--- a/lib/idr.c
+++ b/lib/idr.c
@@ -399,7 +399,7 @@ void idr_preload(gfp_t gfp_mask)
 	 * allocation guarantee.  Disallow usage from those contexts.
 	 */
 	WARN_ON_ONCE(in_interrupt());
-	might_sleep_if(gfp_mask & __GFP_WAIT);
+	might_sleep_if(gfpflags_allow_blocking(gfp_mask));
 
 	preempt_disable();
 
@@ -453,7 +453,7 @@ int idr_alloc(struct idr *idr, void *ptr, int start, int end, gfp_t gfp_mask)
 	struct idr_layer *pa[MAX_IDR_LEVEL + 1];
 	int id;
 
-	might_sleep_if(gfp_mask & __GFP_WAIT);
+	might_sleep_if(gfpflags_allow_blocking(gfp_mask));
 
 	/* sanity checks */
 	if (WARN_ON_ONCE(start < 0))
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index f9ebe1c82060..fcf5d98574ce 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -188,7 +188,7 @@ radix_tree_node_alloc(struct radix_tree_root *root)
 	 * preloading in the interrupt anyway as all the allocations have to
 	 * be atomic. So just do normal allocation when in interrupt.
 	 */
-	if (!(gfp_mask & __GFP_WAIT) && !in_interrupt()) {
+	if (!gfpflags_allow_blocking(gfp_mask) && !in_interrupt()) {
 		struct radix_tree_preload *rtp;
 
 		/*
@@ -249,7 +249,7 @@ radix_tree_node_free(struct radix_tree_node *node)
  * with preemption not disabled.
  *
  * To make use of this facility, the radix tree must be initialised without
- * __GFP_WAIT being passed to INIT_RADIX_TREE().
+ * __GFP_DIRECT_RECLAIM being passed to INIT_RADIX_TREE().
  */
 static int __radix_tree_preload(gfp_t gfp_mask)
 {
@@ -286,12 +286,12 @@ static int __radix_tree_preload(gfp_t gfp_mask)
  * with preemption not disabled.
  *
  * To make use of this facility, the radix tree must be initialised without
- * __GFP_WAIT being passed to INIT_RADIX_TREE().
+ * __GFP_DIRECT_RECLAIM being passed to INIT_RADIX_TREE().
  */
 int radix_tree_preload(gfp_t gfp_mask)
 {
 	/* Warn on non-sensical use... */
-	WARN_ON_ONCE(!(gfp_mask & __GFP_WAIT));
+	WARN_ON_ONCE(!gfpflags_allow_blocking(gfp_mask));
 	return __radix_tree_preload(gfp_mask);
 }
 EXPORT_SYMBOL(radix_tree_preload);
@@ -303,7 +303,7 @@ EXPORT_SYMBOL(radix_tree_preload);
  */
 int radix_tree_maybe_preload(gfp_t gfp_mask)
 {
-	if (gfp_mask & __GFP_WAIT)
+	if (gfpflags_allow_blocking(gfp_mask))
 		return __radix_tree_preload(gfp_mask);
 	/* Preloading doesn't help anything with this gfp mask, skip it */
 	preempt_disable();
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 2df8ddcb0ca0..e7781eb35fd1 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -632,7 +632,7 @@ struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi,
 {
 	struct bdi_writeback *wb;
 
-	might_sleep_if(gfp & __GFP_WAIT);
+	might_sleep_if(gfpflags_allow_blocking(gfp));
 
 	if (!memcg_css->parent)
 		return &bdi->wb;
diff --git a/mm/dmapool.c b/mm/dmapool.c
index 71a8998cd03a..55b53cffd9f6 100644
--- a/mm/dmapool.c
+++ b/mm/dmapool.c
@@ -326,7 +326,7 @@ void *dma_pool_alloc(struct dma_pool *pool, gfp_t mem_flags,
 	size_t offset;
 	void *retval;
 
-	might_sleep_if(mem_flags & __GFP_WAIT);
+	might_sleep_if(gfpflags_allow_blocking(mem_flags));
 
 	spin_lock_irqsave(&pool->lock, flags);
 	list_for_each_entry(page, &pool->page_list, page_list) {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6ddaeba34e09..2c65980c0a00 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2012,7 +2012,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (unlikely(task_in_memcg_oom(current)))
 		goto nomem;
 
-	if (!(gfp_mask & __GFP_WAIT))
+	if (!gfpflags_allow_blocking(gfp_mask))
 		goto nomem;
 
 	mem_cgroup_events(mem_over_limit, MEMCG_MAX, 1);
@@ -2071,7 +2071,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	css_get_many(&memcg->css, batch);
 	if (batch > nr_pages)
 		refill_stock(memcg, batch - nr_pages);
-	if (!(gfp_mask & __GFP_WAIT))
+	if (!gfpflags_allow_blocking(gfp_mask))
 		goto done;
 	/*
 	 * If the hierarchy is above the normal consumption range,
@@ -4396,8 +4396,8 @@ static int mem_cgroup_do_precharge(unsigned long count)
 {
 	int ret;
 
-	/* Try a single bulk charge without reclaim first */
-	ret = try_charge(mc.to, GFP_KERNEL & ~__GFP_WAIT, count);
+	/* Try a single bulk charge without reclaim first, kswapd may wake */
+	ret = try_charge(mc.to, GFP_KERNEL & ~__GFP_DIRECT_RECLAIM, count);
 	if (!ret) {
 		mc.precharge += count;
 		return ret;
diff --git a/mm/mempool.c b/mm/mempool.c
index 4c533bc51d73..004d42b1dfaf 100644
--- a/mm/mempool.c
+++ b/mm/mempool.c
@@ -320,13 +320,13 @@ void * mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
 	gfp_t gfp_temp;
 
 	VM_WARN_ON_ONCE(gfp_mask & __GFP_ZERO);
-	might_sleep_if(gfp_mask & __GFP_WAIT);
+	might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
 
 	gfp_mask |= __GFP_NOMEMALLOC;	/* don't allocate emergency reserves */
 	gfp_mask |= __GFP_NORETRY;	/* don't loop in __alloc_pages */
 	gfp_mask |= __GFP_NOWARN;	/* failures are OK */
 
-	gfp_temp = gfp_mask & ~(__GFP_WAIT|__GFP_IO);
+	gfp_temp = gfp_mask & ~(__GFP_DIRECT_RECLAIM|__GFP_IO);
 
 repeat_alloc:
 
@@ -349,7 +349,7 @@ void * mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
 	}
 
 	/*
-	 * We use gfp mask w/o __GFP_WAIT or IO for the first round.  If
+	 * We use gfp mask w/o direct reclaim or IO for the first round.  If
 	 * alloc failed with that and @pool was empty, retry immediately.
 	 */
 	if (gfp_temp != gfp_mask) {
@@ -358,8 +358,8 @@ void * mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
 		goto repeat_alloc;
 	}
 
-	/* We must not sleep if !__GFP_WAIT */
-	if (!(gfp_mask & __GFP_WAIT)) {
+	/* We must not sleep if !__GFP_DIRECT_RECLAIM */
+	if (!(gfp_mask & __GFP_DIRECT_RECLAIM)) {
 		spin_unlock_irqrestore(&pool->lock, flags);
 		return NULL;
 	}
diff --git a/mm/migrate.c b/mm/migrate.c
index c3cb566af3e2..a1c82b65dcad 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1565,7 +1565,7 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
 					 (GFP_HIGHUSER_MOVABLE |
 					  __GFP_THISNODE | __GFP_NOMEMALLOC |
 					  __GFP_NORETRY | __GFP_NOWARN) &
-					 ~GFP_IOFS, 0);
+					 ~(__GFP_IO | __GFP_FS), 0);
 
 	return newpage;
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4793bddb6b2a..b32081b02c49 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -169,12 +169,12 @@ void pm_restrict_gfp_mask(void)
 	WARN_ON(!mutex_is_locked(&pm_mutex));
 	WARN_ON(saved_gfp_mask);
 	saved_gfp_mask = gfp_allowed_mask;
-	gfp_allowed_mask &= ~GFP_IOFS;
+	gfp_allowed_mask &= ~(__GFP_IO | __GFP_FS);
 }
 
 bool pm_suspended_storage(void)
 {
-	if ((gfp_allowed_mask & GFP_IOFS) == GFP_IOFS)
+	if ((gfp_allowed_mask & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS))
 		return false;
 	return true;
 }
@@ -2183,7 +2183,7 @@ static bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
 		return false;
 	if (fail_page_alloc.ignore_gfp_highmem && (gfp_mask & __GFP_HIGHMEM))
 		return false;
-	if (fail_page_alloc.ignore_gfp_wait && (gfp_mask & __GFP_WAIT))
+	if (fail_page_alloc.ignore_gfp_wait && (gfp_mask & __GFP_DIRECT_RECLAIM))
 		return false;
 
 	return should_fail(&fail_page_alloc.attr, 1 << order);
@@ -2685,7 +2685,7 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
 		if (test_thread_flag(TIF_MEMDIE) ||
 		    (current->flags & (PF_MEMALLOC | PF_EXITING)))
 			filter &= ~SHOW_MEM_FILTER_NODES;
-	if (in_interrupt() || !(gfp_mask & __GFP_WAIT))
+	if (in_interrupt() || !(gfp_mask & __GFP_DIRECT_RECLAIM))
 		filter &= ~SHOW_MEM_FILTER_NODES;
 
 	if (fmt) {
@@ -2945,7 +2945,6 @@ static inline int
 gfp_to_alloc_flags(gfp_t gfp_mask)
 {
 	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
-	const bool atomic = !(gfp_mask & (__GFP_WAIT | __GFP_NO_KSWAPD));
 
 	/* __GFP_HIGH is assumed to be the same as ALLOC_HIGH to save a branch. */
 	BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_HIGH);
@@ -2954,11 +2953,11 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 	 * The caller may dip into page reserves a bit more if the caller
 	 * cannot run direct reclaim, or if the caller has realtime scheduling
 	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
-	 * set both ALLOC_HARDER (atomic == true) and ALLOC_HIGH (__GFP_HIGH).
+	 * set both ALLOC_HARDER (__GFP_ATOMIC) and ALLOC_HIGH (__GFP_HIGH).
 	 */
 	alloc_flags |= (__force int) (gfp_mask & __GFP_HIGH);
 
-	if (atomic) {
+	if (gfp_mask & __GFP_ATOMIC) {
 		/*
 		 * Not worth trying to allocate harder for __GFP_NOMEMALLOC even
 		 * if it can't schedule.
@@ -2995,11 +2994,16 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	return !!(gfp_to_alloc_flags(gfp_mask) & ALLOC_NO_WATERMARKS);
 }
 
+static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
+{
+	return (gfp_mask & (GFP_TRANSHUGE | __GFP_KSWAPD_RECLAIM)) == GFP_TRANSHUGE;
+}
+
 static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 						struct alloc_context *ac)
 {
-	const gfp_t wait = gfp_mask & __GFP_WAIT;
+	bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;
 	struct page *page = NULL;
 	int alloc_flags;
 	unsigned long pages_reclaimed = 0;
@@ -3020,15 +3024,23 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	}
 
 	/*
+	 * We also sanity check to catch abuse of atomic reserves being used by
+	 * callers that are not in atomic context.
+	 */
+	if (WARN_ON_ONCE((gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)) ==
+				(__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
+		gfp_mask &= ~__GFP_ATOMIC;
+
+	/*
 	 * If this allocation cannot block and it is for a specific node, then
 	 * fail early.  There's no need to wakeup kswapd or retry for a
 	 * speculative node-specific allocation.
 	 */
-	if (IS_ENABLED(CONFIG_NUMA) && (gfp_mask & __GFP_THISNODE) && !wait)
+	if (IS_ENABLED(CONFIG_NUMA) && (gfp_mask & __GFP_THISNODE) && !can_direct_reclaim)
 		goto nopage;
 
 retry:
-	if (!(gfp_mask & __GFP_NO_KSWAPD))
+	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
 		wake_all_kswapds(order, ac);
 
 	/*
@@ -3071,8 +3083,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		}
 	}
 
-	/* Atomic allocations - we can't balance anything */
-	if (!wait) {
+	/* Caller is not willing to reclaim, we can't balance anything */
+	if (!can_direct_reclaim) {
 		/*
 		 * All existing users of the deprecated __GFP_NOFAIL are
 		 * blockable, so warn of any new users that actually allow this
@@ -3102,7 +3114,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		goto got_pg;
 
 	/* Checks for THP-specific high-order allocations */
-	if ((gfp_mask & GFP_TRANSHUGE) == GFP_TRANSHUGE) {
+	if (is_thp_gfp_mask(gfp_mask)) {
 		/*
 		 * If compaction is deferred for high-order allocations, it is
 		 * because sync compaction recently failed. If this is the case
@@ -3137,8 +3149,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 * fault, so use asynchronous memory compaction for THP unless it is
 	 * khugepaged trying to collapse.
 	 */
-	if ((gfp_mask & GFP_TRANSHUGE) != GFP_TRANSHUGE ||
-						(current->flags & PF_KTHREAD))
+	if (!is_thp_gfp_mask(gfp_mask) || (current->flags & PF_KTHREAD))
 		migration_mode = MIGRATE_SYNC_LIGHT;
 
 	/* Try direct reclaim and then allocating */
@@ -3209,7 +3220,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 
 	lockdep_trace_alloc(gfp_mask);
 
-	might_sleep_if(gfp_mask & __GFP_WAIT);
+	might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
 
 	if (should_fail_alloc_page(gfp_mask, order))
 		return NULL;
diff --git a/mm/slab.c b/mm/slab.c
index c77ebe6cc87c..3ff59926bf19 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1030,12 +1030,12 @@ static inline int cache_free_alien(struct kmem_cache *cachep, void *objp)
 }
 
 /*
- * Construct gfp mask to allocate from a specific node but do not invoke reclaim
- * or warn about failures.
+ * Construct gfp mask to allocate from a specific node but do not direct reclaim
+ * or warn about failures. kswapd may still wake to reclaim in the background.
  */
 static inline gfp_t gfp_exact_node(gfp_t flags)
 {
-	return (flags | __GFP_THISNODE | __GFP_NOWARN) & ~__GFP_WAIT;
+	return (flags | __GFP_THISNODE | __GFP_NOWARN) & ~__GFP_DIRECT_RECLAIM;
 }
 #endif
 
@@ -2625,7 +2625,7 @@ static int cache_grow(struct kmem_cache *cachep,
 
 	offset *= cachep->colour_off;
 
-	if (local_flags & __GFP_WAIT)
+	if (gfpflags_allow_blocking(local_flags))
 		local_irq_enable();
 
 	/*
@@ -2655,7 +2655,7 @@ static int cache_grow(struct kmem_cache *cachep,
 
 	cache_init_objs(cachep, page);
 
-	if (local_flags & __GFP_WAIT)
+	if (gfpflags_allow_blocking(local_flags))
 		local_irq_disable();
 	check_irq_off();
 	spin_lock(&n->list_lock);
@@ -2669,7 +2669,7 @@ static int cache_grow(struct kmem_cache *cachep,
 opps1:
 	kmem_freepages(cachep, page);
 failed:
-	if (local_flags & __GFP_WAIT)
+	if (gfpflags_allow_blocking(local_flags))
 		local_irq_disable();
 	return 0;
 }
@@ -2861,7 +2861,7 @@ static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags,
 static inline void cache_alloc_debugcheck_before(struct kmem_cache *cachep,
 						gfp_t flags)
 {
-	might_sleep_if(flags & __GFP_WAIT);
+	might_sleep_if(gfpflags_allow_blocking(flags));
 #if DEBUG
 	kmem_flagcheck(cachep, flags);
 #endif
@@ -3049,11 +3049,11 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
 		 */
 		struct page *page;
 
-		if (local_flags & __GFP_WAIT)
+		if (gfpflags_allow_blocking(local_flags))
 			local_irq_enable();
 		kmem_flagcheck(cache, flags);
 		page = kmem_getpages(cache, local_flags, numa_mem_id());
-		if (local_flags & __GFP_WAIT)
+		if (gfpflags_allow_blocking(local_flags))
 			local_irq_disable();
 		if (page) {
 			/*
diff --git a/mm/slub.c b/mm/slub.c
index f614b5dc396b..2cdbf5db348e 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1263,7 +1263,7 @@ static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
 {
 	flags &= gfp_allowed_mask;
 	lockdep_trace_alloc(flags);
-	might_sleep_if(flags & __GFP_WAIT);
+	might_sleep_if(gfpflags_allow_blocking(flags));
 
 	if (should_failslab(s->object_size, flags, s->flags))
 		return NULL;
@@ -1352,7 +1352,7 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 
 	flags &= gfp_allowed_mask;
 
-	if (flags & __GFP_WAIT)
+	if (gfpflags_allow_blocking(flags))
 		local_irq_enable();
 
 	flags |= s->allocflags;
@@ -1362,8 +1362,8 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 	 * so we fall-back to the minimum order allocation.
 	 */
 	alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY) & ~__GFP_NOFAIL;
-	if ((alloc_gfp & __GFP_WAIT) && oo_order(oo) > oo_order(s->min))
-		alloc_gfp = (alloc_gfp | __GFP_NOMEMALLOC) & ~__GFP_WAIT;
+	if ((alloc_gfp & __GFP_DIRECT_RECLAIM) && oo_order(oo) > oo_order(s->min))
+		alloc_gfp = (alloc_gfp | __GFP_NOMEMALLOC) & ~__GFP_DIRECT_RECLAIM;
 
 	page = alloc_slab_page(s, alloc_gfp, node, oo);
 	if (unlikely(!page)) {
@@ -1423,7 +1423,7 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 	page->frozen = 1;
 
 out:
-	if (flags & __GFP_WAIT)
+	if (gfpflags_allow_blocking(flags))
 		local_irq_disable();
 	if (!page)
 		return NULL;
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 2faaa2976447..9ad4dcb0631c 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1617,7 +1617,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
 			goto fail;
 		}
 		area->pages[i] = page;
-		if (gfp_mask & __GFP_WAIT)
+		if (gfpflags_allow_blocking(gfp_mask))
 			cond_resched();
 	}
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8b2786fd42b5..30a87ac1af80 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1476,7 +1476,7 @@ static int too_many_isolated(struct zone *zone, int file,
 	 * won't get blocked by normal direct-reclaimers, forming a circular
 	 * deadlock.
 	 */
-	if ((sc->gfp_mask & GFP_IOFS) == GFP_IOFS)
+	if ((sc->gfp_mask & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS))
 		inactive >>= 3;
 
 	return isolated > inactive;
@@ -3794,7 +3794,7 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	/*
 	 * Do not scan if the allocation should not be delayed.
 	 */
-	if (!(gfp_mask & __GFP_WAIT) || (current->flags & PF_MEMALLOC))
+	if (!gfpflags_allow_blocking(gfp_mask) || (current->flags & PF_MEMALLOC))
 		return ZONE_RECLAIM_NOSCAN;
 
 	/*
diff --git a/mm/zswap.c b/mm/zswap.c
index 4043df7c672f..e54166d3732e 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -571,7 +571,7 @@ static struct zswap_pool *zswap_pool_find_get(char *type, char *compressor)
 static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
 {
 	struct zswap_pool *pool;
-	gfp_t gfp = __GFP_NORETRY | __GFP_NOWARN;
+	gfp_t gfp = __GFP_NORETRY | __GFP_NOWARN | __GFP_KSWAPD_RECLAIM;
 
 	pool = kzalloc(sizeof(*pool), GFP_KERNEL);
 	if (!pool) {
@@ -1011,7 +1011,8 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
 	/* store */
 	len = dlen + sizeof(struct zswap_header);
 	ret = zpool_malloc(entry->pool->zpool, len,
-			   __GFP_NORETRY | __GFP_NOWARN, &handle);
+			   __GFP_NORETRY | __GFP_NOWARN | __GFP_KSWAPD_RECLAIM,
+			   &handle);
 	if (ret == -ENOSPC) {
 		zswap_reject_compress_poor++;
 		goto put_dstmem;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index dad4dd37e2aa..905bae96a742 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -414,7 +414,7 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev, unsigned int len,
 	len += NET_SKB_PAD;
 
 	if ((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) ||
-	    (gfp_mask & (__GFP_WAIT | GFP_DMA))) {
+	    (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
 		skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE);
 		if (!skb)
 			goto skb_fail;
@@ -481,7 +481,7 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
 	len += NET_SKB_PAD + NET_IP_ALIGN;
 
 	if ((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) ||
-	    (gfp_mask & (__GFP_WAIT | GFP_DMA))) {
+	    (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
 		skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE);
 		if (!skb)
 			goto skb_fail;
@@ -4451,7 +4451,7 @@ struct sk_buff *alloc_skb_with_frags(unsigned long header_len,
 		return NULL;
 
 	gfp_head = gfp_mask;
-	if (gfp_head & __GFP_WAIT)
+	if (gfp_head & __GFP_DIRECT_RECLAIM)
 		gfp_head |= __GFP_REPEAT;
 
 	*errcode = -ENOBUFS;
@@ -4466,7 +4466,7 @@ struct sk_buff *alloc_skb_with_frags(unsigned long header_len,
 
 		while (order) {
 			if (npages >= 1 << order) {
-				page = alloc_pages((gfp_mask & ~__GFP_WAIT) |
+				page = alloc_pages((gfp_mask & ~__GFP_DIRECT_RECLAIM) |
 						   __GFP_COMP |
 						   __GFP_NOWARN |
 						   __GFP_NORETRY,
diff --git a/net/core/sock.c b/net/core/sock.c
index ca2984afe16e..4a61a0add949 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1879,8 +1879,10 @@ bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t gfp)
 
 	pfrag->offset = 0;
 	if (SKB_FRAG_PAGE_ORDER) {
-		pfrag->page = alloc_pages((gfp & ~__GFP_WAIT) | __GFP_COMP |
-					  __GFP_NOWARN | __GFP_NORETRY,
+		/* Avoid direct reclaim but allow kswapd to wake */
+		pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
+					  __GFP_COMP | __GFP_NOWARN |
+					  __GFP_NORETRY,
 					  SKB_FRAG_PAGE_ORDER);
 		if (likely(pfrag->page)) {
 			pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER;
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 7f86d3b55060..173c0abe4094 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2084,7 +2084,7 @@ int netlink_broadcast_filtered(struct sock *ssk, struct sk_buff *skb, u32 portid
 	consume_skb(info.skb2);
 
 	if (info.delivered) {
-		if (info.congested && (allocation & __GFP_WAIT))
+		if (info.congested && gfpflags_allow_blocking(allocation))
 			yield();
 		return 0;
 	}
diff --git a/net/rds/ib_recv.c b/net/rds/ib_recv.c
index f43831e4186a..dcfb59775acc 100644
--- a/net/rds/ib_recv.c
+++ b/net/rds/ib_recv.c
@@ -305,7 +305,7 @@ static int rds_ib_recv_refill_one(struct rds_connection *conn,
 	gfp_t slab_mask = GFP_NOWAIT;
 	gfp_t page_mask = GFP_NOWAIT;
 
-	if (gfp & __GFP_WAIT) {
+	if (gfp & __GFP_DIRECT_RECLAIM) {
 		slab_mask = GFP_KERNEL;
 		page_mask = GFP_HIGHUSER;
 	}
@@ -379,7 +379,7 @@ void rds_ib_recv_refill(struct rds_connection *conn, int prefill, gfp_t gfp)
 	struct ib_recv_wr *failed_wr;
 	unsigned int posted = 0;
 	int ret = 0;
-	bool can_wait = !!(gfp & __GFP_WAIT);
+	bool can_wait = !!(gfp & __GFP_DIRECT_RECLAIM);
 	u32 pos;
 
 	/* the goal here is to just make sure that someone, somewhere
diff --git a/net/rxrpc/ar-connection.c b/net/rxrpc/ar-connection.c
index 6631f4f1e39b..3b5de4b86058 100644
--- a/net/rxrpc/ar-connection.c
+++ b/net/rxrpc/ar-connection.c
@@ -500,7 +500,7 @@ int rxrpc_connect_call(struct rxrpc_sock *rx,
 		if (bundle->num_conns >= 20) {
 			_debug("too many conns");
 
-			if (!(gfp & __GFP_WAIT)) {
+			if (!gfpflags_allow_blocking(gfp)) {
 				_leave(" = -EAGAIN");
 				return -EAGAIN;
 			}
diff --git a/net/sctp/associola.c b/net/sctp/associola.c
index 197c3f59ecbf..75369ae8de1e 100644
--- a/net/sctp/associola.c
+++ b/net/sctp/associola.c
@@ -1588,7 +1588,7 @@ int sctp_assoc_lookup_laddr(struct sctp_association *asoc,
 /* Set an association id for a given association */
 int sctp_assoc_set_id(struct sctp_association *asoc, gfp_t gfp)
 {
-	bool preload = !!(gfp & __GFP_WAIT);
+	bool preload = gfpflags_allow_blocking(gfp);
 	int ret;
 
 	/* If the id is already assigned, keep it. */
-- 
2.4.6


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 06/10] mm, page_alloc: Rename __GFP_WAIT to __GFP_RECLAIM
  2015-09-21 10:52 [PATCH 00/10] Remove zonelist cache and high-order watermark checking v4 Mel Gorman
                   ` (4 preceding siblings ...)
  2015-09-21 10:52 ` [PATCH 05/10] mm, page_alloc: Distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd Mel Gorman
@ 2015-09-21 10:52 ` Mel Gorman
  2015-09-25 19:03   ` Johannes Weiner
                     ` (2 more replies)
  2015-09-21 10:52 ` [PATCH 07/10] mm, page_alloc: Delete the zonelist_cache Mel Gorman
                   ` (3 subsequent siblings)
  9 siblings, 3 replies; 48+ messages in thread
From: Mel Gorman @ 2015-09-21 10:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Rik van Riel, Vlastimil Babka, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML, Mel Gorman

__GFP_WAIT was used to signal that the caller was in atomic context and
could not sleep.  Now it is possible to distinguish between true atomic
context and callers that are not willing to sleep. The latter should clear
__GFP_DIRECT_RECLAIM so kswapd will still wake. As clearing __GFP_WAIT
behaves differently, there is a risk that people will clear the wrong
flags. This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly indicate
what it does -- setting it allows all reclaim activity, clearing them
prevents it.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 block/blk-mq.c                              |  2 +-
 block/scsi_ioctl.c                          |  6 +++---
 drivers/block/drbd/drbd_bitmap.c            |  2 +-
 drivers/block/mtip32xx/mtip32xx.c           |  2 +-
 drivers/block/nvme-core.c                   |  4 ++--
 drivers/block/paride/pd.c                   |  2 +-
 drivers/block/pktcdvd.c                     |  4 ++--
 drivers/gpu/drm/i915/i915_gem.c             |  2 +-
 drivers/ide/ide-atapi.c                     |  2 +-
 drivers/ide/ide-cd.c                        |  2 +-
 drivers/ide/ide-cd_ioctl.c                  |  2 +-
 drivers/ide/ide-devsets.c                   |  2 +-
 drivers/ide/ide-disk.c                      |  2 +-
 drivers/ide/ide-ioctls.c                    |  4 ++--
 drivers/ide/ide-park.c                      |  2 +-
 drivers/ide/ide-pm.c                        |  4 ++--
 drivers/ide/ide-tape.c                      |  4 ++--
 drivers/ide/ide-taskfile.c                  |  4 ++--
 drivers/infiniband/hw/qib/qib_init.c        |  2 +-
 drivers/misc/vmw_balloon.c                  |  2 +-
 drivers/scsi/scsi_error.c                   |  2 +-
 drivers/scsi/scsi_lib.c                     |  4 ++--
 drivers/staging/rdma/hfi1/init.c            |  2 +-
 drivers/staging/rdma/ipath/ipath_file_ops.c |  2 +-
 fs/cachefiles/internal.h                    |  2 +-
 fs/direct-io.c                              |  2 +-
 fs/nilfs2/mdt.h                             |  2 +-
 include/linux/gfp.h                         | 16 ++++++++--------
 kernel/power/swap.c                         | 14 +++++++-------
 lib/percpu_ida.c                            |  2 +-
 mm/failslab.c                               |  8 ++++----
 mm/filemap.c                                |  2 +-
 mm/huge_memory.c                            |  2 +-
 mm/migrate.c                                |  2 +-
 mm/page_alloc.c                             |  6 +++---
 security/integrity/ima/ima_crypto.c         |  2 +-
 36 files changed, 63 insertions(+), 63 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 7c322cea838f..b59c646f8e79 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1207,7 +1207,7 @@ static struct request *blk_mq_map_request(struct request_queue *q,
 		ctx = blk_mq_get_ctx(q);
 		hctx = q->mq_ops->map_queue(q, ctx->cpu);
 		blk_mq_set_alloc_data(&alloc_data, q,
-				__GFP_WAIT|__GFP_HIGH, false, ctx, hctx);
+				__GFP_RECLAIM|__GFP_HIGH, false, ctx, hctx);
 		rq = __blk_mq_alloc_request(&alloc_data, rw);
 		ctx = alloc_data.ctx;
 		hctx = alloc_data.hctx;
diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c
index dda653ce7b24..0774799942e0 100644
--- a/block/scsi_ioctl.c
+++ b/block/scsi_ioctl.c
@@ -444,7 +444,7 @@ int sg_scsi_ioctl(struct request_queue *q, struct gendisk *disk, fmode_t mode,
 
 	}
 
-	rq = blk_get_request(q, in_len ? WRITE : READ, __GFP_WAIT);
+	rq = blk_get_request(q, in_len ? WRITE : READ, __GFP_RECLAIM);
 	if (IS_ERR(rq)) {
 		err = PTR_ERR(rq);
 		goto error_free_buffer;
@@ -495,7 +495,7 @@ int sg_scsi_ioctl(struct request_queue *q, struct gendisk *disk, fmode_t mode,
 		break;
 	}
 
-	if (bytes && blk_rq_map_kern(q, rq, buffer, bytes, __GFP_WAIT)) {
+	if (bytes && blk_rq_map_kern(q, rq, buffer, bytes, __GFP_RECLAIM)) {
 		err = DRIVER_ERROR << 24;
 		goto error;
 	}
@@ -536,7 +536,7 @@ static int __blk_send_generic(struct request_queue *q, struct gendisk *bd_disk,
 	struct request *rq;
 	int err;
 
-	rq = blk_get_request(q, WRITE, __GFP_WAIT);
+	rq = blk_get_request(q, WRITE, __GFP_RECLAIM);
 	if (IS_ERR(rq))
 		return PTR_ERR(rq);
 	blk_rq_set_block_pc(rq);
diff --git a/drivers/block/drbd/drbd_bitmap.c b/drivers/block/drbd/drbd_bitmap.c
index e5e0f19ceda0..3dc53a16ed3a 100644
--- a/drivers/block/drbd/drbd_bitmap.c
+++ b/drivers/block/drbd/drbd_bitmap.c
@@ -1007,7 +1007,7 @@ static void bm_page_io_async(struct drbd_bm_aio_ctx *ctx, int page_nr) __must_ho
 	bm_set_page_unchanged(b->bm_pages[page_nr]);
 
 	if (ctx->flags & BM_AIO_COPY_PAGES) {
-		page = mempool_alloc(drbd_md_io_page_pool, __GFP_HIGHMEM|__GFP_WAIT);
+		page = mempool_alloc(drbd_md_io_page_pool, __GFP_HIGHMEM|__GFP_RECLAIM);
 		copy_highpage(page, b->bm_pages[page_nr]);
 		bm_store_page_idx(page, page_nr);
 	} else
diff --git a/drivers/block/mtip32xx/mtip32xx.c b/drivers/block/mtip32xx/mtip32xx.c
index f504232c1ee7..a28a562f7b7f 100644
--- a/drivers/block/mtip32xx/mtip32xx.c
+++ b/drivers/block/mtip32xx/mtip32xx.c
@@ -173,7 +173,7 @@ static struct mtip_cmd *mtip_get_int_command(struct driver_data *dd)
 {
 	struct request *rq;
 
-	rq = blk_mq_alloc_request(dd->queue, 0, __GFP_WAIT, true);
+	rq = blk_mq_alloc_request(dd->queue, 0, __GFP_RECLAIM, true);
 	return blk_mq_rq_to_pdu(rq);
 }
 
diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index b97fc3fe0916..dc860e085e06 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -1032,11 +1032,11 @@ int __nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
 	req->special = (void *)0;
 
 	if (buffer && bufflen) {
-		ret = blk_rq_map_kern(q, req, buffer, bufflen, __GFP_WAIT);
+		ret = blk_rq_map_kern(q, req, buffer, bufflen, __GFP_RECLAIM);
 		if (ret)
 			goto out;
 	} else if (ubuffer && bufflen) {
-		ret = blk_rq_map_user(q, req, NULL, ubuffer, bufflen, __GFP_WAIT);
+		ret = blk_rq_map_user(q, req, NULL, ubuffer, bufflen, __GFP_RECLAIM);
 		if (ret)
 			goto out;
 		bio = req->bio;
diff --git a/drivers/block/paride/pd.c b/drivers/block/paride/pd.c
index b9242d78283d..562b5a4ca7b7 100644
--- a/drivers/block/paride/pd.c
+++ b/drivers/block/paride/pd.c
@@ -723,7 +723,7 @@ static int pd_special_command(struct pd_unit *disk,
 	struct request *rq;
 	int err = 0;
 
-	rq = blk_get_request(disk->gd->queue, READ, __GFP_WAIT);
+	rq = blk_get_request(disk->gd->queue, READ, __GFP_RECLAIM);
 	if (IS_ERR(rq))
 		return PTR_ERR(rq);
 
diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index 7be2375db7f2..5959c2981cc7 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -704,14 +704,14 @@ static int pkt_generic_packet(struct pktcdvd_device *pd, struct packet_command *
 	int ret = 0;
 
 	rq = blk_get_request(q, (cgc->data_direction == CGC_DATA_WRITE) ?
-			     WRITE : READ, __GFP_WAIT);
+			     WRITE : READ, __GFP_RECLAIM);
 	if (IS_ERR(rq))
 		return PTR_ERR(rq);
 	blk_rq_set_block_pc(rq);
 
 	if (cgc->buflen) {
 		ret = blk_rq_map_kern(q, rq, cgc->buffer, cgc->buflen,
-				      __GFP_WAIT);
+				      __GFP_RECLAIM);
 		if (ret)
 			goto out;
 	}
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index d58cb9e034fe..7e505d4be7c0 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2216,7 +2216,7 @@ i915_gem_object_get_pages_gtt(struct drm_i915_gem_object *obj)
 	mapping = file_inode(obj->base.filp)->i_mapping;
 	gfp = mapping_gfp_mask(mapping);
 	gfp |= __GFP_NORETRY | __GFP_NOWARN;
-	gfp &= ~(__GFP_IO | __GFP_WAIT);
+	gfp &= ~(__GFP_IO | __GFP_RECLAIM);
 	sg = st->sgl;
 	st->nents = 0;
 	for (i = 0; i < page_count; i++) {
diff --git a/drivers/ide/ide-atapi.c b/drivers/ide/ide-atapi.c
index 1362ad80a76c..05352f490d60 100644
--- a/drivers/ide/ide-atapi.c
+++ b/drivers/ide/ide-atapi.c
@@ -92,7 +92,7 @@ int ide_queue_pc_tail(ide_drive_t *drive, struct gendisk *disk,
 	struct request *rq;
 	int error;
 
-	rq = blk_get_request(drive->queue, READ, __GFP_WAIT);
+	rq = blk_get_request(drive->queue, READ, __GFP_RECLAIM);
 	rq->cmd_type = REQ_TYPE_DRV_PRIV;
 	rq->special = (char *)pc;
 
diff --git a/drivers/ide/ide-cd.c b/drivers/ide/ide-cd.c
index 64a6b827b3dd..ef907fd5ba98 100644
--- a/drivers/ide/ide-cd.c
+++ b/drivers/ide/ide-cd.c
@@ -441,7 +441,7 @@ int ide_cd_queue_pc(ide_drive_t *drive, const unsigned char *cmd,
 		struct request *rq;
 		int error;
 
-		rq = blk_get_request(drive->queue, write, __GFP_WAIT);
+		rq = blk_get_request(drive->queue, write, __GFP_RECLAIM);
 
 		memcpy(rq->cmd, cmd, BLK_MAX_CDB);
 		rq->cmd_type = REQ_TYPE_ATA_PC;
diff --git a/drivers/ide/ide-cd_ioctl.c b/drivers/ide/ide-cd_ioctl.c
index 066e39036518..474173eb31bb 100644
--- a/drivers/ide/ide-cd_ioctl.c
+++ b/drivers/ide/ide-cd_ioctl.c
@@ -303,7 +303,7 @@ int ide_cdrom_reset(struct cdrom_device_info *cdi)
 	struct request *rq;
 	int ret;
 
-	rq = blk_get_request(drive->queue, READ, __GFP_WAIT);
+	rq = blk_get_request(drive->queue, READ, __GFP_RECLAIM);
 	rq->cmd_type = REQ_TYPE_DRV_PRIV;
 	rq->cmd_flags = REQ_QUIET;
 	ret = blk_execute_rq(drive->queue, cd->disk, rq, 0);
diff --git a/drivers/ide/ide-devsets.c b/drivers/ide/ide-devsets.c
index b05a74d78ef5..0dd43b4fcec6 100644
--- a/drivers/ide/ide-devsets.c
+++ b/drivers/ide/ide-devsets.c
@@ -165,7 +165,7 @@ int ide_devset_execute(ide_drive_t *drive, const struct ide_devset *setting,
 	if (!(setting->flags & DS_SYNC))
 		return setting->set(drive, arg);
 
-	rq = blk_get_request(q, READ, __GFP_WAIT);
+	rq = blk_get_request(q, READ, __GFP_RECLAIM);
 	rq->cmd_type = REQ_TYPE_DRV_PRIV;
 	rq->cmd_len = 5;
 	rq->cmd[0] = REQ_DEVSET_EXEC;
diff --git a/drivers/ide/ide-disk.c b/drivers/ide/ide-disk.c
index 56b9708894a5..37a8a907febe 100644
--- a/drivers/ide/ide-disk.c
+++ b/drivers/ide/ide-disk.c
@@ -477,7 +477,7 @@ static int set_multcount(ide_drive_t *drive, int arg)
 	if (drive->special_flags & IDE_SFLAG_SET_MULTMODE)
 		return -EBUSY;
 
-	rq = blk_get_request(drive->queue, READ, __GFP_WAIT);
+	rq = blk_get_request(drive->queue, READ, __GFP_RECLAIM);
 	rq->cmd_type = REQ_TYPE_ATA_TASKFILE;
 
 	drive->mult_req = arg;
diff --git a/drivers/ide/ide-ioctls.c b/drivers/ide/ide-ioctls.c
index aa2e9b77b20d..d05db2469209 100644
--- a/drivers/ide/ide-ioctls.c
+++ b/drivers/ide/ide-ioctls.c
@@ -125,7 +125,7 @@ static int ide_cmd_ioctl(ide_drive_t *drive, unsigned long arg)
 	if (NULL == (void *) arg) {
 		struct request *rq;
 
-		rq = blk_get_request(drive->queue, READ, __GFP_WAIT);
+		rq = blk_get_request(drive->queue, READ, __GFP_RECLAIM);
 		rq->cmd_type = REQ_TYPE_ATA_TASKFILE;
 		err = blk_execute_rq(drive->queue, NULL, rq, 0);
 		blk_put_request(rq);
@@ -221,7 +221,7 @@ static int generic_drive_reset(ide_drive_t *drive)
 	struct request *rq;
 	int ret = 0;
 
-	rq = blk_get_request(drive->queue, READ, __GFP_WAIT);
+	rq = blk_get_request(drive->queue, READ, __GFP_RECLAIM);
 	rq->cmd_type = REQ_TYPE_DRV_PRIV;
 	rq->cmd_len = 1;
 	rq->cmd[0] = REQ_DRIVE_RESET;
diff --git a/drivers/ide/ide-park.c b/drivers/ide/ide-park.c
index c80868520488..2d7dca56dd24 100644
--- a/drivers/ide/ide-park.c
+++ b/drivers/ide/ide-park.c
@@ -31,7 +31,7 @@ static void issue_park_cmd(ide_drive_t *drive, unsigned long timeout)
 	}
 	spin_unlock_irq(&hwif->lock);
 
-	rq = blk_get_request(q, READ, __GFP_WAIT);
+	rq = blk_get_request(q, READ, __GFP_RECLAIM);
 	rq->cmd[0] = REQ_PARK_HEADS;
 	rq->cmd_len = 1;
 	rq->cmd_type = REQ_TYPE_DRV_PRIV;
diff --git a/drivers/ide/ide-pm.c b/drivers/ide/ide-pm.c
index 081e43458d50..e34af488693a 100644
--- a/drivers/ide/ide-pm.c
+++ b/drivers/ide/ide-pm.c
@@ -18,7 +18,7 @@ int generic_ide_suspend(struct device *dev, pm_message_t mesg)
 	}
 
 	memset(&rqpm, 0, sizeof(rqpm));
-	rq = blk_get_request(drive->queue, READ, __GFP_WAIT);
+	rq = blk_get_request(drive->queue, READ, __GFP_RECLAIM);
 	rq->cmd_type = REQ_TYPE_ATA_PM_SUSPEND;
 	rq->special = &rqpm;
 	rqpm.pm_step = IDE_PM_START_SUSPEND;
@@ -88,7 +88,7 @@ int generic_ide_resume(struct device *dev)
 	}
 
 	memset(&rqpm, 0, sizeof(rqpm));
-	rq = blk_get_request(drive->queue, READ, __GFP_WAIT);
+	rq = blk_get_request(drive->queue, READ, __GFP_RECLAIM);
 	rq->cmd_type = REQ_TYPE_ATA_PM_RESUME;
 	rq->cmd_flags |= REQ_PREEMPT;
 	rq->special = &rqpm;
diff --git a/drivers/ide/ide-tape.c b/drivers/ide/ide-tape.c
index f5d51d1d09ee..12fa04997dcc 100644
--- a/drivers/ide/ide-tape.c
+++ b/drivers/ide/ide-tape.c
@@ -852,7 +852,7 @@ static int idetape_queue_rw_tail(ide_drive_t *drive, int cmd, int size)
 	BUG_ON(cmd != REQ_IDETAPE_READ && cmd != REQ_IDETAPE_WRITE);
 	BUG_ON(size < 0 || size % tape->blk_size);
 
-	rq = blk_get_request(drive->queue, READ, __GFP_WAIT);
+	rq = blk_get_request(drive->queue, READ, __GFP_RECLAIM);
 	rq->cmd_type = REQ_TYPE_DRV_PRIV;
 	rq->cmd[13] = cmd;
 	rq->rq_disk = tape->disk;
@@ -860,7 +860,7 @@ static int idetape_queue_rw_tail(ide_drive_t *drive, int cmd, int size)
 
 	if (size) {
 		ret = blk_rq_map_kern(drive->queue, rq, tape->buf, size,
-				      __GFP_WAIT);
+				      __GFP_RECLAIM);
 		if (ret)
 			goto out_put;
 	}
diff --git a/drivers/ide/ide-taskfile.c b/drivers/ide/ide-taskfile.c
index 0979e126fff1..a716693417a3 100644
--- a/drivers/ide/ide-taskfile.c
+++ b/drivers/ide/ide-taskfile.c
@@ -430,7 +430,7 @@ int ide_raw_taskfile(ide_drive_t *drive, struct ide_cmd *cmd, u8 *buf,
 	int error;
 	int rw = !(cmd->tf_flags & IDE_TFLAG_WRITE) ? READ : WRITE;
 
-	rq = blk_get_request(drive->queue, rw, __GFP_WAIT);
+	rq = blk_get_request(drive->queue, rw, __GFP_RECLAIM);
 	rq->cmd_type = REQ_TYPE_ATA_TASKFILE;
 
 	/*
@@ -441,7 +441,7 @@ int ide_raw_taskfile(ide_drive_t *drive, struct ide_cmd *cmd, u8 *buf,
 	 */
 	if (nsect) {
 		error = blk_rq_map_kern(drive->queue, rq, buf,
-					nsect * SECTOR_SIZE, __GFP_WAIT);
+					nsect * SECTOR_SIZE, __GFP_RECLAIM);
 		if (error)
 			goto put_req;
 	}
diff --git a/drivers/infiniband/hw/qib/qib_init.c b/drivers/infiniband/hw/qib/qib_init.c
index 7e00470adc30..4ff340fe904f 100644
--- a/drivers/infiniband/hw/qib/qib_init.c
+++ b/drivers/infiniband/hw/qib/qib_init.c
@@ -1680,7 +1680,7 @@ int qib_setup_eagerbufs(struct qib_ctxtdata *rcd)
 	 * heavy filesystem activity makes these fail, and we can
 	 * use compound pages.
 	 */
-	gfp_flags = __GFP_WAIT | __GFP_IO | __GFP_COMP;
+	gfp_flags = __GFP_RECLAIM | __GFP_IO | __GFP_COMP;
 
 	egrcnt = rcd->rcvegrcnt;
 	egroff = rcd->rcvegr_tid_base;
diff --git a/drivers/misc/vmw_balloon.c b/drivers/misc/vmw_balloon.c
index ffb56340d0c7..1b49e53463a2 100644
--- a/drivers/misc/vmw_balloon.c
+++ b/drivers/misc/vmw_balloon.c
@@ -85,7 +85,7 @@ MODULE_LICENSE("GPL");
 
 /*
  * Use __GFP_HIGHMEM to allow pages from HIGHMEM zone. We don't
- * allow wait (__GFP_WAIT) for NOSLEEP page allocations. Use
+ * allow wait (__GFP_RECLAIM) for NOSLEEP page allocations. Use
  * __GFP_NOWARN, to suppress page allocation failure warnings.
  */
 #define VMW_PAGE_ALLOC_NOSLEEP		(__GFP_HIGHMEM|__GFP_NOWARN)
diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 66a96cd98b97..984ddcb4786d 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -1970,7 +1970,7 @@ static void scsi_eh_lock_door(struct scsi_device *sdev)
 	struct request *req;
 
 	/*
-	 * blk_get_request with GFP_KERNEL (__GFP_WAIT) sleeps until a
+	 * blk_get_request with GFP_KERNEL (__GFP_RECLAIM) sleeps until a
 	 * request becomes available
 	 */
 	req = blk_get_request(sdev->request_queue, READ, GFP_KERNEL);
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index cbfc5990052b..f570b48883e5 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -222,13 +222,13 @@ int scsi_execute(struct scsi_device *sdev, const unsigned char *cmd,
 	int write = (data_direction == DMA_TO_DEVICE);
 	int ret = DRIVER_ERROR << 24;
 
-	req = blk_get_request(sdev->request_queue, write, __GFP_WAIT);
+	req = blk_get_request(sdev->request_queue, write, __GFP_RECLAIM);
 	if (IS_ERR(req))
 		return ret;
 	blk_rq_set_block_pc(req);
 
 	if (bufflen &&	blk_rq_map_kern(sdev->request_queue, req,
-					buffer, bufflen, __GFP_WAIT))
+					buffer, bufflen, __GFP_RECLAIM))
 		goto out;
 
 	req->cmd_len = COMMAND_SIZE(cmd[0]);
diff --git a/drivers/staging/rdma/hfi1/init.c b/drivers/staging/rdma/hfi1/init.c
index a877eda8c13c..29fff7f2a45a 100644
--- a/drivers/staging/rdma/hfi1/init.c
+++ b/drivers/staging/rdma/hfi1/init.c
@@ -1564,7 +1564,7 @@ int hfi1_setup_eagerbufs(struct hfi1_ctxtdata *rcd)
 	 * heavy filesystem activity makes these fail, and we can
 	 * use compound pages.
 	 */
-	gfp_flags = __GFP_WAIT | __GFP_IO | __GFP_COMP;
+	gfp_flags = __GFP_RECLAIM | __GFP_IO | __GFP_COMP;
 
 	/*
 	 * The minimum size of the eager buffers is a groups of MTU-sized
diff --git a/drivers/staging/rdma/ipath/ipath_file_ops.c b/drivers/staging/rdma/ipath/ipath_file_ops.c
index 450d15965005..c11f6c58ce53 100644
--- a/drivers/staging/rdma/ipath/ipath_file_ops.c
+++ b/drivers/staging/rdma/ipath/ipath_file_ops.c
@@ -905,7 +905,7 @@ static int ipath_create_user_egr(struct ipath_portdata *pd)
 	 * heavy filesystem activity makes these fail, and we can
 	 * use compound pages.
 	 */
-	gfp_flags = __GFP_WAIT | __GFP_IO | __GFP_COMP;
+	gfp_flags = __GFP_RECLAIM | __GFP_IO | __GFP_COMP;
 
 	egrcnt = dd->ipath_rcvegrcnt;
 	/* TID number offset for this port */
diff --git a/fs/cachefiles/internal.h b/fs/cachefiles/internal.h
index aecd0859eacb..9c4b737a54df 100644
--- a/fs/cachefiles/internal.h
+++ b/fs/cachefiles/internal.h
@@ -30,7 +30,7 @@ extern unsigned cachefiles_debug;
 #define CACHEFILES_DEBUG_KLEAVE	2
 #define CACHEFILES_DEBUG_KDEBUG	4
 
-#define cachefiles_gfp (__GFP_WAIT | __GFP_NORETRY | __GFP_NOMEMALLOC)
+#define cachefiles_gfp (__GFP_RECLAIM | __GFP_NORETRY | __GFP_NOMEMALLOC)
 
 /*
  * node records
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 11256291642e..dbb94a2d6c50 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -360,7 +360,7 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio,
 
 	/*
 	 * bio_alloc() is guaranteed to return a bio when called with
-	 * __GFP_WAIT and we request a valid number of vectors.
+	 * __GFP_RECLAIM and we request a valid number of vectors.
 	 */
 	bio = bio_alloc(GFP_KERNEL, nr_vecs);
 
diff --git a/fs/nilfs2/mdt.h b/fs/nilfs2/mdt.h
index fe529a87a208..03246cac3338 100644
--- a/fs/nilfs2/mdt.h
+++ b/fs/nilfs2/mdt.h
@@ -72,7 +72,7 @@ static inline struct nilfs_mdt_info *NILFS_MDT(const struct inode *inode)
 }
 
 /* Default GFP flags using highmem */
-#define NILFS_MDT_GFP      (__GFP_WAIT | __GFP_IO | __GFP_HIGHMEM)
+#define NILFS_MDT_GFP      (__GFP_RECLAIM | __GFP_IO | __GFP_HIGHMEM)
 
 int nilfs_mdt_get_block(struct inode *, unsigned long, int,
 			void (*init_block)(struct inode *,
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index b56e811b6f7c..60b2db94d49d 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -107,7 +107,7 @@ struct vm_area_struct;
  * can be cleared when the reclaiming of pages would cause unnecessary
  * disruption.
  */
-#define __GFP_WAIT ((__force gfp_t)(___GFP_DIRECT_RECLAIM|___GFP_KSWAPD_RECLAIM))
+#define __GFP_RECLAIM ((__force gfp_t)(___GFP_DIRECT_RECLAIM|___GFP_KSWAPD_RECLAIM))
 #define __GFP_DIRECT_RECLAIM	((__force gfp_t)___GFP_DIRECT_RECLAIM) /* Caller can reclaim */
 #define __GFP_KSWAPD_RECLAIM	((__force gfp_t)___GFP_KSWAPD_RECLAIM) /* kswapd can wake */
 
@@ -126,12 +126,12 @@ struct vm_area_struct;
  */
 #define GFP_ATOMIC	(__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM)
 #define GFP_NOWAIT	(__GFP_KSWAPD_RECLAIM)
-#define GFP_NOIO	(__GFP_WAIT)
-#define GFP_NOFS	(__GFP_WAIT | __GFP_IO)
-#define GFP_KERNEL	(__GFP_WAIT | __GFP_IO | __GFP_FS)
-#define GFP_TEMPORARY	(__GFP_WAIT | __GFP_IO | __GFP_FS | \
+#define GFP_NOIO	(__GFP_RECLAIM)
+#define GFP_NOFS	(__GFP_RECLAIM | __GFP_IO)
+#define GFP_KERNEL	(__GFP_RECLAIM | __GFP_IO | __GFP_FS)
+#define GFP_TEMPORARY	(__GFP_RECLAIM | __GFP_IO | __GFP_FS | \
 			 __GFP_RECLAIMABLE)
-#define GFP_USER	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL)
+#define GFP_USER	(__GFP_RECLAIM | __GFP_IO | __GFP_FS | __GFP_HARDWALL)
 #define GFP_HIGHUSER	(GFP_USER | __GFP_HIGHMEM)
 #define GFP_HIGHUSER_MOVABLE	(GFP_HIGHUSER | __GFP_MOVABLE)
 #define GFP_IOFS	(__GFP_IO | __GFP_FS | __GFP_KSWAPD_RECLAIM)
@@ -144,12 +144,12 @@ struct vm_area_struct;
 #define GFP_MOVABLE_SHIFT 3
 
 /* Control page allocator reclaim behavior */
-#define GFP_RECLAIM_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS|\
+#define GFP_RECLAIM_MASK (__GFP_RECLAIM|__GFP_HIGH|__GFP_IO|__GFP_FS|\
 			__GFP_NOWARN|__GFP_REPEAT|__GFP_NOFAIL|\
 			__GFP_NORETRY|__GFP_MEMALLOC|__GFP_NOMEMALLOC)
 
 /* Control slab gfp mask during early boot */
-#define GFP_BOOT_MASK (__GFP_BITS_MASK & ~(__GFP_WAIT|__GFP_IO|__GFP_FS))
+#define GFP_BOOT_MASK (__GFP_BITS_MASK & ~(__GFP_RECLAIM|__GFP_IO|__GFP_FS))
 
 /* Control allocation constraints */
 #define GFP_CONSTRAINT_MASK (__GFP_HARDWALL|__GFP_THISNODE)
diff --git a/kernel/power/swap.c b/kernel/power/swap.c
index b2066fb5b10f..53d95673d88a 100644
--- a/kernel/power/swap.c
+++ b/kernel/power/swap.c
@@ -257,7 +257,7 @@ static int hib_submit_io(int rw, pgoff_t page_off, void *addr,
 	struct bio *bio;
 	int error = 0;
 
-	bio = bio_alloc(__GFP_WAIT | __GFP_HIGH, 1);
+	bio = bio_alloc(__GFP_RECLAIM | __GFP_HIGH, 1);
 	bio->bi_iter.bi_sector = page_off * (PAGE_SIZE >> 9);
 	bio->bi_bdev = hib_resume_bdev;
 
@@ -356,7 +356,7 @@ static int write_page(void *buf, sector_t offset, struct hib_bio_batch *hb)
 		return -ENOSPC;
 
 	if (hb) {
-		src = (void *)__get_free_page(__GFP_WAIT | __GFP_NOWARN |
+		src = (void *)__get_free_page(__GFP_RECLAIM | __GFP_NOWARN |
 		                              __GFP_NORETRY);
 		if (src) {
 			copy_page(src, buf);
@@ -364,7 +364,7 @@ static int write_page(void *buf, sector_t offset, struct hib_bio_batch *hb)
 			ret = hib_wait_io(hb); /* Free pages */
 			if (ret)
 				return ret;
-			src = (void *)__get_free_page(__GFP_WAIT |
+			src = (void *)__get_free_page(__GFP_RECLAIM |
 			                              __GFP_NOWARN |
 			                              __GFP_NORETRY);
 			if (src) {
@@ -672,7 +672,7 @@ static int save_image_lzo(struct swap_map_handle *handle,
 	nr_threads = num_online_cpus() - 1;
 	nr_threads = clamp_val(nr_threads, 1, LZO_THREADS);
 
-	page = (void *)__get_free_page(__GFP_WAIT | __GFP_HIGH);
+	page = (void *)__get_free_page(__GFP_RECLAIM | __GFP_HIGH);
 	if (!page) {
 		printk(KERN_ERR "PM: Failed to allocate LZO page\n");
 		ret = -ENOMEM;
@@ -975,7 +975,7 @@ static int get_swap_reader(struct swap_map_handle *handle,
 		last = tmp;
 
 		tmp->map = (struct swap_map_page *)
-		           __get_free_page(__GFP_WAIT | __GFP_HIGH);
+		           __get_free_page(__GFP_RECLAIM | __GFP_HIGH);
 		if (!tmp->map) {
 			release_swap_reader(handle);
 			return -ENOMEM;
@@ -1242,8 +1242,8 @@ static int load_image_lzo(struct swap_map_handle *handle,
 
 	for (i = 0; i < read_pages; i++) {
 		page[i] = (void *)__get_free_page(i < LZO_CMP_PAGES ?
-		                                  __GFP_WAIT | __GFP_HIGH :
-		                                  __GFP_WAIT | __GFP_NOWARN |
+		                                  __GFP_RECLAIM | __GFP_HIGH :
+		                                  __GFP_RECLAIM | __GFP_NOWARN |
 		                                  __GFP_NORETRY);
 
 		if (!page[i]) {
diff --git a/lib/percpu_ida.c b/lib/percpu_ida.c
index f75715131f20..6d40944960de 100644
--- a/lib/percpu_ida.c
+++ b/lib/percpu_ida.c
@@ -135,7 +135,7 @@ static inline unsigned alloc_local_tag(struct percpu_ida_cpu *tags)
  * TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE, of course).
  *
  * @gfp indicates whether or not to wait until a free id is available (it's not
- * used for internal memory allocations); thus if passed __GFP_WAIT we may sleep
+ * used for internal memory allocations); thus if passed __GFP_RECLAIM we may sleep
  * however long it takes until another thread frees an id (same semantics as a
  * mempool).
  *
diff --git a/mm/failslab.c b/mm/failslab.c
index fefaabaab76d..69f083146a37 100644
--- a/mm/failslab.c
+++ b/mm/failslab.c
@@ -3,11 +3,11 @@
 
 static struct {
 	struct fault_attr attr;
-	u32 ignore_gfp_wait;
+	u32 ignore_gfp_reclaim;
 	int cache_filter;
 } failslab = {
 	.attr = FAULT_ATTR_INITIALIZER,
-	.ignore_gfp_wait = 1,
+	.ignore_gfp_reclaim = 1,
 	.cache_filter = 0,
 };
 
@@ -16,7 +16,7 @@ bool should_failslab(size_t size, gfp_t gfpflags, unsigned long cache_flags)
 	if (gfpflags & __GFP_NOFAIL)
 		return false;
 
-        if (failslab.ignore_gfp_wait && (gfpflags & __GFP_WAIT))
+        if (failslab.ignore_gfp_reclaim && (gfpflags & __GFP_RECLAIM))
 		return false;
 
 	if (failslab.cache_filter && !(cache_flags & SLAB_FAILSLAB))
@@ -42,7 +42,7 @@ static int __init failslab_debugfs_init(void)
 		return PTR_ERR(dir);
 
 	if (!debugfs_create_bool("ignore-gfp-wait", mode, dir,
-				&failslab.ignore_gfp_wait))
+				&failslab.ignore_gfp_reclaim))
 		goto fail;
 	if (!debugfs_create_bool("cache-filter", mode, dir,
 				&failslab.cache_filter))
diff --git a/mm/filemap.c b/mm/filemap.c
index 72940fb38666..1a12b7c7474f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2675,7 +2675,7 @@ EXPORT_SYMBOL(generic_file_write_iter);
  * page is known to the local caching routines.
  *
  * The @gfp_mask argument specifies whether I/O may be performed to release
- * this page (__GFP_IO), and whether the call may block (__GFP_WAIT & __GFP_FS).
+ * this page (__GFP_IO), and whether the call may block (__GFP_RECLAIM & __GFP_FS).
  *
  */
 int try_to_release_page(struct page *page, gfp_t gfp_mask)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4b06b8db9df2..25c74e2dbc8b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -786,7 +786,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 
 static inline gfp_t alloc_hugepage_gfpmask(int defrag, gfp_t extra_gfp)
 {
-	return (GFP_TRANSHUGE & ~(defrag ? 0 : __GFP_WAIT)) | extra_gfp;
+	return (GFP_TRANSHUGE & ~(defrag ? 0 : __GFP_RECLAIM)) | extra_gfp;
 }
 
 /* Caller must hold page table lock. */
diff --git a/mm/migrate.c b/mm/migrate.c
index a1c82b65dcad..b19c03b7f49c 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1739,7 +1739,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 		goto out_dropref;
 
 	new_page = alloc_pages_node(node,
-		(GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_WAIT,
+		(GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM,
 		HPAGE_PMD_ORDER);
 	if (!new_page)
 		goto out_fail;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b32081b02c49..4418741c78ad 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2160,11 +2160,11 @@ static struct {
 	struct fault_attr attr;
 
 	u32 ignore_gfp_highmem;
-	u32 ignore_gfp_wait;
+	u32 ignore_gfp_reclaim;
 	u32 min_order;
 } fail_page_alloc = {
 	.attr = FAULT_ATTR_INITIALIZER,
-	.ignore_gfp_wait = 1,
+	.ignore_gfp_reclaim = 1,
 	.ignore_gfp_highmem = 1,
 	.min_order = 1,
 };
@@ -2202,7 +2202,7 @@ static int __init fail_page_alloc_debugfs(void)
 		return PTR_ERR(dir);
 
 	if (!debugfs_create_bool("ignore-gfp-wait", mode, dir,
-				&fail_page_alloc.ignore_gfp_wait))
+				&fail_page_alloc.ignore_gfp_reclaim))
 		goto fail;
 	if (!debugfs_create_bool("ignore-gfp-highmem", mode, dir,
 				&fail_page_alloc.ignore_gfp_highmem))
diff --git a/security/integrity/ima/ima_crypto.c b/security/integrity/ima/ima_crypto.c
index e24121afb2f2..6eb62936c672 100644
--- a/security/integrity/ima/ima_crypto.c
+++ b/security/integrity/ima/ima_crypto.c
@@ -126,7 +126,7 @@ static void *ima_alloc_pages(loff_t max_size, size_t *allocated_size,
 {
 	void *ptr;
 	int order = ima_maxorder;
-	gfp_t gfp_mask = __GFP_WAIT | __GFP_NOWARN | __GFP_NORETRY;
+	gfp_t gfp_mask = __GFP_RECLAIM | __GFP_NOWARN | __GFP_NORETRY;
 
 	if (order)
 		order = min(get_order(max_size), order);
-- 
2.4.6


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 07/10] mm, page_alloc: Delete the zonelist_cache
  2015-09-21 10:52 [PATCH 00/10] Remove zonelist cache and high-order watermark checking v4 Mel Gorman
                   ` (5 preceding siblings ...)
  2015-09-21 10:52 ` [PATCH 06/10] mm, page_alloc: Rename __GFP_WAIT to __GFP_RECLAIM Mel Gorman
@ 2015-09-21 10:52 ` Mel Gorman
  2015-09-25 19:09   ` Johannes Weiner
  2015-09-21 10:52 ` [PATCH 08/10] mm, page_alloc: Remove MIGRATE_RESERVE Mel Gorman
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 48+ messages in thread
From: Mel Gorman @ 2015-09-21 10:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Rik van Riel, Vlastimil Babka, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML, Mel Gorman

The zonelist cache (zlc) was introduced to skip over zones that were
recently known to be full. This avoided expensive operations such as the
cpuset checks, watermark calculations and zone_reclaim. The situation
today is different and the complexity of zlc is harder to justify.

1) The cpuset checks are no-ops unless a cpuset is active and in general
   are a lot cheaper.

2) zone_reclaim is now disabled by default and I suspect that was a large
   source of the cost that zlc wanted to avoid. When it is enabled, it's
   known to be a major source of stalling when nodes fill up and it's
   unwise to hit every other user with the overhead.

3) Watermark checks are expensive to calculate for high-order
   allocation requests. Later patches in this series will reduce the cost
   of the watermark checking.

4) The most important issue is that in the current implementation it
   is possible for a failed THP allocation to mark a zone full for order-0
   allocations and cause a fallback to remote nodes.

The last issue could be addressed with additional complexity but as the
benefit of zlc is questionable, it is better to remove it.  If stalls
due to zone_reclaim are ever reported then an alternative would be to
introduce deferring logic based on a timeout inside zone_reclaim itself
and leave the page allocator fast paths alone.

The impact on page-allocator microbenchmarks is negligible as they don't
hit the paths where the zlc comes into play. Most page-reclaim related
workloads showed no noticeable difference as a result of the removal.

The impact was noticeable in a workload called "stutter". One part uses a
lot of anonymous memory, a second measures mmap latency and a third copies
a large file. In an ideal world the latency application would not notice
the mmap latency.  On a 2-node machine the results of this patch are

stutter
                             4.3.0-rc1             4.3.0-rc1
                              baseline              nozlc-v4
Min         mmap     20.9243 (  0.00%)     20.7716 (  0.73%)
1st-qrtle   mmap     22.0612 (  0.00%)     22.0680 ( -0.03%)
2nd-qrtle   mmap     22.3291 (  0.00%)     22.3809 ( -0.23%)
3rd-qrtle   mmap     25.2244 (  0.00%)     25.2396 ( -0.06%)
Max-90%     mmap     48.0995 (  0.00%)     28.3713 ( 41.02%)
Max-93%     mmap     52.5557 (  0.00%)     36.0170 ( 31.47%)
Max-95%     mmap     55.8173 (  0.00%)     47.3163 ( 15.23%)
Max-99%     mmap     67.3781 (  0.00%)     70.1140 ( -4.06%)
Max         mmap  24447.6375 (  0.00%)  12915.1356 ( 47.17%)
Mean        mmap     33.7883 (  0.00%)     27.7944 ( 17.74%)
Best99%Mean mmap     27.7825 (  0.00%)     25.2767 (  9.02%)
Best95%Mean mmap     26.3912 (  0.00%)     23.7994 (  9.82%)
Best90%Mean mmap     24.9886 (  0.00%)     23.2251 (  7.06%)
Best50%Mean mmap     22.0157 (  0.00%)     22.0261 ( -0.05%)
Best10%Mean mmap     21.6705 (  0.00%)     21.6083 (  0.29%)
Best5%Mean  mmap     21.5581 (  0.00%)     21.4611 (  0.45%)
Best1%Mean  mmap     21.3079 (  0.00%)     21.1631 (  0.68%)

Note that the maximum stall latency went from 24 seconds to 12 which is still
bad but an improvement.  The milage varies considerably 2-node machine on an
earlier test went from 494 seconds to 47 seconds and  a 4-node machine that
tested an earlier version of this patch went from a worst case stall time of
6 seconds to 67ms. The nature of the benchmark is inherently unpredictable
as it is hammering the system and the milage will vary between machines.

There is a secondary impact with potentially more direct reclaim because
zones are now being considered instead of being skipped by zlc. In this
particular test run it did not occur so will not be described. However,
in at least one test the following was observed

1. Direct reclaim rates were higher. This was likely due to direct reclaim
  being entered instead of the zlc disabling a zone and busy looping.
  Busy looping may have the effect of allowing kswapd to make more
  progress and in some cases may be better overall. If this is found then
  the correct action is to put direct reclaimers to sleep on a waitqueue
  and allow kswapd make forward progress. Busy looping on the zlc is even
  worse than when the allocator used to blindly call congestion_wait().

2. There was higher swap activity as direct reclaim was active.

3. Direct reclaim efficiency was lower. This is related to 1 as more
  scanning activity also encountered more pages that could not be
  immediately reclaimed

In that case, the direct page scan and reclaim rates are noticeable but
it is not considered a problem for a few reasons

1. The test is primarily concerned with latency. The mmap attempts are also
   faulted which means there are THP allocation requests. The ZLC could
   cause zones to be disabled causing the process to busy loop instead
   of reclaiming.  This looks like elevated direct reclaim activity but
   it's the correct action to take based on what processes requested.

2. The test hammers reclaim and compaction heavily. The number of successful
   THP faults is highly variable but affects the reclaim stats. It's not a
   realistic or reasonable measure of page reclaim activity.

3. No other page-reclaim intensive workload that was tested showed a problem.

4. If a workload is identified that benefitted from the busy looping then it
   should be fixed by having direct reclaimers sleep on a wait queue until
   woken by kswapd instead of busy looping. We had this class of problem before
   when congestion_waits() with a fixed timeout was a brain damaged decision
   but happened to benefit some workloads.

If a workload is identified that relied on the zlc to busy loop then it
should be fixed correctly and have a direct reclaimer sleep on a waitqueue
until woken by kswapd.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/mmzone.h |  74 -----------------
 mm/page_alloc.c        | 212 -------------------------------------------------
 2 files changed, 286 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b489e0b5ab48..f42a8340327f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -589,75 +589,8 @@ static inline bool zone_is_empty(struct zone *zone)
  * [1]	: No fallback (__GFP_THISNODE)
  */
 #define MAX_ZONELISTS 2
-
-
-/*
- * We cache key information from each zonelist for smaller cache
- * footprint when scanning for free pages in get_page_from_freelist().
- *
- * 1) The BITMAP fullzones tracks which zones in a zonelist have come
- *    up short of free memory since the last time (last_fullzone_zap)
- *    we zero'd fullzones.
- * 2) The array z_to_n[] maps each zone in the zonelist to its node
- *    id, so that we can efficiently evaluate whether that node is
- *    set in the current tasks mems_allowed.
- *
- * Both fullzones and z_to_n[] are one-to-one with the zonelist,
- * indexed by a zones offset in the zonelist zones[] array.
- *
- * The get_page_from_freelist() routine does two scans.  During the
- * first scan, we skip zones whose corresponding bit in 'fullzones'
- * is set or whose corresponding node in current->mems_allowed (which
- * comes from cpusets) is not set.  During the second scan, we bypass
- * this zonelist_cache, to ensure we look methodically at each zone.
- *
- * Once per second, we zero out (zap) fullzones, forcing us to
- * reconsider nodes that might have regained more free memory.
- * The field last_full_zap is the time we last zapped fullzones.
- *
- * This mechanism reduces the amount of time we waste repeatedly
- * reexaming zones for free memory when they just came up low on
- * memory momentarilly ago.
- *
- * The zonelist_cache struct members logically belong in struct
- * zonelist.  However, the mempolicy zonelists constructed for
- * MPOL_BIND are intentionally variable length (and usually much
- * shorter).  A general purpose mechanism for handling structs with
- * multiple variable length members is more mechanism than we want
- * here.  We resort to some special case hackery instead.
- *
- * The MPOL_BIND zonelists don't need this zonelist_cache (in good
- * part because they are shorter), so we put the fixed length stuff
- * at the front of the zonelist struct, ending in a variable length
- * zones[], as is needed by MPOL_BIND.
- *
- * Then we put the optional zonelist cache on the end of the zonelist
- * struct.  This optional stuff is found by a 'zlcache_ptr' pointer in
- * the fixed length portion at the front of the struct.  This pointer
- * both enables us to find the zonelist cache, and in the case of
- * MPOL_BIND zonelists, (which will just set the zlcache_ptr to NULL)
- * to know that the zonelist cache is not there.
- *
- * The end result is that struct zonelists come in two flavors:
- *  1) The full, fixed length version, shown below, and
- *  2) The custom zonelists for MPOL_BIND.
- * The custom MPOL_BIND zonelists have a NULL zlcache_ptr and no zlcache.
- *
- * Even though there may be multiple CPU cores on a node modifying
- * fullzones or last_full_zap in the same zonelist_cache at the same
- * time, we don't lock it.  This is just hint data - if it is wrong now
- * and then, the allocator will still function, perhaps a bit slower.
- */
-
-
-struct zonelist_cache {
-	unsigned short z_to_n[MAX_ZONES_PER_ZONELIST];		/* zone->nid */
-	DECLARE_BITMAP(fullzones, MAX_ZONES_PER_ZONELIST);	/* zone full? */
-	unsigned long last_full_zap;		/* when last zap'd (jiffies) */
-};
 #else
 #define MAX_ZONELISTS 1
-struct zonelist_cache;
 #endif
 
 /*
@@ -675,9 +608,6 @@ struct zoneref {
  * allocation, the other zones are fallback zones, in decreasing
  * priority.
  *
- * If zlcache_ptr is not NULL, then it is just the address of zlcache,
- * as explained above.  If zlcache_ptr is NULL, there is no zlcache.
- * *
  * To speed the reading of the zonelist, the zonerefs contain the zone index
  * of the entry being read. Helper functions to access information given
  * a struct zoneref are
@@ -687,11 +617,7 @@ struct zoneref {
  * zonelist_node_idx()	- Return the index of the node for an entry
  */
 struct zonelist {
-	struct zonelist_cache *zlcache_ptr;		     // NULL or &zlcache
 	struct zoneref _zonerefs[MAX_ZONES_PER_ZONELIST + 1];
-#ifdef CONFIG_NUMA
-	struct zonelist_cache zlcache;			     // optional ...
-#endif
 };
 
 #ifndef CONFIG_DISCONTIGMEM
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4418741c78ad..1a9a20362251 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2291,122 +2291,6 @@ bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
 }
 
 #ifdef CONFIG_NUMA
-/*
- * zlc_setup - Setup for "zonelist cache".  Uses cached zone data to
- * skip over zones that are not allowed by the cpuset, or that have
- * been recently (in last second) found to be nearly full.  See further
- * comments in mmzone.h.  Reduces cache footprint of zonelist scans
- * that have to skip over a lot of full or unallowed zones.
- *
- * If the zonelist cache is present in the passed zonelist, then
- * returns a pointer to the allowed node mask (either the current
- * tasks mems_allowed, or node_states[N_MEMORY].)
- *
- * If the zonelist cache is not available for this zonelist, does
- * nothing and returns NULL.
- *
- * If the fullzones BITMAP in the zonelist cache is stale (more than
- * a second since last zap'd) then we zap it out (clear its bits.)
- *
- * We hold off even calling zlc_setup, until after we've checked the
- * first zone in the zonelist, on the theory that most allocations will
- * be satisfied from that first zone, so best to examine that zone as
- * quickly as we can.
- */
-static nodemask_t *zlc_setup(struct zonelist *zonelist, int alloc_flags)
-{
-	struct zonelist_cache *zlc;	/* cached zonelist speedup info */
-	nodemask_t *allowednodes;	/* zonelist_cache approximation */
-
-	zlc = zonelist->zlcache_ptr;
-	if (!zlc)
-		return NULL;
-
-	if (time_after(jiffies, zlc->last_full_zap + HZ)) {
-		bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST);
-		zlc->last_full_zap = jiffies;
-	}
-
-	allowednodes = !in_interrupt() && (alloc_flags & ALLOC_CPUSET) ?
-					&cpuset_current_mems_allowed :
-					&node_states[N_MEMORY];
-	return allowednodes;
-}
-
-/*
- * Given 'z' scanning a zonelist, run a couple of quick checks to see
- * if it is worth looking at further for free memory:
- *  1) Check that the zone isn't thought to be full (doesn't have its
- *     bit set in the zonelist_cache fullzones BITMAP).
- *  2) Check that the zones node (obtained from the zonelist_cache
- *     z_to_n[] mapping) is allowed in the passed in allowednodes mask.
- * Return true (non-zero) if zone is worth looking at further, or
- * else return false (zero) if it is not.
- *
- * This check -ignores- the distinction between various watermarks,
- * such as GFP_HIGH, GFP_ATOMIC, PF_MEMALLOC, ...  If a zone is
- * found to be full for any variation of these watermarks, it will
- * be considered full for up to one second by all requests, unless
- * we are so low on memory on all allowed nodes that we are forced
- * into the second scan of the zonelist.
- *
- * In the second scan we ignore this zonelist cache and exactly
- * apply the watermarks to all zones, even it is slower to do so.
- * We are low on memory in the second scan, and should leave no stone
- * unturned looking for a free page.
- */
-static int zlc_zone_worth_trying(struct zonelist *zonelist, struct zoneref *z,
-						nodemask_t *allowednodes)
-{
-	struct zonelist_cache *zlc;	/* cached zonelist speedup info */
-	int i;				/* index of *z in zonelist zones */
-	int n;				/* node that zone *z is on */
-
-	zlc = zonelist->zlcache_ptr;
-	if (!zlc)
-		return 1;
-
-	i = z - zonelist->_zonerefs;
-	n = zlc->z_to_n[i];
-
-	/* This zone is worth trying if it is allowed but not full */
-	return node_isset(n, *allowednodes) && !test_bit(i, zlc->fullzones);
-}
-
-/*
- * Given 'z' scanning a zonelist, set the corresponding bit in
- * zlc->fullzones, so that subsequent attempts to allocate a page
- * from that zone don't waste time re-examining it.
- */
-static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z)
-{
-	struct zonelist_cache *zlc;	/* cached zonelist speedup info */
-	int i;				/* index of *z in zonelist zones */
-
-	zlc = zonelist->zlcache_ptr;
-	if (!zlc)
-		return;
-
-	i = z - zonelist->_zonerefs;
-
-	set_bit(i, zlc->fullzones);
-}
-
-/*
- * clear all zones full, called after direct reclaim makes progress so that
- * a zone that was recently full is not skipped over for up to a second
- */
-static void zlc_clear_zones_full(struct zonelist *zonelist)
-{
-	struct zonelist_cache *zlc;	/* cached zonelist speedup info */
-
-	zlc = zonelist->zlcache_ptr;
-	if (!zlc)
-		return;
-
-	bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST);
-}
-
 static bool zone_local(struct zone *local_zone, struct zone *zone)
 {
 	return local_zone->node == zone->node;
@@ -2417,28 +2301,7 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
 	return node_distance(zone_to_nid(local_zone), zone_to_nid(zone)) <
 				RECLAIM_DISTANCE;
 }
-
 #else	/* CONFIG_NUMA */
-
-static nodemask_t *zlc_setup(struct zonelist *zonelist, int alloc_flags)
-{
-	return NULL;
-}
-
-static int zlc_zone_worth_trying(struct zonelist *zonelist, struct zoneref *z,
-				nodemask_t *allowednodes)
-{
-	return 1;
-}
-
-static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z)
-{
-}
-
-static void zlc_clear_zones_full(struct zonelist *zonelist)
-{
-}
-
 static bool zone_local(struct zone *local_zone, struct zone *zone)
 {
 	return true;
@@ -2448,7 +2311,6 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
 {
 	return true;
 }
-
 #endif	/* CONFIG_NUMA */
 
 static void reset_alloc_batches(struct zone *preferred_zone)
@@ -2475,9 +2337,6 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 	struct zoneref *z;
 	struct page *page = NULL;
 	struct zone *zone;
-	nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
-	int zlc_active = 0;		/* set if using zonelist_cache */
-	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
 	int nr_fair_skipped = 0;
 	bool zonelist_rescan;
 
@@ -2492,9 +2351,6 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 								ac->nodemask) {
 		unsigned long mark;
 
-		if (IS_ENABLED(CONFIG_NUMA) && zlc_active &&
-			!zlc_zone_worth_trying(zonelist, z, allowednodes))
-				continue;
 		if (cpusets_enabled() &&
 			(alloc_flags & ALLOC_CPUSET) &&
 			!cpuset_zone_allowed(zone, gfp_mask))
@@ -2552,28 +2408,8 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 			if (alloc_flags & ALLOC_NO_WATERMARKS)
 				goto try_this_zone;
 
-			if (IS_ENABLED(CONFIG_NUMA) &&
-					!did_zlc_setup && nr_online_nodes > 1) {
-				/*
-				 * we do zlc_setup if there are multiple nodes
-				 * and before considering the first zone allowed
-				 * by the cpuset.
-				 */
-				allowednodes = zlc_setup(zonelist, alloc_flags);
-				zlc_active = 1;
-				did_zlc_setup = 1;
-			}
-
 			if (zone_reclaim_mode == 0 ||
 			    !zone_allows_reclaim(ac->preferred_zone, zone))
-				goto this_zone_full;
-
-			/*
-			 * As we may have just activated ZLC, check if the first
-			 * eligible zone has failed zone_reclaim recently.
-			 */
-			if (IS_ENABLED(CONFIG_NUMA) && zlc_active &&
-				!zlc_zone_worth_trying(zonelist, z, allowednodes))
 				continue;
 
 			ret = zone_reclaim(zone, gfp_mask, order);
@@ -2590,19 +2426,6 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 						ac->classzone_idx, alloc_flags))
 					goto try_this_zone;
 
-				/*
-				 * Failed to reclaim enough to meet watermark.
-				 * Only mark the zone full if checking the min
-				 * watermark or if we failed to reclaim just
-				 * 1<<order pages or else the page allocator
-				 * fastpath will prematurely mark zones full
-				 * when the watermark is between the low and
-				 * min watermarks.
-				 */
-				if (((alloc_flags & ALLOC_WMARK_MASK) == ALLOC_WMARK_MIN) ||
-				    ret == ZONE_RECLAIM_SOME)
-					goto this_zone_full;
-
 				continue;
 			}
 		}
@@ -2615,9 +2438,6 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 				goto try_this_zone;
 			return page;
 		}
-this_zone_full:
-		if (IS_ENABLED(CONFIG_NUMA) && zlc_active)
-			zlc_mark_zone_full(zonelist, z);
 	}
 
 	/*
@@ -2638,12 +2458,6 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 			zonelist_rescan = true;
 	}
 
-	if (unlikely(IS_ENABLED(CONFIG_NUMA) && zlc_active)) {
-		/* Disable zlc cache for second zonelist scan */
-		zlc_active = 0;
-		zonelist_rescan = true;
-	}
-
 	if (zonelist_rescan)
 		goto zonelist_scan;
 
@@ -2888,10 +2702,6 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	if (unlikely(!(*did_some_progress)))
 		return NULL;
 
-	/* After successful reclaim, reconsider all zones for allocation */
-	if (IS_ENABLED(CONFIG_NUMA))
-		zlc_clear_zones_full(ac->zonelist);
-
 retry:
 	page = get_page_from_freelist(gfp_mask, order,
 					alloc_flags & ~ALLOC_NO_WATERMARKS, ac);
@@ -4227,20 +4037,6 @@ static void build_zonelists(pg_data_t *pgdat)
 	build_thisnode_zonelists(pgdat);
 }
 
-/* Construct the zonelist performance cache - see further mmzone.h */
-static void build_zonelist_cache(pg_data_t *pgdat)
-{
-	struct zonelist *zonelist;
-	struct zonelist_cache *zlc;
-	struct zoneref *z;
-
-	zonelist = &pgdat->node_zonelists[0];
-	zonelist->zlcache_ptr = zlc = &zonelist->zlcache;
-	bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST);
-	for (z = zonelist->_zonerefs; z->zone; z++)
-		zlc->z_to_n[z - zonelist->_zonerefs] = zonelist_node_idx(z);
-}
-
 #ifdef CONFIG_HAVE_MEMORYLESS_NODES
 /*
  * Return node id of node used for "local" allocations.
@@ -4301,12 +4097,6 @@ static void build_zonelists(pg_data_t *pgdat)
 	zonelist->_zonerefs[j].zone_idx = 0;
 }
 
-/* non-NUMA variant of zonelist performance cache - just NULL zlcache_ptr */
-static void build_zonelist_cache(pg_data_t *pgdat)
-{
-	pgdat->node_zonelists[0].zlcache_ptr = NULL;
-}
-
 #endif	/* CONFIG_NUMA */
 
 /*
@@ -4347,14 +4137,12 @@ static int __build_all_zonelists(void *data)
 
 	if (self && !node_online(self->node_id)) {
 		build_zonelists(self);
-		build_zonelist_cache(self);
 	}
 
 	for_each_online_node(nid) {
 		pg_data_t *pgdat = NODE_DATA(nid);
 
 		build_zonelists(pgdat);
-		build_zonelist_cache(pgdat);
 	}
 
 	/*
-- 
2.4.6


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 08/10] mm, page_alloc: Remove MIGRATE_RESERVE
  2015-09-21 10:52 [PATCH 00/10] Remove zonelist cache and high-order watermark checking v4 Mel Gorman
                   ` (6 preceding siblings ...)
  2015-09-21 10:52 ` [PATCH 07/10] mm, page_alloc: Delete the zonelist_cache Mel Gorman
@ 2015-09-21 10:52 ` Mel Gorman
  2015-09-21 10:52 ` [PATCH 09/10] mm, page_alloc: Reserve pageblocks for high-order atomic allocations on demand Mel Gorman
  2015-09-21 12:03 ` [PATCH 10/10] mm, page_alloc: Only enforce watermarks for order-0 allocations Mel Gorman
  9 siblings, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2015-09-21 10:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Rik van Riel, Vlastimil Babka, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML, Mel Gorman

MIGRATE_RESERVE preserves an old property of the buddy allocator that existed
prior to fragmentation avoidance -- min_free_kbytes worth of pages tended to
remain contiguous until the only alternative was to fail the allocation. At the
time it was discovered that high-order atomic allocations relied on this
property so MIGRATE_RESERVE was introduced. A later patch will introduce
an alternative MIGRATE_HIGHATOMIC so this patch deletes MIGRATE_RESERVE
and supporting code so it'll be easier to review. Note that this patch
in isolation may look like a false regression if someone was bisecting
high-order atomic allocation failures.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/mmzone.h |  10 +---
 mm/huge_memory.c       |   2 +-
 mm/page_alloc.c        | 148 +++----------------------------------------------
 mm/vmstat.c            |   1 -
 4 files changed, 11 insertions(+), 150 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index f42a8340327f..40a856d28764 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -39,8 +39,6 @@ enum {
 	MIGRATE_UNMOVABLE,
 	MIGRATE_MOVABLE,
 	MIGRATE_RECLAIMABLE,
-	MIGRATE_PCPTYPES,	/* the number of types on the pcp lists */
-	MIGRATE_RESERVE = MIGRATE_PCPTYPES,
 #ifdef CONFIG_CMA
 	/*
 	 * MIGRATE_CMA migration type is designed to mimic the way
@@ -63,6 +61,8 @@ enum {
 	MIGRATE_TYPES
 };
 
+#define MIGRATE_PCPTYPES (MIGRATE_RECLAIMABLE+1)
+
 #ifdef CONFIG_CMA
 #  define is_migrate_cma(migratetype) unlikely((migratetype) == MIGRATE_CMA)
 #else
@@ -429,12 +429,6 @@ struct zone {
 
 	const char		*name;
 
-	/*
-	 * Number of MIGRATE_RESERVE page block. To maintain for just
-	 * optimization. Protected by zone->lock.
-	 */
-	int			nr_migrate_reserve_block;
-
 #ifdef CONFIG_MEMORY_ISOLATION
 	/*
 	 * Number of isolated pageblock. It is used to solve incorrect
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 25c74e2dbc8b..63d0afc37aad 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -116,7 +116,7 @@ static void set_recommended_min_free_kbytes(void)
 	for_each_populated_zone(zone)
 		nr_zones++;
 
-	/* Make sure at least 2 hugepages are free for MIGRATE_RESERVE */
+	/* Ensure 2 pageblocks are free to assist fragmentation avoidance */
 	recommended_min = pageblock_nr_pages * nr_zones * 2;
 
 	/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1a9a20362251..ae01a2c1e863 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -817,7 +817,6 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 			if (unlikely(has_isolate_pageblock(zone)))
 				mt = get_pageblock_migratetype(page);
 
-			/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
 			__free_one_page(page, page_to_pfn(page), zone, 0, mt);
 			trace_mm_page_pcpu_drain(page, 0, mt);
 		} while (--to_free && --batch_free && !list_empty(list));
@@ -1417,15 +1416,14 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
  * the free lists for the desirable migrate type are depleted
  */
 static int fallbacks[MIGRATE_TYPES][4] = {
-	[MIGRATE_UNMOVABLE]   = { MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE,     MIGRATE_RESERVE },
-	[MIGRATE_RECLAIMABLE] = { MIGRATE_UNMOVABLE,   MIGRATE_MOVABLE,     MIGRATE_RESERVE },
-	[MIGRATE_MOVABLE]     = { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE,   MIGRATE_RESERVE },
+	[MIGRATE_UNMOVABLE]   = { MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE,   MIGRATE_TYPES },
+	[MIGRATE_RECLAIMABLE] = { MIGRATE_UNMOVABLE,   MIGRATE_MOVABLE,   MIGRATE_TYPES },
+	[MIGRATE_MOVABLE]     = { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE, MIGRATE_TYPES },
 #ifdef CONFIG_CMA
-	[MIGRATE_CMA]         = { MIGRATE_RESERVE }, /* Never used */
+	[MIGRATE_CMA]         = { MIGRATE_TYPES }, /* Never used */
 #endif
-	[MIGRATE_RESERVE]     = { MIGRATE_RESERVE }, /* Never used */
 #ifdef CONFIG_MEMORY_ISOLATION
-	[MIGRATE_ISOLATE]     = { MIGRATE_RESERVE }, /* Never used */
+	[MIGRATE_ISOLATE]     = { MIGRATE_TYPES }, /* Never used */
 #endif
 };
 
@@ -1598,7 +1596,7 @@ int find_suitable_fallback(struct free_area *area, unsigned int order,
 	*can_steal = false;
 	for (i = 0;; i++) {
 		fallback_mt = fallbacks[migratetype][i];
-		if (fallback_mt == MIGRATE_RESERVE)
+		if (fallback_mt == MIGRATE_TYPES)
 			break;
 
 		if (list_empty(&area->free_list[fallback_mt]))
@@ -1676,25 +1674,13 @@ static struct page *__rmqueue(struct zone *zone, unsigned int order,
 {
 	struct page *page;
 
-retry_reserve:
 	page = __rmqueue_smallest(zone, order, migratetype);
-
-	if (unlikely(!page) && migratetype != MIGRATE_RESERVE) {
+	if (unlikely(!page)) {
 		if (migratetype == MIGRATE_MOVABLE)
 			page = __rmqueue_cma_fallback(zone, order);
 
 		if (!page)
 			page = __rmqueue_fallback(zone, order, migratetype);
-
-		/*
-		 * Use MIGRATE_RESERVE rather than fail an allocation. goto
-		 * is used because __rmqueue_smallest is an inline function
-		 * and we want just one call site
-		 */
-		if (!page) {
-			migratetype = MIGRATE_RESERVE;
-			goto retry_reserve;
-		}
 	}
 
 	trace_mm_page_alloc_zone_locked(page, order, migratetype);
@@ -3491,7 +3477,6 @@ static void show_migration_types(unsigned char type)
 		[MIGRATE_UNMOVABLE]	= 'U',
 		[MIGRATE_RECLAIMABLE]	= 'E',
 		[MIGRATE_MOVABLE]	= 'M',
-		[MIGRATE_RESERVE]	= 'R',
 #ifdef CONFIG_CMA
 		[MIGRATE_CMA]		= 'C',
 #endif
@@ -4302,120 +4287,6 @@ static inline unsigned long wait_table_bits(unsigned long size)
 }
 
 /*
- * Check if a pageblock contains reserved pages
- */
-static int pageblock_is_reserved(unsigned long start_pfn, unsigned long end_pfn)
-{
-	unsigned long pfn;
-
-	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
-		if (!pfn_valid_within(pfn) || PageReserved(pfn_to_page(pfn)))
-			return 1;
-	}
-	return 0;
-}
-
-/*
- * Mark a number of pageblocks as MIGRATE_RESERVE. The number
- * of blocks reserved is based on min_wmark_pages(zone). The memory within
- * the reserve will tend to store contiguous free pages. Setting min_free_kbytes
- * higher will lead to a bigger reserve which will get freed as contiguous
- * blocks as reclaim kicks in
- */
-static void setup_zone_migrate_reserve(struct zone *zone)
-{
-	unsigned long start_pfn, pfn, end_pfn, block_end_pfn;
-	struct page *page;
-	unsigned long block_migratetype;
-	int reserve;
-	int old_reserve;
-
-	/*
-	 * Get the start pfn, end pfn and the number of blocks to reserve
-	 * We have to be careful to be aligned to pageblock_nr_pages to
-	 * make sure that we always check pfn_valid for the first page in
-	 * the block.
-	 */
-	start_pfn = zone->zone_start_pfn;
-	end_pfn = zone_end_pfn(zone);
-	start_pfn = roundup(start_pfn, pageblock_nr_pages);
-	reserve = roundup(min_wmark_pages(zone), pageblock_nr_pages) >>
-							pageblock_order;
-
-	/*
-	 * Reserve blocks are generally in place to help high-order atomic
-	 * allocations that are short-lived. A min_free_kbytes value that
-	 * would result in more than 2 reserve blocks for atomic allocations
-	 * is assumed to be in place to help anti-fragmentation for the
-	 * future allocation of hugepages at runtime.
-	 */
-	reserve = min(2, reserve);
-	old_reserve = zone->nr_migrate_reserve_block;
-
-	/* When memory hot-add, we almost always need to do nothing */
-	if (reserve == old_reserve)
-		return;
-	zone->nr_migrate_reserve_block = reserve;
-
-	for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
-		if (!early_page_nid_uninitialised(pfn, zone_to_nid(zone)))
-			return;
-
-		if (!pfn_valid(pfn))
-			continue;
-		page = pfn_to_page(pfn);
-
-		/* Watch out for overlapping nodes */
-		if (page_to_nid(page) != zone_to_nid(zone))
-			continue;
-
-		block_migratetype = get_pageblock_migratetype(page);
-
-		/* Only test what is necessary when the reserves are not met */
-		if (reserve > 0) {
-			/*
-			 * Blocks with reserved pages will never free, skip
-			 * them.
-			 */
-			block_end_pfn = min(pfn + pageblock_nr_pages, end_pfn);
-			if (pageblock_is_reserved(pfn, block_end_pfn))
-				continue;
-
-			/* If this block is reserved, account for it */
-			if (block_migratetype == MIGRATE_RESERVE) {
-				reserve--;
-				continue;
-			}
-
-			/* Suitable for reserving if this block is movable */
-			if (block_migratetype == MIGRATE_MOVABLE) {
-				set_pageblock_migratetype(page,
-							MIGRATE_RESERVE);
-				move_freepages_block(zone, page,
-							MIGRATE_RESERVE);
-				reserve--;
-				continue;
-			}
-		} else if (!old_reserve) {
-			/*
-			 * At boot time we don't need to scan the whole zone
-			 * for turning off MIGRATE_RESERVE.
-			 */
-			break;
-		}
-
-		/*
-		 * If the reserve is met and this is a previous reserved block,
-		 * take it back
-		 */
-		if (block_migratetype == MIGRATE_RESERVE) {
-			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
-			move_freepages_block(zone, page, MIGRATE_MOVABLE);
-		}
-	}
-}
-
-/*
  * Initially all pages are reserved - free ones are freed
  * up by free_all_bootmem() once the early boot process is
  * done. Non-atomic initialization, single-pass.
@@ -4454,9 +4325,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 		 * movable at startup. This will force kernel allocations
 		 * to reserve their blocks rather than leaking throughout
 		 * the address space during boot when many long-lived
-		 * kernel allocations are made. Later some blocks near
-		 * the start are marked MIGRATE_RESERVE by
-		 * setup_zone_migrate_reserve()
+		 * kernel allocations are made.
 		 *
 		 * bitmap is created for zone's valid pfn range. but memmap
 		 * can be created for invalid pages (for alignment)
@@ -6012,7 +5881,6 @@ static void __setup_per_zone_wmarks(void)
 			high_wmark_pages(zone) - low_wmark_pages(zone) -
 			atomic_long_read(&zone->vm_stat[NR_ALLOC_BATCH]));
 
-		setup_zone_migrate_reserve(zone);
 		spin_unlock_irqrestore(&zone->lock, flags);
 	}
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4f5cd974e11a..49963aa2dff3 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -901,7 +901,6 @@ static char * const migratetype_names[MIGRATE_TYPES] = {
 	"Unmovable",
 	"Reclaimable",
 	"Movable",
-	"Reserve",
 #ifdef CONFIG_CMA
 	"CMA",
 #endif
-- 
2.4.6


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 09/10] mm, page_alloc: Reserve pageblocks for high-order atomic allocations on demand
  2015-09-21 10:52 [PATCH 00/10] Remove zonelist cache and high-order watermark checking v4 Mel Gorman
                   ` (7 preceding siblings ...)
  2015-09-21 10:52 ` [PATCH 08/10] mm, page_alloc: Remove MIGRATE_RESERVE Mel Gorman
@ 2015-09-21 10:52 ` Mel Gorman
  2015-09-24 13:50   ` Michal Hocko
                     ` (2 more replies)
  2015-09-21 12:03 ` [PATCH 10/10] mm, page_alloc: Only enforce watermarks for order-0 allocations Mel Gorman
  9 siblings, 3 replies; 48+ messages in thread
From: Mel Gorman @ 2015-09-21 10:52 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Rik van Riel, Vlastimil Babka, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML, Mel Gorman

High-order watermark checking exists for two reasons --  kswapd high-order
awareness and protection for high-order atomic requests. Historically the
kernel depended on MIGRATE_RESERVE to preserve min_free_kbytes as high-order
free pages for as long as possible. This patch introduces MIGRATE_HIGHATOMIC
that reserves pageblocks for high-order atomic allocations on demand and
avoids using those blocks for order-0 allocations. This is more flexible
and reliable than MIGRATE_RESERVE was.

A MIGRATE_HIGHORDER pageblock is created when an atomic high-order allocation
request steals a pageblock but limits the total number to 1% of the zone.
Callers that speculatively abuse atomic allocations for long-lived
high-order allocations to access the reserve will quickly fail. Note that
SLUB is currently not such an abuser as it reclaims at least once.  It is
possible that the pageblock stolen has few suitable high-order pages and
will need to steal again in the near future but there would need to be
strong justification to search all pageblocks for an ideal candidate.

The pageblocks are unreserved if an allocation fails after a direct
reclaim attempt.

The watermark checks account for the reserved pageblocks when the allocation
request is not a high-order atomic allocation.

The reserved pageblocks can not be used for order-0 allocations. This may
allow temporary wastage until a failed reclaim reassigns the pageblock. This
is deliberate as the intent of the reservation is to satisfy a limited
number of atomic high-order short-lived requests if the system requires them.

The stutter benchmark was used to evaluate this but while it was running
there was a systemtap script that randomly allocated between 1 high-order
page and 12.5% of memory's worth of order-3 pages using GFP_ATOMIC. This
is much larger than the potential reserve and it does not attempt to be
realistic.  It is intended to stress random high-order allocations from
an unknown source, show that there is a reduction in failures without
introducing an anomaly where atomic allocations are more reliable than
regular allocations.  The amount of memory reserved varied throughout the
workload as reserves were created and reclaimed under memory pressure. The
allocation failures once the workload warmed up were as follows;

4.2-rc5-vanilla		70%
4.2-rc5-atomic-reserve	56%

The failure rate was also measured while building multiple kernels. The
failure rate was 14% but is 6% with this patch applied.

Overall, this is a small reduction but the reserves are small relative
to the number of allocation requests. In early versions of the patch,
the failure rate reduced by a much larger amount but that required much
larger reserves and perversely made atomic allocations seem more reliable
than regular allocations.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/mmzone.h |   6 ++-
 mm/page_alloc.c        | 125 +++++++++++++++++++++++++++++++++++++++++++++----
 mm/vmstat.c            |   1 +
 3 files changed, 122 insertions(+), 10 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 40a856d28764..ad8cf52de55b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -39,6 +39,8 @@ enum {
 	MIGRATE_UNMOVABLE,
 	MIGRATE_MOVABLE,
 	MIGRATE_RECLAIMABLE,
+	MIGRATE_PCPTYPES,	/* the number of types on the pcp lists */
+	MIGRATE_HIGHATOMIC = MIGRATE_PCPTYPES,
 #ifdef CONFIG_CMA
 	/*
 	 * MIGRATE_CMA migration type is designed to mimic the way
@@ -61,8 +63,6 @@ enum {
 	MIGRATE_TYPES
 };
 
-#define MIGRATE_PCPTYPES (MIGRATE_RECLAIMABLE+1)
-
 #ifdef CONFIG_CMA
 #  define is_migrate_cma(migratetype) unlikely((migratetype) == MIGRATE_CMA)
 #else
@@ -334,6 +334,8 @@ struct zone {
 	/* zone watermarks, access with *_wmark_pages(zone) macros */
 	unsigned long watermark[NR_WMARK];
 
+	unsigned long nr_reserved_highatomic;
+
 	/*
 	 * We don't know if the memory that we're going to allocate will be freeable
 	 * or/and it will be released eventually, so to avoid totally wasting several
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ae01a2c1e863..811d6fc4ad5d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1615,6 +1615,88 @@ int find_suitable_fallback(struct free_area *area, unsigned int order,
 	return -1;
 }
 
+/*
+ * Reserve a pageblock for exclusive use of high-order atomic allocations if
+ * there are no empty page blocks that contain a page with a suitable order
+ */
+static void reserve_highatomic_pageblock(struct page *page, struct zone *zone,
+				unsigned int alloc_order)
+{
+	int mt;
+	unsigned long max_managed, flags;
+
+	/*
+	 * Limit the number reserved to 1 pageblock or roughly 1% of a zone.
+	 * Check is race-prone but harmless.
+	 */
+	max_managed = (zone->managed_pages / 100) + pageblock_nr_pages;
+	if (zone->nr_reserved_highatomic >= max_managed)
+		return;
+
+	/* Yoink! */
+	spin_lock_irqsave(&zone->lock, flags);
+
+	mt = get_pageblock_migratetype(page);
+	if (mt != MIGRATE_HIGHATOMIC &&
+			!is_migrate_isolate(mt) && !is_migrate_cma(mt)) {
+		zone->nr_reserved_highatomic += pageblock_nr_pages;
+		set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC);
+		move_freepages_block(zone, page, MIGRATE_HIGHATOMIC);
+	}
+	spin_unlock_irqrestore(&zone->lock, flags);
+}
+
+/*
+ * Used when an allocation is about to fail under memory pressure. This
+ * potentially hurts the reliability of high-order allocations when under
+ * intense memory pressure but failed atomic allocations should be easier
+ * to recover from than an OOM.
+ */
+static void unreserve_highatomic_pageblock(const struct alloc_context *ac)
+{
+	struct zonelist *zonelist = ac->zonelist;
+	unsigned long flags;
+	struct zoneref *z;
+	struct zone *zone;
+	struct page *page;
+	int order;
+
+	for_each_zone_zonelist_nodemask(zone, z, zonelist, ac->high_zoneidx,
+								ac->nodemask) {
+		/* Preserve at least one pageblock */
+		if (zone->nr_reserved_highatomic <= pageblock_nr_pages)
+			continue;
+
+		spin_lock_irqsave(&zone->lock, flags);
+		for (order = 0; order < MAX_ORDER; order++) {
+			struct free_area *area = &(zone->free_area[order]);
+
+			if (list_empty(&area->free_list[MIGRATE_HIGHATOMIC]))
+				continue;
+
+			page = list_entry(area->free_list[MIGRATE_HIGHATOMIC].next,
+						struct page, lru);
+
+			zone->nr_reserved_highatomic -= pageblock_nr_pages;
+
+			/*
+			 * Convert to ac->migratetype and avoid the normal
+			 * pageblock stealing heuristics. Minimally, the caller
+			 * is doing the work and needs the pages. More
+			 * importantly, if the block was always converted to
+			 * MIGRATE_UNMOVABLE or another type then the number
+			 * of pageblocks that cannot be completely freed
+			 * may increase.
+			 */
+			set_pageblock_migratetype(page, ac->migratetype);
+			move_freepages_block(zone, page, ac->migratetype);
+			spin_unlock_irqrestore(&zone->lock, flags);
+			return;
+		}
+		spin_unlock_irqrestore(&zone->lock, flags);
+	}
+}
+
 /* Remove an element from the buddy allocator from the fallback list */
 static inline struct page *
 __rmqueue_fallback(struct zone *zone, unsigned int order, int start_migratetype)
@@ -1670,7 +1752,7 @@ __rmqueue_fallback(struct zone *zone, unsigned int order, int start_migratetype)
  * Call me with the zone->lock already held.
  */
 static struct page *__rmqueue(struct zone *zone, unsigned int order,
-						int migratetype)
+				int migratetype, gfp_t gfp_flags)
 {
 	struct page *page;
 
@@ -1700,7 +1782,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 
 	spin_lock(&zone->lock);
 	for (i = 0; i < count; ++i) {
-		struct page *page = __rmqueue(zone, order, migratetype);
+		struct page *page = __rmqueue(zone, order, migratetype, 0);
 		if (unlikely(page == NULL))
 			break;
 
@@ -2072,7 +2154,7 @@ int split_free_page(struct page *page)
 static inline
 struct page *buffered_rmqueue(struct zone *preferred_zone,
 			struct zone *zone, unsigned int order,
-			gfp_t gfp_flags, int migratetype)
+			gfp_t gfp_flags, int alloc_flags, int migratetype)
 {
 	unsigned long flags;
 	struct page *page;
@@ -2115,7 +2197,15 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
 			WARN_ON_ONCE(order > 1);
 		}
 		spin_lock_irqsave(&zone->lock, flags);
-		page = __rmqueue(zone, order, migratetype);
+
+		page = NULL;
+		if (unlikely(order) && (alloc_flags & ALLOC_HARDER)) {
+			page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
+			if (page)
+				trace_mm_page_alloc_zone_locked(page, order, migratetype);
+		}
+
+		page = __rmqueue(zone, order, migratetype, gfp_flags);
 		spin_unlock(&zone->lock);
 		if (!page)
 			goto failed;
@@ -2225,15 +2315,24 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
 			unsigned long mark, int classzone_idx, int alloc_flags,
 			long free_pages)
 {
-	/* free_pages may go negative - that's OK */
 	long min = mark;
 	int o;
 	long free_cma = 0;
 
+	/* free_pages may go negative - that's OK */
 	free_pages -= (1 << order) - 1;
+
 	if (alloc_flags & ALLOC_HIGH)
 		min -= min / 2;
-	if (alloc_flags & ALLOC_HARDER)
+
+	/*
+	 * If the caller does not have rights to ALLOC_HARDER then subtract
+	 * the high-atomic reserves. This will over-estimate the size of the
+	 * atomic reserve but it avoids a search.
+	 */
+	if (likely(!(alloc_flags & ALLOC_HARDER)))
+		free_pages -= z->nr_reserved_highatomic;
+	else
 		min -= min / 4;
 
 #ifdef CONFIG_CMA
@@ -2418,10 +2517,18 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 
 try_this_zone:
 		page = buffered_rmqueue(ac->preferred_zone, zone, order,
-						gfp_mask, ac->migratetype);
+				gfp_mask, alloc_flags, ac->migratetype);
 		if (page) {
 			if (prep_new_page(page, order, gfp_mask, alloc_flags))
 				goto try_this_zone;
+
+			/*
+			 * If this is a high-order atomic allocation then check
+			 * if the pageblock should be reserved for the future
+			 */
+			if (unlikely(order && (alloc_flags & ALLOC_HARDER)))
+				reserve_highatomic_pageblock(page, zone, order);
+
 			return page;
 		}
 	}
@@ -2694,9 +2801,11 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 
 	/*
 	 * If an allocation failed after direct reclaim, it could be because
-	 * pages are pinned on the per-cpu lists. Drain them and try again
+	 * pages are pinned on the per-cpu lists or in high alloc reserves.
+	 * Shrink them them and try again
 	 */
 	if (!page && !drained) {
+		unreserve_highatomic_pageblock(ac);
 		drain_all_pages(NULL);
 		drained = true;
 		goto retry;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 49963aa2dff3..3427a155f85e 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -901,6 +901,7 @@ static char * const migratetype_names[MIGRATE_TYPES] = {
 	"Unmovable",
 	"Reclaimable",
 	"Movable",
+	"HighAtomic",
 #ifdef CONFIG_CMA
 	"CMA",
 #endif
-- 
2.4.6


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH 10/10] mm, page_alloc: Only enforce watermarks for order-0 allocations
  2015-09-21 10:52 [PATCH 00/10] Remove zonelist cache and high-order watermark checking v4 Mel Gorman
                   ` (8 preceding siblings ...)
  2015-09-21 10:52 ` [PATCH 09/10] mm, page_alloc: Reserve pageblocks for high-order atomic allocations on demand Mel Gorman
@ 2015-09-21 12:03 ` Mel Gorman
  2015-09-25 19:32   ` Johannes Weiner
                     ` (2 more replies)
  9 siblings, 3 replies; 48+ messages in thread
From: Mel Gorman @ 2015-09-21 12:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Rik van Riel, Vlastimil Babka, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML

The primary purpose of watermarks is to ensure that reclaim can always
make forward progress in PF_MEMALLOC context (kswapd and direct reclaim).
These assume that order-0 allocations are all that is necessary for
forward progress.

High-order watermarks serve a different purpose. Kswapd
had no high-order awareness before they were introduced
(https://lkml.kernel.org/r/413AA7B2.4000907@yahoo.com.au).  This was
particularly important when there were high-order atomic requests.
The watermarks both gave kswapd awareness and made a reserve for those
atomic requests.

There are two important side-effects of this. The most important is that
a non-atomic high-order request can fail even though free pages are available
and the order-0 watermarks are ok. The second is that high-order watermark
checks are expensive as the free list counts up to the requested order must
be examined.

With the introduction of MIGRATE_HIGHATOMIC it is no longer necessary to
have high-order watermarks. Kswapd and compaction still need high-order
awareness which is handled by checking that at least one suitable high-order
page is free.

With the patch applied, there was little difference in the allocation
failure rates as the atomic reserves are small relative to the number of
allocation attempts. The expected impact is that there will never be an
allocation failure report that shows suitable pages on the free lists.

The one potential side-effect of this is that in a vanilla kernel, the
watermark checks may have kept a free page for an atomic allocation. Now,
we are 100% relying on the HighAtomic reserves and an early allocation to
have allocated them.  If the first high-order atomic allocation is after
the system is already heavily fragmented then it'll fail.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 mm/page_alloc.c | 51 +++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 37 insertions(+), 14 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 811d6fc4ad5d..ee379d3b6cc2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2308,8 +2308,10 @@ static inline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
 #endif /* CONFIG_FAIL_PAGE_ALLOC */
 
 /*
- * Return true if free pages are above 'mark'. This takes into account the order
- * of the allocation.
+ * Return true if free base pages are above 'mark'. For high-order checks it
+ * will return true of the order-0 watermark is reached and there is at least
+ * one free page of a suitable size. Checking now avoids taking the zone lock
+ * to check in the allocation paths if no pages are free.
  */
 static bool __zone_watermark_ok(struct zone *z, unsigned int order,
 			unsigned long mark, int classzone_idx, int alloc_flags,
@@ -2317,7 +2319,7 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
 {
 	long min = mark;
 	int o;
-	long free_cma = 0;
+	const bool alloc_harder = (alloc_flags & ALLOC_HARDER);
 
 	/* free_pages may go negative - that's OK */
 	free_pages -= (1 << order) - 1;
@@ -2330,7 +2332,7 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
 	 * the high-atomic reserves. This will over-estimate the size of the
 	 * atomic reserve but it avoids a search.
 	 */
-	if (likely(!(alloc_flags & ALLOC_HARDER)))
+	if (likely(!alloc_harder))
 		free_pages -= z->nr_reserved_highatomic;
 	else
 		min -= min / 4;
@@ -2338,22 +2340,43 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
 #ifdef CONFIG_CMA
 	/* If allocation can't use CMA areas don't use free CMA pages */
 	if (!(alloc_flags & ALLOC_CMA))
-		free_cma = zone_page_state(z, NR_FREE_CMA_PAGES);
+		free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES);
 #endif
 
-	if (free_pages - free_cma <= min + z->lowmem_reserve[classzone_idx])
+	if (free_pages <= min + z->lowmem_reserve[classzone_idx])
 		return false;
-	for (o = 0; o < order; o++) {
-		/* At the next order, this order's pages become unavailable */
-		free_pages -= z->free_area[o].nr_free << o;
 
-		/* Require fewer higher order pages to be free */
-		min >>= 1;
+	/* order-0 watermarks are ok */
+	if (!order)
+		return true;
+
+	/* Check at least one high-order page is free */
+	for (o = order; o < MAX_ORDER; o++) {
+		struct free_area *area = &z->free_area[o];
+		int mt;
+
+		if (!area->nr_free)
+			continue;
+
+		if (alloc_harder) {
+			if (area->nr_free)
+				return true;
+			continue;
+		}
 
-		if (free_pages <= min)
-			return false;
+		for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) {
+			if (!list_empty(&area->free_list[mt]))
+				return true;
+		}
+
+#ifdef CONFIG_CMA
+		if ((alloc_flags & ALLOC_CMA) &&
+		    !list_empty(&area->free_list[MIGRATE_CMA])) {
+			return true;
+		}
+#endif
 	}
-	return true;
+	return false;
 }
 
 bool zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
-- 
2.4.6


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH 09/10] mm, page_alloc: Reserve pageblocks for high-order atomic allocations on demand
  2015-09-21 10:52 ` [PATCH 09/10] mm, page_alloc: Reserve pageblocks for high-order atomic allocations on demand Mel Gorman
@ 2015-09-24 13:50   ` Michal Hocko
  2015-09-25 19:22   ` Johannes Weiner
  2015-09-29 21:01   ` Andrew Morton
  2 siblings, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2015-09-24 13:50 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Rik van Riel, Vlastimil Babka,
	David Rientjes, Joonsoo Kim, Linux-MM, LKML

On Mon 21-09-15 11:52:41, Mel Gorman wrote:
> High-order watermark checking exists for two reasons --  kswapd high-order
> awareness and protection for high-order atomic requests. Historically the
> kernel depended on MIGRATE_RESERVE to preserve min_free_kbytes as high-order
> free pages for as long as possible. This patch introduces MIGRATE_HIGHATOMIC
> that reserves pageblocks for high-order atomic allocations on demand and
> avoids using those blocks for order-0 allocations. This is more flexible
> and reliable than MIGRATE_RESERVE was.
> 
> A MIGRATE_HIGHORDER pageblock is created when an atomic high-order allocation
> request steals a pageblock but limits the total number to 1% of the zone.
> Callers that speculatively abuse atomic allocations for long-lived
> high-order allocations to access the reserve will quickly fail. Note that
> SLUB is currently not such an abuser as it reclaims at least once.  It is
> possible that the pageblock stolen has few suitable high-order pages and
> will need to steal again in the near future but there would need to be
> strong justification to search all pageblocks for an ideal candidate.
> 
> The pageblocks are unreserved if an allocation fails after a direct
> reclaim attempt.
> 
> The watermark checks account for the reserved pageblocks when the allocation
> request is not a high-order atomic allocation.
> 
> The reserved pageblocks can not be used for order-0 allocations. This may
> allow temporary wastage until a failed reclaim reassigns the pageblock. This
> is deliberate as the intent of the reservation is to satisfy a limited
> number of atomic high-order short-lived requests if the system requires them.
> 
> The stutter benchmark was used to evaluate this but while it was running
> there was a systemtap script that randomly allocated between 1 high-order
> page and 12.5% of memory's worth of order-3 pages using GFP_ATOMIC. This
> is much larger than the potential reserve and it does not attempt to be
> realistic.  It is intended to stress random high-order allocations from
> an unknown source, show that there is a reduction in failures without
> introducing an anomaly where atomic allocations are more reliable than
> regular allocations.  The amount of memory reserved varied throughout the
> workload as reserves were created and reclaimed under memory pressure. The
> allocation failures once the workload warmed up were as follows;
> 
> 4.2-rc5-vanilla		70%
> 4.2-rc5-atomic-reserve	56%
> 
> The failure rate was also measured while building multiple kernels. The
> failure rate was 14% but is 6% with this patch applied.
> 
> Overall, this is a small reduction but the reserves are small relative
> to the number of allocation requests. In early versions of the patch,
> the failure rate reduced by a much larger amount but that required much
> larger reserves and perversely made atomic allocations seem more reliable
> than regular allocations.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

I believe I've acked this one previously. Anyway
Acked-by: Michal Hocko <mhocko@suse.com>
> ---
>  include/linux/mmzone.h |   6 ++-
>  mm/page_alloc.c        | 125 +++++++++++++++++++++++++++++++++++++++++++++----
>  mm/vmstat.c            |   1 +
>  3 files changed, 122 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 40a856d28764..ad8cf52de55b 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -39,6 +39,8 @@ enum {
>  	MIGRATE_UNMOVABLE,
>  	MIGRATE_MOVABLE,
>  	MIGRATE_RECLAIMABLE,
> +	MIGRATE_PCPTYPES,	/* the number of types on the pcp lists */
> +	MIGRATE_HIGHATOMIC = MIGRATE_PCPTYPES,
>  #ifdef CONFIG_CMA
>  	/*
>  	 * MIGRATE_CMA migration type is designed to mimic the way
> @@ -61,8 +63,6 @@ enum {
>  	MIGRATE_TYPES
>  };
>  
> -#define MIGRATE_PCPTYPES (MIGRATE_RECLAIMABLE+1)
> -
>  #ifdef CONFIG_CMA
>  #  define is_migrate_cma(migratetype) unlikely((migratetype) == MIGRATE_CMA)
>  #else
> @@ -334,6 +334,8 @@ struct zone {
>  	/* zone watermarks, access with *_wmark_pages(zone) macros */
>  	unsigned long watermark[NR_WMARK];
>  
> +	unsigned long nr_reserved_highatomic;
> +
>  	/*
>  	 * We don't know if the memory that we're going to allocate will be freeable
>  	 * or/and it will be released eventually, so to avoid totally wasting several
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index ae01a2c1e863..811d6fc4ad5d 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1615,6 +1615,88 @@ int find_suitable_fallback(struct free_area *area, unsigned int order,
>  	return -1;
>  }
>  
> +/*
> + * Reserve a pageblock for exclusive use of high-order atomic allocations if
> + * there are no empty page blocks that contain a page with a suitable order
> + */
> +static void reserve_highatomic_pageblock(struct page *page, struct zone *zone,
> +				unsigned int alloc_order)
> +{
> +	int mt;
> +	unsigned long max_managed, flags;
> +
> +	/*
> +	 * Limit the number reserved to 1 pageblock or roughly 1% of a zone.
> +	 * Check is race-prone but harmless.
> +	 */
> +	max_managed = (zone->managed_pages / 100) + pageblock_nr_pages;
> +	if (zone->nr_reserved_highatomic >= max_managed)
> +		return;
> +
> +	/* Yoink! */
> +	spin_lock_irqsave(&zone->lock, flags);
> +
> +	mt = get_pageblock_migratetype(page);
> +	if (mt != MIGRATE_HIGHATOMIC &&
> +			!is_migrate_isolate(mt) && !is_migrate_cma(mt)) {
> +		zone->nr_reserved_highatomic += pageblock_nr_pages;
> +		set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC);
> +		move_freepages_block(zone, page, MIGRATE_HIGHATOMIC);
> +	}
> +	spin_unlock_irqrestore(&zone->lock, flags);
> +}
> +
> +/*
> + * Used when an allocation is about to fail under memory pressure. This
> + * potentially hurts the reliability of high-order allocations when under
> + * intense memory pressure but failed atomic allocations should be easier
> + * to recover from than an OOM.
> + */
> +static void unreserve_highatomic_pageblock(const struct alloc_context *ac)
> +{
> +	struct zonelist *zonelist = ac->zonelist;
> +	unsigned long flags;
> +	struct zoneref *z;
> +	struct zone *zone;
> +	struct page *page;
> +	int order;
> +
> +	for_each_zone_zonelist_nodemask(zone, z, zonelist, ac->high_zoneidx,
> +								ac->nodemask) {
> +		/* Preserve at least one pageblock */
> +		if (zone->nr_reserved_highatomic <= pageblock_nr_pages)
> +			continue;
> +
> +		spin_lock_irqsave(&zone->lock, flags);
> +		for (order = 0; order < MAX_ORDER; order++) {
> +			struct free_area *area = &(zone->free_area[order]);
> +
> +			if (list_empty(&area->free_list[MIGRATE_HIGHATOMIC]))
> +				continue;
> +
> +			page = list_entry(area->free_list[MIGRATE_HIGHATOMIC].next,
> +						struct page, lru);
> +
> +			zone->nr_reserved_highatomic -= pageblock_nr_pages;
> +
> +			/*
> +			 * Convert to ac->migratetype and avoid the normal
> +			 * pageblock stealing heuristics. Minimally, the caller
> +			 * is doing the work and needs the pages. More
> +			 * importantly, if the block was always converted to
> +			 * MIGRATE_UNMOVABLE or another type then the number
> +			 * of pageblocks that cannot be completely freed
> +			 * may increase.
> +			 */
> +			set_pageblock_migratetype(page, ac->migratetype);
> +			move_freepages_block(zone, page, ac->migratetype);
> +			spin_unlock_irqrestore(&zone->lock, flags);
> +			return;
> +		}
> +		spin_unlock_irqrestore(&zone->lock, flags);
> +	}
> +}
> +
>  /* Remove an element from the buddy allocator from the fallback list */
>  static inline struct page *
>  __rmqueue_fallback(struct zone *zone, unsigned int order, int start_migratetype)
> @@ -1670,7 +1752,7 @@ __rmqueue_fallback(struct zone *zone, unsigned int order, int start_migratetype)
>   * Call me with the zone->lock already held.
>   */
>  static struct page *__rmqueue(struct zone *zone, unsigned int order,
> -						int migratetype)
> +				int migratetype, gfp_t gfp_flags)
>  {
>  	struct page *page;
>  
> @@ -1700,7 +1782,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>  
>  	spin_lock(&zone->lock);
>  	for (i = 0; i < count; ++i) {
> -		struct page *page = __rmqueue(zone, order, migratetype);
> +		struct page *page = __rmqueue(zone, order, migratetype, 0);
>  		if (unlikely(page == NULL))
>  			break;
>  
> @@ -2072,7 +2154,7 @@ int split_free_page(struct page *page)
>  static inline
>  struct page *buffered_rmqueue(struct zone *preferred_zone,
>  			struct zone *zone, unsigned int order,
> -			gfp_t gfp_flags, int migratetype)
> +			gfp_t gfp_flags, int alloc_flags, int migratetype)
>  {
>  	unsigned long flags;
>  	struct page *page;
> @@ -2115,7 +2197,15 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
>  			WARN_ON_ONCE(order > 1);
>  		}
>  		spin_lock_irqsave(&zone->lock, flags);
> -		page = __rmqueue(zone, order, migratetype);
> +
> +		page = NULL;
> +		if (unlikely(order) && (alloc_flags & ALLOC_HARDER)) {
> +			page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
> +			if (page)
> +				trace_mm_page_alloc_zone_locked(page, order, migratetype);
> +		}
> +
> +		page = __rmqueue(zone, order, migratetype, gfp_flags);
>  		spin_unlock(&zone->lock);
>  		if (!page)
>  			goto failed;
> @@ -2225,15 +2315,24 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
>  			unsigned long mark, int classzone_idx, int alloc_flags,
>  			long free_pages)
>  {
> -	/* free_pages may go negative - that's OK */
>  	long min = mark;
>  	int o;
>  	long free_cma = 0;
>  
> +	/* free_pages may go negative - that's OK */
>  	free_pages -= (1 << order) - 1;
> +
>  	if (alloc_flags & ALLOC_HIGH)
>  		min -= min / 2;
> -	if (alloc_flags & ALLOC_HARDER)
> +
> +	/*
> +	 * If the caller does not have rights to ALLOC_HARDER then subtract
> +	 * the high-atomic reserves. This will over-estimate the size of the
> +	 * atomic reserve but it avoids a search.
> +	 */
> +	if (likely(!(alloc_flags & ALLOC_HARDER)))
> +		free_pages -= z->nr_reserved_highatomic;
> +	else
>  		min -= min / 4;
>  
>  #ifdef CONFIG_CMA
> @@ -2418,10 +2517,18 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
>  
>  try_this_zone:
>  		page = buffered_rmqueue(ac->preferred_zone, zone, order,
> -						gfp_mask, ac->migratetype);
> +				gfp_mask, alloc_flags, ac->migratetype);
>  		if (page) {
>  			if (prep_new_page(page, order, gfp_mask, alloc_flags))
>  				goto try_this_zone;
> +
> +			/*
> +			 * If this is a high-order atomic allocation then check
> +			 * if the pageblock should be reserved for the future
> +			 */
> +			if (unlikely(order && (alloc_flags & ALLOC_HARDER)))
> +				reserve_highatomic_pageblock(page, zone, order);
> +
>  			return page;
>  		}
>  	}
> @@ -2694,9 +2801,11 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
>  
>  	/*
>  	 * If an allocation failed after direct reclaim, it could be because
> -	 * pages are pinned on the per-cpu lists. Drain them and try again
> +	 * pages are pinned on the per-cpu lists or in high alloc reserves.
> +	 * Shrink them them and try again
>  	 */
>  	if (!page && !drained) {
> +		unreserve_highatomic_pageblock(ac);
>  		drain_all_pages(NULL);
>  		drained = true;
>  		goto retry;
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 49963aa2dff3..3427a155f85e 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -901,6 +901,7 @@ static char * const migratetype_names[MIGRATE_TYPES] = {
>  	"Unmovable",
>  	"Reclaimable",
>  	"Movable",
> +	"HighAtomic",
>  #ifdef CONFIG_CMA
>  	"CMA",
>  #endif
> -- 
> 2.4.6

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 05/10] mm, page_alloc: Distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
  2015-09-21 10:52 ` [PATCH 05/10] mm, page_alloc: Distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd Mel Gorman
@ 2015-09-24 13:51   ` Michal Hocko
  2015-09-24 20:55   ` Johannes Weiner
  1 sibling, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2015-09-24 13:51 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Rik van Riel, Vlastimil Babka,
	David Rientjes, Joonsoo Kim, Linux-MM, LKML

On Mon 21-09-15 11:52:37, Mel Gorman wrote:
> __GFP_WAIT has been used to identify atomic context in callers that hold
> spinlocks or are in interrupts. They are expected to be high priority and
> have access one of two watermarks lower than "min" which can be referred
> to as the "atomic reserve". __GFP_HIGH users get access to the first lower
> watermark and can be called the "high priority reserve".
> 
> Over time, callers had a requirement to not block when fallback options
> were available. Some have abused __GFP_WAIT leading to a situation where
> an optimisitic allocation with a fallback option can access atomic reserves.
> 
> This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
> cannot sleep and have no alternative. High priority users continue to use
> __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are
> willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers
> that want to wake kswapd for background reclaim. __GFP_WAIT is redefined
> as a caller that is willing to enter direct reclaim and wake kswapd for
> background reclaim.
> 
> This patch then converts a number of sites
> 
> o __GFP_ATOMIC is used by callers that are high priority and have memory
>   pools for those requests. GFP_ATOMIC uses this flag.
> 
> o Callers that have a limited mempool to guarantee forward progress clear
>   __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
>   into this category where kswapd will still be woken but atomic reserves
>   are not used as there is a one-entry mempool to guarantee progress.
> 
> o Callers that are checking if they are non-blocking should use the
>   helper gfpflags_allow_blocking() where possible. This is because
>   checking for __GFP_WAIT as was done historically now can trigger false
>   positives. Some exceptions like dm-crypt.c exist where the code intent
>   is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
>   flag manipulations.
> 
> o Callers that built their own GFP flags instead of starting with GFP_KERNEL
>   and friends now also need to specify __GFP_KSWAPD_RECLAIM.
> 
> The first key hazard to watch out for is callers that removed __GFP_WAIT
> and was depending on access to atomic reserves for inconspicuous reasons.
> In some cases it may be appropriate for them to use __GFP_HIGH.
> 
> The second key hazard is callers that assembled their own combination of
> GFP flags instead of starting with something like GFP_KERNEL. They may
> now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
> if it's missed in most cases as other activity will wake kswapd.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

I belive I've checked this one and acked it already. Anyway
Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  Documentation/vm/balance                           | 14 ++++---
>  arch/arm/mm/dma-mapping.c                          |  6 +--
>  arch/arm/xen/mm.c                                  |  2 +-
>  arch/arm64/mm/dma-mapping.c                        |  4 +-
>  arch/x86/kernel/pci-dma.c                          |  2 +-
>  block/bio.c                                        | 26 ++++++------
>  block/blk-core.c                                   | 16 ++++----
>  block/blk-ioc.c                                    |  2 +-
>  block/blk-mq-tag.c                                 |  2 +-
>  block/blk-mq.c                                     |  8 ++--
>  drivers/block/drbd/drbd_receiver.c                 |  3 +-
>  drivers/block/osdblk.c                             |  2 +-
>  drivers/connector/connector.c                      |  3 +-
>  drivers/firewire/core-cdev.c                       |  2 +-
>  drivers/gpu/drm/i915/i915_gem.c                    |  2 +-
>  drivers/infiniband/core/sa_query.c                 |  2 +-
>  drivers/iommu/amd_iommu.c                          |  2 +-
>  drivers/iommu/intel-iommu.c                        |  2 +-
>  drivers/md/dm-crypt.c                              |  6 +--
>  drivers/md/dm-kcopyd.c                             |  2 +-
>  drivers/media/pci/solo6x10/solo6x10-v4l2-enc.c     |  2 +-
>  drivers/media/pci/solo6x10/solo6x10-v4l2.c         |  2 +-
>  drivers/media/pci/tw68/tw68-video.c                |  2 +-
>  drivers/mtd/mtdcore.c                              |  3 +-
>  drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c    |  2 +-
>  drivers/staging/android/ion/ion_system_heap.c      |  2 +-
>  .../lustre/include/linux/libcfs/libcfs_private.h   |  2 +-
>  drivers/usb/host/u132-hcd.c                        |  2 +-
>  drivers/video/fbdev/vermilion/vermilion.c          |  2 +-
>  fs/btrfs/disk-io.c                                 |  2 +-
>  fs/btrfs/extent_io.c                               | 14 +++----
>  fs/btrfs/volumes.c                                 |  4 +-
>  fs/ext4/super.c                                    |  2 +-
>  fs/fscache/cookie.c                                |  2 +-
>  fs/fscache/page.c                                  |  6 +--
>  fs/jbd2/transaction.c                              |  4 +-
>  fs/nfs/file.c                                      |  6 +--
>  fs/xfs/xfs_qm.c                                    |  2 +-
>  include/linux/gfp.h                                | 46 ++++++++++++++++------
>  include/linux/skbuff.h                             |  6 +--
>  include/net/sock.h                                 |  2 +-
>  include/trace/events/gfpflags.h                    |  5 ++-
>  kernel/audit.c                                     |  6 +--
>  kernel/cgroup.c                                    |  2 +-
>  kernel/locking/lockdep.c                           |  2 +-
>  kernel/power/snapshot.c                            |  2 +-
>  kernel/smp.c                                       |  2 +-
>  lib/idr.c                                          |  4 +-
>  lib/radix-tree.c                                   | 10 ++---
>  mm/backing-dev.c                                   |  2 +-
>  mm/dmapool.c                                       |  2 +-
>  mm/memcontrol.c                                    |  8 ++--
>  mm/mempool.c                                       | 10 ++---
>  mm/migrate.c                                       |  2 +-
>  mm/page_alloc.c                                    | 43 ++++++++++++--------
>  mm/slab.c                                          | 18 ++++-----
>  mm/slub.c                                          | 10 ++---
>  mm/vmalloc.c                                       |  2 +-
>  mm/vmscan.c                                        |  4 +-
>  mm/zswap.c                                         |  5 ++-
>  net/core/skbuff.c                                  |  8 ++--
>  net/core/sock.c                                    |  6 ++-
>  net/netlink/af_netlink.c                           |  2 +-
>  net/rds/ib_recv.c                                  |  4 +-
>  net/rxrpc/ar-connection.c                          |  2 +-
>  net/sctp/associola.c                               |  2 +-
>  66 files changed, 212 insertions(+), 174 deletions(-)
> 
> diff --git a/Documentation/vm/balance b/Documentation/vm/balance
> index c46e68cf9344..964595481af6 100644
> --- a/Documentation/vm/balance
> +++ b/Documentation/vm/balance
> @@ -1,12 +1,14 @@
>  Started Jan 2000 by Kanoj Sarcar <kanoj@sgi.com>
>  
> -Memory balancing is needed for non __GFP_WAIT as well as for non
> -__GFP_IO allocations.
> +Memory balancing is needed for !__GFP_ATOMIC and !__GFP_KSWAPD_RECLAIM as
> +well as for non __GFP_IO allocations.
>  
> -There are two reasons to be requesting non __GFP_WAIT allocations:
> -the caller can not sleep (typically intr context), or does not want
> -to incur cost overheads of page stealing and possible swap io for
> -whatever reasons.
> +The first reason why a caller may avoid reclaim is that the caller can not
> +sleep due to holding a spinlock or is in interrupt context. The second may
> +be that the caller is willing to fail the allocation without incurring the
> +overhead of page reclaim. This may happen for opportunistic high-order
> +allocation requests that have order-0 fallback options. In such cases,
> +the caller may also wish to avoid waking kswapd.
>  
>  __GFP_IO allocation requests are made to prevent file system deadlocks.
>  
> diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
> index 1a7815e5421b..38307d8312ac 100644
> --- a/arch/arm/mm/dma-mapping.c
> +++ b/arch/arm/mm/dma-mapping.c
> @@ -651,12 +651,12 @@ static void *__dma_alloc(struct device *dev, size_t size, dma_addr_t *handle,
>  
>  	if (nommu())
>  		addr = __alloc_simple_buffer(dev, size, gfp, &page);
> -	else if (dev_get_cma_area(dev) && (gfp & __GFP_WAIT))
> +	else if (dev_get_cma_area(dev) && (gfp & __GFP_DIRECT_RECLAIM))
>  		addr = __alloc_from_contiguous(dev, size, prot, &page,
>  					       caller, want_vaddr);
>  	else if (is_coherent)
>  		addr = __alloc_simple_buffer(dev, size, gfp, &page);
> -	else if (!(gfp & __GFP_WAIT))
> +	else if (!gfpflags_allow_blocking(gfp))
>  		addr = __alloc_from_pool(size, &page);
>  	else
>  		addr = __alloc_remap_buffer(dev, size, gfp, prot, &page,
> @@ -1363,7 +1363,7 @@ static void *arm_iommu_alloc_attrs(struct device *dev, size_t size,
>  	*handle = DMA_ERROR_CODE;
>  	size = PAGE_ALIGN(size);
>  
> -	if (!(gfp & __GFP_WAIT))
> +	if (!gfpflags_allow_blocking(gfp))
>  		return __iommu_alloc_atomic(dev, size, handle);
>  
>  	/*
> diff --git a/arch/arm/xen/mm.c b/arch/arm/xen/mm.c
> index 6dd911d1f0ac..99eec9063f68 100644
> --- a/arch/arm/xen/mm.c
> +++ b/arch/arm/xen/mm.c
> @@ -25,7 +25,7 @@
>  unsigned long xen_get_swiotlb_free_pages(unsigned int order)
>  {
>  	struct memblock_region *reg;
> -	gfp_t flags = __GFP_NOWARN;
> +	gfp_t flags = __GFP_NOWARN|__GFP_KSWAPD_RECLAIM;
>  
>  	for_each_memblock(memory, reg) {
>  		if (reg->base < (phys_addr_t)0xffffffff) {
> diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
> index 99224dcebdc5..478234383c2c 100644
> --- a/arch/arm64/mm/dma-mapping.c
> +++ b/arch/arm64/mm/dma-mapping.c
> @@ -100,7 +100,7 @@ static void *__dma_alloc_coherent(struct device *dev, size_t size,
>  	if (IS_ENABLED(CONFIG_ZONE_DMA) &&
>  	    dev->coherent_dma_mask <= DMA_BIT_MASK(32))
>  		flags |= GFP_DMA;
> -	if (dev_get_cma_area(dev) && (flags & __GFP_WAIT)) {
> +	if (dev_get_cma_area(dev) && gfpflags_allow_blocking(flags)) {
>  		struct page *page;
>  		void *addr;
>  
> @@ -148,7 +148,7 @@ static void *__dma_alloc(struct device *dev, size_t size,
>  
>  	size = PAGE_ALIGN(size);
>  
> -	if (!coherent && !(flags & __GFP_WAIT)) {
> +	if (!coherent && !gfpflags_allow_blocking(flags)) {
>  		struct page *page = NULL;
>  		void *addr = __alloc_from_pool(size, &page, flags);
>  
> diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
> index 1b55de1267cf..a8e618b16a66 100644
> --- a/arch/x86/kernel/pci-dma.c
> +++ b/arch/x86/kernel/pci-dma.c
> @@ -90,7 +90,7 @@ void *dma_generic_alloc_coherent(struct device *dev, size_t size,
>  again:
>  	page = NULL;
>  	/* CMA can be used only in the context which permits sleeping */
> -	if (flag & __GFP_WAIT) {
> +	if (gfpflags_allow_blocking(flag)) {
>  		page = dma_alloc_from_contiguous(dev, count, get_order(size));
>  		if (page && page_to_phys(page) + size > dma_mask) {
>  			dma_release_from_contiguous(dev, page, count);
> diff --git a/block/bio.c b/block/bio.c
> index ad3f276d74bc..4f184d938942 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -211,7 +211,7 @@ struct bio_vec *bvec_alloc(gfp_t gfp_mask, int nr, unsigned long *idx,
>  		bvl = mempool_alloc(pool, gfp_mask);
>  	} else {
>  		struct biovec_slab *bvs = bvec_slabs + *idx;
> -		gfp_t __gfp_mask = gfp_mask & ~(__GFP_WAIT | __GFP_IO);
> +		gfp_t __gfp_mask = gfp_mask & ~(__GFP_DIRECT_RECLAIM | __GFP_IO);
>  
>  		/*
>  		 * Make this allocation restricted and don't dump info on
> @@ -221,11 +221,11 @@ struct bio_vec *bvec_alloc(gfp_t gfp_mask, int nr, unsigned long *idx,
>  		__gfp_mask |= __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN;
>  
>  		/*
> -		 * Try a slab allocation. If this fails and __GFP_WAIT
> +		 * Try a slab allocation. If this fails and __GFP_DIRECT_RECLAIM
>  		 * is set, retry with the 1-entry mempool
>  		 */
>  		bvl = kmem_cache_alloc(bvs->slab, __gfp_mask);
> -		if (unlikely(!bvl && (gfp_mask & __GFP_WAIT))) {
> +		if (unlikely(!bvl && (gfp_mask & __GFP_DIRECT_RECLAIM))) {
>  			*idx = BIOVEC_MAX_IDX;
>  			goto fallback;
>  		}
> @@ -395,12 +395,12 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
>   *   If @bs is NULL, uses kmalloc() to allocate the bio; else the allocation is
>   *   backed by the @bs's mempool.
>   *
> - *   When @bs is not NULL, if %__GFP_WAIT is set then bio_alloc will always be
> - *   able to allocate a bio. This is due to the mempool guarantees. To make this
> - *   work, callers must never allocate more than 1 bio at a time from this pool.
> - *   Callers that need to allocate more than 1 bio must always submit the
> - *   previously allocated bio for IO before attempting to allocate a new one.
> - *   Failure to do so can cause deadlocks under memory pressure.
> + *   When @bs is not NULL, if %__GFP_DIRECT_RECLAIM is set then bio_alloc will
> + *   always be able to allocate a bio. This is due to the mempool guarantees.
> + *   To make this work, callers must never allocate more than 1 bio at a time
> + *   from this pool. Callers that need to allocate more than 1 bio must always
> + *   submit the previously allocated bio for IO before attempting to allocate
> + *   a new one. Failure to do so can cause deadlocks under memory pressure.
>   *
>   *   Note that when running under generic_make_request() (i.e. any block
>   *   driver), bios are not submitted until after you return - see the code in
> @@ -459,13 +459,13 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
>  		 * We solve this, and guarantee forward progress, with a rescuer
>  		 * workqueue per bio_set. If we go to allocate and there are
>  		 * bios on current->bio_list, we first try the allocation
> -		 * without __GFP_WAIT; if that fails, we punt those bios we
> -		 * would be blocking to the rescuer workqueue before we retry
> -		 * with the original gfp_flags.
> +		 * without __GFP_DIRECT_RECLAIM; if that fails, we punt those
> +		 * bios we would be blocking to the rescuer workqueue before
> +		 * we retry with the original gfp_flags.
>  		 */
>  
>  		if (current->bio_list && !bio_list_empty(current->bio_list))
> -			gfp_mask &= ~__GFP_WAIT;
> +			gfp_mask &= ~__GFP_DIRECT_RECLAIM;
>  
>  		p = mempool_alloc(bs->bio_pool, gfp_mask);
>  		if (!p && gfp_mask != saved_gfp) {
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 2eb722d48773..0391206868e9 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1160,8 +1160,8 @@ static struct request *__get_request(struct request_list *rl, int rw_flags,
>   * @bio: bio to allocate request for (can be %NULL)
>   * @gfp_mask: allocation mask
>   *
> - * Get a free request from @q.  If %__GFP_WAIT is set in @gfp_mask, this
> - * function keeps retrying under memory pressure and fails iff @q is dead.
> + * Get a free request from @q.  If %__GFP_DIRECT_RECLAIM is set in @gfp_mask,
> + * this function keeps retrying under memory pressure and fails iff @q is dead.
>   *
>   * Must be called with @q->queue_lock held and,
>   * Returns ERR_PTR on failure, with @q->queue_lock held.
> @@ -1181,7 +1181,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
>  	if (!IS_ERR(rq))
>  		return rq;
>  
> -	if (!(gfp_mask & __GFP_WAIT) || unlikely(blk_queue_dying(q))) {
> +	if (!gfpflags_allow_blocking(gfp_mask) || unlikely(blk_queue_dying(q))) {
>  		blk_put_rl(rl);
>  		return rq;
>  	}
> @@ -1259,11 +1259,11 @@ EXPORT_SYMBOL(blk_get_request);
>   * BUG.
>   *
>   * WARNING: When allocating/cloning a bio-chain, careful consideration should be
> - * given to how you allocate bios. In particular, you cannot use __GFP_WAIT for
> - * anything but the first bio in the chain. Otherwise you risk waiting for IO
> - * completion of a bio that hasn't been submitted yet, thus resulting in a
> - * deadlock. Alternatively bios should be allocated using bio_kmalloc() instead
> - * of bio_alloc(), as that avoids the mempool deadlock.
> + * given to how you allocate bios. In particular, you cannot use
> + * __GFP_DIRECT_RECLAIM for anything but the first bio in the chain. Otherwise
> + * you risk waiting for IO completion of a bio that hasn't been submitted yet,
> + * thus resulting in a deadlock. Alternatively bios should be allocated using
> + * bio_kmalloc() instead of bio_alloc(), as that avoids the mempool deadlock.
>   * If possible a big IO should be split into smaller parts when allocation
>   * fails. Partial allocation should not be an error, or you risk a live-lock.
>   */
> diff --git a/block/blk-ioc.c b/block/blk-ioc.c
> index 1a27f45ec776..381cb50a673c 100644
> --- a/block/blk-ioc.c
> +++ b/block/blk-ioc.c
> @@ -289,7 +289,7 @@ struct io_context *get_task_io_context(struct task_struct *task,
>  {
>  	struct io_context *ioc;
>  
> -	might_sleep_if(gfp_flags & __GFP_WAIT);
> +	might_sleep_if(gfpflags_allow_blocking(gfp_flags));
>  
>  	do {
>  		task_lock(task);
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 9115c6d59948..f6020c624967 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -264,7 +264,7 @@ static int bt_get(struct blk_mq_alloc_data *data,
>  	if (tag != -1)
>  		return tag;
>  
> -	if (!(data->gfp & __GFP_WAIT))
> +	if (!gfpflags_allow_blocking(data->gfp))
>  		return -1;
>  
>  	bs = bt_wait_ptr(bt, hctx);
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index f2d67b4047a0..7c322cea838f 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -85,7 +85,7 @@ static int blk_mq_queue_enter(struct request_queue *q, gfp_t gfp)
>  		if (percpu_ref_tryget_live(&q->mq_usage_counter))
>  			return 0;
>  
> -		if (!(gfp & __GFP_WAIT))
> +		if (!gfpflags_allow_blocking(gfp))
>  			return -EBUSY;
>  
>  		ret = wait_event_interruptible(q->mq_freeze_wq,
> @@ -261,11 +261,11 @@ struct request *blk_mq_alloc_request(struct request_queue *q, int rw, gfp_t gfp,
>  
>  	ctx = blk_mq_get_ctx(q);
>  	hctx = q->mq_ops->map_queue(q, ctx->cpu);
> -	blk_mq_set_alloc_data(&alloc_data, q, gfp & ~__GFP_WAIT,
> +	blk_mq_set_alloc_data(&alloc_data, q, gfp & ~__GFP_DIRECT_RECLAIM,
>  			reserved, ctx, hctx);
>  
>  	rq = __blk_mq_alloc_request(&alloc_data, rw);
> -	if (!rq && (gfp & __GFP_WAIT)) {
> +	if (!rq && (gfp & __GFP_DIRECT_RECLAIM)) {
>  		__blk_mq_run_hw_queue(hctx);
>  		blk_mq_put_ctx(ctx);
>  
> @@ -1207,7 +1207,7 @@ static struct request *blk_mq_map_request(struct request_queue *q,
>  		ctx = blk_mq_get_ctx(q);
>  		hctx = q->mq_ops->map_queue(q, ctx->cpu);
>  		blk_mq_set_alloc_data(&alloc_data, q,
> -				__GFP_WAIT|GFP_ATOMIC, false, ctx, hctx);
> +				__GFP_WAIT|__GFP_HIGH, false, ctx, hctx);
>  		rq = __blk_mq_alloc_request(&alloc_data, rw);
>  		ctx = alloc_data.ctx;
>  		hctx = alloc_data.hctx;
> diff --git a/drivers/block/drbd/drbd_receiver.c b/drivers/block/drbd/drbd_receiver.c
> index c097909c589c..b4b5680ac6ad 100644
> --- a/drivers/block/drbd/drbd_receiver.c
> +++ b/drivers/block/drbd/drbd_receiver.c
> @@ -357,7 +357,8 @@ drbd_alloc_peer_req(struct drbd_peer_device *peer_device, u64 id, sector_t secto
>  	}
>  
>  	if (has_payload && data_size) {
> -		page = drbd_alloc_pages(peer_device, nr_pages, (gfp_mask & __GFP_WAIT));
> +		page = drbd_alloc_pages(peer_device, nr_pages,
> +					gfpflags_allow_blocking(gfp_mask));
>  		if (!page)
>  			goto fail;
>  	}
> diff --git a/drivers/block/osdblk.c b/drivers/block/osdblk.c
> index e22942596207..1b709a4e3b5e 100644
> --- a/drivers/block/osdblk.c
> +++ b/drivers/block/osdblk.c
> @@ -271,7 +271,7 @@ static struct bio *bio_chain_clone(struct bio *old_chain, gfp_t gfpmask)
>  			goto err_out;
>  
>  		tmp->bi_bdev = NULL;
> -		gfpmask &= ~__GFP_WAIT;
> +		gfpmask &= ~__GFP_DIRECT_RECLAIM;
>  		tmp->bi_next = NULL;
>  
>  		if (!new_chain)
> diff --git a/drivers/connector/connector.c b/drivers/connector/connector.c
> index 30f522848c73..d7373ca69c99 100644
> --- a/drivers/connector/connector.c
> +++ b/drivers/connector/connector.c
> @@ -124,7 +124,8 @@ int cn_netlink_send_mult(struct cn_msg *msg, u16 len, u32 portid, u32 __group,
>  	if (group)
>  		return netlink_broadcast(dev->nls, skb, portid, group,
>  					 gfp_mask);
> -	return netlink_unicast(dev->nls, skb, portid, !(gfp_mask&__GFP_WAIT));
> +	return netlink_unicast(dev->nls, skb, portid,
> +			!gfpflags_allow_blocking(gfp_mask));
>  }
>  EXPORT_SYMBOL_GPL(cn_netlink_send_mult);
>  
> diff --git a/drivers/firewire/core-cdev.c b/drivers/firewire/core-cdev.c
> index 2a3973a7c441..36a7c2d89a01 100644
> --- a/drivers/firewire/core-cdev.c
> +++ b/drivers/firewire/core-cdev.c
> @@ -486,7 +486,7 @@ static int ioctl_get_info(struct client *client, union ioctl_arg *arg)
>  static int add_client_resource(struct client *client,
>  			       struct client_resource *resource, gfp_t gfp_mask)
>  {
> -	bool preload = !!(gfp_mask & __GFP_WAIT);
> +	bool preload = gfpflags_allow_blocking(gfp_mask);
>  	unsigned long flags;
>  	int ret;
>  
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 4d631a946481..d58cb9e034fe 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -2215,7 +2215,7 @@ i915_gem_object_get_pages_gtt(struct drm_i915_gem_object *obj)
>  	 */
>  	mapping = file_inode(obj->base.filp)->i_mapping;
>  	gfp = mapping_gfp_mask(mapping);
> -	gfp |= __GFP_NORETRY | __GFP_NOWARN | __GFP_NO_KSWAPD;
> +	gfp |= __GFP_NORETRY | __GFP_NOWARN;
>  	gfp &= ~(__GFP_IO | __GFP_WAIT);
>  	sg = st->sgl;
>  	st->nents = 0;
> diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c
> index 8c014b33d8e0..59ab264c99c4 100644
> --- a/drivers/infiniband/core/sa_query.c
> +++ b/drivers/infiniband/core/sa_query.c
> @@ -1083,7 +1083,7 @@ static void init_mad(struct ib_sa_mad *mad, struct ib_mad_agent *agent)
>  
>  static int send_mad(struct ib_sa_query *query, int timeout_ms, gfp_t gfp_mask)
>  {
> -	bool preload = !!(gfp_mask & __GFP_WAIT);
> +	bool preload = gfpflags_allow_blocking(gfp_mask);
>  	unsigned long flags;
>  	int ret, id;
>  
> diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
> index f82060e778a2..1c0006e1ba4a 100644
> --- a/drivers/iommu/amd_iommu.c
> +++ b/drivers/iommu/amd_iommu.c
> @@ -2755,7 +2755,7 @@ static void *alloc_coherent(struct device *dev, size_t size,
>  
>  	page = alloc_pages(flag | __GFP_NOWARN,  get_order(size));
>  	if (!page) {
> -		if (!(flag & __GFP_WAIT))
> +		if (!gfpflags_allow_blocking(flag))
>  			return NULL;
>  
>  		page = dma_alloc_from_contiguous(dev, size >> PAGE_SHIFT,
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index 2d7349a3ee14..ecdafbe81a5e 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -3533,7 +3533,7 @@ static void *intel_alloc_coherent(struct device *dev, size_t size,
>  			flags |= GFP_DMA32;
>  	}
>  
> -	if (flags & __GFP_WAIT) {
> +	if (gfpflags_allow_blocking(flags)) {
>  		unsigned int count = size >> PAGE_SHIFT;
>  
>  		page = dma_alloc_from_contiguous(dev, count, order);
> diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
> index d60c88df5234..55ec935de2b4 100644
> --- a/drivers/md/dm-crypt.c
> +++ b/drivers/md/dm-crypt.c
> @@ -993,7 +993,7 @@ static struct bio *crypt_alloc_buffer(struct dm_crypt_io *io, unsigned size)
>  	struct bio_vec *bvec;
>  
>  retry:
> -	if (unlikely(gfp_mask & __GFP_WAIT))
> +	if (unlikely(gfp_mask & __GFP_DIRECT_RECLAIM))
>  		mutex_lock(&cc->bio_alloc_lock);
>  
>  	clone = bio_alloc_bioset(GFP_NOIO, nr_iovecs, cc->bs);
> @@ -1009,7 +1009,7 @@ static struct bio *crypt_alloc_buffer(struct dm_crypt_io *io, unsigned size)
>  		if (!page) {
>  			crypt_free_buffer_pages(cc, clone);
>  			bio_put(clone);
> -			gfp_mask |= __GFP_WAIT;
> +			gfp_mask |= __GFP_DIRECT_RECLAIM;
>  			goto retry;
>  		}
>  
> @@ -1026,7 +1026,7 @@ static struct bio *crypt_alloc_buffer(struct dm_crypt_io *io, unsigned size)
>  	}
>  
>  return_clone:
> -	if (unlikely(gfp_mask & __GFP_WAIT))
> +	if (unlikely(gfp_mask & __GFP_DIRECT_RECLAIM))
>  		mutex_unlock(&cc->bio_alloc_lock);
>  
>  	return clone;
> diff --git a/drivers/md/dm-kcopyd.c b/drivers/md/dm-kcopyd.c
> index 3a7cade5e27d..1452ed9aacb4 100644
> --- a/drivers/md/dm-kcopyd.c
> +++ b/drivers/md/dm-kcopyd.c
> @@ -244,7 +244,7 @@ static int kcopyd_get_pages(struct dm_kcopyd_client *kc,
>  	*pages = NULL;
>  
>  	do {
> -		pl = alloc_pl(__GFP_NOWARN | __GFP_NORETRY);
> +		pl = alloc_pl(__GFP_NOWARN | __GFP_NORETRY | __GFP_KSWAPD_RECLAIM);
>  		if (unlikely(!pl)) {
>  			/* Use reserved pages */
>  			pl = kc->pages;
> diff --git a/drivers/media/pci/solo6x10/solo6x10-v4l2-enc.c b/drivers/media/pci/solo6x10/solo6x10-v4l2-enc.c
> index 53fff5425c13..fb2cb4bdc0c1 100644
> --- a/drivers/media/pci/solo6x10/solo6x10-v4l2-enc.c
> +++ b/drivers/media/pci/solo6x10/solo6x10-v4l2-enc.c
> @@ -1291,7 +1291,7 @@ static struct solo_enc_dev *solo_enc_alloc(struct solo_dev *solo_dev,
>  	solo_enc->vidq.ops = &solo_enc_video_qops;
>  	solo_enc->vidq.mem_ops = &vb2_dma_sg_memops;
>  	solo_enc->vidq.drv_priv = solo_enc;
> -	solo_enc->vidq.gfp_flags = __GFP_DMA32;
> +	solo_enc->vidq.gfp_flags = __GFP_DMA32 | __GFP_KSWAPD_RECLAIM;
>  	solo_enc->vidq.timestamp_flags = V4L2_BUF_FLAG_TIMESTAMP_MONOTONIC;
>  	solo_enc->vidq.buf_struct_size = sizeof(struct solo_vb2_buf);
>  	solo_enc->vidq.lock = &solo_enc->lock;
> diff --git a/drivers/media/pci/solo6x10/solo6x10-v4l2.c b/drivers/media/pci/solo6x10/solo6x10-v4l2.c
> index 63ae8a61f603..bde77b22340c 100644
> --- a/drivers/media/pci/solo6x10/solo6x10-v4l2.c
> +++ b/drivers/media/pci/solo6x10/solo6x10-v4l2.c
> @@ -675,7 +675,7 @@ int solo_v4l2_init(struct solo_dev *solo_dev, unsigned nr)
>  	solo_dev->vidq.mem_ops = &vb2_dma_contig_memops;
>  	solo_dev->vidq.drv_priv = solo_dev;
>  	solo_dev->vidq.timestamp_flags = V4L2_BUF_FLAG_TIMESTAMP_MONOTONIC;
> -	solo_dev->vidq.gfp_flags = __GFP_DMA32;
> +	solo_dev->vidq.gfp_flags = __GFP_DMA32 | __GFP_KSWAPD_RECLAIM;
>  	solo_dev->vidq.buf_struct_size = sizeof(struct solo_vb2_buf);
>  	solo_dev->vidq.lock = &solo_dev->lock;
>  	ret = vb2_queue_init(&solo_dev->vidq);
> diff --git a/drivers/media/pci/tw68/tw68-video.c b/drivers/media/pci/tw68/tw68-video.c
> index 8355e55b4e8e..e556f989aaab 100644
> --- a/drivers/media/pci/tw68/tw68-video.c
> +++ b/drivers/media/pci/tw68/tw68-video.c
> @@ -975,7 +975,7 @@ int tw68_video_init2(struct tw68_dev *dev, int video_nr)
>  	dev->vidq.ops = &tw68_video_qops;
>  	dev->vidq.mem_ops = &vb2_dma_sg_memops;
>  	dev->vidq.drv_priv = dev;
> -	dev->vidq.gfp_flags = __GFP_DMA32;
> +	dev->vidq.gfp_flags = __GFP_DMA32 | __GFP_KSWAPD_RECLAIM;
>  	dev->vidq.buf_struct_size = sizeof(struct tw68_buf);
>  	dev->vidq.lock = &dev->lock;
>  	dev->vidq.min_buffers_needed = 2;
> diff --git a/drivers/mtd/mtdcore.c b/drivers/mtd/mtdcore.c
> index 8bbbb751bf45..2dfb291a47c6 100644
> --- a/drivers/mtd/mtdcore.c
> +++ b/drivers/mtd/mtdcore.c
> @@ -1188,8 +1188,7 @@ EXPORT_SYMBOL_GPL(mtd_writev);
>   */
>  void *mtd_kmalloc_up_to(const struct mtd_info *mtd, size_t *size)
>  {
> -	gfp_t flags = __GFP_NOWARN | __GFP_WAIT |
> -		       __GFP_NORETRY | __GFP_NO_KSWAPD;
> +	gfp_t flags = __GFP_NOWARN | __GFP_DIRECT_RECLAIM | __GFP_NORETRY;
>  	size_t min_alloc = max_t(size_t, mtd->writesize, PAGE_SIZE);
>  	void *kbuf;
>  
> diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
> index 44173be5cbf0..f8d7a2f06950 100644
> --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
> +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
> @@ -691,7 +691,7 @@ static void *bnx2x_frag_alloc(const struct bnx2x_fastpath *fp, gfp_t gfp_mask)
>  {
>  	if (fp->rx_frag_size) {
>  		/* GFP_KERNEL allocations are used only during initialization */
> -		if (unlikely(gfp_mask & __GFP_WAIT))
> +		if (unlikely(gfpflags_allow_blocking(gfp_mask)))
>  			return (void *)__get_free_page(gfp_mask);
>  
>  		return netdev_alloc_frag(fp->rx_frag_size);
> diff --git a/drivers/staging/android/ion/ion_system_heap.c b/drivers/staging/android/ion/ion_system_heap.c
> index 7a7a9a047230..d4cdbf28dbb6 100644
> --- a/drivers/staging/android/ion/ion_system_heap.c
> +++ b/drivers/staging/android/ion/ion_system_heap.c
> @@ -27,7 +27,7 @@
>  #include "ion_priv.h"
>  
>  static gfp_t high_order_gfp_flags = (GFP_HIGHUSER | __GFP_ZERO | __GFP_NOWARN |
> -				     __GFP_NORETRY) & ~__GFP_WAIT;
> +				     __GFP_NORETRY) & ~__GFP_DIRECT_RECLAIM;
>  static gfp_t low_order_gfp_flags  = (GFP_HIGHUSER | __GFP_ZERO | __GFP_NOWARN);
>  static const unsigned int orders[] = {8, 4, 0};
>  static const int num_orders = ARRAY_SIZE(orders);
> diff --git a/drivers/staging/lustre/include/linux/libcfs/libcfs_private.h b/drivers/staging/lustre/include/linux/libcfs/libcfs_private.h
> index 9544860e3292..78bde2c11b50 100644
> --- a/drivers/staging/lustre/include/linux/libcfs/libcfs_private.h
> +++ b/drivers/staging/lustre/include/linux/libcfs/libcfs_private.h
> @@ -95,7 +95,7 @@ do {								    \
>  do {									    \
>  	LASSERT(!in_interrupt() ||					    \
>  		((size) <= LIBCFS_VMALLOC_SIZE &&			    \
> -		 ((mask) & __GFP_WAIT) == 0));				    \
> +		 !gfpflags_allow_blocking(mask)));			    \
>  } while (0)
>  
>  #define LIBCFS_ALLOC_POST(ptr, size)					    \
> diff --git a/drivers/usb/host/u132-hcd.c b/drivers/usb/host/u132-hcd.c
> index a67bd5090330..67b3b9d9dfd1 100644
> --- a/drivers/usb/host/u132-hcd.c
> +++ b/drivers/usb/host/u132-hcd.c
> @@ -2244,7 +2244,7 @@ static int u132_urb_enqueue(struct usb_hcd *hcd, struct urb *urb,
>  {
>  	struct u132 *u132 = hcd_to_u132(hcd);
>  	if (irqs_disabled()) {
> -		if (__GFP_WAIT & mem_flags) {
> +		if (gfpflags_allow_blocking(mem_flags)) {
>  			printk(KERN_ERR "invalid context for function that migh"
>  				"t sleep\n");
>  			return -EINVAL;
> diff --git a/drivers/video/fbdev/vermilion/vermilion.c b/drivers/video/fbdev/vermilion/vermilion.c
> index 6b70d7f62b2f..1c1e95a0b8fa 100644
> --- a/drivers/video/fbdev/vermilion/vermilion.c
> +++ b/drivers/video/fbdev/vermilion/vermilion.c
> @@ -99,7 +99,7 @@ static int vmlfb_alloc_vram_area(struct vram_area *va, unsigned max_order,
>  		 * below the first 16MB.
>  		 */
>  
> -		flags = __GFP_DMA | __GFP_HIGH;
> +		flags = __GFP_DMA | __GFP_HIGH | __GFP_KSWAPD_RECLAIM;
>  		va->logical =
>  			 __get_free_pages(flags, --max_order);
>  	} while (va->logical == 0 && max_order > min_order);
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 0d98aee34fee..5632ba60c8f5 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2572,7 +2572,7 @@ int open_ctree(struct super_block *sb,
>  	fs_info->commit_interval = BTRFS_DEFAULT_COMMIT_INTERVAL;
>  	fs_info->avg_delayed_ref_runtime = NSEC_PER_SEC >> 6; /* div by 64 */
>  	/* readahead state */
> -	INIT_RADIX_TREE(&fs_info->reada_tree, GFP_NOFS & ~__GFP_WAIT);
> +	INIT_RADIX_TREE(&fs_info->reada_tree, GFP_NOFS & ~__GFP_DIRECT_RECLAIM);
>  	spin_lock_init(&fs_info->reada_lock);
>  
>  	fs_info->thread_pool_size = min_t(unsigned long,
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index f1018cfbfefa..7956b310c194 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -594,7 +594,7 @@ int clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
>  	if (bits & (EXTENT_IOBITS | EXTENT_BOUNDARY))
>  		clear = 1;
>  again:
> -	if (!prealloc && (mask & __GFP_WAIT)) {
> +	if (!prealloc && gfpflags_allow_blocking(mask)) {
>  		/*
>  		 * Don't care for allocation failure here because we might end
>  		 * up not needing the pre-allocated extent state at all, which
> @@ -718,7 +718,7 @@ int clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
>  	if (start > end)
>  		goto out;
>  	spin_unlock(&tree->lock);
> -	if (mask & __GFP_WAIT)
> +	if (gfpflags_allow_blocking(mask))
>  		cond_resched();
>  	goto again;
>  }
> @@ -850,7 +850,7 @@ __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
>  
>  	bits |= EXTENT_FIRST_DELALLOC;
>  again:
> -	if (!prealloc && (mask & __GFP_WAIT)) {
> +	if (!prealloc && gfpflags_allow_blocking(mask)) {
>  		prealloc = alloc_extent_state(mask);
>  		BUG_ON(!prealloc);
>  	}
> @@ -1028,7 +1028,7 @@ __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
>  	if (start > end)
>  		goto out;
>  	spin_unlock(&tree->lock);
> -	if (mask & __GFP_WAIT)
> +	if (gfpflags_allow_blocking(mask))
>  		cond_resched();
>  	goto again;
>  }
> @@ -1076,7 +1076,7 @@ int convert_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
>  	btrfs_debug_check_extent_io_range(tree, start, end);
>  
>  again:
> -	if (!prealloc && (mask & __GFP_WAIT)) {
> +	if (!prealloc && gfpflags_allow_blocking(mask)) {
>  		/*
>  		 * Best effort, don't worry if extent state allocation fails
>  		 * here for the first iteration. We might have a cached state
> @@ -1253,7 +1253,7 @@ int convert_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
>  	if (start > end)
>  		goto out;
>  	spin_unlock(&tree->lock);
> -	if (mask & __GFP_WAIT)
> +	if (gfpflags_allow_blocking(mask))
>  		cond_resched();
>  	first_iteration = false;
>  	goto again;
> @@ -4267,7 +4267,7 @@ int try_release_extent_mapping(struct extent_map_tree *map,
>  	u64 start = page_offset(page);
>  	u64 end = start + PAGE_CACHE_SIZE - 1;
>  
> -	if ((mask & __GFP_WAIT) &&
> +	if (gfpflags_allow_blocking(mask) &&
>  	    page->mapping->host->i_size > 16 * 1024 * 1024) {
>  		u64 len;
>  		while (start <= end) {
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 6fc735869c18..e023919b4470 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -156,8 +156,8 @@ static struct btrfs_device *__alloc_device(void)
>  	spin_lock_init(&dev->reada_lock);
>  	atomic_set(&dev->reada_in_flight, 0);
>  	atomic_set(&dev->dev_stats_ccnt, 0);
> -	INIT_RADIX_TREE(&dev->reada_zones, GFP_NOFS & ~__GFP_WAIT);
> -	INIT_RADIX_TREE(&dev->reada_extents, GFP_NOFS & ~__GFP_WAIT);
> +	INIT_RADIX_TREE(&dev->reada_zones, GFP_NOFS & ~__GFP_DIRECT_RECLAIM);
> +	INIT_RADIX_TREE(&dev->reada_extents, GFP_NOFS & ~__GFP_DIRECT_RECLAIM);
>  
>  	return dev;
>  }
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index a63c7b0a10cf..49f6c78ee3af 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -1058,7 +1058,7 @@ static int bdev_try_to_free_page(struct super_block *sb, struct page *page,
>  		return 0;
>  	if (journal)
>  		return jbd2_journal_try_to_free_buffers(journal, page,
> -							wait & ~__GFP_WAIT);
> +						wait & ~__GFP_DIRECT_RECLAIM);
>  	return try_to_free_buffers(page);
>  }
>  
> diff --git a/fs/fscache/cookie.c b/fs/fscache/cookie.c
> index d403c69bee08..4304072161aa 100644
> --- a/fs/fscache/cookie.c
> +++ b/fs/fscache/cookie.c
> @@ -111,7 +111,7 @@ struct fscache_cookie *__fscache_acquire_cookie(
>  
>  	/* radix tree insertion won't use the preallocation pool unless it's
>  	 * told it may not wait */
> -	INIT_RADIX_TREE(&cookie->stores, GFP_NOFS & ~__GFP_WAIT);
> +	INIT_RADIX_TREE(&cookie->stores, GFP_NOFS & ~__GFP_DIRECT_RECLAIM);
>  
>  	switch (cookie->def->type) {
>  	case FSCACHE_COOKIE_TYPE_INDEX:
> diff --git a/fs/fscache/page.c b/fs/fscache/page.c
> index 483bbc613bf0..79483b3d8c6f 100644
> --- a/fs/fscache/page.c
> +++ b/fs/fscache/page.c
> @@ -58,7 +58,7 @@ bool release_page_wait_timeout(struct fscache_cookie *cookie, struct page *page)
>  
>  /*
>   * decide whether a page can be released, possibly by cancelling a store to it
> - * - we're allowed to sleep if __GFP_WAIT is flagged
> + * - we're allowed to sleep if __GFP_DIRECT_RECLAIM is flagged
>   */
>  bool __fscache_maybe_release_page(struct fscache_cookie *cookie,
>  				  struct page *page,
> @@ -122,7 +122,7 @@ bool __fscache_maybe_release_page(struct fscache_cookie *cookie,
>  	 * allocator as the work threads writing to the cache may all end up
>  	 * sleeping on memory allocation, so we may need to impose a timeout
>  	 * too. */
> -	if (!(gfp & __GFP_WAIT) || !(gfp & __GFP_FS)) {
> +	if (!(gfp & __GFP_DIRECT_RECLAIM) || !(gfp & __GFP_FS)) {
>  		fscache_stat(&fscache_n_store_vmscan_busy);
>  		return false;
>  	}
> @@ -132,7 +132,7 @@ bool __fscache_maybe_release_page(struct fscache_cookie *cookie,
>  		_debug("fscache writeout timeout page: %p{%lx}",
>  			page, page->index);
>  
> -	gfp &= ~__GFP_WAIT;
> +	gfp &= ~__GFP_DIRECT_RECLAIM;
>  	goto try_again;
>  }
>  EXPORT_SYMBOL(__fscache_maybe_release_page);
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index 6b8338ec2464..89463eee6791 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -1937,8 +1937,8 @@ __journal_try_to_free_buffer(journal_t *journal, struct buffer_head *bh)
>   * @journal: journal for operation
>   * @page: to try and free
>   * @gfp_mask: we use the mask to detect how hard should we try to release
> - * buffers. If __GFP_WAIT and __GFP_FS is set, we wait for commit code to
> - * release the buffers.
> + * buffers. If __GFP_DIRECT_RECLAIM and __GFP_FS is set, we wait for commit
> + * code to release the buffers.
>   *
>   *
>   * For all the buffers on this page,
> diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> index c0f9b1ed12b9..17d3417c8a74 100644
> --- a/fs/nfs/file.c
> +++ b/fs/nfs/file.c
> @@ -473,8 +473,8 @@ static int nfs_release_page(struct page *page, gfp_t gfp)
>  	dfprintk(PAGECACHE, "NFS: release_page(%p)\n", page);
>  
>  	/* Always try to initiate a 'commit' if relevant, but only
> -	 * wait for it if __GFP_WAIT is set.  Even then, only wait 1
> -	 * second and only if the 'bdi' is not congested.
> +	 * wait for it if the caller allows blocking.  Even then,
> +	 * only wait 1 second and only if the 'bdi' is not congested.
>  	 * Waiting indefinitely can cause deadlocks when the NFS
>  	 * server is on this machine, when a new TCP connection is
>  	 * needed and in other rare cases.  There is no particular
> @@ -484,7 +484,7 @@ static int nfs_release_page(struct page *page, gfp_t gfp)
>  	if (mapping) {
>  		struct nfs_server *nfss = NFS_SERVER(mapping->host);
>  		nfs_commit_inode(mapping->host, 0);
> -		if ((gfp & __GFP_WAIT) &&
> +		if (gfpflags_allow_blocking(gfp) &&
>  		    !bdi_write_congested(&nfss->backing_dev_info)) {
>  			wait_on_page_bit_killable_timeout(page, PG_private,
>  							  HZ);
> diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
> index eac9549efd52..587174fd4f2c 100644
> --- a/fs/xfs/xfs_qm.c
> +++ b/fs/xfs/xfs_qm.c
> @@ -525,7 +525,7 @@ xfs_qm_shrink_scan(
>  	unsigned long		freed;
>  	int			error;
>  
> -	if ((sc->gfp_mask & (__GFP_FS|__GFP_WAIT)) != (__GFP_FS|__GFP_WAIT))
> +	if ((sc->gfp_mask & (__GFP_FS|__GFP_DIRECT_RECLAIM)) != (__GFP_FS|__GFP_DIRECT_RECLAIM))
>  		return 0;
>  
>  	INIT_LIST_HEAD(&isol.buffers);
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 440fca3e7e5d..b56e811b6f7c 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -29,12 +29,13 @@ struct vm_area_struct;
>  #define ___GFP_NOMEMALLOC	0x10000u
>  #define ___GFP_HARDWALL		0x20000u
>  #define ___GFP_THISNODE		0x40000u
> -#define ___GFP_WAIT		0x80000u
> +#define ___GFP_ATOMIC		0x80000u
>  #define ___GFP_NOACCOUNT	0x100000u
>  #define ___GFP_NOTRACK		0x200000u
> -#define ___GFP_NO_KSWAPD	0x400000u
> +#define ___GFP_DIRECT_RECLAIM	0x400000u
>  #define ___GFP_OTHER_NODE	0x800000u
>  #define ___GFP_WRITE		0x1000000u
> +#define ___GFP_KSWAPD_RECLAIM	0x2000000u
>  /* If the above are modified, __GFP_BITS_SHIFT may need updating */
>  
>  /*
> @@ -71,7 +72,7 @@ struct vm_area_struct;
>   * __GFP_MOVABLE: Flag that this page will be movable by the page migration
>   * mechanism or reclaimed
>   */
> -#define __GFP_WAIT	((__force gfp_t)___GFP_WAIT)	/* Can wait and reschedule? */
> +#define __GFP_ATOMIC	((__force gfp_t)___GFP_ATOMIC)  /* Caller cannot wait or reschedule */
>  #define __GFP_HIGH	((__force gfp_t)___GFP_HIGH)	/* Should access emergency pools? */
>  #define __GFP_IO	((__force gfp_t)___GFP_IO)	/* Can start physical IO? */
>  #define __GFP_FS	((__force gfp_t)___GFP_FS)	/* Can call down to low-level FS? */
> @@ -94,23 +95,37 @@ struct vm_area_struct;
>  #define __GFP_NOACCOUNT	((__force gfp_t)___GFP_NOACCOUNT) /* Don't account to kmemcg */
>  #define __GFP_NOTRACK	((__force gfp_t)___GFP_NOTRACK)  /* Don't track with kmemcheck */
>  
> -#define __GFP_NO_KSWAPD	((__force gfp_t)___GFP_NO_KSWAPD)
>  #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
>  #define __GFP_WRITE	((__force gfp_t)___GFP_WRITE)	/* Allocator intends to dirty page */
>  
>  /*
> + * A caller that is willing to wait may enter direct reclaim and will
> + * wake kswapd to reclaim pages in the background until the high
> + * watermark is met. A caller may wish to clear __GFP_DIRECT_RECLAIM to
> + * avoid unnecessary delays when a fallback option is available but
> + * still allow kswapd to reclaim in the background. The kswapd flag
> + * can be cleared when the reclaiming of pages would cause unnecessary
> + * disruption.
> + */
> +#define __GFP_WAIT ((__force gfp_t)(___GFP_DIRECT_RECLAIM|___GFP_KSWAPD_RECLAIM))
> +#define __GFP_DIRECT_RECLAIM	((__force gfp_t)___GFP_DIRECT_RECLAIM) /* Caller can reclaim */
> +#define __GFP_KSWAPD_RECLAIM	((__force gfp_t)___GFP_KSWAPD_RECLAIM) /* kswapd can wake */
> +
> +/*
>   * This may seem redundant, but it's a way of annotating false positives vs.
>   * allocations that simply cannot be supported (e.g. page tables).
>   */
>  #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
>  
> -#define __GFP_BITS_SHIFT 25	/* Room for N __GFP_FOO bits */
> +#define __GFP_BITS_SHIFT 26	/* Room for N __GFP_FOO bits */
>  #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
>  
> -/* This equals 0, but use constants in case they ever change */
> -#define GFP_NOWAIT	(GFP_ATOMIC & ~__GFP_HIGH)
> -/* GFP_ATOMIC means both !wait (__GFP_WAIT not set) and use emergency pool */
> -#define GFP_ATOMIC	(__GFP_HIGH)
> +/*
> + * GFP_ATOMIC callers can not sleep, need the allocation to succeed.
> + * A lower watermark is applied to allow access to "atomic reserves"
> + */
> +#define GFP_ATOMIC	(__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM)
> +#define GFP_NOWAIT	(__GFP_KSWAPD_RECLAIM)
>  #define GFP_NOIO	(__GFP_WAIT)
>  #define GFP_NOFS	(__GFP_WAIT | __GFP_IO)
>  #define GFP_KERNEL	(__GFP_WAIT | __GFP_IO | __GFP_FS)
> @@ -119,10 +134,10 @@ struct vm_area_struct;
>  #define GFP_USER	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL)
>  #define GFP_HIGHUSER	(GFP_USER | __GFP_HIGHMEM)
>  #define GFP_HIGHUSER_MOVABLE	(GFP_HIGHUSER | __GFP_MOVABLE)
> -#define GFP_IOFS	(__GFP_IO | __GFP_FS)
> -#define GFP_TRANSHUGE	(GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
> -			 __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | \
> -			 __GFP_NO_KSWAPD)
> +#define GFP_IOFS	(__GFP_IO | __GFP_FS | __GFP_KSWAPD_RECLAIM)
> +#define GFP_TRANSHUGE	((GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
> +			 __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN) & \
> +			 ~__GFP_KSWAPD_RECLAIM)
>  
>  /* This mask makes up all the page movable related flags */
>  #define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE)
> @@ -164,6 +179,11 @@ static inline int gfpflags_to_migratetype(const gfp_t gfp_flags)
>  	return (gfp_flags & GFP_MOVABLE_MASK) >> GFP_MOVABLE_SHIFT;
>  }
>  
> +static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
> +{
> +	return gfp_flags & __GFP_DIRECT_RECLAIM;
> +}
> +
>  #ifdef CONFIG_HIGHMEM
>  #define OPT_ZONE_HIGHMEM ZONE_HIGHMEM
>  #else
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 2738d355cdf9..6f1f5a813554 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -1215,7 +1215,7 @@ static inline int skb_cloned(const struct sk_buff *skb)
>  
>  static inline int skb_unclone(struct sk_buff *skb, gfp_t pri)
>  {
> -	might_sleep_if(pri & __GFP_WAIT);
> +	might_sleep_if(gfpflags_allow_blocking(pri));
>  
>  	if (skb_cloned(skb))
>  		return pskb_expand_head(skb, 0, 0, pri);
> @@ -1299,7 +1299,7 @@ static inline int skb_shared(const struct sk_buff *skb)
>   */
>  static inline struct sk_buff *skb_share_check(struct sk_buff *skb, gfp_t pri)
>  {
> -	might_sleep_if(pri & __GFP_WAIT);
> +	might_sleep_if(gfpflags_allow_blocking(pri));
>  	if (skb_shared(skb)) {
>  		struct sk_buff *nskb = skb_clone(skb, pri);
>  
> @@ -1335,7 +1335,7 @@ static inline struct sk_buff *skb_share_check(struct sk_buff *skb, gfp_t pri)
>  static inline struct sk_buff *skb_unshare(struct sk_buff *skb,
>  					  gfp_t pri)
>  {
> -	might_sleep_if(pri & __GFP_WAIT);
> +	might_sleep_if(gfpflags_allow_blocking(pri));
>  	if (skb_cloned(skb)) {
>  		struct sk_buff *nskb = skb_copy(skb, pri);
>  
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 7aa78440559a..e822cdf8b855 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -2020,7 +2020,7 @@ struct sk_buff *sk_stream_alloc_skb(struct sock *sk, int size, gfp_t gfp,
>   */
>  static inline struct page_frag *sk_page_frag(struct sock *sk)
>  {
> -	if (sk->sk_allocation & __GFP_WAIT)
> +	if (gfpflags_allow_blocking(sk->sk_allocation))
>  		return &current->task_frag;
>  
>  	return &sk->sk_frag;
> diff --git a/include/trace/events/gfpflags.h b/include/trace/events/gfpflags.h
> index d6fd8e5b14b7..dde6bf092c8a 100644
> --- a/include/trace/events/gfpflags.h
> +++ b/include/trace/events/gfpflags.h
> @@ -20,7 +20,7 @@
>  	{(unsigned long)GFP_ATOMIC,		"GFP_ATOMIC"},		\
>  	{(unsigned long)GFP_NOIO,		"GFP_NOIO"},		\
>  	{(unsigned long)__GFP_HIGH,		"GFP_HIGH"},		\
> -	{(unsigned long)__GFP_WAIT,		"GFP_WAIT"},		\
> +	{(unsigned long)__GFP_ATOMIC,		"GFP_ATOMIC"},		\
>  	{(unsigned long)__GFP_IO,		"GFP_IO"},		\
>  	{(unsigned long)__GFP_COLD,		"GFP_COLD"},		\
>  	{(unsigned long)__GFP_NOWARN,		"GFP_NOWARN"},		\
> @@ -36,7 +36,8 @@
>  	{(unsigned long)__GFP_RECLAIMABLE,	"GFP_RECLAIMABLE"},	\
>  	{(unsigned long)__GFP_MOVABLE,		"GFP_MOVABLE"},		\
>  	{(unsigned long)__GFP_NOTRACK,		"GFP_NOTRACK"},		\
> -	{(unsigned long)__GFP_NO_KSWAPD,	"GFP_NO_KSWAPD"},	\
> +	{(unsigned long)__GFP_DIRECT_RECLAIM,	"GFP_DIRECT_RECLAIM"},	\
> +	{(unsigned long)__GFP_KSWAPD_RECLAIM,	"GFP_KSWAPD_RECLAIM"},	\
>  	{(unsigned long)__GFP_OTHER_NODE,	"GFP_OTHER_NODE"}	\
>  	) : "GFP_NOWAIT"
>  
> diff --git a/kernel/audit.c b/kernel/audit.c
> index 662c007635fb..6ae6e2b62e3e 100644
> --- a/kernel/audit.c
> +++ b/kernel/audit.c
> @@ -1357,16 +1357,16 @@ struct audit_buffer *audit_log_start(struct audit_context *ctx, gfp_t gfp_mask,
>  	if (unlikely(audit_filter_type(type)))
>  		return NULL;
>  
> -	if (gfp_mask & __GFP_WAIT) {
> +	if (gfp_mask & __GFP_DIRECT_RECLAIM) {
>  		if (audit_pid && audit_pid == current->pid)
> -			gfp_mask &= ~__GFP_WAIT;
> +			gfp_mask &= ~__GFP_DIRECT_RECLAIM;
>  		else
>  			reserve = 0;
>  	}
>  
>  	while (audit_backlog_limit
>  	       && skb_queue_len(&audit_skb_queue) > audit_backlog_limit + reserve) {
> -		if (gfp_mask & __GFP_WAIT && audit_backlog_wait_time) {
> +		if (gfp_mask & __GFP_DIRECT_RECLAIM && audit_backlog_wait_time) {
>  			long sleep_time;
>  
>  			sleep_time = timeout_start + audit_backlog_wait_time - jiffies;
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 2cf0f79f1fc9..e843dffa7b87 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -211,7 +211,7 @@ static int cgroup_idr_alloc(struct idr *idr, void *ptr, int start, int end,
>  
>  	idr_preload(gfp_mask);
>  	spin_lock_bh(&cgroup_idr_lock);
> -	ret = idr_alloc(idr, ptr, start, end, gfp_mask & ~__GFP_WAIT);
> +	ret = idr_alloc(idr, ptr, start, end, gfp_mask & ~__GFP_DIRECT_RECLAIM);
>  	spin_unlock_bh(&cgroup_idr_lock);
>  	idr_preload_end();
>  	return ret;
> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> index 8acfbf773e06..9aa39f20f593 100644
> --- a/kernel/locking/lockdep.c
> +++ b/kernel/locking/lockdep.c
> @@ -2738,7 +2738,7 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags)
>  		return;
>  
>  	/* no reclaim without waiting on it */
> -	if (!(gfp_mask & __GFP_WAIT))
> +	if (!(gfp_mask & __GFP_DIRECT_RECLAIM))
>  		return;
>  
>  	/* this guy won't enter reclaim */
> diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
> index 5235dd4e1e2f..3a970604308f 100644
> --- a/kernel/power/snapshot.c
> +++ b/kernel/power/snapshot.c
> @@ -1779,7 +1779,7 @@ alloc_highmem_pages(struct memory_bitmap *bm, unsigned int nr_highmem)
>  	while (to_alloc-- > 0) {
>  		struct page *page;
>  
> -		page = alloc_image_page(__GFP_HIGHMEM);
> +		page = alloc_image_page(__GFP_HIGHMEM|__GFP_KSWAPD_RECLAIM);
>  		memory_bm_set_bit(bm, page_to_pfn(page));
>  	}
>  	return nr_highmem;
> diff --git a/kernel/smp.c b/kernel/smp.c
> index 07854477c164..d903c02223af 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -669,7 +669,7 @@ void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
>  	cpumask_var_t cpus;
>  	int cpu, ret;
>  
> -	might_sleep_if(gfp_flags & __GFP_WAIT);
> +	might_sleep_if(gfpflags_allow_blocking(gfp_flags));
>  
>  	if (likely(zalloc_cpumask_var(&cpus, (gfp_flags|__GFP_NOWARN)))) {
>  		preempt_disable();
> diff --git a/lib/idr.c b/lib/idr.c
> index 5335c43adf46..6098336df267 100644
> --- a/lib/idr.c
> +++ b/lib/idr.c
> @@ -399,7 +399,7 @@ void idr_preload(gfp_t gfp_mask)
>  	 * allocation guarantee.  Disallow usage from those contexts.
>  	 */
>  	WARN_ON_ONCE(in_interrupt());
> -	might_sleep_if(gfp_mask & __GFP_WAIT);
> +	might_sleep_if(gfpflags_allow_blocking(gfp_mask));
>  
>  	preempt_disable();
>  
> @@ -453,7 +453,7 @@ int idr_alloc(struct idr *idr, void *ptr, int start, int end, gfp_t gfp_mask)
>  	struct idr_layer *pa[MAX_IDR_LEVEL + 1];
>  	int id;
>  
> -	might_sleep_if(gfp_mask & __GFP_WAIT);
> +	might_sleep_if(gfpflags_allow_blocking(gfp_mask));
>  
>  	/* sanity checks */
>  	if (WARN_ON_ONCE(start < 0))
> diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> index f9ebe1c82060..fcf5d98574ce 100644
> --- a/lib/radix-tree.c
> +++ b/lib/radix-tree.c
> @@ -188,7 +188,7 @@ radix_tree_node_alloc(struct radix_tree_root *root)
>  	 * preloading in the interrupt anyway as all the allocations have to
>  	 * be atomic. So just do normal allocation when in interrupt.
>  	 */
> -	if (!(gfp_mask & __GFP_WAIT) && !in_interrupt()) {
> +	if (!gfpflags_allow_blocking(gfp_mask) && !in_interrupt()) {
>  		struct radix_tree_preload *rtp;
>  
>  		/*
> @@ -249,7 +249,7 @@ radix_tree_node_free(struct radix_tree_node *node)
>   * with preemption not disabled.
>   *
>   * To make use of this facility, the radix tree must be initialised without
> - * __GFP_WAIT being passed to INIT_RADIX_TREE().
> + * __GFP_DIRECT_RECLAIM being passed to INIT_RADIX_TREE().
>   */
>  static int __radix_tree_preload(gfp_t gfp_mask)
>  {
> @@ -286,12 +286,12 @@ static int __radix_tree_preload(gfp_t gfp_mask)
>   * with preemption not disabled.
>   *
>   * To make use of this facility, the radix tree must be initialised without
> - * __GFP_WAIT being passed to INIT_RADIX_TREE().
> + * __GFP_DIRECT_RECLAIM being passed to INIT_RADIX_TREE().
>   */
>  int radix_tree_preload(gfp_t gfp_mask)
>  {
>  	/* Warn on non-sensical use... */
> -	WARN_ON_ONCE(!(gfp_mask & __GFP_WAIT));
> +	WARN_ON_ONCE(!gfpflags_allow_blocking(gfp_mask));
>  	return __radix_tree_preload(gfp_mask);
>  }
>  EXPORT_SYMBOL(radix_tree_preload);
> @@ -303,7 +303,7 @@ EXPORT_SYMBOL(radix_tree_preload);
>   */
>  int radix_tree_maybe_preload(gfp_t gfp_mask)
>  {
> -	if (gfp_mask & __GFP_WAIT)
> +	if (gfpflags_allow_blocking(gfp_mask))
>  		return __radix_tree_preload(gfp_mask);
>  	/* Preloading doesn't help anything with this gfp mask, skip it */
>  	preempt_disable();
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 2df8ddcb0ca0..e7781eb35fd1 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -632,7 +632,7 @@ struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi,
>  {
>  	struct bdi_writeback *wb;
>  
> -	might_sleep_if(gfp & __GFP_WAIT);
> +	might_sleep_if(gfpflags_allow_blocking(gfp));
>  
>  	if (!memcg_css->parent)
>  		return &bdi->wb;
> diff --git a/mm/dmapool.c b/mm/dmapool.c
> index 71a8998cd03a..55b53cffd9f6 100644
> --- a/mm/dmapool.c
> +++ b/mm/dmapool.c
> @@ -326,7 +326,7 @@ void *dma_pool_alloc(struct dma_pool *pool, gfp_t mem_flags,
>  	size_t offset;
>  	void *retval;
>  
> -	might_sleep_if(mem_flags & __GFP_WAIT);
> +	might_sleep_if(gfpflags_allow_blocking(mem_flags));
>  
>  	spin_lock_irqsave(&pool->lock, flags);
>  	list_for_each_entry(page, &pool->page_list, page_list) {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 6ddaeba34e09..2c65980c0a00 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2012,7 +2012,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	if (unlikely(task_in_memcg_oom(current)))
>  		goto nomem;
>  
> -	if (!(gfp_mask & __GFP_WAIT))
> +	if (!gfpflags_allow_blocking(gfp_mask))
>  		goto nomem;
>  
>  	mem_cgroup_events(mem_over_limit, MEMCG_MAX, 1);
> @@ -2071,7 +2071,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	css_get_many(&memcg->css, batch);
>  	if (batch > nr_pages)
>  		refill_stock(memcg, batch - nr_pages);
> -	if (!(gfp_mask & __GFP_WAIT))
> +	if (!gfpflags_allow_blocking(gfp_mask))
>  		goto done;
>  	/*
>  	 * If the hierarchy is above the normal consumption range,
> @@ -4396,8 +4396,8 @@ static int mem_cgroup_do_precharge(unsigned long count)
>  {
>  	int ret;
>  
> -	/* Try a single bulk charge without reclaim first */
> -	ret = try_charge(mc.to, GFP_KERNEL & ~__GFP_WAIT, count);
> +	/* Try a single bulk charge without reclaim first, kswapd may wake */
> +	ret = try_charge(mc.to, GFP_KERNEL & ~__GFP_DIRECT_RECLAIM, count);
>  	if (!ret) {
>  		mc.precharge += count;
>  		return ret;
> diff --git a/mm/mempool.c b/mm/mempool.c
> index 4c533bc51d73..004d42b1dfaf 100644
> --- a/mm/mempool.c
> +++ b/mm/mempool.c
> @@ -320,13 +320,13 @@ void * mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
>  	gfp_t gfp_temp;
>  
>  	VM_WARN_ON_ONCE(gfp_mask & __GFP_ZERO);
> -	might_sleep_if(gfp_mask & __GFP_WAIT);
> +	might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
>  
>  	gfp_mask |= __GFP_NOMEMALLOC;	/* don't allocate emergency reserves */
>  	gfp_mask |= __GFP_NORETRY;	/* don't loop in __alloc_pages */
>  	gfp_mask |= __GFP_NOWARN;	/* failures are OK */
>  
> -	gfp_temp = gfp_mask & ~(__GFP_WAIT|__GFP_IO);
> +	gfp_temp = gfp_mask & ~(__GFP_DIRECT_RECLAIM|__GFP_IO);
>  
>  repeat_alloc:
>  
> @@ -349,7 +349,7 @@ void * mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
>  	}
>  
>  	/*
> -	 * We use gfp mask w/o __GFP_WAIT or IO for the first round.  If
> +	 * We use gfp mask w/o direct reclaim or IO for the first round.  If
>  	 * alloc failed with that and @pool was empty, retry immediately.
>  	 */
>  	if (gfp_temp != gfp_mask) {
> @@ -358,8 +358,8 @@ void * mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
>  		goto repeat_alloc;
>  	}
>  
> -	/* We must not sleep if !__GFP_WAIT */
> -	if (!(gfp_mask & __GFP_WAIT)) {
> +	/* We must not sleep if !__GFP_DIRECT_RECLAIM */
> +	if (!(gfp_mask & __GFP_DIRECT_RECLAIM)) {
>  		spin_unlock_irqrestore(&pool->lock, flags);
>  		return NULL;
>  	}
> diff --git a/mm/migrate.c b/mm/migrate.c
> index c3cb566af3e2..a1c82b65dcad 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1565,7 +1565,7 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
>  					 (GFP_HIGHUSER_MOVABLE |
>  					  __GFP_THISNODE | __GFP_NOMEMALLOC |
>  					  __GFP_NORETRY | __GFP_NOWARN) &
> -					 ~GFP_IOFS, 0);
> +					 ~(__GFP_IO | __GFP_FS), 0);
>  
>  	return newpage;
>  }
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4793bddb6b2a..b32081b02c49 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -169,12 +169,12 @@ void pm_restrict_gfp_mask(void)
>  	WARN_ON(!mutex_is_locked(&pm_mutex));
>  	WARN_ON(saved_gfp_mask);
>  	saved_gfp_mask = gfp_allowed_mask;
> -	gfp_allowed_mask &= ~GFP_IOFS;
> +	gfp_allowed_mask &= ~(__GFP_IO | __GFP_FS);
>  }
>  
>  bool pm_suspended_storage(void)
>  {
> -	if ((gfp_allowed_mask & GFP_IOFS) == GFP_IOFS)
> +	if ((gfp_allowed_mask & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS))
>  		return false;
>  	return true;
>  }
> @@ -2183,7 +2183,7 @@ static bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
>  		return false;
>  	if (fail_page_alloc.ignore_gfp_highmem && (gfp_mask & __GFP_HIGHMEM))
>  		return false;
> -	if (fail_page_alloc.ignore_gfp_wait && (gfp_mask & __GFP_WAIT))
> +	if (fail_page_alloc.ignore_gfp_wait && (gfp_mask & __GFP_DIRECT_RECLAIM))
>  		return false;
>  
>  	return should_fail(&fail_page_alloc.attr, 1 << order);
> @@ -2685,7 +2685,7 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
>  		if (test_thread_flag(TIF_MEMDIE) ||
>  		    (current->flags & (PF_MEMALLOC | PF_EXITING)))
>  			filter &= ~SHOW_MEM_FILTER_NODES;
> -	if (in_interrupt() || !(gfp_mask & __GFP_WAIT))
> +	if (in_interrupt() || !(gfp_mask & __GFP_DIRECT_RECLAIM))
>  		filter &= ~SHOW_MEM_FILTER_NODES;
>  
>  	if (fmt) {
> @@ -2945,7 +2945,6 @@ static inline int
>  gfp_to_alloc_flags(gfp_t gfp_mask)
>  {
>  	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
> -	const bool atomic = !(gfp_mask & (__GFP_WAIT | __GFP_NO_KSWAPD));
>  
>  	/* __GFP_HIGH is assumed to be the same as ALLOC_HIGH to save a branch. */
>  	BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_HIGH);
> @@ -2954,11 +2953,11 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
>  	 * The caller may dip into page reserves a bit more if the caller
>  	 * cannot run direct reclaim, or if the caller has realtime scheduling
>  	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
> -	 * set both ALLOC_HARDER (atomic == true) and ALLOC_HIGH (__GFP_HIGH).
> +	 * set both ALLOC_HARDER (__GFP_ATOMIC) and ALLOC_HIGH (__GFP_HIGH).
>  	 */
>  	alloc_flags |= (__force int) (gfp_mask & __GFP_HIGH);
>  
> -	if (atomic) {
> +	if (gfp_mask & __GFP_ATOMIC) {
>  		/*
>  		 * Not worth trying to allocate harder for __GFP_NOMEMALLOC even
>  		 * if it can't schedule.
> @@ -2995,11 +2994,16 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  	return !!(gfp_to_alloc_flags(gfp_mask) & ALLOC_NO_WATERMARKS);
>  }
>  
> +static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
> +{
> +	return (gfp_mask & (GFP_TRANSHUGE | __GFP_KSWAPD_RECLAIM)) == GFP_TRANSHUGE;
> +}
> +
>  static inline struct page *
>  __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  						struct alloc_context *ac)
>  {
> -	const gfp_t wait = gfp_mask & __GFP_WAIT;
> +	bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;
>  	struct page *page = NULL;
>  	int alloc_flags;
>  	unsigned long pages_reclaimed = 0;
> @@ -3020,15 +3024,23 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	}
>  
>  	/*
> +	 * We also sanity check to catch abuse of atomic reserves being used by
> +	 * callers that are not in atomic context.
> +	 */
> +	if (WARN_ON_ONCE((gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)) ==
> +				(__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
> +		gfp_mask &= ~__GFP_ATOMIC;
> +
> +	/*
>  	 * If this allocation cannot block and it is for a specific node, then
>  	 * fail early.  There's no need to wakeup kswapd or retry for a
>  	 * speculative node-specific allocation.
>  	 */
> -	if (IS_ENABLED(CONFIG_NUMA) && (gfp_mask & __GFP_THISNODE) && !wait)
> +	if (IS_ENABLED(CONFIG_NUMA) && (gfp_mask & __GFP_THISNODE) && !can_direct_reclaim)
>  		goto nopage;
>  
>  retry:
> -	if (!(gfp_mask & __GFP_NO_KSWAPD))
> +	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
>  		wake_all_kswapds(order, ac);
>  
>  	/*
> @@ -3071,8 +3083,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  		}
>  	}
>  
> -	/* Atomic allocations - we can't balance anything */
> -	if (!wait) {
> +	/* Caller is not willing to reclaim, we can't balance anything */
> +	if (!can_direct_reclaim) {
>  		/*
>  		 * All existing users of the deprecated __GFP_NOFAIL are
>  		 * blockable, so warn of any new users that actually allow this
> @@ -3102,7 +3114,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  		goto got_pg;
>  
>  	/* Checks for THP-specific high-order allocations */
> -	if ((gfp_mask & GFP_TRANSHUGE) == GFP_TRANSHUGE) {
> +	if (is_thp_gfp_mask(gfp_mask)) {
>  		/*
>  		 * If compaction is deferred for high-order allocations, it is
>  		 * because sync compaction recently failed. If this is the case
> @@ -3137,8 +3149,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	 * fault, so use asynchronous memory compaction for THP unless it is
>  	 * khugepaged trying to collapse.
>  	 */
> -	if ((gfp_mask & GFP_TRANSHUGE) != GFP_TRANSHUGE ||
> -						(current->flags & PF_KTHREAD))
> +	if (!is_thp_gfp_mask(gfp_mask) || (current->flags & PF_KTHREAD))
>  		migration_mode = MIGRATE_SYNC_LIGHT;
>  
>  	/* Try direct reclaim and then allocating */
> @@ -3209,7 +3220,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
>  
>  	lockdep_trace_alloc(gfp_mask);
>  
> -	might_sleep_if(gfp_mask & __GFP_WAIT);
> +	might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
>  
>  	if (should_fail_alloc_page(gfp_mask, order))
>  		return NULL;
> diff --git a/mm/slab.c b/mm/slab.c
> index c77ebe6cc87c..3ff59926bf19 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -1030,12 +1030,12 @@ static inline int cache_free_alien(struct kmem_cache *cachep, void *objp)
>  }
>  
>  /*
> - * Construct gfp mask to allocate from a specific node but do not invoke reclaim
> - * or warn about failures.
> + * Construct gfp mask to allocate from a specific node but do not direct reclaim
> + * or warn about failures. kswapd may still wake to reclaim in the background.
>   */
>  static inline gfp_t gfp_exact_node(gfp_t flags)
>  {
> -	return (flags | __GFP_THISNODE | __GFP_NOWARN) & ~__GFP_WAIT;
> +	return (flags | __GFP_THISNODE | __GFP_NOWARN) & ~__GFP_DIRECT_RECLAIM;
>  }
>  #endif
>  
> @@ -2625,7 +2625,7 @@ static int cache_grow(struct kmem_cache *cachep,
>  
>  	offset *= cachep->colour_off;
>  
> -	if (local_flags & __GFP_WAIT)
> +	if (gfpflags_allow_blocking(local_flags))
>  		local_irq_enable();
>  
>  	/*
> @@ -2655,7 +2655,7 @@ static int cache_grow(struct kmem_cache *cachep,
>  
>  	cache_init_objs(cachep, page);
>  
> -	if (local_flags & __GFP_WAIT)
> +	if (gfpflags_allow_blocking(local_flags))
>  		local_irq_disable();
>  	check_irq_off();
>  	spin_lock(&n->list_lock);
> @@ -2669,7 +2669,7 @@ static int cache_grow(struct kmem_cache *cachep,
>  opps1:
>  	kmem_freepages(cachep, page);
>  failed:
> -	if (local_flags & __GFP_WAIT)
> +	if (gfpflags_allow_blocking(local_flags))
>  		local_irq_disable();
>  	return 0;
>  }
> @@ -2861,7 +2861,7 @@ static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags,
>  static inline void cache_alloc_debugcheck_before(struct kmem_cache *cachep,
>  						gfp_t flags)
>  {
> -	might_sleep_if(flags & __GFP_WAIT);
> +	might_sleep_if(gfpflags_allow_blocking(flags));
>  #if DEBUG
>  	kmem_flagcheck(cachep, flags);
>  #endif
> @@ -3049,11 +3049,11 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
>  		 */
>  		struct page *page;
>  
> -		if (local_flags & __GFP_WAIT)
> +		if (gfpflags_allow_blocking(local_flags))
>  			local_irq_enable();
>  		kmem_flagcheck(cache, flags);
>  		page = kmem_getpages(cache, local_flags, numa_mem_id());
> -		if (local_flags & __GFP_WAIT)
> +		if (gfpflags_allow_blocking(local_flags))
>  			local_irq_disable();
>  		if (page) {
>  			/*
> diff --git a/mm/slub.c b/mm/slub.c
> index f614b5dc396b..2cdbf5db348e 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1263,7 +1263,7 @@ static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
>  {
>  	flags &= gfp_allowed_mask;
>  	lockdep_trace_alloc(flags);
> -	might_sleep_if(flags & __GFP_WAIT);
> +	might_sleep_if(gfpflags_allow_blocking(flags));
>  
>  	if (should_failslab(s->object_size, flags, s->flags))
>  		return NULL;
> @@ -1352,7 +1352,7 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
>  
>  	flags &= gfp_allowed_mask;
>  
> -	if (flags & __GFP_WAIT)
> +	if (gfpflags_allow_blocking(flags))
>  		local_irq_enable();
>  
>  	flags |= s->allocflags;
> @@ -1362,8 +1362,8 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
>  	 * so we fall-back to the minimum order allocation.
>  	 */
>  	alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY) & ~__GFP_NOFAIL;
> -	if ((alloc_gfp & __GFP_WAIT) && oo_order(oo) > oo_order(s->min))
> -		alloc_gfp = (alloc_gfp | __GFP_NOMEMALLOC) & ~__GFP_WAIT;
> +	if ((alloc_gfp & __GFP_DIRECT_RECLAIM) && oo_order(oo) > oo_order(s->min))
> +		alloc_gfp = (alloc_gfp | __GFP_NOMEMALLOC) & ~__GFP_DIRECT_RECLAIM;
>  
>  	page = alloc_slab_page(s, alloc_gfp, node, oo);
>  	if (unlikely(!page)) {
> @@ -1423,7 +1423,7 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
>  	page->frozen = 1;
>  
>  out:
> -	if (flags & __GFP_WAIT)
> +	if (gfpflags_allow_blocking(flags))
>  		local_irq_disable();
>  	if (!page)
>  		return NULL;
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 2faaa2976447..9ad4dcb0631c 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -1617,7 +1617,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
>  			goto fail;
>  		}
>  		area->pages[i] = page;
> -		if (gfp_mask & __GFP_WAIT)
> +		if (gfpflags_allow_blocking(gfp_mask))
>  			cond_resched();
>  	}
>  
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 8b2786fd42b5..30a87ac1af80 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1476,7 +1476,7 @@ static int too_many_isolated(struct zone *zone, int file,
>  	 * won't get blocked by normal direct-reclaimers, forming a circular
>  	 * deadlock.
>  	 */
> -	if ((sc->gfp_mask & GFP_IOFS) == GFP_IOFS)
> +	if ((sc->gfp_mask & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS))
>  		inactive >>= 3;
>  
>  	return isolated > inactive;
> @@ -3794,7 +3794,7 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
>  	/*
>  	 * Do not scan if the allocation should not be delayed.
>  	 */
> -	if (!(gfp_mask & __GFP_WAIT) || (current->flags & PF_MEMALLOC))
> +	if (!gfpflags_allow_blocking(gfp_mask) || (current->flags & PF_MEMALLOC))
>  		return ZONE_RECLAIM_NOSCAN;
>  
>  	/*
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 4043df7c672f..e54166d3732e 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -571,7 +571,7 @@ static struct zswap_pool *zswap_pool_find_get(char *type, char *compressor)
>  static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
>  {
>  	struct zswap_pool *pool;
> -	gfp_t gfp = __GFP_NORETRY | __GFP_NOWARN;
> +	gfp_t gfp = __GFP_NORETRY | __GFP_NOWARN | __GFP_KSWAPD_RECLAIM;
>  
>  	pool = kzalloc(sizeof(*pool), GFP_KERNEL);
>  	if (!pool) {
> @@ -1011,7 +1011,8 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
>  	/* store */
>  	len = dlen + sizeof(struct zswap_header);
>  	ret = zpool_malloc(entry->pool->zpool, len,
> -			   __GFP_NORETRY | __GFP_NOWARN, &handle);
> +			   __GFP_NORETRY | __GFP_NOWARN | __GFP_KSWAPD_RECLAIM,
> +			   &handle);
>  	if (ret == -ENOSPC) {
>  		zswap_reject_compress_poor++;
>  		goto put_dstmem;
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index dad4dd37e2aa..905bae96a742 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -414,7 +414,7 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev, unsigned int len,
>  	len += NET_SKB_PAD;
>  
>  	if ((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) ||
> -	    (gfp_mask & (__GFP_WAIT | GFP_DMA))) {
> +	    (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
>  		skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE);
>  		if (!skb)
>  			goto skb_fail;
> @@ -481,7 +481,7 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
>  	len += NET_SKB_PAD + NET_IP_ALIGN;
>  
>  	if ((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) ||
> -	    (gfp_mask & (__GFP_WAIT | GFP_DMA))) {
> +	    (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
>  		skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE);
>  		if (!skb)
>  			goto skb_fail;
> @@ -4451,7 +4451,7 @@ struct sk_buff *alloc_skb_with_frags(unsigned long header_len,
>  		return NULL;
>  
>  	gfp_head = gfp_mask;
> -	if (gfp_head & __GFP_WAIT)
> +	if (gfp_head & __GFP_DIRECT_RECLAIM)
>  		gfp_head |= __GFP_REPEAT;
>  
>  	*errcode = -ENOBUFS;
> @@ -4466,7 +4466,7 @@ struct sk_buff *alloc_skb_with_frags(unsigned long header_len,
>  
>  		while (order) {
>  			if (npages >= 1 << order) {
> -				page = alloc_pages((gfp_mask & ~__GFP_WAIT) |
> +				page = alloc_pages((gfp_mask & ~__GFP_DIRECT_RECLAIM) |
>  						   __GFP_COMP |
>  						   __GFP_NOWARN |
>  						   __GFP_NORETRY,
> diff --git a/net/core/sock.c b/net/core/sock.c
> index ca2984afe16e..4a61a0add949 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -1879,8 +1879,10 @@ bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t gfp)
>  
>  	pfrag->offset = 0;
>  	if (SKB_FRAG_PAGE_ORDER) {
> -		pfrag->page = alloc_pages((gfp & ~__GFP_WAIT) | __GFP_COMP |
> -					  __GFP_NOWARN | __GFP_NORETRY,
> +		/* Avoid direct reclaim but allow kswapd to wake */
> +		pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
> +					  __GFP_COMP | __GFP_NOWARN |
> +					  __GFP_NORETRY,
>  					  SKB_FRAG_PAGE_ORDER);
>  		if (likely(pfrag->page)) {
>  			pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER;
> diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
> index 7f86d3b55060..173c0abe4094 100644
> --- a/net/netlink/af_netlink.c
> +++ b/net/netlink/af_netlink.c
> @@ -2084,7 +2084,7 @@ int netlink_broadcast_filtered(struct sock *ssk, struct sk_buff *skb, u32 portid
>  	consume_skb(info.skb2);
>  
>  	if (info.delivered) {
> -		if (info.congested && (allocation & __GFP_WAIT))
> +		if (info.congested && gfpflags_allow_blocking(allocation))
>  			yield();
>  		return 0;
>  	}
> diff --git a/net/rds/ib_recv.c b/net/rds/ib_recv.c
> index f43831e4186a..dcfb59775acc 100644
> --- a/net/rds/ib_recv.c
> +++ b/net/rds/ib_recv.c
> @@ -305,7 +305,7 @@ static int rds_ib_recv_refill_one(struct rds_connection *conn,
>  	gfp_t slab_mask = GFP_NOWAIT;
>  	gfp_t page_mask = GFP_NOWAIT;
>  
> -	if (gfp & __GFP_WAIT) {
> +	if (gfp & __GFP_DIRECT_RECLAIM) {
>  		slab_mask = GFP_KERNEL;
>  		page_mask = GFP_HIGHUSER;
>  	}
> @@ -379,7 +379,7 @@ void rds_ib_recv_refill(struct rds_connection *conn, int prefill, gfp_t gfp)
>  	struct ib_recv_wr *failed_wr;
>  	unsigned int posted = 0;
>  	int ret = 0;
> -	bool can_wait = !!(gfp & __GFP_WAIT);
> +	bool can_wait = !!(gfp & __GFP_DIRECT_RECLAIM);
>  	u32 pos;
>  
>  	/* the goal here is to just make sure that someone, somewhere
> diff --git a/net/rxrpc/ar-connection.c b/net/rxrpc/ar-connection.c
> index 6631f4f1e39b..3b5de4b86058 100644
> --- a/net/rxrpc/ar-connection.c
> +++ b/net/rxrpc/ar-connection.c
> @@ -500,7 +500,7 @@ int rxrpc_connect_call(struct rxrpc_sock *rx,
>  		if (bundle->num_conns >= 20) {
>  			_debug("too many conns");
>  
> -			if (!(gfp & __GFP_WAIT)) {
> +			if (!gfpflags_allow_blocking(gfp)) {
>  				_leave(" = -EAGAIN");
>  				return -EAGAIN;
>  			}
> diff --git a/net/sctp/associola.c b/net/sctp/associola.c
> index 197c3f59ecbf..75369ae8de1e 100644
> --- a/net/sctp/associola.c
> +++ b/net/sctp/associola.c
> @@ -1588,7 +1588,7 @@ int sctp_assoc_lookup_laddr(struct sctp_association *asoc,
>  /* Set an association id for a given association */
>  int sctp_assoc_set_id(struct sctp_association *asoc, gfp_t gfp)
>  {
> -	bool preload = !!(gfp & __GFP_WAIT);
> +	bool preload = gfpflags_allow_blocking(gfp);
>  	int ret;
>  
>  	/* If the id is already assigned, keep it. */
> -- 
> 2.4.6

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 01/10] mm, page_alloc: Remove unnecessary parameter from zone_watermark_ok_safe
  2015-09-21 10:52 ` [PATCH 01/10] mm, page_alloc: Remove unnecessary parameter from zone_watermark_ok_safe Mel Gorman
@ 2015-09-24 20:01   ` Johannes Weiner
  0 siblings, 0 replies; 48+ messages in thread
From: Johannes Weiner @ 2015-09-24 20:01 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Rik van Riel, Vlastimil Babka, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML

On Mon, Sep 21, 2015 at 11:52:33AM +0100, Mel Gorman wrote:
> No user of zone_watermark_ok_safe() specifies alloc_flags. This patch
> removes the unnecessary parameter.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: David Rientjes <rientjes@google.com>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> Acked-by: Michal Hocko <mhocko@suse.com>
> Reviewed-by: Christoph Lameter <cl@linux.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 02/10] mm, page_alloc: Remove unnecessary recalculations for dirty zone balancing
  2015-09-21 10:52 ` [PATCH 02/10] mm, page_alloc: Remove unnecessary recalculations for dirty zone balancing Mel Gorman
@ 2015-09-24 20:05   ` Johannes Weiner
  0 siblings, 0 replies; 48+ messages in thread
From: Johannes Weiner @ 2015-09-24 20:05 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Rik van Riel, Vlastimil Babka, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML

On Mon, Sep 21, 2015 at 11:52:34AM +0100, Mel Gorman wrote:
> File-backed pages that will be immediately written are balanced between
> zones.  This heuristic tries to avoid having a single zone filled with
> recently dirtied pages but the checks are unnecessarily expensive. Move
> consider_zone_balanced into the alloc_context instead of checking bitmaps
> multiple times. The patch also gives the parameter a more meaningful name.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: David Rientjes <rientjes@google.com>
> Acked-by: Michal Hocko <mhocko@suse.com>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 03/10] mm, page_alloc: Remove unnecessary taking of a seqlock when cpusets are disabled
  2015-09-21 10:52 ` [PATCH 03/10] mm, page_alloc: Remove unnecessary taking of a seqlock when cpusets are disabled Mel Gorman
@ 2015-09-24 20:06   ` Johannes Weiner
  2015-09-30 22:22   ` David Rientjes
  1 sibling, 0 replies; 48+ messages in thread
From: Johannes Weiner @ 2015-09-24 20:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Rik van Riel, Vlastimil Babka, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML

On Mon, Sep 21, 2015 at 11:52:35AM +0100, Mel Gorman wrote:
> There is a seqcounter that protects against spurious allocation failures
> when a task is changing the allowed nodes in a cpuset. There is no need
> to check the seqcounter until a cpuset exists.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Christoph Lameter <cl@linux.com>
> Acked-by: David Rientjes <rientjes@google.com>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> Acked-by: Michal Hocko <mhocko@suse.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 04/10] mm, page_alloc: Use masks and shifts when converting GFP flags to migrate types
  2015-09-21 10:52 ` [PATCH 04/10] mm, page_alloc: Use masks and shifts when converting GFP flags to migrate types Mel Gorman
@ 2015-09-24 20:34   ` Johannes Weiner
  2015-09-25 12:50     ` Mel Gorman
  0 siblings, 1 reply; 48+ messages in thread
From: Johannes Weiner @ 2015-09-24 20:34 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Rik van Riel, Vlastimil Babka, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML

On Mon, Sep 21, 2015 at 11:52:36AM +0100, Mel Gorman wrote:
> @@ -14,7 +14,7 @@ struct vm_area_struct;
>  #define ___GFP_HIGHMEM		0x02u
>  #define ___GFP_DMA32		0x04u
>  #define ___GFP_MOVABLE		0x08u
> -#define ___GFP_WAIT		0x10u
> +#define ___GFP_RECLAIMABLE	0x10u
>  #define ___GFP_HIGH		0x20u
>  #define ___GFP_IO		0x40u
>  #define ___GFP_FS		0x80u
> @@ -29,7 +29,7 @@ struct vm_area_struct;
>  #define ___GFP_NOMEMALLOC	0x10000u
>  #define ___GFP_HARDWALL		0x20000u
>  #define ___GFP_THISNODE		0x40000u
> -#define ___GFP_RECLAIMABLE	0x80000u
> +#define ___GFP_WAIT		0x80000u
>  #define ___GFP_NOACCOUNT	0x100000u
>  #define ___GFP_NOTRACK		0x200000u
>  #define ___GFP_NO_KSWAPD	0x400000u
> @@ -126,6 +126,7 @@ struct vm_area_struct;
>  
>  /* This mask makes up all the page movable related flags */
>  #define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE)
> +#define GFP_MOVABLE_SHIFT 3

This connects the power-of-two gfp bits to the linear migrate type
enum, so shifting back and forth between them works only with up to
two items. A hypothetical ___GFP_FOOABLE would translate to 4, not
3. I'm not expecting new migratetypes to show up anytime soon, but
this implication does not make the code exactly robust and obvious.

> @@ -152,14 +153,15 @@ struct vm_area_struct;
>  /* Convert GFP flags to their corresponding migrate type */
>  static inline int gfpflags_to_migratetype(const gfp_t gfp_flags)
>  {
> -	WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK);
> +	VM_WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK);
> +	BUILD_BUG_ON((1UL << GFP_MOVABLE_SHIFT) != ___GFP_MOVABLE);
> +	BUILD_BUG_ON((___GFP_MOVABLE >> GFP_MOVABLE_SHIFT) != MIGRATE_MOVABLE);
>  
>  	if (unlikely(page_group_by_mobility_disabled))
>  		return MIGRATE_UNMOVABLE;
>  
>  	/* Group based on mobility */
> -	return (((gfp_flags & __GFP_MOVABLE) != 0) << 1) |
> -		((gfp_flags & __GFP_RECLAIMABLE) != 0);
> +	return (gfp_flags & GFP_MOVABLE_MASK) >> GFP_MOVABLE_SHIFT;

I'm not sure the simplification of this line is worth the fragile
dependency between those two tables.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 05/10] mm, page_alloc: Distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
  2015-09-21 10:52 ` [PATCH 05/10] mm, page_alloc: Distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd Mel Gorman
  2015-09-24 13:51   ` Michal Hocko
@ 2015-09-24 20:55   ` Johannes Weiner
  2015-09-25 12:51     ` Mel Gorman
  1 sibling, 1 reply; 48+ messages in thread
From: Johannes Weiner @ 2015-09-24 20:55 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Rik van Riel, Vlastimil Babka, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML

On Mon, Sep 21, 2015 at 11:52:37AM +0100, Mel Gorman wrote:
> @@ -119,10 +134,10 @@ struct vm_area_struct;
>  #define GFP_USER	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL)
>  #define GFP_HIGHUSER	(GFP_USER | __GFP_HIGHMEM)
>  #define GFP_HIGHUSER_MOVABLE	(GFP_HIGHUSER | __GFP_MOVABLE)
> -#define GFP_IOFS	(__GFP_IO | __GFP_FS)
> -#define GFP_TRANSHUGE	(GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
> -			 __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | \
> -			 __GFP_NO_KSWAPD)
> +#define GFP_IOFS	(__GFP_IO | __GFP_FS | __GFP_KSWAPD_RECLAIM)

These are some really odd semantics to be given a name like that.

GFP_IOFS was introduced as a short-hand for testing/setting/clearing
these two bits at the same time, not to be used for allocations. In
fact, the only user for allocations is lustre, and it's not at all
obious why those sites shouldn't include __GFP_WAIT as well.

Removing this definition altogether would probably be best.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 04/10] mm, page_alloc: Use masks and shifts when converting GFP flags to migrate types
  2015-09-24 20:34   ` Johannes Weiner
@ 2015-09-25 12:50     ` Mel Gorman
  2015-09-25 13:56       ` Johannes Weiner
  0 siblings, 1 reply; 48+ messages in thread
From: Mel Gorman @ 2015-09-25 12:50 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Rik van Riel, Vlastimil Babka, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML

On Thu, Sep 24, 2015 at 04:34:45PM -0400, Johannes Weiner wrote:
> On Mon, Sep 21, 2015 at 11:52:36AM +0100, Mel Gorman wrote:
> > @@ -14,7 +14,7 @@ struct vm_area_struct;
> >  #define ___GFP_HIGHMEM		0x02u
> >  #define ___GFP_DMA32		0x04u
> >  #define ___GFP_MOVABLE		0x08u
> > -#define ___GFP_WAIT		0x10u
> > +#define ___GFP_RECLAIMABLE	0x10u
> >  #define ___GFP_HIGH		0x20u
> >  #define ___GFP_IO		0x40u
> >  #define ___GFP_FS		0x80u
> > @@ -29,7 +29,7 @@ struct vm_area_struct;
> >  #define ___GFP_NOMEMALLOC	0x10000u
> >  #define ___GFP_HARDWALL		0x20000u
> >  #define ___GFP_THISNODE		0x40000u
> > -#define ___GFP_RECLAIMABLE	0x80000u
> > +#define ___GFP_WAIT		0x80000u
> >  #define ___GFP_NOACCOUNT	0x100000u
> >  #define ___GFP_NOTRACK		0x200000u
> >  #define ___GFP_NO_KSWAPD	0x400000u
> > @@ -126,6 +126,7 @@ struct vm_area_struct;
> >  
> >  /* This mask makes up all the page movable related flags */
> >  #define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE)
> > +#define GFP_MOVABLE_SHIFT 3
> 
> This connects the power-of-two gfp bits to the linear migrate type
> enum, so shifting back and forth between them works only with up to
> two items. A hypothetical ___GFP_FOOABLE would translate to 4, not
> 3. I'm not expecting new migratetypes to show up anytime soon, but
> this implication does not make the code exactly robust and obvious.
> 

In the event __GFP_FOOABLE is added then it would need to be reverted.
Adding new migrate types has other consequences as it increases the number
of free lists in struct zone and depending on the type then new per-cpu
lists and the the fallbacks have to be updated. It's fairly obvious if a
new migratetype is added that cares.

> > @@ -152,14 +153,15 @@ struct vm_area_struct;
> >  /* Convert GFP flags to their corresponding migrate type */
> >  static inline int gfpflags_to_migratetype(const gfp_t gfp_flags)
> >  {
> > -	WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK);
> > +	VM_WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK);
> > +	BUILD_BUG_ON((1UL << GFP_MOVABLE_SHIFT) != ___GFP_MOVABLE);
> > +	BUILD_BUG_ON((___GFP_MOVABLE >> GFP_MOVABLE_SHIFT) != MIGRATE_MOVABLE);
> >  
> >  	if (unlikely(page_group_by_mobility_disabled))
> >  		return MIGRATE_UNMOVABLE;
> >  
> >  	/* Group based on mobility */
> > -	return (((gfp_flags & __GFP_MOVABLE) != 0) << 1) |
> > -		((gfp_flags & __GFP_RECLAIMABLE) != 0);
> > +	return (gfp_flags & GFP_MOVABLE_MASK) >> GFP_MOVABLE_SHIFT;
> 
> I'm not sure the simplification of this line is worth the fragile
> dependency between those two tables.

The BUILD_BUG_ON is there to blow up immediately if the dependency is
broken. Sure, it's complexity but it's well contained in a single
place. Do you want to insist the patch is dropped in case someone
decides to add a new migrate type that has per-cpu lists?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 05/10] mm, page_alloc: Distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
  2015-09-24 20:55   ` Johannes Weiner
@ 2015-09-25 12:51     ` Mel Gorman
  2015-09-25 19:01       ` Johannes Weiner
  0 siblings, 1 reply; 48+ messages in thread
From: Mel Gorman @ 2015-09-25 12:51 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Rik van Riel, Vlastimil Babka, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML

On Thu, Sep 24, 2015 at 04:55:09PM -0400, Johannes Weiner wrote:
> On Mon, Sep 21, 2015 at 11:52:37AM +0100, Mel Gorman wrote:
> > @@ -119,10 +134,10 @@ struct vm_area_struct;
> >  #define GFP_USER	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL)
> >  #define GFP_HIGHUSER	(GFP_USER | __GFP_HIGHMEM)
> >  #define GFP_HIGHUSER_MOVABLE	(GFP_HIGHUSER | __GFP_MOVABLE)
> > -#define GFP_IOFS	(__GFP_IO | __GFP_FS)
> > -#define GFP_TRANSHUGE	(GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
> > -			 __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | \
> > -			 __GFP_NO_KSWAPD)
> > +#define GFP_IOFS	(__GFP_IO | __GFP_FS | __GFP_KSWAPD_RECLAIM)
> 
> These are some really odd semantics to be given a name like that.
> 
> GFP_IOFS was introduced as a short-hand for testing/setting/clearing
> these two bits at the same time, not to be used for allocations. In
> fact, the only user for allocations is lustre, and it's not at all
> obious why those sites shouldn't include __GFP_WAIT as well.
> 
> Removing this definition altogether would probably be best.

Ok, I'll add a TODO to create a patch that removes GFP_IOFS entirely. It
can be tacked on to the end of the series.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 04/10] mm, page_alloc: Use masks and shifts when converting GFP flags to migrate types
  2015-09-25 12:50     ` Mel Gorman
@ 2015-09-25 13:56       ` Johannes Weiner
  0 siblings, 0 replies; 48+ messages in thread
From: Johannes Weiner @ 2015-09-25 13:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Rik van Riel, Vlastimil Babka, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML

On Fri, Sep 25, 2015 at 01:50:28PM +0100, Mel Gorman wrote:
> On Thu, Sep 24, 2015 at 04:34:45PM -0400, Johannes Weiner wrote:
> > On Mon, Sep 21, 2015 at 11:52:36AM +0100, Mel Gorman wrote:
> > > @@ -14,7 +14,7 @@ struct vm_area_struct;
> > >  #define ___GFP_HIGHMEM		0x02u
> > >  #define ___GFP_DMA32		0x04u
> > >  #define ___GFP_MOVABLE		0x08u
> > > -#define ___GFP_WAIT		0x10u
> > > +#define ___GFP_RECLAIMABLE	0x10u
> > >  #define ___GFP_HIGH		0x20u
> > >  #define ___GFP_IO		0x40u
> > >  #define ___GFP_FS		0x80u
> > > @@ -29,7 +29,7 @@ struct vm_area_struct;
> > >  #define ___GFP_NOMEMALLOC	0x10000u
> > >  #define ___GFP_HARDWALL		0x20000u
> > >  #define ___GFP_THISNODE		0x40000u
> > > -#define ___GFP_RECLAIMABLE	0x80000u
> > > +#define ___GFP_WAIT		0x80000u
> > >  #define ___GFP_NOACCOUNT	0x100000u
> > >  #define ___GFP_NOTRACK		0x200000u
> > >  #define ___GFP_NO_KSWAPD	0x400000u
> > > @@ -126,6 +126,7 @@ struct vm_area_struct;
> > >  
> > >  /* This mask makes up all the page movable related flags */
> > >  #define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE)
> > > +#define GFP_MOVABLE_SHIFT 3
> > 
> > This connects the power-of-two gfp bits to the linear migrate type
> > enum, so shifting back and forth between them works only with up to
> > two items. A hypothetical ___GFP_FOOABLE would translate to 4, not
> > 3. I'm not expecting new migratetypes to show up anytime soon, but
> > this implication does not make the code exactly robust and obvious.
> > 
> 
> In the event __GFP_FOOABLE is added then it would need to be reverted.
> Adding new migrate types has other consequences as it increases the number
> of free lists in struct zone and depending on the type then new per-cpu
> lists and the the fallbacks have to be updated. It's fairly obvious if a
> new migratetype is added that cares.

As I said, I'm not really worried about somebody screwing up when
actually adding a new migratetype, but rather people having a harder
time reviewing and verifying the code later on and going through the
same "how can it translate a power of two to linear with a shift op?
oh, it relies on the sequences coinciding for the first two elements"
like I did. The instance works fine, but it's a questionable pattern.

> > > @@ -152,14 +153,15 @@ struct vm_area_struct;
> > >  /* Convert GFP flags to their corresponding migrate type */
> > >  static inline int gfpflags_to_migratetype(const gfp_t gfp_flags)
> > >  {
> > > -	WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK);
> > > +	VM_WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK);
> > > +	BUILD_BUG_ON((1UL << GFP_MOVABLE_SHIFT) != ___GFP_MOVABLE);
> > > +	BUILD_BUG_ON((___GFP_MOVABLE >> GFP_MOVABLE_SHIFT) != MIGRATE_MOVABLE);
> > >  
> > >  	if (unlikely(page_group_by_mobility_disabled))
> > >  		return MIGRATE_UNMOVABLE;
> > >  
> > >  	/* Group based on mobility */
> > > -	return (((gfp_flags & __GFP_MOVABLE) != 0) << 1) |
> > > -		((gfp_flags & __GFP_RECLAIMABLE) != 0);
> > > +	return (gfp_flags & GFP_MOVABLE_MASK) >> GFP_MOVABLE_SHIFT;
> > 
> > I'm not sure the simplification of this line is worth the fragile
> > dependency between those two tables.
> 
> The BUILD_BUG_ON is there to blow up immediately if the dependency is
> broken. Sure, it's complexity but it's well contained in a single
> place. Do you want to insist the patch is dropped in case someone
> decides to add a new migrate type that has per-cpu lists?

I think the previous code is easier to comprehend and the optimization
looks miniscule. It doesn't seem worth it to me to add complications
and subtleties to a complex codebase for microoptimizations whose
effects are likely in the noise.

OTOH several others have acked this patch. If nobody else shares my
concern then I'm not insisting that the patch be dropped.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 05/10] mm, page_alloc: Distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
  2015-09-25 12:51     ` Mel Gorman
@ 2015-09-25 19:01       ` Johannes Weiner
  2015-09-29 13:35         ` Mel Gorman
  0 siblings, 1 reply; 48+ messages in thread
From: Johannes Weiner @ 2015-09-25 19:01 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Rik van Riel, Vlastimil Babka, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML

On Fri, Sep 25, 2015 at 01:51:06PM +0100, Mel Gorman wrote:
> On Thu, Sep 24, 2015 at 04:55:09PM -0400, Johannes Weiner wrote:
> > On Mon, Sep 21, 2015 at 11:52:37AM +0100, Mel Gorman wrote:
> > > @@ -119,10 +134,10 @@ struct vm_area_struct;
> > >  #define GFP_USER	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL)
> > >  #define GFP_HIGHUSER	(GFP_USER | __GFP_HIGHMEM)
> > >  #define GFP_HIGHUSER_MOVABLE	(GFP_HIGHUSER | __GFP_MOVABLE)
> > > -#define GFP_IOFS	(__GFP_IO | __GFP_FS)
> > > -#define GFP_TRANSHUGE	(GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
> > > -			 __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | \
> > > -			 __GFP_NO_KSWAPD)
> > > +#define GFP_IOFS	(__GFP_IO | __GFP_FS | __GFP_KSWAPD_RECLAIM)
> > 
> > These are some really odd semantics to be given a name like that.
> > 
> > GFP_IOFS was introduced as a short-hand for testing/setting/clearing
> > these two bits at the same time, not to be used for allocations. In
> > fact, the only user for allocations is lustre, and it's not at all
> > obious why those sites shouldn't include __GFP_WAIT as well.
> > 
> > Removing this definition altogether would probably be best.
> 
> Ok, I'll add a TODO to create a patch that removes GFP_IOFS entirely. It
> can be tacked on to the end of the series.

Okay, that makes sense to me. Thanks!

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 06/10] mm, page_alloc: Rename __GFP_WAIT to __GFP_RECLAIM
  2015-09-21 10:52 ` [PATCH 06/10] mm, page_alloc: Rename __GFP_WAIT to __GFP_RECLAIM Mel Gorman
@ 2015-09-25 19:03   ` Johannes Weiner
  2015-09-28 23:55   ` Andrew Morton
  2015-09-30 22:25   ` David Rientjes
  2 siblings, 0 replies; 48+ messages in thread
From: Johannes Weiner @ 2015-09-25 19:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Rik van Riel, Vlastimil Babka, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML

On Mon, Sep 21, 2015 at 11:52:38AM +0100, Mel Gorman wrote:
> __GFP_WAIT was used to signal that the caller was in atomic context and
> could not sleep.  Now it is possible to distinguish between true atomic
> context and callers that are not willing to sleep. The latter should clear
> __GFP_DIRECT_RECLAIM so kswapd will still wake. As clearing __GFP_WAIT
> behaves differently, there is a risk that people will clear the wrong
> flags. This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly indicate
> what it does -- setting it allows all reclaim activity, clearing them
> prevents it.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Michal Hocko <mhocko@suse.com>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 07/10] mm, page_alloc: Delete the zonelist_cache
  2015-09-21 10:52 ` [PATCH 07/10] mm, page_alloc: Delete the zonelist_cache Mel Gorman
@ 2015-09-25 19:09   ` Johannes Weiner
  0 siblings, 0 replies; 48+ messages in thread
From: Johannes Weiner @ 2015-09-25 19:09 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Rik van Riel, Vlastimil Babka, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML

On Mon, Sep 21, 2015 at 11:52:39AM +0100, Mel Gorman wrote:
> The zonelist cache (zlc) was introduced to skip over zones that were
> recently known to be full. This avoided expensive operations such as the
> cpuset checks, watermark calculations and zone_reclaim. The situation
> today is different and the complexity of zlc is harder to justify.
> 
> 1) The cpuset checks are no-ops unless a cpuset is active and in general
>    are a lot cheaper.
> 
> 2) zone_reclaim is now disabled by default and I suspect that was a large
>    source of the cost that zlc wanted to avoid. When it is enabled, it's
>    known to be a major source of stalling when nodes fill up and it's
>    unwise to hit every other user with the overhead.
> 
> 3) Watermark checks are expensive to calculate for high-order
>    allocation requests. Later patches in this series will reduce the cost
>    of the watermark checking.
> 
> 4) The most important issue is that in the current implementation it
>    is possible for a failed THP allocation to mark a zone full for order-0
>    allocations and cause a fallback to remote nodes.
> 
> The last issue could be addressed with additional complexity but as the
> benefit of zlc is questionable, it is better to remove it.  If stalls
> due to zone_reclaim are ever reported then an alternative would be to
> introduce deferring logic based on a timeout inside zone_reclaim itself
> and leave the page allocator fast paths alone.
> 
> The impact on page-allocator microbenchmarks is negligible as they don't
> hit the paths where the zlc comes into play. Most page-reclaim related
> workloads showed no noticeable difference as a result of the removal.
> 
> The impact was noticeable in a workload called "stutter". One part uses a
> lot of anonymous memory, a second measures mmap latency and a third copies
> a large file. In an ideal world the latency application would not notice
> the mmap latency.  On a 2-node machine the results of this patch are
> 
> stutter
>                              4.3.0-rc1             4.3.0-rc1
>                               baseline              nozlc-v4
> Min         mmap     20.9243 (  0.00%)     20.7716 (  0.73%)
> 1st-qrtle   mmap     22.0612 (  0.00%)     22.0680 ( -0.03%)
> 2nd-qrtle   mmap     22.3291 (  0.00%)     22.3809 ( -0.23%)
> 3rd-qrtle   mmap     25.2244 (  0.00%)     25.2396 ( -0.06%)
> Max-90%     mmap     48.0995 (  0.00%)     28.3713 ( 41.02%)
> Max-93%     mmap     52.5557 (  0.00%)     36.0170 ( 31.47%)
> Max-95%     mmap     55.8173 (  0.00%)     47.3163 ( 15.23%)
> Max-99%     mmap     67.3781 (  0.00%)     70.1140 ( -4.06%)
> Max         mmap  24447.6375 (  0.00%)  12915.1356 ( 47.17%)
> Mean        mmap     33.7883 (  0.00%)     27.7944 ( 17.74%)
> Best99%Mean mmap     27.7825 (  0.00%)     25.2767 (  9.02%)
> Best95%Mean mmap     26.3912 (  0.00%)     23.7994 (  9.82%)
> Best90%Mean mmap     24.9886 (  0.00%)     23.2251 (  7.06%)
> Best50%Mean mmap     22.0157 (  0.00%)     22.0261 ( -0.05%)
> Best10%Mean mmap     21.6705 (  0.00%)     21.6083 (  0.29%)
> Best5%Mean  mmap     21.5581 (  0.00%)     21.4611 (  0.45%)
> Best1%Mean  mmap     21.3079 (  0.00%)     21.1631 (  0.68%)
> 
> Note that the maximum stall latency went from 24 seconds to 12 which is still
> bad but an improvement.  The milage varies considerably 2-node machine on an
> earlier test went from 494 seconds to 47 seconds and  a 4-node machine that
> tested an earlier version of this patch went from a worst case stall time of
> 6 seconds to 67ms. The nature of the benchmark is inherently unpredictable
> as it is hammering the system and the milage will vary between machines.
> 
> There is a secondary impact with potentially more direct reclaim because
> zones are now being considered instead of being skipped by zlc. In this
> particular test run it did not occur so will not be described. However,
> in at least one test the following was observed
> 
> 1. Direct reclaim rates were higher. This was likely due to direct reclaim
>   being entered instead of the zlc disabling a zone and busy looping.
>   Busy looping may have the effect of allowing kswapd to make more
>   progress and in some cases may be better overall. If this is found then
>   the correct action is to put direct reclaimers to sleep on a waitqueue
>   and allow kswapd make forward progress. Busy looping on the zlc is even
>   worse than when the allocator used to blindly call congestion_wait().
> 
> 2. There was higher swap activity as direct reclaim was active.
> 
> 3. Direct reclaim efficiency was lower. This is related to 1 as more
>   scanning activity also encountered more pages that could not be
>   immediately reclaimed
> 
> In that case, the direct page scan and reclaim rates are noticeable but
> it is not considered a problem for a few reasons
> 
> 1. The test is primarily concerned with latency. The mmap attempts are also
>    faulted which means there are THP allocation requests. The ZLC could
>    cause zones to be disabled causing the process to busy loop instead
>    of reclaiming.  This looks like elevated direct reclaim activity but
>    it's the correct action to take based on what processes requested.
> 
> 2. The test hammers reclaim and compaction heavily. The number of successful
>    THP faults is highly variable but affects the reclaim stats. It's not a
>    realistic or reasonable measure of page reclaim activity.
> 
> 3. No other page-reclaim intensive workload that was tested showed a problem.
> 
> 4. If a workload is identified that benefitted from the busy looping then it
>    should be fixed by having direct reclaimers sleep on a wait queue until
>    woken by kswapd instead of busy looping. We had this class of problem before
>    when congestion_waits() with a fixed timeout was a brain damaged decision
>    but happened to benefit some workloads.
> 
> If a workload is identified that relied on the zlc to busy loop then it
> should be fixed correctly and have a direct reclaimer sleep on a waitqueue
> until woken by kswapd.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: David Rientjes <rientjes@google.com>
> Acked-by: Christoph Lameter <cl@linux.com>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> Acked-by: Michal Hocko <mhocko@suse.com>
> ---
>  include/linux/mmzone.h |  74 -----------------
>  mm/page_alloc.c        | 212 -------------------------------------------------
>  2 files changed, 286 deletions(-)

This patch and its results look great!

And I agree, should this affect the balance between kswapd and direct
reclaim, it should be fixed explicitely and not rely on something as
unrelated as the zonelist cache.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 09/10] mm, page_alloc: Reserve pageblocks for high-order atomic allocations on demand
  2015-09-21 10:52 ` [PATCH 09/10] mm, page_alloc: Reserve pageblocks for high-order atomic allocations on demand Mel Gorman
  2015-09-24 13:50   ` Michal Hocko
@ 2015-09-25 19:22   ` Johannes Weiner
  2015-09-29 21:01   ` Andrew Morton
  2 siblings, 0 replies; 48+ messages in thread
From: Johannes Weiner @ 2015-09-25 19:22 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Rik van Riel, Vlastimil Babka, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML

On Mon, Sep 21, 2015 at 11:52:41AM +0100, Mel Gorman wrote:
> High-order watermark checking exists for two reasons --  kswapd high-order
> awareness and protection for high-order atomic requests. Historically the
> kernel depended on MIGRATE_RESERVE to preserve min_free_kbytes as high-order
> free pages for as long as possible. This patch introduces MIGRATE_HIGHATOMIC
> that reserves pageblocks for high-order atomic allocations on demand and
> avoids using those blocks for order-0 allocations. This is more flexible
> and reliable than MIGRATE_RESERVE was.
> 
> A MIGRATE_HIGHORDER pageblock is created when an atomic high-order allocation
> request steals a pageblock but limits the total number to 1% of the zone.
> Callers that speculatively abuse atomic allocations for long-lived
> high-order allocations to access the reserve will quickly fail. Note that
> SLUB is currently not such an abuser as it reclaims at least once.  It is
> possible that the pageblock stolen has few suitable high-order pages and
> will need to steal again in the near future but there would need to be
> strong justification to search all pageblocks for an ideal candidate.
> 
> The pageblocks are unreserved if an allocation fails after a direct
> reclaim attempt.
> 
> The watermark checks account for the reserved pageblocks when the allocation
> request is not a high-order atomic allocation.
> 
> The reserved pageblocks can not be used for order-0 allocations. This may
> allow temporary wastage until a failed reclaim reassigns the pageblock. This
> is deliberate as the intent of the reservation is to satisfy a limited
> number of atomic high-order short-lived requests if the system requires them.
> 
> The stutter benchmark was used to evaluate this but while it was running
> there was a systemtap script that randomly allocated between 1 high-order
> page and 12.5% of memory's worth of order-3 pages using GFP_ATOMIC. This
> is much larger than the potential reserve and it does not attempt to be
> realistic.  It is intended to stress random high-order allocations from
> an unknown source, show that there is a reduction in failures without
> introducing an anomaly where atomic allocations are more reliable than
> regular allocations.  The amount of memory reserved varied throughout the
> workload as reserves were created and reclaimed under memory pressure. The
> allocation failures once the workload warmed up were as follows;
> 
> 4.2-rc5-vanilla		70%
> 4.2-rc5-atomic-reserve	56%
> 
> The failure rate was also measured while building multiple kernels. The
> failure rate was 14% but is 6% with this patch applied.
> 
> Overall, this is a small reduction but the reserves are small relative
> to the number of allocation requests. In early versions of the patch,
> the failure rate reduced by a much larger amount but that required much
> larger reserves and perversely made atomic allocations seem more reliable
> than regular allocations.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Cool, this is much more obvious than trusting the MIGRATE_RESERVE
mechanism for higher order atomics.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 10/10] mm, page_alloc: Only enforce watermarks for order-0 allocations
  2015-09-21 12:03 ` [PATCH 10/10] mm, page_alloc: Only enforce watermarks for order-0 allocations Mel Gorman
@ 2015-09-25 19:32   ` Johannes Weiner
  2015-09-29 21:05   ` Andrew Morton
  2015-09-30 14:11   ` Vlastimil Babka
  2 siblings, 0 replies; 48+ messages in thread
From: Johannes Weiner @ 2015-09-25 19:32 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Rik van Riel, Vlastimil Babka, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML

On Mon, Sep 21, 2015 at 01:03:17PM +0100, Mel Gorman wrote:
> The primary purpose of watermarks is to ensure that reclaim can always
> make forward progress in PF_MEMALLOC context (kswapd and direct reclaim).
> These assume that order-0 allocations are all that is necessary for
> forward progress.
> 
> High-order watermarks serve a different purpose. Kswapd
> had no high-order awareness before they were introduced
> (https://lkml.kernel.org/r/413AA7B2.4000907@yahoo.com.au).  This was
> particularly important when there were high-order atomic requests.
> The watermarks both gave kswapd awareness and made a reserve for those
> atomic requests.
> 
> There are two important side-effects of this. The most important is that
> a non-atomic high-order request can fail even though free pages are available
> and the order-0 watermarks are ok. The second is that high-order watermark
> checks are expensive as the free list counts up to the requested order must
> be examined.
> 
> With the introduction of MIGRATE_HIGHATOMIC it is no longer necessary to
> have high-order watermarks. Kswapd and compaction still need high-order
> awareness which is handled by checking that at least one suitable high-order
> page is free.
> 
> With the patch applied, there was little difference in the allocation
> failure rates as the atomic reserves are small relative to the number of
> allocation attempts. The expected impact is that there will never be an
> allocation failure report that shows suitable pages on the free lists.
> 
> The one potential side-effect of this is that in a vanilla kernel, the
> watermark checks may have kept a free page for an atomic allocation. Now,
> we are 100% relying on the HighAtomic reserves and an early allocation to
> have allocated them.  If the first high-order atomic allocation is after
> the system is already heavily fragmented then it'll fail.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Michal Hocko <mhocko@suse.com>

Nice. This really is a great improvement over the way we used to
ensure higher-order page availability.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 06/10] mm, page_alloc: Rename __GFP_WAIT to __GFP_RECLAIM
  2015-09-21 10:52 ` [PATCH 06/10] mm, page_alloc: Rename __GFP_WAIT to __GFP_RECLAIM Mel Gorman
  2015-09-25 19:03   ` Johannes Weiner
@ 2015-09-28 23:55   ` Andrew Morton
  2015-09-29 13:37     ` Mel Gorman
  2015-09-30 22:25   ` David Rientjes
  2 siblings, 1 reply; 48+ messages in thread
From: Andrew Morton @ 2015-09-28 23:55 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, Rik van Riel, Vlastimil Babka, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML

On Mon, 21 Sep 2015 11:52:38 +0100 Mel Gorman <mgorman@techsingularity.net> wrote:

> __GFP_WAIT was used to signal that the caller was in atomic context and
> could not sleep.  Now it is possible to distinguish between true atomic
> context and callers that are not willing to sleep. The latter should clear
> __GFP_DIRECT_RECLAIM so kswapd will still wake. As clearing __GFP_WAIT
> behaves differently, there is a risk that people will clear the wrong
> flags. This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly indicate
> what it does -- setting it allows all reclaim activity, clearing them
> prevents it.

We have quite a history of remote parts of the kernel using
weird/wrong/inexplicable combinations of __GFP_ flags.  I tend to think
that this is because we didn't adequately explain the interface.

And I don't think that gfp.h really improved much in this area as a
result of this patchset.  Could you go through it some time and decide
if we've adequately documented all this stuff?

GFP_ATOMIC vs GFP_NOWAIT?

GFP_USER vs GFP_HIGHUSER?

When should I use GFP_HIGHUSER_MOVABLE instead?

Why isn't there a GFP_USER_MOVABLE?

What's GFP_IOFS?

GFP_RECLAIM_MASK through GFP_SLAB_BUG_MASK are mm-internal, but look
the same as the exported interface definitions.

__GFP_MOVABLE is documented twice, the second in an odd place.

etcetera.


It's rather unclear which symbols are part of the exported interface
and which are "mm internal only".

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 05/10] mm, page_alloc: Distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
  2015-09-25 19:01       ` Johannes Weiner
@ 2015-09-29 13:35         ` Mel Gorman
  2015-09-30 12:26           ` Vlastimil Babka
  0 siblings, 1 reply; 48+ messages in thread
From: Mel Gorman @ 2015-09-29 13:35 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Rik van Riel, Vlastimil Babka, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML

> > Ok, I'll add a TODO to create a patch that removes GFP_IOFS entirely. It
> > can be tacked on to the end of the series.
> 
> Okay, that makes sense to me. Thanks!
> 

This?

---8<---
mm: page_alloc: Remove GFP_IOFS

GFP_IOFS was intended to be shorthand for clearing two flags, not a
set of allocation flags. There is only one user of this flag combination
now and there appears to be no reason why Lustre had to be protected
from reclaim stalls. As none of the sites appear to be atomic, this
patch simply deletes GFP_IOFS and converts Lustre to using GFP_KERNEL.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 drivers/staging/lustre/lnet/lnet/router.c           |  2 +-
 drivers/staging/lustre/lnet/selftest/conrpc.c       |  2 +-
 drivers/staging/lustre/lnet/selftest/rpc.c          |  2 +-
 drivers/staging/lustre/lustre/libcfs/module.c       |  2 +-
 drivers/staging/lustre/lustre/libcfs/tracefile.c    |  2 +-
 drivers/staging/lustre/lustre/llite/remote_perm.c   |  2 +-
 drivers/staging/lustre/lustre/mgc/mgc_request.c     | 10 +++++-----
 drivers/staging/lustre/lustre/obdecho/echo_client.c |  2 +-
 drivers/staging/lustre/lustre/osc/osc_cache.c       |  2 +-
 include/linux/gfp.h                                 |  1 -
 10 files changed, 13 insertions(+), 14 deletions(-)

diff --git a/drivers/staging/lustre/lnet/lnet/router.c b/drivers/staging/lustre/lnet/lnet/router.c
index 4fbae5ef44a9..dad9816dfee7 100644
--- a/drivers/staging/lustre/lnet/lnet/router.c
+++ b/drivers/staging/lustre/lnet/lnet/router.c
@@ -1246,7 +1246,7 @@ lnet_new_rtrbuf(lnet_rtrbufpool_t *rbp, int cpt)
 	for (i = 0; i < npages; i++) {
 		page = alloc_pages_node(
 				cfs_cpt_spread_node(lnet_cpt_table(), cpt),
-				__GFP_ZERO | GFP_IOFS, 0);
+				GFP_KERNEL | __GFP_ZERO, 0);
 		if (page == NULL) {
 			while (--i >= 0)
 				__free_page(rb->rb_kiov[i].kiov_page);
diff --git a/drivers/staging/lustre/lnet/selftest/conrpc.c b/drivers/staging/lustre/lnet/selftest/conrpc.c
index a1a4e08f7391..3fc37de8d304 100644
--- a/drivers/staging/lustre/lnet/selftest/conrpc.c
+++ b/drivers/staging/lustre/lnet/selftest/conrpc.c
@@ -861,7 +861,7 @@ lstcon_testrpc_prep(lstcon_node_t *nd, int transop, unsigned feats,
 			bulk->bk_iovs[i].kiov_offset = 0;
 			bulk->bk_iovs[i].kiov_len    = len;
 			bulk->bk_iovs[i].kiov_page   =
-				alloc_page(GFP_IOFS);
+				alloc_page(GFP_KERNEL);
 
 			if (bulk->bk_iovs[i].kiov_page == NULL) {
 				lstcon_rpc_put(*crpc);
diff --git a/drivers/staging/lustre/lnet/selftest/rpc.c b/drivers/staging/lustre/lnet/selftest/rpc.c
index 6ae133138b17..aa0f88fbb221 100644
--- a/drivers/staging/lustre/lnet/selftest/rpc.c
+++ b/drivers/staging/lustre/lnet/selftest/rpc.c
@@ -146,7 +146,7 @@ srpc_alloc_bulk(int cpt, unsigned bulk_npg, unsigned bulk_len, int sink)
 		int nob;
 
 		pg = alloc_pages_node(cfs_cpt_spread_node(lnet_cpt_table(), cpt),
-				      GFP_IOFS, 0);
+				      GFP_KERNEL, 0);
 		if (pg == NULL) {
 			CERROR("Can't allocate page %d of %d\n", i, bulk_npg);
 			srpc_free_bulk(bk);
diff --git a/drivers/staging/lustre/lustre/libcfs/module.c b/drivers/staging/lustre/lustre/libcfs/module.c
index 806f9747a3a2..303143f28c06 100644
--- a/drivers/staging/lustre/lustre/libcfs/module.c
+++ b/drivers/staging/lustre/lustre/libcfs/module.c
@@ -321,7 +321,7 @@ static int libcfs_ioctl(struct cfs_psdev_file *pfile, unsigned long cmd, void *a
 	struct libcfs_ioctl_data *data;
 	int err = 0;
 
-	LIBCFS_ALLOC_GFP(buf, 1024, GFP_IOFS);
+	LIBCFS_ALLOC_GFP(buf, 1024, GFP_KERNEL);
 	if (buf == NULL)
 		return -ENOMEM;
 
diff --git a/drivers/staging/lustre/lustre/libcfs/tracefile.c b/drivers/staging/lustre/lustre/libcfs/tracefile.c
index effa2af58c13..a7d72f69c4eb 100644
--- a/drivers/staging/lustre/lustre/libcfs/tracefile.c
+++ b/drivers/staging/lustre/lustre/libcfs/tracefile.c
@@ -810,7 +810,7 @@ int cfs_trace_allocate_string_buffer(char **str, int nob)
 	if (nob > 2 * PAGE_CACHE_SIZE)	    /* string must be "sensible" */
 		return -EINVAL;
 
-	*str = kmalloc(nob, GFP_IOFS | __GFP_ZERO);
+	*str = kmalloc(nob, GFP_KERNEL | __GFP_ZERO);
 	if (*str == NULL)
 		return -ENOMEM;
 
diff --git a/drivers/staging/lustre/lustre/llite/remote_perm.c b/drivers/staging/lustre/lustre/llite/remote_perm.c
index 39022ea88b5f..b27f016c3dd4 100644
--- a/drivers/staging/lustre/lustre/llite/remote_perm.c
+++ b/drivers/staging/lustre/lustre/llite/remote_perm.c
@@ -84,7 +84,7 @@ static struct hlist_head *alloc_rmtperm_hash(void)
 
 	OBD_SLAB_ALLOC_GFP(hash, ll_rmtperm_hash_cachep,
 			   REMOTE_PERM_HASHSIZE * sizeof(*hash),
-			   GFP_IOFS);
+			   GFP_KERNEL);
 	if (!hash)
 		return NULL;
 
diff --git a/drivers/staging/lustre/lustre/mgc/mgc_request.c b/drivers/staging/lustre/lustre/mgc/mgc_request.c
index 019ee2f256aa..79551319d754 100644
--- a/drivers/staging/lustre/lustre/mgc/mgc_request.c
+++ b/drivers/staging/lustre/lustre/mgc/mgc_request.c
@@ -198,7 +198,7 @@ struct config_llog_data *do_config_log_add(struct obd_device *obd,
 	CDEBUG(D_MGC, "do adding config log %s:%p\n", logname,
 	       cfg ? cfg->cfg_instance : NULL);
 
-	cld = kzalloc(sizeof(*cld) + strlen(logname) + 1, GFP_NOFS);
+	cld = kzalloc(sizeof(*cld) + strlen(logname) + 1, GFP_KERNEL);
 	if (!cld)
 		return ERR_PTR(-ENOMEM);
 
@@ -1127,7 +1127,7 @@ static int mgc_apply_recover_logs(struct obd_device *mgc,
 	LASSERT(cfg->cfg_instance != NULL);
 	LASSERT(cfg->cfg_sb == cfg->cfg_instance);
 
-	inst = kzalloc(PAGE_CACHE_SIZE, GFP_NOFS);
+	inst = kzalloc(PAGE_CACHE_SIZE, GFP_KERNEL);
 	if (!inst)
 		return -ENOMEM;
 
@@ -1334,14 +1334,14 @@ static int mgc_process_recover_log(struct obd_device *obd,
 	if (cfg->cfg_last_idx == 0) /* the first time */
 		nrpages = CONFIG_READ_NRPAGES_INIT;
 
-	pages = kcalloc(nrpages, sizeof(*pages), GFP_NOFS);
+	pages = kcalloc(nrpages, sizeof(*pages), GFP_KERNEL);
 	if (pages == NULL) {
 		rc = -ENOMEM;
 		goto out;
 	}
 
 	for (i = 0; i < nrpages; i++) {
-		pages[i] = alloc_page(GFP_IOFS);
+		pages[i] = alloc_page(GFP_KERNEL);
 		if (pages[i] == NULL) {
 			rc = -ENOMEM;
 			goto out;
@@ -1492,7 +1492,7 @@ static int mgc_process_cfg_log(struct obd_device *mgc,
 	if (cld->cld_cfg.cfg_sb)
 		lsi = s2lsi(cld->cld_cfg.cfg_sb);
 
-	env = kzalloc(sizeof(*env), GFP_NOFS);
+	env = kzalloc(sizeof(*env), GFP_KERNEL);
 	if (!env)
 		return -ENOMEM;
 
diff --git a/drivers/staging/lustre/lustre/obdecho/echo_client.c b/drivers/staging/lustre/lustre/obdecho/echo_client.c
index 27bd170c3a28..7c8443644300 100644
--- a/drivers/staging/lustre/lustre/obdecho/echo_client.c
+++ b/drivers/staging/lustre/lustre/obdecho/echo_client.c
@@ -1561,7 +1561,7 @@ static int echo_client_kbrw(struct echo_device *ed, int rw, struct obdo *oa,
 		  (oa->o_valid & OBD_MD_FLFLAGS) != 0 &&
 		  (oa->o_flags & OBD_FL_DEBUG_CHECK) != 0);
 
-	gfp_mask = ((ostid_id(&oa->o_oi) & 2) == 0) ? GFP_IOFS : GFP_HIGHUSER;
+	gfp_mask = ((ostid_id(&oa->o_oi) & 2) == 0) ? GFP_KERNEL : GFP_HIGHUSER;
 
 	LASSERT(rw == OBD_BRW_WRITE || rw == OBD_BRW_READ);
 	LASSERT(lsm != NULL);
diff --git a/drivers/staging/lustre/lustre/osc/osc_cache.c b/drivers/staging/lustre/lustre/osc/osc_cache.c
index c72035e048aa..6fa6bc6874ab 100644
--- a/drivers/staging/lustre/lustre/osc/osc_cache.c
+++ b/drivers/staging/lustre/lustre/osc/osc_cache.c
@@ -346,7 +346,7 @@ static struct osc_extent *osc_extent_alloc(struct osc_object *obj)
 {
 	struct osc_extent *ext;
 
-	OBD_SLAB_ALLOC_PTR_GFP(ext, osc_extent_kmem, GFP_IOFS);
+	OBD_SLAB_ALLOC_PTR_GFP(ext, osc_extent_kmem, GFP_KERNEL);
 	if (ext == NULL)
 		return NULL;
 
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 60b2db94d49d..369227202ac2 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -134,7 +134,6 @@ struct vm_area_struct;
 #define GFP_USER	(__GFP_RECLAIM | __GFP_IO | __GFP_FS | __GFP_HARDWALL)
 #define GFP_HIGHUSER	(GFP_USER | __GFP_HIGHMEM)
 #define GFP_HIGHUSER_MOVABLE	(GFP_HIGHUSER | __GFP_MOVABLE)
-#define GFP_IOFS	(__GFP_IO | __GFP_FS | __GFP_KSWAPD_RECLAIM)
 #define GFP_TRANSHUGE	((GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
 			 __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN) & \
 			 ~__GFP_KSWAPD_RECLAIM)

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH 06/10] mm, page_alloc: Rename __GFP_WAIT to __GFP_RECLAIM
  2015-09-28 23:55   ` Andrew Morton
@ 2015-09-29 13:37     ` Mel Gorman
  2015-10-01  8:39       ` Vlastimil Babka
  2015-10-01 14:06       ` [PATCH 06/10] mm, page_alloc: Rename __GFP_WAIT to __GFP_RECLAIM Michal Hocko
  0 siblings, 2 replies; 48+ messages in thread
From: Mel Gorman @ 2015-09-29 13:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Rik van Riel, Vlastimil Babka, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML

On Mon, Sep 28, 2015 at 04:55:23PM -0700, Andrew Morton wrote:
> > __GFP_WAIT was used to signal that the caller was in atomic context and
> > could not sleep.  Now it is possible to distinguish between true atomic
> > context and callers that are not willing to sleep. The latter should clear
> > __GFP_DIRECT_RECLAIM so kswapd will still wake. As clearing __GFP_WAIT
> > behaves differently, there is a risk that people will clear the wrong
> > flags. This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly indicate
> > what it does -- setting it allows all reclaim activity, clearing them
> > prevents it.
> 
> We have quite a history of remote parts of the kernel using
> weird/wrong/inexplicable combinations of __GFP_ flags.  I tend to think
> that this is because we didn't adequately explain the interface.
> 
> And I don't think that gfp.h really improved much in this area as a
> result of this patchset.  Could you go through it some time and decide
> if we've adequately documented all this stuff?
> 

This? It's not fully build tested yet but any breakage should only
involve adding internal.h to some mm/ files.

---8<---
mm: page_alloc: Hide some GFP internals and document the bits and flag combinations

Andrew started the following

	We have quite a history of remote parts of the kernel using
	weird/wrong/inexplicable combinations of __GFP_ flags.	I tend
	to think that this is because we didn't adequately explain the
	interface.

	And I don't think that gfp.h really improved much in this area as
	a result of this patchset.  Could you go through it some time and
	decide if we've adequately documented all this stuff?

This patches first moves some GFP flag combinations that are part of the MM
internals to mm/internal.h. The rest of the patch documents the __GFP_FOO
bits under various headings and then documents the flag combinations. It
will not help callers that are brain damaged but the clarity might motivate
some fixes and avoid future mistakes.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/gfp.h | 252 +++++++++++++++++++++++++++++++++++-----------------
 mm/internal.h       |  19 ++++
 mm/shmem.c          |   2 +
 mm/vmalloc.c        |   2 +
 4 files changed, 193 insertions(+), 82 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 369227202ac2..67654f08a28b 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -39,9 +39,7 @@ struct vm_area_struct;
 /* If the above are modified, __GFP_BITS_SHIFT may need updating */
 
 /*
- * GFP bitmasks..
- *
- * Zone modifiers (see linux/mmzone.h - low three bits)
+ * Physical address zone modifiers (see linux/mmzone.h - low four bits)
  *
  * Do not put any conditional on these. If necessary modify the definitions
  * without the underscores and use them consistently. The definitions here may
@@ -50,121 +48,209 @@ struct vm_area_struct;
 #define __GFP_DMA	((__force gfp_t)___GFP_DMA)
 #define __GFP_HIGHMEM	((__force gfp_t)___GFP_HIGHMEM)
 #define __GFP_DMA32	((__force gfp_t)___GFP_DMA32)
-#define __GFP_MOVABLE	((__force gfp_t)___GFP_MOVABLE)  /* Page is movable */
+#define __GFP_MOVABLE	((__force gfp_t)___GFP_MOVABLE)  /* ZONE_MOVABLE allowed */
 #define GFP_ZONEMASK	(__GFP_DMA|__GFP_HIGHMEM|__GFP_DMA32|__GFP_MOVABLE)
+
 /*
- * Action modifiers - doesn't change the zoning
+ * Page mobility and placement hints
  *
- * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
- * _might_ fail.  This depends upon the particular VM implementation.
+ * These flags provide hints about how mobile the page is. Pages with similar
+ * mobility are placed within the same pageblocks to minimise problems due
+ * to external fragmentation.
  *
- * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
- * cannot handle allocation failures. New users should be evaluated carefully
- * (and the flag should be used only when there is no reasonable failure policy)
- * but it is definitely preferable to use the flag rather than opencode endless
- * loop around allocator.
+ * __GFP_MOVABLE (also a zone modifier) indicates that the page can be
+ *   moved by page migration during memory compaction or can be reclaimed.
  *
- * __GFP_NORETRY: The VM implementation must not retry indefinitely and will
- * return NULL when direct reclaim and memory compaction have failed to allow
- * the allocation to succeed.  The OOM killer is not called with the current
- * implementation.
+ * __GFP_RECLAIMABLE is used for slab allocations that specify
+ *   SLAB_RECLAIM_ACCOUNT and whose pages can be freed via shrinkers.
+ *
+ * __GFP_WRITE indicates the caller intends to dirty the page. Where possible,
+ *   these pages will be spread between local zones to avoid all the dirty
+ *   pages being in one zone (fair zone allocation policy).
  *
- * __GFP_MOVABLE: Flag that this page will be movable by the page migration
- * mechanism or reclaimed
+ * __GFP_HARDWALL enforces the cpuset memory allocation policy.
+ *
+ * __GFP_THISNODE forces the allocation to be satisified from the requested
+ *   node with no fallbacks or placement policy enforcements.
  */
-#define __GFP_ATOMIC	((__force gfp_t)___GFP_ATOMIC)  /* Caller cannot wait or reschedule */
-#define __GFP_HIGH	((__force gfp_t)___GFP_HIGH)	/* Should access emergency pools? */
-#define __GFP_IO	((__force gfp_t)___GFP_IO)	/* Can start physical IO? */
-#define __GFP_FS	((__force gfp_t)___GFP_FS)	/* Can call down to low-level FS? */
-#define __GFP_COLD	((__force gfp_t)___GFP_COLD)	/* Cache-cold page required */
-#define __GFP_NOWARN	((__force gfp_t)___GFP_NOWARN)	/* Suppress page allocation failure warning */
-#define __GFP_REPEAT	((__force gfp_t)___GFP_REPEAT)	/* See above */
-#define __GFP_NOFAIL	((__force gfp_t)___GFP_NOFAIL)	/* See above */
-#define __GFP_NORETRY	((__force gfp_t)___GFP_NORETRY) /* See above */
-#define __GFP_MEMALLOC	((__force gfp_t)___GFP_MEMALLOC)/* Allow access to emergency reserves */
-#define __GFP_COMP	((__force gfp_t)___GFP_COMP)	/* Add compound page metadata */
-#define __GFP_ZERO	((__force gfp_t)___GFP_ZERO)	/* Return zeroed page on success */
-#define __GFP_NOMEMALLOC ((__force gfp_t)___GFP_NOMEMALLOC) /* Don't use emergency reserves.
-							 * This takes precedence over the
-							 * __GFP_MEMALLOC flag if both are
-							 * set
-							 */
-#define __GFP_HARDWALL   ((__force gfp_t)___GFP_HARDWALL) /* Enforce hardwall cpuset memory allocs */
-#define __GFP_THISNODE	((__force gfp_t)___GFP_THISNODE)/* No fallback, no policies */
-#define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE) /* Page is reclaimable */
-#define __GFP_NOACCOUNT	((__force gfp_t)___GFP_NOACCOUNT) /* Don't account to kmemcg */
-#define __GFP_NOTRACK	((__force gfp_t)___GFP_NOTRACK)  /* Don't track with kmemcheck */
-
-#define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
-#define __GFP_WRITE	((__force gfp_t)___GFP_WRITE)	/* Allocator intends to dirty page */
+#define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE)
+#define __GFP_WRITE	((__force gfp_t)___GFP_WRITE)
+#define __GFP_HARDWALL   ((__force gfp_t)___GFP_HARDWALL)
+#define __GFP_THISNODE	((__force gfp_t)___GFP_THISNODE)
 
 /*
- * A caller that is willing to wait may enter direct reclaim and will
- * wake kswapd to reclaim pages in the background until the high
- * watermark is met. A caller may wish to clear __GFP_DIRECT_RECLAIM to
- * avoid unnecessary delays when a fallback option is available but
- * still allow kswapd to reclaim in the background. The kswapd flag
- * can be cleared when the reclaiming of pages would cause unnecessary
- * disruption.
+ * Watermark modifiers -- controls access to emergency reserves
+ *
+ * __GFP_HIGH indicates that the caller is high-priority and that granting
+ *   the request is necessary before the system can make forward progress.
+ *   For example, creating an IO context to clean pages.
+ *
+ * __GFP_ATOMIC indicates that the caller cannot reclaim or sleep and is
+ *   high priority. Users are typically interrupt handlers. This may be
+ *   used in conjunction with __GFP_HIGH
+ *
+ * __GFP_MEMALLOC allows access to all memory. This should only be used when
+ *   the caller guarantees the allocation will allow more memory to be freed
+ *   very shortly e.g. process exiting or swapping. Users either should
+ *   be the MM or co-ordinating closely with the VM (e.g. swap over NFS).
+ *
+ * __GFP_NOMEMALLOC is used to explicitly forbid access to emergency reserves.
+ *   This takes precedence over the __GFP_MEMALLOC flag if both are set.
+ *
+ * __GFP_NOACCOUNT ignores the accounting for kmemcg limit enforcement.
  */
-#define __GFP_RECLAIM ((__force gfp_t)(___GFP_DIRECT_RECLAIM|___GFP_KSWAPD_RECLAIM))
+#define __GFP_ATOMIC	((__force gfp_t)___GFP_ATOMIC)
+#define __GFP_HIGH	((__force gfp_t)___GFP_HIGH)
+#define __GFP_MEMALLOC	((__force gfp_t)___GFP_MEMALLOC)
+#define __GFP_NOMEMALLOC ((__force gfp_t)___GFP_NOMEMALLOC)
+#define __GFP_NOACCOUNT	((__force gfp_t)___GFP_NOACCOUNT)
+
+/*
+ * Reclaim modifiers
+ *
+ * __GFP_IO can start physical IO.
+ *
+ * __GFP_FS can call down to the low-level FS. Avoids the allocator
+ *   recursing into the filesystem which might already be holding locks.
+ *
+ * __GFP_DIRECT_RECLAIM indicates that the caller may enter direct reclaim.
+ *   This flag can be cleared to avoid unnecessary delays when a fallback
+ *   option is available.
+ *
+ * __GFP_KSWAPD_RECLAIM indicates that the caller wants kswapd when the low
+ *   watermark is reached and have it reclaim pages until the high watermark
+ *   is reached. A caller may wish to clear this flag when fallback options
+ *   are available and the reclaim is likely to disrupt the system. The
+ *   canonical example is THP allocation where a fallback is cheap but
+ *   reclaim/compaction may cause indirect stalls.
+ *
+ * __GFP_RECLAIM is shorthand to allow/forbid both direct and kswapd reclaim.
+ *
+ * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
+ *   _might_ fail.  This depends upon the particular VM implementation.
+ *
+ * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
+ *   cannot handle allocation failures. New users should be evaluated carefully
+ *   (and the flag should be used only when there is no reasonable failure
+ *   policy) but it is definitely preferable to use the flag rather than
+ *   opencode endless loop around allocator.
+ *
+ * __GFP_NORETRY: The VM implementation must not retry indefinitely and will
+ *   return NULL when direct reclaim and memory compaction have failed to allow
+ *   the allocation to succeed.  The OOM killer is not called with the current
+ *   implementation.
+ */
+#define __GFP_IO	((__force gfp_t)___GFP_IO)
+#define __GFP_FS	((__force gfp_t)___GFP_FS)
 #define __GFP_DIRECT_RECLAIM	((__force gfp_t)___GFP_DIRECT_RECLAIM) /* Caller can reclaim */
 #define __GFP_KSWAPD_RECLAIM	((__force gfp_t)___GFP_KSWAPD_RECLAIM) /* kswapd can wake */
+#define __GFP_RECLAIM ((__force gfp_t)(___GFP_DIRECT_RECLAIM|___GFP_KSWAPD_RECLAIM))
+#define __GFP_REPEAT	((__force gfp_t)___GFP_REPEAT)
+#define __GFP_NOFAIL	((__force gfp_t)___GFP_NOFAIL)
+#define __GFP_NORETRY	((__force gfp_t)___GFP_NORETRY)
 
 /*
- * This may seem redundant, but it's a way of annotating false positives vs.
- * allocations that simply cannot be supported (e.g. page tables).
+ * Action modifiers
+ * 
+ * __GFP_COLD indicates that the caller does not expect to be used in the near
+ *   future. Where possible, a cache-cold page will be returned.
+ *
+ * __GFP_NOWARN suppresses allocation failure reports.
+ *
+ * __GFP_COMP address compound page metadata.
+ *
+ * __GFP_ZERO returns a zeroed page on success.
+ *
+ * __GFP_NOTRACK avoids tracking with kmemcheck.
+ *
+ * __GFP_NOTRACK_FALSE_POSITIVE is an alias of __GFP_NOTRACK. It's a means of
+ *   distinguishing in the source between false positives and allocations that
+ *   cannot be supported (e.g. page tables).
+ *
+ * __GFP_OTHER_NODE is for allocations that are on a remote node but that
+ *   should not be accounted for as a remote allocation in vmstat. A
+ *   typical user would be khugepaged collapsing a huge page on a remote
+ *   node.
  */
+#define __GFP_COLD	((__force gfp_t)___GFP_COLD)
+#define __GFP_NOWARN	((__force gfp_t)___GFP_NOWARN)
+#define __GFP_COMP	((__force gfp_t)___GFP_COMP)
+#define __GFP_ZERO	((__force gfp_t)___GFP_ZERO)
+#define __GFP_NOTRACK	((__force gfp_t)___GFP_NOTRACK)
 #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
+#define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE)
 
-#define __GFP_BITS_SHIFT 26	/* Room for N __GFP_FOO bits */
+/* Room for N __GFP_FOO bits */
+#define __GFP_BITS_SHIFT 26
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
 
 /*
- * GFP_ATOMIC callers can not sleep, need the allocation to succeed.
- * A lower watermark is applied to allow access to "atomic reserves"
+ * Useful GFP flag combinations that are commonly used. It is recommended
+ * that subsystems start with one of these combinations and then set/clear
+ * __GFP_FOO flags as necessary.
+ *
+ * GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower
+ *   watermark is applied to allow access to "atomic reserves"
+ *
+ * GFP_KERNEL is typical for kernel-internal allocations. The caller requires
+ *   ZONE_NORMAL or a lower zone for direct access but can direct reclaim.
+ *
+ * GFP_NOWAIT is for kernel allocations that should not stall for direct
+ *   reclaim, start physical IO or use any filesystem callback.
+ *
+ * GFP_NOIO will use direct reclaim to discard clean pages or slab pages
+ *   that do not require the starting of any physical IO.
+ *
+ * GFP_NOFS will use direct reclaim but will not use any filesystem interfaces.
+ *
+ * GFP_USER is for userspace allocations that also need to be directly
+ *   accessibly by the kernel or hardware. It is typically used by hardware
+ *   for buffers that are mapped to userspace (e.g. graphics) that hardware
+ *   still must DMA to. cpuset limits are enforced for these allocations.
+ *
+ * GFP_HIGHUSER is for userspace allocations that may be mapped to userspace,
+ *   do not need to be directly accessible by the kernel but that cannot
+ *   move once in use. An example may be a hardware allocation that maps
+ *   data directly into userspace but has no addressing limitations.
+ *
+ * GFP_DMA exists for historical reasons and should be avoided where possible.
+ *   The flags indicates that the caller requires that the lowest zone be
+ *   used (ZONE_DMA or 16M on x86-64). Ideally, this would be removed but
+ *   it would require careful auditing as some users really require it and
+ *   others use the flag to avoid lowmem reserves in ZONE_DMA and treat the
+ *   lowest zone as a type of emergency reserve.
+ *
+ * GFP_DMA32 is similar to GFP_DMA except that the caller requires a 32-bit
+ *   address.
+ *
+ * GFP_HIGHUSER_MOVABLE is for userspace allocations that the kernel does not
+ *   need direct access to but can use kmap() when access is required. They
+ *   are expected to be movable via page reclaim or page migration. Typically,
+ *   pages on the LRU would also be allocated with GFP_HIGHUSER_MOVABLE.
+ *
+ * GFP_TRANSHUGE is used for THP allocations. They are compound allocations
+ *   that will fail quickly if memory is not available and will not wake
+ *   kswapd on failure.
  */
 #define GFP_ATOMIC	(__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM)
+#define GFP_KERNEL	(__GFP_RECLAIM | __GFP_IO | __GFP_FS)
 #define GFP_NOWAIT	(__GFP_KSWAPD_RECLAIM)
 #define GFP_NOIO	(__GFP_RECLAIM)
 #define GFP_NOFS	(__GFP_RECLAIM | __GFP_IO)
-#define GFP_KERNEL	(__GFP_RECLAIM | __GFP_IO | __GFP_FS)
 #define GFP_TEMPORARY	(__GFP_RECLAIM | __GFP_IO | __GFP_FS | \
 			 __GFP_RECLAIMABLE)
 #define GFP_USER	(__GFP_RECLAIM | __GFP_IO | __GFP_FS | __GFP_HARDWALL)
+#define GFP_DMA		__GFP_DMA
+#define GFP_DMA32	__GFP_DMA32
 #define GFP_HIGHUSER	(GFP_USER | __GFP_HIGHMEM)
 #define GFP_HIGHUSER_MOVABLE	(GFP_HIGHUSER | __GFP_MOVABLE)
 #define GFP_TRANSHUGE	((GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
 			 __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN) & \
 			 ~__GFP_KSWAPD_RECLAIM)
 
-/* This mask makes up all the page movable related flags */
+/* Convert GFP flags to their corresponding migrate type */
 #define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE)
 #define GFP_MOVABLE_SHIFT 3
-
-/* Control page allocator reclaim behavior */
-#define GFP_RECLAIM_MASK (__GFP_RECLAIM|__GFP_HIGH|__GFP_IO|__GFP_FS|\
-			__GFP_NOWARN|__GFP_REPEAT|__GFP_NOFAIL|\
-			__GFP_NORETRY|__GFP_MEMALLOC|__GFP_NOMEMALLOC)
-
-/* Control slab gfp mask during early boot */
-#define GFP_BOOT_MASK (__GFP_BITS_MASK & ~(__GFP_RECLAIM|__GFP_IO|__GFP_FS))
-
-/* Control allocation constraints */
-#define GFP_CONSTRAINT_MASK (__GFP_HARDWALL|__GFP_THISNODE)
-
-/* Do not use these with a slab allocator */
-#define GFP_SLAB_BUG_MASK (__GFP_DMA32|__GFP_HIGHMEM|~__GFP_BITS_MASK)
-
-/* Flag - indicates that the buffer will be suitable for DMA.  Ignored on some
-   platforms, used as appropriate on others */
-
-#define GFP_DMA		__GFP_DMA
-
-/* 4GB DMA on some platforms */
-#define GFP_DMA32	__GFP_DMA32
-
-/* Convert GFP flags to their corresponding migrate type */
 static inline int gfpflags_to_migratetype(const gfp_t gfp_flags)
 {
 	VM_WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK);
@@ -177,6 +263,8 @@ static inline int gfpflags_to_migratetype(const gfp_t gfp_flags)
 	/* Group based on mobility */
 	return (gfp_flags & GFP_MOVABLE_MASK) >> GFP_MOVABLE_SHIFT;
 }
+#undef GFP_MOVABLE_MASK
+#undef GFP_MOVABLE_SHIFT
 
 static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
 {
diff --git a/mm/internal.h b/mm/internal.h
index 83fb0bfffc13..f99f0ff6935d 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -14,6 +14,25 @@
 #include <linux/fs.h>
 #include <linux/mm.h>
 
+/*
+ * The set of flags that only affect watermark checking and reclaim
+ * behaviour. This is used by the MM to obey the caller constraints
+ * about IO, FS and watermark checking while ignoring placement
+ * hints such as HIGHMEM usage.
+ */
+#define GFP_RECLAIM_MASK (__GFP_RECLAIM|__GFP_HIGH|__GFP_IO|__GFP_FS|\
+			__GFP_NOWARN|__GFP_REPEAT|__GFP_NOFAIL|\
+			__GFP_NORETRY|__GFP_MEMALLOC|__GFP_NOMEMALLOC)
+
+/* The GFP flags allowed during early boot */
+#define GFP_BOOT_MASK (__GFP_BITS_MASK & ~(__GFP_RECLAIM|__GFP_IO|__GFP_FS))
+
+/* Control allocation cpuset and node placement constraints */
+#define GFP_CONSTRAINT_MASK (__GFP_HARDWALL|__GFP_THISNODE)
+
+/* Do not use these with a slab allocator */
+#define GFP_SLAB_BUG_MASK (__GFP_DMA32|__GFP_HIGHMEM|~__GFP_BITS_MASK)
+
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
 
diff --git a/mm/shmem.c b/mm/shmem.c
index 48ce82926d93..469b639018b0 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -73,6 +73,8 @@ static struct vfsmount *shm_mnt;
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
 
+#include "internal.h"
+
 #define BLOCKS_PER_PAGE  (PAGE_CACHE_SIZE/512)
 #define VM_ACCT(size)    (PAGE_CACHE_ALIGN(size) >> PAGE_SHIFT)
 
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 9ad4dcb0631c..af6d519aa21b 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -35,6 +35,8 @@
 #include <asm/tlbflush.h>
 #include <asm/shmparam.h>
 
+#include "internal.h"
+
 struct vfree_deferred {
 	struct llist_head list;
 	struct work_struct wq;

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH 09/10] mm, page_alloc: Reserve pageblocks for high-order atomic allocations on demand
  2015-09-21 10:52 ` [PATCH 09/10] mm, page_alloc: Reserve pageblocks for high-order atomic allocations on demand Mel Gorman
  2015-09-24 13:50   ` Michal Hocko
  2015-09-25 19:22   ` Johannes Weiner
@ 2015-09-29 21:01   ` Andrew Morton
  2015-09-30  8:27     ` Mel Gorman
  2 siblings, 1 reply; 48+ messages in thread
From: Andrew Morton @ 2015-09-29 21:01 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, Rik van Riel, Vlastimil Babka, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML

On Mon, 21 Sep 2015 11:52:41 +0100 Mel Gorman <mgorman@techsingularity.net> wrote:

> High-order watermark checking exists for two reasons --  kswapd high-order
> awareness and protection for high-order atomic requests. Historically the
> kernel depended on MIGRATE_RESERVE to preserve min_free_kbytes as high-order
> free pages for as long as possible. This patch introduces MIGRATE_HIGHATOMIC
> that reserves pageblocks for high-order atomic allocations on demand and
> avoids using those blocks for order-0 allocations. This is more flexible
> and reliable than MIGRATE_RESERVE was.
> 
> A MIGRATE_HIGHORDER pageblock is created when an atomic high-order allocation
> request steals a pageblock but limits the total number to 1% of the zone.
> Callers that speculatively abuse atomic allocations for long-lived
> high-order allocations to access the reserve will quickly fail. Note that
> SLUB is currently not such an abuser as it reclaims at least once.  It is
> possible that the pageblock stolen has few suitable high-order pages and
> will need to steal again in the near future but there would need to be
> strong justification to search all pageblocks for an ideal candidate.
> 
> The pageblocks are unreserved if an allocation fails after a direct
> reclaim attempt.
> 
> The watermark checks account for the reserved pageblocks when the allocation
> request is not a high-order atomic allocation.
> 
> The reserved pageblocks can not be used for order-0 allocations. This may
> allow temporary wastage until a failed reclaim reassigns the pageblock. This
> is deliberate as the intent of the reservation is to satisfy a limited
> number of atomic high-order short-lived requests if the system requires them.
> 
> The stutter benchmark was used to evaluate this but while it was running
> there was a systemtap script that randomly allocated between 1 high-order
> page and 12.5% of memory's worth of order-3 pages using GFP_ATOMIC. This
> is much larger than the potential reserve and it does not attempt to be
> realistic.  It is intended to stress random high-order allocations from
> an unknown source, show that there is a reduction in failures without
> introducing an anomaly where atomic allocations are more reliable than
> regular allocations.  The amount of memory reserved varied throughout the
> workload as reserves were created and reclaimed under memory pressure. The
> allocation failures once the workload warmed up were as follows;
> 
> 4.2-rc5-vanilla		70%
> 4.2-rc5-atomic-reserve	56%
> 
> The failure rate was also measured while building multiple kernels. The
> failure rate was 14% but is 6% with this patch applied.
> 
> Overall, this is a small reduction but the reserves are small relative
> to the number of allocation requests. In early versions of the patch,
> the failure rate reduced by a much larger amount but that required much
> larger reserves and perversely made atomic allocations seem more reliable
> than regular allocations.
> 
> ...
>
> +/*
> + * Reserve a pageblock for exclusive use of high-order atomic allocations if
> + * there are no empty page blocks that contain a page with a suitable order
> + */
> +static void reserve_highatomic_pageblock(struct page *page, struct zone *zone,
> +				unsigned int alloc_order)
> +{
> +	int mt;
> +	unsigned long max_managed, flags;
> +
> +	/*
> +	 * Limit the number reserved to 1 pageblock or roughly 1% of a zone.
> +	 * Check is race-prone but harmless.
> +	 */
> +	max_managed = (zone->managed_pages / 100) + pageblock_nr_pages;
> +	if (zone->nr_reserved_highatomic >= max_managed)
> +		return;
> +
> +	/* Yoink! */
> +	spin_lock_irqsave(&zone->lock, flags);
> +
> +	mt = get_pageblock_migratetype(page);
> +	if (mt != MIGRATE_HIGHATOMIC &&
> +			!is_migrate_isolate(mt) && !is_migrate_cma(mt)) {

Do the above checks really need to be inside zone->lock?  I don't think
get_pageblock_migratetype() needs zone->lock?  (Actually I suspect it
does, but we don't...)

> +		zone->nr_reserved_highatomic += pageblock_nr_pages;

And I don't think it would hurt to recheck
nr_reserved_highatomic>=max_managed after taking zone->lock, to plug
that race.  We've had VM we-dont-care races in the past which ended up
causing problems in rare circumstances...

> +		set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC);
> +		move_freepages_block(zone, page, MIGRATE_HIGHATOMIC);
> +	}
> +	spin_unlock_irqrestore(&zone->lock, flags);
> +}
> +
> +/*
> + * Used when an allocation is about to fail under memory pressure. This
> + * potentially hurts the reliability of high-order allocations when under
> + * intense memory pressure but failed atomic allocations should be easier
> + * to recover from than an OOM.
> + */
> +static void unreserve_highatomic_pageblock(const struct alloc_context *ac)
> +{
> +	struct zonelist *zonelist = ac->zonelist;
> +	unsigned long flags;
> +	struct zoneref *z;
> +	struct zone *zone;
> +	struct page *page;
> +	int order;
> +
> +	for_each_zone_zonelist_nodemask(zone, z, zonelist, ac->high_zoneidx,
> +								ac->nodemask) {
> +		/* Preserve at least one pageblock */
> +		if (zone->nr_reserved_highatomic <= pageblock_nr_pages)
> +			continue;
> +
> +		spin_lock_irqsave(&zone->lock, flags);
> +		for (order = 0; order < MAX_ORDER; order++) {
> +			struct free_area *area = &(zone->free_area[order]);
> +
> +			if (list_empty(&area->free_list[MIGRATE_HIGHATOMIC]))
> +				continue;
> +
> +			page = list_entry(area->free_list[MIGRATE_HIGHATOMIC].next,
> +						struct page, lru);
> +
> +			zone->nr_reserved_highatomic -= pageblock_nr_pages;

So if the race happened here, zone->nr_reserved_highatomic underflows?

> +			/*
> +			 * Convert to ac->migratetype and avoid the normal
> +			 * pageblock stealing heuristics. Minimally, the caller
> +			 * is doing the work and needs the pages. More
> +			 * importantly, if the block was always converted to
> +			 * MIGRATE_UNMOVABLE or another type then the number
> +			 * of pageblocks that cannot be completely freed
> +			 * may increase.
> +			 */
> +			set_pageblock_migratetype(page, ac->migratetype);
> +			move_freepages_block(zone, page, ac->migratetype);
> +			spin_unlock_irqrestore(&zone->lock, flags);
> +			return;
> +		}
> +		spin_unlock_irqrestore(&zone->lock, flags);
> +	}
> +}
>
> ...
>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 10/10] mm, page_alloc: Only enforce watermarks for order-0 allocations
  2015-09-21 12:03 ` [PATCH 10/10] mm, page_alloc: Only enforce watermarks for order-0 allocations Mel Gorman
  2015-09-25 19:32   ` Johannes Weiner
@ 2015-09-29 21:05   ` Andrew Morton
  2015-09-30  8:46     ` Mel Gorman
  2015-09-30 14:11   ` Vlastimil Babka
  2 siblings, 1 reply; 48+ messages in thread
From: Andrew Morton @ 2015-09-29 21:05 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, Rik van Riel, Vlastimil Babka, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML

On Mon, 21 Sep 2015 13:03:17 +0100 Mel Gorman <mgorman@techsingularity.net> wrote:

> The primary purpose of watermarks is to ensure that reclaim can always
> make forward progress in PF_MEMALLOC context (kswapd and direct reclaim).
> These assume that order-0 allocations are all that is necessary for
> forward progress.
> 
> High-order watermarks serve a different purpose. Kswapd
> had no high-order awareness before they were introduced
> (https://lkml.kernel.org/r/413AA7B2.4000907@yahoo.com.au).  This was
> particularly important when there were high-order atomic requests.
> The watermarks both gave kswapd awareness and made a reserve for those
> atomic requests.
> 
> There are two important side-effects of this. The most important is that
> a non-atomic high-order request can fail even though free pages are available
> and the order-0 watermarks are ok. The second is that high-order watermark
> checks are expensive as the free list counts up to the requested order must
> be examined.
> 
> With the introduction of MIGRATE_HIGHATOMIC it is no longer necessary to
> have high-order watermarks. Kswapd and compaction still need high-order
> awareness which is handled by checking that at least one suitable high-order
> page is free.
> 
> With the patch applied, there was little difference in the allocation
> failure rates as the atomic reserves are small relative to the number of
> allocation attempts. The expected impact is that there will never be an
> allocation failure report that shows suitable pages on the free lists.
> 
> The one potential side-effect of this is that in a vanilla kernel, the
> watermark checks may have kept a free page for an atomic allocation. Now,
> we are 100% relying on the HighAtomic reserves and an early allocation to
> have allocated them.  If the first high-order atomic allocation is after
> the system is already heavily fragmented then it'll fail.
> 
> ...
>
>  static bool __zone_watermark_ok(struct zone *z, unsigned int order,
>  			unsigned long mark, int classzone_idx, int alloc_flags,
> @@ -2317,7 +2319,7 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
>  {
>  	long min = mark;
>  	int o;
> -	long free_cma = 0;
> +	const bool alloc_harder = (alloc_flags & ALLOC_HARDER);

hmpf.  Setting a bool to 0x10 is a bit grubby.
  
>  	/* free_pages may go negative - that's OK */
>  	free_pages -= (1 << order) - 1;
> @@ -2330,7 +2332,7 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
>  	 * the high-atomic reserves. This will over-estimate the size of the
>  	 * atomic reserve but it avoids a search.
>  	 */
> -	if (likely(!(alloc_flags & ALLOC_HARDER)))
> +	if (likely(!alloc_harder))
>  		free_pages -= z->nr_reserved_highatomic;
>  	else
>  		min -= min / 4;
> @@ -2338,22 +2340,43 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
>  #ifdef CONFIG_CMA
>  	/* If allocation can't use CMA areas don't use free CMA pages */
>  	if (!(alloc_flags & ALLOC_CMA))
> -		free_cma = zone_page_state(z, NR_FREE_CMA_PAGES);
> +		free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES);
>  #endif
>  
> -	if (free_pages - free_cma <= min + z->lowmem_reserve[classzone_idx])
> +	if (free_pages <= min + z->lowmem_reserve[classzone_idx])
>  		return false;
> -	for (o = 0; o < order; o++) {
> -		/* At the next order, this order's pages become unavailable */
> -		free_pages -= z->free_area[o].nr_free << o;
>  
> -		/* Require fewer higher order pages to be free */
> -		min >>= 1;
> +	/* order-0 watermarks are ok */

because?

> +	if (!order)
> +		return true;
> +
> +	/* Check at least one high-order page is free */
> +	for (o = order; o < MAX_ORDER; o++) {
> +		struct free_area *area = &z->free_area[o];
> +		int mt;
> +
> +		if (!area->nr_free)
> +			continue;
> +
> +		if (alloc_harder) {
> +			if (area->nr_free)
> +				return true;
> +			continue;
> +		}
>  
> -		if (free_pages <= min)
> -			return false;
> +		for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) {
> +			if (!list_empty(&area->free_list[mt]))
> +				return true;
> +		}
> +
> +#ifdef CONFIG_CMA
> +		if ((alloc_flags & ALLOC_CMA) &&
> +		    !list_empty(&area->free_list[MIGRATE_CMA])) {
> +			return true;
> +		}
> +#endif
>  	}
> -	return true;
> +	return false;
>  }


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 09/10] mm, page_alloc: Reserve pageblocks for high-order atomic allocations on demand
  2015-09-29 21:01   ` Andrew Morton
@ 2015-09-30  8:27     ` Mel Gorman
  2015-09-30 14:02       ` Vlastimil Babka
  0 siblings, 1 reply; 48+ messages in thread
From: Mel Gorman @ 2015-09-30  8:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Rik van Riel, Vlastimil Babka, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML

On Tue, Sep 29, 2015 at 02:01:41PM -0700, Andrew Morton wrote:
> > ...
> >
> > +/*
> > + * Reserve a pageblock for exclusive use of high-order atomic allocations if
> > + * there are no empty page blocks that contain a page with a suitable order
> > + */
> > +static void reserve_highatomic_pageblock(struct page *page, struct zone *zone,
> > +				unsigned int alloc_order)
> > +{
> > +	int mt;
> > +	unsigned long max_managed, flags;
> > +
> > +	/*
> > +	 * Limit the number reserved to 1 pageblock or roughly 1% of a zone.
> > +	 * Check is race-prone but harmless.
> > +	 */
> > +	max_managed = (zone->managed_pages / 100) + pageblock_nr_pages;
> > +	if (zone->nr_reserved_highatomic >= max_managed)
> > +		return;
> > +
> > +	/* Yoink! */
> > +	spin_lock_irqsave(&zone->lock, flags);
> > +
> > +	mt = get_pageblock_migratetype(page);
> > +	if (mt != MIGRATE_HIGHATOMIC &&
> > +			!is_migrate_isolate(mt) && !is_migrate_cma(mt)) {
> 
> Do the above checks really need to be inside zone->lock?  I don't think
> get_pageblock_migratetype() needs zone->lock?  (Actually I suspect it
> does, but we don't...)
> 

The get_pageblock_migratetype does not require zone->lock but it's race-prone
without it and there have been cases (CMA, isolation) that cared. In this
case, without the lock two parallel allocations may try to reserve the same
block so we'd have to recheck the type under the lock to avoid corrupting
nr_reserved_highatomic. As the move between free lists absolutely requires
the zone->lock, it's best to just do the full operation under the lock.

> > +		zone->nr_reserved_highatomic += pageblock_nr_pages;
> 
> And I don't think it would hurt to recheck
> nr_reserved_highatomic>=max_managed after taking zone->lock, to plug
> that race.  We've had VM we-dont-care races in the past which ended up
> causing problems in rare circumstances...
> 

That makes sense, patch is below.

> > +		set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC);
> > +		move_freepages_block(zone, page, MIGRATE_HIGHATOMIC);
> > +	}
> > +	spin_unlock_irqrestore(&zone->lock, flags);
> > +}
> > +
> > +/*
> > + * Used when an allocation is about to fail under memory pressure. This
> > + * potentially hurts the reliability of high-order allocations when under
> > + * intense memory pressure but failed atomic allocations should be easier
> > + * to recover from than an OOM.
> > + */
> > +static void unreserve_highatomic_pageblock(const struct alloc_context *ac)
> > +{
> > +	struct zonelist *zonelist = ac->zonelist;
> > +	unsigned long flags;
> > +	struct zoneref *z;
> > +	struct zone *zone;
> > +	struct page *page;
> > +	int order;
> > +
> > +	for_each_zone_zonelist_nodemask(zone, z, zonelist, ac->high_zoneidx,
> > +								ac->nodemask) {
> > +		/* Preserve at least one pageblock */
> > +		if (zone->nr_reserved_highatomic <= pageblock_nr_pages)
> > +			continue;
> > +
> > +		spin_lock_irqsave(&zone->lock, flags);
> > +		for (order = 0; order < MAX_ORDER; order++) {
> > +			struct free_area *area = &(zone->free_area[order]);
> > +
> > +			if (list_empty(&area->free_list[MIGRATE_HIGHATOMIC]))
> > +				continue;
> > +
> > +			page = list_entry(area->free_list[MIGRATE_HIGHATOMIC].next,
> > +						struct page, lru);
> > +
> > +			zone->nr_reserved_highatomic -= pageblock_nr_pages;
> 
> So if the race happened here, zone->nr_reserved_highatomic underflows?
> 

It shouldn't. If there are entries on the MIGRATE_HIGHATOMIC list then
it should be accounted for in nr_reserved_highatomic. However, I see your
point as a spill from per-cpu lists has caused us problems in the past.

---8<---
From: Mel Gorman <mgorman@techsingularity.net>
Subject: [PATCH] mm, page_alloc: Reserve pageblocks for high-order atomic
 allocations on demand -fix

nr_reserved_highatomic is checked outside the zone lock so there is a race
whereby the reserve is larger than the limit allows. This patch rechecks
the count under the zone lock.

During unreserving, there is a possibility we could underflow if there
ever was a race between per-cpu drains, reserve and unreserving. This
patch adds a comment about the potential race and protects against it.

These are two fixes to the mmotm patch
mm-page_alloc-reserve-pageblocks-for-high-order-atomic-allocations-on-demand.patch .
They are not separate patches and they should all be folded together.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 811d6fc4ad5d..b1892dc51b55 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1633,9 +1633,13 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone,
 	if (zone->nr_reserved_highatomic >= max_managed)
 		return;
 
-	/* Yoink! */
 	spin_lock_irqsave(&zone->lock, flags);
 
+	/* Recheck the nr_reserved_highatomic limit under the lock */
+	if (zone->nr_reserved_highatomic >= max_managed)
+		goto out_unlock;
+
+	/* Yoink! */
 	mt = get_pageblock_migratetype(page);
 	if (mt != MIGRATE_HIGHATOMIC &&
 			!is_migrate_isolate(mt) && !is_migrate_cma(mt)) {
@@ -1643,6 +1647,8 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone,
 		set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC);
 		move_freepages_block(zone, page, MIGRATE_HIGHATOMIC);
 	}
+
+out_unlock:
 	spin_unlock_irqrestore(&zone->lock, flags);
 }
 
@@ -1677,7 +1683,14 @@ static void unreserve_highatomic_pageblock(const struct alloc_context *ac)
 			page = list_entry(area->free_list[MIGRATE_HIGHATOMIC].next,
 						struct page, lru);
 
-			zone->nr_reserved_highatomic -= pageblock_nr_pages;
+			/*
+			 * It should never happen but changes to locking could
+			 * inadvertently allow a per-cpu drain to add pages
+			 * to MIGRATE_HIGHATOMIC while unreserving so be safe
+			 * and watch for underflows.
+			 */
+			zone->nr_reserved_highatomic -= min(pageblock_nr_pages,
+				zone->nr_reserved_highatomic);
 
 			/*
 			 * Convert to ac->migratetype and avoid the normal

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH 10/10] mm, page_alloc: Only enforce watermarks for order-0 allocations
  2015-09-29 21:05   ` Andrew Morton
@ 2015-09-30  8:46     ` Mel Gorman
  2015-09-30 14:17       ` Vlastimil Babka
  0 siblings, 1 reply; 48+ messages in thread
From: Mel Gorman @ 2015-09-30  8:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Rik van Riel, Vlastimil Babka, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML

On Tue, Sep 29, 2015 at 02:05:07PM -0700, Andrew Morton wrote:
> >  static bool __zone_watermark_ok(struct zone *z, unsigned int order,
> >  			unsigned long mark, int classzone_idx, int alloc_flags,
> > @@ -2317,7 +2319,7 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
> >  {
> >  	long min = mark;
> >  	int o;
> > -	long free_cma = 0;
> > +	const bool alloc_harder = (alloc_flags & ALLOC_HARDER);
> 
> hmpf.  Setting a bool to 0x10 is a bit grubby.
>   

Should be safe, but I see your point. For any other type it would be
truncated and look like a bug.

> >  	/* free_pages may go negative - that's OK */
> >  	free_pages -= (1 << order) - 1;
> > @@ -2330,7 +2332,7 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
> >  	 * the high-atomic reserves. This will over-estimate the size of the
> >  	 * atomic reserve but it avoids a search.
> >  	 */
> > -	if (likely(!(alloc_flags & ALLOC_HARDER)))
> > +	if (likely(!alloc_harder))
> >  		free_pages -= z->nr_reserved_highatomic;
> >  	else
> >  		min -= min / 4;
> > @@ -2338,22 +2340,43 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
> >  #ifdef CONFIG_CMA
> >  	/* If allocation can't use CMA areas don't use free CMA pages */
> >  	if (!(alloc_flags & ALLOC_CMA))
> > -		free_cma = zone_page_state(z, NR_FREE_CMA_PAGES);
> > +		free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES);
> >  #endif
> >  
> > -	if (free_pages - free_cma <= min + z->lowmem_reserve[classzone_idx])
> > +	if (free_pages <= min + z->lowmem_reserve[classzone_idx])
> >  		return false;
> > -	for (o = 0; o < order; o++) {
> > -		/* At the next order, this order's pages become unavailable */
> > -		free_pages -= z->free_area[o].nr_free << o;
> >  
> > -		/* Require fewer higher order pages to be free */
> > -		min >>= 1;
> > +	/* order-0 watermarks are ok */
> 
> because?
> 

The wizard of oz because because!

This should fix it up better than clicking my shoes three times.

---8<---
From: Mel Gorman <mgorman@techsingularity.net>
Subject: [PATCH] mm, page_alloc: only enforce watermarks for order-0
 allocations -fix

This patch is updating comments for clarity and converts a bool to an
int. The code as-is is ok as the compiler is meant to cast it correctly
but it looks odd to people who know the value would be truncated and
lost for other types.

This is a fix to the mmotm patch
mm-page_alloc-only-enforce-watermarks-for-order-0-allocations.patch

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 25731624d734..fedec98aafca 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2332,7 +2332,7 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
 {
 	long min = mark;
 	int o;
-	const bool alloc_harder = (alloc_flags & ALLOC_HARDER);
+	const int alloc_harder = (alloc_flags & ALLOC_HARDER);
 
 	/* free_pages may go negative - that's OK */
 	free_pages -= (1 << order) - 1;
@@ -2356,14 +2356,19 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
 		free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES);
 #endif
 
+	/*
+	 * Check watermarks for an order-0 allocation request. If these
+	 * are not met, then a high-order request also cannot go ahead
+	 * even if a suitable page happened to be free.
+	 */
 	if (free_pages <= min + z->lowmem_reserve[classzone_idx])
 		return false;
 
-	/* order-0 watermarks are ok */
+	/* If this is an order-0 request then the watermark is fine */
 	if (!order)
 		return true;
 
-	/* Check at least one high-order page is free */
+	/* For a high-order request, check at least one suitable page is free */
 	for (o = order; o < MAX_ORDER; o++) {
 		struct free_area *area = &z->free_area[o];
 		int mt;


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH 05/10] mm, page_alloc: Distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
  2015-09-29 13:35         ` Mel Gorman
@ 2015-09-30 12:26           ` Vlastimil Babka
  2015-09-30 13:17             ` Mel Gorman
  2015-10-01  3:04             ` Drokin, Oleg
  0 siblings, 2 replies; 48+ messages in thread
From: Vlastimil Babka @ 2015-09-30 12:26 UTC (permalink / raw)
  To: Mel Gorman, Johannes Weiner
  Cc: Andrew Morton, Rik van Riel, David Rientjes, Joonsoo Kim,
	Michal Hocko, Linux-MM, LKML, Oleg Drokin, Andreas Dilger

[+CC lustre maintainers]

On 09/29/2015 03:35 PM, Mel Gorman wrote:
>>> Ok, I'll add a TODO to create a patch that removes GFP_IOFS entirely. It
>>> can be tacked on to the end of the series.
>>
>> Okay, that makes sense to me. Thanks!
>>
>
> This?

Thanks for adding this, I think I also pointed this GFP_IOFS oddness in 
earlier versions.

> ---8<---
> mm: page_alloc: Remove GFP_IOFS
>
> GFP_IOFS was intended to be shorthand for clearing two flags, not a
> set of allocation flags. There is only one user of this flag combination
> now and there appears to be no reason why Lustre had to be protected

Looks like a mistake to me. __GFP_IO | __GFP_FS have no effect without 
(former) __GFP_WAIT, so I doubt __GFP_WAIT was omitted on purpose, while 
leaving the other two. The naming of GFP_IOFS suggested it was to be 
used in allocations, leading to the mistake.

But I see you also converted several instances of GFP_NOFS to 
GFP_KERNEL. Is that correct? This is a filesystem driver after all...

> from reclaim stalls. As none of the sites appear to be atomic, this
> patch simply deletes GFP_IOFS and converts Lustre to using GFP_KERNEL.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>   drivers/staging/lustre/lnet/lnet/router.c           |  2 +-
>   drivers/staging/lustre/lnet/selftest/conrpc.c       |  2 +-
>   drivers/staging/lustre/lnet/selftest/rpc.c          |  2 +-
>   drivers/staging/lustre/lustre/libcfs/module.c       |  2 +-
>   drivers/staging/lustre/lustre/libcfs/tracefile.c    |  2 +-
>   drivers/staging/lustre/lustre/llite/remote_perm.c   |  2 +-
>   drivers/staging/lustre/lustre/mgc/mgc_request.c     | 10 +++++-----
>   drivers/staging/lustre/lustre/obdecho/echo_client.c |  2 +-
>   drivers/staging/lustre/lustre/osc/osc_cache.c       |  2 +-
>   include/linux/gfp.h                                 |  1 -
>   10 files changed, 13 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/staging/lustre/lnet/lnet/router.c b/drivers/staging/lustre/lnet/lnet/router.c
> index 4fbae5ef44a9..dad9816dfee7 100644
> --- a/drivers/staging/lustre/lnet/lnet/router.c
> +++ b/drivers/staging/lustre/lnet/lnet/router.c
> @@ -1246,7 +1246,7 @@ lnet_new_rtrbuf(lnet_rtrbufpool_t *rbp, int cpt)
>   	for (i = 0; i < npages; i++) {
>   		page = alloc_pages_node(
>   				cfs_cpt_spread_node(lnet_cpt_table(), cpt),
> -				__GFP_ZERO | GFP_IOFS, 0);
> +				GFP_KERNEL | __GFP_ZERO, 0);
>   		if (page == NULL) {
>   			while (--i >= 0)
>   				__free_page(rb->rb_kiov[i].kiov_page);
> diff --git a/drivers/staging/lustre/lnet/selftest/conrpc.c b/drivers/staging/lustre/lnet/selftest/conrpc.c
> index a1a4e08f7391..3fc37de8d304 100644
> --- a/drivers/staging/lustre/lnet/selftest/conrpc.c
> +++ b/drivers/staging/lustre/lnet/selftest/conrpc.c
> @@ -861,7 +861,7 @@ lstcon_testrpc_prep(lstcon_node_t *nd, int transop, unsigned feats,
>   			bulk->bk_iovs[i].kiov_offset = 0;
>   			bulk->bk_iovs[i].kiov_len    = len;
>   			bulk->bk_iovs[i].kiov_page   =
> -				alloc_page(GFP_IOFS);
> +				alloc_page(GFP_KERNEL);
>
>   			if (bulk->bk_iovs[i].kiov_page == NULL) {
>   				lstcon_rpc_put(*crpc);
> diff --git a/drivers/staging/lustre/lnet/selftest/rpc.c b/drivers/staging/lustre/lnet/selftest/rpc.c
> index 6ae133138b17..aa0f88fbb221 100644
> --- a/drivers/staging/lustre/lnet/selftest/rpc.c
> +++ b/drivers/staging/lustre/lnet/selftest/rpc.c
> @@ -146,7 +146,7 @@ srpc_alloc_bulk(int cpt, unsigned bulk_npg, unsigned bulk_len, int sink)
>   		int nob;
>
>   		pg = alloc_pages_node(cfs_cpt_spread_node(lnet_cpt_table(), cpt),
> -				      GFP_IOFS, 0);
> +				      GFP_KERNEL, 0);
>   		if (pg == NULL) {
>   			CERROR("Can't allocate page %d of %d\n", i, bulk_npg);
>   			srpc_free_bulk(bk);
> diff --git a/drivers/staging/lustre/lustre/libcfs/module.c b/drivers/staging/lustre/lustre/libcfs/module.c
> index 806f9747a3a2..303143f28c06 100644
> --- a/drivers/staging/lustre/lustre/libcfs/module.c
> +++ b/drivers/staging/lustre/lustre/libcfs/module.c
> @@ -321,7 +321,7 @@ static int libcfs_ioctl(struct cfs_psdev_file *pfile, unsigned long cmd, void *a
>   	struct libcfs_ioctl_data *data;
>   	int err = 0;
>
> -	LIBCFS_ALLOC_GFP(buf, 1024, GFP_IOFS);
> +	LIBCFS_ALLOC_GFP(buf, 1024, GFP_KERNEL);
>   	if (buf == NULL)
>   		return -ENOMEM;
>
> diff --git a/drivers/staging/lustre/lustre/libcfs/tracefile.c b/drivers/staging/lustre/lustre/libcfs/tracefile.c
> index effa2af58c13..a7d72f69c4eb 100644
> --- a/drivers/staging/lustre/lustre/libcfs/tracefile.c
> +++ b/drivers/staging/lustre/lustre/libcfs/tracefile.c
> @@ -810,7 +810,7 @@ int cfs_trace_allocate_string_buffer(char **str, int nob)
>   	if (nob > 2 * PAGE_CACHE_SIZE)	    /* string must be "sensible" */
>   		return -EINVAL;
>
> -	*str = kmalloc(nob, GFP_IOFS | __GFP_ZERO);
> +	*str = kmalloc(nob, GFP_KERNEL | __GFP_ZERO);

This could use kzalloc.

>   	if (*str == NULL)
>   		return -ENOMEM;
>
> diff --git a/drivers/staging/lustre/lustre/llite/remote_perm.c b/drivers/staging/lustre/lustre/llite/remote_perm.c
> index 39022ea88b5f..b27f016c3dd4 100644
> --- a/drivers/staging/lustre/lustre/llite/remote_perm.c
> +++ b/drivers/staging/lustre/lustre/llite/remote_perm.c
> @@ -84,7 +84,7 @@ static struct hlist_head *alloc_rmtperm_hash(void)
>
>   	OBD_SLAB_ALLOC_GFP(hash, ll_rmtperm_hash_cachep,
>   			   REMOTE_PERM_HASHSIZE * sizeof(*hash),
> -			   GFP_IOFS);
> +			   GFP_KERNEL);
>   	if (!hash)
>   		return NULL;
>
> diff --git a/drivers/staging/lustre/lustre/mgc/mgc_request.c b/drivers/staging/lustre/lustre/mgc/mgc_request.c
> index 019ee2f256aa..79551319d754 100644
> --- a/drivers/staging/lustre/lustre/mgc/mgc_request.c
> +++ b/drivers/staging/lustre/lustre/mgc/mgc_request.c
> @@ -198,7 +198,7 @@ struct config_llog_data *do_config_log_add(struct obd_device *obd,
>   	CDEBUG(D_MGC, "do adding config log %s:%p\n", logname,
>   	       cfg ? cfg->cfg_instance : NULL);
>
> -	cld = kzalloc(sizeof(*cld) + strlen(logname) + 1, GFP_NOFS);
> +	cld = kzalloc(sizeof(*cld) + strlen(logname) + 1, GFP_KERNEL);
>   	if (!cld)
>   		return ERR_PTR(-ENOMEM);
>
> @@ -1127,7 +1127,7 @@ static int mgc_apply_recover_logs(struct obd_device *mgc,
>   	LASSERT(cfg->cfg_instance != NULL);
>   	LASSERT(cfg->cfg_sb == cfg->cfg_instance);
>
> -	inst = kzalloc(PAGE_CACHE_SIZE, GFP_NOFS);
> +	inst = kzalloc(PAGE_CACHE_SIZE, GFP_KERNEL);
>   	if (!inst)
>   		return -ENOMEM;
>
> @@ -1334,14 +1334,14 @@ static int mgc_process_recover_log(struct obd_device *obd,
>   	if (cfg->cfg_last_idx == 0) /* the first time */
>   		nrpages = CONFIG_READ_NRPAGES_INIT;
>
> -	pages = kcalloc(nrpages, sizeof(*pages), GFP_NOFS);
> +	pages = kcalloc(nrpages, sizeof(*pages), GFP_KERNEL);
>   	if (pages == NULL) {
>   		rc = -ENOMEM;
>   		goto out;
>   	}
>
>   	for (i = 0; i < nrpages; i++) {
> -		pages[i] = alloc_page(GFP_IOFS);
> +		pages[i] = alloc_page(GFP_KERNEL);
>   		if (pages[i] == NULL) {
>   			rc = -ENOMEM;
>   			goto out;
> @@ -1492,7 +1492,7 @@ static int mgc_process_cfg_log(struct obd_device *mgc,
>   	if (cld->cld_cfg.cfg_sb)
>   		lsi = s2lsi(cld->cld_cfg.cfg_sb);
>
> -	env = kzalloc(sizeof(*env), GFP_NOFS);
> +	env = kzalloc(sizeof(*env), GFP_KERNEL);
>   	if (!env)
>   		return -ENOMEM;
>
> diff --git a/drivers/staging/lustre/lustre/obdecho/echo_client.c b/drivers/staging/lustre/lustre/obdecho/echo_client.c
> index 27bd170c3a28..7c8443644300 100644
> --- a/drivers/staging/lustre/lustre/obdecho/echo_client.c
> +++ b/drivers/staging/lustre/lustre/obdecho/echo_client.c
> @@ -1561,7 +1561,7 @@ static int echo_client_kbrw(struct echo_device *ed, int rw, struct obdo *oa,
>   		  (oa->o_valid & OBD_MD_FLFLAGS) != 0 &&
>   		  (oa->o_flags & OBD_FL_DEBUG_CHECK) != 0);
>
> -	gfp_mask = ((ostid_id(&oa->o_oi) & 2) == 0) ? GFP_IOFS : GFP_HIGHUSER;
> +	gfp_mask = ((ostid_id(&oa->o_oi) & 2) == 0) ? GFP_KERNEL : GFP_HIGHUSER;
>
>   	LASSERT(rw == OBD_BRW_WRITE || rw == OBD_BRW_READ);
>   	LASSERT(lsm != NULL);
> diff --git a/drivers/staging/lustre/lustre/osc/osc_cache.c b/drivers/staging/lustre/lustre/osc/osc_cache.c
> index c72035e048aa..6fa6bc6874ab 100644
> --- a/drivers/staging/lustre/lustre/osc/osc_cache.c
> +++ b/drivers/staging/lustre/lustre/osc/osc_cache.c
> @@ -346,7 +346,7 @@ static struct osc_extent *osc_extent_alloc(struct osc_object *obj)
>   {
>   	struct osc_extent *ext;
>
> -	OBD_SLAB_ALLOC_PTR_GFP(ext, osc_extent_kmem, GFP_IOFS);
> +	OBD_SLAB_ALLOC_PTR_GFP(ext, osc_extent_kmem, GFP_KERNEL);
>   	if (ext == NULL)
>   		return NULL;
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 60b2db94d49d..369227202ac2 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -134,7 +134,6 @@ struct vm_area_struct;
>   #define GFP_USER	(__GFP_RECLAIM | __GFP_IO | __GFP_FS | __GFP_HARDWALL)
>   #define GFP_HIGHUSER	(GFP_USER | __GFP_HIGHMEM)
>   #define GFP_HIGHUSER_MOVABLE	(GFP_HIGHUSER | __GFP_MOVABLE)
> -#define GFP_IOFS	(__GFP_IO | __GFP_FS | __GFP_KSWAPD_RECLAIM)
>   #define GFP_TRANSHUGE	((GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
>   			 __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN) & \
>   			 ~__GFP_KSWAPD_RECLAIM)
>


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 05/10] mm, page_alloc: Distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
  2015-09-30 12:26           ` Vlastimil Babka
@ 2015-09-30 13:17             ` Mel Gorman
  2015-10-01  3:04             ` Drokin, Oleg
  1 sibling, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2015-09-30 13:17 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Johannes Weiner, Andrew Morton, Rik van Riel, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML, Oleg Drokin,
	Andreas Dilger

On Wed, Sep 30, 2015 at 02:26:24PM +0200, Vlastimil Babka wrote:
> [+CC lustre maintainers]
> 
> On 09/29/2015 03:35 PM, Mel Gorman wrote:
> >>>Ok, I'll add a TODO to create a patch that removes GFP_IOFS entirely. It
> >>>can be tacked on to the end of the series.
> >>
> >>Okay, that makes sense to me. Thanks!
> >>
> >
> >This?
> 
> Thanks for adding this, I think I also pointed this GFP_IOFS oddness in
> earlier versions.
> 
> >---8<---
> >mm: page_alloc: Remove GFP_IOFS
> >
> >GFP_IOFS was intended to be shorthand for clearing two flags, not a
> >set of allocation flags. There is only one user of this flag combination
> >now and there appears to be no reason why Lustre had to be protected
> 
> Looks like a mistake to me. __GFP_IO | __GFP_FS have no effect without
> (former) __GFP_WAIT, so I doubt __GFP_WAIT was omitted on purpose, while
> leaving the other two. The naming of GFP_IOFS suggested it was to be used in
> allocations, leading to the mistake.
> 

GFP_IOFS is shorthand clearing bits and should not have been used as an
allocation flag. Using it as an allocation flag is almost certainly a
mistake.

At a stretch, GFP_IOFS could make sense if we supprted page reclaim that does
not block (e.g. discard clean pages without buffers to release) but we don't.

> But I see you also converted several instances of GFP_NOFS to GFP_KERNEL. Is
> that correct? This is a filesystem driver after all...
> 

Only in the cases where a reclaim path is reentrant and could already be
holding locks that results in deadlock. I didn't spot such a case but then
again, I'm not familiar with the filesystem and it's complex.

Lets see what they say because how they are currently using GFP_IOFS is
almost certainly wrong or at least surprising.

> >diff --git a/drivers/staging/lustre/lustre/libcfs/tracefile.c b/drivers/staging/lustre/lustre/libcfs/tracefile.c
> >index effa2af58c13..a7d72f69c4eb 100644
> >--- a/drivers/staging/lustre/lustre/libcfs/tracefile.c
> >+++ b/drivers/staging/lustre/lustre/libcfs/tracefile.c
> >@@ -810,7 +810,7 @@ int cfs_trace_allocate_string_buffer(char **str, int nob)
> >  	if (nob > 2 * PAGE_CACHE_SIZE)	    /* string must be "sensible" */
> >  		return -EINVAL;
> >
> >-	*str = kmalloc(nob, GFP_IOFS | __GFP_ZERO);
> >+	*str = kmalloc(nob, GFP_KERNEL | __GFP_ZERO);
> 
> This could use kzalloc.
> 

True.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 09/10] mm, page_alloc: Reserve pageblocks for high-order atomic allocations on demand
  2015-09-30  8:27     ` Mel Gorman
@ 2015-09-30 14:02       ` Vlastimil Babka
  0 siblings, 0 replies; 48+ messages in thread
From: Vlastimil Babka @ 2015-09-30 14:02 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Johannes Weiner, Rik van Riel, David Rientjes, Joonsoo Kim,
	Michal Hocko, Linux-MM, LKML

On 09/30/2015 10:27 AM, Mel Gorman wrote:
> On Tue, Sep 29, 2015 at 02:01:41PM -0700, Andrew Morton wrote:
>>> ...
>>>
>>> +/*
>>> + * Reserve a pageblock for exclusive use of high-order atomic allocations if
>>> + * there are no empty page blocks that contain a page with a suitable order
>>> + */
>>> +static void reserve_highatomic_pageblock(struct page *page, struct zone *zone,
>>> +				unsigned int alloc_order)
>>> +{
>>> +	int mt;
>>> +	unsigned long max_managed, flags;
>>> +
>>> +	/*
>>> +	 * Limit the number reserved to 1 pageblock or roughly 1% of a zone.
>>> +	 * Check is race-prone but harmless.
>>> +	 */
>>> +	max_managed = (zone->managed_pages / 100) + pageblock_nr_pages;
>>> +	if (zone->nr_reserved_highatomic >= max_managed)
>>> +		return;
>>> +
>>> +	/* Yoink! */
>>> +	spin_lock_irqsave(&zone->lock, flags);
>>> +
>>> +	mt = get_pageblock_migratetype(page);
>>> +	if (mt != MIGRATE_HIGHATOMIC &&
>>> +			!is_migrate_isolate(mt) && !is_migrate_cma(mt)) {
>>
>> Do the above checks really need to be inside zone->lock?  I don't think
>> get_pageblock_migratetype() needs zone->lock?  (Actually I suspect it
>> does, but we don't...)
>>
>
> The get_pageblock_migratetype does not require zone->lock but it's race-prone
> without it and there have been cases (CMA, isolation) that cared. In this
> case, without the lock two parallel allocations may try to reserve the same
> block so we'd have to recheck the type under the lock to avoid corrupting
> nr_reserved_highatomic. As the move between free lists absolutely requires
> the zone->lock, it's best to just do the full operation under the lock.
>
>>> +		zone->nr_reserved_highatomic += pageblock_nr_pages;
>>
>> And I don't think it would hurt to recheck
>> nr_reserved_highatomic>=max_managed after taking zone->lock, to plug
>> that race.  We've had VM we-dont-care races in the past which ended up
>> causing problems in rare circumstances...
>>
>
> That makes sense, patch is below.
>
>>> +		set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC);
>>> +		move_freepages_block(zone, page, MIGRATE_HIGHATOMIC);
>>> +	}
>>> +	spin_unlock_irqrestore(&zone->lock, flags);
>>> +}
>>> +
>>> +/*
>>> + * Used when an allocation is about to fail under memory pressure. This
>>> + * potentially hurts the reliability of high-order allocations when under
>>> + * intense memory pressure but failed atomic allocations should be easier
>>> + * to recover from than an OOM.
>>> + */
>>> +static void unreserve_highatomic_pageblock(const struct alloc_context *ac)
>>> +{
>>> +	struct zonelist *zonelist = ac->zonelist;
>>> +	unsigned long flags;
>>> +	struct zoneref *z;
>>> +	struct zone *zone;
>>> +	struct page *page;
>>> +	int order;
>>> +
>>> +	for_each_zone_zonelist_nodemask(zone, z, zonelist, ac->high_zoneidx,
>>> +								ac->nodemask) {
>>> +		/* Preserve at least one pageblock */
>>> +		if (zone->nr_reserved_highatomic <= pageblock_nr_pages)
>>> +			continue;
>>> +
>>> +		spin_lock_irqsave(&zone->lock, flags);
>>> +		for (order = 0; order < MAX_ORDER; order++) {
>>> +			struct free_area *area = &(zone->free_area[order]);
>>> +
>>> +			if (list_empty(&area->free_list[MIGRATE_HIGHATOMIC]))
>>> +				continue;
>>> +
>>> +			page = list_entry(area->free_list[MIGRATE_HIGHATOMIC].next,
>>> +						struct page, lru);
>>> +
>>> +			zone->nr_reserved_highatomic -= pageblock_nr_pages;
>>
>> So if the race happened here, zone->nr_reserved_highatomic underflows?
>>
>
> It shouldn't. If there are entries on the MIGRATE_HIGHATOMIC list then
> it should be accounted for in nr_reserved_highatomic. However, I see your
> point as a spill from per-cpu lists has caused us problems in the past.
>
> ---8<---
> From: Mel Gorman <mgorman@techsingularity.net>
> Subject: [PATCH] mm, page_alloc: Reserve pageblocks for high-order atomic
>   allocations on demand -fix
>
> nr_reserved_highatomic is checked outside the zone lock so there is a race
> whereby the reserve is larger than the limit allows. This patch rechecks
> the count under the zone lock.
>
> During unreserving, there is a possibility we could underflow if there
> ever was a race between per-cpu drains, reserve and unreserving. This
> patch adds a comment about the potential race and protects against it.
>
> These are two fixes to the mmotm patch
> mm-page_alloc-reserve-pageblocks-for-high-order-atomic-allocations-on-demand.patch .
> They are not separate patches and they should all be folded together.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Ack.

> ---
>   mm/page_alloc.c | 17 +++++++++++++++--
>   1 file changed, 15 insertions(+), 2 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 811d6fc4ad5d..b1892dc51b55 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1633,9 +1633,13 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone,
>   	if (zone->nr_reserved_highatomic >= max_managed)
>   		return;
>
> -	/* Yoink! */
>   	spin_lock_irqsave(&zone->lock, flags);
>
> +	/* Recheck the nr_reserved_highatomic limit under the lock */
> +	if (zone->nr_reserved_highatomic >= max_managed)
> +		goto out_unlock;
> +
> +	/* Yoink! */
>   	mt = get_pageblock_migratetype(page);
>   	if (mt != MIGRATE_HIGHATOMIC &&
>   			!is_migrate_isolate(mt) && !is_migrate_cma(mt)) {
> @@ -1643,6 +1647,8 @@ static void reserve_highatomic_pageblock(struct page *page, struct zone *zone,
>   		set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC);
>   		move_freepages_block(zone, page, MIGRATE_HIGHATOMIC);
>   	}
> +
> +out_unlock:
>   	spin_unlock_irqrestore(&zone->lock, flags);
>   }
>
> @@ -1677,7 +1683,14 @@ static void unreserve_highatomic_pageblock(const struct alloc_context *ac)
>   			page = list_entry(area->free_list[MIGRATE_HIGHATOMIC].next,
>   						struct page, lru);
>
> -			zone->nr_reserved_highatomic -= pageblock_nr_pages;
> +			/*
> +			 * It should never happen but changes to locking could
> +			 * inadvertently allow a per-cpu drain to add pages
> +			 * to MIGRATE_HIGHATOMIC while unreserving so be safe
> +			 * and watch for underflows.
> +			 */
> +			zone->nr_reserved_highatomic -= min(pageblock_nr_pages,
> +				zone->nr_reserved_highatomic);
>
>   			/*
>   			 * Convert to ac->migratetype and avoid the normal
>


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 10/10] mm, page_alloc: Only enforce watermarks for order-0 allocations
  2015-09-21 12:03 ` [PATCH 10/10] mm, page_alloc: Only enforce watermarks for order-0 allocations Mel Gorman
  2015-09-25 19:32   ` Johannes Weiner
  2015-09-29 21:05   ` Andrew Morton
@ 2015-09-30 14:11   ` Vlastimil Babka
  2 siblings, 0 replies; 48+ messages in thread
From: Vlastimil Babka @ 2015-09-30 14:11 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Johannes Weiner, Rik van Riel, David Rientjes, Joonsoo Kim,
	Michal Hocko, Linux-MM, LKML

On 09/21/2015 02:03 PM, Mel Gorman wrote:
> The primary purpose of watermarks is to ensure that reclaim can always
> make forward progress in PF_MEMALLOC context (kswapd and direct reclaim).
> These assume that order-0 allocations are all that is necessary for
> forward progress.
>
> High-order watermarks serve a different purpose. Kswapd
> had no high-order awareness before they were introduced
> (https://lkml.kernel.org/r/413AA7B2.4000907@yahoo.com.au).  This was
> particularly important when there were high-order atomic requests.
> The watermarks both gave kswapd awareness and made a reserve for those
> atomic requests.
>
> There are two important side-effects of this. The most important is that
> a non-atomic high-order request can fail even though free pages are available
> and the order-0 watermarks are ok. The second is that high-order watermark
> checks are expensive as the free list counts up to the requested order must
> be examined.
>
> With the introduction of MIGRATE_HIGHATOMIC it is no longer necessary to
> have high-order watermarks. Kswapd and compaction still need high-order
> awareness which is handled by checking that at least one suitable high-order
> page is free.
>
> With the patch applied, there was little difference in the allocation
> failure rates as the atomic reserves are small relative to the number of
> allocation attempts. The expected impact is that there will never be an
> allocation failure report that shows suitable pages on the free lists.
>
> The one potential side-effect of this is that in a vanilla kernel, the
> watermark checks may have kept a free page for an atomic allocation. Now,
> we are 100% relying on the HighAtomic reserves and an early allocation to
> have allocated them.  If the first high-order atomic allocation is after
> the system is already heavily fragmented then it'll fail.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

(nitpick below)

> ---
>   mm/page_alloc.c | 51 +++++++++++++++++++++++++++++++++++++--------------
>   1 file changed, 37 insertions(+), 14 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 811d6fc4ad5d..ee379d3b6cc2 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2308,8 +2308,10 @@ static inline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
>   #endif /* CONFIG_FAIL_PAGE_ALLOC */
>
>   /*
> - * Return true if free pages are above 'mark'. This takes into account the order
> - * of the allocation.
> + * Return true if free base pages are above 'mark'. For high-order checks it
> + * will return true of the order-0 watermark is reached and there is at least
> + * one free page of a suitable size. Checking now avoids taking the zone lock
> + * to check in the allocation paths if no pages are free.
>    */
>   static bool __zone_watermark_ok(struct zone *z, unsigned int order,
>   			unsigned long mark, int classzone_idx, int alloc_flags,
> @@ -2317,7 +2319,7 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
>   {
>   	long min = mark;
>   	int o;
> -	long free_cma = 0;
> +	const bool alloc_harder = (alloc_flags & ALLOC_HARDER);
>
>   	/* free_pages may go negative - that's OK */
>   	free_pages -= (1 << order) - 1;
> @@ -2330,7 +2332,7 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
>   	 * the high-atomic reserves. This will over-estimate the size of the
>   	 * atomic reserve but it avoids a search.
>   	 */
> -	if (likely(!(alloc_flags & ALLOC_HARDER)))
> +	if (likely(!alloc_harder))
>   		free_pages -= z->nr_reserved_highatomic;
>   	else
>   		min -= min / 4;
> @@ -2338,22 +2340,43 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
>   #ifdef CONFIG_CMA
>   	/* If allocation can't use CMA areas don't use free CMA pages */
>   	if (!(alloc_flags & ALLOC_CMA))
> -		free_cma = zone_page_state(z, NR_FREE_CMA_PAGES);
> +		free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES);
>   #endif
>
> -	if (free_pages - free_cma <= min + z->lowmem_reserve[classzone_idx])
> +	if (free_pages <= min + z->lowmem_reserve[classzone_idx])
>   		return false;
> -	for (o = 0; o < order; o++) {
> -		/* At the next order, this order's pages become unavailable */
> -		free_pages -= z->free_area[o].nr_free << o;
>
> -		/* Require fewer higher order pages to be free */
> -		min >>= 1;
> +	/* order-0 watermarks are ok */
> +	if (!order)
> +		return true;
> +
> +	/* Check at least one high-order page is free */
> +	for (o = order; o < MAX_ORDER; o++) {
> +		struct free_area *area = &z->free_area[o];
> +		int mt;
> +
> +		if (!area->nr_free)
> +			continue;
> +
> +		if (alloc_harder) {
> +			if (area->nr_free)
> +				return true;

We already checked area->nr_free, so just return true (as Joonsoo 
suggested).

> +			continue;
> +		}
>
> -		if (free_pages <= min)
> -			return false;
> +		for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) {
> +			if (!list_empty(&area->free_list[mt]))
> +				return true;
> +		}
> +
> +#ifdef CONFIG_CMA
> +		if ((alloc_flags & ALLOC_CMA) &&
> +		    !list_empty(&area->free_list[MIGRATE_CMA])) {
> +			return true;
> +		}
> +#endif
>   	}
> -	return true;
> +	return false;
>   }
>
>   bool zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
>


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 10/10] mm, page_alloc: Only enforce watermarks for order-0 allocations
  2015-09-30  8:46     ` Mel Gorman
@ 2015-09-30 14:17       ` Vlastimil Babka
  2015-09-30 15:12         ` Mel Gorman
  0 siblings, 1 reply; 48+ messages in thread
From: Vlastimil Babka @ 2015-09-30 14:17 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Johannes Weiner, Rik van Riel, David Rientjes, Joonsoo Kim,
	Michal Hocko, Linux-MM, LKML

On 09/30/2015 10:46 AM, Mel Gorman wrote:
> On Tue, Sep 29, 2015 at 02:05:07PM -0700, Andrew Morton wrote:
>
> The wizard of oz because because!
>
> This should fix it up better than clicking my shoes three times.
>
> ---8<---
> From: Mel Gorman <mgorman@techsingularity.net>
> Subject: [PATCH] mm, page_alloc: only enforce watermarks for order-0
>   allocations -fix
>
> This patch is updating comments for clarity and converts a bool to an
> int. The code as-is is ok as the compiler is meant to cast it correctly
> but it looks odd to people who know the value would be truncated and
> lost for other types.
>
> This is a fix to the mmotm patch
> mm-page_alloc-only-enforce-watermarks-for-order-0-allocations.patch
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Ack (with nitpick below)

> ---
>   mm/page_alloc.c | 11 ++++++++---
>   1 file changed, 8 insertions(+), 3 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 25731624d734..fedec98aafca 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2332,7 +2332,7 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
>   {
>   	long min = mark;
>   	int o;
> -	const bool alloc_harder = (alloc_flags & ALLOC_HARDER);
> +	const int alloc_harder = (alloc_flags & ALLOC_HARDER);

How bout the !!(alloc_flags & ALLOC_HARDER) conversion to bool? Unless 
it forces to make the compiler some extra work...

>
>   	/* free_pages may go negative - that's OK */
>   	free_pages -= (1 << order) - 1;
> @@ -2356,14 +2356,19 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
>   		free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES);
>   #endif
>
> +	/*
> +	 * Check watermarks for an order-0 allocation request. If these
> +	 * are not met, then a high-order request also cannot go ahead
> +	 * even if a suitable page happened to be free.
> +	 */
>   	if (free_pages <= min + z->lowmem_reserve[classzone_idx])
>   		return false;
>
> -	/* order-0 watermarks are ok */
> +	/* If this is an order-0 request then the watermark is fine */
>   	if (!order)
>   		return true;
>
> -	/* Check at least one high-order page is free */
> +	/* For a high-order request, check at least one suitable page is free */
>   	for (o = order; o < MAX_ORDER; o++) {
>   		struct free_area *area = &z->free_area[o];
>   		int mt;
>


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 10/10] mm, page_alloc: Only enforce watermarks for order-0 allocations
  2015-09-30 14:17       ` Vlastimil Babka
@ 2015-09-30 15:12         ` Mel Gorman
  2015-09-30 20:37           ` Andrew Morton
  0 siblings, 1 reply; 48+ messages in thread
From: Mel Gorman @ 2015-09-30 15:12 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Johannes Weiner, Rik van Riel, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML

On Wed, Sep 30, 2015 at 04:17:44PM +0200, Vlastimil Babka wrote:
> >---
> >  mm/page_alloc.c | 11 ++++++++---
> >  1 file changed, 8 insertions(+), 3 deletions(-)
> >
> >diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >index 25731624d734..fedec98aafca 100644
> >--- a/mm/page_alloc.c
> >+++ b/mm/page_alloc.c
> >@@ -2332,7 +2332,7 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
> >  {
> >  	long min = mark;
> >  	int o;
> >-	const bool alloc_harder = (alloc_flags & ALLOC_HARDER);
> >+	const int alloc_harder = (alloc_flags & ALLOC_HARDER);
> 
> How bout the !!(alloc_flags & ALLOC_HARDER) conversion to bool? Unless it
> forces to make the compiler some extra work...
> 

Some people frown upon that trick as being obscure when it's not unnecessary
and a modern compiler is meant to get it right. The int is clear and
obvious in this context so I just went with it.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 10/10] mm, page_alloc: Only enforce watermarks for order-0 allocations
  2015-09-30 15:12         ` Mel Gorman
@ 2015-09-30 20:37           ` Andrew Morton
  0 siblings, 0 replies; 48+ messages in thread
From: Andrew Morton @ 2015-09-30 20:37 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Vlastimil Babka, Johannes Weiner, Rik van Riel, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML

On Wed, 30 Sep 2015 16:12:34 +0100 Mel Gorman <mgorman@techsingularity.net> wrote:

> On Wed, Sep 30, 2015 at 04:17:44PM +0200, Vlastimil Babka wrote:
> > >---
> > >  mm/page_alloc.c | 11 ++++++++---
> > >  1 file changed, 8 insertions(+), 3 deletions(-)
> > >
> > >diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > >index 25731624d734..fedec98aafca 100644
> > >--- a/mm/page_alloc.c
> > >+++ b/mm/page_alloc.c
> > >@@ -2332,7 +2332,7 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
> > >  {
> > >  	long min = mark;
> > >  	int o;
> > >-	const bool alloc_harder = (alloc_flags & ALLOC_HARDER);
> > >+	const int alloc_harder = (alloc_flags & ALLOC_HARDER);
> > 
> > How bout the !!(alloc_flags & ALLOC_HARDER) conversion to bool? Unless it
> > forces to make the compiler some extra work...
> > 
> 
> Some people frown upon that trick as being obscure when it's not unnecessary
> and a modern compiler is meant to get it right. The int is clear and
> obvious in this context so I just went with it.

Yes, the !!  does generate extra code.  It doesn't seem worthwhile
overhead for a tiny cosmetic thing.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 03/10] mm, page_alloc: Remove unnecessary taking of a seqlock when cpusets are disabled
  2015-09-21 10:52 ` [PATCH 03/10] mm, page_alloc: Remove unnecessary taking of a seqlock when cpusets are disabled Mel Gorman
  2015-09-24 20:06   ` Johannes Weiner
@ 2015-09-30 22:22   ` David Rientjes
  2015-10-01  7:35     ` Vlastimil Babka
  1 sibling, 1 reply; 48+ messages in thread
From: David Rientjes @ 2015-09-30 22:22 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Rik van Riel, Vlastimil Babka,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML

On Mon, 21 Sep 2015, Mel Gorman wrote:

> There is a seqcounter that protects against spurious allocation failures
> when a task is changing the allowed nodes in a cpuset. There is no need
> to check the seqcounter until a cpuset exists.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Christoph Lameter <cl@linux.com>
> Acked-by: David Rientjes <rientjes@google.com>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> Acked-by: Michal Hocko <mhocko@suse.com>
> ---
>  include/linux/cpuset.h | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index 1b357997cac5..6eb27cb480b7 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -104,6 +104,9 @@ extern void cpuset_print_task_mems_allowed(struct task_struct *p);
>   */
>  static inline unsigned int read_mems_allowed_begin(void)
>  {
> +	if (!cpusets_enabled())
> +		return 0;
> +
>  	return read_seqcount_begin(&current->mems_allowed_seq);
>  }
>  
> @@ -115,6 +118,9 @@ static inline unsigned int read_mems_allowed_begin(void)
>   */
>  static inline bool read_mems_allowed_retry(unsigned int seq)
>  {
> +	if (!cpusets_enabled())
> +		return false;
> +
>  	return read_seqcount_retry(&current->mems_allowed_seq, seq);
>  }
>  

I thought this was going to test nr_cpusets() <= 1?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 06/10] mm, page_alloc: Rename __GFP_WAIT to __GFP_RECLAIM
  2015-09-21 10:52 ` [PATCH 06/10] mm, page_alloc: Rename __GFP_WAIT to __GFP_RECLAIM Mel Gorman
  2015-09-25 19:03   ` Johannes Weiner
  2015-09-28 23:55   ` Andrew Morton
@ 2015-09-30 22:25   ` David Rientjes
  2 siblings, 0 replies; 48+ messages in thread
From: David Rientjes @ 2015-09-30 22:25 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Rik van Riel, Vlastimil Babka,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML

On Mon, 21 Sep 2015, Mel Gorman wrote:

> __GFP_WAIT was used to signal that the caller was in atomic context and
> could not sleep.  Now it is possible to distinguish between true atomic
> context and callers that are not willing to sleep. The latter should clear
> __GFP_DIRECT_RECLAIM so kswapd will still wake. As clearing __GFP_WAIT
> behaves differently, there is a risk that people will clear the wrong
> flags. This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly indicate
> what it does -- setting it allows all reclaim activity, clearing them
> prevents it.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Michal Hocko <mhocko@suse.com>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: David Rientjes <rientjes@google.com>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 05/10] mm, page_alloc: Distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
  2015-09-30 12:26           ` Vlastimil Babka
  2015-09-30 13:17             ` Mel Gorman
@ 2015-10-01  3:04             ` Drokin, Oleg
  2015-10-02 12:30               ` Mel Gorman
  1 sibling, 1 reply; 48+ messages in thread
From: Drokin, Oleg @ 2015-10-01  3:04 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Mel Gorman, Johannes Weiner, Andrew Morton, Rik van Riel,
	David Rientjes, Joonsoo Kim, Michal Hocko, Linux-MM, LKML,
	Dilger, Andreas

Hello!
On Sep 30, 2015, at 8:26 AM, Vlastimil Babka wrote:

> [diff --git a/drivers/staging/lustre/lnet/lnet/router.c b/drivers/staging/lustre/lnet/lnet/router.c
>> 
>> index 4fbae5ef44a9..dad9816dfee7 100644
>> --- a/drivers/staging/lustre/lnet/lnet/router.c
>> +++ b/drivers/staging/lustre/lnet/lnet/router.c
>> @@ -1246,7 +1246,7 @@ lnet_new_rtrbuf(lnet_rtrbufpool_t *rbp, int cpt)
>>  	for (i = 0; i < npages; i++) {
>>  		page = alloc_pages_node(
>>  				cfs_cpt_spread_node(lnet_cpt_table(), cpt),
>> -				__GFP_ZERO | GFP_IOFS, 0);
>> +				GFP_KERNEL | __GFP_ZERO, 0);
>>  		if (page == NULL) {
>>  			while (--i >= 0)
>>  				__free_page(rb->rb_kiov[i].kiov_page);

This one is ok, it's in the non-fs part, so cannot enter via an fs operation.

>> diff --git a/drivers/staging/lustre/lnet/selftest/conrpc.c b/drivers/staging/lustre/lnet/selftest/conrpc.c
>> index a1a4e08f7391..3fc37de8d304 100644
>> --- a/drivers/staging/lustre/lnet/selftest/conrpc.c
>> +++ b/drivers/staging/lustre/lnet/selftest/conrpc.c
>> @@ -861,7 +861,7 @@ lstcon_testrpc_prep(lstcon_node_t *nd, int transop, unsigned feats,
>>  			bulk->bk_iovs[i].kiov_offset = 0;
>>  			bulk->bk_iovs[i].kiov_len    = len;
>>  			bulk->bk_iovs[i].kiov_page   =
>> -				alloc_page(GFP_IOFS);
>> +				alloc_page(GFP_KERNEL);
>> 
>>  			if (bulk->bk_iovs[i].kiov_page == NULL) {
>>  				lstcon_rpc_put(*crpc);
>> diff --git a/drivers/staging/lustre/lnet/selftest/rpc.c b/drivers/staging/lustre/lnet/selftest/rpc.c
>> index 6ae133138b17..aa0f88fbb221 100644
>> --- a/drivers/staging/lustre/lnet/selftest/rpc.c
>> +++ b/drivers/staging/lustre/lnet/selftest/rpc.c
>> @@ -146,7 +146,7 @@ srpc_alloc_bulk(int cpt, unsigned bulk_npg, unsigned bulk_len, int sink)
>>  		int nob;
>> 
>>  		pg = alloc_pages_node(cfs_cpt_spread_node(lnet_cpt_table(), cpt),
>> -				      GFP_IOFS, 0);
>> +				      GFP_KERNEL, 0);
>>  		if (pg == NULL) {
>>  			CERROR("Can't allocate page %d of %d\n", i, bulk_npg);
>>  			srpc_free_bulk(bk);

These two are in "lnet-selftest" that is self-hosted. so also ok.

>> diff --git a/drivers/staging/lustre/lustre/libcfs/module.c b/drivers/staging/lustre/lustre/libcfs/module.c
>> index 806f9747a3a2..303143f28c06 100644
>> --- a/drivers/staging/lustre/lustre/libcfs/module.c
>> +++ b/drivers/staging/lustre/lustre/libcfs/module.c
>> @@ -321,7 +321,7 @@ static int libcfs_ioctl(struct cfs_psdev_file *pfile, unsigned long cmd, void *a
>>  	struct libcfs_ioctl_data *data;
>>  	int err = 0;
>> 
>> -	LIBCFS_ALLOC_GFP(buf, 1024, GFP_IOFS);
>> +	LIBCFS_ALLOC_GFP(buf, 1024, GFP_KERNEL);
>>  	if (buf == NULL)
>>  		return -ENOMEM;
>> 
>> diff --git a/drivers/staging/lustre/lustre/libcfs/tracefile.c b/drivers/staging/lustre/lustre/libcfs/tracefile.c
>> index effa2af58c13..a7d72f69c4eb 100644
>> --- a/drivers/staging/lustre/lustre/libcfs/tracefile.c
>> +++ b/drivers/staging/lustre/lustre/libcfs/tracefile.c
>> @@ -810,7 +810,7 @@ int cfs_trace_allocate_string_buffer(char **str, int nob)
>>  	if (nob > 2 * PAGE_CACHE_SIZE)	    /* string must be "sensible" */
>>  		return -EINVAL;
>> 
>> -	*str = kmalloc(nob, GFP_IOFS | __GFP_ZERO);
>> +	*str = kmalloc(nob, GFP_KERNEL | __GFP_ZERO);
> 
> This could use kzalloc.
> 
>>  	if (*str == NULL)
>>  		return -ENOMEM;
>> 
>> diff --git a/drivers/staging/lustre/lustre/llite/remote_perm.c b/drivers/staging/lustre/lustre/llite/remote_perm.c
>> index 39022ea88b5f..b27f016c3dd4 100644
>> --- a/drivers/staging/lustre/lustre/llite/remote_perm.c
>> +++ b/drivers/staging/lustre/lustre/llite/remote_perm.c
>> @@ -84,7 +84,7 @@ static struct hlist_head *alloc_rmtperm_hash(void)
>> 
>>  	OBD_SLAB_ALLOC_GFP(hash, ll_rmtperm_hash_cachep,
>>  			   REMOTE_PERM_HASHSIZE * sizeof(*hash),
>> -			   GFP_IOFS);
>> +			   GFP_KERNEL);
>>  	if (!hash)
>>  		return NULL;
>> 

This is called from ll_inode_permission (the inode ops->permission method), so I imagine this must be GFP_NOFS.

>> diff --git a/drivers/staging/lustre/lustre/mgc/mgc_request.c b/drivers/staging/lustre/lustre/mgc/mgc_request.c
>> index 019ee2f256aa..79551319d754 100644
>> --- a/drivers/staging/lustre/lustre/mgc/mgc_request.c
>> +++ b/drivers/staging/lustre/lustre/mgc/mgc_request.c
>> @@ -198,7 +198,7 @@ struct config_llog_data *do_config_log_add(struct obd_device *obd,
>>  	CDEBUG(D_MGC, "do adding config log %s:%p\n", logname,
>>  	       cfg ? cfg->cfg_instance : NULL);
>> 
>> -	cld = kzalloc(sizeof(*cld) + strlen(logname) + 1, GFP_NOFS);
>> +	cld = kzalloc(sizeof(*cld) + strlen(logname) + 1, GFP_KERNEL);
>>  	if (!cld)
>>  		return ERR_PTR(-ENOMEM);
>> 
>> @@ -1127,7 +1127,7 @@ static int mgc_apply_recover_logs(struct obd_device *mgc,
>>  	LASSERT(cfg->cfg_instance != NULL);
>>  	LASSERT(cfg->cfg_sb == cfg->cfg_instance);
>> 
>> -	inst = kzalloc(PAGE_CACHE_SIZE, GFP_NOFS);
>> +	inst = kzalloc(PAGE_CACHE_SIZE, GFP_KERNEL);
>>  	if (!inst)
>>  		return -ENOMEM;
>> 
>> @@ -1334,14 +1334,14 @@ static int mgc_process_recover_log(struct obd_device *obd,
>>  	if (cfg->cfg_last_idx == 0) /* the first time */
>>  		nrpages = CONFIG_READ_NRPAGES_INIT;
>> 
>> -	pages = kcalloc(nrpages, sizeof(*pages), GFP_NOFS);
>> +	pages = kcalloc(nrpages, sizeof(*pages), GFP_KERNEL);
>>  	if (pages == NULL) {
>>  		rc = -ENOMEM;
>>  		goto out;
>>  	}
>> 
>>  	for (i = 0; i < nrpages; i++) {
>> -		pages[i] = alloc_page(GFP_IOFS);
>> +		pages[i] = alloc_page(GFP_KERNEL);
>>  		if (pages[i] == NULL) {
>>  			rc = -ENOMEM;
>>  			goto out;
>> @@ -1492,7 +1492,7 @@ static int mgc_process_cfg_log(struct obd_device *mgc,
>>  	if (cld->cld_cfg.cfg_sb)
>>  		lsi = s2lsi(cld->cld_cfg.cfg_sb);
>> 
>> -	env = kzalloc(sizeof(*env), GFP_NOFS);
>> +	env = kzalloc(sizeof(*env), GFP_KERNEL);
>>  	if (!env)
>>  		return -ENOMEM;

These should live in it's own separate thread so I imagine should be fine.

>> diff --git a/drivers/staging/lustre/lustre/obdecho/echo_client.c b/drivers/staging/lustre/lustre/obdecho/echo_client.c
>> index 27bd170c3a28..7c8443644300 100644
>> --- a/drivers/staging/lustre/lustre/obdecho/echo_client.c
>> +++ b/drivers/staging/lustre/lustre/obdecho/echo_client.c
>> @@ -1561,7 +1561,7 @@ static int echo_client_kbrw(struct echo_device *ed, int rw, struct obdo *oa,
>>  		  (oa->o_valid & OBD_MD_FLFLAGS) != 0 &&
>>  		  (oa->o_flags & OBD_FL_DEBUG_CHECK) != 0);
>> 
>> -	gfp_mask = ((ostid_id(&oa->o_oi) & 2) == 0) ? GFP_IOFS : GFP_HIGHUSER;
>> +	gfp_mask = ((ostid_id(&oa->o_oi) & 2) == 0) ? GFP_KERNEL : GFP_HIGHUSER;
>> 
>>  	LASSERT(rw == OBD_BRW_WRITE || rw == OBD_BRW_READ);
>>  	LASSERT(lsm != NULL);

This is it's own thing, so ok

>> diff --git a/drivers/staging/lustre/lustre/osc/osc_cache.c b/drivers/staging/lustre/lustre/osc/osc_cache.c
>> index c72035e048aa..6fa6bc6874ab 100644
>> --- a/drivers/staging/lustre/lustre/osc/osc_cache.c
>> +++ b/drivers/staging/lustre/lustre/osc/osc_cache.c
>> @@ -346,7 +346,7 @@ static struct osc_extent *osc_extent_alloc(struct osc_object *obj)
>>  {
>>  	struct osc_extent *ext;
>> 
>> -	OBD_SLAB_ALLOC_PTR_GFP(ext, osc_extent_kmem, GFP_IOFS);
>> +	OBD_SLAB_ALLOC_PTR_GFP(ext, osc_extent_kmem, GFP_KERNEL);
>>  	if (ext == NULL)
>>  		return NULL;
>> 

These are called in IO path, so should be GFP_NOFS, really.

Thanks!


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 03/10] mm, page_alloc: Remove unnecessary taking of a seqlock when cpusets are disabled
  2015-09-30 22:22   ` David Rientjes
@ 2015-10-01  7:35     ` Vlastimil Babka
  0 siblings, 0 replies; 48+ messages in thread
From: Vlastimil Babka @ 2015-10-01  7:35 UTC (permalink / raw)
  To: David Rientjes, Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Rik van Riel, Joonsoo Kim,
	Michal Hocko, Linux-MM, LKML

On 10/01/2015 12:22 AM, David Rientjes wrote:
> On Mon, 21 Sep 2015, Mel Gorman wrote:
>> @@ -115,6 +118,9 @@ static inline unsigned int read_mems_allowed_begin(void)
>>    */
>>   static inline bool read_mems_allowed_retry(unsigned int seq)
>>   {
>> +	if (!cpusets_enabled())
>> +		return false;
>> +
>>   	return read_seqcount_retry(&current->mems_allowed_seq, seq);
>>   }
>>
>
> I thought this was going to test nr_cpusets() <= 1?

That was another patch in prior iteration of the series, but turns out 
it was unnecessary, because cpusets_enabled() is already only true when 
nr_cpusets() > 1 - see https://lkml.org/lkml/2015/8/25/300

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 06/10] mm, page_alloc: Rename __GFP_WAIT to __GFP_RECLAIM
  2015-09-29 13:37     ` Mel Gorman
@ 2015-10-01  8:39       ` Vlastimil Babka
  2015-10-02 13:03         ` [PATCH] mm: page_alloc: Hide some GFP internals and document the bits and flag combinations -fix Mel Gorman
  2015-10-01 14:06       ` [PATCH 06/10] mm, page_alloc: Rename __GFP_WAIT to __GFP_RECLAIM Michal Hocko
  1 sibling, 1 reply; 48+ messages in thread
From: Vlastimil Babka @ 2015-10-01  8:39 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Johannes Weiner, Rik van Riel, David Rientjes, Joonsoo Kim,
	Michal Hocko, Linux-MM, LKML

On 09/29/2015 03:37 PM, Mel Gorman wrote:
> mm: page_alloc: Hide some GFP internals and document the bits and flag combinations
>
> Andrew started the following
>
> 	We have quite a history of remote parts of the kernel using
> 	weird/wrong/inexplicable combinations of __GFP_ flags.	I tend
> 	to think that this is because we didn't adequately explain the
> 	interface.
>
> 	And I don't think that gfp.h really improved much in this area as
> 	a result of this patchset.  Could you go through it some time and
> 	decide if we've adequately documented all this stuff?
>
> This patches first moves some GFP flag combinations that are part of the MM
> internals to mm/internal.h. The rest of the patch documents the __GFP_FOO
> bits under various headings and then documents the flag combinations. It
> will not help callers that are brain damaged but the clarity might motivate
> some fixes and avoid future mistakes.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

With some nitpicks below.

> +/*
> + * Reclaim modifiers
> + *
> + * __GFP_IO can start physical IO.
> + *
> + * __GFP_FS can call down to the low-level FS. Avoids the allocator

"Clearing the flag avoids..."? Avoids confusion.

> + *   recursing into the filesystem which might already be holding locks.
> + *
> + * __GFP_DIRECT_RECLAIM indicates that the caller may enter direct reclaim.
> + *   This flag can be cleared to avoid unnecessary delays when a fallback
> + *   option is available.
> + *
> + * __GFP_KSWAPD_RECLAIM indicates that the caller wants kswapd when the low

s/wants/wakes/? or "wants kswapd woken up"?

> + * GFP_USER is for userspace allocations that also need to be directly
> + *   accessibly by the kernel or hardware. It is typically used by hardware
> + *   for buffers that are mapped to userspace (e.g. graphics) that hardware
> + *   still must DMA to. cpuset limits are enforced for these allocations.
> + *
> + * GFP_HIGHUSER is for userspace allocations that may be mapped to userspace,
> + *   do not need to be directly accessible by the kernel but that cannot
> + *   move once in use. An example may be a hardware allocation that maps
> + *   data directly into userspace but has no addressing limitations.
> + *
> + * GFP_DMA exists for historical reasons and should be avoided where possible.
> + *   The flags indicates that the caller requires that the lowest zone be
> + *   used (ZONE_DMA or 16M on x86-64). Ideally, this would be removed but
> + *   it would require careful auditing as some users really require it and
> + *   others use the flag to avoid lowmem reserves in ZONE_DMA and treat the
> + *   lowest zone as a type of emergency reserve.
> + *
> + * GFP_DMA32 is similar to GFP_DMA except that the caller requires a 32-bit
> + *   address.
> + *
> + * GFP_HIGHUSER_MOVABLE is for userspace allocations that the kernel does not
> + *   need direct access to but can use kmap() when access is required. They
> + *   are expected to be movable via page reclaim or page migration. Typically,
> + *   pages on the LRU would also be allocated with GFP_HIGHUSER_MOVABLE.

Move GFP_HIGHUSER_MOVABLE right below GFP_HIGHUSER?


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 06/10] mm, page_alloc: Rename __GFP_WAIT to __GFP_RECLAIM
  2015-09-29 13:37     ` Mel Gorman
  2015-10-01  8:39       ` Vlastimil Babka
@ 2015-10-01 14:06       ` Michal Hocko
  1 sibling, 0 replies; 48+ messages in thread
From: Michal Hocko @ 2015-10-01 14:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Rik van Riel, Vlastimil Babka,
	David Rientjes, Joonsoo Kim, Linux-MM, LKML

On Tue 29-09-15 14:37:21, Mel Gorman wrote:
[...]
> mm: page_alloc: Hide some GFP internals and document the bits and flag combinations
> 
> Andrew started the following
> 
> 	We have quite a history of remote parts of the kernel using
> 	weird/wrong/inexplicable combinations of __GFP_ flags.	I tend
> 	to think that this is because we didn't adequately explain the
> 	interface.
> 
> 	And I don't think that gfp.h really improved much in this area as
> 	a result of this patchset.  Could you go through it some time and
> 	decide if we've adequately documented all this stuff?
> 
> This patches first moves some GFP flag combinations that are part of the MM
> internals to mm/internal.h. The rest of the patch documents the __GFP_FOO
> bits under various headings and then documents the flag combinations. It
> will not help callers that are brain damaged but the clarity might motivate
> some fixes and avoid future mistakes.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Yes this looks like a clear improvement.
Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/gfp.h | 252 +++++++++++++++++++++++++++++++++++-----------------
>  mm/internal.h       |  19 ++++
>  mm/shmem.c          |   2 +
>  mm/vmalloc.c        |   2 +
>  4 files changed, 193 insertions(+), 82 deletions(-)
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 369227202ac2..67654f08a28b 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -39,9 +39,7 @@ struct vm_area_struct;
>  /* If the above are modified, __GFP_BITS_SHIFT may need updating */
>  
>  /*
> - * GFP bitmasks..
> - *
> - * Zone modifiers (see linux/mmzone.h - low three bits)
> + * Physical address zone modifiers (see linux/mmzone.h - low four bits)
>   *
>   * Do not put any conditional on these. If necessary modify the definitions
>   * without the underscores and use them consistently. The definitions here may
> @@ -50,121 +48,209 @@ struct vm_area_struct;
>  #define __GFP_DMA	((__force gfp_t)___GFP_DMA)
>  #define __GFP_HIGHMEM	((__force gfp_t)___GFP_HIGHMEM)
>  #define __GFP_DMA32	((__force gfp_t)___GFP_DMA32)
> -#define __GFP_MOVABLE	((__force gfp_t)___GFP_MOVABLE)  /* Page is movable */
> +#define __GFP_MOVABLE	((__force gfp_t)___GFP_MOVABLE)  /* ZONE_MOVABLE allowed */
>  #define GFP_ZONEMASK	(__GFP_DMA|__GFP_HIGHMEM|__GFP_DMA32|__GFP_MOVABLE)
> +
>  /*
> - * Action modifiers - doesn't change the zoning
> + * Page mobility and placement hints
>   *
> - * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
> - * _might_ fail.  This depends upon the particular VM implementation.
> + * These flags provide hints about how mobile the page is. Pages with similar
> + * mobility are placed within the same pageblocks to minimise problems due
> + * to external fragmentation.
>   *
> - * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
> - * cannot handle allocation failures. New users should be evaluated carefully
> - * (and the flag should be used only when there is no reasonable failure policy)
> - * but it is definitely preferable to use the flag rather than opencode endless
> - * loop around allocator.
> + * __GFP_MOVABLE (also a zone modifier) indicates that the page can be
> + *   moved by page migration during memory compaction or can be reclaimed.
>   *
> - * __GFP_NORETRY: The VM implementation must not retry indefinitely and will
> - * return NULL when direct reclaim and memory compaction have failed to allow
> - * the allocation to succeed.  The OOM killer is not called with the current
> - * implementation.
> + * __GFP_RECLAIMABLE is used for slab allocations that specify
> + *   SLAB_RECLAIM_ACCOUNT and whose pages can be freed via shrinkers.
> + *
> + * __GFP_WRITE indicates the caller intends to dirty the page. Where possible,
> + *   these pages will be spread between local zones to avoid all the dirty
> + *   pages being in one zone (fair zone allocation policy).
>   *
> - * __GFP_MOVABLE: Flag that this page will be movable by the page migration
> - * mechanism or reclaimed
> + * __GFP_HARDWALL enforces the cpuset memory allocation policy.
> + *
> + * __GFP_THISNODE forces the allocation to be satisified from the requested
> + *   node with no fallbacks or placement policy enforcements.
>   */
> -#define __GFP_ATOMIC	((__force gfp_t)___GFP_ATOMIC)  /* Caller cannot wait or reschedule */
> -#define __GFP_HIGH	((__force gfp_t)___GFP_HIGH)	/* Should access emergency pools? */
> -#define __GFP_IO	((__force gfp_t)___GFP_IO)	/* Can start physical IO? */
> -#define __GFP_FS	((__force gfp_t)___GFP_FS)	/* Can call down to low-level FS? */
> -#define __GFP_COLD	((__force gfp_t)___GFP_COLD)	/* Cache-cold page required */
> -#define __GFP_NOWARN	((__force gfp_t)___GFP_NOWARN)	/* Suppress page allocation failure warning */
> -#define __GFP_REPEAT	((__force gfp_t)___GFP_REPEAT)	/* See above */
> -#define __GFP_NOFAIL	((__force gfp_t)___GFP_NOFAIL)	/* See above */
> -#define __GFP_NORETRY	((__force gfp_t)___GFP_NORETRY) /* See above */
> -#define __GFP_MEMALLOC	((__force gfp_t)___GFP_MEMALLOC)/* Allow access to emergency reserves */
> -#define __GFP_COMP	((__force gfp_t)___GFP_COMP)	/* Add compound page metadata */
> -#define __GFP_ZERO	((__force gfp_t)___GFP_ZERO)	/* Return zeroed page on success */
> -#define __GFP_NOMEMALLOC ((__force gfp_t)___GFP_NOMEMALLOC) /* Don't use emergency reserves.
> -							 * This takes precedence over the
> -							 * __GFP_MEMALLOC flag if both are
> -							 * set
> -							 */
> -#define __GFP_HARDWALL   ((__force gfp_t)___GFP_HARDWALL) /* Enforce hardwall cpuset memory allocs */
> -#define __GFP_THISNODE	((__force gfp_t)___GFP_THISNODE)/* No fallback, no policies */
> -#define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE) /* Page is reclaimable */
> -#define __GFP_NOACCOUNT	((__force gfp_t)___GFP_NOACCOUNT) /* Don't account to kmemcg */
> -#define __GFP_NOTRACK	((__force gfp_t)___GFP_NOTRACK)  /* Don't track with kmemcheck */
> -
> -#define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
> -#define __GFP_WRITE	((__force gfp_t)___GFP_WRITE)	/* Allocator intends to dirty page */
> +#define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE)
> +#define __GFP_WRITE	((__force gfp_t)___GFP_WRITE)
> +#define __GFP_HARDWALL   ((__force gfp_t)___GFP_HARDWALL)
> +#define __GFP_THISNODE	((__force gfp_t)___GFP_THISNODE)
>  
>  /*
> - * A caller that is willing to wait may enter direct reclaim and will
> - * wake kswapd to reclaim pages in the background until the high
> - * watermark is met. A caller may wish to clear __GFP_DIRECT_RECLAIM to
> - * avoid unnecessary delays when a fallback option is available but
> - * still allow kswapd to reclaim in the background. The kswapd flag
> - * can be cleared when the reclaiming of pages would cause unnecessary
> - * disruption.
> + * Watermark modifiers -- controls access to emergency reserves
> + *
> + * __GFP_HIGH indicates that the caller is high-priority and that granting
> + *   the request is necessary before the system can make forward progress.
> + *   For example, creating an IO context to clean pages.
> + *
> + * __GFP_ATOMIC indicates that the caller cannot reclaim or sleep and is
> + *   high priority. Users are typically interrupt handlers. This may be
> + *   used in conjunction with __GFP_HIGH
> + *
> + * __GFP_MEMALLOC allows access to all memory. This should only be used when
> + *   the caller guarantees the allocation will allow more memory to be freed
> + *   very shortly e.g. process exiting or swapping. Users either should
> + *   be the MM or co-ordinating closely with the VM (e.g. swap over NFS).
> + *
> + * __GFP_NOMEMALLOC is used to explicitly forbid access to emergency reserves.
> + *   This takes precedence over the __GFP_MEMALLOC flag if both are set.
> + *
> + * __GFP_NOACCOUNT ignores the accounting for kmemcg limit enforcement.
>   */
> -#define __GFP_RECLAIM ((__force gfp_t)(___GFP_DIRECT_RECLAIM|___GFP_KSWAPD_RECLAIM))
> +#define __GFP_ATOMIC	((__force gfp_t)___GFP_ATOMIC)
> +#define __GFP_HIGH	((__force gfp_t)___GFP_HIGH)
> +#define __GFP_MEMALLOC	((__force gfp_t)___GFP_MEMALLOC)
> +#define __GFP_NOMEMALLOC ((__force gfp_t)___GFP_NOMEMALLOC)
> +#define __GFP_NOACCOUNT	((__force gfp_t)___GFP_NOACCOUNT)
> +
> +/*
> + * Reclaim modifiers
> + *
> + * __GFP_IO can start physical IO.
> + *
> + * __GFP_FS can call down to the low-level FS. Avoids the allocator
> + *   recursing into the filesystem which might already be holding locks.
> + *
> + * __GFP_DIRECT_RECLAIM indicates that the caller may enter direct reclaim.
> + *   This flag can be cleared to avoid unnecessary delays when a fallback
> + *   option is available.
> + *
> + * __GFP_KSWAPD_RECLAIM indicates that the caller wants kswapd when the low
> + *   watermark is reached and have it reclaim pages until the high watermark
> + *   is reached. A caller may wish to clear this flag when fallback options
> + *   are available and the reclaim is likely to disrupt the system. The
> + *   canonical example is THP allocation where a fallback is cheap but
> + *   reclaim/compaction may cause indirect stalls.
> + *
> + * __GFP_RECLAIM is shorthand to allow/forbid both direct and kswapd reclaim.
> + *
> + * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
> + *   _might_ fail.  This depends upon the particular VM implementation.
> + *
> + * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
> + *   cannot handle allocation failures. New users should be evaluated carefully
> + *   (and the flag should be used only when there is no reasonable failure
> + *   policy) but it is definitely preferable to use the flag rather than
> + *   opencode endless loop around allocator.
> + *
> + * __GFP_NORETRY: The VM implementation must not retry indefinitely and will
> + *   return NULL when direct reclaim and memory compaction have failed to allow
> + *   the allocation to succeed.  The OOM killer is not called with the current
> + *   implementation.
> + */
> +#define __GFP_IO	((__force gfp_t)___GFP_IO)
> +#define __GFP_FS	((__force gfp_t)___GFP_FS)
>  #define __GFP_DIRECT_RECLAIM	((__force gfp_t)___GFP_DIRECT_RECLAIM) /* Caller can reclaim */
>  #define __GFP_KSWAPD_RECLAIM	((__force gfp_t)___GFP_KSWAPD_RECLAIM) /* kswapd can wake */
> +#define __GFP_RECLAIM ((__force gfp_t)(___GFP_DIRECT_RECLAIM|___GFP_KSWAPD_RECLAIM))
> +#define __GFP_REPEAT	((__force gfp_t)___GFP_REPEAT)
> +#define __GFP_NOFAIL	((__force gfp_t)___GFP_NOFAIL)
> +#define __GFP_NORETRY	((__force gfp_t)___GFP_NORETRY)
>  
>  /*
> - * This may seem redundant, but it's a way of annotating false positives vs.
> - * allocations that simply cannot be supported (e.g. page tables).
> + * Action modifiers
> + * 
> + * __GFP_COLD indicates that the caller does not expect to be used in the near
> + *   future. Where possible, a cache-cold page will be returned.
> + *
> + * __GFP_NOWARN suppresses allocation failure reports.
> + *
> + * __GFP_COMP address compound page metadata.
> + *
> + * __GFP_ZERO returns a zeroed page on success.
> + *
> + * __GFP_NOTRACK avoids tracking with kmemcheck.
> + *
> + * __GFP_NOTRACK_FALSE_POSITIVE is an alias of __GFP_NOTRACK. It's a means of
> + *   distinguishing in the source between false positives and allocations that
> + *   cannot be supported (e.g. page tables).
> + *
> + * __GFP_OTHER_NODE is for allocations that are on a remote node but that
> + *   should not be accounted for as a remote allocation in vmstat. A
> + *   typical user would be khugepaged collapsing a huge page on a remote
> + *   node.
>   */
> +#define __GFP_COLD	((__force gfp_t)___GFP_COLD)
> +#define __GFP_NOWARN	((__force gfp_t)___GFP_NOWARN)
> +#define __GFP_COMP	((__force gfp_t)___GFP_COMP)
> +#define __GFP_ZERO	((__force gfp_t)___GFP_ZERO)
> +#define __GFP_NOTRACK	((__force gfp_t)___GFP_NOTRACK)
>  #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
> +#define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE)
>  
> -#define __GFP_BITS_SHIFT 26	/* Room for N __GFP_FOO bits */
> +/* Room for N __GFP_FOO bits */
> +#define __GFP_BITS_SHIFT 26
>  #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
>  
>  /*
> - * GFP_ATOMIC callers can not sleep, need the allocation to succeed.
> - * A lower watermark is applied to allow access to "atomic reserves"
> + * Useful GFP flag combinations that are commonly used. It is recommended
> + * that subsystems start with one of these combinations and then set/clear
> + * __GFP_FOO flags as necessary.
> + *
> + * GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower
> + *   watermark is applied to allow access to "atomic reserves"
> + *
> + * GFP_KERNEL is typical for kernel-internal allocations. The caller requires
> + *   ZONE_NORMAL or a lower zone for direct access but can direct reclaim.
> + *
> + * GFP_NOWAIT is for kernel allocations that should not stall for direct
> + *   reclaim, start physical IO or use any filesystem callback.
> + *
> + * GFP_NOIO will use direct reclaim to discard clean pages or slab pages
> + *   that do not require the starting of any physical IO.
> + *
> + * GFP_NOFS will use direct reclaim but will not use any filesystem interfaces.
> + *
> + * GFP_USER is for userspace allocations that also need to be directly
> + *   accessibly by the kernel or hardware. It is typically used by hardware
> + *   for buffers that are mapped to userspace (e.g. graphics) that hardware
> + *   still must DMA to. cpuset limits are enforced for these allocations.
> + *
> + * GFP_HIGHUSER is for userspace allocations that may be mapped to userspace,
> + *   do not need to be directly accessible by the kernel but that cannot
> + *   move once in use. An example may be a hardware allocation that maps
> + *   data directly into userspace but has no addressing limitations.
> + *
> + * GFP_DMA exists for historical reasons and should be avoided where possible.
> + *   The flags indicates that the caller requires that the lowest zone be
> + *   used (ZONE_DMA or 16M on x86-64). Ideally, this would be removed but
> + *   it would require careful auditing as some users really require it and
> + *   others use the flag to avoid lowmem reserves in ZONE_DMA and treat the
> + *   lowest zone as a type of emergency reserve.
> + *
> + * GFP_DMA32 is similar to GFP_DMA except that the caller requires a 32-bit
> + *   address.
> + *
> + * GFP_HIGHUSER_MOVABLE is for userspace allocations that the kernel does not
> + *   need direct access to but can use kmap() when access is required. They
> + *   are expected to be movable via page reclaim or page migration. Typically,
> + *   pages on the LRU would also be allocated with GFP_HIGHUSER_MOVABLE.
> + *
> + * GFP_TRANSHUGE is used for THP allocations. They are compound allocations
> + *   that will fail quickly if memory is not available and will not wake
> + *   kswapd on failure.
>   */
>  #define GFP_ATOMIC	(__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM)
> +#define GFP_KERNEL	(__GFP_RECLAIM | __GFP_IO | __GFP_FS)
>  #define GFP_NOWAIT	(__GFP_KSWAPD_RECLAIM)
>  #define GFP_NOIO	(__GFP_RECLAIM)
>  #define GFP_NOFS	(__GFP_RECLAIM | __GFP_IO)
> -#define GFP_KERNEL	(__GFP_RECLAIM | __GFP_IO | __GFP_FS)
>  #define GFP_TEMPORARY	(__GFP_RECLAIM | __GFP_IO | __GFP_FS | \
>  			 __GFP_RECLAIMABLE)
>  #define GFP_USER	(__GFP_RECLAIM | __GFP_IO | __GFP_FS | __GFP_HARDWALL)
> +#define GFP_DMA		__GFP_DMA
> +#define GFP_DMA32	__GFP_DMA32
>  #define GFP_HIGHUSER	(GFP_USER | __GFP_HIGHMEM)
>  #define GFP_HIGHUSER_MOVABLE	(GFP_HIGHUSER | __GFP_MOVABLE)
>  #define GFP_TRANSHUGE	((GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
>  			 __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN) & \
>  			 ~__GFP_KSWAPD_RECLAIM)
>  
> -/* This mask makes up all the page movable related flags */
> +/* Convert GFP flags to their corresponding migrate type */
>  #define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE)
>  #define GFP_MOVABLE_SHIFT 3
> -
> -/* Control page allocator reclaim behavior */
> -#define GFP_RECLAIM_MASK (__GFP_RECLAIM|__GFP_HIGH|__GFP_IO|__GFP_FS|\
> -			__GFP_NOWARN|__GFP_REPEAT|__GFP_NOFAIL|\
> -			__GFP_NORETRY|__GFP_MEMALLOC|__GFP_NOMEMALLOC)
> -
> -/* Control slab gfp mask during early boot */
> -#define GFP_BOOT_MASK (__GFP_BITS_MASK & ~(__GFP_RECLAIM|__GFP_IO|__GFP_FS))
> -
> -/* Control allocation constraints */
> -#define GFP_CONSTRAINT_MASK (__GFP_HARDWALL|__GFP_THISNODE)
> -
> -/* Do not use these with a slab allocator */
> -#define GFP_SLAB_BUG_MASK (__GFP_DMA32|__GFP_HIGHMEM|~__GFP_BITS_MASK)
> -
> -/* Flag - indicates that the buffer will be suitable for DMA.  Ignored on some
> -   platforms, used as appropriate on others */
> -
> -#define GFP_DMA		__GFP_DMA
> -
> -/* 4GB DMA on some platforms */
> -#define GFP_DMA32	__GFP_DMA32
> -
> -/* Convert GFP flags to their corresponding migrate type */
>  static inline int gfpflags_to_migratetype(const gfp_t gfp_flags)
>  {
>  	VM_WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK);
> @@ -177,6 +263,8 @@ static inline int gfpflags_to_migratetype(const gfp_t gfp_flags)
>  	/* Group based on mobility */
>  	return (gfp_flags & GFP_MOVABLE_MASK) >> GFP_MOVABLE_SHIFT;
>  }
> +#undef GFP_MOVABLE_MASK
> +#undef GFP_MOVABLE_SHIFT
>  
>  static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
>  {
> diff --git a/mm/internal.h b/mm/internal.h
> index 83fb0bfffc13..f99f0ff6935d 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -14,6 +14,25 @@
>  #include <linux/fs.h>
>  #include <linux/mm.h>
>  
> +/*
> + * The set of flags that only affect watermark checking and reclaim
> + * behaviour. This is used by the MM to obey the caller constraints
> + * about IO, FS and watermark checking while ignoring placement
> + * hints such as HIGHMEM usage.
> + */
> +#define GFP_RECLAIM_MASK (__GFP_RECLAIM|__GFP_HIGH|__GFP_IO|__GFP_FS|\
> +			__GFP_NOWARN|__GFP_REPEAT|__GFP_NOFAIL|\
> +			__GFP_NORETRY|__GFP_MEMALLOC|__GFP_NOMEMALLOC)
> +
> +/* The GFP flags allowed during early boot */
> +#define GFP_BOOT_MASK (__GFP_BITS_MASK & ~(__GFP_RECLAIM|__GFP_IO|__GFP_FS))
> +
> +/* Control allocation cpuset and node placement constraints */
> +#define GFP_CONSTRAINT_MASK (__GFP_HARDWALL|__GFP_THISNODE)
> +
> +/* Do not use these with a slab allocator */
> +#define GFP_SLAB_BUG_MASK (__GFP_DMA32|__GFP_HIGHMEM|~__GFP_BITS_MASK)
> +
>  void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
>  		unsigned long floor, unsigned long ceiling);
>  
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 48ce82926d93..469b639018b0 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -73,6 +73,8 @@ static struct vfsmount *shm_mnt;
>  #include <asm/uaccess.h>
>  #include <asm/pgtable.h>
>  
> +#include "internal.h"
> +
>  #define BLOCKS_PER_PAGE  (PAGE_CACHE_SIZE/512)
>  #define VM_ACCT(size)    (PAGE_CACHE_ALIGN(size) >> PAGE_SHIFT)
>  
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 9ad4dcb0631c..af6d519aa21b 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -35,6 +35,8 @@
>  #include <asm/tlbflush.h>
>  #include <asm/shmparam.h>
>  
> +#include "internal.h"
> +
>  struct vfree_deferred {
>  	struct llist_head list;
>  	struct work_struct wq;

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH 05/10] mm, page_alloc: Distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
  2015-10-01  3:04             ` Drokin, Oleg
@ 2015-10-02 12:30               ` Mel Gorman
  0 siblings, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2015-10-02 12:30 UTC (permalink / raw)
  To: Drokin, Oleg
  Cc: Vlastimil Babka, Johannes Weiner, Andrew Morton, Rik van Riel,
	David Rientjes, Joonsoo Kim, Michal Hocko, Linux-MM, LKML,
	Dilger, Andreas

On Thu, Oct 01, 2015 at 03:04:37AM +0000, Drokin, Oleg wrote:
> Hello!

Thanks Oleg for taking a look! Can you review the following please?

---8<---
mm: page_alloc: Remove GFP_IOFS

GFP_IOFS was intended to be shorthand for clearing two flags, not a set of
allocation flags. There is only one user of this flag combination now and
there appears to be no reason why Lustre had to be protected from reclaim
stalls. As none of the sites appear to be atomic, this patch simply deletes
GFP_IOFS and converts Lustre to using GFP_KERNEL, GFP_NOFS or GFP_NOIO
as appropriate.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 drivers/staging/lustre/lnet/lnet/router.c           | 2 +-
 drivers/staging/lustre/lnet/selftest/conrpc.c       | 2 +-
 drivers/staging/lustre/lnet/selftest/rpc.c          | 2 +-
 drivers/staging/lustre/lustre/libcfs/module.c       | 2 +-
 drivers/staging/lustre/lustre/libcfs/tracefile.c    | 2 +-
 drivers/staging/lustre/lustre/llite/remote_perm.c   | 2 +-
 drivers/staging/lustre/lustre/mgc/mgc_request.c     | 8 ++++----
 drivers/staging/lustre/lustre/obdecho/echo_client.c | 2 +-
 drivers/staging/lustre/lustre/osc/osc_cache.c       | 2 +-
 include/linux/gfp.h                                 | 1 -
 10 files changed, 12 insertions(+), 13 deletions(-)

diff --git a/drivers/staging/lustre/lnet/lnet/router.c b/drivers/staging/lustre/lnet/lnet/router.c
index 4fbae5ef44a9..dad9816dfee7 100644
--- a/drivers/staging/lustre/lnet/lnet/router.c
+++ b/drivers/staging/lustre/lnet/lnet/router.c
@@ -1246,7 +1246,7 @@ lnet_new_rtrbuf(lnet_rtrbufpool_t *rbp, int cpt)
 	for (i = 0; i < npages; i++) {
 		page = alloc_pages_node(
 				cfs_cpt_spread_node(lnet_cpt_table(), cpt),
-				__GFP_ZERO | GFP_IOFS, 0);
+				GFP_KERNEL | __GFP_ZERO, 0);
 		if (page == NULL) {
 			while (--i >= 0)
 				__free_page(rb->rb_kiov[i].kiov_page);
diff --git a/drivers/staging/lustre/lnet/selftest/conrpc.c b/drivers/staging/lustre/lnet/selftest/conrpc.c
index a1a4e08f7391..3fc37de8d304 100644
--- a/drivers/staging/lustre/lnet/selftest/conrpc.c
+++ b/drivers/staging/lustre/lnet/selftest/conrpc.c
@@ -861,7 +861,7 @@ lstcon_testrpc_prep(lstcon_node_t *nd, int transop, unsigned feats,
 			bulk->bk_iovs[i].kiov_offset = 0;
 			bulk->bk_iovs[i].kiov_len    = len;
 			bulk->bk_iovs[i].kiov_page   =
-				alloc_page(GFP_IOFS);
+				alloc_page(GFP_KERNEL);
 
 			if (bulk->bk_iovs[i].kiov_page == NULL) {
 				lstcon_rpc_put(*crpc);
diff --git a/drivers/staging/lustre/lnet/selftest/rpc.c b/drivers/staging/lustre/lnet/selftest/rpc.c
index 6ae133138b17..aa0f88fbb221 100644
--- a/drivers/staging/lustre/lnet/selftest/rpc.c
+++ b/drivers/staging/lustre/lnet/selftest/rpc.c
@@ -146,7 +146,7 @@ srpc_alloc_bulk(int cpt, unsigned bulk_npg, unsigned bulk_len, int sink)
 		int nob;
 
 		pg = alloc_pages_node(cfs_cpt_spread_node(lnet_cpt_table(), cpt),
-				      GFP_IOFS, 0);
+				      GFP_KERNEL, 0);
 		if (pg == NULL) {
 			CERROR("Can't allocate page %d of %d\n", i, bulk_npg);
 			srpc_free_bulk(bk);
diff --git a/drivers/staging/lustre/lustre/libcfs/module.c b/drivers/staging/lustre/lustre/libcfs/module.c
index 806f9747a3a2..303143f28c06 100644
--- a/drivers/staging/lustre/lustre/libcfs/module.c
+++ b/drivers/staging/lustre/lustre/libcfs/module.c
@@ -321,7 +321,7 @@ static int libcfs_ioctl(struct cfs_psdev_file *pfile, unsigned long cmd, void *a
 	struct libcfs_ioctl_data *data;
 	int err = 0;
 
-	LIBCFS_ALLOC_GFP(buf, 1024, GFP_IOFS);
+	LIBCFS_ALLOC_GFP(buf, 1024, GFP_KERNEL);
 	if (buf == NULL)
 		return -ENOMEM;
 
diff --git a/drivers/staging/lustre/lustre/libcfs/tracefile.c b/drivers/staging/lustre/lustre/libcfs/tracefile.c
index effa2af58c13..a7d72f69c4eb 100644
--- a/drivers/staging/lustre/lustre/libcfs/tracefile.c
+++ b/drivers/staging/lustre/lustre/libcfs/tracefile.c
@@ -810,7 +810,7 @@ int cfs_trace_allocate_string_buffer(char **str, int nob)
 	if (nob > 2 * PAGE_CACHE_SIZE)	    /* string must be "sensible" */
 		return -EINVAL;
 
-	*str = kmalloc(nob, GFP_IOFS | __GFP_ZERO);
+	*str = kmalloc(nob, GFP_KERNEL | __GFP_ZERO);
 	if (*str == NULL)
 		return -ENOMEM;
 
diff --git a/drivers/staging/lustre/lustre/llite/remote_perm.c b/drivers/staging/lustre/lustre/llite/remote_perm.c
index 39022ea88b5f..b2830035faba 100644
--- a/drivers/staging/lustre/lustre/llite/remote_perm.c
+++ b/drivers/staging/lustre/lustre/llite/remote_perm.c
@@ -84,7 +84,7 @@ static struct hlist_head *alloc_rmtperm_hash(void)
 
 	OBD_SLAB_ALLOC_GFP(hash, ll_rmtperm_hash_cachep,
 			   REMOTE_PERM_HASHSIZE * sizeof(*hash),
-			   GFP_IOFS);
+			   GFP_NOFS);
 	if (!hash)
 		return NULL;
 
diff --git a/drivers/staging/lustre/lustre/mgc/mgc_request.c b/drivers/staging/lustre/lustre/mgc/mgc_request.c
index 019ee2f256aa..67ac6945a8b9 100644
--- a/drivers/staging/lustre/lustre/mgc/mgc_request.c
+++ b/drivers/staging/lustre/lustre/mgc/mgc_request.c
@@ -1127,7 +1127,7 @@ static int mgc_apply_recover_logs(struct obd_device *mgc,
 	LASSERT(cfg->cfg_instance != NULL);
 	LASSERT(cfg->cfg_sb == cfg->cfg_instance);
 
-	inst = kzalloc(PAGE_CACHE_SIZE, GFP_NOFS);
+	inst = kzalloc(PAGE_CACHE_SIZE, GFP_KERNEL);
 	if (!inst)
 		return -ENOMEM;
 
@@ -1334,14 +1334,14 @@ static int mgc_process_recover_log(struct obd_device *obd,
 	if (cfg->cfg_last_idx == 0) /* the first time */
 		nrpages = CONFIG_READ_NRPAGES_INIT;
 
-	pages = kcalloc(nrpages, sizeof(*pages), GFP_NOFS);
+	pages = kcalloc(nrpages, sizeof(*pages), GFP_KERNEL);
 	if (pages == NULL) {
 		rc = -ENOMEM;
 		goto out;
 	}
 
 	for (i = 0; i < nrpages; i++) {
-		pages[i] = alloc_page(GFP_IOFS);
+		pages[i] = alloc_page(GFP_KERNEL);
 		if (pages[i] == NULL) {
 			rc = -ENOMEM;
 			goto out;
@@ -1492,7 +1492,7 @@ static int mgc_process_cfg_log(struct obd_device *mgc,
 	if (cld->cld_cfg.cfg_sb)
 		lsi = s2lsi(cld->cld_cfg.cfg_sb);
 
-	env = kzalloc(sizeof(*env), GFP_NOFS);
+	env = kzalloc(sizeof(*env), GFP_KERNEL);
 	if (!env)
 		return -ENOMEM;
 
diff --git a/drivers/staging/lustre/lustre/obdecho/echo_client.c b/drivers/staging/lustre/lustre/obdecho/echo_client.c
index 27bd170c3a28..7c8443644300 100644
--- a/drivers/staging/lustre/lustre/obdecho/echo_client.c
+++ b/drivers/staging/lustre/lustre/obdecho/echo_client.c
@@ -1561,7 +1561,7 @@ static int echo_client_kbrw(struct echo_device *ed, int rw, struct obdo *oa,
 		  (oa->o_valid & OBD_MD_FLFLAGS) != 0 &&
 		  (oa->o_flags & OBD_FL_DEBUG_CHECK) != 0);
 
-	gfp_mask = ((ostid_id(&oa->o_oi) & 2) == 0) ? GFP_IOFS : GFP_HIGHUSER;
+	gfp_mask = ((ostid_id(&oa->o_oi) & 2) == 0) ? GFP_KERNEL : GFP_HIGHUSER;
 
 	LASSERT(rw == OBD_BRW_WRITE || rw == OBD_BRW_READ);
 	LASSERT(lsm != NULL);
diff --git a/drivers/staging/lustre/lustre/osc/osc_cache.c b/drivers/staging/lustre/lustre/osc/osc_cache.c
index c72035e048aa..d05a4632f1fe 100644
--- a/drivers/staging/lustre/lustre/osc/osc_cache.c
+++ b/drivers/staging/lustre/lustre/osc/osc_cache.c
@@ -346,7 +346,7 @@ static struct osc_extent *osc_extent_alloc(struct osc_object *obj)
 {
 	struct osc_extent *ext;
 
-	OBD_SLAB_ALLOC_PTR_GFP(ext, osc_extent_kmem, GFP_IOFS);
+	OBD_SLAB_ALLOC_PTR_GFP(ext, osc_extent_kmem, GFP_NOIO);
 	if (ext == NULL)
 		return NULL;
 
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 666498d2f6a8..67654f08a28b 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -244,7 +244,6 @@ struct vm_area_struct;
 #define GFP_DMA32	__GFP_DMA32
 #define GFP_HIGHUSER	(GFP_USER | __GFP_HIGHMEM)
 #define GFP_HIGHUSER_MOVABLE	(GFP_HIGHUSER | __GFP_MOVABLE)
-#define GFP_IOFS	(__GFP_IO | __GFP_FS | __GFP_KSWAPD_RECLAIM)
 #define GFP_TRANSHUGE	((GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
 			 __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN) & \
 			 ~__GFP_KSWAPD_RECLAIM)

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH] mm: page_alloc: Hide some GFP internals and document the bits and flag combinations -fix
  2015-10-01  8:39       ` Vlastimil Babka
@ 2015-10-02 13:03         ` Mel Gorman
  0 siblings, 0 replies; 48+ messages in thread
From: Mel Gorman @ 2015-10-02 13:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Johannes Weiner, Rik van Riel, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML

This patch address minor comment nitpicks from Vlastimil. It is a fix for the
mmotm patch
mm-page_alloc-hide-some-GFP-internals-and-document-the-bit-and-flag-combinations.patch

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/gfp.h | 23 ++++++++++++-----------
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 67654f08a28b..4ab8cfa0aa9f 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -110,17 +110,18 @@ struct vm_area_struct;
  *
  * __GFP_IO can start physical IO.
  *
- * __GFP_FS can call down to the low-level FS. Avoids the allocator
- *   recursing into the filesystem which might already be holding locks.
+ * __GFP_FS can call down to the low-level FS. Clearing the flag avoids the
+ *   allocator recursing into the filesystem which might already be holding
+ *   locks.
  *
  * __GFP_DIRECT_RECLAIM indicates that the caller may enter direct reclaim.
  *   This flag can be cleared to avoid unnecessary delays when a fallback
  *   option is available.
  *
- * __GFP_KSWAPD_RECLAIM indicates that the caller wants kswapd when the low
- *   watermark is reached and have it reclaim pages until the high watermark
- *   is reached. A caller may wish to clear this flag when fallback options
- *   are available and the reclaim is likely to disrupt the system. The
+ * __GFP_KSWAPD_RECLAIM indicates that the caller wants to wake kswapd when
+ *   the low watermark is reached and have it reclaim pages until the high
+ *   watermark is reached. A caller may wish to clear this flag when fallback
+ *   options are available and the reclaim is likely to disrupt the system. The
  *   canonical example is THP allocation where a fallback is cheap but
  *   reclaim/compaction may cause indirect stalls.
  *
@@ -208,11 +209,6 @@ struct vm_area_struct;
  *   for buffers that are mapped to userspace (e.g. graphics) that hardware
  *   still must DMA to. cpuset limits are enforced for these allocations.
  *
- * GFP_HIGHUSER is for userspace allocations that may be mapped to userspace,
- *   do not need to be directly accessible by the kernel but that cannot
- *   move once in use. An example may be a hardware allocation that maps
- *   data directly into userspace but has no addressing limitations.
- *
  * GFP_DMA exists for historical reasons and should be avoided where possible.
  *   The flags indicates that the caller requires that the lowest zone be
  *   used (ZONE_DMA or 16M on x86-64). Ideally, this would be removed but
@@ -223,6 +219,11 @@ struct vm_area_struct;
  * GFP_DMA32 is similar to GFP_DMA except that the caller requires a 32-bit
  *   address.
  *
+ * GFP_HIGHUSER is for userspace allocations that may be mapped to userspace,
+ *   do not need to be directly accessible by the kernel but that cannot
+ *   move once in use. An example may be a hardware allocation that maps
+ *   data directly into userspace but has no addressing limitations.
+ *
  * GFP_HIGHUSER_MOVABLE is for userspace allocations that the kernel does not
  *   need direct access to but can use kmap() when access is required. They
  *   are expected to be movable via page reclaim or page migration. Typically,

^ permalink raw reply related	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2015-10-02 13:03 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-21 10:52 [PATCH 00/10] Remove zonelist cache and high-order watermark checking v4 Mel Gorman
2015-09-21 10:52 ` [PATCH 01/10] mm, page_alloc: Remove unnecessary parameter from zone_watermark_ok_safe Mel Gorman
2015-09-24 20:01   ` Johannes Weiner
2015-09-21 10:52 ` [PATCH 02/10] mm, page_alloc: Remove unnecessary recalculations for dirty zone balancing Mel Gorman
2015-09-24 20:05   ` Johannes Weiner
2015-09-21 10:52 ` [PATCH 03/10] mm, page_alloc: Remove unnecessary taking of a seqlock when cpusets are disabled Mel Gorman
2015-09-24 20:06   ` Johannes Weiner
2015-09-30 22:22   ` David Rientjes
2015-10-01  7:35     ` Vlastimil Babka
2015-09-21 10:52 ` [PATCH 04/10] mm, page_alloc: Use masks and shifts when converting GFP flags to migrate types Mel Gorman
2015-09-24 20:34   ` Johannes Weiner
2015-09-25 12:50     ` Mel Gorman
2015-09-25 13:56       ` Johannes Weiner
2015-09-21 10:52 ` [PATCH 05/10] mm, page_alloc: Distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd Mel Gorman
2015-09-24 13:51   ` Michal Hocko
2015-09-24 20:55   ` Johannes Weiner
2015-09-25 12:51     ` Mel Gorman
2015-09-25 19:01       ` Johannes Weiner
2015-09-29 13:35         ` Mel Gorman
2015-09-30 12:26           ` Vlastimil Babka
2015-09-30 13:17             ` Mel Gorman
2015-10-01  3:04             ` Drokin, Oleg
2015-10-02 12:30               ` Mel Gorman
2015-09-21 10:52 ` [PATCH 06/10] mm, page_alloc: Rename __GFP_WAIT to __GFP_RECLAIM Mel Gorman
2015-09-25 19:03   ` Johannes Weiner
2015-09-28 23:55   ` Andrew Morton
2015-09-29 13:37     ` Mel Gorman
2015-10-01  8:39       ` Vlastimil Babka
2015-10-02 13:03         ` [PATCH] mm: page_alloc: Hide some GFP internals and document the bits and flag combinations -fix Mel Gorman
2015-10-01 14:06       ` [PATCH 06/10] mm, page_alloc: Rename __GFP_WAIT to __GFP_RECLAIM Michal Hocko
2015-09-30 22:25   ` David Rientjes
2015-09-21 10:52 ` [PATCH 07/10] mm, page_alloc: Delete the zonelist_cache Mel Gorman
2015-09-25 19:09   ` Johannes Weiner
2015-09-21 10:52 ` [PATCH 08/10] mm, page_alloc: Remove MIGRATE_RESERVE Mel Gorman
2015-09-21 10:52 ` [PATCH 09/10] mm, page_alloc: Reserve pageblocks for high-order atomic allocations on demand Mel Gorman
2015-09-24 13:50   ` Michal Hocko
2015-09-25 19:22   ` Johannes Weiner
2015-09-29 21:01   ` Andrew Morton
2015-09-30  8:27     ` Mel Gorman
2015-09-30 14:02       ` Vlastimil Babka
2015-09-21 12:03 ` [PATCH 10/10] mm, page_alloc: Only enforce watermarks for order-0 allocations Mel Gorman
2015-09-25 19:32   ` Johannes Weiner
2015-09-29 21:05   ` Andrew Morton
2015-09-30  8:46     ` Mel Gorman
2015-09-30 14:17       ` Vlastimil Babka
2015-09-30 15:12         ` Mel Gorman
2015-09-30 20:37           ` Andrew Morton
2015-09-30 14:11   ` Vlastimil Babka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).