linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/12] Remove zonelist cache and high-order watermark checking v3
@ 2015-08-24 12:09 Mel Gorman
  2015-08-24 12:09 ` [PATCH 01/12] mm, page_alloc: Remove unnecessary parameter from zone_watermark_ok_safe Mel Gorman
                   ` (11 more replies)
  0 siblings, 12 replies; 55+ messages in thread
From: Mel Gorman @ 2015-08-24 12:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Rik van Riel, Vlastimil Babka, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML, Mel Gorman

Changelog since V3
o Covered cases where __GFP_KSWAPD_RECLAIM is needed		(vbabka)
o Cleaned up trailing references to zlc				(vbabka)
o Fixed a subtle problem with GFP_TRANSHUGE checks		(vbabka)
o Split out an unrelated change to its own patch		(vbabka)
o Reordered series to put GFP flag modifications at start	(mhocko)
o Added a number of clarifications on reclaim modifications	(mhocko)
o Only check cpusets when one exists that can limit memory	(rientjes)
o Applied acks

Changelog since V2
o Improve cpusets checks as suggested				(rientjes)
o Add various acks and reviewed-bys
o Rebase to 4.2-rc6

Changelog since V1
o Rebase to 4.2-rc5
o Distinguish between high priority callers and callers that avoid sleep
o Remove jump label related damage patches

Overall, the intent of this series is to remove the zonelist cache which
was introduced to avoid high overhead in the page allocator. Once this is
done, it is necessary to reduce the cost of watermark checks.

The series starts with minor micro-optimisations.

Next it notes that GFP flags that affect watermark checks are
bused. __GFP_WAIT historically identified callers that could not sleep and
could access reserves. This was later abused to identify callers that simply
prefer to avoid sleeping and have other options. A patch distinguishes
between atomic callers, high-priority callers and those that simply wish
to avoid sleep.

The zonelist cache has been around for a long time but it is of dubious
merit with a lot of complexity and some issues that are explained.
The most important issue is that a failed THP allocation can cause a
zone to be treated as "full". This potentially causes unnecessary stalls,
reclaim activity or remote fallbacks. The issues could be fixed but it's
not worth it. The series places a small number of other micro-optimisations
on top before examining GFP flags watermarks.

High-order watermarks enforcement can cause high-order allocations to fail
even though pages are free. The watermark checks both protect high-order
atomic allocations and make kswapd aware of high-order pages but there is
a much better way that can be handled using migrate types. This series uses
page grouping by mobility to reserve pageblocks for high-order allocations
with the size of the reservation depending on demand. kswapd awareness
is maintained by examining the free lists. By patch 12 in this series,
there are no high-order watermark checks while preserving the properties
that motivated the introduction of the watermark checks.

 Documentation/vm/balance                           |  14 +-
 arch/arm/mm/dma-mapping.c                          |   4 +-
 arch/arm/xen/mm.c                                  |   2 +-
 arch/arm64/mm/dma-mapping.c                        |   4 +-
 arch/x86/kernel/pci-dma.c                          |   2 +-
 block/bio.c                                        |  26 +-
 block/blk-core.c                                   |  16 +-
 block/blk-ioc.c                                    |   2 +-
 block/blk-mq-tag.c                                 |   2 +-
 block/blk-mq.c                                     |   8 +-
 block/cfq-iosched.c                                |   4 +-
 block/scsi_ioctl.c                                 |   6 +-
 drivers/block/drbd/drbd_bitmap.c                   |   2 +-
 drivers/block/drbd/drbd_receiver.c                 |   3 +-
 drivers/block/mtip32xx/mtip32xx.c                  |   2 +-
 drivers/block/nvme-core.c                          |   4 +-
 drivers/block/osdblk.c                             |   2 +-
 drivers/block/paride/pd.c                          |   2 +-
 drivers/block/pktcdvd.c                            |   4 +-
 drivers/connector/connector.c                      |   3 +-
 drivers/firewire/core-cdev.c                       |   2 +-
 drivers/gpu/drm/i915/i915_gem.c                    |   4 +-
 drivers/ide/ide-atapi.c                            |   2 +-
 drivers/ide/ide-cd.c                               |   2 +-
 drivers/ide/ide-cd_ioctl.c                         |   2 +-
 drivers/ide/ide-devsets.c                          |   2 +-
 drivers/ide/ide-disk.c                             |   2 +-
 drivers/ide/ide-ioctls.c                           |   4 +-
 drivers/ide/ide-park.c                             |   2 +-
 drivers/ide/ide-pm.c                               |   4 +-
 drivers/ide/ide-tape.c                             |   4 +-
 drivers/ide/ide-taskfile.c                         |   4 +-
 drivers/infiniband/core/sa_query.c                 |   2 +-
 drivers/infiniband/hw/ipath/ipath_file_ops.c       |   2 +-
 drivers/infiniband/hw/qib/qib_init.c               |   2 +-
 drivers/iommu/amd_iommu.c                          |   2 +-
 drivers/iommu/intel-iommu.c                        |   2 +-
 drivers/md/dm-crypt.c                              |   6 +-
 drivers/md/dm-kcopyd.c                             |   2 +-
 drivers/media/pci/solo6x10/solo6x10-v4l2-enc.c     |   2 +-
 drivers/media/pci/solo6x10/solo6x10-v4l2.c         |   2 +-
 drivers/media/pci/tw68/tw68-video.c                |   2 +-
 drivers/misc/vmw_balloon.c                         |   2 +-
 drivers/mtd/mtdcore.c                              |   3 +-
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c    |   2 +-
 drivers/scsi/scsi_error.c                          |   2 +-
 drivers/scsi/scsi_lib.c                            |   4 +-
 drivers/staging/android/ion/ion_system_heap.c      |   2 +-
 .../lustre/include/linux/libcfs/libcfs_private.h   |   2 +-
 drivers/usb/host/u132-hcd.c                        |   2 +-
 drivers/video/fbdev/vermilion/vermilion.c          |   2 +-
 fs/btrfs/disk-io.c                                 |   2 +-
 fs/btrfs/extent_io.c                               |  14 +-
 fs/btrfs/volumes.c                                 |   4 +-
 fs/cachefiles/internal.h                           |   2 +-
 fs/direct-io.c                                     |   2 +-
 fs/ext3/super.c                                    |   2 +-
 fs/ext4/super.c                                    |   2 +-
 fs/fscache/cookie.c                                |   2 +-
 fs/fscache/page.c                                  |   6 +-
 fs/jbd/transaction.c                               |   4 +-
 fs/jbd2/transaction.c                              |   4 +-
 fs/nfs/file.c                                      |   6 +-
 fs/nilfs2/mdt.h                                    |   2 +-
 fs/xfs/xfs_qm.c                                    |   2 +-
 include/linux/cpuset.h                             |  18 +-
 include/linux/gfp.h                                |  70 ++-
 include/linux/mmzone.h                             |  88 +---
 include/linux/skbuff.h                             |   6 +-
 include/net/sock.h                                 |   2 +-
 include/trace/events/gfpflags.h                    |   5 +-
 kernel/audit.c                                     |   6 +-
 kernel/locking/lockdep.c                           |   2 +-
 kernel/power/snapshot.c                            |   2 +-
 kernel/power/swap.c                                |  14 +-
 kernel/smp.c                                       |   2 +-
 lib/idr.c                                          |   4 +-
 lib/percpu_ida.c                                   |   2 +-
 lib/radix-tree.c                                   |  10 +-
 mm/backing-dev.c                                   |   2 +-
 mm/dmapool.c                                       |   2 +-
 mm/failslab.c                                      |   8 +-
 mm/filemap.c                                       |   2 +-
 mm/huge_memory.c                                   |   4 +-
 mm/internal.h                                      |   1 +
 mm/memcontrol.c                                    |   8 +-
 mm/mempool.c                                       |  10 +-
 mm/migrate.c                                       |   4 +-
 mm/page_alloc.c                                    | 585 +++++++--------------
 mm/slab.c                                          |  18 +-
 mm/slub.c                                          |   6 +-
 mm/vmalloc.c                                       |   2 +-
 mm/vmscan.c                                        |   8 +-
 mm/vmstat.c                                        |   2 +-
 mm/zswap.c                                         |   5 +-
 net/core/skbuff.c                                  |   8 +-
 net/core/sock.c                                    |   6 +-
 net/netlink/af_netlink.c                           |   2 +-
 net/rxrpc/ar-connection.c                          |   2 +-
 net/sctp/associola.c                               |   2 +-
 security/integrity/ima/ima_crypto.c                |   2 +-
 101 files changed, 455 insertions(+), 710 deletions(-)

-- 
2.4.6


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 01/12] mm, page_alloc: Remove unnecessary parameter from zone_watermark_ok_safe
  2015-08-24 12:09 [PATCH 00/12] Remove zonelist cache and high-order watermark checking v3 Mel Gorman
@ 2015-08-24 12:09 ` Mel Gorman
  2015-08-24 12:09 ` [PATCH 02/12] mm, page_alloc: Remove unnecessary recalculations for dirty zone balancing Mel Gorman
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 55+ messages in thread
From: Mel Gorman @ 2015-08-24 12:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Rik van Riel, Vlastimil Babka, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML, Mel Gorman

No user of zone_watermark_ok_safe() specifies alloc_flags. This patch
removes the unnecessary parameter.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Christoph Lameter <cl@linux.com>
---
 include/linux/mmzone.h | 2 +-
 mm/page_alloc.c        | 5 +++--
 mm/vmscan.c            | 4 ++--
 3 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 754c25966a0a..99cf4209cd45 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -802,7 +802,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx);
 bool zone_watermark_ok(struct zone *z, unsigned int order,
 		unsigned long mark, int classzone_idx, int alloc_flags);
 bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
-		unsigned long mark, int classzone_idx, int alloc_flags);
+		unsigned long mark, int classzone_idx);
 enum memmap_context {
 	MEMMAP_EARLY,
 	MEMMAP_HOTPLUG,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index df959b7d6085..9b6bae688db8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2224,6 +2224,7 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
 		min -= min / 2;
 	if (alloc_flags & ALLOC_HARDER)
 		min -= min / 4;
+
 #ifdef CONFIG_CMA
 	/* If allocation can't use CMA areas don't use free CMA pages */
 	if (!(alloc_flags & ALLOC_CMA))
@@ -2253,14 +2254,14 @@ bool zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
 }
 
 bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
-			unsigned long mark, int classzone_idx, int alloc_flags)
+			unsigned long mark, int classzone_idx)
 {
 	long free_pages = zone_page_state(z, NR_FREE_PAGES);
 
 	if (z->percpu_drift_mark && free_pages < z->percpu_drift_mark)
 		free_pages = zone_page_state_snapshot(z, NR_FREE_PAGES);
 
-	return __zone_watermark_ok(z, order, mark, classzone_idx, alloc_flags,
+	return __zone_watermark_ok(z, order, mark, classzone_idx, 0,
 								free_pages);
 }
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8286938c70de..e950134c4b9a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2450,7 +2450,7 @@ static inline bool compaction_ready(struct zone *zone, int order)
 	balance_gap = min(low_wmark_pages(zone), DIV_ROUND_UP(
 			zone->managed_pages, KSWAPD_ZONE_BALANCE_GAP_RATIO));
 	watermark = high_wmark_pages(zone) + balance_gap + (2UL << order);
-	watermark_ok = zone_watermark_ok_safe(zone, 0, watermark, 0, 0);
+	watermark_ok = zone_watermark_ok_safe(zone, 0, watermark, 0);
 
 	/*
 	 * If compaction is deferred, reclaim up to a point where
@@ -2933,7 +2933,7 @@ static bool zone_balanced(struct zone *zone, int order,
 			  unsigned long balance_gap, int classzone_idx)
 {
 	if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone) +
-				    balance_gap, classzone_idx, 0))
+				    balance_gap, classzone_idx))
 		return false;
 
 	if (IS_ENABLED(CONFIG_COMPACTION) && order && compaction_suitable(zone,
-- 
2.4.6


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 02/12] mm, page_alloc: Remove unnecessary recalculations for dirty zone balancing
  2015-08-24 12:09 [PATCH 00/12] Remove zonelist cache and high-order watermark checking v3 Mel Gorman
  2015-08-24 12:09 ` [PATCH 01/12] mm, page_alloc: Remove unnecessary parameter from zone_watermark_ok_safe Mel Gorman
@ 2015-08-24 12:09 ` Mel Gorman
  2015-08-24 12:09 ` [PATCH 03/12] mm, page_alloc: Remove unnecessary taking of a seqlock when cpusets are disabled Mel Gorman
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 55+ messages in thread
From: Mel Gorman @ 2015-08-24 12:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Rik van Riel, Vlastimil Babka, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML, Mel Gorman

File-backed pages that will be immediately written are balanced between
zones.  This heuristic tries to avoid having a single zone filled with
recently dirtied pages but the checks are unnecessarily expensive. Move
consider_zone_balanced into the alloc_context instead of checking bitmaps
multiple times. The patch also gives the parameter a more meaningful name.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/internal.h   |  1 +
 mm/page_alloc.c | 11 +++++++----
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 36b23f1e2ca6..9331f802a067 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -129,6 +129,7 @@ struct alloc_context {
 	int classzone_idx;
 	int migratetype;
 	enum zone_type high_zoneidx;
+	bool spread_dirty_pages;
 };
 
 /*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9b6bae688db8..62ae28d8ae8d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2453,8 +2453,6 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 	nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
 	int zlc_active = 0;		/* set if using zonelist_cache */
 	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
-	bool consider_zone_dirty = (alloc_flags & ALLOC_WMARK_LOW) &&
-				(gfp_mask & __GFP_WRITE);
 	int nr_fair_skipped = 0;
 	bool zonelist_rescan;
 
@@ -2509,14 +2507,14 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 		 *
 		 * XXX: For now, allow allocations to potentially
 		 * exceed the per-zone dirty limit in the slowpath
-		 * (ALLOC_WMARK_LOW unset) before going into reclaim,
+		 * (spread_dirty_pages unset) before going into reclaim,
 		 * which is important when on a NUMA setup the allowed
 		 * zones are together not big enough to reach the
 		 * global limit.  The proper fix for these situations
 		 * will require awareness of zones in the
 		 * dirty-throttling and the flusher threads.
 		 */
-		if (consider_zone_dirty && !zone_dirty_ok(zone))
+		if (ac->spread_dirty_pages && !zone_dirty_ok(zone))
 			continue;
 
 		mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
@@ -3202,6 +3200,10 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 
 	/* We set it here, as __alloc_pages_slowpath might have changed it */
 	ac.zonelist = zonelist;
+
+	/* Dirty zone balancing only done in the fast path */
+	ac.spread_dirty_pages = (gfp_mask & __GFP_WRITE);
+
 	/* The preferred zone is used for statistics later */
 	preferred_zoneref = first_zones_zonelist(ac.zonelist, ac.high_zoneidx,
 				ac.nodemask ? : &cpuset_current_mems_allowed,
@@ -3220,6 +3222,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 		 * complete.
 		 */
 		alloc_mask = memalloc_noio_flags(gfp_mask);
+		ac.spread_dirty_pages = false;
 
 		page = __alloc_pages_slowpath(alloc_mask, order, &ac);
 	}
-- 
2.4.6


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 03/12] mm, page_alloc: Remove unnecessary taking of a seqlock when cpusets are disabled
  2015-08-24 12:09 [PATCH 00/12] Remove zonelist cache and high-order watermark checking v3 Mel Gorman
  2015-08-24 12:09 ` [PATCH 01/12] mm, page_alloc: Remove unnecessary parameter from zone_watermark_ok_safe Mel Gorman
  2015-08-24 12:09 ` [PATCH 02/12] mm, page_alloc: Remove unnecessary recalculations for dirty zone balancing Mel Gorman
@ 2015-08-24 12:09 ` Mel Gorman
  2015-08-26 10:25   ` Michal Hocko
  2015-08-24 12:09 ` [PATCH 04/12] mm, page_alloc: Only check cpusets when one exists that can be mem-controlled Mel Gorman
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 55+ messages in thread
From: Mel Gorman @ 2015-08-24 12:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Rik van Riel, Vlastimil Babka, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML, Mel Gorman

There is a seqcounter that protects against spurious allocation failures
when a task is changing the allowed nodes in a cpuset. There is no need
to check the seqcounter until a cpuset exists.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/cpuset.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 1b357997cac5..6eb27cb480b7 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -104,6 +104,9 @@ extern void cpuset_print_task_mems_allowed(struct task_struct *p);
  */
 static inline unsigned int read_mems_allowed_begin(void)
 {
+	if (!cpusets_enabled())
+		return 0;
+
 	return read_seqcount_begin(&current->mems_allowed_seq);
 }
 
@@ -115,6 +118,9 @@ static inline unsigned int read_mems_allowed_begin(void)
  */
 static inline bool read_mems_allowed_retry(unsigned int seq)
 {
+	if (!cpusets_enabled())
+		return false;
+
 	return read_seqcount_retry(&current->mems_allowed_seq, seq);
 }
 
-- 
2.4.6


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 04/12] mm, page_alloc: Only check cpusets when one exists that can be mem-controlled
  2015-08-24 12:09 [PATCH 00/12] Remove zonelist cache and high-order watermark checking v3 Mel Gorman
                   ` (2 preceding siblings ...)
  2015-08-24 12:09 ` [PATCH 03/12] mm, page_alloc: Remove unnecessary taking of a seqlock when cpusets are disabled Mel Gorman
@ 2015-08-24 12:09 ` Mel Gorman
  2015-08-24 12:37   ` Vlastimil Babka
  2015-08-26 10:46   ` Michal Hocko
  2015-08-24 12:09 ` [PATCH 05/12] mm, page_alloc: Remove unecessary recheck of nodemask Mel Gorman
                   ` (7 subsequent siblings)
  11 siblings, 2 replies; 55+ messages in thread
From: Mel Gorman @ 2015-08-24 12:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Rik van Riel, Vlastimil Babka, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML, Mel Gorman

David Rientjes correctly pointed out that the "root cpuset may not exclude
mems on the system so, even if mounted, there's no need to check or be
worried about concurrent change when there is only one cpuset".

The three checks for cpusets_enabled() care whether a cpuset exists that
can limit memory, not that cpuset is enabled as such. This patch replaces
cpusets_enabled() with cpusets_mems_enabled() which checks if at least one
cpuset exists that can limit memory and updates the appropriate call sites.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/cpuset.h | 16 +++++++++-------
 mm/page_alloc.c        |  2 +-
 2 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 6eb27cb480b7..1e823870987e 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -17,10 +17,6 @@
 #ifdef CONFIG_CPUSETS
 
 extern struct static_key cpusets_enabled_key;
-static inline bool cpusets_enabled(void)
-{
-	return static_key_false(&cpusets_enabled_key);
-}
 
 static inline int nr_cpusets(void)
 {
@@ -28,6 +24,12 @@ static inline int nr_cpusets(void)
 	return static_key_count(&cpusets_enabled_key) + 1;
 }
 
+/* Returns true if a cpuset exists that can set cpuset.mems */
+static inline bool cpusets_mems_enabled(void)
+{
+	return nr_cpusets() > 1;
+}
+
 static inline void cpuset_inc(void)
 {
 	static_key_slow_inc(&cpusets_enabled_key);
@@ -104,7 +106,7 @@ extern void cpuset_print_task_mems_allowed(struct task_struct *p);
  */
 static inline unsigned int read_mems_allowed_begin(void)
 {
-	if (!cpusets_enabled())
+	if (!cpusets_mems_enabled())
 		return 0;
 
 	return read_seqcount_begin(&current->mems_allowed_seq);
@@ -118,7 +120,7 @@ static inline unsigned int read_mems_allowed_begin(void)
  */
 static inline bool read_mems_allowed_retry(unsigned int seq)
 {
-	if (!cpusets_enabled())
+	if (!cpusets_mems_enabled())
 		return false;
 
 	return read_seqcount_retry(&current->mems_allowed_seq, seq);
@@ -139,7 +141,7 @@ static inline void set_mems_allowed(nodemask_t nodemask)
 
 #else /* !CONFIG_CPUSETS */
 
-static inline bool cpusets_enabled(void) { return false; }
+static inline bool cpusets_mems_enabled(void) { return false; }
 
 static inline int cpuset_init(void) { return 0; }
 static inline void cpuset_init_smp(void) {}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 62ae28d8ae8d..2c1c3bf54d15 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2470,7 +2470,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 		if (IS_ENABLED(CONFIG_NUMA) && zlc_active &&
 			!zlc_zone_worth_trying(zonelist, z, allowednodes))
 				continue;
-		if (cpusets_enabled() &&
+		if (cpusets_mems_enabled() &&
 			(alloc_flags & ALLOC_CPUSET) &&
 			!cpuset_zone_allowed(zone, gfp_mask))
 				continue;
-- 
2.4.6


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 05/12] mm, page_alloc: Remove unecessary recheck of nodemask
  2015-08-24 12:09 [PATCH 00/12] Remove zonelist cache and high-order watermark checking v3 Mel Gorman
                   ` (3 preceding siblings ...)
  2015-08-24 12:09 ` [PATCH 04/12] mm, page_alloc: Only check cpusets when one exists that can be mem-controlled Mel Gorman
@ 2015-08-24 12:09 ` Mel Gorman
  2015-08-25 14:32   ` Vlastimil Babka
  2015-08-24 12:09 ` [PATCH 06/12] mm, page_alloc: Use masks and shifts when converting GFP flags to migrate types Mel Gorman
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 55+ messages in thread
From: Mel Gorman @ 2015-08-24 12:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Rik van Riel, Vlastimil Babka, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML, Mel Gorman

An allocation request will either use the given nodemask or the cpuset
current tasks mems_allowed. A cpuset retry will recheck the callers nodemask
and while it's trivial overhead during an extremely rare operation, also
unnecessary. This patch fixes it.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2c1c3bf54d15..32d1cec124bc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3171,7 +3171,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	gfp_t alloc_mask; /* The gfp_t that was actually used for allocation */
 	struct alloc_context ac = {
 		.high_zoneidx = gfp_zone(gfp_mask),
-		.nodemask = nodemask,
+		.nodemask = nodemask ? : &cpuset_current_mems_allowed,
 		.migratetype = gfpflags_to_migratetype(gfp_mask),
 	};
 
@@ -3206,8 +3206,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 
 	/* The preferred zone is used for statistics later */
 	preferred_zoneref = first_zones_zonelist(ac.zonelist, ac.high_zoneidx,
-				ac.nodemask ? : &cpuset_current_mems_allowed,
-				&ac.preferred_zone);
+				ac.nodemask, &ac.preferred_zone);
 	if (!ac.preferred_zone)
 		goto out;
 	ac.classzone_idx = zonelist_zone_idx(preferred_zoneref);
-- 
2.4.6


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 06/12] mm, page_alloc: Use masks and shifts when converting GFP flags to migrate types
  2015-08-24 12:09 [PATCH 00/12] Remove zonelist cache and high-order watermark checking v3 Mel Gorman
                   ` (4 preceding siblings ...)
  2015-08-24 12:09 ` [PATCH 05/12] mm, page_alloc: Remove unecessary recheck of nodemask Mel Gorman
@ 2015-08-24 12:09 ` Mel Gorman
  2015-08-25 14:36   ` Vlastimil Babka
  2015-08-24 12:09 ` [PATCH 07/12] mm, page_alloc: Distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd Mel Gorman
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 55+ messages in thread
From: Mel Gorman @ 2015-08-24 12:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Rik van Riel, Vlastimil Babka, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML, Mel Gorman

This patch redefines which GFP bits are used for specifying mobility and
the order of the migrate types. Once redefined it's possible to convert
GFP flags to a migrate type with a simple mask and shift. The only downside
is that readers of OOM kill messages and allocation failures may have been
used to the existing values but scripts/gfp-translate will help.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/gfp.h    | 12 +++++++-----
 include/linux/mmzone.h |  2 +-
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index ad35f300b9a4..a10347ca5053 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -14,7 +14,7 @@ struct vm_area_struct;
 #define ___GFP_HIGHMEM		0x02u
 #define ___GFP_DMA32		0x04u
 #define ___GFP_MOVABLE		0x08u
-#define ___GFP_WAIT		0x10u
+#define ___GFP_RECLAIMABLE	0x10u
 #define ___GFP_HIGH		0x20u
 #define ___GFP_IO		0x40u
 #define ___GFP_FS		0x80u
@@ -29,7 +29,7 @@ struct vm_area_struct;
 #define ___GFP_NOMEMALLOC	0x10000u
 #define ___GFP_HARDWALL		0x20000u
 #define ___GFP_THISNODE		0x40000u
-#define ___GFP_RECLAIMABLE	0x80000u
+#define ___GFP_WAIT		0x80000u
 #define ___GFP_NOACCOUNT	0x100000u
 #define ___GFP_NOTRACK		0x200000u
 #define ___GFP_NO_KSWAPD	0x400000u
@@ -123,6 +123,7 @@ struct vm_area_struct;
 
 /* This mask makes up all the page movable related flags */
 #define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE)
+#define GFP_MOVABLE_SHIFT 3
 
 /* Control page allocator reclaim behavior */
 #define GFP_RECLAIM_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS|\
@@ -149,14 +150,15 @@ struct vm_area_struct;
 /* Convert GFP flags to their corresponding migrate type */
 static inline int gfpflags_to_migratetype(const gfp_t gfp_flags)
 {
-	WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK);
+	VM_WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK);
+	BUILD_BUG_ON((1UL << GFP_MOVABLE_SHIFT) != ___GFP_MOVABLE);
+	BUILD_BUG_ON((___GFP_MOVABLE >> GFP_MOVABLE_SHIFT) != MIGRATE_MOVABLE);
 
 	if (unlikely(page_group_by_mobility_disabled))
 		return MIGRATE_UNMOVABLE;
 
 	/* Group based on mobility */
-	return (((gfp_flags & __GFP_MOVABLE) != 0) << 1) |
-		((gfp_flags & __GFP_RECLAIMABLE) != 0);
+	return (gfp_flags & GFP_MOVABLE_MASK) >> GFP_MOVABLE_SHIFT;
 }
 
 #ifdef CONFIG_HIGHMEM
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 99cf4209cd45..fc0457d005f8 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -37,8 +37,8 @@
 
 enum {
 	MIGRATE_UNMOVABLE,
-	MIGRATE_RECLAIMABLE,
 	MIGRATE_MOVABLE,
+	MIGRATE_RECLAIMABLE,
 	MIGRATE_PCPTYPES,	/* the number of types on the pcp lists */
 	MIGRATE_RESERVE = MIGRATE_PCPTYPES,
 #ifdef CONFIG_CMA
-- 
2.4.6


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 07/12] mm, page_alloc: Distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
  2015-08-24 12:09 [PATCH 00/12] Remove zonelist cache and high-order watermark checking v3 Mel Gorman
                   ` (5 preceding siblings ...)
  2015-08-24 12:09 ` [PATCH 06/12] mm, page_alloc: Use masks and shifts when converting GFP flags to migrate types Mel Gorman
@ 2015-08-24 12:09 ` Mel Gorman
  2015-08-24 18:29   ` Mel Gorman
                     ` (4 more replies)
  2015-08-24 12:09 ` [PATCH 08/12] mm, page_alloc: Rename __GFP_WAIT to __GFP_RECLAIM Mel Gorman
                   ` (4 subsequent siblings)
  11 siblings, 5 replies; 55+ messages in thread
From: Mel Gorman @ 2015-08-24 12:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Rik van Riel, Vlastimil Babka, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML, Mel Gorman

__GFP_WAIT has been used to identify atomic context in callers that hold
spinlocks or are in interrupts. They are expected to be high priority and
have access one of two watermarks lower than "min" which can be referred
to as the "atomic reserve". __GFP_HIGH users get access to the first lower
watermark and can be called the "high priority reserve".

Over time, callers had a requirement to not block when fallback options
were available. Some have abused __GFP_WAIT leading to a situation where
an optimisitic allocation with a fallback option can access atomic reserves.

This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
cannot sleep and have no alternative. High priority users continue to use
__GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are
willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers
that want to wake kswapd for background reclaim. __GFP_WAIT is redefined
as a caller that is willing to enter direct reclaim and wake kswapd for
background reclaim.

This patch then converts a number of sites

o __GFP_ATOMIC is used by callers that are high priority and have memory
  pools for those requests. GFP_ATOMIC uses this flag.

o Callers that have a limited mempool to guarantee forward progress use
  __GFP_DIRECT_RECLAIM. bio allocations fall into this category where
  kswapd will still be woken but atomic reserves are not used as there
  is a one-entry mempool to guarantee progress.

o Callers that are checking if they are non-blocking should use the
  helper gfpflags_allow_blocking() where possible. This is because
  checking for __GFP_WAIT as was done historically now can trigger false
  positives. Some exceptions like dm-crypt.c exist where the code intent
  is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
  flag manipulations.

o Callers that built their own GFP flags instead of starting with GFP_KERNEL
  and friends now also need to specify __GFP_KSWAPD_RECLAIM.

The first key hazard to watch out for is callers that removed __GFP_WAIT
and was depending on access to atomic reserves for inconspicuous reasons.
In some cases it may be appropriate for them to use __GFP_HIGH.

The second key hazard is callers that assembled their own combination of
GFP flags instead of starting with something like GFP_KERNEL. They may
now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
if it's missed in most cases as other activity will wake kswapd.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 Documentation/vm/balance                           | 14 ++++---
 arch/arm/mm/dma-mapping.c                          |  4 +-
 arch/arm/xen/mm.c                                  |  2 +-
 arch/arm64/mm/dma-mapping.c                        |  4 +-
 arch/x86/kernel/pci-dma.c                          |  2 +-
 block/bio.c                                        | 26 ++++++------
 block/blk-core.c                                   | 16 ++++----
 block/blk-ioc.c                                    |  2 +-
 block/blk-mq-tag.c                                 |  2 +-
 block/blk-mq.c                                     |  8 ++--
 block/cfq-iosched.c                                |  4 +-
 drivers/block/drbd/drbd_receiver.c                 |  3 +-
 drivers/block/osdblk.c                             |  2 +-
 drivers/connector/connector.c                      |  3 +-
 drivers/firewire/core-cdev.c                       |  2 +-
 drivers/gpu/drm/i915/i915_gem.c                    |  2 +-
 drivers/infiniband/core/sa_query.c                 |  2 +-
 drivers/iommu/amd_iommu.c                          |  2 +-
 drivers/iommu/intel-iommu.c                        |  2 +-
 drivers/md/dm-crypt.c                              |  6 +--
 drivers/md/dm-kcopyd.c                             |  2 +-
 drivers/media/pci/solo6x10/solo6x10-v4l2-enc.c     |  2 +-
 drivers/media/pci/solo6x10/solo6x10-v4l2.c         |  2 +-
 drivers/media/pci/tw68/tw68-video.c                |  2 +-
 drivers/mtd/mtdcore.c                              |  3 +-
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c    |  2 +-
 drivers/staging/android/ion/ion_system_heap.c      |  2 +-
 .../lustre/include/linux/libcfs/libcfs_private.h   |  2 +-
 drivers/usb/host/u132-hcd.c                        |  2 +-
 drivers/video/fbdev/vermilion/vermilion.c          |  2 +-
 fs/btrfs/disk-io.c                                 |  2 +-
 fs/btrfs/extent_io.c                               | 14 +++----
 fs/btrfs/volumes.c                                 |  4 +-
 fs/ext3/super.c                                    |  2 +-
 fs/ext4/super.c                                    |  2 +-
 fs/fscache/cookie.c                                |  2 +-
 fs/fscache/page.c                                  |  6 +--
 fs/jbd/transaction.c                               |  4 +-
 fs/jbd2/transaction.c                              |  4 +-
 fs/nfs/file.c                                      |  6 +--
 fs/xfs/xfs_qm.c                                    |  2 +-
 include/linux/gfp.h                                | 46 ++++++++++++++++------
 include/linux/skbuff.h                             |  6 +--
 include/net/sock.h                                 |  2 +-
 include/trace/events/gfpflags.h                    |  5 ++-
 kernel/audit.c                                     |  6 +--
 kernel/locking/lockdep.c                           |  2 +-
 kernel/power/snapshot.c                            |  2 +-
 kernel/smp.c                                       |  2 +-
 lib/idr.c                                          |  4 +-
 lib/radix-tree.c                                   | 10 ++---
 mm/backing-dev.c                                   |  2 +-
 mm/dmapool.c                                       |  2 +-
 mm/memcontrol.c                                    |  8 ++--
 mm/mempool.c                                       | 10 ++---
 mm/migrate.c                                       |  2 +-
 mm/page_alloc.c                                    | 43 ++++++++++++--------
 mm/slab.c                                          | 18 ++++-----
 mm/slub.c                                          |  6 +--
 mm/vmalloc.c                                       |  2 +-
 mm/vmscan.c                                        |  4 +-
 mm/zswap.c                                         |  5 ++-
 net/core/skbuff.c                                  |  8 ++--
 net/core/sock.c                                    |  6 ++-
 net/netlink/af_netlink.c                           |  2 +-
 net/rxrpc/ar-connection.c                          |  2 +-
 net/sctp/associola.c                               |  2 +-
 67 files changed, 211 insertions(+), 173 deletions(-)

diff --git a/Documentation/vm/balance b/Documentation/vm/balance
index c46e68cf9344..964595481af6 100644
--- a/Documentation/vm/balance
+++ b/Documentation/vm/balance
@@ -1,12 +1,14 @@
 Started Jan 2000 by Kanoj Sarcar <kanoj@sgi.com>
 
-Memory balancing is needed for non __GFP_WAIT as well as for non
-__GFP_IO allocations.
+Memory balancing is needed for !__GFP_ATOMIC and !__GFP_KSWAPD_RECLAIM as
+well as for non __GFP_IO allocations.
 
-There are two reasons to be requesting non __GFP_WAIT allocations:
-the caller can not sleep (typically intr context), or does not want
-to incur cost overheads of page stealing and possible swap io for
-whatever reasons.
+The first reason why a caller may avoid reclaim is that the caller can not
+sleep due to holding a spinlock or is in interrupt context. The second may
+be that the caller is willing to fail the allocation without incurring the
+overhead of page reclaim. This may happen for opportunistic high-order
+allocation requests that have order-0 fallback options. In such cases,
+the caller may also wish to avoid waking kswapd.
 
 __GFP_IO allocation requests are made to prevent file system deadlocks.
 
diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index cba12f34ff77..f999f0987a3e 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -650,7 +650,7 @@ static void *__dma_alloc(struct device *dev, size_t size, dma_addr_t *handle,
 
 	if (is_coherent || nommu())
 		addr = __alloc_simple_buffer(dev, size, gfp, &page);
-	else if (!(gfp & __GFP_WAIT))
+	else if (!gfpflags_allow_blocking(gfp))
 		addr = __alloc_from_pool(size, &page);
 	else if (!dev_get_cma_area(dev))
 		addr = __alloc_remap_buffer(dev, size, gfp, prot, &page, caller, want_vaddr);
@@ -1369,7 +1369,7 @@ static void *arm_iommu_alloc_attrs(struct device *dev, size_t size,
 	*handle = DMA_ERROR_CODE;
 	size = PAGE_ALIGN(size);
 
-	if (!(gfp & __GFP_WAIT))
+	if (!gfpflags_allow_blocking(gfp))
 		return __iommu_alloc_atomic(dev, size, handle);
 
 	/*
diff --git a/arch/arm/xen/mm.c b/arch/arm/xen/mm.c
index 03e75fef15b8..86809bd2026d 100644
--- a/arch/arm/xen/mm.c
+++ b/arch/arm/xen/mm.c
@@ -25,7 +25,7 @@
 unsigned long xen_get_swiotlb_free_pages(unsigned int order)
 {
 	struct memblock_region *reg;
-	gfp_t flags = __GFP_NOWARN;
+	gfp_t flags = __GFP_NOWARN|___GFP_KSWAPD_RECLAIM;
 
 	for_each_memblock(memory, reg) {
 		if (reg->base < (phys_addr_t)0xffffffff) {
diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
index d16a1cead23f..1f10b2503af8 100644
--- a/arch/arm64/mm/dma-mapping.c
+++ b/arch/arm64/mm/dma-mapping.c
@@ -100,7 +100,7 @@ static void *__dma_alloc_coherent(struct device *dev, size_t size,
 	if (IS_ENABLED(CONFIG_ZONE_DMA) &&
 	    dev->coherent_dma_mask <= DMA_BIT_MASK(32))
 		flags |= GFP_DMA;
-	if (IS_ENABLED(CONFIG_DMA_CMA) && (flags & __GFP_WAIT)) {
+	if (IS_ENABLED(CONFIG_DMA_CMA) && gfpflags_allow_blocking(flags)) {
 		struct page *page;
 		void *addr;
 
@@ -147,7 +147,7 @@ static void *__dma_alloc(struct device *dev, size_t size,
 
 	size = PAGE_ALIGN(size);
 
-	if (!coherent && !(flags & __GFP_WAIT)) {
+	if (!coherent && !gfpflags_allow_blocking(flags)) {
 		struct page *page = NULL;
 		void *addr = __alloc_from_pool(size, &page, flags);
 
diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
index 353972c1946c..0310e73e6b57 100644
--- a/arch/x86/kernel/pci-dma.c
+++ b/arch/x86/kernel/pci-dma.c
@@ -101,7 +101,7 @@ void *dma_generic_alloc_coherent(struct device *dev, size_t size,
 again:
 	page = NULL;
 	/* CMA can be used only in the context which permits sleeping */
-	if (flag & __GFP_WAIT) {
+	if (gfpflags_allow_blocking(flag)) {
 		page = dma_alloc_from_contiguous(dev, count, get_order(size));
 		if (page && page_to_phys(page) + size > dma_mask) {
 			dma_release_from_contiguous(dev, page, count);
diff --git a/block/bio.c b/block/bio.c
index d6e5ba3399f0..fbc558b50e67 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -211,7 +211,7 @@ struct bio_vec *bvec_alloc(gfp_t gfp_mask, int nr, unsigned long *idx,
 		bvl = mempool_alloc(pool, gfp_mask);
 	} else {
 		struct biovec_slab *bvs = bvec_slabs + *idx;
-		gfp_t __gfp_mask = gfp_mask & ~(__GFP_WAIT | __GFP_IO);
+		gfp_t __gfp_mask = gfp_mask & ~(__GFP_DIRECT_RECLAIM | __GFP_IO);
 
 		/*
 		 * Make this allocation restricted and don't dump info on
@@ -221,11 +221,11 @@ struct bio_vec *bvec_alloc(gfp_t gfp_mask, int nr, unsigned long *idx,
 		__gfp_mask |= __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN;
 
 		/*
-		 * Try a slab allocation. If this fails and __GFP_WAIT
+		 * Try a slab allocation. If this fails and __GFP_DIRECT_RECLAIM
 		 * is set, retry with the 1-entry mempool
 		 */
 		bvl = kmem_cache_alloc(bvs->slab, __gfp_mask);
-		if (unlikely(!bvl && (gfp_mask & __GFP_WAIT))) {
+		if (unlikely(!bvl && (gfp_mask & __GFP_DIRECT_RECLAIM))) {
 			*idx = BIOVEC_MAX_IDX;
 			goto fallback;
 		}
@@ -393,12 +393,12 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
  *   If @bs is NULL, uses kmalloc() to allocate the bio; else the allocation is
  *   backed by the @bs's mempool.
  *
- *   When @bs is not NULL, if %__GFP_WAIT is set then bio_alloc will always be
- *   able to allocate a bio. This is due to the mempool guarantees. To make this
- *   work, callers must never allocate more than 1 bio at a time from this pool.
- *   Callers that need to allocate more than 1 bio must always submit the
- *   previously allocated bio for IO before attempting to allocate a new one.
- *   Failure to do so can cause deadlocks under memory pressure.
+ *   When @bs is not NULL, if %__GFP_DIRECT_RECLAIM is set then bio_alloc will
+ *   always be able to allocate a bio. This is due to the mempool guarantees.
+ *   To make this work, callers must never allocate more than 1 bio at a time
+ *   from this pool. Callers that need to allocate more than 1 bio must always
+ *   submit the previously allocated bio for IO before attempting to allocate
+ *   a new one. Failure to do so can cause deadlocks under memory pressure.
  *
  *   Note that when running under generic_make_request() (i.e. any block
  *   driver), bios are not submitted until after you return - see the code in
@@ -457,13 +457,13 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
 		 * We solve this, and guarantee forward progress, with a rescuer
 		 * workqueue per bio_set. If we go to allocate and there are
 		 * bios on current->bio_list, we first try the allocation
-		 * without __GFP_WAIT; if that fails, we punt those bios we
-		 * would be blocking to the rescuer workqueue before we retry
-		 * with the original gfp_flags.
+		 * without __GFP_DIRECT_RECLAIM; if that fails, we punt those
+		 * bios we would be blocking to the rescuer workqueue before
+		 * we retry with the original gfp_flags.
 		 */
 
 		if (current->bio_list && !bio_list_empty(current->bio_list))
-			gfp_mask &= ~__GFP_WAIT;
+			gfp_mask &= ~__GFP_DIRECT_RECLAIM;
 
 		p = mempool_alloc(bs->bio_pool, gfp_mask);
 		if (!p && gfp_mask != saved_gfp) {
diff --git a/block/blk-core.c b/block/blk-core.c
index 627ed0c593fb..e3605acaaffc 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1156,8 +1156,8 @@ static struct request *__get_request(struct request_list *rl, int rw_flags,
  * @bio: bio to allocate request for (can be %NULL)
  * @gfp_mask: allocation mask
  *
- * Get a free request from @q.  If %__GFP_WAIT is set in @gfp_mask, this
- * function keeps retrying under memory pressure and fails iff @q is dead.
+ * Get a free request from @q.  If %__GFP_DIRECT_RECLAIM is set in @gfp_mask,
+ * this function keeps retrying under memory pressure and fails iff @q is dead.
  *
  * Must be called with @q->queue_lock held and,
  * Returns ERR_PTR on failure, with @q->queue_lock held.
@@ -1177,7 +1177,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
 	if (!IS_ERR(rq))
 		return rq;
 
-	if (!(gfp_mask & __GFP_WAIT) || unlikely(blk_queue_dying(q))) {
+	if (!gfpflags_allow_blocking(gfp_mask) || unlikely(blk_queue_dying(q))) {
 		blk_put_rl(rl);
 		return rq;
 	}
@@ -1255,11 +1255,11 @@ EXPORT_SYMBOL(blk_get_request);
  * BUG.
  *
  * WARNING: When allocating/cloning a bio-chain, careful consideration should be
- * given to how you allocate bios. In particular, you cannot use __GFP_WAIT for
- * anything but the first bio in the chain. Otherwise you risk waiting for IO
- * completion of a bio that hasn't been submitted yet, thus resulting in a
- * deadlock. Alternatively bios should be allocated using bio_kmalloc() instead
- * of bio_alloc(), as that avoids the mempool deadlock.
+ * given to how you allocate bios. In particular, you cannot use
+ * __GFP_DIRECT_RECLAIM for anything but the first bio in the chain. Otherwise
+ * you risk waiting for IO completion of a bio that hasn't been submitted yet,
+ * thus resulting in a deadlock. Alternatively bios should be allocated using
+ * bio_kmalloc() instead of bio_alloc(), as that avoids the mempool deadlock.
  * If possible a big IO should be split into smaller parts when allocation
  * fails. Partial allocation should not be an error, or you risk a live-lock.
  */
diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index 1a27f45ec776..381cb50a673c 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -289,7 +289,7 @@ struct io_context *get_task_io_context(struct task_struct *task,
 {
 	struct io_context *ioc;
 
-	might_sleep_if(gfp_flags & __GFP_WAIT);
+	might_sleep_if(gfpflags_allow_blocking(gfp_flags));
 
 	do {
 		task_lock(task);
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 9b6e28830b82..a8b46659ce4e 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -264,7 +264,7 @@ static int bt_get(struct blk_mq_alloc_data *data,
 	if (tag != -1)
 		return tag;
 
-	if (!(data->gfp & __GFP_WAIT))
+	if (!gfpflags_allow_blocking(data->gfp))
 		return -1;
 
 	bs = bt_wait_ptr(bt, hctx);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 7d842db59699..7d80379d7a38 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -85,7 +85,7 @@ static int blk_mq_queue_enter(struct request_queue *q, gfp_t gfp)
 		if (percpu_ref_tryget_live(&q->mq_usage_counter))
 			return 0;
 
-		if (!(gfp & __GFP_WAIT))
+		if (!gfpflags_allow_blocking(gfp))
 			return -EBUSY;
 
 		ret = wait_event_interruptible(q->mq_freeze_wq,
@@ -261,11 +261,11 @@ struct request *blk_mq_alloc_request(struct request_queue *q, int rw, gfp_t gfp,
 
 	ctx = blk_mq_get_ctx(q);
 	hctx = q->mq_ops->map_queue(q, ctx->cpu);
-	blk_mq_set_alloc_data(&alloc_data, q, gfp & ~__GFP_WAIT,
+	blk_mq_set_alloc_data(&alloc_data, q, gfp & ~__GFP_DIRECT_RECLAIM,
 			reserved, ctx, hctx);
 
 	rq = __blk_mq_alloc_request(&alloc_data, rw);
-	if (!rq && (gfp & __GFP_WAIT)) {
+	if (!rq && (gfp & __GFP_DIRECT_RECLAIM)) {
 		__blk_mq_run_hw_queue(hctx);
 		blk_mq_put_ctx(ctx);
 
@@ -1221,7 +1221,7 @@ static struct request *blk_mq_map_request(struct request_queue *q,
 		ctx = blk_mq_get_ctx(q);
 		hctx = q->mq_ops->map_queue(q, ctx->cpu);
 		blk_mq_set_alloc_data(&alloc_data, q,
-				__GFP_WAIT|GFP_ATOMIC, false, ctx, hctx);
+				__GFP_WAIT|__GFP_HIGH, false, ctx, hctx);
 		rq = __blk_mq_alloc_request(&alloc_data, rw);
 		ctx = alloc_data.ctx;
 		hctx = alloc_data.hctx;
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index c62bb2e650b8..ecd1d1b61382 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -3674,7 +3674,7 @@ cfq_find_alloc_queue(struct cfq_data *cfqd, bool is_sync, struct cfq_io_cq *cic,
 		if (new_cfqq) {
 			cfqq = new_cfqq;
 			new_cfqq = NULL;
-		} else if (gfp_mask & __GFP_WAIT) {
+		} else if (gfpflags_allow_blocking(gfp_mask)) {
 			rcu_read_unlock();
 			spin_unlock_irq(cfqd->queue->queue_lock);
 			new_cfqq = kmem_cache_alloc_node(cfq_pool,
@@ -4289,7 +4289,7 @@ cfq_set_request(struct request_queue *q, struct request *rq, struct bio *bio,
 	const bool is_sync = rq_is_sync(rq);
 	struct cfq_queue *cfqq;
 
-	might_sleep_if(gfp_mask & __GFP_WAIT);
+	might_sleep_if(gfpflags_allow_blocking(gfp_mask));
 
 	spin_lock_irq(q->queue_lock);
 
diff --git a/drivers/block/drbd/drbd_receiver.c b/drivers/block/drbd/drbd_receiver.c
index c097909c589c..b4b5680ac6ad 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -357,7 +357,8 @@ drbd_alloc_peer_req(struct drbd_peer_device *peer_device, u64 id, sector_t secto
 	}
 
 	if (has_payload && data_size) {
-		page = drbd_alloc_pages(peer_device, nr_pages, (gfp_mask & __GFP_WAIT));
+		page = drbd_alloc_pages(peer_device, nr_pages,
+					gfpflags_allow_blocking(gfp_mask));
 		if (!page)
 			goto fail;
 	}
diff --git a/drivers/block/osdblk.c b/drivers/block/osdblk.c
index e22942596207..1b709a4e3b5e 100644
--- a/drivers/block/osdblk.c
+++ b/drivers/block/osdblk.c
@@ -271,7 +271,7 @@ static struct bio *bio_chain_clone(struct bio *old_chain, gfp_t gfpmask)
 			goto err_out;
 
 		tmp->bi_bdev = NULL;
-		gfpmask &= ~__GFP_WAIT;
+		gfpmask &= ~__GFP_DIRECT_RECLAIM;
 		tmp->bi_next = NULL;
 
 		if (!new_chain)
diff --git a/drivers/connector/connector.c b/drivers/connector/connector.c
index 30f522848c73..d7373ca69c99 100644
--- a/drivers/connector/connector.c
+++ b/drivers/connector/connector.c
@@ -124,7 +124,8 @@ int cn_netlink_send_mult(struct cn_msg *msg, u16 len, u32 portid, u32 __group,
 	if (group)
 		return netlink_broadcast(dev->nls, skb, portid, group,
 					 gfp_mask);
-	return netlink_unicast(dev->nls, skb, portid, !(gfp_mask&__GFP_WAIT));
+	return netlink_unicast(dev->nls, skb, portid,
+			!gfpflags_allow_blocking(gfp_mask));
 }
 EXPORT_SYMBOL_GPL(cn_netlink_send_mult);
 
diff --git a/drivers/firewire/core-cdev.c b/drivers/firewire/core-cdev.c
index 2a3973a7c441..36a7c2d89a01 100644
--- a/drivers/firewire/core-cdev.c
+++ b/drivers/firewire/core-cdev.c
@@ -486,7 +486,7 @@ static int ioctl_get_info(struct client *client, union ioctl_arg *arg)
 static int add_client_resource(struct client *client,
 			       struct client_resource *resource, gfp_t gfp_mask)
 {
-	bool preload = !!(gfp_mask & __GFP_WAIT);
+	bool preload = gfpflags_allow_blocking(gfp_mask);
 	unsigned long flags;
 	int ret;
 
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 52b446b27b4d..c2b45081c5ab 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2225,7 +2225,7 @@ i915_gem_object_get_pages_gtt(struct drm_i915_gem_object *obj)
 	 */
 	mapping = file_inode(obj->base.filp)->i_mapping;
 	gfp = mapping_gfp_mask(mapping);
-	gfp |= __GFP_NORETRY | __GFP_NOWARN | __GFP_NO_KSWAPD;
+	gfp |= __GFP_NORETRY | __GFP_NOWARN;
 	gfp &= ~(__GFP_IO | __GFP_WAIT);
 	sg = st->sgl;
 	st->nents = 0;
diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c
index ca919f429666..7474d79ffac0 100644
--- a/drivers/infiniband/core/sa_query.c
+++ b/drivers/infiniband/core/sa_query.c
@@ -619,7 +619,7 @@ static void init_mad(struct ib_sa_mad *mad, struct ib_mad_agent *agent)
 
 static int send_mad(struct ib_sa_query *query, int timeout_ms, gfp_t gfp_mask)
 {
-	bool preload = !!(gfp_mask & __GFP_WAIT);
+	bool preload = gfpflags_allow_blocking(gfp_mask);
 	unsigned long flags;
 	int ret, id;
 
diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 658ee39e6569..95d4c70dc7b1 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -2755,7 +2755,7 @@ static void *alloc_coherent(struct device *dev, size_t size,
 
 	page = alloc_pages(flag | __GFP_NOWARN,  get_order(size));
 	if (!page) {
-		if (!(flag & __GFP_WAIT))
+		if (!gfpflags_allow_blocking(flag))
 			return NULL;
 
 		page = dma_alloc_from_contiguous(dev, size >> PAGE_SHIFT,
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 0649b94f5958..f77becf3d8d8 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -3566,7 +3566,7 @@ static void *intel_alloc_coherent(struct device *dev, size_t size,
 			flags |= GFP_DMA32;
 	}
 
-	if (flags & __GFP_WAIT) {
+	if (gfpflags_allow_blocking(flags)) {
 		unsigned int count = size >> PAGE_SHIFT;
 
 		page = dma_alloc_from_contiguous(dev, count, order);
diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 0f48fed44a17..6dda08385309 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -993,7 +993,7 @@ static struct bio *crypt_alloc_buffer(struct dm_crypt_io *io, unsigned size)
 	struct bio_vec *bvec;
 
 retry:
-	if (unlikely(gfp_mask & __GFP_WAIT))
+	if (unlikely(gfp_mask & __GFP_DIRECT_RECLAIM))
 		mutex_lock(&cc->bio_alloc_lock);
 
 	clone = bio_alloc_bioset(GFP_NOIO, nr_iovecs, cc->bs);
@@ -1009,7 +1009,7 @@ static struct bio *crypt_alloc_buffer(struct dm_crypt_io *io, unsigned size)
 		if (!page) {
 			crypt_free_buffer_pages(cc, clone);
 			bio_put(clone);
-			gfp_mask |= __GFP_WAIT;
+			gfp_mask |= __GFP_DIRECT_RECLAIM;
 			goto retry;
 		}
 
@@ -1026,7 +1026,7 @@ static struct bio *crypt_alloc_buffer(struct dm_crypt_io *io, unsigned size)
 	}
 
 return_clone:
-	if (unlikely(gfp_mask & __GFP_WAIT))
+	if (unlikely(gfp_mask & __GFP_DIRECT_RECLAIM))
 		mutex_unlock(&cc->bio_alloc_lock);
 
 	return clone;
diff --git a/drivers/md/dm-kcopyd.c b/drivers/md/dm-kcopyd.c
index 3a7cade5e27d..1452ed9aacb4 100644
--- a/drivers/md/dm-kcopyd.c
+++ b/drivers/md/dm-kcopyd.c
@@ -244,7 +244,7 @@ static int kcopyd_get_pages(struct dm_kcopyd_client *kc,
 	*pages = NULL;
 
 	do {
-		pl = alloc_pl(__GFP_NOWARN | __GFP_NORETRY);
+		pl = alloc_pl(__GFP_NOWARN | __GFP_NORETRY | __GFP_KSWAPD_RECLAIM);
 		if (unlikely(!pl)) {
 			/* Use reserved pages */
 			pl = kc->pages;
diff --git a/drivers/media/pci/solo6x10/solo6x10-v4l2-enc.c b/drivers/media/pci/solo6x10/solo6x10-v4l2-enc.c
index 53fff5425c13..fb2cb4bdc0c1 100644
--- a/drivers/media/pci/solo6x10/solo6x10-v4l2-enc.c
+++ b/drivers/media/pci/solo6x10/solo6x10-v4l2-enc.c
@@ -1291,7 +1291,7 @@ static struct solo_enc_dev *solo_enc_alloc(struct solo_dev *solo_dev,
 	solo_enc->vidq.ops = &solo_enc_video_qops;
 	solo_enc->vidq.mem_ops = &vb2_dma_sg_memops;
 	solo_enc->vidq.drv_priv = solo_enc;
-	solo_enc->vidq.gfp_flags = __GFP_DMA32;
+	solo_enc->vidq.gfp_flags = __GFP_DMA32 | __GFP_KSWAPD_RECLAIM;
 	solo_enc->vidq.timestamp_flags = V4L2_BUF_FLAG_TIMESTAMP_MONOTONIC;
 	solo_enc->vidq.buf_struct_size = sizeof(struct solo_vb2_buf);
 	solo_enc->vidq.lock = &solo_enc->lock;
diff --git a/drivers/media/pci/solo6x10/solo6x10-v4l2.c b/drivers/media/pci/solo6x10/solo6x10-v4l2.c
index 63ae8a61f603..bde77b22340c 100644
--- a/drivers/media/pci/solo6x10/solo6x10-v4l2.c
+++ b/drivers/media/pci/solo6x10/solo6x10-v4l2.c
@@ -675,7 +675,7 @@ int solo_v4l2_init(struct solo_dev *solo_dev, unsigned nr)
 	solo_dev->vidq.mem_ops = &vb2_dma_contig_memops;
 	solo_dev->vidq.drv_priv = solo_dev;
 	solo_dev->vidq.timestamp_flags = V4L2_BUF_FLAG_TIMESTAMP_MONOTONIC;
-	solo_dev->vidq.gfp_flags = __GFP_DMA32;
+	solo_dev->vidq.gfp_flags = __GFP_DMA32 | __GFP_KSWAPD_RECLAIM;
 	solo_dev->vidq.buf_struct_size = sizeof(struct solo_vb2_buf);
 	solo_dev->vidq.lock = &solo_dev->lock;
 	ret = vb2_queue_init(&solo_dev->vidq);
diff --git a/drivers/media/pci/tw68/tw68-video.c b/drivers/media/pci/tw68/tw68-video.c
index 8355e55b4e8e..e556f989aaab 100644
--- a/drivers/media/pci/tw68/tw68-video.c
+++ b/drivers/media/pci/tw68/tw68-video.c
@@ -975,7 +975,7 @@ int tw68_video_init2(struct tw68_dev *dev, int video_nr)
 	dev->vidq.ops = &tw68_video_qops;
 	dev->vidq.mem_ops = &vb2_dma_sg_memops;
 	dev->vidq.drv_priv = dev;
-	dev->vidq.gfp_flags = __GFP_DMA32;
+	dev->vidq.gfp_flags = __GFP_DMA32 | __GFP_KSWAPD_RECLAIM;
 	dev->vidq.buf_struct_size = sizeof(struct tw68_buf);
 	dev->vidq.lock = &dev->lock;
 	dev->vidq.min_buffers_needed = 2;
diff --git a/drivers/mtd/mtdcore.c b/drivers/mtd/mtdcore.c
index 8bbbb751bf45..2dfb291a47c6 100644
--- a/drivers/mtd/mtdcore.c
+++ b/drivers/mtd/mtdcore.c
@@ -1188,8 +1188,7 @@ EXPORT_SYMBOL_GPL(mtd_writev);
  */
 void *mtd_kmalloc_up_to(const struct mtd_info *mtd, size_t *size)
 {
-	gfp_t flags = __GFP_NOWARN | __GFP_WAIT |
-		       __GFP_NORETRY | __GFP_NO_KSWAPD;
+	gfp_t flags = __GFP_NOWARN | __GFP_DIRECT_RECLAIM | __GFP_NORETRY;
 	size_t min_alloc = max_t(size_t, mtd->writesize, PAGE_SIZE);
 	void *kbuf;
 
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
index f7fbdc9d1325..3a407e59acab 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
@@ -689,7 +689,7 @@ static void *bnx2x_frag_alloc(const struct bnx2x_fastpath *fp, gfp_t gfp_mask)
 {
 	if (fp->rx_frag_size) {
 		/* GFP_KERNEL allocations are used only during initialization */
-		if (unlikely(gfp_mask & __GFP_WAIT))
+		if (unlikely(gfpflags_allow_blocking(gfp_mask)))
 			return (void *)__get_free_page(gfp_mask);
 
 		return netdev_alloc_frag(fp->rx_frag_size);
diff --git a/drivers/staging/android/ion/ion_system_heap.c b/drivers/staging/android/ion/ion_system_heap.c
index da2a63c0a9ba..2615e0ae4f0a 100644
--- a/drivers/staging/android/ion/ion_system_heap.c
+++ b/drivers/staging/android/ion/ion_system_heap.c
@@ -27,7 +27,7 @@
 #include "ion_priv.h"
 
 static gfp_t high_order_gfp_flags = (GFP_HIGHUSER | __GFP_ZERO | __GFP_NOWARN |
-				     __GFP_NORETRY) & ~__GFP_WAIT;
+				     __GFP_NORETRY) & ~__GFP_DIRECT_RECLAIM;
 static gfp_t low_order_gfp_flags  = (GFP_HIGHUSER | __GFP_ZERO | __GFP_NOWARN);
 static const unsigned int orders[] = {8, 4, 0};
 static const int num_orders = ARRAY_SIZE(orders);
diff --git a/drivers/staging/lustre/include/linux/libcfs/libcfs_private.h b/drivers/staging/lustre/include/linux/libcfs/libcfs_private.h
index ed37d26eb20d..5b0756cb6576 100644
--- a/drivers/staging/lustre/include/linux/libcfs/libcfs_private.h
+++ b/drivers/staging/lustre/include/linux/libcfs/libcfs_private.h
@@ -113,7 +113,7 @@ do {						\
 do {									    \
 	LASSERT(!in_interrupt() ||					    \
 		((size) <= LIBCFS_VMALLOC_SIZE &&			    \
-		 ((mask) & __GFP_WAIT) == 0));				    \
+		 !gfpflags_allow_blocking(mask)));			    \
 } while (0)
 
 #define LIBCFS_ALLOC_POST(ptr, size)					    \
diff --git a/drivers/usb/host/u132-hcd.c b/drivers/usb/host/u132-hcd.c
index d51687780b61..8d4c1806e32f 100644
--- a/drivers/usb/host/u132-hcd.c
+++ b/drivers/usb/host/u132-hcd.c
@@ -2247,7 +2247,7 @@ static int u132_urb_enqueue(struct usb_hcd *hcd, struct urb *urb,
 {
 	struct u132 *u132 = hcd_to_u132(hcd);
 	if (irqs_disabled()) {
-		if (__GFP_WAIT & mem_flags) {
+		if (gfpflags_allow_blocking(mem_flags)) {
 			printk(KERN_ERR "invalid context for function that migh"
 				"t sleep\n");
 			return -EINVAL;
diff --git a/drivers/video/fbdev/vermilion/vermilion.c b/drivers/video/fbdev/vermilion/vermilion.c
index 6b70d7f62b2f..1c1e95a0b8fa 100644
--- a/drivers/video/fbdev/vermilion/vermilion.c
+++ b/drivers/video/fbdev/vermilion/vermilion.c
@@ -99,7 +99,7 @@ static int vmlfb_alloc_vram_area(struct vram_area *va, unsigned max_order,
 		 * below the first 16MB.
 		 */
 
-		flags = __GFP_DMA | __GFP_HIGH;
+		flags = __GFP_DMA | __GFP_HIGH | __GFP_KSWAPD_RECLAIM;
 		va->logical =
 			 __get_free_pages(flags, --max_order);
 	} while (va->logical == 0 && max_order > min_order);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index f556c3732c2c..3dd4792b8099 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2566,7 +2566,7 @@ int open_ctree(struct super_block *sb,
 	fs_info->commit_interval = BTRFS_DEFAULT_COMMIT_INTERVAL;
 	fs_info->avg_delayed_ref_runtime = NSEC_PER_SEC >> 6; /* div by 64 */
 	/* readahead state */
-	INIT_RADIX_TREE(&fs_info->reada_tree, GFP_NOFS & ~__GFP_WAIT);
+	INIT_RADIX_TREE(&fs_info->reada_tree, GFP_NOFS & ~__GFP_DIRECT_RECLAIM);
 	spin_lock_init(&fs_info->reada_lock);
 
 	fs_info->thread_pool_size = min_t(unsigned long,
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 02d05817cbdf..c8a6cdcbef2b 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -594,7 +594,7 @@ int clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
 	if (bits & (EXTENT_IOBITS | EXTENT_BOUNDARY))
 		clear = 1;
 again:
-	if (!prealloc && (mask & __GFP_WAIT)) {
+	if (!prealloc && gfpflags_allow_blocking(mask)) {
 		/*
 		 * Don't care for allocation failure here because we might end
 		 * up not needing the pre-allocated extent state at all, which
@@ -718,7 +718,7 @@ int clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
 	if (start > end)
 		goto out;
 	spin_unlock(&tree->lock);
-	if (mask & __GFP_WAIT)
+	if (gfpflags_allow_blocking(mask))
 		cond_resched();
 	goto again;
 }
@@ -850,7 +850,7 @@ __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
 
 	bits |= EXTENT_FIRST_DELALLOC;
 again:
-	if (!prealloc && (mask & __GFP_WAIT)) {
+	if (!prealloc && gfpflags_allow_blocking(mask)) {
 		prealloc = alloc_extent_state(mask);
 		BUG_ON(!prealloc);
 	}
@@ -1028,7 +1028,7 @@ __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
 	if (start > end)
 		goto out;
 	spin_unlock(&tree->lock);
-	if (mask & __GFP_WAIT)
+	if (gfpflags_allow_blocking(mask))
 		cond_resched();
 	goto again;
 }
@@ -1076,7 +1076,7 @@ int convert_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
 	btrfs_debug_check_extent_io_range(tree, start, end);
 
 again:
-	if (!prealloc && (mask & __GFP_WAIT)) {
+	if (!prealloc && gfpflags_allow_blocking(mask)) {
 		/*
 		 * Best effort, don't worry if extent state allocation fails
 		 * here for the first iteration. We might have a cached state
@@ -1253,7 +1253,7 @@ int convert_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
 	if (start > end)
 		goto out;
 	spin_unlock(&tree->lock);
-	if (mask & __GFP_WAIT)
+	if (gfpflags_allow_blocking(mask))
 		cond_resched();
 	first_iteration = false;
 	goto again;
@@ -4265,7 +4265,7 @@ int try_release_extent_mapping(struct extent_map_tree *map,
 	u64 start = page_offset(page);
 	u64 end = start + PAGE_CACHE_SIZE - 1;
 
-	if ((mask & __GFP_WAIT) &&
+	if (gfpflags_allow_blocking(mask) &&
 	    page->mapping->host->i_size > 16 * 1024 * 1024) {
 		u64 len;
 		while (start <= end) {
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index fbe7c104531c..b1968f36a39b 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -156,8 +156,8 @@ static struct btrfs_device *__alloc_device(void)
 	spin_lock_init(&dev->reada_lock);
 	atomic_set(&dev->reada_in_flight, 0);
 	atomic_set(&dev->dev_stats_ccnt, 0);
-	INIT_RADIX_TREE(&dev->reada_zones, GFP_NOFS & ~__GFP_WAIT);
-	INIT_RADIX_TREE(&dev->reada_extents, GFP_NOFS & ~__GFP_WAIT);
+	INIT_RADIX_TREE(&dev->reada_zones, GFP_NOFS & ~__GFP_DIRECT_RECLAIM);
+	INIT_RADIX_TREE(&dev->reada_extents, GFP_NOFS & ~__GFP_DIRECT_RECLAIM);
 
 	return dev;
 }
diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index 5ed0044fbb37..9004c786716f 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -750,7 +750,7 @@ static int bdev_try_to_free_page(struct super_block *sb, struct page *page,
 		return 0;
 	if (journal)
 		return journal_try_to_free_buffers(journal, page, 
-						   wait & ~__GFP_WAIT);
+						wait & ~__GFP_DIRECT_RECLAIM);
 	return try_to_free_buffers(page);
 }
 
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 58987b5c514b..abe76d41ef1e 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1045,7 +1045,7 @@ static int bdev_try_to_free_page(struct super_block *sb, struct page *page,
 		return 0;
 	if (journal)
 		return jbd2_journal_try_to_free_buffers(journal, page,
-							wait & ~__GFP_WAIT);
+						wait & ~__GFP_DIRECT_RECLAIM);
 	return try_to_free_buffers(page);
 }
 
diff --git a/fs/fscache/cookie.c b/fs/fscache/cookie.c
index d403c69bee08..4304072161aa 100644
--- a/fs/fscache/cookie.c
+++ b/fs/fscache/cookie.c
@@ -111,7 +111,7 @@ struct fscache_cookie *__fscache_acquire_cookie(
 
 	/* radix tree insertion won't use the preallocation pool unless it's
 	 * told it may not wait */
-	INIT_RADIX_TREE(&cookie->stores, GFP_NOFS & ~__GFP_WAIT);
+	INIT_RADIX_TREE(&cookie->stores, GFP_NOFS & ~__GFP_DIRECT_RECLAIM);
 
 	switch (cookie->def->type) {
 	case FSCACHE_COOKIE_TYPE_INDEX:
diff --git a/fs/fscache/page.c b/fs/fscache/page.c
index 483bbc613bf0..79483b3d8c6f 100644
--- a/fs/fscache/page.c
+++ b/fs/fscache/page.c
@@ -58,7 +58,7 @@ bool release_page_wait_timeout(struct fscache_cookie *cookie, struct page *page)
 
 /*
  * decide whether a page can be released, possibly by cancelling a store to it
- * - we're allowed to sleep if __GFP_WAIT is flagged
+ * - we're allowed to sleep if __GFP_DIRECT_RECLAIM is flagged
  */
 bool __fscache_maybe_release_page(struct fscache_cookie *cookie,
 				  struct page *page,
@@ -122,7 +122,7 @@ bool __fscache_maybe_release_page(struct fscache_cookie *cookie,
 	 * allocator as the work threads writing to the cache may all end up
 	 * sleeping on memory allocation, so we may need to impose a timeout
 	 * too. */
-	if (!(gfp & __GFP_WAIT) || !(gfp & __GFP_FS)) {
+	if (!(gfp & __GFP_DIRECT_RECLAIM) || !(gfp & __GFP_FS)) {
 		fscache_stat(&fscache_n_store_vmscan_busy);
 		return false;
 	}
@@ -132,7 +132,7 @@ bool __fscache_maybe_release_page(struct fscache_cookie *cookie,
 		_debug("fscache writeout timeout page: %p{%lx}",
 			page, page->index);
 
-	gfp &= ~__GFP_WAIT;
+	gfp &= ~__GFP_DIRECT_RECLAIM;
 	goto try_again;
 }
 EXPORT_SYMBOL(__fscache_maybe_release_page);
diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
index 1695ba8334a2..f45b90ba7c5c 100644
--- a/fs/jbd/transaction.c
+++ b/fs/jbd/transaction.c
@@ -1690,8 +1690,8 @@ __journal_try_to_free_buffer(journal_t *journal, struct buffer_head *bh)
  * @journal: journal for operation
  * @page: to try and free
  * @gfp_mask: we use the mask to detect how hard should we try to release
- * buffers. If __GFP_WAIT and __GFP_FS is set, we wait for commit code to
- * release the buffers.
+ * buffers. If __GFP_DIRECT_RECLAIM and __GFP_FS is set, we wait for commit
+ * code to release the buffers.
  *
  *
  * For all the buffers on this page,
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index f3d06174b051..06e18bcdb888 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -1893,8 +1893,8 @@ __journal_try_to_free_buffer(journal_t *journal, struct buffer_head *bh)
  * @journal: journal for operation
  * @page: to try and free
  * @gfp_mask: we use the mask to detect how hard should we try to release
- * buffers. If __GFP_WAIT and __GFP_FS is set, we wait for commit code to
- * release the buffers.
+ * buffers. If __GFP_DIRECT_RECLAIM and __GFP_FS is set, we wait for commit
+ * code to release the buffers.
  *
  *
  * For all the buffers on this page,
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index cc4fa1ed61fc..be6821967ec6 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -480,8 +480,8 @@ static int nfs_release_page(struct page *page, gfp_t gfp)
 	dfprintk(PAGECACHE, "NFS: release_page(%p)\n", page);
 
 	/* Always try to initiate a 'commit' if relevant, but only
-	 * wait for it if __GFP_WAIT is set.  Even then, only wait 1
-	 * second and only if the 'bdi' is not congested.
+	 * wait for it if the caller allows blocking.  Even then,
+	 * only wait 1 second and only if the 'bdi' is not congested.
 	 * Waiting indefinitely can cause deadlocks when the NFS
 	 * server is on this machine, when a new TCP connection is
 	 * needed and in other rare cases.  There is no particular
@@ -491,7 +491,7 @@ static int nfs_release_page(struct page *page, gfp_t gfp)
 	if (mapping) {
 		struct nfs_server *nfss = NFS_SERVER(mapping->host);
 		nfs_commit_inode(mapping->host, 0);
-		if ((gfp & __GFP_WAIT) &&
+		if (gfpflags_allow_blocking(gfp) &&
 		    !bdi_write_congested(&nfss->backing_dev_info)) {
 			wait_on_page_bit_killable_timeout(page, PG_private,
 							  HZ);
diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
index eac9549efd52..587174fd4f2c 100644
--- a/fs/xfs/xfs_qm.c
+++ b/fs/xfs/xfs_qm.c
@@ -525,7 +525,7 @@ xfs_qm_shrink_scan(
 	unsigned long		freed;
 	int			error;
 
-	if ((sc->gfp_mask & (__GFP_FS|__GFP_WAIT)) != (__GFP_FS|__GFP_WAIT))
+	if ((sc->gfp_mask & (__GFP_FS|__GFP_DIRECT_RECLAIM)) != (__GFP_FS|__GFP_DIRECT_RECLAIM))
 		return 0;
 
 	INIT_LIST_HEAD(&isol.buffers);
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index a10347ca5053..bd1937977d84 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -29,12 +29,13 @@ struct vm_area_struct;
 #define ___GFP_NOMEMALLOC	0x10000u
 #define ___GFP_HARDWALL		0x20000u
 #define ___GFP_THISNODE		0x40000u
-#define ___GFP_WAIT		0x80000u
+#define ___GFP_ATOMIC		0x80000u
 #define ___GFP_NOACCOUNT	0x100000u
 #define ___GFP_NOTRACK		0x200000u
-#define ___GFP_NO_KSWAPD	0x400000u
+#define ___GFP_DIRECT_RECLAIM	0x400000u
 #define ___GFP_OTHER_NODE	0x800000u
 #define ___GFP_WRITE		0x1000000u
+#define ___GFP_KSWAPD_RECLAIM	0x2000000u
 /* If the above are modified, __GFP_BITS_SHIFT may need updating */
 
 /*
@@ -68,7 +69,7 @@ struct vm_area_struct;
  * __GFP_MOVABLE: Flag that this page will be movable by the page migration
  * mechanism or reclaimed
  */
-#define __GFP_WAIT	((__force gfp_t)___GFP_WAIT)	/* Can wait and reschedule? */
+#define __GFP_ATOMIC	((__force gfp_t)___GFP_ATOMIC)  /* Caller cannot wait or reschedule */
 #define __GFP_HIGH	((__force gfp_t)___GFP_HIGH)	/* Should access emergency pools? */
 #define __GFP_IO	((__force gfp_t)___GFP_IO)	/* Can start physical IO? */
 #define __GFP_FS	((__force gfp_t)___GFP_FS)	/* Can call down to low-level FS? */
@@ -91,23 +92,37 @@ struct vm_area_struct;
 #define __GFP_NOACCOUNT	((__force gfp_t)___GFP_NOACCOUNT) /* Don't account to kmemcg */
 #define __GFP_NOTRACK	((__force gfp_t)___GFP_NOTRACK)  /* Don't track with kmemcheck */
 
-#define __GFP_NO_KSWAPD	((__force gfp_t)___GFP_NO_KSWAPD)
 #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
 #define __GFP_WRITE	((__force gfp_t)___GFP_WRITE)	/* Allocator intends to dirty page */
 
 /*
+ * A caller that is willing to wait may enter direct reclaim and will
+ * wake kswapd to reclaim pages in the background until the high
+ * watermark is met. A caller may wish to clear __GFP_DIRECT_RECLAIM to
+ * avoid unnecessary delays when a fallback option is available but
+ * still allow kswapd to reclaim in the background. The kswapd flag
+ * can be cleared when the reclaiming of pages would cause unnecessary
+ * disruption.
+ */
+#define __GFP_WAIT (__GFP_DIRECT_RECLAIM|__GFP_KSWAPD_RECLAIM)
+#define __GFP_DIRECT_RECLAIM	((__force gfp_t)___GFP_DIRECT_RECLAIM) /* Caller can reclaim */
+#define __GFP_KSWAPD_RECLAIM	((__force gfp_t)___GFP_KSWAPD_RECLAIM) /* kswapd can wake */
+
+/*
  * This may seem redundant, but it's a way of annotating false positives vs.
  * allocations that simply cannot be supported (e.g. page tables).
  */
 #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
 
-#define __GFP_BITS_SHIFT 25	/* Room for N __GFP_FOO bits */
+#define __GFP_BITS_SHIFT 26	/* Room for N __GFP_FOO bits */
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
 
-/* This equals 0, but use constants in case they ever change */
-#define GFP_NOWAIT	(GFP_ATOMIC & ~__GFP_HIGH)
-/* GFP_ATOMIC means both !wait (__GFP_WAIT not set) and use emergency pool */
-#define GFP_ATOMIC	(__GFP_HIGH)
+/*
+ * GFP_ATOMIC callers can not sleep, need the allocation to succeed.
+ * A lower watermark is applied to allow access to "atomic reserves"
+ */
+#define GFP_ATOMIC	(__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM)
+#define GFP_NOWAIT	(__GFP_KSWAPD_RECLAIM)
 #define GFP_NOIO	(__GFP_WAIT)
 #define GFP_NOFS	(__GFP_WAIT | __GFP_IO)
 #define GFP_KERNEL	(__GFP_WAIT | __GFP_IO | __GFP_FS)
@@ -116,10 +131,10 @@ struct vm_area_struct;
 #define GFP_USER	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL)
 #define GFP_HIGHUSER	(GFP_USER | __GFP_HIGHMEM)
 #define GFP_HIGHUSER_MOVABLE	(GFP_HIGHUSER | __GFP_MOVABLE)
-#define GFP_IOFS	(__GFP_IO | __GFP_FS)
-#define GFP_TRANSHUGE	(GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
-			 __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | \
-			 __GFP_NO_KSWAPD)
+#define GFP_IOFS	(__GFP_IO | __GFP_FS | __GFP_KSWAPD_RECLAIM)
+#define GFP_TRANSHUGE	((GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
+			 __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN) & \
+			 ~__GFP_KSWAPD_RECLAIM)
 
 /* This mask makes up all the page movable related flags */
 #define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE)
@@ -161,6 +176,11 @@ static inline int gfpflags_to_migratetype(const gfp_t gfp_flags)
 	return (gfp_flags & GFP_MOVABLE_MASK) >> GFP_MOVABLE_SHIFT;
 }
 
+static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
+{
+	return gfp_flags & __GFP_DIRECT_RECLAIM;
+}
+
 #ifdef CONFIG_HIGHMEM
 #define OPT_ZONE_HIGHMEM ZONE_HIGHMEM
 #else
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 22b6d9ca1654..55c4a9175801 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1109,7 +1109,7 @@ static inline int skb_cloned(const struct sk_buff *skb)
 
 static inline int skb_unclone(struct sk_buff *skb, gfp_t pri)
 {
-	might_sleep_if(pri & __GFP_WAIT);
+	might_sleep_if(gfpflags_allow_blocking(pri));
 
 	if (skb_cloned(skb))
 		return pskb_expand_head(skb, 0, 0, pri);
@@ -1193,7 +1193,7 @@ static inline int skb_shared(const struct sk_buff *skb)
  */
 static inline struct sk_buff *skb_share_check(struct sk_buff *skb, gfp_t pri)
 {
-	might_sleep_if(pri & __GFP_WAIT);
+	might_sleep_if(gfpflags_allow_blocking(pri));
 	if (skb_shared(skb)) {
 		struct sk_buff *nskb = skb_clone(skb, pri);
 
@@ -1229,7 +1229,7 @@ static inline struct sk_buff *skb_share_check(struct sk_buff *skb, gfp_t pri)
 static inline struct sk_buff *skb_unshare(struct sk_buff *skb,
 					  gfp_t pri)
 {
-	might_sleep_if(pri & __GFP_WAIT);
+	might_sleep_if(gfpflags_allow_blocking(pri));
 	if (skb_cloned(skb)) {
 		struct sk_buff *nskb = skb_copy(skb, pri);
 
diff --git a/include/net/sock.h b/include/net/sock.h
index f21f0708ec59..cec0c4b634dc 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2035,7 +2035,7 @@ struct sk_buff *sk_stream_alloc_skb(struct sock *sk, int size, gfp_t gfp,
  */
 static inline struct page_frag *sk_page_frag(struct sock *sk)
 {
-	if (sk->sk_allocation & __GFP_WAIT)
+	if (gfpflags_allow_blocking(sk->sk_allocation))
 		return &current->task_frag;
 
 	return &sk->sk_frag;
diff --git a/include/trace/events/gfpflags.h b/include/trace/events/gfpflags.h
index d6fd8e5b14b7..dde6bf092c8a 100644
--- a/include/trace/events/gfpflags.h
+++ b/include/trace/events/gfpflags.h
@@ -20,7 +20,7 @@
 	{(unsigned long)GFP_ATOMIC,		"GFP_ATOMIC"},		\
 	{(unsigned long)GFP_NOIO,		"GFP_NOIO"},		\
 	{(unsigned long)__GFP_HIGH,		"GFP_HIGH"},		\
-	{(unsigned long)__GFP_WAIT,		"GFP_WAIT"},		\
+	{(unsigned long)__GFP_ATOMIC,		"GFP_ATOMIC"},		\
 	{(unsigned long)__GFP_IO,		"GFP_IO"},		\
 	{(unsigned long)__GFP_COLD,		"GFP_COLD"},		\
 	{(unsigned long)__GFP_NOWARN,		"GFP_NOWARN"},		\
@@ -36,7 +36,8 @@
 	{(unsigned long)__GFP_RECLAIMABLE,	"GFP_RECLAIMABLE"},	\
 	{(unsigned long)__GFP_MOVABLE,		"GFP_MOVABLE"},		\
 	{(unsigned long)__GFP_NOTRACK,		"GFP_NOTRACK"},		\
-	{(unsigned long)__GFP_NO_KSWAPD,	"GFP_NO_KSWAPD"},	\
+	{(unsigned long)__GFP_DIRECT_RECLAIM,	"GFP_DIRECT_RECLAIM"},	\
+	{(unsigned long)__GFP_KSWAPD_RECLAIM,	"GFP_KSWAPD_RECLAIM"},	\
 	{(unsigned long)__GFP_OTHER_NODE,	"GFP_OTHER_NODE"}	\
 	) : "GFP_NOWAIT"
 
diff --git a/kernel/audit.c b/kernel/audit.c
index f9e6065346db..6ab7a55dbdff 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -1357,16 +1357,16 @@ struct audit_buffer *audit_log_start(struct audit_context *ctx, gfp_t gfp_mask,
 	if (unlikely(audit_filter_type(type)))
 		return NULL;
 
-	if (gfp_mask & __GFP_WAIT) {
+	if (gfp_mask & __GFP_DIRECT_RECLAIM) {
 		if (audit_pid && audit_pid == current->pid)
-			gfp_mask &= ~__GFP_WAIT;
+			gfp_mask &= ~__GFP_DIRECT_RECLAIM;
 		else
 			reserve = 0;
 	}
 
 	while (audit_backlog_limit
 	       && skb_queue_len(&audit_skb_queue) > audit_backlog_limit + reserve) {
-		if (gfp_mask & __GFP_WAIT && audit_backlog_wait_time) {
+		if (gfp_mask & __GFP_DIRECT_RECLAIM && audit_backlog_wait_time) {
 			long sleep_time;
 
 			sleep_time = timeout_start + audit_backlog_wait_time - jiffies;
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 8acfbf773e06..9aa39f20f593 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -2738,7 +2738,7 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags)
 		return;
 
 	/* no reclaim without waiting on it */
-	if (!(gfp_mask & __GFP_WAIT))
+	if (!(gfp_mask & __GFP_DIRECT_RECLAIM))
 		return;
 
 	/* this guy won't enter reclaim */
diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index 5235dd4e1e2f..3a970604308f 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -1779,7 +1779,7 @@ alloc_highmem_pages(struct memory_bitmap *bm, unsigned int nr_highmem)
 	while (to_alloc-- > 0) {
 		struct page *page;
 
-		page = alloc_image_page(__GFP_HIGHMEM);
+		page = alloc_image_page(__GFP_HIGHMEM|__GFP_KSWAPD_RECLAIM);
 		memory_bm_set_bit(bm, page_to_pfn(page));
 	}
 	return nr_highmem;
diff --git a/kernel/smp.c b/kernel/smp.c
index 07854477c164..d903c02223af 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -669,7 +669,7 @@ void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
 	cpumask_var_t cpus;
 	int cpu, ret;
 
-	might_sleep_if(gfp_flags & __GFP_WAIT);
+	might_sleep_if(gfpflags_allow_blocking(gfp_flags));
 
 	if (likely(zalloc_cpumask_var(&cpus, (gfp_flags|__GFP_NOWARN)))) {
 		preempt_disable();
diff --git a/lib/idr.c b/lib/idr.c
index 5335c43adf46..6098336df267 100644
--- a/lib/idr.c
+++ b/lib/idr.c
@@ -399,7 +399,7 @@ void idr_preload(gfp_t gfp_mask)
 	 * allocation guarantee.  Disallow usage from those contexts.
 	 */
 	WARN_ON_ONCE(in_interrupt());
-	might_sleep_if(gfp_mask & __GFP_WAIT);
+	might_sleep_if(gfpflags_allow_blocking(gfp_mask));
 
 	preempt_disable();
 
@@ -453,7 +453,7 @@ int idr_alloc(struct idr *idr, void *ptr, int start, int end, gfp_t gfp_mask)
 	struct idr_layer *pa[MAX_IDR_LEVEL + 1];
 	int id;
 
-	might_sleep_if(gfp_mask & __GFP_WAIT);
+	might_sleep_if(gfpflags_allow_blocking(gfp_mask));
 
 	/* sanity checks */
 	if (WARN_ON_ONCE(start < 0))
diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index f9ebe1c82060..c3775ee46cd6 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -188,7 +188,7 @@ radix_tree_node_alloc(struct radix_tree_root *root)
 	 * preloading in the interrupt anyway as all the allocations have to
 	 * be atomic. So just do normal allocation when in interrupt.
 	 */
-	if (!(gfp_mask & __GFP_WAIT) && !in_interrupt()) {
+	if (!gfpflags_allow_blocking(gfp_mask) && !in_interrupt()) {
 		struct radix_tree_preload *rtp;
 
 		/*
@@ -249,7 +249,7 @@ radix_tree_node_free(struct radix_tree_node *node)
  * with preemption not disabled.
  *
  * To make use of this facility, the radix tree must be initialised without
- * __GFP_WAIT being passed to INIT_RADIX_TREE().
+ * __GFP_DIRECT_RECLAIM being passed to INIT_RADIX_TREE().
  */
 static int __radix_tree_preload(gfp_t gfp_mask)
 {
@@ -286,12 +286,12 @@ static int __radix_tree_preload(gfp_t gfp_mask)
  * with preemption not disabled.
  *
  * To make use of this facility, the radix tree must be initialised without
- * __GFP_WAIT being passed to INIT_RADIX_TREE().
+ * __GFP_DIRECT_RECLAIM being passed to INIT_RADIX_TREE().
  */
 int radix_tree_preload(gfp_t gfp_mask)
 {
 	/* Warn on non-sensical use... */
-	WARN_ON_ONCE(!(gfp_mask & __GFP_WAIT));
+	WARN_ON_ONCE(gfpflags_allow_blocking(gfp_mask));
 	return __radix_tree_preload(gfp_mask);
 }
 EXPORT_SYMBOL(radix_tree_preload);
@@ -303,7 +303,7 @@ EXPORT_SYMBOL(radix_tree_preload);
  */
 int radix_tree_maybe_preload(gfp_t gfp_mask)
 {
-	if (gfp_mask & __GFP_WAIT)
+	if (gfpflags_allow_blocking(gfp_mask))
 		return __radix_tree_preload(gfp_mask);
 	/* Preloading doesn't help anything with this gfp mask, skip it */
 	preempt_disable();
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index dac5bf59309d..805ce70b72f3 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -632,7 +632,7 @@ struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi,
 {
 	struct bdi_writeback *wb;
 
-	might_sleep_if(gfp & __GFP_WAIT);
+	might_sleep_if(gfpflags_allow_blocking(gfp));
 
 	if (!memcg_css->parent)
 		return &bdi->wb;
diff --git a/mm/dmapool.c b/mm/dmapool.c
index fd5fe4342e93..84dac666fc0c 100644
--- a/mm/dmapool.c
+++ b/mm/dmapool.c
@@ -323,7 +323,7 @@ void *dma_pool_alloc(struct dma_pool *pool, gfp_t mem_flags,
 	size_t offset;
 	void *retval;
 
-	might_sleep_if(mem_flags & __GFP_WAIT);
+	might_sleep_if(gfpflags_allow_blocking(mem_flags));
 
 	spin_lock_irqsave(&pool->lock, flags);
 	list_for_each_entry(page, &pool->page_list, page_list) {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index acb93c554f6e..e34f6411da8c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2268,7 +2268,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (unlikely(task_in_memcg_oom(current)))
 		goto nomem;
 
-	if (!(gfp_mask & __GFP_WAIT))
+	if (!gfpflags_allow_blocking(gfp_mask))
 		goto nomem;
 
 	mem_cgroup_events(mem_over_limit, MEMCG_MAX, 1);
@@ -2327,7 +2327,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	css_get_many(&memcg->css, batch);
 	if (batch > nr_pages)
 		refill_stock(memcg, batch - nr_pages);
-	if (!(gfp_mask & __GFP_WAIT))
+	if (!gfpflags_allow_blocking(gfp_mask))
 		goto done;
 	/*
 	 * If the hierarchy is above the normal consumption range,
@@ -4696,8 +4696,8 @@ static int mem_cgroup_do_precharge(unsigned long count)
 {
 	int ret;
 
-	/* Try a single bulk charge without reclaim first */
-	ret = try_charge(mc.to, GFP_KERNEL & ~__GFP_WAIT, count);
+	/* Try a single bulk charge without reclaim first, kswapd may wake */
+	ret = try_charge(mc.to, GFP_KERNEL & ~__GFP_DIRECT_RECLAIM, count);
 	if (!ret) {
 		mc.precharge += count;
 		return ret;
diff --git a/mm/mempool.c b/mm/mempool.c
index 2cc08de8b1db..bfd2a0dd0e18 100644
--- a/mm/mempool.c
+++ b/mm/mempool.c
@@ -317,13 +317,13 @@ void * mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
 	gfp_t gfp_temp;
 
 	VM_WARN_ON_ONCE(gfp_mask & __GFP_ZERO);
-	might_sleep_if(gfp_mask & __GFP_WAIT);
+	might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
 
 	gfp_mask |= __GFP_NOMEMALLOC;	/* don't allocate emergency reserves */
 	gfp_mask |= __GFP_NORETRY;	/* don't loop in __alloc_pages */
 	gfp_mask |= __GFP_NOWARN;	/* failures are OK */
 
-	gfp_temp = gfp_mask & ~(__GFP_WAIT|__GFP_IO);
+	gfp_temp = gfp_mask & ~(__GFP_DIRECT_RECLAIM|__GFP_IO);
 
 repeat_alloc:
 
@@ -346,7 +346,7 @@ void * mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
 	}
 
 	/*
-	 * We use gfp mask w/o __GFP_WAIT or IO for the first round.  If
+	 * We use gfp mask w/o direct reclaim or IO for the first round.  If
 	 * alloc failed with that and @pool was empty, retry immediately.
 	 */
 	if (gfp_temp != gfp_mask) {
@@ -355,8 +355,8 @@ void * mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
 		goto repeat_alloc;
 	}
 
-	/* We must not sleep if !__GFP_WAIT */
-	if (!(gfp_mask & __GFP_WAIT)) {
+	/* We must not sleep if !__GFP_DIRECT_RECLAIM */
+	if (!(gfp_mask & __GFP_DIRECT_RECLAIM)) {
 		spin_unlock_irqrestore(&pool->lock, flags);
 		return NULL;
 	}
diff --git a/mm/migrate.c b/mm/migrate.c
index eb4267107d1f..0e16c4047638 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1564,7 +1564,7 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
 					 (GFP_HIGHUSER_MOVABLE |
 					  __GFP_THISNODE | __GFP_NOMEMALLOC |
 					  __GFP_NORETRY | __GFP_NOWARN) &
-					 ~GFP_IOFS, 0);
+					 ~(__GFP_IO | __GFP_FS), 0);
 
 	return newpage;
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 32d1cec124bc..68f961bdfdf8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -151,12 +151,12 @@ void pm_restrict_gfp_mask(void)
 	WARN_ON(!mutex_is_locked(&pm_mutex));
 	WARN_ON(saved_gfp_mask);
 	saved_gfp_mask = gfp_allowed_mask;
-	gfp_allowed_mask &= ~GFP_IOFS;
+	gfp_allowed_mask &= ~(__GFP_IO | __GFP_FS);
 }
 
 bool pm_suspended_storage(void)
 {
-	if ((gfp_allowed_mask & GFP_IOFS) == GFP_IOFS)
+	if ((gfp_allowed_mask & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS))
 		return false;
 	return true;
 }
@@ -2158,7 +2158,7 @@ static bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
 		return false;
 	if (fail_page_alloc.ignore_gfp_highmem && (gfp_mask & __GFP_HIGHMEM))
 		return false;
-	if (fail_page_alloc.ignore_gfp_wait && (gfp_mask & __GFP_WAIT))
+	if (fail_page_alloc.ignore_gfp_wait && (gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
 		return false;
 
 	return should_fail(&fail_page_alloc.attr, 1 << order);
@@ -2660,7 +2660,7 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
 		if (test_thread_flag(TIF_MEMDIE) ||
 		    (current->flags & (PF_MEMALLOC | PF_EXITING)))
 			filter &= ~SHOW_MEM_FILTER_NODES;
-	if (in_interrupt() || !(gfp_mask & __GFP_WAIT))
+	if (in_interrupt() || !(gfp_mask & __GFP_WAIT) || (gfp_mask & __GFP_ATOMIC))
 		filter &= ~SHOW_MEM_FILTER_NODES;
 
 	if (fmt) {
@@ -2915,7 +2915,6 @@ static inline int
 gfp_to_alloc_flags(gfp_t gfp_mask)
 {
 	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
-	const bool atomic = !(gfp_mask & (__GFP_WAIT | __GFP_NO_KSWAPD));
 
 	/* __GFP_HIGH is assumed to be the same as ALLOC_HIGH to save a branch. */
 	BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_HIGH);
@@ -2924,11 +2923,11 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 	 * The caller may dip into page reserves a bit more if the caller
 	 * cannot run direct reclaim, or if the caller has realtime scheduling
 	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
-	 * set both ALLOC_HARDER (atomic == true) and ALLOC_HIGH (__GFP_HIGH).
+	 * set both ALLOC_HARDER (__GFP_ATOMIC) and ALLOC_HIGH (__GFP_HIGH).
 	 */
 	alloc_flags |= (__force int) (gfp_mask & __GFP_HIGH);
 
-	if (atomic) {
+	if (gfp_mask & __GFP_ATOMIC) {
 		/*
 		 * Not worth trying to allocate harder for __GFP_NOMEMALLOC even
 		 * if it can't schedule.
@@ -2965,11 +2964,16 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	return !!(gfp_to_alloc_flags(gfp_mask) & ALLOC_NO_WATERMARKS);
 }
 
+static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
+{
+	return (gfp_mask & (GFP_TRANSHUGE | __GFP_KSWAPD_RECLAIM)) == GFP_TRANSHUGE;
+}
+
 static inline struct page *
 __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 						struct alloc_context *ac)
 {
-	const gfp_t wait = gfp_mask & __GFP_WAIT;
+	bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;
 	struct page *page = NULL;
 	int alloc_flags;
 	unsigned long pages_reclaimed = 0;
@@ -2990,15 +2994,23 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	}
 
 	/*
+	 * We also sanity check to catch abuse of atomic reserves being used by
+	 * callers that are not in atomic context.
+	 */
+	if (WARN_ON_ONCE((gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)) ==
+				(__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
+		gfp_mask &= ~__GFP_ATOMIC;
+
+	/*
 	 * If this allocation cannot block and it is for a specific node, then
 	 * fail early.  There's no need to wakeup kswapd or retry for a
 	 * speculative node-specific allocation.
 	 */
-	if (IS_ENABLED(CONFIG_NUMA) && (gfp_mask & __GFP_THISNODE) && !wait)
+	if (IS_ENABLED(CONFIG_NUMA) && (gfp_mask & __GFP_THISNODE) && !can_direct_reclaim)
 		goto nopage;
 
 retry:
-	if (!(gfp_mask & __GFP_NO_KSWAPD))
+	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
 		wake_all_kswapds(order, ac);
 
 	/*
@@ -3041,8 +3053,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		}
 	}
 
-	/* Atomic allocations - we can't balance anything */
-	if (!wait) {
+	/* Caller is not willing to reclaim, we can't balance anything */
+	if (!can_direct_reclaim) {
 		/*
 		 * All existing users of the deprecated __GFP_NOFAIL are
 		 * blockable, so warn of any new users that actually allow this
@@ -3072,7 +3084,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		goto got_pg;
 
 	/* Checks for THP-specific high-order allocations */
-	if ((gfp_mask & GFP_TRANSHUGE) == GFP_TRANSHUGE) {
+	if (is_thp_gfp_mask(gfp_mask)) {
 		/*
 		 * If compaction is deferred for high-order allocations, it is
 		 * because sync compaction recently failed. If this is the case
@@ -3107,8 +3119,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 * fault, so use asynchronous memory compaction for THP unless it is
 	 * khugepaged trying to collapse.
 	 */
-	if ((gfp_mask & GFP_TRANSHUGE) != GFP_TRANSHUGE ||
-						(current->flags & PF_KTHREAD))
+	if (!is_thp_gfp_mask(gfp_mask) || (current->flags & PF_KTHREAD))
 		migration_mode = MIGRATE_SYNC_LIGHT;
 
 	/* Try direct reclaim and then allocating */
@@ -3179,7 +3190,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 
 	lockdep_trace_alloc(gfp_mask);
 
-	might_sleep_if(gfp_mask & __GFP_WAIT);
+	might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
 
 	if (should_fail_alloc_page(gfp_mask, order))
 		return NULL;
diff --git a/mm/slab.c b/mm/slab.c
index 200e22412a16..f82bdb3eb1fc 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1030,12 +1030,12 @@ static inline int cache_free_alien(struct kmem_cache *cachep, void *objp)
 }
 
 /*
- * Construct gfp mask to allocate from a specific node but do not invoke reclaim
- * or warn about failures.
+ * Construct gfp mask to allocate from a specific node but do not direct reclaim
+ * or warn about failures. kswapd may still wake to reclaim in the background.
  */
 static inline gfp_t gfp_exact_node(gfp_t flags)
 {
-	return (flags | __GFP_THISNODE | __GFP_NOWARN) & ~__GFP_WAIT;
+	return (flags | __GFP_THISNODE | __GFP_NOWARN) & ~__GFP_DIRECT_RECLAIM;
 }
 #endif
 
@@ -2625,7 +2625,7 @@ static int cache_grow(struct kmem_cache *cachep,
 
 	offset *= cachep->colour_off;
 
-	if (local_flags & __GFP_WAIT)
+	if (gfpflags_allow_blocking(local_flags))
 		local_irq_enable();
 
 	/*
@@ -2655,7 +2655,7 @@ static int cache_grow(struct kmem_cache *cachep,
 
 	cache_init_objs(cachep, page);
 
-	if (local_flags & __GFP_WAIT)
+	if (gfpflags_allow_blocking(local_flags))
 		local_irq_disable();
 	check_irq_off();
 	spin_lock(&n->list_lock);
@@ -2669,7 +2669,7 @@ static int cache_grow(struct kmem_cache *cachep,
 opps1:
 	kmem_freepages(cachep, page);
 failed:
-	if (local_flags & __GFP_WAIT)
+	if (gfpflags_allow_blocking(local_flags))
 		local_irq_disable();
 	return 0;
 }
@@ -2861,7 +2861,7 @@ static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags,
 static inline void cache_alloc_debugcheck_before(struct kmem_cache *cachep,
 						gfp_t flags)
 {
-	might_sleep_if(flags & __GFP_WAIT);
+	might_sleep_if(gfpflags_allow_blocking(flags));
 #if DEBUG
 	kmem_flagcheck(cachep, flags);
 #endif
@@ -3049,11 +3049,11 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
 		 */
 		struct page *page;
 
-		if (local_flags & __GFP_WAIT)
+		if (gfpflags_allow_blocking(local_flags))
 			local_irq_enable();
 		kmem_flagcheck(cache, flags);
 		page = kmem_getpages(cache, local_flags, numa_mem_id());
-		if (local_flags & __GFP_WAIT)
+		if (gfpflags_allow_blocking(local_flags))
 			local_irq_disable();
 		if (page) {
 			/*
diff --git a/mm/slub.c b/mm/slub.c
index 816df0016555..a4661c59ff54 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1263,7 +1263,7 @@ static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
 {
 	flags &= gfp_allowed_mask;
 	lockdep_trace_alloc(flags);
-	might_sleep_if(flags & __GFP_WAIT);
+	might_sleep_if(gfpflags_allow_blocking(flags));
 
 	if (should_failslab(s->object_size, flags, s->flags))
 		return NULL;
@@ -1339,7 +1339,7 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 
 	flags &= gfp_allowed_mask;
 
-	if (flags & __GFP_WAIT)
+	if (gfpflags_allow_blocking(flags))
 		local_irq_enable();
 
 	flags |= s->allocflags;
@@ -1380,7 +1380,7 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 			kmemcheck_mark_unallocated_pages(page, pages);
 	}
 
-	if (flags & __GFP_WAIT)
+	if (gfpflags_allow_blocking(flags))
 		local_irq_disable();
 	if (!page)
 		return NULL;
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 2faaa2976447..9ad4dcb0631c 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1617,7 +1617,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
 			goto fail;
 		}
 		area->pages[i] = page;
-		if (gfp_mask & __GFP_WAIT)
+		if (gfpflags_allow_blocking(gfp_mask))
 			cond_resched();
 	}
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e950134c4b9a..837c440d60a9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1465,7 +1465,7 @@ static int too_many_isolated(struct zone *zone, int file,
 	 * won't get blocked by normal direct-reclaimers, forming a circular
 	 * deadlock.
 	 */
-	if ((sc->gfp_mask & GFP_IOFS) == GFP_IOFS)
+	if ((sc->gfp_mask & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS))
 		inactive >>= 3;
 
 	return isolated > inactive;
@@ -3764,7 +3764,7 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
 	/*
 	 * Do not scan if the allocation should not be delayed.
 	 */
-	if (!(gfp_mask & __GFP_WAIT) || (current->flags & PF_MEMALLOC))
+	if (!gfpflags_allow_blocking(gfp_mask) || (current->flags & PF_MEMALLOC))
 		return ZONE_RECLAIM_NOSCAN;
 
 	/*
diff --git a/mm/zswap.c b/mm/zswap.c
index 2d5727baed59..26104a68c972 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -684,7 +684,8 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
 
 	/* store */
 	len = dlen + sizeof(struct zswap_header);
-	ret = zpool_malloc(zswap_pool, len, __GFP_NORETRY | __GFP_NOWARN,
+	ret = zpool_malloc(zswap_pool, len,
+		__GFP_NORETRY | __GFP_NOWARN | __GFP_KSWAPD_RECLAIM,
 		&handle);
 	if (ret == -ENOSPC) {
 		zswap_reject_compress_poor++;
@@ -900,7 +901,7 @@ static void __exit zswap_debugfs_exit(void) { }
 **********************************/
 static int __init init_zswap(void)
 {
-	gfp_t gfp = __GFP_NORETRY | __GFP_NOWARN;
+	gfp_t gfp = __GFP_NORETRY | __GFP_NOWARN | __GFP_KSWAPD_RECLAIM;
 
 	pr_info("loading zswap\n");
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index b6a19ca0f99e..6f025e2544de 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -414,7 +414,7 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev, unsigned int len,
 	len += NET_SKB_PAD;
 
 	if ((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) ||
-	    (gfp_mask & (__GFP_WAIT | GFP_DMA))) {
+	    (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
 		skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE);
 		if (!skb)
 			goto skb_fail;
@@ -481,7 +481,7 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
 	len += NET_SKB_PAD + NET_IP_ALIGN;
 
 	if ((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) ||
-	    (gfp_mask & (__GFP_WAIT | GFP_DMA))) {
+	    (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
 		skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE);
 		if (!skb)
 			goto skb_fail;
@@ -4452,7 +4452,7 @@ struct sk_buff *alloc_skb_with_frags(unsigned long header_len,
 		return NULL;
 
 	gfp_head = gfp_mask;
-	if (gfp_head & __GFP_WAIT)
+	if (gfp_head & __GFP_DIRECT_RECLAIM)
 		gfp_head |= __GFP_REPEAT;
 
 	*errcode = -ENOBUFS;
@@ -4467,7 +4467,7 @@ struct sk_buff *alloc_skb_with_frags(unsigned long header_len,
 
 		while (order) {
 			if (npages >= 1 << order) {
-				page = alloc_pages((gfp_mask & ~__GFP_WAIT) |
+				page = alloc_pages((gfp_mask & ~__GFP_DIRECT_RECLAIM) |
 						   __GFP_COMP |
 						   __GFP_NOWARN |
 						   __GFP_NORETRY,
diff --git a/net/core/sock.c b/net/core/sock.c
index 193901d09757..02b705cc9eb3 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1879,8 +1879,10 @@ bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t gfp)
 
 	pfrag->offset = 0;
 	if (SKB_FRAG_PAGE_ORDER) {
-		pfrag->page = alloc_pages((gfp & ~__GFP_WAIT) | __GFP_COMP |
-					  __GFP_NOWARN | __GFP_NORETRY,
+		/* Avoid direct reclaim but allow kswapd to wake */
+		pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
+					  __GFP_COMP | __GFP_NOWARN |
+					  __GFP_NORETRY,
 					  SKB_FRAG_PAGE_ORDER);
 		if (likely(pfrag->page)) {
 			pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER;
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 67d210477863..8283d90dde74 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2066,7 +2066,7 @@ int netlink_broadcast_filtered(struct sock *ssk, struct sk_buff *skb, u32 portid
 	consume_skb(info.skb2);
 
 	if (info.delivered) {
-		if (info.congested && (allocation & __GFP_WAIT))
+		if (info.congested && gfpflags_allow_blocking(allocation))
 			yield();
 		return 0;
 	}
diff --git a/net/rxrpc/ar-connection.c b/net/rxrpc/ar-connection.c
index 6631f4f1e39b..3b5de4b86058 100644
--- a/net/rxrpc/ar-connection.c
+++ b/net/rxrpc/ar-connection.c
@@ -500,7 +500,7 @@ int rxrpc_connect_call(struct rxrpc_sock *rx,
 		if (bundle->num_conns >= 20) {
 			_debug("too many conns");
 
-			if (!(gfp & __GFP_WAIT)) {
+			if (!gfpflags_allow_blocking(gfp)) {
 				_leave(" = -EAGAIN");
 				return -EAGAIN;
 			}
diff --git a/net/sctp/associola.c b/net/sctp/associola.c
index 197c3f59ecbf..75369ae8de1e 100644
--- a/net/sctp/associola.c
+++ b/net/sctp/associola.c
@@ -1588,7 +1588,7 @@ int sctp_assoc_lookup_laddr(struct sctp_association *asoc,
 /* Set an association id for a given association */
 int sctp_assoc_set_id(struct sctp_association *asoc, gfp_t gfp)
 {
-	bool preload = !!(gfp & __GFP_WAIT);
+	bool preload = gfpflags_allow_blocking(gfp);
 	int ret;
 
 	/* If the id is already assigned, keep it. */
-- 
2.4.6


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 08/12] mm, page_alloc: Rename __GFP_WAIT to __GFP_RECLAIM
  2015-08-24 12:09 [PATCH 00/12] Remove zonelist cache and high-order watermark checking v3 Mel Gorman
                   ` (6 preceding siblings ...)
  2015-08-24 12:09 ` [PATCH 07/12] mm, page_alloc: Distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd Mel Gorman
@ 2015-08-24 12:09 ` Mel Gorman
  2015-08-26 12:19   ` Vlastimil Babka
  2015-08-24 12:09 ` [PATCH 09/12] mm, page_alloc: Delete the zonelist_cache Mel Gorman
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 55+ messages in thread
From: Mel Gorman @ 2015-08-24 12:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Rik van Riel, Vlastimil Babka, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML, Mel Gorman

__GFP_WAIT was used to signal that the caller was in atomic context and
could not sleep.  Now it is possible to distinguish between true atomic
context and callers that are not willing to sleep. The latter should clear
__GFP_DIRECT_RECLAIM so kswapd will still wake. As clearing __GFP_WAIT
behaves differently, there is a risk that people will clear the wrong
flags. This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly indicate
what it does -- setting it allows all reclaim activity, clearing them
prevents it.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 block/blk-mq.c                               |  2 +-
 block/scsi_ioctl.c                           |  6 +++---
 drivers/block/drbd/drbd_bitmap.c             |  2 +-
 drivers/block/mtip32xx/mtip32xx.c            |  2 +-
 drivers/block/nvme-core.c                    |  4 ++--
 drivers/block/paride/pd.c                    |  2 +-
 drivers/block/pktcdvd.c                      |  4 ++--
 drivers/gpu/drm/i915/i915_gem.c              |  2 +-
 drivers/ide/ide-atapi.c                      |  2 +-
 drivers/ide/ide-cd.c                         |  2 +-
 drivers/ide/ide-cd_ioctl.c                   |  2 +-
 drivers/ide/ide-devsets.c                    |  2 +-
 drivers/ide/ide-disk.c                       |  2 +-
 drivers/ide/ide-ioctls.c                     |  4 ++--
 drivers/ide/ide-park.c                       |  2 +-
 drivers/ide/ide-pm.c                         |  4 ++--
 drivers/ide/ide-tape.c                       |  4 ++--
 drivers/ide/ide-taskfile.c                   |  4 ++--
 drivers/infiniband/hw/ipath/ipath_file_ops.c |  2 +-
 drivers/infiniband/hw/qib/qib_init.c         |  2 +-
 drivers/misc/vmw_balloon.c                   |  2 +-
 drivers/scsi/scsi_error.c                    |  2 +-
 drivers/scsi/scsi_lib.c                      |  4 ++--
 fs/cachefiles/internal.h                     |  2 +-
 fs/direct-io.c                               |  2 +-
 fs/nilfs2/mdt.h                              |  2 +-
 include/linux/gfp.h                          | 16 ++++++++--------
 kernel/power/swap.c                          | 14 +++++++-------
 lib/percpu_ida.c                             |  2 +-
 mm/failslab.c                                |  8 ++++----
 mm/filemap.c                                 |  2 +-
 mm/huge_memory.c                             |  2 +-
 mm/migrate.c                                 |  2 +-
 mm/page_alloc.c                              | 10 +++++-----
 security/integrity/ima/ima_crypto.c          |  2 +-
 35 files changed, 64 insertions(+), 64 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 7d80379d7a38..16feffc93489 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1221,7 +1221,7 @@ static struct request *blk_mq_map_request(struct request_queue *q,
 		ctx = blk_mq_get_ctx(q);
 		hctx = q->mq_ops->map_queue(q, ctx->cpu);
 		blk_mq_set_alloc_data(&alloc_data, q,
-				__GFP_WAIT|__GFP_HIGH, false, ctx, hctx);
+				__GFP_RECLAIM|__GFP_HIGH, false, ctx, hctx);
 		rq = __blk_mq_alloc_request(&alloc_data, rw);
 		ctx = alloc_data.ctx;
 		hctx = alloc_data.hctx;
diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c
index dda653ce7b24..0774799942e0 100644
--- a/block/scsi_ioctl.c
+++ b/block/scsi_ioctl.c
@@ -444,7 +444,7 @@ int sg_scsi_ioctl(struct request_queue *q, struct gendisk *disk, fmode_t mode,
 
 	}
 
-	rq = blk_get_request(q, in_len ? WRITE : READ, __GFP_WAIT);
+	rq = blk_get_request(q, in_len ? WRITE : READ, __GFP_RECLAIM);
 	if (IS_ERR(rq)) {
 		err = PTR_ERR(rq);
 		goto error_free_buffer;
@@ -495,7 +495,7 @@ int sg_scsi_ioctl(struct request_queue *q, struct gendisk *disk, fmode_t mode,
 		break;
 	}
 
-	if (bytes && blk_rq_map_kern(q, rq, buffer, bytes, __GFP_WAIT)) {
+	if (bytes && blk_rq_map_kern(q, rq, buffer, bytes, __GFP_RECLAIM)) {
 		err = DRIVER_ERROR << 24;
 		goto error;
 	}
@@ -536,7 +536,7 @@ static int __blk_send_generic(struct request_queue *q, struct gendisk *bd_disk,
 	struct request *rq;
 	int err;
 
-	rq = blk_get_request(q, WRITE, __GFP_WAIT);
+	rq = blk_get_request(q, WRITE, __GFP_RECLAIM);
 	if (IS_ERR(rq))
 		return PTR_ERR(rq);
 	blk_rq_set_block_pc(rq);
diff --git a/drivers/block/drbd/drbd_bitmap.c b/drivers/block/drbd/drbd_bitmap.c
index 434c77dcc99e..2940da0011e0 100644
--- a/drivers/block/drbd/drbd_bitmap.c
+++ b/drivers/block/drbd/drbd_bitmap.c
@@ -1016,7 +1016,7 @@ static void bm_page_io_async(struct drbd_bm_aio_ctx *ctx, int page_nr) __must_ho
 	bm_set_page_unchanged(b->bm_pages[page_nr]);
 
 	if (ctx->flags & BM_AIO_COPY_PAGES) {
-		page = mempool_alloc(drbd_md_io_page_pool, __GFP_HIGHMEM|__GFP_WAIT);
+		page = mempool_alloc(drbd_md_io_page_pool, __GFP_HIGHMEM|__GFP_RECLAIM);
 		copy_highpage(page, b->bm_pages[page_nr]);
 		bm_store_page_idx(page, page_nr);
 	} else
diff --git a/drivers/block/mtip32xx/mtip32xx.c b/drivers/block/mtip32xx/mtip32xx.c
index 4a2ef09e6704..a694b23cb8f9 100644
--- a/drivers/block/mtip32xx/mtip32xx.c
+++ b/drivers/block/mtip32xx/mtip32xx.c
@@ -173,7 +173,7 @@ static struct mtip_cmd *mtip_get_int_command(struct driver_data *dd)
 {
 	struct request *rq;
 
-	rq = blk_mq_alloc_request(dd->queue, 0, __GFP_WAIT, true);
+	rq = blk_mq_alloc_request(dd->queue, 0, __GFP_RECLAIM, true);
 	return blk_mq_rq_to_pdu(rq);
 }
 
diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index 7920c2741b47..0a8b1682305f 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -1033,11 +1033,11 @@ int __nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
 	req->special = (void *)0;
 
 	if (buffer && bufflen) {
-		ret = blk_rq_map_kern(q, req, buffer, bufflen, __GFP_WAIT);
+		ret = blk_rq_map_kern(q, req, buffer, bufflen, __GFP_RECLAIM);
 		if (ret)
 			goto out;
 	} else if (ubuffer && bufflen) {
-		ret = blk_rq_map_user(q, req, NULL, ubuffer, bufflen, __GFP_WAIT);
+		ret = blk_rq_map_user(q, req, NULL, ubuffer, bufflen, __GFP_RECLAIM);
 		if (ret)
 			goto out;
 		bio = req->bio;
diff --git a/drivers/block/paride/pd.c b/drivers/block/paride/pd.c
index b9242d78283d..562b5a4ca7b7 100644
--- a/drivers/block/paride/pd.c
+++ b/drivers/block/paride/pd.c
@@ -723,7 +723,7 @@ static int pd_special_command(struct pd_unit *disk,
 	struct request *rq;
 	int err = 0;
 
-	rq = blk_get_request(disk->gd->queue, READ, __GFP_WAIT);
+	rq = blk_get_request(disk->gd->queue, READ, __GFP_RECLAIM);
 	if (IS_ERR(rq))
 		return PTR_ERR(rq);
 
diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index 4c20c228184c..e372a5f08847 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -704,14 +704,14 @@ static int pkt_generic_packet(struct pktcdvd_device *pd, struct packet_command *
 	int ret = 0;
 
 	rq = blk_get_request(q, (cgc->data_direction == CGC_DATA_WRITE) ?
-			     WRITE : READ, __GFP_WAIT);
+			     WRITE : READ, __GFP_RECLAIM);
 	if (IS_ERR(rq))
 		return PTR_ERR(rq);
 	blk_rq_set_block_pc(rq);
 
 	if (cgc->buflen) {
 		ret = blk_rq_map_kern(q, rq, cgc->buffer, cgc->buflen,
-				      __GFP_WAIT);
+				      __GFP_RECLAIM);
 		if (ret)
 			goto out;
 	}
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index c2b45081c5ab..2ca8638c5b81 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2226,7 +2226,7 @@ i915_gem_object_get_pages_gtt(struct drm_i915_gem_object *obj)
 	mapping = file_inode(obj->base.filp)->i_mapping;
 	gfp = mapping_gfp_mask(mapping);
 	gfp |= __GFP_NORETRY | __GFP_NOWARN;
-	gfp &= ~(__GFP_IO | __GFP_WAIT);
+	gfp &= ~(__GFP_IO | __GFP_RECLAIM);
 	sg = st->sgl;
 	st->nents = 0;
 	for (i = 0; i < page_count; i++) {
diff --git a/drivers/ide/ide-atapi.c b/drivers/ide/ide-atapi.c
index 1362ad80a76c..05352f490d60 100644
--- a/drivers/ide/ide-atapi.c
+++ b/drivers/ide/ide-atapi.c
@@ -92,7 +92,7 @@ int ide_queue_pc_tail(ide_drive_t *drive, struct gendisk *disk,
 	struct request *rq;
 	int error;
 
-	rq = blk_get_request(drive->queue, READ, __GFP_WAIT);
+	rq = blk_get_request(drive->queue, READ, __GFP_RECLAIM);
 	rq->cmd_type = REQ_TYPE_DRV_PRIV;
 	rq->special = (char *)pc;
 
diff --git a/drivers/ide/ide-cd.c b/drivers/ide/ide-cd.c
index 64a6b827b3dd..ef907fd5ba98 100644
--- a/drivers/ide/ide-cd.c
+++ b/drivers/ide/ide-cd.c
@@ -441,7 +441,7 @@ int ide_cd_queue_pc(ide_drive_t *drive, const unsigned char *cmd,
 		struct request *rq;
 		int error;
 
-		rq = blk_get_request(drive->queue, write, __GFP_WAIT);
+		rq = blk_get_request(drive->queue, write, __GFP_RECLAIM);
 
 		memcpy(rq->cmd, cmd, BLK_MAX_CDB);
 		rq->cmd_type = REQ_TYPE_ATA_PC;
diff --git a/drivers/ide/ide-cd_ioctl.c b/drivers/ide/ide-cd_ioctl.c
index 066e39036518..474173eb31bb 100644
--- a/drivers/ide/ide-cd_ioctl.c
+++ b/drivers/ide/ide-cd_ioctl.c
@@ -303,7 +303,7 @@ int ide_cdrom_reset(struct cdrom_device_info *cdi)
 	struct request *rq;
 	int ret;
 
-	rq = blk_get_request(drive->queue, READ, __GFP_WAIT);
+	rq = blk_get_request(drive->queue, READ, __GFP_RECLAIM);
 	rq->cmd_type = REQ_TYPE_DRV_PRIV;
 	rq->cmd_flags = REQ_QUIET;
 	ret = blk_execute_rq(drive->queue, cd->disk, rq, 0);
diff --git a/drivers/ide/ide-devsets.c b/drivers/ide/ide-devsets.c
index b05a74d78ef5..0dd43b4fcec6 100644
--- a/drivers/ide/ide-devsets.c
+++ b/drivers/ide/ide-devsets.c
@@ -165,7 +165,7 @@ int ide_devset_execute(ide_drive_t *drive, const struct ide_devset *setting,
 	if (!(setting->flags & DS_SYNC))
 		return setting->set(drive, arg);
 
-	rq = blk_get_request(q, READ, __GFP_WAIT);
+	rq = blk_get_request(q, READ, __GFP_RECLAIM);
 	rq->cmd_type = REQ_TYPE_DRV_PRIV;
 	rq->cmd_len = 5;
 	rq->cmd[0] = REQ_DEVSET_EXEC;
diff --git a/drivers/ide/ide-disk.c b/drivers/ide/ide-disk.c
index 56b9708894a5..37a8a907febe 100644
--- a/drivers/ide/ide-disk.c
+++ b/drivers/ide/ide-disk.c
@@ -477,7 +477,7 @@ static int set_multcount(ide_drive_t *drive, int arg)
 	if (drive->special_flags & IDE_SFLAG_SET_MULTMODE)
 		return -EBUSY;
 
-	rq = blk_get_request(drive->queue, READ, __GFP_WAIT);
+	rq = blk_get_request(drive->queue, READ, __GFP_RECLAIM);
 	rq->cmd_type = REQ_TYPE_ATA_TASKFILE;
 
 	drive->mult_req = arg;
diff --git a/drivers/ide/ide-ioctls.c b/drivers/ide/ide-ioctls.c
index aa2e9b77b20d..d05db2469209 100644
--- a/drivers/ide/ide-ioctls.c
+++ b/drivers/ide/ide-ioctls.c
@@ -125,7 +125,7 @@ static int ide_cmd_ioctl(ide_drive_t *drive, unsigned long arg)
 	if (NULL == (void *) arg) {
 		struct request *rq;
 
-		rq = blk_get_request(drive->queue, READ, __GFP_WAIT);
+		rq = blk_get_request(drive->queue, READ, __GFP_RECLAIM);
 		rq->cmd_type = REQ_TYPE_ATA_TASKFILE;
 		err = blk_execute_rq(drive->queue, NULL, rq, 0);
 		blk_put_request(rq);
@@ -221,7 +221,7 @@ static int generic_drive_reset(ide_drive_t *drive)
 	struct request *rq;
 	int ret = 0;
 
-	rq = blk_get_request(drive->queue, READ, __GFP_WAIT);
+	rq = blk_get_request(drive->queue, READ, __GFP_RECLAIM);
 	rq->cmd_type = REQ_TYPE_DRV_PRIV;
 	rq->cmd_len = 1;
 	rq->cmd[0] = REQ_DRIVE_RESET;
diff --git a/drivers/ide/ide-park.c b/drivers/ide/ide-park.c
index c80868520488..2d7dca56dd24 100644
--- a/drivers/ide/ide-park.c
+++ b/drivers/ide/ide-park.c
@@ -31,7 +31,7 @@ static void issue_park_cmd(ide_drive_t *drive, unsigned long timeout)
 	}
 	spin_unlock_irq(&hwif->lock);
 
-	rq = blk_get_request(q, READ, __GFP_WAIT);
+	rq = blk_get_request(q, READ, __GFP_RECLAIM);
 	rq->cmd[0] = REQ_PARK_HEADS;
 	rq->cmd_len = 1;
 	rq->cmd_type = REQ_TYPE_DRV_PRIV;
diff --git a/drivers/ide/ide-pm.c b/drivers/ide/ide-pm.c
index 081e43458d50..e34af488693a 100644
--- a/drivers/ide/ide-pm.c
+++ b/drivers/ide/ide-pm.c
@@ -18,7 +18,7 @@ int generic_ide_suspend(struct device *dev, pm_message_t mesg)
 	}
 
 	memset(&rqpm, 0, sizeof(rqpm));
-	rq = blk_get_request(drive->queue, READ, __GFP_WAIT);
+	rq = blk_get_request(drive->queue, READ, __GFP_RECLAIM);
 	rq->cmd_type = REQ_TYPE_ATA_PM_SUSPEND;
 	rq->special = &rqpm;
 	rqpm.pm_step = IDE_PM_START_SUSPEND;
@@ -88,7 +88,7 @@ int generic_ide_resume(struct device *dev)
 	}
 
 	memset(&rqpm, 0, sizeof(rqpm));
-	rq = blk_get_request(drive->queue, READ, __GFP_WAIT);
+	rq = blk_get_request(drive->queue, READ, __GFP_RECLAIM);
 	rq->cmd_type = REQ_TYPE_ATA_PM_RESUME;
 	rq->cmd_flags |= REQ_PREEMPT;
 	rq->special = &rqpm;
diff --git a/drivers/ide/ide-tape.c b/drivers/ide/ide-tape.c
index f5d51d1d09ee..12fa04997dcc 100644
--- a/drivers/ide/ide-tape.c
+++ b/drivers/ide/ide-tape.c
@@ -852,7 +852,7 @@ static int idetape_queue_rw_tail(ide_drive_t *drive, int cmd, int size)
 	BUG_ON(cmd != REQ_IDETAPE_READ && cmd != REQ_IDETAPE_WRITE);
 	BUG_ON(size < 0 || size % tape->blk_size);
 
-	rq = blk_get_request(drive->queue, READ, __GFP_WAIT);
+	rq = blk_get_request(drive->queue, READ, __GFP_RECLAIM);
 	rq->cmd_type = REQ_TYPE_DRV_PRIV;
 	rq->cmd[13] = cmd;
 	rq->rq_disk = tape->disk;
@@ -860,7 +860,7 @@ static int idetape_queue_rw_tail(ide_drive_t *drive, int cmd, int size)
 
 	if (size) {
 		ret = blk_rq_map_kern(drive->queue, rq, tape->buf, size,
-				      __GFP_WAIT);
+				      __GFP_RECLAIM);
 		if (ret)
 			goto out_put;
 	}
diff --git a/drivers/ide/ide-taskfile.c b/drivers/ide/ide-taskfile.c
index 0979e126fff1..a716693417a3 100644
--- a/drivers/ide/ide-taskfile.c
+++ b/drivers/ide/ide-taskfile.c
@@ -430,7 +430,7 @@ int ide_raw_taskfile(ide_drive_t *drive, struct ide_cmd *cmd, u8 *buf,
 	int error;
 	int rw = !(cmd->tf_flags & IDE_TFLAG_WRITE) ? READ : WRITE;
 
-	rq = blk_get_request(drive->queue, rw, __GFP_WAIT);
+	rq = blk_get_request(drive->queue, rw, __GFP_RECLAIM);
 	rq->cmd_type = REQ_TYPE_ATA_TASKFILE;
 
 	/*
@@ -441,7 +441,7 @@ int ide_raw_taskfile(ide_drive_t *drive, struct ide_cmd *cmd, u8 *buf,
 	 */
 	if (nsect) {
 		error = blk_rq_map_kern(drive->queue, rq, buf,
-					nsect * SECTOR_SIZE, __GFP_WAIT);
+					nsect * SECTOR_SIZE, __GFP_RECLAIM);
 		if (error)
 			goto put_req;
 	}
diff --git a/drivers/infiniband/hw/ipath/ipath_file_ops.c b/drivers/infiniband/hw/ipath/ipath_file_ops.c
index 450d15965005..c11f6c58ce53 100644
--- a/drivers/infiniband/hw/ipath/ipath_file_ops.c
+++ b/drivers/infiniband/hw/ipath/ipath_file_ops.c
@@ -905,7 +905,7 @@ static int ipath_create_user_egr(struct ipath_portdata *pd)
 	 * heavy filesystem activity makes these fail, and we can
 	 * use compound pages.
 	 */
-	gfp_flags = __GFP_WAIT | __GFP_IO | __GFP_COMP;
+	gfp_flags = __GFP_RECLAIM | __GFP_IO | __GFP_COMP;
 
 	egrcnt = dd->ipath_rcvegrcnt;
 	/* TID number offset for this port */
diff --git a/drivers/infiniband/hw/qib/qib_init.c b/drivers/infiniband/hw/qib/qib_init.c
index 7e00470adc30..4ff340fe904f 100644
--- a/drivers/infiniband/hw/qib/qib_init.c
+++ b/drivers/infiniband/hw/qib/qib_init.c
@@ -1680,7 +1680,7 @@ int qib_setup_eagerbufs(struct qib_ctxtdata *rcd)
 	 * heavy filesystem activity makes these fail, and we can
 	 * use compound pages.
 	 */
-	gfp_flags = __GFP_WAIT | __GFP_IO | __GFP_COMP;
+	gfp_flags = __GFP_RECLAIM | __GFP_IO | __GFP_COMP;
 
 	egrcnt = rcd->rcvegrcnt;
 	egroff = rcd->rcvegr_tid_base;
diff --git a/drivers/misc/vmw_balloon.c b/drivers/misc/vmw_balloon.c
index 191617492181..5a312958c094 100644
--- a/drivers/misc/vmw_balloon.c
+++ b/drivers/misc/vmw_balloon.c
@@ -85,7 +85,7 @@ MODULE_LICENSE("GPL");
 
 /*
  * Use __GFP_HIGHMEM to allow pages from HIGHMEM zone. We don't
- * allow wait (__GFP_WAIT) for NOSLEEP page allocations. Use
+ * allow wait (__GFP_RECLAIM) for NOSLEEP page allocations. Use
  * __GFP_NOWARN, to suppress page allocation failure warnings.
  */
 #define VMW_PAGE_ALLOC_NOSLEEP		(__GFP_HIGHMEM|__GFP_NOWARN)
diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index cfadccef045c..26416e21295d 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -1961,7 +1961,7 @@ static void scsi_eh_lock_door(struct scsi_device *sdev)
 	struct request *req;
 
 	/*
-	 * blk_get_request with GFP_KERNEL (__GFP_WAIT) sleeps until a
+	 * blk_get_request with GFP_KERNEL (__GFP_RECLAIM) sleeps until a
 	 * request becomes available
 	 */
 	req = blk_get_request(sdev->request_queue, READ, GFP_KERNEL);
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 448ebdaa3d69..2396259b682b 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -221,13 +221,13 @@ int scsi_execute(struct scsi_device *sdev, const unsigned char *cmd,
 	int write = (data_direction == DMA_TO_DEVICE);
 	int ret = DRIVER_ERROR << 24;
 
-	req = blk_get_request(sdev->request_queue, write, __GFP_WAIT);
+	req = blk_get_request(sdev->request_queue, write, __GFP_RECLAIM);
 	if (IS_ERR(req))
 		return ret;
 	blk_rq_set_block_pc(req);
 
 	if (bufflen &&	blk_rq_map_kern(sdev->request_queue, req,
-					buffer, bufflen, __GFP_WAIT))
+					buffer, bufflen, __GFP_RECLAIM))
 		goto out;
 
 	req->cmd_len = COMMAND_SIZE(cmd[0]);
diff --git a/fs/cachefiles/internal.h b/fs/cachefiles/internal.h
index aecd0859eacb..9c4b737a54df 100644
--- a/fs/cachefiles/internal.h
+++ b/fs/cachefiles/internal.h
@@ -30,7 +30,7 @@ extern unsigned cachefiles_debug;
 #define CACHEFILES_DEBUG_KLEAVE	2
 #define CACHEFILES_DEBUG_KDEBUG	4
 
-#define cachefiles_gfp (__GFP_WAIT | __GFP_NORETRY | __GFP_NOMEMALLOC)
+#define cachefiles_gfp (__GFP_RECLAIM | __GFP_NORETRY | __GFP_NOMEMALLOC)
 
 /*
  * node records
diff --git a/fs/direct-io.c b/fs/direct-io.c
index 745d2342651a..b97cf506a20e 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -360,7 +360,7 @@ dio_bio_alloc(struct dio *dio, struct dio_submit *sdio,
 
 	/*
 	 * bio_alloc() is guaranteed to return a bio when called with
-	 * __GFP_WAIT and we request a valid number of vectors.
+	 * __GFP_RECLAIM and we request a valid number of vectors.
 	 */
 	bio = bio_alloc(GFP_KERNEL, nr_vecs);
 
diff --git a/fs/nilfs2/mdt.h b/fs/nilfs2/mdt.h
index fe529a87a208..03246cac3338 100644
--- a/fs/nilfs2/mdt.h
+++ b/fs/nilfs2/mdt.h
@@ -72,7 +72,7 @@ static inline struct nilfs_mdt_info *NILFS_MDT(const struct inode *inode)
 }
 
 /* Default GFP flags using highmem */
-#define NILFS_MDT_GFP      (__GFP_WAIT | __GFP_IO | __GFP_HIGHMEM)
+#define NILFS_MDT_GFP      (__GFP_RECLAIM | __GFP_IO | __GFP_HIGHMEM)
 
 int nilfs_mdt_get_block(struct inode *, unsigned long, int,
 			void (*init_block)(struct inode *,
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index bd1937977d84..f928bdee2b96 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -104,7 +104,7 @@ struct vm_area_struct;
  * can be cleared when the reclaiming of pages would cause unnecessary
  * disruption.
  */
-#define __GFP_WAIT (__GFP_DIRECT_RECLAIM|__GFP_KSWAPD_RECLAIM)
+#define __GFP_RECLAIM (__GFP_DIRECT_RECLAIM|__GFP_KSWAPD_RECLAIM)
 #define __GFP_DIRECT_RECLAIM	((__force gfp_t)___GFP_DIRECT_RECLAIM) /* Caller can reclaim */
 #define __GFP_KSWAPD_RECLAIM	((__force gfp_t)___GFP_KSWAPD_RECLAIM) /* kswapd can wake */
 
@@ -123,12 +123,12 @@ struct vm_area_struct;
  */
 #define GFP_ATOMIC	(__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM)
 #define GFP_NOWAIT	(__GFP_KSWAPD_RECLAIM)
-#define GFP_NOIO	(__GFP_WAIT)
-#define GFP_NOFS	(__GFP_WAIT | __GFP_IO)
-#define GFP_KERNEL	(__GFP_WAIT | __GFP_IO | __GFP_FS)
-#define GFP_TEMPORARY	(__GFP_WAIT | __GFP_IO | __GFP_FS | \
+#define GFP_NOIO	(__GFP_RECLAIM)
+#define GFP_NOFS	(__GFP_RECLAIM | __GFP_IO)
+#define GFP_KERNEL	(__GFP_RECLAIM | __GFP_IO | __GFP_FS)
+#define GFP_TEMPORARY	(__GFP_RECLAIM | __GFP_IO | __GFP_FS | \
 			 __GFP_RECLAIMABLE)
-#define GFP_USER	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL)
+#define GFP_USER	(__GFP_RECLAIM | __GFP_IO | __GFP_FS | __GFP_HARDWALL)
 #define GFP_HIGHUSER	(GFP_USER | __GFP_HIGHMEM)
 #define GFP_HIGHUSER_MOVABLE	(GFP_HIGHUSER | __GFP_MOVABLE)
 #define GFP_IOFS	(__GFP_IO | __GFP_FS | __GFP_KSWAPD_RECLAIM)
@@ -141,12 +141,12 @@ struct vm_area_struct;
 #define GFP_MOVABLE_SHIFT 3
 
 /* Control page allocator reclaim behavior */
-#define GFP_RECLAIM_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS|\
+#define GFP_RECLAIM_MASK (__GFP_RECLAIM|__GFP_HIGH|__GFP_IO|__GFP_FS|\
 			__GFP_NOWARN|__GFP_REPEAT|__GFP_NOFAIL|\
 			__GFP_NORETRY|__GFP_MEMALLOC|__GFP_NOMEMALLOC)
 
 /* Control slab gfp mask during early boot */
-#define GFP_BOOT_MASK (__GFP_BITS_MASK & ~(__GFP_WAIT|__GFP_IO|__GFP_FS))
+#define GFP_BOOT_MASK (__GFP_BITS_MASK & ~(__GFP_RECLAIM|__GFP_IO|__GFP_FS))
 
 /* Control allocation constraints */
 #define GFP_CONSTRAINT_MASK (__GFP_HARDWALL|__GFP_THISNODE)
diff --git a/kernel/power/swap.c b/kernel/power/swap.c
index 2f30ca91e4fa..3841af470cf9 100644
--- a/kernel/power/swap.c
+++ b/kernel/power/swap.c
@@ -261,7 +261,7 @@ static int hib_submit_io(int rw, pgoff_t page_off, void *addr,
 	struct bio *bio;
 	int error = 0;
 
-	bio = bio_alloc(__GFP_WAIT | __GFP_HIGH, 1);
+	bio = bio_alloc(__GFP_RECLAIM | __GFP_HIGH, 1);
 	bio->bi_iter.bi_sector = page_off * (PAGE_SIZE >> 9);
 	bio->bi_bdev = hib_resume_bdev;
 
@@ -360,7 +360,7 @@ static int write_page(void *buf, sector_t offset, struct hib_bio_batch *hb)
 		return -ENOSPC;
 
 	if (hb) {
-		src = (void *)__get_free_page(__GFP_WAIT | __GFP_NOWARN |
+		src = (void *)__get_free_page(__GFP_RECLAIM | __GFP_NOWARN |
 		                              __GFP_NORETRY);
 		if (src) {
 			copy_page(src, buf);
@@ -368,7 +368,7 @@ static int write_page(void *buf, sector_t offset, struct hib_bio_batch *hb)
 			ret = hib_wait_io(hb); /* Free pages */
 			if (ret)
 				return ret;
-			src = (void *)__get_free_page(__GFP_WAIT |
+			src = (void *)__get_free_page(__GFP_RECLAIM |
 			                              __GFP_NOWARN |
 			                              __GFP_NORETRY);
 			if (src) {
@@ -676,7 +676,7 @@ static int save_image_lzo(struct swap_map_handle *handle,
 	nr_threads = num_online_cpus() - 1;
 	nr_threads = clamp_val(nr_threads, 1, LZO_THREADS);
 
-	page = (void *)__get_free_page(__GFP_WAIT | __GFP_HIGH);
+	page = (void *)__get_free_page(__GFP_RECLAIM | __GFP_HIGH);
 	if (!page) {
 		printk(KERN_ERR "PM: Failed to allocate LZO page\n");
 		ret = -ENOMEM;
@@ -979,7 +979,7 @@ static int get_swap_reader(struct swap_map_handle *handle,
 		last = tmp;
 
 		tmp->map = (struct swap_map_page *)
-		           __get_free_page(__GFP_WAIT | __GFP_HIGH);
+		           __get_free_page(__GFP_RECLAIM | __GFP_HIGH);
 		if (!tmp->map) {
 			release_swap_reader(handle);
 			return -ENOMEM;
@@ -1246,8 +1246,8 @@ static int load_image_lzo(struct swap_map_handle *handle,
 
 	for (i = 0; i < read_pages; i++) {
 		page[i] = (void *)__get_free_page(i < LZO_CMP_PAGES ?
-		                                  __GFP_WAIT | __GFP_HIGH :
-		                                  __GFP_WAIT | __GFP_NOWARN |
+		                                  __GFP_RECLAIM | __GFP_HIGH :
+		                                  __GFP_RECLAIM | __GFP_NOWARN |
 		                                  __GFP_NORETRY);
 
 		if (!page[i]) {
diff --git a/lib/percpu_ida.c b/lib/percpu_ida.c
index f75715131f20..6d40944960de 100644
--- a/lib/percpu_ida.c
+++ b/lib/percpu_ida.c
@@ -135,7 +135,7 @@ static inline unsigned alloc_local_tag(struct percpu_ida_cpu *tags)
  * TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE, of course).
  *
  * @gfp indicates whether or not to wait until a free id is available (it's not
- * used for internal memory allocations); thus if passed __GFP_WAIT we may sleep
+ * used for internal memory allocations); thus if passed __GFP_RECLAIM we may sleep
  * however long it takes until another thread frees an id (same semantics as a
  * mempool).
  *
diff --git a/mm/failslab.c b/mm/failslab.c
index fefaabaab76d..69f083146a37 100644
--- a/mm/failslab.c
+++ b/mm/failslab.c
@@ -3,11 +3,11 @@
 
 static struct {
 	struct fault_attr attr;
-	u32 ignore_gfp_wait;
+	u32 ignore_gfp_reclaim;
 	int cache_filter;
 } failslab = {
 	.attr = FAULT_ATTR_INITIALIZER,
-	.ignore_gfp_wait = 1,
+	.ignore_gfp_reclaim = 1,
 	.cache_filter = 0,
 };
 
@@ -16,7 +16,7 @@ bool should_failslab(size_t size, gfp_t gfpflags, unsigned long cache_flags)
 	if (gfpflags & __GFP_NOFAIL)
 		return false;
 
-        if (failslab.ignore_gfp_wait && (gfpflags & __GFP_WAIT))
+        if (failslab.ignore_gfp_reclaim && (gfpflags & __GFP_RECLAIM))
 		return false;
 
 	if (failslab.cache_filter && !(cache_flags & SLAB_FAILSLAB))
@@ -42,7 +42,7 @@ static int __init failslab_debugfs_init(void)
 		return PTR_ERR(dir);
 
 	if (!debugfs_create_bool("ignore-gfp-wait", mode, dir,
-				&failslab.ignore_gfp_wait))
+				&failslab.ignore_gfp_reclaim))
 		goto fail;
 	if (!debugfs_create_bool("cache-filter", mode, dir,
 				&failslab.cache_filter))
diff --git a/mm/filemap.c b/mm/filemap.c
index 1283fc825458..986fe45a5d27 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2673,7 +2673,7 @@ EXPORT_SYMBOL(generic_file_write_iter);
  * page is known to the local caching routines.
  *
  * The @gfp_mask argument specifies whether I/O may be performed to release
- * this page (__GFP_IO), and whether the call may block (__GFP_WAIT & __GFP_FS).
+ * this page (__GFP_IO), and whether the call may block (__GFP_RECLAIM & __GFP_FS).
  *
  */
 int try_to_release_page(struct page *page, gfp_t gfp_mask)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 097c7a4bfbd9..36efda9ff8f1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -767,7 +767,7 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 
 static inline gfp_t alloc_hugepage_gfpmask(int defrag, gfp_t extra_gfp)
 {
-	return (GFP_TRANSHUGE & ~(defrag ? 0 : __GFP_WAIT)) | extra_gfp;
+	return (GFP_TRANSHUGE & ~(defrag ? 0 : __GFP_RECLAIM)) | extra_gfp;
 }
 
 /* Caller must hold page table lock. */
diff --git a/mm/migrate.c b/mm/migrate.c
index 0e16c4047638..c6aa1ef906b4 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1738,7 +1738,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 		goto out_dropref;
 
 	new_page = alloc_pages_node(node,
-		(GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_WAIT,
+		(GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM,
 		HPAGE_PMD_ORDER);
 	if (!new_page)
 		goto out_fail;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 68f961bdfdf8..d176be999b26 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2135,11 +2135,11 @@ static struct {
 	struct fault_attr attr;
 
 	u32 ignore_gfp_highmem;
-	u32 ignore_gfp_wait;
+	u32 ignore_gfp_reclaim;
 	u32 min_order;
 } fail_page_alloc = {
 	.attr = FAULT_ATTR_INITIALIZER,
-	.ignore_gfp_wait = 1,
+	.ignore_gfp_reclaim = 1,
 	.ignore_gfp_highmem = 1,
 	.min_order = 1,
 };
@@ -2158,7 +2158,7 @@ static bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
 		return false;
 	if (fail_page_alloc.ignore_gfp_highmem && (gfp_mask & __GFP_HIGHMEM))
 		return false;
-	if (fail_page_alloc.ignore_gfp_wait && (gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
+	if (fail_page_alloc.ignore_gfp_reclaim && (gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
 		return false;
 
 	return should_fail(&fail_page_alloc.attr, 1 << order);
@@ -2177,7 +2177,7 @@ static int __init fail_page_alloc_debugfs(void)
 		return PTR_ERR(dir);
 
 	if (!debugfs_create_bool("ignore-gfp-wait", mode, dir,
-				&fail_page_alloc.ignore_gfp_wait))
+				&fail_page_alloc.ignore_gfp_reclaim))
 		goto fail;
 	if (!debugfs_create_bool("ignore-gfp-highmem", mode, dir,
 				&fail_page_alloc.ignore_gfp_highmem))
@@ -2660,7 +2660,7 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
 		if (test_thread_flag(TIF_MEMDIE) ||
 		    (current->flags & (PF_MEMALLOC | PF_EXITING)))
 			filter &= ~SHOW_MEM_FILTER_NODES;
-	if (in_interrupt() || !(gfp_mask & __GFP_WAIT) || (gfp_mask & __GFP_ATOMIC))
+	if (in_interrupt() || !(gfp_mask & __GFP_RECLAIM) || (gfp_mask & __GFP_ATOMIC))
 		filter &= ~SHOW_MEM_FILTER_NODES;
 
 	if (fmt) {
diff --git a/security/integrity/ima/ima_crypto.c b/security/integrity/ima/ima_crypto.c
index e24121afb2f2..6eb62936c672 100644
--- a/security/integrity/ima/ima_crypto.c
+++ b/security/integrity/ima/ima_crypto.c
@@ -126,7 +126,7 @@ static void *ima_alloc_pages(loff_t max_size, size_t *allocated_size,
 {
 	void *ptr;
 	int order = ima_maxorder;
-	gfp_t gfp_mask = __GFP_WAIT | __GFP_NOWARN | __GFP_NORETRY;
+	gfp_t gfp_mask = __GFP_RECLAIM | __GFP_NOWARN | __GFP_NORETRY;
 
 	if (order)
 		order = min(get_order(max_size), order);
-- 
2.4.6


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 09/12] mm, page_alloc: Delete the zonelist_cache
  2015-08-24 12:09 [PATCH 00/12] Remove zonelist cache and high-order watermark checking v3 Mel Gorman
                   ` (7 preceding siblings ...)
  2015-08-24 12:09 ` [PATCH 08/12] mm, page_alloc: Rename __GFP_WAIT to __GFP_RECLAIM Mel Gorman
@ 2015-08-24 12:09 ` Mel Gorman
  2015-08-24 12:29 ` [PATCH 10/12] mm, page_alloc: Remove MIGRATE_RESERVE Mel Gorman
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 55+ messages in thread
From: Mel Gorman @ 2015-08-24 12:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Rik van Riel, Vlastimil Babka, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML, Mel Gorman

The zonelist cache (zlc) was introduced to skip over zones that were
recently known to be full. This avoided expensive operations such as the
cpuset checks, watermark calculations and zone_reclaim. The situation
today is different and the complexity of zlc is harder to justify.

1) The cpuset checks are no-ops unless a cpuset is active and in general are
   a lot cheaper.

2) zone_reclaim is now disabled by default and I suspect that was a large
   source of the cost that zlc wanted to avoid. When it is enabled, it's
   known to be a major source of stalling when nodes fill up and it's
   unwise to hit every other user with the overhead.

3) Watermark checks are expensive to calculate for high-order
   allocation requests. Later patches in this series will reduce the cost
   of the watermark checking.

4) The most important issue is that in the current implementation it
   is possible for a failed THP allocation to mark a zone full for order-0
   allocations and cause a fallback to remote nodes.

The last issue could be addressed with additional complexity but as the
benefit of zlc is questionable, it is better to remove it.  If stalls
due to zone_reclaim are ever reported then an alternative would be to
introduce deferring logic based on a timeout inside zone_reclaim itself
and leave the page allocator fast paths alone.

The impact on page-allocator microbenchmarks is negligible as they don't
hit the paths where the zlc comes into play. Most page-reclaim related
workloads showed no noticable difference as a result of the removal.

The impact was noticeable in a workload called "stutter". One part uses a
lot of anonymous memory, a second measures mmap latency and a third copies
a large file. In an ideal world the latency application would not notice
the mmap latency.  On a 2-node machine the results of this patch are

2-node machine stutter
                             4.2.0-rc7             4.2.0-rc7
                               vanilla            nozlc-v3r8
Min         mmap     21.2266 (  0.00%)     21.0431 (  0.86%)
1st-qrtle   mmap     24.3096 (  0.00%)     28.6013 (-17.65%)
2nd-qrtle   mmap     28.5789 (  0.00%)     29.2909 ( -2.49%)
3rd-qrtle   mmap     29.1970 (  0.00%)     29.9426 ( -2.55%)
Max-90%     mmap     33.6982 (  0.00%)     34.0195 ( -0.95%)
Max-93%     mmap     34.1751 (  0.00%)     34.3753 ( -0.59%)
Max-95%     mmap     38.7323 (  0.00%)     35.0257 (  9.57%)
Max-99%     mmap     93.7964 (  0.00%)     89.8644 (  4.19%)
Max         mmap 494274.6891 (  0.00%)  47520.4050 ( 90.39%)
Mean        mmap     48.2708 (  0.00%)     46.5963 (  3.47%)
Best99%Mean mmap     29.0087 (  0.00%)     30.2834 ( -4.39%)
Best95%Mean mmap     27.3926 (  0.00%)     29.2582 ( -6.81%)
Best90%Mean mmap     27.0060 (  0.00%)     28.9746 ( -7.29%)
Best50%Mean mmap     24.9743 (  0.00%)     27.8092 (-11.35%)
Best10%Mean mmap     22.9546 (  0.00%)     24.7627 ( -7.88%)
Best5%Mean  mmap     21.9682 (  0.00%)     23.3681 ( -6.37%)
Best1%Mean  mmap     21.3082 (  0.00%)     21.2479 (  0.28%)

Note that the maximum stall latency went from 494 seconds to 47 seconds
which is still awful but an improvement. The milage here varies considerably
as a 4-node machine that tested an earlier version of this patch went from
a worst case stall time of 6 seconds to 67ms. The nature of the benchmark
is inherently unpredictable as it is hammering the system and the milage
will vary between machines.

There is a secondary impact with potentially more direct reclaim because
zones are now being considered instead of being skipped by zlc. In this
particular test run it did not occur so will not be described. However,
in at least one test the following was observed

1. Direct reclaim rates were higher. This was likely due to direct reclaim
  being entered instead of the zlc disabling a zone and busy looping.
  Busy looping may have the effect of allowing kswapd to make more
  progress and in some cases may be better overall. If this is found then
  the correct action is to put direct reclaimers to sleep on a waitqueue
  and allow kswapd make forward progress. Busy looping on the zlc is even
  worse than when the allocator used to blindly call congestion_wait().

2. There was higher swap activity as direct reclaim was active.

3. Direct reclaim efficiency was lower. This is related to 1 as more
  scanning activity also encountered more pages that could not be
  immediately reclaimed

In that case, the direct page scan and reclaim rates are noticeable but
it is not considered a problem for a few reasons

1. The test is primarily concerned with latency. The mmap attempts are also
   faulted which means there are THP allocation requests. The ZLC could
   cause zones to be disabled causing the process to busy loop instead
   of reclaiming.  This looks like elevated direct reclaim activity but
   it's the correct action to take based on what processes requested.

2. The test hammers reclaim and compaction heavily. The number of successful
   THP faults is highly variable but affects the reclaim stats. It's not a
   realistic or reasonable measure of page reclaim activity.

3. No other page-reclaim intensive workload that was tested showed a problem.

4. If a workload is identified that benefitted from the busy looping then it
   should be fixed by having direct reclaimers sleep on a wait queue until
   woken by kswapd instead of busy looping. We had this class of problem before
   when congestion_waits() with a fixed timeout was a brain damaged decision
   but happened to benefit some workloads.

If a workload is identified that relied on the zlc to busy loop then it
should be fixed correctly and have a direct reclaimer sleep on a waitqueue
until woken by kswapd.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/mmzone.h |  74 -----------------
 mm/page_alloc.c        | 212 -------------------------------------------------
 2 files changed, 286 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fc0457d005f8..aef62cc11c80 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -585,75 +585,8 @@ static inline bool zone_is_empty(struct zone *zone)
  * [1]	: No fallback (__GFP_THISNODE)
  */
 #define MAX_ZONELISTS 2
-
-
-/*
- * We cache key information from each zonelist for smaller cache
- * footprint when scanning for free pages in get_page_from_freelist().
- *
- * 1) The BITMAP fullzones tracks which zones in a zonelist have come
- *    up short of free memory since the last time (last_fullzone_zap)
- *    we zero'd fullzones.
- * 2) The array z_to_n[] maps each zone in the zonelist to its node
- *    id, so that we can efficiently evaluate whether that node is
- *    set in the current tasks mems_allowed.
- *
- * Both fullzones and z_to_n[] are one-to-one with the zonelist,
- * indexed by a zones offset in the zonelist zones[] array.
- *
- * The get_page_from_freelist() routine does two scans.  During the
- * first scan, we skip zones whose corresponding bit in 'fullzones'
- * is set or whose corresponding node in current->mems_allowed (which
- * comes from cpusets) is not set.  During the second scan, we bypass
- * this zonelist_cache, to ensure we look methodically at each zone.
- *
- * Once per second, we zero out (zap) fullzones, forcing us to
- * reconsider nodes that might have regained more free memory.
- * The field last_full_zap is the time we last zapped fullzones.
- *
- * This mechanism reduces the amount of time we waste repeatedly
- * reexaming zones for free memory when they just came up low on
- * memory momentarilly ago.
- *
- * The zonelist_cache struct members logically belong in struct
- * zonelist.  However, the mempolicy zonelists constructed for
- * MPOL_BIND are intentionally variable length (and usually much
- * shorter).  A general purpose mechanism for handling structs with
- * multiple variable length members is more mechanism than we want
- * here.  We resort to some special case hackery instead.
- *
- * The MPOL_BIND zonelists don't need this zonelist_cache (in good
- * part because they are shorter), so we put the fixed length stuff
- * at the front of the zonelist struct, ending in a variable length
- * zones[], as is needed by MPOL_BIND.
- *
- * Then we put the optional zonelist cache on the end of the zonelist
- * struct.  This optional stuff is found by a 'zlcache_ptr' pointer in
- * the fixed length portion at the front of the struct.  This pointer
- * both enables us to find the zonelist cache, and in the case of
- * MPOL_BIND zonelists, (which will just set the zlcache_ptr to NULL)
- * to know that the zonelist cache is not there.
- *
- * The end result is that struct zonelists come in two flavors:
- *  1) The full, fixed length version, shown below, and
- *  2) The custom zonelists for MPOL_BIND.
- * The custom MPOL_BIND zonelists have a NULL zlcache_ptr and no zlcache.
- *
- * Even though there may be multiple CPU cores on a node modifying
- * fullzones or last_full_zap in the same zonelist_cache at the same
- * time, we don't lock it.  This is just hint data - if it is wrong now
- * and then, the allocator will still function, perhaps a bit slower.
- */
-
-
-struct zonelist_cache {
-	unsigned short z_to_n[MAX_ZONES_PER_ZONELIST];		/* zone->nid */
-	DECLARE_BITMAP(fullzones, MAX_ZONES_PER_ZONELIST);	/* zone full? */
-	unsigned long last_full_zap;		/* when last zap'd (jiffies) */
-};
 #else
 #define MAX_ZONELISTS 1
-struct zonelist_cache;
 #endif
 
 /*
@@ -671,9 +604,6 @@ struct zoneref {
  * allocation, the other zones are fallback zones, in decreasing
  * priority.
  *
- * If zlcache_ptr is not NULL, then it is just the address of zlcache,
- * as explained above.  If zlcache_ptr is NULL, there is no zlcache.
- * *
  * To speed the reading of the zonelist, the zonerefs contain the zone index
  * of the entry being read. Helper functions to access information given
  * a struct zoneref are
@@ -683,11 +613,7 @@ struct zoneref {
  * zonelist_node_idx()	- Return the index of the node for an entry
  */
 struct zonelist {
-	struct zonelist_cache *zlcache_ptr;		     // NULL or &zlcache
 	struct zoneref _zonerefs[MAX_ZONES_PER_ZONELIST + 1];
-#ifdef CONFIG_NUMA
-	struct zonelist_cache zlcache;			     // optional ...
-#endif
 };
 
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d176be999b26..aa52a91a7d44 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2266,122 +2266,6 @@ bool zone_watermark_ok_safe(struct zone *z, unsigned int order,
 }
 
 #ifdef CONFIG_NUMA
-/*
- * zlc_setup - Setup for "zonelist cache".  Uses cached zone data to
- * skip over zones that are not allowed by the cpuset, or that have
- * been recently (in last second) found to be nearly full.  See further
- * comments in mmzone.h.  Reduces cache footprint of zonelist scans
- * that have to skip over a lot of full or unallowed zones.
- *
- * If the zonelist cache is present in the passed zonelist, then
- * returns a pointer to the allowed node mask (either the current
- * tasks mems_allowed, or node_states[N_MEMORY].)
- *
- * If the zonelist cache is not available for this zonelist, does
- * nothing and returns NULL.
- *
- * If the fullzones BITMAP in the zonelist cache is stale (more than
- * a second since last zap'd) then we zap it out (clear its bits.)
- *
- * We hold off even calling zlc_setup, until after we've checked the
- * first zone in the zonelist, on the theory that most allocations will
- * be satisfied from that first zone, so best to examine that zone as
- * quickly as we can.
- */
-static nodemask_t *zlc_setup(struct zonelist *zonelist, int alloc_flags)
-{
-	struct zonelist_cache *zlc;	/* cached zonelist speedup info */
-	nodemask_t *allowednodes;	/* zonelist_cache approximation */
-
-	zlc = zonelist->zlcache_ptr;
-	if (!zlc)
-		return NULL;
-
-	if (time_after(jiffies, zlc->last_full_zap + HZ)) {
-		bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST);
-		zlc->last_full_zap = jiffies;
-	}
-
-	allowednodes = !in_interrupt() && (alloc_flags & ALLOC_CPUSET) ?
-					&cpuset_current_mems_allowed :
-					&node_states[N_MEMORY];
-	return allowednodes;
-}
-
-/*
- * Given 'z' scanning a zonelist, run a couple of quick checks to see
- * if it is worth looking at further for free memory:
- *  1) Check that the zone isn't thought to be full (doesn't have its
- *     bit set in the zonelist_cache fullzones BITMAP).
- *  2) Check that the zones node (obtained from the zonelist_cache
- *     z_to_n[] mapping) is allowed in the passed in allowednodes mask.
- * Return true (non-zero) if zone is worth looking at further, or
- * else return false (zero) if it is not.
- *
- * This check -ignores- the distinction between various watermarks,
- * such as GFP_HIGH, GFP_ATOMIC, PF_MEMALLOC, ...  If a zone is
- * found to be full for any variation of these watermarks, it will
- * be considered full for up to one second by all requests, unless
- * we are so low on memory on all allowed nodes that we are forced
- * into the second scan of the zonelist.
- *
- * In the second scan we ignore this zonelist cache and exactly
- * apply the watermarks to all zones, even it is slower to do so.
- * We are low on memory in the second scan, and should leave no stone
- * unturned looking for a free page.
- */
-static int zlc_zone_worth_trying(struct zonelist *zonelist, struct zoneref *z,
-						nodemask_t *allowednodes)
-{
-	struct zonelist_cache *zlc;	/* cached zonelist speedup info */
-	int i;				/* index of *z in zonelist zones */
-	int n;				/* node that zone *z is on */
-
-	zlc = zonelist->zlcache_ptr;
-	if (!zlc)
-		return 1;
-
-	i = z - zonelist->_zonerefs;
-	n = zlc->z_to_n[i];
-
-	/* This zone is worth trying if it is allowed but not full */
-	return node_isset(n, *allowednodes) && !test_bit(i, zlc->fullzones);
-}
-
-/*
- * Given 'z' scanning a zonelist, set the corresponding bit in
- * zlc->fullzones, so that subsequent attempts to allocate a page
- * from that zone don't waste time re-examining it.
- */
-static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z)
-{
-	struct zonelist_cache *zlc;	/* cached zonelist speedup info */
-	int i;				/* index of *z in zonelist zones */
-
-	zlc = zonelist->zlcache_ptr;
-	if (!zlc)
-		return;
-
-	i = z - zonelist->_zonerefs;
-
-	set_bit(i, zlc->fullzones);
-}
-
-/*
- * clear all zones full, called after direct reclaim makes progress so that
- * a zone that was recently full is not skipped over for up to a second
- */
-static void zlc_clear_zones_full(struct zonelist *zonelist)
-{
-	struct zonelist_cache *zlc;	/* cached zonelist speedup info */
-
-	zlc = zonelist->zlcache_ptr;
-	if (!zlc)
-		return;
-
-	bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST);
-}
-
 static bool zone_local(struct zone *local_zone, struct zone *zone)
 {
 	return local_zone->node == zone->node;
@@ -2392,28 +2276,7 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
 	return node_distance(zone_to_nid(local_zone), zone_to_nid(zone)) <
 				RECLAIM_DISTANCE;
 }
-
 #else	/* CONFIG_NUMA */
-
-static nodemask_t *zlc_setup(struct zonelist *zonelist, int alloc_flags)
-{
-	return NULL;
-}
-
-static int zlc_zone_worth_trying(struct zonelist *zonelist, struct zoneref *z,
-				nodemask_t *allowednodes)
-{
-	return 1;
-}
-
-static void zlc_mark_zone_full(struct zonelist *zonelist, struct zoneref *z)
-{
-}
-
-static void zlc_clear_zones_full(struct zonelist *zonelist)
-{
-}
-
 static bool zone_local(struct zone *local_zone, struct zone *zone)
 {
 	return true;
@@ -2423,7 +2286,6 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
 {
 	return true;
 }
-
 #endif	/* CONFIG_NUMA */
 
 static void reset_alloc_batches(struct zone *preferred_zone)
@@ -2450,9 +2312,6 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 	struct zoneref *z;
 	struct page *page = NULL;
 	struct zone *zone;
-	nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
-	int zlc_active = 0;		/* set if using zonelist_cache */
-	int did_zlc_setup = 0;		/* just call zlc_setup() one time */
 	int nr_fair_skipped = 0;
 	bool zonelist_rescan;
 
@@ -2467,9 +2326,6 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 								ac->nodemask) {
 		unsigned long mark;
 
-		if (IS_ENABLED(CONFIG_NUMA) && zlc_active &&
-			!zlc_zone_worth_trying(zonelist, z, allowednodes))
-				continue;
 		if (cpusets_mems_enabled() &&
 			(alloc_flags & ALLOC_CPUSET) &&
 			!cpuset_zone_allowed(zone, gfp_mask))
@@ -2527,28 +2383,8 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 			if (alloc_flags & ALLOC_NO_WATERMARKS)
 				goto try_this_zone;
 
-			if (IS_ENABLED(CONFIG_NUMA) &&
-					!did_zlc_setup && nr_online_nodes > 1) {
-				/*
-				 * we do zlc_setup if there are multiple nodes
-				 * and before considering the first zone allowed
-				 * by the cpuset.
-				 */
-				allowednodes = zlc_setup(zonelist, alloc_flags);
-				zlc_active = 1;
-				did_zlc_setup = 1;
-			}
-
 			if (zone_reclaim_mode == 0 ||
 			    !zone_allows_reclaim(ac->preferred_zone, zone))
-				goto this_zone_full;
-
-			/*
-			 * As we may have just activated ZLC, check if the first
-			 * eligible zone has failed zone_reclaim recently.
-			 */
-			if (IS_ENABLED(CONFIG_NUMA) && zlc_active &&
-				!zlc_zone_worth_trying(zonelist, z, allowednodes))
 				continue;
 
 			ret = zone_reclaim(zone, gfp_mask, order);
@@ -2565,19 +2401,6 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 						ac->classzone_idx, alloc_flags))
 					goto try_this_zone;
 
-				/*
-				 * Failed to reclaim enough to meet watermark.
-				 * Only mark the zone full if checking the min
-				 * watermark or if we failed to reclaim just
-				 * 1<<order pages or else the page allocator
-				 * fastpath will prematurely mark zones full
-				 * when the watermark is between the low and
-				 * min watermarks.
-				 */
-				if (((alloc_flags & ALLOC_WMARK_MASK) == ALLOC_WMARK_MIN) ||
-				    ret == ZONE_RECLAIM_SOME)
-					goto this_zone_full;
-
 				continue;
 			}
 		}
@@ -2590,9 +2413,6 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 				goto try_this_zone;
 			return page;
 		}
-this_zone_full:
-		if (IS_ENABLED(CONFIG_NUMA) && zlc_active)
-			zlc_mark_zone_full(zonelist, z);
 	}
 
 	/*
@@ -2613,12 +2433,6 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 			zonelist_rescan = true;
 	}
 
-	if (unlikely(IS_ENABLED(CONFIG_NUMA) && zlc_active)) {
-		/* Disable zlc cache for second zonelist scan */
-		zlc_active = 0;
-		zonelist_rescan = true;
-	}
-
 	if (zonelist_rescan)
 		goto zonelist_scan;
 
@@ -2858,10 +2672,6 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	if (unlikely(!(*did_some_progress)))
 		return NULL;
 
-	/* After successful reclaim, reconsider all zones for allocation */
-	if (IS_ENABLED(CONFIG_NUMA))
-		zlc_clear_zones_full(ac->zonelist);
-
 retry:
 	page = get_page_from_freelist(gfp_mask, order,
 					alloc_flags & ~ALLOC_NO_WATERMARKS, ac);
@@ -4198,20 +4008,6 @@ static void build_zonelists(pg_data_t *pgdat)
 	build_thisnode_zonelists(pgdat);
 }
 
-/* Construct the zonelist performance cache - see further mmzone.h */
-static void build_zonelist_cache(pg_data_t *pgdat)
-{
-	struct zonelist *zonelist;
-	struct zonelist_cache *zlc;
-	struct zoneref *z;
-
-	zonelist = &pgdat->node_zonelists[0];
-	zonelist->zlcache_ptr = zlc = &zonelist->zlcache;
-	bitmap_zero(zlc->fullzones, MAX_ZONES_PER_ZONELIST);
-	for (z = zonelist->_zonerefs; z->zone; z++)
-		zlc->z_to_n[z - zonelist->_zonerefs] = zonelist_node_idx(z);
-}
-
 #ifdef CONFIG_HAVE_MEMORYLESS_NODES
 /*
  * Return node id of node used for "local" allocations.
@@ -4272,12 +4068,6 @@ static void build_zonelists(pg_data_t *pgdat)
 	zonelist->_zonerefs[j].zone_idx = 0;
 }
 
-/* non-NUMA variant of zonelist performance cache - just NULL zlcache_ptr */
-static void build_zonelist_cache(pg_data_t *pgdat)
-{
-	pgdat->node_zonelists[0].zlcache_ptr = NULL;
-}
-
 #endif	/* CONFIG_NUMA */
 
 /*
@@ -4318,14 +4108,12 @@ static int __build_all_zonelists(void *data)
 
 	if (self && !node_online(self->node_id)) {
 		build_zonelists(self);
-		build_zonelist_cache(self);
 	}
 
 	for_each_online_node(nid) {
 		pg_data_t *pgdat = NODE_DATA(nid);
 
 		build_zonelists(pgdat);
-		build_zonelist_cache(pgdat);
 	}
 
 	/*
-- 
2.4.6


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 10/12] mm, page_alloc: Remove MIGRATE_RESERVE
  2015-08-24 12:09 [PATCH 00/12] Remove zonelist cache and high-order watermark checking v3 Mel Gorman
                   ` (8 preceding siblings ...)
  2015-08-24 12:09 ` [PATCH 09/12] mm, page_alloc: Delete the zonelist_cache Mel Gorman
@ 2015-08-24 12:29 ` Mel Gorman
  2015-08-24 12:29 ` [PATCH 11/12] mm, page_alloc: Reserve pageblocks for high-order atomic allocations on demand Mel Gorman
  2015-08-24 12:30 ` [PATCH 12/12] mm, page_alloc: Only enforce watermarks for order-0 allocations Mel Gorman
  11 siblings, 0 replies; 55+ messages in thread
From: Mel Gorman @ 2015-08-24 12:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Rik van Riel, Vlastimil Babka, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML

MIGRATE_RESERVE preserves an old property of the buddy allocator that existed
prior to fragmentation avoidance -- min_free_kbytes worth of pages tended to
remain contiguous until the only alternative was to fail the allocation. At the
time it was discovered that high-order atomic allocations relied on this
property so MIGRATE_RESERVE was introduced. A later patch will introduce
an alternative MIGRATE_HIGHATOMIC so this patch deletes MIGRATE_RESERVE
and supporting code so it'll be easier to review. Note that this patch
in isolation may look like a false regression if someone was bisecting
high-order atomic allocation failures.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/mmzone.h |  10 +---
 mm/huge_memory.c       |   2 +-
 mm/page_alloc.c        | 148 +++----------------------------------------------
 mm/vmstat.c            |   1 -
 4 files changed, 11 insertions(+), 150 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index aef62cc11c80..cf643539d640 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -39,8 +39,6 @@ enum {
 	MIGRATE_UNMOVABLE,
 	MIGRATE_MOVABLE,
 	MIGRATE_RECLAIMABLE,
-	MIGRATE_PCPTYPES,	/* the number of types on the pcp lists */
-	MIGRATE_RESERVE = MIGRATE_PCPTYPES,
 #ifdef CONFIG_CMA
 	/*
 	 * MIGRATE_CMA migration type is designed to mimic the way
@@ -63,6 +61,8 @@ enum {
 	MIGRATE_TYPES
 };
 
+#define MIGRATE_PCPTYPES (MIGRATE_RECLAIMABLE+1)
+
 #ifdef CONFIG_CMA
 #  define is_migrate_cma(migratetype) unlikely((migratetype) == MIGRATE_CMA)
 #else
@@ -425,12 +425,6 @@ struct zone {
 
 	const char		*name;
 
-	/*
-	 * Number of MIGRATE_RESERVE page block. To maintain for just
-	 * optimization. Protected by zone->lock.
-	 */
-	int			nr_migrate_reserve_block;
-
 #ifdef CONFIG_MEMORY_ISOLATION
 	/*
 	 * Number of isolated pageblock. It is used to solve incorrect
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 36efda9ff8f1..56cfb17169d2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -113,7 +113,7 @@ static int set_recommended_min_free_kbytes(void)
 	for_each_populated_zone(zone)
 		nr_zones++;
 
-	/* Make sure at least 2 hugepages are free for MIGRATE_RESERVE */
+	/* Ensure 2 pageblocks are free to assist fragmentation avoidance */
 	recommended_min = pageblock_nr_pages * nr_zones * 2;
 
 	/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index aa52a91a7d44..d5ce050ebe4f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -792,7 +792,6 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 			if (unlikely(has_isolate_pageblock(zone)))
 				mt = get_pageblock_migratetype(page);
 
-			/* MIGRATE_MOVABLE list may include MIGRATE_RESERVEs */
 			__free_one_page(page, page_to_pfn(page), zone, 0, mt);
 			trace_mm_page_pcpu_drain(page, 0, mt);
 		} while (--to_free && --batch_free && !list_empty(list));
@@ -1390,15 +1389,14 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
  * the free lists for the desirable migrate type are depleted
  */
 static int fallbacks[MIGRATE_TYPES][4] = {
-	[MIGRATE_UNMOVABLE]   = { MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE,     MIGRATE_RESERVE },
-	[MIGRATE_RECLAIMABLE] = { MIGRATE_UNMOVABLE,   MIGRATE_MOVABLE,     MIGRATE_RESERVE },
-	[MIGRATE_MOVABLE]     = { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE,   MIGRATE_RESERVE },
+	[MIGRATE_UNMOVABLE]   = { MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE,   MIGRATE_TYPES },
+	[MIGRATE_RECLAIMABLE] = { MIGRATE_UNMOVABLE,   MIGRATE_MOVABLE,   MIGRATE_TYPES },
+	[MIGRATE_MOVABLE]     = { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE, MIGRATE_TYPES },
 #ifdef CONFIG_CMA
-	[MIGRATE_CMA]         = { MIGRATE_RESERVE }, /* Never used */
+	[MIGRATE_CMA]         = { MIGRATE_TYPES }, /* Never used */
 #endif
-	[MIGRATE_RESERVE]     = { MIGRATE_RESERVE }, /* Never used */
 #ifdef CONFIG_MEMORY_ISOLATION
-	[MIGRATE_ISOLATE]     = { MIGRATE_RESERVE }, /* Never used */
+	[MIGRATE_ISOLATE]     = { MIGRATE_TYPES }, /* Never used */
 #endif
 };
 
@@ -1572,7 +1570,7 @@ int find_suitable_fallback(struct free_area *area, unsigned int order,
 	*can_steal = false;
 	for (i = 0;; i++) {
 		fallback_mt = fallbacks[migratetype][i];
-		if (fallback_mt == MIGRATE_RESERVE)
+		if (fallback_mt == MIGRATE_TYPES)
 			break;
 
 		if (list_empty(&area->free_list[fallback_mt]))
@@ -1651,25 +1649,13 @@ static struct page *__rmqueue(struct zone *zone, unsigned int order,
 {
 	struct page *page;
 
-retry_reserve:
 	page = __rmqueue_smallest(zone, order, migratetype);
-
-	if (unlikely(!page) && migratetype != MIGRATE_RESERVE) {
+	if (unlikely(!page)) {
 		if (migratetype == MIGRATE_MOVABLE)
 			page = __rmqueue_cma_fallback(zone, order);
 
 		if (!page)
 			page = __rmqueue_fallback(zone, order, migratetype);
-
-		/*
-		 * Use MIGRATE_RESERVE rather than fail an allocation. goto
-		 * is used because __rmqueue_smallest is an inline function
-		 * and we want just one call site
-		 */
-		if (!page) {
-			migratetype = MIGRATE_RESERVE;
-			goto retry_reserve;
-		}
 	}
 
 	trace_mm_page_alloc_zone_locked(page, order, migratetype);
@@ -3462,7 +3448,6 @@ static void show_migration_types(unsigned char type)
 		[MIGRATE_UNMOVABLE]	= 'U',
 		[MIGRATE_RECLAIMABLE]	= 'E',
 		[MIGRATE_MOVABLE]	= 'M',
-		[MIGRATE_RESERVE]	= 'R',
 #ifdef CONFIG_CMA
 		[MIGRATE_CMA]		= 'C',
 #endif
@@ -4273,120 +4258,6 @@ static inline unsigned long wait_table_bits(unsigned long size)
 }
 
 /*
- * Check if a pageblock contains reserved pages
- */
-static int pageblock_is_reserved(unsigned long start_pfn, unsigned long end_pfn)
-{
-	unsigned long pfn;
-
-	for (pfn = start_pfn; pfn < end_pfn; pfn++) {
-		if (!pfn_valid_within(pfn) || PageReserved(pfn_to_page(pfn)))
-			return 1;
-	}
-	return 0;
-}
-
-/*
- * Mark a number of pageblocks as MIGRATE_RESERVE. The number
- * of blocks reserved is based on min_wmark_pages(zone). The memory within
- * the reserve will tend to store contiguous free pages. Setting min_free_kbytes
- * higher will lead to a bigger reserve which will get freed as contiguous
- * blocks as reclaim kicks in
- */
-static void setup_zone_migrate_reserve(struct zone *zone)
-{
-	unsigned long start_pfn, pfn, end_pfn, block_end_pfn;
-	struct page *page;
-	unsigned long block_migratetype;
-	int reserve;
-	int old_reserve;
-
-	/*
-	 * Get the start pfn, end pfn and the number of blocks to reserve
-	 * We have to be careful to be aligned to pageblock_nr_pages to
-	 * make sure that we always check pfn_valid for the first page in
-	 * the block.
-	 */
-	start_pfn = zone->zone_start_pfn;
-	end_pfn = zone_end_pfn(zone);
-	start_pfn = roundup(start_pfn, pageblock_nr_pages);
-	reserve = roundup(min_wmark_pages(zone), pageblock_nr_pages) >>
-							pageblock_order;
-
-	/*
-	 * Reserve blocks are generally in place to help high-order atomic
-	 * allocations that are short-lived. A min_free_kbytes value that
-	 * would result in more than 2 reserve blocks for atomic allocations
-	 * is assumed to be in place to help anti-fragmentation for the
-	 * future allocation of hugepages at runtime.
-	 */
-	reserve = min(2, reserve);
-	old_reserve = zone->nr_migrate_reserve_block;
-
-	/* When memory hot-add, we almost always need to do nothing */
-	if (reserve == old_reserve)
-		return;
-	zone->nr_migrate_reserve_block = reserve;
-
-	for (pfn = start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
-		if (!early_page_nid_uninitialised(pfn, zone_to_nid(zone)))
-			return;
-
-		if (!pfn_valid(pfn))
-			continue;
-		page = pfn_to_page(pfn);
-
-		/* Watch out for overlapping nodes */
-		if (page_to_nid(page) != zone_to_nid(zone))
-			continue;
-
-		block_migratetype = get_pageblock_migratetype(page);
-
-		/* Only test what is necessary when the reserves are not met */
-		if (reserve > 0) {
-			/*
-			 * Blocks with reserved pages will never free, skip
-			 * them.
-			 */
-			block_end_pfn = min(pfn + pageblock_nr_pages, end_pfn);
-			if (pageblock_is_reserved(pfn, block_end_pfn))
-				continue;
-
-			/* If this block is reserved, account for it */
-			if (block_migratetype == MIGRATE_RESERVE) {
-				reserve--;
-				continue;
-			}
-
-			/* Suitable for reserving if this block is movable */
-			if (block_migratetype == MIGRATE_MOVABLE) {
-				set_pageblock_migratetype(page,
-							MIGRATE_RESERVE);
-				move_freepages_block(zone, page,
-							MIGRATE_RESERVE);
-				reserve--;
-				continue;
-			}
-		} else if (!old_reserve) {
-			/*
-			 * At boot time we don't need to scan the whole zone
-			 * for turning off MIGRATE_RESERVE.
-			 */
-			break;
-		}
-
-		/*
-		 * If the reserve is met and this is a previous reserved block,
-		 * take it back
-		 */
-		if (block_migratetype == MIGRATE_RESERVE) {
-			set_pageblock_migratetype(page, MIGRATE_MOVABLE);
-			move_freepages_block(zone, page, MIGRATE_MOVABLE);
-		}
-	}
-}
-
-/*
  * Initially all pages are reserved - free ones are freed
  * up by free_all_bootmem() once the early boot process is
  * done. Non-atomic initialization, single-pass.
@@ -4425,9 +4296,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 		 * movable at startup. This will force kernel allocations
 		 * to reserve their blocks rather than leaking throughout
 		 * the address space during boot when many long-lived
-		 * kernel allocations are made. Later some blocks near
-		 * the start are marked MIGRATE_RESERVE by
-		 * setup_zone_migrate_reserve()
+		 * kernel allocations are made.
 		 *
 		 * bitmap is created for zone's valid pfn range. but memmap
 		 * can be created for invalid pages (for alignment)
@@ -5985,7 +5854,6 @@ static void __setup_per_zone_wmarks(void)
 			high_wmark_pages(zone) - low_wmark_pages(zone) -
 			atomic_long_read(&zone->vm_stat[NR_ALLOC_BATCH]));
 
-		setup_zone_migrate_reserve(zone);
 		spin_unlock_irqrestore(&zone->lock, flags);
 	}
 
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 4f5cd974e11a..49963aa2dff3 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -901,7 +901,6 @@ static char * const migratetype_names[MIGRATE_TYPES] = {
 	"Unmovable",
 	"Reclaimable",
 	"Movable",
-	"Reserve",
 #ifdef CONFIG_CMA
 	"CMA",
 #endif
-- 
2.4.6


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 11/12] mm, page_alloc: Reserve pageblocks for high-order atomic allocations on demand
  2015-08-24 12:09 [PATCH 00/12] Remove zonelist cache and high-order watermark checking v3 Mel Gorman
                   ` (9 preceding siblings ...)
  2015-08-24 12:29 ` [PATCH 10/12] mm, page_alloc: Remove MIGRATE_RESERVE Mel Gorman
@ 2015-08-24 12:29 ` Mel Gorman
  2015-08-26 12:44   ` Vlastimil Babka
                     ` (2 more replies)
  2015-08-24 12:30 ` [PATCH 12/12] mm, page_alloc: Only enforce watermarks for order-0 allocations Mel Gorman
  11 siblings, 3 replies; 55+ messages in thread
From: Mel Gorman @ 2015-08-24 12:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Rik van Riel, Vlastimil Babka, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML

High-order watermark checking exists for two reasons --  kswapd high-order
awareness and protection for high-order atomic requests. Historically the
kernel depended on MIGRATE_RESERVE to preserve min_free_kbytes as high-order
free pages for as long as possible. This patch introduces MIGRATE_HIGHATOMIC
that reserves pageblocks for high-order atomic allocations on demand and
avoids using those blocks for order-0 allocations. This is more flexible
and reliable than MIGRATE_RESERVE was.

A MIGRATE_HIGHORDER pageblock is created when a high-order allocation
request steals a pageblock but limits the total number to 1% of the zone.
Callers that speculatively abuse atomic allocations for long-lived
high-order allocations to access the reserve will quickly fail. Note that
SLUB is currently not such an abuser as it reclaims at least once.  It is
possible that the pageblock stolen has few suitable high-order pages and
will need to steal again in the near future but there would need to be
strong justification to search all pageblocks for an ideal candidate.

The pageblocks are unreserved if an allocation fails after a direct
reclaim attempt.

The watermark checks account for the reserved pageblocks when the allocation
request is not a high-order atomic allocation.

The reserved pageblocks can not be used for order-0 allocations. This may
allow temporary wastage until a failed reclaim reassigns the pageblock. This
is deliberate as the intent of the reservation is to satisfy a limited
number of atomic high-order short-lived requests if the system requires them.

The stutter benchmark was used to evaluate this but while it was running
there was a systemtap script that randomly allocated between 1 high-order
page and 12.5% of memory's worth of order-3 pages using GFP_ATOMIC. This
is much larger than the potential reserve and it does not attempt to be
realistic.  It is intended to stress random high-order allocations from
an unknown source, show that there is a reduction in failures without
introducing an anomaly where atomic allocations are more reliable than
regular allocations.  The amount of memory reserved varied throughout the
workload as reserves were created and reclaimed under memory pressure. The
allocation failures once the workload warmed up were as follows;

4.2-rc5-vanilla		70%
4.2-rc5-atomic-reserve	56%

The failure rate was also measured while building multiple kernels. The
failure rate was 14% but is 6% with this patch applied.

Overall, this is a small reduction but the reserves are small relative to the
number of allocation requests. In early versions of the patch, the failure
rate reduced by a much larger amount but that required much larger reserves
and perversely made atomic allocations seem more reliable than regular allocations.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/mmzone.h |   6 ++-
 mm/page_alloc.c        | 117 ++++++++++++++++++++++++++++++++++++++++++++++---
 mm/vmstat.c            |   1 +
 3 files changed, 116 insertions(+), 8 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index cf643539d640..a9805a85940a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -39,6 +39,8 @@ enum {
 	MIGRATE_UNMOVABLE,
 	MIGRATE_MOVABLE,
 	MIGRATE_RECLAIMABLE,
+	MIGRATE_PCPTYPES,	/* the number of types on the pcp lists */
+	MIGRATE_HIGHATOMIC = MIGRATE_PCPTYPES,
 #ifdef CONFIG_CMA
 	/*
 	 * MIGRATE_CMA migration type is designed to mimic the way
@@ -61,8 +63,6 @@ enum {
 	MIGRATE_TYPES
 };
 
-#define MIGRATE_PCPTYPES (MIGRATE_RECLAIMABLE+1)
-
 #ifdef CONFIG_CMA
 #  define is_migrate_cma(migratetype) unlikely((migratetype) == MIGRATE_CMA)
 #else
@@ -330,6 +330,8 @@ struct zone {
 	/* zone watermarks, access with *_wmark_pages(zone) macros */
 	unsigned long watermark[NR_WMARK];
 
+	unsigned long nr_reserved_highatomic;
+
 	/*
 	 * We don't know if the memory that we're going to allocate will be freeable
 	 * or/and it will be released eventually, so to avoid totally wasting several
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d5ce050ebe4f..2415f882b89c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1589,6 +1589,86 @@ int find_suitable_fallback(struct free_area *area, unsigned int order,
 	return -1;
 }
 
+/*
+ * Reserve a pageblock for exclusive use of high-order atomic allocations if
+ * there are no empty page blocks that contain a page with a suitable order
+ */
+static void reserve_highatomic_pageblock(struct page *page, struct zone *zone,
+				unsigned int alloc_order)
+{
+	int mt = get_pageblock_migratetype(page);
+	unsigned long max_managed, flags;
+
+	if (mt == MIGRATE_HIGHATOMIC)
+		return;
+
+	/*
+	 * Limit the number reserved to 1 pageblock or roughly 1% of a zone.
+	 * Check is race-prone but harmless.
+	 */
+	max_managed = (zone->managed_pages / 100) + pageblock_nr_pages;
+	if (zone->nr_reserved_highatomic >= max_managed)
+		return;
+
+	/* Yoink! */
+	spin_lock_irqsave(&zone->lock, flags);
+	zone->nr_reserved_highatomic += pageblock_nr_pages;
+	set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC);
+	move_freepages_block(zone, page, MIGRATE_HIGHATOMIC);
+	spin_unlock_irqrestore(&zone->lock, flags);
+}
+
+/*
+ * Used when an allocation is about to fail under memory pressure. This
+ * potentially hurts the reliability of high-order allocations when under
+ * intense memory pressure but failed atomic allocations should be easier
+ * to recover from than an OOM.
+ */
+static void unreserve_highatomic_pageblock(const struct alloc_context *ac)
+{
+	struct zonelist *zonelist = ac->zonelist;
+	unsigned long flags;
+	struct zoneref *z;
+	struct zone *zone;
+	struct page *page;
+	int order;
+
+	for_each_zone_zonelist_nodemask(zone, z, zonelist, ac->high_zoneidx,
+								ac->nodemask) {
+		/* Preserve at least one pageblock */
+		if (zone->nr_reserved_highatomic <= pageblock_nr_pages)
+			continue;
+
+		spin_lock_irqsave(&zone->lock, flags);
+		for (order = 0; order < MAX_ORDER; order++) {
+			struct free_area *area = &(zone->free_area[order]);
+
+			if (list_empty(&area->free_list[MIGRATE_HIGHATOMIC]))
+				continue;
+
+			page = list_entry(area->free_list[MIGRATE_HIGHATOMIC].next,
+						struct page, lru);
+
+			zone->nr_reserved_highatomic -= pageblock_nr_pages;
+
+			/*
+			 * Convert to ac->migratetype and avoid the normal
+			 * pageblock stealing heuristics. Minimally, the caller
+			 * is doing the work and needs the pages. More
+			 * importantly, if the block was always converted to
+			 * MIGRATE_UNMOVABLE or another type then the number
+			 * of pageblocks that cannot be completely freed
+			 * may increase.
+			 */
+			set_pageblock_migratetype(page, ac->migratetype);
+			move_freepages_block(zone, page, ac->migratetype);
+			spin_unlock_irqrestore(&zone->lock, flags);
+			return;
+		}
+		spin_unlock_irqrestore(&zone->lock, flags);
+	}
+}
+
 /* Remove an element from the buddy allocator from the fallback list */
 static inline struct page *
 __rmqueue_fallback(struct zone *zone, unsigned int order, int start_migratetype)
@@ -1645,10 +1725,16 @@ __rmqueue_fallback(struct zone *zone, unsigned int order, int start_migratetype)
  * Call me with the zone->lock already held.
  */
 static struct page *__rmqueue(struct zone *zone, unsigned int order,
-						int migratetype)
+				int migratetype, gfp_t gfp_flags)
 {
 	struct page *page;
 
+	if (unlikely(order && (gfp_flags & __GFP_ATOMIC))) {
+		page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
+		if (page)
+			goto out;
+	}
+
 	page = __rmqueue_smallest(zone, order, migratetype);
 	if (unlikely(!page)) {
 		if (migratetype == MIGRATE_MOVABLE)
@@ -1658,6 +1744,7 @@ static struct page *__rmqueue(struct zone *zone, unsigned int order,
 			page = __rmqueue_fallback(zone, order, migratetype);
 	}
 
+out:
 	trace_mm_page_alloc_zone_locked(page, order, migratetype);
 	return page;
 }
@@ -1675,7 +1762,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
 
 	spin_lock(&zone->lock);
 	for (i = 0; i < count; ++i) {
-		struct page *page = __rmqueue(zone, order, migratetype);
+		struct page *page = __rmqueue(zone, order, migratetype, 0);
 		if (unlikely(page == NULL))
 			break;
 
@@ -2090,7 +2177,7 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
 			WARN_ON_ONCE(order > 1);
 		}
 		spin_lock_irqsave(&zone->lock, flags);
-		page = __rmqueue(zone, order, migratetype);
+		page = __rmqueue(zone, order, migratetype, gfp_flags);
 		spin_unlock(&zone->lock);
 		if (!page)
 			goto failed;
@@ -2200,15 +2287,23 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
 			unsigned long mark, int classzone_idx, int alloc_flags,
 			long free_pages)
 {
-	/* free_pages may go negative - that's OK */
 	long min = mark;
 	int o;
 	long free_cma = 0;
 
+	/* free_pages may go negative - that's OK */
 	free_pages -= (1 << order) - 1;
+
 	if (alloc_flags & ALLOC_HIGH)
 		min -= min / 2;
-	if (alloc_flags & ALLOC_HARDER)
+
+	/*
+	 * If the caller is not atomic then discount the reserves. This will
+	 * over-estimate how the atomic reserve but it avoids a search
+	 */
+	if (likely(!(alloc_flags & ALLOC_HARDER)))
+		free_pages -= z->nr_reserved_highatomic;
+	else
 		min -= min / 4;
 
 #ifdef CONFIG_CMA
@@ -2397,6 +2492,14 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 		if (page) {
 			if (prep_new_page(page, order, gfp_mask, alloc_flags))
 				goto try_this_zone;
+
+			/*
+			 * If this is a high-order atomic allocation then check
+			 * if the pageblock should be reserved for the future
+			 */
+			if (unlikely(order && (alloc_flags & ALLOC_HARDER)))
+				reserve_highatomic_pageblock(page, zone, order);
+
 			return page;
 		}
 	}
@@ -2664,9 +2767,11 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 
 	/*
 	 * If an allocation failed after direct reclaim, it could be because
-	 * pages are pinned on the per-cpu lists. Drain them and try again
+	 * pages are pinned on the per-cpu lists or in high alloc reserves.
+	 * Shrink them them and try again
 	 */
 	if (!page && !drained) {
+		unreserve_highatomic_pageblock(ac);
 		drain_all_pages(NULL);
 		drained = true;
 		goto retry;
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 49963aa2dff3..3427a155f85e 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -901,6 +901,7 @@ static char * const migratetype_names[MIGRATE_TYPES] = {
 	"Unmovable",
 	"Reclaimable",
 	"Movable",
+	"HighAtomic",
 #ifdef CONFIG_CMA
 	"CMA",
 #endif
-- 
2.4.6


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 12/12] mm, page_alloc: Only enforce watermarks for order-0 allocations
  2015-08-24 12:09 [PATCH 00/12] Remove zonelist cache and high-order watermark checking v3 Mel Gorman
                   ` (10 preceding siblings ...)
  2015-08-24 12:29 ` [PATCH 11/12] mm, page_alloc: Reserve pageblocks for high-order atomic allocations on demand Mel Gorman
@ 2015-08-24 12:30 ` Mel Gorman
  2015-08-26 13:42   ` Vlastimil Babka
                     ` (2 more replies)
  11 siblings, 3 replies; 55+ messages in thread
From: Mel Gorman @ 2015-08-24 12:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Rik van Riel, Vlastimil Babka, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML

The primary purpose of watermarks is to ensure that reclaim can always
make forward progress in PF_MEMALLOC context (kswapd and direct reclaim).
These assume that order-0 allocations are all that is necessary for
forward progress.

High-order watermarks serve a different purpose. Kswapd had no high-order
awareness before they were introduced (https://lkml.org/lkml/2004/9/5/9).
This was particularly important when there were high-order atomic requests.
The watermarks both gave kswapd awareness and made a reserve for those
atomic requests.

There are two important side-effects of this. The most important is that
a non-atomic high-order request can fail even though free pages are available
and the order-0 watermarks are ok. The second is that high-order watermark
checks are expensive as the free list counts up to the requested order must
be examined.

With the introduction of MIGRATE_HIGHATOMIC it is no longer necessary to
have high-order watermarks. Kswapd and compaction still need high-order
awareness which is handled by checking that at least one suitable high-order
page is free.

With the patch applied, there was little difference in the allocation
failure rates as the atomic reserves are small relative to the number of
allocation attempts. The expected impact is that there will never be an
allocation failure report that shows suitable pages on the free lists.

The one potential side-effect of this is that in a vanilla kernel, the
watermark checks may have kept a free page for an atomic allocation. Now,
we are 100% relying on the HighAtomic reserves and an early allocation to
have allocated them.  If the first high-order atomic allocation is after
the system is already heavily fragmented then it'll fail.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 mm/page_alloc.c | 38 ++++++++++++++++++++++++--------------
 1 file changed, 24 insertions(+), 14 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2415f882b89c..35dc578730d1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2280,8 +2280,10 @@ static inline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
 #endif /* CONFIG_FAIL_PAGE_ALLOC */
 
 /*
- * Return true if free pages are above 'mark'. This takes into account the order
- * of the allocation.
+ * Return true if free base pages are above 'mark'. For high-order checks it
+ * will return true of the order-0 watermark is reached and there is at least
+ * one free page of a suitable size. Checking now avoids taking the zone lock
+ * to check in the allocation paths if no pages are free.
  */
 static bool __zone_watermark_ok(struct zone *z, unsigned int order,
 			unsigned long mark, int classzone_idx, int alloc_flags,
@@ -2289,7 +2291,7 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
 {
 	long min = mark;
 	int o;
-	long free_cma = 0;
+	const bool atomic = (alloc_flags & ALLOC_HARDER);
 
 	/* free_pages may go negative - that's OK */
 	free_pages -= (1 << order) - 1;
@@ -2301,7 +2303,7 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
 	 * If the caller is not atomic then discount the reserves. This will
 	 * over-estimate how the atomic reserve but it avoids a search
 	 */
-	if (likely(!(alloc_flags & ALLOC_HARDER)))
+	if (likely(!atomic))
 		free_pages -= z->nr_reserved_highatomic;
 	else
 		min -= min / 4;
@@ -2309,22 +2311,30 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
 #ifdef CONFIG_CMA
 	/* If allocation can't use CMA areas don't use free CMA pages */
 	if (!(alloc_flags & ALLOC_CMA))
-		free_cma = zone_page_state(z, NR_FREE_CMA_PAGES);
+		free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES);
 #endif
 
-	if (free_pages - free_cma <= min + z->lowmem_reserve[classzone_idx])
+	if (free_pages <= min + z->lowmem_reserve[classzone_idx])
 		return false;
-	for (o = 0; o < order; o++) {
-		/* At the next order, this order's pages become unavailable */
-		free_pages -= z->free_area[o].nr_free << o;
 
-		/* Require fewer higher order pages to be free */
-		min >>= 1;
+	/* order-0 watermarks are ok */
+	if (!order)
+		return true;
+
+	/* Check at least one high-order page is free */
+	for (o = order; o < MAX_ORDER; o++) {
+		struct free_area *area = &z->free_area[o];
+		int mt;
+
+		if (atomic && area->nr_free)
+			return true;
 
-		if (free_pages <= min)
-			return false;
+		for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) {
+			if (!list_empty(&area->free_list[mt]))
+				return true;
+		}
 	}
-	return true;
+	return false;
 }
 
 bool zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
-- 
2.4.6


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH 04/12] mm, page_alloc: Only check cpusets when one exists that can be mem-controlled
  2015-08-24 12:09 ` [PATCH 04/12] mm, page_alloc: Only check cpusets when one exists that can be mem-controlled Mel Gorman
@ 2015-08-24 12:37   ` Vlastimil Babka
  2015-08-24 13:16     ` Mel Gorman
  2015-08-26 10:46   ` Michal Hocko
  1 sibling, 1 reply; 55+ messages in thread
From: Vlastimil Babka @ 2015-08-24 12:37 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Johannes Weiner, Rik van Riel, David Rientjes, Joonsoo Kim,
	Michal Hocko, Linux-MM, LKML

On 08/24/2015 02:09 PM, Mel Gorman wrote:
> David Rientjes correctly pointed out that the "root cpuset may not exclude
> mems on the system so, even if mounted, there's no need to check or be
> worried about concurrent change when there is only one cpuset".
>
> The three checks for cpusets_enabled() care whether a cpuset exists that
> can limit memory, not that cpuset is enabled as such. This patch replaces
> cpusets_enabled() with cpusets_mems_enabled() which checks if at least one
> cpuset exists that can limit memory and updates the appropriate call sites.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>   include/linux/cpuset.h | 16 +++++++++-------
>   mm/page_alloc.c        |  2 +-
>   2 files changed, 10 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index 6eb27cb480b7..1e823870987e 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -17,10 +17,6 @@
>   #ifdef CONFIG_CPUSETS
>
>   extern struct static_key cpusets_enabled_key;
> -static inline bool cpusets_enabled(void)
> -{
> -	return static_key_false(&cpusets_enabled_key);
> -}
>
>   static inline int nr_cpusets(void)
>   {
> @@ -28,6 +24,12 @@ static inline int nr_cpusets(void)
>   	return static_key_count(&cpusets_enabled_key) + 1;
>   }
>
> +/* Returns true if a cpuset exists that can set cpuset.mems */
> +static inline bool cpusets_mems_enabled(void)
> +{
> +	return nr_cpusets() > 1;
> +}
> +

Hm, but this loses the benefits of static key branches?
How about something like:

   if (static_key_false(&cpusets_enabled_key))
	return nr_cpusets() > 1
   else
	return false;



>   static inline void cpuset_inc(void)
>   {
>   	static_key_slow_inc(&cpusets_enabled_key);
> @@ -104,7 +106,7 @@ extern void cpuset_print_task_mems_allowed(struct task_struct *p);
>    */
>   static inline unsigned int read_mems_allowed_begin(void)
>   {
> -	if (!cpusets_enabled())
> +	if (!cpusets_mems_enabled())
>   		return 0;
>
>   	return read_seqcount_begin(&current->mems_allowed_seq);
> @@ -118,7 +120,7 @@ static inline unsigned int read_mems_allowed_begin(void)
>    */
>   static inline bool read_mems_allowed_retry(unsigned int seq)
>   {
> -	if (!cpusets_enabled())
> +	if (!cpusets_mems_enabled())
>   		return false;

Actually I doubt it's much of benefit for these usages, even if the 
static key benefits are restored. If there's a single root cpuset, we 
would check the seqlock prior to this patch, now we'll check static key 
value (which should have the same cost?). With >1 cpusets, we would 
check seqlock prior to this patch, now we'll check static key value 
*and* the seqlock...

>
>   	return read_seqcount_retry(&current->mems_allowed_seq, seq);
> @@ -139,7 +141,7 @@ static inline void set_mems_allowed(nodemask_t nodemask)
>
>   #else /* !CONFIG_CPUSETS */
>
> -static inline bool cpusets_enabled(void) { return false; }
> +static inline bool cpusets_mems_enabled(void) { return false; }
>
>   static inline int cpuset_init(void) { return 0; }
>   static inline void cpuset_init_smp(void) {}
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 62ae28d8ae8d..2c1c3bf54d15 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2470,7 +2470,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
>   		if (IS_ENABLED(CONFIG_NUMA) && zlc_active &&
>   			!zlc_zone_worth_trying(zonelist, z, allowednodes))
>   				continue;
> -		if (cpusets_enabled() &&
> +		if (cpusets_mems_enabled() &&
>   			(alloc_flags & ALLOC_CPUSET) &&
>   			!cpuset_zone_allowed(zone, gfp_mask))
>   				continue;

Here the benefits are less clear. I guess cpuset_zone_allowed() is 
potentially costly...

Heck, shouldn't we just start the static key on -1 (if possible), so 
that it's enabled only when there's 2+ cpusets?

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 04/12] mm, page_alloc: Only check cpusets when one exists that can be mem-controlled
  2015-08-24 12:37   ` Vlastimil Babka
@ 2015-08-24 13:16     ` Mel Gorman
  2015-08-24 20:53       ` Vlastimil Babka
  0 siblings, 1 reply; 55+ messages in thread
From: Mel Gorman @ 2015-08-24 13:16 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Johannes Weiner, Rik van Riel, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML

On Mon, Aug 24, 2015 at 02:37:41PM +0200, Vlastimil Babka wrote:
> >
> >+/* Returns true if a cpuset exists that can set cpuset.mems */
> >+static inline bool cpusets_mems_enabled(void)
> >+{
> >+	return nr_cpusets() > 1;
> >+}
> >+
> 
> Hm, but this loses the benefits of static key branches?
> How about something like:
> 
>   if (static_key_false(&cpusets_enabled_key))
> 	return nr_cpusets() > 1
>   else
> 	return false;
> 

Will do.

> 
> 
> >  static inline void cpuset_inc(void)
> >  {
> >  	static_key_slow_inc(&cpusets_enabled_key);
> >@@ -104,7 +106,7 @@ extern void cpuset_print_task_mems_allowed(struct task_struct *p);
> >   */
> >  static inline unsigned int read_mems_allowed_begin(void)
> >  {
> >-	if (!cpusets_enabled())
> >+	if (!cpusets_mems_enabled())
> >  		return 0;
> >
> >  	return read_seqcount_begin(&current->mems_allowed_seq);
> >@@ -118,7 +120,7 @@ static inline unsigned int read_mems_allowed_begin(void)
> >   */
> >  static inline bool read_mems_allowed_retry(unsigned int seq)
> >  {
> >-	if (!cpusets_enabled())
> >+	if (!cpusets_mems_enabled())
> >  		return false;
> 
> Actually I doubt it's much of benefit for these usages, even if the static
> key benefits are restored. If there's a single root cpuset, we would check
> the seqlock prior to this patch, now we'll check static key value (which
> should have the same cost?). With >1 cpusets, we would check seqlock prior
> to this patch, now we'll check static key value *and* the seqlock...
> 

If the cpuset is enabled between the check, it still should retry.
Anyway, special casing this is overkill. It's a small
micro-optimisation.

> >
> >  	return read_seqcount_retry(&current->mems_allowed_seq, seq);
> >@@ -139,7 +141,7 @@ static inline void set_mems_allowed(nodemask_t nodemask)
> >
> >  #else /* !CONFIG_CPUSETS */
> >
> >-static inline bool cpusets_enabled(void) { return false; }
> >+static inline bool cpusets_mems_enabled(void) { return false; }
> >
> >  static inline int cpuset_init(void) { return 0; }
> >  static inline void cpuset_init_smp(void) {}
> >diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >index 62ae28d8ae8d..2c1c3bf54d15 100644
> >--- a/mm/page_alloc.c
> >+++ b/mm/page_alloc.c
> >@@ -2470,7 +2470,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
> >  		if (IS_ENABLED(CONFIG_NUMA) && zlc_active &&
> >  			!zlc_zone_worth_trying(zonelist, z, allowednodes))
> >  				continue;
> >-		if (cpusets_enabled() &&
> >+		if (cpusets_mems_enabled() &&
> >  			(alloc_flags & ALLOC_CPUSET) &&
> >  			!cpuset_zone_allowed(zone, gfp_mask))
> >  				continue;
> 
> Here the benefits are less clear. I guess cpuset_zone_allowed() is
> potentially costly...
> 
> Heck, shouldn't we just start the static key on -1 (if possible), so that
> it's enabled only when there's 2+ cpusets?

It's overkill for the amount of benefit.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 07/12] mm, page_alloc: Distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
  2015-08-24 12:09 ` [PATCH 07/12] mm, page_alloc: Distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd Mel Gorman
@ 2015-08-24 18:29   ` Mel Gorman
  2015-08-25 15:37   ` Vlastimil Babka
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 55+ messages in thread
From: Mel Gorman @ 2015-08-24 18:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Rik van Riel, Vlastimil Babka, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML

On Mon, Aug 24, 2015 at 01:09:46PM +0100, Mel Gorman wrote:
> diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> index f9ebe1c82060..c3775ee46cd6 100644
> --- a/lib/radix-tree.c
> +++ b/lib/radix-tree.c
> @@ -188,7 +188,7 @@ radix_tree_node_alloc(struct radix_tree_root *root)
>  	 * preloading in the interrupt anyway as all the allocations have to
>  	 * be atomic. So just do normal allocation when in interrupt.
>  	 */
> -	if (!(gfp_mask & __GFP_WAIT) && !in_interrupt()) {
> +	if (!gfpflags_allow_blocking(gfp_mask) && !in_interrupt()) {
>  		struct radix_tree_preload *rtp;
>  
>  		/*
> @@ -249,7 +249,7 @@ radix_tree_node_free(struct radix_tree_node *node)
>   * with preemption not disabled.
>   *
>   * To make use of this facility, the radix tree must be initialised without
> - * __GFP_WAIT being passed to INIT_RADIX_TREE().
> + * __GFP_DIRECT_RECLAIM being passed to INIT_RADIX_TREE().
>   */
>  static int __radix_tree_preload(gfp_t gfp_mask)
>  {
> @@ -286,12 +286,12 @@ static int __radix_tree_preload(gfp_t gfp_mask)
>   * with preemption not disabled.
>   *
>   * To make use of this facility, the radix tree must be initialised without
> - * __GFP_WAIT being passed to INIT_RADIX_TREE().
> + * __GFP_DIRECT_RECLAIM being passed to INIT_RADIX_TREE().
>   */
>  int radix_tree_preload(gfp_t gfp_mask)
>  {
>  	/* Warn on non-sensical use... */
> -	WARN_ON_ONCE(!(gfp_mask & __GFP_WAIT));
> +	WARN_ON_ONCE(gfpflags_allow_blocking(gfp_mask));
>  	return __radix_tree_preload(gfp_mask);
>  }
>  EXPORT_SYMBOL(radix_tree_preload);

This was a last minute conversion related to fixing up direct usages of
__GFP_DIRECT_RECLAIM that is obviously wrong. It needs a

diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index c3775ee46cd6..fcf5d98574ce 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -291,7 +291,7 @@ static int __radix_tree_preload(gfp_t gfp_mask)
 int radix_tree_preload(gfp_t gfp_mask)
 {
 	/* Warn on non-sensical use... */
-	WARN_ON_ONCE(gfpflags_allow_blocking(gfp_mask));
+	WARN_ON_ONCE(!gfpflags_allow_blocking(gfp_mask));
 	return __radix_tree_preload(gfp_mask);
 }
 EXPORT_SYMBOL(radix_tree_preload);

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH 04/12] mm, page_alloc: Only check cpusets when one exists that can be mem-controlled
  2015-08-24 13:16     ` Mel Gorman
@ 2015-08-24 20:53       ` Vlastimil Babka
  2015-08-25 10:33         ` Mel Gorman
  0 siblings, 1 reply; 55+ messages in thread
From: Vlastimil Babka @ 2015-08-24 20:53 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Rik van Riel, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML

On 24.8.2015 15:16, Mel Gorman wrote:
>>>
>>>  	return read_seqcount_retry(&current->mems_allowed_seq, seq);
>>> @@ -139,7 +141,7 @@ static inline void set_mems_allowed(nodemask_t nodemask)
>>>
>>>  #else /* !CONFIG_CPUSETS */
>>>
>>> -static inline bool cpusets_enabled(void) { return false; }
>>> +static inline bool cpusets_mems_enabled(void) { return false; }
>>>
>>>  static inline int cpuset_init(void) { return 0; }
>>>  static inline void cpuset_init_smp(void) {}
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index 62ae28d8ae8d..2c1c3bf54d15 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -2470,7 +2470,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
>>>  		if (IS_ENABLED(CONFIG_NUMA) && zlc_active &&
>>>  			!zlc_zone_worth_trying(zonelist, z, allowednodes))
>>>  				continue;
>>> -		if (cpusets_enabled() &&
>>> +		if (cpusets_mems_enabled() &&
>>>  			(alloc_flags & ALLOC_CPUSET) &&
>>>  			!cpuset_zone_allowed(zone, gfp_mask))
>>>  				continue;
>>
>> Here the benefits are less clear. I guess cpuset_zone_allowed() is
>> potentially costly...
>>
>> Heck, shouldn't we just start the static key on -1 (if possible), so that
>> it's enabled only when there's 2+ cpusets?

Hm wait a minute, that's what already happens:

static inline int nr_cpusets(void)
{
        /* jump label reference count + the top-level cpuset */
        return static_key_count(&cpusets_enabled_key) + 1;
}

I.e. if there's only the root cpuset, static key is disabled, so I think this
patch is moot after all?

> It's overkill for the amount of benefit.
> 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 04/12] mm, page_alloc: Only check cpusets when one exists that can be mem-controlled
  2015-08-24 20:53       ` Vlastimil Babka
@ 2015-08-25 10:33         ` Mel Gorman
  2015-08-25 11:09           ` Vlastimil Babka
  0 siblings, 1 reply; 55+ messages in thread
From: Mel Gorman @ 2015-08-25 10:33 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Johannes Weiner, Rik van Riel, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML

On Mon, Aug 24, 2015 at 10:53:37PM +0200, Vlastimil Babka wrote:
> On 24.8.2015 15:16, Mel Gorman wrote:
> >>>
> >>>  	return read_seqcount_retry(&current->mems_allowed_seq, seq);
> >>> @@ -139,7 +141,7 @@ static inline void set_mems_allowed(nodemask_t nodemask)
> >>>
> >>>  #else /* !CONFIG_CPUSETS */
> >>>
> >>> -static inline bool cpusets_enabled(void) { return false; }
> >>> +static inline bool cpusets_mems_enabled(void) { return false; }
> >>>
> >>>  static inline int cpuset_init(void) { return 0; }
> >>>  static inline void cpuset_init_smp(void) {}
> >>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >>> index 62ae28d8ae8d..2c1c3bf54d15 100644
> >>> --- a/mm/page_alloc.c
> >>> +++ b/mm/page_alloc.c
> >>> @@ -2470,7 +2470,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
> >>>  		if (IS_ENABLED(CONFIG_NUMA) && zlc_active &&
> >>>  			!zlc_zone_worth_trying(zonelist, z, allowednodes))
> >>>  				continue;
> >>> -		if (cpusets_enabled() &&
> >>> +		if (cpusets_mems_enabled() &&
> >>>  			(alloc_flags & ALLOC_CPUSET) &&
> >>>  			!cpuset_zone_allowed(zone, gfp_mask))
> >>>  				continue;
> >>
> >> Here the benefits are less clear. I guess cpuset_zone_allowed() is
> >> potentially costly...
> >>
> >> Heck, shouldn't we just start the static key on -1 (if possible), so that
> >> it's enabled only when there's 2+ cpusets?
> 
> Hm wait a minute, that's what already happens:
> 
> static inline int nr_cpusets(void)
> {
>         /* jump label reference count + the top-level cpuset */
>         return static_key_count(&cpusets_enabled_key) + 1;
> }
> 
> I.e. if there's only the root cpuset, static key is disabled, so I think this
> patch is moot after all?
> 

static_key_count is an atomic read on a field in struct static_key where
as static_key_false is a arch_static_branch which can be eliminated. The
patch eliminates an atomic read so I didn't think it was moot.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 04/12] mm, page_alloc: Only check cpusets when one exists that can be mem-controlled
  2015-08-25 10:33         ` Mel Gorman
@ 2015-08-25 11:09           ` Vlastimil Babka
  2015-08-26 13:41             ` Mel Gorman
  0 siblings, 1 reply; 55+ messages in thread
From: Vlastimil Babka @ 2015-08-25 11:09 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Rik van Riel, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML

On 08/25/2015 12:33 PM, Mel Gorman wrote:
> On Mon, Aug 24, 2015 at 10:53:37PM +0200, Vlastimil Babka wrote:
>> On 24.8.2015 15:16, Mel Gorman wrote:
>>>>>
>>>>>   	return read_seqcount_retry(&current->mems_allowed_seq, seq);
>>>>> @@ -139,7 +141,7 @@ static inline void set_mems_allowed(nodemask_t nodemask)
>>>>>
>>>>>   #else /* !CONFIG_CPUSETS */
>>>>>
>>>>> -static inline bool cpusets_enabled(void) { return false; }
>>>>> +static inline bool cpusets_mems_enabled(void) { return false; }
>>>>>
>>>>>   static inline int cpuset_init(void) { return 0; }
>>>>>   static inline void cpuset_init_smp(void) {}
>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>>> index 62ae28d8ae8d..2c1c3bf54d15 100644
>>>>> --- a/mm/page_alloc.c
>>>>> +++ b/mm/page_alloc.c
>>>>> @@ -2470,7 +2470,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
>>>>>   		if (IS_ENABLED(CONFIG_NUMA) && zlc_active &&
>>>>>   			!zlc_zone_worth_trying(zonelist, z, allowednodes))
>>>>>   				continue;
>>>>> -		if (cpusets_enabled() &&
>>>>> +		if (cpusets_mems_enabled() &&
>>>>>   			(alloc_flags & ALLOC_CPUSET) &&
>>>>>   			!cpuset_zone_allowed(zone, gfp_mask))
>>>>>   				continue;
>>>>
>>>> Here the benefits are less clear. I guess cpuset_zone_allowed() is
>>>> potentially costly...
>>>>
>>>> Heck, shouldn't we just start the static key on -1 (if possible), so that
>>>> it's enabled only when there's 2+ cpusets?
>>
>> Hm wait a minute, that's what already happens:
>>
>> static inline int nr_cpusets(void)
>> {
>>          /* jump label reference count + the top-level cpuset */
>>          return static_key_count(&cpusets_enabled_key) + 1;
>> }
>>
>> I.e. if there's only the root cpuset, static key is disabled, so I think this
>> patch is moot after all?
>>
>
> static_key_count is an atomic read on a field in struct static_key where
> as static_key_false is a arch_static_branch which can be eliminated. The
> patch eliminates an atomic read so I didn't think it was moot.

Sorry I wasn't clear enough. My point is that AFAICS cpusets_enabled() 
will only return true if there are more cpusets than the root 
(top-level) one.
So the current cpusets_enabled() checks should be enough. Checking that 
"nr_cpusets() > 1" only duplicates what is already covered by 
cpusets_enabled() - see the nr_cpusets() listing above. I.e. David's 
premise was wrong.



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 05/12] mm, page_alloc: Remove unecessary recheck of nodemask
  2015-08-24 12:09 ` [PATCH 05/12] mm, page_alloc: Remove unecessary recheck of nodemask Mel Gorman
@ 2015-08-25 14:32   ` Vlastimil Babka
  0 siblings, 0 replies; 55+ messages in thread
From: Vlastimil Babka @ 2015-08-25 14:32 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Johannes Weiner, Rik van Riel, David Rientjes, Joonsoo Kim,
	Michal Hocko, Linux-MM, LKML

On 08/24/2015 02:09 PM, Mel Gorman wrote:
> An allocation request will either use the given nodemask or the cpuset
> current tasks mems_allowed. A cpuset retry will recheck the callers nodemask
> and while it's trivial overhead during an extremely rare operation, also
> unnecessary. This patch fixes it.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>   mm/page_alloc.c | 5 ++---
>   1 file changed, 2 insertions(+), 3 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 2c1c3bf54d15..32d1cec124bc 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3171,7 +3171,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
>   	gfp_t alloc_mask; /* The gfp_t that was actually used for allocation */
>   	struct alloc_context ac = {
>   		.high_zoneidx = gfp_zone(gfp_mask),
> -		.nodemask = nodemask,
> +		.nodemask = nodemask ? : &cpuset_current_mems_allowed,

Hm this is a functional change for atomic allocations with NULL 
nodemask. ac.nodemask is passed down to __alloc_pages_slowpath() which 
might determine that ALLOC_CPUSET is not to be used (because it's 
atomic). Yet it would use the restricted ac.nodemask in 
get_page_from_freelist() and elsewhere.

>   		.migratetype = gfpflags_to_migratetype(gfp_mask),
>   	};
>
> @@ -3206,8 +3206,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
>
>   	/* The preferred zone is used for statistics later */
>   	preferred_zoneref = first_zones_zonelist(ac.zonelist, ac.high_zoneidx,
> -				ac.nodemask ? : &cpuset_current_mems_allowed,
> -				&ac.preferred_zone);
> +				ac.nodemask, &ac.preferred_zone);
>   	if (!ac.preferred_zone)
>   		goto out;
>   	ac.classzone_idx = zonelist_zone_idx(preferred_zoneref);
>


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 06/12] mm, page_alloc: Use masks and shifts when converting GFP flags to migrate types
  2015-08-24 12:09 ` [PATCH 06/12] mm, page_alloc: Use masks and shifts when converting GFP flags to migrate types Mel Gorman
@ 2015-08-25 14:36   ` Vlastimil Babka
  0 siblings, 0 replies; 55+ messages in thread
From: Vlastimil Babka @ 2015-08-25 14:36 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Johannes Weiner, Rik van Riel, David Rientjes, Joonsoo Kim,
	Michal Hocko, Linux-MM, LKML

On 08/24/2015 02:09 PM, Mel Gorman wrote:
> This patch redefines which GFP bits are used for specifying mobility and
> the order of the migrate types. Once redefined it's possible to convert
> GFP flags to a migrate type with a simple mask and shift. The only downside
> is that readers of OOM kill messages and allocation failures may have been
> used to the existing values but scripts/gfp-translate will help.

Yeah after patches 7 and 8, this is not much of a concern :)

> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Vlastimil Babka <vbabka@suse.cz>


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 07/12] mm, page_alloc: Distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
  2015-08-24 12:09 ` [PATCH 07/12] mm, page_alloc: Distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd Mel Gorman
  2015-08-24 18:29   ` Mel Gorman
@ 2015-08-25 15:37   ` Vlastimil Babka
  2015-08-26 14:45     ` Mel Gorman
  2015-08-25 15:48   ` Vlastimil Babka
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 55+ messages in thread
From: Vlastimil Babka @ 2015-08-25 15:37 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Johannes Weiner, Rik van Riel, David Rientjes, Joonsoo Kim,
	Michal Hocko, Linux-MM, LKML

On 08/24/2015 02:09 PM, Mel Gorman wrote:
> __GFP_WAIT has been used to identify atomic context in callers that hold
> spinlocks or are in interrupts. They are expected to be high priority and
> have access one of two watermarks lower than "min" which can be referred
> to as the "atomic reserve". __GFP_HIGH users get access to the first lower
> watermark and can be called the "high priority reserve".
>
> Over time, callers had a requirement to not block when fallback options
> were available. Some have abused __GFP_WAIT leading to a situation where
> an optimisitic allocation with a fallback option can access atomic reserves.
>
> This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
> cannot sleep and have no alternative. High priority users continue to use
> __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are
> willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers
> that want to wake kswapd for background reclaim. __GFP_WAIT is redefined
> as a caller that is willing to enter direct reclaim and wake kswapd for
> background reclaim.
>
> This patch then converts a number of sites
>
> o __GFP_ATOMIC is used by callers that are high priority and have memory
>    pools for those requests. GFP_ATOMIC uses this flag.
>
> o Callers that have a limited mempool to guarantee forward progress use
>    __GFP_DIRECT_RECLAIM. bio allocations fall into this category where

      ^ __GFP_KSWAPD_RECLAIM ? (missed it previously)

>    kswapd will still be woken but atomic reserves are not used as there
>    is a one-entry mempool to guarantee progress.
>
> o Callers that are checking if they are non-blocking should use the
>    helper gfpflags_allow_blocking() where possible. This is because
>    checking for __GFP_WAIT as was done historically now can trigger false
>    positives. Some exceptions like dm-crypt.c exist where the code intent
>    is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
>    flag manipulations.
>
> o Callers that built their own GFP flags instead of starting with GFP_KERNEL
>    and friends now also need to specify __GFP_KSWAPD_RECLAIM.
>
> The first key hazard to watch out for is callers that removed __GFP_WAIT
> and was depending on access to atomic reserves for inconspicuous reasons.
> In some cases it may be appropriate for them to use __GFP_HIGH.
>
> The second key hazard is callers that assembled their own combination of
> GFP flags instead of starting with something like GFP_KERNEL. They may
> now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
> if it's missed in most cases as other activity will wake kswapd.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Thanks for the effort!

Acked-by: Vlastimil Babka <vbabka@suse.cz>

Just last few bits:

> @@ -2158,7 +2158,7 @@ static bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
>   		return false;
>   	if (fail_page_alloc.ignore_gfp_highmem && (gfp_mask & __GFP_HIGHMEM))
>   		return false;
> -	if (fail_page_alloc.ignore_gfp_wait && (gfp_mask & __GFP_WAIT))
> +	if (fail_page_alloc.ignore_gfp_wait && (gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
>   		return false;
>
>   	return should_fail(&fail_page_alloc.attr, 1 << order);

IIUC ignore_gfp_wait tells it to assume that reclaimers will eventually 
succeed (for some reason?), so they shouldn't fail. Probably to focus 
the testing on atomic allocations. But your change makes atomic 
allocation never fail, so that goes against the knob IMHO?

> @@ -2660,7 +2660,7 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
>   		if (test_thread_flag(TIF_MEMDIE) ||
>   		    (current->flags & (PF_MEMALLOC | PF_EXITING)))
>   			filter &= ~SHOW_MEM_FILTER_NODES;
> -	if (in_interrupt() || !(gfp_mask & __GFP_WAIT))
> +	if (in_interrupt() || !(gfp_mask & __GFP_WAIT) || (gfp_mask & __GFP_ATOMIC))
>   		filter &= ~SHOW_MEM_FILTER_NODES;
>
>   	if (fmt) {

This caught me previously and I convinced myself that it's OK, but now 
I'm not anymore. IIUC this is to not filter nodes by mems_allowed during 
printing, if the allocation itself wasn't limited? In that case it 
should probably only look at __GFP_ATOMIC after this patch? As that's 
the only thing that determines ALLOC_CPUSET.
I don't know where in_interrupt() comes from, but it was probably 
considered in the past, as can be seen in zlc_setup()?


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 07/12] mm, page_alloc: Distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
  2015-08-24 12:09 ` [PATCH 07/12] mm, page_alloc: Distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd Mel Gorman
  2015-08-24 18:29   ` Mel Gorman
  2015-08-25 15:37   ` Vlastimil Babka
@ 2015-08-25 15:48   ` Vlastimil Babka
  2015-08-26 13:05   ` Michal Hocko
  2015-09-08  6:49   ` Joonsoo Kim
  4 siblings, 0 replies; 55+ messages in thread
From: Vlastimil Babka @ 2015-08-25 15:48 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Johannes Weiner, Rik van Riel, David Rientjes, Joonsoo Kim,
	Michal Hocko, Linux-MM, LKML

On 08/24/2015 02:09 PM, Mel Gorman wrote:
> The first key hazard to watch out for is callers that removed __GFP_WAIT
> and was depending on access to atomic reserves for inconspicuous reasons.
> In some cases it may be appropriate for them to use __GFP_HIGH.

Hm so I think this hazard should be expanded. If such caller comes from 
interrupt and doesn't use __GFP_ATOMIC, the ALLOC_CPUSET with 
restrictions taken from the interrupted process will also apply to him?


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 03/12] mm, page_alloc: Remove unnecessary taking of a seqlock when cpusets are disabled
  2015-08-24 12:09 ` [PATCH 03/12] mm, page_alloc: Remove unnecessary taking of a seqlock when cpusets are disabled Mel Gorman
@ 2015-08-26 10:25   ` Michal Hocko
  0 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2015-08-26 10:25 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Rik van Riel, Vlastimil Babka,
	David Rientjes, Joonsoo Kim, Linux-MM, LKML

On Mon 24-08-15 13:09:42, Mel Gorman wrote:
> There is a seqcounter that protects against spurious allocation failures
> when a task is changing the allowed nodes in a cpuset. There is no need
> to check the seqcounter until a cpuset exists.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Christoph Lameter <cl@linux.com>
> Acked-by: David Rientjes <rientjes@google.com>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/cpuset.h | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index 1b357997cac5..6eb27cb480b7 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -104,6 +104,9 @@ extern void cpuset_print_task_mems_allowed(struct task_struct *p);
>   */
>  static inline unsigned int read_mems_allowed_begin(void)
>  {
> +	if (!cpusets_enabled())
> +		return 0;
> +
>  	return read_seqcount_begin(&current->mems_allowed_seq);
>  }
>  
> @@ -115,6 +118,9 @@ static inline unsigned int read_mems_allowed_begin(void)
>   */
>  static inline bool read_mems_allowed_retry(unsigned int seq)
>  {
> +	if (!cpusets_enabled())
> +		return false;
> +
>  	return read_seqcount_retry(&current->mems_allowed_seq, seq);
>  }
>  
> -- 
> 2.4.6

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 04/12] mm, page_alloc: Only check cpusets when one exists that can be mem-controlled
  2015-08-24 12:09 ` [PATCH 04/12] mm, page_alloc: Only check cpusets when one exists that can be mem-controlled Mel Gorman
  2015-08-24 12:37   ` Vlastimil Babka
@ 2015-08-26 10:46   ` Michal Hocko
  1 sibling, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2015-08-26 10:46 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Rik van Riel, Vlastimil Babka,
	David Rientjes, Joonsoo Kim, Linux-MM, LKML

On Mon 24-08-15 13:09:43, Mel Gorman wrote:
> David Rientjes correctly pointed out that the "root cpuset may not exclude
> mems on the system so, even if mounted, there's no need to check or be
> worried about concurrent change when there is only one cpuset".

Hmm, but cpuset_inc() is called only from cpuset_css_online and only
when it is called with non-NULL css->parent AFAICS. This means that the
static key should be still false after the root cpuset is created.

> The three checks for cpusets_enabled() care whether a cpuset exists that
> can limit memory, not that cpuset is enabled as such. This patch replaces
> cpusets_enabled() with cpusets_mems_enabled() which checks if at least one
> cpuset exists that can limit memory and updates the appropriate call sites.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  include/linux/cpuset.h | 16 +++++++++-------
>  mm/page_alloc.c        |  2 +-
>  2 files changed, 10 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index 6eb27cb480b7..1e823870987e 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -17,10 +17,6 @@
>  #ifdef CONFIG_CPUSETS
>  
>  extern struct static_key cpusets_enabled_key;
> -static inline bool cpusets_enabled(void)
> -{
> -	return static_key_false(&cpusets_enabled_key);
> -}
>  
>  static inline int nr_cpusets(void)
>  {
> @@ -28,6 +24,12 @@ static inline int nr_cpusets(void)
>  	return static_key_count(&cpusets_enabled_key) + 1;
>  }
>  
> +/* Returns true if a cpuset exists that can set cpuset.mems */
> +static inline bool cpusets_mems_enabled(void)
> +{
> +	return nr_cpusets() > 1;
> +}
> +
>  static inline void cpuset_inc(void)
>  {
>  	static_key_slow_inc(&cpusets_enabled_key);
> @@ -104,7 +106,7 @@ extern void cpuset_print_task_mems_allowed(struct task_struct *p);
>   */
>  static inline unsigned int read_mems_allowed_begin(void)
>  {
> -	if (!cpusets_enabled())
> +	if (!cpusets_mems_enabled())
>  		return 0;
>  
>  	return read_seqcount_begin(&current->mems_allowed_seq);
> @@ -118,7 +120,7 @@ static inline unsigned int read_mems_allowed_begin(void)
>   */
>  static inline bool read_mems_allowed_retry(unsigned int seq)
>  {
> -	if (!cpusets_enabled())
> +	if (!cpusets_mems_enabled())
>  		return false;
>  
>  	return read_seqcount_retry(&current->mems_allowed_seq, seq);
> @@ -139,7 +141,7 @@ static inline void set_mems_allowed(nodemask_t nodemask)
>  
>  #else /* !CONFIG_CPUSETS */
>  
> -static inline bool cpusets_enabled(void) { return false; }
> +static inline bool cpusets_mems_enabled(void) { return false; }
>  
>  static inline int cpuset_init(void) { return 0; }
>  static inline void cpuset_init_smp(void) {}
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 62ae28d8ae8d..2c1c3bf54d15 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2470,7 +2470,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
>  		if (IS_ENABLED(CONFIG_NUMA) && zlc_active &&
>  			!zlc_zone_worth_trying(zonelist, z, allowednodes))
>  				continue;
> -		if (cpusets_enabled() &&
> +		if (cpusets_mems_enabled() &&
>  			(alloc_flags & ALLOC_CPUSET) &&
>  			!cpuset_zone_allowed(zone, gfp_mask))
>  				continue;
> -- 
> 2.4.6

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 08/12] mm, page_alloc: Rename __GFP_WAIT to __GFP_RECLAIM
  2015-08-24 12:09 ` [PATCH 08/12] mm, page_alloc: Rename __GFP_WAIT to __GFP_RECLAIM Mel Gorman
@ 2015-08-26 12:19   ` Vlastimil Babka
  0 siblings, 0 replies; 55+ messages in thread
From: Vlastimil Babka @ 2015-08-26 12:19 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Johannes Weiner, Rik van Riel, David Rientjes, Joonsoo Kim,
	Michal Hocko, Linux-MM, LKML

On 08/24/2015 02:09 PM, Mel Gorman wrote:
> __GFP_WAIT was used to signal that the caller was in atomic context and
> could not sleep.  Now it is possible to distinguish between true atomic
> context and callers that are not willing to sleep. The latter should clear
> __GFP_DIRECT_RECLAIM so kswapd will still wake. As clearing __GFP_WAIT
> behaves differently, there is a risk that people will clear the wrong
> flags. This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly indicate
> what it does -- setting it allows all reclaim activity, clearing them
> prevents it.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Acked-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 11/12] mm, page_alloc: Reserve pageblocks for high-order atomic allocations on demand
  2015-08-24 12:29 ` [PATCH 11/12] mm, page_alloc: Reserve pageblocks for high-order atomic allocations on demand Mel Gorman
@ 2015-08-26 12:44   ` Vlastimil Babka
  2015-08-26 14:53   ` Michal Hocko
  2015-09-08  8:01   ` Joonsoo Kim
  2 siblings, 0 replies; 55+ messages in thread
From: Vlastimil Babka @ 2015-08-26 12:44 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Johannes Weiner, Rik van Riel, David Rientjes, Joonsoo Kim,
	Michal Hocko, Linux-MM, LKML

On 08/24/2015 02:29 PM, Mel Gorman wrote:
> High-order watermark checking exists for two reasons --  kswapd high-order
> awareness and protection for high-order atomic requests. Historically the
> kernel depended on MIGRATE_RESERVE to preserve min_free_kbytes as high-order
> free pages for as long as possible. This patch introduces MIGRATE_HIGHATOMIC
> that reserves pageblocks for high-order atomic allocations on demand and
> avoids using those blocks for order-0 allocations. This is more flexible
> and reliable than MIGRATE_RESERVE was.
>
> A MIGRATE_HIGHORDER pageblock is created when a high-order allocation

                                                  ^ atomic ...

> request steals a pageblock but limits the total number to 1% of the zone.
> Callers that speculatively abuse atomic allocations for long-lived
> high-order allocations to access the reserve will quickly fail. Note that
> SLUB is currently not such an abuser as it reclaims at least once.  It is
> possible that the pageblock stolen has few suitable high-order pages and
> will need to steal again in the near future but there would need to be
> strong justification to search all pageblocks for an ideal candidate.
>
> The pageblocks are unreserved if an allocation fails after a direct
> reclaim attempt.
>
> The watermark checks account for the reserved pageblocks when the allocation
> request is not a high-order atomic allocation.
>
> The reserved pageblocks can not be used for order-0 allocations. This may
> allow temporary wastage until a failed reclaim reassigns the pageblock. This
> is deliberate as the intent of the reservation is to satisfy a limited
> number of atomic high-order short-lived requests if the system requires them.
>
> The stutter benchmark was used to evaluate this but while it was running
> there was a systemtap script that randomly allocated between 1 high-order
> page and 12.5% of memory's worth of order-3 pages using GFP_ATOMIC. This
> is much larger than the potential reserve and it does not attempt to be
> realistic.  It is intended to stress random high-order allocations from
> an unknown source, show that there is a reduction in failures without
> introducing an anomaly where atomic allocations are more reliable than
> regular allocations.  The amount of memory reserved varied throughout the
> workload as reserves were created and reclaimed under memory pressure. The
> allocation failures once the workload warmed up were as follows;
>
> 4.2-rc5-vanilla		70%
> 4.2-rc5-atomic-reserve	56%
>
> The failure rate was also measured while building multiple kernels. The
> failure rate was 14% but is 6% with this patch applied.
>
> Overall, this is a small reduction but the reserves are small relative to the
> number of allocation requests. In early versions of the patch, the failure
> rate reduced by a much larger amount but that required much larger reserves
> and perversely made atomic allocations seem more reliable than regular allocations.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Vlastimil Babka <vbabka@suse.cz>



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 07/12] mm, page_alloc: Distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
  2015-08-24 12:09 ` [PATCH 07/12] mm, page_alloc: Distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd Mel Gorman
                     ` (2 preceding siblings ...)
  2015-08-25 15:48   ` Vlastimil Babka
@ 2015-08-26 13:05   ` Michal Hocko
  2015-09-08  6:49   ` Joonsoo Kim
  4 siblings, 0 replies; 55+ messages in thread
From: Michal Hocko @ 2015-08-26 13:05 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Rik van Riel, Vlastimil Babka,
	David Rientjes, Joonsoo Kim, Linux-MM, LKML

On Mon 24-08-15 13:09:46, Mel Gorman wrote:
> __GFP_WAIT has been used to identify atomic context in callers that hold
> spinlocks or are in interrupts. They are expected to be high priority and
> have access one of two watermarks lower than "min" which can be referred
> to as the "atomic reserve". __GFP_HIGH users get access to the first lower
> watermark and can be called the "high priority reserve".
> 
> Over time, callers had a requirement to not block when fallback options
> were available. Some have abused __GFP_WAIT leading to a situation where
> an optimisitic allocation with a fallback option can access atomic reserves.
> 
> This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
> cannot sleep and have no alternative. High priority users continue to use
> __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are
> willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers
> that want to wake kswapd for background reclaim. __GFP_WAIT is redefined
> as a caller that is willing to enter direct reclaim and wake kswapd for
> background reclaim.
> 
> This patch then converts a number of sites
> 
> o __GFP_ATOMIC is used by callers that are high priority and have memory
>   pools for those requests. GFP_ATOMIC uses this flag.
> 
> o Callers that have a limited mempool to guarantee forward progress use
>   __GFP_DIRECT_RECLAIM. bio allocations fall into this category where
>   kswapd will still be woken but atomic reserves are not used as there
>   is a one-entry mempool to guarantee progress.
> 
> o Callers that are checking if they are non-blocking should use the
>   helper gfpflags_allow_blocking() where possible. This is because
>   checking for __GFP_WAIT as was done historically now can trigger false
>   positives. Some exceptions like dm-crypt.c exist where the code intent
>   is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
>   flag manipulations.
> 
> o Callers that built their own GFP flags instead of starting with GFP_KERNEL
>   and friends now also need to specify __GFP_KSWAPD_RECLAIM.
> 
> The first key hazard to watch out for is callers that removed __GFP_WAIT
> and was depending on access to atomic reserves for inconspicuous reasons.
> In some cases it may be appropriate for them to use __GFP_HIGH.
> 
> The second key hazard is callers that assembled their own combination of
> GFP flags instead of starting with something like GFP_KERNEL. They may
> now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
> if it's missed in most cases as other activity will wake kswapd.

JFYI mmotm tree has
https://git.kernel.org/cgit/linux/kernel/git/mhocko/mm.git/commit/?h=since-4.1&id=5f467d4fb5fc32ecf209c10f93aae81cebfe69c3
which falls into ~__GFP_DIRECT_RECLAIM category.

I have tried to look at all the changed places as well and haven't
spotted anything problematic.

> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  Documentation/vm/balance                           | 14 ++++---
>  arch/arm/mm/dma-mapping.c                          |  4 +-
>  arch/arm/xen/mm.c                                  |  2 +-
>  arch/arm64/mm/dma-mapping.c                        |  4 +-
>  arch/x86/kernel/pci-dma.c                          |  2 +-
>  block/bio.c                                        | 26 ++++++------
>  block/blk-core.c                                   | 16 ++++----
>  block/blk-ioc.c                                    |  2 +-
>  block/blk-mq-tag.c                                 |  2 +-
>  block/blk-mq.c                                     |  8 ++--
>  block/cfq-iosched.c                                |  4 +-
>  drivers/block/drbd/drbd_receiver.c                 |  3 +-
>  drivers/block/osdblk.c                             |  2 +-
>  drivers/connector/connector.c                      |  3 +-
>  drivers/firewire/core-cdev.c                       |  2 +-
>  drivers/gpu/drm/i915/i915_gem.c                    |  2 +-
>  drivers/infiniband/core/sa_query.c                 |  2 +-
>  drivers/iommu/amd_iommu.c                          |  2 +-
>  drivers/iommu/intel-iommu.c                        |  2 +-
>  drivers/md/dm-crypt.c                              |  6 +--
>  drivers/md/dm-kcopyd.c                             |  2 +-
>  drivers/media/pci/solo6x10/solo6x10-v4l2-enc.c     |  2 +-
>  drivers/media/pci/solo6x10/solo6x10-v4l2.c         |  2 +-
>  drivers/media/pci/tw68/tw68-video.c                |  2 +-
>  drivers/mtd/mtdcore.c                              |  3 +-
>  drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c    |  2 +-
>  drivers/staging/android/ion/ion_system_heap.c      |  2 +-
>  .../lustre/include/linux/libcfs/libcfs_private.h   |  2 +-
>  drivers/usb/host/u132-hcd.c                        |  2 +-
>  drivers/video/fbdev/vermilion/vermilion.c          |  2 +-
>  fs/btrfs/disk-io.c                                 |  2 +-
>  fs/btrfs/extent_io.c                               | 14 +++----
>  fs/btrfs/volumes.c                                 |  4 +-
>  fs/ext3/super.c                                    |  2 +-
>  fs/ext4/super.c                                    |  2 +-
>  fs/fscache/cookie.c                                |  2 +-
>  fs/fscache/page.c                                  |  6 +--
>  fs/jbd/transaction.c                               |  4 +-
>  fs/jbd2/transaction.c                              |  4 +-
>  fs/nfs/file.c                                      |  6 +--
>  fs/xfs/xfs_qm.c                                    |  2 +-
>  include/linux/gfp.h                                | 46 ++++++++++++++++------
>  include/linux/skbuff.h                             |  6 +--
>  include/net/sock.h                                 |  2 +-
>  include/trace/events/gfpflags.h                    |  5 ++-
>  kernel/audit.c                                     |  6 +--
>  kernel/locking/lockdep.c                           |  2 +-
>  kernel/power/snapshot.c                            |  2 +-
>  kernel/smp.c                                       |  2 +-
>  lib/idr.c                                          |  4 +-
>  lib/radix-tree.c                                   | 10 ++---
>  mm/backing-dev.c                                   |  2 +-
>  mm/dmapool.c                                       |  2 +-
>  mm/memcontrol.c                                    |  8 ++--
>  mm/mempool.c                                       | 10 ++---
>  mm/migrate.c                                       |  2 +-
>  mm/page_alloc.c                                    | 43 ++++++++++++--------
>  mm/slab.c                                          | 18 ++++-----
>  mm/slub.c                                          |  6 +--
>  mm/vmalloc.c                                       |  2 +-
>  mm/vmscan.c                                        |  4 +-
>  mm/zswap.c                                         |  5 ++-
>  net/core/skbuff.c                                  |  8 ++--
>  net/core/sock.c                                    |  6 ++-
>  net/netlink/af_netlink.c                           |  2 +-
>  net/rxrpc/ar-connection.c                          |  2 +-
>  net/sctp/associola.c                               |  2 +-
>  67 files changed, 211 insertions(+), 173 deletions(-)
> 
> diff --git a/Documentation/vm/balance b/Documentation/vm/balance
> index c46e68cf9344..964595481af6 100644
> --- a/Documentation/vm/balance
> +++ b/Documentation/vm/balance
> @@ -1,12 +1,14 @@
>  Started Jan 2000 by Kanoj Sarcar <kanoj@sgi.com>
>  
> -Memory balancing is needed for non __GFP_WAIT as well as for non
> -__GFP_IO allocations.
> +Memory balancing is needed for !__GFP_ATOMIC and !__GFP_KSWAPD_RECLAIM as
> +well as for non __GFP_IO allocations.
>  
> -There are two reasons to be requesting non __GFP_WAIT allocations:
> -the caller can not sleep (typically intr context), or does not want
> -to incur cost overheads of page stealing and possible swap io for
> -whatever reasons.
> +The first reason why a caller may avoid reclaim is that the caller can not
> +sleep due to holding a spinlock or is in interrupt context. The second may
> +be that the caller is willing to fail the allocation without incurring the
> +overhead of page reclaim. This may happen for opportunistic high-order
> +allocation requests that have order-0 fallback options. In such cases,
> +the caller may also wish to avoid waking kswapd.
>  
>  __GFP_IO allocation requests are made to prevent file system deadlocks.
>  
> diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
> index cba12f34ff77..f999f0987a3e 100644
> --- a/arch/arm/mm/dma-mapping.c
> +++ b/arch/arm/mm/dma-mapping.c
> @@ -650,7 +650,7 @@ static void *__dma_alloc(struct device *dev, size_t size, dma_addr_t *handle,
>  
>  	if (is_coherent || nommu())
>  		addr = __alloc_simple_buffer(dev, size, gfp, &page);
> -	else if (!(gfp & __GFP_WAIT))
> +	else if (!gfpflags_allow_blocking(gfp))
>  		addr = __alloc_from_pool(size, &page);
>  	else if (!dev_get_cma_area(dev))
>  		addr = __alloc_remap_buffer(dev, size, gfp, prot, &page, caller, want_vaddr);
> @@ -1369,7 +1369,7 @@ static void *arm_iommu_alloc_attrs(struct device *dev, size_t size,
>  	*handle = DMA_ERROR_CODE;
>  	size = PAGE_ALIGN(size);
>  
> -	if (!(gfp & __GFP_WAIT))
> +	if (!gfpflags_allow_blocking(gfp))
>  		return __iommu_alloc_atomic(dev, size, handle);
>  
>  	/*
> diff --git a/arch/arm/xen/mm.c b/arch/arm/xen/mm.c
> index 03e75fef15b8..86809bd2026d 100644
> --- a/arch/arm/xen/mm.c
> +++ b/arch/arm/xen/mm.c
> @@ -25,7 +25,7 @@
>  unsigned long xen_get_swiotlb_free_pages(unsigned int order)
>  {
>  	struct memblock_region *reg;
> -	gfp_t flags = __GFP_NOWARN;
> +	gfp_t flags = __GFP_NOWARN|___GFP_KSWAPD_RECLAIM;
>  
>  	for_each_memblock(memory, reg) {
>  		if (reg->base < (phys_addr_t)0xffffffff) {
> diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
> index d16a1cead23f..1f10b2503af8 100644
> --- a/arch/arm64/mm/dma-mapping.c
> +++ b/arch/arm64/mm/dma-mapping.c
> @@ -100,7 +100,7 @@ static void *__dma_alloc_coherent(struct device *dev, size_t size,
>  	if (IS_ENABLED(CONFIG_ZONE_DMA) &&
>  	    dev->coherent_dma_mask <= DMA_BIT_MASK(32))
>  		flags |= GFP_DMA;
> -	if (IS_ENABLED(CONFIG_DMA_CMA) && (flags & __GFP_WAIT)) {
> +	if (IS_ENABLED(CONFIG_DMA_CMA) && gfpflags_allow_blocking(flags)) {
>  		struct page *page;
>  		void *addr;
>  
> @@ -147,7 +147,7 @@ static void *__dma_alloc(struct device *dev, size_t size,
>  
>  	size = PAGE_ALIGN(size);
>  
> -	if (!coherent && !(flags & __GFP_WAIT)) {
> +	if (!coherent && !gfpflags_allow_blocking(flags)) {
>  		struct page *page = NULL;
>  		void *addr = __alloc_from_pool(size, &page, flags);
>  
> diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
> index 353972c1946c..0310e73e6b57 100644
> --- a/arch/x86/kernel/pci-dma.c
> +++ b/arch/x86/kernel/pci-dma.c
> @@ -101,7 +101,7 @@ void *dma_generic_alloc_coherent(struct device *dev, size_t size,
>  again:
>  	page = NULL;
>  	/* CMA can be used only in the context which permits sleeping */
> -	if (flag & __GFP_WAIT) {
> +	if (gfpflags_allow_blocking(flag)) {
>  		page = dma_alloc_from_contiguous(dev, count, get_order(size));
>  		if (page && page_to_phys(page) + size > dma_mask) {
>  			dma_release_from_contiguous(dev, page, count);
> diff --git a/block/bio.c b/block/bio.c
> index d6e5ba3399f0..fbc558b50e67 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -211,7 +211,7 @@ struct bio_vec *bvec_alloc(gfp_t gfp_mask, int nr, unsigned long *idx,
>  		bvl = mempool_alloc(pool, gfp_mask);
>  	} else {
>  		struct biovec_slab *bvs = bvec_slabs + *idx;
> -		gfp_t __gfp_mask = gfp_mask & ~(__GFP_WAIT | __GFP_IO);
> +		gfp_t __gfp_mask = gfp_mask & ~(__GFP_DIRECT_RECLAIM | __GFP_IO);
>  
>  		/*
>  		 * Make this allocation restricted and don't dump info on
> @@ -221,11 +221,11 @@ struct bio_vec *bvec_alloc(gfp_t gfp_mask, int nr, unsigned long *idx,
>  		__gfp_mask |= __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN;
>  
>  		/*
> -		 * Try a slab allocation. If this fails and __GFP_WAIT
> +		 * Try a slab allocation. If this fails and __GFP_DIRECT_RECLAIM
>  		 * is set, retry with the 1-entry mempool
>  		 */
>  		bvl = kmem_cache_alloc(bvs->slab, __gfp_mask);
> -		if (unlikely(!bvl && (gfp_mask & __GFP_WAIT))) {
> +		if (unlikely(!bvl && (gfp_mask & __GFP_DIRECT_RECLAIM))) {
>  			*idx = BIOVEC_MAX_IDX;
>  			goto fallback;
>  		}
> @@ -393,12 +393,12 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
>   *   If @bs is NULL, uses kmalloc() to allocate the bio; else the allocation is
>   *   backed by the @bs's mempool.
>   *
> - *   When @bs is not NULL, if %__GFP_WAIT is set then bio_alloc will always be
> - *   able to allocate a bio. This is due to the mempool guarantees. To make this
> - *   work, callers must never allocate more than 1 bio at a time from this pool.
> - *   Callers that need to allocate more than 1 bio must always submit the
> - *   previously allocated bio for IO before attempting to allocate a new one.
> - *   Failure to do so can cause deadlocks under memory pressure.
> + *   When @bs is not NULL, if %__GFP_DIRECT_RECLAIM is set then bio_alloc will
> + *   always be able to allocate a bio. This is due to the mempool guarantees.
> + *   To make this work, callers must never allocate more than 1 bio at a time
> + *   from this pool. Callers that need to allocate more than 1 bio must always
> + *   submit the previously allocated bio for IO before attempting to allocate
> + *   a new one. Failure to do so can cause deadlocks under memory pressure.
>   *
>   *   Note that when running under generic_make_request() (i.e. any block
>   *   driver), bios are not submitted until after you return - see the code in
> @@ -457,13 +457,13 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
>  		 * We solve this, and guarantee forward progress, with a rescuer
>  		 * workqueue per bio_set. If we go to allocate and there are
>  		 * bios on current->bio_list, we first try the allocation
> -		 * without __GFP_WAIT; if that fails, we punt those bios we
> -		 * would be blocking to the rescuer workqueue before we retry
> -		 * with the original gfp_flags.
> +		 * without __GFP_DIRECT_RECLAIM; if that fails, we punt those
> +		 * bios we would be blocking to the rescuer workqueue before
> +		 * we retry with the original gfp_flags.
>  		 */
>  
>  		if (current->bio_list && !bio_list_empty(current->bio_list))
> -			gfp_mask &= ~__GFP_WAIT;
> +			gfp_mask &= ~__GFP_DIRECT_RECLAIM;
>  
>  		p = mempool_alloc(bs->bio_pool, gfp_mask);
>  		if (!p && gfp_mask != saved_gfp) {
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 627ed0c593fb..e3605acaaffc 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1156,8 +1156,8 @@ static struct request *__get_request(struct request_list *rl, int rw_flags,
>   * @bio: bio to allocate request for (can be %NULL)
>   * @gfp_mask: allocation mask
>   *
> - * Get a free request from @q.  If %__GFP_WAIT is set in @gfp_mask, this
> - * function keeps retrying under memory pressure and fails iff @q is dead.
> + * Get a free request from @q.  If %__GFP_DIRECT_RECLAIM is set in @gfp_mask,
> + * this function keeps retrying under memory pressure and fails iff @q is dead.
>   *
>   * Must be called with @q->queue_lock held and,
>   * Returns ERR_PTR on failure, with @q->queue_lock held.
> @@ -1177,7 +1177,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
>  	if (!IS_ERR(rq))
>  		return rq;
>  
> -	if (!(gfp_mask & __GFP_WAIT) || unlikely(blk_queue_dying(q))) {
> +	if (!gfpflags_allow_blocking(gfp_mask) || unlikely(blk_queue_dying(q))) {
>  		blk_put_rl(rl);
>  		return rq;
>  	}
> @@ -1255,11 +1255,11 @@ EXPORT_SYMBOL(blk_get_request);
>   * BUG.
>   *
>   * WARNING: When allocating/cloning a bio-chain, careful consideration should be
> - * given to how you allocate bios. In particular, you cannot use __GFP_WAIT for
> - * anything but the first bio in the chain. Otherwise you risk waiting for IO
> - * completion of a bio that hasn't been submitted yet, thus resulting in a
> - * deadlock. Alternatively bios should be allocated using bio_kmalloc() instead
> - * of bio_alloc(), as that avoids the mempool deadlock.
> + * given to how you allocate bios. In particular, you cannot use
> + * __GFP_DIRECT_RECLAIM for anything but the first bio in the chain. Otherwise
> + * you risk waiting for IO completion of a bio that hasn't been submitted yet,
> + * thus resulting in a deadlock. Alternatively bios should be allocated using
> + * bio_kmalloc() instead of bio_alloc(), as that avoids the mempool deadlock.
>   * If possible a big IO should be split into smaller parts when allocation
>   * fails. Partial allocation should not be an error, or you risk a live-lock.
>   */
> diff --git a/block/blk-ioc.c b/block/blk-ioc.c
> index 1a27f45ec776..381cb50a673c 100644
> --- a/block/blk-ioc.c
> +++ b/block/blk-ioc.c
> @@ -289,7 +289,7 @@ struct io_context *get_task_io_context(struct task_struct *task,
>  {
>  	struct io_context *ioc;
>  
> -	might_sleep_if(gfp_flags & __GFP_WAIT);
> +	might_sleep_if(gfpflags_allow_blocking(gfp_flags));
>  
>  	do {
>  		task_lock(task);
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 9b6e28830b82..a8b46659ce4e 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -264,7 +264,7 @@ static int bt_get(struct blk_mq_alloc_data *data,
>  	if (tag != -1)
>  		return tag;
>  
> -	if (!(data->gfp & __GFP_WAIT))
> +	if (!gfpflags_allow_blocking(data->gfp))
>  		return -1;
>  
>  	bs = bt_wait_ptr(bt, hctx);
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 7d842db59699..7d80379d7a38 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -85,7 +85,7 @@ static int blk_mq_queue_enter(struct request_queue *q, gfp_t gfp)
>  		if (percpu_ref_tryget_live(&q->mq_usage_counter))
>  			return 0;
>  
> -		if (!(gfp & __GFP_WAIT))
> +		if (!gfpflags_allow_blocking(gfp))
>  			return -EBUSY;
>  
>  		ret = wait_event_interruptible(q->mq_freeze_wq,
> @@ -261,11 +261,11 @@ struct request *blk_mq_alloc_request(struct request_queue *q, int rw, gfp_t gfp,
>  
>  	ctx = blk_mq_get_ctx(q);
>  	hctx = q->mq_ops->map_queue(q, ctx->cpu);
> -	blk_mq_set_alloc_data(&alloc_data, q, gfp & ~__GFP_WAIT,
> +	blk_mq_set_alloc_data(&alloc_data, q, gfp & ~__GFP_DIRECT_RECLAIM,
>  			reserved, ctx, hctx);
>  
>  	rq = __blk_mq_alloc_request(&alloc_data, rw);
> -	if (!rq && (gfp & __GFP_WAIT)) {
> +	if (!rq && (gfp & __GFP_DIRECT_RECLAIM)) {
>  		__blk_mq_run_hw_queue(hctx);
>  		blk_mq_put_ctx(ctx);
>  
> @@ -1221,7 +1221,7 @@ static struct request *blk_mq_map_request(struct request_queue *q,
>  		ctx = blk_mq_get_ctx(q);
>  		hctx = q->mq_ops->map_queue(q, ctx->cpu);
>  		blk_mq_set_alloc_data(&alloc_data, q,
> -				__GFP_WAIT|GFP_ATOMIC, false, ctx, hctx);
> +				__GFP_WAIT|__GFP_HIGH, false, ctx, hctx);
>  		rq = __blk_mq_alloc_request(&alloc_data, rw);
>  		ctx = alloc_data.ctx;
>  		hctx = alloc_data.hctx;
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index c62bb2e650b8..ecd1d1b61382 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -3674,7 +3674,7 @@ cfq_find_alloc_queue(struct cfq_data *cfqd, bool is_sync, struct cfq_io_cq *cic,
>  		if (new_cfqq) {
>  			cfqq = new_cfqq;
>  			new_cfqq = NULL;
> -		} else if (gfp_mask & __GFP_WAIT) {
> +		} else if (gfpflags_allow_blocking(gfp_mask)) {
>  			rcu_read_unlock();
>  			spin_unlock_irq(cfqd->queue->queue_lock);
>  			new_cfqq = kmem_cache_alloc_node(cfq_pool,
> @@ -4289,7 +4289,7 @@ cfq_set_request(struct request_queue *q, struct request *rq, struct bio *bio,
>  	const bool is_sync = rq_is_sync(rq);
>  	struct cfq_queue *cfqq;
>  
> -	might_sleep_if(gfp_mask & __GFP_WAIT);
> +	might_sleep_if(gfpflags_allow_blocking(gfp_mask));
>  
>  	spin_lock_irq(q->queue_lock);
>  
> diff --git a/drivers/block/drbd/drbd_receiver.c b/drivers/block/drbd/drbd_receiver.c
> index c097909c589c..b4b5680ac6ad 100644
> --- a/drivers/block/drbd/drbd_receiver.c
> +++ b/drivers/block/drbd/drbd_receiver.c
> @@ -357,7 +357,8 @@ drbd_alloc_peer_req(struct drbd_peer_device *peer_device, u64 id, sector_t secto
>  	}
>  
>  	if (has_payload && data_size) {
> -		page = drbd_alloc_pages(peer_device, nr_pages, (gfp_mask & __GFP_WAIT));
> +		page = drbd_alloc_pages(peer_device, nr_pages,
> +					gfpflags_allow_blocking(gfp_mask));
>  		if (!page)
>  			goto fail;
>  	}
> diff --git a/drivers/block/osdblk.c b/drivers/block/osdblk.c
> index e22942596207..1b709a4e3b5e 100644
> --- a/drivers/block/osdblk.c
> +++ b/drivers/block/osdblk.c
> @@ -271,7 +271,7 @@ static struct bio *bio_chain_clone(struct bio *old_chain, gfp_t gfpmask)
>  			goto err_out;
>  
>  		tmp->bi_bdev = NULL;
> -		gfpmask &= ~__GFP_WAIT;
> +		gfpmask &= ~__GFP_DIRECT_RECLAIM;
>  		tmp->bi_next = NULL;
>  
>  		if (!new_chain)
> diff --git a/drivers/connector/connector.c b/drivers/connector/connector.c
> index 30f522848c73..d7373ca69c99 100644
> --- a/drivers/connector/connector.c
> +++ b/drivers/connector/connector.c
> @@ -124,7 +124,8 @@ int cn_netlink_send_mult(struct cn_msg *msg, u16 len, u32 portid, u32 __group,
>  	if (group)
>  		return netlink_broadcast(dev->nls, skb, portid, group,
>  					 gfp_mask);
> -	return netlink_unicast(dev->nls, skb, portid, !(gfp_mask&__GFP_WAIT));
> +	return netlink_unicast(dev->nls, skb, portid,
> +			!gfpflags_allow_blocking(gfp_mask));
>  }
>  EXPORT_SYMBOL_GPL(cn_netlink_send_mult);
>  
> diff --git a/drivers/firewire/core-cdev.c b/drivers/firewire/core-cdev.c
> index 2a3973a7c441..36a7c2d89a01 100644
> --- a/drivers/firewire/core-cdev.c
> +++ b/drivers/firewire/core-cdev.c
> @@ -486,7 +486,7 @@ static int ioctl_get_info(struct client *client, union ioctl_arg *arg)
>  static int add_client_resource(struct client *client,
>  			       struct client_resource *resource, gfp_t gfp_mask)
>  {
> -	bool preload = !!(gfp_mask & __GFP_WAIT);
> +	bool preload = gfpflags_allow_blocking(gfp_mask);
>  	unsigned long flags;
>  	int ret;
>  
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 52b446b27b4d..c2b45081c5ab 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -2225,7 +2225,7 @@ i915_gem_object_get_pages_gtt(struct drm_i915_gem_object *obj)
>  	 */
>  	mapping = file_inode(obj->base.filp)->i_mapping;
>  	gfp = mapping_gfp_mask(mapping);
> -	gfp |= __GFP_NORETRY | __GFP_NOWARN | __GFP_NO_KSWAPD;
> +	gfp |= __GFP_NORETRY | __GFP_NOWARN;
>  	gfp &= ~(__GFP_IO | __GFP_WAIT);
>  	sg = st->sgl;
>  	st->nents = 0;
> diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c
> index ca919f429666..7474d79ffac0 100644
> --- a/drivers/infiniband/core/sa_query.c
> +++ b/drivers/infiniband/core/sa_query.c
> @@ -619,7 +619,7 @@ static void init_mad(struct ib_sa_mad *mad, struct ib_mad_agent *agent)
>  
>  static int send_mad(struct ib_sa_query *query, int timeout_ms, gfp_t gfp_mask)
>  {
> -	bool preload = !!(gfp_mask & __GFP_WAIT);
> +	bool preload = gfpflags_allow_blocking(gfp_mask);
>  	unsigned long flags;
>  	int ret, id;
>  
> diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
> index 658ee39e6569..95d4c70dc7b1 100644
> --- a/drivers/iommu/amd_iommu.c
> +++ b/drivers/iommu/amd_iommu.c
> @@ -2755,7 +2755,7 @@ static void *alloc_coherent(struct device *dev, size_t size,
>  
>  	page = alloc_pages(flag | __GFP_NOWARN,  get_order(size));
>  	if (!page) {
> -		if (!(flag & __GFP_WAIT))
> +		if (!gfpflags_allow_blocking(flag))
>  			return NULL;
>  
>  		page = dma_alloc_from_contiguous(dev, size >> PAGE_SHIFT,
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index 0649b94f5958..f77becf3d8d8 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -3566,7 +3566,7 @@ static void *intel_alloc_coherent(struct device *dev, size_t size,
>  			flags |= GFP_DMA32;
>  	}
>  
> -	if (flags & __GFP_WAIT) {
> +	if (gfpflags_allow_blocking(flags)) {
>  		unsigned int count = size >> PAGE_SHIFT;
>  
>  		page = dma_alloc_from_contiguous(dev, count, order);
> diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
> index 0f48fed44a17..6dda08385309 100644
> --- a/drivers/md/dm-crypt.c
> +++ b/drivers/md/dm-crypt.c
> @@ -993,7 +993,7 @@ static struct bio *crypt_alloc_buffer(struct dm_crypt_io *io, unsigned size)
>  	struct bio_vec *bvec;
>  
>  retry:
> -	if (unlikely(gfp_mask & __GFP_WAIT))
> +	if (unlikely(gfp_mask & __GFP_DIRECT_RECLAIM))
>  		mutex_lock(&cc->bio_alloc_lock);
>  
>  	clone = bio_alloc_bioset(GFP_NOIO, nr_iovecs, cc->bs);
> @@ -1009,7 +1009,7 @@ static struct bio *crypt_alloc_buffer(struct dm_crypt_io *io, unsigned size)
>  		if (!page) {
>  			crypt_free_buffer_pages(cc, clone);
>  			bio_put(clone);
> -			gfp_mask |= __GFP_WAIT;
> +			gfp_mask |= __GFP_DIRECT_RECLAIM;
>  			goto retry;
>  		}
>  
> @@ -1026,7 +1026,7 @@ static struct bio *crypt_alloc_buffer(struct dm_crypt_io *io, unsigned size)
>  	}
>  
>  return_clone:
> -	if (unlikely(gfp_mask & __GFP_WAIT))
> +	if (unlikely(gfp_mask & __GFP_DIRECT_RECLAIM))
>  		mutex_unlock(&cc->bio_alloc_lock);
>  
>  	return clone;
> diff --git a/drivers/md/dm-kcopyd.c b/drivers/md/dm-kcopyd.c
> index 3a7cade5e27d..1452ed9aacb4 100644
> --- a/drivers/md/dm-kcopyd.c
> +++ b/drivers/md/dm-kcopyd.c
> @@ -244,7 +244,7 @@ static int kcopyd_get_pages(struct dm_kcopyd_client *kc,
>  	*pages = NULL;
>  
>  	do {
> -		pl = alloc_pl(__GFP_NOWARN | __GFP_NORETRY);
> +		pl = alloc_pl(__GFP_NOWARN | __GFP_NORETRY | __GFP_KSWAPD_RECLAIM);
>  		if (unlikely(!pl)) {
>  			/* Use reserved pages */
>  			pl = kc->pages;
> diff --git a/drivers/media/pci/solo6x10/solo6x10-v4l2-enc.c b/drivers/media/pci/solo6x10/solo6x10-v4l2-enc.c
> index 53fff5425c13..fb2cb4bdc0c1 100644
> --- a/drivers/media/pci/solo6x10/solo6x10-v4l2-enc.c
> +++ b/drivers/media/pci/solo6x10/solo6x10-v4l2-enc.c
> @@ -1291,7 +1291,7 @@ static struct solo_enc_dev *solo_enc_alloc(struct solo_dev *solo_dev,
>  	solo_enc->vidq.ops = &solo_enc_video_qops;
>  	solo_enc->vidq.mem_ops = &vb2_dma_sg_memops;
>  	solo_enc->vidq.drv_priv = solo_enc;
> -	solo_enc->vidq.gfp_flags = __GFP_DMA32;
> +	solo_enc->vidq.gfp_flags = __GFP_DMA32 | __GFP_KSWAPD_RECLAIM;
>  	solo_enc->vidq.timestamp_flags = V4L2_BUF_FLAG_TIMESTAMP_MONOTONIC;
>  	solo_enc->vidq.buf_struct_size = sizeof(struct solo_vb2_buf);
>  	solo_enc->vidq.lock = &solo_enc->lock;
> diff --git a/drivers/media/pci/solo6x10/solo6x10-v4l2.c b/drivers/media/pci/solo6x10/solo6x10-v4l2.c
> index 63ae8a61f603..bde77b22340c 100644
> --- a/drivers/media/pci/solo6x10/solo6x10-v4l2.c
> +++ b/drivers/media/pci/solo6x10/solo6x10-v4l2.c
> @@ -675,7 +675,7 @@ int solo_v4l2_init(struct solo_dev *solo_dev, unsigned nr)
>  	solo_dev->vidq.mem_ops = &vb2_dma_contig_memops;
>  	solo_dev->vidq.drv_priv = solo_dev;
>  	solo_dev->vidq.timestamp_flags = V4L2_BUF_FLAG_TIMESTAMP_MONOTONIC;
> -	solo_dev->vidq.gfp_flags = __GFP_DMA32;
> +	solo_dev->vidq.gfp_flags = __GFP_DMA32 | __GFP_KSWAPD_RECLAIM;
>  	solo_dev->vidq.buf_struct_size = sizeof(struct solo_vb2_buf);
>  	solo_dev->vidq.lock = &solo_dev->lock;
>  	ret = vb2_queue_init(&solo_dev->vidq);
> diff --git a/drivers/media/pci/tw68/tw68-video.c b/drivers/media/pci/tw68/tw68-video.c
> index 8355e55b4e8e..e556f989aaab 100644
> --- a/drivers/media/pci/tw68/tw68-video.c
> +++ b/drivers/media/pci/tw68/tw68-video.c
> @@ -975,7 +975,7 @@ int tw68_video_init2(struct tw68_dev *dev, int video_nr)
>  	dev->vidq.ops = &tw68_video_qops;
>  	dev->vidq.mem_ops = &vb2_dma_sg_memops;
>  	dev->vidq.drv_priv = dev;
> -	dev->vidq.gfp_flags = __GFP_DMA32;
> +	dev->vidq.gfp_flags = __GFP_DMA32 | __GFP_KSWAPD_RECLAIM;
>  	dev->vidq.buf_struct_size = sizeof(struct tw68_buf);
>  	dev->vidq.lock = &dev->lock;
>  	dev->vidq.min_buffers_needed = 2;
> diff --git a/drivers/mtd/mtdcore.c b/drivers/mtd/mtdcore.c
> index 8bbbb751bf45..2dfb291a47c6 100644
> --- a/drivers/mtd/mtdcore.c
> +++ b/drivers/mtd/mtdcore.c
> @@ -1188,8 +1188,7 @@ EXPORT_SYMBOL_GPL(mtd_writev);
>   */
>  void *mtd_kmalloc_up_to(const struct mtd_info *mtd, size_t *size)
>  {
> -	gfp_t flags = __GFP_NOWARN | __GFP_WAIT |
> -		       __GFP_NORETRY | __GFP_NO_KSWAPD;
> +	gfp_t flags = __GFP_NOWARN | __GFP_DIRECT_RECLAIM | __GFP_NORETRY;
>  	size_t min_alloc = max_t(size_t, mtd->writesize, PAGE_SIZE);
>  	void *kbuf;
>  
> diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
> index f7fbdc9d1325..3a407e59acab 100644
> --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
> +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
> @@ -689,7 +689,7 @@ static void *bnx2x_frag_alloc(const struct bnx2x_fastpath *fp, gfp_t gfp_mask)
>  {
>  	if (fp->rx_frag_size) {
>  		/* GFP_KERNEL allocations are used only during initialization */
> -		if (unlikely(gfp_mask & __GFP_WAIT))
> +		if (unlikely(gfpflags_allow_blocking(gfp_mask)))
>  			return (void *)__get_free_page(gfp_mask);
>  
>  		return netdev_alloc_frag(fp->rx_frag_size);
> diff --git a/drivers/staging/android/ion/ion_system_heap.c b/drivers/staging/android/ion/ion_system_heap.c
> index da2a63c0a9ba..2615e0ae4f0a 100644
> --- a/drivers/staging/android/ion/ion_system_heap.c
> +++ b/drivers/staging/android/ion/ion_system_heap.c
> @@ -27,7 +27,7 @@
>  #include "ion_priv.h"
>  
>  static gfp_t high_order_gfp_flags = (GFP_HIGHUSER | __GFP_ZERO | __GFP_NOWARN |
> -				     __GFP_NORETRY) & ~__GFP_WAIT;
> +				     __GFP_NORETRY) & ~__GFP_DIRECT_RECLAIM;
>  static gfp_t low_order_gfp_flags  = (GFP_HIGHUSER | __GFP_ZERO | __GFP_NOWARN);
>  static const unsigned int orders[] = {8, 4, 0};
>  static const int num_orders = ARRAY_SIZE(orders);
> diff --git a/drivers/staging/lustre/include/linux/libcfs/libcfs_private.h b/drivers/staging/lustre/include/linux/libcfs/libcfs_private.h
> index ed37d26eb20d..5b0756cb6576 100644
> --- a/drivers/staging/lustre/include/linux/libcfs/libcfs_private.h
> +++ b/drivers/staging/lustre/include/linux/libcfs/libcfs_private.h
> @@ -113,7 +113,7 @@ do {						\
>  do {									    \
>  	LASSERT(!in_interrupt() ||					    \
>  		((size) <= LIBCFS_VMALLOC_SIZE &&			    \
> -		 ((mask) & __GFP_WAIT) == 0));				    \
> +		 !gfpflags_allow_blocking(mask)));			    \
>  } while (0)
>  
>  #define LIBCFS_ALLOC_POST(ptr, size)					    \
> diff --git a/drivers/usb/host/u132-hcd.c b/drivers/usb/host/u132-hcd.c
> index d51687780b61..8d4c1806e32f 100644
> --- a/drivers/usb/host/u132-hcd.c
> +++ b/drivers/usb/host/u132-hcd.c
> @@ -2247,7 +2247,7 @@ static int u132_urb_enqueue(struct usb_hcd *hcd, struct urb *urb,
>  {
>  	struct u132 *u132 = hcd_to_u132(hcd);
>  	if (irqs_disabled()) {
> -		if (__GFP_WAIT & mem_flags) {
> +		if (gfpflags_allow_blocking(mem_flags)) {
>  			printk(KERN_ERR "invalid context for function that migh"
>  				"t sleep\n");
>  			return -EINVAL;
> diff --git a/drivers/video/fbdev/vermilion/vermilion.c b/drivers/video/fbdev/vermilion/vermilion.c
> index 6b70d7f62b2f..1c1e95a0b8fa 100644
> --- a/drivers/video/fbdev/vermilion/vermilion.c
> +++ b/drivers/video/fbdev/vermilion/vermilion.c
> @@ -99,7 +99,7 @@ static int vmlfb_alloc_vram_area(struct vram_area *va, unsigned max_order,
>  		 * below the first 16MB.
>  		 */
>  
> -		flags = __GFP_DMA | __GFP_HIGH;
> +		flags = __GFP_DMA | __GFP_HIGH | __GFP_KSWAPD_RECLAIM;
>  		va->logical =
>  			 __get_free_pages(flags, --max_order);
>  	} while (va->logical == 0 && max_order > min_order);
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index f556c3732c2c..3dd4792b8099 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2566,7 +2566,7 @@ int open_ctree(struct super_block *sb,
>  	fs_info->commit_interval = BTRFS_DEFAULT_COMMIT_INTERVAL;
>  	fs_info->avg_delayed_ref_runtime = NSEC_PER_SEC >> 6; /* div by 64 */
>  	/* readahead state */
> -	INIT_RADIX_TREE(&fs_info->reada_tree, GFP_NOFS & ~__GFP_WAIT);
> +	INIT_RADIX_TREE(&fs_info->reada_tree, GFP_NOFS & ~__GFP_DIRECT_RECLAIM);
>  	spin_lock_init(&fs_info->reada_lock);
>  
>  	fs_info->thread_pool_size = min_t(unsigned long,
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 02d05817cbdf..c8a6cdcbef2b 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -594,7 +594,7 @@ int clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
>  	if (bits & (EXTENT_IOBITS | EXTENT_BOUNDARY))
>  		clear = 1;
>  again:
> -	if (!prealloc && (mask & __GFP_WAIT)) {
> +	if (!prealloc && gfpflags_allow_blocking(mask)) {
>  		/*
>  		 * Don't care for allocation failure here because we might end
>  		 * up not needing the pre-allocated extent state at all, which
> @@ -718,7 +718,7 @@ int clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
>  	if (start > end)
>  		goto out;
>  	spin_unlock(&tree->lock);
> -	if (mask & __GFP_WAIT)
> +	if (gfpflags_allow_blocking(mask))
>  		cond_resched();
>  	goto again;
>  }
> @@ -850,7 +850,7 @@ __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
>  
>  	bits |= EXTENT_FIRST_DELALLOC;
>  again:
> -	if (!prealloc && (mask & __GFP_WAIT)) {
> +	if (!prealloc && gfpflags_allow_blocking(mask)) {
>  		prealloc = alloc_extent_state(mask);
>  		BUG_ON(!prealloc);
>  	}
> @@ -1028,7 +1028,7 @@ __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
>  	if (start > end)
>  		goto out;
>  	spin_unlock(&tree->lock);
> -	if (mask & __GFP_WAIT)
> +	if (gfpflags_allow_blocking(mask))
>  		cond_resched();
>  	goto again;
>  }
> @@ -1076,7 +1076,7 @@ int convert_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
>  	btrfs_debug_check_extent_io_range(tree, start, end);
>  
>  again:
> -	if (!prealloc && (mask & __GFP_WAIT)) {
> +	if (!prealloc && gfpflags_allow_blocking(mask)) {
>  		/*
>  		 * Best effort, don't worry if extent state allocation fails
>  		 * here for the first iteration. We might have a cached state
> @@ -1253,7 +1253,7 @@ int convert_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
>  	if (start > end)
>  		goto out;
>  	spin_unlock(&tree->lock);
> -	if (mask & __GFP_WAIT)
> +	if (gfpflags_allow_blocking(mask))
>  		cond_resched();
>  	first_iteration = false;
>  	goto again;
> @@ -4265,7 +4265,7 @@ int try_release_extent_mapping(struct extent_map_tree *map,
>  	u64 start = page_offset(page);
>  	u64 end = start + PAGE_CACHE_SIZE - 1;
>  
> -	if ((mask & __GFP_WAIT) &&
> +	if (gfpflags_allow_blocking(mask) &&
>  	    page->mapping->host->i_size > 16 * 1024 * 1024) {
>  		u64 len;
>  		while (start <= end) {
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index fbe7c104531c..b1968f36a39b 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -156,8 +156,8 @@ static struct btrfs_device *__alloc_device(void)
>  	spin_lock_init(&dev->reada_lock);
>  	atomic_set(&dev->reada_in_flight, 0);
>  	atomic_set(&dev->dev_stats_ccnt, 0);
> -	INIT_RADIX_TREE(&dev->reada_zones, GFP_NOFS & ~__GFP_WAIT);
> -	INIT_RADIX_TREE(&dev->reada_extents, GFP_NOFS & ~__GFP_WAIT);
> +	INIT_RADIX_TREE(&dev->reada_zones, GFP_NOFS & ~__GFP_DIRECT_RECLAIM);
> +	INIT_RADIX_TREE(&dev->reada_extents, GFP_NOFS & ~__GFP_DIRECT_RECLAIM);
>  
>  	return dev;
>  }
> diff --git a/fs/ext3/super.c b/fs/ext3/super.c
> index 5ed0044fbb37..9004c786716f 100644
> --- a/fs/ext3/super.c
> +++ b/fs/ext3/super.c
> @@ -750,7 +750,7 @@ static int bdev_try_to_free_page(struct super_block *sb, struct page *page,
>  		return 0;
>  	if (journal)
>  		return journal_try_to_free_buffers(journal, page, 
> -						   wait & ~__GFP_WAIT);
> +						wait & ~__GFP_DIRECT_RECLAIM);
>  	return try_to_free_buffers(page);
>  }
>  
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 58987b5c514b..abe76d41ef1e 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -1045,7 +1045,7 @@ static int bdev_try_to_free_page(struct super_block *sb, struct page *page,
>  		return 0;
>  	if (journal)
>  		return jbd2_journal_try_to_free_buffers(journal, page,
> -							wait & ~__GFP_WAIT);
> +						wait & ~__GFP_DIRECT_RECLAIM);
>  	return try_to_free_buffers(page);
>  }
>  
> diff --git a/fs/fscache/cookie.c b/fs/fscache/cookie.c
> index d403c69bee08..4304072161aa 100644
> --- a/fs/fscache/cookie.c
> +++ b/fs/fscache/cookie.c
> @@ -111,7 +111,7 @@ struct fscache_cookie *__fscache_acquire_cookie(
>  
>  	/* radix tree insertion won't use the preallocation pool unless it's
>  	 * told it may not wait */
> -	INIT_RADIX_TREE(&cookie->stores, GFP_NOFS & ~__GFP_WAIT);
> +	INIT_RADIX_TREE(&cookie->stores, GFP_NOFS & ~__GFP_DIRECT_RECLAIM);
>  
>  	switch (cookie->def->type) {
>  	case FSCACHE_COOKIE_TYPE_INDEX:
> diff --git a/fs/fscache/page.c b/fs/fscache/page.c
> index 483bbc613bf0..79483b3d8c6f 100644
> --- a/fs/fscache/page.c
> +++ b/fs/fscache/page.c
> @@ -58,7 +58,7 @@ bool release_page_wait_timeout(struct fscache_cookie *cookie, struct page *page)
>  
>  /*
>   * decide whether a page can be released, possibly by cancelling a store to it
> - * - we're allowed to sleep if __GFP_WAIT is flagged
> + * - we're allowed to sleep if __GFP_DIRECT_RECLAIM is flagged
>   */
>  bool __fscache_maybe_release_page(struct fscache_cookie *cookie,
>  				  struct page *page,
> @@ -122,7 +122,7 @@ bool __fscache_maybe_release_page(struct fscache_cookie *cookie,
>  	 * allocator as the work threads writing to the cache may all end up
>  	 * sleeping on memory allocation, so we may need to impose a timeout
>  	 * too. */
> -	if (!(gfp & __GFP_WAIT) || !(gfp & __GFP_FS)) {
> +	if (!(gfp & __GFP_DIRECT_RECLAIM) || !(gfp & __GFP_FS)) {
>  		fscache_stat(&fscache_n_store_vmscan_busy);
>  		return false;
>  	}
> @@ -132,7 +132,7 @@ bool __fscache_maybe_release_page(struct fscache_cookie *cookie,
>  		_debug("fscache writeout timeout page: %p{%lx}",
>  			page, page->index);
>  
> -	gfp &= ~__GFP_WAIT;
> +	gfp &= ~__GFP_DIRECT_RECLAIM;
>  	goto try_again;
>  }
>  EXPORT_SYMBOL(__fscache_maybe_release_page);
> diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
> index 1695ba8334a2..f45b90ba7c5c 100644
> --- a/fs/jbd/transaction.c
> +++ b/fs/jbd/transaction.c
> @@ -1690,8 +1690,8 @@ __journal_try_to_free_buffer(journal_t *journal, struct buffer_head *bh)
>   * @journal: journal for operation
>   * @page: to try and free
>   * @gfp_mask: we use the mask to detect how hard should we try to release
> - * buffers. If __GFP_WAIT and __GFP_FS is set, we wait for commit code to
> - * release the buffers.
> + * buffers. If __GFP_DIRECT_RECLAIM and __GFP_FS is set, we wait for commit
> + * code to release the buffers.
>   *
>   *
>   * For all the buffers on this page,
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index f3d06174b051..06e18bcdb888 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -1893,8 +1893,8 @@ __journal_try_to_free_buffer(journal_t *journal, struct buffer_head *bh)
>   * @journal: journal for operation
>   * @page: to try and free
>   * @gfp_mask: we use the mask to detect how hard should we try to release
> - * buffers. If __GFP_WAIT and __GFP_FS is set, we wait for commit code to
> - * release the buffers.
> + * buffers. If __GFP_DIRECT_RECLAIM and __GFP_FS is set, we wait for commit
> + * code to release the buffers.
>   *
>   *
>   * For all the buffers on this page,
> diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> index cc4fa1ed61fc..be6821967ec6 100644
> --- a/fs/nfs/file.c
> +++ b/fs/nfs/file.c
> @@ -480,8 +480,8 @@ static int nfs_release_page(struct page *page, gfp_t gfp)
>  	dfprintk(PAGECACHE, "NFS: release_page(%p)\n", page);
>  
>  	/* Always try to initiate a 'commit' if relevant, but only
> -	 * wait for it if __GFP_WAIT is set.  Even then, only wait 1
> -	 * second and only if the 'bdi' is not congested.
> +	 * wait for it if the caller allows blocking.  Even then,
> +	 * only wait 1 second and only if the 'bdi' is not congested.
>  	 * Waiting indefinitely can cause deadlocks when the NFS
>  	 * server is on this machine, when a new TCP connection is
>  	 * needed and in other rare cases.  There is no particular
> @@ -491,7 +491,7 @@ static int nfs_release_page(struct page *page, gfp_t gfp)
>  	if (mapping) {
>  		struct nfs_server *nfss = NFS_SERVER(mapping->host);
>  		nfs_commit_inode(mapping->host, 0);
> -		if ((gfp & __GFP_WAIT) &&
> +		if (gfpflags_allow_blocking(gfp) &&
>  		    !bdi_write_congested(&nfss->backing_dev_info)) {
>  			wait_on_page_bit_killable_timeout(page, PG_private,
>  							  HZ);
> diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
> index eac9549efd52..587174fd4f2c 100644
> --- a/fs/xfs/xfs_qm.c
> +++ b/fs/xfs/xfs_qm.c
> @@ -525,7 +525,7 @@ xfs_qm_shrink_scan(
>  	unsigned long		freed;
>  	int			error;
>  
> -	if ((sc->gfp_mask & (__GFP_FS|__GFP_WAIT)) != (__GFP_FS|__GFP_WAIT))
> +	if ((sc->gfp_mask & (__GFP_FS|__GFP_DIRECT_RECLAIM)) != (__GFP_FS|__GFP_DIRECT_RECLAIM))
>  		return 0;
>  
>  	INIT_LIST_HEAD(&isol.buffers);
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index a10347ca5053..bd1937977d84 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -29,12 +29,13 @@ struct vm_area_struct;
>  #define ___GFP_NOMEMALLOC	0x10000u
>  #define ___GFP_HARDWALL		0x20000u
>  #define ___GFP_THISNODE		0x40000u
> -#define ___GFP_WAIT		0x80000u
> +#define ___GFP_ATOMIC		0x80000u
>  #define ___GFP_NOACCOUNT	0x100000u
>  #define ___GFP_NOTRACK		0x200000u
> -#define ___GFP_NO_KSWAPD	0x400000u
> +#define ___GFP_DIRECT_RECLAIM	0x400000u
>  #define ___GFP_OTHER_NODE	0x800000u
>  #define ___GFP_WRITE		0x1000000u
> +#define ___GFP_KSWAPD_RECLAIM	0x2000000u
>  /* If the above are modified, __GFP_BITS_SHIFT may need updating */
>  
>  /*
> @@ -68,7 +69,7 @@ struct vm_area_struct;
>   * __GFP_MOVABLE: Flag that this page will be movable by the page migration
>   * mechanism or reclaimed
>   */
> -#define __GFP_WAIT	((__force gfp_t)___GFP_WAIT)	/* Can wait and reschedule? */
> +#define __GFP_ATOMIC	((__force gfp_t)___GFP_ATOMIC)  /* Caller cannot wait or reschedule */
>  #define __GFP_HIGH	((__force gfp_t)___GFP_HIGH)	/* Should access emergency pools? */
>  #define __GFP_IO	((__force gfp_t)___GFP_IO)	/* Can start physical IO? */
>  #define __GFP_FS	((__force gfp_t)___GFP_FS)	/* Can call down to low-level FS? */
> @@ -91,23 +92,37 @@ struct vm_area_struct;
>  #define __GFP_NOACCOUNT	((__force gfp_t)___GFP_NOACCOUNT) /* Don't account to kmemcg */
>  #define __GFP_NOTRACK	((__force gfp_t)___GFP_NOTRACK)  /* Don't track with kmemcheck */
>  
> -#define __GFP_NO_KSWAPD	((__force gfp_t)___GFP_NO_KSWAPD)
>  #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
>  #define __GFP_WRITE	((__force gfp_t)___GFP_WRITE)	/* Allocator intends to dirty page */
>  
>  /*
> + * A caller that is willing to wait may enter direct reclaim and will
> + * wake kswapd to reclaim pages in the background until the high
> + * watermark is met. A caller may wish to clear __GFP_DIRECT_RECLAIM to
> + * avoid unnecessary delays when a fallback option is available but
> + * still allow kswapd to reclaim in the background. The kswapd flag
> + * can be cleared when the reclaiming of pages would cause unnecessary
> + * disruption.
> + */
> +#define __GFP_WAIT (__GFP_DIRECT_RECLAIM|__GFP_KSWAPD_RECLAIM)
> +#define __GFP_DIRECT_RECLAIM	((__force gfp_t)___GFP_DIRECT_RECLAIM) /* Caller can reclaim */
> +#define __GFP_KSWAPD_RECLAIM	((__force gfp_t)___GFP_KSWAPD_RECLAIM) /* kswapd can wake */
> +
> +/*
>   * This may seem redundant, but it's a way of annotating false positives vs.
>   * allocations that simply cannot be supported (e.g. page tables).
>   */
>  #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
>  
> -#define __GFP_BITS_SHIFT 25	/* Room for N __GFP_FOO bits */
> +#define __GFP_BITS_SHIFT 26	/* Room for N __GFP_FOO bits */
>  #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
>  
> -/* This equals 0, but use constants in case they ever change */
> -#define GFP_NOWAIT	(GFP_ATOMIC & ~__GFP_HIGH)
> -/* GFP_ATOMIC means both !wait (__GFP_WAIT not set) and use emergency pool */
> -#define GFP_ATOMIC	(__GFP_HIGH)
> +/*
> + * GFP_ATOMIC callers can not sleep, need the allocation to succeed.
> + * A lower watermark is applied to allow access to "atomic reserves"
> + */
> +#define GFP_ATOMIC	(__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM)
> +#define GFP_NOWAIT	(__GFP_KSWAPD_RECLAIM)
>  #define GFP_NOIO	(__GFP_WAIT)
>  #define GFP_NOFS	(__GFP_WAIT | __GFP_IO)
>  #define GFP_KERNEL	(__GFP_WAIT | __GFP_IO | __GFP_FS)
> @@ -116,10 +131,10 @@ struct vm_area_struct;
>  #define GFP_USER	(__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL)
>  #define GFP_HIGHUSER	(GFP_USER | __GFP_HIGHMEM)
>  #define GFP_HIGHUSER_MOVABLE	(GFP_HIGHUSER | __GFP_MOVABLE)
> -#define GFP_IOFS	(__GFP_IO | __GFP_FS)
> -#define GFP_TRANSHUGE	(GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
> -			 __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | \
> -			 __GFP_NO_KSWAPD)
> +#define GFP_IOFS	(__GFP_IO | __GFP_FS | __GFP_KSWAPD_RECLAIM)
> +#define GFP_TRANSHUGE	((GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
> +			 __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN) & \
> +			 ~__GFP_KSWAPD_RECLAIM)
>  
>  /* This mask makes up all the page movable related flags */
>  #define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE)
> @@ -161,6 +176,11 @@ static inline int gfpflags_to_migratetype(const gfp_t gfp_flags)
>  	return (gfp_flags & GFP_MOVABLE_MASK) >> GFP_MOVABLE_SHIFT;
>  }
>  
> +static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
> +{
> +	return gfp_flags & __GFP_DIRECT_RECLAIM;
> +}
> +
>  #ifdef CONFIG_HIGHMEM
>  #define OPT_ZONE_HIGHMEM ZONE_HIGHMEM
>  #else
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 22b6d9ca1654..55c4a9175801 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -1109,7 +1109,7 @@ static inline int skb_cloned(const struct sk_buff *skb)
>  
>  static inline int skb_unclone(struct sk_buff *skb, gfp_t pri)
>  {
> -	might_sleep_if(pri & __GFP_WAIT);
> +	might_sleep_if(gfpflags_allow_blocking(pri));
>  
>  	if (skb_cloned(skb))
>  		return pskb_expand_head(skb, 0, 0, pri);
> @@ -1193,7 +1193,7 @@ static inline int skb_shared(const struct sk_buff *skb)
>   */
>  static inline struct sk_buff *skb_share_check(struct sk_buff *skb, gfp_t pri)
>  {
> -	might_sleep_if(pri & __GFP_WAIT);
> +	might_sleep_if(gfpflags_allow_blocking(pri));
>  	if (skb_shared(skb)) {
>  		struct sk_buff *nskb = skb_clone(skb, pri);
>  
> @@ -1229,7 +1229,7 @@ static inline struct sk_buff *skb_share_check(struct sk_buff *skb, gfp_t pri)
>  static inline struct sk_buff *skb_unshare(struct sk_buff *skb,
>  					  gfp_t pri)
>  {
> -	might_sleep_if(pri & __GFP_WAIT);
> +	might_sleep_if(gfpflags_allow_blocking(pri));
>  	if (skb_cloned(skb)) {
>  		struct sk_buff *nskb = skb_copy(skb, pri);
>  
> diff --git a/include/net/sock.h b/include/net/sock.h
> index f21f0708ec59..cec0c4b634dc 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -2035,7 +2035,7 @@ struct sk_buff *sk_stream_alloc_skb(struct sock *sk, int size, gfp_t gfp,
>   */
>  static inline struct page_frag *sk_page_frag(struct sock *sk)
>  {
> -	if (sk->sk_allocation & __GFP_WAIT)
> +	if (gfpflags_allow_blocking(sk->sk_allocation))
>  		return &current->task_frag;
>  
>  	return &sk->sk_frag;
> diff --git a/include/trace/events/gfpflags.h b/include/trace/events/gfpflags.h
> index d6fd8e5b14b7..dde6bf092c8a 100644
> --- a/include/trace/events/gfpflags.h
> +++ b/include/trace/events/gfpflags.h
> @@ -20,7 +20,7 @@
>  	{(unsigned long)GFP_ATOMIC,		"GFP_ATOMIC"},		\
>  	{(unsigned long)GFP_NOIO,		"GFP_NOIO"},		\
>  	{(unsigned long)__GFP_HIGH,		"GFP_HIGH"},		\
> -	{(unsigned long)__GFP_WAIT,		"GFP_WAIT"},		\
> +	{(unsigned long)__GFP_ATOMIC,		"GFP_ATOMIC"},		\
>  	{(unsigned long)__GFP_IO,		"GFP_IO"},		\
>  	{(unsigned long)__GFP_COLD,		"GFP_COLD"},		\
>  	{(unsigned long)__GFP_NOWARN,		"GFP_NOWARN"},		\
> @@ -36,7 +36,8 @@
>  	{(unsigned long)__GFP_RECLAIMABLE,	"GFP_RECLAIMABLE"},	\
>  	{(unsigned long)__GFP_MOVABLE,		"GFP_MOVABLE"},		\
>  	{(unsigned long)__GFP_NOTRACK,		"GFP_NOTRACK"},		\
> -	{(unsigned long)__GFP_NO_KSWAPD,	"GFP_NO_KSWAPD"},	\
> +	{(unsigned long)__GFP_DIRECT_RECLAIM,	"GFP_DIRECT_RECLAIM"},	\
> +	{(unsigned long)__GFP_KSWAPD_RECLAIM,	"GFP_KSWAPD_RECLAIM"},	\
>  	{(unsigned long)__GFP_OTHER_NODE,	"GFP_OTHER_NODE"}	\
>  	) : "GFP_NOWAIT"
>  
> diff --git a/kernel/audit.c b/kernel/audit.c
> index f9e6065346db..6ab7a55dbdff 100644
> --- a/kernel/audit.c
> +++ b/kernel/audit.c
> @@ -1357,16 +1357,16 @@ struct audit_buffer *audit_log_start(struct audit_context *ctx, gfp_t gfp_mask,
>  	if (unlikely(audit_filter_type(type)))
>  		return NULL;
>  
> -	if (gfp_mask & __GFP_WAIT) {
> +	if (gfp_mask & __GFP_DIRECT_RECLAIM) {
>  		if (audit_pid && audit_pid == current->pid)
> -			gfp_mask &= ~__GFP_WAIT;
> +			gfp_mask &= ~__GFP_DIRECT_RECLAIM;
>  		else
>  			reserve = 0;
>  	}
>  
>  	while (audit_backlog_limit
>  	       && skb_queue_len(&audit_skb_queue) > audit_backlog_limit + reserve) {
> -		if (gfp_mask & __GFP_WAIT && audit_backlog_wait_time) {
> +		if (gfp_mask & __GFP_DIRECT_RECLAIM && audit_backlog_wait_time) {
>  			long sleep_time;
>  
>  			sleep_time = timeout_start + audit_backlog_wait_time - jiffies;
> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> index 8acfbf773e06..9aa39f20f593 100644
> --- a/kernel/locking/lockdep.c
> +++ b/kernel/locking/lockdep.c
> @@ -2738,7 +2738,7 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags)
>  		return;
>  
>  	/* no reclaim without waiting on it */
> -	if (!(gfp_mask & __GFP_WAIT))
> +	if (!(gfp_mask & __GFP_DIRECT_RECLAIM))
>  		return;
>  
>  	/* this guy won't enter reclaim */
> diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
> index 5235dd4e1e2f..3a970604308f 100644
> --- a/kernel/power/snapshot.c
> +++ b/kernel/power/snapshot.c
> @@ -1779,7 +1779,7 @@ alloc_highmem_pages(struct memory_bitmap *bm, unsigned int nr_highmem)
>  	while (to_alloc-- > 0) {
>  		struct page *page;
>  
> -		page = alloc_image_page(__GFP_HIGHMEM);
> +		page = alloc_image_page(__GFP_HIGHMEM|__GFP_KSWAPD_RECLAIM);
>  		memory_bm_set_bit(bm, page_to_pfn(page));
>  	}
>  	return nr_highmem;
> diff --git a/kernel/smp.c b/kernel/smp.c
> index 07854477c164..d903c02223af 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -669,7 +669,7 @@ void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
>  	cpumask_var_t cpus;
>  	int cpu, ret;
>  
> -	might_sleep_if(gfp_flags & __GFP_WAIT);
> +	might_sleep_if(gfpflags_allow_blocking(gfp_flags));
>  
>  	if (likely(zalloc_cpumask_var(&cpus, (gfp_flags|__GFP_NOWARN)))) {
>  		preempt_disable();
> diff --git a/lib/idr.c b/lib/idr.c
> index 5335c43adf46..6098336df267 100644
> --- a/lib/idr.c
> +++ b/lib/idr.c
> @@ -399,7 +399,7 @@ void idr_preload(gfp_t gfp_mask)
>  	 * allocation guarantee.  Disallow usage from those contexts.
>  	 */
>  	WARN_ON_ONCE(in_interrupt());
> -	might_sleep_if(gfp_mask & __GFP_WAIT);
> +	might_sleep_if(gfpflags_allow_blocking(gfp_mask));
>  
>  	preempt_disable();
>  
> @@ -453,7 +453,7 @@ int idr_alloc(struct idr *idr, void *ptr, int start, int end, gfp_t gfp_mask)
>  	struct idr_layer *pa[MAX_IDR_LEVEL + 1];
>  	int id;
>  
> -	might_sleep_if(gfp_mask & __GFP_WAIT);
> +	might_sleep_if(gfpflags_allow_blocking(gfp_mask));
>  
>  	/* sanity checks */
>  	if (WARN_ON_ONCE(start < 0))
> diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> index f9ebe1c82060..c3775ee46cd6 100644
> --- a/lib/radix-tree.c
> +++ b/lib/radix-tree.c
> @@ -188,7 +188,7 @@ radix_tree_node_alloc(struct radix_tree_root *root)
>  	 * preloading in the interrupt anyway as all the allocations have to
>  	 * be atomic. So just do normal allocation when in interrupt.
>  	 */
> -	if (!(gfp_mask & __GFP_WAIT) && !in_interrupt()) {
> +	if (!gfpflags_allow_blocking(gfp_mask) && !in_interrupt()) {
>  		struct radix_tree_preload *rtp;
>  
>  		/*
> @@ -249,7 +249,7 @@ radix_tree_node_free(struct radix_tree_node *node)
>   * with preemption not disabled.
>   *
>   * To make use of this facility, the radix tree must be initialised without
> - * __GFP_WAIT being passed to INIT_RADIX_TREE().
> + * __GFP_DIRECT_RECLAIM being passed to INIT_RADIX_TREE().
>   */
>  static int __radix_tree_preload(gfp_t gfp_mask)
>  {
> @@ -286,12 +286,12 @@ static int __radix_tree_preload(gfp_t gfp_mask)
>   * with preemption not disabled.
>   *
>   * To make use of this facility, the radix tree must be initialised without
> - * __GFP_WAIT being passed to INIT_RADIX_TREE().
> + * __GFP_DIRECT_RECLAIM being passed to INIT_RADIX_TREE().
>   */
>  int radix_tree_preload(gfp_t gfp_mask)
>  {
>  	/* Warn on non-sensical use... */
> -	WARN_ON_ONCE(!(gfp_mask & __GFP_WAIT));
> +	WARN_ON_ONCE(gfpflags_allow_blocking(gfp_mask));
>  	return __radix_tree_preload(gfp_mask);
>  }
>  EXPORT_SYMBOL(radix_tree_preload);
> @@ -303,7 +303,7 @@ EXPORT_SYMBOL(radix_tree_preload);
>   */
>  int radix_tree_maybe_preload(gfp_t gfp_mask)
>  {
> -	if (gfp_mask & __GFP_WAIT)
> +	if (gfpflags_allow_blocking(gfp_mask))
>  		return __radix_tree_preload(gfp_mask);
>  	/* Preloading doesn't help anything with this gfp mask, skip it */
>  	preempt_disable();
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index dac5bf59309d..805ce70b72f3 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -632,7 +632,7 @@ struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi,
>  {
>  	struct bdi_writeback *wb;
>  
> -	might_sleep_if(gfp & __GFP_WAIT);
> +	might_sleep_if(gfpflags_allow_blocking(gfp));
>  
>  	if (!memcg_css->parent)
>  		return &bdi->wb;
> diff --git a/mm/dmapool.c b/mm/dmapool.c
> index fd5fe4342e93..84dac666fc0c 100644
> --- a/mm/dmapool.c
> +++ b/mm/dmapool.c
> @@ -323,7 +323,7 @@ void *dma_pool_alloc(struct dma_pool *pool, gfp_t mem_flags,
>  	size_t offset;
>  	void *retval;
>  
> -	might_sleep_if(mem_flags & __GFP_WAIT);
> +	might_sleep_if(gfpflags_allow_blocking(mem_flags));
>  
>  	spin_lock_irqsave(&pool->lock, flags);
>  	list_for_each_entry(page, &pool->page_list, page_list) {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index acb93c554f6e..e34f6411da8c 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2268,7 +2268,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	if (unlikely(task_in_memcg_oom(current)))
>  		goto nomem;
>  
> -	if (!(gfp_mask & __GFP_WAIT))
> +	if (!gfpflags_allow_blocking(gfp_mask))
>  		goto nomem;
>  
>  	mem_cgroup_events(mem_over_limit, MEMCG_MAX, 1);
> @@ -2327,7 +2327,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	css_get_many(&memcg->css, batch);
>  	if (batch > nr_pages)
>  		refill_stock(memcg, batch - nr_pages);
> -	if (!(gfp_mask & __GFP_WAIT))
> +	if (!gfpflags_allow_blocking(gfp_mask))
>  		goto done;
>  	/*
>  	 * If the hierarchy is above the normal consumption range,
> @@ -4696,8 +4696,8 @@ static int mem_cgroup_do_precharge(unsigned long count)
>  {
>  	int ret;
>  
> -	/* Try a single bulk charge without reclaim first */
> -	ret = try_charge(mc.to, GFP_KERNEL & ~__GFP_WAIT, count);
> +	/* Try a single bulk charge without reclaim first, kswapd may wake */
> +	ret = try_charge(mc.to, GFP_KERNEL & ~__GFP_DIRECT_RECLAIM, count);
>  	if (!ret) {
>  		mc.precharge += count;
>  		return ret;
> diff --git a/mm/mempool.c b/mm/mempool.c
> index 2cc08de8b1db..bfd2a0dd0e18 100644
> --- a/mm/mempool.c
> +++ b/mm/mempool.c
> @@ -317,13 +317,13 @@ void * mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
>  	gfp_t gfp_temp;
>  
>  	VM_WARN_ON_ONCE(gfp_mask & __GFP_ZERO);
> -	might_sleep_if(gfp_mask & __GFP_WAIT);
> +	might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
>  
>  	gfp_mask |= __GFP_NOMEMALLOC;	/* don't allocate emergency reserves */
>  	gfp_mask |= __GFP_NORETRY;	/* don't loop in __alloc_pages */
>  	gfp_mask |= __GFP_NOWARN;	/* failures are OK */
>  
> -	gfp_temp = gfp_mask & ~(__GFP_WAIT|__GFP_IO);
> +	gfp_temp = gfp_mask & ~(__GFP_DIRECT_RECLAIM|__GFP_IO);
>  
>  repeat_alloc:
>  
> @@ -346,7 +346,7 @@ void * mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
>  	}
>  
>  	/*
> -	 * We use gfp mask w/o __GFP_WAIT or IO for the first round.  If
> +	 * We use gfp mask w/o direct reclaim or IO for the first round.  If
>  	 * alloc failed with that and @pool was empty, retry immediately.
>  	 */
>  	if (gfp_temp != gfp_mask) {
> @@ -355,8 +355,8 @@ void * mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
>  		goto repeat_alloc;
>  	}
>  
> -	/* We must not sleep if !__GFP_WAIT */
> -	if (!(gfp_mask & __GFP_WAIT)) {
> +	/* We must not sleep if !__GFP_DIRECT_RECLAIM */
> +	if (!(gfp_mask & __GFP_DIRECT_RECLAIM)) {
>  		spin_unlock_irqrestore(&pool->lock, flags);
>  		return NULL;
>  	}
> diff --git a/mm/migrate.c b/mm/migrate.c
> index eb4267107d1f..0e16c4047638 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1564,7 +1564,7 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
>  					 (GFP_HIGHUSER_MOVABLE |
>  					  __GFP_THISNODE | __GFP_NOMEMALLOC |
>  					  __GFP_NORETRY | __GFP_NOWARN) &
> -					 ~GFP_IOFS, 0);
> +					 ~(__GFP_IO | __GFP_FS), 0);
>  
>  	return newpage;
>  }
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 32d1cec124bc..68f961bdfdf8 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -151,12 +151,12 @@ void pm_restrict_gfp_mask(void)
>  	WARN_ON(!mutex_is_locked(&pm_mutex));
>  	WARN_ON(saved_gfp_mask);
>  	saved_gfp_mask = gfp_allowed_mask;
> -	gfp_allowed_mask &= ~GFP_IOFS;
> +	gfp_allowed_mask &= ~(__GFP_IO | __GFP_FS);
>  }
>  
>  bool pm_suspended_storage(void)
>  {
> -	if ((gfp_allowed_mask & GFP_IOFS) == GFP_IOFS)
> +	if ((gfp_allowed_mask & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS))
>  		return false;
>  	return true;
>  }
> @@ -2158,7 +2158,7 @@ static bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
>  		return false;
>  	if (fail_page_alloc.ignore_gfp_highmem && (gfp_mask & __GFP_HIGHMEM))
>  		return false;
> -	if (fail_page_alloc.ignore_gfp_wait && (gfp_mask & __GFP_WAIT))
> +	if (fail_page_alloc.ignore_gfp_wait && (gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
>  		return false;
>  
>  	return should_fail(&fail_page_alloc.attr, 1 << order);
> @@ -2660,7 +2660,7 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
>  		if (test_thread_flag(TIF_MEMDIE) ||
>  		    (current->flags & (PF_MEMALLOC | PF_EXITING)))
>  			filter &= ~SHOW_MEM_FILTER_NODES;
> -	if (in_interrupt() || !(gfp_mask & __GFP_WAIT))
> +	if (in_interrupt() || !(gfp_mask & __GFP_WAIT) || (gfp_mask & __GFP_ATOMIC))
>  		filter &= ~SHOW_MEM_FILTER_NODES;
>  
>  	if (fmt) {
> @@ -2915,7 +2915,6 @@ static inline int
>  gfp_to_alloc_flags(gfp_t gfp_mask)
>  {
>  	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
> -	const bool atomic = !(gfp_mask & (__GFP_WAIT | __GFP_NO_KSWAPD));
>  
>  	/* __GFP_HIGH is assumed to be the same as ALLOC_HIGH to save a branch. */
>  	BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_HIGH);
> @@ -2924,11 +2923,11 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
>  	 * The caller may dip into page reserves a bit more if the caller
>  	 * cannot run direct reclaim, or if the caller has realtime scheduling
>  	 * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
> -	 * set both ALLOC_HARDER (atomic == true) and ALLOC_HIGH (__GFP_HIGH).
> +	 * set both ALLOC_HARDER (__GFP_ATOMIC) and ALLOC_HIGH (__GFP_HIGH).
>  	 */
>  	alloc_flags |= (__force int) (gfp_mask & __GFP_HIGH);
>  
> -	if (atomic) {
> +	if (gfp_mask & __GFP_ATOMIC) {
>  		/*
>  		 * Not worth trying to allocate harder for __GFP_NOMEMALLOC even
>  		 * if it can't schedule.
> @@ -2965,11 +2964,16 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  	return !!(gfp_to_alloc_flags(gfp_mask) & ALLOC_NO_WATERMARKS);
>  }
>  
> +static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
> +{
> +	return (gfp_mask & (GFP_TRANSHUGE | __GFP_KSWAPD_RECLAIM)) == GFP_TRANSHUGE;
> +}
> +
>  static inline struct page *
>  __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  						struct alloc_context *ac)
>  {
> -	const gfp_t wait = gfp_mask & __GFP_WAIT;
> +	bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;
>  	struct page *page = NULL;
>  	int alloc_flags;
>  	unsigned long pages_reclaimed = 0;
> @@ -2990,15 +2994,23 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	}
>  
>  	/*
> +	 * We also sanity check to catch abuse of atomic reserves being used by
> +	 * callers that are not in atomic context.
> +	 */
> +	if (WARN_ON_ONCE((gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)) ==
> +				(__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
> +		gfp_mask &= ~__GFP_ATOMIC;
> +
> +	/*
>  	 * If this allocation cannot block and it is for a specific node, then
>  	 * fail early.  There's no need to wakeup kswapd or retry for a
>  	 * speculative node-specific allocation.
>  	 */
> -	if (IS_ENABLED(CONFIG_NUMA) && (gfp_mask & __GFP_THISNODE) && !wait)
> +	if (IS_ENABLED(CONFIG_NUMA) && (gfp_mask & __GFP_THISNODE) && !can_direct_reclaim)
>  		goto nopage;
>  
>  retry:
> -	if (!(gfp_mask & __GFP_NO_KSWAPD))
> +	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
>  		wake_all_kswapds(order, ac);
>  
>  	/*
> @@ -3041,8 +3053,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  		}
>  	}
>  
> -	/* Atomic allocations - we can't balance anything */
> -	if (!wait) {
> +	/* Caller is not willing to reclaim, we can't balance anything */
> +	if (!can_direct_reclaim) {
>  		/*
>  		 * All existing users of the deprecated __GFP_NOFAIL are
>  		 * blockable, so warn of any new users that actually allow this
> @@ -3072,7 +3084,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  		goto got_pg;
>  
>  	/* Checks for THP-specific high-order allocations */
> -	if ((gfp_mask & GFP_TRANSHUGE) == GFP_TRANSHUGE) {
> +	if (is_thp_gfp_mask(gfp_mask)) {
>  		/*
>  		 * If compaction is deferred for high-order allocations, it is
>  		 * because sync compaction recently failed. If this is the case
> @@ -3107,8 +3119,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	 * fault, so use asynchronous memory compaction for THP unless it is
>  	 * khugepaged trying to collapse.
>  	 */
> -	if ((gfp_mask & GFP_TRANSHUGE) != GFP_TRANSHUGE ||
> -						(current->flags & PF_KTHREAD))
> +	if (!is_thp_gfp_mask(gfp_mask) || (current->flags & PF_KTHREAD))
>  		migration_mode = MIGRATE_SYNC_LIGHT;
>  
>  	/* Try direct reclaim and then allocating */
> @@ -3179,7 +3190,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
>  
>  	lockdep_trace_alloc(gfp_mask);
>  
> -	might_sleep_if(gfp_mask & __GFP_WAIT);
> +	might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
>  
>  	if (should_fail_alloc_page(gfp_mask, order))
>  		return NULL;
> diff --git a/mm/slab.c b/mm/slab.c
> index 200e22412a16..f82bdb3eb1fc 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -1030,12 +1030,12 @@ static inline int cache_free_alien(struct kmem_cache *cachep, void *objp)
>  }
>  
>  /*
> - * Construct gfp mask to allocate from a specific node but do not invoke reclaim
> - * or warn about failures.
> + * Construct gfp mask to allocate from a specific node but do not direct reclaim
> + * or warn about failures. kswapd may still wake to reclaim in the background.
>   */
>  static inline gfp_t gfp_exact_node(gfp_t flags)
>  {
> -	return (flags | __GFP_THISNODE | __GFP_NOWARN) & ~__GFP_WAIT;
> +	return (flags | __GFP_THISNODE | __GFP_NOWARN) & ~__GFP_DIRECT_RECLAIM;
>  }
>  #endif
>  
> @@ -2625,7 +2625,7 @@ static int cache_grow(struct kmem_cache *cachep,
>  
>  	offset *= cachep->colour_off;
>  
> -	if (local_flags & __GFP_WAIT)
> +	if (gfpflags_allow_blocking(local_flags))
>  		local_irq_enable();
>  
>  	/*
> @@ -2655,7 +2655,7 @@ static int cache_grow(struct kmem_cache *cachep,
>  
>  	cache_init_objs(cachep, page);
>  
> -	if (local_flags & __GFP_WAIT)
> +	if (gfpflags_allow_blocking(local_flags))
>  		local_irq_disable();
>  	check_irq_off();
>  	spin_lock(&n->list_lock);
> @@ -2669,7 +2669,7 @@ static int cache_grow(struct kmem_cache *cachep,
>  opps1:
>  	kmem_freepages(cachep, page);
>  failed:
> -	if (local_flags & __GFP_WAIT)
> +	if (gfpflags_allow_blocking(local_flags))
>  		local_irq_disable();
>  	return 0;
>  }
> @@ -2861,7 +2861,7 @@ static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags,
>  static inline void cache_alloc_debugcheck_before(struct kmem_cache *cachep,
>  						gfp_t flags)
>  {
> -	might_sleep_if(flags & __GFP_WAIT);
> +	might_sleep_if(gfpflags_allow_blocking(flags));
>  #if DEBUG
>  	kmem_flagcheck(cachep, flags);
>  #endif
> @@ -3049,11 +3049,11 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
>  		 */
>  		struct page *page;
>  
> -		if (local_flags & __GFP_WAIT)
> +		if (gfpflags_allow_blocking(local_flags))
>  			local_irq_enable();
>  		kmem_flagcheck(cache, flags);
>  		page = kmem_getpages(cache, local_flags, numa_mem_id());
> -		if (local_flags & __GFP_WAIT)
> +		if (gfpflags_allow_blocking(local_flags))
>  			local_irq_disable();
>  		if (page) {
>  			/*
> diff --git a/mm/slub.c b/mm/slub.c
> index 816df0016555..a4661c59ff54 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1263,7 +1263,7 @@ static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
>  {
>  	flags &= gfp_allowed_mask;
>  	lockdep_trace_alloc(flags);
> -	might_sleep_if(flags & __GFP_WAIT);
> +	might_sleep_if(gfpflags_allow_blocking(flags));
>  
>  	if (should_failslab(s->object_size, flags, s->flags))
>  		return NULL;
> @@ -1339,7 +1339,7 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
>  
>  	flags &= gfp_allowed_mask;
>  
> -	if (flags & __GFP_WAIT)
> +	if (gfpflags_allow_blocking(flags))
>  		local_irq_enable();
>  
>  	flags |= s->allocflags;
> @@ -1380,7 +1380,7 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
>  			kmemcheck_mark_unallocated_pages(page, pages);
>  	}
>  
> -	if (flags & __GFP_WAIT)
> +	if (gfpflags_allow_blocking(flags))
>  		local_irq_disable();
>  	if (!page)
>  		return NULL;
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 2faaa2976447..9ad4dcb0631c 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -1617,7 +1617,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
>  			goto fail;
>  		}
>  		area->pages[i] = page;
> -		if (gfp_mask & __GFP_WAIT)
> +		if (gfpflags_allow_blocking(gfp_mask))
>  			cond_resched();
>  	}
>  
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index e950134c4b9a..837c440d60a9 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1465,7 +1465,7 @@ static int too_many_isolated(struct zone *zone, int file,
>  	 * won't get blocked by normal direct-reclaimers, forming a circular
>  	 * deadlock.
>  	 */
> -	if ((sc->gfp_mask & GFP_IOFS) == GFP_IOFS)
> +	if ((sc->gfp_mask & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS))
>  		inactive >>= 3;
>  
>  	return isolated > inactive;
> @@ -3764,7 +3764,7 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
>  	/*
>  	 * Do not scan if the allocation should not be delayed.
>  	 */
> -	if (!(gfp_mask & __GFP_WAIT) || (current->flags & PF_MEMALLOC))
> +	if (!gfpflags_allow_blocking(gfp_mask) || (current->flags & PF_MEMALLOC))
>  		return ZONE_RECLAIM_NOSCAN;
>  
>  	/*
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 2d5727baed59..26104a68c972 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -684,7 +684,8 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
>  
>  	/* store */
>  	len = dlen + sizeof(struct zswap_header);
> -	ret = zpool_malloc(zswap_pool, len, __GFP_NORETRY | __GFP_NOWARN,
> +	ret = zpool_malloc(zswap_pool, len,
> +		__GFP_NORETRY | __GFP_NOWARN | __GFP_KSWAPD_RECLAIM,
>  		&handle);
>  	if (ret == -ENOSPC) {
>  		zswap_reject_compress_poor++;
> @@ -900,7 +901,7 @@ static void __exit zswap_debugfs_exit(void) { }
>  **********************************/
>  static int __init init_zswap(void)
>  {
> -	gfp_t gfp = __GFP_NORETRY | __GFP_NOWARN;
> +	gfp_t gfp = __GFP_NORETRY | __GFP_NOWARN | __GFP_KSWAPD_RECLAIM;
>  
>  	pr_info("loading zswap\n");
>  
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index b6a19ca0f99e..6f025e2544de 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -414,7 +414,7 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev, unsigned int len,
>  	len += NET_SKB_PAD;
>  
>  	if ((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) ||
> -	    (gfp_mask & (__GFP_WAIT | GFP_DMA))) {
> +	    (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
>  		skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE);
>  		if (!skb)
>  			goto skb_fail;
> @@ -481,7 +481,7 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
>  	len += NET_SKB_PAD + NET_IP_ALIGN;
>  
>  	if ((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) ||
> -	    (gfp_mask & (__GFP_WAIT | GFP_DMA))) {
> +	    (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
>  		skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE);
>  		if (!skb)
>  			goto skb_fail;
> @@ -4452,7 +4452,7 @@ struct sk_buff *alloc_skb_with_frags(unsigned long header_len,
>  		return NULL;
>  
>  	gfp_head = gfp_mask;
> -	if (gfp_head & __GFP_WAIT)
> +	if (gfp_head & __GFP_DIRECT_RECLAIM)
>  		gfp_head |= __GFP_REPEAT;
>  
>  	*errcode = -ENOBUFS;
> @@ -4467,7 +4467,7 @@ struct sk_buff *alloc_skb_with_frags(unsigned long header_len,
>  
>  		while (order) {
>  			if (npages >= 1 << order) {
> -				page = alloc_pages((gfp_mask & ~__GFP_WAIT) |
> +				page = alloc_pages((gfp_mask & ~__GFP_DIRECT_RECLAIM) |
>  						   __GFP_COMP |
>  						   __GFP_NOWARN |
>  						   __GFP_NORETRY,
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 193901d09757..02b705cc9eb3 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -1879,8 +1879,10 @@ bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t gfp)
>  
>  	pfrag->offset = 0;
>  	if (SKB_FRAG_PAGE_ORDER) {
> -		pfrag->page = alloc_pages((gfp & ~__GFP_WAIT) | __GFP_COMP |
> -					  __GFP_NOWARN | __GFP_NORETRY,
> +		/* Avoid direct reclaim but allow kswapd to wake */
> +		pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
> +					  __GFP_COMP | __GFP_NOWARN |
> +					  __GFP_NORETRY,
>  					  SKB_FRAG_PAGE_ORDER);
>  		if (likely(pfrag->page)) {
>  			pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER;
> diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
> index 67d210477863..8283d90dde74 100644
> --- a/net/netlink/af_netlink.c
> +++ b/net/netlink/af_netlink.c
> @@ -2066,7 +2066,7 @@ int netlink_broadcast_filtered(struct sock *ssk, struct sk_buff *skb, u32 portid
>  	consume_skb(info.skb2);
>  
>  	if (info.delivered) {
> -		if (info.congested && (allocation & __GFP_WAIT))
> +		if (info.congested && gfpflags_allow_blocking(allocation))
>  			yield();
>  		return 0;
>  	}
> diff --git a/net/rxrpc/ar-connection.c b/net/rxrpc/ar-connection.c
> index 6631f4f1e39b..3b5de4b86058 100644
> --- a/net/rxrpc/ar-connection.c
> +++ b/net/rxrpc/ar-connection.c
> @@ -500,7 +500,7 @@ int rxrpc_connect_call(struct rxrpc_sock *rx,
>  		if (bundle->num_conns >= 20) {
>  			_debug("too many conns");
>  
> -			if (!(gfp & __GFP_WAIT)) {
> +			if (!gfpflags_allow_blocking(gfp)) {
>  				_leave(" = -EAGAIN");
>  				return -EAGAIN;
>  			}
> diff --git a/net/sctp/associola.c b/net/sctp/associola.c
> index 197c3f59ecbf..75369ae8de1e 100644
> --- a/net/sctp/associola.c
> +++ b/net/sctp/associola.c
> @@ -1588,7 +1588,7 @@ int sctp_assoc_lookup_laddr(struct sctp_association *asoc,
>  /* Set an association id for a given association */
>  int sctp_assoc_set_id(struct sctp_association *asoc, gfp_t gfp)
>  {
> -	bool preload = !!(gfp & __GFP_WAIT);
> +	bool preload = gfpflags_allow_blocking(gfp);
>  	int ret;
>  
>  	/* If the id is already assigned, keep it. */
> -- 
> 2.4.6

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 04/12] mm, page_alloc: Only check cpusets when one exists that can be mem-controlled
  2015-08-25 11:09           ` Vlastimil Babka
@ 2015-08-26 13:41             ` Mel Gorman
  0 siblings, 0 replies; 55+ messages in thread
From: Mel Gorman @ 2015-08-26 13:41 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Johannes Weiner, Rik van Riel, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML

On Tue, Aug 25, 2015 at 01:09:35PM +0200, Vlastimil Babka wrote:
> On 08/25/2015 12:33 PM, Mel Gorman wrote:
> >On Mon, Aug 24, 2015 at 10:53:37PM +0200, Vlastimil Babka wrote:
> >>On 24.8.2015 15:16, Mel Gorman wrote:
> >>>>>
> >>>>>  	return read_seqcount_retry(&current->mems_allowed_seq, seq);
> >>>>>@@ -139,7 +141,7 @@ static inline void set_mems_allowed(nodemask_t nodemask)
> >>>>>
> >>>>>  #else /* !CONFIG_CPUSETS */
> >>>>>
> >>>>>-static inline bool cpusets_enabled(void) { return false; }
> >>>>>+static inline bool cpusets_mems_enabled(void) { return false; }
> >>>>>
> >>>>>  static inline int cpuset_init(void) { return 0; }
> >>>>>  static inline void cpuset_init_smp(void) {}
> >>>>>diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >>>>>index 62ae28d8ae8d..2c1c3bf54d15 100644
> >>>>>--- a/mm/page_alloc.c
> >>>>>+++ b/mm/page_alloc.c
> >>>>>@@ -2470,7 +2470,7 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
> >>>>>  		if (IS_ENABLED(CONFIG_NUMA) && zlc_active &&
> >>>>>  			!zlc_zone_worth_trying(zonelist, z, allowednodes))
> >>>>>  				continue;
> >>>>>-		if (cpusets_enabled() &&
> >>>>>+		if (cpusets_mems_enabled() &&
> >>>>>  			(alloc_flags & ALLOC_CPUSET) &&
> >>>>>  			!cpuset_zone_allowed(zone, gfp_mask))
> >>>>>  				continue;
> >>>>
> >>>>Here the benefits are less clear. I guess cpuset_zone_allowed() is
> >>>>potentially costly...
> >>>>
> >>>>Heck, shouldn't we just start the static key on -1 (if possible), so that
> >>>>it's enabled only when there's 2+ cpusets?
> >>
> >>Hm wait a minute, that's what already happens:
> >>
> >>static inline int nr_cpusets(void)
> >>{
> >>         /* jump label reference count + the top-level cpuset */
> >>         return static_key_count(&cpusets_enabled_key) + 1;
> >>}
> >>
> >>I.e. if there's only the root cpuset, static key is disabled, so I think this
> >>patch is moot after all?
> >>
> >
> >static_key_count is an atomic read on a field in struct static_key where
> >as static_key_false is a arch_static_branch which can be eliminated. The
> >patch eliminates an atomic read so I didn't think it was moot.
> 
> Sorry I wasn't clear enough. My point is that AFAICS cpusets_enabled() will
> only return true if there are more cpusets than the root (top-level) one.
> So the current cpusets_enabled() checks should be enough. Checking that
> "nr_cpusets() > 1" only duplicates what is already covered by
> cpusets_enabled() - see the nr_cpusets() listing above. I.e. David's premise
> was wrong.
> 

/me slaps self

I should have spotted that. Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 12/12] mm, page_alloc: Only enforce watermarks for order-0 allocations
  2015-08-24 12:30 ` [PATCH 12/12] mm, page_alloc: Only enforce watermarks for order-0 allocations Mel Gorman
@ 2015-08-26 13:42   ` Vlastimil Babka
  2015-08-26 14:53     ` Mel Gorman
  2015-08-28 12:10   ` Michal Hocko
  2015-09-08  8:26   ` Joonsoo Kim
  2 siblings, 1 reply; 55+ messages in thread
From: Vlastimil Babka @ 2015-08-26 13:42 UTC (permalink / raw)
  To: Mel Gorman, Andrew Morton
  Cc: Johannes Weiner, Rik van Riel, David Rientjes, Joonsoo Kim,
	Michal Hocko, Linux-MM, LKML

On 08/24/2015 02:30 PM, Mel Gorman wrote:
> The primary purpose of watermarks is to ensure that reclaim can always
> make forward progress in PF_MEMALLOC context (kswapd and direct reclaim).
> These assume that order-0 allocations are all that is necessary for
> forward progress.
>
> High-order watermarks serve a different purpose. Kswapd had no high-order
> awareness before they were introduced (https://lkml.org/lkml/2004/9/5/9).
> This was particularly important when there were high-order atomic requests.
> The watermarks both gave kswapd awareness and made a reserve for those
> atomic requests.
>
> There are two important side-effects of this. The most important is that
> a non-atomic high-order request can fail even though free pages are available
> and the order-0 watermarks are ok. The second is that high-order watermark
> checks are expensive as the free list counts up to the requested order must
> be examined.
>
> With the introduction of MIGRATE_HIGHATOMIC it is no longer necessary to
> have high-order watermarks. Kswapd and compaction still need high-order
> awareness which is handled by checking that at least one suitable high-order
> page is free.
>
> With the patch applied, there was little difference in the allocation
> failure rates as the atomic reserves are small relative to the number of
> allocation attempts. The expected impact is that there will never be an
> allocation failure report that shows suitable pages on the free lists.
>
> The one potential side-effect of this is that in a vanilla kernel, the
> watermark checks may have kept a free page for an atomic allocation. Now,
> we are 100% relying on the HighAtomic reserves and an early allocation to
> have allocated them.  If the first high-order atomic allocation is after
> the system is already heavily fragmented then it'll fail.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>   mm/page_alloc.c | 38 ++++++++++++++++++++++++--------------
>   1 file changed, 24 insertions(+), 14 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 2415f882b89c..35dc578730d1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2280,8 +2280,10 @@ static inline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
>   #endif /* CONFIG_FAIL_PAGE_ALLOC */
>
>   /*
> - * Return true if free pages are above 'mark'. This takes into account the order
> - * of the allocation.
> + * Return true if free base pages are above 'mark'. For high-order checks it
> + * will return true of the order-0 watermark is reached and there is at least
> + * one free page of a suitable size. Checking now avoids taking the zone lock
> + * to check in the allocation paths if no pages are free.
>    */
>   static bool __zone_watermark_ok(struct zone *z, unsigned int order,
>   			unsigned long mark, int classzone_idx, int alloc_flags,
> @@ -2289,7 +2291,7 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
>   {
>   	long min = mark;
>   	int o;
> -	long free_cma = 0;
> +	const bool atomic = (alloc_flags & ALLOC_HARDER);
>
>   	/* free_pages may go negative - that's OK */
>   	free_pages -= (1 << order) - 1;
> @@ -2301,7 +2303,7 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
>   	 * If the caller is not atomic then discount the reserves. This will
>   	 * over-estimate how the atomic reserve but it avoids a search
>   	 */
> -	if (likely(!(alloc_flags & ALLOC_HARDER)))
> +	if (likely(!atomic))
>   		free_pages -= z->nr_reserved_highatomic;
>   	else
>   		min -= min / 4;
> @@ -2309,22 +2311,30 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
>   #ifdef CONFIG_CMA
>   	/* If allocation can't use CMA areas don't use free CMA pages */
>   	if (!(alloc_flags & ALLOC_CMA))
> -		free_cma = zone_page_state(z, NR_FREE_CMA_PAGES);
> +		free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES);
>   #endif
>
> -	if (free_pages - free_cma <= min + z->lowmem_reserve[classzone_idx])
> +	if (free_pages <= min + z->lowmem_reserve[classzone_idx])
>   		return false;
> -	for (o = 0; o < order; o++) {
> -		/* At the next order, this order's pages become unavailable */
> -		free_pages -= z->free_area[o].nr_free << o;
>
> -		/* Require fewer higher order pages to be free */
> -		min >>= 1;
> +	/* order-0 watermarks are ok */
> +	if (!order)
> +		return true;
> +
> +	/* Check at least one high-order page is free */
> +	for (o = order; o < MAX_ORDER; o++) {
> +		struct free_area *area = &z->free_area[o];
> +		int mt;
> +
> +		if (atomic && area->nr_free)
> +			return true;
>
> -		if (free_pages <= min)
> -			return false;
> +		for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) {
> +			if (!list_empty(&area->free_list[mt]))
> +				return true;
> +		}

I think we really need something like this here:

#ifdef CONFIG_CMA
if (alloc_flags & ALLOC_CMA)) &&
	!list_empty(&area->free_list[MIGRATE_CMA])
		return true;
#endif

This is not about CMA and high-order atomic allocations being used at 
the same time. This is about high-order MIGRATE_MOVABLE allocations 
(that set ALLOC_CMA) failing to use MIGRATE_CMA pageblocks, which they 
should be allowed to use. It's complementary to the existing free_pages 
adjustment above.

Maybe there's not many high-order MIGRATE_MOVABLE allocations today, but 
they might increase with the driver migration framework. So why set up 
us a bomb.

>   	}
> -	return true;
> +	return false;
>   }
>
>   bool zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
>


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 07/12] mm, page_alloc: Distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
  2015-08-25 15:37   ` Vlastimil Babka
@ 2015-08-26 14:45     ` Mel Gorman
  2015-08-26 16:24       ` Vlastimil Babka
  0 siblings, 1 reply; 55+ messages in thread
From: Mel Gorman @ 2015-08-26 14:45 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Johannes Weiner, Rik van Riel, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML

On Tue, Aug 25, 2015 at 05:37:59PM +0200, Vlastimil Babka wrote:
> On 08/24/2015 02:09 PM, Mel Gorman wrote:
> >__GFP_WAIT has been used to identify atomic context in callers that hold
> >spinlocks or are in interrupts. They are expected to be high priority and
> >have access one of two watermarks lower than "min" which can be referred
> >to as the "atomic reserve". __GFP_HIGH users get access to the first lower
> >watermark and can be called the "high priority reserve".
> >
> >Over time, callers had a requirement to not block when fallback options
> >were available. Some have abused __GFP_WAIT leading to a situation where
> >an optimisitic allocation with a fallback option can access atomic reserves.
> >
> >This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
> >cannot sleep and have no alternative. High priority users continue to use
> >__GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are
> >willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers
> >that want to wake kswapd for background reclaim. __GFP_WAIT is redefined
> >as a caller that is willing to enter direct reclaim and wake kswapd for
> >background reclaim.
> >
> >This patch then converts a number of sites
> >
> >o __GFP_ATOMIC is used by callers that are high priority and have memory
> >   pools for those requests. GFP_ATOMIC uses this flag.
> >
> >o Callers that have a limited mempool to guarantee forward progress use
> >   __GFP_DIRECT_RECLAIM. bio allocations fall into this category where
> 
>      ^ __GFP_KSWAPD_RECLAIM ? (missed it previously)
> 

I updated the changelog to make this clearer.

> >   kswapd will still be woken but atomic reserves are not used as there
> >   is a one-entry mempool to guarantee progress.
> >
> >o Callers that are checking if they are non-blocking should use the
> >   helper gfpflags_allow_blocking() where possible. This is because
> >   checking for __GFP_WAIT as was done historically now can trigger false
> >   positives. Some exceptions like dm-crypt.c exist where the code intent
> >   is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
> >   flag manipulations.
> >
> >o Callers that built their own GFP flags instead of starting with GFP_KERNEL
> >   and friends now also need to specify __GFP_KSWAPD_RECLAIM.
> >
> >The first key hazard to watch out for is callers that removed __GFP_WAIT
> >and was depending on access to atomic reserves for inconspicuous reasons.
> >In some cases it may be appropriate for them to use __GFP_HIGH.
> >
> >The second key hazard is callers that assembled their own combination of
> >GFP flags instead of starting with something like GFP_KERNEL. They may
> >now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
> >if it's missed in most cases as other activity will wake kswapd.
> >
> >Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> 
> Thanks for the effort!
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> 
> Just last few bits:
> 
> >@@ -2158,7 +2158,7 @@ static bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
> >  		return false;
> >  	if (fail_page_alloc.ignore_gfp_highmem && (gfp_mask & __GFP_HIGHMEM))
> >  		return false;
> >-	if (fail_page_alloc.ignore_gfp_wait && (gfp_mask & __GFP_WAIT))
> >+	if (fail_page_alloc.ignore_gfp_wait && (gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
> >  		return false;
> >
> >  	return should_fail(&fail_page_alloc.attr, 1 << order);
> 
> IIUC ignore_gfp_wait tells it to assume that reclaimers will eventually
> succeed (for some reason?), so they shouldn't fail. Probably to focus the
> testing on atomic allocations. But your change makes atomic allocation never
> fail, so that goes against the knob IMHO?
> 

Fair point, I'll remove the __GFP_ATOMIC check. I felt this was a sensible
but then again deliberately failing allocations makes my brain twitch a
bit. In retrospect, someone who cared should add a ignore_gfp_atomic knob.

> >@@ -2660,7 +2660,7 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
> >  		if (test_thread_flag(TIF_MEMDIE) ||
> >  		    (current->flags & (PF_MEMALLOC | PF_EXITING)))
> >  			filter &= ~SHOW_MEM_FILTER_NODES;
> >-	if (in_interrupt() || !(gfp_mask & __GFP_WAIT))
> >+	if (in_interrupt() || !(gfp_mask & __GFP_WAIT) || (gfp_mask & __GFP_ATOMIC))
> >  		filter &= ~SHOW_MEM_FILTER_NODES;
> >
> >  	if (fmt) {
> 
> This caught me previously and I convinced myself that it's OK, but now I'm
> not anymore. IIUC this is to not filter nodes by mems_allowed during
> printing, if the allocation itself wasn't limited? In that case it should
> probably only look at __GFP_ATOMIC after this patch? As that's the only
> thing that determines ALLOC_CPUSET.
> I don't know where in_interrupt() comes from, but it was probably considered
> in the past, as can be seen in zlc_setup()?
> 

I assumed the in_interrupt() thing was simply because cpusets were the
primary means of limiting allocations of interest to the author at the
time.

I guess now that I think about it more that a more sensible check would
be against __GFP_DIRECT_RECLAIM because that covers the interesting
cases.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 12/12] mm, page_alloc: Only enforce watermarks for order-0 allocations
  2015-08-26 13:42   ` Vlastimil Babka
@ 2015-08-26 14:53     ` Mel Gorman
  0 siblings, 0 replies; 55+ messages in thread
From: Mel Gorman @ 2015-08-26 14:53 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Johannes Weiner, Rik van Riel, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML

On Wed, Aug 26, 2015 at 03:42:23PM +0200, Vlastimil Babka wrote:
> >@@ -2309,22 +2311,30 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
> >  #ifdef CONFIG_CMA
> >  	/* If allocation can't use CMA areas don't use free CMA pages */
> >  	if (!(alloc_flags & ALLOC_CMA))
> >-		free_cma = zone_page_state(z, NR_FREE_CMA_PAGES);
> >+		free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES);
> >  #endif
> >
> >-	if (free_pages - free_cma <= min + z->lowmem_reserve[classzone_idx])
> >+	if (free_pages <= min + z->lowmem_reserve[classzone_idx])
> >  		return false;
> >-	for (o = 0; o < order; o++) {
> >-		/* At the next order, this order's pages become unavailable */
> >-		free_pages -= z->free_area[o].nr_free << o;
> >
> >-		/* Require fewer higher order pages to be free */
> >-		min >>= 1;
> >+	/* order-0 watermarks are ok */
> >+	if (!order)
> >+		return true;
> >+
> >+	/* Check at least one high-order page is free */
> >+	for (o = order; o < MAX_ORDER; o++) {
> >+		struct free_area *area = &z->free_area[o];
> >+		int mt;
> >+
> >+		if (atomic && area->nr_free)
> >+			return true;
> >
> >-		if (free_pages <= min)
> >-			return false;
> >+		for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) {
> >+			if (!list_empty(&area->free_list[mt]))
> >+				return true;
> >+		}
> 
> I think we really need something like this here:
> 
> #ifdef CONFIG_CMA
> if (alloc_flags & ALLOC_CMA)) &&
> 	!list_empty(&area->free_list[MIGRATE_CMA])
> 		return true;
> #endif
> 
> This is not about CMA and high-order atomic allocations being used at the
> same time. This is about high-order MIGRATE_MOVABLE allocations (that set
> ALLOC_CMA) failing to use MIGRATE_CMA pageblocks, which they should be
> allowed to use. It's complementary to the existing free_pages adjustment
> above.
> 
> Maybe there's not many high-order MIGRATE_MOVABLE allocations today, but
> they might increase with the driver migration framework. So why set up us a
> bomb.
> 

Ok, that seems sensible. Will apply this hunk on top

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1a4169be1498..10f25bf18665 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2337,6 +2337,13 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
 			if (!list_empty(&area->free_list[mt]))
 				return true;
 		}
+
+#ifdef CONFIG_CMA
+		if ((alloc_flags & ALLOC_CMA) &&
+		    !list_empty(&area->free_list[MIGRATE_CMA])) {
+			return true;
+		}
+#endif
 	}
 	return false;
 }

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH 11/12] mm, page_alloc: Reserve pageblocks for high-order atomic allocations on demand
  2015-08-24 12:29 ` [PATCH 11/12] mm, page_alloc: Reserve pageblocks for high-order atomic allocations on demand Mel Gorman
  2015-08-26 12:44   ` Vlastimil Babka
@ 2015-08-26 14:53   ` Michal Hocko
  2015-08-26 15:38     ` Mel Gorman
  2015-09-08  8:01   ` Joonsoo Kim
  2 siblings, 1 reply; 55+ messages in thread
From: Michal Hocko @ 2015-08-26 14:53 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Rik van Riel, Vlastimil Babka,
	David Rientjes, Joonsoo Kim, Linux-MM, LKML

On Mon 24-08-15 13:29:57, Mel Gorman wrote:
> High-order watermark checking exists for two reasons --  kswapd high-order
> awareness and protection for high-order atomic requests. Historically the
> kernel depended on MIGRATE_RESERVE to preserve min_free_kbytes as high-order
> free pages for as long as possible. This patch introduces MIGRATE_HIGHATOMIC
> that reserves pageblocks for high-order atomic allocations on demand and
> avoids using those blocks for order-0 allocations. This is more flexible
> and reliable than MIGRATE_RESERVE was.
> 
> A MIGRATE_HIGHORDER pageblock is created when a high-order allocation
> request steals a pageblock but limits the total number to 1% of the zone.
> Callers that speculatively abuse atomic allocations for long-lived
> high-order allocations to access the reserve will quickly fail. Note that
> SLUB is currently not such an abuser as it reclaims at least once.  It is
> possible that the pageblock stolen has few suitable high-order pages and
> will need to steal again in the near future but there would need to be
> strong justification to search all pageblocks for an ideal candidate.
> 
> The pageblocks are unreserved if an allocation fails after a direct
> reclaim attempt.
> 
> The watermark checks account for the reserved pageblocks when the allocation
> request is not a high-order atomic allocation.
> 
> The reserved pageblocks can not be used for order-0 allocations. This may
> allow temporary wastage until a failed reclaim reassigns the pageblock. This
> is deliberate as the intent of the reservation is to satisfy a limited
> number of atomic high-order short-lived requests if the system requires them.
> 
> The stutter benchmark was used to evaluate this but while it was running
> there was a systemtap script that randomly allocated between 1 high-order
> page and 12.5% of memory's worth of order-3 pages using GFP_ATOMIC. This
> is much larger than the potential reserve and it does not attempt to be
> realistic.  It is intended to stress random high-order allocations from
> an unknown source, show that there is a reduction in failures without
> introducing an anomaly where atomic allocations are more reliable than
> regular allocations.  The amount of memory reserved varied throughout the
> workload as reserves were created and reclaimed under memory pressure. The
> allocation failures once the workload warmed up were as follows;
> 
> 4.2-rc5-vanilla		70%
> 4.2-rc5-atomic-reserve	56%
> 
> The failure rate was also measured while building multiple kernels. The
> failure rate was 14% but is 6% with this patch applied.
> 
> Overall, this is a small reduction but the reserves are small relative to the
> number of allocation requests. In early versions of the patch, the failure
> rate reduced by a much larger amount but that required much larger reserves
> and perversely made atomic allocations seem more reliable than regular allocations.

Have you considered a counter for vmstat/zoneinfo so that we have an overview
about the memory consumed for this reserve?

> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Michal Hocko <mhocko@suse.com>

[...]
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d5ce050ebe4f..2415f882b89c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
[...]
> @@ -1645,10 +1725,16 @@ __rmqueue_fallback(struct zone *zone, unsigned int order, int start_migratetype)
>   * Call me with the zone->lock already held.
>   */
>  static struct page *__rmqueue(struct zone *zone, unsigned int order,
> -						int migratetype)
> +				int migratetype, gfp_t gfp_flags)
>  {
>  	struct page *page;
>  
> +	if (unlikely(order && (gfp_flags & __GFP_ATOMIC))) {
> +		page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
> +		if (page)
> +			goto out;

I guess you want to change migratetype to MIGRATE_HIGHATOMIC in the
successful case so the tracepoint reports this properly.

> +	}
> +
>  	page = __rmqueue_smallest(zone, order, migratetype);
>  	if (unlikely(!page)) {
>  		if (migratetype == MIGRATE_MOVABLE)
> @@ -1658,6 +1744,7 @@ static struct page *__rmqueue(struct zone *zone, unsigned int order,
>  			page = __rmqueue_fallback(zone, order, migratetype);
>  	}
>  
> +out:
>  	trace_mm_page_alloc_zone_locked(page, order, migratetype);
>  	return page;

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 11/12] mm, page_alloc: Reserve pageblocks for high-order atomic allocations on demand
  2015-08-26 14:53   ` Michal Hocko
@ 2015-08-26 15:38     ` Mel Gorman
  0 siblings, 0 replies; 55+ messages in thread
From: Mel Gorman @ 2015-08-26 15:38 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Johannes Weiner, Rik van Riel, Vlastimil Babka,
	David Rientjes, Joonsoo Kim, Linux-MM, LKML

On Wed, Aug 26, 2015 at 04:53:52PM +0200, Michal Hocko wrote:
> > 
> > Overall, this is a small reduction but the reserves are small relative to the
> > number of allocation requests. In early versions of the patch, the failure
> > rate reduced by a much larger amount but that required much larger reserves
> > and perversely made atomic allocations seem more reliable than regular allocations.
> 
> Have you considered a counter for vmstat/zoneinfo so that we have an overview
> about the memory consumed for this reserve?
> 

It should already be available in /proc/pagetypeinfo

> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> 
> Acked-by: Michal Hocko <mhocko@suse.com>
> 
> [...]
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index d5ce050ebe4f..2415f882b89c 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> [...]
> > @@ -1645,10 +1725,16 @@ __rmqueue_fallback(struct zone *zone, unsigned int order, int start_migratetype)
> >   * Call me with the zone->lock already held.
> >   */
> >  static struct page *__rmqueue(struct zone *zone, unsigned int order,
> > -						int migratetype)
> > +				int migratetype, gfp_t gfp_flags)
> >  {
> >  	struct page *page;
> >  
> > +	if (unlikely(order && (gfp_flags & __GFP_ATOMIC))) {
> > +		page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
> > +		if (page)
> > +			goto out;
> 
> I guess you want to change migratetype to MIGRATE_HIGHATOMIC in the
> successful case so the tracepoint reports this properly.
> 

Yes, thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 07/12] mm, page_alloc: Distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
  2015-08-26 14:45     ` Mel Gorman
@ 2015-08-26 16:24       ` Vlastimil Babka
  2015-08-26 18:10         ` Mel Gorman
  0 siblings, 1 reply; 55+ messages in thread
From: Vlastimil Babka @ 2015-08-26 16:24 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Rik van Riel, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML

On 08/26/2015 04:45 PM, Mel Gorman wrote:
> On Tue, Aug 25, 2015 at 05:37:59PM +0200, Vlastimil Babka wrote:
>>> @@ -2158,7 +2158,7 @@ static bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
>>>   		return false;
>>>   	if (fail_page_alloc.ignore_gfp_highmem && (gfp_mask & __GFP_HIGHMEM))
>>>   		return false;
>>> -	if (fail_page_alloc.ignore_gfp_wait && (gfp_mask & __GFP_WAIT))
>>> +	if (fail_page_alloc.ignore_gfp_wait && (gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
>>>   		return false;
>>>
>>>   	return should_fail(&fail_page_alloc.attr, 1 << order);
>>
>> IIUC ignore_gfp_wait tells it to assume that reclaimers will eventually
>> succeed (for some reason?), so they shouldn't fail. Probably to focus the
>> testing on atomic allocations. But your change makes atomic allocation never
>> fail, so that goes against the knob IMHO?
>>
>
> Fair point, I'll remove the __GFP_ATOMIC check. I felt this was a sensible
> but then again deliberately failing allocations makes my brain twitch a
> bit. In retrospect, someone who cared should add a ignore_gfp_atomic knob.

Thanks.

>>> @@ -2660,7 +2660,7 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
>>>   		if (test_thread_flag(TIF_MEMDIE) ||
>>>   		    (current->flags & (PF_MEMALLOC | PF_EXITING)))
>>>   			filter &= ~SHOW_MEM_FILTER_NODES;
>>> -	if (in_interrupt() || !(gfp_mask & __GFP_WAIT))
>>> +	if (in_interrupt() || !(gfp_mask & __GFP_WAIT) || (gfp_mask & __GFP_ATOMIC))
>>>   		filter &= ~SHOW_MEM_FILTER_NODES;
>>>
>>>   	if (fmt) {
>>
>> This caught me previously and I convinced myself that it's OK, but now I'm
>> not anymore. IIUC this is to not filter nodes by mems_allowed during
>> printing, if the allocation itself wasn't limited? In that case it should
>> probably only look at __GFP_ATOMIC after this patch? As that's the only
>> thing that determines ALLOC_CPUSET.
>> I don't know where in_interrupt() comes from, but it was probably considered
>> in the past, as can be seen in zlc_setup()?
>>
>
> I assumed the in_interrupt() thing was simply because cpusets were the
> primary means of limiting allocations of interest to the author at the
> time.

IIUC this hunk is unrelated to the previous one - not about limiting 
allocations, but printing allocation warnings. Which includes the state 
of nodes where the allocation was allowed to try. And 
~SHOW_MEM_FILTER_NODES means it was allowed everywhere, so the printing 
won't filter by mems_allowed.

> I guess now that I think about it more that a more sensible check would
> be against __GFP_DIRECT_RECLAIM because that covers the interesting
> cases.

I think the most robust check would be to rely on what was already 
prepared by gfp_to_alloc_flags(), instead of repeating it here. So add 
alloc_flags parameter to warn_alloc_failed(), and drop the filter when
- ALLOC_CPUSET is not set, as that disables the cpuset checks
- ALLOC_NO_WATERMARKS is set, as that allows calling
   __alloc_pages_high_priority() attempt which ignores cpusets


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 07/12] mm, page_alloc: Distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
  2015-08-26 16:24       ` Vlastimil Babka
@ 2015-08-26 18:10         ` Mel Gorman
  2015-08-27  9:18           ` Vlastimil Babka
  0 siblings, 1 reply; 55+ messages in thread
From: Mel Gorman @ 2015-08-26 18:10 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Johannes Weiner, Rik van Riel, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML

On Wed, Aug 26, 2015 at 06:24:34PM +0200, Vlastimil Babka wrote:
> On 08/26/2015 04:45 PM, Mel Gorman wrote:
> >On Tue, Aug 25, 2015 at 05:37:59PM +0200, Vlastimil Babka wrote:
> >>>@@ -2158,7 +2158,7 @@ static bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
> >>>  		return false;
> >>>  	if (fail_page_alloc.ignore_gfp_highmem && (gfp_mask & __GFP_HIGHMEM))
> >>>  		return false;
> >>>-	if (fail_page_alloc.ignore_gfp_wait && (gfp_mask & __GFP_WAIT))
> >>>+	if (fail_page_alloc.ignore_gfp_wait && (gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
> >>>  		return false;
> >>>
> >>>  	return should_fail(&fail_page_alloc.attr, 1 << order);
> >>
> >>IIUC ignore_gfp_wait tells it to assume that reclaimers will eventually
> >>succeed (for some reason?), so they shouldn't fail. Probably to focus the
> >>testing on atomic allocations. But your change makes atomic allocation never
> >>fail, so that goes against the knob IMHO?
> >>
> >
> >Fair point, I'll remove the __GFP_ATOMIC check. I felt this was a sensible
> >but then again deliberately failing allocations makes my brain twitch a
> >bit. In retrospect, someone who cared should add a ignore_gfp_atomic knob.
> 
> Thanks.
> 
> >>>@@ -2660,7 +2660,7 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
> >>>  		if (test_thread_flag(TIF_MEMDIE) ||
> >>>  		    (current->flags & (PF_MEMALLOC | PF_EXITING)))
> >>>  			filter &= ~SHOW_MEM_FILTER_NODES;
> >>>-	if (in_interrupt() || !(gfp_mask & __GFP_WAIT))
> >>>+	if (in_interrupt() || !(gfp_mask & __GFP_WAIT) || (gfp_mask & __GFP_ATOMIC))
> >>>  		filter &= ~SHOW_MEM_FILTER_NODES;
> >>>
> >>>  	if (fmt) {
> >>
> >>This caught me previously and I convinced myself that it's OK, but now I'm
> >>not anymore. IIUC this is to not filter nodes by mems_allowed during
> >>printing, if the allocation itself wasn't limited? In that case it should
> >>probably only look at __GFP_ATOMIC after this patch? As that's the only
> >>thing that determines ALLOC_CPUSET.
> >>I don't know where in_interrupt() comes from, but it was probably considered
> >>in the past, as can be seen in zlc_setup()?
> >>
> >
> >I assumed the in_interrupt() thing was simply because cpusets were the
> >primary means of limiting allocations of interest to the author at the
> >time.
> 
> IIUC this hunk is unrelated to the previous one - not about limiting
> allocations, but printing allocation warnings. Which includes the state of
> nodes where the allocation was allowed to try. And ~SHOW_MEM_FILTER_NODES
> means it was allowed everywhere, so the printing won't filter by
> mems_allowed.
> 
> >I guess now that I think about it more that a more sensible check would
> >be against __GFP_DIRECT_RECLAIM because that covers the interesting
> >cases.
> 
> I think the most robust check would be to rely on what was already prepared
> by gfp_to_alloc_flags(), instead of repeating it here. So add alloc_flags
> parameter to warn_alloc_failed(), and drop the filter when
> - ALLOC_CPUSET is not set, as that disables the cpuset checks
> - ALLOC_NO_WATERMARKS is set, as that allows calling
>   __alloc_pages_high_priority() attempt which ignores cpusets
> 

warn_alloc_failed is used outside of page_alloc.c in a context that does
not have alloc_flags. It could be extended to take an extra parameter
that is ALLOC_CPUSET for the other callers or else split it into
__warn_alloc_failed (takes alloc_flags parameter) and warn_alloc_failed
(calls __warn_alloc_failed with ALLOC_CPUSET) but is it really worth it?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 07/12] mm, page_alloc: Distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
  2015-08-26 18:10         ` Mel Gorman
@ 2015-08-27  9:18           ` Vlastimil Babka
  0 siblings, 0 replies; 55+ messages in thread
From: Vlastimil Babka @ 2015-08-27  9:18 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Rik van Riel, David Rientjes,
	Joonsoo Kim, Michal Hocko, Linux-MM, LKML

On 08/26/2015 08:10 PM, Mel Gorman wrote:
>>
>> I think the most robust check would be to rely on what was already prepared
>> by gfp_to_alloc_flags(), instead of repeating it here. So add alloc_flags
>> parameter to warn_alloc_failed(), and drop the filter when
>> - ALLOC_CPUSET is not set, as that disables the cpuset checks
>> - ALLOC_NO_WATERMARKS is set, as that allows calling
>>    __alloc_pages_high_priority() attempt which ignores cpusets
>>
>
> warn_alloc_failed is used outside of page_alloc.c in a context that does
> not have alloc_flags. It could be extended to take an extra parameter
> that is ALLOC_CPUSET for the other callers or else split it into
> __warn_alloc_failed (takes alloc_flags parameter) and warn_alloc_failed
> (calls __warn_alloc_failed with ALLOC_CPUSET) but is it really worth it?

Probably not. Testing lack of __GFP_DIRECT_RECLAIM is good enough until 
somebody cares more.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 12/12] mm, page_alloc: Only enforce watermarks for order-0 allocations
  2015-08-24 12:30 ` [PATCH 12/12] mm, page_alloc: Only enforce watermarks for order-0 allocations Mel Gorman
  2015-08-26 13:42   ` Vlastimil Babka
@ 2015-08-28 12:10   ` Michal Hocko
  2015-08-28 14:12     ` Mel Gorman
  2015-09-08  8:26   ` Joonsoo Kim
  2 siblings, 1 reply; 55+ messages in thread
From: Michal Hocko @ 2015-08-28 12:10 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Rik van Riel, Vlastimil Babka,
	David Rientjes, Joonsoo Kim, Linux-MM, LKML

On Mon 24-08-15 13:30:15, Mel Gorman wrote:
> The primary purpose of watermarks is to ensure that reclaim can always
> make forward progress in PF_MEMALLOC context (kswapd and direct reclaim).
> These assume that order-0 allocations are all that is necessary for
> forward progress.
> 
> High-order watermarks serve a different purpose. Kswapd had no high-order
> awareness before they were introduced (https://lkml.org/lkml/2004/9/5/9).

lkml.org sucks. Could you plase replace it by something else e.g.
https://lkml.kernel.org/r/413AA7B2.4000907@yahoo.com.au?

> This was particularly important when there were high-order atomic requests.
> The watermarks both gave kswapd awareness and made a reserve for those
> atomic requests.
> 
> There are two important side-effects of this. The most important is that
> a non-atomic high-order request can fail even though free pages are available
> and the order-0 watermarks are ok. The second is that high-order watermark
> checks are expensive as the free list counts up to the requested order must
> be examined.
> 
> With the introduction of MIGRATE_HIGHATOMIC it is no longer necessary to
> have high-order watermarks. Kswapd and compaction still need high-order
> awareness which is handled by checking that at least one suitable high-order
> page is free.
> 
> With the patch applied, there was little difference in the allocation
> failure rates as the atomic reserves are small relative to the number of
> allocation attempts. The expected impact is that there will never be an
> allocation failure report that shows suitable pages on the free lists.
> 
> The one potential side-effect of this is that in a vanilla kernel, the
> watermark checks may have kept a free page for an atomic allocation. Now,
> we are 100% relying on the HighAtomic reserves and an early allocation to
> have allocated them.  If the first high-order atomic allocation is after
> the system is already heavily fragmented then it'll fail.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Acked-by: Michal Hocko <mhocko@suse.com>

[...]
> @@ -2289,7 +2291,7 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
>  {
>  	long min = mark;
>  	int o;
> -	long free_cma = 0;
> +	const bool atomic = (alloc_flags & ALLOC_HARDER);

I just find the naming a bit confusing. ALLOC_HARDER != __GFP_ATOMIC. RT tasks
might get access to this reserve as well.

[...]
> +	/* Check at least one high-order page is free */
> +	for (o = order; o < MAX_ORDER; o++) {
> +		struct free_area *area = &z->free_area[o];
> +		int mt;
> +
> +		if (atomic && area->nr_free)
> +			return true;

Didn't you want
		if (atomic) {
			if (area->nr_free)
				return true;
			continue;
		}

>  
> -		if (free_pages <= min)
> -			return false;
> +		for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) {
> +			if (!list_empty(&area->free_list[mt]))
> +				return true;
> +		}
>  	}
> -	return true;
> +	return false;
>  }
>  
>  bool zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
> -- 
> 2.4.6

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 12/12] mm, page_alloc: Only enforce watermarks for order-0 allocations
  2015-08-28 12:10   ` Michal Hocko
@ 2015-08-28 14:12     ` Mel Gorman
  0 siblings, 0 replies; 55+ messages in thread
From: Mel Gorman @ 2015-08-28 14:12 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Johannes Weiner, Rik van Riel, Vlastimil Babka,
	David Rientjes, Joonsoo Kim, Linux-MM, LKML

On Fri, Aug 28, 2015 at 02:10:51PM +0200, Michal Hocko wrote:
> On Mon 24-08-15 13:30:15, Mel Gorman wrote:
> > The primary purpose of watermarks is to ensure that reclaim can always
> > make forward progress in PF_MEMALLOC context (kswapd and direct reclaim).
> > These assume that order-0 allocations are all that is necessary for
> > forward progress.
> > 
> > High-order watermarks serve a different purpose. Kswapd had no high-order
> > awareness before they were introduced (https://lkml.org/lkml/2004/9/5/9).
> 
> lkml.org sucks. Could you plase replace it by something else e.g.
> https://lkml.kernel.org/r/413AA7B2.4000907@yahoo.com.au?
> 

Done.

> > This was particularly important when there were high-order atomic requests.
> > The watermarks both gave kswapd awareness and made a reserve for those
> > atomic requests.
> > 
> > There are two important side-effects of this. The most important is that
> > a non-atomic high-order request can fail even though free pages are available
> > and the order-0 watermarks are ok. The second is that high-order watermark
> > checks are expensive as the free list counts up to the requested order must
> > be examined.
> > 
> > With the introduction of MIGRATE_HIGHATOMIC it is no longer necessary to
> > have high-order watermarks. Kswapd and compaction still need high-order
> > awareness which is handled by checking that at least one suitable high-order
> > page is free.
> > 
> > With the patch applied, there was little difference in the allocation
> > failure rates as the atomic reserves are small relative to the number of
> > allocation attempts. The expected impact is that there will never be an
> > allocation failure report that shows suitable pages on the free lists.
> > 
> > The one potential side-effect of this is that in a vanilla kernel, the
> > watermark checks may have kept a free page for an atomic allocation. Now,
> > we are 100% relying on the HighAtomic reserves and an early allocation to
> > have allocated them.  If the first high-order atomic allocation is after
> > the system is already heavily fragmented then it'll fail.
> > 
> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> 
> Acked-by: Michal Hocko <mhocko@suse.com>
> 

Thanks.

> [...]
> > @@ -2289,7 +2291,7 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
> >  {
> >  	long min = mark;
> >  	int o;
> > -	long free_cma = 0;
> > +	const bool atomic = (alloc_flags & ALLOC_HARDER);
> 
> I just find the naming a bit confusing. ALLOC_HARDER != __GFP_ATOMIC. RT tasks
> might get access to this reserve as well.
> 

I'll just call it alloc_harder then.

> [...]
> > +	/* Check at least one high-order page is free */
> > +	for (o = order; o < MAX_ORDER; o++) {
> > +		struct free_area *area = &z->free_area[o];
> > +		int mt;
> > +
> > +		if (atomic && area->nr_free)
> > +			return true;
> 
> Didn't you want
> 		if (atomic) {
> 			if (area->nr_free)
> 				return true;
> 			continue;
> 		}
> 

That is slightly more efficient so yes, I'll use it. Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 07/12] mm, page_alloc: Distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
  2015-08-24 12:09 ` [PATCH 07/12] mm, page_alloc: Distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd Mel Gorman
                     ` (3 preceding siblings ...)
  2015-08-26 13:05   ` Michal Hocko
@ 2015-09-08  6:49   ` Joonsoo Kim
  2015-09-09 12:22     ` Mel Gorman
  4 siblings, 1 reply; 55+ messages in thread
From: Joonsoo Kim @ 2015-09-08  6:49 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Rik van Riel, Vlastimil Babka,
	David Rientjes, Joonsoo Kim, Michal Hocko, Linux-MM, LKML

2015-08-24 21:09 GMT+09:00 Mel Gorman <mgorman@techsingularity.net>:
> __GFP_WAIT has been used to identify atomic context in callers that hold
> spinlocks or are in interrupts. They are expected to be high priority and
> have access one of two watermarks lower than "min" which can be referred
> to as the "atomic reserve". __GFP_HIGH users get access to the first lower
> watermark and can be called the "high priority reserve".
>
> Over time, callers had a requirement to not block when fallback options
> were available. Some have abused __GFP_WAIT leading to a situation where
> an optimisitic allocation with a fallback option can access atomic reserves.
>
> This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
> cannot sleep and have no alternative. High priority users continue to use
> __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are
> willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers
> that want to wake kswapd for background reclaim. __GFP_WAIT is redefined
> as a caller that is willing to enter direct reclaim and wake kswapd for
> background reclaim.

Hello, Mel.

I think that it is better to do one thing at one patch.
To distinguish real atomic, we just need to introduce __GFP_ATOMIC and
make GFP_ATOMIC to __GFP_ATOMIC | GFP_HARDER and change related
things. __GFP_WAIT changes isn't needed at all for this purpose. It can
reduce patch size and provides more good bisectability.

And, I don't think that introducing __GFP_KSWAPD_RECLAIM is good thing.
Basically, kswapd reclaim should be enforced. New flag makes user who manually
manipulate gfp flag more difficult. Without this change, your second hazard will
be disappeared although it is almost harmless.

And, I doubt that this big one shot change is preferable. AFAIK, even if changes
are one to one mapping and no functional difference, each one is made by
one patch and send it to correct maintainer. I guess there is some difficulty
in this patch to do like this, but, it could. Isn't it?

Some nitpicks are below.

>
> This patch then converts a number of sites
>
> o __GFP_ATOMIC is used by callers that are high priority and have memory
>   pools for those requests. GFP_ATOMIC uses this flag.
>
> o Callers that have a limited mempool to guarantee forward progress use
>   __GFP_DIRECT_RECLAIM. bio allocations fall into this category where
>   kswapd will still be woken but atomic reserves are not used as there
>   is a one-entry mempool to guarantee progress.
>
> o Callers that are checking if they are non-blocking should use the
>   helper gfpflags_allow_blocking() where possible. This is because
>   checking for __GFP_WAIT as was done historically now can trigger false
>   positives. Some exceptions like dm-crypt.c exist where the code intent
>   is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
>   flag manipulations.
>
> o Callers that built their own GFP flags instead of starting with GFP_KERNEL
>   and friends now also need to specify __GFP_KSWAPD_RECLAIM.
>
> The first key hazard to watch out for is callers that removed __GFP_WAIT
> and was depending on access to atomic reserves for inconspicuous reasons.
> In some cases it may be appropriate for them to use __GFP_HIGH.
>
> The second key hazard is callers that assembled their own combination of
> GFP flags instead of starting with something like GFP_KERNEL. They may
> now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
> if it's missed in most cases as other activity will wake kswapd.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  Documentation/vm/balance                           | 14 ++++---
>  arch/arm/mm/dma-mapping.c                          |  4 +-
>  arch/arm/xen/mm.c                                  |  2 +-
>  arch/arm64/mm/dma-mapping.c                        |  4 +-
>  arch/x86/kernel/pci-dma.c                          |  2 +-
>  block/bio.c                                        | 26 ++++++------
>  block/blk-core.c                                   | 16 ++++----
>  block/blk-ioc.c                                    |  2 +-
>  block/blk-mq-tag.c                                 |  2 +-
>  block/blk-mq.c                                     |  8 ++--
>  block/cfq-iosched.c                                |  4 +-
>  drivers/block/drbd/drbd_receiver.c                 |  3 +-
>  drivers/block/osdblk.c                             |  2 +-
>  drivers/connector/connector.c                      |  3 +-
>  drivers/firewire/core-cdev.c                       |  2 +-
>  drivers/gpu/drm/i915/i915_gem.c                    |  2 +-
>  drivers/infiniband/core/sa_query.c                 |  2 +-
>  drivers/iommu/amd_iommu.c                          |  2 +-
>  drivers/iommu/intel-iommu.c                        |  2 +-
>  drivers/md/dm-crypt.c                              |  6 +--
>  drivers/md/dm-kcopyd.c                             |  2 +-
>  drivers/media/pci/solo6x10/solo6x10-v4l2-enc.c     |  2 +-
>  drivers/media/pci/solo6x10/solo6x10-v4l2.c         |  2 +-
>  drivers/media/pci/tw68/tw68-video.c                |  2 +-
>  drivers/mtd/mtdcore.c                              |  3 +-
>  drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c    |  2 +-
>  drivers/staging/android/ion/ion_system_heap.c      |  2 +-
>  .../lustre/include/linux/libcfs/libcfs_private.h   |  2 +-
>  drivers/usb/host/u132-hcd.c                        |  2 +-
>  drivers/video/fbdev/vermilion/vermilion.c          |  2 +-
>  fs/btrfs/disk-io.c                                 |  2 +-
>  fs/btrfs/extent_io.c                               | 14 +++----
>  fs/btrfs/volumes.c                                 |  4 +-
>  fs/ext3/super.c                                    |  2 +-
>  fs/ext4/super.c                                    |  2 +-
>  fs/fscache/cookie.c                                |  2 +-
>  fs/fscache/page.c                                  |  6 +--
>  fs/jbd/transaction.c                               |  4 +-
>  fs/jbd2/transaction.c                              |  4 +-
>  fs/nfs/file.c                                      |  6 +--
>  fs/xfs/xfs_qm.c                                    |  2 +-
>  include/linux/gfp.h                                | 46 ++++++++++++++++------
>  include/linux/skbuff.h                             |  6 +--
>  include/net/sock.h                                 |  2 +-
>  include/trace/events/gfpflags.h                    |  5 ++-
>  kernel/audit.c                                     |  6 +--
>  kernel/locking/lockdep.c                           |  2 +-
>  kernel/power/snapshot.c                            |  2 +-
>  kernel/smp.c                                       |  2 +-
>  lib/idr.c                                          |  4 +-
>  lib/radix-tree.c                                   | 10 ++---
>  mm/backing-dev.c                                   |  2 +-
>  mm/dmapool.c                                       |  2 +-
>  mm/memcontrol.c                                    |  8 ++--
>  mm/mempool.c                                       | 10 ++---
>  mm/migrate.c                                       |  2 +-
>  mm/page_alloc.c                                    | 43 ++++++++++++--------
>  mm/slab.c                                          | 18 ++++-----
>  mm/slub.c                                          |  6 +--
>  mm/vmalloc.c                                       |  2 +-
>  mm/vmscan.c                                        |  4 +-
>  mm/zswap.c                                         |  5 ++-
>  net/core/skbuff.c                                  |  8 ++--
>  net/core/sock.c                                    |  6 ++-
>  net/netlink/af_netlink.c                           |  2 +-
>  net/rxrpc/ar-connection.c                          |  2 +-
>  net/sctp/associola.c                               |  2 +-
>  67 files changed, 211 insertions(+), 173 deletions(-)
>
> diff --git a/Documentation/vm/balance b/Documentation/vm/balance
> index c46e68cf9344..964595481af6 100644
> --- a/Documentation/vm/balance
> +++ b/Documentation/vm/balance
> @@ -1,12 +1,14 @@
>  Started Jan 2000 by Kanoj Sarcar <kanoj@sgi.com>
>
> -Memory balancing is needed for non __GFP_WAIT as well as for non
> -__GFP_IO allocations.
> +Memory balancing is needed for !__GFP_ATOMIC and !__GFP_KSWAPD_RECLAIM as
> +well as for non __GFP_IO allocations.
>
> -There are two reasons to be requesting non __GFP_WAIT allocations:
> -the caller can not sleep (typically intr context), or does not want
> -to incur cost overheads of page stealing and possible swap io for
> -whatever reasons.
> +The first reason why a caller may avoid reclaim is that the caller can not
> +sleep due to holding a spinlock or is in interrupt context. The second may
> +be that the caller is willing to fail the allocation without incurring the
> +overhead of page reclaim. This may happen for opportunistic high-order
> +allocation requests that have order-0 fallback options. In such cases,
> +the caller may also wish to avoid waking kswapd.
>
>  __GFP_IO allocation requests are made to prevent file system deadlocks.
>
> diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
> index cba12f34ff77..f999f0987a3e 100644
> --- a/arch/arm/mm/dma-mapping.c
> +++ b/arch/arm/mm/dma-mapping.c
> @@ -650,7 +650,7 @@ static void *__dma_alloc(struct device *dev, size_t size, dma_addr_t *handle,
>
>         if (is_coherent || nommu())
>                 addr = __alloc_simple_buffer(dev, size, gfp, &page);
> -       else if (!(gfp & __GFP_WAIT))
> +       else if (!gfpflags_allow_blocking(gfp))
>                 addr = __alloc_from_pool(size, &page);
>         else if (!dev_get_cma_area(dev))
>                 addr = __alloc_remap_buffer(dev, size, gfp, prot, &page, caller, want_vaddr);
> @@ -1369,7 +1369,7 @@ static void *arm_iommu_alloc_attrs(struct device *dev, size_t size,
>         *handle = DMA_ERROR_CODE;
>         size = PAGE_ALIGN(size);
>
> -       if (!(gfp & __GFP_WAIT))
> +       if (!gfpflags_allow_blocking(gfp))
>                 return __iommu_alloc_atomic(dev, size, handle);
>
>         /*
> diff --git a/arch/arm/xen/mm.c b/arch/arm/xen/mm.c
> index 03e75fef15b8..86809bd2026d 100644
> --- a/arch/arm/xen/mm.c
> +++ b/arch/arm/xen/mm.c
> @@ -25,7 +25,7 @@
>  unsigned long xen_get_swiotlb_free_pages(unsigned int order)
>  {
>         struct memblock_region *reg;
> -       gfp_t flags = __GFP_NOWARN;
> +       gfp_t flags = __GFP_NOWARN|___GFP_KSWAPD_RECLAIM;

Please use __XXX rather than ___XXX.

>         for_each_memblock(memory, reg) {
>                 if (reg->base < (phys_addr_t)0xffffffff) {
> diff --git a/arch/arm64/mm/dma-mapping.c b/arch/arm64/mm/dma-mapping.c
> index d16a1cead23f..1f10b2503af8 100644
> --- a/arch/arm64/mm/dma-mapping.c
> +++ b/arch/arm64/mm/dma-mapping.c
> @@ -100,7 +100,7 @@ static void *__dma_alloc_coherent(struct device *dev, size_t size,
>         if (IS_ENABLED(CONFIG_ZONE_DMA) &&
>             dev->coherent_dma_mask <= DMA_BIT_MASK(32))
>                 flags |= GFP_DMA;
> -       if (IS_ENABLED(CONFIG_DMA_CMA) && (flags & __GFP_WAIT)) {
> +       if (IS_ENABLED(CONFIG_DMA_CMA) && gfpflags_allow_blocking(flags)) {
>                 struct page *page;
>                 void *addr;
>
> @@ -147,7 +147,7 @@ static void *__dma_alloc(struct device *dev, size_t size,
>
>         size = PAGE_ALIGN(size);
>
> -       if (!coherent && !(flags & __GFP_WAIT)) {
> +       if (!coherent && !gfpflags_allow_blocking(flags)) {
>                 struct page *page = NULL;
>                 void *addr = __alloc_from_pool(size, &page, flags);
>
> diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
> index 353972c1946c..0310e73e6b57 100644
> --- a/arch/x86/kernel/pci-dma.c
> +++ b/arch/x86/kernel/pci-dma.c
> @@ -101,7 +101,7 @@ void *dma_generic_alloc_coherent(struct device *dev, size_t size,
>  again:
>         page = NULL;
>         /* CMA can be used only in the context which permits sleeping */
> -       if (flag & __GFP_WAIT) {
> +       if (gfpflags_allow_blocking(flag)) {
>                 page = dma_alloc_from_contiguous(dev, count, get_order(size));
>                 if (page && page_to_phys(page) + size > dma_mask) {
>                         dma_release_from_contiguous(dev, page, count);
> diff --git a/block/bio.c b/block/bio.c
> index d6e5ba3399f0..fbc558b50e67 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -211,7 +211,7 @@ struct bio_vec *bvec_alloc(gfp_t gfp_mask, int nr, unsigned long *idx,
>                 bvl = mempool_alloc(pool, gfp_mask);
>         } else {
>                 struct biovec_slab *bvs = bvec_slabs + *idx;
> -               gfp_t __gfp_mask = gfp_mask & ~(__GFP_WAIT | __GFP_IO);
> +               gfp_t __gfp_mask = gfp_mask & ~(__GFP_DIRECT_RECLAIM | __GFP_IO);
>
>                 /*
>                  * Make this allocation restricted and don't dump info on
> @@ -221,11 +221,11 @@ struct bio_vec *bvec_alloc(gfp_t gfp_mask, int nr, unsigned long *idx,
>                 __gfp_mask |= __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN;
>
>                 /*
> -                * Try a slab allocation. If this fails and __GFP_WAIT
> +                * Try a slab allocation. If this fails and __GFP_DIRECT_RECLAIM
>                  * is set, retry with the 1-entry mempool
>                  */
>                 bvl = kmem_cache_alloc(bvs->slab, __gfp_mask);
> -               if (unlikely(!bvl && (gfp_mask & __GFP_WAIT))) {
> +               if (unlikely(!bvl && (gfp_mask & __GFP_DIRECT_RECLAIM))) {
>                         *idx = BIOVEC_MAX_IDX;
>                         goto fallback;
>                 }
> @@ -393,12 +393,12 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
>   *   If @bs is NULL, uses kmalloc() to allocate the bio; else the allocation is
>   *   backed by the @bs's mempool.
>   *
> - *   When @bs is not NULL, if %__GFP_WAIT is set then bio_alloc will always be
> - *   able to allocate a bio. This is due to the mempool guarantees. To make this
> - *   work, callers must never allocate more than 1 bio at a time from this pool.
> - *   Callers that need to allocate more than 1 bio must always submit the
> - *   previously allocated bio for IO before attempting to allocate a new one.
> - *   Failure to do so can cause deadlocks under memory pressure.
> + *   When @bs is not NULL, if %__GFP_DIRECT_RECLAIM is set then bio_alloc will
> + *   always be able to allocate a bio. This is due to the mempool guarantees.
> + *   To make this work, callers must never allocate more than 1 bio at a time
> + *   from this pool. Callers that need to allocate more than 1 bio must always
> + *   submit the previously allocated bio for IO before attempting to allocate
> + *   a new one. Failure to do so can cause deadlocks under memory pressure.
>   *
>   *   Note that when running under generic_make_request() (i.e. any block
>   *   driver), bios are not submitted until after you return - see the code in
> @@ -457,13 +457,13 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
>                  * We solve this, and guarantee forward progress, with a rescuer
>                  * workqueue per bio_set. If we go to allocate and there are
>                  * bios on current->bio_list, we first try the allocation
> -                * without __GFP_WAIT; if that fails, we punt those bios we
> -                * would be blocking to the rescuer workqueue before we retry
> -                * with the original gfp_flags.
> +                * without __GFP_DIRECT_RECLAIM; if that fails, we punt those
> +                * bios we would be blocking to the rescuer workqueue before
> +                * we retry with the original gfp_flags.
>                  */
>
>                 if (current->bio_list && !bio_list_empty(current->bio_list))
> -                       gfp_mask &= ~__GFP_WAIT;
> +                       gfp_mask &= ~__GFP_DIRECT_RECLAIM;

How about introduce helper function to mask out __GFP_DIRECT_RECLAIM?
It can be used many places.

>                 p = mempool_alloc(bs->bio_pool, gfp_mask);
>                 if (!p && gfp_mask != saved_gfp) {
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 627ed0c593fb..e3605acaaffc 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1156,8 +1156,8 @@ static struct request *__get_request(struct request_list *rl, int rw_flags,
>   * @bio: bio to allocate request for (can be %NULL)
>   * @gfp_mask: allocation mask
>   *
> - * Get a free request from @q.  If %__GFP_WAIT is set in @gfp_mask, this
> - * function keeps retrying under memory pressure and fails iff @q is dead.
> + * Get a free request from @q.  If %__GFP_DIRECT_RECLAIM is set in @gfp_mask,
> + * this function keeps retrying under memory pressure and fails iff @q is dead.
>   *
>   * Must be called with @q->queue_lock held and,
>   * Returns ERR_PTR on failure, with @q->queue_lock held.
> @@ -1177,7 +1177,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
>         if (!IS_ERR(rq))
>                 return rq;
>
> -       if (!(gfp_mask & __GFP_WAIT) || unlikely(blk_queue_dying(q))) {
> +       if (!gfpflags_allow_blocking(gfp_mask) || unlikely(blk_queue_dying(q))) {
>                 blk_put_rl(rl);
>                 return rq;
>         }
> @@ -1255,11 +1255,11 @@ EXPORT_SYMBOL(blk_get_request);
>   * BUG.
>   *
>   * WARNING: When allocating/cloning a bio-chain, careful consideration should be
> - * given to how you allocate bios. In particular, you cannot use __GFP_WAIT for
> - * anything but the first bio in the chain. Otherwise you risk waiting for IO
> - * completion of a bio that hasn't been submitted yet, thus resulting in a
> - * deadlock. Alternatively bios should be allocated using bio_kmalloc() instead
> - * of bio_alloc(), as that avoids the mempool deadlock.
> + * given to how you allocate bios. In particular, you cannot use
> + * __GFP_DIRECT_RECLAIM for anything but the first bio in the chain. Otherwise
> + * you risk waiting for IO completion of a bio that hasn't been submitted yet,
> + * thus resulting in a deadlock. Alternatively bios should be allocated using
> + * bio_kmalloc() instead of bio_alloc(), as that avoids the mempool deadlock.
>   * If possible a big IO should be split into smaller parts when allocation
>   * fails. Partial allocation should not be an error, or you risk a live-lock.
>   */
> diff --git a/block/blk-ioc.c b/block/blk-ioc.c
> index 1a27f45ec776..381cb50a673c 100644
> --- a/block/blk-ioc.c
> +++ b/block/blk-ioc.c
> @@ -289,7 +289,7 @@ struct io_context *get_task_io_context(struct task_struct *task,
>  {
>         struct io_context *ioc;
>
> -       might_sleep_if(gfp_flags & __GFP_WAIT);
> +       might_sleep_if(gfpflags_allow_blocking(gfp_flags));
>
>         do {
>                 task_lock(task);
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 9b6e28830b82..a8b46659ce4e 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -264,7 +264,7 @@ static int bt_get(struct blk_mq_alloc_data *data,
>         if (tag != -1)
>                 return tag;
>
> -       if (!(data->gfp & __GFP_WAIT))
> +       if (!gfpflags_allow_blocking(data->gfp))
>                 return -1;
>
>         bs = bt_wait_ptr(bt, hctx);
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 7d842db59699..7d80379d7a38 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -85,7 +85,7 @@ static int blk_mq_queue_enter(struct request_queue *q, gfp_t gfp)
>                 if (percpu_ref_tryget_live(&q->mq_usage_counter))
>                         return 0;
>
> -               if (!(gfp & __GFP_WAIT))
> +               if (!gfpflags_allow_blocking(gfp))
>                         return -EBUSY;
>
>                 ret = wait_event_interruptible(q->mq_freeze_wq,
> @@ -261,11 +261,11 @@ struct request *blk_mq_alloc_request(struct request_queue *q, int rw, gfp_t gfp,
>
>         ctx = blk_mq_get_ctx(q);
>         hctx = q->mq_ops->map_queue(q, ctx->cpu);
> -       blk_mq_set_alloc_data(&alloc_data, q, gfp & ~__GFP_WAIT,
> +       blk_mq_set_alloc_data(&alloc_data, q, gfp & ~__GFP_DIRECT_RECLAIM,
>                         reserved, ctx, hctx);
>
>         rq = __blk_mq_alloc_request(&alloc_data, rw);
> -       if (!rq && (gfp & __GFP_WAIT)) {
> +       if (!rq && (gfp & __GFP_DIRECT_RECLAIM)) {
>                 __blk_mq_run_hw_queue(hctx);
>                 blk_mq_put_ctx(ctx);

Is there any reason not to use gfpflags_allow_nonblocking() here?
There are some places not using this helper and reason isn't
specified.

Thanks.

> @@ -1221,7 +1221,7 @@ static struct request *blk_mq_map_request(struct request_queue *q,
>                 ctx = blk_mq_get_ctx(q);
>                 hctx = q->mq_ops->map_queue(q, ctx->cpu);
>                 blk_mq_set_alloc_data(&alloc_data, q,
> -                               __GFP_WAIT|GFP_ATOMIC, false, ctx, hctx);
> +                               __GFP_WAIT|__GFP_HIGH, false, ctx, hctx);
>                 rq = __blk_mq_alloc_request(&alloc_data, rw);
>                 ctx = alloc_data.ctx;
>                 hctx = alloc_data.hctx;
> diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
> index c62bb2e650b8..ecd1d1b61382 100644
> --- a/block/cfq-iosched.c
> +++ b/block/cfq-iosched.c
> @@ -3674,7 +3674,7 @@ cfq_find_alloc_queue(struct cfq_data *cfqd, bool is_sync, struct cfq_io_cq *cic,
>                 if (new_cfqq) {
>                         cfqq = new_cfqq;
>                         new_cfqq = NULL;
> -               } else if (gfp_mask & __GFP_WAIT) {
> +               } else if (gfpflags_allow_blocking(gfp_mask)) {
>                         rcu_read_unlock();
>                         spin_unlock_irq(cfqd->queue->queue_lock);
>                         new_cfqq = kmem_cache_alloc_node(cfq_pool,
> @@ -4289,7 +4289,7 @@ cfq_set_request(struct request_queue *q, struct request *rq, struct bio *bio,
>         const bool is_sync = rq_is_sync(rq);
>         struct cfq_queue *cfqq;
>
> -       might_sleep_if(gfp_mask & __GFP_WAIT);
> +       might_sleep_if(gfpflags_allow_blocking(gfp_mask));
>
>         spin_lock_irq(q->queue_lock);
>
> diff --git a/drivers/block/drbd/drbd_receiver.c b/drivers/block/drbd/drbd_receiver.c
> index c097909c589c..b4b5680ac6ad 100644
> --- a/drivers/block/drbd/drbd_receiver.c
> +++ b/drivers/block/drbd/drbd_receiver.c
> @@ -357,7 +357,8 @@ drbd_alloc_peer_req(struct drbd_peer_device *peer_device, u64 id, sector_t secto
>         }
>
>         if (has_payload && data_size) {
> -               page = drbd_alloc_pages(peer_device, nr_pages, (gfp_mask & __GFP_WAIT));
> +               page = drbd_alloc_pages(peer_device, nr_pages,
> +                                       gfpflags_allow_blocking(gfp_mask));
>                 if (!page)
>                         goto fail;
>         }
> diff --git a/drivers/block/osdblk.c b/drivers/block/osdblk.c
> index e22942596207..1b709a4e3b5e 100644
> --- a/drivers/block/osdblk.c
> +++ b/drivers/block/osdblk.c
> @@ -271,7 +271,7 @@ static struct bio *bio_chain_clone(struct bio *old_chain, gfp_t gfpmask)
>                         goto err_out;
>
>                 tmp->bi_bdev = NULL;
> -               gfpmask &= ~__GFP_WAIT;
> +               gfpmask &= ~__GFP_DIRECT_RECLAIM;
>                 tmp->bi_next = NULL;
>
>                 if (!new_chain)
> diff --git a/drivers/connector/connector.c b/drivers/connector/connector.c
> index 30f522848c73..d7373ca69c99 100644
> --- a/drivers/connector/connector.c
> +++ b/drivers/connector/connector.c
> @@ -124,7 +124,8 @@ int cn_netlink_send_mult(struct cn_msg *msg, u16 len, u32 portid, u32 __group,
>         if (group)
>                 return netlink_broadcast(dev->nls, skb, portid, group,
>                                          gfp_mask);
> -       return netlink_unicast(dev->nls, skb, portid, !(gfp_mask&__GFP_WAIT));
> +       return netlink_unicast(dev->nls, skb, portid,
> +                       !gfpflags_allow_blocking(gfp_mask));
>  }
>  EXPORT_SYMBOL_GPL(cn_netlink_send_mult);
>
> diff --git a/drivers/firewire/core-cdev.c b/drivers/firewire/core-cdev.c
> index 2a3973a7c441..36a7c2d89a01 100644
> --- a/drivers/firewire/core-cdev.c
> +++ b/drivers/firewire/core-cdev.c
> @@ -486,7 +486,7 @@ static int ioctl_get_info(struct client *client, union ioctl_arg *arg)
>  static int add_client_resource(struct client *client,
>                                struct client_resource *resource, gfp_t gfp_mask)
>  {
> -       bool preload = !!(gfp_mask & __GFP_WAIT);
> +       bool preload = gfpflags_allow_blocking(gfp_mask);
>         unsigned long flags;
>         int ret;
>
> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> index 52b446b27b4d..c2b45081c5ab 100644
> --- a/drivers/gpu/drm/i915/i915_gem.c
> +++ b/drivers/gpu/drm/i915/i915_gem.c
> @@ -2225,7 +2225,7 @@ i915_gem_object_get_pages_gtt(struct drm_i915_gem_object *obj)
>          */
>         mapping = file_inode(obj->base.filp)->i_mapping;
>         gfp = mapping_gfp_mask(mapping);
> -       gfp |= __GFP_NORETRY | __GFP_NOWARN | __GFP_NO_KSWAPD;
> +       gfp |= __GFP_NORETRY | __GFP_NOWARN;
>         gfp &= ~(__GFP_IO | __GFP_WAIT);
>         sg = st->sgl;
>         st->nents = 0;
> diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c
> index ca919f429666..7474d79ffac0 100644
> --- a/drivers/infiniband/core/sa_query.c
> +++ b/drivers/infiniband/core/sa_query.c
> @@ -619,7 +619,7 @@ static void init_mad(struct ib_sa_mad *mad, struct ib_mad_agent *agent)
>
>  static int send_mad(struct ib_sa_query *query, int timeout_ms, gfp_t gfp_mask)
>  {
> -       bool preload = !!(gfp_mask & __GFP_WAIT);
> +       bool preload = gfpflags_allow_blocking(gfp_mask);
>         unsigned long flags;
>         int ret, id;
>
> diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
> index 658ee39e6569..95d4c70dc7b1 100644
> --- a/drivers/iommu/amd_iommu.c
> +++ b/drivers/iommu/amd_iommu.c
> @@ -2755,7 +2755,7 @@ static void *alloc_coherent(struct device *dev, size_t size,
>
>         page = alloc_pages(flag | __GFP_NOWARN,  get_order(size));
>         if (!page) {
> -               if (!(flag & __GFP_WAIT))
> +               if (!gfpflags_allow_blocking(flag))
>                         return NULL;
>
>                 page = dma_alloc_from_contiguous(dev, size >> PAGE_SHIFT,
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index 0649b94f5958..f77becf3d8d8 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -3566,7 +3566,7 @@ static void *intel_alloc_coherent(struct device *dev, size_t size,
>                         flags |= GFP_DMA32;
>         }
>
> -       if (flags & __GFP_WAIT) {
> +       if (gfpflags_allow_blocking(flags)) {
>                 unsigned int count = size >> PAGE_SHIFT;
>
>                 page = dma_alloc_from_contiguous(dev, count, order);
> diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
> index 0f48fed44a17..6dda08385309 100644
> --- a/drivers/md/dm-crypt.c
> +++ b/drivers/md/dm-crypt.c
> @@ -993,7 +993,7 @@ static struct bio *crypt_alloc_buffer(struct dm_crypt_io *io, unsigned size)
>         struct bio_vec *bvec;
>
>  retry:
> -       if (unlikely(gfp_mask & __GFP_WAIT))
> +       if (unlikely(gfp_mask & __GFP_DIRECT_RECLAIM))
>                 mutex_lock(&cc->bio_alloc_lock);
>
>         clone = bio_alloc_bioset(GFP_NOIO, nr_iovecs, cc->bs);
> @@ -1009,7 +1009,7 @@ static struct bio *crypt_alloc_buffer(struct dm_crypt_io *io, unsigned size)
>                 if (!page) {
>                         crypt_free_buffer_pages(cc, clone);
>                         bio_put(clone);
> -                       gfp_mask |= __GFP_WAIT;
> +                       gfp_mask |= __GFP_DIRECT_RECLAIM;
>                         goto retry;
>                 }
>
> @@ -1026,7 +1026,7 @@ static struct bio *crypt_alloc_buffer(struct dm_crypt_io *io, unsigned size)
>         }
>
>  return_clone:
> -       if (unlikely(gfp_mask & __GFP_WAIT))
> +       if (unlikely(gfp_mask & __GFP_DIRECT_RECLAIM))
>                 mutex_unlock(&cc->bio_alloc_lock);
>
>         return clone;
> diff --git a/drivers/md/dm-kcopyd.c b/drivers/md/dm-kcopyd.c
> index 3a7cade5e27d..1452ed9aacb4 100644
> --- a/drivers/md/dm-kcopyd.c
> +++ b/drivers/md/dm-kcopyd.c
> @@ -244,7 +244,7 @@ static int kcopyd_get_pages(struct dm_kcopyd_client *kc,
>         *pages = NULL;
>
>         do {
> -               pl = alloc_pl(__GFP_NOWARN | __GFP_NORETRY);
> +               pl = alloc_pl(__GFP_NOWARN | __GFP_NORETRY | __GFP_KSWAPD_RECLAIM);
>                 if (unlikely(!pl)) {
>                         /* Use reserved pages */
>                         pl = kc->pages;
> diff --git a/drivers/media/pci/solo6x10/solo6x10-v4l2-enc.c b/drivers/media/pci/solo6x10/solo6x10-v4l2-enc.c
> index 53fff5425c13..fb2cb4bdc0c1 100644
> --- a/drivers/media/pci/solo6x10/solo6x10-v4l2-enc.c
> +++ b/drivers/media/pci/solo6x10/solo6x10-v4l2-enc.c
> @@ -1291,7 +1291,7 @@ static struct solo_enc_dev *solo_enc_alloc(struct solo_dev *solo_dev,
>         solo_enc->vidq.ops = &solo_enc_video_qops;
>         solo_enc->vidq.mem_ops = &vb2_dma_sg_memops;
>         solo_enc->vidq.drv_priv = solo_enc;
> -       solo_enc->vidq.gfp_flags = __GFP_DMA32;
> +       solo_enc->vidq.gfp_flags = __GFP_DMA32 | __GFP_KSWAPD_RECLAIM;
>         solo_enc->vidq.timestamp_flags = V4L2_BUF_FLAG_TIMESTAMP_MONOTONIC;
>         solo_enc->vidq.buf_struct_size = sizeof(struct solo_vb2_buf);
>         solo_enc->vidq.lock = &solo_enc->lock;
> diff --git a/drivers/media/pci/solo6x10/solo6x10-v4l2.c b/drivers/media/pci/solo6x10/solo6x10-v4l2.c
> index 63ae8a61f603..bde77b22340c 100644
> --- a/drivers/media/pci/solo6x10/solo6x10-v4l2.c
> +++ b/drivers/media/pci/solo6x10/solo6x10-v4l2.c
> @@ -675,7 +675,7 @@ int solo_v4l2_init(struct solo_dev *solo_dev, unsigned nr)
>         solo_dev->vidq.mem_ops = &vb2_dma_contig_memops;
>         solo_dev->vidq.drv_priv = solo_dev;
>         solo_dev->vidq.timestamp_flags = V4L2_BUF_FLAG_TIMESTAMP_MONOTONIC;
> -       solo_dev->vidq.gfp_flags = __GFP_DMA32;
> +       solo_dev->vidq.gfp_flags = __GFP_DMA32 | __GFP_KSWAPD_RECLAIM;
>         solo_dev->vidq.buf_struct_size = sizeof(struct solo_vb2_buf);
>         solo_dev->vidq.lock = &solo_dev->lock;
>         ret = vb2_queue_init(&solo_dev->vidq);
> diff --git a/drivers/media/pci/tw68/tw68-video.c b/drivers/media/pci/tw68/tw68-video.c
> index 8355e55b4e8e..e556f989aaab 100644
> --- a/drivers/media/pci/tw68/tw68-video.c
> +++ b/drivers/media/pci/tw68/tw68-video.c
> @@ -975,7 +975,7 @@ int tw68_video_init2(struct tw68_dev *dev, int video_nr)
>         dev->vidq.ops = &tw68_video_qops;
>         dev->vidq.mem_ops = &vb2_dma_sg_memops;
>         dev->vidq.drv_priv = dev;
> -       dev->vidq.gfp_flags = __GFP_DMA32;
> +       dev->vidq.gfp_flags = __GFP_DMA32 | __GFP_KSWAPD_RECLAIM;
>         dev->vidq.buf_struct_size = sizeof(struct tw68_buf);
>         dev->vidq.lock = &dev->lock;
>         dev->vidq.min_buffers_needed = 2;
> diff --git a/drivers/mtd/mtdcore.c b/drivers/mtd/mtdcore.c
> index 8bbbb751bf45..2dfb291a47c6 100644
> --- a/drivers/mtd/mtdcore.c
> +++ b/drivers/mtd/mtdcore.c
> @@ -1188,8 +1188,7 @@ EXPORT_SYMBOL_GPL(mtd_writev);
>   */
>  void *mtd_kmalloc_up_to(const struct mtd_info *mtd, size_t *size)
>  {
> -       gfp_t flags = __GFP_NOWARN | __GFP_WAIT |
> -                      __GFP_NORETRY | __GFP_NO_KSWAPD;
> +       gfp_t flags = __GFP_NOWARN | __GFP_DIRECT_RECLAIM | __GFP_NORETRY;
>         size_t min_alloc = max_t(size_t, mtd->writesize, PAGE_SIZE);
>         void *kbuf;
>
> diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
> index f7fbdc9d1325..3a407e59acab 100644
> --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
> +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
> @@ -689,7 +689,7 @@ static void *bnx2x_frag_alloc(const struct bnx2x_fastpath *fp, gfp_t gfp_mask)
>  {
>         if (fp->rx_frag_size) {
>                 /* GFP_KERNEL allocations are used only during initialization */
> -               if (unlikely(gfp_mask & __GFP_WAIT))
> +               if (unlikely(gfpflags_allow_blocking(gfp_mask)))
>                         return (void *)__get_free_page(gfp_mask);
>
>                 return netdev_alloc_frag(fp->rx_frag_size);
> diff --git a/drivers/staging/android/ion/ion_system_heap.c b/drivers/staging/android/ion/ion_system_heap.c
> index da2a63c0a9ba..2615e0ae4f0a 100644
> --- a/drivers/staging/android/ion/ion_system_heap.c
> +++ b/drivers/staging/android/ion/ion_system_heap.c
> @@ -27,7 +27,7 @@
>  #include "ion_priv.h"
>
>  static gfp_t high_order_gfp_flags = (GFP_HIGHUSER | __GFP_ZERO | __GFP_NOWARN |
> -                                    __GFP_NORETRY) & ~__GFP_WAIT;
> +                                    __GFP_NORETRY) & ~__GFP_DIRECT_RECLAIM;
>  static gfp_t low_order_gfp_flags  = (GFP_HIGHUSER | __GFP_ZERO | __GFP_NOWARN);
>  static const unsigned int orders[] = {8, 4, 0};
>  static const int num_orders = ARRAY_SIZE(orders);
> diff --git a/drivers/staging/lustre/include/linux/libcfs/libcfs_private.h b/drivers/staging/lustre/include/linux/libcfs/libcfs_private.h
> index ed37d26eb20d..5b0756cb6576 100644
> --- a/drivers/staging/lustre/include/linux/libcfs/libcfs_private.h
> +++ b/drivers/staging/lustre/include/linux/libcfs/libcfs_private.h
> @@ -113,7 +113,7 @@ do {                                                \
>  do {                                                                       \
>         LASSERT(!in_interrupt() ||                                          \
>                 ((size) <= LIBCFS_VMALLOC_SIZE &&                           \
> -                ((mask) & __GFP_WAIT) == 0));                              \
> +                !gfpflags_allow_blocking(mask)));                          \
>  } while (0)
>
>  #define LIBCFS_ALLOC_POST(ptr, size)                                       \
> diff --git a/drivers/usb/host/u132-hcd.c b/drivers/usb/host/u132-hcd.c
> index d51687780b61..8d4c1806e32f 100644
> --- a/drivers/usb/host/u132-hcd.c
> +++ b/drivers/usb/host/u132-hcd.c
> @@ -2247,7 +2247,7 @@ static int u132_urb_enqueue(struct usb_hcd *hcd, struct urb *urb,
>  {
>         struct u132 *u132 = hcd_to_u132(hcd);
>         if (irqs_disabled()) {
> -               if (__GFP_WAIT & mem_flags) {
> +               if (gfpflags_allow_blocking(mem_flags)) {
>                         printk(KERN_ERR "invalid context for function that migh"
>                                 "t sleep\n");
>                         return -EINVAL;
> diff --git a/drivers/video/fbdev/vermilion/vermilion.c b/drivers/video/fbdev/vermilion/vermilion.c
> index 6b70d7f62b2f..1c1e95a0b8fa 100644
> --- a/drivers/video/fbdev/vermilion/vermilion.c
> +++ b/drivers/video/fbdev/vermilion/vermilion.c
> @@ -99,7 +99,7 @@ static int vmlfb_alloc_vram_area(struct vram_area *va, unsigned max_order,
>                  * below the first 16MB.
>                  */
>
> -               flags = __GFP_DMA | __GFP_HIGH;
> +               flags = __GFP_DMA | __GFP_HIGH | __GFP_KSWAPD_RECLAIM;
>                 va->logical =
>                          __get_free_pages(flags, --max_order);
>         } while (va->logical == 0 && max_order > min_order);
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index f556c3732c2c..3dd4792b8099 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -2566,7 +2566,7 @@ int open_ctree(struct super_block *sb,
>         fs_info->commit_interval = BTRFS_DEFAULT_COMMIT_INTERVAL;
>         fs_info->avg_delayed_ref_runtime = NSEC_PER_SEC >> 6; /* div by 64 */
>         /* readahead state */
> -       INIT_RADIX_TREE(&fs_info->reada_tree, GFP_NOFS & ~__GFP_WAIT);
> +       INIT_RADIX_TREE(&fs_info->reada_tree, GFP_NOFS & ~__GFP_DIRECT_RECLAIM);
>         spin_lock_init(&fs_info->reada_lock);
>
>         fs_info->thread_pool_size = min_t(unsigned long,
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 02d05817cbdf..c8a6cdcbef2b 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -594,7 +594,7 @@ int clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
>         if (bits & (EXTENT_IOBITS | EXTENT_BOUNDARY))
>                 clear = 1;
>  again:
> -       if (!prealloc && (mask & __GFP_WAIT)) {
> +       if (!prealloc && gfpflags_allow_blocking(mask)) {
>                 /*
>                  * Don't care for allocation failure here because we might end
>                  * up not needing the pre-allocated extent state at all, which
> @@ -718,7 +718,7 @@ int clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
>         if (start > end)
>                 goto out;
>         spin_unlock(&tree->lock);
> -       if (mask & __GFP_WAIT)
> +       if (gfpflags_allow_blocking(mask))
>                 cond_resched();
>         goto again;
>  }
> @@ -850,7 +850,7 @@ __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
>
>         bits |= EXTENT_FIRST_DELALLOC;
>  again:
> -       if (!prealloc && (mask & __GFP_WAIT)) {
> +       if (!prealloc && gfpflags_allow_blocking(mask)) {
>                 prealloc = alloc_extent_state(mask);
>                 BUG_ON(!prealloc);
>         }
> @@ -1028,7 +1028,7 @@ __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
>         if (start > end)
>                 goto out;
>         spin_unlock(&tree->lock);
> -       if (mask & __GFP_WAIT)
> +       if (gfpflags_allow_blocking(mask))
>                 cond_resched();
>         goto again;
>  }
> @@ -1076,7 +1076,7 @@ int convert_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
>         btrfs_debug_check_extent_io_range(tree, start, end);
>
>  again:
> -       if (!prealloc && (mask & __GFP_WAIT)) {
> +       if (!prealloc && gfpflags_allow_blocking(mask)) {
>                 /*
>                  * Best effort, don't worry if extent state allocation fails
>                  * here for the first iteration. We might have a cached state
> @@ -1253,7 +1253,7 @@ int convert_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
>         if (start > end)
>                 goto out;
>         spin_unlock(&tree->lock);
> -       if (mask & __GFP_WAIT)
> +       if (gfpflags_allow_blocking(mask))
>                 cond_resched();
>         first_iteration = false;
>         goto again;
> @@ -4265,7 +4265,7 @@ int try_release_extent_mapping(struct extent_map_tree *map,
>         u64 start = page_offset(page);
>         u64 end = start + PAGE_CACHE_SIZE - 1;
>
> -       if ((mask & __GFP_WAIT) &&
> +       if (gfpflags_allow_blocking(mask) &&
>             page->mapping->host->i_size > 16 * 1024 * 1024) {
>                 u64 len;
>                 while (start <= end) {
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index fbe7c104531c..b1968f36a39b 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -156,8 +156,8 @@ static struct btrfs_device *__alloc_device(void)
>         spin_lock_init(&dev->reada_lock);
>         atomic_set(&dev->reada_in_flight, 0);
>         atomic_set(&dev->dev_stats_ccnt, 0);
> -       INIT_RADIX_TREE(&dev->reada_zones, GFP_NOFS & ~__GFP_WAIT);
> -       INIT_RADIX_TREE(&dev->reada_extents, GFP_NOFS & ~__GFP_WAIT);
> +       INIT_RADIX_TREE(&dev->reada_zones, GFP_NOFS & ~__GFP_DIRECT_RECLAIM);
> +       INIT_RADIX_TREE(&dev->reada_extents, GFP_NOFS & ~__GFP_DIRECT_RECLAIM);
>
>         return dev;
>  }
> diff --git a/fs/ext3/super.c b/fs/ext3/super.c
> index 5ed0044fbb37..9004c786716f 100644
> --- a/fs/ext3/super.c
> +++ b/fs/ext3/super.c
> @@ -750,7 +750,7 @@ static int bdev_try_to_free_page(struct super_block *sb, struct page *page,
>                 return 0;
>         if (journal)
>                 return journal_try_to_free_buffers(journal, page,
> -                                                  wait & ~__GFP_WAIT);
> +                                               wait & ~__GFP_DIRECT_RECLAIM);
>         return try_to_free_buffers(page);
>  }
>
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 58987b5c514b..abe76d41ef1e 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -1045,7 +1045,7 @@ static int bdev_try_to_free_page(struct super_block *sb, struct page *page,
>                 return 0;
>         if (journal)
>                 return jbd2_journal_try_to_free_buffers(journal, page,
> -                                                       wait & ~__GFP_WAIT);
> +                                               wait & ~__GFP_DIRECT_RECLAIM);
>         return try_to_free_buffers(page);
>  }
>
> diff --git a/fs/fscache/cookie.c b/fs/fscache/cookie.c
> index d403c69bee08..4304072161aa 100644
> --- a/fs/fscache/cookie.c
> +++ b/fs/fscache/cookie.c
> @@ -111,7 +111,7 @@ struct fscache_cookie *__fscache_acquire_cookie(
>
>         /* radix tree insertion won't use the preallocation pool unless it's
>          * told it may not wait */
> -       INIT_RADIX_TREE(&cookie->stores, GFP_NOFS & ~__GFP_WAIT);
> +       INIT_RADIX_TREE(&cookie->stores, GFP_NOFS & ~__GFP_DIRECT_RECLAIM);
>
>         switch (cookie->def->type) {
>         case FSCACHE_COOKIE_TYPE_INDEX:
> diff --git a/fs/fscache/page.c b/fs/fscache/page.c
> index 483bbc613bf0..79483b3d8c6f 100644
> --- a/fs/fscache/page.c
> +++ b/fs/fscache/page.c
> @@ -58,7 +58,7 @@ bool release_page_wait_timeout(struct fscache_cookie *cookie, struct page *page)
>
>  /*
>   * decide whether a page can be released, possibly by cancelling a store to it
> - * - we're allowed to sleep if __GFP_WAIT is flagged
> + * - we're allowed to sleep if __GFP_DIRECT_RECLAIM is flagged
>   */
>  bool __fscache_maybe_release_page(struct fscache_cookie *cookie,
>                                   struct page *page,
> @@ -122,7 +122,7 @@ bool __fscache_maybe_release_page(struct fscache_cookie *cookie,
>          * allocator as the work threads writing to the cache may all end up
>          * sleeping on memory allocation, so we may need to impose a timeout
>          * too. */
> -       if (!(gfp & __GFP_WAIT) || !(gfp & __GFP_FS)) {
> +       if (!(gfp & __GFP_DIRECT_RECLAIM) || !(gfp & __GFP_FS)) {
>                 fscache_stat(&fscache_n_store_vmscan_busy);
>                 return false;
>         }
> @@ -132,7 +132,7 @@ bool __fscache_maybe_release_page(struct fscache_cookie *cookie,
>                 _debug("fscache writeout timeout page: %p{%lx}",
>                         page, page->index);
>
> -       gfp &= ~__GFP_WAIT;
> +       gfp &= ~__GFP_DIRECT_RECLAIM;
>         goto try_again;
>  }
>  EXPORT_SYMBOL(__fscache_maybe_release_page);
> diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c
> index 1695ba8334a2..f45b90ba7c5c 100644
> --- a/fs/jbd/transaction.c
> +++ b/fs/jbd/transaction.c
> @@ -1690,8 +1690,8 @@ __journal_try_to_free_buffer(journal_t *journal, struct buffer_head *bh)
>   * @journal: journal for operation
>   * @page: to try and free
>   * @gfp_mask: we use the mask to detect how hard should we try to release
> - * buffers. If __GFP_WAIT and __GFP_FS is set, we wait for commit code to
> - * release the buffers.
> + * buffers. If __GFP_DIRECT_RECLAIM and __GFP_FS is set, we wait for commit
> + * code to release the buffers.
>   *
>   *
>   * For all the buffers on this page,
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index f3d06174b051..06e18bcdb888 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -1893,8 +1893,8 @@ __journal_try_to_free_buffer(journal_t *journal, struct buffer_head *bh)
>   * @journal: journal for operation
>   * @page: to try and free
>   * @gfp_mask: we use the mask to detect how hard should we try to release
> - * buffers. If __GFP_WAIT and __GFP_FS is set, we wait for commit code to
> - * release the buffers.
> + * buffers. If __GFP_DIRECT_RECLAIM and __GFP_FS is set, we wait for commit
> + * code to release the buffers.
>   *
>   *
>   * For all the buffers on this page,
> diff --git a/fs/nfs/file.c b/fs/nfs/file.c
> index cc4fa1ed61fc..be6821967ec6 100644
> --- a/fs/nfs/file.c
> +++ b/fs/nfs/file.c
> @@ -480,8 +480,8 @@ static int nfs_release_page(struct page *page, gfp_t gfp)
>         dfprintk(PAGECACHE, "NFS: release_page(%p)\n", page);
>
>         /* Always try to initiate a 'commit' if relevant, but only
> -        * wait for it if __GFP_WAIT is set.  Even then, only wait 1
> -        * second and only if the 'bdi' is not congested.
> +        * wait for it if the caller allows blocking.  Even then,
> +        * only wait 1 second and only if the 'bdi' is not congested.
>          * Waiting indefinitely can cause deadlocks when the NFS
>          * server is on this machine, when a new TCP connection is
>          * needed and in other rare cases.  There is no particular
> @@ -491,7 +491,7 @@ static int nfs_release_page(struct page *page, gfp_t gfp)
>         if (mapping) {
>                 struct nfs_server *nfss = NFS_SERVER(mapping->host);
>                 nfs_commit_inode(mapping->host, 0);
> -               if ((gfp & __GFP_WAIT) &&
> +               if (gfpflags_allow_blocking(gfp) &&
>                     !bdi_write_congested(&nfss->backing_dev_info)) {
>                         wait_on_page_bit_killable_timeout(page, PG_private,
>                                                           HZ);
> diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c
> index eac9549efd52..587174fd4f2c 100644
> --- a/fs/xfs/xfs_qm.c
> +++ b/fs/xfs/xfs_qm.c
> @@ -525,7 +525,7 @@ xfs_qm_shrink_scan(
>         unsigned long           freed;
>         int                     error;
>
> -       if ((sc->gfp_mask & (__GFP_FS|__GFP_WAIT)) != (__GFP_FS|__GFP_WAIT))
> +       if ((sc->gfp_mask & (__GFP_FS|__GFP_DIRECT_RECLAIM)) != (__GFP_FS|__GFP_DIRECT_RECLAIM))
>                 return 0;
>
>         INIT_LIST_HEAD(&isol.buffers);
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index a10347ca5053..bd1937977d84 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -29,12 +29,13 @@ struct vm_area_struct;
>  #define ___GFP_NOMEMALLOC      0x10000u
>  #define ___GFP_HARDWALL                0x20000u
>  #define ___GFP_THISNODE                0x40000u
> -#define ___GFP_WAIT            0x80000u
> +#define ___GFP_ATOMIC          0x80000u
>  #define ___GFP_NOACCOUNT       0x100000u
>  #define ___GFP_NOTRACK         0x200000u
> -#define ___GFP_NO_KSWAPD       0x400000u
> +#define ___GFP_DIRECT_RECLAIM  0x400000u
>  #define ___GFP_OTHER_NODE      0x800000u
>  #define ___GFP_WRITE           0x1000000u
> +#define ___GFP_KSWAPD_RECLAIM  0x2000000u
>  /* If the above are modified, __GFP_BITS_SHIFT may need updating */
>
>  /*
> @@ -68,7 +69,7 @@ struct vm_area_struct;
>   * __GFP_MOVABLE: Flag that this page will be movable by the page migration
>   * mechanism or reclaimed
>   */
> -#define __GFP_WAIT     ((__force gfp_t)___GFP_WAIT)    /* Can wait and reschedule? */
> +#define __GFP_ATOMIC   ((__force gfp_t)___GFP_ATOMIC)  /* Caller cannot wait or reschedule */
>  #define __GFP_HIGH     ((__force gfp_t)___GFP_HIGH)    /* Should access emergency pools? */
>  #define __GFP_IO       ((__force gfp_t)___GFP_IO)      /* Can start physical IO? */
>  #define __GFP_FS       ((__force gfp_t)___GFP_FS)      /* Can call down to low-level FS? */
> @@ -91,23 +92,37 @@ struct vm_area_struct;
>  #define __GFP_NOACCOUNT        ((__force gfp_t)___GFP_NOACCOUNT) /* Don't account to kmemcg */
>  #define __GFP_NOTRACK  ((__force gfp_t)___GFP_NOTRACK)  /* Don't track with kmemcheck */
>
> -#define __GFP_NO_KSWAPD        ((__force gfp_t)___GFP_NO_KSWAPD)
>  #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
>  #define __GFP_WRITE    ((__force gfp_t)___GFP_WRITE)   /* Allocator intends to dirty page */
>
>  /*
> + * A caller that is willing to wait may enter direct reclaim and will
> + * wake kswapd to reclaim pages in the background until the high
> + * watermark is met. A caller may wish to clear __GFP_DIRECT_RECLAIM to
> + * avoid unnecessary delays when a fallback option is available but
> + * still allow kswapd to reclaim in the background. The kswapd flag
> + * can be cleared when the reclaiming of pages would cause unnecessary
> + * disruption.
> + */
> +#define __GFP_WAIT (__GFP_DIRECT_RECLAIM|__GFP_KSWAPD_RECLAIM)

Convention is that combination of gfp flags don't use __XXX.

> +#define __GFP_DIRECT_RECLAIM   ((__force gfp_t)___GFP_DIRECT_RECLAIM) /* Caller can reclaim */
> +#define __GFP_KSWAPD_RECLAIM   ((__force gfp_t)___GFP_KSWAPD_RECLAIM) /* kswapd can wake */
> +
> +/*
>   * This may seem redundant, but it's a way of annotating false positives vs.
>   * allocations that simply cannot be supported (e.g. page tables).
>   */
>  #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
>
> -#define __GFP_BITS_SHIFT 25    /* Room for N __GFP_FOO bits */
> +#define __GFP_BITS_SHIFT 26    /* Room for N __GFP_FOO bits */
>  #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
>
> -/* This equals 0, but use constants in case they ever change */
> -#define GFP_NOWAIT     (GFP_ATOMIC & ~__GFP_HIGH)
> -/* GFP_ATOMIC means both !wait (__GFP_WAIT not set) and use emergency pool */
> -#define GFP_ATOMIC     (__GFP_HIGH)
> +/*
> + * GFP_ATOMIC callers can not sleep, need the allocation to succeed.
> + * A lower watermark is applied to allow access to "atomic reserves"
> + */
> +#define GFP_ATOMIC     (__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM)
> +#define GFP_NOWAIT     (__GFP_KSWAPD_RECLAIM)
>  #define GFP_NOIO       (__GFP_WAIT)
>  #define GFP_NOFS       (__GFP_WAIT | __GFP_IO)
>  #define GFP_KERNEL     (__GFP_WAIT | __GFP_IO | __GFP_FS)
> @@ -116,10 +131,10 @@ struct vm_area_struct;
>  #define GFP_USER       (__GFP_WAIT | __GFP_IO | __GFP_FS | __GFP_HARDWALL)
>  #define GFP_HIGHUSER   (GFP_USER | __GFP_HIGHMEM)
>  #define GFP_HIGHUSER_MOVABLE   (GFP_HIGHUSER | __GFP_MOVABLE)
> -#define GFP_IOFS       (__GFP_IO | __GFP_FS)
> -#define GFP_TRANSHUGE  (GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
> -                        __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | \
> -                        __GFP_NO_KSWAPD)
> +#define GFP_IOFS       (__GFP_IO | __GFP_FS | __GFP_KSWAPD_RECLAIM)
> +#define GFP_TRANSHUGE  ((GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
> +                        __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN) & \
> +                        ~__GFP_KSWAPD_RECLAIM)
>
>  /* This mask makes up all the page movable related flags */
>  #define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE)
> @@ -161,6 +176,11 @@ static inline int gfpflags_to_migratetype(const gfp_t gfp_flags)
>         return (gfp_flags & GFP_MOVABLE_MASK) >> GFP_MOVABLE_SHIFT;
>  }
>
> +static inline bool gfpflags_allow_blocking(const gfp_t gfp_flags)
> +{
> +       return gfp_flags & __GFP_DIRECT_RECLAIM;
> +}
> +
>  #ifdef CONFIG_HIGHMEM
>  #define OPT_ZONE_HIGHMEM ZONE_HIGHMEM
>  #else
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 22b6d9ca1654..55c4a9175801 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -1109,7 +1109,7 @@ static inline int skb_cloned(const struct sk_buff *skb)
>
>  static inline int skb_unclone(struct sk_buff *skb, gfp_t pri)
>  {
> -       might_sleep_if(pri & __GFP_WAIT);
> +       might_sleep_if(gfpflags_allow_blocking(pri));
>
>         if (skb_cloned(skb))
>                 return pskb_expand_head(skb, 0, 0, pri);
> @@ -1193,7 +1193,7 @@ static inline int skb_shared(const struct sk_buff *skb)
>   */
>  static inline struct sk_buff *skb_share_check(struct sk_buff *skb, gfp_t pri)
>  {
> -       might_sleep_if(pri & __GFP_WAIT);
> +       might_sleep_if(gfpflags_allow_blocking(pri));
>         if (skb_shared(skb)) {
>                 struct sk_buff *nskb = skb_clone(skb, pri);
>
> @@ -1229,7 +1229,7 @@ static inline struct sk_buff *skb_share_check(struct sk_buff *skb, gfp_t pri)
>  static inline struct sk_buff *skb_unshare(struct sk_buff *skb,
>                                           gfp_t pri)
>  {
> -       might_sleep_if(pri & __GFP_WAIT);
> +       might_sleep_if(gfpflags_allow_blocking(pri));
>         if (skb_cloned(skb)) {
>                 struct sk_buff *nskb = skb_copy(skb, pri);
>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index f21f0708ec59..cec0c4b634dc 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -2035,7 +2035,7 @@ struct sk_buff *sk_stream_alloc_skb(struct sock *sk, int size, gfp_t gfp,
>   */
>  static inline struct page_frag *sk_page_frag(struct sock *sk)
>  {
> -       if (sk->sk_allocation & __GFP_WAIT)
> +       if (gfpflags_allow_blocking(sk->sk_allocation))
>                 return &current->task_frag;
>
>         return &sk->sk_frag;
> diff --git a/include/trace/events/gfpflags.h b/include/trace/events/gfpflags.h
> index d6fd8e5b14b7..dde6bf092c8a 100644
> --- a/include/trace/events/gfpflags.h
> +++ b/include/trace/events/gfpflags.h
> @@ -20,7 +20,7 @@
>         {(unsigned long)GFP_ATOMIC,             "GFP_ATOMIC"},          \
>         {(unsigned long)GFP_NOIO,               "GFP_NOIO"},            \
>         {(unsigned long)__GFP_HIGH,             "GFP_HIGH"},            \
> -       {(unsigned long)__GFP_WAIT,             "GFP_WAIT"},            \
> +       {(unsigned long)__GFP_ATOMIC,           "GFP_ATOMIC"},          \
>         {(unsigned long)__GFP_IO,               "GFP_IO"},              \
>         {(unsigned long)__GFP_COLD,             "GFP_COLD"},            \
>         {(unsigned long)__GFP_NOWARN,           "GFP_NOWARN"},          \
> @@ -36,7 +36,8 @@
>         {(unsigned long)__GFP_RECLAIMABLE,      "GFP_RECLAIMABLE"},     \
>         {(unsigned long)__GFP_MOVABLE,          "GFP_MOVABLE"},         \
>         {(unsigned long)__GFP_NOTRACK,          "GFP_NOTRACK"},         \
> -       {(unsigned long)__GFP_NO_KSWAPD,        "GFP_NO_KSWAPD"},       \
> +       {(unsigned long)__GFP_DIRECT_RECLAIM,   "GFP_DIRECT_RECLAIM"},  \
> +       {(unsigned long)__GFP_KSWAPD_RECLAIM,   "GFP_KSWAPD_RECLAIM"},  \
>         {(unsigned long)__GFP_OTHER_NODE,       "GFP_OTHER_NODE"}       \
>         ) : "GFP_NOWAIT"
>
> diff --git a/kernel/audit.c b/kernel/audit.c
> index f9e6065346db..6ab7a55dbdff 100644
> --- a/kernel/audit.c
> +++ b/kernel/audit.c
> @@ -1357,16 +1357,16 @@ struct audit_buffer *audit_log_start(struct audit_context *ctx, gfp_t gfp_mask,
>         if (unlikely(audit_filter_type(type)))
>                 return NULL;
>
> -       if (gfp_mask & __GFP_WAIT) {
> +       if (gfp_mask & __GFP_DIRECT_RECLAIM) {
>                 if (audit_pid && audit_pid == current->pid)
> -                       gfp_mask &= ~__GFP_WAIT;
> +                       gfp_mask &= ~__GFP_DIRECT_RECLAIM;
>                 else
>                         reserve = 0;
>         }
>
>         while (audit_backlog_limit
>                && skb_queue_len(&audit_skb_queue) > audit_backlog_limit + reserve) {
> -               if (gfp_mask & __GFP_WAIT && audit_backlog_wait_time) {
> +               if (gfp_mask & __GFP_DIRECT_RECLAIM && audit_backlog_wait_time) {
>                         long sleep_time;
>
>                         sleep_time = timeout_start + audit_backlog_wait_time - jiffies;
> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> index 8acfbf773e06..9aa39f20f593 100644
> --- a/kernel/locking/lockdep.c
> +++ b/kernel/locking/lockdep.c
> @@ -2738,7 +2738,7 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags)
>                 return;
>
>         /* no reclaim without waiting on it */
> -       if (!(gfp_mask & __GFP_WAIT))
> +       if (!(gfp_mask & __GFP_DIRECT_RECLAIM))
>                 return;
>
>         /* this guy won't enter reclaim */
> diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
> index 5235dd4e1e2f..3a970604308f 100644
> --- a/kernel/power/snapshot.c
> +++ b/kernel/power/snapshot.c
> @@ -1779,7 +1779,7 @@ alloc_highmem_pages(struct memory_bitmap *bm, unsigned int nr_highmem)
>         while (to_alloc-- > 0) {
>                 struct page *page;
>
> -               page = alloc_image_page(__GFP_HIGHMEM);
> +               page = alloc_image_page(__GFP_HIGHMEM|__GFP_KSWAPD_RECLAIM);
>                 memory_bm_set_bit(bm, page_to_pfn(page));
>         }
>         return nr_highmem;
> diff --git a/kernel/smp.c b/kernel/smp.c
> index 07854477c164..d903c02223af 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -669,7 +669,7 @@ void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
>         cpumask_var_t cpus;
>         int cpu, ret;
>
> -       might_sleep_if(gfp_flags & __GFP_WAIT);
> +       might_sleep_if(gfpflags_allow_blocking(gfp_flags));
>
>         if (likely(zalloc_cpumask_var(&cpus, (gfp_flags|__GFP_NOWARN)))) {
>                 preempt_disable();
> diff --git a/lib/idr.c b/lib/idr.c
> index 5335c43adf46..6098336df267 100644
> --- a/lib/idr.c
> +++ b/lib/idr.c
> @@ -399,7 +399,7 @@ void idr_preload(gfp_t gfp_mask)
>          * allocation guarantee.  Disallow usage from those contexts.
>          */
>         WARN_ON_ONCE(in_interrupt());
> -       might_sleep_if(gfp_mask & __GFP_WAIT);
> +       might_sleep_if(gfpflags_allow_blocking(gfp_mask));
>
>         preempt_disable();
>
> @@ -453,7 +453,7 @@ int idr_alloc(struct idr *idr, void *ptr, int start, int end, gfp_t gfp_mask)
>         struct idr_layer *pa[MAX_IDR_LEVEL + 1];
>         int id;
>
> -       might_sleep_if(gfp_mask & __GFP_WAIT);
> +       might_sleep_if(gfpflags_allow_blocking(gfp_mask));
>
>         /* sanity checks */
>         if (WARN_ON_ONCE(start < 0))
> diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> index f9ebe1c82060..c3775ee46cd6 100644
> --- a/lib/radix-tree.c
> +++ b/lib/radix-tree.c
> @@ -188,7 +188,7 @@ radix_tree_node_alloc(struct radix_tree_root *root)
>          * preloading in the interrupt anyway as all the allocations have to
>          * be atomic. So just do normal allocation when in interrupt.
>          */
> -       if (!(gfp_mask & __GFP_WAIT) && !in_interrupt()) {
> +       if (!gfpflags_allow_blocking(gfp_mask) && !in_interrupt()) {
>                 struct radix_tree_preload *rtp;
>
>                 /*
> @@ -249,7 +249,7 @@ radix_tree_node_free(struct radix_tree_node *node)
>   * with preemption not disabled.
>   *
>   * To make use of this facility, the radix tree must be initialised without
> - * __GFP_WAIT being passed to INIT_RADIX_TREE().
> + * __GFP_DIRECT_RECLAIM being passed to INIT_RADIX_TREE().
>   */
>  static int __radix_tree_preload(gfp_t gfp_mask)
>  {
> @@ -286,12 +286,12 @@ static int __radix_tree_preload(gfp_t gfp_mask)
>   * with preemption not disabled.
>   *
>   * To make use of this facility, the radix tree must be initialised without
> - * __GFP_WAIT being passed to INIT_RADIX_TREE().
> + * __GFP_DIRECT_RECLAIM being passed to INIT_RADIX_TREE().
>   */
>  int radix_tree_preload(gfp_t gfp_mask)
>  {
>         /* Warn on non-sensical use... */
> -       WARN_ON_ONCE(!(gfp_mask & __GFP_WAIT));
> +       WARN_ON_ONCE(gfpflags_allow_blocking(gfp_mask));
>         return __radix_tree_preload(gfp_mask);
>  }
>  EXPORT_SYMBOL(radix_tree_preload);
> @@ -303,7 +303,7 @@ EXPORT_SYMBOL(radix_tree_preload);
>   */
>  int radix_tree_maybe_preload(gfp_t gfp_mask)
>  {
> -       if (gfp_mask & __GFP_WAIT)
> +       if (gfpflags_allow_blocking(gfp_mask))
>                 return __radix_tree_preload(gfp_mask);
>         /* Preloading doesn't help anything with this gfp mask, skip it */
>         preempt_disable();
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index dac5bf59309d..805ce70b72f3 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -632,7 +632,7 @@ struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi,
>  {
>         struct bdi_writeback *wb;
>
> -       might_sleep_if(gfp & __GFP_WAIT);
> +       might_sleep_if(gfpflags_allow_blocking(gfp));
>
>         if (!memcg_css->parent)
>                 return &bdi->wb;
> diff --git a/mm/dmapool.c b/mm/dmapool.c
> index fd5fe4342e93..84dac666fc0c 100644
> --- a/mm/dmapool.c
> +++ b/mm/dmapool.c
> @@ -323,7 +323,7 @@ void *dma_pool_alloc(struct dma_pool *pool, gfp_t mem_flags,
>         size_t offset;
>         void *retval;
>
> -       might_sleep_if(mem_flags & __GFP_WAIT);
> +       might_sleep_if(gfpflags_allow_blocking(mem_flags));
>
>         spin_lock_irqsave(&pool->lock, flags);
>         list_for_each_entry(page, &pool->page_list, page_list) {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index acb93c554f6e..e34f6411da8c 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2268,7 +2268,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>         if (unlikely(task_in_memcg_oom(current)))
>                 goto nomem;
>
> -       if (!(gfp_mask & __GFP_WAIT))
> +       if (!gfpflags_allow_blocking(gfp_mask))
>                 goto nomem;
>
>         mem_cgroup_events(mem_over_limit, MEMCG_MAX, 1);
> @@ -2327,7 +2327,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>         css_get_many(&memcg->css, batch);
>         if (batch > nr_pages)
>                 refill_stock(memcg, batch - nr_pages);
> -       if (!(gfp_mask & __GFP_WAIT))
> +       if (!gfpflags_allow_blocking(gfp_mask))
>                 goto done;
>         /*
>          * If the hierarchy is above the normal consumption range,
> @@ -4696,8 +4696,8 @@ static int mem_cgroup_do_precharge(unsigned long count)
>  {
>         int ret;
>
> -       /* Try a single bulk charge without reclaim first */
> -       ret = try_charge(mc.to, GFP_KERNEL & ~__GFP_WAIT, count);
> +       /* Try a single bulk charge without reclaim first, kswapd may wake */
> +       ret = try_charge(mc.to, GFP_KERNEL & ~__GFP_DIRECT_RECLAIM, count);
>         if (!ret) {
>                 mc.precharge += count;
>                 return ret;
> diff --git a/mm/mempool.c b/mm/mempool.c
> index 2cc08de8b1db..bfd2a0dd0e18 100644
> --- a/mm/mempool.c
> +++ b/mm/mempool.c
> @@ -317,13 +317,13 @@ void * mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
>         gfp_t gfp_temp;
>
>         VM_WARN_ON_ONCE(gfp_mask & __GFP_ZERO);
> -       might_sleep_if(gfp_mask & __GFP_WAIT);
> +       might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
>
>         gfp_mask |= __GFP_NOMEMALLOC;   /* don't allocate emergency reserves */
>         gfp_mask |= __GFP_NORETRY;      /* don't loop in __alloc_pages */
>         gfp_mask |= __GFP_NOWARN;       /* failures are OK */
>
> -       gfp_temp = gfp_mask & ~(__GFP_WAIT|__GFP_IO);
> +       gfp_temp = gfp_mask & ~(__GFP_DIRECT_RECLAIM|__GFP_IO);
>
>  repeat_alloc:
>
> @@ -346,7 +346,7 @@ void * mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
>         }
>
>         /*
> -        * We use gfp mask w/o __GFP_WAIT or IO for the first round.  If
> +        * We use gfp mask w/o direct reclaim or IO for the first round.  If
>          * alloc failed with that and @pool was empty, retry immediately.
>          */
>         if (gfp_temp != gfp_mask) {
> @@ -355,8 +355,8 @@ void * mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
>                 goto repeat_alloc;
>         }
>
> -       /* We must not sleep if !__GFP_WAIT */
> -       if (!(gfp_mask & __GFP_WAIT)) {
> +       /* We must not sleep if !__GFP_DIRECT_RECLAIM */
> +       if (!(gfp_mask & __GFP_DIRECT_RECLAIM)) {
>                 spin_unlock_irqrestore(&pool->lock, flags);
>                 return NULL;
>         }
> diff --git a/mm/migrate.c b/mm/migrate.c
> index eb4267107d1f..0e16c4047638 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1564,7 +1564,7 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
>                                          (GFP_HIGHUSER_MOVABLE |
>                                           __GFP_THISNODE | __GFP_NOMEMALLOC |
>                                           __GFP_NORETRY | __GFP_NOWARN) &
> -                                        ~GFP_IOFS, 0);
> +                                        ~(__GFP_IO | __GFP_FS), 0);
>
>         return newpage;
>  }
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 32d1cec124bc..68f961bdfdf8 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -151,12 +151,12 @@ void pm_restrict_gfp_mask(void)
>         WARN_ON(!mutex_is_locked(&pm_mutex));
>         WARN_ON(saved_gfp_mask);
>         saved_gfp_mask = gfp_allowed_mask;
> -       gfp_allowed_mask &= ~GFP_IOFS;
> +       gfp_allowed_mask &= ~(__GFP_IO | __GFP_FS);
>  }
>
>  bool pm_suspended_storage(void)
>  {
> -       if ((gfp_allowed_mask & GFP_IOFS) == GFP_IOFS)
> +       if ((gfp_allowed_mask & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS))
>                 return false;
>         return true;
>  }
> @@ -2158,7 +2158,7 @@ static bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
>                 return false;
>         if (fail_page_alloc.ignore_gfp_highmem && (gfp_mask & __GFP_HIGHMEM))
>                 return false;
> -       if (fail_page_alloc.ignore_gfp_wait && (gfp_mask & __GFP_WAIT))
> +       if (fail_page_alloc.ignore_gfp_wait && (gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
>                 return false;
>
>         return should_fail(&fail_page_alloc.attr, 1 << order);
> @@ -2660,7 +2660,7 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
>                 if (test_thread_flag(TIF_MEMDIE) ||
>                     (current->flags & (PF_MEMALLOC | PF_EXITING)))
>                         filter &= ~SHOW_MEM_FILTER_NODES;
> -       if (in_interrupt() || !(gfp_mask & __GFP_WAIT))
> +       if (in_interrupt() || !(gfp_mask & __GFP_WAIT) || (gfp_mask & __GFP_ATOMIC))
>                 filter &= ~SHOW_MEM_FILTER_NODES;
>
>         if (fmt) {
> @@ -2915,7 +2915,6 @@ static inline int
>  gfp_to_alloc_flags(gfp_t gfp_mask)
>  {
>         int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
> -       const bool atomic = !(gfp_mask & (__GFP_WAIT | __GFP_NO_KSWAPD));
>
>         /* __GFP_HIGH is assumed to be the same as ALLOC_HIGH to save a branch. */
>         BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_HIGH);
> @@ -2924,11 +2923,11 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
>          * The caller may dip into page reserves a bit more if the caller
>          * cannot run direct reclaim, or if the caller has realtime scheduling
>          * policy or is asking for __GFP_HIGH memory.  GFP_ATOMIC requests will
> -        * set both ALLOC_HARDER (atomic == true) and ALLOC_HIGH (__GFP_HIGH).
> +        * set both ALLOC_HARDER (__GFP_ATOMIC) and ALLOC_HIGH (__GFP_HIGH).
>          */
>         alloc_flags |= (__force int) (gfp_mask & __GFP_HIGH);
>
> -       if (atomic) {
> +       if (gfp_mask & __GFP_ATOMIC) {
>                 /*
>                  * Not worth trying to allocate harder for __GFP_NOMEMALLOC even
>                  * if it can't schedule.
> @@ -2965,11 +2964,16 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>         return !!(gfp_to_alloc_flags(gfp_mask) & ALLOC_NO_WATERMARKS);
>  }
>
> +static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
> +{
> +       return (gfp_mask & (GFP_TRANSHUGE | __GFP_KSWAPD_RECLAIM)) == GFP_TRANSHUGE;
> +}
> +
>  static inline struct page *
>  __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>                                                 struct alloc_context *ac)
>  {
> -       const gfp_t wait = gfp_mask & __GFP_WAIT;
> +       bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;
>         struct page *page = NULL;
>         int alloc_flags;
>         unsigned long pages_reclaimed = 0;
> @@ -2990,15 +2994,23 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>         }
>
>         /*
> +        * We also sanity check to catch abuse of atomic reserves being used by
> +        * callers that are not in atomic context.
> +        */
> +       if (WARN_ON_ONCE((gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)) ==
> +                               (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
> +               gfp_mask &= ~__GFP_ATOMIC;
> +
> +       /*
>          * If this allocation cannot block and it is for a specific node, then
>          * fail early.  There's no need to wakeup kswapd or retry for a
>          * speculative node-specific allocation.
>          */
> -       if (IS_ENABLED(CONFIG_NUMA) && (gfp_mask & __GFP_THISNODE) && !wait)
> +       if (IS_ENABLED(CONFIG_NUMA) && (gfp_mask & __GFP_THISNODE) && !can_direct_reclaim)
>                 goto nopage;
>
>  retry:
> -       if (!(gfp_mask & __GFP_NO_KSWAPD))
> +       if (gfp_mask & __GFP_KSWAPD_RECLAIM)
>                 wake_all_kswapds(order, ac);
>
>         /*
> @@ -3041,8 +3053,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>                 }
>         }
>
> -       /* Atomic allocations - we can't balance anything */
> -       if (!wait) {
> +       /* Caller is not willing to reclaim, we can't balance anything */
> +       if (!can_direct_reclaim) {
>                 /*
>                  * All existing users of the deprecated __GFP_NOFAIL are
>                  * blockable, so warn of any new users that actually allow this
> @@ -3072,7 +3084,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>                 goto got_pg;
>
>         /* Checks for THP-specific high-order allocations */
> -       if ((gfp_mask & GFP_TRANSHUGE) == GFP_TRANSHUGE) {
> +       if (is_thp_gfp_mask(gfp_mask)) {
>                 /*
>                  * If compaction is deferred for high-order allocations, it is
>                  * because sync compaction recently failed. If this is the case
> @@ -3107,8 +3119,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>          * fault, so use asynchronous memory compaction for THP unless it is
>          * khugepaged trying to collapse.
>          */
> -       if ((gfp_mask & GFP_TRANSHUGE) != GFP_TRANSHUGE ||
> -                                               (current->flags & PF_KTHREAD))
> +       if (!is_thp_gfp_mask(gfp_mask) || (current->flags & PF_KTHREAD))
>                 migration_mode = MIGRATE_SYNC_LIGHT;
>
>         /* Try direct reclaim and then allocating */
> @@ -3179,7 +3190,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
>
>         lockdep_trace_alloc(gfp_mask);
>
> -       might_sleep_if(gfp_mask & __GFP_WAIT);
> +       might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
>
>         if (should_fail_alloc_page(gfp_mask, order))
>                 return NULL;
> diff --git a/mm/slab.c b/mm/slab.c
> index 200e22412a16..f82bdb3eb1fc 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -1030,12 +1030,12 @@ static inline int cache_free_alien(struct kmem_cache *cachep, void *objp)
>  }
>
>  /*
> - * Construct gfp mask to allocate from a specific node but do not invoke reclaim
> - * or warn about failures.
> + * Construct gfp mask to allocate from a specific node but do not direct reclaim
> + * or warn about failures. kswapd may still wake to reclaim in the background.
>   */
>  static inline gfp_t gfp_exact_node(gfp_t flags)
>  {
> -       return (flags | __GFP_THISNODE | __GFP_NOWARN) & ~__GFP_WAIT;
> +       return (flags | __GFP_THISNODE | __GFP_NOWARN) & ~__GFP_DIRECT_RECLAIM;
>  }
>  #endif
>
> @@ -2625,7 +2625,7 @@ static int cache_grow(struct kmem_cache *cachep,
>
>         offset *= cachep->colour_off;
>
> -       if (local_flags & __GFP_WAIT)
> +       if (gfpflags_allow_blocking(local_flags))
>                 local_irq_enable();
>
>         /*
> @@ -2655,7 +2655,7 @@ static int cache_grow(struct kmem_cache *cachep,
>
>         cache_init_objs(cachep, page);
>
> -       if (local_flags & __GFP_WAIT)
> +       if (gfpflags_allow_blocking(local_flags))
>                 local_irq_disable();
>         check_irq_off();
>         spin_lock(&n->list_lock);
> @@ -2669,7 +2669,7 @@ static int cache_grow(struct kmem_cache *cachep,
>  opps1:
>         kmem_freepages(cachep, page);
>  failed:
> -       if (local_flags & __GFP_WAIT)
> +       if (gfpflags_allow_blocking(local_flags))
>                 local_irq_disable();
>         return 0;
>  }
> @@ -2861,7 +2861,7 @@ static void *cache_alloc_refill(struct kmem_cache *cachep, gfp_t flags,
>  static inline void cache_alloc_debugcheck_before(struct kmem_cache *cachep,
>                                                 gfp_t flags)
>  {
> -       might_sleep_if(flags & __GFP_WAIT);
> +       might_sleep_if(gfpflags_allow_blocking(flags));
>  #if DEBUG
>         kmem_flagcheck(cachep, flags);
>  #endif
> @@ -3049,11 +3049,11 @@ static void *fallback_alloc(struct kmem_cache *cache, gfp_t flags)
>                  */
>                 struct page *page;
>
> -               if (local_flags & __GFP_WAIT)
> +               if (gfpflags_allow_blocking(local_flags))
>                         local_irq_enable();
>                 kmem_flagcheck(cache, flags);
>                 page = kmem_getpages(cache, local_flags, numa_mem_id());
> -               if (local_flags & __GFP_WAIT)
> +               if (gfpflags_allow_blocking(local_flags))
>                         local_irq_disable();
>                 if (page) {
>                         /*
> diff --git a/mm/slub.c b/mm/slub.c
> index 816df0016555..a4661c59ff54 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1263,7 +1263,7 @@ static inline struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s,
>  {
>         flags &= gfp_allowed_mask;
>         lockdep_trace_alloc(flags);
> -       might_sleep_if(flags & __GFP_WAIT);
> +       might_sleep_if(gfpflags_allow_blocking(flags));
>
>         if (should_failslab(s->object_size, flags, s->flags))
>                 return NULL;
> @@ -1339,7 +1339,7 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
>
>         flags &= gfp_allowed_mask;
>
> -       if (flags & __GFP_WAIT)
> +       if (gfpflags_allow_blocking(flags))
>                 local_irq_enable();
>
>         flags |= s->allocflags;
> @@ -1380,7 +1380,7 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
>                         kmemcheck_mark_unallocated_pages(page, pages);
>         }
>
> -       if (flags & __GFP_WAIT)
> +       if (gfpflags_allow_blocking(flags))
>                 local_irq_disable();
>         if (!page)
>                 return NULL;
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 2faaa2976447..9ad4dcb0631c 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -1617,7 +1617,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
>                         goto fail;
>                 }
>                 area->pages[i] = page;
> -               if (gfp_mask & __GFP_WAIT)
> +               if (gfpflags_allow_blocking(gfp_mask))
>                         cond_resched();
>         }
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index e950134c4b9a..837c440d60a9 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1465,7 +1465,7 @@ static int too_many_isolated(struct zone *zone, int file,
>          * won't get blocked by normal direct-reclaimers, forming a circular
>          * deadlock.
>          */
> -       if ((sc->gfp_mask & GFP_IOFS) == GFP_IOFS)
> +       if ((sc->gfp_mask & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS))
>                 inactive >>= 3;
>
>         return isolated > inactive;
> @@ -3764,7 +3764,7 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
>         /*
>          * Do not scan if the allocation should not be delayed.
>          */
> -       if (!(gfp_mask & __GFP_WAIT) || (current->flags & PF_MEMALLOC))
> +       if (!gfpflags_allow_blocking(gfp_mask) || (current->flags & PF_MEMALLOC))
>                 return ZONE_RECLAIM_NOSCAN;
>
>         /*
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 2d5727baed59..26104a68c972 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -684,7 +684,8 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset,
>
>         /* store */
>         len = dlen + sizeof(struct zswap_header);
> -       ret = zpool_malloc(zswap_pool, len, __GFP_NORETRY | __GFP_NOWARN,
> +       ret = zpool_malloc(zswap_pool, len,
> +               __GFP_NORETRY | __GFP_NOWARN | __GFP_KSWAPD_RECLAIM,
>                 &handle);
>         if (ret == -ENOSPC) {
>                 zswap_reject_compress_poor++;
> @@ -900,7 +901,7 @@ static void __exit zswap_debugfs_exit(void) { }
>  **********************************/
>  static int __init init_zswap(void)
>  {
> -       gfp_t gfp = __GFP_NORETRY | __GFP_NOWARN;
> +       gfp_t gfp = __GFP_NORETRY | __GFP_NOWARN | __GFP_KSWAPD_RECLAIM;
>
>         pr_info("loading zswap\n");
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index b6a19ca0f99e..6f025e2544de 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -414,7 +414,7 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev, unsigned int len,
>         len += NET_SKB_PAD;
>
>         if ((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) ||
> -           (gfp_mask & (__GFP_WAIT | GFP_DMA))) {
> +           (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
>                 skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE);
>                 if (!skb)
>                         goto skb_fail;
> @@ -481,7 +481,7 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
>         len += NET_SKB_PAD + NET_IP_ALIGN;
>
>         if ((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) ||
> -           (gfp_mask & (__GFP_WAIT | GFP_DMA))) {
> +           (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
>                 skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE);
>                 if (!skb)
>                         goto skb_fail;
> @@ -4452,7 +4452,7 @@ struct sk_buff *alloc_skb_with_frags(unsigned long header_len,
>                 return NULL;
>
>         gfp_head = gfp_mask;
> -       if (gfp_head & __GFP_WAIT)
> +       if (gfp_head & __GFP_DIRECT_RECLAIM)
>                 gfp_head |= __GFP_REPEAT;
>
>         *errcode = -ENOBUFS;
> @@ -4467,7 +4467,7 @@ struct sk_buff *alloc_skb_with_frags(unsigned long header_len,
>
>                 while (order) {
>                         if (npages >= 1 << order) {
> -                               page = alloc_pages((gfp_mask & ~__GFP_WAIT) |
> +                               page = alloc_pages((gfp_mask & ~__GFP_DIRECT_RECLAIM) |
>                                                    __GFP_COMP |
>                                                    __GFP_NOWARN |
>                                                    __GFP_NORETRY,
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 193901d09757..02b705cc9eb3 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -1879,8 +1879,10 @@ bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t gfp)
>
>         pfrag->offset = 0;
>         if (SKB_FRAG_PAGE_ORDER) {
> -               pfrag->page = alloc_pages((gfp & ~__GFP_WAIT) | __GFP_COMP |
> -                                         __GFP_NOWARN | __GFP_NORETRY,
> +               /* Avoid direct reclaim but allow kswapd to wake */
> +               pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
> +                                         __GFP_COMP | __GFP_NOWARN |
> +                                         __GFP_NORETRY,
>                                           SKB_FRAG_PAGE_ORDER);
>                 if (likely(pfrag->page)) {
>                         pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER;
> diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
> index 67d210477863..8283d90dde74 100644
> --- a/net/netlink/af_netlink.c
> +++ b/net/netlink/af_netlink.c
> @@ -2066,7 +2066,7 @@ int netlink_broadcast_filtered(struct sock *ssk, struct sk_buff *skb, u32 portid
>         consume_skb(info.skb2);
>
>         if (info.delivered) {
> -               if (info.congested && (allocation & __GFP_WAIT))
> +               if (info.congested && gfpflags_allow_blocking(allocation))
>                         yield();
>                 return 0;
>         }
> diff --git a/net/rxrpc/ar-connection.c b/net/rxrpc/ar-connection.c
> index 6631f4f1e39b..3b5de4b86058 100644
> --- a/net/rxrpc/ar-connection.c
> +++ b/net/rxrpc/ar-connection.c
> @@ -500,7 +500,7 @@ int rxrpc_connect_call(struct rxrpc_sock *rx,
>                 if (bundle->num_conns >= 20) {
>                         _debug("too many conns");
>
> -                       if (!(gfp & __GFP_WAIT)) {
> +                       if (!gfpflags_allow_blocking(gfp)) {
>                                 _leave(" = -EAGAIN");
>                                 return -EAGAIN;
>                         }
> diff --git a/net/sctp/associola.c b/net/sctp/associola.c
> index 197c3f59ecbf..75369ae8de1e 100644
> --- a/net/sctp/associola.c
> +++ b/net/sctp/associola.c
> @@ -1588,7 +1588,7 @@ int sctp_assoc_lookup_laddr(struct sctp_association *asoc,
>  /* Set an association id for a given association */
>  int sctp_assoc_set_id(struct sctp_association *asoc, gfp_t gfp)
>  {
> -       bool preload = !!(gfp & __GFP_WAIT);
> +       bool preload = gfpflags_allow_blocking(gfp);
>         int ret;
>
>         /* If the id is already assigned, keep it. */
> --
> 2.4.6
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 11/12] mm, page_alloc: Reserve pageblocks for high-order atomic allocations on demand
  2015-08-24 12:29 ` [PATCH 11/12] mm, page_alloc: Reserve pageblocks for high-order atomic allocations on demand Mel Gorman
  2015-08-26 12:44   ` Vlastimil Babka
  2015-08-26 14:53   ` Michal Hocko
@ 2015-09-08  8:01   ` Joonsoo Kim
  2015-09-09 12:32     ` Mel Gorman
  2 siblings, 1 reply; 55+ messages in thread
From: Joonsoo Kim @ 2015-09-08  8:01 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Rik van Riel, Vlastimil Babka,
	David Rientjes, Joonsoo Kim, Michal Hocko, Linux-MM, LKML

2015-08-24 21:29 GMT+09:00 Mel Gorman <mgorman@techsingularity.net>:
> High-order watermark checking exists for two reasons --  kswapd high-order
> awareness and protection for high-order atomic requests. Historically the
> kernel depended on MIGRATE_RESERVE to preserve min_free_kbytes as high-order
> free pages for as long as possible. This patch introduces MIGRATE_HIGHATOMIC
> that reserves pageblocks for high-order atomic allocations on demand and
> avoids using those blocks for order-0 allocations. This is more flexible
> and reliable than MIGRATE_RESERVE was.
>
> A MIGRATE_HIGHORDER pageblock is created when a high-order allocation
> request steals a pageblock but limits the total number to 1% of the zone.
> Callers that speculatively abuse atomic allocations for long-lived
> high-order allocations to access the reserve will quickly fail. Note that
> SLUB is currently not such an abuser as it reclaims at least once.  It is
> possible that the pageblock stolen has few suitable high-order pages and
> will need to steal again in the near future but there would need to be
> strong justification to search all pageblocks for an ideal candidate.
>
> The pageblocks are unreserved if an allocation fails after a direct
> reclaim attempt.
>
> The watermark checks account for the reserved pageblocks when the allocation
> request is not a high-order atomic allocation.
>
> The reserved pageblocks can not be used for order-0 allocations. This may
> allow temporary wastage until a failed reclaim reassigns the pageblock. This
> is deliberate as the intent of the reservation is to satisfy a limited
> number of atomic high-order short-lived requests if the system requires them.
>
> The stutter benchmark was used to evaluate this but while it was running
> there was a systemtap script that randomly allocated between 1 high-order
> page and 12.5% of memory's worth of order-3 pages using GFP_ATOMIC. This
> is much larger than the potential reserve and it does not attempt to be
> realistic.  It is intended to stress random high-order allocations from
> an unknown source, show that there is a reduction in failures without
> introducing an anomaly where atomic allocations are more reliable than
> regular allocations.  The amount of memory reserved varied throughout the
> workload as reserves were created and reclaimed under memory pressure. The
> allocation failures once the workload warmed up were as follows;
>
> 4.2-rc5-vanilla         70%
> 4.2-rc5-atomic-reserve  56%
>
> The failure rate was also measured while building multiple kernels. The
> failure rate was 14% but is 6% with this patch applied.
>
> Overall, this is a small reduction but the reserves are small relative to the
> number of allocation requests. In early versions of the patch, the failure
> rate reduced by a much larger amount but that required much larger reserves
> and perversely made atomic allocations seem more reliable than regular allocations.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  include/linux/mmzone.h |   6 ++-
>  mm/page_alloc.c        | 117 ++++++++++++++++++++++++++++++++++++++++++++++---
>  mm/vmstat.c            |   1 +
>  3 files changed, 116 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index cf643539d640..a9805a85940a 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -39,6 +39,8 @@ enum {
>         MIGRATE_UNMOVABLE,
>         MIGRATE_MOVABLE,
>         MIGRATE_RECLAIMABLE,
> +       MIGRATE_PCPTYPES,       /* the number of types on the pcp lists */
> +       MIGRATE_HIGHATOMIC = MIGRATE_PCPTYPES,
>  #ifdef CONFIG_CMA
>         /*
>          * MIGRATE_CMA migration type is designed to mimic the way
> @@ -61,8 +63,6 @@ enum {
>         MIGRATE_TYPES
>  };
>
> -#define MIGRATE_PCPTYPES (MIGRATE_RECLAIMABLE+1)
> -
>  #ifdef CONFIG_CMA
>  #  define is_migrate_cma(migratetype) unlikely((migratetype) == MIGRATE_CMA)
>  #else
> @@ -330,6 +330,8 @@ struct zone {
>         /* zone watermarks, access with *_wmark_pages(zone) macros */
>         unsigned long watermark[NR_WMARK];
>
> +       unsigned long nr_reserved_highatomic;
> +
>         /*
>          * We don't know if the memory that we're going to allocate will be freeable
>          * or/and it will be released eventually, so to avoid totally wasting several
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d5ce050ebe4f..2415f882b89c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1589,6 +1589,86 @@ int find_suitable_fallback(struct free_area *area, unsigned int order,
>         return -1;
>  }
>
> +/*
> + * Reserve a pageblock for exclusive use of high-order atomic allocations if
> + * there are no empty page blocks that contain a page with a suitable order
> + */
> +static void reserve_highatomic_pageblock(struct page *page, struct zone *zone,
> +                               unsigned int alloc_order)
> +{
> +       int mt = get_pageblock_migratetype(page);
> +       unsigned long max_managed, flags;
> +
> +       if (mt == MIGRATE_HIGHATOMIC)
> +               return;
> +
> +       /*
> +        * Limit the number reserved to 1 pageblock or roughly 1% of a zone.
> +        * Check is race-prone but harmless.
> +        */
> +       max_managed = (zone->managed_pages / 100) + pageblock_nr_pages;
> +       if (zone->nr_reserved_highatomic >= max_managed)
> +               return;
> +
> +       /* Yoink! */
> +       spin_lock_irqsave(&zone->lock, flags);
> +       zone->nr_reserved_highatomic += pageblock_nr_pages;
> +       set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC);
> +       move_freepages_block(zone, page, MIGRATE_HIGHATOMIC);
> +       spin_unlock_irqrestore(&zone->lock, flags);
> +}

It is better to check if migratetype is MIGRATE_ISOLATE or MIGRATE_CMA.
There can be race that isolated pageblock is changed to MIGRATE_HIGHATOMIC.

> +/*
> + * Used when an allocation is about to fail under memory pressure. This
> + * potentially hurts the reliability of high-order allocations when under
> + * intense memory pressure but failed atomic allocations should be easier
> + * to recover from than an OOM.
> + */
> +static void unreserve_highatomic_pageblock(const struct alloc_context *ac)
> +{
> +       struct zonelist *zonelist = ac->zonelist;
> +       unsigned long flags;
> +       struct zoneref *z;
> +       struct zone *zone;
> +       struct page *page;
> +       int order;
> +
> +       for_each_zone_zonelist_nodemask(zone, z, zonelist, ac->high_zoneidx,
> +                                                               ac->nodemask) {
> +               /* Preserve at least one pageblock */
> +               if (zone->nr_reserved_highatomic <= pageblock_nr_pages)
> +                       continue;
> +
> +               spin_lock_irqsave(&zone->lock, flags);
> +               for (order = 0; order < MAX_ORDER; order++) {
> +                       struct free_area *area = &(zone->free_area[order]);
> +
> +                       if (list_empty(&area->free_list[MIGRATE_HIGHATOMIC]))
> +                               continue;
> +
> +                       page = list_entry(area->free_list[MIGRATE_HIGHATOMIC].next,
> +                                               struct page, lru);
> +
> +                       zone->nr_reserved_highatomic -= pageblock_nr_pages;
> +
> +                       /*
> +                        * Convert to ac->migratetype and avoid the normal
> +                        * pageblock stealing heuristics. Minimally, the caller
> +                        * is doing the work and needs the pages. More
> +                        * importantly, if the block was always converted to
> +                        * MIGRATE_UNMOVABLE or another type then the number
> +                        * of pageblocks that cannot be completely freed
> +                        * may increase.
> +                        */
> +                       set_pageblock_migratetype(page, ac->migratetype);
> +                       move_freepages_block(zone, page, ac->migratetype);
> +                       spin_unlock_irqrestore(&zone->lock, flags);
> +                       return;
> +               }
> +               spin_unlock_irqrestore(&zone->lock, flags);
> +       }
> +}
> +
>  /* Remove an element from the buddy allocator from the fallback list */
>  static inline struct page *
>  __rmqueue_fallback(struct zone *zone, unsigned int order, int start_migratetype)
> @@ -1645,10 +1725,16 @@ __rmqueue_fallback(struct zone *zone, unsigned int order, int start_migratetype)
>   * Call me with the zone->lock already held.
>   */
>  static struct page *__rmqueue(struct zone *zone, unsigned int order,
> -                                               int migratetype)
> +                               int migratetype, gfp_t gfp_flags)
>  {
>         struct page *page;
>
> +       if (unlikely(order && (gfp_flags & __GFP_ATOMIC))) {
> +               page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
> +               if (page)
> +                       goto out;
> +       }

This hunk only serves for high order allocation so it is better to introduce
rmqueue_highorder() and move this hunk to that function and call it in
buffered_rmqueue. It makes order-0 request doesn't get worse
by adding new branch.

And, there is some mismatch that check atomic high-order allocation.
In some place, you checked __GFP_ATOMIC, but some other places,
you checked ALLOC_HARDER. It is better to use unified one.
Introducing helper function may be a good choice.

>         page = __rmqueue_smallest(zone, order, migratetype);
>         if (unlikely(!page)) {
>                 if (migratetype == MIGRATE_MOVABLE)
> @@ -1658,6 +1744,7 @@ static struct page *__rmqueue(struct zone *zone, unsigned int order,
>                         page = __rmqueue_fallback(zone, order, migratetype);
>         }
>
> +out:
>         trace_mm_page_alloc_zone_locked(page, order, migratetype);
>         return page;
>  }
> @@ -1675,7 +1762,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>
>         spin_lock(&zone->lock);
>         for (i = 0; i < count; ++i) {
> -               struct page *page = __rmqueue(zone, order, migratetype);
> +               struct page *page = __rmqueue(zone, order, migratetype, 0);
>                 if (unlikely(page == NULL))
>                         break;
>
> @@ -2090,7 +2177,7 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
>                         WARN_ON_ONCE(order > 1);
>                 }
>                 spin_lock_irqsave(&zone->lock, flags);
> -               page = __rmqueue(zone, order, migratetype);
> +               page = __rmqueue(zone, order, migratetype, gfp_flags);
>                 spin_unlock(&zone->lock);
>                 if (!page)
>                         goto failed;
> @@ -2200,15 +2287,23 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
>                         unsigned long mark, int classzone_idx, int alloc_flags,
>                         long free_pages)
>  {
> -       /* free_pages may go negative - that's OK */
>         long min = mark;
>         int o;
>         long free_cma = 0;
>
> +       /* free_pages may go negative - that's OK */
>         free_pages -= (1 << order) - 1;
> +
>         if (alloc_flags & ALLOC_HIGH)
>                 min -= min / 2;
> -       if (alloc_flags & ALLOC_HARDER)
> +
> +       /*
> +        * If the caller is not atomic then discount the reserves. This will
> +        * over-estimate how the atomic reserve but it avoids a search
> +        */
> +       if (likely(!(alloc_flags & ALLOC_HARDER)))
> +               free_pages -= z->nr_reserved_highatomic;
> +       else
>                 min -= min / 4;
>
>  #ifdef CONFIG_CMA
> @@ -2397,6 +2492,14 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
>                 if (page) {
>                         if (prep_new_page(page, order, gfp_mask, alloc_flags))
>                                 goto try_this_zone;
> +
> +                       /*
> +                        * If this is a high-order atomic allocation then check
> +                        * if the pageblock should be reserved for the future
> +                        */
> +                       if (unlikely(order && (alloc_flags & ALLOC_HARDER)))
> +                               reserve_highatomic_pageblock(page, zone, order);
> +
>                         return page;
>                 }
>         }
> @@ -2664,9 +2767,11 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
>
>         /*
>          * If an allocation failed after direct reclaim, it could be because
> -        * pages are pinned on the per-cpu lists. Drain them and try again
> +        * pages are pinned on the per-cpu lists or in high alloc reserves.
> +        * Shrink them them and try again
>          */
>         if (!page && !drained) {
> +               unreserve_highatomic_pageblock(ac);
>                 drain_all_pages(NULL);
>                 drained = true;
>                 goto retry;

In case of high-order request, it can easily fail even after direct reclaim.
It can cause ping-pong effect on highatomic pageblock.
Unreserve on order-0 request fail is one option to avoid that problem.

Anyway, do you measure fragmentation effect of this patch?

High-order atomic request is usually unmovable and it would be served
by unmovable pageblock. And then, it is changed to highatomic.
But, reclaim can be triggered by movable request and this unreserve
makes that pageblock to movable type.

So, following sequence of transition will usually happen.

unmovable -> highatomic -> movable

It can reduce number of unmovable pageblock and unmovable
allocation can be spread and cause fragmentation. I'd like to see
result about fragmentation. Highorder stress benchmark can be
one of candidates.

Thanks.

> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 49963aa2dff3..3427a155f85e 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -901,6 +901,7 @@ static char * const migratetype_names[MIGRATE_TYPES] = {
>         "Unmovable",
>         "Reclaimable",
>         "Movable",
> +       "HighAtomic",
>  #ifdef CONFIG_CMA
>         "CMA",
>  #endif
> --
> 2.4.6
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 12/12] mm, page_alloc: Only enforce watermarks for order-0 allocations
  2015-08-24 12:30 ` [PATCH 12/12] mm, page_alloc: Only enforce watermarks for order-0 allocations Mel Gorman
  2015-08-26 13:42   ` Vlastimil Babka
  2015-08-28 12:10   ` Michal Hocko
@ 2015-09-08  8:26   ` Joonsoo Kim
  2015-09-09 12:39     ` Mel Gorman
  2 siblings, 1 reply; 55+ messages in thread
From: Joonsoo Kim @ 2015-09-08  8:26 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Rik van Riel, Vlastimil Babka,
	David Rientjes, Joonsoo Kim, Michal Hocko, Linux-MM, LKML

2015-08-24 21:30 GMT+09:00 Mel Gorman <mgorman@techsingularity.net>:
> The primary purpose of watermarks is to ensure that reclaim can always
> make forward progress in PF_MEMALLOC context (kswapd and direct reclaim).
> These assume that order-0 allocations are all that is necessary for
> forward progress.
>
> High-order watermarks serve a different purpose. Kswapd had no high-order
> awareness before they were introduced (https://lkml.org/lkml/2004/9/5/9).
> This was particularly important when there were high-order atomic requests.
> The watermarks both gave kswapd awareness and made a reserve for those
> atomic requests.
>
> There are two important side-effects of this. The most important is that
> a non-atomic high-order request can fail even though free pages are available
> and the order-0 watermarks are ok. The second is that high-order watermark
> checks are expensive as the free list counts up to the requested order must
> be examined.
>
> With the introduction of MIGRATE_HIGHATOMIC it is no longer necessary to
> have high-order watermarks. Kswapd and compaction still need high-order
> awareness which is handled by checking that at least one suitable high-order
> page is free.

I still don't think that this one suitable high-order page is enough.
If fragmentation happens, there would be no order-2 freepage. If kswapd
prepares only 1 order-2 freepage, one of two successive process forks
(AFAIK, fork in x86 and ARM require order 2 page) must go to direct reclaim
to make order-2 freepage. Kswapd cannot make order-2 freepage in that
short time. It causes latency to many high-order freepage requestor
in fragmented situation.

> With the patch applied, there was little difference in the allocation
> failure rates as the atomic reserves are small relative to the number of
> allocation attempts. The expected impact is that there will never be an
> allocation failure report that shows suitable pages on the free lists.

Due to highatomic pageblock and freepage count mismatch per allocation
flag, allocation failure with suitable pages can still be possible.

> The one potential side-effect of this is that in a vanilla kernel, the
> watermark checks may have kept a free page for an atomic allocation. Now,
> we are 100% relying on the HighAtomic reserves and an early allocation to
> have allocated them.  If the first high-order atomic allocation is after
> the system is already heavily fragmented then it'll fail.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  mm/page_alloc.c | 38 ++++++++++++++++++++++++--------------
>  1 file changed, 24 insertions(+), 14 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 2415f882b89c..35dc578730d1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2280,8 +2280,10 @@ static inline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
>  #endif /* CONFIG_FAIL_PAGE_ALLOC */
>
>  /*
> - * Return true if free pages are above 'mark'. This takes into account the order
> - * of the allocation.
> + * Return true if free base pages are above 'mark'. For high-order checks it
> + * will return true of the order-0 watermark is reached and there is at least
> + * one free page of a suitable size. Checking now avoids taking the zone lock
> + * to check in the allocation paths if no pages are free.
>   */
>  static bool __zone_watermark_ok(struct zone *z, unsigned int order,
>                         unsigned long mark, int classzone_idx, int alloc_flags,
> @@ -2289,7 +2291,7 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
>  {
>         long min = mark;
>         int o;
> -       long free_cma = 0;
> +       const bool atomic = (alloc_flags & ALLOC_HARDER);
>
>         /* free_pages may go negative - that's OK */
>         free_pages -= (1 << order) - 1;
> @@ -2301,7 +2303,7 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
>          * If the caller is not atomic then discount the reserves. This will
>          * over-estimate how the atomic reserve but it avoids a search
>          */
> -       if (likely(!(alloc_flags & ALLOC_HARDER)))
> +       if (likely(!atomic))
>                 free_pages -= z->nr_reserved_highatomic;
>         else
>                 min -= min / 4;
> @@ -2309,22 +2311,30 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
>  #ifdef CONFIG_CMA
>         /* If allocation can't use CMA areas don't use free CMA pages */
>         if (!(alloc_flags & ALLOC_CMA))
> -               free_cma = zone_page_state(z, NR_FREE_CMA_PAGES);
> +               free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES);
>  #endif
>
> -       if (free_pages - free_cma <= min + z->lowmem_reserve[classzone_idx])
> +       if (free_pages <= min + z->lowmem_reserve[classzone_idx])
>                 return false;
> -       for (o = 0; o < order; o++) {
> -               /* At the next order, this order's pages become unavailable */
> -               free_pages -= z->free_area[o].nr_free << o;
>
> -               /* Require fewer higher order pages to be free */
> -               min >>= 1;
> +       /* order-0 watermarks are ok */
> +       if (!order)
> +               return true;
> +
> +       /* Check at least one high-order page is free */
> +       for (o = order; o < MAX_ORDER; o++) {
> +               struct free_area *area = &z->free_area[o];
> +               int mt;
> +
> +               if (atomic && area->nr_free)
> +                       return true;

How about checking area->nr_free first?
In both atomic and !atomic case, nr_free == 0 means
there is no appropriate pages.

So,
if (!area->nr_free)
    continue;
if (atomic)
    return true;
...


> -               if (free_pages <= min)
> -                       return false;
> +               for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) {
> +                       if (!list_empty(&area->free_list[mt]))
> +                               return true;
> +               }

I'm not sure this is really faster than previous.
We need to check three lists on each order.

Think about order-2 case. I guess order-2 is usually on movable
pageblock rather than unmovable pageblock. In this case,
we need to check three lists so cost is more.

And, if system is fragmented and has not enough order-2 freepage,
we need to check 3,4,..., MAX_ORDER-1 to find out that
there is no order-2 freepage. This would be more costly
than previous approach.

Thanks.

>         }
> -       return true;
> +       return false;
>  }
>
>  bool zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark,
> --
> 2.4.6
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 07/12] mm, page_alloc: Distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
  2015-09-08  6:49   ` Joonsoo Kim
@ 2015-09-09 12:22     ` Mel Gorman
  2015-09-18  6:25       ` Joonsoo Kim
  0 siblings, 1 reply; 55+ messages in thread
From: Mel Gorman @ 2015-09-09 12:22 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Johannes Weiner, Rik van Riel, Vlastimil Babka,
	David Rientjes, Joonsoo Kim, Michal Hocko, Linux-MM, LKML

On Tue, Sep 08, 2015 at 03:49:58PM +0900, Joonsoo Kim wrote:
> 2015-08-24 21:09 GMT+09:00 Mel Gorman <mgorman@techsingularity.net>:
> > __GFP_WAIT has been used to identify atomic context in callers that hold
> > spinlocks or are in interrupts. They are expected to be high priority and
> > have access one of two watermarks lower than "min" which can be referred
> > to as the "atomic reserve". __GFP_HIGH users get access to the first lower
> > watermark and can be called the "high priority reserve".
> >
> > Over time, callers had a requirement to not block when fallback options
> > were available. Some have abused __GFP_WAIT leading to a situation where
> > an optimisitic allocation with a fallback option can access atomic reserves.
> >
> > This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
> > cannot sleep and have no alternative. High priority users continue to use
> > __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are
> > willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers
> > that want to wake kswapd for background reclaim. __GFP_WAIT is redefined
> > as a caller that is willing to enter direct reclaim and wake kswapd for
> > background reclaim.
> 
> Hello, Mel.
> 
> I think that it is better to do one thing at one patch.

This was a case where the incremental change felt unnecessary. The purpose
of the patch is to "distinguish between being unable to sleep, unwilling
to sleep and avoiding waking kswapd". Splitting that up is possible but
I'm not convinced it helps.

> To distinguish real atomic, we just need to introduce __GFP_ATOMIC and
> make GFP_ATOMIC to __GFP_ATOMIC | GFP_HARDER and change related
> things. __GFP_WAIT changes isn't needed at all for this purpose. It can
> reduce patch size and provides more good bisectability.
> 
> And, I don't think that introducing __GFP_KSWAPD_RECLAIM is good thing.
> Basically, kswapd reclaim should be enforced.

Several years ago, I would have agreed. Now there are callers that want
to control kswapd and I think it made more sense to clearly state whether
RECLAIM and KSWAPD are allowed instead of having RECLAIM and NO_KSWAPD
flags -- i.e. flags that consistently allow or consistently deny.

> New flag makes user who manually
> manipulate gfp flag more difficult. Without this change, your second hazard will
> be disappeared although it is almost harmless.
> 
> And, I doubt that this big one shot change is preferable. AFAIK, even if changes
> are one to one mapping and no functional difference, each one is made by
> one patch and send it to correct maintainer. I guess there is some difficulty
> in this patch to do like this, but, it could. Isn't it?
> 

Splitting this into one patch per maintainer would be a review and bisection
nightmare. If I saw someone else doing that I would wonder if they were
just trying to increase their patch count for no reason.

> Some nitpicks are below.
> 
> > <SNIP>
> >
> > diff --git a/arch/arm/xen/mm.c b/arch/arm/xen/mm.c
> > index 03e75fef15b8..86809bd2026d 100644
> > --- a/arch/arm/xen/mm.c
> > +++ b/arch/arm/xen/mm.c
> > @@ -25,7 +25,7 @@
> >  unsigned long xen_get_swiotlb_free_pages(unsigned int order)
> >  {
> >         struct memblock_region *reg;
> > -       gfp_t flags = __GFP_NOWARN;
> > +       gfp_t flags = __GFP_NOWARN|___GFP_KSWAPD_RECLAIM;
> 
> Please use __XXX rather than ___XXX.
> 

Fixed.

> > <SNIP>
> >
> > @@ -457,13 +457,13 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
> >                  * We solve this, and guarantee forward progress, with a rescuer
> >                  * workqueue per bio_set. If we go to allocate and there are
> >                  * bios on current->bio_list, we first try the allocation
> > -                * without __GFP_WAIT; if that fails, we punt those bios we
> > -                * would be blocking to the rescuer workqueue before we retry
> > -                * with the original gfp_flags.
> > +                * without __GFP_DIRECT_RECLAIM; if that fails, we punt those
> > +                * bios we would be blocking to the rescuer workqueue before
> > +                * we retry with the original gfp_flags.
> >                  */
> >
> >                 if (current->bio_list && !bio_list_empty(current->bio_list))
> > -                       gfp_mask &= ~__GFP_WAIT;
> > +                       gfp_mask &= ~__GFP_DIRECT_RECLAIM;
> 
> How about introduce helper function to mask out __GFP_DIRECT_RECLAIM?
> It can be used many places.
> 

In this case, the pattern for removing a single flag is easier to recognise
than a helper whose implementation must be examined.

> >                 p = mempool_alloc(bs->bio_pool, gfp_mask);
> >                 if (!p && gfp_mask != saved_gfp) {
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index 627ed0c593fb..e3605acaaffc 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -1156,8 +1156,8 @@ static struct request *__get_request(struct request_list *rl, int rw_flags,
> >   * @bio: bio to allocate request for (can be %NULL)
> >   * @gfp_mask: allocation mask
> >   *
> > - * Get a free request from @q.  If %__GFP_WAIT is set in @gfp_mask, this
> > - * function keeps retrying under memory pressure and fails iff @q is dead.
> > + * Get a free request from @q.  If %__GFP_DIRECT_RECLAIM is set in @gfp_mask,
> > + * this function keeps retrying under memory pressure and fails iff @q is dead.
> >   *
> >   * Must be called with @q->queue_lock held and,
> >   * Returns ERR_PTR on failure, with @q->queue_lock held.
> > @@ -1177,7 +1177,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
> >         if (!IS_ERR(rq))
> >                 return rq;
> >
> > -       if (!(gfp_mask & __GFP_WAIT) || unlikely(blk_queue_dying(q))) {
> > +       if (!gfpflags_allow_blocking(gfp_mask) || unlikely(blk_queue_dying(q))) {
> >                 blk_put_rl(rl);
> >                 return rq;
> >         }
> > @@ -1255,11 +1255,11 @@ EXPORT_SYMBOL(blk_get_request);
> >   * BUG.
> >   *
> >   * WARNING: When allocating/cloning a bio-chain, careful consideration should be
> > - * given to how you allocate bios. In particular, you cannot use __GFP_WAIT for
> > - * anything but the first bio in the chain. Otherwise you risk waiting for IO
> > - * completion of a bio that hasn't been submitted yet, thus resulting in a
> > - * deadlock. Alternatively bios should be allocated using bio_kmalloc() instead
> > - * of bio_alloc(), as that avoids the mempool deadlock.
> > + * given to how you allocate bios. In particular, you cannot use
> > + * __GFP_DIRECT_RECLAIM for anything but the first bio in the chain. Otherwise
> > + * you risk waiting for IO completion of a bio that hasn't been submitted yet,
> > + * thus resulting in a deadlock. Alternatively bios should be allocated using
> > + * bio_kmalloc() instead of bio_alloc(), as that avoids the mempool deadlock.
> >   * If possible a big IO should be split into smaller parts when allocation
> >   * fails. Partial allocation should not be an error, or you risk a live-lock.
> >   */
> > diff --git a/block/blk-ioc.c b/block/blk-ioc.c
> > index 1a27f45ec776..381cb50a673c 100644
> > --- a/block/blk-ioc.c
> > +++ b/block/blk-ioc.c
> > @@ -289,7 +289,7 @@ struct io_context *get_task_io_context(struct task_struct *task,
> >  {
> >         struct io_context *ioc;
> >
> > -       might_sleep_if(gfp_flags & __GFP_WAIT);
> > +       might_sleep_if(gfpflags_allow_blocking(gfp_flags));
> >
> >         do {
> >                 task_lock(task);
> > diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> > index 9b6e28830b82..a8b46659ce4e 100644
> > --- a/block/blk-mq-tag.c
> > +++ b/block/blk-mq-tag.c
> > @@ -264,7 +264,7 @@ static int bt_get(struct blk_mq_alloc_data *data,
> >         if (tag != -1)
> >                 return tag;
> >
> > -       if (!(data->gfp & __GFP_WAIT))
> > +       if (!gfpflags_allow_blocking(data->gfp))
> >                 return -1;
> >
> >         bs = bt_wait_ptr(bt, hctx);
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 7d842db59699..7d80379d7a38 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -85,7 +85,7 @@ static int blk_mq_queue_enter(struct request_queue *q, gfp_t gfp)
> >                 if (percpu_ref_tryget_live(&q->mq_usage_counter))
> >                         return 0;
> >
> > -               if (!(gfp & __GFP_WAIT))
> > +               if (!gfpflags_allow_blocking(gfp))
> >                         return -EBUSY;
> >
> >                 ret = wait_event_interruptible(q->mq_freeze_wq,
> > @@ -261,11 +261,11 @@ struct request *blk_mq_alloc_request(struct request_queue *q, int rw, gfp_t gfp,
> >
> >         ctx = blk_mq_get_ctx(q);
> >         hctx = q->mq_ops->map_queue(q, ctx->cpu);
> > -       blk_mq_set_alloc_data(&alloc_data, q, gfp & ~__GFP_WAIT,
> > +       blk_mq_set_alloc_data(&alloc_data, q, gfp & ~__GFP_DIRECT_RECLAIM,
> >                         reserved, ctx, hctx);
> >
> >         rq = __blk_mq_alloc_request(&alloc_data, rw);
> > -       if (!rq && (gfp & __GFP_WAIT)) {
> > +       if (!rq && (gfp & __GFP_DIRECT_RECLAIM)) {
> >                 __blk_mq_run_hw_queue(hctx);
> >                 blk_mq_put_ctx(ctx);
> 
> Is there any reason not to use gfpflags_allow_nonblocking() here?
> There are some places not using this helper and reason isn't
> specified.
> 

Strictly speaking the helper could be used. However, in cases where the
same function manipulates or examines the flag in any way, I did not use
the helper. It's in all those cases, I thought the final result was
easier to follow.
> >
> >  /*
> > + * A caller that is willing to wait may enter direct reclaim and will
> > + * wake kswapd to reclaim pages in the background until the high
> > + * watermark is met. A caller may wish to clear __GFP_DIRECT_RECLAIM to
> > + * avoid unnecessary delays when a fallback option is available but
> > + * still allow kswapd to reclaim in the background. The kswapd flag
> > + * can be cleared when the reclaiming of pages would cause unnecessary
> > + * disruption.
> > + */
> > +#define __GFP_WAIT (__GFP_DIRECT_RECLAIM|__GFP_KSWAPD_RECLAIM)
> 
> Convention is that combination of gfp flags don't use __XXX.
> 

I don't understand. GFP_MOVABLE_MASK, GFP_USER and a bunch of other
combinations use __XXX.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 11/12] mm, page_alloc: Reserve pageblocks for high-order atomic allocations on demand
  2015-09-08  8:01   ` Joonsoo Kim
@ 2015-09-09 12:32     ` Mel Gorman
  2015-09-18  6:38       ` Joonsoo Kim
  0 siblings, 1 reply; 55+ messages in thread
From: Mel Gorman @ 2015-09-09 12:32 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Johannes Weiner, Rik van Riel, Vlastimil Babka,
	David Rientjes, Joonsoo Kim, Michal Hocko, Linux-MM, LKML

On Tue, Sep 08, 2015 at 05:01:06PM +0900, Joonsoo Kim wrote:
> 2015-08-24 21:29 GMT+09:00 Mel Gorman <mgorman@techsingularity.net>:
> > <SNIP>
> >
> > +/*
> > + * Reserve a pageblock for exclusive use of high-order atomic allocations if
> > + * there are no empty page blocks that contain a page with a suitable order
> > + */
> > +static void reserve_highatomic_pageblock(struct page *page, struct zone *zone,
> > +                               unsigned int alloc_order)
> > +{
> > +       int mt = get_pageblock_migratetype(page);
> > +       unsigned long max_managed, flags;
> > +
> > +       if (mt == MIGRATE_HIGHATOMIC)
> > +               return;
> > +
> > +       /*
> > +        * Limit the number reserved to 1 pageblock or roughly 1% of a zone.
> > +        * Check is race-prone but harmless.
> > +        */
> > +       max_managed = (zone->managed_pages / 100) + pageblock_nr_pages;
> > +       if (zone->nr_reserved_highatomic >= max_managed)
> > +               return;
> > +
> > +       /* Yoink! */
> > +       spin_lock_irqsave(&zone->lock, flags);
> > +       zone->nr_reserved_highatomic += pageblock_nr_pages;
> > +       set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC);
> > +       move_freepages_block(zone, page, MIGRATE_HIGHATOMIC);
> > +       spin_unlock_irqrestore(&zone->lock, flags);
> > +}
> 
> It is better to check if migratetype is MIGRATE_ISOLATE or MIGRATE_CMA.
> There can be race that isolated pageblock is changed to MIGRATE_HIGHATOMIC.
> 

Done.

> > +/*
> > + * Used when an allocation is about to fail under memory pressure. This
> > + * potentially hurts the reliability of high-order allocations when under
> > + * intense memory pressure but failed atomic allocations should be easier
> > + * to recover from than an OOM.
> > + */
> > +static void unreserve_highatomic_pageblock(const struct alloc_context *ac)
> > +{
> > +       struct zonelist *zonelist = ac->zonelist;
> > +       unsigned long flags;
> > +       struct zoneref *z;
> > +       struct zone *zone;
> > +       struct page *page;
> > +       int order;
> > +
> > +       for_each_zone_zonelist_nodemask(zone, z, zonelist, ac->high_zoneidx,
> > +                                                               ac->nodemask) {
> > +               /* Preserve at least one pageblock */
> > +               if (zone->nr_reserved_highatomic <= pageblock_nr_pages)
> > +                       continue;
> > +
> > +               spin_lock_irqsave(&zone->lock, flags);
> > +               for (order = 0; order < MAX_ORDER; order++) {
> > +                       struct free_area *area = &(zone->free_area[order]);
> > +
> > +                       if (list_empty(&area->free_list[MIGRATE_HIGHATOMIC]))
> > +                               continue;
> > +
> > +                       page = list_entry(area->free_list[MIGRATE_HIGHATOMIC].next,
> > +                                               struct page, lru);
> > +
> > +                       zone->nr_reserved_highatomic -= pageblock_nr_pages;
> > +
> > +                       /*
> > +                        * Convert to ac->migratetype and avoid the normal
> > +                        * pageblock stealing heuristics. Minimally, the caller
> > +                        * is doing the work and needs the pages. More
> > +                        * importantly, if the block was always converted to
> > +                        * MIGRATE_UNMOVABLE or another type then the number
> > +                        * of pageblocks that cannot be completely freed
> > +                        * may increase.
> > +                        */
> > +                       set_pageblock_migratetype(page, ac->migratetype);
> > +                       move_freepages_block(zone, page, ac->migratetype);
> > +                       spin_unlock_irqrestore(&zone->lock, flags);
> > +                       return;
> > +               }
> > +               spin_unlock_irqrestore(&zone->lock, flags);
> > +       }
> > +}
> > +
> >  /* Remove an element from the buddy allocator from the fallback list */
> >  static inline struct page *
> >  __rmqueue_fallback(struct zone *zone, unsigned int order, int start_migratetype)
> > @@ -1645,10 +1725,16 @@ __rmqueue_fallback(struct zone *zone, unsigned int order, int start_migratetype)
> >   * Call me with the zone->lock already held.
> >   */
> >  static struct page *__rmqueue(struct zone *zone, unsigned int order,
> > -                                               int migratetype)
> > +                               int migratetype, gfp_t gfp_flags)
> >  {
> >         struct page *page;
> >
> > +       if (unlikely(order && (gfp_flags & __GFP_ATOMIC))) {
> > +               page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
> > +               if (page)
> > +                       goto out;
> > +       }
> 
> This hunk only serves for high order allocation so it is better to introduce
> rmqueue_highorder() and move this hunk to that function and call it in
> buffered_rmqueue. It makes order-0 request doesn't get worse
> by adding new branch.
> 

The helper is overkill. I can move the check to avoid the branch but it
duplicates the tracepoint handling which can be easy to miss in the
future. I'm not convinced it is an overall improvement.

> And, there is some mismatch that check atomic high-order allocation.
> In some place, you checked __GFP_ATOMIC, but some other places,
> you checked ALLOC_HARDER. It is better to use unified one.
> Introducing helper function may be a good choice.
> 

Which cases specifically? In the zone_watermark check, it's because
there is no GFP flags in that context. They could be passed in but then
every caller needs to be updated accordingly and overall it gains
nothing.

> >         page = __rmqueue_smallest(zone, order, migratetype);
> >         if (unlikely(!page)) {
> >                 if (migratetype == MIGRATE_MOVABLE)
> > @@ -1658,6 +1744,7 @@ static struct page *__rmqueue(struct zone *zone, unsigned int order,
> >                         page = __rmqueue_fallback(zone, order, migratetype);
> >         }
> >
> > +out:
> >         trace_mm_page_alloc_zone_locked(page, order, migratetype);
> >         return page;
> >  }
> > @@ -1675,7 +1762,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
> >
> >         spin_lock(&zone->lock);
> >         for (i = 0; i < count; ++i) {
> > -               struct page *page = __rmqueue(zone, order, migratetype);
> > +               struct page *page = __rmqueue(zone, order, migratetype, 0);
> >                 if (unlikely(page == NULL))
> >                         break;
> >
> > @@ -2090,7 +2177,7 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
> >                         WARN_ON_ONCE(order > 1);
> >                 }
> >                 spin_lock_irqsave(&zone->lock, flags);
> > -               page = __rmqueue(zone, order, migratetype);
> > +               page = __rmqueue(zone, order, migratetype, gfp_flags);
> >                 spin_unlock(&zone->lock);
> >                 if (!page)
> >                         goto failed;
> > @@ -2200,15 +2287,23 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
> >                         unsigned long mark, int classzone_idx, int alloc_flags,
> >                         long free_pages)
> >  {
> > -       /* free_pages may go negative - that's OK */
> >         long min = mark;
> >         int o;
> >         long free_cma = 0;
> >
> > +       /* free_pages may go negative - that's OK */
> >         free_pages -= (1 << order) - 1;
> > +
> >         if (alloc_flags & ALLOC_HIGH)
> >                 min -= min / 2;
> > -       if (alloc_flags & ALLOC_HARDER)
> > +
> > +       /*
> > +        * If the caller is not atomic then discount the reserves. This will
> > +        * over-estimate how the atomic reserve but it avoids a search
> > +        */
> > +       if (likely(!(alloc_flags & ALLOC_HARDER)))
> > +               free_pages -= z->nr_reserved_highatomic;
> > +       else
> >                 min -= min / 4;
> >
> >  #ifdef CONFIG_CMA
> > @@ -2397,6 +2492,14 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
> >                 if (page) {
> >                         if (prep_new_page(page, order, gfp_mask, alloc_flags))
> >                                 goto try_this_zone;
> > +
> > +                       /*
> > +                        * If this is a high-order atomic allocation then check
> > +                        * if the pageblock should be reserved for the future
> > +                        */
> > +                       if (unlikely(order && (alloc_flags & ALLOC_HARDER)))
> > +                               reserve_highatomic_pageblock(page, zone, order);
> > +
> >                         return page;
> >                 }
> >         }
> > @@ -2664,9 +2767,11 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
> >
> >         /*
> >          * If an allocation failed after direct reclaim, it could be because
> > -        * pages are pinned on the per-cpu lists. Drain them and try again
> > +        * pages are pinned on the per-cpu lists or in high alloc reserves.
> > +        * Shrink them them and try again
> >          */
> >         if (!page && !drained) {
> > +               unreserve_highatomic_pageblock(ac);
> >                 drain_all_pages(NULL);
> >                 drained = true;
> >                 goto retry;
> 
> In case of high-order request, it can easily fail even after direct reclaim.
> It can cause ping-pong effect on highatomic pageblock.
> Unreserve on order-0 request fail is one option to avoid that problem.
> 

That is potentially a modification that would be interest to non-atomic
high-atomic only users which I know you are interested in. However, it is
both outside the scope of the series and it is a hazardous change because a
normal high-order allocation that can reclaim can unreserve a block reserved
for high-order atomic allocations and then the atomic allocations fail.
That is a sufficiently strong side-effect that it should be a separate
patch that fixed a measurable problem.

> Anyway, do you measure fragmentation effect of this patch?
> 

Nothing interesting was revealed, the fragmentation effects looked similar
before and after the series. The number of reserved pageblocks is too
small to matteer.

> High-order atomic request is usually unmovable and it would be served
> by unmovable pageblock. And then, it is changed to highatomic.
> But, reclaim can be triggered by movable request and this unreserve
> makes that pageblock to movable type.
> 
> So, following sequence of transition will usually happen.
> 
> unmovable -> highatomic -> movable
> 
> It can reduce number of unmovable pageblock and unmovable
> allocation can be spread and cause fragmentation. I'd like to see
> result about fragmentation. Highorder stress benchmark can be
> one of candidates.
> 

Too few to matter. I checked high-order stresses and they appeared fine,
external fragmentation events were fine.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 12/12] mm, page_alloc: Only enforce watermarks for order-0 allocations
  2015-09-08  8:26   ` Joonsoo Kim
@ 2015-09-09 12:39     ` Mel Gorman
  2015-09-18  6:56       ` Joonsoo Kim
  2015-09-30  8:51       ` Vitaly Wool
  0 siblings, 2 replies; 55+ messages in thread
From: Mel Gorman @ 2015-09-09 12:39 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Johannes Weiner, Rik van Riel, Vlastimil Babka,
	David Rientjes, Joonsoo Kim, Michal Hocko, Linux-MM, LKML

On Tue, Sep 08, 2015 at 05:26:13PM +0900, Joonsoo Kim wrote:
> 2015-08-24 21:30 GMT+09:00 Mel Gorman <mgorman@techsingularity.net>:
> > The primary purpose of watermarks is to ensure that reclaim can always
> > make forward progress in PF_MEMALLOC context (kswapd and direct reclaim).
> > These assume that order-0 allocations are all that is necessary for
> > forward progress.
> >
> > High-order watermarks serve a different purpose. Kswapd had no high-order
> > awareness before they were introduced (https://lkml.org/lkml/2004/9/5/9).
> > This was particularly important when there were high-order atomic requests.
> > The watermarks both gave kswapd awareness and made a reserve for those
> > atomic requests.
> >
> > There are two important side-effects of this. The most important is that
> > a non-atomic high-order request can fail even though free pages are available
> > and the order-0 watermarks are ok. The second is that high-order watermark
> > checks are expensive as the free list counts up to the requested order must
> > be examined.
> >
> > With the introduction of MIGRATE_HIGHATOMIC it is no longer necessary to
> > have high-order watermarks. Kswapd and compaction still need high-order
> > awareness which is handled by checking that at least one suitable high-order
> > page is free.
> 
> I still don't think that this one suitable high-order page is enough.
> If fragmentation happens, there would be no order-2 freepage. If kswapd
> prepares only 1 order-2 freepage, one of two successive process forks
> (AFAIK, fork in x86 and ARM require order 2 page) must go to direct reclaim
> to make order-2 freepage. Kswapd cannot make order-2 freepage in that
> short time. It causes latency to many high-order freepage requestor
> in fragmented situation.
> 

So what do you suggest instead? A fixed number, some other heuristic?
You have pushed several times now for the series to focus on the latency
of standard high-order allocations but again I will say that it is outside
the scope of this series. If you want to take steps to reduce the latency
of ordinary high-order allocation requests that can sleep then it should
be a separate series.

> > With the patch applied, there was little difference in the allocation
> > failure rates as the atomic reserves are small relative to the number of
> > allocation attempts. The expected impact is that there will never be an
> > allocation failure report that shows suitable pages on the free lists.
> 
> Due to highatomic pageblock and freepage count mismatch per allocation
> flag, allocation failure with suitable pages can still be possible.
> 

An allocation failure of this type would be a !atomic allocation that
cannot access the reserve. If such allocations requests can access the
reserve then it defeats the whole point of the pageblock type.

> > + * Return true if free base pages are above 'mark'. For high-order checks it
> > + * will return true of the order-0 watermark is reached and there is at least
> > + * one free page of a suitable size. Checking now avoids taking the zone lock
> > + * to check in the allocation paths if no pages are free.
> >   */
> >  static bool __zone_watermark_ok(struct zone *z, unsigned int order,
> >                         unsigned long mark, int classzone_idx, int alloc_flags,
> > @@ -2289,7 +2291,7 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
> >  {
> >         long min = mark;
> >         int o;
> > -       long free_cma = 0;
> > +       const bool atomic = (alloc_flags & ALLOC_HARDER);
> >
> >         /* free_pages may go negative - that's OK */
> >         free_pages -= (1 << order) - 1;
> > @@ -2301,7 +2303,7 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
> >          * If the caller is not atomic then discount the reserves. This will
> >          * over-estimate how the atomic reserve but it avoids a search
> >          */
> > -       if (likely(!(alloc_flags & ALLOC_HARDER)))
> > +       if (likely(!atomic))
> >                 free_pages -= z->nr_reserved_highatomic;
> >         else
> >                 min -= min / 4;
> > @@ -2309,22 +2311,30 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
> >  #ifdef CONFIG_CMA
> >         /* If allocation can't use CMA areas don't use free CMA pages */
> >         if (!(alloc_flags & ALLOC_CMA))
> > -               free_cma = zone_page_state(z, NR_FREE_CMA_PAGES);
> > +               free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES);
> >  #endif
> >
> > -       if (free_pages - free_cma <= min + z->lowmem_reserve[classzone_idx])
> > +       if (free_pages <= min + z->lowmem_reserve[classzone_idx])
> >                 return false;
> > -       for (o = 0; o < order; o++) {
> > -               /* At the next order, this order's pages become unavailable */
> > -               free_pages -= z->free_area[o].nr_free << o;
> >
> > -               /* Require fewer higher order pages to be free */
> > -               min >>= 1;
> > +       /* order-0 watermarks are ok */
> > +       if (!order)
> > +               return true;
> > +
> > +       /* Check at least one high-order page is free */
> > +       for (o = order; o < MAX_ORDER; o++) {
> > +               struct free_area *area = &z->free_area[o];
> > +               int mt;
> > +
> > +               if (atomic && area->nr_free)
> > +                       return true;
> 
> How about checking area->nr_free first?
> In both atomic and !atomic case, nr_free == 0 means
> there is no appropriate pages.
> 
> So,
> if (!area->nr_free)
>     continue;
> if (atomic)
>     return true;
> ...
> 
> 
> > -               if (free_pages <= min)
> > -                       return false;
> > +               for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) {
> > +                       if (!list_empty(&area->free_list[mt]))
> > +                               return true;
> > +               }
> 
> I'm not sure this is really faster than previous.
> We need to check three lists on each order.
> 
> Think about order-2 case. I guess order-2 is usually on movable
> pageblock rather than unmovable pageblock. In this case,
> we need to check three lists so cost is more.
> 

Ok, the extra check makes sense. Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 07/12] mm, page_alloc: Distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
  2015-09-09 12:22     ` Mel Gorman
@ 2015-09-18  6:25       ` Joonsoo Kim
  0 siblings, 0 replies; 55+ messages in thread
From: Joonsoo Kim @ 2015-09-18  6:25 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Rik van Riel, Vlastimil Babka,
	David Rientjes, Michal Hocko, Linux-MM, LKML

On Wed, Sep 09, 2015 at 01:22:03PM +0100, Mel Gorman wrote:
> On Tue, Sep 08, 2015 at 03:49:58PM +0900, Joonsoo Kim wrote:
> > 2015-08-24 21:09 GMT+09:00 Mel Gorman <mgorman@techsingularity.net>:
> > > __GFP_WAIT has been used to identify atomic context in callers that hold
> > > spinlocks or are in interrupts. They are expected to be high priority and
> > > have access one of two watermarks lower than "min" which can be referred
> > > to as the "atomic reserve". __GFP_HIGH users get access to the first lower
> > > watermark and can be called the "high priority reserve".
> > >
> > > Over time, callers had a requirement to not block when fallback options
> > > were available. Some have abused __GFP_WAIT leading to a situation where
> > > an optimisitic allocation with a fallback option can access atomic reserves.
> > >
> > > This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
> > > cannot sleep and have no alternative. High priority users continue to use
> > > __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and are
> > > willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify callers
> > > that want to wake kswapd for background reclaim. __GFP_WAIT is redefined
> > > as a caller that is willing to enter direct reclaim and wake kswapd for
> > > background reclaim.
> > 
> > Hello, Mel.
> > 
> > I think that it is better to do one thing at one patch.
> 
> This was a case where the incremental change felt unnecessary. The purpose
> of the patch is to "distinguish between being unable to sleep, unwilling
> to sleep and avoiding waking kswapd". Splitting that up is possible but
> I'm not convinced it helps.
> 
> > To distinguish real atomic, we just need to introduce __GFP_ATOMIC and
> > make GFP_ATOMIC to __GFP_ATOMIC | GFP_HARDER and change related
> > things. __GFP_WAIT changes isn't needed at all for this purpose. It can
> > reduce patch size and provides more good bisectability.
> > 
> > And, I don't think that introducing __GFP_KSWAPD_RECLAIM is good thing.
> > Basically, kswapd reclaim should be enforced.
> 
> Several years ago, I would have agreed. Now there are callers that want
> to control kswapd and I think it made more sense to clearly state whether
> RECLAIM and KSWAPD are allowed instead of having RECLAIM and NO_KSWAPD
> flags -- i.e. flags that consistently allow or consistently deny.
> 
> > New flag makes user who manually
> > manipulate gfp flag more difficult. Without this change, your second hazard will
> > be disappeared although it is almost harmless.
> > 
> > And, I doubt that this big one shot change is preferable. AFAIK, even if changes
> > are one to one mapping and no functional difference, each one is made by
> > one patch and send it to correct maintainer. I guess there is some difficulty
> > in this patch to do like this, but, it could. Isn't it?
> > 
> 
> Splitting this into one patch per maintainer would be a review and bisection
> nightmare. If I saw someone else doing that I would wonder if they were
> just trying to increase their patch count for no reason.
> 
> > Some nitpicks are below.
> > 
> > > <SNIP>
> > >
> > > diff --git a/arch/arm/xen/mm.c b/arch/arm/xen/mm.c
> > > index 03e75fef15b8..86809bd2026d 100644
> > > --- a/arch/arm/xen/mm.c
> > > +++ b/arch/arm/xen/mm.c
> > > @@ -25,7 +25,7 @@
> > >  unsigned long xen_get_swiotlb_free_pages(unsigned int order)
> > >  {
> > >         struct memblock_region *reg;
> > > -       gfp_t flags = __GFP_NOWARN;
> > > +       gfp_t flags = __GFP_NOWARN|___GFP_KSWAPD_RECLAIM;
> > 
> > Please use __XXX rather than ___XXX.
> > 
> 
> Fixed.
> 
> > > <SNIP>
> > >
> > > @@ -457,13 +457,13 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, int nr_iovecs, struct bio_set *bs)
> > >                  * We solve this, and guarantee forward progress, with a rescuer
> > >                  * workqueue per bio_set. If we go to allocate and there are
> > >                  * bios on current->bio_list, we first try the allocation
> > > -                * without __GFP_WAIT; if that fails, we punt those bios we
> > > -                * would be blocking to the rescuer workqueue before we retry
> > > -                * with the original gfp_flags.
> > > +                * without __GFP_DIRECT_RECLAIM; if that fails, we punt those
> > > +                * bios we would be blocking to the rescuer workqueue before
> > > +                * we retry with the original gfp_flags.
> > >                  */
> > >
> > >                 if (current->bio_list && !bio_list_empty(current->bio_list))
> > > -                       gfp_mask &= ~__GFP_WAIT;
> > > +                       gfp_mask &= ~__GFP_DIRECT_RECLAIM;
> > 
> > How about introduce helper function to mask out __GFP_DIRECT_RECLAIM?
> > It can be used many places.
> > 
> 
> In this case, the pattern for removing a single flag is easier to recognise
> than a helper whose implementation must be examined.
> 
> > >                 p = mempool_alloc(bs->bio_pool, gfp_mask);
> > >                 if (!p && gfp_mask != saved_gfp) {
> > > diff --git a/block/blk-core.c b/block/blk-core.c
> > > index 627ed0c593fb..e3605acaaffc 100644
> > > --- a/block/blk-core.c
> > > +++ b/block/blk-core.c
> > > @@ -1156,8 +1156,8 @@ static struct request *__get_request(struct request_list *rl, int rw_flags,
> > >   * @bio: bio to allocate request for (can be %NULL)
> > >   * @gfp_mask: allocation mask
> > >   *
> > > - * Get a free request from @q.  If %__GFP_WAIT is set in @gfp_mask, this
> > > - * function keeps retrying under memory pressure and fails iff @q is dead.
> > > + * Get a free request from @q.  If %__GFP_DIRECT_RECLAIM is set in @gfp_mask,
> > > + * this function keeps retrying under memory pressure and fails iff @q is dead.
> > >   *
> > >   * Must be called with @q->queue_lock held and,
> > >   * Returns ERR_PTR on failure, with @q->queue_lock held.
> > > @@ -1177,7 +1177,7 @@ static struct request *get_request(struct request_queue *q, int rw_flags,
> > >         if (!IS_ERR(rq))
> > >                 return rq;
> > >
> > > -       if (!(gfp_mask & __GFP_WAIT) || unlikely(blk_queue_dying(q))) {
> > > +       if (!gfpflags_allow_blocking(gfp_mask) || unlikely(blk_queue_dying(q))) {
> > >                 blk_put_rl(rl);
> > >                 return rq;
> > >         }
> > > @@ -1255,11 +1255,11 @@ EXPORT_SYMBOL(blk_get_request);
> > >   * BUG.
> > >   *
> > >   * WARNING: When allocating/cloning a bio-chain, careful consideration should be
> > > - * given to how you allocate bios. In particular, you cannot use __GFP_WAIT for
> > > - * anything but the first bio in the chain. Otherwise you risk waiting for IO
> > > - * completion of a bio that hasn't been submitted yet, thus resulting in a
> > > - * deadlock. Alternatively bios should be allocated using bio_kmalloc() instead
> > > - * of bio_alloc(), as that avoids the mempool deadlock.
> > > + * given to how you allocate bios. In particular, you cannot use
> > > + * __GFP_DIRECT_RECLAIM for anything but the first bio in the chain. Otherwise
> > > + * you risk waiting for IO completion of a bio that hasn't been submitted yet,
> > > + * thus resulting in a deadlock. Alternatively bios should be allocated using
> > > + * bio_kmalloc() instead of bio_alloc(), as that avoids the mempool deadlock.
> > >   * If possible a big IO should be split into smaller parts when allocation
> > >   * fails. Partial allocation should not be an error, or you risk a live-lock.
> > >   */
> > > diff --git a/block/blk-ioc.c b/block/blk-ioc.c
> > > index 1a27f45ec776..381cb50a673c 100644
> > > --- a/block/blk-ioc.c
> > > +++ b/block/blk-ioc.c
> > > @@ -289,7 +289,7 @@ struct io_context *get_task_io_context(struct task_struct *task,
> > >  {
> > >         struct io_context *ioc;
> > >
> > > -       might_sleep_if(gfp_flags & __GFP_WAIT);
> > > +       might_sleep_if(gfpflags_allow_blocking(gfp_flags));
> > >
> > >         do {
> > >                 task_lock(task);
> > > diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> > > index 9b6e28830b82..a8b46659ce4e 100644
> > > --- a/block/blk-mq-tag.c
> > > +++ b/block/blk-mq-tag.c
> > > @@ -264,7 +264,7 @@ static int bt_get(struct blk_mq_alloc_data *data,
> > >         if (tag != -1)
> > >                 return tag;
> > >
> > > -       if (!(data->gfp & __GFP_WAIT))
> > > +       if (!gfpflags_allow_blocking(data->gfp))
> > >                 return -1;
> > >
> > >         bs = bt_wait_ptr(bt, hctx);
> > > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > index 7d842db59699..7d80379d7a38 100644
> > > --- a/block/blk-mq.c
> > > +++ b/block/blk-mq.c
> > > @@ -85,7 +85,7 @@ static int blk_mq_queue_enter(struct request_queue *q, gfp_t gfp)
> > >                 if (percpu_ref_tryget_live(&q->mq_usage_counter))
> > >                         return 0;
> > >
> > > -               if (!(gfp & __GFP_WAIT))
> > > +               if (!gfpflags_allow_blocking(gfp))
> > >                         return -EBUSY;
> > >
> > >                 ret = wait_event_interruptible(q->mq_freeze_wq,
> > > @@ -261,11 +261,11 @@ struct request *blk_mq_alloc_request(struct request_queue *q, int rw, gfp_t gfp,
> > >
> > >         ctx = blk_mq_get_ctx(q);
> > >         hctx = q->mq_ops->map_queue(q, ctx->cpu);
> > > -       blk_mq_set_alloc_data(&alloc_data, q, gfp & ~__GFP_WAIT,
> > > +       blk_mq_set_alloc_data(&alloc_data, q, gfp & ~__GFP_DIRECT_RECLAIM,
> > >                         reserved, ctx, hctx);
> > >
> > >         rq = __blk_mq_alloc_request(&alloc_data, rw);
> > > -       if (!rq && (gfp & __GFP_WAIT)) {
> > > +       if (!rq && (gfp & __GFP_DIRECT_RECLAIM)) {
> > >                 __blk_mq_run_hw_queue(hctx);
> > >                 blk_mq_put_ctx(ctx);
> > 
> > Is there any reason not to use gfpflags_allow_nonblocking() here?
> > There are some places not using this helper and reason isn't
> > specified.
> > 
> 
> Strictly speaking the helper could be used. However, in cases where the
> same function manipulates or examines the flag in any way, I did not use
> the helper. It's in all those cases, I thought the final result was
> easier to follow.
> > >
> > >  /*
> > > + * A caller that is willing to wait may enter direct reclaim and will
> > > + * wake kswapd to reclaim pages in the background until the high
> > > + * watermark is met. A caller may wish to clear __GFP_DIRECT_RECLAIM to
> > > + * avoid unnecessary delays when a fallback option is available but
> > > + * still allow kswapd to reclaim in the background. The kswapd flag
> > > + * can be cleared when the reclaiming of pages would cause unnecessary
> > > + * disruption.
> > > + */
> > > +#define __GFP_WAIT (__GFP_DIRECT_RECLAIM|__GFP_KSWAPD_RECLAIM)
> > 
> > Convention is that combination of gfp flags don't use __XXX.
> > 
> 
> I don't understand. GFP_MOVABLE_MASK, GFP_USER and a bunch of other
> combinations use __XXX.

Hello, Mel.
Sorry for late response.

Yes, GFP_XXX can consist of multiple __GFP_XXX.
But, __GFP_XXX doesn't consist of multiple __GFP_YYY.
Your __GFP_WAIT seems to be a first one.

Thanks.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 11/12] mm, page_alloc: Reserve pageblocks for high-order atomic allocations on demand
  2015-09-09 12:32     ` Mel Gorman
@ 2015-09-18  6:38       ` Joonsoo Kim
  2015-09-21 10:51         ` Mel Gorman
  0 siblings, 1 reply; 55+ messages in thread
From: Joonsoo Kim @ 2015-09-18  6:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Rik van Riel, Vlastimil Babka,
	David Rientjes, Michal Hocko, Linux-MM, LKML

On Wed, Sep 09, 2015 at 01:32:39PM +0100, Mel Gorman wrote:
> On Tue, Sep 08, 2015 at 05:01:06PM +0900, Joonsoo Kim wrote:
> > 2015-08-24 21:29 GMT+09:00 Mel Gorman <mgorman@techsingularity.net>:
> > > <SNIP>
> > >
> > > +/*
> > > + * Reserve a pageblock for exclusive use of high-order atomic allocations if
> > > + * there are no empty page blocks that contain a page with a suitable order
> > > + */
> > > +static void reserve_highatomic_pageblock(struct page *page, struct zone *zone,
> > > +                               unsigned int alloc_order)
> > > +{
> > > +       int mt = get_pageblock_migratetype(page);
> > > +       unsigned long max_managed, flags;
> > > +
> > > +       if (mt == MIGRATE_HIGHATOMIC)
> > > +               return;
> > > +
> > > +       /*
> > > +        * Limit the number reserved to 1 pageblock or roughly 1% of a zone.
> > > +        * Check is race-prone but harmless.
> > > +        */
> > > +       max_managed = (zone->managed_pages / 100) + pageblock_nr_pages;
> > > +       if (zone->nr_reserved_highatomic >= max_managed)
> > > +               return;
> > > +
> > > +       /* Yoink! */
> > > +       spin_lock_irqsave(&zone->lock, flags);
> > > +       zone->nr_reserved_highatomic += pageblock_nr_pages;
> > > +       set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC);
> > > +       move_freepages_block(zone, page, MIGRATE_HIGHATOMIC);
> > > +       spin_unlock_irqrestore(&zone->lock, flags);
> > > +}
> > 
> > It is better to check if migratetype is MIGRATE_ISOLATE or MIGRATE_CMA.
> > There can be race that isolated pageblock is changed to MIGRATE_HIGHATOMIC.
> > 
> 
> Done.
> 
> > > +/*
> > > + * Used when an allocation is about to fail under memory pressure. This
> > > + * potentially hurts the reliability of high-order allocations when under
> > > + * intense memory pressure but failed atomic allocations should be easier
> > > + * to recover from than an OOM.
> > > + */
> > > +static void unreserve_highatomic_pageblock(const struct alloc_context *ac)
> > > +{
> > > +       struct zonelist *zonelist = ac->zonelist;
> > > +       unsigned long flags;
> > > +       struct zoneref *z;
> > > +       struct zone *zone;
> > > +       struct page *page;
> > > +       int order;
> > > +
> > > +       for_each_zone_zonelist_nodemask(zone, z, zonelist, ac->high_zoneidx,
> > > +                                                               ac->nodemask) {
> > > +               /* Preserve at least one pageblock */
> > > +               if (zone->nr_reserved_highatomic <= pageblock_nr_pages)
> > > +                       continue;
> > > +
> > > +               spin_lock_irqsave(&zone->lock, flags);
> > > +               for (order = 0; order < MAX_ORDER; order++) {
> > > +                       struct free_area *area = &(zone->free_area[order]);
> > > +
> > > +                       if (list_empty(&area->free_list[MIGRATE_HIGHATOMIC]))
> > > +                               continue;
> > > +
> > > +                       page = list_entry(area->free_list[MIGRATE_HIGHATOMIC].next,
> > > +                                               struct page, lru);
> > > +
> > > +                       zone->nr_reserved_highatomic -= pageblock_nr_pages;
> > > +
> > > +                       /*
> > > +                        * Convert to ac->migratetype and avoid the normal
> > > +                        * pageblock stealing heuristics. Minimally, the caller
> > > +                        * is doing the work and needs the pages. More
> > > +                        * importantly, if the block was always converted to
> > > +                        * MIGRATE_UNMOVABLE or another type then the number
> > > +                        * of pageblocks that cannot be completely freed
> > > +                        * may increase.
> > > +                        */
> > > +                       set_pageblock_migratetype(page, ac->migratetype);
> > > +                       move_freepages_block(zone, page, ac->migratetype);
> > > +                       spin_unlock_irqrestore(&zone->lock, flags);
> > > +                       return;
> > > +               }
> > > +               spin_unlock_irqrestore(&zone->lock, flags);
> > > +       }
> > > +}
> > > +
> > >  /* Remove an element from the buddy allocator from the fallback list */
> > >  static inline struct page *
> > >  __rmqueue_fallback(struct zone *zone, unsigned int order, int start_migratetype)
> > > @@ -1645,10 +1725,16 @@ __rmqueue_fallback(struct zone *zone, unsigned int order, int start_migratetype)
> > >   * Call me with the zone->lock already held.
> > >   */
> > >  static struct page *__rmqueue(struct zone *zone, unsigned int order,
> > > -                                               int migratetype)
> > > +                               int migratetype, gfp_t gfp_flags)
> > >  {
> > >         struct page *page;
> > >
> > > +       if (unlikely(order && (gfp_flags & __GFP_ATOMIC))) {
> > > +               page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
> > > +               if (page)
> > > +                       goto out;
> > > +       }
> > 
> > This hunk only serves for high order allocation so it is better to introduce
> > rmqueue_highorder() and move this hunk to that function and call it in
> > buffered_rmqueue. It makes order-0 request doesn't get worse
> > by adding new branch.
> > 
> 
> The helper is overkill. I can move the check to avoid the branch but it
> duplicates the tracepoint handling which can be easy to miss in the
> future. I'm not convinced it is an overall improvement.
> 
> > And, there is some mismatch that check atomic high-order allocation.
> > In some place, you checked __GFP_ATOMIC, but some other places,
> > you checked ALLOC_HARDER. It is better to use unified one.
> > Introducing helper function may be a good choice.
> > 
> 
> Which cases specifically? In the zone_watermark check, it's because
> there is no GFP flags in that context. They could be passed in but then
> every caller needs to be updated accordingly and overall it gains
> nothing.

You use __GFP_ATOMIC in rmqueue() to allow highatomic reserve.
ALLOC_HARDER is used in watermark check and to reserve highatomic
pageblock after allocation.

ALLOC_HARDER is set if (__GFP_ATOMIC && !__GFP_NOMEMALLOC) *or*
(rt_task && !in_interrupt()). So, later case could pass watermark
check but cannot use HIGHATOMIC reserve. And, it will reserve
highatomic pageblock. When it try to allocate again, it can't use
this reserved pageblock due to GFP flags and this could happens
repeatedly.
And, first case also has a problem. If user requests memory
with __GFP_NOMEMALLOC, it's intend doesn't touch reserved mem,
but, in current patch, it can use highatomic pageblock.

I'm not sure these causes real trouble but unifying it as much as
possible is preferable solution.

Thanks.

> 
> > >         page = __rmqueue_smallest(zone, order, migratetype);
> > >         if (unlikely(!page)) {
> > >                 if (migratetype == MIGRATE_MOVABLE)
> > > @@ -1658,6 +1744,7 @@ static struct page *__rmqueue(struct zone *zone, unsigned int order,
> > >                         page = __rmqueue_fallback(zone, order, migratetype);
> > >         }
> > >
> > > +out:
> > >         trace_mm_page_alloc_zone_locked(page, order, migratetype);
> > >         return page;
> > >  }
> > > @@ -1675,7 +1762,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
> > >
> > >         spin_lock(&zone->lock);
> > >         for (i = 0; i < count; ++i) {
> > > -               struct page *page = __rmqueue(zone, order, migratetype);
> > > +               struct page *page = __rmqueue(zone, order, migratetype, 0);
> > >                 if (unlikely(page == NULL))
> > >                         break;
> > >
> > > @@ -2090,7 +2177,7 @@ struct page *buffered_rmqueue(struct zone *preferred_zone,
> > >                         WARN_ON_ONCE(order > 1);
> > >                 }
> > >                 spin_lock_irqsave(&zone->lock, flags);
> > > -               page = __rmqueue(zone, order, migratetype);
> > > +               page = __rmqueue(zone, order, migratetype, gfp_flags);
> > >                 spin_unlock(&zone->lock);
> > >                 if (!page)
> > >                         goto failed;
> > > @@ -2200,15 +2287,23 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
> > >                         unsigned long mark, int classzone_idx, int alloc_flags,
> > >                         long free_pages)
> > >  {
> > > -       /* free_pages may go negative - that's OK */
> > >         long min = mark;
> > >         int o;
> > >         long free_cma = 0;
> > >
> > > +       /* free_pages may go negative - that's OK */
> > >         free_pages -= (1 << order) - 1;
> > > +
> > >         if (alloc_flags & ALLOC_HIGH)
> > >                 min -= min / 2;
> > > -       if (alloc_flags & ALLOC_HARDER)
> > > +
> > > +       /*
> > > +        * If the caller is not atomic then discount the reserves. This will
> > > +        * over-estimate how the atomic reserve but it avoids a search
> > > +        */
> > > +       if (likely(!(alloc_flags & ALLOC_HARDER)))
> > > +               free_pages -= z->nr_reserved_highatomic;
> > > +       else
> > >                 min -= min / 4;
> > >
> > >  #ifdef CONFIG_CMA
> > > @@ -2397,6 +2492,14 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
> > >                 if (page) {
> > >                         if (prep_new_page(page, order, gfp_mask, alloc_flags))
> > >                                 goto try_this_zone;
> > > +
> > > +                       /*
> > > +                        * If this is a high-order atomic allocation then check
> > > +                        * if the pageblock should be reserved for the future
> > > +                        */
> > > +                       if (unlikely(order && (alloc_flags & ALLOC_HARDER)))
> > > +                               reserve_highatomic_pageblock(page, zone, order);
> > > +
> > >                         return page;
> > >                 }
> > >         }
> > > @@ -2664,9 +2767,11 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
> > >
> > >         /*
> > >          * If an allocation failed after direct reclaim, it could be because
> > > -        * pages are pinned on the per-cpu lists. Drain them and try again
> > > +        * pages are pinned on the per-cpu lists or in high alloc reserves.
> > > +        * Shrink them them and try again
> > >          */
> > >         if (!page && !drained) {
> > > +               unreserve_highatomic_pageblock(ac);
> > >                 drain_all_pages(NULL);
> > >                 drained = true;
> > >                 goto retry;
> > 
> > In case of high-order request, it can easily fail even after direct reclaim.
> > It can cause ping-pong effect on highatomic pageblock.
> > Unreserve on order-0 request fail is one option to avoid that problem.
> > 
> 
> That is potentially a modification that would be interest to non-atomic
> high-atomic only users which I know you are interested in. However, it is
> both outside the scope of the series and it is a hazardous change because a
> normal high-order allocation that can reclaim can unreserve a block reserved
> for high-order atomic allocations and then the atomic allocations fail.
> That is a sufficiently strong side-effect that it should be a separate
> patch that fixed a measurable problem.
> 
> > Anyway, do you measure fragmentation effect of this patch?
> > 
> 
> Nothing interesting was revealed, the fragmentation effects looked similar
> before and after the series. The number of reserved pageblocks is too
> small to matteer.
> 
> > High-order atomic request is usually unmovable and it would be served
> > by unmovable pageblock. And then, it is changed to highatomic.
> > But, reclaim can be triggered by movable request and this unreserve
> > makes that pageblock to movable type.
> > 
> > So, following sequence of transition will usually happen.
> > 
> > unmovable -> highatomic -> movable
> > 
> > It can reduce number of unmovable pageblock and unmovable
> > allocation can be spread and cause fragmentation. I'd like to see
> > result about fragmentation. Highorder stress benchmark can be
> > one of candidates.
> > 
> 
> Too few to matter. I checked high-order stresses and they appeared fine,
> external fragmentation events were fine.
> 
> -- 
> Mel Gorman
> SUSE Labs
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 12/12] mm, page_alloc: Only enforce watermarks for order-0 allocations
  2015-09-09 12:39     ` Mel Gorman
@ 2015-09-18  6:56       ` Joonsoo Kim
  2015-09-21 10:51         ` Mel Gorman
  2015-09-30  8:51       ` Vitaly Wool
  1 sibling, 1 reply; 55+ messages in thread
From: Joonsoo Kim @ 2015-09-18  6:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Rik van Riel, Vlastimil Babka,
	David Rientjes, Michal Hocko, Linux-MM, LKML

On Wed, Sep 09, 2015 at 01:39:01PM +0100, Mel Gorman wrote:
> On Tue, Sep 08, 2015 at 05:26:13PM +0900, Joonsoo Kim wrote:
> > 2015-08-24 21:30 GMT+09:00 Mel Gorman <mgorman@techsingularity.net>:
> > > The primary purpose of watermarks is to ensure that reclaim can always
> > > make forward progress in PF_MEMALLOC context (kswapd and direct reclaim).
> > > These assume that order-0 allocations are all that is necessary for
> > > forward progress.
> > >
> > > High-order watermarks serve a different purpose. Kswapd had no high-order
> > > awareness before they were introduced (https://lkml.org/lkml/2004/9/5/9).
> > > This was particularly important when there were high-order atomic requests.
> > > The watermarks both gave kswapd awareness and made a reserve for those
> > > atomic requests.
> > >
> > > There are two important side-effects of this. The most important is that
> > > a non-atomic high-order request can fail even though free pages are available
> > > and the order-0 watermarks are ok. The second is that high-order watermark
> > > checks are expensive as the free list counts up to the requested order must
> > > be examined.
> > >
> > > With the introduction of MIGRATE_HIGHATOMIC it is no longer necessary to
> > > have high-order watermarks. Kswapd and compaction still need high-order
> > > awareness which is handled by checking that at least one suitable high-order
> > > page is free.
> > 
> > I still don't think that this one suitable high-order page is enough.
> > If fragmentation happens, there would be no order-2 freepage. If kswapd
> > prepares only 1 order-2 freepage, one of two successive process forks
> > (AFAIK, fork in x86 and ARM require order 2 page) must go to direct reclaim
> > to make order-2 freepage. Kswapd cannot make order-2 freepage in that
> > short time. It causes latency to many high-order freepage requestor
> > in fragmented situation.
> > 
> 
> So what do you suggest instead? A fixed number, some other heuristic?
> You have pushed several times now for the series to focus on the latency
> of standard high-order allocations but again I will say that it is outside
> the scope of this series. If you want to take steps to reduce the latency
> of ordinary high-order allocation requests that can sleep then it should
> be a separate series.

I don't understand why you think it should be a separate series.
I don't know exact reason why high order watermark check is
introduced, but, based on your description, it is for high-order
allocation request in atomic context. And, it would accidently take care
about latency. It is used for a long time and your patch try to remove it
and it only takes care about success rate. That means that your patch
could cause regression. I think that if this happens actually, it is handled
in this patchset instead of separate series.

In review of previous version, I suggested that removing watermark
check only for higher than PAGE_ALLOC_COSTLY_ORDER. You didn't accept
that and I still don't agree with your approach. You can show me that
my concern is wrong via some number.

One candidate test for this is that making system fragmented and
run hackbench which uses a lot of high-order allocation and measure
elapsed-time.

Thanks.

> 
> > > With the patch applied, there was little difference in the allocation
> > > failure rates as the atomic reserves are small relative to the number of
> > > allocation attempts. The expected impact is that there will never be an
> > > allocation failure report that shows suitable pages on the free lists.
> > 
> > Due to highatomic pageblock and freepage count mismatch per allocation
> > flag, allocation failure with suitable pages can still be possible.
> > 
> 
> An allocation failure of this type would be a !atomic allocation that
> cannot access the reserve. If such allocations requests can access the
> reserve then it defeats the whole point of the pageblock type.
> 
> > > + * Return true if free base pages are above 'mark'. For high-order checks it
> > > + * will return true of the order-0 watermark is reached and there is at least
> > > + * one free page of a suitable size. Checking now avoids taking the zone lock
> > > + * to check in the allocation paths if no pages are free.
> > >   */
> > >  static bool __zone_watermark_ok(struct zone *z, unsigned int order,
> > >                         unsigned long mark, int classzone_idx, int alloc_flags,
> > > @@ -2289,7 +2291,7 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
> > >  {
> > >         long min = mark;
> > >         int o;
> > > -       long free_cma = 0;
> > > +       const bool atomic = (alloc_flags & ALLOC_HARDER);
> > >
> > >         /* free_pages may go negative - that's OK */
> > >         free_pages -= (1 << order) - 1;
> > > @@ -2301,7 +2303,7 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
> > >          * If the caller is not atomic then discount the reserves. This will
> > >          * over-estimate how the atomic reserve but it avoids a search
> > >          */
> > > -       if (likely(!(alloc_flags & ALLOC_HARDER)))
> > > +       if (likely(!atomic))
> > >                 free_pages -= z->nr_reserved_highatomic;
> > >         else
> > >                 min -= min / 4;
> > > @@ -2309,22 +2311,30 @@ static bool __zone_watermark_ok(struct zone *z, unsigned int order,
> > >  #ifdef CONFIG_CMA
> > >         /* If allocation can't use CMA areas don't use free CMA pages */
> > >         if (!(alloc_flags & ALLOC_CMA))
> > > -               free_cma = zone_page_state(z, NR_FREE_CMA_PAGES);
> > > +               free_pages -= zone_page_state(z, NR_FREE_CMA_PAGES);
> > >  #endif
> > >
> > > -       if (free_pages - free_cma <= min + z->lowmem_reserve[classzone_idx])
> > > +       if (free_pages <= min + z->lowmem_reserve[classzone_idx])
> > >                 return false;
> > > -       for (o = 0; o < order; o++) {
> > > -               /* At the next order, this order's pages become unavailable */
> > > -               free_pages -= z->free_area[o].nr_free << o;
> > >
> > > -               /* Require fewer higher order pages to be free */
> > > -               min >>= 1;
> > > +       /* order-0 watermarks are ok */
> > > +       if (!order)
> > > +               return true;
> > > +
> > > +       /* Check at least one high-order page is free */
> > > +       for (o = order; o < MAX_ORDER; o++) {
> > > +               struct free_area *area = &z->free_area[o];
> > > +               int mt;
> > > +
> > > +               if (atomic && area->nr_free)
> > > +                       return true;
> > 
> > How about checking area->nr_free first?
> > In both atomic and !atomic case, nr_free == 0 means
> > there is no appropriate pages.
> > 
> > So,
> > if (!area->nr_free)
> >     continue;
> > if (atomic)
> >     return true;
> > ...
> > 
> > 
> > > -               if (free_pages <= min)
> > > -                       return false;
> > > +               for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) {
> > > +                       if (!list_empty(&area->free_list[mt]))
> > > +                               return true;
> > > +               }
> > 
> > I'm not sure this is really faster than previous.
> > We need to check three lists on each order.
> > 
> > Think about order-2 case. I guess order-2 is usually on movable
> > pageblock rather than unmovable pageblock. In this case,
> > we need to check three lists so cost is more.
> > 
> 
> Ok, the extra check makes sense. Thanks.
> 
> -- 
> Mel Gorman
> SUSE Labs
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 11/12] mm, page_alloc: Reserve pageblocks for high-order atomic allocations on demand
  2015-09-18  6:38       ` Joonsoo Kim
@ 2015-09-21 10:51         ` Mel Gorman
  0 siblings, 0 replies; 55+ messages in thread
From: Mel Gorman @ 2015-09-21 10:51 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Johannes Weiner, Rik van Riel, Vlastimil Babka,
	David Rientjes, Michal Hocko, Linux-MM, LKML

On Fri, Sep 18, 2015 at 03:38:35PM +0900, Joonsoo Kim wrote:
> > > And, there is some mismatch that check atomic high-order allocation.
> > > In some place, you checked __GFP_ATOMIC, but some other places,
> > > you checked ALLOC_HARDER. It is better to use unified one.
> > > Introducing helper function may be a good choice.
> > > 
> > 
> > Which cases specifically? In the zone_watermark check, it's because
> > there is no GFP flags in that context. They could be passed in but then
> > every caller needs to be updated accordingly and overall it gains
> > nothing.
> 
> You use __GFP_ATOMIC in rmqueue() to allow highatomic reserve.
> ALLOC_HARDER is used in watermark check and to reserve highatomic
> pageblock after allocation.
> 
> ALLOC_HARDER is set if (__GFP_ATOMIC && !__GFP_NOMEMALLOC) *or*
> (rt_task && !in_interrupt()). So, later case could pass watermark
> check but cannot use HIGHATOMIC reserve. And, it will reserve
> highatomic pageblock. When it try to allocate again, it can't use
> this reserved pageblock due to GFP flags and this could happens
> repeatedly.
> And, first case also has a problem. If user requests memory
> with __GFP_NOMEMALLOC, it's intend doesn't touch reserved mem,
> but, in current patch, it can use highatomic pageblock.
> 
> I'm not sure these causes real trouble but unifying it as much as
> possible is preferable solution.
> 

Ok, that makes sense. Thanks

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 12/12] mm, page_alloc: Only enforce watermarks for order-0 allocations
  2015-09-18  6:56       ` Joonsoo Kim
@ 2015-09-21 10:51         ` Mel Gorman
  0 siblings, 0 replies; 55+ messages in thread
From: Mel Gorman @ 2015-09-21 10:51 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Johannes Weiner, Rik van Riel, Vlastimil Babka,
	David Rientjes, Michal Hocko, Linux-MM, LKML

On Fri, Sep 18, 2015 at 03:56:21PM +0900, Joonsoo Kim wrote:
> On Wed, Sep 09, 2015 at 01:39:01PM +0100, Mel Gorman wrote:
> > On Tue, Sep 08, 2015 at 05:26:13PM +0900, Joonsoo Kim wrote:
> > > 2015-08-24 21:30 GMT+09:00 Mel Gorman <mgorman@techsingularity.net>:
> > > > The primary purpose of watermarks is to ensure that reclaim can always
> > > > make forward progress in PF_MEMALLOC context (kswapd and direct reclaim).
> > > > These assume that order-0 allocations are all that is necessary for
> > > > forward progress.
> > > >
> > > > High-order watermarks serve a different purpose. Kswapd had no high-order
> > > > awareness before they were introduced (https://lkml.org/lkml/2004/9/5/9).
> > > > This was particularly important when there were high-order atomic requests.
> > > > The watermarks both gave kswapd awareness and made a reserve for those
> > > > atomic requests.
> > > >
> > > > There are two important side-effects of this. The most important is that
> > > > a non-atomic high-order request can fail even though free pages are available
> > > > and the order-0 watermarks are ok. The second is that high-order watermark
> > > > checks are expensive as the free list counts up to the requested order must
> > > > be examined.
> > > >
> > > > With the introduction of MIGRATE_HIGHATOMIC it is no longer necessary to
> > > > have high-order watermarks. Kswapd and compaction still need high-order
> > > > awareness which is handled by checking that at least one suitable high-order
> > > > page is free.
> > > 
> > > I still don't think that this one suitable high-order page is enough.
> > > If fragmentation happens, there would be no order-2 freepage. If kswapd
> > > prepares only 1 order-2 freepage, one of two successive process forks
> > > (AFAIK, fork in x86 and ARM require order 2 page) must go to direct reclaim
> > > to make order-2 freepage. Kswapd cannot make order-2 freepage in that
> > > short time. It causes latency to many high-order freepage requestor
> > > in fragmented situation.
> > > 
> > 
> > So what do you suggest instead? A fixed number, some other heuristic?
> > You have pushed several times now for the series to focus on the latency
> > of standard high-order allocations but again I will say that it is outside
> > the scope of this series. If you want to take steps to reduce the latency
> > of ordinary high-order allocation requests that can sleep then it should
> > be a separate series.
> 
> I don't understand why you think it should be a separate series.

Because atomic high-order allocation success and normal high-order
allocation stall latency are different problems. Atomic high-order
allocation successes are about reserves, normal high-order allocations
are about reclaim.

> I don't know exact reason why high order watermark check is
> introduced, but, based on your description, it is for high-order
> allocation request in atomic context.

Mostly yes, the initial motivation is described in the linked mail --
give kswapd high-order awareness because otherwise (higher-order && !wait)
allocations that fail would wake kswapd but it would go back to sleep.

> And, it would accidently take care
> about latency.

Except all it does is defer the problem. If kswapd frees N high-order
pages then it disrupts the system to satisfy the request, potentially
reclaiming hot pages for an allocation attempt that *may* occur that
will stall if there are N+1 allocation requests.

Kswapd reclaiming additional pages is definite system disruption and
potentially increases thrashing *now* to help an event that *might* occur
in the future.

> It is used for a long time and your patch try to remove it
> and it only takes care about success rate. That means that your patch
> could cause regression. I think that if this happens actually, it is handled
> in this patchset instead of separate series.
> 

Except it doesn't really.

Current situation
o A high-order watermark check might fail for a normal high-order
  allocation request. On failure, stall to reclaim more pages which may
  or may not succeed
o An atomic allocation may use a lower watermark but it can still fail
  even if there are free pages on the list

Patched situation

o A watermark check might fail for a normal high-order allocation
  request and cannot use one of the reserved pages. On failure, stall to
  reclaim more pages which may or may not succeed.
  Functionally, this is very similar to current behaviour
o An atomic allocation may use the reserves so if a free page exists, it
  will be used
  Functionally, this is more reliable than current behaviour as there is
  still potential for disruption

> In review of previous version, I suggested that removing watermark
> check only for higher than PAGE_ALLOC_COSTLY_ORDER.

It increases complexity for reasons that are not quantified.

> You didn't accept
> that and I still don't agree with your approach. You can show me that
> my concern is wrong via some number.
> 
> One candidate test for this is that making system fragmented and
> run hackbench which uses a lot of high-order allocation and measure
> elapsed-time.
> 

o There is no difference in normal allocation high-order success rates
  with this series appied
o With the series applied, such tests complete in approximately the same
  time
o For the tests with parallel high-order allocation requests, there was
  no significant difference in the elapsed times although success rates
  were slightly higher

Each time the full sets of tests take about 4 days to complete on this
series and so far no problems of the type you describe have been found.
If such a test case is found then there would a clear workload to
justify either having kswapd reclaiming multiple pages or apply the old
watermark scheme for lower orders.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 12/12] mm, page_alloc: Only enforce watermarks for order-0 allocations
  2015-09-09 12:39     ` Mel Gorman
  2015-09-18  6:56       ` Joonsoo Kim
@ 2015-09-30  8:51       ` Vitaly Wool
  2015-09-30 13:52         ` Vlastimil Babka
  1 sibling, 1 reply; 55+ messages in thread
From: Vitaly Wool @ 2015-09-30  8:51 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Joonsoo Kim, Andrew Morton, Johannes Weiner, Rik van Riel,
	Vlastimil Babka, David Rientjes, Joonsoo Kim, Michal Hocko,
	Linux-MM, LKML

On Wed, Sep 9, 2015 at 2:39 PM, Mel Gorman <mgorman@techsingularity.net> wrote:
> On Tue, Sep 08, 2015 at 05:26:13PM +0900, Joonsoo Kim wrote:
>> 2015-08-24 21:30 GMT+09:00 Mel Gorman <mgorman@techsingularity.net>:
>> > The primary purpose of watermarks is to ensure that reclaim can always
>> > make forward progress in PF_MEMALLOC context (kswapd and direct reclaim).
>> > These assume that order-0 allocations are all that is necessary for
>> > forward progress.
>> >
>> > High-order watermarks serve a different purpose. Kswapd had no high-order
>> > awareness before they were introduced (https://lkml.org/lkml/2004/9/5/9).
>> > This was particularly important when there were high-order atomic requests.
>> > The watermarks both gave kswapd awareness and made a reserve for those
>> > atomic requests.
>> >
>> > There are two important side-effects of this. The most important is that
>> > a non-atomic high-order request can fail even though free pages are available
>> > and the order-0 watermarks are ok. The second is that high-order watermark
>> > checks are expensive as the free list counts up to the requested order must
>> > be examined.
>> >
>> > With the introduction of MIGRATE_HIGHATOMIC it is no longer necessary to
>> > have high-order watermarks. Kswapd and compaction still need high-order
>> > awareness which is handled by checking that at least one suitable high-order
>> > page is free.
>>
>> I still don't think that this one suitable high-order page is enough.
>> If fragmentation happens, there would be no order-2 freepage. If kswapd
>> prepares only 1 order-2 freepage, one of two successive process forks
>> (AFAIK, fork in x86 and ARM require order 2 page) must go to direct reclaim
>> to make order-2 freepage. Kswapd cannot make order-2 freepage in that
>> short time. It causes latency to many high-order freepage requestor
>> in fragmented situation.
>>
>
> So what do you suggest instead? A fixed number, some other heuristic?
> You have pushed several times now for the series to focus on the latency
> of standard high-order allocations but again I will say that it is outside
> the scope of this series. If you want to take steps to reduce the latency
> of ordinary high-order allocation requests that can sleep then it should
> be a separate series.

I do believe https://lkml.org/lkml/2015/9/9/313 does a better job
here. I have to admit the patch header is a bit misleading here since
we don't actually exclude CMA pages, we just _fix_ the calculation in
the loop which is utterly wrong otherwise.

~vitaly

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 12/12] mm, page_alloc: Only enforce watermarks for order-0 allocations
  2015-09-30  8:51       ` Vitaly Wool
@ 2015-09-30 13:52         ` Vlastimil Babka
  2015-09-30 14:16           ` Vitaly Wool
  0 siblings, 1 reply; 55+ messages in thread
From: Vlastimil Babka @ 2015-09-30 13:52 UTC (permalink / raw)
  To: Vitaly Wool, Mel Gorman
  Cc: Joonsoo Kim, Andrew Morton, Johannes Weiner, Rik van Riel,
	David Rientjes, Joonsoo Kim, Michal Hocko, Linux-MM, LKML

On 09/30/2015 10:51 AM, Vitaly Wool wrote:
> On Wed, Sep 9, 2015 at 2:39 PM, Mel Gorman <mgorman@techsingularity.net> wrote:
>> On Tue, Sep 08, 2015 at 05:26:13PM +0900, Joonsoo Kim wrote:
>>> 2015-08-24 21:30 GMT+09:00 Mel Gorman <mgorman@techsingularity.net>:
>>>> The primary purpose of watermarks is to ensure that reclaim can always
>>>> make forward progress in PF_MEMALLOC context (kswapd and direct reclaim).
>>>> These assume that order-0 allocations are all that is necessary for
>>>> forward progress.
>>>>
>>>> High-order watermarks serve a different purpose. Kswapd had no high-order
>>>> awareness before they were introduced (https://lkml.org/lkml/2004/9/5/9).
>>>> This was particularly important when there were high-order atomic requests.
>>>> The watermarks both gave kswapd awareness and made a reserve for those
>>>> atomic requests.
>>>>
>>>> There are two important side-effects of this. The most important is that
>>>> a non-atomic high-order request can fail even though free pages are available
>>>> and the order-0 watermarks are ok. The second is that high-order watermark
>>>> checks are expensive as the free list counts up to the requested order must
>>>> be examined.
>>>>
>>>> With the introduction of MIGRATE_HIGHATOMIC it is no longer necessary to
>>>> have high-order watermarks. Kswapd and compaction still need high-order
>>>> awareness which is handled by checking that at least one suitable high-order
>>>> page is free.
>>>
>>> I still don't think that this one suitable high-order page is enough.
>>> If fragmentation happens, there would be no order-2 freepage. If kswapd
>>> prepares only 1 order-2 freepage, one of two successive process forks
>>> (AFAIK, fork in x86 and ARM require order 2 page) must go to direct reclaim
>>> to make order-2 freepage. Kswapd cannot make order-2 freepage in that
>>> short time. It causes latency to many high-order freepage requestor
>>> in fragmented situation.
>>>
>>
>> So what do you suggest instead? A fixed number, some other heuristic?
>> You have pushed several times now for the series to focus on the latency
>> of standard high-order allocations but again I will say that it is outside
>> the scope of this series. If you want to take steps to reduce the latency
>> of ordinary high-order allocation requests that can sleep then it should
>> be a separate series.
>
> I do believe https://lkml.org/lkml/2015/9/9/313 does a better job

Does a better job regarding what exactly? It does fix the CMA-specific 
issue, but so does this patch - without affecting allocation fastpaths 
by making them update another counter. But the issues discussed here are 
not related to that CMA problem.

> here. I have to admit the patch header is a bit misleading here since
> we don't actually exclude CMA pages, we just _fix_ the calculation in
> the loop which is utterly wrong otherwise.
>
> ~vitaly
>


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 12/12] mm, page_alloc: Only enforce watermarks for order-0 allocations
  2015-09-30 13:52         ` Vlastimil Babka
@ 2015-09-30 14:16           ` Vitaly Wool
  2015-09-30 14:43             ` Vlastimil Babka
  0 siblings, 1 reply; 55+ messages in thread
From: Vitaly Wool @ 2015-09-30 14:16 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Mel Gorman, Joonsoo Kim, Andrew Morton, Johannes Weiner,
	Rik van Riel, David Rientjes, Joonsoo Kim, Michal Hocko,
	Linux-MM, LKML

On Wed, Sep 30, 2015 at 3:52 PM, Vlastimil Babka <vbabka@suse.cz> wrote:
> On 09/30/2015 10:51 AM, Vitaly Wool wrote:
>>
>> On Wed, Sep 9, 2015 at 2:39 PM, Mel Gorman <mgorman@techsingularity.net>
>> wrote:
>>>
>>> On Tue, Sep 08, 2015 at 05:26:13PM +0900, Joonsoo Kim wrote:
>>>>
>>>> 2015-08-24 21:30 GMT+09:00 Mel Gorman <mgorman@techsingularity.net>:
>>>>>
>>>>> The primary purpose of watermarks is to ensure that reclaim can always
>>>>> make forward progress in PF_MEMALLOC context (kswapd and direct
>>>>> reclaim).
>>>>> These assume that order-0 allocations are all that is necessary for
>>>>> forward progress.
>>>>>
>>>>> High-order watermarks serve a different purpose. Kswapd had no
>>>>> high-order
>>>>> awareness before they were introduced
>>>>> (https://lkml.org/lkml/2004/9/5/9).
>>>>> This was particularly important when there were high-order atomic
>>>>> requests.
>>>>> The watermarks both gave kswapd awareness and made a reserve for those
>>>>> atomic requests.
>>>>>
>>>>> There are two important side-effects of this. The most important is
>>>>> that
>>>>> a non-atomic high-order request can fail even though free pages are
>>>>> available
>>>>> and the order-0 watermarks are ok. The second is that high-order
>>>>> watermark
>>>>> checks are expensive as the free list counts up to the requested order
>>>>> must
>>>>> be examined.
>>>>>
>>>>> With the introduction of MIGRATE_HIGHATOMIC it is no longer necessary
>>>>> to
>>>>> have high-order watermarks. Kswapd and compaction still need high-order
>>>>> awareness which is handled by checking that at least one suitable
>>>>> high-order
>>>>> page is free.
>>>>
>>>>
>>>> I still don't think that this one suitable high-order page is enough.
>>>> If fragmentation happens, there would be no order-2 freepage. If kswapd
>>>> prepares only 1 order-2 freepage, one of two successive process forks
>>>> (AFAIK, fork in x86 and ARM require order 2 page) must go to direct
>>>> reclaim
>>>> to make order-2 freepage. Kswapd cannot make order-2 freepage in that
>>>> short time. It causes latency to many high-order freepage requestor
>>>> in fragmented situation.
>>>>
>>>
>>> So what do you suggest instead? A fixed number, some other heuristic?
>>> You have pushed several times now for the series to focus on the latency
>>> of standard high-order allocations but again I will say that it is
>>> outside
>>> the scope of this series. If you want to take steps to reduce the latency
>>> of ordinary high-order allocation requests that can sleep then it should
>>> be a separate series.
>>
>>
>> I do believe https://lkml.org/lkml/2015/9/9/313 does a better job
>
>
> Does a better job regarding what exactly? It does fix the CMA-specific
> issue, but so does this patch - without affecting allocation fastpaths by
> making them update another counter. But the issues discussed here are not
> related to that CMA problem.

Let me disagree. Guaranteeing one suitable high-order page is not
enough, so the suggested patch does not work that well for me.
Existing broken watermark calculation doesn't work for me either, as
opposed to the one with my patch applied. Both solutions are related
to the CMA issue but one does make compaction work harder and cause
bigger latencies -- why do you think these are not related?

~vitaly

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 12/12] mm, page_alloc: Only enforce watermarks for order-0 allocations
  2015-09-30 14:16           ` Vitaly Wool
@ 2015-09-30 14:43             ` Vlastimil Babka
  2015-09-30 15:18               ` Mel Gorman
  0 siblings, 1 reply; 55+ messages in thread
From: Vlastimil Babka @ 2015-09-30 14:43 UTC (permalink / raw)
  To: Vitaly Wool
  Cc: Mel Gorman, Joonsoo Kim, Andrew Morton, Johannes Weiner,
	Rik van Riel, David Rientjes, Joonsoo Kim, Michal Hocko,
	Linux-MM, LKML

On 09/30/2015 04:16 PM, Vitaly Wool wrote:
>>>>>
>>>>
>>>> So what do you suggest instead? A fixed number, some other heuristic?
>>>> You have pushed several times now for the series to focus on the latency
>>>> of standard high-order allocations but again I will say that it is
>>>> outside
>>>> the scope of this series. If you want to take steps to reduce the latency
>>>> of ordinary high-order allocation requests that can sleep then it should
>>>> be a separate series.
>>>
>>>
>>> I do believe https://lkml.org/lkml/2015/9/9/313 does a better job
>>
>>
>> Does a better job regarding what exactly? It does fix the CMA-specific
>> issue, but so does this patch - without affecting allocation fastpaths by
>> making them update another counter. But the issues discussed here are not
>> related to that CMA problem.
>
> Let me disagree. Guaranteeing one suitable high-order page is not
> enough, so the suggested patch does not work that well for me.
> Existing broken watermark calculation doesn't work for me either, as
> opposed to the one with my patch applied. Both solutions are related
> to the CMA issue but one does make compaction work harder and cause
> bigger latencies -- why do you think these are not related?

Well you didn't mention which issues you have with this patch. If you 
did measure bigger latencies and more compaction work, please post the 
numbers and details about the test.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 12/12] mm, page_alloc: Only enforce watermarks for order-0 allocations
  2015-09-30 14:43             ` Vlastimil Babka
@ 2015-09-30 15:18               ` Mel Gorman
  0 siblings, 0 replies; 55+ messages in thread
From: Mel Gorman @ 2015-09-30 15:18 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Vitaly Wool, Joonsoo Kim, Andrew Morton, Johannes Weiner,
	Rik van Riel, David Rientjes, Joonsoo Kim, Michal Hocko,
	Linux-MM, LKML

On Wed, Sep 30, 2015 at 04:43:00PM +0200, Vlastimil Babka wrote:
> >>Does a better job regarding what exactly? It does fix the CMA-specific
> >>issue, but so does this patch - without affecting allocation fastpaths by
> >>making them update another counter. But the issues discussed here are not
> >>related to that CMA problem.
> >
> >Let me disagree. Guaranteeing one suitable high-order page is not
> >enough, so the suggested patch does not work that well for me.
> >Existing broken watermark calculation doesn't work for me either, as
> >opposed to the one with my patch applied. Both solutions are related
> >to the CMA issue but one does make compaction work harder and cause
> >bigger latencies -- why do you think these are not related?
> 
> Well you didn't mention which issues you have with this patch. If you did
> measure bigger latencies and more compaction work, please post the numbers
> and details about the test.
> 

And very broadly watch out for decisions that force more reclaim/compaction
to potentially reduce latency in the future. It's trading definite overhead
now combined with potential reclaim of hot pages to reduce a *possible*
high-order allocation request in the future. It's why I think a series that
keeps more high-order pages free to reduce future high-order allocation
latency needs to be treated with care.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2015-09-30 15:19 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-08-24 12:09 [PATCH 00/12] Remove zonelist cache and high-order watermark checking v3 Mel Gorman
2015-08-24 12:09 ` [PATCH 01/12] mm, page_alloc: Remove unnecessary parameter from zone_watermark_ok_safe Mel Gorman
2015-08-24 12:09 ` [PATCH 02/12] mm, page_alloc: Remove unnecessary recalculations for dirty zone balancing Mel Gorman
2015-08-24 12:09 ` [PATCH 03/12] mm, page_alloc: Remove unnecessary taking of a seqlock when cpusets are disabled Mel Gorman
2015-08-26 10:25   ` Michal Hocko
2015-08-24 12:09 ` [PATCH 04/12] mm, page_alloc: Only check cpusets when one exists that can be mem-controlled Mel Gorman
2015-08-24 12:37   ` Vlastimil Babka
2015-08-24 13:16     ` Mel Gorman
2015-08-24 20:53       ` Vlastimil Babka
2015-08-25 10:33         ` Mel Gorman
2015-08-25 11:09           ` Vlastimil Babka
2015-08-26 13:41             ` Mel Gorman
2015-08-26 10:46   ` Michal Hocko
2015-08-24 12:09 ` [PATCH 05/12] mm, page_alloc: Remove unecessary recheck of nodemask Mel Gorman
2015-08-25 14:32   ` Vlastimil Babka
2015-08-24 12:09 ` [PATCH 06/12] mm, page_alloc: Use masks and shifts when converting GFP flags to migrate types Mel Gorman
2015-08-25 14:36   ` Vlastimil Babka
2015-08-24 12:09 ` [PATCH 07/12] mm, page_alloc: Distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd Mel Gorman
2015-08-24 18:29   ` Mel Gorman
2015-08-25 15:37   ` Vlastimil Babka
2015-08-26 14:45     ` Mel Gorman
2015-08-26 16:24       ` Vlastimil Babka
2015-08-26 18:10         ` Mel Gorman
2015-08-27  9:18           ` Vlastimil Babka
2015-08-25 15:48   ` Vlastimil Babka
2015-08-26 13:05   ` Michal Hocko
2015-09-08  6:49   ` Joonsoo Kim
2015-09-09 12:22     ` Mel Gorman
2015-09-18  6:25       ` Joonsoo Kim
2015-08-24 12:09 ` [PATCH 08/12] mm, page_alloc: Rename __GFP_WAIT to __GFP_RECLAIM Mel Gorman
2015-08-26 12:19   ` Vlastimil Babka
2015-08-24 12:09 ` [PATCH 09/12] mm, page_alloc: Delete the zonelist_cache Mel Gorman
2015-08-24 12:29 ` [PATCH 10/12] mm, page_alloc: Remove MIGRATE_RESERVE Mel Gorman
2015-08-24 12:29 ` [PATCH 11/12] mm, page_alloc: Reserve pageblocks for high-order atomic allocations on demand Mel Gorman
2015-08-26 12:44   ` Vlastimil Babka
2015-08-26 14:53   ` Michal Hocko
2015-08-26 15:38     ` Mel Gorman
2015-09-08  8:01   ` Joonsoo Kim
2015-09-09 12:32     ` Mel Gorman
2015-09-18  6:38       ` Joonsoo Kim
2015-09-21 10:51         ` Mel Gorman
2015-08-24 12:30 ` [PATCH 12/12] mm, page_alloc: Only enforce watermarks for order-0 allocations Mel Gorman
2015-08-26 13:42   ` Vlastimil Babka
2015-08-26 14:53     ` Mel Gorman
2015-08-28 12:10   ` Michal Hocko
2015-08-28 14:12     ` Mel Gorman
2015-09-08  8:26   ` Joonsoo Kim
2015-09-09 12:39     ` Mel Gorman
2015-09-18  6:56       ` Joonsoo Kim
2015-09-21 10:51         ` Mel Gorman
2015-09-30  8:51       ` Vitaly Wool
2015-09-30 13:52         ` Vlastimil Babka
2015-09-30 14:16           ` Vitaly Wool
2015-09-30 14:43             ` Vlastimil Babka
2015-09-30 15:18               ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).