linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6] mm: give __GFP_REPEAT a better semantic
@ 2017-06-23  8:53 Michal Hocko
  2017-06-23  8:53 ` [PATCH 1/6] MIPS: do not use __GFP_REPEAT for order-0 request Michal Hocko
                   ` (5 more replies)
  0 siblings, 6 replies; 24+ messages in thread
From: Michal Hocko @ 2017-06-23  8:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Johannes Weiner, Mel Gorman, NeilBrown, LKML,
	linux-mm, Alex Belits, Christoph Hellwig, Chris Wilson,
	Darrick J. Wong, David Daney, Michal Hocko, Ralf Baechle

Hi,
this is a follow up for __GFP_REPEAT clean up merged in 4.7. The previous
version of this patch series was posted as an RFC
http://lkml.kernel.org/r/20170307154843.32516-1-mhocko@kernel.org
Since then I have updated the documentation based on feedback from Neil
Brown. It also seems that drm/i915 guys would like to use the flag as
well.  A new __GFP_REPEAT user in MIPS code has emerged so this is fixed
as well.  I have also added some more users into the core VM. We can
discuss them separately but I guess we have grown to the state where no
other alternative has been proposed while there is a demand for the new
semantic. We should simply merge the new flag and new users will emerge
over time. There is no need for a flag day to convert all of them at once.

This is based on the current linux-next (next-20170623)

The main motivation for the change is that the current implementation of
__GFP_REPEAT is not very much useful. 

The documentation says:
 * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
 *   _might_ fail.  This depends upon the particular VM implementation.

It just fails to mention that this is true only for large (costly) high
order which has been the case since the flag was introduced. A similar
semantic would be really helpful for smal orders as well, though,
because we have places where a failure with a specific fallback error
handling is preferred to a potential endless loop inside the page
allocator.

The earlier cleanup dropped __GFP_REPEAT usage for low (!costly) order
users so only those which might use larger orders have stayed. One new user
added in the meantime is addressed in patch 1.

Let's rename the flag to something more verbose and use it for existing
users. Semantic for those will not change. Then implement low (!costly)
orders failure path which is hit after the page allocator is about
to invoke the oom killer. With that we have a good counterpart for
__GFP_NORETRY and finally can tell try as hard as possible without the
OOM killer.

Xfs code already has an existing annotation for allocations which are
allowed to fail and we can trivially map them to the new gfp flag
because it will provide the semantic KM_MAYFAIL wants. Christoph didn't
consider the new flag really necessary but didn't respond to the OOM
killer aspect of the change so I have kept the patch. If this is still
seen as not really needed I can drop the patch.

kvmalloc will allow also !costly high order allocations to retry hard
before falling back to the vmalloc.

drm/i915 asked for the new semantic explicitly.

Memory migration code, especially for the memory hotplug, should back off
rather than invoking the OOM killer as well.

Diffstat (the biggest addition is the documentation)
 Documentation/DMA-ISA-LPC.txt                |  2 +-
 arch/mips/include/asm/pgalloc.h              |  2 +-
 arch/powerpc/include/asm/book3s/64/pgalloc.h |  2 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c          |  2 +-
 drivers/gpu/drm/i915/i915_gem.c              |  3 +-
 drivers/mmc/host/wbsd.c                      |  2 +-
 drivers/s390/char/vmcp.c                     |  2 +-
 drivers/target/target_core_transport.c       |  2 +-
 drivers/vhost/net.c                          |  2 +-
 drivers/vhost/scsi.c                         |  2 +-
 drivers/vhost/vsock.c                        |  2 +-
 fs/xfs/kmem.h                                | 10 +++++
 include/linux/gfp.h                          | 55 +++++++++++++++++++++-------
 include/linux/migrate.h                      |  2 +-
 include/linux/slab.h                         |  3 +-
 include/trace/events/mmflags.h               |  2 +-
 mm/hugetlb.c                                 |  4 +-
 mm/internal.h                                |  2 +-
 mm/memory-failure.c                          |  3 +-
 mm/mempolicy.c                               |  3 +-
 mm/page_alloc.c                              | 14 +++++--
 mm/sparse-vmemmap.c                          |  4 +-
 mm/util.c                                    | 14 ++-----
 mm/vmalloc.c                                 |  2 +-
 mm/vmscan.c                                  |  8 ++--
 net/core/dev.c                               |  6 +--
 net/core/skbuff.c                            |  2 +-
 net/sched/sch_fq.c                           |  2 +-
 tools/perf/builtin-kmem.c                    |  2 +-
 29 files changed, 103 insertions(+), 58 deletions(-)

Shortlog
Michal Hocko (6):
      MIPS: do not use __GFP_REPEAT for order-0 request
      mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic
      xfs: map KM_MAYFAIL to __GFP_RETRY_MAYFAIL
      mm: kvmalloc support __GFP_RETRY_MAYFAIL for all sizes
      drm/i915: use __GFP_RETRY_MAYFAIL
      mm, migration: do not trigger OOM killer when migrating memory

Thanks

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/6] MIPS: do not use __GFP_REPEAT for order-0 request
  2017-06-23  8:53 [PATCH 0/6] mm: give __GFP_REPEAT a better semantic Michal Hocko
@ 2017-06-23  8:53 ` Michal Hocko
  2017-06-23 10:27   ` Ralf Baechle
  2017-06-26  9:36   ` Vlastimil Babka
  2017-06-23  8:53 ` [PATCH 2/6] mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic Michal Hocko
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 24+ messages in thread
From: Michal Hocko @ 2017-06-23  8:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Johannes Weiner, Mel Gorman, NeilBrown, LKML,
	linux-mm, Michal Hocko, Alex Belits, David Daney, Ralf Baechle

From: Michal Hocko <mhocko@suse.com>

3377e227af44 ("MIPS: Add 48-bit VA space (and 4-level page tables) for
4K pages.") has added a new __GFP_REPEAT user but using this flag
doesn't really make any sense for order-0 request which is the case here
because PUD_ORDER is 0. __GFP_REPEAT has historically effect only on
allocation requests with order > PAGE_ALLOC_COSTLY_ORDER.

This doesn't introduce any functional change. This is a preparatory
patch for later work which renames the flag and redefines its semantic.

Cc: Alex Belits <alex.belits@cavium.com>
Cc: David Daney <david.daney@cavium.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 arch/mips/include/asm/pgalloc.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/mips/include/asm/pgalloc.h b/arch/mips/include/asm/pgalloc.h
index a1bdb1ea5234..39b9f311c4ef 100644
--- a/arch/mips/include/asm/pgalloc.h
+++ b/arch/mips/include/asm/pgalloc.h
@@ -116,7 +116,7 @@ static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long address)
 {
 	pud_t *pud;
 
-	pud = (pud_t *) __get_free_pages(GFP_KERNEL|__GFP_REPEAT, PUD_ORDER);
+	pud = (pud_t *) __get_free_pages(GFP_KERNEL, PUD_ORDER);
 	if (pud)
 		pud_init((unsigned long)pud, (unsigned long)invalid_pmd_table);
 	return pud;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/6] mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic
  2017-06-23  8:53 [PATCH 0/6] mm: give __GFP_REPEAT a better semantic Michal Hocko
  2017-06-23  8:53 ` [PATCH 1/6] MIPS: do not use __GFP_REPEAT for order-0 request Michal Hocko
@ 2017-06-23  8:53 ` Michal Hocko
  2017-06-26 11:45   ` Vlastimil Babka
  2017-06-26 11:53   ` Vlastimil Babka
  2017-06-23  8:53 ` [PATCH 3/6] xfs: map KM_MAYFAIL to __GFP_RETRY_MAYFAIL Michal Hocko
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 24+ messages in thread
From: Michal Hocko @ 2017-06-23  8:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Johannes Weiner, Mel Gorman, NeilBrown, LKML,
	linux-mm, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

__GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
the page allocator. This has been true but only for allocations requests
larger than PAGE_ALLOC_COSTLY_ORDER. It has been always ignored for
smaller sizes. This is a bit unfortunate because there is no way to
express the same semantic for those requests and they are considered too
important to fail so they might end up looping in the page allocator for
ever, similarly to GFP_NOFAIL requests.

Now that the whole tree has been cleaned up and accidental or misled
usage of __GFP_REPEAT flag has been removed for !costly requests we can
give the original flag a better name and more importantly a more useful
semantic. Let's rename it to __GFP_RETRY_MAYFAIL which tells the user that
the allocator would try really hard but there is no promise of a
success. This will work independent of the order and overrides the
default allocator behavior. Page allocator users have several levels of
guarantee vs. cost options (take GFP_KERNEL as an example)
- GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
  attempt to free memory at all. The most light weight mode which even
  doesn't kick the background reclaim. Should be used carefully because
  it might deplete the memory and the next user might hit the more
  aggressive reclaim
- GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
  allocation without any attempt to free memory from the current context
  but can wake kswapd to reclaim memory if the zone is below the low
  watermark. Can be used from either atomic contexts or when the request
  is a performance optimization and there is another fallback for a slow
  path.
- (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) - non
  sleeping allocation with an expensive fallback so it can access some
  portion of memory reserves. Usually used from interrupt/bh context with
  an expensive slow path fallback.
- GFP_KERNEL - both background and direct reclaim are allowed and the
  _default_ page allocator behavior is used. That means that !costly
  allocation requests are basically nofail (unless the requesting task
  is killed by the OOM killer) and costly will fail early rather than
  cause disruptive reclaim.
- GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior and
  all allocation requests fail early rather than cause disruptive
  reclaim (one round of reclaim in this implementation). The OOM killer
  is not invoked.
- GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator behavior
  and all allocation requests try really hard. The request will fail if the
  reclaim cannot make any progress. The OOM killer won't be triggered.
- GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
  and all allocation requests will loop endlessly until they
  succeed. This might be really dangerous especially for larger orders.

Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL because
they already had their semantic. No new users are added.
__alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
there is no progress and we have already passed the OOM point. This
means that all the reclaim opportunities have been exhausted except the
most disruptive one (the OOM killer) and a user defined fallback
behavior is more sensible than keep retrying in the page allocator.

Changes since RFC
- udpate documentation wording as per Neil Brown

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 Documentation/DMA-ISA-LPC.txt                |  2 +-
 arch/powerpc/include/asm/book3s/64/pgalloc.h |  2 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c          |  2 +-
 drivers/mmc/host/wbsd.c                      |  2 +-
 drivers/s390/char/vmcp.c                     |  2 +-
 drivers/target/target_core_transport.c       |  2 +-
 drivers/vhost/net.c                          |  2 +-
 drivers/vhost/scsi.c                         |  2 +-
 drivers/vhost/vsock.c                        |  2 +-
 include/linux/gfp.h                          | 55 +++++++++++++++++++++-------
 include/linux/slab.h                         |  3 +-
 include/trace/events/mmflags.h               |  2 +-
 mm/hugetlb.c                                 |  4 +-
 mm/internal.h                                |  2 +-
 mm/page_alloc.c                              | 14 +++++--
 mm/sparse-vmemmap.c                          |  4 +-
 mm/util.c                                    |  6 +--
 mm/vmalloc.c                                 |  2 +-
 mm/vmscan.c                                  |  8 ++--
 net/core/dev.c                               |  6 +--
 net/core/skbuff.c                            |  2 +-
 net/sched/sch_fq.c                           |  2 +-
 tools/perf/builtin-kmem.c                    |  2 +-
 23 files changed, 84 insertions(+), 46 deletions(-)

diff --git a/Documentation/DMA-ISA-LPC.txt b/Documentation/DMA-ISA-LPC.txt
index c41331398752..7a065ac4a9d1 100644
--- a/Documentation/DMA-ISA-LPC.txt
+++ b/Documentation/DMA-ISA-LPC.txt
@@ -42,7 +42,7 @@ requirements you pass the flag GFP_DMA to kmalloc.
 
 Unfortunately the memory available for ISA DMA is scarce so unless you
 allocate the memory during boot-up it's a good idea to also pass
-__GFP_REPEAT and __GFP_NOWARN to make the allocator try a bit harder.
+__GFP_RETRY_MAYFAIL and __GFP_NOWARN to make the allocator try a bit harder.
 
 (This scarcity also means that you should allocate the buffer as
 early as possible and not release it until the driver is unloaded.)
diff --git a/arch/powerpc/include/asm/book3s/64/pgalloc.h b/arch/powerpc/include/asm/book3s/64/pgalloc.h
index 20b1485ff1e8..e2329db9d6f4 100644
--- a/arch/powerpc/include/asm/book3s/64/pgalloc.h
+++ b/arch/powerpc/include/asm/book3s/64/pgalloc.h
@@ -56,7 +56,7 @@ static inline pgd_t *radix__pgd_alloc(struct mm_struct *mm)
 	return (pgd_t *)__get_free_page(pgtable_gfp_flags(mm, PGALLOC_GFP));
 #else
 	struct page *page;
-	page = alloc_pages(pgtable_gfp_flags(mm, PGALLOC_GFP | __GFP_REPEAT),
+	page = alloc_pages(pgtable_gfp_flags(mm, PGALLOC_GFP | __GFP_RETRY_MAYFAIL),
 				4);
 	if (!page)
 		return NULL;
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 710e491206ed..8cb0190e2a73 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -93,7 +93,7 @@ int kvmppc_allocate_hpt(struct kvm_hpt_info *info, u32 order)
 	}
 
 	if (!hpt)
-		hpt = __get_free_pages(GFP_KERNEL|__GFP_ZERO|__GFP_REPEAT
+		hpt = __get_free_pages(GFP_KERNEL|__GFP_ZERO|__GFP_RETRY_MAYFAIL
 				       |__GFP_NOWARN, order - PAGE_SHIFT);
 
 	if (!hpt)
diff --git a/drivers/mmc/host/wbsd.c b/drivers/mmc/host/wbsd.c
index e15a9733fcfd..9668616faf16 100644
--- a/drivers/mmc/host/wbsd.c
+++ b/drivers/mmc/host/wbsd.c
@@ -1386,7 +1386,7 @@ static void wbsd_request_dma(struct wbsd_host *host, int dma)
 	 * order for ISA to be able to DMA to it.
 	 */
 	host->dma_buffer = kmalloc(WBSD_DMA_SIZE,
-		GFP_NOIO | GFP_DMA | __GFP_REPEAT | __GFP_NOWARN);
+		GFP_NOIO | GFP_DMA | __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
 	if (!host->dma_buffer)
 		goto free;
 
diff --git a/drivers/s390/char/vmcp.c b/drivers/s390/char/vmcp.c
index 65f5a794f26d..98749fa817da 100644
--- a/drivers/s390/char/vmcp.c
+++ b/drivers/s390/char/vmcp.c
@@ -98,7 +98,7 @@ vmcp_write(struct file *file, const char __user *buff, size_t count,
 	}
 	if (!session->response)
 		session->response = (char *)__get_free_pages(GFP_KERNEL
-						| __GFP_REPEAT | GFP_DMA,
+						| __GFP_RETRY_MAYFAIL | GFP_DMA,
 						get_order(session->bufsize));
 	if (!session->response) {
 		mutex_unlock(&session->mutex);
diff --git a/drivers/target/target_core_transport.c b/drivers/target/target_core_transport.c
index a5ecec8f3996..9cea1eb8f019 100644
--- a/drivers/target/target_core_transport.c
+++ b/drivers/target/target_core_transport.c
@@ -252,7 +252,7 @@ int transport_alloc_session_tags(struct se_session *se_sess,
 	int rc;
 
 	se_sess->sess_cmd_map = kzalloc(tag_num * tag_size,
-					GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
+					GFP_KERNEL | __GFP_NOWARN | __GFP_RETRY_MAYFAIL);
 	if (!se_sess->sess_cmd_map) {
 		se_sess->sess_cmd_map = vzalloc(tag_num * tag_size);
 		if (!se_sess->sess_cmd_map) {
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index e3d7ea1288c6..06d044862e58 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -897,7 +897,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)
 	struct sk_buff **queue;
 	int i;
 
-	n = kvmalloc(sizeof *n, GFP_KERNEL | __GFP_REPEAT);
+	n = kvmalloc(sizeof *n, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
 	if (!n)
 		return -ENOMEM;
 	vqs = kmalloc(VHOST_NET_VQ_MAX * sizeof(*vqs), GFP_KERNEL);
diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 679f8960db4b..046f6d280af5 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -1399,7 +1399,7 @@ static int vhost_scsi_open(struct inode *inode, struct file *f)
 	struct vhost_virtqueue **vqs;
 	int r = -ENOMEM, i;
 
-	vs = kzalloc(sizeof(*vs), GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT);
+	vs = kzalloc(sizeof(*vs), GFP_KERNEL | __GFP_NOWARN | __GFP_RETRY_MAYFAIL);
 	if (!vs) {
 		vs = vzalloc(sizeof(*vs));
 		if (!vs)
diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 3f63e03de8e8..c9de9c41aa97 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -508,7 +508,7 @@ static int vhost_vsock_dev_open(struct inode *inode, struct file *file)
 	/* This struct is large and allocation could fail, fall back to vmalloc
 	 * if there is no other way.
 	 */
-	vsock = kvmalloc(sizeof(*vsock), GFP_KERNEL | __GFP_REPEAT);
+	vsock = kvmalloc(sizeof(*vsock), GFP_KERNEL | __GFP_RETRY_MAYFAIL);
 	if (!vsock)
 		return -ENOMEM;
 
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 4c6656f1fee7..6be1f836b69e 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -25,7 +25,7 @@ struct vm_area_struct;
 #define ___GFP_FS		0x80u
 #define ___GFP_COLD		0x100u
 #define ___GFP_NOWARN		0x200u
-#define ___GFP_REPEAT		0x400u
+#define ___GFP_RETRY_MAYFAIL		0x400u
 #define ___GFP_NOFAIL		0x800u
 #define ___GFP_NORETRY		0x1000u
 #define ___GFP_MEMALLOC		0x2000u
@@ -136,26 +136,55 @@ struct vm_area_struct;
  *
  * __GFP_RECLAIM is shorthand to allow/forbid both direct and kswapd reclaim.
  *
- * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
- *   _might_ fail.  This depends upon the particular VM implementation.
+ * The default allocator behavior depends on the request size. We have a concept
+ * of so called costly allocations (with order > PAGE_ALLOC_COSTLY_ORDER).
+ * !costly allocations are too essential to fail so they are implicitly
+ * non-failing (with some exceptions like OOM victims might fail) by default while
+ * costly requests try to be not disruptive and back off even without invoking
+ * the OOM killer. The following three modifiers might be used to override some of
+ * these implicit rules
+ *
+ * __GFP_NORETRY: The VM implementation will try only very lightweight
+ *   memory direct reclaim to get some memory under memory pressure (thus
+ *   it can sleep). It will avoid disruptive actions like OOM killer. The
+ *   caller must handle the failure which is quite likely to happen under
+ *   heavy memory pressure. The flag is suitable when failure can easily be
+ *   handled at small cost, such as reduced throughput
+ *
+ * __GFP_RETRY_MAYFAIL: The VM implementation will retry memory reclaim
+ *   procedures that have previously failed if there is some indication
+ *   that progress has been made else where.  It can wait for other
+ *   tasks to attempt high level approaches to freeing memory such as
+ *   compaction (which removes fragmentation) and page-out.
+ *   There is still a definite limit to the number of retries, but it is
+ *   a larger limit than with __GFP_NORERY.
+ *   Allocations with this flag may fail, but only when there is
+ *   genuinely little unused memory. While these allocations do not
+ *   directly trigger the OOM killer, their failure indicates that
+ *   the system is likely to need to use the OOM killer soon.  The
+ *   caller must handle failure, but can reasonably do so by failing
+ *   a higher-level request, or completing it only in a much less
+ *   efficient manner.
+ *   If the allocation does fail, and the caller is in a position to
+ *   free some non-essential memory, doing so could benefit the system
+ *   as a whole.
  *
  * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
- *   cannot handle allocation failures. New users should be evaluated carefully
- *   (and the flag should be used only when there is no reasonable failure
- *   policy) but it is definitely preferable to use the flag rather than
- *   opencode endless loop around allocator.
- *
- * __GFP_NORETRY: The VM implementation must not retry indefinitely and will
- *   return NULL when direct reclaim and memory compaction have failed to allow
- *   the allocation to succeed.  The OOM killer is not called with the current
- *   implementation.
+ *   cannot handle allocation failures. The allocation could block
+ *   indefinitely but will never return with failure. Testing for
+ *   failure is pointless.
+ *   New users should be evaluated carefully (and the flag should be
+ *   used only when there is no reasonable failure policy) but it is
+ *   definitely preferable to use the flag rather than opencode endless
+ *   loop around allocator.
+ *   Using this flag for costly allocations is _highly_ discouraged.
  */
 #define __GFP_IO	((__force gfp_t)___GFP_IO)
 #define __GFP_FS	((__force gfp_t)___GFP_FS)
 #define __GFP_DIRECT_RECLAIM	((__force gfp_t)___GFP_DIRECT_RECLAIM) /* Caller can reclaim */
 #define __GFP_KSWAPD_RECLAIM	((__force gfp_t)___GFP_KSWAPD_RECLAIM) /* kswapd can wake */
 #define __GFP_RECLAIM ((__force gfp_t)(___GFP_DIRECT_RECLAIM|___GFP_KSWAPD_RECLAIM))
-#define __GFP_REPEAT	((__force gfp_t)___GFP_REPEAT)
+#define __GFP_RETRY_MAYFAIL	((__force gfp_t)___GFP_RETRY_MAYFAIL)
 #define __GFP_NOFAIL	((__force gfp_t)___GFP_NOFAIL)
 #define __GFP_NORETRY	((__force gfp_t)___GFP_NORETRY)
 
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 04a7f7993e67..41473df6dfb0 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -471,7 +471,8 @@ static __always_inline void *kmalloc_large(size_t size, gfp_t flags)
  *
  * %__GFP_NOWARN - If allocation fails, don't issue any warnings.
  *
- * %__GFP_REPEAT - If allocation fails initially, try once more before failing.
+ * %__GFP_RETRY_MAYFAIL - Try really hard to succeed the allocation but fail
+ *   eventually.
  *
  * There are other flags available as well, but these are not intended
  * for general use, and so are not documented here. For a full list of
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 10e3663a75a6..8e50d01c645f 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -34,7 +34,7 @@
 	{(unsigned long)__GFP_FS,		"__GFP_FS"},		\
 	{(unsigned long)__GFP_COLD,		"__GFP_COLD"},		\
 	{(unsigned long)__GFP_NOWARN,		"__GFP_NOWARN"},	\
-	{(unsigned long)__GFP_REPEAT,		"__GFP_REPEAT"},	\
+	{(unsigned long)__GFP_RETRY_MAYFAIL,	"__GFP_RETRY_MAYFAIL"},	\
 	{(unsigned long)__GFP_NOFAIL,		"__GFP_NOFAIL"},	\
 	{(unsigned long)__GFP_NORETRY,		"__GFP_NORETRY"},	\
 	{(unsigned long)__GFP_COMP,		"__GFP_COMP"},		\
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 907786581812..c9e1734a371f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1385,7 +1385,7 @@ static struct page *alloc_fresh_huge_page_node(struct hstate *h, int nid)
 
 	page = __alloc_pages_node(nid,
 		htlb_alloc_mask(h)|__GFP_COMP|__GFP_THISNODE|
-						__GFP_REPEAT|__GFP_NOWARN,
+						__GFP_RETRY_MAYFAIL|__GFP_NOWARN,
 		huge_page_order(h));
 	if (page) {
 		prep_new_huge_page(h, page, nid);
@@ -1534,7 +1534,7 @@ static struct page *__hugetlb_alloc_buddy_huge_page(struct hstate *h,
 		struct vm_area_struct *vma, unsigned long addr, int nid)
 {
 	int order = huge_page_order(h);
-	gfp_t gfp = htlb_alloc_mask(h)|__GFP_COMP|__GFP_REPEAT|__GFP_NOWARN;
+	gfp_t gfp = htlb_alloc_mask(h)|__GFP_COMP|__GFP_RETRY_MAYFAIL|__GFP_NOWARN;
 	unsigned int cpuset_mems_cookie;
 
 	/*
diff --git a/mm/internal.h b/mm/internal.h
index 0e4f558412fb..24d88f084705 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -23,7 +23,7 @@
  * hints such as HIGHMEM usage.
  */
 #define GFP_RECLAIM_MASK (__GFP_RECLAIM|__GFP_HIGH|__GFP_IO|__GFP_FS|\
-			__GFP_NOWARN|__GFP_REPEAT|__GFP_NOFAIL|\
+			__GFP_NOWARN|__GFP_RETRY_MAYFAIL|__GFP_NOFAIL|\
 			__GFP_NORETRY|__GFP_MEMALLOC|__GFP_NOMEMALLOC|\
 			__GFP_ATOMIC)
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b896897dcda7..b92e438046ad 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3281,6 +3281,14 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	/* The OOM killer will not help higher order allocs */
 	if (order > PAGE_ALLOC_COSTLY_ORDER)
 		goto out;
+	/*
+	 * We have already exhausted all our reclaim opportunities without any
+	 * success so it is time to admit defeat. We will skip the OOM killer
+	 * because it is very likely that the caller has a more reasonable
+	 * fallback than shooting a random task.
+	 */
+	if (gfp_mask & __GFP_RETRY_MAYFAIL)
+		goto out;
 	/* The OOM killer does not needlessly kill tasks for lowmem */
 	if (ac->high_zoneidx < ZONE_NORMAL)
 		goto out;
@@ -3410,7 +3418,7 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
 	}
 
 	/*
-	 * !costly requests are much more important than __GFP_REPEAT
+	 * !costly requests are much more important than __GFP_RETRY_MAYFAIL
 	 * costly ones because they are de facto nofail and invoke OOM
 	 * killer to move on while costly can fail and users are ready
 	 * to cope with that. 1/4 retries is rather arbitrary but we
@@ -3917,9 +3925,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 
 	/*
 	 * Do not retry costly high order allocations unless they are
-	 * __GFP_REPEAT
+	 * __GFP_RETRY_MAYFAIL
 	 */
-	if (costly_order && !(gfp_mask & __GFP_REPEAT))
+	if (costly_order && !(gfp_mask & __GFP_RETRY_MAYFAIL))
 		goto nopage;
 
 	if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index a56c3989f773..c50b1a14d55e 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -56,11 +56,11 @@ void * __meminit vmemmap_alloc_block(unsigned long size, int node)
 
 		if (node_state(node, N_HIGH_MEMORY))
 			page = alloc_pages_node(
-				node, GFP_KERNEL | __GFP_ZERO | __GFP_REPEAT,
+				node, GFP_KERNEL | __GFP_ZERO | __GFP_RETRY_MAYFAIL,
 				get_order(size));
 		else
 			page = alloc_pages(
-				GFP_KERNEL | __GFP_ZERO | __GFP_REPEAT,
+				GFP_KERNEL | __GFP_ZERO | __GFP_RETRY_MAYFAIL,
 				get_order(size));
 		if (page)
 			return page_address(page);
diff --git a/mm/util.c b/mm/util.c
index 26be6407abd7..6520f2d4a226 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -339,7 +339,7 @@ EXPORT_SYMBOL(vm_mmap);
  * Uses kmalloc to get the memory but if the allocation fails then falls back
  * to the vmalloc allocator. Use kvfree for freeing the memory.
  *
- * Reclaim modifiers - __GFP_NORETRY and __GFP_NOFAIL are not supported. __GFP_REPEAT
+ * Reclaim modifiers - __GFP_NORETRY and __GFP_NOFAIL are not supported. __GFP_RETRY_MAYFAIL
  * is supported only for large (>32kB) allocations, and it should be used only if
  * kmalloc is preferable to the vmalloc fallback, due to visible performance drawbacks.
  *
@@ -367,11 +367,11 @@ void *kvmalloc_node(size_t size, gfp_t flags, int node)
 		kmalloc_flags |= __GFP_NOWARN;
 
 		/*
-		 * We have to override __GFP_REPEAT by __GFP_NORETRY for !costly
+		 * We have to override __GFP_RETRY_MAYFAIL by __GFP_NORETRY for !costly
 		 * requests because there is no other way to tell the allocator
 		 * that we want to fail rather than retry endlessly.
 		 */
-		if (!(kmalloc_flags & __GFP_REPEAT) ||
+		if (!(kmalloc_flags & __GFP_RETRY_MAYFAIL) ||
 				(size <= PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
 			kmalloc_flags |= __GFP_NORETRY;
 	}
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 6016ab079e2b..8698c1c86c4d 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1795,7 +1795,7 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
  *	allocator with @gfp_mask flags.  Map them into contiguous
  *	kernel virtual space, using a pagetable protection of @prot.
  *
- *	Reclaim modifiers in @gfp_mask - __GFP_NORETRY, __GFP_REPEAT
+ *	Reclaim modifiers in @gfp_mask - __GFP_NORETRY, __GFP_RETRY_MAYFAIL
  *	and __GFP_NOFAIL are not supported
  *
  *	Any use of gfp flags outside of GFP_KERNEL should be consulted
diff --git a/mm/vmscan.c b/mm/vmscan.c
index f84cdd3751e1..efc9da21c5e6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2506,18 +2506,18 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
 		return false;
 
 	/* Consider stopping depending on scan and reclaim activity */
-	if (sc->gfp_mask & __GFP_REPEAT) {
+	if (sc->gfp_mask & __GFP_RETRY_MAYFAIL) {
 		/*
-		 * For __GFP_REPEAT allocations, stop reclaiming if the
+		 * For __GFP_RETRY_MAYFAIL allocations, stop reclaiming if the
 		 * full LRU list has been scanned and we are still failing
 		 * to reclaim pages. This full LRU scan is potentially
-		 * expensive but a __GFP_REPEAT caller really wants to succeed
+		 * expensive but a __GFP_RETRY_MAYFAIL caller really wants to succeed
 		 */
 		if (!nr_reclaimed && !nr_scanned)
 			return false;
 	} else {
 		/*
-		 * For non-__GFP_REPEAT allocations which can presumably
+		 * For non-__GFP_RETRY_MAYFAIL allocations which can presumably
 		 * fail without consequence, stop if we failed to reclaim
 		 * any pages from the last SWAP_CLUSTER_MAX number of
 		 * pages that were scanned. This will return to the
diff --git a/net/core/dev.c b/net/core/dev.c
index df7637733e3c..550c27a2efcd 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -7370,7 +7370,7 @@ static int netif_alloc_rx_queues(struct net_device *dev)
 
 	BUG_ON(count < 1);
 
-	rx = kvzalloc(sz, GFP_KERNEL | __GFP_REPEAT);
+	rx = kvzalloc(sz, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
 	if (!rx)
 		return -ENOMEM;
 
@@ -7410,7 +7410,7 @@ static int netif_alloc_netdev_queues(struct net_device *dev)
 	if (count < 1 || count > 0xffff)
 		return -EINVAL;
 
-	tx = kvzalloc(sz, GFP_KERNEL | __GFP_REPEAT);
+	tx = kvzalloc(sz, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
 	if (!tx)
 		return -ENOMEM;
 
@@ -7951,7 +7951,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 	/* ensure 32-byte alignment of whole construct */
 	alloc_size += NETDEV_ALIGN - 1;
 
-	p = kvzalloc(alloc_size, GFP_KERNEL | __GFP_REPEAT);
+	p = kvzalloc(alloc_size, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
 	if (!p)
 		return NULL;
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index f75897a33fa4..2bff10a20bc9 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -4747,7 +4747,7 @@ struct sk_buff *alloc_skb_with_frags(unsigned long header_len,
 
 	gfp_head = gfp_mask;
 	if (gfp_head & __GFP_DIRECT_RECLAIM)
-		gfp_head |= __GFP_REPEAT;
+		gfp_head |= __GFP_RETRY_MAYFAIL;
 
 	*errcode = -ENOBUFS;
 	skb = alloc_skb(header_len, gfp_head);
diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
index 147fde73a0f5..263d16e3219e 100644
--- a/net/sched/sch_fq.c
+++ b/net/sched/sch_fq.c
@@ -648,7 +648,7 @@ static int fq_resize(struct Qdisc *sch, u32 log)
 		return 0;
 
 	/* If XPS was setup, we can allocate memory on right NUMA node */
-	array = kvmalloc_node(sizeof(struct rb_root) << log, GFP_KERNEL | __GFP_REPEAT,
+	array = kvmalloc_node(sizeof(struct rb_root) << log, GFP_KERNEL | __GFP_RETRY_MAYFAIL,
 			      netdev_queue_numa_node_read(sch->dev_queue));
 	if (!array)
 		return -ENOMEM;
diff --git a/tools/perf/builtin-kmem.c b/tools/perf/builtin-kmem.c
index 9409c9464667..c4222ea452e9 100644
--- a/tools/perf/builtin-kmem.c
+++ b/tools/perf/builtin-kmem.c
@@ -643,7 +643,7 @@ static const struct {
 	{ "__GFP_FS",			"F" },
 	{ "__GFP_COLD",			"CO" },
 	{ "__GFP_NOWARN",		"NWR" },
-	{ "__GFP_REPEAT",		"R" },
+	{ "__GFP_RETRY_MAYFAIL",	"R" },
 	{ "__GFP_NOFAIL",		"NF" },
 	{ "__GFP_NORETRY",		"NR" },
 	{ "__GFP_COMP",			"C" },
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 3/6] xfs: map KM_MAYFAIL to __GFP_RETRY_MAYFAIL
  2017-06-23  8:53 [PATCH 0/6] mm: give __GFP_REPEAT a better semantic Michal Hocko
  2017-06-23  8:53 ` [PATCH 1/6] MIPS: do not use __GFP_REPEAT for order-0 request Michal Hocko
  2017-06-23  8:53 ` [PATCH 2/6] mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic Michal Hocko
@ 2017-06-23  8:53 ` Michal Hocko
  2017-06-27  8:49   ` Michal Hocko
  2017-06-23  8:53 ` [PATCH 4/6] mm: kvmalloc support __GFP_RETRY_MAYFAIL for all sizes Michal Hocko
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2017-06-23  8:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Johannes Weiner, Mel Gorman, NeilBrown, LKML,
	linux-mm, Michal Hocko, Christoph Hellwig, Darrick J. Wong

From: Michal Hocko <mhocko@suse.com>

KM_MAYFAIL didn't have any suitable GFP_FOO counterpart until recently
so it relied on the default page allocator behavior for the given set
of flags. This means that small allocations actually never failed.

Now that we have __GFP_RETRY_MAYFAIL flag which works independently on the
allocation request size we can map KM_MAYFAIL to it. The allocator will
try as hard as it can to fulfill the request but fails eventually if
the progress cannot be made. It does so without triggering the OOM
killer which can be seen as an improvement because KM_MAYFAIL users
should be able to deal with allocation failures.

Cc: Darrick J. Wong <darrick.wong@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/xfs/kmem.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index d6ea520162b2..4d85992d75b2 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -54,6 +54,16 @@ kmem_flags_convert(xfs_km_flags_t flags)
 			lflags &= ~__GFP_FS;
 	}
 
+	/*
+	 * Default page/slab allocator behavior is to retry for ever
+	 * for small allocations. We can override this behavior by using
+	 * __GFP_RETRY_MAYFAIL which will tell the allocator to retry as long
+	 * as it is feasible but rather fail than retry forever for all
+	 * request sizes.
+	 */
+	if (flags & KM_MAYFAIL)
+		lflags |= __GFP_RETRY_MAYFAIL;
+
 	if (flags & KM_ZERO)
 		lflags |= __GFP_ZERO;
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 4/6] mm: kvmalloc support __GFP_RETRY_MAYFAIL for all sizes
  2017-06-23  8:53 [PATCH 0/6] mm: give __GFP_REPEAT a better semantic Michal Hocko
                   ` (2 preceding siblings ...)
  2017-06-23  8:53 ` [PATCH 3/6] xfs: map KM_MAYFAIL to __GFP_RETRY_MAYFAIL Michal Hocko
@ 2017-06-23  8:53 ` Michal Hocko
  2017-06-26 12:00   ` Vlastimil Babka
  2017-06-23  8:53 ` [PATCH 5/6] drm/i915: use __GFP_RETRY_MAYFAIL Michal Hocko
  2017-06-23  8:53 ` [PATCH 6/6] mm, migration: do not trigger OOM killer when migrating memory Michal Hocko
  5 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2017-06-23  8:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Johannes Weiner, Mel Gorman, NeilBrown, LKML,
	linux-mm, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Now that __GFP_RETRY_MAYFAIL has a reasonable semantic regardless of the
request size we can drop the hackish implementation for !costly orders.
__GFP_RETRY_MAYFAIL retries as long as the reclaim makes a forward
progress and backs of when we are out of memory for the requested size.
Therefore we do not need to enforce__GFP_NORETRY for !costly orders just
to silent the oom killer anymore.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/util.c | 14 ++++----------
 1 file changed, 4 insertions(+), 10 deletions(-)

diff --git a/mm/util.c b/mm/util.c
index 6520f2d4a226..ee250e2cde34 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -339,9 +339,9 @@ EXPORT_SYMBOL(vm_mmap);
  * Uses kmalloc to get the memory but if the allocation fails then falls back
  * to the vmalloc allocator. Use kvfree for freeing the memory.
  *
- * Reclaim modifiers - __GFP_NORETRY and __GFP_NOFAIL are not supported. __GFP_RETRY_MAYFAIL
- * is supported only for large (>32kB) allocations, and it should be used only if
- * kmalloc is preferable to the vmalloc fallback, due to visible performance drawbacks.
+ * Reclaim modifiers - __GFP_NORETRY and __GFP_NOFAIL are not supported.
+ * __GFP_RETRY_MAYFAIL is supported, and it should be used only if kmalloc is
+ * preferable to the vmalloc fallback, due to visible performance drawbacks.
  *
  * Any use of gfp flags outside of GFP_KERNEL should be consulted with mm people.
  */
@@ -366,13 +366,7 @@ void *kvmalloc_node(size_t size, gfp_t flags, int node)
 	if (size > PAGE_SIZE) {
 		kmalloc_flags |= __GFP_NOWARN;
 
-		/*
-		 * We have to override __GFP_RETRY_MAYFAIL by __GFP_NORETRY for !costly
-		 * requests because there is no other way to tell the allocator
-		 * that we want to fail rather than retry endlessly.
-		 */
-		if (!(kmalloc_flags & __GFP_RETRY_MAYFAIL) ||
-				(size <= PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
+		if (!(kmalloc_flags & __GFP_RETRY_MAYFAIL))
 			kmalloc_flags |= __GFP_NORETRY;
 	}
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 5/6] drm/i915: use __GFP_RETRY_MAYFAIL
  2017-06-23  8:53 [PATCH 0/6] mm: give __GFP_REPEAT a better semantic Michal Hocko
                   ` (3 preceding siblings ...)
  2017-06-23  8:53 ` [PATCH 4/6] mm: kvmalloc support __GFP_RETRY_MAYFAIL for all sizes Michal Hocko
@ 2017-06-23  8:53 ` Michal Hocko
  2017-06-23  8:53 ` [PATCH 6/6] mm, migration: do not trigger OOM killer when migrating memory Michal Hocko
  5 siblings, 0 replies; 24+ messages in thread
From: Michal Hocko @ 2017-06-23  8:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Johannes Weiner, Mel Gorman, NeilBrown, LKML,
	linux-mm, Michal Hocko, Chris Wilson

From: Michal Hocko <mhocko@suse.com>

24f8e00a8a2e ("drm/i915: Prefer to report ENOMEM rather than incur the
oom for gfx allocations") has tried to remove disruptive OOM killer
because the userspace should be able to cope with allocation failures.
At the time only __GFP_NORETRY could achieve that and it turned out
that this would fail the allocations just too easily. So "drm/i915:
Remove __GFP_NORETRY from our buffer allocator" removed it and hoped
for a better solution. __GFP_RETRY_MAYFAIL is that solution. It will
keep retrying the allocation until there is no more progress and we
would go OOM. Instead we fail the allocation and let the caller to deal
with it.

Cc: Chris Wilson <chris@chris-wilson.co.uk>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 drivers/gpu/drm/i915/i915_gem.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index ae3ce1314bd1..eb193f27c8b7 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2434,8 +2434,9 @@ i915_gem_object_get_pages_gtt(struct drm_i915_gem_object *obj)
 				 * again with !__GFP_NORETRY. However, we still
 				 * want to fail this allocation rather than
 				 * trigger the out-of-memory killer and for
-				 * this we want the future __GFP_MAYFAIL.
+				 * this we want __GFP_RETRY_MAYFAIL.
 				 */
+				gfp |= __GFP_RETRY_MAYFAIL;
 			}
 		} while (1);
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 6/6] mm, migration: do not trigger OOM killer when migrating memory
  2017-06-23  8:53 [PATCH 0/6] mm: give __GFP_REPEAT a better semantic Michal Hocko
                   ` (4 preceding siblings ...)
  2017-06-23  8:53 ` [PATCH 5/6] drm/i915: use __GFP_RETRY_MAYFAIL Michal Hocko
@ 2017-06-23  8:53 ` Michal Hocko
  2017-06-23 20:43   ` Andrew Morton
  2017-06-26 12:13   ` Vlastimil Babka
  5 siblings, 2 replies; 24+ messages in thread
From: Michal Hocko @ 2017-06-23  8:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Johannes Weiner, Mel Gorman, NeilBrown, LKML,
	linux-mm, Michal Hocko

From: Michal Hocko <mhocko@suse.com>

Page migration (for memory hotplug, soft_offline_page or mbind) needs
to allocate a new memory. This can trigger an oom killer if the target
memory is depleated. Although quite unlikely, still possible, especially
for the memory hotplug (offlining of memoery). Up to now we didn't
really have reasonable means to back off. __GFP_NORETRY can fail just
too easily and __GFP_THISNODE sticks to a single node and that is not
suitable for all callers.

But now that we have __GFP_RETRY_MAYFAIL we should use it.  It is
preferable to fail the migration than disrupt the system by killing some
processes.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/migrate.h | 2 +-
 mm/memory-failure.c     | 3 ++-
 mm/mempolicy.c          | 3 ++-
 3 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index f80c9882403a..9f5885dae80e 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -34,7 +34,7 @@ extern char *migrate_reason_names[MR_TYPES];
 static inline struct page *new_page_nodemask(struct page *page, int preferred_nid,
 		nodemask_t *nodemask)
 {
-	gfp_t gfp_mask = GFP_USER | __GFP_MOVABLE;
+	gfp_t gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL;
 
 	if (PageHuge(page))
 		return alloc_huge_page_nodemask(page_hstate(compound_head(page)),
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index e2e0cb0e1d0f..fe0c484c6fdb 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1492,7 +1492,8 @@ static struct page *new_page(struct page *p, unsigned long private, int **x)
 
 		return alloc_huge_page_node(hstate, nid);
 	} else {
-		return __alloc_pages_node(nid, GFP_HIGHUSER_MOVABLE, 0);
+		return __alloc_pages_node(nid,
+				GFP_HIGHUSER_MOVABLE | __GFP_RETRY_MAYFAIL, 0);
 	}
 }
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 7d8e56214ac0..d911fa5cb2a7 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1078,7 +1078,8 @@ static struct page *new_page(struct page *page, unsigned long start, int **x)
 	/*
 	 * if !vma, alloc_page_vma() will use task or system default policy
 	 */
-	return alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
+	return alloc_page_vma(GFP_HIGHUSER_MOVABLE | __GFP_RETRY_MAYFAIL,
+			vma, address);
 }
 #else
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/6] MIPS: do not use __GFP_REPEAT for order-0 request
  2017-06-23  8:53 ` [PATCH 1/6] MIPS: do not use __GFP_REPEAT for order-0 request Michal Hocko
@ 2017-06-23 10:27   ` Ralf Baechle
  2017-06-26  9:36   ` Vlastimil Babka
  1 sibling, 0 replies; 24+ messages in thread
From: Ralf Baechle @ 2017-06-23 10:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Vlastimil Babka, Johannes Weiner, Mel Gorman,
	NeilBrown, LKML, linux-mm, Michal Hocko, Alex Belits,
	David Daney

Feel free to funnel this upstream with the rest of your series.

Acked-by: Ralf Baechle <ralf@linux-mips.org>

Thanks,

  Ralf

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 6/6] mm, migration: do not trigger OOM killer when migrating memory
  2017-06-23  8:53 ` [PATCH 6/6] mm, migration: do not trigger OOM killer when migrating memory Michal Hocko
@ 2017-06-23 20:43   ` Andrew Morton
  2017-06-26  5:28     ` Michal Hocko
  2017-06-26 12:13   ` Vlastimil Babka
  1 sibling, 1 reply; 24+ messages in thread
From: Andrew Morton @ 2017-06-23 20:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vlastimil Babka, Johannes Weiner, Mel Gorman, NeilBrown, LKML,
	linux-mm, Michal Hocko

On Fri, 23 Jun 2017 10:53:45 +0200 Michal Hocko <mhocko@kernel.org> wrote:

> From: Michal Hocko <mhocko@suse.com>
> 
> Page migration (for memory hotplug, soft_offline_page or mbind) needs
> to allocate a new memory. This can trigger an oom killer if the target
> memory is depleated. Although quite unlikely, still possible, especially
> for the memory hotplug (offlining of memoery). Up to now we didn't
> really have reasonable means to back off. __GFP_NORETRY can fail just
> too easily and __GFP_THISNODE sticks to a single node and that is not
> suitable for all callers.
> 
> But now that we have __GFP_RETRY_MAYFAIL we should use it.  It is
> preferable to fail the migration than disrupt the system by killing some
> processes.

I'm not sure which tree this is against...

> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1492,7 +1492,8 @@ static struct page *new_page(struct page *p, unsigned long private, int **x)
>  
>  		return alloc_huge_page_node(hstate, nid);
>  	} else {
> -		return __alloc_pages_node(nid, GFP_HIGHUSER_MOVABLE, 0);
> +		return __alloc_pages_node(nid,
> +				GFP_HIGHUSER_MOVABLE | __GFP_RETRY_MAYFAIL, 0);
>  	}
>  }

new_page() is now

static struct page *new_page(struct page *p, unsigned long private, int **x)
{
	int nid = page_to_nid(p);

	return new_page_nodemask(p, nid, &node_states[N_MEMORY]);
}

and new_page_nodemask() uses __GFP_RETRY_MAYFAIL so I simply dropped
the above hunk.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 6/6] mm, migration: do not trigger OOM killer when migrating memory
  2017-06-23 20:43   ` Andrew Morton
@ 2017-06-26  5:28     ` Michal Hocko
  0 siblings, 0 replies; 24+ messages in thread
From: Michal Hocko @ 2017-06-26  5:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Johannes Weiner, Mel Gorman, NeilBrown, LKML, linux-mm

On Fri 23-06-17 13:43:05, Andrew Morton wrote:
> On Fri, 23 Jun 2017 10:53:45 +0200 Michal Hocko <mhocko@kernel.org> wrote:
> 
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > Page migration (for memory hotplug, soft_offline_page or mbind) needs
> > to allocate a new memory. This can trigger an oom killer if the target
> > memory is depleated. Although quite unlikely, still possible, especially
> > for the memory hotplug (offlining of memoery). Up to now we didn't
> > really have reasonable means to back off. __GFP_NORETRY can fail just
> > too easily and __GFP_THISNODE sticks to a single node and that is not
> > suitable for all callers.
> > 
> > But now that we have __GFP_RETRY_MAYFAIL we should use it.  It is
> > preferable to fail the migration than disrupt the system by killing some
> > processes.
> 
> I'm not sure which tree this is against...

next-20170623

> 
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -1492,7 +1492,8 @@ static struct page *new_page(struct page *p, unsigned long private, int **x)
> >  
> >  		return alloc_huge_page_node(hstate, nid);
> >  	} else {
> > -		return __alloc_pages_node(nid, GFP_HIGHUSER_MOVABLE, 0);
> > +		return __alloc_pages_node(nid,
> > +				GFP_HIGHUSER_MOVABLE | __GFP_RETRY_MAYFAIL, 0);
> >  	}
> >  }
> 
> new_page() is now
> 
> static struct page *new_page(struct page *p, unsigned long private, int **x)
> {
> 	int nid = page_to_nid(p);
> 
> 	return new_page_nodemask(p, nid, &node_states[N_MEMORY]);
> }
> 
> and new_page_nodemask() uses __GFP_RETRY_MAYFAIL so I simply dropped
> the above hunk.

Ohh, right. This is
http://lkml.kernel.org/r/20170622193034.28972-4-mhocko@kernel.org. I've
just didn't realize it was not in mmotm yet. So yes the hunk can be
dropped, new_page_nodemask does what we need.
 
Sorry about that
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/6] MIPS: do not use __GFP_REPEAT for order-0 request
  2017-06-23  8:53 ` [PATCH 1/6] MIPS: do not use __GFP_REPEAT for order-0 request Michal Hocko
  2017-06-23 10:27   ` Ralf Baechle
@ 2017-06-26  9:36   ` Vlastimil Babka
  1 sibling, 0 replies; 24+ messages in thread
From: Vlastimil Babka @ 2017-06-26  9:36 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: Johannes Weiner, Mel Gorman, NeilBrown, LKML, linux-mm,
	Michal Hocko, Alex Belits, David Daney, Ralf Baechle

On 06/23/2017 10:53 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> 3377e227af44 ("MIPS: Add 48-bit VA space (and 4-level page tables) for
> 4K pages.") has added a new __GFP_REPEAT user but using this flag
> doesn't really make any sense for order-0 request which is the case here
> because PUD_ORDER is 0. __GFP_REPEAT has historically effect only on
> allocation requests with order > PAGE_ALLOC_COSTLY_ORDER.
> 
> This doesn't introduce any functional change. This is a preparatory
> patch for later work which renames the flag and redefines its semantic.
> 
> Cc: Alex Belits <alex.belits@cavium.com>
> Cc: David Daney <david.daney@cavium.com>
> Cc: Ralf Baechle <ralf@linux-mips.org>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  arch/mips/include/asm/pgalloc.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/mips/include/asm/pgalloc.h b/arch/mips/include/asm/pgalloc.h
> index a1bdb1ea5234..39b9f311c4ef 100644
> --- a/arch/mips/include/asm/pgalloc.h
> +++ b/arch/mips/include/asm/pgalloc.h
> @@ -116,7 +116,7 @@ static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long address)
>  {
>  	pud_t *pud;
>  
> -	pud = (pud_t *) __get_free_pages(GFP_KERNEL|__GFP_REPEAT, PUD_ORDER);
> +	pud = (pud_t *) __get_free_pages(GFP_KERNEL, PUD_ORDER);
>  	if (pud)
>  		pud_init((unsigned long)pud, (unsigned long)invalid_pmd_table);
>  	return pud;
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/6] mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic
  2017-06-23  8:53 ` [PATCH 2/6] mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic Michal Hocko
@ 2017-06-26 11:45   ` Vlastimil Babka
  2017-06-26 12:14     ` Michal Hocko
  2017-06-26 11:53   ` Vlastimil Babka
  1 sibling, 1 reply; 24+ messages in thread
From: Vlastimil Babka @ 2017-06-26 11:45 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: Johannes Weiner, Mel Gorman, NeilBrown, LKML, linux-mm, Michal Hocko

On 06/23/2017 10:53 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
> the page allocator. This has been true but only for allocations requests
> larger than PAGE_ALLOC_COSTLY_ORDER. It has been always ignored for
> smaller sizes. This is a bit unfortunate because there is no way to
> express the same semantic for those requests and they are considered too
> important to fail so they might end up looping in the page allocator for
> ever, similarly to GFP_NOFAIL requests.
> 
> Now that the whole tree has been cleaned up and accidental or misled
> usage of __GFP_REPEAT flag has been removed for !costly requests we can
> give the original flag a better name and more importantly a more useful
> semantic. Let's rename it to __GFP_RETRY_MAYFAIL which tells the user that
> the allocator would try really hard but there is no promise of a
> success. This will work independent of the order and overrides the
> default allocator behavior. Page allocator users have several levels of
> guarantee vs. cost options (take GFP_KERNEL as an example)
> - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
>   attempt to free memory at all. The most light weight mode which even
>   doesn't kick the background reclaim. Should be used carefully because
>   it might deplete the memory and the next user might hit the more
>   aggressive reclaim
> - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
>   allocation without any attempt to free memory from the current context
>   but can wake kswapd to reclaim memory if the zone is below the low
>   watermark. Can be used from either atomic contexts or when the request
>   is a performance optimization and there is another fallback for a slow
>   path.
> - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) - non
>   sleeping allocation with an expensive fallback so it can access some
>   portion of memory reserves. Usually used from interrupt/bh context with
>   an expensive slow path fallback.
> - GFP_KERNEL - both background and direct reclaim are allowed and the
>   _default_ page allocator behavior is used. That means that !costly
>   allocation requests are basically nofail (unless the requesting task
>   is killed by the OOM killer)

Should we explicitly point out that failure must be handled? After lots
of talking about "too small to fail", people might get the wrong impression.

> and costly will fail early rather than
>   cause disruptive reclaim.
> - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior and
>   all allocation requests fail early rather than cause disruptive
>   reclaim (one round of reclaim in this implementation). The OOM killer
>   is not invoked.
> - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator behavior
>   and all allocation requests try really hard. The request will fail if the
>   reclaim cannot make any progress. The OOM killer won't be triggered.
> - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
>   and all allocation requests will loop endlessly until they
>   succeed. This might be really dangerous especially for larger orders.
> 
> Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL because
> they already had their semantic. No new users are added.
> __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
> there is no progress and we have already passed the OOM point. This
> means that all the reclaim opportunities have been exhausted except the
> most disruptive one (the OOM killer) and a user defined fallback
> behavior is more sensible than keep retrying in the page allocator.
> 
> Changes since RFC
> - udpate documentation wording as per Neil Brown
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

Some more minor comments below:

...

> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 4c6656f1fee7..6be1f836b69e 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -25,7 +25,7 @@ struct vm_area_struct;
>  #define ___GFP_FS		0x80u
>  #define ___GFP_COLD		0x100u
>  #define ___GFP_NOWARN		0x200u
> -#define ___GFP_REPEAT		0x400u
> +#define ___GFP_RETRY_MAYFAIL		0x400u

Seems like one tab too many, the end result is off:
(sigh, tabs are not only error prone, but also we make less money due to
them, I heard)

#define ___GFP_NOWARN           0x200u
#define ___GFP_RETRY_MAYFAIL            0x400u
#define ___GFP_NOFAIL           0x800u


>  #define ___GFP_NOFAIL		0x800u
>  #define ___GFP_NORETRY		0x1000u
>  #define ___GFP_MEMALLOC		0x2000u
> @@ -136,26 +136,55 @@ struct vm_area_struct;
>   *
>   * __GFP_RECLAIM is shorthand to allow/forbid both direct and kswapd reclaim.
>   *
> - * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
> - *   _might_ fail.  This depends upon the particular VM implementation.
> + * The default allocator behavior depends on the request size. We have a concept
> + * of so called costly allocations (with order > PAGE_ALLOC_COSTLY_ORDER).
> + * !costly allocations are too essential to fail so they are implicitly
> + * non-failing (with some exceptions like OOM victims might fail) by default while

Again, emphasize need for error handling?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/6] mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic
  2017-06-23  8:53 ` [PATCH 2/6] mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic Michal Hocko
  2017-06-26 11:45   ` Vlastimil Babka
@ 2017-06-26 11:53   ` Vlastimil Babka
  1 sibling, 0 replies; 24+ messages in thread
From: Vlastimil Babka @ 2017-06-26 11:53 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: Johannes Weiner, Mel Gorman, NeilBrown, LKML, linux-mm, Michal Hocko

On 06/23/2017 10:53 AM, Michal Hocko wrote:
...

> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 4c6656f1fee7..6be1f836b69e 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -25,7 +25,7 @@ struct vm_area_struct;
>  #define ___GFP_FS		0x80u
>  #define ___GFP_COLD		0x100u
>  #define ___GFP_NOWARN		0x200u
> -#define ___GFP_REPEAT		0x400u
> +#define ___GFP_RETRY_MAYFAIL		0x400u
>  #define ___GFP_NOFAIL		0x800u
>  #define ___GFP_NORETRY		0x1000u
>  #define ___GFP_MEMALLOC		0x2000u
> @@ -136,26 +136,55 @@ struct vm_area_struct;
>   *
>   * __GFP_RECLAIM is shorthand to allow/forbid both direct and kswapd reclaim.
>   *
> - * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
> - *   _might_ fail.  This depends upon the particular VM implementation.
> + * The default allocator behavior depends on the request size. We have a concept
> + * of so called costly allocations (with order > PAGE_ALLOC_COSTLY_ORDER).
> + * !costly allocations are too essential to fail so they are implicitly
> + * non-failing (with some exceptions like OOM victims might fail) by default while
> + * costly requests try to be not disruptive and back off even without invoking
> + * the OOM killer. The following three modifiers might be used to override some of
> + * these implicit rules
> + *
> + * __GFP_NORETRY: The VM implementation will try only very lightweight
> + *   memory direct reclaim to get some memory under memory pressure (thus
> + *   it can sleep). It will avoid disruptive actions like OOM killer. The
> + *   caller must handle the failure which is quite likely to happen under
> + *   heavy memory pressure. The flag is suitable when failure can easily be
> + *   handled at small cost, such as reduced throughput
> + *
> + * __GFP_RETRY_MAYFAIL: The VM implementation will retry memory reclaim
> + *   procedures that have previously failed if there is some indication
> + *   that progress has been made else where.  It can wait for other
> + *   tasks to attempt high level approaches to freeing memory such as
> + *   compaction (which removes fragmentation) and page-out.
> + *   There is still a definite limit to the number of retries, but it is
> + *   a larger limit than with __GFP_NORERY.

Also, __GFP_NORETRY ^ (for grep purposes).

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/6] mm: kvmalloc support __GFP_RETRY_MAYFAIL for all sizes
  2017-06-23  8:53 ` [PATCH 4/6] mm: kvmalloc support __GFP_RETRY_MAYFAIL for all sizes Michal Hocko
@ 2017-06-26 12:00   ` Vlastimil Babka
  2017-06-26 12:18     ` Michal Hocko
  0 siblings, 1 reply; 24+ messages in thread
From: Vlastimil Babka @ 2017-06-26 12:00 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: Johannes Weiner, Mel Gorman, NeilBrown, LKML, linux-mm, Michal Hocko

On 06/23/2017 10:53 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Now that __GFP_RETRY_MAYFAIL has a reasonable semantic regardless of the
> request size we can drop the hackish implementation for !costly orders.
> __GFP_RETRY_MAYFAIL retries as long as the reclaim makes a forward
> progress and backs of when we are out of memory for the requested size.
> Therefore we do not need to enforce__GFP_NORETRY for !costly orders just
> to silent the oom killer anymore.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

The flag is now supported, but not for the embedded page table
allocations, so OOM is still theoretically possible, right?
That should be rare, though. Worth mentioning anywhere?

Other than that.
Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/util.c | 14 ++++----------
>  1 file changed, 4 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/util.c b/mm/util.c
> index 6520f2d4a226..ee250e2cde34 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -339,9 +339,9 @@ EXPORT_SYMBOL(vm_mmap);
>   * Uses kmalloc to get the memory but if the allocation fails then falls back
>   * to the vmalloc allocator. Use kvfree for freeing the memory.
>   *
> - * Reclaim modifiers - __GFP_NORETRY and __GFP_NOFAIL are not supported. __GFP_RETRY_MAYFAIL
> - * is supported only for large (>32kB) allocations, and it should be used only if
> - * kmalloc is preferable to the vmalloc fallback, due to visible performance drawbacks.
> + * Reclaim modifiers - __GFP_NORETRY and __GFP_NOFAIL are not supported.
> + * __GFP_RETRY_MAYFAIL is supported, and it should be used only if kmalloc is
> + * preferable to the vmalloc fallback, due to visible performance drawbacks.
>   *
>   * Any use of gfp flags outside of GFP_KERNEL should be consulted with mm people.
>   */
> @@ -366,13 +366,7 @@ void *kvmalloc_node(size_t size, gfp_t flags, int node)
>  	if (size > PAGE_SIZE) {
>  		kmalloc_flags |= __GFP_NOWARN;
>  
> -		/*
> -		 * We have to override __GFP_RETRY_MAYFAIL by __GFP_NORETRY for !costly
> -		 * requests because there is no other way to tell the allocator
> -		 * that we want to fail rather than retry endlessly.
> -		 */
> -		if (!(kmalloc_flags & __GFP_RETRY_MAYFAIL) ||
> -				(size <= PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
> +		if (!(kmalloc_flags & __GFP_RETRY_MAYFAIL))
>  			kmalloc_flags |= __GFP_NORETRY;
>  	}
>  
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 6/6] mm, migration: do not trigger OOM killer when migrating memory
  2017-06-23  8:53 ` [PATCH 6/6] mm, migration: do not trigger OOM killer when migrating memory Michal Hocko
  2017-06-23 20:43   ` Andrew Morton
@ 2017-06-26 12:13   ` Vlastimil Babka
  1 sibling, 0 replies; 24+ messages in thread
From: Vlastimil Babka @ 2017-06-26 12:13 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: Johannes Weiner, Mel Gorman, NeilBrown, LKML, linux-mm, Michal Hocko

On 06/23/2017 10:53 AM, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Page migration (for memory hotplug, soft_offline_page or mbind) needs
> to allocate a new memory. This can trigger an oom killer if the target
> memory is depleated. Although quite unlikely, still possible, especially
> for the memory hotplug (offlining of memoery). Up to now we didn't
> really have reasonable means to back off. __GFP_NORETRY can fail just
> too easily and __GFP_THISNODE sticks to a single node and that is not
> suitable for all callers.
> 
> But now that we have __GFP_RETRY_MAYFAIL we should use it.  It is
> preferable to fail the migration than disrupt the system by killing some
> processes.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  include/linux/migrate.h | 2 +-
>  mm/memory-failure.c     | 3 ++-
>  mm/mempolicy.c          | 3 ++-
>  3 files changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index f80c9882403a..9f5885dae80e 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -34,7 +34,7 @@ extern char *migrate_reason_names[MR_TYPES];
>  static inline struct page *new_page_nodemask(struct page *page, int preferred_nid,
>  		nodemask_t *nodemask)
>  {
> -	gfp_t gfp_mask = GFP_USER | __GFP_MOVABLE;
> +	gfp_t gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL;
>  
>  	if (PageHuge(page))
>  		return alloc_huge_page_nodemask(page_hstate(compound_head(page)),
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index e2e0cb0e1d0f..fe0c484c6fdb 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1492,7 +1492,8 @@ static struct page *new_page(struct page *p, unsigned long private, int **x)
>  
>  		return alloc_huge_page_node(hstate, nid);
>  	} else {
> -		return __alloc_pages_node(nid, GFP_HIGHUSER_MOVABLE, 0);
> +		return __alloc_pages_node(nid,
> +				GFP_HIGHUSER_MOVABLE | __GFP_RETRY_MAYFAIL, 0);
>  	}
>  }
>  
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 7d8e56214ac0..d911fa5cb2a7 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1078,7 +1078,8 @@ static struct page *new_page(struct page *page, unsigned long start, int **x)
>  	/*
>  	 * if !vma, alloc_page_vma() will use task or system default policy
>  	 */
> -	return alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
> +	return alloc_page_vma(GFP_HIGHUSER_MOVABLE | __GFP_RETRY_MAYFAIL,
> +			vma, address);
>  }
>  #else
>  
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/6] mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic
  2017-06-26 11:45   ` Vlastimil Babka
@ 2017-06-26 12:14     ` Michal Hocko
  2017-06-26 12:17       ` Vlastimil Babka
  0 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2017-06-26 12:14 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Johannes Weiner, Mel Gorman, NeilBrown, LKML, linux-mm

On Mon 26-06-17 13:45:19, Vlastimil Babka wrote:
> On 06/23/2017 10:53 AM, Michal Hocko wrote:
[...]
> > - GFP_KERNEL - both background and direct reclaim are allowed and the
> >   _default_ page allocator behavior is used. That means that !costly
> >   allocation requests are basically nofail (unless the requesting task
> >   is killed by the OOM killer)
> 
> Should we explicitly point out that failure must be handled? After lots
> of talking about "too small to fail", people might get the wrong impression.

OK. What about the following.
"That means that !costly allocation requests are basically nofail but
there is no guarantee of thaat behavior so failures have to be checked
properly by callers (e.g. OOM killer victim is allowed to fail
currently).

> > and costly will fail early rather than
> >   cause disruptive reclaim.
> > - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior and
> >   all allocation requests fail early rather than cause disruptive
> >   reclaim (one round of reclaim in this implementation). The OOM killer
> >   is not invoked.
> > - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator behavior
> >   and all allocation requests try really hard. The request will fail if the
> >   reclaim cannot make any progress. The OOM killer won't be triggered.
> > - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
> >   and all allocation requests will loop endlessly until they
> >   succeed. This might be really dangerous especially for larger orders.
> > 
> > Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL because
> > they already had their semantic. No new users are added.
> > __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
> > there is no progress and we have already passed the OOM point. This
> > means that all the reclaim opportunities have been exhausted except the
> > most disruptive one (the OOM killer) and a user defined fallback
> > behavior is more sensible than keep retrying in the page allocator.
> > 
> > Changes since RFC
> > - udpate documentation wording as per Neil Brown
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Thanks!

> Some more minor comments below:
> 
> ...
> 
> > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > index 4c6656f1fee7..6be1f836b69e 100644
> > --- a/include/linux/gfp.h
> > +++ b/include/linux/gfp.h
> > @@ -25,7 +25,7 @@ struct vm_area_struct;
> >  #define ___GFP_FS		0x80u
> >  #define ___GFP_COLD		0x100u
> >  #define ___GFP_NOWARN		0x200u
> > -#define ___GFP_REPEAT		0x400u
> > +#define ___GFP_RETRY_MAYFAIL		0x400u
> 
> Seems like one tab too many, the end result is off:

will fix

> (sigh, tabs are not only error prone, but also we make less money due to
> them, I heard)
> 
> #define ___GFP_NOWARN           0x200u
> #define ___GFP_RETRY_MAYFAIL            0x400u
> #define ___GFP_NOFAIL           0x800u
> 
> 
> >  #define ___GFP_NOFAIL		0x800u
> >  #define ___GFP_NORETRY		0x1000u
> >  #define ___GFP_MEMALLOC		0x2000u
> > @@ -136,26 +136,55 @@ struct vm_area_struct;
> >   *
> >   * __GFP_RECLAIM is shorthand to allow/forbid both direct and kswapd reclaim.
> >   *
> > - * __GFP_REPEAT: Try hard to allocate the memory, but the allocation attempt
> > - *   _might_ fail.  This depends upon the particular VM implementation.
> > + * The default allocator behavior depends on the request size. We have a concept
> > + * of so called costly allocations (with order > PAGE_ALLOC_COSTLY_ORDER).
> > + * !costly allocations are too essential to fail so they are implicitly
> > + * non-failing (with some exceptions like OOM victims might fail) by default while
> 
> Again, emphasize need for error handling?

the same wording as above?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/6] mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic
  2017-06-26 12:14     ` Michal Hocko
@ 2017-06-26 12:17       ` Vlastimil Babka
  2017-06-26 12:38         ` Michal Hocko
  0 siblings, 1 reply; 24+ messages in thread
From: Vlastimil Babka @ 2017-06-26 12:17 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Johannes Weiner, Mel Gorman, NeilBrown, LKML, linux-mm

On 06/26/2017 02:14 PM, Michal Hocko wrote:
> On Mon 26-06-17 13:45:19, Vlastimil Babka wrote:
>> On 06/23/2017 10:53 AM, Michal Hocko wrote:
> [...]
>>> - GFP_KERNEL - both background and direct reclaim are allowed and the
>>>   _default_ page allocator behavior is used. That means that !costly
>>>   allocation requests are basically nofail (unless the requesting task
>>>   is killed by the OOM killer)
>>
>> Should we explicitly point out that failure must be handled? After lots
>> of talking about "too small to fail", people might get the wrong impression.
> 
> OK. What about the following.
> "That means that !costly allocation requests are basically nofail but
> there is no guarantee of thaat behavior so failures have to be checked

                           that

> properly by callers (e.g. OOM killer victim is allowed to fail
> currently).

Looks good, thanks!

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/6] mm: kvmalloc support __GFP_RETRY_MAYFAIL for all sizes
  2017-06-26 12:00   ` Vlastimil Babka
@ 2017-06-26 12:18     ` Michal Hocko
  0 siblings, 0 replies; 24+ messages in thread
From: Michal Hocko @ 2017-06-26 12:18 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, Johannes Weiner, Mel Gorman, NeilBrown, LKML, linux-mm

On Mon 26-06-17 14:00:27, Vlastimil Babka wrote:
> On 06/23/2017 10:53 AM, Michal Hocko wrote:
> > From: Michal Hocko <mhocko@suse.com>
> > 
> > Now that __GFP_RETRY_MAYFAIL has a reasonable semantic regardless of the
> > request size we can drop the hackish implementation for !costly orders.
> > __GFP_RETRY_MAYFAIL retries as long as the reclaim makes a forward
> > progress and backs of when we are out of memory for the requested size.
> > Therefore we do not need to enforce__GFP_NORETRY for !costly orders just
> > to silent the oom killer anymore.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> The flag is now supported, but not for the embedded page table
> allocations, so OOM is still theoretically possible, right?
> That should be rare, though. Worth mentioning anywhere?

Yes that is true. Not sure I would make it more complicated than
necessary. I can add a note in there if you insist but to me it sounds
like something that will only confuse people.
 
> Other than that.
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Thanks!

> > ---
> >  mm/util.c | 14 ++++----------
> >  1 file changed, 4 insertions(+), 10 deletions(-)
> > 
> > diff --git a/mm/util.c b/mm/util.c
> > index 6520f2d4a226..ee250e2cde34 100644
> > --- a/mm/util.c
> > +++ b/mm/util.c
> > @@ -339,9 +339,9 @@ EXPORT_SYMBOL(vm_mmap);
> >   * Uses kmalloc to get the memory but if the allocation fails then falls back
> >   * to the vmalloc allocator. Use kvfree for freeing the memory.
> >   *
> > - * Reclaim modifiers - __GFP_NORETRY and __GFP_NOFAIL are not supported. __GFP_RETRY_MAYFAIL
> > - * is supported only for large (>32kB) allocations, and it should be used only if
> > - * kmalloc is preferable to the vmalloc fallback, due to visible performance drawbacks.
> > + * Reclaim modifiers - __GFP_NORETRY and __GFP_NOFAIL are not supported.
> > + * __GFP_RETRY_MAYFAIL is supported, and it should be used only if kmalloc is
> > + * preferable to the vmalloc fallback, due to visible performance drawbacks.
> >   *
> >   * Any use of gfp flags outside of GFP_KERNEL should be consulted with mm people.
> >   */
> > @@ -366,13 +366,7 @@ void *kvmalloc_node(size_t size, gfp_t flags, int node)
> >  	if (size > PAGE_SIZE) {
> >  		kmalloc_flags |= __GFP_NOWARN;
> >  
> > -		/*
> > -		 * We have to override __GFP_RETRY_MAYFAIL by __GFP_NORETRY for !costly
> > -		 * requests because there is no other way to tell the allocator
> > -		 * that we want to fail rather than retry endlessly.
> > -		 */
> > -		if (!(kmalloc_flags & __GFP_RETRY_MAYFAIL) ||
> > -				(size <= PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
> > +		if (!(kmalloc_flags & __GFP_RETRY_MAYFAIL))
> >  			kmalloc_flags |= __GFP_NORETRY;
> >  	}
> >  
> > 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/6] mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic
  2017-06-26 12:17       ` Vlastimil Babka
@ 2017-06-26 12:38         ` Michal Hocko
  2017-06-26 12:42           ` Michal Hocko
  0 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2017-06-26 12:38 UTC (permalink / raw)
  To: Andrew Morton, Vlastimil Babka
  Cc: Johannes Weiner, Mel Gorman, NeilBrown, LKML, linux-mm

On Mon 26-06-17 14:17:30, Vlastimil Babka wrote:
> On 06/26/2017 02:14 PM, Michal Hocko wrote:
> > On Mon 26-06-17 13:45:19, Vlastimil Babka wrote:
> >> On 06/23/2017 10:53 AM, Michal Hocko wrote:
> > [...]
> >>> - GFP_KERNEL - both background and direct reclaim are allowed and the
> >>>   _default_ page allocator behavior is used. That means that !costly
> >>>   allocation requests are basically nofail (unless the requesting task
> >>>   is killed by the OOM killer)
> >>
> >> Should we explicitly point out that failure must be handled? After lots
> >> of talking about "too small to fail", people might get the wrong impression.
> > 
> > OK. What about the following.
> > "That means that !costly allocation requests are basically nofail but
> > there is no guarantee of thaat behavior so failures have to be checked
> 
>                            that
> 
> > properly by callers (e.g. OOM killer victim is allowed to fail
> > currently).
> 
> Looks good, thanks!

Andrew, could you fold the following in and replace the GFP_KERNEL part
of the changelog with the updated text. Thanks!
---
>From 0ad39ef3d2791bfad4be555851a750fcfdfe424e Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Mon, 26 Jun 2017 14:31:19 +0200
Subject: [PATCH] 
 mm-tree-wide-replace-__gfp_repeat-by-__gfp_retry_mayfail-with-more-useful-semantic-fix.patch

- GFP_KERNEL - both background and direct reclaim are allowed and the
  _default_ page allocator behavior is used. That means that !costly
  allocation requests are basically nofail but there is no guarantee
  of that behavior so failures have to be checked properly by callers
  (e.g. OOM killer victim is allowed to fail currently).

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/gfp.h | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 6be1f836b69e..56bcb147910e 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -25,7 +25,7 @@ struct vm_area_struct;
 #define ___GFP_FS		0x80u
 #define ___GFP_COLD		0x100u
 #define ___GFP_NOWARN		0x200u
-#define ___GFP_RETRY_MAYFAIL		0x400u
+#define ___GFP_RETRY_MAYFAIL	0x400u
 #define ___GFP_NOFAIL		0x800u
 #define ___GFP_NORETRY		0x1000u
 #define ___GFP_MEMALLOC		0x2000u
@@ -139,10 +139,11 @@ struct vm_area_struct;
  * The default allocator behavior depends on the request size. We have a concept
  * of so called costly allocations (with order > PAGE_ALLOC_COSTLY_ORDER).
  * !costly allocations are too essential to fail so they are implicitly
- * non-failing (with some exceptions like OOM victims might fail) by default while
- * costly requests try to be not disruptive and back off even without invoking
- * the OOM killer. The following three modifiers might be used to override some of
- * these implicit rules
+ * non-failing by default (with some exceptions like OOM victims might fail so
+ * the caller still has to check for failures) while costly requests try to be
+ * not disruptive and back off even without invoking the OOM killer.
+ * The following three modifiers might be used to override some of these
+ * implicit rules
  *
  * __GFP_NORETRY: The VM implementation will try only very lightweight
  *   memory direct reclaim to get some memory under memory pressure (thus
-- 
2.11.0


-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/6] mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic
  2017-06-26 12:38         ` Michal Hocko
@ 2017-06-26 12:42           ` Michal Hocko
  0 siblings, 0 replies; 24+ messages in thread
From: Michal Hocko @ 2017-06-26 12:42 UTC (permalink / raw)
  To: Andrew Morton, Vlastimil Babka
  Cc: Johannes Weiner, Mel Gorman, NeilBrown, LKML, linux-mm

On Mon 26-06-17 14:38:47, Michal Hocko wrote:
> On Mon 26-06-17 14:17:30, Vlastimil Babka wrote:
> > On 06/26/2017 02:14 PM, Michal Hocko wrote:
> > > On Mon 26-06-17 13:45:19, Vlastimil Babka wrote:
> > >> On 06/23/2017 10:53 AM, Michal Hocko wrote:
> > > [...]
> > >>> - GFP_KERNEL - both background and direct reclaim are allowed and the
> > >>>   _default_ page allocator behavior is used. That means that !costly
> > >>>   allocation requests are basically nofail (unless the requesting task
> > >>>   is killed by the OOM killer)
> > >>
> > >> Should we explicitly point out that failure must be handled? After lots
> > >> of talking about "too small to fail", people might get the wrong impression.
> > > 
> > > OK. What about the following.
> > > "That means that !costly allocation requests are basically nofail but
> > > there is no guarantee of thaat behavior so failures have to be checked
> > 
> >                            that
> > 
> > > properly by callers (e.g. OOM killer victim is allowed to fail
> > > currently).
> > 
> > Looks good, thanks!
> 
> Andrew, could you fold the following in and replace the GFP_KERNEL part
> of the changelog with the updated text. Thanks!

Forgot to address other thing spotted by Vlastimil.
---
>From c7f9dba93ac3001fbafe287729c4e2bb646b25f4 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Mon, 26 Jun 2017 14:31:19 +0200
Subject: [PATCH] 
 mm-tree-wide-replace-__gfp_repeat-by-__gfp_retry_mayfail-with-more-useful-semantic-fix.patch

- GFP_KERNEL - both background and direct reclaim are allowed and the
  _default_ page allocator behavior is used. That means that !costly
  allocation requests are basically nofail but there is no guarantee
  of that behavior so failures have to be checked properly by callers
  (e.g. OOM killer victim is allowed to fail currently).

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/gfp.h | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 6be1f836b69e..bcfb9f7c46f5 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -25,7 +25,7 @@ struct vm_area_struct;
 #define ___GFP_FS		0x80u
 #define ___GFP_COLD		0x100u
 #define ___GFP_NOWARN		0x200u
-#define ___GFP_RETRY_MAYFAIL		0x400u
+#define ___GFP_RETRY_MAYFAIL	0x400u
 #define ___GFP_NOFAIL		0x800u
 #define ___GFP_NORETRY		0x1000u
 #define ___GFP_MEMALLOC		0x2000u
@@ -139,10 +139,11 @@ struct vm_area_struct;
  * The default allocator behavior depends on the request size. We have a concept
  * of so called costly allocations (with order > PAGE_ALLOC_COSTLY_ORDER).
  * !costly allocations are too essential to fail so they are implicitly
- * non-failing (with some exceptions like OOM victims might fail) by default while
- * costly requests try to be not disruptive and back off even without invoking
- * the OOM killer. The following three modifiers might be used to override some of
- * these implicit rules
+ * non-failing by default (with some exceptions like OOM victims might fail so
+ * the caller still has to check for failures) while costly requests try to be
+ * not disruptive and back off even without invoking the OOM killer.
+ * The following three modifiers might be used to override some of these
+ * implicit rules
  *
  * __GFP_NORETRY: The VM implementation will try only very lightweight
  *   memory direct reclaim to get some memory under memory pressure (thus
@@ -157,7 +158,7 @@ struct vm_area_struct;
  *   tasks to attempt high level approaches to freeing memory such as
  *   compaction (which removes fragmentation) and page-out.
  *   There is still a definite limit to the number of retries, but it is
- *   a larger limit than with __GFP_NORERY.
+ *   a larger limit than with __GFP_NORETRY.
  *   Allocations with this flag may fail, but only when there is
  *   genuinely little unused memory. While these allocations do not
  *   directly trigger the OOM killer, their failure indicates that
-- 
2.11.0

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/6] xfs: map KM_MAYFAIL to __GFP_RETRY_MAYFAIL
  2017-06-23  8:53 ` [PATCH 3/6] xfs: map KM_MAYFAIL to __GFP_RETRY_MAYFAIL Michal Hocko
@ 2017-06-27  8:49   ` Michal Hocko
  2017-06-27 13:47     ` Christoph Hellwig
  0 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2017-06-27  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Johannes Weiner, Mel Gorman, NeilBrown, LKML,
	linux-mm, Christoph Hellwig, Darrick J. Wong

Christoph, Darrick
could you have a look at this patch please? Andrew has put it into mmotm
but I definitely do not want it passes your attention.

On Fri 23-06-17 10:53:42, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> KM_MAYFAIL didn't have any suitable GFP_FOO counterpart until recently
> so it relied on the default page allocator behavior for the given set
> of flags. This means that small allocations actually never failed.
> 
> Now that we have __GFP_RETRY_MAYFAIL flag which works independently on the
> allocation request size we can map KM_MAYFAIL to it. The allocator will
> try as hard as it can to fulfill the request but fails eventually if
> the progress cannot be made. It does so without triggering the OOM
> killer which can be seen as an improvement because KM_MAYFAIL users
> should be able to deal with allocation failures.
> 
> Cc: Darrick J. Wong <darrick.wong@oracle.com>
> Cc: Christoph Hellwig <hch@infradead.org>
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  fs/xfs/kmem.h | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
> index d6ea520162b2..4d85992d75b2 100644
> --- a/fs/xfs/kmem.h
> +++ b/fs/xfs/kmem.h
> @@ -54,6 +54,16 @@ kmem_flags_convert(xfs_km_flags_t flags)
>  			lflags &= ~__GFP_FS;
>  	}
>  
> +	/*
> +	 * Default page/slab allocator behavior is to retry for ever
> +	 * for small allocations. We can override this behavior by using
> +	 * __GFP_RETRY_MAYFAIL which will tell the allocator to retry as long
> +	 * as it is feasible but rather fail than retry forever for all
> +	 * request sizes.
> +	 */
> +	if (flags & KM_MAYFAIL)
> +		lflags |= __GFP_RETRY_MAYFAIL;
> +
>  	if (flags & KM_ZERO)
>  		lflags |= __GFP_ZERO;
>  
> -- 
> 2.11.0
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/6] xfs: map KM_MAYFAIL to __GFP_RETRY_MAYFAIL
  2017-06-27  8:49   ` Michal Hocko
@ 2017-06-27 13:47     ` Christoph Hellwig
  2017-06-27 14:06       ` Michal Hocko
  0 siblings, 1 reply; 24+ messages in thread
From: Christoph Hellwig @ 2017-06-27 13:47 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Vlastimil Babka, Johannes Weiner, Mel Gorman,
	NeilBrown, LKML, linux-mm, Christoph Hellwig, Darrick J. Wong

On Tue, Jun 27, 2017 at 10:49:50AM +0200, Michal Hocko wrote:
> Christoph, Darrick
> could you have a look at this patch please? Andrew has put it into mmotm
> but I definitely do not want it passes your attention.

I don't think what we have to gain from it.  Callsite for KM_MAYFAIL
should handler failures, but the current behavior seems to be doing fine
too.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/6] xfs: map KM_MAYFAIL to __GFP_RETRY_MAYFAIL
  2017-06-27 13:47     ` Christoph Hellwig
@ 2017-06-27 14:06       ` Michal Hocko
  2017-06-28  4:12         ` Darrick J. Wong
  0 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2017-06-27 14:06 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andrew Morton, Vlastimil Babka, Johannes Weiner, Mel Gorman,
	NeilBrown, LKML, linux-mm, Darrick J. Wong

On Tue 27-06-17 06:47:51, Christoph Hellwig wrote:
> On Tue, Jun 27, 2017 at 10:49:50AM +0200, Michal Hocko wrote:
> > Christoph, Darrick
> > could you have a look at this patch please? Andrew has put it into mmotm
> > but I definitely do not want it passes your attention.
> 
> I don't think what we have to gain from it.  Callsite for KM_MAYFAIL
> should handler failures, but the current behavior seems to be doing fine
> too.

Last time I've asked I didnd't get any reply so let me ask again. Some
of those allocations seem to be small (e.g. by a random look
xlog_cil_init allocates struct xfs_cil which is 576B and struct
xfs_cil_ctx 176B). Those do not fail currently under most conditions and
it will retry allocation with the OOM killer if there is no progress. As
you know that failing those is acceptable, wouldn't it be better to
simply fail them and do not disrupt the system with the oom killer?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/6] xfs: map KM_MAYFAIL to __GFP_RETRY_MAYFAIL
  2017-06-27 14:06       ` Michal Hocko
@ 2017-06-28  4:12         ` Darrick J. Wong
  0 siblings, 0 replies; 24+ messages in thread
From: Darrick J. Wong @ 2017-06-28  4:12 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Christoph Hellwig, Andrew Morton, Vlastimil Babka,
	Johannes Weiner, Mel Gorman, NeilBrown, LKML, linux-mm, xfs

[add linux-xfs to cc]

FYI this is a discussion of the patch "xfs: map KM_MAYFAIL to
__GFP_RETRY_MAYFAIL" which was last discussed on the xfs list in March
and now is in the -mm tree...

https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20170627&id=43182d82c48fae80d31a9101b6bb06d75cee32c7

On Tue, Jun 27, 2017 at 04:06:54PM +0200, Michal Hocko wrote:
> On Tue 27-06-17 06:47:51, Christoph Hellwig wrote:
> > On Tue, Jun 27, 2017 at 10:49:50AM +0200, Michal Hocko wrote:
> > > Christoph, Darrick
> > > could you have a look at this patch please? Andrew has put it into mmotm
> > > but I definitely do not want it passes your attention.
> > 
> > I don't think what we have to gain from it.  Callsite for KM_MAYFAIL
> > should handler failures, but the current behavior seems to be doing fine
> > too.
> 
> Last time I've asked I didnd't get any reply so let me ask again. Some
> of those allocations seem to be small (e.g. by a random look
> xlog_cil_init allocates struct xfs_cil which is 576B and struct
> xfs_cil_ctx 176B). Those do not fail currently under most conditions and
> it will retry allocation with the OOM killer if there is no progress. As
> you know that failing those is acceptable, wouldn't it be better to
> simply fail them and do not disrupt the system with the oom killer?

I remember the first time I saw this patch, and didn't have much of an
opinion either way -- the current behavior is fine, so why mess around?
I'd just as soon XFS not have to deal with errors if it doesn't have to. :)

But, you've asked again, so I'll be less glib this time.

I took a quick glance at all the MAYFAIL users in XFS.  /Nearly/ all
them seem to be cases either where we're mounting a filesystem or are
collecting memory for some ioctl -- in either case it's not hard to just
fail back out to userspace.  The upcoming online fsck patches use it
heavily, which is fine since we can always fail out to userspace and
tell the admin to go run xfs_repair offline.

The one user that caught my eye was xfs_iflush_cluster, which seems to
want an array of pointers to a cluster's worth of struct xfs_inodes.  On
a 64k block fs with 256 byte pointers I guess that could be ~2k worth of
pointers, but otoh it looks like that's an optimization: If we're going
to flush an inode out to disk we opportunistically scan the inode tree
to see if the adjacent inodes are also ready to flush; if we can't get
the memory for this, then it just backs off to flushing the one inode.

All the callers of MAYFAIL that I found actually /do/ check the return
value and start bailing out... so, uh, I guess I'm fine with it.  At
worst it's easily reverted during -rc if it causes problems.  Anyone
have a stronger objection?

Acked-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2017-06-28  4:15 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-23  8:53 [PATCH 0/6] mm: give __GFP_REPEAT a better semantic Michal Hocko
2017-06-23  8:53 ` [PATCH 1/6] MIPS: do not use __GFP_REPEAT for order-0 request Michal Hocko
2017-06-23 10:27   ` Ralf Baechle
2017-06-26  9:36   ` Vlastimil Babka
2017-06-23  8:53 ` [PATCH 2/6] mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic Michal Hocko
2017-06-26 11:45   ` Vlastimil Babka
2017-06-26 12:14     ` Michal Hocko
2017-06-26 12:17       ` Vlastimil Babka
2017-06-26 12:38         ` Michal Hocko
2017-06-26 12:42           ` Michal Hocko
2017-06-26 11:53   ` Vlastimil Babka
2017-06-23  8:53 ` [PATCH 3/6] xfs: map KM_MAYFAIL to __GFP_RETRY_MAYFAIL Michal Hocko
2017-06-27  8:49   ` Michal Hocko
2017-06-27 13:47     ` Christoph Hellwig
2017-06-27 14:06       ` Michal Hocko
2017-06-28  4:12         ` Darrick J. Wong
2017-06-23  8:53 ` [PATCH 4/6] mm: kvmalloc support __GFP_RETRY_MAYFAIL for all sizes Michal Hocko
2017-06-26 12:00   ` Vlastimil Babka
2017-06-26 12:18     ` Michal Hocko
2017-06-23  8:53 ` [PATCH 5/6] drm/i915: use __GFP_RETRY_MAYFAIL Michal Hocko
2017-06-23  8:53 ` [PATCH 6/6] mm, migration: do not trigger OOM killer when migrating memory Michal Hocko
2017-06-23 20:43   ` Andrew Morton
2017-06-26  5:28     ` Michal Hocko
2017-06-26 12:13   ` Vlastimil Babka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).